diff options
-rw-r--r-- | Contributing.md | 86 | ||||
-rw-r--r-- | README.rst | 4 | ||||
-rw-r--r-- | doc/docs/lexerdevelopment.rst | 16 | ||||
-rw-r--r-- | doc/languages.rst | 3 |
4 files changed, 100 insertions, 9 deletions
diff --git a/Contributing.md b/Contributing.md index ac17dd68..0d831f0b 100644 --- a/Contributing.md +++ b/Contributing.md @@ -26,7 +26,7 @@ Contribution checklist [a new formatter](https://pygments.org/docs/formatterdevelopment/) or [a new filter](https://pygments.org/docs/filterdevelopment/) -* Make sure to add a test for your new functionality, and where applicable, +* Make sure to add a test for your new functionality, and where applicable, write documentation. * When writing rules, try to merge simple rules. For instance, combine: @@ -53,6 +53,90 @@ Contribution checklist instead of matching ``@first@`` and ``@second@``. You can use ``@.*?@`` in this case to stop early. The ``?`` tries to match _as few times_ as possible. +* Beware of so-called "catastrophic backtracking". As a first example, consider + the regular expression ``(A+)*C``. This is equivalent to ``A*B`` regarding + what it matches, but *non*-matches will take very long. This is because + of the way the regular expression engine works. Suppose you feed it 50 + 'A's, and a 'C' at the end. It first matches the 'A's greedily in ``A+``, + but finds that it cannot match the end since 'B' is not the same as 'C'. + Then it backtracks, removing one 'A' from the first ``A+`` and trying to + match the rest as another ``(A+)*``. This fails again, so it backtracks + further left in the input string, etc. In effect, it tries all combinations + + ``` + (AAAAAAAAAAAAAAAAA) + (AAAAAAAAAAAAAAAA)(A) + (AAAAAAAAAAAAAAA)(AA) + (AAAAAAAAAAAAAAA)(A)(A) + (AAAAAAAAAAAAAA)(AAA) + (AAAAAAAAAAAAAA)(AA)(A) + ... + ``` + + Thus, the matching has exponential complexity. In a lexer, the + effect is that Pygments will seemingly hang when parsing invalid + input. + + ```python + >>> import re + >>> re.match('(A+)*B', 'A'*50 + 'C') # hangs + ``` + + As a more subtle and real-life example, here is a badly written + regular expression to match strings: + + ```python + r'"(\\?.)*?"' + ``` + + If the ending quote is missing, the regular expression engine will + find that it cannot match at the end, and try to backtrack with less + matches in the ``*?``. When it finds a backslash, as it has already + tried the possibility ``\\.``, it tries ``.`` (recognizing it as a + simple character without meaning), which leads to the same + exponential backtracking problem if there are lots of backslashes in + the (invalid) input string. A good way to write this would be + ``r'"([^\\]|\\.)*?"'``, where the inner group can only match in one + way. Better yet is to use a dedicated state, which not only + sidesteps the issue without headaches, but allows you to highlight + string escapes. + + ```python + 'root': [ + ..., + (r'"', String, 'string'), + ... + ], + 'string': [ + (r'\\.', String.Escape), + (r'"', String, '#pop'), + (r'[^\\"]+', String), + ] + ``` + +* When writing rules for patterns such as comments or strings, match as many + characters as possible in each token. This is an example of what not to + do: + + ```python + 'comment': [ + (r'\*/', Comment.Multiline, '#pop'), + (r'.', Comment.Multiline), + ] + ``` + + This generates one token per character in the comment, which slows + down the lexing process, and also makes the raw token output (and in + particular the test output) hard to read. Do this instead: + + ```python + 'comment': [ + (r'\*/', Comment.Multiline, '#pop'), + (r'[^*]+', Comment.Multiline), + (r'\*', Comment.Multiline), + ] + ``` + * Don't add imports of your lexer anywhere in the codebase. (In case you're curious about ``compiled.py`` -- this file exists for backwards compatibility reasons.) @@ -30,6 +30,10 @@ Continuous testing runs on GitHub workflows: .. image:: https://github.com/pygments/pygments/workflows/Pygments/badge.svg :target: https://github.com/pygments/pygments/actions?query=workflow%3APygments +Contribution guidelines are found in Contributing.md_. + +.. _Contributing.md: https://github.com/pygments/pygments/blob/master/Contributing.md + The authors ----------- diff --git a/doc/docs/lexerdevelopment.rst b/doc/docs/lexerdevelopment.rst index 720a804b..b883d3e6 100644 --- a/doc/docs/lexerdevelopment.rst +++ b/doc/docs/lexerdevelopment.rst @@ -291,7 +291,7 @@ and single-line with ``//`` until end of line):: (r'/', Text) ], 'comment': [ - (r'[^*/]', Comment.Multiline), + (r'[^*/]+', Comment.Multiline), (r'/\*', Comment.Multiline, '#push'), (r'\*/', Comment.Multiline, '#pop'), (r'[*/]', Comment.Multiline) @@ -346,7 +346,7 @@ There are a few more things you can do with states: ... ], 'directive': [ - (r'[^>]*', Comment.Directive), + (r'[^>]+', Comment.Directive), (r'>', Comment, '#pop'), ], 'comment': [ @@ -376,20 +376,20 @@ There are a few more things you can do with states: class ExampleLexer(RegexLexer): tokens = { 'comments': [ - (r'/\*.*?\*/', Comment), + (r'(?s)/\*.*?\*/', Comment), (r'//.*?\n', Comment), ], 'root': [ include('comments'), - (r'(function )(\w+)( {)', - bygroups(Keyword, Name, Keyword), 'function'), - (r'.', Text), + (r'(function)( )(\w+)( )({)', + bygroups(Keyword, Whitespace, Name, Whitespace, Punctuation), 'function'), + (r'.*\n', Text), ], 'function': [ (r'[^}/]+', Text), include('comments'), (r'/', Text), - (r'\}', Keyword, '#pop'), + (r'\}', Punctuation, '#pop'), ] } @@ -455,7 +455,7 @@ defined in the parent and child class are merged. For example:: ('[a-z]+', Name), (r'/\*', Comment, 'comment'), ('"', String, 'string'), - ('\s+', Text), + (r'\s+', Text), ], 'string': [ ('[^"]+', String), diff --git a/doc/languages.rst b/doc/languages.rst index 95d1d899..460ea676 100644 --- a/doc/languages.rst +++ b/doc/languages.rst @@ -367,6 +367,9 @@ prompt. Well, why not write your own? Contributing to Pygments is easy and fun. Take a look at the :doc:`docs on lexer development <docs/lexerdevelopment>`. Pull requests are welcome on `GitHub <https://github.com/pygments/pygments>`_. +Be sure to read the `contributing guidelines`_. + +.. _contributing guidelines: https://github.com/pygments/pygments/blob/master/Contributing.md Note: the languages listed here are supported in the development version. The latest release may lack a few of them. |