Add notes to Contributing.md about common mistakes in lexers (#2075)

author: Jean Abou-Samra <jean@abou-samra.fr> 2022-03-02 06:41:47 +0100
committer: GitHub <noreply@github.com> 2022-03-02 06:41:47 +0100
commit: 4752b7d9af7d47ffc993edb48ebe222bfba919b9 (patch)
tree: ea642cda9bf499aa768d4b8486f0fada5e3c942f
parent: 8d2e31d1966f42808595bccc3384062c2942fc74 (diff)
download: pygments-git-4752b7d9af7d47ffc993edb48ebe222bfba919b9.tar.gz
4 files changed, 100 insertions, 9 deletions
diff --git a/Contributing.md b/Contributing.md
index ac17dd68..0d831f0b 100644
--- a/Contributing.md
+++ b/Contributing.md
@@ -26,7 +26,7 @@ Contribution checklist
   [a new formatter](https://pygments.org/docs/formatterdevelopment/) or
   [a new filter](https://pygments.org/docs/filterdevelopment/)
 
-* Make sure to add a test for your new functionality, and where applicable, 
+* Make sure to add a test for your new functionality, and where applicable,
   write documentation.
 
 * When writing rules, try to merge simple rules. For instance, combine:
@@ -53,6 +53,90 @@ Contribution checklist
   instead of matching ``@first@`` and ``@second@``. You can use ``@.*?@`` in
   this case to stop early. The ``?`` tries to match _as few times_ as possible.
 
+* Beware of so-called "catastrophic backtracking".  As a first example, consider
+  the regular expression ``(A+)*C``.  This is equivalent to ``A*B`` regarding
+  what it matches, but *non*-matches will take very long.  This is because
+  of the way the regular expression engine works.  Suppose you feed it 50
+  'A's, and a 'C' at the end.  It first matches the 'A's greedily in ``A+``,
+  but finds that it cannot match the end since 'B' is not the same as 'C'.
+  Then it backtracks, removing one 'A' from the first ``A+`` and trying to
+  match the rest as another ``(A+)*``.  This fails again, so it backtracks
+  further left in the input string, etc.  In effect, it tries all combinations
+
+  ```
+  (AAAAAAAAAAAAAAAAA)
+  (AAAAAAAAAAAAAAAA)(A)
+  (AAAAAAAAAAAAAAA)(AA)
+  (AAAAAAAAAAAAAAA)(A)(A)
+  (AAAAAAAAAAAAAA)(AAA)
+  (AAAAAAAAAAAAAA)(AA)(A)
+  ...
+  ```
+
+  Thus, the matching has exponential complexity.  In a lexer, the
+  effect is that Pygments will seemingly hang when parsing invalid
+  input.
+
+  ```python
+  >>> import re
+  >>> re.match('(A+)*B', 'A'*50 + 'C') # hangs
+  ```
+
+  As a more subtle and real-life example, here is a badly written
+  regular expression to match strings:
+
+  ```python
+  r'"(\\?.)*?"'
+  ```
+
+  If the ending quote is missing, the regular expression engine will
+  find that it cannot match at the end, and try to backtrack with less
+  matches in the ``*?``.  When it finds a backslash, as it has already
+  tried the possibility ``\\.``, it tries ``.`` (recognizing it as a
+  simple character without meaning), which leads to the same
+  exponential backtracking problem if there are lots of backslashes in
+  the (invalid) input string.  A good way to write this would be
+  ``r'"([^\\]|\\.)*?"'``, where the inner group can only match in one
+  way.  Better yet is to use a dedicated state, which not only
+  sidesteps the issue without headaches, but allows you to highlight
+  string escapes.
+
+  ```python
+  'root': [
+      ...,
+      (r'"', String, 'string'),
+      ...
+  ],
+  'string': [
+      (r'\\.', String.Escape),
+      (r'"', String, '#pop'),
+      (r'[^\\"]+', String),
+  ]
+  ```
+
+* When writing rules for patterns such as comments or strings, match as many
+  characters as possible in each token.  This is an example of what not to
+  do:
+
+  ```python
+  'comment': [
+      (r'\*/', Comment.Multiline, '#pop'),
+      (r'.', Comment.Multiline),
+  ]
+  ```
+
+  This generates one token per character in the comment, which slows
+  down the lexing process, and also makes the raw token output (and in
+  particular the test output) hard to read.  Do this instead:
+
+  ```python
+  'comment': [
+      (r'\*/', Comment.Multiline, '#pop'),
+      (r'[^*]+', Comment.Multiline),
+      (r'\*', Comment.Multiline),
+  ]
+  ```
+
 * Don't add imports of your lexer anywhere in the codebase. (In case you're
   curious about ``compiled.py`` -- this file exists for backwards compatibility
   reasons.)
diff --git a/README.rst b/README.rst
index 77b2d565..9974cbbe 100644
--- a/README.rst
+++ b/README.rst
@@ -30,6 +30,10 @@ Continuous testing runs on GitHub workflows:
 .. image:: https://github.com/pygments/pygments/workflows/Pygments/badge.svg
    :target: https://github.com/pygments/pygments/actions?query=workflow%3APygments
 
+Contribution guidelines are found in Contributing.md_.
+
+.. _Contributing.md: https://github.com/pygments/pygments/blob/master/Contributing.md
+
 The authors
 -----------
 
diff --git a/doc/docs/lexerdevelopment.rst b/doc/docs/lexerdevelopment.rst
index 720a804b..b883d3e6 100644
--- a/doc/docs/lexerdevelopment.rst
+++ b/doc/docs/lexerdevelopment.rst
@@ -291,7 +291,7 @@ and single-line with ``//`` until end of line)::
                 (r'/', Text)
             ],
             'comment': [
-                (r'[^*/]', Comment.Multiline),
+                (r'[^*/]+', Comment.Multiline),
                 (r'/\*', Comment.Multiline, '#push'),
                 (r'\*/', Comment.Multiline, '#pop'),
                 (r'[*/]', Comment.Multiline)
@@ -346,7 +346,7 @@ There are a few more things you can do with states:
               ...
           ],
           'directive': [
-              (r'[^>]*', Comment.Directive),
+              (r'[^>]+', Comment.Directive),
               (r'>', Comment, '#pop'),
           ],
           'comment': [
@@ -376,20 +376,20 @@ There are a few more things you can do with states:
       class ExampleLexer(RegexLexer):
           tokens = {
               'comments': [
-                  (r'/\*.*?\*/', Comment),
+                  (r'(?s)/\*.*?\*/', Comment),
                   (r'//.*?\n', Comment),
               ],
               'root': [
                   include('comments'),
-                  (r'(function )(\w+)( {)',
-                   bygroups(Keyword, Name, Keyword), 'function'),
-                  (r'.', Text),
+                  (r'(function)( )(\w+)( )({)',
+                   bygroups(Keyword, Whitespace, Name, Whitespace, Punctuation), 'function'),
+                  (r'.*\n', Text),
               ],
               'function': [
                   (r'[^}/]+', Text),
                   include('comments'),
                   (r'/', Text),
-                  (r'\}', Keyword, '#pop'),
+                  (r'\}', Punctuation, '#pop'),
               ]
           }
 
@@ -455,7 +455,7 @@ defined in the parent and child class are merged.  For example::
                   ('[a-z]+', Name),
                   (r'/\*', Comment, 'comment'),
                   ('"', String, 'string'),
-                  ('\s+', Text),
+                  (r'\s+', Text),
               ],
               'string': [
                   ('[^"]+', String),
diff --git a/doc/languages.rst b/doc/languages.rst
index 95d1d899..460ea676 100644
--- a/doc/languages.rst
+++ b/doc/languages.rst
@@ -367,6 +367,9 @@ prompt.
 Well, why not write your own? Contributing to Pygments is easy and fun.  Take a
 look at the :doc:`docs on lexer development <docs/lexerdevelopment>`.  Pull
 requests are welcome on `GitHub <https://github.com/pygments/pygments>`_.
+Be sure to read the `contributing guidelines`_.
+
+.. _contributing guidelines: https://github.com/pygments/pygments/blob/master/Contributing.md
 
 Note: the languages listed here are supported in the development version. The
 latest release may lack a few of them.
author	Jean Abou-Samra <jean@abou-samra.fr>	2022-03-02 06:41:47 +0100
committer	GitHub <noreply@github.com>	2022-03-02 06:41:47 +0100
commit	4752b7d9af7d47ffc993edb48ebe222bfba919b9 (patch)
tree	ea642cda9bf499aa768d4b8486f0fada5e3c942f
parent	8d2e31d1966f42808595bccc3384062c2942fc74 (diff)
download	pygments-git-4752b7d9af7d47ffc993edb48ebe222bfba919b9.tar.gz