summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorJean Abou-Samra <jean@abou-samra.fr>2022-03-02 06:41:47 +0100
committerGitHub <noreply@github.com>2022-03-02 06:41:47 +0100
commit4752b7d9af7d47ffc993edb48ebe222bfba919b9 (patch)
treeea642cda9bf499aa768d4b8486f0fada5e3c942f
parent8d2e31d1966f42808595bccc3384062c2942fc74 (diff)
downloadpygments-git-4752b7d9af7d47ffc993edb48ebe222bfba919b9.tar.gz
Add notes to Contributing.md about common mistakes in lexers (#2075)
-rw-r--r--Contributing.md86
-rw-r--r--README.rst4
-rw-r--r--doc/docs/lexerdevelopment.rst16
-rw-r--r--doc/languages.rst3
4 files changed, 100 insertions, 9 deletions
diff --git a/Contributing.md b/Contributing.md
index ac17dd68..0d831f0b 100644
--- a/Contributing.md
+++ b/Contributing.md
@@ -26,7 +26,7 @@ Contribution checklist
[a new formatter](https://pygments.org/docs/formatterdevelopment/) or
[a new filter](https://pygments.org/docs/filterdevelopment/)
-* Make sure to add a test for your new functionality, and where applicable,
+* Make sure to add a test for your new functionality, and where applicable,
write documentation.
* When writing rules, try to merge simple rules. For instance, combine:
@@ -53,6 +53,90 @@ Contribution checklist
instead of matching ``@first@`` and ``@second@``. You can use ``@.*?@`` in
this case to stop early. The ``?`` tries to match _as few times_ as possible.
+* Beware of so-called "catastrophic backtracking". As a first example, consider
+ the regular expression ``(A+)*C``. This is equivalent to ``A*B`` regarding
+ what it matches, but *non*-matches will take very long. This is because
+ of the way the regular expression engine works. Suppose you feed it 50
+ 'A's, and a 'C' at the end. It first matches the 'A's greedily in ``A+``,
+ but finds that it cannot match the end since 'B' is not the same as 'C'.
+ Then it backtracks, removing one 'A' from the first ``A+`` and trying to
+ match the rest as another ``(A+)*``. This fails again, so it backtracks
+ further left in the input string, etc. In effect, it tries all combinations
+
+ ```
+ (AAAAAAAAAAAAAAAAA)
+ (AAAAAAAAAAAAAAAA)(A)
+ (AAAAAAAAAAAAAAA)(AA)
+ (AAAAAAAAAAAAAAA)(A)(A)
+ (AAAAAAAAAAAAAA)(AAA)
+ (AAAAAAAAAAAAAA)(AA)(A)
+ ...
+ ```
+
+ Thus, the matching has exponential complexity. In a lexer, the
+ effect is that Pygments will seemingly hang when parsing invalid
+ input.
+
+ ```python
+ >>> import re
+ >>> re.match('(A+)*B', 'A'*50 + 'C') # hangs
+ ```
+
+ As a more subtle and real-life example, here is a badly written
+ regular expression to match strings:
+
+ ```python
+ r'"(\\?.)*?"'
+ ```
+
+ If the ending quote is missing, the regular expression engine will
+ find that it cannot match at the end, and try to backtrack with less
+ matches in the ``*?``. When it finds a backslash, as it has already
+ tried the possibility ``\\.``, it tries ``.`` (recognizing it as a
+ simple character without meaning), which leads to the same
+ exponential backtracking problem if there are lots of backslashes in
+ the (invalid) input string. A good way to write this would be
+ ``r'"([^\\]|\\.)*?"'``, where the inner group can only match in one
+ way. Better yet is to use a dedicated state, which not only
+ sidesteps the issue without headaches, but allows you to highlight
+ string escapes.
+
+ ```python
+ 'root': [
+ ...,
+ (r'"', String, 'string'),
+ ...
+ ],
+ 'string': [
+ (r'\\.', String.Escape),
+ (r'"', String, '#pop'),
+ (r'[^\\"]+', String),
+ ]
+ ```
+
+* When writing rules for patterns such as comments or strings, match as many
+ characters as possible in each token. This is an example of what not to
+ do:
+
+ ```python
+ 'comment': [
+ (r'\*/', Comment.Multiline, '#pop'),
+ (r'.', Comment.Multiline),
+ ]
+ ```
+
+ This generates one token per character in the comment, which slows
+ down the lexing process, and also makes the raw token output (and in
+ particular the test output) hard to read. Do this instead:
+
+ ```python
+ 'comment': [
+ (r'\*/', Comment.Multiline, '#pop'),
+ (r'[^*]+', Comment.Multiline),
+ (r'\*', Comment.Multiline),
+ ]
+ ```
+
* Don't add imports of your lexer anywhere in the codebase. (In case you're
curious about ``compiled.py`` -- this file exists for backwards compatibility
reasons.)
diff --git a/README.rst b/README.rst
index 77b2d565..9974cbbe 100644
--- a/README.rst
+++ b/README.rst
@@ -30,6 +30,10 @@ Continuous testing runs on GitHub workflows:
.. image:: https://github.com/pygments/pygments/workflows/Pygments/badge.svg
:target: https://github.com/pygments/pygments/actions?query=workflow%3APygments
+Contribution guidelines are found in Contributing.md_.
+
+.. _Contributing.md: https://github.com/pygments/pygments/blob/master/Contributing.md
+
The authors
-----------
diff --git a/doc/docs/lexerdevelopment.rst b/doc/docs/lexerdevelopment.rst
index 720a804b..b883d3e6 100644
--- a/doc/docs/lexerdevelopment.rst
+++ b/doc/docs/lexerdevelopment.rst
@@ -291,7 +291,7 @@ and single-line with ``//`` until end of line)::
(r'/', Text)
],
'comment': [
- (r'[^*/]', Comment.Multiline),
+ (r'[^*/]+', Comment.Multiline),
(r'/\*', Comment.Multiline, '#push'),
(r'\*/', Comment.Multiline, '#pop'),
(r'[*/]', Comment.Multiline)
@@ -346,7 +346,7 @@ There are a few more things you can do with states:
...
],
'directive': [
- (r'[^>]*', Comment.Directive),
+ (r'[^>]+', Comment.Directive),
(r'>', Comment, '#pop'),
],
'comment': [
@@ -376,20 +376,20 @@ There are a few more things you can do with states:
class ExampleLexer(RegexLexer):
tokens = {
'comments': [
- (r'/\*.*?\*/', Comment),
+ (r'(?s)/\*.*?\*/', Comment),
(r'//.*?\n', Comment),
],
'root': [
include('comments'),
- (r'(function )(\w+)( {)',
- bygroups(Keyword, Name, Keyword), 'function'),
- (r'.', Text),
+ (r'(function)( )(\w+)( )({)',
+ bygroups(Keyword, Whitespace, Name, Whitespace, Punctuation), 'function'),
+ (r'.*\n', Text),
],
'function': [
(r'[^}/]+', Text),
include('comments'),
(r'/', Text),
- (r'\}', Keyword, '#pop'),
+ (r'\}', Punctuation, '#pop'),
]
}
@@ -455,7 +455,7 @@ defined in the parent and child class are merged. For example::
('[a-z]+', Name),
(r'/\*', Comment, 'comment'),
('"', String, 'string'),
- ('\s+', Text),
+ (r'\s+', Text),
],
'string': [
('[^"]+', String),
diff --git a/doc/languages.rst b/doc/languages.rst
index 95d1d899..460ea676 100644
--- a/doc/languages.rst
+++ b/doc/languages.rst
@@ -367,6 +367,9 @@ prompt.
Well, why not write your own? Contributing to Pygments is easy and fun. Take a
look at the :doc:`docs on lexer development <docs/lexerdevelopment>`. Pull
requests are welcome on `GitHub <https://github.com/pygments/pygments>`_.
+Be sure to read the `contributing guidelines`_.
+
+.. _contributing guidelines: https://github.com/pygments/pygments/blob/master/Contributing.md
Note: the languages listed here are supported in the development version. The
latest release may lack a few of them.