summaryrefslogtreecommitdiff
path: root/Contributing.md
blob: e3fdedb106d831aca4a97ca7d76d35e13ae0322e (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
Licensing
=========

The code is distributed under the BSD 2-clause license. Contributors making pull
requests must agree that they are able and willing to put their contributions
under that license.

Goals & non-goals of Pygments
=============================

Python support
--------------

Pygments supports all supported Python versions as per the [Python Developer's Guide](https://devguide.python.org/versions/). Additionally, the default Python version of the latest stable version of RHEL, Ubuntu LTS, and Debian are supported, even if they're officially EOL. Supporting other end-of-life versions is a non-goal of Pygments.

Validation
----------

Pygments does not attempt to validate the input. Accepting code that is not legal for a given language is acceptable if it simplifies the codebase and does not result in surprising behavior. For instance, in C89, accepting `//` based comments would be fine because de-facto all compilers supported it, and having a separate lexer for it would not be worth it.

Contribution checklist
======================

* Check the documentation for how to write
  [a new lexer](https://pygments.org/docs/lexerdevelopment/),
  [a new formatter](https://pygments.org/docs/formatterdevelopment/) or
  [a new filter](https://pygments.org/docs/filterdevelopment/)

* Make sure to add a test for your new functionality, and where applicable,
  write documentation.

* When writing rules, try to merge simple rules. For instance, combine:

  ```python
  _PUNCTUATION = [
    (r"\(", token.Punctuation),
    (r"\)", token.Punctuation),
    (r"\[", token.Punctuation),
    (r"\]", token.Punctuation),
    ("{", token.Punctuation),
    ("}", token.Punctuation),
  ]
  ```

  into:

  ```python
  (r"[\(\)\[\]{}]", token.Punctuation)
  ```

* Be careful with ``.*``. This matches greedily as much as it can. For instance,
  a rule like ``@.*@`` will match the whole string ``@first@ second @third@``,
  instead of matching ``@first@`` and ``@second@``. You can use ``@.*?@`` in
  this case to stop early. The ``?`` tries to match _as few times_ as possible.

* Beware of so-called "catastrophic backtracking".  As a first example, consider
  the regular expression ``(A+)*C``.  This is equivalent to ``A*B`` regarding
  what it matches, but *non*-matches will take very long.  This is because
  of the way the regular expression engine works.  Suppose you feed it 50
  'A's, and a 'C' at the end.  It first matches the 'A's greedily in ``A+``,
  but finds that it cannot match the end since 'B' is not the same as 'C'.
  Then it backtracks, removing one 'A' from the first ``A+`` and trying to
  match the rest as another ``(A+)*``.  This fails again, so it backtracks
  further left in the input string, etc.  In effect, it tries all combinations

  ```
  (AAAAAAAAAAAAAAAAA)
  (AAAAAAAAAAAAAAAA)(A)
  (AAAAAAAAAAAAAAA)(AA)
  (AAAAAAAAAAAAAAA)(A)(A)
  (AAAAAAAAAAAAAA)(AAA)
  (AAAAAAAAAAAAAA)(AA)(A)
  ...
  ```

  Thus, the matching has exponential complexity.  In a lexer, the
  effect is that Pygments will seemingly hang when parsing invalid
  input.

  ```python
  >>> import re
  >>> re.match('(A+)*B', 'A'*50 + 'C') # hangs
  ```

  As a more subtle and real-life example, here is a badly written
  regular expression to match strings:

  ```python
  r'"(\\?.)*?"'
  ```

  If the ending quote is missing, the regular expression engine will
  find that it cannot match at the end, and try to backtrack with less
  matches in the ``*?``.  When it finds a backslash, as it has already
  tried the possibility ``\\.``, it tries ``.`` (recognizing it as a
  simple character without meaning), which leads to the same
  exponential backtracking problem if there are lots of backslashes in
  the (invalid) input string.  A good way to write this would be
  ``r'"([^\\]|\\.)*?"'``, where the inner group can only match in one
  way.  Better yet is to use a dedicated state, which not only
  sidesteps the issue without headaches, but allows you to highlight
  string escapes.

  ```python
  'root': [
      ...,
      (r'"', String, 'string'),
      ...
  ],
  'string': [
      (r'\\.', String.Escape),
      (r'"', String, '#pop'),
      (r'[^\\"]+', String),
  ]
  ```

* When writing rules for patterns such as comments or strings, match as many
  characters as possible in each token.  This is an example of what not to
  do:

  ```python
  'comment': [
      (r'\*/', Comment.Multiline, '#pop'),
      (r'.', Comment.Multiline),
  ]
  ```

  This generates one token per character in the comment, which slows
  down the lexing process, and also makes the raw token output (and in
  particular the test output) hard to read.  Do this instead:

  ```python
  'comment': [
      (r'\*/', Comment.Multiline, '#pop'),
      (r'[^*]+', Comment.Multiline),
      (r'\*', Comment.Multiline),
  ]
  ```

* Don't add imports of your lexer anywhere in the codebase. (In case you're
  curious about ``compiled.py`` -- this file exists for backwards compatibility
  reasons.)

* Use the standard importing convention: ``from token import Punctuation``

* For test cases that assert on the tokens produced by a lexer, use tools:

  * You can use the ``testcase`` formatter to produce a piece of code that
    can be pasted into a unittest file:
    ``python -m pygments -l lua -f testcase <<< "local a = 5"``

  * Most snippets should instead be put as a sample file under
    ``tests/snippets/<lexer_alias>/*.txt``. These files are automatically
    picked up as individual tests, asserting that the input produces the
    expected tokens.

    To add a new test, create a file with just your code snippet under a
    subdirectory based on your lexer's main alias. Then run
    ``tox -- --update-goldens <filename.txt>`` to auto-populate the
    currently expected tokens. Check that they look good and check in the file.

    Also run the same command whenever you need to update the test if the
    actual produced tokens change (assuming the change is expected).

  * Large test files should go in ``tests/examplefiles``.  This works
    similar to ``snippets``, but the token output is stored in a separate
    file.  Output can also be regenerated with ``--update-goldens``.