diff options
Diffstat (limited to 'Contributing.md')
-rw-r--r-- | Contributing.md | 167 |
1 files changed, 0 insertions, 167 deletions
diff --git a/Contributing.md b/Contributing.md deleted file mode 100644 index e3fdedb1..00000000 --- a/Contributing.md +++ /dev/null @@ -1,167 +0,0 @@ -Licensing -========= - -The code is distributed under the BSD 2-clause license. Contributors making pull -requests must agree that they are able and willing to put their contributions -under that license. - -Goals & non-goals of Pygments -============================= - -Python support --------------- - -Pygments supports all supported Python versions as per the [Python Developer's Guide](https://devguide.python.org/versions/). Additionally, the default Python version of the latest stable version of RHEL, Ubuntu LTS, and Debian are supported, even if they're officially EOL. Supporting other end-of-life versions is a non-goal of Pygments. - -Validation ----------- - -Pygments does not attempt to validate the input. Accepting code that is not legal for a given language is acceptable if it simplifies the codebase and does not result in surprising behavior. For instance, in C89, accepting `//` based comments would be fine because de-facto all compilers supported it, and having a separate lexer for it would not be worth it. - -Contribution checklist -====================== - -* Check the documentation for how to write - [a new lexer](https://pygments.org/docs/lexerdevelopment/), - [a new formatter](https://pygments.org/docs/formatterdevelopment/) or - [a new filter](https://pygments.org/docs/filterdevelopment/) - -* Make sure to add a test for your new functionality, and where applicable, - write documentation. - -* When writing rules, try to merge simple rules. For instance, combine: - - ```python - _PUNCTUATION = [ - (r"\(", token.Punctuation), - (r"\)", token.Punctuation), - (r"\[", token.Punctuation), - (r"\]", token.Punctuation), - ("{", token.Punctuation), - ("}", token.Punctuation), - ] - ``` - - into: - - ```python - (r"[\(\)\[\]{}]", token.Punctuation) - ``` - -* Be careful with ``.*``. This matches greedily as much as it can. For instance, - a rule like ``@.*@`` will match the whole string ``@first@ second @third@``, - instead of matching ``@first@`` and ``@second@``. You can use ``@.*?@`` in - this case to stop early. The ``?`` tries to match _as few times_ as possible. - -* Beware of so-called "catastrophic backtracking". As a first example, consider - the regular expression ``(A+)*C``. This is equivalent to ``A*B`` regarding - what it matches, but *non*-matches will take very long. This is because - of the way the regular expression engine works. Suppose you feed it 50 - 'A's, and a 'C' at the end. It first matches the 'A's greedily in ``A+``, - but finds that it cannot match the end since 'B' is not the same as 'C'. - Then it backtracks, removing one 'A' from the first ``A+`` and trying to - match the rest as another ``(A+)*``. This fails again, so it backtracks - further left in the input string, etc. In effect, it tries all combinations - - ``` - (AAAAAAAAAAAAAAAAA) - (AAAAAAAAAAAAAAAA)(A) - (AAAAAAAAAAAAAAA)(AA) - (AAAAAAAAAAAAAAA)(A)(A) - (AAAAAAAAAAAAAA)(AAA) - (AAAAAAAAAAAAAA)(AA)(A) - ... - ``` - - Thus, the matching has exponential complexity. In a lexer, the - effect is that Pygments will seemingly hang when parsing invalid - input. - - ```python - >>> import re - >>> re.match('(A+)*B', 'A'*50 + 'C') # hangs - ``` - - As a more subtle and real-life example, here is a badly written - regular expression to match strings: - - ```python - r'"(\\?.)*?"' - ``` - - If the ending quote is missing, the regular expression engine will - find that it cannot match at the end, and try to backtrack with less - matches in the ``*?``. When it finds a backslash, as it has already - tried the possibility ``\\.``, it tries ``.`` (recognizing it as a - simple character without meaning), which leads to the same - exponential backtracking problem if there are lots of backslashes in - the (invalid) input string. A good way to write this would be - ``r'"([^\\]|\\.)*?"'``, where the inner group can only match in one - way. Better yet is to use a dedicated state, which not only - sidesteps the issue without headaches, but allows you to highlight - string escapes. - - ```python - 'root': [ - ..., - (r'"', String, 'string'), - ... - ], - 'string': [ - (r'\\.', String.Escape), - (r'"', String, '#pop'), - (r'[^\\"]+', String), - ] - ``` - -* When writing rules for patterns such as comments or strings, match as many - characters as possible in each token. This is an example of what not to - do: - - ```python - 'comment': [ - (r'\*/', Comment.Multiline, '#pop'), - (r'.', Comment.Multiline), - ] - ``` - - This generates one token per character in the comment, which slows - down the lexing process, and also makes the raw token output (and in - particular the test output) hard to read. Do this instead: - - ```python - 'comment': [ - (r'\*/', Comment.Multiline, '#pop'), - (r'[^*]+', Comment.Multiline), - (r'\*', Comment.Multiline), - ] - ``` - -* Don't add imports of your lexer anywhere in the codebase. (In case you're - curious about ``compiled.py`` -- this file exists for backwards compatibility - reasons.) - -* Use the standard importing convention: ``from token import Punctuation`` - -* For test cases that assert on the tokens produced by a lexer, use tools: - - * You can use the ``testcase`` formatter to produce a piece of code that - can be pasted into a unittest file: - ``python -m pygments -l lua -f testcase <<< "local a = 5"`` - - * Most snippets should instead be put as a sample file under - ``tests/snippets/<lexer_alias>/*.txt``. These files are automatically - picked up as individual tests, asserting that the input produces the - expected tokens. - - To add a new test, create a file with just your code snippet under a - subdirectory based on your lexer's main alias. Then run - ``tox -- --update-goldens <filename.txt>`` to auto-populate the - currently expected tokens. Check that they look good and check in the file. - - Also run the same command whenever you need to update the test if the - actual produced tokens change (assuming the change is expected). - - * Large test files should go in ``tests/examplefiles``. This works - similar to ``snippets``, but the token output is stored in a separate - file. Output can also be regenerated with ``--update-goldens``. |