From 8f417de6a9f3dff6d86a02c9faa365d6828a1046 Mon Sep 17 00:00:00 2001 From: Jean Abou-Samra Date: Thu, 23 Feb 2023 16:53:56 +0100 Subject: Devolve Contributing.md (#2352) * Devolve Contributing.md Move the content to the docs and website so it is displayed on pygments.org, to make it easier to find. - Regex dos and don'ts go to lexerdevelopment.rst - The rest goes to a new file contributing.rst - That makes the explanation of how lexers are tested in lexerdevelopment.rst redundant, so remove it. The bit on how to add a lexer goes to contributing.rst. --- doc/docs/contributing.rst | 115 +++++++++++++++++++++++++++ doc/docs/index.rst | 3 +- doc/docs/lexerdevelopment.rst | 177 ++++++++++++++++++++++++++---------------- doc/index.rst | 7 +- 4 files changed, 230 insertions(+), 72 deletions(-) create mode 100644 doc/docs/contributing.rst (limited to 'doc') diff --git a/doc/docs/contributing.rst b/doc/docs/contributing.rst new file mode 100644 index 00000000..79d2bc96 --- /dev/null +++ b/doc/docs/contributing.rst @@ -0,0 +1,115 @@ +======================== +Contributing to Pygments +======================== + +Thanks for your interest in contributing! Please read the following +guidelines. + + +Licensing +========= + +The code is distributed under the BSD 2-clause license. Contributors making pull +requests must agree that they are able and willing to put their contributions +under that license. + + +General contribution checklist +============================== + +* Check the documentation for how to write + :doc:`a new lexer `, + :doc:`a new formatter `, + :doc:`a new style ` or + :doc:`a new filter `. + If adding a lexer, please make sure you have + read :ref:`lexer-pitfalls`. + +* Run the test suite with ``tox``, and ensure it passes. + +* Make sure to add a test for your new functionality, and where applicable, + write documentation. See below on how to test lexers. + +* Use the standard importing convention: ``from token import Punctuation`` + + +How to add a lexer +================== + +To add a lexer, you have to perform the following steps: + +* Select a matching module under ``pygments/lexers``, or create a new + module for your lexer class. + + .. note:: + + We encourage you to put your lexer class into its own module, unless it's a + very small derivative of an already existing lexer. + +* Next, make sure the lexer is known from outside the module. All modules + in the ``pygments.lexers`` package specify ``__all__``. For example, + ``esoteric.py`` sets:: + + __all__ = ['BrainfuckLexer', 'BefungeLexer', ...] + + Add the name of your lexer class to this list (or create the list if your lexer + is the only class in the module). + +* Finally the lexer can be made publicly known by rebuilding the lexer mapping. + + .. code-block:: console + + $ tox -e mapfiles + + +How lexers are tested +===================== + +To add a new lexer test, create a file with just your code snippet +under ``tests/snippets//``. Then run +``tox -- --update-goldens `` to auto-populate the +currently expected tokens. Check that they look good and check in the +file. + +Lexer tests are run with ``tox``, like all other tests. While +working on a lexer, you can also run only the tests for that lexer +with ``tox -- tests/snippets/language-name/`` and/or +``tox -- tests/examplefiles/language-name/``. + +Running the test suite with ``tox`` will run lexers on the test +inputs, and check that the output matches the expected tokens. If you +are improving a lexer, it is normal that the token output changes. To +update the expected token output for the tests, again use +``tox -- --update-goldens ``. Review the changes and +check that they are as intended, then commit them along with your +proposed code change. + +Large test files should go in ``tests/examplefiles``. This works +similar to ``snippets``, but the token output is stored in a separate +file. Output can also be regenerated with ``--update-goldens``. + + +Goals & non-goals of Pygments +============================= + +Python support +-------------- + +Pygments supports all supported Python versions as per the `Python +Developer's Guide `_. Additionally, the default Python +version of the latest stable version of RHEL, Ubuntu LTS, and Debian +are supported, even if they're officially EOL. Supporting other +end-of-life versions is a non-goal of Pygments. + +.. _devguide: https://devguide.python.org/versions/ + + +Validation +---------- + +Pygments does not attempt to validate the input. Accepting code that +is not legal for a given language is acceptable if it simplifies the +codebase and does not result in surprising behavior. For instance, in +C89, accepting `//` based comments would be fine because de-facto all +compilers supported it, and having a separate lexer for it would not +be worth it. diff --git a/doc/docs/index.rst b/doc/docs/index.rst index d35fe6f0..930467c4 100644 --- a/doc/docs/index.rst +++ b/doc/docs/index.rst @@ -30,7 +30,7 @@ Pygments documentation api terminal-sessions -**Hacking for Pygments** +**Hacking with Pygments** .. toctree:: :maxdepth: 1 @@ -56,6 +56,7 @@ Pygments documentation .. toctree:: :maxdepth: 1 + contributing changelog authors security diff --git a/doc/docs/lexerdevelopment.rst b/doc/docs/lexerdevelopment.rst index 29bdd6ca..3b720dc6 100644 --- a/doc/docs/lexerdevelopment.rst +++ b/doc/docs/lexerdevelopment.rst @@ -85,8 +85,8 @@ If no rule matches at the current position, the current char is emitted as an one. -Adding and testing a new lexer -============================== +Using a lexer +============= The easiest way to use a new lexer is to use Pygments' support for loading the lexer from a file relative to your current directory. @@ -141,70 +141,8 @@ to display the token types assigned to each part of your input file: Hover over each token to see the token type displayed as a tooltip. -To prepare your new lexer for inclusion in the Pygments distribution, so that it -will be found when passing filenames or lexer aliases from the command line, you -have to perform the following steps. - -First, change to the current directory containing the Pygments source code. You -will need to have either an unpacked source tarball, or (preferably) a copy -cloned from GitHub. - -.. code-block:: console - - $ cd pygments - -Select a matching module under ``pygments/lexers``, or create a new module for -your lexer class. - -.. note:: - - We encourage you to put your lexer class into its own module, unless it's a - very small derivative of an already existing lexer. - -Next, make sure the lexer is known from outside of the module. All modules in -the ``pygments.lexers`` package specify ``__all__``. For example, -``esoteric.py`` sets:: - - __all__ = ['BrainfuckLexer', 'BefungeLexer', ...] - -Add the name of your lexer class to this list (or create the list if your lexer -is the only class in the module). - -Finally the lexer can be made publicly known by rebuilding the lexer mapping. -In the root directory of the source (where the ``tox.ini`` file is located), run: - -.. code-block:: console - - $ tox -e mapfiles - -To test the new lexer, store an example file in -``tests/examplefiles/``. For example, to test your -``DiffLexer``, add a ``tests/examplefiles/diff/example.diff`` containing a -sample diff output. To (re)generate the lexer output which the file is checked -against, use the command ``tox -- tests/examplefiles/diff --update-goldens``. - -Now you can use ``python -m pygments`` from the current root of the checkout to -render your example to HTML: - -.. code-block:: console - - $ python -m pygments -O full -f html -o /tmp/example.html tests/examplefiles/diff/example.diff - -Note that this explicitly calls the ``pygments`` module in the current -directory. This ensures your modifications are used. Otherwise a possibly -already installed, unmodified version without your new lexer would have been -called from the system search path (``$PATH``). - -To view the result, open ``/tmp/example.html`` in your browser. - -Once the example renders as expected, you should run the complete test suite: - -.. code-block:: console - - $ tox - -It also tests that your lexer fulfills the lexer API and certain invariants, -such as that the concatenation of all token text is the same as the input text. +If your lexer would be useful to other people, we would love if you +contributed it to Pygments. See :doc:`contributing` for advice. Regex Flags @@ -746,3 +684,110 @@ pseudo keywords:: yield index, token, value The `PhpLexer` and `LuaLexer` use this method to resolve builtin functions. + + + +.. _lexer-pitfalls: + +Common pitfalls and best practices +================================== + +Regular expressions are ubiquitous in Pygments lexers. We have +written this section to warn about a few common mistakes you might do +when using them. There are also some tips on making your lexers easier +to read and review. You are asked to read this section if you want to +contribute a new lexer, but you might find it useful in any case. + + +* When writing rules, try to merge simple rules. For instance, combine:: + + (r"\(", token.Punctuation), + (r"\)", token.Punctuation), + (r"\[", token.Punctuation), + (r"\]", token.Punctuation), + ("{", token.Punctuation), + ("}", token.Punctuation), + + into:: + + (r"[\(\)\[\]{}]", token.Punctuation) + + +* Be careful with ``.*``. This matches greedily as much as it can. For instance, + a rule like ``@.*@`` will match the whole string ``@first@ second @third@``, + instead of matching ``@first@`` and ``@second@``. You can use ``@.*?@`` in + this case to stop early. The ``?`` tries to match *as few times* as possible. + +* Beware of so-called "catastrophic backtracking". As a first example, consider + the regular expression ``(A+)*B``. This is equivalent to ``A*B`` regarding + what it matches, but *non*-matches will take very long. This is because + of the way the regular expression engine works. Suppose you feed it 50 + 'A's, and a 'C' at the end. It first matches the 'A's greedily in ``A+``, + but finds that it cannot match the end since 'B' is not the same as 'C'. + Then it backtracks, removing one 'A' from the first ``A+`` and trying to + match the rest as another ``(A+)*``. This fails again, so it backtracks + further left in the input string, etc. In effect, it tries all combinations + + .. code-block:: text + + (AAAAAAAAAAAAAAAAA) + (AAAAAAAAAAAAAAAA)(A) + (AAAAAAAAAAAAAAA)(AA) + (AAAAAAAAAAAAAAA)(A)(A) + (AAAAAAAAAAAAAA)(AAA) + (AAAAAAAAAAAAAA)(AA)(A) + ... + + Thus, the matching has exponential complexity. In a lexer, the + effect is that Pygments will seemingly hang when parsing invalid + input. :: + + >>> import re + >>> re.match('(A+)*B', 'A'*50 + 'C') # hangs + + As a more subtle and real-life example, here is a badly written + regular expression to match strings:: + + r'"(\\?.)*?"' + + If the ending quote is missing, the regular expression engine will + find that it cannot match at the end, and try to backtrack with less + matches in the ``*?``. When it finds a backslash, as it has already + tried the possibility ``\\.``, it tries ``.`` (recognizing it as a + simple character without meaning), which leads to the same + exponential backtracking problem if there are lots of backslashes in + the (invalid) input string. A good way to write this would be + ``r'"([^\\]|\\.)*?"'``, where the inner group can only match in one + way. Better yet is to use a dedicated state, which not only + sidesteps the issue without headaches, but allows you to highlight + string escapes. :: + + 'root': [ + ..., + (r'"', String, 'string'), + ... + ], + 'string': [ + (r'\\.', String.Escape), + (r'"', String, '#pop'), + (r'[^\\"]+', String), + ] + +* When writing rules for patterns such as comments or strings, match as many + characters as possible in each token. This is an example of what *not* to + do:: + + 'comment': [ + (r'\*/', Comment.Multiline, '#pop'), + (r'.', Comment.Multiline), + ] + + This generates one token per character in the comment, which slows + down the lexing process, and also makes the raw token output (and in + particular the test output) hard to read. Do this instead:: + + 'comment': [ + (r'\*/', Comment.Multiline, '#pop'), + (r'[^*]+', Comment.Multiline), + (r'\*', Comment.Multiline), + ] diff --git a/doc/index.rst b/doc/index.rst index dbd15968..cf91c5ff 100644 --- a/doc/index.rst +++ b/doc/index.rst @@ -16,16 +16,13 @@ source code. Highlights are: Read more in the :doc:`FAQ list ` or the :doc:`documentation `, or `download the latest release `_. -.. _contribute: - Contribute ---------- Like every open-source project, we are always looking for volunteers to help us with programming. Python knowledge is required, but don't fear: Python is a very -clear and easy to learn language. - -Development takes place on `GitHub `_. +clear and easy to learn language. Read our :doc:`contribution guidelines +` for more information. If you found a bug, just open a ticket in the GitHub tracker. Be sure to log in to be notified when the issue is fixed -- development is not fast-paced as -- cgit v1.2.1