summaryrefslogtreecommitdiff
path: root/doc
diff options
context:
space:
mode:
authorJean Abou-Samra <jean@abou-samra.fr>2023-02-23 16:53:56 +0100
committerGitHub <noreply@github.com>2023-02-23 16:53:56 +0100
commit8f417de6a9f3dff6d86a02c9faa365d6828a1046 (patch)
treef09096f1fe0525e30b7ac654671937009db3cdbd /doc
parent1bcb869b6b9bfc6c40689570cf395b4dd55541cb (diff)
downloadpygments-git-8f417de6a9f3dff6d86a02c9faa365d6828a1046.tar.gz
Devolve Contributing.md (#2352)
* Devolve Contributing.md Move the content to the docs and website so it is displayed on pygments.org, to make it easier to find. - Regex dos and don'ts go to lexerdevelopment.rst - The rest goes to a new file contributing.rst - That makes the explanation of how lexers are tested in lexerdevelopment.rst redundant, so remove it. The bit on how to add a lexer goes to contributing.rst.
Diffstat (limited to 'doc')
-rw-r--r--doc/docs/contributing.rst115
-rw-r--r--doc/docs/index.rst3
-rw-r--r--doc/docs/lexerdevelopment.rst177
-rw-r--r--doc/index.rst7
4 files changed, 230 insertions, 72 deletions
diff --git a/doc/docs/contributing.rst b/doc/docs/contributing.rst
new file mode 100644
index 00000000..79d2bc96
--- /dev/null
+++ b/doc/docs/contributing.rst
@@ -0,0 +1,115 @@
+========================
+Contributing to Pygments
+========================
+
+Thanks for your interest in contributing! Please read the following
+guidelines.
+
+
+Licensing
+=========
+
+The code is distributed under the BSD 2-clause license. Contributors making pull
+requests must agree that they are able and willing to put their contributions
+under that license.
+
+
+General contribution checklist
+==============================
+
+* Check the documentation for how to write
+ :doc:`a new lexer <lexerdevelopment>`,
+ :doc:`a new formatter <formatterdevelopment>`,
+ :doc:`a new style <styledevelopment>` or
+ :doc:`a new filter <filterdevelopment>`.
+ If adding a lexer, please make sure you have
+ read :ref:`lexer-pitfalls`.
+
+* Run the test suite with ``tox``, and ensure it passes.
+
+* Make sure to add a test for your new functionality, and where applicable,
+ write documentation. See below on how to test lexers.
+
+* Use the standard importing convention: ``from token import Punctuation``
+
+
+How to add a lexer
+==================
+
+To add a lexer, you have to perform the following steps:
+
+* Select a matching module under ``pygments/lexers``, or create a new
+ module for your lexer class.
+
+ .. note::
+
+ We encourage you to put your lexer class into its own module, unless it's a
+ very small derivative of an already existing lexer.
+
+* Next, make sure the lexer is known from outside the module. All modules
+ in the ``pygments.lexers`` package specify ``__all__``. For example,
+ ``esoteric.py`` sets::
+
+ __all__ = ['BrainfuckLexer', 'BefungeLexer', ...]
+
+ Add the name of your lexer class to this list (or create the list if your lexer
+ is the only class in the module).
+
+* Finally the lexer can be made publicly known by rebuilding the lexer mapping.
+
+ .. code-block:: console
+
+ $ tox -e mapfiles
+
+
+How lexers are tested
+=====================
+
+To add a new lexer test, create a file with just your code snippet
+under ``tests/snippets/<lexer_alias>/``. Then run
+``tox -- --update-goldens <filename.txt>`` to auto-populate the
+currently expected tokens. Check that they look good and check in the
+file.
+
+Lexer tests are run with ``tox``, like all other tests. While
+working on a lexer, you can also run only the tests for that lexer
+with ``tox -- tests/snippets/language-name/`` and/or
+``tox -- tests/examplefiles/language-name/``.
+
+Running the test suite with ``tox`` will run lexers on the test
+inputs, and check that the output matches the expected tokens. If you
+are improving a lexer, it is normal that the token output changes. To
+update the expected token output for the tests, again use
+``tox -- --update-goldens <filename.txt>``. Review the changes and
+check that they are as intended, then commit them along with your
+proposed code change.
+
+Large test files should go in ``tests/examplefiles``. This works
+similar to ``snippets``, but the token output is stored in a separate
+file. Output can also be regenerated with ``--update-goldens``.
+
+
+Goals & non-goals of Pygments
+=============================
+
+Python support
+--------------
+
+Pygments supports all supported Python versions as per the `Python
+Developer's Guide <devguide>`_. Additionally, the default Python
+version of the latest stable version of RHEL, Ubuntu LTS, and Debian
+are supported, even if they're officially EOL. Supporting other
+end-of-life versions is a non-goal of Pygments.
+
+.. _devguide: https://devguide.python.org/versions/
+
+
+Validation
+----------
+
+Pygments does not attempt to validate the input. Accepting code that
+is not legal for a given language is acceptable if it simplifies the
+codebase and does not result in surprising behavior. For instance, in
+C89, accepting `//` based comments would be fine because de-facto all
+compilers supported it, and having a separate lexer for it would not
+be worth it.
diff --git a/doc/docs/index.rst b/doc/docs/index.rst
index d35fe6f0..930467c4 100644
--- a/doc/docs/index.rst
+++ b/doc/docs/index.rst
@@ -30,7 +30,7 @@ Pygments documentation
api
terminal-sessions
-**Hacking for Pygments**
+**Hacking with Pygments**
.. toctree::
:maxdepth: 1
@@ -56,6 +56,7 @@ Pygments documentation
.. toctree::
:maxdepth: 1
+ contributing
changelog
authors
security
diff --git a/doc/docs/lexerdevelopment.rst b/doc/docs/lexerdevelopment.rst
index 29bdd6ca..3b720dc6 100644
--- a/doc/docs/lexerdevelopment.rst
+++ b/doc/docs/lexerdevelopment.rst
@@ -85,8 +85,8 @@ If no rule matches at the current position, the current char is emitted as an
one.
-Adding and testing a new lexer
-==============================
+Using a lexer
+=============
The easiest way to use a new lexer is to use Pygments' support for loading
the lexer from a file relative to your current directory.
@@ -141,70 +141,8 @@ to display the token types assigned to each part of your input file:
Hover over each token to see the token type displayed as a tooltip.
-To prepare your new lexer for inclusion in the Pygments distribution, so that it
-will be found when passing filenames or lexer aliases from the command line, you
-have to perform the following steps.
-
-First, change to the current directory containing the Pygments source code. You
-will need to have either an unpacked source tarball, or (preferably) a copy
-cloned from GitHub.
-
-.. code-block:: console
-
- $ cd pygments
-
-Select a matching module under ``pygments/lexers``, or create a new module for
-your lexer class.
-
-.. note::
-
- We encourage you to put your lexer class into its own module, unless it's a
- very small derivative of an already existing lexer.
-
-Next, make sure the lexer is known from outside of the module. All modules in
-the ``pygments.lexers`` package specify ``__all__``. For example,
-``esoteric.py`` sets::
-
- __all__ = ['BrainfuckLexer', 'BefungeLexer', ...]
-
-Add the name of your lexer class to this list (or create the list if your lexer
-is the only class in the module).
-
-Finally the lexer can be made publicly known by rebuilding the lexer mapping.
-In the root directory of the source (where the ``tox.ini`` file is located), run:
-
-.. code-block:: console
-
- $ tox -e mapfiles
-
-To test the new lexer, store an example file in
-``tests/examplefiles/<alias>``. For example, to test your
-``DiffLexer``, add a ``tests/examplefiles/diff/example.diff`` containing a
-sample diff output. To (re)generate the lexer output which the file is checked
-against, use the command ``tox -- tests/examplefiles/diff --update-goldens``.
-
-Now you can use ``python -m pygments`` from the current root of the checkout to
-render your example to HTML:
-
-.. code-block:: console
-
- $ python -m pygments -O full -f html -o /tmp/example.html tests/examplefiles/diff/example.diff
-
-Note that this explicitly calls the ``pygments`` module in the current
-directory. This ensures your modifications are used. Otherwise a possibly
-already installed, unmodified version without your new lexer would have been
-called from the system search path (``$PATH``).
-
-To view the result, open ``/tmp/example.html`` in your browser.
-
-Once the example renders as expected, you should run the complete test suite:
-
-.. code-block:: console
-
- $ tox
-
-It also tests that your lexer fulfills the lexer API and certain invariants,
-such as that the concatenation of all token text is the same as the input text.
+If your lexer would be useful to other people, we would love if you
+contributed it to Pygments. See :doc:`contributing` for advice.
Regex Flags
@@ -746,3 +684,110 @@ pseudo keywords::
yield index, token, value
The `PhpLexer` and `LuaLexer` use this method to resolve builtin functions.
+
+
+
+.. _lexer-pitfalls:
+
+Common pitfalls and best practices
+==================================
+
+Regular expressions are ubiquitous in Pygments lexers. We have
+written this section to warn about a few common mistakes you might do
+when using them. There are also some tips on making your lexers easier
+to read and review. You are asked to read this section if you want to
+contribute a new lexer, but you might find it useful in any case.
+
+
+* When writing rules, try to merge simple rules. For instance, combine::
+
+ (r"\(", token.Punctuation),
+ (r"\)", token.Punctuation),
+ (r"\[", token.Punctuation),
+ (r"\]", token.Punctuation),
+ ("{", token.Punctuation),
+ ("}", token.Punctuation),
+
+ into::
+
+ (r"[\(\)\[\]{}]", token.Punctuation)
+
+
+* Be careful with ``.*``. This matches greedily as much as it can. For instance,
+ a rule like ``@.*@`` will match the whole string ``@first@ second @third@``,
+ instead of matching ``@first@`` and ``@second@``. You can use ``@.*?@`` in
+ this case to stop early. The ``?`` tries to match *as few times* as possible.
+
+* Beware of so-called "catastrophic backtracking". As a first example, consider
+ the regular expression ``(A+)*B``. This is equivalent to ``A*B`` regarding
+ what it matches, but *non*-matches will take very long. This is because
+ of the way the regular expression engine works. Suppose you feed it 50
+ 'A's, and a 'C' at the end. It first matches the 'A's greedily in ``A+``,
+ but finds that it cannot match the end since 'B' is not the same as 'C'.
+ Then it backtracks, removing one 'A' from the first ``A+`` and trying to
+ match the rest as another ``(A+)*``. This fails again, so it backtracks
+ further left in the input string, etc. In effect, it tries all combinations
+
+ .. code-block:: text
+
+ (AAAAAAAAAAAAAAAAA)
+ (AAAAAAAAAAAAAAAA)(A)
+ (AAAAAAAAAAAAAAA)(AA)
+ (AAAAAAAAAAAAAAA)(A)(A)
+ (AAAAAAAAAAAAAA)(AAA)
+ (AAAAAAAAAAAAAA)(AA)(A)
+ ...
+
+ Thus, the matching has exponential complexity. In a lexer, the
+ effect is that Pygments will seemingly hang when parsing invalid
+ input. ::
+
+ >>> import re
+ >>> re.match('(A+)*B', 'A'*50 + 'C') # hangs
+
+ As a more subtle and real-life example, here is a badly written
+ regular expression to match strings::
+
+ r'"(\\?.)*?"'
+
+ If the ending quote is missing, the regular expression engine will
+ find that it cannot match at the end, and try to backtrack with less
+ matches in the ``*?``. When it finds a backslash, as it has already
+ tried the possibility ``\\.``, it tries ``.`` (recognizing it as a
+ simple character without meaning), which leads to the same
+ exponential backtracking problem if there are lots of backslashes in
+ the (invalid) input string. A good way to write this would be
+ ``r'"([^\\]|\\.)*?"'``, where the inner group can only match in one
+ way. Better yet is to use a dedicated state, which not only
+ sidesteps the issue without headaches, but allows you to highlight
+ string escapes. ::
+
+ 'root': [
+ ...,
+ (r'"', String, 'string'),
+ ...
+ ],
+ 'string': [
+ (r'\\.', String.Escape),
+ (r'"', String, '#pop'),
+ (r'[^\\"]+', String),
+ ]
+
+* When writing rules for patterns such as comments or strings, match as many
+ characters as possible in each token. This is an example of what *not* to
+ do::
+
+ 'comment': [
+ (r'\*/', Comment.Multiline, '#pop'),
+ (r'.', Comment.Multiline),
+ ]
+
+ This generates one token per character in the comment, which slows
+ down the lexing process, and also makes the raw token output (and in
+ particular the test output) hard to read. Do this instead::
+
+ 'comment': [
+ (r'\*/', Comment.Multiline, '#pop'),
+ (r'[^*]+', Comment.Multiline),
+ (r'\*', Comment.Multiline),
+ ]
diff --git a/doc/index.rst b/doc/index.rst
index dbd15968..cf91c5ff 100644
--- a/doc/index.rst
+++ b/doc/index.rst
@@ -16,16 +16,13 @@ source code. Highlights are:
Read more in the :doc:`FAQ list <faq>` or the :doc:`documentation <docs/index>`,
or `download the latest release <https://pypi.python.org/pypi/Pygments>`_.
-.. _contribute:
-
Contribute
----------
Like every open-source project, we are always looking for volunteers to help us
with programming. Python knowledge is required, but don't fear: Python is a very
-clear and easy to learn language.
-
-Development takes place on `GitHub <https://github.com/pygments/pygments>`_.
+clear and easy to learn language. Read our :doc:`contribution guidelines
+<docs/contributing>` for more information.
If you found a bug, just open a ticket in the GitHub tracker. Be sure to log
in to be notified when the issue is fixed -- development is not fast-paced as