From 8f417de6a9f3dff6d86a02c9faa365d6828a1046 Mon Sep 17 00:00:00 2001
From: Jean Abou-Samra <jean@abou-samra.fr>
Date: Thu, 23 Feb 2023 16:53:56 +0100
Subject: Devolve Contributing.md (#2352)

* Devolve Contributing.md

Move the content to the docs and website so it is displayed on
pygments.org, to make it easier to find.

- Regex dos and don'ts go to lexerdevelopment.rst

- The rest goes to a new file contributing.rst

- That makes the explanation of how lexers are tested in
  lexerdevelopment.rst redundant, so remove it. The bit on how to add a
  lexer goes to contributing.rst.
---
 doc/docs/contributing.rst     | 115 +++++++++++++++++++++++++++
 doc/docs/index.rst            |   3 +-
 doc/docs/lexerdevelopment.rst | 177 ++++++++++++++++++++++++++----------------
 doc/index.rst                 |   7 +-
 4 files changed, 230 insertions(+), 72 deletions(-)
 create mode 100644 doc/docs/contributing.rst

(limited to 'doc')
diff --git a/doc/docs/contributing.rst b/doc/docs/contributing.rst
new file mode 100644
index 00000000..79d2bc96
--- /dev/null
+++ b/doc/docs/contributing.rst
@@ -0,0 +1,115 @@
+========================
+Contributing to Pygments
+========================
+
+Thanks for your interest in contributing! Please read the following
+guidelines.
+
+
+Licensing
+=========
+
+The code is distributed under the BSD 2-clause license. Contributors making pull
+requests must agree that they are able and willing to put their contributions
+under that license.
+
+
+General contribution checklist
+==============================
+
+* Check the documentation for how to write
+  :doc:`a new lexer <lexerdevelopment>`,
+  :doc:`a new formatter <formatterdevelopment>`,
+  :doc:`a new style <styledevelopment>` or
+  :doc:`a new filter <filterdevelopment>`.
+  If adding a lexer, please make sure you have
+  read :ref:`lexer-pitfalls`.
+
+* Run the test suite with ``tox``, and ensure it passes.
+
+* Make sure to add a test for your new functionality, and where applicable,
+  write documentation. See below on how to test lexers.
+
+* Use the standard importing convention: ``from token import Punctuation``
+
+
+How to add a lexer
+==================
+
+To add a lexer, you have to perform the following steps:
+
+* Select a matching module under ``pygments/lexers``, or create a new
+  module for your lexer class.
+
+  .. note::
+
+     We encourage you to put your lexer class into its own module, unless it's a
+     very small derivative of an already existing lexer.
+
+* Next, make sure the lexer is known from outside the module. All modules
+  in the ``pygments.lexers`` package specify ``__all__``. For example,
+  ``esoteric.py`` sets::
+
+     __all__ = ['BrainfuckLexer', 'BefungeLexer', ...]
+
+  Add the name of your lexer class to this list (or create the list if your lexer
+  is the only class in the module).
+
+* Finally the lexer can be made publicly known by rebuilding the lexer mapping.
+
+  .. code-block:: console
+
+     $ tox -e mapfiles
+
+
+How lexers are tested
+=====================
+
+To add a new lexer test, create a file with just your code snippet
+under ``tests/snippets/<lexer_alias>/``. Then run
+``tox -- --update-goldens <filename.txt>`` to auto-populate the
+currently expected tokens. Check that they look good and check in the
+file.
+
+Lexer tests are run with ``tox``, like all other tests. While
+working on a lexer, you can also run only the tests for that lexer
+with ``tox -- tests/snippets/language-name/`` and/or
+``tox -- tests/examplefiles/language-name/``.
+
+Running the test suite with ``tox`` will run lexers on the test
+inputs, and check that the output matches the expected tokens. If you
+are improving a lexer, it is normal that the token output changes. To
+update the expected token output for the tests, again use
+``tox -- --update-goldens <filename.txt>``.  Review the changes and
+check that they are as intended, then commit them along with your
+proposed code change.
+
+Large test files should go in ``tests/examplefiles``.  This works
+similar to ``snippets``, but the token output is stored in a separate
+file.  Output can also be regenerated with ``--update-goldens``.
+
+
+Goals & non-goals of Pygments
+=============================
+
+Python support
+--------------
+
+Pygments supports all supported Python versions as per the `Python
+Developer's Guide <devguide>`_. Additionally, the default Python
+version of the latest stable version of RHEL, Ubuntu LTS, and Debian
+are supported, even if they're officially EOL. Supporting other
+end-of-life versions is a non-goal of Pygments.
+
+.. _devguide: https://devguide.python.org/versions/
+
+
+Validation
+----------
+
+Pygments does not attempt to validate the input. Accepting code that
+is not legal for a given language is acceptable if it simplifies the
+codebase and does not result in surprising behavior. For instance, in
+C89, accepting `//` based comments would be fine because de-facto all
+compilers supported it, and having a separate lexer for it would not
+be worth it.
diff --git a/doc/docs/index.rst b/doc/docs/index.rst
index d35fe6f0..930467c4 100644
--- a/doc/docs/index.rst
+++ b/doc/docs/index.rst
@@ -30,7 +30,7 @@ Pygments documentation
    api
    terminal-sessions
 
-**Hacking for Pygments**
+**Hacking with Pygments**
 
 .. toctree::
    :maxdepth: 1
@@ -56,6 +56,7 @@ Pygments documentation
 .. toctree::
    :maxdepth: 1
 
+   contributing
    changelog
    authors
    security
diff --git a/doc/docs/lexerdevelopment.rst b/doc/docs/lexerdevelopment.rst
index 29bdd6ca..3b720dc6 100644
--- a/doc/docs/lexerdevelopment.rst
+++ b/doc/docs/lexerdevelopment.rst
@@ -85,8 +85,8 @@ If no rule matches at the current position, the current char is emitted as an
 one.
 
 
-Adding and testing a new lexer
-==============================
+Using a lexer
+=============
 
 The easiest way to use a new lexer is to use Pygments' support for loading
 the lexer from a file relative to your current directory.
@@ -141,70 +141,8 @@ to display the token types assigned to each part of your input file:
 
 Hover over each token to see the token type displayed as a tooltip.
 
-To prepare your new lexer for inclusion in the Pygments distribution, so that it
-will be found when passing filenames or lexer aliases from the command line, you
-have to perform the following steps.
-
-First, change to the current directory containing the Pygments source code.  You
-will need to have either an unpacked source tarball, or (preferably) a copy
-cloned from GitHub.
-
-.. code-block:: console
-
-    $ cd pygments
-
-Select a matching module under ``pygments/lexers``, or create a new module for
-your lexer class.
-
-.. note::
-
-  We encourage you to put your lexer class into its own module, unless it's a
-  very small derivative of an already existing lexer.
-
-Next, make sure the lexer is known from outside of the module.  All modules in
-the ``pygments.lexers`` package specify ``__all__``. For example,
-``esoteric.py`` sets::
-
-    __all__ = ['BrainfuckLexer', 'BefungeLexer', ...]
-
-Add the name of your lexer class to this list (or create the list if your lexer
-is the only class in the module).
-
-Finally the lexer can be made publicly known by rebuilding the lexer mapping.
-In the root directory of the source (where the ``tox.ini`` file is located), run:
-
-.. code-block:: console
-
-    $ tox -e mapfiles
-
-To test the new lexer, store an example file in
-``tests/examplefiles/<alias>``.  For example, to test your
-``DiffLexer``, add a ``tests/examplefiles/diff/example.diff`` containing a
-sample diff output.  To (re)generate the lexer output which the file is checked
-against, use the command ``tox -- tests/examplefiles/diff --update-goldens``.
-
-Now you can use ``python -m pygments`` from the current root of the checkout to
-render your example to HTML:
-
-.. code-block:: console
-
-    $ python -m pygments -O full -f html -o /tmp/example.html tests/examplefiles/diff/example.diff
-
-Note that this explicitly calls the ``pygments`` module in the current
-directory. This ensures your modifications are used. Otherwise a possibly
-already installed, unmodified version without your new lexer would have been
-called from the system search path (``$PATH``).
-
-To view the result, open ``/tmp/example.html`` in your browser.
-
-Once the example renders as expected, you should run the complete test suite:
-
-.. code-block:: console
-
-    $ tox
-
-It also tests that your lexer fulfills the lexer API and certain invariants,
-such as that the concatenation of all token text is the same as the input text.
+If your lexer would be useful to other people, we would love if you
+contributed it to Pygments.  See :doc:`contributing` for advice.
 
 
 Regex Flags
@@ -746,3 +684,110 @@ pseudo keywords::
                     yield index, token, value
 
 The `PhpLexer` and `LuaLexer` use this method to resolve builtin functions.
+
+
+
+.. _lexer-pitfalls:
+
+Common pitfalls and best practices
+==================================
+
+Regular expressions are ubiquitous in Pygments lexers.  We have
+written this section to warn about a few common mistakes you might do
+when using them. There are also some tips on making your lexers easier
+to read and review. You are asked to read this section if you want to
+contribute a new lexer, but you might find it useful in any case.
+
+
+* When writing rules, try to merge simple rules. For instance, combine::
+
+     (r"\(", token.Punctuation),
+     (r"\)", token.Punctuation),
+     (r"\[", token.Punctuation),
+     (r"\]", token.Punctuation),
+     ("{", token.Punctuation),
+     ("}", token.Punctuation),
+
+  into::
+
+   (r"[\(\)\[\]{}]", token.Punctuation)
+
+
+* Be careful with ``.*``. This matches greedily as much as it can. For instance,
+  a rule like ``@.*@`` will match the whole string ``@first@ second @third@``,
+  instead of matching ``@first@`` and ``@second@``. You can use ``@.*?@`` in
+  this case to stop early. The ``?`` tries to match *as few times* as possible.
+
+* Beware of so-called "catastrophic backtracking".  As a first example, consider
+  the regular expression ``(A+)*B``.  This is equivalent to ``A*B`` regarding
+  what it matches, but *non*-matches will take very long.  This is because
+  of the way the regular expression engine works.  Suppose you feed it 50
+  'A's, and a 'C' at the end.  It first matches the 'A's greedily in ``A+``,
+  but finds that it cannot match the end since 'B' is not the same as 'C'.
+  Then it backtracks, removing one 'A' from the first ``A+`` and trying to
+  match the rest as another ``(A+)*``.  This fails again, so it backtracks
+  further left in the input string, etc.  In effect, it tries all combinations
+
+  .. code-block:: text
+
+     (AAAAAAAAAAAAAAAAA)
+     (AAAAAAAAAAAAAAAA)(A)
+     (AAAAAAAAAAAAAAA)(AA)
+     (AAAAAAAAAAAAAAA)(A)(A)
+     (AAAAAAAAAAAAAA)(AAA)
+     (AAAAAAAAAAAAAA)(AA)(A)
+     ...
+
+  Thus, the matching has exponential complexity.  In a lexer, the
+  effect is that Pygments will seemingly hang when parsing invalid
+  input. ::
+
+     >>> import re
+     >>> re.match('(A+)*B', 'A'*50 + 'C') # hangs
+
+  As a more subtle and real-life example, here is a badly written
+  regular expression to match strings::
+
+     r'"(\\?.)*?"'
+
+  If the ending quote is missing, the regular expression engine will
+  find that it cannot match at the end, and try to backtrack with less
+  matches in the ``*?``.  When it finds a backslash, as it has already
+  tried the possibility ``\\.``, it tries ``.`` (recognizing it as a
+  simple character without meaning), which leads to the same
+  exponential backtracking problem if there are lots of backslashes in
+  the (invalid) input string.  A good way to write this would be
+  ``r'"([^\\]|\\.)*?"'``, where the inner group can only match in one
+  way.  Better yet is to use a dedicated state, which not only
+  sidesteps the issue without headaches, but allows you to highlight
+  string escapes. ::
+
+     'root': [
+         ...,
+         (r'"', String, 'string'),
+         ...
+     ],
+     'string': [
+         (r'\\.', String.Escape),
+         (r'"', String, '#pop'),
+         (r'[^\\"]+', String),
+     ]
+
+* When writing rules for patterns such as comments or strings, match as many
+  characters as possible in each token.  This is an example of what *not* to
+  do::
+
+     'comment': [
+         (r'\*/', Comment.Multiline, '#pop'),
+         (r'.', Comment.Multiline),
+     ]
+
+  This generates one token per character in the comment, which slows
+  down the lexing process, and also makes the raw token output (and in
+  particular the test output) hard to read.  Do this instead::
+
+     'comment': [
+         (r'\*/', Comment.Multiline, '#pop'),
+         (r'[^*]+', Comment.Multiline),
+         (r'\*', Comment.Multiline),
+     ]
diff --git a/doc/index.rst b/doc/index.rst
index dbd15968..cf91c5ff 100644
--- a/doc/index.rst
+++ b/doc/index.rst
@@ -16,16 +16,13 @@ source code.  Highlights are:
 Read more in the :doc:`FAQ list <faq>` or the :doc:`documentation <docs/index>`,
 or `download the latest release <https://pypi.python.org/pypi/Pygments>`_.
 
-.. _contribute:
-
 Contribute
 ----------
 
 Like every open-source project, we are always looking for volunteers to help us
 with programming. Python knowledge is required, but don't fear: Python is a very
-clear and easy to learn language.
-
-Development takes place on `GitHub <https://github.com/pygments/pygments>`_.
+clear and easy to learn language. Read our :doc:`contribution guidelines
+<docs/contributing>` for more information.
 
 If you found a bug, just open a ticket in the GitHub tracker. Be sure to log
 in to be notified when the issue is fixed -- development is not fast-paced as
-- 
cgit v1.2.1