diff options
Diffstat (limited to 'Doc/library/re.rst')
-rw-r--r-- | Doc/library/re.rst | 64 |
1 files changed, 43 insertions, 21 deletions
diff --git a/Doc/library/re.rst b/Doc/library/re.rst index 26f2a3824e..dd790e7efe 100644 --- a/Doc/library/re.rst +++ b/Doc/library/re.rst @@ -242,21 +242,32 @@ The special characters are: ``(?P<name>...)`` Similar to regular parentheses, but the substring matched by the group is - accessible within the rest of the regular expression via the symbolic group - name *name*. Group names must be valid Python identifiers, and each group - name must be defined only once within a regular expression. A symbolic group - is also a numbered group, just as if the group were not named. So the group - named ``id`` in the example below can also be referenced as the numbered group - ``1``. - - For example, if the pattern is ``(?P<id>[a-zA-Z_]\w*)``, the group can be - referenced by its name in arguments to methods of match objects, such as - ``m.group('id')`` or ``m.end('id')``, and also by name in the regular - expression itself (using ``(?P=id)``) and replacement text given to - ``.sub()`` (using ``\g<id>``). + accessible via the symbolic group name *name*. Group names must be valid + Python identifiers, and each group name must be defined only once within a + regular expression. A symbolic group is also a numbered group, just as if + the group were not named. + + Named groups can be referenced in three contexts. If the pattern is + ``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either + single or double quotes): + + +---------------------------------------+----------------------------------+ + | Context of reference to group "quote" | Ways to reference it | + +=======================================+==================================+ + | in the same pattern itself | * ``(?P=quote)`` (as shown) | + | | * ``\1`` | + +---------------------------------------+----------------------------------+ + | when processing match object ``m`` | * ``m.group('quote')`` | + | | * ``m.end('quote')`` (etc.) | + +---------------------------------------+----------------------------------+ + | in a string passed to the ``repl`` | * ``\g<quote>`` | + | argument of ``re.sub()`` | * ``\g<1>`` | + | | * ``\1`` | + +---------------------------------------+----------------------------------+ ``(?P=name)`` - Matches whatever text was matched by the earlier group named *name*. + A backreference to a named group; it matches whatever text was matched by the + earlier group named *name*. ``(?#...)`` A comment; the contents of the parentheses are simply ignored. @@ -306,7 +317,7 @@ The special characters are: optional and can be omitted. For example, ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)`` is a poor email matching pattern, which will match with ``'<user@host.com>'`` as well as ``'user@host.com'``, but - not with ``'<user@host.com'`` nor ``'user@host.com>'`` . + not with ``'<user@host.com'`` nor ``'user@host.com>'``. The special sequences consist of ``'\'`` and a character from the list below. @@ -316,7 +327,7 @@ the second character. For example, ``\$`` matches the character ``'$'``. ``\number`` Matches the contents of the group of the same number. Groups are numbered starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``, - but not ``'the end'`` (note the space after the group). This special sequence + but not ``'thethe'`` (note the space after the group). This special sequence can only be used to match one of the first 99 groups. If the first digit of *number* is 0, or *number* is 3 octal digits long, it will not be interpreted as a group match, but as the character with octal value *number*. Inside the @@ -414,17 +425,24 @@ Most of the standard escapes supported by Python string literals are also accepted by the regular expression parser:: \a \b \f \n - \r \t \v \x - \\ + \r \t \u \U + \v \x \\ (Note that ``\b`` is used to represent word boundaries, and means "backspace" only inside character classes.) +``'\u'`` and ``'\U'`` escape sequences are only recognized in Unicode +patterns. In bytes patterns they are not treated specially. + Octal escapes are included in a limited form. If the first digit is a 0, or if there are three octal digits, it is considered an octal escape. Otherwise, it is a group reference. As for string literals, octal escapes are always at most three digits in length. +.. versionchanged:: 3.3 + The ``'\u'`` and ``'\U'`` escape sequences have been added. + + .. _contents-of-module-re: @@ -660,7 +678,8 @@ form. when not adjacent to a previous match, so ``sub('x*', '-', 'abc')`` returns ``'-a-b-c-'``. - In addition to character escapes and backreferences as described above, + In string-type *repl* arguments, in addition to the character escapes and + backreferences described above, ``\g<name>`` will use the substring matched by the group named ``name``, as defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous @@ -684,9 +703,12 @@ form. .. function:: escape(string) - Return *string* with all non-alphanumerics backslashed; this is useful if you - want to match an arbitrary literal string that may have regular expression - metacharacters in it. + Escape all the characters in pattern except ASCII letters, numbers and ``'_'``. + This is useful if you want to match an arbitrary literal string that may + have regular expression metacharacters in it. + + .. versionchanged:: 3.3 + The ``'_'`` character is no longer escaped. .. function:: purge() |