summaryrefslogtreecommitdiff
path: root/Doc/library/re.rst
diff options
context:
space:
mode:
Diffstat (limited to 'Doc/library/re.rst')
-rw-r--r--Doc/library/re.rst64
1 files changed, 43 insertions, 21 deletions
diff --git a/Doc/library/re.rst b/Doc/library/re.rst
index 26f2a3824e..dd790e7efe 100644
--- a/Doc/library/re.rst
+++ b/Doc/library/re.rst
@@ -242,21 +242,32 @@ The special characters are:
``(?P<name>...)``
Similar to regular parentheses, but the substring matched by the group is
- accessible within the rest of the regular expression via the symbolic group
- name *name*. Group names must be valid Python identifiers, and each group
- name must be defined only once within a regular expression. A symbolic group
- is also a numbered group, just as if the group were not named. So the group
- named ``id`` in the example below can also be referenced as the numbered group
- ``1``.
-
- For example, if the pattern is ``(?P<id>[a-zA-Z_]\w*)``, the group can be
- referenced by its name in arguments to methods of match objects, such as
- ``m.group('id')`` or ``m.end('id')``, and also by name in the regular
- expression itself (using ``(?P=id)``) and replacement text given to
- ``.sub()`` (using ``\g<id>``).
+ accessible via the symbolic group name *name*. Group names must be valid
+ Python identifiers, and each group name must be defined only once within a
+ regular expression. A symbolic group is also a numbered group, just as if
+ the group were not named.
+
+ Named groups can be referenced in three contexts. If the pattern is
+ ``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either
+ single or double quotes):
+
+ +---------------------------------------+----------------------------------+
+ | Context of reference to group "quote" | Ways to reference it |
+ +=======================================+==================================+
+ | in the same pattern itself | * ``(?P=quote)`` (as shown) |
+ | | * ``\1`` |
+ +---------------------------------------+----------------------------------+
+ | when processing match object ``m`` | * ``m.group('quote')`` |
+ | | * ``m.end('quote')`` (etc.) |
+ +---------------------------------------+----------------------------------+
+ | in a string passed to the ``repl`` | * ``\g<quote>`` |
+ | argument of ``re.sub()`` | * ``\g<1>`` |
+ | | * ``\1`` |
+ +---------------------------------------+----------------------------------+
``(?P=name)``
- Matches whatever text was matched by the earlier group named *name*.
+ A backreference to a named group; it matches whatever text was matched by the
+ earlier group named *name*.
``(?#...)``
A comment; the contents of the parentheses are simply ignored.
@@ -306,7 +317,7 @@ The special characters are:
optional and can be omitted. For example,
``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)`` is a poor email matching pattern, which
will match with ``'<user@host.com>'`` as well as ``'user@host.com'``, but
- not with ``'<user@host.com'`` nor ``'user@host.com>'`` .
+ not with ``'<user@host.com'`` nor ``'user@host.com>'``.
The special sequences consist of ``'\'`` and a character from the list below.
@@ -316,7 +327,7 @@ the second character. For example, ``\$`` matches the character ``'$'``.
``\number``
Matches the contents of the group of the same number. Groups are numbered
starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
- but not ``'the end'`` (note the space after the group). This special sequence
+ but not ``'thethe'`` (note the space after the group). This special sequence
can only be used to match one of the first 99 groups. If the first digit of
*number* is 0, or *number* is 3 octal digits long, it will not be interpreted as
a group match, but as the character with octal value *number*. Inside the
@@ -414,17 +425,24 @@ Most of the standard escapes supported by Python string literals are also
accepted by the regular expression parser::
\a \b \f \n
- \r \t \v \x
- \\
+ \r \t \u \U
+ \v \x \\
(Note that ``\b`` is used to represent word boundaries, and means "backspace"
only inside character classes.)
+``'\u'`` and ``'\U'`` escape sequences are only recognized in Unicode
+patterns. In bytes patterns they are not treated specially.
+
Octal escapes are included in a limited form. If the first digit is a 0, or if
there are three octal digits, it is considered an octal escape. Otherwise, it is
a group reference. As for string literals, octal escapes are always at most
three digits in length.
+.. versionchanged:: 3.3
+ The ``'\u'`` and ``'\U'`` escape sequences have been added.
+
+
.. _contents-of-module-re:
@@ -660,7 +678,8 @@ form.
when not adjacent to a previous match, so ``sub('x*', '-', 'abc')`` returns
``'-a-b-c-'``.
- In addition to character escapes and backreferences as described above,
+ In string-type *repl* arguments, in addition to the character escapes and
+ backreferences described above,
``\g<name>`` will use the substring matched by the group named ``name``, as
defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
@@ -684,9 +703,12 @@ form.
.. function:: escape(string)
- Return *string* with all non-alphanumerics backslashed; this is useful if you
- want to match an arbitrary literal string that may have regular expression
- metacharacters in it.
+ Escape all the characters in pattern except ASCII letters, numbers and ``'_'``.
+ This is useful if you want to match an arbitrary literal string that may
+ have regular expression metacharacters in it.
+
+ .. versionchanged:: 3.3
+ The ``'_'`` character is no longer escaped.
.. function:: purge()