=head1 NAME
X X X
perlre - Perl regular expressions
=head1 DESCRIPTION
This page describes the syntax of regular expressions in Perl.
If you haven't used regular expressions before, a quick-start
introduction is available in L, and a longer tutorial
introduction is available in L.
For reference on how regular expressions are used in matching
operations, plus various examples of the same, see discussions of
C, C, C and C?> in L.
=head2 Modifiers
Matching operations can have various modifiers. Modifiers
that relate to the interpretation of the regular expression inside
are listed below. Modifiers that alter the way a regular expression
is used by Perl are detailed in L and
L.
=over 4
=item m
X X X X
Treat string as multiple lines. That is, change "^" and "$" from matching
the start or end of line only at the left and right ends of the string to
matching them anywhere within the string.
=item s
X X X
X
Treat string as single line. That is, change "." to match any character
whatsoever, even a newline, which normally it would not match.
Used together, as C, they let the "." match any character whatsoever,
while still allowing "^" and "$" to match, respectively, just after
and just before newlines within the string.
=item i
X X X
X
Do case-insensitive pattern matching.
If locale matching rules are in effect, the case map is taken from the
current
locale for code points less than 255, and from Unicode rules for larger
code points. However, matches that would cross the Unicode
rules/non-Unicode rules boundary (ords 255/256) will not succeed. See
L.
There are a number of Unicode characters that match multiple characters
under C. For example, C
should match the sequence C. Perl is not
currently able to do this when the multiple characters are in the pattern and
are split between groupings, or when one or more are quantified. Thus
"\N{LATIN SMALL LIGATURE FI}" =~ /fi/i; # Matches
"\N{LATIN SMALL LIGATURE FI}" =~ /[fi][fi]/i; # Doesn't match!
"\N{LATIN SMALL LIGATURE FI}" =~ /fi*/i; # Doesn't match!
# The below doesn't match, and it isn't clear what $1 and $2 would
# be even if it did!!
"\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i; # Doesn't match!
Perl doesn't match multiple characters in an inverted bracketed
character class, which otherwise could be highly confusing. See
L.
Another bug involves character classes that match both a sequence of
multiple characters, and an initial sub-string of that sequence. For
example,
/[s\xDF]/i
should match both a single and a double "s", since C<\xDF> (on ASCII
platforms) matches "ss". However, this bug
(L<[perl #89774]|https://rt.perl.org/rt3/Ticket/Display.html?id=89774>)
causes it to only match a single "s", even if the final larger match
fails, and matching the double "ss" would have succeeded.
Also, Perl matching doesn't fully conform to the current Unicode C
recommendations, which ask that the matching be made upon the NFD
(Normalization Form Decomposed) of the text. However, Unicode is
in the process of reconsidering and revising their recommendations.
=item x
X
Extend your pattern's legibility by permitting whitespace and comments.
Details in L"/x">
=item p
X
X X
Preserve the string matched such that ${^PREMATCH}, ${^MATCH}, and
${^POSTMATCH} are available for use after matching.
=item g and c
X X
Global matching, and keep the Current position after failed matching.
Unlike i, m, s and x, these two flags affect the way the regex is used
rather than the regex itself. See
L for further explanation
of the g and c modifiers.
=item a, d, l and u
X X X X
These modifiers, all new in 5.14, affect which character-set semantics
(Unicode, etc.) are used, as described below in
L.
=back
Regular expression modifiers are usually written in documentation
as e.g., "the C modifier", even though the delimiter
in question might not really be a slash. The modifiers C
may also be embedded within the regular expression itself using
the C<(?...)> construct, see L below.
=head3 /x
C tells
the regular expression parser to ignore most whitespace that is neither
backslashed nor within a character class. You can use this to break up
your regular expression into (slightly) more readable parts. The C<#>
character is also treated as a metacharacter introducing a comment,
just as in ordinary Perl code. This also means that if you want real
whitespace or C<#> characters in the pattern (outside a character
class, where they are unaffected by C), then you'll either have to
escape them (using backslashes or C<\Q...\E>) or encode them using octal,
hex, or C<\N{}> escapes. Taken together, these features go a long way towards
making Perl's regular expressions more readable. Note that you have to
be careful not to include the pattern delimiter in the comment--perl has
no way of knowing you did not intend to close the pattern early. See
the C-comment deletion code in L. Also note that anything inside
a C<\Q...\E> stays unaffected by C. And note that C doesn't affect
space interpretation within a single multi-character construct. For
example in C<\x{...}>, regardless of the C modifier, there can be no
spaces. Same for a L such as C<{3}> or
C<{5,}>. Similarly, C<(?:...)> can't have a space between the C> and C<:>,
but can between the C<(> and C>. Within any delimiters for such a
construct, allowed spaces are not affected by C, and depend on the
construct. For example, C<\x{...}> can't have spaces because hexadecimal
numbers don't have spaces in them. But, Unicode properties can have spaces, so
in C<\p{...}> there can be spaces that follow the Unicode rules, for which see
L.
X
=head3 Character set modifiers
C, C, C, and C, available starting in 5.14, are called
the character set modifiers; they affect the character set semantics
used for the regular expression.
The C, C, and C modifiers are not likely to be of much use
to you, and so you need not worry about them very much. They exist for
Perl's internal use, so that complex regular expression data structures
can be automatically serialized and later exactly reconstituted,
including all their nuances. But, since Perl can't keep a secret, and
there may be rare instances where they are useful, they are documented
here.
The C modifier, on the other hand, may be useful. Its purpose is to
allow code that is to work mostly on ASCII data to not have to concern
itself with Unicode.
Briefly, C sets the character set to that of whatever Bocale is in
effect at the time of the execution of the pattern match.
C sets the character set to Bnicode.
C also sets the character set to Unicode, BUT adds several
restrictions for BSCII-safe matching.
C is the old, problematic, pre-5.14 Befault character set
behavior. Its only use is to force that old behavior.
At any given time, exactly one of these modifiers is in effect. Their
existence allows Perl to keep the originally compiled behavior of a
regular expression, regardless of what rules are in effect when it is
actually executed. And if it is interpolated into a larger regex, the
original's rules continue to apply to it, and only it.
The C and C modifiers are automatically selected for
regular expressions compiled within the scope of various pragmas,
and we recommend that in general, you use those pragmas instead of
specifying these modifiers explicitly. For one thing, the modifiers
affect only pattern matching, and do not extend to even any replacement
done, whereas using the pragmas give consistent results for all
appropriate operations within their scopes. For example,
s/foo/\Ubar/il
will match "foo" using the locale's rules for case-insensitive matching,
but the C does not affect how the C<\U> operates. Most likely you
want both of them to use locale rules. To do this, instead compile the
regular expression within the scope of C