summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--pod/perl5120delta.pod9
-rw-r--r--pod/perlre.pod3
-rw-r--r--pod/perlrebackslash.pod95
-rw-r--r--pod/perlrecharclass.pod43
-rw-r--r--pod/perlreref.pod6
5 files changed, 104 insertions, 52 deletions
diff --git a/pod/perl5120delta.pod b/pod/perl5120delta.pod
index 47304ff55f..a5fac1430a 100644
--- a/pod/perl5120delta.pod
+++ b/pod/perl5120delta.pod
@@ -238,9 +238,12 @@ for some or all operations. (Yuval Kogman)
A new regex escape has been added, C<\N>. It will match any character that
is not a newline, independently from the presence or absence of the single
line match modifier C</s>. It is not usable within a character class.
-(If C<\N> is followed by an opening brace and
-by a letter, perl will still assume that a Unicode character name is
-coming, so compatibility is preserved.) (Rafael Garcia-Suarez).
+C<\N{3}> means to match 3 non-newlines; C<\N{5,}> means to match at least 5.
+C<\N{NAME}> still means the character or sequence named C<NAME>, but C<NAME> no
+longer can be things like C<3>, or C<5,>.
+Compatibility with Unicode names is preserved, as none look like these, but it
+has been possible to create custom names that do look like them, and those will
+no longer work. (Rafael Garcia-Suarez)
This will break a L<custom charnames translator|charnames/CUSTOM TRANSLATORS>
which allows numbers for character names, as C<\N{3}> will now mean to match 3
diff --git a/pod/perlre.pod b/pod/perlre.pod
index 8189b29b32..f82d196e26 100644
--- a/pod/perlre.pod
+++ b/pod/perlre.pod
@@ -301,6 +301,9 @@ B<Note:> C<\R> has no special meaning inside of a character class;
use C<\v> instead (vertical whitespace).
X<\R>
+Note that C<\N> has two meanings. When of the form C<\N{NAME}>, it matches the
+character whose name is C<NAME>. Otherwise it matches any character but C<\n>.
+
The POSIX character class syntax
X<character class>
diff --git a/pod/perlrebackslash.pod b/pod/perlrebackslash.pod
index 3d3a76f58e..271f4b3059 100644
--- a/pod/perlrebackslash.pod
+++ b/pod/perlrebackslash.pod
@@ -25,14 +25,13 @@ it either takes away the special meaning of the character following it
or it is the start of a backslash or escape sequence.
The rules determining what it is are quite simple: if the character
-following the backslash is a punctuation (non-word) character (that is,
-anything that is not a letter, digit or underscore), then the backslash
-just takes away the special meaning (if any) of the character following
-it.
-
-If the character following the backslash is a letter or a digit, then the
-sequence may be special; if so, it's listed below. A few letters have not
-been used yet, and escaping them with a backslash is safe for now, but a
+following the backslash is an ASCII punctuation (non-word) character (that is,
+anything that is not a letter, digit or underscore), then the backslash just
+takes away the special meaning (if any) of the character following it.
+
+If the character following the backslash is an ASCII letter or an ASCII digit,
+then the sequence may be special; if so, it's listed below. A few letters have
+not been used yet, and escaping them with a backslash is safe for now, but a
future version of Perl may assign a special meaning to it. However, if you
have warnings turned on, Perl will issue a warning if you use such a sequence.
[1].
@@ -61,48 +60,51 @@ quoted constructs>.
=head2 All the sequences and escapes
+Those not usable within a bracketed character class (like C<[\da-z]>) are marked
+as C<Not in [].>
+
\000 Octal escape sequence.
- \1 Absolute backreference.
+ \1 Absolute backreference. Not in [].
\a Alarm or bell.
- \A Beginning of string.
- \b Word/non-word boundary. (Backspace in a char class).
- \B Not a word/non-word boundary.
+ \A Beginning of string. Not in [].
+ \b Word/non-word boundary. (Backspace in []).
+ \B Not a word/non-word boundary. Not in [].
\cX Control-X (X can be any ASCII character).
- \C Single octet, even under UTF-8.
+ \C Single octet, even under UTF-8. Not in [].
\d Character class for digits.
\D Character class for non-digits.
\e Escape character.
- \E Turn off \Q, \L and \U processing.
+ \E Turn off \Q, \L and \U processing. Not in [].
\f Form feed.
- \g{}, \g1 Named, absolute or relative backreference.
- \G Pos assertion.
+ \g{}, \g1 Named, absolute or relative backreference. Not in [].
+ \G Pos assertion. Not in [].
\h Character class for horizontal white space.
\H Character class for non horizontal white space.
- \k{}, \k<>, \k'' Named backreference.
- \K Keep the stuff left of \K.
- \l Lowercase next character.
- \L Lowercase till \E.
+ \k{}, \k<>, \k'' Named backreference. Not in [].
+ \K Keep the stuff left of \K. Not in [].
+ \l Lowercase next character. Not in [].
+ \L Lowercase till \E. Not in [].
\n (Logical) newline character.
- \N Any character but newline.
+ \N Any character but newline. Not in [].
\N{} Named (Unicode) character.
\p{}, \pP Character with the given Unicode property.
\P{}, \PP Character without the given Unicode property.
- \Q Quotemeta till \E.
+ \Q Quotemeta till \E. Not in [].
\r Return character.
- \R Generic new line.
+ \R Generic new line. Not in [].
\s Character class for white space.
\S Character class for non white space.
\t Tab character.
- \u Titlecase next character.
- \U Uppercase till \E.
+ \u Titlecase next character. Not in [].
+ \U Uppercase till \E. Not in [].
\v Character class for vertical white space.
\V Character class for non vertical white space.
\w Character class for word characters.
\W Character class for non-word characters.
\x{}, \x00 Hexadecimal escape sequence.
- \X Unicode "extended grapheme cluster".
- \z End of string.
- \Z End of string.
+ \X Unicode "extended grapheme cluster". Not in [].
+ \z End of string. Not in [].
+ \Z End of string. Not in [].
=head2 Character Escapes
@@ -156,15 +158,20 @@ Mnemonic: I<c>ontrol character.
=head3 Named characters
-All Unicode characters have a Unicode name, and characters in various scripts
-have names as well. It is even possible to give your own names to characters.
-You can use a character by name by using the C<\N{}> construct; the name of
-the character goes between the curly braces. You do have to C<use charnames>
-to load the names of the characters, otherwise Perl will complain you use
-a name it doesn't know about. For more details, see L<charnames>.
+All Unicode characters have a Unicode name. It is even possible to give your
+own names to characters, even to short sequences of characters. You can use a
+character by name by using the C<\N{}> construct; the name of the character
+goes between the curly braces. You do have to C<use charnames> to load the
+Unicode names of the characters, otherwise Perl will complain. (If you instead
+have your own names, a C<use> statement will be required for your translator.)
+For more details, see L<charnames>.
Mnemonic: I<N>amed character.
+Note that a character that is expressed as a named character is considered
+as a character without special meaning by the regex engine, and will match
+"as is".
+
=head4 Example
use charnames ':full'; # Loads the Unicode names.
@@ -177,7 +184,8 @@ Mnemonic: I<N>amed character.
Octal escapes consist of a backslash followed by two or three octal digits
matching the code point of the character you want to use. This allows for
-512 characters (C<\00> up to C<\777>) that can be expressed this way.
+512 characters (C<\00> up to C<\777>) that can be expressed this way (but
+anything above C<\377> is deprecated).
Enough in pre-Unicode days, but most Unicode characters cannot be escaped
this way.
@@ -329,7 +337,7 @@ absolutely, relatively, and by name.
A backslash sequence that starts with a backslash and is followed by a
number is an absolute reference (but be aware of the caveat mentioned above).
-If the number is I<N>, it refers to the Nth set of parenthesis - whatever
+If the number is I<N>, it refers to the Nth set of parentheses - whatever
has been matched by that set of parenthesis has to be matched by the C<\N>
as well.
@@ -379,7 +387,7 @@ written as C<\k{name}>, C<< \k<name> >> or C<\k'name'>.
Note that C<\g{}> has the potential to be ambiguous, as it could be a named
reference, or an absolute or relative reference (if its argument is numeric).
-However, names are not allowed to start with digits, nor are allowed to
+However, names are not allowed to start with digits, nor are they allowed to
contain a hyphen, so there is no ambiguity.
=head4 Examples
@@ -490,6 +498,17 @@ instead of C<s/(PAT1) PAT2/${1}REPL/x> or C<s/(?<=PAT1) PAT2/REPL/x>.
Mnemonic: I<K>eep.
+=item \N
+
+This is new in perl 5.12.0. It matches any character that is not a newline.
+It is a short-hand for writing C<[^\n]>, and is identical to the C<.>
+metasymbol, except under the C</s> flag, which changes the meaning of C<.>, but
+not C<\N>.
+
+Note that C<\N{...}> can mean a L<named character|/Named characters>.
+
+Mnemonic: Complement of I<\n>.
+
=item \R
C<\R> matches a I<generic newline>, that is, anything that is considered
@@ -512,7 +531,7 @@ This matches a Unicode I<extended grapheme cluster>.
C<\X> matches quite well what normal (non-Unicode-programmer) usage
would consider a single character. As an example, consider a G with some sort
of diacritic mark, such as an arrow. There is no such single character in
-Unicode, but one can be composed using a G followed by a Unicode "COMBINING
+Unicode, but one can be composed by using a G followed by a Unicode "COMBINING
UPWARDS ARROW BELOW", and would be displayed by Unicode-aware software as if it
were a single character.
diff --git a/pod/perlrecharclass.pod b/pod/perlrecharclass.pod
index 0b5b89a44d..55b178e3e7 100644
--- a/pod/perlrecharclass.pod
+++ b/pod/perlrecharclass.pod
@@ -45,7 +45,7 @@ constitute a character class. That is, they will match a single
character, if that character belongs to a specific set of characters
(defined by the sequence). A backslashed sequence is a sequence of
characters starting with a backslash. Not all backslashed sequences
-are character class; for a full list, see L<perlrebackslash>.
+are character classes; for a full list, see L<perlrebackslash>.
Here's a list of the backslashed sequences, which are discussed in
more detail below.
@@ -116,8 +116,12 @@ that is not considered horizontal white space.
C<\N>, like the dot, will match any character that is not a newline. The
difference is that C<\N> will not be influenced by the single line C</s>
regular expression modifier. (Note that, since C<\N{}> is also used for
-Unicode named characters, if C<\N> is followed by an opening brace and
-by a letter, perl will assume that a Unicode character name is coming.)
+named characters, if C<\N> is followed by an opening brace and something that
+is not a quantifier, perl will assume that a character name is coming. For
+example, C<\N{3}> means to match 3 non-newlines; C<\N{5,}> means to match 5 or
+more non-newlines, but C<\N{4F}> is not a legal quantifier, and will cause
+perl to look for a character named C<4F> (and won't find it unless custom names
+have been defined.)
C<\v> will match any character that is considered vertical white space;
this includes the carriage return and line feed characters (newline).
@@ -265,7 +269,7 @@ Examples:
=head3 Special Characters Inside a Bracketed Character Class
Most characters that are meta characters in regular expressions (that
-is, characters that carry a special meaning like C<*> or C<(>) lose
+is, characters that carry a special meaning like C<.>, C<*>, or C<(>) lose
their special meaning and can be used inside a character class without
the need to escape them. For instance, C<[()]> matches either an opening
parenthesis, or a closing parenthesis, and the parens inside the character
@@ -282,6 +286,22 @@ that does not have either two word characters or two non-word characters
on either side, inside a bracketed character class, C<\b> matches a
backspace character.
+The sequences
+C<\a>,
+C<\c>,
+C<\e>,
+C<\f>,
+C<\n>,
+C<\N{NAME}>,
+C<\r>,
+C<\t>,
+and
+C<\x>
+are also special and have the same meanings as they do outside a bracketed character
+class.
+
+Also, a backslash followed by digits is considered an octal number.
+
A C<[> is not special inside a character class, unless it's the start
of a POSIX character class (see below). It normally does not need escaping.
@@ -362,11 +382,16 @@ Examples:
=head3 Backslash Sequences
-You can put a backslash sequence character class inside a bracketed character
-class, and it will act just as if you put all the characters matched by
-the backslash sequence inside the character class. For instance,
-C<[a-f\d]> will match any digit, or any of the lowercase letters between
-'a' and 'f' inclusive.
+You can put any backslash sequence character class (with one exception listed
+in the next paragraph) inside a bracketed character class, and it will act just
+as if you put all the characters matched by the backslash sequence inside the
+character class. For instance, C<[a-f\d]> will match any digit, or any of the
+lowercase letters between 'a' and 'f' inclusive.
+
+C<\N> within a bracketed character class must be of the form C<\N{NAME}> for
+the same reason that a dot C<.> inside a bracketed character class loses its
+special meaning: it matches nearly anything, which generally isn't what you
+want to happen.
Examples:
diff --git a/pod/perlreref.pod b/pod/perlreref.pod
index f7d01b8e9b..266868d249 100644
--- a/pod/perlreref.pod
+++ b/pod/perlreref.pod
@@ -113,7 +113,7 @@ This one works differently from normal strings:
[f-j-] Dash escaped or at start or end means 'dash'
[^f-j] Caret indicates "match any character _except_ these"
-The following sequences work within or without a character class.
+The following sequences (except C<\N>) work within or without a character class.
The first six are locale aware, all are Unicode aware. See L<perllocale>
and L<perlunicode> for details.
@@ -125,7 +125,9 @@ and L<perlunicode> for details.
\S A non-whitespace character
\h An horizontal white space
\H A non horizontal white space
- \N A non newline (when not followed by a '{'; it's like . without /s)
+ \N A non newline (when not followed by '{NAME}'; not valid in a
+ character class; equivalent to [^\n]; it's like . without /s
+ modifier)
\v A vertical white space
\V A non vertical white space
\R A generic newline (?>\v|\x0D\x0A)