diff options
-rw-r--r-- | pod/perl5120delta.pod | 9 | ||||
-rw-r--r-- | pod/perlre.pod | 3 | ||||
-rw-r--r-- | pod/perlrebackslash.pod | 95 | ||||
-rw-r--r-- | pod/perlrecharclass.pod | 43 | ||||
-rw-r--r-- | pod/perlreref.pod | 6 |
5 files changed, 104 insertions, 52 deletions
diff --git a/pod/perl5120delta.pod b/pod/perl5120delta.pod index 47304ff55f..a5fac1430a 100644 --- a/pod/perl5120delta.pod +++ b/pod/perl5120delta.pod @@ -238,9 +238,12 @@ for some or all operations. (Yuval Kogman) A new regex escape has been added, C<\N>. It will match any character that is not a newline, independently from the presence or absence of the single line match modifier C</s>. It is not usable within a character class. -(If C<\N> is followed by an opening brace and -by a letter, perl will still assume that a Unicode character name is -coming, so compatibility is preserved.) (Rafael Garcia-Suarez). +C<\N{3}> means to match 3 non-newlines; C<\N{5,}> means to match at least 5. +C<\N{NAME}> still means the character or sequence named C<NAME>, but C<NAME> no +longer can be things like C<3>, or C<5,>. +Compatibility with Unicode names is preserved, as none look like these, but it +has been possible to create custom names that do look like them, and those will +no longer work. (Rafael Garcia-Suarez) This will break a L<custom charnames translator|charnames/CUSTOM TRANSLATORS> which allows numbers for character names, as C<\N{3}> will now mean to match 3 diff --git a/pod/perlre.pod b/pod/perlre.pod index 8189b29b32..f82d196e26 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -301,6 +301,9 @@ B<Note:> C<\R> has no special meaning inside of a character class; use C<\v> instead (vertical whitespace). X<\R> +Note that C<\N> has two meanings. When of the form C<\N{NAME}>, it matches the +character whose name is C<NAME>. Otherwise it matches any character but C<\n>. + The POSIX character class syntax X<character class> diff --git a/pod/perlrebackslash.pod b/pod/perlrebackslash.pod index 3d3a76f58e..271f4b3059 100644 --- a/pod/perlrebackslash.pod +++ b/pod/perlrebackslash.pod @@ -25,14 +25,13 @@ it either takes away the special meaning of the character following it or it is the start of a backslash or escape sequence. The rules determining what it is are quite simple: if the character -following the backslash is a punctuation (non-word) character (that is, -anything that is not a letter, digit or underscore), then the backslash -just takes away the special meaning (if any) of the character following -it. - -If the character following the backslash is a letter or a digit, then the -sequence may be special; if so, it's listed below. A few letters have not -been used yet, and escaping them with a backslash is safe for now, but a +following the backslash is an ASCII punctuation (non-word) character (that is, +anything that is not a letter, digit or underscore), then the backslash just +takes away the special meaning (if any) of the character following it. + +If the character following the backslash is an ASCII letter or an ASCII digit, +then the sequence may be special; if so, it's listed below. A few letters have +not been used yet, and escaping them with a backslash is safe for now, but a future version of Perl may assign a special meaning to it. However, if you have warnings turned on, Perl will issue a warning if you use such a sequence. [1]. @@ -61,48 +60,51 @@ quoted constructs>. =head2 All the sequences and escapes +Those not usable within a bracketed character class (like C<[\da-z]>) are marked +as C<Not in [].> + \000 Octal escape sequence. - \1 Absolute backreference. + \1 Absolute backreference. Not in []. \a Alarm or bell. - \A Beginning of string. - \b Word/non-word boundary. (Backspace in a char class). - \B Not a word/non-word boundary. + \A Beginning of string. Not in []. + \b Word/non-word boundary. (Backspace in []). + \B Not a word/non-word boundary. Not in []. \cX Control-X (X can be any ASCII character). - \C Single octet, even under UTF-8. + \C Single octet, even under UTF-8. Not in []. \d Character class for digits. \D Character class for non-digits. \e Escape character. - \E Turn off \Q, \L and \U processing. + \E Turn off \Q, \L and \U processing. Not in []. \f Form feed. - \g{}, \g1 Named, absolute or relative backreference. - \G Pos assertion. + \g{}, \g1 Named, absolute or relative backreference. Not in []. + \G Pos assertion. Not in []. \h Character class for horizontal white space. \H Character class for non horizontal white space. - \k{}, \k<>, \k'' Named backreference. - \K Keep the stuff left of \K. - \l Lowercase next character. - \L Lowercase till \E. + \k{}, \k<>, \k'' Named backreference. Not in []. + \K Keep the stuff left of \K. Not in []. + \l Lowercase next character. Not in []. + \L Lowercase till \E. Not in []. \n (Logical) newline character. - \N Any character but newline. + \N Any character but newline. Not in []. \N{} Named (Unicode) character. \p{}, \pP Character with the given Unicode property. \P{}, \PP Character without the given Unicode property. - \Q Quotemeta till \E. + \Q Quotemeta till \E. Not in []. \r Return character. - \R Generic new line. + \R Generic new line. Not in []. \s Character class for white space. \S Character class for non white space. \t Tab character. - \u Titlecase next character. - \U Uppercase till \E. + \u Titlecase next character. Not in []. + \U Uppercase till \E. Not in []. \v Character class for vertical white space. \V Character class for non vertical white space. \w Character class for word characters. \W Character class for non-word characters. \x{}, \x00 Hexadecimal escape sequence. - \X Unicode "extended grapheme cluster". - \z End of string. - \Z End of string. + \X Unicode "extended grapheme cluster". Not in []. + \z End of string. Not in []. + \Z End of string. Not in []. =head2 Character Escapes @@ -156,15 +158,20 @@ Mnemonic: I<c>ontrol character. =head3 Named characters -All Unicode characters have a Unicode name, and characters in various scripts -have names as well. It is even possible to give your own names to characters. -You can use a character by name by using the C<\N{}> construct; the name of -the character goes between the curly braces. You do have to C<use charnames> -to load the names of the characters, otherwise Perl will complain you use -a name it doesn't know about. For more details, see L<charnames>. +All Unicode characters have a Unicode name. It is even possible to give your +own names to characters, even to short sequences of characters. You can use a +character by name by using the C<\N{}> construct; the name of the character +goes between the curly braces. You do have to C<use charnames> to load the +Unicode names of the characters, otherwise Perl will complain. (If you instead +have your own names, a C<use> statement will be required for your translator.) +For more details, see L<charnames>. Mnemonic: I<N>amed character. +Note that a character that is expressed as a named character is considered +as a character without special meaning by the regex engine, and will match +"as is". + =head4 Example use charnames ':full'; # Loads the Unicode names. @@ -177,7 +184,8 @@ Mnemonic: I<N>amed character. Octal escapes consist of a backslash followed by two or three octal digits matching the code point of the character you want to use. This allows for -512 characters (C<\00> up to C<\777>) that can be expressed this way. +512 characters (C<\00> up to C<\777>) that can be expressed this way (but +anything above C<\377> is deprecated). Enough in pre-Unicode days, but most Unicode characters cannot be escaped this way. @@ -329,7 +337,7 @@ absolutely, relatively, and by name. A backslash sequence that starts with a backslash and is followed by a number is an absolute reference (but be aware of the caveat mentioned above). -If the number is I<N>, it refers to the Nth set of parenthesis - whatever +If the number is I<N>, it refers to the Nth set of parentheses - whatever has been matched by that set of parenthesis has to be matched by the C<\N> as well. @@ -379,7 +387,7 @@ written as C<\k{name}>, C<< \k<name> >> or C<\k'name'>. Note that C<\g{}> has the potential to be ambiguous, as it could be a named reference, or an absolute or relative reference (if its argument is numeric). -However, names are not allowed to start with digits, nor are allowed to +However, names are not allowed to start with digits, nor are they allowed to contain a hyphen, so there is no ambiguity. =head4 Examples @@ -490,6 +498,17 @@ instead of C<s/(PAT1) PAT2/${1}REPL/x> or C<s/(?<=PAT1) PAT2/REPL/x>. Mnemonic: I<K>eep. +=item \N + +This is new in perl 5.12.0. It matches any character that is not a newline. +It is a short-hand for writing C<[^\n]>, and is identical to the C<.> +metasymbol, except under the C</s> flag, which changes the meaning of C<.>, but +not C<\N>. + +Note that C<\N{...}> can mean a L<named character|/Named characters>. + +Mnemonic: Complement of I<\n>. + =item \R C<\R> matches a I<generic newline>, that is, anything that is considered @@ -512,7 +531,7 @@ This matches a Unicode I<extended grapheme cluster>. C<\X> matches quite well what normal (non-Unicode-programmer) usage would consider a single character. As an example, consider a G with some sort of diacritic mark, such as an arrow. There is no such single character in -Unicode, but one can be composed using a G followed by a Unicode "COMBINING +Unicode, but one can be composed by using a G followed by a Unicode "COMBINING UPWARDS ARROW BELOW", and would be displayed by Unicode-aware software as if it were a single character. diff --git a/pod/perlrecharclass.pod b/pod/perlrecharclass.pod index 0b5b89a44d..55b178e3e7 100644 --- a/pod/perlrecharclass.pod +++ b/pod/perlrecharclass.pod @@ -45,7 +45,7 @@ constitute a character class. That is, they will match a single character, if that character belongs to a specific set of characters (defined by the sequence). A backslashed sequence is a sequence of characters starting with a backslash. Not all backslashed sequences -are character class; for a full list, see L<perlrebackslash>. +are character classes; for a full list, see L<perlrebackslash>. Here's a list of the backslashed sequences, which are discussed in more detail below. @@ -116,8 +116,12 @@ that is not considered horizontal white space. C<\N>, like the dot, will match any character that is not a newline. The difference is that C<\N> will not be influenced by the single line C</s> regular expression modifier. (Note that, since C<\N{}> is also used for -Unicode named characters, if C<\N> is followed by an opening brace and -by a letter, perl will assume that a Unicode character name is coming.) +named characters, if C<\N> is followed by an opening brace and something that +is not a quantifier, perl will assume that a character name is coming. For +example, C<\N{3}> means to match 3 non-newlines; C<\N{5,}> means to match 5 or +more non-newlines, but C<\N{4F}> is not a legal quantifier, and will cause +perl to look for a character named C<4F> (and won't find it unless custom names +have been defined.) C<\v> will match any character that is considered vertical white space; this includes the carriage return and line feed characters (newline). @@ -265,7 +269,7 @@ Examples: =head3 Special Characters Inside a Bracketed Character Class Most characters that are meta characters in regular expressions (that -is, characters that carry a special meaning like C<*> or C<(>) lose +is, characters that carry a special meaning like C<.>, C<*>, or C<(>) lose their special meaning and can be used inside a character class without the need to escape them. For instance, C<[()]> matches either an opening parenthesis, or a closing parenthesis, and the parens inside the character @@ -282,6 +286,22 @@ that does not have either two word characters or two non-word characters on either side, inside a bracketed character class, C<\b> matches a backspace character. +The sequences +C<\a>, +C<\c>, +C<\e>, +C<\f>, +C<\n>, +C<\N{NAME}>, +C<\r>, +C<\t>, +and +C<\x> +are also special and have the same meanings as they do outside a bracketed character +class. + +Also, a backslash followed by digits is considered an octal number. + A C<[> is not special inside a character class, unless it's the start of a POSIX character class (see below). It normally does not need escaping. @@ -362,11 +382,16 @@ Examples: =head3 Backslash Sequences -You can put a backslash sequence character class inside a bracketed character -class, and it will act just as if you put all the characters matched by -the backslash sequence inside the character class. For instance, -C<[a-f\d]> will match any digit, or any of the lowercase letters between -'a' and 'f' inclusive. +You can put any backslash sequence character class (with one exception listed +in the next paragraph) inside a bracketed character class, and it will act just +as if you put all the characters matched by the backslash sequence inside the +character class. For instance, C<[a-f\d]> will match any digit, or any of the +lowercase letters between 'a' and 'f' inclusive. + +C<\N> within a bracketed character class must be of the form C<\N{NAME}> for +the same reason that a dot C<.> inside a bracketed character class loses its +special meaning: it matches nearly anything, which generally isn't what you +want to happen. Examples: diff --git a/pod/perlreref.pod b/pod/perlreref.pod index f7d01b8e9b..266868d249 100644 --- a/pod/perlreref.pod +++ b/pod/perlreref.pod @@ -113,7 +113,7 @@ This one works differently from normal strings: [f-j-] Dash escaped or at start or end means 'dash' [^f-j] Caret indicates "match any character _except_ these" -The following sequences work within or without a character class. +The following sequences (except C<\N>) work within or without a character class. The first six are locale aware, all are Unicode aware. See L<perllocale> and L<perlunicode> for details. @@ -125,7 +125,9 @@ and L<perlunicode> for details. \S A non-whitespace character \h An horizontal white space \H A non horizontal white space - \N A non newline (when not followed by a '{'; it's like . without /s) + \N A non newline (when not followed by '{NAME}'; not valid in a + character class; equivalent to [^\n]; it's like . without /s + modifier) \v A vertical white space \V A non vertical white space \R A generic newline (?>\v|\x0D\x0A) |