diff options
author | Karl Williamson <public@khwilliamson.com> | 2011-04-10 18:05:52 -0600 |
---|---|---|
committer | Karl Williamson <public@khwilliamson.com> | 2011-04-12 19:26:25 -0600 |
commit | ed7efc79ab6ea9f03d275ec3a285b8416f9c9bfa (patch) | |
tree | 2d3a8f7527cad505d8405223e97620594ffe07af /pod/perlre.pod | |
parent | 90803c373a7a4fcf2fa1a1029a44cb4d1cdf11e2 (diff) | |
download | perl-ed7efc79ab6ea9f03d275ec3a285b8416f9c9bfa.tar.gz |
perlre.pod: Update for 5.14
Diffstat (limited to 'pod/perlre.pod')
-rw-r--r-- | pod/perlre.pod | 299 |
1 files changed, 225 insertions, 74 deletions
diff --git a/pod/perlre.pod b/pod/perlre.pod index 387c820621..fa7f3ecaf5 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -53,12 +53,34 @@ Do case-insensitive pattern matching. If C<use locale> is in effect, the case map is taken from the current locale for code points less than 255, and from Unicode rules for larger -code points. See L<perllocale>. +code points. However, matches that would cross the Unicode +rules/non-Unicode rules boundary (ords 255/256) will not succeed. See +L<perllocale>. + +There are a number of Unicode characters that match multiple characters +under C</i>. For example, C<LATIN SMALL LIGATURE FI> +should match the sequence C<fi>. Perl is not +currently able to do this when the multiple characters are in the pattern and +are split between groupings, or when one or more are quantified. Thus + + "\N{LATIN SMALL LIGATURE FI}" =~ /fi/i; # Matches + "\N{LATIN SMALL LIGATURE FI}" =~ /[fi][fi]/i; # Doesn't match! + "\N{LATIN SMALL LIGATURE FI}" =~ /fi*/i; # Doesn't match! + + # The below doesn't match, and it isn't clear what $1 and $2 would + # be even if it did!! + "\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i; # Doesn't match! + +Also, this matching doesn't fully conform to the current Unicode +recommendations, which ask that the matching be made upon the NFD +(Normalization Form Decomposed) of the text. However, Unicode is +in the process of reconsidering and revising their recommendations. =item x X</x> Extend your pattern's legibility by permitting whitespace and comments. +Details in L</"/x"> =item p X</p> X<regex, preserve> X<regexp, preserve> @@ -79,18 +101,21 @@ of the g and c modifiers. X</a> X</d> X</l> X</u> These modifiers, new in 5.14, affect which character-set semantics -(Unicode, ASCII, etc.) are used, as described below. +(Unicode, ASCII, etc.) are used, as described below in +L</Character set modifiers>. =back These are usually written as "the C</x> modifier", even though the delimiter in question might not really be a slash. The modifiers C</imsxadlup> may also be embedded within the regular expression itself using -the C<(?...)> construct. +the C<(?...)> construct, see L</Extended Patterns> below. The C</x>, C</l>, C</u>, C</a> and C</d> modifiers need a little more explanation. +=head3 /x + C</x> tells the regular expression parser to ignore most whitespace that is neither backslashed nor within a character class. You can use this to break up @@ -118,80 +143,211 @@ in C<\p{...}> there can be spaces that follow the Unicode rules, for which see L<perluniprops/Properties accessible through \p{} and \P{}>. X</x> -C</l> means to use a locale (see L<perllocale>) when pattern matching. -The locale used will be the one in effect at the time of execution of -the pattern match. This may not be the same as the compilation-time -locale, and can differ from one match to another if there is an -intervening call of the +=head3 Character set modifiers + +C</d>, C</u>, C</a>, and C</l>, available starting in 5.14, are called +the character set modifiers; they affect the character set semantics +used for the regular expression. + +At any given time, exactly one of these modifiers is in effect. Once +compiled, the behavior doesn't change regardless of what rules are in +effect when the regular expression is executed. And if a regular +expression is interpolated into a larger one, the original's rules +continue to apply to it, and only it. + +=head4 /l + +means to use the current locale's rules (see L<perllocale>) when pattern +matching. For example, C<\w> will match the "word" characters of that +locale, and C<"/i"> case-insensitive matching will match according to +the locale's case folding rules. The locale used will be the one in +effect at the time of execution of the pattern match. This may not be +the same as the compilation-time locale, and can differ from one match +to another if there is an intervening call of the L<setlocale() function|perllocale/The setlocale function>. -This modifier is automatically set if the regular expression is compiled -within the scope of a C<"use locale"> pragma. -Perl only allows single-byte locales. This means that code points above -255 are treated as Unicode no matter what locale is in effect. -Under Unicode rules, there are a few case-insensitive matches that cross the -255/256 boundary. These are disallowed. For example, -0xFF does not caselessly match the character at 0x178, LATIN CAPITAL -LETTER Y WITH DIAERESIS, because 0xFF may not be LATIN SMALL LETTER Y -in the current locale, and Perl has no way of knowing if that character -even exists in the locale, much less what code point it is. + +Perl only supports single-byte locales. This means that code points +above 255 are treated as Unicode no matter what locale is in effect. +Under Unicode rules, there are a few case-insensitive matches that cross +the 255/256 boundary. These are disallowed under C</l>. For example, +0xFF does not caselessly match the character at 0x178, C<LATIN CAPITAL +LETTER Y WITH DIAERESIS>, because 0xFF may not be C<LATIN SMALL LETTER Y +WITH DIAERESIS> in the current locale, and Perl has no way of knowing if +that character even exists in the locale, much less what code point it +is. + +This modifier may be specified to be the default by C<use locale>, but +see L</Which character set modifier is in effect?>. X</l> -C</u> means to use Unicode semantics when pattern matching. It is -automatically set if the regular expression is encoded in utf8 internally, -or is compiled within the scope of a -L<C<"use feature 'unicode_strings">|feature> pragma (and isn't also in -the scope of the L<C<"use locale">|locale> or the L<C<"use bytes">|bytes> -pragma). On ASCII platforms, the code points between 128 and 255 take on their +=head4 /u + +means to use Unicode rules when pattern matching. On ASCII platforms, +this means that the code points between 128 and 255 take on their Latin-1 (ISO-8859-1) meanings (which are the same as Unicode's), whereas in strict ASCII their meanings are undefined. Thus the platform -effectively becomes a Unicode platform. The ASCII characters remain as -ASCII characters (since ASCII is a subset of Latin-1 and Unicode). For -example, when this option is not on, on a non-utf8 string, C<"\w"> -matches precisely C<[A-Za-z0-9_]>. When the option is on, it matches -not just those, but all the Latin-1 word characters (such as an "n" with -a tilde). On EBCDIC platforms, which already are equivalent to Latin-1, -this modifier changes behavior only when the C<"/i"> modifier is also -specified, and affects only two characters, giving them full Unicode -semantics: the C<MICRO SIGN> will match the Greek capital and -small letters C<MU>; otherwise not; and the C<LATIN CAPITAL LETTER SHARP -S> will match any of C<SS>, C<Ss>, C<sS>, and C<ss>, otherwise not. -(This last case is buggy, however.) +effectively becomes a Unicode platform, hence, for example, C<\w> will +match any of the more than 100_000 word characters in Unicode. + +Unlike most locales, which are specific to a language and country pair, +Unicode classifies all the characters that are letters I<somewhere> as +C<\w>. For example, your locale might not think that C<LATIN SMALL +LETTER ETH> is a letter (unless you happen to speak Icelandic), but +Unicode does. Similarly, all the characters that are decimal digits +somewhere in the world will match C<\d>; this is hundreds, not 10, +possible matches. And some of those digits look like some of the 10 +ASCII digits, but mean a different number, so a human could easily think +a number is a different quantity than it really is. For example, +C<BENGALI DIGIT FOUR> (U+09EA) looks very much like an +C<ASCII DIGIT EIGHT> (U+0038). And, C<\d+>, may match strings of digits +that are a mixture from different writing systems, creating a security +issue. L<Unicode::UCD/num()> can be used to sort this out. + +Also, case-insensitive matching works on the full set of Unicode +characters. The C<KELVIN SIGN>, for example matches the letters "k" and +"K"; and C<LATIN SMALL LIGATURE FF> matches the sequence "ff", which, +if you're not prepared, might make it look like a hexadecimal constant, +presenting another potential security issue. See +L<http://unicode.org/reports/tr36> for a detailed discussion of Unicode +security issues. + +On EBCDIC platforms, which already are equivalent to Latin-1 (at least +the ones that Perl handles), this modifier changes behavior only when +the C<"/i"> modifier is also specified, and it turns out it affects only +two characters, giving them full Unicode semantics: the C<MICRO SIGN> +will match the Greek capital and small letters C<MU>; otherwise not; and +the C<LATIN CAPITAL LETTER SHARP S> will match any of C<SS>, C<Ss>, +C<sS>, and C<ss>, otherwise not. + +This modifier may be specified to be the default by C<use feature +'unicode_strings>, but see +L</Which character set modifier is in effect?>. X</u> -C</a> is the same as C</u>, except that C<\d>, C<\s>, C<\w>, and the +=head4 /a + +is the same as C</u>, except that C<\d>, C<\s>, C<\w>, and the Posix character classes are restricted to matching in the ASCII range only. That is, with this modifier, C<\d> always means precisely the digits C<"0"> to C<"9">; C<\s> means the five characters C<[ \f\n\r\t]>; C<\w> means the 63 characters C<[A-Za-z0-9_]>; and likewise, all the Posix classes such as C<[[:print:]]> match only the appropriate -ASCII-range characters. As you would expect, this modifier causes, for -example, C<\D> to mean the same thing as C<[^0-9]>; in fact, all -non-ASCII characters match C<\D>, C<\S>, and C<\W>. C<\b> still means -to match at the boundary between C<\w> and C<\W>, using the C<"a"> -definitions of them (similarly for C<\B>). Otherwise, C<"a"> behaves -like the C<"u"> modifier, in that case-insensitive matching uses Unicode -semantics; for example, "k" will match the Unicode C<\N{KELVIN SIGN}> -under C</i> matching, and code points in the Latin1 range, above ASCII -will have Unicode semantics when it comes to case-insensitive matching. -But writing two in "a"'s in a row will increase its effect, causing the -Kelvin sign and all other non-ASCII characters not to match any ASCII -character under C</i> matching. +ASCII-range characters. + +This modifier is useful for people who only incidentally use Unicode. +With it, one can write C<\d> with confidence that it will only match +ASCII characters, and should the need arise to match beyond ASCII, you +can use C<\p{Digit}>, or C<\p{Word}> for C<\w>. There are similar +C<\p{...}> constructs that can match white space and Posix classes +beyond ASCII. See L<perlrecharclass>. + +As you would expect, this modifier causes, for example, C<\D> to mean +the same thing as C<[^0-9]>; in fact, all non-ASCII characters match +C<\D>, C<\S>, and C<\W>. C<\b> still means to match at the boundary +between C<\w> and C<\W>, using the C</a> definitions of them (similarly +for C<\B>). + +Otherwise, C</a> behaves like the C</u> modifier, in that +case-insensitive matching uses Unicode semantics; for example, "k" will +match the Unicode C<\N{KELVIN SIGN}> under C</i> matching, and code +points in the Latin1 range, above ASCII will have Unicode rules when it +comes to case-insensitive matching. + +To forbid ASCII/non-ASCII matches (like "k" with C<\N{KELVIN SIGN}>), +specify the "a" twice, for example C</aai> or C</aia> + +To reiterate, this modifier provides protection for applications that +don't wish to be exposed to all of Unicode. Specifying it twice +gives added protection. + +This modifier may be specified to be the default by C<use re '/a'> +or C<use re '/aa'>, but see +L</Which character set modifier is in effect?>. X</a> +X</aa> + +=head4 /d + +This modifier means to use the "Default" native rules of the platform +except when there is cause to use Unicode rules instead, as follows: + +=over 4 + +=item 1 + +the target string is encoded in UTF-8; or + +=item 2 + +the pattern is encoded in UTF-8; or + +=item 3 + +the pattern explicitly mentions a code point that is above 255 (say by +C<\x{100}>); or + +=item 4 -C</d> means to use the traditional Perl pattern-matching behavior. -This is dualistic (hence the name C</d>, which also could stand for -"depends"). When this is in effect, Perl matches according to the -platform's native character set rules unless there is something that -indicates to use Unicode rules. If either the target string or the -pattern itself is encoded in UTF-8, Unicode rules are used. Also, if -the pattern contains Unicode-only features, such as code points above -255, C<\p()> Unicode properties or C<\N{}> Unicode names, Unicode rules -will be used. It is automatically selected by default if the regular -expression is compiled neither within the scope of a C<"use locale"> -pragma nor a <C<"use feature 'unicode_strings"> pragma. -This behavior causes a number of glitches, see -L<perlunicode/The "Unicode Bug">. -X</d> +the pattern uses a Unicode name (C<\N{...}>); or + +=item 5 + +the pattern uses a Unicode property (C<\p{...}>) + +=back + +Another mnemonic for this modifier is "Depends", as the rules actually +used depend on various things, and as a result you can get unexpected +results. See L<perlunicode/The "Unicode Bug">. + +On ASCII platforms, the native rules are ASCII, and on EBCDIC platforms +(at least the ones that Perl handles), they are Latin-1. + +Here are some examples of how that works on an ASCII platform: + + $str = "\xDF"; # $str is not in UTF-8 format. + $str =~ /^\w/; # No match, as $str isn't in UTF-8 format. + $str .= "\x{0e0b}"; # Now $str is in UTF-8 format. + $str =~ /^\w/; # Match! $str is now in UTF-8 format. + chop $str; + $str =~ /^\w/; # Still a match! $str remains in UTF-8 format. + +=head4 Which character set modifier is in effect? + +Which of these modifiers is in effect at any given point in a regular +expression depends on a fairly complex set of interactions. As +explained below in L</Extended Patterns> it is possible to explicitly +specify modifiers that apply only to portions of a regular expression. +The innermost always has priority over any outer ones, and one applying +to the whole expression has priority over any default settings that are +described in the next few paragraphs. + +The C<L<use re 'E<sol>foo'|re/'E<sol>flags' mode">> pragma can be used to set +default modifiers (including these) for regular expressions compiled +within its scope. This pragma has precedence over the other pragmas +that change the defaults, as listed below. + +Otherwise, C<L<use locale|perllocale>> sets the default modifier to C</l>; +and C<L<use feature 'unicode_strings|feature>> or +C<L<use 5.012|perlfunc/use VERSION>> (or higher) set the default to +C</u> when not in the same scope as either C<L<use locale|perllocale>> +or C<L<use bytes|bytes>> . + +If none of the above apply, for backwards compatibility reasons, the +C</d> modifier is the one in effect by default. As this can lead to +unexpected results, it is best to specify which other rule set should be +used. + +=head4 Character set modifier behavior prior to Perl 5.14 + +Prior to 5.14, there were no explicit modifiers, but C</l> was implied +for regexes compiled within the scope of C<use locale>, and C</d> was +implied otherwise. However, interpolating a regex into a larger regex +would ignore the original compilation in favor of whatever was in effect +at the time of the second compilation. There were a number of +inconsistencies (bugs) with the C</d> modifier, where Unicode rules +would be used when inappropriate, and vice versa. C<\p{}> did not imply +Unicode rules, and neither did all occurrences of C<\N{}>, until 5.12. =head2 Regular Expressions @@ -549,7 +705,7 @@ digits padded with leading zeros, since a leading zero implies an octal constant. The C<\I<digit>> notation also works in certain circumstances outside -the pattern. See L</Warning on \1 Instead of $1> below for details.) +the pattern. See L</Warning on \1 Instead of $1> below for details. Examples: @@ -733,7 +889,8 @@ But a minus sign is not legal with it. Note that the C<a>, C<d>, C<l>, C<p>, and C<u> modifiers are special in that they can only be enabled, not disabled, and the C<a>, C<d>, C<l>, and C<u> modifiers are mutually exclusive: specifying one de-specifies the -others, and a maximum of one may appear in the construct. Thus, for +others, and a maximum of one (or two C<a>'s) may appear in the +construct. Thus, for example, C<(?-p)> will warn when compiled under C<use warnings>; C<(?-d:...)> and C<(?dl:...)> are fatal errors. @@ -2253,17 +2410,11 @@ Subroutine call to a named capture group. Equivalent to C<< (?&NAME) >>. =head1 BUGS -There are numerous problems with case-insensitive matching of characters -outside the ASCII range, especially with those whose folds are multiple -characters, such as ligatures like C<LATIN SMALL LIGATURE FF>. - -In a bracketed character class with case-insensitive matching, ranges only work -for ASCII characters. For example, -C<m/[\N{CYRILLIC CAPITAL LETTER A}-\N{CYRILLIC CAPITAL LETTER YA}]/i> -doesn't match all the Russian upper and lower case letters. - Many regular expression constructs don't work on EBCDIC platforms. +There are a number of issues with regard to case-insensitive matching +in Unicode rules. See C<i> under L</Modifiers> above. + This document varies from difficult to understand to completely and utterly opaque. The wandering prose riddled with jargon is hard to fathom in several places. |