diff options
author | Father Chrysostomos <sprout@cpan.org> | 2011-02-20 14:17:19 -0800 |
---|---|---|
committer | Father Chrysostomos <sprout@cpan.org> | 2011-02-20 14:17:39 -0800 |
commit | b6fa137b2c9fea23677e2533d7f563bb9f49db0a (patch) | |
tree | 8d2be20b1f1393617c50bb359f4bfaeb4cd5ff49 /pod | |
parent | d7a0d798cfc5c243cb147b827ff253e49f9bd57f (diff) | |
download | perl-b6fa137b2c9fea23677e2533d7f563bb9f49db0a.tar.gz |
Update docs for postfix /dual flags
Diffstat (limited to 'pod')
-rw-r--r-- | pod/perldiag.pod | 9 | ||||
-rw-r--r-- | pod/perlop.pod | 23 | ||||
-rw-r--r-- | pod/perlre.pod | 177 |
3 files changed, 107 insertions, 102 deletions
diff --git a/pod/perldiag.pod b/pod/perldiag.pod index 2d9a9ac541..442106450f 100644 --- a/pod/perldiag.pod +++ b/pod/perldiag.pod @@ -2008,11 +2008,10 @@ Further error messages would likely be uninformative. (D syntax) You had a word that isn't a regex modifier immediately following a -pattern without an intervening space or you used one of the regex -modifiers ("a", "d", "l", and "u") that in 5.14 are disallowed as -suffixes. In that case, use the infix form, like C</(?a:...)/>. In the -other case, add white space between the pattern and following word. -As an example of the latter, the two constructs: +pattern without an intervening space. If you are trying to use the C</le> +flags on a substitution, use C</el> instead. Otherwise, add white space +between the pattern and following word to eliminate the warning. As an +example of the latter, the two constructs: $a =~ m/$foo/sand $bar $a =~ m/$foo/s and $bar diff --git a/pod/perlop.pod b/pod/perlop.pod index 79f608bf20..3b7bc0df94 100644 --- a/pod/perlop.pod +++ b/pod/perlop.pod @@ -1267,7 +1267,7 @@ matching and related activities. =over 8 -=item qr/STRING/msixpo +=item qr/STRING/msixpodual X<qr> X</i> X</m> X</o> X</s> X</x> X</p> This operator quotes (and possibly compiles) its I<STRING> as a regular @@ -1327,28 +1327,29 @@ Options (specified by the following modifiers) are: p When matching preserve a copy of the matched string so that ${^PREMATCH}, ${^MATCH}, ${^POSTMATCH} will be defined. o Compile pattern only once. + l Use the locale + u Use Unicode semantics + a Use ASCII for \d, \s, \w + d Use Unicode or native charset, as in 5.12 and earlier If a precompiled pattern is embedded in a larger pattern then the effect of 'msixp' will be propagated appropriately. The effect of the 'o' modifier has is not propagated, being restricted to those patterns explicitly using it. -Several other modifiers to control the character set semantics were -added for 5.14 that, unlike the ones listed above, may not be used -after the final pattern delimiter, but only following a C<"(?"> inside -the regular expression. (It is planned in 5.16 to make them usable in -the suffix position.) These are B<C<"a">>, B<C<"d">>, B<C<"l">>, and -B<C<"u">>. They are documented in L<perlre/Extended Patterns>. +The last four modifiers listed above, added in Perl 5.14, +control the character set semantics. They are documented in +L<perlre/Modifiers>. See L<perlre> for additional information on valid syntax for STRING, and for a detailed look at the semantics of regular expressions. -=item m/PATTERN/msixpogc +=item m/PATTERN/msixpodualgc X<m> X<operator, match> X<regexp, options> X<regexp> X<regex, options> X<regex> X</m> X</s> X</i> X</x> X</p> X</o> X</g> X</c> -=item /PATTERN/msixpogc +=item /PATTERN/msixpodualgc Searches a string for a pattern match, and in scalar context returns true if it succeeds, false if it fails. If no string is specified @@ -1384,7 +1385,7 @@ the trailing delimiter. This avoids expensive run-time recompilations, and is useful when the value you are interpolating won't change over the life of the script. However, mentioning C</o> constitutes a promise that you won't change the variables in the pattern. If you change them, -Perl won't even notice. See also L<"qr/STRING/msixpo">. +Perl won't even notice. See also L<"qr/STRING/msixpodual">. =item The empty pattern // @@ -1561,7 +1562,7 @@ but the resulting C<?PATTERN?> syntax is deprecated, will warn on usage and may be removed from a future stable release of Perl without further notice. -=item s/PATTERN/REPLACEMENT/msixpogcer +=item s/PATTERN/REPLACEMENT/msixpodualgcer X<substitute> X<substitution> X<replace> X<regexp, replace> X<regexp, substitute> X</m> X</s> X</i> X</x> X</p> X</o> X</g> X</c> X</e> X</r> diff --git a/pod/perlre.pod b/pod/perlre.pod index 31e881703c..0f5b436e08 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -75,19 +75,23 @@ rather than the regex itself. See L<perlretut/"Using regular expressions in Perl"> for further explanation of the g and c modifiers. +=item a, d, l and u +X</a> X</d> X</l> X</u> + +These modifiers, new in 5.14, affect which character-set semantics +(Unicode, ASCII, etc.) are used, as described below. + =back These are usually written as "the C</x> modifier", even though the delimiter -in question might not really be a slash. The modifiers C</imsx> +in question might not really be a slash. The modifiers C</imsxadlup> may also be embedded within the regular expression itself using -the C<(?...)> construct. Also are new (in 5.14) character set semantics -modifiers B<C<<"a">>, B<C<"d">>, B<C<"l">> and B<C<"u">>, which, in 5.14 -only, must be used embedded in the regular expression, and not after the -trailing delimiter. All this is discussed below in -L</Extended Patterns>. -X</a> X</d> X</l> X</u> +the C<(?...)> construct. + +The C</x>, C</l>, C</u>, C</a> and C</d> modifiers need a little more +explanation. -The C</x> modifier itself needs a little more explanation. It tells +C</x> tells the regular expression parser to ignore most whitespace that is neither backslashed nor within a character class. You can use this to break up your regular expression into (slightly) more readable parts. The C<#> @@ -114,6 +118,81 @@ in C<\p{...}> there can be spaces that follow the Unicode rules, for which see L<perluniprops/Properties accessible through \p{} and \P{}>. X</x> +C</l> means to use a locale (see L<perllocale>) when pattern matching. +The locale used will be the one in effect at the time of execution of +the pattern match. This may not be the same as the compilation-time +locale, and can differ from one match to another if there is an +intervening call of the +L<setlocale() function|perllocale/The setlocale function>. +This modifier is automatically set if the regular expression is compiled +within the scope of a C<"use locale"> pragma. +Perl only allows single-byte locales. This means that code points above +255 are treated as Unicode no matter what locale is in effect. +Under Unicode rules, there are a few case-insensitive matches that cross the +boundary 255/256 boundary. These are disallowed. For example, +0xFF does not caselessly match the character at 0x178, LATIN CAPITAL +LETTER Y WITH DIAERESIS, because 0xFF may not be LATIN SMALL LETTER Y +in the current locale, and Perl has no way of knowing if that character +even exists in the locale, much less what code point it is. +X</l> + +C</u> means to use Unicode semantics when pattern matching. It is +automatically set if the regular expression is encoded in utf8, or is +compiled within the scope of a +L<C<"use feature 'unicode_strings">|feature> pragma (and isn't also in +the scope of L<C<"use locale">|locale> nor L<C<"use bytes">|bytes> +pragmas). On ASCII platforms, the code points between 128 and 255 take on their +Latin-1 (ISO-8859-1) meanings (which are the same as Unicode's), whereas +in strict ASCII their meanings are undefined. Thus the platform +effectively becomes a Unicode platform. The ASCII characters remain as +ASCII characters (since ASCII is a subset of Latin-1 and Unicode). For +example, when this option is not on, on a non-utf8 string, C<"\w"> +matches precisely C<[A-Za-z0-9_]>. When the option is on, it matches +not just those, but all the Latin-1 word characters (such as an "n" with +a tilde). On EBCDIC platforms, which already are equivalent to Latin-1, +this modifier changes behavior only when the C<"/i"> modifier is also +specified, and affects only two characters, giving them full Unicode +semantics: the C<MICRO SIGN> will match the Greek capital and +small letters C<MU>; otherwise not; and the C<LATIN CAPITAL LETTER SHARP +S> will match any of C<SS>, C<Ss>, C<sS>, and C<ss>, otherwise not. +(This last case is buggy, however.) +X</u> + +C</a> is the same as C</u>, except that C<\d>, C<\s>, C<\w>, and the +Posix character classes are restricted to matching in the ASCII range +only. That is, with this modifier, C<\d> always means precisely the +digits C<"0"> to C<"9">; C<\s> means the five characters C<[ \f\n\r\t]>; +C<\w> means the 63 characters C<[A-Za-z0-9_]>; and likewise, all the +Posix classes such as C<[[:print:]]> match only the appropriate +ASCII-range characters. As you would expect, this modifier causes, for +example, C<\D> to mean the same thing as C<[^0-9]>; in fact, all +non-ASCII characters match C<\D>, C<\S>, and C<\W>. C<\b> still means +to match at the boundary between C<\w> and C<\W>, using the C<"a"> +definitions of them (similarly for C<\B>). Otherwise, C<"a"> behaves +like the C<"u"> modifier, in that case-insensitive matching uses Unicode +semantics; for example, "k" will match the Unicode C<\N{KELVIN SIGN}> +under C</i> matching, and code points in the Latin1 range, above ASCII +will have Unicode semantics when it comes to case-insensitive matching. +But writing two in "a"'s in a row will increase its effect, causing the +Kelvin sign and all other non-ASCII characters to not match any ASCII +character under C</i> matching. +X</a> + +C</d> means to use the traditional Perl pattern-matching behavior. +This is dualistic (hence the name C</d>, which also could stand for +"depends"). When this is in effect, Perl matches according to the +platform's native character set rules unless there is something that +indicates to use Unicode rules. If either the target string or the +pattern itself is encoded in UTF-8, Unicode rules are used. Also, if +the pattern contains Unicode-only features, such as code points above +255, C<\p()> Unicode properties or C<\N{}> Unicode names, Unicode rules +will be used. It is automatically selected by default if the regular +expression is compiled neither within the scope of a C<"use locale"> +pragma nor a <C<"use feature 'unicode_strings"> pragma. +This behavior causes a number of glitches, see +L<perlunicode/The "Unicode Bug">. +X</d> + =head2 Regular Expressions =head3 Metacharacters @@ -646,86 +725,12 @@ after the C<"?"> is a shorthand equivalent to C<d-imsx>. Flags (except C<"d">) may follow the caret to override it. But a minus sign is not legal with it. -Also, starting in Perl 5.14, are modifiers C<"a">, C<"d">, C<"l">, and -C<"u">, which for 5.14 may not be used as suffix modifiers. - -C<"l"> means to use a locale (see L<perllocale>) when pattern matching. -The locale used will be the one in effect at the time of execution of -the pattern match. This may not be the same as the compilation-time -locale, and can differ from one match to another if there is an -intervening call of the -L<setlocale() function|perllocale/The setlocale function>. -This modifier is automatically set if the regular expression is compiled -within the scope of a C<"use locale"> pragma. -Perl only allows single-byte locales. This means that code points above -255 are treated as Unicode no matter what locale is in effect. -Under Unicode rules, there are a few case-insensitive matches that cross the -boundary 255/256 boundary. These are disallowed. For example, -0xFF does not caselessly match the character at 0x178, LATIN CAPITAL -LETTER Y WITH DIAERESIS, because 0xFF may not be LATIN SMALL LETTER Y -in the current locale, and Perl has no way of knowing if that character -even exists in the locale, much less what code point it is. - -C<"u"> means to use Unicode semantics when pattern matching. It is -automatically set if the regular expression is encoded in utf8, or is -compiled within the scope of a -L<C<"use feature 'unicode_strings">|feature> pragma (and isn't also in -the scope of L<C<"use locale">|locale> nor L<C<"use bytes">|bytes> -pragmas. On ASCII platforms, the code points between 128 and 255 take on their -Latin-1 (ISO-8859-1) meanings (which are the same as Unicode's), whereas -in strict ASCII their meanings are undefined. Thus the platform -effectively becomes a Unicode platform. The ASCII characters remain as -ASCII characters (since ASCII is a subset of Latin-1 and Unicode). For -example, when this option is not on, on a non-utf8 string, C<"\w"> -matches precisely C<[A-Za-z0-9_]>. When the option is on, it matches -not just those, but all the Latin-1 word characters (such as an "n" with -a tilde). On EBCDIC platforms, which already are equivalent to Latin-1, -this modifier changes behavior only when the C<"/i"> modifier is also -specified, and affects only two characters, giving them full Unicode -semantics: the C<MICRO SIGN> will match the Greek capital and -small letters C<MU>; otherwise not; and the C<LATIN CAPITAL LETTER SHARP -S> will match any of C<SS>, C<Ss>, C<sS>, and C<ss>, otherwise not. -(This last case is buggy, however.) - -C<"a"> is the same as C<"u">, except that C<\d>, C<\s>, C<\w>, and the -Posix character classes are restricted to matching in the ASCII range -only. That is, with this modifier, C<\d> always means precisely the -digits C<"0"> to C<"9">; C<\s> means the five characters C<[ \f\n\r\t]>; -C<\w> means the 63 characters C<[A-Za-z0-9_]>; and likewise, all the -Posix classes such as C<[[:print:]]> match only the appropriate -ASCII-range characters. As you would expect, this modifier causes, for -example, C<\D> to mean the same thing as C<[^0-9]>; in fact, all -non-ASCII characters match C<\D>, C<\S>, and C<\W>. C<\b> still means -to match at the boundary between C<\w> and C<\W>, using the C<"a"> -definitions of them (similarly for C<\B>). Otherwise, C<"a"> behaves -like the C<"u"> modifier, in that case-insensitive matching uses Unicode -semantics; for example, "k" will match the Unicode C<\N{KELVIN SIGN}> -under C</i> matching, and code points in the Latin1 range, above ASCII -will have Unicode semantics when it comes to case-insensitive matching. -But writing two in "a"'s in a row will increase its effect, causing the -Kelvin sign and all other non-ASCII characters to not match any ASCII -character under C</i> matching. - -C<"d"> means to use the traditional Perl pattern matching behavior. -This is dualistic (hence the name C<"d">, which also could stand for -"depends"). When this is in effect, Perl matches according to the -platform's native character set rules unless there is something that -indicates to use Unicode rules. If either the target string or the -pattern itself is encoded in UTF-8, Unicode rules are used. Also, if -the pattern contains Unicode-only features, such as code points above -255, C<\p()> Unicode properties or C<\N{}> Unicode names, Unicode rules -will be used. It is automatically selected by default if the regular -expression is compiled neither within the scope of a C<"use locale"> -pragma nor a <C<"use feature 'unicode_strings"> pragma. -This behavior causes a number of glitches, see -L<perlunicode/The "Unicode Bug">. - Note that the C<a>, C<d>, C<l>, C<p>, and C<u> modifiers are special in that they can only be enabled, not disabled, and the C<a>, C<d>, C<l>, and C<u> modifiers are mutually exclusive: specifying one de-specifies the others, and a maximum of one may appear in the construct. Thus, for -example, C<(?-p)>, C<(?-d:...)>, and C<(?dl:...)> will warn when -compiled under C<use warnings>. +example, C<(?-p)>, will warn when compiled under C<use warnings>; +C<(?-d:...)> and C<(?dl:...)> are fatal errors. Note also that the C<p> modifier is special in that its presence anywhere in a pattern has a global effect. @@ -1018,7 +1023,7 @@ For reasons of security, this construct is forbidden if the regular expression involves run-time interpolation of variables, unless the perilous C<use re 'eval'> pragma has been used (see L<re>), or the variables contain results of C<qr//> operator (see -L<perlop/"qr/STRINGE<sol>msixpo">). +L<perlop/"qr/STRINGE<sol>msixpodual">). This restriction is due to the wide-spread and remarkably convenient custom of using run-time determined strings as patterns. For example: @@ -1090,7 +1095,7 @@ For reasons of security, this construct is forbidden if the regular expression involves run-time interpolation of variables, unless the perilous C<use re 'eval'> pragma has been used (see L<re>), or the variables contain results of C<qr//> operator (see -L<perlop/"qrE<sol>STRINGE<sol>msixpo">). +L<perlop/"qrE<sol>STRINGE<sol>msixpodual">). In perl 5.12.x and earlier, because the regex engine was not re-entrant, delayed code could not safely invoke the regex engine either directly with |