diff options
author | Karl Williamson <khw@cpan.org> | 2020-02-05 13:32:26 -0700 |
---|---|---|
committer | Karl Williamson <khw@cpan.org> | 2020-02-12 16:25:53 -0700 |
commit | 673c254b34746289019db8836016c81eb38e5bf0 (patch) | |
tree | a137acc65278cc7a8f7c3c03024c21e10940c0b6 /pod | |
parent | ff5ebe043d728d8813248fe7b3a58935b1116e6a (diff) | |
download | perl-673c254b34746289019db8836016c81eb38e5bf0.tar.gz |
Add qr/\p{Name=...}/
This accomplishes the same thing as \N{...}, but only for regex
patterns, using loose matching and only the official Unicode names.
This commit includes a comparison of the two approaches, added to
perlunicode. But the real reason to do this is as a way station to
being able to specify wild card lookup on the name property, coming in a
later commit.
I chose to not include user-defined aliases nor :short character names
at this time. I thought that there might be unforeseen consequences of
using them. It's better to later relax a requirement than to try to
restrict it.
Diffstat (limited to 'pod')
-rw-r--r-- | pod/perldelta.pod | 7 | ||||
-rw-r--r-- | pod/perlre.pod | 2 | ||||
-rw-r--r-- | pod/perlretut.pod | 11 | ||||
-rw-r--r-- | pod/perlunicode.pod | 77 | ||||
-rw-r--r-- | pod/perlunicook.pod | 7 | ||||
-rw-r--r-- | pod/perluniintro.pod | 5 |
6 files changed, 99 insertions, 10 deletions
diff --git a/pod/perldelta.pod b/pod/perldelta.pod index 562473ed87..07c5c73190 100644 --- a/pod/perldelta.pod +++ b/pod/perldelta.pod @@ -47,6 +47,13 @@ that aren't part of the strict UCD (Unicode character database). These two are used for examining inputs for security purposes. Details on their usage is at L<https://www.unicode.org/reports/tr39/proposed.html>. +=head2 It is now possible to write C<qr/\p{Name=...}/>, or C<\p{Na=...}> + +The Unicode Name property is now accessible in regular expression +patterns using the above syntaxes, as an alternative to C<\N{...}>. +A comparison of the two methods is given in +L<perlunicode/Comparison of \N{...} and \p{name=...}>. + =head1 Security XXX Any security-related notices go here. In particular, any security diff --git a/pod/perlre.pod b/pod/perlre.pod index 68e18c950d..8c0d2049e0 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -465,7 +465,7 @@ Use of C</x> means that if you want real whitespace or C<"#"> characters in the pattern (outside a bracketed character class, which is unaffected by C</x>), then you'll either have to escape them (using backslashes or C<\Q...\E>) or encode them using octal, -hex, or C<\N{}> escapes. +hex, or C<\N{}> or C<\p{name=...}> escapes. It is ineffective to try to continue a comment onto the next line by escaping the C<\n> with a backslash or C<\Q>. diff --git a/pod/perlretut.pod b/pod/perlretut.pod index 78050a1556..72e23d0d3b 100644 --- a/pod/perlretut.pod +++ b/pod/perlretut.pod @@ -2005,6 +2005,17 @@ Consortium, L<https://www.unicode.org/charts/charindex.html>; explanatory material with links to other resources at L<https://www.unicode.org/standard/where>. +Starting in Perl v5.32, an alternative to C<\N{...}> for full names is +available, and that is to say + + /\p{Name=greek small letter sigma}/ + +The casing of the character name is irrelevant when used in C<\p{}>, as +are most spaces, underscores and hyphens. (A few outlier characters +cause problems with ignoring all of them always. The details (which you +can look up when you get more proficient, and if ever needed) are in +L<https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2>). + The answer to requirement 2) is that a regexp (mostly) uses Unicode characters. The "mostly" is for messy backward compatibility reasons, but starting in Perl 5.14, any regexp compiled in diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index ed20878eac..5a7938d5e6 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -460,15 +460,20 @@ Upper/lower case differences in property names and values are irrelevant; thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>. Similarly, you can add or subtract underscores anywhere in the middle of a word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space -is irrelevant adjacent to non-word characters, such as the braces and the equals -or colon separators, so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are -equivalent to these as well. In fact, white space and even -hyphens can usually be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is -equivalent. All this is called "loose-matching" by Unicode. The few places -where stricter matching is used is in the middle of numbers, and in the Perl -extension properties that begin or end with an underscore. Stricter matching -cares about white space (except adjacent to non-word characters), -hyphens, and non-interior underscores. +is generally irrelevant adjacent to non-word characters, such as the +braces and the equals or colon separators, so C<\p{ Upper }> and +C<\p{ Upper_case : Y }> are equivalent to these as well. In fact, white +space and even hyphens can usually be added or deleted anywhere. So +even C<\p{ Up-per case = Yes}> is equivalent. All this is called +"loose-matching" by Unicode. The "name" property has some restrictions +on this due to a few outlier names. Full details are given in +L<https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2>. + +The few places where stricter matching is +used is in the middle of numbers, the "name" property, and in the Perl +extension properties that begin or end with an underscore. Stricter +matching cares about white space (except adjacent to non-word +characters), hyphens, and non-interior underscores. You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret (C<^>) between the first brace and the property name: C<\p{^Tamil}> is @@ -922,6 +927,60 @@ L<perlrecharclass/POSIX Character Classes>. =back +=head2 Comparison of C<\N{...}> and C<\p{name=...}> + +Starting in Perl 5.32, you can specify a character by its name in +regular expression patterns using C<\p{name=...}>. This is in addition +to the longstanding method of using C<\N{...}>. The following +summarizes the differences between these two: + + \N{...} \p{Name=...} + can interpolate only with eval yes [1] + custom names yes no [2] + name aliases yes yes [3] + named sequences yes not yet [4] + name value parsing exact Unicode loose [5] + +=over + +=item [1] + +The ability to interpolate means you can do something like + + qr/\p{na=latin capital letter $which}/ + +and specify C<$which> elsewhere. + +=item [2] + +You can create your own names for characters, and override official +ones when using C<\N{...}>. See L<charnames/CUSTOM ALIASES>. + +=item [3] + +Some characters have multiple names (synonyms). + +=item [4] + +Some particular sequences of characters are given a single name, in +addition to their individual ones. + +It is planned to add support for named sequences to the C<\p{...}> form +before 5.32; in the meantime, an accurate but not fully informative +message is generated if use of one of these is attempted. + +=item [5] + +Exact name value matching means you have to specify case, hyphens, +underscores, and spaces precisely in the name you want. Loose matching +follows the Unicode rules +L<https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2>, +where these are mostly irrelevant. Except for a few outlier character +names, these are the same rules as are already used for any other +C<\p{...}> property. + +=back + =head2 Wildcards in Property Values Starting in Perl 5.30, it is possible to do do something like this: diff --git a/pod/perlunicook.pod b/pod/perlunicook.pod index eb395f795e..c709e0fc73 100644 --- a/pod/perlunicook.pod +++ b/pod/perlunicook.pod @@ -152,6 +152,13 @@ that is, it disregards case, whitespace, and underscores: "\N{euro sign}" # :loose (from v5.16) +Starting in v5.32, you can also use + + qr/\p{name=euro sign}/ + +to get official Unicode named characters in regular expressions. Loose +matching is always done for these. + =head2 ℞ 9: Unicode named sequences These look just like character names but return multiple codepoints. diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod index fb799a4c73..14e8c513f0 100644 --- a/pod/perluniintro.pod +++ b/pod/perluniintro.pod @@ -267,6 +267,11 @@ Similarly, they can be used in regular expression literals $smiley =~ /\N{WHITE SMILING FACE}/; $smiley =~ /\N{U+263a}/; +or, starting in v5.32: + + $smiley =~ /\p{Name=WHITE SMILING FACE}/; + $smiley =~ /\p{Name=whitesmilingface}/; + At run-time you can use: use charnames (); |