diff options
author | Karl Williamson <khw@cpan.org> | 2020-02-05 13:32:26 -0700 |
---|---|---|
committer | Karl Williamson <khw@cpan.org> | 2020-02-12 16:25:53 -0700 |
commit | 673c254b34746289019db8836016c81eb38e5bf0 (patch) | |
tree | a137acc65278cc7a8f7c3c03024c21e10940c0b6 /pod/perlunicode.pod | |
parent | ff5ebe043d728d8813248fe7b3a58935b1116e6a (diff) | |
download | perl-673c254b34746289019db8836016c81eb38e5bf0.tar.gz |
Add qr/\p{Name=...}/
This accomplishes the same thing as \N{...}, but only for regex
patterns, using loose matching and only the official Unicode names.
This commit includes a comparison of the two approaches, added to
perlunicode. But the real reason to do this is as a way station to
being able to specify wild card lookup on the name property, coming in a
later commit.
I chose to not include user-defined aliases nor :short character names
at this time. I thought that there might be unforeseen consequences of
using them. It's better to later relax a requirement than to try to
restrict it.
Diffstat (limited to 'pod/perlunicode.pod')
-rw-r--r-- | pod/perlunicode.pod | 77 |
1 files changed, 68 insertions, 9 deletions
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index ed20878eac..5a7938d5e6 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -460,15 +460,20 @@ Upper/lower case differences in property names and values are irrelevant; thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>. Similarly, you can add or subtract underscores anywhere in the middle of a word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space -is irrelevant adjacent to non-word characters, such as the braces and the equals -or colon separators, so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are -equivalent to these as well. In fact, white space and even -hyphens can usually be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is -equivalent. All this is called "loose-matching" by Unicode. The few places -where stricter matching is used is in the middle of numbers, and in the Perl -extension properties that begin or end with an underscore. Stricter matching -cares about white space (except adjacent to non-word characters), -hyphens, and non-interior underscores. +is generally irrelevant adjacent to non-word characters, such as the +braces and the equals or colon separators, so C<\p{ Upper }> and +C<\p{ Upper_case : Y }> are equivalent to these as well. In fact, white +space and even hyphens can usually be added or deleted anywhere. So +even C<\p{ Up-per case = Yes}> is equivalent. All this is called +"loose-matching" by Unicode. The "name" property has some restrictions +on this due to a few outlier names. Full details are given in +L<https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2>. + +The few places where stricter matching is +used is in the middle of numbers, the "name" property, and in the Perl +extension properties that begin or end with an underscore. Stricter +matching cares about white space (except adjacent to non-word +characters), hyphens, and non-interior underscores. You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret (C<^>) between the first brace and the property name: C<\p{^Tamil}> is @@ -922,6 +927,60 @@ L<perlrecharclass/POSIX Character Classes>. =back +=head2 Comparison of C<\N{...}> and C<\p{name=...}> + +Starting in Perl 5.32, you can specify a character by its name in +regular expression patterns using C<\p{name=...}>. This is in addition +to the longstanding method of using C<\N{...}>. The following +summarizes the differences between these two: + + \N{...} \p{Name=...} + can interpolate only with eval yes [1] + custom names yes no [2] + name aliases yes yes [3] + named sequences yes not yet [4] + name value parsing exact Unicode loose [5] + +=over + +=item [1] + +The ability to interpolate means you can do something like + + qr/\p{na=latin capital letter $which}/ + +and specify C<$which> elsewhere. + +=item [2] + +You can create your own names for characters, and override official +ones when using C<\N{...}>. See L<charnames/CUSTOM ALIASES>. + +=item [3] + +Some characters have multiple names (synonyms). + +=item [4] + +Some particular sequences of characters are given a single name, in +addition to their individual ones. + +It is planned to add support for named sequences to the C<\p{...}> form +before 5.32; in the meantime, an accurate but not fully informative +message is generated if use of one of these is attempted. + +=item [5] + +Exact name value matching means you have to specify case, hyphens, +underscores, and spaces precisely in the name you want. Loose matching +follows the Unicode rules +L<https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2>, +where these are mostly irrelevant. Except for a few outlier character +names, these are the same rules as are already used for any other +C<\p{...}> property. + +=back + =head2 Wildcards in Property Values Starting in Perl 5.30, it is possible to do do something like this: |