summaryrefslogtreecommitdiff
path: root/pod/perlunicode.pod
diff options
context:
space:
mode:
authorKarl Williamson <khw@cpan.org>2020-02-05 13:32:26 -0700
committerKarl Williamson <khw@cpan.org>2020-02-12 16:25:53 -0700
commit673c254b34746289019db8836016c81eb38e5bf0 (patch)
treea137acc65278cc7a8f7c3c03024c21e10940c0b6 /pod/perlunicode.pod
parentff5ebe043d728d8813248fe7b3a58935b1116e6a (diff)
downloadperl-673c254b34746289019db8836016c81eb38e5bf0.tar.gz
Add qr/\p{Name=...}/
This accomplishes the same thing as \N{...}, but only for regex patterns, using loose matching and only the official Unicode names. This commit includes a comparison of the two approaches, added to perlunicode. But the real reason to do this is as a way station to being able to specify wild card lookup on the name property, coming in a later commit. I chose to not include user-defined aliases nor :short character names at this time. I thought that there might be unforeseen consequences of using them. It's better to later relax a requirement than to try to restrict it.
Diffstat (limited to 'pod/perlunicode.pod')
-rw-r--r--pod/perlunicode.pod77
1 files changed, 68 insertions, 9 deletions
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index ed20878eac..5a7938d5e6 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -460,15 +460,20 @@ Upper/lower case differences in property names and values are irrelevant;
thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>.
Similarly, you can add or subtract underscores anywhere in the middle of a
word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space
-is irrelevant adjacent to non-word characters, such as the braces and the equals
-or colon separators, so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are
-equivalent to these as well. In fact, white space and even
-hyphens can usually be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is
-equivalent. All this is called "loose-matching" by Unicode. The few places
-where stricter matching is used is in the middle of numbers, and in the Perl
-extension properties that begin or end with an underscore. Stricter matching
-cares about white space (except adjacent to non-word characters),
-hyphens, and non-interior underscores.
+is generally irrelevant adjacent to non-word characters, such as the
+braces and the equals or colon separators, so C<\p{ Upper }> and
+C<\p{ Upper_case : Y }> are equivalent to these as well. In fact, white
+space and even hyphens can usually be added or deleted anywhere. So
+even C<\p{ Up-per case = Yes}> is equivalent. All this is called
+"loose-matching" by Unicode. The "name" property has some restrictions
+on this due to a few outlier names. Full details are given in
+L<https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2>.
+
+The few places where stricter matching is
+used is in the middle of numbers, the "name" property, and in the Perl
+extension properties that begin or end with an underscore. Stricter
+matching cares about white space (except adjacent to non-word
+characters), hyphens, and non-interior underscores.
You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
(C<^>) between the first brace and the property name: C<\p{^Tamil}> is
@@ -922,6 +927,60 @@ L<perlrecharclass/POSIX Character Classes>.
=back
+=head2 Comparison of C<\N{...}> and C<\p{name=...}>
+
+Starting in Perl 5.32, you can specify a character by its name in
+regular expression patterns using C<\p{name=...}>. This is in addition
+to the longstanding method of using C<\N{...}>. The following
+summarizes the differences between these two:
+
+ \N{...} \p{Name=...}
+ can interpolate only with eval yes [1]
+ custom names yes no [2]
+ name aliases yes yes [3]
+ named sequences yes not yet [4]
+ name value parsing exact Unicode loose [5]
+
+=over
+
+=item [1]
+
+The ability to interpolate means you can do something like
+
+ qr/\p{na=latin capital letter $which}/
+
+and specify C<$which> elsewhere.
+
+=item [2]
+
+You can create your own names for characters, and override official
+ones when using C<\N{...}>. See L<charnames/CUSTOM ALIASES>.
+
+=item [3]
+
+Some characters have multiple names (synonyms).
+
+=item [4]
+
+Some particular sequences of characters are given a single name, in
+addition to their individual ones.
+
+It is planned to add support for named sequences to the C<\p{...}> form
+before 5.32; in the meantime, an accurate but not fully informative
+message is generated if use of one of these is attempted.
+
+=item [5]
+
+Exact name value matching means you have to specify case, hyphens,
+underscores, and spaces precisely in the name you want. Loose matching
+follows the Unicode rules
+L<https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2>,
+where these are mostly irrelevant. Except for a few outlier character
+names, these are the same rules as are already used for any other
+C<\p{...}> property.
+
+=back
+
=head2 Wildcards in Property Values
Starting in Perl 5.30, it is possible to do do something like this: