Add qr/\p{Name=...}/

This accomplishes the same thing as \N{...}, but only for regex patterns, using loose matching and only the official Unicode names. This commit includes a comparison of the two approaches, added to perlunicode. But the real reason to do this is as a way station to being able to specify wild card lookup on the name property, coming in a later commit. I chose to not include user-defined aliases nor :short character names at this time. I thought that there might be unforeseen consequences of using them. It's better to later relax a requirement than to try to restrict it.
author: Karl Williamson <khw@cpan.org> 2020-02-05 13:32:26 -0700
committer: Karl Williamson <khw@cpan.org> 2020-02-12 16:25:53 -0700
commit: 673c254b34746289019db8836016c81eb38e5bf0 (patch)
tree: a137acc65278cc7a8f7c3c03024c21e10940c0b6 /pod/perlunicode.pod
parent: ff5ebe043d728d8813248fe7b3a58935b1116e6a (diff)
download: perl-673c254b34746289019db8836016c81eb38e5bf0.tar.gz
1 files changed, 68 insertions, 9 deletions
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index ed20878eac..5a7938d5e6 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -460,15 +460,20 @@ Upper/lower case differences in property names and values are irrelevant;
 thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>.
 Similarly, you can add or subtract underscores anywhere in the middle of a
 word, so that these are also equivalent to C<\p{U_p_p_e_r}>.  And white space
-is irrelevant adjacent to non-word characters, such as the braces and the equals
-or colon separators, so C<\p{   Upper  }> and C<\p{ Upper_case : Y }> are
-equivalent to these as well.  In fact, white space and even
-hyphens can usually be added or deleted anywhere.  So even C<\p{ Up-per case = Yes}> is
-equivalent.  All this is called "loose-matching" by Unicode.  The few places
-where stricter matching is used is in the middle of numbers, and in the Perl
-extension properties that begin or end with an underscore.  Stricter matching
-cares about white space (except adjacent to non-word characters),
-hyphens, and non-interior underscores.
+is generally irrelevant adjacent to non-word characters, such as the
+braces and the equals or colon separators, so C<\p{   Upper  }> and
+C<\p{ Upper_case : Y }> are equivalent to these as well.  In fact, white
+space and even hyphens can usually be added or deleted anywhere.  So
+even C<\p{ Up-per case = Yes}> is equivalent.  All this is called
+"loose-matching" by Unicode.  The "name" property has some restrictions
+on this due to a few outlier names.  Full details are given in
+L<https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2>.
+
+The few places where stricter matching is
+used is in the middle of numbers, the "name" property, and in the Perl
+extension properties that begin or end with an underscore.  Stricter
+matching cares about white space (except adjacent to non-word
+characters), hyphens, and non-interior underscores.
 
 You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
 (C<^>) between the first brace and the property name: C<\p{^Tamil}> is
@@ -922,6 +927,60 @@ L<perlrecharclass/POSIX Character Classes>.
 
 =back
 
+=head2 Comparison of C<\N{...}> and C<\p{name=...}>
+
+Starting in Perl 5.32, you can specify a character by its name in
+regular expression patterns using C<\p{name=...}>.  This is in addition
+to the longstanding method of using C<\N{...}>.  The following
+summarizes the differences between these two:
+
+                       \N{...}       \p{Name=...}
+ can interpolate    only with eval       yes            [1]
+ custom names            yes             no             [2]
+ name aliases            yes             yes            [3]
+ named sequences         yes           not yet          [4]
+ name value parsing     exact       Unicode loose       [5]
+
+=over
+
+=item [1]
+
+The ability to interpolate means you can do something like
+
+ qr/\p{na=latin capital letter $which}/
+
+and specify C<$which> elsewhere.
+
+=item [2]
+
+You can create your own names for characters, and override official
+ones when using C<\N{...}>.  See L<charnames/CUSTOM ALIASES>.
+
+=item [3]
+
+Some characters have multiple names (synonyms).
+
+=item [4]
+
+Some particular sequences of characters are given a single name, in
+addition to their individual ones.
+
+It is planned to add support for named sequences to the C<\p{...}> form
+before 5.32; in the meantime, an accurate but not fully informative
+message is generated if use of one of these is attempted.
+
+=item [5]
+
+Exact name value matching means you have to specify case, hyphens,
+underscores, and spaces precisely in the name you want.  Loose matching
+follows the Unicode rules
+L<https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2>,
+where these are mostly irrelevant.  Except for a few outlier character
+names, these are the same rules as are already used for any other
+C<\p{...}> property.
+
+=back
+
 =head2 Wildcards in Property Values
 
 Starting in Perl 5.30, it is possible to do do something like this:
author	Karl Williamson <khw@cpan.org>	2020-02-05 13:32:26 -0700
committer	Karl Williamson <khw@cpan.org>	2020-02-12 16:25:53 -0700
commit	673c254b34746289019db8836016c81eb38e5bf0 (patch)
tree	a137acc65278cc7a8f7c3c03024c21e10940c0b6 /pod/perlunicode.pod
parent	ff5ebe043d728d8813248fe7b3a58935b1116e6a (diff)
download	perl-673c254b34746289019db8836016c81eb38e5bf0.tar.gz