Add qr/\p{Name=...}/

This accomplishes the same thing as \N{...}, but only for regex patterns, using loose matching and only the official Unicode names. This commit includes a comparison of the two approaches, added to perlunicode. But the real reason to do this is as a way station to being able to specify wild card lookup on the name property, coming in a later commit. I chose to not include user-defined aliases nor :short character names at this time. I thought that there might be unforeseen consequences of using them. It's better to later relax a requirement than to try to restrict it.
author: Karl Williamson <khw@cpan.org> 2020-02-05 13:32:26 -0700
committer: Karl Williamson <khw@cpan.org> 2020-02-12 16:25:53 -0700
commit: 673c254b34746289019db8836016c81eb38e5bf0 (patch)
tree: a137acc65278cc7a8f7c3c03024c21e10940c0b6 /pod
parent: ff5ebe043d728d8813248fe7b3a58935b1116e6a (diff)
download: perl-673c254b34746289019db8836016c81eb38e5bf0.tar.gz
6 files changed, 99 insertions, 10 deletions
diff --git a/pod/perldelta.pod b/pod/perldelta.pod
index 562473ed87..07c5c73190 100644
--- a/pod/perldelta.pod
+++ b/pod/perldelta.pod
@@ -47,6 +47,13 @@ that aren't part of the strict UCD (Unicode character database).  These
 two are used for examining inputs for security purposes.  Details on
 their usage is at L<https://www.unicode.org/reports/tr39/proposed.html>.
 
+=head2 It is now possible to write C<qr/\p{Name=...}/>, or C<\p{Na=...}>
+
+The Unicode Name property is now accessible in regular expression
+patterns using the above syntaxes, as an alternative to C<\N{...}>.
+A comparison of the two methods is given in
+L<perlunicode/Comparison of \N{...} and \p{name=...}>.
+
 =head1 Security
 
 XXX Any security-related notices go here.  In particular, any security
diff --git a/pod/perlre.pod b/pod/perlre.pod
index 68e18c950d..8c0d2049e0 100644
--- a/pod/perlre.pod
+++ b/pod/perlre.pod
@@ -465,7 +465,7 @@ Use of C</x> means that if you want real
 whitespace or C<"#"> characters in the pattern (outside a bracketed character
 class, which is unaffected by C</x>), then you'll either have to
 escape them (using backslashes or C<\Q...\E>) or encode them using octal,
-hex, or C<\N{}> escapes.
+hex, or C<\N{}> or C<\p{name=...}> escapes.
 It is ineffective to try to continue a comment onto the next line by
 escaping the C<\n> with a backslash or C<\Q>.
 
diff --git a/pod/perlretut.pod b/pod/perlretut.pod
index 78050a1556..72e23d0d3b 100644
--- a/pod/perlretut.pod
+++ b/pod/perlretut.pod
@@ -2005,6 +2005,17 @@ Consortium, L<https://www.unicode.org/charts/charindex.html>; explanatory
 material with links to other resources at
 L<https://www.unicode.org/standard/where>.
 
+Starting in Perl v5.32, an alternative to C<\N{...}> for full names is
+available, and that is to say
+
+ /\p{Name=greek small letter sigma}/
+
+The casing of the character name is irrelevant when used in C<\p{}>, as
+are most spaces, underscores and hyphens.  (A few outlier characters
+cause problems with ignoring all of them always.  The details (which you
+can look up when you get more proficient, and if ever needed) are in
+L<https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2>).
+
 The answer to requirement 2) is that a regexp (mostly)
 uses Unicode characters.  The "mostly" is for messy backward
 compatibility reasons, but starting in Perl 5.14, any regexp compiled in
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index ed20878eac..5a7938d5e6 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -460,15 +460,20 @@ Upper/lower case differences in property names and values are irrelevant;
 thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>.
 Similarly, you can add or subtract underscores anywhere in the middle of a
 word, so that these are also equivalent to C<\p{U_p_p_e_r}>.  And white space
-is irrelevant adjacent to non-word characters, such as the braces and the equals
-or colon separators, so C<\p{   Upper  }> and C<\p{ Upper_case : Y }> are
-equivalent to these as well.  In fact, white space and even
-hyphens can usually be added or deleted anywhere.  So even C<\p{ Up-per case = Yes}> is
-equivalent.  All this is called "loose-matching" by Unicode.  The few places
-where stricter matching is used is in the middle of numbers, and in the Perl
-extension properties that begin or end with an underscore.  Stricter matching
-cares about white space (except adjacent to non-word characters),
-hyphens, and non-interior underscores.
+is generally irrelevant adjacent to non-word characters, such as the
+braces and the equals or colon separators, so C<\p{   Upper  }> and
+C<\p{ Upper_case : Y }> are equivalent to these as well.  In fact, white
+space and even hyphens can usually be added or deleted anywhere.  So
+even C<\p{ Up-per case = Yes}> is equivalent.  All this is called
+"loose-matching" by Unicode.  The "name" property has some restrictions
+on this due to a few outlier names.  Full details are given in
+L<https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2>.
+
+The few places where stricter matching is
+used is in the middle of numbers, the "name" property, and in the Perl
+extension properties that begin or end with an underscore.  Stricter
+matching cares about white space (except adjacent to non-word
+characters), hyphens, and non-interior underscores.
 
 You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
 (C<^>) between the first brace and the property name: C<\p{^Tamil}> is
@@ -922,6 +927,60 @@ L<perlrecharclass/POSIX Character Classes>.
 
 =back
 
+=head2 Comparison of C<\N{...}> and C<\p{name=...}>
+
+Starting in Perl 5.32, you can specify a character by its name in
+regular expression patterns using C<\p{name=...}>.  This is in addition
+to the longstanding method of using C<\N{...}>.  The following
+summarizes the differences between these two:
+
+                       \N{...}       \p{Name=...}
+ can interpolate    only with eval       yes            [1]
+ custom names            yes             no             [2]
+ name aliases            yes             yes            [3]
+ named sequences         yes           not yet          [4]
+ name value parsing     exact       Unicode loose       [5]
+
+=over
+
+=item [1]
+
+The ability to interpolate means you can do something like
+
+ qr/\p{na=latin capital letter $which}/
+
+and specify C<$which> elsewhere.
+
+=item [2]
+
+You can create your own names for characters, and override official
+ones when using C<\N{...}>.  See L<charnames/CUSTOM ALIASES>.
+
+=item [3]
+
+Some characters have multiple names (synonyms).
+
+=item [4]
+
+Some particular sequences of characters are given a single name, in
+addition to their individual ones.
+
+It is planned to add support for named sequences to the C<\p{...}> form
+before 5.32; in the meantime, an accurate but not fully informative
+message is generated if use of one of these is attempted.
+
+=item [5]
+
+Exact name value matching means you have to specify case, hyphens,
+underscores, and spaces precisely in the name you want.  Loose matching
+follows the Unicode rules
+L<https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2>,
+where these are mostly irrelevant.  Except for a few outlier character
+names, these are the same rules as are already used for any other
+C<\p{...}> property.
+
+=back
+
 =head2 Wildcards in Property Values
 
 Starting in Perl 5.30, it is possible to do do something like this:
diff --git a/pod/perlunicook.pod b/pod/perlunicook.pod
index eb395f795e..c709e0fc73 100644
--- a/pod/perlunicook.pod
+++ b/pod/perlunicook.pod
@@ -152,6 +152,13 @@ that is, it disregards case, whitespace, and underscores:
 
  "\N{euro sign}"                        # :loose (from v5.16)
 
+Starting in v5.32, you can also use
+
+ qr/\p{name=euro sign}/
+
+to get official Unicode named characters in regular expressions.  Loose
+matching is always done for these.
+
 =head2 ℞ 9: Unicode named sequences
 
 These look just like character names but return multiple codepoints.
diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod
index fb799a4c73..14e8c513f0 100644
--- a/pod/perluniintro.pod
+++ b/pod/perluniintro.pod
@@ -267,6 +267,11 @@ Similarly, they can be used in regular expression literals
  $smiley =~ /\N{WHITE SMILING FACE}/;
  $smiley =~ /\N{U+263a}/;
 
+or, starting in v5.32:
+
+ $smiley =~ /\p{Name=WHITE SMILING FACE}/;
+ $smiley =~ /\p{Name=whitesmilingface}/;
+
 At run-time you can use:
 
  use charnames ();
author	Karl Williamson <khw@cpan.org>	2020-02-05 13:32:26 -0700
committer	Karl Williamson <khw@cpan.org>	2020-02-12 16:25:53 -0700
commit	673c254b34746289019db8836016c81eb38e5bf0 (patch)
tree	a137acc65278cc7a8f7c3c03024c21e10940c0b6 /pod
parent	ff5ebe043d728d8813248fe7b3a58935b1116e6a (diff)
download	perl-673c254b34746289019db8836016c81eb38e5bf0.tar.gz