summaryrefslogtreecommitdiff
path: root/pod
diff options
context:
space:
mode:
authorKarl Williamson <khw@cpan.org>2020-02-05 13:32:26 -0700
committerKarl Williamson <khw@cpan.org>2020-02-12 16:25:53 -0700
commit673c254b34746289019db8836016c81eb38e5bf0 (patch)
treea137acc65278cc7a8f7c3c03024c21e10940c0b6 /pod
parentff5ebe043d728d8813248fe7b3a58935b1116e6a (diff)
downloadperl-673c254b34746289019db8836016c81eb38e5bf0.tar.gz
Add qr/\p{Name=...}/
This accomplishes the same thing as \N{...}, but only for regex patterns, using loose matching and only the official Unicode names. This commit includes a comparison of the two approaches, added to perlunicode. But the real reason to do this is as a way station to being able to specify wild card lookup on the name property, coming in a later commit. I chose to not include user-defined aliases nor :short character names at this time. I thought that there might be unforeseen consequences of using them. It's better to later relax a requirement than to try to restrict it.
Diffstat (limited to 'pod')
-rw-r--r--pod/perldelta.pod7
-rw-r--r--pod/perlre.pod2
-rw-r--r--pod/perlretut.pod11
-rw-r--r--pod/perlunicode.pod77
-rw-r--r--pod/perlunicook.pod7
-rw-r--r--pod/perluniintro.pod5
6 files changed, 99 insertions, 10 deletions
diff --git a/pod/perldelta.pod b/pod/perldelta.pod
index 562473ed87..07c5c73190 100644
--- a/pod/perldelta.pod
+++ b/pod/perldelta.pod
@@ -47,6 +47,13 @@ that aren't part of the strict UCD (Unicode character database). These
two are used for examining inputs for security purposes. Details on
their usage is at L<https://www.unicode.org/reports/tr39/proposed.html>.
+=head2 It is now possible to write C<qr/\p{Name=...}/>, or C<\p{Na=...}>
+
+The Unicode Name property is now accessible in regular expression
+patterns using the above syntaxes, as an alternative to C<\N{...}>.
+A comparison of the two methods is given in
+L<perlunicode/Comparison of \N{...} and \p{name=...}>.
+
=head1 Security
XXX Any security-related notices go here. In particular, any security
diff --git a/pod/perlre.pod b/pod/perlre.pod
index 68e18c950d..8c0d2049e0 100644
--- a/pod/perlre.pod
+++ b/pod/perlre.pod
@@ -465,7 +465,7 @@ Use of C</x> means that if you want real
whitespace or C<"#"> characters in the pattern (outside a bracketed character
class, which is unaffected by C</x>), then you'll either have to
escape them (using backslashes or C<\Q...\E>) or encode them using octal,
-hex, or C<\N{}> escapes.
+hex, or C<\N{}> or C<\p{name=...}> escapes.
It is ineffective to try to continue a comment onto the next line by
escaping the C<\n> with a backslash or C<\Q>.
diff --git a/pod/perlretut.pod b/pod/perlretut.pod
index 78050a1556..72e23d0d3b 100644
--- a/pod/perlretut.pod
+++ b/pod/perlretut.pod
@@ -2005,6 +2005,17 @@ Consortium, L<https://www.unicode.org/charts/charindex.html>; explanatory
material with links to other resources at
L<https://www.unicode.org/standard/where>.
+Starting in Perl v5.32, an alternative to C<\N{...}> for full names is
+available, and that is to say
+
+ /\p{Name=greek small letter sigma}/
+
+The casing of the character name is irrelevant when used in C<\p{}>, as
+are most spaces, underscores and hyphens. (A few outlier characters
+cause problems with ignoring all of them always. The details (which you
+can look up when you get more proficient, and if ever needed) are in
+L<https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2>).
+
The answer to requirement 2) is that a regexp (mostly)
uses Unicode characters. The "mostly" is for messy backward
compatibility reasons, but starting in Perl 5.14, any regexp compiled in
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index ed20878eac..5a7938d5e6 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -460,15 +460,20 @@ Upper/lower case differences in property names and values are irrelevant;
thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>.
Similarly, you can add or subtract underscores anywhere in the middle of a
word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space
-is irrelevant adjacent to non-word characters, such as the braces and the equals
-or colon separators, so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are
-equivalent to these as well. In fact, white space and even
-hyphens can usually be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is
-equivalent. All this is called "loose-matching" by Unicode. The few places
-where stricter matching is used is in the middle of numbers, and in the Perl
-extension properties that begin or end with an underscore. Stricter matching
-cares about white space (except adjacent to non-word characters),
-hyphens, and non-interior underscores.
+is generally irrelevant adjacent to non-word characters, such as the
+braces and the equals or colon separators, so C<\p{ Upper }> and
+C<\p{ Upper_case : Y }> are equivalent to these as well. In fact, white
+space and even hyphens can usually be added or deleted anywhere. So
+even C<\p{ Up-per case = Yes}> is equivalent. All this is called
+"loose-matching" by Unicode. The "name" property has some restrictions
+on this due to a few outlier names. Full details are given in
+L<https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2>.
+
+The few places where stricter matching is
+used is in the middle of numbers, the "name" property, and in the Perl
+extension properties that begin or end with an underscore. Stricter
+matching cares about white space (except adjacent to non-word
+characters), hyphens, and non-interior underscores.
You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
(C<^>) between the first brace and the property name: C<\p{^Tamil}> is
@@ -922,6 +927,60 @@ L<perlrecharclass/POSIX Character Classes>.
=back
+=head2 Comparison of C<\N{...}> and C<\p{name=...}>
+
+Starting in Perl 5.32, you can specify a character by its name in
+regular expression patterns using C<\p{name=...}>. This is in addition
+to the longstanding method of using C<\N{...}>. The following
+summarizes the differences between these two:
+
+ \N{...} \p{Name=...}
+ can interpolate only with eval yes [1]
+ custom names yes no [2]
+ name aliases yes yes [3]
+ named sequences yes not yet [4]
+ name value parsing exact Unicode loose [5]
+
+=over
+
+=item [1]
+
+The ability to interpolate means you can do something like
+
+ qr/\p{na=latin capital letter $which}/
+
+and specify C<$which> elsewhere.
+
+=item [2]
+
+You can create your own names for characters, and override official
+ones when using C<\N{...}>. See L<charnames/CUSTOM ALIASES>.
+
+=item [3]
+
+Some characters have multiple names (synonyms).
+
+=item [4]
+
+Some particular sequences of characters are given a single name, in
+addition to their individual ones.
+
+It is planned to add support for named sequences to the C<\p{...}> form
+before 5.32; in the meantime, an accurate but not fully informative
+message is generated if use of one of these is attempted.
+
+=item [5]
+
+Exact name value matching means you have to specify case, hyphens,
+underscores, and spaces precisely in the name you want. Loose matching
+follows the Unicode rules
+L<https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2>,
+where these are mostly irrelevant. Except for a few outlier character
+names, these are the same rules as are already used for any other
+C<\p{...}> property.
+
+=back
+
=head2 Wildcards in Property Values
Starting in Perl 5.30, it is possible to do do something like this:
diff --git a/pod/perlunicook.pod b/pod/perlunicook.pod
index eb395f795e..c709e0fc73 100644
--- a/pod/perlunicook.pod
+++ b/pod/perlunicook.pod
@@ -152,6 +152,13 @@ that is, it disregards case, whitespace, and underscores:
"\N{euro sign}" # :loose (from v5.16)
+Starting in v5.32, you can also use
+
+ qr/\p{name=euro sign}/
+
+to get official Unicode named characters in regular expressions. Loose
+matching is always done for these.
+
=head2 ℞ 9: Unicode named sequences
These look just like character names but return multiple codepoints.
diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod
index fb799a4c73..14e8c513f0 100644
--- a/pod/perluniintro.pod
+++ b/pod/perluniintro.pod
@@ -267,6 +267,11 @@ Similarly, they can be used in regular expression literals
$smiley =~ /\N{WHITE SMILING FACE}/;
$smiley =~ /\N{U+263a}/;
+or, starting in v5.32:
+
+ $smiley =~ /\p{Name=WHITE SMILING FACE}/;
+ $smiley =~ /\p{Name=whitesmilingface}/;
+
At run-time you can use:
use charnames ();