summaryrefslogtreecommitdiff
path: root/pod/perlunicode.pod
diff options
context:
space:
mode:
authorKarl Williamson <khw@cpan.org>2019-03-11 17:16:34 -0600
committerKarl Williamson <khw@cpan.org>2019-03-12 12:06:26 -0600
commit1532347b696561120241d1e6221c028acedff019 (patch)
treeb182ef965a2abd4f7fbfbed562cb3bb9e6a8eb50 /pod/perlunicode.pod
parent2cd613ec5fcf3b5c85fd2752b5871f18b4d33773 (diff)
downloadperl-1532347b696561120241d1e6221c028acedff019.tar.gz
Add Unicode property wildcards
Diffstat (limited to 'pod/perlunicode.pod')
-rw-r--r--pod/perlunicode.pod146
1 files changed, 144 insertions, 2 deletions
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index 955893f690..8f09a18fca 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -921,6 +921,145 @@ L<perlrecharclass/POSIX Character Classes>.
=back
+=head2 Wildcards in Property Values
+
+Starting in Perl 5.30, it is possible to do do something like this:
+
+ qr!\p{numeric_value=/\A[0-5]\z/}!
+
+or, by abbreviating and adding C</x>,
+
+ qr! \p{nv= /(?x) \A [0-5] \z / }!
+
+This matches all code points whose numeric value is one of 0, 1, 2, 3,
+4, or 5. This particular example could instead have been written as
+
+ qr! \A [ \p{nv=0}\p{nv=1}\p{nv=2}\p{nv=3}\p{nv=4}\p{nv=5} ] \z !xx
+
+in earlier perls, so in this case this feature just makes things easier
+and shorter to write. If we hadn't included the C<\A> and C<\z>, these
+would have matched things like C<1E<sol>2> because that contains a 1 (as
+well as a 2). As written, it matches things like subscripts that have
+these numeric values. If we only wanted the decimal digits with those
+numeric values, we could say,
+
+ qr! (?[ \d & \p{nv=/[0-5]/ ]) }!x
+
+The C<\d> gets rid of needing to anchor the pattern, since it forces the
+result to only match C<[0-9]>, and the C<[0-5]> further restricts it.
+
+The text in the above examples enclosed between the C<"E<sol>">
+characters can be just about any regular expression. It is independent
+of the main pattern, so doesn't share any capturing groups, I<etc>. The
+delimiters for it must be ASCII punctuation, but it may NOT be
+delimited by C<"{">, nor C<"}"> nor contain a literal C<"}">, as that
+delimits the end of the enclosing C<\p{}>. Like any pattern, certain
+other delimiters are terminated by their mirror images. These are
+C<"(">, C<"[>", and C<"E<lt>">. If the delimiter is any of C<"-">,
+C<"_">, C<"+">, or C<"\">, or is the same delimiter as is used for the
+enclosing pattern, it must be be preceded by a backslash escape, both
+fore and aft.
+
+Beware of using C<"$"> to indicate to match the end of the string. It
+can too easily be interpreted as being a punctuation variable, like
+C<$/>.
+
+No modifiers may follow the final delimiter. Instead, use
+L<perlre/(?adlupimnsx-imnsx)> and/or
+L<perlre/(?adluimnsx-imnsx:pattern)> to specify modifiers.
+
+This feature is not available when the left-hand side is prefixed by
+C<Is_>, nor for any form that is marked as "Discouraged" in
+L<perluniprops/Discouraged>.
+
+Perl wraps your pattern with C<(?iaa: ... )>. This is because nothing
+outside ASCII can match the Unicode property values available in this
+release, and they should match caselessly. If your pattern has a syntax
+error, this wrapping will be shown in the error message, even though you
+didn't specify it yourself. This could be confusing if you don't know
+about this.
+
+This experimental feature has been added to begin to implement
+L<https://www.unicode.org/reports/tr18/#Wildcard_Properties>. Using it
+will raise a (default-on) warning in the
+C<experimental::uniprop_wildcards> category. We reserve the right to
+change its operation as we gain experience.
+
+Your subpattern can be just about anything, but for it to have some
+utility, it should match when called with either or both of
+a) the full name of the property value with underscores (and/or spaces
+in the Block property) and some things uppercase; or b) the property
+value in all lowercase with spaces and underscores squeezed out. For
+example,
+
+ qr!\p{Blk=/Old I.*/}!
+ qr!\p{Blk=/oldi.*/}!
+
+would match the same things.
+
+A warning is issued if none of the legal values for a property are
+matched by your pattern. It's likely that a future release will raise a
+warning if your pattern ends up causing every possible code point to
+match.
+
+Another example that shows that within C<\p{...}>, C</x> isn't needed to
+have spaces:
+
+ qr!\p{scx= /Hebrew|Greek/ }!
+
+To be safe, we should have anchored the above example, to prevent
+matches for something like C<Hebrew_Braile>, but there aren't
+any script names like that.
+
+There are certain properties that it doesn't currently work with. These
+are:
+
+ Bidi Mirroring Glyph
+ Bidi Paired Bracket
+ Case Folding
+ Decomposition Mapping
+ Equivalent Unified Ideograph
+ Name
+ Name Alias
+ Lowercase Mapping
+ NFKC Case Fold
+ Titlecase Mapping
+ Uppercase Mapping
+
+Nor is the C<@I<unicode_property>@> form implemented.
+
+Here's a complete example of matching IPV4 internet protocol addresses
+in any (single) script
+
+ no warnings 'experimental::script_run';
+ no warnings 'experimental::regex_sets';
+ no warnings 'experimental::uniprop_wildcards';
+
+ # Can match a substring, so this intermediate regex needs to have
+ # context or anchoring in its final use. Using nt=de yields decimal
+ # digits. When specifying a subset of these, we must include \d to
+ # prevent things like U+00B2 SUPERSCRIPT TWO from matching
+ my $zero_through_255 =
+ qr/ \b (*sr: # All from same sript
+ (?[ \p{nv=0} & \d ])* # Optional leading zeros
+ ( # Then one of:
+ \d{1,2} # 0 - 99
+ | (?[ \p{nv=1} & \d ]) \d{2} # 100 - 199
+ | (?[ \p{nv=2} & \d ])
+ ( (?[ \p{nv=:[0-4]:} & \d ]) \d # 200 - 249
+ | (?[ \p{nv=5} & \d ])
+ (?[ \p{nv=:[0-5]:} & \d ]) # 250 - 255
+ )
+ )
+ )
+ \b
+ /x;
+
+ my $ipv4 = qr/ \A (*sr: $zero_through_255
+ (?: [.] $zero_through_255 ) {3}
+ )
+ \z
+ /x;
=head2 User-Defined Character Properties
@@ -1220,7 +1359,7 @@ C<U+10FFFF> but also beyond C<U+10FFFF>
RL2.3 Default Word Boundaries - Done [11]
RL2.4 Default Case Conversion - Done
RL2.5 Name Properties - Done
- RL2.6 Wildcard Properties - Missing
+ RL2.6 Wildcards in Property Values - Partial [12]
RL2.7 Full Properties - Done
=over 4
@@ -1239,6 +1378,9 @@ Perl has C<\X> and C<\b{gcb}> but we don't have a "Grapheme Cluster Mode".
=item [11] see
L<UAX#29 "Unicode Text Segmentation"|http://www.unicode.org/reports/tr29>,
+=item [12] see
+L</Wildcards in Property Values> above.
+
=back
=head3 Level 3 - Tailored Support
@@ -1272,7 +1414,7 @@ portion.
Perl has user-defined properties (L</"User-Defined Character
Properties">) to look at single code points in ways beyond Unicode, and
it might be possible, though probably not very clean, to use code blocks
-and things like C<(?(DEFINE)...)> (see L<perlre> to do more specialized
+and things like C<(?(DEFINE)...)> (see L<perlre>) to do more specialized
matching.
=back