diff options
author | Karl Williamson <khw@cpan.org> | 2019-03-11 17:16:34 -0600 |
---|---|---|
committer | Karl Williamson <khw@cpan.org> | 2019-03-12 12:06:26 -0600 |
commit | 1532347b696561120241d1e6221c028acedff019 (patch) | |
tree | b182ef965a2abd4f7fbfbed562cb3bb9e6a8eb50 /pod/perlunicode.pod | |
parent | 2cd613ec5fcf3b5c85fd2752b5871f18b4d33773 (diff) | |
download | perl-1532347b696561120241d1e6221c028acedff019.tar.gz |
Add Unicode property wildcards
Diffstat (limited to 'pod/perlunicode.pod')
-rw-r--r-- | pod/perlunicode.pod | 146 |
1 files changed, 144 insertions, 2 deletions
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 955893f690..8f09a18fca 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -921,6 +921,145 @@ L<perlrecharclass/POSIX Character Classes>. =back +=head2 Wildcards in Property Values + +Starting in Perl 5.30, it is possible to do do something like this: + + qr!\p{numeric_value=/\A[0-5]\z/}! + +or, by abbreviating and adding C</x>, + + qr! \p{nv= /(?x) \A [0-5] \z / }! + +This matches all code points whose numeric value is one of 0, 1, 2, 3, +4, or 5. This particular example could instead have been written as + + qr! \A [ \p{nv=0}\p{nv=1}\p{nv=2}\p{nv=3}\p{nv=4}\p{nv=5} ] \z !xx + +in earlier perls, so in this case this feature just makes things easier +and shorter to write. If we hadn't included the C<\A> and C<\z>, these +would have matched things like C<1E<sol>2> because that contains a 1 (as +well as a 2). As written, it matches things like subscripts that have +these numeric values. If we only wanted the decimal digits with those +numeric values, we could say, + + qr! (?[ \d & \p{nv=/[0-5]/ ]) }!x + +The C<\d> gets rid of needing to anchor the pattern, since it forces the +result to only match C<[0-9]>, and the C<[0-5]> further restricts it. + +The text in the above examples enclosed between the C<"E<sol>"> +characters can be just about any regular expression. It is independent +of the main pattern, so doesn't share any capturing groups, I<etc>. The +delimiters for it must be ASCII punctuation, but it may NOT be +delimited by C<"{">, nor C<"}"> nor contain a literal C<"}">, as that +delimits the end of the enclosing C<\p{}>. Like any pattern, certain +other delimiters are terminated by their mirror images. These are +C<"(">, C<"[>", and C<"E<lt>">. If the delimiter is any of C<"-">, +C<"_">, C<"+">, or C<"\">, or is the same delimiter as is used for the +enclosing pattern, it must be be preceded by a backslash escape, both +fore and aft. + +Beware of using C<"$"> to indicate to match the end of the string. It +can too easily be interpreted as being a punctuation variable, like +C<$/>. + +No modifiers may follow the final delimiter. Instead, use +L<perlre/(?adlupimnsx-imnsx)> and/or +L<perlre/(?adluimnsx-imnsx:pattern)> to specify modifiers. + +This feature is not available when the left-hand side is prefixed by +C<Is_>, nor for any form that is marked as "Discouraged" in +L<perluniprops/Discouraged>. + +Perl wraps your pattern with C<(?iaa: ... )>. This is because nothing +outside ASCII can match the Unicode property values available in this +release, and they should match caselessly. If your pattern has a syntax +error, this wrapping will be shown in the error message, even though you +didn't specify it yourself. This could be confusing if you don't know +about this. + +This experimental feature has been added to begin to implement +L<https://www.unicode.org/reports/tr18/#Wildcard_Properties>. Using it +will raise a (default-on) warning in the +C<experimental::uniprop_wildcards> category. We reserve the right to +change its operation as we gain experience. + +Your subpattern can be just about anything, but for it to have some +utility, it should match when called with either or both of +a) the full name of the property value with underscores (and/or spaces +in the Block property) and some things uppercase; or b) the property +value in all lowercase with spaces and underscores squeezed out. For +example, + + qr!\p{Blk=/Old I.*/}! + qr!\p{Blk=/oldi.*/}! + +would match the same things. + +A warning is issued if none of the legal values for a property are +matched by your pattern. It's likely that a future release will raise a +warning if your pattern ends up causing every possible code point to +match. + +Another example that shows that within C<\p{...}>, C</x> isn't needed to +have spaces: + + qr!\p{scx= /Hebrew|Greek/ }! + +To be safe, we should have anchored the above example, to prevent +matches for something like C<Hebrew_Braile>, but there aren't +any script names like that. + +There are certain properties that it doesn't currently work with. These +are: + + Bidi Mirroring Glyph + Bidi Paired Bracket + Case Folding + Decomposition Mapping + Equivalent Unified Ideograph + Name + Name Alias + Lowercase Mapping + NFKC Case Fold + Titlecase Mapping + Uppercase Mapping + +Nor is the C<@I<unicode_property>@> form implemented. + +Here's a complete example of matching IPV4 internet protocol addresses +in any (single) script + + no warnings 'experimental::script_run'; + no warnings 'experimental::regex_sets'; + no warnings 'experimental::uniprop_wildcards'; + + # Can match a substring, so this intermediate regex needs to have + # context or anchoring in its final use. Using nt=de yields decimal + # digits. When specifying a subset of these, we must include \d to + # prevent things like U+00B2 SUPERSCRIPT TWO from matching + my $zero_through_255 = + qr/ \b (*sr: # All from same sript + (?[ \p{nv=0} & \d ])* # Optional leading zeros + ( # Then one of: + \d{1,2} # 0 - 99 + | (?[ \p{nv=1} & \d ]) \d{2} # 100 - 199 + | (?[ \p{nv=2} & \d ]) + ( (?[ \p{nv=:[0-4]:} & \d ]) \d # 200 - 249 + | (?[ \p{nv=5} & \d ]) + (?[ \p{nv=:[0-5]:} & \d ]) # 250 - 255 + ) + ) + ) + \b + /x; + + my $ipv4 = qr/ \A (*sr: $zero_through_255 + (?: [.] $zero_through_255 ) {3} + ) + \z + /x; =head2 User-Defined Character Properties @@ -1220,7 +1359,7 @@ C<U+10FFFF> but also beyond C<U+10FFFF> RL2.3 Default Word Boundaries - Done [11] RL2.4 Default Case Conversion - Done RL2.5 Name Properties - Done - RL2.6 Wildcard Properties - Missing + RL2.6 Wildcards in Property Values - Partial [12] RL2.7 Full Properties - Done =over 4 @@ -1239,6 +1378,9 @@ Perl has C<\X> and C<\b{gcb}> but we don't have a "Grapheme Cluster Mode". =item [11] see L<UAX#29 "Unicode Text Segmentation"|http://www.unicode.org/reports/tr29>, +=item [12] see +L</Wildcards in Property Values> above. + =back =head3 Level 3 - Tailored Support @@ -1272,7 +1414,7 @@ portion. Perl has user-defined properties (L</"User-Defined Character Properties">) to look at single code points in ways beyond Unicode, and it might be possible, though probably not very clean, to use code blocks -and things like C<(?(DEFINE)...)> (see L<perlre> to do more specialized +and things like C<(?(DEFINE)...)> (see L<perlre>) to do more specialized matching. =back |