diff options
author | Karl Williamson <khw@cpan.org> | 2019-03-14 11:48:11 -0600 |
---|---|---|
committer | Karl Williamson <khw@cpan.org> | 2019-03-14 12:18:01 -0600 |
commit | f4e61fc03836484ea88518e8bf04cc1b32a6a1a0 (patch) | |
tree | 54a697a00fe9ed00a15d86abb46a359b95f7407e /pod/perlre.pod | |
parent | bfa9f5ee70ce509f0e66dcff9e9fda131ea8a133 (diff) | |
download | perl-f4e61fc03836484ea88518e8bf04cc1b32a6a1a0.tar.gz |
Any Common digit set can match in any script
This fixes a design flaw in script runs that in 5.30 effectively
prevented digits from the Common script except the ASCII [0-9] from
being in any meaningful script run.
Diffstat (limited to 'pod/perlre.pod')
-rw-r--r-- | pod/perlre.pod | 19 |
1 files changed, 8 insertions, 11 deletions
diff --git a/pod/perlre.pod b/pod/perlre.pod index 209cac7f8d..4898f94d9f 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -2550,15 +2550,12 @@ Katakana and Hiragana are commonly mixed together in practice, along with some Chinese characters, and hence are treated as being in a single script run by Perl. -The rules used for matching decimal digits are somewhat different. Many +The rules used for matching decimal digits are slightly stricter. Many scripts have their own sets of digits equivalent to the Western C<0> through C<9> ones. A few, such as Arabic, have more than one set. For a string to be considered a script run, all digits in it must come from -the same set, as determined by the first digit encountered. The ASCII -C<[0-9]> are accepted as being in any script, even those that have their -own set. This is because these are often used in commerce even in such -scripts. But any mixing of the ASCII and other digits will cause the -sequence to not be a script run, failing the match. As an example, +the same set of ten, as determined by the first digit encountered. +As an example, qr/(*script_run: \d+ \b )/x @@ -2579,11 +2576,11 @@ accent of some type. These are considered to be in the script of the master character, and so never cause a script run to not match. The other one is "Common". This consists of mostly punctuation, emoji, -and characters used in mathematics and music, and the ASCII digits C<0> -through C<9>. These characters can appear intermixed in text in many of -the world's scripts. These also don't cause a script run to not match, -except any ASCII digits encountered have to obey the decimal digit rules -described above. +and characters used in mathematics and music, the ASCII digits C<0> +through C<9>, and full-width forms of these digits. These characters +can appear intermixed in text in many of the world's scripts. These +also don't cause a script run to not match. But like other scripts, all +digits in a run must come from the same set of 10. This construct is non-capturing. You can add parentheses to I<pattern> to capture, if desired. You will have to do this if you plan to use |