summaryrefslogtreecommitdiff
path: root/pod/perlre.pod
diff options
context:
space:
mode:
authorKarl Williamson <khw@cpan.org>2019-03-14 11:48:11 -0600
committerKarl Williamson <khw@cpan.org>2019-03-14 12:18:01 -0600
commitf4e61fc03836484ea88518e8bf04cc1b32a6a1a0 (patch)
tree54a697a00fe9ed00a15d86abb46a359b95f7407e /pod/perlre.pod
parentbfa9f5ee70ce509f0e66dcff9e9fda131ea8a133 (diff)
downloadperl-f4e61fc03836484ea88518e8bf04cc1b32a6a1a0.tar.gz
Any Common digit set can match in any script
This fixes a design flaw in script runs that in 5.30 effectively prevented digits from the Common script except the ASCII [0-9] from being in any meaningful script run.
Diffstat (limited to 'pod/perlre.pod')
-rw-r--r--pod/perlre.pod19
1 files changed, 8 insertions, 11 deletions
diff --git a/pod/perlre.pod b/pod/perlre.pod
index 209cac7f8d..4898f94d9f 100644
--- a/pod/perlre.pod
+++ b/pod/perlre.pod
@@ -2550,15 +2550,12 @@ Katakana and Hiragana are commonly mixed together in practice, along
with some Chinese characters, and hence are treated as being in a single
script run by Perl.
-The rules used for matching decimal digits are somewhat different. Many
+The rules used for matching decimal digits are slightly stricter. Many
scripts have their own sets of digits equivalent to the Western C<0>
through C<9> ones. A few, such as Arabic, have more than one set. For
a string to be considered a script run, all digits in it must come from
-the same set, as determined by the first digit encountered. The ASCII
-C<[0-9]> are accepted as being in any script, even those that have their
-own set. This is because these are often used in commerce even in such
-scripts. But any mixing of the ASCII and other digits will cause the
-sequence to not be a script run, failing the match. As an example,
+the same set of ten, as determined by the first digit encountered.
+As an example,
qr/(*script_run: \d+ \b )/x
@@ -2579,11 +2576,11 @@ accent of some type. These are considered to be in the script of the
master character, and so never cause a script run to not match.
The other one is "Common". This consists of mostly punctuation, emoji,
-and characters used in mathematics and music, and the ASCII digits C<0>
-through C<9>. These characters can appear intermixed in text in many of
-the world's scripts. These also don't cause a script run to not match,
-except any ASCII digits encountered have to obey the decimal digit rules
-described above.
+and characters used in mathematics and music, the ASCII digits C<0>
+through C<9>, and full-width forms of these digits. These characters
+can appear intermixed in text in many of the world's scripts. These
+also don't cause a script run to not match. But like other scripts, all
+digits in a run must come from the same set of 10.
This construct is non-capturing. You can add parentheses to I<pattern>
to capture, if desired. You will have to do this if you plan to use