summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorKarl Williamson <khw@cpan.org>2019-03-14 11:48:11 -0600
committerKarl Williamson <khw@cpan.org>2019-03-14 12:18:01 -0600
commitf4e61fc03836484ea88518e8bf04cc1b32a6a1a0 (patch)
tree54a697a00fe9ed00a15d86abb46a359b95f7407e
parentbfa9f5ee70ce509f0e66dcff9e9fda131ea8a133 (diff)
downloadperl-f4e61fc03836484ea88518e8bf04cc1b32a6a1a0.tar.gz
Any Common digit set can match in any script
This fixes a design flaw in script runs that in 5.30 effectively prevented digits from the Common script except the ASCII [0-9] from being in any meaningful script run.
-rw-r--r--pod/perldelta.pod19
-rw-r--r--pod/perlre.pod19
-rw-r--r--regexec.c39
-rw-r--r--t/re/script_run.t19
4 files changed, 56 insertions, 40 deletions
diff --git a/pod/perldelta.pod b/pod/perldelta.pod
index 06ae872679..68f4ba9fac 100644
--- a/pod/perldelta.pod
+++ b/pod/perldelta.pod
@@ -91,6 +91,20 @@ It croaks if it would otherwise return a UTF-8 string that contains
malformed UTF-8. This protects agains potential security threats. This
is considered a bug fix as well ([perl #131642]).
+=head2 Any set of digits in the Common script are legal in a script run
+of another script
+
+There are several sets of digits in the Common script. C<[0-9]> is the
+most familiar. But there are also C<[\x{FF10}-\x{FF19}]> (FULLWIDTH
+DIGIT ZERO - FULLWIDTH DIGIT NINE), and several sets for use in
+mathematical notation, such as the MATHEMATICAL DOUBLE-STRUCK DIGITs.
+Any of these sets should be able to appear in script runs of, say,
+Greek. But the design of 5.30 overlooked all but the ASCII digits
+C<[0-9]>, so the design was flawed. This has been fixed, so is both a
+bug fix and an incompatibility. [perl #133547]
+
+All digits in a run still have to come from the same set of ten digits.
+
=head1 Deprecations
XXX Any deprecated features, syntax, modules etc. should be listed here.
@@ -430,6 +444,11 @@ C<pack()> no longer can return malformed UTF-8. It croaks if it would
otherwise return a UTF-8 string that contains malformed UTF-8. This
protects agains potential security threats. [perl #131642]
+=item *
+
+See L</Any set of digits in the Common script are legal in a script run
+of another script>.
+
=back
=head1 Known Problems
diff --git a/pod/perlre.pod b/pod/perlre.pod
index 209cac7f8d..4898f94d9f 100644
--- a/pod/perlre.pod
+++ b/pod/perlre.pod
@@ -2550,15 +2550,12 @@ Katakana and Hiragana are commonly mixed together in practice, along
with some Chinese characters, and hence are treated as being in a single
script run by Perl.
-The rules used for matching decimal digits are somewhat different. Many
+The rules used for matching decimal digits are slightly stricter. Many
scripts have their own sets of digits equivalent to the Western C<0>
through C<9> ones. A few, such as Arabic, have more than one set. For
a string to be considered a script run, all digits in it must come from
-the same set, as determined by the first digit encountered. The ASCII
-C<[0-9]> are accepted as being in any script, even those that have their
-own set. This is because these are often used in commerce even in such
-scripts. But any mixing of the ASCII and other digits will cause the
-sequence to not be a script run, failing the match. As an example,
+the same set of ten, as determined by the first digit encountered.
+As an example,
qr/(*script_run: \d+ \b )/x
@@ -2579,11 +2576,11 @@ accent of some type. These are considered to be in the script of the
master character, and so never cause a script run to not match.
The other one is "Common". This consists of mostly punctuation, emoji,
-and characters used in mathematics and music, and the ASCII digits C<0>
-through C<9>. These characters can appear intermixed in text in many of
-the world's scripts. These also don't cause a script run to not match,
-except any ASCII digits encountered have to obey the decimal digit rules
-described above.
+and characters used in mathematics and music, the ASCII digits C<0>
+through C<9>, and full-width forms of these digits. These characters
+can appear intermixed in text in many of the world's scripts. These
+also don't cause a script run to not match. But like other scripts, all
+digits in a run must come from the same set of 10.
This construct is non-capturing. You can add parentheses to I<pattern>
to capture, if desired. You will have to do this if you plan to use
diff --git a/regexec.c b/regexec.c
index 64a65462b5..dff221a99c 100644
--- a/regexec.c
+++ b/regexec.c
@@ -10252,11 +10252,13 @@ Additionally all decimal digits must come from the same consecutive sequence of
For example, if all the characters in the sequence are Greek, or Common, or
Inherited, this function will return TRUE, provided any decimal digits in it
-are the ASCII digits "0".."9". For scripts (unlike Greek) that have their own
-digits defined this will accept either digits from that set or from 0..9, but
-not a combination of the two. Some scripts, such as Arabic, have more than one
-set of digits. All digits must come from the same set for this function to
-return TRUE.
+are from the same block of digits in Common. (These are the ASCII digits
+"0".."9" and additionally a block for full width forms of these, and several
+others used in mathematical notation.) For scripts (unlike Greek) that have
+their own digits defined this will accept either digits from that set or from
+one of the Common digit sets, but not a combination of the two. Some scripts,
+such as Arabic, have more than one set of digits. All digits must come from
+the same set for this function to return TRUE.
C<*ret_script>, if C<ret_script> is not NULL, will on return of TRUE
contain the script found, using the C<SCX_enum> typedef. Its value will be
@@ -10359,10 +10361,9 @@ Perl_isSCRIPT_RUN(pTHX_ const U8 * s, const U8 * send, const bool utf8_target)
UV cp;
/* The code allows all scripts to use the ASCII digits. This is
- * because they are used in commerce even in scripts that have their
- * own set. Hence any ASCII ones found are ok, unless and until a
- * digit from another set has already been encountered. (The other
- * digit ranges in Common are not similarly blessed) */
+ * because they are in the Common script. Hence any ASCII ones found
+ * are ok, unless and until a digit from another set has already been
+ * encountered. digit ranges in Common are not similarly blessed) */
if (UNLIKELY(isDIGIT(*s))) {
if (UNLIKELY(script_of_run == SCX_Unknown)) {
retval = FALSE;
@@ -10456,19 +10457,11 @@ Perl_isSCRIPT_RUN(pTHX_ const U8 * s, const U8 * send, const bool utf8_target)
/* If the run so far is Common, and the new character isn't, change the
* run's script to that of this character */
if (script_of_run == SCX_Common && script_of_char != SCX_Common) {
-
- /* But Common contains several sets of digits. Only the '0' set
- * can be part of another script. */
- if (zero_of_run && zero_of_run != '0') {
- retval = FALSE;
- break;
- }
-
script_of_run = script_of_char;
}
- /* Now we can see if the script of the character is the same as that of
- * the run */
+ /* Now we can see if the script of the new character is the same as
+ * that of the run */
if (LIKELY(script_of_char == script_of_run)) {
/* By far the most common case */
goto scripts_match;
@@ -10668,14 +10661,6 @@ Perl_isSCRIPT_RUN(pTHX_ const U8 * s, const U8 * send, const bool utf8_target)
break;
}
}
- else if (script_of_char == SCX_Common && script_of_run != SCX_Common) {
-
- /* Here, the script run isn't Common, but the current digit is in
- * Common, and isn't '0'-'9' (those were handled earlier). Only
- * '0'-'9' are acceptable in non-Common scripts. */
- retval = FALSE;
- break;
- }
else { /* Otherwise we now have a zero for this run */
zero_of_run = zero_of_char;
}
diff --git a/t/re/script_run.t b/t/re/script_run.t
index 035a9104aa..19d4e10e53 100644
--- a/t/re/script_run.t
+++ b/t/re/script_run.t
@@ -51,8 +51,8 @@ foreach my $type ('script_run', 'sr', 'atomic_script_run', 'asr') {
unlike("\N{HEBREW LETTER ALEF}\N{HEBREW LETTER TAV}\N{MODIFIER LETTER SMALL Y}", $script_run, "Hebrew then Latin isn't a script run");
like("9876543210\N{DESERET SMALL LETTER WU}", $script_run, "0-9 are the digits for Deseret");
like("\N{DESERET SMALL LETTER WU}9876543210", $script_run, "Also when they aren't in the initial position");
- unlike("\N{DESERET SMALL LETTER WU}\N{FULLWIDTH DIGIT FIVE}", $script_run, "Fullwidth digits aren't the digits for Deseret");
- unlike("\N{FULLWIDTH DIGIT SIX}\N{DESERET SMALL LETTER LONG I}", $script_run, "... likewise if the digits come first");
+ like("\N{DESERET SMALL LETTER WU}\N{FULLWIDTH DIGIT FIVE}", $script_run, "Fullwidth digits may be digits for Deseret");
+ like("\N{FULLWIDTH DIGIT SIX}\N{DESERET SMALL LETTER LONG I}", $script_run, "... likewise if the digits come first");
like("1234567890\N{ARABIC LETTER ALEF}", $script_run, "[0-9] work for Arabic");
unlike("1234567890\N{ARABIC LETTER ALEF}\N{ARABIC-INDIC DIGIT FOUR}\N{ARABIC-INDIC DIGIT FIVE}", $script_run, "... but not in combination with real ARABIC digits");
@@ -104,4 +104,19 @@ foreach my $type ('script_run', 'sr', 'atomic_script_run', 'asr') {
like("\x{3041}12\x{3041}", qr/^(*sr:.{4})/,
"Script without own zero works with ASCII digits");
+ like("A\x{ff10}\x{ff19}B", qr/^(*sr:.{4})/,
+ "Non-ASCII Common digits work with Latin"); # perl #133547
+ like("A\x{ff10}BC", qr/^(*sr:.{4})/,
+ "Non-ASCII Common digits work with Latin"); # perl #133547
+ like("A\x{1d7ce}\x{1d7cf}B", qr/^(*sr:.{4})/,
+ "Non-ASCII Common digits work with Latin"); # perl #133547
+ like("A\x{1d7ce}BC", qr/^(*sr:.{4})/,
+ "Non-ASCII Common digits work with Latin"); # perl #133547
+ like("\x{1d7ce}\x{1d7cf}AB", qr/^(*sr:.{4})/,
+ "Non-ASCII Common digits work with Latin"); # perl #133547
+ like("α\x{1d7ce}βγ", qr/^(*sr:.{4})/,
+ "Non-ASCII Common digits work with Greek"); # perl #133547
+ like("\x{1d7ce}αβγ", qr/^(*sr:.{4})/,
+ "Non-ASCII Common digits work with Greek"); # perl #133547
+
done_testing();