perlre: Add some clarifying script run documentation

author: Karl Williamson <khw@cpan.org> 2018-08-16 16:48:05 -0600
committer: Karl Williamson <khw@cpan.org> 2018-08-16 16:54:39 -0600
commit: 4a1d964056983f26f5646fdb7aadb4b5e7b5235f (patch)
tree: 008483198ac51805f364cd402a0decdeb90b5d0f
parent: 8350b2740abc0cad147113487148473a9e19034b (diff)
download: perl-4a1d964056983f26f5646fdb7aadb4b5e7b5235f.tar.gz
1 files changed, 54 insertions, 4 deletions
diff --git a/pod/perlre.pod b/pod/perlre.pod
index ce557edf4d..375507a6e8 100644
--- a/pod/perlre.pod
+++ b/pod/perlre.pod
@@ -2517,8 +2517,9 @@ backtracking occurs until something all in the same script is found that
 matches, or all possibilities are exhausted.  This can cause a lot of
 backtracking, but generally, only malicious input will result in this,
 though the slow down could cause a denial of service attack.  If your
-needs permit, it is best to make the pattern atomic.  This is so likely
-to be what you want, that instead of writing this:
+needs permit, it is best to make the pattern atomic to cut down on the
+amount of backtracking.  This is so likely to be what you want, that
+instead of writing this:
 
  (*script_run:(?>pattern))
 
@@ -2532,7 +2533,10 @@ you can write either of these:
 In Taiwan, Japan, and Korea, it is common for text to have a mixture of
 characters from their native scripts and base Chinese.  Perl follows
 Unicode's UTS 39 (L<http://unicode.org/reports/tr39/>) Unicode Security
-Mechanisms in allowing such mixtures.
+Mechanisms in allowing such mixtures.  For example, the Japanese scripts
+Katakana and Hiragana are commonly mixed together in practice, along
+with some Chinese characters, and hence are treated as being in a single
+script run by Perl.
 
 The rules used for matching decimal digits are somewhat different.  Many
 scripts have their own sets of digits equivalent to the Western C<0>
@@ -2578,7 +2582,53 @@ This feature is experimental, and the exact syntax and details of
 operation are subject to change; using it yields a warning in the
 C<experimental::script_run> category.
 
-The C<Script_Extensions> property is used as the basis for this feature.
+The C<Script_Extensions> property as modified by UTS 39
+(L<http://unicode.org/reports/tr39/>) is used as the basis for this
+feature.
+
+To summarize,
+
+=over 4
+
+=item *
+
+All length 0 or length 1 sequences are script runs.
+
+=item *
+
+A longer sequence is a script run if and only if B<all> of the following
+conditions are met:
+
+Z<>
+
+=over
+
+=item 1
+
+No code point in the sequence has the C<Script_Extension> property of
+C<Unknown>.
+
+This currently means that all code points the sequence have been
+assigned by Unicode to be characters that aren't private use nor
+surrogate code points.
+
+=item 2
+
+All characters in the sequence come from the Common script and/or the
+Inherited script and/or a single other script.
+
+The script of a character is determined by the C<Script_Extensions>
+property as modified by UTS 39 (L<http://unicode.org/reports/tr39/>), as
+described above.
+
+=item 3
+
+All decimal digits in the sequence come from the same block of 10
+consecutive digits.
+
+=back
+
+=back
 
 =head2 Special Backtracking Control Verbs
author	Karl Williamson <khw@cpan.org>	2018-08-16 16:48:05 -0600
committer	Karl Williamson <khw@cpan.org>	2018-08-16 16:54:39 -0600
commit	4a1d964056983f26f5646fdb7aadb4b5e7b5235f (patch)
tree	008483198ac51805f364cd402a0decdeb90b5d0f
parent	8350b2740abc0cad147113487148473a9e19034b (diff)
download	perl-4a1d964056983f26f5646fdb7aadb4b5e7b5235f.tar.gz