diff options
author | Karl Williamson <khw@cpan.org> | 2018-08-16 16:48:05 -0600 |
---|---|---|
committer | Karl Williamson <khw@cpan.org> | 2018-08-16 16:54:39 -0600 |
commit | 4a1d964056983f26f5646fdb7aadb4b5e7b5235f (patch) | |
tree | 008483198ac51805f364cd402a0decdeb90b5d0f | |
parent | 8350b2740abc0cad147113487148473a9e19034b (diff) | |
download | perl-4a1d964056983f26f5646fdb7aadb4b5e7b5235f.tar.gz |
perlre: Add some clarifying script run documentation
-rw-r--r-- | pod/perlre.pod | 58 |
1 files changed, 54 insertions, 4 deletions
diff --git a/pod/perlre.pod b/pod/perlre.pod index ce557edf4d..375507a6e8 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -2517,8 +2517,9 @@ backtracking occurs until something all in the same script is found that matches, or all possibilities are exhausted. This can cause a lot of backtracking, but generally, only malicious input will result in this, though the slow down could cause a denial of service attack. If your -needs permit, it is best to make the pattern atomic. This is so likely -to be what you want, that instead of writing this: +needs permit, it is best to make the pattern atomic to cut down on the +amount of backtracking. This is so likely to be what you want, that +instead of writing this: (*script_run:(?>pattern)) @@ -2532,7 +2533,10 @@ you can write either of these: In Taiwan, Japan, and Korea, it is common for text to have a mixture of characters from their native scripts and base Chinese. Perl follows Unicode's UTS 39 (L<http://unicode.org/reports/tr39/>) Unicode Security -Mechanisms in allowing such mixtures. +Mechanisms in allowing such mixtures. For example, the Japanese scripts +Katakana and Hiragana are commonly mixed together in practice, along +with some Chinese characters, and hence are treated as being in a single +script run by Perl. The rules used for matching decimal digits are somewhat different. Many scripts have their own sets of digits equivalent to the Western C<0> @@ -2578,7 +2582,53 @@ This feature is experimental, and the exact syntax and details of operation are subject to change; using it yields a warning in the C<experimental::script_run> category. -The C<Script_Extensions> property is used as the basis for this feature. +The C<Script_Extensions> property as modified by UTS 39 +(L<http://unicode.org/reports/tr39/>) is used as the basis for this +feature. + +To summarize, + +=over 4 + +=item * + +All length 0 or length 1 sequences are script runs. + +=item * + +A longer sequence is a script run if and only if B<all> of the following +conditions are met: + +Z<> + +=over + +=item 1 + +No code point in the sequence has the C<Script_Extension> property of +C<Unknown>. + +This currently means that all code points the sequence have been +assigned by Unicode to be characters that aren't private use nor +surrogate code points. + +=item 2 + +All characters in the sequence come from the Common script and/or the +Inherited script and/or a single other script. + +The script of a character is determined by the C<Script_Extensions> +property as modified by UTS 39 (L<http://unicode.org/reports/tr39/>), as +described above. + +=item 3 + +All decimal digits in the sequence come from the same block of 10 +consecutive digits. + +=back + +=back =head2 Special Backtracking Control Verbs |