From 4a1d964056983f26f5646fdb7aadb4b5e7b5235f Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Thu, 16 Aug 2018 16:48:05 -0600 Subject: perlre: Add some clarifying script run documentation --- pod/perlre.pod | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 54 insertions(+), 4 deletions(-) (limited to 'pod') diff --git a/pod/perlre.pod b/pod/perlre.pod index ce557edf4d..375507a6e8 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -2517,8 +2517,9 @@ backtracking occurs until something all in the same script is found that matches, or all possibilities are exhausted. This can cause a lot of backtracking, but generally, only malicious input will result in this, though the slow down could cause a denial of service attack. If your -needs permit, it is best to make the pattern atomic. This is so likely -to be what you want, that instead of writing this: +needs permit, it is best to make the pattern atomic to cut down on the +amount of backtracking. This is so likely to be what you want, that +instead of writing this: (*script_run:(?>pattern)) @@ -2532,7 +2533,10 @@ you can write either of these: In Taiwan, Japan, and Korea, it is common for text to have a mixture of characters from their native scripts and base Chinese. Perl follows Unicode's UTS 39 (L) Unicode Security -Mechanisms in allowing such mixtures. +Mechanisms in allowing such mixtures. For example, the Japanese scripts +Katakana and Hiragana are commonly mixed together in practice, along +with some Chinese characters, and hence are treated as being in a single +script run by Perl. The rules used for matching decimal digits are somewhat different. Many scripts have their own sets of digits equivalent to the Western C<0> @@ -2578,7 +2582,53 @@ This feature is experimental, and the exact syntax and details of operation are subject to change; using it yields a warning in the C category. -The C property is used as the basis for this feature. +The C property as modified by UTS 39 +(L) is used as the basis for this +feature. + +To summarize, + +=over 4 + +=item * + +All length 0 or length 1 sequences are script runs. + +=item * + +A longer sequence is a script run if and only if B of the following +conditions are met: + +Z<> + +=over + +=item 1 + +No code point in the sequence has the C property of +C. + +This currently means that all code points the sequence have been +assigned by Unicode to be characters that aren't private use nor +surrogate code points. + +=item 2 + +All characters in the sequence come from the Common script and/or the +Inherited script and/or a single other script. + +The script of a character is determined by the C +property as modified by UTS 39 (L), as +described above. + +=item 3 + +All decimal digits in the sequence come from the same block of 10 +consecutive digits. + +=back + +=back =head2 Special Backtracking Control Verbs -- cgit v1.2.1