Add script_run regex feature

As explained in the docs, this helps detect spoofing attacks.
author: Karl Williamson <khw@cpan.org> 2017-12-23 23:15:37 -0700
committer: Karl Williamson <khw@cpan.org> 2017-12-24 17:20:45 -0700
commit: 034602eb0483306842c572ced4105398f6c38595 (patch)
tree: 4c92777d3695c39b3d181bb6d3b1393a43364556 /pod/perlre.pod
parent: 07093db4de0c6dc92b6c401b33af53b861e41ea2 (diff)
download: perl-034602eb0483306842c572ced4105398f6c38595.tar.gz
1 files changed, 90 insertions, 0 deletions
diff --git a/pod/perlre.pod b/pod/perlre.pod
index f902ea9ceb..7a1d405c79 100644
--- a/pod/perlre.pod
+++ b/pod/perlre.pod
@@ -706,6 +706,10 @@ the pattern uses a Unicode break (C<\b{...}> or C<\B{...}>); or
 
 the pattern uses L</C<(?[ ])>>
 
+=item 8
+
+the pattern uses L<C<(+script_run: ...)>|/Script Runs>
+
 =back
 
 Another mnemonic for this modifier is "Depends", as the rules actually
@@ -2412,6 +2416,92 @@ whether they match is considered relevant.  For an example
 where side-effects of lookahead I<might> have influenced the
 following match, see L</C<< (?>pattern) >>>.
 
+=head2 Script Runs
+
+A script run is basically a sequence of characters, all from the same
+Unicode script (see L<perlunicode/Scripts>), such as Latin or Greek.  In
+most places a single word would never be written in multiple scripts,
+unless it is a spoofing attack.  An infamous example, is
+
+ paypal.com
+
+Those letters could all be Latin (as in the example just above), or they
+could be all Cyrillic (except for the dot), or they could be a mixture
+of the two.  In the case of an internet address the C<.com> would be in
+Latin, And any Cyrillic ones would cause it to be a mixture, not a
+script run.  Someone clicking on such a link would not be directed to
+the real Paypal website, but an attacker would craft a look-alike one to
+attempt to gather sensitive information from the person.
+
+Starting in Perl 5.28, it is now easy to detect strings that aren't
+script runs.  Simply enclose just about any pattern like this:
+
+ (+script_run:pattern)
+
+What happens is that after I<pattern> succeeds in matching, it is
+subjected to the additional criterion that every character in it must be
+from the same script (see exceptions below).  If this isn't true,
+backtracking occurs until something all in the same script is found that
+matches, or all possibilities are exhausted.  This can cause a lot of
+backtracking, but generally, only malicious input will result in this,
+though the slow down could cause a denial of service attack.  If your
+needs permit, it is best to make the pattern atomic.
+
+ (+script_run:(?>pattern))
+
+(See L</C<(?E<gt>pattern)>>.)
+
+In Taiwan, Japan, and Korea, it is common for text to have a mixture of
+characters from their native scripts and base Chinese.  Perl follows
+Unicode's UTS 39 (L<http://unicode.org/reports/tr39/>) Unicode Security
+Mechanisms in allowing such mixtures.
+
+The rules used for matching decimal digits are somewhat different.  Many
+scripts have their own sets of digits equivalent to the Western C<0>
+through C<9> ones.  A few, such as Arabic, have more than one set.  For
+a string to be considered a script run, all digits in it must come from
+the same set, as determined by the first digit encountered. The ASCII
+C<[0-9]> are accepted as being in any script, even those that have their
+own set.  This is because these are often used in commerce even in such
+scripts.  But any mixing of the ASCII and other digits will cause the
+sequence to not be a script run, failing the match.  As an example,
+
+ qr/(?script_run: \d+ \b )/x
+
+guarantees that the digits matched will all be from the same set of 10.
+You won't get a look-alike digit from a different script that has a
+different value than what it appears to be.
+
+Unicode has three pseudo scripts that are handled specially.
+
+"Unknown" is applied to code points whose meaning has yet to be
+determined.  Perl currently will match as a script run, any single
+character string consisting of one of these code points.  But any string
+longer than one code point containing one of these will not be
+considered a script run.
+
+"Inherited" is applied to characters that modify another, such as an
+accent of some type.  These are considered to be in the script of the
+master character, and so never cause a script run to not match.
+
+The other one is "Common".  This consists of mostly punctuation, emoji,
+and characters used in mathematics and music, and the ASCII digits C<0>
+through C<9>.  These characters can appear intermixed in text in many of
+the world's scripts.  These also don't cause a script run to not match,
+except any ASCII digits encountered have to obey the decimal digit rules
+described above.
+
+This construct is non-capturing.  You can add parentheses to I<pattern>
+to capture, if desired.  You will have to do this if you plan to use
+L</(*ACCEPT) (*ACCEPT:arg)> and not have it bypass the script run
+checking.
+
+This feature is experimental, and the exact syntax and details of
+operation are subject to change; using it yields a warning in the
+C<experimental::script_run> category.
+
+The C<Script_Extensions> property is used as the basis for this feature.
+
 =head2 Special Backtracking Control Verbs
 
 These special patterns are generally of the form C<(*I<VERB>:I<ARG>)>. Unless
author	Karl Williamson <khw@cpan.org>	2017-12-23 23:15:37 -0700
committer	Karl Williamson <khw@cpan.org>	2017-12-24 17:20:45 -0700
commit	034602eb0483306842c572ced4105398f6c38595 (patch)
tree	4c92777d3695c39b3d181bb6d3b1393a43364556 /pod/perlre.pod
parent	07093db4de0c6dc92b6c401b33af53b861e41ea2 (diff)
download	perl-034602eb0483306842c572ced4105398f6c38595.tar.gz