summaryrefslogtreecommitdiff
path: root/pod/perlre.pod
diff options
context:
space:
mode:
authorKarl Williamson <khw@cpan.org>2017-12-23 23:15:37 -0700
committerKarl Williamson <khw@cpan.org>2017-12-24 17:20:45 -0700
commit034602eb0483306842c572ced4105398f6c38595 (patch)
tree4c92777d3695c39b3d181bb6d3b1393a43364556 /pod/perlre.pod
parent07093db4de0c6dc92b6c401b33af53b861e41ea2 (diff)
downloadperl-034602eb0483306842c572ced4105398f6c38595.tar.gz
Add script_run regex feature
As explained in the docs, this helps detect spoofing attacks.
Diffstat (limited to 'pod/perlre.pod')
-rw-r--r--pod/perlre.pod90
1 files changed, 90 insertions, 0 deletions
diff --git a/pod/perlre.pod b/pod/perlre.pod
index f902ea9ceb..7a1d405c79 100644
--- a/pod/perlre.pod
+++ b/pod/perlre.pod
@@ -706,6 +706,10 @@ the pattern uses a Unicode break (C<\b{...}> or C<\B{...}>); or
the pattern uses L</C<(?[ ])>>
+=item 8
+
+the pattern uses L<C<(+script_run: ...)>|/Script Runs>
+
=back
Another mnemonic for this modifier is "Depends", as the rules actually
@@ -2412,6 +2416,92 @@ whether they match is considered relevant. For an example
where side-effects of lookahead I<might> have influenced the
following match, see L</C<< (?>pattern) >>>.
+=head2 Script Runs
+
+A script run is basically a sequence of characters, all from the same
+Unicode script (see L<perlunicode/Scripts>), such as Latin or Greek. In
+most places a single word would never be written in multiple scripts,
+unless it is a spoofing attack. An infamous example, is
+
+ paypal.com
+
+Those letters could all be Latin (as in the example just above), or they
+could be all Cyrillic (except for the dot), or they could be a mixture
+of the two. In the case of an internet address the C<.com> would be in
+Latin, And any Cyrillic ones would cause it to be a mixture, not a
+script run. Someone clicking on such a link would not be directed to
+the real Paypal website, but an attacker would craft a look-alike one to
+attempt to gather sensitive information from the person.
+
+Starting in Perl 5.28, it is now easy to detect strings that aren't
+script runs. Simply enclose just about any pattern like this:
+
+ (+script_run:pattern)
+
+What happens is that after I<pattern> succeeds in matching, it is
+subjected to the additional criterion that every character in it must be
+from the same script (see exceptions below). If this isn't true,
+backtracking occurs until something all in the same script is found that
+matches, or all possibilities are exhausted. This can cause a lot of
+backtracking, but generally, only malicious input will result in this,
+though the slow down could cause a denial of service attack. If your
+needs permit, it is best to make the pattern atomic.
+
+ (+script_run:(?>pattern))
+
+(See L</C<(?E<gt>pattern)>>.)
+
+In Taiwan, Japan, and Korea, it is common for text to have a mixture of
+characters from their native scripts and base Chinese. Perl follows
+Unicode's UTS 39 (L<http://unicode.org/reports/tr39/>) Unicode Security
+Mechanisms in allowing such mixtures.
+
+The rules used for matching decimal digits are somewhat different. Many
+scripts have their own sets of digits equivalent to the Western C<0>
+through C<9> ones. A few, such as Arabic, have more than one set. For
+a string to be considered a script run, all digits in it must come from
+the same set, as determined by the first digit encountered. The ASCII
+C<[0-9]> are accepted as being in any script, even those that have their
+own set. This is because these are often used in commerce even in such
+scripts. But any mixing of the ASCII and other digits will cause the
+sequence to not be a script run, failing the match. As an example,
+
+ qr/(?script_run: \d+ \b )/x
+
+guarantees that the digits matched will all be from the same set of 10.
+You won't get a look-alike digit from a different script that has a
+different value than what it appears to be.
+
+Unicode has three pseudo scripts that are handled specially.
+
+"Unknown" is applied to code points whose meaning has yet to be
+determined. Perl currently will match as a script run, any single
+character string consisting of one of these code points. But any string
+longer than one code point containing one of these will not be
+considered a script run.
+
+"Inherited" is applied to characters that modify another, such as an
+accent of some type. These are considered to be in the script of the
+master character, and so never cause a script run to not match.
+
+The other one is "Common". This consists of mostly punctuation, emoji,
+and characters used in mathematics and music, and the ASCII digits C<0>
+through C<9>. These characters can appear intermixed in text in many of
+the world's scripts. These also don't cause a script run to not match,
+except any ASCII digits encountered have to obey the decimal digit rules
+described above.
+
+This construct is non-capturing. You can add parentheses to I<pattern>
+to capture, if desired. You will have to do this if you plan to use
+L</(*ACCEPT) (*ACCEPT:arg)> and not have it bypass the script run
+checking.
+
+This feature is experimental, and the exact syntax and details of
+operation are subject to change; using it yields a warning in the
+C<experimental::script_run> category.
+
+The C<Script_Extensions> property is used as the basis for this feature.
+
=head2 Special Backtracking Control Verbs
These special patterns are generally of the form C<(*I<VERB>:I<ARG>)>. Unless