diff options
author | Karl Williamson <khw@cpan.org> | 2017-12-23 23:15:37 -0700 |
---|---|---|
committer | Karl Williamson <khw@cpan.org> | 2017-12-24 17:20:45 -0700 |
commit | 034602eb0483306842c572ced4105398f6c38595 (patch) | |
tree | 4c92777d3695c39b3d181bb6d3b1393a43364556 /pod/perlre.pod | |
parent | 07093db4de0c6dc92b6c401b33af53b861e41ea2 (diff) | |
download | perl-034602eb0483306842c572ced4105398f6c38595.tar.gz |
Add script_run regex feature
As explained in the docs, this helps detect spoofing attacks.
Diffstat (limited to 'pod/perlre.pod')
-rw-r--r-- | pod/perlre.pod | 90 |
1 files changed, 90 insertions, 0 deletions
diff --git a/pod/perlre.pod b/pod/perlre.pod index f902ea9ceb..7a1d405c79 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -706,6 +706,10 @@ the pattern uses a Unicode break (C<\b{...}> or C<\B{...}>); or the pattern uses L</C<(?[ ])>> +=item 8 + +the pattern uses L<C<(+script_run: ...)>|/Script Runs> + =back Another mnemonic for this modifier is "Depends", as the rules actually @@ -2412,6 +2416,92 @@ whether they match is considered relevant. For an example where side-effects of lookahead I<might> have influenced the following match, see L</C<< (?>pattern) >>>. +=head2 Script Runs + +A script run is basically a sequence of characters, all from the same +Unicode script (see L<perlunicode/Scripts>), such as Latin or Greek. In +most places a single word would never be written in multiple scripts, +unless it is a spoofing attack. An infamous example, is + + paypal.com + +Those letters could all be Latin (as in the example just above), or they +could be all Cyrillic (except for the dot), or they could be a mixture +of the two. In the case of an internet address the C<.com> would be in +Latin, And any Cyrillic ones would cause it to be a mixture, not a +script run. Someone clicking on such a link would not be directed to +the real Paypal website, but an attacker would craft a look-alike one to +attempt to gather sensitive information from the person. + +Starting in Perl 5.28, it is now easy to detect strings that aren't +script runs. Simply enclose just about any pattern like this: + + (+script_run:pattern) + +What happens is that after I<pattern> succeeds in matching, it is +subjected to the additional criterion that every character in it must be +from the same script (see exceptions below). If this isn't true, +backtracking occurs until something all in the same script is found that +matches, or all possibilities are exhausted. This can cause a lot of +backtracking, but generally, only malicious input will result in this, +though the slow down could cause a denial of service attack. If your +needs permit, it is best to make the pattern atomic. + + (+script_run:(?>pattern)) + +(See L</C<(?E<gt>pattern)>>.) + +In Taiwan, Japan, and Korea, it is common for text to have a mixture of +characters from their native scripts and base Chinese. Perl follows +Unicode's UTS 39 (L<http://unicode.org/reports/tr39/>) Unicode Security +Mechanisms in allowing such mixtures. + +The rules used for matching decimal digits are somewhat different. Many +scripts have their own sets of digits equivalent to the Western C<0> +through C<9> ones. A few, such as Arabic, have more than one set. For +a string to be considered a script run, all digits in it must come from +the same set, as determined by the first digit encountered. The ASCII +C<[0-9]> are accepted as being in any script, even those that have their +own set. This is because these are often used in commerce even in such +scripts. But any mixing of the ASCII and other digits will cause the +sequence to not be a script run, failing the match. As an example, + + qr/(?script_run: \d+ \b )/x + +guarantees that the digits matched will all be from the same set of 10. +You won't get a look-alike digit from a different script that has a +different value than what it appears to be. + +Unicode has three pseudo scripts that are handled specially. + +"Unknown" is applied to code points whose meaning has yet to be +determined. Perl currently will match as a script run, any single +character string consisting of one of these code points. But any string +longer than one code point containing one of these will not be +considered a script run. + +"Inherited" is applied to characters that modify another, such as an +accent of some type. These are considered to be in the script of the +master character, and so never cause a script run to not match. + +The other one is "Common". This consists of mostly punctuation, emoji, +and characters used in mathematics and music, and the ASCII digits C<0> +through C<9>. These characters can appear intermixed in text in many of +the world's scripts. These also don't cause a script run to not match, +except any ASCII digits encountered have to obey the decimal digit rules +described above. + +This construct is non-capturing. You can add parentheses to I<pattern> +to capture, if desired. You will have to do this if you plan to use +L</(*ACCEPT) (*ACCEPT:arg)> and not have it bypass the script run +checking. + +This feature is experimental, and the exact syntax and details of +operation are subject to change; using it yields a warning in the +C<experimental::script_run> category. + +The C<Script_Extensions> property is used as the basis for this feature. + =head2 Special Backtracking Control Verbs These special patterns are generally of the form C<(*I<VERB>:I<ARG>)>. Unless |