summaryrefslogtreecommitdiff
path: root/pod/perllocale.pod
diff options
context:
space:
mode:
authorKarl Williamson <public@khwilliamson.com>2014-02-18 12:59:26 -0700
committerKarl Williamson <public@khwilliamson.com>2014-02-19 14:31:38 -0700
commit63baef57e83f77e202ae14ef902a6615cf69c8a2 (patch)
treee9ab630270d9070ec495be7115931cb6a3551939 /pod/perllocale.pod
parentfdf73a7f7fb994c00e17a01f146018fcb3c47ffb (diff)
downloadperl-63baef57e83f77e202ae14ef902a6615cf69c8a2.tar.gz
Make taint checking regex compile time instead of runtime
See discussion at https://rt.perl.org/Ticket/Display.html?id=120675 There are several unresolved items in this discussion, but we did agree that tainting should be dependent only on the regex pattern, and not the particular input string being matched against: "The bottom line is we are moving to the policy that tainting is based on the operation being in locale, without regard to the particular operand's contents passed this time to the operation. This means simpler core code and more consistent tainting results. And it lessens the likelihood that there are paths in the core that should taint but don't" This commit does the minimal work to change regex pattern matching to determine tainting at pattern compilation time. Simply put, if a pattern contains a regnode whose match/not match depends on the run-time locale, any attempt to match against that pattern will taint, regardless of the actual target string or runtime locale in effect. Given this change, there are optimizations that can be made to avoid runtime work, but these are deferred until later. Note that just because a regular expression is compiled under locale doesn't mean that the generated pattern will be tainted. It depends on the actual pattern. For example, the pattern /(.)/ doesn't taint because it will match exactly one character of the input, regardless of locale settings.
Diffstat (limited to 'pod/perllocale.pod')
-rw-r--r--pod/perllocale.pod23
1 files changed, 15 insertions, 8 deletions
diff --git a/pod/perllocale.pod b/pod/perllocale.pod
index 62a2d8b91d..c7546f8b26 100644
--- a/pod/perllocale.pod
+++ b/pod/perllocale.pod
@@ -1011,16 +1011,23 @@ Scalar true/false result never tainted.
All subpatterns, either delivered as a list-context result or as C<$1>
I<etc>., are tainted if C<use locale> (but not
S<C<use locale ':not_characters'>>) is in effect, and the subpattern
-regular expression is matched case-insensitively (C</i>) or contains a
-locale-dependent construct. These constructs include C<\w>
-(to match an alphanumeric character), C<\W> (non-alphanumeric
-character), C<\s> (whitespace character), C<\S> (non whitespace
-character), and the POSIX character classes, such as C<[:alpha:]> (see
-L<perlrecharclass/POSIX Character Classes>).
+regular expression contains a locale-dependent construct. These
+constructs include C<\w> (to match an alphanumeric character), C<\W>
+(non-alphanumeric character), C<\b> and C<\B> (word-boundary and
+non-boundardy, which depend on what C<\w> and C<\W> match), C<\s>
+(whitespace character), C<\S> (non whitespace character), C<\d> and
+C<\D> (digits and non-digits), and the POSIX character classes, such as
+C<[:alpha:]> (see L<perlrecharclass/POSIX Character Classes>).
+
+Tainting is also likely if the pattern is to be matched
+case-insensitively (via C</i>). The exception is if all the code points
+to be matched this way are above 255 and do not have folds under Unicode
+rules to below 256. Tainting is not done for these because Perl
+only uses Unicode rules for such code points, and those rules are the
+same no matter what the current locale.
+
The matched-pattern variables, C<$&>, C<$`> (pre-match), C<$'>
(post-match), and C<$+> (last match) also are tainted.
-(Note that currently there are some bugs where not everything that
-should be tainted gets tainted in all circumstances.)
=item *