From af302e7fa58415c2d8454c8cbef7bccd8b504257 Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Wed, 16 Mar 2011 12:19:42 -0600 Subject: RT #85964: bleadperl breaks CGI-FormBuilder The introduction of the l regex modifier introduces the possibility that a regular expression can have subportions that match under locale and other portions that don't. I (khw) failed to see all the implications of that in the optimizer. Unfortunately, things didn't start surfacing until late in the development cycle. The optimizer is structured so that a new blank node is initialized to match anything, and the state is set to AND, so that the first real node that comes along is supposed to be ANDed together; with the result being that node. (Like an AND of all 1's with some bit pattern yields that bit pattern.) Then the mode is switched to OR, so subsequent nodes that could be the start ones are or'd in. *(see footnote below). This design leads to some issues, like at the XXX line added by this commit, which looks to be a work-around for the deficiencies of the design. Commit cf34198ebe3dd876d67c10caa9acf491ad2a0c51 that led to this ticket changed things to include LOCALE as part of the initialization, so that the l could be on and off in various parts of the regex. I tried to just revert that (plus associated parameter changes), and found that the changes made to the AND and OR logic that fixed other problems really depended on that commit. Perhaps those could be worked around, but it is not the forward direction. This commit works around things in a different way. What happened in the earlier commit was that the synthetic start class (SSC) is, under some circumstances, getting generated as matching locale even if there is no locale matching in the regex. (This could not happen if the design were as described in the footnote.) This shouldn't matter except for potentially performance issues, as this would just be false positives. However, it turns out there is code in the optimizer that assumes that locale and non-locale are never mixed; and so does not do the right thing. This patch is aimed at safety. If the SSC is marked as locale, it sets the bits for things like \w as if the SSC could also end up being for non-locale. This can generate false positives for true locale matches but shouldn't introduce actual optimizer errors, since it only adds to what the SSC can match and doesn't make any restrictions. * I don't see why this design; it seems to me easier to start with the initial state set to all 0's, and then the first node gets OR'd in, yielding exactly that first node; then you don't have to switch; you still have to deal with AND cases, as for example in 0 length lookaheads, but things are made easier. --- t/re/re_tests | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 't/re') diff --git a/t/re/re_tests b/t/re/re_tests index 0f19ae21d1..b3815298bb 100644 --- a/t/re/re_tests +++ b/t/re/re_tests @@ -1498,16 +1498,16 @@ abc\N{def - c - \\N{NAME} must be resolved by the lexer (?{})[\x{100}] \x{100} y $& \x{100} # RT #85964 -^m?(\S)(.*)\1$ aba Ty $1 a +^m?(\S)(.*)\1$ aba y $1 a ^m?(\S)(.*)\1$ \tb\t n - - -^m?(\s)(.*)\1$ \tb\t Ty $1 \t +^m?(\s)(.*)\1$ \tb\t y $1 \t ^m?(\s)(.*)\1$ aba n - - -^m?(\W)(.*)\1$ :b: Ty $1 : +^m?(\W)(.*)\1$ :b: y $1 : ^m?(\W)(.*)\1$ aba n - - -^m?(\w)(.*)\1$ aba Ty $1 a +^m?(\w)(.*)\1$ aba y $1 a ^m?(\w)(.*)\1$ :b: n - - -^m?(\D)(.*)\1$ aba Ty $1 a +^m?(\D)(.*)\1$ aba y $1 a ^m?(\D)(.*)\1$ 5b5 n - - -^m?(\d)(.*)\1$ 5b5 Ty $1 5 +^m?(\d)(.*)\1$ 5b5 y $1 5 ^m?(\d)(.*)\1$ aba n - - # vim: softtabstop=0 noexpandtab -- cgit v1.2.1