diff options
author | Karl Williamson <public@khwilliamson.com> | 2011-12-23 20:11:22 -0700 |
---|---|---|
committer | Karl Williamson <public@khwilliamson.com> | 2012-01-19 11:58:19 -0700 |
commit | bb9144850c8033592c0187ea712691a97703385e (patch) | |
tree | 52c1796d71c7ec32663a02436927ce9d8a05e90f | |
parent | 86d6fcadb912bd04e5bb511a8188871eb12e4274 (diff) | |
download | perl-bb9144850c8033592c0187ea712691a97703385e.tar.gz |
regex: Fix some tricky fold problems
As described in the comments, this changes the design of handling the
Unicode tricky fold characters to not generate a node for each possible
sequence but to get them to work within EXACTFish nodes.
The previous design(s) all used a node to handle these, which suffers
from the downfall that it precludes legitimate matches that would cross
the node boundary.
The new design is described in the comments.
-rw-r--r-- | regcomp.c | 398 | ||||
-rw-r--r-- | regcomp.h | 1 | ||||
-rw-r--r-- | t/re/re_tests | 6 |
3 files changed, 172 insertions, 233 deletions
@@ -2505,6 +2505,101 @@ S_make_trie_failtable(pTHX_ RExC_state_t *pRExC_state, regnode *source, regnode }}); +/* The below joins as many adjacent EXACTish nodes as possible into a single + * one, and looks for problematic sequences of characters whose folds vs. + * non-folds have sufficiently different lengths, that the optimizer would be + * fooled into rejecting legitimate matches of them, and the trie construction + * code can't cope with them. The joining is only done if: + * 1) there is room in the current conglomerated node to entirely contain the + * next one. + * 2) they are the exact same node type + * + * The adjacent nodes actually may be separated by NOTHING kind nodes, and + * these get optimized out + * + * If there are problematic code sequences, *min_change is set to the delta + * that the minimum size of the node can be off from its actual size. And, the + * node type of the result is changed to reflect that it contains these + * sequences. + * + * This is as good a place as any to discuss the design of handling these + * problematic sequences. It's been wrong in Perl for a very long time. There + * are three code points in Unicode whose folded lengths differ so much from + * the un-folded lengths that it causes problems for the optimizer and trie + * construction. Why only these are problematic, and not others where lengths + * also differ is something I (khw) do not understand. New versions of Unicode + * might add more such code points. Hopefully the logic in fold_grind.t that + * figures out what to test (in part by veriying that each size-combination + * gets tested) will catch any that do come along, so they can be added to the + * special handling below. The chances of this are actually rather small, as + * most, if not all, of the world's scripts that have casefolding have already + * been encoded by Unicode. Also, a number of Unicode's decisions were made to + * allow compatibility with pre-existing standards, and almost all of those + * have already been dealt with. These would otherwise be the most likely + * candidates for generating further tricky sequences. In other words, Unicode + * by itself is unlikely to add new ones unless it is for compatibility with + * pre-existing standards. + * + * The previous designs for dealing with these involved assigning a special + * node for them. This approach doesn't work, as evidenced by this example: + * "\xDFs" =~ /s\xDF/ui + * Both these fold to "sss", but if the pattern is parsed to create a node of + * that would match just the \xDF, it won't be able to handle the case where a + * successful match would have to cross the node's boundary. The new approach + * that hopefully generally solves the problem generates an EXACTFU_SS node + * that is "sss". + * + * There are a number of components to the approach (a lot of work for just + * three code points!): + * 1) This routine examines each EXACTFish node that could contain the + * problematic sequences, and if found, returns in *min_change the total + * delta between the actual length of the string and one that could match + * it. This delta is used by the caller to adjust the min length of the + * match, and the delta between min and max, so that the optimizer doesn't + * reject these possibilities based on size constraints + * 2) These sequences are not currently correctly handled by the trie code + * either, so it changes the joined node type to ops that are not handled + * by trie's, those new ops being EXACTFU_SS and EXACTFU_NO_TRIE. + * 3) This is sufficient for the two Greek sequences (described below), but + * the one involving the Sharp s (\xDF) needs more. The node type + * EXACTFU_SS is used for an EXACTFU node that contains at least one "ss" + * sequence in it. For non-UTF-8 patterns and strings, this is the only + * case where there is a possible fold length change. That means that a + * regular EXACTFU node without UTF-8 involvement doesn't have to concern + * itself with length changes, and so can be processed faster. regexec.c + * takes advantage of this. Generally, an EXACTFish node that is in UTF-8 + * is pre-folded by regcomp.c. This saves effort in regex matching. + * However, probably mostly for historical reasons, the pre-folding isn't + * done for non-UTF8 patterns. The fold possibilities for these are quite + * simple, except for the sharp s. All the ones that don't involve a + * UTF-8 target string are members of a fold-pair, and arrays are set up + * for all of them that quickly find the other member of the pair. It + * might actually be faster to pre-fold these, but it isn't currently + * done, except for the sharp s. Code elsewhere in this file makes sure + * that it gets folded to 'ss', even if the pattern isn't UTF-8. This + * avoids the issues describe in the next item. + * 4) A problem remains for the sharp s in EXACTF nodes. Whether it matches + * 'ss' or not is not knowable at compile time. It will match iff the + * target string is in UTF-8, unlike the EXACTFU nodes, where it always + * matches; and the EXACTFL and EXACTFA nodes where it never does. Thus + * it can't be folded to "ss" at compile time, unlike EXACTFU does as + * described in item 3). An assumption that the optimizer part of + * regexec.c (probably unwittingly) makes is that a character in the + * pattern corresponds to at most a single character in the target string. + * (And I do mean character, and not byte here, unlike other parts of the + * documentation that have never been updated to account for multibyte + * Unicode.) This assumption is wrong only in this case, as all other + * cases are either 1-1 folds when no UTF-8 is involved; or is true by + * virtue of having this file pre-fold UTF-8 patterns. I'm + * reluctant to try to change this assumption, so instead the code punts. + * Elsewhere in this file, each EXACTF node is examined for the sharp s. + * If found, a flag is set that later causes the optimizer in this file to + * not set values for the floating and fixed string lengths, and thus + * avoid the optimizer code in regexec.c that makes this invalid + * assumption. Thus, there is no optimization based on string lengths for + * EXACTF nodes that contain the sharp s. This only happens for /id rules + * (which means the pattern isn't in UTF-8). + */ #define JOIN_EXACT(scan,min_change,flags) \ if (PL_regkind[OP(scan)] == EXACT) \ @@ -2531,7 +2626,7 @@ S_join_exact(pTHX_ RExC_state_t *pRExC_state, regnode *scan, IV *min_change, U32 PERL_UNUSED_ARG(val); #endif DEBUG_PEEP("join",scan,depth); - + /* Look through the subsequent nodes in the chain. Skip NOTHING, merge * EXACT ones that are mergeable to the current one. */ while (n @@ -2660,6 +2755,55 @@ S_join_exact(pTHX_ RExC_state_t *pRExC_state, regnode *scan, IV *min_change, U32 || (t[-1] == U3B0_2nd_byte && t[-2] == U3B0_first_byte)) { *min_change -= 4; + + /* This can't currently be handled by tries, so change the + * node type to indicate this. */ + if (OP(scan) == EXACTFU) { + OP(scan) = EXACTFU_NO_TRIE; + } + } + } + } + + /* The third problematic sequence is 'ss', which can match just the + * single byte LATIN SMALL LETTER SHARP S, and it can do it in both + * non- and UTF-8. Code elsewhere in this file makes sure, however, + * that the sharp s gets folded to 'ss' under Unicode rules even if not + * UTF-8. */ + if (STR_LEN(scan) >= 2 + && (OP(scan) == EXACTFU + || OP(scan) == EXACTFU_NO_TRIE /* The code above could have + set to this node type */ + || OP(scan) == EXACTF)) + { + /* The string will be folded to 'ss' if it's in UTF-8, but it could + * include capital 'S' instead of lower case when not UTF-8. We + * could have different code to handle the two cases, but this is + * not necessary since both S and s are invariants under UTF-8; and + * not worth it, especially because we can use just one test for + * either 'S' or 's' each * time through the loop (plus a mask). + * Ths is because on both EBCDIC and ASCII machines, 'S' and 's' + * differ by a single bit. On ASCII they are 32 apart; on EBCDIC, + * they are 64. This uses an exclusive 'or' to find that bit and + * then inverts it to form a mask, with just a single 0, in the bit + * position where 'S' and 's' differ. */ + const char S_or_s_mask = ~ ('S' ^ 's'); + const char s_masked = 's' & S_or_s_mask; + + for (s = s0; s < s_end - 1; s++) { + if (((*s & S_or_s_mask) == s_masked) + && ((*(s+1) & S_or_s_mask) == s_masked)) + { + s++; + *min_change -= 1; + + /* EXACTFU_SS also isn't trie'able, so don't have to + * preserve EXACTFU_NO_TRIE. EXACTF is also not trie'able, + * and because we essentially punt the optimizations in its + * case, we don't need to indicate that it has an ss */ + if (OP(scan) == EXACTFU || OP(scan) == EXACTFU_NO_TRIE) { + OP(scan) = EXACTFU_SS; + } } } } @@ -3296,6 +3440,11 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp, l = utf8_length(s, s + l); uc = utf8_to_uvchr(s, NULL); } + else if (OP(scan) == EXACTF) { + if (memchr(STRING(scan), LATIN_SMALL_LETTER_SHARP_S, l)) { + RExC_seen |= REG_SEEN_EXACTF_SHARP_S; + } + } min += l + min_change; if (min < 0) { min = 0; @@ -5122,9 +5271,10 @@ reStudy: { I32 t,ml; - if (SvCUR(data.longest_fixed) /* ok to leave SvCUR */ - && data.offset_fixed == data.offset_float_min - && SvCUR(data.longest_fixed) == SvCUR(data.longest_float)) + if ((RExC_seen & REG_SEEN_EXACTF_SHARP_S) + || (SvCUR(data.longest_fixed) /* ok to leave SvCUR */ + && data.offset_fixed == data.offset_float_min + && SvCUR(data.longest_fixed) == SvCUR(data.longest_float))) goto remove_float; /* As in (a)+. */ /* copy the information about the longest float from the reg_scan_data @@ -5167,10 +5317,11 @@ reStudy: Be careful. */ longest_fixed_length = CHR_SVLEN(data.longest_fixed); - if (longest_fixed_length - || (data.flags & SF_FIX_BEFORE_EOL /* Cannot have SEOL and MULTI */ - && (!(data.flags & SF_FIX_BEFORE_MEOL) - || (RExC_flags & RXf_PMf_MULTILINE)))) + if (! (RExC_seen & REG_SEEN_EXACTF_SHARP_S) + && (longest_fixed_length + || (data.flags & SF_FIX_BEFORE_EOL /* Cannot have SEOL and MULTI */ + && (!(data.flags & SF_FIX_BEFORE_MEOL) + || (RExC_flags & RXf_PMf_MULTILINE)))) ) { I32 t,ml; @@ -9058,15 +9209,6 @@ tryagain: RExC_parse++; defchar: { - typedef enum { - generic_char = 0, - char_s, - upsilon_1, - upsilon_2, - iota_1, - iota_2, - } char_state; - char_state latest_char_state = generic_char; register STRLEN len; register UV ender; register char *p; @@ -9076,6 +9218,10 @@ tryagain: regnode * orig_emit; U8 node_type; + /* Is this a LATIN LOWER CASE SHARP S in an EXACTFU node? If so, + * it is folded to 'ss' even if not utf8 */ + bool is_exactfu_sharp_s; + ender = 0; orig_emit = RExC_emit; /* Save the original output node position in case we need to output a different node @@ -9304,219 +9450,11 @@ tryagain: break; } /* End of switch on the literal */ - /* Certain characters are problematic because their folded - * length is so different from their original length that it - * isn't handleable by the optimizer. They are therefore not - * placed in an EXACTish node; and are here handled specially. - * (Even if the optimizer handled LATIN_SMALL_LETTER_SHARP_S, - * putting it in a special node keeps regexec from having to - * deal with a non-utf8 multi-char fold */ - if (FOLD - && (ender > 255 || (! MORE_ASCII_RESTRICTED && ! LOC))) - { - /* We look for either side of the fold. For example \xDF - * folds to 'ss'. We look for both the single character - * \xDF and the sequence 'ss'. When we find something that - * could be one of those, we stop and flush whatever we - * have output so far into the EXACTish node that was being - * built. Then restore the input pointer to what it was. - * regatom will return that EXACT node, and will be called - * again, positioned so the first character is the one in - * question, which we return in a different node type. - * The multi-char folds are a sequence, so the occurrence - * of the first character in that sequence doesn't - * necessarily mean that what follows is the rest of the - * sequence. We keep track of that with a state machine, - * with the state being set to the latest character - * processed before the current one. Most characters will - * set the state to 0, but if one occurs that is part of a - * potential tricky fold sequence, the state is set to that - * character, and the next loop iteration sees if the state - * should progress towards the final folded-from character, - * or if it was a false alarm. If it turns out to be a - * false alarm, the character(s) will be output in a new - * EXACTish node, and join_exact() will later combine them. - * In the case of the 'ss' sequence, which is more common - * and more easily checked, some look-ahead is done to - * save time by ruling-out some false alarms */ - switch (ender) { - default: - latest_char_state = generic_char; - break; - case 's': - case 'S': - case 0x17F: /* LATIN SMALL LETTER LONG S */ - if (AT_LEAST_UNI_SEMANTICS) { - if (latest_char_state == char_s) { /* 'ss' */ - ender = LATIN_SMALL_LETTER_SHARP_S; - goto do_tricky; - } - else if (p < RExC_end) { - - /* Look-ahead at the next character. If it - * is also an s, we handle as a sharp s - * tricky regnode. */ - if (*p == 's' || *p == 'S') { - - /* But first flush anything in the - * EXACTish buffer */ - if (len != 0) { - p = oldp; - goto loopdone; - } - p++; /* Account for swallowing this - 's' up */ - ender = LATIN_SMALL_LETTER_SHARP_S; - goto do_tricky; - } - /* Here, the next character is not a - * literal 's', but still could - * evaluate to one if part of a \o{}, - * \x or \OCTAL-DIGIT. The minimum - * length required for that is 4, eg - * \x53 or \123 */ - else if (*p == '\\' - && p < RExC_end - 4 - && (isDIGIT(*(p + 1)) - || *(p + 1) == 'x' - || *(p + 1) == 'o' )) - { - - /* Here, it could be an 's', too much - * bother to figure it out here. Flush - * the buffer if any; when come back - * here, set the state so know that the - * previous char was an 's' */ - if (len != 0) { - latest_char_state = generic_char; - p = oldp; - goto loopdone; - } - latest_char_state = char_s; - break; - } - } - } - - /* Here, can't be an 'ss' sequence, or at least not - * one that could fold to/from the sharp ss */ - latest_char_state = generic_char; - break; - case 0x03C5: /* First char in upsilon series */ - case 0x03A5: /* Also capital UPSILON, which folds to - 03C5, and hence exhibits the same - problem */ - if (p < RExC_end - 4) { /* Need >= 4 bytes left */ - latest_char_state = upsilon_1; - if (len != 0) { - p = oldp; - goto loopdone; - } - } - else { - latest_char_state = generic_char; - } - break; - case 0x03B9: /* First char in iota series */ - case 0x0399: /* Also capital IOTA */ - case 0x1FBE: /* GREEK PROSGEGRAMMENI folds to 3B9 */ - case 0x0345: /* COMBINING GREEK YPOGEGRAMMENI folds - to 3B9 */ - if (p < RExC_end - 4) { - latest_char_state = iota_1; - if (len != 0) { - p = oldp; - goto loopdone; - } - } - else { - latest_char_state = generic_char; - } - break; - case 0x0308: - if (latest_char_state == upsilon_1) { - latest_char_state = upsilon_2; - } - else if (latest_char_state == iota_1) { - latest_char_state = iota_2; - } - else { - latest_char_state = generic_char; - } - break; - case 0x301: - if (latest_char_state == upsilon_2) { - ender = GREEK_SMALL_LETTER_UPSILON_WITH_DIALYTIKA_AND_TONOS; - goto do_tricky; - } - else if (latest_char_state == iota_2) { - ender = GREEK_SMALL_LETTER_IOTA_WITH_DIALYTIKA_AND_TONOS; - goto do_tricky; - } - latest_char_state = generic_char; - break; - - /* These are the tricky fold characters. Flush any - * buffer first. (When adding to this list, also should - * add them to fold_grind.t to make sure get tested) */ - case GREEK_SMALL_LETTER_UPSILON_WITH_DIALYTIKA_AND_TONOS: - case GREEK_SMALL_LETTER_IOTA_WITH_DIALYTIKA_AND_TONOS: - case LATIN_SMALL_LETTER_SHARP_S: - case LATIN_CAPITAL_LETTER_SHARP_S: - case 0x1FD3: /* GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA */ - case 0x1FE3: /* GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA */ - if (len != 0) { - p = oldp; - goto loopdone; - } - /* FALL THROUGH */ - do_tricky: { - char* const oldregxend = RExC_end; - U8 tmpbuf[UTF8_MAXBYTES+1]; - - /* Here, we know we need to generate a special - * regnode, and 'ender' contains the tricky - * character. What's done is to pretend it's in a - * [bracketed] class, and let the code that deals - * with those handle it, as that code has all the - * intelligence necessary. First save the current - * parse state, get rid of the already allocated - * but empty EXACT node that the ANYOFV node will - * replace, and point the parse to a buffer which - * we fill with the character we want the regclass - * code to think is being parsed */ - RExC_emit = orig_emit; - RExC_parse = (char *) tmpbuf; - if (UTF) { - U8 *d = uvchr_to_utf8(tmpbuf, ender); - *d = '\0'; - RExC_end = (char *) d; - } - else { /* ender above 255 already excluded */ - tmpbuf[0] = (U8) ender; - tmpbuf[1] = '\0'; - RExC_end = RExC_parse + 1; - } - - ret = regclass(pRExC_state,depth+1); - - /* Here, have parsed the buffer. Reset the parse to - * the actual input, and return */ - RExC_end = oldregxend; - RExC_parse = p - 1; - - Set_Node_Offset(ret, RExC_parse); - Set_Node_Cur_Length(ret); - nextchar(pRExC_state); - *flagp |= HASWIDTH|SIMPLE; - return ret; - } - } - } - + is_exactfu_sharp_s = (node_type == EXACTFU + && ender == LATIN_SMALL_LETTER_SHARP_S); if ( RExC_flags & RXf_PMf_EXTENDED) p = regwhite( pRExC_state, p ); - if (UTF && FOLD) { + if ((UTF && FOLD) || is_exactfu_sharp_s) { /* Prime the casefolded buffer. Locale rules, which apply * only to code points < 256, aren't known until execution, * so for them, just output the original character using @@ -9579,7 +9517,7 @@ tryagain: if (p < RExC_end && ISMULT2(p)) { /* Back off on ?+*. */ if (len) p = oldp; - else if (UTF) { + else if (UTF || is_exactfu_sharp_s) { if (FOLD) { /* Emit all the Unicode characters. */ STRLEN numlen; @@ -9615,7 +9553,7 @@ tryagain: } break; } - if (UTF) { + if (UTF || is_exactfu_sharp_s) { if (FOLD) { /* Emit all the Unicode characters. */ STRLEN numlen; @@ -494,6 +494,7 @@ struct regnode_charclass_class { #define REG_SEEN_VERBARG 0x00000080 #define REG_SEEN_CUTGROUP 0x00000100 #define REG_SEEN_RUN_ON_COMMENT 0x00000200 +#define REG_SEEN_EXACTF_SHARP_S 0x00000400 START_EXTERN_C diff --git a/t/re/re_tests b/t/re/re_tests index 2077843574..66ca57276a 100644 --- a/t/re/re_tests +++ b/t/re/re_tests @@ -1560,8 +1560,8 @@ abc\N{def - c - \\N{NAME} must be resolved by the lexer # Was matching 'ss' only and failing the entire match, not seeing the # alternative that would succeed -/s\xDF/ui \xDFs Ty $& \xDFs -/sst/ui s\N{LATIN SMALL LIGATURE ST} Ty $& s\N{LATIN SMALL LIGATURE ST} -/sst/ui s\N{LATIN SMALL LIGATURE LONG S T} Ty $& s\N{LATIN SMALL LIGATURE LONG S T} +/s\xDF/ui \xDFs y $& \xDFs +/sst/ui s\N{LATIN SMALL LIGATURE ST} y $& s\N{LATIN SMALL LIGATURE ST} +/sst/ui s\N{LATIN SMALL LIGATURE LONG S T} y $& s\N{LATIN SMALL LIGATURE LONG S T} # vim: softtabstop=0 noexpandtab |