diff options
author | Karl Williamson <khw@cpan.org> | 2018-12-19 11:21:28 -0700 |
---|---|---|
committer | Karl Williamson <khw@cpan.org> | 2018-12-26 12:50:37 -0700 |
commit | 0ea669f4e37ccfbcd5ad708ca625ec17bf22e5b3 (patch) | |
tree | 8c8766a5d7fd7dc7cde698caa828e97d16a272c0 /regcomp.sym | |
parent | 627a7895564679975632d9b637b27e9c09d3d985 (diff) | |
download | perl-0ea669f4e37ccfbcd5ad708ca625ec17bf22e5b3.tar.gz |
Collapse regnode EXACTFU_SS into EXACTFUP
EXACTFUP was created by the previous commit to handle a problematic case
in which not all the code points in an EXACTFU node are /i foldable at
compile time. Doing so will allow a future commit to use the pre-folded
EXACTFU nodes (done in a prior commit), saving execution time for the
common case. The only problematic code point is the MICRO SIGN. Most
patterns don't use this character.
EXACTFU_SS is problematic in a different way. It contains the sequence
'ss' which is folded to by LATIN SMALL LETTER SHARP S, but everything in
it can be pre-folded (unless it also contains a MICRO SIGN). The reason
this is problematic is that it is the only non-UTF-8 node where the
length in folding can change. To process it at runtime, the more
general fold equivalence function is used that is capable of handling
length disparities, but is slower than the functions otherwise used for
non-UTF-8.
What I've chosen to do for now is to make a single node type for all the
problematic cases (which at this time means just the two aforementioned
ones). If we didn't do this, we'd have to add a third node type for
patterns that contain both 'ss' and MICRO. Or artificially split the
pattern so the two never were in the same node, but we can't do that
because it can cause bugs in handling multi-character folds. If more
special handling is found to be needed, there'd be a combinatorial
explosion of additional node types to handle all possible combinations.
What this effectively means is that the slower, more general foldEQ
function is used for portions of patterns containing the MICRO sign when
the pattern isn't in UTF-8, even though there is no inherent reason to
do so for non-UTF-8 strings that don't also contain the 'ss' sequence.
Diffstat (limited to 'regcomp.sym')
-rw-r--r-- | regcomp.sym | 1 |
1 files changed, 0 insertions, 1 deletions
diff --git a/regcomp.sym b/regcomp.sym index bdbe059cc5..8033a138d2 100644 --- a/regcomp.sym +++ b/regcomp.sym @@ -107,7 +107,6 @@ EXACTFAA EXACT, str ; Match this string using /iaa rules (w/len) (stri # End of important relative ordering. -EXACTFU_SS EXACT, str ; Match this string using /iu rules (w/len); (string not UTF-8, only portions guaranteed to be folded; folded length > unfolded). EXACTFUP EXACT, str ; Match this string using /iu rules (w/len); (string not UTF-8, not guaranteed to be folded; and its Problematic). # In order for a non-UTF-8 EXACTFAA to think the pattern is pre-folded when # matching a UTF-8 target string, there would have to be something like an |