summaryrefslogtreecommitdiff
path: root/regcomp.sym
diff options
context:
space:
mode:
authorYves Orton <demerphq@gmail.com>2012-02-19 21:32:05 +0100
committerYves Orton <demerphq@gmail.com>2012-03-03 13:27:28 +0100
commitfab2782b37b5570d7f8f8065fd7d18621117ed49 (patch)
tree5c7ef3c0c69d4998c89e2469998092747584f68b /regcomp.sym
parent2f137bbd018b7f86a6a557d3552cbb7a760bb43a (diff)
downloadperl-fab2782b37b5570d7f8f8065fd7d18621117ed49.tar.gz
rework how the trie logic handles the newer EXACT nodetypes
This cleans up and simplifies and extends how the trie logic interacts with the new node types. This change ultimately makes the EXACTFU, EXACTFU_SS, EXACTFU_NO_TRIE (renamed to EXACTFU_TRICKYFOLD) work properly with the trie engine regardless of whether the string is utf8 or latin1. This patch depends on the following: EXACT => utf8 or "binary" text EXACTFU => either pre-folded utf8, or latin1 that has to be folded as though it was utf8 EXACTFU_SS => special case of EXACTFU to handle \xDF/ss (affects latin1 treatment) EXACTFU_TRICKYFOLD => special case of EXACTFU to handle tricky non-latin1 fold rules EXACTF => "old style fold logic" untriable nodetype EXACTFA => (currently) untriable nodetype EXACTFL => (currently) untriable nodetype See the comments in regcomp.sym for these fold types. This patch involves a number of distinct, but related parts. Starting from compilation: * Simplify how we detect a triable sequence given the new nodetypes, this also probably fixed some "bugs" in how we detected certain sequences, like /||foo|bar/. * Simplify how we read EXACTFU nodes under utf8 by removing the now redundant folding logic (EXACTFU nodes under utf8 are prefolded). Also extend this logic to handle latin1 patterns properly (in conjunction with other changes) * Part of the problems associated with EXACTFU_SS and EXACTFU_TRICKYFOLD have to do with how the trie logic interacts with the minlen logic. This change handles both by pessimising the minlen when encounting these nodetypes. One observation is that the minlen logic is basically broken, and works only because it conflates bytes and codepoints in such a way that we more or less always get a value small enough that things work out anyway. Fixing that is properly is the job of another patch. * Part of the problem of doing folding under unicode rules is that there are a lot of foldings possible, some with strange rules. This means that the bitmap logic does not work correctly in all cases, as we currently do not have any way to populate it properly. So this patch disables the bitmap entirely when folding is involved until that is fixed. The end result of this is: we can TRIE/AHOCORASICK any sequence of EXACT, or EXACTFU (ish) nodes, regardless of utf8 or not, but we disable the bitmap when folding. A note for follow up relating to this patch is that the way EXACTFU_XXX nodes are currently dealt with we wont build the "maximal" trie because of their presence, instead creating a "jumptrie" consisting of either a leading EXACTFU node followed by a EXACTFU_XXX node, or vice versa. We should eventually address that.
Diffstat (limited to 'regcomp.sym')
-rw-r--r--regcomp.sym4
1 files changed, 2 insertions, 2 deletions
diff --git a/regcomp.sym b/regcomp.sym
index f0ee701991..e6be6ac292 100644
--- a/regcomp.sym
+++ b/regcomp.sym
@@ -92,14 +92,14 @@ BRANCH BRANCH, node 0 V ; Match this alternative, or the next...
# not used
BACK BACK, no 0 V ; Match "", "next" ptr points backward.
-#*Literals
+#*Literals - NOTE the relative ordering of these types is important do not change it
EXACT EXACT, str ; Match this string (preceded by length).
EXACTF EXACT, str ; Match this non-UTF-8 string (not guaranteed to be folded) using /id rules (w/len).
EXACTFL EXACT, str ; Match this string (not guaranteed to be folded) using /il rules (w/len).
EXACTFU EXACT, str ; Match this string (folded iff in UTF-8, length in folding doesn't change if not in UTF-8) using /iu rules (w/len).
EXACTFU_SS EXACT, str ; Match this string (folded iff in UTF-8, length in folding may change even if not in UTF-8) using /iu rules (w/len).
-EXACTFU_NO_TRIE EXACT, str ; Match this folded UTF-8 string using /iu rules, but don't generate a trie for it
+EXACTFU_TRICKYFOLD EXACT, str ; Match this folded UTF-8 string using /iu rules, but don't generate a trie for it
EXACTFA EXACT, str ; Match this string (not guaranteed to be folded) using /iaa rules (w/len).
#*Do nothing types