diff options
author | Yves Orton <demerphq@gmail.com> | 2012-02-19 21:32:05 +0100 |
---|---|---|
committer | Yves Orton <demerphq@gmail.com> | 2012-03-03 13:27:28 +0100 |
commit | fab2782b37b5570d7f8f8065fd7d18621117ed49 (patch) | |
tree | 5c7ef3c0c69d4998c89e2469998092747584f68b /proto.h | |
parent | 2f137bbd018b7f86a6a557d3552cbb7a760bb43a (diff) | |
download | perl-fab2782b37b5570d7f8f8065fd7d18621117ed49.tar.gz |
rework how the trie logic handles the newer EXACT nodetypes
This cleans up and simplifies and extends how the trie
logic interacts with the new node types. This change ultimately
makes the EXACTFU, EXACTFU_SS, EXACTFU_NO_TRIE (renamed to
EXACTFU_TRICKYFOLD) work properly with the trie engine regardless
of whether the string is utf8 or latin1.
This patch depends on the following:
EXACT => utf8 or "binary" text
EXACTFU => either pre-folded utf8, or latin1 that has to be folded as though it was utf8
EXACTFU_SS => special case of EXACTFU to handle \xDF/ss (affects latin1 treatment)
EXACTFU_TRICKYFOLD => special case of EXACTFU to handle tricky non-latin1 fold rules
EXACTF => "old style fold logic" untriable nodetype
EXACTFA => (currently) untriable nodetype
EXACTFL => (currently) untriable nodetype
See the comments in regcomp.sym for these fold types.
This patch involves a number of distinct, but related parts. Starting
from compilation:
* Simplify how we detect a triable sequence given the new nodetypes,
this also probably fixed some "bugs" in how we detected certain
sequences, like /||foo|bar/.
* Simplify how we read EXACTFU nodes under utf8 by removing the now
redundant folding logic (EXACTFU nodes under utf8 are prefolded).
Also extend this logic to handle latin1 patterns properly (in
conjunction with other changes)
* Part of the problems associated with EXACTFU_SS and EXACTFU_TRICKYFOLD
have to do with how the trie logic interacts with the minlen logic.
This change handles both by pessimising the minlen when encounting
these nodetypes. One observation is that the minlen logic is basically
broken, and works only because it conflates bytes and codepoints in
such a way that we more or less always get a value small enough that things work out
anyway. Fixing that is properly is the job of another patch.
* Part of the problem of doing folding under unicode rules is that
there are a lot of foldings possible, some with strange rules. This
means that the bitmap logic does not work correctly in all cases,
as we currently do not have any way to populate it properly.
So this patch disables the bitmap entirely when folding is involved
until that is fixed.
The end result of this is: we can TRIE/AHOCORASICK any sequence of
EXACT, or EXACTFU (ish) nodes, regardless of utf8 or not, but we disable
the bitmap when folding.
A note for follow up relating to this patch is that the way EXACTFU_XXX
nodes are currently dealt with we wont build the "maximal" trie because
of their presence, instead creating a "jumptrie" consisting of either a
leading EXACTFU node followed by a EXACTFU_XXX node, or vice versa. We
should eventually address that.
Diffstat (limited to 'proto.h')
-rw-r--r-- | proto.h | 14 |
1 files changed, 8 insertions, 6 deletions
@@ -7114,12 +7114,6 @@ STATIC bool S_isa_lookup(pTHX_ HV *stash, const char * const name, STRLEN len, U #endif #if defined(PERL_IN_UTF8_C) -PERL_CALLCONV UV Perl__to_fold_latin1(pTHX_ const U8 c, U8 *p, STRLEN *lenp, const bool flags) - __attribute__nonnull__(pTHX_2) - __attribute__nonnull__(pTHX_3); -#define PERL_ARGS_ASSERT__TO_FOLD_LATIN1 \ - assert(p); assert(lenp) - STATIC UV S_check_locale_boundary_crossing(pTHX_ const U8* const p, const UV result, U8* const ustrp, STRLEN *lenp) __attribute__warn_unused_result__ __attribute__nonnull__(pTHX_1) @@ -7166,6 +7160,14 @@ PERL_CALLCONV UV Perl__to_upper_title_latin1(pTHX_ const U8 c, U8 *p, STRLEN *le assert(p); assert(lenp) #endif +#if defined(PERL_IN_UTF8_C) || defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_REGEXEC_C) +PERL_CALLCONV UV Perl__to_fold_latin1(pTHX_ const U8 c, U8 *p, STRLEN *lenp, const bool flags) + __attribute__nonnull__(pTHX_2) + __attribute__nonnull__(pTHX_3); +#define PERL_ARGS_ASSERT__TO_FOLD_LATIN1 \ + assert(p); assert(lenp) + +#endif #if defined(PERL_IN_UTIL_C) STATIC bool S_ckwarn_common(pTHX_ U32 w); STATIC const COP* S_closest_cop(pTHX_ const COP *cop, const OP *o) |