summaryrefslogtreecommitdiff
path: root/regnodes.h
Commit message (Collapse)AuthorAgeFilesLines
* clear savestack on (?{...}) failure and backtrackDavid Mitchell2017-02-141-79/+85
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RT #126697 In a regex, after executing a (?{...}) code block, if we fail and backtrack over the codeblock, we're supposed to unwind the savestack, so that for any example any local()s within the code block are undone. It turns out that a backtracking state isn't pushed for (?{...}), only for postponed evals ( i.e. (??{...})). This means that it relies on one of the earlier backtracking states to clear the savestack on its behalf. This can't always be relied upon, and the ticket above contains code where this falls down; in particular: 'ABC' =~ m{ \A (?: (?: AB | A | BC ) (?{ local $count = $count + 1; print "! count=$count; ; pos=${\pos}\n"; }) )* \z }x Here we end up relying on TRIE_next to do the cleaning up, but TRIE_next doesn't, since there's nothing it would be responsible for that needs cleaning up. The solution to this is to push a backtrack state for every (?{...}) as well as every (??{...}). The sole job of that state is to do a LEAVE_SCOPE(ST.lastcp). The existing backtrack state EVAL_AB has been renamed EVAL_postponed_AB to make it clear it's only used on postponed /(??{A})B/ regexes, and a new state has been added, EVAL_B, which is only called when backtracking after failing something in the B in /(?{...})B/.
* Unify GOSTART and GOSUBYves Orton2016-03-061-40/+36
| | | | | | | | | | | | | | | | GOSTART is a special case of GOSUB, we can remove a lot of offset twiddling, and other special casing by unifying them, at pretty much no cost. GOSUB has 2 arguments, ARG() and ARG2L(), which are interpreted as a U32 and an I32 respectively. ARG() holds the "parno" we will recurse into. ARG2L() holds a signed offset to the relevant start node for the recursion. Prior to this patch the argument to GOSUB would always be >=, and unlike other parts of our logic we would not use 0 to represent "start/end" of pattern, as GOSTART would be used for "recurse to beginning of pattern", after this patch we use 0 to represent "start/end", and a lot of complexity "goes away" along with GOSTART regops.
* Cleanup, document, and restructure regen/regcomp.plYves Orton2015-10-051-1/+3
| | | | | | | | | | | | | | | | We cleanup the parsing code, replacing our set of arrays of properties with an array of hashes of properties, with utility subs registering new items, etc. We also split up the output code into a set of subs, one sub per output "blob" (generaly a var definition), so that we have some visibility of the higher level strucuture of our output code. With this patch visibility of the structure of what we generate emerges from the nest of here docs. :-) Note this change does not (greatly) alter regcomp.sym or perldebguts.pod, it merely cleans up and generally speaking modernizes and most importantly documents the code.
* fix perl #126186 make all verbs allow an optional argYves Orton2015-10-051-4/+4
| | | | | | | | | | | | In perl #126186 it was pointed out we had started allowing name arguments for verbs where we did not document them to be supported, albeit in an inconsistent way. The previous patch cleaned up some of the cause of this, but it seems better to just generally allow the existing verbs to all support a mark name argument. So this patch reverses the effect of the previous patch, and makes all verbs, FAIL, ACCEPT, etc, allow an optional argument, and set REGERROR/REGMARK appropriately as well.
* Add ANYOFD regex nodeKarl Williamson2015-08-241-154/+159
| | | | | This is like an ANYOF node, but just for when /d is in effect. It will be used in future commits
* perldebguts: Add clarificationKarl Williamson2015-08-241-1/+1
|
* remove deprecated /\C/ RE character classDavid Mitchell2015-06-191-163/+157
| | | | | | This horrible thing broke encapsulation and was as buggy as a very buggy thing. It's been officially deprecated since 5.20.0 and now it can finally die die die!!!!
* regcomp.sym: Update \b descriptionsKarl Williamson2015-03-181-7/+7
|
* Add qr/\b{gcb}/Karl Williamson2015-02-191-8/+8
| | | | | | | | | | | A function implements seeing if the space between any two characters is a grapheme cluster break. Afer I wrote this, I realized that an array lookup might be a better implementation, but the deadline for v5.22 was too close to change it. I did see that my gcc optimized it down to an array lookup. This makes the implementation of \X go from being complicated to trivial.
* Reserve a bit for 'the re strict subpragma.Karl Williamson2015-01-131-3/+3
| | | | This is another step in the process
* Add regex nodes for localeKarl Williamson2014-12-291-148/+163
| | | | | These will be used in a future commit to distinguish between /l patterns vs non-/l.
* Create bit for /n.Karl Williamson2014-12-281-7/+7
|
* Eliminate unused BACK regnodeAaron Crane2014-09-291-132/+127
|
* Make space for /xx flagKarl Williamson2014-09-291-7/+7
| | | | | | This doesn't actually use the flag yet. We no longer have to make version-dependent changes to ext/Devel-Peek/t/Peek.t, (it being in /ext) so this doesn't
* Up regex flags limit for (??{})Karl Williamson2014-09-291-1/+1
| | | | | | | | | Previously the regex pattern compilation flags needed for this construct would fit into an 8-bit byte. This conveniently fits into the flags structure element of a regnode. There are changes coming that require more than 8 bits, so in preparation, this commit adds an argument to the node that implements (??{}) (31-bits usable for flags), and moves the storage to that.
* regcomp.sym: ANYOF nodes have an argumentKarl Williamson2014-09-291-1/+1
| | | | | | Plus a bitmap, but they always have an argument besides, contrary to what was specified here. Future commits rely on this, whereas heretofore this error was harmless.
* regexp.h Remove unused bit placeholdersKarl Williamson2014-09-291-7/+7
| | | | | | We do not need a placeholder for unused flag bits. And removing them makes the generated regnodes.h more accurate as to what bits are available.
* regexp.h: Move regex flag bit positions.Karl Williamson2014-09-291-6/+6
| | | | | | | | | | | | | | This moves three bits to create a block of unused bits at the beginning. The first bit had to be moved to make space for other uses that are coming in future commits. This breaks binary compatibility, so might as well move the other two bits so that all the unused bits are consolidated at the beginning. This pool of unused bits is the boundary between the bits that are common to op.h and regexp.h (and in op_reg_common.h) and those that are separate. It's best to have all the unused bits there, so when we need to use one, it can be taken from either side, as needed, without us being trapped into having an available bit, but of the wrong kind.
* Eliminate the duplicative regops BOL and EOLYves Orton2014-09-171-207/+198
| | | | | | | | | | | | | | | | | | | | | | | | | | | See also perl5porters thread titled: "Perl MBOLism in regex engine" In the perl 5.000 release (a0d0e21ea6ea90a22318550944fe6cb09ae10cda) the BOL regop was split into two behaviours MBOL and SBOL, with SBOL and BOL behaving identically. Similarly the EOL regop was split into two behaviors SEOL and MEOL, with EOL and SEOL behaving identically. This then resulted in various duplicative code related to flags and case statements in various parts of the regex engine. It appears that perhaps BOL and EOL were kept because they are the type ("regkind") for SBOL/MBOL and SEOL/MEOL/EOS. Reworking regcomp.pl to handle aliases for the type data so that SBOL/MBOL are of type BOL, even though BOL == SBOL seems to cover that case without adding to the confusion. This means two regops, a regstate, and an internal regex flag can be removed (and used for other things), and various logic relating to them can be removed. For the uninitiated, SBOL is /^/ and /\A/ (with or without /m) and MBOL is /^/m. (I consider it a fail we have no way to say MBOL without the /m modifier). Similarly SEOL is /$/ and MEOL is /$/m (there is also a /\z/ which is EOS "end of string" with or without the /m).
* Fix for Coverity perl5 CID 29034: Out-of-bounds read (OVERRUN) ↵Jarkko Hietaniemi2014-04-301-0/+8
| | | | | | | | | | | | overrun-local: Overrunning array PL_reg_intflags name of 14 8-byte elements at element index 31 (byte offset 248) using index bit (which evaluates to 31). Needed compile-time limits for the PL_reg_intflags_name so that the bit loop doesn't waltz off past the array. Could not use C_ARRAY_LENGTH because the size of name array is not visible during compile time (only const char*[] is), so modified regcomp.pl to generate the size, made it visible only under DEBUGGING. Did extflags analogously even though its size currently exactly 32 already. The sizeof(flags)*8 is extra paranoia for ILP64.
* Change 'semantics' to 'rules'Karl Williamson2014-02-201-12/+12
| | | | | | The term 'semantics' in documentation when applied to character sets is changed to 'rules' as being a shorter less-jargony synonym in this case. This was discussed several releases ago, but I didn't get around to it.
* Revert "Free up bit for regex ANYOF nodes"Karl Williamson2014-02-151-155/+150
| | | | | This reverts commit 34fdef848b1687b91892ba55e9e0c3430e0770f6, and adds comments referring to it, in case it is ever needed.
* Free up bit for regex ANYOF nodesKarl Williamson2014-02-151-150/+155
| | | | | | | | | | | | | | | | | | | | | | | | | | This commit frees up a bit by using an extra regnode to pass the information to the regex engine instead of the flag. I originally thought that if this was needed, it should be the ANYOF_ABOVE_LATIN1_ALL bit, as that might speed some things up. But if we need to do this again by adding another node to get another bit, we want one that is mutually exclusive of the first one we did, For otherwise we start having to make 3 nodes instead of two to get the combinations: 1 0 0 1 1 1 This combinatorial problem is avoided by using bits that are mutually exclusive, which the ABOVE_LATIN1_ALL isn't, but the one freed by this commit ANYOF_NON_UTF8_NON_ASCII_ALL is only set under /d matching, and there are other bits that are set only under /l, so if we need to do this again, we should use one of those. I wrote this code when I thought I really needed a bit. But since, I have figured out a better way to get the bit needed now. But I don't want to lose this code to posterity, so this commit is being made long enough to get the commit number, then it will be reverted, adding comments referring to the commit number, so that it can easily be reconstructed when necessary.
* Add RXf_UNBOUNDED_QUANTIFIER and regexp->maxlenYves Orton2014-02-031-1/+1
| | | | | | | | | The flag tells us that a pattern may match an infinitely long string. The new member in the regexp struct tells us how long the string might be. With these two items we can implement regexp based $/
* Move the RXf_ANCH flags to intflags as PREGf_ANCH_xxx and add ↵Yves Orton2014-01-311-8/+12
| | | | | | | | | | RXf_IS_ANCHORED as a replacement The only requirement outside of the regex engine is to identify that there is an anchor involved at all. So we move the 4 anchor flags to intflags and replace it with a single aggregate flag RXf_IS_ANCHORED in extflags. This frees up another 3 bits in extflags.
* move RXf_GPOS_SEEN and RXf_GPOS_FLOAT to intflagsYves Orton2014-01-311-6/+8
| | | | | | | | This required removing the RXf_GPOS_CHECK mask as it uses one flag that will stay in extflags for now (RXf_ANCH_GPOS), and one flag that moves to intflags (RXf_GPOS_SEEN). This mask is strange however, as you cant have RXf_ANCH_GPOS without having RXf_GPOS_SEEN so I dont know why we test both. Further investigation required.
* Rename RXf_CANY_SEEN to PREGf_CANY_SEEN and move from extflags to intflagsYves Orton2014-01-311-2/+4
|
* Use bit instead of node for regex SSCKarl Williamson2014-01-221-155/+150
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The flag bits in regular expression ANYOF nodes are perennially in short supply. However there are still plenty of regex nodes possible. So one solution to needing to pass more information is to create a node that encapsulates what is needed. That is what commit 9aa1e39f96ac28f6ce5d814d9a1eccf1464aba4a did to tell regexec.c that a particular ANYOF node is for the synthetic start class (SSC). However this solution introduces other issues. If you have to express two things, then you need a regnode for A, a regnode for B, a regnode for both A and B, and another regnode for both not A nor B; With three things, you need 8 regnodes to express all possible combinations. This becomes unwieldy to write code for. The number of combinations goes way down if some of them are mutually exclusive. At the time of that commit, I thought that a SSC need not ever warn if matching against an above-Unicode code point. I was wrong, and that has been corrected earlier in the 5.19 series. But it finally came to me how to tell regexec that an ANYOF node is for the SSC without taking up a flag bit and without requiring a regnode type. The 'next_off' field in a regnode tells the engine the offeset in the regex program to the node it's supposed to go to after processing this one. Since the SSC stands alone, its 'next_off' field is unused, and we can put anything we want in it. That, however, is not true of other ANYOF regnodes. But it turns out that there are certain values that will never be legitimate in the 'next_off' field in these, and so this commit uses one of those to signal that this ANYOF field is an SSC. regnodes come in various sizes, and the offset is in terms of how many of the smallest ones are there to the next node to look at. Since ANYOF nodes are large, the offset is always > 1, and so this commit uses 1 to indicate an SSC.
* Convert regnode to a flag for [...]Karl Williamson2013-12-311-157/+152
| | | | | | | | | | | | | | | | | | Prior to this commit, there were 3 types of ANYOF nodes; now there are two: regular, and one for the synthetic start class (ssc). This commit converted the third type dealing with warning about matching \p{} against non-Unicode code points, into using the spare flag bit for ANYOF nodes. This allows this bit to apply to ssc ANYOF nodes, whereas previously it couldn't. There is a bug in which the warning isn't raised if the match is rejected by the optimizer, because of this inability. This bug will be fixed in a later commit. Another option would have been to create a new node-type which was an ANYOF_SSC_WARN_SUPER node. But this adds extra complications to things; and we have a spare bit that we might as well use. The comments give better possibilities for freeing up 2 bits should they be needed.
* Allow trie use for /iaa matchingKarl Williamson2013-08-291-111/+116
| | | | | | | | | This adds code so that tries can be formed under /iaa, which formerly weren't handled. A problem occurs when the string contains the LATIN SMALL LETTER SHARP S when the regex pattern is not UTF-8 encoded. I tried several ways to get this to work easily, but ended up deciding it was too hard, to in this one situation, a new regnode is created to prevent the trie code from even trying to turn it into a trie.
* Remove newly unnecessary regnode, codeKarl Williamson2013-08-291-116/+111
| | | | | The previous commit fixed things up so that this work-around regnode doesn't have to exist; nor the work around for the EXACTFU_SS regnode
* Fix and add tests for *PRUNE/*THEN plus leading non-greedy +Yves Orton2013-06-221-6/+6
| | | | "aaabc" should match /a+?(*THEN)bc/ with "abc".
* Show intflags as well as extflagsYves Orton2013-06-221-0/+15
|
* rework split() special case interaction with regex engineYves Orton2013-03-271-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch resolves several issues at once. The parts are sufficiently interconnected that it is hard to break it down into smaller commits. The tickets open for these issues are: RT #94490 - split and constant folding RT #116086 - split "\x20" doesn't work as documented It additionally corrects some issues with cached regexes that were exposed by the split changes (and applied to them). It effectively reverts 5255171e6cd0accee6f76ea2980e32b3b5b8e171 and cccd1425414e6518c1fc8b7bcaccfb119320c513. Prior to this patch the special RXf_SKIPWHITE behavior of split(" ", $thing) was only available if Perl could resolve the first argument to split at compile time, meaning under various arcane situations. This manifested as oddities like my $delim = $cond ? " " : qr/\s+/; split $delim, $string; and split $cond ? " ", qr/\s+/, $string not behaving the same as: ($cond ? split(" ", $string) : split(/\s+/, $string)) which isn't very convenient. This patch changes this by adding a new flag to the op_pmflags, PMf_SPLIT which enables pp_regcomp() to know whether it was called as part of split, which allows the RXf_SPLIT to be passed into run time regex compilation. We also preserve the original flags so pattern caching works properly, by adding a new property to the regexp structure, "compflags", and related macros for accessing it. We preserve the original flags passed into the compilation process, so we can compare when we are trying to decide if we need to recompile. Note that this essentially the opposite fix from the one applied originally to fix #94490 in 5255171e6cd0accee6f76ea2980e32b3b5b8e171. The reverted patch was meant to make: split( 0 || " ", $thing ) #1 consistent with my $x=0; split( $x || " ", $thing ) #2 and not with split( " ", $thing ) #3 This was reverted because it broke C<split("\x{20}", $thing)>, and because one might argue that is not that #1 does the wrong thing, but rather that the behavior of #2 that is wrong. In other words we might expect that all three should behave the same as #3, and that instead of "fixing" the behavior of #1 to be like #2, we should really fix the behavior of #2 to behave like #3. (Which is what we did.) Also, it doesn't make sense to move the special case detection logic further from the regex engine. We really want the regex engine to decide this stuff itself, otherwise split " ", ... wouldn't work properly with an alternate engine. (Imagine we add a special regexp meta pattern that behaves the same as " " does in a split /.../. For instance we might make split /(*SPLITWHITE)/ trigger the same behavior as split " ". The other major change as result of this patch is it effectively reverts commit cccd1425414e6518c1fc8b7bcaccfb119320c513, which was intended to get rid of RXf_SPLIT and RXf_SKIPWHITE, which and free up bits in the regex flags structure. But we dont want to get rid of these vars, and it turns out that RXf_SEEN_LOOKBEHIND is used only in the same situation as the new RXf_MODIFIES_VARS. So I have renamed RXf_SEEN_LOOKBEHIND to RXf_NO_INPLACE_SUBST, and then instead of using two vars we use only the one. Which in turn allows RXf_SPLIT and RXf_SKIPWHITE to have their bits back.
* Improve how regcomp.pl handles multibitsYves Orton2013-03-271-3/+3
| | | | In preparation for future changes.
* Add new regnode for synthetic start classKarl Williamson2012-12-281-150/+155
| | | | | | | | | | | | | This creates a regnode specifically for the synthetic start class, which is a type of ANYOF node. The flag bit previously used to denote this is removed. This paves the way for this bit to be freed up, but first the other use of this bit must also be removed, which will be done in the next commit. There are now three ANYOF-type regnodes. This one should be called only in one place in regexec.c. The other special one is ANYOF_WARN_SUPER. A synthetic start class node should not do any warning, so there is no issue of having something need to be both types.
* Free up regex ANYOF bit.Karl Williamson2012-12-281-150/+155
| | | | | | | This uses a regnode type, of which we have many available, to free up a bit in the ANYOF regnode flag field, of which we have none, and are trying to have the same bit do double duty. This will enable us to remove some of that double duty in the next commit.
* Regenerate the regnode table in perldebguts.pod automaticallyFather Chrysostomos2012-12-221-2/+2
|
* Consolidate some regex OPSKarl Williamson2012-12-221-293/+150
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The regular rexpression operation POSIXA works on any of the (currently) 16 posix classes (like \w and [:graph:]) under the regex modifier /a. This commit creates similar operations for the other modifiers: POSIXL (for /l), POSIXD (for /d), POSIXU (for /u), plus their complements. It causes these ops to be generated instead of the ALNUM, DIGIT, HORIZWS, SPACE, and VERTWS ops, as well as all their variants. The net saving is 22 regnode types. The reason to do this is for maintenance. As of this commit, there are now 22 fewer node types for which code has to be maintained. The code for each variant was essentially the same logic, but on different operands. It would be easy to make a change to one copy and forget to make the corresponding change in the others. Indeed, this patch fixes [perl #114272] in which one copy was out of sync with others. This patch actually reduces the number of separate code paths to 5: POSIXA, NPOSIXA, POSIXL, POSIXD, and POSIXU. The complements of the last 3 use the same code path as their non-complemented version, except that a variable is initialized differently. The code then XORs this variable with its result to do the complementing or not. Further, the POSIXD branch now just checks if the target string being matched is UTF-8 or not, and then jumps to either the POSIXU or POSIXA code respectively. So, there are effectively only 4 cases that are coded: POSIXA, NPOSIXA, POSIXL, and POSIXU. (POSIXA doesn't have to worry about UTF-8, while NPOSIXA does, hence these for efficiency are coded separately.) Removing all this code saves memory. The output of the Linux size command shows that the perl executable was shrunk by 33K bytes on my platform compiled under -O0 (.7%) and by 18K bytes (1.3%) under -O2. The reason this patch was doable was previous work in numbering the POSIX classes, so that they could be indexed in arrays and bit positions. This is a large patch; I didn't see how to break it into smaller components. I chose to make this code more efficient as opposed to saving even more memory. Thus there is a separate loop that is jumped to after we know we have to load a swash; this just saves having to test if the swash is loaded each time through the loop. I avoid loading the swash until absolutely necessary. In places in the previous version of this code, the swash was loaded when the input was UTF-8, even if it wasn't yet needed (and might never be if the input didn't contain anything above Latin1); apparently to avoid the extra test per iteration. The Perl test suite runs slightly faster on my platform with this patch under -O0, and the speeds are indistinguishable under -O2. This is in spite of these new POSIX regops being unknown to the regex optimizer (this will be addressed in future commits), and extra machine instructions being required for each character (the xor, and some shifting and masking). I expect this is a result of better caching, and not loading swashes unless absolutely necessary.
* regcomp.sym: Change regkind for NPOSIX regnodesKarl Williamson2012-11-191-4/+4
| | | | | | | It turns out that it is more convenient for the complement of a node to have a regkind that is also the complement of a node. This creates slight inconveniences that are included in this patch, but will help further patches.
* regex: Remove old code that tried to handle multi-char foldsKarl Williamson2012-10-141-212/+207
| | | | | | A recent commit has changed the algorithm used to handle multi-character folding in bracketed character classes. The old code is no longer needed.
* RXf_MODIFIES_VARSFather Chrysostomos2012-10-111-2/+2
| | | | | | regcomp.c sets this new flag whenever regops that could modify $REGMARK or $REGERROR have been seen. pp_subst will use this to tell whether it should repeatedly stringify the replacement.
* Define RXf_SPLIT and RXf_SKIPWHITE as 0Father Chrysostomos2012-10-111-3/+3
| | | | | | | | They are on longer used in core, and we need room for more flags. The only CPAN modules that use them check whether RXf_SPLIT is set (which no longer happens) before setting RXf_SKIPWHITE (which is ignored).
* regcomp.sym: Add new node types POSIXA and NPOSIXAKarl Williamson2012-07-241-141/+182
| | | | | | | | | These will be used to handle things like /[[:word:]]/a. This patch doesn't add the code to actually use these. That will be done in a future patch. Also, placeholders POSIXD, POSIXL, and POSIXU are also added for future use.
* regcomp.c: Simply some node calculationsKarl Williamson2012-06-291-148/+158
| | | | | | | | | | | | For the node types that have differing versions depending on the character set regex modifiers, /d, /l, /u, /a, and /aa, we can use the enum values as offsets from the base node number to derive the correct one. This eliminates a number of tests. Because there is no DIGITU node type, I added placeholders for it (and NDIGITU) to avoid some special casing of it (more important in future commits). We currently have many available node types, so can afford to waste these two.
* regcomp.sym: Reorder a couple of nodesKarl Williamson2012-06-291-9/+9
| | | | | | This causes all the nodes that depend on the regex modifier, BOUND, BOUNDL, etc. to have the same relative ordering. This will enable a future commit to simplify generation of the correct node.
* regcomp.sym: Fix out-dated descriptionKarl Williamson2012-03-031-1/+1
| | | | | As a result of commit fab2782b37b5570d7f8f8065fd7d18621117ed49 the description is no longer valid. This node type is trieable.
* rework how the trie logic handles the newer EXACT nodetypesYves Orton2012-03-031-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This cleans up and simplifies and extends how the trie logic interacts with the new node types. This change ultimately makes the EXACTFU, EXACTFU_SS, EXACTFU_NO_TRIE (renamed to EXACTFU_TRICKYFOLD) work properly with the trie engine regardless of whether the string is utf8 or latin1. This patch depends on the following: EXACT => utf8 or "binary" text EXACTFU => either pre-folded utf8, or latin1 that has to be folded as though it was utf8 EXACTFU_SS => special case of EXACTFU to handle \xDF/ss (affects latin1 treatment) EXACTFU_TRICKYFOLD => special case of EXACTFU to handle tricky non-latin1 fold rules EXACTF => "old style fold logic" untriable nodetype EXACTFA => (currently) untriable nodetype EXACTFL => (currently) untriable nodetype See the comments in regcomp.sym for these fold types. This patch involves a number of distinct, but related parts. Starting from compilation: * Simplify how we detect a triable sequence given the new nodetypes, this also probably fixed some "bugs" in how we detected certain sequences, like /||foo|bar/. * Simplify how we read EXACTFU nodes under utf8 by removing the now redundant folding logic (EXACTFU nodes under utf8 are prefolded). Also extend this logic to handle latin1 patterns properly (in conjunction with other changes) * Part of the problems associated with EXACTFU_SS and EXACTFU_TRICKYFOLD have to do with how the trie logic interacts with the minlen logic. This change handles both by pessimising the minlen when encounting these nodetypes. One observation is that the minlen logic is basically broken, and works only because it conflates bytes and codepoints in such a way that we more or less always get a value small enough that things work out anyway. Fixing that is properly is the job of another patch. * Part of the problem of doing folding under unicode rules is that there are a lot of foldings possible, some with strange rules. This means that the bitmap logic does not work correctly in all cases, as we currently do not have any way to populate it properly. So this patch disables the bitmap entirely when folding is involved until that is fixed. The end result of this is: we can TRIE/AHOCORASICK any sequence of EXACT, or EXACTFU (ish) nodes, regardless of utf8 or not, but we disable the bitmap when folding. A note for follow up relating to this patch is that the way EXACTFU_XXX nodes are currently dealt with we wont build the "maximal" trie because of their presence, instead creating a "jumptrie" consisting of either a leading EXACTFU node followed by a EXACTFU_XXX node, or vice versa. We should eventually address that.
* regex: Remove FOLDCHAR regnode typeKarl Williamson2012-01-191-11/+6
| | | | | | | | | | | | | | This node type hasn't been used since 5.14.0. Instead an ANYOFV node was generated where formerly a FOLDCHAR node would have been used. The ANYOFV was used because it already existed and was up-to-date, whereas FOLDCHAR would have needed some bug fixes to adapt it, even though it would be faster in execution than ANYOFV; so the code for it was retained in case it was needed. However, both these solutions were defective, and a previous commit has changed things to a different type of solution entirely. Thus FOLDCHAR is obsolescent and can be removed, though the code in it was used as a base for some of the new solutions.
* regex: Add new node type EXACTFU_NO_TRIEKarl Williamson2012-01-191-124/+129
| | | | | | This new node is like EXACTFU but is not currently trie'able. This adds handling for it in regexec.c, but it is not currently generated; this commit is preparing for future commits