summaryrefslogtreecommitdiff
path: root/regcomp.h
Commit message (Collapse)AuthorAgeFilesLines
* propagate 'use re eval' into return from (??{})David Mitchell2012-06-131-0/+1
| | | | | | | | | (??{}) returns a string which needs to be put through the regex compiler, and which may also contain (?{...}) - so any 'use re eval' in scope needs to be propagated into the inner environment. Achieve this by adding a new private flag - PREGf_USE_RE_EVAL - to the regex to indicate the use is in scope, and modify how the call to compile the inner pattern is done, to allow the use state to be passed in.
* eliminate OP_4tree typeDavid Mitchell2012-06-131-3/+0
| | | | This was an alias to OP, and formerly used by the old re_eval mechanism
* eliminate REG_SEEN_EVALDavid Mitchell2012-06-131-1/+1
| | | | | This flag was set during pattern compilation if a (?{}) was encountered; but is redundant now that we have pRExC_state->num_code_blocks.
* Fix up runtime regex codeblocks.David Mitchell2012-06-131-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The previous commits in this branch have brought literal code blocks into the New World Order; now do the same for runtime blocks, i.e. those needing "use re 'eval'". The main user-visible changes from this commit are that: * the code is now fully parsed, rather than needing balanced {}'s; i.e. this now works: my $code = q[ (?{ $a = '{' }) ]; use re 'eval'; /$code/ * warnings and errors are now reported as coming from "(eval NNN)" rather than "(re_eval NNN)" (although see the next commit for some fixups to that). Indeed, the string "re_eval" has been expunged from the source and documentation. The big internal difference is that the sv_compile_2op() and sv_compile_2op_is_broken() functions are no longer used, and will be removed shorty. It works by the regex compiler detecting the presence of run-time code blocks, and feeding the whole pattern string back into the parser (where the run-time blocks are now seen as compile-time), then extracting out any compiled code blocks and adding them to the mix. For example, in the following: $c = '(?{"runtime"})d'; use re 'eval'; /a(?{"literal"})\b'$c/ At the point the regex compiler is called, the perl parser will already have compiled the literal code block and presented it to the regex engine. The engine examines the pattern string, sees two '(?{', but only one accounted for by the parser, and so constructs a short string to be evalled: based on the pattern, but with literal code-blocks blanked out, and \ and ' escaped. In the above example, the pattern string is a(?{"literal"})\b'(?{"runtime"})d and we call eval_sv() with an SV containing the text qr'a \\b\'(?{"runtime"})d' The returned qr will contain the new code-block (and associated CV and pad) which can be extracted and added to the list of compiled code blocks of the original pattern. Note that with this scheme, the requirement for "use re 'eval'" is easily determined, and no longer requires all the pp_regcreset / PL_reginterp_cnt machinery, which will be removed shortly. Two subtleties of this scheme are that normally, \\ isn't collapsed into \ for literal regexes (unlike literal strings), and hints aren't inherited when using eval_sv(). We get round both of these by adding and setting a new flag, PL_reg_state.re_reparsing, which indicates that we are refeeding a pattern into the perl parser.
* add op_comp field to regexp_engine APIDavid Mitchell2012-06-131-1/+2
| | | | | | | | | | | | | | | | | | | | Perl's internal function for compiling regexes that knows about code blocks, Perl_re_op_compile, isn't part of the engine API. However, the way that regcomp.c is dual-lifed as ext/re/re_comp.c with debugging compiled in, means that Perl_re_op_compile is also compiled as my_re_op_compile. These days days the mechanism to choose whether to call the main functions or the debugging my_* functions when 'use re debug' is in scope, is the re engine API jump table. Ergo, to ensure that my_re_op_compile gets called appropriately, this method needs adding to the jump table. So, I've added it, but documented as 'for perl internal use only, set to null in your engine'. I've also updated current_re_engine() to always return a pointer to a jump table, even if we're using the internal engine (formerly it returned null). This then allows us to use the simple condition (eng->op_comp) to determine whether the current engine supports code blocks.
* preserve code blocks in interpolated qr//sDavid Mitchell2012-06-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This now works: { my $x = 1; $r = qr/(??{$x})/ } my $x = 2; print "ok\n" if "1" =~ /^$r$/; When a qr// is interpolated into another pattern, the pattern is still recompiled using the stringified qr, but now the pre-compiled code blocks from the qr are reused rather than being re-compiled, so it behaves like a closure. Note that this makes some tests in regexp_qr_embed_thr.t fail, due to a pre-existing threads bug, which can be summarised as: use threads; my $s = threads->new(sub { return sub { $::x = 1} })->join; $s->(); print "\$::x=[$::x]\n"; which prints undef, not 1, since the *::x is cloned into the child thread, then cloned back into the parent as part of the CV (linked from the pad) being returned in the join. The cloning/join code isn't clever enough to realise that the back-cloned *::x is the same as the original *::x, so the main thread ends up with two copies. This manifests itself in the re tests as my $re = threads->new( sub { qr/(?{$::x = 1 })/ })->join(); where, since the returned qr// is now a closure, it suffers from the same glob duplication in the parent. So I've disabled 4 re_tests tests under threads for now.
* in re_op_compile(), keep code_blocks for qr//David Mitchell2012-06-131-0/+2
| | | | | | | | code_blocks is a temporary list of start/end indices and pointers to DO blocks, that is used during the regexp compilation. Change it so that in the qr// case, this structure is preserved (attached to regexp_internal), so that in a forthcoming commit it will be available for use when interpolating a qr within another pattern.
* make qr/(?{})/ behave with closuresDavid Mitchell2012-06-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With this commit, qr// with a literal (compile-time) code block will Do the Right Thing as regards closures and the scope of lexical vars; in particular, the following now creates three regexes that match 1, 2 and 3: for my $i (0..2) { push @r, qr/^(??{$i})$/; } "1" =~ $r[1]; # matches Previously, $i would be evaluated as undef in all 3 patterns. This is achieved by wrapping the compilation of the pattern within a new anonymous CV, which is then attached to the pattern. At run-time pp_qr() clones the CV as well as copying the REGEXP; and when the code block is executed, it does so using the pad of the cloned CV. Which makes everything come out all right in the wash. The CV is stored in a new field of the REGEXP, called qr_anoncv. Note that run-time qr//s are still not fixed, e.g. qr/$foo(?{...})/; nor is it yet fixed where the qr// is embedded within another pattern: continuing with the code example from above, my $i = 99; "1" =~ $r[1]; # bare qr: matches: correct! "X99" =~ /X$r[1]/; # embedded qr: matches: whoops, it's still seeing the wrong $i
* Mostly complete fix for literal /(?{..})/ blocksDavid Mitchell2012-06-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change the way that code blocks in patterns are parsed and executed, especially as regards lexical and scoping behaviour. (Note that this fix only applies to literal code blocks appearing within patterns: run-time patterns, and literals within qr//, are still done the old broken way for now). This change means that for literal /(?{..})/ and /(??{..})/: * the code block is now fully parsed in the same pass as the surrounding code, which means that the compiler no longer just does a simplistic count of balancing {} to find the limits of the code block; i.e. stuff like /(?{ $x = "{" })/ now works (in the same way that subscripts in double quoted strings always have: "$a{'{'}" ) * Error and warning messages will now appear to emanate from the main body rather than an re_eval; e.g. the output from #!/usr/bin/perl /(?{ warn "boo" })/ has changed from boo at (re_eval 1) line 1. to boo at /tmp/p line 2. * scope and closures now behave as you might expect; for example for my $x (qw(a b c)) { "" =~ /(?{ print $x })/ } now prints "abc" rather than "" * with recursion, it now finds the lexical within the appropriate depth of pad: this code now prints "012" rather than "000": sub recurse { my ($n) = @_; return if $n > 2; "" =~ /^(?{print $n})/; recurse($n+1); } recurse(0); * an earlier fix that stopped 'my' declarations within code blocks causing crashes, required the accumulating of two SAVECOMPPADs on the stack for each iteration of the code block; this is no longer needed; * UNITCHECK blocks within literal code blocks are now run as part of the main body of code (run-time code blocks still trigger an immediate call to the UNITCHECK block though) This is all achieved by building upon the efforts of the commits which led up to this; those altered the parser to parse literal code blocks directly, but up until now those code blocks were discarded by Perl_pmruntime and the block re-compiled using the original re_eval mechanism. As of this commit, for the non-qr and non-runtime variants, those code blocks are no longer thrown away. Instead: * the LISTOP generated by the parser, which contains all the code blocks plus OP_CONSTs that collectively make up the literal pattern, is now stored in a new field in PMOPs, called op_code_list. For example in /A(?{BLOCK})C/, the listop stored in op_code_list looks like LIST PUSHMARK CONST['A'] NULL/special (aka a DO block) BLOCK CONST['(?{BLOCK})'] CONST['B'] * each of the code blocks has its last op set to null and is individually run through the peephole optimiser, so each one becomes a little self-contained block of code, rather than a list of blocks that run into each other; * then in re_op_compile(), we concatenate the list of CONSTs to produce a string to be compiled, but at the same time we note any DO blocks and note the start and end positions of the corresponding CONST['(?{BLOCK})']; * (if the current regex engine isn't the built-in perl one, then we just throw away the code blocks and pass the concatenated string to the engine) * then during regex compilation, whenever we encounter a '(?{', we see if it matches the index of one of the pre-compiled blocks, and if so, we store a pointer to that block in an 'l' data slot, and use the end index to skip over the text of the code body. Conversely, if the index doesn't match, then we know that it's a run-time pattern and (for now), compile it in the old way. * During execution, when an EVAL op is encountered, if data->what is 'l', then we just use the pad that was in effect when the pattern was called; i.e. we use the current pad slot of the currently executing CV that the pattern is embedded within.
* update the editor hints for spaces, not tabsRicardo Signes2012-05-291-2/+2
| | | | | This updates the editor hints in our files for Emacs and vim to request that tabs be inserted as spaces.
* regex: Fix some tricky fold problemsKarl Williamson2012-01-191-0/+1
| | | | | | | | | | | | As described in the comments, this changes the design of handling the Unicode tricky fold characters to not generate a node for each possible sequence but to get them to work within EXACTFish nodes. The previous design(s) all used a node to handle these, which suffers from the downfall that it precludes legitimate matches that would cross the node boundary. The new design is described in the comments.
* Comment additions, typos, white-space.Karl Williamson2012-01-131-0/+2
| | | | And the reordering for clarity of one test
* Change __attribute_unused__ to PERL_UNUSED_DECLKarl Williamson2011-11-091-1/+1
| | | | The latter is the Perl standard way of making this declaration
* use __attribute__unused__ to silence -Wunused-but-set-variableRobin Barker2011-05-191-1/+2
|
* regcomp.h: Add commentKarl Williamson2011-03-191-1/+1
|
* regcomp.h: Add ANYOF_CLASS_SETALL()Karl Williamson2011-03-191-0/+2
| | | | | This macro sets all the bits of the class (for \w, etc) for use during initialization
* regex: Fix locale regressionKarl Williamson2011-03-181-31/+18
| | | | | | | | | | | | | | | | | | Things like \S have not been accessible to the synthetic start class under locale matching rules. They have been placed there, but the start class didn't know they were there. This patch sets ANYOF_CLASS in initializing the synthetic start class so that downstream code knows it is a charclass_class, and removes the code that partially allowed this bit to be shared, and which isn't needed in 5.14, and more thought would have to go into doing it than was reflected in the code. I can't come up with a test case that would verify that this works, because of general locale testing issues, except it looked at a dump of the generated regex synthetic start class, but the dump isn't the same thing as the real behavior, and using one is also subject to breakage if the regex code changes in the slightest.
* regcomp.h: #define of ANYOF flags immune from inversionKarl Williamson2011-03-081-0/+10
|
* regex: /l in combo with others in syn start classKarl Williamson2011-03-081-12/+10
| | | | | | | | | Now that regexes can be combinations of different charset modifiers, a synthetic start class can match locale and non-locale both. locale should generally match only things in the bitmap for code points < 256. But a synthetic start class with a non-locale component can match such code points. This patch makes an exception for synthetic nodes that will be resolved if it passes and is matched again for real.
* regcomp.c: Move #defines to be be in bit orderKarl Williamson2011-03-081-5/+5
|
* regex: Remove obsolete codeKarl Williamson2011-02-281-19/+0
| | | | | | | This code has been rendered obsolete in 5.14 by using a different mechanism altogether. This functionality is now provided at run-time, user-selectable, via the /u and /d regex modifiers. This code was for compile-time selection of which to use.
* bleadperl breaks RCLAMP/Text-GlobKarl Williamson2011-02-251-6/+13
| | | | | | | | This was from commit f424400810b6af341e96230836690da51c37b812 which came from needing a bit in an already-full flags field, and my faulty analysis that two bits could be shared. I found another mechanism to free up another bit, and now can separate these shared bits again.
* Free up bit in ANYOF flagsKarl Williamson2011-02-251-7/+16
| | | | | | | | | | | | | | | | This is the foundation for fixing the regression RT #82610. My analysis was wrong that two bits could be shared, at least not without further work. This changes to use a different mechanism to pass needed information to regexec.c so that another bit can be freed up and, in a later commit, the two bits can become unshared again. The bit that is freed up is ANYOF_UTF8, which basically said there is something that is matched outside the ANYOF bitmap, and requires the target string to be in utf8. This changes things so the existence of something besides the bitmap indicates this, and so no flag is needed. The flag bit ANYOF_NONBITMAP_NON_UTF8 remains to indicate that there is something that should be matched outside the bitmap even if the target string isn't in utf8.
* regcomp.h: Remove obsolete defineKarl Williamson2011-02-241-3/+0
|
* regex: Convert regnode FLAGS fields to charset enumKarl Williamson2011-01-161-2/+3
| | | | | | | | | The FLAGS fields of certain regnodes were encoded with USE_UNI if unicode semantics were to be used. This patch instead sets them to the character set used, one of the possibilities of which is to use unicode semantics. This shortens the code somewhat, and always puts the character set in the flags field, which can allow use of switch statements on it for efficiency, especially as new values are added.
* Fix \xa0 matching both [\s] [\S], et.al.Karl Williamson2011-01-161-0/+4
| | | | | | | | | | | | This bug stemmed from Latin1 characters not matching any (non-complemented) character class in /d semantics when the target string is no utf8; but having unicode semantics when it isn't. The solution here is to add a special flag. There were several tests that relied on the broken behavior, specifically they tested that \xff isn't a printable word character even in utf8. I changed the deparse test to instead use a non-printable code point, and I changed the ones in re_tests to be TODOs, and will change them back using /a when that is shortly added.
* regcomp: Share two bits in ANYOF flagsKarl Williamson2011-01-161-8/+17
| | | | | | It turns out that the INVERT and EOS flags can actually share the same bit, as explained in the comments, freeing up a bit for other uses. No code changes need be made.
* regex: Some Comment clarificationsKarl Williamson2011-01-131-2/+8
|
* Fix typos (spelling errors) in Perl sources.Peter J. Acklam) (via RT2011-01-071-4/+4
| | | | | | | | | # New Ticket Created by (Peter J. Acklam) # Please include the string: [perl #81904] # in the subject line of all future correspondence about this issue. # <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=81904 > Signed-off-by: Abigail <abigail@abigail.be>
* Change name of regex intrnl macro to new meaningKarl Williamson2010-12-201-3/+12
| | | | | | | | | | | ANYOF_FOLD is now used only under fewer conditions. Otherwise the bitmap of character 0-255 is fully calculated with the folds, and the flag is not set. One condition is under locale, where the folds aren't known at compile time; the other is for things accessible through a swash. By changing the name to its new meaning, certain optimizations become more obvious.
* Change regexes to debug dump non-ASCII as hex.Karl Williamson2010-12-191-3/+3
| | | | | | instead of the less familiar octal for larger values. Perhaps they should actually print the actual character, but this is far easier than the previous to understand.
* regcomp: Allow freeing up bit in ANYOF flagsKarl Williamson2010-12-111-7/+23
| | | | | | The flags field is fully used, and until the ANYOF node is split later in development, the CLASS bit will need to be freed up to give the space for other uses. This patch allows for this to easily be toggled.
* regcomp.h: Restore separate bit for LOC classKarl Williamson2010-11-221-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | This commit partially reverts cefafd73018b048fa66d2b22250431112141955a which unconditionally used a bitmap for classes like \w in ANYOF nodes used in locales. Unfortunately, I forgot to unconditionally allocate that space, so things were getting corrupted. It is scary that that did not show up in my testing, but locales are hard to test. It showed up in a workspace without DEBUGGING. This commit now causes the bitmap to be used only when necessary, at the expense of using a precious bit in the flags field to indicate that it is being used. However, as events have turned out since that commit, that flags bit isn't as precious as I thought. It looks like we will have to split the ANYOF node into two similar nodes, one of which is variable length, as there are bugs due to the optimizer thinking it is of length 1, when in fact it doesn't currently have to be. That split should allow more bits to be freed. I'm retaining for now some ancillary code that was to help improve the efficiency when that bit was removed; just in case we have to redo this. But if we do, we have to unconditionally allocate the space we think we are using. Signed-off-by: David Golden <dagolden@cpan.org>
* Split ANYOF_NONBITMAP into two componentsKarl Williamson2010-11-221-1/+7
| | | | | | | | | | | ANYOF_NONBITMAP means that the node can match things that aren't in its bitmap. Some things can match only when the target string is in utf8, and some things can match even if it isn't. If the target string is not in utf8, and we know that the only possible match is when it is in utf8, we know it can't match, and avoid a fruitless, expensive swash load. This change also fixes a number of problems shown in t/re/grind_fold.t that I will deliver soon.
* regcomp.h: Add commentKarl Williamson2010-11-221-1/+2
|
* regcomp.h: Renumber ANYOF_EOS bitKarl Williamson2010-11-221-3/+3
| | | | | This is in preparation for adding a new bit which for debugging ease ought to be adjacent to another one.
* rename ANYOF_UNICODE to ANYOF_NONBITMAPKarl Williamson2010-11-221-1/+4
| | | | | | | | | | I am about the hone the meaning of this to mean that there is something outside the bitmap that is matchable by the node, and the new name reflects that more accurately. I am not retaining the old name because I'm about to remove it from the flags field to save a bit and avoid masking operations, and any code that would be using it would break at that point.
* regex free up bit in ANYOF nodeKarl Williamson2010-11-221-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch causes all locale ANYOF nodes to have a class bitmap (4 bytes) even if they don't have a class (such as \w, \d, [:posix:]). This frees up a bit in the flags field that was used to signal if the node had the bitmap. I intend to use it instead to signal that loading a swash, which is slow, can be bypassed. Thus this is a time/space tradeoff, applicable to not just locale nodes: adding a word to the locale nodes saves time for all nodes. I added the ANYOF_CLASS_TEST_ANY_SET() macro to determine quickly if there are actually any classes in the node. Minimal code was changed, so this can be easily reversed if another bit frees up. Another possibility is to share with the ANYOF_EOS bit instead, as this is used just in the optimizer's start class, and only in regcomp.c. But this requires more careful coding. Another possibility is to add a byte (hence likely at least 4 because of alignment issues) to store extra flags. And still another possibility is to add just the byte for the start class, which would not need to affect other ANYOF nodes, since the EOS bit is not used outside regcomp.c. But various routines in regcomp assume that the start class and other ANYOF nodes are interchangeable, so this option would require more code changes.
* regcomp.h: Add commentKarl Williamson2010-11-221-1/+1
|
* regcomp.h: Reorder statements for clarityKarl Williamson2010-11-221-4/+6
| | | | Reorder #defines of bits so are in numerical order
* regcomp.h: Remove unused #defineKarl Williamson2010-10-311-3/+0
| | | | | | | | | | | ANYOF_RUNTIME() is no longer used, so can be removed. I had long tried to figure out what the purpose of this was, and discovered it really had none. I think it must have had something to do with locales at one time. But locales don't do well with utf8, and I don't know how to make it better. In any event this wasn't actually accomplishing anything.
* regcomp.h: Clean up some commentsKarl Williamson2010-10-311-10/+10
|
* ANYOF_LARGE is now the same as ANYOF_CLASSKarl Williamson2010-10-311-4/+2
| | | | | These two #defines now mean the same thing. Free up bit used by ANYOF_LARGE
* regcomp.c: Get rid of compiler warning.Karl Williamson2010-10-211-1/+1
| | | | | | | | This patch should remove a compiler warning that is currently only showing up in one compiler. It declares a debug-only variable to be volatile, so should silence the warning that it is getting clobbered. Since this variable is only used for debugging purposes, when DEBUGGING is defined, performance should not be an issue.
* Subject: [perl #58182] partial: Add uni \s,\w matchingKarl Williamson2010-10-151-0/+3
| | | | | | | | | | | | | | | | | | | This commit causes regex sequences \b, \s, and \w (and complements) to match in the latin1 range in the scope of feature 'unicode_strings' or with the /u regex modifier. It uses the previously unused flags field in the respective regnodes to indicate the type of matching, and in regexec.c, uses that to decide which of the handy.h macros to use, native or Latin1. I chose this for now rather than create new nodes for each type of match. An earlier version of this patch did that, and in every case the switch case: statements were adjacent, offering no performance advantage. If regexec were modified to use in-line functions or more macros for various short section of it, then it would be faster to have new nodes rather than using the flags field. But, using that field simplified things, as this change flies under the radar in a number of places where it would not if separate nodes were used.
* Subject: regcomp.h: Add macro to get regnode flagsKarl Williamson2010-10-151-0/+2
|
* PATCH: regex longjmp flawsKarl Williamson2010-09-151-1/+3
| | | | | | | | The netbsd - 5.0.2 compiler pointed out that the recent changes to add longjmps to speed up some regex compilations can result in clobbering a few values. These depend on the compiled code, and so didn't show up in other compiler's warnings. This patch reinitializes them after a longjmp.
* Properly free paren_name_list with its regexp.Nicholas Clark2010-05-291-0/+1
| | | | | Previously the AV paren_name_list would "leak" until global destruction. This was only an issue under -DDEBUGGING. Fixes RT #73438.
* Generate PL_simple[] and PL_varies[] with regcomp.pl, rather than hard-coding.Nicholas Clark2010-05-271-31/+0
| | | | | | Add a new flags column to regcomp.sym, with V if the node type is in PL_varies, S if it is in PL_simple, and . if a placeholder is needed because subsequent optional columns are present.
* tries: don't allocate memory at runtimeDavid Mitchell2010-05-031-3/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is an indirect fix for [perl #74484] Regex causing exponential runtime+mem usage The trie runtime code was doing more SAVETMPS than FREETMPS and was thus growing a large tmps stack on heavy backtracking. Rather than fixing this directly, I rewrote part of the trie code so that it no longer needs to allocate memory in S_regmatch (it still does in find_byclass()). The basic issue is that multiple branches in the trie may trigger an accept state; for example: "abcd" =~ /xyz/abcd.*X|ab.*Y|/ here, words (branches) 2 and 3 are accept states. The original approach was, at run time, to create a list of accepted word numbers and the character positions of the end of each of those words. Then run the rest of the pattern for each word in the list in turn (in word index order). This requires memory for the list to be allocated and freed. The new approach involves creating extra info at compile time; in particular, for each word, a pointer to the previous accepted word (if any) in the state tree. For example for the above pattern, part of the state tree may be q b c d 1 -> 2 -> 3 -> 4 -> 5 (#3) (#2) (e.g. at state 1, if the next char is 'a', we transition to state 2). Here, state 3 is an accept state with word #3, and 5 is an accept state with word #2. So we build a table indexed by word number, which has wordinfo[2] = 3, wordinfo[3] = 0, thus building the word chain 2->3->0. At run time we run the trie to completion, and remember the word associated with the longest accept state (word #2 above). Then by following back the chain of .prev fields, we can produce a list of all accepting words. We then iteratively find the smallest-numbered (ie LH-most) word in the chain, and run with it. On failure and backtrack, we find the next-smallest and so on. Since we are no longer recording the end-position of each word in the string, we have to recalculate this for each backtrack. We initially record the end-position of the shortest accepting word, and given that we know the length of each word, we can calculate the new position each time as an offset from that first word. Depending on unicode and folding, that calculation can be cheap or expensive. This algorithm is optimised for the typical case where there are a small number (<= 2) accepting states. This patch creates a new compile-time array, trie->wordinfo[], indexed by word number, which contains relevant info about each word. This also supersedes the old trie->newword[] array, whose function of recording "overspills" of multiple words per accept state, is now handled as part of the wordinfo[].prev chain.