diff options
author | David Mitchell <davem@iabyn.com> | 2010-05-03 13:57:58 +0100 |
---|---|---|
committer | David Mitchell <davem@iabyn.com> | 2010-05-03 14:29:54 +0100 |
commit | 2e64971a6530d2645969bc489f564bfd3ce64993 (patch) | |
tree | 8b5ab3ff672755f1ca62bd040b2240dbb5719934 /regexp.h | |
parent | a23e6e2012eae03dbd049a058d24b0ce29357c76 (diff) | |
download | perl-2e64971a6530d2645969bc489f564bfd3ce64993.tar.gz |
tries: don't allocate memory at runtime
This is an indirect fix for
[perl #74484] Regex causing exponential runtime+mem usage
The trie runtime code was doing more SAVETMPS than FREETMPS and was thus
growing a large tmps stack on heavy backtracking. Rather than fixing this
directly, I rewrote part of the trie code so that it no longer needs to
allocate memory in S_regmatch (it still does in find_byclass()).
The basic issue is that multiple branches in the trie may trigger an
accept state; for example:
"abcd" =~ /xyz/abcd.*X|ab.*Y|/
here, words (branches) 2 and 3 are accept states. The original approach
was, at run time, to create a list of accepted word numbers and the
character positions of the end of each of those words. Then run the rest
of the pattern for each word in the list in turn (in word index order).
This requires memory for the list to be allocated and freed.
The new approach involves creating extra info at compile time; in
particular, for each word, a pointer to the previous accepted word (if
any) in the state tree. For example for the above pattern, part of the
state tree may be
q b c d
1 -> 2 -> 3 -> 4 -> 5
(#3) (#2)
(e.g. at state 1, if the next char is 'a', we transition to state 2).
Here, state 3 is an accept state with word #3, and 5 is an accept state
with word #2. So we build a table indexed by word number, which has
wordinfo[2] = 3, wordinfo[3] = 0, thus building the word chain 2->3->0.
At run time we run the trie to completion, and remember the word
associated with the longest accept state (word #2 above). Then by following
back the chain of .prev fields, we can produce a list of all accepting
words. We then iteratively find the smallest-numbered (ie LH-most) word in
the chain, and run with it. On failure and backtrack, we find the
next-smallest and so on.
Since we are no longer recording the end-position of each word in the
string, we have to recalculate this for each backtrack. We initially
record the end-position of the shortest accepting word, and given that we
know the length of each word, we can calculate the new position each time
as an offset from that first word. Depending on unicode and folding, that
calculation can be cheap or expensive.
This algorithm is optimised for the typical case where there are a small
number (<= 2) accepting states.
This patch creates a new compile-time array, trie->wordinfo[], indexed by
word number, which contains relevant info about each word. This also
supersedes the old trie->newword[] array, whose function of recording
"overspills" of multiple words per accept state, is now handled as part of
the wordinfo[].prev chain.
Diffstat (limited to 'regexp.h')
-rw-r--r-- | regexp.h | 15 |
1 files changed, 6 insertions, 9 deletions
@@ -490,13 +490,6 @@ and check for NULL. #define FBMrf_MULTILINE 1 -/* an accepting state/position*/ -struct _reg_trie_accepted { - U8 *endpos; - U16 wordnum; -}; -typedef struct _reg_trie_accepted reg_trie_accepted; - /* some basic information about the current match that is created by * Perl_regexec_flags and then passed to regtry(), regmatch() etc */ @@ -557,11 +550,15 @@ typedef struct regmatch_state { U32 lastparen; CHECKPOINT cp; - reg_trie_accepted *accept_buff; /* accepting states we have seen */ - U32 accepted; /* how many accepting states we have seen */ + U32 accepted; /* how many accepting states left */ U16 *jump; /* positive offsets from me */ regnode *B; /* node following the trie */ regnode *me; /* Which node am I - needed for jump tries*/ + U8 *firstpos;/* pos in string of first trie match */ + U32 firstchars;/* len in chars of firstpos from start */ + U16 nextword;/* next word to try */ + U16 topword; /* longest accepted word */ + bool longfold;/* saw a fold with a 1->n char mapping */ } trie; /* special types - these members are used to store state for special |