| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As noted in the comments of the code, "a" =~ /[A]/i doesn't work currently
(except that regcomp.c knows about the ASCII characters and corrects for
it, but not always, for example in cases like "a" =~ /\p{Upper}/i. This
patch catches all those).
It works by computing a list of all characters that (singly) fold to
another one, and then checking each of those. The maximum length of
the list is 3 in the current Unicode standard.
I believe that a better long-term solution is to do this at compile
rather than execution time, by generating a closure of everything
matched. But this can't be done now because the data structure would
need to be extensively revamped to list all non-byte characters, and
user-defined \p{} matches are not known at compile-time.
And it doesn't handle the multi-char folds. There is a separate ticket
for those.
|
| |
|
|
|
|
|
|
|
| |
As pointed out by Karl Williamson.
Apparently the wrong grouping only introduced an inefficiency, not any bugs, so
the tests still passed.
|
|
|
|
|
|
| |
gcc said:
regexec.c: In function ‘S_reginclass’:
regexec.c:6305: warning: suggest parentheses around ‘&&’ within ‘||’
|
|
|
|
|
|
| |
When the jump code was added the case of an empty string followed
by a jump wasnt accounted for. One could argue it should not happen
however that is a matter for a different commit.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
The previous re-ordering of this function makes it clear that this test
doesn't do anything. It is testing the charclass bitmap, but that was
already done in the re-ordered block from a previous commit, so if it
didn't succeed there, it won't succeed here.
In fact, trying to understand why this code was here was what led me to
figure out that it wasn't, and that things could be sped up by doing the
reordering.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch simply moves the block of code that does the bitmap tests in
front of the block of code that deals with potential things not in the
bit map. The reason to do this is that it is faster to find things in
the bitmap, than to have to create a utf8 swash.
The patch also adds some comments, and the first block doesn't have to
test if there has been a match, and the second block does, so if
statements for those two blocks are adjusted accordingly.
The proof that this doesn't break anything stems from the fact that the
routine never stops early. If there wasn't a match in the first block
of code, it would execute the second block. Thus swapping the order
doesn't affect the outcome. The side effects of the first block are
reading in the swash. These side effects won't happen if it no longer
gets executed, because the other block matched. And thus an error could
be introduced if there were coding errors elsewhere that didn't
initialize the swash before using it. But that doesn't appear to be the
case, as all tests pass.
|
|
|
|
|
| |
The previous changes have made it clear that this test never was useful,
so remove it.
|
|
|
|
|
|
| |
reginclass assumes that can match always at least one character. Make
that explicit, and now that we have that length always saved, don't
recalculate it.
|
|
|
|
| |
Several other variables in the routine have the previous name
|
|
|
|
|
| |
The call to reginclass is guaranteed by constness to not change
locinput, so if going to fail don't waste time calling it.
|
| |
|
| |
|
|
|
|
|
| |
Now that reginclass is guaranteed to return the match length upon
success, the caller need not do it again.
|
|
|
|
| |
This also allows for less special case testing
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a continuation of [perl #78464]. It fixes it also for the /i
flag. After this, a character should match itself in the regrepeat
function, even if one is in utf8 and the other isn't, for both /i and
not.
The solution is to move the code for handling /i into the non-i
structure so that the decisions about utf8 are all in one place. When
the string is in utf8, it uses the utf8-fold function.
This has the added effect of fixing a few cases where a utf8 string did
not match a fold in a non-utf8 pattern. I haven't added tests for
these, as it only fixes a few cases where this is a problem, and I'm
working on a comprehensive solution to the problem, accompanied by
extensive tests.
|
|
|
|
|
|
|
|
|
|
| |
Some regex patterns don't match a character with itself when the target
string is in utf8 and the pattern isn't, and the character is variant
under utf8. (This means only Latin1-range characters in the pattern are
affected.)
The solution is to test for this case and use the utf8 representation of
the pattern character for the comparison.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit causes regex sequences \b, \s, and \w (and complements) to
match in the latin1 range in the scope of feature 'unicode_strings' or
with the /u regex modifier.
It uses the previously unused flags field in the respective regnodes to
indicate the type of matching, and in regexec.c, uses that to decide
which of the handy.h macros to use, native or Latin1.
I chose this for now rather than create new nodes for each type of
match. An earlier version of this patch did that, and in every case the
switch case: statements were adjacent, offering no performance
advantage. If regexec were modified to use in-line functions or more
macros for various short section of it, then it would be faster to have
new nodes rather than using the flags field. But, using that field
simplified things, as this change flies under the radar in a number of
places where it would not if separate nodes were used.
|
|
|
|
|
|
| |
Certain multi-line macros had their continuation backslashes way out.
One line of each is longer than 80 chars, but no point in makeing all
the lines that long.
|
|
|
|
|
|
|
| |
Add macros that will have unicode semantics; these share much code in
common with ones that don't. So factor out that common code.
These might be good candidates for inline functions.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
The commit message for v5.13.4-47-gd1c771f contained some misleading language
which I only noticed after I pushed. This change puts the comment in the
code and hopefully clarifies things properly.
In simple words: VERBS should *never* be included in the JUMPABLE condition.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
JUMPABLE nodes can be ignored during certain phases of regex execution,
including ones where backtracking is affected. This change disables this
behviour so that the VERBS can perform their desired results.
Committer has taken the liberty of modifying the patch so that all
VERBS are jumped, thus making the JUMPABLE expression a little simpler.
I have left Bram's change to JUMPABLE intact, but inside of a comment
for now.
See discussion in thread for [perl #71942] *COMMIT bypasses optimisation
for futher details.
http://rt.perl.org/rt3/Ticket/Display.html?id=71942
There appears to be room for futher optimisation here
by moving the JUMPABLE logic to regex-compile time. Currently
it is arguable that the "optimisation" this patch seeks to avoid
is actually not an optimisation at all, as it happens OVER AND OVER
during execution of a match, thus the extra effort might actually
outweight the benefit, especially on large strings.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
s+=UTF8SKIP(s) to move to the next char
Most of the regex code where do the two types of increments are wrapped up in macros.
Unfortunately these macros arent suitable in this case because we use goto to jump
inside the loop under some situations, and since this is a one-off case I figured the
modest C&P associated was better than creating a new macro just for this case.
There is still a possible bug here marked by an XXX, which will need to be fixed
once I find out the correct way to simulate strptr--. Additionally I havent found
a test case that actually exposes this form of the bug.
Moral of the story, utf8 makes string scanning awkward... And slow...
|
|
|
|
|
|
| |
Strictly speaking charid would only get used uninitialized in debugging
output with a trie state machine that has no entries, but initialise it
to zero just to keep everyone happy.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It seems that if a regex check string is determined to be "not useful"
it is permananently disabled. However, it appears that when doing this
we dont necessarily reset any flags that are related to it.
Worse, the behaviour is not determinisitic, so it is quite possible that
a given program may experience this bug "randomly" based on what strings
it was matching. Thus it may be difficult to reproduce.
Resetting the RXc_ANCH_MBOL when we know that it is implicit
(PREGf_IMPLICIT) seems to fix /this/ particular example. But it wouldn't
surprise me to discover that other "random" bugs we encounter can be
traced back to this behaviour.
This fixes RT #75878 which is derived from ActiveState Bug #87173.
http://bugs.activestate.com/show_bug.cgi?id=87173
http://rt.perl.org/rt3/Ticket/Display.html?id=75878
|
|
|
|
|
|
|
|
|
| |
do_utf8 is changed to utf8_target
UTF is changed to UTF_PATTERN
This will help me keep track of the fact that there are four possible
combinations of these, and that ! do_utf8 doesn't necessarily mean don't
do utf8.
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
| |
these errors were introduced in my trie-allocation patch,
2e64971a6530d2645969bc489f564bfd3ce64993
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is an indirect fix for
[perl #74484] Regex causing exponential runtime+mem usage
The trie runtime code was doing more SAVETMPS than FREETMPS and was thus
growing a large tmps stack on heavy backtracking. Rather than fixing this
directly, I rewrote part of the trie code so that it no longer needs to
allocate memory in S_regmatch (it still does in find_byclass()).
The basic issue is that multiple branches in the trie may trigger an
accept state; for example:
"abcd" =~ /xyz/abcd.*X|ab.*Y|/
here, words (branches) 2 and 3 are accept states. The original approach
was, at run time, to create a list of accepted word numbers and the
character positions of the end of each of those words. Then run the rest
of the pattern for each word in the list in turn (in word index order).
This requires memory for the list to be allocated and freed.
The new approach involves creating extra info at compile time; in
particular, for each word, a pointer to the previous accepted word (if
any) in the state tree. For example for the above pattern, part of the
state tree may be
q b c d
1 -> 2 -> 3 -> 4 -> 5
(#3) (#2)
(e.g. at state 1, if the next char is 'a', we transition to state 2).
Here, state 3 is an accept state with word #3, and 5 is an accept state
with word #2. So we build a table indexed by word number, which has
wordinfo[2] = 3, wordinfo[3] = 0, thus building the word chain 2->3->0.
At run time we run the trie to completion, and remember the word
associated with the longest accept state (word #2 above). Then by following
back the chain of .prev fields, we can produce a list of all accepting
words. We then iteratively find the smallest-numbered (ie LH-most) word in
the chain, and run with it. On failure and backtrack, we find the
next-smallest and so on.
Since we are no longer recording the end-position of each word in the
string, we have to recalculate this for each backtrack. We initially
record the end-position of the shortest accepting word, and given that we
know the length of each word, we can calculate the new position each time
as an offset from that first word. Depending on unicode and folding, that
calculation can be cheap or expensive.
This algorithm is optimised for the typical case where there are a small
number (<= 2) accepting states.
This patch creates a new compile-time array, trie->wordinfo[], indexed by
word number, which contains relevant info about each word. This also
supersedes the old trie->newword[] array, whose function of recording
"overspills" of multiple words per accept state, is now handled as part of
the wordinfo[].prev chain.
|
| |
|
|
|
|
| |
This makes the other 26 (or 58) bits available for save data.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
bool b = (bool)some_int
doesn't necessarily do what you think. In some builds, bool is defined as
char, and that cast's behaviour is thus undefined. So this line in mg.c:
const bool was_temp = (bool)SvTEMP(sv);
was actually setting was_temp to false even when the SVs_TEMP flag was set.
Fix this by replacing all the (bool) casts with a new cBOOL() cast macro
that (hopefully) does the right thing.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Solaris, VMS, and Win32 all failed to build after this change. In C99's
description of:
do statement while ( expression ) ;
the trailing semicolon does not appear to be optional. And at least
three compilers from three vendors agree.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Also revert 8902bb05b18c9858efa90229ca1ee42b17277554 as it merely
masked one symptom of the deeper problems.
Also fixes RT #69973, which was a segfault which was exposed by
8902bb05, see the ticket for further details.
http://rt.perl.org/rt3//Public/Bug/Display.html?id=69973
At the code of this is the problem that in unicode matching a bunch
of code points have case folding rules beyond just A-Z/a-z. Since
the case folding rules are decided at runtime by the string, we cant
use the same TRIE tables for both unicode/non-unicode matching.
Until this is reconciled or some other solution is found case insensitive
matching only gets the TRIE optimisation when the pattern is uniocde.
From CaseFolding.txt:
00B5; C; 03BC; # MICRO SIGN
00C0; C; 00E0; # LATIN CAPITAL LETTER A WITH GRAVE
00C1; C; 00E1; # LATIN CAPITAL LETTER A WITH ACUTE
00C2; C; 00E2; # LATIN CAPITAL LETTER A WITH CIRCUMFLEX
00C3; C; 00E3; # LATIN CAPITAL LETTER A WITH TILDE
00C4; C; 00E4; # LATIN CAPITAL LETTER A WITH DIAERESIS
00C5; C; 00E5; # LATIN CAPITAL LETTER A WITH RING ABOVE
00C6; C; 00E6; # LATIN CAPITAL LETTER AE
00C7; C; 00E7; # LATIN CAPITAL LETTER C WITH CEDILLA
00C8; C; 00E8; # LATIN CAPITAL LETTER E WITH GRAVE
00C9; C; 00E9; # LATIN CAPITAL LETTER E WITH ACUTE
00CA; C; 00EA; # LATIN CAPITAL LETTER E WITH CIRCUMFLEX
00CB; C; 00EB; # LATIN CAPITAL LETTER E WITH DIAERESIS
00CC; C; 00EC; # LATIN CAPITAL LETTER I WITH GRAVE
00CD; C; 00ED; # LATIN CAPITAL LETTER I WITH ACUTE
00CE; C; 00EE; # LATIN CAPITAL LETTER I WITH CIRCUMFLEX
00CF; C; 00EF; # LATIN CAPITAL LETTER I WITH DIAERESIS
00D0; C; 00F0; # LATIN CAPITAL LETTER ETH
00D1; C; 00F1; # LATIN CAPITAL LETTER N WITH TILDE
00D2; C; 00F2; # LATIN CAPITAL LETTER O WITH GRAVE
00D3; C; 00F3; # LATIN CAPITAL LETTER O WITH ACUTE
00D4; C; 00F4; # LATIN CAPITAL LETTER O WITH CIRCUMFLEX
00D5; C; 00F5; # LATIN CAPITAL LETTER O WITH TILDE
00D6; C; 00F6; # LATIN CAPITAL LETTER O WITH DIAERESIS
00D8; C; 00F8; # LATIN CAPITAL LETTER O WITH STROKE
00D9; C; 00F9; # LATIN CAPITAL LETTER U WITH GRAVE
00DA; C; 00FA; # LATIN CAPITAL LETTER U WITH ACUTE
00DB; C; 00FB; # LATIN CAPITAL LETTER U WITH CIRCUMFLEX
00DC; C; 00FC; # LATIN CAPITAL LETTER U WITH DIAERESIS
00DD; C; 00FD; # LATIN CAPITAL LETTER Y WITH ACUTE
00DE; C; 00FE; # LATIN CAPITAL LETTER THORN
00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
|
|
|
|
|
|
|
|
| |
It uses reg_temp_copy to copy the REGEXP onto the destination SV without
needing to copy the underlying pattern structure. This means changing
the prototype of reg_temp_copy, so it can copy onto a passed-in SV, but
it isn't API (and probably shouldn't be exported) so I don't think this
is a problem.
|
| |
|