| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It seems that if a regex check string is determined to be "not useful"
it is permananently disabled. However, it appears that when doing this
we dont necessarily reset any flags that are related to it.
Worse, the behaviour is not determinisitic, so it is quite possible that
a given program may experience this bug "randomly" based on what strings
it was matching. Thus it may be difficult to reproduce.
Resetting the RXc_ANCH_MBOL when we know that it is implicit
(PREGf_IMPLICIT) seems to fix /this/ particular example. But it wouldn't
surprise me to discover that other "random" bugs we encounter can be
traced back to this behaviour.
This fixes RT #75878 which is derived from ActiveState Bug #87173.
http://bugs.activestate.com/show_bug.cgi?id=87173
http://rt.perl.org/rt3/Ticket/Display.html?id=75878
|
|
|
|
|
|
|
|
|
| |
do_utf8 is changed to utf8_target
UTF is changed to UTF_PATTERN
This will help me keep track of the fact that there are four possible
combinations of these, and that ! do_utf8 doesn't necessarily mean don't
do utf8.
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
| |
these errors were introduced in my trie-allocation patch,
2e64971a6530d2645969bc489f564bfd3ce64993
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is an indirect fix for
[perl #74484] Regex causing exponential runtime+mem usage
The trie runtime code was doing more SAVETMPS than FREETMPS and was thus
growing a large tmps stack on heavy backtracking. Rather than fixing this
directly, I rewrote part of the trie code so that it no longer needs to
allocate memory in S_regmatch (it still does in find_byclass()).
The basic issue is that multiple branches in the trie may trigger an
accept state; for example:
"abcd" =~ /xyz/abcd.*X|ab.*Y|/
here, words (branches) 2 and 3 are accept states. The original approach
was, at run time, to create a list of accepted word numbers and the
character positions of the end of each of those words. Then run the rest
of the pattern for each word in the list in turn (in word index order).
This requires memory for the list to be allocated and freed.
The new approach involves creating extra info at compile time; in
particular, for each word, a pointer to the previous accepted word (if
any) in the state tree. For example for the above pattern, part of the
state tree may be
q b c d
1 -> 2 -> 3 -> 4 -> 5
(#3) (#2)
(e.g. at state 1, if the next char is 'a', we transition to state 2).
Here, state 3 is an accept state with word #3, and 5 is an accept state
with word #2. So we build a table indexed by word number, which has
wordinfo[2] = 3, wordinfo[3] = 0, thus building the word chain 2->3->0.
At run time we run the trie to completion, and remember the word
associated with the longest accept state (word #2 above). Then by following
back the chain of .prev fields, we can produce a list of all accepting
words. We then iteratively find the smallest-numbered (ie LH-most) word in
the chain, and run with it. On failure and backtrack, we find the
next-smallest and so on.
Since we are no longer recording the end-position of each word in the
string, we have to recalculate this for each backtrack. We initially
record the end-position of the shortest accepting word, and given that we
know the length of each word, we can calculate the new position each time
as an offset from that first word. Depending on unicode and folding, that
calculation can be cheap or expensive.
This algorithm is optimised for the typical case where there are a small
number (<= 2) accepting states.
This patch creates a new compile-time array, trie->wordinfo[], indexed by
word number, which contains relevant info about each word. This also
supersedes the old trie->newword[] array, whose function of recording
"overspills" of multiple words per accept state, is now handled as part of
the wordinfo[].prev chain.
|
| |
|
|
|
|
| |
This makes the other 26 (or 58) bits available for save data.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
bool b = (bool)some_int
doesn't necessarily do what you think. In some builds, bool is defined as
char, and that cast's behaviour is thus undefined. So this line in mg.c:
const bool was_temp = (bool)SvTEMP(sv);
was actually setting was_temp to false even when the SVs_TEMP flag was set.
Fix this by replacing all the (bool) casts with a new cBOOL() cast macro
that (hopefully) does the right thing.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Solaris, VMS, and Win32 all failed to build after this change. In C99's
description of:
do statement while ( expression ) ;
the trailing semicolon does not appear to be optional. And at least
three compilers from three vendors agree.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Also revert 8902bb05b18c9858efa90229ca1ee42b17277554 as it merely
masked one symptom of the deeper problems.
Also fixes RT #69973, which was a segfault which was exposed by
8902bb05, see the ticket for further details.
http://rt.perl.org/rt3//Public/Bug/Display.html?id=69973
At the code of this is the problem that in unicode matching a bunch
of code points have case folding rules beyond just A-Z/a-z. Since
the case folding rules are decided at runtime by the string, we cant
use the same TRIE tables for both unicode/non-unicode matching.
Until this is reconciled or some other solution is found case insensitive
matching only gets the TRIE optimisation when the pattern is uniocde.
From CaseFolding.txt:
00B5; C; 03BC; # MICRO SIGN
00C0; C; 00E0; # LATIN CAPITAL LETTER A WITH GRAVE
00C1; C; 00E1; # LATIN CAPITAL LETTER A WITH ACUTE
00C2; C; 00E2; # LATIN CAPITAL LETTER A WITH CIRCUMFLEX
00C3; C; 00E3; # LATIN CAPITAL LETTER A WITH TILDE
00C4; C; 00E4; # LATIN CAPITAL LETTER A WITH DIAERESIS
00C5; C; 00E5; # LATIN CAPITAL LETTER A WITH RING ABOVE
00C6; C; 00E6; # LATIN CAPITAL LETTER AE
00C7; C; 00E7; # LATIN CAPITAL LETTER C WITH CEDILLA
00C8; C; 00E8; # LATIN CAPITAL LETTER E WITH GRAVE
00C9; C; 00E9; # LATIN CAPITAL LETTER E WITH ACUTE
00CA; C; 00EA; # LATIN CAPITAL LETTER E WITH CIRCUMFLEX
00CB; C; 00EB; # LATIN CAPITAL LETTER E WITH DIAERESIS
00CC; C; 00EC; # LATIN CAPITAL LETTER I WITH GRAVE
00CD; C; 00ED; # LATIN CAPITAL LETTER I WITH ACUTE
00CE; C; 00EE; # LATIN CAPITAL LETTER I WITH CIRCUMFLEX
00CF; C; 00EF; # LATIN CAPITAL LETTER I WITH DIAERESIS
00D0; C; 00F0; # LATIN CAPITAL LETTER ETH
00D1; C; 00F1; # LATIN CAPITAL LETTER N WITH TILDE
00D2; C; 00F2; # LATIN CAPITAL LETTER O WITH GRAVE
00D3; C; 00F3; # LATIN CAPITAL LETTER O WITH ACUTE
00D4; C; 00F4; # LATIN CAPITAL LETTER O WITH CIRCUMFLEX
00D5; C; 00F5; # LATIN CAPITAL LETTER O WITH TILDE
00D6; C; 00F6; # LATIN CAPITAL LETTER O WITH DIAERESIS
00D8; C; 00F8; # LATIN CAPITAL LETTER O WITH STROKE
00D9; C; 00F9; # LATIN CAPITAL LETTER U WITH GRAVE
00DA; C; 00FA; # LATIN CAPITAL LETTER U WITH ACUTE
00DB; C; 00FB; # LATIN CAPITAL LETTER U WITH CIRCUMFLEX
00DC; C; 00FC; # LATIN CAPITAL LETTER U WITH DIAERESIS
00DD; C; 00FD; # LATIN CAPITAL LETTER Y WITH ACUTE
00DE; C; 00FE; # LATIN CAPITAL LETTER THORN
00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
|
|
|
|
|
|
|
|
| |
It uses reg_temp_copy to copy the REGEXP onto the destination SV without
needing to copy the underlying pattern structure. This means changing
the prototype of reg_temp_copy, so it can copy onto a passed-in SV, but
it isn't API (and probably shouldn't be exported) so I don't think this
is a problem.
|
| |
|
|
|
|
| |
this is a precursor step to fixing the re/pat_special_cc.t failures.
|
|
|
|
|
|
|
| |
This is a first step towards macroizing the special CC handler logic so
it is easier to maintain them, for instance interestng optimisations are
being used in one, but not all, even though the logic is sharable. By
moving the BOUND logic out of the way the code repition is much clearer.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
http://rt.perl.org/rt3/Ticket/Display.html?id=69056
In perl 5.8 we get this:
$ perl -Mre=debug -le '$_="foo"; s/(.)\G//g; print'
Freeing REx: `","'
Compiling REx `(.)\G'
size 7 Got 60 bytes for offset annotations.
first at 3
1: OPEN1(3)
3: REG_ANY(4)
4: CLOSE1(6)
6: GPOS(7)
7: END(0)
GPOS minlen 1
Offsets: [7]
1[1] 0[0] 2[1] 3[1] 0[0] 4[2] 6[0]
Matching REx `(.)\G' against `foo'
Setting an EVAL scope, savestack=3
0 <> <foo> | 1: OPEN1
0 <> <foo> | 3: REG_ANY
1 <f> <oo> | 4: CLOSE1
1 <f> <oo> | 6: GPOS
failed...
Setting an EVAL scope, savestack=3
1 <f> <oo> | 1: OPEN1
1 <f> <oo> | 3: REG_ANY
2 <fo> <o> | 4: CLOSE1
2 <fo> <o> | 6: GPOS
failed...
Setting an EVAL scope, savestack=3
2 <fo> <o> | 1: OPEN1
2 <fo> <o> | 3: REG_ANY
3 <foo> <> | 4: CLOSE1
3 <foo> <> | 6: GPOS
failed...
Setting an EVAL scope, savestack=3
3 <foo> <> | 1: OPEN1
3 <foo> <> | 3: REG_ANY
failed...
Match failed
foo
Freeing REx: `"(.)\\G"'
In perl 5.10 we get this:
$ perl -Mre=debug -le '$_="foo"; s/(.)\G//g; print'
Compiling REx "(.)\G"
Final program:
1: OPEN1 (3)
3: REG_ANY (4)
4: CLOSE1 (6)
6: GPOS (7)
7: END (0)
anchored(GPOS) GPOS:1 minlen 1
Matching REx "(.)\G" against "foo"
-1 <> <%0foo> | 1:OPEN1(3)
-1 <> <%0foo> | 3:REG_ANY(4)
0 <> <foo> | 4:CLOSE1(6)
0 <> <foo> | 6:GPOS(7)
0 <> <foo> | 7:END(0)
Match successful!
Segmentation fault
With this patch we get:
$ ./perl -Ilib -Mre=debug -le '$_="foo"; s/(.)\G//g; print'
Compiling REx "(.)\G"
Final program:
1: OPEN1 (3)
3: REG_ANY (4)
4: CLOSE1 (6)
6: GPOS (7)
7: END (0)
anchored(GPOS) GPOS:1 minlen 1
Matching REx "(.)\G" against "foo"
Match failed
foo
Freeing REx: "(.)\G"
Which seems to me to be a net improvement.
|
|
|
|
|
|
|
|
|
|
|
| |
Commit c74340f9 added backreferences as well as the idea of a ->swap
regex pointer to keep track of the match offsets in case of backtracking.
The problem is that when Perl re-enters the regex engine to handle
utf8::SWASHNEW, the ->swap is not saved/restored/cleared so any capture
from the utf8 (Perl) code could inadvertently modify the regex match
data that caused the utf8 swash to get built.
This change should close out RT #60508
|
|
|
|
|
|
|
|
|
|
|
|
| |
If the op inside of a (?{ }) construct is another regex, the two
regexen end up corrupting each others' end-of-string markers,
resulting in various pathologies including access violations,
stack corruptions, and memory use growing without bound.
The change here is intended to be a relatively safe, cheap way to
prevent memory errors and makes no attempt to save and restore
other aspects of regex state; i.e., general purpose reentrancy
for the regex engine is still a TODO.
|
|
|
|
|
|
|
|
| |
This looks to be a simple oversight. All tests pass here.
Hugo
Signed-off-by: H.Merijn Brand <h.m.brand@xs4all.nl>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
PL_reglastparen and PL_reglastcloseparen contains a pointer are set to & rex->lastparen and & rex->lastcloseparen.
In case END the rex var is modified but PL_reglastparen and PL_reglastcloseparen is not.
Some part of the codes access PL_reglastparen while other parts use rex->lastparen.
This patch corrects this and adds 3 assertions.
I'm currently unable to proof (with a test case) that the code in case EVAL_ab is really nessesary...
Logically speaking it is nessesary but I do not know if it can cause test failures.
Also in the patch are missing regressions between 5.8 -> 5.10 and 5.10 -> 5.11. (and a test script that contains these regressions)
Message-ID: <rt-3.6.HEAD-4802-1236806863-900.56194-15-0@perl.org>
[Includes message and patch edits by committer.]
|
| |
|
|
|
|
|
|
|
| |
(reminder)
Date: 17 Nov 2007 16:29:29 +0100
Message-ID: <87r6iohova.fsf@biokovo-amd64.herceg.de>
|
| |
|
| |
|
|
|
|
|
|
|
|
|
| |
As Nicholas already noted in a FIXME, assigning to DEFSV should
use GvSV instead of GvSVn. This change ensures that, at least
under -DPERL_CORE, DEFSV cannot be assigned to and introduces
a DEFSV_set macro to allow setting DEFSV.
This fixes #53038: map leaks memory.
p4raw-id: //depot/perl@34776
|
|
|
|
|
|
|
| |
MUTABLE_SV() check. Use SvPVX_const() instead of SvPVX()
where only a const SV* is available. Also fix two falsely
consted pointers in Perl_sv_2pv_flags().
p4raw-id: //depot/perl@34770
|
|
|
|
|
|
|
|
|
|
|
| |
* Make ANYOF output from regprop easier to read by adding ][ in between the unicode representation and the "ascii" one
* Make it possible to make tests in re_tests todo.
* add a todo test for a complementary character class match that should fail (perl #60156)
* Also add a comment explaining a previous commit (relating to perl #60344)
p4raw-id: //depot/perl@34755
|
|
|
|
|
|
|
|
|
|
| |
5.10
During the de-recursivization it looks like Dave M forgot to reset the 'logical'
flag after using it, which in turn causes UNLESSM/IFTHEN when used after a LOGICAL operator to
be incorrectly intrepreted. This change resets the logical flag after each time it is stored
in ST.logical.
p4raw-id: //depot/perl@34746
|
|
|
|
|
|
| |
Message-ID: <25940.1225611819@chthon>
Date: Sun, 02 Nov 2008 01:43:39 -0600
p4raw-id: //depot/perl@34698
|
|
|
|
|
| |
erroneous const in dump.c.
p4raw-id: //depot/perl@34675
|
|
|
| |
p4raw-id: //depot/perl@34650
|
|
|
| |
p4raw-id: //depot/perl@34629
|
|
|
| |
p4raw-id: //depot/perl@34585
|
|
|
|
|
|
|
|
|
|
|
|
| |
ability to create landmines that will explode under someone in the
future when they upgrade their compiler to one with better
optimisation. We've already done this at least twice.
(Yes, some of the assertions are after code that would already have
SEGVd because it already deferences a pointer, but they are put in
to make it easier to automate checking that each and every case is
covered.)
Add a tool, checkARGS_ASSERT.pl, to check that every case is covered.
p4raw-id: //depot/perl@33291
|
|
|
|
|
|
| |
indirectly via RVs. This saves memory, and removes 1 level of pointer
indirection.
p4raw-id: //depot/perl@32950
|
|
|
|
|
| |
ithreads, so don't waste time doing it there.
p4raw-id: //depot/perl@32939
|
|
|
|
|
|
| |
rather than put in lots of hacks to work round not reference counting
it.
p4raw-id: //depot/perl@32938
|
|
|
| |
p4raw-id: //depot/perl@32913
|
|
|
|
|
|
| |
reference count only needs "doubling" when the scalar is pushed onto
PL_regex_padav for the second time.
p4raw-id: //depot/perl@32899
|