summaryrefslogtreecommitdiff
path: root/regexec.c
Commit message (Collapse)AuthorAgeFilesLines
* when disabling regex implicit check string we must reset anchored flagYves Orton2010-06-241-0/+5
| | | | | | | | | | | | | | | | | | | | It seems that if a regex check string is determined to be "not useful" it is permananently disabled. However, it appears that when doing this we dont necessarily reset any flags that are related to it. Worse, the behaviour is not determinisitic, so it is quite possible that a given program may experience this bug "randomly" based on what strings it was matching. Thus it may be difficult to reproduce. Resetting the RXc_ANCH_MBOL when we know that it is implicit (PREGf_IMPLICIT) seems to fix /this/ particular example. But it wouldn't surprise me to discover that other "random" bugs we encounter can be traced back to this behaviour. This fixes RT #75878 which is derived from ActiveState Bug #87173. http://bugs.activestate.com/show_bug.cgi?id=87173 http://rt.perl.org/rt3/Ticket/Display.html?id=75878
* regexec.c: change names of two vars for clarityKarl Williamson2010-06-071-169/+169
| | | | | | | | | do_utf8 is changed to utf8_target UTF is changed to UTF_PATTERN This will help me keep track of the fact that there are four possible combinations of these, and that ! do_utf8 doesn't necessarily mean don't do utf8.
* reduce size of regmatch_state.u.curlyx by 2 wordsDavid Mitchell2010-06-061-13/+14
|
* micro-optimise a bit of trie codeDavid Mitchell2010-06-061-14/+13
|
* Change regexec.c to use new foldEQ functionsKarl Williamson2010-06-051-12/+12
|
* Clarify that count is bytes not unicode charactersKarl Williamson2010-05-291-1/+1
|
* Fix build warnings introduced by v5.13.0-139-ge0fa7e2Jerry D. Hedden2010-05-241-1/+2
|
* fix a couple of var typesDavid Mitchell2010-05-031-3/+3
| | | | | these errors were introduced in my trie-allocation patch, 2e64971a6530d2645969bc489f564bfd3ce64993
* tries: don't allocate memory at runtimeDavid Mitchell2010-05-031-178/+209
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is an indirect fix for [perl #74484] Regex causing exponential runtime+mem usage The trie runtime code was doing more SAVETMPS than FREETMPS and was thus growing a large tmps stack on heavy backtracking. Rather than fixing this directly, I rewrote part of the trie code so that it no longer needs to allocate memory in S_regmatch (it still does in find_byclass()). The basic issue is that multiple branches in the trie may trigger an accept state; for example: "abcd" =~ /xyz/abcd.*X|ab.*Y|/ here, words (branches) 2 and 3 are accept states. The original approach was, at run time, to create a list of accepted word numbers and the character positions of the end of each of those words. Then run the rest of the pattern for each word in the list in turn (in word index order). This requires memory for the list to be allocated and freed. The new approach involves creating extra info at compile time; in particular, for each word, a pointer to the previous accepted word (if any) in the state tree. For example for the above pattern, part of the state tree may be q b c d 1 -> 2 -> 3 -> 4 -> 5 (#3) (#2) (e.g. at state 1, if the next char is 'a', we transition to state 2). Here, state 3 is an accept state with word #3, and 5 is an accept state with word #2. So we build a table indexed by word number, which has wordinfo[2] = 3, wordinfo[3] = 0, thus building the word chain 2->3->0. At run time we run the trie to completion, and remember the word associated with the longest accept state (word #2 above). Then by following back the chain of .prev fields, we can produce a list of all accepting words. We then iteratively find the smallest-numbered (ie LH-most) word in the chain, and run with it. On failure and backtrack, we find the next-smallest and so on. Since we are no longer recording the end-position of each word in the string, we have to recalculate this for each backtrack. We initially record the end-position of the shortest accepting word, and given that we know the length of each word, we can calculate the new position each time as an offset from that first word. Depending on unicode and folding, that calculation can be cheap or expensive. This algorithm is optimised for the typical case where there are a small number (<= 2) accepting states. This patch creates a new compile-time array, trie->wordinfo[], indexed by word number, which contains relevant info about each word. This also supersedes the old trie->newword[] array, whose function of recording "overspills" of multiple words per accept state, is now handled as part of the wordinfo[].prev chain.
* For SAVEt_REGCONTEXT, store the number of save stack entries used with the type.Nicholas Clark2010-05-021-7/+11
|
* On the save stack, store the save type as the bottom 6 bits of a UV.Nicholas Clark2010-05-011-2/+2
| | | | This makes the other 26 (or 58) bits available for save data.
* Untangle REGCP_FRAME_ELEMS from REGCP_OTHER_ELEMS.Nicholas Clark2010-05-011-10/+11
|
* use cBOOL for bool castsDavid Mitchell2010-04-151-13/+13
| | | | | | | | | | | | | bool b = (bool)some_int doesn't necessarily do what you think. In some builds, bool is defined as char, and that cast's behaviour is thus undefined. So this line in mg.c: const bool was_temp = (bool)SvTEMP(sv); was actually setting was_temp to false even when the SVs_TEMP flag was set. Fix this by replacing all the (bool) casts with a new cBOOL() cast macro that (hopefully) does the right thing.
* Allow U+0FFFF in regexKarl Williamson2009-12-201-2/+4
|
* Fix compile failure introduced in 37e2e78edfe0a224b8a615820f46db879584f523.Craig A. Berry2009-12-131-9/+9
| | | | | | | | | | Solaris, VMS, and Win32 all failed to build after this change. In C99's description of: do statement while ( expression ) ; the trailing semicolon does not appear to be optional. And at least three compilers from three vendors agree.
* qr/\X/ expansionKarl Williamson2009-12-051-15/+229
|
* SvREFCNT_dec already checks if the SV is non-NULL (continued)Vincent Pit2009-11-081-2/+2
|
* disable non-unicode case insensitive trie matchingYves Orton2009-10-251-7/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Also revert 8902bb05b18c9858efa90229ca1ee42b17277554 as it merely masked one symptom of the deeper problems. Also fixes RT #69973, which was a segfault which was exposed by 8902bb05, see the ticket for further details. http://rt.perl.org/rt3//Public/Bug/Display.html?id=69973 At the code of this is the problem that in unicode matching a bunch of code points have case folding rules beyond just A-Z/a-z. Since the case folding rules are decided at runtime by the string, we cant use the same TRIE tables for both unicode/non-unicode matching. Until this is reconciled or some other solution is found case insensitive matching only gets the TRIE optimisation when the pattern is uniocde. From CaseFolding.txt: 00B5; C; 03BC; # MICRO SIGN 00C0; C; 00E0; # LATIN CAPITAL LETTER A WITH GRAVE 00C1; C; 00E1; # LATIN CAPITAL LETTER A WITH ACUTE 00C2; C; 00E2; # LATIN CAPITAL LETTER A WITH CIRCUMFLEX 00C3; C; 00E3; # LATIN CAPITAL LETTER A WITH TILDE 00C4; C; 00E4; # LATIN CAPITAL LETTER A WITH DIAERESIS 00C5; C; 00E5; # LATIN CAPITAL LETTER A WITH RING ABOVE 00C6; C; 00E6; # LATIN CAPITAL LETTER AE 00C7; C; 00E7; # LATIN CAPITAL LETTER C WITH CEDILLA 00C8; C; 00E8; # LATIN CAPITAL LETTER E WITH GRAVE 00C9; C; 00E9; # LATIN CAPITAL LETTER E WITH ACUTE 00CA; C; 00EA; # LATIN CAPITAL LETTER E WITH CIRCUMFLEX 00CB; C; 00EB; # LATIN CAPITAL LETTER E WITH DIAERESIS 00CC; C; 00EC; # LATIN CAPITAL LETTER I WITH GRAVE 00CD; C; 00ED; # LATIN CAPITAL LETTER I WITH ACUTE 00CE; C; 00EE; # LATIN CAPITAL LETTER I WITH CIRCUMFLEX 00CF; C; 00EF; # LATIN CAPITAL LETTER I WITH DIAERESIS 00D0; C; 00F0; # LATIN CAPITAL LETTER ETH 00D1; C; 00F1; # LATIN CAPITAL LETTER N WITH TILDE 00D2; C; 00F2; # LATIN CAPITAL LETTER O WITH GRAVE 00D3; C; 00F3; # LATIN CAPITAL LETTER O WITH ACUTE 00D4; C; 00F4; # LATIN CAPITAL LETTER O WITH CIRCUMFLEX 00D5; C; 00F5; # LATIN CAPITAL LETTER O WITH TILDE 00D6; C; 00F6; # LATIN CAPITAL LETTER O WITH DIAERESIS 00D8; C; 00F8; # LATIN CAPITAL LETTER O WITH STROKE 00D9; C; 00F9; # LATIN CAPITAL LETTER U WITH GRAVE 00DA; C; 00FA; # LATIN CAPITAL LETTER U WITH ACUTE 00DB; C; 00FB; # LATIN CAPITAL LETTER U WITH CIRCUMFLEX 00DC; C; 00FC; # LATIN CAPITAL LETTER U WITH DIAERESIS 00DD; C; 00FD; # LATIN CAPITAL LETTER Y WITH ACUTE 00DE; C; 00FE; # LATIN CAPITAL LETTER THORN 00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
* RT#69616: regexp SVs lose regexpness in assignmentBen Morrow2009-10-221-1/+1
| | | | | | | | It uses reg_temp_copy to copy the REGEXP onto the destination SV without needing to copy the underlying pattern structure. This means changing the prototype of reg_temp_copy, so it can copy onto a passed-in SV, but it isn't API (and probably shouldn't be exported) so I don't think this is a problem.
* somewhat fix failing regex tests. but break lots of other stuff at the same timeYves Orton2009-10-191-18/+51
|
* refactor the special CC code in reg_try()Yves Orton2009-10-161-140/+75
| | | | this is a precursor step to fixing the re/pat_special_cc.t failures.
* in regexec.c move the BOUND logic out of the way of the special CC logicYves Orton2009-10-051-41/+41
| | | | | | | This is a first step towards macroizing the special CC handler logic so it is easier to maintain them, for instance interestng optimisations are being used in one, but not all, even though the logic is sharable. By moving the BOUND logic out of the way the code repition is much clearer.
* refactor some common logic in regexec.cYves Orton2009-10-041-10/+4
|
* fix format warnings from regexec.cRobin Barker2009-09-221-3/+3
|
* add more positive gofs GPOS tests and fix some bugs tooYves Orton2009-09-101-4/+16
|
* Fix RT69056 - postive GPOS leads to segv on first matchYves Orton2009-09-091-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | http://rt.perl.org/rt3/Ticket/Display.html?id=69056 In perl 5.8 we get this: $ perl -Mre=debug -le '$_="foo"; s/(.)\G//g; print' Freeing REx: `","' Compiling REx `(.)\G' size 7 Got 60 bytes for offset annotations. first at 3 1: OPEN1(3) 3: REG_ANY(4) 4: CLOSE1(6) 6: GPOS(7) 7: END(0) GPOS minlen 1 Offsets: [7] 1[1] 0[0] 2[1] 3[1] 0[0] 4[2] 6[0] Matching REx `(.)\G' against `foo' Setting an EVAL scope, savestack=3 0 <> <foo> | 1: OPEN1 0 <> <foo> | 3: REG_ANY 1 <f> <oo> | 4: CLOSE1 1 <f> <oo> | 6: GPOS failed... Setting an EVAL scope, savestack=3 1 <f> <oo> | 1: OPEN1 1 <f> <oo> | 3: REG_ANY 2 <fo> <o> | 4: CLOSE1 2 <fo> <o> | 6: GPOS failed... Setting an EVAL scope, savestack=3 2 <fo> <o> | 1: OPEN1 2 <fo> <o> | 3: REG_ANY 3 <foo> <> | 4: CLOSE1 3 <foo> <> | 6: GPOS failed... Setting an EVAL scope, savestack=3 3 <foo> <> | 1: OPEN1 3 <foo> <> | 3: REG_ANY failed... Match failed foo Freeing REx: `"(.)\\G"' In perl 5.10 we get this: $ perl -Mre=debug -le '$_="foo"; s/(.)\G//g; print' Compiling REx "(.)\G" Final program: 1: OPEN1 (3) 3: REG_ANY (4) 4: CLOSE1 (6) 6: GPOS (7) 7: END (0) anchored(GPOS) GPOS:1 minlen 1 Matching REx "(.)\G" against "foo" -1 <> <%0foo> | 1:OPEN1(3) -1 <> <%0foo> | 3:REG_ANY(4) 0 <> <foo> | 4:CLOSE1(6) 0 <> <foo> | 6:GPOS(7) 0 <> <foo> | 7:END(0) Match successful! Segmentation fault With this patch we get: $ ./perl -Ilib -Mre=debug -le '$_="foo"; s/(.)\G//g; print' Compiling REx "(.)\G" Final program: 1: OPEN1 (3) 3: REG_ANY (4) 4: CLOSE1 (6) 6: GPOS (7) 7: END (0) anchored(GPOS) GPOS:1 minlen 1 Matching REx "(.)\G" against "foo" Match failed foo Freeing REx: "(.)\G" Which seems to me to be a net improvement.
* much better swap logic to support reentrancy and fix assert failureGeorge Greer2009-07-261-29/+17
| | | | | | | | | | | Commit c74340f9 added backreferences as well as the idea of a ->swap regex pointer to keep track of the match offsets in case of backtracking. The problem is that when Perl re-enters the regex engine to handle utf8::SWASHNEW, the ->swap is not saved/restored/cleared so any capture from the utf8 (Perl) code could inadvertently modify the regex match data that caused the utf8 swash to get built. This change should close out RT #60508
* Save and restore PL_regeol for op inside of regex (RT ##66110)Craig A. Berry2009-07-251-0/+2
| | | | | | | | | | | | If the op inside of a (?{ }) construct is another regex, the two regexen end up corrupting each others' end-of-string markers, resulting in various pathologies including access violations, stack corruptions, and memory use growing without bound. The change here is intended to be a relatively safe, cheap way to prevent memory errors and makes no attempt to save and restore other aspects of regex state; i.e., general purpose reentrancy for the regex engine is still a TODO.
* Regex fails when string is too longhv@crypt.org2009-07-061-2/+3
| | | | | | | | This looks to be a simple oversight. All tests pass here. Hugo Signed-off-by: H.Merijn Brand <h.m.brand@xs4all.nl>
* fix [RT #60034]. An equivalent fix was already in 5.8.9 as change 34580.David Mitchell2009-03-221-2/+5
|
* Fix #56194 Regex: (((??{1 + $^N}))) behaves differently in 5.10.0 than in bleadBram2009-03-121-1/+15
| | | | | | | | | | | | | | | | | PL_reglastparen and PL_reglastcloseparen contains a pointer are set to & rex->lastparen and & rex->lastcloseparen. In case END the rex var is modified but PL_reglastparen and PL_reglastcloseparen is not. Some part of the codes access PL_reglastparen while other parts use rex->lastparen. This patch corrects this and adds 3 assertions. I'm currently unable to proof (with a test case) that the code in case EVAL_ab is really nessesary... Logically speaking it is nessesary but I do not know if it can cause test failures. Also in the patch are missing regressions between 5.8 -> 5.10 and 5.10 -> 5.11. (and a test script that contains these regressions) Message-ID: <rt-3.6.HEAD-4802-1236806863-900.56194-15-0@perl.org> [Includes message and patch edits by committer.]
* Fix memory leakKarl2009-01-261-0/+3
|
* Another regexp failure with utf8-flagged string and byte-flagged pattern ↵Slaven Rezic2009-01-041-2/+6
| | | | | | | (reminder) Date: 17 Nov 2007 16:29:29 +0100 Message-ID: <87r6iohova.fsf@biokovo-amd64.herceg.de>
* Fix malformed utf8 in regexec.cKarl2008-12-281-6/+12
|
* fix bug #57042 - preserve $^R across TRIE matchesYves Orton2008-12-271-1/+4
|
* Assigning to DEFSV leaks if PL_defgv's gp_sv isn't set.Marcus Holland-Moritz2008-11-081-1/+1
| | | | | | | | | As Nicholas already noted in a FIXME, assigning to DEFSV should use GvSV instead of GvSVn. This change ensures that, at least under -DPERL_CORE, DEFSV cannot be assigned to and introduces a DEFSV_set macro to allow setting DEFSV. This fixes #53038: map leaks memory. p4raw-id: //depot/perl@34776
* Revert SvPVX() to allow lvalue usage, but also add aMarcus Holland-Moritz2008-11-071-1/+1
| | | | | | | MUTABLE_SV() check. Use SvPVX_const() instead of SvPVX() where only a const SV* is available. Also fix two falsely consted pointers in Perl_sv_2pv_flags(). p4raw-id: //depot/perl@34770
* Various changes to regex diagnostics and testingYves Orton2008-11-061-1/+2
| | | | | | | | | | | * Make ANYOF output from regprop easier to read by adding ][ in between the unicode representation and the "ascii" one * Make it possible to make tests in re_tests todo. * add a todo test for a complementary character class match that should fail (perl #60156) * Also add a comment explaining a previous commit (relating to perl #60344) p4raw-id: //depot/perl@34755
* Resolve perl #60344: Regex lookbehind failure after an (if)then|else in perl ↵Yves Orton2008-11-061-0/+1
| | | | | | | | | | 5.10 During the de-recursivization it looks like Dave M forgot to reset the 'logical' flag after using it, which in turn causes UNLESSM/IFTHEN when used after a LOGICAL operator to be incorrectly intrepreted. This change resets the logical flag after each time it is stored in ST.logical. p4raw-id: //depot/perl@34746
* PATCH: Large omnibus patch to clean up the JRRT quotesTom Christiansen2008-11-021-1/+5
| | | | | | Message-ID: <25940.1225611819@chthon> Date: Sun, 02 Nov 2008 01:43:39 -0600 p4raw-id: //depot/perl@34698
* Eliminate (SV *) casts from the rest of *.c, picking up one (further)Nicholas Clark2008-10-301-8/+8
| | | | | erroneous const in dump.c. p4raw-id: //depot/perl@34675
* Eliminate (AV *) casts in *.c.Nicholas Clark2008-10-291-3/+3
| | | p4raw-id: //depot/perl@34650
* Every remaining (HV *) cast in *.cNicholas Clark2008-10-281-2/+2
| | | p4raw-id: //depot/perl@34629
* Update copyright years.Nicholas Clark2008-10-251-1/+2
| | | p4raw-id: //depot/perl@34585
* assert() that every NN argument is not NULL. Otherwise we have theNicholas Clark2008-02-121-15/+49
| | | | | | | | | | | | ability to create landmines that will explode under someone in the future when they upgrade their compiler to one with better optimisation. We've already done this at least twice. (Yes, some of the assertions are after code that would already have SEGVd because it already deferences a pointer, but they are put in to make it easier to automate checking that each and every case is covered.) Add a tool, checkARGS_ASSERT.pl, to check that every case is covered. p4raw-id: //depot/perl@33291
* REGEXPs are now stored directly in PL_regex_padav, rather thanNicholas Clark2008-01-111-1/+1
| | | | | | indirectly via RVs. This saves memory, and removes 1 level of pointer indirection. p4raw-id: //depot/perl@32950
* It seems that you don't need to reference count PL_reg_curpm withoutNicholas Clark2008-01-101-0/+5
| | | | | ithreads, so don't waste time doing it there. p4raw-id: //depot/perl@32939
* The correct solution is to reference count the regexp in PL_reg_curpm,Nicholas Clark2008-01-101-3/+6
| | | | | | rather than put in lots of hacks to work round not reference counting it. p4raw-id: //depot/perl@32938
* Change 32899 missed the other double-reference count.Nicholas Clark2008-01-091-1/+1
| | | p4raw-id: //depot/perl@32913
* Correct a long-standing ithreads reference counting anonamly - theNicholas Clark2008-01-081-1/+1
| | | | | | reference count only needs "doubling" when the scalar is pushed onto PL_regex_padav for the second time. p4raw-id: //depot/perl@32899