summaryrefslogtreecommitdiff
path: root/regexec.c
Commit message (Collapse)AuthorAgeFilesLines
* regexec.c: Don't give up on fold matching earlyKarl Williamson2010-11-071-0/+45
| | | | | | | | | | | | | | | | | | | | As noted in the comments of the code, "a" =~ /[A]/i doesn't work currently (except that regcomp.c knows about the ASCII characters and corrects for it, but not always, for example in cases like "a" =~ /\p{Upper}/i. This patch catches all those). It works by computing a list of all characters that (singly) fold to another one, and then checking each of those. The maximum length of the list is 3 in the current Unicode standard. I believe that a better long-term solution is to do this at compile rather than execution time, by generating a closure of everything matched. But this can't be done now because the data structure would need to be extensively revamped to list all non-byte characters, and user-defined \p{} matches are not known at compile-time. And it doesn't handle the multi-char folds. There is a separate ticket for those.
* regexec.c: change variable name to its purposeKarl Williamson2010-11-071-4/+6
|
* Correct grouping in S_reginclassFlorian Ragwitz2010-11-071-1/+1
| | | | | | | As pointed out by Karl Williamson. Apparently the wrong grouping only introduced an inefficiency, not any bugs, so the tests still passed.
* Fix a compiler warningFlorian Ragwitz2010-11-061-1/+1
| | | | | | gcc said: regexec.c: In function ‘S_reginclass’: regexec.c:6305: warning: suggest parentheses around ‘&&’ within ‘||’
* fix the trie part of rt-78356Yves Orton2010-11-031-1/+2
| | | | | | When the jump code was added the case of an empty string followed by a jump wasnt accounted for. One could argue it should not happen however that is a matter for a different commit.
* Fix RT-70998: qq{\x{30ab}} =~ /\xab|\xa9/ produces warningscompileYves Orton2010-11-021-4/+10
|
* reginclass: Remove redundant testKarl Williamson2010-10-311-6/+3
| | | | | | | | | | | The previous re-ordering of this function makes it clear that this test doesn't do anything. It is testing the charclass bitmap, but that was already done in the re-ordered block from a previous commit, so if it didn't succeed there, it won't succeed here. In fact, trying to understand why this code was here was what led me to figure out that it wasn't, and that things could be sped up by doing the reordering.
* reginclass: Reorder fastest firstKarl Williamson2010-10-311-51/+55
| | | | | | | | | | | | | | | | | | | | | This patch simply moves the block of code that does the bitmap tests in front of the block of code that deals with potential things not in the bit map. The reason to do this is that it is faster to find things in the bitmap, than to have to create a utf8 swash. The patch also adds some comments, and the first block doesn't have to test if there has been a match, and the second block does, so if statements for those two blocks are adjusted accordingly. The proof that this doesn't break anything stems from the fact that the routine never stops early. If there wasn't a match in the first block of code, it would execute the second block. Thus swapping the order doesn't affect the outcome. The side effects of the first block are reading in the swash. These side effects won't happen if it no longer gets executed, because the other block matched. And thus an error could be introduced if there were coding errors elsewhere that didn't initialize the swash before using it. But that doesn't appear to be the case, as all tests pass.
* reginclass: Remove unnecessary testKarl Williamson2010-10-311-1/+1
| | | | | The previous changes have made it clear that this test never was useful, so remove it.
* reginclass: Make explicit the length assumptionsKarl Williamson2010-10-311-7/+9
| | | | | | reginclass assumes that can match always at least one character. Make that explicit, and now that we have that length always saved, don't recalculate it.
* reginclass: Rename variable for clarityKarl Williamson2010-10-311-4/+4
| | | | Several other variables in the routine have the previous name
* regexec.c: reorder statements for speedKarl Williamson2010-10-311-4/+4
| | | | | The call to reginclass is guaranteed by constness to not change locinput, so if going to fail don't waste time calling it.
* regexec.c: Add clarifying commentKarl Williamson2010-10-311-1/+5
|
* reginclass: add some consts to prototypeKarl Williamson2010-10-311-1/+1
|
* regexec.c: Remove redundant line.Karl Williamson2010-10-311-1/+1
| | | | | Now that reginclass is guaranteed to return the match length upon success, the caller need not do it again.
* reginclass: Return matched length even if not utf8Karl Williamson2010-10-311-10/+18
| | | | This also allows for less special case testing
* reginclass: Change variable name for clarity.Karl Williamson2010-10-311-7/+9
|
* regexec.c: Document existing reginclass behaviorKarl Williamson2010-10-311-1/+7
|
* regexec.c: utf8 doesn't match /i nonutf8 selfKarl Williamson2010-10-211-30/+79
| | | | | | | | | | | | | | | | | This is a continuation of [perl #78464]. It fixes it also for the /i flag. After this, a character should match itself in the regrepeat function, even if one is in utf8 and the other isn't, for both /i and not. The solution is to move the code for handling /i into the non-i structure so that the decisions about utf8 are all in one place. When the string is in utf8, it uses the utf8-fold function. This has the added effect of fixing a few cases where a utf8 string did not match a fold in a non-utf8 pattern. I haven't added tests for these, as it only fixes a few cases where this is a problem, and I'm working on a comprehensive solution to the problem, accompanied by extensive tests.
* regexec.c: utf8 doesn't match non-utf8 selfKarl Williamson2010-10-211-3/+37
| | | | | | | | | | Some regex patterns don't match a character with itself when the target string is in utf8 and the pattern isn't, and the character is variant under utf8. (This means only Latin1-range characters in the pattern are affected.) The solution is to test for this case and use the utf8 representation of the pattern character for the comparison.
* Subject: [perl #58182] partial: Add uni \s,\w matchingKarl Williamson2010-10-151-29/+73
| | | | | | | | | | | | | | | | | | | This commit causes regex sequences \b, \s, and \w (and complements) to match in the latin1 range in the scope of feature 'unicode_strings' or with the /u regex modifier. It uses the previously unused flags field in the respective regnodes to indicate the type of matching, and in regexec.c, uses that to decide which of the handy.h macros to use, native or Latin1. I chose this for now rather than create new nodes for each type of match. An earlier version of this patch did that, and in every case the switch case: statements were adjacent, offering no performance advantage. If regexec were modified to use in-line functions or more macros for various short section of it, then it would be faster to have new nodes rather than using the flags field. But, using that field simplified things, as this change flies under the radar in a number of places where it would not if separate nodes were used.
* Subject: [PATCH] regexec.c: make macros fit 80 colsKarl Williamson2010-10-151-68/+68
| | | | | | Certain multi-line macros had their continuation backslashes way out. One line of each is longer than 80 chars, but no point in makeing all the lines that long.
* Subject: [PATCH] regexec.c: add and refactor macrosKarl Williamson2010-10-151-5/+27
| | | | | | | Add macros that will have unicode semantics; these share much code in common with ones that don't. So factor out that common code. These might be good candidates for inline functions.
* Fix typo spotted by avar++Florian Ragwitz2010-09-301-1/+1
|
* Document why we're not using the save stackFlorian Ragwitz2010-09-301-0/+14
|
* Localize PL_reg_state during re_evalsFlorian Ragwitz2010-09-291-1/+6
|
* eliminate unneeded code, and explain why the code was not neededYves Orton2010-08-261-2/+9
| | | | | | | | The commit message for v5.13.4-47-gd1c771f contained some misleading language which I only noticed after I pushed. This change puts the comment in the code and hopefully clarifies things properly. In simple words: VERBS should *never* be included in the JUMPABLE condition.
* VERB nodes in the regex engine should NOT be marked as JUMPABLE.Bram2010-08-261-1/+2
| | | | | | | | | | | | | | | | | | | | | | | JUMPABLE nodes can be ignored during certain phases of regex execution, including ones where backtracking is affected. This change disables this behviour so that the VERBS can perform their desired results. Committer has taken the liberty of modifying the patch so that all VERBS are jumped, thus making the JUMPABLE expression a little simpler. I have left Bram's change to JUMPABLE intact, but inside of a comment for now. See discussion in thread for [perl #71942] *COMMIT bypasses optimisation for futher details. http://rt.perl.org/rt3/Ticket/Display.html?id=71942 There appears to be room for futher optimisation here by moving the JUMPABLE logic to regex-compile time. Currently it is arguable that the "optimisation" this patch seeks to avoid is actually not an optimisation at all, as it happens OVER AND OVER during execution of a match, thus the extra effort might actually outweight the benefit, especially on large strings.
* fix rt75680 - when working with utf8 strings one must always use ↵Yves Orton2010-08-231-21/+56
| | | | | | | | | | | | | | | | s+=UTF8SKIP(s) to move to the next char Most of the regex code where do the two types of increments are wrapped up in macros. Unfortunately these macros arent suitable in this case because we use goto to jump inside the loop under some situations, and since this is a one-off case I figured the modest C&P associated was better than creating a new macro just for this case. There is still a possible bug here marked by an XXX, which will need to be fixed once I find out the correct way to simulate strptr--. Additionally I havent found a test case that actually exposes this form of the bug. Moral of the story, utf8 makes string scanning awkward... And slow...
* fix 'might be used uninitialized' in S_regmatchDavid Mitchell2010-08-181-1/+1
| | | | | | Strictly speaking charid would only get used uninitialized in debugging output with a trie state machine that has no entries, but initialise it to zero just to keep everyone happy.
* when disabling regex implicit check string we must reset anchored flagYves Orton2010-06-241-0/+5
| | | | | | | | | | | | | | | | | | | | It seems that if a regex check string is determined to be "not useful" it is permananently disabled. However, it appears that when doing this we dont necessarily reset any flags that are related to it. Worse, the behaviour is not determinisitic, so it is quite possible that a given program may experience this bug "randomly" based on what strings it was matching. Thus it may be difficult to reproduce. Resetting the RXc_ANCH_MBOL when we know that it is implicit (PREGf_IMPLICIT) seems to fix /this/ particular example. But it wouldn't surprise me to discover that other "random" bugs we encounter can be traced back to this behaviour. This fixes RT #75878 which is derived from ActiveState Bug #87173. http://bugs.activestate.com/show_bug.cgi?id=87173 http://rt.perl.org/rt3/Ticket/Display.html?id=75878
* regexec.c: change names of two vars for clarityKarl Williamson2010-06-071-169/+169
| | | | | | | | | do_utf8 is changed to utf8_target UTF is changed to UTF_PATTERN This will help me keep track of the fact that there are four possible combinations of these, and that ! do_utf8 doesn't necessarily mean don't do utf8.
* reduce size of regmatch_state.u.curlyx by 2 wordsDavid Mitchell2010-06-061-13/+14
|
* micro-optimise a bit of trie codeDavid Mitchell2010-06-061-14/+13
|
* Change regexec.c to use new foldEQ functionsKarl Williamson2010-06-051-12/+12
|
* Clarify that count is bytes not unicode charactersKarl Williamson2010-05-291-1/+1
|
* Fix build warnings introduced by v5.13.0-139-ge0fa7e2Jerry D. Hedden2010-05-241-1/+2
|
* fix a couple of var typesDavid Mitchell2010-05-031-3/+3
| | | | | these errors were introduced in my trie-allocation patch, 2e64971a6530d2645969bc489f564bfd3ce64993
* tries: don't allocate memory at runtimeDavid Mitchell2010-05-031-178/+209
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is an indirect fix for [perl #74484] Regex causing exponential runtime+mem usage The trie runtime code was doing more SAVETMPS than FREETMPS and was thus growing a large tmps stack on heavy backtracking. Rather than fixing this directly, I rewrote part of the trie code so that it no longer needs to allocate memory in S_regmatch (it still does in find_byclass()). The basic issue is that multiple branches in the trie may trigger an accept state; for example: "abcd" =~ /xyz/abcd.*X|ab.*Y|/ here, words (branches) 2 and 3 are accept states. The original approach was, at run time, to create a list of accepted word numbers and the character positions of the end of each of those words. Then run the rest of the pattern for each word in the list in turn (in word index order). This requires memory for the list to be allocated and freed. The new approach involves creating extra info at compile time; in particular, for each word, a pointer to the previous accepted word (if any) in the state tree. For example for the above pattern, part of the state tree may be q b c d 1 -> 2 -> 3 -> 4 -> 5 (#3) (#2) (e.g. at state 1, if the next char is 'a', we transition to state 2). Here, state 3 is an accept state with word #3, and 5 is an accept state with word #2. So we build a table indexed by word number, which has wordinfo[2] = 3, wordinfo[3] = 0, thus building the word chain 2->3->0. At run time we run the trie to completion, and remember the word associated with the longest accept state (word #2 above). Then by following back the chain of .prev fields, we can produce a list of all accepting words. We then iteratively find the smallest-numbered (ie LH-most) word in the chain, and run with it. On failure and backtrack, we find the next-smallest and so on. Since we are no longer recording the end-position of each word in the string, we have to recalculate this for each backtrack. We initially record the end-position of the shortest accepting word, and given that we know the length of each word, we can calculate the new position each time as an offset from that first word. Depending on unicode and folding, that calculation can be cheap or expensive. This algorithm is optimised for the typical case where there are a small number (<= 2) accepting states. This patch creates a new compile-time array, trie->wordinfo[], indexed by word number, which contains relevant info about each word. This also supersedes the old trie->newword[] array, whose function of recording "overspills" of multiple words per accept state, is now handled as part of the wordinfo[].prev chain.
* For SAVEt_REGCONTEXT, store the number of save stack entries used with the type.Nicholas Clark2010-05-021-7/+11
|
* On the save stack, store the save type as the bottom 6 bits of a UV.Nicholas Clark2010-05-011-2/+2
| | | | This makes the other 26 (or 58) bits available for save data.
* Untangle REGCP_FRAME_ELEMS from REGCP_OTHER_ELEMS.Nicholas Clark2010-05-011-10/+11
|
* use cBOOL for bool castsDavid Mitchell2010-04-151-13/+13
| | | | | | | | | | | | | bool b = (bool)some_int doesn't necessarily do what you think. In some builds, bool is defined as char, and that cast's behaviour is thus undefined. So this line in mg.c: const bool was_temp = (bool)SvTEMP(sv); was actually setting was_temp to false even when the SVs_TEMP flag was set. Fix this by replacing all the (bool) casts with a new cBOOL() cast macro that (hopefully) does the right thing.
* Allow U+0FFFF in regexKarl Williamson2009-12-201-2/+4
|
* Fix compile failure introduced in 37e2e78edfe0a224b8a615820f46db879584f523.Craig A. Berry2009-12-131-9/+9
| | | | | | | | | | Solaris, VMS, and Win32 all failed to build after this change. In C99's description of: do statement while ( expression ) ; the trailing semicolon does not appear to be optional. And at least three compilers from three vendors agree.
* qr/\X/ expansionKarl Williamson2009-12-051-15/+229
|
* SvREFCNT_dec already checks if the SV is non-NULL (continued)Vincent Pit2009-11-081-2/+2
|
* disable non-unicode case insensitive trie matchingYves Orton2009-10-251-7/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Also revert 8902bb05b18c9858efa90229ca1ee42b17277554 as it merely masked one symptom of the deeper problems. Also fixes RT #69973, which was a segfault which was exposed by 8902bb05, see the ticket for further details. http://rt.perl.org/rt3//Public/Bug/Display.html?id=69973 At the code of this is the problem that in unicode matching a bunch of code points have case folding rules beyond just A-Z/a-z. Since the case folding rules are decided at runtime by the string, we cant use the same TRIE tables for both unicode/non-unicode matching. Until this is reconciled or some other solution is found case insensitive matching only gets the TRIE optimisation when the pattern is uniocde. From CaseFolding.txt: 00B5; C; 03BC; # MICRO SIGN 00C0; C; 00E0; # LATIN CAPITAL LETTER A WITH GRAVE 00C1; C; 00E1; # LATIN CAPITAL LETTER A WITH ACUTE 00C2; C; 00E2; # LATIN CAPITAL LETTER A WITH CIRCUMFLEX 00C3; C; 00E3; # LATIN CAPITAL LETTER A WITH TILDE 00C4; C; 00E4; # LATIN CAPITAL LETTER A WITH DIAERESIS 00C5; C; 00E5; # LATIN CAPITAL LETTER A WITH RING ABOVE 00C6; C; 00E6; # LATIN CAPITAL LETTER AE 00C7; C; 00E7; # LATIN CAPITAL LETTER C WITH CEDILLA 00C8; C; 00E8; # LATIN CAPITAL LETTER E WITH GRAVE 00C9; C; 00E9; # LATIN CAPITAL LETTER E WITH ACUTE 00CA; C; 00EA; # LATIN CAPITAL LETTER E WITH CIRCUMFLEX 00CB; C; 00EB; # LATIN CAPITAL LETTER E WITH DIAERESIS 00CC; C; 00EC; # LATIN CAPITAL LETTER I WITH GRAVE 00CD; C; 00ED; # LATIN CAPITAL LETTER I WITH ACUTE 00CE; C; 00EE; # LATIN CAPITAL LETTER I WITH CIRCUMFLEX 00CF; C; 00EF; # LATIN CAPITAL LETTER I WITH DIAERESIS 00D0; C; 00F0; # LATIN CAPITAL LETTER ETH 00D1; C; 00F1; # LATIN CAPITAL LETTER N WITH TILDE 00D2; C; 00F2; # LATIN CAPITAL LETTER O WITH GRAVE 00D3; C; 00F3; # LATIN CAPITAL LETTER O WITH ACUTE 00D4; C; 00F4; # LATIN CAPITAL LETTER O WITH CIRCUMFLEX 00D5; C; 00F5; # LATIN CAPITAL LETTER O WITH TILDE 00D6; C; 00F6; # LATIN CAPITAL LETTER O WITH DIAERESIS 00D8; C; 00F8; # LATIN CAPITAL LETTER O WITH STROKE 00D9; C; 00F9; # LATIN CAPITAL LETTER U WITH GRAVE 00DA; C; 00FA; # LATIN CAPITAL LETTER U WITH ACUTE 00DB; C; 00FB; # LATIN CAPITAL LETTER U WITH CIRCUMFLEX 00DC; C; 00FC; # LATIN CAPITAL LETTER U WITH DIAERESIS 00DD; C; 00FD; # LATIN CAPITAL LETTER Y WITH ACUTE 00DE; C; 00FE; # LATIN CAPITAL LETTER THORN 00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
* RT#69616: regexp SVs lose regexpness in assignmentBen Morrow2009-10-221-1/+1
| | | | | | | | It uses reg_temp_copy to copy the REGEXP onto the destination SV without needing to copy the underlying pattern structure. This means changing the prototype of reg_temp_copy, so it can copy onto a passed-in SV, but it isn't API (and probably shouldn't be exported) so I don't think this is a problem.
* somewhat fix failing regex tests. but break lots of other stuff at the same timeYves Orton2009-10-191-18/+51
|