summaryrefslogtreecommitdiff
path: root/regexec.c
Commit message (Collapse)AuthorAgeFilesLines
* temp commit for smokessmoke-me/khw-trickyKarl Williamson2011-12-221-26/+22
|
* regexec.c: white space onlyKarl Williamson2011-12-221-1/+1
|
* regexec.c: EXACTF nodes can never be UTFKarl Williamson2011-12-221-4/+9
| | | | | | | By definition a regex pattern that is in UTF-8 uses Unicode matching rules, and EXACTF is non-Unicode (unless the target string is UTF-8). Therefore an EXACTF node will never be generated for a UTF-8 pattern, and there is no need to test for it being so.
* regexec.c: Bypass unneeded stepKarl Williamson2011-11-111-2/+2
| | | | | We don't have to convert from utf8 to code point to fold; instead can call the function that starts from utf8
* regexec.c: Stop looking for match even soonerKarl Williamson2011-11-091-3/+3
| | | | | | | | This revised commit e067297c376fbbb5a0dc8428c65d922f11e1f4c6 slightly so that we round up to get the search stopping point. We aren't matching partial characters, so if we were to match 3+1/3 characters, we really have to match 4 characters.
* regexec.c: revise commentKarl Williamson2011-11-091-4/+6
|
* regexec.c: typo in commentKarl Williamson2011-11-091-1/+1
|
* Change __attribute_unused__ to PERL_UNUSED_DECLKarl Williamson2011-11-091-1/+1
| | | | The latter is the Perl standard way of making this declaration
* PATCH: [perl #101710] Regression with /i, latin1 chars.Karl Williamson2011-11-011-1/+1
| | | | | | The root cause of this bug is that it was assuming that a string was in utf8 when it wasn't, and so was thinking that a byte was a starter byte that wasn't, so was skipping ahead based on that starter byte.
* regexec.c: Add another place to not re-foldKarl Williamson2011-10-171-1/+1
| | | | This adds regrepeat to no keep re-folding to the recent commits
* regexec.c: Another place to not re-foldKarl Williamson2011-10-171-2/+2
| | | | | A recent commit caused regexec.c to not keep calculating the folds in one circumstance. This one adds the case in regmatch
* regexec.c: Less work in /i matchingKarl Williamson2011-10-171-2/+4
| | | | | | | | | | | If you watch an execution trace of regexec /i, often you will see it folding the same thing over and over, as it backtracks or searches ahead. regcomp.c has now been changed to always fold UTF-8 encoded EXACTF and EXCACTFU nodes. This allows these to not be re-folded each time. This commit does it just for find_by_class(). Other commits will expand this technique for other cases.
* regexec.c: Stop looking for match soonerKarl Williamson2011-10-171-2/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | This is a partial reversion of commit 7c1b9f38fcbfdb3a9e1766e02bcb991d1a5452d9 which went unnecessarily far in fixing the problem. After studying the situation some more, I see more clearly what was going on. The point is that if you have only 2 characters left in the string, but the pattern requires 3 to work, it's guaranteed to fail, so pointless, and unnecessary work, to try. So don't being a match trial at a position when there are fewer than the minimum number of characters necessary. That is what the code before that commit did. However it neglected the fact that it is possible for a single character to match multiple ones, so there is not a 1:1 ratio. This new commit assumes the worst possible ratio to calculate how far into a string is the furthest a successful match could start. This is going to in most cases still look too far, but it is much better than always going up to the final character, as the previous patch did. The maximum ratio is guaranteed by Unicode to be 3:1, but when the target isn't in UTF-8, the max is 2:1, determined simply by inspection of the defined folds. And actually, currently, the single case where it isn't 1:1 doesn't come up here, because regcomp.c guarantees that that match doesn't generate one of these EXACTFish nodes. However, I expect that to change for 5.16, and so am preparing for that case by making it 2:1.
* regexec.c: Add commentKarl Williamson2011-10-171-0/+5
|
* regexec.c: omit goto for the common caseKarl Williamson2011-10-171-13/+13
| | | | | | | The structure of this code is that initial setup is done and then gotos or fall-through used to join for the main logic. This commit just moves a block, without logic changes, so that the more common case has a fall-through instead of a goto.
* regexec.c: Fix "\x{FB01}\x{FB00}" =~ /ff/iKarl Williamson2011-10-131-1/+2
| | | | | | | | | | | Only the first character of the string was being checked when scanning for the beginning position of the pattern match. This was so wrong, it looks like it has to be a regression. I experimented a little and did not find any. I believe (but am not certain) that a multi-char fold has to be involved. The the handling of these was so broken before 5.14 that there very well may not be a regression.
* regexec.c: Add commentsKarl Williamson2011-10-131-0/+4
|
* regexec.c: Avoid hard-coded utf8 tests for EBCDICKarl Williamson2011-10-011-1/+7
| | | | | | | | When a swash is loaded, generally it is checked for sanity with an assert(). The strings used are hard-coded utf8 strings, which will be different in EBCDIC, and hence will fail. I haven't figured out a simple way to get compile-time utf8 vs utfebcdic strings, but we can just skip the check in EBCDIC builds
* regexec.c: Add assertion checkKarl Williamson2011-10-011-1/+1
| | | | | This makes sure before there is a segfault that the is_() functions actually have the side effect that this expects.
* RT #96354: \h \H \v and \V didn't check for EOLDavid Mitchell2011-08-051-0/+4
| | | | | | The HORIZWS and similar regexp ops didn't check that the end of the string had been reached; therefore they would blithely compare against the \0 at the end of the string, or beyond.
* /aa and \b fail under some utf8 stringsKarl Williamson2011-07-301-1/+4
| | | | | This was due to my failure to realize that this 'if' needed to be updated when the /aa modifier was added.
* Panic with \b and /aaKarl Williamson2011-07-301-0/+1
| | | | | This was due to my oversight in not fixing this switch statement to accommodate /aa when it was added.
* re_eval: clear lexicals in the right padDavid Mitchell2011-07-161-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | (?{...}) deliberately doesn't introduce a new scope (so that the affects of local() can accumulate across multiple calls to the code). This also means that the SAVEt_CLEARSVs pushed onto the save stack by lexical declarations (i.e. (?{ my $x; ... }) also accumulate, and are only processed en-mass at the end, on exit from the regex. Currently they are usually processed in the wrong pad (the caller of the pattern, rather than the pads of the individual code block(s)), leading to random misbehaviour and SEGVs. Hence the long-standing advice to avoid lexical declarations within re_evals. We fix this by wrapping a pair of SAVECOMPPADs around each call to a code block. Eventually the save stack will be a long accumulation of SAVEt_CLEARSV's interspersed with SAVEt_COMPPAD's, that when popped en-mass should unwind in the right order with the right pad at the right time. The price to pay for this is two extra additions to the save stack (which accumulate) for each code call. A few TODO tests in reg_eval_scope.t now pass, so I'm probably doing the right thing ;-)
* regexec.c: Nits in commentsKarl Williamson2011-07-071-7/+8
|
* For shorter strings, store C<study>'s data as U8s or U16s, instead of U32s.Nicholas Clark2011-07-011-1/+11
| | | | | | | The assumption is that most studied strings are fairly short, hence the pain of the extra code is worth it, given the memory savings. 80 character string, 336 bytes as U8, down from 1344 as U32 800 character string, 2112 bytes as U16, down from 4224 as U32
* Store C<study>'s data as U32s, instead of I32s.Nicholas Clark2011-07-011-1/+1
| | | | The "no more" condition is now represented as ~0, instead of -1.
* Store C<study>'s data in in mg_ptr instead of interpreter variables.Nicholas Clark2011-07-011-1/+6
| | | | | This allows more than one C<study> to be active at the same time. It eliminates PL_screamfirst, PL_lastscream, PL_maxscream.
* Change PL_screamnext to store absolute positions.Nicholas Clark2011-07-011-1/+1
| | | | | | | | | | PL_screamnext gives the position of the next occurrence of the current octet. Previously it stored this as an offset from the current position, with -pos stored for "no more", so that the calculated new offset would be zero, allowing a zero/non-zero loop exit test in Perl_screaminstr(). Now it stores absolute position, with -1 for "no more". Also codify -1 as the "not present" value for PL_screamfirst, instead of any negative value.
* The regex engine can't assume that SvSCREAM() remains set on its target.Nicholas Clark2011-06-301-3/+3
| | | | | | | | Callers to the engine set REXEC_SCREAM in the flags when the target scalar is studied, and the engine should use the study data. It's possible for embedded code blocks to cause the target scalar to stop being studied. Hence the engine needs to check for this, instead of simply assuming that the study data is present and valid to read. This resolves #92696.
* regexec.c: Remvove unnecessary special handling for \xDFKarl Williamson2011-06-111-6/+5
| | | | | regcomp.c has been changed, so the case that this handled no longer comes up.
* Use SvTAIL() instead of BmFLAGS(). The core no longer uses BmFLAGS().Nicholas Clark2011-06-111-10/+9
|
* use __attribute__unused__ to silence -Wunused-but-set-variableRobin Barker2011-05-191-2/+10
|
* Assertion fails in multi-char regex matchKarl Williamson2011-05-181-4/+6
| | | | | | | | | | In '"s\N{U+DF}" =~ /\x{00DF}/i, the LHS folds to 'sss', the RHS to 'ss'. The bug occurs when the RHS tries to match the first two es's, but that splits the LHS \xDF character, which Perl doesn't know how to handle, and the assertion got triggered. (This is similar to [perl #72998].) The solution adopted here is to disallow a partial character match, as #72998 did as well.
* PATCH: [perl #87908] \W is its complement sometimesKarl Williamson2011-04-061-1/+1
| | | | | | | | | | A missing '!' turned \W into \w in some code execution paths and utf8 data. This patch fixes that. It does not include tests at the moment, since I don't have time just now to examine why the existing tests didn't catch this, when it looks like they are set up to, and there have been several BBC tickets lately that I'm hopeful this may fix and head off other ones.
* regexec.c: fix some compiler warningsDavid Mitchell2011-03-261-2/+2
|
* regexec.c: Rmv special code no longer neededKarl Williamson2011-03-201-14/+3
| | | | The trickyness has been resolved elsewhere
* regexec.c: Update commentKarl Williamson2011-03-191-25/+13
|
* regexec.c: execute inappropriately skipped codeKarl Williamson2011-03-191-5/+6
| | | | | | The comment said that there was no use doing this in lenp was NULL, but there is, as it sees if there is a match or not and sets the appropriate variable.
* regexec.c: Chg var. name for clarityKarl Williamson2011-03-191-5/+5
|
* Stop hang in regexKarl Williamson2011-03-191-28/+35
| | | | | | | The algorithm for mapping multi-char fold matches back to the source in processing ANYOF nodes was defective. This caused the regex engine to hang on certain character combinations. I've also added an assert to stop instead of loop.
* Fix RT #84294 /((\w+)(?{print $2})){2,2}/ problemYves Orton2011-03-121-2/+5
| | | | | | | | | | | | | | | | When we are doing a CURLYX/WHILEM loop and the min iterations is larger than zero we were not saving the buffer state before each iteration. This mean that partial matches would end up with strange buffer pointers, with the start *after* the end point. In more detail WHILEM has four exits, three of which as far as I could tell would do a regcppush/regcppop in their state transitions, only one, WHILEM_A_pre which is entered when (n < min) would not. And it is this state that we repeatedly enter when performing A the min number of times. When I made the logic similar to the handling of ( n < max ), the bug went away, and as far as I can tell nothing else broke. Review by Dave Mitchell required before release.
* regexec.c: Use equivalent macro instead of codeKarl Williamson2011-03-101-13/+1
| | | | | Recent simplification of this code left it to be the equivalent of an existing macro
* regexec.c: Add assert() to detect inconsistent ANYOFKarl Williamson2011-03-101-0/+2
| | | | | | | There have been various segfaults apparently due to trying to access the swash (and allies) portion of an ANYOF which doesn't have that. This doesn't show up on all platforms. The assert() should detect this and help debugging
* regexec.c: Fix precedenceKarl Williamson2011-03-101-5/+6
| | | | | | | | | | | | | Commit ac51e94be5daabecdeb0ed734f3ccc059b7b77e3 didn't do what it purported, because it omitted parentheses that were necessary to change the natural precedence. It's strange that it passed all tests on my machine, and failed so miserably elsewhere that it was quickly reverted by commit 63c0bfd59562325fed2ac5a90088ed40960ac2ad. This reinstates it with the correct precedence. The next commit will add an assert() so that the underlying issue will be detected on all platforms
* Revert "regexec.c: don't try accessing non-bitmap if doesn't exist"David Mitchell2011-03-101-6/+5
| | | | | | | This reverts commit ac51e94be5daabecdeb0ed734f3ccc059b7b77e3. This commit made many of the re/*.t tests fail, on my build at least. Haven't looked at why, just reverting it for the moment.
* regexec.c: don't try accessing non-bitmap if doesn't existKarl Williamson2011-03-091-5/+6
| | | | | | | | | | | | ANYOF_NONBITMAP is supposed to be set iff there is something outside the bitmap to try matching in an ANYOF node. Due to slight changes in the meaning of this, the code has been trying to access this if ANYOF_NONBITMAP_NON_UTF8 is set without ANYOF_NONBITMAP being set, which means it was trying to access something that doesn't exist. I'm hopeful, based on a stack trace sent to me that this is the cause of [perl #85478], but can't reproduce that easily. But the logic is clearly wrong.
* regex: /l in combo with others in syn start classKarl Williamson2011-03-081-3/+8
| | | | | | | | | Now that regexes can be combinations of different charset modifiers, a synthetic start class can match locale and non-locale both. locale should generally match only things in the bitmap for code points < 256. But a synthetic start class with a non-locale component can match such code points. This patch makes an exception for synthetic nodes that will be resolved if it passes and is matched again for real.
* regexec.c: Remove '#if 0' codeKarl Williamson2011-03-021-70/+0
| | | | | This code was retained for a while until it was clear that the replacement code worked.
* regex: Remove obsolete codeKarl Williamson2011-02-281-52/+23
| | | | | | | This code has been rendered obsolete in 5.14 by using a different mechanism altogether. This functionality is now provided at run-time, user-selectable, via the /u and /d regex modifiers. This code was for compile-time selection of which to use.
* regexec.c: remove no longer needed codeKarl Williamson2011-02-281-5/+1
| | | | | The code dealing with the sharp ss is now handled by the ANYOFV node, and shouldn't appear here.