summaryrefslogtreecommitdiff
path: root/regexec.c
Commit message (Collapse)AuthorAgeFilesLines
* regex: Remove FOLDCHAR regnode typeKarl Williamson2012-01-191-21/+0
| | | | | | | | | | | | | | This node type hasn't been used since 5.14.0. Instead an ANYOFV node was generated where formerly a FOLDCHAR node would have been used. The ANYOFV was used because it already existed and was up-to-date, whereas FOLDCHAR would have needed some bug fixes to adapt it, even though it would be faster in execution than ANYOFV; so the code for it was retained in case it was needed. However, both these solutions were defective, and a previous commit has changed things to a different type of solution entirely. Thus FOLDCHAR is obsolescent and can be removed, though the code in it was used as a base for some of the new solutions.
* regex: Add new node type EXACTFU_NO_TRIEKarl Williamson2012-01-191-2/+8
| | | | | | This new node is like EXACTFU but is not currently trie'able. This adds handling for it in regexec.c, but it is not currently generated; this commit is preparing for future commits
* regex: Add new node type EXACTFU_SSKarl Williamson2012-01-191-5/+16
| | | | | | | | | | This node will be used to distinguish between the case in a non-UTF8 pattern and string where something could be matched that is of different lengths. The only instance where this can happen is the LATIN SMALL LETTER SHARP S can match the sequences "ss", "Ss", "sS", or "SS", hence the name. This node is not currently generated; this prepares for future commits
* regexec.c: white space onlyKarl Williamson2012-01-191-1/+1
|
* regexec.c: EXACTF nodes can never be UTFKarl Williamson2012-01-191-4/+9
| | | | | | | By definition a regex pattern that is in UTF-8 uses Unicode matching rules, and EXACTF is non-Unicode (unless the target string is UTF-8). Therefore an EXACTF node will never be generated for a UTF-8 pattern, and there is no need to test for it being so.
* Provide as much diagnostic information as possible in "panic: ..." messages.Nicholas Clark2012-01-161-1/+2
| | | | | | | | | | | | | | | The convention is that when the interpreter dies with an internal error, the message starts "panic: ". Historically, many panic messages had been terse fixed strings, which means that the out-of-range values that triggered the panic are lost. Now we try to report these values, as such panics may not be repeatable, and the original error message may be the only diagnostic we get when we try to find the cause. We can't report diagnostics when the panic message is generated by something other than croak(), as we don't have *printf-style format strings. Don't attempt to report values in panics related to *printf buffer overflows, as attempting to format the values to strings may repeat or compound the original error.
* regexec.c: Use shared swash in bracketed character classesKarl Williamson2012-01-131-1/+1
| | | | | | | | | | | | | | | | | This takes advantage of an earlier commit to use a swash that may be shared across multiple character class instances. That means that if a match in another class has to look up a value, that that same value is automatically available without further lookup to all character classes that share the swash. This means that the lookup result only needs be cached once for all instances in the thread, saving time and memory. Note that currently the only swashes that are shared are those that consist solely of a single Unicode property definition. Some sort of checksum would have to be computed if this were to be extended to custom classes. But what this does is cause sharing for all Unicode properties that aren't in bracketed classes (as they are implemented as a bracketed class with a single element), as well as the few cases where someone explicitly writes [\p{foo}] without anything else in the class.
* regexec.c: Allow for returning shared swashKarl Williamson2012-01-131-5/+12
| | | | | | | | | | | | This changes the function that returns the swash associated with a bracketed character class so that it returns the original swash and not a copy. The function is renamed and made accessible only from within regexec.c, and a new wrapper function with the original name is created that just calls the other one and returns a copy of the swash. Thus, all access from outside regexec.c will use a copy which if overwritten will not harm others; while the option exists from within regexec.c to use a shared version.
* regexec.c: Prepare for inversion lists in ANYOF nodesKarl Williamson2012-01-131-6/+52
| | | | | | Future commits will start passing inversion lists to regexec.c from the compilation phase. This commit causes regexec.c to accept them, trace them for debug output, and pass them along to utf8.c
* regexec.c: Add some comments to regclass_swash()Karl Williamson2012-01-131-1/+16
|
* regexec.c: Remove unnecessary intermediate valuesKarl Williamson2012-01-131-7/+6
|
* diag_listed_as galoreFather Chrysostomos2011-12-281-4/+5
| | | | | In two instances, I actually modified to code to avoid %s for a constant string, as it should be faster that way.
* regexec.c: Bypass unneeded stepKarl Williamson2011-11-111-2/+2
| | | | | We don't have to convert from utf8 to code point to fold; instead can call the function that starts from utf8
* regexec.c: Stop looking for match even soonerKarl Williamson2011-11-091-3/+3
| | | | | | | | This revised commit e067297c376fbbb5a0dc8428c65d922f11e1f4c6 slightly so that we round up to get the search stopping point. We aren't matching partial characters, so if we were to match 3+1/3 characters, we really have to match 4 characters.
* regexec.c: revise commentKarl Williamson2011-11-091-4/+6
|
* regexec.c: typo in commentKarl Williamson2011-11-091-1/+1
|
* Change __attribute_unused__ to PERL_UNUSED_DECLKarl Williamson2011-11-091-1/+1
| | | | The latter is the Perl standard way of making this declaration
* PATCH: [perl #101710] Regression with /i, latin1 chars.Karl Williamson2011-11-011-1/+1
| | | | | | The root cause of this bug is that it was assuming that a string was in utf8 when it wasn't, and so was thinking that a byte was a starter byte that wasn't, so was skipping ahead based on that starter byte.
* regexec.c: Add another place to not re-foldKarl Williamson2011-10-171-1/+1
| | | | This adds regrepeat to no keep re-folding to the recent commits
* regexec.c: Another place to not re-foldKarl Williamson2011-10-171-2/+2
| | | | | A recent commit caused regexec.c to not keep calculating the folds in one circumstance. This one adds the case in regmatch
* regexec.c: Less work in /i matchingKarl Williamson2011-10-171-2/+4
| | | | | | | | | | | If you watch an execution trace of regexec /i, often you will see it folding the same thing over and over, as it backtracks or searches ahead. regcomp.c has now been changed to always fold UTF-8 encoded EXACTF and EXCACTFU nodes. This allows these to not be re-folded each time. This commit does it just for find_by_class(). Other commits will expand this technique for other cases.
* regexec.c: Stop looking for match soonerKarl Williamson2011-10-171-2/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | This is a partial reversion of commit 7c1b9f38fcbfdb3a9e1766e02bcb991d1a5452d9 which went unnecessarily far in fixing the problem. After studying the situation some more, I see more clearly what was going on. The point is that if you have only 2 characters left in the string, but the pattern requires 3 to work, it's guaranteed to fail, so pointless, and unnecessary work, to try. So don't being a match trial at a position when there are fewer than the minimum number of characters necessary. That is what the code before that commit did. However it neglected the fact that it is possible for a single character to match multiple ones, so there is not a 1:1 ratio. This new commit assumes the worst possible ratio to calculate how far into a string is the furthest a successful match could start. This is going to in most cases still look too far, but it is much better than always going up to the final character, as the previous patch did. The maximum ratio is guaranteed by Unicode to be 3:1, but when the target isn't in UTF-8, the max is 2:1, determined simply by inspection of the defined folds. And actually, currently, the single case where it isn't 1:1 doesn't come up here, because regcomp.c guarantees that that match doesn't generate one of these EXACTFish nodes. However, I expect that to change for 5.16, and so am preparing for that case by making it 2:1.
* regexec.c: Add commentKarl Williamson2011-10-171-0/+5
|
* regexec.c: omit goto for the common caseKarl Williamson2011-10-171-13/+13
| | | | | | | The structure of this code is that initial setup is done and then gotos or fall-through used to join for the main logic. This commit just moves a block, without logic changes, so that the more common case has a fall-through instead of a goto.
* regexec.c: Fix "\x{FB01}\x{FB00}" =~ /ff/iKarl Williamson2011-10-131-1/+2
| | | | | | | | | | | Only the first character of the string was being checked when scanning for the beginning position of the pattern match. This was so wrong, it looks like it has to be a regression. I experimented a little and did not find any. I believe (but am not certain) that a multi-char fold has to be involved. The the handling of these was so broken before 5.14 that there very well may not be a regression.
* regexec.c: Add commentsKarl Williamson2011-10-131-0/+4
|
* regexec.c: Avoid hard-coded utf8 tests for EBCDICKarl Williamson2011-10-011-1/+7
| | | | | | | | When a swash is loaded, generally it is checked for sanity with an assert(). The strings used are hard-coded utf8 strings, which will be different in EBCDIC, and hence will fail. I haven't figured out a simple way to get compile-time utf8 vs utfebcdic strings, but we can just skip the check in EBCDIC builds
* regexec.c: Add assertion checkKarl Williamson2011-10-011-1/+1
| | | | | This makes sure before there is a segfault that the is_() functions actually have the side effect that this expects.
* RT #96354: \h \H \v and \V didn't check for EOLDavid Mitchell2011-08-051-0/+4
| | | | | | The HORIZWS and similar regexp ops didn't check that the end of the string had been reached; therefore they would blithely compare against the \0 at the end of the string, or beyond.
* /aa and \b fail under some utf8 stringsKarl Williamson2011-07-301-1/+4
| | | | | This was due to my failure to realize that this 'if' needed to be updated when the /aa modifier was added.
* Panic with \b and /aaKarl Williamson2011-07-301-0/+1
| | | | | This was due to my oversight in not fixing this switch statement to accommodate /aa when it was added.
* re_eval: clear lexicals in the right padDavid Mitchell2011-07-161-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | (?{...}) deliberately doesn't introduce a new scope (so that the affects of local() can accumulate across multiple calls to the code). This also means that the SAVEt_CLEARSVs pushed onto the save stack by lexical declarations (i.e. (?{ my $x; ... }) also accumulate, and are only processed en-mass at the end, on exit from the regex. Currently they are usually processed in the wrong pad (the caller of the pattern, rather than the pads of the individual code block(s)), leading to random misbehaviour and SEGVs. Hence the long-standing advice to avoid lexical declarations within re_evals. We fix this by wrapping a pair of SAVECOMPPADs around each call to a code block. Eventually the save stack will be a long accumulation of SAVEt_CLEARSV's interspersed with SAVEt_COMPPAD's, that when popped en-mass should unwind in the right order with the right pad at the right time. The price to pay for this is two extra additions to the save stack (which accumulate) for each code call. A few TODO tests in reg_eval_scope.t now pass, so I'm probably doing the right thing ;-)
* regexec.c: Nits in commentsKarl Williamson2011-07-071-7/+8
|
* For shorter strings, store C<study>'s data as U8s or U16s, instead of U32s.Nicholas Clark2011-07-011-1/+11
| | | | | | | The assumption is that most studied strings are fairly short, hence the pain of the extra code is worth it, given the memory savings. 80 character string, 336 bytes as U8, down from 1344 as U32 800 character string, 2112 bytes as U16, down from 4224 as U32
* Store C<study>'s data as U32s, instead of I32s.Nicholas Clark2011-07-011-1/+1
| | | | The "no more" condition is now represented as ~0, instead of -1.
* Store C<study>'s data in in mg_ptr instead of interpreter variables.Nicholas Clark2011-07-011-1/+6
| | | | | This allows more than one C<study> to be active at the same time. It eliminates PL_screamfirst, PL_lastscream, PL_maxscream.
* Change PL_screamnext to store absolute positions.Nicholas Clark2011-07-011-1/+1
| | | | | | | | | | PL_screamnext gives the position of the next occurrence of the current octet. Previously it stored this as an offset from the current position, with -pos stored for "no more", so that the calculated new offset would be zero, allowing a zero/non-zero loop exit test in Perl_screaminstr(). Now it stores absolute position, with -1 for "no more". Also codify -1 as the "not present" value for PL_screamfirst, instead of any negative value.
* The regex engine can't assume that SvSCREAM() remains set on its target.Nicholas Clark2011-06-301-3/+3
| | | | | | | | Callers to the engine set REXEC_SCREAM in the flags when the target scalar is studied, and the engine should use the study data. It's possible for embedded code blocks to cause the target scalar to stop being studied. Hence the engine needs to check for this, instead of simply assuming that the study data is present and valid to read. This resolves #92696.
* regexec.c: Remvove unnecessary special handling for \xDFKarl Williamson2011-06-111-6/+5
| | | | | regcomp.c has been changed, so the case that this handled no longer comes up.
* Use SvTAIL() instead of BmFLAGS(). The core no longer uses BmFLAGS().Nicholas Clark2011-06-111-10/+9
|
* use __attribute__unused__ to silence -Wunused-but-set-variableRobin Barker2011-05-191-2/+10
|
* Assertion fails in multi-char regex matchKarl Williamson2011-05-181-4/+6
| | | | | | | | | | In '"s\N{U+DF}" =~ /\x{00DF}/i, the LHS folds to 'sss', the RHS to 'ss'. The bug occurs when the RHS tries to match the first two es's, but that splits the LHS \xDF character, which Perl doesn't know how to handle, and the assertion got triggered. (This is similar to [perl #72998].) The solution adopted here is to disallow a partial character match, as #72998 did as well.
* PATCH: [perl #87908] \W is its complement sometimesKarl Williamson2011-04-061-1/+1
| | | | | | | | | | A missing '!' turned \W into \w in some code execution paths and utf8 data. This patch fixes that. It does not include tests at the moment, since I don't have time just now to examine why the existing tests didn't catch this, when it looks like they are set up to, and there have been several BBC tickets lately that I'm hopeful this may fix and head off other ones.
* regexec.c: fix some compiler warningsDavid Mitchell2011-03-261-2/+2
|
* regexec.c: Rmv special code no longer neededKarl Williamson2011-03-201-14/+3
| | | | The trickyness has been resolved elsewhere
* regexec.c: Update commentKarl Williamson2011-03-191-25/+13
|
* regexec.c: execute inappropriately skipped codeKarl Williamson2011-03-191-5/+6
| | | | | | The comment said that there was no use doing this in lenp was NULL, but there is, as it sees if there is a match or not and sets the appropriate variable.
* regexec.c: Chg var. name for clarityKarl Williamson2011-03-191-5/+5
|
* Stop hang in regexKarl Williamson2011-03-191-28/+35
| | | | | | | The algorithm for mapping multi-char fold matches back to the source in processing ANYOF nodes was defective. This caused the regex engine to hang on certain character combinations. I've also added an assert to stop instead of loop.
* Fix RT #84294 /((\w+)(?{print $2})){2,2}/ problemYves Orton2011-03-121-2/+5
| | | | | | | | | | | | | | | | When we are doing a CURLYX/WHILEM loop and the min iterations is larger than zero we were not saving the buffer state before each iteration. This mean that partial matches would end up with strange buffer pointers, with the start *after* the end point. In more detail WHILEM has four exits, three of which as far as I could tell would do a regcppush/regcppop in their state transitions, only one, WHILEM_A_pre which is entered when (n < min) would not. And it is this state that we repeatedly enter when performing A the min number of times. When I made the logic similar to the handling of ( n < max ), the bug went away, and as far as I can tell nothing else broke. Review by Dave Mitchell required before release.