summaryrefslogtreecommitdiff
path: root/regexec.c
Commit message (Collapse)AuthorAgeFilesLines
* Fix RT #84294 /((\w+)(?{print $2})){2,2}/ problemYves Orton2011-03-121-2/+5
| | | | | | | | | | | | | | | | When we are doing a CURLYX/WHILEM loop and the min iterations is larger than zero we were not saving the buffer state before each iteration. This mean that partial matches would end up with strange buffer pointers, with the start *after* the end point. In more detail WHILEM has four exits, three of which as far as I could tell would do a regcppush/regcppop in their state transitions, only one, WHILEM_A_pre which is entered when (n < min) would not. And it is this state that we repeatedly enter when performing A the min number of times. When I made the logic similar to the handling of ( n < max ), the bug went away, and as far as I can tell nothing else broke. Review by Dave Mitchell required before release.
* regexec.c: Use equivalent macro instead of codeKarl Williamson2011-03-101-13/+1
| | | | | Recent simplification of this code left it to be the equivalent of an existing macro
* regexec.c: Add assert() to detect inconsistent ANYOFKarl Williamson2011-03-101-0/+2
| | | | | | | There have been various segfaults apparently due to trying to access the swash (and allies) portion of an ANYOF which doesn't have that. This doesn't show up on all platforms. The assert() should detect this and help debugging
* regexec.c: Fix precedenceKarl Williamson2011-03-101-5/+6
| | | | | | | | | | | | | Commit ac51e94be5daabecdeb0ed734f3ccc059b7b77e3 didn't do what it purported, because it omitted parentheses that were necessary to change the natural precedence. It's strange that it passed all tests on my machine, and failed so miserably elsewhere that it was quickly reverted by commit 63c0bfd59562325fed2ac5a90088ed40960ac2ad. This reinstates it with the correct precedence. The next commit will add an assert() so that the underlying issue will be detected on all platforms
* Revert "regexec.c: don't try accessing non-bitmap if doesn't exist"David Mitchell2011-03-101-6/+5
| | | | | | | This reverts commit ac51e94be5daabecdeb0ed734f3ccc059b7b77e3. This commit made many of the re/*.t tests fail, on my build at least. Haven't looked at why, just reverting it for the moment.
* regexec.c: don't try accessing non-bitmap if doesn't existKarl Williamson2011-03-091-5/+6
| | | | | | | | | | | | ANYOF_NONBITMAP is supposed to be set iff there is something outside the bitmap to try matching in an ANYOF node. Due to slight changes in the meaning of this, the code has been trying to access this if ANYOF_NONBITMAP_NON_UTF8 is set without ANYOF_NONBITMAP being set, which means it was trying to access something that doesn't exist. I'm hopeful, based on a stack trace sent to me that this is the cause of [perl #85478], but can't reproduce that easily. But the logic is clearly wrong.
* regex: /l in combo with others in syn start classKarl Williamson2011-03-081-3/+8
| | | | | | | | | Now that regexes can be combinations of different charset modifiers, a synthetic start class can match locale and non-locale both. locale should generally match only things in the bitmap for code points < 256. But a synthetic start class with a non-locale component can match such code points. This patch makes an exception for synthetic nodes that will be resolved if it passes and is matched again for real.
* regexec.c: Remove '#if 0' codeKarl Williamson2011-03-021-70/+0
| | | | | This code was retained for a while until it was clear that the replacement code worked.
* regex: Remove obsolete codeKarl Williamson2011-02-281-52/+23
| | | | | | | This code has been rendered obsolete in 5.14 by using a different mechanism altogether. This functionality is now provided at run-time, user-selectable, via the /u and /d regex modifiers. This code was for compile-time selection of which to use.
* regexec.c: remove no longer needed codeKarl Williamson2011-02-281-5/+1
| | | | | The code dealing with the sharp ss is now handled by the ANYOFV node, and shouldn't appear here.
* regexec.c: Array declared 1 too smallKarl Williamson2011-02-261-1/+1
| | | | | The bounds of this array were being exceeded causing smoke failures on netbsd
* Free up bit in ANYOF flagsKarl Williamson2011-02-251-1/+1
| | | | | | | | | | | | | | | | This is the foundation for fixing the regression RT #82610. My analysis was wrong that two bits could be shared, at least not without further work. This changes to use a different mechanism to pass needed information to regexec.c so that another bit can be freed up and, in a later commit, the two bits can become unshared again. The bit that is freed up is ANYOF_UTF8, which basically said there is something that is matched outside the ANYOF bitmap, and requires the target string to be in utf8. This changes things so the existence of something besides the bitmap indicates this, and so no flag is needed. The flag bit ANYOF_NONBITMAP_NON_UTF8 remains to indicate that there is something that should be matched outside the bitmap even if the target string isn't in utf8.
* regexec.c: Fix utf8 e.g. [\s] under localeKarl Williamson2011-02-191-2/+7
| | | | | locale rules are handled improperly for utf8-encoded strings in bracketed character classes under locale. This fixes that.
* Fix locale caseless matching and utf8Karl Williamson2011-02-191-10/+7
| | | | | | | As explained in the doc changes of this patch, under /l, caseless matching of code points less than 256 now use locale rules regardless of the utf8ness of the pattern or string. They now match the behavior of things like \w, in this regard.
* regexec.c: Change flag bit from digit to mnemonicKarl Williamson2011-02-191-1/+1
|
* document how tainting works with pattern matchingDavid Mitchell2011-02-161-1/+1
|
* fix C++ builds and make the comment on initializers clearerTony Cook2011-02-161-2/+4
|
* regexec.c: Silence netbsd compiler warningKarl Williamson2011-02-151-2/+2
|
* Add /aa regex modifierKarl Williamson2011-02-141-2/+39
| | | | Tests for \N{} with this option will be added later.
* regexec.c: Handle sharp s in middle of backrefKarl Williamson2011-02-141-4/+1
| | | | | | This code handled some of the case of the LATIN SMALL LETTER SHARP S at the beginning of a back ref, but not in the middle. To do it easily, just call the function that handles our full Unicode folding
* regexec.c: Remove no longer used codeKarl Williamson2011-02-141-141/+0
| | | | A recent commit #ifdef'd this out
* regexec.c: Convert to foldEQ_utf8_flags()Karl Williamson2011-02-141-6/+21
|
* regexec.c: Rmv unused macroKarl Williamson2011-02-141-24/+0
| | | | A recent commit stopped calling this
* change commentKarl Williamson2011-02-141-1/+1
|
* regexec.c: refactor find-by-class EXACTish codeKarl Williamson2011-02-141-14/+112
| | | | This code is way out-of-date, using upper and lower case instead of fold-case.
* regexec.c: Fix commentKarl Williamson2011-02-141-2/+1
|
* regexec.c: Rmv wrong commentKarl Williamson2011-02-141-1/+0
|
* regex: Add commentsKarl Williamson2011-02-141-3/+3
|
* regexec.c: Give context for ANYOFV callKarl Williamson2011-02-141-4/+3
| | | | | This converts one case where ANYOFV is now usable to allow it to match more than one character.
* regexec.c: Give context for ANYOFV callKarl Williamson2011-02-141-4/+8
| | | | | This converts one case where ANYOFV is now usable to allow it to match more than one character.
* regexec.c: Remove folding now done in regcompKarl Williamson2011-02-141-11/+4
|
* Move ANYOF folding from regexec to regcompKarl Williamson2011-02-021-0/+2
| | | | | | | | | | This is for security as well as performance. It allows Unicode properties to not be matched case sensitively. As a result the swash inversion hash is converted from having utf8 keys to numeric, code point, keys. It also for the first time fixes the bug where /i doesn't work for a code point not at the end of a range in a bracketed character class has a multi-character fold
* regexec.c: Remove break statements from macrosKarl Williamson2011-01-181-6/+3
| | | | This is so future coders won't be tempted to rely on them.
* regexec.c: Don't rely on break stmts in macrosKarl Williamson2011-01-181-0/+24
| | | | | It is safer and clearer to have the break statement in each case statement at the source level
* regexec.c: Fix /a complementsKarl Williamson2011-01-181-3/+21
| | | | | This showed up only on some systems in the current test suite, but processing eg, \D has to care about the target string being utf8.
* regexec.c: Fix so will compile on WindowsKarl Williamson2011-01-171-6/+8
| | | | | | | Commit cfaf538b6276c6a8ef80ff6c66e106c6a4f1caaa introduced changes that cause this to not compile on Windows. It did not accept empty macro parameters, unlike gcc. This just creates a placeholder macro that expands to nothing to give the preprocessor something to grab onto.
* Add /a regex modifierKarl Williamson2011-01-171-25/+168
| | | | | This restricts certain constructs, like \w, to matching in the ASCII range only.
* regex: Use BOUNDU regnodesKarl Williamson2011-01-161-98/+108
| | | | | This refactors one area in regexec.c to use BOUNDU, NBOUNDU for efficiciency, and easier adding of the future BOUNDA.
* regexec.c: Remove unnecessary cBOOLsKarl Williamson2011-01-161-5/+5
| | | | These functions already return a boolean.
* regexec.c: Use FLAGS field instead of OP for BOUND nodeKarl Williamson2011-01-161-8/+8
| | | | | | This makes the equivalent code in BOUND and NBOUND identical so can factor out, and makes optimization easier, as the FLAGS field is already required in the vicinity.
* regexec.c: Convert two !=0's to cBOOLKarl Williamson2011-01-161-4/+4
|
* regexec.c: change variable name to reflect its purposeKarl Williamson2011-01-161-1/+1
|
* regexec.c: Change '1' to bool TRUE for clarity.Karl Williamson2011-01-161-2/+2
| | | | The function is supposed to take a bool.
* regexec.c: refactor and comment the CCC_TRY macrosKarl Williamson2011-01-161-55/+92
| | | | These are refactored to be more compact, and I think clearer.
* regex: Separate nodes for Unicode semantics \s \wKarl Williamson2011-01-161-115/+140
| | | | | | | | | | | | | | | | | This patch converts the \s, \w and complements Unicode semantics to instead of using the flags field of their nodes to instead use separate nodes. This gains some efficiency, especially useful in tight loops and backtracking of regexec.c, and prepares the way for easily adding other semantic variations, such as /a. It refactors the CCC_TRY... macros. I tried to break this piece up into smaller chunks, but found it much easier to get to this in one step. Further patches will do some more refactoring of these. As part of the CCC_TRY macro refactoring, the lines that include the test if (! nextchr) are changed to just look for the end-of-string by position instead of it being NUL. In locales, it could be (however unlikely), that NUL is a real alphabetic, digit, or space character.
* regexec.c: Replace duplicated code by its macroKarl Williamson2011-01-161-12/+2
| | | | | Replace two instances of code that is the same as that given by an already existing macro.
* regex: Convert regnode FLAGS fields to charset enumKarl Williamson2011-01-161-16/+16
| | | | | | | | | The FLAGS fields of certain regnodes were encoded with USE_UNI if unicode semantics were to be used. This patch instead sets them to the character set used, one of the possibilities of which is to use unicode semantics. This shortens the code somewhat, and always puts the character set in the flags field, which can allow use of switch statements on it for efficiency, especially as new values are added.
* Fix \xa0 matching both [\s] [\S], et.al.Karl Williamson2011-01-161-0/+6
| | | | | | | | | | | | This bug stemmed from Latin1 characters not matching any (non-complemented) character class in /d semantics when the target string is no utf8; but having unicode semantics when it isn't. The solution here is to add a special flag. There were several tests that relied on the broken behavior, specifically they tested that \xff isn't a printable word character even in utf8. I changed the deparse test to instead use a non-printable code point, and I changed the ones in re_tests to be TODOs, and will change them back using /a when that is shortly added.
* regexec.c: Remove no longer needed gotoKarl Williamson2011-01-141-4/+2
|
* regexec.c: Add to commentKarl Williamson2011-01-141-1/+5
|