summaryrefslogtreecommitdiff
path: root/regcomp.c
Commit message (Collapse)AuthorAgeFilesLines
* regcomp.c: Remove #if 0 codeKarl Williamson2011-03-021-98/+0
| | | | | | This code is obsolete, as new code has been written to do folding; now that smokes are all passing with that new code, there is no point to retaining the old.
* regex: Remove obsolete codeKarl Williamson2011-02-281-24/+2
| | | | | | | This code has been rendered obsolete in 5.14 by using a different mechanism altogether. This functionality is now provided at run-time, user-selectable, via the /u and /d regex modifiers. This code was for compile-time selection of which to use.
* regcomp.c: white space onlyKarl Williamson2011-02-281-145/+149
| | | | | A previous commit collapsed nested blocks. This outdents the nested part
* regcomp.c: collapse two blocksKarl Williamson2011-02-281-6/+3
| | | | | An earlier commit removed code so that these two blocks can be written as one.
* regcomp.c: Remove temporary codeKarl Williamson2011-02-281-9/+0
| | | | | | This code was inserted to make sure no tests failed in the intermediate commits leading up to d50a4f90cab527593b2dd218f71b66a6be555490, and should have been removed in that commit, but I forgot to.
* Handle [folds] of 0-255 without swashesKarl Williamson2011-02-271-13/+110
| | | | | | | | | | | | | | | | | | | | | | | | | | Commit 56ca34cada940c7f6aae9a59da266e541530041e had the side effect of causing regular expressions with things like [a-z], or even just [k] to go out to disk to read tables to create swashes because it knew that some of those characters matched outside the bitmap (and due to l1_char_class_tab.h it knew which ones had those matches), but it didn't know what the characters were that participated in those folds. This patch hard-codes the Unicode 6.0 rules into regcomp.c for the code points 0-255, so that the very slow utf8_heavy is not invoked on them. (Code points above 255 will continue to invoke it.) It would, of course, be better if these rules could be regen'd into regcomp.c, as there is a risk that the standard will change, and the code will not. But I don't think that has ever happened; in other words, I think that the rules haven't changed so far since Day 1 of Unicode. (That would not be the case if we were doing simple case folding, as the capital sharp ss which folds to U+00DF was added later.) And the Standard is getting more stable in this area. I believe one of their stability policies now forbid them from adding something that simply folds to one of the characters that already has a fold, such as M and m. Ligatures are frowned on, and I doubt that new ones would be encoded, so that leaves a new Unicode character that folds to a Latin-1 plus some sort of mark. For those, this code is a no-op, so those aren't a problem either.
* regcomp.c: Add deprecation macro with extra paramKarl Williamson2011-02-271-0/+7
|
* regcomp.c: More prep for bitmap/nonbitmap foldsKarl Williamson2011-02-271-1/+32
| | | | | This sets things up in preparation for a future commit that will move calculating all folds involving characters in the bit map.
* regcomp.c: Place marker for 2nd inversion listKarl Williamson2011-02-271-19/+28
| | | | | The set_regclass_bit functions will be adding to a new inversion list. This declares that list and passes it to them.
* Change to use new add_cp_to_invlist()Karl Williamson2011-02-271-1/+1
|
* regcomp.c: Add parameters to fcnsKarl Williamson2011-02-271-23/+23
| | | | | | A pointer to the list of multi-char folds in an ANYOF node is now passed to the routines that set the bit map. This is in preparation for those routines to add to the list
* regcomp.c: Convert old-style to inversion listKarl Williamson2011-02-271-5/+5
| | | | | The code that handles a false range in a [character class] hadn't been converted to use inversion lists
* regcomp.c: Add fcn add_cp_to_invlist()Karl Williamson2011-02-271-0/+5
| | | | | This is just an inline shorthand when a single code point is all that is needed. A macro could have been used instead, but this just seemed nicer.
* regcomp.c: Move code to common placeKarl Williamson2011-02-271-3/+3
| | | | | | THis is part of the refactoring of the code that sets the alternate array for multi-char folds. Changing the node type to ANYOFV can be done at the last second, in pass 2, as it doesn't change any sizing, etc.
* regcomp.c: Factor code into a function.Karl Williamson2011-02-271-6/+19
| | | | A future commit uses this same code, so put it into a common place.
* regcomp.c: Remove no longer necessary testsKarl Williamson2011-02-271-6/+0
| | | | | A previous commit changed add_range_to_invlist() to do the creation that these lines did.
* regcomp.c: accept NULL as inversion list paramKarl Williamson2011-02-271-5/+12
| | | | | | | Change the function add_range_to_invlist() to accept NULL as the inversion list, in which case it creates it. A common usage of this function is to create the list if it doesn't exist before calling it, so this just makes that code once.
* [perl #84746] Accessing $2 causes the interpreter to crashFather Chrysostomos2011-02-251-7/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Actually, it doesn’t. The original test case was: #!/usr/bin/perl my $rx = qr'\$ (?| {(.+?)} | (.+?); | (.+?)(\s) )'x; my $test = '/home/$USERNAME '; die unless $test =~ $rx; print "1: $1\n"; print "2: $2\n" if defined $2; This crashes even if I put an ‘exit’ right after the pattern match. What’s happening is that regcomp miscounts the number of capturing parenthesis pairs (cf. [perl #59734]), so the execution of the regular expression causes a buffer overflow which overwrites the op_sibling field of the regcreset op, causing a crash when the op is freed. (The exact failure may differ between builds, platforms, etc., of course.) S_reg in regcomp.c keeps a count of the parenthesised groups in a (?|...) construct, which it updates after each branch, if that branch has more captures than any previous branch. But it was not updating the count after the last branch. So this bug would occur if the last branch had more capturing paren- theses than any previous branch. Commit ee91d26, which fixed bug #59734, only solved the problem when there was just one branch (by updating the count before the loop that deals with subsequent branches was entered). This commit changes the code at the end of S_reg to take into account that RExC_npar (the current paren count) might have been increased by the last branch. Since the loop to deal with subsequent branches resets the count *before* each branch, the code that commit ee91d26 added is no longer necessary, so this commit removes it.
* Free up bit in ANYOF flagsKarl Williamson2011-02-251-22/+37
| | | | | | | | | | | | | | | | This is the foundation for fixing the regression RT #82610. My analysis was wrong that two bits could be shared, at least not without further work. This changes to use a different mechanism to pass needed information to regexec.c so that another bit can be freed up and, in a later commit, the two bits can become unshared again. The bit that is freed up is ANYOF_UTF8, which basically said there is something that is matched outside the ANYOF bitmap, and requires the target string to be in utf8. This changes things so the existence of something besides the bitmap indicates this, and so no flag is needed. The flag bit ANYOF_NONBITMAP_NON_UTF8 remains to indicate that there is something that should be matched outside the bitmap even if the target string isn't in utf8.
* regcomp.c: ANYOF node handle range to UV_MAXKarl Williamson2011-02-251-2/+11
|
* regcomp.c: Move inversion list conversion codeKarl Williamson2011-02-251-29/+32
| | | | This just moves the code to later in the subroutine, in preparation for future commits
* regcomp.c: Use more precise ANYOF flagKarl Williamson2011-02-251-1/+1
| | | | | | As the comment above the changed line says, \p doesn't have to match only utf8, but it sets the flag that is two bits, meaning UTF8. Set just the one flag.
* regcomp.c: Add commentKarl Williamson2011-02-251-1/+2
|
* Two functions shouldnt be declared inlineKarl Williamson2011-02-211-2/+2
|
* Revert "regcomp: Add warning if tries to use \p in locale."Karl Williamson2011-02-191-8/+1
| | | | | | | This reverts commit fb2e24cdda774d9e9c28f1cd0356bba9070894c7. This turned out to be contentious, and is past the date for contentious changes.
* Fix locale caseless matching and utf8Karl Williamson2011-02-191-16/+66
| | | | | | | As explained in the doc changes of this patch, under /l, caseless matching of code points less than 256 now use locale rules regardless of the utf8ness of the pattern or string. They now match the behavior of things like \w, in this regard.
* regcomp.c: no sharp ss tricky fold under localeKarl Williamson2011-02-191-2/+4
|
* regcomp.c: Fix some commentsKarl Williamson2011-02-191-13/+11
|
* regcomp.c: Silence win32 compiler warningsKarl Williamson2011-02-151-2/+2
|
* Add /aa regex modifierKarl Williamson2011-02-141-35/+142
| | | | Tests for \N{} with this option will be added later.
* regcomp.c: Add cast.Karl Williamson2011-02-141-1/+1
| | | | I found this through gdb. Sign extension was happening.
* regcomp.c: Handle more cases of tricky fold charsKarl Williamson2011-02-141-0/+61
| | | | | | | | | | | | | Certain characters are not placed in EXACTish nodes because of problems mostly with the optimizer. However, not all notations that generated those characters were caught. This catches all but those in \N{} constructs; which is coming later. This does not use FOLDCHAR, which doesn't know the difference between /d and /u; instead it uses ANYOFV, which does handle those cases already, at the expense of larger (in storage) regexes for these few characters. If this were deemed a problem, there would be some work involved in adding FOLDCHARU, and fixing the code where it doesn't work properly now.
* regex: Add commentsKarl Williamson2011-02-141-2/+4
|
* regcomp.c: Add commentKarl Williamson2011-02-141-0/+2
|
* regcomp.c: simplify conditionalKarl Williamson2011-02-141-9/+5
| | | | A previous commit removed some things, so this block can be rearranged
* Add commentsKarl Williamson2011-02-141-0/+5
|
* regcomp.c: Remove special handling for U+00DFKarl Williamson2011-02-141-26/+0
| | | | The code elsewhere is now better equipped to handle this.
* regcomp.c: tell regexec more about multi-char foldsKarl Williamson2011-02-141-2/+24
| | | | | A multi-char fold that matches in the Latin1 range needs to have that fact communicated to regexec.
* regcomp.c: Synthetic start class should include ord >255 foldsKarl Williamson2011-02-141-0/+26
| | | | | | | | Some characters above 255 fold to the < 256 range. These need to be in the synthetic start class so the optimizer won't reject them. This is temporary code which creates false positives, to be replaced by more precise matching later.
* regcomp.c: Be more precise about ANYOF matching flagKarl Williamson2011-02-141-1/+1
| | | | | There are two flags for matching outside the ANYOF bitmap. Instead of setting both, set the corresponding one.
* regcomp.c: Put two static functions in embed.fncKarl Williamson2011-02-141-22/+26
|
* Update commentKarl Williamson2011-02-141-4/+2
|
* regex: Deprecate \b{ and \B{Karl Williamson2011-02-121-0/+6
| | | | This allows future use by Perl of these
* regcomp.c: include { in unregcognized \q{ warningKarl Williamson2011-02-121-2/+7
| | | | | | The warning message about regex unrecognized escapes passed through is changed to include any literal '{' following the 2-char escape. e.g., "\q{" will include the { in the message as part of the escape.
* Fix up \cX for 5.14Karl Williamson2011-02-091-2/+2
| | | | | | | | | | | | | | | | | | | | | | | Throughout 5.13 there was temporary code to deprecate and forbid certain values of X following a \c in qq strings. This patch fixes this to the final 5.14 semantics. These are: 1) a utf8 non-ASCII character will croak. This is the same behavior as pre-5.13, but it gives a correct error message, rather than the malformed utf8 message previously. 2) \c{ and \cX where X is above ASCII will generate a deprecated message. The intent is to remove these capabilities in 5.16. The original agreement was to croak on above ASCII, but that does violate our stability policy, so I'm deprecating it instead. 3) A non-deprecated warning is generated for all other \cX; this is the same as throughout the 5.13 series. I did not have the tuits to use \c{} as I had planned in 5.14, but \N{} can be used instead.
* regcomp: Add/subtract consts to match embed.fncKarl Williamson2011-02-061-1/+1
|
* Silence compile warnings before uni tables builtKarl Williamson2011-02-061-7/+17
| | | | | | | | | | The recent move of Unicode folding to the compilation phase caused spurious warnings during the miniperl build phase of Perl itself before the Unicode tables get built. Before the tables are built, Perl is unable to know about the Unicode semantics (it has ASCII/Latin1 hard-coded in), but was still trying to access the tables. Now, it checks and if the tables aren't present uses just the hard-coded ASCII/Latin1 semantics.
* Two Safefree() changes to make -DPERL_POISON builds work again.George Greer2011-02-061-1/+2
| | | | | | | The poison exposes a failure in t/op/magic: panic: corrupt saved stack index at - line 6. FAILED at test 7
* Don't redefine Perl API functions in ext/re.Craig A. Berry2011-02-051-0/+4
|
* Move ANYOF folding from regexec to regcompKarl Williamson2011-02-021-25/+205
| | | | | | | | | | This is for security as well as performance. It allows Unicode properties to not be matched case sensitively. As a result the swash inversion hash is converted from having utf8 keys to numeric, code point, keys. It also for the first time fixes the bug where /i doesn't work for a code point not at the end of a range in a bracketed character class has a multi-character fold