summaryrefslogtreecommitdiff
path: root/regcomp.c
Commit message (Collapse)AuthorAgeFilesLines
* regcomp.c: Silence win32 compiler warningsKarl Williamson2011-02-151-2/+2
|
* Add /aa regex modifierKarl Williamson2011-02-141-35/+142
| | | | Tests for \N{} with this option will be added later.
* regcomp.c: Add cast.Karl Williamson2011-02-141-1/+1
| | | | I found this through gdb. Sign extension was happening.
* regcomp.c: Handle more cases of tricky fold charsKarl Williamson2011-02-141-0/+61
| | | | | | | | | | | | | Certain characters are not placed in EXACTish nodes because of problems mostly with the optimizer. However, not all notations that generated those characters were caught. This catches all but those in \N{} constructs; which is coming later. This does not use FOLDCHAR, which doesn't know the difference between /d and /u; instead it uses ANYOFV, which does handle those cases already, at the expense of larger (in storage) regexes for these few characters. If this were deemed a problem, there would be some work involved in adding FOLDCHARU, and fixing the code where it doesn't work properly now.
* regex: Add commentsKarl Williamson2011-02-141-2/+4
|
* regcomp.c: Add commentKarl Williamson2011-02-141-0/+2
|
* regcomp.c: simplify conditionalKarl Williamson2011-02-141-9/+5
| | | | A previous commit removed some things, so this block can be rearranged
* Add commentsKarl Williamson2011-02-141-0/+5
|
* regcomp.c: Remove special handling for U+00DFKarl Williamson2011-02-141-26/+0
| | | | The code elsewhere is now better equipped to handle this.
* regcomp.c: tell regexec more about multi-char foldsKarl Williamson2011-02-141-2/+24
| | | | | A multi-char fold that matches in the Latin1 range needs to have that fact communicated to regexec.
* regcomp.c: Synthetic start class should include ord >255 foldsKarl Williamson2011-02-141-0/+26
| | | | | | | | Some characters above 255 fold to the < 256 range. These need to be in the synthetic start class so the optimizer won't reject them. This is temporary code which creates false positives, to be replaced by more precise matching later.
* regcomp.c: Be more precise about ANYOF matching flagKarl Williamson2011-02-141-1/+1
| | | | | There are two flags for matching outside the ANYOF bitmap. Instead of setting both, set the corresponding one.
* regcomp.c: Put two static functions in embed.fncKarl Williamson2011-02-141-22/+26
|
* Update commentKarl Williamson2011-02-141-4/+2
|
* regex: Deprecate \b{ and \B{Karl Williamson2011-02-121-0/+6
| | | | This allows future use by Perl of these
* regcomp.c: include { in unregcognized \q{ warningKarl Williamson2011-02-121-2/+7
| | | | | | The warning message about regex unrecognized escapes passed through is changed to include any literal '{' following the 2-char escape. e.g., "\q{" will include the { in the message as part of the escape.
* Fix up \cX for 5.14Karl Williamson2011-02-091-2/+2
| | | | | | | | | | | | | | | | | | | | | | | Throughout 5.13 there was temporary code to deprecate and forbid certain values of X following a \c in qq strings. This patch fixes this to the final 5.14 semantics. These are: 1) a utf8 non-ASCII character will croak. This is the same behavior as pre-5.13, but it gives a correct error message, rather than the malformed utf8 message previously. 2) \c{ and \cX where X is above ASCII will generate a deprecated message. The intent is to remove these capabilities in 5.16. The original agreement was to croak on above ASCII, but that does violate our stability policy, so I'm deprecating it instead. 3) A non-deprecated warning is generated for all other \cX; this is the same as throughout the 5.13 series. I did not have the tuits to use \c{} as I had planned in 5.14, but \N{} can be used instead.
* regcomp: Add/subtract consts to match embed.fncKarl Williamson2011-02-061-1/+1
|
* Silence compile warnings before uni tables builtKarl Williamson2011-02-061-7/+17
| | | | | | | | | | The recent move of Unicode folding to the compilation phase caused spurious warnings during the miniperl build phase of Perl itself before the Unicode tables get built. Before the tables are built, Perl is unable to know about the Unicode semantics (it has ASCII/Latin1 hard-coded in), but was still trying to access the tables. Now, it checks and if the tables aren't present uses just the hard-coded ASCII/Latin1 semantics.
* Two Safefree() changes to make -DPERL_POISON builds work again.George Greer2011-02-061-1/+2
| | | | | | | The poison exposes a failure in t/op/magic: panic: corrupt saved stack index at - line 6. FAILED at test 7
* Don't redefine Perl API functions in ext/re.Craig A. Berry2011-02-051-0/+4
|
* Move ANYOF folding from regexec to regcompKarl Williamson2011-02-021-25/+205
| | | | | | | | | | This is for security as well as performance. It allows Unicode properties to not be matched case sensitively. As a result the swash inversion hash is converted from having utf8 keys to numeric, code point, keys. It also for the first time fixes the bug where /i doesn't work for a code point not at the end of a range in a bracketed character class has a multi-character fold
* Add initial inversion list objectKarl Williamson2011-02-021-0/+588
| | | | | | | | | | | | | | | Going forward the intent is to convert from swashes to the better-suited inversion list data structure. This adds rudimentary inversion lists that have only the functionality needed for 5.14. As a result, they are as much as possible static to one file. What's necessary for 5.14 is enough to allow folding of ANYOF nodes to be moved from regexec to regcomp. Why they are needed for that is to generate as compact as possible class definitions; otherwise, very long linear lists might be generated. (They still may be, but that's inherent in the problem domain; this generates as compact as possible, combining overlapping ranges, etc.) The only two non-trivial methods in this object are from published algorithms.
* regcomp.c: Generate different property for /i matchingKarl Williamson2011-02-021-2/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch causes regcomp.c to generate a different property name under /i than not. utf8_heavy.pl will later resolve whether this is to match the same under /i or not, based on the data structure generated by mktables. This is part of moving non-locale folding into regcomp from regexec. The reasons are primarily security, but this has been planned to do at some point anyway for performance. It was not until a 5.13.X build that fixed the regexec code that the case-insensitive matching mostly worked. With that change, things like /\p{ASCII_Hex_Digit}+/i would match non-ASCII characters, such as LATIN SMALL LIGATURE FF, and almost certainly that would not be the expectation of the coder. The Unicode Standard is silent on the matter, but as of this writing, it appears that they will act to recommend against caseless matching of properties; I get the sense that they would never have thought someone would think to do it, but Perl has. I ran some experiments, and actually very few properties have differences under caseless matching anyway. have submitted a proposal to them that says that, but suggests that certain properties can be grandfathered-in. Perl users have come to expect that /\p{Uppercase}/i would match lower case letters, and have written bug reports that they don't, until 5.13.X fixed them, but in addition added the unintended wrinkle from the example above. The design is for mktables to generate tables for /i matching for the few properties that have differences, and to create a hash mapping the standard table to the /i table, which is read by utf8_heavy.pl. regcomp.c munges the names of all properties under /i to be __foo_i. The two initial underscores make sure there is no conflict with existing single underscore initial tables. utf8_heavy strips these off, and computes the table as normal from the remaining unmunged name. At the last moment, it looks up that name in the list of those that have /i tables, and substitutes if found. This completely hides all this from the swash mechanism and regexec.c. This can't be completely hidden from user-defined properties. Now, a boolean will be passed to those subroutines indicating if /i is in effect or not. They are free to ignore it, but they can return a different set of code points depending on its value. They will be called once for each type, and the results cached by the normal swash mechanism, which thinks these are two different properties.
* regcomp.c: cl_and() fixKarl Williamson2011-02-021-3/+7
| | | | | | If ANDing two nodes together and they both have UNICODE_ALL set, the result should also. I don't have a test case for this, but the bug is exposed by some commits soon to come in a test case in pat_advanced.t for cl_and.
* regcomp: Add warning if tries to use \p in locale.Karl Williamson2011-01-271-1/+8
| | | | | \p implies Unicode matching rules, which are likely going to be different than the locale's.
* regex: \p{} in pattern implies Unicode semanticsKarl Williamson2011-01-271-6/+23
| | | | Now, a Unicode property match specified in the pattern will indicate that the pattern is meant for matching according to Unicode rules
* fix harmless invalid read in Perl_re_compile()David Mitchell2011-01-241-5/+5
| | | | | | | | | | | | | | | [perl #2460] described a case where electric fence reported an invalid read. This could be reproduced under valgrind with blead and -e'/x/', but only on a non-debugging build. This was because it was checking for certain pairs of nodes (e.g. BOL + END) and wasn't allowing for EXACT nodes, which have the string at the next node position when using a naive NEXTOPER(first). In the non-debugging build, the nodes aren't initialised to zero, and a 1-char EXACT node isn't long enough to spill into the type field of the "next node". Fix this by only using NEXTOPER(first) when we know the first node is kosher.
* regcomp.c: Refactor macro so works on EBCDIC, clarityKarl Williamson2011-01-201-22/+23
| | | | | | | | | | | _C_C_T was being passed both a test and the value to put the test on, whereas the value is actually setup inside the macro, which is not a clean interface, and would not work on EBCDIC, as the loop bounds are for ASCII. By just passing the test name, the macro can generate code that will work also on EBCDIC. Unfortunately, the same can't happen for the NOLOC version of the macro, as it needs to look at more than a single byte, so needs an address.
* regcomp: Disallow multi-char folds in lookbehindKarl Williamson2011-01-181-7/+17
| | | | | | | | | | | The addition of the ANYOFV regnode to treat multi-char folds in a bracketed character class has exposed a bug, in which those classes have long been able to be varying length (due to the multi-char fold), but the compiler wasn't aware of it. Now it is, and hence won't allow those which have multi-char folds to be part of a lookbehind pattern, which requires a constant length. This patch disallows multi-char folds in a lookbehind bracketed character class.
* Add /a regex modifierKarl Williamson2011-01-171-12/+61
| | | | | This restricts certain constructs, like \w, to matching in the ASCII range only.
* regcomp.c: Convert \d \D to a switch{}Karl Williamson2011-01-171-8/+22
|
* regcomp.c: Clarify commentKarl Williamson2011-01-161-1/+1
|
* regex: Use BOUNDU regnodesKarl Williamson2011-01-161-8/+26
| | | | | This refactors one area in regexec.c to use BOUNDU, NBOUNDU for efficiciency, and easier adding of the future BOUNDA.
* regex: Separate nodes for Unicode semantics \s \wKarl Williamson2011-01-161-28/+66
| | | | | | | | | | | | | | | | | This patch converts the \s, \w and complements Unicode semantics to instead of using the flags field of their nodes to instead use separate nodes. This gains some efficiency, especially useful in tight loops and backtracking of regexec.c, and prepares the way for easily adding other semantic variations, such as /a. It refactors the CCC_TRY... macros. I tried to break this piece up into smaller chunks, but found it much easier to get to this in one step. Further patches will do some more refactoring of these. As part of the CCC_TRY macro refactoring, the lines that include the test if (! nextchr) are changed to just look for the end-of-string by position instead of it being NUL. In locales, it could be (however unlikely), that NUL is a real alphabetic, digit, or space character.
* regcomp.c: add missing code for optimizer for \WKarl Williamson2011-01-161-3/+13
| | | | | | | | | | | The code here was asymmetrical. It did not account for Unicode semantics when ORing \W. For \w, \s, and \S it does. This patch changes the code to be symmetrical. I spent a couple hours trying to come up with a test, but could not get this area of the code to execute, which may explain why there has not been a field report of it. It may be that it is unreachable; there has been other code in the routine that wasn't.
* regcomp.c: remove unreached codeKarl Williamson2011-01-161-44/+0
| | | | | | This code can never be reached, as the switch statement switches on the regkind of the op, not the op itself; and the kind of all the locale regnodes is the base regnode itself. For example regkind[ALNUML] is ALNUM.
* regex: Convert regnode FLAGS fields to charset enumKarl Williamson2011-01-161-13/+13
| | | | | | | | | The FLAGS fields of certain regnodes were encoded with USE_UNI if unicode semantics were to be used. This patch instead sets them to the character set used, one of the possibilities of which is to use unicode semantics. This shortens the code somewhat, and always puts the character set in the flags field, which can allow use of switch statements on it for efficiency, especially as new values are added.
* Change name of /d to DEPENDSKarl Williamson2011-01-161-2/+2
| | | | | | | I much prefer David Golden's name for /d whose meaning 'depends' on circumstances, instead of 'dual' meaning it could be one or another. Change it before this gets out in a stable release, and we're stuck with the old name.
* Use multi-bit field for regex character setKarl Williamson2011-01-161-23/+43
| | | | | | | | | | | | | The /d, /l, and /u regex modifiers are mutually exclusive. This patch changes the field that stores the character set to use more than one bit with an enum determining which one. This data structure more closely follows the semantics of their being mutually exclusive, and conserves bits as well, and is better expandable. A small API is added to set and query the bit field. This patch is not .xs source backwards compatible. A handful of cpan programs are affected.
* White space, comment only changeKarl Williamson2011-01-161-50/+51
|
* Fix \xa0 matching both [\s] [\S], et.al.Karl Williamson2011-01-161-5/+15
| | | | | | | | | | | | This bug stemmed from Latin1 characters not matching any (non-complemented) character class in /d semantics when the target string is no utf8; but having unicode semantics when it isn't. The solution here is to add a special flag. There were several tests that relied on the broken behavior, specifically they tested that \xff isn't a printable word character even in utf8. I changed the deparse test to instead use a non-printable code point, and I changed the ones in re_tests to be TODOs, and will change them back using /a when that is shortly added.
* regcomp: Share two bits in ANYOF flagsKarl Williamson2011-01-161-0/+6
| | | | | | It turns out that the INVERT and EOS flags can actually share the same bit, as explained in the comments, freeing up a bit for other uses. No code changes need be made.
* regex: Use ANYOFVKarl Williamson2011-01-131-22/+55
| | | | | | | | | | | | | | | | | | | | This patch restructures the regex ANYOF code to generate ANYOFV nodes instead when there is a possibility that it could match more than one character. Note that this doesn't affect the optimizer, as it essentially ignores things that fit into this category. (But it means that the optimizer will no longer reject these when it shouldn't have.) The handling of the LATIN SHARP s is modified to correspond with this new node type. The initial handling of ANYOFV is placed in regexec.c. More analysis will come on that. But there was significant change to the part that handles matching multi-char strings. This has long been buggy, with it previously comparing a folded-version on one side with a non-folded version on the other. This patch fixes about 60% of the problems that my undelivered test suite gives for multi-char folds. But there are still 17K test failures left, so I'm still not delivering that. The TODOs that this fixes will be cleaned up in a later commit
* regex: Some Comment clarificationsKarl Williamson2011-01-131-3/+7
|
* Fix typos (spelling errors) in Perl sources.Peter J. Acklam) (via RT2011-01-071-19/+19
| | | | | | | | | # New Ticket Created by (Peter J. Acklam) # Please include the string: [perl #81904] # in the subject line of all future correspondence about this issue. # <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=81904 > Signed-off-by: Abigail <abigail@abigail.be>
* Change name of regex intrnl macro to new meaningKarl Williamson2010-12-201-17/+17
| | | | | | | | | | | ANYOF_FOLD is now used only under fewer conditions. Otherwise the bitmap of character 0-255 is fully calculated with the folds, and the flag is not set. One condition is under locale, where the folds aren't known at compile time; the other is for things accessible through a swash. By changing the name to its new meaning, certain optimizations become more obvious.
* Change regexes to debug dump non-ASCII as hex.Karl Williamson2010-12-191-0/+1
| | | | | | instead of the less familiar octal for larger values. Perhaps they should actually print the actual character, but this is far easier than the previous to understand.
* Silence some data truncation compiler warningsJan Dubois2010-12-161-3/+3
|
* regcomp.c: Optimize [cC] char class to EXACTFKarl Williamson2010-12-151-20/+46
| | | | | | | | | | | A two character character class where the two elements are the folds of each other can be optimized into an EXACTF regnode. This should not change the speed of execution appreciably, except that EXACTF regnodes are candidates for further optimization by combining with adjacent nodes of the same type. This commit brings the optimization level up to somewhat better than when I started mucking around with ANYOF nodes.