summaryrefslogtreecommitdiff
path: root/regcomp.c
Commit message (Collapse)AuthorAgeFilesLines
* Use multi-bit field for regex character setKarl Williamson2011-01-161-23/+43
| | | | | | | | | | | | | The /d, /l, and /u regex modifiers are mutually exclusive. This patch changes the field that stores the character set to use more than one bit with an enum determining which one. This data structure more closely follows the semantics of their being mutually exclusive, and conserves bits as well, and is better expandable. A small API is added to set and query the bit field. This patch is not .xs source backwards compatible. A handful of cpan programs are affected.
* White space, comment only changeKarl Williamson2011-01-161-50/+51
|
* Fix \xa0 matching both [\s] [\S], et.al.Karl Williamson2011-01-161-5/+15
| | | | | | | | | | | | This bug stemmed from Latin1 characters not matching any (non-complemented) character class in /d semantics when the target string is no utf8; but having unicode semantics when it isn't. The solution here is to add a special flag. There were several tests that relied on the broken behavior, specifically they tested that \xff isn't a printable word character even in utf8. I changed the deparse test to instead use a non-printable code point, and I changed the ones in re_tests to be TODOs, and will change them back using /a when that is shortly added.
* regcomp: Share two bits in ANYOF flagsKarl Williamson2011-01-161-0/+6
| | | | | | It turns out that the INVERT and EOS flags can actually share the same bit, as explained in the comments, freeing up a bit for other uses. No code changes need be made.
* regex: Use ANYOFVKarl Williamson2011-01-131-22/+55
| | | | | | | | | | | | | | | | | | | | This patch restructures the regex ANYOF code to generate ANYOFV nodes instead when there is a possibility that it could match more than one character. Note that this doesn't affect the optimizer, as it essentially ignores things that fit into this category. (But it means that the optimizer will no longer reject these when it shouldn't have.) The handling of the LATIN SHARP s is modified to correspond with this new node type. The initial handling of ANYOFV is placed in regexec.c. More analysis will come on that. But there was significant change to the part that handles matching multi-char strings. This has long been buggy, with it previously comparing a folded-version on one side with a non-folded version on the other. This patch fixes about 60% of the problems that my undelivered test suite gives for multi-char folds. But there are still 17K test failures left, so I'm still not delivering that. The TODOs that this fixes will be cleaned up in a later commit
* regex: Some Comment clarificationsKarl Williamson2011-01-131-3/+7
|
* Fix typos (spelling errors) in Perl sources.Peter J. Acklam) (via RT2011-01-071-19/+19
| | | | | | | | | # New Ticket Created by (Peter J. Acklam) # Please include the string: [perl #81904] # in the subject line of all future correspondence about this issue. # <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=81904 > Signed-off-by: Abigail <abigail@abigail.be>
* Change name of regex intrnl macro to new meaningKarl Williamson2010-12-201-17/+17
| | | | | | | | | | | ANYOF_FOLD is now used only under fewer conditions. Otherwise the bitmap of character 0-255 is fully calculated with the folds, and the flag is not set. One condition is under locale, where the folds aren't known at compile time; the other is for things accessible through a swash. By changing the name to its new meaning, certain optimizations become more obvious.
* Change regexes to debug dump non-ASCII as hex.Karl Williamson2010-12-191-0/+1
| | | | | | instead of the less familiar octal for larger values. Perhaps they should actually print the actual character, but this is far easier than the previous to understand.
* Silence some data truncation compiler warningsJan Dubois2010-12-161-3/+3
|
* regcomp.c: Optimize [cC] char class to EXACTFKarl Williamson2010-12-151-20/+46
| | | | | | | | | | | A two character character class where the two elements are the folds of each other can be optimized into an EXACTF regnode. This should not change the speed of execution appreciably, except that EXACTF regnodes are candidates for further optimization by combining with adjacent nodes of the same type. This commit brings the optimization level up to somewhat better than when I started mucking around with ANYOF nodes.
* regcomp.c: More work on ANYOF_CLASSKarl Williamson2010-12-151-3/+4
| | | | | | I overlooked two cases in a previous commit where it would be advisable to make changes in case the ANYOF_CLASS bit needs to be combined with ANYOF_LOCALE.
* regcomp.c: Fix VC6 compiler warningsKarl Williamson2010-12-141-14/+14
| | | | The number passed here is never larger than 255.
* blead breaks Attribute::ConstantKarl Williamson2010-12-121-2/+2
| | | | | | The problem is that I confused FOLD with ANYOF_FOLD, and as a result, emitted a locale regnode, which is tainted. Any tests that required non-tainting started failing
* regcomp.c: Clean up optimization for 1-char []Karl Williamson2010-12-111-16/+47
| | | | | | | | | | A single character character class can be optimized into an EXACT node. The changes elsewhere allow this to no longer be constrained to ASCII-only when the pattern isn't UTF-8. Also, the optimization shouldn't have happened for FOLDED characters, as explained in the comments, when they participate in multi-char folds; so that is removed. Also, a locale node with folded characters can be optimized.
* regcomp: Allow freeing up bit in ANYOF flagsKarl Williamson2010-12-111-5/+17
| | | | | | The flags field is fully used, and until the ANYOF node is split later in development, the CLASS bit will need to be freed up to give the space for other uses. This patch allows for this to easily be toggled.
* regcomp.c: Move [] inversion optimizationKarl Williamson2010-12-111-13/+14
| | | | | | The optimization to do inversion a compile time is moved to earlier. This doesn't help today, but it may someday when we start keeping better track of Unicode characters, and is the more logical place for it.
* regcomp.c: When inverting a [], adjust bit countKarl Williamson2010-12-111-0/+3
| | | | | When one complements every bit, the count of those that are set should be complemented as well.
* Subject: [PATCH] regcomp.c: adjust flagKarl Williamson2010-12-111-1/+1
| | | | | When something matches above Latin1, it should have the ANYOF_UTF8 bit set.
* regcomp.c: Change constants for clarity.Karl Williamson2010-12-111-1/+1
| | | | | Oddly, it is clearer to use 0xFF as an exclusive-or target instead of an unrelated #define that happens to have that value.
* regcomp.c: fix indentKarl Williamson2010-12-111-1/+1
|
* regcomp.c: remove no longer needed testKarl Williamson2010-12-111-7/+6
| | | | | | | | optimize_invert is no longer needed given the changes already made, as now if there is something not in the bitmap, a flag will be set, and the optimization doesn't take place unless the only flag is inversion. And, the bitmap is setup completely now for anything that doesn't have to be deferred to runtime, and such deferrals are marked with other flags.
* regcomp.c: isASCII doesn't match outside ANYOF bitmapKarl Williamson2010-12-111-5/+3
| | | | | So there is no need to tell regexec that it does, and then can combine two other statements
* regcomp.c: Fix compiler warningKarl Williamson2010-12-111-1/+1
| | | | | | One smoke is warning about truncated results. This should fix that. It may be that other compilers will now complain, and we'll need to add casts, but I'm waiting to see.
* regcomp.c: Remove no longer necessary loopKarl Williamson2010-12-111-16/+6
| | | | | | | | Recent changes to this cause the bitmap to be populated where possible with the all folding taken into consideration. Therefore, the FOLD flag isn't necessary except when that wasn't possible, and the loop that went through looking for folds is also not necessary, as the bitmap is now completely populated before it gets to where the loop was.
* regcomp.c: remove unncessary countingKarl Williamson2010-12-111-4/+1
| | | | | | | stored now contains the number of 1 bits in the ANYOF node, and is no longer needed to be arbitrarily set. Part of this is because there is now a flag if there is any match outside the bitmap, which prohibits optimization if so.
* regcomp.c: clarify commentKarl Williamson2010-12-111-2/+1
|
* regcomp.c: Add locale for \dKarl Williamson2010-12-071-2/+10
| | | | | | | | The DIGITL and NDIGITL regnodes were not being generated; instead regular DIGIT and NDIGIT regnodes were even under locale. This means no one has probably ever used Perl on a locale that changed the digits.
* regcomp.c: Revert to using regcomp.sym orderKarl Williamson2010-12-071-1/+1
| | | | | | Now that the new nodes are grouped properly, we can use the fact that the named backreferences all come after all the numbered backreferences, as had been there before.
* regcomp.c: Fix longjmp-related warningsKarl Williamson2010-12-061-17/+21
| | | | | | | | | | This patch should get rid of the compiler warnings recently introduced. Another way to handle the pm_flags warning is to declare it to be volatile, but not all compilers that perl uses apparently have that, so I chose a longer way of introducting a new variable that isn't changed within the jumpable area. The others were fixed by not initializing them before the jumpable area.
* regcomp.c: small efficiency, portability fixKarl Williamson2010-12-041-6/+3
| | | | | The code had hard-coded into it the ascii platform values for possible start bytes. There are macros that do that portably with no branches
* regcomp.c: small efficiency improvementKarl Williamson2010-12-041-4/+1
| | | | The inline function repeats the test removed here.
* regcomp.c: small efficiency gainKarl Williamson2010-12-041-19/+8
| | | | | | | | | | | | | | | | | | The 7-bit test operations always fail on non-ascii characters, therefore it isn't needed to test each such character individually. The loops that do that and then set a bit for each character can therefore stop at 127 instead of 255 (the bitmaps are initialized to all zeros). For EBCDIC, the same applies, except that we have to map those 7-bits characters to the 8-bit EBCDIC range. This creates an extra array lookup for each ebcdic character, but half as many times through the loop. For the complement of the 7-bit operations, we know that they will all be set for the non-ascii characters. Therefore, we don't need to test, we can just unconditionally set those bits. It would not be a good idea to just do a memset on that range under /i, as each character that gets chosen may have its fold added as well and that has to be looked up individually.
* regecomp, regexec: Use mnemonic character namesKarl Williamson2010-12-041-13/+21
| | | | | This patch replaces hex ordinals by macros containing the character names, for clarity and portability to EBCDIC.
* regcomp.c: Move code out of longjump areaKarl Williamson2010-12-041-6/+6
| | | | | This code should be done before the setjump to avoid the longjump clobbering it.
* make empty string regexp stringify to the same thing regardless of unicode flagsYves Orton2010-12-041-5/+4
|
* make the jump point a little more obvious in a commentYves Orton2010-12-041-0/+1
|
* regcomp.c: Generate REFFU and NREFFUKarl Williamson2010-12-011-8/+26
| | | | | | | | This causes the new nodes that denote Unicode semantics in backreferences to be generated when appropriate. Because the addition of these nodes was at the end of the node list, the arithmetic relation that previously was valid no longer is.
* regcomp.c: Use latin1 folding in synthetic start classKarl Williamson2010-12-011-15/+19
| | | | | | | This is because the pattern may not specify unicode semantics, but if the target matching string is in utf8, then unicode semantics may be needed nonetheless. So to avoid the regexec optimizer rejecting the match, we need to allow for a possible false positive.
* regcomp.c: utf8 pattern defaults to Unicode semanticsKarl Williamson2010-12-011-1/+21
| | | | | | A utf8 pattern should force unicode semantics unless otherwise overridden. This means that the 'd' regex modifier means Unicode semantics as well.
* regcomp.c: teach tries about EXACTFUKarl Williamson2010-12-011-7/+7
|
* regcomp.c: typo in commentKarl Williamson2010-12-011-1/+1
|
* regcomp.c: Remove duplicate statementKarl Williamson2010-12-011-1/+0
| | | | | The flags this statement cleared are cleared as part of the macro called just before it.
* [perl #79152] super-linear cache can prevent a valid matchNick Cleaton2010-11-301-7/+10
| | | | | | | | | | | | | | The super-linear cache in regexec.c can prevent a valid match from being detected. For example: print "yay\n" if 'xayxay' =~ /(q1|.)*(q2|.)*(x(a|bc)*y){2,}/; This should match, but it doesn't because the cache fails to distinguish between matching the final xay to x(a|bc)*y as the first instance of the {2,} and matching it in the same position as the second instance. This seems to do the trick.
* regcomp.c: Handle EXACTFU nodes in optimizerKarl Williamson2010-11-281-4/+26
| | | | | | This patch also changes the optimizer to include the other member of a fold pair in the bitmap. Thus if 'b' is set under /i, so will 'B', and vice versa.
* regcomp.c: Use hex instead of octal for debug ordsKarl Williamson2010-11-281-2/+8
| | | | | | | The ordinals that are output in the debugging output have been in octal, which is ok for the low controls, but for above Latin1, the standard is hex, so this changes them all to correspond. If desired the low controls could be changed back to be in octal.
* Fix debug outputKarl Williamson2010-11-281-1/+1
| | | | | The 'outside bitmap' message isn't orthogonal to the others, it is independent.
* regcomp.c: Typo in commentKarl Williamson2010-11-281-2/+2
|
* regcomp.c: Generate EXACTFU nodesKarl Williamson2010-11-281-5/+19
|
* regcomp.c: remove unnecessary testsKarl Williamson2010-11-281-1/+1
| | | | | The tests in the else are unnecessary as they comprise everything else but what the 'if' says.