summaryrefslogtreecommitdiff
path: root/regcomp.c
Commit message (Collapse)AuthorAgeFilesLines
* In Perl_regdupe_internal() eliminate npar, which is assigned to but never used.Nicholas Clark2011-05-181-2/+1
| | | | It has been unused since 28d8d7f41ab202dd restructured the regexp dup code.
* In S_regpiece(), only declare parse_start if conditional code needs it.Nicholas Clark2011-05-181-0/+6
| | | | | | gcc 4.6.0 warns about variables that are set but never read, and unless RE_TRACK_PATTERN_OFFSETS is defined, parse_start is never read. So avoid declaring or setting it if it's not actually going to be used later.
* regcomp.c: change comment wording, from TomCKarl Williamson2011-05-181-3/+3
|
* regcomp.c: White space onlyKarl Williamson2011-05-031-4/+4
| | | | | A previous commit added an 'if' around this code. This now indents the block properly.
* PATCH: [perl #89750]: Unicode regex negated case-insensitivityKarl Williamson2011-05-031-1/+21
| | | | | | | | | | | | | This patch causes inverted [bracketed] character classes to not handle multi-character folds. The reason is that these can lead to very counter-intuitive results (see bug discussion). In an inverted character class, only single-char folds are now generated. However the fold for \xDF=>ss is hard-coded in, and it was too much trouble sending flags to the sub-sub routine that does this, so another check is done at the point of storing the list of multi-char folds. Since \xDF doesn't have a single char fold, this works.
* regcomp: Improve error message for (?-d:...)Karl Williamson2011-04-121-6/+11
| | | | | As agreed, this improvement is going into 5.14. A customized message is output, instead of a generic one.
* PATCH: [perl #86972]: Tweak error messagesKarl Williamson2011-04-121-3/+3
| | | | | | | An earlier commit in this series changed some error messages. I realized that it did not make sense really to use "/a" for the regex modifier, when the message was for the infix form "(?a:", so this just removes the slash.
* PATCH: partial [perl #86972]: Allow /(?aia)/Karl Williamson2011-04-101-13/+35
| | | | | | This allows a second regex 'a' modifier in the infix form to not have to be contiguous with the first, and improves the message if there are extra modifiers.
* PATCH: [perl #87812] BBC breaks Pod::Coverage::TrustPodKarl Williamson2011-04-091-9/+5
| | | | | | | | | | | | | | | This patch completes the fixing of this problem. The problem is that the failing .t set @INC to exclude lib, and hence couldn't find utf8.pm, which 5.14 was requiring in places where it previously didn't. This patch finishes the job of not requiring utf8.pm in so many places as were inadvertently added in 5.14. Commit 3ad98780b4bded02c371c83a668dc8f323e57718 started the job. This patch changes regcomp.c to not set ANYOF_NONBITMAP_NON_UTF8 where it inappropriately was. I don't know what I was thinking when I originally did what this changes. In order to match outside the bitmap, these characters all must match something that requires utf8, such as a LIGATURE FI.
* regcomp.c: Shun ANYOF_NONBITMAP_NON_UTF8Karl Williamson2011-04-091-16/+39
| | | | | | | | | | As noted in the comments in the code for this commit, this flag has higher consequences than others when it is inappropriately set in the synthetic start class. This patch causes it to not be set unless there is some path in the regex engine that needs it, but once set it is never cleared. This results in a different set of false positives than currently, but the current set can have this set even if there is no path in the engine that needs it.
* PATCH: [perl #87810] BBC Text::MultiMarkdownKarl Williamson2011-04-071-1/+1
| | | | | | | | | | A statement should have been outside a block but was inside it. The indentation was correct, and in a number of times reading the code I still missed it. I'm having trouble distilling down the failure scenario into a simple test case, and am short on tuits right now, so a test will be committed later.
* regcomp.c: Silence smoke win32 compiler warnigKarl Williamson2011-03-221-2/+2
|
* silence "unused variable" compiler warningKarl Williamson2011-03-211-3/+0
|
* regcomp.c: enums are tighter in C++Karl Williamson2011-03-211-9/+10
| | | | | I was using 0 for a generic non-interesting character, which works in C but not C++.
* regcomp.c: Remove FOLDCHAR generationKarl Williamson2011-03-201-36/+0
| | | | | | | | | | | | | | | | | | | | | ANYOFV handles multi-char folds in ANYOF nodes, and it turns out it is a superset of what FOLDCHAR does, which never got fully implemented in regexec.c, whereas ANYOFV is. FOLDCHAR may be the better way to go in the long-term, as it takes less space and is faster, but this gives us the functionality today, with no extra work. FOLDCHAR had been generated only when the character in question is a literal in the input stream, and wasn't touched for the probably more common use of \N{} or \x, which were fixed from not doing anything special to using ANYOFV earlier in the 5.13 series, and it turns out that the code that does it all is in a part of the code that gets executed anyway, so that simply removing the special FOLDCHAR code causes execution to drop down to this code. I'm thinking at the moment that for 5.16, ANYOV should be removed in favor of branches, using the technique of recursion that has recently been added to \N{}. That would enable easier trie generation and simplify things in regexec and the optimizer.
* regcomp.c: Use mnemonic instead of hex valueKarl Williamson2011-03-201-2/+2
|
* regcomp.c: Handle inverse tricky foldsKarl Williamson2011-03-201-45/+195
| | | | | | The tricky folds have only worked one direction. This handles the other, when it sees something the tricky fold folds to it converts that to the tricky fold op.
* regcomp.c: Move opening block braceKarl Williamson2011-03-201-6/+7
| | | | | This is in preparation for future commits. The declarations don't depend on the two code lines.
* regcomp.c: Add special case for a Unicode charKarl Williamson2011-03-201-0/+12
| | | | This failed under some circumstances
* reg_namedseq: Restructure so doesn't duplicate codeKarl Williamson2011-03-201-148/+36
| | | | | | | | | | | | | | | This routine now calls reg() recursively after converting the parse to something the rest of the code understands. This eliminates duplicated code, and allows for uniform treatment of code points, as things were getting out of sync. It also eliminates the restrction on how many characters a named sequence can expand to. toke now converts its input (which is in Unicode terms) to native on EBCDIC platforms, so the rest of the code can can continue to ignore that. The restriction on the length of the number of characters a named sequence is hereby removed, because reg() handles that.
* Add depth parameter to reg_namedseqKarl Williamson2011-03-201-4/+4
|
* regcomp.c: Add element to structureKarl Williamson2011-03-201-0/+2
|
* regcomp.c: RT#77414. Initialize flagKarl Williamson2011-03-191-1/+9
| | | | | | | | | | | | | As indicated in the comments, this flag needs to be initialized to 1 or the optimizer loses the fact that something could match a character that isn't in utf8 and whose bitmap bit isn't set. This happens, for example, with Unicode properties. Thus this fixes #77414. That ticket had been closed recently because it went away due to another patch that caused the optimizer to be bypassed in the cases tested for. But when that patch was reverted, and cleaned-up, this bug came back. Now, I believe I have found the root cause.
* regcomp.c: /l uses the \w, etc. classesKarl Williamson2011-03-191-1/+4
| | | | | | | | For non-locale, \d, etc are compiled in with their actual code points they match, so the class portion of the synthetic start class node is irrelevant, and should initialized to zero to avoid confusion. But for locale it is highly relevant, and should be initialized to all ones, to indicate matching anything.
* regcomp.c: Optimizer could lose some infoKarl Williamson2011-03-191-2/+9
| | | | | | | When ORing two nodes together for the synthetic start class, and one matches outside the 256-char bitmap, we currently don't know what it matches. In some cases it could be some or all of those 256 characters. If so, we have to assume it's all of them.
* regcomp.c: Move statement down.Karl Williamson2011-03-191-3/+3
| | | | | This is in prep for another commit which needs the flags to be untouched for some tests.
* regcomp.c: Reorder if to silence valgrindKarl Williamson2011-03-181-2/+4
| | | | | It is better to test that a pointer is in bounds before dereferencing it even though in this case it doesn't lead to an actual error.
* regex: Fix locale regressionKarl Williamson2011-03-181-19/+10
| | | | | | | | | | | | | | | | | | Things like \S have not been accessible to the synthetic start class under locale matching rules. They have been placed there, but the start class didn't know they were there. This patch sets ANYOF_CLASS in initializing the synthetic start class so that downstream code knows it is a charclass_class, and removes the code that partially allowed this bit to be shared, and which isn't needed in 5.14, and more thought would have to go into doing it than was reflected in the code. I can't come up with a test case that would verify that this works, because of general locale testing issues, except it looked at a dump of the generated regex synthetic start class, but the dump isn't the same thing as the real behavior, and using one is also subject to breakage if the regex code changes in the slightest.
* regcomp.c: Avoid locale in optimizer unless necessaryKarl Williamson2011-03-171-22/+30
| | | | | | | | | | | | | | | | | | | | | | | This is further work along the lines in RT #85964 and commit af302e7fa58415c2d8454c8cbef7bccd8b504257. It reverts, for the the most part, commits aa19b56b2f07e9eabf57540f00d312d8093e9d28 (Remove unused parameter) and c613755a4b4fc8e64a77639d47d7e208fee68edc (/l in synthetic start class). Those commits caused the synthetic start class to often be marked as matching under locale rules, even if there was no part of the regular expression that used locale. This led to RT #85964, which made apparent that there were a number of assumptions in the optimizer about locale that were no longer necessarily true. This new commit changes things so that locale has to be somewhere in the regex in order to get the synthetic start class to include /l. In other words, this reverts the effect of those commits to regular expression which have /l -- we go back to the old way of doing things for non-locale regexes. This limits any bugs that may have been introduced by the addition of /l (and being able to match only sub-parts of a regex under locale) to the relatively uncommon regexes which actually use it. There are a number of bugs that have surfaced for the locale rules regexes that have gone unreported; and some say locale rules regexes should be deprecated.
* Revert "regcomp.c: Rmv unused parameter"Karl Williamson2011-03-171-11/+11
| | | | | This reverts commit c45df5a16bb5a26a06275cc63f2c3e6b1d708184. The parameter is about to be put back in.
* regcomp.c: Add flag for /l occurring anywhereKarl Williamson2011-03-171-3/+12
| | | | | If any part of a pattern has /l, this flag will get set; for future use.
* regcomp.c: Move commentKarl Williamson2011-03-171-2/+2
|
* regcomp.c: Omitted hard-coded case mappingKarl Williamson2011-03-161-0/+2
| | | | | The code has hard-coded the possible case mappings for the code points < 256. This one was omitted.
* regcomp.c: Restore ptr correctlyKarl Williamson2011-03-161-1/+1
| | | | | oldp contains the pointer that we want to get to. Use that instead of a possibly invalid assumption about length
* regcomp.c: comment and white-space-only changeKarl Williamson2011-03-161-25/+24
|
* RT #85964: bleadperl breaks CGI-FormBuilderKarl Williamson2011-03-161-9/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The introduction of the l regex modifier introduces the possibility that a regular expression can have subportions that match under locale and other portions that don't. I (khw) failed to see all the implications of that in the optimizer. Unfortunately, things didn't start surfacing until late in the development cycle. The optimizer is structured so that a new blank node is initialized to match anything, and the state is set to AND, so that the first real node that comes along is supposed to be ANDed together; with the result being that node. (Like an AND of all 1's with some bit pattern yields that bit pattern.) Then the mode is switched to OR, so subsequent nodes that could be the start ones are or'd in. *(see footnote below). This design leads to some issues, like at the XXX line added by this commit, which looks to be a work-around for the deficiencies of the design. Commit cf34198ebe3dd876d67c10caa9acf491ad2a0c51 that led to this ticket changed things to include LOCALE as part of the initialization, so that the l could be on and off in various parts of the regex. I tried to just revert that (plus associated parameter changes), and found that the changes made to the AND and OR logic that fixed other problems really depended on that commit. Perhaps those could be worked around, but it is not the forward direction. This commit works around things in a different way. What happened in the earlier commit was that the synthetic start class (SSC) is, under some circumstances, getting generated as matching locale even if there is no locale matching in the regex. (This could not happen if the design were as described in the footnote.) This shouldn't matter except for potentially performance issues, as this would just be false positives. However, it turns out there is code in the optimizer that assumes that locale and non-locale are never mixed; and so does not do the right thing. This patch is aimed at safety. If the SSC is marked as locale, it sets the bits for things like \w as if the SSC could also end up being for non-locale. This can generate false positives for true locale matches but shouldn't introduce actual optimizer errors, since it only adds to what the SSC can match and doesn't make any restrictions. * I don't see why this design; it seems to me easier to start with the initial state set to all 0's, and then the first node gets OR'd in, yielding exactly that first node; then you don't have to switch; you still have to deal with AND cases, as for example in 0 length lookaheads, but things are made easier.
* regcomp.c: white space onlyKarl Williamson2011-03-161-5/+5
|
* regcomp.c: \D and \d should work under localeKarl Williamson2011-03-161-0/+3
| | | | | | A number of earlier commits have fixed various places where the code assumed that digits did not move under locale. This adds another two, bringing the code here in line with the other sequences like \w
* regcomp.c: no bitmap means no bitmapKarl Williamson2011-03-161-0/+1
| | | | | | | | The line before this line indicates that there is no bitmap, but it didn't clear this flag that says that there may be. This was likely a contributory bug to what ac51e94be5daabecdeb0ed734f3ccc059b7b77e3 tried to fix, and was eventually fixed in 6f8d7d0df3e3141d61246e6b0a3db12ab1fd7f92.
* regcomp.c: Add commentKarl Williamson2011-03-161-1/+1
|
* regcomp.c: utf8 pattern implies uni rulesKarl Williamson2011-03-141-1/+3
| | | | | | This fixes a regression introduced with charset regex modifiers. A utf8 pattern without a charset is supposed to mean unicode semantics. But it didn't until this patch.
* regcomp.c: /a should handle /\xdf/i same as /uKarl Williamson2011-03-121-4/+4
| | | | | | | /a and /u should match identically case-insensitively, but they didn't. Nor was /a being tested because it was thought that they handled things identically, and the tests were already taking too long. So this adds some tests as well.
* regcomp.c: call regclass_swash() only if non-emptyKarl Williamson2011-03-101-1/+1
| | | | | We can tell if there is something outside the bitmap and so can short circuit calling this function if there isn't.
* regcomp.c: Rmv unused parameterKarl Williamson2011-03-081-11/+11
| | | | This silences a compiler warning
* regcomp.c: Rmv unused parameterKarl Williamson2011-03-081-8/+8
| | | | This silences a compiler warning
* regcomp.c: Rmv unused parameterKarl Williamson2011-03-081-10/+10
| | | | This silences a compiler warning
* PATCH: [perl #85528], add initializationKarl Williamson2011-03-081-0/+1
| | | | | | | Commit 137165a601b852a9679983cdfe8d35be29f0939c omitted required initialization for the synthetic start class. Adding it exposed other bugs in cl_and() and cl_or(), which have been fixed by a previous commit.
* regcomp.c: revamp cl_and() and cl_or()Karl Williamson2011-03-081-39/+94
| | | | | | These two routines have not kept pace with the changes in the ANYOF flags. And, I believe there were issues even before them. I did a systematic re-thinking of what their behaviors should be.
* regex: /l in combo with others in syn start classKarl Williamson2011-03-081-4/+2
| | | | | | | | | Now that regexes can be combinations of different charset modifiers, a synthetic start class can match locale and non-locale both. locale should generally match only things in the bitmap for code points < 256. But a synthetic start class with a non-locale component can match such code points. This patch makes an exception for synthetic nodes that will be resolved if it passes and is matched again for real.
* regcomp.c: UTF /l should not use triesKarl Williamson2011-03-081-1/+4
| | | | | | It's unclear if tries will work under /l. I haven't seen any failures, but there have been under /d. As a precaution, until more testing is done, disable tries under anything but /u and UTF.