summaryrefslogtreecommitdiff
path: root/regcomp.c
Commit message (Collapse)AuthorAgeFilesLines
* regcomp.c: don't include INTERN.hDavid Mitchell2019-02-191-4/+0
| | | | | | | | | | | This file only needs including by globals.c; it was being included in regcomp.c too as the declarations in regcomp.h aren't included by perl.h and thus don't get pulled into globals.c. This was a confusing and hacky workaround. Instead, this commit causes globals.c to #include regcomp.h directly After this commit, only globals.c #includes INTERN.h
* add dVAR's for PERL_GLOBAL_STRUCT_PRIVATE buildsDavid Mitchell2019-02-191-0/+16
| | | | | | The perl build option -DPERL_GLOBAL_STRUCT_PRIVATE had bit-rotted due to lack of smoking. The main fix is to just add 'dVAR;' to any functions which have a pTHX arg. It's a NOOP on normal builds.
* regcomp.c: Don't interate a loop needlesslyKarl Williamson2019-02-161-2/+3
| | | | | While single stepping in gdb, I noticed that this loop kept executing, when it need not.
* PATCH: [perl #133770] null pointer dereference in S_regclass()Karl Williamson2019-02-161-3/+5
| | | | | | | | | | | | | | | | | | | | The failing case can be reduced to qr/\x{100}[\x{3030}\x{1fb2}/ (It only happens on UTF-8 patterns). The bottom line is that it was assuming that there was at least one character that folded to 1fb2 besides itself, even though the function call said there weren't any such. The solution is to pay attention to the function return value. I incorporated Hugo's++ patch as part of this one. However, the original test case should never have gotten this far. The parser is getting passed garbage, and instead of croaking, it is somehow interpreting it as valid and calling the regex compiler. I will file a ticket about that.
* PATCH: [perl #133767] Assertion failureKarl Williamson2019-02-161-2/+7
| | | | | | | | | | | The problem here is that a syntax error occurs and hence certain things don't get done, but processing continues, as the error isn't checked for until after the return of the function that found it. The failing assertion is checking that those certain things actually did get done. There appear to be good reasons to defer the raising of the error until then, so the simplest way to fix this is to generalize the code so that the failing assertion doesn't happen.
* Use STATIC_ASSERT_STMT for checking compile-time invariantsDagfinn Ilmari Mannsåker2019-02-151-3/+2
| | | | | Better to have the build fail if they're wrong than relying on the code path being hit at runtime in a DEBUGGING build.
* Remove relics of regex swash useKarl Williamson2019-02-141-167/+32
| | | | | | | | | | | This removes the most obvious and easy things that are no longer needed since regexes no longer use swashes at all. tr/// continues, for the time being, to use swashes, so not all swash handling is removable now. But tr/// doesn't use inversion lists, and so a bunch of code is ripped out here. Other code could have been, but I did only the relatively easy stuff. The rest can be ripped out all at once when tr/// is stops using swashes.
* Use mnemonics for array indicesKarl Williamson2019-02-141-15/+26
| | | | | The element at say, [0] is a particular thing. This commit changes to use a mnemonic instead of [0], for clarity
* regcomp.c: Arrays no longer need PL_sv_undef placeholdersKarl Williamson2019-02-141-29/+22
| | | | An empty entry is now just NULL.
* regcomp.c: Simplify args passing for ANYOF nodesKarl Williamson2019-02-141-91/+40
| | | | | | | | | | A swash is no longer used, so we can remove some elements from the array of data that gets stored with the compiled pattern for use in runtime matching. This is the first step in more simplifications. Since a swash isn't used, this change also requires regexec.c to change to use a straight inversion list lookup. This has the salutary effect of eliminating a conversion between code point and UTF-8.
* regcomp.c: Add some potential code that's #ifdef'd outKarl Williamson2019-02-141-0/+52
| | | | | | | | | | | This is in case we ever need it. This checks for portability in the code points specified in user-defined properties. Previously there was a check, but I couldn't get a warning to trigger unless there was also overflow. So that means the pattern compile failed due to the overflow, and the portability warning was superfluous. But, one can have non-portable code points without overflow; just the old method didn't properly detect them. If we do ever need to detect and report on them, the code is mostly written and in this commit.
* Move \p{user-defined} to core from utf8_heavy.plKarl Williamson2019-02-141-231/+890
| | | | | | | | | | | | | | This large commit moves the handling of user-defined properties to C code. This should speed it up, but the main reason to do this is to stop using swashes in this case, leaving only tr/// using them. Once that too is converted, all swash handling can be ripped out of perl. Doing this in perl has caused some nasty interactions that will now be fixed automatically. The change is not entirely transparent, however (besides speed and the possibility of removing these interactions). perldelta in this commit details these.
* Add global hash to handle \p{user-defined}Karl Williamson2019-02-141-0/+25
| | | | | | | A global hash has to be specially handled. The keys can't be shared, and all the SVs stored into it must be in its thread. This commit adds the hash, and initialization, and macros for context change, but doesn't use them. The code to deal with this is entirely confined to regcomp.c.
* regcomp.c: Add/reword some comments/white-spaceKarl Williamson2019-02-141-41/+39
|
* regcomp.c: Change variable nameKarl Williamson2019-02-141-13/+13
| | | | The new name more closely corresponds with its use.
* regcomp.c: White-space onlyKarl Williamson2019-02-051-4/+4
| | | | Indent a block of code newly formed by the previous commit
* Add Turkish locale handling to /i pattern matchingKarl Williamson2019-02-051-3/+46
| | | | | | Previous commits in this series have changed uc(), lc(), fc(), etc. to know how to handle Turkish UTF-8 locales. This commit extends this to /i regular expression pattern matching.
* regcomp.c: Under /l any < 256 char can match any otherKarl Williamson2019-02-051-3/+14
| | | | | | | | | | | | The code knew this, but it was adding the ASCII alphabetics to the list of things that matched in UTF-8 locales. This is unnecessary, as we've long had the infrastructure elsewhere to handle all potential mappings from a Latin1 code point to other Latin1, so we can just rely on it. And it created complexities for future commits in this series. The MICRO SIGN is the exception, as it folds to non-Latin1 in UTF-8 locales, and this is the place where the structure exists to handle that.
* regen/mk_invlists.pl: Create new inversion listKarl Williamson2019-02-051-0/+1
| | | | This will be used in a future commit.
* regcomp.c: Clarify commentKarl Williamson2019-02-051-1/+2
|
* regcomp.c: Fix recent optimization of [...] bugKarl Williamson2019-02-041-1/+1
| | | | | | This bug was introduced in b2296192536090829ba6d2cb367456f4e346dcc6 n 5.29.7. Using /il should not result in looking for a [:posix:] class that matches the code points given.
* PATCH: [perl #133756] Failure to match properlyKarl Williamson2019-01-101-5/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This was caused by a counting error. An EXACTFish regnode has a finite length it can hold for the string being matched. If that length is exceeded, a 2nd node is used for the next segment of the string, for as many regnodes as are needed. A problem occurs if a regnode ends with one of the 22 characters in Unicode 11 that occur in non-final positions of a multi-character fold. The design of the pattern matching engine doesn't allow matches across regnodes. Consider, for example if a node ended in the letter 'f' and the next node begins with the letter 'i'. That sequence should match, under /i, the ligature "fi" (U+FB01). But it wouldn't because the pattern splits them across nodes. The solution I adopted was to forbid a node to end with one of those 22 characters if there is another string node that follows it. This is not fool proof, for example, if the entire node consisted of only these characters, one would have to split it at some position (In that case, we just take as much of the string as will fit.) But for real life applications, it is good enough. What happens if a node ends with one of the 22, is that the node is shortened so that those are instead placed at the beginning of the following node. When the code encounters this situation, it backs off until it finds a character that isn't a non-final fold one, and closes the node with that one. A /i node is filled with the fold of the input, for several reasons. The most obvious is that it saves time, you can skip folding the pattern at runtime. But there are reasons based on the design of the optimzer as well, which I won't go into here, but are documented in regcomp.c. When we back out the final characters in a node, we also have to back out the corresponding unfolded characters in the input, so that those can be (folded) into the following node. Since the number of characters in the fold may not be the same as unfolded, there is not an easily discernable correspondence between the input and the folded output. That means that generally, what has to be done is that the input is reparsed from the beginning of the node, but the permitted length has been shortened (we know precisely how much to shorten it to) so that it will end with something other than the 22. But, the code saves the previous input character's position (for other reasons), so if we only have to backup one character, we can just use that and not have to reparse. This bug was that the code thought a two character backup was really a one character one, and did not reparse the node, creating an off-by-one error, and a character was simply omitted in the pattern (that should have started the following node). And the input had two of the 22 characters adjacent to each other in just the right positions that the node was split. The bisect showed that when the node size was changed the bug went away, at least for this particular input string. But a different, longer, string would have triggered the bug, and this commit fixes that. This bug is actually very unlikely to occur in most real world applications. That is because other changes in the regex compiler have caused nodes to be split so that things that don't particpate in folds at all are separated out into EXACT nodes. (The reason for that is it allows the optimizer things to grab on to under /i that it wouldn't otherwise have known about.) That means that anything like this string would never cause the bug to happen because blanks and commas, etc. would be in separate nodes, and so no node would ever get large enough to fill the 238 available byte slots in a node (235 on EBCDIC). Only a long string without punctuation would trigger it. I have artificially constructed such a string in the tests added by this commit. One of the 22 characters is 't', so long strings of DNA "ACTG" could trigger this bug. I find it somewhat amusing that this is something like a DNA transcription error, which occurs in nature at very low rates, but selection, it is believed, will make sure the error rate is above zero.
* regcomp.c: Fix wrong commentKarl Williamson2019-01-101-3/+3
|
* regcomp.c: Rmv null function callsKarl Williamson2019-01-011-14/+12
| | | | | | | | These were relics from the removal of the sizing pass. I did a global substitute, and missed that these cases promptly took the inverse function of the function I just added. In other words, if g() is the inverse of f(), then g(f(x)) is always x, and we can omit both g() and f().
* Revert "regen/mk_invlists.pl: Fix bug when 2 ident tables"Karl Williamson2018-12-311-2/+4
| | | | | | | | | This reverts commit 7e9b4fe4d85e9b669993bf96a7e33ffff3197e20, with additional changes to get things to compile It turns out I was wrong about the underlying cause that commit addressed, and it is easier to just use the existing constants that get generated.
* regcomp.c: Refactor \b{} parsing codeKarl Williamson2018-12-271-18/+26
| | | | | | | | | | This just moves things around so that the information is kept in local variables and the regnode not created until all that info has been completely determined. I believe it is clearer to read, but the impetus came from the fact that prior to this commit, use of \b{} always restarted the parse unnecessarily because the order of things made it appear that a real /d op had appeared, whereas it was just the one currently being constructed
* regcomp.c: White-space onlyKarl Williamson2018-12-271-8/+10
| | | | Indent the block added by the previous commit
* regcomp.c: Avoid reading out-of-bounds memoryKarl Williamson2018-12-271-0/+2
| | | | | | Recent commit c316b824875fdd5ce52338f301fb0255d843dfec introduced an out-of-bounds memory read. This commit fixes it. An ANYOFH regnode doesn't have a bitmap, so don't try to read it.
* Change length-1 ASCII fold pairs to ANYOFM regnodesKarl Williamson2018-12-261-0/+38
| | | | | | | | | | A node that matches only 'A' and 'a', for example, can be turned into an ANYOFM node, which is faster to execute. This is done after joining of adjacent EXACTFish nodes, as longer nodes are better than shorter ones, including because they lessen the number of bugs with multi-char folds not matching because of node boundaries. But if a length 1 node remains, ANYOFM is better.
* regcomp.c: White-space onlyKarl Williamson2018-12-261-35/+37
| | | | Indent newly formed block in previous commit
* Add new regnode: ANYOFH, without a bitmapKarl Williamson2018-12-261-7/+27
| | | | | | | | | | | This commit adds a regnode for the case where nothing in the bit map has matches. This allows the bitmap to be omitted, saving 32 bytes of otherwise wasted space per node. Many non-Latin Unicode properties have this characteristic. Further, since this node applies only to code points above 255, which are representable only in UTF-8, we can trivially fail a match where the target string isn't in UTF-8. Time savings also accrue from skipping the bitmap look-up. When swashes are removed, even more time will be saved.
* Revamp qr/[...]/ optimizationsKarl Williamson2018-12-261-130/+536
| | | | | | | | | | | | | | | | This commit extensively changes the optimizations for ANYOF regnodes that represent bracketed character classes. The removal of the regex compilation pass now makes these feasible and desirable. Compilation now tries hard to optimize an ANYOF node into something smaller and/or faster when feasible. Now, qr/[X]/ for any single character or POSIX class X, and any modifiers like /d, /i, etc, should be the same as qr/X/ for the same modifiers, unless it would require the pattern to be upgraded from non-UTF-8 to UTF-8, unless not doing so could introduce bugs. These changes fix some issues with multi-character /i folding.
* regcomp.c: Rename a variableKarl Williamson2018-12-261-4/+4
| | | | | This is to distinguish it from another similar variable being introduced in the next commit.
* regcomp.c: White-space, comments onlyKarl Williamson2018-12-261-28/+31
|
* regcomp.c: Refactor 3 variables into flags of 1Karl Williamson2018-12-261-14/+24
| | | | | This makes the code easier to read, as it summarizes the purposes of the three, and makes it easy to find which reason it is.
* regcomp.c: White space onlyKarl Williamson2018-12-261-27/+27
| | | | Indent after the previous commit created a new outer loop
* regcomp.c: Refactor looking for POSIX optimizationsKarl Williamson2018-12-261-23/+25
| | | | | | Instead of repeating the code, slightly modified, this uses a loop. This is in preparation for a future commit where a third instance would have been required
* regcomp.c: Rename a variableKarl Williamson2018-12-261-4/+4
| | | | | The new name more accurately expresses the usage, as what gets generated may not actually be an ANYOFD.
* regcomp.c: Remove no longer used static functionKarl Williamson2018-12-261-163/+0
|
* regcomp.c: Remove remaining use of static functionKarl Williamson2018-12-261-5/+16
| | | | | | Commit 13dfc48da5322166f9d64a7349e3434c070ead88 removed one of two uses of this function. This removes the remaining one. Commits in between removed the need for most of the guts of the function.
* regcomp.c: Consolidate common codeKarl Williamson2018-12-261-8/+5
| | | | These flags can be set in one place, rather than in multiple ones.
* regcomp.c: Simplify ANYOFM node generationKarl Williamson2018-12-261-19/+14
| | | | | | This refactors the code somewhat. When we discover a deal-breaker code point we can just break out of the loop (using a goto) instead of setting a flag, continuing, and later testing it.
* regcomp.c: Don't zap larger scope variablesKarl Williamson2018-12-261-5/+6
| | | | | | | It doesn't matter currently, but it's best to declare more limited scope variables for doing limited scope work, rather than using the more global variable, which someday might want to be used later, outside the block that zapped it, and would lead to a surprise.
* Remove ASCII/NASCII regnodesKarl Williamson2018-12-261-28/+4
| | | | | | | The ANYOFM/NANYOFM regnodes are generalizations of these. They have more masks and shifts than the removed nodes, but not more branches, so are effectively the same speed. Remove the ASCII/NASCII nodes in favor of having less code to maintain.
* regcomp.c: Prefer ANYOF/NANYOFM regnodesKarl Williamson2018-12-261-149/+152
| | | | | | | | | These two regnodes are faster than regular /[[:posix:]]/ ones, and some of the latter are equivalent to some of the former. So try the faster optimizations first. This commit just swaps the two blocks of code, and outdents appropriately
* regcomp.c: Refactor some /[foo]/ codeKarl Williamson2018-12-261-48/+50
| | | | | | | | This refactors the code that sees about optimizing bracketed character classes into something else, so that the creating of the other regnode is done closer to its determination, and only the really common code actually is done in the common place, moved to the end of the function. This removes the need for some 'elses' and 'ifs'
* regcomp.c: Simplify handling of EXACTFish nodes with 's' at edgeKarl Williamson2018-12-261-179/+130
| | | | | | | | | | | | | | | | | | Commit 8a100c918ec81926c0536594df8ee1fcccb171da created node types for handling an 's' at the leading edge, at the trailing edge, and at both edges for nodes under /di that there is nothing else in that would prevent them from being EXACTFU nodes. If two of these get joined, it could create an 'ss' sequence which can't be an EXACTFU node, for U+DF would match them unconditionally. Instead, under /di it should match if and only if the target string is UTF-8 encoded. I realized later that having three types becomes harder to deal with when adding yet more node types, so this commit turns the three into just one node type, indicating that at least one edge of the node is an 's'. It also simplifies the parsing of the pattern and determining which node to use.
* regexec.c: Avoid unnecessary foldingKarl Williamson2018-12-261-1/+1
| | | | | | Previous commits caused the pattern under /i to be folded as much as possible. This commit takes advantage of this by not folding when we know it already has been folded.
* Collapse regnode EXACTFU_SS into EXACTFUPKarl Williamson2018-12-261-32/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | EXACTFUP was created by the previous commit to handle a problematic case in which not all the code points in an EXACTFU node are /i foldable at compile time. Doing so will allow a future commit to use the pre-folded EXACTFU nodes (done in a prior commit), saving execution time for the common case. The only problematic code point is the MICRO SIGN. Most patterns don't use this character. EXACTFU_SS is problematic in a different way. It contains the sequence 'ss' which is folded to by LATIN SMALL LETTER SHARP S, but everything in it can be pre-folded (unless it also contains a MICRO SIGN). The reason this is problematic is that it is the only non-UTF-8 node where the length in folding can change. To process it at runtime, the more general fold equivalence function is used that is capable of handling length disparities, but is slower than the functions otherwise used for non-UTF-8. What I've chosen to do for now is to make a single node type for all the problematic cases (which at this time means just the two aforementioned ones). If we didn't do this, we'd have to add a third node type for patterns that contain both 'ss' and MICRO. Or artificially split the pattern so the two never were in the same node, but we can't do that because it can cause bugs in handling multi-character folds. If more special handling is found to be needed, there'd be a combinatorial explosion of additional node types to handle all possible combinations. What this effectively means is that the slower, more general foldEQ function is used for portions of patterns containing the MICRO sign when the pattern isn't in UTF-8, even though there is no inherent reason to do so for non-UTF-8 strings that don't also contain the 'ss' sequence.
* Add regnode EXACTFUP, for problematicKarl Williamson2018-12-261-7/+13
| | | | | | | | | | If a non-UTF-8 pattern contains a MICRO SIGN, this special node is now created. This character is the only one not needing UTF-8 to represent, but its fold does need UTF-8, which causes some issues, so it has to be specially handled. When matching against a non-UTF-8 target string, the pattern is effectively folded, but not if the target is UTF-8. By creating this node, we can remove the special handling required for the nodes that don't have a MICRO SIGN, in a future commit.