summaryrefslogtreecommitdiff
path: root/regcomp.c
Commit message (Collapse)AuthorAgeFilesLines
...
* regcomp.c: Generate EXACTFU_SS only for non-UTF8Karl Williamson2018-12-261-23/+13
| | | | | | | | | | | | | It turns out that now, the regular methods for handling multi-character folds work for the ones involving LATIN SMALL LETTER SHARP S when the pattern is in UTF-8. So the special code for handling this case can be removed, and a regular EXACTFU node is generated. This has the advantage of being trie-able, and requiring fewer operations at run time, as the pattern is pre-folded at compile time, and doesn't have to be re-folded during each backtracking at run-time. This means that the EXACTFU_SS node type will only be generated for non-UTF-8 patterns, and the handling of it is unchanged in these cases.
* regcomp.c: Avoid duplicate workKarl Williamson2018-12-261-9/+11
| | | | | | By setting a variable, we can place some subsequent code into an else, avoiding duplication of work. (Perhaps the compiler optimizers already figure this out.)
* regcomp.c: Rationalize use of two variablesKarl Williamson2018-12-261-16/+15
| | | | | | | | | | | | | | | | | | I didn't fully understand the use of two variables that deal with /u in regular expression pattern compilation when I commited 1a26cbcb0358334c3eb1941a919ffdb57eb4fe7e. RExC_uni_semantics is used to say that there is something in the pattern that indicates it is supposed to use /u if /d would otherwise be selected. Once set, it cannot be cleared. UNI_SEMANTICS means if we are operating under /u. Code forbids /d to be selected if RExC_uni_semantics is set, but we could select /l, /a, or /aa, so we don't have to be in /u (and hence UNI_SEMANTICS) if RExC_uni_semantics is set. We just can't be in /d. This commit adds comments, and fixes things to match the above description.
* regcomp.c, regexec.c: Rename some related variablesKarl Williamson2018-12-251-26/+26
| | | | The new names are shorter and more meaningful.
* regcomp.c: Shorten variable nameKarl Williamson2018-12-251-27/+27
| | | | | | Change 'has_upper_latin1_only_utf8_matches' to 'upper_latin1_only_utf8_matches ', as the initial 'has_' is unnecessary and somewhat misleading in spots
* Change name of PL_NonL1NonFinalFoldKarl Williamson2018-12-251-3/+2
| | | | | The inversion list this refers to now includes the Latin 1 range, so the name was misleading.
* Move 2 property defns to mktablesKarl Williamson2018-12-251-2/+2
| | | | | | | | | These 2 Unicode-like property definitions used internally by the regular expression compiler are moved by this commit from regen/mk_invlists.pl to lib/unicore/mktables. By placing all these in the same place, maintainers only have to learn one bit of code, instead of two.
* regcomp.c: Avoid a NULL dereferenceKarl Williamson2018-12-251-4/+21
| | | | | | | This refactors the code so that it doesn't refer to an object before it makes sure it exists and isn't empty. This hasn't been a problem in the past, but a future commit will call this subroutine with parameters that expose this bug.
* Change name of PL_utf8_foldable variableKarl Williamson2018-12-251-5/+5
| | | | | | This variable's name was out-of-date and misleading. It is the name of an inversion list that contains all the code points in the current version of Unicode that participate in any way in a /i type of fold.
* regcomp.c: qr/[\xFF]/di doesn't have runtime dependenciesKarl Williamson2018-12-251-1/+1
| | | | | | | | | | | | Prior to this commit, a class containing U+FF, LATIN SMALL LETTER Y WITH DIAERESIS, generated an ANYOFD regnode because it thought that what matched depended on the UTF-8ness of the target string. But it doesn't. No bugs were introduced because when ANYOFD is encountered the code looks at some flags to determine what sorts of dependencies to further look for, and the flags remained clear. But ANYOFD is less desirable than plain ANYOF, because it adds extra branches to execute. Tests for this fix will be added in a future commit.
* regcomp.c: Add #ifdefKarl Williamson2018-12-251-0/+2
| | | | | | If perl is compiled on an early enough Unicode release, these won't be defined, but since they're only used in deprecated functions, we can just #ifdef them away.
* regcomp.c: Fix commentKarl Williamson2018-12-251-1/+1
| | | | This comment was out-of-date
* regcomp.c: Tighten embedded patterns in regex setsKarl Williamson2018-12-161-11/+8
| | | | | | | In the (?[ ... ]) regex sets features, one can embed another compiled regex set pattern. Such compiled patterns always have a flag of '^', which we weren't looking for prior to this commit. That meant that uncompiled patterns would be mistaken for compiled ones.
* regcomp.c: Allow more EXACTFish nodes to be trieableKarl Williamson2018-12-071-25/+165
| | | | | | | | | | | | | | | | | | The previous two commits fixed bugs where it would be possible during optimization to join two EXACTFish nodes together, and the result would not work properly with LATIN SMALL LETTER SHARP S. But by doing so, the commits caused all non-UTF-8 EXACTFU nodes that begin or end with [Ss] from being trieable. This commit changes things so that the only the ones that are non-trieable are the ones that, when joined, have the sequence [Ss][Ss] in them. To do so, I created three new node types that indicate if the node begins with [Ss] or ends with them, or both. These preclude having to examine the node contents at joining to determine this. And since there are plenty of node types available, it seemed the best choice. But other options would be available should we run out of nodes. Examining the first and final characters of a node is not expensive, for example.
* regcomp.c: Make sure /di nodes begining in 's' are EXACTFKarl Williamson2018-12-071-4/+35
| | | | | | | | | | | This is defensive coding. The previous commit changed things so under /di a node ending in [Ss] doesn't get made an EXACTFU. This commit does the same for nodes that begin with [Ss]. This isn't actually necessary as one needs two EXACTFU nodes in a row for the problem to occur, and the previous commit appears to remove the possibility for the first node being an EXACTFU. But I'm leery of relying on this. So this commit makes sure that a node beginning with 'S' or 's' under /di remains EXACTF
* regcomp.c: Make sure /di nodes ending in 's' are EXACTFKarl Williamson2018-12-071-8/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prior to this commit only nodes that filled fully were guaranteed not to be upgraded to EXACTFU. EXACTF nodes are used when /d rules are to be used unless the string being matched against is in UTF-8. EXACTFU nodes are used when the /u rules are to be used always. If the node contains only characters whose interpretation is the same under /d and /u, both node types behave identically, and so either can be used. EXACTFU nodes are preferred because they are trie-able, and otherwise faster at runtime due to not having to check the UTF-8ness of the target string. The pattern compilation uses an EXACTFU node when possible, over an EXACTF node. The sequences 'ss', 'SS', 'Ss', 'sS' are very tricky, for several reasons. For this commit, the trickiness lies in the fact that they are the only sequences where both the pattern and target string aren't in UTF-8, and what matches under /ui rules differs from what matches under /di. (There are various single characters that match differently, but this is the only sequence whose individual components match the same, but the sequence as a whole matches differently.) The code has long taken special care for these sequences, but overlooked two fairly obscure cases where it matters. This commit addresses one of those cases; the next commit, the other. Because these are sequences, it might be possible for it to be split across two EXACTFish nodes, so that the first 's' or 'S' is the last thing in the first node, and the second 's' or 'S' is the first thing in the second. Should these nodes get joined during optimization, they form the sequence. The code has long recognized this possibility if the first node gets filled up, necessitating a split, and it doesn't make the first node EXACTFU if it ends in 's' or 'S'. But we don't fill those nodes totally completely, and optimization can join two together. Future commits in the pipeline will join nodes in more cases during optimization, and so, we need to not create an EXACTFU for trailing 's' or 'S' always, not just if the first node fills up. This commit moves the code that accomplishes this so it always gets executed at node end of /di nodes, instead of previously only getting executed when the node fills.
* regcomp.c: Simplify a bit of codeKarl Williamson2018-12-071-3/+3
| | | | | By using a macro with a slightly different API, we don't have to mess with the parse pointer.
* regcomp.c: Can join certain EXACTish node typesKarl Williamson2018-12-071-1/+17
| | | | | | | | | The optimization phase of regular expression pattern compilation looks for adjacent EXACTish nodes and joins them if they are the same flavor of EXACT. Commits a9f8c7ac75c364c3e05305718f38c5f8ccd935d8 and f6b4b99d2e584fbcd85eeed475eea10b87858e54 introduced two new nodes that are so close to existing flavors that they are joinable with their respective flavor. This commit does that.
* regcomp.c: Move clause of while() conditional into loopKarl Williamson2018-12-071-3/+7
| | | | | This is in preparation for making the conditional more complicated than can be easily done in the condition.
* regcomp.c: Add assertionKarl Williamson2018-12-071-0/+3
|
* Remove one use of static functionKarl Williamson2018-12-071-15/+17
| | | | | | | | | | | Previous commits in 5.29 have removed all but two calls to this function, and the remaining ones take radically different paths in it, with very little common code. It simplifies things if we expand each call to the code that gets evaluated. This commit does one call. Changing the other usage is deferred until a later commit. The need for an #ifdef is removed by adding a flag and setting it in an existing #ifdef.
* regen/mk_invlists.pl: Add new tableKarl Williamson2018-12-071-0/+1
| | | | | | | This table contains all the code points that are in any multi-character fold (not the folded-from character, but what that character folds to). It will be used in a future commit.
* regcomp.c: Use simpler variable name as long as possibleKarl Williamson2018-12-061-7/+8
| | | | | | This just extends the use of a variable name a little longer, as it's easier to read than the nested macro calls that eventually have to be used.
* regcomp.c: Prefer one of similarly named varsKarl Williamson2018-12-061-3/+5
| | | | | | These two variables are similarly named, but have slightly different purposes. Comment out one of them, and convert to using the other consistently
* Use consistent spelling in qr// dumpingKarl Williamson2018-12-061-3/+3
| | | | | | | Under -Dr (or use re 'Debug') the compiled regex engine program is displayed. I noticed that it used two different spellings for 'infinity'. This commit changes so only one is used, the one that has been in the field the longest.
* regcomp.c: Clarify commentKarl Williamson2018-12-061-2/+4
|
* Revert "regcomp.c: Use a weird value in a place where ignored"Karl Williamson2018-12-021-2/+1
| | | | | | | This reverts commit 5a4bc50b395654abde21718cbce2e296049f470d. It turns out I was wrong, and the value isn't ignored. The commit caused the pattern to be optimized differently, but still be valid, so no tests failed.
* PATCH: [perl #133423]Karl Williamson2018-11-291-1/+0
|
* regcomp.c: White-space onlyKarl Williamson2018-11-271-4/+4
| | | | Vertical align continuation chars in a macro
* Add regnode EXACTFU_ONLY8Karl Williamson2018-11-271-4/+30
| | | | | | | | | | | | This is a regnode that otherwise would be an EXACTFU except that it contains a code point that requires UTF-8 to match, including all the possible folds involving it. Hence if the target string isn't UTF-8, we know it can't possibly match, without needing to try. For completeness, there could also be an EXACTFAA_ONLY8 and an EXACTFL_ONLY8 created, but I think these are unlikely to actually appear in the wild, since using /aa is mainly about ASCII, and /l mostly will involve characters that don't require UTF-8.
* Add regnode EXACT_ONLY8Karl Williamson2018-11-271-21/+50
| | | | | | | This is a regnode that otherwise would be an EXACT except that it contains a code point that requires UTF-8 to represent. Hence if the target string isn't UTF-8, we know it can't possibly match, without needing to try.
* regcomp.c: Use common code instead of duplicating itKarl Williamson2018-11-271-8/+1
| | | | | The common code is about to get more complicated, so use it instead of a copy.
* handle /(?(?{code}))/ mixed compile-and runtimeDavid Mitchell2018-11-271-5/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Where a runtime pattern contains both compile-time and run-time code blocks, e.g.: $re = '(?{ RRR })'; / $re X(?{ CCC })Y/ The compile-time code-block CCC is parsed at the same time as the surrounding text. The runtime code RRR is parsed at runtime by constructing a fake pattern and re-parsing it, but with any compile-time code-blocks blanked out (so they don't get compiled twice). The compiled regex is then thrown away, but any optrees just created for the runtime code blocks are kept. For example at runtime, the re-parsed pattern looks like: / (?{ RRR }) X__________Y/ Unfortunately this was failing for the conditional pattern, e.g. / $re X(?(?{ CCC }))Y/ which was getting blanked as / (?{ RRR }) X(?_______)Y/ which isn't valid syntax. This commit blanks (?{...}) into (?=====) instead which is always legal.
* regcomp.c: Clarify commentKarl Williamson2018-11-261-1/+2
|
* regcomp.c: Initialize a variable more conservativelyKarl Williamson2018-11-261-2/+2
| | | | | It doesn't matter currently, but thes variable shouldn't be TRUE unless /i is in effect.
* regcomp.c: Use a weird value in a place where ignoredKarl Williamson2018-11-261-1/+2
| | | | | | This way, it doesn't confuse that it is legal, and should it stop being ignored in the called function, it will show up as a problem much sooner.
* regcomp.c: Consolidate duplicated code into 1 placeKarl Williamson2018-11-261-13/+7
|
* regcomp.c: Use better method for setting debug offsetsKarl Williamson2018-11-261-5/+2
| | | | | | | | This was changing the parse pointer around some code and restoring it afterwards. The purpose must have been to get the debug offsets correct. But there is a better way to do that, which doesn't take up cycles on a non-Debugging build, and that is to set the offsets directly.
* regcomp.c: Remove another sizing pass relictKarl Williamson2018-11-261-3/+0
| | | | | We no longer play with the emit ptr in this function, so no need to save and restore it.
* regcomp.c: Rmv malformed assert()Karl Williamson2018-11-201-3/+0
| | | | | | | | | | This had a plain '=' instead of an '==', thus setting the value in the process of asserting it, hence would always be true. But changing it to '==' causes the assertion to fail in various cases. I need to rethink this, so in the meantime am simply removing it. Spotted by Dave Mitchell++
* Add regnode NANYOFMKarl Williamson2018-11-171-22/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This matches when the existing node ANYOFM would not match; i.e., they are complements. I almost didn't create this node, but it turns out to significantly speed up various classes of matches. For example qr/[^g]/, both /i and not, turn into this node; and something like (("a" x $large_number) . "b") =~ /[^a]/ goes through the string a word at a time, instead of previously byte-by-byte. Benchmarks are at the end of this mesage. This node gets generated when complementing any single ASCII character and when complementing any ASCII case pair, like /[^Gg]/. It never gets generated if the class includes a character that isn't ASCII (actually UTF-8 invariant, which matters only on EBCDIC platforms). The details of when this node gets constructed are complicated. It happens when the bit patterns of the characters in the class happen to have certain very particular characteristics, depending on the vagaries of the character set. [BbCc] will do so, but [AaBb] does not. [^01] does, but not [^12]. Precisely, look at all the bit patterns of the characters in the set, and count the total number of differing bits, calling it 'n'. If and only if the number of characters is 2**n, this node gets generated. As an example, on both ASCII and EBCDIC, the last 4 bits of '0' are 0000; of '1' are 0001; of '2' are 0010; and of '3' are 0011. The other 4 bits are the same for each of these 4 digits. That means that only 2 bits differ among the 4 characters, and 2**2==4, so the NANYOFM node will get generated. Similarly, 8=1000 and 0=0000 differ only in one bit so 2**1==2, and so [^08] will generate this node. We could consider in the future, an extension where, if the input doesn't work to generate this node, that we construct the closure of that input to generate this node, which would have false positives that would have to be tested for. The speedup of this node is so significant that that could still be faster than what we have today. The benchmarks are for a 64-bit word. 32-bits would not be as good. Key: Ir Instruction read Dr Data read Dw Data write COND conditional branches IND indirect branches The numbers (except for the final column) represent raw counts per loop iteration. The higher the number in the final column, the faster. (('a' x 1) . 'b') =~ /[^a]/ blead nanyof Ratio % -------- -------- -------- Ir 2782.0 2648.0 105.1 Dr 845.0 799.0 105.8 Dw 531.0 500.0 106.2 COND 431.0 419.0 102.9 IND 22.0 22.0 100.0 (('a' x 10) . 'b') =~ /[^a]/ blead nanyof Ratio % -------- -------- -------- Ir 3358.0 2671.0 125.7 Dr 998.0 801.0 124.6 Dw 630.0 500.0 126.0 COND 503.0 424.0 118.6 IND 22.0 22.0 100.0 (('a' x 100) . 'b') =~ /[^a]/ blead nanyof Ratio % -------- -------- -------- Ir 9118.0 2773.0 328.8 Dr 2528.0 814.0 310.6 Dw 1620.0 500.0 324.0 COND 1223.0 450.0 271.8 IND 22.0 22.0 100.0 (('a' x 1000) . 'b') =~ /[^a]/ blead nanyof Ratio % -------- -------- -------- Ir 66718.0 3650.0 1827.9 Dr 17828.0 923.0 1931.5 Dw 11520.0 500.0 2304.0 COND 8423.0 668.0 1260.9 IND 22.0 22.0 100.0 (('a' x 10000) . 'b') =~ /[^a]/ blead nanyof Ratio % -------- -------- -------- Ir 642718.0 12650.0 5080.8 Dr 170828.0 2048.0 8341.2 Dw 110520.0 500.0 22104.0 COND 80423.0 2918.0 2756.1 IND 22.0 22.0 100.0 (('a' x 100000) . 'b') =~ /[^a]/ blead nanyof Ratio % -------- -------- -------- Ir Inf 102654.8 6237.1 Dr Inf 13299.3 12788.9 Dw Inf 500.9 219708.7 COND 800424.1 25419.1 3148.9 IND 22.0 22.0 100.0
* regcomp.c: Simplify early failure returnsKarl Williamson2018-11-161-19/+12
| | | | | Previous commits have removed the need for certain macros and generality in returning from functions early. Correspondingly simplify
* regcomp.c: Remove no longer used parameter, and refactorKarl Williamson2018-11-161-40/+30
| | | | | | | This static function no longer is called with a non-NULL final parameter. That means it no longer returns a list, and its name is hereby changed to reflect that. It also means the function can be refactored and made simpler.
* regcomp.c: Remove now always NULL parameterKarl Williamson2018-11-161-14/+6
| | | | | This parameter is always NULL. No need to have it in this static function
* regcomp.c: Don't restart parse for /d to /u if no need toKarl Williamson2018-11-161-5/+23
| | | | | | This commit keeps track of if there are any operations encountered which differ under /d from /u. If we switch to /u and haven't so far found anything which differs, there's no need to reparse
* regcomp.c: Don't restart parse for /d to /u if reparsing anywayKarl Williamson2018-11-161-0/+4
| | | | | | | | | Prior to this commit, if the rules changed from /d to /u, the parse was immediately restarted. This commit changes that so that it doesn't do this if it is known that the parse will be redone anyway, but a full parse needs to done first in order to count the parentheses. Doing this can avoid the need for an almost full extra reparse.
* regcomp.c: Don't restart parse now if doing so laterKarl Williamson2018-11-161-2/+6
| | | | | | | | | | Prior to this commit, if it became apparent that long branches were going to be needed, the parse was immediately restarted. This commit changes that so that it doesn't do this if it is known that the parse will be redone anyway, but a full parse needs to done first in order to count the parentheses. This can avoid an almost complete reparse in some situations.
* regcomp.c: Swap 'if' branches for readabilityKarl Williamson2018-11-161-4/+4
| | | | It's easier to understand if the simplest case is first in the code.
* regcomp.c: Refactor constructing EXACTish nodesKarl Williamson2018-11-161-125/+73
| | | | | | | | | | The previous commits have allowed us to refactor this to eliminate redundancies. Previously, the same logic was done separately for UTF-8 and non-UTF-8 patterns. This refactors so the logic is done once. The details differ for UTF-8 and non-UTF-8. So that's where the differences lie, in the details without having to duplicate the logic.
* regcomp.c: Fix up parsing \N{} in a stringKarl Williamson2018-11-161-0/+18
| | | | | | | | \N{} changes /d to /u, and may require reparsing of the current node to fix wrong assumptions. This commit is necessary for the simplifications achieved in the next commit.