summaryrefslogtreecommitdiff
path: root/regcomp.sym
Commit message (Collapse)AuthorAgeFilesLines
* regcomp.sym: Add new regnode type for (?[])Karl Williamson2020-02-191-0/+2
| | | | | | | | | | | | | | | | | | This new regnode is used to handle interpolated already-compiled regex sets inside outer regex sets. If it isn't present, it will mean that what appears to be a nested, interpolated set really isn't. I created a new regnode structure to hold a pointer. This has to be temporary as pointers can be invalidated. I thought of just having a regnode without a pointer as a marker, and using a parallel array to store the data, rather than creating a whole new regnode structure for just pointers, but parallel data structures can get out of sync, so this seemed best. This commit just sets up the regnode; a future commit will actually use it.
* regcomp.sym: Add commentKarl Williamson2019-11-221-0/+10
|
* regcomp.sym: Simplify a couple regnode defnsKarl Williamson2019-11-221-2/+2
| | | | | Under the given circumstances, these work precisely like others that already have good descriptions.
* PATCH: gh #17319 SegfaultKarl Williamson2019-11-221-1/+1
| | | | | | It turns out that one isn't supposed to fill in the offset to the next regnode at node creation time. And this node is like EXACTish, so the string stuff isn't accounted for in its regcomp.sym definition
* Add ANYOFHs regnodeKarl Williamson2019-11-201-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | This node is like ANYOFHb, but is used when more than one leading byte is the same in all the matched code points. ANYOFHb is used to avoid having to convert from UTF-8 to code point for something that won't match. It checks that the first byte in the UTF-8 encoded target is the desired one, thus ruling out most of the possible code points. But for higher code points that require longer UTF-8 sequences, many many non-matching code points pass this filter. Its almost 200K that it is ineffective for for code points above 0xFFFF. This commit creates a new node type that addresses this problem. Instead of a single byte, it stores as many leading bytes that are the same for all code points that match the class. For many classes, that will cut down the number of possible false positives by a huge amount before having to convert to code point to make the final determination. This regnode adds a UTF-8 string at the end. It is still much smaller, even in the rare worst case, than a plain ANYOF node because the maximum string length, 15 bytes, is still shorter than the 32-byte bitmap that is present in a plain ANYOF. Most of the time the added string will instead be at most 4 bytes.
* regcomp.sym: Add missing commasKarl Williamson2019-11-191-2/+2
| | | | I don't think lack of these affects anything, but they were inconsistent
* Add ANYOFRb regnodeKarl Williamson2019-11-171-1/+3
| | | | | | | | | | | | | | | | | | | | | | This is like the ANYOFR regnode added in the previous commit, but all code points in the range it matches are known to have the same first UTF-8 start byte. That means it can't match UTF-8 invariant characters, like ASCII, because the "start" byte is different on each one, so it could only match a range of 1, and the compiler wouldn't generate this node for that; instead using an EXACT. Pattern matching can rule out most code points by looking at the first character of their UTF-8 representation, before having to convert from UTF-8. On ASCII this rules out all but 64 2-byte UTF-8 characters from this simple comparison. 3-byte it's up to 4096, and 4-byte, 2**18, so the test is less effective for higher code points. I believe that most UTF-8 patterns that otherwise would compile to ANYOFR will instead compile to this, as I can't envision real life applications wanting to match large single ranges. Even the 2048 surrogates all have the same first byte.
* Add ANYOFR regnodeKarl Williamson2019-11-171-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | This matches a single range of code points. It is both faster and smaller than other ANYOF-type nodes, requiring, after set-up, a single subtraction and conditional branch. The vast majority of Unicode properties match a single range (though most of the properties likely to be used in real world applications have more than a single range). But things like [ij] are a single range, and those are quite commonly encountered. This new regnode matches them more efficiently than a bitmap would, and doesn't require the space for one either. The flags field is used to store the minimum matchable start byte for UTF-8 strings, and is ignored for non-UTF-8 targets. This, like ANYOFH nodes which have a similar mechanism, allows for quick weeding out of many possible matches without having to convert the UTF-8 to its corresponding code point. This regnode packs the 32 bit argument with 20 bits for the minimum code point the node matches, and 12 bits for the maximum range. If the input is a value outside these, it simply won't compile to this regnode, instead going to one of the ANYOFH flavors. ANYOFR is sufficient to match all of Unicode except for the final (private use) 65K plane.
* regcomp.sym: Add detail to some node descriptionsKarl Williamson2019-11-141-10/+14
| | | | | | | Having this enabled me to more quickly understand what's going on. A trailing period is removed from some long descriptions to make them slightly shorter.
* Change the names of some regnodesKarl Williamson2019-10-291-3/+3
| | | | The new name is shorter and I believe, clearer.
* regex: Add LEXACT_ONLY8 node typeKarl Williamson2019-09-291-0/+1
| | | | | This is like LEXACT, but it is known that only strings encoded in UTF-8 will match it, so don't even have to try if that condition isn't met.
* Add regnode LEXACT, for long stringsKarl Williamson2019-09-291-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit adds a new regnode for strings that don't fit in a regular one, and adds a structure for that regnode to use. Actually using them is deferred to the next commit. This new regnode structure is needed because the previous structure only allows for an 8 bit length field, 255 max bytes. This commit puts the length instead in a new field, the same place single-argument regnodes put their argument. Hence this long string is an extra 32 bits of overhead, but at no string length is this node ever bigger than the combination of the smaller nodes it replaces. I also considered simply combining the original 8 bit length field (which is now unused) with the first byte of the string field to get a 16 bit length, and have the actual string be offset by 1. But I rejected that because it would mean the string would usually not be aligned, slowing down memory accesses. This new LEXACT regnode can hold up to what 1024 regular EXACT ones hold, using 4K fewer overhead bytes to do so. That means it can handle strings containing 262000 bytes. The comments give ideas for expanding that should it become necessary or desirable. Besides the space advantage, any hardware acceleration in memcmp can be done in much bigger chunks, and otherwise the memcmp inner loop (often written in assembly) will run many more times in a row, and our outer loop that calls it, correspondingly fewer.
* regcomp.sym Update and improve descriptions of some nodesKarl Williamson2019-09-271-7/+7
| | | | | | | | | EXACTFU nodes always now fold their strings; the information here had not been updated to reflect that change. And the descriptions of several EXACTish nodes are now changed to be slightly shorter and to remove mention of the string length, which is problematic, and is covered in the description for EXACT
* regen/regcomp.pl, regcomp.sym: CommentsKarl Williamson2019-09-271-4/+19
| | | | | | | | | I spent some time in this code trying to understand some things, and as a result I'm commenting previously undocumented features. The comments about what an entry in regcomp.sym should look like are moved to that file, rather than the file that reads it. The former is most often touched, and they had gotten out-of-sync in the latter. Things now make more sense to me, and hopefully anyone using this in the future.
* regcomp.sym: Fix commentKarl Williamson2019-09-151-1/+1
| | | | | The length of an EXACTish node is the same bits as the FLAGS field in other nodes; it doesn't "precede the length", as previously claimed.
* Add ANYOFHr regnodeKarl Williamson2019-06-261-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit adds a new regnode, ANYOFHr, like ANYOFH, but it also has a loose upper bound for the first UTF-8 byte matchable by the node. (The 'r' stands for 'range'). It would be nice to have a tight upper bound, but to do so requires 4 more bits than are available without changing the node arguments types, and hence increasing the node size. Having a loose bound is better than no bound, and comes essentially free, by using two unused bits in the current ANYOFH node, and requiring only a few extra, pipeline-able, mask, etc instructions at run time, no extra conditionals. Any ANYOFH nodes that would benefit from having an upper bound will instead be compiled into this node type. Its use is based on the following observations. There are 64 possible start bytes, so the full range can be expressed in 6 bits. This means that the flags field in ANYOFH nodes containing the start byte has two extra bits that can be used for something else. An ANYOFH node only happens when there is no matching code point in the bit map, so the smallest code point that could be is 256. The start byte for that is C4, so there are actually only 60 possible start bytes. (perl can be compiled with a larger bit map in which case the minimum start byte would be even higher.) A second observation is that knowing the highest start byte is above F0 is no better than knowing it's F0. This is because the highest code point whose start byte is F0 is U+3FFFF, and all code points above that that are currently allocated are all very specialized and rarely encountered. And there's no likelihood of that changing anytime soon as there's plenty of unallocated space below that. So if an ANYOFH node's highest start byte is F0 or above, there's no advantage to knowing what the actual max possible start byte is, so leave it as ANYOFH,. That means the highest start byte we care about in ANYOFHr is EF. That cuts the number of start bytes we care about down to 43, still 6 bits required to represent them, but it allows for the following scheme: Populate the flags field by subtracting C0 from the lowest start byte and shift left 2 bits. That leaves the the bottom two bits unused. We use them as follows, where x is the start byte of the lowest code point in the node: bits ---- 11 The upper limit of the range can be as much as (EF - x) / 8 10 The upper limit of the range can be as much as (EF - x) / 4 01 The upper limit of the range can be as much as (EF - x) / 2 00 The upper limit of the range can be as much as EF That partitions the loose upper bound into 4 possible ranges, with it being tighter the closer it is to the strict lower bound. This makes the loose upper bound more meaningful when there is most to gain by having one. Some examples of what the various upper bounds would be for all the possibilities of these two bits are: Upper bound given the 2 bits Low bound 11 10 01 00 --------- -- -- -- -- C4 C9 CE D9 EF D0 D3 D7 DF EF E0 E1 E3 E7 EF Start bytes of E0 and above represent many more code points each than lower ones, as they are 3 byte sequences instead of two. This scheme provides tighter bounds for them, which is also a point in its favor. Thus we have provided a loose upper bound using two otherwise unused bits. An alternate scheme could have had the intervals all the same, but this provides a tighter bound when it makes the most sense to. For EBCDIC the range is is from C8 to F4, Tests will be added in a later commit
* regex: Add lower bound to ANYOFH nodes UTF-8 byteKarl Williamson2019-06-261-1/+1
| | | | | | | | | | | This commit adds a lower bound for the first UTF-8 byte matchable by an ANYOFH node. The flags field is otherwise unused, and using it for this purpose allows code to rule out match possibilities without having to convert from UTF-8 to code point. It might be better to do the inverse instead, to have the field be an upper bound. The reason is that the conversion is cheap for smaller numbers. The commit following mostly addresses this.
* Use inRANGE for seeing if node is an ANYOFH typeKarl Williamson2019-06-261-0/+3
| | | | | This is easier to read, especially when a third type is added a few commits ahead.
* regcomp.sym: Change regnode descriptionKarl Williamson2019-06-251-1/+1
| | | | Simplify the description for ANYOFb
* Split ANYOFH regnode into two typesKarl Williamson2019-05-311-1/+2
| | | | | | | | | ANYOFHb will be for nodes where all the matching code points share the frst UTF-8 byte. ANYOFH will be for all others. Neither of these has a bitmap. I noticed that we can omit some execution conditionals by splitting the nodes.
* regcomp.sym: Fix typo in commentKarl Williamson2019-05-241-1/+1
|
* regnodes.h: Change some regnodes' namesKarl Williamson2019-05-241-6/+6
| | | | | These were misleading, as elsewhere a leading 'N' in the name means the complement. Instead move the N to the end of the name
* Add common UTF-8 first byte to ANYOFH regnodesKarl Williamson2019-03-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | An ANYOFH regnode is generated instead of a plain ANYOF one when nothing it can match is in the bitmap used in ANYOF nodes. It is therefore smaller as the 4 word (or more) bitmap is omitted. This means that for it to match a target string, that string must be UTF-8 (since the bitmap is for at least the lowest 256 code points). And only in rare circumstances are there any flags associated with it in the regnode flags field. This commit changes things so that the flags field in an ANYOFH node is repurposed to be the first UTF-8 encoded byte of every code point matched by the class if there is a common byte for all of them; or 0 if some have different first bytes. (That means that those rare cases where the flags field isn't otherwise empty can no longer be ANYOFH nodes.) The purpose of this is so that a future commit can take advantage of this, and more quickly scan the target string for places that this node can match. A problem with ANYOF nodes is that they are code point oriented (U32 or U64), and the target string is UTF-8, so conversion has to be done. By having the partial conversion compiled in, we can look for that at runtime instead of having to look at every character in the scan.
* Implement variable length lookbehind in regex patternsKarl Williamson2019-03-181-2/+2
| | | | See [perl #132367].
* regnodes.h, perldebguts: Shorten some descriptionsKarl Williamson2019-03-141-10/+10
|
* regcomp.sym: Note specialized use of 'flags' in 2 OPsKarl Williamson2018-12-301-2/+2
|
* Add new regnode: ANYOFH, without a bitmapKarl Williamson2018-12-261-0/+1
| | | | | | | | | | | This commit adds a regnode for the case where nothing in the bit map has matches. This allows the bitmap to be omitted, saving 32 bytes of otherwise wasted space per node. Many non-Latin Unicode properties have this characteristic. Further, since this node applies only to code points above 255, which are representable only in UTF-8, we can trivially fail a match where the target string isn't in UTF-8. Time savings also accrue from skipping the bitmap look-up. When swashes are removed, even more time will be saved.
* Remove ASCII/NASCII regnodesKarl Williamson2018-12-261-3/+0
| | | | | | | The ANYOFM/NANYOFM regnodes are generalizations of these. They have more masks and shifts than the removed nodes, but not more branches, so are effectively the same speed. Remove the ASCII/NASCII nodes in favor of having less code to maintain.
* regcomp.c: Simplify handling of EXACTFish nodes with 's' at edgeKarl Williamson2018-12-261-5/+1
| | | | | | | | | | | | | | | | | | Commit 8a100c918ec81926c0536594df8ee1fcccb171da created node types for handling an 's' at the leading edge, at the trailing edge, and at both edges for nodes under /di that there is nothing else in that would prevent them from being EXACTFU nodes. If two of these get joined, it could create an 'ss' sequence which can't be an EXACTFU node, for U+DF would match them unconditionally. Instead, under /di it should match if and only if the target string is UTF-8 encoded. I realized later that having three types becomes harder to deal with when adding yet more node types, so this commit turns the three into just one node type, indicating that at least one edge of the node is an 's'. It also simplifies the parsing of the pattern and determining which node to use.
* Collapse regnode EXACTFU_SS into EXACTFUPKarl Williamson2018-12-261-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | EXACTFUP was created by the previous commit to handle a problematic case in which not all the code points in an EXACTFU node are /i foldable at compile time. Doing so will allow a future commit to use the pre-folded EXACTFU nodes (done in a prior commit), saving execution time for the common case. The only problematic code point is the MICRO SIGN. Most patterns don't use this character. EXACTFU_SS is problematic in a different way. It contains the sequence 'ss' which is folded to by LATIN SMALL LETTER SHARP S, but everything in it can be pre-folded (unless it also contains a MICRO SIGN). The reason this is problematic is that it is the only non-UTF-8 node where the length in folding can change. To process it at runtime, the more general fold equivalence function is used that is capable of handling length disparities, but is slower than the functions otherwise used for non-UTF-8. What I've chosen to do for now is to make a single node type for all the problematic cases (which at this time means just the two aforementioned ones). If we didn't do this, we'd have to add a third node type for patterns that contain both 'ss' and MICRO. Or artificially split the pattern so the two never were in the same node, but we can't do that because it can cause bugs in handling multi-character folds. If more special handling is found to be needed, there'd be a combinatorial explosion of additional node types to handle all possible combinations. What this effectively means is that the slower, more general foldEQ function is used for portions of patterns containing the MICRO sign when the pattern isn't in UTF-8, even though there is no inherent reason to do so for non-UTF-8 strings that don't also contain the 'ss' sequence.
* Add regnode EXACTFUP, for problematicKarl Williamson2018-12-261-1/+1
| | | | | | | | | | If a non-UTF-8 pattern contains a MICRO SIGN, this special node is now created. This character is the only one not needing UTF-8 to represent, but its fold does need UTF-8, which causes some issues, so it has to be specially handled. When matching against a non-UTF-8 target string, the pattern is effectively folded, but not if the target is UTF-8. By creating this node, we can remove the special handling required for the nodes that don't have a MICRO SIGN, in a future commit.
* regexec.c: Most /iaa nodes are now pre-foldedKarl Williamson2018-12-261-0/+7
| | | | | | | | | | | | | | | | | | | | | | So, we don't have to re-fold them. Previous commits have caused any EXACTFAA nodes to be pre-folded, and we now have the infrastructure in regexec.c to take advantage of this, including in non-UTF-8 patterns. This commit changes to do this. The only non-pre-folded EXACTFAA nodes are those that are not UTF-8, but the target string is. The reason is that the MICRO SIGN folds to something representable only in UTF-8, so if you have both non-UTF-8, it effectively is folded, and if you have the pattern in UTF-8, it gets folded to the proper character. In order for non-UTF-8 /iaa nodes to always be fully folded, there would need to be a separate node for ones that contain the MICRO SIGN, and then only that one wouldn't be considered folded when the target is UTF-8. I don't think it's worth it, as the only gain would be in matching a non-UTF-8 /iaa node against a UTF-8 target string. I suspect /iaa will be used mostly in non-UTF8 target strings. Comments have been added to point this out in case someone thinks it should be implemented.
* regcomp.c: Generate EXACTFU_SS only for non-UTF8Karl Williamson2018-12-261-1/+1
| | | | | | | | | | | | | It turns out that now, the regular methods for handling multi-character folds work for the ones involving LATIN SMALL LETTER SHARP S when the pattern is in UTF-8. So the special code for handling this case can be removed, and a regular EXACTFU node is generated. This has the advantage of being trie-able, and requiring fewer operations at run time, as the pattern is pre-folded at compile time, and doesn't have to be re-folded during each backtracking at run-time. This means that the EXACTFU_SS node type will only be generated for non-UTF-8 patterns, and the handling of it is unchanged in these cases.
* regcomp.c: Allow more EXACTFish nodes to be trieableKarl Williamson2018-12-071-0/+6
| | | | | | | | | | | | | | | | | | The previous two commits fixed bugs where it would be possible during optimization to join two EXACTFish nodes together, and the result would not work properly with LATIN SMALL LETTER SHARP S. But by doing so, the commits caused all non-UTF-8 EXACTFU nodes that begin or end with [Ss] from being trieable. This commit changes things so that the only the ones that are non-trieable are the ones that, when joined, have the sequence [Ss][Ss] in them. To do so, I created three new node types that indicate if the node begins with [Ss] or ends with them, or both. These preclude having to examine the node contents at joining to determine this. And since there are plenty of node types available, it seemed the best choice. But other options would be available should we run out of nodes. Examining the first and final characters of a node is not expensive, for example.
* regcomp.sym: Clarify descriptions of EXACTish regnodesKarl Williamson2018-12-061-9/+10
|
* Add regnode EXACTFU_ONLY8Karl Williamson2018-11-271-0/+3
| | | | | | | | | | | | This is a regnode that otherwise would be an EXACTFU except that it contains a code point that requires UTF-8 to match, including all the possible folds involving it. Hence if the target string isn't UTF-8, we know it can't possibly match, without needing to try. For completeness, there could also be an EXACTFAA_ONLY8 and an EXACTFL_ONLY8 created, but I think these are unlikely to actually appear in the wild, since using /aa is mainly about ASCII, and /l mostly will involve characters that don't require UTF-8.
* Add regnode EXACT_ONLY8Karl Williamson2018-11-271-0/+2
| | | | | | | This is a regnode that otherwise would be an EXACT except that it contains a code point that requires UTF-8 to represent. Hence if the target string isn't UTF-8, we know it can't possibly match, without needing to try.
* Add regnode NANYOFMKarl Williamson2018-11-171-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This matches when the existing node ANYOFM would not match; i.e., they are complements. I almost didn't create this node, but it turns out to significantly speed up various classes of matches. For example qr/[^g]/, both /i and not, turn into this node; and something like (("a" x $large_number) . "b") =~ /[^a]/ goes through the string a word at a time, instead of previously byte-by-byte. Benchmarks are at the end of this mesage. This node gets generated when complementing any single ASCII character and when complementing any ASCII case pair, like /[^Gg]/. It never gets generated if the class includes a character that isn't ASCII (actually UTF-8 invariant, which matters only on EBCDIC platforms). The details of when this node gets constructed are complicated. It happens when the bit patterns of the characters in the class happen to have certain very particular characteristics, depending on the vagaries of the character set. [BbCc] will do so, but [AaBb] does not. [^01] does, but not [^12]. Precisely, look at all the bit patterns of the characters in the set, and count the total number of differing bits, calling it 'n'. If and only if the number of characters is 2**n, this node gets generated. As an example, on both ASCII and EBCDIC, the last 4 bits of '0' are 0000; of '1' are 0001; of '2' are 0010; and of '3' are 0011. The other 4 bits are the same for each of these 4 digits. That means that only 2 bits differ among the 4 characters, and 2**2==4, so the NANYOFM node will get generated. Similarly, 8=1000 and 0=0000 differ only in one bit so 2**1==2, and so [^08] will generate this node. We could consider in the future, an extension where, if the input doesn't work to generate this node, that we construct the closure of that input to generate this node, which would have false positives that would have to be tested for. The speedup of this node is so significant that that could still be faster than what we have today. The benchmarks are for a 64-bit word. 32-bits would not be as good. Key: Ir Instruction read Dr Data read Dw Data write COND conditional branches IND indirect branches The numbers (except for the final column) represent raw counts per loop iteration. The higher the number in the final column, the faster. (('a' x 1) . 'b') =~ /[^a]/ blead nanyof Ratio % -------- -------- -------- Ir 2782.0 2648.0 105.1 Dr 845.0 799.0 105.8 Dw 531.0 500.0 106.2 COND 431.0 419.0 102.9 IND 22.0 22.0 100.0 (('a' x 10) . 'b') =~ /[^a]/ blead nanyof Ratio % -------- -------- -------- Ir 3358.0 2671.0 125.7 Dr 998.0 801.0 124.6 Dw 630.0 500.0 126.0 COND 503.0 424.0 118.6 IND 22.0 22.0 100.0 (('a' x 100) . 'b') =~ /[^a]/ blead nanyof Ratio % -------- -------- -------- Ir 9118.0 2773.0 328.8 Dr 2528.0 814.0 310.6 Dw 1620.0 500.0 324.0 COND 1223.0 450.0 271.8 IND 22.0 22.0 100.0 (('a' x 1000) . 'b') =~ /[^a]/ blead nanyof Ratio % -------- -------- -------- Ir 66718.0 3650.0 1827.9 Dr 17828.0 923.0 1931.5 Dw 11520.0 500.0 2304.0 COND 8423.0 668.0 1260.9 IND 22.0 22.0 100.0 (('a' x 10000) . 'b') =~ /[^a]/ blead nanyof Ratio % -------- -------- -------- Ir 642718.0 12650.0 5080.8 Dr 170828.0 2048.0 8341.2 Dw 110520.0 500.0 22104.0 COND 80423.0 2918.0 2756.1 IND 22.0 22.0 100.0 (('a' x 100000) . 'b') =~ /[^a]/ blead nanyof Ratio % -------- -------- -------- Ir Inf 102654.8 6237.1 Dr Inf 13299.3 12788.9 Dw Inf 500.9 219708.7 COND 800424.1 25419.1 3148.9 IND 22.0 22.0 100.0
* regcomp.sym: longj field is a booleanKarl Williamson2018-11-161-4/+6
| | | | | | | | | | | | | | | The comments could lead one to thinking one could specify any of the argument fields that nodes can have. But in fact, the value is a boolean, 0 meaning to use the normal offset field of all regnodes; and 1 meaning to use the ARG field that some regnodes have. If a regnode had more than just the one argument field, the one that corresponds to that would be used. This commit enforces that, and changes regcomp.sym to not use '2', which is misleading. It clarifies the comments about this and what '.' means in the flags field
* regcomp.sym: Add lengths for ANYOF nodesKarl Williamson2018-10-201-4/+4
| | | | | | This changes regcomp.sym to generate the correct lengths for ANYOF nodes, which means they don't have to be special cased in regcomp.c, leading to simplification
* regcomp.sym: Add node type ANYOF_POSIXLKarl Williamson2018-10-201-0/+1
| | | | | | This is like ANYOFL, but has runtime matches of /[[:posix:]]/ in it, which requires extra space. Adding this will allow a future commit to simplify handling for ANYOF nodes.
* S_regmatch(): combine CURLY_B_min/_known statesDavid Mitchell2018-08-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are currently two similar backtracking states for simple non-greedy pattern repeats: CURLY_B_min CURLY_B_min_known the latter is a variant of the former for when the character which must follow the repeat is known, e.g. /(...)*?X.../, which allows quick skipping to the next viable position. The code for the two cases: case CURLY_B_min_fail: case CURLY_B_min_known_fail: share a lot of similarities. This commit merges the two states into a single CURLY_B_min state, with an associated single CURLY_B_min_fail fail state. That one code block can handle both types, with a single if (ST.c1 == CHRTEST_VOID) ... test to choose between the two variant parts of the code. This makes the code smaller and more maintainable, at the cost of one extra test per backtrack.
* Spelling correction for consistency with pod/perldebguts.pod.James E Keenan2018-04-081-1/+1
|
* Change name of regnode for clarityKarl Williamson2018-02-161-2/+2
| | | | | | | The EXACTFA nodes are in fact not generated by /a, but by /aa. Change the name to EXACTFAA to correspond. I found myself getting confused by this.
* recomp.sym: Add ANYOFM regnodeKarl Williamson2018-01-301-0/+1
| | | | | This uses a mask instead of a bitmap, and is restricted to representing invariant characters under UTF-8 that meet particular bit patterns.
* regcomp.sym: Add regnodes for [[:ascii:]]Karl Williamson2017-12-291-0/+3
| | | | These will be used in a future commit
* regcomp.sym: Add nodes for script runsKarl Williamson2017-12-241-0/+2
| | | | To be used in the implementation thereof.
* regcomp.sym: Clarify regnode commentKarl Williamson2017-12-161-1/+1
|
* clear savestack on (?{...}) failure and backtrackDavid Mitchell2017-02-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RT #126697 In a regex, after executing a (?{...}) code block, if we fail and backtrack over the codeblock, we're supposed to unwind the savestack, so that for any example any local()s within the code block are undone. It turns out that a backtracking state isn't pushed for (?{...}), only for postponed evals ( i.e. (??{...})). This means that it relies on one of the earlier backtracking states to clear the savestack on its behalf. This can't always be relied upon, and the ticket above contains code where this falls down; in particular: 'ABC' =~ m{ \A (?: (?: AB | A | BC ) (?{ local $count = $count + 1; print "! count=$count; ; pos=${\pos}\n"; }) )* \z }x Here we end up relying on TRIE_next to do the cleaning up, but TRIE_next doesn't, since there's nothing it would be responsible for that needs cleaning up. The solution to this is to push a backtrack state for every (?{...}) as well as every (??{...}). The sole job of that state is to do a LEAVE_SCOPE(ST.lastcp). The existing backtrack state EVAL_AB has been renamed EVAL_postponed_AB to make it clear it's only used on postponed /(??{A})B/ regexes, and a new state has been added, EVAL_B, which is only called when backtracking after failing something in the B in /(?{...})B/.
* Unify GOSTART and GOSUBYves Orton2016-03-061-1/+0
| | | | | | | | | | | | | | | | GOSTART is a special case of GOSUB, we can remove a lot of offset twiddling, and other special casing by unifying them, at pretty much no cost. GOSUB has 2 arguments, ARG() and ARG2L(), which are interpreted as a U32 and an I32 respectively. ARG() holds the "parno" we will recurse into. ARG2L() holds a signed offset to the relevant start node for the recursion. Prior to this patch the argument to GOSUB would always be >=, and unlike other parts of our logic we would not use 0 to represent "start/end" of pattern, as GOSTART would be used for "recurse to beginning of pattern", after this patch we use 0 to represent "start/end", and a lot of complexity "goes away" along with GOSTART regops.