summaryrefslogtreecommitdiff
path: root/regcomp.sym
Commit message (Collapse)AuthorAgeFilesLines
* Add ANYOFHr regnodeKarl Williamson2019-06-261-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit adds a new regnode, ANYOFHr, like ANYOFH, but it also has a loose upper bound for the first UTF-8 byte matchable by the node. (The 'r' stands for 'range'). It would be nice to have a tight upper bound, but to do so requires 4 more bits than are available without changing the node arguments types, and hence increasing the node size. Having a loose bound is better than no bound, and comes essentially free, by using two unused bits in the current ANYOFH node, and requiring only a few extra, pipeline-able, mask, etc instructions at run time, no extra conditionals. Any ANYOFH nodes that would benefit from having an upper bound will instead be compiled into this node type. Its use is based on the following observations. There are 64 possible start bytes, so the full range can be expressed in 6 bits. This means that the flags field in ANYOFH nodes containing the start byte has two extra bits that can be used for something else. An ANYOFH node only happens when there is no matching code point in the bit map, so the smallest code point that could be is 256. The start byte for that is C4, so there are actually only 60 possible start bytes. (perl can be compiled with a larger bit map in which case the minimum start byte would be even higher.) A second observation is that knowing the highest start byte is above F0 is no better than knowing it's F0. This is because the highest code point whose start byte is F0 is U+3FFFF, and all code points above that that are currently allocated are all very specialized and rarely encountered. And there's no likelihood of that changing anytime soon as there's plenty of unallocated space below that. So if an ANYOFH node's highest start byte is F0 or above, there's no advantage to knowing what the actual max possible start byte is, so leave it as ANYOFH,. That means the highest start byte we care about in ANYOFHr is EF. That cuts the number of start bytes we care about down to 43, still 6 bits required to represent them, but it allows for the following scheme: Populate the flags field by subtracting C0 from the lowest start byte and shift left 2 bits. That leaves the the bottom two bits unused. We use them as follows, where x is the start byte of the lowest code point in the node: bits ---- 11 The upper limit of the range can be as much as (EF - x) / 8 10 The upper limit of the range can be as much as (EF - x) / 4 01 The upper limit of the range can be as much as (EF - x) / 2 00 The upper limit of the range can be as much as EF That partitions the loose upper bound into 4 possible ranges, with it being tighter the closer it is to the strict lower bound. This makes the loose upper bound more meaningful when there is most to gain by having one. Some examples of what the various upper bounds would be for all the possibilities of these two bits are: Upper bound given the 2 bits Low bound 11 10 01 00 --------- -- -- -- -- C4 C9 CE D9 EF D0 D3 D7 DF EF E0 E1 E3 E7 EF Start bytes of E0 and above represent many more code points each than lower ones, as they are 3 byte sequences instead of two. This scheme provides tighter bounds for them, which is also a point in its favor. Thus we have provided a loose upper bound using two otherwise unused bits. An alternate scheme could have had the intervals all the same, but this provides a tighter bound when it makes the most sense to. For EBCDIC the range is is from C8 to F4, Tests will be added in a later commit
* regex: Add lower bound to ANYOFH nodes UTF-8 byteKarl Williamson2019-06-261-1/+1
| | | | | | | | | | | This commit adds a lower bound for the first UTF-8 byte matchable by an ANYOFH node. The flags field is otherwise unused, and using it for this purpose allows code to rule out match possibilities without having to convert from UTF-8 to code point. It might be better to do the inverse instead, to have the field be an upper bound. The reason is that the conversion is cheap for smaller numbers. The commit following mostly addresses this.
* Use inRANGE for seeing if node is an ANYOFH typeKarl Williamson2019-06-261-0/+3
| | | | | This is easier to read, especially when a third type is added a few commits ahead.
* regcomp.sym: Change regnode descriptionKarl Williamson2019-06-251-1/+1
| | | | Simplify the description for ANYOFb
* Split ANYOFH regnode into two typesKarl Williamson2019-05-311-1/+2
| | | | | | | | | ANYOFHb will be for nodes where all the matching code points share the frst UTF-8 byte. ANYOFH will be for all others. Neither of these has a bitmap. I noticed that we can omit some execution conditionals by splitting the nodes.
* regcomp.sym: Fix typo in commentKarl Williamson2019-05-241-1/+1
|
* regnodes.h: Change some regnodes' namesKarl Williamson2019-05-241-6/+6
| | | | | These were misleading, as elsewhere a leading 'N' in the name means the complement. Instead move the N to the end of the name
* Add common UTF-8 first byte to ANYOFH regnodesKarl Williamson2019-03-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | An ANYOFH regnode is generated instead of a plain ANYOF one when nothing it can match is in the bitmap used in ANYOF nodes. It is therefore smaller as the 4 word (or more) bitmap is omitted. This means that for it to match a target string, that string must be UTF-8 (since the bitmap is for at least the lowest 256 code points). And only in rare circumstances are there any flags associated with it in the regnode flags field. This commit changes things so that the flags field in an ANYOFH node is repurposed to be the first UTF-8 encoded byte of every code point matched by the class if there is a common byte for all of them; or 0 if some have different first bytes. (That means that those rare cases where the flags field isn't otherwise empty can no longer be ANYOFH nodes.) The purpose of this is so that a future commit can take advantage of this, and more quickly scan the target string for places that this node can match. A problem with ANYOF nodes is that they are code point oriented (U32 or U64), and the target string is UTF-8, so conversion has to be done. By having the partial conversion compiled in, we can look for that at runtime instead of having to look at every character in the scan.
* Implement variable length lookbehind in regex patternsKarl Williamson2019-03-181-2/+2
| | | | See [perl #132367].
* regnodes.h, perldebguts: Shorten some descriptionsKarl Williamson2019-03-141-10/+10
|
* regcomp.sym: Note specialized use of 'flags' in 2 OPsKarl Williamson2018-12-301-2/+2
|
* Add new regnode: ANYOFH, without a bitmapKarl Williamson2018-12-261-0/+1
| | | | | | | | | | | This commit adds a regnode for the case where nothing in the bit map has matches. This allows the bitmap to be omitted, saving 32 bytes of otherwise wasted space per node. Many non-Latin Unicode properties have this characteristic. Further, since this node applies only to code points above 255, which are representable only in UTF-8, we can trivially fail a match where the target string isn't in UTF-8. Time savings also accrue from skipping the bitmap look-up. When swashes are removed, even more time will be saved.
* Remove ASCII/NASCII regnodesKarl Williamson2018-12-261-3/+0
| | | | | | | The ANYOFM/NANYOFM regnodes are generalizations of these. They have more masks and shifts than the removed nodes, but not more branches, so are effectively the same speed. Remove the ASCII/NASCII nodes in favor of having less code to maintain.
* regcomp.c: Simplify handling of EXACTFish nodes with 's' at edgeKarl Williamson2018-12-261-5/+1
| | | | | | | | | | | | | | | | | | Commit 8a100c918ec81926c0536594df8ee1fcccb171da created node types for handling an 's' at the leading edge, at the trailing edge, and at both edges for nodes under /di that there is nothing else in that would prevent them from being EXACTFU nodes. If two of these get joined, it could create an 'ss' sequence which can't be an EXACTFU node, for U+DF would match them unconditionally. Instead, under /di it should match if and only if the target string is UTF-8 encoded. I realized later that having three types becomes harder to deal with when adding yet more node types, so this commit turns the three into just one node type, indicating that at least one edge of the node is an 's'. It also simplifies the parsing of the pattern and determining which node to use.
* Collapse regnode EXACTFU_SS into EXACTFUPKarl Williamson2018-12-261-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | EXACTFUP was created by the previous commit to handle a problematic case in which not all the code points in an EXACTFU node are /i foldable at compile time. Doing so will allow a future commit to use the pre-folded EXACTFU nodes (done in a prior commit), saving execution time for the common case. The only problematic code point is the MICRO SIGN. Most patterns don't use this character. EXACTFU_SS is problematic in a different way. It contains the sequence 'ss' which is folded to by LATIN SMALL LETTER SHARP S, but everything in it can be pre-folded (unless it also contains a MICRO SIGN). The reason this is problematic is that it is the only non-UTF-8 node where the length in folding can change. To process it at runtime, the more general fold equivalence function is used that is capable of handling length disparities, but is slower than the functions otherwise used for non-UTF-8. What I've chosen to do for now is to make a single node type for all the problematic cases (which at this time means just the two aforementioned ones). If we didn't do this, we'd have to add a third node type for patterns that contain both 'ss' and MICRO. Or artificially split the pattern so the two never were in the same node, but we can't do that because it can cause bugs in handling multi-character folds. If more special handling is found to be needed, there'd be a combinatorial explosion of additional node types to handle all possible combinations. What this effectively means is that the slower, more general foldEQ function is used for portions of patterns containing the MICRO sign when the pattern isn't in UTF-8, even though there is no inherent reason to do so for non-UTF-8 strings that don't also contain the 'ss' sequence.
* Add regnode EXACTFUP, for problematicKarl Williamson2018-12-261-1/+1
| | | | | | | | | | If a non-UTF-8 pattern contains a MICRO SIGN, this special node is now created. This character is the only one not needing UTF-8 to represent, but its fold does need UTF-8, which causes some issues, so it has to be specially handled. When matching against a non-UTF-8 target string, the pattern is effectively folded, but not if the target is UTF-8. By creating this node, we can remove the special handling required for the nodes that don't have a MICRO SIGN, in a future commit.
* regexec.c: Most /iaa nodes are now pre-foldedKarl Williamson2018-12-261-0/+7
| | | | | | | | | | | | | | | | | | | | | | So, we don't have to re-fold them. Previous commits have caused any EXACTFAA nodes to be pre-folded, and we now have the infrastructure in regexec.c to take advantage of this, including in non-UTF-8 patterns. This commit changes to do this. The only non-pre-folded EXACTFAA nodes are those that are not UTF-8, but the target string is. The reason is that the MICRO SIGN folds to something representable only in UTF-8, so if you have both non-UTF-8, it effectively is folded, and if you have the pattern in UTF-8, it gets folded to the proper character. In order for non-UTF-8 /iaa nodes to always be fully folded, there would need to be a separate node for ones that contain the MICRO SIGN, and then only that one wouldn't be considered folded when the target is UTF-8. I don't think it's worth it, as the only gain would be in matching a non-UTF-8 /iaa node against a UTF-8 target string. I suspect /iaa will be used mostly in non-UTF8 target strings. Comments have been added to point this out in case someone thinks it should be implemented.
* regcomp.c: Generate EXACTFU_SS only for non-UTF8Karl Williamson2018-12-261-1/+1
| | | | | | | | | | | | | It turns out that now, the regular methods for handling multi-character folds work for the ones involving LATIN SMALL LETTER SHARP S when the pattern is in UTF-8. So the special code for handling this case can be removed, and a regular EXACTFU node is generated. This has the advantage of being trie-able, and requiring fewer operations at run time, as the pattern is pre-folded at compile time, and doesn't have to be re-folded during each backtracking at run-time. This means that the EXACTFU_SS node type will only be generated for non-UTF-8 patterns, and the handling of it is unchanged in these cases.
* regcomp.c: Allow more EXACTFish nodes to be trieableKarl Williamson2018-12-071-0/+6
| | | | | | | | | | | | | | | | | | The previous two commits fixed bugs where it would be possible during optimization to join two EXACTFish nodes together, and the result would not work properly with LATIN SMALL LETTER SHARP S. But by doing so, the commits caused all non-UTF-8 EXACTFU nodes that begin or end with [Ss] from being trieable. This commit changes things so that the only the ones that are non-trieable are the ones that, when joined, have the sequence [Ss][Ss] in them. To do so, I created three new node types that indicate if the node begins with [Ss] or ends with them, or both. These preclude having to examine the node contents at joining to determine this. And since there are plenty of node types available, it seemed the best choice. But other options would be available should we run out of nodes. Examining the first and final characters of a node is not expensive, for example.
* regcomp.sym: Clarify descriptions of EXACTish regnodesKarl Williamson2018-12-061-9/+10
|
* Add regnode EXACTFU_ONLY8Karl Williamson2018-11-271-0/+3
| | | | | | | | | | | | This is a regnode that otherwise would be an EXACTFU except that it contains a code point that requires UTF-8 to match, including all the possible folds involving it. Hence if the target string isn't UTF-8, we know it can't possibly match, without needing to try. For completeness, there could also be an EXACTFAA_ONLY8 and an EXACTFL_ONLY8 created, but I think these are unlikely to actually appear in the wild, since using /aa is mainly about ASCII, and /l mostly will involve characters that don't require UTF-8.
* Add regnode EXACT_ONLY8Karl Williamson2018-11-271-0/+2
| | | | | | | This is a regnode that otherwise would be an EXACT except that it contains a code point that requires UTF-8 to represent. Hence if the target string isn't UTF-8, we know it can't possibly match, without needing to try.
* Add regnode NANYOFMKarl Williamson2018-11-171-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This matches when the existing node ANYOFM would not match; i.e., they are complements. I almost didn't create this node, but it turns out to significantly speed up various classes of matches. For example qr/[^g]/, both /i and not, turn into this node; and something like (("a" x $large_number) . "b") =~ /[^a]/ goes through the string a word at a time, instead of previously byte-by-byte. Benchmarks are at the end of this mesage. This node gets generated when complementing any single ASCII character and when complementing any ASCII case pair, like /[^Gg]/. It never gets generated if the class includes a character that isn't ASCII (actually UTF-8 invariant, which matters only on EBCDIC platforms). The details of when this node gets constructed are complicated. It happens when the bit patterns of the characters in the class happen to have certain very particular characteristics, depending on the vagaries of the character set. [BbCc] will do so, but [AaBb] does not. [^01] does, but not [^12]. Precisely, look at all the bit patterns of the characters in the set, and count the total number of differing bits, calling it 'n'. If and only if the number of characters is 2**n, this node gets generated. As an example, on both ASCII and EBCDIC, the last 4 bits of '0' are 0000; of '1' are 0001; of '2' are 0010; and of '3' are 0011. The other 4 bits are the same for each of these 4 digits. That means that only 2 bits differ among the 4 characters, and 2**2==4, so the NANYOFM node will get generated. Similarly, 8=1000 and 0=0000 differ only in one bit so 2**1==2, and so [^08] will generate this node. We could consider in the future, an extension where, if the input doesn't work to generate this node, that we construct the closure of that input to generate this node, which would have false positives that would have to be tested for. The speedup of this node is so significant that that could still be faster than what we have today. The benchmarks are for a 64-bit word. 32-bits would not be as good. Key: Ir Instruction read Dr Data read Dw Data write COND conditional branches IND indirect branches The numbers (except for the final column) represent raw counts per loop iteration. The higher the number in the final column, the faster. (('a' x 1) . 'b') =~ /[^a]/ blead nanyof Ratio % -------- -------- -------- Ir 2782.0 2648.0 105.1 Dr 845.0 799.0 105.8 Dw 531.0 500.0 106.2 COND 431.0 419.0 102.9 IND 22.0 22.0 100.0 (('a' x 10) . 'b') =~ /[^a]/ blead nanyof Ratio % -------- -------- -------- Ir 3358.0 2671.0 125.7 Dr 998.0 801.0 124.6 Dw 630.0 500.0 126.0 COND 503.0 424.0 118.6 IND 22.0 22.0 100.0 (('a' x 100) . 'b') =~ /[^a]/ blead nanyof Ratio % -------- -------- -------- Ir 9118.0 2773.0 328.8 Dr 2528.0 814.0 310.6 Dw 1620.0 500.0 324.0 COND 1223.0 450.0 271.8 IND 22.0 22.0 100.0 (('a' x 1000) . 'b') =~ /[^a]/ blead nanyof Ratio % -------- -------- -------- Ir 66718.0 3650.0 1827.9 Dr 17828.0 923.0 1931.5 Dw 11520.0 500.0 2304.0 COND 8423.0 668.0 1260.9 IND 22.0 22.0 100.0 (('a' x 10000) . 'b') =~ /[^a]/ blead nanyof Ratio % -------- -------- -------- Ir 642718.0 12650.0 5080.8 Dr 170828.0 2048.0 8341.2 Dw 110520.0 500.0 22104.0 COND 80423.0 2918.0 2756.1 IND 22.0 22.0 100.0 (('a' x 100000) . 'b') =~ /[^a]/ blead nanyof Ratio % -------- -------- -------- Ir Inf 102654.8 6237.1 Dr Inf 13299.3 12788.9 Dw Inf 500.9 219708.7 COND 800424.1 25419.1 3148.9 IND 22.0 22.0 100.0
* regcomp.sym: longj field is a booleanKarl Williamson2018-11-161-4/+6
| | | | | | | | | | | | | | | The comments could lead one to thinking one could specify any of the argument fields that nodes can have. But in fact, the value is a boolean, 0 meaning to use the normal offset field of all regnodes; and 1 meaning to use the ARG field that some regnodes have. If a regnode had more than just the one argument field, the one that corresponds to that would be used. This commit enforces that, and changes regcomp.sym to not use '2', which is misleading. It clarifies the comments about this and what '.' means in the flags field
* regcomp.sym: Add lengths for ANYOF nodesKarl Williamson2018-10-201-4/+4
| | | | | | This changes regcomp.sym to generate the correct lengths for ANYOF nodes, which means they don't have to be special cased in regcomp.c, leading to simplification
* regcomp.sym: Add node type ANYOF_POSIXLKarl Williamson2018-10-201-0/+1
| | | | | | This is like ANYOFL, but has runtime matches of /[[:posix:]]/ in it, which requires extra space. Adding this will allow a future commit to simplify handling for ANYOF nodes.
* S_regmatch(): combine CURLY_B_min/_known statesDavid Mitchell2018-08-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are currently two similar backtracking states for simple non-greedy pattern repeats: CURLY_B_min CURLY_B_min_known the latter is a variant of the former for when the character which must follow the repeat is known, e.g. /(...)*?X.../, which allows quick skipping to the next viable position. The code for the two cases: case CURLY_B_min_fail: case CURLY_B_min_known_fail: share a lot of similarities. This commit merges the two states into a single CURLY_B_min state, with an associated single CURLY_B_min_fail fail state. That one code block can handle both types, with a single if (ST.c1 == CHRTEST_VOID) ... test to choose between the two variant parts of the code. This makes the code smaller and more maintainable, at the cost of one extra test per backtrack.
* Spelling correction for consistency with pod/perldebguts.pod.James E Keenan2018-04-081-1/+1
|
* Change name of regnode for clarityKarl Williamson2018-02-161-2/+2
| | | | | | | The EXACTFA nodes are in fact not generated by /a, but by /aa. Change the name to EXACTFAA to correspond. I found myself getting confused by this.
* recomp.sym: Add ANYOFM regnodeKarl Williamson2018-01-301-0/+1
| | | | | This uses a mask instead of a bitmap, and is restricted to representing invariant characters under UTF-8 that meet particular bit patterns.
* regcomp.sym: Add regnodes for [[:ascii:]]Karl Williamson2017-12-291-0/+3
| | | | These will be used in a future commit
* regcomp.sym: Add nodes for script runsKarl Williamson2017-12-241-0/+2
| | | | To be used in the implementation thereof.
* regcomp.sym: Clarify regnode commentKarl Williamson2017-12-161-1/+1
|
* clear savestack on (?{...}) failure and backtrackDavid Mitchell2017-02-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RT #126697 In a regex, after executing a (?{...}) code block, if we fail and backtrack over the codeblock, we're supposed to unwind the savestack, so that for any example any local()s within the code block are undone. It turns out that a backtracking state isn't pushed for (?{...}), only for postponed evals ( i.e. (??{...})). This means that it relies on one of the earlier backtracking states to clear the savestack on its behalf. This can't always be relied upon, and the ticket above contains code where this falls down; in particular: 'ABC' =~ m{ \A (?: (?: AB | A | BC ) (?{ local $count = $count + 1; print "! count=$count; ; pos=${\pos}\n"; }) )* \z }x Here we end up relying on TRIE_next to do the cleaning up, but TRIE_next doesn't, since there's nothing it would be responsible for that needs cleaning up. The solution to this is to push a backtrack state for every (?{...}) as well as every (??{...}). The sole job of that state is to do a LEAVE_SCOPE(ST.lastcp). The existing backtrack state EVAL_AB has been renamed EVAL_postponed_AB to make it clear it's only used on postponed /(??{A})B/ regexes, and a new state has been added, EVAL_B, which is only called when backtracking after failing something in the B in /(?{...})B/.
* Unify GOSTART and GOSUBYves Orton2016-03-061-1/+0
| | | | | | | | | | | | | | | | GOSTART is a special case of GOSUB, we can remove a lot of offset twiddling, and other special casing by unifying them, at pretty much no cost. GOSUB has 2 arguments, ARG() and ARG2L(), which are interpreted as a U32 and an I32 respectively. ARG() holds the "parno" we will recurse into. ARG2L() holds a signed offset to the relevant start node for the recursion. Prior to this patch the argument to GOSUB would always be >=, and unlike other parts of our logic we would not use 0 to represent "start/end" of pattern, as GOSTART would be used for "recurse to beginning of pattern", after this patch we use 0 to represent "start/end", and a lot of complexity "goes away" along with GOSTART regops.
* fix perl #126186 make all verbs allow an optional argYves Orton2015-10-051-3/+2
| | | | | | | | | | | | In perl #126186 it was pointed out we had started allowing name arguments for verbs where we did not document them to be supported, albeit in an inconsistent way. The previous patch cleaned up some of the cause of this, but it seems better to just generally allow the existing verbs to all support a mark name argument. So this patch reverses the effect of the previous patch, and makes all verbs, FAIL, ACCEPT, etc, allow an optional argument, and set REGERROR/REGMARK appropriately as well.
* Add ANYOFD regex nodeKarl Williamson2015-08-241-0/+1
| | | | | This is like an ANYOF node, but just for when /d is in effect. It will be used in future commits
* perldebguts: Add clarificationKarl Williamson2015-08-241-1/+1
|
* remove deprecated /\C/ RE character classDavid Mitchell2015-06-191-1/+0
| | | | | | This horrible thing broke encapsulation and was as buggy as a very buggy thing. It's been officially deprecated since 5.20.0 and now it can finally die die die!!!!
* regcomp.sym: Update \b descriptionsKarl Williamson2015-03-181-7/+7
|
* Add qr/\b{gcb}/Karl Williamson2015-02-191-8/+8
| | | | | | | | | | | A function implements seeing if the space between any two characters is a grapheme cluster break. Afer I wrote this, I realized that an array lookup might be a better implementation, but the deadline for v5.22 was too close to change it. I did see that my gcc optimized it down to an array lookup. This makes the implementation of \X go from being complicated to trivial.
* Add regex nodes for localeKarl Williamson2014-12-291-0/+3
| | | | | These will be used in a future commit to distinguish between /l patterns vs non-/l.
* Nits in commentsKarl Williamson2014-12-291-0/+3
|
* Eliminate unused BACK regnodeAaron Crane2014-09-291-8/+1
|
* Up regex flags limit for (??{})Karl Williamson2014-09-291-1/+1
| | | | | | | | | Previously the regex pattern compilation flags needed for this construct would fit into an 8-bit byte. This conveniently fits into the flags structure element of a regnode. There are changes coming that require more than 8 bits, so in preparation, this commit adds an argument to the node that implements (??{}) (31-bits usable for flags), and moves the storage to that.
* regcomp.sym: ANYOF nodes have an argumentKarl Williamson2014-09-291-1/+1
| | | | | | Plus a bitmap, but they always have an argument besides, contrary to what was specified here. Future commits rely on this, whereas heretofore this error was harmless.
* Perl RT #122761 - split /\A/ should not behave like split /^/mYves Orton2014-09-171-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Long ago a weird special case was hacked into split so that it treated C<split /^/> as if it was C<split /^/m>. At the time this was done by letting the split PP code inspect the pattern, and IFF it matched "^\0" the special behavior was enabled (which also bypasses using the regex engine for matching.) Later on when we added pluggable regex engines and when we encountered various counter-intuitive behaviors related to split we changed who this worked so that the regex engine would set flags appropriate for split to use. This meant that regex plugins using totally different regex syntax could still enable the optimisation. At the same time I modified how we detected this pattern type by looking at the *compiled* regops, and not the raw pattern. This had the side effect of making things like C< split /(?:)^/ > also enable the optimisation. Unfortunately this did not play nicely with the fact that /^/ produces an SBOL node, as does /\A/, but we definitely don't want C<split /\A/> to behave like C<split /^/m>. In fact C<split /\A/> should behave like a noop (which means there is room for a future optimisation here if someone cares to implement it.) In the discussion attached to the ticket I propose what I consider to be a better fix, default split patterns to be compiled by default with the /m modifier enabled. This patch does NOT do this. It is instead the "simple" patch. This means that C<split /^/> behaves like C<split /^/m> but C<split /^x/> does NOT behave like C<split /^x/m> which I consider to be a bug which I will fix in a future patch.
* Eliminate the duplicative regops BOL and EOLYves Orton2014-09-171-15/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | See also perl5porters thread titled: "Perl MBOLism in regex engine" In the perl 5.000 release (a0d0e21ea6ea90a22318550944fe6cb09ae10cda) the BOL regop was split into two behaviours MBOL and SBOL, with SBOL and BOL behaving identically. Similarly the EOL regop was split into two behaviors SEOL and MEOL, with EOL and SEOL behaving identically. This then resulted in various duplicative code related to flags and case statements in various parts of the regex engine. It appears that perhaps BOL and EOL were kept because they are the type ("regkind") for SBOL/MBOL and SEOL/MEOL/EOS. Reworking regcomp.pl to handle aliases for the type data so that SBOL/MBOL are of type BOL, even though BOL == SBOL seems to cover that case without adding to the confusion. This means two regops, a regstate, and an internal regex flag can be removed (and used for other things), and various logic relating to them can be removed. For the uninitiated, SBOL is /^/ and /\A/ (with or without /m) and MBOL is /^/m. (I consider it a fail we have no way to say MBOL without the /m modifier). Similarly SEOL is /$/ and MEOL is /$/m (there is also a /\z/ which is EOS "end of string" with or without the /m).
* Change 'semantics' to 'rules'Karl Williamson2014-02-201-12/+12
| | | | | | The term 'semantics' in documentation when applied to character sets is changed to 'rules' as being a shorter less-jargony synonym in this case. This was discussed several releases ago, but I didn't get around to it.
* Revert "Free up bit for regex ANYOF nodes"Karl Williamson2014-02-151-1/+0
| | | | | This reverts commit 34fdef848b1687b91892ba55e9e0c3430e0770f6, and adds comments referring to it, in case it is ever needed.