summaryrefslogtreecommitdiff
path: root/regcomp.sym
Commit message (Collapse)AuthorAgeFilesLines
* regcomp.c - extend REF to hold the paren it needs to regcppushYves Orton2023-03-131-10/+10
| | | | | this way we can avoid pushing every buffer, we only need to push the nestroot of the ref.
* regex engine - simplify regnode structures and make them consistentYves Orton2023-03-131-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This eliminates the regnode_2L data structure, and merges it with the older regnode_2 data structure. At the same time it makes each "arg" property of the various regnode types that have one be consistently structured as an anonymous union like this: union { U32 arg1u; I32 arg2i; struct { U16 arg1a; U16 arg1b; }; }; We then expose four macros for accessing each slot: ARG1u() ARG1i() and ARG1a() and ARG1b(). Code then explicitly designates which they want. The old logic used ARG() to access an U32 arg1, and ARG1() to access an I32 arg1, which was confusing to say the least. The regnode_2L structure had a U32 arg1, and I32 arg2, and the regnode_2 data strucutre had two I32 args. With the new set of macros we use the regnode_2 for both, and use the appropriate macros to show whether we want to signed or unsigned values. This also renames the regnode_4 to regnode_3. The 3 stands for "three 32-bit args". However as each slot can also store two U16s, a regnode_3 can hold up to 6 U16s, or as 3 I32's, or a combination. For instance the CURLY style nodes use regnode_3 to store 4 values, ARG1i() for min count, ARG2i() for max count and ARG3a() and ARG3b() for parens before and inside the quantifier. It also changes the functions reganode() to reg1node() and changes reg2Lanode() to reg2node(). The 2L thing was just confusing.
* regexec.c - make REF into a backtracking stateYves Orton2023-03-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | This way we can do the required paren restoration only when it is in use. When we match a REF type node which is potentially a reference to an unclosed paren we push the match context information, currently for "everything", but in a future patch we can teach it to be more efficient by adding a new parameter to the REF regop to track which parens it should save. This converts the backtracking changes from the previous commit, so that it is run only when specifically enabled via the define RE_PESSIMISTIC_PARENS which is by default 0. We don't make the new fields in the struct conditional as the stack frames are large and our changes don't make any real difference and it keeps things simpler to not have conditional members, especially since some of the structures have to line up with each other. If enabling RE_PESSIMISTIC_PARENS fixes a backtracking bug then it means something is sensitive to us not necessarily restoring the parens properly on failure. We make some assumptions that the paren state after a failing state will be corrected by a future successful state, or that the state of the parens is irrelevant as we will fail anyway. This can be made not true by EVAL, backrefs, and potentially some other scenarios. Thus I have left this inefficient logic in place but guarded by the flag.
* regexec.c - teach BRANCH and BRANCHJ nodes to reset capture buffersYves Orton2023-03-131-2/+2
| | | | | | | | | | | | | | | | | | | | In /((a)(b)|(a))+/ we should not end up with $2 and $4 being set at the same time. When a branch fails it should reset any capture buffers that might be touched by its branch. We change BRANCH and BRANCHJ to store the number of parens before the branch, and the number of parens after the branch was completed. When a BRANCH operation fails, we clear the buffers it contains before we continue on. It is a bit more complex than it should be because we have BRANCHJ and BRANCH. (One of these days we should merge them together.) This is also made somewhat more complex because TRIE nodes are actually branches, and may need to track capture buffers also, at two levels. The overall TRIE op, and for jump tries especially where we emulate the behavior of branches. So we have to do the same clearing logic if a trie branch fails as well.
* regcomp.c - track parens related to CURLYX and CURLYMYves Orton2023-03-131-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | This was originally a patch which made somewhat drastic changes to how we represent capture buffers, which Dave M and I and are still discussing offline and which has a larger impact than is acceptable to address at the current time. As such I have reverted the controversial parts of this patch for now, while keeping most of it intact even if in some cases the changes are unused except for debugging purposes. This patch still contains valuable changes, for instance teaching CURLYX and CURLYM about how many parens there are before the curly[1] (which will be useful in follow up patches even if stricly speaking they are not directly used yet), tests and other cleanups. Also this patch is sufficiently large that reverting it out would have a large effect on the patches that were made on top of it. Thus keeping most of this patch while eliminating the controversial parts of it for now seemed the best approach, especially as some of the changes it introduces and the follow up patches based on it are very useful in cleaning up the structures we use to represent regops. [1] Curly is the regexp internals term for quantifiers, named after x{min,max} "curly brace" quantifiers.
* regcomp.c - decompose into smaller filesYves Orton2022-12-091-5/+5
| | | | | | | | | | | | | | | | | This splits a bunch of the subcomponents of the regex engine into smaller files. regcomp_debug.c regcomp_internal.h regcomp_invlist.c regcomp_study.c regcomp_trie.c The only real change besides to the build machine to achieve the split is to also adds some new defines which can be used in embed.fnc to control exports without having to enumerate /every/ regex engine file. For instance all of regcomp*.c defines PERL_IN_REGCOMP_ANY, and this is used in embed.fnc to manage exports.
* regex engine - cleanup internal tabs and ws (use -w to ignore)Yves Orton2022-12-091-5/+5
| | | | | | | | | | | | | | Having internal tabs causes confusion in diffs and reviews. In the following patch I will move a lot of code around, creating new files and they will all be whitespace clean: no trailing whitespace, tabs expanded to the next tabstop properly, and no trailing empty lines at the bottom of the file. This patch prepares for that split, and future splits and changes to the regex engine by precleaning the main regex engine files with the same rules. It should show no changes under '-w'.
* regen/regcomp.pl - add PL_regargvariesYves Orton2022-08-031-1/+1
|
* regex: Add optimizing regnodeKarl Williamson2022-07-121-0/+3
| | | | | | | | | | | | | | | | | | | | | | It turns out that any character class whose UTF-8 representation is two bytes long, and where all elements share the same first byte can be represented by a compact, fast regnode designed for the purpose. This commit adds that regnode, ANYOFHbbm. ANYOFHb already exists for classes where all elements have the same first byte, and this just changes the two-byte ones to use a bitmap instead of an inversion list. The advantages of this are that no conversion to code point is required (the continuation byte is just looked up in the bitmap) and no inversion list is needed. The inversion list would occupy more space, from 4 to 34 extra 64-bit words, plus an AV and SV, depending on what elements the class matches. Many characters in the Latin, Greek, Cyrillic, Greek, Hebrew, Arabic, and several other (lesser-known) scripts are of this form. It would be possible to extend this technique to larger bitmaps, but this commit is a start.
* Create new regnode type ANYOFHKarl Williamson2022-07-101-4/+4
| | | | | | This previously was lumped in with plain ANYOF. A future commit will be easier if this is separated out, and doing so leads to some simplifications, and from having to know all the OPs in this type.
* regcomp.sym: Add commentaryKarl Williamson2022-07-071-6/+6
|
* Revert "regex: Add POSIXA1R node"Karl Williamson2022-07-011-5/+0
| | | | | | | This reverts commit d62feba66bf43f35d092bb026694f927e9f94d38. As explained in its commit message. It adds some comments to point out that the commit exists, for the curious.
* regex: Add POSIXA1R nodeKarl Williamson2022-07-011-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | Several of the POSIXA classes are a single range on ASCII platforms, and [:digit:] is a single range on both ASCII and EBCDIC. This regnode was designed to replace the POSIXA regnode for such classes to get a bit of performance by not needing to do an array lookup. Instead it encodes some bits in the flags field that with shifting and masking get the right values for the single range's bounds for any such node. However, performance tests conducted by Sergey Aleynikov showed this was actually slower than what it intended to replace. Rather than completely drop this work, I'm adding it to blead, and immediately reverting it, so that should parts of it ever become useful, it would be available. A few tests fail; those are skipped for the purposes of this commit so that it doesn't interfere with bisecting. The code also isn't completely commented. One could add a regnode for each posix class it was decided should have the expected performance boost. But regnodes are a finite resource, and the boost is probably not large enough to justify doing so.
* regcomp.sym: Comment why no ANYOFRs node existsKarl Williamson2022-05-311-2/+14
|
* regex engine: Issue #19168 - Fix variable length lookbehind matchesYves Orton2022-02-231-0/+3
| | | | | | | | | | | | We were not validating that when (?<=a|ab) matched that the right hand side of the match lined up with the position of the assertion. Similar for (?<!a|ab) and related patterns, eg, (*positive_lookbehind:). Note these problems do NOT affect lookahead. Part of the difficulty here was that the SUCCEED node was serving too many purposes, necessitating a new regop LOOKBEHIND_END. Includes more tests for various lookahead or lookbehind cases.
* regnodes.h: Add two convenience bit masksKarl Williamson2020-10-161-2/+3
| | | | | | | These categorize the many types of EXACT nodes, so that code can refer to a particular subset of such nodes without having to list all of them out. This simplifies some 'if' statements, and makes updating things easier.
* regcomp.sym: Make adjacent opcodes for 2 similar regnodesKarl Williamson2020-10-161-1/+2
| | | | | These are often tested together. By making them adjacent we can use inRANGE.
* regcomp.sym: Update node commentsKarl Williamson2020-10-161-3/+3
|
* regcomp.sym: Reorder some entriesKarl Williamson2020-10-021-18/+20
| | | | | | | | These are mostly used in regexec.c in three functions. Two of the functions use less than half the available ones, as case labels in a switch() statement. By moving all the ones used by those functions to be nearly contiguous at the beginning, compilers can generate smaller jump tables for the switch().
* regcomp.sym: Add new regnode type for (?[])Karl Williamson2020-02-191-0/+2
| | | | | | | | | | | | | | | | | | This new regnode is used to handle interpolated already-compiled regex sets inside outer regex sets. If it isn't present, it will mean that what appears to be a nested, interpolated set really isn't. I created a new regnode structure to hold a pointer. This has to be temporary as pointers can be invalidated. I thought of just having a regnode without a pointer as a marker, and using a parallel array to store the data, rather than creating a whole new regnode structure for just pointers, but parallel data structures can get out of sync, so this seemed best. This commit just sets up the regnode; a future commit will actually use it.
* regcomp.sym: Add commentKarl Williamson2019-11-221-0/+10
|
* regcomp.sym: Simplify a couple regnode defnsKarl Williamson2019-11-221-2/+2
| | | | | Under the given circumstances, these work precisely like others that already have good descriptions.
* PATCH: gh #17319 SegfaultKarl Williamson2019-11-221-1/+1
| | | | | | It turns out that one isn't supposed to fill in the offset to the next regnode at node creation time. And this node is like EXACTish, so the string stuff isn't accounted for in its regcomp.sym definition
* Add ANYOFHs regnodeKarl Williamson2019-11-201-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | This node is like ANYOFHb, but is used when more than one leading byte is the same in all the matched code points. ANYOFHb is used to avoid having to convert from UTF-8 to code point for something that won't match. It checks that the first byte in the UTF-8 encoded target is the desired one, thus ruling out most of the possible code points. But for higher code points that require longer UTF-8 sequences, many many non-matching code points pass this filter. Its almost 200K that it is ineffective for for code points above 0xFFFF. This commit creates a new node type that addresses this problem. Instead of a single byte, it stores as many leading bytes that are the same for all code points that match the class. For many classes, that will cut down the number of possible false positives by a huge amount before having to convert to code point to make the final determination. This regnode adds a UTF-8 string at the end. It is still much smaller, even in the rare worst case, than a plain ANYOF node because the maximum string length, 15 bytes, is still shorter than the 32-byte bitmap that is present in a plain ANYOF. Most of the time the added string will instead be at most 4 bytes.
* regcomp.sym: Add missing commasKarl Williamson2019-11-191-2/+2
| | | | I don't think lack of these affects anything, but they were inconsistent
* Add ANYOFRb regnodeKarl Williamson2019-11-171-1/+3
| | | | | | | | | | | | | | | | | | | | | | This is like the ANYOFR regnode added in the previous commit, but all code points in the range it matches are known to have the same first UTF-8 start byte. That means it can't match UTF-8 invariant characters, like ASCII, because the "start" byte is different on each one, so it could only match a range of 1, and the compiler wouldn't generate this node for that; instead using an EXACT. Pattern matching can rule out most code points by looking at the first character of their UTF-8 representation, before having to convert from UTF-8. On ASCII this rules out all but 64 2-byte UTF-8 characters from this simple comparison. 3-byte it's up to 4096, and 4-byte, 2**18, so the test is less effective for higher code points. I believe that most UTF-8 patterns that otherwise would compile to ANYOFR will instead compile to this, as I can't envision real life applications wanting to match large single ranges. Even the 2048 surrogates all have the same first byte.
* Add ANYOFR regnodeKarl Williamson2019-11-171-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | This matches a single range of code points. It is both faster and smaller than other ANYOF-type nodes, requiring, after set-up, a single subtraction and conditional branch. The vast majority of Unicode properties match a single range (though most of the properties likely to be used in real world applications have more than a single range). But things like [ij] are a single range, and those are quite commonly encountered. This new regnode matches them more efficiently than a bitmap would, and doesn't require the space for one either. The flags field is used to store the minimum matchable start byte for UTF-8 strings, and is ignored for non-UTF-8 targets. This, like ANYOFH nodes which have a similar mechanism, allows for quick weeding out of many possible matches without having to convert the UTF-8 to its corresponding code point. This regnode packs the 32 bit argument with 20 bits for the minimum code point the node matches, and 12 bits for the maximum range. If the input is a value outside these, it simply won't compile to this regnode, instead going to one of the ANYOFH flavors. ANYOFR is sufficient to match all of Unicode except for the final (private use) 65K plane.
* regcomp.sym: Add detail to some node descriptionsKarl Williamson2019-11-141-10/+14
| | | | | | | Having this enabled me to more quickly understand what's going on. A trailing period is removed from some long descriptions to make them slightly shorter.
* Change the names of some regnodesKarl Williamson2019-10-291-3/+3
| | | | The new name is shorter and I believe, clearer.
* regex: Add LEXACT_ONLY8 node typeKarl Williamson2019-09-291-0/+1
| | | | | This is like LEXACT, but it is known that only strings encoded in UTF-8 will match it, so don't even have to try if that condition isn't met.
* Add regnode LEXACT, for long stringsKarl Williamson2019-09-291-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit adds a new regnode for strings that don't fit in a regular one, and adds a structure for that regnode to use. Actually using them is deferred to the next commit. This new regnode structure is needed because the previous structure only allows for an 8 bit length field, 255 max bytes. This commit puts the length instead in a new field, the same place single-argument regnodes put their argument. Hence this long string is an extra 32 bits of overhead, but at no string length is this node ever bigger than the combination of the smaller nodes it replaces. I also considered simply combining the original 8 bit length field (which is now unused) with the first byte of the string field to get a 16 bit length, and have the actual string be offset by 1. But I rejected that because it would mean the string would usually not be aligned, slowing down memory accesses. This new LEXACT regnode can hold up to what 1024 regular EXACT ones hold, using 4K fewer overhead bytes to do so. That means it can handle strings containing 262000 bytes. The comments give ideas for expanding that should it become necessary or desirable. Besides the space advantage, any hardware acceleration in memcmp can be done in much bigger chunks, and otherwise the memcmp inner loop (often written in assembly) will run many more times in a row, and our outer loop that calls it, correspondingly fewer.
* regcomp.sym Update and improve descriptions of some nodesKarl Williamson2019-09-271-7/+7
| | | | | | | | | EXACTFU nodes always now fold their strings; the information here had not been updated to reflect that change. And the descriptions of several EXACTish nodes are now changed to be slightly shorter and to remove mention of the string length, which is problematic, and is covered in the description for EXACT
* regen/regcomp.pl, regcomp.sym: CommentsKarl Williamson2019-09-271-4/+19
| | | | | | | | | I spent some time in this code trying to understand some things, and as a result I'm commenting previously undocumented features. The comments about what an entry in regcomp.sym should look like are moved to that file, rather than the file that reads it. The former is most often touched, and they had gotten out-of-sync in the latter. Things now make more sense to me, and hopefully anyone using this in the future.
* regcomp.sym: Fix commentKarl Williamson2019-09-151-1/+1
| | | | | The length of an EXACTish node is the same bits as the FLAGS field in other nodes; it doesn't "precede the length", as previously claimed.
* Add ANYOFHr regnodeKarl Williamson2019-06-261-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit adds a new regnode, ANYOFHr, like ANYOFH, but it also has a loose upper bound for the first UTF-8 byte matchable by the node. (The 'r' stands for 'range'). It would be nice to have a tight upper bound, but to do so requires 4 more bits than are available without changing the node arguments types, and hence increasing the node size. Having a loose bound is better than no bound, and comes essentially free, by using two unused bits in the current ANYOFH node, and requiring only a few extra, pipeline-able, mask, etc instructions at run time, no extra conditionals. Any ANYOFH nodes that would benefit from having an upper bound will instead be compiled into this node type. Its use is based on the following observations. There are 64 possible start bytes, so the full range can be expressed in 6 bits. This means that the flags field in ANYOFH nodes containing the start byte has two extra bits that can be used for something else. An ANYOFH node only happens when there is no matching code point in the bit map, so the smallest code point that could be is 256. The start byte for that is C4, so there are actually only 60 possible start bytes. (perl can be compiled with a larger bit map in which case the minimum start byte would be even higher.) A second observation is that knowing the highest start byte is above F0 is no better than knowing it's F0. This is because the highest code point whose start byte is F0 is U+3FFFF, and all code points above that that are currently allocated are all very specialized and rarely encountered. And there's no likelihood of that changing anytime soon as there's plenty of unallocated space below that. So if an ANYOFH node's highest start byte is F0 or above, there's no advantage to knowing what the actual max possible start byte is, so leave it as ANYOFH,. That means the highest start byte we care about in ANYOFHr is EF. That cuts the number of start bytes we care about down to 43, still 6 bits required to represent them, but it allows for the following scheme: Populate the flags field by subtracting C0 from the lowest start byte and shift left 2 bits. That leaves the the bottom two bits unused. We use them as follows, where x is the start byte of the lowest code point in the node: bits ---- 11 The upper limit of the range can be as much as (EF - x) / 8 10 The upper limit of the range can be as much as (EF - x) / 4 01 The upper limit of the range can be as much as (EF - x) / 2 00 The upper limit of the range can be as much as EF That partitions the loose upper bound into 4 possible ranges, with it being tighter the closer it is to the strict lower bound. This makes the loose upper bound more meaningful when there is most to gain by having one. Some examples of what the various upper bounds would be for all the possibilities of these two bits are: Upper bound given the 2 bits Low bound 11 10 01 00 --------- -- -- -- -- C4 C9 CE D9 EF D0 D3 D7 DF EF E0 E1 E3 E7 EF Start bytes of E0 and above represent many more code points each than lower ones, as they are 3 byte sequences instead of two. This scheme provides tighter bounds for them, which is also a point in its favor. Thus we have provided a loose upper bound using two otherwise unused bits. An alternate scheme could have had the intervals all the same, but this provides a tighter bound when it makes the most sense to. For EBCDIC the range is is from C8 to F4, Tests will be added in a later commit
* regex: Add lower bound to ANYOFH nodes UTF-8 byteKarl Williamson2019-06-261-1/+1
| | | | | | | | | | | This commit adds a lower bound for the first UTF-8 byte matchable by an ANYOFH node. The flags field is otherwise unused, and using it for this purpose allows code to rule out match possibilities without having to convert from UTF-8 to code point. It might be better to do the inverse instead, to have the field be an upper bound. The reason is that the conversion is cheap for smaller numbers. The commit following mostly addresses this.
* Use inRANGE for seeing if node is an ANYOFH typeKarl Williamson2019-06-261-0/+3
| | | | | This is easier to read, especially when a third type is added a few commits ahead.
* regcomp.sym: Change regnode descriptionKarl Williamson2019-06-251-1/+1
| | | | Simplify the description for ANYOFb
* Split ANYOFH regnode into two typesKarl Williamson2019-05-311-1/+2
| | | | | | | | | ANYOFHb will be for nodes where all the matching code points share the frst UTF-8 byte. ANYOFH will be for all others. Neither of these has a bitmap. I noticed that we can omit some execution conditionals by splitting the nodes.
* regcomp.sym: Fix typo in commentKarl Williamson2019-05-241-1/+1
|
* regnodes.h: Change some regnodes' namesKarl Williamson2019-05-241-6/+6
| | | | | These were misleading, as elsewhere a leading 'N' in the name means the complement. Instead move the N to the end of the name
* Add common UTF-8 first byte to ANYOFH regnodesKarl Williamson2019-03-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | An ANYOFH regnode is generated instead of a plain ANYOF one when nothing it can match is in the bitmap used in ANYOF nodes. It is therefore smaller as the 4 word (or more) bitmap is omitted. This means that for it to match a target string, that string must be UTF-8 (since the bitmap is for at least the lowest 256 code points). And only in rare circumstances are there any flags associated with it in the regnode flags field. This commit changes things so that the flags field in an ANYOFH node is repurposed to be the first UTF-8 encoded byte of every code point matched by the class if there is a common byte for all of them; or 0 if some have different first bytes. (That means that those rare cases where the flags field isn't otherwise empty can no longer be ANYOFH nodes.) The purpose of this is so that a future commit can take advantage of this, and more quickly scan the target string for places that this node can match. A problem with ANYOF nodes is that they are code point oriented (U32 or U64), and the target string is UTF-8, so conversion has to be done. By having the partial conversion compiled in, we can look for that at runtime instead of having to look at every character in the scan.
* Implement variable length lookbehind in regex patternsKarl Williamson2019-03-181-2/+2
| | | | See [perl #132367].
* regnodes.h, perldebguts: Shorten some descriptionsKarl Williamson2019-03-141-10/+10
|
* regcomp.sym: Note specialized use of 'flags' in 2 OPsKarl Williamson2018-12-301-2/+2
|
* Add new regnode: ANYOFH, without a bitmapKarl Williamson2018-12-261-0/+1
| | | | | | | | | | | This commit adds a regnode for the case where nothing in the bit map has matches. This allows the bitmap to be omitted, saving 32 bytes of otherwise wasted space per node. Many non-Latin Unicode properties have this characteristic. Further, since this node applies only to code points above 255, which are representable only in UTF-8, we can trivially fail a match where the target string isn't in UTF-8. Time savings also accrue from skipping the bitmap look-up. When swashes are removed, even more time will be saved.
* Remove ASCII/NASCII regnodesKarl Williamson2018-12-261-3/+0
| | | | | | | The ANYOFM/NANYOFM regnodes are generalizations of these. They have more masks and shifts than the removed nodes, but not more branches, so are effectively the same speed. Remove the ASCII/NASCII nodes in favor of having less code to maintain.
* regcomp.c: Simplify handling of EXACTFish nodes with 's' at edgeKarl Williamson2018-12-261-5/+1
| | | | | | | | | | | | | | | | | | Commit 8a100c918ec81926c0536594df8ee1fcccb171da created node types for handling an 's' at the leading edge, at the trailing edge, and at both edges for nodes under /di that there is nothing else in that would prevent them from being EXACTFU nodes. If two of these get joined, it could create an 'ss' sequence which can't be an EXACTFU node, for U+DF would match them unconditionally. Instead, under /di it should match if and only if the target string is UTF-8 encoded. I realized later that having three types becomes harder to deal with when adding yet more node types, so this commit turns the three into just one node type, indicating that at least one edge of the node is an 's'. It also simplifies the parsing of the pattern and determining which node to use.
* Collapse regnode EXACTFU_SS into EXACTFUPKarl Williamson2018-12-261-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | EXACTFUP was created by the previous commit to handle a problematic case in which not all the code points in an EXACTFU node are /i foldable at compile time. Doing so will allow a future commit to use the pre-folded EXACTFU nodes (done in a prior commit), saving execution time for the common case. The only problematic code point is the MICRO SIGN. Most patterns don't use this character. EXACTFU_SS is problematic in a different way. It contains the sequence 'ss' which is folded to by LATIN SMALL LETTER SHARP S, but everything in it can be pre-folded (unless it also contains a MICRO SIGN). The reason this is problematic is that it is the only non-UTF-8 node where the length in folding can change. To process it at runtime, the more general fold equivalence function is used that is capable of handling length disparities, but is slower than the functions otherwise used for non-UTF-8. What I've chosen to do for now is to make a single node type for all the problematic cases (which at this time means just the two aforementioned ones). If we didn't do this, we'd have to add a third node type for patterns that contain both 'ss' and MICRO. Or artificially split the pattern so the two never were in the same node, but we can't do that because it can cause bugs in handling multi-character folds. If more special handling is found to be needed, there'd be a combinatorial explosion of additional node types to handle all possible combinations. What this effectively means is that the slower, more general foldEQ function is used for portions of patterns containing the MICRO sign when the pattern isn't in UTF-8, even though there is no inherent reason to do so for non-UTF-8 strings that don't also contain the 'ss' sequence.
* Add regnode EXACTFUP, for problematicKarl Williamson2018-12-261-1/+1
| | | | | | | | | | If a non-UTF-8 pattern contains a MICRO SIGN, this special node is now created. This character is the only one not needing UTF-8 to represent, but its fold does need UTF-8, which causes some issues, so it has to be specially handled. When matching against a non-UTF-8 target string, the pattern is effectively folded, but not if the target is UTF-8. By creating this node, we can remove the special handling required for the nodes that don't have a MICRO SIGN, in a future commit.