summaryrefslogtreecommitdiff
path: root/regcomp.sym
diff options
context:
space:
mode:
authorKarl Williamson <khw@cpan.org>2019-06-05 11:15:00 -0600
committerKarl Williamson <khw@cpan.org>2019-06-26 09:01:27 -0600
commit3146c00a633e9cbed741e10146662fbcedfdb8d3 (patch)
treea9f070cfb500d886efd36fad4aec107f4fca88ab /regcomp.sym
parent29a889ef8a5621dae70b129c9b5db9e83e1087f9 (diff)
downloadperl-3146c00a633e9cbed741e10146662fbcedfdb8d3.tar.gz
Add ANYOFHr regnode
This commit adds a new regnode, ANYOFHr, like ANYOFH, but it also has a loose upper bound for the first UTF-8 byte matchable by the node. (The 'r' stands for 'range'). It would be nice to have a tight upper bound, but to do so requires 4 more bits than are available without changing the node arguments types, and hence increasing the node size. Having a loose bound is better than no bound, and comes essentially free, by using two unused bits in the current ANYOFH node, and requiring only a few extra, pipeline-able, mask, etc instructions at run time, no extra conditionals. Any ANYOFH nodes that would benefit from having an upper bound will instead be compiled into this node type. Its use is based on the following observations. There are 64 possible start bytes, so the full range can be expressed in 6 bits. This means that the flags field in ANYOFH nodes containing the start byte has two extra bits that can be used for something else. An ANYOFH node only happens when there is no matching code point in the bit map, so the smallest code point that could be is 256. The start byte for that is C4, so there are actually only 60 possible start bytes. (perl can be compiled with a larger bit map in which case the minimum start byte would be even higher.) A second observation is that knowing the highest start byte is above F0 is no better than knowing it's F0. This is because the highest code point whose start byte is F0 is U+3FFFF, and all code points above that that are currently allocated are all very specialized and rarely encountered. And there's no likelihood of that changing anytime soon as there's plenty of unallocated space below that. So if an ANYOFH node's highest start byte is F0 or above, there's no advantage to knowing what the actual max possible start byte is, so leave it as ANYOFH,. That means the highest start byte we care about in ANYOFHr is EF. That cuts the number of start bytes we care about down to 43, still 6 bits required to represent them, but it allows for the following scheme: Populate the flags field by subtracting C0 from the lowest start byte and shift left 2 bits. That leaves the the bottom two bits unused. We use them as follows, where x is the start byte of the lowest code point in the node: bits ---- 11 The upper limit of the range can be as much as (EF - x) / 8 10 The upper limit of the range can be as much as (EF - x) / 4 01 The upper limit of the range can be as much as (EF - x) / 2 00 The upper limit of the range can be as much as EF That partitions the loose upper bound into 4 possible ranges, with it being tighter the closer it is to the strict lower bound. This makes the loose upper bound more meaningful when there is most to gain by having one. Some examples of what the various upper bounds would be for all the possibilities of these two bits are: Upper bound given the 2 bits Low bound 11 10 01 00 --------- -- -- -- -- C4 C9 CE D9 EF D0 D3 D7 DF EF E0 E1 E3 E7 EF Start bytes of E0 and above represent many more code points each than lower ones, as they are 3 byte sequences instead of two. This scheme provides tighter bounds for them, which is also a point in its favor. Thus we have provided a loose upper bound using two otherwise unused bits. An alternate scheme could have had the intervals all the same, but this provides a tighter bound when it makes the most sense to. For EBCDIC the range is is from C8 to F4, Tests will be added in a later commit
Diffstat (limited to 'regcomp.sym')
-rw-r--r--regcomp.sym1
1 files changed, 1 insertions, 0 deletions
diff --git a/regcomp.sym b/regcomp.sym
index 9e2c6d3aea..522293c9c2 100644
--- a/regcomp.sym
+++ b/regcomp.sym
@@ -66,6 +66,7 @@ ANYOFPOSIXL ANYOF, sv charclass_posixl S ; Like ANYOFL, but matches [[:p
# Must be sequential
ANYOFH ANYOF, sv 1 S ; Like ANYOF, but only has "High" matches, none in the bitmap; the flags field contains the lowest matchable UTF-8 start byte
ANYOFHb ANYOF, sv 1 S ; Like ANYOFH, but all matches share the same UTF-8 start byte, given in the flags field
+ANYOFHr ANYOF, sv 1 S ; Like ANYOFH, but the flags field contains packed bounds for all matchable UTF-8 start bytes.
ANYOFM ANYOFM byte 1 S ; Like ANYOF, but matches an invariant byte as determined by the mask and arg
NANYOFM ANYOFM byte 1 S ; complement of ANYOFM