summaryrefslogtreecommitdiff
path: root/regcomp.h
diff options
context:
space:
mode:
authorKarl Williamson <khw@cpan.org>2019-06-05 11:15:00 -0600
committerKarl Williamson <khw@cpan.org>2019-06-26 09:01:27 -0600
commit3146c00a633e9cbed741e10146662fbcedfdb8d3 (patch)
treea9f070cfb500d886efd36fad4aec107f4fca88ab /regcomp.h
parent29a889ef8a5621dae70b129c9b5db9e83e1087f9 (diff)
downloadperl-3146c00a633e9cbed741e10146662fbcedfdb8d3.tar.gz
Add ANYOFHr regnode
This commit adds a new regnode, ANYOFHr, like ANYOFH, but it also has a loose upper bound for the first UTF-8 byte matchable by the node. (The 'r' stands for 'range'). It would be nice to have a tight upper bound, but to do so requires 4 more bits than are available without changing the node arguments types, and hence increasing the node size. Having a loose bound is better than no bound, and comes essentially free, by using two unused bits in the current ANYOFH node, and requiring only a few extra, pipeline-able, mask, etc instructions at run time, no extra conditionals. Any ANYOFH nodes that would benefit from having an upper bound will instead be compiled into this node type. Its use is based on the following observations. There are 64 possible start bytes, so the full range can be expressed in 6 bits. This means that the flags field in ANYOFH nodes containing the start byte has two extra bits that can be used for something else. An ANYOFH node only happens when there is no matching code point in the bit map, so the smallest code point that could be is 256. The start byte for that is C4, so there are actually only 60 possible start bytes. (perl can be compiled with a larger bit map in which case the minimum start byte would be even higher.) A second observation is that knowing the highest start byte is above F0 is no better than knowing it's F0. This is because the highest code point whose start byte is F0 is U+3FFFF, and all code points above that that are currently allocated are all very specialized and rarely encountered. And there's no likelihood of that changing anytime soon as there's plenty of unallocated space below that. So if an ANYOFH node's highest start byte is F0 or above, there's no advantage to knowing what the actual max possible start byte is, so leave it as ANYOFH,. That means the highest start byte we care about in ANYOFHr is EF. That cuts the number of start bytes we care about down to 43, still 6 bits required to represent them, but it allows for the following scheme: Populate the flags field by subtracting C0 from the lowest start byte and shift left 2 bits. That leaves the the bottom two bits unused. We use them as follows, where x is the start byte of the lowest code point in the node: bits ---- 11 The upper limit of the range can be as much as (EF - x) / 8 10 The upper limit of the range can be as much as (EF - x) / 4 01 The upper limit of the range can be as much as (EF - x) / 2 00 The upper limit of the range can be as much as EF That partitions the loose upper bound into 4 possible ranges, with it being tighter the closer it is to the strict lower bound. This makes the loose upper bound more meaningful when there is most to gain by having one. Some examples of what the various upper bounds would be for all the possibilities of these two bits are: Upper bound given the 2 bits Low bound 11 10 01 00 --------- -- -- -- -- C4 C9 CE D9 EF D0 D3 D7 DF EF E0 E1 E3 E7 EF Start bytes of E0 and above represent many more code points each than lower ones, as they are 3 byte sequences instead of two. This scheme provides tighter bounds for them, which is also a point in its favor. Thus we have provided a loose upper bound using two otherwise unused bits. An alternate scheme could have had the intervals all the same, but this provides a tighter bound when it makes the most sense to. For EBCDIC the range is is from C8 to F4, Tests will be added in a later commit
Diffstat (limited to 'regcomp.h')
-rw-r--r--regcomp.h24
1 files changed, 24 insertions, 0 deletions
diff --git a/regcomp.h b/regcomp.h
index 5002e2b38d..ac61408239 100644
--- a/regcomp.h
+++ b/regcomp.h
@@ -1126,6 +1126,30 @@ typedef enum {
WB_BOUND
} bound_type;
+/* This unpacks the FLAGS field of ANYOFHx nodes. The value it contains
+ * gives the strict lower bound for the UTF-8 start byte of any code point
+ * matchable by the node, and a loose upper bound as well.
+ *
+ * The low bound is stored in the upper 6 bits, plus 0xC0.
+ * The loose upper bound is determined from the lowest 2 bits and the low bound
+ * (called x) as follows:
+ *
+ * 11 The upper limit of the range can be as much as (EF - x) / 8
+ * 10 The upper limit of the range can be as much as (EF - x) / 4
+ * 01 The upper limit of the range can be as much as (EF - x) / 2
+ * 00 The upper limit of the range can be as much as EF
+ *
+ * For motivation of this design, see the commit message */
+#ifdef EBCDIC
+# define MAX_ANYOF_HRx_BYTE 0xF4
+#else
+# define MAX_ANYOF_HRx_BYTE 0xEF
+#endif
+#define LOWEST_ANYOF_HRx_BYTE(b) (((b) >> 2) + 0xC0)
+#define HIGHEST_ANYOF_HRx_BYTE(b) \
+ (LOWEST_ANYOF_HRx_BYTE(b) \
+ + ((MAX_ANYOF_HRx_BYTE - LOWEST_ANYOF_HRx_BYTE(b)) >> ((b) & 3)))
+
#endif /* PERL_REGCOMP_H_ */
/*