summaryrefslogtreecommitdiff
path: root/regcomp.sym
diff options
context:
space:
mode:
authorKarl Williamson <khw@cpan.org>2019-09-21 09:51:52 -0600
committerKarl Williamson <khw@cpan.org>2019-11-17 21:20:07 -0700
commit2d5613be2139e3ec2e5cf6a54ecbae6ba8b3a1e0 (patch)
tree1859d23b3dfab71ab0416d4e0d7746b4db5ac8a1 /regcomp.sym
parent13fcf6522466471a1b1c5fc2d760dd5367fd8940 (diff)
downloadperl-2d5613be2139e3ec2e5cf6a54ecbae6ba8b3a1e0.tar.gz
Add ANYOFRb regnode
This is like the ANYOFR regnode added in the previous commit, but all code points in the range it matches are known to have the same first UTF-8 start byte. That means it can't match UTF-8 invariant characters, like ASCII, because the "start" byte is different on each one, so it could only match a range of 1, and the compiler wouldn't generate this node for that; instead using an EXACT. Pattern matching can rule out most code points by looking at the first character of their UTF-8 representation, before having to convert from UTF-8. On ASCII this rules out all but 64 2-byte UTF-8 characters from this simple comparison. 3-byte it's up to 4096, and 4-byte, 2**18, so the test is less effective for higher code points. I believe that most UTF-8 patterns that otherwise would compile to ANYOFR will instead compile to this, as I can't envision real life applications wanting to match large single ranges. Even the 2048 surrogates all have the same first byte.
Diffstat (limited to 'regcomp.sym')
-rw-r--r--regcomp.sym4
1 files changed, 3 insertions, 1 deletions
diff --git a/regcomp.sym b/regcomp.sym
index b664fc8f07..6ca83aeef1 100644
--- a/regcomp.sym
+++ b/regcomp.sym
@@ -82,7 +82,9 @@ ANYOFPOSIXL ANYOF, sv charclass_posixl S ; Like ANYOFL, but matches [[:p
ANYOFH ANYOF, sv 1 S ; Like ANYOF, but only has "High" matches, none in the bitmap; the flags field contains the lowest matchable UTF-8 start byte
ANYOFHb ANYOF, sv 1 S ; Like ANYOFH, but all matches share the same UTF-8 start byte, given in the flags field
ANYOFHr ANYOF, sv 1 S ; Like ANYOFH, but the flags field contains packed bounds for all matchable UTF-8 start bytes.
-ANYOFR ANYOFR, packed 1 S ; Matches any character in the range given by its packed args: upper 12 bits is the max delta from the base lower 20; the flags field contains the lowest matchable UTF-8 start byte
+ANYOFR ANYOFR, packed 1 S ; Matches any character in the range given by its packed args: upper 12 bits is the max delta from the base lower 20; the flags field contains the lowest matchable UTF-8 start byte
+ANYOFRb ANYOFR, packed 1 S ; Like ANYOFR, but all matches share the same UTF-8 start byte, given in the flags field
+# There is no ANYOFRr because khw doesn't think there are likely to be real-world cases where such a large range is used.
ANYOFM ANYOFM byte 1 S ; Like ANYOF, but matches an invariant byte as determined by the mask and arg
NANYOFM ANYOFM byte 1 S ; complement of ANYOFM