Add ANYOFRb regnode

This is like the ANYOFR regnode added in the previous commit, but all code points in the range it matches are known to have the same first UTF-8 start byte. That means it can't match UTF-8 invariant characters, like ASCII, because the "start" byte is different on each one, so it could only match a range of 1, and the compiler wouldn't generate this node for that; instead using an EXACT. Pattern matching can rule out most code points by looking at the first character of their UTF-8 representation, before having to convert from UTF-8. On ASCII this rules out all but 64 2-byte UTF-8 characters from this simple comparison. 3-byte it's up to 4096, and 4-byte, 2**18, so the test is less effective for higher code points. I believe that most UTF-8 patterns that otherwise would compile to ANYOFR will instead compile to this, as I can't envision real life applications wanting to match large single ranges. Even the 2048 surrogates all have the same first byte.
author: Karl Williamson <khw@cpan.org> 2019-09-21 09:51:52 -0600
committer: Karl Williamson <khw@cpan.org> 2019-11-17 21:20:07 -0700
commit: 2d5613be2139e3ec2e5cf6a54ecbae6ba8b3a1e0 (patch)
tree: 1859d23b3dfab71ab0416d4e0d7746b4db5ac8a1 /regcomp.sym
parent: 13fcf6522466471a1b1c5fc2d760dd5367fd8940 (diff)
download: perl-2d5613be2139e3ec2e5cf6a54ecbae6ba8b3a1e0.tar.gz
1 files changed, 3 insertions, 1 deletions
diff --git a/regcomp.sym b/regcomp.sym
index b664fc8f07..6ca83aeef1 100644
--- a/regcomp.sym
+++ b/regcomp.sym
@@ -82,7 +82,9 @@ ANYOFPOSIXL ANYOF,      sv charclass_posixl S    ; Like ANYOFL, but matches [[:p
 ANYOFH      ANYOF,      sv 1 S    ; Like ANYOF, but only has "High" matches, none in the bitmap; the flags field contains the lowest matchable UTF-8 start byte
 ANYOFHb     ANYOF,      sv 1 S    ; Like ANYOFH, but all matches share the same UTF-8 start byte, given in the flags field
 ANYOFHr     ANYOF,      sv 1 S    ; Like ANYOFH, but the flags field contains packed bounds for all matchable UTF-8 start bytes.
-ANYOFR      ANYOFR,     packed 1  S ; Matches any character in the range given by its packed args: upper 12 bits is the max delta from the base lower 20; the flags field contains the lowest matchable UTF-8 start byte
+ANYOFR      ANYOFR,     packed 1  S  ; Matches any character in the range given by its packed args: upper 12 bits is the max delta from the base lower 20; the flags field contains the lowest matchable UTF-8 start byte
+ANYOFRb     ANYOFR,     packed 1  S ; Like ANYOFR, but all matches share the same UTF-8 start byte, given in the flags field
+# There is no ANYOFRr because khw doesn't think there are likely to be real-world cases where such a large range is used.
 
 ANYOFM      ANYOFM      byte 1 S  ; Like ANYOF, but matches an invariant byte as determined by the mask and arg
 NANYOFM     ANYOFM      byte 1 S  ; complement of ANYOFM
author	Karl Williamson <khw@cpan.org>	2019-09-21 09:51:52 -0600
committer	Karl Williamson <khw@cpan.org>	2019-11-17 21:20:07 -0700
commit	2d5613be2139e3ec2e5cf6a54ecbae6ba8b3a1e0 (patch)
tree	1859d23b3dfab71ab0416d4e0d7746b4db5ac8a1 /regcomp.sym
parent	13fcf6522466471a1b1c5fc2d760dd5367fd8940 (diff)
download	perl-2d5613be2139e3ec2e5cf6a54ecbae6ba8b3a1e0.tar.gz