Add ANYOFHs regnode

This node is like ANYOFHb, but is used when more than one leading byte is the same in all the matched code points. ANYOFHb is used to avoid having to convert from UTF-8 to code point for something that won't match. It checks that the first byte in the UTF-8 encoded target is the desired one, thus ruling out most of the possible code points. But for higher code points that require longer UTF-8 sequences, many many non-matching code points pass this filter. Its almost 200K that it is ineffective for for code points above 0xFFFF. This commit creates a new node type that addresses this problem. Instead of a single byte, it stores as many leading bytes that are the same for all code points that match the class. For many classes, that will cut down the number of possible false positives by a huge amount before having to convert to code point to make the final determination. This regnode adds a UTF-8 string at the end. It is still much smaller, even in the rare worst case, than a plain ANYOF node because the maximum string length, 15 bytes, is still shorter than the 32-byte bitmap that is present in a plain ANYOF. Most of the time the added string will instead be at most 4 bytes.
author: Karl Williamson <khw@cpan.org> 2019-11-19 19:15:38 -0700
committer: Karl Williamson <khw@cpan.org> 2019-11-20 14:09:21 -0700
commit: 34924db0919c191e271602c82cb2de7784fc63a4 (patch)
tree: 420a27c457c5c44f6089f07fc2657813531d4920 /regcomp.sym
parent: 21c3fd9dd0a7a389c901af03acc1907666ee1870 (diff)
download: perl-34924db0919c191e271602c82cb2de7784fc63a4.tar.gz
1 files changed, 1 insertions, 0 deletions
diff --git a/regcomp.sym b/regcomp.sym
index 251006a245..2f4018d62d 100644
--- a/regcomp.sym
+++ b/regcomp.sym
@@ -82,6 +82,7 @@ ANYOFPOSIXL ANYOF,      sv charclass_posixl S    ; Like ANYOFL, but matches [[:p
 ANYOFH      ANYOF,      sv 1 S    ; Like ANYOF, but only has "High" matches, none in the bitmap; the flags field contains the lowest matchable UTF-8 start byte
 ANYOFHb     ANYOF,      sv 1 S    ; Like ANYOFH, but all matches share the same UTF-8 start byte, given in the flags field
 ANYOFHr     ANYOF,      sv 1 S    ; Like ANYOFH, but the flags field contains packed bounds for all matchable UTF-8 start bytes.
+ANYOFHs     ANYOF,      sv anyofhs S ; Like ANYOFHb, but has a string field that gives the leading matchable UTF-8 bytes; flags field is len
 ANYOFR      ANYOFR,     packed 1  S  ; Matches any character in the range given by its packed args: upper 12 bits is the max delta from the base lower 20; the flags field contains the lowest matchable UTF-8 start byte
 ANYOFRb     ANYOFR,     packed 1  S ; Like ANYOFR, but all matches share the same UTF-8 start byte, given in the flags field
 # There is no ANYOFRr because khw doesn't think there are likely to be real-world cases where such a large range is used.
author	Karl Williamson <khw@cpan.org>	2019-11-19 19:15:38 -0700
committer	Karl Williamson <khw@cpan.org>	2019-11-20 14:09:21 -0700
commit	34924db0919c191e271602c82cb2de7784fc63a4 (patch)
tree	420a27c457c5c44f6089f07fc2657813531d4920 /regcomp.sym
parent	21c3fd9dd0a7a389c901af03acc1907666ee1870 (diff)
download	perl-34924db0919c191e271602c82cb2de7784fc63a4.tar.gz