Add regnode NANYOFM

This matches when the existing node ANYOFM would not match; i.e., they are complements. I almost didn't create this node, but it turns out to significantly speed up various classes of matches. For example qr/[^g]/, both /i and not, turn into this node; and something like (("a" x $large_number) . "b") =~ /[^a]/ goes through the string a word at a time, instead of previously byte-by-byte. Benchmarks are at the end of this mesage. This node gets generated when complementing any single ASCII character and when complementing any ASCII case pair, like /[^Gg]/. It never gets generated if the class includes a character that isn't ASCII (actually UTF-8 invariant, which matters only on EBCDIC platforms). The details of when this node gets constructed are complicated. It happens when the bit patterns of the characters in the class happen to have certain very particular characteristics, depending on the vagaries of the character set. [BbCc] will do so, but [AaBb] does not. [^01] does, but not [^12]. Precisely, look at all the bit patterns of the characters in the set, and count the total number of differing bits, calling it 'n'. If and only if the number of characters is 2**n, this node gets generated. As an example, on both ASCII and EBCDIC, the last 4 bits of '0' are 0000; of '1' are 0001; of '2' are 0010; and of '3' are 0011. The other 4 bits are the same for each of these 4 digits. That means that only 2 bits differ among the 4 characters, and 2**2==4, so the NANYOFM node will get generated. Similarly, 8=1000 and 0=0000 differ only in one bit so 2**1==2, and so [^08] will generate this node. We could consider in the future, an extension where, if the input doesn't work to generate this node, that we construct the closure of that input to generate this node, which would have false positives that would have to be tested for. The speedup of this node is so significant that that could still be faster than what we have today. The benchmarks are for a 64-bit word. 32-bits would not be as good. Key: Ir Instruction read Dr Data read Dw Data write COND conditional branches IND indirect branches The numbers (except for the final column) represent raw counts per loop iteration. The higher the number in the final column, the faster. (('a' x 1) . 'b') =~ /[^a]/ blead nanyof Ratio % -------- -------- -------- Ir 2782.0 2648.0 105.1 Dr 845.0 799.0 105.8 Dw 531.0 500.0 106.2 COND 431.0 419.0 102.9 IND 22.0 22.0 100.0 (('a' x 10) . 'b') =~ /[^a]/ blead nanyof Ratio % -------- -------- -------- Ir 3358.0 2671.0 125.7 Dr 998.0 801.0 124.6 Dw 630.0 500.0 126.0 COND 503.0 424.0 118.6 IND 22.0 22.0 100.0 (('a' x 100) . 'b') =~ /[^a]/ blead nanyof Ratio % -------- -------- -------- Ir 9118.0 2773.0 328.8 Dr 2528.0 814.0 310.6 Dw 1620.0 500.0 324.0 COND 1223.0 450.0 271.8 IND 22.0 22.0 100.0 (('a' x 1000) . 'b') =~ /[^a]/ blead nanyof Ratio % -------- -------- -------- Ir 66718.0 3650.0 1827.9 Dr 17828.0 923.0 1931.5 Dw 11520.0 500.0 2304.0 COND 8423.0 668.0 1260.9 IND 22.0 22.0 100.0 (('a' x 10000) . 'b') =~ /[^a]/ blead nanyof Ratio % -------- -------- -------- Ir 642718.0 12650.0 5080.8 Dr 170828.0 2048.0 8341.2 Dw 110520.0 500.0 22104.0 COND 80423.0 2918.0 2756.1 IND 22.0 22.0 100.0 (('a' x 100000) . 'b') =~ /[^a]/ blead nanyof Ratio % -------- -------- -------- Ir Inf 102654.8 6237.1 Dr Inf 13299.3 12788.9 Dw Inf 500.9 219708.7 COND 800424.1 25419.1 3148.9 IND 22.0 22.0 100.0
author: Karl Williamson <khw@cpan.org> 2018-11-15 10:57:24 -0700
committer: Karl Williamson <khw@cpan.org> 2018-11-17 10:02:11 -0700
commit: 3db0bcccf4d9617c94822609f33314fe0df2df3b (patch)
tree: ea58525c991f6388368871ff597fb1ff9a1b268b /regcomp.sym
parent: 1d8aafa0548097ffe407cad078f79b4d56fd7dca (diff)
download: perl-3db0bcccf4d9617c94822609f33314fe0df2df3b.tar.gz
1 files changed, 1 insertions, 0 deletions
diff --git a/regcomp.sym b/regcomp.sym
index 6305bfbbcf..02bd9986b5 100644
--- a/regcomp.sym
+++ b/regcomp.sym
@@ -63,6 +63,7 @@ ANYOFD      ANYOF,      sv charclass S    ; Like ANYOF, but /d is in effect
 ANYOFL      ANYOF,      sv charclass S    ; Like ANYOF, but /l is in effect
 ANYOFPOSIXL ANYOF,      sv charclass_posixl S    ; Like ANYOFL, but matches [[:posix:]] classes
 ANYOFM      ANYOFM      byte 1 S  ; Like ANYOF, but matches an invariant byte as determined by the mask and arg
+NANYOFM     ANYOFM      byte 1 S  ; complement of ANYOFM
 
 #* POSIX Character Classes:
 # Order of the below is important.  See ordering comment above.
author	Karl Williamson <khw@cpan.org>	2018-11-15 10:57:24 -0700
committer	Karl Williamson <khw@cpan.org>	2018-11-17 10:02:11 -0700
commit	3db0bcccf4d9617c94822609f33314fe0df2df3b (patch)
tree	ea58525c991f6388368871ff597fb1ff9a1b268b /regcomp.sym
parent	1d8aafa0548097ffe407cad078f79b4d56fd7dca (diff)
download	perl-3db0bcccf4d9617c94822609f33314fe0df2df3b.tar.gz