diff options
author | Karl Williamson <khw@cpan.org> | 2018-11-15 10:57:24 -0700 |
---|---|---|
committer | Karl Williamson <khw@cpan.org> | 2018-11-17 10:02:11 -0700 |
commit | 3db0bcccf4d9617c94822609f33314fe0df2df3b (patch) | |
tree | ea58525c991f6388368871ff597fb1ff9a1b268b /regcomp.sym | |
parent | 1d8aafa0548097ffe407cad078f79b4d56fd7dca (diff) | |
download | perl-3db0bcccf4d9617c94822609f33314fe0df2df3b.tar.gz |
Add regnode NANYOFM
This matches when the existing node ANYOFM would not match; i.e., they
are complements.
I almost didn't create this node, but it turns out to significantly
speed up various classes of matches. For example qr/[^g]/, both /i and
not, turn into this node; and something like
(("a" x $large_number) . "b") =~ /[^a]/
goes through the string a word at a time, instead of previously
byte-by-byte. Benchmarks are at the end of this mesage.
This node gets generated when complementing any single ASCII character
and when complementing any ASCII case pair, like /[^Gg]/. It never gets
generated if the class includes a character that isn't ASCII (actually
UTF-8 invariant, which matters only on EBCDIC platforms).
The details of when this node gets constructed are complicated. It
happens when the bit patterns of the characters in the class happen to
have certain very particular characteristics, depending on the vagaries
of the character set. [BbCc] will do so, but [AaBb] does not. [^01]
does, but not [^12]. Precisely, look at all the bit patterns of the
characters in the set, and count the total number of differing bits,
calling it 'n'. If and only if the number of characters is 2**n, this
node gets generated. As an example, on both ASCII and EBCDIC, the last
4 bits of '0' are 0000; of '1' are 0001; of '2' are 0010; and of '3' are
0011. The other 4 bits are the same for each of these 4 digits. That
means that only 2 bits differ among the 4 characters, and 2**2==4, so
the NANYOFM node will get generated. Similarly, 8=1000 and 0=0000
differ only in one bit so 2**1==2, and so [^08] will generate this node.
We could consider in the future, an extension where, if the input
doesn't work to generate this node, that we construct the closure of
that input to generate this node, which would have false positives that
would have to be tested for. The speedup of this node is so significant
that that could still be faster than what we have today.
The benchmarks are for a 64-bit word. 32-bits would not be as good.
Key:
Ir Instruction read
Dr Data read
Dw Data write
COND conditional branches
IND indirect branches
The numbers (except for the final column) represent raw counts per loop
iteration. The higher the number in the final column, the faster.
(('a' x 1) . 'b') =~ /[^a]/
blead nanyof Ratio %
-------- -------- --------
Ir 2782.0 2648.0 105.1
Dr 845.0 799.0 105.8
Dw 531.0 500.0 106.2
COND 431.0 419.0 102.9
IND 22.0 22.0 100.0
(('a' x 10) . 'b') =~ /[^a]/
blead nanyof Ratio %
-------- -------- --------
Ir 3358.0 2671.0 125.7
Dr 998.0 801.0 124.6
Dw 630.0 500.0 126.0
COND 503.0 424.0 118.6
IND 22.0 22.0 100.0
(('a' x 100) . 'b') =~ /[^a]/
blead nanyof Ratio %
-------- -------- --------
Ir 9118.0 2773.0 328.8
Dr 2528.0 814.0 310.6
Dw 1620.0 500.0 324.0
COND 1223.0 450.0 271.8
IND 22.0 22.0 100.0
(('a' x 1000) . 'b') =~ /[^a]/
blead nanyof Ratio %
-------- -------- --------
Ir 66718.0 3650.0 1827.9
Dr 17828.0 923.0 1931.5
Dw 11520.0 500.0 2304.0
COND 8423.0 668.0 1260.9
IND 22.0 22.0 100.0
(('a' x 10000) . 'b') =~ /[^a]/
blead nanyof Ratio %
-------- -------- --------
Ir 642718.0 12650.0 5080.8
Dr 170828.0 2048.0 8341.2
Dw 110520.0 500.0 22104.0
COND 80423.0 2918.0 2756.1
IND 22.0 22.0 100.0
(('a' x 100000) . 'b') =~ /[^a]/
blead nanyof Ratio %
-------- -------- --------
Ir Inf 102654.8 6237.1
Dr Inf 13299.3 12788.9
Dw Inf 500.9 219708.7
COND 800424.1 25419.1 3148.9
IND 22.0 22.0 100.0
Diffstat (limited to 'regcomp.sym')
-rw-r--r-- | regcomp.sym | 1 |
1 files changed, 1 insertions, 0 deletions
diff --git a/regcomp.sym b/regcomp.sym index 6305bfbbcf..02bd9986b5 100644 --- a/regcomp.sym +++ b/regcomp.sym @@ -63,6 +63,7 @@ ANYOFD ANYOF, sv charclass S ; Like ANYOF, but /d is in effect ANYOFL ANYOF, sv charclass S ; Like ANYOF, but /l is in effect ANYOFPOSIXL ANYOF, sv charclass_posixl S ; Like ANYOFL, but matches [[:posix:]] classes ANYOFM ANYOFM byte 1 S ; Like ANYOF, but matches an invariant byte as determined by the mask and arg +NANYOFM ANYOFM byte 1 S ; complement of ANYOFM #* POSIX Character Classes: # Order of the below is important. See ordering comment above. |