diff options
author | Karl Williamson <khw@cpan.org> | 2019-09-21 09:51:52 -0600 |
---|---|---|
committer | Karl Williamson <khw@cpan.org> | 2019-11-17 21:20:07 -0700 |
commit | 2d5613be2139e3ec2e5cf6a54ecbae6ba8b3a1e0 (patch) | |
tree | 1859d23b3dfab71ab0416d4e0d7746b4db5ac8a1 /regcomp.sym | |
parent | 13fcf6522466471a1b1c5fc2d760dd5367fd8940 (diff) | |
download | perl-2d5613be2139e3ec2e5cf6a54ecbae6ba8b3a1e0.tar.gz |
Add ANYOFRb regnode
This is like the ANYOFR regnode added in the previous commit, but all
code points in the range it matches are known to have the same first
UTF-8 start byte. That means it can't match UTF-8 invariant characters,
like ASCII, because the "start" byte is different on each one, so it
could only match a range of 1, and the compiler wouldn't generate this
node for that; instead using an EXACT.
Pattern matching can rule out most code points by looking at the first
character of their UTF-8 representation, before having to convert from
UTF-8.
On ASCII this rules out all but 64 2-byte UTF-8 characters from this
simple comparison. 3-byte it's up to 4096, and 4-byte, 2**18, so the
test is less effective for higher code points.
I believe that most UTF-8 patterns that otherwise would compile to
ANYOFR will instead compile to this, as I can't envision real life
applications wanting to match large single ranges. Even the 2048
surrogates all have the same first byte.
Diffstat (limited to 'regcomp.sym')
-rw-r--r-- | regcomp.sym | 4 |
1 files changed, 3 insertions, 1 deletions
diff --git a/regcomp.sym b/regcomp.sym index b664fc8f07..6ca83aeef1 100644 --- a/regcomp.sym +++ b/regcomp.sym @@ -82,7 +82,9 @@ ANYOFPOSIXL ANYOF, sv charclass_posixl S ; Like ANYOFL, but matches [[:p ANYOFH ANYOF, sv 1 S ; Like ANYOF, but only has "High" matches, none in the bitmap; the flags field contains the lowest matchable UTF-8 start byte ANYOFHb ANYOF, sv 1 S ; Like ANYOFH, but all matches share the same UTF-8 start byte, given in the flags field ANYOFHr ANYOF, sv 1 S ; Like ANYOFH, but the flags field contains packed bounds for all matchable UTF-8 start bytes. -ANYOFR ANYOFR, packed 1 S ; Matches any character in the range given by its packed args: upper 12 bits is the max delta from the base lower 20; the flags field contains the lowest matchable UTF-8 start byte +ANYOFR ANYOFR, packed 1 S ; Matches any character in the range given by its packed args: upper 12 bits is the max delta from the base lower 20; the flags field contains the lowest matchable UTF-8 start byte +ANYOFRb ANYOFR, packed 1 S ; Like ANYOFR, but all matches share the same UTF-8 start byte, given in the flags field +# There is no ANYOFRr because khw doesn't think there are likely to be real-world cases where such a large range is used. ANYOFM ANYOFM byte 1 S ; Like ANYOF, but matches an invariant byte as determined by the mask and arg NANYOFM ANYOFM byte 1 S ; complement of ANYOFM |