diff options
author | Karl Williamson <khw@cpan.org> | 2022-07-05 09:41:12 -0600 |
---|---|---|
committer | Karl Williamson <khw@cpan.org> | 2022-07-12 05:14:35 -0600 |
commit | 4c8c99df3c8fddf745e5971c2f6b56d2027f2be8 (patch) | |
tree | bf66d4668df70961fb5a208021bb6b1874417b51 /regcomp.sym | |
parent | 4bc69097efa904296afd72d1c6ca0d95f3fbb028 (diff) | |
download | perl-4c8c99df3c8fddf745e5971c2f6b56d2027f2be8.tar.gz |
regex: Add optimizing regnode
It turns out that any character class whose UTF-8 representation is two
bytes long, and where all elements share the same first byte can be
represented by a compact, fast regnode designed for the purpose.
This commit adds that regnode, ANYOFHbbm. ANYOFHb already exists for
classes where all elements have the same first byte, and this just
changes the two-byte ones to use a bitmap instead of an inversion list.
The advantages of this are that no conversion to code point is required
(the continuation byte is just looked up in the bitmap) and no inversion
list is needed. The inversion list would occupy more space, from 4 to
34 extra 64-bit words, plus an AV and SV, depending on what elements the
class matches.
Many characters in the Latin, Greek, Cyrillic, Greek, Hebrew, Arabic,
and several other (lesser-known) scripts are of this form.
It would be possible to extend this technique to larger bitmaps, but
this commit is a start.
Diffstat (limited to 'regcomp.sym')
-rw-r--r-- | regcomp.sym | 3 |
1 files changed, 3 insertions, 0 deletions
diff --git a/regcomp.sym b/regcomp.sym index 516c489dca..829882ecc2 100644 --- a/regcomp.sym +++ b/regcomp.sym @@ -109,6 +109,9 @@ ANYOFRb ANYOFR, packed 1 S ; Like ANYOFR, but all matches share the sam # different or else this wouldn't be a range.) So we might as well displense # with the comparisons that ANYOFRs would do, and go directly to do the # conversion . + +ANYOFHbbm ANYOFHbbm none bbm S ; Like ANYOFHb, but only for 2-byte UTF-8 characters; uses a bitmap to match the continuation byte + ANYOFM ANYOFM, byte 1 S ; Like ANYOF, but matches an invariant byte as determined by the mask and arg NANYOFM ANYOFM, byte 1 S ; complement of ANYOFM |