summaryrefslogtreecommitdiff
path: root/regcomp.sym
diff options
context:
space:
mode:
authorKarl Williamson <khw@cpan.org>2022-07-05 09:41:12 -0600
committerKarl Williamson <khw@cpan.org>2022-07-12 05:14:35 -0600
commit4c8c99df3c8fddf745e5971c2f6b56d2027f2be8 (patch)
treebf66d4668df70961fb5a208021bb6b1874417b51 /regcomp.sym
parent4bc69097efa904296afd72d1c6ca0d95f3fbb028 (diff)
downloadperl-4c8c99df3c8fddf745e5971c2f6b56d2027f2be8.tar.gz
regex: Add optimizing regnode
It turns out that any character class whose UTF-8 representation is two bytes long, and where all elements share the same first byte can be represented by a compact, fast regnode designed for the purpose. This commit adds that regnode, ANYOFHbbm. ANYOFHb already exists for classes where all elements have the same first byte, and this just changes the two-byte ones to use a bitmap instead of an inversion list. The advantages of this are that no conversion to code point is required (the continuation byte is just looked up in the bitmap) and no inversion list is needed. The inversion list would occupy more space, from 4 to 34 extra 64-bit words, plus an AV and SV, depending on what elements the class matches. Many characters in the Latin, Greek, Cyrillic, Greek, Hebrew, Arabic, and several other (lesser-known) scripts are of this form. It would be possible to extend this technique to larger bitmaps, but this commit is a start.
Diffstat (limited to 'regcomp.sym')
-rw-r--r--regcomp.sym3
1 files changed, 3 insertions, 0 deletions
diff --git a/regcomp.sym b/regcomp.sym
index 516c489dca..829882ecc2 100644
--- a/regcomp.sym
+++ b/regcomp.sym
@@ -109,6 +109,9 @@ ANYOFRb ANYOFR, packed 1 S ; Like ANYOFR, but all matches share the sam
# different or else this wouldn't be a range.) So we might as well displense
# with the comparisons that ANYOFRs would do, and go directly to do the
# conversion .
+
+ANYOFHbbm ANYOFHbbm none bbm S ; Like ANYOFHb, but only for 2-byte UTF-8 characters; uses a bitmap to match the continuation byte
+
ANYOFM ANYOFM, byte 1 S ; Like ANYOF, but matches an invariant byte as determined by the mask and arg
NANYOFM ANYOFM, byte 1 S ; complement of ANYOFM