From 4c8c99df3c8fddf745e5971c2f6b56d2027f2be8 Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Tue, 5 Jul 2022 09:41:12 -0600 Subject: regex: Add optimizing regnode It turns out that any character class whose UTF-8 representation is two bytes long, and where all elements share the same first byte can be represented by a compact, fast regnode designed for the purpose. This commit adds that regnode, ANYOFHbbm. ANYOFHb already exists for classes where all elements have the same first byte, and this just changes the two-byte ones to use a bitmap instead of an inversion list. The advantages of this are that no conversion to code point is required (the continuation byte is just looked up in the bitmap) and no inversion list is needed. The inversion list would occupy more space, from 4 to 34 extra 64-bit words, plus an AV and SV, depending on what elements the class matches. Many characters in the Latin, Greek, Cyrillic, Greek, Hebrew, Arabic, and several other (lesser-known) scripts are of this form. It would be possible to extend this technique to larger bitmaps, but this commit is a start. --- regcomp.sym | 3 +++ 1 file changed, 3 insertions(+) (limited to 'regcomp.sym') diff --git a/regcomp.sym b/regcomp.sym index 516c489dca..829882ecc2 100644 --- a/regcomp.sym +++ b/regcomp.sym @@ -109,6 +109,9 @@ ANYOFRb ANYOFR, packed 1 S ; Like ANYOFR, but all matches share the sam # different or else this wouldn't be a range.) So we might as well displense # with the comparisons that ANYOFRs would do, and go directly to do the # conversion . + +ANYOFHbbm ANYOFHbbm none bbm S ; Like ANYOFHb, but only for 2-byte UTF-8 characters; uses a bitmap to match the continuation byte + ANYOFM ANYOFM, byte 1 S ; Like ANYOF, but matches an invariant byte as determined by the mask and arg NANYOFM ANYOFM, byte 1 S ; complement of ANYOFM -- cgit v1.2.1