From 4c8c99df3c8fddf745e5971c2f6b56d2027f2be8 Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@cpan.org>
Date: Tue, 5 Jul 2022 09:41:12 -0600
Subject: regex: Add optimizing regnode

It turns out that any character class whose UTF-8 representation is two
bytes long, and where all elements share the same first byte can be
represented by a compact, fast regnode designed for the purpose.

This commit adds that regnode, ANYOFHbbm.  ANYOFHb already exists for
classes where all elements have the same first byte, and this just
changes the two-byte ones to use a bitmap instead of an inversion list.

The advantages of this are that no conversion to code point is required
(the continuation byte is just looked up in the bitmap) and no inversion
list is needed.  The inversion list would occupy more space, from 4 to
34 extra 64-bit words, plus an AV and SV, depending on what elements the
class matches.

Many characters in the Latin, Greek, Cyrillic, Greek, Hebrew, Arabic,
and several other (lesser-known) scripts are of this form.

It would be possible to extend this technique to larger bitmaps, but
this commit is a start.
---
 regcomp.sym | 3 +++
 1 file changed, 3 insertions(+)

(limited to 'regcomp.sym')

diff --git a/regcomp.sym b/regcomp.sym
index 516c489dca..829882ecc2 100644
--- a/regcomp.sym
+++ b/regcomp.sym
@@ -109,6 +109,9 @@ ANYOFRb     ANYOFR,     packed 1  S ; Like ANYOFR, but all matches share the sam
 # different or else this wouldn't be a range.)  So we might as well displense
 # with the comparisons that ANYOFRs would do, and go directly to do the
 # conversion .
+
+ANYOFHbbm   ANYOFHbbm   none bbm S ; Like ANYOFHb, but only for 2-byte UTF-8 characters; uses a bitmap to match the continuation byte
+
 ANYOFM      ANYOFM,     byte 1 S  ; Like ANYOF, but matches an invariant byte as determined by the mask and arg
 NANYOFM     ANYOFM,     byte 1 S  ; complement of ANYOFM
 
-- 
cgit v1.2.1