summaryrefslogtreecommitdiff
path: root/regnodes.h
diff options
context:
space:
mode:
authorKarl Williamson <khw@cpan.org>2019-03-20 11:47:15 -0600
committerKarl Williamson <khw@cpan.org>2019-03-20 12:12:44 -0600
commit765e6ecf32a570694dcff91c1c72f98306a9390e (patch)
tree6f2b8a6cd5bd5c4c24a80459d052d1c4f348853d /regnodes.h
parent80e7c5414423d633f11ec93a7990915e97489502 (diff)
downloadperl-765e6ecf32a570694dcff91c1c72f98306a9390e.tar.gz
Add common UTF-8 first byte to ANYOFH regnodes
An ANYOFH regnode is generated instead of a plain ANYOF one when nothing it can match is in the bitmap used in ANYOF nodes. It is therefore smaller as the 4 word (or more) bitmap is omitted. This means that for it to match a target string, that string must be UTF-8 (since the bitmap is for at least the lowest 256 code points). And only in rare circumstances are there any flags associated with it in the regnode flags field. This commit changes things so that the flags field in an ANYOFH node is repurposed to be the first UTF-8 encoded byte of every code point matched by the class if there is a common byte for all of them; or 0 if some have different first bytes. (That means that those rare cases where the flags field isn't otherwise empty can no longer be ANYOFH nodes.) The purpose of this is so that a future commit can take advantage of this, and more quickly scan the target string for places that this node can match. A problem with ANYOF nodes is that they are code point oriented (U32 or U64), and the target string is UTF-8, so conversion has to be done. By having the partial conversion compiled in, we can look for that at runtime instead of having to look at every character in the scan.
Diffstat (limited to 'regnodes.h')
-rw-r--r--regnodes.h2
1 files changed, 1 insertions, 1 deletions
diff --git a/regnodes.h b/regnodes.h
index 803938ac48..ba691a2c18 100644
--- a/regnodes.h
+++ b/regnodes.h
@@ -33,7 +33,7 @@
#define ANYOFD 19 /* 0x13 Like ANYOF, but /d is in effect */
#define ANYOFL 20 /* 0x14 Like ANYOF, but /l is in effect */
#define ANYOFPOSIXL 21 /* 0x15 Like ANYOFL, but matches [[:posix:]] classes */
-#define ANYOFH 22 /* 0x16 Like ANYOF, but only has "High" matches, none in the bitmap */
+#define ANYOFH 22 /* 0x16 Like ANYOF, but only has "High" matches, none in the bitmap; non-zero flags "f" means "f" is the first UTF-8 byte shared in common by all code points matched */
#define ANYOFM 23 /* 0x17 Like ANYOF, but matches an invariant byte as determined by the mask and arg */
#define NANYOFM 24 /* 0x18 complement of ANYOFM */
#define POSIXD 25 /* 0x19 Some [[:class:]] under /d; the FLAGS field gives which one */