Make isC9_STRICT_UTF8_CHAR() an inline dfa

This replaces a complicated trie with a dfa. This should cut down the number of conditionals encountered in parsing many code points.
author: Karl Williamson <khw@cpan.org> 2018-07-01 16:00:41 -0600
committer: Karl Williamson <khw@cpan.org> 2018-07-05 14:47:19 -0600
commit: c5bfbb64f98c2c7e8055565dd018e0a2a8565c10 (patch)
tree: 480e811b72524783daf394b67ec17dc426538540 /inline.h
parent: 67049a5ffa8b7757041edb8f972a0a74fbe5d63d (diff)
download: perl-c5bfbb64f98c2c7e8055565dd018e0a2a8565c10.tar.gz
1 files changed, 57 insertions, 0 deletions
diff --git a/inline.h b/inline.h
index e5fe5f5b97..33cbde8451 100644
--- a/inline.h
+++ b/inline.h
@@ -1156,6 +1156,63 @@ S_isSTRICT_UTF8_CHAR(const U8 * const s0, const U8 * const e)
 
 /*
 
+=for apidoc Am|STRLEN|isC9_STRICT_UTF8_CHAR|const U8 *s|const U8 *e
+
+Evaluates to non-zero if the first few bytes of the string starting at C<s> and
+looking no further than S<C<e - 1>> are well-formed UTF-8 that represents some
+Unicode non-surrogate code point; otherwise it evaluates to 0.  If non-zero,
+the value gives how many bytes starting at C<s> comprise the code point's
+representation.  Any bytes remaining before C<e>, but beyond the ones needed to
+form the first code point in C<s>, are not examined.
+
+The largest acceptable code point is the Unicode maximum 0x10FFFF.  This
+differs from C<L</isSTRICT_UTF8_CHAR>> only in that it accepts non-character
+code points.  This corresponds to
+L<Unicode Corrigendum #9|http://www.unicode.org/versions/corrigendum9.html>.
+which said that non-character code points are merely discouraged rather than
+completely forbidden in open interchange.  See
+L<perlunicode/Noncharacter code points>.
+
+Use C<L</isUTF8_CHAR>> to check for Perl's extended UTF-8; and
+C<L</isUTF8_CHAR_flags>> for a more customized definition.
+
+Use C<L</is_c9strict_utf8_string>>, C<L</is_c9strict_utf8_string_loc>>, and
+C<L</is_c9strict_utf8_string_loclen>> to check entire strings.
+
+=cut
+
+This uses an adaptation of the tables and algorithm given in
+http://bjoern.hoehrmann.de/utf-8/decoder/dfa/, which provides comprehensive
+documentation of the original version.  A copyright notice for the original
+version is given at the beginning of this file.  The Perl adapation is
+documented at the definition of C9_utf8_dfa_tab[].
+
+*/
+
+PERL_STATIC_INLINE Size_t
+S_isC9_STRICT_UTF8_CHAR(const U8 * const s0, const U8 * const e)
+{
+    const U8 * s = s0;
+    UV state = 0;
+
+    PERL_ARGS_ASSERT_ISC9_STRICT_UTF8_CHAR;
+
+    while (s < e && LIKELY(state != 1)) {
+        state = C9_utf8_dfa_tab[256 + state + C9_utf8_dfa_tab[*s]];
+
+        if (state != 0) {
+            s++;
+            continue;
+        }
+
+        return s - s0 + 1;
+    }
+
+    return 0;
+}
+
+/*
+
 =for apidoc is_strict_utf8_string_loc
 
 Like C<L</is_strict_utf8_string>> but stores the location of the failure (in the
author	Karl Williamson <khw@cpan.org>	2018-07-01 16:00:41 -0600
committer	Karl Williamson <khw@cpan.org>	2018-07-05 14:47:19 -0600
commit	c5bfbb64f98c2c7e8055565dd018e0a2a8565c10 (patch)
tree	480e811b72524783daf394b67ec17dc426538540 /inline.h
parent	67049a5ffa8b7757041edb8f972a0a74fbe5d63d (diff)
download	perl-c5bfbb64f98c2c7e8055565dd018e0a2a8565c10.tar.gz