diff options
author | Karl Williamson <khw@cpan.org> | 2018-07-01 16:00:41 -0600 |
---|---|---|
committer | Karl Williamson <khw@cpan.org> | 2018-07-05 14:47:19 -0600 |
commit | c5bfbb64f98c2c7e8055565dd018e0a2a8565c10 (patch) | |
tree | 480e811b72524783daf394b67ec17dc426538540 /inline.h | |
parent | 67049a5ffa8b7757041edb8f972a0a74fbe5d63d (diff) | |
download | perl-c5bfbb64f98c2c7e8055565dd018e0a2a8565c10.tar.gz |
Make isC9_STRICT_UTF8_CHAR() an inline dfa
This replaces a complicated trie with a dfa. This should cut down the
number of conditionals encountered in parsing many code points.
Diffstat (limited to 'inline.h')
-rw-r--r-- | inline.h | 57 |
1 files changed, 57 insertions, 0 deletions
@@ -1156,6 +1156,63 @@ S_isSTRICT_UTF8_CHAR(const U8 * const s0, const U8 * const e) /* +=for apidoc Am|STRLEN|isC9_STRICT_UTF8_CHAR|const U8 *s|const U8 *e + +Evaluates to non-zero if the first few bytes of the string starting at C<s> and +looking no further than S<C<e - 1>> are well-formed UTF-8 that represents some +Unicode non-surrogate code point; otherwise it evaluates to 0. If non-zero, +the value gives how many bytes starting at C<s> comprise the code point's +representation. Any bytes remaining before C<e>, but beyond the ones needed to +form the first code point in C<s>, are not examined. + +The largest acceptable code point is the Unicode maximum 0x10FFFF. This +differs from C<L</isSTRICT_UTF8_CHAR>> only in that it accepts non-character +code points. This corresponds to +L<Unicode Corrigendum #9|http://www.unicode.org/versions/corrigendum9.html>. +which said that non-character code points are merely discouraged rather than +completely forbidden in open interchange. See +L<perlunicode/Noncharacter code points>. + +Use C<L</isUTF8_CHAR>> to check for Perl's extended UTF-8; and +C<L</isUTF8_CHAR_flags>> for a more customized definition. + +Use C<L</is_c9strict_utf8_string>>, C<L</is_c9strict_utf8_string_loc>>, and +C<L</is_c9strict_utf8_string_loclen>> to check entire strings. + +=cut + +This uses an adaptation of the tables and algorithm given in +http://bjoern.hoehrmann.de/utf-8/decoder/dfa/, which provides comprehensive +documentation of the original version. A copyright notice for the original +version is given at the beginning of this file. The Perl adapation is +documented at the definition of C9_utf8_dfa_tab[]. + +*/ + +PERL_STATIC_INLINE Size_t +S_isC9_STRICT_UTF8_CHAR(const U8 * const s0, const U8 * const e) +{ + const U8 * s = s0; + UV state = 0; + + PERL_ARGS_ASSERT_ISC9_STRICT_UTF8_CHAR; + + while (s < e && LIKELY(state != 1)) { + state = C9_utf8_dfa_tab[256 + state + C9_utf8_dfa_tab[*s]]; + + if (state != 0) { + s++; + continue; + } + + return s - s0 + 1; + } + + return 0; +} + +/* + =for apidoc is_strict_utf8_string_loc Like C<L</is_strict_utf8_string>> but stores the location of the failure (in the |