Create isUTF8_CHAR() macro and use it

This macro will inline the code to determine if a character is well-formed UTF-8 for code points below a certain value, falling back to a slower function for larger ones. On ASCII platforms, it will inline for well-beyond all legal Unicode code points. On EBCDIC, it currently does it for code points up to 0x3FFF. This could be increased, but our porting tests do the regen every time to make sure everything is ok, and making it larger slows that down. This is worked around on ASCII by normally commenting out the code that generates this info, but including in utf8.h a version that did get generated. This is static information and won't change. (This could be done for EBCDIC too, but I chose not to at this time as each code page has a different macro generated, and it gets ugly getting all of them in utf8.h) Using this macro allowed for simplification of several functions in utf8.c
author: Karl Williamson <khw@cpan.org> 2014-05-05 20:43:47 -0600
committer: Karl Williamson <khw@cpan.org> 2014-05-31 11:42:40 -0600
commit: 6302f837102d66f532a1c151f7299abbef3a15dd (patch)
tree: 242fe154607368b270e65e29f81f8aed4214259c /inline.h
parent: d9f92374c5f4b19ed46c29c6710922b80429de59 (diff)
download: perl-6302f837102d66f532a1c151f7299abbef3a15dd.tar.gz
1 files changed, 12 insertions, 16 deletions
diff --git a/inline.h b/inline.h
index 34d9b3b866..fff7499f01 100644
--- a/inline.h
+++ b/inline.h
@@ -239,24 +239,19 @@ S_isALNUM_lazy(pTHX_ const char* p)
 }
 
 /*
-Tests if the first C<len> bytes of string C<s> form a valid UTF-8
-character.  Note that an INVARIANT (i.e. ASCII on non-EBCDIC) character is a
-valid UTF-8 character.  The number of bytes in the UTF-8 character
-will be returned if it is valid, otherwise 0.
-
-This is the "slow" version as opposed to the "fast" version which is
-the "unrolled" IS_UTF8_CHAR().  E.g. for t/uni/class.t the speed
-difference is a factor of 2 to 3.  For lengths (UTF8SKIP(s)) of four
-or less you should use the IS_UTF8_CHAR(), for lengths of five or more
-you should use the _slow().  In practice this means that the _slow()
-will be used very rarely, since the maximum Unicode code point (as of
-Unicode 4.1) is U+10FFFF, which encodes in UTF-8 to four bytes.  Only
-the "Perl extended UTF-8" (e.g, the infamous 'v-strings') will encode into
-five bytes or more.
+A helper function for the macro isUTF8_CHAR(), which should be used instead of
+this function.  The macro will handle smaller code points directly saving time,
+using this function as a fall-back for higher code points.
+Tests if the first bytes of string C<s> form a valid UTF-8 character.  0 is
+returned if the bytes starting at C<s> up to but not including C<e> do not form a
+complete well-formed UTF-8 character; otherwise the number of bytes in the
+character is returned.
+Note that an INVARIANT (i.e. ASCII on non-EBCDIC) character is a valid UTF-8
+character.
 
 =cut */
 PERL_STATIC_INLINE STRLEN
-S__is_utf8_char_slow(const U8 *s, const STRLEN len)
+S__is_utf8_char_slow(const U8 *s, const U8 *e)
 {
     dTHX;   /* The function called below requires thread context */
 
@@ -264,7 +259,8 @@ S__is_utf8_char_slow(const U8 *s, const STRLEN len)
 
     PERL_ARGS_ASSERT__IS_UTF8_CHAR_SLOW;
 
-    utf8n_to_uvchr(s, len, &actual_len, UTF8_CHECK_ONLY);
+    assert(e >= s);
+    utf8n_to_uvchr(s, e - s, &actual_len, UTF8_CHECK_ONLY);
 
     return (actual_len == (STRLEN) -1) ? 0 : actual_len;
 }
author	Karl Williamson <khw@cpan.org>	2014-05-05 20:43:47 -0600
committer	Karl Williamson <khw@cpan.org>	2014-05-31 11:42:40 -0600
commit	6302f837102d66f532a1c151f7299abbef3a15dd (patch)
tree	242fe154607368b270e65e29f81f8aed4214259c /inline.h
parent	d9f92374c5f4b19ed46c29c6710922b80429de59 (diff)
download	perl-6302f837102d66f532a1c151f7299abbef3a15dd.tar.gz