summaryrefslogtreecommitdiff
path: root/utf8.h
diff options
context:
space:
mode:
authorKarl Williamson <khw@cpan.org>2016-08-26 16:29:54 -0600
committerKarl Williamson <khw@cpan.org>2016-08-31 20:32:37 -0600
commit35f8c9bd0ff4f298f8bc09ae9848a14a9667a95a (patch)
tree2d3a60c10c7a5b9e8f379eb4389f44e8c64a1195 /utf8.h
parenteda91ad75c71796ed6c5d3da7850b2fd7566c2a2 (diff)
downloadperl-35f8c9bd0ff4f298f8bc09ae9848a14a9667a95a.tar.gz
Move isUTF8_CHAR helper function, and reimplement it
The macro isUTF8_CHAR calls a helper function for code points higher than it can handle. That function had been an inlined wrapper around utf8n_to_uvchr(). The function has been rewritten to not call utf8n_to_uvchr(), so it is now too big to be effectively inlined. Instead, it implements a faster method of checking the validity of the UTF-8 without having to decode it. It just checks for valid syntax and now knows where the few discontinuities are in UTF-8 where overlongs can occur, and uses a string compare to verify that overflow won't occur. As a result this is now a pure function. This also causes a previously generated deprecation warning to not be, because in printing UTF-8, no longer does it have to be converted to internal form. I could add a check for that, but I think it's best not to. If you manipulated what is getting printed in any way, the deprecation message will already have been raised. This commit also fleshes out the documentation of isUTF8_CHAR.
Diffstat (limited to 'utf8.h')
-rw-r--r--utf8.h25
1 files changed, 18 insertions, 7 deletions
diff --git a/utf8.h b/utf8.h
index 29b052cd64..b940eaf3ee 100644
--- a/utf8.h
+++ b/utf8.h
@@ -826,13 +826,24 @@ case any call to string overloading updates the internal UTF-8 encoding flag.
=for apidoc Am|STRLEN|isUTF8_CHAR|const U8 *s|const U8 *e
-Returns the number of bytes beginning at C<s> which form a legal UTF-8 (or
-UTF-EBCDIC) encoded character, looking no further than S<C<e - s>> bytes into
-C<s>. Returns 0 if the sequence starting at C<s> through S<C<e - 1>> is not
-well-formed UTF-8.
+Evaluates to non-zero if the first few bytes of the string starting at C<s> and
+looking no further than S<C<e - 1>> are well-formed UTF-8 that represents some
+code point; otherwise it evaluates to 0. If non-zero, the value gives how many
+many bytes starting at C<s> comprise the code point's representation.
-Note that an INVARIANT character (i.e. ASCII on non-EBCDIC
-machines) is a valid UTF-8 character.
+The code point can be any that will fit in a UV on this machine, using Perl's
+extension to official UTF-8 to represent those higher than the Unicode maximum
+of 0x10FFFF. That means that this macro is used to efficiently decide if the
+next few bytes in C<s> is legal UTF-8 for a single character. Use
+L</is_utf8_string>(), L</is_utf8_string_loclen>(), and
+L</is_utf8_string_loc>() to check entire strings.
+
+Note that it is deprecated to use code points higher than what will fit in an
+IV. This macro does not raise any warnings for such code points, treating them
+as valid.
+
+Note also that a UTF-8 INVARIANT character (i.e. ASCII on non-EBCDIC machines)
+is a valid UTF-8 character.
=cut
*/
@@ -845,7 +856,7 @@ machines) is a valid UTF-8 character.
? 0 \
: LIKELY(IS_UTF8_CHAR_FAST(UTF8SKIP(s))) \
? is_UTF8_CHAR_utf8_no_length_checks(s) \
- : _is_utf8_char_slow(s, e))
+ : _is_utf8_char_slow(s, UTF8SKIP(s)))
#define is_utf8_char_buf(buf, buf_end) isUTF8_CHAR(buf, buf_end)