Avoid some conditionals in is...UTF8_CHAR()

These three functions to determine if the next bit of a string is UTF-8 (constrained in three different ways) have basically the same short loop. One of the initial conditions in the while() is always true the first time around. By moving that condition to the middle of the loop, we avoid it for the common case where the loop is executed just once. This is when the input is a UTF-8 invariant character (ASCII on ASCII platforms). If the functions were constrained to require the first byte pointed to by the input to exist, the while() could be a do {} while(), and there would be no extra conditional in calling this vs checking if the next character is invariant, and if not calling this. And there would be fewer conditionals for the case of 2 or more bytes in the character.
author: Karl Williamson <khw@cpan.org> 2021-05-30 16:50:26 -0600
committer: Nicholas Clark <nick@ccl4.org> 2021-06-28 04:18:40 -0600
commit: ffea7477df4e15160658e87a1a6cf24280f24e29 (patch)
tree: 1711759712846e9ba5cb946a9a250f9cebf2533d /inline.h
parent: 5678ce7977e423bda6d0f627b4c94d2cd5b9a213 (diff)
download: perl-ffea7477df4e15160658e87a1a6cf24280f24e29.tar.gz
1 files changed, 26 insertions, 17 deletions
diff --git a/inline.h b/inline.h
index 045f880666..65b2d9df0c 100644
--- a/inline.h
+++ b/inline.h
@@ -1127,16 +1127,19 @@ Perl_isUTF8_CHAR(const U8 * const s0, const U8 * const e)
      * on 32-bit ASCII platforms where it trivially is an error).  Call a
      * helper function for the other platforms. */
 
-    while (s < e && LIKELY(state != 1)) {
-        state = PL_extended_utf8_dfa_tab[256
+    while (s < e) {
+        state = PL_extended_utf8_dfa_tab[  256
                                          + state
                                          + PL_extended_utf8_dfa_tab[*s]];
-        if (state != 0) {
-            s++;
-            continue;
+        s++;
+
+        if (state == 0) {
+            return s - s0;
         }
 
-        return s - s0 + 1;
+        if (UNLIKELY(state == 1)) {
+            break;
+        }
     }
 
 #if defined(UV_IS_QUAD) || defined(EBCDIC)
@@ -1195,15 +1198,19 @@ Perl_isSTRICT_UTF8_CHAR(const U8 * const s0, const U8 * const e)
 
     PERL_ARGS_ASSERT_ISSTRICT_UTF8_CHAR;
 
-    while (s < e && LIKELY(state != 1)) {
-        state = PL_strict_utf8_dfa_tab[256 + state + PL_strict_utf8_dfa_tab[*s]];
+    while (s < e) {
+        state = PL_strict_utf8_dfa_tab[  256
+                                       + state
+                                       + PL_strict_utf8_dfa_tab[*s]];
+        s++;
 
-        if (state != 0) {
-            s++;
-            continue;
+        if (state == 0) {
+            return s - s0;
         }
 
-        return s - s0 + 1;
+        if (UNLIKELY(state == 1)) {
+            break;
+        }
     }
 
 #ifndef EBCDIC
@@ -1261,15 +1268,17 @@ Perl_isC9_STRICT_UTF8_CHAR(const U8 * const s0, const U8 * const e)
 
     PERL_ARGS_ASSERT_ISC9_STRICT_UTF8_CHAR;
 
-    while (s < e && LIKELY(state != 1)) {
+    while (s < e) {
         state = PL_c9_utf8_dfa_tab[256 + state + PL_c9_utf8_dfa_tab[*s]];
+        s++;
 
-        if (state != 0) {
-            s++;
-            continue;
+        if (state == 0) {
+            return s - s0;
         }
 
-        return s - s0 + 1;
+        if (UNLIKELY(state == 1)) {
+            break;
+        }
     }
 
     return 0;
author	Karl Williamson <khw@cpan.org>	2021-05-30 16:50:26 -0600
committer	Nicholas Clark <nick@ccl4.org>	2021-06-28 04:18:40 -0600
commit	ffea7477df4e15160658e87a1a6cf24280f24e29 (patch)
tree	1711759712846e9ba5cb946a9a250f9cebf2533d /inline.h
parent	5678ce7977e423bda6d0f627b4c94d2cd5b9a213 (diff)
download	perl-ffea7477df4e15160658e87a1a6cf24280f24e29.tar.gz