Use per-word calcs in utf8_length()

This commit changes utf8_length to read the input a word at a time. The current method of looking per character is retained for shorter strings. The per-word method yields significant time savings for very long strings and typical inputs. The timings vary depending on the average number of bytes per character in the input. If all our characters were 13 bytes, this commit would always be a loser, as we would be processing per 8 (or 4 on 32-bit platforms) instead of 13. But we don't care about performance for non-Unicode code points, and the maximum legal Unicode code point occupies 4 UTF-8 bytes, which means that is a wash on 32-bit platforms, but a real gain on 64-bit ones. And, except for emoji, most text in modern languages is 3 byte max, with a significant amount of single byte characters (e.g., for punctuation) even in non-Latin scripts. For very long strings we would expect to use 1/8 the conditionals if the input is entirely ASCII; 1/4 if entirely 2-byte UTF-8, and 1/2 if entirely 4-byte. (For 32-bit systems, the savings is approximately half this.) Because of set-up and tear-down complications, these values are limits that are approached the longer the string is (which is where it matters most). The per-word method kicks in for input strings 96 bytes and longer. This value was based on some eyeballing cache grind output, and could be tweaked, but the differences in time spent on strings this short is tiny. This function does a half-hearted job of checking for UTF-8 validity; it doesn't do extra work, but it makes sure that the length implied by the start bytes it sees, all add up. (It doesn't check that the characters in between are all continuation bytes.) In order to preserve this checking, the new version has to stop per-word looking a word earlier than it otherwise would have. There are complications, as it has to process per-byte to get to a word boundary before reading per-word. Here are benchmarks for a 2-byte word using the best and worst case scenarios. (All benchmarks are for a 64-bit platform) Key: Ir Instruction read Dr Data read Dw Data write COND conditional branches IND indirect branches The numbers represent relative counts per loop iteration, compared to blead at 100.0%. Higher is better: for example, using half as many instructions gives 200%, while using twice as many gives 50%. Best case 2-byte sceanario: string length 48 characters; 2 bytes per character; 0 bytes after word boundary blead patch ------ ------- Ir 100.00 123.09 Dr 100.00 130.18 Dw 100.00 111.44 COND 100.00 128.63 IND 100.00 100.00 Worst case 2-byte sceanario: string length 48 characters; 2 bytes per character; 7 bytes after word boundary blead patch ------ ------- Ir 100.00 122.46 Dr 100.00 129.52 Dw 100.00 111.07 COND 100.00 127.65 IND 100.00 100.00 Very long strings run an order of magnitude fewer instructions than blead. Here are worst case scenarios (7 bytes after word boundary). string length 10000000 characters; 1 bytes per character blead patch ------ ------- Ir 100.00 814.53 Dr 100.00 1069.58 Dw 100.00 3296.55 COND 100.00 1575.83 IND 100.00 100.00 string length 5000000 characters; 2 bytes per character blead patch ------ ------- Ir 100.00 408.86 Dr 100.00 536.32 Dw 100.00 1698.31 COND 100.00 788.72 IND 100.00 100.00 string length 3333333 characters; 3 bytes per character blead patch ------ ------- Ir 100.00 273.64 Dr 100.00 358.56 Dw 100.00 1165.55 COND 100.00 526.35 IND 100.00 100.00 string length 2500000 characters; 4 bytes per character blead patch ------ ------- Ir 100.00 206.03 Dr 100.00 269.68 Dw 100.00 899.17 COND 100.00 395.17 IND 100.00 100.00
author: Karl Williamson <khw@cpan.org> 2019-03-15 14:37:13 -0600
committer: Yves Orton <demerphq@gmail.com> 2023-02-08 11:55:21 +0800
commit: 71d63d0dc1fcf23d28f488655c105c0dfefbd254 (patch)
tree: 9198cea4867f11830d3d483e0a44b7186dc7c68b /proto.h
parent: 9d3a26c75c777cb4dc1382d59de52a9b69a38f14 (diff)
download: perl-71d63d0dc1fcf23d28f488655c105c0dfefbd254.tar.gz
1 files changed, 2 insertions, 2 deletions
diff --git a/proto.h b/proto.h
index ea56c4d604..2de342bc85 100644
--- a/proto.h
+++ b/proto.h
@@ -5095,10 +5095,10 @@ Perl_utf16_to_utf8_reversed(pTHX_ U8 *p, U8 *d, Size_t bytelen, Size_t *newlen);
         assert(p); assert(d); assert(newlen)
 
 PERL_CALLCONV STRLEN
-Perl_utf8_length(pTHX_ const U8 *s, const U8 *e)
+Perl_utf8_length(pTHX_ const U8 *s0, const U8 *e)
         __attribute__warn_unused_result__;
 #define PERL_ARGS_ASSERT_UTF8_LENGTH            \
-        assert(s); assert(e)
+        assert(s0); assert(e)
 
 PERL_CALLCONV U8 *
 Perl_utf8_to_bytes(pTHX_ U8 *s, STRLEN *lenp);
author	Karl Williamson <khw@cpan.org>	2019-03-15 14:37:13 -0600
committer	Yves Orton <demerphq@gmail.com>	2023-02-08 11:55:21 +0800
commit	71d63d0dc1fcf23d28f488655c105c0dfefbd254 (patch)
tree	9198cea4867f11830d3d483e0a44b7186dc7c68b /proto.h
parent	9d3a26c75c777cb4dc1382d59de52a9b69a38f14 (diff)
download	perl-71d63d0dc1fcf23d28f488655c105c0dfefbd254.tar.gz