diff options
author | Karl Williamson <khw@cpan.org> | 2019-03-15 14:37:13 -0600 |
---|---|---|
committer | Yves Orton <demerphq@gmail.com> | 2023-02-08 11:55:21 +0800 |
commit | 71d63d0dc1fcf23d28f488655c105c0dfefbd254 (patch) | |
tree | 9198cea4867f11830d3d483e0a44b7186dc7c68b /README.linux | |
parent | 9d3a26c75c777cb4dc1382d59de52a9b69a38f14 (diff) | |
download | perl-71d63d0dc1fcf23d28f488655c105c0dfefbd254.tar.gz |
Use per-word calcs in utf8_length()
This commit changes utf8_length to read the input a word at a time.
The current method of looking per character is retained for shorter
strings. The per-word method yields significant time savings for very
long strings and typical inputs.
The timings vary depending on the average number of bytes per character
in the input. If all our characters were 13 bytes, this commit would
always be a loser, as we would be processing per 8 (or 4 on 32-bit
platforms) instead of 13. But we don't care about performance for
non-Unicode code points, and the maximum legal Unicode code point
occupies 4 UTF-8 bytes, which means that is a wash on 32-bit platforms,
but a real gain on 64-bit ones. And, except for emoji, most text in
modern languages is 3 byte max, with a significant amount of single byte
characters (e.g., for punctuation) even in non-Latin scripts.
For very long strings we would expect to use 1/8 the conditionals if the
input is entirely ASCII; 1/4 if entirely 2-byte UTF-8, and 1/2 if
entirely 4-byte. (For 32-bit systems, the savings is approximately half
this.) Because of set-up and tear-down complications, these values are
limits that are approached the longer the string is (which is where it
matters most).
The per-word method kicks in for input strings 96 bytes and longer.
This value was based on some eyeballing cache grind output, and could be
tweaked, but the differences in time spent on strings this short is
tiny.
This function does a half-hearted job of checking for UTF-8 validity; it
doesn't do extra work, but it makes sure that the length implied by the
start bytes it sees, all add up. (It doesn't check that the characters in
between are all continuation bytes.) In order to preserve this
checking, the new version has to stop per-word looking a word earlier
than it otherwise would have.
There are complications, as it has to process per-byte to get to a word
boundary before reading per-word. Here are benchmarks for a 2-byte word
using the best and worst case scenarios. (All benchmarks are for a
64-bit platform)
Key:
Ir Instruction read
Dr Data read
Dw Data write
COND conditional branches
IND indirect branches
The numbers represent relative counts per loop iteration, compared to
blead at 100.0%.
Higher is better: for example, using half as many instructions gives 200%,
while using twice as many gives 50%.
Best case 2-byte sceanario:
string length 48 characters; 2 bytes per character;
0 bytes after word boundary
blead patch
------ -------
Ir 100.00 123.09
Dr 100.00 130.18
Dw 100.00 111.44
COND 100.00 128.63
IND 100.00 100.00
Worst case 2-byte sceanario:
string length 48 characters; 2 bytes per character;
7 bytes after word boundary
blead patch
------ -------
Ir 100.00 122.46
Dr 100.00 129.52
Dw 100.00 111.07
COND 100.00 127.65
IND 100.00 100.00
Very long strings run an order of magnitude fewer instructions than
blead. Here are worst case scenarios (7 bytes after word boundary).
string length 10000000 characters; 1 bytes per character
blead patch
------ -------
Ir 100.00 814.53
Dr 100.00 1069.58
Dw 100.00 3296.55
COND 100.00 1575.83
IND 100.00 100.00
string length 5000000 characters; 2 bytes per character
blead patch
------ -------
Ir 100.00 408.86
Dr 100.00 536.32
Dw 100.00 1698.31
COND 100.00 788.72
IND 100.00 100.00
string length 3333333 characters; 3 bytes per character
blead patch
------ -------
Ir 100.00 273.64
Dr 100.00 358.56
Dw 100.00 1165.55
COND 100.00 526.35
IND 100.00 100.00
string length 2500000 characters; 4 bytes per character
blead patch
------ -------
Ir 100.00 206.03
Dr 100.00 269.68
Dw 100.00 899.17
COND 100.00 395.17
IND 100.00 100.00
Diffstat (limited to 'README.linux')
0 files changed, 0 insertions, 0 deletions