diff options
author | Karl Williamson <khw@cpan.org> | 2018-03-23 13:43:56 -0600 |
---|---|---|
committer | Karl Williamson <khw@cpan.org> | 2018-03-26 16:26:54 -0600 |
commit | 8946fcd98c63bdc848cec00a1c72aaf232d932a1 (patch) | |
tree | d121082565c788d8cb4876b5d867ba701e3c4662 /embedvar.h | |
parent | 608bdd1e9ade8e9ca6d2312c64b2b1c0a653eadc (diff) | |
download | perl-8946fcd98c63bdc848cec00a1c72aaf232d932a1.tar.gz |
Move UTF-8 case changing data into core
Prior to this commit, if a program wanted to compute the case-change of
a character above 0xFF, the C code would switch to perl, loading
lib/utf8heavy.pl and then read another file from disk, and then create a
hash. Future references would use the hash, but the start up cost is
quite large. There are five case change types, uc, lc, tc, fc, and
simple fc. Only the first encountered requires loading of utf8_heavy,
but each required switching to utf8_heavy, and reading the appropriate
file from disk.
This commit changes these functions to use compiled-in C data structures
(inversion maps) to represent the data. To look something up requires a
binary search instead of a hash lookup.
An individual hash lookup tends to be faster than a binary search, but
the differences are small for small sizes. I did some benchmarking some
years ago, (commit message 87367d5f9dc9bbf7db1a6cf87820cea76571bf1a) and
the results were that for fewer than 512 entries, the binary search was
just as fast as a hash, if not actually faster. Now, I've done some
more benchmarks on blead, using the tool benchmark.pl, which wasn't
available back then. The results below indicate that the differences
are minimal up through 2047 entries, which all Unicode properties are
well within.
A hash, PL_foldclosures, is still constructed at runtime for the case of
regular expression /i matching, and this could be generated at Perl
compile time, as a further enhancement for later. But reading a file
from disk is no longer required to do this.
======================= benchmarking results =======================
Key:
Ir Instruction read
Dr Data read
Dw Data write
COND conditional branches
IND indirect branches
_m branch predict miss
_m1 level 1 cache miss
_mm last cache (e.g. L3) miss
- indeterminate percentage (e.g. 1/0)
The numbers represent raw counts per loop iteration.
"\x{10000}" =~ qr/\p{CWKCF}/"
swash invlist Ratio %
fetch search
------ ------- -------
Ir 2259.0 2264.0 99.8
Dr 665.0 664.0 100.2
Dw 406.0 404.0 100.5
COND 406.0 405.0 100.2
IND 17.0 15.0 113.3
COND_m 8.0 8.0 100.0
IND_m 4.0 4.0 100.0
Ir_m1 8.9 17.0 52.4
Dr_m1 4.5 3.4 132.4
Dw_m1 1.9 1.2 158.3
Ir_mm 0.0 0.0 100.0
Dr_mm 0.0 0.0 100.0
Dw_mm 0.0 0.0 100.0
These were constructed by using the file whose contents are below, which
uses the property in Unicode that currently has the largest number of
entries in its inversion list, > 1600. The test was run on blead -O2,
no debugging, no threads. Then the cut-off boundary was changed from
512 to 2047 for when we use a hash vs an inversion list, and the test
run again. This yields the difference between a hash fetch and an
inversion list binary search
===================== The benchmark file is below ===============
no warnings 'once';
my @benchmarks;
push @benchmarks, 'swash' => {
desc => '"\x{10000}" =~ qr/\p{CWKCF}/"',
setup => 'no warnings "once"; my $re = qr/\p{CWKCF}/; my $a =
"\x{10000}";',
code => '$a =~ $re;',
};
\@benchmarks;
Diffstat (limited to 'embedvar.h')
-rw-r--r-- | embedvar.h | 1 |
1 files changed, 1 insertions, 0 deletions
diff --git a/embedvar.h b/embedvar.h index a3f7fb3dd3..4e39a94bcf 100644 --- a/embedvar.h +++ b/embedvar.h @@ -343,6 +343,7 @@ #define PL_utf8_swash_ptrs (vTHX->Iutf8_swash_ptrs) #define PL_utf8_tofold (vTHX->Iutf8_tofold) #define PL_utf8_tolower (vTHX->Iutf8_tolower) +#define PL_utf8_tosimplefold (vTHX->Iutf8_tosimplefold) #define PL_utf8_totitle (vTHX->Iutf8_totitle) #define PL_utf8_toupper (vTHX->Iutf8_toupper) #define PL_utf8cache (vTHX->Iutf8cache) |