diff options
author | Karl Williamson <khw@cpan.org> | 2015-12-03 13:12:51 -0700 |
---|---|---|
committer | Karl Williamson <khw@cpan.org> | 2015-12-09 23:43:21 -0700 |
commit | 36eaa8111efe6b0ebe974f6b26ed667c1206dc9f (patch) | |
tree | 111d349b4deb4ee6cd3cb2d8b030107a4fd616e7 /unicode_constants.h | |
parent | 5af9bc9750ba392c2a4adfdc3ced4b0b301f656a (diff) | |
download | perl-36eaa8111efe6b0ebe974f6b26ed667c1206dc9f.tar.gz |
Skip casing for some non-cased scripts
Characters whose upper, lower, title, or fold case differ from the
character itself amount to just 1.5% of the assigned Unicode characters,
and this percentage falls with each new Unicode release, as almost all
cased scripts have already been encoded. But a lot of code is written
assuming a cased language, such as calling uc() or lcfirst(), or doing
qr//i. When such code is run on a non-cased language, the work expended
in doing the casing is wasted. And casing is expensive. But finding
out if a character is cased or not is nearly as expensive, so one might
as well just do the casing.
However, the Unicode code space is organized so that there are some long
stretches of contiguous code points that aren't cased. By adding tests
to see if the input code point is in just a few of these ranges, we can
quickly rule casing out for most of the non-cased scripts that are of
commercial use today, at essentially no expense to handling the more
common cased scripts. Testing for just 3 ranges in Plane 0 of Unicode
(where most of the code points in common use today reside) allows us to
skip doing casing for more than 82% of code points in the plane,
including the following languages: Arabic, Chinese, Hebrew, Japanese,
Korean, Thai, and the major scripts of India. No longer is a swash
generated when trying to case one of these, so runtime memory usage is
decreased.
(It should be noted that some of these languages have characters
scattered in other areas, because the original allocation for them
turned out to be not large enough. When changing the case of these
other characters, the lookups won't be skippped. But that original
allocation included all or nearly all the characters in current common
use, so these other characters are comparatively rare.)
The comments in the code indicate some candidate non-cased ranges that I
chose not to treat specially at this time. The next commit will address
planes above Plane 0.
When this command is run on a perl compiled with -O2, no DEBUGGING:
blead Porting/bench.pl --perlargs="-Ilib -X" --benchfile=plane0_casing_perf /path_to_prior_perl=before_this_commit /path_to_new_perl=after
and file 'plane0_casing_perf' contains
[
'string::casing::greek' => {
desc => 'should be no change',
setup => 'my $a = "\x{3B1}"', # GREEK SMALL LETTER ALPHA
code => 'uc($a)'
},
'string::casing::hebrew' => {
desc => 'yes swash vs no swash',
setup => 'my $a = "\x{5D0}"', # HEBREW LETTER ALEF
code => 'uc($a)'
},
'string::casing::cjk' => {
desc => 'yes swash vs no swash',
setup => 'my $a = "\x{4E01}"',
code => 'uc($a)'
},
'string::casing::korean' => {
desc => 'yes swash vs no swash',
setup => 'my $a = "\x{AC00}"',
code => 'uc($a)'
},
];
These are the results:
The numbers represent raw counts per loop iteration.
string::casing::cjk
yes swash vs no swash
before_this_commit after
------------------ --------
Ir 931.0 300.0
Dr 217.0 93.0
Dw 94.0 45.0
COND 129.0 48.0
IND 7.0 4.0
COND_m 1.5 0.0
IND_m 4.0 2.0
Ir_m1 0.1 0.0
Dr_m1 0.0 0.0
Dw_m1 0.0 0.0
Ir_mm 0.0 0.0
Dr_mm 0.0 0.0
Dw_mm 0.0 0.0
string::casing::greek
should be no change
before_this_commit after
------------------ --------
Ir 946.0 920.0
Dr 218.0 220.0
Dw 100.0 100.0
COND 127.0 121.0
IND 6.0 8.0
COND_m 0.5 1.3
IND_m 2.0 2.0
Ir_m1 0.1 0.0
Dr_m1 0.0 0.0
Dw_m1 0.0 0.0
Ir_mm 0.0 0.0
Dr_mm 0.0 0.0
Dw_mm 0.0 0.0
string::casing::hebrew
yes swash vs no swash
before_this_commit after
------------------ --------
Ir 928.0 290.0
Dr 224.0 92.0
Dw 100.0 45.0
COND 129.0 46.0
IND 6.0 4.0
COND_m 0.5 0.0
IND_m 2.0 2.0
Ir_m1 0.1 0.0
Dr_m1 0.0 0.0
Dw_m1 0.0 0.0
Ir_mm 0.0 0.0
Dr_mm 0.0 0.0
Dw_mm 0.0 0.0
string::casing::korean
yes swash vs no swash
before_this_commit after
------------------ --------
Ir 953.0 307.6
Dr 224.0 93.0
Dw 100.0 45.0
COND 131.0 50.9
IND 7.0 4.0
COND_m 1.5 0.0
IND_m 4.0 2.0
Ir_m1 0.1 0.0
Dr_m1 0.0 0.0
Dw_m1 0.0 0.0
Ir_mm 0.0 0.0
Dr_mm 0.0 0.0
Dw_mm 0.0 0.0
Diffstat (limited to 'unicode_constants.h')
0 files changed, 0 insertions, 0 deletions