summaryrefslogtreecommitdiff
path: root/unicode_constants.h
diff options
context:
space:
mode:
authorKarl Williamson <khw@cpan.org>2015-12-03 13:12:51 -0700
committerKarl Williamson <khw@cpan.org>2015-12-09 23:43:21 -0700
commit36eaa8111efe6b0ebe974f6b26ed667c1206dc9f (patch)
tree111d349b4deb4ee6cd3cb2d8b030107a4fd616e7 /unicode_constants.h
parent5af9bc9750ba392c2a4adfdc3ced4b0b301f656a (diff)
downloadperl-36eaa8111efe6b0ebe974f6b26ed667c1206dc9f.tar.gz
Skip casing for some non-cased scripts
Characters whose upper, lower, title, or fold case differ from the character itself amount to just 1.5% of the assigned Unicode characters, and this percentage falls with each new Unicode release, as almost all cased scripts have already been encoded. But a lot of code is written assuming a cased language, such as calling uc() or lcfirst(), or doing qr//i. When such code is run on a non-cased language, the work expended in doing the casing is wasted. And casing is expensive. But finding out if a character is cased or not is nearly as expensive, so one might as well just do the casing. However, the Unicode code space is organized so that there are some long stretches of contiguous code points that aren't cased. By adding tests to see if the input code point is in just a few of these ranges, we can quickly rule casing out for most of the non-cased scripts that are of commercial use today, at essentially no expense to handling the more common cased scripts. Testing for just 3 ranges in Plane 0 of Unicode (where most of the code points in common use today reside) allows us to skip doing casing for more than 82% of code points in the plane, including the following languages: Arabic, Chinese, Hebrew, Japanese, Korean, Thai, and the major scripts of India. No longer is a swash generated when trying to case one of these, so runtime memory usage is decreased. (It should be noted that some of these languages have characters scattered in other areas, because the original allocation for them turned out to be not large enough. When changing the case of these other characters, the lookups won't be skippped. But that original allocation included all or nearly all the characters in current common use, so these other characters are comparatively rare.) The comments in the code indicate some candidate non-cased ranges that I chose not to treat specially at this time. The next commit will address planes above Plane 0. When this command is run on a perl compiled with -O2, no DEBUGGING: blead Porting/bench.pl --perlargs="-Ilib -X" --benchfile=plane0_casing_perf /path_to_prior_perl=before_this_commit /path_to_new_perl=after and file 'plane0_casing_perf' contains [ 'string::casing::greek' => { desc => 'should be no change', setup => 'my $a = "\x{3B1}"', # GREEK SMALL LETTER ALPHA code => 'uc($a)' }, 'string::casing::hebrew' => { desc => 'yes swash vs no swash', setup => 'my $a = "\x{5D0}"', # HEBREW LETTER ALEF code => 'uc($a)' }, 'string::casing::cjk' => { desc => 'yes swash vs no swash', setup => 'my $a = "\x{4E01}"', code => 'uc($a)' }, 'string::casing::korean' => { desc => 'yes swash vs no swash', setup => 'my $a = "\x{AC00}"', code => 'uc($a)' }, ]; These are the results: The numbers represent raw counts per loop iteration. string::casing::cjk yes swash vs no swash before_this_commit after ------------------ -------- Ir 931.0 300.0 Dr 217.0 93.0 Dw 94.0 45.0 COND 129.0 48.0 IND 7.0 4.0 COND_m 1.5 0.0 IND_m 4.0 2.0 Ir_m1 0.1 0.0 Dr_m1 0.0 0.0 Dw_m1 0.0 0.0 Ir_mm 0.0 0.0 Dr_mm 0.0 0.0 Dw_mm 0.0 0.0 string::casing::greek should be no change before_this_commit after ------------------ -------- Ir 946.0 920.0 Dr 218.0 220.0 Dw 100.0 100.0 COND 127.0 121.0 IND 6.0 8.0 COND_m 0.5 1.3 IND_m 2.0 2.0 Ir_m1 0.1 0.0 Dr_m1 0.0 0.0 Dw_m1 0.0 0.0 Ir_mm 0.0 0.0 Dr_mm 0.0 0.0 Dw_mm 0.0 0.0 string::casing::hebrew yes swash vs no swash before_this_commit after ------------------ -------- Ir 928.0 290.0 Dr 224.0 92.0 Dw 100.0 45.0 COND 129.0 46.0 IND 6.0 4.0 COND_m 0.5 0.0 IND_m 2.0 2.0 Ir_m1 0.1 0.0 Dr_m1 0.0 0.0 Dw_m1 0.0 0.0 Ir_mm 0.0 0.0 Dr_mm 0.0 0.0 Dw_mm 0.0 0.0 string::casing::korean yes swash vs no swash before_this_commit after ------------------ -------- Ir 953.0 307.6 Dr 224.0 93.0 Dw 100.0 45.0 COND 131.0 50.9 IND 7.0 4.0 COND_m 1.5 0.0 IND_m 4.0 2.0 Ir_m1 0.1 0.0 Dr_m1 0.0 0.0 Dw_m1 0.0 0.0 Ir_mm 0.0 0.0 Dr_mm 0.0 0.0 Dw_mm 0.0 0.0
Diffstat (limited to 'unicode_constants.h')
0 files changed, 0 insertions, 0 deletions