summaryrefslogtreecommitdiff
path: root/utf8.h
diff options
context:
space:
mode:
authorKarl Williamson <public@khwilliamson.com>2012-08-25 14:49:47 -0600
committerKarl Williamson <public@khwilliamson.com>2012-08-25 23:21:30 -0600
commit87367d5f9dc9bbf7db1a6cf87820cea76571bf1a (patch)
tree0065ddd270e63dc5f72f6596aa1a74d0218b5f1d /utf8.h
parent5ab9d2ef14d72dec0d833310911032a980efcf09 (diff)
downloadperl-87367d5f9dc9bbf7db1a6cf87820cea76571bf1a.tar.gz
utf8.c: Prefer binsearch over swash hash for small swashes
A binary swash is a hash of bitmaps used to cache the results of looking up if a code point matches a Unicode property or regex bracketed character class. An inversion list is a data structure that also holds information about what code points match a Unicode property or character class. It is implemented as an SV* to a sorted C array, and hence can be searched using a binary search. This patch converts to using a binary search of an inversion list instead of a hash look-up for inversion lists that are no more than 512 elements (9 iterations of the search loop). That number can be easily adjusted, if necessary. Theoretically, a hash is faster than a binary lookup over a very long period. So this may negatively impact long-running servers. But in the short run, where most programs reside, the binary search is significantly faster. A swash is filled as necessary as time goes on, caching each new distinct code point it is called with. If it is called with many, many such code points, its performance can degrade as collisions increase. A binary search does not have that drawback. However, most real-world scenarios do not have a program being called with huge numbers of distinct code points. Mostly, the program will be called with code points from just one or a few of the world's scripts, so will remain sparse. The bitmaps in a swash are each 64 bits long (except for ASCII, where it is 128). That means that when the swash is populated, a lookup of a single code point that hasn't been checked before will have to lookup the 63 adjoining code points as well, increasing its startup overhead. Of course, if one of those 63 code points is later accessed, no extra populate happens. This is a typical case where a languages code points are all near each other. The bottom line, though, is in the short term, this patch speeds up the processing of \X regex matching about 35-40%, with modern Korean (which has uniquely complicated \X processing) closer to 40%, and other scripts closer to 35%. The 512 boundary means that over 90% of the official Unicode properties are handled using binary search. I settled on that number by experimenting with several properties besides \X and with various powers-of-2 limits. Until I got that high, performance kept improving when the property went from being a swash to a binary search. \X improved even up to 2048, which encompasses 100% of the official Unicode properties. The implementation changes so that an inversion list instead of a swash is returned by swash_init() when the input flags allows it to do so, for all inversion lists shorter than the compiled in constant of 512 (actually <= 512). The other functions that access swashes have added intelligence to deal with an object of either type. Should someone in CPAN be using the public swash_init() interface, they will not see any difference, as the option to get an inversion list is not available to them.
Diffstat (limited to 'utf8.h')
-rw-r--r--utf8.h1
1 files changed, 1 insertions, 0 deletions
diff --git a/utf8.h b/utf8.h
index 2307f585f3..11b5acac1b 100644
--- a/utf8.h
+++ b/utf8.h
@@ -25,6 +25,7 @@
/* For _core_swash_init(), internal core use only */
#define _CORE_SWASH_INIT_USER_DEFINED_PROPERTY 0x1
#define _CORE_SWASH_INIT_RETURN_IF_UNDEF 0x2
+#define _CORE_SWASH_INIT_ACCEPT_INVLIST 0x4
#define to_uni_fold(c, p, lenp) _to_uni_fold_flags(c, p, lenp, FOLD_FLAGS_FULL)
#define to_utf8_fold(c, p, lenp) _to_utf8_fold_flags(c, p, lenp, \