summaryrefslogtreecommitdiff
path: root/utf8.c
diff options
context:
space:
mode:
authorKarl Williamson <public@khwilliamson.com>2014-03-14 14:23:21 -0600
committerKarl Williamson <public@khwilliamson.com>2014-03-14 14:58:06 -0600
commiteb0925341cc65ce6ce57503ec0ab97cdad39dc98 (patch)
tree36dd7cb7408cb8c93e8d08f1ff48635b0149767f /utf8.c
parent31f658ea7702fc70c35f013cc9d18909fd7589c9 (diff)
downloadperl-eb0925341cc65ce6ce57503ec0ab97cdad39dc98.tar.gz
mktables: Inline short tables
mktables generates tables of Unicode properties. These are stored in files to be loaded on-demand. This is because the memory cost of having all of them loaded would be excessive, and many are rarely used. Hashes are created in Heavy.pl which is read in by utf8_heavy.pl map the Unicode property name to the file which contains its definition. It turns out that nearly half of current Unicode properties are just a single consecutive ranges of code points, and their definitions are representable almost as compactly as the name of the files that contain them. This commit changes mktables so that the tables for single-range properties are not written out to disk, but instead a special syntax is used in Heavy.pl to indicate this and what their definitions are. This does not increase the memory usage of Heavy.pl appreciably, as the definitions replace the file names that are already there, but it lowers the number of files generated by mktables from 908 (in Unicode 6.3) to 507. These files are probably each a disk block, so the disk savings is not large. But it means that reading in any of these properties is much faster, as once utf8_heavy gets loaded, no further disk access is needed to get any of these properties. Most of these properties are obscure, but not all. The Line and Paragraph separators, for example, are quite commonly used. Further, utf8_heavy.pl caches the files it has read in into hashes. This is not necessary for these, as they are already in memory, so the total memory usage goes down if a program uses any of these, but again, since these are small, that amount is not large.. The major gain is not having to read these files from disk at run time. Tables that match no code points at all are also represented using this mechanimsm. Previously, they were expressed as the complements of \p{All}, which matches everything possible.
Diffstat (limited to 'utf8.c')
-rw-r--r--utf8.c5
1 files changed, 5 insertions, 0 deletions
diff --git a/utf8.c b/utf8.c
index 727c12541a..7a30a63c71 100644
--- a/utf8.c
+++ b/utf8.c
@@ -3719,6 +3719,10 @@ Perl__swash_to_invlist(pTHX_ SV* const swash)
/* The first number is a count of the rest */
l++;
elements = Strtoul((char *)l, &after_strtol, 10);
+ if (elements == 0) {
+ invlist = _new_invlist(0);
+ }
+ else {
l = (U8 *) after_strtol;
/* Get the 0th element, which is needed to setup the inversion list */
@@ -3735,6 +3739,7 @@ Perl__swash_to_invlist(pTHX_ SV* const swash)
*other_elements_ptr++ = (UV) Strtoul((char *)l, &after_strtol, 10);
l = (U8 *) after_strtol;
}
+ }
}
else {