diff options
author | Karl Williamson <public@khwilliamson.com> | 2014-03-14 14:23:21 -0600 |
---|---|---|
committer | Karl Williamson <public@khwilliamson.com> | 2014-03-14 14:58:06 -0600 |
commit | eb0925341cc65ce6ce57503ec0ab97cdad39dc98 (patch) | |
tree | 36dd7cb7408cb8c93e8d08f1ff48635b0149767f /utf8.c | |
parent | 31f658ea7702fc70c35f013cc9d18909fd7589c9 (diff) | |
download | perl-eb0925341cc65ce6ce57503ec0ab97cdad39dc98.tar.gz |
mktables: Inline short tables
mktables generates tables of Unicode properties. These are stored in
files to be loaded on-demand. This is because the memory cost of having
all of them loaded would be excessive, and many are rarely used. Hashes
are created in Heavy.pl which is read in by utf8_heavy.pl map the
Unicode property name to the file which contains its definition.
It turns out that nearly half of current Unicode properties are just a
single consecutive ranges of code points, and their definitions are
representable almost as compactly as the name of the files that contain
them.
This commit changes mktables so that the tables for single-range
properties are not written out to disk, but instead a special syntax is
used in Heavy.pl to indicate this and what their definitions are.
This does not increase the memory usage of Heavy.pl appreciably, as the
definitions replace the file names that are already there, but it lowers
the number of files generated by mktables from 908 (in Unicode 6.3) to
507. These files are probably each a disk block, so the disk savings is
not large. But it means that reading in any of these properties is much
faster, as once utf8_heavy gets loaded, no further disk access is needed
to get any of these properties. Most of these properties are obscure,
but not all. The Line and Paragraph separators, for example, are quite
commonly used.
Further, utf8_heavy.pl caches the files it has read in into hashes.
This is not necessary for these, as they are already in memory, so the
total memory usage goes down if a program uses any of these, but again,
since these are small, that amount is not large.. The major gain is not
having to read these files from disk at run time.
Tables that match no code points at all are also represented using this
mechanimsm. Previously, they were expressed as the complements of
\p{All}, which matches everything possible.
Diffstat (limited to 'utf8.c')
-rw-r--r-- | utf8.c | 5 |
1 files changed, 5 insertions, 0 deletions
@@ -3719,6 +3719,10 @@ Perl__swash_to_invlist(pTHX_ SV* const swash) /* The first number is a count of the rest */ l++; elements = Strtoul((char *)l, &after_strtol, 10); + if (elements == 0) { + invlist = _new_invlist(0); + } + else { l = (U8 *) after_strtol; /* Get the 0th element, which is needed to setup the inversion list */ @@ -3735,6 +3739,7 @@ Perl__swash_to_invlist(pTHX_ SV* const swash) *other_elements_ptr++ = (UV) Strtoul((char *)l, &after_strtol, 10); l = (U8 *) after_strtol; } + } } else { |