mktables: Inline short tables

mktables generates tables of Unicode properties. These are stored in files to be loaded on-demand. This is because the memory cost of having all of them loaded would be excessive, and many are rarely used. Hashes are created in Heavy.pl which is read in by utf8_heavy.pl map the Unicode property name to the file which contains its definition. It turns out that nearly half of current Unicode properties are just a single consecutive ranges of code points, and their definitions are representable almost as compactly as the name of the files that contain them. This commit changes mktables so that the tables for single-range properties are not written out to disk, but instead a special syntax is used in Heavy.pl to indicate this and what their definitions are. This does not increase the memory usage of Heavy.pl appreciably, as the definitions replace the file names that are already there, but it lowers the number of files generated by mktables from 908 (in Unicode 6.3) to 507. These files are probably each a disk block, so the disk savings is not large. But it means that reading in any of these properties is much faster, as once utf8_heavy gets loaded, no further disk access is needed to get any of these properties. Most of these properties are obscure, but not all. The Line and Paragraph separators, for example, are quite commonly used. Further, utf8_heavy.pl caches the files it has read in into hashes. This is not necessary for these, as they are already in memory, so the total memory usage goes down if a program uses any of these, but again, since these are small, that amount is not large.. The major gain is not having to read these files from disk at run time. Tables that match no code points at all are also represented using this mechanimsm. Previously, they were expressed as the complements of \p{All}, which matches everything possible.
author: Karl Williamson <public@khwilliamson.com> 2014-03-14 14:23:21 -0600
committer: Karl Williamson <public@khwilliamson.com> 2014-03-14 14:58:06 -0600
commit: eb0925341cc65ce6ce57503ec0ab97cdad39dc98 (patch)
tree: 36dd7cb7408cb8c93e8d08f1ff48635b0149767f /utf8.c
parent: 31f658ea7702fc70c35f013cc9d18909fd7589c9 (diff)
download: perl-eb0925341cc65ce6ce57503ec0ab97cdad39dc98.tar.gz
1 files changed, 5 insertions, 0 deletions
diff --git a/utf8.c b/utf8.c
index 727c12541a..7a30a63c71 100644
--- a/utf8.c
+++ b/utf8.c
@@ -3719,6 +3719,10 @@ Perl__swash_to_invlist(pTHX_ SV* const swash)
         /* The first number is a count of the rest */
         l++;
         elements = Strtoul((char *)l, &after_strtol, 10);
+        if (elements == 0) {
+            invlist = _new_invlist(0);
+        }
+        else {
         l = (U8 *) after_strtol;
 
         /* Get the 0th element, which is needed to setup the inversion list */
@@ -3735,6 +3739,7 @@ Perl__swash_to_invlist(pTHX_ SV* const swash)
             *other_elements_ptr++ = (UV) Strtoul((char *)l, &after_strtol, 10);
             l = (U8 *) after_strtol;
         }
+        }
     }
     else {
author	Karl Williamson <public@khwilliamson.com>	2014-03-14 14:23:21 -0600
committer	Karl Williamson <public@khwilliamson.com>	2014-03-14 14:58:06 -0600
commit	eb0925341cc65ce6ce57503ec0ab97cdad39dc98 (patch)
tree	36dd7cb7408cb8c93e8d08f1ff48635b0149767f /utf8.c
parent	31f658ea7702fc70c35f013cc9d18909fd7589c9 (diff)
download	perl-eb0925341cc65ce6ce57503ec0ab97cdad39dc98.tar.gz