mktables: Inline short tables

mktables generates tables of Unicode properties. These are stored in files to be loaded on-demand. This is because the memory cost of having all of them loaded would be excessive, and many are rarely used. Hashes are created in Heavy.pl which is read in by utf8_heavy.pl map the Unicode property name to the file which contains its definition. It turns out that nearly half of current Unicode properties are just a single consecutive ranges of code points, and their definitions are representable almost as compactly as the name of the files that contain them. This commit changes mktables so that the tables for single-range properties are not written out to disk, but instead a special syntax is used in Heavy.pl to indicate this and what their definitions are. This does not increase the memory usage of Heavy.pl appreciably, as the definitions replace the file names that are already there, but it lowers the number of files generated by mktables from 908 (in Unicode 6.3) to 507. These files are probably each a disk block, so the disk savings is not large. But it means that reading in any of these properties is much faster, as once utf8_heavy gets loaded, no further disk access is needed to get any of these properties. Most of these properties are obscure, but not all. The Line and Paragraph separators, for example, are quite commonly used. Further, utf8_heavy.pl caches the files it has read in into hashes. This is not necessary for these, as they are already in memory, so the total memory usage goes down if a program uses any of these, but again, since these are small, that amount is not large.. The major gain is not having to read these files from disk at run time. Tables that match no code points at all are also represented using this mechanimsm. Previously, they were expressed as the complements of \p{All}, which matches everything possible.
author: Karl Williamson <public@khwilliamson.com> 2014-03-14 14:23:21 -0600
committer: Karl Williamson <public@khwilliamson.com> 2014-03-14 14:58:06 -0600
commit: eb0925341cc65ce6ce57503ec0ab97cdad39dc98 (patch)
tree: 36dd7cb7408cb8c93e8d08f1ff48635b0149767f /lib/Unicode
parent: 31f658ea7702fc70c35f013cc9d18909fd7589c9 (diff)
download: perl-eb0925341cc65ce6ce57503ec0ab97cdad39dc98.tar.gz
1 files changed, 24 insertions, 8 deletions
diff --git a/lib/Unicode/UCD.t b/lib/Unicode/UCD.t
index 5ed15c61d8..c6b50fdca3 100644
--- a/lib/Unicode/UCD.t
+++ b/lib/Unicode/UCD.t
@@ -1051,9 +1051,19 @@ foreach my $set_of_tables (\%utf8::stricter_to_file_of, \%utf8::loose_to_file_of
         }
         $tested_invlist{$file} = dclone \@tested;
 
-        # A leading '!' in the file name means that it is to be inverted.
-        my $invert = $file =~ s/^!//;
-        my $official = do "unicore/lib/$file.pl";
+        # A '!' in the file name means that it is to be inverted.
+        my $invert = $file =~ s/!//;
+        my $official;
+
+        # If the file's directory is '#', it is a special case where the
+        # contents are in-lined with semi-colons meaning new-lines, instead of
+        # it being an actual file to read.
+        if ($file =~ s!^#/!!) {
+            $official = $file =~ s/;/\n/gr;
+        }
+        else {
+            $official = do "unicore/lib/$file.pl";
+        }
 
         # Get rid of any trailing space and comments in the file.
         $official =~ s/\s*(#.*)?$//mg;
@@ -1475,13 +1485,19 @@ foreach my $prop (sort(keys %props), sort keys %legacy_props) {
             # property comes along without these characteristics
             if (!defined $base_file) {
                 $base_file = $utf8::loose_to_file_of{$proxy_prop};
-                $is_binary = ($base_file =~ s/^!//) ? -1 : 1;
-                $base_file = "lib/$base_file";
+                $is_binary = ($base_file =~ s/!//) ? -1 : 1;
+                $base_file = "lib/$base_file" unless $base_file =~ m!^#/!;
             }
 
-            # Read in the file
-            $file = "unicore/$base_file.pl";
-            $official = do $file;
+            # Read in the file.  If the file's directory is '#', it is a
+            # special case where the contents are in-lined with semi-colons
+            # meaning new-lines, instead of it being an actual file to read.
+            if ($base_file =~ s!^#/!!) {
+                $official = $base_file =~ s/;/\n/gr;
+            }
+            else {
+                $official = do "unicore/$base_file.pl";
+            }
 
             # Get rid of any trailing space and comments in the file.
             $official =~ s/\s*(#.*)?$//mg;
author	Karl Williamson <public@khwilliamson.com>	2014-03-14 14:23:21 -0600
committer	Karl Williamson <public@khwilliamson.com>	2014-03-14 14:58:06 -0600
commit	eb0925341cc65ce6ce57503ec0ab97cdad39dc98 (patch)
tree	36dd7cb7408cb8c93e8d08f1ff48635b0149767f /lib/Unicode
parent	31f658ea7702fc70c35f013cc9d18909fd7589c9 (diff)
download	perl-eb0925341cc65ce6ce57503ec0ab97cdad39dc98.tar.gz