summaryrefslogtreecommitdiff
path: root/lib/Unicode
diff options
context:
space:
mode:
authorKarl Williamson <public@khwilliamson.com>2014-03-14 14:23:21 -0600
committerKarl Williamson <public@khwilliamson.com>2014-03-14 14:58:06 -0600
commiteb0925341cc65ce6ce57503ec0ab97cdad39dc98 (patch)
tree36dd7cb7408cb8c93e8d08f1ff48635b0149767f /lib/Unicode
parent31f658ea7702fc70c35f013cc9d18909fd7589c9 (diff)
downloadperl-eb0925341cc65ce6ce57503ec0ab97cdad39dc98.tar.gz
mktables: Inline short tables
mktables generates tables of Unicode properties. These are stored in files to be loaded on-demand. This is because the memory cost of having all of them loaded would be excessive, and many are rarely used. Hashes are created in Heavy.pl which is read in by utf8_heavy.pl map the Unicode property name to the file which contains its definition. It turns out that nearly half of current Unicode properties are just a single consecutive ranges of code points, and their definitions are representable almost as compactly as the name of the files that contain them. This commit changes mktables so that the tables for single-range properties are not written out to disk, but instead a special syntax is used in Heavy.pl to indicate this and what their definitions are. This does not increase the memory usage of Heavy.pl appreciably, as the definitions replace the file names that are already there, but it lowers the number of files generated by mktables from 908 (in Unicode 6.3) to 507. These files are probably each a disk block, so the disk savings is not large. But it means that reading in any of these properties is much faster, as once utf8_heavy gets loaded, no further disk access is needed to get any of these properties. Most of these properties are obscure, but not all. The Line and Paragraph separators, for example, are quite commonly used. Further, utf8_heavy.pl caches the files it has read in into hashes. This is not necessary for these, as they are already in memory, so the total memory usage goes down if a program uses any of these, but again, since these are small, that amount is not large.. The major gain is not having to read these files from disk at run time. Tables that match no code points at all are also represented using this mechanimsm. Previously, they were expressed as the complements of \p{All}, which matches everything possible.
Diffstat (limited to 'lib/Unicode')
-rw-r--r--lib/Unicode/UCD.t32
1 files changed, 24 insertions, 8 deletions
diff --git a/lib/Unicode/UCD.t b/lib/Unicode/UCD.t
index 5ed15c61d8..c6b50fdca3 100644
--- a/lib/Unicode/UCD.t
+++ b/lib/Unicode/UCD.t
@@ -1051,9 +1051,19 @@ foreach my $set_of_tables (\%utf8::stricter_to_file_of, \%utf8::loose_to_file_of
}
$tested_invlist{$file} = dclone \@tested;
- # A leading '!' in the file name means that it is to be inverted.
- my $invert = $file =~ s/^!//;
- my $official = do "unicore/lib/$file.pl";
+ # A '!' in the file name means that it is to be inverted.
+ my $invert = $file =~ s/!//;
+ my $official;
+
+ # If the file's directory is '#', it is a special case where the
+ # contents are in-lined with semi-colons meaning new-lines, instead of
+ # it being an actual file to read.
+ if ($file =~ s!^#/!!) {
+ $official = $file =~ s/;/\n/gr;
+ }
+ else {
+ $official = do "unicore/lib/$file.pl";
+ }
# Get rid of any trailing space and comments in the file.
$official =~ s/\s*(#.*)?$//mg;
@@ -1475,13 +1485,19 @@ foreach my $prop (sort(keys %props), sort keys %legacy_props) {
# property comes along without these characteristics
if (!defined $base_file) {
$base_file = $utf8::loose_to_file_of{$proxy_prop};
- $is_binary = ($base_file =~ s/^!//) ? -1 : 1;
- $base_file = "lib/$base_file";
+ $is_binary = ($base_file =~ s/!//) ? -1 : 1;
+ $base_file = "lib/$base_file" unless $base_file =~ m!^#/!;
}
- # Read in the file
- $file = "unicore/$base_file.pl";
- $official = do $file;
+ # Read in the file. If the file's directory is '#', it is a
+ # special case where the contents are in-lined with semi-colons
+ # meaning new-lines, instead of it being an actual file to read.
+ if ($base_file =~ s!^#/!!) {
+ $official = $base_file =~ s/;/\n/gr;
+ }
+ else {
+ $official = do "unicore/$base_file.pl";
+ }
# Get rid of any trailing space and comments in the file.
$official =~ s/\s*(#.*)?$//mg;