Prepare for Unicode 9.0

The major code changes needed to support Unicode 9.0 are to changes in the boundary (break) rules, for things like \b{lb}, \b{wb}. regen/mk_invlists.pl creates two-dimensional arrays for all these properties. To see if a given point in the target string is a break or not, regexec.c looks up the entry in the property's table whose row corresponds to the code point before the potential break, and whose column corresponds to the one after. Mostly this is completely determining, but for some cases, extra context is required, and the array entry indicates this, and there has to be specially crafted code in regexec.c to handle each such possibility. When a new release comes along, mk_invlists.pl has to be changed to handle any new or changed rules, and regexec.c has to be changed to handle any changes to the custom code. Unfortunately this is not a mature area of the Standard, and changes are fairly common in new releases. In part, this is because new types of code points come along, which need new rules. Sometimes it is because they realized the previous version didn't work as well as it could. An example of the latter is that Unicode now realizes that Regional Indicator (RI) characters come in pairs, and that one should be able to break between each pair, but not within a pair. Previous versions treated any run of them as unbreakable. (Regional Indicators are a fairly recent type that was added to the Standard in 6.0, and things are still getting shaken out.) The other main changes to these rules also involve a fairly new type of character, emojis. We can expect further changes to these in the next Unicode releases. \b{gcb} for the first time, now depends on context (in rarely encountered cases, like RI's), so the function had to be changed from a simple table look-up to be more like the functions handling the other break properties. Some years ago I revamped mktables in part to try to make it require as few manual interventions as possible when upgrading to a new version of Unicode. For example, a new data file in a release requires telling mktables about it, but as long as it follows the format of existing recent files, nothing else need be done to get whatever properties it describes to be included. Some of changes to mktables involved guessing, from existing limited data, what the underlying paradigm for that data was. The problem with that is there may not have been a paradigm, just something they did ad hoc, which can change at will; or I didn't understand their unstated thinking, and guessed wrong. Besides the boundary rule changes, the only change that the existing mktables couldn't cope with was the addition of the Tangut script, whose character names include the code point, like CJK UNIFIED IDEOGRAPH-3400 has always done. The paradigm for this wasn't clear, since CJK was the only script that had this characteristic, and so I hard-coded it into mktables. The way Tangut is structured may show that there is a paradigm emerging (but we only have two examples, and there may not be a paradigm at all), and so I have guessed one, and changed mktables to assume this guessed paradigm. If other scripts like this come along, and I have guessed correctly, mktables will cope with these automatically without manual intervention.
author: Karl Williamson <khw@cpan.org> 2016-06-16 11:59:24 -0600
committer: Karl Williamson <khw@cpan.org> 2016-06-21 18:10:38 -0600
commit: b0e24409fd3623db353286c203d33b56e622bae6 (patch)
tree: bfba4fb7ba71ee2c200933eb6ac850448345e30d /lib
parent: 6295dc14882a54531ce4542f1d80fa8ae7b4f8f0 (diff)
download: perl-b0e24409fd3623db353286c203d33b56e622bae6.tar.gz
3 files changed, 56 insertions, 12 deletions
diff --git a/lib/Unicode/UCD.pm b/lib/Unicode/UCD.pm
index f48e4ca493..276e9f5d7e 100644
--- a/lib/Unicode/UCD.pm
+++ b/lib/Unicode/UCD.pm
@@ -5,7 +5,7 @@ use warnings;
 no warnings 'surrogate';    # surrogates can be inputs to this
 use charnames ();
 
-our $VERSION = '0.65';
+our $VERSION = '0.66';
 
 require Exporter;
 
@@ -98,6 +98,9 @@ Unicode::UCD - Unicode character database
     use Unicode::UCD 'search_invlist';
     my $index = search_invlist(\@invlist, $code_point);
 
+    # The following function should be used only internally in
+    # implementations of the Unicode Normalization Algorithm, and there
+    # are better choices than it.
     use Unicode::UCD 'compexcl';
     my $compexcl = compexcl($codepoint);
 
@@ -1200,6 +1203,12 @@ sub bidi_types {
 
 =head2 B<compexcl()>
 
+WARNING: Unicode discourages the use of this function or any of the
+alternative mechanisms listed in this section (the documention of
+C<compexcl()>), except internally in implementations of the Unicode
+Normalization Algorithm.  You should be using L<Unicode::Normalize> directly
+instead of these.  Using these will likely lead to half-baked results.
+
     use Unicode::UCD 'compexcl';
 
     my $compexcl = compexcl(0x09dc);
@@ -3044,6 +3053,8 @@ L<Unicode::Normalize::NFD()|Unicode::Normalize>.
 
 Note that the mapping is the one that is specified in the Unicode data files,
 and to get the final decomposition, it may need to be applied recursively.
+Unicode in fact discourages use of this property except internally in
+implementations of the Unicode Normalization Algorithm.
 
 The fourth (index [3]) element (C<$default>) in the list returned for this
 format is 0.
diff --git a/lib/charnames.t b/lib/charnames.t
index cd87350bfe..9a5400c87a 100644
--- a/lib/charnames.t
+++ b/lib/charnames.t
@@ -1009,7 +1009,7 @@ is("\N{U+1D0C5}", "\N{BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS}", 'V
         die "Can't open ../../lib/unicore/UnicodeData.txt: $!";
     while (<$fh>) {
         chomp;
-        my ($code, $name, undef, undef, undef, undef, undef, undef, undef, undef, $u1name) = split ";";
+        my ($code, $name, $category, undef, undef, undef, undef, undef, undef, undef, $u1name) = split ";";
         my $decimal = utf8::unicode_to_native(hex $code);
         $code = sprintf("%04X", $decimal) unless $::IS_ASCII;
 
@@ -1042,12 +1042,26 @@ is("\N{U+1D0C5}", "\N{BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS}", 'V
             /^(.*?);/;
             my $end_decimal = hex $1;
 
-            # Only the CJK (and the Hangul which are instead dealt with below)
-            # ones have names, and they all have the code point as part of the
-            # name, which we can construct
-            if ($name =~ /^<CJK/) {
+            # Only the ones whose category is a letter currently have names,
+            # and of those the Hangul Syllables are dealt with below
+            if ( $category eq 'Lo' && $name !~ /^Hangul/i) {
+
+                # The CJK ones all get translated to a particular form; we
+                # just capitalize any others in the hopes that Unicode will
+                # use the correct term in any future ones it might add.
+                if ($name =~ /^<CJK/) {
+                    $name = "CJK UNIFIED IDEOGRAPH";
+                }
+                else {
+                    $name =~ s/<//;
+                    $name =~ s/,.*//;
+                    $name = uc($name);
+                }
+
+                # They all have the code point as part of the name, which we
+                # can construct
                 for my $i ($decimal .. $end_decimal) {
-                    $names[$i] = sprintf "CJK UNIFIED IDEOGRAPH-%04X", $i;
+                    $names[$i] = sprintf "$name-%04X", $i;
                     my $block = $i >> $block_size_bits;
                     $algorithmic_names_count[$block]++;
                 }
diff --git a/lib/unicore/mktables b/lib/unicore/mktables
index 7b25ba7f36..e8a28314af 100644
--- a/lib/unicore/mktables
+++ b/lib/unicore/mktables
@@ -45,7 +45,7 @@ sub NON_ASCII_PLATFORM { ord("A") != 65 }
 # expected, a warning will be generated.  If an older version is being
 # compiled, any bounds tests that fail in the generated test file (-maketest
 # option) will be marked as TODO.
-my $version_of_mk_invlist_bounds = v8.0.0;
+my $version_of_mk_invlist_bounds = v9.0.0;
 
 ##########################################################################
 #
@@ -11741,7 +11741,16 @@ END
                                           . $CMD_DELIM
                                           . $fields[$CHARNAME];
             }
-            elsif ($fields[$CHARNAME] =~ /^CJK/) {
+            elsif ($fields[$CATEGORY] eq 'Lo') {    # Is a letter
+
+                # All the CJK ranges like this have the name given as a
+                # special case in the next code line.  And for the others, we
+                # hope that Unicode continues to use the correct name in
+                # future releases, so we don't have to make further special
+                # cases.
+                my $name = ($fields[$CHARNAME] =~ /^CJK/)
+                           ? 'CJK UNIFIED IDEOGRAPH'
+                           : uc $fields[$CHARNAME];
 
                 # The name for these contains the code point itself, and all
                 # are defined to have the same base name, regardless of what
@@ -11753,7 +11762,7 @@ END
                                            . '='
                                            . $CP_IN_NAME
                                            . $CMD_DELIM
-                                           . 'CJK UNIFIED IDEOGRAPH';
+                                           . $name;
 
             }
             elsif ($fields[$CATEGORY] eq 'Co'
@@ -19193,7 +19202,8 @@ my @input_file_objects = (
                           . 'incorporated into the Unicode data base',
                    ),
     Input_file->new('StandardizedVariants.html', v3.2.0,
-                    Skip => 'Provides a visual display of the standard '
+                    Skip => 'Obsoleted as of Unicode 9.0, but previously '
+                          . 'provided a visual display of the standard '
                           . 'variant sequences derived from '
                           . 'F<StandardizedVariants.txt>.',
                         # I don't know why the html came earlier than the
@@ -19407,6 +19417,12 @@ my @input_file_objects = (
                     Property => 'Indic_Positional_Category',
                     Has_Missings_Defaults => $NOT_IGNORED,
                    ),
+    Input_file->new('TangutSources.txt', v9.0.0,
+                    Skip => 'Specifies source mappings for Tangut ideographs'
+                          . ' and components. This data file also includes'
+                          . ' informative radical-stroke values that are used'
+                          . ' internally by Unicode',
+                   ),
 );
 
 # End of all the preliminaries.
@@ -19871,7 +19887,10 @@ if (defined &locales_enabled) {
 }
 
 # Eval'd so can run on versions earlier than the property is available in
-my $WB_Extend_or_Format_re = eval 'qr/[\p{WB=Extend}\p{WB=Format}]/';
+my $WB_Extend_or_Format_re = eval 'qr/[\p{WB=Extend}\p{WB=Format}\p{WB=ZWJ}]/';
+if (! defined $WB_Extend_or_Format_re) {
+    $WB_Extend_or_Format_re = eval 'qr/[\p{WB=Extend}\p{WB=Format}]/';
+}
 
 sub _test_break($$) {
     # Test various break property matches.  The 2nd parameter gives the
author	Karl Williamson <khw@cpan.org>	2016-06-16 11:59:24 -0600
committer	Karl Williamson <khw@cpan.org>	2016-06-21 18:10:38 -0600
commit	b0e24409fd3623db353286c203d33b56e622bae6 (patch)
tree	bfba4fb7ba71ee2c200933eb6ac850448345e30d /lib
parent	6295dc14882a54531ce4542f1d80fa8ae7b4f8f0 (diff)
download	perl-b0e24409fd3623db353286c203d33b56e622bae6.tar.gz