diff options
author | Karl Williamson <khw@cpan.org> | 2015-07-04 12:01:18 -0600 |
---|---|---|
committer | Karl Williamson <khw@cpan.org> | 2015-07-28 22:15:56 -0600 |
commit | d963b40d95190da0752832d8df913f2b294844bf (patch) | |
tree | c2f485f10d8d9fba1e0cee21ef06b59e8532ab7d /lib/Unicode | |
parent | 8981bf2b79483d0f338ee86fb79b9e1872cf96ed (diff) | |
download | perl-d963b40d95190da0752832d8df913f2b294844bf.tar.gz |
Properly handle the Unicode kIICore property
This property is not included in the standard Perl distribution, but it
is normative in the Unicode Unihan database and perl can be compiled to
include it. This property is currently unique in that it operates much
like how Perl defines string truthfulness: non-empty values are
considered true. That is \p{kIICore} matches all characters which have
a non-empty value for this property, plus the actual values have
meaning that may need to be examined in some circumstances. These can
be retrieved via Unicode::UCD::prop_invmap().
Unicode 7.0 changed this property without my noticing, and went a very
different direction with it than I anticipated. And the perl
interpreter would loop when trying to deal with it under some
circumstances.
This property is true for all 'core' Chinese/Japanese/Korean characters
that every implementation of CJK things should strive to handle, i.e., the
minimally acceptable set, though the values now specify a precedence as
their first letter, A, B, or C (I suppose this means one could implement
just the A level ones first). The remaining letters in each value
encode the standards which were used as the source for the character.
In previous versions of the Standard, every non-null value was the
string "2.1".
Diffstat (limited to 'lib/Unicode')
-rw-r--r-- | lib/Unicode/UCD.pm | 8 | ||||
-rw-r--r-- | lib/Unicode/UCD.t | 8 |
2 files changed, 13 insertions, 3 deletions
diff --git a/lib/Unicode/UCD.pm b/lib/Unicode/UCD.pm index aef2e3096f..bf4ba53c3a 100644 --- a/lib/Unicode/UCD.pm +++ b/lib/Unicode/UCD.pm @@ -2655,9 +2655,11 @@ or even better, C<"Gc=LC">). Many Unicode properties have more than one name (or alias). C<prop_invmap> understands all of these, including Perl extensions to them. Ambiguities are -resolved as described above for L</prop_aliases()>. The Perl internal -property "Perl_Decimal_Digit, described below, is also accepted. An empty -list is returned if the property name is unknown. +resolved as described above for L</prop_aliases()> (except if a property has +both a complete mapping, and a binary C<Y>/C<N> mapping, then specifying the +property name prefixed by C<"is"> causes the binary one to be returned). The +Perl internal property "Perl_Decimal_Digit, described below, is also accepted. +An empty list is returned if the property name is unknown. See L<perluniprops/Properties accessible through Unicode::UCD> for the properties acceptable as inputs to this function. diff --git a/lib/Unicode/UCD.t b/lib/Unicode/UCD.t index 43c16eec6c..1bf036cbf6 100644 --- a/lib/Unicode/UCD.t +++ b/lib/Unicode/UCD.t @@ -1675,6 +1675,14 @@ foreach my $prop (sort(keys %props), sort keys %legacy_props) { # normalized version. $name = &utf8::_loose_name(lc $name); + # In the case of a combination property, both a map table and a match + # table are generated. For all the tests except prop_invmap(), this is + # irrelevant, but for prop_invmap, having an 'is' prefix forces it to + # return the match table; otherwise the map. We thus need to distinguish + # between the two forms. The property name is what has this information. + $name = &utf8::_loose_name(lc $prop) + if exists $Unicode::UCD::combination_property{$name}; + # Add in the characters that are supposed to be ignored to test loose # matching, which the tested function applies to all properties $display_prop = "$extra_chars$prop" unless $display_prop; |