summaryrefslogtreecommitdiff
path: root/lib/Unicode
diff options
context:
space:
mode:
authorKarl Williamson <khw@cpan.org>2015-07-04 12:01:18 -0600
committerKarl Williamson <khw@cpan.org>2015-07-28 22:15:56 -0600
commitd963b40d95190da0752832d8df913f2b294844bf (patch)
treec2f485f10d8d9fba1e0cee21ef06b59e8532ab7d /lib/Unicode
parent8981bf2b79483d0f338ee86fb79b9e1872cf96ed (diff)
downloadperl-d963b40d95190da0752832d8df913f2b294844bf.tar.gz
Properly handle the Unicode kIICore property
This property is not included in the standard Perl distribution, but it is normative in the Unicode Unihan database and perl can be compiled to include it. This property is currently unique in that it operates much like how Perl defines string truthfulness: non-empty values are considered true. That is \p{kIICore} matches all characters which have a non-empty value for this property, plus the actual values have meaning that may need to be examined in some circumstances. These can be retrieved via Unicode::UCD::prop_invmap(). Unicode 7.0 changed this property without my noticing, and went a very different direction with it than I anticipated. And the perl interpreter would loop when trying to deal with it under some circumstances. This property is true for all 'core' Chinese/Japanese/Korean characters that every implementation of CJK things should strive to handle, i.e., the minimally acceptable set, though the values now specify a precedence as their first letter, A, B, or C (I suppose this means one could implement just the A level ones first). The remaining letters in each value encode the standards which were used as the source for the character. In previous versions of the Standard, every non-null value was the string "2.1".
Diffstat (limited to 'lib/Unicode')
-rw-r--r--lib/Unicode/UCD.pm8
-rw-r--r--lib/Unicode/UCD.t8
2 files changed, 13 insertions, 3 deletions
diff --git a/lib/Unicode/UCD.pm b/lib/Unicode/UCD.pm
index aef2e3096f..bf4ba53c3a 100644
--- a/lib/Unicode/UCD.pm
+++ b/lib/Unicode/UCD.pm
@@ -2655,9 +2655,11 @@ or even better, C<"Gc=LC">).
Many Unicode properties have more than one name (or alias). C<prop_invmap>
understands all of these, including Perl extensions to them. Ambiguities are
-resolved as described above for L</prop_aliases()>. The Perl internal
-property "Perl_Decimal_Digit, described below, is also accepted. An empty
-list is returned if the property name is unknown.
+resolved as described above for L</prop_aliases()> (except if a property has
+both a complete mapping, and a binary C<Y>/C<N> mapping, then specifying the
+property name prefixed by C<"is"> causes the binary one to be returned). The
+Perl internal property "Perl_Decimal_Digit, described below, is also accepted.
+An empty list is returned if the property name is unknown.
See L<perluniprops/Properties accessible through Unicode::UCD> for the
properties acceptable as inputs to this function.
diff --git a/lib/Unicode/UCD.t b/lib/Unicode/UCD.t
index 43c16eec6c..1bf036cbf6 100644
--- a/lib/Unicode/UCD.t
+++ b/lib/Unicode/UCD.t
@@ -1675,6 +1675,14 @@ foreach my $prop (sort(keys %props), sort keys %legacy_props) {
# normalized version.
$name = &utf8::_loose_name(lc $name);
+ # In the case of a combination property, both a map table and a match
+ # table are generated. For all the tests except prop_invmap(), this is
+ # irrelevant, but for prop_invmap, having an 'is' prefix forces it to
+ # return the match table; otherwise the map. We thus need to distinguish
+ # between the two forms. The property name is what has this information.
+ $name = &utf8::_loose_name(lc $prop)
+ if exists $Unicode::UCD::combination_property{$name};
+
# Add in the characters that are supposed to be ignored to test loose
# matching, which the tested function applies to all properties
$display_prop = "$extra_chars$prop" unless $display_prop;