diff options
author | Karl Williamson <khw@cpan.org> | 2015-02-14 09:34:39 -0700 |
---|---|---|
committer | Karl Williamson <khw@cpan.org> | 2015-02-18 12:51:34 -0700 |
commit | 91e7847033aefac6cf5b500a5c72811c8d6b8fbc (patch) | |
tree | d2b1242b00e5f8ee11d717a89698f8e3bda751b7 /lib/Unicode/UCD.pm | |
parent | fc1bb3f2dcaaac9f305b27acb4800babdc8a06f3 (diff) | |
download | perl-91e7847033aefac6cf5b500a5c72811c8d6b8fbc.tar.gz |
Unicode::UCD: Pod corrections, clarifications
Diffstat (limited to 'lib/Unicode/UCD.pm')
-rw-r--r-- | lib/Unicode/UCD.pm | 61 |
1 files changed, 35 insertions, 26 deletions
diff --git a/lib/Unicode/UCD.pm b/lib/Unicode/UCD.pm index 7e2ccf6a09..5d3115dcfe 100644 --- a/lib/Unicode/UCD.pm +++ b/lib/Unicode/UCD.pm @@ -111,7 +111,8 @@ Character Database. Some of the functions are called with a I<code point argument>, which is either a decimal or a hexadecimal scalar designating a code point in the platform's -native character set (extended to Unicode), or C<U+> followed by hexadecimals +native character set (extended to Unicode), or a string containing C<U+> +followed by hexadecimals designating a Unicode code point. A leading 0 will force a hexadecimal interpretation, as will a hexadecimal digit that isn't a decimal digit. @@ -120,7 +121,7 @@ Examples: 223 # Decimal 223 in native character set 0223 # Hexadecimal 223, native (= 547 decimal) 0xDF # Hexadecimal DF, native (= 223 decimal - U+DF # Hexadecimal DF, in Unicode's character set + 'U+DF' # Hexadecimal DF, in Unicode's character set (= LATIN SMALL LETTER SHARP S) Note that the largest code point in Unicode is U+10FFFF. @@ -284,30 +285,30 @@ As of Unicode 6.0, this is always empty. =item B<upper> -is empty if there is no single code point uppercase mapping for I<code> -(its uppercase mapping is itself); -otherwise it is that mapping expressed as at least four hexdigits. -(L</casespec()> should be used in addition to B<charinfo()> -for case mappings when the calling program can cope with multiple code point -mappings.) +is, if non-empty, the uppercase mapping for I<code> expressed as at least four +hexdigits. This indicates that the full uppercase mapping is a single +character, and is identical to the simple (single-character only) mapping. +When this field is empty, it means that the simple uppercase mapping is +I<code> itself; you'll need some other means, (like +L</casespec()> to get the full mapping. =item B<lower> -is empty if there is no single code point lowercase mapping for I<code> -(its lowercase mapping is itself); -otherwise it is that mapping expressed as at least four hexdigits. -(L</casespec()> should be used in addition to B<charinfo()> -for case mappings when the calling program can cope with multiple code point -mappings.) +is, if non-empty, the lowercase mapping for I<code> expressed as at least four +hexdigits. This indicates that the full lowercase mapping is a single +character, and is identical to the simple (single-character only) mapping. +When this field is empty, it means that the simple lowercase mapping is +I<code> itself; you'll need some other means, (like +L</casespec()> to get the full mapping. =item B<title> -is empty if there is no single code point titlecase mapping for I<code> -(its titlecase mapping is itself); -otherwise it is that mapping expressed as at least four hexdigits. -(L</casespec()> should be used in addition to B<charinfo()> -for case mappings when the calling program can cope with multiple code point -mappings.) +is, if non-empty, the titlecase mapping for I<code> expressed as at least four +hexdigits. This indicates that the full titlecase mapping is a single +character, and is identical to the simple (single-character only) mapping. +When this field is empty, it means that the simple titlecase mapping is +I<code> itself; you'll need some other means, (like +L</casespec()> to get the full mapping. =item B<block> @@ -2070,10 +2071,10 @@ are only a few dozen possible General Categories. You can use L</prop_values()> to find out if a given property is one which has a restricted set of values, and if so, what those values are. But usually -each value actually has several synonyms. For example, in binary properties, -I<truth> can be represented by any of the strings "Y", "Yes", "T", or "True"; -and the General Category "Punctuation" by that string, or "Punct", or simply -"P". +each value actually has several synonyms. For example, in Unicode binary +properties, I<truth> can be represented by any of the strings "Y", "Yes", "T", +or "True"; and the General Category "Punctuation" by that string, or "Punct", +or simply "P". Like property names, there is typically at least a short name for each such property-value, and a long name. If you know any name of the property-value @@ -2097,7 +2098,7 @@ C<undef>. If called with a property that doesn't have synonyms for its values, it returns the input value, possibly normalized with capitalization and -underscores. +underscores, but not necessarily checking that the input value is valid. For the block property, new-style block names are returned (see L</Old-style versus new-style block names>). @@ -2890,6 +2891,14 @@ Use L</casefold()> for these. C<prop_invmap> does not know about any user-defined properties, and will return C<undef> if called with one of those. +The returned values for the Perl extension properties, such as C<Any> and +C<Greek> are somewhat misleading. The values are either C<"Y"> or C<"N>". +All Unicode properties are bipartite, so you can actually use the C<"Y"> or +C<"N>" in a Perl regular rexpression for these, like C<qr/\p{ID_Start=Y/}> or +C<qr/\p{Upper=N/}>. But the Perl extensions aren't specified this way, only +like C</qr/\p{Any}>, I<etc>. You can't actually use the C<"Y"> and C<"N>" in +them. + =cut # User-defined properties could be handled with some changes to utf8_heavy.pl; @@ -3794,7 +3803,7 @@ as C<Basic Latin>, C<Latin 1 Supplement>, C<Latin Extended-A>, and C<Latin Extended-B>. On the other hand, the Latin script does not contain all the characters of the C<Basic Latin> block (also known as ASCII): it includes only the letters, and not, for example, the digits -or the punctuation. +nor the punctuation. For blocks see L<http://www.unicode.org/Public/UNIDATA/Blocks.txt> |