summaryrefslogtreecommitdiff
path: root/pod/perlunicode.pod
diff options
context:
space:
mode:
authorKarl Williamson <khw@khw-desktop.(none)>2009-12-24 22:54:58 -0700
committerAbigail <abigail@abigail.be>2009-12-25 10:07:41 +0100
commite1b711dac329baf9cf4ea3e4628e6c713e24b342 (patch)
treeb12ce1b41c2d6c0582296ddad541efd2ae3f71e2 /pod/perlunicode.pod
parent27bca3226281a592aed848b7e68ea50f27381dac (diff)
downloadperl-e1b711dac329baf9cf4ea3e4628e6c713e24b342.tar.gz
Update .pods
Signed-off-by: Abigail <abigail@abigail.be>
Diffstat (limited to 'pod/perlunicode.pod')
-rw-r--r--pod/perlunicode.pod299
1 files changed, 181 insertions, 118 deletions
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index 26a7af059a..6807e7067e 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -80,18 +80,17 @@ be made without additional information from the user, Perl decides in
favor of compatibility and chooses to use byte semantics.
Under byte semantics, when C<use locale> is in effect, Perl uses the
-semantics associated with the current locale. Absent a C<use locale>, Perl
-currently uses US-ASCII (or Basic Latin in Unicode terminology) byte semantics,
-meaning that characters whose ordinal numbers are in the range 128 - 255 are
-undefined except for their ordinal numbers. This means that none have case
-(upper and lower), nor are any a member of character classes, like C<[:alpha:]>
-or C<\w>.
-(But all do belong to the C<\W> class or the Perl regular expression extension
-C<[:^alpha:]>.)
+semantics associated with the current locale. Absent a C<use locale>, and
+absent a C<use feature 'unicode_strings'> pragma, Perl currently uses US-ASCII
+(or Basic Latin in Unicode terminology) byte semantics, meaning that characters
+whose ordinal numbers are in the range 128 - 255 are undefined except for their
+ordinal numbers. This means that none have case (upper and lower), nor are any
+a member of character classes, like C<[:alpha:]> or C<\w>. (But all do belong
+to the C<\W> class or the Perl regular expression extension C<[:^alpha:]>.)
This behavior preserves compatibility with earlier versions of Perl,
which allowed byte semantics in Perl operations only if
-none of the program's inputs were marked as being as source of Unicode
+none of the program's inputs were marked as being a source of Unicode
character data. Such data may come from filehandles, from calls to
external programs, from information provided by the system (such as %ENV),
or from literals and constants in the source text.
@@ -99,6 +98,11 @@ or from literals and constants in the source text.
The C<bytes> pragma will always, regardless of platform, force byte
semantics in a particular lexical scope. See L<bytes>.
+The C<use feature 'unicode_strings'> pragma is intended to always, regardless
+of platform, force Unicode semantics in a particular lexical scope. In
+release 5.12, it is partially implemented, applying only to case changes.
+See L</The "Unicode Bug"> below.
+
The C<utf8> pragma is primarily a compatibility device that enables
recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
Note that this pragma is only required while Perl defaults to byte
@@ -112,7 +116,9 @@ input data comes from a Unicode source--for example, if a character
encoding layer is added to a filehandle or a literal Unicode
string constant appears in a program--character semantics apply.
Otherwise, byte semantics are in effect. The C<bytes> pragma should
-be used to force byte semantics on Unicode data.
+be used to force byte semantics on Unicode data, and the C<use feature
+'unicode_strings'> pragma to force Unicode semantics on byte data (though in
+5.12 it isn't fully implemented).
If strings operating under byte semantics and strings with Unicode
character data are concatenated, the new string will have
@@ -178,12 +184,10 @@ ideograph, for instance.
Named Unicode properties, scripts, and block ranges may be used like
character classes via the C<\p{}> "matches property" construct and
the C<\P{}> negation, "doesn't match property".
-
See L</"Unicode Character Properties"> for more details.
You can define your own character properties and use them
in the regular expression with the C<\p{}> or C<\P{}> construct.
-
See L</"User-Defined Character Properties"> for more details.
=item *
@@ -206,7 +210,8 @@ Case translation operators use the Unicode case translation tables
when character input is provided. Note that C<uc()>, or C<\U> in
interpolated strings, translates to uppercase, while C<ucfirst>,
or C<\u> in interpolated strings, translates to titlecase in languages
-that make the distinction.
+that make the distinction (which is equivalent to uppercase in languages
+without the distinction).
=item *
@@ -250,30 +255,8 @@ complement B<and> the full character-wide bit complement.
=item *
-lc(), uc(), lcfirst(), and ucfirst() work for the following cases:
-
-=over 8
-
-=item *
-
-the case mapping is from a single Unicode character to another
-single Unicode character, or
-
-=item *
-
-the case mapping is from a single Unicode character to more
-than one Unicode character.
-
-=back
-
-Things to do with locales (Lithuanian, Turkish, Azeri) do B<not> work
-since Perl does not understand the concept of Unicode locales.
-
-See the Unicode Technical Report #21, Case Mappings, for more details.
-
-But you can also define your own mappings to be used in the lc(),
+You can define your own mappings to be used in lc(),
lcfirst(), uc(), and ucfirst() (or their string-inlined versions).
-
See L</"User-Defined Case Mappings"> for more details.
=back
@@ -297,16 +280,16 @@ For instance, C<\p{Uppercase}> matches any character with the Unicode
General_Category of "L" (letter) property. Brackets are not
required for single letter properties, so C<\p{L}> is equivalent to C<\pL>.
-More formally, C<\p{Uppercase}> matches any character whose Uppercase property
-value is True, and C<\P{Uppercase}> matches any character whose Uppercase
-property value is False, and they could have been written as
+More formally, C<\p{Uppercase}> matches any character whose Unicode Uppercase
+property value is True, and C<\P{Uppercase}> matches any character whose
+Uppercase property value is False, and they could have been written as
C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively
This formality is needed when properties are not binary, that is if they can
take on more values than just True and False. For example, the Bidi_Class (see
L</"Bidirectional Character Types"> below), can take on a number of different
values, such as Left, Right, Whitespace, and others. To match these, one needs
-to specify the property name (Bidi_Class), and the value being matched with
+to specify the property name (Bidi_Class), and the value being matched against
(Left, Right, etc.). This is done, as in the examples above, by having the two
components separated by an equal sign (or interchangeably, a colon), like
C<\p{Bidi_Class: Left}>.
@@ -327,7 +310,7 @@ C<\p{Uppercase}> equivalently as C<\p{Upper}>. Also, there are typically
various synonyms for the values the property can be. For binary properties,
"True" has 3 synonyms: "T", "Yes", and "Y"; and "False has correspondingly "F",
"No", and "N". But be careful. A short form of a value for one property may
-not mean the same thing as the same name for another. Thus, for the
+not mean the same thing as the same short form for another. Thus, for the
General_Category property, "L" means "Letter", but for the Bidi_Class property,
"L" means "Left". A complete list of properties and synonyms is in
L<perluniprops>.
@@ -456,9 +439,10 @@ written right to left.
=head3 B<Scripts>
-The world's languages are written in a number of scripts. This sentence is
-written in Latin, while Russian is written in Cyrllic, and Greek is written in,
-well, Greek; Japanese mainly in Hiragana or Katakana. There are many more.
+The world's languages are written in a number of scripts. This sentence
+(unless you're reading it in translation) is written in Latin, while Russian is
+written in Cyrllic, and Greek is written in, well, Greek; Japanese mainly in
+Hiragana or Katakana. There are many more.
The Unicode Script property gives what script a given character is in,
and can be matched with the compound form like C<\p{Script=Hebrew}> (short:
@@ -670,8 +654,8 @@ the first character in ucfirst()), and C<ToUpper> (for uc(), and the
rest of the characters in ucfirst()).
The string returned by the subroutines needs to be two hexadecimal numbers
-separated by two tabulators: the source code point and the destination code
-point. For example:
+separated by two tabulators: the two numbers being, respectively, the source
+code point and the destination code point. For example:
sub ToUpper {
return <<END;
@@ -726,17 +710,14 @@ Level 1 - Basic Unicode Support
[1] \x{...}
[2] \p{...} \P{...}
- [3] supports not only minimal list (general category, scripts,
- Alphabetic, Lowercase, Uppercase, WhiteSpace,
- NoncharacterCodePoint, DefaultIgnorableCodePoint, Any,
- ASCII, Assigned), but also bidirectional types, blocks, etc.
- (see "Unicode Character Properties")
+ [3] supports not only minimal list, but all Unicode character
+ properties (see L</Unicode Character Properties>)
[4] \d \D \s \S \w \W \X [:prop:] [:^prop:]
[5] can use regular expression look-ahead [a] or
user-defined character properties [b] to emulate set operations
[6] \b \B
- [7] note that Perl does Full case-folding in matching, not Simple:
- for example U+1F88 is equivalent to U+1F00 U+03B9,
+ [7] note that Perl does Full case-folding in matching (but with bugs),
+ not Simple: for example U+1F88 is equivalent to U+1F00 U+03B9,
not with 1F80. This difference matters mainly for certain Greek
capital letters with certain modifiers: the Full case-folding
decomposes the letter, while the Simple case-folding would map
@@ -789,9 +770,7 @@ Level 2 - Extended Unicode Support
[10] see UAX#15 "Unicode Normalization Forms"
[11] have Unicode::Normalize but not integrated to regexes
- [12] have \X but at this level . should equal that
- [13] UAX#29 "Text Boundaries" considers CRLF and Hangul syllable
- clusters as a single grapheme cluster.
+ [12] have \X but we don't have a "Grapheme Cluster Mode"
[14] see UAX#29, Word Boundaries
[15] see UAX#21 "Case Mappings"
[16] have \N{...} but neither compute names of CJK Ideographs
@@ -843,26 +822,24 @@ transparent.
The following table is from Unicode 3.2.
- Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
+ Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
- U+0000..U+007F 00..7F
- U+0080..U+07FF C2..DF 80..BF
- U+0800..U+0FFF E0 A0..BF 80..BF
+ U+0000..U+007F 00..7F
+ U+0080..U+07FF * C2..DF 80..BF
+ U+0800..U+0FFF E0 * A0..BF 80..BF
U+1000..U+CFFF E1..EC 80..BF 80..BF
U+D000..U+D7FF ED 80..9F 80..BF
- U+D800..U+DFFF ******* ill-formed *******
+ U+D800..U+DFFF +++++++ utf16 surrogates, not legal utf8 +++++++
U+E000..U+FFFF EE..EF 80..BF 80..BF
- U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
- U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
- U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
-
-Note the C<A0..BF> in C<U+0800..U+0FFF>, the C<80..9F> in
-C<U+D000...U+D7FF>, the C<90..B>F in C<U+10000..U+3FFFF>, and the
-C<80...8F> in C<U+100000..U+10FFFF>. The "gaps" are caused by legal
-UTF-8 avoiding non-shortest encodings: it is technically possible to
-UTF-8-encode a single code point in different ways, but that is
-explicitly forbidden, and the shortest possible encoding should always
-be used. So that's what Perl does.
+ U+10000..U+3FFFF F0 * 90..BF 80..BF 80..BF
+ U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
+ U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
+
+Note the gaps before several of the byte entries above marked by '*'. These are
+caused by legal UTF-8 avoiding non-shortest encodings: it is technically
+possible to UTF-8-encode a single code point in different ways, but that is
+explicitly forbidden, and the shortest possible encoding should always be used
+(and that is what Perl does).
Another way to look at it is via bits:
@@ -874,7 +851,7 @@ Another way to look at it is via bits:
00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa
As you can see, the continuation bytes all begin with C<10>, and the
-leading bits of the start byte tell how many bytes the are in the
+leading bits of the start byte tell how many bytes there are in the
encoded character.
=item *
@@ -909,7 +886,7 @@ and the decoding is
$uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
If you try to generate surrogates (for example by using chr()), you
-will get a warning if warnings are turned on, because those code
+will get a warning, if warnings are turned on, because those code
points are not valid for a Unicode character.
Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16
@@ -933,7 +910,9 @@ The way this trick works is that the character with the code point
C<U+FFFE> is guaranteed not to be a valid Unicode character, so the
sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in
little-endian format" and cannot be C<U+FFFE>, represented in big-endian
-format".
+format". (Actually, C<U+FFFE> is legal for use by your program, even for
+input/output, but better not use it if you need a BOM. But it is "illegal for
+interchange", so that an unsuspecting program won't get confused.)
=item *
@@ -964,6 +943,9 @@ transport or storage is not eight-bit safe. Defined by RFC 2152.
=head2 Security Implications of Unicode
+Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
+Also, note the following:
+
=over 4
=item *
@@ -976,7 +958,7 @@ from one input Unicode character. Strictly speaking, the shortest
possible sequence of UTF-8 bytes should be generated,
because otherwise there is potential for an input buffer overflow at
the receiving end of a UTF-8 connection. Perl always generates the
-shortest length UTF-8, and with warnings on Perl will warn about
+shortest length UTF-8, and with warnings on, Perl will warn about
non-shortest length UTF-8 along with other malformations, such as the
surrogates, which are not real Unicode code points.
@@ -1053,12 +1035,13 @@ as Unicode (UTF-8), there still are many places where Unicode (in some
encoding or another) could be given as arguments or received as
results, or both, but it is not.
-The following are such interfaces. For all of these interfaces Perl
+The following are such interfaces. Also, see L</The "Unicode Bug">.
+For all of these interfaces Perl
currently (as of 5.8.3) simply assumes byte strings both as arguments
and results, or UTF-8 strings if the C<encoding> pragma has been used.
One reason why Perl does not attempt to resolve the role of Unicode in
-this cases is that the answers are highly dependent on the operating
+these cases is that the answers are highly dependent on the operating
system and the file system(s). For example, whether filenames can be
in Unicode, and in exactly what kind of encoding, is not exactly a
portable concept. Similarly for the qx and system: how well will the
@@ -1093,10 +1076,93 @@ readdir, readlink
=back
+=head2 The "Unicode Bug"
+
+The term, the "Unicode bug" has been applied to an inconsistency with the
+Unicode characters whose code points are in the Latin-1 Supplement block, that
+is, between 128 and 255. Without a locale specified, unlike all other
+characters or code points, these characters have very different semantics in
+byte semantics versus character semantics.
+
+In character semantics they are interpreted as Unicode code points, which means
+they have the same semantics as Latin-1 (ISO-8859-1).
+
+In byte semantics, they are considered to be unassigned characters, meaning
+that the only semantics they have is their ordinal numbers, and that they are
+not members of various character classes. None are considered to match C<\w>
+for example, but all match C<\W>. (On EBCDIC platforms, the behavior may
+be different from this, depending on the underlying C language library
+functions.)
+
+The behavior is known to have effects on these areas:
+
+=over 4
+
+=item *
+
+Changing the case of a scalar, that is, using C<uc()>, C<ucfirst()>, C<lc()>,
+and C<lcfirst()>, or C<\L>, C<\U>, C<\u> and C<\l> in regular expression
+substitutions.
+
+=item *
+
+Using caseless (C</i>) regular expression matching
+
+=item *
+
+Matching a number of properties in regular expressions, such as C<\w>
+
+=item *
+
+User-defined case change mappings. You can create a C<ToUpper()> function, for
+example, which overrides Perl's built-in case mappings. The scalar must be
+encoded in utf8 for your function to actually be invoked.
+
+=back
+
+This behavior can lead to unexpected results in which a string's semantics
+suddenly change if a code point above 255 is appended to or removed from it,
+which changes the string's semantics from byte to character or vice versa. As
+an example, consider the following program and its output:
+
+ $ perl -le'
+ $s1 = "\xC2";
+ $s2 = "\x{2660}";
+ for ($s1, $s2, $s1.$s2) {
+ print /\w/ || 0;
+ }
+ '
+ 0
+ 0
+ 1
+
+If there's no \w in s1 or in s2, why does their concatenation have one?
+
+This anomaly stems from Perl's attempt to not disturb older programs that
+didn't use Unicode, and hence had no semantics for characters outside of the
+ASCII range (except in a locale), along with Perl's desire to add Unicode
+support seamlessly. The result wasn't seamless: these characters were
+orphaned.
+
+Work is being done to correct this, but only some of it was complete in time
+for the 5.12 release. What has been finished is the important part of the case
+changing component. Due to concerns, and some evidence, that older code might
+have come to rely on the existing behavior, the new behavior must be explicitly
+enabled by the feature C<unicode_strings> in the L<feature> pragma, even though
+no new syntax is involved.
+
+See L<perlfunc/lc> for details on how this pragma works in combination with
+various others for casing. Even though the pragma only affects casing
+operations in the 5.12 release, it is planned to have it affect all the
+problematic behaviors in later releases: you can't have one without them all.
+
+In the meantime, a workaround is to always call utf8::upgrade($string), or to
+use the standard modules L<Encode> or L<charnames>.
+
=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
-Sometimes (see L</"When Unicode Does Not Happen">) there are
-situations where you simply need to force a byte
+Sometimes (see L</"When Unicode Does Not Happen"> or L</The "Unicode Bug">)
+there are situations where you simply need to force a byte
string into UTF-8, or vice versa. The low-level calls
utf8::upgrade($bytestring) and utf8::downgrade($utf8string[, FAIL_OK]) are
the answers.
@@ -1104,6 +1170,9 @@ the answers.
Note that utf8::downgrade() can fail if the string contains characters
that don't fit into a byte.
+Calling either function on a string that already is in the desired state is a
+no-op.
+
=head2 Using Unicode in XS
If you want to handle Perl Unicode in XS extensions, you may find the
@@ -1210,6 +1279,24 @@ comparisons you can just use C<memEQ()> and C<memNE()> as usual.
For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h>
in the Perl source code distribution.
+=head2 Hacking Perl to work on earlier Unicode versions (for very serious hackers only)
+
+Perl by default comes with the latest supported Unicode version built in, but
+you can change to use any earlier one.
+
+Download the files in the version of Unicode that you want from the Unicode web
+site L<http://www.unicode.org>). These should replace the existing files in
+C<\$Config{privlib}>/F<unicore>. (C<\%Config> is available from the Config
+module.) Follow the instructions in F<README.perl> in that directory to change
+some of their names, and then run F<make>.
+
+It is even possible to download them to a different directory, and then change
+F<utf8_heavy.pl> in the directory C<\$Config{privlib}> to point to the new
+directory, or maybe make a copy of that directory before making the change, and
+using C<@INC> or the C<-I> run-time flag to switch between versions at will
+(but because of caching, not in the middle of a process), but all this is
+beyond the scope of these instructions.
+
=head1 BUGS
=head2 Interaction with Locales
@@ -1221,27 +1308,17 @@ use characters above that range when mapped into Unicode. Perl's
Unicode support will also tend to run slower. Use of locales with
Unicode is discouraged.
-=head2 Problems with characters whose ordinal numbers are in the range 128 - 255 with no Locale specified
+=head2 Problems with characters in the C<Latin-1 Supplement> range
-Without a locale specified, unlike all other characters or code points,
-these characters have very different semantics in byte semantics versus
-character semantics.
-In character semantics they are interpreted as Unicode code points, which means
-they are viewed as Latin-1 (ISO-8859-1).
-In byte semantics, they are considered to be unassigned characters,
-meaning that the only semantics they have is their
-ordinal numbers, and that they are not members of various character classes.
-None are considered to match C<\w> for example, but all match C<\W>.
-Besides these class matches,
-the known operations that this affects are those that change the case,
-regular expression matching while ignoring case,
-and B<quotemeta()>.
-This can lead to unexpected results in which a string's semantics suddenly
-change if a code point above 255 is appended to or removed from it,
-which changes the string's semantics from byte to character or vice versa.
-This behavior is scheduled to change in version 5.12, but in the meantime,
-a workaround is to always call utf8::upgrade($string), or to use the
-standard modules L<Encode> or L<charnames>.
+See L</The "Unicode Bug">
+
+=head2 Problems with case-insensitive regular expression matching
+
+There are problems with case-insensitive matches, including those involving
+character classes (enclosed in [square brackets]), characters whose fold
+is to multiple characters (such as the single character C<LATIN SMALL LIGATURE
+FFL> matches case-insensitively with the 3-character string C<ffl>), and
+characters in the C<Latin-1 Supplement>.
=head2 Interaction with Extensions
@@ -1323,7 +1400,10 @@ be quite a bit slower (5-20 times) than their simpler counterparts
like C<\d> (then again, there 268 Unicode characters matching C<Nd>
compared with the 10 ASCII characters matching C<d>).
-=head2 Possible problems on EBCDIC platforms
+=head2 Problems on EBCDIC platforms
+
+There are a number of known problems with Perl on EBCDIC platforms. If you
+want to use Perl there, send email to perlbug@perl.org.
In earlier versions, when byte and character data were concatenated,
the new string was sometimes created by
@@ -1440,23 +1520,6 @@ the UTF8 flag:
=back
-=head2 Hacking Perl to work on earlier Unicode versions (for very serious hackers only)
-
-Perl by default comes with the latest supported Unicode version built in, but
-you can change to use any earlier one.
-
-Download the files in the version of Unicode that you want from the Unicode web
-site L<http://www.unicode.org>). These should replace the existing files in
-C<\$Config{privlib}>/F<unicore>. (C<\%Config> is available from the Config
-module.) Follow the instructions in F<README.perl> in that directory to change
-some of their names, and then run F<make>.
-
-It is even possible to download them to a different directory, and then change
-F<utf8_heavy.pl> in the directory C<\$Config{privlib}> to point to the new
-directory, or maybe make a copy of that directory before making the change, and
-using C<@INC> or the C<-I> run-time flag to switch between versions at will,
-but all this is beyond the scope of these instructions.
-
=head1 SEE ALSO
L<perlunitut>, L<perluniintro>, L<perluniprops>, L<Encode>, L<open>, L<utf8>, L<bytes>,