diff options
author | Audrey Tang <cpan@audreyt.org> | 2003-12-10 04:39:16 +0800 |
---|---|---|
committer | Rafael Garcia-Suarez <rgarciasuarez@gmail.com> | 2003-12-09 21:33:22 +0000 |
commit | 990e18f721a7d2ee48d50ea4262bd5d109e9f89c (patch) | |
tree | 4334f1521da4188b6bebdae304b622e37030f8ca /pod/perlunicode.pod | |
parent | 4e2344ada78d8742c0023d545c1baed6597bae39 (diff) | |
download | perl-990e18f721a7d2ee48d50ea4262bd5d109e9f89c.tar.gz |
Implicit upgrading docs
Message-ID: <20031209123915.GA1454@not.autrijus.org>
p4raw-id: //depot/perl@21873
Diffstat (limited to 'pod/perlunicode.pod')
-rw-r--r-- | pod/perlunicode.pod | 27 |
1 files changed, 21 insertions, 6 deletions
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 190247aea7..b6d00d1660 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -42,6 +42,21 @@ is needed.> See L<utf8>. You can also use the C<encoding> pragma to change the default encoding of the data in your script; see L<encoding>. +=item C<use encoding> needed to upgrade non-Latin-1 byte strings + +By default, there is a fundamental asymmetry in Perl's unicode model: +implicit upgrading from byte strings to Unicode strings assumes that +they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are +downgraded with UTF-8 encoding. This happens because the first 256 +codepoints in Unicode happens to agree with Latin-1. + +If you wish to interpret byte strings as UTF-8 instead, use the +C<encoding> pragma: + + use encoding 'utf8'; + +See L</"Byte and Character Semantics"> for more details. + =back =head2 Byte and Character Semantics @@ -86,12 +101,12 @@ Otherwise, byte semantics are in effect. The C<bytes> pragma should be used to force byte semantics on Unicode data. If strings operating under byte semantics and strings with Unicode -character data are concatenated, the new string will be upgraded to -I<ISO 8859-1 (Latin-1)>, even if the old Unicode string used EBCDIC. -This translation is done without regard to the system's native 8-bit -encoding, so to change this for systems with non-Latin-1 and -non-EBCDIC native encodings use the C<encoding> pragma. See -L<encoding>. +character data are concatenated, the new string will be created by +decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the +old Unicode string used EBCDIC. This translation is done without +regard to the system's native 8-bit encoding. To change this for +systems with non-Latin-1 and non-EBCDIC native encodings, use the +C<encoding> pragma. See L<encoding>. Under character semantics, many operations that formerly operated on bytes now operate on characters. A character in Perl is |