summaryrefslogtreecommitdiff
path: root/pod/perlunicode.pod
diff options
context:
space:
mode:
authorJuerd Waalboer <#####@juerd.nl>2007-03-04 17:00:19 +0100
committerH.Merijn Brand <h.m.brand@xs4all.nl>2007-03-07 13:23:23 +0000
commit2575c402a8f9be55f848bdfb219afbf912c50ac1 (patch)
treec21a19c42deaa2dba098c38d74338a7c01328c28 /pod/perlunicode.pod
parent2a6a970fa1b36c99c83fd3fdd48253c1b567db9b (diff)
downloadperl-2575c402a8f9be55f848bdfb219afbf912c50ac1.tar.gz
Re: [PATCH] (Re: [PATCH] unicode/utf8 pod)
Message-ID: <20070304150019.GN4723@c4.convolution.nl> p4raw-id: //depot/perl@30493
Diffstat (limited to 'pod/perlunicode.pod')
-rw-r--r--pod/perlunicode.pod62
1 files changed, 25 insertions, 37 deletions
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index 1a49f04687..c913047099 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -10,6 +10,10 @@ Unicode support is an extensive requirement. While Perl does not
implement the Unicode standard or the accompanying technical reports
from cover to cover, Perl does support many Unicode features.
+People who want to learn to use Unicode in Perl, should probably read
+L<the Perl Unicode tutorial|perlunitut> before reading this reference
+document.
+
=over 4
=item Input and Output Layers
@@ -20,15 +24,15 @@ the ":utf8" layer. Other encodings can be converted to Perl's
encoding on input or from Perl's encoding on output by use of the
":encoding(...)" layer. See L<open>.
-To indicate that Perl source itself is using a particular encoding,
-see L<encoding>.
+To indicate that Perl source itself is in UTF-8, use C<use utf8;>.
=item Regular Expressions
The regular expression compiler produces polymorphic opcodes. That is,
the pattern adapts to the data and automatically switches to the Unicode
-character scheme when presented with Unicode data--or instead uses
-a traditional byte scheme when presented with byte data.
+character scheme when presented with data that is internally encoded in
+UTF-8 -- or instead uses a traditional byte scheme when presented with
+byte data.
=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
@@ -39,9 +43,6 @@ ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
machines. B<These are the only times when an explicit C<use utf8>
is needed.> See L<utf8>.
-You can also use the C<encoding> pragma to change the default encoding
-of the data in your script; see L<encoding>.
-
=item BOM-marked scripts and UTF-16 scripts autodetected
If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
@@ -58,11 +59,6 @@ they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
downgraded with UTF-8 encoding. This happens because the first 256
codepoints in Unicode happens to agree with Latin-1.
-If you wish to interpret byte strings as UTF-8 instead, use the
-C<encoding> pragma:
-
- use encoding 'utf8';
-
See L</"Byte and Character Semantics"> for more details.
=back
@@ -112,9 +108,7 @@ If strings operating under byte semantics and strings with Unicode
character data are concatenated, the new string will be created by
decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
old Unicode string used EBCDIC. This translation is done without
-regard to the system's native 8-bit encoding. To change this for
-systems with non-Latin-1 and non-EBCDIC native encodings, use the
-C<encoding> pragma. See L<encoding>.
+regard to the system's native 8-bit encoding.
Under character semantics, many operations that formerly operated on
bytes now operate on characters. A character in Perl is
@@ -134,17 +128,16 @@ Character semantics have the following effects:
Strings--including hash keys--and regular expression patterns may
contain characters that have an ordinal value larger than 255.
-If you use a Unicode editor to edit your program, Unicode characters
-may occur directly within the literal strings in one of the various
-Unicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized
-as such and converted to Perl's internal representation only if the
-appropriate L<encoding> is specified.
+If you use a Unicode editor to edit your program, Unicode characters may
+occur directly within the literal strings in UTF-8 encoding, or UTF-16.
+(The former requires a BOM or C<use utf8>, the latter requires a BOM.)
-Unicode characters can also be added to a string by using the
-C<\x{...}> notation. The Unicode code for the desired character, in
-hexadecimal, should be placed in the braces. For instance, a smiley
-face is C<\x{263A}>. This encoding scheme only works for characters
-with a code of 0x100 or above.
+Unicode characters can also be added to a string by using the C<\x{...}>
+notation. The Unicode code for the desired character, in hexadecimal,
+should be placed in the braces. For instance, a smiley face is
+C<\x{263A}>. This encoding scheme only works for all characters, but
+for characters under 0x100, note that Perl may use an 8 bit encoding
+internally, for optimization and/or backward compatibility.
Additionally, if you
@@ -163,8 +156,7 @@ names.
=item *
Regular expressions match characters instead of bytes. "." matches
-a character instead of a byte. The C<\C> pattern is provided to force
-a match a single byte--a C<char> in C, hence C<\C>.
+a character instead of a byte.
=item *
@@ -173,17 +165,13 @@ bytes and match against the character properties specified in the
Unicode properties database. C<\w> can be used to match a Japanese
ideograph, for instance.
-(However, and as a limitation of the current implementation, using
-C<\w> or C<\W> I<inside> a C<[...]> character class will still match
-with byte semantics.)
-
=item *
Named Unicode properties, scripts, and block ranges may be used like
character classes via the C<\p{}> "matches property" construct and
the C<\P{}> negation, "doesn't match property".
-See L</"Unicode Character Properties"> for more details.
+See L</"Unicode Character Properties"> for more details.
You can define your own character properties and use them
in the regular expression with the C<\p{}> or C<\P{}> construct.
@@ -1441,7 +1429,7 @@ Unicode is discouraged.
=head2 Interaction with Extensions
When Perl exchanges data with an extension, the extension should be
-able to understand the UTF-8 flag and act accordingly. If the
+able to understand the UTF8 flag and act accordingly. If the
extension doesn't know about the flag, it's likely that the extension
will return incorrectly-flagged data.
@@ -1544,7 +1532,7 @@ A scalar that is going to be passed to some extension
Be it Compress::Zlib, Apache::Request or any extension that has no
mention of Unicode in the manpage, you need to make sure that the
-UTF-8 flag is stripped off. Note that at the time of this writing
+UTF8 flag is stripped off. Note that at the time of this writing
(October 2002) the mentioned modules are not UTF-8-aware. Please
check the documentation to verify if this is still true.
@@ -1558,7 +1546,7 @@ check the documentation to verify if this is still true.
A scalar we got back from an extension
If you believe the scalar comes back as UTF-8, you will most likely
-want the UTF-8 flag restored:
+want the UTF8 flag restored:
if ($] > 5.007) {
require Encode;
@@ -1620,7 +1608,7 @@ A large scalar that you know can only contain ASCII
Scalars that contain only ASCII and are marked as UTF-8 are sometimes
a drag to your program. If you recognize such a situation, just remove
-the UTF-8 flag:
+the UTF8 flag:
utf8::downgrade($val) if $] > 5.007;
@@ -1628,7 +1616,7 @@ the UTF-8 flag:
=head1 SEE ALSO
-L<perluniintro>, L<encoding>, L<Encode>, L<open>, L<utf8>, L<bytes>,
+L<perlunitut>, L<perluniintro>, L<Encode>, L<open>, L<utf8>, L<bytes>,
L<perlretut>, L<perlvar/"${^UNICODE}">
=cut