diff options
author | Juerd Waalboer <#####@juerd.nl> | 2007-03-04 17:00:19 +0100 |
---|---|---|
committer | H.Merijn Brand <h.m.brand@xs4all.nl> | 2007-03-07 13:23:23 +0000 |
commit | 2575c402a8f9be55f848bdfb219afbf912c50ac1 (patch) | |
tree | c21a19c42deaa2dba098c38d74338a7c01328c28 /pod/perlunicode.pod | |
parent | 2a6a970fa1b36c99c83fd3fdd48253c1b567db9b (diff) | |
download | perl-2575c402a8f9be55f848bdfb219afbf912c50ac1.tar.gz |
Re: [PATCH] (Re: [PATCH] unicode/utf8 pod)
Message-ID: <20070304150019.GN4723@c4.convolution.nl>
p4raw-id: //depot/perl@30493
Diffstat (limited to 'pod/perlunicode.pod')
-rw-r--r-- | pod/perlunicode.pod | 62 |
1 files changed, 25 insertions, 37 deletions
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 1a49f04687..c913047099 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -10,6 +10,10 @@ Unicode support is an extensive requirement. While Perl does not implement the Unicode standard or the accompanying technical reports from cover to cover, Perl does support many Unicode features. +People who want to learn to use Unicode in Perl, should probably read +L<the Perl Unicode tutorial|perlunitut> before reading this reference +document. + =over 4 =item Input and Output Layers @@ -20,15 +24,15 @@ the ":utf8" layer. Other encodings can be converted to Perl's encoding on input or from Perl's encoding on output by use of the ":encoding(...)" layer. See L<open>. -To indicate that Perl source itself is using a particular encoding, -see L<encoding>. +To indicate that Perl source itself is in UTF-8, use C<use utf8;>. =item Regular Expressions The regular expression compiler produces polymorphic opcodes. That is, the pattern adapts to the data and automatically switches to the Unicode -character scheme when presented with Unicode data--or instead uses -a traditional byte scheme when presented with byte data. +character scheme when presented with data that is internally encoded in +UTF-8 -- or instead uses a traditional byte scheme when presented with +byte data. =item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts @@ -39,9 +43,6 @@ ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based machines. B<These are the only times when an explicit C<use utf8> is needed.> See L<utf8>. -You can also use the C<encoding> pragma to change the default encoding -of the data in your script; see L<encoding>. - =item BOM-marked scripts and UTF-16 scripts autodetected If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE, @@ -58,11 +59,6 @@ they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are downgraded with UTF-8 encoding. This happens because the first 256 codepoints in Unicode happens to agree with Latin-1. -If you wish to interpret byte strings as UTF-8 instead, use the -C<encoding> pragma: - - use encoding 'utf8'; - See L</"Byte and Character Semantics"> for more details. =back @@ -112,9 +108,7 @@ If strings operating under byte semantics and strings with Unicode character data are concatenated, the new string will be created by decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the old Unicode string used EBCDIC. This translation is done without -regard to the system's native 8-bit encoding. To change this for -systems with non-Latin-1 and non-EBCDIC native encodings, use the -C<encoding> pragma. See L<encoding>. +regard to the system's native 8-bit encoding. Under character semantics, many operations that formerly operated on bytes now operate on characters. A character in Perl is @@ -134,17 +128,16 @@ Character semantics have the following effects: Strings--including hash keys--and regular expression patterns may contain characters that have an ordinal value larger than 255. -If you use a Unicode editor to edit your program, Unicode characters -may occur directly within the literal strings in one of the various -Unicode encodings (UTF-8, UTF-EBCDIC, UCS-2, etc.), but will be recognized -as such and converted to Perl's internal representation only if the -appropriate L<encoding> is specified. +If you use a Unicode editor to edit your program, Unicode characters may +occur directly within the literal strings in UTF-8 encoding, or UTF-16. +(The former requires a BOM or C<use utf8>, the latter requires a BOM.) -Unicode characters can also be added to a string by using the -C<\x{...}> notation. The Unicode code for the desired character, in -hexadecimal, should be placed in the braces. For instance, a smiley -face is C<\x{263A}>. This encoding scheme only works for characters -with a code of 0x100 or above. +Unicode characters can also be added to a string by using the C<\x{...}> +notation. The Unicode code for the desired character, in hexadecimal, +should be placed in the braces. For instance, a smiley face is +C<\x{263A}>. This encoding scheme only works for all characters, but +for characters under 0x100, note that Perl may use an 8 bit encoding +internally, for optimization and/or backward compatibility. Additionally, if you @@ -163,8 +156,7 @@ names. =item * Regular expressions match characters instead of bytes. "." matches -a character instead of a byte. The C<\C> pattern is provided to force -a match a single byte--a C<char> in C, hence C<\C>. +a character instead of a byte. =item * @@ -173,17 +165,13 @@ bytes and match against the character properties specified in the Unicode properties database. C<\w> can be used to match a Japanese ideograph, for instance. -(However, and as a limitation of the current implementation, using -C<\w> or C<\W> I<inside> a C<[...]> character class will still match -with byte semantics.) - =item * Named Unicode properties, scripts, and block ranges may be used like character classes via the C<\p{}> "matches property" construct and the C<\P{}> negation, "doesn't match property". -See L</"Unicode Character Properties"> for more details. +See L</"Unicode Character Properties"> for more details. You can define your own character properties and use them in the regular expression with the C<\p{}> or C<\P{}> construct. @@ -1441,7 +1429,7 @@ Unicode is discouraged. =head2 Interaction with Extensions When Perl exchanges data with an extension, the extension should be -able to understand the UTF-8 flag and act accordingly. If the +able to understand the UTF8 flag and act accordingly. If the extension doesn't know about the flag, it's likely that the extension will return incorrectly-flagged data. @@ -1544,7 +1532,7 @@ A scalar that is going to be passed to some extension Be it Compress::Zlib, Apache::Request or any extension that has no mention of Unicode in the manpage, you need to make sure that the -UTF-8 flag is stripped off. Note that at the time of this writing +UTF8 flag is stripped off. Note that at the time of this writing (October 2002) the mentioned modules are not UTF-8-aware. Please check the documentation to verify if this is still true. @@ -1558,7 +1546,7 @@ check the documentation to verify if this is still true. A scalar we got back from an extension If you believe the scalar comes back as UTF-8, you will most likely -want the UTF-8 flag restored: +want the UTF8 flag restored: if ($] > 5.007) { require Encode; @@ -1620,7 +1608,7 @@ A large scalar that you know can only contain ASCII Scalars that contain only ASCII and are marked as UTF-8 are sometimes a drag to your program. If you recognize such a situation, just remove -the UTF-8 flag: +the UTF8 flag: utf8::downgrade($val) if $] > 5.007; @@ -1628,7 +1616,7 @@ the UTF-8 flag: =head1 SEE ALSO -L<perluniintro>, L<encoding>, L<Encode>, L<open>, L<utf8>, L<bytes>, +L<perlunitut>, L<perluniintro>, L<Encode>, L<open>, L<utf8>, L<bytes>, L<perlretut>, L<perlvar/"${^UNICODE}"> =cut |