diff options
author | Juerd Waalboer <#####@juerd.nl> | 2007-03-04 17:00:19 +0100 |
---|---|---|
committer | H.Merijn Brand <h.m.brand@xs4all.nl> | 2007-03-07 13:23:23 +0000 |
commit | 2575c402a8f9be55f848bdfb219afbf912c50ac1 (patch) | |
tree | c21a19c42deaa2dba098c38d74338a7c01328c28 /ext/Encode | |
parent | 2a6a970fa1b36c99c83fd3fdd48253c1b567db9b (diff) | |
download | perl-2575c402a8f9be55f848bdfb219afbf912c50ac1.tar.gz |
Re: [PATCH] (Re: [PATCH] unicode/utf8 pod)
Message-ID: <20070304150019.GN4723@c4.convolution.nl>
p4raw-id: //depot/perl@30493
Diffstat (limited to 'ext/Encode')
-rw-r--r-- | ext/Encode/Encode.pm | 52 | ||||
-rw-r--r-- | ext/Encode/encoding.pm | 8 |
2 files changed, 35 insertions, 25 deletions
diff --git a/ext/Encode/Encode.pm b/ext/Encode/Encode.pm index ae047553bd..bdfa695723 100644 --- a/ext/Encode/Encode.pm +++ b/ext/Encode/Encode.pm @@ -406,10 +406,10 @@ iso-8859-1 (also known as Latin1), $octets = encode("iso-8859-1", $string); B<CAVEAT>: When you run C<$octets = encode("utf8", $string)>, then $octets -B<may not be equal to> $string. Though they both contain the same data, the utf8 flag -for $octets is B<always> off. When you encode anything, utf8 flag of +B<may not be equal to> $string. Though they both contain the same data, the UTF8 flag +for $octets is B<always> off. When you encode anything, UTF8 flag of the result is always off, even when it contains completely valid utf8 -string. See L</"The UTF-8 flag"> below. +string. See L</"The UTF8 flag"> below. If the $string is C<undef> then C<undef> is returned. @@ -427,8 +427,8 @@ For example, to convert ISO-8859-1 data to a string in Perl's internal format: B<CAVEAT>: When you run C<$string = decode("utf8", $octets)>, then $string B<may not be equal to> $octets. Though they both contain the same data, -the utf8 flag for $string is on unless $octets entirely consists of -ASCII data (or EBCDIC on EBCDIC machines). See L</"The UTF-8 flag"> +the UTF8 flag for $string is on unless $octets entirely consists of +ASCII data (or EBCDIC on EBCDIC machines). See L</"The UTF8 flag"> below. If the $string is C<undef> then C<undef> is returned. @@ -458,11 +458,11 @@ B<CAVEAT>: The following operations look the same but are not quite so; $data = decode("iso-8859-1", $data); #2 Both #1 and #2 make $data consist of a completely valid UTF-8 string -but only #2 turns utf8 flag on. #1 is equivalent to +but only #2 turns UTF8 flag on. #1 is equivalent to $data = encode("utf8", decode("iso-8859-1", $data)); -See L</"The UTF-8 flag"> below. +See L</"The UTF8 flag"> below. =item $octets = encode_utf8($string); @@ -684,13 +684,13 @@ arguments are taken as aliases for I<$object>. See L<Encode::Encoding> for more details. -=head1 The UTF-8 flag +=head1 The UTF8 flag -Before the introduction of utf8 support in perl, The C<eq> operator +Before the introduction of Unicode support in perl, The C<eq> operator just compared the strings represented by two scalars. Beginning with -perl 5.8, C<eq> compares two strings with simultaneous consideration -of I<the utf8 flag>. To explain why we made it so, I will quote page -402 of C<Programming Perl, 3rd ed.> +perl 5.8, C<eq> compares two strings with simultaneous consideration of +I<the UTF8 flag>. To explain why we made it so, I will quote page 402 of +C<Programming Perl, 3rd ed.> =over 2 @@ -719,27 +719,27 @@ byte-oriented Perl and a character-oriented Perl. Back when C<Programming Perl, 3rd ed.> was written, not even Perl 5.6.0 was born and many features documented in the book remained unimplemented for a long time. Perl 5.8 corrected this and the introduction -of the UTF-8 flag is one of them. You can think of this perl notion as of a -byte-oriented mode (utf8 flag off) and a character-oriented mode (utf8 +of the UTF8 flag is one of them. You can think of this perl notion as of a +byte-oriented mode (UTF8 flag off) and a character-oriented mode (UTF8 flag on). -Here is how Encode takes care of the utf8 flag. +Here is how Encode takes care of the UTF8 flag. =over 2 =item * -When you encode, the resulting utf8 flag is always off. +When you encode, the resulting UTF8 flag is always off. =item * -When you decode, the resulting utf8 flag is on unless you can +When you decode, the resulting UTF8 flag is on unless you can unambiguously represent data. Here is the definition of dis-ambiguity. After C<$utf8 = decode('foo', $octet);>, - When $octet is... The utf8 flag in $utf8 is + When $octet is... The UTF8 flag in $utf8 is --------------------------------------------- In ASCII only (or EBCDIC only) OFF In ISO-8859-1 ON @@ -750,7 +750,7 @@ As you see, there is one exception, In ASCII. That way you can assume Goal #1. And with Encode Goal #2 is assumed but you still have to be careful in such cases mentioned in B<CAVEAT> paragraphs. -This utf8 flag is not visible in perl scripts, exactly for the same +This UTF8 flag is not visible in perl scripts, exactly for the same reason you cannot (or you I<don't have to>) see if a scalar contains a string, integer, or floating point number. But you can still peek and poke these if you will. See the section below. @@ -766,7 +766,7 @@ implementation. As such, they are efficient but may change. =item is_utf8(STRING [, CHECK]) -[INTERNAL] Tests whether the UTF-8 flag is turned on in the STRING. +[INTERNAL] Tests whether the UTF8 flag is turned on in the STRING. If CHECK is true, also checks the data in STRING for being well-formed UTF-8. Returns true if successful, false otherwise. @@ -774,22 +774,22 @@ As of perl 5.8.1, L<utf8> also has utf8::is_utf8(). =item _utf8_on(STRING) -[INTERNAL] Turns on the UTF-8 flag in STRING. The data in STRING is +[INTERNAL] Turns on the UTF8 flag in STRING. The data in STRING is B<not> checked for being well-formed UTF-8. Do not use unless you B<know> that the STRING is well-formed UTF-8. Returns the previous -state of the UTF-8 flag (so please don't treat the return value as +state of the UTF8 flag (so please don't treat the return value as indicating success or failure), or C<undef> if STRING is not a string. =item _utf8_off(STRING) -[INTERNAL] Turns off the UTF-8 flag in STRING. Do not use frivolously. -Returns the previous state of the UTF-8 flag (so please don't treat the +[INTERNAL] Turns off the UTF8 flag in STRING. Do not use frivolously. +Returns the previous state of the UTF8 flag (so please don't treat the return value as indicating success or failure), or C<undef> if STRING is not a string. =back -=head1 UTF-8 vs. utf8 +=head1 UTF-8 vs. utf8 vs. UTF8 ....We now view strings not as sequences of bytes, but as sequences of numbers in the range 0 .. 2**32-1 (or in the case of 64-bit @@ -836,6 +836,8 @@ goes "liberal" find_encoding("utf_8")->name # ditto. "_" are treated as "-" find_encoding("UTF8")->name # is 'utf8'. +The UTF8 flag is internally called UTF8, without a hyphen. It indicates +whether a string is internally encoded as utf8, also without a hypen. =head1 SEE ALSO diff --git a/ext/Encode/encoding.pm b/ext/Encode/encoding.pm index eb84e481f1..1f418e3a22 100644 --- a/ext/Encode/encoding.pm +++ b/ext/Encode/encoding.pm @@ -307,6 +307,14 @@ Will print C<2>, because C<$string> is upgraded as UTF-8. Without C<use encoding 'utf8';>, it will print C<4> instead, since C<$string> is three octets when interpreted as Latin-1. +=head2 Side effects + +If the C<encoding> pragma is in scope then the lengths returned are +calculated from the length of C<$/> in Unicode characters, which is not +always the same as the length of C<$/> in the native encoding. + +This pragma affects utf8::upgrade, but not utf8::downgrade. + =head1 FEATURES THAT REQUIRE 5.8.1 Some of the features offered by this pragma requires perl 5.8.1. Most |