summaryrefslogtreecommitdiff
path: root/ext/Encode
diff options
context:
space:
mode:
authorJuerd Waalboer <#####@juerd.nl>2007-03-04 17:00:19 +0100
committerH.Merijn Brand <h.m.brand@xs4all.nl>2007-03-07 13:23:23 +0000
commit2575c402a8f9be55f848bdfb219afbf912c50ac1 (patch)
treec21a19c42deaa2dba098c38d74338a7c01328c28 /ext/Encode
parent2a6a970fa1b36c99c83fd3fdd48253c1b567db9b (diff)
downloadperl-2575c402a8f9be55f848bdfb219afbf912c50ac1.tar.gz
Re: [PATCH] (Re: [PATCH] unicode/utf8 pod)
Message-ID: <20070304150019.GN4723@c4.convolution.nl> p4raw-id: //depot/perl@30493
Diffstat (limited to 'ext/Encode')
-rw-r--r--ext/Encode/Encode.pm52
-rw-r--r--ext/Encode/encoding.pm8
2 files changed, 35 insertions, 25 deletions
diff --git a/ext/Encode/Encode.pm b/ext/Encode/Encode.pm
index ae047553bd..bdfa695723 100644
--- a/ext/Encode/Encode.pm
+++ b/ext/Encode/Encode.pm
@@ -406,10 +406,10 @@ iso-8859-1 (also known as Latin1),
$octets = encode("iso-8859-1", $string);
B<CAVEAT>: When you run C<$octets = encode("utf8", $string)>, then $octets
-B<may not be equal to> $string. Though they both contain the same data, the utf8 flag
-for $octets is B<always> off. When you encode anything, utf8 flag of
+B<may not be equal to> $string. Though they both contain the same data, the UTF8 flag
+for $octets is B<always> off. When you encode anything, UTF8 flag of
the result is always off, even when it contains completely valid utf8
-string. See L</"The UTF-8 flag"> below.
+string. See L</"The UTF8 flag"> below.
If the $string is C<undef> then C<undef> is returned.
@@ -427,8 +427,8 @@ For example, to convert ISO-8859-1 data to a string in Perl's internal format:
B<CAVEAT>: When you run C<$string = decode("utf8", $octets)>, then $string
B<may not be equal to> $octets. Though they both contain the same data,
-the utf8 flag for $string is on unless $octets entirely consists of
-ASCII data (or EBCDIC on EBCDIC machines). See L</"The UTF-8 flag">
+the UTF8 flag for $string is on unless $octets entirely consists of
+ASCII data (or EBCDIC on EBCDIC machines). See L</"The UTF8 flag">
below.
If the $string is C<undef> then C<undef> is returned.
@@ -458,11 +458,11 @@ B<CAVEAT>: The following operations look the same but are not quite so;
$data = decode("iso-8859-1", $data); #2
Both #1 and #2 make $data consist of a completely valid UTF-8 string
-but only #2 turns utf8 flag on. #1 is equivalent to
+but only #2 turns UTF8 flag on. #1 is equivalent to
$data = encode("utf8", decode("iso-8859-1", $data));
-See L</"The UTF-8 flag"> below.
+See L</"The UTF8 flag"> below.
=item $octets = encode_utf8($string);
@@ -684,13 +684,13 @@ arguments are taken as aliases for I<$object>.
See L<Encode::Encoding> for more details.
-=head1 The UTF-8 flag
+=head1 The UTF8 flag
-Before the introduction of utf8 support in perl, The C<eq> operator
+Before the introduction of Unicode support in perl, The C<eq> operator
just compared the strings represented by two scalars. Beginning with
-perl 5.8, C<eq> compares two strings with simultaneous consideration
-of I<the utf8 flag>. To explain why we made it so, I will quote page
-402 of C<Programming Perl, 3rd ed.>
+perl 5.8, C<eq> compares two strings with simultaneous consideration of
+I<the UTF8 flag>. To explain why we made it so, I will quote page 402 of
+C<Programming Perl, 3rd ed.>
=over 2
@@ -719,27 +719,27 @@ byte-oriented Perl and a character-oriented Perl.
Back when C<Programming Perl, 3rd ed.> was written, not even Perl 5.6.0
was born and many features documented in the book remained
unimplemented for a long time. Perl 5.8 corrected this and the introduction
-of the UTF-8 flag is one of them. You can think of this perl notion as of a
-byte-oriented mode (utf8 flag off) and a character-oriented mode (utf8
+of the UTF8 flag is one of them. You can think of this perl notion as of a
+byte-oriented mode (UTF8 flag off) and a character-oriented mode (UTF8
flag on).
-Here is how Encode takes care of the utf8 flag.
+Here is how Encode takes care of the UTF8 flag.
=over 2
=item *
-When you encode, the resulting utf8 flag is always off.
+When you encode, the resulting UTF8 flag is always off.
=item *
-When you decode, the resulting utf8 flag is on unless you can
+When you decode, the resulting UTF8 flag is on unless you can
unambiguously represent data. Here is the definition of
dis-ambiguity.
After C<$utf8 = decode('foo', $octet);>,
- When $octet is... The utf8 flag in $utf8 is
+ When $octet is... The UTF8 flag in $utf8 is
---------------------------------------------
In ASCII only (or EBCDIC only) OFF
In ISO-8859-1 ON
@@ -750,7 +750,7 @@ As you see, there is one exception, In ASCII. That way you can assume
Goal #1. And with Encode Goal #2 is assumed but you still have to be
careful in such cases mentioned in B<CAVEAT> paragraphs.
-This utf8 flag is not visible in perl scripts, exactly for the same
+This UTF8 flag is not visible in perl scripts, exactly for the same
reason you cannot (or you I<don't have to>) see if a scalar contains a
string, integer, or floating point number. But you can still peek
and poke these if you will. See the section below.
@@ -766,7 +766,7 @@ implementation. As such, they are efficient but may change.
=item is_utf8(STRING [, CHECK])
-[INTERNAL] Tests whether the UTF-8 flag is turned on in the STRING.
+[INTERNAL] Tests whether the UTF8 flag is turned on in the STRING.
If CHECK is true, also checks the data in STRING for being well-formed
UTF-8. Returns true if successful, false otherwise.
@@ -774,22 +774,22 @@ As of perl 5.8.1, L<utf8> also has utf8::is_utf8().
=item _utf8_on(STRING)
-[INTERNAL] Turns on the UTF-8 flag in STRING. The data in STRING is
+[INTERNAL] Turns on the UTF8 flag in STRING. The data in STRING is
B<not> checked for being well-formed UTF-8. Do not use unless you
B<know> that the STRING is well-formed UTF-8. Returns the previous
-state of the UTF-8 flag (so please don't treat the return value as
+state of the UTF8 flag (so please don't treat the return value as
indicating success or failure), or C<undef> if STRING is not a string.
=item _utf8_off(STRING)
-[INTERNAL] Turns off the UTF-8 flag in STRING. Do not use frivolously.
-Returns the previous state of the UTF-8 flag (so please don't treat the
+[INTERNAL] Turns off the UTF8 flag in STRING. Do not use frivolously.
+Returns the previous state of the UTF8 flag (so please don't treat the
return value as indicating success or failure), or C<undef> if STRING is
not a string.
=back
-=head1 UTF-8 vs. utf8
+=head1 UTF-8 vs. utf8 vs. UTF8
....We now view strings not as sequences of bytes, but as sequences
of numbers in the range 0 .. 2**32-1 (or in the case of 64-bit
@@ -836,6 +836,8 @@ goes "liberal"
find_encoding("utf_8")->name # ditto. "_" are treated as "-"
find_encoding("UTF8")->name # is 'utf8'.
+The UTF8 flag is internally called UTF8, without a hyphen. It indicates
+whether a string is internally encoded as utf8, also without a hypen.
=head1 SEE ALSO
diff --git a/ext/Encode/encoding.pm b/ext/Encode/encoding.pm
index eb84e481f1..1f418e3a22 100644
--- a/ext/Encode/encoding.pm
+++ b/ext/Encode/encoding.pm
@@ -307,6 +307,14 @@ Will print C<2>, because C<$string> is upgraded as UTF-8. Without
C<use encoding 'utf8';>, it will print C<4> instead, since C<$string>
is three octets when interpreted as Latin-1.
+=head2 Side effects
+
+If the C<encoding> pragma is in scope then the lengths returned are
+calculated from the length of C<$/> in Unicode characters, which is not
+always the same as the length of C<$/> in the native encoding.
+
+This pragma affects utf8::upgrade, but not utf8::downgrade.
+
=head1 FEATURES THAT REQUIRE 5.8.1
Some of the features offered by this pragma requires perl 5.8.1. Most