Re: [PATCH] (Re: [PATCH] unicode/utf8 pod)

Message-ID: <20070304150019.GN4723@c4.convolution.nl> p4raw-id: //depot/perl@30493
author: Juerd Waalboer <#####@juerd.nl> 2007-03-04 17:00:19 +0100
committer: H.Merijn Brand <h.m.brand@xs4all.nl> 2007-03-07 13:23:23 +0000
commit: 2575c402a8f9be55f848bdfb219afbf912c50ac1 (patch)
tree: c21a19c42deaa2dba098c38d74338a7c01328c28 /ext/Encode
parent: 2a6a970fa1b36c99c83fd3fdd48253c1b567db9b (diff)
download: perl-2575c402a8f9be55f848bdfb219afbf912c50ac1.tar.gz
2 files changed, 35 insertions, 25 deletions
diff --git a/ext/Encode/Encode.pm b/ext/Encode/Encode.pm
index ae047553bd..bdfa695723 100644
--- a/ext/Encode/Encode.pm
+++ b/ext/Encode/Encode.pm
@@ -406,10 +406,10 @@ iso-8859-1 (also known as Latin1),
   $octets = encode("iso-8859-1", $string);
 
 B<CAVEAT>: When you run C<$octets = encode("utf8", $string)>, then $octets
-B<may not be equal to> $string.  Though they both contain the same data, the utf8 flag
-for $octets is B<always> off.  When you encode anything, utf8 flag of
+B<may not be equal to> $string.  Though they both contain the same data, the UTF8 flag
+for $octets is B<always> off.  When you encode anything, UTF8 flag of
 the result is always off, even when it contains completely valid utf8
-string. See L</"The UTF-8 flag"> below.
+string. See L</"The UTF8 flag"> below.
 
 If the $string is C<undef> then C<undef> is returned.
 
@@ -427,8 +427,8 @@ For example, to convert ISO-8859-1 data to a string in Perl's internal format:
 
 B<CAVEAT>: When you run C<$string = decode("utf8", $octets)>, then $string
 B<may not be equal to> $octets.  Though they both contain the same data,
-the utf8 flag for $string is on unless $octets entirely consists of
-ASCII data (or EBCDIC on EBCDIC machines).  See L</"The UTF-8 flag">
+the UTF8 flag for $string is on unless $octets entirely consists of
+ASCII data (or EBCDIC on EBCDIC machines).  See L</"The UTF8 flag">
 below.
 
 If the $string is C<undef> then C<undef> is returned.
@@ -458,11 +458,11 @@ B<CAVEAT>: The following operations look the same but are not quite so;
   $data = decode("iso-8859-1", $data);  #2
 
 Both #1 and #2 make $data consist of a completely valid UTF-8 string
-but only #2 turns utf8 flag on.  #1 is equivalent to
+but only #2 turns UTF8 flag on.  #1 is equivalent to
 
   $data = encode("utf8", decode("iso-8859-1", $data));
 
-See L</"The UTF-8 flag"> below.
+See L</"The UTF8 flag"> below.
 
 =item $octets = encode_utf8($string);
 
@@ -684,13 +684,13 @@ arguments are taken as aliases for I<$object>.
 
 See L<Encode::Encoding> for more details.
 
-=head1 The UTF-8 flag
+=head1 The UTF8 flag
 
-Before the introduction of utf8 support in perl, The C<eq> operator
+Before the introduction of Unicode support in perl, The C<eq> operator
 just compared the strings represented by two scalars. Beginning with
-perl 5.8, C<eq> compares two strings with simultaneous consideration
-of I<the utf8 flag>. To explain why we made it so, I will quote page
-402 of C<Programming Perl, 3rd ed.>
+perl 5.8, C<eq> compares two strings with simultaneous consideration of
+I<the UTF8 flag>. To explain why we made it so, I will quote page 402 of
+C<Programming Perl, 3rd ed.>
 
 =over 2
 
@@ -719,27 +719,27 @@ byte-oriented Perl and a character-oriented Perl.
 Back when C<Programming Perl, 3rd ed.> was written, not even Perl 5.6.0
 was born and many features documented in the book remained
 unimplemented for a long time.  Perl 5.8 corrected this and the introduction
-of the UTF-8 flag is one of them.  You can think of this perl notion as of a
-byte-oriented mode (utf8 flag off) and a character-oriented mode (utf8
+of the UTF8 flag is one of them.  You can think of this perl notion as of a
+byte-oriented mode (UTF8 flag off) and a character-oriented mode (UTF8
 flag on).
 
-Here is how Encode takes care of the utf8 flag.
+Here is how Encode takes care of the UTF8 flag.
 
 =over 2
 
 =item *
 
-When you encode, the resulting utf8 flag is always off.
+When you encode, the resulting UTF8 flag is always off.
 
 =item *
 
-When you decode, the resulting utf8 flag is on unless you can
+When you decode, the resulting UTF8 flag is on unless you can
 unambiguously represent data.  Here is the definition of
 dis-ambiguity.
 
 After C<$utf8 = decode('foo', $octet);>,
 
-  When $octet is...   The utf8 flag in $utf8 is
+  When $octet is...   The UTF8 flag in $utf8 is
   ---------------------------------------------
   In ASCII only (or EBCDIC only)            OFF
   In ISO-8859-1                              ON
@@ -750,7 +750,7 @@ As you see, there is one exception, In ASCII.  That way you can assume
 Goal #1.  And with Encode Goal #2 is assumed but you still have to be
 careful in such cases mentioned in B<CAVEAT> paragraphs.
 
-This utf8 flag is not visible in perl scripts, exactly for the same
+This UTF8 flag is not visible in perl scripts, exactly for the same
 reason you cannot (or you I<don't have to>) see if a scalar contains a
 string, integer, or floating point number.   But you can still peek
 and poke these if you will.  See the section below.
@@ -766,7 +766,7 @@ implementation.  As such, they are efficient but may change.
 
 =item is_utf8(STRING [, CHECK])
 
-[INTERNAL] Tests whether the UTF-8 flag is turned on in the STRING.
+[INTERNAL] Tests whether the UTF8 flag is turned on in the STRING.
 If CHECK is true, also checks the data in STRING for being well-formed
 UTF-8.  Returns true if successful, false otherwise.
 
@@ -774,22 +774,22 @@ As of perl 5.8.1, L<utf8> also has utf8::is_utf8().
 
 =item _utf8_on(STRING)
 
-[INTERNAL] Turns on the UTF-8 flag in STRING.  The data in STRING is
+[INTERNAL] Turns on the UTF8 flag in STRING.  The data in STRING is
 B<not> checked for being well-formed UTF-8.  Do not use unless you
 B<know> that the STRING is well-formed UTF-8.  Returns the previous
-state of the UTF-8 flag (so please don't treat the return value as
+state of the UTF8 flag (so please don't treat the return value as
 indicating success or failure), or C<undef> if STRING is not a string.
 
 =item _utf8_off(STRING)
 
-[INTERNAL] Turns off the UTF-8 flag in STRING.  Do not use frivolously.
-Returns the previous state of the UTF-8 flag (so please don't treat the
+[INTERNAL] Turns off the UTF8 flag in STRING.  Do not use frivolously.
+Returns the previous state of the UTF8 flag (so please don't treat the
 return value as indicating success or failure), or C<undef> if STRING is
 not a string.
 
 =back
 
-=head1 UTF-8 vs. utf8
+=head1 UTF-8 vs. utf8 vs. UTF8
 
   ....We now view strings not as sequences of bytes, but as sequences
   of numbers in the range 0 .. 2**32-1 (or in the case of 64-bit
@@ -836,6 +836,8 @@ goes "liberal"
   find_encoding("utf_8")->name  # ditto. "_" are treated as "-"
   find_encoding("UTF8")->name  # is 'utf8'.
 
+The UTF8 flag is internally called UTF8, without a hyphen. It indicates
+whether a string is internally encoded as utf8, also without a hypen.
 
 =head1 SEE ALSO
 
diff --git a/ext/Encode/encoding.pm b/ext/Encode/encoding.pm
index eb84e481f1..1f418e3a22 100644
--- a/ext/Encode/encoding.pm
+++ b/ext/Encode/encoding.pm
@@ -307,6 +307,14 @@ Will print C<2>, because C<$string> is upgraded as UTF-8.  Without
 C<use encoding 'utf8';>, it will print C<4> instead, since C<$string>
 is three octets when interpreted as Latin-1.
 
+=head2 Side effects
+
+If the C<encoding> pragma is in scope then the lengths returned are
+calculated from the length of C<$/> in Unicode characters, which is not
+always the same as the length of C<$/> in the native encoding.
+
+This pragma affects utf8::upgrade, but not utf8::downgrade.
+
 =head1 FEATURES THAT REQUIRE 5.8.1
 
 Some of the features offered by this pragma requires perl 5.8.1.  Most
author	Juerd Waalboer <#####@juerd.nl>	2007-03-04 17:00:19 +0100
committer	H.Merijn Brand <h.m.brand@xs4all.nl>	2007-03-07 13:23:23 +0000
commit	2575c402a8f9be55f848bdfb219afbf912c50ac1 (patch)
tree	c21a19c42deaa2dba098c38d74338a7c01328c28 /ext/Encode
parent	2a6a970fa1b36c99c83fd3fdd48253c1b567db9b (diff)
download	perl-2575c402a8f9be55f848bdfb219afbf912c50ac1.tar.gz