Encoding neutral unpack

Message-Id: <d0fi6i$k06$1@post.home.lunix> p4raw-id: //depot/perl@24010
author: Ton Hospel <perl5-porters@ton.iguana.be> 2005-03-06 18:29:38 +0000
committer: Rafael Garcia-Suarez <rgarciasuarez@gmail.com> 2005-03-08 17:53:50 +0000
commit: f337b084e4f053c4222a0b9a773a9e12c0232e6d (patch)
tree: 1292203ca74046d2df21ce05bb8f8289ea14bc8d /pod/perluniintro.pod
parent: c478aefb95db58c5f937ab7c70bba552d23df9b2 (diff)
download: perl-f337b084e4f053c4222a0b9a773a9e12c0232e6d.tar.gz
1 files changed, 22 insertions, 15 deletions
diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod
index 81efd6bba4..b0d5859065 100644
--- a/pod/perluniintro.pod
+++ b/pod/perluniintro.pod
@@ -247,12 +247,12 @@ constants: you cannot use variables in them.  if you want similar
 run-time functionality, use C<chr()> and C<charnames::vianame()>.
 
 If you want to force the result to Unicode characters, use the special
-C<"U0"> prefix.  It consumes no arguments but forces the result to be
-in Unicode characters, instead of bytes.
+C<"U0"> prefix.  It consumes no arguments but causes the following bytes
+to be interpreted as the UTF-8 encoding of Unicode characters:
 
-   my $chars = pack("U0C*", 0x80, 0x42);
+   my $chars = pack("U0W*", 0x80, 0x42);
 
-Likewise, you can force the result to be bytes by using the special
+Likewise, you can stop such UTF-8 interpretation by using the special 
 C<"C0"> prefix.
 
 =head2 Handling Unicode
@@ -452,7 +452,7 @@ displayed as C<\x..>, and the rest of the characters as themselves:
                chr($_) =~ /[[:cntrl:]]/ ?  # else if control character ...
                sprintf("\\x%02X", $_) :    # \x..
                quotemeta(chr($_))          # else quoted or as themselves
-         } unpack("U*", $_[0]));           # unpack Unicode characters
+         } unpack("W*", $_[0]));           # unpack Unicode characters
    }
 
 For example,
@@ -492,11 +492,12 @@ explicitly-defined I/O layers). But if you must, there are two
 ways of looking behind the scenes.
 
 One way of peeking inside the internal encoding of Unicode characters
-is to use C<unpack("C*", ...> to get the bytes or C<unpack("H*", ...)>
-to display the bytes:
+is to use C<unpack("C*", ...> to get the bytes of whatever the string
+encoding happens to be, or C<unpack("U0..", ...)> to get the bytes of the
+UTF-8 encoding:
 
     # this prints  c4 80  for the UTF-8 bytes 0xc4 0x80
-    print join(" ", unpack("H*", pack("U", 0x100))), "\n";
+    print join(" ", unpack("U0(H2)*", pack("U", 0x100))), "\n";
 
 Yet another way would be to use the Devel::Peek module:
 
@@ -675,15 +676,17 @@ For example,
         # invalid
     }
 
-For UTF-8 only, you can use:
+Or use C<unpack> to try decoding it:
 
     use warnings;
-    @chars = unpack("U0U*", $string_of_bytes_that_I_think_is_utf8);
+    @chars = unpack("C0U*", $string_of_bytes_that_I_think_is_utf8);
 
 If invalid, a C<Malformed UTF-8 character (byte 0x##) in unpack>
-warning is produced. The "U0" means "expect strictly UTF-8 encoded
-Unicode".  Without that the C<unpack("U*", ...)> would accept also
-data like C<chr(0xFF>), similarly to the C<pack> as we saw earlier.
+warning is produced. The "C0" means 
+"process the string character per character".  Without that the 
+C<unpack("U*", ...)> would work in C<U0> mode (the default if the format 
+string starts with C<U>) and it would return the bytes making up the UTF-8 
+encoding of the target string, something that will always work.
 
 =item *
 
@@ -725,8 +728,8 @@ Back to converting data.  If you have (or want) data in your system's
 native 8-bit encoding (e.g. Latin-1, EBCDIC, etc.), you can use
 pack/unpack to convert to/from Unicode.
 
-    $native_string  = pack("C*", unpack("U*", $Unicode_string));
-    $Unicode_string = pack("U*", unpack("C*", $native_string));
+    $native_string  = pack("W*", unpack("U*", $Unicode_string));
+    $Unicode_string = pack("U*", unpack("W*", $native_string));
 
 If you have a sequence of bytes you B<know> is valid UTF-8,
 but Perl doesn't know it yet, you can make Perl a believer, too:
@@ -734,6 +737,10 @@ but Perl doesn't know it yet, you can make Perl a believer, too:
     use Encode 'decode_utf8';
     $Unicode = decode_utf8($bytes);
 
+or:
+
+    $Unicode = pack("U0a*", $bytes);
+   
 You can convert well-formed UTF-8 to a sequence of bytes, but if
 you just want to convert random binary data into UTF-8, you can't.
 B<Any random collection of bytes isn't well-formed UTF-8>.  You can
author	Ton Hospel <perl5-porters@ton.iguana.be>	2005-03-06 18:29:38 +0000
committer	Rafael Garcia-Suarez <rgarciasuarez@gmail.com>	2005-03-08 17:53:50 +0000
commit	f337b084e4f053c4222a0b9a773a9e12c0232e6d (patch)
tree	1292203ca74046d2df21ce05bb8f8289ea14bc8d /pod/perluniintro.pod
parent	c478aefb95db58c5f937ab7c70bba552d23df9b2 (diff)
download	perl-f337b084e4f053c4222a0b9a773a9e12c0232e6d.tar.gz