summaryrefslogtreecommitdiff
path: root/pod/perlpacktut.pod
diff options
context:
space:
mode:
authorJuerd Waalboer <#####@juerd.nl>2007-03-04 17:00:19 +0100
committerH.Merijn Brand <h.m.brand@xs4all.nl>2007-03-07 13:23:23 +0000
commit2575c402a8f9be55f848bdfb219afbf912c50ac1 (patch)
treec21a19c42deaa2dba098c38d74338a7c01328c28 /pod/perlpacktut.pod
parent2a6a970fa1b36c99c83fd3fdd48253c1b567db9b (diff)
downloadperl-2575c402a8f9be55f848bdfb219afbf912c50ac1.tar.gz
Re: [PATCH] (Re: [PATCH] unicode/utf8 pod)
Message-ID: <20070304150019.GN4723@c4.convolution.nl> p4raw-id: //depot/perl@30493
Diffstat (limited to 'pod/perlpacktut.pod')
-rw-r--r--pod/perlpacktut.pod24
1 files changed, 18 insertions, 6 deletions
diff --git a/pod/perlpacktut.pod b/pod/perlpacktut.pod
index 1cb127e0b9..d907b1805c 100644
--- a/pod/perlpacktut.pod
+++ b/pod/perlpacktut.pod
@@ -633,24 +633,36 @@ The UTF-8 encoding avoids this by storing the most common (from a western
point of view) characters in a single byte while encoding the rarer
ones in three or more bytes.
-So what has this got to do with C<pack>? Well, if you want to convert
-between a Unicode number and its UTF-8 representation you can do so by
-using template code C<U>. As an example, let's produce the UTF-8
-representation of the Euro currency symbol (code number 0x20AC):
+Perl uses UTF-8, internally, for most Unicode strings.
+
+So what has this got to do with C<pack>? Well, if you want to compose a
+Unicode string (that is internally encoded as UTF-8), you can do so by
+using template code C<U>. As an example, let's produce the Euro currency
+symbol (code number 0x20AC):
$UTF8{Euro} = pack( 'U', 0x20AC );
+ # Equivalent to: $UTF8{Euro} = "\x{20ac}";
-Inspecting C<$UTF8{Euro}> shows that it contains 3 bytes: "\xe2\x82\xac". The
-round trip can be completed with C<unpack>:
+Inspecting C<$UTF8{Euro}> shows that it contains 3 bytes:
+"\xe2\x82\xac". However, it contains only 1 character, number 0x20AC.
+The round trip can be completed with C<unpack>:
$Unicode{Euro} = unpack( 'U', $UTF8{Euro} );
+Unpacking using the C<U> template code also works on UTF-8 encoded byte
+strings.
+
Usually you'll want to pack or unpack UTF-8 strings:
# pack and unpack the Hebrew alphabet
my $alefbet = pack( 'U*', 0x05d0..0x05ea );
my @hebrew = unpack( 'U*', $utf );
+Please note: in the general case, you're better off using
+Encode::decode_utf8 to decode a UTF-8 encoded byte string to a Perl
+unicode string, and Encode::encode_utf8 to encode a Perl unicode string
+to UTF-8 bytes. These functions provide means of handling invalid byte
+sequences and generally have a friendlier interface.
=head2 Another Portable Binary Encoding