diff options
author | Juerd Waalboer <#####@juerd.nl> | 2007-03-04 17:00:19 +0100 |
---|---|---|
committer | H.Merijn Brand <h.m.brand@xs4all.nl> | 2007-03-07 13:23:23 +0000 |
commit | 2575c402a8f9be55f848bdfb219afbf912c50ac1 (patch) | |
tree | c21a19c42deaa2dba098c38d74338a7c01328c28 /pod/perlpacktut.pod | |
parent | 2a6a970fa1b36c99c83fd3fdd48253c1b567db9b (diff) | |
download | perl-2575c402a8f9be55f848bdfb219afbf912c50ac1.tar.gz |
Re: [PATCH] (Re: [PATCH] unicode/utf8 pod)
Message-ID: <20070304150019.GN4723@c4.convolution.nl>
p4raw-id: //depot/perl@30493
Diffstat (limited to 'pod/perlpacktut.pod')
-rw-r--r-- | pod/perlpacktut.pod | 24 |
1 files changed, 18 insertions, 6 deletions
diff --git a/pod/perlpacktut.pod b/pod/perlpacktut.pod index 1cb127e0b9..d907b1805c 100644 --- a/pod/perlpacktut.pod +++ b/pod/perlpacktut.pod @@ -633,24 +633,36 @@ The UTF-8 encoding avoids this by storing the most common (from a western point of view) characters in a single byte while encoding the rarer ones in three or more bytes. -So what has this got to do with C<pack>? Well, if you want to convert -between a Unicode number and its UTF-8 representation you can do so by -using template code C<U>. As an example, let's produce the UTF-8 -representation of the Euro currency symbol (code number 0x20AC): +Perl uses UTF-8, internally, for most Unicode strings. + +So what has this got to do with C<pack>? Well, if you want to compose a +Unicode string (that is internally encoded as UTF-8), you can do so by +using template code C<U>. As an example, let's produce the Euro currency +symbol (code number 0x20AC): $UTF8{Euro} = pack( 'U', 0x20AC ); + # Equivalent to: $UTF8{Euro} = "\x{20ac}"; -Inspecting C<$UTF8{Euro}> shows that it contains 3 bytes: "\xe2\x82\xac". The -round trip can be completed with C<unpack>: +Inspecting C<$UTF8{Euro}> shows that it contains 3 bytes: +"\xe2\x82\xac". However, it contains only 1 character, number 0x20AC. +The round trip can be completed with C<unpack>: $Unicode{Euro} = unpack( 'U', $UTF8{Euro} ); +Unpacking using the C<U> template code also works on UTF-8 encoded byte +strings. + Usually you'll want to pack or unpack UTF-8 strings: # pack and unpack the Hebrew alphabet my $alefbet = pack( 'U*', 0x05d0..0x05ea ); my @hebrew = unpack( 'U*', $utf ); +Please note: in the general case, you're better off using +Encode::decode_utf8 to decode a UTF-8 encoded byte string to a Perl +unicode string, and Encode::encode_utf8 to encode a Perl unicode string +to UTF-8 bytes. These functions provide means of handling invalid byte +sequences and generally have a friendlier interface. =head2 Another Portable Binary Encoding |