From f1aa04aa98e25c473fc39871251cb722556cefa9 Mon Sep 17 00:00:00 2001 From: Rafael Garcia-Suarez Date: Mon, 21 Mar 2005 10:12:01 +0000 Subject: perldelta suggestions on (un)?pack by Ton Hospel p4raw-id: //depot/perl@24051 --- pod/perl592delta.pod | 23 ++++++++++++++--------- 1 file changed, 14 insertions(+), 9 deletions(-) (limited to 'pod/perl592delta.pod') diff --git a/pod/perl592delta.pod b/pod/perl592delta.pod index fc251a872a..ea113798ac 100644 --- a/pod/perl592delta.pod +++ b/pod/perl592delta.pod @@ -13,9 +13,12 @@ differences between 5.8.0 and 5.9.1. =head2 Packing and UTF-8 strings The semantics of pack() and unpack() regarding UTF-8-encoded data has been -clarified. B Notably, code that -uses C to see through the encoding of string will now -simply return $string. +changed. Processing is now by default character per character instead of +byte per byte on the underlying encoding. Notably, code that used things +like C to see through the encoding of string will now +simply get back the original $string. Packed strings can also get upgraded +during processing when you store upgraded characters. You can get the old +behaviour by using C. To be consistent with pack(), the C in unpack() templates indicates that the data is to be processed in character mode, i.e. character by @@ -26,14 +29,16 @@ by byte basis. This is reversed with regard to perl 5.8.X. Moreover, C and C can also be used in pack() templates to specify respectively character and byte modes. -C and C in the middle of a pack format now switch to the specified -encoding mode, honoring parens grouping. Previously, parens were ignored. +C and C in the middle of a pack or unpack format now switch to the +specified encoding mode, honoring parens grouping. Previously, parens were +ignored. Also, there is a new pack() character format, C, which is intended to -replace the old C. C is kept for unsigned chars coded on eight bits. -C represents unsigned character values, which can be greater than 255. -It is therefore more robust when dealing with potentially UTF-8-encoded -data (as C will wrap values outside the range 0..255). +replace the old C. C is kept for unsigned chars coded as bytes in +the strings internal representation. C represents unsigned (logical) +character values, which can be greater than 255. It is therefore more +robust when dealing with potentially UTF-8-encoded data (as C will wrap +values outside the range 0..255, and not respect the string encoding). In practice, that means that pack formats are now encoding-neutral, except C. -- cgit v1.2.1