1 files changed, 65 insertions, 54 deletions
diff --git a/pod/perlfunc.pod b/pod/perlfunc.pod
index dc23b210fa..c15185e680 100644
--- a/pod/perlfunc.pod
+++ b/pod/perlfunc.pod
@@ -2054,7 +2054,7 @@ addresses returned by the corresponding system library call.  In the
 Internet domain, each address is four bytes long and you can unpack it
 by saying something like:
 
-    ($a,$b,$c,$d) = unpack('C4',$addr[0]);
+    ($a,$b,$c,$d) = unpack('W4',$addr[0]);
 
 The Socket library makes this slightly easier:
 
@@ -3296,7 +3296,8 @@ Takes a LIST of values and converts it into a string using the rules
 given by the TEMPLATE.  The resulting string is the concatenation of
 the converted values.  Typically, each converted value looks
 like its machine-level representation.  For example, on 32-bit machines
-a converted integer may be represented by a sequence of 4 bytes.
+an integer may be represented by a sequence of 4 bytes which will be 
+converted to a sequence of 4 characters.
 
 The TEMPLATE is a sequence of characters that give the order and type
 of values, as follows:
@@ -3311,7 +3312,9 @@ of values, as follows:
     H	A hex string (high nybble first).
 
     c	A signed char (8-bit) value.
-    C	An unsigned char value.  Only does bytes.  See U for Unicode.
+    C	An unsigned C char (octet) even under Unicode. Should normally not
+        be used. See U and W instead.
+    W   An unsigned char value (can be greater than 255).
 
     s	A signed short (16-bit) value.
     S	An unsigned short value.
@@ -3414,71 +3417,72 @@ byte (so the packed result will be one longer than the byte C<length>
 of the item).
 
 The repeat count for C<u> is interpreted as the maximal number of bytes
-to encode per line of output, with 0 and 1 replaced by 45.
+to encode per line of output, with 0, 1 and 2 replaced by 45. The repeat 
+count should not be more than 65.
 
 =item *
 
 The C<a>, C<A>, and C<Z> types gobble just one value, but pack it as a
 string of length count, padding with nulls or spaces as necessary.  When
 unpacking, C<A> strips trailing spaces and nulls, C<Z> strips everything
-after the first null, and C<a> returns data verbatim.  When packing,
-C<a>, and C<Z> are equivalent.
+after the first null, and C<a> returns data verbatim.
 
 If the value-to-pack is too long, it is truncated.  If too long and an
 explicit count is provided, C<Z> packs only C<$count-1> bytes, followed
-by a null byte.  Thus C<Z> always packs a trailing null byte under
-all circumstances.
+by a null byte.  Thus C<Z> always packs a trailing null (except when the
+count is 0).
 
 =item *
 
 Likewise, the C<b> and C<B> fields pack a string that many bits long.
-Each byte of the input field of pack() generates 1 bit of the result.
+Each character of the input field of pack() generates 1 bit of the result.
 Each result bit is based on the least-significant bit of the corresponding
-input byte, i.e., on C<ord($byte)%2>.  In particular, bytes C<"0"> and
-C<"1"> generate bits 0 and 1, as do bytes C<"\0"> and C<"\1">.
+input character, i.e., on C<ord($char)%2>.  In particular, characters C<"0">
+and C<"1"> generate bits 0 and 1, as do characters C<"\0"> and C<"\1">.
 
 Starting from the beginning of the input string of pack(), each 8-tuple
-of bytes is converted to 1 byte of output.  With format C<b>
-the first byte of the 8-tuple determines the least-significant bit of a
-byte, and with format C<B> it determines the most-significant bit of
-a byte.
+of characters is converted to 1 character of output.  With format C<b>
+the first character of the 8-tuple determines the least-significant bit of a
+character, and with format C<B> it determines the most-significant bit of
+a character.
 
 If the length of the input string is not exactly divisible by 8, the
-remainder is packed as if the input string were padded by null bytes
+remainder is packed as if the input string were padded by null characters
 at the end.  Similarly, during unpack()ing the "extra" bits are ignored.
 
-If the input string of pack() is longer than needed, extra bytes are ignored.
-A C<*> for the repeat count of pack() means to use all the bytes of
-the input field.  On unpack()ing the bits are converted to a string
-of C<"0">s and C<"1">s.
+If the input string of pack() is longer than needed, extra characters are 
+ignored. A C<*> for the repeat count of pack() means to use all the 
+characters of the input field.  On unpack()ing the bits are converted to a 
+string of C<"0">s and C<"1">s.
 
 =item *
 
 The C<h> and C<H> fields pack a string that many nybbles (4-bit groups,
 representable as hexadecimal digits, 0-9a-f) long.
 
-Each byte of the input field of pack() generates 4 bits of the result.
-For non-alphabetical bytes the result is based on the 4 least-significant
-bits of the input byte, i.e., on C<ord($byte)%16>.  In particular,
-bytes C<"0"> and C<"1"> generate nybbles 0 and 1, as do bytes
-C<"\0"> and C<"\1">.  For bytes C<"a".."f"> and C<"A".."F"> the result
+Each character of the input field of pack() generates 4 bits of the result.
+For non-alphabetical characters the result is based on the 4 least-significant
+bits of the input character, i.e., on C<ord($char)%16>.  In particular,
+characters C<"0"> and C<"1"> generate nybbles 0 and 1, as do bytes
+C<"\0"> and C<"\1">.  For characters C<"a".."f"> and C<"A".."F"> the result
 is compatible with the usual hexadecimal digits, so that C<"a"> and
-C<"A"> both generate the nybble C<0xa==10>.  The result for bytes
+C<"A"> both generate the nybble C<0xa==10>.  The result for characters
 C<"g".."z"> and C<"G".."Z"> is not well-defined.
 
 Starting from the beginning of the input string of pack(), each pair
-of bytes is converted to 1 byte of output.  With format C<h> the
-first byte of the pair determines the least-significant nybble of the
-output byte, and with format C<H> it determines the most-significant
+of characters is converted to 1 character of output.  With format C<h> the
+first character of the pair determines the least-significant nybble of the
+output character, and with format C<H> it determines the most-significant
 nybble.
 
 If the length of the input string is not even, it behaves as if padded
-by a null byte at the end.  Similarly, during unpack()ing the "extra"
+by a null character at the end.  Similarly, during unpack()ing the "extra"
 nybbles are ignored.
 
-If the input string of pack() is longer than needed, extra bytes are ignored.
-A C<*> for the repeat count of pack() means to use all the bytes of
-the input field.  On unpack()ing the bits are converted to a string
+If the input string of pack() is longer than needed, extra characters are
+ignored.
+A C<*> for the repeat count of pack() means to use all the characters of
+the input field.  On unpack()ing the nybbles are converted to a string
 of hexadecimal digits.
 
 =item *
@@ -3512,7 +3516,7 @@ I<length-item>, but if you put in the '*' it will be ignored. For all other
 codes, C<unpack> applies the length value to the next item, which must not
 have a repeat count.
 
-    unpack 'C/a', "\04Gurusamy";        gives 'Guru'
+    unpack 'W/a', "\04Gurusamy";        gives 'Guru'
     unpack 'a3/A* A*', '007 Bond  J ';  gives (' Bond','J')
     pack 'n/a* w/a*','hello,','world';  gives "\000\006hello,\005world"
 
@@ -3581,7 +3585,7 @@ Some systems may have even weirder byte orders such as
 You can see your system's preference with
 
  	print join(" ", map { sprintf "%#02x", $_ }
-                            unpack("C*",pack("L",0x12345678))), "\n";
+                            unpack("W*",pack("L",0x12345678))), "\n";
 
 The byteorder on the platform where Perl was built is also available
 via L<Config>:
@@ -3649,21 +3653,21 @@ will not in general equal $foo).
 
 =item *
 
-If the pattern begins with a C<U>, the resulting string will be
-treated as UTF-8-encoded Unicode. You can force UTF-8 encoding on in a
-string with an initial C<U0>, and the bytes that follow will be
-interpreted as Unicode characters. If you don't want this to happen,
-you can begin your pattern with C<C0> (or anything else) to force Perl
-not to UTF-8 encode your string, and then follow this with a C<U*>
-somewhere in your pattern.
+Pack and unpack can operate in two modes, character mode (C<C0> mode) where
+the packed string is processed per character and UTF-8 mode (C<U0> mode)
+where the packed string is processed in its UTF-8-encoded Unicode form on
+a byte by byte basis. Character mode is the default unless the format string 
+starts with an C<U>. You can switch mode at any moment with an explicit 
+C<C0> or C<U0> in the format. A mode is in effect until the next mode switch 
+or until the end of the ()-group in which it was entered.
 
 =item *
 
 You must yourself do any alignment or padding by inserting for example
 enough C<'x'>es while packing.  There is no way to pack() and unpack()
-could know where the bytes are going to or coming from.  Therefore
+could know where the characters are going to or coming from.  Therefore
 C<pack> (and C<unpack>) handle their output and input as flat
-sequences of bytes.
+sequences of characters.
 
 =item *
 
@@ -3681,9 +3685,9 @@ is the string "\0a\0\0bc".
 
 C<x> and C<X> accept C<!> modifier.  In this case they act as
 alignment commands: they jump forward/back to the closest position
-aligned at a multiple of C<count> bytes.  For example, to pack() or
+aligned at a multiple of C<count> characters. For example, to pack() or
 unpack() C's C<struct {char c; double d; char cc[2]}> one may need to
-use the template C<C x![d] d C[2]>; this assumes that doubles must be
+use the template C<W x![d] d W[2]>; this assumes that doubles must be
 aligned on the double's size.
 
 For alignment commands C<count> of 0 is equivalent to C<count> of 1;
@@ -3713,20 +3717,27 @@ to pack() than actually given, extra arguments are ignored.
 
 Examples:
 
-    $foo = pack("CCCC",65,66,67,68);
+    $foo = pack("WWWW",65,66,67,68);
     # foo eq "ABCD"
-    $foo = pack("C4",65,66,67,68);
+    $foo = pack("W4",65,66,67,68);
     # same thing
+    $foo = pack("W4",0x24b6,0x24b7,0x24b8,0x24b9);
+    # same thing with Unicode circled letters.
     $foo = pack("U4",0x24b6,0x24b7,0x24b8,0x24b9);
-    # same thing with Unicode circled letters
+    # same thing with Unicode circled letters. You don't get the UTF-8
+    # bytes because the U at the start of the format caused a switch to
+    # U0-mode, so the UTF-8 bytes get joined into characters
+    $foo = pack("C0U4",0x24b6,0x24b7,0x24b8,0x24b9);
+    # foo eq "\xe2\x92\xb6\xe2\x92\xb7\xe2\x92\xb8\xe2\x92\xb9"
+    # This is the UTF-8 encoding of the string in the previous example
 
     $foo = pack("ccxxcc",65,66,67,68);
     # foo eq "AB\0\0CD"
 
-    # note: the above examples featuring "C" and "c" are true
+    # note: the above examples featuring "W" and "c" are true
     # only on ASCII and ASCII-derived systems such as ISO Latin 1
     # and UTF-8.  In EBCDIC the first example would be
-    # $foo = pack("CCCC",193,194,195,196);
+    # $foo = pack("WWWW",193,194,195,196);
 
     $foo = pack("s2",1,2);
     # "\1\0\2\0" on little-endian
@@ -6242,7 +6253,7 @@ If EXPR is omitted, unpacks the C<$_> string.
 
 The string is broken into chunks described by the TEMPLATE.  Each chunk
 is converted separately to a value.  Typically, either the string is a result
-of C<pack>, or the bytes of the string represent a C structure of some
+of C<pack>, or the characters of the string represent a C structure of some
 kind.
 
 The TEMPLATE has the same format as in the C<pack> function.
@@ -6255,7 +6266,7 @@ Here's a subroutine that does substring:
 
 and then there's
 
-    sub ordinal { unpack("c",$_[0]); } # same as ord()
+    sub ordinal { unpack("W",$_[0]); } # same as ord()
 
 In addition to fields allowed in pack(), you may prefix a field with
 a %<number> to indicate that
@@ -6269,7 +6280,7 @@ computes the same number as the System V sum program:
 
     $checksum = do {
 	local $/;  # slurp!
-	unpack("%32C*",<>) % 65535;
+	unpack("%32W*",<>) % 65535;
     };
 
 The following efficiently counts the number of set bits in a bit vector: