Add perlpacktut from Wolfgang Laun; regen toc.

p4raw-id: //depot/perl@13341
author: Jarkko Hietaniemi <jhi@iki.fi> 2001-11-28 14:00:12 +0000
committer: Jarkko Hietaniemi <jhi@iki.fi> 2001-11-28 14:00:12 +0000
commit: 34babc168865edbd6963528152c3c3b68680b55f (patch)
tree: 7c115a5843147fddd36cf90e7d95fdaf58f00f70 /pod/perlpacktut.pod
parent: d9ae6319b04989bd3d538f5691a909e211d38fbb (diff)
download: perl-34babc168865edbd6963528152c3c3b68680b55f.tar.gz
1 files changed, 1008 insertions, 0 deletions
diff --git a/pod/perlpacktut.pod b/pod/perlpacktut.pod
new file mode 100644
index 0000000000..921fdc5a07
--- /dev/null
+++ b/pod/perlpacktut.pod
@@ -0,0 +1,1008 @@
+=head1 NAME
+
+perlpacktut - tutorial on C<pack> and C<unpack>
+
+=head1 DESCRIPTION
+
+C<pack> and C<unpack> are two functions for transforming data according
+to a user-defined template, between the guarded way Perl stores values
+and some well-defined  representation as might be required in the 
+environment of a Perl program. Unfortunately, they're also two of 
+the most misunderstood and most often overlooked functions that Perl
+provides. This tutorial will demystify them for you.
+
+
+=head1 The Basic Principle
+
+Most programming languages don't shelter the memory where variables are
+stored. In C, for instance, you can take the address of some variable,
+and the C<sizeof> operator tells you how many bytes are allocated to
+the variable. Using the address and the size, you may access the storage
+to your heart's content.
+
+In Perl, you just can't access memory at random, but the structural and
+representational conversion provided by C<pack> and C<unpack> is an
+excellent alternative. The C<pack> function converts values to a byte
+sequence containing representations according to a given specification,
+the so-called "template" argument. C<unpack> is the reverse process,
+deriving some values from the contents of a string of bytes. (Be cautioned,
+however, that not all that has been packed together can be neatly unpacked - 
+a very common experience as seasoned travellers are likely to confirm.)
+
+Why, you may ask, would you need a chunk of memory containing some values
+in binary representation? One good reason is input and output accessing
+some file, a device, or a network connection, whereby this binary
+representation is either forced on you or will give you some benefit
+in processing. Another cause is passing data to some system call that
+is not available as a Perl function: C<syscall> requires you to provide
+parameters stored in the way it happens in a C program. Even text processing 
+(as shown in the next section) may be simplified with judicious usage 
+of these two functions.
+
+To see how (un)packing works, we'll start with a simple template
+code where the conversion is in low gear: between the contents of a byte
+sequence and a string of hexadecimal digits. Let's use C<unpack>, since
+this is likely to remind you of a dump program, or some desparate last
+message unfortunate programs are wont to throw at you before they expire
+into the wild blue yonder. Assuming that the variable C<$mem> holds a 
+sequence of bytes that we'd like to inspect without assuming anything 
+about its meaning, we can write
+
+   my( $hex ) = unpack( 'H*', $mem );
+   print "$hex\n";
+
+whereupon we might see something like this, with each pair of hex digits
+corresponding to a byte:
+
+   41204d414e204120504c414e20412043414e414c2050414e414d41
+
+What was in this chunk of memory? Numbers, charactes, or a mixture of
+both? Assuming that we're on a computer where ASCII (or some similar)
+encoding is used: hexadecimal values in the range C<0x40> - C<0x5A>
+indicate an uppercase letter, and <0x20> encodes a space. So we might
+assume it is a piece of text, which some are able to read like a tabloid;
+but others will have to get hold of an ASCII table and relive that
+firstgrader feeling. Not caring too much about which way to read this,
+we note that C<unpack> with the template code C<H> converts the contents
+of a sequence of bytes into the customary hexadecimal notation. Since
+"a sequence of" is a pretty vague indication of quantity, C<H> has been
+defined to convert just a single hexadecimal digit unless it is followed
+by a repeat count. An asterisk for the repeat count means to use whatever
+remains.
+
+The inverse operation - packing byte contents from a string of hexadecimal
+digits - is just as easily written. For instance:
+
+   my $s = pack( 'H2' x 10, map { "3$_" } ( 0..9 ) );
+   print "$s\n";
+
+Since we feed a list of ten 2-digit hexadecimal strings to C<pack>, the
+pack template should contain ten pack codes. If this is run on a computer
+with ASCII character coding, it will print C<0123456789>.
+
+
+=head1 Packing Text
+
+Let's suppose you've got to read in a data file like this:
+
+    Date      |Description                | Income|Expenditure
+    01/24/2001 Ahmed's Camel Emporium                  1147.99
+    01/28/2001 Flea spray                                24.99
+    01/29/2001 Camel rides to tourists      235.00
+
+How do we do it? You might think first to use C<split>; however, since
+C<split> collapses blank fields, you'll never know whether a record was
+income or expenditure. Oops. Well, you could always use C<substr>:
+
+    while (<>) { 
+        my $date   = substr($_,  0, 11);
+        my $desc   = substr($_, 12, 27);
+        my $income = substr($_, 40,  7);
+        my $expend = substr($_, 52,  7);
+        ...
+    }
+
+It's not really a barrel of laughs, is it? In fact, it's worse than it
+may seem; the eagle-eyed may notice that the first field should only be
+10 characters wide, and the error has propagated right through the other
+numbers - which we've had to count by hand. So it's error-prone as well
+as horribly unfriendly.
+
+Or maybe we could use regular expressions:
+    
+    while (<>) { 
+        my($date, $desc, $income, $expend) = 
+            m|(\d\d/\d\d/\d{4}) (.{27}) (.{7})(.*)|;
+        ...
+    }
+
+Urgh. Well, it's a bit better, but - well, would you want to maintain
+that?
+
+Hey, isn't Perl supposed to make this sort of thing easy? Well, it does,
+if you use the right tools. C<pack> and C<unpack> are designed to help
+you out when dealing with fixed-width data like the above. Let's have a
+look at a solution with C<unpack>:
+
+    while (<>) { 
+        my($date, $desc, $income, $expend) = unpack("A10xA27xA7A*", $_);
+        ...
+    }
+
+That looks a bit nicer; but we've got to take apart that weird template.
+Where did I pull that out of? 
+
+OK, let's have a look at some of our data again; in fact, we'll include
+the headers, and a handy ruler so we can keep track of where we are.
+
+             1         2         3         4         5        
+    1234567890123456789012345678901234567890123456789012345678
+    Date      |Description                | Income|Expenditure
+    01/28/2001 Flea spray                                24.99
+    01/29/2001 Camel rides to tourists      235.00
+
+>From this, we can see that the date column stretches from column 1 to
+column 10 - ten characters wide. The C<pack>-ese for "character" is
+C<A>, and ten of them are C<A10>. So if we just wanted to extract the
+dates, we could say this:
+
+    my($date) = unpack("A10", $_);
+
+OK, what's next? Between the date and the description is a blank column;
+we want to skip over that. The C<x> template means "skip forward", so we
+want one of those. Next, we have another batch of characters, from 12 to
+38. That's 27 more characters, hence C<A27>. (Don't make the fencepost
+error - there are 27 characters between 12 and 38, not 26. Count 'em!)
+
+Now we skip another character and pick up the next 7 characters:
+
+    my($date,$description,$income) = unpack("A10xA27xA7", $_);
+
+Now comes the clever bit. Lines in our ledger which are just income and
+not expenditure might end at column 46. Hence, we don't want to tell our
+C<unpack> pattern that we B<need> to find another 12 characters; we'll
+just say "if there's anything left, take it". As you might guess from
+regular expressions, that's what the C<*> means: "use everything
+remaining".
+
+=over 3
+
+=item *
+
+Be warned, though, that unlike regular expressions, if the C<unpack>
+template doesn't match the incoming data, Perl will scream and die.
+
+=back
+
+
+Hence, putting it all together:
+
+    my($date,$description,$income,$expend) = unpack("A10xA27xA7A*", $_);
+
+Now, that's our data parsed. I suppose what we might want to do now is
+total up our income and expenditure, and add another line to the end of
+our ledger - in the same format - saying how much we've brought in and
+how much we've spent:
+
+    while (<>) {
+        my($date, $desc, $income, $expend) = unpack("A10xA27xA7xA*", $_);
+        $tot_income += $income;
+        $tot_expend += $expend;
+    }
+
+    $tot_income = sprintf("%.2f", $tot_income); # Get them into 
+    $tot_expend = sprintf("%.2f", $tot_expend); # "financial" format
+
+    $date = POSIX::strftime("%m/%d/%Y", localtime); 
+
+    # OK, let's go:
+
+    print pack("A10xA27xA7xA*", $date, "Totals", $tot_income, $tot_expend);
+
+Oh, hmm. That didn't quite work. Let's see what happened:
+
+    01/24/2001 Ahmed's Camel Emporium                   1147.99
+    01/28/2001 Flea spray                                 24.99
+    01/29/2001 Camel rides to tourists     1235.00
+    03/23/2001Totals                     1235.001172.98
+
+OK, it's a start, but what happened to the spaces? We put C<x>, didn't
+we? Shouldn't it skip forward? Let's look at what L<perlfunc/pack> says:
+
+    x   A null byte.
+
+Urgh. No wonder. There's a big difference between "a null byte",
+character zero, and "a space", character 32. Perl's put something
+between the date and the description - but unfortunately, we can't see
+it! 
+
+What we actually need to do is expand the width of the fields. The C<A>
+format pads any non-existent characters with spaces, so we can use the
+additional spaces to line up our fields, like this:
+
+    print pack("A11 A28 A8 A*", $date, "Totals", $tot_income, $tot_expend);
+
+(Note that you can put spaces in the template to make it more readable,
+but they don't translate to spaces in the output.) Here's what we got
+this time:
+
+    01/24/2001 Ahmed's Camel Emporium                   1147.99
+    01/28/2001 Flea spray                                 24.99
+    01/29/2001 Camel rides to tourists     1235.00
+    03/23/2001 Totals                      1235.00 1172.98
+
+That's a bit better, but we still have that last column which needs to
+be moved further over. There's an easy way to fix this up:
+unfortunately, we can't get C<pack> to right-justify our fields, but we
+can get C<sprintf> to do it:
+
+    $tot_income = sprintf("%.2f", $tot_income); 
+    $tot_expend = sprintf("%12.2f", $tot_expend);
+    $date = POSIX::strftime("%m/%d/%Y", localtime); 
+    print pack("A11 A28 A8 A*", $date, "Totals", $tot_income, $tot_expend);
+
+This time we get the right answer:
+
+    01/28/2001 Flea spray                                 24.99
+    01/29/2001 Camel rides to tourists     1235.00
+    03/23/2001 Totals                      1235.00      1172.98
+
+So that's how we consume and produce fixed-width data. Let's recap what
+we've seen of C<pack> and C<unpack> so far:
+
+=over 3
+
+=item *
+
+Use C<pack> to go from several pieces of data to one fixed-width
+version; use C<unpack> to turn a fixed-width-format string into several
+pieces of data. 
+
+=item *
+
+The pack format C<A> means "any character"; if you're C<pack>ing and
+you've run out of things to pack, C<pack> will fill the rest up with
+spaces.
+
+=item *
+
+C<x> means "skip a byte" when C<unpack>ing; when C<pack>ing, it means
+"introduce a null byte" - that's probably not what you mean if you're
+dealing with plain text.
+
+=item *
+
+You can follow the formats with numbers to say how many characters
+should be affected by that format: C<A12> means "take 12 characters";
+C<x6> means "skip 6 bytes" or "character 0, 6 times".
+
+=item *
+
+Instead of a number, you can use C<*> to mean "consume everything else
+left". 
+
+B<Warning>: when packing multiple pieces of data, C<*> only means
+"consume all of the current piece of data". That's to say
+
+    pack("A*A*", $one, $two)
+
+packs all of C<$one> into the first C<A*> and then all of C<$two> into
+the second. This is a general principle: each format character
+corresponds to one piece of data to be C<pack>ed.
+
+=back
+
+
+
+=head1 Packing Numbers
+
+So much for textual data. Let's get onto the meaty stuff that C<pack>
+and C<unpack> are best at: handling binary formats for numbers. There is,
+of course, not just one binary format  - life would be too simple - but
+Perl will do all the finicky labor for you.
+
+
+=head2 Integers
+
+Packing and unpacking numbers implies conversion to and from some
+I<specific> binary representation. Leaving floating point numbers
+aside for the moment, the salient properties of any such representation
+are:
+
+=over 4
+
+=item *
+
+the number of bytes used for storing the integer,
+
+=item *
+
+whether the contents are interpreted as a signed or unsigned number,
+
+=item *
+
+the byte ordering: whether the first byte is the least or most
+significant byte (or: little-endian or big-endian, respectively).
+
+=back
+
+So, for instance, to pack 20302 to a signed 16 bit integer in your
+computer's representation you write
+
+   my $ps = pack( 's', 20302 );
+
+Again, the result is a string, now containing 2 bytes. If you print 
+this string (which is, generally, not recommended) you might see
+C<ON> or C<NO> (depending on your system's byte ordering) - or something
+entirely different if your computer doesn't use ASCII character encoding.
+Unpacking C<$ps> with the same template returns the original integer value:
+
+   my( $s ) = unpack( 's', $ps );
+
+This is true for all numeric template codes. But don't expect miracles:
+if the packed value exceeds the alotted byte capacity, high order bits
+are silently discarded, and unpack certainly won't be able to pull them
+back out of some magic hat. And, when you pack using a signed template
+code such as C<s>, an excess value may result in the sign bit
+getting set, and unpacking this will smartly return a negative value.
+
+16 bits won't get you too far with integers, but there is C<l> and C<L>
+for signed and unsigned 32-bit integers. And if this is not enough and
+your system supports 64 bit integers you can push the limits much closer
+to infinity with pack codes C<q> and C<Q>. A notable exception is provided
+by pack codes C<i> and C<I> for signed and unsigned integers of the 
+"local custom" variety: Such an integer will take up as many bytes as
+a local C compiler returns for C<sizeof(int)>, but it'll use I<at least>
+32 bits.
+
+Each of the integer pack codes C<sSlLqQ> results in a fixed number of bytes,
+no matter where you execute your program. This may be useful for some 
+applications, but it does not provide for a portable way to pass data 
+structures between Perl and C programs (bound to happen when you call 
+XS extensions or the Perl function C<syscall>), or when you read or
+write binary files. What you'll need in this case are template codes that
+depend on what your local C compiler compiles when you code C<short> or
+C<unsigned long>, for instance. These codes and their corresponding
+byte lengths are shown in the table below.  Since the C standard leaves
+much leeway with respect to the relative sizes of these data types, actual
+values may vary, and that's why the values are given as expressions in
+C and Perl. (If you'd like to use values from C<%Config> in your program
+you have to import it with C<use Config>.)
+
+   signed unsigned  byte length in C   byte length in Perl       
+   C<s!>  C<S!>     sizeof(short)      $Config{shortsize}
+   C<i!>  C<I!>     sizeof(int)        $Config{intsize}
+   C<l!>  C<L!>     sizeof(long)       $Config{longsize}
+   C<q!>  C<Q!>     sizeof(longlong)   $Config{longlongsize}
+
+The C<i!> and C<I!> codes aren't different from C<i> and C<I>; they are
+tolerated for completeness' sake.
+
+
+=head2 Unpacking a Stack Frame
+
+Requesting a particular byte ordering may be necessary when you work with
+binary data coming from some specific architecture while your program could
+run on a totally different system. As an example, assume you have 24 bytes
+containing a stack frame as it happens on an Intel 8086:
+
+      +---------+        +----+----+               +---------+
+ TOS: |   IP    |  TOS+4:| FL | FH | FLAGS  TOS+14:|   SI    |
+      +---------+        +----+----+               +---------+
+      |   CS    |        | AL | AH | AX            |   DI    |
+      +---------+        +----+----+               +---------+
+                         | BL | BH | BX            |   BP    |
+                         +----+----+               +---------+
+                         | CL | CH | CX            |   DS    |
+                         +----+----+               +---------+
+                         | DL | DH | DX            |   ES    |
+                         +----+----+               +---------+
+
+First, we note that this time-honored 16-bit CPU uses little-endian order,
+and that's why the low order byte is stored at the lower address. To
+unpack such a (signed) short we'll have to use code C<v>. A repeat
+count unpacks all 12 shorts:
+
+   my( $ip, $cs, $flags, $ax, $bx, $cd, $dx, $si, $di, $bp, $ds, $es ) =
+     unpack( 'v12', $frame );
+
+Alternatively, we could have used C<C> to unpack the individually
+accessible byte registers FL, FH, AL, AH, etc.:
+
+   my( $fl, $fh, $al, $ah, $bl, $bh, $cl, $ch, $dl, $dh ) =
+     unpack( 'C10', substr( $frame, 4, 10 ) );
+
+It would be nice if we could do this in one fell swoop: unpack a short,
+back up a little, and then unpack 2 bytes. Since Perl I<is> nice, it
+proffers the template code C<X> to back up one byte. Putting this all
+together, we may now write:
+
+   my( $ip, $cs,
+       $flags,$fl,$fh,
+       $ax,$al,$ah, $bx,$bl,$bh, $cx,$cl,$ch, $dx,$dl,$dh, 
+       $si, $di, $bp, $ds, $es ) =
+   unpack( 'v2' . ('vXXCC' x 5) . 'v5', $frame );
+
+We've taken some pains to get construct the template so that it matches
+the contents of our frame buffer. Otherwise we'd either get undefined values,
+or C<unpack> could not unpack all. If C<pack> runs out of items, it will
+supply null strings.
+
+
+=head2 How to Eat an Egg on a Net
+
+The pack code for big-endian (high order byte at the lowest address) is
+C<n> for 16 bit and C<N> for 32 bit integers. You use these codes
+if you know that your data comes from a compliant architecture, but,
+surprisingly enough, you should also use these pack codes if you
+exchange binary data, across the network, with some system that you
+know next to nothing about. The simple reason is that this
+order has been chosen as the I<network order>, and all standard-fearing
+programs ought to follow this convention. (This is, of course, a stern
+backing for one of the Lilliputian parties and may well influence the
+political development there.) So, if the protocol expects you to send
+a message by sending the length first, followed by just so many bytes,
+you could write:
+
+   my $buf = pack( 'N', length( $msg ) ) . $msg;
+
+or even:
+
+   my $buf = pack( 'NA*', length( $msg ), $msg );
+
+and pass C<$buf> to your send routine. Some protocols demand that the
+count should include the length of the count itself: then just add 4
+to the data length. (But make sure to read L<"Lengths and Widths"> before
+you really code this!)
+
+
+
+=head2 Floating point Numbers
+
+For packing floating point numbers you have the choice between the
+pack codes C<f> and C<d> which pack into (or unpack from) single-precision or
+double-precision representation as it is provided by your system. (There
+is no such thing as a network representation for reals, so if you want
+to send your real numbers across computer boundaries, you'd better stick
+to ASCII representation, unless you're absolutely sure what's on the other
+end of the line.)
+
+
+
+=head1 Exotic Templates
+
+
+=head2 Bit Strings
+
+Bits are the atoms in the memory world. Access to individual bits may
+have to be used either as a last resort or because it is the most
+convenient way to handle your data. Bit string (un)packing converts
+between strings containing a series of C<0> and C<1> characters and
+a sequence of bytes each containing a group of 8 bits. This is almost
+as simple as it sounds, except that there are two ways the contents of
+a byte may be written as a bit string. Let's have a look at an annotated
+byte:
+
+     7 6 5 4 3 2 1 0
+   +-----------------+
+   | 1 0 0 0 1 1 0 0 |
+   +-----------------+
+    MSB           LSB
+
+It's egg-eating all over again: Some think that as a bit string this should
+be written "10001100" i.e. beginning with the most significant bit, others
+insist on "00110001". Well, Perl isn't biased, so that's why we have two bit
+string codes:
+
+   $byte = pack( 'B8', '10001100' ); # start with MSB
+   $byte = pack( 'b8', '00110001' ); # start with LSB
+
+It is not possible to pack or unpack bit fields - just integral bytes.
+C<pack> always starts at the next byte boundary and "rounds up" to the
+next multiple of 8 by adding zero bits as required. (If you do want bit
+fields, there is L<perlfunc/vec>. Or you could implement bit field 
+handling at the character string level, using split, substr, and
+concatenation on unpacked bit strings.)
+
+To illustrate unpacking for bit strings, we'll decompose a simple
+status register (a "-" stands for a "reserved" bit):
+
+   +-----------------+-----------------+
+   | S Z - A - P - C | - - - - O D I T |
+   +-----------------+-----------------+
+    MSB           LSB MSB           LSB
+
+Converting these two bytes to a string can be done with the unpack 
+template C<'b16'>. To obtain the individual bit values from the bit
+string we use C<split> with the "empty" separator pattern which splits
+into individual characters. Bit values from the "reserved" positions are
+simply assigned to C<undef>, a convenient notation for "I don't care where
+this goes".
+
+   ($carry, undef, $parity, undef, $auxcarry, undef, $sign,
+    $trace, $interrupt, $direction, $overflow) =
+      split( '', unpack( 'b16', $status ) );
+
+We could have used an unpack template C<'b12'> just as well, since the
+last 4 bits can be ignored anyway. 
+
+
+=head2 Uuencoding
+
+Another odd-man-out in the template alphabet is C<u>, which packs an
+"uuencoded string". ("uu" is short for Unix-to-Unix.) Chances are that
+you won't ever need this encoding technique which was invented to overcome
+the shortcomings of old-fashioned transmission mediums that do not support
+other than simple ASCII data. The essential recipe is simple: Take three 
+bytes, or 24 bits. Split them into 4 six-packs, adding a space (0x20) to 
+each. Repeat until all of the data is blended. Fold groups of 4 bytes into 
+lines no longer than 60 and garnish them in front with the original byte count 
+(incremented by 0x20) and a C<"\n"> at the end. - The C<pack> chef will
+prepare this for you, a la minute, when you select pack code C<u> on the menu:
+
+   my $uubuf = pack( 'u', $bindat );
+
+A repeat count after C<u> sets the number of bytes to put into an
+uuencoded line, which is the maximum of 45 by default, but could be
+set to some (smaller) integer multiple of three. C<unpack> simply ignores
+the repeat count.
+
+
+=head2 Doing Sums
+
+An even stranger template code is C<%>E<lt>I<number>E<gt>. First, because 
+it's used as a prefix to some other template code. Second, because it
+cannot be used in C<pack> at all, and third, in C<unpack>, doesn't return the
+data as defined by the template code it precedes. Instead it'll give you an
+integer of I<number> bits that is computed from the data value by 
+doing sums. For numeric unpack codes, no big feat is achieved:
+
+    my $buf = pack( 'iii', 100, 20, 3 );
+    print unpack( '%32i3', $buf ), "\n";  # prints 123
+
+For string values, C<%> returns the sum of the byte values saving
+you the trouble of a sum loop with C<substr> and C<ord>:
+
+    print unpack( '%32A*', "\x01\x10" ), "\n";  # prints 17
+
+Although the C<%> code is documented as returning a "checksum":
+don't put your trust in such values! Even when applied to a small number
+of bytes, they won't guarantee a noticeable Hamming distance.
+
+In connection with C<b> or C<B>, C<%> simply adds bits, and this can be put
+to good use to count set bits efficiently:
+
+    my $bitcount = unpack( '%32b*', $mask );
+
+And an even parity bit can be determined like this:
+
+    my $evenparity = unpack( '%1b*', $mask );
+
+
+=head2  Unicode
+
+Unicode is a character set that can represent most characters in most of
+the world's languages, providing room for over one million different
+characters. Unicode 3.1 specifies 94,140 characters: The Basic Latin
+characters are assigned to the numbers 0 - 127. The Latin-1 Supplement with
+characters that are used in several European languages is in the next
+range, up to 255. After some more Latin extensions we find the character
+sets from languages using non-roman alphabets, interspersed with a
+variety of symbol sets such as currency symbols, Zapf Dingbats or Braille.
+(You might want to visit L<www.unicode.org> for a look at some of
+them - my personal favourites are Telugu and Kannada.)
+
+The Unicode character sets associates characters with integers. Encoding
+these numbers in an equal number of bytes would more than double the
+requirements for storing texts written in latin alphabets.
+The UTF-8 encoding avoids this by storing the most common (from a western
+point of view) characters in a single byte while encoding the rarer
+ones in three or more bytes.
+
+So what has this got to do with C<pack>? Well, if you want to convert
+between a Unicode number and its UTF-8 representation you can do so by
+using template code C<U>. As an example, let's produce the UTF-8
+representation of the Euro currency symbol (code number 0x20AC):
+
+   $UTF8{Euro} = pack( 'U', 0x20AC );
+
+Inspecting C<$UTF8{Euro}> shows that it contains 3 bytes: "\xe2\x82\xac". The
+round trip can be completed with C<unpack>:
+
+   $Unicode{Euro} = unpack( 'U', $UTF8{Euro} );
+
+Usually you'll want to pack or unpack UTF-8 strings:
+
+   # pack and unpack the Hebrew alphabet
+   my $alefbet = pack( 'U*', 0x05d0..0x05ea );
+   my @hebrew = unpack( 'U*', $utf );
+
+
+
+=head1 Lengths and Widths
+
+=head2 String Lengths
+
+In the previous section we've seen a network message that was constructed
+by prefixing the binary message length to the actual message. You'll find
+that packing a length followed by so many bytes of data is a 
+frequently used recipe since appending a null byte won't work
+if a null byte may be part of the data. - Here is an example where both
+techniques are used: after two null terminated strings with source and
+destination address, a Short Message (to a mobile phone) is sent after
+a length byte:
+
+   my $msg = pack( 'Z*Z*CA*', $src, $dst, length( $sm ), $sm );
+
+Unpacking this message can be done with the same template:
+
+   ( $src, $dst, $len, $sm ) = unpack( 'Z*Z*CA*', $msg );
+
+There's a subtle trap lurking in the offings: Adding another field after
+the Short Message (in variable C<$sm>) is all right when packing, but this
+cannot be unpacked naively:
+
+   # pack a message
+   my $msg = pack( 'Z*Z*CA*C', $src, $dst, length( $sm ), $sm, $prio );
+   
+   # unpack fails - $prio remains undefined!
+   ( $src, $dst, $len, $sm, $prio ) = unpack( 'Z*Z*CA*C', $msg );
+
+The pack code C<A*> gobbles up all remaining bytes, and C<$prio> remains
+undefined! Before we let disappointment dampen the morale: Perl's got
+the trump card to make this trick too, just a little further up the sleeve.
+Watch this:
+
+   # pack a message: ASCIIZ, ASCIIZ, length/string, byte
+   my $msg = pack( 'Z* Z* C/A* C', $src, $dst, $sm, $prio );
+
+   # unpack
+   ( $src, $dst, $sm, $prio ) = unpack( 'Z* Z* C/A* C', $msg );
+
+Combining two pack codes with a slash (C</>) associates them with a single
+value from the argument list. In C<pack>, the length of the argument is
+taken and packed according to the first code while the argument itself
+is added after being converted with the template code after the slash.
+This saves us the trouble of inserting the C<length> call, but it is 
+in C<unpack> where we really score: The value of the length byte marks the
+end of the string to be taken from the buffer. Since this combination
+doesn't make sense execpt when the second pack code isn't C<a*>, C<A*>
+or C<Z*>, Perl won't let you.
+
+The pack code preceding C</> may be anything that's fit to represent a
+number: All the numeric binary pack codes, and even text codes such as
+C<A4> or C<Z*>:
+
+   # pack/unpack a string preceded by its length in ASCII
+   my $buf = pack( 'A4/A*', "Humpty-Dumpty" );
+   # unpack $buf: '13  Humpty-Dumpty'
+   my $txt = unpack( 'A4/A*', $buf );
+
+
+=head2 Dynamic Templates
+
+So far, we've seen literals used as templates. If the list of pack
+items doesn't have fixed length, an expression constructing the
+template has to be used. Here's an example:
+To store named string values in a way that can be conveniently parsed
+by a C program, we create a sequence of names and null terminated ASCII
+strings, with C<=> between the name and the value, followed by an
+additional delimiting null byte. Here's how:
+
+   my $env = pack( 'A*A*Z*' x keys( %Env ) . 'C',
+                   map{ ( $_, '=', $Env{$_} ) } keys( %Env ), 0 );
+
+For the reverse operation, we'll have to determine the number of items
+in the buffer before we can let C<unpack> rip it apart:
+
+   my $n = ( $env =~ s/\0/\0/g - 1 );
+   my %env = map { split( '=', $_ ) } unpack( 'Z*' x $n, $env );
+
+The substitution counts the null bytes. The C<unpack> call returns a
+list of name-value pairs each of which is taken apart in the C<map>
+block.
+
+
+=head2 Another Portable Binary Encoding
+
+The pack code C<w> has been added to support a portable binary data
+encoding scheme that goes way beyond simple integers. (Details can
+be found at L<Casbah.org>, the Scarab project.)  A BER (Binary Encoded
+Representation) compressed unsigned integer stores base 128
+digits, most significant digit first, with as few digits as possible.
+Bit eight (the high bit) is set on each byte except the last. There
+is no size limit to BER encoding, but Perl won't go to extremes.
+
+   my $berbuf = pack( 'w*', 1, 128, 128+1, 128*128+127 );
+
+A hex dump of C<$berbuf>, with spaces inserted at the right places,
+shows 01 8100 8101 81807F. Since the last byte is always less than
+128, C<unpack> knows where to stop.
+
+
+=head1 Packing and Unpacking C Structures
+
+In previous sections we have seen how to pack numbers and character
+strings. If it were not for a couple of snags we could conclude this
+section right away with the terse remark that C structures don't
+contain anything else, and therefore you already know all there is to it.
+Sorry, no: read on, please.
+
+=head2 The Alignment Pit
+
+In the consideration of speed against memory requirements the balance
+has been tilted in favor of faster execution. This has influenced the
+way C compilers allocate memory for structures: On architectures
+where a 16-bit or 32-bit operand can be moved faster between places in
+memory, or to or from a CPU register, if it is aligned at an even or 
+multiple-of-four or even at a multiple-of eight address, a C compiler
+will give you this speed benefit by stuffing extra bytes into structures.
+If you don't cross the C shoreline this is not likely to cause you any
+grief (although you should care when you design large data structures).
+
+To see how this affects C<pack> and C<unpack>, we'll compare these two
+C structures:
+
+   typedef struct {
+     char     c1;
+     short    s;
+     char     c2;
+     long     l;
+   } gappy_t;
+
+   typedef struct {
+     long     l;
+     short    s;
+     char     c1;
+     char     c2;
+   } dense_t;
+
+Typically, a C compiler allocates 12 bytes to a C<gappy_t> variable, but
+requires only 8 bytes for a C<dense_t>. After investigating this further,
+we can draw memory maps, showing where the extra 4 bytes are hidden:
+
+   0           +4          +8          +12
+   +--+--+--+--+--+--+--+--+--+--+--+--+
+   |c1|xx|  s  |c2|xx|xx|xx|     l     |    xx = fill byte
+   +--+--+--+--+--+--+--+--+--+--+--+--+
+   gappy_t
+
+   0           +4          +8
+   +--+--+--+--+--+--+--+--+
+   |     l     |  h  |c1|c2|
+   +--+--+--+--+--+--+--+--+
+   dense_t
+
+And that's where the first quirk strikes: C<pack> and C<unpack>
+templates have to be stuffed with C<x> codes to get those extra fill bytes.
+
+The natural question: "Why can't Perl compensate for the gaps?" warrants
+an answer. One good reason is that C compilers might provide (non-ANSI)
+extensions permitting all sorts of fancy control over the way structures
+are aligned, even at the level of an individual structure field. And, if
+this were not enough, there is an insidious thing called C<union> where
+the amount of fill bytes cannot be derived from the alignment of the next
+item alone.
+
+OK, so let's bite the bullet. Here's one way to get the alignment right
+by inserting template codes C<x>, which don't take a corresponding item 
+from the list:
+
+  my $gappy = pack( 'cxs cxxx l!', $c1, $s, $c2, $l );
+
+Note the C<!> after C<l>: We want to make sure that we pack a long
+integer as it is compiled by our C compiler.
+
+Counting bytes and watching alignments in lengthy structures is bound to 
+be a drag. Isn't there a way we can create the template with a simple
+program? Here's a C program that does the trick:
+
+   #include <stdio.h>
+   #include <stddef.h>
+
+   typedef struct {
+     char     fc1;
+     short    fs;
+     char     fc2;
+     long     fl;
+   } gappy_t;
+
+   #define Pt(struct,field,tchar) \
+     printf( "@%d%s ", offsetof(struct,field), # tchar );
+
+   int main(){
+     Pt( gappy_t, fc1, c  );
+     Pt( gappy_t, fs,  s! );
+     Pt( gappy_t, fc2, c  );
+     Pt( gappy_t, fl,  l! );
+     printf( "\n" );
+   }
+
+The output line can be used as a template in a C<pack> or C<unpack> call:
+
+  my $gappy = pack( '@0c @2s! @4c @8l!', $c1, $s, $c2, $l );
+
+Gee, yet another template code - as if we hadn't plenty. But 
+C<@> saves our day by enabling us to specify the offset from the beginning
+of the pack buffer to the next item: This is just the value
+the C<offsetof> macro (defined in C<E<lt>stddef.hE<gt>>) returns when
+given a C<struct> type and one of its field names ("member-designator" in 
+C standardese).
+
+
+=head2 Alignment, Take 2
+
+I'm afraid that we're not quite through with the alignment catch yet. The
+hydra raises another ugly head when you pack arrays of structures:
+
+   typedef struct {
+     short    count;
+     char     glyph;
+   } cell_t;
+
+   typedef cell_t buffer_t[BUFLEN];
+
+Where's the catch? Padding is neither required before the first field C<count>,
+nor between this and the next field C<glyph>, so why can't we simply pack
+like this:
+
+   # something goes wrong here:
+   pack( 's!a' x @buffer,
+         map{ ( $_->{count}, $_->{glyph} ) } @buffer );
+
+This packs C<3*@buffer> bytes, but it turns out that the size of 
+C<buffer_t> is four times C<BUFLEN>! The moral of the story is that
+the required alignment of a structure or array is propagated to the
+next higher level where we have to consider padding I<at the end>
+of each component as well. Thus the correct template is:
+
+   pack( 's!ax' x @buffer,
+         map{ ( $_->{count}, $_->{glyph} ) } @buffer );
+
+
+
+=head2 Pointers for How to Use Them
+
+The title of this section indicates the second problem you may run into
+sooner or later when you pack C structures. If the function you intend
+to call expects a, say, C<void *> value, you I<cannot> simply take
+a reference to a Perl variable. (Although that value certainly is a
+memory address, it's not the address where the variable's contents are
+stored.)
+
+Template code C<P> promises to pack a "pointer to a fixed length string".
+Isn't this what we want? Let's try:
+
+    # allocate some storage and pack a pointer to it
+    my $memory = "\x00" x $size;
+    my $memptr = pack( 'P', $memory );
+
+But wait: doesn't C<pack> just return a sequence of bytes? How can we pass this
+string of bytes to some C code expecting a pointer which is, after all,
+nothing but a number? The answer is simple: We have to obtain the numeric
+address from the bytes returned by C<pack>.
+
+    my $ptr = unpack( 'L!', $memptr );
+
+Obviously this assumes that it is possible to typecast a pointer
+to an unsigned long and vice versa, which frequently works but should not
+be taken as a universal law. - Now that we have this pointer the next question
+is: How can we put it to good use? We need a call to some C function
+where a pointer is expected. The read(2) system call comes to mind:
+
+    ssize_t read(int fd, void *buf, size_t count);
+
+After reading L<perlfunc> explaining how to use C<syscall> we can write
+this Perl function copying a file to standard output:
+
+    require 'syscall.ph';
+    sub cat($){
+        my $path = shift();
+        my $size = -s $path;
+        my $memory = "\x00" x $size;  # allocate some memory
+        my $ptr = unpack( 'L', pack( 'P', $memory ) );
+        open( F, $path ) || die( "$path: cannot open ($!)\n" );
+        my $fd = fileno(F);
+        my $res = syscall( &SYS_read, fileno(F), $ptr, $size );
+        print $memory;
+        close( F );
+    }
+
+This is neither a specimen of simplicity nor a paragon of portability but
+it illustrates the point: We are able to sneak behind the scenes and
+access Perl's otherwise well-guarded memory! (Important note: Perl's
+C<syscall> does I<not> require you to construct pointers in this roundabout
+way. You simply pass a string variable, and Perl forwards the address.) 
+
+How does C<unpack> with C<P> work? Imagine some pointer in the buffer
+about to be unpacked: If it isn't the null pointer (which will smartly
+produce the C<undef> value) we have a start address - but then what?
+Perl has no way of knowing how long this "fixed length string" is, so
+it's up to you to specify the actual size as an explicit length after C<P>.
+
+   my $mem = "abcdefghijklmn";
+   print unpack( 'P5', pack( 'P', $mem ) ); # prints "abcde"
+
+As a consequence, C<pack> ignores any number or C<*> after C<P>.
+
+
+Now that we have seen C<P> at work, we might as well give C<p> a whirl.
+Why do we need a second template code for packing pointers at all? The 
+answer lies behind the simple fact that an C<unpack> with C<p> promises
+a null-terminated string starting at the address taken from the buffer,
+and that implies a length for the data item to be returned:
+
+   my $buf = pack( 'p', "abc\x00efhijklmn" );
+   print unpack( 'p', $buf );    # prints "abc"
+
+
+
+Albeit this is apt to be confusing: As a consequence of the length being
+implied by the string's length, a number after pack code C<p> is a repeat
+count, not a length as after C<P>. 
+
+
+Using C<pack(..., $x)> with C<P> or C<p> to get the address where C<$x> is
+actually stored must be used with circumspection. Perl's internal machinery
+considers the relation between a variable and that address as its very own 
+private matter and doesn't really care that we have obtained a copy. Therefore:
+
+=over 4
+
+=item * 
+
+Do not use C<pack> with C<p> or C<P> to obtain the address of variable
+that's bound to go out of scope (and thereby freeing its memory) before you
+are done with using the memory at that address.
+
+=item * 
+
+Be very careful with Perl operations that change the value of the
+variable. Appending something to the variable, for instance, might require
+reallocation of its storage, leaving you with a pointer into no-man's land.
+
+=item * 
+
+Don't think that you can get the address of a Perl variable
+when it is stored as an integer or double number! C<pack('P', $x)> will
+force the variable's internal representation to string, just as if you
+had written something like C<$x .= ''>.
+
+=back
+
+It's safe, however, to P- or p-pack a string literal, because Perl simply
+allocates an anonymous variable.
+
+
+
+=head1 Pack Recipes
+
+Here are a collection of (possibly) useful canned recipes for C<pack>
+and C<unpack>:
+
+    # Convert IP address for socket functions
+    pack( "C4", split /\./, "123.4.5.6" ); 
+
+    # Count the bits in a chunk of memory (e.g. a select vector)
+    unpack( '%32b*', $mask );
+
+    # Determine the endianness of your system
+    $is_little_endian = unpack( 'c', pack( 's', 1 ) );
+    $is_big_endian = unpack( 'xc', pack( 's', 1 ) );
+
+    # Determine the number of bits in a native integer
+    $bits = unpack( '%32I!', ~0 );
+
+    # Prepare argument for the nanosleep system call
+    my $timespec = pack( 'L!L!', $secs, $nanosecs );
+
+    # A simple memory dump
+    my $i;
+    map { ++$i % 16 ? "$_ " : "$_\n" }
+    unpack( 'H2' x length( $mem ), $mem );
+
+
+=head1 Authors
+
+Simon Cozens and Wolfgang Laun.
+
author	Jarkko Hietaniemi <jhi@iki.fi>	2001-11-28 14:00:12 +0000
committer	Jarkko Hietaniemi <jhi@iki.fi>	2001-11-28 14:00:12 +0000
commit	34babc168865edbd6963528152c3c3b68680b55f (patch)
tree	7c115a5843147fddd36cf90e7d95fdaf58f00f70 /pod/perlpacktut.pod
parent	d9ae6319b04989bd3d538f5691a909e211d38fbb (diff)
download	perl-34babc168865edbd6963528152c3c3b68680b55f.tar.gz