summaryrefslogtreecommitdiff
path: root/pod/perldsc.pod
diff options
context:
space:
mode:
Diffstat (limited to 'pod/perldsc.pod')
-rw-r--r--pod/perldsc.pod348
1 files changed, 348 insertions, 0 deletions
diff --git a/pod/perldsc.pod b/pod/perldsc.pod
new file mode 100644
index 0000000000..1d51af8ab3
--- /dev/null
+++ b/pod/perldsc.pod
@@ -0,0 +1,348 @@
+=head1 TITLE
+
+perldsc - Manipulating Complex Data Structures in Perl
+
+=head1 INTRODUCTION
+
+The single feature most sorely lacking in the Perl programming language
+prior to its 5.0 release was complex data structures. Even without direct
+language support, some valiant programmers did manage to emulate them, but
+it was hard work and not for the faint of heart. You could occasionally
+get away with the C<$m{$LoL,$b}> notation borrowed from I<awk> in which the
+keys are actually more like a single concatenated string C<"$LoL$b">, but
+traversal and sorting were difficult. More desperate programmers even
+hacked Perl's internal symbol table directly, a strategy that proved hard
+to develop and maintain--to put it mildly.
+
+The 5.0 release of Perl let us have complex data structures. You
+may now write something like this and all of a sudden, you'd have a array
+with three dimensions!
+
+ for $x (1 .. 10) {
+ for $y (1 .. 10) {
+ for $z (1 .. 10) {
+ $LoL[$x][$y][$z] =
+ $x ** $y + $z;
+ }
+ }
+ }
+
+Alas, however simple this may appear, underneath it's a much more
+elaborate construct than meets the eye!
+
+How do you print it out? Why can't you just say C<print @LoL>? How do
+you sort it? How can you pass it to a function or get one of these back
+from a function? Is is an object? Can you save it to disk to read
+back later? How do you access whole rows or columns of that matrix? Do
+all the values have to be numeric?
+
+As you see, it's quite easy to become confused. While some small portion
+of the blame for this can be attributed to the reference-based
+implementation, it's really more due to a lack of existing documentation with
+examples designed for the beginner.
+
+This document is meant to be a detailed but understandable treatment of
+the many different sorts of data structures you might want to develop. It should
+also serve as a cookbook of examples. That way, when you need to create one of these
+complex data structures, you can just pinch, pilfer, or purloin
+a drop-in example from here.
+
+Let's look at each of these possible constructs in detail. There are separate
+documents on each of the following:
+
+=over 5
+
+=item * arrays of arrays
+
+=item * hashes of arrays
+
+=item * arrays of hashes
+
+=item * hashes of hashes
+
+=item * more elaborate constructs
+
+=item * recursive and self-referential data structures
+
+=item * objects
+
+=back
+
+But for now, let's look at some of the general issues common to all
+of these types of data structures.
+
+=head1 REFERENCES
+
+The most important thing to understand about all data structures in Perl
+-- including multidimensional arrays--is that even though they might
+appear otherwise, Perl C<@ARRAY>s and C<%HASH>es are all internally
+one-dimensional. They can only hold scalar values (meaning a string,
+number, or a reference). They cannot directly contain other arrays or
+hashes, but instead contain I<references> to other arrays or hashes.
+
+You can't use a reference to a array or hash in quite the same way that
+you would a real array or hash. For C or C++ programmers unused to distinguishing
+between arrays and pointers to the same, this can be confusing. If so,
+just think of it as the difference between a structure and a pointer to a
+structure.
+
+You can (and should) read more about references in the perlref(1) man
+page. Briefly, references are rather like pointers that know what they
+point to. (Objects are also a kind of reference, but we won't be needing
+them right away--if ever.) That means that when you have something that
+looks to you like an access to two-or-more-dimensional array and/or hash,
+that what's really going on is that in all these cases, the base type is
+merely a one-dimensional entity that contains references to the next
+level. It's just that you can I<use> it as though it were a
+two-dimensional one. This is actually the way almost all C
+multidimensional arrays work as well.
+
+ $list[7][12] # array of arrays
+ $list[7]{string} # array of hashes
+ $hash{string}[7] # hash of arrays
+ $hash{string}{'another string'} # hash of hashes
+
+Now, because the top level only contains references, if you try to print
+out your array in with a simple print() function, you'll get something
+that doesn't look very nice, like this:
+
+ @LoL = ( [2, 3], [4, 5, 7], [0] );
+ print $LoL[1][2];
+ 7
+ print @LoL;
+ ARRAY(0x83c38)ARRAY(0x8b194)ARRAY(0x8b1d0)
+
+
+That's because Perl doesn't (ever) implicitly dereference your variables.
+If you want to get at the thing a reference is referring to, then you have
+to do this yourself using either prefix typing indicators, like
+C<${$blah}>, C<@{$blah}>, C<@{$blah[$i]}>, or else postfix pointer arrows,
+like C<$a-E<gt>[3]>, C<$h-E<gt>{fred}>, or even C<$ob-E<gt>method()-E<gt>[3]>.
+
+=head1 COMMON MISTAKES
+
+The two most common mistakes made in constructing something like
+an array of arrays is either accidentally counting the number of
+elements or else taking a reference to the same memory location
+repeatedly. Here's the case where you just get the count instead
+of a nested array:
+
+ for $i (1..10) {
+ @list = somefunc($i);
+ $LoL[$i] = @list; # WRONG!
+ }
+
+That's just the simple case of assigning a list to a scalar and getting
+its element count. If that's what you really and truly want, then you
+might do well to consider being a tad more explicit about it, like this:
+
+ for $i (1..10) {
+ @list = somefunc($i);
+ $counts[$i] = scalar @list;
+ }
+
+Here's the case of taking a reference to the same memory location
+again and again:
+
+ for $i (1..10) {
+ @list = somefunc($i);
+ $LoL[$i] = \@list; # WRONG!
+ }
+
+So, just what's the big problem with that? It looks right, doesn't it?
+After all, I just told you that you need an array of references, so by
+golly, you've made me one!
+
+Unfortunately, while this is true, it's still broken. All the references
+in @LoL refer to the I<very same place>, and they will therefore all hold
+whatever was last in @list! It's similar to the problem demonstrated in
+the following C program:
+
+ #include <pwd.h>
+ main() {
+ struct passwd *getpwnam(), *rp, *dp;
+ rp = getpwnam("root");
+ dp = getpwnam("daemon");
+
+ printf("daemon name is %s\nroot name is %s\n",
+ dp->pw_name, rp->pw_name);
+ }
+
+Which will print
+
+ daemon name is daemon
+ root name is daemon
+
+The problem is that both C<rp> and C<dp> are pointers to the same location
+in memory! In C, you'd have to remember to malloc() yourself some new
+memory. In Perl, you'll want to use the array constructor C<[]> or the
+hash constructor C<{}> instead. Here's the right way to do the preceding
+broken code fragments
+
+ for $i (1..10) {
+ @list = somefunc($i);
+ $LoL[$i] = [ @list ];
+ }
+
+The square brackets make a reference to a new array with a I<copy>
+of what's in @list at the time of the assignment. This is what
+you want.
+
+Note that this will produce something similar, but it's
+much harder to read:
+
+ for $i (1..10) {
+ @list = 0 .. $i;
+ @{$LoL[$i]} = @list;
+ }
+
+Is it the same? Well, maybe so--and maybe not. The subtle difference
+is that when you assign something in square brackets, you know for sure
+it's always a brand new reference with a new I<copy> of the data.
+Something else could be going on in this new case with the C<@{$LoL[$i]}}>
+dereference on the left-hand-side of the assignment. It all depends on
+whether C<$LoL[$i]> had been undefined to start with, or whether it
+already contained a reference. If you had already populated @LoL with
+references, as in
+
+ $LoL[3] = \@another_list;
+
+Then the assignment with the indirection on the left-hand-side would
+use the existing reference that was already there:
+
+ @{$LoL[3]} = @list;
+
+Of course, this I<would> have the "interesting" effect of clobbering
+@another_list. (Have you ever noticed how when a programmer says
+something is "interesting", that rather than meaning "intriguing",
+they're disturbingly more apt to mean that it's "annoying",
+"difficult", or both? :-)
+
+So just remember to always use the array or hash constructors with C<[]>
+or C<{}>, and you'll be fine, although it's not always optimally
+efficient.
+
+Surprisingly, the following dangerous-looking construct will
+actually work out fine:
+
+ for $i (1..10) {
+ my @list = somefunc($i);
+ $LoL[$i] = \@list;
+ }
+
+That's because my() is more of a run-time statement than it is a
+compile-time declaration I<per se>. This means that the my() variable is
+remade afresh each time through the loop. So even though it I<looks> as
+though you stored the same variable reference each time, you actually did
+not! This is a subtle distinction that can produce more efficient code at
+the risk of misleading all but the most experienced of programmers. So I
+usually advise against teaching it to beginners. In fact, except for
+passing arguments to functions, I seldom like to see the gimme-a-reference
+operator (backslash) used much at all in code. Instead, I advise
+beginners that they (and most of the rest of us) should try to use the
+much more easily understood constructors C<[]> and C<{}> instead of
+relying upon lexical (or dynamic) scoping and hidden reference-counting to
+do the right thing behind the scenes.
+
+In summary:
+
+ $LoL[$i] = [ @list ]; # usually best
+ $LoL[$i] = \@list; # perilous; just how my() was that list?
+ @{ $LoL[$i] } = @list; # way too tricky for most programmers
+
+
+=head1 CAVEAT ON PRECEDENCE
+
+Speaking of things like C<@{$LoL[$i]}>, the following are actually the
+same thing:
+
+ $listref->[2][2] # clear
+ $$listref[2][2] # confusing
+
+That's because Perl's precedence rules on its five prefix dereferencers
+(which look like someone swearing: C<$ @ * % &>) make them bind more
+tightly than the postfix subscripting brackets or braces! This will no
+doubt come as a great shock to the C or C++ programmer, who is quite
+accustomed to using C<*a[i]> to mean what's pointed to by the I<i'th>
+element of C<a>. That is, they first take the subscript, and only then
+dereference the thing at that subscript. That's fine in C, but this isn't C.
+
+The seemingly equivalent construct in Perl, C<$$listref[$i]> first does
+the deref of C<$listref>, making it take $listref as a reference to an
+array, and then dereference that, and finally tell you the I<i'th> value
+of the array pointed to by $LoL. If you wanted the C notion, you'd have to
+write C<${$LoL[$i]}> to force the C<$LoL[$i]> to get evaluated first
+before the leading C<$> dereferencer.
+
+=head1 WHY YOU SHOULD ALWAYS C<use strict>
+
+If this is starting to sound scarier than it's worth, relax. Perl has
+some features to help you avoid its most common pitfalls. The best
+way to avoid getting confused is to start every program like this:
+
+ #!/usr/bin/perl -w
+ use strict;
+
+This way, you'll be forced to declare all your variables with my() and
+also disallow accidental "symbolic dereferencing". Therefore if you'd done
+this:
+
+ my $listref = [
+ [ "fred", "barney", "pebbles", "bambam", "dino", ],
+ [ "homer", "bart", "marge", "maggie", ],
+ [ "george", "jane", "alroy", "judy", ],
+ ];
+
+ print $listref[2][2];
+
+The compiler would immediately flag that as an error I<at compile time>,
+because you were accidentally accessing C<@listref>, an undeclared
+variable, and it would thereby remind you to instead write:
+
+ print $listref->[2][2]
+
+=head1 DEBUGGING
+
+The standard Perl debugger in 5.001 doesn't do a very nice job of
+printing out complex data structures. However, the perl5db that
+Ilya Zakharevich E<lt>F<ilya@math.ohio-state.edu>E<gt>
+wrote, which is accessible at
+
+ ftp://ftp.perl.com/pub/perl/ext/perl5db-kit-0.9.tar.gz
+
+has several new features, including command line editing as well
+as the C<x> command to dump out complex data structures. For example,
+given the assignment to $LoL above, here's the debugger output:
+
+ DB<1> X $LoL
+ $LoL = ARRAY(0x13b5a0)
+ 0 ARRAY(0x1f0a24)
+ 0 'fred'
+ 1 'barney'
+ 2 'pebbles'
+ 3 'bambam'
+ 4 'dino'
+ 1 ARRAY(0x13b558)
+ 0 'homer'
+ 1 'bart'
+ 2 'marge'
+ 3 'maggie'
+ 2 ARRAY(0x13b540)
+ 0 'george'
+ 1 'jane'
+ 2 'alroy'
+ 3 'judy'
+
+There's also a lower-case B<x> command which is nearly the same.
+
+=head1 SEE ALSO
+
+perlref(1), perldata(1)
+
+=head1 AUTHOR
+
+Tom Christiansen E<lt>F<tchrist@perl.com>E<gt>
+
+Last update:
+Sat Oct 7 22:41:09 MDT 1995
+