diff options
Diffstat (limited to 'pod/perldsc.pod')
-rw-r--r-- | pod/perldsc.pod | 348 |
1 files changed, 348 insertions, 0 deletions
diff --git a/pod/perldsc.pod b/pod/perldsc.pod new file mode 100644 index 0000000000..1d51af8ab3 --- /dev/null +++ b/pod/perldsc.pod @@ -0,0 +1,348 @@ +=head1 TITLE + +perldsc - Manipulating Complex Data Structures in Perl + +=head1 INTRODUCTION + +The single feature most sorely lacking in the Perl programming language +prior to its 5.0 release was complex data structures. Even without direct +language support, some valiant programmers did manage to emulate them, but +it was hard work and not for the faint of heart. You could occasionally +get away with the C<$m{$LoL,$b}> notation borrowed from I<awk> in which the +keys are actually more like a single concatenated string C<"$LoL$b">, but +traversal and sorting were difficult. More desperate programmers even +hacked Perl's internal symbol table directly, a strategy that proved hard +to develop and maintain--to put it mildly. + +The 5.0 release of Perl let us have complex data structures. You +may now write something like this and all of a sudden, you'd have a array +with three dimensions! + + for $x (1 .. 10) { + for $y (1 .. 10) { + for $z (1 .. 10) { + $LoL[$x][$y][$z] = + $x ** $y + $z; + } + } + } + +Alas, however simple this may appear, underneath it's a much more +elaborate construct than meets the eye! + +How do you print it out? Why can't you just say C<print @LoL>? How do +you sort it? How can you pass it to a function or get one of these back +from a function? Is is an object? Can you save it to disk to read +back later? How do you access whole rows or columns of that matrix? Do +all the values have to be numeric? + +As you see, it's quite easy to become confused. While some small portion +of the blame for this can be attributed to the reference-based +implementation, it's really more due to a lack of existing documentation with +examples designed for the beginner. + +This document is meant to be a detailed but understandable treatment of +the many different sorts of data structures you might want to develop. It should +also serve as a cookbook of examples. That way, when you need to create one of these +complex data structures, you can just pinch, pilfer, or purloin +a drop-in example from here. + +Let's look at each of these possible constructs in detail. There are separate +documents on each of the following: + +=over 5 + +=item * arrays of arrays + +=item * hashes of arrays + +=item * arrays of hashes + +=item * hashes of hashes + +=item * more elaborate constructs + +=item * recursive and self-referential data structures + +=item * objects + +=back + +But for now, let's look at some of the general issues common to all +of these types of data structures. + +=head1 REFERENCES + +The most important thing to understand about all data structures in Perl +-- including multidimensional arrays--is that even though they might +appear otherwise, Perl C<@ARRAY>s and C<%HASH>es are all internally +one-dimensional. They can only hold scalar values (meaning a string, +number, or a reference). They cannot directly contain other arrays or +hashes, but instead contain I<references> to other arrays or hashes. + +You can't use a reference to a array or hash in quite the same way that +you would a real array or hash. For C or C++ programmers unused to distinguishing +between arrays and pointers to the same, this can be confusing. If so, +just think of it as the difference between a structure and a pointer to a +structure. + +You can (and should) read more about references in the perlref(1) man +page. Briefly, references are rather like pointers that know what they +point to. (Objects are also a kind of reference, but we won't be needing +them right away--if ever.) That means that when you have something that +looks to you like an access to two-or-more-dimensional array and/or hash, +that what's really going on is that in all these cases, the base type is +merely a one-dimensional entity that contains references to the next +level. It's just that you can I<use> it as though it were a +two-dimensional one. This is actually the way almost all C +multidimensional arrays work as well. + + $list[7][12] # array of arrays + $list[7]{string} # array of hashes + $hash{string}[7] # hash of arrays + $hash{string}{'another string'} # hash of hashes + +Now, because the top level only contains references, if you try to print +out your array in with a simple print() function, you'll get something +that doesn't look very nice, like this: + + @LoL = ( [2, 3], [4, 5, 7], [0] ); + print $LoL[1][2]; + 7 + print @LoL; + ARRAY(0x83c38)ARRAY(0x8b194)ARRAY(0x8b1d0) + + +That's because Perl doesn't (ever) implicitly dereference your variables. +If you want to get at the thing a reference is referring to, then you have +to do this yourself using either prefix typing indicators, like +C<${$blah}>, C<@{$blah}>, C<@{$blah[$i]}>, or else postfix pointer arrows, +like C<$a-E<gt>[3]>, C<$h-E<gt>{fred}>, or even C<$ob-E<gt>method()-E<gt>[3]>. + +=head1 COMMON MISTAKES + +The two most common mistakes made in constructing something like +an array of arrays is either accidentally counting the number of +elements or else taking a reference to the same memory location +repeatedly. Here's the case where you just get the count instead +of a nested array: + + for $i (1..10) { + @list = somefunc($i); + $LoL[$i] = @list; # WRONG! + } + +That's just the simple case of assigning a list to a scalar and getting +its element count. If that's what you really and truly want, then you +might do well to consider being a tad more explicit about it, like this: + + for $i (1..10) { + @list = somefunc($i); + $counts[$i] = scalar @list; + } + +Here's the case of taking a reference to the same memory location +again and again: + + for $i (1..10) { + @list = somefunc($i); + $LoL[$i] = \@list; # WRONG! + } + +So, just what's the big problem with that? It looks right, doesn't it? +After all, I just told you that you need an array of references, so by +golly, you've made me one! + +Unfortunately, while this is true, it's still broken. All the references +in @LoL refer to the I<very same place>, and they will therefore all hold +whatever was last in @list! It's similar to the problem demonstrated in +the following C program: + + #include <pwd.h> + main() { + struct passwd *getpwnam(), *rp, *dp; + rp = getpwnam("root"); + dp = getpwnam("daemon"); + + printf("daemon name is %s\nroot name is %s\n", + dp->pw_name, rp->pw_name); + } + +Which will print + + daemon name is daemon + root name is daemon + +The problem is that both C<rp> and C<dp> are pointers to the same location +in memory! In C, you'd have to remember to malloc() yourself some new +memory. In Perl, you'll want to use the array constructor C<[]> or the +hash constructor C<{}> instead. Here's the right way to do the preceding +broken code fragments + + for $i (1..10) { + @list = somefunc($i); + $LoL[$i] = [ @list ]; + } + +The square brackets make a reference to a new array with a I<copy> +of what's in @list at the time of the assignment. This is what +you want. + +Note that this will produce something similar, but it's +much harder to read: + + for $i (1..10) { + @list = 0 .. $i; + @{$LoL[$i]} = @list; + } + +Is it the same? Well, maybe so--and maybe not. The subtle difference +is that when you assign something in square brackets, you know for sure +it's always a brand new reference with a new I<copy> of the data. +Something else could be going on in this new case with the C<@{$LoL[$i]}}> +dereference on the left-hand-side of the assignment. It all depends on +whether C<$LoL[$i]> had been undefined to start with, or whether it +already contained a reference. If you had already populated @LoL with +references, as in + + $LoL[3] = \@another_list; + +Then the assignment with the indirection on the left-hand-side would +use the existing reference that was already there: + + @{$LoL[3]} = @list; + +Of course, this I<would> have the "interesting" effect of clobbering +@another_list. (Have you ever noticed how when a programmer says +something is "interesting", that rather than meaning "intriguing", +they're disturbingly more apt to mean that it's "annoying", +"difficult", or both? :-) + +So just remember to always use the array or hash constructors with C<[]> +or C<{}>, and you'll be fine, although it's not always optimally +efficient. + +Surprisingly, the following dangerous-looking construct will +actually work out fine: + + for $i (1..10) { + my @list = somefunc($i); + $LoL[$i] = \@list; + } + +That's because my() is more of a run-time statement than it is a +compile-time declaration I<per se>. This means that the my() variable is +remade afresh each time through the loop. So even though it I<looks> as +though you stored the same variable reference each time, you actually did +not! This is a subtle distinction that can produce more efficient code at +the risk of misleading all but the most experienced of programmers. So I +usually advise against teaching it to beginners. In fact, except for +passing arguments to functions, I seldom like to see the gimme-a-reference +operator (backslash) used much at all in code. Instead, I advise +beginners that they (and most of the rest of us) should try to use the +much more easily understood constructors C<[]> and C<{}> instead of +relying upon lexical (or dynamic) scoping and hidden reference-counting to +do the right thing behind the scenes. + +In summary: + + $LoL[$i] = [ @list ]; # usually best + $LoL[$i] = \@list; # perilous; just how my() was that list? + @{ $LoL[$i] } = @list; # way too tricky for most programmers + + +=head1 CAVEAT ON PRECEDENCE + +Speaking of things like C<@{$LoL[$i]}>, the following are actually the +same thing: + + $listref->[2][2] # clear + $$listref[2][2] # confusing + +That's because Perl's precedence rules on its five prefix dereferencers +(which look like someone swearing: C<$ @ * % &>) make them bind more +tightly than the postfix subscripting brackets or braces! This will no +doubt come as a great shock to the C or C++ programmer, who is quite +accustomed to using C<*a[i]> to mean what's pointed to by the I<i'th> +element of C<a>. That is, they first take the subscript, and only then +dereference the thing at that subscript. That's fine in C, but this isn't C. + +The seemingly equivalent construct in Perl, C<$$listref[$i]> first does +the deref of C<$listref>, making it take $listref as a reference to an +array, and then dereference that, and finally tell you the I<i'th> value +of the array pointed to by $LoL. If you wanted the C notion, you'd have to +write C<${$LoL[$i]}> to force the C<$LoL[$i]> to get evaluated first +before the leading C<$> dereferencer. + +=head1 WHY YOU SHOULD ALWAYS C<use strict> + +If this is starting to sound scarier than it's worth, relax. Perl has +some features to help you avoid its most common pitfalls. The best +way to avoid getting confused is to start every program like this: + + #!/usr/bin/perl -w + use strict; + +This way, you'll be forced to declare all your variables with my() and +also disallow accidental "symbolic dereferencing". Therefore if you'd done +this: + + my $listref = [ + [ "fred", "barney", "pebbles", "bambam", "dino", ], + [ "homer", "bart", "marge", "maggie", ], + [ "george", "jane", "alroy", "judy", ], + ]; + + print $listref[2][2]; + +The compiler would immediately flag that as an error I<at compile time>, +because you were accidentally accessing C<@listref>, an undeclared +variable, and it would thereby remind you to instead write: + + print $listref->[2][2] + +=head1 DEBUGGING + +The standard Perl debugger in 5.001 doesn't do a very nice job of +printing out complex data structures. However, the perl5db that +Ilya Zakharevich E<lt>F<ilya@math.ohio-state.edu>E<gt> +wrote, which is accessible at + + ftp://ftp.perl.com/pub/perl/ext/perl5db-kit-0.9.tar.gz + +has several new features, including command line editing as well +as the C<x> command to dump out complex data structures. For example, +given the assignment to $LoL above, here's the debugger output: + + DB<1> X $LoL + $LoL = ARRAY(0x13b5a0) + 0 ARRAY(0x1f0a24) + 0 'fred' + 1 'barney' + 2 'pebbles' + 3 'bambam' + 4 'dino' + 1 ARRAY(0x13b558) + 0 'homer' + 1 'bart' + 2 'marge' + 3 'maggie' + 2 ARRAY(0x13b540) + 0 'george' + 1 'jane' + 2 'alroy' + 3 'judy' + +There's also a lower-case B<x> command which is nearly the same. + +=head1 SEE ALSO + +perlref(1), perldata(1) + +=head1 AUTHOR + +Tom Christiansen E<lt>F<tchrist@perl.com>E<gt> + +Last update: +Sat Oct 7 22:41:09 MDT 1995 + |