diff options
Diffstat (limited to 'extras/slurp_article.pod')
-rw-r--r-- | extras/slurp_article.pod | 743 |
1 files changed, 743 insertions, 0 deletions
diff --git a/extras/slurp_article.pod b/extras/slurp_article.pod new file mode 100644 index 0000000..8b000f7 --- /dev/null +++ b/extras/slurp_article.pod @@ -0,0 +1,743 @@ +=head1 Perl Slurp Ease + +=head2 Introduction + + +One of the common Perl idioms is processing text files line by line: + + while( <FH> ) { + do something with $_ + } + +This idiom has several variants, but the key point is that it reads in +only one line from the file in each loop iteration. This has several +advantages, including limiting memory use to one line, the ability to +handle any size file (including data piped in via STDIN), and it is +easily taught and understood to Perl newbies. In fact newbies are the +ones who do silly things like this: + + while( <FH> ) { + push @lines, $_ ; + } + + foreach ( @lines ) { + do something with $_ + } + +Line by line processing is fine, but it isn't the only way to deal with +reading files. The other common style is reading the entire file into a +scalar or array, and that is commonly known as slurping. Now, slurping has +somewhat of a poor reputation, and this article is an attempt at +rehabilitating it. Slurping files has advantages and limitations, and is +not something you should just do when line by line processing is fine. +It is best when you need the entire file in memory for processing all at +once. Slurping with in memory processing can be faster and lead to +simpler code than line by line if done properly. + +The biggest issue to watch for with slurping is file size. Slurping very +large files or unknown amounts of data from STDIN can be disastrous to +your memory usage and cause swap disk thrashing. You can slurp STDIN if +you know that you can handle the maximum size input without +detrimentally affecting your memory usage. So I advocate slurping only +disk files and only when you know their size is reasonable and you have +a real reason to process the file as a whole. Note that reasonable size +these days is larger than the bad old days of limited RAM. Slurping in a +megabyte is not an issue on most systems. But most of the +files I tend to slurp in are much smaller than that. Typical files that +work well with slurping are configuration files, (mini-)language scripts, +some data (especially binary) files, and other files of known sizes +which need fast processing. + +Another major win for slurping over line by line is speed. Perl's IO +system (like many others) is slow. Calling C<< <> >> for each line +requires a check for the end of line, checks for EOF, copying a line, +munging the internal handle structure, etc. Plenty of work for each line +read in. Whereas slurping, if done correctly, will usually involve only +one I/O call and no extra data copying. The same is true for writing +files to disk, and we will cover that as well (even though the term +slurping is traditionally a read operation, I use the term ``slurp'' for +the concept of doing I/O with an entire file in one operation). + +Finally, when you have slurped the entire file into memory, you can do +operations on the data that are not possible or easily done with line by +line processing. These include global search/replace (without regard for +newlines), grabbing all matches with one call of C<//g>, complex parsing +(which in many cases must ignore newlines), processing *ML (where line +endings are just white space) and performing complex transformations +such as template expansion. + +=head2 Global Operations + +Here are some simple global operations that can be done quickly and +easily on an entire file that has been slurped in. They could also be +done with line by line processing but that would be slower and require +more code. + +A common problem is reading in a file with key/value pairs. There are +modules which do this but who needs them for simple formats? Just slurp +in the file and do a single parse to grab all the key/value pairs. + + my $text = read_file( $file ) ; + my %config = $text =~ /^(\w+)=(.+)$/mg ; + +That matches a key which starts a line (anywhere inside the string +because of the C</m> modifier), the '=' char and the text to the end of the +line (again, C</m> makes that work). In fact the ending C<$> is not even needed +since C<.> will not normally match a newline. Since the key and value are +grabbed and the C<m//> is in list context with the C</g> modifier, it will +grab all key/value pairs and return them. The C<%config>hash will be +assigned this list and now you have the file fully parsed into a hash. + +Various projects I have worked on needed some simple templating and I +wasn't in the mood to use a full module (please, no flames about your +favorite template module :-). So I rolled my own by slurping in the +template file, setting up a template hash and doing this one line: + + $text =~ s/<%(.+?)%>/$template{$1}/g ; + +That only works if the entire file was slurped in. With a little +extra work it can handle chunks of text to be expanded: + + $text =~ s/<%(\w+)_START%>(.+?)<%\1_END%>/ template($1, $2)/sge ; + +Just supply a C<template> sub to expand the text between the markers and +you have yourself a simple system with minimal code. Note that this will +work and grab over multiple lines due the the C</s> modifier. This is +something that is much trickier with line by line processing. + +Note that this is a very simple templating system, and it can't directly +handle nested tags and other complex features. But even if you use one +of the myriad of template modules on the CPAN, you will gain by having +speedier ways to read and write files. + +Slurping in a file into an array also offers some useful advantages. +One simple example is reading in a flat database where each record has +fields separated by a character such as C<:>: + + my @pw_fields = map [ split /:/ ], read_file( '/etc/passwd' ) ; + +Random access to any line of the slurped file is another advantage. Also +a line index could be built to speed up searching the array of lines. + + +=head2 Traditional Slurping + +Perl has always supported slurping files with minimal code. Slurping of +a file to a list of lines is trivial, just call the C<< <> >> operator +in a list context: + + my @lines = <FH> ; + +and slurping to a scalar isn't much more work. Just set the built in +variable C<$/> (the input record separator to the undefined value and read +in the file with C<< <> >>: + + { + local( $/, *FH ) ; + open( FH, $file ) or die "sudden flaming death\n" + $text = <FH> + } + +Notice the use of C<local()>. It sets C<$/> to C<undef> for you and when +the scope exits it will revert C<$/> back to its previous value (most +likely "\n"). + +Here is a Perl idiom that allows the C<$text> variable to be declared, +and there is no need for a tightly nested block. The C<do> block will +execute C<< <FH> >> in a scalar context and slurp in the file named by +C<$text>: + + local( *FH ) ; + open( FH, $file ) or die "sudden flaming death\n" + my $text = do { local( $/ ) ; <FH> } ; + +Both of those slurps used localized filehandles to be compatible with +5.005. Here they are with 5.6.0 lexical autovivified handles: + + { + local( $/ ) ; + open( my $fh, $file ) or die "sudden flaming death\n" + $text = <$fh> + } + + open( my $fh, $file ) or die "sudden flaming death\n" + my $text = do { local( $/ ) ; <$fh> } ; + +And this is a variant of that idiom that removes the need for the open +call: + + my $text = do { local( @ARGV, $/ ) = $file ; <> } ; + +The filename in C<$file> is assigned to a localized C<@ARGV> and the +null filehandle is used which reads the data from the files in C<@ARGV>. + +Instead of assigning to a scalar, all the above slurps can assign to an +array and it will get the file but split into lines (using C<$/> as the +end of line marker). + +There is one common variant of those slurps which is very slow and not +good code. You see it around, and it is almost always cargo cult code: + + my $text = join( '', <FH> ) ; + +That needlessly splits the input file into lines (C<join> provides a +list context to C<< <FH> >>) and then joins up those lines again. The +original coder of this idiom obviously never read I<perlvar> and learned +how to use C<$/> to allow scalar slurping. + +=head2 Write Slurping + +While reading in entire files at one time is common, writing out entire +files is also done. We call it ``slurping'' when we read in files, but +there is no commonly accepted term for the write operation. I asked some +Perl colleagues and got two interesting nominations. Peter Scott said to +call it ``burping'' (rhymes with ``slurping'' and suggests movement in +the opposite direction). Others suggested ``spewing'' which has a +stronger visual image :-) Tell me your favorite or suggest your own. I +will use both in this section so you can see how they work for you. + +Spewing a file is a much simpler operation than slurping. You don't have +context issues to worry about and there is no efficiency problem with +returning a buffer. Here is a simple burp subroutine: + + sub burp { + my( $file_name ) = shift ; + open( my $fh, ">$file_name" ) || + die "can't create $file_name $!" ; + print $fh @_ ; + } + +Note that it doesn't copy the input text but passes @_ directly to +print. We will look at faster variations of that later on. + +=head2 Slurp on the CPAN + +As you would expect there are modules in the CPAN that will slurp files +for you. The two I found are called Slurp.pm (by Rob Casey - ROBAU on +CPAN) and File::Slurp.pm (by David Muir Sharnoff - MUIR on CPAN). + +Here is the code from Slurp.pm: + + sub slurp { + local( $/, @ARGV ) = ( wantarray ? $/ : undef, @_ ); + return <ARGV>; + } + + sub to_array { + my @array = slurp( @_ ); + return wantarray ? @array : \@array; + } + + sub to_scalar { + my $scalar = slurp( @_ ); + return $scalar; + } + ++The subroutine C<slurp()> uses the magic undefined value of C<$/> and +the magic file +handle C<ARGV> to support slurping into a scalar or +array. It also provides two wrapper subs that allow the caller to +control the context of the slurp. And the C<to_array()> subroutine will +return the list of slurped lines or a anonymous array of them according +to its caller's context by checking C<wantarray>. It has 'slurp' in +C<@EXPORT> and all three subroutines in C<@EXPORT_OK>. + +<Footnote: Slurp.pm is poorly named and it shouldn't be in the top level +namespace.> + +The original File::Slurp.pm has this code: + +sub read_file +{ + my ($file) = @_; + + local($/) = wantarray ? $/ : undef; + local(*F); + my $r; + my (@r); + + open(F, "<$file") || croak "open $file: $!"; + @r = <F>; + close(F) || croak "close $file: $!"; + + return $r[0] unless wantarray; + return @r; +} + +This module provides several subroutines including C<read_file()> (more +on the others later). C<read_file()> behaves simularly to +C<Slurp::slurp()> in that it will slurp a list of lines or a single +scalar depending on the caller's context. It also uses the magic +undefined value of C<$/> for scalar slurping but it uses an explicit +open call rather than using a localized C<@ARGV> and the other module +did. Also it doesn't provide a way to get an anonymous array of the +lines but that can easily be rectified by calling it inside an anonymous +array constuctor C<[]>. + +Both of these modules make it easier for Perl coders to slurp in +files. They both use the magic C<$/> to slurp in scalar mode and the +natural behavior of C<< <> >> in list context to slurp as lines. But +neither is optmized for speed nor can they handle C<binmode()> to +support binary or unicode files. See below for more on slurp features +and speedups. + +=head2 Slurping API Design + +The slurp modules on CPAN are have a very simple API and don't support +C<binmode()>. This section will cover various API design issues such as +efficient return by reference, C<binmode()> and calling variations. + +Let's start with the call variations. Slurped files can be returned in +four formats: as a single scalar, as a reference to a scalar, as a list +of lines or as an anonymous array of lines. But the caller can only +provide two contexts: scalar or list. So we have to either provide an +API with more than one subroutine (as Slurp.pm did) or just provide one +subroutine which only returns a scalar or a list (not an anonymous +array) as File::Slurp does. + +I have used my own C<read_file()> subroutine for years and it has the +same API as File::Slurp: a single subroutine that returns a scalar or a +list of lines depending on context. But I recognize the interest of +those that want an anonymous array for line slurping. For one thing, it +is easier to pass around to other subs and for another, it eliminates +the extra copying of the lines via C<return>. So my module provides only +one slurp subroutine that returns the file data based on context and any +format options passed in. There is no need for a specific +slurp-in-as-a-scalar or list subroutine as the general C<read_file()> +sub will do that by default in the appropriate context. If you want +C<read_file()> to return a scalar reference or anonymous array of lines, +you can request those formats with options. You can even pass in a +reference to a scalar (e.g. a previously allocated buffer) and have that +filled with the slurped data (and that is one of the fastest slurp +modes. see the benchmark section for more on that). If you want to +slurp a scalar into an array, just select the desired array element and +that will provide scalar context to the C<read_file()> subroutine. + +The next area to cover is what to name the slurp sub. I will go with +C<read_file()>. It is descriptive and keeps compatibilty with the +current simple and don't use the 'slurp' nickname (though that nickname +is in the module name). Also I decided to keep the File::Slurp +namespace which was graciously handed over to me by its current owner, +David Muir. + +Another critical area when designing APIs is how to pass in +arguments. The C<read_file()> subroutine takes one required argument +which is the file name. To support C<binmode()> we need another optional +argument. A third optional argument is needed to support returning a +slurped scalar by reference. My first thought was to design the API with +3 positional arguments - file name, buffer reference and binmode. But if +you want to set the binmode and not pass in a buffer reference, you have +to fill the second argument with C<undef> and that is ugly. So I decided +to make the filename argument positional and the other two named. The +subroutine starts off like this: + + sub read_file { + + my( $file_name, %args ) = @_ ; + + my $buf ; + my $buf_ref = $args{'buf'} || \$buf ; + +The other sub (C<read_file_lines()>) will only take an optional binmode +(so you can read files with binary delimiters). It doesn't need a buffer +reference argument since it can return an anonymous array if the called +in a scalar context. So this subroutine could use positional arguments, +but to keep its API similar to the API of C<read_file()>, it will also +use pass by name for the optional arguments. This also means that new +optional arguments can be added later without breaking any legacy +code. A bonus with keeping the API the same for both subs will be seen +how the two subs are optimized to work together. + +Write slurping (or spewing or burping :-)) needs to have its API +designed as well. The biggest issue is not only needing to support +optional arguments but a list of arguments to be written is needed. Perl +6 will be able to handle that with optional named arguments and a final +slurp argument. Since this is Perl 5 we have to do it using some +cleverness. The first argument is the file name and it will be +positional as with the C<read_file> subroutine. But how can we pass in +the optional arguments and also a list of data? The solution lies in the +fact that the data list should never contain a reference. +Burping/spewing works only on plain data. So if the next argument is a +hash reference, we can assume it cointains the optional arguments and +the rest of the arguments is the data list. So the C<write_file()> +subroutine will start off like this: + + sub write_file { + + my $file_name = shift ; + + my $args = ( ref $_[0] eq 'HASH' ) ? shift : {} ; + +Whether or not optional arguments are passed in, we leave the data list +in C<@_> to minimize any more copying. You call C<write_file()> like this: + + write_file( 'foo', { binmode => ':raw' }, @data ) ; + write_file( 'junk', { append => 1 }, @more_junk ) ; + write_file( 'bar', @spew ) ; + +=head2 Fast Slurping + +Somewhere along the line, I learned about a way to slurp files faster +than by setting $/ to undef. The method is very simple, you do a single +read call with the size of the file (which the -s operator provides). +This bypasses the I/O loop inside perl that checks for EOF and does all +sorts of processing. I then decided to experiment and found that +sysread is even faster as you would expect. sysread bypasses all of +Perl's stdio and reads the file from the kernel buffers directly into a +Perl scalar. This is why the slurp code in File::Slurp uses +sysopen/sysread/syswrite. All the rest of the code is just to support +the various options and data passing techniques. + + +=head2 Benchmarks + +Benchmarks can be enlightening, informative, frustrating and +deceiving. It would make no sense to create a new and more complex slurp +module unless it also gained signifigantly in speed. So I created a +benchmark script which compares various slurp methods with differing +file sizes and calling contexts. This script can be run from the main +directory of the tarball like this: + + perl -Ilib extras/slurp_bench.pl + +If you pass in an argument on the command line, it will be passed to +timethese() and it will control the duration. It defaults to -2 which +makes each benchmark run to at least 2 seconds of cpu time. + +The following numbers are from a run I did on my 300Mhz sparc. You will +most likely get much faster counts on your boxes but the relative speeds +shouldn't change by much. If you see major differences on your +benchmarks, please send me the results and your Perl and OS +versions. Also you can play with the benchmark script and add more slurp +variations or data files. + +The rest of this section will be discussing the results of the +benchmarks. You can refer to extras/slurp_bench.pl to see the code for +the individual benchmarks. If the benchmark name starts with cpan_, it +is either from Slurp.pm or File::Slurp.pm. Those starting with new_ are +from the new File::Slurp.pm. Those that start with file_contents_ are +from a client's code base. The rest are variations I created to +highlight certain aspects of the benchmarks. + +The short and long file data is made like this: + + my @lines = ( 'abc' x 30 . "\n") x 100 ; + my $text = join( '', @lines ) ; + + @lines = ( 'abc' x 40 . "\n") x 1000 ; + $text = join( '', @lines ) ; + +So the short file is 9,100 bytes and the long file is 121,000 +bytes. + +=head3 Scalar Slurp of Short File + + file_contents 651/s + file_contents_no_OO 828/s + cpan_read_file 1866/s + cpan_slurp 1934/s + read_file 2079/s + new 2270/s + new_buf_ref 2403/s + new_scalar_ref 2415/s + sysread_file 2572/s + +=head3 Scalar Slurp of Long File + + file_contents_no_OO 82.9/s + file_contents 85.4/s + cpan_read_file 250/s + cpan_slurp 257/s + read_file 323/s + new 468/s + sysread_file 489/s + new_scalar_ref 766/s + new_buf_ref 767/s + +The primary inference you get from looking at the mumbers above is that +when slurping a file into a scalar, the longer the file, the more time +you save by returning the result via a scalar reference. The time for +the extra buffer copy can add up. The new module came out on top overall +except for the very simple sysread_file entry which was added to +highlight the overhead of the more flexible new module which isn't that +much. The file_contents entries are always the worst since they do a +list slurp and then a join, which is a classic newbie and cargo culted +style which is extremely slow. Also the OO code in file_contents slows +it down even more (I added the file_contents_no_OO entry to show this). +The two CPAN modules are decent with small files but they are laggards +compared to the new module when the file gets much larger. + +=head3 List Slurp of Short File + + cpan_read_file 589/s + cpan_slurp_to_array 620/s + read_file 824/s + new_array_ref 824/s + sysread_file 828/s + new 829/s + new_in_anon_array 833/s + cpan_slurp_to_array_ref 836/s + +=head3 List Slurp of Long File + + cpan_read_file 62.4/s + cpan_slurp_to_array 62.7/s + read_file 92.9/s + sysread_file 94.8/s + new_array_ref 95.5/s + new 96.2/s + cpan_slurp_to_array_ref 96.3/s + new_in_anon_array 97.2/s + +This is perhaps the most interesting result of this benchmark. Five +different entries have effectively tied for the lead. The logical +conclusion is that splitting the input into lines is the bounding +operation, no matter how the file gets slurped. This is the only +benchmark where the new module isn't the clear winner (in the long file +entries - it is no worse than a close second in the short file +entries). + + +Note: In the benchmark information for all the spew entries, the extra +number at the end of each line is how many wallclock seconds the whole +entry took. The benchmarks were run for at least 2 CPU seconds per +entry. The unusually large wallclock times will be discussed below. + +=head3 Scalar Spew of Short File + + cpan_write_file 1035/s 38 + print_file 1055/s 41 + syswrite_file 1135/s 44 + new 1519/s 2 + print_join_file 1766/s 2 + new_ref 1900/s 2 + syswrite_file2 2138/s 2 + +=head3 Scalar Spew of Long File + + cpan_write_file 164/s 20 + print_file 211/s 26 + syswrite_file 236/s 25 + print_join_file 277/s 2 + new 295/s 2 + syswrite_file2 428/s 2 + new_ref 608/s 2 + +In the scalar spew entries, the new module API wins when it is passed a +reference to the scalar buffer. The C<syswrite_file2> entry beats it +with the shorter file due to its simpler code. The old CPAN module is +the slowest due to its extra copying of the data and its use of print. + +=head3 List Spew of Short File + + cpan_write_file 794/s 29 + syswrite_file 1000/s 38 + print_file 1013/s 42 + new 1399/s 2 + print_join_file 1557/s 2 + +=head3 List Spew of Long File + + cpan_write_file 112/s 12 + print_file 179/s 21 + syswrite_file 181/s 19 + print_join_file 205/s 2 + new 228/s 2 + +Again, the simple C<print_join_file> entry beats the new module when +spewing a short list of lines to a file. But is loses to the new module +when the file size gets longer. The old CPAN module lags behind the +others since it first makes an extra copy of the lines and then it calls +C<print> on the output list and that is much slower than passing to +C<print> a single scalar generated by join. The C<print_file> entry +shows the advantage of directly printing C<@_> and the +C<print_join_file> adds the join optimization. + +Now about those long wallclock times. If you look carefully at the +benchmark code of all the spew entries, you will find that some always +write to new files and some overwrite existing files. When I asked David +Muir why the old File::Slurp module had an C<overwrite> subroutine, he +answered that by overwriting a file, you always guarantee something +readable is in the file. If you create a new file, there is a moment +when the new file is created but has no data in it. I feel this is not a +good enough answer. Even when overwriting, you can write a shorter file +than the existing file and then you have to truncate the file to the new +size. There is a small race window there where another process can slurp +in the file with the new data followed by leftover junk from the +previous version of the file. This reinforces the point that the only +way to ensure consistant file data is the proper use of file locks. + +But what about those long times? Well it is all about the difference +between creating files and overwriting existing ones. The former have to +allocate new inodes (or the equivilent on other file systems) and the +latter can reuse the exising inode. This mean the overwrite will save on +disk seeks as well as on cpu time. In fact when running this benchmark, +I could hear my disk going crazy allocating inodes during the spew +operations. This speedup in both cpu and wallclock is why the new module +always does overwriting when spewing files. It also does the proper +truncate (and this is checked in the tests by spewing shorter files +after longer ones had previously been written). The C<overwrite> +subroutine is just an typeglob alias to C<write_file> and is there for +backwards compatibilty with the old File::Slurp module. + +=head3 Benchmark Conclusion + +Other than a few cases where a simpler entry beat it out, the new +File::Slurp module is either the speed leader or among the leaders. Its +special APIs for passing buffers by reference prove to be very useful +speedups. Also it uses all the other optimizations including using +C<sysread/syswrite> and joining output lines. I expect many projects +that extensively use slurping will notice the speed improvements, +especially if they rewrite their code to take advantage of the new API +features. Even if they don't touch their code and use the simple API +they will get a significant speedup. + +=head2 Error Handling + +Slurp subroutines are subject to conditions such as not being able to +open the file, or I/O errors. How these errors are handled, and what the +caller will see, are important aspects of the design of an API. The +classic error handling for slurping has been to call C<die()> or even +better, C<croak()>. But sometimes you want the slurp to either +C<warn()>/C<carp()> or allow your code to handle the error. Sure, this +can be done by wrapping the slurp in a C<eval> block to catch a fatal +error, but not everyone wants all that extra code. So I have added +another option to all the subroutines which selects the error +handling. If the 'err_mode' option is 'croak' (which is also the +default), the called subroutine will croak. If set to 'carp' then carp +will be called. Set to any other string (use 'quiet' when you want to +be explicit) and no error handler is called. Then the caller can use the +error status from the call. + +C<write_file()> doesn't use the return value for data so it can return a +false status value in-band to mark an error. C<read_file()> does use its +return value for data, but we can still make it pass back the error +status. A successful read in any scalar mode will return either a +defined data string or a reference to a scalar or array. So a bare +return would work here. But if you slurp in lines by calling it in a +list context, a bare C<return> will return an empty list, which is the +same value it would get from an existing but empty file. So now, +C<read_file()> will do something I normally strongly advocate against, +i.e., returning an explicit C<undef> value. In the scalar context this +still returns a error, and in list context, the returned first value +will be C<undef>, and that is not legal data for the first element. So +the list context also gets a error status it can detect: + + my @lines = read_file( $file_name, err_mode => 'quiet' ) ; + your_handle_error( "$file_name can't be read\n" ) unless + @lines && defined $lines[0] ; + + +=head2 File::FastSlurp + + sub read_file { + + my( $file_name, %args ) = @_ ; + + my $buf ; + my $buf_ref = $args{'buf_ref'} || \$buf ; + + my $mode = O_RDONLY ; + $mode |= O_BINARY if $args{'binmode'} ; + + local( *FH ) ; + sysopen( FH, $file_name, $mode ) or + carp "Can't open $file_name: $!" ; + + my $size_left = -s FH ; + + while( $size_left > 0 ) { + + my $read_cnt = sysread( FH, ${$buf_ref}, + $size_left, length ${$buf_ref} ) ; + + unless( $read_cnt ) { + + carp "read error in file $file_name: $!" ; + last ; + } + + $size_left -= $read_cnt ; + } + + # handle void context (return scalar by buffer reference) + + return unless defined wantarray ; + + # handle list context + + return split m|?<$/|g, ${$buf_ref} if wantarray ; + + # handle scalar context + + return ${$buf_ref} ; + } + + sub write_file { + + my $file_name = shift ; + + my $args = ( ref $_[0] eq 'HASH' ) ? shift : {} ; + my $buf = join '', @_ ; + + + my $mode = O_WRONLY ; + $mode |= O_BINARY if $args->{'binmode'} ; + $mode |= O_APPEND if $args->{'append'} ; + + local( *FH ) ; + sysopen( FH, $file_name, $mode ) or + carp "Can't open $file_name: $!" ; + + my $size_left = length( $buf ) ; + my $offset = 0 ; + + while( $size_left > 0 ) { + + my $write_cnt = syswrite( FH, $buf, + $size_left, $offset ) ; + + unless( $write_cnt ) { + + carp "write error in file $file_name: $!" ; + last ; + } + + $size_left -= $write_cnt ; + $offset += $write_cnt ; + } + + return ; + } + +=head2 Slurping in Perl 6 + +As usual with Perl 6, much of the work in this article will be put to +pasture. Perl 6 will allow you to set a 'slurp' property on file handles +and when you read from such a handle, the file is slurped. List and +scalar context will still be supported so you can slurp into lines or a +<scalar. I would expect that support for slurping in Perl 6 will be +optimized and bypass the stdio subsystem since it can use the slurp +property to trigger a call to special code. Otherwise some enterprising +individual will just create a File::FastSlurp module for Perl 6. The +code in the Perl 5 module could easily be modified to Perl 6 syntax and +semantics. Any volunteers? + +=head2 In Summary + +We have compared classic line by line processing with munging a whole +file in memory. Slurping files can speed up your programs and simplify +your code if done properly. You must still be aware to not slurp +humongous files (logs, DNA sequences, etc.) or STDIN where you don't +know how much data you will read in. But slurping megabyte sized files +is not an major issue on today's systems with the typical amount of RAM +installed. When Perl was first being used in depth (Perl 4), slurping +was limited by the smaller RAM size of 10 years ago. Now, you should be +able to slurp almost any reasonably sized file, whether it contains +configuration, source code, data, etc. + +=head2 Acknowledgements + + + + + |