Don't copy all of the match string buffer

When a pattern matches, and that pattern contains captures (or $`, $&, $' or /p are present), a copy is made of the whole original string, so that $1 et al continue to hold the correct value even if the original string is subsequently modified. This can have severe performance penalties; for example, this code causes a 1Mb buffer to be allocated, copied and freed a million times: $&; $x = 'x' x 1_000_000; 1 while $x =~ /(.)/g; This commit changes this so that, where possible, only the needed substring of the original string is copied: in the above case, only a 1-byte buffer is copied each time. Also, it now reuses or reallocs the buffer, rather than freeing and mallocing each time. Now that PL_sawampersand is a 3-bit flag indicating separately whether $`, $& and $' have been seen, they each contribute only their own individual penalty; which ones have been seen will limit the extent to which we can avoid copying the whole buffer. Note that the above code *without* the $& is not currently slow, but only because the copying is artificially disabled to avoid the performance hit. The next but one commit will remove that hack, meaning that it will still be fast, but will now be correct in the presence of a modified original string. We achieve this by by adding suboffset and subcoffset fields to the existing subbeg and sublen fields of a regex, to indicate how many bytes and characters have been skipped from the logical start of the string till the physical start of the buffer. To avoid copying stuff at the end, we just reduce sublen. For example, in this: "abcdefgh" =~ /(c)d/ subbeg points to a malloced buffer containing "c\0"; sublen == 1, and suboffset == 2 (as does subcoffset). while if $& has been seen, subbeg points to a malloced buffer containing "cd\0"; sublen == 2, and suboffset == 2. If in addition $' has been seen, then subbeg points to a malloced buffer containing "cdefgh\0"; sublen == 6, and suboffset == 2. The regex engine won't do this by default; there are two new flag bits, REXEC_COPY_SKIP_PRE and REXEC_COPY_SKIP_POST, which in conjunction with REXEC_COPY_STR, request that the engine skip the start or end of the buffer (it will still copy in the presence of the relevant $`, $&, $', /p). Only pp_match has been enhanced to use these extra flags; substitution can't easily benefit, since the usual action of s///g is to copy the whole string first time round, then perform subsequent matching iterations against the copy, without further copying. So you still need to copy most of the buffer.
author: David Mitchell <davem@iabyn.com> 2012-07-26 16:04:09 +0100
committer: David Mitchell <davem@iabyn.com> 2012-09-08 15:42:06 +0100
commit: 6502e08109cd003b2cdf39bc94ef35e52203240b (patch)
tree: ae4071332e6a7fd61354d33941476643066d5f56 /pod
parent: 2c7b5d7698f52b86acffe19a7ec15e85c99337fe (diff)
download: perl-6502e08109cd003b2cdf39bc94ef35e52203240b.tar.gz
1 files changed, 19 insertions, 3 deletions
diff --git a/pod/perlreapi.pod b/pod/perlreapi.pod
index ec072189fd..1ccc6d869f 100644
--- a/pod/perlreapi.pod
+++ b/pod/perlreapi.pod
@@ -555,6 +555,8 @@ values.
         char *subbeg;  /* saved or original string so \digit works forever. */
         SV_SAVED_COPY  /* If non-NULL, SV which is COW from original */
         I32 sublen;    /* Length of string pointed by subbeg */
+	I32 suboffset;	/* byte offset of subbeg from logical start of str */
+	I32 subcoffset;	/* suboffset equiv, but in chars (for @-/@+) */
 
         /* Information about the match that isn't often used */
         I32 prelen;           /* length of precomp */
@@ -695,9 +697,23 @@ occur at a floating offset from the start of the pattern. Used to do
 Fast-Boyer-Moore searches on the string to find out if its worth using
 the regex engine at all, and if so where in the string to search.
 
-=head2 C<subbeg> C<sublen> C<saved_copy>
-
-Used during execution phase for managing search and replace patterns.
+=head2 C<subbeg> C<sublen> C<saved_copy> C<suboffset> C<subcoffset>
+
+Used during the execution phase for managing search and replace patterns,
+and for providing the text for C<$&>, C<$1> etc. C<subbeg> points to a
+buffer (either the original string, or a copy in the case of
+C<RX_MATCH_COPIED(rx)>), and C<sublen> is the length of the buffer.  The
+C<RX_OFFS> start and end indices index into this buffer.
+
+In the presence of the C<REXEC_COPY_STR> flag, but with the addition of
+the C<REXEC_COPY_SKIP_PRE> or C<REXEC_COPY_SKIP_POST> flags, an engine
+can choose not to copy the full buffer (although it must still do so in
+the presence of C<RXf_PMf_KEEPCOPY> or the relevant bits being set in
+C<PL_sawampersand>). In this case, it may set C<suboffset> to indicate the
+number of bytes from the logical start of the buffer to the physical start
+(i.e. C<subbeg>). It should also set C<subcoffset>, the number of
+characters in the offset. The latter is needed to support C<@-> and C<@+>
+which work in characters, not bytes.
 
 =head2 C<wrapped> C<wraplen>
author	David Mitchell <davem@iabyn.com>	2012-07-26 16:04:09 +0100
committer	David Mitchell <davem@iabyn.com>	2012-09-08 15:42:06 +0100
commit	6502e08109cd003b2cdf39bc94ef35e52203240b (patch)
tree	ae4071332e6a7fd61354d33941476643066d5f56 /pod
parent	2c7b5d7698f52b86acffe19a7ec15e85c99337fe (diff)
download	perl-6502e08109cd003b2cdf39bc94ef35e52203240b.tar.gz