summaryrefslogtreecommitdiff
path: root/regexp.h
Commit message (Collapse)AuthorAgeFilesLines
* avoid double-freeing regex code blocksDavid Mitchell2017-02-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RT #130650 heap-use-after-free in S_free_codeblocks When compiling qr/(?{...})/, a reg_code_blocks structure is allocated and various SVs are attached to it. Initially this is set to be freed via a destructor on the savestack, in case of early dying. Later the structure is attached to the compiling regex, and a boolean flag in the structure, 'attached', is set to true to show that the destructor no longer needs to free the struct. However, it is possible to get three orders of destruction: 1) allocate, push destructor, die early 2) allocate, push destructor, attach to regex, die 2) allocate, push destructor, attach to regex, succeed In 2, the regex is freed (via the savestack) before the destructor is called. In 3, the destructor is called, then later the regex is freed. It turns out perl can't currently handle case 2: qr'(?{})\6' Fix this by turning the 'attached' boolean field into an integer refcount, then keep a count of whether the struct is referenced from the savestack and/or the regex. Since it normally has a value of 1 or 2, it's similar to a boolean flag, but crucially it no longer just indicates that the regex has a pointer to it ('attached'), but that at least one of the savestack and regex have a pointer to it. So order of freeing no longer matters. I also updated S_free_codeblocks() so that it nulls out SV pointers in the reg_code_blocks struct before freeing them. This is is generally good practice to avoid double frees, although is probably not needed at the moment.
* Use cBOOL() instead of ? TRUE : FALSEDagfinn Ilmari Mannsåker2017-01-251-1/+1
| | | | Except under cpan/ and dist/
* better handle freeing of code blocks in /(?{...})/David Mitchell2017-01-241-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [perl #129140] attempting double-free Thus fixes some leaks and double frees in regexes which contain code blocks. During compilation, an array of struct reg_code_block's is malloced. Initially this is just attached to the RExC_state_t struct local var in Perl_re_op_compile(). Later it may be attached to a pattern. The difficulty is ensuring that the array is free()d (and the ref counts contained within decremented) should compilation croak early, while avoiding double frees once the array has been attached to a regex. The current mechanism of making the array the PVX of an SV is a bit flaky, as the array can be realloced(), and code can be re-entered when utf8 is detected mid-compilation. This commit changes the array into separately malloced head and body. The body contains the actual array, and can be realloced. The head contains a pointer to the array, plus size and an 'attached' boolean. This indicates whether the struct has been attached to a regex, and is effectively a 1-bit ref count. Whenever a head is allocated, SAVEDESTRUCTOR_X() is used to call S_free_codeblocks() to free the head and body on scope exit. This function skips the freeing if 'attached' is true, and this flag is set only at the point where the head gets attached to the regex. In one way this complicates the code, since the num_code_blocks field is now not always available (it's only there is a head has been allocated), but mainly its simplifies, since all the book-keeping is now done in the two new static functions S_alloc_code_blocks() and S_free_codeblocks()
* Add /xx regex pattern modifierKarl Williamson2017-01-131-3/+11
| | | | | This was first proposed in the thread starting at http://www.nntp.perl.org/group/perl.perl5.porters/2014/09/msg219394.html
* make OP_SPLIT a PMOP, and eliminate OP_PUSHREDavid Mitchell2016-10-041-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most ops that execute a regex, such as match and subst, are of type PMOP. A PMOP allows the actual regex to be attached directly to that op, due to its extra fields. OP_SPLIT is different; it is just a plain LISTOP, but it always has an OP_PUSHRE as its first child, which *is* a PMOP and which has the regex attached. At runtime, pp_pushre()'s only job is to push itself (i.e. the current PL_op) onto the stack. Later pp_split() pops this to get access to the regex it wants to execute. This is a bit unpleasant, because we're pushing an OP* onto the stack, which is supposed to be an array of SV*'s. As a bit of a hack, on DEBUGGING builds we push a PVLV with the PL_op address embedded instead, but this still isn't very satisfactory. Now that regexes are first-class SVs, we could push a REGEXP onto the stack rather than PL_op. However, there is an optimisation of @array = split which eliminates the assign and embeds the array's GV/padix directly in the PUSHRE op. So split still needs access to that op. But the pushre op will always be splitop->op_first anyway, so one possibility is to just skip executing the pushre altogether, and make pp_split just directly access op_first instead to get the regex and @array info. But if we're doing that, then why not just go the full hog and make OP_SPLIT into a PMOP, and eliminate the OP_PUSHRE op entirely: with the data that was spread across the two ops now combined into just the one split op. That is exactly what this commit does. For a simple compile-time pattern like split(/foo/, $s, 1), the optree looks like: before: <@> split[t2] lK </> pushre(/"foo"/) s/RTIME <0> padsv[$s:1,2] s <$> const(IV 1) s after: </> split(/"foo"/)[t2] lK/RTIME <0> padsv[$s:1,2] s <$> const[IV 1] s while for a run-time expression like split(/$pat/, $s, 1), before: <@> split[t3] lK </> pushre() sK/RTIME <|> regcomp(other->8) sK <0> padsv[$pat:2,3] s <0> padsv[$s:1,3] s <$> const(IV 1)s after: </> split()[t3] lK/RTIME <|> regcomp(other->8) sK <0> padsv[$pat:2,3] s <0> padsv[$s:1,3] s <$> const[IV 1] s This makes the code faster and simpler. At the same time, two new private flags have been added for OP_SPLIT - OPpSPLIT_ASSIGN and OPpSPLIT_LEX - which make it explicit that the assign op has been optimised away, and if so, whether the array is lexical. Also, deparsing of split has been improved, to the extent that perl TEST -deparse op/split.t now passes. Also, a couple of panic messages in pp_split() have been replaced with asserts().
* Make deprecated qr//xx fatalKarl Williamson2016-05-091-7/+0
| | | | This has been deprecated since v5.22
* fix "bad match" issue reported in perl #127705Yves Orton2016-03-151-12/+1
| | | | | | | | | | | | In 24be310237a0f8f19cfdb71de1b068b4ce9572a0 I reworked how we stored the close_paren info in the regexp match state structure. Unfortunately I missed a subtle aspect of the logic which meant that in certain cases we were relying on close_paren being true to avoid comparing it against a false ARG value for things like CURLYX, which meant that sometimes we would exit an stack frame prematurely. This patch fixes that logic and makes it more clear (via macros) what is going on.
* add consistency with other union membersYves Orton2016-03-151-1/+1
| | | | | | | | In most cases the curlyx member is the first thing after the yes state member, but eval was reversed. While debugging perl #127705 I switched them to see what would happen, which changed the bug, and ultimately revealed the cause of the problem. So I am going to leave them in the "consistent" order.
* fix Perl #126182, out of memory due to infinite pattern recursionYves Orton2016-03-061-1/+8
| | | | | | | | | | | | | | | | | | | The way we tracked if pattern recursion was infinite did not work properly. A pattern like "a"=~/(.(?2))((?<=(?=(?1)).))/ would loop forever, slowly eat up all available ram as it added pattern recursion stack frames. This patch changes the rules for recursion so that recursively entering a given pattern "subroutine" twice from the same position fails the match. This means that where previously we might have seen fatal exception we will now simply fail. This means that "aaabbb"=~/a(?R)?b/ succeeds with $& equal to "aaabbb".
* Unify GOSTART and GOSUBYves Orton2016-03-061-4/+10
| | | | | | | | | | | | | | | | GOSTART is a special case of GOSUB, we can remove a lot of offset twiddling, and other special casing by unifying them, at pretty much no cost. GOSUB has 2 arguments, ARG() and ARG2L(), which are interpreted as a U32 and an I32 respectively. ARG() holds the "parno" we will recurse into. ARG2L() holds a signed offset to the relevant start node for the recursion. Prior to this patch the argument to GOSUB would always be >=, and unlike other parts of our logic we would not use 0 to represent "start/end" of pattern, as GOSTART would be used for "recurse to beginning of pattern", after this patch we use 0 to represent "start/end", and a lot of complexity "goes away" along with GOSTART regops.
* first step cleaning up regexp recursion "return" logicYves Orton2016-03-061-1/+8
| | | | | | | | | | | | | | | | | | | | | | When we do a GOSUB/GOSTART we need to know where the recursion "returns" to the previous position in the pattern. Currently this is done by comparing cur_eval->u.eval.close_paren to the the argument of CLOSE parens, and within END blocks (for GOSTART). However, there is a problem. The state machinery for GOSUB/GOSTART is shared with EVAL ( implementing /(??{ ... })/ ), which also sets cur_eval->u.eval.close_paren to 0. This for instance breaks /(?(R)YES|NO)/ by making it return true within a /(??{ ... })/ when arguably it shouldn't. It also is confusing. So we change the meaning of close_paren so that 0 means "from EVAL", and that otherwise the value is "+1", so 1 means GOSTART (trigger for END), and 2 means "CLOSE1", etc. In order to make this transparent this patch adds the EVAL_CLOSE_PAREN_IS( cur_eval, EXPR ) macro which does the right thing: ( cur_eval && cur_eval->u.eval.close_paren && ( ( cur_eval->u.eval.close_paren - 1 ) == EXPR ) )
* Various pods: Add C<> around many typed-as-is thingsKarl Williamson2015-09-031-1/+1
| | | | Removes 'the' in front of parameter names in some instances.
* remove redundant PERL_EXPORT_C and PERL_XS_EXPORT_C macrosDaniel Dragan2015-06-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These 2 macros were created for the Symbian port in commit "Symbian port of Perl" to replace a direct "extern" token. I guess the author was unaware of PERL_CALLCONV. PERL_CALLCONV is the official macro to use. PERL_XS_EXPORT_C and PERL_EXPORT_C have no usage on cpan grep except for modules with direct copies of core headers. A defect of using PERL_EXPORT_C and PERL_XS_EXPORT_C instead of PERL_CALLCONV is that win32/win32.h has no knowledge of the 2 macros and doesn't set them, and os/os2ish.h doesn't either. On Win32, since the unix defaults are used instead of Win32 specific "__declspec(dllimport)" token, XS modules use indirect function stubs in each XS module placed by the CC to call into perl5**.dll instead of directly calls the core C functions. I observed this in in XS-Typemap's DLL. To simplify the API, and to decrease the amount of macros needing to implemented to support each platform, just remove the 2 macros. Since perl.h's fallback defaults for PERL_CALLCONV are very late in perl.h, they need to be moved up before function declarations start in perlio.h (perlio.h is included from iperlsys.h). win32iop.h contains the "PerlIO" and SV" tokens, so perlio.h must be included before win32iop.h is. Including perlio.h so early in win32.h, causes PERL_CALLCONV not be defined since Win32 platform uses the fallback in perl.h, since win32.h doesn't always define PERL_CALLCONV and sometimes relies on the fallback. Since win32iop.h contains alot of declarations, it belongs with other declarations such as those in proto.h so move it from win32.h to perl.h. the "free" token in struct regexp_engine conflicts with win32iop's "#define free win32_free" so rename that member.
* Replace common Emacs file-local variables with dir-localsDagfinn Ilmari Mannsåker2015-03-221-6/+0
| | | | | | | | | | | | | | | | An empty cpan/.dir-locals.el stops Emacs using the core defaults for code imported from CPAN. Committer's work: To keep t/porting/cmp_version.t and t/porting/utils.t happy, $VERSION needed to be incremented in many files, including throughout dist/PathTools. perldelta entry for module updates. Add two Emacs control files to MANIFEST; re-sort MANIFEST. For: RT #124119.
* Reserve a bit for 'the re strict subpragma.Karl Williamson2015-01-131-1/+1
| | | | This is another step in the process
* Support for nocapture regexp flag /nMatthew Horsfall (alh)2014-12-281-5/+8
| | | | | | | | | | | | | | | | | | | | This flag will prevent () from capturing and filling in $1, $2, etc... Named captures will still work though, and if used will cause $1, $2, etc... to be filled in *only* within named groups. The motivation behind this is to allow the common construct of: /(?:b|c)a(?:t|n)/ To be rewritten more cleanly as: /(b|c)a(t|n)/n When you want grouping but no memory penalty on captures. You can also use ?n inside of a () directly to avoid capturing, and ?-n inside of a () to negate its effects if you want to capture.
* Create bit for /n.Karl Williamson2014-12-281-1/+1
|
* [perl #122911] regexp.h: Rmv VOL from op_comp sigFather Chrysostomos2014-10-061-1/+1
| | | | It is no longer needed as of 1067df30ae9.
* Suppress some Solaris warningsKarl Williamson2014-09-291-15/+16
| | | | | | | | We get an integer overflow message when we left shift a 1 into the highest bit of a word. This changes the 1's into 1U's to indicate unsigned. This is done for all the flag bits in the affected word, as they could get reorderd by someone in the future, unintentionally reintroducing this problem again.
* Deprecate multiple "x" in "/xx"Karl Williamson2014-09-291-5/+12
| | | | | | | | | | It is planned for a future Perl release to have /xx mean something different from just /x. To prepare for this, this commit raises a deprecation warning if someone currently has this usage. A grep of CPAN did not turn up any instances of this, but this is to be safe anyway. The added code is more general than actually needed, in case we want to do this for another flag.
* Make space for /xx flagKarl Williamson2014-09-291-2/+2
| | | | | | This doesn't actually use the flag yet. We no longer have to make version-dependent changes to ext/Devel-Peek/t/Peek.t, (it being in /ext) so this doesn't
* regexp.h: Comment shared-pool free bits schemeKarl Williamson2014-09-291-3/+39
|
* regexp.h: Make tentative division of free-bit spaceKarl Williamson2014-09-291-20/+18
| | | | | | This sets a #define to point in the middle of the free-space, so that bits at either end can be added without having to adjust many other defines.
* regexp.h: Define flag bit directly, not indirectlyKarl Williamson2014-09-291-8/+5
| | | | | | | This #defined a symbol then did a compile time check that it was the same as another symbol. This commit simply defines it as the other symbol directly, and moves it to above the other definitions, which it no longer is part of. This prepares for the next commit.
* regexp.h Remove unused bit placeholdersKarl Williamson2014-09-291-6/+1
| | | | | | We do not need a placeholder for unused flag bits. And removing them makes the generated regnodes.h more accurate as to what bits are available.
* regexp.h: Move regex flag bit positions.Karl Williamson2014-09-291-5/+6
| | | | | | | | | | | | | | This moves three bits to create a block of unused bits at the beginning. The first bit had to be moved to make space for other uses that are coming in future commits. This breaks binary compatibility, so might as well move the other two bits so that all the unused bits are consolidated at the beginning. This pool of unused bits is the boundary between the bits that are common to op.h and regexp.h (and in op_reg_common.h) and those that are separate. It's best to have all the unused bits there, so when we need to use one, it can be taken from either side, as needed, without us being trapped into having an available bit, but of the wrong kind.
* Some low-hanging -Wunreachable-code fruits.Jarkko Hietaniemi2014-06-151-31/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | - after return/croak/die/exit, return/break are pointless (break is not a terminator/separator, it's a goto) - after goto, another goto (!) is pointless - in some cases (usually function ends) introduce explicit NOT_REACHED to make the noreturn nature clearer (do not do this everywhere, though, since that would mean adding NOT_REACHED after every croak) - for the added NOT_REACHED also add /* NOTREACHED */ since NOT_REACHED is for gcc (and VC), while the comment is for linters - declaring variables in switch blocks is just too fragile: it kind of works for narrowing the scope (which is nice), but breaks the moment there are initializations for the variables (the initializations will be skipped since the flow will bypass the start of the block); in some easy cases simply hoist the declarations out of the block and move them earlier Note 1: Since after this patch the core is not yet -Wunreachable-code clean, not enabling that via cflags.SH, one needs to -Accflags=... it. Note 2: At least with the older gcc 4.4.7 there are far too many "unreachable code" warnings, which seem to go away with gcc 4.8, maybe better flow control analysis. Therefore, the warning should eventually be enabled only for modernish gccs (what about clang and Intel cc?)
* Revert "Some low-hanging -Wunreachable-code fruits."Jarkko Hietaniemi2014-06-131-2/+2
| | | | | | | This reverts commit 8c2b19724d117cecfa186d044abdbf766372c679. I don't understand - smoke-me came back happy with three separate reports... oh well, some other time.
* Some low-hanging -Wunreachable-code fruits.Jarkko Hietaniemi2014-06-131-2/+2
| | | | | | | | | | | | | | | | | | - after croak/die/exit (or return), break (or return!) are pointless (break is not a terminator/separator, it's a promise of a jump) - after goto, another goto (!) is pointless - in some cases (usually function ends) introduce explicit NOT_REACHED to make the noreturn nature clearer (do not do this everywhere, though, since that would mean adding NOT_REACHED after every croak) - for the added NOT_REACHED also add /* NOTREACHED */ since NOT_REACHED is for gcc (and VC), while the comment is for linters - declaring variables in switch blocks is just too fragile: it kind of works for narrowing the scope (which is nice), but breaks the moment there are initializations for the variables (they will be skipped!); in some easy cases simply hoist the declarations out of the block and move them earlier There are still a few places left.
* Undo 63b558ddd980cd36bcbd8a7465a3412e886ba75e.Jarkko Hietaniemi2014-05-291-1/+1
| | | | (For some odd reason assert() cannot be found and Jenkins becomes apoplectic.)
* Use NOT_REACHED for the impossible case.Jarkko Hietaniemi2014-05-291-1/+1
| | | | | | | | The default case really is impossible because all the valid enums values are already covered in the switch. The NOT_REACHED; is for the compiler (from perl.h), the /* NOTREACHED */ is for static analyzers.
* [perl #121854] use re 'taint' regressionDavid Mitchell2014-05-131-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit v5.19.8-533-g63baef5 changed the handling of locale-dependent regexes so that the pattern was considered tainted at compile-time, rather than determining it each time at run-time whenever it executed a locale-dependent node. Unfortunately due to the conflating of two flags, RXf_TAINTED and RXf_TAINTED_SEEN, it had the side effect of permanently marking a pattern as tainted once it had had a single tainted result. E.g. use re qw(taint); use Scalar::Util qw(tainted); for ($^X, "abc") { /(.*)/ or die; print "not " unless tainted("$1"); print "tainted\n"; }; which from 5.19.9 onwards output: tainted tainted but with this commit (and with 5.19.8 and earlier), it now outputs: tainted not tainted The RXf_TAINTED flag indicates that the pattern itself is tainted, e.g. $r = qr/$tainted_value/ while the RXf_TAINTED_SEEN flag means that the results of the last match are tainted, e.g. use re 'tainted'; $tainted =~ /(.*)/; # $1 is tainted Pre 63baef5, the code used to look like: at run-time: turn off RXf_TAINTED_SEEN; while (nodes to execute) { switch(node) { case BOUNDL: /* and other locale-specific ops */ turn on RXf_TAINTED_SEEN; ...; } } if (tainted || RXf_TAINTED) turn on RXf_TAINTED_SEEN; 63baef5 changed it to: at compile-time: if (pattern has locale ops) turn on RXf_TAINTED_SEEN; at run-time: while (nodes to execute) { ... } if (tainted || RXf_TAINTED) turn on RXf_TAINTED_SEEN; This commit changes it to: at compile-time; if (pattern has locale ops) turn on RXf_TAINTED; at run-time: turn off RXf_TAINTED_SEEN; while (nodes to execute) { ... } if (tainted || RXf_TAINTED) turn on RXf_TAINTED_SEEN;
* regex substrs: record index of check substrDavid Mitchell2014-02-071-0/+1
| | | | | | | | | | | | | | Currently prog->substrs->data[] is a 3 element array of structures. Elements 0 and 1 record the longest anchored and floating substrings, while element 2 ('check'), is a copy of the longest of 0 and 1. Record in a new field, prog->substrs->check_ix, the index of which element was copied. (Eventually I intend to remove the copy altogether.) Also for the anchored substr, set max_offset equal to min offset. Previously it was left as zero and ignored, although if copied to check, the check copy of max *was* set equal to min. Having this always set will allow us to make the code simpler.
* regexp.h: document the fields of reg_substr_datumDavid Mitchell2014-02-071-3/+3
| | | | | In particular, specify that the various offset fields are char rather than byte counts.
* Avoid compiler warnings by consistently using #ifdef instead of plain #ifBrian Fraser2014-02-051-1/+1
|
* Add RXf_UNBOUNDED_QUANTIFIER and regexp->maxlenYves Orton2014-02-031-1/+2
| | | | | | | | | The flag tells us that a pattern may match an infinitely long string. The new member in the regexp struct tells us how long the string might be. With these two items we can implement regexp based $/
* Move the RXf_ANCH flags to intflags as PREGf_ANCH_xxx and add ↵Yves Orton2014-01-311-8/+5
| | | | | | | | | | RXf_IS_ANCHORED as a replacement The only requirement outside of the regex engine is to identify that there is an anchor involved at all. So we move the 4 anchor flags to intflags and replace it with a single aggregate flag RXf_IS_ANCHORED in extflags. This frees up another 3 bits in extflags.
* rename RXf_UNUSED flags to match their BASE_SHIFT offsetYves Orton2014-01-311-4/+4
| | | | So they stay stable as I move other flags from extflags to intflags
* move RXf_GPOS_SEEN and RXf_GPOS_FLOAT to intflagsYves Orton2014-01-311-5/+4
| | | | | | | | This required removing the RXf_GPOS_CHECK mask as it uses one flag that will stay in extflags for now (RXf_ANCH_GPOS), and one flag that moves to intflags (RXf_GPOS_SEEN). This mask is strange however, as you cant have RXf_ANCH_GPOS without having RXf_GPOS_SEEN so I dont know why we test both. Further investigation required.
* Rename RXf_CANY_SEEN to PREGf_CANY_SEEN and move from extflags to intflagsYves Orton2014-01-311-2/+2
|
* move RXf_NOSCAN from extflags to intflags as PREGf_NOSCANYves Orton2014-01-311-1/+1
| | | | | Includes some improvements to how we dump regexps so that when a regexp is for the standard perl engine we also show the intflags for the engine
* perlapi: Consistent spaces after dotsFather Chrysostomos2013-12-291-1/+1
| | | | plus some typo fixes. I probably changed some things in perlintern, too.
* Use SSize_t/STRLEN in more places in regexp codeFather Chrysostomos2013-08-251-10/+10
| | | | | | | | | | | | | | | | | | | As part of getting the regexp engine to handle long strings, this com- mit changes any variables, parameters and struct members that hold lengths of the string being matched against (or parts thereof) to use SSize_t or STRLEN instead of [IU]32. To avoid having to change any logic, I kept the signedness the same. I did not change anything that affects the length of the regular expression itself, so regexps are still practically limited to I32_MAX. Changing that would involve changing the size of regnodes, which would be a lot more involved. These changes should fix bugs, but are very hard to test. In most cases, I don’t know the regexp engine well enough to come up with test cases that test the paths in question with long strings. In other cases I don’t have a box with enough memory to test the fix.
* Stop substr re optimisation from rejecting long strsFather Chrysostomos2013-08-251-2/+2
| | | | | | | | | | | | | | Using I32 for the fields that record information about the location of a fixed string that must be found for a regular expression to match can result in match failures, because I32 is not large enough to store offsets >= 2**31. SSize_t is appropriate, since it is 64 bits on 64-bit platforms and 32 bits on 32-bit platforms. This commit changes enough instances of I32 to SSize_t to get the added test passing and suppress compiler warnings. A later commit will change many more.
* Make $' work past the 2**31 thresholdFather Chrysostomos2013-08-251-1/+1
|
* [perl #116907] Allow //g matching past 2**31 thresholdFather Chrysostomos2013-08-251-3/+4
| | | | | | | | | Change the internal fields for storing positions so that //g in scalar context can move past the 2**31 character threshold. Before this com- mit, the numbers would wrap, resulting in assertion failures. The changes in this commit are only enough to get the added test pass- ing. Stay tuned for more.
* Stop pos() from being confused by changing utf8nessFather Chrysostomos2013-08-251-0/+1
| | | | | | | | | | | | | | | | | | | | | | | The value of pos() is stored as a byte offset. If it is stored on a tied variable or a reference (or glob), then the stringification could change, resulting in pos() now pointing to a different character off- set or pointing to the middle of a character: $ ./perl -Ilib -le '$x = bless [], chr 256; pos $x=1; bless $x, a; print pos $x' 2 $ ./perl -Ilib -le '$x = bless [], chr 256; pos $x=1; bless $x, "\x{1000}"; print pos $x' Malformed UTF-8 character (unexpected end of string) in match position at -e line 1. 0 So pos() should be stored as a character offset. The regular expression engine expects byte offsets always, so allow it to store bytes when possible (a pure non-magical string) but use char- acters otherwise. This does result in more complexity than I should like, but the alter- native (always storing a character offset) would slow down regular expressions, which is a big no-no.
* improve regexec_flags() API documentationDavid Mitchell2013-08-131-11/+16
| | | | | | In the API, rename the 'screamer' arg to be 'sv' instead; update the description of the functions args; improve the documentation of the REXEC_* flags for the 'flags' arg.
* s/.(?=.\G)/X/g: refuse to go backwardsDavid Mitchell2013-07-281-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | On something like: $_ = "123456789"; pos = 6; s/.(?=.\G)/X/g; each iteration could in theory start with pos one character to the left of the previous position, and with the substitution replacing bits that it has already replaced. Since that way madness lies, ban any attempt by s/// to substitute to the left of a previous position. To implement this, add a new flag to regexec(), REXEC_FAIL_ON_UNDERFLOW. This tells regexec() to return failure even if the match itself succeeded, but where the start of $& is before the passed stringarg point. This change caused one existing test to fail (which was added about a year ago): $_="abcdef"; s/bc|(.)\G(.)/$1 ? "[$1-$2]" : "XX"/ge; print; # used to print "aXX[c-d][d-e][e-f]"; now prints "aXXdef" I think that that test relies on ambiguous behaviour, and that my change makes things saner. Note that s/// with \G is generally very under-tested.
* regexec: handle \G ourself, rather than in callersDavid Mitchell2013-07-281-0/+3
| | | | | | | | | | | | | | | | | Normally a /g match starts its processing at the previous pos() (or at char 0 if pos is not set); however in the case of something like /abc\G/ we actually need to start 3 characters before pos. This has been handled by the *callers* of regexec() subtracting prog->gofs from the stringarg arg before calling it, or by setting stringarg to strbeg for floating, such as /\w+\G/. This is clearly wrong: the callers of regexec() shouldn't need to worry about the details of getting \G right: move this code into regexec() itself. (Note that although this commit passes all tests, it quite possibly isn't logically correct. It will get fixed up further during the next few commits)