summaryrefslogtreecommitdiff
path: root/regexec.c
Commit message (Collapse)AuthorAgeFilesLines
* Avoid redundant text -in -Dr outputKarl Williamson2014-08-211-2/+2
| | | | | | | | | | There is an optimization in regcomp.c which saves memory when used, but which caused -Dr output to display the same code points twice. I didn't add a test to ext/re/t/regop.t because this only happens for a Unicode property, under certain circumstances, and that triggers a lot of other regex patterns to be compiled which are subject to change, so the test would frequently have to be updated. This only affects debugging output anyway.
* Move _get_regclass_nonbitmap_data() to regcomp.cKarl Williamson2014-08-211-112/+0
| | | | | | | | | | | | | This function, though mostly used by regexec.c and previously contained therein, relies on the particular data structure in regcomp.c, so it makes sense to move it adjacent to the code that generates the structure. Future commits will add function calls to this function that are (currently) only in regcomp.c, and so will tie it tighter to that file. This just moves the code with surrounding #ifdef, adds a couple of blank lines, removes some trailing white space, and adjusts the comments. No other changes were made.
* regex: Use #define for number of bits in ANYOFKarl Williamson2014-08-211-1/+1
| | | | | | | ANYOF nodes (for bracketed character classes) currently are for code points 0-255. This is the first step in the eventual making that size configurable. This also renames a static function, as the domain may not necessarily be 'latin1'
* Move return false out of switch default.Jarkko Hietaniemi2014-07-291-8/+2
|
* NOTREACHED goes at/in the unreachable, not after it.Jarkko Hietaniemi2014-07-281-47/+90
|
* Negatives as UVs: sign-extension intentional, add cast.Jarkko Hietaniemi2014-07-281-2/+2
|
* wrap op_sibling field access in OP_SIBLING* macrosDavid Mitchell2014-07-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | Remove (almost all) direct access to the op_sibling field of OP structs, and use these three new macros instead: OP_SIBLING(o); OP_HAS_SIBLING(o); OP_SIBLING_set(o, new_value); OP_HAS_SIBLING is intended to be a slightly more efficient version of OP_SIBLING when only boolean context is needed. For now these three macros are just defined in the obvious way: #define OP_SIBLING(o) (0 + (o)->op_sibling) #define OP_HAS_SIBLING(o) (cBOOL((o)->op_sibling)) #define OP_SIBLING_set(o, sib) ((o)->op_sibling = (sib)) but abstracting them out will allow us shortly to make the last pointer in an op_sibling chain point back to the parent rather than being null, with a new flag indicating whether this is the last op. Perl_ck_fun() still has a couple of direct uses of op_sibling, since it takes the field's address, which is not covered by these macros.
* Assert before deref due to possible NULL.Jarkko Hietaniemi2014-06-301-0/+1
| | | | Coverity perl5 CID 68587.
* Guard cur_curlyx at least with an assert.Jarkko Hietaniemi2014-06-271-3/+6
| | | | Keep on keeping Coverity happy (Coverity perl5 CID 45352).
* regexec.c: Move some macro definitions aroundKarl Williamson2014-06-261-38/+38
| | | | This puts them in a more logical order
* PATCH: [perl #122090] Non-word-boundary doesn't match EOSKarl Williamson2014-06-261-2/+12
| | | | | | | | | | | | | | | | | | | The root cause of this is that the loop is a "< strend" instead of a "<=". However, the macro that is called to form the loop is used elsewhere where "<" is appropriate. So I opted to just repeat the small salient portions of the loop right after it so it gets executed one final time in the final position. This code: if ((!prog->minlen && tmp) && (reginfo->intuit || regtry(reginfo, &s))) \ goto got_it; had the effect previously of causing \b to look at the position just after the string, but not \B. By doing it the other way, this is no longer needed, except I don't understand the prog->minlen portion. It isn't used anywhere else in the tests which see if one should goto got_it, nor do any test suite tests fail by removing it. I think it is a relic, but if a bisect should end up blaming this commit, I'd start with that.
* regexec.c: Comments and white-space onlyKarl Williamson2014-06-261-87/+126
|
* regexec.c: Exchange 2 lines of code and add commentsKarl Williamson2014-06-261-2/+11
| | | | | This reorders things to avoid unnecessary work, as the now-first line may jump, making the assignment that used to be unconditional irrelevant
* regexec.c: More cleaning of FBC macro/code interfaceKarl Williamson2014-06-261-35/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The definition of \w is now compiled into the Perl core. This allows the complicated swash_fetch function call to be replaced by isWORDCHAR_utf8, which takes a single parameter, so the interface can be simplified. [1]. This macro will execute faster on Latin1-range inputs, as it doesn't do a swash_fetch on them, but slower on other code points due to function call overhead, and some currently in-place error checking that wasn't done previously. This overhead could be removed by using inline functions, and perhaps a different interface for known non-malformed input (though I'm actually not sure the input is known to be well-formed in this case). These macros still depend on and modify outside variables. That could be cleaned up by adding additional parameters to them, but I'm not going to do it now. I don't like these kinds of code-generating macros, and have been tempted to rewrite these as inline functions, but it's not a trivial task to do. [1] I hadn't realized it before, but the interface could have been cleaned up instead by introducting a macro that makes it look like a single parameter is used uniformly to existing macros, looking like #define FBC_BOUND_SWASH_FETCH(s) \ cBOOL(swash_fetch(PL_utf8_swash_ptrs[_CC_WORDCHAR], s, utf8_target)) But it seems better to me to use isWORDCHAR_utf8 as it is faster for Western European languages, and can be made nearly the same speed as the alternative if experience tells us that this is a slow spot that should be sped up.
* regexec.c: Clean up macro/code interface slightlyKarl Williamson2014-06-261-17/+16
| | | | | | | | | These code-generating macros are a pain to maintain. These have side effects and use variables outside the macros. This commit cleans up the interface slightly by separating the operand of a function from the function, so the code outside doesn't have to know as much about the macro as before. The next commit will fix up the remaining parameter like this.
* regexec.c: Change names of 4 macrosKarl Williamson2014-06-261-14/+14
| | | | | | The names were misleading; they were not for general use, but only for this FBC area of the code, and they no longer have to do with having to load utf8 or not.
* regexec.c: Change MiXeD cAsE formal macro parametersKarl Williamson2014-06-261-16/+16
| | | | | | | I find these a pain to type and read, and there is no reason for them. As formal parameters there is no possibility of conflict with the code outside them. And I'm about to do some editing of these, so this change will make that easier.
* Remove or downgrade unnecessary dVAR.Jarkko Hietaniemi2014-06-251-15/+0
| | | | | | | | You need to configure with g++ *and* -Accflags=-DPERL_GLOBAL_STRUCT or -Accflags=-DPERL_GLOBAL_STRUCT_PRIVATE to see any difference. (g++ does not do the "post-annotation" form of "unused".) The version code has some of these issues, reported upstream.
* Revert "/* NOTREACHED */ belongs *before* the unreachable."Jarkko Hietaniemi2014-06-191-86/+43
| | | | | | This reverts commit 148f39b7de6eae9ddd59e0b0aff691d6abea7aca. (Still needs more work, but wanted to see how well this passed with Jenkins.)
* /* NOTREACHED */ belongs *before* the unreachable.Jarkko Hietaniemi2014-06-191-43/+86
| | | | | | Definitely not *after* it. It marks the start of the unreachable, not the first unrechable line. And if they are in that order, it looks better to linebreak after the lint hint.
* Do not printf U32 and I32 (both) as %i.Jarkko Hietaniemi2014-06-161-2/+3
| | | | (Detected in HP-UX.)
* Some low-hanging -Wunreachable-code fruits.Jarkko Hietaniemi2014-06-151-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | - after return/croak/die/exit, return/break are pointless (break is not a terminator/separator, it's a goto) - after goto, another goto (!) is pointless - in some cases (usually function ends) introduce explicit NOT_REACHED to make the noreturn nature clearer (do not do this everywhere, though, since that would mean adding NOT_REACHED after every croak) - for the added NOT_REACHED also add /* NOTREACHED */ since NOT_REACHED is for gcc (and VC), while the comment is for linters - declaring variables in switch blocks is just too fragile: it kind of works for narrowing the scope (which is nice), but breaks the moment there are initializations for the variables (the initializations will be skipped since the flow will bypass the start of the block); in some easy cases simply hoist the declarations out of the block and move them earlier Note 1: Since after this patch the core is not yet -Wunreachable-code clean, not enabling that via cflags.SH, one needs to -Accflags=... it. Note 2: At least with the older gcc 4.4.7 there are far too many "unreachable code" warnings, which seem to go away with gcc 4.8, maybe better flow control analysis. Therefore, the warning should eventually be enabled only for modernish gccs (what about clang and Intel cc?)
* Revert "Some low-hanging -Wunreachable-code fruits."Jarkko Hietaniemi2014-06-131-0/+3
| | | | | | | This reverts commit 8c2b19724d117cecfa186d044abdbf766372c679. I don't understand - smoke-me came back happy with three separate reports... oh well, some other time.
* Some low-hanging -Wunreachable-code fruits.Jarkko Hietaniemi2014-06-131-3/+0
| | | | | | | | | | | | | | | | | | - after croak/die/exit (or return), break (or return!) are pointless (break is not a terminator/separator, it's a promise of a jump) - after goto, another goto (!) is pointless - in some cases (usually function ends) introduce explicit NOT_REACHED to make the noreturn nature clearer (do not do this everywhere, though, since that would mean adding NOT_REACHED after every croak) - for the added NOT_REACHED also add /* NOTREACHED */ since NOT_REACHED is for gcc (and VC), while the comment is for linters - declaring variables in switch blocks is just too fragile: it kind of works for narrowing the scope (which is nice), but breaks the moment there are initializations for the variables (they will be skipped!); in some easy cases simply hoist the declarations out of the block and move them earlier There are still a few places left.
* Silence several -Wunused-parameter warnings about my_perlBrian Fraser2014-06-131-1/+1
| | | | | | | | This meant sprinkling some PERL_UNUSED_CONTEXT invocations, as well as stopping some functions from getting my_perl in the first place; all of the functions in the latter category are internal (S_ prefix and s or i in embed.fnc), so this should be both safe and economical.
* Change newSVpvn("…", …) to newSVpvs("…")Dagfinn Ilmari Mannsåker2014-06-121-1/+1
| | | | | The dual-life dists affected use Devel::PPPort, so can safely use newSVpvs() even though it wasn't added until Perl v5.8.9.
* regexec.c: Eliminate a malloc/freeKarl Williamson2014-05-301-5/+4
| | | | This uses an C automatic variable instead of a malloc and free.
* regcomp.c, regexec.c: Move common code to a functionKarl Williamson2014-05-301-8/+1
| | | | There are other cases where this functionality will be needed as well.
* regexec.c: Fix some EBCDIC problemsKarl Williamson2014-05-301-6/+4
| | | | | | We were testing for UTF-8 invariant, when we should have been testing for ASCII. This is a problem only on EBCDIC platforms, where they mean two different sets of code points.
* Unify the "fall-through" lint annotation.Jarkko Hietaniemi2014-05-291-17/+17
| | | | | | | Used by linters (static checkers), and also good for human readers. Even though "FALL THROUGH" seems to be the most common, e.g BSD lint manual only knows "FALLTHROUGH" (or "FALLTHRU").
* Insert asserts to paths suspected by Coverity.Jarkko Hietaniemi2014-05-291-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Coverity suspects paths like this: (1) p = foo(); if (!p) p = bar(); baz(p->q); since it cannot prove that p is always non-NULL at the dereference. (2) Or simply something like: mg = mg_find(...); foo(mg->blah); Since mg_find() can fail returning NULL. Adding assert() calls before the dereferences. Testing with -DDEBUGGING. Hopefully there are regular smokes doing the same. [perl #121894] Fix for Coverity perl5 CIDs 28950,28952..28955,28964,28967,28970..28795,49921: CID ...: Dereference after null check (FORWARD_NULL) var_deref_op: Dereferencing null pointer p->q (TODO: Coverity perl5 CIDs 28962,28968,28969: the same issue, but would conflict with already in-flight changes, prepare catch-up patch later.) --- dist/Data-Dumper/Dumper.xs | 1 + dump.c | 1 + ext/Devel-Peek/Peek.xs | 1 + ext/XS-APItest/APItest.xs | 1 + mg.c | 2 ++ mro.c | 4 +++- op.c | 4 ++++ perlio.c | 1 + pp_hot.c | 5 +++-- regcomp.c | 3 +++ regexec.c | 5 ++++- universal.c | 2 +- util.c | 4 +++- 13 files changed, 28 insertions(+), 6 deletions(-) diff --git a/dist/Data-Dumper/Dumper.xs b/dist/Data-Dumper/Dumper.xs index 12c4ebd..23e0cf4 100644 --- a/dist/Data-Dumper/Dumper.xs +++ b/dist/Data-Dumper/Dumper.xs @@ -641,6 +641,7 @@ DD_dump(pTHX_ SV *val, const char *name, STRLEN namelen, SV *retval, HV *seenhv, else { sv_pattern = val; } + assert(sv_pattern); rval = SvPV(sv_pattern, rlen); rend = rval+rlen; slash = rval; diff --git a/dump.c b/dump.c index 59be3e0..c2d72fd 100644 --- a/dump.c +++ b/dump.c @@ -471,6 +471,7 @@ Perl_sv_peek(pTHX_ SV *sv) finish: while (unref--) sv_catpv(t, ")"); + /* XXX when is sv ever NULL? */ if (TAINTING_get && SvTAINTED(sv)) sv_catpv(t, " [tainted]"); return SvPV_nolen(t); diff --git a/ext/Devel-Peek/Peek.xs b/ext/Devel-Peek/Peek.xs index 679efa5..b20fa94 100644 --- a/ext/Devel-Peek/Peek.xs +++ b/ext/Devel-Peek/Peek.xs @@ -450,6 +450,7 @@ PPCODE: BOOT: { CV * const cv = get_cvn_flags("Devel::Peek::Dump", 17, 0); + assert(cv); cv_set_call_checker(cv, S_ck_dump, (SV *)cv); XopENTRY_set(&my_xop, xop_name, "Dump"); diff --git a/ext/XS-APItest/APItest.xs b/ext/XS-APItest/APItest.xs index a51924d..18ba381 100644 --- a/ext/XS-APItest/APItest.xs +++ b/ext/XS-APItest/APItest.xs @@ -2096,6 +2096,7 @@ newCONSTSUB(stash, name, flags, sv) break; } EXTEND(SP, 2); + assert(mycv); PUSHs( CvCONST(mycv) ? &PL_sv_yes : &PL_sv_no ); PUSHs((SV*)CvGV(mycv)); diff --git a/mg.c b/mg.c index 76912bd..7f3339a 100644 --- a/mg.c +++ b/mg.c @@ -1675,6 +1675,7 @@ Perl_magic_clearisa(pTHX_ SV *sv, MAGIC *mg) same function. */ mg = mg_find(mg->mg_obj, PERL_MAGIC_isa); + assert(mg); if (SvTYPE(mg->mg_obj) == SVt_PVAV) { /* multiple stashes */ SV **svp = AvARRAY((AV *)mg->mg_obj); I32 items = AvFILLp((AV *)mg->mg_obj) + 1; @@ -3437,6 +3438,7 @@ Perl_magic_copycallchecker(pTHX_ SV *sv, MAGIC *mg, SV *nsv, sv_magic(nsv, &PL_sv_undef, mg->mg_type, NULL, 0); nmg = mg_find(nsv, mg->mg_type); + assert(nmg); if (nmg->mg_flags & MGf_REFCOUNTED) SvREFCNT_dec(nmg->mg_obj); nmg->mg_ptr = mg->mg_ptr; nmg->mg_obj = SvREFCNT_inc_simple(mg->mg_obj); diff --git a/mro.c b/mro.c index 1b37ca7..ccf4bf4 100644 --- a/mro.c +++ b/mro.c @@ -638,12 +638,14 @@ Perl_mro_isa_changed_in(pTHX_ HV* stash) hv_storehek(mroisarev, namehek, &PL_sv_yes); } - if((SV *)isa != &PL_sv_undef) + if ((SV *)isa != &PL_sv_undef) { + assert(namehek); mro_clean_isarev( isa, HEK_KEY(namehek), HEK_LEN(namehek), HvMROMETA(revstash)->isa, HEK_HASH(namehek), HEK_UTF8(namehek) ); + } } } } diff --git a/op.c b/op.c index 796cb03..79621ce 100644 --- a/op.c +++ b/op.c @@ -2907,6 +2907,7 @@ S_my_kid(pTHX_ OP *o, OP *attrs, OP **imopsp) S_cant_declare(aTHX_ o); } else if (attrs) { GV * const gv = cGVOPx_gv(cUNOPo->op_first); + assert(PL_parser); PL_parser->in_my = FALSE; PL_parser->in_my_stash = NULL; apply_attrs(GvSTASH(gv), @@ -2929,6 +2930,7 @@ S_my_kid(pTHX_ OP *o, OP *attrs, OP **imopsp) else if (attrs && type != OP_PUSHMARK) { HV *stash; + assert(PL_parser); PL_parser->in_my = FALSE; PL_parser->in_my_stash = NULL; @@ -10229,6 +10231,7 @@ Perl_ck_split(pTHX_ OP *o) op_append_elem(OP_SPLIT, o, newDEFSVOP()); kid = kid->op_sibling; + assert(kid); scalar(kid); if (!kid->op_sibling) @@ -10902,6 +10905,7 @@ Perl_cv_set_call_checker(pTHX_ CV *cv, Perl_call_checker ckfun, SV *ckobj) MAGIC *callmg; sv_magic((SV*)cv, &PL_sv_undef, PERL_MAGIC_checkcall, NULL, 0); callmg = mg_find((SV*)cv, PERL_MAGIC_checkcall); + assert(callmg); if (callmg->mg_flags & MGf_REFCOUNTED) { SvREFCNT_dec(callmg->mg_obj); callmg->mg_flags &= ~MGf_REFCOUNTED; diff --git a/perlio.c b/perlio.c index d4c43d0..e6ff9e4 100644 --- a/perlio.c +++ b/perlio.c @@ -2225,6 +2225,7 @@ PerlIOBase_dup(pTHX_ PerlIO *f, PerlIO *o, CLONE_PARAMS *param, int flags) PerlIO_funcs * const self = PerlIOBase(o)->tab; SV *arg = NULL; char buf[8]; + assert(self); PerlIO_debug("PerlIOBase_dup %s f=%p o=%p param=%p\n", self ? self->name : "(Null)", (void*)f, (void*)o, (void*)param); diff --git a/pp_hot.c b/pp_hot.c index 2cccc48..9c9d1e9 100644 --- a/pp_hot.c +++ b/pp_hot.c @@ -3078,6 +3078,7 @@ S_method_common(pTHX_ SV* meth, U32* hashp) const HE* const he = hv_fetch_ent(stash, meth, 0, *hashp); if (he) { gv = MUTABLE_GV(HeVAL(he)); + assert(stash); if (isGV(gv) && GvCV(gv) && (!GvCVGEN(gv) || GvCVGEN(gv) == (PL_sub_generation + HvMROMETA(stash)->cache_gen))) @@ -3085,9 +3086,9 @@ S_method_common(pTHX_ SV* meth, U32* hashp) } } + assert(stash || packsv); gv = gv_fetchmethod_sv_flags(stash ? stash : MUTABLE_HV(packsv), - meth, GV_AUTOLOAD | GV_CROAK); - + meth, GV_AUTOLOAD | GV_CROAK); assert(gv); return isGV(gv) ? MUTABLE_SV(GvCV(gv)) : MUTABLE_SV(gv); diff --git a/regcomp.c b/regcomp.c index eaee604..3d49827 100644 --- a/regcomp.c +++ b/regcomp.c @@ -2007,6 +2007,7 @@ S_make_trie(pTHX_ RExC_state_t *pRExC_state, regnode *startbranch, }); re_trie_maxbuff = get_sv(RE_TRIE_MAXBUF_NAME, 1); + assert(re_trie_maxbuff); if (!SvIOK(re_trie_maxbuff)) { sv_setiv(re_trie_maxbuff, RE_TRIE_MAXBUF_INIT); } @@ -14920,6 +14921,7 @@ S_set_ANYOF_arg(pTHX_ RExC_state_t* const pRExC_state, av_store(av, 0, (runtime_defns) ? SvREFCNT_inc(runtime_defns) : &PL_sv_undef); if (swash) { + assert(cp_list); av_store(av, 1, swash); SvREFCNT_dec_NN(cp_list); } @@ -16609,6 +16611,7 @@ S_dumpuntil(pTHX_ const regexp *r, const regnode *start, const regnode *node, last= plast; while (PL_regkind[op] != END && (!last || node < last)) { + assert(node); /* While that wasn't END last time... */ NODE_ALIGN(node); op = OP(node); diff --git a/regexec.c b/regexec.c index 362390b..ffab4f2 100644 --- a/regexec.c +++ b/regexec.c @@ -7010,6 +7010,8 @@ no_silent: sv_commit = &PL_sv_yes; sv_yes_mark = &PL_sv_no; } + assert(sv_err); + assert(sv_mrk); sv_setsv(sv_err, sv_commit); sv_setsv(sv_mrk, sv_yes_mark); } @@ -7620,6 +7622,7 @@ Perl__get_regclass_nonbitmap_data(pTHX_ const regexp *prog, *only_utf8_locale_ptr = ary[2]; } else { + assert(only_utf8_locale_ptr); *only_utf8_locale_ptr = NULL; } @@ -7641,7 +7644,7 @@ Perl__get_regclass_nonbitmap_data(pTHX_ const regexp *prog, } else if (doinit && ((si && si != &PL_sv_undef) || (invlist && invlist != &PL_sv_undef))) { - + assert(si); sw = _core_swash_init("utf8", /* the utf8 package */ "", /* nameless */ si, diff --git a/universal.c b/universal.c index a29696d..65e02df 100644 --- a/universal.c +++ b/universal.c @@ -67,7 +67,7 @@ S_isa_lookup(pTHX_ HV *stash, const char * const name, STRLEN len, U32 flags) if (our_stash) { HEK *canon_name = HvENAME_HEK(our_stash); if (!canon_name) canon_name = HvNAME_HEK(our_stash); - + assert(canon_name); if (hv_common(isa, NULL, HEK_KEY(canon_name), HEK_LEN(canon_name), HEK_FLAGS(canon_name), HV_FETCH_ISEXISTS, NULL, HEK_HASH(canon_name))) { diff --git a/util.c b/util.c index b90abe5..cd0afb6 100644 --- a/util.c +++ b/util.c @@ -850,15 +850,17 @@ Perl_fbm_instr(pTHX_ unsigned char *big, unsigned char *bigend, SV *littlestr, U { const MAGIC *const mg = mg_find(littlestr, PERL_MAGIC_bm); - const unsigned char * const table = (const unsigned char *) mg->mg_ptr; const unsigned char *oldlittle; + assert(mg); + --littlelen; /* Last char found by table lookup */ s = big + littlelen; little += littlelen; /* last char */ oldlittle = little; if (s < bigend) { + const unsigned char * const table = (const unsigned char *) mg->mg_ptr; I32 tmp; top2: -- 1.8.5.2 (Apple Git-48)
* Off-by-one in PL_fold_locale use.Jarkko Hietaniemi2014-05-281-4/+4
| | | | | | | | | | | | | | | | Fix for Coverity perl5 CID 29033: Out-of-bounds read (OVERRUN) overrun-local: Overrunning array PL_fold_locale of 256 bytes at byte offset 256 using index c1 (which evaluates to 256). - the "c1 > 256" was off-by-one, it needed to be "c1 > 255", it could have caused the PL_fold_locale to be accessed one past the end, at offset 256, but we have dodged the bullet thanks to the regex engine optimizing the bad case away before we hit it (analysis by Karl Williamson): regexec.c - comment fixes (pointed out by Karl Williamson): regexec.c - add tests to nail down the behaviour of fold matching for the last of Latin-1 (0xFF, lowercase which curiously does not have uppercase within Latin-1). and the first pure Unicode: t/re/pat.t
* [perl #121854] use re 'taint' regressionDavid Mitchell2014-05-131-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit v5.19.8-533-g63baef5 changed the handling of locale-dependent regexes so that the pattern was considered tainted at compile-time, rather than determining it each time at run-time whenever it executed a locale-dependent node. Unfortunately due to the conflating of two flags, RXf_TAINTED and RXf_TAINTED_SEEN, it had the side effect of permanently marking a pattern as tainted once it had had a single tainted result. E.g. use re qw(taint); use Scalar::Util qw(tainted); for ($^X, "abc") { /(.*)/ or die; print "not " unless tainted("$1"); print "tainted\n"; }; which from 5.19.9 onwards output: tainted tainted but with this commit (and with 5.19.8 and earlier), it now outputs: tainted not tainted The RXf_TAINTED flag indicates that the pattern itself is tainted, e.g. $r = qr/$tainted_value/ while the RXf_TAINTED_SEEN flag means that the results of the last match are tainted, e.g. use re 'tainted'; $tainted =~ /(.*)/; # $1 is tainted Pre 63baef5, the code used to look like: at run-time: turn off RXf_TAINTED_SEEN; while (nodes to execute) { switch(node) { case BOUNDL: /* and other locale-specific ops */ turn on RXf_TAINTED_SEEN; ...; } } if (tainted || RXf_TAINTED) turn on RXf_TAINTED_SEEN; 63baef5 changed it to: at compile-time: if (pattern has locale ops) turn on RXf_TAINTED_SEEN; at run-time: while (nodes to execute) { ... } if (tainted || RXf_TAINTED) turn on RXf_TAINTED_SEEN; This commit changes it to: at compile-time; if (pattern has locale ops) turn on RXf_TAINTED; at run-time: turn off RXf_TAINTED_SEEN; while (nodes to execute) { ... } if (tainted || RXf_TAINTED) turn on RXf_TAINTED_SEEN;
* [perl #121484] /m causing false negativeDavid Mitchell2014-03-241-6/+9
| | | | | | | | | | | My recent commit d0d4464849e2b30aee8 in re_intuit_start() reduced the scope of a 'skip if multiline' check, so that certain optimisations weren't being unnecessarily skipped. Unfortunately it didn't reduce the scope enough, so a vital slen-- was being skipped in the SvTAIL-but-don't-fail case. This commit just moves the !multiline test further down, and updates the commentary and condition formatting a bit.
* re_intuit_start(): move comments abut IMPLICITDavid Mitchell2014-03-191-9/+9
| | | | | After the previous reworking of the PREGf_IMPLICIT tests, move commentary about PREGf_IMPLICIT closer to where its now relevant.
* re_intuit_start(): don't unset MBOL on uselessnessDavid Mitchell2014-03-191-9/+0
| | | | | | | | | | | | | | When BmUSEFUL() for a pattern goes negative, we unset ~RXf_USE_INTUIT. A bug fix in 2010 (c941595168) made this part of the code also remove the RXf_ANCH_MBOL at the same time, if PREGf_IMPLICIT was set. However, this was really just working round a bug in regexec_flags(). Once intuit was disabled, regexec_flags() would then start taking the buggy code path. This buggy code path was separately fixed in 2012 by 21eede782, so there's no longer any need to remove this flag. So don't.
* re_intuit_start(): change definition of ml_anchDavid Mitchell2014-03-191-8/+5
| | | | | | | | | | | | | | | | | | | | The ml_anch variable is supposed to indicate that a multi-line match is possible, i.e. that the regex is anchored to a \n. Currently we just do ml_anch = (prog->intflags & PREGf_ANCH_MBOL); However, MBOL is also set on /.*..../; the two cases are distinguished by adding the PREGf_IMPLICIT flag too. So at the moment we have lots of tests along the lines of if (ml_anch && !(prog->intflags & PREGf_IMPLICIT)) Simplify this by adding the IMPLICIT condition when initially calculating ml_anch, so there's no need to keep testing for it later. This also means that ml_anch actually means what it says now.
* re_intuit_start(): check for IMPLICIT in abs anchDavid Mitchell2014-03-191-14/+3
| | | | | | | | | | | | | | | | | | | | | | The block of code that deals with absolute anchors looks like: if (...) { if (.... && !PREGf_IMPLICIT) ...; if (FOO) { assert(!PREGf_IMPLICIT); .... } } where the constraints of FOO imply PREGf_IMPLICIT, as shown by the assertion. Simplify this to: if (... && !PREGf_IMPLICIT) { if (....) ...; if (FOO) { .... } }
* re_intuit_start(): check for IMPLICIT in stclassDavid Mitchell2014-03-191-3/+4
| | | | | | | | | | | | | | In the stclass block of code, there are various checks for anchored-ness. Add !(prog->intflags & PREGf_IMPLICIT) to these conditions, since from the perspective of these tests, we can only quit early etc if these are real rather than fake anchors. As it happens, this currently makes no logical difference, since an PREGf_IMPLICIT pattern starts with .*, and so more or less by definition can't have an stclass. So this commit is really only for logical completeness, and to allow us to change the definition of ml_anch shortly.
* re_intuit_start(): use better limit on anch floatDavid Mitchell2014-03-191-9/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The 'check' block of code has special handling for if the pattern is anchored; in this case, it allows us to set an upper bound on where a floating substring might match; e.g. in /^..\d?abc/ the floating string can't be more than 3 chars from the absolute beginning of the string. Similarly with /..\G\d?abc/, it can't be more than 3 chars from the start position (assuming that position been calculated correctly by the caller). However, the current code used rx_origin as the base for the offset calculation, rather than strbeg/strpos as appropriate. This meant that a) the first time round the loop, if strpos > strbeg, then the upper bound would be set higher than needed; b) if we ever go back to restart: with an incremented rx_origin, then the upper limit is recalculated with more wasted slack at the latter end. This commit changes the limit calculation, which reduces the following from second to milliseconds: $s = "abcdefg" x 1_000_000; $s =~ /^XX\d{1,10}cde/ for 1..100; It also adds a quick test to skip hopping when the result is likely to leave end_point unchanged, and adds an explicit test for !PREGf_IMPLICIT. This latter test isn't strictly necessary, as if PREGf_IMPLICIT were set, it implies that the pattern starts with '.*', which implies that prog->check_offset_max == SSize_t_MAX, which is already tested for. However, it makes the overall condition more comprehensible, and makes it more robust in the face of future changes.
* re_intuit_start(): do 'not at start' check on BOLDavid Mitchell2014-03-191-6/+25
| | | | | | | | | | | | | | | | | | | | The quick reject test that says "if a pattern has an absolute anchor, then immediately reject if strpos != strbeg", currently skips that test if PREGf_ANCH_GPOS is present. Instead, skip unless BOL or SBOL is present. This means that something like /^abc\G/ will still do the quick reject test. I can't think of any example that will actually give a measurable performance boost, and this is a fairly unlikely scenario, but the code is more logical this way, and makes it more robust against future changes (it's less likely to suddenly skip the quick test where it used to do it). I've also updated the commentary on why we don't do a quick /\G/ test akin to the /^/ test, and added some more info about why we test for the PREGf_IMPLICIT flag. And I've added an assert about PREGf_IMPLICIT too.
* re_intuit_start(): reduce scope of /^...$/m testDavid Mitchell2014-03-191-5/+2
| | | | | | | | | | | | | | | | | | | | | Intuit has a quick reject test for a fixed pattern that is anchored at both ends. For example, with the pattern /^abcd$/, only the exact strings "abcd" or "abcd\n" will match; anything else, and the match immediately fails. A fix for [perl #115242] correctly made intuit skip the test in the presence of //m, since in this case the $ doesn't necessarily correspond to the end of the string. However, the fix was too wide in scope; it caused //m patterns to skip searching for a known string anchored just at the start, as well as one anchored at both ends. With this commit, the following code now runs in a few milliseconds rather than a few seconds on my machine: $s = "abcdefg" x 1_000_000; $s =~ /(?-m:^)abcX?fg/m for 1..100;
* re_intuit_start(): change !ml_anch debugging msgDavid Mitchell2014-03-191-2/+1
| | | | | | | | | | | When MBOL (/^.../m) matching is skipped, the debugging output looks like: Starting position does not contradict /^/m... which sounds a bit like the \n test *was* done and passed, rather than the test being skipped. Change the message to: (multiline anchor test skipped)
* re_intuit_start(): don't set ml_anch on BOLDavid Mitchell2014-03-191-2/+1
| | | | | | | | | | | | | | | | re_intuit_start() decided that a pattern was capable of being anchored after *any* \n in the string for a //m pattern that contains a BOL (rather than an MBOL). This can happen by embedding one regex in another for example. This is an incorrect assumption, and means that intuit() might try against every \n position in the string rather than just trying at the beginning. With this commit, the following code on my machine reduces in execution time from 7000ms to 5ms: my $r = qr/^abcd/; my $s = "abcd-xyz\n" x 500_000; $s =~ /$r\d{1,2}xyz/m for 1..200;
* re_intuit_start(): re-comment head of functionDavid Mitchell2014-03-161-71/+70
| | | | | The blurb above this function is a mess and contains lots of obsolete stuff. Mostly rewrite it. Also sprinkle a few extra comments elsewhere.
* re_intuit_start(): indent rest of check blockDavid Mitchell2014-03-161-25/+25
| | | | | Most of the 'check' substring matching code is in its own scope; move the last few bits into that scope too. This is purely cosmetic.
* re_intuit_start(): add some general commentsDavid Mitchell2014-03-161-5/+29
|
* re_intuit_start(): update comments in BmUSEFUL blkDavid Mitchell2014-03-161-5/+6
|
* re_intuit_start(): update comments in stclass codeDavid Mitchell2014-03-161-12/+20
| | | | | | Improve the description at the start of this code block, and move some of the comments closer to where they apply (and explain why they're probably wrong)
* re_intuit_start(): improve main terminating condDavid Mitchell2014-03-161-3/+23
| | | | | | | | | | | | | | | | | | | | | | At the end of the failure block in the stclass code, it decides whether rx_origin has been advanced too much for the match to ever succeed, and if so, quits. That test has a comment next to it about perhaps testing for this earlier. It also does a byte rather than chars calculation. This commit does the following. It adds a proper chars-based termination test near the beginning of the 'check' code block. This test is essentially free, since at that point we've just done the necessary HOPs. It also calculates start_point using end_point rather the strend as the limit, so avoiding a bit of unnecessary hopping when we're about to fail anyway. It introduces the new HOPMAYBE3 macro wrapper for this purpose. It keeps the old bytes-based test at the end in stclass, with a note that the simple bytes calculation errs on the side of re-entering the check block where the accurate chars-based test is now done.