| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
| |
There is an optimization in regcomp.c which saves memory when used, but
which caused -Dr output to display the same code points twice.
I didn't add a test to ext/re/t/regop.t because this only happens for a
Unicode property, under certain circumstances, and that triggers a lot
of other regex patterns to be compiled which are subject to change, so
the test would frequently have to be updated. This only affects
debugging output anyway.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This function, though mostly used by regexec.c and previously contained
therein, relies on the particular data structure in regcomp.c, so it
makes sense to move it adjacent to the code that generates the
structure. Future commits will add function calls to this function that
are (currently) only in regcomp.c, and so will tie it tighter to that
file.
This just moves the code with surrounding #ifdef, adds a couple of blank
lines, removes some trailing white space, and adjusts the comments. No
other changes were made.
|
|
|
|
|
|
|
| |
ANYOF nodes (for bracketed character classes) currently are for code
points 0-255. This is the first step in the eventual making that size
configurable. This also renames a static function, as the domain may
not necessarily be 'latin1'
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Remove (almost all) direct access to the op_sibling field of OP structs,
and use these three new macros instead:
OP_SIBLING(o);
OP_HAS_SIBLING(o);
OP_SIBLING_set(o, new_value);
OP_HAS_SIBLING is intended to be a slightly more efficient version of
OP_SIBLING when only boolean context is needed.
For now these three macros are just defined in the obvious way:
#define OP_SIBLING(o) (0 + (o)->op_sibling)
#define OP_HAS_SIBLING(o) (cBOOL((o)->op_sibling))
#define OP_SIBLING_set(o, sib) ((o)->op_sibling = (sib))
but abstracting them out will allow us shortly to make the last pointer in
an op_sibling chain point back to the parent rather than being null, with
a new flag indicating whether this is the last op.
Perl_ck_fun() still has a couple of direct uses of op_sibling, since it
takes the field's address, which is not covered by these macros.
|
|
|
|
| |
Coverity perl5 CID 68587.
|
|
|
|
| |
Keep on keeping Coverity happy (Coverity perl5 CID 45352).
|
|
|
|
| |
This puts them in a more logical order
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The root cause of this is that the loop is a "< strend" instead of a
"<=". However, the macro that is called to form the loop is used
elsewhere where "<" is appropriate. So I opted to just repeat the
small salient portions of the loop right after it so it gets executed
one final time in the final position.
This code:
if ((!prog->minlen && tmp) && (reginfo->intuit || regtry(reginfo, &s))) \
goto got_it;
had the effect previously of causing \b to look at the position just
after the string, but not \B. By doing it the other way, this is no
longer needed, except I don't understand the prog->minlen portion. It
isn't used anywhere else in the tests which see if one should goto
got_it, nor do any test suite tests fail by removing it. I think it is
a relic, but if a bisect should end up blaming this commit, I'd start
with that.
|
| |
|
|
|
|
|
| |
This reorders things to avoid unnecessary work, as the now-first line
may jump, making the assignment that used to be unconditional irrelevant
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The definition of \w is now compiled into the Perl core. This allows
the complicated swash_fetch function call to be replaced by
isWORDCHAR_utf8, which takes a single parameter, so the interface can be
simplified. [1].
This macro will execute faster on Latin1-range inputs, as it doesn't do
a swash_fetch on them, but slower on other code points due to function
call overhead, and some currently in-place error checking that wasn't
done previously. This overhead could be removed by using inline
functions, and perhaps a different interface for known non-malformed
input (though I'm actually not sure the input is known to be well-formed
in this case).
These macros still depend on and modify outside variables. That could
be cleaned up by adding additional parameters to them, but I'm not going
to do it now. I don't like these kinds of code-generating macros, and
have been tempted to rewrite these as inline functions, but it's not a
trivial task to do.
[1] I hadn't realized it before, but the interface could have been
cleaned up instead by introducting a macro that makes it look like a
single parameter is used uniformly to existing macros, looking like
#define FBC_BOUND_SWASH_FETCH(s) \
cBOOL(swash_fetch(PL_utf8_swash_ptrs[_CC_WORDCHAR], s, utf8_target))
But it seems better to me to use isWORDCHAR_utf8 as it is faster for
Western European languages, and can be made nearly the same speed as the
alternative if experience tells us that this is a slow spot that should
be sped up.
|
|
|
|
|
|
|
|
|
| |
These code-generating macros are a pain to maintain. These have side
effects and use variables outside the macros. This commit cleans up the
interface slightly by separating the operand of a function from the
function, so the code outside doesn't have to know as much about the
macro as before. The next commit will fix up the remaining parameter
like this.
|
|
|
|
|
|
| |
The names were misleading; they were not for general use, but only for
this FBC area of the code, and they no longer have to do with having to
load utf8 or not.
|
|
|
|
|
|
|
| |
I find these a pain to type and read, and there is no reason for them.
As formal parameters there is no possibility of conflict with the code
outside them. And I'm about to do some editing of these, so this change
will make that easier.
|
|
|
|
|
|
|
|
| |
You need to configure with g++ *and* -Accflags=-DPERL_GLOBAL_STRUCT
or -Accflags=-DPERL_GLOBAL_STRUCT_PRIVATE to see any difference.
(g++ does not do the "post-annotation" form of "unused".)
The version code has some of these issues, reported upstream.
|
|
|
|
|
|
| |
This reverts commit 148f39b7de6eae9ddd59e0b0aff691d6abea7aca.
(Still needs more work, but wanted to see how well this passed with Jenkins.)
|
|
|
|
|
|
| |
Definitely not *after* it. It marks the start of the unreachable,
not the first unrechable line. And if they are in that order,
it looks better to linebreak after the lint hint.
|
|
|
|
| |
(Detected in HP-UX.)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- after return/croak/die/exit, return/break are pointless
(break is not a terminator/separator, it's a goto)
- after goto, another goto (!) is pointless
- in some cases (usually function ends) introduce explicit NOT_REACHED
to make the noreturn nature clearer (do not do this everywhere, though,
since that would mean adding NOT_REACHED after every croak)
- for the added NOT_REACHED also add /* NOTREACHED */ since
NOT_REACHED is for gcc (and VC), while the comment is for linters
- declaring variables in switch blocks is just too fragile:
it kind of works for narrowing the scope (which is nice),
but breaks the moment there are initializations for the variables
(the initializations will be skipped since the flow will bypass
the start of the block); in some easy cases simply hoist the declarations
out of the block and move them earlier
Note 1: Since after this patch the core is not yet -Wunreachable-code
clean, not enabling that via cflags.SH, one needs to -Accflags=... it.
Note 2: At least with the older gcc 4.4.7 there are far too many
"unreachable code" warnings, which seem to go away with gcc 4.8,
maybe better flow control analysis. Therefore, the warning should
eventually be enabled only for modernish gccs (what about clang and
Intel cc?)
|
|
|
|
|
|
|
| |
This reverts commit 8c2b19724d117cecfa186d044abdbf766372c679.
I don't understand - smoke-me came back happy with three
separate reports... oh well, some other time.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- after croak/die/exit (or return), break (or return!) are pointless
(break is not a terminator/separator, it's a promise of a jump)
- after goto, another goto (!) is pointless
- in some cases (usually function ends) introduce explicit NOT_REACHED
to make the noreturn nature clearer (do not do this everywhere, though,
since that would mean adding NOT_REACHED after every croak)
- for the added NOT_REACHED also add /* NOTREACHED */ since
NOT_REACHED is for gcc (and VC), while the comment is for linters
- declaring variables in switch blocks is just too fragile:
it kind of works for narrowing the scope (which is nice),
but breaks the moment there are initializations for the variables
(they will be skipped!); in some easy cases simply hoist the declarations
out of the block and move them earlier
There are still a few places left.
|
|
|
|
|
|
|
|
| |
This meant sprinkling some PERL_UNUSED_CONTEXT invocations,
as well as stopping some functions from getting my_perl in
the first place; all of the functions in the latter category
are internal (S_ prefix and s or i in embed.fnc), so
this should be both safe and economical.
|
|
|
|
|
| |
The dual-life dists affected use Devel::PPPort, so can safely use
newSVpvs() even though it wasn't added until Perl v5.8.9.
|
|
|
|
| |
This uses an C automatic variable instead of a malloc and free.
|
|
|
|
| |
There are other cases where this functionality will be needed as well.
|
|
|
|
|
|
| |
We were testing for UTF-8 invariant, when we should have been testing
for ASCII. This is a problem only on EBCDIC platforms, where they mean
two different sets of code points.
|
|
|
|
|
|
|
| |
Used by linters (static checkers), and also good for human readers.
Even though "FALL THROUGH" seems to be the most common, e.g BSD lint
manual only knows "FALLTHROUGH" (or "FALLTHRU").
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Coverity suspects paths like this:
(1)
p = foo();
if (!p) p = bar();
baz(p->q);
since it cannot prove that p is always non-NULL at the dereference.
(2) Or simply something like:
mg = mg_find(...);
foo(mg->blah);
Since mg_find() can fail returning NULL.
Adding assert() calls before the dereferences. Testing with -DDEBUGGING.
Hopefully there are regular smokes doing the same.
[perl #121894]
Fix for Coverity perl5 CIDs 28950,28952..28955,28964,28967,28970..28795,49921:
CID ...: Dereference after null check (FORWARD_NULL)
var_deref_op: Dereferencing null pointer p->q
(TODO: Coverity perl5 CIDs 28962,28968,28969: the same issue, but would
conflict with already in-flight changes, prepare catch-up patch later.)
---
dist/Data-Dumper/Dumper.xs | 1 +
dump.c | 1 +
ext/Devel-Peek/Peek.xs | 1 +
ext/XS-APItest/APItest.xs | 1 +
mg.c | 2 ++
mro.c | 4 +++-
op.c | 4 ++++
perlio.c | 1 +
pp_hot.c | 5 +++--
regcomp.c | 3 +++
regexec.c | 5 ++++-
universal.c | 2 +-
util.c | 4 +++-
13 files changed, 28 insertions(+), 6 deletions(-)
diff --git a/dist/Data-Dumper/Dumper.xs b/dist/Data-Dumper/Dumper.xs
index 12c4ebd..23e0cf4 100644
--- a/dist/Data-Dumper/Dumper.xs
+++ b/dist/Data-Dumper/Dumper.xs
@@ -641,6 +641,7 @@ DD_dump(pTHX_ SV *val, const char *name, STRLEN namelen, SV *retval, HV *seenhv,
else {
sv_pattern = val;
}
+ assert(sv_pattern);
rval = SvPV(sv_pattern, rlen);
rend = rval+rlen;
slash = rval;
diff --git a/dump.c b/dump.c
index 59be3e0..c2d72fd 100644
--- a/dump.c
+++ b/dump.c
@@ -471,6 +471,7 @@ Perl_sv_peek(pTHX_ SV *sv)
finish:
while (unref--)
sv_catpv(t, ")");
+ /* XXX when is sv ever NULL? */
if (TAINTING_get && SvTAINTED(sv))
sv_catpv(t, " [tainted]");
return SvPV_nolen(t);
diff --git a/ext/Devel-Peek/Peek.xs b/ext/Devel-Peek/Peek.xs
index 679efa5..b20fa94 100644
--- a/ext/Devel-Peek/Peek.xs
+++ b/ext/Devel-Peek/Peek.xs
@@ -450,6 +450,7 @@ PPCODE:
BOOT:
{
CV * const cv = get_cvn_flags("Devel::Peek::Dump", 17, 0);
+ assert(cv);
cv_set_call_checker(cv, S_ck_dump, (SV *)cv);
XopENTRY_set(&my_xop, xop_name, "Dump");
diff --git a/ext/XS-APItest/APItest.xs b/ext/XS-APItest/APItest.xs
index a51924d..18ba381 100644
--- a/ext/XS-APItest/APItest.xs
+++ b/ext/XS-APItest/APItest.xs
@@ -2096,6 +2096,7 @@ newCONSTSUB(stash, name, flags, sv)
break;
}
EXTEND(SP, 2);
+ assert(mycv);
PUSHs( CvCONST(mycv) ? &PL_sv_yes : &PL_sv_no );
PUSHs((SV*)CvGV(mycv));
diff --git a/mg.c b/mg.c
index 76912bd..7f3339a 100644
--- a/mg.c
+++ b/mg.c
@@ -1675,6 +1675,7 @@ Perl_magic_clearisa(pTHX_ SV *sv, MAGIC *mg)
same function. */
mg = mg_find(mg->mg_obj, PERL_MAGIC_isa);
+ assert(mg);
if (SvTYPE(mg->mg_obj) == SVt_PVAV) { /* multiple stashes */
SV **svp = AvARRAY((AV *)mg->mg_obj);
I32 items = AvFILLp((AV *)mg->mg_obj) + 1;
@@ -3437,6 +3438,7 @@ Perl_magic_copycallchecker(pTHX_ SV *sv, MAGIC *mg, SV *nsv,
sv_magic(nsv, &PL_sv_undef, mg->mg_type, NULL, 0);
nmg = mg_find(nsv, mg->mg_type);
+ assert(nmg);
if (nmg->mg_flags & MGf_REFCOUNTED) SvREFCNT_dec(nmg->mg_obj);
nmg->mg_ptr = mg->mg_ptr;
nmg->mg_obj = SvREFCNT_inc_simple(mg->mg_obj);
diff --git a/mro.c b/mro.c
index 1b37ca7..ccf4bf4 100644
--- a/mro.c
+++ b/mro.c
@@ -638,12 +638,14 @@ Perl_mro_isa_changed_in(pTHX_ HV* stash)
hv_storehek(mroisarev, namehek, &PL_sv_yes);
}
- if((SV *)isa != &PL_sv_undef)
+ if ((SV *)isa != &PL_sv_undef) {
+ assert(namehek);
mro_clean_isarev(
isa, HEK_KEY(namehek), HEK_LEN(namehek),
HvMROMETA(revstash)->isa, HEK_HASH(namehek),
HEK_UTF8(namehek)
);
+ }
}
}
}
diff --git a/op.c b/op.c
index 796cb03..79621ce 100644
--- a/op.c
+++ b/op.c
@@ -2907,6 +2907,7 @@ S_my_kid(pTHX_ OP *o, OP *attrs, OP **imopsp)
S_cant_declare(aTHX_ o);
} else if (attrs) {
GV * const gv = cGVOPx_gv(cUNOPo->op_first);
+ assert(PL_parser);
PL_parser->in_my = FALSE;
PL_parser->in_my_stash = NULL;
apply_attrs(GvSTASH(gv),
@@ -2929,6 +2930,7 @@ S_my_kid(pTHX_ OP *o, OP *attrs, OP **imopsp)
else if (attrs && type != OP_PUSHMARK) {
HV *stash;
+ assert(PL_parser);
PL_parser->in_my = FALSE;
PL_parser->in_my_stash = NULL;
@@ -10229,6 +10231,7 @@ Perl_ck_split(pTHX_ OP *o)
op_append_elem(OP_SPLIT, o, newDEFSVOP());
kid = kid->op_sibling;
+ assert(kid);
scalar(kid);
if (!kid->op_sibling)
@@ -10902,6 +10905,7 @@ Perl_cv_set_call_checker(pTHX_ CV *cv, Perl_call_checker ckfun, SV *ckobj)
MAGIC *callmg;
sv_magic((SV*)cv, &PL_sv_undef, PERL_MAGIC_checkcall, NULL, 0);
callmg = mg_find((SV*)cv, PERL_MAGIC_checkcall);
+ assert(callmg);
if (callmg->mg_flags & MGf_REFCOUNTED) {
SvREFCNT_dec(callmg->mg_obj);
callmg->mg_flags &= ~MGf_REFCOUNTED;
diff --git a/perlio.c b/perlio.c
index d4c43d0..e6ff9e4 100644
--- a/perlio.c
+++ b/perlio.c
@@ -2225,6 +2225,7 @@ PerlIOBase_dup(pTHX_ PerlIO *f, PerlIO *o, CLONE_PARAMS *param, int flags)
PerlIO_funcs * const self = PerlIOBase(o)->tab;
SV *arg = NULL;
char buf[8];
+ assert(self);
PerlIO_debug("PerlIOBase_dup %s f=%p o=%p param=%p\n",
self ? self->name : "(Null)",
(void*)f, (void*)o, (void*)param);
diff --git a/pp_hot.c b/pp_hot.c
index 2cccc48..9c9d1e9 100644
--- a/pp_hot.c
+++ b/pp_hot.c
@@ -3078,6 +3078,7 @@ S_method_common(pTHX_ SV* meth, U32* hashp)
const HE* const he = hv_fetch_ent(stash, meth, 0, *hashp);
if (he) {
gv = MUTABLE_GV(HeVAL(he));
+ assert(stash);
if (isGV(gv) && GvCV(gv) &&
(!GvCVGEN(gv) || GvCVGEN(gv)
== (PL_sub_generation + HvMROMETA(stash)->cache_gen)))
@@ -3085,9 +3086,9 @@ S_method_common(pTHX_ SV* meth, U32* hashp)
}
}
+ assert(stash || packsv);
gv = gv_fetchmethod_sv_flags(stash ? stash : MUTABLE_HV(packsv),
- meth, GV_AUTOLOAD | GV_CROAK);
-
+ meth, GV_AUTOLOAD | GV_CROAK);
assert(gv);
return isGV(gv) ? MUTABLE_SV(GvCV(gv)) : MUTABLE_SV(gv);
diff --git a/regcomp.c b/regcomp.c
index eaee604..3d49827 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -2007,6 +2007,7 @@ S_make_trie(pTHX_ RExC_state_t *pRExC_state, regnode *startbranch,
});
re_trie_maxbuff = get_sv(RE_TRIE_MAXBUF_NAME, 1);
+ assert(re_trie_maxbuff);
if (!SvIOK(re_trie_maxbuff)) {
sv_setiv(re_trie_maxbuff, RE_TRIE_MAXBUF_INIT);
}
@@ -14920,6 +14921,7 @@ S_set_ANYOF_arg(pTHX_ RExC_state_t* const pRExC_state,
av_store(av, 0, (runtime_defns)
? SvREFCNT_inc(runtime_defns) : &PL_sv_undef);
if (swash) {
+ assert(cp_list);
av_store(av, 1, swash);
SvREFCNT_dec_NN(cp_list);
}
@@ -16609,6 +16611,7 @@ S_dumpuntil(pTHX_ const regexp *r, const regnode *start, const regnode *node,
last= plast;
while (PL_regkind[op] != END && (!last || node < last)) {
+ assert(node);
/* While that wasn't END last time... */
NODE_ALIGN(node);
op = OP(node);
diff --git a/regexec.c b/regexec.c
index 362390b..ffab4f2 100644
--- a/regexec.c
+++ b/regexec.c
@@ -7010,6 +7010,8 @@ no_silent:
sv_commit = &PL_sv_yes;
sv_yes_mark = &PL_sv_no;
}
+ assert(sv_err);
+ assert(sv_mrk);
sv_setsv(sv_err, sv_commit);
sv_setsv(sv_mrk, sv_yes_mark);
}
@@ -7620,6 +7622,7 @@ Perl__get_regclass_nonbitmap_data(pTHX_ const regexp *prog,
*only_utf8_locale_ptr = ary[2];
}
else {
+ assert(only_utf8_locale_ptr);
*only_utf8_locale_ptr = NULL;
}
@@ -7641,7 +7644,7 @@ Perl__get_regclass_nonbitmap_data(pTHX_ const regexp *prog,
}
else if (doinit && ((si && si != &PL_sv_undef)
|| (invlist && invlist != &PL_sv_undef))) {
-
+ assert(si);
sw = _core_swash_init("utf8", /* the utf8 package */
"", /* nameless */
si,
diff --git a/universal.c b/universal.c
index a29696d..65e02df 100644
--- a/universal.c
+++ b/universal.c
@@ -67,7 +67,7 @@ S_isa_lookup(pTHX_ HV *stash, const char * const name, STRLEN len, U32 flags)
if (our_stash) {
HEK *canon_name = HvENAME_HEK(our_stash);
if (!canon_name) canon_name = HvNAME_HEK(our_stash);
-
+ assert(canon_name);
if (hv_common(isa, NULL, HEK_KEY(canon_name), HEK_LEN(canon_name),
HEK_FLAGS(canon_name),
HV_FETCH_ISEXISTS, NULL, HEK_HASH(canon_name))) {
diff --git a/util.c b/util.c
index b90abe5..cd0afb6 100644
--- a/util.c
+++ b/util.c
@@ -850,15 +850,17 @@ Perl_fbm_instr(pTHX_ unsigned char *big, unsigned char *bigend, SV *littlestr, U
{
const MAGIC *const mg = mg_find(littlestr, PERL_MAGIC_bm);
- const unsigned char * const table = (const unsigned char *) mg->mg_ptr;
const unsigned char *oldlittle;
+ assert(mg);
+
--littlelen; /* Last char found by table lookup */
s = big + littlelen;
little += littlelen; /* last char */
oldlittle = little;
if (s < bigend) {
+ const unsigned char * const table = (const unsigned char *) mg->mg_ptr;
I32 tmp;
top2:
--
1.8.5.2 (Apple Git-48)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fix for Coverity perl5 CID 29033: Out-of-bounds read
(OVERRUN) overrun-local: Overrunning array PL_fold_locale of 256 bytes at
byte offset 256 using index c1 (which evaluates to 256).
- the "c1 > 256" was off-by-one, it needed to be "c1 > 255",
it could have caused the PL_fold_locale to be accessed one past the end,
at offset 256, but we have dodged the bullet thanks to the regex engine
optimizing the bad case away before we hit it (analysis by Karl Williamson):
regexec.c
- comment fixes (pointed out by Karl Williamson): regexec.c
- add tests to nail down the behaviour of fold matching
for the last of Latin-1 (0xFF, lowercase which curiously does not have
uppercase within Latin-1). and the first pure Unicode: t/re/pat.t
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit v5.19.8-533-g63baef5 changed the handling of locale-dependent
regexes so that the pattern was considered tainted at compile-time, rather
than determining it each time at run-time whenever it executed a
locale-dependent node. Unfortunately due to the conflating of two flags,
RXf_TAINTED and RXf_TAINTED_SEEN, it had the side effect of permanently
marking a pattern as tainted once it had had a single tainted result.
E.g.
use re qw(taint);
use Scalar::Util qw(tainted);
for ($^X, "abc") {
/(.*)/ or die;
print "not " unless tainted("$1"); print "tainted\n";
};
which from 5.19.9 onwards output:
tainted
tainted
but with this commit (and with 5.19.8 and earlier), it now outputs:
tainted
not tainted
The RXf_TAINTED flag indicates that the pattern itself is tainted, e.g.
$r = qr/$tainted_value/
while the RXf_TAINTED_SEEN flag means that the results of the last match
are tainted, e.g.
use re 'tainted';
$tainted =~ /(.*)/;
# $1 is tainted
Pre 63baef5, the code used to look like:
at run-time:
turn off RXf_TAINTED_SEEN;
while (nodes to execute) {
switch(node) {
case
BOUNDL: /* and other locale-specific ops */
turn on RXf_TAINTED_SEEN;
...;
}
}
if (tainted || RXf_TAINTED)
turn on RXf_TAINTED_SEEN;
63baef5 changed it to:
at compile-time:
if (pattern has locale ops)
turn on RXf_TAINTED_SEEN;
at run-time:
while (nodes to execute) {
...
}
if (tainted || RXf_TAINTED)
turn on RXf_TAINTED_SEEN;
This commit changes it to:
at compile-time;
if (pattern has locale ops)
turn on RXf_TAINTED;
at run-time:
turn off RXf_TAINTED_SEEN;
while (nodes to execute) {
...
}
if (tainted || RXf_TAINTED)
turn on RXf_TAINTED_SEEN;
|
|
|
|
|
|
|
|
|
|
|
| |
My recent commit d0d4464849e2b30aee8 in re_intuit_start() reduced the
scope of a 'skip if multiline' check, so that certain optimisations
weren't being unnecessarily skipped. Unfortunately it didn't reduce the
scope enough, so a vital slen-- was being skipped in the
SvTAIL-but-don't-fail case.
This commit just moves the !multiline test further down, and updates the
commentary and condition formatting a bit.
|
|
|
|
|
| |
After the previous reworking of the PREGf_IMPLICIT tests, move commentary
about PREGf_IMPLICIT closer to where its now relevant.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When BmUSEFUL() for a pattern goes negative, we unset ~RXf_USE_INTUIT.
A bug fix in 2010 (c941595168) made this part of the code also remove
the RXf_ANCH_MBOL at the same time, if PREGf_IMPLICIT was set.
However, this was really just working round a bug in regexec_flags().
Once intuit was disabled, regexec_flags() would then start taking the
buggy code path.
This buggy code path was separately fixed in 2012 by 21eede782, so there's
no longer any need to remove this flag. So don't.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The ml_anch variable is supposed to indicate that a multi-line match is
possible, i.e. that the regex is anchored to a \n.
Currently we just do
ml_anch = (prog->intflags & PREGf_ANCH_MBOL);
However, MBOL is also set on /.*..../; the two cases are distinguished by
adding the PREGf_IMPLICIT flag too.
So at the moment we have lots of tests along the lines of
if (ml_anch && !(prog->intflags & PREGf_IMPLICIT))
Simplify this by adding the IMPLICIT condition when initially calculating
ml_anch, so there's no need to keep testing for it later. This also means
that ml_anch actually means what it says now.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The block of code that deals with absolute anchors looks like:
if (...) {
if (.... && !PREGf_IMPLICIT) ...;
if (FOO) {
assert(!PREGf_IMPLICIT);
....
}
}
where the constraints of FOO imply PREGf_IMPLICIT, as shown by the
assertion. Simplify this to:
if (... && !PREGf_IMPLICIT) {
if (....) ...;
if (FOO) {
....
}
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In the stclass block of code, there are various checks for anchored-ness.
Add !(prog->intflags & PREGf_IMPLICIT) to these conditions, since
from the perspective of these tests, we can only quit early etc if these
are real rather than fake anchors.
As it happens, this currently makes no logical difference, since an
PREGf_IMPLICIT pattern starts with .*, and so more or less by definition
can't have an stclass.
So this commit is really only for logical completeness, and to allow us to
change the definition of ml_anch shortly.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The 'check' block of code has special handling for if the pattern is
anchored; in this case, it allows us to set an upper bound on where a
floating substring might match; e.g. in
/^..\d?abc/
the floating string can't be more than 3 chars from the absolute beginning
of the string. Similarly with /..\G\d?abc/, it can't be more than 3 chars
from the start position (assuming that position been calculated correctly
by the caller).
However, the current code used rx_origin as the base for the offset
calculation, rather than strbeg/strpos as appropriate. This meant that
a) the first time round the loop, if strpos > strbeg, then the upper bound
would be set higher than needed;
b) if we ever go back to restart: with an incremented rx_origin, then the
upper limit is recalculated with more wasted slack at the latter end.
This commit changes the limit calculation, which reduces the following
from second to milliseconds:
$s = "abcdefg" x 1_000_000;
$s =~ /^XX\d{1,10}cde/ for 1..100;
It also adds a quick test to skip hopping when the result is likely to
leave end_point unchanged, and adds an explicit test for !PREGf_IMPLICIT.
This latter test isn't strictly necessary, as if PREGf_IMPLICIT were set,
it implies that the pattern starts with '.*', which implies that
prog->check_offset_max == SSize_t_MAX, which is already tested for.
However, it makes the overall condition more comprehensible, and makes it
more robust in the face of future changes.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The quick reject test that says "if a pattern has an absolute anchor,
then immediately reject if strpos != strbeg", currently skips that test
if PREGf_ANCH_GPOS is present. Instead, skip unless BOL or SBOL is present.
This means that something like /^abc\G/ will still do the quick reject
test.
I can't think of any example that will actually give a measurable
performance boost, and this is a fairly unlikely scenario, but the code is
more logical this way, and makes it more robust against future changes
(it's less likely to suddenly skip the quick test where it used to do it).
I've also updated the commentary on why we don't do a quick /\G/ test
akin to the /^/ test, and added some more info about why we test for the
PREGf_IMPLICIT flag.
And I've added an assert about PREGf_IMPLICIT too.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Intuit has a quick reject test for a fixed pattern that is anchored at
both ends. For example, with the pattern /^abcd$/, only the exact strings
"abcd" or "abcd\n" will match; anything else, and the match immediately
fails.
A fix for [perl #115242] correctly made intuit skip the test in the
presence of //m, since in this case the $ doesn't necessarily correspond
to the end of the string.
However, the fix was too wide in scope; it caused //m patterns to skip
searching for a known string anchored just at the start, as well as one
anchored at both ends.
With this commit, the following code now runs in a few milliseconds rather
than a few seconds on my machine:
$s = "abcdefg" x 1_000_000;
$s =~ /(?-m:^)abcX?fg/m for 1..100;
|
|
|
|
|
|
|
|
|
|
|
| |
When MBOL (/^.../m) matching is skipped, the debugging output looks like:
Starting position does not contradict /^/m...
which sounds a bit like the \n test *was* done and passed, rather than
the test being skipped. Change the message to:
(multiline anchor test skipped)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
re_intuit_start() decided that a pattern was capable of being anchored
after *any* \n in the string for a //m pattern that contains a BOL
(rather than an MBOL). This can happen by embedding one regex in another
for example.
This is an incorrect assumption, and means that intuit() might try
against every \n position in the string rather than just trying at the
beginning. With this commit, the following code on my machine reduces in
execution time from 7000ms to 5ms:
my $r = qr/^abcd/;
my $s = "abcd-xyz\n" x 500_000;
$s =~ /$r\d{1,2}xyz/m for 1..200;
|
|
|
|
|
| |
The blurb above this function is a mess and contains lots of obsolete
stuff. Mostly rewrite it. Also sprinkle a few extra comments elsewhere.
|
|
|
|
|
| |
Most of the 'check' substring matching code is in its own scope;
move the last few bits into that scope too. This is purely cosmetic.
|
| |
|
| |
|
|
|
|
|
|
| |
Improve the description at the start of this code block, and move
some of the comments closer to where they apply (and explain why they're
probably wrong)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
At the end of the failure block in the stclass code, it decides whether
rx_origin has been advanced too much for the match to ever succeed, and if
so, quits.
That test has a comment next to it about perhaps testing for this earlier.
It also does a byte rather than chars calculation.
This commit does the following.
It adds a proper chars-based termination test near the beginning of the
'check' code block. This test is essentially free, since at that point
we've just done the necessary HOPs. It also calculates start_point using
end_point rather the strend as the limit, so avoiding a bit of unnecessary
hopping when we're about to fail anyway. It introduces the new HOPMAYBE3
macro wrapper for this purpose.
It keeps the old bytes-based test at the end in stclass, with a note that
the simple bytes calculation errs on the side of re-entering the check
block where the accurate chars-based test is now done.
|