| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[perl #129140] attempting double-free
Thus fixes some leaks and double frees in regexes which contain code
blocks.
During compilation, an array of struct reg_code_block's is malloced.
Initially this is just attached to the RExC_state_t struct local var in
Perl_re_op_compile(). Later it may be attached to a pattern. The difficulty
is ensuring that the array is free()d (and the ref counts contained within
decremented) should compilation croak early, while avoiding double frees
once the array has been attached to a regex.
The current mechanism of making the array the PVX of an SV is a bit flaky,
as the array can be realloced(), and code can be re-entered when utf8 is
detected mid-compilation.
This commit changes the array into separately malloced head and body.
The body contains the actual array, and can be realloced. The head
contains a pointer to the array, plus size and an 'attached' boolean.
This indicates whether the struct has been attached to a regex, and is
effectively a 1-bit ref count.
Whenever a head is allocated, SAVEDESTRUCTOR_X() is used to call
S_free_codeblocks() to free the head and body on scope exit. This function
skips the freeing if 'attached' is true, and this flag is set only at the
point where the head gets attached to the regex.
In one way this complicates the code, since the num_code_blocks field is now
not always available (it's only there is a head has been allocated), but
mainly its simplifies, since all the book-keeping is now done in the two
new static functions S_alloc_code_blocks() and S_free_codeblocks()
|
|
|
|
|
|
|
|
| |
This indents code and reflows the comments to account for the enclosing
block added by the previous commit.
At the same time, it adds some other miscellaneous white space changes,
and adds, revises other comments.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Clang has taken it upon itself to warn when an equality is wrapped in
double parentheses, e.g.
((foo == bar))
Which is a bit dumb, as any code along the lines of
#define isBAR (foo == BAR)
if (isBAR) {}
will trigger the warning.
This commit shuts clang up by putting in a harmless cast:
#define isBAR cBOOL(foo == BAR)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The way we tracked if pattern recursion was infinite did not work
properly. A pattern like
"a"=~/(.(?2))((?<=(?=(?1)).))/
would loop forever, slowly eat up all available ram as it added
pattern recursion stack frames.
This patch changes the rules for recursion so that recursively
entering a given pattern "subroutine" twice from the same position
fails the match. This means that where previously we might have
seen fatal exception we will now simply fail. This means that
"aaabbb"=~/a(?R)?b/
succeeds with $& equal to "aaabbb".
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
GOSTART is a special case of GOSUB, we can remove a lot of offset twiddling,
and other special casing by unifying them, at pretty much no cost.
GOSUB has 2 arguments, ARG() and ARG2L(), which are interpreted as
a U32 and an I32 respectively. ARG() holds the "parno" we will recurse
into. ARG2L() holds a signed offset to the relevant start node for the
recursion.
Prior to this patch the argument to GOSUB would always be >=, and unlike
other parts of our logic we would not use 0 to represent "start/end" of
pattern, as GOSTART would be used for "recurse to beginning of pattern",
after this patch we use 0 to represent "start/end", and a lot of
complexity "goes away" along with GOSTART regops.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The regex engine when displaying debugging info, say under -Dr, will elide
data in order to keep the output from getting too long. For example,
the number of code points in all of Unicode matched by \w is quite
large, and so when displaying a pattern that matches this, only the
first some number of them are printed, and the rest are truncated,
represented by "...".
Sometimes, one wants to see more than what the
compiled-into-the-engine-max shows. This commit creates code to read
this environment variable to override the default max lengths. This
changes the lengths for everything to the input number, even if they
have different compiled maximums in the absence of this variable.
I'm not currently documenting this variable, as I don't think it works
properly under threads, and we may want to alter the behavior in various
ways as a result of gaining experience with using it.
|
|
|
|
| |
So, it's better to not have a mask to include the unused ones.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently most pp_leavefoo subs have something along the lines of
POPBLOCK(cx);
POPFOO(cx);
where POPBLOCK does cxstack_ix-- and sets cx to point to the top CX stack
entry. It then restores a bunch of PL_ vars saved in the CX struct.
Then POPFOO does any type-specific restoration, e.g. POPSUB decrements the
ref count of the cv that was just executed.
However, this is logically the wrong order. When we *enter* a scope, we do
PUSHBLOCK;
PUSHFOO;
so undoing the PUSHBLOCK should be the last thing we do. As it happens,
it doesn't really make any difference to the running, which is why we've
never fixed it before.
Reordering it has two advantages.
First, it allows the steps for scope exit to be the exact logical reverse
of scope exit, which makes understanding what's going on and debugging
easier.
It allows us to make the code cleaner.
This commit also removes the cxstack_ix-- and setting cx steps from
POPBLOCK; now we already expect cx to be set (which it usually already is)
and we do the cxstack_ix-- ourselves. This also means we can remove a
whole bunch of cxstack_ix++'s that were added immediately after the
POPBLOCK in order to prevent the context being inadvertently overwritten
before we've finished using it.
So in full,
POPBLOCK(cx);
POPFOO(cx);
is now implemented as:
cx = &cxstack[cxstack_ix];
... other stuff done with cx ...
POPFOO(cx);
POPBLOCK(cx);
cxstack_ix--;
Finally, this commit also tweaks PL_curcop in pp_leaveeval, since
otherwise PL_curcop could temporarily be NULL when debugging code is
called in the presence of 'use re Debug'. It also stops the debugging code
crashing if PL_curcop is still NULL.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds the final Unicode boundary type previously missing from core
Perl: the LineBreak one. This feature is already available in the
Unicode::LineBreak module, but I've been told that there are portability
and some other issues with that module. What's added here is a
light-weight version that is lacking the customizable features of the
module.
This implements the default Line Breaking algorithm, but with the
customizations that Unicode is expecting everybody to add, as their
test file tests for them. In other words, this passes Unicode's fairly
extensive furnished tests, but wouldn't if it didn't include certain
customizations specified by Unicode beyond the basic algorithm.
The implementation uses a look-up table of the characters surrounding a
boundary to see if it is a suitable place to break a line. In a few
cases, context needs to be taken into account, so there is code in
addition to the lookup table to handle those.
This should meet the needs for line breaking of many applications,
without having to load the module.
The algorithm is somewhat independent of the Unicode version, just like
the other boundary types. Only if new rules are added, or existing ones
modified is there need to go in and change this code. Otherwise,
running regen/mk_invlists.pl should be sufficient when a new Unicode
release is done to keep it up-to-date, again like the other Unicode
boundary types.
|
|
|
|
| |
Better to not have this clutter.
|
|
|
|
|
|
|
|
|
|
|
| |
The mask removed here was to make sure that right shifting didn't
propagate the sign bit, but is unnecessary as the value shifted is
unsigned. And confining things to a U8 with that mask assumes that the
bit vector being operated on has 256 elements max. This isn't
necessarily true these days, as one can change ANYOF_BITMAP_SIZE.
In fact changing that number was failing until this commit.
It also adds white space to make it easier to read.
|
|
|
|
|
| |
Instead of having this code repeated in several places, call
the more base macro from the others.
|
|
|
|
|
|
|
|
|
|
|
| |
I've long been confronted with trying to do things to create a spare bit
to use. I thought it easier now, while it's fresh in my mind, to free
up one for future use, rather than re-learn things when it next becomes
necessary. It would have been a different story if the freed bit had
required a performance penalty.
This commit also updates the comments about how to create even more
spare bits should it become necessary.
|
|
|
|
| |
Some of the names are expanded slightly and not shortened
|
| |
|
| |
|
|
|
|
|
|
| |
This commit sets a flag at pattern compilation time to indicate if
a rare case is present that requires special handling, so that that
handling can be avoided unless necessary.
|
|
|
|
|
|
| |
This changes the spare bit to be adjacent to the LOC_FOLD bit, in
preparation for the next commit, which will use that bit for a
LOC_FOLD-related use.
|
|
|
|
|
|
|
|
| |
This is done by combining 2 mutually exclusive bits into one. I hadn't
seen this possibility before because the name of one of them misled me.
It also misled me into turning on one that flag unnecessarily, and to
miss opportunities to not have to create a swash at runtime. This
commit corrects those things as well.
|
| |
|
|
|
|
|
| |
This places the flag bits of like-type flags adjacent for convenience in
reading the code. It also improves the commentary about their purposes.
|
|
|
|
|
|
| |
Since commit a0bd1a30d379f2625c307657d63fc50173d7a56d, a synthetic start
class node can be just an ANYOF-type node. I don't think this causes a
bug, just misses a potential optimisation.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously use of this under /l regex rules was a compile time error.
Now it works like \b{wb} and \b{sb}, which compile under locale rules
and always work like Unicode says they should. A UTF-8 locale implies
Unicode rules, and the goal is for it to work seamlessly with the rest
of perl. This construct was the only one I am aware of that didn't work
seamlessly (not counting OS interfaces) under UTF-8 LC_CTYPE locales.
For all three of these constructs, use with a non-UTF-8 runtime locale
raises a warning, and Unicode rules are used anyway.
UTF-8 locale collation still has problems, but this is low priority to
fix, as it's a lot of work, and if one really cares, one should be using
Unicode::Collate.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The ANYOF_FLAGS bits are all used up, but a future commit wants one.
This commit frees up a bit by sharing two of the existing
comparatively-rarely-used ones. One bit is used only under /d matching
rules, while the other is used only when not under /d. Only the latter
bit is used in synthetic start classes. The previous commit introduced
an ANYOFD node type corresponding to /d. An SSC never is this type.
Thus, the bits have mutually exclusive meanings, and we can use the node
type to distinguish between the two meanings of the combined bit.
An alternative implementation would have been to use the
ANYOF_HAS_NONBITMAP_NON_UTF8_MATCHES non-/d bit instead of the one
chosen. But this is used more frequently, so the disambiguation would
have been exercised more frequently, slowing execution down ever so
slightly; more importantly, this one required fewer code changes, by a
slight amount.
|
|
|
|
|
|
| |
This horrible thing broke encapsulation and was as buggy as a very buggy
thing. It's been officially deprecated since 5.20.0 and now it can finally
die die die!!!!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
An empty cpan/.dir-locals.el stops Emacs using the core defaults for
code imported from CPAN.
Committer's work:
To keep t/porting/cmp_version.t and t/porting/utils.t happy, $VERSION needed
to be incremented in many files, including throughout dist/PathTools.
perldelta entry for module updates.
Add two Emacs control files to MANIFEST; re-sort MANIFEST.
For: RT #124119.
|
|
|
|
|
|
|
| |
This was experimentally introduced in 5.18, and no issues were raised,
except that it got us to thinking and spurred us to stop allowing $^X,
where 'X' is a non-printable control character, and that change caused
some issues.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
A function implements seeing if the space between any two characters is
a grapheme cluster break. Afer I wrote this, I realized that an array
lookup might be a better implementation, but the deadline for v5.22 was
too close to change it. I did see that my gcc optimized it down to
an array lookup.
This makes the implementation of \X go from being complicated to
trivial.
|
|
|
|
| |
Extracted from patch submitted by Lajos Veres in RT #123693.
|
| |
|
|
|
|
|
| |
These will be used in a future commit to distinguish between /l patterns
vs non-/l.
|
| |
|
|
|
|
|
|
|
| |
This adds a function to allocate a regnode with 2 32-bit arguments, and
uses it, rather than the ad-hoc code that did the same thing previously.
This is in preparation for this code being used in a 2nd place in a
future commit.
|
| |
|
|
|
|
| |
These internal definitions are no longer used.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add a new re debug mode for outputing stuff useful for testing.
In this case we count the number of times that we go through
study_chunk. With a51d618a we should do 5 times (or less) when
we traverse the test pattern. Without a51d618a we recurse 11
times. In the case of RT #122283 we would do gazilions of
recursions, so many I never let it run to finish.
/
(?(DEFINE)(?<foo>foo))
(?(DEFINE)(?<bar>(?&foo)bar))
(?(DEFINE)(?<baz>(?&bar)baz))
(?(DEFINE)(?<bop>(?&baz)bop))
/x
I say "or less" because you could argue that since these defines are
never called, we should not actually recurse at all, and should maybe
just compile this as a simple empty pattern.
|
|
|
|
|
|
|
|
|
|
|
|
| |
In 075abff3 Andy Lester set the flags field of regops
to default to 0xde. I find this really weird, and possibly dangerous,
as it seems to me reasonable to assume a new regop would have this
field set to 0, so that later on code can set it to something else
if necessary. (Which is what I wanted to do.)
Since nothing breaks if I set it to 0x0 and I find that to be a much
more natural default than 0xde (the prefix of 0xdeadbeef), I am
changing this to set it to 0.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
See also perl5porters thread titled: "Perl MBOLism in regex engine"
In the perl 5.000 release (a0d0e21ea6ea90a22318550944fe6cb09ae10cda)
the BOL regop was split into two behaviours MBOL and SBOL, with SBOL
and BOL behaving identically. Similarly the EOL regop was split into
two behaviors SEOL and MEOL, with EOL and SEOL behaving identically.
This then resulted in various duplicative code related to flags and
case statements in various parts of the regex engine.
It appears that perhaps BOL and EOL were kept because they are the
type ("regkind") for SBOL/MBOL and SEOL/MEOL/EOS. Reworking regcomp.pl
to handle aliases for the type data so that SBOL/MBOL are of type
BOL, even though BOL == SBOL seems to cover that case without adding
to the confusion.
This means two regops, a regstate, and an internal regex flag can
be removed (and used for other things), and various logic relating
to them can be removed.
For the uninitiated, SBOL is /^/ and /\A/ (with or without /m) and
MBOL is /^/m. (I consider it a fail we have no way to say MBOL without
the /m modifier). Similarly SEOL is /$/ and MEOL is /$/m (there is
also a /\z/ which is EOS "end of string" with or without the /m).
|
| |
|
|
|
|
|
|
|
|
| |
This commit allows Perl to be compiled with a bitmap size that is larger
than 256. This bitmap is used to directly look up whether a character
matches or not, without having to do a binary search or hash lookup. It
might improve the performance for some installations that have a lot of
use of scripts that are above the Latin1 range.
|
|
|
|
|
|
|
|
|
| |
These are renamed to be more clear as to their actual meanings. I know
other people have been confused by their former names.
Some of the name changes will become more important as future commits
will allow the bitmap in a bracketed character class to be a different
size.
|
|
|
|
| |
This is an internal header, so can change names within it.
|
|
|
|
|
|
| |
This prevents a signed result if this macro ever gets used in a U8.
The ANYOF_BITMAP_TEST macro must now be cast or it would generate warnings
when compiled with -DPERL_BOOL_AS_CHAR
|
|
|
|
| |
Too many negations led to this.
|
|
|
|
|
|
|
| |
ANYOF nodes (for bracketed character classes) currently are for code
points 0-255. This is the first step in the eventual making that size
configurable. This also renames a static function, as the domain may
not necessarily be 'latin1'
|
|
|
|
|
|
| |
This just sets the ptr field in the Synthetic Start Class that will be
passed to regexec.c NULL, and clarifies the comments in regcomp.h. See
the thread starting at http://markmail.org/message/2txwaqnjco6zodeo
|