| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
| |
Comment change suggestions from @hvds in PR #18835.
|
|
|
|
|
|
|
|
| |
My attempt to insulate from the leading tab removal the year-old commits
finally pushed as 77a6d54c0deb1165b37dcf11c21cd334ae2579bb and
403d7eb3e4320188571cf61b9dab62ff10799f49 failed miserably.
I spent a bunch of time sorting it all out, and this is the result.
|
| |
|
| |
|
|
|
|
|
| |
*ACCEPT already avoids this (because it is "ENDLIKE"), but gets a
related fix to stop scanning for start class.
|
| |
|
|
|
|
|
|
|
| |
This is a rebasing by @khw of part of GH #18792, which I needed to get
in now to proceed with other commits.
It also strips trailing white space from the affected files.
|
|
|
|
|
| |
S_regclass() is unwieldy. This commit splits it into two nearly equal
size parts. More could be done.
|
|
|
|
|
|
|
|
| |
The expression we're about to add to data->pos_delta in this part of
study_chunk() can be both positive or negative; however while we apply
an overflow check to avoid exceeding OPTIMIZE_INFTY, we were happily
subtracting from it when the expression was negative, making it no longer
infinite.
|
|
|
|
| |
delta and pos_delta may hold OPTIMIZE_INFTY to represent infinity.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This fixes GH #18604. There was a path through the code where a
particular SV did not get its reference count decremented.
I did an audit of the function and came up with several other
possiblities that are included in this commit.
Further, there would be leaks for some instances of finding syntax
errors in the input pattern, or when warnings are fatalized. Those
would require mortalizing some SVs, but that is beyond the scope of this
commit.
|
|
|
|
|
|
|
|
|
| |
Otherwise a strict linker will fail to build the extenstion due
to a multiply defined symbol. We used to do this but it was
removed in e513125ac7bdea1f for unknown reasons. The same
commit also defined some macros inside the function that are used
but inside and outside it, so put them where they can be seen
regardless of whether we are defining the function itself.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 122af31004 acted on the wrong assumption that NEXTOPER() and
regnext() were equivalent, and in fixing a valgrind complaint tried
to simplify code for detecting specific patterns for split() that
merited special-case handling by making them all use regnext().
As a result, the special case /\s+/ was no longer correctly detected,
resulting in a degree of pessimisation.
This commit fixes that, and avoids reading via the calculated 'next'
pointer except for the ops we need (in which cases we know it'll point
to another regop) - for the EXACT case (which we don't need), valgrind
was correctly pointing out that it points to potentially uninitialized
data.
|
| |
|
|
|
|
|
| |
This was the consensus in
http://nntp.perl.org/group/perl.perl5.porters/258489
|
|
|
|
|
|
| |
Not all three synonyms were documented.
This also fixes up related comments in regcomp.c to correspond
|
|
|
|
|
| |
By changing a bool into a pointer, we can avoid some work and prepare
for a future commit.
|
|
|
|
|
|
|
|
| |
This moves the finding of the matching '}' for \g{ to earlier, and
creates a temporary to point to the current position in the parse. This
makes it easier to deal with backtracking; we haven't advanced the main
parse pointer, so don't have to remember how far we advanced. This will
prove advantageous in a future commit.
|
|
|
|
| |
This is considered better practice.
|
|
|
|
|
|
|
| |
Rather than know how far we have advanced in parsing when we have to
back up, use the already-existing checkpoint position. This results in
slightly more maintainable code that a future commit will take advantage
of.
|
|
|
|
|
|
|
|
| |
This change has been planned for a long time, bringing Perl into parity
with similar languages, but it took many deprecation cycles to be able
to reach the point where it could safely go in.
This fixes GH #18264
|
|
|
|
|
|
| |
Prior to this comment a curly quantifier that had an error in the bounds
pointed to the left brace. Now the error message points to the first
bound that has a problem.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit copies portions of new_regcurly(), which has been around
since 5.28, into plain regcurly(), as a baby step in preparation for
converting entirely to the new one. These functions are used for
parsing {m,n} quantifiers. Future commits will add capabilities not
available using the old version.
The commit adds an optional parameter, to return to the caller
information it gleans during parsing.
regpiece() is changed by this commit to use this information, instead of
itself reparsing the input. Part of the reason for this commit is that
changes are planned soon to what is legal syntax. With this commit in
place, those changes only have to be done once.
This commit also extracts into a function the calculation of the
quantifier bounds. This allows the logic for that to be done in one
place instead of two.
|
|
|
|
|
|
| |
The new names are more understandable to me. This also adds a second
parameter to one macro, that is unused until the next commit in the
series.
|
|
|
|
|
|
|
|
|
|
|
| |
This just detabifies to get rid of the mixed tab/space indentation.
Applying consistent indentation and dealing with other tabs are another issue.
Done with `expand -i`.
* vutil.* left alone, it's part of version.
* Left regen managed files alone for now.
|
|
|
|
|
|
| |
This makes the linker have to decide (or guess) which of the
identically-named symbols to include. The VMS linker refuses
and throws a multiply-defined symbol error.
|
| |
|
|
|
|
|
|
| |
The names were intended to force people to not use them outside their
intended scopes. But by restricting those scopes in the first place, we
don't need such unwieldy names
|
|
|
|
|
|
| |
This function is called only at compile time; experience has shown that
compile-time operations are not time-critical. And future commits will
lengthen it, making it not practically inlinable anyway.
|
|
|
|
|
|
|
|
| |
Many of the files in perl are for one thing only, and hence their
embedded documentation will be for that one thing. By creating a hash
here of them, those files don't have to worry about what section that
documentation goes under, and so it can be completely changed without
affecting them.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
this also simplifies the flagging for these assertions, since this
error is now the only thing using in_lookhead and in_lookbehind they
can be combined into a single in_lookaround.
Rather than conditional increment/decrement as we recurse into S_reg
I simply save the value of in_lookaround and restore it before
returning. Some unsuccessful or restart paths don't do the restore,
but they either result in a croak(), or a restart which reinitialises
in_lookaround anyway.
Also added tests to ensure that all the different zero-width assertions
with content trigger the error.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This was an assertion failure in regexec.c under rare circumstances. A
reduction of the fuzzed test case is now in pat_advanced.t
The root cause of this was that the pattern being compiled was encoded in
UTF-8 and 'use locale' was in effect, equivalent to the /l charset, and
then the charset was reset inside the pattern, to /d. But /d in a UTF-8
patterns is illegal, hence the later assertion failure.
The solution is to reset instead to /u when the pattern is UTF-8.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Generally we have to wait until runtime to do folding for regnodes that
are locale dependent, because we don't know what the locale at runtime
will be, and hence what the folds will be.
But UTF-8 locales all have the same folding behavior, no matter what the
locale is, with the exception of two fold pairs in Turkish. (Lithuanian
too, but Perl doesn't support that language's special folding rules.)
UTF-8 is the only locale type that Perl supports that can represent code
points above 255. Therefore we do know at compile time what the
above-255 folds are (again excepting the two in Turkish), and so we can
do the folding then. But only if both the components are above 255.
There are a few folds that cross the 255/256 boundary, and they must be
deferred.
However, there are two instances where there are three characters that
fold together in which two of them are above 255, and the third isn't.
That the two high ones are equivalent under /i is known at compile time,
and so that equivalence can be stated then.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Prior to this commit, the generated macros for dealing with multi-char
folds in UTF-8 strings only recognized completely folded strings. This
commit changes that to add the uppercase for characters in the Latin1
range. Hopefully an example will clarify.
The fold for U+0130: LATIN CAPITAL LETTER I WITH DOT ABOVE is 'i'
followed by U+0307: COMBINING DOT ABOVE. But since we are doing /i
matching, an 'I' followed by U+307 should also match. This commit
changes the macros to know this. Before this, if the fold were entirely
ASCII, the macros would know all the possible combinations. This commit
extends that to all code points < 256. (Since there are no folds to the
upper latin1 range), that really means all code points below 128. But
making it general means it wouldn't have to be revised if a fold were
ever added to the upper half range.)
The reason to make this change is that it makes some future code less
complicated. And it adds very little complexity to the generated
macros; less than the code it will save. I originally thought it would
be more complext than it now turns out to be. Much of that is because
the infrastructure has advanced since that decision.
I couldn't find any current places that this change will allow to be
simplified. There could be if the macros were extended to do this on
all code points, not just the low ones. I tried that, but the generated
macros were at least 400 lines longer than before. That does add
significant complexity, so I backed that out.
|
|
|
|
|
| |
This was a case statement of every type of EXACTish node. Instead,
there is a simple way to see if something is EXACTish.
|
|
|
|
|
| |
This commit uses the new macros from the previous commit to simply come
code.
|
|
|
|
|
|
| |
The previous commit made the opcodes for two regops adjacent, so that we
can refer to them by a single range. This commit takes advantage of
that change.
|
| |
|
|
|
|
| |
This is reserved for length-1 constructs.
|
|
|
|
|
|
| |
Its a bit more clearer to test the 0 case before the 1 case, and by
doing so it becomes visually easier to compare and contrast the the two
cases.
|
|
|
|
|
|
|
|
|
|
|
| |
This changes an error branch to be goto'd out of the mainline code. The
large chunk being in the middle obscured the comonality of the slightly
different non-error cases.
The branch is moved to the bottom of the routine, and croaks, so there
is no return.
This is a modification to a suggestion by Hugo van der Sanden.
|
|
|
|
|
| |
Change indentation to correspond with new blocks formed by the previous
commit
|
|
|
|
| |
This makes the code easier to understand, I think.
|
| |
|
|
|
|
|
| |
I think this makes it clearer the commonalities of the * and +
quantifiers.
|
|
|
|
| |
There is a common place these three occurrences can be placed at,
|
|
|
|
|
|
| |
The name was misleading. There are other things being done here. And
previous restructuring led to a goto immediately prior to where it went
to.
|