| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
| |
The code confusing uses type and kind as synonyms. Lets end that bad habit
|
|
|
|
|
| |
Now that REGNODE_AFTER() can handle all cases it makes sense
to remove the dynamic() suffix.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It is really easy to get confused about the difference between
NEXTOPER() and regnext() of a regnode. The two concepts are related,
similar, but importantly distinct. NEXTOPER() is also defined in such a
way that it is easy to abuse and misunderstand and encourages producing
code that is fragile to larger change, effectively "baking in"
assumptions to the code that are difficult to discover by searching.
Changing the type and storage requirements of a regnode may break things
in subtle and hard to debug ways.
An example of how NEXTOPER() is problematic is that this:
NEXTOPER(NEXTOPER(branch)) does not mean "find the second node after the
branch node", it means "jump forward by a regnode which happens to be
two regnodes large". In other words NEXTOPER is just a fancy way of
writing "node+1".
This patch replaces NEXTOPER() with three new macros:
REGNODE_AFTER_dynamic(node)
REGNODE_AFTER_opcode(node,op)
REGNODE_AFTER_type(node,tregnode_OPNAME)
The first is the most generic case, it jumps forward by the size of the
node, and determines that size by consulting OP(node). The second is
where you have already extracted OP(node), and the third is where you
know the actual structure that you want to jump forward by. Every
regnode type has a corresponding type, which is known at compile time,
so using the third will produce the most efficient code. However in many
cases the code operates on one of several types, whose size may be the
same now, but may change in the future, in which case one of the other
forms is preferred. The run time logic in regexec.c should probably
only use the REGNODE_AFTER_type() interface.
Note that there is also a REGNODE_BEFORE() which replaces PREVOPER(),
which is used in a specific piece of legacy logic but should not be
used otherwise. It is not safe to go backwards from an arbitrary node,
we simply have no way to know how large the previous node is and thus
where it starts.
This patch includes some logic that validates assumptions during DEBUG
mode which should catch errors from resizing regnodes.
After this patch changing the size of an existing regnode should be
relatively safe and errors related to sizing should trigger assertion
fails.
This patch includes changes to perlreguts.pod to explain this stuff
better.
|
|
|
|
|
|
| |
This was the result of using the wrong case in a switch()
Thanks to @bram-perl for the idea of the test case reduction
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There is a codepath where we do not set up folder and fold_array, and
rely on later codepaths not triggering uninitialized access. Because of
the nature of the switch and etc, it is hard for an analyzer or even a
human to be 100% certain it is all set up right in all cases where it
is used.
Also the variable initialization inside of the switch is a bit awkward.
So this patch moves the declarations up, inits the vars to zero and then
asserts before we use any of these vars under DEBUGGING to ensure we
didn't forget.
Example warning:
re_exec.c:7121:44: warning: ‘fold_array’ may be used uninitialized in this function [-Wmaybe-uninitialized]
7121 | && UCHARAT(s) != fold_array[nextbyte])
This was noticed when building with:
./Configure -d -Doptimize=-g -Dusedevel -DDEBUGGING \
-Accflags='-fsanitize=address -fsanitize=undefined \
-ggdb3' -Aldflags='-Wl,--no-as-needed -lasan -lubsan' \
-Dcc=ccache\ gcc -Dld=gcc
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It turns out that any character class whose UTF-8 representation is two
bytes long, and where all elements share the same first byte can be
represented by a compact, fast regnode designed for the purpose.
This commit adds that regnode, ANYOFHbbm. ANYOFHb already exists for
classes where all elements have the same first byte, and this just
changes the two-byte ones to use a bitmap instead of an inversion list.
The advantages of this are that no conversion to code point is required
(the continuation byte is just looked up in the bitmap) and no inversion
list is needed. The inversion list would occupy more space, from 4 to
34 extra 64-bit words, plus an AV and SV, depending on what elements the
class matches.
Many characters in the Latin, Greek, Cyrillic, Greek, Hebrew, Arabic,
and several other (lesser-known) scripts are of this form.
It would be possible to extend this technique to larger bitmaps, but
this commit is a start.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A bracketed character class in a pattern is generally represented by
some form of ANYOF node, with matches of characters in the Latin1 range
handled by a bitmap, and an inversion list for higher code point
matches. But some patterns only have low matches, and some only high,
and some match everything that is high.
This commit refactors a little so that the distinction between nothing
high matches vs everything high matches is done through the same
technique. Previously one was indicated by a flag, and the other by a
special value in the node's structure. Now there are two special
values, and the flag is freed up for a potential future use. In the
past the meaning of the flags has had to be overloaded go accommodate
all the needs. freeing of a flag means
This all allows for some slight simplicfications.
|
|
|
|
| |
Fix up after previous commit
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In ANYOF nodes (generated for qr/[]/), there is a bitmap component, and
possibly a non-bitmap component. It turns out that a single flag can be
used to indicate the existence of the latter. When looked at this way,
the name of the flag becomes simpler, and incorporates the meaning of
another bit, which was previously shared with yet another meaning. Thus
that other meaning can become an unshared bit.
This allows for some simplification, and being able to handle the
uncommon Turkish locale with fewer main-line conditionals being executed
at runtime.
|
|
|
|
| |
No sense testing each iteration when it isn't going to change.
|
|
|
|
|
|
|
|
|
| |
These long names are designed to remind the coder that they have
multiple meanings. But move the reminder text to the end, as it
obscures the purposes.
And some have two halves for the separate meanings; change the names so
the halves are split by two underscores to visually emphasize this.
|
|
|
|
|
| |
Within this block we know that *p must be one of [Ii]. If it's not 'i',
it must be 'I' without needing to test
|
|
|
|
| |
It's more understandable
|
|
|
|
|
|
| |
C reserves leading underscores for system use; this commit fixes
_CHECK_AND_WARN_PROBLEMATIC_LOCALE to be
CHECK_AND_WARN_PROBLEMATIC_LOCALE_
|
|
|
|
|
|
| |
On some configurations we're getting comparison of signed vs unsigned.
This fixes GH #19915
|
|
|
|
|
| |
Solaris complains that these could be negative, though in practice they
aren't.
|
|
|
|
|
|
|
| |
This reverts commit d62feba66bf43f35d092bb026694f927e9f94d38.
As explained in its commit message. It adds some comments to point out
that the commit exists, for the curious.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Several of the POSIXA classes are a single range on ASCII platforms, and
[:digit:] is a single range on both ASCII and EBCDIC. This regnode was
designed to replace the POSIXA regnode for such classes to get a bit of
performance by not needing to do an array lookup. Instead it encodes
some bits in the flags field that with shifting and masking get the
right values for the single range's bounds for any such node.
However, performance tests conducted by Sergey Aleynikov showed this was
actually slower than what it intended to replace. Rather than
completely drop this work, I'm adding it to blead, and immediately
reverting it, so that should parts of it ever become useful, it would be
available.
A few tests fail; those are skipped for the purposes of this commit so
that it doesn't interfere with bisecting.
The code also isn't completely commented.
One could add a regnode for each posix class it was decided should have
the expected performance boost. But regnodes are a finite resource, and
the boost is probably not large enough to justify doing so.
|
|
|
|
|
| |
This reorders the conditionals in an 'if' where all must be true to
succeed, so the least likely one is done first, and adds some UNLIKELY
|
|
|
|
|
|
|
|
|
|
|
|
| |
I believe the '!!' is somewhat obscure; I for one didn't know about it
for years of programming C, and it was buggy in one compiler, which is why
cBOOL was created, I believe. And it is graphically dense, and
generally harder to read than the cBOOL() construct.
This commit dates from before we moved to C99 where we can simply cast
to (bool), and cBOOL() has been rewritten to do that. But the vast
majority of code uses cBOOL(), and this commit brings the remainder of
the core .[ch] files into uniformity.
|
|
|
|
|
| |
It seems clearer to me to have the panic at the end of the routine
instead of as the default: of a switch().
|
|
|
|
|
|
|
|
|
|
| |
These case statements in a switch all had the same prelude for checking
if the locale is UTF-8 and handling that case separately. A few commits
ago created macros closer to the base level. This commit factors out
the common UTF-8 handling, and then puts the lower lever things in the
switch(). Perhaps the C optimizer would have been smart enough to do
this too, but we might as well do it ourselves, now that it is
convenient.
|
|
|
|
|
|
|
| |
C reserves symbols beginning with underscores for its own use. This
commit moves the underscore so it is trailing, which is legal. The
symbols changed here are many of the ones in handy.h that have
significant uses outside it.
|
|
|
|
|
|
|
| |
C reserves symbols beginning with underscores for its own use. This
commit moves the underscore so it is trailing, which is legal. The
symbols changed here are most of the ones in handy.h that have few uses
outside it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A long standing bug in Perl that has gone undetected is that the array
is global that is created when changing locales and tells fc() and qr//i
matching what the folds are in the new locale.
What this means is that any program only has one set of fold definitions
that apply to all threads within it, even if we claim that the locales
are thread-safe on the given platform. One possibility for this going
undetected so long is that no one is using locales on multi-threaded
systems much. Another possibility is that modern UTF-8 locales have the
same set of folds as any other one.
It is a simple matter to make the fold array per-thread instead of
per-process, and that solves the problem transparently to other code.
I discovered this stress-testing locale handling under threads. That
test will be added in a future commit.
In order to keep from having a dTHX inside foldEQ_locale, it has to have
a pTHX_ parameter. This means that the other functions that function
pointer variables get assigned to point to have to have an identical
signature, which means adding pTHX_ to functions that don't require it.
The bodies of all these are known to the compiler, since they are all
in inline.h or in the same .c file as where they are called. Hence the
compiler can optimize out the unused parameter.
Two calls of STR_WITH_LEN also have to be changed because of C
preprocessor limitations; perhaps there is another way to do it that I'm
unfamiliar with.
|
|
|
|
|
|
| |
Perl defaults to the bitmap for ANYOF nodes being for the lowest 256
characters, but it is possible to compile the bitmap to be up to size
2**16. Doing so, prior to this commit, broke Turkish locale handling.
|
|
|
|
|
|
|
| |
The regnodes in the ANYOFH series by definition don't have a bitmap used
in the other ANYOF-type nodes. Instead they have an inversion list. We
can avoid the overhead of calling the general function that looks first
in the bitmap.
|
|
|
|
| |
Easier to read
|
|
|
|
|
| |
The newer name is not misleading; the old name has been retained only
for backcompat; this removes its final use in the core.
|
|
|
|
|
|
|
|
| |
This names a flag bit whose meaning depends on context. The previous
name contained both meanings, and was simply too long to be
comprehensible. This commit splits it into two names, with the suffix
'_shared'. It's not necessary to know precisely what the other meaning
is when reading the code, only that it has another meaning.
|
|
|
|
| |
The old name had a negative in it, and its easier to grok positives
|
|
|
|
| |
The results of this are no longer used
|
|
|
|
|
| |
I fixed this locally as a result of Hugo van der Sanden's comment in
#19636, but neglected to commit it.
|
|
|
|
|
|
| |
There is no recursion to exceed limits of.
This fixes #8369
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
GH Issue #19484 reported that
print "ABDE" =~ /(A (A|B(*ACCEPT)|C)+ D)(E)/x ? "yes: <$1-$2>" : "no";
does not output the expected 'AB-B', and instead does not match.
Removing the + quantifier behaves as expected.
This patch is 4/4 of the patches to fix this problem: SUCCEED and
LOOKBEHIND_END regops are type 'END' which have a next_off of 0. This
was causing regnext() to return null inside of the loop iterator for the
logic in ACCEPT which closes any open capture buffers thus terminating
the loop prematurely and preventing some of the capture buffers from
being properly closed. SUCCEED is used to end a variety of structures,
including lookahead IFMATCH and UNLESSM, SUSPEND, and CURLYM, and
LOOKBEHIND_END serves the same purpose for lookbehind IFMATCH and
UNLESSM. Thus this patch fixes the original bug but also fixes a variety
of other cases involving ACCEPT.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
GH Issue #19484 reported that
print "ABDE" =~ /(A (A|B(*ACCEPT)|C)+ D)(E)/x ? "yes: <$1-$2>" : "no";
does not output the expected 'AB-B', and instead does not match.
Removing the + quantifier behaves as expected.
This patch is 3/4 and fixes part of the problem: the CURLYM optimization
was not terminating its loop properly when it contained an ACCEPT. This
patch adds a new variable 'is_accepted' which is used to ensure that the
CURLYM optimization stops after an ACCEPT regop is executed.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
GH Issue #19484 reported that
print "ABDE" =~ /(A (A|B(*ACCEPT)|C)+ D)(E)/x ? "yes: <$1-$2>" : "no";
does not output the expected 'AB-B', and instead does not match.
Removing the + quantifier behaves as expected.
This patch is 2/4 and fixes part of the problem: lastopen was not
being set properly inside of the CURLYM optimization. lastopen is used
by the ACCEPT logic to know which parens need to be closed.
|
|
|
|
|
|
| |
After 271c3af797, early bailout from the inner one of a pair of nested
lookbehinds would leave the desired match_end pointing at the wrong
place, so the outer lookbehind could give the wrong answer.
|
|
|
|
| |
Failing to check for max iterations caused an assertion failure.
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a function outside of sv.c creates a SV via newSV(0):
* There is a call to Perl_newSV
* A SV head is uprooted and its flags set
* A runtime check is made to effectively see if 0 > 0
* The new SV* is returned
Replacing newSV(0) with newSV_type(SVt_NULL) should be more efficient,
because (assuming there are SV heads to uproot), the only step is:
* A SV head is uprooted and its flags set
|
|
|
|
|
|
|
|
|
|
|
|
| |
We were not validating that when (?<=a|ab) matched that the right hand
side of the match lined up with the position of the assertion. Similar
for (?<!a|ab) and related patterns, eg, (*positive_lookbehind:).
Note these problems do NOT affect lookahead.
Part of the difficulty here was that the SUCCEED node was serving too
many purposes, necessitating a new regop LOOKBEHIND_END.
Includes more tests for various lookahead or lookbehind cases.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
Many of the cases in the main switch() statement in regrepeat() begin
with
if (utf8_target)
or similar. These branches can be saved, and the code easier to
understand by including the utf8ness in the case statements, as this
commit does; taking advantage of the changes in the previous commit.
|
| |
|
|
|
|
| |
This is in preparation for a somewhat different use to be added.
|
|
|
|
|
|
| |
Before this commit, the code looped through a bitmap looking for a set
bit. Now that we have a fast way to find where a set bit is, use it,
and avoid the fruitless iterations.
|