| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
| |
|
|
|
|
|
|
|
| |
By definition a regex pattern that is in UTF-8 uses Unicode matching
rules, and EXACTF is non-Unicode (unless the target string is UTF-8).
Therefore an EXACTF node will never be generated for a UTF-8 pattern,
and there is no need to test for it being so.
|
|
|
|
|
| |
We don't have to convert from utf8 to code point to fold; instead can
call the function that starts from utf8
|
|
|
|
|
|
|
|
| |
This revised commit e067297c376fbbb5a0dc8428c65d922f11e1f4c6
slightly so that we round up to get the search stopping point.
We aren't matching partial characters, so if we were to match 3+1/3
characters, we really have to match 4 characters.
|
| |
|
| |
|
|
|
|
| |
The latter is the Perl standard way of making this declaration
|
|
|
|
|
|
| |
The root cause of this bug is that it was assuming that a string was in
utf8 when it wasn't, and so was thinking that a byte was a starter byte
that wasn't, so was skipping ahead based on that starter byte.
|
|
|
|
| |
This adds regrepeat to no keep re-folding to the recent commits
|
|
|
|
|
| |
A recent commit caused regexec.c to not keep calculating the folds in
one circumstance. This one adds the case in regmatch
|
|
|
|
|
|
|
|
|
|
|
| |
If you watch an execution trace of regexec /i, often you will see it
folding the same thing over and over, as it backtracks or searches
ahead. regcomp.c has now been changed to always fold UTF-8 encoded
EXACTF and EXCACTFU nodes. This allows these to not be re-folded each
time.
This commit does it just for find_by_class(). Other commits will expand
this technique for other cases.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a partial reversion of commit
7c1b9f38fcbfdb3a9e1766e02bcb991d1a5452d9
which went unnecessarily far in fixing the problem.
After studying the situation some more, I see more clearly what was
going on. The point is that if you have only 2 characters left in the
string, but the pattern requires 3 to work, it's guaranteed to fail, so
pointless, and unnecessary work, to try. So don't being a match trial
at a position when there are fewer than the minimum number of characters
necessary. That is what the code before that commit did. However it
neglected the fact that it is possible for a single character to match
multiple ones, so there is not a 1:1 ratio. This new commit assumes the
worst possible ratio to calculate how far into a string is the furthest
a successful match could start. This is going to in most cases still
look too far, but it is much better than always going up to the final
character, as the previous patch did.
The maximum ratio is guaranteed by Unicode to be 3:1, but when the
target isn't in UTF-8, the max is 2:1, determined simply by inspection
of the defined folds. And actually, currently, the single case where it
isn't 1:1 doesn't come up here, because regcomp.c guarantees that that
match doesn't generate one of these EXACTFish nodes. However, I expect
that to change for 5.16, and so am preparing for that case by making it
2:1.
|
| |
|
|
|
|
|
|
|
| |
The structure of this code is that initial setup is done and then gotos
or fall-through used to join for the main logic. This commit just moves
a block, without logic changes, so that the more common case has a
fall-through instead of a goto.
|
|
|
|
|
|
|
|
|
|
|
| |
Only the first character of the string was being checked when scanning
for the beginning position of the pattern match.
This was so wrong, it looks like it has to be a regression. I
experimented a little and did not find any. I believe (but am not
certain) that a multi-char fold has to be involved. The the handling of
these was so broken before 5.14 that there very well may not be a
regression.
|
| |
|
|
|
|
|
|
|
|
| |
When a swash is loaded, generally it is checked for sanity with an
assert(). The strings used are hard-coded utf8 strings, which will be
different in EBCDIC, and hence will fail. I haven't figured out a
simple way to get compile-time utf8 vs utfebcdic strings, but we can
just skip the check in EBCDIC builds
|
|
|
|
|
| |
This makes sure before there is a segfault that the is_() functions
actually have the side effect that this expects.
|
|
|
|
|
|
| |
The HORIZWS and similar regexp ops didn't check that the end of the string
had been reached; therefore they would blithely compare against the \0 at
the end of the string, or beyond.
|
|
|
|
|
| |
This was due to my failure to realize that this 'if' needed to
be updated when the /aa modifier was added.
|
|
|
|
|
| |
This was due to my oversight in not fixing this switch statement
to accommodate /aa when it was added.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
(?{...}) deliberately doesn't introduce a new scope (so that the affects of
local() can accumulate across multiple calls to the code). This also means
that the SAVEt_CLEARSVs pushed onto the save stack by lexical declarations
(i.e. (?{ my $x; ... }) also accumulate, and are only processed en-mass at
the end, on exit from the regex. Currently they are usually processed in
the wrong pad (the caller of the pattern, rather than the pads of the
individual code block(s)), leading to random misbehaviour and SEGVs.
Hence the long-standing advice to avoid lexical declarations within
re_evals.
We fix this by wrapping a pair of SAVECOMPPADs around each call to a code
block. Eventually the save stack will be a long accumulation of
SAVEt_CLEARSV's interspersed with SAVEt_COMPPAD's, that when popped
en-mass should unwind in the right order with the right pad at the right
time.
The price to pay for this is two extra additions to the save stack (which
accumulate) for each code call.
A few TODO tests in reg_eval_scope.t now pass, so I'm probably doing the
right thing ;-)
|
| |
|
|
|
|
|
|
|
| |
The assumption is that most studied strings are fairly short, hence the pain
of the extra code is worth it, given the memory savings.
80 character string, 336 bytes as U8, down from 1344 as U32
800 character string, 2112 bytes as U16, down from 4224 as U32
|
|
|
|
| |
The "no more" condition is now represented as ~0, instead of -1.
|
|
|
|
|
| |
This allows more than one C<study> to be active at the same time.
It eliminates PL_screamfirst, PL_lastscream, PL_maxscream.
|
|
|
|
|
|
|
|
|
|
| |
PL_screamnext gives the position of the next occurrence of the current octet.
Previously it stored this as an offset from the current position, with -pos
stored for "no more", so that the calculated new offset would be zero,
allowing a zero/non-zero loop exit test in Perl_screaminstr().
Now it stores absolute position, with -1 for "no more". Also codify -1 as the
"not present" value for PL_screamfirst, instead of any negative value.
|
|
|
|
|
|
|
|
| |
Callers to the engine set REXEC_SCREAM in the flags when the target scalar is
studied, and the engine should use the study data. It's possible for embedded
code blocks to cause the target scalar to stop being studied. Hence the engine
needs to check for this, instead of simply assuming that the study data is
present and valid to read. This resolves #92696.
|
|
|
|
|
| |
regcomp.c has been changed, so the case that this handled no longer
comes up.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
| |
In '"s\N{U+DF}" =~ /\x{00DF}/i, the LHS folds to 'sss', the RHS to 'ss'.
The bug occurs when the RHS tries to match the first two es's, but that
splits the LHS \xDF character, which Perl doesn't know how to handle,
and the assertion got triggered. (This is similar to [perl #72998].)
The solution adopted here is to disallow a partial character match,
as #72998 did as well.
|
|
|
|
|
|
|
|
|
|
| |
A missing '!' turned \W into \w in some code execution paths and utf8 data.
This patch fixes that.
It does not include tests at the moment, since I don't have time
just now to examine why the existing tests didn't catch this, when
it looks like they are set up to, and there have been several BBC tickets
lately that I'm hopeful this may fix and head off other ones.
|
| |
|
|
|
|
| |
The trickyness has been resolved elsewhere
|
| |
|
|
|
|
|
|
| |
The comment said that there was no use doing this in lenp was NULL,
but there is, as it sees if there is a match or not and sets the
appropriate variable.
|
| |
|
|
|
|
|
|
|
| |
The algorithm for mapping multi-char fold matches back to the source in
processing ANYOF nodes was defective. This caused the regex engine to
hang on certain character combinations. I've also added an assert to
stop instead of loop.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When we are doing a CURLYX/WHILEM loop and the min iterations is
larger than zero we were not saving the buffer state before each
iteration. This mean that partial matches would end up with strange
buffer pointers, with the start *after* the end point.
In more detail WHILEM has four exits, three of which as far as I could
tell would do a regcppush/regcppop in their state transitions, only one,
WHILEM_A_pre which is entered when (n < min) would not. And it is this state
that we repeatedly enter when performing A the min number of times.
When I made the logic similar to the handling of ( n < max ), the bug
went away, and as far as I can tell nothing else broke.
Review by Dave Mitchell required before release.
|
|
|
|
|
| |
Recent simplification of this code left it to be the equivalent
of an existing macro
|
|
|
|
|
|
|
| |
There have been various segfaults apparently due to trying to access
the swash (and allies) portion of an ANYOF which doesn't have that.
This doesn't show up on all platforms. The assert() should detect
this and help debugging
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit ac51e94be5daabecdeb0ed734f3ccc059b7b77e3 didn't
do what it purported, because it omitted parentheses that
were necessary to change the natural precedence. It's strange that
it passed all tests on my machine, and failed so miserably elsewhere
that it was quickly reverted by commit
63c0bfd59562325fed2ac5a90088ed40960ac2ad.
This reinstates it with the correct precedence. The next commit
will add an assert() so that the underlying issue will be detected
on all platforms
|
|
|
|
|
|
|
| |
This reverts commit ac51e94be5daabecdeb0ed734f3ccc059b7b77e3.
This commit made many of the re/*.t tests fail, on my build at least.
Haven't looked at why, just reverting it for the moment.
|
|
|
|
|
|
|
|
|
|
|
|
| |
ANYOF_NONBITMAP is supposed to be set iff there is something outside
the bitmap to try matching in an ANYOF node. Due to slight changes in
the meaning of this, the code has been trying to access this
if ANYOF_NONBITMAP_NON_UTF8 is set without ANYOF_NONBITMAP being set,
which means it was trying to access something that doesn't exist.
I'm hopeful, based on a stack trace sent to me that this is the cause
of [perl #85478], but can't reproduce that easily. But the logic
is clearly wrong.
|
|
|
|
|
|
|
|
|
| |
Now that regexes can be combinations of different charset modifiers,
a synthetic start class can match locale and non-locale both. locale
should generally match only things in the bitmap for code points < 256.
But a synthetic start class with a non-locale component can match such
code points. This patch makes an exception for synthetic nodes that
will be resolved if it passes and is matched again for real.
|
|
|
|
|
| |
This code was retained for a while until it was clear that the replacement
code worked.
|
|
|
|
|
|
|
| |
This code has been rendered obsolete in 5.14 by using a different
mechanism altogether. This functionality is now provided at run-time,
user-selectable, via the /u and /d regex modifiers. This code was
for compile-time selection of which to use.
|
|
|
|
|
| |
The code dealing with the sharp ss is now handled by the ANYOFV node,
and shouldn't appear here.
|