| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When we are doing a CURLYX/WHILEM loop and the min iterations is
larger than zero we were not saving the buffer state before each
iteration. This mean that partial matches would end up with strange
buffer pointers, with the start *after* the end point.
In more detail WHILEM has four exits, three of which as far as I could
tell would do a regcppush/regcppop in their state transitions, only one,
WHILEM_A_pre which is entered when (n < min) would not. And it is this state
that we repeatedly enter when performing A the min number of times.
When I made the logic similar to the handling of ( n < max ), the bug
went away, and as far as I can tell nothing else broke.
Review by Dave Mitchell required before release.
|
|
|
|
|
| |
Recent simplification of this code left it to be the equivalent
of an existing macro
|
|
|
|
|
|
|
| |
There have been various segfaults apparently due to trying to access
the swash (and allies) portion of an ANYOF which doesn't have that.
This doesn't show up on all platforms. The assert() should detect
this and help debugging
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit ac51e94be5daabecdeb0ed734f3ccc059b7b77e3 didn't
do what it purported, because it omitted parentheses that
were necessary to change the natural precedence. It's strange that
it passed all tests on my machine, and failed so miserably elsewhere
that it was quickly reverted by commit
63c0bfd59562325fed2ac5a90088ed40960ac2ad.
This reinstates it with the correct precedence. The next commit
will add an assert() so that the underlying issue will be detected
on all platforms
|
|
|
|
|
|
|
| |
This reverts commit ac51e94be5daabecdeb0ed734f3ccc059b7b77e3.
This commit made many of the re/*.t tests fail, on my build at least.
Haven't looked at why, just reverting it for the moment.
|
|
|
|
|
|
|
|
|
|
|
|
| |
ANYOF_NONBITMAP is supposed to be set iff there is something outside
the bitmap to try matching in an ANYOF node. Due to slight changes in
the meaning of this, the code has been trying to access this
if ANYOF_NONBITMAP_NON_UTF8 is set without ANYOF_NONBITMAP being set,
which means it was trying to access something that doesn't exist.
I'm hopeful, based on a stack trace sent to me that this is the cause
of [perl #85478], but can't reproduce that easily. But the logic
is clearly wrong.
|
|
|
|
|
|
|
|
|
| |
Now that regexes can be combinations of different charset modifiers,
a synthetic start class can match locale and non-locale both. locale
should generally match only things in the bitmap for code points < 256.
But a synthetic start class with a non-locale component can match such
code points. This patch makes an exception for synthetic nodes that
will be resolved if it passes and is matched again for real.
|
|
|
|
|
| |
This code was retained for a while until it was clear that the replacement
code worked.
|
|
|
|
|
|
|
| |
This code has been rendered obsolete in 5.14 by using a different
mechanism altogether. This functionality is now provided at run-time,
user-selectable, via the /u and /d regex modifiers. This code was
for compile-time selection of which to use.
|
|
|
|
|
| |
The code dealing with the sharp ss is now handled by the ANYOFV node,
and shouldn't appear here.
|
|
|
|
|
| |
The bounds of this array were being exceeded causing smoke failures on
netbsd
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is the foundation for fixing the regression RT #82610. My analysis
was wrong that two bits could be shared, at least not without further
work. This changes to use a different mechanism to pass needed
information to regexec.c so that another bit can be freed up and, in a
later commit, the two bits can become unshared again.
The bit that is freed up is ANYOF_UTF8, which basically said there is
something that is matched outside the ANYOF bitmap, and requires the
target string to be in utf8. This changes things so the existence of
something besides the bitmap indicates this, and so no flag is needed.
The flag bit ANYOF_NONBITMAP_NON_UTF8 remains to indicate that there is
something that should be matched outside the bitmap even if the target
string isn't in utf8.
|
|
|
|
|
| |
locale rules are handled improperly for utf8-encoded strings in
bracketed character classes under locale. This fixes that.
|
|
|
|
|
|
|
| |
As explained in the doc changes of this patch, under /l, caseless
matching of code points less than 256 now use locale rules regardless
of the utf8ness of the pattern or string. They now match the behavior
of things like \w, in this regard.
|
| |
|
| |
|
| |
|
| |
|
|
|
|
| |
Tests for \N{} with this option will be added later.
|
|
|
|
|
|
| |
This code handled some of the case of the LATIN SMALL LETTER SHARP S at
the beginning of a back ref, but not in the middle. To do it easily,
just call the function that handles our full Unicode folding
|
|
|
|
| |
A recent commit #ifdef'd this out
|
| |
|
|
|
|
| |
A recent commit stopped calling this
|
| |
|
|
|
|
| |
This code is way out-of-date, using upper and lower case instead of fold-case.
|
| |
|
| |
|
| |
|
|
|
|
|
| |
This converts one case where ANYOFV is now usable to allow it to match
more than one character.
|
|
|
|
|
| |
This converts one case where ANYOFV is now usable to allow it to match
more than one character.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
This is for security as well as performance. It allows Unicode properties to
not be matched case sensitively. As a result the swash inversion hash is
converted from having utf8 keys to numeric, code point, keys.
It also for the first time fixes the bug where /i doesn't work for a code point
not at the end of a range in a bracketed character class has a multi-character
fold
|
|
|
|
| |
This is so future coders won't be tempted to rely on them.
|
|
|
|
|
| |
It is safer and clearer to have the break statement in each case statement at
the source level
|
|
|
|
|
| |
This showed up only on some systems in the current test suite, but processing
eg, \D has to care about the target string being utf8.
|
|
|
|
|
|
|
| |
Commit cfaf538b6276c6a8ef80ff6c66e106c6a4f1caaa introduced changes that cause
this to not compile on Windows. It did not accept empty macro parameters,
unlike gcc. This just creates a placeholder macro that expands to nothing to
give the preprocessor something to grab onto.
|
|
|
|
|
| |
This restricts certain constructs, like \w, to matching in the ASCII range
only.
|
|
|
|
|
| |
This refactors one area in regexec.c to use BOUNDU, NBOUNDU for
efficiciency, and easier adding of the future BOUNDA.
|
|
|
|
| |
These functions already return a boolean.
|
|
|
|
|
|
| |
This makes the equivalent code in BOUND and NBOUND identical so can
factor out, and makes optimization easier, as the FLAGS field is already
required in the vicinity.
|
| |
|
| |
|
|
|
|
| |
The function is supposed to take a bool.
|
|
|
|
| |
These are refactored to be more compact, and I think clearer.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch converts the \s, \w and complements Unicode semantics to
instead of using the flags field of their nodes to instead use separate
nodes. This gains some efficiency, especially useful in tight loops and
backtracking of regexec.c, and prepares the way for easily adding other
semantic variations, such as /a.
It refactors the CCC_TRY... macros. I tried to break this piece up into
smaller chunks, but found it much easier to get to this in one step.
Further patches will do some more refactoring of these.
As part of the CCC_TRY macro refactoring, the lines that include the
test if (! nextchr) are changed to just look for the end-of-string by
position instead of it being NUL. In locales, it could be (however
unlikely), that NUL is a real alphabetic, digit, or space character.
|
|
|
|
|
| |
Replace two instances of code that is the same as that given by an already
existing macro.
|
|
|
|
|
|
|
|
|
| |
The FLAGS fields of certain regnodes were encoded with USE_UNI if
unicode semantics were to be used. This patch instead sets them to the
character set used, one of the possibilities of which is to use unicode
semantics. This shortens the code somewhat, and always puts the
character set in the flags field, which can allow use of switch
statements on it for efficiency, especially as new values are added.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This bug stemmed from Latin1 characters not matching any (non-complemented)
character class in /d semantics when the target string is no utf8; but having
unicode semantics when it isn't. The solution here is to add a special flag.
There were several tests that relied on the broken behavior, specifically they
tested that \xff isn't a printable word character even in utf8. I changed the
deparse test to instead use a non-printable code point, and I changed the ones
in re_tests to be TODOs, and will change them back using /a when that is
shortly added.
|
| |
|
| |
|