| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
| |
It has been unused since 28d8d7f41ab202dd restructured the regexp dup code.
|
|
|
|
|
|
| |
gcc 4.6.0 warns about variables that are set but never read, and unless
RE_TRACK_PATTERN_OFFSETS is defined, parse_start is never read. So avoid
declaring or setting it if it's not actually going to be used later.
|
| |
|
|
|
|
|
| |
A previous commit added an 'if' around this code. This now indents
the block properly.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch causes inverted [bracketed] character classes to not handle
multi-character folds. The reason is that these can lead to very
counter-intuitive results (see bug discussion).
In an inverted character class, only single-char folds are now
generated. However the fold for \xDF=>ss is hard-coded in,
and it was too much trouble sending flags to the sub-sub routine that
does this, so another check is done at the point of storing the list of
multi-char folds. Since \xDF doesn't have a single char fold, this
works.
|
|
|
|
|
| |
As agreed, this improvement is going into 5.14. A customized
message is output, instead of a generic one.
|
|
|
|
|
|
|
| |
An earlier commit in this series changed some error messages.
I realized that it did not make sense really to use "/a" for the regex
modifier, when the message was for the infix form "(?a:", so this
just removes the slash.
|
|
|
|
|
|
| |
This allows a second regex 'a' modifier in the infix form to not have to
be contiguous with the first, and improves the message if there are extra
modifiers.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch completes the fixing of this problem. The problem is that
the failing .t set @INC to exclude lib, and hence couldn't find utf8.pm,
which 5.14 was requiring in places where it previously didn't. This
patch finishes the job of not requiring utf8.pm in so many places as
were inadvertently added in 5.14. Commit
3ad98780b4bded02c371c83a668dc8f323e57718 started the job.
This patch changes regcomp.c to not set ANYOF_NONBITMAP_NON_UTF8 where
it inappropriately was. I don't know what I was thinking when I
originally did what this changes. In order to match outside the bitmap,
these characters all must match something that requires utf8, such as a
LIGATURE FI.
|
|
|
|
|
|
|
|
|
|
| |
As noted in the comments in the code for this commit, this flag has
higher consequences than others when it is inappropriately set in the
synthetic start class. This patch causes it to not be set unless there
is some path in the regex engine that needs it, but once set it is never
cleared. This results in a different set of false positives than
currently, but the current set can have this set even if there is no
path in the engine that needs it.
|
|
|
|
|
|
|
|
|
|
| |
A statement should have been outside a block but was inside it.
The indentation was correct, and in a number of times reading
the code I still missed it.
I'm having trouble distilling down the failure scenario into
a simple test case, and am short on tuits right now, so a test
will be committed later.
|
| |
|
| |
|
|
|
|
|
| |
I was using 0 for a generic non-interesting character, which works in C
but not C++.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ANYOFV handles multi-char folds in ANYOF nodes, and it turns
out it is a superset of what FOLDCHAR does, which never got fully
implemented in regexec.c, whereas ANYOFV is. FOLDCHAR may be the better
way to go in the long-term, as it takes less space and is faster, but
this gives us the functionality today, with no extra work.
FOLDCHAR had been generated only when the character in question is a
literal in the input stream, and wasn't touched for the probably more
common use of \N{} or \x, which were fixed from not doing anything
special to using ANYOFV earlier in the 5.13 series, and it turns out
that the code that does it all is in a part of the code that gets
executed anyway, so that simply removing the special FOLDCHAR code
causes execution to drop down to this code.
I'm thinking at the moment that for 5.16, ANYOV should be removed in
favor of branches, using the technique of recursion that has recently
been added to \N{}. That would enable easier trie generation and
simplify things in regexec and the optimizer.
|
| |
|
|
|
|
|
|
| |
The tricky folds have only worked one direction. This handles the
other, when it sees something the tricky fold folds to it converts that
to the tricky fold op.
|
|
|
|
|
| |
This is in preparation for future commits. The declarations don't
depend on the two code lines.
|
|
|
|
| |
This failed under some circumstances
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This routine now calls reg() recursively after converting the parse
to something the rest of the code understands. This eliminates
duplicated code, and allows for uniform treatment of code points, as
things were getting out of sync. It also eliminates the restrction on
how many characters a named sequence can expand to.
toke now converts its input (which is in Unicode terms) to native on
EBCDIC platforms, so the rest of the code can can continue to ignore
that.
The restriction on the length of the number of characters a named
sequence is hereby removed, because reg() handles that.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As indicated in the comments, this flag needs to be initialized to
1 or the optimizer loses the fact that something could match a
character that isn't in utf8 and whose bitmap bit isn't set. This
happens, for example, with Unicode properties.
Thus this fixes #77414. That ticket had been closed recently because
it went away due to another patch that caused the optimizer to be
bypassed in the cases tested for. But when that patch was reverted,
and cleaned-up, this bug came back. Now, I believe I have found the
root cause.
|
|
|
|
|
|
|
|
| |
For non-locale, \d, etc are compiled in with their actual code points they
match, so the class portion of the synthetic start class node is
irrelevant, and should initialized to zero to avoid confusion. But for
locale it is highly relevant, and should be initialized to all ones, to
indicate matching anything.
|
|
|
|
|
|
|
| |
When ORing two nodes together for the synthetic start class, and one
matches outside the 256-char bitmap, we currently don't know what it
matches. In some cases it could be some or all of those 256 characters.
If so, we have to assume it's all of them.
|
|
|
|
|
| |
This is in prep for another commit which needs the flags to be
untouched for some tests.
|
|
|
|
|
| |
It is better to test that a pointer is in bounds before dereferencing it
even though in this case it doesn't lead to an actual error.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Things like \S have not been accessible to the synthetic start class
under locale matching rules. They have been placed there, but the
start class didn't know they were there.
This patch sets ANYOF_CLASS in initializing the synthetic start class
so that downstream code knows it is a charclass_class, and removes
the code that partially allowed this bit to be shared, and which isn't
needed in 5.14, and more thought would have to go into doing it than
was reflected in the code.
I can't come up with a test case that would verify that this works,
because of general locale testing issues, except it looked at a dump of
the generated regex synthetic start class, but the dump isn't the same
thing as the real behavior, and using one is also subject to breakage if
the regex code changes in the slightest.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is further work along the lines in RT #85964 and commit
af302e7fa58415c2d8454c8cbef7bccd8b504257. It reverts, for the the most
part, commits aa19b56b2f07e9eabf57540f00d312d8093e9d28 (Remove unused
parameter) and c613755a4b4fc8e64a77639d47d7e208fee68edc (/l in synthetic
start class).
Those commits caused the synthetic start class to often be marked as
matching under locale rules, even if there was no part of the regular
expression that used locale. This led to RT #85964, which made apparent
that there were a number of assumptions in the optimizer about locale
that were no longer necessarily true. This new commit changes things so
that locale has to be somewhere in the regex in order to get the
synthetic start class to include /l. In other words, this reverts the
effect of those commits to regular expression which have /l -- we go
back to the old way of doing things for non-locale regexes. This limits
any bugs that may have been introduced by the addition of /l (and being
able to match only sub-parts of a regex under locale) to the relatively
uncommon regexes which actually use it. There are a number of bugs
that have surfaced for the locale rules regexes that have gone
unreported; and some say locale rules regexes should be deprecated.
|
|
|
|
|
| |
This reverts commit c45df5a16bb5a26a06275cc63f2c3e6b1d708184.
The parameter is about to be put back in.
|
|
|
|
|
| |
If any part of a pattern has /l, this flag will get set; for future
use.
|
| |
|
|
|
|
|
| |
The code has hard-coded the possible case mappings for the code points
< 256. This one was omitted.
|
|
|
|
|
| |
oldp contains the pointer that we want to get to. Use that instead
of a possibly invalid assumption about length
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The introduction of the l regex modifier introduces the possibility that
a regular expression can have subportions that match under locale and
other portions that don't. I (khw) failed to see all the implications
of that in the optimizer. Unfortunately, things didn't start surfacing
until late in the development cycle.
The optimizer is structured so that a new blank node is initialized to
match anything, and the state is set to AND, so that the first real node
that comes along is supposed to be ANDed together; with the result being
that node. (Like an AND of all 1's with some bit pattern yields that
bit pattern.) Then the mode is switched to OR, so subsequent nodes that
could be the start ones are or'd in. *(see footnote below).
This design leads to some issues, like at the XXX line added by this
commit, which looks to be a work-around for the deficiencies of the
design.
Commit cf34198ebe3dd876d67c10caa9acf491ad2a0c51 that led to this ticket
changed things to include LOCALE as part of the initialization, so that
the l could be on and off in various parts of the regex. I tried to
just revert that (plus associated parameter changes), and found that the
changes made to the AND and OR logic that fixed other problems really
depended on that commit. Perhaps those could be worked around, but it
is not the forward direction.
This commit works around things in a different way. What happened in
the earlier commit was that the synthetic start class (SSC) is, under
some circumstances, getting generated as matching locale even if there
is no locale matching in the regex. (This could not happen if the
design were as described in the footnote.) This shouldn't matter except
for potentially performance issues, as this would just be false
positives. However, it turns out there is code in the optimizer that
assumes that locale and non-locale are never mixed; and so does not do
the right thing.
This patch is aimed at safety. If the SSC is marked as locale, it sets
the bits for things like \w as if the SSC could also end up being for
non-locale. This can generate false positives for true locale matches
but shouldn't introduce actual optimizer errors, since it only adds to
what the SSC can match and doesn't make any restrictions.
* I don't see why this design; it seems to me easier to start with the
initial state set to all 0's, and then the first node gets OR'd in,
yielding exactly that first node; then you don't have to switch; you
still have to deal with AND cases, as for example in 0 length
lookaheads, but things are made easier.
|
| |
|
|
|
|
|
|
| |
A number of earlier commits have fixed various places where the code
assumed that digits did not move under locale. This adds another two,
bringing the code here in line with the other sequences like \w
|
|
|
|
|
|
|
|
| |
The line before this line indicates that there is no bitmap, but it
didn't clear this flag that says that there may be. This was likely
a contributory bug to what ac51e94be5daabecdeb0ed734f3ccc059b7b77e3
tried to fix, and was eventually fixed in
6f8d7d0df3e3141d61246e6b0a3db12ab1fd7f92.
|
| |
|
|
|
|
|
|
| |
This fixes a regression introduced with charset regex modifiers. A utf8
pattern without a charset is supposed to mean unicode semantics. But
it didn't until this patch.
|
|
|
|
|
|
|
| |
/a and /u should match identically case-insensitively, but they didn't.
Nor was /a being tested because it was thought that they handled things
identically, and the tests were already taking too long. So this adds
some tests as well.
|
|
|
|
|
| |
We can tell if there is something outside the bitmap and so
can short circuit calling this function if there isn't.
|
|
|
|
| |
This silences a compiler warning
|
|
|
|
| |
This silences a compiler warning
|
|
|
|
| |
This silences a compiler warning
|
|
|
|
|
|
|
| |
Commit 137165a601b852a9679983cdfe8d35be29f0939c omitted
required initialization for the synthetic start class. Adding it
exposed other bugs in cl_and() and cl_or(), which have been fixed
by a previous commit.
|
|
|
|
|
|
| |
These two routines have not kept pace with the changes in the ANYOF
flags. And, I believe there were issues even before them. I did a
systematic re-thinking of what their behaviors should be.
|
|
|
|
|
|
|
|
|
| |
Now that regexes can be combinations of different charset modifiers,
a synthetic start class can match locale and non-locale both. locale
should generally match only things in the bitmap for code points < 256.
But a synthetic start class with a non-locale component can match such
code points. This patch makes an exception for synthetic nodes that
will be resolved if it passes and is matched again for real.
|
|
|
|
|
|
| |
It's unclear if tries will work under /l. I haven't seen any failures,
but there have been under /d. As a precaution, until more testing is
done, disable tries under anything but /u and UTF.
|