| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
| |
This file only needs including by globals.c; it was being included
in regcomp.c too as the declarations in regcomp.h aren't included by
perl.h and thus don't get pulled into globals.c. This was a confusing
and hacky workaround.
Instead, this commit causes globals.c to #include regcomp.h directly
After this commit, only globals.c #includes INTERN.h
|
|
|
|
|
|
| |
The perl build option -DPERL_GLOBAL_STRUCT_PRIVATE had bit-rotted
due to lack of smoking. The main fix is to just add 'dVAR;' to any
functions which have a pTHX arg. It's a NOOP on normal builds.
|
|
|
|
|
| |
While single stepping in gdb, I noticed that this loop kept executing,
when it need not.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The failing case can be reduced to
qr/\x{100}[\x{3030}\x{1fb2}/
(It only happens on UTF-8 patterns).
The bottom line is that it was assuming that there was at least one
character that folded to 1fb2 besides itself, even though the function
call said there weren't any such. The solution is to pay attention to
the function return value.
I incorporated Hugo's++ patch as part of this one.
However, the original test case should never have gotten this far. The
parser is getting passed garbage, and instead of croaking, it is somehow
interpreting it as valid and calling the regex compiler. I will file a
ticket about that.
|
|
|
|
|
|
|
|
|
|
|
| |
The problem here is that a syntax error occurs and hence certain things
don't get done, but processing continues, as the error isn't checked for
until after the return of the function that found it. The failing
assertion is checking that those certain things actually did get done.
There appear to be good reasons to defer the raising of the error until
then, so the simplest way to fix this is to generalize the code so that
the failing assertion doesn't happen.
|
|
|
|
|
| |
Better to have the build fail if they're wrong than relying on the
code path being hit at runtime in a DEBUGGING build.
|
|
|
|
|
|
|
|
|
|
|
| |
This removes the most obvious and easy things that are no longer needed
since regexes no longer use swashes at all.
tr/// continues, for the time being, to use swashes, so not all swash
handling is removable now. But tr/// doesn't use inversion lists, and
so a bunch of code is ripped out here. Other code could have been, but
I did only the relatively easy stuff. The rest can be ripped out all at
once when tr/// is stops using swashes.
|
|
|
|
|
| |
The element at say, [0] is a particular thing. This commit changes to
use a mnemonic instead of [0], for clarity
|
|
|
|
| |
An empty entry is now just NULL.
|
|
|
|
|
|
|
|
|
|
| |
A swash is no longer used, so we can remove some elements from the array
of data that gets stored with the compiled pattern for use in runtime
matching. This is the first step in more simplifications.
Since a swash isn't used, this change also requires regexec.c to change
to use a straight inversion list lookup. This has the salutary effect
of eliminating a conversion between code point and UTF-8.
|
|
|
|
|
|
|
|
|
|
|
| |
This is in case we ever need it. This checks for portability in the
code points specified in user-defined properties. Previously there was
a check, but I couldn't get a warning to trigger unless there was also
overflow. So that means the pattern compile failed due to the overflow,
and the portability warning was superfluous. But, one can have
non-portable code points without overflow; just the old method didn't
properly detect them. If we do ever need to detect and report on them,
the code is mostly written and in this commit.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This large commit moves the handling of user-defined properties to C
code. This should speed it up, but the main reason to do this is to
stop using swashes in this case, leaving only tr/// using them. Once
that too is converted, all swash handling can be ripped out of perl.
Doing this in perl has caused some nasty interactions that will now be
fixed automatically.
The change is not entirely transparent, however (besides speed and the
possibility of removing these interactions). perldelta in this commit
details these.
|
|
|
|
|
|
|
| |
A global hash has to be specially handled. The keys can't be shared,
and all the SVs stored into it must be in its thread. This commit adds
the hash, and initialization, and macros for context change, but doesn't
use them. The code to deal with this is entirely confined to regcomp.c.
|
| |
|
|
|
|
| |
The new name more closely corresponds with its use.
|
|
|
|
| |
Indent a block of code newly formed by the previous commit
|
|
|
|
|
|
| |
Previous commits in this series have changed uc(), lc(), fc(), etc. to
know how to handle Turkish UTF-8 locales. This commit extends this to
/i regular expression pattern matching.
|
|
|
|
|
|
|
|
|
|
|
|
| |
The code knew this, but it was adding the ASCII alphabetics to the list
of things that matched in UTF-8 locales. This is unnecessary, as we've
long had the infrastructure elsewhere to handle all potential mappings
from a Latin1 code point to other Latin1, so we can just rely on it.
And it created complexities for future commits in this series.
The MICRO SIGN is the exception, as it folds to non-Latin1 in UTF-8
locales, and this is the place where the structure exists to handle
that.
|
|
|
|
| |
This will be used in a future commit.
|
| |
|
|
|
|
|
|
| |
This bug was introduced in b2296192536090829ba6d2cb367456f4e346dcc6
n 5.29.7. Using /il should not result in looking for a [:posix:] class
that matches the code points given.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This was caused by a counting error.
An EXACTFish regnode has a finite length it can hold for the string
being matched. If that length is exceeded, a 2nd node is used for the
next segment of the string, for as many regnodes as are needed.
A problem occurs if a regnode ends with one of the 22 characters in
Unicode 11 that occur in non-final positions of a multi-character fold.
The design of the pattern matching engine doesn't allow matches across
regnodes. Consider, for example if a node ended in the letter 'f' and
the next node begins with the letter 'i'. That sequence should match,
under /i, the ligature "fi" (U+FB01). But it wouldn't because the
pattern splits them across nodes. The solution I adopted was to forbid
a node to end with one of those 22 characters if there is another string
node that follows it. This is not fool proof, for example, if the
entire node consisted of only these characters, one would have to split
it at some position (In that case, we just take as much of the string as
will fit.) But for real life applications, it is good enough.
What happens if a node ends with one of the 22, is that the node is
shortened so that those are instead placed at the beginning of the
following node. When the code encounters this situation, it backs off
until it finds a character that isn't a non-final fold one, and closes
the node with that one.
A /i node is filled with the fold of the input, for several reasons.
The most obvious is that it saves time, you can skip folding the pattern
at runtime. But there are reasons based on the design of the optimzer
as well, which I won't go into here, but are documented in regcomp.c.
When we back out the final characters in a node, we also have to back
out the corresponding unfolded characters in the input, so that those
can be (folded) into the following node. Since the number of characters
in the fold may not be the same as unfolded, there is not an easily
discernable correspondence between the input and the folded output.
That means that generally, what has to be done is that the input is
reparsed from the beginning of the node, but the permitted length has
been shortened (we know precisely how much to shorten it to) so that it
will end with something other than the 22. But, the code saves the
previous input character's position (for other reasons), so if we only
have to backup one character, we can just use that and not have to
reparse.
This bug was that the code thought a two character backup was really a
one character one, and did not reparse the node, creating an off-by-one
error, and a character was simply omitted in the pattern (that should
have started the following node). And the input had two of the 22
characters adjacent to each other in just the right positions that the
node was split. The bisect showed that when the node size was changed
the bug went away, at least for this particular input string. But a
different, longer, string would have triggered the bug, and this commit
fixes that.
This bug is actually very unlikely to occur in most real world
applications. That is because other changes in the regex compiler have
caused nodes to be split so that things that don't particpate in folds
at all are separated out into EXACT nodes. (The reason for that is it
allows the optimizer things to grab on to under /i that it wouldn't
otherwise have known about.) That means that anything like this string
would never cause the bug to happen because blanks and commas, etc.
would be in separate nodes, and so no node would ever get large enough
to fill the 238 available byte slots in a node (235 on EBCDIC). Only a
long string without punctuation would trigger it. I have artificially
constructed such a string in the tests added by this commit.
One of the 22 characters is 't', so long strings of DNA "ACTG" could
trigger this bug. I find it somewhat amusing that this is something
like a DNA transcription error, which occurs in nature at very low
rates, but selection, it is believed, will make sure the error rate is
above zero.
|
| |
|
|
|
|
|
|
|
|
| |
These were relics from the removal of the sizing pass. I did a global
substitute, and missed that these cases promptly took the inverse
function of the function I just added. In other words, if g() is the
inverse of f(), then g(f(x)) is always x, and we can omit both g() and
f().
|
|
|
|
|
|
|
|
|
| |
This reverts commit 7e9b4fe4d85e9b669993bf96a7e33ffff3197e20, with
additional changes to get things to compile
It turns out I was wrong about the underlying cause that commit
addressed, and it is easier to just use the existing constants that get
generated.
|
|
|
|
|
|
|
|
|
|
| |
This just moves things around so that the information is kept in local
variables and the regnode not created until all that info has been
completely determined. I believe it is clearer to read, but the impetus
came from the fact that prior to this commit, use of \b{} always
restarted the parse unnecessarily because the order of things made it
appear that a real /d op had appeared, whereas it was just the one
currently being constructed
|
|
|
|
| |
Indent the block added by the previous commit
|
|
|
|
|
|
| |
Recent commit c316b824875fdd5ce52338f301fb0255d843dfec introduced an
out-of-bounds memory read. This commit fixes it. An ANYOFH regnode
doesn't have a bitmap, so don't try to read it.
|
|
|
|
|
|
|
|
|
|
| |
A node that matches only 'A' and 'a', for example, can be turned into an
ANYOFM node, which is faster to execute. This is done after joining of
adjacent EXACTFish nodes, as longer nodes are better than shorter ones,
including because they lessen the number of bugs with multi-char folds
not matching because of node boundaries.
But if a length 1 node remains, ANYOFM is better.
|
|
|
|
| |
Indent newly formed block in previous commit
|
|
|
|
|
|
|
|
|
|
|
| |
This commit adds a regnode for the case where nothing in the bit map has
matches. This allows the bitmap to be omitted, saving 32 bytes of
otherwise wasted space per node. Many non-Latin Unicode properties have
this characteristic. Further, since this node applies only to code
points above 255, which are representable only in UTF-8, we can
trivially fail a match where the target string isn't in UTF-8. Time
savings also accrue from skipping the bitmap look-up. When swashes are
removed, even more time will be saved.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit extensively changes the optimizations for ANYOF regnodes
that represent bracketed character classes.
The removal of the regex compilation pass now makes these feasible and
desirable. Compilation now tries hard to optimize an ANYOF node into
something smaller and/or faster when feasible.
Now, qr/[X]/ for any single character or POSIX class X, and any
modifiers like /d, /i, etc, should be the same as qr/X/ for the same
modifiers, unless it would require the pattern to be upgraded from
non-UTF-8 to UTF-8, unless not doing so could introduce bugs.
These changes fix some issues with multi-character /i folding.
|
|
|
|
|
| |
This is to distinguish it from another similar variable being introduced
in the next commit.
|
| |
|
|
|
|
|
| |
This makes the code easier to read, as it summarizes the purposes of the
three, and makes it easy to find which reason it is.
|
|
|
|
| |
Indent after the previous commit created a new outer loop
|
|
|
|
|
|
| |
Instead of repeating the code, slightly modified, this uses a loop.
This is in preparation for a future commit where a third instance would
have been required
|
|
|
|
|
| |
The new name more accurately expresses the usage, as what gets generated
may not actually be an ANYOFD.
|
| |
|
|
|
|
|
|
| |
Commit 13dfc48da5322166f9d64a7349e3434c070ead88 removed one of two uses
of this function. This removes the remaining one. Commits in between
removed the need for most of the guts of the function.
|
|
|
|
| |
These flags can be set in one place, rather than in multiple ones.
|
|
|
|
|
|
| |
This refactors the code somewhat. When we discover a deal-breaker code
point we can just break out of the loop (using a goto) instead of
setting a flag, continuing, and later testing it.
|
|
|
|
|
|
|
| |
It doesn't matter currently, but it's best to declare more limited scope
variables for doing limited scope work, rather than using the more
global variable, which someday might want to be used later, outside the
block that zapped it, and would lead to a surprise.
|
|
|
|
|
|
|
| |
The ANYOFM/NANYOFM regnodes are generalizations of these. They have
more masks and shifts than the removed nodes, but not more branches, so
are effectively the same speed. Remove the ASCII/NASCII nodes in favor
of having less code to maintain.
|
|
|
|
|
|
|
|
|
| |
These two regnodes are faster than regular /[[:posix:]]/ ones, and some
of the latter are equivalent to some of the former. So try the faster
optimizations first.
This commit just swaps the two blocks of code, and outdents
appropriately
|
|
|
|
|
|
|
|
| |
This refactors the code that sees about optimizing bracketed character
classes into something else, so that the creating of the other regnode
is done closer to its determination, and only the really common code
actually is done in the common place, moved to the end of the function.
This removes the need for some 'elses' and 'ifs'
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 8a100c918ec81926c0536594df8ee1fcccb171da created node types for
handling an 's' at the leading edge, at the trailing edge, and at both
edges for nodes under /di that there is nothing else in that would
prevent them from being EXACTFU nodes. If two of these get joined, it
could create an 'ss' sequence which can't be an EXACTFU node, for U+DF
would match them unconditionally. Instead, under /di it should match
if and only if the target string is UTF-8 encoded.
I realized later that having three types becomes harder to deal with
when adding yet more node types, so this commit turns the three into
just one node type, indicating that at least one edge of the node is an
's'.
It also simplifies the parsing of the pattern and determining which node
to use.
|
|
|
|
|
|
| |
Previous commits caused the pattern under /i to be folded as much as
possible. This commit takes advantage of this by not folding when we
know it already has been folded.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
EXACTFUP was created by the previous commit to handle a problematic case
in which not all the code points in an EXACTFU node are /i foldable at
compile time. Doing so will allow a future commit to use the pre-folded
EXACTFU nodes (done in a prior commit), saving execution time for the
common case. The only problematic code point is the MICRO SIGN. Most
patterns don't use this character.
EXACTFU_SS is problematic in a different way. It contains the sequence
'ss' which is folded to by LATIN SMALL LETTER SHARP S, but everything in
it can be pre-folded (unless it also contains a MICRO SIGN). The reason
this is problematic is that it is the only non-UTF-8 node where the
length in folding can change. To process it at runtime, the more
general fold equivalence function is used that is capable of handling
length disparities, but is slower than the functions otherwise used for
non-UTF-8.
What I've chosen to do for now is to make a single node type for all the
problematic cases (which at this time means just the two aforementioned
ones). If we didn't do this, we'd have to add a third node type for
patterns that contain both 'ss' and MICRO. Or artificially split the
pattern so the two never were in the same node, but we can't do that
because it can cause bugs in handling multi-character folds. If more
special handling is found to be needed, there'd be a combinatorial
explosion of additional node types to handle all possible combinations.
What this effectively means is that the slower, more general foldEQ
function is used for portions of patterns containing the MICRO sign when
the pattern isn't in UTF-8, even though there is no inherent reason to
do so for non-UTF-8 strings that don't also contain the 'ss' sequence.
|
|
|
|
|
|
|
|
|
|
| |
If a non-UTF-8 pattern contains a MICRO SIGN, this special node is now
created. This character is the only one not needing UTF-8 to represent,
but its fold does need UTF-8, which causes some issues, so it has to be
specially handled. When matching against a non-UTF-8 target string, the
pattern is effectively folded, but not if the target is UTF-8. By
creating this node, we can remove the special handling required for the
nodes that don't have a MICRO SIGN, in a future commit.
|