| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
| |
As described in the comments, this changes the design of handling the
Unicode tricky fold characters to not generate a node for each possible
sequence but to get them to work within EXACTFish nodes.
The previous design(s) all used a node to handle these, which suffers
from the downfall that it precludes legitimate matches that would cross
the node boundary.
The new design is described in the comments.
|
|
|
|
| |
And the reordering for clarity of one test
|
|
|
|
| |
The latter is the Perl standard way of making this declaration
|
| |
|
| |
|
|
|
|
|
| |
This macro sets all the bits of the class (for \w, etc) for use during
initialization
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Things like \S have not been accessible to the synthetic start class
under locale matching rules. They have been placed there, but the
start class didn't know they were there.
This patch sets ANYOF_CLASS in initializing the synthetic start class
so that downstream code knows it is a charclass_class, and removes
the code that partially allowed this bit to be shared, and which isn't
needed in 5.14, and more thought would have to go into doing it than
was reflected in the code.
I can't come up with a test case that would verify that this works,
because of general locale testing issues, except it looked at a dump of
the generated regex synthetic start class, but the dump isn't the same
thing as the real behavior, and using one is also subject to breakage if
the regex code changes in the slightest.
|
| |
|
|
|
|
|
|
|
|
|
| |
Now that regexes can be combinations of different charset modifiers,
a synthetic start class can match locale and non-locale both. locale
should generally match only things in the bitmap for code points < 256.
But a synthetic start class with a non-locale component can match such
code points. This patch makes an exception for synthetic nodes that
will be resolved if it passes and is matched again for real.
|
| |
|
|
|
|
|
|
|
| |
This code has been rendered obsolete in 5.14 by using a different
mechanism altogether. This functionality is now provided at run-time,
user-selectable, via the /u and /d regex modifiers. This code was
for compile-time selection of which to use.
|
|
|
|
|
|
|
|
| |
This was from commit f424400810b6af341e96230836690da51c37b812
which came from needing a bit in an already-full flags field,
and my faulty analysis that two bits could be shared. I found another
mechanism to free up another bit, and now can separate these shared
bits again.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is the foundation for fixing the regression RT #82610. My analysis
was wrong that two bits could be shared, at least not without further
work. This changes to use a different mechanism to pass needed
information to regexec.c so that another bit can be freed up and, in a
later commit, the two bits can become unshared again.
The bit that is freed up is ANYOF_UTF8, which basically said there is
something that is matched outside the ANYOF bitmap, and requires the
target string to be in utf8. This changes things so the existence of
something besides the bitmap indicates this, and so no flag is needed.
The flag bit ANYOF_NONBITMAP_NON_UTF8 remains to indicate that there is
something that should be matched outside the bitmap even if the target
string isn't in utf8.
|
| |
|
|
|
|
|
|
|
|
|
| |
The FLAGS fields of certain regnodes were encoded with USE_UNI if
unicode semantics were to be used. This patch instead sets them to the
character set used, one of the possibilities of which is to use unicode
semantics. This shortens the code somewhat, and always puts the
character set in the flags field, which can allow use of switch
statements on it for efficiency, especially as new values are added.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This bug stemmed from Latin1 characters not matching any (non-complemented)
character class in /d semantics when the target string is no utf8; but having
unicode semantics when it isn't. The solution here is to add a special flag.
There were several tests that relied on the broken behavior, specifically they
tested that \xff isn't a printable word character even in utf8. I changed the
deparse test to instead use a non-printable code point, and I changed the ones
in re_tests to be TODOs, and will change them back using /a when that is
shortly added.
|
|
|
|
|
|
| |
It turns out that the INVERT and EOS flags can actually share the same bit, as
explained in the comments, freeing up a bit for other uses. No code changes
need be made.
|
| |
|
|
|
|
|
|
|
|
|
| |
# New Ticket Created by (Peter J. Acklam)
# Please include the string: [perl #81904]
# in the subject line of all future correspondence about this issue.
# <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=81904 >
Signed-off-by: Abigail <abigail@abigail.be>
|
|
|
|
|
|
|
|
|
|
|
| |
ANYOF_FOLD is now used only under fewer conditions. Otherwise the
bitmap of character 0-255 is fully calculated with the folds, and the
flag is not set. One condition is under locale, where the folds aren't
known at compile time; the other is for things accessible through a
swash.
By changing the name to its new meaning, certain optimizations become more
obvious.
|
|
|
|
|
|
| |
instead of the less familiar octal for larger values. Perhaps they
should actually print the actual character, but this is far easier than
the previous to understand.
|
|
|
|
|
|
| |
The flags field is fully used, and until the ANYOF node is split later
in development, the CLASS bit will need to be freed up to give the space
for other uses. This patch allows for this to easily be toggled.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit partially reverts cefafd73018b048fa66d2b22250431112141955a
which unconditionally used a bitmap for classes like \w in ANYOF nodes
used in locales. Unfortunately, I forgot to unconditionally allocate
that space, so things were getting corrupted. It is scary that that did
not show up in my testing, but locales are hard to test. It showed up
in a workspace without DEBUGGING.
This commit now causes the bitmap to be used only when necessary, at the
expense of using a precious bit in the flags field to indicate that it
is being used. However, as events have turned out since that commit,
that flags bit isn't as precious as I thought. It looks like we will
have to split the ANYOF node into two similar nodes, one of which is
variable length, as there are bugs due to the optimizer thinking it is
of length 1, when in fact it doesn't currently have to be. That split
should allow more bits to be freed.
I'm retaining for now some ancillary code that was to help improve the
efficiency when that bit was removed; just in case we have to redo this.
But if we do, we have to unconditionally allocate the space we think we
are using.
Signed-off-by: David Golden <dagolden@cpan.org>
|
|
|
|
|
|
|
|
|
|
|
| |
ANYOF_NONBITMAP means that the node can match things that aren't in its
bitmap. Some things can match only when the target string is in utf8,
and some things can match even if it isn't. If the target string is not
in utf8, and we know that the only possible match is when it is in utf8,
we know it can't match, and avoid a fruitless, expensive swash load.
This change also fixes a number of problems shown in t/re/grind_fold.t
that I will deliver soon.
|
| |
|
|
|
|
|
| |
This is in preparation for adding a new bit which for debugging ease
ought to be adjacent to another one.
|
|
|
|
|
|
|
|
|
|
| |
I am about the hone the meaning of this to mean that there is something
outside the bitmap that is matchable by the node, and the new name
reflects that more accurately.
I am not retaining the old name because I'm about to remove it from the
flags field to save a bit and avoid masking operations, and any code
that would be using it would break at that point.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch causes all locale ANYOF nodes to have a class bitmap (4
bytes) even if they don't have a class (such as \w, \d, [:posix:]).
This frees up a bit in the flags field that was used to signal if the
node had the bitmap. I intend to use it instead to signal that loading
a swash, which is slow, can be bypassed. Thus this is a time/space
tradeoff, applicable to not just locale nodes: adding a word to the
locale nodes saves time for all nodes.
I added the ANYOF_CLASS_TEST_ANY_SET() macro to determine quickly if
there are actually any classes in the node.
Minimal code was changed, so this can be easily reversed if another bit
frees up.
Another possibility is to share with the ANYOF_EOS bit instead, as this
is used just in the optimizer's start class, and only in regcomp.c. But
this requires more careful coding.
Another possibility is to add a byte (hence likely at least 4 because of
alignment issues) to store extra flags.
And still another possibility is to add just the byte for the start
class, which would not need to affect other ANYOF nodes, since the EOS
bit is not used outside regcomp.c. But various routines in regcomp
assume that the start class and other ANYOF nodes are interchangeable,
so this option would require more code changes.
|
| |
|
|
|
|
| |
Reorder #defines of bits so are in numerical order
|
|
|
|
|
|
|
|
|
|
|
| |
ANYOF_RUNTIME() is no longer used, so can be removed.
I had long tried to figure out what the purpose of this was, and
discovered it really had none.
I think it must have had something to do with locales at one time. But
locales don't do well with utf8, and I don't know how to make it better.
In any event this wasn't actually accomplishing anything.
|
| |
|
|
|
|
|
| |
These two #defines now mean the same thing. Free up bit used by
ANYOF_LARGE
|
|
|
|
|
|
|
|
| |
This patch should remove a compiler warning that is currently only
showing up in one compiler. It declares a debug-only variable to be
volatile, so should silence the warning that it is getting clobbered.
Since this variable is only used for debugging purposes, when DEBUGGING
is defined, performance should not be an issue.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit causes regex sequences \b, \s, and \w (and complements) to
match in the latin1 range in the scope of feature 'unicode_strings' or
with the /u regex modifier.
It uses the previously unused flags field in the respective regnodes to
indicate the type of matching, and in regexec.c, uses that to decide
which of the handy.h macros to use, native or Latin1.
I chose this for now rather than create new nodes for each type of
match. An earlier version of this patch did that, and in every case the
switch case: statements were adjacent, offering no performance
advantage. If regexec were modified to use in-line functions or more
macros for various short section of it, then it would be faster to have
new nodes rather than using the flags field. But, using that field
simplified things, as this change flies under the radar in a number of
places where it would not if separate nodes were used.
|
| |
|
|
|
|
|
|
|
|
| |
The netbsd - 5.0.2 compiler pointed out that the recent changes to add
longjmps to speed up some regex compilations can result in clobbering a
few values. These depend on the compiled code, and so didn't show up in
other compiler's warnings. This patch reinitializes them after a
longjmp.
|
|
|
|
|
| |
Previously the AV paren_name_list would "leak" until global destruction.
This was only an issue under -DDEBUGGING. Fixes RT #73438.
|
|
|
|
|
|
| |
Add a new flags column to regcomp.sym, with V if the node type is in PL_varies,
S if it is in PL_simple, and . if a placeholder is needed because subsequent
optional columns are present.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is an indirect fix for
[perl #74484] Regex causing exponential runtime+mem usage
The trie runtime code was doing more SAVETMPS than FREETMPS and was thus
growing a large tmps stack on heavy backtracking. Rather than fixing this
directly, I rewrote part of the trie code so that it no longer needs to
allocate memory in S_regmatch (it still does in find_byclass()).
The basic issue is that multiple branches in the trie may trigger an
accept state; for example:
"abcd" =~ /xyz/abcd.*X|ab.*Y|/
here, words (branches) 2 and 3 are accept states. The original approach
was, at run time, to create a list of accepted word numbers and the
character positions of the end of each of those words. Then run the rest
of the pattern for each word in the list in turn (in word index order).
This requires memory for the list to be allocated and freed.
The new approach involves creating extra info at compile time; in
particular, for each word, a pointer to the previous accepted word (if
any) in the state tree. For example for the above pattern, part of the
state tree may be
q b c d
1 -> 2 -> 3 -> 4 -> 5
(#3) (#2)
(e.g. at state 1, if the next char is 'a', we transition to state 2).
Here, state 3 is an accept state with word #3, and 5 is an accept state
with word #2. So we build a table indexed by word number, which has
wordinfo[2] = 3, wordinfo[3] = 0, thus building the word chain 2->3->0.
At run time we run the trie to completion, and remember the word
associated with the longest accept state (word #2 above). Then by following
back the chain of .prev fields, we can produce a list of all accepting
words. We then iteratively find the smallest-numbered (ie LH-most) word in
the chain, and run with it. On failure and backtrack, we find the
next-smallest and so on.
Since we are no longer recording the end-position of each word in the
string, we have to recalculate this for each backtrack. We initially
record the end-position of the shortest accepting word, and given that we
know the length of each word, we can calculate the new position each time
as an offset from that first word. Depending on unicode and folding, that
calculation can be cheap or expensive.
This algorithm is optimised for the typical case where there are a small
number (<= 2) accepting states.
This patch creates a new compile-time array, trie->wordinfo[], indexed by
word number, which contains relevant info about each word. This also
supersedes the old trie->newword[] array, whose function of recording
"overspills" of multiple words per accept state, is now handled as part of
the wordinfo[].prev chain.
|
|
|
|
|
|
|
|
| |
revert ba9ac1759cb6e7a5e6883c85edd0b450061b5ccb
Changing the semantics of \w \s and \d breaks too much
and Jesse wants to do a rollout. This disables the new
semantics until we can get all the details worked out.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
class matching
This also alters which Unicode properties that the POSIX character
class and the Perl "special" character classes, like \w and \d map
to. At the same time it allows a number of tests for POSIX character
class behaviour to be switched from todo to non todo. Legacy testing
is still available by changing the define and setting the
PERL_TEST_LEGACY_POSIX_CC value to true.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
the regex engine)
Perlbug #60156 and #49302 (and probably others) resolve down to the problem
that the definition of \s and \w and \d and the POSIX charclasses are different
for unicode strings and for non-unicode strings. This broke the character class
logic in the regex engine. The easiest fix to make the character class logic sane
again is to define new properties which do match.
This change creates new property classes that can be used instead of the
traditional ones (it does not change the previously defined ones). If the
define in regcomp.h:
#define PERL_LEGACY_UNICODE_CHARCLASS_MAPPINGS 1
is changed to 0, then the new mappings will be used. This will fix a bunch
of bugs that are reported as TODO items in the new reg_posixcc.t test file.
p4raw-id: //depot/perl@34769
|
|
|
|
|
|
|
|
|
|
|
| |
* Make ANYOF output from regprop easier to read by adding ][ in between the unicode representation and the "ascii" one
* Make it possible to make tests in re_tests todo.
* add a todo test for a complementary character class match that should fail (perl #60156)
* Also add a comment explaining a previous commit (relating to perl #60344)
p4raw-id: //depot/perl@34755
|
|
|
|
|
|
| |
and regexp reference counting is via the regular SV reference counting.
This was not as easy at it looks.
p4raw-id: //depot/perl@32804
|
|
|
|
|
|
|
| |
regcomp.c and regexec.c RXp_* where necessary] so that in future we
can maintain source compatibility when we add an extra level of
dereferencing.
p4raw-id: //depot/perl@32802
|
|
|
| |
p4raw-id: //depot/perl@32793
|
|
|
| |
p4raw-id: //depot/perl@32237
|