| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
| |
Mostly in comments and docs, but some in diagnostic messages and one
case of 'or die die'.
|
|
|
|
|
|
|
|
|
| |
The regex_sets feature cannot yet handle named sequences possibly
returned by \p{name=...}. I forgot to check for this possibility which
led to a null pointer dereference. Also, the called function was
returning success when it should have failed in this circumstance.
This fixes #17732
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It turns out that the SV returned by re_intuit_string() may be freed by
future calls to re_intuit_start(). Thus, the caller doesn't get clear
title to the returned SV. (This wasn't documented until the
commit immediately prior to this one.)
Cope with this situation by making a mortalized copy. This commit also
changes to use the copy's PV directly, simplifying some 'if' statements.
re_intuit_string() is effectively in the API, as it is an element in the
regex engine structure, callable by anyone. It should not be returning
a tenuous SV. That returned scalar should not freed before the pattern
it is for is freed. It is too late in the development cycle to change
this, so this workaround is presented instead for 5.32.
This fixes #17734.
|
|
|
|
|
| |
The dubious '((*ACCEPT)0)*' construct resulted on the one hand with
is_inf being false, but on the other setting pos_delta to OPTIMIZE_INFTY.
|
|
|
|
|
|
|
|
|
|
|
| |
Commits 3bc2a7809d and bdb91f3f96 used the existence of a frame to
decide when it was unsafe to mutate the regexp program, due to having
recursed for as GOSUB. However the frame recursion mechanism is also
used for SUSPEND.
Refine it further to avoid mutation only when within a GOSUB by saving
a new boolean in the frame structure, and using that to derive a "mutate_ok"
flag.
|
|
|
|
|
| |
gh16947: the outer frame may be in the middle of looking at the part
of the program we would rewrite. Let the outer frame deal with it.
|
|
|
|
|
|
|
|
|
| |
That this code was doing in the presence of an illegally large (or
small) relative capturing group number was to set it to about the
furthest away from zero it could get, and silently carry on, where it
likely overflowed a few lines down.
Instead die immediately with a proper message.
|
|
|
|
|
|
|
|
| |
This was resulting in C undefined behavior reported by asan. Check that
won't overflow before doing the operation; then die instead of going
ahead anyway.
This fixes #17593
|
|
|
|
|
|
|
| |
As spotted by Hugo van der Sanden, a barckwards reference to a capturing
group doesn't need to look forward instead. If we don't have a
capturing group we already should have, looking forward won't magically
produce it.
|
|
|
|
| |
Spotted by Dagfinn Ilmari Mannsåker
|
|
|
|
|
|
|
| |
Having both ss and \xdf in a string caused the node type to be changed
back to a wrong one.
This fixes #17486
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The algorithm for dealing with Unicode property wildcards is to wrap the
user-supplied pattern with /miaa. We don't want the user to be able to
override the /m and /aa parts. Modifiers that are only specifiable as a
modifier in a qr or similar op (like /gc) can't be included in things
like (?gc). These normally incur a warning that they are ignored, but
the texts of those warnings are misleading when using wildcards, so I
chose to just make them illegal. Of course that could be changed to
having custom useful warning texts, but I didn't think it was worth it.
I also chose to forbid recursion of using nested \p{}, just from fear
that it might lead to issues down the road, and it really isn't useful
for this limited universe of strings to match against. Because
wildcards currently can't handle '}' inside them, only the single letter
\p,\P are valid anyway.
Similarly, I forbid the '*' quantifier to make it harder for the
constructed subpattern to take forever to make any progress and decide
to halt. Again, using it would be overkill on the universe of possible
match strings.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A set operations expression can contain a previously-compiled one
interpolated in. Prior to this commit, some heuristics were employed
to verify it actually was such a thing, and not a sort of look-alike
that wasn't necessarily valid. The heuristics actually forbade legal
ones. I don't know of any illegal ones that were let through, but it is
certainly possible. Also, the error/warning messages referred to the
heuristics, and were unhelpful at best.
The technique used instead in this commit is to return a regop only used
by this feature for any nested compilations. This guarantees that the
caller can determine if the result is valid, and what that result is
without having to do any heuristics or inspecting any flags. The
error/warning messages are changed to reflect this, and I believe are
now helpful.
This fixes the bugs in #16779
https://github.com/Perl/perl5/issues/16779#issuecomment-563987618
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This accomplishes the same thing as \N{...}, but only for regex
patterns, using loose matching and only the official Unicode names.
This commit includes a comparison of the two approaches, added to
perlunicode. But the real reason to do this is as a way station to
being able to specify wild card lookup on the name property, coming in a
later commit.
I chose to not include user-defined aliases nor :short character names
at this time. I thought that there might be unforeseen consequences of
using them. It's better to later relax a requirement than to try to
restrict it.
|
| |
|
|
|
|
|
| |
It wasn't intended to be part of the recursion logic, and doesn't get
decremented again (GH 17490).
|
|
|
|
|
|
|
|
| |
This changes warning messages for too short \0 octal constants to use
the function introduced in the previous commit. This function assures a
consistent and clear warning message, which is slightly different than
the one this commit replaces. I know of no CPAN code which depends on
this warning's wording.
|
|
|
|
|
|
|
| |
These tests are to generate warnings that the affected code is not
portable between ASCII and EBCDIC systems. But, it was being too picky.
Code points above 255 are the same on both systems, so the warning
shouldn't be generated for those.
|
|
|
|
|
|
|
|
| |
All the other messages raised when a construct is expecting a
terminating '}' but none is found include the '}' in the message. '\o{'
did not. Since these diagnostics are getting revised anyway, and I
didn't find any CPAN modules relying on the wording, this commit makes
the messages consistent by adding the '}' to the \o message.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit causes these functions to allow a caller to request any
messages generated to be returned to the caller, instead of always being
handled within these functions. The messages are somewhat changed from
previously to be clearer. I did not find any code in CPAN that relied
on the previous message text.
Like the previous commit for grok_bslash_c, here are two reasons to do
this, repeated here.
1) In pattern compilation this brings these messages into conformity
with the other ones that get generated in pattern compilation, where
there is a particular syntax, including marking the exact position in
the parse where the problem occurred.
2) These could generate truncated messages due to the (mostly)
single-pass nature of pattern compilation that is now in effect. It
keeps track of where during a parse a message has been output, and
won't output it again if a second parsing pass turns out to be
necessary. Prior to this commit, it had to assume that a message
from one of these functions did get output, and this caused some
out-of-bounds reads when a subparse (using a constructed pattern) was
executed. The possibility of those went away in commit 5d894ca5213,
which guarantees it won't try to read outside bounds, but that may
still mean it is outputting text from the wrong parse, giving
meaningless results. This commit should stop that possibility.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit causes this function to allow a caller to request any
messages generated to be returned to the caller, instead of always being
handled within this function.
Like the previous commit for grok_bslash_c, here are two reasons to do
this, repeated here.
1) In pattern compilation this brings these messages into conformity
with the other ones that get generated in pattern compilation, where
there is a particular syntax, including marking the exact position in
the parse where the problem occurred.
2) The messages could be truncated due to the (mostly) single-pass
nature of pattern compilation that is now in effect. It keeps track
of where during a parse a message has been output, and won't output
it again if a second parsing pass turns out to be necessary. Prior
to this commit, it had to assume that a message from one of these
functions did get output, and this caused some out-of-bounds reads
when a subparse (using a constructed pattern) was executed. The
possibility of those went away in commit 5d894ca5213, which
guarantees it won't try to read outside bounds, but that may still
mean it is outputting text from the wrong parse, giving meaningless
results. This commit should stop that possibility.
|
|
|
|
|
| |
These are ok perhaps accidentally, but shouldn't accidentally become not
ok
|
|
|
|
|
|
|
| |
This appears to be a difference in how shells run. On many of H. Merijn
Brand's boxes, this a test is failing. By avoiding the shell by using
fresh_perl instead of runperl, it succeeds there, without breaking
elsewhere.
|
|
|
|
|
| |
We weren't handling NOTHING regops that were not followed
by a trieable type in the trie code.
|
|
|
|
|
|
|
|
|
|
|
| |
This turned out to be because there are two versions of the property
name being parsed: 1) the original input; and 2) a canonicalized one
with characters squeeezed out that are usually optional, such as spaces,
dashes and, here, underscores.
The code was conflating the two names, and moving along the squeezed
name based on counts from the unsqueezed one, hence going too far in the
buffer.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit:
commit 0cd59ee9ca0f0af3c0c172ecc27bb3f02da6db08
Author: Karl Williamson <khw@cpan.org>
AuthorDate: Fri Sep 6 10:23:26 2019 -0600
Commit: Karl Williamson <khw@cpan.org>
CommitDate: Mon Nov 11 21:05:13 2019 -0700
t/re/regexp.t: Only convert to EBCDIC once
Some tests get added as we go along, and those added tests have already
been converted to EBCDIC if necessary. Don't reconvert, which messes
things up.
caused a huge slowdown in regex tests. The most noticeable on my
platform was regexp_qr_embed_thr.t which doubled in wall clock time
spent.
It turns out that it was because a function was now always being called,
and that does nothing on ASCII platforms besides return its argument,
which then was copied over the argument.
This new commit causes the function to be a constant { 1; } on ASCII
platforms, so should be completely optimized out, returning the time
spent in that .t to 5.30 levels.
|
|
|
|
|
|
| |
This was caused by a character being counted as both the first delimiter
of a pattern, and the final one, which led to the pattern's length being
negative, which was turned into a very large unsigned number.
|
|
|
|
|
| |
Like GH #17367, this was caused by a failure to check that we aren't at
the end of the buffer after advancing the ptr to it.
|
|
|
|
|
| |
This is a bug in grok_infnan() in which in one place it failed to check
that it was reading within bounds.
|
| |
|
|
|
|
| |
And rearrange so is easier to see the correct value.
|
|
|
|
| |
These are fuzzer generated, and don't translate well to EBCDIC
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The experimental feature that allows wildcard subpatterns in finding
Unicode properties, is supposed to only allow ASCII punctuation for
delimitters. But if you preceded the delimitter by a backslash, the
check was skipped. This commit fixes that.
It may be that we will eventually want to loosen the restriction and
allow a wider range of delimiters. But until we have valid use-cases
that would push us in that direction, I don't want to get into
supporting stuff that we might later regret, such as invisible
characters for delimitters. This feature is not really required for
programs to work, so I don't view it as necessary to be as general as
possible.
|
| |
|
|
|
|
| |
Prior to this patch, they only sometimes overrode.
|
| |
|
|
|
|
|
|
| |
Any skips inside this loop was skipping over the entire loop,
not just the current iteration, from the conditions tested in
the loop this seemed incorrect (besides messing up the test count.)
|
| |
|
|
|
|
|
|
| |
It turns out that one isn't supposed to fill in the offset to the next
regnode at node creation time. And this node is like EXACTish, so the
string stuff isn't accounted for in its regcomp.sym definition
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously we were ignoring this possibility. Suppose a pattern being
compiled under /il contains 'SS', and that it so happens that a regnode
becomes filled with the first 'S', so that the next regnode would begin
with the second one. If at runtime, the locale is UTF-8, the pattern
should match match a LATIN SHARP S. Until this commit, it wouldn't.
The commit just extends the current mechanism used in this situation (of
a filled regnode) for non-/l patterns.
If the locale isn't a UTF-8 one, the 'SS' sequence shouldn't match the
SHARP S, and it won't, but we have to construct the node so that it can
handle the UTF-8 case.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This node is like ANYOFHb, but is used when more than one leading byte
is the same in all the matched code points.
ANYOFHb is used to avoid having to convert from UTF-8 to code point for
something that won't match. It checks that the first byte in the UTF-8
encoded target is the desired one, thus ruling out most of the possible
code points.
But for higher code points that require longer UTF-8 sequences, many
many non-matching code points pass this filter. Its almost 200K that it
is ineffective for for code points above 0xFFFF.
This commit creates a new node type that addresses this problem.
Instead of a single byte, it stores as many leading bytes that are the
same for all code points that match the class. For many classes, that
will cut down the number of possible false positives by a huge amount
before having to convert to code point to make the final determination.
This regnode adds a UTF-8 string at the end. It is still much smaller,
even in the rare worst case, than a plain ANYOF node because the maximum
string length, 15 bytes, is still shorter than the 32-byte bitmap that
is present in a plain ANYOF. Most of the time the added string will
instead be at most 4 bytes.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is like the ANYOFR regnode added in the previous commit, but all
code points in the range it matches are known to have the same first
UTF-8 start byte. That means it can't match UTF-8 invariant characters,
like ASCII, because the "start" byte is different on each one, so it
could only match a range of 1, and the compiler wouldn't generate this
node for that; instead using an EXACT.
Pattern matching can rule out most code points by looking at the first
character of their UTF-8 representation, before having to convert from
UTF-8.
On ASCII this rules out all but 64 2-byte UTF-8 characters from this
simple comparison. 3-byte it's up to 4096, and 4-byte, 2**18, so the
test is less effective for higher code points.
I believe that most UTF-8 patterns that otherwise would compile to
ANYOFR will instead compile to this, as I can't envision real life
applications wanting to match large single ranges. Even the 2048
surrogates all have the same first byte.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This matches a single range of code points. It is both faster and
smaller than other ANYOF-type nodes, requiring, after set-up, a single
subtraction and conditional branch.
The vast majority of Unicode properties match a single range (though
most of the properties likely to be used in real world applications have
more than a single range). But things like [ij] are a single range, and
those are quite commonly encountered. This new regnode matches them more
efficiently than a bitmap would, and doesn't require the space for one
either.
The flags field is used to store the minimum matchable start byte for
UTF-8 strings, and is ignored for non-UTF-8 targets. This, like ANYOFH
nodes which have a similar mechanism, allows for quick weeding out of
many possible matches without having to convert the UTF-8 to its
corresponding code point.
This regnode packs the 32 bit argument with 20 bits for the minimum code
point the node matches, and 12 bits for the maximum range. If the input
is a value outside these, it simply won't compile to this regnode,
instead going to one of the ANYOFH flavors.
ANYOFR is sufficient to match all of Unicode except for the final
(private use) 65K plane.
|
|
|
|
| |
This makes sure a non-folding above-Latin1 character is tested.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ANYOFH nodes (that match code points above 255) are smaller than regular
ANYOF nodes because they don't have a 256-bit bitmap. But the
disadvantage of them over EXACT nodes is that the characters encountered
must first be converted from UTF-8 to code point to see if they match
the ANYOFH. (The difference is less clearcut with /i, because typically,
currently, the UTF-8 must be converted to code point anyway in order to
fold them.) But the EXACTFish node doesn't have an inversion list to do
lookup in, and occupies less space, because it doesn't have inversion
list data attached to it.
Also there is a bug in using ANYOFH under /l, as wide character warnings
should be emitted if the locale isn't a UTF-8 one.
The reason this change hasn't been made before (by me anyway) is that
the old way avoided upgrading the pattern to UTF-8. But having thought
about this for a long time, to match this node, the target string must
be in UTF-8 anyway, and having a UTF8ness mismatch slows down pattern
matching, as things have to be continually converted, and reconverted
after backtracking.
|
|
|
|
|
|
| |
Previously we had infinity minus 1, but infinity should be beyond the
range, and the highest isn't infinity - 1, but the highest legal code
point.
|
|
|
|
| |
These are covered by the single code point tests.
|
|
|
|
|
|
| |
One shouldn't be able to specify an infinite code point. The tests have
the conceit that one can specify a range's upper limit as infinity, but
that is just shorthand for the range being unbounded.
|
|
|
|
|
|
|
| |
When a test fails, an extra test is run to output debugging info; this
will cause the planned number of tests to be wrong, which will output an
extra, confusing message. This adds an explanation that the number is
expected to be wrong, hence not to worry.
|