| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
See also perl5porters thread titled: "Perl MBOLism in regex engine"
In the perl 5.000 release (a0d0e21ea6ea90a22318550944fe6cb09ae10cda)
the BOL regop was split into two behaviours MBOL and SBOL, with SBOL
and BOL behaving identically. Similarly the EOL regop was split into
two behaviors SEOL and MEOL, with EOL and SEOL behaving identically.
This then resulted in various duplicative code related to flags and
case statements in various parts of the regex engine.
It appears that perhaps BOL and EOL were kept because they are the
type ("regkind") for SBOL/MBOL and SEOL/MEOL/EOS. Reworking regcomp.pl
to handle aliases for the type data so that SBOL/MBOL are of type
BOL, even though BOL == SBOL seems to cover that case without adding
to the confusion.
This means two regops, a regstate, and an internal regex flag can
be removed (and used for other things), and various logic relating
to them can be removed.
For the uninitiated, SBOL is /^/ and /\A/ (with or without /m) and
MBOL is /^/m. (I consider it a fail we have no way to say MBOL without
the /m modifier). Similarly SEOL is /$/ and MEOL is /$/m (there is
also a /\z/ which is EOS "end of string" with or without the /m).
|
|
|
|
|
|
|
|
|
|
|
|
| |
overrun-local: Overrunning array PL_reg_intflags name of 14 8-byte elements at element index 31 (byte offset 248) using index bit (which evaluates to 31).
Needed compile-time limits for the PL_reg_intflags_name so that the
bit loop doesn't waltz off past the array. Could not use C_ARRAY_LENGTH
because the size of name array is not visible during compile time
(only const char*[] is), so modified regcomp.pl to generate the size,
made it visible only under DEBUGGING. Did extflags analogously
even though its size currently exactly 32 already. The sizeof(flags)*8
is extra paranoia for ILP64.
|
|
|
|
|
|
| |
The term 'semantics' in documentation when applied to character sets is
changed to 'rules' as being a shorter less-jargony synonym in this case.
This was discussed several releases ago, but I didn't get around to it.
|
|
|
|
|
| |
This reverts commit 34fdef848b1687b91892ba55e9e0c3430e0770f6, and
adds comments referring to it, in case it is ever needed.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit frees up a bit by using an extra regnode to pass the
information to the regex engine instead of the flag. I originally
thought that if this was needed, it should be the ANYOF_ABOVE_LATIN1_ALL
bit, as that might speed some things up. But if we need to do this
again by adding another node to get another bit, we want one that is
mutually exclusive of the first one we did, For otherwise we start
having to make 3 nodes instead of two to get the combinations:
1 0
0 1
1 1
This combinatorial problem is avoided by using bits that are mutually
exclusive, which the ABOVE_LATIN1_ALL isn't, but the one freed by this
commit ANYOF_NON_UTF8_NON_ASCII_ALL is only set under /d matching, and
there are other bits that are set only under /l, so if we need to do
this again, we should use one of those.
I wrote this code when I thought I really needed a bit. But since, I
have figured out a better way to get the bit needed now. But I don't
want to lose this code to posterity, so this commit is being made long
enough to get the commit number, then it will be reverted, adding
comments referring to the commit number, so that it can easily be
reconstructed when necessary.
|
|
|
|
|
|
|
|
|
| |
The flag tells us that a pattern may match an infinitely long string.
The new member in the regexp struct tells us how long the string might
be.
With these two items we can implement regexp based $/
|
|
|
|
|
|
|
|
|
|
| |
RXf_IS_ANCHORED as a replacement
The only requirement outside of the regex engine is to identify that there is
an anchor involved at all. So we move the 4 anchor flags to intflags and replace
it with a single aggregate flag RXf_IS_ANCHORED in extflags.
This frees up another 3 bits in extflags.
|
|
|
|
|
|
|
|
| |
This required removing the RXf_GPOS_CHECK mask as it uses one flag
that will stay in extflags for now (RXf_ANCH_GPOS), and one flag that
moves to intflags (RXf_GPOS_SEEN). This mask is strange however, as
you cant have RXf_ANCH_GPOS without having RXf_GPOS_SEEN so I dont
know why we test both. Further investigation required.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The flag bits in regular expression ANYOF nodes are perennially in short
supply. However there are still plenty of regex nodes possible. So one
solution to needing to pass more information is to create a node that
encapsulates what is needed. That is what commit
9aa1e39f96ac28f6ce5d814d9a1eccf1464aba4a did to tell regexec.c that a
particular ANYOF node is for the synthetic start class (SSC). However
this solution introduces other issues. If you have to express two
things, then you need a regnode for A, a regnode for B, a regnode for
both A and B, and another regnode for both not A nor B; With three
things, you need 8 regnodes to express all possible combinations. This
becomes unwieldy to write code for. The number of combinations goes way
down if some of them are mutually exclusive. At the time of that
commit, I thought that a SSC need not ever warn if matching against an
above-Unicode code point. I was wrong, and that has been corrected
earlier in the 5.19 series.
But it finally came to me how to tell regexec that an ANYOF node is
for the SSC without taking up a flag bit and without requiring a regnode
type. The 'next_off' field in a regnode tells the engine the offeset in
the regex program to the node it's supposed to go to after processing
this one. Since the SSC stands alone, its 'next_off' field is unused,
and we can put anything we want in it. That, however, is not true of
other ANYOF regnodes. But it turns out that there are certain values
that will never be legitimate in the 'next_off' field in these, and so
this commit uses one of those to signal that this ANYOF field is an SSC.
regnodes come in various sizes, and the offset is in terms of how many
of the smallest ones are there to the next node to look at. Since ANYOF
nodes are large, the offset is always > 1, and so this commit uses 1 to
indicate an SSC.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Prior to this commit, there were 3 types of ANYOF nodes; now there are
two: regular, and one for the synthetic start class (ssc). This commit
converted the third type dealing with warning about matching \p{}
against non-Unicode code points, into using the spare flag bit for ANYOF
nodes.
This allows this bit to apply to ssc ANYOF nodes, whereas previously it
couldn't. There is a bug in which the warning isn't raised if the match
is rejected by the optimizer, because of this inability. This bug will
be fixed in a later commit.
Another option would have been to create a new node-type which was an
ANYOF_SSC_WARN_SUPER node. But this adds extra complications to things;
and we have a spare bit that we might as well use. The comments give
better possibilities for freeing up 2 bits should they be needed.
|
|
|
|
|
|
|
|
|
| |
This adds code so that tries can be formed under /iaa, which formerly
weren't handled. A problem occurs when the string contains the LATIN
SMALL LETTER SHARP S when the regex pattern is not UTF-8 encoded. I
tried several ways to get this to work easily, but ended up deciding it
was too hard, to in this one situation, a new regnode is created to
prevent the trie code from even trying to turn it into a trie.
|
|
|
|
|
| |
The previous commit fixed things up so that this work-around regnode
doesn't have to exist; nor the work around for the EXACTFU_SS regnode
|
|
|
|
| |
"aaabc" should match /a+?(*THEN)bc/ with "abc".
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch resolves several issues at once. The parts are
sufficiently interconnected that it is hard to break it down
into smaller commits. The tickets open for these issues are:
RT #94490 - split and constant folding
RT #116086 - split "\x20" doesn't work as documented
It additionally corrects some issues with cached regexes that
were exposed by the split changes (and applied to them).
It effectively reverts 5255171e6cd0accee6f76ea2980e32b3b5b8e171
and cccd1425414e6518c1fc8b7bcaccfb119320c513.
Prior to this patch the special RXf_SKIPWHITE behavior of
split(" ", $thing)
was only available if Perl could resolve the first argument to
split at compile time, meaning under various arcane situations.
This manifested as oddities like
my $delim = $cond ? " " : qr/\s+/;
split $delim, $string;
and
split $cond ? " ", qr/\s+/, $string
not behaving the same as:
($cond ? split(" ", $string) : split(/\s+/, $string))
which isn't very convenient.
This patch changes this by adding a new flag to the op_pmflags,
PMf_SPLIT which enables pp_regcomp() to know whether it was called
as part of split, which allows the RXf_SPLIT to be passed into run
time regex compilation. We also preserve the original flags so
pattern caching works properly, by adding a new property to the
regexp structure, "compflags", and related macros for accessing it.
We preserve the original flags passed into the compilation process,
so we can compare when we are trying to decide if we need to
recompile.
Note that this essentially the opposite fix from the one applied
originally to fix #94490 in 5255171e6cd0accee6f76ea2980e32b3b5b8e171.
The reverted patch was meant to make:
split( 0 || " ", $thing ) #1
consistent with
my $x=0; split( $x || " ", $thing ) #2
and not with
split( " ", $thing ) #3
This was reverted because it broke C<split("\x{20}", $thing)>, and
because one might argue that is not that #1 does the wrong thing,
but rather that the behavior of #2 that is wrong. In other words
we might expect that all three should behave the same as #3, and
that instead of "fixing" the behavior of #1 to be like #2, we should
really fix the behavior of #2 to behave like #3. (Which is what we did.)
Also, it doesn't make sense to move the special case detection logic
further from the regex engine. We really want the regex engine to decide
this stuff itself, otherwise split " ", ... wouldn't work properly with
an alternate engine. (Imagine we add a special regexp meta pattern that behaves
the same as " " does in a split /.../. For instance we might make
split /(*SPLITWHITE)/ trigger the same behavior as split " ".
The other major change as result of this patch is it effectively
reverts commit cccd1425414e6518c1fc8b7bcaccfb119320c513, which
was intended to get rid of RXf_SPLIT and RXf_SKIPWHITE, which
and free up bits in the regex flags structure.
But we dont want to get rid of these vars, and it turns out that
RXf_SEEN_LOOKBEHIND is used only in the same situation as the new
RXf_MODIFIES_VARS. So I have renamed RXf_SEEN_LOOKBEHIND to
RXf_NO_INPLACE_SUBST, and then instead of using two vars we use
only the one. Which in turn allows RXf_SPLIT and RXf_SKIPWHITE to
have their bits back.
|
|
|
|
| |
In preparation for future changes.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This creates a regnode specifically for the synthetic start class, which
is a type of ANYOF node. The flag bit previously used to denote this is
removed. This paves the way for this bit to be freed up, but first the
other use of this bit must also be removed, which will be done in the
next commit.
There are now three ANYOF-type regnodes. This one should be called only
in one place in regexec.c. The other special one is ANYOF_WARN_SUPER.
A synthetic start class node should not do any warning, so there is no
issue of having something need to be both types.
|
|
|
|
|
|
|
| |
This uses a regnode type, of which we have many available, to free up
a bit in the ANYOF regnode flag field, of which we have none, and are
trying to have the same bit do double duty. This will enable us to
remove some of that double duty in the next commit.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The regular rexpression operation POSIXA works on any of the (currently)
16 posix classes (like \w and [:graph:]) under the regex modifier /a.
This commit creates similar operations for the other modifiers: POSIXL
(for /l), POSIXD (for /d), POSIXU (for /u), plus their complements.
It causes these ops to be generated instead of the ALNUM, DIGIT,
HORIZWS, SPACE, and VERTWS ops, as well as all their variants. The net
saving is 22 regnode types.
The reason to do this is for maintenance. As of this commit, there are
now 22 fewer node types for which code has to be maintained. The code
for each variant was essentially the same logic, but on different
operands. It would be easy to make a change to one copy and forget to
make the corresponding change in the others. Indeed, this patch fixes
[perl #114272] in which one copy was out of sync with others.
This patch actually reduces the number of separate code paths to 5:
POSIXA, NPOSIXA, POSIXL, POSIXD, and POSIXU. The complements of the
last 3 use the same code path as their non-complemented version, except
that a variable is initialized differently. The code then XORs this
variable with its result to do the complementing or not. Further, the
POSIXD branch now just checks if the target string being matched is
UTF-8 or not, and then jumps to either the POSIXU or POSIXA code
respectively. So, there are effectively only 4 cases that are coded:
POSIXA, NPOSIXA, POSIXL, and POSIXU. (POSIXA doesn't have to worry
about UTF-8, while NPOSIXA does, hence these for efficiency are coded
separately.)
Removing all this code saves memory. The output of the Linux size
command shows that the perl executable was shrunk by 33K bytes on my
platform compiled under -O0 (.7%) and by 18K bytes (1.3%) under -O2.
The reason this patch was doable was previous work in numbering the
POSIX classes, so that they could be indexed in arrays and bit
positions. This is a large patch; I didn't see how to break it into
smaller components.
I chose to make this code more efficient as opposed to saving even more
memory. Thus there is a separate loop that is jumped to after we know
we have to load a swash; this just saves having to test if the swash is
loaded each time through the loop. I avoid loading the swash until
absolutely necessary. In places in the previous version of this code,
the swash was loaded when the input was UTF-8, even if it wasn't yet
needed (and might never be if the input didn't contain anything above
Latin1); apparently to avoid the extra test per iteration.
The Perl test suite runs slightly faster on my platform with this patch
under -O0, and the speeds are indistinguishable under -O2. This is in
spite of these new POSIX regops being unknown to the regex optimizer
(this will be addressed in future commits), and extra machine
instructions being required for each character (the xor, and some
shifting and masking). I expect this is a result of better caching, and
not loading swashes unless absolutely necessary.
|
|
|
|
|
|
|
| |
It turns out that it is more convenient for the complement of a node to
have a regkind that is also the complement of a node. This creates
slight inconveniences that are included in this patch, but will help
further patches.
|
|
|
|
|
|
| |
A recent commit has changed the algorithm used to handle multi-character
folding in bracketed character classes. The old code is no longer
needed.
|
|
|
|
|
|
| |
regcomp.c sets this new flag whenever regops that could modify
$REGMARK or $REGERROR have been seen. pp_subst will use this
to tell whether it should repeatedly stringify the replacement.
|
|
|
|
|
|
|
|
| |
They are on longer used in core, and we need room for more flags.
The only CPAN modules that use them check whether RXf_SPLIT is set
(which no longer happens) before setting RXf_SKIPWHITE (which is
ignored).
|
|
|
|
|
|
|
|
|
| |
These will be used to handle things like /[[:word:]]/a. This patch
doesn't add the code to actually use these. That will be done in a
future patch.
Also, placeholders POSIXD, POSIXL, and POSIXU are also added for future
use.
|
|
|
|
|
|
|
|
|
|
|
|
| |
For the node types that have differing versions depending on the
character set regex modifiers, /d, /l, /u, /a, and /aa, we can use the
enum values as offsets from the base node number to derive the correct
one. This eliminates a number of tests.
Because there is no DIGITU node type, I added placeholders for it (and
NDIGITU) to avoid some special casing of it (more important in future
commits). We currently have many available node types, so can afford to
waste these two.
|
|
|
|
|
|
| |
This causes all the nodes that depend on the regex modifier, BOUND,
BOUNDL, etc. to have the same relative ordering. This will enable a
future commit to simplify generation of the correct node.
|
|
|
|
|
| |
As a result of commit fab2782b37b5570d7f8f8065fd7d18621117ed49
the description is no longer valid. This node type is trieable.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This cleans up and simplifies and extends how the trie
logic interacts with the new node types. This change ultimately
makes the EXACTFU, EXACTFU_SS, EXACTFU_NO_TRIE (renamed to
EXACTFU_TRICKYFOLD) work properly with the trie engine regardless
of whether the string is utf8 or latin1.
This patch depends on the following:
EXACT => utf8 or "binary" text
EXACTFU => either pre-folded utf8, or latin1 that has to be folded as though it was utf8
EXACTFU_SS => special case of EXACTFU to handle \xDF/ss (affects latin1 treatment)
EXACTFU_TRICKYFOLD => special case of EXACTFU to handle tricky non-latin1 fold rules
EXACTF => "old style fold logic" untriable nodetype
EXACTFA => (currently) untriable nodetype
EXACTFL => (currently) untriable nodetype
See the comments in regcomp.sym for these fold types.
This patch involves a number of distinct, but related parts. Starting
from compilation:
* Simplify how we detect a triable sequence given the new nodetypes,
this also probably fixed some "bugs" in how we detected certain
sequences, like /||foo|bar/.
* Simplify how we read EXACTFU nodes under utf8 by removing the now
redundant folding logic (EXACTFU nodes under utf8 are prefolded).
Also extend this logic to handle latin1 patterns properly (in
conjunction with other changes)
* Part of the problems associated with EXACTFU_SS and EXACTFU_TRICKYFOLD
have to do with how the trie logic interacts with the minlen logic.
This change handles both by pessimising the minlen when encounting
these nodetypes. One observation is that the minlen logic is basically
broken, and works only because it conflates bytes and codepoints in
such a way that we more or less always get a value small enough that things work out
anyway. Fixing that is properly is the job of another patch.
* Part of the problem of doing folding under unicode rules is that
there are a lot of foldings possible, some with strange rules. This
means that the bitmap logic does not work correctly in all cases,
as we currently do not have any way to populate it properly.
So this patch disables the bitmap entirely when folding is involved
until that is fixed.
The end result of this is: we can TRIE/AHOCORASICK any sequence of
EXACT, or EXACTFU (ish) nodes, regardless of utf8 or not, but we disable
the bitmap when folding.
A note for follow up relating to this patch is that the way EXACTFU_XXX
nodes are currently dealt with we wont build the "maximal" trie because
of their presence, instead creating a "jumptrie" consisting of either a
leading EXACTFU node followed by a EXACTFU_XXX node, or vice versa. We
should eventually address that.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This node type hasn't been used since 5.14.0. Instead an ANYOFV node
was generated where formerly a FOLDCHAR node would have been used. The
ANYOFV was used because it already existed and was up-to-date, whereas
FOLDCHAR would have needed some bug fixes to adapt it, even though it
would be faster in execution than ANYOFV; so the code for it was
retained in case it was needed.
However, both these solutions were defective, and a previous commit has
changed things to a different type of solution entirely. Thus FOLDCHAR
is obsolescent and can be removed, though the code in it was used as a
base for some of the new solutions.
|
|
|
|
|
|
| |
This new node is like EXACTFU but is not currently trie'able. This adds
handling for it in regexec.c, but it is not currently generated; this
commit is preparing for future commits
|
|
|
|
|
|
|
|
|
|
| |
This node will be used to distinguish between the case in a non-UTF8
pattern and string where something could be matched that is of different
lengths. The only instance where this can happen is the LATIN SMALL
LETTER SHARP S can match the sequences "ss", "Ss", "sS", or "SS", hence
the name.
This node is not currently generated; this prepares for future commits
|
| |
|
| |
|
|
|
|
| |
These are not used yet.
|
|
|
|
| |
It is not used yet.
|
|
|
|
|
| |
This changes the bits to add a new charset type for /aa, and other bookkeeping
for it.
|
|
|
|
|
|
|
|
|
| |
Previously all the scripts in regen/ had code to generate header comments
(buffer-read-only, "do not edit this file", and optionally regeneration
script, regeneration data, copyright years and filename).
This change results in some minor reformatting of header blocks, and
standardises the copyright line as "Larry Wall and others".
|
|
|
|
| |
These aren't used yet.
|
|
|
|
|
|
| |
These are unused because there is no difference between Unicode
semantics and non for digits. That is there are no digit characters in
the 128-255 range.
|
|
|
|
|
|
| |
This will make for somewhat more efficient execution, as won't have to
test the regnode type multiple times, at the expense of slightly bigger
code space.
|
|
|
|
|
| |
These nodes aren't actually used yet, but allow the splitting out of
Unicode semantics for \w, \s, and complements
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The /d, /l, and /u regex modifiers are mutually exclusive. This patch
changes the field that stores the character set to use more than one bit
with an enum determining which one. This data structure more
closely follows the semantics of their being mutually exclusive, and
conserves bits as well, and is better expandable.
A small API is added to set and query the bit field.
This patch is not .xs source backwards compatible. A handful of cpan
programs are affected.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This node is like a straight ANYOF node to match [bracketed character classes],
but can match multiple characters; in particular it can match a multi-char
fold.
When multi-char Unicode folding was added to Perl, it was overlooked that the
ANYOF node is supposed to match exactly one character, hence there have been
bugs ever since. Adding a specialized node that can match multiple chars,
these can be fixed more easily. I tried at first to make ANYOF match multiple
chars, but this causes Perl to not be able to fully compile.
|
| |
|
|
|
|
|
| |
These were missing that they were simple (matching exactly 1 character)
and have 0 regnode arguments
|
|
|
|
|
|
| |
The recently added regnodes are moved to their respective equivalence
classes, and the named backreferences are moved to just after the
numbered backreferences
|
|
|
|
|
|
|
| |
These will be used for matching capture buffers case-insensitively using
Unicode semantics.
make regen will regenerate the delivered regnodes.h
|