| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I believe this will fix the remaining alignment problems recently being
shown on gcc on HP-UX, It works on the procura machine.
regnodes should not have stricter alignment than required by U32, for
reasons given in the comments this commit adds to the beginning of
regcomp.h. Commit 31f05a37 added a new ANYOF regnode struct with a
pointer field. This requires stricter alignment on some 64-bit platforms,
and hence doesn't work on those platforms.
This commit removes that regnode struct type, and instead stores the
pointer it used via a more indirect, but already existing mechanism
that stores other data..
The function that returns that other data is enlarged to return this new
field as well. It now needs to be called from regcomp.c, so the
previous commit had renamed and made it accessible from there. The
"public" function that wraps this one is unchanged. (I put "public" in
quotes here, because I don't think anyone outside core is or should be
using it, but since it has been publicly available for a long time, I'm
treating the API as unchangeable. regcomp.c called this public function
before this commit, but needs the additional data returned by the inner
one).
|
|
|
|
|
|
|
|
| |
Instead of doing the calculation of how many bytes a 256 bitmap
occupies, let the compiler do it. I believe we are not too far away
from having the ability to allow applications to recompile Perl to
increase the bitmap size trading speed for memory. ICU has an 8192
bitmap last time I checked.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For the last several releases, the fact that an ANYOF node could match
something outside its bitmap has been passed to regexec.c by having its
ARG field not be -1 (appropriately cast). A bit was set if the match
could occur even if the target string was not UTF-8 encoded. This
design was used to save a bit, as previously there was a bit also for it
matching UTF-8 strings.
That design is no longer tenable, as a future commit will have a third
(independent) reason for something to match outside the bitmap, This
commits uses the current spare bit flag to indicate if the match can
only occur if the target string is UTF-8.
|
|
|
|
|
| |
This is obsolete and is a partial copy of the up-to-date comment below
it.
|
|
|
|
| |
The ANYOF_LOC bit was removed from final use in the previous commit.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This flag no longer adds any useful information and can be removed. An
ANYOF node that depends on locale either matches a POSIX class like /d,
or matches case insensitively, or both. There are flags for both these
cases, and to see if something matches locale, one merely needs to see
if either flag is set.
Not having to keep track of this extra flag simplifies things, and will
allow it to be removed. There was a time when this flag was shared with
one of the remaining locale ones, and there was relict code that allowed
that sharing to be reinstated, and which this commit also removes.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The ANYOF_POSIXL flag is needed in general for ANYOF nodes to indicate
if the struct contains an extra U32 element used to hold the list of
POSIX classes (like \w and [:punct:]) whose matches depend on the locale
in effect at the time of runtime pattern matching.
But the SSC always contains this U32, and so doesn't need to use the
flag. Instead, if there aren't any such classes, the U32 will be zero.
Removing keeping track of this flag during the assembly of the SSC
simplifies things. At the completion of this process, this flag is
set if the U32 is non-zero to pass that information on to regexec.c so
that it doesn't have to special case things.
|
|
|
|
|
| |
This reverts commit 34fdef848b1687b91892ba55e9e0c3430e0770f6, and
adds comments referring to it, in case it is ever needed.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit frees up a bit by using an extra regnode to pass the
information to the regex engine instead of the flag. I originally
thought that if this was needed, it should be the ANYOF_ABOVE_LATIN1_ALL
bit, as that might speed some things up. But if we need to do this
again by adding another node to get another bit, we want one that is
mutually exclusive of the first one we did, For otherwise we start
having to make 3 nodes instead of two to get the combinations:
1 0
0 1
1 1
This combinatorial problem is avoided by using bits that are mutually
exclusive, which the ABOVE_LATIN1_ALL isn't, but the one freed by this
commit ANYOF_NON_UTF8_NON_ASCII_ALL is only set under /d matching, and
there are other bits that are set only under /l, so if we need to do
this again, we should use one of those.
I wrote this code when I thought I really needed a bit. But since, I
have figured out a better way to get the bit needed now. But I don't
want to lose this code to posterity, so this commit is being made long
enough to get the commit number, then it will be reverted, adding
comments referring to the commit number, so that it can easily be
reconstructed when necessary.
|
|
|
|
| |
I misread the code when I added these comments
|
|
|
|
|
|
|
|
|
| |
This macro defines two flag bits:
#define PREGf_ANCH_SINGLE (PREGf_ANCH_SBOL|PREGf_ANCH_GPOS)
but is only used twice in core (and not on CPAN),
don't really add any value, but increases cognitive complexity.
|
|
|
|
|
|
|
|
|
| |
The flag tells us that a pattern may match an infinitely long string.
The new member in the regexp struct tells us how long the string might
be.
With these two items we can implement regexp based $/
|
|
|
|
| |
convention
|
|
|
|
|
|
|
|
|
|
| |
RXf_IS_ANCHORED as a replacement
The only requirement outside of the regex engine is to identify that there is
an anchor involved at all. So we move the 4 anchor flags to intflags and replace
it with a single aggregate flag RXf_IS_ANCHORED in extflags.
This frees up another 3 bits in extflags.
|
|
|
|
|
|
|
|
| |
This required removing the RXf_GPOS_CHECK mask as it uses one flag
that will stay in extflags for now (RXf_ANCH_GPOS), and one flag that
moves to intflags (RXf_GPOS_SEEN). This mask is strange however, as
you cant have RXf_ANCH_GPOS without having RXf_GPOS_SEEN so I dont
know why we test both. Further investigation required.
|
| |
|
|
|
|
|
| |
Includes some improvements to how we dump regexps so that when a regexp
is for the standard perl engine we also show the intflags for the engine
|
|
|
|
|
| |
The meaning of these was expanded two commits ago, so update the name to
reflect this, to prevent future confusion
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This large (sorry, I couldn't figure out how to meaningfully split it
up) commit causes Perl to fully support LC_CTYPE operations (case
changing, character classification) in UTF-8 locales.
As a side effect it resolves [perl #56820].
The basics are easy, but there were a lot of details, and one
troublesome edge case discussed below.
What essentially happens is that when the locale is changed to a UTF-8
one, a global variable is set TRUE (FALSE when changed to a non-UTF-8
locale). Within the scope of 'use locale', this variable is checked,
and if TRUE, the code that Perl uses for non-locale behavior is used
instead of the code for locale behavior. Since Perl's internal
representation is UTF-8, we get UTF-8 behavior for a UTF-8 locale.
More work had to be done for regular expressions. There are three
cases.
1) The character classes \w, [[:punct:]] needed no extra work, as
the changes fall out from the base work.
2) Strings that are to be matched case-insensitively. These form
EXACTFL regops (nodes). Notice that if such a string contains only
characters above-Latin1 that match only themselves, that the node can be
downgraded to an EXACT-only node, which presents better optimization
possibilities, as we now have a fixed string known at compile time to be
required to be in the target string to match. Similarly if all
characters in the string match only other above-Latin1 characters
case-insensitively, the node can be downgraded to a regular EXACTFU node
(match, folding, using Unicode, not locale, rules). The code changes
for this could be done without accepting UTF-8 locales fully, but there
were edge cases which needed to be handled differently if I stopped
there, so I continued on.
In an EXACTFL node, all such characters are now folded at compile time
(just as before this commit), while the other characters whose folds are
locale-dependent are left unfolded. This means that they have to be
folded at execution time based on the locale in effect at the moment.
Again, this isn't a change from before. The difference is that now some
of the folds that need to be done at execution time (in regexec) are
potentially multi-char. Some of the code in regexec was trivial to
extend to account for this because of existing infrastructure, but the
part dealing with regex quantifiers, had to have more work.
Also the code that joins EXACTish nodes together had to be expanded to
account for the possibility of multi-character folds within locale
handling. This was fairly easy, because it already has infrastructure
to handle these under somewhat different circumstances.
3) In bracketed character classes, represented by ANYOF nodes, a new
inversion list was created giving the characters that should be matched
by this node when the runtime locale is UTF-8. The list is ignored
except under that circumstance. To do this, I created a new ANYOF type
which has an extra SV for the inversion list.
The edge case that caused the most difficulty is folding involving the
MICRO SIGN, U+00B5. It folds to the GREEK SMALL LETTER MU, as does the
GREEK CAPITAL LETTER MU. The MICRO SIGN is the only 0-255 range
character that folds to outside that range. The issue is that it
doesn't naturally fall out that it will match the CAP MU. If we let the
CAP MU fold to the samll mu at compile time (which it can because both
are above-Latin1 and so the fold is the same no matter what locale is in
effect), it could appear that the regnode can be downgraded away from
EXACTFL to EXACTFU, but doing so would cause the MICRO SIGN to not case
insensitvely match the CAP MU. This could be special cased in regcomp
and regexec, but I wanted to avoid that. Instead the mktables tables
are set up to include the CAP MU as a character whose presence forbids
the downgrading, so the special casing is in mktables, and not in the C
code.
|
|
|
|
|
| |
This is a clearer name; is used internally only in regcomp.c and
regexec.c
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The flag bits in regular expression ANYOF nodes are perennially in short
supply. However there are still plenty of regex nodes possible. So one
solution to needing to pass more information is to create a node that
encapsulates what is needed. That is what commit
9aa1e39f96ac28f6ce5d814d9a1eccf1464aba4a did to tell regexec.c that a
particular ANYOF node is for the synthetic start class (SSC). However
this solution introduces other issues. If you have to express two
things, then you need a regnode for A, a regnode for B, a regnode for
both A and B, and another regnode for both not A nor B; With three
things, you need 8 regnodes to express all possible combinations. This
becomes unwieldy to write code for. The number of combinations goes way
down if some of them are mutually exclusive. At the time of that
commit, I thought that a SSC need not ever warn if matching against an
above-Unicode code point. I was wrong, and that has been corrected
earlier in the 5.19 series.
But it finally came to me how to tell regexec that an ANYOF node is
for the SSC without taking up a flag bit and without requiring a regnode
type. The 'next_off' field in a regnode tells the engine the offeset in
the regex program to the node it's supposed to go to after processing
this one. Since the SSC stands alone, its 'next_off' field is unused,
and we can put anything we want in it. That, however, is not true of
other ANYOF regnodes. But it turns out that there are certain values
that will never be legitimate in the 'next_off' field in these, and so
this commit uses one of those to signal that this ANYOF field is an SSC.
regnodes come in various sizes, and the offset is in terms of how many
of the smallest ones are there to the next node to look at. Since ANYOF
nodes are large, the offset is always > 1, and so this commit uses 1 to
indicate an SSC.
|
|
|
|
|
|
| |
There are no logic changes. The previous commit changed the numbers for
some of the bits. This commit re-arranges things so that the #defines
are again in numerical order.
|
|
|
|
|
|
|
|
|
|
|
| |
The ANYOF_INVERT flag is used in every single pattern match of
[bracketed character classes]. With backtracking, this can be a huge
number. All the other flags' uses pale by comparison. I noticed that
by making it the lowest bit, we don't have to use CBOOL, as the only
possibilities are 0 and 1. cBOOL hopefully will be optimized away, but
not always. This commit reorders some of the flag bits to make this one
the lowest, and adds a compile check to make sure it isn't inadvertently
changed.
|
|
|
|
|
|
|
| |
A warning is supposed to be raised under some conditions when matching
an above-Unicode code point against a Unicode property. Prior to this
patch, if the synthetic start class excluded the code point, the warning
would be skipped, even though it was attempted to be matched.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Prior to this commit, there were 3 types of ANYOF nodes; now there are
two: regular, and one for the synthetic start class (ssc). This commit
converted the third type dealing with warning about matching \p{}
against non-Unicode code points, into using the spare flag bit for ANYOF
nodes.
This allows this bit to apply to ssc ANYOF nodes, whereas previously it
couldn't. There is a bug in which the warning isn't raised if the match
is rejected by the optimizer, because of this inability. This bug will
be fixed in a later commit.
Another option would have been to create a new node-type which was an
ANYOF_SSC_WARN_SUPER node. But this adds extra complications to things;
and we have a spare bit that we might as well use. The comments give
better possibilities for freeing up 2 bits should they be needed.
|
|
|
|
|
|
|
|
|
|
|
|
| |
The syntethic start class regnode (SSC) and a bracketed character class
node share much of the same data structure, including a flags field,
and some of the same flag bits within it. Currently, only
locale-related flags (under /l rules) are the same between the two
during construction of the SSC. But a future commit will introduce
another common flag. This commit creates an extra #define for use where
we want the common flags, while retaining the existing one for use where
we want the locale flags. The new #define is just a copy of the
existing one, to be changed in the future commit.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since we can only recurse into a given paren (or the entire pattern)
once, we know that the maximum recursion depth is the number of parens
in the pattern (plus one for "whole pattern"). This means we can
preallocate one large bitmap, and then use different chunks of it
for each level. That avoids SAVEFREEPV costs for each bitmap, which
are likely short anyway. (One could imagine an optimization where a
flag somewhere lets us use the RExC_study_chunk_recursed pointer
as a bitmap, so we dont have to allocate all when we have less than
32 parens.)
This removes the "recursed" argument from study_chunk() and replaces
it with a "recursive_depth" argument which counts how deep we
are in the bitmap "stack".
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 899d20b99829f8ecdc14e1351b533bc62a354dea was used to free up a
bit in a flags field that had run out of bits at the time. Further work
has made that unnecessary, and this commit moves it back to the flags
field, which even after this commit has a spare bit (which is intended
to be used in a future commit).
Doing so makes this bit "just one of the guys", so can be operated on
en-masse with the others. This allows a little code to be removed, and
the knowledge of this flag mostly confined to lower level subroutines.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Until this commit, the regular expression optimizer has essentially
punted on above-Latin1 code points. Under some circumstances, they
would be taken into account, more or less, but often, the generated
synthetic start class would end up matching all above-Latin1 code
points. With the advent of inversion lists, it becomes feasible to
actually fully handle such code points, as inversion lists are a
convenient way to express arbitrary lists of code points and take their
union, intersection, etc. This commit changes the optimizer to use
inversion lists for operating on the code points the synthetic start
class can match.
I don't much understand the overall operation of the optimizer. I'm
told that previous porters found that perturbing it caused unexpected
behaviors. I had promised to get this change in 5.18, but didn't. I'm
trying to get it in early enough into the 5.20 preliminary series that
any problems will surface before 5.20 ships.
This commit doesn't change the macro level logic, but does significantly
change various micro level things. Thus the 'and' and 'or' subroutines
have been rewritten to use inversion lists. I'm pretty confident that
they do what their names suggest. I re-derived the equations for what
these operations should do, getting the same results in some cases, but
extending others where the previous code mostly punted. The derivations
are given in comments in the respective routines.
Some of the code is greatly simplified, as it no longer has to treat
above-Latin1 specially.
It is now feasible for /i matching of above-Latin1 code points to know
explicitly the folds that should be in the synthetic start class. But
more prepatory work needs to be done before putting that into place.
...
|
|
|
|
|
|
|
|
|
| |
This commit adds some functions that are currently unused, but will be
used in a future commit. This commit is essentially to make the
differences smaller in that commit, as 'diff' is getting confused and
not outputting the logical differences. The functions are added in a
block at the beginning of the file to avoid the 'diff' issues. A later
white-space only commit will move them to more appropriate positions.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In pass 1 of compiling regular expressions, the needed size is
calculated. There is space allocated for a scratch node that can be
used for the things that the real one will hold in pass 2. It is valid
only while working on the current node, and gets overwritten in the next
node.
Until this commit, this scratch space was sized only for the smallest
node type, meaning that larger types could not use it for scratch. Now
it is sized to be the largest non EXACTish node.
We could make it an array of 256 + overhead bytes instead to be able to
hold the EXACTish nodes, but I don't see a need for that now.
|
|
|
|
|
|
| |
ANYOF_UNICODE_ALL doesn't mean every Unicode code point. It means those
above the Latin1 range. Rename it, while retaining the old one for back
compat.
|
|
|
|
|
|
|
|
|
| |
This commit finishes (at least for now) removing some of the overloading
of the term class. A 'regnode_charclass_class' node contains space for
storing the posix classes it matches that are never defined until the
moment of matching because they are subject to the current run-time
locale. This commit creates a typedef 'regnode_charclass_posixl'
synonym that doesn't re-use the term 'class' for two different purposes.
|
|
|
|
|
| |
Not doing so can cause problems, so it is standard procedure to
parenthesize all parameters within a macro definition.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This continues the process started two commits ago of removing some of
the overloading of the term 'class'.
In this case, this commit adds some #defines referring to the portions
of the regnode associated with bracketed character classes, the ANYOF
node. Specifically those portions that deal with the Posix character
classes, like \w and [:punct:] under /l (locale) matching are renamed
substituting POSIXL for CLASS. POSIXL is already used for POSIX-related
things under /l. I remember being terribly confused when I started
reading this code about this. One had a class within a class. This
should clarify things somewhat.
The old names are retained in case files outside the core #include and
use it (there are a few such in cpan).
|
|
|
|
| |
This moves it to be adjacent to similar #defines
|
|
|
|
|
|
|
|
|
|
|
|
| |
As part of extending the regular expression optimizer to properly handle
above Latin1 code points, I need an inversion list to contain which code
points the synthetic start class (ssc) matches.
The ssc currently is the same as a locale-aware ANYOF node, which uses
the struct of a regular ANYOF node, plus some extra fields at the end.
This commit creates a new typedef for ssc use, which is the locale-aware
ANYOF node, plus an extra SV* at the end to hold the inversion list.
|
|
|
|
|
| |
Sometimes SIZE_ONLY isn't really clear as to what is going, on. This
adds PASS1 and PASS2 for such instances.
|
|
|
|
|
| |
Remove obsolete comment, typos in others, plus reflow one block to fit
into 79 columns
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When I added support for possessive modifiers it was possible to
build perl so that they could be combined even if it made no sense
to do so.
Later on in relation to Perl #118375 I got confused and thought
they were undocumented but legal.
So to prevent further confusion, and since nobody has every mentioned
it since they were added, I am removing the unusued conditional
build logic, and clearly documenting why they aren't allowed.
|
|
|
|
|
|
|
| |
This global (per-interpreter) var is just used during regex compilation as
a placeholder to point RExC_emit at during the first (non-emitting) pass,
to indicate to not to emit anything. There's no need for it to be a global
var: just add it as an extra field in the RExC_state_t struct instead.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit 595598ee1f247e72e06e4cfbe0f98406015df5cc.
The netbsd - 5.0.2 compiler pointed out that the recent changes to add
longjmps to speed up some regex compilations can result in clobbering a
few values. These depend on the compiled code, and so didn't show up in
other compiler's warnings. This patch reinitializes them after a
longjmp.
[With a lot of hand editing in regcomp.c, to propagate the changes through
subsequent commits.]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
/[[:upper:]]/i and /[[:lower:]]/i should match the Unicode property
\p{Cased}. This commit introduces a pseudo-Posix class, internally named
'cased', to represent this. This class isn't specifiable by the user,
except through using either /[[:upper:]]/i or /[[:lower:]]/i. Debug
output will say ':cased:'.
The regex parsing either of :lower: or :upper: will change them into
:cased:, where already existing logic can handle this, just like any
other class.
This commit fixes the regression introduced in
3018b823898645e44b8c37c70ac5c6302b031381, and that these have never
worked under 'use locale'. The next commit will un-TODO the tests for
these things.
|
|
|
|
|
|
|
|
| |
Until recently, these were needed to be (or it made sense to be) in
numerical value of what the rhs of each #define evaluates to. But now,
they are all initialized to something else, and the numerical value is
not even apparent. Alphabetical order gives a logical ordering to help
a reader find things.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This frees up a flag bit for ANYOF regnodes. The freed bit is currently
not needed for other uses; I decided to make the change now, while how
to do it was fresh in my mind. There are fewer shifts and masks as a
result, as well.
This commit moves the information this bit contains to the otherwise
unused 'next_off' field in the synthetic start class. This paradigm
could be used to pass information to the regex matching code for just
the synthetic start class, but the current bit is used just during
compilation.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This creates a regnode specifically for the synthetic start class, which
is a type of ANYOF node. The flag bit previously used to denote this is
removed. This paves the way for this bit to be freed up, but first the
other use of this bit must also be removed, which will be done in the
next commit.
There are now three ANYOF-type regnodes. This one should be called only
in one place in regexec.c. The other special one is ANYOF_WARN_SUPER.
A synthetic start class node should not do any warning, so there is no
issue of having something need to be both types.
|
|
|
|
| |
No code changes
|
|
|
|
|
|
| |
This essentially reverts 8b27d3db700fc2fce268e3d78e221a16ccaca2e8
and causes ANYOF nodes that are in locale but don't match things like \w
to have a smaller node size.
|
|
|
|
|
|
|
| |
This uses a regnode type, of which we have many available, to free up
a bit in the ANYOF regnode flag field, of which we have none, and are
trying to have the same bit do double duty. This will enable us to
remove some of that double duty in the next commit.
|
|
|
|
|
|
|
|
|
|
|
| |
The ANYOF_CLASS flag is used in ANYOF nodes (for [bracketed] and the
synthetic start class) only when matching something like \w, [:punct:]
etc., under /l (locale). It should not be set unless /l is specified.
However, it was always getting set for the synthetic start class. This
commit fixes that. The previous code was masking errors in which it was
being tested for unnecessarily, and for much of the 5.17 series, the
synthetic start class was always set to test for locale, which was a
waste of cpu when no locale was specified.
|