| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
| |
The flag tells us that a pattern may match an infinitely long string.
The new member in the regexp struct tells us how long the string might
be.
With these two items we can implement regexp based $/
|
|
|
|
| |
convention
|
|
|
|
|
|
|
|
|
|
| |
RXf_IS_ANCHORED as a replacement
The only requirement outside of the regex engine is to identify that there is
an anchor involved at all. So we move the 4 anchor flags to intflags and replace
it with a single aggregate flag RXf_IS_ANCHORED in extflags.
This frees up another 3 bits in extflags.
|
|
|
|
|
|
|
|
| |
This required removing the RXf_GPOS_CHECK mask as it uses one flag
that will stay in extflags for now (RXf_ANCH_GPOS), and one flag that
moves to intflags (RXf_GPOS_SEEN). This mask is strange however, as
you cant have RXf_ANCH_GPOS without having RXf_GPOS_SEEN so I dont
know why we test both. Further investigation required.
|
| |
|
|
|
|
|
| |
Includes some improvements to how we dump regexps so that when a regexp
is for the standard perl engine we also show the intflags for the engine
|
|
|
|
|
| |
The meaning of these was expanded two commits ago, so update the name to
reflect this, to prevent future confusion
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This large (sorry, I couldn't figure out how to meaningfully split it
up) commit causes Perl to fully support LC_CTYPE operations (case
changing, character classification) in UTF-8 locales.
As a side effect it resolves [perl #56820].
The basics are easy, but there were a lot of details, and one
troublesome edge case discussed below.
What essentially happens is that when the locale is changed to a UTF-8
one, a global variable is set TRUE (FALSE when changed to a non-UTF-8
locale). Within the scope of 'use locale', this variable is checked,
and if TRUE, the code that Perl uses for non-locale behavior is used
instead of the code for locale behavior. Since Perl's internal
representation is UTF-8, we get UTF-8 behavior for a UTF-8 locale.
More work had to be done for regular expressions. There are three
cases.
1) The character classes \w, [[:punct:]] needed no extra work, as
the changes fall out from the base work.
2) Strings that are to be matched case-insensitively. These form
EXACTFL regops (nodes). Notice that if such a string contains only
characters above-Latin1 that match only themselves, that the node can be
downgraded to an EXACT-only node, which presents better optimization
possibilities, as we now have a fixed string known at compile time to be
required to be in the target string to match. Similarly if all
characters in the string match only other above-Latin1 characters
case-insensitively, the node can be downgraded to a regular EXACTFU node
(match, folding, using Unicode, not locale, rules). The code changes
for this could be done without accepting UTF-8 locales fully, but there
were edge cases which needed to be handled differently if I stopped
there, so I continued on.
In an EXACTFL node, all such characters are now folded at compile time
(just as before this commit), while the other characters whose folds are
locale-dependent are left unfolded. This means that they have to be
folded at execution time based on the locale in effect at the moment.
Again, this isn't a change from before. The difference is that now some
of the folds that need to be done at execution time (in regexec) are
potentially multi-char. Some of the code in regexec was trivial to
extend to account for this because of existing infrastructure, but the
part dealing with regex quantifiers, had to have more work.
Also the code that joins EXACTish nodes together had to be expanded to
account for the possibility of multi-character folds within locale
handling. This was fairly easy, because it already has infrastructure
to handle these under somewhat different circumstances.
3) In bracketed character classes, represented by ANYOF nodes, a new
inversion list was created giving the characters that should be matched
by this node when the runtime locale is UTF-8. The list is ignored
except under that circumstance. To do this, I created a new ANYOF type
which has an extra SV for the inversion list.
The edge case that caused the most difficulty is folding involving the
MICRO SIGN, U+00B5. It folds to the GREEK SMALL LETTER MU, as does the
GREEK CAPITAL LETTER MU. The MICRO SIGN is the only 0-255 range
character that folds to outside that range. The issue is that it
doesn't naturally fall out that it will match the CAP MU. If we let the
CAP MU fold to the samll mu at compile time (which it can because both
are above-Latin1 and so the fold is the same no matter what locale is in
effect), it could appear that the regnode can be downgraded away from
EXACTFL to EXACTFU, but doing so would cause the MICRO SIGN to not case
insensitvely match the CAP MU. This could be special cased in regcomp
and regexec, but I wanted to avoid that. Instead the mktables tables
are set up to include the CAP MU as a character whose presence forbids
the downgrading, so the special casing is in mktables, and not in the C
code.
|
|
|
|
|
| |
This is a clearer name; is used internally only in regcomp.c and
regexec.c
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The flag bits in regular expression ANYOF nodes are perennially in short
supply. However there are still plenty of regex nodes possible. So one
solution to needing to pass more information is to create a node that
encapsulates what is needed. That is what commit
9aa1e39f96ac28f6ce5d814d9a1eccf1464aba4a did to tell regexec.c that a
particular ANYOF node is for the synthetic start class (SSC). However
this solution introduces other issues. If you have to express two
things, then you need a regnode for A, a regnode for B, a regnode for
both A and B, and another regnode for both not A nor B; With three
things, you need 8 regnodes to express all possible combinations. This
becomes unwieldy to write code for. The number of combinations goes way
down if some of them are mutually exclusive. At the time of that
commit, I thought that a SSC need not ever warn if matching against an
above-Unicode code point. I was wrong, and that has been corrected
earlier in the 5.19 series.
But it finally came to me how to tell regexec that an ANYOF node is
for the SSC without taking up a flag bit and without requiring a regnode
type. The 'next_off' field in a regnode tells the engine the offeset in
the regex program to the node it's supposed to go to after processing
this one. Since the SSC stands alone, its 'next_off' field is unused,
and we can put anything we want in it. That, however, is not true of
other ANYOF regnodes. But it turns out that there are certain values
that will never be legitimate in the 'next_off' field in these, and so
this commit uses one of those to signal that this ANYOF field is an SSC.
regnodes come in various sizes, and the offset is in terms of how many
of the smallest ones are there to the next node to look at. Since ANYOF
nodes are large, the offset is always > 1, and so this commit uses 1 to
indicate an SSC.
|
|
|
|
|
|
| |
There are no logic changes. The previous commit changed the numbers for
some of the bits. This commit re-arranges things so that the #defines
are again in numerical order.
|
|
|
|
|
|
|
|
|
|
|
| |
The ANYOF_INVERT flag is used in every single pattern match of
[bracketed character classes]. With backtracking, this can be a huge
number. All the other flags' uses pale by comparison. I noticed that
by making it the lowest bit, we don't have to use CBOOL, as the only
possibilities are 0 and 1. cBOOL hopefully will be optimized away, but
not always. This commit reorders some of the flag bits to make this one
the lowest, and adds a compile check to make sure it isn't inadvertently
changed.
|
|
|
|
|
|
|
| |
A warning is supposed to be raised under some conditions when matching
an above-Unicode code point against a Unicode property. Prior to this
patch, if the synthetic start class excluded the code point, the warning
would be skipped, even though it was attempted to be matched.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Prior to this commit, there were 3 types of ANYOF nodes; now there are
two: regular, and one for the synthetic start class (ssc). This commit
converted the third type dealing with warning about matching \p{}
against non-Unicode code points, into using the spare flag bit for ANYOF
nodes.
This allows this bit to apply to ssc ANYOF nodes, whereas previously it
couldn't. There is a bug in which the warning isn't raised if the match
is rejected by the optimizer, because of this inability. This bug will
be fixed in a later commit.
Another option would have been to create a new node-type which was an
ANYOF_SSC_WARN_SUPER node. But this adds extra complications to things;
and we have a spare bit that we might as well use. The comments give
better possibilities for freeing up 2 bits should they be needed.
|
|
|
|
|
|
|
|
|
|
|
|
| |
The syntethic start class regnode (SSC) and a bracketed character class
node share much of the same data structure, including a flags field,
and some of the same flag bits within it. Currently, only
locale-related flags (under /l rules) are the same between the two
during construction of the SSC. But a future commit will introduce
another common flag. This commit creates an extra #define for use where
we want the common flags, while retaining the existing one for use where
we want the locale flags. The new #define is just a copy of the
existing one, to be changed in the future commit.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since we can only recurse into a given paren (or the entire pattern)
once, we know that the maximum recursion depth is the number of parens
in the pattern (plus one for "whole pattern"). This means we can
preallocate one large bitmap, and then use different chunks of it
for each level. That avoids SAVEFREEPV costs for each bitmap, which
are likely short anyway. (One could imagine an optimization where a
flag somewhere lets us use the RExC_study_chunk_recursed pointer
as a bitmap, so we dont have to allocate all when we have less than
32 parens.)
This removes the "recursed" argument from study_chunk() and replaces
it with a "recursive_depth" argument which counts how deep we
are in the bitmap "stack".
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 899d20b99829f8ecdc14e1351b533bc62a354dea was used to free up a
bit in a flags field that had run out of bits at the time. Further work
has made that unnecessary, and this commit moves it back to the flags
field, which even after this commit has a spare bit (which is intended
to be used in a future commit).
Doing so makes this bit "just one of the guys", so can be operated on
en-masse with the others. This allows a little code to be removed, and
the knowledge of this flag mostly confined to lower level subroutines.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Until this commit, the regular expression optimizer has essentially
punted on above-Latin1 code points. Under some circumstances, they
would be taken into account, more or less, but often, the generated
synthetic start class would end up matching all above-Latin1 code
points. With the advent of inversion lists, it becomes feasible to
actually fully handle such code points, as inversion lists are a
convenient way to express arbitrary lists of code points and take their
union, intersection, etc. This commit changes the optimizer to use
inversion lists for operating on the code points the synthetic start
class can match.
I don't much understand the overall operation of the optimizer. I'm
told that previous porters found that perturbing it caused unexpected
behaviors. I had promised to get this change in 5.18, but didn't. I'm
trying to get it in early enough into the 5.20 preliminary series that
any problems will surface before 5.20 ships.
This commit doesn't change the macro level logic, but does significantly
change various micro level things. Thus the 'and' and 'or' subroutines
have been rewritten to use inversion lists. I'm pretty confident that
they do what their names suggest. I re-derived the equations for what
these operations should do, getting the same results in some cases, but
extending others where the previous code mostly punted. The derivations
are given in comments in the respective routines.
Some of the code is greatly simplified, as it no longer has to treat
above-Latin1 specially.
It is now feasible for /i matching of above-Latin1 code points to know
explicitly the folds that should be in the synthetic start class. But
more prepatory work needs to be done before putting that into place.
...
|
|
|
|
|
|
|
|
|
| |
This commit adds some functions that are currently unused, but will be
used in a future commit. This commit is essentially to make the
differences smaller in that commit, as 'diff' is getting confused and
not outputting the logical differences. The functions are added in a
block at the beginning of the file to avoid the 'diff' issues. A later
white-space only commit will move them to more appropriate positions.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In pass 1 of compiling regular expressions, the needed size is
calculated. There is space allocated for a scratch node that can be
used for the things that the real one will hold in pass 2. It is valid
only while working on the current node, and gets overwritten in the next
node.
Until this commit, this scratch space was sized only for the smallest
node type, meaning that larger types could not use it for scratch. Now
it is sized to be the largest non EXACTish node.
We could make it an array of 256 + overhead bytes instead to be able to
hold the EXACTish nodes, but I don't see a need for that now.
|
|
|
|
|
|
| |
ANYOF_UNICODE_ALL doesn't mean every Unicode code point. It means those
above the Latin1 range. Rename it, while retaining the old one for back
compat.
|
|
|
|
|
|
|
|
|
| |
This commit finishes (at least for now) removing some of the overloading
of the term class. A 'regnode_charclass_class' node contains space for
storing the posix classes it matches that are never defined until the
moment of matching because they are subject to the current run-time
locale. This commit creates a typedef 'regnode_charclass_posixl'
synonym that doesn't re-use the term 'class' for two different purposes.
|
|
|
|
|
| |
Not doing so can cause problems, so it is standard procedure to
parenthesize all parameters within a macro definition.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This continues the process started two commits ago of removing some of
the overloading of the term 'class'.
In this case, this commit adds some #defines referring to the portions
of the regnode associated with bracketed character classes, the ANYOF
node. Specifically those portions that deal with the Posix character
classes, like \w and [:punct:] under /l (locale) matching are renamed
substituting POSIXL for CLASS. POSIXL is already used for POSIX-related
things under /l. I remember being terribly confused when I started
reading this code about this. One had a class within a class. This
should clarify things somewhat.
The old names are retained in case files outside the core #include and
use it (there are a few such in cpan).
|
|
|
|
| |
This moves it to be adjacent to similar #defines
|
|
|
|
|
|
|
|
|
|
|
|
| |
As part of extending the regular expression optimizer to properly handle
above Latin1 code points, I need an inversion list to contain which code
points the synthetic start class (ssc) matches.
The ssc currently is the same as a locale-aware ANYOF node, which uses
the struct of a regular ANYOF node, plus some extra fields at the end.
This commit creates a new typedef for ssc use, which is the locale-aware
ANYOF node, plus an extra SV* at the end to hold the inversion list.
|
|
|
|
|
| |
Sometimes SIZE_ONLY isn't really clear as to what is going, on. This
adds PASS1 and PASS2 for such instances.
|
|
|
|
|
| |
Remove obsolete comment, typos in others, plus reflow one block to fit
into 79 columns
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When I added support for possessive modifiers it was possible to
build perl so that they could be combined even if it made no sense
to do so.
Later on in relation to Perl #118375 I got confused and thought
they were undocumented but legal.
So to prevent further confusion, and since nobody has every mentioned
it since they were added, I am removing the unusued conditional
build logic, and clearly documenting why they aren't allowed.
|
|
|
|
|
|
|
| |
This global (per-interpreter) var is just used during regex compilation as
a placeholder to point RExC_emit at during the first (non-emitting) pass,
to indicate to not to emit anything. There's no need for it to be a global
var: just add it as an extra field in the RExC_state_t struct instead.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit 595598ee1f247e72e06e4cfbe0f98406015df5cc.
The netbsd - 5.0.2 compiler pointed out that the recent changes to add
longjmps to speed up some regex compilations can result in clobbering a
few values. These depend on the compiled code, and so didn't show up in
other compiler's warnings. This patch reinitializes them after a
longjmp.
[With a lot of hand editing in regcomp.c, to propagate the changes through
subsequent commits.]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
/[[:upper:]]/i and /[[:lower:]]/i should match the Unicode property
\p{Cased}. This commit introduces a pseudo-Posix class, internally named
'cased', to represent this. This class isn't specifiable by the user,
except through using either /[[:upper:]]/i or /[[:lower:]]/i. Debug
output will say ':cased:'.
The regex parsing either of :lower: or :upper: will change them into
:cased:, where already existing logic can handle this, just like any
other class.
This commit fixes the regression introduced in
3018b823898645e44b8c37c70ac5c6302b031381, and that these have never
worked under 'use locale'. The next commit will un-TODO the tests for
these things.
|
|
|
|
|
|
|
|
| |
Until recently, these were needed to be (or it made sense to be) in
numerical value of what the rhs of each #define evaluates to. But now,
they are all initialized to something else, and the numerical value is
not even apparent. Alphabetical order gives a logical ordering to help
a reader find things.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This frees up a flag bit for ANYOF regnodes. The freed bit is currently
not needed for other uses; I decided to make the change now, while how
to do it was fresh in my mind. There are fewer shifts and masks as a
result, as well.
This commit moves the information this bit contains to the otherwise
unused 'next_off' field in the synthetic start class. This paradigm
could be used to pass information to the regex matching code for just
the synthetic start class, but the current bit is used just during
compilation.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This creates a regnode specifically for the synthetic start class, which
is a type of ANYOF node. The flag bit previously used to denote this is
removed. This paves the way for this bit to be freed up, but first the
other use of this bit must also be removed, which will be done in the
next commit.
There are now three ANYOF-type regnodes. This one should be called only
in one place in regexec.c. The other special one is ANYOF_WARN_SUPER.
A synthetic start class node should not do any warning, so there is no
issue of having something need to be both types.
|
|
|
|
| |
No code changes
|
|
|
|
|
|
| |
This essentially reverts 8b27d3db700fc2fce268e3d78e221a16ccaca2e8
and causes ANYOF nodes that are in locale but don't match things like \w
to have a smaller node size.
|
|
|
|
|
|
|
| |
This uses a regnode type, of which we have many available, to free up
a bit in the ANYOF regnode flag field, of which we have none, and are
trying to have the same bit do double duty. This will enable us to
remove some of that double duty in the next commit.
|
|
|
|
|
|
|
|
|
|
|
| |
The ANYOF_CLASS flag is used in ANYOF nodes (for [bracketed] and the
synthetic start class) only when matching something like \w, [:punct:]
etc., under /l (locale). It should not be set unless /l is specified.
However, it was always getting set for the synthetic start class. This
commit fixes that. The previous code was masking errors in which it was
being tested for unnecessarily, and for much of the 5.17 series, the
synthetic start class was always set to test for locale, which was a
waste of cpu when no locale was specified.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Perl has had an undocumented macro isALNUMC() for a long time. I want
to document it, but the name is very obscure. Neither Yves nor I are
sure what it is. My best guess is "C's alnum". It corresponds to
/[[:alnum:]]/, and so its best name would be isALNUM(). But that is the
name long given to what matches \w. A new synonym, isWORDCHAR(), has
been in place for several releases for that, but the old isALNUM()
should remain for backwards compatibility.
I don't think that the name isALNUMC() should be published, as it is too
close to isALNUM(). I finally came to the conclusion that
isALPHANUMERIC() is the best name; it describes its purpose clearly; the
disadvantage is its long length. I doubt that it will get much use, but
we need something, I think, that we can publish to accomplish this
functionality.
This commit also converts core uses of isALNUMC to isALPHANUMERIC. (I
intended to that separately, but made a mistake in rebasing, and
combined the two patches; and it seemed like not a big enough problem to
separate them out again.)
|
|
|
|
|
| |
PERL_UNUSED_DECL doesn't do anything under g++, so doing this silences
some g++ warnings.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I presume that the reason this bitmap was expressed in bytes was that
the macros for dealing with that were already readily available and
familiar, and because it could easily be grown.
However, it's extremely unlikely that we would ever need more bits.
This bit map is for the Posix character classes, and no one is making
more of them. There is currently one spare bit available, and if we
don't back out of the \s and [:space:] unification, a second will become
available in 5.20 or 5.22.
Using a single word is more efficient, so this changes to use that.
Should we ever need more bits, we can change back.
|
|
|
|
|
|
| |
This revises how these #defines are set up so that the order can change
(as will be done in a later commit), and the only dependencies are on
VERTWS and the max one from handy.h.
|
|
|
|
|
|
|
| |
This will be used in future commits to allow \v and \V to be treated
consistently with other character classes. (Doing the same for \h isn't
necessary, as it matches identically to [:blank:] in the entire Unicode
range.)
|
|
|
|
|
|
|
|
|
| |
ANYOF_MAX is used as the upper boundary in loops. If we keep it larger
than necessary, the loop does extraneous iterations.
The #defines that come after ANYOF_MAX are moved down to start with it.
This is useful in a later commit that will create an entry in
l1_char_class_tab.h for vertical white space determination.
|
|
|
|
|
| |
ANYOF_MAX is used for two different purposes; this separates them and
creates a separate #define for one of them.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since the xpvlv and regexp structs conflict, we have to find somewhere
else to put the regexp struct.
I was going to sneak it in SvPVX, allocating a buffer large
enough to fit the regexp struct followed by the string, and have
SvPVX - sizeof(regexp) point to the struct. But that would make all
regexp flag-checking macros fatter, and those are used in hot code.
So I came up with another method. Regexp stringification is not
speed-critical. So we can move the regexp stringification out of
re->sv_u and put it in the regexp struct. Then the regexp struct
itself can be pointed to by re->sv_u. So SVt_REGEXPs will have
re->sv_any and re->sv_u pointing to the same spot. PVLVs can then
have sv->sv_any point to the xpvlv body as usual, but have sv->sv_u
point to a regexp struct. All regexp member access can go through
sv_u instead of sv_any, which will be no slower than before.
Regular expressions will no longer be SvPOK, so we give sv_2pv spec-
ial logic for regexps. We don’t need to make the regexp struct
larger, as SvLEN is currently always 0 iff mother_re is set. So we
can replace the SvLEN field with the pv.
SvFAKE is never used without SvPOK or SvSCREAM also set. So we can
use that to identify regexps.
|
|
|
|
|
| |
This macro is now only used under locale; its other use has now been
removed. Change the name to reflect its only use.
|
|
|
|
| |
ALNUM (meaning \w) is too close to ALNUMC ([[:alnum:]]) for comfort
|
|
|
|
|
| |
This synchronizes the ANYOF_FOO usages to the isFOO() usages. Future
commits will take advantage of this relationship.
|