| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
| |
When I added support for possessive modifiers it was possible to
build perl so that they could be combined even if it made no sense
to do so.
Later on in relation to Perl #118375 I got confused and thought
they were undocumented but legal.
So to prevent further confusion, and since nobody has every mentioned
it since they were added, I am removing the unusued conditional
build logic, and clearly documenting why they aren't allowed.
|
|
|
|
|
|
|
| |
This global (per-interpreter) var is just used during regex compilation as
a placeholder to point RExC_emit at during the first (non-emitting) pass,
to indicate to not to emit anything. There's no need for it to be a global
var: just add it as an extra field in the RExC_state_t struct instead.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit 595598ee1f247e72e06e4cfbe0f98406015df5cc.
The netbsd - 5.0.2 compiler pointed out that the recent changes to add
longjmps to speed up some regex compilations can result in clobbering a
few values. These depend on the compiled code, and so didn't show up in
other compiler's warnings. This patch reinitializes them after a
longjmp.
[With a lot of hand editing in regcomp.c, to propagate the changes through
subsequent commits.]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
/[[:upper:]]/i and /[[:lower:]]/i should match the Unicode property
\p{Cased}. This commit introduces a pseudo-Posix class, internally named
'cased', to represent this. This class isn't specifiable by the user,
except through using either /[[:upper:]]/i or /[[:lower:]]/i. Debug
output will say ':cased:'.
The regex parsing either of :lower: or :upper: will change them into
:cased:, where already existing logic can handle this, just like any
other class.
This commit fixes the regression introduced in
3018b823898645e44b8c37c70ac5c6302b031381, and that these have never
worked under 'use locale'. The next commit will un-TODO the tests for
these things.
|
|
|
|
|
|
|
|
| |
Until recently, these were needed to be (or it made sense to be) in
numerical value of what the rhs of each #define evaluates to. But now,
they are all initialized to something else, and the numerical value is
not even apparent. Alphabetical order gives a logical ordering to help
a reader find things.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This frees up a flag bit for ANYOF regnodes. The freed bit is currently
not needed for other uses; I decided to make the change now, while how
to do it was fresh in my mind. There are fewer shifts and masks as a
result, as well.
This commit moves the information this bit contains to the otherwise
unused 'next_off' field in the synthetic start class. This paradigm
could be used to pass information to the regex matching code for just
the synthetic start class, but the current bit is used just during
compilation.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This creates a regnode specifically for the synthetic start class, which
is a type of ANYOF node. The flag bit previously used to denote this is
removed. This paves the way for this bit to be freed up, but first the
other use of this bit must also be removed, which will be done in the
next commit.
There are now three ANYOF-type regnodes. This one should be called only
in one place in regexec.c. The other special one is ANYOF_WARN_SUPER.
A synthetic start class node should not do any warning, so there is no
issue of having something need to be both types.
|
|
|
|
| |
No code changes
|
|
|
|
|
|
| |
This essentially reverts 8b27d3db700fc2fce268e3d78e221a16ccaca2e8
and causes ANYOF nodes that are in locale but don't match things like \w
to have a smaller node size.
|
|
|
|
|
|
|
| |
This uses a regnode type, of which we have many available, to free up
a bit in the ANYOF regnode flag field, of which we have none, and are
trying to have the same bit do double duty. This will enable us to
remove some of that double duty in the next commit.
|
|
|
|
|
|
|
|
|
|
|
| |
The ANYOF_CLASS flag is used in ANYOF nodes (for [bracketed] and the
synthetic start class) only when matching something like \w, [:punct:]
etc., under /l (locale). It should not be set unless /l is specified.
However, it was always getting set for the synthetic start class. This
commit fixes that. The previous code was masking errors in which it was
being tested for unnecessarily, and for much of the 5.17 series, the
synthetic start class was always set to test for locale, which was a
waste of cpu when no locale was specified.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Perl has had an undocumented macro isALNUMC() for a long time. I want
to document it, but the name is very obscure. Neither Yves nor I are
sure what it is. My best guess is "C's alnum". It corresponds to
/[[:alnum:]]/, and so its best name would be isALNUM(). But that is the
name long given to what matches \w. A new synonym, isWORDCHAR(), has
been in place for several releases for that, but the old isALNUM()
should remain for backwards compatibility.
I don't think that the name isALNUMC() should be published, as it is too
close to isALNUM(). I finally came to the conclusion that
isALPHANUMERIC() is the best name; it describes its purpose clearly; the
disadvantage is its long length. I doubt that it will get much use, but
we need something, I think, that we can publish to accomplish this
functionality.
This commit also converts core uses of isALNUMC to isALPHANUMERIC. (I
intended to that separately, but made a mistake in rebasing, and
combined the two patches; and it seemed like not a big enough problem to
separate them out again.)
|
|
|
|
|
| |
PERL_UNUSED_DECL doesn't do anything under g++, so doing this silences
some g++ warnings.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I presume that the reason this bitmap was expressed in bytes was that
the macros for dealing with that were already readily available and
familiar, and because it could easily be grown.
However, it's extremely unlikely that we would ever need more bits.
This bit map is for the Posix character classes, and no one is making
more of them. There is currently one spare bit available, and if we
don't back out of the \s and [:space:] unification, a second will become
available in 5.20 or 5.22.
Using a single word is more efficient, so this changes to use that.
Should we ever need more bits, we can change back.
|
|
|
|
|
|
| |
This revises how these #defines are set up so that the order can change
(as will be done in a later commit), and the only dependencies are on
VERTWS and the max one from handy.h.
|
|
|
|
|
|
|
| |
This will be used in future commits to allow \v and \V to be treated
consistently with other character classes. (Doing the same for \h isn't
necessary, as it matches identically to [:blank:] in the entire Unicode
range.)
|
|
|
|
|
|
|
|
|
| |
ANYOF_MAX is used as the upper boundary in loops. If we keep it larger
than necessary, the loop does extraneous iterations.
The #defines that come after ANYOF_MAX are moved down to start with it.
This is useful in a later commit that will create an entry in
l1_char_class_tab.h for vertical white space determination.
|
|
|
|
|
| |
ANYOF_MAX is used for two different purposes; this separates them and
creates a separate #define for one of them.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since the xpvlv and regexp structs conflict, we have to find somewhere
else to put the regexp struct.
I was going to sneak it in SvPVX, allocating a buffer large
enough to fit the regexp struct followed by the string, and have
SvPVX - sizeof(regexp) point to the struct. But that would make all
regexp flag-checking macros fatter, and those are used in hot code.
So I came up with another method. Regexp stringification is not
speed-critical. So we can move the regexp stringification out of
re->sv_u and put it in the regexp struct. Then the regexp struct
itself can be pointed to by re->sv_u. So SVt_REGEXPs will have
re->sv_any and re->sv_u pointing to the same spot. PVLVs can then
have sv->sv_any point to the xpvlv body as usual, but have sv->sv_u
point to a regexp struct. All regexp member access can go through
sv_u instead of sv_any, which will be no slower than before.
Regular expressions will no longer be SvPOK, so we give sv_2pv spec-
ial logic for regexps. We don’t need to make the regexp struct
larger, as SvLEN is currently always 0 iff mother_re is set. So we
can replace the SvLEN field with the pv.
SvFAKE is never used without SvPOK or SvSCREAM also set. So we can
use that to identify regexps.
|
|
|
|
|
| |
This macro is now only used under locale; its other use has now been
removed. Change the name to reflect its only use.
|
|
|
|
| |
ALNUM (meaning \w) is too close to ALNUMC ([[:alnum:]]) for comfort
|
|
|
|
|
| |
This synchronizes the ANYOF_FOO usages to the isFOO() usages. Future
commits will take advantage of this relationship.
|
|
|
|
|
|
|
| |
The ANYOF_foo character class #defines really form an enum, with the
property that the regular one is n, and its complement is n+1. So
we can replace the tests in each case: of the switch, with a single test
afterwards.
|
|
|
|
|
|
|
|
|
|
|
| |
This warning was being generated inappropriately during some internal
operations, such as parsing a program; spotted by Tom Christiansen.
The solution is to move the check for this situation out of the common
code, and into the code where just \p{} and \P{} are handled.
As mentioned in the commit's perldelta, there remains a bug
[perl #114148], where no warning gets generated when it should
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There have been two flavors of ANYOF nodes under /l (locale) (for
bracketed character classes). If a class didn't try to match things
like [:word:], it was smaller by 4 bytes than one that did.
A flag bit was used to indicate which size it was. By making all such
nodes the larger size, whether needed or not, that bit can be freed to
be used for other purposes.
This only affects ANYOF nodes compiled under locale rules. The hope is
to eventually git rid of these nodes anyway, by taking the suggestion of
Yves Orton to compile regular expressions using the current locale, and
automatically recompile the next time they are used after the locale
changes.
This commit is somewhat experimental, done early in the development
cycle to see if there are any breakages. There are other ways to free
up a bit, as explained in the comments. Best would be to split off
nodes that match everything outside Latin1, freeing up the
ANYOF_UNICODE_ALL bit. However, there currently would need to be two
flavors of this, one also for ANYOFV. I'm currently working to
eliminate the need for ANYOFV nodes (which aren't sufficient,
[perl #89774]), so it's easiest to wait for this work to be done before
doing the split, after which we can revert this change in order to gain
back the space, but in the meantime, this will have had the opportunity
to smoke out issues that I would like to know about.
|
| |
|
|
|
|
|
|
|
|
|
| |
(??{}) returns a string which needs to be put through the regex compiler,
and which may also contain (?{...}) - so any 'use re eval' in scope needs
to be propagated into the inner environment. Achieve this by adding a new
private flag - PREGf_USE_RE_EVAL - to the regex to indicate the use is in
scope, and modify how the call to compile the inner pattern is done,
to allow the use state to be passed in.
|
|
|
|
| |
This was an alias to OP, and formerly used by the old re_eval mechanism
|
|
|
|
|
| |
This flag was set during pattern compilation if a (?{}) was encountered;
but is redundant now that we have pRExC_state->num_code_blocks.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The previous commits in this branch have brought literal code blocks
into the New World Order; now do the same for runtime blocks, i.e. those
needing "use re 'eval'".
The main user-visible changes from this commit are that:
* the code is now fully parsed, rather than needing balanced {}'s; i.e.
this now works:
my $code = q[ (?{ $a = '{' }) ];
use re 'eval';
/$code/
* warnings and errors are now reported as coming from "(eval NNN)" rather
than "(re_eval NNN)" (although see the next commit for some fixups to
that). Indeed, the string "re_eval" has been expunged from the source
and documentation.
The big internal difference is that the sv_compile_2op() and
sv_compile_2op_is_broken() functions are no longer used, and will be
removed shorty.
It works by the regex compiler detecting the presence of run-time code
blocks, and feeding the whole pattern string back into the parser (where
the run-time blocks are now seen as compile-time), then extracting out
any compiled code blocks and adding them to the mix.
For example, in the following:
$c = '(?{"runtime"})d';
use re 'eval';
/a(?{"literal"})\b'$c/
At the point the regex compiler is called, the perl parser will already
have compiled the literal code block and presented it to the regex engine.
The engine examines the pattern string, sees two '(?{', but only one
accounted for by the parser, and so constructs a short string to be
evalled: based on the pattern, but with literal code-blocks blanked out,
and \ and ' escaped. In the above example, the pattern string is
a(?{"literal"})\b'(?{"runtime"})d
and we call eval_sv() with an SV containing the text
qr'a \\b\'(?{"runtime"})d'
The returned qr will contain the new code-block (and associated CV and
pad) which can be extracted and added to the list of compiled code blocks
of the original pattern.
Note that with this scheme, the requirement for "use re 'eval'" is easily
determined, and no longer requires all the pp_regcreset / PL_reginterp_cnt
machinery, which will be removed shortly.
Two subtleties of this scheme are that normally, \\ isn't collapsed into \
for literal regexes (unlike literal strings), and hints aren't inherited
when using eval_sv(). We get round both of these by adding and setting a
new flag, PL_reg_state.re_reparsing, which indicates that we are refeeding
a pattern into the perl parser.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Perl's internal function for compiling regexes that knows about code
blocks, Perl_re_op_compile, isn't part of the engine API. However, the
way that regcomp.c is dual-lifed as ext/re/re_comp.c with debugging
compiled in, means that Perl_re_op_compile is also compiled as
my_re_op_compile. These days days the mechanism to choose whether to call
the main functions or the debugging my_* functions when 'use re debug' is
in scope, is the re engine API jump table. Ergo, to ensure that
my_re_op_compile gets called appropriately, this method needs adding to
the jump table.
So, I've added it, but documented as 'for perl internal use only, set to
null in your engine'.
I've also updated current_re_engine() to always return a pointer to a jump
table, even if we're using the internal engine (formerly it returned
null). This then allows us to use the simple condition (eng->op_comp)
to determine whether the current engine supports code blocks.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This now works:
{ my $x = 1; $r = qr/(??{$x})/ }
my $x = 2;
print "ok\n" if "1" =~ /^$r$/;
When a qr// is interpolated into another pattern, the pattern is still
recompiled using the stringified qr, but now the pre-compiled code blocks
from the qr are reused rather than being re-compiled, so it behaves like a
closure.
Note that this makes some tests in regexp_qr_embed_thr.t fail, due to a
pre-existing threads bug, which can be summarised as:
use threads;
my $s = threads->new(sub { return sub { $::x = 1} })->join;
$s->();
print "\$::x=[$::x]\n";
which prints undef, not 1, since the *::x is cloned into the child thread,
then cloned back into the parent as part of the CV (linked from the pad)
being returned in the join. The cloning/join code isn't clever enough
to realise that the back-cloned *::x is the same as the original *::x, so
the main thread ends up with two copies.
This manifests itself in the re tests as
my $re = threads->new( sub { qr/(?{$::x = 1 })/ })->join();
where, since the returned qr// is now a closure, it suffers from the same
glob duplication in the parent.
So I've disabled 4 re_tests tests under threads for now.
|
|
|
|
|
|
|
|
| |
code_blocks is a temporary list of start/end indices and pointers to DO
blocks, that is used during the regexp compilation. Change it so that in
the qr// case, this structure is preserved (attached to regexp_internal),
so that in a forthcoming commit it will be available for use when
interpolating a qr within another pattern.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
With this commit, qr// with a literal (compile-time) code block
will Do the Right Thing as regards closures and the scope of lexical vars;
in particular, the following now creates three regexes that match 1, 2
and 3:
for my $i (0..2) {
push @r, qr/^(??{$i})$/;
}
"1" =~ $r[1]; # matches
Previously, $i would be evaluated as undef in all 3 patterns.
This is achieved by wrapping the compilation of the pattern within a
new anonymous CV, which is then attached to the pattern. At run-time
pp_qr() clones the CV as well as copying the REGEXP; and when the code
block is executed, it does so using the pad of the cloned CV.
Which makes everything come out all right in the wash.
The CV is stored in a new field of the REGEXP, called qr_anoncv.
Note that run-time qr//s are still not fixed, e.g. qr/$foo(?{...})/;
nor is it yet fixed where the qr// is embedded within another pattern:
continuing with the code example from above,
my $i = 99;
"1" =~ $r[1]; # bare qr: matches: correct!
"X99" =~ /X$r[1]/; # embedded qr: matches: whoops, it's still
seeing the wrong $i
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Change the way that code blocks in patterns are parsed and executed,
especially as regards lexical and scoping behaviour.
(Note that this fix only applies to literal code blocks appearing within
patterns: run-time patterns, and literals within qr//, are still done the
old broken way for now).
This change means that for literal /(?{..})/ and /(??{..})/:
* the code block is now fully parsed in the same pass as the surrounding
code, which means that the compiler no longer just does a simplistic
count of balancing {} to find the limits of the code block;
i.e. stuff like /(?{ $x = "{" })/ now works (in the same way
that subscripts in double quoted strings always have: "$a{'{'}" )
* Error and warning messages will now appear to emanate from the main body
rather than an re_eval; e.g. the output from
#!/usr/bin/perl
/(?{ warn "boo" })/
has changed from
boo at (re_eval 1) line 1.
to
boo at /tmp/p line 2.
* scope and closures now behave as you might expect; for example
for my $x (qw(a b c)) { "" =~ /(?{ print $x })/ }
now prints "abc" rather than ""
* with recursion, it now finds the lexical within the appropriate depth
of pad: this code now prints "012" rather than "000":
sub recurse {
my ($n) = @_;
return if $n > 2;
"" =~ /^(?{print $n})/;
recurse($n+1);
}
recurse(0);
* an earlier fix that stopped 'my' declarations within code blocks causing
crashes, required the accumulating of two SAVECOMPPADs on the stack for
each iteration of the code block; this is no longer needed;
* UNITCHECK blocks within literal code blocks are now run as part of the
main body of code (run-time code blocks still trigger an immediate
call to the UNITCHECK block though)
This is all achieved by building upon the efforts of the commits which led
up to this; those altered the parser to parse literal code blocks
directly, but up until now those code blocks were discarded by
Perl_pmruntime and the block re-compiled using the original re_eval
mechanism. As of this commit, for the non-qr and non-runtime variants,
those code blocks are no longer thrown away. Instead:
* the LISTOP generated by the parser, which contains all the code
blocks plus OP_CONSTs that collectively make up the literal pattern,
is now stored in a new field in PMOPs, called op_code_list. For example
in /A(?{BLOCK})C/, the listop stored in op_code_list looks like
LIST
PUSHMARK
CONST['A']
NULL/special (aka a DO block)
BLOCK
CONST['(?{BLOCK})']
CONST['B']
* each of the code blocks has its last op set to null and is individually
run through the peephole optimiser, so each one becomes a little
self-contained block of code, rather than a list of blocks that run into
each other;
* then in re_op_compile(), we concatenate the list of CONSTs to produce a
string to be compiled, but at the same time we note any DO blocks and
note the start and end positions of the corresponding CONST['(?{BLOCK})'];
* (if the current regex engine isn't the built-in perl one, then we just
throw away the code blocks and pass the concatenated string to the engine)
* then during regex compilation, whenever we encounter a '(?{', we see if
it matches the index of one of the pre-compiled blocks, and if so, we
store a pointer to that block in an 'l' data slot, and use the end index
to skip over the text of the code body. Conversely, if the index doesn't
match, then we know that it's a run-time pattern and (for now), compile
it in the old way.
* During execution, when an EVAL op is encountered, if data->what is 'l',
then we just use the pad that was in effect when the pattern was called;
i.e. we use the current pad slot of the currently executing CV that the
pattern is embedded within.
|
|
|
|
|
| |
This updates the editor hints in our files for Emacs and vim to request
that tabs be inserted as spaces.
|
|
|
|
|
|
|
|
|
|
|
|
| |
As described in the comments, this changes the design of handling the
Unicode tricky fold characters to not generate a node for each possible
sequence but to get them to work within EXACTFish nodes.
The previous design(s) all used a node to handle these, which suffers
from the downfall that it precludes legitimate matches that would cross
the node boundary.
The new design is described in the comments.
|
|
|
|
| |
And the reordering for clarity of one test
|
|
|
|
| |
The latter is the Perl standard way of making this declaration
|
| |
|
| |
|
|
|
|
|
| |
This macro sets all the bits of the class (for \w, etc) for use during
initialization
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Things like \S have not been accessible to the synthetic start class
under locale matching rules. They have been placed there, but the
start class didn't know they were there.
This patch sets ANYOF_CLASS in initializing the synthetic start class
so that downstream code knows it is a charclass_class, and removes
the code that partially allowed this bit to be shared, and which isn't
needed in 5.14, and more thought would have to go into doing it than
was reflected in the code.
I can't come up with a test case that would verify that this works,
because of general locale testing issues, except it looked at a dump of
the generated regex synthetic start class, but the dump isn't the same
thing as the real behavior, and using one is also subject to breakage if
the regex code changes in the slightest.
|
| |
|
|
|
|
|
|
|
|
|
| |
Now that regexes can be combinations of different charset modifiers,
a synthetic start class can match locale and non-locale both. locale
should generally match only things in the bitmap for code points < 256.
But a synthetic start class with a non-locale component can match such
code points. This patch makes an exception for synthetic nodes that
will be resolved if it passes and is matched again for real.
|
| |
|
|
|
|
|
|
|
| |
This code has been rendered obsolete in 5.14 by using a different
mechanism altogether. This functionality is now provided at run-time,
user-selectable, via the /u and /d regex modifiers. This code was
for compile-time selection of which to use.
|
|
|
|
|
|
|
|
| |
This was from commit f424400810b6af341e96230836690da51c37b812
which came from needing a bit in an already-full flags field,
and my faulty analysis that two bits could be shared. I found another
mechanism to free up another bit, and now can separate these shared
bits again.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is the foundation for fixing the regression RT #82610. My analysis
was wrong that two bits could be shared, at least not without further
work. This changes to use a different mechanism to pass needed
information to regexec.c so that another bit can be freed up and, in a
later commit, the two bits can become unshared again.
The bit that is freed up is ANYOF_UTF8, which basically said there is
something that is matched outside the ANYOF bitmap, and requires the
target string to be in utf8. This changes things so the existence of
something besides the bitmap indicates this, and so no flag is needed.
The flag bit ANYOF_NONBITMAP_NON_UTF8 remains to indicate that there is
something that should be matched outside the bitmap even if the target
string isn't in utf8.
|
| |
|