| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
| |
embed.fnc declared it as "U32 depth", while it was defined as "const U32
depth".
|
|
|
|
|
|
|
|
| |
this should fix the smoke failures on threaded builds,
also it renames re_indentfo which was a terrible name in the first
place, and now what i have had to strip the Perl_prefixes from
these subs with a perl -i -pe, I took the opportunity to rename
it to re_exec_indent, which self documents much better.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This introduces three new subs:
Perl_re_printf() which is a wrapper for
PerlIO_printf( Perl_debug_log, ... ),
which cuts down on clutter in the code. Arguably this could be moved
to util.c and renamed something like PerlIO_debugf() and then we could
declutter all the statements that write to the Perl_debug_log
filehandle. But that is a bit too ambituous for me right now, so
I leave this as a regex engine only sub for now.
Perl_re_indentf() which is a wrapper for PerlIO_re_printf(),
which adds an indent argument and automatically indents the
line appropriately, and is used in regcomp.c for trace diagnostics
during compilation.
Perl_re_indentfo() which is similar to Perl_re_indentf() but
is used in regexec.c which adds a specific prefix to each indented
line to account for the fact that during execution we normally have
string position information on the left.
The end result of this patch is that a lot of clutter in the debugging
statements in the regex engine is reduced, exposing what is actually
going on. It should also now be easier to add new diagnostics which
"do the right thing".
Over time the debugging trace output in regexec has become
very cluttered and confusing. This patch cleans much of it up,
if something happens at a given recursion depth it is output
at the right depth, etc, and formats have been changed to not have
leading spaces so you can actually see the indentation properly.
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
| |
A future commit will make more sense if these names are changed. This
reindents some code so that it doesn't overflow 79 columns
|
|
|
|
|
| |
I found myself using this function, forgetting that it zapped one of the
parameters, so change the name so that can't be forgotten.
|
|
|
|
|
|
|
|
|
|
|
| |
This is at least a partial patch for [perl #127392], cutting the maximum
memory used on my box from around 8600kB to 7800kB. For [perl #127568],
which has been merged into #127392, the savings are even larger, about
37%
Previously a large number of large mortal SVs could be created while
compiling a single regex pattern, and their accumulated memory quickly
added up. This changes things to not use so many mortals.
|
|
|
|
|
|
| |
I don't know of any cases where this happens, but in working on the next
commit I triggered a problem with shrinking an inversion list so much
that the required 0 UV at the beginning was freed.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This revamps the handling of -Dr for bracketed character classes. There
were bugs introduced earlier in 5.23, and this consolidates the handling
of /d classes so that the interactions can be better considered. It
tries inverting the portion that is in the bitmap range to see if the
output is shorter, and clearer that way. And it always makes the
above-bitmap code points show as not-inverted, as that is clearer.
I ran out of time before the freeze, so I had to not invert in some
cases.
|
|
|
|
|
|
|
| |
This parameter will be used in a future commit, it changes the output
format of this function that displays the contents of an inversion list
so that it won't have to be parsed later, simplifying the code at that
time.
|
|
|
|
|
|
|
| |
This function was used outside the file it contains, but was only
defined (by #ifdef's) for those few internal core files for which it was
needed. Now all those uses have gone, save for the one file. Better to
make it static so no one can circumvent those #ifdef's.
|
|
|
|
|
|
|
|
|
|
| |
grok_bslash_x() is so large that no compiler will inline it. Move it to
dquote.c from dq_inline.c. Conversely, move form_octal_warning() to
dq_inline.c. It is so tiny that the function call overhead is scarcely
smaller than the function body.
This also moves things in embed.fnc so all these functions. are not
visible outside the few files they are supposed to be used in.
|
|
|
|
|
| |
This is the one remaining empty {} that was accepted under the
experimental 'use re "strict"'.
|
|
|
|
|
|
| |
This takes code that was duplicated and makes it into a single static
inline function, so that maintenance tasks don't have to be done on both
copies.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A problem with bracketed character classes, qr/[foo]/, is that there is
very little structure about them, so almost anything is legal, and so
typos just silently compile into something unintended. One of the
possible components are posix character classes. There are 14 of them,
and they have a very restricted structure, which is easy to get slightly
wrong, so that instead of the intended posix class being compiled,
something else silently is created. This commit causes the regex
compiler to look for slightly misspelled posix character classes and to
raise a warning when found. It does not change the results of the
compilation.
To do this, it introduces fuzzy parsing into the regex compiler, using
the Damerau-Levenshtein algorithm to find out how many single character
edits it would take to transform the input into one of the 14 classes.
If it is 1 or 2 off, it considers the input to have been intended to be
that class and raises the warning. If more edits would be needed, it
remains silent.
This is a heuristic, and someone could have made enough typos that this
thinks a class wasn't intended that was. Conversely it could raise a
warning when no class was intended, though warnings only happen when the
input very closely resembles a posix class of one of the 14 legal ones.
The algorithm can be tweaked if experience indicates it should. But the
bottom line is that many more cases of unintended results will now be
warned about.
Things like having blanks in the construct and having the '^' before the
colon are recognized as being intended posix classes (given that the
actual names are close to one of the 14), and raise warnings. Again
this commit does not change what gets compiled. This found a bug in
autodoc.pl which was fixed a few commits ago.
The [. .] and [= =] POSIX constructs cause perl to croak that they are
unimplemented. This commit improves the parsing of these two, and fixes
some false positives. See
http://nntp.perl.org/group/perl.perl5.porters/230975
The new code combines two functions in regcomp.c into one new one.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This will be used in a future commit.
This code is taken from CPAN Text::Levenshtein::Damerau::XS with the
author's knowledge. There have been white-space changes to make it
conform better to perl's core coding standards, and declaration changes
to make it more portable, such as using UV instead of 'unsigned int',
and PERL_STATIC_INLINE instead of a less portable form, but the logic is
unchanged. One variable was changed to signed from unsigned to avoid a
warning message from some compilers.
The author and I will decide later about keeping the cpan module and
this code in sync. It changes very rarely.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The value of gimme stored in the context stack is U8.
Make all other uses in the main core consistent with this.
My primary motivation on this was that the new function cx_pushblock(),
which I gave a 'U8 gimme' parameter, was generating warnings where callers
were passing I32 gimme vars to it. Rather than play whack-a-mole, it
seemed simpler to just uniformly use U8 everywhere.
Porting/bench.pl shows a consistent reduction of about 2 instructions on
the loop and sub benchmarks, so this change isn't harming performance.
|
|
|
|
| |
Replace CX_PUSHGIVEN() with cx_pushgiven() etc.
|
|
|
|
| |
Replace CX_PUSHLOOP_FOR() with cx_pushfloop_for() etc.
|
|
|
|
|
|
| |
Replace CX_PUSHEVAL() with cx_pusheval() etc.
No functional changes.
|
|
|
|
|
|
| |
Replace CX_PUSHFORMAT() with cx_pushformat() etc.
No functional changes.
|
|
|
|
|
|
| |
Replace CX_PUSHSUB() with cx_pushsub() etc.
No functional changes.
|
|
|
|
|
|
| |
Replace CX_PUSHBLOCK() with cx_pushblock() etc.
No functional changes.
|
|
|
|
|
|
|
|
| |
By making SAVETMPS have its own dedicated save type, it avoids having to
push the address of PL_tmps_floor onto the save stack each time.
By also giving it a dedicated save function, the function can do
the PL_tmpsfloor = PL_tmps_ix step too, making the binary slightly more
compact.
|
|
|
|
|
|
| |
Rather than doing cx->blk_eval.retop = NULL in PUSHEVAL, then relying on
the caller to subsequently change it to something more useful, make it an
arg to PUSHEVAL.
|
|
|
|
|
|
|
|
|
|
| |
Make the remaining callers of S_leave_common() use leave_adjust_stacks()
instead, then delete this static function.
This brings the benefits of freeing TEMPS on all scope exists that
has already been introduced on sub exits; uses the optimised code for
creating mortal copies; and finally unifies all the different 'process
return args on scope exit' implementations into single function.
|
|
|
|
|
|
|
| |
It was using S_leave_common(), but that's shortly to be removed. It also
required adding an extra arg to leave_adjust_stacks() to indicate where to
shift the return args to. This will also be needed for when we replace the
remaining uses of S_leave_common() with leave_adjust_stacks().
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently S_leavesub_adjust_stacks() is just used by pp_leavesub.
Rename it to Perl_leave_adjust_stacks(), extend its functionality
slightly, then make pp_leavesublv() use it too.
This means that lvalue sub exit gains the benefit of FREETMPS being done,
and (where mortal copying needs doing) the optimised copying code.
It also means there is now one less version of the "process args on scope
exit" code.
pp_leavesublv() still does a scan of its return args looking for things to
croak() on, but leaves everything else to leave_adjust_stacks().
leave_adjust_stacks() is intended shortly to be used in place of
S_leave_common() too, thus unifying all args-on-scope-exit code.
The changes to leave_adjust_stacks() in this commit (apart from the
renaming and doc changes) are:
* a new arg to indicate what condition to use to decide whether to
pass or copy the arg;
* a new branch to mortalise and ref count bump an arg
|
|
|
|
|
| |
This makes it a bit more obvious what niche in the "eval" ecosystem
that it occupies.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently one of the args to S_leave_common() is supposed to be the
current stack pointer; it returns an updated sp. Instead make it get/set
PL_stack_sp directly.
e.g. in the caller, replace
dSP;
SP = S_leave_common(..., SP, ...);
PUTBACK;
with
S_leave_common(..., ...);
and in S_leave_common(), make it initially get PL_stack_sp, and before
returning, update PL_stack_sp.
|
|
|
|
|
| |
Since it searches the context stack for the next GIVEN *or* FOR LOOP
context, make the name better express its purpose.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This function implements the less commonly used branch in the POPSUB()
macro that clears @_ in place, or abandons it and creates a new array
in pad slot 0 of the function (the common branch is where @_ hasn't been
reified, and so can be clered simply by setting fill to -1).
By moving this out to a separate function we can avoid repeating the same
code everywhere the POPSUB macro is used; but since its only used
in the less frequent cases, the extra overall of a function call doesn't
matter.
It has a currently unused arg, 'abandon', which will be used shortly.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This changes the handling of Grapheme Cluster Breaks to be entirely via
a lookup table generated by regen/mk_invlists.pl.
This is easier to maintain and follow, as the generation of the table
follows the text of Unicode's UAX29 precisely, and loops can be used to
set every class up instead of having to name each explicitly, so it will
be easier to add new rules. And the runtime switch statement is
replaced by a single line.
My gcc compiler optimized the previous version to an array lookup, but
this commit does it for not so clever compilers.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds the final Unicode boundary type previously missing from core
Perl: the LineBreak one. This feature is already available in the
Unicode::LineBreak module, but I've been told that there are portability
and some other issues with that module. What's added here is a
light-weight version that is lacking the customizable features of the
module.
This implements the default Line Breaking algorithm, but with the
customizations that Unicode is expecting everybody to add, as their
test file tests for them. In other words, this passes Unicode's fairly
extensive furnished tests, but wouldn't if it didn't include certain
customizations specified by Unicode beyond the basic algorithm.
The implementation uses a look-up table of the characters surrounding a
boundary to see if it is a suitable place to break a line. In a few
cases, context needs to be taken into account, so there is code in
addition to the lookup table to handle those.
This should meet the needs for line breaking of many applications,
without having to load the module.
The algorithm is somewhat independent of the Unicode version, just like
the other boundary types. Only if new rules are added, or existing ones
modified is there need to go in and change this code. Otherwise,
running regen/mk_invlists.pl should be sufficient when a new Unicode
release is done to keep it up-to-date, again like the other Unicode
boundary types.
|
|
|
|
| |
This will allow new behavior, needed in a future commit.
|
|
|
|
| |
This is just acting on the TODO comment.
|
|
|
|
| |
See http://nntp.perl.org/group/perl.perl5.porters/233287
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Bracketed character classes generally generate an ANYOF-type regnode,
which consists of a bitmap for the lower code points, and an inversion
list or swash to handle ones not in the bitmap. They take up more
memory than other regnode types. There are already some optimizations
that use a smaller and/or faster regnode instead. For example, some
people prefer not to use a backslash to escape metacharacters, instead
writing something like /abc[.]def/. This has for some time generated
the same thing as /abc\.def/ does, namely a single EXACT node, which is
both smaller and faster than an ANYOF node in the middle of two EXACT
nodes.
This commit adds some optimizations that hadn't been done previously.
Now things like /[\p{Word}]/ will optimize to \w, for example. I had
not done this before, because my tests had shown very little performance
difference, but I had added most of the code to regcomp.c so it wouldn't
get lost, #ifdef'd out.
It turns out that I hadn't tested on code points above the bitmap, which
with this commit have a small, but appreciable speed up in matching, so
this commit enables and finishes that code.
Prior to this commit, things like /[[:word:]]/ were optimized to \w, but
things like /[_[:word:]]/ were not. This commit fixes that.
If the following command is run on a perl compiled with -O2 and no
DEBUGGING:
blead Porting/bench.pl --raw --benchfile=charclass_perf --perlargs=-Ilib /path_to_prior_perl="before this commit" /path_to_this_perl=after
and the file 'charclass_perf' contains
[
'regex::charclass::ascii' => {
desc => 'charclass, ascii range',
setup => 'my $a = qr/[\p{Word}]/',
code => '"A" =~ $a'
},
'regex::charclass::upper_latin1' => {
desc => 'charclass, upper latin1 range',
setup => 'my $a = qr/[\p{Word}]/',
code => '"\x{e0}" =~ $a'
},
'regex::charclass::above_latin1' => {
desc => 'charclass, above latin1 range',
setup => 'my $a = qr/[\p{Word}]/',
code => '"\x{100}" =~ $a'
},
'regex::charclass::high_Unicode' => {
desc => 'charclass, high Unicode code point',
setup => 'my $a = qr/[\p{Word}]/',
code => '"\x{10FFFF}" =~ $a'
},
];
the following results are obtained:
The numbers represent raw counts per loop iteration.
regex::charclass::above_latin1
charclass, above latin1 range
before this commit after
------------------ --------
Ir 3344.0 2888.0
Dr 971.0 855.0
Dw 604.0 541.0
COND 575.0 504.0
IND 25.0 25.0
COND_m 11.0 10.7
IND_m 10.0 10.0
Ir_m1 8.9 6.0
Dr_m1 3.0 3.2
Dw_m1 1.5 1.4
Ir_mm 0.0 0.0
Dr_mm 0.0 0.0
Dw_mm 0.0 0.0
regex::charclass::ascii
charclass, ascii range
before this commit after
------------------ --------
Ir 2661.0 2649.0
Dr 798.0 795.0
Dw 516.0 517.0
COND 467.0 465.0
IND 23.0 23.0
COND_m 10.0 8.8
IND_m 10.0 10.0
Ir_m1 7.9 0.0
Dr_m1 2.9 3.1
Dw_m1 1.3 1.3
Ir_mm 0.0 0.0
Dr_mm 0.0 0.0
Dw_mm 0.0 0.0
regex::charclass::high_Unicode
charclass, high Unicode code point
before this commit after
------------------ --------
Ir 3344.0 2888.0
Dr 971.0 855.0
Dw 604.0 541.0
COND 575.0 504.0
IND 25.0 25.0
COND_m 11.0 10.7
IND_m 10.0 10.0
Ir_m1 8.9 6.0
Dr_m1 3.0 3.2
Dw_m1 1.5 1.4
Ir_mm 0.0 0.0
Dr_mm 0.0 0.0
Dw_mm 0.0 0.0
regex::charclass::upper_latin1
charclass, upper latin1 range
before this commit after
------------------ --------
Ir 2661.0 2651.0
Dr 798.0 796.0
Dw 516.0 517.0
COND 467.0 466.0
IND 23.0 23.0
COND_m 11.0 8.8
IND_m 10.0 10.0
Ir_m1 7.9 0.0
Dr_m1 2.9 3.3
Dw_m1 1.5 1.2
Ir_mm 0.0 0.0
Dr_mm 0.0 0.0
Dw_mm 0.0 0.0
|
|
|
|
|
|
|
|
|
|
| |
In half the calls to to_utf8_case(), the code point being looked up is
known. It is thrown away because the API doesn't pass it, and then
recalculated first thing in to_utf8_case.
Fix this by making a new static function which adds the code point to
the parameter list, and change all calls to use this, leaving the
existing to_utf8_case() as just a wrapper for the new function.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
By "true" I mean that they have prototypes, but no bodies.
So don't declare their prototypes under PERL_NO_INLINE_FUNCTIONS.
After some studying of http://www.greenend.org.uk/rjk/tech/inline.html
it seems like Perl is trying to implement the "simple portable" model.
But the functions listed as failing during porting/extrefs.t in Tru64:
they are neither fish nor fowl. Their prototypes are listed in
proto.h as PERL_STATIC_INLINE (which in Tru64 is "static inline"),
but since the test is built with -DPERL_NO_INLINE_FUNCTIONS,
the function bodies (which would be in inline.h) are not visible.
So they end up being body-less static inline prototypes, which is,
I believe, somewhat of an oxymoron.
The "complicated portable" model might be a more wortwhile longer
term goal: in that, there is no "static inline", and there would be
a new source file, say, inline.c. Now with the "simple portable",
the bodies might end up being compiled multiple times, multiple copies
ending up in different object files, depending on how smart the
compiler/linker is.
Another move could be that maybe there should be no prototypes at all
for inlineables, because having those is kind beside the point. How
well that would work across different compilers is unknown.
Yet another move, perhaps the simplest one, would be to move these
particular functions away from inline.h. But this would be just
dodging the larger problems discussed above.
|
|
|
|
| |
With one ugly cast inside the reg_recode() call.
|
|
|
|
| |
This only returns TRUE or FALSE; no need for a wider return value.
|
|
|
|
|
|
|
| |
regpatws() is only used in one place, and is dangerous to retain it as a
named entity. This is because wherever white space is to be skipped,
(#...) comments are to be as well, so the function that does both things
should be called instead of this one.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Sometimes we want to move to the next non-ignored character in the
input. The nextchar() function does that (but buggily in UTF-8).
And sometimes we are already at the next character, but if it is one
that should be ignored, we want to move to the first one that isn't.
This commit creates a function to do the second task by extracting the
code in nextchar() to it, and making nextchar() a lightweight wrapper
around it, and hence likely to be optimized out by the compiler.
This is a step in the direction of fixing the UTF-8 problems with
nextchar(), and fixing some other bugs. The new function has added
generality which won't be used until a later commit.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
nextchar() advances the parse to the next byte beyond any ignorable
bytes, returning the parse pointer before the advancement.
I find this confusing, as
foo = nextchar();
reads as if foo should point to the next character, instead of the
character where the parse already is at. This functionality is hard for
a reader to grok, even if the name weren't misleading, as the place the
variable gets set in the source is far away from the call. It's clearer
to say
foo = current;
nextchar();
This has confused others as well, as in one place several commits have
been required to get it so it works properly, and games have been played
to back up the parse if it turns out it shouldn't have been advanced,
whereas it's better to check first, then advance if it is the right
thing to do. Ready-Fire-Aim is not a best practice.
This commit makes nextchar() return void, and changes the few places
where the en-passant value was used.
The new scheme is still buggy, as nextchar() only advances a single
byte, which may be the wrong thing to do when the pattern is UTF-8
encoded. More work is needed to be in a position to fix this. We have
only gotten away with this so far because apparently no one is using
non-ASCII white space under /x, and our meta characters are all ASCII,
and there are likely other things that reposition things to a character
boundary before problems have arisen.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Reorder the body of Perl_sv_backoff slightly to make it more tail-call
friendly, and change its signature from returning an int (always 0) to
void.
sv_backoff has only 1.5 function calls in it, there is a memcpy of a U32 *
for alignment reasons (I wont discuss U32_ALIGNMENT_REQUIRED) inside of
SvOOK_offset, and the explicit Move()/memmove. GCC and clang often inline
memcpy/memmove when the length is a constant and is small. Sometimes
a CC might also do unaligned memory reads if OS/CPU allows it
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20130513/174807.html
so I'll assume memcpy by short constant isn't a func call for discussion.
By moving SvFLAGS modification before the one and only func call, and
changing the return type to void, there is no code to execute after the
Move func call so the CC, if it wants (OS/ABI/CPU, specifically I am
thinking about x86-64) can tailcall jump to memmove. Also var sv can be
stored in a cheaper vol reg since it is not saved around any func calls
(SvFLAGS set was moved) assuming the memcpy by short constant was inlined.
The before machine code size of Perl_sv_backoff with VC 2003 -O1 was
0x6d bytes. After size is 0x61. .text section size of perl523.dll was
after was 0xD2733 bytes long, before was 0xD2743 bytes long. VC perl does
not inline memcpys by default.
In commit a0d0e21ea6 "perl 5.000" the return 0 was added. The int ret type
is from day 1 of sv_backoff function existing/day 1 of SV *s
from commit 79072805bf "perl 5.0 alpha 2". str_backoff didn't exist AFAIK,
only str_grow would retake the memory at the start of the block. Since
sv_backoff is usually used in a "&& func()" macro (SvOOK_off), it needed a
non void ret type, a simple ", 0" in the macro fixes that. All CCs optimize
and remove "if(0)" machine instructions so the ", 0" is optimized away in
the perl binary.
|