| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
| |
A few clarifications
|
|
|
|
|
|
|
|
|
| |
The first comment listed an item as a TODO that was recommended by
Unicode. That recommendation is being rescinded in Unicode 12.0 based
on a ticket I filed against Unicode, which in turn was based on feedback
from Asmus Freitag.
The second comment was obsoleted by later code changes.
|
|
|
|
|
|
| |
Commit 4c83fb55d7096a1d0e6a7a8e25d20b186be3281d added a macro for
clarity. I have since realized that it is even clearer to spell things
as this commit now does.
|
|
|
|
|
|
|
|
|
|
|
|
| |
All scripts can have the ASCII digits for their numbers. Scripts with
their own digits can alternatively use those. Only one of these two
sets can be used in a script run. The decision as to which set to use
must be deferred until the first digit is encountered, as otherwise we
don't know which set will be used. Prior to this commit, the decision
was being made prematurely in some cases. As a result of this change,
the non-ASCII-digits in the Common script need to be special-cased, and
different criteria are used to decide if we need to look up whether a
character is a digit or not.
|
|
|
|
|
| |
The new name is clearer as to its meaning, more so after the next
commit.
|
|
|
|
| |
(and tweak the debugging output of CLOSE_CAPTURE())
|
|
|
|
|
|
| |
Every use of the CLOSE_CAPTURE() macro is followed by the setting of
lastparen and lastcloseparen, so include these actions in the macro
itself.
|
|
|
|
|
| |
This macro includes debugging output, so by using it rather than
setting rex->offs[paren].start/end directly, you get better debugging.
|
|
|
|
|
|
| |
Make its index and start+end values into parameters. This will shortly
allow its use in other places, bringing consistent code and debug logging
to the whole of S_regmatch().
|
|
|
|
|
|
|
|
| |
Move this macro to earlier in the file to be with the other functions
and macros which deal with setting and restoring captures.
No changes (functional or textual) apart from the physical moving of the
13 lines.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The (?n) mechanism allows you to 'gosub' to a subpattern delineated by
capture n. For 1-char-width repeats, such as a+, \w*?, (\d)*, then
currently the code checks whether it's in a gosub each time it attempts
to start executing the B part of A*B, regardless of whether the A is
in a capture.
This commit moves the GOSUB check to within the capture-only variant
(CURLYN), which then directly just looks for one instance of A and
returns. This moves the check away from more frequently called code
paths.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are currently two similar backtracking states for simple
non-greedy pattern repeats:
CURLY_B_min
CURLY_B_min_known
the latter is a variant of the former for when the character which must
follow the repeat is known, e.g. /(...)*?X.../, which allows quick
skipping to the next viable position.
The code for the two cases:
case CURLY_B_min_fail:
case CURLY_B_min_known_fail:
share a lot of similarities. This commit merges the two states into a
single CURLY_B_min state, with an associated single CURLY_B_min_fail
fail state.
That one code block can handle both types, with a single
if (ST.c1 == CHRTEST_VOID) ...
test to choose between the two variant parts of the code.
This makes the code smaller and more maintainable, at the cost of one
extra test per backtrack.
|
|
|
|
|
|
|
|
| |
This does not have a ticket, but was pointed out in
http://nntp.perl.org/group/perl.perl5.porters/251870
The logic for deciding if it was needed to check if a character is a
digit was flawed.
|
|
|
|
|
| |
This commit #defines a macro and uses it, which makes the code easier to
understand.
|
|
|
|
|
| |
The expression as a whole is unlikely to be true, not just the portion
that was marked so.
|
|
|
|
|
|
|
|
|
| |
Unicode 11 has some new data files needed for it, and some changes in
the boundary rules that need to be accounted for. This does all that
can be done without causing tests to fail. The LB algorithm has
changed, and tests would fail if we included the code changes needed for
that change in this commit. Instead those few lines will come as part
of the Unicode 11.0 commit.
|
|
|
|
|
| |
There's no reason to use the function name, and by using the macro, this
code is insulated against future changes
|
|
|
|
|
| |
The second argument to this macro is a pointer to the end, as opposed to
a length.
|
|
|
|
|
| |
It makes more sense for this code to be in the function called, rather
than separated out.
|
|
|
|
|
|
|
|
|
|
| |
Setting the pointer to NULL after freeing signals the code in later
interations that it has been freed already
No test is added because it could become outdated (not testing what it
was designed to test) with a new Unicode version changing the underlying
data. This bug was discovered by testing on Unicode 7.0, and the data
changed so that there was not a problem by Unicode 10.0.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is the third commit involved in [perl #132063, and the bottom line
cause of it. The problem is that the code is incorrectly branching to a
portion of the code that expects it is handling UTF-8. And the input
isn't UTF-8. The fix is to handle this case and branch correctly. This
bug requires the following things in order to manifest:
1) the pattern is compiled under /il
2) the pattern does not contain any characters below 256
3) the target string is not UTF-8.
(The committer changed the test to test this issue on EBCDIC, as the
original \xFF is an invariant there that wouldn't exercise the problem.
We want a start byte for a long UTF-8 sequence for a single character.
On the EBCDIC pages we support, \xFE fits that bill.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There were three things that were fixed as a result of this ticket, any
one of which would have avoided the issue.
Commit 421da25c4318861925129cd1b17263289db3443c already has fixed
one of those. The issue was reading beyond the end of a buffer, and
that commit keeps from reading beyond a NUL, which normally should be
present, marking the end of the buffer.
This commit fixes the issue where the code was told that reading that
many bytes was ok to do. This is several instances in regexec.c of the
code assuming that the input was valid UTF-8, whereas the input was too
short for what the start byte claimed it would be.
I grepped through the core for any other similar uses, and did not find
any.
The next commit will fix the third thing.
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit 77584140f7cbfe714083cacfa671085466e98a7b.
This optimisation of mine from 5.25.9 is ill-conceived; under the right
permutations of backtracking, it is possible for the current positions
of one of more captures not to restored to their previous positions.
This commit reverts the code change, but keeps the benchmark part of
that commit, and adds a test
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Make the various debugging outputs identify where the message is coming
from; e,g, change
trying longer...
to
WHILEM: B min fail: trying longer...
and change some existing "whilem: ..." messages to "WHILEM: ..." for
consistency.
|
|
|
|
|
|
|
|
|
| |
The code points that Unicode furnishes will always be unsigned. This
changes to uniformly treat the ones in the constructed tables of Unicode
properties to be unsigned, avoiding possible signedness compiler
warnings on some systems.
Spotted by Dave Mitchell.
|
|
|
|
| |
The macro hides the underlying implementation detail.
|
|
|
|
| |
The argument is 32 bits, but only the lowest 8 are used.
|
|
|
|
|
| |
A Previous commit has changed the circumstances of this code so that we
know certain things to be true that we didn't use to.
|
|
|
|
|
| |
This outdents to to the removal of an enclosing block by a previous
commit
|
|
|
|
|
|
|
|
|
|
| |
This commit changes to use the C data structures generated by the
previous commit to compute what characters fold to a given one. This is
used to find out what things should match under /i.
This now avoids the expensive start up cost of switching to perl
utf8_heavy.pl, loading a file from disk, and constructing a hash from
it.
|
|
|
|
|
| |
These are unused now that all the POSIX class lookups are done through
inversion lists, instead of swashes.
|
|
|
|
| |
Fix up indentation based on the previous few commits
|
| |
|
|
|
|
|
|
|
| |
Previously this had two loops, the first one was used to keep from
loading the swash for as long as possible. Now that it is loaded by
default, there is no need to do this. This overwrites the first loop
with the second loop
|
| |
|
|
|
|
|
| |
This is so the default: can be used for another purpose in the next
commit.
|
|
|
|
|
|
| |
We've been burned before by malformed UTF-8 causing us to read outside
the buffer bounds. Here is a case I saw during code inspection, and
it's easy to add the buffer end limit
|
|
|
|
|
| |
I'm doing this one-at-a-time for bisection reasons, in case I make a
mistake.
|
|
|
|
|
| |
This macro is obsolete because the inversion list for this property is
now always loaded, so no need to load.
|
|
|
|
| |
When this #ifdef'd code is compiled, there was a warning
|
|
|
|
|
|
|
|
|
| |
The recent changes fixed by this commit neglected to take into account
EBCDIC differences.
Mostly, the algorithms apply only to ASCII platforms, so the EBCDIC is
ifdef'd out. In a couple cases, the algorithm mostly applies, so the
scope of the ifdefs is smaller.
|
|
|
|
| |
Properly indent preprocessor directives
|
|
|
|
|
|
| |
Daniel Dragan pointed out that this parameter is unused (the commits
that want it didn't get into 5.28), and is causing a table to be
duplicated all over the place, so just remove it for now.
|
|
|
|
|
|
|
|
| |
The root cause of this was using a 'char' where it should have been
'U8'. I changed the signatures so that all the related functions take
and return U8's, and the compiler detects what should be cast to/from
char. The functions all deal with byte bit patterns, so unsigned is the
appropriate declaration.
|
|
|
|
|
|
|
| |
The EXACTFA nodes are in fact not generated by /a, but by /aa. Change
the name to EXACTFAA to correspond.
I found myself getting confused by this.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
find_byclass() is used to scan through a target string looking for
something that matches a particular class. This commit speeds that up
for patterns of the /foo/i type, where neither the target string nor
the pattern are UTF-8.
More precisely, it speeds up only finding the first byte of 'foo'
in the string. The actual matching speed remains the same, once that
initial character that is a potential match is found.
But finding that first character is sped up immensely by this commit.
It does this by using memchr() when the character is caseless. For
example in the pattern /:abcde/i, the colon is first, and is caseless.
On my system memchr is extremely fast, so the numbers below for this
case may not be as good on other systems.
And when the first character is cased, such as in /abcde/i, it uses the
techniques added in 2813d4adc971fbaa124b5322d4bccaa73e9df8e2 for the
ANYOFM regnode. In both ASCII and EBCDIC machines, the case folds of
the cased letters are almost all of the form that these techniques work
on. There are no tests in our current test suite that don't have this
form. However, /il (locale) matching may very well violate this, and
will use the per-byte scheme that has been in effect until this commit.
The numbers below are for finding the first letter after a long string
that doesn't include that character. Doing this isolates the speed up
attributable to this commit from ovehead.
The only downsides of this commit are that on some systems memchr() may
introduce function call overhead that won't pay off if the next
occurrence of the character is close by; and in the other case, a single
extra conditional is required to determine if there is at least a word's
worth of data to look at, plus some masking, shifting, and arithmetic
instructions associated with that conditional. A static function is
called, so there may or may not be function call overhead, depending on
the compiler optimizer.
Key:
Ir Instruction read
Dr Data read
Dw Data write
COND conditional branches
IND indirect branches
The numbers represent raw counts per loop iteration.
caseless first letter
('b' x 10000) . ':' =~ /:a/i
blead fast Ratio %
-------- ------- -------
Ir 72109.0 4819.0 1496.3
Dr 20608.0 1237.5 1665.3
Dw 10409.0 409.5 2541.9
COND 20376.0 702.0 2902.6
IND 15.0 16.0 93.8
cased first letter
('b' x 10000) . 'a' =~ /A/i
blead fast Ratio %
-------- ------- -------
Ir 103074.0 25704.6 401.0
Dr 20896.5 2164.9 965.2
Dw 10587.5 601.9 1759.0
COND 30516.0 3036.2 1005.1
IND 22.0 22.0 100.0
|
|
|
|
|
|
|
| |
In this macro, COND has just returned true for the given byte. We then
need to test that the rest of the relevant portion of the input string
and pattern match. But before this commit, we started at the byte we
already know the answer for. Change to test starting one position over.
|
| |
|
| |
|