| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
|
|
|
|
|
|
|
|
| |
It turns out that now, the regular methods for handling multi-character
folds work for the ones involving LATIN SMALL LETTER SHARP S when the
pattern is in UTF-8. So the special code for handling this case can be
removed, and a regular EXACTFU node is generated. This has the
advantage of being trie-able, and requiring fewer operations at run
time, as the pattern is pre-folded at compile time, and doesn't have to
be re-folded during each backtracking at run-time.
This means that the EXACTFU_SS node type will only be generated for
non-UTF-8 patterns, and the handling of it is unchanged in these cases.
|
|
|
|
|
|
| |
By setting a variable, we can place some subsequent code into an else,
avoiding duplication of work. (Perhaps the compiler optimizers already
figure this out.)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I didn't fully understand the use of two variables that deal with /u in
regular expression pattern compilation when I commited
1a26cbcb0358334c3eb1941a919ffdb57eb4fe7e.
RExC_uni_semantics is used to say that there is something in the pattern
that indicates it is supposed to use /u if /d would otherwise be
selected. Once set, it cannot be cleared.
UNI_SEMANTICS means if we are operating under /u. Code forbids /d to be
selected if RExC_uni_semantics is set, but we could select /l, /a, or
/aa, so we don't have to be in /u (and hence UNI_SEMANTICS) if
RExC_uni_semantics is set. We just can't be in /d.
This commit adds comments, and fixes things to match the above
description.
|
|
|
|
| |
The new names are shorter and more meaningful.
|
|
|
|
|
|
| |
Change 'has_upper_latin1_only_utf8_matches' to
'upper_latin1_only_utf8_matches ', as the initial 'has_' is unnecessary
and somewhat misleading in spots
|
|
|
|
|
| |
The inversion list this refers to now includes the Latin 1 range, so the
name was misleading.
|
|
|
|
|
|
|
|
|
| |
These 2 Unicode-like property definitions used internally by the regular
expression compiler are moved by this commit from regen/mk_invlists.pl
to lib/unicore/mktables.
By placing all these in the same place, maintainers only have to learn
one bit of code, instead of two.
|
|
|
|
|
|
|
| |
This refactors the code so that it doesn't refer to an object before it
makes sure it exists and isn't empty. This hasn't been a problem in the
past, but a future commit will call this subroutine with parameters that
expose this bug.
|
|
|
|
|
|
| |
This variable's name was out-of-date and misleading. It is the name of
an inversion list that contains all the code points in the current
version of Unicode that participate in any way in a /i type of fold.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Prior to this commit, a class containing U+FF, LATIN SMALL LETTER Y WITH
DIAERESIS, generated an ANYOFD regnode because it thought that what
matched depended on the UTF-8ness of the target string. But it doesn't.
No bugs were introduced because when ANYOFD is encountered the code
looks at some flags to determine what sorts of dependencies to further
look for, and the flags remained clear. But ANYOFD is less desirable
than plain ANYOF, because it adds extra branches to execute.
Tests for this fix will be added in a future commit.
|
|
|
|
|
|
| |
If perl is compiled on an early enough Unicode release, these won't be
defined, but since they're only used in deprecated functions, we can
just #ifdef them away.
|
|
|
|
| |
This comment was out-of-date
|
|
|
|
|
|
|
| |
In the (?[ ... ]) regex sets features, one can embed another compiled
regex set pattern. Such compiled patterns always have a flag of '^',
which we weren't looking for prior to this commit. That meant that
uncompiled patterns would be mistaken for compiled ones.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The previous two commits fixed bugs where it would be possible during
optimization to join two EXACTFish nodes together, and the result would
not work properly with LATIN SMALL LETTER SHARP S. But by doing so,
the commits caused all non-UTF-8 EXACTFU nodes that begin or end with
[Ss] from being trieable.
This commit changes things so that the only the ones that are
non-trieable are the ones that, when joined, have the sequence [Ss][Ss]
in them. To do so, I created three new node types that indicate if the
node begins with [Ss] or ends with them, or both. These preclude having
to examine the node contents at joining to determine this. And since
there are plenty of node types available, it seemed the best choice.
But other options would be available should we run out of nodes.
Examining the first and final characters of a node is not expensive, for
example.
|
|
|
|
|
|
|
|
|
|
|
| |
This is defensive coding. The previous commit changed things so under
/di a node ending in [Ss] doesn't get made an EXACTFU. This commit does
the same for nodes that begin with [Ss]. This isn't actually necessary
as one needs two EXACTFU nodes in a row for the problem to occur, and
the previous commit appears to remove the possibility for the first node
being an EXACTFU. But I'm leery of relying on this. So this commit
makes sure that a node beginning with 'S' or 's' under /di remains
EXACTF
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Prior to this commit only nodes that filled fully were guaranteed not to
be upgraded to EXACTFU.
EXACTF nodes are used when /d rules are to be used unless the string
being matched against is in UTF-8. EXACTFU nodes are used when the /u
rules are to be used always. If the node contains only characters whose
interpretation is the same under /d and /u, both node types behave
identically, and so either can be used. EXACTFU nodes are preferred
because they are trie-able, and otherwise faster at runtime due to not
having to check the UTF-8ness of the target string. The pattern
compilation uses an EXACTFU node when possible, over an EXACTF node.
The sequences 'ss', 'SS', 'Ss', 'sS' are very tricky, for several
reasons. For this commit, the trickiness lies in the fact that they are
the only sequences where both the pattern and target string aren't in
UTF-8, and what matches under /ui rules differs from what matches under
/di. (There are various single characters that match differently, but
this is the only sequence whose individual components match the same,
but the sequence as a whole matches differently.)
The code has long taken special care for these sequences, but overlooked
two fairly obscure cases where it matters. This commit addresses one of
those cases; the next commit, the other.
Because these are sequences, it might be possible for it to be split
across two EXACTFish nodes, so that the first 's' or 'S' is the last
thing in the first node, and the second 's' or 'S' is the first thing in
the second. Should these nodes get joined during optimization, they
form the sequence. The code has long recognized this possibility if the
first node gets filled up, necessitating a split, and it doesn't make
the first node EXACTFU if it ends in 's' or 'S'. But we don't fill
those nodes totally completely, and optimization can join two together.
Future commits in the pipeline will join nodes in more cases during
optimization, and so, we need to not create an EXACTFU for trailing 's'
or 'S' always, not just if the first node fills up. This commit moves
the code that accomplishes this so it always gets executed at node end
of /di nodes, instead of previously only getting executed when the node
fills.
|
|
|
|
|
| |
By using a macro with a slightly different API, we don't have to mess
with the parse pointer.
|
|
|
|
|
|
|
|
|
| |
The optimization phase of regular expression pattern compilation looks
for adjacent EXACTish nodes and joins them if they are the same flavor
of EXACT. Commits a9f8c7ac75c364c3e05305718f38c5f8ccd935d8 and
f6b4b99d2e584fbcd85eeed475eea10b87858e54 introduced two new nodes
that are so close to existing flavors that they are joinable with their
respective flavor. This commit does that.
|
|
|
|
|
| |
This is in preparation for making the conditional more complicated than
can be easily done in the condition.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
Previous commits in 5.29 have removed all but two calls to this
function, and the remaining ones take radically different paths in it,
with very little common code. It simplifies things if we expand each
call to the code that gets evaluated. This commit does one call.
Changing the other usage is deferred until a later commit.
The need for an #ifdef is removed by adding a flag and setting it in an
existing #ifdef.
|
|
|
|
|
|
|
| |
This table contains all the code points that are in any multi-character
fold (not the folded-from character, but what that character folds to).
It will be used in a future commit.
|
|
|
|
|
|
| |
This just extends the use of a variable name a little longer, as it's
easier to read than the nested macro calls that eventually have to be
used.
|
|
|
|
|
|
| |
These two variables are similarly named, but have slightly different
purposes. Comment out one of them, and convert to using the other
consistently
|
|
|
|
|
|
|
| |
Under -Dr (or use re 'Debug') the compiled regex engine program is
displayed. I noticed that it used two different spellings for
'infinity'. This commit changes so only one is used, the one that has
been in the field the longest.
|
| |
|
|
|
|
|
|
|
| |
This reverts commit 5a4bc50b395654abde21718cbce2e296049f470d.
It turns out I was wrong, and the value isn't ignored. The commit
caused the pattern to be optimized differently, but still be valid,
so no tests failed.
|
| |
|
|
|
|
| |
Vertical align continuation chars in a macro
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a regnode that otherwise would be an EXACTFU except that it
contains a code point that requires UTF-8 to match, including all the
possible folds involving it. Hence if the target string isn't UTF-8, we
know it can't possibly match, without needing to try.
For completeness, there could also be an EXACTFAA_ONLY8 and an
EXACTFL_ONLY8 created, but I think these are unlikely to actually appear
in the wild, since using /aa is mainly about ASCII, and /l mostly will
involve characters that don't require UTF-8.
|
|
|
|
|
|
|
| |
This is a regnode that otherwise would be an EXACT except that it
contains a code point that requires UTF-8 to represent. Hence if the
target string isn't UTF-8, we know it can't possibly match, without
needing to try.
|
|
|
|
|
| |
The common code is about to get more complicated, so use it instead of a
copy.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Where a runtime pattern contains both compile-time and run-time code
blocks, e.g.:
$re = '(?{ RRR })';
/ $re X(?{ CCC })Y/
The compile-time code-block CCC is parsed at the same time as the
surrounding text. The runtime code RRR is parsed at runtime by
constructing a fake pattern and re-parsing it, but with any compile-time
code-blocks blanked out (so they don't get compiled twice). The compiled
regex is then thrown away, but any optrees just created for the runtime
code blocks are kept.
For example at runtime, the re-parsed pattern looks like:
/ (?{ RRR }) X__________Y/
Unfortunately this was failing for the conditional pattern, e.g.
/ $re X(?(?{ CCC }))Y/
which was getting blanked as
/ (?{ RRR }) X(?_______)Y/
which isn't valid syntax.
This commit blanks (?{...}) into (?=====) instead which is always legal.
|
| |
|
|
|
|
|
| |
It doesn't matter currently, but thes variable shouldn't be TRUE unless
/i is in effect.
|
|
|
|
|
|
| |
This way, it doesn't confuse that it is legal, and should it stop being
ignored in the called function, it will show up as a problem much
sooner.
|
| |
|
|
|
|
|
|
|
|
| |
This was changing the parse pointer around some code and restoring it
afterwards. The purpose must have been to get the debug offsets
correct. But there is a better way to do that, which doesn't take up
cycles on a non-Debugging build, and that is to set the offsets
directly.
|
|
|
|
|
| |
We no longer play with the emit ptr in this function, so no need to save
and restore it.
|
|
|
|
|
|
|
|
|
|
| |
This had a plain '=' instead of an '==', thus setting the value in the
process of asserting it, hence would always be true. But changing it to
'==' causes the assertion to fail in various cases.
I need to rethink this, so in the meantime am simply removing it.
Spotted by Dave Mitchell++
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This matches when the existing node ANYOFM would not match; i.e., they
are complements.
I almost didn't create this node, but it turns out to significantly
speed up various classes of matches. For example qr/[^g]/, both /i and
not, turn into this node; and something like
(("a" x $large_number) . "b") =~ /[^a]/
goes through the string a word at a time, instead of previously
byte-by-byte. Benchmarks are at the end of this mesage.
This node gets generated when complementing any single ASCII character
and when complementing any ASCII case pair, like /[^Gg]/. It never gets
generated if the class includes a character that isn't ASCII (actually
UTF-8 invariant, which matters only on EBCDIC platforms).
The details of when this node gets constructed are complicated. It
happens when the bit patterns of the characters in the class happen to
have certain very particular characteristics, depending on the vagaries
of the character set. [BbCc] will do so, but [AaBb] does not. [^01]
does, but not [^12]. Precisely, look at all the bit patterns of the
characters in the set, and count the total number of differing bits,
calling it 'n'. If and only if the number of characters is 2**n, this
node gets generated. As an example, on both ASCII and EBCDIC, the last
4 bits of '0' are 0000; of '1' are 0001; of '2' are 0010; and of '3' are
0011. The other 4 bits are the same for each of these 4 digits. That
means that only 2 bits differ among the 4 characters, and 2**2==4, so
the NANYOFM node will get generated. Similarly, 8=1000 and 0=0000
differ only in one bit so 2**1==2, and so [^08] will generate this node.
We could consider in the future, an extension where, if the input
doesn't work to generate this node, that we construct the closure of
that input to generate this node, which would have false positives that
would have to be tested for. The speedup of this node is so significant
that that could still be faster than what we have today.
The benchmarks are for a 64-bit word. 32-bits would not be as good.
Key:
Ir Instruction read
Dr Data read
Dw Data write
COND conditional branches
IND indirect branches
The numbers (except for the final column) represent raw counts per loop
iteration. The higher the number in the final column, the faster.
(('a' x 1) . 'b') =~ /[^a]/
blead nanyof Ratio %
-------- -------- --------
Ir 2782.0 2648.0 105.1
Dr 845.0 799.0 105.8
Dw 531.0 500.0 106.2
COND 431.0 419.0 102.9
IND 22.0 22.0 100.0
(('a' x 10) . 'b') =~ /[^a]/
blead nanyof Ratio %
-------- -------- --------
Ir 3358.0 2671.0 125.7
Dr 998.0 801.0 124.6
Dw 630.0 500.0 126.0
COND 503.0 424.0 118.6
IND 22.0 22.0 100.0
(('a' x 100) . 'b') =~ /[^a]/
blead nanyof Ratio %
-------- -------- --------
Ir 9118.0 2773.0 328.8
Dr 2528.0 814.0 310.6
Dw 1620.0 500.0 324.0
COND 1223.0 450.0 271.8
IND 22.0 22.0 100.0
(('a' x 1000) . 'b') =~ /[^a]/
blead nanyof Ratio %
-------- -------- --------
Ir 66718.0 3650.0 1827.9
Dr 17828.0 923.0 1931.5
Dw 11520.0 500.0 2304.0
COND 8423.0 668.0 1260.9
IND 22.0 22.0 100.0
(('a' x 10000) . 'b') =~ /[^a]/
blead nanyof Ratio %
-------- -------- --------
Ir 642718.0 12650.0 5080.8
Dr 170828.0 2048.0 8341.2
Dw 110520.0 500.0 22104.0
COND 80423.0 2918.0 2756.1
IND 22.0 22.0 100.0
(('a' x 100000) . 'b') =~ /[^a]/
blead nanyof Ratio %
-------- -------- --------
Ir Inf 102654.8 6237.1
Dr Inf 13299.3 12788.9
Dw Inf 500.9 219708.7
COND 800424.1 25419.1 3148.9
IND 22.0 22.0 100.0
|
|
|
|
|
| |
Previous commits have removed the need for certain macros and generality
in returning from functions early. Correspondingly simplify
|
|
|
|
|
|
|
| |
This static function no longer is called with a non-NULL final
parameter. That means it no longer returns a list, and its name is
hereby changed to reflect that. It also means the function can be
refactored and made simpler.
|
|
|
|
|
| |
This parameter is always NULL. No need to have it in this static
function
|
|
|
|
|
|
| |
This commit keeps track of if there are any operations encountered which
differ under /d from /u. If we switch to /u and haven't so far found
anything which differs, there's no need to reparse
|
|
|
|
|
|
|
|
|
| |
Prior to this commit, if the rules changed from /d to /u, the parse was
immediately restarted. This commit changes that so that it doesn't do
this if it is known that the parse will be redone anyway, but a full
parse needs to done first in order to count the parentheses.
Doing this can avoid the need for an almost full extra reparse.
|
|
|
|
|
|
|
|
|
|
| |
Prior to this commit, if it became apparent that long branches were
going to be needed, the parse was immediately restarted. This commit
changes that so that it doesn't do this if it is known that the parse
will be redone anyway, but a full parse needs to done first in order to
count the parentheses.
This can avoid an almost complete reparse in some situations.
|
|
|
|
| |
It's easier to understand if the simplest case is first in the code.
|
|
|
|
|
|
|
|
|
|
| |
The previous commits have allowed us to refactor this to eliminate
redundancies.
Previously, the same logic was done separately for UTF-8 and non-UTF-8
patterns. This refactors so the logic is done once. The details differ
for UTF-8 and non-UTF-8. So that's where the differences lie, in the
details without having to duplicate the logic.
|
|
|
|
|
|
|
|
| |
\N{} changes /d to /u, and may require reparsing of the current node to
fix wrong assumptions.
This commit is necessary for the simplifications achieved in the next
commit.
|