| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
| |
The /d, /l, and /u regex modifiers are mutually exclusive. This patch
changes the field that stores the character set to use more than one bit
with an enum determining which one. This data structure more
closely follows the semantics of their being mutually exclusive, and
conserves bits as well, and is better expandable.
A small API is added to set and query the bit field.
This patch is not .xs source backwards compatible. A handful of cpan
programs are affected.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
This bug stemmed from Latin1 characters not matching any (non-complemented)
character class in /d semantics when the target string is no utf8; but having
unicode semantics when it isn't. The solution here is to add a special flag.
There were several tests that relied on the broken behavior, specifically they
tested that \xff isn't a printable word character even in utf8. I changed the
deparse test to instead use a non-printable code point, and I changed the ones
in re_tests to be TODOs, and will change them back using /a when that is
shortly added.
|
|
|
|
|
|
| |
It turns out that the INVERT and EOS flags can actually share the same bit, as
explained in the comments, freeing up a bit for other uses. No code changes
need be made.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch restructures the regex ANYOF code to generate ANYOFV nodes instead
when there is a possibility that it could match more than one character. Note
that this doesn't affect the optimizer, as it essentially ignores things that
fit into this category. (But it means that the optimizer will no longer reject
these when it shouldn't have.)
The handling of the LATIN SHARP s is modified to correspond with this new node
type.
The initial handling of ANYOFV is placed in regexec.c. More analysis will come
on that. But there was significant change to the part that handles matching
multi-char strings. This has long been buggy, with it previously comparing a
folded-version on one side with a non-folded version on the other.
This patch fixes about 60% of the problems that my undelivered test suite gives
for multi-char folds. But there are still 17K test failures left, so I'm still
not delivering that. The TODOs that this fixes will be cleaned up in a later commit
|
| |
|
|
|
|
|
|
|
|
|
| |
# New Ticket Created by (Peter J. Acklam)
# Please include the string: [perl #81904]
# in the subject line of all future correspondence about this issue.
# <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=81904 >
Signed-off-by: Abigail <abigail@abigail.be>
|
|
|
|
|
|
|
|
|
|
|
| |
ANYOF_FOLD is now used only under fewer conditions. Otherwise the
bitmap of character 0-255 is fully calculated with the folds, and the
flag is not set. One condition is under locale, where the folds aren't
known at compile time; the other is for things accessible through a
swash.
By changing the name to its new meaning, certain optimizations become more
obvious.
|
|
|
|
|
|
| |
instead of the less familiar octal for larger values. Perhaps they
should actually print the actual character, but this is far easier than
the previous to understand.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
A two character character class where the two elements are the folds of
each other can be optimized into an EXACTF regnode. This should not
change the speed of execution appreciably, except that EXACTF regnodes
are candidates for further optimization by combining with adjacent nodes
of the same type.
This commit brings the optimization level up to somewhat better than
when I started mucking around with ANYOF nodes.
|
|
|
|
|
|
| |
I overlooked two cases in a previous commit where it would be advisable
to make changes in case the ANYOF_CLASS bit needs to be combined with
ANYOF_LOCALE.
|
|
|
|
| |
The number passed here is never larger than 255.
|
|
|
|
|
|
| |
The problem is that I confused FOLD with ANYOF_FOLD, and as a result,
emitted a locale regnode, which is tainted. Any tests that required
non-tainting started failing
|
|
|
|
|
|
|
|
|
|
| |
A single character character class can be optimized into an EXACT node.
The changes elsewhere allow this to no longer be constrained to
ASCII-only when the pattern isn't UTF-8. Also, the optimization
shouldn't have happened for FOLDED characters, as explained in the
comments, when they participate in multi-char folds; so that is removed.
Also, a locale node with folded characters can be optimized.
|
|
|
|
|
|
| |
The flags field is fully used, and until the ANYOF node is split later
in development, the CLASS bit will need to be freed up to give the space
for other uses. This patch allows for this to easily be toggled.
|
|
|
|
|
|
| |
The optimization to do inversion a compile time is moved to earlier.
This doesn't help today, but it may someday when we start keeping better
track of Unicode characters, and is the more logical place for it.
|
|
|
|
|
| |
When one complements every bit, the count of those that are set should
be complemented as well.
|
|
|
|
|
| |
When something matches above Latin1, it should have the ANYOF_UTF8 bit
set.
|
|
|
|
|
| |
Oddly, it is clearer to use 0xFF as an exclusive-or target instead of an
unrelated #define that happens to have that value.
|
| |
|
|
|
|
|
|
|
|
| |
optimize_invert is no longer needed given the changes already made, as
now if there is something not in the bitmap, a flag will be set, and the
optimization doesn't take place unless the only flag is inversion. And,
the bitmap is setup completely now for anything that doesn't have to be
deferred to runtime, and such deferrals are marked with other flags.
|
|
|
|
|
| |
So there is no need to tell regexec that it does, and then can combine
two other statements
|
|
|
|
|
|
| |
One smoke is warning about truncated results. This should fix that. It
may be that other compilers will now complain, and we'll need to add
casts, but I'm waiting to see.
|
|
|
|
|
|
|
|
| |
Recent changes to this cause the bitmap to be populated where possible
with the all folding taken into consideration. Therefore, the FOLD flag
isn't necessary except when that wasn't possible, and the loop that went
through looking for folds is also not necessary, as the bitmap is
now completely populated before it gets to where the loop was.
|
|
|
|
|
|
|
| |
stored now contains the number of 1 bits in the ANYOF node, and is no
longer needed to be arbitrarily set. Part of this is because there is
now a flag if there is any match outside the bitmap, which prohibits
optimization if so.
|
| |
|
|
|
|
|
|
|
|
| |
The DIGITL and NDIGITL regnodes were not being generated; instead
regular DIGIT and NDIGIT regnodes were even under locale.
This means no one has probably ever used Perl on a locale that changed
the digits.
|
|
|
|
|
|
| |
Now that the new nodes are grouped properly, we can use the fact that
the named backreferences all come after all the numbered backreferences,
as had been there before.
|
|
|
|
|
|
|
|
|
|
| |
This patch should get rid of the compiler warnings recently introduced.
Another way to handle the pm_flags warning is to declare it to be
volatile, but not all compilers that perl uses apparently have that, so
I chose a longer way of introducting a new variable that isn't changed
within the jumpable area. The others were fixed by not initializing
them before the jumpable area.
|
|
|
|
|
| |
The code had hard-coded into it the ascii platform values for possible
start bytes. There are macros that do that portably with no branches
|
|
|
|
| |
The inline function repeats the test removed here.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The 7-bit test operations always fail on non-ascii characters, therefore
it isn't needed to test each such character individually. The loops
that do that and then set a bit for each character can therefore stop at
127 instead of 255 (the bitmaps are initialized to all zeros). For
EBCDIC, the same applies, except that we have to map those 7-bits
characters to the 8-bit EBCDIC range. This creates an extra array
lookup for each ebcdic character, but half as many times through the
loop.
For the complement of the 7-bit operations, we know that they will all
be set for the non-ascii characters. Therefore, we don't need to test,
we can just unconditionally set those bits. It would not be a good idea
to just do a memset on that range under /i, as each character that gets
chosen may have its fold added as well and that has to be looked up
individually.
|
|
|
|
|
| |
This patch replaces hex ordinals by macros containing the character
names, for clarity and portability to EBCDIC.
|
|
|
|
|
| |
This code should be done before the setjump to avoid the longjump
clobbering it.
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
This causes the new nodes that denote Unicode semantics in
backreferences to be generated when appropriate.
Because the addition of these nodes was at the end of the node list, the
arithmetic relation that previously was valid no longer is.
|
|
|
|
|
|
|
| |
This is because the pattern may not specify unicode semantics, but if
the target matching string is in utf8, then unicode semantics may be
needed nonetheless. So to avoid the regexec optimizer rejecting the
match, we need to allow for a possible false positive.
|
|
|
|
|
|
| |
A utf8 pattern should force unicode semantics unless otherwise
overridden. This means that the 'd' regex modifier means Unicode
semantics as well.
|
| |
|
| |
|
|
|
|
|
| |
The flags this statement cleared are cleared as part of the macro called
just before it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The super-linear cache in regexec.c can prevent a valid match
from being detected. For example:
print "yay\n" if 'xayxay' =~ /(q1|.)*(q2|.)*(x(a|bc)*y){2,}/;
This should match, but it doesn't because the cache fails to
distinguish between matching the final xay to x(a|bc)*y as the
first instance of the {2,} and matching it in the same position
as the second instance.
This seems to do the trick.
|
|
|
|
|
|
| |
This patch also changes the optimizer to include the other member of a
fold pair in the bitmap. Thus if 'b' is set under /i, so will 'B', and
vice versa.
|
|
|
|
|
|
|
| |
The ordinals that are output in the debugging output have been in octal,
which is ok for the low controls, but for above Latin1, the standard is
hex, so this changes them all to correspond. If desired the low
controls could be changed back to be in octal.
|
|
|
|
|
| |
The 'outside bitmap' message isn't orthogonal to the others, it is
independent.
|
| |
|
| |
|
|
|
|
|
| |
The tests in the else are unnecessary as they comprise everything else
but what the 'if' says.
|