| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
| |
This code is obsolete, as new code has been written to do folding;
now that smokes are all passing with that new code, there is no point to
retaining the old.
|
|
|
|
|
|
|
| |
This code has been rendered obsolete in 5.14 by using a different
mechanism altogether. This functionality is now provided at run-time,
user-selectable, via the /u and /d regex modifiers. This code was
for compile-time selection of which to use.
|
|
|
|
|
| |
A previous commit collapsed nested blocks. This outdents the nested
part
|
|
|
|
|
| |
An earlier commit removed code so that these two blocks can be written
as one.
|
|
|
|
|
|
| |
This code was inserted to make sure no tests failed in the intermediate
commits leading up to d50a4f90cab527593b2dd218f71b66a6be555490, and
should have been removed in that commit, but I forgot to.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 56ca34cada940c7f6aae9a59da266e541530041e had the side effect of
causing regular expressions with things like [a-z], or even just [k] to
go out to disk to read tables to create swashes because it knew that
some of those characters matched outside the bitmap (and due to
l1_char_class_tab.h it knew which ones had those matches), but it didn't
know what the characters were that participated in those folds.
This patch hard-codes the Unicode 6.0 rules into regcomp.c for the
code points 0-255, so that the very slow utf8_heavy is not invoked on
them. (Code points above 255 will continue to invoke it.) It would,
of course, be better if these rules could be regen'd into regcomp.c, as
there is a risk that the standard will change, and the code will not.
But I don't think that has ever happened; in other words, I think that
the rules haven't changed so far since Day 1 of Unicode. (That would
not be the case if we were doing simple case folding, as the capital
sharp ss which folds to U+00DF was added later.) And the Standard is
getting more stable in this area. I believe one of their stability
policies now forbid them from adding something that simply folds to
one of the characters that already has a fold, such as M and m.
Ligatures are frowned on, and I doubt that new ones would be encoded,
so that leaves a new Unicode character that folds to a Latin-1 plus some
sort of mark. For those, this code is a no-op, so those aren't a
problem either.
|
| |
|
|
|
|
|
| |
This sets things up in preparation for a future commit that will
move calculating all folds involving characters in the bit map.
|
|
|
|
|
| |
The set_regclass_bit functions will be adding to a new inversion list.
This declares that list and passes it to them.
|
| |
|
|
|
|
|
|
| |
A pointer to the list of multi-char folds in an ANYOF node is now passed
to the routines that set the bit map. This is in preparation for those
routines to add to the list
|
|
|
|
|
| |
The code that handles a false range in a [character class] hadn't been
converted to use inversion lists
|
|
|
|
|
| |
This is just an inline shorthand when a single code point is all that is
needed. A macro could have been used instead, but this just seemed nicer.
|
|
|
|
|
|
| |
THis is part of the refactoring of the code that sets the alternate array
for multi-char folds. Changing the node type to ANYOFV can be done at
the last second, in pass 2, as it doesn't change any sizing, etc.
|
|
|
|
| |
A future commit uses this same code, so put it into a common place.
|
|
|
|
|
| |
A previous commit changed add_range_to_invlist() to do the creation
that these lines did.
|
|
|
|
|
|
|
| |
Change the function add_range_to_invlist() to accept NULL as the
inversion list, in which case it creates it. A common usage of this
function is to create the list if it doesn't exist before calling it, so
this just makes that code once.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Actually, it doesn’t. The original test case was:
#!/usr/bin/perl
my $rx = qr'\$ (?| {(.+?)} | (.+?); | (.+?)(\s) )'x;
my $test = '/home/$USERNAME ';
die unless $test =~ $rx;
print "1: $1\n";
print "2: $2\n" if defined $2;
This crashes even if I put an ‘exit’ right after the pattern match.
What’s happening is that regcomp miscounts the number of capturing
parenthesis pairs (cf. [perl #59734]), so the execution of the regular
expression causes a buffer overflow which overwrites the op_sibling
field of the regcreset op, causing a crash when the op is freed. (The
exact failure may differ between builds, platforms, etc., of course.)
S_reg in regcomp.c keeps a count of the parenthesised groups in a
(?|...) construct, which it updates after each branch, if that branch
has more captures than any previous branch. But it was not updating
the count after the last branch.
So this bug would occur if the last branch had more capturing paren-
theses than any previous branch.
Commit ee91d26, which fixed bug #59734, only solved the problem when
there was just one branch (by updating the count before the loop that
deals with subsequent branches was entered).
This commit changes the code at the end of S_reg to take into account
that RExC_npar (the current paren count) might have been increased by
the last branch.
Since the loop to deal with subsequent branches resets the count
*before* each branch, the code that commit ee91d26 added is no longer
necessary, so this commit removes it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is the foundation for fixing the regression RT #82610. My analysis
was wrong that two bits could be shared, at least not without further
work. This changes to use a different mechanism to pass needed
information to regexec.c so that another bit can be freed up and, in a
later commit, the two bits can become unshared again.
The bit that is freed up is ANYOF_UTF8, which basically said there is
something that is matched outside the ANYOF bitmap, and requires the
target string to be in utf8. This changes things so the existence of
something besides the bitmap indicates this, and so no flag is needed.
The flag bit ANYOF_NONBITMAP_NON_UTF8 remains to indicate that there is
something that should be matched outside the bitmap even if the target
string isn't in utf8.
|
| |
|
|
|
|
| |
This just moves the code to later in the subroutine, in preparation for future commits
|
|
|
|
|
|
| |
As the comment above the changed line says, \p doesn't have to match only
utf8, but it sets the flag that is two bits, meaning UTF8. Set just the
one flag.
|
| |
|
| |
|
|
|
|
|
|
|
| |
This reverts commit fb2e24cdda774d9e9c28f1cd0356bba9070894c7.
This turned out to be contentious, and is past the date for
contentious changes.
|
|
|
|
|
|
|
| |
As explained in the doc changes of this patch, under /l, caseless
matching of code points less than 256 now use locale rules regardless
of the utf8ness of the pattern or string. They now match the behavior
of things like \w, in this regard.
|
| |
|
| |
|
| |
|
|
|
|
| |
Tests for \N{} with this option will be added later.
|
|
|
|
| |
I found this through gdb. Sign extension was happening.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Certain characters are not placed in EXACTish nodes because of problems
mostly with the optimizer. However, not all notations that generated
those characters were caught. This catches all but those in \N{}
constructs; which is coming later.
This does not use FOLDCHAR, which doesn't know the difference between
/d and /u; instead it uses ANYOFV, which does handle those cases already,
at the expense of larger (in storage) regexes for these few characters.
If this were deemed a problem, there would be some work involved in
adding FOLDCHARU, and fixing the code where it doesn't work properly now.
|
| |
|
| |
|
|
|
|
| |
A previous commit removed some things, so this block can be rearranged
|
| |
|
|
|
|
| |
The code elsewhere is now better equipped to handle this.
|
|
|
|
|
| |
A multi-char fold that matches in the Latin1 range needs to have that
fact communicated to regexec.
|
|
|
|
|
|
|
|
| |
Some characters above 255 fold to the < 256 range. These need to be in
the synthetic start class so the optimizer won't reject them.
This is temporary code which creates false positives, to be
replaced by more precise matching later.
|
|
|
|
|
| |
There are two flags for matching outside the ANYOF bitmap. Instead of
setting both, set the corresponding one.
|
| |
|
| |
|
|
|
|
| |
This allows future use by Perl of these
|
|
|
|
|
|
| |
The warning message about regex unrecognized escapes passed through is
changed to include any literal '{' following the 2-char escape.
e.g., "\q{" will include the { in the message as part of the escape.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Throughout 5.13 there was temporary code to deprecate and forbid
certain values of X following a \c in qq strings. This patch fixes
this to the final 5.14 semantics.
These are:
1) a utf8 non-ASCII character will croak. This is the same
behavior as pre-5.13, but it gives a correct error message, rather than
the malformed utf8 message previously.
2) \c{ and \cX where X is above ASCII will generate a deprecated
message. The intent is to remove these capabilities in 5.16. The
original agreement was to croak on above ASCII, but that does violate
our stability policy, so I'm deprecating it instead.
3) A non-deprecated warning is generated for all other \cX; this is the
same as throughout the 5.13 series.
I did not have the tuits to use \c{} as I had planned in 5.14, but \N{}
can be used instead.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
The recent move of Unicode folding to the compilation phase caused
spurious warnings during the miniperl build phase of Perl itself before
the Unicode tables get built. Before the tables are built, Perl is
unable to know about the Unicode semantics (it has ASCII/Latin1
hard-coded in), but was still trying to access the tables. Now, it
checks and if the tables aren't present uses just the hard-coded
ASCII/Latin1 semantics.
|
|
|
|
|
|
|
| |
The poison exposes a failure in t/op/magic:
panic: corrupt saved stack index at - line 6.
FAILED at test 7
|
| |
|
|
|
|
|
|
|
|
|
|
| |
This is for security as well as performance. It allows Unicode properties to
not be matched case sensitively. As a result the swash inversion hash is
converted from having utf8 keys to numeric, code point, keys.
It also for the first time fixes the bug where /i doesn't work for a code point
not at the end of a range in a bracketed character class has a multi-character
fold
|