| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
| |
Tests for \N{} with this option will be added later.
|
|
|
|
| |
I found this through gdb. Sign extension was happening.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Certain characters are not placed in EXACTish nodes because of problems
mostly with the optimizer. However, not all notations that generated
those characters were caught. This catches all but those in \N{}
constructs; which is coming later.
This does not use FOLDCHAR, which doesn't know the difference between
/d and /u; instead it uses ANYOFV, which does handle those cases already,
at the expense of larger (in storage) regexes for these few characters.
If this were deemed a problem, there would be some work involved in
adding FOLDCHARU, and fixing the code where it doesn't work properly now.
|
| |
|
| |
|
|
|
|
| |
A previous commit removed some things, so this block can be rearranged
|
| |
|
|
|
|
| |
The code elsewhere is now better equipped to handle this.
|
|
|
|
|
| |
A multi-char fold that matches in the Latin1 range needs to have that
fact communicated to regexec.
|
|
|
|
|
|
|
|
| |
Some characters above 255 fold to the < 256 range. These need to be in
the synthetic start class so the optimizer won't reject them.
This is temporary code which creates false positives, to be
replaced by more precise matching later.
|
|
|
|
|
| |
There are two flags for matching outside the ANYOF bitmap. Instead of
setting both, set the corresponding one.
|
| |
|
| |
|
|
|
|
| |
This allows future use by Perl of these
|
|
|
|
|
|
| |
The warning message about regex unrecognized escapes passed through is
changed to include any literal '{' following the 2-char escape.
e.g., "\q{" will include the { in the message as part of the escape.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Throughout 5.13 there was temporary code to deprecate and forbid
certain values of X following a \c in qq strings. This patch fixes
this to the final 5.14 semantics.
These are:
1) a utf8 non-ASCII character will croak. This is the same
behavior as pre-5.13, but it gives a correct error message, rather than
the malformed utf8 message previously.
2) \c{ and \cX where X is above ASCII will generate a deprecated
message. The intent is to remove these capabilities in 5.16. The
original agreement was to croak on above ASCII, but that does violate
our stability policy, so I'm deprecating it instead.
3) A non-deprecated warning is generated for all other \cX; this is the
same as throughout the 5.13 series.
I did not have the tuits to use \c{} as I had planned in 5.14, but \N{}
can be used instead.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
The recent move of Unicode folding to the compilation phase caused
spurious warnings during the miniperl build phase of Perl itself before
the Unicode tables get built. Before the tables are built, Perl is
unable to know about the Unicode semantics (it has ASCII/Latin1
hard-coded in), but was still trying to access the tables. Now, it
checks and if the tables aren't present uses just the hard-coded
ASCII/Latin1 semantics.
|
|
|
|
|
|
|
| |
The poison exposes a failure in t/op/magic:
panic: corrupt saved stack index at - line 6.
FAILED at test 7
|
| |
|
|
|
|
|
|
|
|
|
|
| |
This is for security as well as performance. It allows Unicode properties to
not be matched case sensitively. As a result the swash inversion hash is
converted from having utf8 keys to numeric, code point, keys.
It also for the first time fixes the bug where /i doesn't work for a code point
not at the end of a range in a bracketed character class has a multi-character
fold
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Going forward the intent is to convert from swashes to the better-suited
inversion list data structure. This adds rudimentary inversion lists that have
only the functionality needed for 5.14. As a result, they are as much as
possible static to one file.
What's necessary for 5.14 is enough to allow folding of ANYOF nodes to be moved
from regexec to regcomp. Why they are needed for that is to generate as
compact as possible class definitions; otherwise, very long linear lists might
be generated. (They still may be, but that's inherent in the problem domain;
this generates as compact as possible, combining overlapping ranges, etc.)
The only two non-trivial methods in this object are from published algorithms.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch causes regcomp.c to generate a different property name under /i than
not. utf8_heavy.pl will later resolve whether this is to match the same under
/i or not, based on the data structure generated by mktables.
This is part of moving non-locale folding into regcomp from regexec. The
reasons are primarily security, but this has been planned to do at some point
anyway for performance. It was not until a 5.13.X build that fixed the regexec
code that the case-insensitive matching mostly worked. With that change, things like
/\p{ASCII_Hex_Digit}+/i
would match non-ASCII characters, such as LATIN SMALL LIGATURE FF, and almost
certainly that would not be the expectation of the coder. The Unicode
Standard is silent on the matter, but as of this writing, it appears that they
will act to recommend against caseless matching of properties; I get the sense
that they would never have thought someone would think to do it, but Perl has.
I ran some experiments, and actually very few properties have differences under
caseless matching anyway. have submitted a proposal to them that says that,
but suggests that certain properties can be grandfathered-in. Perl users have
come to expect that /\p{Uppercase}/i would match lower case letters, and have
written bug reports that they don't, until 5.13.X fixed them, but in addition
added the unintended wrinkle from the example above.
The design is for mktables to generate tables for /i matching for the few
properties that have differences, and to create a hash mapping the standard
table to the /i table, which is read by utf8_heavy.pl. regcomp.c munges the
names of all properties under /i to be __foo_i. The two initial underscores
make sure there is no conflict with existing single underscore initial tables.
utf8_heavy strips these off, and computes the table as normal from the
remaining unmunged name. At the last moment, it looks up that name in the list
of those that have /i tables, and substitutes if found.
This completely hides all this from the swash mechanism and regexec.c.
This can't be completely hidden from user-defined properties. Now, a boolean
will be passed to those subroutines indicating if /i is in effect or not. They
are free to ignore it, but they can return a different set of code points
depending on its value. They will be called once for each type, and the
results cached by the normal swash mechanism, which thinks these are two
different properties.
|
|
|
|
|
|
| |
If ANDing two nodes together and they both have UNICODE_ALL set, the result
should also. I don't have a test case for this, but the bug is exposed by some
commits soon to come in a test case in pat_advanced.t for cl_and.
|
|
|
|
|
| |
\p implies Unicode matching rules, which are likely going to be different than
the locale's.
|
|
|
|
| |
Now, a Unicode property match specified in the pattern will indicate that the pattern is meant for matching according to Unicode rules
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[perl #2460] described a case where electric fence reported an invalid
read. This could be reproduced under valgrind with blead and -e'/x/',
but only on a non-debugging build.
This was because it was checking for certain pairs of nodes (e.g. BOL + END)
and wasn't allowing for EXACT nodes, which have the string at the next
node position when using a naive NEXTOPER(first). In the non-debugging
build, the nodes aren't initialised to zero, and a 1-char EXACT node isn't
long enough to spill into the type field of the "next node".
Fix this by only using NEXTOPER(first) when we know the first node is
kosher.
|
|
|
|
|
|
|
|
|
|
|
| |
_C_C_T was being passed both a test and the value to put the test on, whereas
the value is actually setup inside the macro, which is not a clean interface,
and would not work on EBCDIC, as the loop bounds are for ASCII. By just
passing the test name, the macro can generate code that will work also on
EBCDIC.
Unfortunately, the same can't happen for the NOLOC version of the macro, as it
needs to look at more than a single byte, so needs an address.
|
|
|
|
|
|
|
|
|
|
|
| |
The addition of the ANYOFV regnode to treat multi-char folds in a bracketed
character class has exposed a bug, in which those classes have long been able
to be varying length (due to the multi-char fold), but the compiler wasn't
aware of it. Now it is, and hence won't allow those which have multi-char
folds to be part of a lookbehind pattern, which requires a constant length.
This patch disallows multi-char folds in a lookbehind bracketed character
class.
|
|
|
|
|
| |
This restricts certain constructs, like \w, to matching in the ASCII range
only.
|
| |
|
| |
|
|
|
|
|
| |
This refactors one area in regexec.c to use BOUNDU, NBOUNDU for
efficiciency, and easier adding of the future BOUNDA.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch converts the \s, \w and complements Unicode semantics to
instead of using the flags field of their nodes to instead use separate
nodes. This gains some efficiency, especially useful in tight loops and
backtracking of regexec.c, and prepares the way for easily adding other
semantic variations, such as /a.
It refactors the CCC_TRY... macros. I tried to break this piece up into
smaller chunks, but found it much easier to get to this in one step.
Further patches will do some more refactoring of these.
As part of the CCC_TRY macro refactoring, the lines that include the
test if (! nextchr) are changed to just look for the end-of-string by
position instead of it being NUL. In locales, it could be (however
unlikely), that NUL is a real alphabetic, digit, or space character.
|
|
|
|
|
|
|
|
|
|
|
| |
The code here was asymmetrical. It did not account for Unicode
semantics when ORing \W. For \w, \s, and \S it does. This patch
changes the code to be symmetrical.
I spent a couple hours trying to come up with a test, but could not get
this area of the code to execute, which may explain why there has not
been a field report of it. It may be that it is unreachable; there has
been other code in the routine that wasn't.
|
|
|
|
|
|
| |
This code can never be reached, as the switch statement switches on the
regkind of the op, not the op itself; and the kind of all the locale
regnodes is the base regnode itself. For example regkind[ALNUML] is ALNUM.
|
|
|
|
|
|
|
|
|
| |
The FLAGS fields of certain regnodes were encoded with USE_UNI if
unicode semantics were to be used. This patch instead sets them to the
character set used, one of the possibilities of which is to use unicode
semantics. This shortens the code somewhat, and always puts the
character set in the flags field, which can allow use of switch
statements on it for efficiency, especially as new values are added.
|
|
|
|
|
|
|
| |
I much prefer David Golden's name for /d whose meaning 'depends' on
circumstances, instead of 'dual' meaning it could be one or another.
Change it before this gets out in a stable release, and we're stuck with
the old name.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The /d, /l, and /u regex modifiers are mutually exclusive. This patch
changes the field that stores the character set to use more than one bit
with an enum determining which one. This data structure more
closely follows the semantics of their being mutually exclusive, and
conserves bits as well, and is better expandable.
A small API is added to set and query the bit field.
This patch is not .xs source backwards compatible. A handful of cpan
programs are affected.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
This bug stemmed from Latin1 characters not matching any (non-complemented)
character class in /d semantics when the target string is no utf8; but having
unicode semantics when it isn't. The solution here is to add a special flag.
There were several tests that relied on the broken behavior, specifically they
tested that \xff isn't a printable word character even in utf8. I changed the
deparse test to instead use a non-printable code point, and I changed the ones
in re_tests to be TODOs, and will change them back using /a when that is
shortly added.
|
|
|
|
|
|
| |
It turns out that the INVERT and EOS flags can actually share the same bit, as
explained in the comments, freeing up a bit for other uses. No code changes
need be made.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch restructures the regex ANYOF code to generate ANYOFV nodes instead
when there is a possibility that it could match more than one character. Note
that this doesn't affect the optimizer, as it essentially ignores things that
fit into this category. (But it means that the optimizer will no longer reject
these when it shouldn't have.)
The handling of the LATIN SHARP s is modified to correspond with this new node
type.
The initial handling of ANYOFV is placed in regexec.c. More analysis will come
on that. But there was significant change to the part that handles matching
multi-char strings. This has long been buggy, with it previously comparing a
folded-version on one side with a non-folded version on the other.
This patch fixes about 60% of the problems that my undelivered test suite gives
for multi-char folds. But there are still 17K test failures left, so I'm still
not delivering that. The TODOs that this fixes will be cleaned up in a later commit
|
| |
|
|
|
|
|
|
|
|
|
| |
# New Ticket Created by (Peter J. Acklam)
# Please include the string: [perl #81904]
# in the subject line of all future correspondence about this issue.
# <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=81904 >
Signed-off-by: Abigail <abigail@abigail.be>
|
|
|
|
|
|
|
|
|
|
|
| |
ANYOF_FOLD is now used only under fewer conditions. Otherwise the
bitmap of character 0-255 is fully calculated with the folds, and the
flag is not set. One condition is under locale, where the folds aren't
known at compile time; the other is for things accessible through a
swash.
By changing the name to its new meaning, certain optimizations become more
obvious.
|
|
|
|
|
|
| |
instead of the less familiar octal for larger values. Perhaps they
should actually print the actual character, but this is far easier than
the previous to understand.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
A two character character class where the two elements are the folds of
each other can be optimized into an EXACTF regnode. This should not
change the speed of execution appreciably, except that EXACTF regnodes
are candidates for further optimization by combining with adjacent nodes
of the same type.
This commit brings the optimization level up to somewhat better than
when I started mucking around with ANYOF nodes.
|