| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
| |
The previous return value where NULL meant OK is outside-the-norm.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
| |
This commit adds the new construct \o{} to express a character constant
by its octal ordinal value, along with ancillary tests and
documentation.
A function to handle this is added to util.c, and it is called from the
3 parsing places it could occur. The function is a candidate for
in-lining, though I doubt that it will ever be used frequently.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Prior to this patch, \400 - \777 meant something different in some
circumstances in regexes outside bracketed character classes. A
deprecated warning message has been in place since 5.10.1 when this
happens. Remove the warning, and bring the behavior into line with the
other double-quotish contexts. \400 - \777 now always means the same
thing as \x{100} - \x{1FF} (except when the octal forms are taken as
backreferences.)
Signed-off-by: David Golden <dagolden@cpan.org>
|
|
|
|
|
| |
This reduces object code size, reducing CPU cache pressure on the non-exception
paths.
|
|
|
|
|
| |
Previously the AV paren_name_list would "leak" until global destruction.
This was only an issue under -DDEBUGGING. Fixes RT #73438.
|
| |
|
|
|
|
| |
This allows the implementation of the lookup mechanism to change.
|
| |
|
|
|
|
|
|
|
|
|
| |
The problem is that a dot can come between the braces in \N{foo.bar},
but when searching for it, I didn't stop looking at the right brace, so
it generated an error inappropriately.
This is essentially a minimum patch; efficiency could be improved
slightly with a little more work.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is an indirect fix for
[perl #74484] Regex causing exponential runtime+mem usage
The trie runtime code was doing more SAVETMPS than FREETMPS and was thus
growing a large tmps stack on heavy backtracking. Rather than fixing this
directly, I rewrote part of the trie code so that it no longer needs to
allocate memory in S_regmatch (it still does in find_byclass()).
The basic issue is that multiple branches in the trie may trigger an
accept state; for example:
"abcd" =~ /xyz/abcd.*X|ab.*Y|/
here, words (branches) 2 and 3 are accept states. The original approach
was, at run time, to create a list of accepted word numbers and the
character positions of the end of each of those words. Then run the rest
of the pattern for each word in the list in turn (in word index order).
This requires memory for the list to be allocated and freed.
The new approach involves creating extra info at compile time; in
particular, for each word, a pointer to the previous accepted word (if
any) in the state tree. For example for the above pattern, part of the
state tree may be
q b c d
1 -> 2 -> 3 -> 4 -> 5
(#3) (#2)
(e.g. at state 1, if the next char is 'a', we transition to state 2).
Here, state 3 is an accept state with word #3, and 5 is an accept state
with word #2. So we build a table indexed by word number, which has
wordinfo[2] = 3, wordinfo[3] = 0, thus building the word chain 2->3->0.
At run time we run the trie to completion, and remember the word
associated with the longest accept state (word #2 above). Then by following
back the chain of .prev fields, we can produce a list of all accepting
words. We then iteratively find the smallest-numbered (ie LH-most) word in
the chain, and run with it. On failure and backtrack, we find the
next-smallest and so on.
Since we are no longer recording the end-position of each word in the
string, we have to recalculate this for each backtrack. We initially
record the end-position of the shortest accepting word, and given that we
know the length of each word, we can calculate the new position each time
as an offset from that first word. Depending on unicode and folding, that
calculation can be cheap or expensive.
This algorithm is optimised for the typical case where there are a small
number (<= 2) accepting states.
This patch creates a new compile-time array, trie->wordinfo[], indexed by
word number, which contains relevant info about each word. This also
supersedes the old trie->newword[] array, whose function of recording
"overspills" of multiple words per accept state, is now handled as part of
the wordinfo[].prev chain.
|
|
|
|
| |
This makes the other 26 (or 58) bits available for save data.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
make regen is needed
This patch forbids non-ascii following the "\c". It also terminates for
"\c{" with a message to contact p5p if there is need for continuing its
current definition. And if the character following the "\c" causes the
result to not be a control character, a warning is issued. This is
currently 'deprecated', which by default is turned on. This can easily
be changed later.
This patch is the initial patch. It does not do any fancy showing the
context where the problematic construct occurs. This can be added
later.
It gathers the 3 occurrences of evaluating \c and puts them in one
common routine.
|
|
|
|
| |
Broken in ff3f963aa0f95ea53996b6a3842b824504b57c79.
|
|
|
|
|
|
|
|
|
|
|
|
| |
It is possible to bypass the lexer's parsing of \N. This patch causes
the regex compiler to deal with that better. The compiler no longer
assumes that the lexer parsed the \N. It generates an error message if
the \N isn't in a form it is expecting, and invalid hexadecimal digits
are now fatal errors, with the position of the error more clearly
marked.
The diagnostic pod has been updated to reflect the new error messages,
with some slight clarifications to the previous ones as well.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
make regen embed.fnc
needs to be run on this patch.
This patch fixes Bugs #56444 and #62056.
Hopefully we have finally gotten this right. The parser used to handle
all the escaped constants, expanding \x2e to its single byte equivalent.
The problem is that for regexp patterns, this is a '.', which is a
metacharacter and has special meaning that \x2e does not. So things
were changed so that the parser didn't expand things in patterns. But
this causes problems for \N{NAME}, when the pattern doesn't get
evaluated until runtime, as for example when it has a scalar reference
in it, like qr/$foo\N{NAME}/. We want the value for \N{NAME} that was
in effect at the point during the parsing phase that this regex was
encountered in, but we don't actually look at it until runtime, when
these bug reports show that it is gone. The solution is for the
tokenizer to parse \N{NAME}, but to compile it into an intermediate
value that won't ever be considered a metacharacter. We have chosen to
compile NAME to its equivalent code point value, and express it in the
already existing \N{U+...} form. This indicates to the regex compiler
that the original input was a named character and retains the value it
had at that point in the parse.
This means that \N{U+...} now always must imply Unicode semantics for
the string or pattern it appeared in. Previously there was an
inconsistency, where effectively \N{NAME} implied Unicode semantics, but
\N{U+...} did not necessarily. So now, any string or pattern that has
either of these forms is utf8 upgraded.
A complication is that a charnames handler can return a sequence of
multiple characters instead of just one. To deal with this case, the
tokenizer will generate a constant of the form \N{U+c1.c2.c2...}, where
c1 etc are the individual characters. Perhaps this will be made a
public interface someday, but I decided to not expose it externally as
far as possible for now in case we find reason to change it. It is
possible to defeat this by passing it in a single quoted string to the
regex compiler, so the documentation will be changed to discourage that.
A further complication is that \N can have an additional meaning: to
match a non-newline. This means that the two meanings have to be
disambiguated.
embed.fnc was changed to make public the function regcurly() in
regcomp.c so that it could be referred to in toke.c to see if the ... in
\N{...} is a legal quantifier like {2,}. This is used in the
disambiguation.
toke.c was changed to update some out-dated relevant comments.
It now parses \N in patterns. If it determines that it isn't a named
sequence, it passes it through unchanged. This happens when there is no
brace after the \N, or no closing brace, or if the braces enclose a
legal quantifier. Previously there has been essentially no restriction
on what can come between the braces so that a custom translator can
accept virtually anything. Now, legal quantifiers are assumed to mean
that the \N is a "match non-newline that quantity of times".
I removed the #ifdef'd out code that had been left in in case pack U
reverted to earlier behavior. I did this because it complicated things,
and because the change to pack U has been in long enough and shown that
it is correct so it's not likely to be reverted.
\N meaning a named character is handled differently depending on whether
this is a pattern or not. In all cases, the output will be upgraded to
utf8 because a named character implies Unicode semantics. If not a
pattern, the \N is parsed into a utf8 string, as before. Otherwise it
will be parsed into the intermediate \N{U+...} form. If the original
was already a valid \N{U+...} constant, it is passed through unchanged.
I now check that the sequence returned by the charnames handler is not
malformed, which was lacking before.
The code in regcomp.c which dealt with interfacing with the charnames
handler has been removed. All the values should be determined by the
time regcomp.c gets involved. The affected subroutine is necessarily
restructured.
An EXACT-type node is generated for the character sequence. Such a node
has a capacity of 255 bytes, and so it is possible to overflow it. This
wasn't checked for before, but now it is, and a warning issued and the
overflowing characters are discarded.
|
|
|
|
|
|
| |
Change c2123ae380a372d5 exposed the fact that Perl_reg_temp_copy() didn't
reset SvMAGIC() to NULL after block copying the "parent" regexp. The analagous
problem with SvSTASH() was fixed with change b9ad13acb338e137.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
That bug happens when we detect a compilation error in the statement
being parsed, and when the continuation of the parsing of that same
statement needs to load the file unicore/Name.pl via charnames.pm.
In that case perl gets confused, fails to parse Name.pl because
the parser is already in error, and also fails to properly rewind
to a normal error-reporting state.
This patch does not attempt to fix the whole error-reporting process;
instead, it simply prevents perl from trying to load charnames if it has
already recorded a parse error. So, in a way, it hides the bug under
the carpet. However, this is a safe fix, suitable for a code-freeze
stage.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
$ ./perl -lwe '$a = ${qr//}; $a = 2; print re::is_regexp(\$a)'
1
It is possible for arbitrary SVs (eg PAD entries) to be upgraded to
SVt_REGEXP. (This is new with first class regexps)
Whilst the example above does not SEGV, it will be possible to write
code that will cause SEGVs (or worse) at the point when the scalar is freed,
because the code in sv_clear() assumes that all scalars of type
SVt_REGEXP *are* regexps, and passes them to pregfree2(), which assumes that
pointers within are valid.
|
|
|
|
|
|
|
|
|
|
|
|
| |
A simple program like the following could coredump:
use charnames ':full';
/\N{LATIN SMALL LETTER E}/;
The moral being, make sure sp is synced on return from call_sv()
*before* using the stack!
(Was a regression since 5.10)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Instead of returning a(nother) reference to the (pre-compiled) regexp in the
optree, use reg_temp_copy() to create a copy of it, and return a reference to
that. This resolves issues about Regexp::DESTROY not being called in a timely
fashion (the original bug tracked by RT #69852), as well as bugs related to
blessing regexps, and of assigning to regexps, as described in correspondence
added to the ticket.
It transpires that we also need to undo the SvPVX() sharing when ithreads
cloning a Regexp SV, because mother_re is set to NULL, instead of a cloned
copy of the mother_re. This change might fix bugs with regexps and threads in
certain other situations, but as yet neither tests nor bug reports have
indicated any problems, so it might not actually be an edge case that it's
possible to reach.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Also revert 8902bb05b18c9858efa90229ca1ee42b17277554 as it merely
masked one symptom of the deeper problems.
Also fixes RT #69973, which was a segfault which was exposed by
8902bb05, see the ticket for further details.
http://rt.perl.org/rt3//Public/Bug/Display.html?id=69973
At the code of this is the problem that in unicode matching a bunch
of code points have case folding rules beyond just A-Z/a-z. Since
the case folding rules are decided at runtime by the string, we cant
use the same TRIE tables for both unicode/non-unicode matching.
Until this is reconciled or some other solution is found case insensitive
matching only gets the TRIE optimisation when the pattern is uniocde.
From CaseFolding.txt:
00B5; C; 03BC; # MICRO SIGN
00C0; C; 00E0; # LATIN CAPITAL LETTER A WITH GRAVE
00C1; C; 00E1; # LATIN CAPITAL LETTER A WITH ACUTE
00C2; C; 00E2; # LATIN CAPITAL LETTER A WITH CIRCUMFLEX
00C3; C; 00E3; # LATIN CAPITAL LETTER A WITH TILDE
00C4; C; 00E4; # LATIN CAPITAL LETTER A WITH DIAERESIS
00C5; C; 00E5; # LATIN CAPITAL LETTER A WITH RING ABOVE
00C6; C; 00E6; # LATIN CAPITAL LETTER AE
00C7; C; 00E7; # LATIN CAPITAL LETTER C WITH CEDILLA
00C8; C; 00E8; # LATIN CAPITAL LETTER E WITH GRAVE
00C9; C; 00E9; # LATIN CAPITAL LETTER E WITH ACUTE
00CA; C; 00EA; # LATIN CAPITAL LETTER E WITH CIRCUMFLEX
00CB; C; 00EB; # LATIN CAPITAL LETTER E WITH DIAERESIS
00CC; C; 00EC; # LATIN CAPITAL LETTER I WITH GRAVE
00CD; C; 00ED; # LATIN CAPITAL LETTER I WITH ACUTE
00CE; C; 00EE; # LATIN CAPITAL LETTER I WITH CIRCUMFLEX
00CF; C; 00EF; # LATIN CAPITAL LETTER I WITH DIAERESIS
00D0; C; 00F0; # LATIN CAPITAL LETTER ETH
00D1; C; 00F1; # LATIN CAPITAL LETTER N WITH TILDE
00D2; C; 00F2; # LATIN CAPITAL LETTER O WITH GRAVE
00D3; C; 00F3; # LATIN CAPITAL LETTER O WITH ACUTE
00D4; C; 00F4; # LATIN CAPITAL LETTER O WITH CIRCUMFLEX
00D5; C; 00F5; # LATIN CAPITAL LETTER O WITH TILDE
00D6; C; 00F6; # LATIN CAPITAL LETTER O WITH DIAERESIS
00D8; C; 00F8; # LATIN CAPITAL LETTER O WITH STROKE
00D9; C; 00F9; # LATIN CAPITAL LETTER U WITH GRAVE
00DA; C; 00FA; # LATIN CAPITAL LETTER U WITH ACUTE
00DB; C; 00FB; # LATIN CAPITAL LETTER U WITH CIRCUMFLEX
00DC; C; 00FC; # LATIN CAPITAL LETTER U WITH DIAERESIS
00DD; C; 00FD; # LATIN CAPITAL LETTER Y WITH ACUTE
00DE; C; 00FE; # LATIN CAPITAL LETTER THORN
00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
|
|
|
|
|
|
|
|
| |
It uses reg_temp_copy to copy the REGEXP onto the destination SV without
needing to copy the underlying pattern structure. This means changing
the prototype of reg_temp_copy, so it can copy onto a passed-in SV, but
it isn't API (and probably shouldn't be exported) so I don't think this
is a problem.
|
| |
|
|
|
|
|
|
|
| |
They had been concatenating "%s" REPORT_LOCATION, as they weren't passing in a
format string, which wasn't consistent with the 2-5 argument versions. None of
the strings passed in have % characters in them, so this is safe (and any static
analyser will be able to see this).
|
| |
|
|
|
|
|
|
| |
This folds many pairs of ckWARN*() && Perl_warner() calls into singles call to
Perl_ck_warner(). vWARN(), vWARNdep() and vWARN2() are no longer used, so are
removed.
|
|
|
|
|
|
|
|
|
|
|
| |
Commit c74340f9 added backreferences as well as the idea of a ->swap
regex pointer to keep track of the match offsets in case of backtracking.
The problem is that when Perl re-enters the regex engine to handle
utf8::SWASHNEW, the ->swap is not saved/restored/cleared so any capture
from the utf8 (Perl) code could inadvertently modify the regex match
data that caused the utf8 swash to get built.
This change should close out RT #60508
|
|
|
|
|
|
| |
Calculate memory allocation using regexp and XPVIO, and the offset of the first
real structure member. This avoids tripping over alignment differences between
X* and x*_allocated, because x*_allocated doesn't have a double in it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
"Hugo van der Sanden via RT" <perlbug-followup@perl.org> wrote:
:This is caused by a failure of the start_class optimization in the case
:of lookahead, as per the attached comment.
:
:In more detail: at the point study_chunk() attempts to deal with the
:start_class discovered for the lookahead chunk, we have
:SCF_DO_STCLASS_OR set, and_withp has the starting value of ANYOF_EOS |
:ANYOF_UNICODE_ALL, and data->start_class has [a] | ANYOF_EOS.
[...]
:In other words, we need to stack an alternation of ANDs and ORs to cope
:with this situation, and we don't have a mechanism to do that except to
:recurse into study_chunk() some more.
:
:A simpler short-term fix is instead to throw up our hands in this
:situation, and just nullify start_class. I'm not sure exactly how to do
:that, but it seems the more likely to be achievable for 5.10.1.
This patch implements the simple fix, and passes all tests including
Abigail's test cases for the bug.
Yves: note that I've preserved the 'was' code in this chunk, introduced
by you in the patch [1], discussed in the thread [2]. As far as I can
see the 3 lines propagating ANYOF_EOS via 'was' (and the copy of those
3 lines a little later) are simply doing the wrong thing - they seem
to be saying "when we combine two start classes using SCF_DO_STCLASS_AND,
claim that end-of-string is valid if the first class says it would be
even though the second says it wouldn't be". Removing those lines doesn't
cause any test failures - can you remember why you introduced those lines,
and maybe add a test case that fails without them?
Hugo
[1] http://perl5.git.perl.org/perl.git/commit/b515a41db88584b4fd1c30cf890c92d3f9697760
[2] http://groups.google.co.uk/group/perl.perl5.porters/browse_thread/thread/436187077ef96918/f11c3268394abf89
Message-Id: <200907021036.n62Aa8rv029500@zen.crypt.org>
rt.perl.org #56690
|
|
|
|
|
|
| |
... )
This fixes RT #59734 : Segfault when using (?|) in regexp.
|
|
|
|
|
|
| |
\N, like in Perl 6, is equivalent to . but not influenced by /s.
It matches any character except \n. Note that followed by { and
a non-number, \N is still a named character.
|
|
|
|
|
|
| |
(Tweaked by rgs)
Message-ID: <496D3F02.6020204@khwilliamson.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
the regex engine)
Perlbug #60156 and #49302 (and probably others) resolve down to the problem
that the definition of \s and \w and \d and the POSIX charclasses are different
for unicode strings and for non-unicode strings. This broke the character class
logic in the regex engine. The easiest fix to make the character class logic sane
again is to define new properties which do match.
This change creates new property classes that can be used instead of the
traditional ones (it does not change the previously defined ones). If the
define in regcomp.h:
#define PERL_LEGACY_UNICODE_CHARCLASS_MAPPINGS 1
is changed to 0, then the new mappings will be used. This will fix a bunch
of bugs that are reported as TODO items in the new reg_posixcc.t test file.
p4raw-id: //depot/perl@34769
|
|
|
|
|
| |
And refactor the code that adds the extra braces into a macro, and make it support the colorization stuff.
p4raw-id: //depot/perl@34766
|
|
|
|
|
|
|
|
|
|
|
| |
* Make ANYOF output from regprop easier to read by adding ][ in between the unicode representation and the "ascii" one
* Make it possible to make tests in re_tests todo.
* add a todo test for a complementary character class match that should fail (perl #60156)
* Also add a comment explaining a previous commit (relating to perl #60344)
p4raw-id: //depot/perl@34755
|
|
|
|
|
|
|
| |
Subject: PATCH [perl #59328] In re's, \N{U+...} doesn't match for ... > 256
Message-ID: <49124B78.2000907@khwilliamson.com>
Date: Wed, 05 Nov 2008 18:42:16 -0700
p4raw-id: //depot/perl@34747
|
|
|
|
|
|
| |
Message-ID: <25940.1225611819@chthon>
Date: Sun, 02 Nov 2008 01:43:39 -0600
p4raw-id: //depot/perl@34698
|
|
|
|
|
|
| |
From: Michael Cartmell (via RT) <perlbug-followup@perl.org>
Message-ID: <rt-3.6.HEAD-27577-1215001078-1211.56526-75-0@perl.org>
p4raw-id: //depot/perl@34697
|
|
|
|
|
|
| |
This is mostly to silence gcc's warning, "format not a string
literal and no format arguments".
p4raw-id: //depot/perl@34694
|
|
|
|
|
| |
erroneous const in dump.c.
p4raw-id: //depot/perl@34675
|
|
|
|
|
|
|
|
|
|
| |
to Perl_re_compile() can't be const, which means that the pattern
argument to Perl_pregcomp() can't be const, as can't the argument in
the function in the regexp engine structure.
It's a shame that no-one spotted this earlier.
(Again) I may have rendered the documentation inaccurate.
p4raw-id: //depot/perl@34672
|
|
|
| |
p4raw-id: //depot/perl@34653
|
|
|
| |
p4raw-id: //depot/perl@34650
|
|
|
| |
p4raw-id: //depot/perl@34629
|
|
|
| |
p4raw-id: //depot/perl@34628
|
|
|
| |
p4raw-id: //depot/perl@34585
|
|
|
|
|
|
| |
optimization. This was most probably introduced with #28262.
This change fixes perl #59516.
p4raw-id: //depot/perl@34507
|