| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
| |
As far as I can tell these particular tests were never actually TODO,
but were lumped in with other failing tests for ease of coding the .t file
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Prior to now just about anything has been legal for a character name in
\N{...}. This means that legal code was broken by having \N{3,4} for
example mean [^\n]{3,4}. Such code doesn't come from standard
charnames, but from legal custom translators.
This patch deprecates "unreasonable" names. handy.h is changed by the
addition of macros that taken together define the names we deem
reasonable, namely alpha beginning with alphanumerics and some
punctuations as continuations.
toke.c is changed to parse each name and to raise a warning if any
problematic characters are found.
Some tests and diagnostic documentation are also included.
|
|
|
|
|
|
|
|
|
|
|
|
| |
It is possible to bypass the lexer's parsing of \N. This patch causes
the regex compiler to deal with that better. The compiler no longer
assumes that the lexer parsed the \N. It generates an error message if
the \N isn't in a form it is expecting, and invalid hexadecimal digits
are now fatal errors, with the position of the error more clearly
marked.
The diagnostic pod has been updated to reflect the new error messages,
with some slight clarifications to the previous ones as well.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
make regen embed.fnc
needs to be run on this patch.
This patch fixes Bugs #56444 and #62056.
Hopefully we have finally gotten this right. The parser used to handle
all the escaped constants, expanding \x2e to its single byte equivalent.
The problem is that for regexp patterns, this is a '.', which is a
metacharacter and has special meaning that \x2e does not. So things
were changed so that the parser didn't expand things in patterns. But
this causes problems for \N{NAME}, when the pattern doesn't get
evaluated until runtime, as for example when it has a scalar reference
in it, like qr/$foo\N{NAME}/. We want the value for \N{NAME} that was
in effect at the point during the parsing phase that this regex was
encountered in, but we don't actually look at it until runtime, when
these bug reports show that it is gone. The solution is for the
tokenizer to parse \N{NAME}, but to compile it into an intermediate
value that won't ever be considered a metacharacter. We have chosen to
compile NAME to its equivalent code point value, and express it in the
already existing \N{U+...} form. This indicates to the regex compiler
that the original input was a named character and retains the value it
had at that point in the parse.
This means that \N{U+...} now always must imply Unicode semantics for
the string or pattern it appeared in. Previously there was an
inconsistency, where effectively \N{NAME} implied Unicode semantics, but
\N{U+...} did not necessarily. So now, any string or pattern that has
either of these forms is utf8 upgraded.
A complication is that a charnames handler can return a sequence of
multiple characters instead of just one. To deal with this case, the
tokenizer will generate a constant of the form \N{U+c1.c2.c2...}, where
c1 etc are the individual characters. Perhaps this will be made a
public interface someday, but I decided to not expose it externally as
far as possible for now in case we find reason to change it. It is
possible to defeat this by passing it in a single quoted string to the
regex compiler, so the documentation will be changed to discourage that.
A further complication is that \N can have an additional meaning: to
match a non-newline. This means that the two meanings have to be
disambiguated.
embed.fnc was changed to make public the function regcurly() in
regcomp.c so that it could be referred to in toke.c to see if the ... in
\N{...} is a legal quantifier like {2,}. This is used in the
disambiguation.
toke.c was changed to update some out-dated relevant comments.
It now parses \N in patterns. If it determines that it isn't a named
sequence, it passes it through unchanged. This happens when there is no
brace after the \N, or no closing brace, or if the braces enclose a
legal quantifier. Previously there has been essentially no restriction
on what can come between the braces so that a custom translator can
accept virtually anything. Now, legal quantifiers are assumed to mean
that the \N is a "match non-newline that quantity of times".
I removed the #ifdef'd out code that had been left in in case pack U
reverted to earlier behavior. I did this because it complicated things,
and because the change to pack U has been in long enough and shown that
it is correct so it's not likely to be reverted.
\N meaning a named character is handled differently depending on whether
this is a pattern or not. In all cases, the output will be upgraded to
utf8 because a named character implies Unicode semantics. If not a
pattern, the \N is parsed into a utf8 string, as before. Otherwise it
will be parsed into the intermediate \N{U+...} form. If the original
was already a valid \N{U+...} constant, it is passed through unchanged.
I now check that the sequence returned by the charnames handler is not
malformed, which was lacking before.
The code in regcomp.c which dealt with interfacing with the charnames
handler has been removed. All the values should be determined by the
time regcomp.c gets involved. The affected subroutine is necessarily
restructured.
An EXACT-type node is generated for the character sequence. Such a node
has a capacity of 255 bytes, and so it is possible to overflow it. This
wasn't checked for before, but now it is, and a warning issued and the
overflowing characters are discarded.
|
|
|
|
| |
available for the pos and len arguments, with safe conversion to STRLEN where it's smaller than an IV.
|
|
|
|
|
| |
[N.B. I converted package name separators from q{'} to q{::} in
the test files as suggested by demerphq. -- dagolden]
|
|
|
|
|
|
|
|
| |
This reverts commit b6d1426f94a845fb8fece8b6ad0b7d9f35f2d62e.
Conflicts:
pp.c
|
|
|
|
| |
(This is only a partial fix, since it doesn't handle lvalue substr)
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
matches
The match vars are associated with the regexp that last matched
successfully. In the case of $str =~ $qr or /$qr/, since the $qr could
be used in multiple scopes that need their own sets of match vars, the
$qr is cloned by Perl_reg_temp_copy as of change 30677/28d8d7f. This
happens in pp_regcomp before pp_match has stringified the LHS, hence the
bug. In short, /$gror/ is not equivalent to
($which = !$which) ? /$gror/ : /$gror/, which is weird.
Attached is a patch, which admittedly is a hack, but fixes this
particular side effect of what is probably a bad design, by stringifying
the LHS in pp_regcomp, and having pp_match skip get-magic in such cases.
A real fix far exceeds my capabalities, and would also be very intrusive
according to
<http://www.nntp.perl.org/group/perl.perl5.porters/2007/03/msg122415.html>.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Instead of returning a(nother) reference to the (pre-compiled) regexp in the
optree, use reg_temp_copy() to create a copy of it, and return a reference to
that. This resolves issues about Regexp::DESTROY not being called in a timely
fashion (the original bug tracked by RT #69852), as well as bugs related to
blessing regexps, and of assigning to regexps, as described in correspondence
added to the ticket.
It transpires that we also need to undo the SvPVX() sharing when ithreads
cloning a Regexp SV, because mother_re is set to NULL, instead of a cloned
copy of the mother_re. This change might fix bugs with regexps and threads in
certain other situations, but as yet neither tests nor bug reports have
indicated any problems, so it might not actually be an edge case that it's
possible to reach.
|
|
|
|
|
| |
Also add a test for that, fill in test description, and sneak in a vim
modeline for re_tests
|
|
|
|
| |
Message-ID: <4B0C4744.7080401@khwilliamson.com>
|
| |
|
| |
|
| |
|
|
|
|
|
|
| |
quantified groups)
This patch was modified to work with the updated file locations.
|
|
|
|
| |
This means that re::is_regexp(${qr/x/}) will now return true.
|
|
|
|
|
|
|
|
| |
revert ba9ac1759cb6e7a5e6883c85edd0b450061b5ccb
Changing the semantics of \w \s and \d breaks too much
and Jesse wants to do a rollout. This disables the new
semantics until we can get all the details worked out.
|
|
|
|
| |
Note: we currently fail these tests. This will be recitified.
|
|
|
|
| |
anyway, and resort and update the MANIFEST
|
| |
|
|
|
|
|
|
| |
next commit i will mutually gut them all, and this step will allow git to trivially
identify that all of these files were copies of pat.t originally, and thus the blame
log should show the full history.
|
| |
|
|
|
|
| |
Remove the long deprecated feature where split in scalar context writes to @_
|
| |
|
| |
|
| |
|
|
|