summaryrefslogtreecommitdiff
path: root/t/re
Commit message (Collapse)AuthorAgeFilesLines
* unTODO some passing TODO tests in reg_fold.tDavid Mitchell2010-02-201-1/+6
| | | | | As far as I can tell these particular tests were never actually TODO, but were lumped in with other failing tests for ease of coding the .t file
* PATCH: deprecation warnings for unreasonable charnamesKarl Williamson2010-02-201-1/+15
| | | | | | | | | | | | | | | | | Prior to now just about anything has been legal for a character name in \N{...}. This means that legal code was broken by having \N{3,4} for example mean [^\n]{3,4}. Such code doesn't come from standard charnames, but from legal custom translators. This patch deprecates "unreasonable" names. handy.h is changed by the addition of macros that taken together define the names we deem reasonable, namely alpha beginning with alphanumerics and some punctuations as continuations. toke.c is changed to parse each name and to raise a warning if any problematic characters are found. Some tests and diagnostic documentation are also included.
* Improve handling of qq(\N{...}); and /xKarl Williamson2010-02-201-1/+30
| | | | | | | | | | | | It is possible to bypass the lexer's parsing of \N. This patch causes the regex compiler to deal with that better. The compiler no longer assumes that the lexer parsed the \N. It generates an error message if the \N isn't in a form it is expecting, and invalid hexadecimal digits are now fatal errors, with the position of the error more clearly marked. The diagnostic pod has been updated to reflect the new error messages, with some slight clarifications to the previous ones as well.
* PATCH: [perl #56444] delayed interpolation of \N{...}Karl Williamson2010-02-194-8/+76
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | make regen embed.fnc needs to be run on this patch. This patch fixes Bugs #56444 and #62056. Hopefully we have finally gotten this right. The parser used to handle all the escaped constants, expanding \x2e to its single byte equivalent. The problem is that for regexp patterns, this is a '.', which is a metacharacter and has special meaning that \x2e does not. So things were changed so that the parser didn't expand things in patterns. But this causes problems for \N{NAME}, when the pattern doesn't get evaluated until runtime, as for example when it has a scalar reference in it, like qr/$foo\N{NAME}/. We want the value for \N{NAME} that was in effect at the point during the parsing phase that this regex was encountered in, but we don't actually look at it until runtime, when these bug reports show that it is gone. The solution is for the tokenizer to parse \N{NAME}, but to compile it into an intermediate value that won't ever be considered a metacharacter. We have chosen to compile NAME to its equivalent code point value, and express it in the already existing \N{U+...} form. This indicates to the regex compiler that the original input was a named character and retains the value it had at that point in the parse. This means that \N{U+...} now always must imply Unicode semantics for the string or pattern it appeared in. Previously there was an inconsistency, where effectively \N{NAME} implied Unicode semantics, but \N{U+...} did not necessarily. So now, any string or pattern that has either of these forms is utf8 upgraded. A complication is that a charnames handler can return a sequence of multiple characters instead of just one. To deal with this case, the tokenizer will generate a constant of the form \N{U+c1.c2.c2...}, where c1 etc are the individual characters. Perhaps this will be made a public interface someday, but I decided to not expose it externally as far as possible for now in case we find reason to change it. It is possible to defeat this by passing it in a single quoted string to the regex compiler, so the documentation will be changed to discourage that. A further complication is that \N can have an additional meaning: to match a non-newline. This means that the two meanings have to be disambiguated. embed.fnc was changed to make public the function regcurly() in regcomp.c so that it could be referred to in toke.c to see if the ... in \N{...} is a legal quantifier like {2,}. This is used in the disambiguation. toke.c was changed to update some out-dated relevant comments. It now parses \N in patterns. If it determines that it isn't a named sequence, it passes it through unchanged. This happens when there is no brace after the \N, or no closing brace, or if the braces enclose a legal quantifier. Previously there has been essentially no restriction on what can come between the braces so that a custom translator can accept virtually anything. Now, legal quantifiers are assumed to mean that the \N is a "match non-newline that quantity of times". I removed the #ifdef'd out code that had been left in in case pack U reverted to earlier behavior. I did this because it complicated things, and because the change to pack U has been in long enough and shown that it is correct so it's not likely to be reverted. \N meaning a named character is handled differently depending on whether this is a pattern or not. In all cases, the output will be upgraded to utf8 because a named character implies Unicode semantics. If not a pattern, the \N is parsed into a utf8 string, as before. Otherwise it will be parsed into the intermediate \N{U+...} form. If the original was already a valid \N{U+...} constant, it is passed through unchanged. I now check that the sequence returned by the charnames handler is not malformed, which was lacking before. The code in regcomp.c which dealt with interfacing with the charnames handler has been removed. All the values should be determined by the time regcomp.c gets involved. The affected subroutine is necessarily restructured. An EXACT-type node is generated for the character sequence. Such a node has a capacity of 255 bytes, and so it is possible to overflow it. This wasn't checked for before, but now it is, and a warning issued and the overflowing characters are discarded.
* Removes 32-bit limit on substr arguments. The full range of IV and UV is ↵Eric Brine2010-02-141-1/+41
| | | | available for the pos and len arguments, with safe conversion to STRLEN where it's smaller than an IV.
* fix qr// and get-magic problemsFather Chrysostomos2010-01-192-2/+28
| | | | | [N.B. I converted package name separators from q{'} to q{::} in the test files as suggested by demerphq. -- dagolden]
* Revert "[perl #62646] Maximum string length with substr"Rafael Garcia-Suarez2010-01-181-16/+1
| | | | | | | | This reverts commit b6d1426f94a845fb8fece8b6ad0b7d9f35f2d62e. Conflicts: pp.c
* [perl #62646] Maximum string length with substrZefram2010-01-151-1/+16
| | | | (This is only a partial fix, since it doesn't handle lvalue substr)
* Tie::Hash::NamedCapture::* shouldn't abort if passed bad input [RT #71828]Nicholas Clark2010-01-051-1/+16
|
* Allow U+0FFFF in regexKarl Williamson2009-12-201-1/+12
|
* more regex folding testsKarl Williamson2009-12-151-15/+45
|
* [perl #70764] $' fails to initialized for pre-compiled regular expression ↵Father Chrysostomos2009-12-141-1/+61
| | | | | | | | | | | | | | | | | | | matches The match vars are associated with the regexp that last matched successfully. In the case of $str =~ $qr or /$qr/, since the $qr could be used in multiple scopes that need their own sets of match vars, the $qr is cloned by Perl_reg_temp_copy as of change 30677/28d8d7f. This happens in pp_regcomp before pp_match has stringified the LHS, hence the bug. In short, /$gror/ is not equivalent to ($which = !$which) ? /$gror/ : /$gror/, which is weird. Attached is a patch, which admittedly is a hack, but fixes this particular side effect of what is probably a bad design, by stringifying the LHS in pp_regcomp, and having pp_match skip get-magic in such cases. A real fix far exceeds my capabalities, and would also be very intrusive according to <http://www.nntp.perl.org/group/perl.perl5.porters/2007/03/msg122415.html>.
* Ensure that pp_qr returns a new regexp SV each time. Resolves RT #69852.Nicholas Clark2009-12-021-4/+0
| | | | | | | | | | | | | | | | Instead of returning a(nother) reference to the (pre-compiled) regexp in the optree, use reg_temp_copy() to create a copy of it, and return a reference to that. This resolves issues about Regexp::DESTROY not being called in a timely fashion (the original bug tracked by RT #69852), as well as bugs related to blessing regexps, and of assigning to regexps, as described in correspondence added to the ticket. It transpires that we also need to undo the SvPVX() sharing when ithreads cloning a Regexp SV, because mother_re is set to NULL, instead of a cloned copy of the mother_re. This change might fix bugs with regexps and threads in certain other situations, but as yet neither tests nor bug reports have indicated any problems, so it might not actually be an edge case that it's possible to reach.
* Document backreferences to groups that did not matchMoritz Lenz2009-11-281-1/+4
| | | | | Also add a test for that, fill in test description, and sneak in a vim modeline for re_tests
* wrap uniprops.t; makefile changes for mktablesKarl Williamson2009-11-251-71657/+4
| | | | Message-ID: <4B0C4744.7080401@khwilliamson.com>
* mktables not run unless neededKarl Williamson2009-11-241-1/+1
|
* Fix plan syntax in TAP outputRafael Garcia-Suarez2009-11-221-1/+1
|
* mktables revampKarl Williamson2009-11-213-5/+71667
|
* [PATCH] Todo test for [perl #38133] (was: [regex] backref problem with ↵Bram via RT2009-10-311-1/+13
| | | | | | quantified groups) This patch was modified to work with the updated file locations.
* Let SvRX(OK) recognise a bare REGEXP.Ben Morrow2009-10-221-5/+15
| | | | This means that re::is_regexp(${qr/x/}) will now return true.
* revert to 5.8.x semantics for \s \w and \dYves Orton2009-10-193-3/+3
| | | | | | | | revert ba9ac1759cb6e7a5e6883c85edd0b450061b5ccb Changing the semantics of \w \s and \d breaks too much and Jesse wants to do a rollout. This disables the new semantics until we can get all the details worked out.
* add tests to make sure the \s and [\s] match the same thingYves Orton2009-10-052-0/+63
| | | | Note: we currently fail these tests. This will be recitified.
* dropped a test by accident the last go, so ressurect the pat_re_eval.t ↵Yves Orton2009-09-195-199/+409
| | | | anyway, and resort and update the MANIFEST
* split t/re/pat.t into new piecesYves Orton2009-09-1910-22131/+236
|
* copy pat.t into five new filesYves Orton2009-09-195-0/+21955
| | | | | | next commit i will mutually gut them all, and this step will allow git to trivially identify that all of these files were copies of pat.t originally, and thus the blame log should show the full history.
* Avoid using lib.pm in miniperl's tests.Nicholas Clark2009-09-181-1/+3
|
* split: Remove implicit split to @_Bo Borgerson2009-09-131-2/+2
| | | | Remove the long deprecated feature where split in scalar context writes to @_
* Update some remaining comments that still point to the old regexp tests locationVincent Pit2009-09-103-4/+4
|
* Fix paths in threaded regexp testsVincent Pit2009-09-104-4/+4
|
* missed a comment reference to t/op that should now be t/reYves Orton2009-09-101-1/+1
|
* move regex related tests out of t/op/ into t/re/Yves Orton2009-09-1033-0/+8680