summaryrefslogtreecommitdiff
path: root/t/re
Commit message (Collapse)AuthorAgeFilesLines
* Fix a bunch of repeated-word typosDagfinn Ilmari Mannsåker2020-05-222-5/+5
| | | | | Mostly in comments and docs, but some in diagnostic messages and one case of 'or die die'.
* regcomp.c: Fix named sequences in (?[...])Karl Williamson2020-04-291-0/+15
| | | | | | | | | The regex_sets feature cannot yet handle named sequences possibly returned by \p{name=...}. I forgot to check for this possibility which led to a null pointer dereference. Also, the called function was returning success when it should have failed in this circumstance. This fixes #17732
* regcomp.c: Avoid use after freeKarl Williamson2020-04-291-0/+9
| | | | | | | | | | | | | | | | | | It turns out that the SV returned by re_intuit_string() may be freed by future calls to re_intuit_start(). Thus, the caller doesn't get clear title to the returned SV. (This wasn't documented until the commit immediately prior to this one.) Cope with this situation by making a mortalized copy. This commit also changes to use the copy's PV directly, simplifying some 'if' statements. re_intuit_string() is effectively in the API, as it is an element in the regex engine structure, callable by anyone. It should not be returning a tenuous SV. That returned scalar should not freed before the pattern it is for is freed. It is too late in the development cycle to change this, so this workaround is presented instead for 5.32. This fixes #17734.
* regcomp: avoid overflow setting last_start_maxHugo van der Sanden2020-04-231-1/+8
| | | | | The dubious '((*ACCEPT)0)*' construct resulted on the one hand with is_inf being false, but on the other setting pos_delta to OPTIMIZE_INFTY.
* gh16947: avoid mutating regexp program only within GOSUBHugo van der Sanden2020-04-231-3/+8
| | | | | | | | | | | Commits 3bc2a7809d and bdb91f3f96 used the existence of a frame to decide when it was unsafe to mutate the regexp program, due to having recursed for as GOSUB. However the frame recursion mechanism is also used for SUSPEND. Refine it further to avoid mutation only when within a GOSUB by saving a new boolean in the frame structure, and using that to derive a "mutate_ok" flag.
* study_chunk: do not rewrite for trie while enframedHugo van der Sanden2020-04-111-1/+8
| | | | | gh16947: the outer frame may be in the middle of looking at the part of the program we would rewrite. Let the outer frame deal with it.
* regcomp.c: Die on relative group number overflowKarl Williamson2020-04-051-0/+10
| | | | | | | | | That this code was doing in the presence of an illegally large (or small) relative capturing group number was to set it to about the furthest away from zero it could get, and silently carry on, where it likely overflowed a few lines down. Instead die immediately with a proper message.
* regcomp.c: Avoid a segfaultKarl Williamson2020-04-051-1/+5
| | | | | | | | This was resulting in C undefined behavior reported by asan. Check that won't overflow before doing the operation; then die instead of going ahead anyway. This fixes #17593
* regcomp.c: Rmv unnecessary codeKarl Williamson2020-04-051-0/+7
| | | | | | | As spotted by Hugo van der Sanden, a barckwards reference to a capturing group doesn't need to look forward instead. If we don't have a capturing group we already should have, looking forward won't magically produce it.
* re/pat_advanced.t: Rmv extraneous lineKarl Williamson2020-04-041-1/+0
| | | | Spotted by Dagfinn Ilmari Mannsåker
* regcomp.c: Handle /ss\xdf/iaa properlyKarl Williamson2020-04-031-0/+6
| | | | | | | Having both ss and \xdf in a string caused the node type to be changed back to a wrong one. This fixes #17486
* Restrict features in wildcardsKarl Williamson2020-02-191-0/+13
| | | | | | | | | | | | | | | | | | | | | | The algorithm for dealing with Unicode property wildcards is to wrap the user-supplied pattern with /miaa. We don't want the user to be able to override the /m and /aa parts. Modifiers that are only specifiable as a modifier in a qr or similar op (like /gc) can't be included in things like (?gc). These normally incur a warning that they are ignored, but the texts of those warnings are misleading when using wildcards, so I chose to just make them illegal. Of course that could be changed to having custom useful warning texts, but I didn't think it was worth it. I also chose to forbid recursion of using nested \p{}, just from fear that it might lead to issues down the road, and it really isn't useful for this limited universe of strings to match against. Because wildcards currently can't handle '}' inside them, only the single letter \p,\P are valid anyway. Similarly, I forbid the '*' quantifier to make it harder for the constructed subpattern to take forever to make any progress and decide to halt. Again, using it would be overkill on the universe of possible match strings.
* t/re/reg_mesg.t: Emit diagnostics on some more failing testsKarl Williamson2020-02-191-2/+4
|
* Improve handling of nested qr/(?[...])/Karl Williamson2020-02-192-1/+7
| | | | | | | | | | | | | | | | | | | | A set operations expression can contain a previously-compiled one interpolated in. Prior to this commit, some heuristics were employed to verify it actually was such a thing, and not a sort of look-alike that wasn't necessarily valid. The heuristics actually forbade legal ones. I don't know of any illegal ones that were let through, but it is certainly possible. Also, the error/warning messages referred to the heuristics, and were unhelpful at best. The technique used instead in this commit is to return a regop only used by this feature for any nested compilations. This guarantees that the caller can determine if the result is valid, and what that result is without having to do any heuristics or inspecting any flags. The error/warning messages are changed to reflect this, and I believe are now helpful. This fixes the bugs in #16779 https://github.com/Perl/perl5/issues/16779#issuecomment-563987618
* Add qr/\p{Name=...}/Karl Williamson2020-02-121-0/+3
| | | | | | | | | | | | | | | This accomplishes the same thing as \N{...}, but only for regex patterns, using loose matching and only the official Unicode names. This commit includes a comparison of the two approaches, added to perlunicode. But the real reason to do this is as a way station to being able to specify wild card lookup on the name property, coming in a later commit. I chose to not include user-defined aliases nor :short character names at this time. I thought that there might be unforeseen consequences of using them. It's better to later relax a requirement than to try to restrict it.
* regcomp.c: make \K+ and \K* illegal.Yves Orton2020-02-071-0/+3
|
* regexec: don't increment recursion counter for non-postponed EVALHugo van der Sanden2020-01-271-1/+10
| | | | | It wasn't intended to be part of the recursion logic, and doesn't get decremented again (GH 17490).
* (toke|regcomp).c: Use common fcn to handle \0 problemsKarl Williamson2020-01-231-4/+5
| | | | | | | | This changes warning messages for too short \0 octal constants to use the function introduced in the previous commit. This function assures a consistent and clear warning message, which is slightly different than the one this commit replaces. I know of no CPAN code which depends on this warning's wording.
* regcomp.c: Code points above 255 are portableKarl Williamson2020-01-231-1/+0
| | | | | | | These tests are to generate warnings that the affected code is not portable between ASCII and EBCDIC systems. But, it was being too picky. Code points above 255 are the same on both systems, so the warning shouldn't be generated for those.
* Revise \o{ missing '}' error messageKarl Williamson2020-01-231-7/+7
| | | | | | | | All the other messages raised when a construct is expecting a terminating '}' but none is found include the '}' in the message. '\o{' did not. Since these diagnostics are getting revised anyway, and I didn't find any CPAN modules relying on the wording, this commit makes the messages consistent by adding the '}' to the \o message.
* Restructure grok_bslash_[ox]Karl Williamson2020-01-232-15/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit causes these functions to allow a caller to request any messages generated to be returned to the caller, instead of always being handled within these functions. The messages are somewhat changed from previously to be clearer. I did not find any code in CPAN that relied on the previous message text. Like the previous commit for grok_bslash_c, here are two reasons to do this, repeated here. 1) In pattern compilation this brings these messages into conformity with the other ones that get generated in pattern compilation, where there is a particular syntax, including marking the exact position in the parse where the problem occurred. 2) These could generate truncated messages due to the (mostly) single-pass nature of pattern compilation that is now in effect. It keeps track of where during a parse a message has been output, and won't output it again if a second parsing pass turns out to be necessary. Prior to this commit, it had to assume that a message from one of these functions did get output, and this caused some out-of-bounds reads when a subparse (using a constructed pattern) was executed. The possibility of those went away in commit 5d894ca5213, which guarantees it won't try to read outside bounds, but that may still mean it is outputting text from the wrong parse, giving meaningless results. This commit should stop that possibility.
* Restructure grok_bslash_cKarl Williamson2020-01-231-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | This commit causes this function to allow a caller to request any messages generated to be returned to the caller, instead of always being handled within this function. Like the previous commit for grok_bslash_c, here are two reasons to do this, repeated here. 1) In pattern compilation this brings these messages into conformity with the other ones that get generated in pattern compilation, where there is a particular syntax, including marking the exact position in the parse where the problem occurred. 2) The messages could be truncated due to the (mostly) single-pass nature of pattern compilation that is now in effect. It keeps track of where during a parse a message has been output, and won't output it again if a second parsing pass turns out to be necessary. Prior to this commit, it had to assume that a message from one of these functions did get output, and this caused some out-of-bounds reads when a subparse (using a constructed pattern) was executed. The possibility of those went away in commit 5d894ca5213, which guarantees it won't try to read outside bounds, but that may still mean it is outputting text from the wrong parse, giving meaningless results. This commit should stop that possibility.
* Test that a leading '_' is ok in \x{}, \o{}Karl Williamson2020-01-231-0/+2
| | | | | These are ok perhaps accidentally, but shouldn't accidentally become not ok
* re/pat.t: Fix failure on some systemsKarl Williamson2020-01-181-1/+1
| | | | | | | This appears to be a difference in how shells run. On many of H. Merijn Brand's boxes, this a test is failing. By avoiding the shell by using fresh_perl instead of runperl, it succeeds there, without breaking elsewhere.
* Fix Issue #17372 - Deal with NOTHING regops in trie code properlyYves Orton2020-01-091-1/+2
| | | | | We weren't handling NOTHING regops that were not followed by a trieable type in the trie code.
* PATCH: GH #17384 out of bounds read in qr//Karl Williamson2019-12-271-1/+4
| | | | | | | | | | | This turned out to be because there are two versions of the property name being parsed: 1) the original input; and 2) a canonicalized one with characters squeeezed out that are usually optional, such as spaces, dashes and, here, underscores. The code was conflating the two names, and moving along the squeezed name based on counts from the unsqueezed one, hence going too far in the buffer.
* t/re/regexp.t: Speed up many regex tests on ASCII platformKarl Williamson2019-12-191-15/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit: commit 0cd59ee9ca0f0af3c0c172ecc27bb3f02da6db08 Author: Karl Williamson <khw@cpan.org> AuthorDate: Fri Sep 6 10:23:26 2019 -0600 Commit: Karl Williamson <khw@cpan.org> CommitDate: Mon Nov 11 21:05:13 2019 -0700 t/re/regexp.t: Only convert to EBCDIC once Some tests get added as we go along, and those added tests have already been converted to EBCDIC if necessary. Don't reconvert, which messes things up. caused a huge slowdown in regex tests. The most noticeable on my platform was regexp_qr_embed_thr.t which doubled in wall clock time spent. It turns out that it was because a function was now always being called, and that does nothing on ASCII platforms besides return its argument, which then was copied over the argument. This new commit causes the function to be a constant { 1; } on ASCII platforms, so should be completely optimized out, returning the time spent in that .t to 5.30 levels.
* PATCH: GH #17371Karl Williamson2019-12-181-1/+4
| | | | | | This was caused by a character being counted as both the first delimiter of a pattern, and the final one, which led to the pattern's length being negative, which was turned into a very large unsigned number.
* PATCH: GH #17370, read beyond buffer in grok_inf_nanKarl Williamson2019-12-171-1/+4
| | | | | Like GH #17367, this was caused by a failure to check that we aren't at the end of the buffer after advancing the ptr to it.
* PATCH: GH #17367 read 1 beyond end of bufferKarl Williamson2019-12-171-1/+5
| | | | | This is a bug in grok_infnan() in which in one place it failed to check that it was reading within bounds.
* t/re/pat.t: White-space onlyKarl Williamson2019-12-141-7/+7
|
* t/re/pat.t: Fix skip count for limited mem platformsKarl Williamson2019-12-141-6/+9
| | | | And rearrange so is easier to see the correct value.
* t/re/pat.t: Skip tests that don't work on EBCDICKarl Williamson2019-12-141-0/+3
| | | | These are fuzzer generated, and don't translate well to EBCDIC
* Only allow punct delimiter for regex subpatternKarl Williamson2019-12-111-0/+1
| | | | | | | | | | | | | | | The experimental feature that allows wildcard subpatterns in finding Unicode properties, is supposed to only allow ASCII punctuation for delimitters. But if you preceded the delimitter by a backslash, the check was skipped. This commit fixes that. It may be that we will eventually want to loosen the restriction and allow a wider range of delimiters. But until we have valid use-cases that would push us in that direction, I don't want to get into supporting stuff that we might later regret, such as invisible characters for delimitters. This feature is not really required for programs to work, so I don't view it as necessary to be as general as possible.
* re/regexp_unicode_prop.t: Add commentKarl Williamson2019-12-091-0/+1
|
* PATCH GH #17025 \p{user-defined} overrides official UnicodeKarl Williamson2019-12-091-8/+2
| | | | Prior to this patch, they only sometimes overrode.
* skip code that requires dynamic loading and minitest worksTony Cook2019-12-041-2/+4
|
* fix skip for loop over localesTony Cook2019-12-041-30/+32
| | | | | | Any skips inside this loop was skipping over the entire loop, not just the current iteration, from the conditions tested in the loop this seemed incorrect (besides messing up the test count.)
* POSIX isn't loadable in miniperlTony Cook2019-12-041-1/+3
|
* PATCH: gh #17319 SegfaultKarl Williamson2019-11-221-1/+10
| | | | | | It turns out that one isn't supposed to fill in the offset to the next regnode at node creation time. And this node is like EXACTish, so the string stuff isn't accounted for in its regcomp.sym definition
* Properly handle filled /il regnodes and multi-char foldsKarl Williamson2019-11-211-4/+7
| | | | | | | | | | | | | | Previously we were ignoring this possibility. Suppose a pattern being compiled under /il contains 'SS', and that it so happens that a regnode becomes filled with the first 'S', so that the next regnode would begin with the second one. If at runtime, the locale is UTF-8, the pattern should match match a LATIN SHARP S. Until this commit, it wouldn't. The commit just extends the current mechanism used in this situation (of a filled regnode) for non-/l patterns. If the locale isn't a UTF-8 one, the 'SS' sequence shouldn't match the SHARP S, and it won't, but we have to construct the node so that it can handle the UTF-8 case.
* Add ANYOFHs regnodeKarl Williamson2019-11-201-1/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | This node is like ANYOFHb, but is used when more than one leading byte is the same in all the matched code points. ANYOFHb is used to avoid having to convert from UTF-8 to code point for something that won't match. It checks that the first byte in the UTF-8 encoded target is the desired one, thus ruling out most of the possible code points. But for higher code points that require longer UTF-8 sequences, many many non-matching code points pass this filter. Its almost 200K that it is ineffective for for code points above 0xFFFF. This commit creates a new node type that addresses this problem. Instead of a single byte, it stores as many leading bytes that are the same for all code points that match the class. For many classes, that will cut down the number of possible false positives by a huge amount before having to convert to code point to make the final determination. This regnode adds a UTF-8 string at the end. It is still much smaller, even in the rare worst case, than a plain ANYOF node because the maximum string length, 15 bytes, is still shorter than the 32-byte bitmap that is present in a plain ANYOF. Most of the time the added string will instead be at most 4 bytes.
* Add ANYOFRb regnodeKarl Williamson2019-11-171-35/+35
| | | | | | | | | | | | | | | | | | | | | | This is like the ANYOFR regnode added in the previous commit, but all code points in the range it matches are known to have the same first UTF-8 start byte. That means it can't match UTF-8 invariant characters, like ASCII, because the "start" byte is different on each one, so it could only match a range of 1, and the compiler wouldn't generate this node for that; instead using an EXACT. Pattern matching can rule out most code points by looking at the first character of their UTF-8 representation, before having to convert from UTF-8. On ASCII this rules out all but 64 2-byte UTF-8 characters from this simple comparison. 3-byte it's up to 4096, and 4-byte, 2**18, so the test is less effective for higher code points. I believe that most UTF-8 patterns that otherwise would compile to ANYOFR will instead compile to this, as I can't envision real life applications wanting to match large single ranges. Even the 2048 surrogates all have the same first byte.
* Add ANYOFR regnodeKarl Williamson2019-11-171-37/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | This matches a single range of code points. It is both faster and smaller than other ANYOF-type nodes, requiring, after set-up, a single subtraction and conditional branch. The vast majority of Unicode properties match a single range (though most of the properties likely to be used in real world applications have more than a single range). But things like [ij] are a single range, and those are quite commonly encountered. This new regnode matches them more efficiently than a bitmap would, and doesn't require the space for one either. The flags field is used to store the minimum matchable start byte for UTF-8 strings, and is ignored for non-UTF-8 targets. This, like ANYOFH nodes which have a similar mechanism, allows for quick weeding out of many possible matches without having to convert the UTF-8 to its corresponding code point. This regnode packs the 32 bit argument with 20 bits for the minimum code point the node matches, and 12 bits for the maximum range. If the input is a value outside these, it simply won't compile to this regnode, instead going to one of the ANYOFH flavors. ANYOFR is sufficient to match all of Unicode except for the final (private use) 65K plane.
* t/re/anyof.t: Add a testKarl Williamson2019-11-171-0/+1
| | | | This makes sure a non-folding above-Latin1 character is tested.
* Prefer EXACTish regnodes to ANYOFH nodesKarl Williamson2019-11-172-25/+38
| | | | | | | | | | | | | | | | | | | | | | ANYOFH nodes (that match code points above 255) are smaller than regular ANYOF nodes because they don't have a 256-bit bitmap. But the disadvantage of them over EXACT nodes is that the characters encountered must first be converted from UTF-8 to code point to see if they match the ANYOFH. (The difference is less clearcut with /i, because typically, currently, the UTF-8 must be converted to code point anyway in order to fold them.) But the EXACTFish node doesn't have an inversion list to do lookup in, and occupies less space, because it doesn't have inversion list data attached to it. Also there is a bug in using ANYOFH under /l, as wide character warnings should be emitted if the locale isn't a UTF-8 one. The reason this change hasn't been made before (by me anyway) is that the old way avoided upgrading the pattern to UTF-8. But having thought about this for a long time, to match this node, the target string must be in UTF-8 anyway, and having a UTF8ness mismatch slows down pattern matching, as things have to be continually converted, and reconverted after backtracking.
* t/re/anyof.t: Fix highest range testsKarl Williamson2019-11-171-50/+52
| | | | | | Previously we had infinity minus 1, but infinity should be beyond the range, and the highest isn't infinity - 1, but the highest legal code point.
* t/re/anyof.t: Remove duplicate testKarl Williamson2019-11-171-1/+0
| | | | These are covered by the single code point tests.
* t/re/anyof.t: Remove invalid testKarl Williamson2019-11-171-1/+0
| | | | | | One shouldn't be able to specify an infinite code point. The tests have the conceit that one can specify a range's upper limit as infinity, but that is just shorthand for the range being unbounded.
* re/anyof.t: Clarify failing messageKarl Williamson2019-11-171-1/+1
| | | | | | | When a test fails, an extra test is run to output debugging info; this will cause the planned number of tests to be wrong, which will output an extra, confusing message. This adds an explanation that the number is expected to be wrong, hence not to worry.