summaryrefslogtreecommitdiff
path: root/regcomp.c
Commit message (Collapse)AuthorAgeFilesLines
* implicitly anchor .{0,} like .* [perl #125810]Lukas Mai2015-08-241-13/+15
|
* Output bad locale warning in regex synthetic start classKarl Williamson2015-08-241-0/+4
| | | | | | | | | | | perl detects some locale errors when a new locale is entered. It stores these up to output upon first use of something that uses that locale. A synthetic start class (SSC) is used by the regex optimizer under certain circumstances. Prior to this patch, it was possible for the stored up bad locale message to not be raised if the match failed the SSC. This patch fixes this by changing the node type of the SSC to be one that checks for the stored-up message should there be locale-dependent portions of the pattern.
* PATCH: [perl 125825] {n}+ possessive quantifier brokenKarl Williamson2015-08-241-3/+1
| | | | | I was unaware of this construct when I wrote the commit that broke it, and there were no tests for it. Now there are.
* Make qr/(?[ ])/ work in UTF-8 localesKarl Williamson2015-08-241-4/+43
| | | | | | | | | | | | | | | | Previously use of this under /l regex rules was a compile time error. Now it works like \b{wb} and \b{sb}, which compile under locale rules and always work like Unicode says they should. A UTF-8 locale implies Unicode rules, and the goal is for it to work seamlessly with the rest of perl. This construct was the only one I am aware of that didn't work seamlessly (not counting OS interfaces) under UTF-8 LC_CTYPE locales. For all three of these constructs, use with a non-UTF-8 runtime locale raises a warning, and Unicode rules are used anyway. UTF-8 locale collation still has problems, but this is low priority to fix, as it's a lot of work, and if one really cares, one should be using Unicode::Collate.
* regcomp.c: Add a parameter to static functionKarl Williamson2015-08-241-3/+13
| | | | This will be used by the next commit
* regcomp.h: Fold 2 ANYOF flags into a single oneKarl Williamson2015-08-241-8/+29
| | | | | | | | | | | | | | | | | | The ANYOF_FLAGS bits are all used up, but a future commit wants one. This commit frees up a bit by sharing two of the existing comparatively-rarely-used ones. One bit is used only under /d matching rules, while the other is used only when not under /d. Only the latter bit is used in synthetic start classes. The previous commit introduced an ANYOFD node type corresponding to /d. An SSC never is this type. Thus, the bits have mutually exclusive meanings, and we can use the node type to distinguish between the two meanings of the combined bit. An alternative implementation would have been to use the ANYOF_HAS_NONBITMAP_NON_UTF8_MATCHES non-/d bit instead of the one chosen. But this is used more frequently, so the disambiguation would have been exercised more frequently, slowing execution down ever so slightly; more importantly, this one required fewer code changes, by a slight amount.
* Add ANYOFD regex nodeKarl Williamson2015-08-241-1/+4
| | | | | This is like an ANYOF node, but just for when /d is in effect. It will be used in future commits
* PATCH: [perl #125805] Perl segfaults with a regex_sets error messageKarl Williamson2015-08-241-0/+1
| | | | This fix required an extra test of the return value of a function.
* [perl #125826] make the buffer large enough in TRIE_STORE_REVCHARTony Cook2015-08-191-1/+1
| | | | | Since the SV is discarded almost immediately (in non-DEBUGGING builds) don't worry about making it the smallest possible size.
* Safefree(NULL) reductionDaniel Dragan2015-08-031-1/+2
| | | | | | | | | | | | | | | | locale.c: - the pointers are always null at this point, see http://www.nntp.perl.org/group/perl.perl5.porters/2015/07/msg229533.html pp.c: - reduce scope of temp_buffer and svrecode, into an inner branch - in some permutations, either temp_buffer is never set to non-null, or svrecode, in permutations where it is known that the var hasn't been set yet, skip the freeing calls at the end, this doesn't eliminate all permutations with NULL being passed to Safefree and SvREFCNT_dec, but only some of them regcomp.c - dont create a save stack entry to call Safefree(NULL), see ticket for this patch for some profiling stats
* fKarl Williamson2015-07-281-1/+1
|
* Handle Unicode 3.0.1 /i Turkish "i" rulesKarl Williamson2015-07-281-0/+15
| | | | | | | Actually, there are no special rules for this Unicode release. All the 4 "i" characters are considered equivalent under /i only in this release. (Upper and lowercase dotted and dotless "i"). This adds special cases that are only compiled in for that release.
* There are no folds to multiple chars in early Unicode versionsKarl Williamson2015-07-281-8/+35
| | | | | Several places require special handling because of this, notably for the lowercase Sharp S, but not in Unicodes before 3.0.1
* regcomp.c: Rmv useless 'continue'Karl Williamson2015-07-281-1/+0
| | | | This is the final statement of the short loop. It does nothing.
* Allow Perl to compile and work on Unicode releases without U+1E9EKarl Williamson2015-07-281-0/+6
| | | | | | | | | | | | U+1E9E LATIN CAPITAL LETTER SHARP S is handled specially by Perl, because of its relationship to the infamous LATIN SMALL LETTER SHARP S, which folds to 'ss', being the only character whose code point is < 256 to have a multi char fold, and this creates lots of special cases. But U+1E9E wasn't in all Unicode releases. Because Perl is supposed to work with any release, we need to be able to compile when this character isn't defined. In some of those cases we use U+017F (LATIN SMALL LETTER LONG S instead, which is in all releases.
* regcomp.c: Add commentKarl Williamson2015-07-271-1/+7
|
* dquote_static.c -> dquote.cJarkko Hietaniemi2015-07-221-1/+0
| | | | Instead of #include-ing the C file, compile it normally.
* static inlines from dquote_static.c -> new dquote_inline.hJarkko Hietaniemi2015-07-221-0/+1
|
* inline_invlist.c -> invlist_inline.hJarkko Hietaniemi2015-07-221-2/+2
|
* Cannot do much if putc fails in debug output.Jarkko Hietaniemi2015-06-261-1/+1
| | | | Coverity CID 104782 (only flagged the deb.c spot)
* remove deprecated /\C/ RE character classDavid Mitchell2015-06-191-19/+2
| | | | | | This horrible thing broke encapsulation and was as buggy as a very buggy thing. It's been officially deprecated since 5.20.0 and now it can finally die die die!!!!
* avoid uninit read in re_op_compile()David Mitchell2015-04-281-3/+3
| | | | | | | | | | | | | | | | | | Some code in this function examines the first two nodes in the regex to set suitable flags etc. Part of the code accesses the second node by using regnext(first), other parts by NEXTOPER(first). The second method only works when the node is the same size as a basic node. I *think* that the code only makes use of this second value in situations where the node *is* basic, but nevertheless, it makes valgrind unhappy when the first node is an EXACT node, and reading the second node's supposed type field is actually reading the padding bytes at the end of the EXACT string, which are uninitialised. So just use regnext() only. Something as simple as /x/ on non-debugging builds was enough to make valgrind complain. (On debugging builds, the program buffer is initially zero-filled.)
* Fix regression in 5.21: /[A-Z]/aiKarl Williamson2015-04-091-3/+2
| | | | | | | | | | /[A-Z]/ai should match KELVIN SIGN, as it folds to a 'k'. It should not match under /aai, as that restricts fold matching. But I tested for the wrong symbol which ended up forbidding both /ai and /aai. This commit changes to the correct symbol. I also reordered the 'if' while I was at it as a nano optimisation, to test for the /aa last, as that is the less common part of the '&&' test.
* Perl_save_re_context(): re-indent after last commitDavid Mitchell2015-03-301-16/+12
| | | | whitespace-only change.
* save_re_context(): do "local $n" with no PL_curpmDavid Mitchell2015-03-301-3/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RT #124109. 2c1f00b9036 localised PL_curpm to NULL when calling swash init code (i.e. perl-level code that is loaded and executed when something like "lc $large_codepoint" is executed). b4fa55d3f1 followed this up by gutting Perl_save_re_context(), since that function did, basically, if (PL_curpm) { for (i = 1; i <= RX_NPARENS(PM_GETRE(PL_curpm))) { do the C equivalent of the perl code "local ${i}"; } } and now that PL_curpm was null, the code wasn't called any more. However, it turns out that the localisation *was* still needed, it's just that nothing in the test suite actually tested for it. In something like the following: $x = "\x{41c}"; $x =~ /(.*)/; $s = lc $1; pp_lc() calls get magic on $1, which sets $1's PV value to a copy of the substring captured by the current pattern match. Then pp_lc() calls a function to convert the string to upper case, which triggers a swash load, which calls perl code that does a pattern match and, most importantly, uses the value of $1. This triggers get magic on $1, which overwrites $1's PV value with a new value. When control returns to pp_lc(), $1 now holds the wrong string value. Hence $1, $2 etc need localising as well as PL_curpm. The old way that Perl_save_re_context() used to work (localising $1..${RX_NPARENS}) won't work directly when PL_curpm is NULL (as in the swash case), since we don't know how many vars to localise. In this case, hard-code it as localising $1,$2,$3 and add a porting test file that checks that the utf8.pm code and dependences don't use anything outside those 3 vars.
* Revert "Gut Perl_save_re_context"David Mitchell2015-03-301-3/+21
| | | | | | This reverts commit b4fa55d3f12c6d98b13a8b3db4f8d921c8e56edc. Turns out we need Perl_save_re_context() after all
* Revert "Don’t call save_re_context"David Mitchell2015-03-301-0/+1
| | | | | | This reverts commit d28a9254e445aee7212523d9a7ff62ae0a743fec. Turns out we need save_re_context() after all
* Revert "Mathomise save_re_context"David Mitchell2015-03-301-0/+11
| | | | | | This reverts commit 0ddd4a5b1910c8bfa9b7e55eb0db60a115fe368c. Turns out we need the save_re_context() function after all.
* Replace common Emacs file-local variables with dir-localsDagfinn Ilmari Mannsåker2015-03-221-6/+0
| | | | | | | | | | | | | | | | An empty cpan/.dir-locals.el stops Emacs using the core defaults for code imported from CPAN. Committer's work: To keep t/porting/cmp_version.t and t/porting/utils.t happy, $VERSION needed to be incremented in many files, including throughout dist/PathTools. perldelta entry for module updates. Add two Emacs control files to MANIFEST; re-sort MANIFEST. For: RT #124119.
* regcomp.c: Fix so works on Unicode 5.2Karl Williamson2015-03-191-3/+12
| | | | | | Unicode 5.2 had an anomalous situation, fixed in the next release, which runs afoul of an assert() in regcomp.c. This just modifies the assert for it to not fail for this situation.
* Change /(?[...]) to have normal operator precedenceKarl Williamson2015-03-191-195/+407
| | | | | This experimental feature now has the intersection operator ("&") higher precedence than the other binary operators.
* regcomp.c: White-space onlyKarl Williamson2015-03-181-14/+14
| | | | Outdent code that the previous commit removed the surrounding block from
* Fix qr'\N{U+41}' on EBCDIC platformsKarl Williamson2015-03-181-196/+263
| | | | | | | | | | | Prior to this commit, the regex compiler was relying on the lexer to do the translation from Unicode to native for \N{...} constructs, where it was simpler to do. However, when the pattern is a single-quoted string, it is passed unchanged to the regex compiler, and did not work. Fixing it required some refactoring, though it led to a clean API in a static function. This was spotted by Father Chrysostomos.
* fix XXX comment for regcomp.c:S_regHugo van der Sanden2015-03-101-1/+1
| | | | | It actually does do the right thing: /(?(R0))/ and /(?(R00))/ both fall through to give an appropriate error 'Switch condition not recognized'
* [perl #123814] replace grok_atou with grok_atoUVHugo van der Sanden2015-03-091-17/+28
| | | | | | | | | | | | Some questions and loose ends: XXX gv.c:S_gv_magicalize - why are we using SSize_t for paren? XXX mg.c:Perl_magic_set - need appopriate error handling for $) XXX regcomp.c:S_reg - need to check if we do the right thing if parno was not grokked Perl_get_debug_opts should probably return something unsigned; not sure if that's something we can change.
* [perl #123814] stricter handling of numbers in regexp quantifiersHugo van der Sanden2015-03-091-5/+20
|
* Consistently use NOT_REACHED; /* NOTREACHED */Jarkko Hietaniemi2015-03-041-6/+6
| | | | | | Both needed: the macro is for compilers, the comment for static checkers. (This doesn't address whether each spot is correct and necessary.)
* \s matching VT is no longer experimentalKarl Williamson2015-02-211-5/+2
| | | | | | | This was experimentally introduced in 5.18, and no issues were raised, except that it got us to thinking and spurred us to stop allowing $^X, where 'X' is a non-printable control character, and that change caused some issues.
* regcomp.c: Add assertionKarl Williamson2015-02-191-0/+2
|
* Add \b{sb}Karl Williamson2015-02-191-0/+7
|
* Add qr/\b{wb}/Karl Williamson2015-02-191-1/+8
|
* Add qr/\b{gcb}/Karl Williamson2015-02-191-13/+83
| | | | | | | | | | | A function implements seeing if the space between any two characters is a grapheme cluster break. Afer I wrote this, I realized that an array lookup might be a better implementation, but the deadline for v5.22 was too close to change it. I did see that my gcc optimized it down to an array lookup. This makes the implementation of \X go from being complicated to trivial.
* regen/mk_invlists.pl: Revamp #if generationKarl Williamson2015-02-191-2/+0
| | | | | | | | | | | | This changes where the symbols are defined to a single file each. This may save text space, depending on the compiler. The next commit will cause this hdr to be included in more places, so it becomes more important to do this. At the same time this removes the guard for #ifndef PERL_IN_XSUB_RE. The code now is executed regardless of that. This is simpler, and previously there might have been the possibility of uninitialized memory being read, should re_comp.o be executed before recomp.o.
* [perl #123852] avoid capture side-effects under noncapture flagHugo van der Sanden2015-02-181-0/+2
| | | | | | | | | | //n was implemented by avoiding the primary side-effects of compiling a capture when the flag was turned on; however some secondary effects still occurred later in the same function, by using the value of the 'paren' variable - even as far as causing coredumps. Setting paren to ':' when NOCAPTURE is enabled makes the rest of the function act just as if it had parsed (?:...) instead of (...).
* [perl #123843] fix SEGV reading data->flagsHugo van der Sanden2015-02-151-1/+1
| | | | This could be triggered by trying to compile eg 'qr{x+(y(?0))*}'.
* Add comments about how backrefs are parsedYves Orton2015-02-151-8/+27
|
* fix infinite loop in parsing backrefs in regex patternsYves Orton2015-02-151-2/+4
|
* [perl #123782] regcomp: check for overflow on /(?123)/Hugo van der Sanden2015-02-101-1/+3
| | | | | | | | AFL (<http://lcamtuf.coredump.cx/afl>) found that the UV to I32 conversion can evade the necessary range checks on wraparound, leading to bad reads. Check for it, and force to I32_MAX, expecting that this will usually yield a "Reference to nonexistent group" error.
* regcomp can read past end of string after parsing flagsHugo van der Sanden2015-02-101-1/+2
| | | | | New test in 8a6d8ec6fe revealed additional code problem reading past end of string under clang with sanitize=address.
* [perl #123755] including unknown char in error requires careHugo van der Sanden2015-02-091-3/+8
| | | | | | | AFL (<http://lcamtuf.coredump.cx/afl>) found that when producing the error message for /(??/ we hit an assert because we've stepped past the end of the pattern string. Code inspection found that we also do that in other branches, and we also need to check UTF more carefully.