summaryrefslogtreecommitdiff
path: root/regcomp.c
Commit message (Collapse)AuthorAgeFilesLines
* Fix spelling: precedeFelipe Gasper2021-06-151-1/+1
|
* regcomp.c: commentsHugo van der Sanden2021-06-141-16/+14
| | | | Comment change suggestions from @hvds in PR #18835.
* regcomp.c: White-space onlyKarl Williamson2021-06-141-512/+512
| | | | | | | | My attempt to insulate from the leading tab removal the year-old commits finally pushed as 77a6d54c0deb1165b37dcf11c21cd334ae2579bb and 403d7eb3e4320188571cf61b9dab62ff10799f49 failed miserably. I spent a bunch of time sorting it all out, and this is the result.
* regcomp.c: Fix typo in commentKarl Williamson2021-06-121-1/+1
|
* Rename G_ARRAY to G_LIST; provide back-compat when not(PERL_CORE)Paul "LeoNerd" Evans2021-06-021-1/+1
|
* gh18770: stop scanning for substrs after *COMMITHugo van der Sanden2021-06-011-6/+20
| | | | | *ACCEPT already avoids this (because it is "ENDLIKE"), but gets a related fix to stop scanning for start class.
* regcomp.c: white-space; commentsKarl Williamson2021-05-311-268/+239
|
* Base *.[ch] files: Replace leading tabs with blanksMichael G Schwern2021-05-311-2615/+2615
| | | | | | | This is a rebasing by @khw of part of GH #18792, which I needed to get in now to proceed with other commits. It also strips trailing white space from the affected files.
* regcomp.c: Extract code from a too-large-functionKarl Williamson2021-05-311-140/+191
| | | | | S_regclass() is unwieldy. This commit splits it into two nearly equal size parts. More could be done.
* [gh 17847] data->pos_delta should stick at infinityHugo van der Sanden2021-05-311-0/+1
| | | | | | | | The expression we're about to add to data->pos_delta in this part of study_chunk() can be both positive or negative; however while we apply an overflow check to avoid exceeding OPTIMIZE_INFTY, we were happily subtracting from it when the expression was negative, making it no longer infinite.
* [gh 17847] avoid overflow on delta in study_chunkHugo van der Sanden2021-05-311-2/+14
| | | | delta and pos_delta may hold OPTIMIZE_INFTY to represent infinity.
* [gh 17847] Include data->pos_delta in #if'd-out diagnosticHugo van der Sanden2021-05-311-2/+3
|
* regcomp.c: Remove memory leakKarl Williamson2021-02-281-0/+7
| | | | | | | | | | | | | This fixes GH #18604. There was a path through the code where a particular SV did not get its reference count decremented. I did an audit of the function and came up with several other possiblities that are included in this commit. Further, there would be leaks for some instances of finding syntax errors in the input pattern, or when warnings are fatalized. Those would require mortalizing some SVs, but that is beyond the scope of this commit.
* Hide Perl_regcurly in the re extensionCraig A. Berry2021-02-151-6/+8
| | | | | | | | | Otherwise a strict linker will fail to build the extenstion due to a multiply defined symbol. We used to do this but it was removed in e513125ac7bdea1f for unknown reasons. The same commit also defined some macros inside the function that are used but inside and outside it, so put them where they can be seen regardless of whether we are defining the function itself.
* gh18515: fix special handling of specific split() patternsHugo van der Sanden2021-02-091-4/+8
| | | | | | | | | | | | | | | | Commit 122af31004 acted on the wrong assumption that NEXTOPER() and regnext() were equivalent, and in fixing a valgrind complaint tried to simplify code for detecting specific patterns for split() that merited special-case handling by making them all use regnext(). As a result, the special case /\s+/ was no longer correctly detected, resulting in a degree of pessimisation. This commit fixes that, and avoids reading via the calculated 'next' pointer except for the ops we need (in which cases we know it'll point to another regop) - for the EXACT case (which we don't need), valgrind was correctly pointing out that it points to potentially uninitialized data.
* regcomp.c: White-space and commentsKarl Williamson2021-01-201-12/+15
|
* Allow blanks within and adjacent to {...} constructsKarl Williamson2021-01-201-24/+77
| | | | | This was the consensus in http://nntp.perl.org/group/perl.perl5.porters/258489
* perlre: Note the other forms of \k<name>Karl Williamson2021-01-201-2/+2
| | | | | | Not all three synonyms were documented. This also fixes up related comments in regcomp.c to correspond
* regcomp.c: Further refactor \gKarl Williamson2021-01-201-14/+15
| | | | | By changing a bool into a pointer, we can avoid some work and prepare for a future commit.
* regcomp.c: Refactor portions of \g parsingKarl Williamson2021-01-201-13/+39
| | | | | | | | This moves the finding of the matching '}' for \g{ to earlier, and creates a temporary to point to the current position in the parse. This makes it easier to deal with backtracking; we haven't advanced the main parse pointer, so don't have to remember how far we advanced. This will prove advantageous in a future commit.
* regcomp.c: Move initialization into declarationKarl Williamson2021-01-201-2/+2
| | | | This is considered better practice.
* regcomp.c: Slight simplificationKarl Williamson2021-01-201-1/+1
| | | | | | | Rather than know how far we have advanced in parsing when we have to back up, use the already-existing checkpoint position. This results in slightly more maintainable code that a future commit will take advantage of.
* Allow empty lower bound in /{,n}/Karl Williamson2021-01-201-56/+11
| | | | | | | | This change has been planned for a long time, bringing Perl into parity with similar languages, but it took many deprecation cycles to be able to reach the point where it could safely go in. This fixes GH #18264
* Point to error in malformed /x{y,z}/Karl Williamson2021-01-201-2/+2
| | | | | | Prior to this comment a curly quantifier that had an error in the bounds pointed to the left brace. Now the error message points to the first bound that has a problem.
* Revamp regcurly(), regpiece() use of itKarl Williamson2021-01-201-66/+159
| | | | | | | | | | | | | | | | | | | | This commit copies portions of new_regcurly(), which has been around since 5.28, into plain regcurly(), as a baby step in preparation for converting entirely to the new one. These functions are used for parsing {m,n} quantifiers. Future commits will add capabilities not available using the old version. The commit adds an optional parameter, to return to the caller information it gleans during parsing. regpiece() is changed by this commit to use this information, instead of itself reparsing the input. Part of the reason for this commit is that changes are planned soon to what is legal syntax. With this commit in place, those changes only have to be done once. This commit also extracts into a function the calculation of the quantifier bounds. This allows the logic for that to be done in one place instead of two.
* regcomp.c: Change names of 2 macros for mnemonicsKarl Williamson2021-01-201-2626/+2627
| | | | | | The new names are more understandable to me. This also adds a second parameter to one macro, that is unused until the next commit in the series.
* style: Detabify indentation of the C code maintained by the core.Michael G. Schwern2021-01-171-2631/+2631
| | | | | | | | | | | This just detabifies to get rid of the mixed tab/space indentation. Applying consistent indentation and dealing with other tabs are another issue. Done with `expand -i`. * vutil.* left alone, it's part of version. * Left regen managed files alone for now.
* Don't define Perl_regcurly in re extensionCraig A. Berry2020-12-241-1/+2
| | | | | | This makes the linker have to decide (or guess) which of the identically-named symbols to include. The VMS linker refuses and throws a multiply-defined symbol error.
* Remove empty "#ifdef"sTom Hukins2020-12-081-4/+0
|
* Restrict scope/Shorten some very long macro namesKarl Williamson2020-11-221-11/+0
| | | | | | The names were intended to force people to not use them outside their intended scopes. But by restricting those scopes in the first place, we don't need such unwieldy names
* Move regcurly to regcomp.c (from inline.h)Karl Williamson2020-11-181-0/+24
| | | | | | This function is called only at compile time; experience has shown that compile-time operations are not time-critical. And future commits will lengthen it, making it not practically inlinable anyway.
* autodoc.pl: Specify scn for single-purpose filesKarl Williamson2020-11-061-1/+0
| | | | | | | | Many of the files in perl are for one thing only, and hence their embedded documentation will be for that one thing. By creating a hash here of them, those files don't have to worry about what section that documentation goes under, and so it can be completely changed without affecting them.
* don't croak when the \K follows the lookaround assertionTony Cook2020-11-041-23/+12
| | | | | | | | | | | | | | | this also simplifies the flagging for these assertions, since this error is now the only thing using in_lookhead and in_lookbehind they can be combined into a single in_lookaround. Rather than conditional increment/decrement as we recurse into S_reg I simply save the value of in_lookaround and restore it before returning. Some unsuccessful or restart paths don't do the restore, but they either result in a croak(), or a restart which reinitialises in_lookaround anyway. Also added tests to ensure that all the different zero-width assertions with content trigger the error.
* Fix GH #17278Karl Williamson2020-10-231-5/+10
| | | | | | | | | | | | This was an assertion failure in regexec.c under rare circumstances. A reduction of the fuzzed test case is now in pat_advanced.t The root cause of this was that the pattern being compiled was encoded in UTF-8 and 'use locale' was in effect, equivalent to the /l charset, and then the charset was reset inside the pattern, to /d. But /d in a UTF-8 patterns is illegal, hence the later assertion failure. The solution is to reset instead to /u when the pattern is UTF-8.
* perlapi: Add markupKarl Williamson2020-10-221-1/+1
|
* regcomp.c: Do some extra foldingKarl Williamson2020-10-161-4/+19
| | | | | | | | | | | | | | | | | | | | | Generally we have to wait until runtime to do folding for regnodes that are locale dependent, because we don't know what the locale at runtime will be, and hence what the folds will be. But UTF-8 locales all have the same folding behavior, no matter what the locale is, with the exception of two fold pairs in Turkish. (Lithuanian too, but Perl doesn't support that language's special folding rules.) UTF-8 is the only locale type that Perl supports that can represent code points above 255. Therefore we do know at compile time what the above-255 folds are (again excepting the two in Turkish), and so we can do the folding then. But only if both the components are above 255. There are a few folds that cross the 255/256 boundary, and they must be deferred. However, there are two instances where there are three characters that fold together in which two of them are above 255, and the third isn't. That the two high ones are equivalent under /i is known at compile time, and so that equivalence can be stated then.
* regcharclass.h: multi-folds: Add some unfoldedsKarl Williamson2020-10-161-11/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prior to this commit, the generated macros for dealing with multi-char folds in UTF-8 strings only recognized completely folded strings. This commit changes that to add the uppercase for characters in the Latin1 range. Hopefully an example will clarify. The fold for U+0130: LATIN CAPITAL LETTER I WITH DOT ABOVE is 'i' followed by U+0307: COMBINING DOT ABOVE. But since we are doing /i matching, an 'I' followed by U+307 should also match. This commit changes the macros to know this. Before this, if the fold were entirely ASCII, the macros would know all the possible combinations. This commit extends that to all code points < 256. (Since there are no folds to the upper latin1 range), that really means all code points below 128. But making it general means it wouldn't have to be revised if a fold were ever added to the upper half range.) The reason to make this change is that it makes some future code less complicated. And it adds very little complexity to the generated macros; less than the code it will save. I originally thought it would be more complext than it now turns out to be. Much of that is because the infrastructure has advanced since that decision. I couldn't find any current places that this change will allow to be simplified. There could be if the macros were extended to do this on all code points, not just the low ones. I tried that, but the generated macros were at least 400 lines longer than before. That does add significant complexity, so I backed that out.
* regcomp.c: SimplifyKarl Williamson2020-10-161-22/+7
| | | | | This was a case statement of every type of EXACTish node. Instead, there is a simple way to see if something is EXACTish.
* regcomp.c,regexec.c: SimplifyKarl Williamson2020-10-161-20/+5
| | | | | This commit uses the new macros from the previous commit to simply come code.
* regcomp.c: SimplifyKarl Williamson2020-10-161-3/+3
| | | | | | The previous commit made the opcodes for two regops adjacent, so that we can refer to them by a single range. This commit takes advantage of that change.
* regcomp.c: Clarify commentKarl Williamson2020-10-161-6/+6
|
* regcomp.c: Zero width constructs shouldn't be SIMPLEKarl Williamson2020-10-151-3/+0
| | | | This is reserved for length-1 constructs.
* regcomp.c: regpiece: swap order of conditionalsKarl Williamson2020-10-121-6/+6
| | | | | | Its a bit more clearer to test the 0 case before the 1 case, and by doing so it becomes visually easier to compare and contrast the the two cases.
* regcomp.c: regpiece: Move chunk of code for clarityKarl Williamson2020-10-121-16/+21
| | | | | | | | | | | This changes an error branch to be goto'd out of the mainline code. The large chunk being in the middle obscured the comonality of the slightly different non-error cases. The branch is moved to the bottom of the routine, and croaks, so there is no return. This is a modification to a suggestion by Hugo van der Sanden.
* regcomp.c: White-space onlyKarl Williamson2020-10-121-58/+59
| | | | | Change indentation to correspond with new blocks formed by the previous commit
* regcomp.c: regpiece(): Convert to a switch() stmtKarl Williamson2020-10-121-19/+28
| | | | This makes the code easier to understand, I think.
* regcomp.c: regpiece(): More comments; white-spaceKarl Williamson2020-10-121-1/+9
|
* regcomp.c: regpiece(): Refactor two 'if'sKarl Williamson2020-10-121-24/+27
| | | | | I think this makes it clearer the commonalities of the * and + quantifiers.
* regcomp.c: regpiece: Consolidate codeKarl Williamson2020-10-121-5/+5
| | | | There is a common place these three occurrences can be placed at,
* regcomp.c: Change label name; rmv extraneous gotoKarl Williamson2020-10-121-5/+3
| | | | | | The name was misleading. There are other things being done here. And previous restructuring led to a goto immediately prior to where it went to.