summaryrefslogtreecommitdiff
path: root/regexec.c
Commit message (Collapse)AuthorAgeFilesLines
* regexec.c: Comments, White-space onlyKarl Williamson2018-09-301-7/+8
| | | | A few clarifications
* regexec.c: Remove obsolete commentsKarl Williamson2018-09-301-13/+0
| | | | | | | | | The first comment listed an item as a TODO that was recommended by Unicode. That recommendation is being rescinded in Unicode 12.0 based on a ticket I filed against Unicode, which in turn was based on feedback from Asmus Freitag. The second comment was obsoleted by later code changes.
* regexec.c: Remove macro use for further clarityKarl Williamson2018-09-301-4/+3
| | | | | | Commit 4c83fb55d7096a1d0e6a7a8e25d20b186be3281d added a macro for clarity. I have since realized that it is even clearer to spell things as this commit now does.
* PATCH: [perl #133547]: script run brokenKarl Williamson2018-09-301-55/+56
| | | | | | | | | | | | All scripts can have the ASCII digits for their numbers. Scripts with their own digits can alternatively use those. Only one of these two sets can be used in a script run. The decision as to which set to use must be deferred until the first digit is encountered, as otherwise we don't know which set will be used. Prior to this commit, the decision was being made prematurely in some cases. As a result of this change, the non-ASCII-digits in the Common script need to be special-cased, and different criteria are used to decide if we need to look up whether a character is a digit or not.
* regexec.c: Rename variableKarl Williamson2018-09-301-7/+7
| | | | | The new name is clearer as to its meaning, more so after the next commit.
* S_regmatch: add debugging to UNWIND_PAREN()David Mitchell2018-08-261-1/+10
| | | | (and tweak the debugging output of CLOSE_CAPTURE())
* S_rematch(): CLOSE_CAPTURE(): set last(close)parenDavid Mitchell2018-08-261-17/+7
| | | | | | Every use of the CLOSE_CAPTURE() macro is followed by the setting of lastparen and lastcloseparen, so include these actions in the macro itself.
* S_regmatch(): use CLOSE_CAPTURE() macro moreDavid Mitchell2018-08-261-5/+5
| | | | | This macro includes debugging output, so by using it rather than setting rex->offs[paren].start/end directly, you get better debugging.
* S_regmatch(): parameterise CLOSE_CAPTURE macroDavid Mitchell2018-08-261-8/+12
| | | | | | Make its index and start+end values into parameters. This will shortly allow its use in other places, bringing consistent code and debug logging to the whole of S_regmatch().
* S_regmatch(): move CLOSE_CAPTURE macro definitionDavid Mitchell2018-08-261-13/+14
| | | | | | | | Move this macro to earlier in the file to be with the other functions and macros which deal with setting and restoring captures. No changes (functional or textual) apart from the physical moving of the 13 lines.
* S_regmatch(): handle GOSUB within (.)* speciallyDavid Mitchell2018-08-261-7/+9
| | | | | | | | | | | | | The (?n) mechanism allows you to 'gosub' to a subpattern delineated by capture n. For 1-char-width repeats, such as a+, \w*?, (\d)*, then currently the code checks whether it's in a gosub each time it attempts to start executing the B part of A*B, regardless of whether the A is in a capture. This commit moves the GOSUB check to within the capture-only variant (CURLYN), which then directly just looks for one instance of A and returns. This moves the check away from more frequently called code paths.
* S_regmatch(): combine CURLY_B_min/_known statesDavid Mitchell2018-08-261-45/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are currently two similar backtracking states for simple non-greedy pattern repeats: CURLY_B_min CURLY_B_min_known the latter is a variant of the former for when the character which must follow the repeat is known, e.g. /(...)*?X.../, which allows quick skipping to the next viable position. The code for the two cases: case CURLY_B_min_fail: case CURLY_B_min_known_fail: share a lot of similarities. This commit merges the two states into a single CURLY_B_min state, with an associated single CURLY_B_min_fail fail state. That one code block can handle both types, with a single if (ST.c1 == CHRTEST_VOID) ... test to choose between the two variant parts of the code. This makes the code smaller and more maintainable, at the cost of one extra test per backtrack.
* Fix script run bug '1' followed by Thai digitKarl Williamson2018-08-161-15/+31
| | | | | | | | This does not have a ticket, but was pointed out in http://nntp.perl.org/group/perl.perl5.porters/251870 The logic for deciding if it was needed to check if a character is a digit was flawed.
* regexec.c: Use a macro for clarityKarl Williamson2018-08-161-5/+7
| | | | | This commit #defines a macro and uses it, which makes the code easier to understand.
* regexec.c: Fix UNLIKELY() scopeKarl Williamson2018-08-161-1/+1
| | | | | The expression as a whole is unlikely to be true, not just the portion that was marked so.
* Prepare for Unicode 11.0Karl Williamson2018-07-201-0/+18
| | | | | | | | | Unicode 11 has some new data files needed for it, and some changes in the boundary rules that need to be accounted for. This does all that can be done without causing tests to fail. The LB algorithm has changed, and tests would fail if we included the code changes needed for that change in this commit. Instead those few lines will come as part of the Unicode 11.0 commit.
* regexec.c: Use macro instead of Perl_invlist_search()Karl Williamson2018-07-171-1/+1
| | | | | There's no reason to use the function name, and by using the macro, this code is insulated against future changes
* regexec.c: Call macro with correct args.Karl Williamson2018-07-011-1/+1
| | | | | The second argument to this macro is a pointer to the end, as opposed to a length.
* toke.c: Move some code into called functionKarl Williamson2018-06-251-1/+8
| | | | | It makes more sense for this code to be in the function called, rather than separated out.
* PATCH: [perl #133175] script run free from wrong pool panicKarl Williamson2018-05-091-0/+2
| | | | | | | | | | Setting the pointer to NULL after freeing signals the code in later interations that it has been freed already No test is added because it could become outdated (not testing what it was designed to test) with a new Unicode version changing the underlying data. This bug was discovered by testing on Unicode 7.0, and the data changed so that there was not a problem by Unicode 10.0.
* fix TRIE_READ_CHAR and DECL_TRIE_TYPE to account for non-utf8 targetYves Orton2018-04-161-4/+10
| | | | | | | | | | | | | | | | This is the third commit involved in [perl #132063, and the bottom line cause of it. The problem is that the code is incorrectly branching to a portion of the code that expects it is handling UTF-8. And the input isn't UTF-8. The fix is to handle this case and branch correctly. This bug requires the following things in order to manifest: 1) the pattern is compiled under /il 2) the pattern does not contain any characters below 256 3) the target string is not UTF-8. (The committer changed the test to test this issue on EBCDIC, as the original \xFF is an invariant there that wouldn't exercise the problem. We want a start byte for a long UTF-8 sequence for a single character. On the EBCDIC pages we support, \xFE fits that bill.
* Subject: PATCH: [perl #132063]: Heap buffer overflowKarl Williamson2018-04-161-15/+18
| | | | | | | | | | | | | | | | | | | | There were three things that were fixed as a result of this ticket, any one of which would have avoided the issue. Commit 421da25c4318861925129cd1b17263289db3443c already has fixed one of those. The issue was reading beyond the end of a buffer, and that commit keeps from reading beyond a NUL, which normally should be present, marking the end of the buffer. This commit fixes the issue where the code was told that reading that many bytes was ok to do. This is several instances in regexec.c of the code assuming that the input was valid UTF-8, whereas the input was too short for what the start byte claimed it would be. I grepped through the core for any other similar uses, and did not find any. The next commit will fix the third thing.
* Revert "S_regmatch: eliminate WHILEM_A_min paren saving"David Mitchell2018-04-161-2/+5
| | | | | | | | | | | This reverts commit 77584140f7cbfe714083cacfa671085466e98a7b. This optimisation of mine from 5.25.9 is ill-conceived; under the right permutations of backtracking, it is possible for the current positions of one of more captures not to restored to their previous positions. This commit reverts the code change, but keeps the benchmark part of that commit, and adds a test
* S_regmatch(): improve debugging outputDavid Mitchell2018-04-091-20/+20
| | | | | | | | | | | | | | Make the various debugging outputs identify where the message is coming from; e,g, change trying longer... to WHILEM: B min fail: trying longer... and change some existing "whilem: ..." messages to "WHILEM: ..." for consistency.
* Use unsigned to avoid compiler warningKarl Williamson2018-04-031-2/+2
| | | | | | | | | The code points that Unicode furnishes will always be unsigned. This changes to uniformly treat the ones in the constructed tables of Unicode properties to be unsigned, avoiding possible signedness compiler warnings on some systems. Spotted by Dave Mitchell.
* regexec.c: Use macro intended for the purposeKarl Williamson2018-04-021-1/+1
| | | | The macro hides the underlying implementation detail.
* regexec.c: Silence a compiler warningKarl Williamson2018-03-311-1/+2
| | | | The argument is 32 bits, but only the lowest 8 are used.
* regexec.c: Simplify a littleKarl Williamson2018-03-311-15/+9
| | | | | A Previous commit has changed the circumstances of this code so that we know certain things to be true that we didn't use to.
* regexec.c: White-space onlyKarl Williamson2018-03-311-25/+24
| | | | | This outdents to to the removal of an enclosing block by a previous commit
* Use compiled-in C structure for inverted case foldsKarl Williamson2018-03-311-39/+16
| | | | | | | | | | This commit changes to use the C data structures generated by the previous commit to compute what characters fold to a given one. This is used to find out what things should match under /i. This now avoids the expensive start up cost of switching to perl utf8_heavy.pl, loading a file from disk, and constructing a hash from it.
* regexec.c: Remove no longer used macrosKarl Williamson2018-03-311-26/+0
| | | | | These are unused now that all the POSIX class lookups are done through inversion lists, instead of swashes.
* regexec.c: White-space, comment onlyKarl Williamson2018-03-311-92/+94
| | | | Fix up indentation based on the previous few commits
* regexec.c: Convert swash lookup to inversion listKarl Williamson2018-03-311-64/+12
|
* regexec.c: Convert swash lookup to inversion listKarl Williamson2018-03-311-50/+9
| | | | | | | Previously this had two loops, the first one was used to keep from loading the swash for as long as possible. Now that it is loaded by default, there is no need to do this. This overwrites the first loop with the second loop
* regexec.c: Convert swash lookup to inversion listKarl Williamson2018-03-311-18/+8
|
* regexec.c: Explicitly use case: instead of default:Karl Williamson2018-03-311-2/+2
| | | | | This is so the default: can be used for another purpose in the next commit.
* regexec.c: Check for UTF-8 fittingKarl Williamson2018-03-311-5/+6
| | | | | | We've been burned before by malformed UTF-8 causing us to read outside the buffer bounds. Here is a case I saw during code inspection, and it's easy to add the buffer end limit
* regexec.c: Convert one swash to inversion listKarl Williamson2018-03-311-18/+3
| | | | | I'm doing this one-at-a-time for bisection reasons, in case I make a mistake.
* regexec.c: Rmv obsolete macroKarl Williamson2018-03-311-7/+0
| | | | | This macro is obsolete because the inversion list for this property is now always loaded, so no need to load.
* regexec.c: Silence compiler warningKarl Williamson2018-03-311-0/+1
| | | | When this #ifdef'd code is compiled, there was a warning
* EBCDIC conditional compilation fixesKarl Williamson2018-03-051-2/+23
| | | | | | | | | The recent changes fixed by this commit neglected to take into account EBCDIC differences. Mostly, the algorithms apply only to ASCII platforms, so the EBCDIC is ifdef'd out. In a couple cases, the algorithm mostly applies, so the scope of the ifdefs is smaller.
* regexec.c: White-space onlyKarl Williamson2018-03-051-4/+4
| | | | Properly indent preprocessor directives
* Remove parameter from isSCRIPT_RUNKarl Williamson2018-03-011-2/+5
| | | | | | Daniel Dragan pointed out that this parameter is unused (the commits that want it didn't get into 5.28), and is causing a table to be duplicated all over the place, so just remove it for now.
* PATCH: [perl #132900] Blead Breaks CPAN: FELIPE/Crypt-PerlKarl Williamson2018-02-221-14/+14
| | | | | | | | The root cause of this was using a 'char' where it should have been 'U8'. I changed the signatures so that all the related functions take and return U8's, and the compiler detects what should be cast to/from char. The functions all deal with byte bit patterns, so unsigned is the appropriate declaration.
* Change name of regnode for clarityKarl Williamson2018-02-161-18/+18
| | | | | | | The EXACTFA nodes are in fact not generated by /a, but by /aa. Change the name to EXACTFAA to correspond. I found myself getting confused by this.
* fix perl #132630, dont try to fbm match past end of stringYves Orton2018-02-071-1/+3
|
* Speed up finding non-UTF8 EXACTFish initial matchesKarl Williamson2018-02-011-13/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | find_byclass() is used to scan through a target string looking for something that matches a particular class. This commit speeds that up for patterns of the /foo/i type, where neither the target string nor the pattern are UTF-8. More precisely, it speeds up only finding the first byte of 'foo' in the string. The actual matching speed remains the same, once that initial character that is a potential match is found. But finding that first character is sped up immensely by this commit. It does this by using memchr() when the character is caseless. For example in the pattern /:abcde/i, the colon is first, and is caseless. On my system memchr is extremely fast, so the numbers below for this case may not be as good on other systems. And when the first character is cased, such as in /abcde/i, it uses the techniques added in 2813d4adc971fbaa124b5322d4bccaa73e9df8e2 for the ANYOFM regnode. In both ASCII and EBCDIC machines, the case folds of the cased letters are almost all of the form that these techniques work on. There are no tests in our current test suite that don't have this form. However, /il (locale) matching may very well violate this, and will use the per-byte scheme that has been in effect until this commit. The numbers below are for finding the first letter after a long string that doesn't include that character. Doing this isolates the speed up attributable to this commit from ovehead. The only downsides of this commit are that on some systems memchr() may introduce function call overhead that won't pay off if the next occurrence of the character is close by; and in the other case, a single extra conditional is required to determine if there is at least a word's worth of data to look at, plus some masking, shifting, and arithmetic instructions associated with that conditional. A static function is called, so there may or may not be function call overhead, depending on the compiler optimizer. Key: Ir Instruction read Dr Data read Dw Data write COND conditional branches IND indirect branches The numbers represent raw counts per loop iteration. caseless first letter ('b' x 10000) . ':' =~ /:a/i blead fast Ratio % -------- ------- ------- Ir 72109.0 4819.0 1496.3 Dr 20608.0 1237.5 1665.3 Dw 10409.0 409.5 2541.9 COND 20376.0 702.0 2902.6 IND 15.0 16.0 93.8 cased first letter ('b' x 10000) . 'a' =~ /A/i blead fast Ratio % -------- ------- ------- Ir 103074.0 25704.6 401.0 Dr 20896.5 2164.9 965.2 Dw 10587.5 601.9 1759.0 COND 30516.0 3036.2 1005.1 IND 22.0 22.0 100.0
* regexec.c: Don't retest the same byte immediatelyKarl Williamson2018-02-011-1/+1
| | | | | | | In this macro, COND has just returned true for the given byte. We then need to test that the rest of the relevant portion of the input string and pattern match. But before this commit, we started at the byte we already know the answer for. Change to test starting one position over.
* isSCRIPT_RUN: Document in perlinternKarl Williamson2018-01-301-6/+46
|
* An empty string is a script_run, but marked INVALIDKarl Williamson2018-01-301-1/+1
|