summaryrefslogtreecommitdiff
path: root/regexec.c
Commit message (Collapse)AuthorAgeFilesLines
* Correct for build-time warningJames E Keenan2021-01-101-1/+1
| | | | | | Addresses this build-time warning: suggest braces around initialization of subobject [-Wmissing-braces]
* regexec.c: Fix assertion failure GH #18451Karl Williamson2021-01-031-13/+26
| | | | | This was caused by copying too many characters for the size of the buffer. Only one character is needed.
* regexec.c: Clarify commentsKarl Williamson2021-01-031-3/+6
|
* regexec.c: Silence compiler warningKarl Williamson2020-12-211-1/+1
|
* regexec.c: Fix failing CI 32-bit testsKarl Williamson2020-12-211-1/+3
| | | | | | This bug was introduced by bb3825626ed2b1217a2ac184eff66d0d4ed6e070, and was the result of overflowing a 32 bit space. The solution is to rework the expression so that it can't overflow.
* regexec.c: Link to github issue in commentKarl Williamson2020-12-201-0/+3
|
* regexec.c: White-space, comments onlyKarl Williamson2020-12-191-48/+50
| | | | Mostly indent because the prior commit created a new block
* regexec.c: Revamp S_setup_EXACTISH_ST() loop end conditionsKarl Williamson2020-12-191-599/+780
| | | | | | | | | | | Consider the pattern /A*B/ where A and B are arbitrary. The pattern matching code tries to make a tight loop to match the span of A's. The logic of this was not really updated when UTF-8 was added. I did revamp it some releases ago to fix some bugs and to at least consider UTF-8. This commit changes it so that Unicode is now a first class citizen. Some details are listed in the ticket GH #18414
* regexec.c: Change name of static functionKarl Williamson2020-12-191-4/+4
| | | | The new name reflects its new functionality coming in future commits
* regexec.c: Trim trailing blanksKarl Williamson2020-12-191-87/+87
|
* Restrict scope/Shorten some very long macro namesKarl Williamson2020-11-221-2/+0
| | | | | | The names were intended to force people to not use them outside their intended scopes. But by restricting those scopes in the first place, we don't need such unwieldy names
* autodoc.pl: Enhance apidoc_section featureKarl Williamson2020-11-061-1/+1
| | | | | | | | | | | This feature allows documentation destined for perlapi or perlintern to be split into sections of related functions, no matter where the documentation source is. Prior to this commit the line had to contain the exact text of the title of the section. Now it can be a $variable name that autodoc.pl expands to the title. It still has to be an exact match for the variable in autodoc, but now, the expanded text can be changed in autodoc alone, without other files needing to be updated at the same time.
* regexec.c: Store expression in a variableKarl Williamson2020-10-161-10/+11
| | | | | | This makes the text look cleaner, and prepares for a future commit, where we will want to change the variable (which can't be done with the expression).
* regexec.c: Change variable name in a functionKarl Williamson2020-10-161-8/+8
| | | | This makes it like a corresponding variable.
* regexec.c: Rename local variable; change typeKarl Williamson2020-10-161-32/+33
| | | | | | | | | | | I found myself getting confused, as this most likely was named before UTF-8 came along. It actually is just a byte, plus an out-of-bounds value. While I'm at it, I'm also changing the type from I32, to the perl equivalent of the C99 'int_fast16_t', as it doesn't need to be 32 bits, and we should let the compiler choose what size is the most efficient that still meets our needs.
* regcomp.c,regexec.c: SimplifyKarl Williamson2020-10-161-11/+3
| | | | | This commit uses the new macros from the previous commit to simply come code.
* regexec.c: Macroize another common paradigmKarl Williamson2020-10-141-22/+16
|
* regexec.c: Macroize a common paradigmKarl Williamson2020-10-141-19/+12
|
* regexec.c: Rename a static variableKarl Williamson2020-10-141-5/+9
| | | | | This is to distinguish it from a similar variable being added in a future commit
* regexec.c: find_byclass(): RestructureKarl Williamson2020-10-141-465/+754
| | | | | | | | | | | | This is a follow-on to the previous commit. The case number of the main switch statement now includes three things: the regnode op, the UTF8ness of the target, and the UTF8ness of the pattern. This allows the conditionals within the previous cases (which only encoded the op), to be removed, and things to be moved around so that there is more fall throughs and fewer gotos, and the macros that are called no longer have to test for UTF8ness; so I teased the UTF8 ones apart from the non_UTF8 ones.
* regexec.c: S_find_byclass(): utf8ness in switch()Karl Williamson2020-10-141-40/+40
| | | | | | | | | | | | | | | | | | | | | | | | This uses the #defines created in the previous commit to make the switch statement in this function incorporate the UTF8ness of both the pattern and the target string. The reason for this is that the first statement in nearly every case of the switch is to test if the target string being matched is UTF-8 or not. By putting that information into the the case number, those conditionals can be eliminated, leading to cleaner, more modular code. I had hoped that this would also improve performance since there are fewer conditionals, but Sergey Aleynikov did performance testing of this change for me, and found no real noticeable gain nor loss. Further, the cases involving matching EXACTish nodes have to also test if the pattern is UTF-8 or not before doing anything else. I added that information as well to the case number, so that those conditionals can be eliminated. For the non-EXACTish nodes, it simply means that that two case statements execute the same code. This is an intermediate commit, which only does the expansion of the current cases into four for each. The refactoring that takes advantage of this is in the following commit.
* regexec: disallow zero-width nodes in regrepeatHugo van der Sanden2020-10-081-19/+0
| | | | | GH #17594: the logic here expects the node to have width 1 (except for LNBREAK), it is not expected to do the right thing on zero-width nodes.
* regexec.c: White-space onlyKarl Williamson2020-10-021-169/+170
| | | | Adjust indentation as a result of the previous commit.
* S_find_byclass() Restructure bounds checkingKarl Williamson2020-10-021-59/+16
| | | | | | | There are five \b variants. Plain \b (without braces) is the outlier as far as implementation. This commit moves the handling of plain \b to outside the switch that handles the others. That allows the duplicate code that previously existed to be consolidated into one occurrence.
* Change some =head1 to apidoc_section linesKarl Williamson2020-09-041-1/+1
| | | | | apidoc_section is slightly favored over head1, as it is known only to autodoc, and can't be confused with real pod.
* Use av_top_index() instead of av_tindex()Karl Williamson2020-08-191-1/+1
| | | | | | | I was never happy with this short form, and other people weren't either. Now that most things are better expressed in terms of av_count, convert the few remaining items that are clearer when referring to an index into using the fully spelled out form
* regexec.c: Use withinCOUNT()Karl Williamson2020-08-081-7/+2
| | | | This is faster, and clearer
* regexec.c: Clarify commentKarl Williamson2020-08-081-1/+1
|
* regexec.c: Use UTF8SKIP for utf8 stringKarl Williamson2020-07-301-1/+1
| | | | | This code was advancing per-byte on a UTF-8 string, which still works, but is slower than need be.
* Remove use of dVAR in coreDagfinn Ilmari Mannsåker2020-07-201-15/+0
| | | | | It only does anything under PERL_GLOBAL_STRUCT, which is gone. Keep the dNOOP defintion for CPAN back-compat
* regexec.c: Fix commentKarl Williamson2020-07-171-1/+2
|
* regexec.c: Don't use sizeof()Karl Williamson2020-07-171-1/+1
| | | | | A future commit will change this array so that its size isn't known at compilation time.
* Fix a bunch of repeated-word typosDagfinn Ilmari Mannsåker2020-05-221-2/+2
| | | | | Mostly in comments and docs, but some in diagnostic messages and one case of 'or die die'.
* regexec.c: Clean up debug callHugo van der Sanden2020-03-111-3/+5
| | | | | | The code this replaces relies on the internal structure of a macro, which can change and break things. This commit changes to use a more straight forward way of accomplishing the same thing.
* Allow debugging from regexec.c back to regcomp.cKarl Williamson2020-03-111-2/+8
| | | | | | | | | The compilation of User-defined properties in a regular expression that haven't been defined at the time that pattern is compiled is deferred until execution time. Until this commit, any request for debugging info on those was ignored. This fixes that by
* regex: Change internal macro nameKarl Williamson2020-03-051-9/+9
| | | | | It wasn't clear to me that the macro did more than a declaration, given its name. Rename it to be clear as to what it does.
* regexec.c: Fix Debug statementKarl Williamson2020-02-261-2/+2
| | | | | | | | A proper debugging statement isn't just controlled by DEBUG_r, it needs what sort of class of debugging controls this, so that re.pm can operate properly. This is the second of two cases in the code where it was wrong.
* regexec.c: Fix Debug statementKarl Williamson2020-02-261-1/+1
| | | | | | | | A proper debugging statement isn't just controlled by DEBUG_r, it needs what sort of class of debugging controls this, so that re.pm can operate properly. This is one of two cases in the code where it was wrong.
* Change Unicode property abbrev to upcoming officialKarl Williamson2020-01-301-1/+1
| | | | | | | | | | | | Unicode 12.0 used a new property file that was not from the Unicode Character Database. It only had a long property name. I incorporated it into our data, and rather than use the very long name all the time, I created my own short name, since there was no official one. Now, the upcoming 13.0 has moved the file to the UCD, and come up with a short name that differs from the one I had. This commit converts to use Unicode's name. This property is not exposed to user or XS space, so there is no user impact.
* regexec: don't increment recursion counter for non-postponed EVALHugo van der Sanden2020-01-271-1/+1
| | | | | It wasn't intended to be part of the recursion logic, and doesn't get decremented again (GH 17490).
* Change parameter type of static fcnKarl Williamson2020-01-031-1/+1
| | | | | This makes the first parameter consistent with the other similar parameter.
* Change some structures/fcns to use I32 and U32Karl Williamson2020-01-031-1/+1
| | | | | | | This is because these deal with only legal Unicode code points, which are restricted to 21 bits, so 16 is too few, but 32 is sufficient to hold them. Doing this saves some space/memory on 64 bit builds where an int is 64 bits.
* regexec.c: Clarify commentKarl Williamson2019-12-111-1/+1
|
* Rmv leading underscore from macro nameKarl Williamson2019-12-111-2/+2
| | | | | | | These are illegal in C, but we have plenty of them around; I happened to be looking at this function, and decided to fix it. Note that only the macro name is illegal; the function was fine, but to change the macro name means changing the function one.
* Add ANYOFHs regnodeKarl Williamson2019-11-201-1/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | This node is like ANYOFHb, but is used when more than one leading byte is the same in all the matched code points. ANYOFHb is used to avoid having to convert from UTF-8 to code point for something that won't match. It checks that the first byte in the UTF-8 encoded target is the desired one, thus ruling out most of the possible code points. But for higher code points that require longer UTF-8 sequences, many many non-matching code points pass this filter. Its almost 200K that it is ineffective for for code points above 0xFFFF. This commit creates a new node type that addresses this problem. Instead of a single byte, it stores as many leading bytes that are the same for all code points that match the class. For many classes, that will cut down the number of possible false positives by a huge amount before having to convert to code point to make the final determination. This regnode adds a UTF-8 string at the end. It is still much smaller, even in the rare worst case, than a plain ANYOF node because the maximum string length, 15 bytes, is still shorter than the 32-byte bitmap that is present in a plain ANYOF. Most of the time the added string will instead be at most 4 bytes.
* Add ANYOFRb regnodeKarl Williamson2019-11-171-0/+68
| | | | | | | | | | | | | | | | | | | | | | This is like the ANYOFR regnode added in the previous commit, but all code points in the range it matches are known to have the same first UTF-8 start byte. That means it can't match UTF-8 invariant characters, like ASCII, because the "start" byte is different on each one, so it could only match a range of 1, and the compiler wouldn't generate this node for that; instead using an EXACT. Pattern matching can rule out most code points by looking at the first character of their UTF-8 representation, before having to convert from UTF-8. On ASCII this rules out all but 64 2-byte UTF-8 characters from this simple comparison. 3-byte it's up to 4096, and 4-byte, 2**18, so the test is less effective for higher code points. I believe that most UTF-8 patterns that otherwise would compile to ANYOFR will instead compile to this, as I can't envision real life applications wanting to match large single ranges. Even the 2048 surrogates all have the same first byte.
* Add ANYOFR regnodeKarl Williamson2019-11-171-0/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | | This matches a single range of code points. It is both faster and smaller than other ANYOF-type nodes, requiring, after set-up, a single subtraction and conditional branch. The vast majority of Unicode properties match a single range (though most of the properties likely to be used in real world applications have more than a single range). But things like [ij] are a single range, and those are quite commonly encountered. This new regnode matches them more efficiently than a bitmap would, and doesn't require the space for one either. The flags field is used to store the minimum matchable start byte for UTF-8 strings, and is ignored for non-UTF-8 targets. This, like ANYOFH nodes which have a similar mechanism, allows for quick weeding out of many possible matches without having to convert the UTF-8 to its corresponding code point. This regnode packs the 32 bit argument with 20 bits for the minimum code point the node matches, and 12 bits for the maximum range. If the input is a value outside these, it simply won't compile to this regnode, instead going to one of the ANYOFH flavors. ANYOFR is sufficient to match all of Unicode except for the final (private use) 65K plane.
* regexec.c: Rmv some unnecessary castsKarl Williamson2019-11-171-5/+5
| | | | The called macro does the cast, and this makes it more legibile
* Change the names of some regnodesKarl Williamson2019-10-291-14/+14
| | | | The new name is shorter and I believe, clearer.
* Move Unicode.org URLs to https:// in source codeMax Maischein2019-10-111-2/+2
| | | | | This URL is outdated, but the link forwards to the correct section in a PDF.