summaryrefslogtreecommitdiff
path: root/regcomp.h
Commit message (Collapse)AuthorAgeFilesLines
* regcomp.c - decompose into smaller filesYves Orton2022-12-091-96/+96
| | | | | | | | | | | | | | | | | This splits a bunch of the subcomponents of the regex engine into smaller files. regcomp_debug.c regcomp_internal.h regcomp_invlist.c regcomp_study.c regcomp_trie.c The only real change besides to the build machine to achieve the split is to also adds some new defines which can be used in embed.fnc to control exports without having to enumerate /every/ regex engine file. For instance all of regcomp*.c defines PERL_IN_REGCOMP_ANY, and this is used in embed.fnc to manage exports.
* regex engine - cleanup internal tabs and ws (use -w to ignore)Yves Orton2022-12-091-95/+95
| | | | | | | | | | | | | | Having internal tabs causes confusion in diffs and reviews. In the following patch I will move a lot of code around, creating new files and they will all be whitespace clean: no trailing whitespace, tabs expanded to the next tabstop properly, and no trailing empty lines at the bottom of the file. This patch prepares for that split, and future splits and changes to the regex engine by precleaning the main regex engine files with the same rules. It should show no changes under '-w'.
* regcomp.c - add a PARNO() macro to wrap the ARG() macroYves Orton2022-11-101-0/+3
| | | | | | | | | | | | | | We used the ARG() macro to access the parno data for the OPEN and CLOSE regops. This made it difficult to find what needed to change when the type and size or location of this data in the node was modified. Replacing this access with a specific macro makes the code more legible and future proof. This was actually backported from finding everything that broke by changing the regnode type for OPEN and CLOSE to 2L and moving the paren parameter to the 2L slot. We might do something like this in the future and separating the PARNO() macros from their implementation will make it easier.
* regcomp.h - put STMT_START on its own line and lined up with STMT_ENDYves Orton2022-09-071-3/+6
|
* avoid dereferencing prog which may be NULLTony Cook2022-08-081-0/+1
| | | | CID 353002
* regex engine - replace many attribute arrays with oneYves Orton2022-08-061-5/+6
| | | | | | | | | | | | | | This replaces PL_regnode_arg_len, PL_regnode_arg_len_varies, PL_regnode_off_by_arg and PL_regnode_kind with a single PL_regnode_info array, which is an array of struct regnode_meta, which contains the same data but as a struct. Since PL_regnode_name is only used in debugging builds of the regex engine we keep it separate. If we add more debug properties it might be good to create a PL_regnode_debug_info[] to hold that data instead. This means when we add new properties we do not need to modify any secondary sources to add new properites, just the struct definition and regen/regcomp.pl
* regex engine - wrap PL_regnode_name with macro REGNODE_NAME()Yves Orton2022-08-061-0/+1
|
* regex engine - wrap PL_regnode_arg_len_varies with macro ↵Yves Orton2022-08-061-0/+1
| | | | REGNODE_ARG_LEN_VARIES()
* regex engine - wrap PL_regnode_arg_len with macro REGNODE_ARG_LEN()Yves Orton2022-08-061-3/+4
|
* regex engine - wrap PL_regnode_off_by_arg with macro REGNODE_OFF_BY_ARG()Yves Orton2022-08-061-0/+1
|
* regex engine - wrap PL_regnode_kind with macro REGNODE_TYPE()Yves Orton2022-08-061-1/+3
| | | | The code confusing uses type and kind as synonyms. Lets end that bad habit
* regex engine - improved comments explaining REGNODE_AFTER()Yves Orton2022-08-031-19/+60
| | | | | This rewrites one comment to include more explanation of the difference between Perl_regnext() and REGNODE_AFTER().
* regex engine - integrate regnode_after() support for EXACTish nodesYves Orton2022-08-031-17/+14
| | | | | | | | | | This adds REGNODE_AFTER_varies() which is used when the called *knows* that the current regnode is variable length. We then use it to handle EXACTish style nodes as determined by PL_regnode_arg_len_varies. As part of this patch Perl_regnext() Perl_regnode_after() and Perl_check_regnode_after() are moved to reginline.h, which is loaded via regcomp.c only when we are compiling the regex engine.
* regex engine - rename REGNODE_AFTER_dynamic() REGNODE_AFTER()Yves Orton2022-08-031-2/+4
| | | | | Now that REGNODE_AFTER() can handle all cases it makes sense to remove the dynamic() suffix.
* regex engine - Rename PL_regkind to PL_regnode_kindYves Orton2022-08-031-1/+1
|
* regex engine - Rename PL_regarglen to PL_regnode_arg_lenYves Orton2022-08-031-4/+4
|
* regcomp.c - rename NEXTOPER to REGNODE_AFTER and related logicYves Orton2022-08-031-4/+63
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is really easy to get confused about the difference between NEXTOPER() and regnext() of a regnode. The two concepts are related, similar, but importantly distinct. NEXTOPER() is also defined in such a way that it is easy to abuse and misunderstand and encourages producing code that is fragile to larger change, effectively "baking in" assumptions to the code that are difficult to discover by searching. Changing the type and storage requirements of a regnode may break things in subtle and hard to debug ways. An example of how NEXTOPER() is problematic is that this: NEXTOPER(NEXTOPER(branch)) does not mean "find the second node after the branch node", it means "jump forward by a regnode which happens to be two regnodes large". In other words NEXTOPER is just a fancy way of writing "node+1". This patch replaces NEXTOPER() with three new macros: REGNODE_AFTER_dynamic(node) REGNODE_AFTER_opcode(node,op) REGNODE_AFTER_type(node,tregnode_OPNAME) The first is the most generic case, it jumps forward by the size of the node, and determines that size by consulting OP(node). The second is where you have already extracted OP(node), and the third is where you know the actual structure that you want to jump forward by. Every regnode type has a corresponding type, which is known at compile time, so using the third will produce the most efficient code. However in many cases the code operates on one of several types, whose size may be the same now, but may change in the future, in which case one of the other forms is preferred. The run time logic in regexec.c should probably only use the REGNODE_AFTER_type() interface. Note that there is also a REGNODE_BEFORE() which replaces PREVOPER(), which is used in a specific piece of legacy logic but should not be used otherwise. It is not safe to go backwards from an arbitrary node, we simply have no way to know how large the previous node is and thus where it starts. This patch includes some logic that validates assumptions during DEBUG mode which should catch errors from resizing regnodes. After this patch changing the size of an existing regnode should be relatively safe and errors related to sizing should trigger assertion fails. This patch includes changes to perlreguts.pod to explain this stuff better.
* regen/regcomp.pl - Make regarglen available as PL_regarglen in regexec.cYves Orton2022-08-031-3/+3
| | | | | | | | | In a follow up patch we will use this data from regexec.c which currently cannot see the variable. This changes a comment in regen/mk_invlists.pl which necessitated rebuilding several files related to unicode. Only the hashes associated with mk_invlists.pl were changed.
* regcomp.h: deal with 64 bit aligned pointer data in regex program.Yves Orton2022-07-151-13/+85
| | | | | | | | | | | | | | | | | | | | We cannot safely store 64 bit aligned data in a regnode structure due to the implicit 32 bit alignment the base structure forces on the data. Thanks to Tony Cook for the suggestion on how to cleanly support variable sized pointers without alignment issues. I am pretty sure we should not be storing pointers in the regexp program like this. In most cases where we need an SV attached to a regnode structure we store it in the 'data' array which part of the regexp structure, and then store an index to that item in the regnode. This allows the use of a smaller member for the index instead. This was identified by running "make test_reonly" under the ubsan build: ./Configure -d -Doptimize=-g -Dusedevel -DDEBUGGING \ -Accflags='-fsanitize=address -fsanitize=undefined \ -ggdb3' -Aldflags='-Wl,--no-as-needed -lasan -lubsan' \ -Dcc=ccache\ gcc -Dld=gcc
* regex: Add optimizing regnodeKarl Williamson2022-07-121-0/+14
| | | | | | | | | | | | | | | | | | | | | | It turns out that any character class whose UTF-8 representation is two bytes long, and where all elements share the same first byte can be represented by a compact, fast regnode designed for the purpose. This commit adds that regnode, ANYOFHbbm. ANYOFHb already exists for classes where all elements have the same first byte, and this just changes the two-byte ones to use a bitmap instead of an inversion list. The advantages of this are that no conversion to code point is required (the continuation byte is just looked up in the bitmap) and no inversion list is needed. The inversion list would occupy more space, from 4 to 34 extra 64-bit words, plus an AV and SV, depending on what elements the class matches. Many characters in the Latin, Greek, Cyrillic, Greek, Hebrew, Arabic, and several other (lesser-known) scripts are of this form. It would be possible to extend this technique to larger bitmaps, but this commit is a start.
* regcomp.h: Make bitmap lookups more generalKarl Williamson2022-07-121-6/+6
| | | | | | This introduces a new macro and converts to use it so that bitmaps other than the traditional ones in ANYOF nodes may be defined in a common manner.
* regex: Refactor bitmap vs non-bitmap of qr/[]/Karl Williamson2022-07-101-25/+23
| | | | | | | | | | | | | | | | | | A bracketed character class in a pattern is generally represented by some form of ANYOF node, with matches of characters in the Latin1 range handled by a bitmap, and an inversion list for higher code point matches. But some patterns only have low matches, and some only high, and some match everything that is high. This commit refactors a little so that the distinction between nothing high matches vs everything high matches is done through the same technique. Previously one was indicated by a flag, and the other by a special value in the node's structure. Now there are two special values, and the flag is freed up for a potential future use. In the past the meaning of the flags has had to be overloaded go accommodate all the needs. freeing of a flag means This all allows for some slight simplicfications.
* regex: Refactor a shared flagKarl Williamson2022-07-101-61/+48
| | | | | | | | | | | | | In ANYOF nodes (generated for qr/[]/), there is a bitmap component, and possibly a non-bitmap component. It turns out that a single flag can be used to indicate the existence of the latter. When looked at this way, the name of the flag becomes simpler, and incorporates the meaning of another bit, which was previously shared with yet another meaning. Thus that other meaning can become an unshared bit. This allows for some simplification, and being able to handle the uncommon Turkish locale with fewer main-line conditionals being executed at runtime.
* regcomp.h: Use mnemonic instead of literal+constantKarl Williamson2022-07-051-1/+1
| | | | This is a small thing, but might as well use the mnemonic.
* regcomp.h: Add comments better explaining ANYOF nodesKarl Williamson2022-07-031-46/+68
|
* regex: Change some internal macro names for clarityKarl Williamson2022-07-031-12/+12
| | | | | | | | | These long names are designed to remind the coder that they have multiple meanings. But move the reminder text to the end, as it obscures the purposes. And some have two halves for the separate meanings; change the names so the halves are split by two underscores to visually emphasize this.
* Revert "regex: Add POSIXA1R node"Karl Williamson2022-07-011-31/+0
| | | | | | | This reverts commit d62feba66bf43f35d092bb026694f927e9f94d38. As explained in its commit message. It adds some comments to point out that the commit exists, for the curious.
* regex: Add POSIXA1R nodeKarl Williamson2022-07-011-0/+31
| | | | | | | | | | | | | | | | | | | | | | | | Several of the POSIXA classes are a single range on ASCII platforms, and [:digit:] is a single range on both ASCII and EBCDIC. This regnode was designed to replace the POSIXA regnode for such classes to get a bit of performance by not needing to do an array lookup. Instead it encodes some bits in the flags field that with shifting and masking get the right values for the single range's bounds for any such node. However, performance tests conducted by Sergey Aleynikov showed this was actually slower than what it intended to replace. Rather than completely drop this work, I'm adding it to blead, and immediately reverting it, so that should parts of it ever become useful, it would be available. A few tests fail; those are skipped for the purposes of this commit so that it doesn't interfere with bisecting. The code also isn't completely commented. One could add a regnode for each posix class it was decided should have the expected performance boost. But regnodes are a finite resource, and the boost is probably not large enough to justify doing so.
* Change handy.h macro names to be C standard conformantKarl Williamson2022-06-121-20/+20
| | | | | | | C reserves symbols beginning with underscores for its own use. This commit moves the underscore so it is trailing, which is legal. The symbols changed here are many of the ones in handy.h that have significant uses outside it.
* regex: Create a macro to avoid some #ifdef'sKarl Williamson2022-06-021-0/+6
| | | | Easier to read
* regcomp,regexec: Shorten #define nameKarl Williamson2022-06-021-6/+6
| | | | | | | | This names a flag bit whose meaning depends on context. The previous name contained both meanings, and was simply too long to be comprehensible. This commit splits it into two names, with the suffix '_shared'. It's not necessary to know precisely what the other meaning is when reading the code, only that it has another meaning.
* regcomp.h: Add, fix, clarify commentsKarl Williamson2022-06-021-44/+65
|
* regcomp.c: consistent NOTHING/OPFAIL optimizations for lookaroundYves Orton2022-02-231-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | Add funcs to parse empty constructs which boil down to OPFAIL or NOTHING, and then consistently use them while parsing lookahead and lookbehind. This allows us to eliminate duplicated code and optimize the various cases which boil down to these constructs. Eg. (?!) and (?<!) are equivalent to OPFAIL ops, and (?=) and (?<=) are equivalent to NOTHING ops. This patch does not deal with the (* ... ) forms, that will come in a follow up. One advantage of this optimization is it does not need to set the various special flags related to lookaround as they aren't really lookaround ops. As a bonus this also improves the error messages from incomplete patterns, and add tests for various error messages. Note the functions are given awkward but shortish names so they are not always forced to be line broken. I would prefer to name them something more descriptive, but that would make their use harder to read by forcing line breaks. So I chose to use an abbreviation.
* regcomp.h: change regexp_internal attribute from I32 to U32Yves Orton2022-02-181-1/+4
| | | | | | | | | | This changes the name_list_idx attribute from I32 to a U32 as it will never be negative, and as of a963d6d5acabdd8c7 a 0 can be safely used to represent "no value" for items in the 'data' array. I noticed this while cleaning up the offsets debug logic and updating the perlreguts documentation, so I figured I might as well clean it up at the same time.
* regcomp.c,re.pm: Remove "offsets" debugging codeYves Orton2022-02-181-26/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This code was added by Mark Jason Dominus to aid a regex debugger he wrote for ActiveState. The basic premise is that every opcode in a regex can be attributed back to a contiguous sequence of characters that make up the pattern. This assumption has not been true ever since the "jump" TRIE optimizations were added to the engine. I spoke to MJD many years ago about whether it was ok to remove this from the regex engine and he said he had no objections. An example of a pattern that cannot be handled correctly by this logic is /(?: a x+ | b y+ | c z+ )/x where the (?:a ... | b ... | c ...) parts will now be handled by the TRIE logic and not by the BRANCH/EXACT opcodes that it would have been in the past. The offset debug output cannot handle this type of transformation, and produce nonsense output that mention opcodes that have been optimized away from the final program. The regex compiler is complicated enough without having to maintain this logic. There are essentially no tests for it, and the few tests that do cover it do so as a byproduct of testing other things. Despite the offsets logic only being used in debug supporting it does have a cost to non-debug logic as various internal routines include parameters related to it that are otherwise unused. Note this output is only usable or visible by enabling special flags in re.pm, there is no formal API to access it short of parsing the output of the debug mode of the regex engine, which has changed multiple time over the past years.
* regcomp: Save '&' instrs by casting to U8Karl Williamson2021-07-301-1/+1
|
* regcomp.h: Add internal macroKarl Williamson2021-07-301-12/+13
| | | | | | | This returns the locale bitmap field from an ANYOF node. In this commit, it just tidies up the code, omitting lengthy casts. But a future commit will use it on its own.
* regcomp.h: Clarify commentKarl Williamson2021-06-121-1/+1
|
* regcomp.h: Fix typos in commentsKarl Williamson2021-05-281-2/+2
|
* speed up regex ops under DEBUGGING when -Dr not usedYves Orton2021-02-121-2/+2
|
* style: Detabify indentation of the C code maintained by the core.Michael G. Schwern2021-01-171-17/+17
| | | | | | | | | | | This just detabifies to get rid of the mixed tab/space indentation. Applying consistent indentation and dealing with other tabs are another issue. Done with `expand -i`. * vutil.* left alone, it's part of version. * Left regen managed files alone for now.
* regcomp.h: Restrict scope of symbols to coreKarl Williamson2020-11-021-1/+2
| | | | Everything defined here are internal symbols
* handy.h: Create nBIT_UMAX() macroKarl Williamson2020-07-171-1/+1
| | | | This encapsulates a common paradigm
* handy.h: Create nBIT_MASK(n) macroKarl Williamson2020-07-171-5/+5
| | | | | This encapsulates a common paradigm, making sure that it is done correctly for the platform's size.
* Use dNOOP for otherwise empty declarationsKarl Williamson2020-04-261-3/+3
| | | | | | | | Otherwise, the semi-colon the code has afterwards can be interpreted by some compilers as ending the block of declarations, and treat any later declarations in the same block as erors in C89. For an example, see the ticket this fixes #17725
* regcomp.c: Rmv C undefined behaviorKarl Williamson2020-04-121-4/+3
| | | | | | | | | | | | | | | One analyzer said that what this commit changes was C undefined behavior, in casting void* pointers. Right now, the only actual type it is called with is SV*, but I made it void*, because I thought it might be used more generally. But, it turns out that Unicode is planning on changing its regular expression processing requirements to where what I have no longer will make sense. And, since only SV* is actually used, this commit changes the void* to SV*, removing any undefined behavior, with no changes to program logic. The changes for the new Unicode direction will come in probably 5.34; their document is still in draft, but I anticipate it will soon be finalized.
* regcomp.h: Add macroKarl Williamson2020-03-181-0/+16
| | | | | | | This macro is for files outside of regcomp.c/regexec.c (and re_comp.c/re_exec.c) to have access to the re.pm Debug options. Right now pp_match could benefit from this.
* regcomp.h: Combine two macros into oneKarl Williamson2020-03-111-13/+14
| | | | One of these macros is no longer used, so just combine them.
* Allow wildcard pattern debuggingKarl Williamson2020-03-051-1/+5
| | | | | | | | | | | | | | | | | | | | | | This fixes #17026 Patterns can have subpatterns since 5.30. These are processed when encountered, by suspending the main pattern compilation, compiling the subpattern, and then matching that against the set of all legal possibilities, which Perl knows about. Debugging info for the compilation portion of the subpattern was added by be8790133a0ce8fc67454e55e7849a47a0858d32, without fanfare. But, prior to this new commit, debugging info was not available for that matching portion of the compilation, except under DEBUGGING builds, with -Drv. This commit adds a new option to 'use re qw(Debug ...)', WILDCARD, to enable subpattern match debugging. Whatever other match debugging options have been turned on will show up when a wildcard subpattern is compiled iff WILDCARD is specified. The output of this may be voluminous, which is why you have to ask for it specifically. Or, the EXTRA option turns it on, along with several other things.
* regex: Change internal macro nameKarl Williamson2020-03-051-2/+2
| | | | | It wasn't clear to me that the macro did more than a declaration, given its name. Rename it to be clear as to what it does.