summaryrefslogtreecommitdiff
path: root/regnodes.h
Commit message (Collapse)AuthorAgeFilesLines
* rework split() special case interaction with regex engineYves Orton2013-03-271-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch resolves several issues at once. The parts are sufficiently interconnected that it is hard to break it down into smaller commits. The tickets open for these issues are: RT #94490 - split and constant folding RT #116086 - split "\x20" doesn't work as documented It additionally corrects some issues with cached regexes that were exposed by the split changes (and applied to them). It effectively reverts 5255171e6cd0accee6f76ea2980e32b3b5b8e171 and cccd1425414e6518c1fc8b7bcaccfb119320c513. Prior to this patch the special RXf_SKIPWHITE behavior of split(" ", $thing) was only available if Perl could resolve the first argument to split at compile time, meaning under various arcane situations. This manifested as oddities like my $delim = $cond ? " " : qr/\s+/; split $delim, $string; and split $cond ? " ", qr/\s+/, $string not behaving the same as: ($cond ? split(" ", $string) : split(/\s+/, $string)) which isn't very convenient. This patch changes this by adding a new flag to the op_pmflags, PMf_SPLIT which enables pp_regcomp() to know whether it was called as part of split, which allows the RXf_SPLIT to be passed into run time regex compilation. We also preserve the original flags so pattern caching works properly, by adding a new property to the regexp structure, "compflags", and related macros for accessing it. We preserve the original flags passed into the compilation process, so we can compare when we are trying to decide if we need to recompile. Note that this essentially the opposite fix from the one applied originally to fix #94490 in 5255171e6cd0accee6f76ea2980e32b3b5b8e171. The reverted patch was meant to make: split( 0 || " ", $thing ) #1 consistent with my $x=0; split( $x || " ", $thing ) #2 and not with split( " ", $thing ) #3 This was reverted because it broke C<split("\x{20}", $thing)>, and because one might argue that is not that #1 does the wrong thing, but rather that the behavior of #2 that is wrong. In other words we might expect that all three should behave the same as #3, and that instead of "fixing" the behavior of #1 to be like #2, we should really fix the behavior of #2 to behave like #3. (Which is what we did.) Also, it doesn't make sense to move the special case detection logic further from the regex engine. We really want the regex engine to decide this stuff itself, otherwise split " ", ... wouldn't work properly with an alternate engine. (Imagine we add a special regexp meta pattern that behaves the same as " " does in a split /.../. For instance we might make split /(*SPLITWHITE)/ trigger the same behavior as split " ". The other major change as result of this patch is it effectively reverts commit cccd1425414e6518c1fc8b7bcaccfb119320c513, which was intended to get rid of RXf_SPLIT and RXf_SKIPWHITE, which and free up bits in the regex flags structure. But we dont want to get rid of these vars, and it turns out that RXf_SEEN_LOOKBEHIND is used only in the same situation as the new RXf_MODIFIES_VARS. So I have renamed RXf_SEEN_LOOKBEHIND to RXf_NO_INPLACE_SUBST, and then instead of using two vars we use only the one. Which in turn allows RXf_SPLIT and RXf_SKIPWHITE to have their bits back.
* Improve how regcomp.pl handles multibitsYves Orton2013-03-271-3/+3
| | | | In preparation for future changes.
* Add new regnode for synthetic start classKarl Williamson2012-12-281-150/+155
| | | | | | | | | | | | | This creates a regnode specifically for the synthetic start class, which is a type of ANYOF node. The flag bit previously used to denote this is removed. This paves the way for this bit to be freed up, but first the other use of this bit must also be removed, which will be done in the next commit. There are now three ANYOF-type regnodes. This one should be called only in one place in regexec.c. The other special one is ANYOF_WARN_SUPER. A synthetic start class node should not do any warning, so there is no issue of having something need to be both types.
* Free up regex ANYOF bit.Karl Williamson2012-12-281-150/+155
| | | | | | | This uses a regnode type, of which we have many available, to free up a bit in the ANYOF regnode flag field, of which we have none, and are trying to have the same bit do double duty. This will enable us to remove some of that double duty in the next commit.
* Regenerate the regnode table in perldebguts.pod automaticallyFather Chrysostomos2012-12-221-2/+2
|
* Consolidate some regex OPSKarl Williamson2012-12-221-293/+150
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The regular rexpression operation POSIXA works on any of the (currently) 16 posix classes (like \w and [:graph:]) under the regex modifier /a. This commit creates similar operations for the other modifiers: POSIXL (for /l), POSIXD (for /d), POSIXU (for /u), plus their complements. It causes these ops to be generated instead of the ALNUM, DIGIT, HORIZWS, SPACE, and VERTWS ops, as well as all their variants. The net saving is 22 regnode types. The reason to do this is for maintenance. As of this commit, there are now 22 fewer node types for which code has to be maintained. The code for each variant was essentially the same logic, but on different operands. It would be easy to make a change to one copy and forget to make the corresponding change in the others. Indeed, this patch fixes [perl #114272] in which one copy was out of sync with others. This patch actually reduces the number of separate code paths to 5: POSIXA, NPOSIXA, POSIXL, POSIXD, and POSIXU. The complements of the last 3 use the same code path as their non-complemented version, except that a variable is initialized differently. The code then XORs this variable with its result to do the complementing or not. Further, the POSIXD branch now just checks if the target string being matched is UTF-8 or not, and then jumps to either the POSIXU or POSIXA code respectively. So, there are effectively only 4 cases that are coded: POSIXA, NPOSIXA, POSIXL, and POSIXU. (POSIXA doesn't have to worry about UTF-8, while NPOSIXA does, hence these for efficiency are coded separately.) Removing all this code saves memory. The output of the Linux size command shows that the perl executable was shrunk by 33K bytes on my platform compiled under -O0 (.7%) and by 18K bytes (1.3%) under -O2. The reason this patch was doable was previous work in numbering the POSIX classes, so that they could be indexed in arrays and bit positions. This is a large patch; I didn't see how to break it into smaller components. I chose to make this code more efficient as opposed to saving even more memory. Thus there is a separate loop that is jumped to after we know we have to load a swash; this just saves having to test if the swash is loaded each time through the loop. I avoid loading the swash until absolutely necessary. In places in the previous version of this code, the swash was loaded when the input was UTF-8, even if it wasn't yet needed (and might never be if the input didn't contain anything above Latin1); apparently to avoid the extra test per iteration. The Perl test suite runs slightly faster on my platform with this patch under -O0, and the speeds are indistinguishable under -O2. This is in spite of these new POSIX regops being unknown to the regex optimizer (this will be addressed in future commits), and extra machine instructions being required for each character (the xor, and some shifting and masking). I expect this is a result of better caching, and not loading swashes unless absolutely necessary.
* regcomp.sym: Change regkind for NPOSIX regnodesKarl Williamson2012-11-191-4/+4
| | | | | | | It turns out that it is more convenient for the complement of a node to have a regkind that is also the complement of a node. This creates slight inconveniences that are included in this patch, but will help further patches.
* regex: Remove old code that tried to handle multi-char foldsKarl Williamson2012-10-141-212/+207
| | | | | | A recent commit has changed the algorithm used to handle multi-character folding in bracketed character classes. The old code is no longer needed.
* RXf_MODIFIES_VARSFather Chrysostomos2012-10-111-2/+2
| | | | | | regcomp.c sets this new flag whenever regops that could modify $REGMARK or $REGERROR have been seen. pp_subst will use this to tell whether it should repeatedly stringify the replacement.
* Define RXf_SPLIT and RXf_SKIPWHITE as 0Father Chrysostomos2012-10-111-3/+3
| | | | | | | | They are on longer used in core, and we need room for more flags. The only CPAN modules that use them check whether RXf_SPLIT is set (which no longer happens) before setting RXf_SKIPWHITE (which is ignored).
* regcomp.sym: Add new node types POSIXA and NPOSIXAKarl Williamson2012-07-241-141/+182
| | | | | | | | | These will be used to handle things like /[[:word:]]/a. This patch doesn't add the code to actually use these. That will be done in a future patch. Also, placeholders POSIXD, POSIXL, and POSIXU are also added for future use.
* regcomp.c: Simply some node calculationsKarl Williamson2012-06-291-148/+158
| | | | | | | | | | | | For the node types that have differing versions depending on the character set regex modifiers, /d, /l, /u, /a, and /aa, we can use the enum values as offsets from the base node number to derive the correct one. This eliminates a number of tests. Because there is no DIGITU node type, I added placeholders for it (and NDIGITU) to avoid some special casing of it (more important in future commits). We currently have many available node types, so can afford to waste these two.
* regcomp.sym: Reorder a couple of nodesKarl Williamson2012-06-291-9/+9
| | | | | | This causes all the nodes that depend on the regex modifier, BOUND, BOUNDL, etc. to have the same relative ordering. This will enable a future commit to simplify generation of the correct node.
* regcomp.sym: Fix out-dated descriptionKarl Williamson2012-03-031-1/+1
| | | | | As a result of commit fab2782b37b5570d7f8f8065fd7d18621117ed49 the description is no longer valid. This node type is trieable.
* rework how the trie logic handles the newer EXACT nodetypesYves Orton2012-03-031-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This cleans up and simplifies and extends how the trie logic interacts with the new node types. This change ultimately makes the EXACTFU, EXACTFU_SS, EXACTFU_NO_TRIE (renamed to EXACTFU_TRICKYFOLD) work properly with the trie engine regardless of whether the string is utf8 or latin1. This patch depends on the following: EXACT => utf8 or "binary" text EXACTFU => either pre-folded utf8, or latin1 that has to be folded as though it was utf8 EXACTFU_SS => special case of EXACTFU to handle \xDF/ss (affects latin1 treatment) EXACTFU_TRICKYFOLD => special case of EXACTFU to handle tricky non-latin1 fold rules EXACTF => "old style fold logic" untriable nodetype EXACTFA => (currently) untriable nodetype EXACTFL => (currently) untriable nodetype See the comments in regcomp.sym for these fold types. This patch involves a number of distinct, but related parts. Starting from compilation: * Simplify how we detect a triable sequence given the new nodetypes, this also probably fixed some "bugs" in how we detected certain sequences, like /||foo|bar/. * Simplify how we read EXACTFU nodes under utf8 by removing the now redundant folding logic (EXACTFU nodes under utf8 are prefolded). Also extend this logic to handle latin1 patterns properly (in conjunction with other changes) * Part of the problems associated with EXACTFU_SS and EXACTFU_TRICKYFOLD have to do with how the trie logic interacts with the minlen logic. This change handles both by pessimising the minlen when encounting these nodetypes. One observation is that the minlen logic is basically broken, and works only because it conflates bytes and codepoints in such a way that we more or less always get a value small enough that things work out anyway. Fixing that is properly is the job of another patch. * Part of the problem of doing folding under unicode rules is that there are a lot of foldings possible, some with strange rules. This means that the bitmap logic does not work correctly in all cases, as we currently do not have any way to populate it properly. So this patch disables the bitmap entirely when folding is involved until that is fixed. The end result of this is: we can TRIE/AHOCORASICK any sequence of EXACT, or EXACTFU (ish) nodes, regardless of utf8 or not, but we disable the bitmap when folding. A note for follow up relating to this patch is that the way EXACTFU_XXX nodes are currently dealt with we wont build the "maximal" trie because of their presence, instead creating a "jumptrie" consisting of either a leading EXACTFU node followed by a EXACTFU_XXX node, or vice versa. We should eventually address that.
* regex: Remove FOLDCHAR regnode typeKarl Williamson2012-01-191-11/+6
| | | | | | | | | | | | | | This node type hasn't been used since 5.14.0. Instead an ANYOFV node was generated where formerly a FOLDCHAR node would have been used. The ANYOFV was used because it already existed and was up-to-date, whereas FOLDCHAR would have needed some bug fixes to adapt it, even though it would be faster in execution than ANYOFV; so the code for it was retained in case it was needed. However, both these solutions were defective, and a previous commit has changed things to a different type of solution entirely. Thus FOLDCHAR is obsolescent and can be removed, though the code in it was used as a base for some of the new solutions.
* regex: Add new node type EXACTFU_NO_TRIEKarl Williamson2012-01-191-124/+129
| | | | | | This new node is like EXACTFU but is not currently trie'able. This adds handling for it in regexec.c, but it is not currently generated; this commit is preparing for future commits
* regex: Add new node type EXACTFU_SSKarl Williamson2012-01-191-125/+130
| | | | | | | | | | This node will be used to distinguish between the case in a non-UTF8 pattern and string where something could be matched that is of different lengths. The only instance where this can happen is the LATIN SMALL LETTER SHARP S can match the sequences "ss", "Ss", "sS", or "SS", hence the name. This node is not currently generated; this prepares for future commits
* regcomp.sym: Change commentsKarl Williamson2012-01-191-4/+4
|
* regcomp.sym: Add commentsKarl Williamson2011-10-171-4/+4
|
* regcomp.sym: Add nodes for backref of EXACTFAKarl Williamson2011-02-141-90/+100
| | | | These are not used yet.
* regcomp.sym: Add regnode for /aa matchingKarl Williamson2011-02-141-118/+123
| | | | It is not used yet.
* Initial setup to accommodate /aa regex modifierKarl Williamson2011-02-141-4/+4
| | | | | This changes the bits to add a new charset type for /aa, and other bookkeeping for it.
* Move all the generated file header printing into read_only_top()Nicholas Clark2011-01-231-1/+1
| | | | | | | | | Previously all the scripts in regen/ had code to generate header comments (buffer-read-only, "do not edit this file", and optionally regeneration script, regeneration data, copyright years and filename). This change results in some minor reformatting of header blocks, and standardises the copyright line as "Larry Wall and others".
* regcomp.sym: Add nodes for /aKarl Williamson2011-01-171-185/+226
| | | | These aren't used yet.
* regcomp.sym: Remove unused nodes DIGITU, NDIGITUKarl Williamson2011-01-161-148/+137
| | | | | | These are unused because there is no difference between Unicode semantics and non for digits. That is there are no digit characters in the 128-255 range.
* regcomp.sym: Add BOUNDU, NBOUNDU regnodesKarl Williamson2011-01-161-185/+195
| | | | | | This will make for somewhat more efficient execution, as won't have to test the regnode type multiple times, at the expense of slightly bigger code space.
* regex: Add separate regnodes for \w \s Uni semanticsKarl Williamson2011-01-161-156/+187
| | | | | These nodes aren't actually used yet, but allow the splitting out of Unicode semantics for \w, \s, and complements
* regcomp.sym: add clarifying commentsKarl Williamson2011-01-161-2/+2
|
* Use multi-bit field for regex character setKarl Williamson2011-01-161-2/+2
| | | | | | | | | | | | | The /d, /l, and /u regex modifiers are mutually exclusive. This patch changes the field that stores the character set to use more than one bit with an enum determining which one. This data structure more closely follows the semantics of their being mutually exclusive, and conserves bits as well, and is better expandable. A small API is added to set and query the bit field. This patch is not .xs source backwards compatible. A handful of cpan programs are affected.
* regcomp.sym: Add ANYOFV nodeKarl Williamson2011-01-131-160/+165
| | | | | | | | | | | | This node is like a straight ANYOF node to match [bracketed character classes], but can match multiple characters; in particular it can match a multi-char fold. When multi-char Unicode folding was added to Perl, it was overlooked that the ANYOF node is supposed to match exactly one character, hence there have been bugs ever since. Adding a specialized node that can match multiple chars, these can be fixed more easily. I tried at first to make ANYOF match multiple chars, but this causes Perl to not be able to fully compile.
* Run make regen after 486ec47ab73770ab updated regcomp.sym.Nicholas Clark2011-01-071-1/+1
|
* regcomp.sym: Correct DIGITL, NDIGITL entriesKarl Williamson2010-12-071-3/+3
| | | | | These were missing that they were simple (matching exactly 1 character) and have 0 regnode arguments
* regcomp.sym: Re-order for better groupingKarl Williamson2010-12-071-134/+134
| | | | | | The recently added regnodes are moved to their respective equivalence classes, and the named backreferences are moved to just after the numbered backreferences
* regcomp.sym: Add REFFU and NREFFU nodesKarl Williamson2010-12-011-9/+20
| | | | | | | These will be used for matching capture buffers case-insensitively using Unicode semantics. make regen will regenerate the delivered regnodes.h
* regcomp.sym: Add EXACTFU regnodeKarl Williamson2010-11-281-6/+11
| | | | | This node will be used for matching case insensitive exactish nodes using Unicode semantics
* regcomp.sym: Clarify commentKarl Williamson2010-11-221-1/+1
| | | | make regen needed
* regcomp.sym: Fix descriptionsKarl Williamson2010-11-221-4/+4
| | | | requires regen
* regcomp.pl -> regen/regcomp.plFather Chrysostomos2010-10-131-1/+1
|
* Add /d, /l, /u (infixed) regex modifiersKarl Williamson2010-09-221-2/+2
| | | | | | | | | | | | This patch adds recognition of these modifiers, with appropriate action for d and l. u does nothing useful yet. This allows for the interpolation of a regex into another one without losing the character set semantics that it was compiled with, as for the first time, the semantics is now specified in the stringification as one of these modifiers. To this end, it allocates an unused bit in the structures. The off- sets change so as to not disturb other bits.
* regexp.h: Move bits aroundKarl Williamson2010-08-111-13/+13
| | | | | | | | | | | | make regen needed. This commit moves some bits in extflags around so that all the unallocated ones are at the boundary between the unshared portion and the portion shared with op.h. This allows them to be allocated in the future to go either way, without affecting binary compatibility at that time. The high-order bits are unaffected, but the low order ones move to fill the gap.
* Convert REGNODE_{SIMPLE,VARIES} to a bitmask lookup, from a strchr() lookup.Nicholas Clark2010-05-271-6/+22
| | | | | | | This is O(1) with no branching, instead of O(n) with branching. Deprecate the old implementation's externally visible variables PL_simple and PL_varies. Google codesearch suggests that nothing outside the core regexp code was using these.
* Encapsulate lookups in PL_{varies,simple} within macros REGNODE_{VARIES,SIMPLE}Nicholas Clark2010-05-271-0/+4
| | | | This allows the implementation of the lookup mechanism to change.
* Generate PL_simple[] and PL_varies[] with regcomp.pl, rather than hard-coding.Nicholas Clark2010-05-271-0/+24
| | | | | | Add a new flags column to regcomp.sym, with V if the node type is in PL_varies, S if it is in PL_simple, and . if a placeholder is needed because subsequent optional columns are present.
* Remove stray tab character in definition for VERB.Nicholas Clark2010-05-271-2/+2
| | | | | | | As VERB is "Used only for the type field of verbs" this is only a cosmetic change, causing that correct description to appear in the comment in regnodes.h. The change to regarglen doesn't affect anything, as the VERB type is never actually used for compiled nodes.
* Abolish RXf_UTF8. Store the UTF-8-ness of the pattern with SvUTF8().Nicholas Clark2008-01-051-2/+2
| | | p4raw-id: //depot/perl@32852
* Reorder the external regexp flags to get RXf_PMf_STD_PMMOD into theNicholas Clark2007-12-291-30/+30
| | | | | | | lowest 4 bits (which saves a shift), and the "flags indicating special patterns" into contiguous bits. This makes everything a little tidier, and saves 88 bytes (woohoo!) of object file with -Os on x86 FreeBSD. p4raw-id: //depot/perl@32775
* TRIE must use 'yes' state transitions when more than one match possible to ↵Marcus Holland-Moritz2007-08-181-2/+2
| | | | | | | | | | ensure proper scope cleanup. Fix and test for issue raised in: Subject: Very strange interaction between regex and lexical array in blead Message-ID: <20070818015537.0088db31@r2d2> p4raw-id: //depot/perl@31733
* /p vs (?p)Abigail2007-06-301-0/+42
| | | | | | | | | | | | | Date: Fri, 29 Jun 2007 23:38:07 +0200 Message-ID: <20070629213807.GA14454@abigail.nl> Subject: [PATCH pod/perlre.pod] Keeping up with the changes. From: Abigail <abigail@abigail.be> Date: Sat, 30 Jun 2007 01:24:36 +0200 Message-ID: <20070629232436.GA15326@abigail.nl> Plus tweaks, and debug enahancements. p4raw-id: //depot/perl@31506
* Re: Analysis of problems with mixed encoding case insensitive matches in ↵Yves Orton2007-04-261-6/+11
| | | | | | | regex engine. Message-ID: <9b18b3110704240746u461e4bdcl208ef7d7f9c5ef64@mail.gmail.com> p4raw-id: //depot/perl@31081