summaryrefslogtreecommitdiff
path: root/regcomp.sym
Commit message (Collapse)AuthorAgeFilesLines
* regcomp.sym: Add new node types POSIXA and NPOSIXAKarl Williamson2012-07-241-1/+9
| | | | | | | | | These will be used to handle things like /[[:word:]]/a. This patch doesn't add the code to actually use these. That will be done in a future patch. Also, placeholders POSIXD, POSIXL, and POSIXU are also added for future use.
* regcomp.sym: Correct and add commentsKarl Williamson2012-07-241-1/+2
|
* regcomp.c: Optimize e.g., /[^\w]/, /[[^:word:]]/ into /\W/Karl Williamson2012-06-291-0/+3
| | | | | | | | This optimizes character classes that have a single element that is one of the ops that have the same meaning outside (namely \d, \h, \s, \w, \v, :word:, :digit: and their complements) to that op. Those ops take less space than a character class and run faster. An initial '^' for complementing the class is also handled.
* regcomp.c: Simply some node calculationsKarl Williamson2012-06-291-1/+16
| | | | | | | | | | | | For the node types that have differing versions depending on the character set regex modifiers, /d, /l, /u, /a, and /aa, we can use the enum values as offsets from the base node number to derive the correct one. This eliminates a number of tests. Because there is no DIGITU node type, I added placeholders for it (and NDIGITU) to avoid some special casing of it (more important in future commits). We currently have many available node types, so can afford to waste these two.
* regcomp.sym: Reorder a couple of nodesKarl Williamson2012-06-291-1/+1
| | | | | | This causes all the nodes that depend on the regex modifier, BOUND, BOUNDL, etc. to have the same relative ordering. This will enable a future commit to simplify generation of the correct node.
* regcomp.sym: Fix out-dated descriptionKarl Williamson2012-03-031-1/+1
| | | | | As a result of commit fab2782b37b5570d7f8f8065fd7d18621117ed49 the description is no longer valid. This node type is trieable.
* rework how the trie logic handles the newer EXACT nodetypesYves Orton2012-03-031-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This cleans up and simplifies and extends how the trie logic interacts with the new node types. This change ultimately makes the EXACTFU, EXACTFU_SS, EXACTFU_NO_TRIE (renamed to EXACTFU_TRICKYFOLD) work properly with the trie engine regardless of whether the string is utf8 or latin1. This patch depends on the following: EXACT => utf8 or "binary" text EXACTFU => either pre-folded utf8, or latin1 that has to be folded as though it was utf8 EXACTFU_SS => special case of EXACTFU to handle \xDF/ss (affects latin1 treatment) EXACTFU_TRICKYFOLD => special case of EXACTFU to handle tricky non-latin1 fold rules EXACTF => "old style fold logic" untriable nodetype EXACTFA => (currently) untriable nodetype EXACTFL => (currently) untriable nodetype See the comments in regcomp.sym for these fold types. This patch involves a number of distinct, but related parts. Starting from compilation: * Simplify how we detect a triable sequence given the new nodetypes, this also probably fixed some "bugs" in how we detected certain sequences, like /||foo|bar/. * Simplify how we read EXACTFU nodes under utf8 by removing the now redundant folding logic (EXACTFU nodes under utf8 are prefolded). Also extend this logic to handle latin1 patterns properly (in conjunction with other changes) * Part of the problems associated with EXACTFU_SS and EXACTFU_TRICKYFOLD have to do with how the trie logic interacts with the minlen logic. This change handles both by pessimising the minlen when encounting these nodetypes. One observation is that the minlen logic is basically broken, and works only because it conflates bytes and codepoints in such a way that we more or less always get a value small enough that things work out anyway. Fixing that is properly is the job of another patch. * Part of the problem of doing folding under unicode rules is that there are a lot of foldings possible, some with strange rules. This means that the bitmap logic does not work correctly in all cases, as we currently do not have any way to populate it properly. So this patch disables the bitmap entirely when folding is involved until that is fixed. The end result of this is: we can TRIE/AHOCORASICK any sequence of EXACT, or EXACTFU (ish) nodes, regardless of utf8 or not, but we disable the bitmap when folding. A note for follow up relating to this patch is that the way EXACTFU_XXX nodes are currently dealt with we wont build the "maximal" trie because of their presence, instead creating a "jumptrie" consisting of either a leading EXACTFU node followed by a EXACTFU_XXX node, or vice versa. We should eventually address that.
* regex: Remove FOLDCHAR regnode typeKarl Williamson2012-01-191-2/+0
| | | | | | | | | | | | | | This node type hasn't been used since 5.14.0. Instead an ANYOFV node was generated where formerly a FOLDCHAR node would have been used. The ANYOFV was used because it already existed and was up-to-date, whereas FOLDCHAR would have needed some bug fixes to adapt it, even though it would be faster in execution than ANYOFV; so the code for it was retained in case it was needed. However, both these solutions were defective, and a previous commit has changed things to a different type of solution entirely. Thus FOLDCHAR is obsolescent and can be removed, though the code in it was used as a base for some of the new solutions.
* regex: Add new node type EXACTFU_NO_TRIEKarl Williamson2012-01-191-0/+1
| | | | | | This new node is like EXACTFU but is not currently trie'able. This adds handling for it in regexec.c, but it is not currently generated; this commit is preparing for future commits
* regex: Add new node type EXACTFU_SSKarl Williamson2012-01-191-1/+2
| | | | | | | | | | This node will be used to distinguish between the case in a non-UTF8 pattern and string where something could be matched that is of different lengths. The only instance where this can happen is the LATIN SMALL LETTER SHARP S can match the sequences "ss", "Ss", "sS", or "SS", hence the name. This node is not currently generated; this prepares for future commits
* regcomp.sym: Change commentsKarl Williamson2012-01-191-4/+4
|
* regcomp.sym: Add commentsKarl Williamson2011-10-171-4/+4
|
* regcomp.sym: Add commentKarl Williamson2011-05-181-0/+1
|
* regcomp.sym: Add nodes for backref of EXACTFAKarl Williamson2011-02-141-1/+3
| | | | These are not used yet.
* regcomp.sym: Add regnode for /aa matchingKarl Williamson2011-02-141-1/+2
| | | | It is not used yet.
* regcomp.sym: Add nodes for /aKarl Williamson2011-01-171-0/+8
| | | | These aren't used yet.
* regex: Use BOUNDU regnodesKarl Williamson2011-01-161-0/+1
| | | | | This refactors one area in regexec.c to use BOUNDU, NBOUNDU for efficiciency, and easier adding of the future BOUNDA.
* regcomp.sym: Remove unused nodes DIGITU, NDIGITUKarl Williamson2011-01-161-2/+0
| | | | | | These are unused because there is no difference between Unicode semantics and non for digits. That is there are no digit characters in the 128-255 range.
* regcomp.sym: Add BOUNDU, NBOUNDU regnodesKarl Williamson2011-01-161-2/+4
| | | | | | This will make for somewhat more efficient execution, as won't have to test the regnode type multiple times, at the expense of slightly bigger code space.
* regex: Add separate regnodes for \w \s Uni semanticsKarl Williamson2011-01-161-6/+12
| | | | | These nodes aren't actually used yet, but allow the splitting out of Unicode semantics for \w, \s, and complements
* regcomp.sym: add clarifying commentsKarl Williamson2011-01-161-2/+2
|
* regcomp.sym: Add ANYOFV nodeKarl Williamson2011-01-131-1/+2
| | | | | | | | | | | | This node is like a straight ANYOF node to match [bracketed character classes], but can match multiple characters; in particular it can match a multi-char fold. When multi-char Unicode folding was added to Perl, it was overlooked that the ANYOF node is supposed to match exactly one character, hence there have been bugs ever since. Adding a specialized node that can match multiple chars, these can be fixed more easily. I tried at first to make ANYOF match multiple chars, but this causes Perl to not be able to fully compile.
* Fix typos (spelling errors) in Perl sources.Peter J. Acklam) (via RT2011-01-071-4/+4
| | | | | | | | | # New Ticket Created by (Peter J. Acklam) # Please include the string: [perl #81904] # in the subject line of all future correspondence about this issue. # <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=81904 > Signed-off-by: Abigail <abigail@abigail.be>
* regcomp.sym: Correct DIGITL, NDIGITL entriesKarl Williamson2010-12-071-2/+2
| | | | | These were missing that they were simple (matching exactly 1 character) and have 0 regnode arguments
* regcomp.sym: Re-order for better groupingKarl Williamson2010-12-071-13/+10
| | | | | | The recently added regnodes are moved to their respective equivalence classes, and the named backreferences are moved to just after the numbered backreferences
* regcomp.sym: Remove misleading commentsKarl Williamson2010-12-071-18/+18
| | | | | | Yves informed me that in spite of the comments giving precise node numbers, those numbers can change, so new nodes can be mixed in with their kin. Remove those comments
* regcomp.sym: Add REFFU and NREFFU nodesKarl Williamson2010-12-011-0/+7
| | | | | | | These will be used for matching capture buffers case-insensitively using Unicode semantics. make regen will regenerate the delivered regnodes.h
* regcomp.sym: update commentKarl Williamson2010-12-011-1/+1
|
* regcomp.sym: Add EXACTFU regnodeKarl Williamson2010-11-281-0/+1
| | | | | This node will be used for matching case insensitive exactish nodes using Unicode semantics
* regcomp.sym: Clarify commentKarl Williamson2010-11-221-1/+1
| | | | make regen needed
* regcomp.sym: Fix descriptionsKarl Williamson2010-11-221-4/+4
| | | | requires regen
* Generate PL_simple[] and PL_varies[] with regcomp.pl, rather than hard-coding.Nicholas Clark2010-05-271-43/+42
| | | | | | Add a new flags column to regcomp.sym, with V if the node type is in PL_varies, S if it is in PL_simple, and . if a placeholder is needed because subsequent optional columns are present.
* Re-work the regcomp.sym to remove use of hard tabs. No data change.Nicholas Clark2010-05-271-119/+119
| | | | | The tab separating name and type is replaced with whitespace, the tab marking the start of the description is replaced by a semicolon.
* Correct the node numbers in comments.Nicholas Clark2010-05-271-4/+4
| | | | Really, should we be maintaining these manually?
* Remove stray tab character in definition for VERB.Nicholas Clark2010-05-271-1/+1
| | | | | | | As VERB is "Used only for the type field of verbs" this is only a cosmetic change, causing that correct description to appear in the comment in regnodes.h. The change to regarglen doesn't affect anything, as the VERB type is never actually used for compiled nodes.
* Re: Analysis of problems with mixed encoding case insensitive matches in ↵Yves Orton2007-04-261-0/+2
| | | | | | | regex engine. Message-ID: <9b18b3110704240746u461e4bdcl208ef7d7f9c5ef64@mail.gmail.com> p4raw-id: //depot/perl@31081
* Change meaning of \v, \V, and add \h, \H to match Perl6, add \R to match ↵Yves Orton2007-04-231-1/+8
| | | | | | | PCRE and unicode tr18 Message-ID: <9b18b3110704221434g43457742p28cab00289f83639@mail.gmail.com> p4raw-id: //depot/perl@31026
* Add Regexp::Keep \K functionality to regex engine as well as add \v and \V, ↵Yves Orton2007-01-111-3/+7
| | | | | | | | | cleanup and more docs for regatom() Message-ID: <9b18b3110701101133i46dc5fd0p1476a0f1dd1e9c5a@mail.gmail.com> (plus POD nits by Merijn and myself) p4raw-id: //depot/perl@29756
* \G with /g results in infinite loop in 5.6 and laterYves Orton2006-11-221-1/+0
| | | | | Message-ID: <9b18b3110611220811k1a54f650t1bd7c6a9450b0a7e@mail.gmail.com> p4raw-id: //depot/perl@29354
* Re: [PATCH] New regex syntax omnibusYves Orton2006-11-131-13/+17
| | | | | Message-ID: <9b18b3110611090809l667860c9t6c27453d7c86a21e@mail.gmail.com> p4raw-id: //depot/perl@29260
* New regex syntax omnibusYves Orton2006-11-071-5/+12
| | | | | | | | Message-ID: <9b18b3110611060406u2fa1572as57073949a5df9e62@mail.gmail.com> Plus a portability fix (in string comparison for regex verbs) and doc tweaks / podchecker fixes p4raw-id: //depot/perl@29222
* Add more backtracking control verbs to regex engine (?CUT), (?ERROR)Yves Orton2006-11-021-2/+5
| | | | | Message-ID: <9b18b3110611020335h7ea469a8g28ca483f6832816d@mail.gmail.com> p4raw-id: //depot/perl@29189
* Add a commit verb to regex engine to allow fine tuning of backtracking control.Yves Orton2006-11-011-1/+2
| | | | | Message-ID: <9b18b3110610311349n5947cc8fsf0b2e6ddd9a7ee01@mail.gmail.com> p4raw-id: //depot/perl@29183
* The second patch from:Yves Orton2006-10-301-4/+4
| | | | | | Subject: [PATCH] regex engine optimiser should grok subroutine patterns, and, name subroutine regops more intuitively Message-ID: <9b18b3110610300915x3abf6cddu9c2071a70bea48e1@mail.gmail.com> p4raw-id: //depot/perl@29162
* The first patch from:Yves Orton2006-10-301-4/+3
| | | | | | Subject: [PATCH] regex engine optimiser should grok subroutine patterns, and, name subroutine regops more intuitively Message-ID: <9b18b3110610300915x3abf6cddu9c2071a70bea48e1@mail.gmail.com> p4raw-id: //depot/perl@29161
* Fix a problem with jump-tries, add (?FAIL) pattern.Yves Orton2006-10-261-2/+5
| | | | | Message-ID: <9b18b3110610260559k3efa98barc28987e88c581a8a@mail.gmail.com> p4raw-id: //depot/perl@29118
* Add Regex conditionals. Various bugfixes. More tests.Yves Orton2006-10-121-0/+5
| | | | | Message-ID: <9b18b3110610111546j74ca490dg21bd9fd1e7e10d42@mail.gmail.com> p4raw-id: //depot/perl@28998
* Re: [PATCH] Initial attempt at named captures for perls regexp engineYves Orton2006-10-071-5/+11
| | | | | Message-ID: <9b18b3110610061016x5ddce965u30d9a821f632d450@mail.gmail.com> p4raw-id: //depot/perl@28957
* migrate CURLYX/WHILEM branch in regmatch() to new FSM-esque paradigmDave Mitchell2006-10-051-2/+3
| | | p4raw-id: //depot/perl@28944
* Re: [PATCH] Add recursive regexes similar to PCREYves Orton2006-10-051-1/+3
| | | | | | | | | | | Date: Wed, 4 Oct 2006 15:45:15 +0200 Message-ID: <9b18b3110610040645s563220a2id6f235494b497e90@mail.gmail.com> Subject: Re: [PATCH] Add recursive regexes similar to PCRE From: demerphq <demerphq@gmail.com> Date: Wed, 4 Oct 2006 21:05:10 +0200 Message-ID: <9b18b3110610041205m2660eb43m1315cf4b0653db96@mail.gmail.com> p4raw-id: //depot/perl@28939