summaryrefslogtreecommitdiff
path: root/regcomp.h
Commit message (Collapse)AuthorAgeFilesLines
* Fix typos (spelling errors) in Perl sources.Peter J. Acklam) (via RT2011-01-071-4/+4
| | | | | | | | | # New Ticket Created by (Peter J. Acklam) # Please include the string: [perl #81904] # in the subject line of all future correspondence about this issue. # <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=81904 > Signed-off-by: Abigail <abigail@abigail.be>
* Change name of regex intrnl macro to new meaningKarl Williamson2010-12-201-3/+12
| | | | | | | | | | | ANYOF_FOLD is now used only under fewer conditions. Otherwise the bitmap of character 0-255 is fully calculated with the folds, and the flag is not set. One condition is under locale, where the folds aren't known at compile time; the other is for things accessible through a swash. By changing the name to its new meaning, certain optimizations become more obvious.
* Change regexes to debug dump non-ASCII as hex.Karl Williamson2010-12-191-3/+3
| | | | | | instead of the less familiar octal for larger values. Perhaps they should actually print the actual character, but this is far easier than the previous to understand.
* regcomp: Allow freeing up bit in ANYOF flagsKarl Williamson2010-12-111-7/+23
| | | | | | The flags field is fully used, and until the ANYOF node is split later in development, the CLASS bit will need to be freed up to give the space for other uses. This patch allows for this to easily be toggled.
* regcomp.h: Restore separate bit for LOC classKarl Williamson2010-11-221-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | This commit partially reverts cefafd73018b048fa66d2b22250431112141955a which unconditionally used a bitmap for classes like \w in ANYOF nodes used in locales. Unfortunately, I forgot to unconditionally allocate that space, so things were getting corrupted. It is scary that that did not show up in my testing, but locales are hard to test. It showed up in a workspace without DEBUGGING. This commit now causes the bitmap to be used only when necessary, at the expense of using a precious bit in the flags field to indicate that it is being used. However, as events have turned out since that commit, that flags bit isn't as precious as I thought. It looks like we will have to split the ANYOF node into two similar nodes, one of which is variable length, as there are bugs due to the optimizer thinking it is of length 1, when in fact it doesn't currently have to be. That split should allow more bits to be freed. I'm retaining for now some ancillary code that was to help improve the efficiency when that bit was removed; just in case we have to redo this. But if we do, we have to unconditionally allocate the space we think we are using. Signed-off-by: David Golden <dagolden@cpan.org>
* Split ANYOF_NONBITMAP into two componentsKarl Williamson2010-11-221-1/+7
| | | | | | | | | | | ANYOF_NONBITMAP means that the node can match things that aren't in its bitmap. Some things can match only when the target string is in utf8, and some things can match even if it isn't. If the target string is not in utf8, and we know that the only possible match is when it is in utf8, we know it can't match, and avoid a fruitless, expensive swash load. This change also fixes a number of problems shown in t/re/grind_fold.t that I will deliver soon.
* regcomp.h: Add commentKarl Williamson2010-11-221-1/+2
|
* regcomp.h: Renumber ANYOF_EOS bitKarl Williamson2010-11-221-3/+3
| | | | | This is in preparation for adding a new bit which for debugging ease ought to be adjacent to another one.
* rename ANYOF_UNICODE to ANYOF_NONBITMAPKarl Williamson2010-11-221-1/+4
| | | | | | | | | | I am about the hone the meaning of this to mean that there is something outside the bitmap that is matchable by the node, and the new name reflects that more accurately. I am not retaining the old name because I'm about to remove it from the flags field to save a bit and avoid masking operations, and any code that would be using it would break at that point.
* regex free up bit in ANYOF nodeKarl Williamson2010-11-221-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch causes all locale ANYOF nodes to have a class bitmap (4 bytes) even if they don't have a class (such as \w, \d, [:posix:]). This frees up a bit in the flags field that was used to signal if the node had the bitmap. I intend to use it instead to signal that loading a swash, which is slow, can be bypassed. Thus this is a time/space tradeoff, applicable to not just locale nodes: adding a word to the locale nodes saves time for all nodes. I added the ANYOF_CLASS_TEST_ANY_SET() macro to determine quickly if there are actually any classes in the node. Minimal code was changed, so this can be easily reversed if another bit frees up. Another possibility is to share with the ANYOF_EOS bit instead, as this is used just in the optimizer's start class, and only in regcomp.c. But this requires more careful coding. Another possibility is to add a byte (hence likely at least 4 because of alignment issues) to store extra flags. And still another possibility is to add just the byte for the start class, which would not need to affect other ANYOF nodes, since the EOS bit is not used outside regcomp.c. But various routines in regcomp assume that the start class and other ANYOF nodes are interchangeable, so this option would require more code changes.
* regcomp.h: Add commentKarl Williamson2010-11-221-1/+1
|
* regcomp.h: Reorder statements for clarityKarl Williamson2010-11-221-4/+6
| | | | Reorder #defines of bits so are in numerical order
* regcomp.h: Remove unused #defineKarl Williamson2010-10-311-3/+0
| | | | | | | | | | | ANYOF_RUNTIME() is no longer used, so can be removed. I had long tried to figure out what the purpose of this was, and discovered it really had none. I think it must have had something to do with locales at one time. But locales don't do well with utf8, and I don't know how to make it better. In any event this wasn't actually accomplishing anything.
* regcomp.h: Clean up some commentsKarl Williamson2010-10-311-10/+10
|
* ANYOF_LARGE is now the same as ANYOF_CLASSKarl Williamson2010-10-311-4/+2
| | | | | These two #defines now mean the same thing. Free up bit used by ANYOF_LARGE
* regcomp.c: Get rid of compiler warning.Karl Williamson2010-10-211-1/+1
| | | | | | | | This patch should remove a compiler warning that is currently only showing up in one compiler. It declares a debug-only variable to be volatile, so should silence the warning that it is getting clobbered. Since this variable is only used for debugging purposes, when DEBUGGING is defined, performance should not be an issue.
* Subject: [perl #58182] partial: Add uni \s,\w matchingKarl Williamson2010-10-151-0/+3
| | | | | | | | | | | | | | | | | | | This commit causes regex sequences \b, \s, and \w (and complements) to match in the latin1 range in the scope of feature 'unicode_strings' or with the /u regex modifier. It uses the previously unused flags field in the respective regnodes to indicate the type of matching, and in regexec.c, uses that to decide which of the handy.h macros to use, native or Latin1. I chose this for now rather than create new nodes for each type of match. An earlier version of this patch did that, and in every case the switch case: statements were adjacent, offering no performance advantage. If regexec were modified to use in-line functions or more macros for various short section of it, then it would be faster to have new nodes rather than using the flags field. But, using that field simplified things, as this change flies under the radar in a number of places where it would not if separate nodes were used.
* Subject: regcomp.h: Add macro to get regnode flagsKarl Williamson2010-10-151-0/+2
|
* PATCH: regex longjmp flawsKarl Williamson2010-09-151-1/+3
| | | | | | | | The netbsd - 5.0.2 compiler pointed out that the recent changes to add longjmps to speed up some regex compilations can result in clobbering a few values. These depend on the compiled code, and so didn't show up in other compiler's warnings. This patch reinitializes them after a longjmp.
* Properly free paren_name_list with its regexp.Nicholas Clark2010-05-291-0/+1
| | | | | Previously the AV paren_name_list would "leak" until global destruction. This was only an issue under -DDEBUGGING. Fixes RT #73438.
* Generate PL_simple[] and PL_varies[] with regcomp.pl, rather than hard-coding.Nicholas Clark2010-05-271-31/+0
| | | | | | Add a new flags column to regcomp.sym, with V if the node type is in PL_varies, S if it is in PL_simple, and . if a placeholder is needed because subsequent optional columns are present.
* tries: don't allocate memory at runtimeDavid Mitchell2010-05-031-3/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is an indirect fix for [perl #74484] Regex causing exponential runtime+mem usage The trie runtime code was doing more SAVETMPS than FREETMPS and was thus growing a large tmps stack on heavy backtracking. Rather than fixing this directly, I rewrote part of the trie code so that it no longer needs to allocate memory in S_regmatch (it still does in find_byclass()). The basic issue is that multiple branches in the trie may trigger an accept state; for example: "abcd" =~ /xyz/abcd.*X|ab.*Y|/ here, words (branches) 2 and 3 are accept states. The original approach was, at run time, to create a list of accepted word numbers and the character positions of the end of each of those words. Then run the rest of the pattern for each word in the list in turn (in word index order). This requires memory for the list to be allocated and freed. The new approach involves creating extra info at compile time; in particular, for each word, a pointer to the previous accepted word (if any) in the state tree. For example for the above pattern, part of the state tree may be q b c d 1 -> 2 -> 3 -> 4 -> 5 (#3) (#2) (e.g. at state 1, if the next char is 'a', we transition to state 2). Here, state 3 is an accept state with word #3, and 5 is an accept state with word #2. So we build a table indexed by word number, which has wordinfo[2] = 3, wordinfo[3] = 0, thus building the word chain 2->3->0. At run time we run the trie to completion, and remember the word associated with the longest accept state (word #2 above). Then by following back the chain of .prev fields, we can produce a list of all accepting words. We then iteratively find the smallest-numbered (ie LH-most) word in the chain, and run with it. On failure and backtrack, we find the next-smallest and so on. Since we are no longer recording the end-position of each word in the string, we have to recalculate this for each backtrack. We initially record the end-position of the shortest accepting word, and given that we know the length of each word, we can calculate the new position each time as an offset from that first word. Depending on unicode and folding, that calculation can be cheap or expensive. This algorithm is optimised for the typical case where there are a small number (<= 2) accepting states. This patch creates a new compile-time array, trie->wordinfo[], indexed by word number, which contains relevant info about each word. This also supersedes the old trie->newword[] array, whose function of recording "overspills" of multiple words per accept state, is now handled as part of the wordinfo[].prev chain.
* revert to 5.8.x semantics for \s \w and \dYves Orton2009-10-191-1/+1
| | | | | | | | revert ba9ac1759cb6e7a5e6883c85edd0b450061b5ccb Changing the semantics of \w \s and \d breaks too much and Jesse wants to do a rollout. This disables the new semantics until we can get all the details worked out.
* somewhat fix failing regex tests. but break lots of other stuff at the same timeYves Orton2009-10-191-1/+1
|
* add more positive gofs GPOS tests and fix some bugs tooYves Orton2009-09-101-0/+3
|
* set PERL_LEGACY_UNICODE_CHARCLASS_MAPPINGS to 0 and enable proper POSIX char ↵Yves Orton2009-09-021-1/+2
| | | | | | | | | | | class matching This also alters which Unicode properties that the POSIX character class and the Perl "special" character classes, like \w and \d map to. At the same time it allows a number of tests for POSIX character class behaviour to be switched from todo to non todo. Legacy testing is still available by changing the define and setting the PERL_TEST_LEGACY_POSIX_CC value to true.
* create new unicode props as defined in POSIX spec (optionally use them in ↵Yves Orton2008-11-071-0/+18
| | | | | | | | | | | | | | | | | | | | the regex engine) Perlbug #60156 and #49302 (and probably others) resolve down to the problem that the definition of \s and \w and \d and the POSIX charclasses are different for unicode strings and for non-unicode strings. This broke the character class logic in the regex engine. The easiest fix to make the character class logic sane again is to define new properties which do match. This change creates new property classes that can be used instead of the traditional ones (it does not change the previously defined ones). If the define in regcomp.h: #define PERL_LEGACY_UNICODE_CHARCLASS_MAPPINGS 1 is changed to 0, then the new mappings will be used. This will fix a bunch of bugs that are reported as TODO items in the new reg_posixcc.t test file. p4raw-id: //depot/perl@34769
* Various changes to regex diagnostics and testingYves Orton2008-11-061-2/+2
| | | | | | | | | | | * Make ANYOF output from regprop easier to read by adding ][ in between the unicode representation and the "ascii" one * Make it possible to make tests in re_tests todo. * add a todo test for a complementary character class match that should fail (perl #60156) * Also add a comment explaining a previous commit (relating to perl #60344) p4raw-id: //depot/perl@34755
* Make struct regexp the body of SVt_REGEXP SVs, REGEXPs become SVs,Nicholas Clark2008-01-021-4/+4
| | | | | | and regexp reference counting is via the regular SV reference counting. This was not as easy at it looks. p4raw-id: //depot/perl@32804
* Wrap all deferences of struct regexp* in macros RX_*() [and forNicholas Clark2008-01-021-1/+4
| | | | | | | regcomp.c and regexec.c RXp_* where necessary] so that in future we can maintain source compatibility when we add an extra level of dereferencing. p4raw-id: //depot/perl@32802
* Add editor blocks to some header files.Marcus Holland-Moritz2008-01-011-2/+9
| | | p4raw-id: //depot/perl@32793
* Fix up copyright years for files modified in 2007.Nicholas Clark2007-11-071-1/+1
| | | p4raw-id: //depot/perl@32237
* API spelling patch, by Jerry D. HeddenRafael Garcia-Suarez2007-09-261-1/+1
| | | p4raw-id: //depot/perl@31983
* /p vs (?p)Abigail2007-06-301-1/+3
| | | | | | | | | | | | | Date: Fri, 29 Jun 2007 23:38:07 +0200 Message-ID: <20070629213807.GA14454@abigail.nl> Subject: [PATCH pod/perlre.pod] Keeping up with the changes. From: Abigail <abigail@abigail.be> Date: Sat, 30 Jun 2007 01:24:36 +0200 Message-ID: <20070629232436.GA15326@abigail.nl> Plus tweaks, and debug enahancements. p4raw-id: //depot/perl@31506
* s/\bunicode\b/Unicode/; # For everything not dual lifeNicholas Clark2007-06-241-1/+1
| | | p4raw-id: //depot/perl@31455
* Re: [PATCH] Callbacks for named captures (%+ and %-)Ævar Arnfjörð Bjarmason2007-06-061-1/+2
| | | | | | From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com> Message-ID: <51dd1af80706031324y5618d519p460da27a2e7fe712@mail.gmail.com> p4raw-id: //depot/perl@31341
* FETCH/STORE/LENGTH callbacks for numbered capture variablesÆvar Arnfjörð Bjarmason2007-05-031-4/+6
| | | | | | From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com> Message-ID: <51dd1af80705011658g1156e14cw4d2b21a8d772ed41@mail.gmail.com> p4raw-id: //depot/perl@31130
* Re: [PATCH] Cleanup of the regexp APIÆvar Arnfjörð Bjarmason2007-04-301-1/+1
| | | | | | From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com> Message-ID: <51dd1af80704261922j3db0615wa86ccc4cb65b2713@mail.gmail.com> p4raw-id: //depot/perl@31106
* Change meaning of \v, \V, and add \h, \H to match Perl6, add \R to match ↵Yves Orton2007-04-231-1/+13
| | | | | | | PCRE and unicode tr18 Message-ID: <9b18b3110704221434g43457742p28cab00289f83639@mail.gmail.com> p4raw-id: //depot/perl@31026
* Re: Proposed changes and to regular expression interfaces in coreÆvar Arnfjörð Bjarmason2007-04-061-1/+2
| | | | | | From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com> Message-ID: <51dd1af80703291552y1073bcb6r954b043eb68a4459@mail.gmail.com> p4raw-id: //depot/perl@30849
* Reorder the members of various regexp structs to reduce their size onNicholas Clark2007-03-311-5/+5
| | | | | | | | LP64 platforms, by pairing up I32 and U32 members. Notably structs _reg_trie_data, reg_ac_data, regexp and regmatch_state down by 8 bytes, re_save_state by 16, and regmatch_slab up by 48 (ie one more state per slab) p4raw-id: //depot/perl@30815
* Resolve PL_curpm issues with (??{}) and fix corruption of match results when ↵Yves Orton2007-03-221-6/+1
| | | | | | | | | pattern is a qr. Message-ID: <9b18b3110703210239x540f5ad9mdb41c2ea6229ac31@mail.gmail.com> plus two follow-up patches (minor tweaks) p4raw-id: //depot/perl@30678
* Re: New file: t/op/regexp_email.tYves Orton2007-03-011-0/+4
| | | | | Message-ID: <9b18b3110702280845p7860ca08taf1aead39a178aa4@mail.gmail.com> p4raw-id: //depot/perl@30436
* add hooks for capture buffers into regex engine.Yves Orton2007-02-131-0/+2
| | | | | Message-ID: <9b18b3110702131127q79cc6df1lb1480d9a40d15213@mail.gmail.com> p4raw-id: //depot/perl@30265
* Improve regex stringification codeYves Orton2007-01-311-0/+1
| | | | | Message-ID: <9b18b3110701301458k2f6a8254hea6c6db28489c38b@mail.gmail.com> p4raw-id: //depot/perl@30084
* Disable positive lookaround optimisationsYves Orton2007-01-221-1/+1
| | | | | | Message-ID: <9b18b3110701210953l4df6198re36a9342e6049583@mail.gmail.com> Date: Sun, 21 Jan 2007 18:53:38 +0100 p4raw-id: //depot/perl@29923
* Re: [PATCH] Change implementation of %+ to use a proper tied hash interface ↵Yves Orton2007-01-161-4/+4
| | | | | | | and add support for %- Message-ID: <9b18b3110701151406p7168b20byf873ee2e58091ca3@mail.gmail.com> p4raw-id: //depot/perl@29843
* Make offsets support conditionalYves Orton2007-01-161-3/+14
| | | | | Message-ID: <9b18b3110701140624v452f7684x5e9d2890805489fd@mail.gmail.com> p4raw-id: //depot/perl@29842
* Add support for /k modfier for matching along with ${^PREMATCH}, ${^MATCH}, ↵Yves Orton2007-01-151-3/+0
| | | | | | | | ${^POSTMATCH} Message-ID: <9b18b3110701111731x29b1c63i57b1698f769b3bbc@mail.gmail.com> (with tweaks) p4raw-id: //depot/perl@29831
* Update copyright years in .h files. Also, in .plRafael Garcia-Suarez2007-01-051-1/+1
| | | | | | files that generate .h files, so they'll be ready next time. p4raw-id: //depot/perl@29695