summaryrefslogtreecommitdiff
path: root/regcomp.h
Commit message (Collapse)AuthorAgeFilesLines
* regcomp.h: Remove extraneous 'struct'sKarl Williamson2015-12-261-3/+3
| | | | Better to not have this clutter.
* regcomp.h: Fix shift and maskKarl Williamson2015-12-261-1/+1
| | | | | | | | | | | The mask removed here was to make sure that right shifting didn't propagate the sign bit, but is unnecessary as the value shifted is unsigned. And confining things to a U8 with that mask assumes that the bit vector being operated on has 256 elements max. This isn't necessarily true these days, as one can change ANYOF_BITMAP_SIZE. In fact changing that number was failing until this commit. It also adds white space to make it easier to read.
* regcomp.h: Use more basic macro in #definesKarl Williamson2015-12-261-2/+2
| | | | | Instead of having this code repeated in several places, call the more base macro from the others.
* regcomp.h: Free up bit in ANYOF FLAGS fieldKarl Williamson2015-12-261-71/+64
| | | | | | | | | | | I've long been confronted with trying to do things to create a spare bit to use. I thought it easier now, while it's fresh in my mind, to free up one for future use, rather than re-learn things when it next becomes necessary. It would have been a different story if the freed bit had required a performance penalty. This commit also updates the comments about how to create even more spare bits should it become necessary.
* regcomp.h: Shorten, clarify names of internal flagsKarl Williamson2015-12-261-16/+17
| | | | Some of the names are expanded slightly and not shortened
* regcomp.h: reword some commentsKarl Williamson2015-12-221-33/+32
|
* regcomp.h: Add commentsKarl Williamson2015-12-171-40/+119
|
* regex matching: Don't do unnecessary workKarl Williamson2015-12-171-1/+3
| | | | | | This commit sets a flag at pattern compilation time to indicate if a rare case is present that requires special handling, so that that handling can be avoided unless necessary.
* regcomp.h: Renumber 2 flag bitsKarl Williamson2015-12-171-4/+4
| | | | | | This changes the spare bit to be adjacent to the LOC_FOLD bit, in preparation for the next commit, which will use that bit for a LOC_FOLD-related use.
* regex: Free a ANYOF node bitKarl Williamson2015-12-171-10/+16
| | | | | | | | This is done by combining 2 mutually exclusive bits into one. I hadn't seen this possibility before because the name of one of them misled me. It also misled me into turning on one that flag unnecessarily, and to miss opportunities to not have to create a swash at runtime. This commit corrects those things as well.
* regcomp.[ch]: Comment additions, fixesKarl Williamson2015-09-031-23/+46
|
* regcomp.h: Reorder some flag definitions.Karl Williamson2015-09-031-15/+16
| | | | | This places the flag bits of like-type flags adjacent for convenience in reading the code. It also improves the commentary about their purposes.
* regcomp.h: SSC no longer has to be strict ANYOFKarl Williamson2015-09-031-1/+1
| | | | | | Since commit a0bd1a30d379f2625c307657d63fc50173d7a56d, a synthetic start class node can be just an ANYOF-type node. I don't think this causes a bug, just misses a potential optimisation.
* Make qr/(?[ ])/ work in UTF-8 localesKarl Williamson2015-08-241-2/+6
| | | | | | | | | | | | | | | | Previously use of this under /l regex rules was a compile time error. Now it works like \b{wb} and \b{sb}, which compile under locale rules and always work like Unicode says they should. A UTF-8 locale implies Unicode rules, and the goal is for it to work seamlessly with the rest of perl. This construct was the only one I am aware of that didn't work seamlessly (not counting OS interfaces) under UTF-8 LC_CTYPE locales. For all three of these constructs, use with a non-UTF-8 runtime locale raises a warning, and Unicode rules are used anyway. UTF-8 locale collation still has problems, but this is low priority to fix, as it's a lot of work, and if one really cares, one should be using Unicode::Collate.
* regcomp.h: Fold 2 ANYOF flags into a single oneKarl Williamson2015-08-241-8/+13
| | | | | | | | | | | | | | | | | | The ANYOF_FLAGS bits are all used up, but a future commit wants one. This commit frees up a bit by sharing two of the existing comparatively-rarely-used ones. One bit is used only under /d matching rules, while the other is used only when not under /d. Only the latter bit is used in synthetic start classes. The previous commit introduced an ANYOFD node type corresponding to /d. An SSC never is this type. Thus, the bits have mutually exclusive meanings, and we can use the node type to distinguish between the two meanings of the combined bit. An alternative implementation would have been to use the ANYOF_HAS_NONBITMAP_NON_UTF8_MATCHES non-/d bit instead of the one chosen. But this is used more frequently, so the disambiguation would have been exercised more frequently, slowing execution down ever so slightly; more importantly, this one required fewer code changes, by a slight amount.
* remove deprecated /\C/ RE character classDavid Mitchell2015-06-191-2/+1
| | | | | | This horrible thing broke encapsulation and was as buggy as a very buggy thing. It's been officially deprecated since 5.20.0 and now it can finally die die die!!!!
* Replace common Emacs file-local variables with dir-localsDagfinn Ilmari Mannsåker2015-03-221-6/+0
| | | | | | | | | | | | | | | | An empty cpan/.dir-locals.el stops Emacs using the core defaults for code imported from CPAN. Committer's work: To keep t/porting/cmp_version.t and t/porting/utils.t happy, $VERSION needed to be incremented in many files, including throughout dist/PathTools. perldelta entry for module updates. Add two Emacs control files to MANIFEST; re-sort MANIFEST. For: RT #124119.
* \s matching VT is no longer experimentalKarl Williamson2015-02-211-2/+0
| | | | | | | This was experimentally introduced in 5.18, and no issues were raised, except that it got us to thinking and spurred us to stop allowing $^X, where 'X' is a non-printable control character, and that change caused some issues.
* Add \b{sb}Karl Williamson2015-02-191-0/+1
|
* Add qr/\b{wb}/Karl Williamson2015-02-191-1/+2
|
* Add qr/\b{gcb}/Karl Williamson2015-02-191-0/+5
| | | | | | | | | | | A function implements seeing if the space between any two characters is a grapheme cluster break. Afer I wrote this, I realized that an array lookup might be a better implementation, but the deadline for v5.22 was too close to change it. I did see that my gcc optimized it down to an array lookup. This makes the implementation of \X go from being complicated to trivial.
* Corrections to spelling and grammatical errors.Lajos Veres2015-01-281-1/+1
| | | | Extracted from patch submitted by Lajos Veres in RT #123693.
* regcomp.h: Clarify commentKarl Williamson2015-01-211-1/+1
|
* Add regex nodes for localeKarl Williamson2014-12-291-1/+9
| | | | | These will be used in a future commit to distinguish between /l patterns vs non-/l.
* Eliminate unused BACK regnodeAaron Crane2014-09-291-3/+1
|
* regcomp.c: Add a function and use itKarl Williamson2014-09-291-0/+7
| | | | | | | This adds a function to allocate a regnode with 2 32-bit arguments, and uses it, rather than the ad-hoc code that did the same thing previously. This is in preparation for this code being used in a 2nd place in a future commit.
* regcomp.h: Add commentKarl Williamson2014-09-291-1/+1
|
* regcomp.h: Remove obsolete #definesKarl Williamson2014-09-291-5/+0
| | | | These internal definitions are no longer used.
* regcomp.h: Use existing macro instead of reinventingKarl Williamson2014-09-291-2/+2
|
* Add tests for a51d618a fix of RT #122283Yves Orton2014-09-281-0/+3
| | | | | | | | | | | | | | | | | | | | | Add a new re debug mode for outputing stuff useful for testing. In this case we count the number of times that we go through study_chunk. With a51d618a we should do 5 times (or less) when we traverse the test pattern. Without a51d618a we recurse 11 times. In the case of RT #122283 we would do gazilions of recursions, so many I never let it run to finish. / (?(DEFINE)(?<foo>foo)) (?(DEFINE)(?<bar>(?&foo)bar)) (?(DEFINE)(?<baz>(?&bar)baz)) (?(DEFINE)(?<bop>(?&baz)bop)) /x I say "or less" because you could argue that since these defines are never called, we should not actually recurse at all, and should maybe just compile this as a simple empty pattern.
* change NODE_ALIGN_FILL to set flags to 0Yves Orton2014-09-171-1/+10
| | | | | | | | | | | | In 075abff3 Andy Lester set the flags field of regops to default to 0xde. I find this really weird, and possibly dangerous, as it seems to me reasonable to assume a new regop would have this field set to 0, so that later on code can set it to something else if necessary. (Which is what I wanted to do.) Since nothing breaks if I set it to 0x0 and I find that to be a much more natural default than 0xde (the prefix of 0xdeadbeef), I am changing this to set it to 0.
* Eliminate the duplicative regops BOL and EOLYves Orton2014-09-171-6/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | See also perl5porters thread titled: "Perl MBOLism in regex engine" In the perl 5.000 release (a0d0e21ea6ea90a22318550944fe6cb09ae10cda) the BOL regop was split into two behaviours MBOL and SBOL, with SBOL and BOL behaving identically. Similarly the EOL regop was split into two behaviors SEOL and MEOL, with EOL and SEOL behaving identically. This then resulted in various duplicative code related to flags and case statements in various parts of the regex engine. It appears that perhaps BOL and EOL were kept because they are the type ("regkind") for SBOL/MBOL and SEOL/MEOL/EOS. Reworking regcomp.pl to handle aliases for the type data so that SBOL/MBOL are of type BOL, even though BOL == SBOL seems to cover that case without adding to the confusion. This means two regops, a regstate, and an internal regex flag can be removed (and used for other things), and various logic relating to them can be removed. For the uninitiated, SBOL is /^/ and /\A/ (with or without /m) and MBOL is /^/m. (I consider it a fail we have no way to say MBOL without the /m modifier). Similarly SEOL is /$/ and MEOL is /$/m (there is also a /\z/ which is EOS "end of string" with or without the /m).
* regcomp.h: Comment nitsKarl Williamson2014-09-031-2/+2
|
* Allow for changing size of bracketed regex char classKarl Williamson2014-09-031-1/+14
| | | | | | | | This commit allows Perl to be compiled with a bitmap size that is larger than 256. This bitmap is used to directly look up whether a character matches or not, without having to do a binary search or hash lookup. It might improve the performance for some installations that have a lot of use of scripts that are above the Latin1 range.
* Rename some internal regex #definesKarl Williamson2014-09-031-23/+24
| | | | | | | | | These are renamed to be more clear as to their actual meanings. I know other people have been confused by their former names. Some of the name changes will become more important as future commits will allow the bitmap in a bracketed character class to be a different size.
* regcomp.h: Remove some no-longer used #definesKarl Williamson2014-09-031-10/+0
| | | | This is an internal header, so can change names within it.
* regcomp.h: Use unsigned 1 in left shiftKarl Williamson2014-09-031-2/+2
| | | | | | This prevents a signed result if this macro ever gets used in a U8. The ANYOF_BITMAP_TEST macro must now be cast or it would generate warnings when compiled with -DPERL_BOOL_AS_CHAR
* regcomp.h: Fix comment that said the opposite of the truthKarl Williamson2014-09-031-1/+1
| | | | Too many negations led to this.
* regex: Use #define for number of bits in ANYOFKarl Williamson2014-08-211-3/+8
| | | | | | | ANYOF nodes (for bracketed character classes) currently are for code points 0-255. This is the first step in the eventual making that size configurable. This also renames a static function, as the domain may not necessarily be 'latin1'
* regcomp.c: Make SSC node clone safeKarl Williamson2014-03-121-9/+13
| | | | | | This just sets the ptr field in the Synthetic Start Class that will be passed to regexec.c NULL, and clarifies the comments in regcomp.h. See the thread starting at http://markmail.org/message/2txwaqnjco6zodeo
* regcomp.c: Fix more alignment problemsKarl Williamson2014-02-191-20/+16
| | | | | | | | | | | | | | | | | | | | | | | | | I believe this will fix the remaining alignment problems recently being shown on gcc on HP-UX, It works on the procura machine. regnodes should not have stricter alignment than required by U32, for reasons given in the comments this commit adds to the beginning of regcomp.h. Commit 31f05a37 added a new ANYOF regnode struct with a pointer field. This requires stricter alignment on some 64-bit platforms, and hence doesn't work on those platforms. This commit removes that regnode struct type, and instead stores the pointer it used via a more indirect, but already existing mechanism that stores other data.. The function that returns that other data is enlarged to return this new field as well. It now needs to be called from regcomp.c, so the previous commit had renamed and made it accessible from there. The "public" function that wraps this one is unchanged. (I put "public" in quotes here, because I don't think anyone outside core is or should be using it, but since it has been publicly available for a long time, I'm treating the API as unchangeable. regcomp.c called this public function before this commit, but needs the additional data returned by the inner one).
* regcomp.h: Allow compiler to perform calculationKarl Williamson2014-02-191-1/+1
| | | | | | | | Instead of doing the calculation of how many bytes a 256 bitmap occupies, let the compiler do it. I believe we are not too far away from having the ability to allow applications to recompile Perl to increase the bitmap size trading speed for memory. ICU has an 8192 bitmap last time I checked.
* Change method of passing some info from regcomp to regexecKarl Williamson2014-02-191-14/+6
| | | | | | | | | | | | | | For the last several releases, the fact that an ANYOF node could match something outside its bitmap has been passed to regexec.c by having its ARG field not be -1 (appropriately cast). A bit was set if the match could occur even if the target string was not UTF-8 encoded. This design was used to save a bit, as previously there was a bit also for it matching UTF-8 strings. That design is no longer tenable, as a future commit will have a third (independent) reason for something to match outside the bitmap, This commits uses the current spare bit flag to indicate if the match can only occur if the target string is UTF-8.
* regcomp.h: Remove extraneous commentKarl Williamson2014-02-191-7/+0
| | | | | This is obsolete and is a partial copy of the up-to-date comment below it.
* regcomp.h: Free up flag bit in ANYOF nodesKarl Williamson2014-02-191-10/+8
| | | | The ANYOF_LOC bit was removed from final use in the previous commit.
* regexes: Remove uses of ANYOF_LOCALE flagKarl Williamson2014-02-191-4/+2
| | | | | | | | | | | | | This flag no longer adds any useful information and can be removed. An ANYOF node that depends on locale either matches a POSIX class like /d, or matches case insensitively, or both. There are flags for both these cases, and to see if something matches locale, one merely needs to see if either flag is set. Not having to keep track of this extra flag simplifies things, and will allow it to be removed. There was a time when this flag was shared with one of the remaining locale ones, and there was relict code that allowed that sharing to be reinstated, and which this commit also removes.
* regcomp.c: Simplify /l Synthetic Start Class constructionKarl Williamson2014-02-191-3/+12
| | | | | | | | | | | | | | | The ANYOF_POSIXL flag is needed in general for ANYOF nodes to indicate if the struct contains an extra U32 element used to hold the list of POSIX classes (like \w and [:punct:]) whose matches depend on the locale in effect at the time of runtime pattern matching. But the SSC always contains this U32, and so doesn't need to use the flag. Instead, if there aren't any such classes, the U32 will be zero. Removing keeping track of this flag during the assembly of the SSC simplifies things. At the completion of this process, this flag is set if the U32 is non-zero to pass that information on to regexec.c so that it doesn't have to special case things.
* Revert "Free up bit for regex ANYOF nodes"Karl Williamson2014-02-151-5/+21
| | | | | This reverts commit 34fdef848b1687b91892ba55e9e0c3430e0770f6, and adds comments referring to it, in case it is ever needed.
* Free up bit for regex ANYOF nodesKarl Williamson2014-02-151-16/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | This commit frees up a bit by using an extra regnode to pass the information to the regex engine instead of the flag. I originally thought that if this was needed, it should be the ANYOF_ABOVE_LATIN1_ALL bit, as that might speed some things up. But if we need to do this again by adding another node to get another bit, we want one that is mutually exclusive of the first one we did, For otherwise we start having to make 3 nodes instead of two to get the combinations: 1 0 0 1 1 1 This combinatorial problem is avoided by using bits that are mutually exclusive, which the ABOVE_LATIN1_ALL isn't, but the one freed by this commit ANYOF_NON_UTF8_NON_ASCII_ALL is only set under /d matching, and there are other bits that are set only under /l, so if we need to do this again, we should use one of those. I wrote this code when I thought I really needed a bit. But since, I have figured out a better way to get the bit needed now. But I don't want to lose this code to posterity, so this commit is being made long enough to get the commit number, then it will be reverted, adding comments referring to the commit number, so that it can easily be reconstructed when necessary.
* regcomp.h: Rmv false commentsKarl Williamson2014-02-121-4/+4
| | | | I misread the code when I added these comments