summaryrefslogtreecommitdiff
path: root/regexp.h
Commit message (Collapse)AuthorAgeFilesLines
* Eliminate PL_reg_state.re_reparsing, part 2David Mitchell2013-04-121-1/+0
| | | | | | The previous commit added an alternative flag mechanism to PL_reg_state.re_reparsing, but kept the old one around for consistency checking. Remove the old one now.
* rework split() special case interaction with regex engineYves Orton2013-03-271-17/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch resolves several issues at once. The parts are sufficiently interconnected that it is hard to break it down into smaller commits. The tickets open for these issues are: RT #94490 - split and constant folding RT #116086 - split "\x20" doesn't work as documented It additionally corrects some issues with cached regexes that were exposed by the split changes (and applied to them). It effectively reverts 5255171e6cd0accee6f76ea2980e32b3b5b8e171 and cccd1425414e6518c1fc8b7bcaccfb119320c513. Prior to this patch the special RXf_SKIPWHITE behavior of split(" ", $thing) was only available if Perl could resolve the first argument to split at compile time, meaning under various arcane situations. This manifested as oddities like my $delim = $cond ? " " : qr/\s+/; split $delim, $string; and split $cond ? " ", qr/\s+/, $string not behaving the same as: ($cond ? split(" ", $string) : split(/\s+/, $string)) which isn't very convenient. This patch changes this by adding a new flag to the op_pmflags, PMf_SPLIT which enables pp_regcomp() to know whether it was called as part of split, which allows the RXf_SPLIT to be passed into run time regex compilation. We also preserve the original flags so pattern caching works properly, by adding a new property to the regexp structure, "compflags", and related macros for accessing it. We preserve the original flags passed into the compilation process, so we can compare when we are trying to decide if we need to recompile. Note that this essentially the opposite fix from the one applied originally to fix #94490 in 5255171e6cd0accee6f76ea2980e32b3b5b8e171. The reverted patch was meant to make: split( 0 || " ", $thing ) #1 consistent with my $x=0; split( $x || " ", $thing ) #2 and not with split( " ", $thing ) #3 This was reverted because it broke C<split("\x{20}", $thing)>, and because one might argue that is not that #1 does the wrong thing, but rather that the behavior of #2 that is wrong. In other words we might expect that all three should behave the same as #3, and that instead of "fixing" the behavior of #1 to be like #2, we should really fix the behavior of #2 to behave like #3. (Which is what we did.) Also, it doesn't make sense to move the special case detection logic further from the regex engine. We really want the regex engine to decide this stuff itself, otherwise split " ", ... wouldn't work properly with an alternate engine. (Imagine we add a special regexp meta pattern that behaves the same as " " does in a split /.../. For instance we might make split /(*SPLITWHITE)/ trigger the same behavior as split " ". The other major change as result of this patch is it effectively reverts commit cccd1425414e6518c1fc8b7bcaccfb119320c513, which was intended to get rid of RXf_SPLIT and RXf_SKIPWHITE, which and free up bits in the regex flags structure. But we dont want to get rid of these vars, and it turns out that RXf_SEEN_LOOKBEHIND is used only in the same situation as the new RXf_MODIFIES_VARS. So I have renamed RXf_SEEN_LOOKBEHIND to RXf_NO_INPLACE_SUBST, and then instead of using two vars we use only the one. Which in turn allows RXf_SPLIT and RXf_SKIPWHITE to have their bits back.
* Reorder the members of struct re_save_state to reduce its size on LP64.Nicholas Clark2013-02-261-3/+4
| | | | | This drops its size by 8 bytes on LP64 platforms. This also makes the interpreter struct 8 bytes smaller, as it embeds a re_save_state struct.
* Eliminate 'swap' from struct regexp_internal.Nicholas Clark2013-02-201-1/+0
| | | | It's been unused since commit e9105d30edfbaa7f in July 2009.
* Eliminate PL_reg_flagsDavid Mitchell2012-12-251-2/+0
| | | | | | | | | The previous 3 commits have removed any usage of the 3 flags bits from this var; remove the (now unused) varable (which is actually #deffed to PL_reg_state.re_state_reg_flags). This change brought to you by the Campaign to Remove Global State from the Regex Engine(tm).
* Eliminate RF_tainted flag from PL_reg_flagsDavid Mitchell2012-12-251-4/+10
| | | | | | | | | | | This global flag is cleared at the start of execution, and then set if any locale-based nodes are executed. At the end of execution, the RXf_TAINTED_SEEN flag on the regex is set/cleared based on RF_tainted. We eliminate RF_tainted by simply directly setting RXf_TAINTED_SEEN each time a taintable node is executed. This is the final step before eliminating PL_reg_flags.
* eliminate RF_warned flag from PL_reg_flagsDavid Mitchell2012-12-251-0/+1
| | | | | | | | This global flag indicates whether the currently executing regex has issued a recursion limit warning yet. Replace it with a boolean var local to the regmatch_info struct. This is a second step to eliminating PL_reg_flags.
* eliminate RF_utf8 flag from PL_reg_flagsDavid Mitchell2012-12-251-3/+2
| | | | | | | | This global flag indicates whether the currently executing regex is utf8. Replace it with a boolean var local to to the matching function, and pass it around via function args, or as a member of the regmatch_info struct. This is a first step to eliminating PL_reg_flags.
* eliminate PL_regsizeDavid Mitchell2012-12-161-2/+0
| | | | | | | | | This var (or rather PL_reg_state.re_state_regsize, which it is #deffed to) just holds the index of the maximum opening paren index seen so far in S_regmatch(). So make it a local var of S_regmatch() and pass it as a param to the couple of static functions called from there that need it. (Also give the local var the more meaningful name 'maxopenparen'.)
* New COW mechanismFather Chrysostomos2012-11-271-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | This was discussed in ticket #114820. This new copy-on-write mechanism stores a reference count for the PV inside the PV itself, at the very end. (I was using SvEND+1 at first, but parts of the regexp engine expect to be able to do SvCUR_set(sv,0), which causes the wrong byte of the string to be used as the reference count.) Only 256 SVs can share the same PV this way. Also, only strings with allocated space after the trailing null can be used for copy-on-write. Much of the code is shared with PERL_OLD_COPY_ON_WRITE. The restric- tion against doing copy-on-write with magical variables has hence been inherited, though it is not necessary. A future commit will take care of that. I had to modify _core_swash_init to handle $@ differently. The exist- ing mechanism of copying $@ to a new scalar and back again was very fragile. With copy-on-write, $@ =~ s/// can cause pp_subst’s string pointers to become stale. So now we remove the scalar from *@ and allow the utf8-table-loading code to autovivify a new one. Then we restore the untouched $@ afterwards if all goes well.
* Add C define to remove taint support from perlSteffen Mueller2012-11-051-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | By defining NO_TAINT_SUPPORT, all the various checks that perl does for tainting become no-ops. It's not an entirely complete change: it doesn't attempt to remove the taint-related interpreter variables, but instead virtually eliminates access to it. Why, you ask? Because it appears to speed up perl's run-time significantly by avoiding various "are we running under taint" checks and the like. This change is not in a state to go into blead yet. The actual way I implemented it might raise some (valid) objections. Basically, I replaced all uses of the global taint variables (but not PL_taint_warn!) with an extra layer of get/set macros (TAINT_get/TAINTING_get). Furthermore, the change is not complete: - PL_taint_warn would likely deserve the same treatment. - Obviously, tests fail. We have tests for -t/-T - Right now, I added a Perl warn() on startup when -t/-T are detected but the perl was not compiled support it. It might be argued that it should be silently ignored! Needs some thinking. - Code quality concerns - needs review. - Configure support required. - Needs thinking: How does this tie in with CPAN XS modules that use PL_taint and friends? It's easy to backport the new macros via PPPort, but that doesn't magically change all code out there. Might be harmless, though, because whenever you're running under NO_TAINT_SUPPORT, any check of PL_taint/etc is going to come up false. Thus, the only CPAN code that SHOULD be adversely affected is code that changes taint state.
* Allow regexp-to-pvlv assignmentFather Chrysostomos2012-10-301-65/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | Since the xpvlv and regexp structs conflict, we have to find somewhere else to put the regexp struct. I was going to sneak it in SvPVX, allocating a buffer large enough to fit the regexp struct followed by the string, and have SvPVX - sizeof(regexp) point to the struct. But that would make all regexp flag-checking macros fatter, and those are used in hot code. So I came up with another method. Regexp stringification is not speed-critical. So we can move the regexp stringification out of re->sv_u and put it in the regexp struct. Then the regexp struct itself can be pointed to by re->sv_u. So SVt_REGEXPs will have re->sv_any and re->sv_u pointing to the same spot. PVLVs can then have sv->sv_any point to the xpvlv body as usual, but have sv->sv_u point to a regexp struct. All regexp member access can go through sv_u instead of sv_any, which will be no slower than before. Regular expressions will no longer be SvPOK, so we give sv_2pv spec- ial logic for regexps. We don’t need to make the regexp struct larger, as SvLEN is currently always 0 iff mother_re is set. So we can replace the SvLEN field with the pv. SvFAKE is never used without SvPOK or SvSCREAM also set. So we can use that to identify regexps.
* regexec: Do less work on quantified UTF-8Karl Williamson2012-10-161-2/+8
| | | | | | | | | | | | | | | | Consider the regexes /A*B/ and /A*?B/ where A and B are arbitrary, except that B begins with an EXACTish node. Prior to this patch, as a shortcut, the loop for accumulating A* would look for the first character of B to help it decide if B is a possiblity for the next thing. It did not test for all of B unless testing showed that the next thing could be the beginning of B. If the target string was UTF-8, it converted each new sequence of bytes to the code point they represented, and then did the comparision. This is a relative expensive process. This commit avoids that conversion by just doing a memEQ at the current input position. To do this, it revamps S_setup_EXACTISH_ST_c1_c2() to output the UTF-8 sequences to compare against. The function also has been tightened up so that there are fewer false positives.
* regexp.h: Update commentsKarl Williamson2012-10-161-2/+2
| | | | | These comments should have been changed in commit c74f6de970ef0f0eb8ba43b1840fde0cf5a45497, but were mistakenly omitted.
* RXf_MODIFIES_VARSFather Chrysostomos2012-10-111-0/+2
| | | | | | regcomp.c sets this new flag whenever regops that could modify $REGMARK or $REGERROR have been seen. pp_subst will use this to tell whether it should repeatedly stringify the replacement.
* Define RXf_SPLIT and RXf_SKIPWHITE as 0Father Chrysostomos2012-10-111-15/+13
| | | | | | | | They are on longer used in core, and we need room for more flags. The only CPAN modules that use them check whether RXf_SPLIT is set (which no longer happens) before setting RXf_SKIPWHITE (which is ignored).
* [perl #94490] const fold should not trigger special split " "Father Chrysostomos2012-09-221-5/+11
| | | | | | | | | | | The easiest way to fix this was to move the special handling out of the regexp engine. Now a flag is set on the split op itself for this case. A real regexp is still created, as that is the most convenient way to propagate locale settings, and it prevents the need to rework pp_split to handle a null regexp. This also means that custom regexp plugins no longer need to handle split specially (which they all do currently).
* regexp.h: Correct commentFather Chrysostomos2012-09-221-1/+1
| | | | RXf_SKIPWHITE is for split " ", which is special, *not* for split / /.
* eliminate PL_reginputDavid Mitchell2012-09-141-2/+0
| | | | | | | | | | | | | PL_reginput (which is actually #defined to PL_reg_state.re_state_reginput) is, to all intents and purposes, state that is only used within S_regmatch(). The only other places it is referenced are in S_regtry() and S_regrepeat(), where it is used to pass the current match position back and forth between the subs. Do this passing instead via function args, and bingo! PL_reginput is now just a local var of S_regmatch().
* Don't copy all of the match string bufferDavid Mitchell2012-09-081-0/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a pattern matches, and that pattern contains captures (or $`, $&, $' or /p are present), a copy is made of the whole original string, so that $1 et al continue to hold the correct value even if the original string is subsequently modified. This can have severe performance penalties; for example, this code causes a 1Mb buffer to be allocated, copied and freed a million times: $&; $x = 'x' x 1_000_000; 1 while $x =~ /(.)/g; This commit changes this so that, where possible, only the needed substring of the original string is copied: in the above case, only a 1-byte buffer is copied each time. Also, it now reuses or reallocs the buffer, rather than freeing and mallocing each time. Now that PL_sawampersand is a 3-bit flag indicating separately whether $`, $& and $' have been seen, they each contribute only their own individual penalty; which ones have been seen will limit the extent to which we can avoid copying the whole buffer. Note that the above code *without* the $& is not currently slow, but only because the copying is artificially disabled to avoid the performance hit. The next but one commit will remove that hack, meaning that it will still be fast, but will now be correct in the presence of a modified original string. We achieve this by by adding suboffset and subcoffset fields to the existing subbeg and sublen fields of a regex, to indicate how many bytes and characters have been skipped from the logical start of the string till the physical start of the buffer. To avoid copying stuff at the end, we just reduce sublen. For example, in this: "abcdefgh" =~ /(c)d/ subbeg points to a malloced buffer containing "c\0"; sublen == 1, and suboffset == 2 (as does subcoffset). while if $& has been seen, subbeg points to a malloced buffer containing "cd\0"; sublen == 2, and suboffset == 2. If in addition $' has been seen, then subbeg points to a malloced buffer containing "cdefgh\0"; sublen == 6, and suboffset == 2. The regex engine won't do this by default; there are two new flag bits, REXEC_COPY_SKIP_PRE and REXEC_COPY_SKIP_POST, which in conjunction with REXEC_COPY_STR, request that the engine skip the start or end of the buffer (it will still copy in the presence of the relevant $`, $&, $', /p). Only pp_match has been enhanced to use these extra flags; substitution can't easily benefit, since the usual action of s///g is to copy the whole string first time round, then perform subsequent matching iterations against the copy, without further copying. So you still need to copy most of the buffer.
* Separate handling of ${^PREMATCH} from $` etcDavid Mitchell2012-09-081-3/+6
| | | | | | | | | | | | | | | | | | | Currently the handling of getting the value, length etc of ${^PREMATCH} etc is identical to that of $` etc. Handle them separately, by adding RX_BUFF_IDX_CARET_PREMATCH etc constants to the existing RX_BUFF_IDX_PREMATCH set. This allows, when retrieving them, to always return undef if the current match didn't use //p. Previously the result depended on stuff such as whether the (non-//p) pattern included captures or not. The documentation for ${^PREMATCH} etc states that it's only guaranteed to return a defined value when the last pattern was //p. As well as making things more consistent, this is a necessary prerequisite for the following commit, which may not always copy the whole string during a non-//p match.
* $+ and $^N not always correct on backtrackingDavid Mitchell2012-06-131-0/+7
| | | | | | | | | | Certain ops (TRIE, BRANCH, CURLYM, CURLY and related) didn't always correctly restore or update lastcloseparen ($^N) - and sometimes lastparen ($+) - when backtracking, or doing a zero-length match (i.e. A* matching zero times). Fix this by saving lastcloseparen and lastparen in the relevant regmatch_state union sub structs, then restoring as necessary.
* reduce size of struct regmatch_stateDavid Mitchell2012-06-131-2/+1
| | | | | | | | | | Currently the trie struct is the largest sub-struct in the union; reduce its size by 2 x 8 bytes (on a 64-bit system) by 1) aligning a bool better 2) eliminating ST.B, which can be trivially derived as ST.me + NEXT_OFF(ST.me) (I'm going to partially spoil this by adding a new field in the next commit)
* make regexp_paren_pair.start_tmp an offsetDavid Mitchell2012-06-131-1/+1
| | | | | | Currently the start_tmp field is a pointer into the string, whereas the the start and end fields are offsets within that string. Make start_tmp an offset too for consistency.
* eliminate PL_reg_start_tmp, PL_reg_start_tmplDavid Mitchell2012-06-131-4/+9
| | | | | | | | | | | | | | | | | | PL_reg_start_tmp is a global array of temporary parentheses start positions. An element is set when a '(' is first encountered, while when a ')' is seen, the per-regex offs array is updated with the start and end position: the end derived from the position where the ')' was encountered, and the start position derived from PL_reg_start_tmp[n]. This allows us to differentiate between pending and fully-processed captures. Change it so that the tmp start value becomes a third field in the offs array (.start_tmp), along with the existing .start and .end fields. This makes the value now per regex rather than global. Although it uses a bit more memory (the start_tmp values aren't needed after the match has completed), it simplifies the code, and will make it easier to make a (??{}) switch to the new regex without having to dump everything on the save stack.
* eliminate PL_reglast(close)?paren, PL_regoffsDavid Mitchell2012-06-131-6/+0
| | | | | | | | | | | | | | eliminate the three vars PL_reglastcloseparen PL_reglastparen PL_regoffs (which are actually aliases to PL_reg_state struct elements). These three vars always point to the corresponding fields within the currently executing regex; so just access those fields directly instead. This makes switching between regexes with (??{}) simpler: just update rex, and everything automatically references the new fields.
* make is_bare_re bool. not int in re_op_compileDavid Mitchell2012-06-131-1/+1
| | | | | This flag pointer only stores truth, so make it a pointer to a bool rather than to an int.
* rename and simplify PL_reg_eval_setDavid Mitchell2012-06-131-2/+1
| | | | | | | | | These days PL_reg_eval_set is actually aliased to PL_reg_state.re_state_reg_eval_set. Rename this field to re_state_eval_setup_done to make it clearer what it represents; remove the PL_ backcompat macro; make it boolean; and remove the two redundant macros RS_init and RS_set.
* eliminate RExC_seen_evals and RExC_rx->seen_evalsDavid Mitchell2012-06-131-3/+0
| | | | | these were used as part of the old "use re 'eval'" security mechanism used by the now-eliminated PL_reginterp_cnt
* Fix up runtime regex codeblocks.David Mitchell2012-06-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The previous commits in this branch have brought literal code blocks into the New World Order; now do the same for runtime blocks, i.e. those needing "use re 'eval'". The main user-visible changes from this commit are that: * the code is now fully parsed, rather than needing balanced {}'s; i.e. this now works: my $code = q[ (?{ $a = '{' }) ]; use re 'eval'; /$code/ * warnings and errors are now reported as coming from "(eval NNN)" rather than "(re_eval NNN)" (although see the next commit for some fixups to that). Indeed, the string "re_eval" has been expunged from the source and documentation. The big internal difference is that the sv_compile_2op() and sv_compile_2op_is_broken() functions are no longer used, and will be removed shorty. It works by the regex compiler detecting the presence of run-time code blocks, and feeding the whole pattern string back into the parser (where the run-time blocks are now seen as compile-time), then extracting out any compiled code blocks and adding them to the mix. For example, in the following: $c = '(?{"runtime"})d'; use re 'eval'; /a(?{"literal"})\b'$c/ At the point the regex compiler is called, the perl parser will already have compiled the literal code block and presented it to the regex engine. The engine examines the pattern string, sees two '(?{', but only one accounted for by the parser, and so constructs a short string to be evalled: based on the pattern, but with literal code-blocks blanked out, and \ and ' escaped. In the above example, the pattern string is a(?{"literal"})\b'(?{"runtime"})d and we call eval_sv() with an SV containing the text qr'a \\b\'(?{"runtime"})d' The returned qr will contain the new code-block (and associated CV and pad) which can be extracted and added to the list of compiled code blocks of the original pattern. Note that with this scheme, the requirement for "use re 'eval'" is easily determined, and no longer requires all the pp_regcreset / PL_reginterp_cnt machinery, which will be removed shortly. Two subtleties of this scheme are that normally, \\ isn't collapsed into \ for literal regexes (unlike literal strings), and hints aren't inherited when using eval_sv(). We get round both of these by adding and setting a new flag, PL_reg_state.re_reparsing, which indicates that we are refeeding a pattern into the perl parser.
* make _REGEXP_COMMON work under win32David Mitchell2012-06-131-1/+1
| | | | There's an extra trailing semicolon which the win32 compiler doesn't like.
* add op_comp field to regexp_engine APIDavid Mitchell2012-06-131-0/+4
| | | | | | | | | | | | | | | | | | | | Perl's internal function for compiling regexes that knows about code blocks, Perl_re_op_compile, isn't part of the engine API. However, the way that regcomp.c is dual-lifed as ext/re/re_comp.c with debugging compiled in, means that Perl_re_op_compile is also compiled as my_re_op_compile. These days days the mechanism to choose whether to call the main functions or the debugging my_* functions when 'use re debug' is in scope, is the re engine API jump table. Ergo, to ensure that my_re_op_compile gets called appropriately, this method needs adding to the jump table. So, I've added it, but documented as 'for perl internal use only, set to null in your engine'. I've also updated current_re_engine() to always return a pointer to a jump table, even if we're using the internal engine (formerly it returned null). This then allows us to use the simple condition (eng->op_comp) to determine whether the current engine supports code blocks.
* preserve code blocks in interpolated qr//sDavid Mitchell2012-06-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This now works: { my $x = 1; $r = qr/(??{$x})/ } my $x = 2; print "ok\n" if "1" =~ /^$r$/; When a qr// is interpolated into another pattern, the pattern is still recompiled using the stringified qr, but now the pre-compiled code blocks from the qr are reused rather than being re-compiled, so it behaves like a closure. Note that this makes some tests in regexp_qr_embed_thr.t fail, due to a pre-existing threads bug, which can be summarised as: use threads; my $s = threads->new(sub { return sub { $::x = 1} })->join; $s->(); print "\$::x=[$::x]\n"; which prints undef, not 1, since the *::x is cloned into the child thread, then cloned back into the parent as part of the CV (linked from the pad) being returned in the join. The cloning/join code isn't clever enough to realise that the back-cloned *::x is the same as the original *::x, so the main thread ends up with two copies. This manifests itself in the re tests as my $re = threads->new( sub { qr/(?{$::x = 1 })/ })->join(); where, since the returned qr// is now a closure, it suffers from the same glob duplication in the parent. So I've disabled 4 re_tests tests under threads for now.
* in re_op_compile(), keep code_blocks for qr//David Mitchell2012-06-131-1/+10
| | | | | | | | code_blocks is a temporary list of start/end indices and pointers to DO blocks, that is used during the regexp compilation. Change it so that in the qr// case, this structure is preserved (attached to regexp_internal), so that in a forthcoming commit it will be available for use when interpolating a qr within another pattern.
* make qr/(?{})/ behave with closuresDavid Mitchell2012-06-131-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With this commit, qr// with a literal (compile-time) code block will Do the Right Thing as regards closures and the scope of lexical vars; in particular, the following now creates three regexes that match 1, 2 and 3: for my $i (0..2) { push @r, qr/^(??{$i})$/; } "1" =~ $r[1]; # matches Previously, $i would be evaluated as undef in all 3 patterns. This is achieved by wrapping the compilation of the pattern within a new anonymous CV, which is then attached to the pattern. At run-time pp_qr() clones the CV as well as copying the REGEXP; and when the code block is executed, it does so using the pad of the cloned CV. Which makes everything come out all right in the wash. The CV is stored in a new field of the REGEXP, called qr_anoncv. Note that run-time qr//s are still not fixed, e.g. qr/$foo(?{...})/; nor is it yet fixed where the qr// is embedded within another pattern: continuing with the code example from above, my $i = 99; "1" =~ $r[1]; # bare qr: matches: correct! "X99" =~ /X$r[1]/; # embedded qr: matches: whoops, it's still seeing the wrong $i
* update the editor hints for spaces, not tabsRicardo Signes2012-05-291-2/+2
| | | | | This updates the editor hints in our files for Emacs and vim to request that tabs be inserted as spaces.
* regcomp.c: Add ability to take union of a complementKarl Williamson2012-02-091-0/+1
| | | | | | | | | Previous commits have added the ability to the inversion list intersection routine to take the complement of one of its inputs. Likewise, for unions, this will be a frequent paradigm, and it is cheaper to do the complement of an input in the routine than to construct a new temporary that is the desired complement, and throw it away.
* regcomp.c: _invlist_subtract() becomes a macroKarl Williamson2012-02-091-0/+4
| | | | | | This function is no longer necessary, as it is just a call to the newly created _invlist_intersection_maybe_complement_2nd() with the correct parameters.
* regcomp.c: Add ability to take intersection of complementKarl Williamson2012-02-091-0/+4
| | | | | | | | | | | | | It turns out that it is a common paradigm to want to take the intersection of an inversion list with the complement of another inversion list. In fact, this is the how to subtract the second inversion list from the first, as what remains in the first after the subtraction is everything in it that is not in the second. It also turns out that it adds very few cycles to an intersection to complement one (or both, should we choose to) of the operands. By adding this capability, we don't have to create a copy of the inverted operand beforehand, just to throw it away.
* regexp.h: Update commentKarl Williamson2012-01-211-3/+4
|
* avoid overflowing a 32-bit signed intTony Cook2012-01-171-1/+1
| | | | | | and the associated warning from Solaris C: "regcomp.c", line 5294: warning: integer overflow detected: op "<<"
* regexp.h: remove completely redundant return statementÆvar Arnfjörð Bjarmason2011-12-231-1/+0
| | | | | | | | | | | | | | | | Remove a redundant return() statement at the end of the get_regex_charset_name function. The "default" case for the above switch statement will always return for us. This was added intentionally in v5.14.0-354-g0984e55 by Jim Cromie, but the rationale for doing so is that we might have a compiler bug here, but we're pretty screwed anyway if switch statements stop working as advertised by the standard so there's no reason to be defensive in this particular case. This is also causing a lot of whine from Sun Studio 12 Update 1: regexp.h", line 329: warning: statement not reached
* regexp.h: work around -Werror compile failure of XS in linux perf toolJim Cromie2011-06-021-1/+2
| | | | | | | | | | | | | The linux perf tool has an XS component, but when built using system perl 5.14.0, its compilation errors out on the switch statement in regexp.h: get_regex_charset_name(), which lacks a default case. Add one, copying the end-of-function return "?". Preserve the end-of-function return "?", to avoid a hypothetical compiler bug which misses the default case, and infers a void return for a function thats declared otherwize. Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
* Allow suffix form for /a /d /l /uKarl Williamson2011-02-191-2/+4
| | | | | | This patch contains the code changes for doing this, but not most of the pod changes, nor the new .t tests required. There were already tests in place to make sure that this didn't break backcompat.
* regexp.h: The length of 'aa' is 2Karl Williamson2011-02-191-1/+2
|
* document how tainting works with pattern matchingDavid Mitchell2011-02-161-1/+5
|
* Initial setup to accommodate /aa regex modifierKarl Williamson2011-02-141-2/+5
| | | | | This changes the bits to add a new charset type for /aa, and other bookkeeping for it.
* Add /a regex modifierKarl Williamson2011-01-171-0/+3
| | | | | This restricts certain constructs, like \w, to matching in the ASCII range only.
* Change name of /d to DEPENDSKarl Williamson2011-01-161-3/+3
| | | | | | | I much prefer David Golden's name for /d whose meaning 'depends' on circumstances, instead of 'dual' meaning it could be one or another. Change it before this gets out in a stable release, and we're stuck with the old name.
* Use multi-bit field for regex character setKarl Williamson2011-01-161-1/+27
| | | | | | | | | | | | | The /d, /l, and /u regex modifiers are mutually exclusive. This patch changes the field that stores the character set to use more than one bit with an enum determining which one. This data structure more closely follows the semantics of their being mutually exclusive, and conserves bits as well, and is better expandable. A small API is added to set and query the bit field. This patch is not .xs source backwards compatible. A handful of cpan programs are affected.