delta/perl.git - github.com: perl/perl5.git

	Commit message (Collapse)	Author	Age	Files	Lines
*	temp commit for smokessmoke-me/khw-tricky	Karl Williamson	2011-12-22	1	-26/+22
\|
*	regexec.c: white space only	Karl Williamson	2011-12-22	1	-1/+1
\|
*	regexec.c: EXACTF nodes can never be UTF	Karl Williamson	2011-12-22	1	-4/+9
\| \| \| \| \| \| \|	By definition a regex pattern that is in UTF-8 uses Unicode matching rules, and EXACTF is non-Unicode (unless the target string is UTF-8). Therefore an EXACTF node will never be generated for a UTF-8 pattern, and there is no need to test for it being so.
*	regexec.c: Bypass unneeded step	Karl Williamson	2011-11-11	1	-2/+2
\| \| \| \| \|	We don't have to convert from utf8 to code point to fold; instead can call the function that starts from utf8
*	regexec.c: Stop looking for match even sooner	Karl Williamson	2011-11-09	1	-3/+3
\| \| \| \| \| \| \| \|	This revised commit e067297c376fbbb5a0dc8428c65d922f11e1f4c6 slightly so that we round up to get the search stopping point. We aren't matching partial characters, so if we were to match 3+1/3 characters, we really have to match 4 characters.
*	regexec.c: revise comment	Karl Williamson	2011-11-09	1	-4/+6
\|
*	regexec.c: typo in comment	Karl Williamson	2011-11-09	1	-1/+1
\|
*	Change __attribute_unused__ to PERL_UNUSED_DECL	Karl Williamson	2011-11-09	1	-1/+1
\| \| \| \|	The latter is the Perl standard way of making this declaration
*	PATCH: [perl #101710] Regression with /i, latin1 chars.	Karl Williamson	2011-11-01	1	-1/+1
\| \| \| \| \| \|	The root cause of this bug is that it was assuming that a string was in utf8 when it wasn't, and so was thinking that a byte was a starter byte that wasn't, so was skipping ahead based on that starter byte.
*	regexec.c: Add another place to not re-fold	Karl Williamson	2011-10-17	1	-1/+1
\| \| \| \|	This adds regrepeat to no keep re-folding to the recent commits
*	regexec.c: Another place to not re-fold	Karl Williamson	2011-10-17	1	-2/+2
\| \| \| \| \|	A recent commit caused regexec.c to not keep calculating the folds in one circumstance. This one adds the case in regmatch
*	regexec.c: Less work in /i matching	Karl Williamson	2011-10-17	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \|	If you watch an execution trace of regexec /i, often you will see it folding the same thing over and over, as it backtracks or searches ahead. regcomp.c has now been changed to always fold UTF-8 encoded EXACTF and EXCACTFU nodes. This allows these to not be re-folded each time. This commit does it just for find_by_class(). Other commits will expand this technique for other cases.
*	regexec.c: Stop looking for match sooner	Karl Williamson	2011-10-17	1	-2/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a partial reversion of commit 7c1b9f38fcbfdb3a9e1766e02bcb991d1a5452d9 which went unnecessarily far in fixing the problem. After studying the situation some more, I see more clearly what was going on. The point is that if you have only 2 characters left in the string, but the pattern requires 3 to work, it's guaranteed to fail, so pointless, and unnecessary work, to try. So don't being a match trial at a position when there are fewer than the minimum number of characters necessary. That is what the code before that commit did. However it neglected the fact that it is possible for a single character to match multiple ones, so there is not a 1:1 ratio. This new commit assumes the worst possible ratio to calculate how far into a string is the furthest a successful match could start. This is going to in most cases still look too far, but it is much better than always going up to the final character, as the previous patch did. The maximum ratio is guaranteed by Unicode to be 3:1, but when the target isn't in UTF-8, the max is 2:1, determined simply by inspection of the defined folds. And actually, currently, the single case where it isn't 1:1 doesn't come up here, because regcomp.c guarantees that that match doesn't generate one of these EXACTFish nodes. However, I expect that to change for 5.16, and so am preparing for that case by making it 2:1.
*	regexec.c: Add comment	Karl Williamson	2011-10-17	1	-0/+5
\|
*	regexec.c: omit goto for the common case	Karl Williamson	2011-10-17	1	-13/+13
\| \| \| \| \| \| \|	The structure of this code is that initial setup is done and then gotos or fall-through used to join for the main logic. This commit just moves a block, without logic changes, so that the more common case has a fall-through instead of a goto.
*	regexec.c: Fix "\x{FB01}\x{FB00}" =~ /ff/i	Karl Williamson	2011-10-13	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \|	Only the first character of the string was being checked when scanning for the beginning position of the pattern match. This was so wrong, it looks like it has to be a regression. I experimented a little and did not find any. I believe (but am not certain) that a multi-char fold has to be involved. The the handling of these was so broken before 5.14 that there very well may not be a regression.
*	regexec.c: Add comments	Karl Williamson	2011-10-13	1	-0/+4
\|
*	regexec.c: Avoid hard-coded utf8 tests for EBCDIC	Karl Williamson	2011-10-01	1	-1/+7
\| \| \| \| \| \| \| \|	When a swash is loaded, generally it is checked for sanity with an assert(). The strings used are hard-coded utf8 strings, which will be different in EBCDIC, and hence will fail. I haven't figured out a simple way to get compile-time utf8 vs utfebcdic strings, but we can just skip the check in EBCDIC builds
*	regexec.c: Add assertion check	Karl Williamson	2011-10-01	1	-1/+1
\| \| \| \| \|	This makes sure before there is a segfault that the is_() functions actually have the side effect that this expects.
*	RT #96354: \h \H \v and \V didn't check for EOL	David Mitchell	2011-08-05	1	-0/+4
\| \| \| \| \| \|	The HORIZWS and similar regexp ops didn't check that the end of the string had been reached; therefore they would blithely compare against the \0 at the end of the string, or beyond.
*	/aa and \b fail under some utf8 strings	Karl Williamson	2011-07-30	1	-1/+4
\| \| \| \| \|	This was due to my failure to realize that this 'if' needed to be updated when the /aa modifier was added.
*	Panic with \b and /aa	Karl Williamson	2011-07-30	1	-0/+1
\| \| \| \| \|	This was due to my oversight in not fixing this switch statement to accommodate /aa when it was added.
*	re_eval: clear lexicals in the right pad	David Mitchell	2011-07-16	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	(?{...}) deliberately doesn't introduce a new scope (so that the affects of local() can accumulate across multiple calls to the code). This also means that the SAVEt_CLEARSVs pushed onto the save stack by lexical declarations (i.e. (?{ my $x; ... }) also accumulate, and are only processed en-mass at the end, on exit from the regex. Currently they are usually processed in the wrong pad (the caller of the pattern, rather than the pads of the individual code block(s)), leading to random misbehaviour and SEGVs. Hence the long-standing advice to avoid lexical declarations within re_evals. We fix this by wrapping a pair of SAVECOMPPADs around each call to a code block. Eventually the save stack will be a long accumulation of SAVEt_CLEARSV's interspersed with SAVEt_COMPPAD's, that when popped en-mass should unwind in the right order with the right pad at the right time. The price to pay for this is two extra additions to the save stack (which accumulate) for each code call. A few TODO tests in reg_eval_scope.t now pass, so I'm probably doing the right thing ;-)
*	regexec.c: Nits in comments	Karl Williamson	2011-07-07	1	-7/+8
\|
*	For shorter strings, store C<study>'s data as U8s or U16s, instead of U32s.	Nicholas Clark	2011-07-01	1	-1/+11
\| \| \| \| \| \| \|	The assumption is that most studied strings are fairly short, hence the pain of the extra code is worth it, given the memory savings. 80 character string, 336 bytes as U8, down from 1344 as U32 800 character string, 2112 bytes as U16, down from 4224 as U32
*	Store C<study>'s data as U32s, instead of I32s.	Nicholas Clark	2011-07-01	1	-1/+1
\| \| \| \|	The "no more" condition is now represented as ~0, instead of -1.
*	Store C<study>'s data in in mg_ptr instead of interpreter variables.	Nicholas Clark	2011-07-01	1	-1/+6
\| \| \| \| \|	This allows more than one C<study> to be active at the same time. It eliminates PL_screamfirst, PL_lastscream, PL_maxscream.
*	Change PL_screamnext to store absolute positions.	Nicholas Clark	2011-07-01	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	PL_screamnext gives the position of the next occurrence of the current octet. Previously it stored this as an offset from the current position, with -pos stored for "no more", so that the calculated new offset would be zero, allowing a zero/non-zero loop exit test in Perl_screaminstr(). Now it stores absolute position, with -1 for "no more". Also codify -1 as the "not present" value for PL_screamfirst, instead of any negative value.
*	The regex engine can't assume that SvSCREAM() remains set on its target.	Nicholas Clark	2011-06-30	1	-3/+3
\| \| \| \| \| \| \| \|	Callers to the engine set REXEC_SCREAM in the flags when the target scalar is studied, and the engine should use the study data. It's possible for embedded code blocks to cause the target scalar to stop being studied. Hence the engine needs to check for this, instead of simply assuming that the study data is present and valid to read. This resolves #92696.
*	regexec.c: Remvove unnecessary special handling for \xDF	Karl Williamson	2011-06-11	1	-6/+5
\| \| \| \| \|	regcomp.c has been changed, so the case that this handled no longer comes up.
*	Use SvTAIL() instead of BmFLAGS(). The core no longer uses BmFLAGS().	Nicholas Clark	2011-06-11	1	-10/+9
\|
*	use __attribute__unused__ to silence -Wunused-but-set-variable	Robin Barker	2011-05-19	1	-2/+10
\|
*	Assertion fails in multi-char regex match	Karl Williamson	2011-05-18	1	-4/+6
\| \| \| \| \| \| \| \| \| \|	In '"s\N{U+DF}" =~ /\x{00DF}/i, the LHS folds to 'sss', the RHS to 'ss'. The bug occurs when the RHS tries to match the first two es's, but that splits the LHS \xDF character, which Perl doesn't know how to handle, and the assertion got triggered. (This is similar to [perl #72998].) The solution adopted here is to disallow a partial character match, as #72998 did as well.
*	PATCH: [perl #87908] \W is its complement sometimes	Karl Williamson	2011-04-06	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	A missing '!' turned \W into \w in some code execution paths and utf8 data. This patch fixes that. It does not include tests at the moment, since I don't have time just now to examine why the existing tests didn't catch this, when it looks like they are set up to, and there have been several BBC tickets lately that I'm hopeful this may fix and head off other ones.
*	regexec.c: fix some compiler warnings	David Mitchell	2011-03-26	1	-2/+2
\|
*	regexec.c: Rmv special code no longer needed	Karl Williamson	2011-03-20	1	-14/+3
\| \| \| \|	The trickyness has been resolved elsewhere
*	regexec.c: Update comment	Karl Williamson	2011-03-19	1	-25/+13
\|
*	regexec.c: execute inappropriately skipped code	Karl Williamson	2011-03-19	1	-5/+6
\| \| \| \| \| \|	The comment said that there was no use doing this in lenp was NULL, but there is, as it sees if there is a match or not and sets the appropriate variable.
*	regexec.c: Chg var. name for clarity	Karl Williamson	2011-03-19	1	-5/+5
\|
*	Stop hang in regex	Karl Williamson	2011-03-19	1	-28/+35
\| \| \| \| \| \| \|	The algorithm for mapping multi-char fold matches back to the source in processing ANYOF nodes was defective. This caused the regex engine to hang on certain character combinations. I've also added an assert to stop instead of loop.
*	Fix RT #84294 /((\w+)(?{print $2})){2,2}/ problem	Yves Orton	2011-03-12	1	-2/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we are doing a CURLYX/WHILEM loop and the min iterations is larger than zero we were not saving the buffer state before each iteration. This mean that partial matches would end up with strange buffer pointers, with the start after the end point. In more detail WHILEM has four exits, three of which as far as I could tell would do a regcppush/regcppop in their state transitions, only one, WHILEM_A_pre which is entered when (n < min) would not. And it is this state that we repeatedly enter when performing A the min number of times. When I made the logic similar to the handling of ( n < max ), the bug went away, and as far as I can tell nothing else broke. Review by Dave Mitchell required before release.
*	regexec.c: Use equivalent macro instead of code	Karl Williamson	2011-03-10	1	-13/+1
\| \| \| \| \|	Recent simplification of this code left it to be the equivalent of an existing macro
*	regexec.c: Add assert() to detect inconsistent ANYOF	Karl Williamson	2011-03-10	1	-0/+2
\| \| \| \| \| \| \|	There have been various segfaults apparently due to trying to access the swash (and allies) portion of an ANYOF which doesn't have that. This doesn't show up on all platforms. The assert() should detect this and help debugging
*	regexec.c: Fix precedence	Karl Williamson	2011-03-10	1	-5/+6
\| \| \| \| \| \| \| \| \| \| \| \| \|	Commit ac51e94be5daabecdeb0ed734f3ccc059b7b77e3 didn't do what it purported, because it omitted parentheses that were necessary to change the natural precedence. It's strange that it passed all tests on my machine, and failed so miserably elsewhere that it was quickly reverted by commit 63c0bfd59562325fed2ac5a90088ed40960ac2ad. This reinstates it with the correct precedence. The next commit will add an assert() so that the underlying issue will be detected on all platforms
*	Revert "regexec.c: don't try accessing non-bitmap if doesn't exist"	David Mitchell	2011-03-10	1	-6/+5
\| \| \| \| \| \| \|	This reverts commit ac51e94be5daabecdeb0ed734f3ccc059b7b77e3. This commit made many of the re/*.t tests fail, on my build at least. Haven't looked at why, just reverting it for the moment.
*	regexec.c: don't try accessing non-bitmap if doesn't exist	Karl Williamson	2011-03-09	1	-5/+6
\| \| \| \| \| \| \| \| \| \| \| \|	ANYOF_NONBITMAP is supposed to be set iff there is something outside the bitmap to try matching in an ANYOF node. Due to slight changes in the meaning of this, the code has been trying to access this if ANYOF_NONBITMAP_NON_UTF8 is set without ANYOF_NONBITMAP being set, which means it was trying to access something that doesn't exist. I'm hopeful, based on a stack trace sent to me that this is the cause of [perl #85478], but can't reproduce that easily. But the logic is clearly wrong.
*	regex: /l in combo with others in syn start class	Karl Williamson	2011-03-08	1	-3/+8
\| \| \| \| \| \| \| \| \|	Now that regexes can be combinations of different charset modifiers, a synthetic start class can match locale and non-locale both. locale should generally match only things in the bitmap for code points < 256. But a synthetic start class with a non-locale component can match such code points. This patch makes an exception for synthetic nodes that will be resolved if it passes and is matched again for real.
*	regexec.c: Remove '#if 0' code	Karl Williamson	2011-03-02	1	-70/+0
\| \| \| \| \|	This code was retained for a while until it was clear that the replacement code worked.
*	regex: Remove obsolete code	Karl Williamson	2011-02-28	1	-52/+23
\| \| \| \| \| \| \|	This code has been rendered obsolete in 5.14 by using a different mechanism altogether. This functionality is now provided at run-time, user-selectable, via the /u and /d regex modifiers. This code was for compile-time selection of which to use.
*	regexec.c: remove no longer needed code	Karl Williamson	2011-02-28	1	-5/+1
\| \| \| \| \|	The code dealing with the sharp ss is now handled by the ANYOFV node, and shouldn't appear here.