summaryrefslogtreecommitdiff
path: root/regcomp.c
Commit message (Collapse)AuthorAgeFilesLines
* Revert "use symbolic constants for substrs[] indices"David Mitchell2017-07-051-45/+42
| | | | | | This reverts commit 2ac902efe11ee156653eb2ca1369f0e5f4546c31. See thread at Message-ID: <d8jfuebazrl.fsf@dalvik.ping.uio.no>
* regcomp.c: use symbolic constants for substrs[] indicesDagfinn Ilmari Mannsåker2017-07-051-42/+45
|
* regcomp.c: parameterise scan_data_t substrs[]David Mitchell2017-07-021-157/+133
| | | | | | | | | | | | | | Now that the scan_data_t stores its fixed and floating substring data as a 2-element array, replace various bits of duplicated code which separately handled fixed and floating substrings with for (i = 0; i < 2; i++) loops etc. This makes the code shorter and simpler, and will make it easier in future to expand to more than a single each of fixed+float. There should be no functional changes, except that debugging output now displays N..N rather than just just N for the fixed substring start range (i.e. its now just a subset of float where max == min)
* scan_data_t: rename 'longest' fieldDavid Mitchell2017-07-021-17/+19
| | | | | | | | | .. to 'cur_is_floating' It's an index into either the fixed or float substring info; the information it provides is whether the currently being captured substring is fixed or floating; it's nothing to do with whether the fixed or the floating is currently the longest.
* regcomp.c: remove float_min_offset etc macro useDavid Mitchell2017-07-021-56/+65
| | | | | | | | | | | | | | In this src file, expand all the various macros like #define anchored_offset substrs->data[0].min_offset #define float_min_offset substrs->data[1].min_offset This will later allow parts of the code to be parameterised, e.g. for (i=0; i<1; i++) { substrs->data[i].min_offset = ...; ... }
* regcomp.c: S_setup_longest(): simplify argsDavid Mitchell2017-07-021-12/+8
| | | | | | | | | | | | | | Rather than passing e.g. &(r->float_utf8), &(r->float_substr), &(r->float_end_shift), pass the single arg &(r->substrs->data[1]) (float_foo are macros which expand to substrs->data[1].foo)
* regcomp: set fixed max_offset to min_offsetDavid Mitchell2017-07-021-8/+23
| | | | | | | | | | | | | | | | | | | | | | | previously scan_data_t had the three fields offset_fixed offset_float_min offset_float_max a few commits ago that was converted into a 2 element array (for fixed and float), each with the fields min_offset max_offset where the max_offset was unused in fixed (substrs[0]) case. Instead, set it equal to min_offset This makes the fixed and float code paths more similar. At the same time expand a few of the 'float_max_offset' type macros to make it clearer what's going on.
* S_study_chunk: have per substring flagsDavid Mitchell2017-07-021-39/+40
| | | | | | | | | | | | | | | | | | | | | | | Currently the scan_data_t struct has a flags field which contains SF_ and SCF_ flags. Some of the SF_ flags are general; others are specific to the fixed or floating substr. For example there are these 3 flags: SF_BEFORE_MEOL SF_FIX_BEFORE_MEOL SF_FL_BEFORE_MEOL This commit adds a flags field to the per-substring substruct and sets some flags per-substring instead. For example previously we did: now we would do: -------------------------------- -------------------------------------- data->flags |= SF_BEFORE_MEOL unchanged data->flags |= SF_FIX_BEFORE_MEOL data->substrs[0].flags |= SF_BEFORE_MEOL data->flags |= SF_FL_BEFORE_MEOL data->substrs[1].flags |= SF_BEFORE_MEOL This allows us to simplify the code (e.g. eliminating some args from S_setup_longest()) and in future will allow more than one fixed or floating substring.
* regcomp.c: DEBUG_PEEP(): invalid flagsDavid Mitchell2017-07-021-6/+6
| | | | | | | | DEBUG_PEEP(..., flags) was invoked from 3 functions - however in two of throse functions, the 'flags' local var did *not* contain SF_ and SCF_ bits, so the flag bits were being incorrectly displayed as SF_ etc. In those two functions, change it instead to DEBUG_PEEP(...., 0)
* regcomp.c: convert debugging macros to static fnsDavid Mitchell2017-07-021-89/+133
| | | | | | | | | | | | make these 3 macros into thin wrappers around some new static functions, rather than just being huge macros: DEBUG_SHOW_STUDY_FLAGS DEBUG_STUDYDATA DEBUG_PEEP Also, avoid the macros implicitly using local vars: make them into explicit parameters instead (this is one of my pet peeves).
* make struct scan_data_t->longest an index valDavid Mitchell2017-07-021-23/+22
| | | | | | | | | | | In this private data structure used during regex compilation, the 'longest' field was an SV** pointer which was always set to point to one of these two addresses: &(data->substrs[0].str) &(data->substrs[1].str) Instead, just make it a U8 with the value 0 or 1.
* S_setup_longest() pass struct rather than fieldsDavid Mitchell2017-07-021-25/+21
| | | | | Now that a substring is a separate struct, pass as a single pointer rather than as 4 separate args.
* struct scan_data_t: make some fields into an arrayDavid Mitchell2017-07-021-79/+85
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This private struct is used just within regcomp.c while compiling a pattern. It has a set of fields for a fixed substring, and similar set for floating, e.g. SV *longest_fixed; SV *longest_float; SSize_t *minlen_fixed; SSize_t *minlen_float; etc Instead have a 2 element array, one for fixed, one for float, so e.g. data->offset_float_max becomes data->substrs[1].max_offset There are 3 reasons for doing this. First, it makes the code more regular, and allows a whole substr ptr to be passed as an arg to a function rather than having to pass every individual field; second, it makes the compile-time struct more similar to the runtime struct, which already has such an arrangement; third, it allows for a hypothetical future expansion where there aren't necessarily at most 1 fixed and 1 floating substring. Note that a side effect of this commit has been to change lookbehind_fixed from I32 to SSize_t; lookbehind_float was already SSize_t, so the I32 was probably a bug.
* regcomp.c: correct the regdata which paratermers under DEBUGYves Orton2017-06-271-1/+4
| | | | | | this worked because 'a' and 'o' are treated the same for all intents and purposes, but it is confusing as 'a' stands for array, and 'o' for hash, and the DEBUG mode code here adds two arrays not hashes.
* regcomp.c: document reg_data types better in reg_dupYves Orton2017-06-271-8/+22
|
* wrap multi-statement macros in STMT_START/STMT_ENDLukas Mai2017-06-201-4/+8
| | | | | | | | | | | | | With the original code you'd have to be very, very careful: if (foo) CLEAR_POSIX_WARNINGS_AND_RETURN(42); would have expanded to if (foo) CLEAR_POSIX_WARNINGS(); return 42; /* always returns! */
* Resolve Perl #131522: Spurious "Assuming NOT a POSIX class" warningYves Orton2017-06-181-12/+18
|
* enforce size constraint via STATIC_ASSERT, not just a commentLukas Mai2017-06-071-0/+1
|
* regcomp.c: Simplify expressionKarl Williamson2017-06-011-1/+2
| | | | | Here, there is no advantage to assigning a variable within an 'if', and it is somewhat harder to read, so don't do it.
* Relax fatal circumstances of unescaped '{'Karl Williamson2017-06-011-16/+71
| | | | | | | | | | | | | | | | | | | | After the 5.26.0 code freeze, it came out that an application that many others depend on, GNU Autoconf, has an unescaped '{' in it. Commit 7335cb814c19345052a23bc4462c701ce734e6c5 created a kludge that was minimal, and designed to get just that one application to work. I originally proposed a less kludgy patch that was applicable across a larger set of applications. The proposed patch didn't fatalize uses of unesacped '{' where we don't anticipate using it for something other than its literal self. That approach worked for Autoconf, but also far more instances, but was more complicated, and was rejected as being too risky during code freeze. Now this commit implements my original suggestion. I am putting it in now, to let it soak in blead, in case something else surfaces besides Autoconf, that we need to work around. By having experience with the patch live, we can be more confident about using it, if necessary, in a dot release.
* regcomp.c: Don't set variable within an 'if'Karl Williamson2017-06-011-4/+4
| | | | | | | Sometimes it is convenient/and or necessary to do an assignment within a clause of an 'if', but it adds a little cognitive load. In this case, it's entirely unnecessary. This patch changes to do the assignment before the 'if'.
* Slightly change -Dr output of regex ANYOF nodesKarl Williamson2017-06-011-1/+1
| | | | | This changes to precede each literal '[' in a [...] class with a backslash to better make is standout as a literal
* regcomp.c: Change lookup for dumping patternKarl Williamson2017-06-011-2/+1
| | | | | | Instead of using a bunch of branches, use strchr() to see if a character is a member of a class. This is a common paradigm in the parsers.
* Workaround for GNU Autoconf unescaped left braceKarl Williamson2017-04-171-2/+22
| | | | | | | | | | | | | | | | | | | | | See [perl #130497] GNU Autoconf depends on Perl, and will not work on Blead (and the forthcoming Perl 5.26), due to a single unescaped '{', that has previously been deprecated and is now fatal. A patch for it has been in the Autoconf repository since early 2013, but there has not been a release since before then. Because this is depended on by so much code, and because it is simpler than trying to revert to making the fatality merely deprecated, this patch simply changes perl to not die when compiled with the exact pattern that trips up Autoconf. Thus Autoconf can continue to work, but any other patterns that use the now illegal construct will continue to die. If other code uses the exact pattern, they too will not die, but the deprecation message continues to get raised. The use of the left brace in this particular pattern is not one where we envision using the construct to mean something else, so a deprecation is suitable for the foreseeable future.
* update size after RenewHugo van der Sanden2017-03-151-4/+6
| | | | | | | | | | | | | | | RT #130841 In general code, change this idiom: PL_foo_max += size; Renew(PL_foo, PL_foo_max, foo_t); to Renew(PL_foo, PL_foo_max + size, foo_t); PL_foo_max += size; so that if Renew dies, PL_foo_max won't be left hanging.
* (perl #130822) fix an AV leak in Perl_reg_named_buff_fetchTony Cook2017-02-211-4/+1
| | | | Originally noted as a scoping issue by Andy Lester.
* Revert "Deprecating the use of C<< \cI<X> >> to specify a printable character."Sawyer X2017-02-121-15/+6
| | | | This reverts commit bfdc8cd3d5a81ab176f7d530d2e692897463c97d.
* Change av_foo_nomg() nameKarl Williamson2017-02-111-17/+17
| | | | | | | | | | | | | | | These names sparked some controversy when created: http://www.nntp.perl.org/group/perl.perl5.porters/2016/03/msg235216.html I looked through existing code for paradigms to follow, and found some occurrences of 'skip_foo_mg'. So this commit changes the names to be av_top_index_skip_len_mg() av_tindex_skip_len_mg() This is explicit about the type of magic that is ignored, and will still be valid if another type of magic ever gets added.
* Coverity #155950: pRExC->code_blocks is blindly derefedJarkko Hietaniemi2017-02-101-0/+2
| | | | | Even though code calling S_pat_upgrade_to_utf8 from the Perl_re_op_compile is testing the code_blocks for NULLness.
* regcomp.c: Fix so will compile on C++11Karl Williamson2017-02-091-1/+1
| | | | See 147e38468b8279e26a0ca11e4efd8492016f2702 for complete explanation
* [perl #129061] CURLYX nodes can be studied more than onceHugo van der Sanden2017-02-061-3/+9
| | | | | | | | | | | | | study_chunk() for CURLYX is used to set flags on the linked WHILEM node to say it is the whilem_c'th of whilem_seen. However it assumes each CURLYX can be studied only once, which is not the case - there are various cases such as GOSUB which call study_chunk() recursively on already-visited parts of the program. Storing the wrong index can cause the super-linear cache handling in regmatch() to read/write the byte after the end of poscache. Also reported in [perl #129281].
* avoid double-freeing regex code blocksDavid Mitchell2017-02-011-10/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RT #130650 heap-use-after-free in S_free_codeblocks When compiling qr/(?{...})/, a reg_code_blocks structure is allocated and various SVs are attached to it. Initially this is set to be freed via a destructor on the savestack, in case of early dying. Later the structure is attached to the compiling regex, and a boolean flag in the structure, 'attached', is set to true to show that the destructor no longer needs to free the struct. However, it is possible to get three orders of destruction: 1) allocate, push destructor, die early 2) allocate, push destructor, attach to regex, die 2) allocate, push destructor, attach to regex, succeed In 2, the regex is freed (via the savestack) before the destructor is called. In 3, the destructor is called, then later the regex is freed. It turns out perl can't currently handle case 2: qr'(?{})\6' Fix this by turning the 'attached' boolean field into an integer refcount, then keep a count of whether the struct is referenced from the savestack and/or the regex. Since it normally has a value of 1 or 2, it's similar to a boolean flag, but crucially it no longer just indicates that the regex has a pointer to it ('attached'), but that at least one of the savestack and regex have a pointer to it. So order of freeing no longer matters. I also updated S_free_codeblocks() so that it nulls out SV pointers in the reg_code_blocks struct before freeing them. This is is generally good practice to avoid double frees, although is probably not needed at the moment.
* (perl #130684) allocate enough space for the extra 'x'Tony Cook2017-02-011-1/+1
| | | | | | | | | 77c8f26370dcc0e added support for a doubled x regexp flags, and ensured the doubled flag was passed to the qr// created by S_compile_runtime_code(). Unfortunately it didn't ensure enough space was allocated for that extra 'x'.
* mention PASS2 in reginsert() exampleHugo van der Sanden2017-01-291-1/+2
| | | | As per bb78386f13.
* assert that the RExC_recurse data structure points at a valid GOSUBYves Orton2017-01-281-0/+12
| | | | | This assert will fail if someone adds code that optimises away a GOSUB call. At which point they will see the comment and know what to do.
* only mess with NEXT_OFF() when we are in PASS2Yves Orton2017-01-271-2/+2
| | | | | | In 31fc93954d1f379c7a49889d91436ce99818e1f6 I added code that would modify NEXT_OFF() when we were not in PASS2, when we should not do so. Strangly this did not segfault when I tested, but this fix is required.
* add some details to the docs for S_reginsert()Yves Orton2017-01-271-0/+7
| | | | | | Had these docs been here I would have saved some time debugging. So save the next guy from the same trouble... (with my memory *I* might even be the /next guy/. Sigh.)
* fix RT #130561 - recursion and optimising away impossible quantifiers are ↵Yves Orton2017-01-271-11/+3
| | | | | | | | | | | | | | | not friends Instead of optimising away impossible quantifiers like (foo){1,0} treat them as unquantified, and guard them with an OPFAIL. Thus /(foo){1,0}/ is treated the same as /(*FAIL)(foo)/ this is important in patterns like /(foo){1,0}|(?1)/ where the (?1) needs to be able to recurse into the (foo) even though the (foo){1,0} can never match. It also resolves various issues (SEGVs) with patterns like /((?1)){1,0}/. This patch would have been easier if S_reginsert() documented that it is the callers responsibility to properly set up the NEXT_OFF() of the inserted node (if the node has a NEXT_OFF())
* rename opnd to operand to save my sanityYves Orton2017-01-271-5/+5
|
* better handle freeing of code blocks in /(?{...})/David Mitchell2017-01-241-107/+121
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [perl #129140] attempting double-free Thus fixes some leaks and double frees in regexes which contain code blocks. During compilation, an array of struct reg_code_block's is malloced. Initially this is just attached to the RExC_state_t struct local var in Perl_re_op_compile(). Later it may be attached to a pattern. The difficulty is ensuring that the array is free()d (and the ref counts contained within decremented) should compilation croak early, while avoiding double frees once the array has been attached to a regex. The current mechanism of making the array the PVX of an SV is a bit flaky, as the array can be realloced(), and code can be re-entered when utf8 is detected mid-compilation. This commit changes the array into separately malloced head and body. The body contains the actual array, and can be realloced. The head contains a pointer to the array, plus size and an 'attached' boolean. This indicates whether the struct has been attached to a regex, and is effectively a 1-bit ref count. Whenever a head is allocated, SAVEDESTRUCTOR_X() is used to call S_free_codeblocks() to free the head and body on scope exit. This function skips the freeing if 'attached' is true, and this flag is set only at the point where the head gets attached to the regex. In one way this complicates the code, since the num_code_blocks field is now not always available (it's only there is a head has been allocated), but mainly its simplifies, since all the book-keeping is now done in the two new static functions S_alloc_code_blocks() and S_free_codeblocks()
* Fix bug with a digit range under re 'strict'Karl Williamson2017-01-191-39/+71
| | | | | | | | | "use re 'strict" is supposed to warn if a range whose start and end points are digits aren't from the same group of 10. For example, if you mix Bengali and Thai digits. It wasn't working properly for 5 groups of mathematical digits starting at U+1D7E. This commit fixes that, and refactors the code to bail out as soon as it discovers that no warning is warranted, instead of doing unnecessary work.
* Deprecating the use of C<< \cI<X> >> to specify a printable character.Abigail2017-01-161-6/+15
| | | | | | | | | | | Starting in 5.14, we deprecated the use of "\cI<X>" when this results in a printable character. For instance, "\c:" is just a fancy way of writing "z". Starting in 5.28, this will be a fatal error. This also includes certain usage in regular expressions with the experimental (?[ ]) construct, or when "use re 'strict'" is in effect (also experimental).
* Unescaped left braces in regular expressions will be fatal in 5.30.Abigail2017-01-161-1/+1
| | | | | | | | | | | | In 5.26, some uses of unescaped left braces were made fatal; they have given a deprecation warning since 5.20. Due to an oversight, some cases were missed, and did not give a deprecation warning. They do now. This patch changes said deprecation warning to mention the Perl version in which the use of an unescaped left brace will be fatal (5.30). The patch also cleans up some unnecessary quotes inside a C<> construct in the discussion of this warning in perldiag.pod.
* Warn on unescaped /[]}]/ under re strictKarl Williamson2017-01-131-0/+6
| | | | | | | | | | | | | | | | | | | This commit generates a warning when the experimental 're strict' feature is in effect for unescaped '}' and ']' characters (in a regular expression pattern) that are interpreted literally. This brings the behavior of these more in line with ')' which croaks when it is taken literally. The problem with the existing behavior is that these characters may be metacharacters or they may be literals, depending on action at a distance. Not so with ')', which is always a metacharacter unless escaped. Ideally, all three of these characters should behave similarly, but it really is too late for that, except we can warn if the user has requested extra checking of their patterns with this experimental 're strict' feature.
* regcomp.c: Clarify comment.Karl Williamson2017-01-131-1/+1
|
* Add /xx regex pattern modifierKarl Williamson2017-01-131-8/+19
| | | | | This was first proposed in the thread starting at http://www.nntp.perl.org/group/perl.perl5.porters/2014/09/msg219394.html
* regcomp.c: Remove obsolete data structure elementKarl Williamson2017-01-131-6/+0
| | | | | This was used for the removed feature of having the source in a different encoding.
* Rmv unused regex implementation structure elementKarl Williamson2017-01-121-9/+0
|
* PATCH: [perl #130530]: HP-UX assertion failureKarl Williamson2017-01-111-3/+2
| | | | | | | | | | | | This was introduced in a1a5ec35e6a3df0994b103aadb28a8c1a3a278da, and was due to a thinko on my part. Zefram figured it out. A macro evaluating to a string constant returns an instance of that constant. Compilers are free to collapse all instances into a single one (which saves space), or to have multiple copies. The code was assuming the former, and HP-UX cc doesn't. The passed size also was one byte larger than it should have been.
* Eliminate two unused variables detected by clang.James E Keenan2017-01-061-3/+0
| | | | "warning: unused variable 'i' [-Wunused-variable]"