| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
|
|
|
|
|
|
|
| |
perl detects some locale errors when a new locale is entered. It stores
these up to output upon first use of something that uses that locale. A
synthetic start class (SSC) is used by the regex optimizer under certain
circumstances. Prior to this patch, it was possible for the stored up
bad locale message to not be raised if the match failed the SSC. This
patch fixes this by changing the node type of the SSC to be one that
checks for the stored-up message should there be locale-dependent
portions of the pattern.
|
|
|
|
|
| |
I was unaware of this construct when I wrote the commit that broke it,
and there were no tests for it. Now there are.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously use of this under /l regex rules was a compile time error.
Now it works like \b{wb} and \b{sb}, which compile under locale rules
and always work like Unicode says they should. A UTF-8 locale implies
Unicode rules, and the goal is for it to work seamlessly with the rest
of perl. This construct was the only one I am aware of that didn't work
seamlessly (not counting OS interfaces) under UTF-8 LC_CTYPE locales.
For all three of these constructs, use with a non-UTF-8 runtime locale
raises a warning, and Unicode rules are used anyway.
UTF-8 locale collation still has problems, but this is low priority to
fix, as it's a lot of work, and if one really cares, one should be using
Unicode::Collate.
|
|
|
|
| |
This will be used by the next commit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The ANYOF_FLAGS bits are all used up, but a future commit wants one.
This commit frees up a bit by sharing two of the existing
comparatively-rarely-used ones. One bit is used only under /d matching
rules, while the other is used only when not under /d. Only the latter
bit is used in synthetic start classes. The previous commit introduced
an ANYOFD node type corresponding to /d. An SSC never is this type.
Thus, the bits have mutually exclusive meanings, and we can use the node
type to distinguish between the two meanings of the combined bit.
An alternative implementation would have been to use the
ANYOF_HAS_NONBITMAP_NON_UTF8_MATCHES non-/d bit instead of the one
chosen. But this is used more frequently, so the disambiguation would
have been exercised more frequently, slowing execution down ever so
slightly; more importantly, this one required fewer code changes, by a
slight amount.
|
|
|
|
|
| |
This is like an ANYOF node, but just for when /d is in effect. It will
be used in future commits
|
|
|
|
| |
This fix required an extra test of the return value of a function.
|
|
|
|
|
| |
Since the SV is discarded almost immediately (in non-DEBUGGING builds)
don't worry about making it the smallest possible size.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
locale.c:
- the pointers are always null at this point, see
http://www.nntp.perl.org/group/perl.perl5.porters/2015/07/msg229533.html
pp.c:
- reduce scope of temp_buffer and svrecode, into an inner branch
- in some permutations, either temp_buffer is never set to non-null, or
svrecode, in permutations where it is known that the var hasn't been set
yet, skip the freeing calls at the end, this doesn't eliminate all
permutations with NULL being passed to Safefree and SvREFCNT_dec, but
only some of them
regcomp.c
- dont create a save stack entry to call Safefree(NULL), see ticket for
this patch for some profiling stats
|
| |
|
|
|
|
|
|
|
| |
Actually, there are no special rules for this Unicode release. All the
4 "i" characters are considered equivalent under /i only in this
release. (Upper and lowercase dotted and dotless "i"). This
adds special cases that are only compiled in for that release.
|
|
|
|
|
| |
Several places require special handling because of this, notably for the
lowercase Sharp S, but not in Unicodes before 3.0.1
|
|
|
|
| |
This is the final statement of the short loop. It does nothing.
|
|
|
|
|
|
|
|
|
|
|
|
| |
U+1E9E LATIN CAPITAL LETTER SHARP S is handled specially by Perl,
because of its relationship to the infamous LATIN SMALL LETTER SHARP S,
which folds to 'ss', being the only character whose code point is < 256
to have a multi char fold, and this creates lots of special cases.
But U+1E9E wasn't in all Unicode releases. Because Perl is supposed to
work with any release, we need to be able to compile when this character
isn't defined. In some of those cases we use U+017F (LATIN SMALL LETTER
LONG S instead, which is in all releases.
|
| |
|
|
|
|
| |
Instead of #include-ing the C file, compile it normally.
|
| |
|
| |
|
|
|
|
| |
Coverity CID 104782 (only flagged the deb.c spot)
|
|
|
|
|
|
| |
This horrible thing broke encapsulation and was as buggy as a very buggy
thing. It's been officially deprecated since 5.20.0 and now it can finally
die die die!!!!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Some code in this function examines the first two nodes in the regex to
set suitable flags etc. Part of the code accesses the second node
by using regnext(first), other parts by NEXTOPER(first). The second method
only works when the node is the same size as a basic node. I *think*
that the code only makes use of this second value in situations where
the node *is* basic, but nevertheless, it makes valgrind unhappy when
the first node is an EXACT node, and reading the second node's
supposed type field is actually reading the padding bytes at the end of
the EXACT string, which are uninitialised.
So just use regnext() only.
Something as simple as /x/ on non-debugging builds was enough to make
valgrind complain. (On debugging builds, the program buffer is initially
zero-filled.)
|
|
|
|
|
|
|
|
|
|
| |
/[A-Z]/ai should match KELVIN SIGN, as it folds to a 'k'. It should not
match under /aai, as that restricts fold matching. But I tested for the
wrong symbol which ended up forbidding both /ai and /aai.
This commit changes to the correct symbol. I also reordered the 'if'
while I was at it as a nano optimisation, to test for the /aa last, as
that is the less common part of the '&&' test.
|
|
|
|
| |
whitespace-only change.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
RT #124109.
2c1f00b9036 localised PL_curpm to NULL when calling swash init code
(i.e. perl-level code that is loaded and executed when something
like "lc $large_codepoint" is executed).
b4fa55d3f1 followed this up by gutting Perl_save_re_context(), since
that function did, basically,
if (PL_curpm) {
for (i = 1; i <= RX_NPARENS(PM_GETRE(PL_curpm))) {
do the C equivalent of the perl code "local ${i}";
}
}
and now that PL_curpm was null, the code wasn't called any more. However,
it turns out that the localisation *was* still needed, it's just that
nothing in the test suite actually tested for it.
In something like the following:
$x = "\x{41c}";
$x =~ /(.*)/;
$s = lc $1;
pp_lc() calls get magic on $1, which sets $1's PV value to a copy of the
substring captured by the current pattern match.
Then pp_lc() calls a function to convert the string to upper case, which
triggers a swash load, which calls perl code that does a pattern match
and, most importantly, uses the value of $1. This triggers get magic on
$1, which overwrites $1's PV value with a new value. When control returns
to pp_lc(), $1 now holds the wrong string value.
Hence $1, $2 etc need localising as well as PL_curpm.
The old way that Perl_save_re_context() used to work (localising
$1..${RX_NPARENS}) won't work directly when PL_curpm is NULL (as in the
swash case), since we don't know how many vars to localise.
In this case, hard-code it as localising $1,$2,$3 and add a porting
test file that checks that the utf8.pm code and dependences don't
use anything outside those 3 vars.
|
|
|
|
|
|
| |
This reverts commit b4fa55d3f12c6d98b13a8b3db4f8d921c8e56edc.
Turns out we need Perl_save_re_context() after all
|
|
|
|
|
|
| |
This reverts commit d28a9254e445aee7212523d9a7ff62ae0a743fec.
Turns out we need save_re_context() after all
|
|
|
|
|
|
| |
This reverts commit 0ddd4a5b1910c8bfa9b7e55eb0db60a115fe368c.
Turns out we need the save_re_context() function after all.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
An empty cpan/.dir-locals.el stops Emacs using the core defaults for
code imported from CPAN.
Committer's work:
To keep t/porting/cmp_version.t and t/porting/utils.t happy, $VERSION needed
to be incremented in many files, including throughout dist/PathTools.
perldelta entry for module updates.
Add two Emacs control files to MANIFEST; re-sort MANIFEST.
For: RT #124119.
|
|
|
|
|
|
| |
Unicode 5.2 had an anomalous situation, fixed in the next release, which
runs afoul of an assert() in regcomp.c. This just modifies the assert
for it to not fail for this situation.
|
|
|
|
|
| |
This experimental feature now has the intersection operator ("&") higher
precedence than the other binary operators.
|
|
|
|
| |
Outdent code that the previous commit removed the surrounding block from
|
|
|
|
|
|
|
|
|
|
|
| |
Prior to this commit, the regex compiler was relying on the lexer to do
the translation from Unicode to native for \N{...} constructs, where it
was simpler to do. However, when the pattern is a single-quoted string,
it is passed unchanged to the regex compiler, and did not work. Fixing
it required some refactoring, though it led to a clean API in a static
function.
This was spotted by Father Chrysostomos.
|
|
|
|
|
| |
It actually does do the right thing: /(?(R0))/ and /(?(R00))/ both fall
through to give an appropriate error 'Switch condition not recognized'
|
|
|
|
|
|
|
|
|
|
|
|
| |
Some questions and loose ends:
XXX gv.c:S_gv_magicalize - why are we using SSize_t for paren?
XXX mg.c:Perl_magic_set - need appopriate error handling for $)
XXX regcomp.c:S_reg - need to check if we do the right thing if parno
was not grokked
Perl_get_debug_opts should probably return something unsigned; not sure
if that's something we can change.
|
| |
|
|
|
|
|
|
| |
Both needed: the macro is for compilers, the comment for static checkers.
(This doesn't address whether each spot is correct and necessary.)
|
|
|
|
|
|
|
| |
This was experimentally introduced in 5.18, and no issues were raised,
except that it got us to thinking and spurred us to stop allowing $^X,
where 'X' is a non-printable control character, and that change caused
some issues.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
A function implements seeing if the space between any two characters is
a grapheme cluster break. Afer I wrote this, I realized that an array
lookup might be a better implementation, but the deadline for v5.22 was
too close to change it. I did see that my gcc optimized it down to
an array lookup.
This makes the implementation of \X go from being complicated to
trivial.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This changes where the symbols are defined to a single file each. This
may save text space, depending on the compiler. The next commit will
cause this hdr to be included in more places, so it becomes more
important to do this.
At the same time this removes the guard for #ifndef PERL_IN_XSUB_RE.
The code now is executed regardless of that. This is simpler, and
previously there might have been the possibility of uninitialized memory
being read, should re_comp.o be executed before recomp.o.
|
|
|
|
|
|
|
|
|
|
| |
//n was implemented by avoiding the primary side-effects of compiling
a capture when the flag was turned on; however some secondary effects
still occurred later in the same function, by using the value of the
'paren' variable - even as far as causing coredumps.
Setting paren to ':' when NOCAPTURE is enabled makes the rest of the
function act just as if it had parsed (?:...) instead of (...).
|
|
|
|
| |
This could be triggered by trying to compile eg 'qr{x+(y(?0))*}'.
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
AFL (<http://lcamtuf.coredump.cx/afl>) found that the UV to I32 conversion
can evade the necessary range checks on wraparound, leading to bad reads.
Check for it, and force to I32_MAX, expecting that this will usually
yield a "Reference to nonexistent group" error.
|
|
|
|
|
| |
New test in 8a6d8ec6fe revealed additional code problem reading past
end of string under clang with sanitize=address.
|
|
|
|
|
|
|
| |
AFL (<http://lcamtuf.coredump.cx/afl>) found that when producing the
error message for /(??/ we hit an assert because we've stepped past
the end of the pattern string. Code inspection found that we also do
that in other branches, and we also need to check UTF more carefully.
|