| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
| |
[Why does my (Windows) git cherry-pick change 100755 modes to 100644?]
|
|
|
|
|
|
|
|
| |
(cherry picked from commit f86d720ebb7ad53ce8b1c12cee66586eabffe0c8)
[Edited by the committer to bump the $VERSION to a value that has not
already been used in a blead release and will not be used in a future
blead release.]
|
|
|
|
|
|
|
|
|
|
|
|
| |
overrun-local: Overrunning array PL_reg_intflags name of 14 8-byte elements at element index 31 (byte offset 248) using index bit (which evaluates to 31).
Needed compile-time limits for the PL_reg_intflags_name so that the
bit loop doesn't waltz off past the array. Could not use C_ARRAY_LENGTH
because the size of name array is not visible during compile time
(only const char*[] is), so modified regcomp.pl to generate the size,
made it visible only under DEBUGGING. Did extflags analogously
even though its size currently exactly 32 already. The sizeof(flags)*8
is extra paranoia for ILP64.
|
| |
|
|
|
|
| |
(and of regen/warnings.pl)
|
| |
|
| |
|
| |
|
|
|
|
|
|
| |
we return, rather than print, the warnings, so we can potentially
futz around with the string and put it where we like without
having to worry about C<select>
|
|
|
|
|
|
| |
...to limit the number of variables visible everywhere and
make it a bit easier to see what I am doing as I refactor
regen/warnings.pl
|
|
|
|
|
|
|
|
| |
In doing an audit of regcomp.c, and experimenting using
Encode::_utf8_on(), I found this one instance of a regen/regcharclass.pl
macro that could read beyond the end of the string if given malformed
UTF-8. Hence we convert to use the 'safe' form. There are no other
uses of the non-safe version, so don't need to generate them.
|
|
|
|
| |
Having these unused macros around just clutters up the header file
|
|
|
|
|
|
|
|
|
|
| |
It makes no sense to check for length safeness for The macros generated
by this program which take a single UV code point as a parameter. Prior
to this patch, it would skip trying to generate them if asked. But,
because of the way things are structured, that means that if you need
just this and the safe versions, you can't do it so easily. What this
commit does is generate the cp macro if requested even if the 'safe'
version of other macros are also requested.
|
|
|
|
|
|
|
|
|
|
|
| |
For matches that can match more than a single code point, one should
always use a macro that makes sure that one doesn't read off the end of
the buffer. This is because the buffer might end with the first N
characters of a sequence with at least N+1 in it, and we don't want to
read that N+1 position in the buffer.
If this had been in place, buggy commit 3a8bbffbce would not have
happened.
|
|
|
|
|
|
| |
The macros generated by these options are not needed in the core;
generating them just clutters up the header file, and some will actually
be forbidden by the next commit.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
My thinking was muddled when I made that commit, and this reverts the
essence of it. The theory was that since we have already processed the
regex pattern, we don't need to check it for malformedness, hence we
don't need the "safe" form of certain macros that check for and avoid
running off the end of the buffer. It is true that we don't have to
worry about malformedness indicating that the buffer is bigger than it
really is, but these macros can match up to three well-formed
characters, so we do have to make sure that all three are in the buffer,
and that the input isn't just the first two at the buffer's end.
This was caught by running valgrind.
|
|
|
|
| |
Indent to account for new block added in the previous commit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This simplifies the macros generated which make sure there are no read
errors. Prior to this commit, the code generated looked like
(e - s) > 3
? see if things of at most length 4 match
: (e - s) > 2
? see if things of at most length 3 match
: (e - s) > 1
? see if things of at most length 2 match
: (e - s) > 0
? see if things of at most length 1 match
For things that are a single character, the ones greater than length 2
must be in UTF8, and their needed length can be determined by UTF8SKIP,
so we can get rid of most of the (e-s) tests.
This doesn't change the macros which can match multiple characters; that
is a harder to do.
|
|
|
|
|
| |
This adds a comment to the generated file that the macros are not to be
generally used.
|
|
|
|
|
|
| |
The term 'semantics' in documentation when applied to character sets is
changed to 'rules' as being a shorter less-jargony synonym in this case.
This was discussed several releases ago, but I didn't get around to it.
|
|
|
|
|
|
|
|
|
|
| |
Declarative syntax to unwrap argument list into lexical variables.
"sub foo ($a,$b) {...}" checks number of arguments and puts the
arguments into lexical variables. Signatures are not equivalent to the
existing idiom of "sub foo { my($a,$b) = @_; ... }". Signatures are only
available by enabling a non-default feature, and generate warnings about
being experimental. The syntactic clash with prototypes is managed by
disabling the short prototype syntax when signatures are enabled.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This large (sorry, I couldn't figure out how to meaningfully split it
up) commit causes Perl to fully support LC_CTYPE operations (case
changing, character classification) in UTF-8 locales.
As a side effect it resolves [perl #56820].
The basics are easy, but there were a lot of details, and one
troublesome edge case discussed below.
What essentially happens is that when the locale is changed to a UTF-8
one, a global variable is set TRUE (FALSE when changed to a non-UTF-8
locale). Within the scope of 'use locale', this variable is checked,
and if TRUE, the code that Perl uses for non-locale behavior is used
instead of the code for locale behavior. Since Perl's internal
representation is UTF-8, we get UTF-8 behavior for a UTF-8 locale.
More work had to be done for regular expressions. There are three
cases.
1) The character classes \w, [[:punct:]] needed no extra work, as
the changes fall out from the base work.
2) Strings that are to be matched case-insensitively. These form
EXACTFL regops (nodes). Notice that if such a string contains only
characters above-Latin1 that match only themselves, that the node can be
downgraded to an EXACT-only node, which presents better optimization
possibilities, as we now have a fixed string known at compile time to be
required to be in the target string to match. Similarly if all
characters in the string match only other above-Latin1 characters
case-insensitively, the node can be downgraded to a regular EXACTFU node
(match, folding, using Unicode, not locale, rules). The code changes
for this could be done without accepting UTF-8 locales fully, but there
were edge cases which needed to be handled differently if I stopped
there, so I continued on.
In an EXACTFL node, all such characters are now folded at compile time
(just as before this commit), while the other characters whose folds are
locale-dependent are left unfolded. This means that they have to be
folded at execution time based on the locale in effect at the moment.
Again, this isn't a change from before. The difference is that now some
of the folds that need to be done at execution time (in regexec) are
potentially multi-char. Some of the code in regexec was trivial to
extend to account for this because of existing infrastructure, but the
part dealing with regex quantifiers, had to have more work.
Also the code that joins EXACTish nodes together had to be expanded to
account for the possibility of multi-character folds within locale
handling. This was fairly easy, because it already has infrastructure
to handle these under somewhat different circumstances.
3) In bracketed character classes, represented by ANYOF nodes, a new
inversion list was created giving the characters that should be matched
by this node when the runtime locale is UTF-8. The list is ignored
except under that circumstance. To do this, I created a new ANYOF type
which has an extra SV for the inversion list.
The edge case that caused the most difficulty is folding involving the
MICRO SIGN, U+00B5. It folds to the GREEK SMALL LETTER MU, as does the
GREEK CAPITAL LETTER MU. The MICRO SIGN is the only 0-255 range
character that folds to outside that range. The issue is that it
doesn't naturally fall out that it will match the CAP MU. If we let the
CAP MU fold to the samll mu at compile time (which it can because both
are above-Latin1 and so the fold is the same no matter what locale is in
effect), it could appear that the regnode can be downgraded away from
EXACTFL to EXACTFU, but doing so would cause the MICRO SIGN to not case
insensitvely match the CAP MU. This could be special cased in regcomp
and regexec, but I wanted to avoid that. Instead the mktables tables
are set up to include the CAP MU as a character whose presence forbids
the downgrading, so the special casing is in mktables, and not in the C
code.
|
|
|
|
|
|
|
|
|
|
| |
regen/regcharclass.pl can create macros for use where we need to worry
about the possibility of malformed UTF-8, and for where we don't. In
the case of looking at regex patterns, the Perl core has complete
control over generating them, and hence isn't generally going to create
too short a buffer; if it does, it's a bug that will show up and get
fixed. This commit changes to generate and use the faster macros that
don't do bounds checking.
|
|
|
|
|
|
|
| |
An unsigned must always be >= 0, and generating a test for that can lead
to a compiler warning, even if it gets optimized out. The input to the
macros generated by this are supposed to be UV. This commit suppresses
any >= 0 test.
|
|
|
|
|
| |
wrap() is already defined by the regen infrastructure; no need to do so
again, and get warning if we persist in doing so.
|
|
|
|
|
|
|
| |
Prior to this patch, this was in regen/mk_invlists.pl, but future
commits will want it to also be used by the header generated by
regen/regcharclass.pl, so use a common source so the logic doesn't have
to be duplicated.
|
|
|
|
| |
Namely, Android.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Until now, the behavior of the statements
use warnings "FATAL";
use warnings "NONFATAL";
no warnings "FATAL";
was unspecified and inconsistent. This change causes them to be handled
with an implied "all" at the end of the import list.
Tony Cook: fix AUTHORS formatting
|
| |
|
| |
|
|
|
|
|
| |
We need a better name for the experimental category, but I have not
thought of one, even after sleeping on it.
|
|
|
|
|
|
|
| |
These are the base names for various macros used in parsing identifiers.
Prior to this patch, parsing a code point above Latin1 caused loading
disk files. This patch causes all the information to be compiled into
the Perl binary.
|
|
|
|
| |
This outdents a block to be in line with adjacent lines.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previous commits in this series have removed all uses of this global
array. This completely removes it.
Since it is a global, consideration need be given to possible uses of it
outside the core. It has never been externally documented, and is an
opaque structure whose internals have changed with every release. The
functions used to access it are almost all static to regcomp.c; those
few that aren't have been hidden from all but the few .c files that need
to have access to them, via #if's.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This global array is no longer used, having been removed in previous
commits in this series.
Since it is a global, consideration need be given to possible uses of it
outside the core. It has never been externally documented, and is an
opaque structure whose internals have changed with every release. The
functions used to access it are almost all static to regcomp.c; those
few that aren't have been hidden from all but the few .c files that need
to have access to them, via #if's.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When constructing what matches code points under /i, Perl uses an
inversion list of all the possible code points that participate in
folds. This number is relatively few compared to the possible universe
of code points, as most of the world's scripts aren't cased, and many
characters in the scripts that do fold aren't foldable (such as
punctuation). Prior to this commit, the list for the above-Latin1 code
points was read-in from disk if and only if needed. This commit causes
the list to be added to read-only data in a C header, trading a little
space in Perl's text segment for speed at execution. This will enable
ripping out some code in this and future commits (offsetting the space
used by this one).
|
|
|
|
|
|
|
|
| |
This changes charclass_invlists.h to have the complete definitions for
all the POSIX classes, like \w and [:alpha:]. Thus these won't have to
be loaded off disk at run-time.
Taking advantage of this will be done in stages in future commits
|
|
|
|
|
| |
These note that warnings categories should be independent in the calls
to ckWARN() and packWARN() type macros.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Due to the security risks associated with user-supplied formats
being passed to C-level printf() style functions (eg %n),
gcc has a -Wformat-nonliteral warning that complains whenever such a
function is passed a non-literal format string.
This commit silences all such warnings in core and ext/.
The main changes are
1) the 'f' (format) flag in embed.fnc is now handled slightly more
cleverly. Rather than just applying to functions whose last arg is '...'
(and where the format arg is assumed to be the previous arg), it
can now handle non-'...' functions: arg checking is disabled, but format
checking is sill done: it works by assuming that an arg called 'fmt',
'pat' or 'f' is the format string (and dies if fails to find exactly one
such arg).
2) with the new embed.fnc functionally, more functions have been marked
with the 'f' flag. When such a function passes its fmt arg onto an inner
printf-like function, we simply disable the warning for that call using
GCC_DIAG_IGNORE(-Wformat-nonliteral), since we know that the caller must
have already checked it.
3) In quite a few places the format string isn't literal, but it *is*
constant (e.g. PL_warn_uninit_sv). For those cases, again disable the
warning.
4) In pp_formline(), a particular format was was one of several different
literal strings depending on circumstances. Rather than assigning this
string to a temporary variable, incorporate the ?: branches directly in
the function call arg. gcc is clever enough to decide the arg is then
always literal.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
mark this function with
__attribute__format__null_ok__(__strftime__,pTHX_1,0)
so that compiler checks and warnings about strftime-style format args
can be checked.
Rather than adding new flag(s) to embed.fnc, I just enhanced the f flag
to treat it as strftime-style rather than printf if the function name
matches /strftime/. This was quicker, and we're unlikely to have many
such functions.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a scalar is returned from (??{...}) inside a regexp, it gets com-
piled into a regexp if it is not one already. Then the regexp is sup-
posed to be cached on that scalar (in magic), so that the same scalar
returned again will not require another compilation.
Commit e4bfbed39b disabled caching except on references to overloaded
objects. But in that one case the caching caused erroneous behaviour,
which was just fixed by 636209429f and this commit’s parent, effect-
ively disabling the cache altogether.
The cache is disabled because it does not apply to TEMP variables
(those about to be freed anyway, for which caching would be a waste
of CPU), and all non-overloaded non-qr thingies get copied into
new mortal (TEMP) scalars (as of e4bfbed39b) before reaching the
caching code.
This commit skips the copy if the return value is already a non-magi-
cal string or number. It also allows the caching to happen on con-
stants, which has never been permitted before. (There is actually no
reason for disallowing qr magic on read-only variables.)
|
|
|
|
|
|
|
|
|
| |
by removing the hint from the exit op itself and just having pp_exit
look in the cop hint hash, where it is already stored (as a result of
having been in %^H at compile time).
&CORE:: subs intentionally lack a nextstate op (cop) so they can see
the hints in the caller’s nextstate op.
|
|
|
|
|
|
|
| |
This commit makes them behave like exit and die without the ampersand
by moving the OPpHUSH_VMSISH hint from exit/die op to the current
statement (nextstate/cop) instead. &CORE:: subs intentionally lack a
nextstate op, so they can see the hints in the caller’s nextstate op.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The way CORE:: was handled in the lexer was convoluted.
CORE was treated initially as a keyword, with exceptions in the lexer
to make it behave correctly. If it turned out not to be followed
by ::, then the lexer would fall back to treating it as a bareword
or sub name.
Before even checking for a keyword, the lexer looks for :: and goes
to the bareword/sub code. But it made a special exception there
for CORE::.
In the end, treating CORE as a keyword recognized by the keyword()
function requires more special cases than simply special-casing CORE::
in toke.c.
This fixes the lexical CORE sub bug, while reducing the total num-
ber of lines.
|
|
|
|
|
|
| |
It is used for two op types, but only a small portion of it applies
to both, so we can put that in a static function. This makes the
next commit easier.
|
|
|
|
|
|
|
| |
rv2hv has had a TARG since perl 5.000, but it has not used it since
hv_scalar was added in perl-5.8.0-3008-ga3bcc51.
This commit removes it, saving a tiny bit of space in the pad.
|
| |
|
| |
|
| |
|