| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
| |
This commit, made during the freeze, was approved by the pumpking
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
More functions have appeared that are PERL_STATIC_INLINE, but the
porting/extrefs.t compiles with -DPERL_NO_INLINE_FUNCTIONS, which
means no bodies are visible, but the Tru64 cc takes static inline
seriously, requiring the bodies.
Instead of the manual tweak of adding #ifndef PERL_NO_INLINE_FUNCTIONS
to embed.fnc, fix the problem in embed.pl so that 'i' type inserts the
required ifndef. Remove the manual PERL_NO_INLINE_FUNCTIONS insertions
made in a4570f51 (note that the types of some have diverged).
Now the extrefs.t again works in Tru64 (and no other compiler
has ever tripped on this).
|
|
|
|
|
|
| |
I found myself needing this function for development debugging, which
formerly was only usable from utf8.c. This enhances it to allow a
second format type, and makes it core-accessible.
|
|
|
|
|
|
| |
This is to be called to abort the parsing early, before the required
number of errors have been found. It is used when continuing the parse
would be either fruitless or we could be looking at garbage.
|
|
|
|
|
|
| |
This creates a function in toke.c to output the compilation aborted
message, changing perl.c to call that function. This is in preparation
for this to be called from a 2nd place
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previous commits have tightened up the checking of UTF-8 for
well-formedness in the input program or string eval. This is done in
lex_next_chunk and lex_start. But it doesn't handle the case of
use utf8; foo
because 'foo' is checked while UTF-8 is still off. This solves that
problem by noticing when utf8 is turned on, and then rechecking at the
next opportunity.
See thread beginning at
http://nntp.perl.org/group/perl.perl5.porters/242916
This fixes [perl #130675]. A test will be added in a future commit
This catches some errors earlier than they used to be and aborts. so
some tests in the suite had to be split into multiple parts.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit d2067945159644d284f8064efbd41024f9e8448a reverted commit
b5248d1e210c2a723adae8e9b7f5d17076647431. b5248 removed a parameter
from S_scan_ident, and changed its interior to use PL_bufend instead of
that parameter. The parameter had been used to limit how far into the
string being parsed scan_ident could look. In all calls to scan_ident
but one, the parameter was already PL_bufend. In the one call where it
wasn't, b5248 compensated by temporarily changing PL_bufend around the
call, running afoul, eventually, of the expectation that PL_bufend
points to a NUL.
I would have expected the reversion to add back both the parameter and
the uses of it, but apparently the function interior has changed enough
since the original commit, that it didn't even think there were
conflicts. As a result the parameter got added back, but not the uses
of it.
I tried both approaches to fix this:
1) to change the function to use the parameter;
2) to simply delete the parameter.
Only the latter passed the test suite without error.
I then tried to understand why the parameter in the first place, and why
the kludge introduced by b5248 to work around removing it. It appears
to me that this is for the benefit of the intuit_more function to enable
it to discern $] from a $ ending a bracketed character class, by ending
the scan before the ']' when in a pattern.
The trouble is that modern scan_ident versions do not view themselves as
constrained by PL_bufend. If that is reached at a point where white
space is allowed, it will try appending the next input line and
continuing, thus changing PL_bufend. Thus the kludge in b5248 wouldn't
necessarily do the expected limiting anyway. The reason the approach
"1)" I tried didn't work was that the function continued to use the
original value, even after it had read in new things, instead of
accounting for those.
Hence approach "2)" is used. I'm a little nervous about this, as it may
lead to intuit_more() (which uses heuristics) having more cases where it
makes the wrong choice about $] vs [...$]. But I don't see a way around
this, and the pre-existing code could fail anyway.
Spotted by Dave Mitchell.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
buffer" argument, use PL_bufend"
This reverts commit b5248d1e210c2a723adae8e9b7f5d17076647431.
This commit, dating from 2013, was made unnecessary by later removal of
the MAD code. It temporarily changed the PL_bufend variable; doing that
ran afoul of an assertion, added in
fac0f7a38edc4e50a7250b738699165079b852d8, that expects PL_bufend to
point to a terminating NUL.
Beyond the reversion, a test is added here.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Given an op, this function determines what type of struct it has been
allocated as. Returns one of the OPclass enums, such as OPclass_LISTOP.
Originally this was a static function in B.xs, but it has wider
applicability; indeed several XS modules on CPAN have cut and pasted it.
It adds the OPclass enum to op.h. In B.xs there was a similar enum, but
with names like OPc_LISTOP. I've renamed them to OPclass_LISTOP etc. so as
not to clash with the cut+paste code already on CPAN.
|
|
|
|
|
|
|
|
|
| |
In order for Perl to eventually allow string delimiters to be Unicode
grapheme clusters (which look like a single character, but may be
multiple ones), we have to stop allowing a single char delimiter that
isn't a grapheme by itself. These are unlikely to exist in actual code,
as they would typically display as attached to the character in front of
them, but we should be sure.
|
|
|
|
| |
Now that there are _safe versions, deprecate the unsafe ones.
|
| |
|
|
|
|
|
|
| |
These macros are being replaced by a safe version; they now generate a
deprecation message at each call site upon the first use there in each
program run.
|
|
|
|
| |
This is in preparation for it to be called from outside this file.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The original API does not check that we aren't reading beyond the end of
a buffer, apparently assuming that we could keep malformed UTF-8 out by
use of gatekeepers, but that is currently impossible. This commit adds
"safe" macros for determining if a UTF-8 sequence represents
an alphabetic, a digit, etc. Each new macro has an extra parameter
pointing to the end of the sequence, so that looking beyond the input
string can be avoided.
The macros aren't currently completely safe, as they don't test that
there is at least a single valid byte in the input, except by an
assertion in DEBUGGING builds. This is because typically they are
called in code that makes that assumption, and frequently tests the
current byte for one thing or another.
|
|
|
|
|
|
|
| |
The bottom level UTF-8 decode routine now generates detailed messages
when it encounters malformations. In some instances these should be
treated as croak reasons and output even if warnings are off, just
before dying. This commit adds a function to do this.
|
|
|
|
|
|
| |
As explained in the previous commit, I misunderstood the available scope
of functions not otherwise qualified by a flag. This changes some of
them to correspond with my new understanding.
|
|
|
|
|
|
|
| |
This function is equivalent to sv_setsv(sv, &PL_sv_undef), but more
efficient.
Also change the obvious places in the core to use the new idiom.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Unlike utf8_hop(), utf8_hop_safe() won't navigate before the
beginning or after the end of the supplied buffer.
The original version of this put all of the logic into
utf8_hop_safe(), but in many cases a caller specifically
needs to go forward or backward, and supplying the other limit
made the function less usable, so I split the function
into forward and backward cases.
This split may also make inlining these functions more efficient
or more likely.
|
|
|
|
|
|
|
|
| |
Commit 2b5e7bc2e60b4c4b5d87aa66e066363d9dce7930 changed the algorithm
for detecting overflow during decoding UTF-8 into code points. However,
on 32-bit platforms, this change caused it to claim some things overflow
that really don't. ALl such are overlong malformations, which are
normally forbidden, but not necessarily. This commit fixes that.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
In the regex engine it can be useful in debugging mode to
maintain a depth counter, but in normal mode this argument
would be unused. This allows us to define functions in embed.fnc
with a "W" flag which use _pDEPTH and _aDEPTH defines which
effectively define/pass through a U32 depth parameter to the
macro wrappers. These defines are similar to the existing
aTHX and pTHX parameters.
|
|
|
|
|
|
| |
This is enabled by a C flag, as commented. It is designed to be found
only by someone reading the code and wanting something temporary to help
in debugging.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
The first can be used to wrap several SVPV steps into
a single sub, and a wrapper macro which is the equivalent
of
$s= "";
but is optimized in various ways.
|
|
|
|
|
|
|
|
|
|
| |
This new function behaves like utf8n_to_uvchr(), but takes an extra
parameter that points to a U32 which will be set to 0 if no errors are
found; otherwise each error found will set a bit in it. This can be
used by the caller to figure out precisely what the error(s) is/are.
Previously, one would have to capture and parse the warning/error
messages raised. This can be used, for example, to customize the
messages to the expected end-user's knowledge level.
|
|
|
|
|
| |
This is in preparation for the same functionality to each be used in a
new place in a future commit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I've long been unsatisfied with the information contained in the
error/warning messages raised when some input is malformed UTF-8, but
have been reluctant to change the text in case some one is relying on
it. One reason that someone might be parsing the messages is that there
has been no convenient way to otherwise pin down what the exact
malformation might be. A few commits from now will add a facility
to get the type of malformation unambiguously. This will be a better
mechanism to use for those rare modules that need to know what's the
exact malformation.
So, I will fix and issue pull requests for any module broken by this
commit.
The messages are changed by now dumping (in \xXY format) the bytes that
make up the malformed character, and extra details are added in most
cases.
Messages about overlongs now display the code point they evaluate to and
what the shortest UTF-8 sequence for generating that code point is.
Messages about overflowing now just display that it overflows, since the
entire byte sequence is now dumped. The previous message displayed just
the byte which was being processed where overflow was detected, but that
information is not at all meaningfull.
|
|
|
|
| |
This text is generated in 2 places; consolidate into one place.
|
|
|
|
|
|
|
|
| |
This encodes a simple pattern that may not be immediately obvious to
someone needing it. If you have a fixed-size buffer that is full of
purportedly UTF-8 bytes, is it valid or not? It's easy to do, as shown
in this commit. The file test operators -T and -B can be simpified by
using this function.
|
|
|
|
|
|
|
|
|
|
| |
These functions are all extensions of the is_utf8_string_foo()
functions, that restrict the UTF-8 recognized as valid in various ways.
There are named ones for the two definitions that Unicode makes, and
foo_flags ones for more custom restrictions.
The named ones are implemented as tries, while the flags ones provide
complete generality
|
|
|
|
|
| |
This is a generalization of is_utf8_valid_partial_char to allow the
caller to automatically exclude things such as surrogates.
|
|
|
|
|
|
|
| |
This changes the name of this helper function and adds a parameter and
functionality to allow it to exclude problematic classes of code
points, the same ones excludeable by utf8n_to_uvchar(), like surrogates
or non-character code points.
|
|
|
|
|
|
| |
Actually the code isn't quite duplicate, but should be because one
instance is wrong. This failure would only show up on EBCDIC platforms.
Tests are coming in a future commit.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
$ cat > foo
#!/usr/bin/perl
print "What?!\n"
^D
$ chmod +x foo
$ ./perl -Ilib -Te '$ENV{PATH}="."; exec "foo"'
Insecure directory in $ENV{PATH} while running with -T switch at -e line 1.
That is what I expect to see. But:
$ ./perl -Ilib -Te '$ENV{PATH}="/\\:."; exec "foo"'
What?!
Perl is allowing the \ to escape the :, but the \ is not treated as an
escape by the system, allowing a relative path in PATH to be consid-
ered safe.
|
|
|
|
|
|
|
|
|
| |
This new function can test some purported UTF-8 to see if it is
well-formed as far as it goes. That is there aren't enough bytes for
the character they start, but what is there is legal so far. This can
be useful in a fixed width buffer, where the final character is split in
the middle, and we want to test without waiting for the next read that
the entire buffer is valid.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The macro isUTF8_CHAR calls a helper function for code points higher
than it can handle. That function had been an inlined wrapper around
utf8n_to_uvchr().
The function has been rewritten to not call utf8n_to_uvchr(), so it is
now too big to be effectively inlined. Instead, it implements a faster
method of checking the validity of the UTF-8 without having to decode
it. It just checks for valid syntax and now knows where the
few discontinuities are in UTF-8 where overlongs can occur, and uses a
string compare to verify that overflow won't occur.
As a result this is now a pure function.
This also causes a previously generated deprecation warning to not be,
because in printing UTF-8, no longer does it have to be converted to
internal form. I could add a check for that, but I think it's best not
to. If you manipulated what is getting printed in any way, the
deprecation message will already have been raised.
This commit also fleshes out the documentation of isUTF8_CHAR.
|
| |
|
|
|
|
|
|
| |
This is clearer as to its meaning than the existing 'is_ascii_string'
and 'is_invariant_string', which are retained for back compat. The
thread context variable is removed as it is not used.
|
|
|
|
|
|
|
|
| |
This function has been in several releases without problem, and is short
enough that some compilers can inline it. This commit also notes that
the result should not be ignored, and removes the unused pTHX. The
function has explicitly been marked as being changeable, and has not
been part of the API until now.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When I moved subroutine signature processing into perly.y with
v5.25.3-101-gd3d9da4, I added a new lexer PL_expect state, XSIGVAR.
This indicated, when about to parse a variable, that it was a signature
element rather than a my variable; in particular, it makes ($,...)
be toked as the lone sigil '$' rather than the punctuation variable '$,'.
However this is a bit heavy-handled; so instead this commit adds a
new allowed pseudo-keyword value to PL_in_my: as well as KEY_my, KEY_our and
KEY_state, it can now be KEY_sigvar. This is a less intrusive change
to the lexer.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This code is confusing, and confusion has resulted in bugs. So rework
the code a bit to make it more comprehensible.
gv_magicalize no longer has such an arcane return value. What it does
now is simply add appropriate magic to (or do appropriate vivification
in) the GV passed as its argument. It returns true if magic or vivi-
fication was applicable.
The caller (gv_fetchpvn_flags) uses that return value to determine
whether to install the GV in the symbol table, if the caller has
requested that a symbol only be added if it is a magical one
(GV_ADDMG).
This reworking does mean that the GV is now checked for content even
when it would make no difference, but I think the resulting clarity
(ahem, *relative* clarity) of the code is worth it.
|
|
|
|
|
|
| |
All the callers create the SV on the fly. We might as well put the
SV creation into the function itself. (A forthcoming commit will
refactor things to avoid the SV when possible.)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are many built-in variables that perl creates on demand for
efficiency’s sake. gv_fetchpvn_flags (which is responsible for sym-
bol lookup) will fill in those variables automatically when add-
ing a symbol.
The special GV_ADDMG flag passed to this function by a few code paths
(such as defined *{"..."}) tells gv_fetchpvn_flags to add the symbol,
but only if it is one of the ‘magical’ built-in variables that we pre-
tend already exist.
To accomplish this, when the GV_ADDMG flag is passed,
gv_fetchpvn_flags, if the symbol does not already exist, creates a new
GV that is not attached to the stash. It then runs it through its
magicalization code and checks afterward to see whether the GV
changed. If it did, then it gets added to the stash. Otherwise, it
is discarded.
Three of the variables, %-, %!, and $], are problematic, in that they
are implemented by external modules. gv_fetchpvn_flags loads those
modules, which tie the variable in question, and then control is
returned to gv_fetchpvn_flags. If it has a GV that has not been
installed in the symbol table yet, then the module will vivify that GV
on its own by a recursive call to gv_fetchpvn_flags (with the GV_ADD
flag, which does none of this temporary-dangling-GV stuff), and
gv_fetchpvn_flags will have a separate one which, when installed,
would clobber the one with the tied variable.
We solved that by having the GV installed right before calling the
module, for those three variables (in perl 5.16).
The implementation changed in commit v5.19.3-437-g930867a, which was
supposed to clean up the code and make it easier to follow. Unfortun-
ately there was a bug in the implementation. It tries to install the
GV for those cases *before* the magicalization code, but the logic is
wrong. It checks to see whether we are adding only magical symbols
(addmg) and whether the GV has anything in it, but before anything has
been added to the GV. So the symbol never gets installed. Instead,
it just leaks, and the one that the implementing module vivifies
gets used.
This leak can be observed with XS::APItest::sv_count:
$ ./perl -Ilib -MXS::APItest -e 'for (1..10){ defined *{"!"}; delete $::{"!"}; warn sv_count }'
3833 at -e line 1.
4496 at -e line 1.
4500 at -e line 1.
4504 at -e line 1.
4508 at -e line 1.
4512 at -e line 1.
4516 at -e line 1.
4520 at -e line 1.
4524 at -e line 1.
4528 at -e line 1.
Perl 5.18 does not exhibit the leak.
So in this commit I am finally implementing something that was dis-
cussed about the time that v5.19.3-437-g930867a was introduced. To
avoid the whole problem of recursive calls to gv_fetchpvn_flags vying
over whose GV counts, I have stopped the implementing modules from
tying the variables themselves. Instead, whichever gv_fetchpvn_flags
call is trying to create the glob is now responsible for seeing that
the variable is tied after the module is loaded. Each module now pro-
vides a _tie_it function that gv_fetchpvn_flags can call.
One remaining infelicity is that Errno mentions $! in its source, so
*! will be vivified when it is loading, only to be clobbered by the
GV subsequently installed by gv_fetch_pvn_flags. But at least it
will not leak.
One test that failed as a result of this (in t/op/magic.t) was try-
ing to undo the loading of Errno.pm in order to test it afresh with
*{"!"}. But it did not remove *! before the test. The new logic in
the code happens to work in such a way that the tiedness of the vari-
able determines whether the module needs to be loaded (which is neces-
sary, now that the module does not tie the variable). Since the test
is by no means normal code, it seems reasonable to change it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently subroutine signature parsing emits many small discrete ops
to implement arg handling. This commit replaces them with a couple of ops
per signature element, plus an initial signature check op.
These new ops are added to the OP tree during parsing, so will be visible
to hooks called up to and including peephole optimisation. It is intended
soon that the peephole optimiser will take these per-element ops, and
replace them with a single OP_SIGNATURE op which handles the whole
signature in a single go. So normally these ops wont actually get executed
much. But adding these intermediate-level ops gives three advantages:
1) it allows the parser to efficiently generate subtrees containing
individual signature elements, which can't be done if only OP_SIGNATURE
or discrete ops are available;
2) prior to optimisation, it provides a simple and straightforward
representation of the signature;
3) hooks can mess with the signature OP subtree in ways that make it
no longer possible to optimise into an OP_SIGNATURE, but which can
still be executed, deparsed etc (if less efficiently).
This code:
use feature "signatures";
sub f($a, $, $b = 1, @c) {$a}
under 'perl -MO=Concise,f' now gives:
d <1> leavesub[1 ref] K/REFC,1 ->(end)
- <@> lineseq KP ->d
1 <;> nextstate(main 84 foo:6) v:%,469762048 ->2
2 <+> argcheck(3,1,@) v ->3
3 <;> nextstate(main 81 foo:6) v:%,469762048 ->4
4 <+> argelem(0)[$a:81,84] v/SV ->5
5 <;> nextstate(main 82 foo:6) v:%,469762048 ->6
8 <+> argelem(2)[$b:82,84] vKS/SV ->9
6 <|> argdefelem(other->7)[2] sK ->8
7 <$> const(IV 1) s ->8
9 <;> nextstate(main 83 foo:6) v:%,469762048 ->a
a <+> argelem(3)[@c:83,84] v/AV ->b
- <;> ex-nextstate(main 84 foo:6) v:%,469762048 ->b
b <;> nextstate(main 84 foo:6) v:%,469762048 ->c
c <0> padsv[$a:81,84] s ->d
The argcheck(3,1,@) op knows the number of positional params (3), the
number of optional params (1), and whether it has an array / hash slurpy
element at the end. This op is responsible for checking that @_ contains
the right number of args.
A simple argelem(0)[$a] op does the equivalent of 'my $a = $_[0]'.
Similarly, argelem(3)[@c] is equivalent to 'my @c = @_[3..$#_]'.
If it has a child, it gets its arg from the stack rather than using $_[N].
Currently the only used child is the logop argdefelem.
argdefelem(other->7)[2] is equivalent to '@_ > 2 ? $_[2] : other'.
[ These ops currently assume that the lexical var being introduced
is undef/empty and non-magival etc. This is an incorrect assumption and
is fixed in a few commits' time ]
|
|
|
|
| |
This is principally so that it can be accessed from perly.y too.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently the signature of a sub (i.e. the '($a, $b = 1)' bit) is parsed
in toke.c using a roll-your-own mini-parser. This commit makes
the signature be part of the general grammar in perly.y instead.
In theory it should still generate the same optree as before, except
that an OP_STUB is no longer appended to each signature optree: it's
unnecessary, and I assume that was a hangover from early development of
the original signature code.
Error messages have changed somewhat: the generic 'Parse error' has
changed to the generic 'syntax error', with the addition of ', near "xyz"'
now appended to each message.
Also, some specific error messages have been added; for example
(@a=1) now says that slurpy params can't have a default vale, rather than
just giving 'Parse error'.
It introduces a new lexer expect state, XSIGVAR, since otherwise when
the lexer saw something like '($, ...)' it would see the identifier
'$,' rather than the tokens '$' and ','.
Since it no longer uses parse_termexpr(), it is no longer subject to the
bug (#123010) associated with that; so sub f($x = print, $y) {}
is no longer mis-interpreted as sub f($x = print($_, $y)) {}
|
|
|
|
|
| |
This also involves moving some complicated debugging statements to a
separate function so that it can be called from more than one place
|
|
|
|
| |
These functions are no longer needed in re_comp.c
|
|
|
|
|
|
| |
It turns out that the changes in
0854ea0b9abfd9ff71c9dca1b5a5765dad2a20bd caused two functions to no
longer be used in re_comp.c
|