| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
| |
In certain places in the documentation, "5.20" is no longer applicable.
Also, a message referred to in perldiag got reworded, but our checks did
not catch that perldiag should have been updated.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
PL_timesbuf is effectively a vestige of Perl 1, and doesn't actually need to
be an interpreter variable. It will be removed early in v5.21.x, but it's a
good idea to refactor the code not to use it before then. A local struct tms
will be on the C stack, which will be in the CPU's L1 cache, whereas the
relevant part of the interpreter struct may well not be in the CPU cache at
all. Therefore this change might reduce cache pressure fractionally. A local
variable access should also be simpler machine code on most CPU architectures.
|
|
|
|
| |
on the grounds that its a reasonably hot variable.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This large (sorry, I couldn't figure out how to meaningfully split it
up) commit causes Perl to fully support LC_CTYPE operations (case
changing, character classification) in UTF-8 locales.
As a side effect it resolves [perl #56820].
The basics are easy, but there were a lot of details, and one
troublesome edge case discussed below.
What essentially happens is that when the locale is changed to a UTF-8
one, a global variable is set TRUE (FALSE when changed to a non-UTF-8
locale). Within the scope of 'use locale', this variable is checked,
and if TRUE, the code that Perl uses for non-locale behavior is used
instead of the code for locale behavior. Since Perl's internal
representation is UTF-8, we get UTF-8 behavior for a UTF-8 locale.
More work had to be done for regular expressions. There are three
cases.
1) The character classes \w, [[:punct:]] needed no extra work, as
the changes fall out from the base work.
2) Strings that are to be matched case-insensitively. These form
EXACTFL regops (nodes). Notice that if such a string contains only
characters above-Latin1 that match only themselves, that the node can be
downgraded to an EXACT-only node, which presents better optimization
possibilities, as we now have a fixed string known at compile time to be
required to be in the target string to match. Similarly if all
characters in the string match only other above-Latin1 characters
case-insensitively, the node can be downgraded to a regular EXACTFU node
(match, folding, using Unicode, not locale, rules). The code changes
for this could be done without accepting UTF-8 locales fully, but there
were edge cases which needed to be handled differently if I stopped
there, so I continued on.
In an EXACTFL node, all such characters are now folded at compile time
(just as before this commit), while the other characters whose folds are
locale-dependent are left unfolded. This means that they have to be
folded at execution time based on the locale in effect at the moment.
Again, this isn't a change from before. The difference is that now some
of the folds that need to be done at execution time (in regexec) are
potentially multi-char. Some of the code in regexec was trivial to
extend to account for this because of existing infrastructure, but the
part dealing with regex quantifiers, had to have more work.
Also the code that joins EXACTish nodes together had to be expanded to
account for the possibility of multi-character folds within locale
handling. This was fairly easy, because it already has infrastructure
to handle these under somewhat different circumstances.
3) In bracketed character classes, represented by ANYOF nodes, a new
inversion list was created giving the characters that should be matched
by this node when the runtime locale is UTF-8. The list is ignored
except under that circumstance. To do this, I created a new ANYOF type
which has an extra SV for the inversion list.
The edge case that caused the most difficulty is folding involving the
MICRO SIGN, U+00B5. It folds to the GREEK SMALL LETTER MU, as does the
GREEK CAPITAL LETTER MU. The MICRO SIGN is the only 0-255 range
character that folds to outside that range. The issue is that it
doesn't naturally fall out that it will match the CAP MU. If we let the
CAP MU fold to the samll mu at compile time (which it can because both
are above-Latin1 and so the fold is the same no matter what locale is in
effect), it could appear that the regnode can be downgraded away from
EXACTFL to EXACTFU, but doing so would cause the MICRO SIGN to not case
insensitvely match the CAP MU. This could be special cased in regcomp
and regexec, but I wanted to avoid that. Instead the mktables tables
are set up to include the CAP MU as a character whose presence forbids
the downgrading, so the special casing is in mktables, and not in the C
code.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
This global array is no longer used, having been removed in previous
commits in this series.
Since it is a global, consideration need be given to possible uses of it
outside the core. It has never been externally documented, and is an
opaque structure whose internals have changed with every release. The
functions used to access it are almost all static to regcomp.c; those
few that aren't have been hidden from all but the few .c files that need
to have access to them, via #if's.
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
PL_ASCII contains an inversion list to match the ASCII-range code
points. It is unusable outside the core regular expression code because
all the functions that manipulate inversion lists are defined only
within a few core files. Therefore no outside code should be depending
on it.
It turns out that there are arrays of similar inversion lists, and these
all have slots which should have this inversion list in them. This
commit fills them, instead of using PL_ASCII.
|
|
|
|
|
| |
This is the upper half of the Latin1 range. This simplifies some code
very slightly, but will be of use in future commits.
|
|
|
|
|
| |
Added as an experiment in 462e5cf6, it never quite worked, and
recently wasn't even using registers.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Based on Yves's random branch work.
This version makes the new random number visible to external modules,
for example, List::Util's XS shuffle() implementation.
I've also added a 64-bit implementation when HAS_QUAD is true, this
should be significantly faster, even on 32-bit CPUs. This is intended to
produce exactly the same sequence as the original implementation.
The original version of this commit retained the "freebsd" name from
Yves's original work for the function and data structure names. I've
removed "freebsd" from most function names so the name isn't an issue
if we choose to replace the implementation,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a partial fix for #119161.
On 64-bit platforms, I32 is too small to hold offsets into a stack
that can grow larger than I32_MAX. What happens is the offsets can
wrap so we end up referencing and modifying elements with negative
indices, corrupting memory, and causing crashes.
With this commit, ()=1..1000000000000 stops crashing immediately.
Instead, it gobbles up all your memory first, and then, if your com-
puter still survives, crashes. The second crash happesn bcause of
a similar bug with the argument stack, which the next commit will
take care of.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
PL_hints stores the hints at compile time that get copied into the
cop_hints field of each COP (in newSTATEOP).
Since perl-5.8.0-8053-gd5ec298, COPs have stored all the hints.
Before that, COPs used to store only some of the hints. The hints
were copied here and there into PL_compiling, a static COP-shaped buf-
fer used during compilation, so that things like constant folding
would see the correct hints. a0ed51b3 back in 1998 did that.
Now that COPs can store all the hints, we can just use
PL_compiling.cop_hints to avoid having to copy them from PL_hints from
time to time.
This simplifies the code and avoids creating bugs like those that
a547fd219 and 1c75beb82 fixed.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit c82ecf346.
It turn out to be faulty, because a location shared betweens threads
(the cop) was holding a reference count on a pad entry in a particu-
lar thread. So when you free the cop, how do you know where to do
SvREFCNT_dec?
In reverting c82ecf346, this commit still preserves the bug fix from
1311cfc0a7b, but shifts it around.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This saves having to allocate a separate string buffer for every cop
(control op; every statement has one).
Under non-threaded builds, every cop has a pointer to the GV for that
source file, namely *{"_<filename"}.
Under threaded builds, the name of the GV used to be stored instead.
Now we store an offset into the per-interpreter PL_filegvpad, which
points to the GV.
This makes no significant speed difference, but it reduces mem-
ory usage.
|
| |
|
|
|
|
|
|
|
|
|
| |
This produces a report on the number of OPs of a given type that were
executed at the end of a program run. This can be useful in multiple
ways. One, it can help determine hotspots for optimization (yes, I know
execution count is not equal execution time). It can also help with
determining whether a given change to perl has had the desired effect on
deterministic programs.
|
|
|
|
|
|
|
|
|
| |
SV_CONST(XXX) returns SV* that contains "XXX" string.
SVs are built on demand and stored in interp's structure
for re-use. All SVs have precomputed hash value.
Creates SVs on demand, we don't want 35 SV created during
compile time or cloned during thread creation.
|
| |
|
| |
|
|
|
|
|
|
|
| |
This global (per-interpreter) var is just used during regex compilation as
a placeholder to point RExC_emit at during the first (non-emitting) pass,
to indicate to not to emit anything. There's no need for it to be a global
var: just add it as an extra field in the RExC_state_t struct instead.
|
|
|
|
|
|
|
|
|
|
| |
This is a struct that holds all the global state of the current regex
match.
The previous set of commits have gradually removed all the fields of this
struct (by making things local rather than global state). Since the struct
is now empty, the PL_reg_state var can be removed, along with the
SAVEt_RE_STATE save type which was used to save and restore those fields
on recursive re-entry to the regex engine.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently PL_reg_curpm is actually #deffed to a field within PL_reg_state;
promote it into a fully autonomous perl-interpreter variable.
PL_reg_curpm points to a fake PMOP that's used to temporarily point
PL_curpm to, that we can hang the current regex off, so that this works:
"a" =~ /^(.)(?{ print $1 })/ # prints 'a'
It turns out that it doesn't need to be saved and restored when we
recursively enter the regex engine; that is already handled by saving and
restoring which regex is currently attached to PL_reg_curpm.
So we just need a single global (per interpreter) placeholder.
Since we're shortly going to get rid of PL_reg_state, we need to move it
out of that struct.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Adds support for PERL_PERTURB_KEYS environment variable, which in turn allows one to control
the level of randomization applied to keys() and friends.
When PERL_PERTURB_KEYS is 0 we will not randomize key order at all. The
chance that keys() changes due to an insert will be the same as in
previous perls, basically only when the bucket size is changed.
When PERL_PERTURB_KEYS is 1 we will randomize keys in a non repeatedable
way. The chance that keys() changes due to an insert will be very high.
This is the most secure and default mode.
When PERL_PERTURB_KEYS is 2 we will randomize keys in a repeatedable way.
Repititive runs of the same program should produce the same output every
time. The chance that keys changes due to an insert will be very high.
This patch also makes PERL_HASH_SEED imply a non-default
PERL_PERTURB_KEYS setting. Setting PERL_HASH_SEED=0 (exactly one 0) implies
PERL_PERTURB_KEYS=0 (hash key randomization disabled), settng PERL_HASH_SEED
to any other value, implies PERL_PERTURB_KEYS=2 (deterministic/repeatable
hash key randomization). Specifying PERL_PERTURB_KEYS explicitly to a
different level overrides this behavior.
Includes changes to allow one to compile out various aspects of the
patch. One can compile such that PERL_PERTURB_KEYS is not respected, or
can compile without hash key traversal randomization at all. Note that
support for these modes is incomplete, and currently a few tests will
fail.
Also includes a new subroutine in Hash::Util::hash_traversal_mask()
which can be used to ensure a given hash produces a predictable key
order (assuming the same hash seed is in effect). This sub acts as a
getter and a setter.
NOTE - this patch lacks tests, but I lack tuits to get them done quickly,
so I am pushing this with the hope that others can add them afterwards.
|
|
|
|
| |
Commit 19bc2726ec6be805 created 32 bytes of holes (on LP64 systems).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Adds:
S_ptr_hash() - A new static function in hv.c which can be used to
hash a pointer or integer.
PL_hash_rand_bits - A new interpreter variable used as a cheap
provider of "semi-random" state for use by the hash infrastructure.
xpvhv_aux.xhv_rand - Used as a mask which is xored against the
xpvhv_aux.riter during iteration to randomize the order the actual
buckets are visited.
PL_hash_rand_bits is initialized as interpreter start from the random
hash seed, and then modified by "mixing in" the result of ptr_hash()
on the bucket array pointer in the hv (HvARRAY(hv)) every time
hv_auxinit() allocates a new iterator structure.
The net result is that every hash has its own iteration order, which
should make it much more difficult to determine what the current hash
seed is.
This required some test to be restructured, as they tested for something
that was not necessarily true, we never guaranteed that two hashes with
the same keys would produce the same key order, we merely promised that
using keys(), values(), or each() on the same hash, without any
insertions in between, would produce the same order of visiting the
key/values.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Move more of the more commonly-used PL_ variables towards the front of the
file (and thus to the top of the interpreter struct on MULTIPLICITY
builds).
This helps ensure that "hot" variables are clustered together on the same
small number of cache lines, and also that the machine code to load them
will have shorter offsets, which on some architectures may be achieved
with shorter instructions.
The "hotness" has been determined purely by my subjective judgement rather
than any profiling. It's still open for the later to be done.
(Only simple shunting of whole lines has been done; no changes have been
made to individual lines.)
|
|
|
|
|
|
|
|
|
| |
This used to keep track of all objects. At least by now, that is
for no particularly good reason. Just because it could avoid a
bit of work during global destruction if no objects remained.
Let's do less work at run-time instead.
The interpreter global will remain for one deprecation cycle.
|
|
|
|
|
|
|
|
|
| |
It may be unlikely that a Perl program will hit 2 billion SVs, but by
the time that 5.18 is ancient history, it's looking a lot more likely.
This makes two global counters use native-size ints.
I'm preserving signedness just for hysterical raisins: It might be
deliberate.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Holes were created by commit f59909ab8dad6ceb (April 2012) which removed
PL_reginterp_cnt, commit 7dc8663964c66a69 (Nov 2012) which removed
PL_rehash_seed_set, and commit 8936b48a49448f4e (Dec 2012) which removed
PL_glob_index.
There is still an unavoidable U16 sized hole on the default threaded
configuration on x86_64. (U8 if PERL_SAWAMPERSAND is defined).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
/[[:upper:]]/i and /[[:lower:]]/i should match the Unicode property
\p{Cased}. This commit introduces a pseudo-Posix class, internally named
'cased', to represent this. This class isn't specifiable by the user,
except through using either /[[:upper:]]/i or /[[:lower:]]/i. Debug
output will say ':cased:'.
The regex parsing either of :lower: or :upper: will change them into
:cased:, where already existing logic can handle this, just like any
other class.
This commit fixes the regression introduced in
3018b823898645e44b8c37c70ac5c6302b031381, and that these have never
worked under 'use locale'. The next commit will un-TODO the tests for
these things.
|
|
|
|
|
|
|
| |
This also changes isIDCONT_utf8() to use the Perl definition, which
excludes any \W characters (the Unicode definition includes a few of
these). Tests are also added. These macros remain undocumented for
now.
|
|
|
|
|
| |
Previous commits have placed some inversion list pointers into arrays.
This commit extends that to another group of inversion lists
|
|
|
|
|
| |
An earlier commit placed some inversion list pointers into an array.
This commit extends that to another group of inversion lists.
|
|
|
|
|
|
| |
This patch creates an array pointing to the inversion lists that cover
the Latin-1 ranges for Posix character classes, and uses it instead of
the individual variables previously referred to.
|
| |
|