| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
| |
Better to not have this clutter.
|
|
|
|
|
|
|
|
|
|
|
| |
The mask removed here was to make sure that right shifting didn't
propagate the sign bit, but is unnecessary as the value shifted is
unsigned. And confining things to a U8 with that mask assumes that the
bit vector being operated on has 256 elements max. This isn't
necessarily true these days, as one can change ANYOF_BITMAP_SIZE.
In fact changing that number was failing until this commit.
It also adds white space to make it easier to read.
|
|
|
|
|
| |
Instead of having this code repeated in several places, call
the more base macro from the others.
|
|
|
|
|
|
|
|
|
|
|
| |
I've long been confronted with trying to do things to create a spare bit
to use. I thought it easier now, while it's fresh in my mind, to free
up one for future use, rather than re-learn things when it next becomes
necessary. It would have been a different story if the freed bit had
required a performance penalty.
This commit also updates the comments about how to create even more
spare bits should it become necessary.
|
|
|
|
| |
Some of the names are expanded slightly and not shortened
|
| |
|
| |
|
|
|
|
|
|
| |
This commit sets a flag at pattern compilation time to indicate if
a rare case is present that requires special handling, so that that
handling can be avoided unless necessary.
|
|
|
|
|
|
| |
This changes the spare bit to be adjacent to the LOC_FOLD bit, in
preparation for the next commit, which will use that bit for a
LOC_FOLD-related use.
|
|
|
|
|
|
|
|
| |
This is done by combining 2 mutually exclusive bits into one. I hadn't
seen this possibility before because the name of one of them misled me.
It also misled me into turning on one that flag unnecessarily, and to
miss opportunities to not have to create a swash at runtime. This
commit corrects those things as well.
|
| |
|
|
|
|
|
| |
This places the flag bits of like-type flags adjacent for convenience in
reading the code. It also improves the commentary about their purposes.
|
|
|
|
|
|
| |
Since commit a0bd1a30d379f2625c307657d63fc50173d7a56d, a synthetic start
class node can be just an ANYOF-type node. I don't think this causes a
bug, just misses a potential optimisation.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously use of this under /l regex rules was a compile time error.
Now it works like \b{wb} and \b{sb}, which compile under locale rules
and always work like Unicode says they should. A UTF-8 locale implies
Unicode rules, and the goal is for it to work seamlessly with the rest
of perl. This construct was the only one I am aware of that didn't work
seamlessly (not counting OS interfaces) under UTF-8 LC_CTYPE locales.
For all three of these constructs, use with a non-UTF-8 runtime locale
raises a warning, and Unicode rules are used anyway.
UTF-8 locale collation still has problems, but this is low priority to
fix, as it's a lot of work, and if one really cares, one should be using
Unicode::Collate.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The ANYOF_FLAGS bits are all used up, but a future commit wants one.
This commit frees up a bit by sharing two of the existing
comparatively-rarely-used ones. One bit is used only under /d matching
rules, while the other is used only when not under /d. Only the latter
bit is used in synthetic start classes. The previous commit introduced
an ANYOFD node type corresponding to /d. An SSC never is this type.
Thus, the bits have mutually exclusive meanings, and we can use the node
type to distinguish between the two meanings of the combined bit.
An alternative implementation would have been to use the
ANYOF_HAS_NONBITMAP_NON_UTF8_MATCHES non-/d bit instead of the one
chosen. But this is used more frequently, so the disambiguation would
have been exercised more frequently, slowing execution down ever so
slightly; more importantly, this one required fewer code changes, by a
slight amount.
|
|
|
|
|
|
| |
This horrible thing broke encapsulation and was as buggy as a very buggy
thing. It's been officially deprecated since 5.20.0 and now it can finally
die die die!!!!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
An empty cpan/.dir-locals.el stops Emacs using the core defaults for
code imported from CPAN.
Committer's work:
To keep t/porting/cmp_version.t and t/porting/utils.t happy, $VERSION needed
to be incremented in many files, including throughout dist/PathTools.
perldelta entry for module updates.
Add two Emacs control files to MANIFEST; re-sort MANIFEST.
For: RT #124119.
|
|
|
|
|
|
|
| |
This was experimentally introduced in 5.18, and no issues were raised,
except that it got us to thinking and spurred us to stop allowing $^X,
where 'X' is a non-printable control character, and that change caused
some issues.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
A function implements seeing if the space between any two characters is
a grapheme cluster break. Afer I wrote this, I realized that an array
lookup might be a better implementation, but the deadline for v5.22 was
too close to change it. I did see that my gcc optimized it down to
an array lookup.
This makes the implementation of \X go from being complicated to
trivial.
|
|
|
|
| |
Extracted from patch submitted by Lajos Veres in RT #123693.
|
| |
|
|
|
|
|
| |
These will be used in a future commit to distinguish between /l patterns
vs non-/l.
|
| |
|
|
|
|
|
|
|
| |
This adds a function to allocate a regnode with 2 32-bit arguments, and
uses it, rather than the ad-hoc code that did the same thing previously.
This is in preparation for this code being used in a 2nd place in a
future commit.
|
| |
|
|
|
|
| |
These internal definitions are no longer used.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add a new re debug mode for outputing stuff useful for testing.
In this case we count the number of times that we go through
study_chunk. With a51d618a we should do 5 times (or less) when
we traverse the test pattern. Without a51d618a we recurse 11
times. In the case of RT #122283 we would do gazilions of
recursions, so many I never let it run to finish.
/
(?(DEFINE)(?<foo>foo))
(?(DEFINE)(?<bar>(?&foo)bar))
(?(DEFINE)(?<baz>(?&bar)baz))
(?(DEFINE)(?<bop>(?&baz)bop))
/x
I say "or less" because you could argue that since these defines are
never called, we should not actually recurse at all, and should maybe
just compile this as a simple empty pattern.
|
|
|
|
|
|
|
|
|
|
|
|
| |
In 075abff3 Andy Lester set the flags field of regops
to default to 0xde. I find this really weird, and possibly dangerous,
as it seems to me reasonable to assume a new regop would have this
field set to 0, so that later on code can set it to something else
if necessary. (Which is what I wanted to do.)
Since nothing breaks if I set it to 0x0 and I find that to be a much
more natural default than 0xde (the prefix of 0xdeadbeef), I am
changing this to set it to 0.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
See also perl5porters thread titled: "Perl MBOLism in regex engine"
In the perl 5.000 release (a0d0e21ea6ea90a22318550944fe6cb09ae10cda)
the BOL regop was split into two behaviours MBOL and SBOL, with SBOL
and BOL behaving identically. Similarly the EOL regop was split into
two behaviors SEOL and MEOL, with EOL and SEOL behaving identically.
This then resulted in various duplicative code related to flags and
case statements in various parts of the regex engine.
It appears that perhaps BOL and EOL were kept because they are the
type ("regkind") for SBOL/MBOL and SEOL/MEOL/EOS. Reworking regcomp.pl
to handle aliases for the type data so that SBOL/MBOL are of type
BOL, even though BOL == SBOL seems to cover that case without adding
to the confusion.
This means two regops, a regstate, and an internal regex flag can
be removed (and used for other things), and various logic relating
to them can be removed.
For the uninitiated, SBOL is /^/ and /\A/ (with or without /m) and
MBOL is /^/m. (I consider it a fail we have no way to say MBOL without
the /m modifier). Similarly SEOL is /$/ and MEOL is /$/m (there is
also a /\z/ which is EOS "end of string" with or without the /m).
|
| |
|
|
|
|
|
|
|
|
| |
This commit allows Perl to be compiled with a bitmap size that is larger
than 256. This bitmap is used to directly look up whether a character
matches or not, without having to do a binary search or hash lookup. It
might improve the performance for some installations that have a lot of
use of scripts that are above the Latin1 range.
|
|
|
|
|
|
|
|
|
| |
These are renamed to be more clear as to their actual meanings. I know
other people have been confused by their former names.
Some of the name changes will become more important as future commits
will allow the bitmap in a bracketed character class to be a different
size.
|
|
|
|
| |
This is an internal header, so can change names within it.
|
|
|
|
|
|
| |
This prevents a signed result if this macro ever gets used in a U8.
The ANYOF_BITMAP_TEST macro must now be cast or it would generate warnings
when compiled with -DPERL_BOOL_AS_CHAR
|
|
|
|
| |
Too many negations led to this.
|
|
|
|
|
|
|
| |
ANYOF nodes (for bracketed character classes) currently are for code
points 0-255. This is the first step in the eventual making that size
configurable. This also renames a static function, as the domain may
not necessarily be 'latin1'
|
|
|
|
|
|
| |
This just sets the ptr field in the Synthetic Start Class that will be
passed to regexec.c NULL, and clarifies the comments in regcomp.h. See
the thread starting at http://markmail.org/message/2txwaqnjco6zodeo
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I believe this will fix the remaining alignment problems recently being
shown on gcc on HP-UX, It works on the procura machine.
regnodes should not have stricter alignment than required by U32, for
reasons given in the comments this commit adds to the beginning of
regcomp.h. Commit 31f05a37 added a new ANYOF regnode struct with a
pointer field. This requires stricter alignment on some 64-bit platforms,
and hence doesn't work on those platforms.
This commit removes that regnode struct type, and instead stores the
pointer it used via a more indirect, but already existing mechanism
that stores other data..
The function that returns that other data is enlarged to return this new
field as well. It now needs to be called from regcomp.c, so the
previous commit had renamed and made it accessible from there. The
"public" function that wraps this one is unchanged. (I put "public" in
quotes here, because I don't think anyone outside core is or should be
using it, but since it has been publicly available for a long time, I'm
treating the API as unchangeable. regcomp.c called this public function
before this commit, but needs the additional data returned by the inner
one).
|
|
|
|
|
|
|
|
| |
Instead of doing the calculation of how many bytes a 256 bitmap
occupies, let the compiler do it. I believe we are not too far away
from having the ability to allow applications to recompile Perl to
increase the bitmap size trading speed for memory. ICU has an 8192
bitmap last time I checked.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For the last several releases, the fact that an ANYOF node could match
something outside its bitmap has been passed to regexec.c by having its
ARG field not be -1 (appropriately cast). A bit was set if the match
could occur even if the target string was not UTF-8 encoded. This
design was used to save a bit, as previously there was a bit also for it
matching UTF-8 strings.
That design is no longer tenable, as a future commit will have a third
(independent) reason for something to match outside the bitmap, This
commits uses the current spare bit flag to indicate if the match can
only occur if the target string is UTF-8.
|
|
|
|
|
| |
This is obsolete and is a partial copy of the up-to-date comment below
it.
|
|
|
|
| |
The ANYOF_LOC bit was removed from final use in the previous commit.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This flag no longer adds any useful information and can be removed. An
ANYOF node that depends on locale either matches a POSIX class like /d,
or matches case insensitively, or both. There are flags for both these
cases, and to see if something matches locale, one merely needs to see
if either flag is set.
Not having to keep track of this extra flag simplifies things, and will
allow it to be removed. There was a time when this flag was shared with
one of the remaining locale ones, and there was relict code that allowed
that sharing to be reinstated, and which this commit also removes.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The ANYOF_POSIXL flag is needed in general for ANYOF nodes to indicate
if the struct contains an extra U32 element used to hold the list of
POSIX classes (like \w and [:punct:]) whose matches depend on the locale
in effect at the time of runtime pattern matching.
But the SSC always contains this U32, and so doesn't need to use the
flag. Instead, if there aren't any such classes, the U32 will be zero.
Removing keeping track of this flag during the assembly of the SSC
simplifies things. At the completion of this process, this flag is
set if the U32 is non-zero to pass that information on to regexec.c so
that it doesn't have to special case things.
|
|
|
|
|
| |
This reverts commit 34fdef848b1687b91892ba55e9e0c3430e0770f6, and
adds comments referring to it, in case it is ever needed.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit frees up a bit by using an extra regnode to pass the
information to the regex engine instead of the flag. I originally
thought that if this was needed, it should be the ANYOF_ABOVE_LATIN1_ALL
bit, as that might speed some things up. But if we need to do this
again by adding another node to get another bit, we want one that is
mutually exclusive of the first one we did, For otherwise we start
having to make 3 nodes instead of two to get the combinations:
1 0
0 1
1 1
This combinatorial problem is avoided by using bits that are mutually
exclusive, which the ABOVE_LATIN1_ALL isn't, but the one freed by this
commit ANYOF_NON_UTF8_NON_ASCII_ALL is only set under /d matching, and
there are other bits that are set only under /l, so if we need to do
this again, we should use one of those.
I wrote this code when I thought I really needed a bit. But since, I
have figured out a better way to get the bit needed now. But I don't
want to lose this code to posterity, so this commit is being made long
enough to get the commit number, then it will be reverted, adding
comments referring to the commit number, so that it can easily be
reconstructed when necessary.
|
|
|
|
| |
I misread the code when I added these comments
|