| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
| |
Addresses this build-time warning:
suggest braces around initialization of subobject [-Wmissing-braces]
|
|
|
|
|
| |
This was caused by copying too many characters for the size of the
buffer. Only one character is needed.
|
| |
|
| |
|
|
|
|
|
|
| |
This bug was introduced by bb3825626ed2b1217a2ac184eff66d0d4ed6e070, and
was the result of overflowing a 32 bit space. The solution is to rework
the expression so that it can't overflow.
|
| |
|
|
|
|
| |
Mostly indent because the prior commit created a new block
|
|
|
|
|
|
|
|
|
|
|
| |
Consider the pattern /A*B/ where A and B are arbitrary. The pattern
matching code tries to make a tight loop to match the span of A's. The
logic of this was not really updated when UTF-8 was added. I did
revamp it some releases ago to fix some bugs and to at least consider
UTF-8.
This commit changes it so that Unicode is now a first class citizen.
Some details are listed in the ticket GH #18414
|
|
|
|
| |
The new name reflects its new functionality coming in future commits
|
| |
|
|
|
|
|
|
| |
The names were intended to force people to not use them outside their
intended scopes. But by restricting those scopes in the first place, we
don't need such unwieldy names
|
|
|
|
|
|
|
|
|
|
|
| |
This feature allows documentation destined for perlapi or perlintern to
be split into sections of related functions, no matter where the
documentation source is. Prior to this commit the line had to contain
the exact text of the title of the section. Now it can be a $variable
name that autodoc.pl expands to the title. It still has to be an exact
match for the variable in autodoc, but now, the expanded text can be
changed in autodoc alone, without other files needing to be updated at
the same time.
|
|
|
|
|
|
| |
This makes the text look cleaner, and prepares for a future commit,
where we will want to change the variable (which can't be done with the
expression).
|
|
|
|
| |
This makes it like a corresponding variable.
|
|
|
|
|
|
|
|
|
|
|
| |
I found myself getting confused, as this most likely was named before
UTF-8 came along. It actually is just a byte, plus an out-of-bounds
value.
While I'm at it, I'm also changing the type from I32, to the perl
equivalent of the C99 'int_fast16_t', as it doesn't need to be 32 bits,
and we should let the compiler choose what size is the most efficient
that still meets our needs.
|
|
|
|
|
| |
This commit uses the new macros from the previous commit to simply come
code.
|
| |
|
| |
|
|
|
|
|
| |
This is to distinguish it from a similar variable being added in a
future commit
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a follow-on to the previous commit. The case number of the main
switch statement now includes three things: the regnode op, the UTF8ness
of the target, and the UTF8ness of the pattern.
This allows the conditionals within the previous cases (which only
encoded the op), to be removed, and things to be moved around so that
there is more fall throughs and fewer gotos, and the macros that are
called no longer have to test for UTF8ness; so I teased the UTF8 ones
apart from the non_UTF8 ones.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This uses the #defines created in the previous commit to make the switch
statement in this function incorporate the UTF8ness of both the pattern
and the target string.
The reason for this is that the first statement in nearly every case of
the switch is to test if the target string being matched is UTF-8 or
not. By putting that information into the the case number, those
conditionals can be eliminated, leading to cleaner, more modular code.
I had hoped that this would also improve performance since there are
fewer conditionals, but Sergey Aleynikov did performance testing of this
change for me, and found no real noticeable gain nor loss.
Further, the cases involving matching EXACTish nodes have to also test
if the pattern is UTF-8 or not before doing anything else. I added that
information as well to the case number, so that those conditionals can
be eliminated. For the non-EXACTish nodes, it simply means that that
two case statements execute the same code.
This is an intermediate commit, which only does the expansion of the
current cases into four for each. The refactoring that takes advantage
of this is in the following commit.
|
|
|
|
|
| |
GH #17594: the logic here expects the node to have width 1 (except for
LNBREAK), it is not expected to do the right thing on zero-width nodes.
|
|
|
|
| |
Adjust indentation as a result of the previous commit.
|
|
|
|
|
|
|
| |
There are five \b variants. Plain \b (without braces) is the outlier as
far as implementation. This commit moves the handling of plain \b to
outside the switch that handles the others. That allows the duplicate
code that previously existed to be consolidated into one occurrence.
|
|
|
|
|
| |
apidoc_section is slightly favored over head1, as it is known only to
autodoc, and can't be confused with real pod.
|
|
|
|
|
|
|
| |
I was never happy with this short form, and other people weren't either.
Now that most things are better expressed in terms of av_count, convert
the few remaining items that are clearer when referring to an index into
using the fully spelled out form
|
|
|
|
| |
This is faster, and clearer
|
| |
|
|
|
|
|
| |
This code was advancing per-byte on a UTF-8 string, which still works,
but is slower than need be.
|
|
|
|
|
| |
It only does anything under PERL_GLOBAL_STRUCT, which is gone.
Keep the dNOOP defintion for CPAN back-compat
|
| |
|
|
|
|
|
| |
A future commit will change this array so that its size isn't known at
compilation time.
|
|
|
|
|
| |
Mostly in comments and docs, but some in diagnostic messages and one
case of 'or die die'.
|
|
|
|
|
|
| |
The code this replaces relies on the internal structure of a macro,
which can change and break things. This commit changes to use a more
straight forward way of accomplishing the same thing.
|
|
|
|
|
|
|
|
|
| |
The compilation of User-defined properties in a regular expression that
haven't been defined at the time that pattern is compiled is deferred
until execution time. Until this commit, any request for debugging info
on those was ignored.
This fixes that by
|
|
|
|
|
| |
It wasn't clear to me that the macro did more than a declaration, given
its name. Rename it to be clear as to what it does.
|
|
|
|
|
|
|
|
| |
A proper debugging statement isn't just controlled by DEBUG_r, it needs
what sort of class of debugging controls this, so that re.pm can operate
properly.
This is the second of two cases in the code where it was wrong.
|
|
|
|
|
|
|
|
| |
A proper debugging statement isn't just controlled by DEBUG_r, it needs
what sort of class of debugging controls this, so that re.pm can operate
properly.
This is one of two cases in the code where it was wrong.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Unicode 12.0 used a new property file that was not from the Unicode
Character Database. It only had a long property name. I incorporated
it into our data, and rather than use the very long name all the time, I
created my own short name, since there was no official one.
Now, the upcoming 13.0 has moved the file to the UCD, and come up with a
short name that differs from the one I had. This commit converts to use
Unicode's name. This property is not exposed to user or XS space, so
there is no user impact.
|
|
|
|
|
| |
It wasn't intended to be part of the recursion logic, and doesn't get
decremented again (GH 17490).
|
|
|
|
|
| |
This makes the first parameter consistent with the other similar
parameter.
|
|
|
|
|
|
|
| |
This is because these deal with only legal Unicode code points, which
are restricted to 21 bits, so 16 is too few, but 32 is sufficient to
hold them. Doing this saves some space/memory on 64 bit builds where an
int is 64 bits.
|
| |
|
|
|
|
|
|
|
| |
These are illegal in C, but we have plenty of them around; I happened
to be looking at this function, and decided to fix it. Note that only
the macro name is illegal; the function was fine, but to change the
macro name means changing the function one.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This node is like ANYOFHb, but is used when more than one leading byte
is the same in all the matched code points.
ANYOFHb is used to avoid having to convert from UTF-8 to code point for
something that won't match. It checks that the first byte in the UTF-8
encoded target is the desired one, thus ruling out most of the possible
code points.
But for higher code points that require longer UTF-8 sequences, many
many non-matching code points pass this filter. Its almost 200K that it
is ineffective for for code points above 0xFFFF.
This commit creates a new node type that addresses this problem.
Instead of a single byte, it stores as many leading bytes that are the
same for all code points that match the class. For many classes, that
will cut down the number of possible false positives by a huge amount
before having to convert to code point to make the final determination.
This regnode adds a UTF-8 string at the end. It is still much smaller,
even in the rare worst case, than a plain ANYOF node because the maximum
string length, 15 bytes, is still shorter than the 32-byte bitmap that
is present in a plain ANYOF. Most of the time the added string will
instead be at most 4 bytes.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is like the ANYOFR regnode added in the previous commit, but all
code points in the range it matches are known to have the same first
UTF-8 start byte. That means it can't match UTF-8 invariant characters,
like ASCII, because the "start" byte is different on each one, so it
could only match a range of 1, and the compiler wouldn't generate this
node for that; instead using an EXACT.
Pattern matching can rule out most code points by looking at the first
character of their UTF-8 representation, before having to convert from
UTF-8.
On ASCII this rules out all but 64 2-byte UTF-8 characters from this
simple comparison. 3-byte it's up to 4096, and 4-byte, 2**18, so the
test is less effective for higher code points.
I believe that most UTF-8 patterns that otherwise would compile to
ANYOFR will instead compile to this, as I can't envision real life
applications wanting to match large single ranges. Even the 2048
surrogates all have the same first byte.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This matches a single range of code points. It is both faster and
smaller than other ANYOF-type nodes, requiring, after set-up, a single
subtraction and conditional branch.
The vast majority of Unicode properties match a single range (though
most of the properties likely to be used in real world applications have
more than a single range). But things like [ij] are a single range, and
those are quite commonly encountered. This new regnode matches them more
efficiently than a bitmap would, and doesn't require the space for one
either.
The flags field is used to store the minimum matchable start byte for
UTF-8 strings, and is ignored for non-UTF-8 targets. This, like ANYOFH
nodes which have a similar mechanism, allows for quick weeding out of
many possible matches without having to convert the UTF-8 to its
corresponding code point.
This regnode packs the 32 bit argument with 20 bits for the minimum code
point the node matches, and 12 bits for the maximum range. If the input
is a value outside these, it simply won't compile to this regnode,
instead going to one of the ANYOFH flavors.
ANYOFR is sufficient to match all of Unicode except for the final
(private use) 65K plane.
|
|
|
|
| |
The called macro does the cast, and this makes it more legibile
|
|
|
|
| |
The new name is shorter and I believe, clearer.
|
|
|
|
|
| |
This URL is outdated, but the link forwards to the correct
section in a PDF.
|