| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The convention is that when the interpreter dies with an internal error, the
message starts "panic: ". Historically, many panic messages had been terse
fixed strings, which means that the out-of-range values that triggered the
panic are lost. Now we try to report these values, as such panics may not be
repeatable, and the original error message may be the only diagnostic we get
when we try to find the cause.
We can't report diagnostics when the panic message is generated by something
other than croak(), as we don't have *printf-style format strings. Don't
attempt to report values in panics related to *printf buffer overflows, as
attempting to format the values to strings may repeat or compound the
original error.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
All Unicode properties actually turn into bracketed character classes,
whether explicitly done or not. A swash is generated for each property
in the class. If that is the only thing not in the class's bitmap, it
specifies completely the non-bitmap behavior of the class, and can be
passed explicitly to regexec.c. This avoids having to regenerate the
swash. It also means that the same swash is used for multiple instances
of a property. And that means the number of duplicated data structures
is greatly reduced. This currently doesn't extend to cases where
multiple Unicode properties are used in the same class
[\p{greek}\p{latin}] will not share the same swash as another character
class with the same components. This is because I don't know of a
an efficient method to determine if a new class being parsed has the
same components as one already generated. I suppose some sort of
checksum could be generated, but that is for future consideration.
|
|
|
|
|
|
| |
As a result of previous commits adding and removing if() {} blocks,
indent and outdent and reflow comments and statements to not exceed 80
columns.
|
|
|
|
|
|
|
| |
Add a new parameter to _core_swash_init() that is an inversion list to
add to the swash, along with a boolean to indicate if this inversion
list is derived from a user-defined property. This capability will prove
useful in future commits
|
|
|
|
|
|
| |
This adds the capability, to be used in future commits, for swash_ini()
to return NULL instead of croaking if it can't find a property, so that
the caller can choose how to handle the situation.
|
|
|
|
|
| |
Make sure there is something before the character being read before
reading it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Prior to this patch, every time a code point was matched against a swash,
and the result was not previously known, a linear search through the
swash was performed. This patch changes that to generate an inversion
list whenever a swash for a binary property is created. A binary search
is then performed for missing values.
This change does not have much effect on the speed of Perl's regression
test suite, but the speed-up in worst-case scenarios is huge. The
program at the end of this commit is crafted to avoid the caching that
hides much of the current inefficiencies. At character classes of 100
isolated code points, the new method is about an order of magnitude
faster; two orders of magnitude at 1000 code points. The program at the
end of this commit message took 97s to execute on my box using blead,
and 1.5 seconds using this new scheme. I was surprised to see that even
with classes containing fewer than 10 code points, the binary search
trumped, by a little, the linear search
Even after this patch, under the current scheme, one can easily run out
of memory due to the permanent storing of results of swash lookups in
hashes. The new search mechanism might be fast enough to enable the
elimination of that memory usage. Instead, a simple cache in each
inversion list that stored its previous result could be created, and
that checked to see if it's still valid before starting the search,
under the assumption, which the current scheme also makes, that probes
will tend to be clustered together, as nearby code points are often in
the same script.
===============================================
# This program creates longer and longer character class lists while
# testing code points matches against them. By adding or subtracting
# 65 from the previous member, caching of results is eliminated (as of
# this writing), so this essentially tests for how long it takes to
# search through swashes to see if a code point matches or not.
use Benchmark ':hireswallclock';
my $string = "";
my $class_cp = 2**30; # Divide the code space in half, approx.
my $string_cp = $class_cp;
my $iterations = 10000;
for my $j (1..2048) {
# Append the next character to the [class]
my $hex_class_cp = sprintf("%X", $class_cp);
$string .= "\\x{$hex_class_cp}";
$class_cp -= 65;
next if $j % 100 != 0; # Only test certain ones
print "$j: lowest is [$hex_class_cp]: ";
timethis(1, "no warnings qw(portable non_unicode);my \$i = $string_cp; for (0 .. $iterations) { chr(\$i) =~ /[$string]/; \$i+= 65 }");
$string_cp += ($iterations + 1) * 65;
}
|
|
|
|
|
| |
Future commits will split up the necessary initialization into two
components. This patch prepares for that without adding anything new.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, swash_init returns a copy of the swash it finds. The core
portions of the swash are read-only, and the non-read-only portions are
derived from them. When the value for a code point is looked up, the
results for it and adjacent code points are stored in a new element,
so that the lookup never has to be performed again. But since a copy is
returned, those results are stored only in the copy, and any other uses
of the same logical stash don't have access to them, so the lookups have
to be performed for each logical use.
Here's an example. If you have 2 occurrences of /\p{Upper}/ in your
program, there are 2 different swashes created, both initialized
identically. As you start matching against code points, say "A" =~
/\p{Upper}/, the swashes diverge, as the results for each match are
saved in the one applicable to that match. If you match "A" in each
swash, it has to be looked up in each swash, and an (identical) element
will be saved for it in each swash. This is wasteful of both time and
memory.
This patch renames the function and returns the original and not a copy,
thus eliminating the overhead for stashes accessed through the new
interface. The old function name is serviced by a new function which
merely wraps the new name result with a copy, thus preserving the
interface for existing calls.
Thus, in the example above, there is only one swash, and matching "A"
against it results in only one new element, and so the second use will
find that, and not have to go out looking again. In a program with lots
of regular expressions, the savings in time and memory can be quite
large.
The new name is restricted to use only in regcomp.c and utf8.c (unless
XS code cheats the preprocessor), where we will code so as to not
destroy the original's data. Otherwise, a change to that would change
the definition of a Unicode property everywhere in the program.
Note that there are no current callers of the new interface; these will
be added in future commits.
|
|
|
|
|
| |
This function has always confused me, as it doesn't return a swash, but
a swatch.
|
|
|
|
|
| |
We set the upper limit of the loops before entering them to the min of
the two possible limits, thus avoiding a test each time through
|
|
|
|
| |
And the reordering for clarity of one test
|
|
|
|
|
| |
In two instances, I actually modified to code to avoid %s for a
constant string, as it should be faster that way.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
The test here was if any flag was set, not the particular desired one.
This doesn't cause any bugs as things are currently structured, but
could in the future.
The reason it doesn't cause any bugs currently are that the other
flags are tested first, and only if they are both 0 does this flag get
tested.
|
|
|
|
|
|
|
|
|
|
| |
_to_uni_fold_flags() and _to_fold_latin1() now have their flags
parameter be a boolean. The name 'flags' is retained in case the usage
ever expands instead of calling it by the name of the only use this
currently has.
This is as a result of confusion between this and _to_ut8_fold_flags()
which does have more than one flag possibility.
|
|
|
|
| |
This indents previous lines that are now within new blocks
|
|
|
|
|
|
|
|
|
|
| |
This changes the 4 case changing functions to take extra parameters to
specify if the utf8 string is to be processed under locale rules when
the code points are < 256. The current functions are changed to macros
that call the new versions so that current behavior is unchanged.
An additional, static, function is created that makes sure that the
255/256 boundary is not crossed during the case change.
|
| |
|
| |
|
|
|
|
|
|
| |
This function and is_utf8_string_loclen() are modified to check before
reading beyond the end of the string; and the pod for is_utf8_char()
is modified to warn about the buffer overflow potential.
|
| |
|
|
|
|
|
| |
The function to_uni_fold() works without requiring conversion first to
utf8.
|
|
|
|
| |
It's very rare that someone will be outputting these unusual code points
|
|
|
|
|
| |
I now understand swashes enough to document them better; nits in other
comments
|
|
|
|
|
|
| |
Perl has allowed user-defined properties to match above-Unicode code
points, while falsely warning that it doesn't. This removes that
warning.
|
|
|
|
|
|
|
| |
The code assumed that there is a code point above the highest value we
are looking at. That is true except when we are looking at the highest
representable code point on the machine. A special case is needed for
that.
|
|
|
|
|
|
| |
On a 32 bit machine with USE_MORE_BITS, a UV is 64 bits, but STRLEN is
32 bits. A cast was missing during a bit complement that led to loss of
32 bits.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a code point is to be checked if it matches a property, a swatch of
the swash is read in. Typically this is a block of 64 code points that
contain the one desired. A bit map is set for those 64 code points,
apparently under the expectation that the program will desire code
points near the original.
However, it just adds 63 to the original code point to get the ending
point of the block. When the original is so close to the maximum UV
expressible on the platform, this will overflow.
The patch is simply to check for overflow and if it happens use the max
possible. A special case is still needed to handle the very maximum
possible code point, and a future commit will deal with that.
|
|
|
|
|
|
|
| |
This adds a function similar to the ones for the other three case
changing operations that works on latin1 characters only, and avoids
having to go out to swashes. It changes to_uni_fold() and
to_utf8_fold() to call it on the appropriate input
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This creates a new function to handle upper/title casing code points in
the latin1 range, and avoids using a swash to compute the case. This is
because the correct values are compiled-in.
And it calls this function when appropriate for both title and upper
casing, in both utf8 and uni forms,
Unlike the similar function for lower casing, it may make sense for this function to be
called from outside utf8.c, but inside the core, so it is not static,
but its name begins with an underscore.
|
|
|
|
|
|
|
|
| |
The new function split out from to_uni_lower is now called when
appropriate from to_utf8_lower.
And to_uni_lower no longer calls to_utf8_lower, using the macro instead,
saving a function call and duplicate work
|
|
|
|
|
| |
The portion that deals with Latin1 range characters is refactored into a
separate (static) function, so that it can be called from more than one place.
|
|
|
|
| |
Future commits will use these in additional places, so macroize
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are five functions in utf8.c that look up Unicode maps--the case
changing functions. They look up these maps under the names ToDigit,
ToFold, ToLower, ToTitle, and ToUpper. The imminent expansion of Unicode::UCD
to return the mappings for all properties creates a naming conflict, as
three of those names are the same as other properties, Upper, Lower, and
Title.
It was an unfortunate choice of names originally. Now mktables has been
changed to create a list of mapping properties that utf8_heavy.pl reads.
It uses the official names of those properties, so change utf8.c to
correspond.
|
|
|
|
|
| |
The lowercase of latin-1 range code points is known to the perl core, so
for those we can short-ciruit converting to utf8 and reading in a swash
|
| |
|
|
|
|
|
| |
Indent newly formed blocks, and reflow comments and code to fit in
narrower space
|
|
|
|
|
| |
This adds flags so that if one of the input strings is known to already
have been folded, this routine can skip the (redundant) folding step.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The previous commit introduced some code that concatenates a pv on to
an sv and then does SvUTF8_on on the sv if the pv was utf8.
That can’t work if the sv was in Latin-1 (or single-byte) encoding
and contained extra-ASCII characters. Nor can it work if bytes are
appended to a utf8 sv. Both produce mangled utf8.
There is apparently no function apart from sv_catsv that handle
this. So I’ve modified sv_catpvn_flags to handle this if passed the
SV_CATUTF8 (concatenating a utf8 pv) or SV_CATBYTES (cancatenating a
byte pv) flag.
This avoids the overhead of creating a new sv (in fact, sv_catsv
even copies its rhs in some cases, so that would mean creating two
new svs). It might even be worthwhile to redefine sv_catsv in terms
of this....
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
The swashes already have the underscore, so this test is redundant. It
does save some time for this character to avoid having to go out and
load the swash, but why just the underscore? In fact an earlier commit
changed the macro that most people should use to access this function to
not even call it for the underscore.
|
|
|
|
|
|
|
|
|
| |
Unicode stability policy guarantees that no code points will ever be
added to the control characters beyond those already in it.
All such characters are in the Latin1 range, and so the Perl core
already knows which ones those are, and so there is no need to go out to
disk and create a swash for these.
|
|
|
|
|
| |
The XPerlSpace is less confusing than SpacePerl (at least to me). It
means take PerlSpace and extend it beyond ASCII.
|
|
|
|
|
|
| |
These three properties are restricted to being true only for ASCII
characters. That information is compiled into Perl, so no need to
create swashes for them.
|
|
|
|
|
| |
This information is trivially computed via the macro, no need to go out
to disk and store a swash for this.
|
|
|
|
|
| |
This new function is now potentially called. However, there is no data file
or other circumstances which currently cause this path to get executed.
|