| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
| |
Prior to this commit 98.4% of Unicode code points that went through \X
had to be looked up to see if they begin a grapheme cluster; then looked
up again to find that they didn't require special handling. This commit
refactors things so only one look-up is required for those 98.4%. It
changes the table generated by mktables to accomplish this, and hence
the name of it, and references to it are changed to correspond.
|
|
|
|
|
|
|
|
|
|
|
| |
This changes code to be able to handle Unicode 6.2, while continuing to
handle all prevrious releases.
The major change was a new definition of \X, which adds a property to
its calculation. Unfortunately \X is hard-coded into regexec.c, and so
has to revised whenever there is a change of this magnitude in Unicode,
which fortunately isn't all that often. I refactored the code in
mktables to make it easier next time there is a change like this one.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For most Unicode releases, GCB=prepend matches absolutely nothing. And
that appears to be the case going forward, as they added things to it,
and removed them later based on field experience.
An earlier commit has improved the performance of this significantly by
using a binary search of an empty array instead of a swash hash.
However, that search requires several layers of function calls to
discover that it is empty, which this commit avoids.
This patch will use whatever swash_init() returns unless it is empty,
preserving backwards compatibility with older Unicode releases. But if
it is empty, the routine sets things up so that future calls will always
fail without further testing.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A binary swash is a hash of bitmaps used to cache the results of looking
up if a code point matches a Unicode property or regex bracketed
character class. An inversion list is a data structure that also holds
information about what code points match a Unicode property or character
class. It is implemented as an SV* to a sorted C array, and hence can
be searched using a binary search.
This patch converts to using a binary search of an inversion list
instead of a hash look-up for inversion lists that are no more than 512
elements (9 iterations of the search loop). That number can be easily
adjusted, if necessary.
Theoretically, a hash is faster than a binary lookup over a very long
period. So this may negatively impact long-running servers. But in the
short run, where most programs reside, the binary search is
significantly faster.
A swash is filled as necessary as time goes on, caching each new
distinct code point it is called with. If it is called with many, many
such code points, its performance can degrade as collisions increase. A
binary search does not have that drawback. However, most real-world
scenarios do not have a program being called with huge numbers of
distinct code points. Mostly, the program will be called with code
points from just one or a few of the world's scripts, so will remain
sparse. The bitmaps in a swash are each 64 bits long (except for ASCII,
where it is 128). That means that when the swash is populated, a lookup
of a single code point that hasn't been checked before will have to
lookup the 63 adjoining code points as well, increasing its startup
overhead. Of course, if one of those 63 code points is later accessed,
no extra populate happens. This is a typical case where a languages
code points are all near each other.
The bottom line, though, is in the short term, this patch speeds up the
processing of \X regex matching about 35-40%, with modern Korean (which
has uniquely complicated \X processing) closer to 40%, and other scripts
closer to 35%.
The 512 boundary means that over 90% of the official Unicode properties
are handled using binary search. I settled on that number by
experimenting with several properties besides \X and with various
powers-of-2 limits. Until I got that high, performance kept improving
when the property went from being a swash to a binary search. \X
improved even up to 2048, which encompasses 100% of the official Unicode
properties.
The implementation changes so that an inversion list instead of a swash
is returned by swash_init() when the input flags allows it to do so, for
all inversion lists shorter than the compiled in constant of 512
(actually <= 512). The other functions that access swashes have added
intelligence to deal with an object of either type. Should someone in
CPAN be using the public swash_init() interface, they will not see any
difference, as the option to get an inversion list is not available to
them.
|
|
|
|
|
| |
We might as well call the core swash initialization, since we are the
core here, since the public one merely wraps it.
|
|
|
|
|
| |
This might keep someone later from attempting the speedup which didn't
actually help, so I didn't commit it
|
|
|
|
|
|
|
|
|
|
| |
Experiments have shown that longer hash keys impact performance. See
the thread at
http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2012-08/msg00869.html
This patch shortens a key used very frequently. There are other keys in
this hash which are used frequently in some circumstances, but I expect
to change to use fewer in the future, so am not changing them now
|
|
|
|
|
|
|
| |
Now that we have a flags parameter, we can get put this parameter as
just another flag, giving a cleaner interface to this internal-only
function. This also renames the flag parameter to <flag_p> to indicate
it needs to be dereferenced.
|
|
|
|
|
|
|
| |
A new get method has been written to access the internals of a swash
it's best to use it.
This also moves the error checking to the method
|
|
|
|
|
| |
This function only does something on EBCDIC platforms. On ASCII ones
make it a macro, like similar ones to avoid useless function nesting
|
|
|
|
|
|
|
|
|
|
|
| |
This revises the API for the version of swash_init() that is usable
by core Perl. The external interface is unaffected. There is now a
flags parameter to allow for future growth. And the core internal-only
function that returns if a swash has a user-defined property in it or
not has been removed. This information is now returned via the new
flags parameter upon initialization, and is unavailable afterwards.
This is to prepare for the flexibility to change the swash that is
needed in future commits.
|
|
|
|
|
|
| |
In looking at \X handling, I noticed that this function which is
intended for use in it, actually isn't used. This function may someday
be useful, so I'm leaving the source in.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
\X matches according to a complicated pattern that is hard-coded in
regexec.c. Part of that pattern involves checking if a code point is a
component of a Hangul Syllable or not. For Korean code points, this
involves checking against multiple tables. It turns out that two of
those tables are arranged so that the checks for them can be done via an
arithmetic expression; Unicode publishes algorithms for determining
various characteristics based on their very structured ordering.
This patch converts the routines that check these two tables to instead
use the arithmetic expression.
|
|
|
|
|
|
|
|
|
|
| |
This will be used for things need to handle inversion lists in the three
files that currently use them. I'm putting this in a separate hdr,
because inversion lists are very internal-only, so should not be grouped
in with things that there is an external API for. It is a dot-c file so
that the functions can continue to be declared with embed.fnc, and
porting/args_assert.t will continue to work, as it looks only in .c
files.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This removes most register declarations in C code (and accompanying
documentation) in the Perl core. Retained are those in the ext
directory, Configure, and those that are associated with assembly
language.
See:
http://stackoverflow.com/questions/314994/whats-a-good-example-of-register-variable-usage-in-c
which says, in part:
There is no good example of register usage when using modern compilers
(read: last 10+ years) because it almost never does any good and can do
some bad. When you use register, you are telling the compiler "I know
how to optimize my code better than you do" which is almost never the
case. One of three things can happen when you use register:
The compiler ignores it, this is most likely. In this case the only
harm is that you cannot take the address of the variable in the
code.
The compiler honors your request and as a result the code runs slower.
The compiler honors your request and the code runs faster, this is the least likely scenario.
Even if one compiler produces better code when you use register, there
is no reason to believe another will do the same. If you have some
critical code that the compiler is not optimizing well enough your best
bet is probably to use assembler for that part anyway but of course do
the appropriate profiling to verify the generated code is really a
problem first.
|
|
|
|
|
|
| |
This should have been written this way to begin with (I'm the culprit).
But we should have a method so another routine doesn't have to know the
internal details.
|
|
|
|
| |
This proved useful when I recently needed to use these for debugging
|
|
|
|
|
|
|
|
|
|
|
| |
This warning was being generated inappropriately during some internal
operations, such as parsing a program; spotted by Tom Christiansen.
The solution is to move the check for this situation out of the common
code, and into the code where just \p{} and \P{} are handled.
As mentioned in the commit's perldelta, there remains a bug
[perl #114148], where no warning gets generated when it should
|
|
|
|
|
|
|
| |
This creates a function to hide some of the internal details of swashes
from the regex engine, which is the only authorized user, enforced
through #ifdefs in embed.fnc. These work closely together, but it's
best to have a clean interface.
|
|
|
|
|
|
|
|
|
|
| |
These macros have never worked outside the Latin1 range, so this extends
them to work.
There are no tests I could find for things in handy.h, except that many
of them are called all over the place during the normal course of
events. This commit adds a new file for such testing, containing for
now only with a few tests for the isBLANK's
|
|
|
|
| |
to make sure it really is never reached.
|
| |
|
|
|
|
|
|
|
| |
This routine can never return 0, as if there is no case mapping, the
input is used instead. The code point for that input has already been
derived earlier in the function, so it doesn't have to be recalculated.
And, rearrange the order of things slightly.
|
|
|
|
|
| |
In the case changed, the output is the input, so can just Copy it
instead of re-deriving it.
|
| |
|
|
|
|
|
|
| |
These new properties are generated for all Unicode releases, and so \X
can now work on all Unicodes, not just the ones where Unicode has
defined them.
|
|
|
|
|
| |
This updates the editor hints in our files for Emacs and vim to request
that tabs be inserted as spaces.
|
|
|
|
|
|
|
|
|
|
| |
Under /iaa regex matching, folds that cross the ASCII/non-ASCII
boundary are prohibited. This changes _to_uni_fold_flags() and
_to_utf8_fold_flags() functions to take a new flag which, when set,
tells them to not accept such folds.
This allows us to later move the intelligence for handling this
situation to these centralized functions.
|
| |
|
|
|
|
|
|
|
| |
Probably the C optimizer does this anyway, but do the uncomplicated test
before the (mutually exclusive) complicated test (though the
complications are hidden in a macro). The new first test is a
pre-requisite for the new 2nd test anyway.
|
| |
|
|
|
|
|
| |
Tell the compiler that malformed input is not likely, so it can optimize
accordingly.
|
|
|
|
|
| |
This eliminates an intermediate function call by calling the base level
one directly.
|
|
|
|
|
| |
These two functions are to be called only on strings known to be valid,
so we can skip the validation.
|
|
|
|
|
|
| |
This test eliminates all code points less than U+D800 from having to be
checked more than once, at the expense of an extra test for code points
that are larger
|
|
|
|
|
|
|
|
|
|
|
| |
All code points whose UTF-8 representations start with a byte containing
either \xFE or \xFF are considered problematic because they are not
portable. There are many such code points that are too large to
represent on a 32 or even a 64 bit platform. Commit
eb83ed87110e41de6a4cd4463f75df60798a9243 failed to properly catch
overflow when the input flags to this function say to warn on, but
otherwise accept FE and FF sequences. Now overflow is checked for
unconditionally.
|
|
|
|
|
|
|
| |
There are possible overlong sequences that this function blindly
accepts. Instead of developing the code to figure this out, turn this
function into a wrapper for utf8n_to_uvuni() which already has this
check.
|
| |
|
|
|
|
| |
This outdents to account for the removal of a surrounding block.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The prior version had a number of issues, some of which have been taken
care of in previous commits.
The goal when presented with malformed input is to consume as few bytes
as possible, so as to position the input for the next try to the first
possible byte that could be the beginning of a character. We don't want
to consume too few bytes, so that the next call has us thinking that
what is the middle of a character is really the beginning; nor do we
want to consume too many, so as to skip valid input characters. (This
is forbidden by the Unicode standard because of security
considerations.) The previous code could do both of these under various
circumstances.
In some cases it took as a given that the first byte in a character is
correct, and skipped looking at the rest of the bytes in the sequence.
This is wrong when just that first byte is garbled. We have to look at
all bytes in the expected sequence to make sure it hasn't been
prematurely terminated from what we were led to expect by that first
byte.
Likewise when we get an overflow: we have to keep looking at each byte
in the sequence. It may be that the initial byte was garbled, so that
it appeared that there was going to be overflow, but in reality, the
input was supposed to be a shorter sequence that doesn't overflow. We
want to have an error on that shorter sequence, and advance the pointer
to just beyond it, which is the first position where a valid character
could start.
This fixes a long-standing TODO from an externally supplied utf8 decode
test suite.
And, the old algorithm for finding overflow failed to detect it on some
inputs. This was spotted by Hugo van der Sanden, who suggested the new
algorithm that this commit uses, and which should work in all instances.
For example, on a 32-bit machine, any string beginning with "\xFE" and
having the next byte be either "\x86" or \x87 overflows, but this was
missed by the old algorithm.
Another bug was that the code was careless about what happens when a
malformation occurs that the input flags allow. For example, a sequence
should not start with a continuation byte. If that malformation is
allowed, the code pretended it is a start byte and extracts the "length"
of the sequence from it. But pretending it is a start byte is not the
same thing as it actually being a start byte, and so there is no
extractable length in it, so the number that this code thought was
"length" was bogus.
Yet another bug fixed is that if only the warning subcategories of the
utf8 category were turned on, and not the entire utf8 category itself,
warnings were not raised that should have been.
And yet another change is that given malformed input with warnings
turned off, this function used to return whatever it had computed so
far, which is incomplete or erroneous garbage. This commit changes to
return the REPLACEMENT CHARACTER instead.
Thanks to Hugo van der Sanden for reviewing and finding problems with an
earlier version of these commits
|
|
|
|
|
|
| |
Prior to this patch, if the first byte of a UTF-8 sequence indicated
that the sequence occupied n bytes, but the input parameters indicated
that fewer were available, all n were attempted to be read
|
|
|
|
| |
Some of these were spotted by Hugo van der Sanden
|
|
|
|
|
|
|
|
| |
There are two existing macros that do the job that this longish sequence
does. One, UTF8SKIP(), does an array lookup and is very likely to be in
the machine's cache as it is used ubiquitously when processing UTF-8.
The other is a simple test and shift. These simplify the code and
should speed things up as well.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The code assumed that all property definitions would be well-formed,
meaning, in part, that they would be numerically sorted by code point,
with each range disjoint from all others. So, the code was just
appending each range as it is found to the inversion list it is
building.
This assumption is true for all definitions generated by mktables, but
it might not be true for user-defined ones. The solution is merely to
change from calling the function that appends to instead call the
existing function that handles the more general case.
However, that function was not previously used outside the file it was
defined in, so must now be made public. Also, this whole interface is
considered volatile, so the names of the public functions in it begin
with an underscore to further discourage XS writers from using them.
Therefore the more general add function is renamed to begin with an
underscore.
And, the append function is no longer needed outside the file it is
defined in, so again to keep XS writers from using it, this commit makes
it static.
|
|
|
|
|
| |
This was the result of assuming that these would not be on unless
the main category was also on.
|
|
|
|
|
| |
This was deleted by mistake in commit
4b88fb76efce8c436e63b907c9842345d4fa77c7
|
|
|
|
|
| |
Commit 4b88fb76efce8c436e63b907c9842345d4fa77c7 missed 2 occurrences of
this, one of which is #ifdef'd out.
|
|
|
|
|
|
| |
These functions can read beyond the end of their input strings if
presented with malformed UTF-8 input. Perl core code has been converted
to use other functions instead of these.
|
|
|
|
|
| |
These functions should be used in preference to the old ones which can
read beyond the end of the input string.
|