| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit changes utf8_length to read the input a word at a time.
The current method of looking per character is retained for shorter
strings. The per-word method yields significant time savings for very
long strings and typical inputs.
The timings vary depending on the average number of bytes per character
in the input. If all our characters were 13 bytes, this commit would
always be a loser, as we would be processing per 8 (or 4 on 32-bit
platforms) instead of 13. But we don't care about performance for
non-Unicode code points, and the maximum legal Unicode code point
occupies 4 UTF-8 bytes, which means that is a wash on 32-bit platforms,
but a real gain on 64-bit ones. And, except for emoji, most text in
modern languages is 3 byte max, with a significant amount of single byte
characters (e.g., for punctuation) even in non-Latin scripts.
For very long strings we would expect to use 1/8 the conditionals if the
input is entirely ASCII; 1/4 if entirely 2-byte UTF-8, and 1/2 if
entirely 4-byte. (For 32-bit systems, the savings is approximately half
this.) Because of set-up and tear-down complications, these values are
limits that are approached the longer the string is (which is where it
matters most).
The per-word method kicks in for input strings 96 bytes and longer.
This value was based on some eyeballing cache grind output, and could be
tweaked, but the differences in time spent on strings this short is
tiny.
This function does a half-hearted job of checking for UTF-8 validity; it
doesn't do extra work, but it makes sure that the length implied by the
start bytes it sees, all add up. (It doesn't check that the characters in
between are all continuation bytes.) In order to preserve this
checking, the new version has to stop per-word looking a word earlier
than it otherwise would have.
There are complications, as it has to process per-byte to get to a word
boundary before reading per-word. Here are benchmarks for a 2-byte word
using the best and worst case scenarios. (All benchmarks are for a
64-bit platform)
Key:
Ir Instruction read
Dr Data read
Dw Data write
COND conditional branches
IND indirect branches
The numbers represent relative counts per loop iteration, compared to
blead at 100.0%.
Higher is better: for example, using half as many instructions gives 200%,
while using twice as many gives 50%.
Best case 2-byte sceanario:
string length 48 characters; 2 bytes per character;
0 bytes after word boundary
blead patch
------ -------
Ir 100.00 123.09
Dr 100.00 130.18
Dw 100.00 111.44
COND 100.00 128.63
IND 100.00 100.00
Worst case 2-byte sceanario:
string length 48 characters; 2 bytes per character;
7 bytes after word boundary
blead patch
------ -------
Ir 100.00 122.46
Dr 100.00 129.52
Dw 100.00 111.07
COND 100.00 127.65
IND 100.00 100.00
Very long strings run an order of magnitude fewer instructions than
blead. Here are worst case scenarios (7 bytes after word boundary).
string length 10000000 characters; 1 bytes per character
blead patch
------ -------
Ir 100.00 814.53
Dr 100.00 1069.58
Dw 100.00 3296.55
COND 100.00 1575.83
IND 100.00 100.00
string length 5000000 characters; 2 bytes per character
blead patch
------ -------
Ir 100.00 408.86
Dr 100.00 536.32
Dw 100.00 1698.31
COND 100.00 788.72
IND 100.00 100.00
string length 3333333 characters; 3 bytes per character
blead patch
------ -------
Ir 100.00 273.64
Dr 100.00 358.56
Dw 100.00 1165.55
COND 100.00 526.35
IND 100.00 100.00
string length 2500000 characters; 4 bytes per character
blead patch
------ -------
Ir 100.00 206.03
Dr 100.00 269.68
Dw 100.00 899.17
COND 100.00 395.17
IND 100.00 100.00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In GH 20435 many typos in our C code were corrected. However, this pull
request was not applied to blead and developed merge conflicts. I
extracted diffs for the individual modified files and applied them with
'git apply', excepting four files where patch conflicts were reported.
Those files were:
handy.h
locale.c
regcomp.c
toke.c
We can handle these in a subsequent commit. Also, had to run these two
programs to keep 'make test_porting' happy:
$ ./perl -Ilib regen/uconfig_h.pl
$ ./perl -Ilib regen/regcomp.pl regnodes.h
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We can't put PL_compiling or PL_curcop on the save stack as we don't
have a way to ensure they cross threads properly. This showed up as a
win32 t/op/fork.t failure in the thread based fork emulation layer.
This adds a new save type SAVEt_CURCOP_WARNINGS and macro
SAVECURCOPWARNINGS() to complement SAVEt_COMPILER_WARNINGS and
SAVECOMPILEWARNINGS(). By simply hard coding where the pointers should
be restored to we side step the issue of which thread we are in.
Thanks to Graham Knop for help identifying that one of my commits was
responsible.
|
|
|
|
|
| |
With RCPV strings we can use the RCPV_LEN() macro, and
make this logic a little less weird.
|
|
|
|
|
|
|
|
|
|
|
|
| |
It's unlikely that perl will be compiled with out the LC_CTYPE locale
category being enabled. But if it isn't, there is no sense in having
per-interpreter variables for various conditions in it, and no sense
having code that tests those variables.
This commit changes a macro to always yield 'false' when this is
disabled, adds a new similar macro, and changes some occurrences that
test for a variable to use the macros instead of the variables. That
way the compiler knows these to conditions can never be true.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This changes utf8_to_bytes() to do a per-word initial scan to see if the
source is actually downgradable, before starting the conversion. This
is significantly faster than the current per-character scan. However,
the speed advantage evaporates in doing the actual conversion to being a
wash with the previous scheme.
Thus it finds out quicker if the source is downgradable.
cache grind yields this, based on a 100K character string; the
non-downgradable one has the next character after that be the only one
that's too large.:
Key:
Ir Instruction read
Dr Data read
Dw Data write
COND conditional branches
IND indirect branches
_m branch predict miss
_m1 level 1 cache miss
_mm last cache (e.g. L3) miss
- indeterminate percentage (e.g. 1/0)
The numbers represent relative counts per loop iteration, compared to
blead at 100.0%.
Higher is better: for example, using half as many instructions gives 200%,
while using twice as many gives 50%.
unicode::bytes_to_utf8_legal_API_test
Downgrading 100K valid characters
blead proposed
------ ------
Ir 100.00 99.99
Dr 100.00 100.03
Dw 100.00 100.04
COND 100.00 100.05
IND 100.00 100.00
COND_m 100.00 87.25
IND_m 100.00 100.00
Ir_m1 100.00 123.25
Dr_m1 100.00 100.18
Dw_m1 100.00 99.94
Ir_mm 100.00 100.00
Dr_mm 100.00 100.00
Dw_mm 100.00 100.00
unicode::bytes_to_utf8_illegal
Finding too high a character after 100K valid ones
blead fast
------ ------
Ir 100.00 188.91
Dr 100.00 179.77
Dw 100.00 66.75
COND 100.00 278.47
IND 100.00 100.00
COND_m 100.00 88.71
IND_m 100.00 100.00
Ir_m1 100.00 121.86
Dr_m1 100.00 100.01
Dw_m1 100.00 100.03
Ir_mm 100.00 100.00
Dr_mm 100.00 100.00
Dw_mm 100.00 100.00
|
|
|
|
|
|
| |
C reserves leading underscores for system use; this commit fixes
_CHECK_AND_WARN_PROBLEMATIC_LOCALE to be
CHECK_AND_WARN_PROBLEMATIC_LOCALE_
|
|
|
|
|
| |
Most of these have been deprecated for a long time; and we've never
bothered to follow through in removing them. This commit does that.
|
| |
|
|
|
|
|
| |
This may cause problems when not used correctly; so continue to restrict
it.
|
|
|
|
| |
Don't deref before checking that it is ok to
|
| |
|
|
|
|
|
|
|
|
|
|
| |
This effectively reverts 3ece276e6c0.
It turns out this was a bad idea to make U mean the non-native official
Unicode code points. It may seem to make sense to do so, but broke
multiple CPAN modules which were using U the previous way.
This commit has no effect on ASCII-platform functioning.
|
|
|
|
| |
This code is identical to a few lines above it
|
|
|
|
|
|
|
|
| |
This fixes GH #19091
This is from a rebasing error. The two variable assignments were
supposed to have been superceded by the first one in the function, and
these removed, but they didn't get removed, until now
|
| |
|
|
|
|
|
| |
This commit allows this function to be called with NULL parameters when
the result of these is not needed.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
The code this commit removes made sense when we were using swashes, and
we had to go out to files on disk to find the answers. It used
knowledge of the Unicode character database to skip swaths of scripts
which are caseless.
But now, all that information is stored in C arrays that will be paged
in when accessed, which is done by a binary search. The information
about those swaths is in those arrays. The conditionals removed here
are better spent in executing iterations of the search in L1 cache.
|
|
|
|
|
|
|
|
|
| |
This adds a new function for changing the case of an input code point.
The difference between this and the existing function is that the new
one returns an array of UVs instead of a combination of the first code
point and UTF-8 of the whole thing, a somewhat awkward API that made
more sense when we used swashes. That function is retained for now, at
least, but most of the work is done in the new function.
|
|
|
|
|
|
|
| |
The fancy wrapper macro that does extra things was being used, when all
that is needed is the bare libc function. This is because the code
calling it already wraps it, so avoids calling it unless the bare
version is what is appropriate.
|
| |
|
|
|
|
|
|
| |
Instead of destroying the input by first swapping the bytes, this calls
a base function with the order to use. The non-reverse function is
changed to call the base function with the non-reversed order.
|
|
|
|
|
|
| |
A previous commit has simplified uvoffuni_to_utf8_flags() so that it is
hardly more than the code in this function. So strip out the code and
replace it by a call to uvoffuni_to_utf8_flags().
|
|
|
|
|
| |
This is unnecessary in a .c file, and the code it referred to has been
moved away.
|
| |
|
|
|
|
|
|
|
|
|
| |
Now that the DFA is used by the only callers to this to eliminate the
need to check for e.g., wrong continuation bytes, this function can be
refactored to use a switch statement, which makes it clearer, shorter,
and faster.
The name is changed to indicate its private nature
|
|
|
|
| |
This makes it use the fast DFA for this functionality.
|
|
|
|
| |
There are new macros that suffice to make the determination here.
|
|
|
|
| |
This is now generated by regcharclass.pl
|
|
|
|
| |
The new mname is more mnemonic
|
|
|
|
|
| |
These macros don't need to be macros, as they each are only called from
one place, and that isn't likely to change.
|
| |
|
|
|
|
|
|
|
| |
The previous commit for EBCDIC paved the way for moving some checks for
a code point being for Perl extended UTF-8 out of places where they
cannot succeed. The resultant simplifications more than compensate for
the two extra case statements added by this commit.
|
|
|
|
|
|
| |
Simply by adjusting the case statement labels, and adding an extra case,
the code can avoid checking for a problem on EBCDIC boxes when it would
be impossible for the problem to exist.
|
|
|
|
|
|
|
|
|
|
|
| |
Having a fast UVOFFUNISKIP() allows this function be be refactored to
simplify it.
This commit continues to shortchange large code points and EBCDIC by a
little. For example, it checks if a 4-byte character is above Unicode,
but no 4-byte characters fit that description in UTF-EBCDIC. This will
be fixed in the next commit, which will prepare for further
enhancements.
|
|
|
|
| |
This will make more sense of the next commit
|
|
|
|
|
|
|
|
|
| |
This specialized functionality is used to check the validity of Perl's
extended-length UTF-8, which has some ideosyncratic characteristics from
the shorter sequences. This means this function doesn't have to
consider those differences. It will be used in the next commit to avoid
some work, and to eventually enable is_utf8_char_helper() to be
simplified.
|
|
|
|
|
|
|
|
| |
One of these functions is now only called from the other, and there is
significant overlap in their logic.
This commit refactors them into one resulting function, which is half
the code, and more straight forward.
|
|
|
|
|
| |
The sequences here aren't UTF-8, but UTF, since they are I8 in
UTF-EBCDIC terms
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The code has hard-coded into it the UTF-8 for the highest representable
code point for various platforms and word sizes. The algorithm is to
compare the input sequence to verify it is <= the highest. But the tail
of each of them has some number of the highest possible continuation
byte. We need not look at the tail, as the input cannot be above the
highest possible. This commit shortens the highest string constants and
exits the loop when we get to where the tail used to be.
This change allows for the complete removal of the code that is #ifdef'd
out that would be used when we allow core to use code points up to
UV_MAX.
|
|
|
|
| |
This makes the code easier to read.
|
|
|
|
| |
This macro is preferred to sizeof()
|
|
|
|
|
|
|
|
|
| |
I've always been uncomfortable with the input constraints this function
had. Now that it has been refactored into using a switch(), new cases
for full generality can be added without affecting performance, and
some conditionals removed before calling it.
The function is renamed to reflect its more generality
|
|
|
|
|
| |
The insight in the previous commit allows this function to become much
more compact.
|
|
|
|
|
| |
I hadn't previously noticed the underlying symmetry between the
platforms.
|
|
|
|
| |
UTF_MIN_CONTINUATION_BYTE is clearer for use in some contexts
|
|
|
|
|
| |
This changes only portions of the capitalization, and the new version is
more in keeping with other function names.
|
|
|
|
|
|
|
|
| |
The previous commit added a convenient place to create a symbol to
indicate that the UTF-8 on this platform includes Perl's nearly-double
length extension. The platforms this isn't needed on are 32-bit ASCII
ones. This symbol allows removing one place where EBCDIC need
be considered, and future commits will use it as well.
|