| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit inlines the simple portion of the dfa that translates from
UTF-8 to code points, used in functions like utf8_to_uvchr_buf.
This dfa has been changed in previous commits so that it is small, and
punts on any problematic input, plus 18% of the Hangul syllable code
points. (These still come out faster than blead.) The smallness allows
it to be inlined, adding <2000 total bytes to the perl text space.
The inlined part never calls anything that needs thread context, so that
parameter can be removed. I decided to remove it also from the
Perl_utf8_to_uvchr_buf() and Perl_utf8n_to_uvchr_error() functions.
There is a small risk that someone is actually using those functions
instead of the documented macros utf8_to_uvchr_buf() and
utf8n_to_uvchr_error(). If so, this can be added back in.
Perl_utf8_to_uvchr_msgs() is entirely removed, but the macro
utf8_to_uvchr_msgs() which is the normal interface to it is retained
unchanged, and it is marked as unstable anyway.
This change decreases the number of conditional branches in the Perl
statement
my $a = ord("\x{foo}")
where foo is a non-problematic code point by about 11%, except for
ASCII characters, where it is 4%, and those Hangul syllables mentioned
above, where it is 7%. Problematic code points fare much worse here
than in blead. These are the surrogates, non-characters, and
non-Unicode code points. We don't care very much about the speed of
handling these code points, which are mostly considered illegal by
Unicode anyway.
The percentage decrease is higher for the just the function itself, as
the measured Perl statement has unchanged overhead.
Here are the annotated benchmarks:
Key:
Ir Instruction read
Dr Data read
Dw Data write
COND conditional branches
IND indirect branches
_m branch predict miss
_m1 level 1 cache miss
_mm last cache (e.g. L3) miss
- indeterminate percentage (e.g. 1/0)
The numbers represent raw counts per loop iteration.
translate_utf8_to_uv_007f
my $a = ord("\x{007f}")
blead dfa Ratio %
----- ----- -------
Ir 395.0 370.0 106.8
Dr 122.0 115.0 106.1
Dw 71.0 61.0 116.4
COND 49.0 47.0 104.3
IND 5.0 5.0 100.0
In all the measurements, the indirect numbers were all zeros and
unchanged, and are omitted in this message.
translate_utf8_to_uv_07ff
my $a = ord("\x{07ff}")
blead dfa Ratio %
----- ----- -------
Ir 438.0 390.0 112.3
Dr 128.0 118.0 108.5
Dw 71.0 61.0 116.4
COND 57.0 51.0 111.8
IND 5.0 5.0 100.0
translate_utf8_to_uv_cfff
my $a = ord("\x{cfff}")
This is the highest Hangul syllable that gets the full reduction.
blead dfa Ratio %
----- ----- -------
Ir 457.0 410.0 111.5
Dr 131.0 121.0 108.3
Dw 71.0 61.0 116.4
COND 61.0 55.0 110.9
IND 5.0 5.0 100.0
translate_utf8_to_uv_d000
my $a = ord("\x{d000}")
This is the lowest affected Hangul syllable
blead dfa Ratio %
----- ----- -------
Ir 457.0 443.0 103.2
Dr 131.0 132.0 99.2
Dw 71.0 71.0 100.0
COND 61.0 57.0 107.0
IND 5.0 5.0 100.0
translate_utf8_to_uv_d7ff
my $a = ord("\x{d7ff}")
This is the highest affected Hangul syllable
blead dfa Ratio %
----- ----- -------
Ir 457.0 443.0 103.2
Dr 131.0 132.0 99.2
Dw 71.0 71.0 100.0
COND 61.0 57.0 107.0
IND 5.0 5.0 100.0
translate_utf8_to_uv_d800
my $a = ord("\x{d800}")
This is a surrogate, showing much worse performance, but we don't care
blead dfa Ratio %
----- ----- -------
Ir 457.0 515.0 88.7
Dr 131.0 134.0 97.8
Dw 71.0 73.0 97.3
COND 61.0 75.0 81.3
IND 5.0 5.0 100.0
translate_utf8_to_uv_fdd0
my $a = ord("\x{fdd0}")
This is a non-char, showing much worse performance, but we don't care
blead dfa Ratio %
----- ----- -------
Ir 457.0 548.0 83.4
Dr 131.0 139.0 94.2
Dw 71.0 73.0 97.3
COND 61.0 81.0 75.3
IND 5.0 5.0 100.0
translate_utf8_to_uv_fffd
my $a = ord("\x{fffd}")
blead dfa Ratio %
----- ----- -------
Ir 457.0 410.0 111.5
Dr 131.0 121.0 108.3
Dw 71.0 61.0 116.4
COND 61.0 55.0 110.9
IND 5.0 5.0 100.0
translate_utf8_to_uv_ffff
my $a = ord("\x{ffff}")
This is another non-char, showing much worse performance, but we don't
care
blead dfa Ratio %
----- ----- -------
Ir 457.0 548.0 83.4
Dr 131.0 139.0 94.2
Dw 71.0 73.0 97.3
COND 61.0 81.0 75.3
IND 5.0 5.0 100.0
translate_utf8_to_uv_1fffd
my $a = ord("\x{1fffd}")
blead dfa Ratio %
----- ----- -------
Ir 476.0 430.0 110.7
Dr 134.0 124.0 108.1
Dw 71.0 61.0 116.4
COND 65.0 59.0 110.2
IND 5.0 5.0 100.0
translate_utf8_to_uv_10fffd
my $a = ord("\x{10fffd}")
blead dfa Ratio %
----- ----- -------
Ir 476.0 430.0 110.7
Dr 134.0 124.0 108.1
Dw 71.0 61.0 116.4
COND 65.0 59.0 110.2
IND 5.0 5.0 100.0
translate_utf8_to_uv_110000
my $a = ord("\x{110000}")
This is a non-Unicode code point, showing much worse performance, but we
don't care
blead dfa Ratio %
----- ----- -------
Ir 476.0 544.0 87.5
Dr 134.0 137.0 97.8
Dw 71.0 73.0 97.3
COND 65.0 81.0 80.2
IND 5.0 5.0 100.0
|
|
|
|
|
|
|
| |
It was a macro that used a trie. This changes to use the dfa
constructed in previous commits. I didn't bother with taking
measurements. A dfa should require fewer conditionals to be executed
for many code points.
|
|
|
|
|
| |
Several problems with this compile option were not caught before 5.28
was frozen.
|
|
|
|
|
| |
These have been deprecated since 5.18, and have security issues, as they
can try to read beyond the end of the buffer.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is like my_atof2(), but with an extra argument signifying the
length of the input string to parse. If that length is 0, it uses
strlen() to determine it.
Then my_atof2() just calls my_atof3() with a zero final parameter.
And this commit just uses the bulk of the current my_atof2() as the core
of my_atof3(). Changes were needed however, because it relied on
NUL-termination in a number of places.
This allows one to convert a string that isn't necessarily
NUL-terminated to an NV.
|
|
|
|
|
|
| |
This was using the incorrect formal parameter name. It did not generate
an error because the function declares a variable with the incorrect
name, so that this was actually asserting on the wrong thing.
|
|
|
|
|
|
| |
The previous commits in this series have been preparing to allow the
Devel::Tokenizer::C code to be swapped out for the much smaller perfect
hash code.
|
|
|
|
|
|
|
|
|
|
|
| |
This commit causes the looking up of \p{} Unicode properties to be done
without having to use the swash mechanism.s, with certain exceptions.
This will all be explained in the merge commit.
This commit uses Devel::Tokenizer::C to generate the code that turns the
property string as keywords into numbers that can be understood by the
computer. This mechanism generates relatively large code. The next
commits will replace this with a smaller mechanism.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This function will parse the interior of \p{} Unicode property names in
regular expression patterns.
The design of this function will be to return NULL on the properties it
cannot handle; otherwise it returns an inversion list representing the
property it did find. The current mechanism will be used to handle the
cases where this function returns NULL.
This initial state is just to have the function return NULL always, so
the existing mechanism is always used. A later commit will add
the functionality in 5.28 that bypasses the existing mechanism.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These previously were statics in perl.c. A future commit would need
access to these from regcomp.c. We could create an access function in
perl.c so that regcomp.c could access them, or we could move them to
regcomp.c. But doing that means also they would be statics in
re_comp.c, and that would mean two copies.
So that means an access function is needed. Their use is really
unrelated to perl.c, which merely initializes them, so that could have
an access function instead. But the most logical place for their home
is utf8.c, which is described as for Unicode things, not just UTF-8
things.
So this commit moves these inversion lists to utf8.c, and creates an
initialization function called on perl startup from perl.c
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 3f1866a8f6 assumed "A" flag means a function can't be mathomed. Not
true. Many funcs were listed in embed.fnc as "A" yet were in mathoms.c.
This caused a missing symbol link failure on Win32 with -DNO_MATHOMS,
since the "A" mathomed funcs were now put into perlldll.def while
previously they were parsed out of mathoms.c by makedef.pl. Revise the
logic so "b" means instant removal from the export list on a no mathoms
build.
embed.fnc "b" flag adds were generated from a missing symbol list from my
linker, some funcs not in my build/platform config might need to be "b"
flagged in future. Some funcs like ASCII_TO_NEED were already marked "b"
but still being by mistake exported because they were also "A".
sv_2bool, sv_eq and sv_collxfrm also needed a "p" flag or a Perl_-less
symbol was declared in proto.h. sv_2bool and sv_collxfrm also failed
porting/args_assert.t so add those macros to mathoms.c
|
|
|
|
|
|
|
|
|
| |
The code points that Unicode furnishes will always be unsigned. This
changes to uniformly treat the ones in the constructed tables of Unicode
properties to be unsigned, avoiding possible signedness compiler
warnings on some systems.
Spotted by Dave Mitchell.
|
| |
|
| |
|
|
|
|
|
| |
The previous commit completely stopped using this core-only function.
Remove it.
|
|
|
|
|
|
|
|
|
|
| |
This commit changes to use the C data structures generated by the
previous commit to compute what characters fold to a given one. This is
used to find out what things should match under /i.
This now avoids the expensive start up cost of switching to perl
utf8_heavy.pl, loading a file from disk, and constructing a hash from
it.
|
|
|
|
|
| |
An inversion map currently is used only for Unicode-range code points,
which can fit in an int, so don't use the space unnecessarily
|
|
|
|
|
|
| |
We've been burned before by malformed UTF-8 causing us to read outside
the buffer bounds. Here is a case I saw during code inspection, and
it's easy to add the buffer end limit
|
|
|
|
|
|
|
|
| |
This commit makes the inversion lists for parsing character name global
instead of interpreter level, so can be initialized once per process,
and no copies are created upon new thread instantiation. More
importantly, this is another instance where utf8_heavy.pl no longer
needs to be loaded, and the definition files read from disk.
|
|
|
|
|
|
| |
The previous commits have caused certain parameters to be ignored in
some calls to these functions. Change them to dummys, so if a mistake
is made, it can be caught, and not promulgated
|
|
|
|
| |
To match what the file declares it as.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Prior to this commit, if a program wanted to compute the case-change of
a character above 0xFF, the C code would switch to perl, loading
lib/utf8heavy.pl and then read another file from disk, and then create a
hash. Future references would use the hash, but the start up cost is
quite large. There are five case change types, uc, lc, tc, fc, and
simple fc. Only the first encountered requires loading of utf8_heavy,
but each required switching to utf8_heavy, and reading the appropriate
file from disk.
This commit changes these functions to use compiled-in C data structures
(inversion maps) to represent the data. To look something up requires a
binary search instead of a hash lookup.
An individual hash lookup tends to be faster than a binary search, but
the differences are small for small sizes. I did some benchmarking some
years ago, (commit message 87367d5f9dc9bbf7db1a6cf87820cea76571bf1a) and
the results were that for fewer than 512 entries, the binary search was
just as fast as a hash, if not actually faster. Now, I've done some
more benchmarks on blead, using the tool benchmark.pl, which wasn't
available back then. The results below indicate that the differences
are minimal up through 2047 entries, which all Unicode properties are
well within.
A hash, PL_foldclosures, is still constructed at runtime for the case of
regular expression /i matching, and this could be generated at Perl
compile time, as a further enhancement for later. But reading a file
from disk is no longer required to do this.
======================= benchmarking results =======================
Key:
Ir Instruction read
Dr Data read
Dw Data write
COND conditional branches
IND indirect branches
_m branch predict miss
_m1 level 1 cache miss
_mm last cache (e.g. L3) miss
- indeterminate percentage (e.g. 1/0)
The numbers represent raw counts per loop iteration.
"\x{10000}" =~ qr/\p{CWKCF}/"
swash invlist Ratio %
fetch search
------ ------- -------
Ir 2259.0 2264.0 99.8
Dr 665.0 664.0 100.2
Dw 406.0 404.0 100.5
COND 406.0 405.0 100.2
IND 17.0 15.0 113.3
COND_m 8.0 8.0 100.0
IND_m 4.0 4.0 100.0
Ir_m1 8.9 17.0 52.4
Dr_m1 4.5 3.4 132.4
Dw_m1 1.9 1.2 158.3
Ir_mm 0.0 0.0 100.0
Dr_mm 0.0 0.0 100.0
Dw_mm 0.0 0.0 100.0
These were constructed by using the file whose contents are below, which
uses the property in Unicode that currently has the largest number of
entries in its inversion list, > 1600. The test was run on blead -O2,
no debugging, no threads. Then the cut-off boundary was changed from
512 to 2047 for when we use a hash vs an inversion list, and the test
run again. This yields the difference between a hash fetch and an
inversion list binary search
===================== The benchmark file is below ===============
no warnings 'once';
my @benchmarks;
push @benchmarks, 'swash' => {
desc => '"\x{10000}" =~ qr/\p{CWKCF}/"',
setup => 'no warnings "once"; my $re = qr/\p{CWKCF}/; my $a =
"\x{10000}";',
code => '$a =~ $re;',
};
\@benchmarks;
|
|
|
|
|
|
|
| |
This commit will turn UTF-8 on in the returned SV if its string is legal
UTF-8 containing something besides ASCII, and the locale is a UTF-8 one.
It is based on the patch included in the ticket, but is generalized to
handle edge cases.
|
|
|
|
|
|
|
|
|
| |
The recent changes fixed by this commit neglected to take into account
EBCDIC differences.
Mostly, the algorithms apply only to ASCII platforms, so the EBCDIC is
ifdef'd out. In a couple cases, the algorithm mostly applies, so the
scope of the ifdefs is smaller.
|
|
|
|
|
|
|
| |
This moves a block of code out from perly.y into its own function,
because it will shortly be needed in more than one place.
Should be no functional changes.
|
|
|
|
|
|
| |
Daniel Dragan pointed out that this parameter is unused (the commits
that want it didn't get into 5.28), and is causing a table to be
duplicated all over the place, so just remove it for now.
|
|
|
|
|
|
| |
It turns out that it will be convenient in a future commit to have this
function handle NULL input. That also means that every call should use
the return value of this function.
|
|
|
|
|
|
|
|
| |
The root cause of this was using a 'char' where it should have been
'U8'. I changed the signatures so that all the related functions take
and return U8's, and the compiler detects what should be cast to/from
char. The functions all deal with byte bit patterns, so unsigned is the
appropriate declaration.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
To avoid having to create deferred elements every time a sparse array
is pushed on to the stack, store a magic scalar in the array itself,
which av_exists and refto recognise as not existing.
This means there is only a one-time cost for putting such arrays on
the stack.
It also means that deferred elements that live long enough don’t
start pointing to the wrong array entry if the array gets shifted (or
unshifted/spliced) in the mean time. Instead, the scalar is already
in the array, so it cannot lose its place. This fix only applies
when the array as a whole is pushed on to the stack, but it could be
extended in future commits to apply to other places where we currently
use deferred elements.
|
|
|
|
|
|
|
|
|
| |
This new API function is for use in applications that call alien library
routines that are expecting the old pre-POSIX 2008 locale functionality,
namely a single global locale accessible via setlocale().
This function converts the calling thread to use that global locale, if
not already there.
|
|
|
|
|
|
| |
This function now returns a boolean, and does not want an aTHX
parameter. There should be no impact on code that uses the macro form
to call it.
|
|
|
|
|
|
| |
This (large) commit allows locales to be used in threaded perls on
platforms that support it. This includes recent Windows and Posix 2008
ones.
|
|
|
|
| |
This core-only function is now used only in one file.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
khw could not find any modules on CPAN that correctly use the C library
function setlocale(). (The very few that do try, do not use it
correctly, looking at the return value incorrectly, so they are broken.)
This analysis does not include modules that call non-Perl libaries that
may call setlocale().
And, a future commit will render the setlocale() function useless in
some configurations on some platforms.
So this commit adds Perl_setlocale(), for XS code to call, and which is
always effective, but it should not be used to alter the locale except
on platforms where the predefined variable ${^SAFE_LOCALES} evaluates to
1.
This function is also what POSIX::setlocale() calls to do the real work.
|
|
|
|
|
| |
This is propmpted by Encode's needs. When called with the proper
parameter, it returns any warnings instead of displaying them directly.
|
|
|
|
|
| |
This is in preparation for the next commit which will use this code in
multiple places
|
|
|
|
|
|
| |
This function was written specifically for Encode's needs. My intent is
to eventually make it publicly usable, but since it's new, we should
give some time for it to prove itself.
|
|
|
|
|
| |
These two paradigms are each repeated in 4 places. Make into two
subroutines
|
|
|
|
| |
This allows it to return the script of the run.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a specialized ANYOF node for use when the code points in it
have characteristics that allow them to be matched with a mask instead
of a bit map. When this happens, the speed up is pretty spectacular:
Key:
Ir Instruction read
Dr Data read
Dw Data write
COND conditional branches
IND indirect branches
The numbers represent raw counts per loop iteration.
Results of ('b' x 10000) . 'a' =~ /[Aa]/
blead mask Ratio %
-------- ------- -------
Ir 153132.0 25636.0 597.3
Dr 40909.0 2155.0 1898.3
Dw 20593.0 593.0 3472.7
COND 20529.0 3028.0 678.0
IND 22.0 22.0 100.0
See the comments in regcomp.c or
http://nntp.perl.org/group/perl.perl5.porters/249001 for a description
of the cases that this new technique can handle. But several common
ones include the C0 controls (on ASCII platforms), [01], [0-7], [Aa] and
any other ASCII case pair.
The set of ASCII characters also could be done with this node instead of
having the special ASCII regnode, reducing code size and complexity.
I haven't investigated the speed loss of doing so.
A NANYOFM node could be created for matching the complements this one
matches.
A pattern like /A/i is not affected by this commit, but the regex
optimizer could be changed to take advantage of this commit. What would
need to be done is for it to look at the first byte of an EXACTFish node
and if its one of the case pairs this handles, to generate a synthetic
start class for it. This would automatically invoke the sped up code.
|
|
|
|
|
| |
In which case handling is skipped. This is in preparation for a future
commit which will use this function in a slightly different manner
|
|
|
|
|
|
| |
For most of the case folding pairs, like [Aa], it is possible to use a
mask to match them word-at-a-time in regrepeat(), so that long sequences
of them are handled with significantly better performance.
|
|
|
|
|
|
|
| |
There is special code in the function regrepeat() to handle instances
where the pattern to repeat is a single byte. These all can be done
word-at-a-time to significantly increase the performance of long
repeats.
|
|
|
|
| |
Future commits will use this without regard to platform.
|
|
|
|
|
|
|
|
|
|
|
| |
Recent commit 0b08cab0fc46a5f381ca18a451f55cf12c81d966 caused a function
to not be compiled when running on MSVC6, and hence its callers needed
to use an alternative mechanism there. This is easy enough, it turns
out, but it also turns out that there are more opportunities to call
this function. Rather than having each caller have to know about the
MSVC6 problem, this current commit reimplements the function on that
platform to use a slow, dumb method, so knowing about the issue is
confined to just this one function.
|
|
|
|
|
|
| |
This UTF-8 to code point translator variant is to meet the needs of
Encode, and provides XS authors with more general capability than
the other decoders.
|
|
|
|
| |
See [perl #132766]
|
| |
|
|
|
|
|
|
| |
Change the signature of all the internal do_trans*() functions to return
Size_t rather than I32, so that the count returned by tr//// can cope with
strings longer than 2Gb.
|
|
|
|
|
|
| |
This reverts commit 523d71b314dc75bd212794cc8392eab8267ea744, reinstating
commit 2cdf406af42834c46ef407517daab0734f7066fc. Reversion is not the
way to address the porting problem that motivated that reversion.
|