| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
| |
This commit changes to use the C data structures generated by the
previous commit to compute what characters fold to a given one. This is
used to find out what things should match under /i.
This now avoids the expensive start up cost of switching to perl
utf8_heavy.pl, loading a file from disk, and constructing a hash from
it.
|
|
|
|
|
| |
An inversion map currently is used only for Unicode-range code points,
which can fit in an int, so don't use the space unnecessarily
|
|
|
|
|
| |
These were for when some of the Posix character classes were implemented
as swashes, which is no longer the case, so these can be removed.
|
|
|
|
|
|
| |
The previous commits have caused certain parameters to be ignored in
some calls to these functions. Change them to dummys, so if a mistake
is made, it can be caught, and not promulgated
|
|
|
|
|
|
|
| |
These read-only globals can be initialized in perl.c, which allows us to
remove runtime checks that they are initialized. This commit also takes
advantage of the fact that they are now always initialized to use them
as inversion lists, avoid swash creation.
|
|
|
|
|
|
|
|
| |
This was using the wrong variable, the one used by plain
_is_utf8_idcont()
Since both of these are in mathoms.c, and deprecated, this really wasn't
causing an issue in the field.
|
|
|
|
|
|
|
|
|
| |
Now that we prefer inversion lists over swashes, we can just use the
inversion lists functions if we have an inversion list, avoiding the
swash code altogether in these instances.
This commit stops using inversion lists for two internal properties, but
the next commit will restore that.
|
|
|
|
|
|
|
|
| |
Measurements I took in 8946fcd98c63bdc848cec00a1c72aaf232d932a1 indicate
that at the sizes that Unicode inversion lists are, there is no slowdown
in retrieval of data using an inversion list vs a hash. Converting to
use an inversion list, when possible, avoids the hash construction
overhead, and eventually to the removal of a bunch of code.
|
| |
|
|
|
|
|
|
| |
This adds comments, and some white space changes to the function
dealing with changing case changed in
8946fcd98c63bdc848cec00a1c72aaf232d932a1
|
|
|
|
|
|
|
| |
Commit 8946fcd98c63bdc848cec00a1c72aaf232d932a1 broke the compilation of
utf8.c when perl is compiled against very early Unicode versions, as
some tables this is expecting don't exist in them. But it is easily
solvable by a few #ifdefs
|
|
|
|
|
| |
Commit 8946fcd98c63bdc848cec00a1c72aaf232d932a1 failed to free a scalar
it created. I meant to do so, but in the end, forgot.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Prior to this commit, if a program wanted to compute the case-change of
a character above 0xFF, the C code would switch to perl, loading
lib/utf8heavy.pl and then read another file from disk, and then create a
hash. Future references would use the hash, but the start up cost is
quite large. There are five case change types, uc, lc, tc, fc, and
simple fc. Only the first encountered requires loading of utf8_heavy,
but each required switching to utf8_heavy, and reading the appropriate
file from disk.
This commit changes these functions to use compiled-in C data structures
(inversion maps) to represent the data. To look something up requires a
binary search instead of a hash lookup.
An individual hash lookup tends to be faster than a binary search, but
the differences are small for small sizes. I did some benchmarking some
years ago, (commit message 87367d5f9dc9bbf7db1a6cf87820cea76571bf1a) and
the results were that for fewer than 512 entries, the binary search was
just as fast as a hash, if not actually faster. Now, I've done some
more benchmarks on blead, using the tool benchmark.pl, which wasn't
available back then. The results below indicate that the differences
are minimal up through 2047 entries, which all Unicode properties are
well within.
A hash, PL_foldclosures, is still constructed at runtime for the case of
regular expression /i matching, and this could be generated at Perl
compile time, as a further enhancement for later. But reading a file
from disk is no longer required to do this.
======================= benchmarking results =======================
Key:
Ir Instruction read
Dr Data read
Dw Data write
COND conditional branches
IND indirect branches
_m branch predict miss
_m1 level 1 cache miss
_mm last cache (e.g. L3) miss
- indeterminate percentage (e.g. 1/0)
The numbers represent raw counts per loop iteration.
"\x{10000}" =~ qr/\p{CWKCF}/"
swash invlist Ratio %
fetch search
------ ------- -------
Ir 2259.0 2264.0 99.8
Dr 665.0 664.0 100.2
Dw 406.0 404.0 100.5
COND 406.0 405.0 100.2
IND 17.0 15.0 113.3
COND_m 8.0 8.0 100.0
IND_m 4.0 4.0 100.0
Ir_m1 8.9 17.0 52.4
Dr_m1 4.5 3.4 132.4
Dw_m1 1.9 1.2 158.3
Ir_mm 0.0 0.0 100.0
Dr_mm 0.0 0.0 100.0
Dw_mm 0.0 0.0 100.0
These were constructed by using the file whose contents are below, which
uses the property in Unicode that currently has the largest number of
entries in its inversion list, > 1600. The test was run on blead -O2,
no debugging, no threads. Then the cut-off boundary was changed from
512 to 2047 for when we use a hash vs an inversion list, and the test
run again. This yields the difference between a hash fetch and an
inversion list binary search
===================== The benchmark file is below ===============
no warnings 'once';
my @benchmarks;
push @benchmarks, 'swash' => {
desc => '"\x{10000}" =~ qr/\p{CWKCF}/"',
setup => 'no warnings "once"; my $re = qr/\p{CWKCF}/; my $a =
"\x{10000}";',
code => '$a =~ $re;',
};
\@benchmarks;
|
| |
|
|
|
|
| |
The caller is responsible for freeing the memory used by these functions
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The original test case in this ticket has already been fixed; but
modifying it slightly showed some other issues that are now fixed by
this commit.
The deepest problem is that this code in some paths creates a string to
parse instead of the original pattern. And in some cases, it's not even
the original pattern, but something that had already been created to
parse instead of the pattern. Any messages that are raised should be
output in terms of the original. regcomp.c already has the
infrastructure to handle the case where a message is raised during
parsing of a constructed string, but it can't handle a 2nd level
constructed string. That was what led to the segfault in the original
ticket. Unrelated fixes caused the original ticket to no longer be
applicable, and so this fix adds tests for things still would cause a
problem.
The method chosen here is to just make sure that the string constructed
here to parse is error free, so no messages will be raised. Instead it
does the error checking as it constructs the string, so if what is being
parsed to construct a new string is an already constructed one, the
existing infrastructure handles outputting the message relative to the
original pattern. Since what is being parsed is a series of hex
numbers, it's easy to find out what their values are: just accumulate a
total, shifting 4 bits each time through the loop. A side benefit is
that this fixes some unreported bugs dealing with an input code point
that overflows. Prior to this patch, it would error ungracefully.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Perl has, for some releases now, checked for anomalies in non-UTF-8
locales and raised warnings if any are found when such a locale actually
gets used. This came about because it turns out that vendors have
defective locale definitions (that Perl has no control over, besides
reporting them as bugs to the proper places).
I was surprised to stumble across UTF-8 locales that don't adhere
strictly to Unicode, and so this commit now checks for such things and
raises an appropriate message.
Some of this is understandable because Turkish and related languages
have a locale-dependent exemption for them in the Unicode standard, but
others are simply defective locale definitions.
Perl will use the standard Unicode rules, but the user is now warned
that these aren't what the locale specified.
An example is that there are some UTF-8 locales where the common
punctuation characters like ",", "$" aren't marked as punctuation.
|
|
|
|
| |
These are spurious warnings, from netbsd
|
| |
|
|
|
|
|
| |
This is propmpted by Encode's needs. When called with the proper
parameter, it returns any warnings instead of displaying them directly.
|
|
|
|
|
| |
This is in preparation for the next commit which will use this code in
multiple places
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This dfa is available from the internet has the reputation of being the
fastest general translator. This commit changes to use it at the
beginning of our translator, modifying it slightly to accept surrogates
and all 4-byte Perl-extended. If necessary, it drops down into our
translator to handle errors and warnings and Perl extended.
It shows some improvement over our base translation:
Key:
Ir Instruction read
Dr Data read
Dw Data write
COND conditional branches
IND indirect branches
_m branch predict miss
- indeterminate percentage (e.g. 1/0)
The numbers represent raw counts per loop iteration.
unicode::utf8n_to_uvchr_0x007f
ord(X)
blead dfa Ratio %
----- ----- -------
Ir 359.0 359.0 100.0
Dr 111.0 111.0 100.0
Dw 64.0 64.0 100.0
COND 42.0 42.0 100.0
IND 5.0 5.0 100.0
COND_m 2.0 0.0 Inf
IND_m 5.0 5.0 100.0
unicode::utf8n_to_uvchr_0x07ff
ord(X)
blead dfa Ratio %
----- ----- -------
Ir 478.0 467.0 102.4
Dr 132.0 133.0 99.2
Dw 79.0 78.0 101.3
COND 63.0 57.0 110.5
IND 5.0 5.0 100.0
COND_m 1.0 0.0 Inf
IND_m 5.0 5.0 100.0
unicode::utf8n_to_uvchr_0xfffd
ord(X)
blead dfa Ratio %
----- ----- -------
Ir 494.0 486.0 101.6
Dr 134.0 136.0 98.5
Dw 79.0 78.0 101.3
COND 67.0 61.0 109.8
IND 5.0 5.0 100.0
COND_m 2.0 0.0 Inf
IND_m 5.0 5.0 100.0
unicode::utf8n_to_uvchr_0x1fffd
ord(X)
blead dfa Ratio %
----- ----- -------
Ir 508.0 505.0 100.6
Dr 135.0 139.0 97.1
Dw 79.0 78.0 101.3
COND 70.0 65.0 107.7
IND 5.0 5.0 100.0
COND_m 2.0 1.0 200.0
IND_m 5.0 5.0 100.0
unicode::utf8n_to_uvchr_0x10fffd
ord(X)
blead dfa Ratio %
----- ----- -------
Ir 508.0 505.0 100.6
Dr 135.0 139.0 97.1
Dw 79.0 78.0 101.3
COND 70.0 65.0 107.7
IND 5.0 5.0 100.0
COND_m 2.0 1.0 200.0
IND_m 5.0 5.0 100.0
Each code point represents an extra byte required in its UTF-8
representation from the previous one.
|
|
|
|
|
|
| |
This UTF-8 to code point translator variant is to meet the needs of
Encode, and provides XS authors with more general capability than
the other decoders.
|
|
|
|
|
| |
Contrary to what it previously said, it does not croak. This clarifies
what happens if the start and end pointers have the same value.
|
|
|
|
| |
Properly outdent 2 lines
|
|
|
|
|
|
|
| |
It seemed to imply that the bytes making up the char were s..e; they're
actually s..(e-1).
NPD
|
|
|
|
|
|
|
|
|
|
| |
These undocumented functions require the destination buffer to have the
worst case size. However that size (previously listed as 3/2 * input)
is wrong for EBCDIC. Correct the comments, and the single use of these
in core.
These functions do not have a way to avoid overflowing, which strikes me
as wrong.
|
|
|
|
|
| |
This changes this function to not put an initial space character in the
returned string.
|
|
|
|
|
| |
This allows things to work properly in the face of embedded NULs.
See the branch merge message for more information.
|
|
|
|
|
|
|
|
|
|
| |
Where the length is known, we can use these functions which relieve
the programmer and the program reader from having to count characters.
The memFOO functions should also be slightly faster than the strFOO
equivalents.
In some instances in this commit, hard coded numbers are used. These
come from the 'case' statement values that apply to them.
|
|
|
|
| |
This was no longer true
|
|
|
|
|
|
|
| |
I asked on p5p if anyone had an opinion about whether to trim
overallocated space in this function, and got no replies.
It seems to me to be best to tidy up upon return.
|
| |
|
|
|
|
|
|
|
|
| |
Commit d819dc506b9fbd0d9bb316e42ca5bbefdd5f1d77 did not fully work. I
switched the wrong thing that should have been in native vs
Unicode/Latin1, and forgot to update the test file.
Hopefully this is correct.
|
|
|
|
|
|
|
|
| |
This fixes a warning message for EBCDIC. The native character set is
different than Unicode, and needs special handling. I earlier tried to
save an #ifdef, but the resulting warning was hard to test right, and
that helped convince me that it would be confusing to anyone trying to
make sense of the message. So, in goes the #ifdef.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This implements the restriction of code points to 0..IV_MAX in such a
way that the process doesn't die when presented with input UTF-8 that
evaluates to a larger one. Instead, it is treated as overflow.
The commit reinstates causing the offending process to die if trying to
create a character somehow that is above IV_MAX (like
chr(0xFFFFFFFFFFFFF) or trying to do certain operations on one if
somehow one did get created.
The long term goal is to use code points above IV_MAX internally, as
Perl6 does. So code and tests are not removed, just commented out
|
|
|
|
|
|
|
| |
This will be used in the following commit.
One function is made more complicated, so we stop asking it to be
inlined.
|
|
|
|
| |
This is so there are fewer real differences shown in the next commit
|
|
|
|
| |
This makes it harder to think that 0 means a definite FALSE.
|
|
|
|
|
|
| |
This simply moves a function to later in the file. The next commIt will
change it to needing a definition which, until this commit, came after it
in the file, and so was not available to it.
|
|
|
|
| |
This makes it harder to think that 0 means a definite FALSE.
|
|
|
|
|
|
|
|
|
|
|
| |
Prior to this commit, isFF_OVERLONG() returned a boolean, with 0 also
indicating that there wasn't enough information to make a determination.
I realized that I was forgetting that 0 wasn't necessarily definitive
while coding. By changing the API to return 3 values, forgetting that
won't likely happen.
This and the next several commits change several other functions that
have the same predicament.
|
|
|
|
|
| |
This is purely to get vertical line up that easier to see of slightly
differently spelled tests
|
|
|
|
|
| |
This just does a small refactor, which I think makes things easier to
understand.
|
|
|
|
|
|
|
|
| |
It turns out that it could incorrectly deem something to be overflowing
or overlong. This fixes that and changes the test to catch this
possibility. This fixes a bug, so now on 32-bit systems, it detects
that if you have a start byte of FE, you need a continuation byte to
determine if the result overflows.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It somehow dawned on me that the code is incorrect for
warning/disallowing very high code points. What is really wanted in the
API is to catch UTF-8 that is not necessarily portable. There are
several classes of this, but I'm referring here to just the code points
that are above the Unicode-defined maximum of 0x10FFFF. These can be
considered non-portable, and there is a mechanism in the API to
warn/disallow these.
However an earlier standard defined UTF-8 to handle code points up to
2**31-1. Anything above that is using an extension to UTF-8 that has
never been officially recognized. Perl does use such an extension, and
the API is supposed to have a different mechanism to warn/disallow on
this.
Thus there are two classes of warning/disallowing for above-Unicode code
points. One for things that have some non-Unicode official recognition,
and the other for things that have never had official recognition.
UTF-EBCDIC differs somewhat in this, and since Perl 5.24, we have had a
Perl extension that allows it to handle any code point that fits in a
64-bit word. This kicks in at code points above 2**30-1, a number
different than UTF-8 extended kicks in on ASCII platforms.
Things are also complicated by the fact that the API has provisions for
accepting the overlong UTF-8 malformation. It is possible to use
extended UTF-8 to represent code points smaller than 31-bit ones.
Until this commit, the extended warning/disallowing was based on the
resultant code point, and only when that code point did not fit into 31
bits.
But what is really wanted is if extended UTF-8 was used to represent a
code point, no matter how large the resultant code point is. This
differs from the previous definition, but only for EBCDIC platforms, or
when the overlong malformation was also present. So it does not affect
very many real-world cases.
This commit fixes that. It turns out that it is easier to tell if
something is using extended-UTF8. One just looks at the first byte of a
sequence.
The trailing part of the warning message that gets raised is slightly
changed to be clearer. It's not significant enough to affect perldiag.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The next commit will fix the detection of using Perl's extended UTF-8 to
be more accurate. The current name for various flags in the API is
somewhat misleading. What is really wanted to know is if extended UTF-8
was used, not the value of the resultant code point.
This commit basically does
s/ABOVE_31_BIT/PERL_EXTENDED/g
It also similarly changes the name of a hash key in APItest/t/utf8.t.
This intermediary step makes the next commit easier to read.
|
|
|
|
|
|
|
|
|
|
|
|
| |
The code handling the UTF-8 overlong malformation must come after
handling all the other malformations. This is because it may change the
code point represented to the REPLACEMENT CHARACTER. The other
malformation code is expecting the code point to be the original one.
This may cause failure to catch and report other malformations, or
report the wrong value of the erroneous code point.
What was needed was simply to move the 'if else' branch for overlongs to
after the branches for the other formations.
|
|
|
|
|
|
|
| |
For above-Unicode, we should use 0xDEADBEEF instead of U+DEADBEEF.
^^ ^^
This is because U+ only applies to Unicode. This only affects a warning
message for overlongs.
|