summaryrefslogtreecommitdiff
path: root/utf8.c
Commit message (Collapse)AuthorAgeFilesLines
* Use compiled-in C structure for inverted case foldsKarl Williamson2018-03-311-0/+60
| | | | | | | | | | This commit changes to use the C data structures generated by the previous commit to compute what characters fold to a given one. This is used to find out what things should match under /i. This now avoids the expensive start up cost of switching to perl utf8_heavy.pl, loading a file from disk, and constructing a hash from it.
* regen/mk_invlists.pl: Inversion maps don't have to be IVKarl Williamson2018-03-311-1/+1
| | | | | An inversion map currently is used only for Unicode-range code points, which can fit in an int, so don't use the space unnecessarily
* Remove obsolete variablesKarl Williamson2018-03-311-2/+0
| | | | | These were for when some of the Posix character classes were implemented as swashes, which is no longer the case, so these can be removed.
* utf8.c: Change no longer used params to dummysKarl Williamson2018-03-311-15/+23
| | | | | | The previous commits have caused certain parameters to be ignored in some calls to these functions. Change them to dummys, so if a mistake is made, it can be caught, and not promulgated
* Move init of 2 inversion lists to perl.cKarl Williamson2018-03-311-21/+7
| | | | | | | These read-only globals can be initialized in perl.c, which allows us to remove runtime checks that they are initialized. This commit also takes advantage of the fact that they are now always initialized to use them as inversion lists, avoid swash creation.
* Fix bug in mathoms fcn _is_utf8_xidcont()Karl Williamson2018-03-311-1/+1
| | | | | | | | This was using the wrong variable, the one used by plain _is_utf8_idcont() Since both of these are in mathoms.c, and deprecated, this really wasn't causing an issue in the field.
* utf8.c: Avoid calling swash codeKarl Williamson2018-03-311-13/+11
| | | | | | | | | Now that we prefer inversion lists over swashes, we can just use the inversion lists functions if we have an inversion list, avoiding the swash code altogether in these instances. This commit stops using inversion lists for two internal properties, but the next commit will restore that.
* utf8.c: Prefer an inversion list over a swashKarl Williamson2018-03-311-11/+4
| | | | | | | | Measurements I took in 8946fcd98c63bdc848cec00a1c72aaf232d932a1 indicate that at the sizes that Unicode inversion lists are, there is no slowdown in retrieval of data using an inversion list vs a hash. Converting to use an inversion list, when possible, avoids the hash construction overhead, and eventually to the removal of a bunch of code.
* utf8.c: Clarify commentKarl Williamson2018-03-311-1/+1
|
* utf8.c: Add commentsKarl Williamson2018-03-311-2/+25
| | | | | | This adds comments, and some white space changes to the function dealing with changing case changed in 8946fcd98c63bdc848cec00a1c72aaf232d932a1
* utf8.c: Allow to compile for early Unicode versionsKarl Williamson2018-03-311-0/+23
| | | | | | | Commit 8946fcd98c63bdc848cec00a1c72aaf232d932a1 broke the compilation of utf8.c when perl is compiled against very early Unicode versions, as some tables this is expecting don't exist in them. But it is easily solvable by a few #ifdefs
* utf8.c: fix leakKarl Williamson2018-03-271-0/+2
| | | | | Commit 8946fcd98c63bdc848cec00a1c72aaf232d932a1 failed to free a scalar it created. I meant to do so, but in the end, forgot.
* Move UTF-8 case changing data into coreKarl Williamson2018-03-261-58/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prior to this commit, if a program wanted to compute the case-change of a character above 0xFF, the C code would switch to perl, loading lib/utf8heavy.pl and then read another file from disk, and then create a hash. Future references would use the hash, but the start up cost is quite large. There are five case change types, uc, lc, tc, fc, and simple fc. Only the first encountered requires loading of utf8_heavy, but each required switching to utf8_heavy, and reading the appropriate file from disk. This commit changes these functions to use compiled-in C data structures (inversion maps) to represent the data. To look something up requires a binary search instead of a hash lookup. An individual hash lookup tends to be faster than a binary search, but the differences are small for small sizes. I did some benchmarking some years ago, (commit message 87367d5f9dc9bbf7db1a6cf87820cea76571bf1a) and the results were that for fewer than 512 entries, the binary search was just as fast as a hash, if not actually faster. Now, I've done some more benchmarks on blead, using the tool benchmark.pl, which wasn't available back then. The results below indicate that the differences are minimal up through 2047 entries, which all Unicode properties are well within. A hash, PL_foldclosures, is still constructed at runtime for the case of regular expression /i matching, and this could be generated at Perl compile time, as a further enhancement for later. But reading a file from disk is no longer required to do this. ======================= benchmarking results ======================= Key: Ir Instruction read Dr Data read Dw Data write COND conditional branches IND indirect branches _m branch predict miss _m1 level 1 cache miss _mm last cache (e.g. L3) miss - indeterminate percentage (e.g. 1/0) The numbers represent raw counts per loop iteration. "\x{10000}" =~ qr/\p{CWKCF}/" swash invlist Ratio % fetch search ------ ------- ------- Ir 2259.0 2264.0 99.8 Dr 665.0 664.0 100.2 Dw 406.0 404.0 100.5 COND 406.0 405.0 100.2 IND 17.0 15.0 113.3 COND_m 8.0 8.0 100.0 IND_m 4.0 4.0 100.0 Ir_m1 8.9 17.0 52.4 Dr_m1 4.5 3.4 132.4 Dw_m1 1.9 1.2 158.3 Ir_mm 0.0 0.0 100.0 Dr_mm 0.0 0.0 100.0 Dw_mm 0.0 0.0 100.0 These were constructed by using the file whose contents are below, which uses the property in Unicode that currently has the largest number of entries in its inversion list, > 1600. The test was run on blead -O2, no debugging, no threads. Then the cut-off boundary was changed from 512 to 2047 for when we use a hash vs an inversion list, and the test run again. This yields the difference between a hash fetch and an inversion list binary search ===================== The benchmark file is below =============== no warnings 'once'; my @benchmarks; push @benchmarks, 'swash' => { desc => '"\x{10000}" =~ qr/\p{CWKCF}/"', setup => 'no warnings "once"; my $re = qr/\p{CWKCF}/; my $a = "\x{10000}";', code => '$a =~ $re;', }; \@benchmarks;
* utf8.c: Update commentKarl Williamson2018-03-221-4/+4
|
* perlapi: bytes_to_utf8(), from_utf8(): Add clarificationKarl Williamson2018-03-191-2/+5
| | | | The caller is responsible for freeing the memory used by these functions
* PATCH: [perl #132163] regex assertion failureKarl Williamson2018-03-061-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | The original test case in this ticket has already been fixed; but modifying it slightly showed some other issues that are now fixed by this commit. The deepest problem is that this code in some paths creates a string to parse instead of the original pattern. And in some cases, it's not even the original pattern, but something that had already been created to parse instead of the pattern. Any messages that are raised should be output in terms of the original. regcomp.c already has the infrastructure to handle the case where a message is raised during parsing of a constructed string, but it can't handle a 2nd level constructed string. That was what led to the segfault in the original ticket. Unrelated fixes caused the original ticket to no longer be applicable, and so this fix adds tests for things still would cause a problem. The method chosen here is to just make sure that the string constructed here to parse is error free, so no messages will be raised. Instead it does the error checking as it constructs the string, so if what is being parsed to construct a new string is an already constructed one, the existing infrastructure handles outputting the message relative to the original pattern. Since what is being parsed is a series of hex numbers, it's easy to find out what their values are: just accumulate a total, shifting 4 bits each time through the loop. A side benefit is that this fixes some unreported bugs dealing with an input code point that overflows. Prior to this patch, it would error ungracefully.
* perlapi: utf8_to_uvuni_buf() Add clarificationKarl Williamson2018-03-041-1/+3
|
* locale.c: Check for anomalies in UTF-8 localesKarl Williamson2018-02-181-5/+4
| | | | | | | | | | | | | | | | | | | | | | Perl has, for some releases now, checked for anomalies in non-UTF-8 locales and raised warnings if any are found when such a locale actually gets used. This came about because it turns out that vendors have defective locale definitions (that Perl has no control over, besides reporting them as bugs to the proper places). I was surprised to stumble across UTF-8 locales that don't adhere strictly to Unicode, and so this commit now checks for such things and raises an appropriate message. Some of this is understandable because Turkish and related languages have a locale-dependent exemption for them in the Unicode standard, but others are simply defective locale definitions. Perl will use the standard Unicode rules, but the user is now warned that these aren't what the locale specified. An example is that there are some UTF-8 locales where the common punctuation characters like ",", "$" aren't marked as punctuation.
* utf8.c: Silence compiler warningsKarl Williamson2018-02-151-4/+4
| | | | These are spurious warnings, from netbsd
* perlapi: Rmv nonapplicable textKarl Williamson2018-02-071-5/+3
|
* Add uvchr_to_utf8_flags_msgs()Karl Williamson2018-02-071-22/+112
| | | | | This is propmpted by Encode's needs. When called with the proper parameter, it returns any warnings instead of displaying them directly.
* utf8.c: Extract code into separate functionKarl Williamson2018-02-071-10/+26
| | | | | This is in preparation for the next commit which will use this code in multiple places
* Use dfa to speed up translating UTF-8 into code pointKarl Williamson2018-01-301-8/+149
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This dfa is available from the internet has the reputation of being the fastest general translator. This commit changes to use it at the beginning of our translator, modifying it slightly to accept surrogates and all 4-byte Perl-extended. If necessary, it drops down into our translator to handle errors and warnings and Perl extended. It shows some improvement over our base translation: Key: Ir Instruction read Dr Data read Dw Data write COND conditional branches IND indirect branches _m branch predict miss - indeterminate percentage (e.g. 1/0) The numbers represent raw counts per loop iteration. unicode::utf8n_to_uvchr_0x007f ord(X) blead dfa Ratio % ----- ----- ------- Ir 359.0 359.0 100.0 Dr 111.0 111.0 100.0 Dw 64.0 64.0 100.0 COND 42.0 42.0 100.0 IND 5.0 5.0 100.0 COND_m 2.0 0.0 Inf IND_m 5.0 5.0 100.0 unicode::utf8n_to_uvchr_0x07ff ord(X) blead dfa Ratio % ----- ----- ------- Ir 478.0 467.0 102.4 Dr 132.0 133.0 99.2 Dw 79.0 78.0 101.3 COND 63.0 57.0 110.5 IND 5.0 5.0 100.0 COND_m 1.0 0.0 Inf IND_m 5.0 5.0 100.0 unicode::utf8n_to_uvchr_0xfffd ord(X) blead dfa Ratio % ----- ----- ------- Ir 494.0 486.0 101.6 Dr 134.0 136.0 98.5 Dw 79.0 78.0 101.3 COND 67.0 61.0 109.8 IND 5.0 5.0 100.0 COND_m 2.0 0.0 Inf IND_m 5.0 5.0 100.0 unicode::utf8n_to_uvchr_0x1fffd ord(X) blead dfa Ratio % ----- ----- ------- Ir 508.0 505.0 100.6 Dr 135.0 139.0 97.1 Dw 79.0 78.0 101.3 COND 70.0 65.0 107.7 IND 5.0 5.0 100.0 COND_m 2.0 1.0 200.0 IND_m 5.0 5.0 100.0 unicode::utf8n_to_uvchr_0x10fffd ord(X) blead dfa Ratio % ----- ----- ------- Ir 508.0 505.0 100.6 Dr 135.0 139.0 97.1 Dw 79.0 78.0 101.3 COND 70.0 65.0 107.7 IND 5.0 5.0 100.0 COND_m 2.0 1.0 200.0 IND_m 5.0 5.0 100.0 Each code point represents an extra byte required in its UTF-8 representation from the previous one.
* Add utf8n_to_uvchr_msgs()Karl Williamson2018-01-301-18/+129
| | | | | | This UTF-8 to code point translator variant is to meet the needs of Encode, and provides XS authors with more general capability than the other decoders.
* Fix and clarify the pod for utf8_length()Karl Williamson2017-11-261-3/+6
| | | | | Contrary to what it previously said, it does not croak. This clarifies what happens if the start and end pointers have the same value.
* utf8.c: White-space onlyKarl Williamson2017-11-231-2/+2
| | | | Properly outdent 2 lines
* clarify the pod for Perl_utf8_length()David Mitchell2017-11-161-2/+2
| | | | | | | It seemed to imply that the bytes making up the char were s..e; they're actually s..(e-1). NPD
* Dest buffer needs to be bigger for utf16_to_utf8()Karl Williamson2017-11-081-3/+12
| | | | | | | | | | These undocumented functions require the destination buffer to have the worst case size. However that size (previously listed as 3/2 * input) is wrong for EBCDIC. Correct the comments, and the single use of these in core. These functions do not have a way to avoid overflowing, which strikes me as wrong.
* _byte_dump_string(): Don't output leading spaceKarl Williamson2017-11-081-5/+8
| | | | | This changes this function to not put an initial space character in the returned string.
* utf8.c: Use memchr instead of strchrKarl Williamson2017-11-061-1/+1
| | | | | This allows things to work properly in the face of embedded NULs. See the branch merge message for more information.
* Use memEQs, memNEs in core filesKarl Williamson2017-11-061-11/+3
| | | | | | | | | | Where the length is known, we can use these functions which relieve the programmer and the program reader from having to count characters. The memFOO functions should also be slightly faster than the strFOO equivalents. In some instances in this commit, hard coded numbers are used. These come from the 'case' statement values that apply to them.
* utf8.c: Rmv obsolete commentKarl Williamson2017-11-011-1/+0
| | | | This was no longer true
* bytes_to_utf8(): Trim unused malloc'd spaceKarl Williamson2017-10-281-0/+5
| | | | | | | I asked on p5p if anyone had an opinion about whether to trim overallocated space in this function, and got no replies. It seems to me to be best to tidy up upon return.
* utf8.c: Use mnemonic for repeatedly used numberKarl Williamson2017-09-091-2/+4
|
* utf8.c: EBCDIC fixKarl Williamson2017-08-091-3/+3
| | | | | | | | Commit d819dc506b9fbd0d9bb316e42ca5bbefdd5f1d77 did not fully work. I switched the wrong thing that should have been in native vs Unicode/Latin1, and forgot to update the test file. Hopefully this is correct.
* utf8_to_uvchr() EBCDIC fixKarl Williamson2017-08-051-5/+11
| | | | | | | | This fixes a warning message for EBCDIC. The native character set is different than Unicode, and needs special handling. I earlier tried to save an #ifdef, but the resulting warning was hard to test right, and that helped convince me that it would be confusing to anyone trying to make sense of the message. So, in goes the #ifdef.
* Forbid above IV_MAX code pointsKarl Williamson2017-07-121-108/+102
| | | | | | | | | | | | | | This implements the restriction of code points to 0..IV_MAX in such a way that the process doesn't die when presented with input UTF-8 that evaluates to a larger one. Instead, it is treated as overflow. The commit reinstates causing the offending process to die if trying to create a character somehow that is above IV_MAX (like chr(0xFFFFFFFFFFFFF) or trying to do certain operations on one if somehow one did get created. The long term goal is to use code points above IV_MAX internally, as Perl6 does. So code and tests are not removed, just commented out
* utf8.c: Change 2 static fcns to handle overlongsKarl Williamson2017-07-121-60/+139
| | | | | | | This will be used in the following commit. One function is made more complicated, so we stop asking it to be inlined.
* utf8.c: Move and slightly change comment blockKarl Williamson2017-07-121-15/+18
| | | | This is so there are fewer real differences shown in the next commit
* utf8.c: Generalize static fcn return for indeterminate resultKarl Williamson2017-07-121-26/+37
| | | | This makes it harder to think that 0 means a definite FALSE.
* utf8.c: Move a fcn within the fileKarl Williamson2017-07-121-76/+76
| | | | | | This simply moves a function to later in the file. The next commIt will change it to needing a definition which, until this commit, came after it in the file, and so was not available to it.
* utf8.c: Generalize static fcn return for indeterminate resultKarl Williamson2017-07-121-15/+20
| | | | This makes it harder to think that 0 means a definite FALSE.
* utf8.c: Generalize static fcn return for indeterminate resultKarl Williamson2017-07-121-8/+25
| | | | | | | | | | | Prior to this commit, isFF_OVERLONG() returned a boolean, with 0 also indicating that there wasn't enough information to make a determination. I realized that I was forgetting that 0 wasn't necessarily definitive while coding. By changing the API to return 3 values, forgetting that won't likely happen. This and the next several commits change several other functions that have the same predicament.
* utf8.c: Reorder two 'if' clausesKarl Williamson2017-07-121-5/+5
| | | | | This is purely to get vertical line up that easier to see of slightly differently spelled tests
* utf8.c: Slightly simplify some codeKarl Williamson2017-07-121-11/+9
| | | | | This just does a small refactor, which I think makes things easier to understand.
* utf8n_to_uvchr(): Properly handle extremely high code pointsKarl Williamson2017-07-121-3/+4
| | | | | | | | It turns out that it could incorrectly deem something to be overflowing or overlong. This fixes that and changes the test to catch this possibility. This fixes a bug, so now on 32-bit systems, it detects that if you have a start byte of FE, you need a continuation byte to determine if the result overflows.
* utf8n_to_uvchr() Properly test for extended UTF-8Karl Williamson2017-07-121-94/+107
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It somehow dawned on me that the code is incorrect for warning/disallowing very high code points. What is really wanted in the API is to catch UTF-8 that is not necessarily portable. There are several classes of this, but I'm referring here to just the code points that are above the Unicode-defined maximum of 0x10FFFF. These can be considered non-portable, and there is a mechanism in the API to warn/disallow these. However an earlier standard defined UTF-8 to handle code points up to 2**31-1. Anything above that is using an extension to UTF-8 that has never been officially recognized. Perl does use such an extension, and the API is supposed to have a different mechanism to warn/disallow on this. Thus there are two classes of warning/disallowing for above-Unicode code points. One for things that have some non-Unicode official recognition, and the other for things that have never had official recognition. UTF-EBCDIC differs somewhat in this, and since Perl 5.24, we have had a Perl extension that allows it to handle any code point that fits in a 64-bit word. This kicks in at code points above 2**30-1, a number different than UTF-8 extended kicks in on ASCII platforms. Things are also complicated by the fact that the API has provisions for accepting the overlong UTF-8 malformation. It is possible to use extended UTF-8 to represent code points smaller than 31-bit ones. Until this commit, the extended warning/disallowing was based on the resultant code point, and only when that code point did not fit into 31 bits. But what is really wanted is if extended UTF-8 was used to represent a code point, no matter how large the resultant code point is. This differs from the previous definition, but only for EBCDIC platforms, or when the overlong malformation was also present. So it does not affect very many real-world cases. This commit fixes that. It turns out that it is easier to tell if something is using extended-UTF8. One just looks at the first byte of a sequence. The trailing part of the warning message that gets raised is slightly changed to be clearer. It's not significant enough to affect perldiag.
* utf8.h: Add synonyms for flag namesKarl Williamson2017-07-121-22/+22
| | | | | | | | | | | | | | | The next commit will fix the detection of using Perl's extended UTF-8 to be more accurate. The current name for various flags in the API is somewhat misleading. What is really wanted to know is if extended UTF-8 was used, not the value of the resultant code point. This commit basically does s/ABOVE_31_BIT/PERL_EXTENDED/g It also similarly changes the name of a hash key in APItest/t/utf8.t. This intermediary step makes the next commit easier to read.
* utf8.c: Fix bugs with overlongs combined with other malformations.Karl Williamson2017-07-121-58/+60
| | | | | | | | | | | | The code handling the UTF-8 overlong malformation must come after handling all the other malformations. This is because it may change the code point represented to the REPLACEMENT CHARACTER. The other malformation code is expecting the code point to be the original one. This may cause failure to catch and report other malformations, or report the wrong value of the erroneous code point. What was needed was simply to move the 'if else' branch for overlongs to after the branches for the other formations.
* utf8n_to_uvchr: U+ should be for only Unicode code pointsKarl Williamson2017-07-121-1/+5
| | | | | | | For above-Unicode, we should use 0xDEADBEEF instead of U+DEADBEEF. ^^ ^^ This is because U+ only applies to Unicode. This only affects a warning message for overlongs.