summaryrefslogtreecommitdiff
path: root/utf8.c
Commit message (Collapse)AuthorAgeFilesLines
* utf8.c - add category parameter to unused warn_on_first_deprecated_use functionYves Orton2023-03-181-4/+5
|
* Use per-word calcs in utf8_length()Karl Williamson2023-02-081-20/+130
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit changes utf8_length to read the input a word at a time. The current method of looking per character is retained for shorter strings. The per-word method yields significant time savings for very long strings and typical inputs. The timings vary depending on the average number of bytes per character in the input. If all our characters were 13 bytes, this commit would always be a loser, as we would be processing per 8 (or 4 on 32-bit platforms) instead of 13. But we don't care about performance for non-Unicode code points, and the maximum legal Unicode code point occupies 4 UTF-8 bytes, which means that is a wash on 32-bit platforms, but a real gain on 64-bit ones. And, except for emoji, most text in modern languages is 3 byte max, with a significant amount of single byte characters (e.g., for punctuation) even in non-Latin scripts. For very long strings we would expect to use 1/8 the conditionals if the input is entirely ASCII; 1/4 if entirely 2-byte UTF-8, and 1/2 if entirely 4-byte. (For 32-bit systems, the savings is approximately half this.) Because of set-up and tear-down complications, these values are limits that are approached the longer the string is (which is where it matters most). The per-word method kicks in for input strings 96 bytes and longer. This value was based on some eyeballing cache grind output, and could be tweaked, but the differences in time spent on strings this short is tiny. This function does a half-hearted job of checking for UTF-8 validity; it doesn't do extra work, but it makes sure that the length implied by the start bytes it sees, all add up. (It doesn't check that the characters in between are all continuation bytes.) In order to preserve this checking, the new version has to stop per-word looking a word earlier than it otherwise would have. There are complications, as it has to process per-byte to get to a word boundary before reading per-word. Here are benchmarks for a 2-byte word using the best and worst case scenarios. (All benchmarks are for a 64-bit platform) Key: Ir Instruction read Dr Data read Dw Data write COND conditional branches IND indirect branches The numbers represent relative counts per loop iteration, compared to blead at 100.0%. Higher is better: for example, using half as many instructions gives 200%, while using twice as many gives 50%. Best case 2-byte sceanario: string length 48 characters; 2 bytes per character; 0 bytes after word boundary blead patch ------ ------- Ir 100.00 123.09 Dr 100.00 130.18 Dw 100.00 111.44 COND 100.00 128.63 IND 100.00 100.00 Worst case 2-byte sceanario: string length 48 characters; 2 bytes per character; 7 bytes after word boundary blead patch ------ ------- Ir 100.00 122.46 Dr 100.00 129.52 Dw 100.00 111.07 COND 100.00 127.65 IND 100.00 100.00 Very long strings run an order of magnitude fewer instructions than blead. Here are worst case scenarios (7 bytes after word boundary). string length 10000000 characters; 1 bytes per character blead patch ------ ------- Ir 100.00 814.53 Dr 100.00 1069.58 Dw 100.00 3296.55 COND 100.00 1575.83 IND 100.00 100.00 string length 5000000 characters; 2 bytes per character blead patch ------ ------- Ir 100.00 408.86 Dr 100.00 536.32 Dw 100.00 1698.31 COND 100.00 788.72 IND 100.00 100.00 string length 3333333 characters; 3 bytes per character blead patch ------ ------- Ir 100.00 273.64 Dr 100.00 358.56 Dw 100.00 1165.55 COND 100.00 526.35 IND 100.00 100.00 string length 2500000 characters; 4 bytes per character blead patch ------ ------- Ir 100.00 206.03 Dr 100.00 269.68 Dw 100.00 899.17 COND 100.00 395.17 IND 100.00 100.00
* Correct typos as per GH 20435James E Keenan2022-12-291-3/+3
| | | | | | | | | | | | | | | | | | | In GH 20435 many typos in our C code were corrected. However, this pull request was not applied to blead and developed merge conflicts. I extracted diffs for the individual modified files and applied them with 'git apply', excepting four files where patch conflicts were reported. Those files were: handy.h locale.c regcomp.c toke.c We can handle these in a subsequent commit. Also, had to run these two programs to keep 'make test_porting' happy: $ ./perl -Ilib regen/uconfig_h.pl $ ./perl -Ilib regen/regcomp.pl regnodes.h
* scope.* - revert and rework SAVECOPWARNINGS changeYves Orton2022-11-041-3/+1
| | | | | | | | | | | | | | We can't put PL_compiling or PL_curcop on the save stack as we don't have a way to ensure they cross threads properly. This showed up as a win32 t/op/fork.t failure in the thread based fork emulation layer. This adds a new save type SAVEt_CURCOP_WARNINGS and macro SAVECURCOPWARNINGS() to complement SAVEt_COMPILER_WARNINGS and SAVECOMPILEWARNINGS(). By simply hard coding where the pointers should be restored to we side step the issue of which thread we are in. Thanks to Graham Knop for help identifying that one of my commits was responsible.
* cop.h - get rid of the STRLEN* stuff from cop_warningsYves Orton2022-11-021-13/+1
| | | | | With RCPV strings we can use the RCPV_LEN() macro, and make this logic a little less weird.
* handy.h: Set macro to false if can't ever be trueKarl Williamson2022-10-101-3/+3
| | | | | | | | | | | | It's unlikely that perl will be compiled with out the LC_CTYPE locale category being enabled. But if it isn't, there is no sense in having per-interpreter variables for various conditions in it, and no sense having code that tests those variables. This commit changes a macro to always yield 'false' when this is disabled, adds a new similar macro, and changes some occurrences that test for a variable to use the macros instead of the variables. That way the compiler knows these to conditions can never be true.
* Per-word utf8_to_bytes()Karl Williamson2022-08-201-22/+188
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This changes utf8_to_bytes() to do a per-word initial scan to see if the source is actually downgradable, before starting the conversion. This is significantly faster than the current per-character scan. However, the speed advantage evaporates in doing the actual conversion to being a wash with the previous scheme. Thus it finds out quicker if the source is downgradable. cache grind yields this, based on a 100K character string; the non-downgradable one has the next character after that be the only one that's too large.: Key: Ir Instruction read Dr Data read Dw Data write COND conditional branches IND indirect branches _m branch predict miss _m1 level 1 cache miss _mm last cache (e.g. L3) miss - indeterminate percentage (e.g. 1/0) The numbers represent relative counts per loop iteration, compared to blead at 100.0%. Higher is better: for example, using half as many instructions gives 200%, while using twice as many gives 50%. unicode::bytes_to_utf8_legal_API_test Downgrading 100K valid characters blead proposed ------ ------ Ir 100.00 99.99 Dr 100.00 100.03 Dw 100.00 100.04 COND 100.00 100.05 IND 100.00 100.00 COND_m 100.00 87.25 IND_m 100.00 100.00 Ir_m1 100.00 123.25 Dr_m1 100.00 100.18 Dw_m1 100.00 99.94 Ir_mm 100.00 100.00 Dr_mm 100.00 100.00 Dw_mm 100.00 100.00 unicode::bytes_to_utf8_illegal Finding too high a character after 100K valid ones blead fast ------ ------ Ir 100.00 188.91 Dr 100.00 179.77 Dw 100.00 66.75 COND 100.00 278.47 IND 100.00 100.00 COND_m 100.00 88.71 IND_m 100.00 100.00 Ir_m1 100.00 121.86 Dr_m1 100.00 100.01 Dw_m1 100.00 100.03 Ir_mm 100.00 100.00 Dr_mm 100.00 100.00 Dw_mm 100.00 100.00
* locale: Change macro name to be C conformantKarl Williamson2022-07-031-2/+2
| | | | | | C reserves leading underscores for system use; this commit fixes _CHECK_AND_WARN_PROBLEMATIC_LOCALE to be CHECK_AND_WARN_PROBLEMATIC_LOCALE_
* Remove deprecated functionsKarl Williamson2022-06-051-37/+0
| | | | | Most of these have been deprecated for a long time; and we've never bothered to follow through in removing them. This commit does that.
* utf8.c: Minor comment changesKarl Williamson2022-06-041-2/+2
|
* Make STRLENs() available to coreKarl Williamson2022-05-311-5/+0
| | | | | This may cause problems when not used correctly; so continue to restrict it.
* utf8.c: Klortho advises: Check before derefKarl Williamson2022-03-051-1/+2
| | | | Don't deref before checking that it is ok to
* utf8.c: Fix typo in commentKarl Williamson2022-03-051-1/+1
|
* Change pack U behavior for EBCDICKarl Williamson2021-12-281-3/+3
| | | | | | | | | | This effectively reverts 3ece276e6c0. It turns out this was a bad idea to make U mean the non-native official Unicode code points. It may seem to make sense to do so, but broke multiple CPAN modules which were using U the previous way. This commit has no effect on ASCII-platform functioning.
* utf8.c: Rmv duplicate #define.Karl Williamson2021-12-181-4/+0
| | | | This code is identical to a few lines above it
* utf8.c: Rmv redundnat assignsKarl Williamson2021-09-051-2/+0
| | | | | | | | This fixes GH #19091 This is from a rebasing error. The two variable assignments were supposed to have been superceded by the first one in the function, and these removed, but they didn't get removed, until now
* utf8.c: White-space, comment onlyKarl Williamson2021-09-051-10/+11
|
* utf8.c: Make new static fcn more flexibleKarl Williamson2021-08-231-6/+22
| | | | | This commit allows this function to be called with NULL parameters when the result of these is not needed.
* utf8.c: White-space, comment onlyKarl Williamson2021-08-231-40/+42
|
* utf8.c: Rmv no longer needed speed-up codeKarl Williamson2021-08-231-78/+12
| | | | | | | | | | | | The code this commit removes made sense when we were using swashes, and we had to go out to files on disk to find the answers. It used knowledge of the Unicode character database to skip swaths of scripts which are caseless. But now, all that information is stored in C arrays that will be paged in when accessed, which is done by a binary search. The information about those swaths is in those arrays. The conditionals removed here are better spent in executing iterations of the search in L1 cache.
* utf8.c: Split a static fcnKarl Williamson2021-08-231-54/+89
| | | | | | | | | This adds a new function for changing the case of an input code point. The difference between this and the existing function is that the new one returns an array of UVs instead of a combination of the first code point and UTF-8 of the whole thing, a somewhat awkward API that made more sense when we used swashes. That function is retained for now, at least, but most of the work is done in the new function.
* utf8.c: Use porcelain libc case changing fcnKarl Williamson2021-08-231-7/+7
| | | | | | | The fancy wrapper macro that does extra things was being used, when all that is needed is the bare libc function. This is because the code calling it already wraps it, so avoids calling it unless the bare version is what is appropriate.
* Add utf8_to_utf16Karl Williamson2021-08-141-0/+72
|
* Improve utf16_to_utf8_reversed()Karl Williamson2021-08-141-37/+51
| | | | | | Instead of destroying the input by first swapping the bytes, this calls a base function with the order to use. The non-reverse function is changed to call the base function with the non-reversed order.
* Simplify utf16_to_utf8()Karl Williamson2021-08-141-30/+10
| | | | | | A previous commit has simplified uvoffuni_to_utf8_flags() so that it is hardly more than the code in this function. So strip out the code and replace it by a call to uvoffuni_to_utf8_flags().
* utf8.c: Rmv #undefKarl Williamson2021-08-141-2/+0
| | | | | This is unnecessary in a .c file, and the code it referred to has been moved away.
* utf8.c: Use macros instead of reinventing themKarl Williamson2021-08-141-12/+8
|
* utf8.c: Refactor is_utf8_char_helper()Karl Williamson2021-08-141-111/+82
| | | | | | | | | Now that the DFA is used by the only callers to this to eliminate the need to check for e.g., wrong continuation bytes, this function can be refactored to use a switch statement, which makes it clearer, shorter, and faster. The name is changed to indicate its private nature
* Make macro isUTF8_CHAR_flags an inline fcnKarl Williamson2021-08-141-1/+4
| | | | This makes it use the fast DFA for this functionality.
* utf8.c: Rmv EBCDIC dependencyKarl Williamson2021-08-141-17/+8
| | | | There are new macros that suffice to make the determination here.
* utf8.c: Rmv an EBCDIC dependencyKarl Williamson2021-08-141-9/+2
| | | | This is now generated by regcharclass.pl
* utf8.c: Rename formal param to static fcnKarl Williamson2021-08-071-19/+19
| | | | The new mname is more mnemonic
* utf8.c: in-line only use of two macrosKarl Williamson2021-08-071-40/+30
| | | | | These macros don't need to be macros, as they each are only called from one place, and that isn't likely to change.
* utf8.c: Comment non-obvious fcn param meaningKarl Williamson2021-08-071-1/+2
|
* uvoffuni_to_utf8_flags_msgs: Avoid extra conditionalsKarl Williamson2021-08-071-25/+44
| | | | | | | The previous commit for EBCDIC paved the way for moving some checks for a code point being for Perl extended UTF-8 out of places where they cannot succeed. The resultant simplifications more than compensate for the two extra case statements added by this commit.
* Fix EBCDIC deficiency in uvoffuni_to_utf8_flags_msgs()Karl Williamson2021-08-071-4/+16
| | | | | | Simply by adjusting the case statement labels, and adding an extra case, the code can avoid checking for a problem on EBCDIC boxes when it would be impossible for the problem to exist.
* Refactor uvoffuni_to_utf8_flags_msgsKarl Williamson2021-08-071-119/+73
| | | | | | | | | | | Having a fast UVOFFUNISKIP() allows this function be be refactored to simplify it. This commit continues to shortchange large code points and EBCDIC by a little. For example, it checks if a 4-byte character is above Unicode, but no 4-byte characters fit that description in UTF-EBCDIC. This will be fixed in the next commit, which will prepare for further enhancements.
* utf8.c: Change formal parameter name to fcnKarl Williamson2021-08-071-39/+39
| | | | This will make more sense of the next commit
* Add helper function for longest UTF8 sequenceKarl Williamson2021-08-071-0/+64
| | | | | | | | | This specialized functionality is used to check the validity of Perl's extended-length UTF-8, which has some ideosyncratic characteristics from the shorter sequences. This means this function doesn't have to consider those differences. It will be used in the next commit to avoid some work, and to eventually enable is_utf8_char_helper() to be simplified.
* utf8.c: Fold 2 overlapping fcns into oneKarl Williamson2021-08-071-198/+97
| | | | | | | | One of these functions is now only called from the other, and there is significant overlap in their logic. This commit refactors them into one resulting function, which is half the code, and more straight forward.
* utf8.c: Change internal macro nameKarl Williamson2021-08-071-6/+6
| | | | | The sequences here aren't UTF-8, but UTF, since they are I8 in UTF-EBCDIC terms
* utf8.c: Improve algorithm for detecting overflowKarl Williamson2021-08-071-61/+25
| | | | | | | | | | | | | | The code has hard-coded into it the UTF-8 for the highest representable code point for various platforms and word sizes. The algorithm is to compare the input sequence to verify it is <= the highest. But the tail of each of them has some number of the highest possible continuation byte. We need not look at the tail, as the input cannot be above the highest possible. This commit shortens the highest string constants and exits the loop when we get to where the tail used to be. This change allows for the complete removal of the code that is #ifdef'd out that would be used when we allow core to use code points up to UV_MAX.
* utf8.c: Use STRLENs() instead of sizeof()Karl Williamson2021-08-071-9/+14
| | | | This makes the code easier to read.
* utf8.c: Use C_ARRAY_LENGTH()Karl Williamson2021-08-071-1/+1
| | | | This macro is preferred to sizeof()
* utf8.c: Generalize static fcnKarl Williamson2021-08-071-25/+40
| | | | | | | | | I've always been uncomfortable with the input constraints this function had. Now that it has been refactored into using a switch(), new cases for full generality can be added without affecting performance, and some conditionals removed before calling it. The function is renamed to reflect its more generality
* utf8.c: Refactor internal functionKarl Williamson2021-08-071-35/+25
| | | | | The insight in the previous commit allows this function to become much more compact.
* utf8.c: Rmv some EBCDIC dependenciesKarl Williamson2021-08-071-8/+4
| | | | | I hadn't previously noticed the underlying symmetry between the platforms.
* utf8.h: Add #defineKarl Williamson2021-08-071-1/+1
| | | | UTF_MIN_CONTINUATION_BYTE is clearer for use in some contexts
* utf8.c: Change name of static functionKarl Williamson2021-08-071-2/+2
| | | | | This changes only portions of the capitalization, and the new version is more in keeping with other function names.
* utf8.h: Add new #define for extended length UTF-8Karl Williamson2021-08-071-1/+1
| | | | | | | | The previous commit added a convenient place to create a symbol to indicate that the UTF-8 on this platform includes Perl's nearly-double length extension. The platforms this isn't needed on are 32-bit ASCII ones. This symbol allows removing one place where EBCDIC need be considered, and future commits will use it as well.