summaryrefslogtreecommitdiff
path: root/utf8.c
Commit message (Collapse)AuthorAgeFilesLines
* Change pack U behavior for EBCDICKarl Williamson2021-12-281-3/+3
| | | | | | | | | | This effectively reverts 3ece276e6c0. It turns out this was a bad idea to make U mean the non-native official Unicode code points. It may seem to make sense to do so, but broke multiple CPAN modules which were using U the previous way. This commit has no effect on ASCII-platform functioning.
* utf8.c: Rmv duplicate #define.Karl Williamson2021-12-181-4/+0
| | | | This code is identical to a few lines above it
* utf8.c: Rmv redundnat assignsKarl Williamson2021-09-051-2/+0
| | | | | | | | This fixes GH #19091 This is from a rebasing error. The two variable assignments were supposed to have been superceded by the first one in the function, and these removed, but they didn't get removed, until now
* utf8.c: White-space, comment onlyKarl Williamson2021-09-051-10/+11
|
* utf8.c: Make new static fcn more flexibleKarl Williamson2021-08-231-6/+22
| | | | | This commit allows this function to be called with NULL parameters when the result of these is not needed.
* utf8.c: White-space, comment onlyKarl Williamson2021-08-231-40/+42
|
* utf8.c: Rmv no longer needed speed-up codeKarl Williamson2021-08-231-78/+12
| | | | | | | | | | | | The code this commit removes made sense when we were using swashes, and we had to go out to files on disk to find the answers. It used knowledge of the Unicode character database to skip swaths of scripts which are caseless. But now, all that information is stored in C arrays that will be paged in when accessed, which is done by a binary search. The information about those swaths is in those arrays. The conditionals removed here are better spent in executing iterations of the search in L1 cache.
* utf8.c: Split a static fcnKarl Williamson2021-08-231-54/+89
| | | | | | | | | This adds a new function for changing the case of an input code point. The difference between this and the existing function is that the new one returns an array of UVs instead of a combination of the first code point and UTF-8 of the whole thing, a somewhat awkward API that made more sense when we used swashes. That function is retained for now, at least, but most of the work is done in the new function.
* utf8.c: Use porcelain libc case changing fcnKarl Williamson2021-08-231-7/+7
| | | | | | | The fancy wrapper macro that does extra things was being used, when all that is needed is the bare libc function. This is because the code calling it already wraps it, so avoids calling it unless the bare version is what is appropriate.
* Add utf8_to_utf16Karl Williamson2021-08-141-0/+72
|
* Improve utf16_to_utf8_reversed()Karl Williamson2021-08-141-37/+51
| | | | | | Instead of destroying the input by first swapping the bytes, this calls a base function with the order to use. The non-reverse function is changed to call the base function with the non-reversed order.
* Simplify utf16_to_utf8()Karl Williamson2021-08-141-30/+10
| | | | | | A previous commit has simplified uvoffuni_to_utf8_flags() so that it is hardly more than the code in this function. So strip out the code and replace it by a call to uvoffuni_to_utf8_flags().
* utf8.c: Rmv #undefKarl Williamson2021-08-141-2/+0
| | | | | This is unnecessary in a .c file, and the code it referred to has been moved away.
* utf8.c: Use macros instead of reinventing themKarl Williamson2021-08-141-12/+8
|
* utf8.c: Refactor is_utf8_char_helper()Karl Williamson2021-08-141-111/+82
| | | | | | | | | Now that the DFA is used by the only callers to this to eliminate the need to check for e.g., wrong continuation bytes, this function can be refactored to use a switch statement, which makes it clearer, shorter, and faster. The name is changed to indicate its private nature
* Make macro isUTF8_CHAR_flags an inline fcnKarl Williamson2021-08-141-1/+4
| | | | This makes it use the fast DFA for this functionality.
* utf8.c: Rmv EBCDIC dependencyKarl Williamson2021-08-141-17/+8
| | | | There are new macros that suffice to make the determination here.
* utf8.c: Rmv an EBCDIC dependencyKarl Williamson2021-08-141-9/+2
| | | | This is now generated by regcharclass.pl
* utf8.c: Rename formal param to static fcnKarl Williamson2021-08-071-19/+19
| | | | The new mname is more mnemonic
* utf8.c: in-line only use of two macrosKarl Williamson2021-08-071-40/+30
| | | | | These macros don't need to be macros, as they each are only called from one place, and that isn't likely to change.
* utf8.c: Comment non-obvious fcn param meaningKarl Williamson2021-08-071-1/+2
|
* uvoffuni_to_utf8_flags_msgs: Avoid extra conditionalsKarl Williamson2021-08-071-25/+44
| | | | | | | The previous commit for EBCDIC paved the way for moving some checks for a code point being for Perl extended UTF-8 out of places where they cannot succeed. The resultant simplifications more than compensate for the two extra case statements added by this commit.
* Fix EBCDIC deficiency in uvoffuni_to_utf8_flags_msgs()Karl Williamson2021-08-071-4/+16
| | | | | | Simply by adjusting the case statement labels, and adding an extra case, the code can avoid checking for a problem on EBCDIC boxes when it would be impossible for the problem to exist.
* Refactor uvoffuni_to_utf8_flags_msgsKarl Williamson2021-08-071-119/+73
| | | | | | | | | | | Having a fast UVOFFUNISKIP() allows this function be be refactored to simplify it. This commit continues to shortchange large code points and EBCDIC by a little. For example, it checks if a 4-byte character is above Unicode, but no 4-byte characters fit that description in UTF-EBCDIC. This will be fixed in the next commit, which will prepare for further enhancements.
* utf8.c: Change formal parameter name to fcnKarl Williamson2021-08-071-39/+39
| | | | This will make more sense of the next commit
* Add helper function for longest UTF8 sequenceKarl Williamson2021-08-071-0/+64
| | | | | | | | | This specialized functionality is used to check the validity of Perl's extended-length UTF-8, which has some ideosyncratic characteristics from the shorter sequences. This means this function doesn't have to consider those differences. It will be used in the next commit to avoid some work, and to eventually enable is_utf8_char_helper() to be simplified.
* utf8.c: Fold 2 overlapping fcns into oneKarl Williamson2021-08-071-198/+97
| | | | | | | | One of these functions is now only called from the other, and there is significant overlap in their logic. This commit refactors them into one resulting function, which is half the code, and more straight forward.
* utf8.c: Change internal macro nameKarl Williamson2021-08-071-6/+6
| | | | | The sequences here aren't UTF-8, but UTF, since they are I8 in UTF-EBCDIC terms
* utf8.c: Improve algorithm for detecting overflowKarl Williamson2021-08-071-61/+25
| | | | | | | | | | | | | | The code has hard-coded into it the UTF-8 for the highest representable code point for various platforms and word sizes. The algorithm is to compare the input sequence to verify it is <= the highest. But the tail of each of them has some number of the highest possible continuation byte. We need not look at the tail, as the input cannot be above the highest possible. This commit shortens the highest string constants and exits the loop when we get to where the tail used to be. This change allows for the complete removal of the code that is #ifdef'd out that would be used when we allow core to use code points up to UV_MAX.
* utf8.c: Use STRLENs() instead of sizeof()Karl Williamson2021-08-071-9/+14
| | | | This makes the code easier to read.
* utf8.c: Use C_ARRAY_LENGTH()Karl Williamson2021-08-071-1/+1
| | | | This macro is preferred to sizeof()
* utf8.c: Generalize static fcnKarl Williamson2021-08-071-25/+40
| | | | | | | | | I've always been uncomfortable with the input constraints this function had. Now that it has been refactored into using a switch(), new cases for full generality can be added without affecting performance, and some conditionals removed before calling it. The function is renamed to reflect its more generality
* utf8.c: Refactor internal functionKarl Williamson2021-08-071-35/+25
| | | | | The insight in the previous commit allows this function to become much more compact.
* utf8.c: Rmv some EBCDIC dependenciesKarl Williamson2021-08-071-8/+4
| | | | | I hadn't previously noticed the underlying symmetry between the platforms.
* utf8.h: Add #defineKarl Williamson2021-08-071-1/+1
| | | | UTF_MIN_CONTINUATION_BYTE is clearer for use in some contexts
* utf8.c: Change name of static functionKarl Williamson2021-08-071-2/+2
| | | | | This changes only portions of the capitalization, and the new version is more in keeping with other function names.
* utf8.h: Add new #define for extended length UTF-8Karl Williamson2021-08-071-1/+1
| | | | | | | | The previous commit added a convenient place to create a symbol to indicate that the UTF-8 on this platform includes Perl's nearly-double length extension. The platforms this isn't needed on are 32-bit ASCII ones. This symbol allows removing one place where EBCDIC need be considered, and future commits will use it as well.
* Rename internal macro and move to utf8.hKarl Williamson2021-08-071-5/+3
| | | | | | This macro has a corresponding, older, name for the non-UTF-8 case. It makes sense to use the same paradigm, and move the definitions together so that the comments for one don't have to be repeated for the other.
* regcharclass.h: Remove 2 EBCDIC dependenciesKarl Williamson2021-07-311-5/+1
| | | | | | | | | This commit makes is_HANGUL_ED_utf8_safe() return 0 unconditionally on EBCDIC platforms. This means its callers don't have to care what platform is running. Change the two callers to take advantage of this The commit also changes the description of the macro to be slightly more accurate
* utf8.c: Remove redundant '&' instructionKarl Williamson2021-07-301-1/+1
| | | | The cast does this with out extra instruction
* utf8_length: Fix undefined C behaviorKarl Williamson2021-06-301-11/+18
| | | | | | | | | In C the comparison of two pointers is only legal if both point to within the same object, or to a virtual element one above the high edge of the object. The previous code was doing an addition potentially outside that range, and so the results would be undefined.
* Fix several unicode.org linksThibault DUPONCHELLE2021-06-171-1/+1
|
* style: Detabify indentation of the C code maintained by the core.Michael G. Schwern2021-01-171-255/+255
| | | | | | | | | | | This just detabifies to get rid of the mixed tab/space indentation. Applying consistent indentation and dealing with other tabs are another issue. Done with `expand -i`. * vutil.* left alone, it's part of version. * Left regen managed files alone for now.
* Restrict scope/Shorten some very long macro namesKarl Williamson2020-11-221-4/+2
| | | | | | The names were intended to force people to not use them outside their intended scopes. But by restricting those scopes in the first place, we don't need such unwieldy names
* utf8.c: Note various symbols are documented hereKarl Williamson2020-11-091-1/+40
|
* autodoc.pl: Specify scn for single-purpose filesKarl Williamson2020-11-061-3/+0
| | | | | | | | Many of the files in perl are for one thing only, and hence their embedded documentation will be for that one thing. By creating a hash here of them, those files don't have to worry about what section that documentation goes under, and so it can be completely changed without affecting them.
* utf8.c: Add blank line for visual clarityKarl Williamson2020-10-221-0/+1
|
* Fix typosSamanta Navarro2020-10-031-1/+1
| | | | | | | | | For: https://github.com/Perl/perl5/pull/18201 Committer: Samanta Navarro is now a Perl author. To keep 'make test_porting' happy: Increment $VERSION in several files. Regenerate uconfig.h via './perl -Ilib regen/uconfig_h.pl'.
* Fix bytes_from_utf8_loc() podKarl Williamson2020-09-051-2/+2
| | | | | I didn't mean to apply 8505db87404436956f86e54885cacd73840801c0 just yet, but having done so, there is a change required in a link
* Make bytes_from_utf8_loc() internalKarl Williamson2020-09-041-6/+1
| | | | And add it to perlintern, and fix grammar in its pod