summaryrefslogtreecommitdiff
path: root/utf8.h
Commit message (Collapse)AuthorAgeFilesLines
...
* handy.h: Create nBIT_MASK(n) macroKarl Williamson2020-07-171-2/+2
| | | | | This encapsulates a common paradigm, making sure that it is done correctly for the platform's size.
* Fix a bunch of repeated-word typosDagfinn Ilmari Mannsåker2020-05-221-1/+1
| | | | | Mostly in comments and docs, but some in diagnostic messages and one case of 'or die die'.
* pv_uni_display: Use common fcn; \b mnemonicKarl Williamson2020-01-231-1/+7
| | | | | | | | | | | This removes the (almost) duplicate code in this function to display mnemonics for control characters that have them. The reason the two pieces of code aren't precisely the same is that the other function also uses \b as a mnemonic for backspace. Using all possible mnemonics is desirable, so a flag is added for pv_uni_display to now use \b. This is now by default enabled in double-quoted strings, but not regex patterns (as \b there means something quite different except in character classes). B.pm is changed to expect \b.
* Fix UTF8_IS_START on EBCDICKarl Williamson2019-12-071-3/+11
|
* utf8.h: Rmv obsolete macrosKarl Williamson2019-11-241-16/+0
| | | | | | | These macros were missed in dd1a3ba7882ca70c1e85b0fd6c03d07856672075 and 059703b088f44d5665f67fba0b9d80cad89085fd. Using them would cause things to fail to compile
* Add missing back compat macrosKarl Williamson2019-11-241-0/+1
| | | | These are needed only to allow some modules to stay updated with blead.
* utf8.h: Use MAX() macro instead of its expansionKarl Williamson2019-11-141-3/+1
| | | | It makes things a little clearer.
* utf8.h: Use a cast to U8 to avoid an ANDKarl Williamson2019-11-111-1/+1
|
* Allow core to work with code points above IV_MAXKarl Williamson2019-11-061-0/+4
| | | | | Higher has been reserved for core use, and a future commit will want to finally do this.
* Consolidate uses of PERL_SMALL_MACRO_BUFFERKarl Williamson2019-10-311-17/+2
| | | | | | | At the moment the _ASSERT_() is the one which has been showing large expansions. Change so it doesn't do anything if PERL_SMALL_MACRO_BUFFER is defined. That means various other calls that use PERL_SMALL_MACRO_BUFFER can be simplified to not use it.
* Document UTF8_MAXBYTES_CASEKarl Williamson2019-10-151-8/+18
|
* Fix UTF8_CHK_SKIP()Karl Williamson2019-10-131-1/+2
| | | | I forgot an arg in a macro it calls.
* Add UTF8_CHK_SKIP() macroKarl Williamson2019-10-091-2/+40
| | | | | | | | | | | | | This is a safer version of UTF8SKIP for use when the input could be possibly malformed. It uses strnlen() to not read past a NUL in the input. Since Perl adds NULs to the end of SV's, this will likely prevent reading beyond the end of a buffer. A still safer version could be written that doesn't look for just a NUL, but any unexpected byte, and stops just before that. I suspect that is overkill, and since strnlen() can be very fast, I went with this approach instead. Nothing precludes adding another version that does this full checking
* Document UTF8_SKIP()Karl Williamson2019-10-091-0/+8
|
* Make defn of UTF_IS_CONTINUED commonKarl Williamson2019-10-061-6/+5
| | | | This can be derived from other values, removing an EBCDIC dependency
* Make defn of UVCHR_IS_INVARIANT commonKarl Williamson2019-10-061-13/+13
| | | | This can be derived from other values, removing an EBCDIC dependency
* Make defn of OFFUNI_IS_INVARIANT commonKarl Williamson2019-10-061-8/+5
| | | | This can be derived from other values, removing an EBCDIC dependency
* Make defn of UTF8_IS_DOWNGRADEABLE_START commonKarl Williamson2019-10-061-8/+7
| | | | This can be derived from other values, removing an EBCDIC dependency
* Make defn of UTF_IS_ABOVE_LATIN1 commonKarl Williamson2019-10-061-6/+8
| | | | This can be derived from other values, removing an EBCDIC dependency
* Make defn of UTF8_IS_START commonKarl Williamson2019-10-061-7/+10
| | | | This can be derived from other values, removing an EBCDIC dependency
* Make defn of UTF8_IS_CONTINUATION commonKarl Williamson2019-10-061-6/+6
| | | | This can be derived from other values, removing an EBCDIC dependency
* Make defn of UTF_CONTINUATION_MARK commonKarl Williamson2019-10-061-4/+6
| | | | This can be derived from other values, removing an EBCDIC dependency
* Make defn of UTF_IS_CONTINUATION_MASK commonKarl Williamson2019-10-061-3/+4
| | | | | This variable can be defined from the same base in both UTF-8 and UTF-EBCDIC, and doing so eliminates an EBCDIC dependency.
* utf8.h: Add commentKarl Williamson2019-10-061-1/+3
|
* utf8.h: Remove redundant castKarl Williamson2019-10-061-1/+1
| | | | The called macro does the cast already
* utf8.h: Make sure macros not called with a ptrKarl Williamson2019-10-061-8/+8
| | | | | By doing an '| 0' with a parameter in a macro expansion, a C syntax error will be generated. This is free protection.
* Remove deprecated character classification/case changing macrosKarl Williamson2019-09-291-9/+0
| | | | | | | | | | | | | | It has been deprecated since 5.26 to use various macros that deal with UTF-8 inputs but don't have a parameter indicating the maximum length beyond which we should not look. This commit changes all such macros, as threatened in existing documentation and warning messages, to have an extra parameter giving the length. This was originally scheduled to happen in 5.30, but was delayed because it broke some CPAN modules, and there wasn't really a good way around it. But now that Devel::PPPort 3.54 is out, ppport.h has new facilities for getting modules making these changes to work with older Perl releases.
* Change name of _utf8_to_uvchr_buf()Karl Williamson2019-09-151-1/+1
| | | | | A function name with a leading underscore is not legal in C. Instead add a suffix to differentiate this name from an otherwise identical one.
* Strip leading underscore from is_utf8_char_helper()Karl Williamson2019-09-151-3/+3
| | | | Leading underscored name are reserved for the C implementers
* Document UTF8_MAXBYTESKarl Williamson2019-09-021-5/+13
|
* perlapi: Document UNICODE_REPLACEMENTKarl Williamson2019-09-021-0/+5
|
* perlapi: Document NATIVE_to_foo, converse functionsKarl Williamson2019-09-021-3/+45
|
* utf8_to_uvchr_buf() make behavior match docsKarl Williamson2019-07-011-3/+1
| | | | | | | | | | | For well formed input, there is no change. But for malformed it wasn't returning the documented length when warnings were enabled, and not always the documented value when they were disabled. This is implemented as an inline function, called from both the macro and the Perl_ form. Devel::PPPort has sufficient tests for this.
* PATCH: [perl #133896] Assertion failureKarl Williamson2019-04-051-4/+7
| | | | | | | This was due to UTF8_SAFE_SKIP(s, e) not allowing s to be as large as e, and there are legitimate cases where it can be. This commit hardens the macro so that it never reads above e-1, returning 0 if it otherwise would be required to. The assertion is changed to 's <= e'.
* Add UTF8_SAFE_SKIP API macroKarl Williamson2019-03-131-0/+11
| | | | This version of UTF8SKIP refuses to advance beyond the end pointer
* Remove relics of regex swash useKarl Williamson2019-02-141-5/+0
| | | | | | | | | | | This removes the most obvious and easy things that are no longer needed since regexes no longer use swashes at all. tr/// continues, for the time being, to use swashes, so not all swash handling is removable now. But tr/// doesn't use inversion lists, and so a bunch of code is ripped out here. Other code could have been, but I did only the relatively easy stuff. The rest can be ripped out all at once when tr/// is stops using swashes.
* Pass a UV to a format expecting a UVTony Cook2018-11-291-1/+1
| | | | | | | | | MAX_LEGAL_CP can end up as int depending on the ranges of the types involved, causing a type mismatch on the format in cp_above_legal_max. By adding the cast to the macro definition we both prevent the type mismatch on the format, but also may allow some static analysis tool to detect comparisons against signed types, which is likely an error.
* utf8.h: Update outmoded commentKarl Williamson2018-08-201-1/+1
|
* utf8.c: Rename macro and move to utf8.h, and use it in regcomp.cKarl Williamson2018-08-201-0/+2
| | | | This hides an internal detail
* Make isC9_STRICT_UTF8_CHAR() an inline dfaKarl Williamson2018-07-051-62/+0
| | | | | This replaces a complicated trie with a dfa. This should cut down the number of conditionals encountered in parsing many code points.
* Make isSTRICT_UTF8_CHAR() an inline functionKarl Williamson2018-07-051-88/+3
| | | | | | | It was a macro that used a trie. This changes to use the dfa constructed in previous commits. I didn't bother with taking measurements. A dfa should have fewer conditionals for many code points.
* Make isUTF8_char() an inline functionKarl Williamson2018-07-051-73/+0
| | | | | | | It was a macro that used a trie. This changes to use the dfa constructed in previous commits. I didn't bother with taking measurements. A dfa should require fewer conditionals to be executed for many code points.
* utf8.h: Remove obsolete commentKarl Williamson2018-07-051-3/+2
|
* utf8.h: Add assert for utf8n_to_uvchr_buf()Karl Williamson2018-07-011-2/+3
| | | | | The Perl_utf8n_to_uvchr_buf() version of this function has an assert; this adds it as well to the macro that bypasses the function.
* utf8.h: Add in #define for backcompatKarl Williamson2018-02-191-0/+1
| | | | | | | This symbol somehow got deleted, and it really shouldn't have been. This should not go in perldelta, as we don't want people to be using this ancient symbol who aren't already are.
* Add uvchr_to_utf8_flags_msgs()Karl Williamson2018-02-071-1/+11
| | | | | This is propmpted by Encode's needs. When called with the proper parameter, it returns any warnings instead of displaying them directly.
* Add utf8n_to_uvchr_msgs()Karl Williamson2018-01-301-0/+2
| | | | | | This UTF-8 to code point translator variant is to meet the needs of Encode, and provides XS authors with more general capability than the other decoders.
* utf8.h: Comments onlyKarl Williamson2017-07-121-9/+16
| | | | | An earlier commit had split some comments up. And this adds clarifying details.
* utf8n_to_uvchr() Properly test for extended UTF-8Karl Williamson2017-07-121-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It somehow dawned on me that the code is incorrect for warning/disallowing very high code points. What is really wanted in the API is to catch UTF-8 that is not necessarily portable. There are several classes of this, but I'm referring here to just the code points that are above the Unicode-defined maximum of 0x10FFFF. These can be considered non-portable, and there is a mechanism in the API to warn/disallow these. However an earlier standard defined UTF-8 to handle code points up to 2**31-1. Anything above that is using an extension to UTF-8 that has never been officially recognized. Perl does use such an extension, and the API is supposed to have a different mechanism to warn/disallow on this. Thus there are two classes of warning/disallowing for above-Unicode code points. One for things that have some non-Unicode official recognition, and the other for things that have never had official recognition. UTF-EBCDIC differs somewhat in this, and since Perl 5.24, we have had a Perl extension that allows it to handle any code point that fits in a 64-bit word. This kicks in at code points above 2**30-1, a number different than UTF-8 extended kicks in on ASCII platforms. Things are also complicated by the fact that the API has provisions for accepting the overlong UTF-8 malformation. It is possible to use extended UTF-8 to represent code points smaller than 31-bit ones. Until this commit, the extended warning/disallowing was based on the resultant code point, and only when that code point did not fit into 31 bits. But what is really wanted is if extended UTF-8 was used to represent a code point, no matter how large the resultant code point is. This differs from the previous definition, but only for EBCDIC platforms, or when the overlong malformation was also present. So it does not affect very many real-world cases. This commit fixes that. It turns out that it is easier to tell if something is using extended-UTF8. One just looks at the first byte of a sequence. The trailing part of the warning message that gets raised is slightly changed to be clearer. It's not significant enough to affect perldiag.
* utf8.h: Add synonyms for flag namesKarl Williamson2017-07-121-14/+19
| | | | | | | | | | | | | | | The next commit will fix the detection of using Perl's extended UTF-8 to be more accurate. The current name for various flags in the API is somewhat misleading. What is really wanted to know is if extended UTF-8 was used, not the value of the resultant code point. This commit basically does s/ABOVE_31_BIT/PERL_EXTENDED/g It also similarly changes the name of a hash key in APItest/t/utf8.t. This intermediary step makes the next commit easier to read.