summaryrefslogtreecommitdiff
path: root/utf8.h
Commit message (Collapse)AuthorAgeFilesLines
* utf8.h: Simplify macroKarl Williamson2016-10-131-3/+1
| | | | This complicated macro boils down to just one bit.
* Add utf8n_to_uvchr_errorKarl Williamson2016-10-131-0/+2
| | | | | | | | | | This new function behaves like utf8n_to_uvchr(), but takes an extra parameter that points to a U32 which will be set to 0 if no errors are found; otherwise each error found will set a bit in it. This can be used by the caller to figure out precisely what the error(s) is/are. Previously, one would have to capture and parse the warning/error messages raised. This can be used, for example, to customize the messages to the expected end-user's knowledge level.
* utf8n_to_uvchr(): Note multiple malformationsKarl Williamson2016-10-131-8/+13
| | | | | | | | | | | | | | | | | | | | | | | | | Some UTF-8 sequences can have multiple malformations. For example, a sequence can be the start of an overlong representation of a code point, and still be incomplete. Until this commit what was generally done was to stop looking when the first malformation was found. This was not correct behavior, as that malformation may be allowed, while another unallowed one went unnoticed. (But this did not actually create security holes, as those allowed malformations replaced the input with a REPLACEMENT CHARACTER.) This commit refactors the error handling of this function to set a flag and keep going if a malformation is found that doesn't preclude others. Then each is handled in a loop at the end, warning if warranted. The result is that there is a warning for each malformation for which warnings should be generated, and an error return is made if any one is disallowed. Overflow doesn't happen except for very high code points, well above the Unicode range, and above fitting in 31 bits. Hence the latter 2 potential malformations are subsets of overflow, so only one warning is output--the most dire. This will speed up the normal case slightly, as the test for overflow is pulled out of the loop, allowing the UV to overflow. Then a single test after the loop is done to see if there was overflow or not.
* utf8.h: Change some flag definition constantsKarl Williamson2016-10-131-9/+9
| | | | | | These #defines give flag bits in a U32. This commit opens a gap that will be filled in a future commit. A test file has to change to correspond, as it duplicates the defines.
* Add API Unicode handling functionsKarl Williamson2016-09-251-0/+9
| | | | | | | | | | These functions are all extensions of the is_utf8_string_foo() functions, that restrict the UTF-8 recognized as valid in various ways. There are named ones for the two definitions that Unicode makes, and foo_flags ones for more custom restrictions. The named ones are implemented as tries, while the flags ones provide complete generality
* Move #define to different headerKarl Williamson2016-09-251-3/+0
| | | | | | Instead of having a comment in one header pointing to the #define in the other, remove the indirection and just have the #define itself where it is needed.
* perlapi: Clarifications, nits in Unicode support docsKarl Williamson2016-09-251-9/+29
| | | | This also does a white space change to inline.h
* Add isUTF8_CHAR_flags() macroKarl Williamson2016-09-171-0/+34
| | | | | | | | | This is like the previous 2 commits, but the macro takes a flags parameter so any combination of the disallowed flags may be used. The others, along with the original isUTF8_CHAR(), are the most commonly desired strictures, and use an implementation of a, hopefully, inlined trie for speed. This is for generality and the major portion of its implementation isn't inlined.
* Add macro for Unicode Corregindum #9 strictKarl Williamson2016-09-171-1/+54
| | | | | | | | | | | | | This macro follows Unicode Corrigendum #9 to allow non-character code points. These are still discouraged but not completely forbidden. It's best for code that isn't intended to operate on arbitrary other code text to use the original definition, but code that does things, such as source code control, should change to use this definition if it wants to be Unicode-strict. Perl can't adopt C9 wholesale, as it might create security holes in existing applications that rely on Perl keeping non-chars out.
* Add macro for determining if UTF-8 is Unicode-strictKarl Williamson2016-09-171-3/+78
|
* perlapi: Clarify isUTF8_CHAR()Karl Williamson2016-09-171-3/+4
|
* utf8.h: Add comment, white-space changesKarl Williamson2016-09-171-7/+11
|
* Enhance and rename is_utf8_char_slow()Karl Williamson2016-09-171-3/+3
| | | | | | | This changes the name of this helper function and adds a parameter and functionality to allow it to exclude problematic classes of code points, the same ones excludeable by utf8n_to_uvchar(), like surrogates or non-character code points.
* isUTF8_CHAR(): Bring UTF-EBCDIC to parity with ASCIIKarl Williamson2016-09-171-51/+37
| | | | | | | | | | | | | | | | This changes the macro isUTF8_CHAR to have the same number of code points built-in for EBCDIC as ASCII. This obsoletes the IS_UTF8_CHAR_FAST macro, which is removed. Previously, the code generated by regen/regcharclass.pl for ASCII platforms was hand copied into utf8.h, and LIKELY's manually added, then the generating code was commented out. Now this has been done with EBCDIC platforms as well. This makes regenerating regcharclass.h faster. The copied macro in utf8.h is moved by this commit to within the main code section for non-EBCDIC compiles, cutting the number of #ifdef's down, and the comments about it are changed somewhat.
* Add IS_UTF8_INVARIANT and IS_UVCHR_INVARIANT to APIKarl Williamson2016-09-171-10/+32
|
* Add #defines for XS code for Unicode Corregindum 9Karl Williamson2016-09-171-5/+14
| | | | These are convenience macros.
* Make 3 UTF-8 macros APIKarl Williamson2016-08-311-16/+62
| | | | | | | | | | | | | | | | | These may be useful to various module writers. They certainly are useful for Encode. This makes public API macros to determine if the input UTF-8 represents (one macro for each category) a) a surrogate code point b) a non-character code point c) a code point that is above Unicode's legal maximum. The macros are machine generated. In making them public, I am now using the string end location parameter to guard against running off the end of the input. Previously this parameter was ignored, as their use in the core could be tightly controlled so that we already knew that the string was long enough when calling these macros. But this can't be guaranteed in the public API. An optimizing compiler should be able to remove redundant length checks.
* Move isUTF8_CHAR helper function, and reimplement itKarl Williamson2016-08-311-7/+18
| | | | | | | | | | | | | | | | | | | | | | | The macro isUTF8_CHAR calls a helper function for code points higher than it can handle. That function had been an inlined wrapper around utf8n_to_uvchr(). The function has been rewritten to not call utf8n_to_uvchr(), so it is now too big to be effectively inlined. Instead, it implements a faster method of checking the validity of the UTF-8 without having to decode it. It just checks for valid syntax and now knows where the few discontinuities are in UTF-8 where overlongs can occur, and uses a string compare to verify that overflow won't occur. As a result this is now a pure function. This also causes a previously generated deprecation warning to not be, because in printing UTF-8, no longer does it have to be converted to internal form. I could add a check for that, but I think it's best not to. If you manipulated what is getting printed in any way, the deprecation message will already have been raised. This commit also fleshes out the documentation of isUTF8_CHAR.
* Add #defines for UTF-8 of highest representable code pointKarl Williamson2016-08-311-0/+7
| | | | | This will allow the next commit to not have to actually try to decode the UTF-8 string in order to see if it overflows the platform.
* utf8.h: Add some LIKELY() to help branch predictionKarl Williamson2016-08-311-9/+11
| | | | | This macro gives the legal UTF-8 byte sequences. Almost always, the input will be legal, so help compiler branch prediction for that.
* utf8.h, utfebcdic.h: Add comments, align white spaceKarl Williamson2016-08-311-19/+27
|
* Add new synonym 'is_utf8_invariant_string'Karl Williamson2016-08-311-3/+10
| | | | | | This is clearer as to its meaning than the existing 'is_ascii_string' and 'is_invariant_string', which are retained for back compat. The thread context variable is removed as it is not used.
* silence MSVC warnings for NATIVE_UTF8_TO_I8/I8_TO_NATIVE_UTF8Daniel Dragan2016-08-141-2/+2
| | | | | | | | | | | | | | | | | | The result of I8_TO_NATIVE_UTF8 has to be U8 casted for the MSVC specific PERL_SMALL_MACRO_BUFFER option just like it is for newer CCs that dont have a small CPP buffer. Commit 1a3756de64/#127426 did add U8 casts to NATIVE_TO_LATIN1/LATIN1_TO_NATIVE but missed NATIVE_UTF8_TO_I8/I8_TO_NATIVE_UTF8. This commit fixes that. One example of the C4244 warning is VC6 thinks 0xFF & (0xFE << 6) in UTF_START_MARK could be bigger than 0xff (a char), fixes ..\inline.h(247) : warning C4244: '=' : conversion from 'long ' to 'unsigned char ', possible loss of data Also fixes ..\utf8.c(146) : warning C4244: '=' : conversion from 'UV' to 'U8', possible loss of data and alot more warnings in utf8.c
* utf8.h: Add comment clarificationKarl Williamson2016-08-021-2/+5
|
* utf8.h: Guard some macros against improper callsKarl Williamson2016-02-101-12/+18
| | | | | | | | | | The UTF8_IS_foo() macros have an inconsistent API. In some, the parameter is a pointer, and in others it is a byte. In the former case, a call of the wrong type will not compile, as it will try to dereference a non-ptr. This commit makes the other ones not compile when called wrongly, by using the technique shown by Lukas Mai (in 9c903d5937fa3682f21b2aece7f6011b6fcb2750) of ORing the argument with a constant 0, which should get optimized out.
* [perl #127426] fixes for 126045 patch, restrict to MSVC, add castsTony Cook2016-02-011-3/+3
|
* [perl #126045] part revert e9b19ab7 for vc2003 and earlierTony Cook2016-02-011-0/+15
| | | | This avoids an internal compiler error on VC 2003 and earlier
* utf8.h: Add 2 assertionsKarl Williamson2015-12-221-2/+4
| | | | This makes sure in DEBUGGING builds that the macro is called correctly.
* utf8.h, utfebcdic.h: Add #defineKarl Williamson2015-12-091-1/+5
| | | | for future use
* utf8.h: Fix macro definitionKarl Williamson2015-12-091-1/+1
| | | | | This has been wrong, and won't compile, should anyone have tried, since 635e76f560b3b3ca075aa2cb5d6d661601968e04 earlier in 5.23.
* utf8.h: Remove unused #defineKarl Williamson2015-12-091-4/+0
| | | | | | UTF8_QUAD_MAX is no longer used in the core, and is not in cpan, and its name is highly misleading. It is defined to be 2**36, which has really nothing to do with what its name indicates.
* utf8.h: Split UNICODE_IS_NONCHAR() into smaller macrosKarl Williamson2015-12-081-8/+17
| | | | | | This defines 2 macros that can be used individually to check for non-characters when the input range is known to be restricted to not include every possible one. This is for future commits.
* utf8.h, utfebcdic.h: Comments, white-space onlyKarl Williamson2015-12-061-6/+7
|
* utf8.h: Remove a branch in macro for Unicode surrogatesKarl Williamson2015-12-051-2/+5
| | | | | | By masking, this macro can be written so it only has one branch. Presumably an optimizing compiler would do the same, but not necessarily so.
* utf8.h: Add some casts in macros, for safetyKarl Williamson2015-12-051-7/+7
| | | | | | This also renames the macro formal parameter to uv to be clearer as to what is expected as input, and there were cases where it was referred to inside the macro without being parenthesized, which was dangerous.
* utf8.h: Combine EBCDIC and ASCII macrosKarl Williamson2015-12-051-18/+25
| | | | | | | | Previous commits have set things up so the macros are the same on both platforms. By moving them to the common part of utf8.h, they can share the same definition. The difference listing shows instead other things being moved due to the size of this move in comparison with those things that really stayed the same.
* utf8.h: Refactor macro definitionKarl Williamson2015-12-051-7/+27
| | | | | | | | This changes to use the underlying UTF-8 structure to compute the numbers in this macro, instead of hand-specifying the resultant ones. Thus, this macro, with minor tweaks, is the same text on both ASCII and EBCDIC platforms (though the resultant numbers differ), and the next commit will change them to use it in common.
* utf8.h: Combine EBCDIC and ASCII macrosKarl Williamson2015-12-051-9/+9
| | | | | | | | The previous commits have made these macros be the exact same text, so can be combined, and defined just once. This requires moving them to the portion of the file that is common with both EBCDIC and ASCII. The commit diff shows instead other code being moved.
* utf8.h: Refactor a macroKarl Williamson2015-12-051-1/+1
| | | | | | | This new definition expands to the same thing as before, but now the unexpanded text is identical to the EBCDIC definition (which expands to something else), so the next commit can combine the ASCII and EBCDIC ones into a single definition.
* utf8.h: Use common macro to avoid repeatingKarl Williamson2015-12-051-10/+14
| | | | | | | | This refactors two macros that have mostly the same guts to have a third macro to define the common guts. It also changes to use UV_IS_QUAD instead of a home-grown #define that a future commit will remove.
* utf8.h: Move #define within fileKarl Williamson2015-12-051-4/+4
| | | | This makes 2 related definitions adjacent.
* utf8.h: Combine EBCDIC and ASCII #definesKarl Williamson2015-12-051-4/+6
| | | | | | Change to use the same definition for two macros on both types of platforms, simplifying the code, by using the underlying structure of the encoding.
* utf8.h: Move #define to earlier in the fileKarl Williamson2015-12-051-6/+6
| | | | | And use its mnemonic in other #defines instead of repeating the raw value.
* utf8.h, et.al.: Clean up some castsKarl Williamson2015-12-051-10/+10
| | | | | By making sure the no-op macros cast the output appropriately, we can eliminate the casts that have been added in things that call them
* utf8.h: Combine ASCII and EBCDIC defines into oneKarl Williamson2015-12-051-3/+3
| | | | | By using a more fundamental value, these two definitions of the macro can be made the same, so only need one, common to both platforms
* utf8.h, utfebcdic.h: Fix-up UTF8_MAXBYTES_CASE defnKarl Williamson2015-12-051-12/+13
| | | | | | | | The definition had gotten moved away from its comments in utf8.h, and the wrong thing was being guarded by a #error, (UTF8_MAXBYTES instead). And it is possible to generalize to get the compiler to do the calculation, and to consolidate the definitions from the two files into a single one.
* utf8.h: Remove use of redundant flagsKarl Williamson2015-11-281-5/+6
| | | | | | | | The ABOVE_31_BIT flags is a proper subset of the SUPER flags, so if the latter is set, we don't have to bother setting the former. On the other hand, there is no harm in doing so, these changes are all resolved at compile time. The reason I'm changing them is that it is easier to explain in the pod what is happening, in the next commit.
* utf8.h: Add clearer #define synonymsKarl Williamson2015-11-281-19/+23
| | | | | | | | | These names have long caused me consternation, as they are named after the internal ASCII-platform UTF-8 representation, which is not the same for EBCDIC platforms, nor do they convey meaning to someone who isn't currently steeped in the UTF-8 internals. I've added synonyms that are platform-independent in meaning and make more sense to someone coming at this cold. The old names are retained for back compat.
* Extend UTF-EBCDIC to handle up to 2**64-1Karl Williamson2015-11-251-10/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This uses for UTF-EBCDIC essentially the same mechanism that Perl already uses for UTF-8 on ASCII platforms to extend it beyond what might be its natural maximum. That is, when the UTF-8 start byte is 0xFF, it adds a bunch more bytes to the character than it otherwise would, bringing it to a total of 14 for UTF-EBCDIC. This is enough to handle any code point that fits in a 64 bit word. The downside of this is that this extension is not compatible with previous perls for the range 2**30 up through the previous max, 2**30 - 1. A simple program could be written to convert files that were written out using an older perl so that they can be read with newer perls, and the perldelta says we will do this should anyone ask. However, I strongly suspect that the number of such files in existence is zero, as people in EBCDIC land don't seem to use Unicode much, and these are very large code points, which are associated with a portability warning every time they are output in some way. This extension brings UTF-EBCDIC to parity with UTF-8, so that both can cover a 64-bit word. It allows some removal of special cases for EBCDIC in core code and core tests. And it is a necessary step to handle Perl 6's NFG, which I'd like eventually to bring to Perl 5. This commit causes two implementations of a macro in utf8.h and utfebcdic.h to become the same, and both are moved to a single one in the portion of utf8.h common to both. To illustrate, the I8 for U+3FFFFFFF (2**30-1) is "\xFE\xBF\xBF\xBF\xBF\xBF\xBF" before and after this commit, but the I8 for the next code point, U+40000000 is now "\xFF\xA0\xA0\xA0\xA0\xA0\xA0\xA1\xA0\xA0\xA0\xA0\xA0\xA0", and before this commit it was "\xFF\xA0\xA0\xA0\xA0\xA0\xA0". The I8 for 2**64-1 (U+FFFFFFFFFFFFFFFF) is "\xFF\xAF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF", whereas before this commit it was unrepresentable. Commit 7c560c3beefbb9946463c9f7b946a13f02f319d8 said in its message that it was moving something that hadn't been needed on EBCDIC until the "next commit". That statement turned out to be wrong, overtaken by events. This now is the commit it was referring to. commit I prematurely pushed that
* utf8.h: Move #define within fileKarl Williamson2015-11-091-7/+7
| | | | | This #define has, until the next commit, not been needed on EBCDIC platforms; move it in preparation for that commit.