summaryrefslogtreecommitdiff
path: root/utf8.h
Commit message (Collapse)AuthorAgeFilesLines
* utf8.h: Comments onlyKarl Williamson2017-07-121-9/+16
| | | | | An earlier commit had split some comments up. And this adds clarifying details.
* utf8n_to_uvchr() Properly test for extended UTF-8Karl Williamson2017-07-121-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It somehow dawned on me that the code is incorrect for warning/disallowing very high code points. What is really wanted in the API is to catch UTF-8 that is not necessarily portable. There are several classes of this, but I'm referring here to just the code points that are above the Unicode-defined maximum of 0x10FFFF. These can be considered non-portable, and there is a mechanism in the API to warn/disallow these. However an earlier standard defined UTF-8 to handle code points up to 2**31-1. Anything above that is using an extension to UTF-8 that has never been officially recognized. Perl does use such an extension, and the API is supposed to have a different mechanism to warn/disallow on this. Thus there are two classes of warning/disallowing for above-Unicode code points. One for things that have some non-Unicode official recognition, and the other for things that have never had official recognition. UTF-EBCDIC differs somewhat in this, and since Perl 5.24, we have had a Perl extension that allows it to handle any code point that fits in a 64-bit word. This kicks in at code points above 2**30-1, a number different than UTF-8 extended kicks in on ASCII platforms. Things are also complicated by the fact that the API has provisions for accepting the overlong UTF-8 malformation. It is possible to use extended UTF-8 to represent code points smaller than 31-bit ones. Until this commit, the extended warning/disallowing was based on the resultant code point, and only when that code point did not fit into 31 bits. But what is really wanted is if extended UTF-8 was used to represent a code point, no matter how large the resultant code point is. This differs from the previous definition, but only for EBCDIC platforms, or when the overlong malformation was also present. So it does not affect very many real-world cases. This commit fixes that. It turns out that it is easier to tell if something is using extended-UTF8. One just looks at the first byte of a sequence. The trailing part of the warning message that gets raised is slightly changed to be clearer. It's not significant enough to affect perldiag.
* utf8.h: Add synonyms for flag namesKarl Williamson2017-07-121-14/+19
| | | | | | | | | | | | | | | The next commit will fix the detection of using Perl's extended UTF-8 to be more accurate. The current name for various flags in the API is somewhat misleading. What is really wanted to know is if extended UTF-8 was used, not the value of the resultant code point. This commit basically does s/ABOVE_31_BIT/PERL_EXTENDED/g It also similarly changes the name of a hash key in APItest/t/utf8.t. This intermediary step makes the next commit easier to read.
* utf8.c: Move some #defines here, the only file that uses themKarl Williamson2017-07-011-7/+0
| | | | | | These are very specialized #defines to determine if UTF-8 overflows the word size of the platform. I think its unwise to make them kinda generally available.
* Add new function utf8_from_bytes_loc()Karl Williamson2017-06-141-0/+2
| | | | | | | | | This is currently undocumented externally, so we can change the API if needed. This is like utf8_from_bytes(), but in the case of not being able to convert the whole string, it converts the initial substring that is convertible, and tells you where it had to stop.
* Use new paradigm for hdr file double inclusion guardKarl Williamson2017-06-021-3/+3
| | | | | | | | | | We changed to use symbols not likely to be used by non-Perl code that could conflict, and which have trailing underbars, so they don't look like a regular Perl #define. See https://rt.perl.org/Ticket/Display.html?id=131110 There are many more header files which are not guarded.
* utf8.h: Add parens around macro param in expansionKarl Williamson2017-06-011-3/+4
| | | | | | | a6951642ede4abe605dcf0e94b74948e0a60a56b added an assertion to find bugs in calling macros, and so far, instead, it found a bug in a macro. A parameter needs to be enclosed in parens in case it is an expression, so that precedence works.
* utf8.h: Add assertions for macros that take charsKarl Williamson2017-06-011-10/+18
| | | | | | | This is inspired by [perl #131190]. The UTF-8 macros whose parameters are characters now have assertions that verify they are not being called with something that won't fit in a char. These assertions should be getting optimized out if the input type is a char or U8.
* utf8.h: Clarify commentKarl Williamson2017-02-111-4/+4
|
* utf8.h: White-space, parens onlyKarl Williamson2017-02-111-6/+6
| | | | Add parens to clarify grouping, white-space for legibility
* utf8.h: Add branch predictionKarl Williamson2017-02-111-1/+1
| | | | | | use bytes; is unlikely to be the case.
* Deprecate toFOO_utf8()Karl Williamson2016-12-231-4/+4
| | | | Now that there are _safe versions, deprecate the unsafe ones.
* Add toFOO_utf8_safe() macrosKarl Williamson2016-12-231-4/+9
|
* utf8.c: Add flag to indicate unsure as to end of string to printKarl Williamson2016-12-231-0/+1
| | | | | | | | When decoding a UTF-8 encoded string, we may have guessed as to how long it is. This adds a flag so that the base level decode routine knows that it is a guess, and it minimizes what gets printed, rather than the normal full information, so as to minimize reading past the end of the string
* Deprecate isFOO_utf8() macrosKarl Williamson2016-12-231-7/+13
| | | | | | These macros are being replaced by a safe version; they now generate a deprecation message at each call site upon the first use there in each program run.
* Allow allowing UTF-8 overflow malformationKarl Williamson2016-12-231-3/+8
| | | | | | | | | perl has never allowed the UTF-8 overflow malformation, for some reason. But as long as overflows are turned into the REPLACEMENT CHARACTER, there is no real reason not to. And making it allowable allows code that wants to carry on in the face of malformed input to do so, without risk of contaminating things, as the REPLACEMENT is the Unicode prescribed way of handling malformations.
* Return REPLACEMENT for UTF-8 overlong malformationKarl Williamson2016-12-231-1/+4
| | | | | | | | | | | | | | | | | | | When perl decodes UTF-8 into a code point, it must decide what to do if the input is malformed in some way. When the flags passed to the decode function indicate that a given malformation type is not acceptable, the function returns 0 to indicate failure; on success it returns the decoded code point (unfortunately that may require disambiguation if the input is validly a NUL). As perl evolved, what happened when various allowed malformations were encountered got stricter and stricter. This is the final malformation that was not turned into a REPLACEMENT CHARACTER when the malformation was allowed, and this commit changes to return that. Unlike most other malformations, the code point value of an overlong is well-defined, and that is why it hadn't been changed here-to-fore. But it is safer to use the Unicode prescribed behavior on all malformations, which is to replace them with the REPLACEMENT CHARACTER. Just in case there is code that requires the old behavior, it is retained, but you have to search the source for the undocumented flag that enables it.
* utf8.h: Don't allow zero length malformation unless requestedKarl Williamson2016-12-231-6/+10
| | | | | | | | | | | | | | | | The bottom level Perl routine that decodes UTF-8 into a code point has long accepted inputs where the length is specified to be 0, returning a NUL. It considers this a malformation, which is accepted in some scenarios, but not others. In consultation with Tony Cook, we decided this really isn't a malformation, but is a bug in the calling program. Rather than call the decode routine when it has nothing to decode, it should just not call it. This commit removes the acceptance of a zero length string from any of the canned flag combinations passed to the decode function. One can convert to specify this flag explicitly, if necessary. However the next commit will cause this to fail under DEBUGGING builds, as a step towards removing the capability altogether.
* utf8.h: Renumber flag bitsKarl Williamson2016-12-231-10/+10
| | | | This creates a gap that will be filled by future commits
* Add isFOO_utf8_safe() macrosKarl Williamson2016-12-231-0/+11
| | | | | | | | | | | | | | | | The original API does not check that we aren't reading beyond the end of a buffer, apparently assuming that we could keep malformed UTF-8 out by use of gatekeepers, but that is currently impossible. This commit adds "safe" macros for determining if a UTF-8 sequence represents an alphabetic, a digit, etc. Each new macro has an extra parameter pointing to the end of the sequence, so that looking beyond the input string can be avoided. The macros aren't currently completely safe, as they don't test that there is at least a single valid byte in the input, except by an assertion in DEBUGGING builds. This is because typically they are called in code that makes that assumption, and frequently tests the current byte for one thing or another.
* Fix above-Unicode UTF-8 detection for EBCDICKarl Williamson2016-12-131-1/+1
| | | | | | The root cause of this was missing parentheses causing (s[0] + 1) to be evaluated instead of the desired s[1]. It was causing an error in lib/warnings.t, but only on EBCDIC platforms.
* Fix const correctness in utf8.hPetr Písař2016-12-011-57/+57
| | | | | | | | The original code was generated and then hand-tunes. Therefore I edited the code in place instead of fixing the regen/regcharclass.pl generator. Signed-off-by: Petr Písař <ppisar@redhat.com>
* utf8.h: Simplify macroKarl Williamson2016-10-131-3/+1
| | | | This complicated macro boils down to just one bit.
* Add utf8n_to_uvchr_errorKarl Williamson2016-10-131-0/+2
| | | | | | | | | | This new function behaves like utf8n_to_uvchr(), but takes an extra parameter that points to a U32 which will be set to 0 if no errors are found; otherwise each error found will set a bit in it. This can be used by the caller to figure out precisely what the error(s) is/are. Previously, one would have to capture and parse the warning/error messages raised. This can be used, for example, to customize the messages to the expected end-user's knowledge level.
* utf8n_to_uvchr(): Note multiple malformationsKarl Williamson2016-10-131-8/+13
| | | | | | | | | | | | | | | | | | | | | | | | | Some UTF-8 sequences can have multiple malformations. For example, a sequence can be the start of an overlong representation of a code point, and still be incomplete. Until this commit what was generally done was to stop looking when the first malformation was found. This was not correct behavior, as that malformation may be allowed, while another unallowed one went unnoticed. (But this did not actually create security holes, as those allowed malformations replaced the input with a REPLACEMENT CHARACTER.) This commit refactors the error handling of this function to set a flag and keep going if a malformation is found that doesn't preclude others. Then each is handled in a loop at the end, warning if warranted. The result is that there is a warning for each malformation for which warnings should be generated, and an error return is made if any one is disallowed. Overflow doesn't happen except for very high code points, well above the Unicode range, and above fitting in 31 bits. Hence the latter 2 potential malformations are subsets of overflow, so only one warning is output--the most dire. This will speed up the normal case slightly, as the test for overflow is pulled out of the loop, allowing the UV to overflow. Then a single test after the loop is done to see if there was overflow or not.
* utf8.h: Change some flag definition constantsKarl Williamson2016-10-131-9/+9
| | | | | | These #defines give flag bits in a U32. This commit opens a gap that will be filled in a future commit. A test file has to change to correspond, as it duplicates the defines.
* Add API Unicode handling functionsKarl Williamson2016-09-251-0/+9
| | | | | | | | | | These functions are all extensions of the is_utf8_string_foo() functions, that restrict the UTF-8 recognized as valid in various ways. There are named ones for the two definitions that Unicode makes, and foo_flags ones for more custom restrictions. The named ones are implemented as tries, while the flags ones provide complete generality
* Move #define to different headerKarl Williamson2016-09-251-3/+0
| | | | | | Instead of having a comment in one header pointing to the #define in the other, remove the indirection and just have the #define itself where it is needed.
* perlapi: Clarifications, nits in Unicode support docsKarl Williamson2016-09-251-9/+29
| | | | This also does a white space change to inline.h
* Add isUTF8_CHAR_flags() macroKarl Williamson2016-09-171-0/+34
| | | | | | | | | This is like the previous 2 commits, but the macro takes a flags parameter so any combination of the disallowed flags may be used. The others, along with the original isUTF8_CHAR(), are the most commonly desired strictures, and use an implementation of a, hopefully, inlined trie for speed. This is for generality and the major portion of its implementation isn't inlined.
* Add macro for Unicode Corregindum #9 strictKarl Williamson2016-09-171-1/+54
| | | | | | | | | | | | | This macro follows Unicode Corrigendum #9 to allow non-character code points. These are still discouraged but not completely forbidden. It's best for code that isn't intended to operate on arbitrary other code text to use the original definition, but code that does things, such as source code control, should change to use this definition if it wants to be Unicode-strict. Perl can't adopt C9 wholesale, as it might create security holes in existing applications that rely on Perl keeping non-chars out.
* Add macro for determining if UTF-8 is Unicode-strictKarl Williamson2016-09-171-3/+78
|
* perlapi: Clarify isUTF8_CHAR()Karl Williamson2016-09-171-3/+4
|
* utf8.h: Add comment, white-space changesKarl Williamson2016-09-171-7/+11
|
* Enhance and rename is_utf8_char_slow()Karl Williamson2016-09-171-3/+3
| | | | | | | This changes the name of this helper function and adds a parameter and functionality to allow it to exclude problematic classes of code points, the same ones excludeable by utf8n_to_uvchar(), like surrogates or non-character code points.
* isUTF8_CHAR(): Bring UTF-EBCDIC to parity with ASCIIKarl Williamson2016-09-171-51/+37
| | | | | | | | | | | | | | | | This changes the macro isUTF8_CHAR to have the same number of code points built-in for EBCDIC as ASCII. This obsoletes the IS_UTF8_CHAR_FAST macro, which is removed. Previously, the code generated by regen/regcharclass.pl for ASCII platforms was hand copied into utf8.h, and LIKELY's manually added, then the generating code was commented out. Now this has been done with EBCDIC platforms as well. This makes regenerating regcharclass.h faster. The copied macro in utf8.h is moved by this commit to within the main code section for non-EBCDIC compiles, cutting the number of #ifdef's down, and the comments about it are changed somewhat.
* Add IS_UTF8_INVARIANT and IS_UVCHR_INVARIANT to APIKarl Williamson2016-09-171-10/+32
|
* Add #defines for XS code for Unicode Corregindum 9Karl Williamson2016-09-171-5/+14
| | | | These are convenience macros.
* Make 3 UTF-8 macros APIKarl Williamson2016-08-311-16/+62
| | | | | | | | | | | | | | | | | These may be useful to various module writers. They certainly are useful for Encode. This makes public API macros to determine if the input UTF-8 represents (one macro for each category) a) a surrogate code point b) a non-character code point c) a code point that is above Unicode's legal maximum. The macros are machine generated. In making them public, I am now using the string end location parameter to guard against running off the end of the input. Previously this parameter was ignored, as their use in the core could be tightly controlled so that we already knew that the string was long enough when calling these macros. But this can't be guaranteed in the public API. An optimizing compiler should be able to remove redundant length checks.
* Move isUTF8_CHAR helper function, and reimplement itKarl Williamson2016-08-311-7/+18
| | | | | | | | | | | | | | | | | | | | | | | The macro isUTF8_CHAR calls a helper function for code points higher than it can handle. That function had been an inlined wrapper around utf8n_to_uvchr(). The function has been rewritten to not call utf8n_to_uvchr(), so it is now too big to be effectively inlined. Instead, it implements a faster method of checking the validity of the UTF-8 without having to decode it. It just checks for valid syntax and now knows where the few discontinuities are in UTF-8 where overlongs can occur, and uses a string compare to verify that overflow won't occur. As a result this is now a pure function. This also causes a previously generated deprecation warning to not be, because in printing UTF-8, no longer does it have to be converted to internal form. I could add a check for that, but I think it's best not to. If you manipulated what is getting printed in any way, the deprecation message will already have been raised. This commit also fleshes out the documentation of isUTF8_CHAR.
* Add #defines for UTF-8 of highest representable code pointKarl Williamson2016-08-311-0/+7
| | | | | This will allow the next commit to not have to actually try to decode the UTF-8 string in order to see if it overflows the platform.
* utf8.h: Add some LIKELY() to help branch predictionKarl Williamson2016-08-311-9/+11
| | | | | This macro gives the legal UTF-8 byte sequences. Almost always, the input will be legal, so help compiler branch prediction for that.
* utf8.h, utfebcdic.h: Add comments, align white spaceKarl Williamson2016-08-311-19/+27
|
* Add new synonym 'is_utf8_invariant_string'Karl Williamson2016-08-311-3/+10
| | | | | | This is clearer as to its meaning than the existing 'is_ascii_string' and 'is_invariant_string', which are retained for back compat. The thread context variable is removed as it is not used.
* silence MSVC warnings for NATIVE_UTF8_TO_I8/I8_TO_NATIVE_UTF8Daniel Dragan2016-08-141-2/+2
| | | | | | | | | | | | | | | | | | The result of I8_TO_NATIVE_UTF8 has to be U8 casted for the MSVC specific PERL_SMALL_MACRO_BUFFER option just like it is for newer CCs that dont have a small CPP buffer. Commit 1a3756de64/#127426 did add U8 casts to NATIVE_TO_LATIN1/LATIN1_TO_NATIVE but missed NATIVE_UTF8_TO_I8/I8_TO_NATIVE_UTF8. This commit fixes that. One example of the C4244 warning is VC6 thinks 0xFF & (0xFE << 6) in UTF_START_MARK could be bigger than 0xff (a char), fixes ..\inline.h(247) : warning C4244: '=' : conversion from 'long ' to 'unsigned char ', possible loss of data Also fixes ..\utf8.c(146) : warning C4244: '=' : conversion from 'UV' to 'U8', possible loss of data and alot more warnings in utf8.c
* utf8.h: Add comment clarificationKarl Williamson2016-08-021-2/+5
|
* utf8.h: Guard some macros against improper callsKarl Williamson2016-02-101-12/+18
| | | | | | | | | | The UTF8_IS_foo() macros have an inconsistent API. In some, the parameter is a pointer, and in others it is a byte. In the former case, a call of the wrong type will not compile, as it will try to dereference a non-ptr. This commit makes the other ones not compile when called wrongly, by using the technique shown by Lukas Mai (in 9c903d5937fa3682f21b2aece7f6011b6fcb2750) of ORing the argument with a constant 0, which should get optimized out.
* [perl #127426] fixes for 126045 patch, restrict to MSVC, add castsTony Cook2016-02-011-3/+3
|
* [perl #126045] part revert e9b19ab7 for vc2003 and earlierTony Cook2016-02-011-0/+15
| | | | This avoids an internal compiler error on VC 2003 and earlier
* utf8.h: Add 2 assertionsKarl Williamson2015-12-221-2/+4
| | | | This makes sure in DEBUGGING builds that the macro is called correctly.