summaryrefslogtreecommitdiff
path: root/utf8.h
Commit message (Collapse)AuthorAgeFilesLines
...
* Pass a UV to a format expecting a UVTony Cook2018-11-291-1/+1
| | | | | | | | | MAX_LEGAL_CP can end up as int depending on the ranges of the types involved, causing a type mismatch on the format in cp_above_legal_max. By adding the cast to the macro definition we both prevent the type mismatch on the format, but also may allow some static analysis tool to detect comparisons against signed types, which is likely an error.
* utf8.h: Update outmoded commentKarl Williamson2018-08-201-1/+1
|
* utf8.c: Rename macro and move to utf8.h, and use it in regcomp.cKarl Williamson2018-08-201-0/+2
| | | | This hides an internal detail
* Make isC9_STRICT_UTF8_CHAR() an inline dfaKarl Williamson2018-07-051-62/+0
| | | | | This replaces a complicated trie with a dfa. This should cut down the number of conditionals encountered in parsing many code points.
* Make isSTRICT_UTF8_CHAR() an inline functionKarl Williamson2018-07-051-88/+3
| | | | | | | It was a macro that used a trie. This changes to use the dfa constructed in previous commits. I didn't bother with taking measurements. A dfa should have fewer conditionals for many code points.
* Make isUTF8_char() an inline functionKarl Williamson2018-07-051-73/+0
| | | | | | | It was a macro that used a trie. This changes to use the dfa constructed in previous commits. I didn't bother with taking measurements. A dfa should require fewer conditionals to be executed for many code points.
* utf8.h: Remove obsolete commentKarl Williamson2018-07-051-3/+2
|
* utf8.h: Add assert for utf8n_to_uvchr_buf()Karl Williamson2018-07-011-2/+3
| | | | | The Perl_utf8n_to_uvchr_buf() version of this function has an assert; this adds it as well to the macro that bypasses the function.
* utf8.h: Add in #define for backcompatKarl Williamson2018-02-191-0/+1
| | | | | | | This symbol somehow got deleted, and it really shouldn't have been. This should not go in perldelta, as we don't want people to be using this ancient symbol who aren't already are.
* Add uvchr_to_utf8_flags_msgs()Karl Williamson2018-02-071-1/+11
| | | | | This is propmpted by Encode's needs. When called with the proper parameter, it returns any warnings instead of displaying them directly.
* Add utf8n_to_uvchr_msgs()Karl Williamson2018-01-301-0/+2
| | | | | | This UTF-8 to code point translator variant is to meet the needs of Encode, and provides XS authors with more general capability than the other decoders.
* utf8.h: Comments onlyKarl Williamson2017-07-121-9/+16
| | | | | An earlier commit had split some comments up. And this adds clarifying details.
* utf8n_to_uvchr() Properly test for extended UTF-8Karl Williamson2017-07-121-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It somehow dawned on me that the code is incorrect for warning/disallowing very high code points. What is really wanted in the API is to catch UTF-8 that is not necessarily portable. There are several classes of this, but I'm referring here to just the code points that are above the Unicode-defined maximum of 0x10FFFF. These can be considered non-portable, and there is a mechanism in the API to warn/disallow these. However an earlier standard defined UTF-8 to handle code points up to 2**31-1. Anything above that is using an extension to UTF-8 that has never been officially recognized. Perl does use such an extension, and the API is supposed to have a different mechanism to warn/disallow on this. Thus there are two classes of warning/disallowing for above-Unicode code points. One for things that have some non-Unicode official recognition, and the other for things that have never had official recognition. UTF-EBCDIC differs somewhat in this, and since Perl 5.24, we have had a Perl extension that allows it to handle any code point that fits in a 64-bit word. This kicks in at code points above 2**30-1, a number different than UTF-8 extended kicks in on ASCII platforms. Things are also complicated by the fact that the API has provisions for accepting the overlong UTF-8 malformation. It is possible to use extended UTF-8 to represent code points smaller than 31-bit ones. Until this commit, the extended warning/disallowing was based on the resultant code point, and only when that code point did not fit into 31 bits. But what is really wanted is if extended UTF-8 was used to represent a code point, no matter how large the resultant code point is. This differs from the previous definition, but only for EBCDIC platforms, or when the overlong malformation was also present. So it does not affect very many real-world cases. This commit fixes that. It turns out that it is easier to tell if something is using extended-UTF8. One just looks at the first byte of a sequence. The trailing part of the warning message that gets raised is slightly changed to be clearer. It's not significant enough to affect perldiag.
* utf8.h: Add synonyms for flag namesKarl Williamson2017-07-121-14/+19
| | | | | | | | | | | | | | | The next commit will fix the detection of using Perl's extended UTF-8 to be more accurate. The current name for various flags in the API is somewhat misleading. What is really wanted to know is if extended UTF-8 was used, not the value of the resultant code point. This commit basically does s/ABOVE_31_BIT/PERL_EXTENDED/g It also similarly changes the name of a hash key in APItest/t/utf8.t. This intermediary step makes the next commit easier to read.
* utf8.c: Move some #defines here, the only file that uses themKarl Williamson2017-07-011-7/+0
| | | | | | These are very specialized #defines to determine if UTF-8 overflows the word size of the platform. I think its unwise to make them kinda generally available.
* Add new function utf8_from_bytes_loc()Karl Williamson2017-06-141-0/+2
| | | | | | | | | This is currently undocumented externally, so we can change the API if needed. This is like utf8_from_bytes(), but in the case of not being able to convert the whole string, it converts the initial substring that is convertible, and tells you where it had to stop.
* Use new paradigm for hdr file double inclusion guardKarl Williamson2017-06-021-3/+3
| | | | | | | | | | We changed to use symbols not likely to be used by non-Perl code that could conflict, and which have trailing underbars, so they don't look like a regular Perl #define. See https://rt.perl.org/Ticket/Display.html?id=131110 There are many more header files which are not guarded.
* utf8.h: Add parens around macro param in expansionKarl Williamson2017-06-011-3/+4
| | | | | | | a6951642ede4abe605dcf0e94b74948e0a60a56b added an assertion to find bugs in calling macros, and so far, instead, it found a bug in a macro. A parameter needs to be enclosed in parens in case it is an expression, so that precedence works.
* utf8.h: Add assertions for macros that take charsKarl Williamson2017-06-011-10/+18
| | | | | | | This is inspired by [perl #131190]. The UTF-8 macros whose parameters are characters now have assertions that verify they are not being called with something that won't fit in a char. These assertions should be getting optimized out if the input type is a char or U8.
* utf8.h: Clarify commentKarl Williamson2017-02-111-4/+4
|
* utf8.h: White-space, parens onlyKarl Williamson2017-02-111-6/+6
| | | | Add parens to clarify grouping, white-space for legibility
* utf8.h: Add branch predictionKarl Williamson2017-02-111-1/+1
| | | | | | use bytes; is unlikely to be the case.
* Deprecate toFOO_utf8()Karl Williamson2016-12-231-4/+4
| | | | Now that there are _safe versions, deprecate the unsafe ones.
* Add toFOO_utf8_safe() macrosKarl Williamson2016-12-231-4/+9
|
* utf8.c: Add flag to indicate unsure as to end of string to printKarl Williamson2016-12-231-0/+1
| | | | | | | | When decoding a UTF-8 encoded string, we may have guessed as to how long it is. This adds a flag so that the base level decode routine knows that it is a guess, and it minimizes what gets printed, rather than the normal full information, so as to minimize reading past the end of the string
* Deprecate isFOO_utf8() macrosKarl Williamson2016-12-231-7/+13
| | | | | | These macros are being replaced by a safe version; they now generate a deprecation message at each call site upon the first use there in each program run.
* Allow allowing UTF-8 overflow malformationKarl Williamson2016-12-231-3/+8
| | | | | | | | | perl has never allowed the UTF-8 overflow malformation, for some reason. But as long as overflows are turned into the REPLACEMENT CHARACTER, there is no real reason not to. And making it allowable allows code that wants to carry on in the face of malformed input to do so, without risk of contaminating things, as the REPLACEMENT is the Unicode prescribed way of handling malformations.
* Return REPLACEMENT for UTF-8 overlong malformationKarl Williamson2016-12-231-1/+4
| | | | | | | | | | | | | | | | | | | When perl decodes UTF-8 into a code point, it must decide what to do if the input is malformed in some way. When the flags passed to the decode function indicate that a given malformation type is not acceptable, the function returns 0 to indicate failure; on success it returns the decoded code point (unfortunately that may require disambiguation if the input is validly a NUL). As perl evolved, what happened when various allowed malformations were encountered got stricter and stricter. This is the final malformation that was not turned into a REPLACEMENT CHARACTER when the malformation was allowed, and this commit changes to return that. Unlike most other malformations, the code point value of an overlong is well-defined, and that is why it hadn't been changed here-to-fore. But it is safer to use the Unicode prescribed behavior on all malformations, which is to replace them with the REPLACEMENT CHARACTER. Just in case there is code that requires the old behavior, it is retained, but you have to search the source for the undocumented flag that enables it.
* utf8.h: Don't allow zero length malformation unless requestedKarl Williamson2016-12-231-6/+10
| | | | | | | | | | | | | | | | The bottom level Perl routine that decodes UTF-8 into a code point has long accepted inputs where the length is specified to be 0, returning a NUL. It considers this a malformation, which is accepted in some scenarios, but not others. In consultation with Tony Cook, we decided this really isn't a malformation, but is a bug in the calling program. Rather than call the decode routine when it has nothing to decode, it should just not call it. This commit removes the acceptance of a zero length string from any of the canned flag combinations passed to the decode function. One can convert to specify this flag explicitly, if necessary. However the next commit will cause this to fail under DEBUGGING builds, as a step towards removing the capability altogether.
* utf8.h: Renumber flag bitsKarl Williamson2016-12-231-10/+10
| | | | This creates a gap that will be filled by future commits
* Add isFOO_utf8_safe() macrosKarl Williamson2016-12-231-0/+11
| | | | | | | | | | | | | | | | The original API does not check that we aren't reading beyond the end of a buffer, apparently assuming that we could keep malformed UTF-8 out by use of gatekeepers, but that is currently impossible. This commit adds "safe" macros for determining if a UTF-8 sequence represents an alphabetic, a digit, etc. Each new macro has an extra parameter pointing to the end of the sequence, so that looking beyond the input string can be avoided. The macros aren't currently completely safe, as they don't test that there is at least a single valid byte in the input, except by an assertion in DEBUGGING builds. This is because typically they are called in code that makes that assumption, and frequently tests the current byte for one thing or another.
* Fix above-Unicode UTF-8 detection for EBCDICKarl Williamson2016-12-131-1/+1
| | | | | | The root cause of this was missing parentheses causing (s[0] + 1) to be evaluated instead of the desired s[1]. It was causing an error in lib/warnings.t, but only on EBCDIC platforms.
* Fix const correctness in utf8.hPetr Písař2016-12-011-57/+57
| | | | | | | | The original code was generated and then hand-tunes. Therefore I edited the code in place instead of fixing the regen/regcharclass.pl generator. Signed-off-by: Petr Písař <ppisar@redhat.com>
* utf8.h: Simplify macroKarl Williamson2016-10-131-3/+1
| | | | This complicated macro boils down to just one bit.
* Add utf8n_to_uvchr_errorKarl Williamson2016-10-131-0/+2
| | | | | | | | | | This new function behaves like utf8n_to_uvchr(), but takes an extra parameter that points to a U32 which will be set to 0 if no errors are found; otherwise each error found will set a bit in it. This can be used by the caller to figure out precisely what the error(s) is/are. Previously, one would have to capture and parse the warning/error messages raised. This can be used, for example, to customize the messages to the expected end-user's knowledge level.
* utf8n_to_uvchr(): Note multiple malformationsKarl Williamson2016-10-131-8/+13
| | | | | | | | | | | | | | | | | | | | | | | | | Some UTF-8 sequences can have multiple malformations. For example, a sequence can be the start of an overlong representation of a code point, and still be incomplete. Until this commit what was generally done was to stop looking when the first malformation was found. This was not correct behavior, as that malformation may be allowed, while another unallowed one went unnoticed. (But this did not actually create security holes, as those allowed malformations replaced the input with a REPLACEMENT CHARACTER.) This commit refactors the error handling of this function to set a flag and keep going if a malformation is found that doesn't preclude others. Then each is handled in a loop at the end, warning if warranted. The result is that there is a warning for each malformation for which warnings should be generated, and an error return is made if any one is disallowed. Overflow doesn't happen except for very high code points, well above the Unicode range, and above fitting in 31 bits. Hence the latter 2 potential malformations are subsets of overflow, so only one warning is output--the most dire. This will speed up the normal case slightly, as the test for overflow is pulled out of the loop, allowing the UV to overflow. Then a single test after the loop is done to see if there was overflow or not.
* utf8.h: Change some flag definition constantsKarl Williamson2016-10-131-9/+9
| | | | | | These #defines give flag bits in a U32. This commit opens a gap that will be filled in a future commit. A test file has to change to correspond, as it duplicates the defines.
* Add API Unicode handling functionsKarl Williamson2016-09-251-0/+9
| | | | | | | | | | These functions are all extensions of the is_utf8_string_foo() functions, that restrict the UTF-8 recognized as valid in various ways. There are named ones for the two definitions that Unicode makes, and foo_flags ones for more custom restrictions. The named ones are implemented as tries, while the flags ones provide complete generality
* Move #define to different headerKarl Williamson2016-09-251-3/+0
| | | | | | Instead of having a comment in one header pointing to the #define in the other, remove the indirection and just have the #define itself where it is needed.
* perlapi: Clarifications, nits in Unicode support docsKarl Williamson2016-09-251-9/+29
| | | | This also does a white space change to inline.h
* Add isUTF8_CHAR_flags() macroKarl Williamson2016-09-171-0/+34
| | | | | | | | | This is like the previous 2 commits, but the macro takes a flags parameter so any combination of the disallowed flags may be used. The others, along with the original isUTF8_CHAR(), are the most commonly desired strictures, and use an implementation of a, hopefully, inlined trie for speed. This is for generality and the major portion of its implementation isn't inlined.
* Add macro for Unicode Corregindum #9 strictKarl Williamson2016-09-171-1/+54
| | | | | | | | | | | | | This macro follows Unicode Corrigendum #9 to allow non-character code points. These are still discouraged but not completely forbidden. It's best for code that isn't intended to operate on arbitrary other code text to use the original definition, but code that does things, such as source code control, should change to use this definition if it wants to be Unicode-strict. Perl can't adopt C9 wholesale, as it might create security holes in existing applications that rely on Perl keeping non-chars out.
* Add macro for determining if UTF-8 is Unicode-strictKarl Williamson2016-09-171-3/+78
|
* perlapi: Clarify isUTF8_CHAR()Karl Williamson2016-09-171-3/+4
|
* utf8.h: Add comment, white-space changesKarl Williamson2016-09-171-7/+11
|
* Enhance and rename is_utf8_char_slow()Karl Williamson2016-09-171-3/+3
| | | | | | | This changes the name of this helper function and adds a parameter and functionality to allow it to exclude problematic classes of code points, the same ones excludeable by utf8n_to_uvchar(), like surrogates or non-character code points.
* isUTF8_CHAR(): Bring UTF-EBCDIC to parity with ASCIIKarl Williamson2016-09-171-51/+37
| | | | | | | | | | | | | | | | This changes the macro isUTF8_CHAR to have the same number of code points built-in for EBCDIC as ASCII. This obsoletes the IS_UTF8_CHAR_FAST macro, which is removed. Previously, the code generated by regen/regcharclass.pl for ASCII platforms was hand copied into utf8.h, and LIKELY's manually added, then the generating code was commented out. Now this has been done with EBCDIC platforms as well. This makes regenerating regcharclass.h faster. The copied macro in utf8.h is moved by this commit to within the main code section for non-EBCDIC compiles, cutting the number of #ifdef's down, and the comments about it are changed somewhat.
* Add IS_UTF8_INVARIANT and IS_UVCHR_INVARIANT to APIKarl Williamson2016-09-171-10/+32
|
* Add #defines for XS code for Unicode Corregindum 9Karl Williamson2016-09-171-5/+14
| | | | These are convenience macros.
* Make 3 UTF-8 macros APIKarl Williamson2016-08-311-16/+62
| | | | | | | | | | | | | | | | | These may be useful to various module writers. They certainly are useful for Encode. This makes public API macros to determine if the input UTF-8 represents (one macro for each category) a) a surrogate code point b) a non-character code point c) a code point that is above Unicode's legal maximum. The macros are machine generated. In making them public, I am now using the string end location parameter to guard against running off the end of the input. Previously this parameter was ignored, as their use in the core could be tightly controlled so that we already knew that the string was long enough when calling these macros. But this can't be guaranteed in the public API. An optimizing compiler should be able to remove redundant length checks.