summaryrefslogtreecommitdiff
path: root/utfebcdic.h
Commit message (Collapse)AuthorAgeFilesLines
* utf8.h: Remove EBCDIC dependencyKarl Williamson2021-08-071-5/+0
| | | | By generalizing a macro, we can make it serve both ASCII and EBCDIC
* utfebcdic.h: White-space, comment onlyKarl Williamson2021-08-071-11/+17
|
* utf8.h: Rmv EBCDIC dependencyKarl Williamson2021-08-071-20/+0
| | | | | | | This moves a #define into the common code for ASCII and EBCDIC machines. It adds a bunch of comments about the value that I wish I hadn't had to figure out for myself.
* utf8.h: Remove an EBCDIC dependencyKarl Williamson2021-08-071-2/+0
| | | | | A symbol introduced in a previous commit allows this internal macro to only need a single version, suitable for either EBCDIC or ASCII.
* utf8.h: Make a bit of EBCDIC known to ASCIIKarl Williamson2021-08-071-5/+1
| | | | | This info is needed in one other place; doing it here means only specifying it once.
* utf8.h: Add a #define synonymKarl Williamson2021-08-071-2/+1
| | | | | This is more clearly named for various uses in this file. It has an unwieldy length, but is unlikely to be used outside it.
* utfebcdic.h: Remove obsolete #definesKarl Williamson2021-06-271-243/+0
| | | | For a couple of releases now, these have been uncalled
* Fix several unicode.org linksThibault DUPONCHELLE2021-06-171-1/+1
|
* style: Detabify indentation of the C code maintained by the core.Michael G. Schwern2021-01-171-61/+61
| | | | | | | | | | | This just detabifies to get rid of the mixed tab/space indentation. Applying consistent indentation and dealing with other tabs are another issue. Done with `expand -i`. * vutil.* left alone, it's part of version. * Left regen managed files alone for now.
* utfebcdic.h: Add commentsKarl Williamson2019-11-111-2/+15
|
* utfebcdic.h: Move some #definesKarl Williamson2019-10-091-3/+4
|
* Make defn of UTF_IS_CONTINUED commonKarl Williamson2019-10-061-4/+0
| | | | This can be derived from other values, removing an EBCDIC dependency
* Make defn of UVCHR_IS_INVARIANT commonKarl Williamson2019-10-061-11/+1
| | | | This can be derived from other values, removing an EBCDIC dependency
* Make defn of OFFUNI_IS_INVARIANT commonKarl Williamson2019-10-061-3/+0
| | | | This can be derived from other values, removing an EBCDIC dependency
* Make defn of UTF8_IS_DOWNGRADEABLE_START commonKarl Williamson2019-10-061-3/+0
| | | | This can be derived from other values, removing an EBCDIC dependency
* Make defn of UTF_IS_ABOVE_LATIN1 commonKarl Williamson2019-10-061-6/+0
| | | | This can be derived from other values, removing an EBCDIC dependency
* Make defn of UTF8_IS_START commonKarl Williamson2019-10-061-3/+0
| | | | This can be derived from other values, removing an EBCDIC dependency
* Make defn of UTF8_IS_CONTINUATION commonKarl Williamson2019-10-061-8/+0
| | | | This can be derived from other values, removing an EBCDIC dependency
* Make defn of UTF_CONTINUATION_MARK commonKarl Williamson2019-10-061-2/+0
| | | | This can be derived from other values, removing an EBCDIC dependency
* Make defn of UTF_IS_CONTINUATION_MASK commonKarl Williamson2019-10-061-2/+0
| | | | | This variable can be defined from the same base in both UTF-8 and UTF-EBCDIC, and doing so eliminates an EBCDIC dependency.
* utfebcdic.h: Fix EBCDIC compilation errorKarl Williamson2019-08-191-14/+0
| | | | | | The #include needs to always be done, so remove the #ifdef. The included file has the proper setup anyway for the variables that were used.
* utf8n_to_uvchr() Properly test for extended UTF-8Karl Williamson2017-07-121-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It somehow dawned on me that the code is incorrect for warning/disallowing very high code points. What is really wanted in the API is to catch UTF-8 that is not necessarily portable. There are several classes of this, but I'm referring here to just the code points that are above the Unicode-defined maximum of 0x10FFFF. These can be considered non-portable, and there is a mechanism in the API to warn/disallow these. However an earlier standard defined UTF-8 to handle code points up to 2**31-1. Anything above that is using an extension to UTF-8 that has never been officially recognized. Perl does use such an extension, and the API is supposed to have a different mechanism to warn/disallow on this. Thus there are two classes of warning/disallowing for above-Unicode code points. One for things that have some non-Unicode official recognition, and the other for things that have never had official recognition. UTF-EBCDIC differs somewhat in this, and since Perl 5.24, we have had a Perl extension that allows it to handle any code point that fits in a 64-bit word. This kicks in at code points above 2**30-1, a number different than UTF-8 extended kicks in on ASCII platforms. Things are also complicated by the fact that the API has provisions for accepting the overlong UTF-8 malformation. It is possible to use extended UTF-8 to represent code points smaller than 31-bit ones. Until this commit, the extended warning/disallowing was based on the resultant code point, and only when that code point did not fit into 31 bits. But what is really wanted is if extended UTF-8 was used to represent a code point, no matter how large the resultant code point is. This differs from the previous definition, but only for EBCDIC platforms, or when the overlong malformation was also present. So it does not affect very many real-world cases. This commit fixes that. It turns out that it is easier to tell if something is using extended-UTF8. One just looks at the first byte of a sequence. The trailing part of the warning message that gets raised is slightly changed to be clearer. It's not significant enough to affect perldiag.
* utf8.c: Move some #defines here, the only file that uses themKarl Williamson2017-07-011-7/+0
| | | | | | These are very specialized #defines to determine if UTF-8 overflows the word size of the platform. I think its unwise to make them kinda generally available.
* utfebcdic.h: Fix typo in commentKarl Williamson2016-12-191-1/+1
| | | | Spotted by Christian Hansen
* utfebcdic.h: Follow up to adding const qualifiersKarl Williamson2016-12-111-57/+74
| | | | | | | | | | | | | | | | | | | | | | | | This is a follow-up to commit 9f2eed981068e7abbcc52267863529bc59e9c8c0, which manually added const qualifiers to some generated code in order to avoid some compiler warnings. The code changed by the other commit had been hand-edited after being generated to add branch prediction, which would be too hard to program in at this time, so the const additions also had to be hand-edited in. The commit just before this current one changed the generator to add the const, and I then did comparisons by hand to make sure the only differences were the branch predictions. In doing so, I found one missing const, plus a bunch of differences in the generated code for EBCDIC 037. We do not currently have a smoker for that system, so the differences could be as a result of a previous error, or they could be the result of the added 'const' causing the macro generator to split things differently. It splits in order to avoid size limits in some preprocessors, and the extra 'const' tokens could have caused it to make its splits differently. Since we don't have any smokers for this, and no known actual systems running it, I decided not to bother to hand-edit the output to add branch prediction.
* Fix const correctness in utf8.hPetr Písař2016-12-011-138/+138
| | | | | | | | The original code was generated and then hand-tunes. Therefore I edited the code in place instead of fixing the regen/regcharclass.pl generator. Signed-off-by: Petr Písař <ppisar@redhat.com>
* Add macro for Unicode Corregindum #9 strictKarl Williamson2016-09-171-0/+42
| | | | | | | | | | | | | This macro follows Unicode Corrigendum #9 to allow non-character code points. These are still discouraged but not completely forbidden. It's best for code that isn't intended to operate on arbitrary other code text to use the original definition, but code that does things, such as source code control, should change to use this definition if it wants to be Unicode-strict. Perl can't adopt C9 wholesale, as it might create security holes in existing applications that rely on Perl keeping non-chars out.
* Add macro for determining if UTF-8 is Unicode-strictKarl Williamson2016-09-171-8/+140
|
* isUTF8_CHAR(): Bring UTF-EBCDIC to parity with ASCIIKarl Williamson2016-09-171-0/+51
| | | | | | | | | | | | | | | | This changes the macro isUTF8_CHAR to have the same number of code points built-in for EBCDIC as ASCII. This obsoletes the IS_UTF8_CHAR_FAST macro, which is removed. Previously, the code generated by regen/regcharclass.pl for ASCII platforms was hand copied into utf8.h, and LIKELY's manually added, then the generating code was commented out. Now this has been done with EBCDIC platforms as well. This makes regenerating regcharclass.h faster. The copied macro in utf8.h is moved by this commit to within the main code section for non-EBCDIC compiles, cutting the number of #ifdef's down, and the comments about it are changed somewhat.
* utfebcdic.h: Fix typo in commentKarl Williamson2016-09-171-1/+1
|
* Add #defines for UTF-8 of highest representable code pointKarl Williamson2016-08-311-0/+8
| | | | | This will allow the next commit to not have to actually try to decode the UTF-8 string in order to see if it overflows the platform.
* utf8.h, utfebcdic.h: Add comments, align white spaceKarl Williamson2016-08-311-1/+29
|
* utf8.h, utfebcdic.h: Add #defineKarl Williamson2015-12-091-0/+9
| | | | for future use
* utf8.h, utfebcdic.h: Comments, white-space onlyKarl Williamson2015-12-061-1/+6
|
* utf8.h: Combine EBCDIC and ASCII macrosKarl Williamson2015-12-051-8/+0
| | | | | | | | Previous commits have set things up so the macros are the same on both platforms. By moving them to the common part of utf8.h, they can share the same definition. The difference listing shows instead other things being moved due to the size of this move in comparison with those things that really stayed the same.
* utf8.h: Combine EBCDIC and ASCII macrosKarl Williamson2015-12-051-5/+0
| | | | | | | | The previous commits have made these macros be the exact same text, so can be combined, and defined just once. This requires moving them to the portion of the file that is common with both EBCDIC and ASCII. The commit diff shows instead other code being moved.
* utf8.h: Combine EBCDIC and ASCII #definesKarl Williamson2015-12-051-2/+0
| | | | | | Change to use the same definition for two macros on both types of platforms, simplifying the code, by using the underlying structure of the encoding.
* utf8.h, et.al.: Clean up some castsKarl Williamson2015-12-051-2/+2
| | | | | By making sure the no-op macros cast the output appropriately, we can eliminate the casts that have been added in things that call them
* utf8.h: Combine ASCII and EBCDIC defines into oneKarl Williamson2015-12-051-1/+0
| | | | | By using a more fundamental value, these two definitions of the macro can be made the same, so only need one, common to both platforms
* utfebcdic.h: Use an internal macro to avoid repeatingKarl Williamson2015-12-051-15/+12
| | | | | This creates a macro that is used in portions of 2 other macros, thus removing repetition.
* utf8.h, utfebcdic.h: Fix-up UTF8_MAXBYTES_CASE defnKarl Williamson2015-12-051-6/+0
| | | | | | | | The definition had gotten moved away from its comments in utf8.h, and the wrong thing was being guarded by a #error, (UTF8_MAXBYTES instead). And it is possible to generalize to get the compiler to do the calculation, and to consolidate the definitions from the two files into a single one.
* Extend UTF-EBCDIC to handle up to 2**64-1Karl Williamson2015-11-251-17/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This uses for UTF-EBCDIC essentially the same mechanism that Perl already uses for UTF-8 on ASCII platforms to extend it beyond what might be its natural maximum. That is, when the UTF-8 start byte is 0xFF, it adds a bunch more bytes to the character than it otherwise would, bringing it to a total of 14 for UTF-EBCDIC. This is enough to handle any code point that fits in a 64 bit word. The downside of this is that this extension is not compatible with previous perls for the range 2**30 up through the previous max, 2**30 - 1. A simple program could be written to convert files that were written out using an older perl so that they can be read with newer perls, and the perldelta says we will do this should anyone ask. However, I strongly suspect that the number of such files in existence is zero, as people in EBCDIC land don't seem to use Unicode much, and these are very large code points, which are associated with a portability warning every time they are output in some way. This extension brings UTF-EBCDIC to parity with UTF-8, so that both can cover a 64-bit word. It allows some removal of special cases for EBCDIC in core code and core tests. And it is a necessary step to handle Perl 6's NFG, which I'd like eventually to bring to Perl 5. This commit causes two implementations of a macro in utf8.h and utfebcdic.h to become the same, and both are moved to a single one in the portion of utf8.h common to both. To illustrate, the I8 for U+3FFFFFFF (2**30-1) is "\xFE\xBF\xBF\xBF\xBF\xBF\xBF" before and after this commit, but the I8 for the next code point, U+40000000 is now "\xFF\xA0\xA0\xA0\xA0\xA0\xA0\xA1\xA0\xA0\xA0\xA0\xA0\xA0", and before this commit it was "\xFF\xA0\xA0\xA0\xA0\xA0\xA0". The I8 for 2**64-1 (U+FFFFFFFFFFFFFFFF) is "\xFF\xAF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF", whereas before this commit it was unrepresentable. Commit 7c560c3beefbb9946463c9f7b946a13f02f319d8 said in its message that it was moving something that hadn't been needed on EBCDIC until the "next commit". That statement turned out to be wrong, overtaken by events. This now is the commit it was referring to. commit I prematurely pushed that
* utf8.h, utfebcdic.h: Use mnemonic constantKarl Williamson2015-11-091-14/+15
| | | | | | | The magic number 13 is used in various places on ASCII platforms, and 7 correspondingly on EBCDIC. This moves the #defines for what these represent to early in their files, and uses the symbolic name thereafter.
* Change meaning of UNI_IS_INVARIANT on EBCDIC platformsKarl Williamson2015-09-181-2/+1
| | | | | | | | | | | | | This should make more CPAN and other code work without change. Usually, unwittingly, code that says UNI_IS_INVARIANT means to use the native platform code values for code points below 256, so acquiesce to the expected meaning and make the macro correspond. Since the native values on ASCII machines are the same as Unicode, this change doesn't affect code running on them. A new macro, OFFUNI_IS_INVARIANT, is created for those few places that really do want a Unicode value. There are just a few places in the Perl core like that, which this commit changes.
* Fix potential flaw in 2 EBCDIC macros.Karl Williamson2015-09-041-2/+2
| | | | | | | | It occurred to me in code reading that it was possible for these macros to not give the correct result if passed a signed argument. An earlier version of this commit was buggy. Thanks to Yaroslav Kuzmin for spotting that.
* utf8.h, utfebcdic.h: Add some assertionsKarl Williamson2015-09-041-4/+6
| | | | | | These will detect a array bounds error that occurs on EBCDIC machines, and by including the assert on non-EBCDIC, we verify that the code wouldn't fail when built on EBCDIC.
* Change EBCDIC macro definitionKarl Williamson2015-09-041-0/+3
| | | | | | This changes the definition of isUTF8_POSSIBLY_PROBLEMATIC() on EBCDIC platforms to use PL_charclass[] instead of PL_e2a[]. The new array is more likely to be in the memory cache.
* Change EBCDIC macro definitionKarl Williamson2015-09-041-0/+6
| | | | | | | | Prior to this commit UVCHR_SKIP() was defined the same in both ASCII and EBCDIC, but they expanded to different things. Now, they are defined separately -- to what they expand to, and the EBCDIC version is changed when all expanded out to use PL_charclass[] instead of PL_e2a[]. The new array is more likely to be in the memory cache.
* Change EBCDIC macro definitionKarl Williamson2015-09-041-0/+8
| | | | | | | | Prior to this commit UVCHR_IS_INVARIANT() was defined the same in both ASCII and EBCDIC, but they expanded to different things. Now, they are defined separately to what they expand to, and the EBCDIC version is changed when all expanded out to use PL_charclass[] instead of PL_e2a[]. The new array is more likely to be in the memory cache.
* Change some UTF-EBCDIC macro handling defnsKarl Williamson2015-09-041-14/+19
| | | | | | | | | This commit changes the definitions of some macros for UTF-8 handling on EBCDIC platforms. The previous definitions transformed the bytes into I8 and did tests on the transformed values. The change is to use previously unused bits in l1_char_class_tab.h so the transform isn't needed, and generally only one branch is. These macros are called from the inner loops of, for example, regex backtracking.