| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
| |
By generalizing a macro, we can make it serve both ASCII and EBCDIC
|
| |
|
|
|
|
|
|
|
| |
This moves a #define into the common code for ASCII and EBCDIC machines.
It adds a bunch of comments about the value that I wish I hadn't had to
figure out for myself.
|
|
|
|
|
| |
A symbol introduced in a previous commit allows this internal macro to
only need a single version, suitable for either EBCDIC or ASCII.
|
|
|
|
|
| |
This info is needed in one other place; doing it here means only
specifying it once.
|
|
|
|
|
| |
This is more clearly named for various uses in this file. It has an
unwieldy length, but is unlikely to be used outside it.
|
|
|
|
| |
For a couple of releases now, these have been uncalled
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
This just detabifies to get rid of the mixed tab/space indentation.
Applying consistent indentation and dealing with other tabs are another issue.
Done with `expand -i`.
* vutil.* left alone, it's part of version.
* Left regen managed files alone for now.
|
| |
|
| |
|
|
|
|
| |
This can be derived from other values, removing an EBCDIC dependency
|
|
|
|
| |
This can be derived from other values, removing an EBCDIC dependency
|
|
|
|
| |
This can be derived from other values, removing an EBCDIC dependency
|
|
|
|
| |
This can be derived from other values, removing an EBCDIC dependency
|
|
|
|
| |
This can be derived from other values, removing an EBCDIC dependency
|
|
|
|
| |
This can be derived from other values, removing an EBCDIC dependency
|
|
|
|
| |
This can be derived from other values, removing an EBCDIC dependency
|
|
|
|
| |
This can be derived from other values, removing an EBCDIC dependency
|
|
|
|
|
| |
This variable can be defined from the same base in both UTF-8 and
UTF-EBCDIC, and doing so eliminates an EBCDIC dependency.
|
|
|
|
|
|
| |
The #include needs to always be done, so remove the #ifdef. The
included file has the proper setup anyway for the variables that were
used.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It somehow dawned on me that the code is incorrect for
warning/disallowing very high code points. What is really wanted in the
API is to catch UTF-8 that is not necessarily portable. There are
several classes of this, but I'm referring here to just the code points
that are above the Unicode-defined maximum of 0x10FFFF. These can be
considered non-portable, and there is a mechanism in the API to
warn/disallow these.
However an earlier standard defined UTF-8 to handle code points up to
2**31-1. Anything above that is using an extension to UTF-8 that has
never been officially recognized. Perl does use such an extension, and
the API is supposed to have a different mechanism to warn/disallow on
this.
Thus there are two classes of warning/disallowing for above-Unicode code
points. One for things that have some non-Unicode official recognition,
and the other for things that have never had official recognition.
UTF-EBCDIC differs somewhat in this, and since Perl 5.24, we have had a
Perl extension that allows it to handle any code point that fits in a
64-bit word. This kicks in at code points above 2**30-1, a number
different than UTF-8 extended kicks in on ASCII platforms.
Things are also complicated by the fact that the API has provisions for
accepting the overlong UTF-8 malformation. It is possible to use
extended UTF-8 to represent code points smaller than 31-bit ones.
Until this commit, the extended warning/disallowing was based on the
resultant code point, and only when that code point did not fit into 31
bits.
But what is really wanted is if extended UTF-8 was used to represent a
code point, no matter how large the resultant code point is. This
differs from the previous definition, but only for EBCDIC platforms, or
when the overlong malformation was also present. So it does not affect
very many real-world cases.
This commit fixes that. It turns out that it is easier to tell if
something is using extended-UTF8. One just looks at the first byte of a
sequence.
The trailing part of the warning message that gets raised is slightly
changed to be clearer. It's not significant enough to affect perldiag.
|
|
|
|
|
|
| |
These are very specialized #defines to determine if UTF-8 overflows the
word size of the platform. I think its unwise to make them kinda
generally available.
|
|
|
|
| |
Spotted by Christian Hansen
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a follow-up to commit 9f2eed981068e7abbcc52267863529bc59e9c8c0,
which manually added const qualifiers to some generated code in order to
avoid some compiler warnings. The code changed by the other commit had
been hand-edited after being generated to add branch prediction, which
would be too hard to program in at this time, so the const additions
also had to be hand-edited in.
The commit just before this current one changed the generator to add the
const, and I then did comparisons by hand to make sure the only
differences were the branch predictions. In doing so, I found one
missing const, plus a bunch of differences in the generated code for
EBCDIC 037. We do not currently have a smoker for that system, so the
differences could be as a result of a previous error, or they could be
the result of the added 'const' causing the macro generator to split
things differently. It splits in order to avoid size limits in some
preprocessors, and the extra 'const' tokens could have caused it to make
its splits differently.
Since we don't have any smokers for this, and no known actual systems
running it, I decided not to bother to hand-edit the output to add
branch prediction.
|
|
|
|
|
|
|
|
| |
The original code was generated and then hand-tunes. Therefore
I edited the code in place instead of fixing the regen/regcharclass.pl
generator.
Signed-off-by: Petr Písař <ppisar@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This macro follows Unicode Corrigendum #9 to allow non-character code
points. These are still discouraged but not completely forbidden.
It's best for code that isn't intended to operate on arbitrary other
code text to use the original definition, but code that does things,
such as source code control, should change to use this definition if it
wants to be Unicode-strict.
Perl can't adopt C9 wholesale, as it might create security holes in
existing applications that rely on Perl keeping non-chars out.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This changes the macro isUTF8_CHAR to have the same number of code
points built-in for EBCDIC as ASCII. This obsoletes the
IS_UTF8_CHAR_FAST macro, which is removed.
Previously, the code generated by regen/regcharclass.pl for ASCII
platforms was hand copied into utf8.h, and LIKELY's manually added, then
the generating code was commented out. Now this has been done with
EBCDIC platforms as well. This makes regenerating regcharclass.h
faster.
The copied macro in utf8.h is moved by this commit to within the main
code section for non-EBCDIC compiles, cutting the number of #ifdef's
down, and the comments about it are changed somewhat.
|
| |
|
|
|
|
|
| |
This will allow the next commit to not have to actually try to decode
the UTF-8 string in order to see if it overflows the platform.
|
| |
|
|
|
|
| |
for future use
|
| |
|
|
|
|
|
|
|
|
| |
Previous commits have set things up so the macros are the same on both
platforms. By moving them to the common part of utf8.h, they can share
the same definition. The difference listing shows instead other things
being moved due to the size of this move in comparison with those things
that really stayed the same.
|
|
|
|
|
|
|
|
| |
The previous commits have made these macros be the exact same text, so
can be combined, and defined just once. This requires moving them to
the portion of the file that is common with both EBCDIC and ASCII.
The commit diff shows instead other code being moved.
|
|
|
|
|
|
| |
Change to use the same definition for two macros on both types of
platforms, simplifying the code, by using the underlying structure of
the encoding.
|
|
|
|
|
| |
By making sure the no-op macros cast the output appropriately, we can
eliminate the casts that have been added in things that call them
|
|
|
|
|
| |
By using a more fundamental value, these two definitions of the macro
can be made the same, so only need one, common to both platforms
|
|
|
|
|
| |
This creates a macro that is used in portions of 2 other macros, thus
removing repetition.
|
|
|
|
|
|
|
|
| |
The definition had gotten moved away from its comments in utf8.h, and
the wrong thing was being guarded by a #error, (UTF8_MAXBYTES instead).
And it is possible to generalize to get the compiler to do the
calculation, and to consolidate the definitions from the two files into
a single one.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This uses for UTF-EBCDIC essentially the same mechanism that Perl
already uses for UTF-8 on ASCII platforms to extend it beyond what might
be its natural maximum. That is, when the UTF-8 start byte is 0xFF, it
adds a bunch more bytes to the character than it otherwise would,
bringing it to a total of 14 for UTF-EBCDIC. This is enough to handle
any code point that fits in a 64 bit word.
The downside of this is that this extension is not compatible with
previous perls for the range 2**30 up through the previous max,
2**30 - 1. A simple program could be written to convert files that were
written out using an older perl so that they can be read with newer
perls, and the perldelta says we will do this should anyone ask.
However, I strongly suspect that the number of such files in existence
is zero, as people in EBCDIC land don't seem to use Unicode much, and
these are very large code points, which are associated with a
portability warning every time they are output in some way.
This extension brings UTF-EBCDIC to parity with UTF-8, so that both can
cover a 64-bit word. It allows some removal of special cases for EBCDIC
in core code and core tests. And it is a necessary step to handle Perl
6's NFG, which I'd like eventually to bring to Perl 5.
This commit causes two implementations of a macro in utf8.h and
utfebcdic.h to become the same, and both are moved to a single one in
the portion of utf8.h common to both.
To illustrate, the I8 for U+3FFFFFFF (2**30-1) is
"\xFE\xBF\xBF\xBF\xBF\xBF\xBF" before and after this commit, but the I8
for the next code point, U+40000000 is now
"\xFF\xA0\xA0\xA0\xA0\xA0\xA0\xA1\xA0\xA0\xA0\xA0\xA0\xA0",
and before this commit it was "\xFF\xA0\xA0\xA0\xA0\xA0\xA0".
The I8 for 2**64-1 (U+FFFFFFFFFFFFFFFF) is
"\xFF\xAF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF", whereas
before this commit it was unrepresentable.
Commit 7c560c3beefbb9946463c9f7b946a13f02f319d8 said in its message that
it was moving something that hadn't been needed on EBCDIC until the
"next commit". That statement turned out to be wrong, overtaken by
events. This now is the commit it was referring to.
commit I prematurely
pushed that
|
|
|
|
|
|
|
| |
The magic number 13 is used in various places on ASCII platforms, and
7 correspondingly on EBCDIC. This moves the #defines for what these
represent to early in their files, and uses the symbolic name
thereafter.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This should make more CPAN and other code work without change. Usually,
unwittingly, code that says UNI_IS_INVARIANT means to use the native
platform code values for code points below 256, so acquiesce to the
expected meaning and make the macro correspond. Since the native values
on ASCII machines are the same as Unicode, this change doesn't affect
code running on them.
A new macro, OFFUNI_IS_INVARIANT, is created for those few places that
really do want a Unicode value. There are just a few places in the Perl
core like that, which this commit changes.
|
|
|
|
|
|
|
|
| |
It occurred to me in code reading that it was possible for these macros
to not give the correct result if passed a signed argument.
An earlier version of this commit was buggy. Thanks to Yaroslav Kuzmin
for spotting that.
|
|
|
|
|
|
| |
These will detect a array bounds error that occurs on EBCDIC machines,
and by including the assert on non-EBCDIC, we verify that the code
wouldn't fail when built on EBCDIC.
|
|
|
|
|
|
| |
This changes the definition of isUTF8_POSSIBLY_PROBLEMATIC() on EBCDIC
platforms to use PL_charclass[] instead of PL_e2a[]. The new array is
more likely to be in the memory cache.
|
|
|
|
|
|
|
|
| |
Prior to this commit UVCHR_SKIP() was defined the same in both ASCII and
EBCDIC, but they expanded to different things. Now, they are defined
separately -- to what they expand to, and the EBCDIC version is changed
when all expanded out to use PL_charclass[] instead of PL_e2a[]. The
new array is more likely to be in the memory cache.
|
|
|
|
|
|
|
|
| |
Prior to this commit UVCHR_IS_INVARIANT() was defined the same in both
ASCII and EBCDIC, but they expanded to different things. Now, they are
defined separately to what they expand to, and the EBCDIC version is
changed when all expanded out to use PL_charclass[] instead of PL_e2a[].
The new array is more likely to be in the memory cache.
|
|
|
|
|
|
|
|
|
| |
This commit changes the definitions of some macros for UTF-8 handling on
EBCDIC platforms. The previous definitions transformed the bytes into
I8 and did tests on the transformed values. The change is to use
previously unused bits in l1_char_class_tab.h so the transform isn't
needed, and generally only one branch is. These macros are called from
the inner loops of, for example, regex backtracking.
|