| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
| |
This macro follows Unicode Corrigendum #9 to allow non-character code
points. These are still discouraged but not completely forbidden.
It's best for code that isn't intended to operate on arbitrary other
code text to use the original definition, but code that does things,
such as source code control, should change to use this definition if it
wants to be Unicode-strict.
Perl can't adopt C9 wholesale, as it might create security holes in
existing applications that rely on Perl keeping non-chars out.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This changes the macro isUTF8_CHAR to have the same number of code
points built-in for EBCDIC as ASCII. This obsoletes the
IS_UTF8_CHAR_FAST macro, which is removed.
Previously, the code generated by regen/regcharclass.pl for ASCII
platforms was hand copied into utf8.h, and LIKELY's manually added, then
the generating code was commented out. Now this has been done with
EBCDIC platforms as well. This makes regenerating regcharclass.h
faster.
The copied macro in utf8.h is moved by this commit to within the main
code section for non-EBCDIC compiles, cutting the number of #ifdef's
down, and the comments about it are changed somewhat.
|
| |
|
|
|
|
|
| |
This will allow the next commit to not have to actually try to decode
the UTF-8 string in order to see if it overflows the platform.
|
| |
|
|
|
|
| |
for future use
|
| |
|
|
|
|
|
|
|
|
| |
Previous commits have set things up so the macros are the same on both
platforms. By moving them to the common part of utf8.h, they can share
the same definition. The difference listing shows instead other things
being moved due to the size of this move in comparison with those things
that really stayed the same.
|
|
|
|
|
|
|
|
| |
The previous commits have made these macros be the exact same text, so
can be combined, and defined just once. This requires moving them to
the portion of the file that is common with both EBCDIC and ASCII.
The commit diff shows instead other code being moved.
|
|
|
|
|
|
| |
Change to use the same definition for two macros on both types of
platforms, simplifying the code, by using the underlying structure of
the encoding.
|
|
|
|
|
| |
By making sure the no-op macros cast the output appropriately, we can
eliminate the casts that have been added in things that call them
|
|
|
|
|
| |
By using a more fundamental value, these two definitions of the macro
can be made the same, so only need one, common to both platforms
|
|
|
|
|
| |
This creates a macro that is used in portions of 2 other macros, thus
removing repetition.
|
|
|
|
|
|
|
|
| |
The definition had gotten moved away from its comments in utf8.h, and
the wrong thing was being guarded by a #error, (UTF8_MAXBYTES instead).
And it is possible to generalize to get the compiler to do the
calculation, and to consolidate the definitions from the two files into
a single one.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This uses for UTF-EBCDIC essentially the same mechanism that Perl
already uses for UTF-8 on ASCII platforms to extend it beyond what might
be its natural maximum. That is, when the UTF-8 start byte is 0xFF, it
adds a bunch more bytes to the character than it otherwise would,
bringing it to a total of 14 for UTF-EBCDIC. This is enough to handle
any code point that fits in a 64 bit word.
The downside of this is that this extension is not compatible with
previous perls for the range 2**30 up through the previous max,
2**30 - 1. A simple program could be written to convert files that were
written out using an older perl so that they can be read with newer
perls, and the perldelta says we will do this should anyone ask.
However, I strongly suspect that the number of such files in existence
is zero, as people in EBCDIC land don't seem to use Unicode much, and
these are very large code points, which are associated with a
portability warning every time they are output in some way.
This extension brings UTF-EBCDIC to parity with UTF-8, so that both can
cover a 64-bit word. It allows some removal of special cases for EBCDIC
in core code and core tests. And it is a necessary step to handle Perl
6's NFG, which I'd like eventually to bring to Perl 5.
This commit causes two implementations of a macro in utf8.h and
utfebcdic.h to become the same, and both are moved to a single one in
the portion of utf8.h common to both.
To illustrate, the I8 for U+3FFFFFFF (2**30-1) is
"\xFE\xBF\xBF\xBF\xBF\xBF\xBF" before and after this commit, but the I8
for the next code point, U+40000000 is now
"\xFF\xA0\xA0\xA0\xA0\xA0\xA0\xA1\xA0\xA0\xA0\xA0\xA0\xA0",
and before this commit it was "\xFF\xA0\xA0\xA0\xA0\xA0\xA0".
The I8 for 2**64-1 (U+FFFFFFFFFFFFFFFF) is
"\xFF\xAF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF", whereas
before this commit it was unrepresentable.
Commit 7c560c3beefbb9946463c9f7b946a13f02f319d8 said in its message that
it was moving something that hadn't been needed on EBCDIC until the
"next commit". That statement turned out to be wrong, overtaken by
events. This now is the commit it was referring to.
commit I prematurely
pushed that
|
|
|
|
|
|
|
| |
The magic number 13 is used in various places on ASCII platforms, and
7 correspondingly on EBCDIC. This moves the #defines for what these
represent to early in their files, and uses the symbolic name
thereafter.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This should make more CPAN and other code work without change. Usually,
unwittingly, code that says UNI_IS_INVARIANT means to use the native
platform code values for code points below 256, so acquiesce to the
expected meaning and make the macro correspond. Since the native values
on ASCII machines are the same as Unicode, this change doesn't affect
code running on them.
A new macro, OFFUNI_IS_INVARIANT, is created for those few places that
really do want a Unicode value. There are just a few places in the Perl
core like that, which this commit changes.
|
|
|
|
|
|
|
|
| |
It occurred to me in code reading that it was possible for these macros
to not give the correct result if passed a signed argument.
An earlier version of this commit was buggy. Thanks to Yaroslav Kuzmin
for spotting that.
|
|
|
|
|
|
| |
These will detect a array bounds error that occurs on EBCDIC machines,
and by including the assert on non-EBCDIC, we verify that the code
wouldn't fail when built on EBCDIC.
|
|
|
|
|
|
| |
This changes the definition of isUTF8_POSSIBLY_PROBLEMATIC() on EBCDIC
platforms to use PL_charclass[] instead of PL_e2a[]. The new array is
more likely to be in the memory cache.
|
|
|
|
|
|
|
|
| |
Prior to this commit UVCHR_SKIP() was defined the same in both ASCII and
EBCDIC, but they expanded to different things. Now, they are defined
separately -- to what they expand to, and the EBCDIC version is changed
when all expanded out to use PL_charclass[] instead of PL_e2a[]. The
new array is more likely to be in the memory cache.
|
|
|
|
|
|
|
|
| |
Prior to this commit UVCHR_IS_INVARIANT() was defined the same in both
ASCII and EBCDIC, but they expanded to different things. Now, they are
defined separately to what they expand to, and the EBCDIC version is
changed when all expanded out to use PL_charclass[] instead of PL_e2a[].
The new array is more likely to be in the memory cache.
|
|
|
|
|
|
|
|
|
| |
This commit changes the definitions of some macros for UTF-8 handling on
EBCDIC platforms. The previous definitions transformed the bytes into
I8 and did tests on the transformed values. The change is to use
previously unused bits in l1_char_class_tab.h so the transform isn't
needed, and generally only one branch is. These macros are called from
the inner loops of, for example, regex backtracking.
|
| |
|
| |
|
| |
|
|
|
|
| |
One is false, and one is addressed now in the perlebcdic.pod
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
An empty cpan/.dir-locals.el stops Emacs using the core defaults for
code imported from CPAN.
Committer's work:
To keep t/porting/cmp_version.t and t/porting/utils.t happy, $VERSION needed
to be incremented in many files, including throughout dist/PathTools.
perldelta entry for module updates.
Add two Emacs control files to MANIFEST; re-sort MANIFEST.
For: RT #124119.
|
| |
|
| |
|
|
|
|
|
|
| |
The definition was incorrect. When going from control to printable
name, we need to go from Latin1 -> Native, so that e.g., a 65 gets
turned into the native 'A'
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This causes the generated file ebcdic_tables.h to be #included by
utfebcdic.h instead of the hand-coded tables that were formerly there.
This makes it much easier to add or remove support for EBCDIC code
pages.
The UTF-EBCDIC-related tables for 037 and POSIX-BC are somewhat modified
from what they were before. They were changed by hand minimally a long
time ago to prevent segfaults, but in so doing, they lost an important
sorting characteristic of UTF-EBCDIC. The machine-generated versions
retain the sorting, while also not doing the segfaults. utfebcdic.h has
more detail about this, regarding tr16.
|
|
|
|
| |
Clarifications and typo fix.
|
| |
|
|
|
|
| |
This is for consistency with the rest of Perl
|
|
|
|
| |
The operand of this macro is implicitly a UV. Make sure that it is.
|
| |
|
|
|
|
|
|
|
|
|
| |
The previous definition broke good encapsulation rules. UTF_START_MARK
should return something that fits in a byte; it shouldn't be the caller
that does this. So the mask is moved into the definition. This means
it can apply only to the portion that creates something larger than a
byte. Further, the EBCDIC version can be simplified, since 7 is the
largest possible number of bytes in an EBCDIC UTF8 character.
|
|
|
|
|
| |
These two macros were improperly expanding the parameters as well as
defining the operation, leading to compile errors.
|
|
|
|
|
|
|
|
|
| |
This means use official Unicode code point numbering, not native. Doing
this converts the existing UNISKIP calls in the code to refer to native
code points, which is what they meant anyway. The terminology is
somewhat ambiguous, but I don't think it will cause real confusion.
NATIVE_SKIP is also introduced for situations where it is important to
be precise.
|
|
|
|
|
| |
These are final tables that haven't been converted to native character
set casing.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These macros are no longer called in the Perl core. This commit turns
them into functions so that they can use gcc's deprecation facility.
I believe these were defective right from the beginning, and I have
struggled to understand what's going on. From the name, it appears
NATIVE_TO_NEED taks a native byte and turns it into UTF-8 if the
appropriate parameter indicates that. But that is impossible to do
correctly from that API, as for variant characters, it needs to return
two bytes. It could only work correctly if ch is an I8 byte, which
isn't native, and hence the name would be wrong.
Similar arguments for ASCII_TO_NEED.
The function S_append_utf8_from_native_byte(const U8 byte, U8** dest)
does what I think NATIVE_TO_NEED intended.
|
|
|
|
|
| |
This converts several areas of code to use the more clearly named macros
introduced in the previous commit
|
|
|
|
|
|
|
|
|
|
|
| |
This commit creates macros whose names mean something to me, and which I
don't find confusing. The older names are retained for backwards
compatibility. Future commits will fix bugs I introduced from
misunderstanding the meaning of the older names.
The older names are now #defined in terms of the newer ones, and moved
so that they are only defined once, valid for both ASCII and EBCDIC
platforms.
|
|
|
|
|
| |
This patch was posted in
http://markmail.org/message/pwjxbxnlazvxgsyw
|
| |
|
| |
|
|
|
|
| |
This also reorders one #define to be closer to a related one.
|