| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
| |
This means use official Unicode code point numbering, not native. Doing
this converts the existing UNISKIP calls in the code to refer to native
code points, which is what they meant anyway. The terminology is
somewhat ambiguous, but I don't think it will cause real confusion.
NATIVE_SKIP is also introduced for situations where it is important to
be precise.
|
| |
|
|
|
|
| |
Everything but \N{U+XXXX} is now in native,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is in preparation for deprecating these functions, to force any
code that has been using these functions to change.
Since the Unicode tables are now stored in native order, these
functions should only rarely be needed.
However, the functionality of these is needed, and in actuality, on
ASCII platforms, the native functions are #defined to these. So what
this commit does is rename the functions to something else, and create
wrappers with the old names, so that anyone using them will get the
deprecation when it actually goes into effect: we are waiting for CPAN
files distributed with the core to change before doing the deprecation.
According to cpan.grep.me, this should affect fewer than 10 additional
CPAN distributions.
|
|
|
|
|
|
|
| |
Code should almost never be dealing with non-native code points
This is in preparation for later deprecation when our CPAN modules have
been converted away from using it.
|
|
|
|
|
|
|
| |
Now that the tables are stored in native order, there is almost no need
for code to be dealing in Unicode order.
According to grep.cpan.me, there are no uses of this function in CPAN.
|
|
|
|
|
|
|
|
|
| |
Now that all the tables are stored in native format, there is very
little reason to use this function; and those who do need this kind of
functionality should be using the bottom level routine, so as to make it
clear they are doing nonstandard stuff.
According to grep.cpan.me, there are no uses of this function in CPAN.
|
|
|
|
| |
This is in preparation for the current wrapee becoming deprecated
|
|
|
|
|
| |
Since the value is invariant under both UTF-8 and not, we already have
it in 'uv'; no need to do anything else to get it
|
| |
|
| |
|
|
|
|
|
| |
These messages say the output number is Unicode, but it is really
native, so change to saying is 0xXXXX.
|
|
|
|
|
| |
All the tables are now based on the native character set, so using
uvuni() in almost all cases is wrong.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These functions no longer have the hard-coded definitions in them,
but now end up resolving to internal functions, so that new encodings
could be added and these would automatically understand them.
Instead of using tr///, these now go character by character and
converting to/from ord, which is slower, but allows them to operate on
utf8 strings.
Peephole optimization should make these essentially no-ops on ascii
platforms.
|
|
|
|
|
|
| |
This commit changes these functions from converting to/from a string to
calling functions which operate on ordinals instead that are in the
utf8:: namespace.
|
|
|
|
|
| |
These are final tables that haven't been converted to native character
set casing.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We have not had a working modern Perl on EBCDIC for some years. When I
started out, comments and code led me to conclude erroneously that
natively it supported semantics for all 256 characters 0-255. It turns
out that I was wrong; it natively (at least on some platforms) has the
same rules (essentially none) for the characters which don't correspond
to ASCII ones, as the rules for these on ASCII platforms.
A previous commit for 5.18 changed the docs about this issue. This
current commit forces ASCII rules on EBCDIC platforms (even should there
be one that natively uses all 256). To get all 256, the same things
like 'use feature "unicode_strings"' must now be done.
|
|
|
|
|
|
|
|
|
| |
handy.h is included in files that don't include perl.h, and hence not
utf8.h. We can't rely therefore on the ASCII/EBCDIC conversion
macros being available to us. The best way to cope is to use the native
ctype functions. Most, but not all, of the macros in this commit
currently resolve to use those native ones, but a future commit will
change that.
|
|
|
|
|
| |
Now, only one of the macros relies on magic numbers (isPRINT), leading
to clearer definitions.
|
|
|
|
|
|
|
|
| |
These 4 macros can have the same RHS for their ASCII and EBCDIC
versions, so no need to duplicate their definitions
This also enables the EBCDIC versions to not have undefined expansions
when compiling without perl.h
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These macros are no longer called in the Perl core. This commit turns
them into functions so that they can use gcc's deprecation facility.
I believe these were defective right from the beginning, and I have
struggled to understand what's going on. From the name, it appears
NATIVE_TO_NEED taks a native byte and turns it into UTF-8 if the
appropriate parameter indicates that. But that is impossible to do
correctly from that API, as for variant characters, it needs to return
two bytes. It could only work correctly if ch is an I8 byte, which
isn't native, and hence the name would be wrong.
Similar arguments for ASCII_TO_NEED.
The function S_append_utf8_from_native_byte(const U8 byte, U8** dest)
does what I think NATIVE_TO_NEED intended.
|
|
|
|
|
|
| |
These calls are just copying the input to the output byte by byte.
There is no need to worry about UTF-8 or not, as the output is just an
exact copy of the input
|
|
|
|
|
|
|
|
| |
I believe NATIVE_TO_NEED is defective, and will remove it in a future
commit. But, just in case I'm wrong, I'm doing it in small steps so
bisects will show the culprit. This removes the calls to it where the
parameter is clearly invariant under UTF-8 and UTF-EBCDIC, and so the
result can't be other than just the parameter.
|
|
|
|
|
|
| |
This code is attempting to deal with the problem of holes in the ranges
a-z and A-Z in EBCDIC. By using macros with the suffix "_A", we
emphasize that.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The code here was wrong in assuming that \xFF is not legal in UTF-8
encoded strings. It currently doesn't work due to a bug, but that may
eventually be fixed: [perl #116867]. The comments are also wrong that
all bytes are legal in UTF-EBCDIC.
It turns out that in well-formed UTF-8, the bytes C0 and C1 never appear
(C2, C3, and C4 as well in UTF-EBCDIC), as they would be the start byte
of an illegal overlong sequence.
This creates a #define for an illegal byte using one of the real illegal
ones, and changes the code to use that.
No test is included due to #116867.
|
|
|
|
|
|
|
|
|
|
|
|
| |
The code prior to this commit converted something like \04 into its
EBCDIC equivalent only in double-quoted strings. This was not done in
patterns, and so gave inconsistent results. The correct thing to do
should be to do the native thing, what someone who works on a platform
would think \04 do. Platform independent characters are available
through \N{}, either by name or by U+XXXX.
The comment changed by this was wrong, as in some cases it was native,
and in some cases Unicode.
|
|
|
|
|
|
|
|
| |
Now that the Unicode tables are stored in native format, we shouldn't be
doing remapping.
Note that this assumes that the Latin1 casing tables are stored in
native order; not all of this has been done yet.
|
|
|
|
|
|
|
|
| |
The conversion from UTF-8 to code point should generally be to the
native code point. This adds a macro to do that, and converts the
core calls to the existing macro to use the new one instead. The old
macro is retained for possible backwards compatibility, though it
probably should be deprecated.
|
|
|
|
|
| |
Now that mktables generates native tables, we need to make U+XXXX mean
Unicode instead of native.
|
|
|
|
|
| |
Now that mktables generates native tables, it is a fairly simple matter
to get Unicode::UCD to work on those platforms.
|
|
|
|
|
|
|
|
|
|
|
| |
The output tables for mktables are now in the platform's native
character set. This means there is no change for ASCII platforms, but
is a change for EBCDIC ones.
Code that didn't realize there was a potential difference between EBCDIC
and non-EBCDIC platforms will now start to work; code that tried to do
the right thing under these circumstances will no longer work. Fixing
that comes in later commits.
|
|
|
|
|
|
|
|
|
|
| |
This code is moved later in the process. This is in preparation for
mktables generating tables in the native character set. By moving it to
later, the translation to native has already been done, and special
coding need not be done.
This also caught 7 code points that were omitted somehow in the previous
logic
|
|
|
|
|
|
|
| |
These spots have native code points, so should be using the macros for
native code points, instead of Unicode ones; this also changes to use
the portable symbolic name of a code point instead of the non-portable
hex.
|
|
|
|
| |
These areas of code included a temporary that is unnecessary.
|
|
|
|
|
| |
These macros were incorrect for EBCDIC. The 3 step process given in
utfebcdic.h wasn't being followed.
|
|
|
|
|
| |
This fairly short paradigm is repeated in several places; a later commit
will improve it.
|
|
|
|
|
|
|
|
|
|
| |
C recognizes '\a' (for BEL); just use that instead of a look-up.
regen/unicode_constants.pl could be used to generate the character for
the ESC (set in nearby code), but I didn't do that because of
potential bootstrapping problems when porting to an EBCDIC platform
without a working perl. (The other characters generated in that .pl are
less likely to cause problems when compiling perl.)
|
|
|
|
|
|
|
|
|
|
| |
The macros like NATIVE_TO_UNI will work on EBCDIC, but operate on the
whole Unicode range. In the locations affected by this commit, it is
known that the domain is limited to a single byte, so the simpler ones
whose names contain LATIN1 may be used.
On ASCII platforms, all the macros are null, so there is no effective
change.
|
|
|
|
|
| |
This converts several areas of code to use the more clearly named macros
introduced in the previous commit
|
|
|
|
|
|
|
|
|
|
|
| |
This commit creates macros whose names mean something to me, and which I
don't find confusing. The older names are retained for backwards
compatibility. Future commits will fix bugs I introduced from
misunderstanding the meaning of the older names.
The older names are now #defined in terms of the newer ones, and moved
so that they are only defined once, valid for both ASCII and EBCDIC
platforms.
|
|
|
|
|
|
|
|
| |
This is clearer and portable to EBCDIC.
There is a subtle difference in the behavior here, which I believe is a
bug fix. Before, the code didn't treat DEL as a control, and now it
does.
|
|
|
|
|
|
| |
What is passed is the actual length of the native utf8 character. What
this was calculating was the length it would be if it were a Unicode
character, and then compared, apples to oranges.
|
|
|
|
|
|
| |
test.pl will delete any file made by tempfile(), but it won't delete the
*.dir and *.pag files made by dbmopen. Hopefully this is the right way to
delete them, cribbed from lib/AnyDBM_File.t.
|
|
|
|
| |
Filehandles need to be closed for unlink to work on Windows.
|
|
|
|
|
|
|
|
|
|
| |
The link function call in the have_compiler and have_cplusplus tests create
a compilet.def file on Windows which is correctly recorded for cleaning up
when the EU::CB object is destroyed, but if another one gets made in the
meantime then ExtUtils::Mksymlists::Mksymlists moves the first one to
compilet_def.old, which isn't recorded for cleaning up and gets left
behind when the test script has finished. Using a new object each time,
destroying the previous one first, prevents this.
|
|
|
|
|
| |
(This should have been part of the previous commit, but I forgot to --amend
it)
|
|
|
|
|
|
| |
A missing break; in Perl_gv_fetchpvn_flags() meant that the variable ${^MPEN}
had been behaving as a synonym of ${^MATCH}. Adding the break makes ${^MPEN}
behave like all other unused multi-character control character variable.
|
| |
|