| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
| |
This complicated macro boils down to just one bit.
|
|
|
|
|
|
|
|
|
|
| |
This new function behaves like utf8n_to_uvchr(), but takes an extra
parameter that points to a U32 which will be set to 0 if no errors are
found; otherwise each error found will set a bit in it. This can be
used by the caller to figure out precisely what the error(s) is/are.
Previously, one would have to capture and parse the warning/error
messages raised. This can be used, for example, to customize the
messages to the expected end-user's knowledge level.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Some UTF-8 sequences can have multiple malformations. For example, a
sequence can be the start of an overlong representation of a code point,
and still be incomplete. Until this commit what was generally done was
to stop looking when the first malformation was found. This was not
correct behavior, as that malformation may be allowed, while another
unallowed one went unnoticed. (But this did not actually create
security holes, as those allowed malformations replaced the input with a
REPLACEMENT CHARACTER.) This commit refactors the error handling of
this function to set a flag and keep going if a malformation is found
that doesn't preclude others. Then each is handled in a loop at the
end, warning if warranted. The result is that there is a warning for
each malformation for which warnings should be generated, and an error
return is made if any one is disallowed.
Overflow doesn't happen except for very high code points, well above the
Unicode range, and above fitting in 31 bits. Hence the latter 2
potential malformations are subsets of overflow, so only one warning is
output--the most dire.
This will speed up the normal case slightly, as the test for overflow is
pulled out of the loop, allowing the UV to overflow. Then a single test
after the loop is done to see if there was overflow or not.
|
|
|
|
|
|
| |
These #defines give flag bits in a U32. This commit opens a gap that
will be filled in a future commit. A test file has to change to
correspond, as it duplicates the defines.
|
|
|
|
|
|
|
|
|
|
| |
These functions are all extensions of the is_utf8_string_foo()
functions, that restrict the UTF-8 recognized as valid in various ways.
There are named ones for the two definitions that Unicode makes, and
foo_flags ones for more custom restrictions.
The named ones are implemented as tries, while the flags ones provide
complete generality
|
|
|
|
|
|
| |
Instead of having a comment in one header pointing to the #define in the
other, remove the indirection and just have the #define itself where it
is needed.
|
|
|
|
| |
This also does a white space change to inline.h
|
|
|
|
|
|
|
|
|
| |
This is like the previous 2 commits, but the macro takes a flags
parameter so any combination of the disallowed flags may be used. The
others, along with the original isUTF8_CHAR(), are the most commonly
desired strictures, and use an implementation of a, hopefully, inlined
trie for speed. This is for generality and the major portion of its
implementation isn't inlined.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This macro follows Unicode Corrigendum #9 to allow non-character code
points. These are still discouraged but not completely forbidden.
It's best for code that isn't intended to operate on arbitrary other
code text to use the original definition, but code that does things,
such as source code control, should change to use this definition if it
wants to be Unicode-strict.
Perl can't adopt C9 wholesale, as it might create security holes in
existing applications that rely on Perl keeping non-chars out.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
| |
This changes the name of this helper function and adds a parameter and
functionality to allow it to exclude problematic classes of code
points, the same ones excludeable by utf8n_to_uvchar(), like surrogates
or non-character code points.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This changes the macro isUTF8_CHAR to have the same number of code
points built-in for EBCDIC as ASCII. This obsoletes the
IS_UTF8_CHAR_FAST macro, which is removed.
Previously, the code generated by regen/regcharclass.pl for ASCII
platforms was hand copied into utf8.h, and LIKELY's manually added, then
the generating code was commented out. Now this has been done with
EBCDIC platforms as well. This makes regenerating regcharclass.h
faster.
The copied macro in utf8.h is moved by this commit to within the main
code section for non-EBCDIC compiles, cutting the number of #ifdef's
down, and the comments about it are changed somewhat.
|
| |
|
|
|
|
| |
These are convenience macros.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These may be useful to various module writers. They certainly are
useful for Encode. This makes public API macros to determine if the
input UTF-8 represents (one macro for each category)
a) a surrogate code point
b) a non-character code point
c) a code point that is above Unicode's legal maximum.
The macros are machine generated. In making them public, I am now using
the string end location parameter to guard against running off the end
of the input. Previously this parameter was ignored, as their use in
the core could be tightly controlled so that we already knew that the
string was long enough when calling these macros. But this can't be
guaranteed in the public API. An optimizing compiler should be able to
remove redundant length checks.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The macro isUTF8_CHAR calls a helper function for code points higher
than it can handle. That function had been an inlined wrapper around
utf8n_to_uvchr().
The function has been rewritten to not call utf8n_to_uvchr(), so it is
now too big to be effectively inlined. Instead, it implements a faster
method of checking the validity of the UTF-8 without having to decode
it. It just checks for valid syntax and now knows where the
few discontinuities are in UTF-8 where overlongs can occur, and uses a
string compare to verify that overflow won't occur.
As a result this is now a pure function.
This also causes a previously generated deprecation warning to not be,
because in printing UTF-8, no longer does it have to be converted to
internal form. I could add a check for that, but I think it's best not
to. If you manipulated what is getting printed in any way, the
deprecation message will already have been raised.
This commit also fleshes out the documentation of isUTF8_CHAR.
|
|
|
|
|
| |
This will allow the next commit to not have to actually try to decode
the UTF-8 string in order to see if it overflows the platform.
|
|
|
|
|
| |
This macro gives the legal UTF-8 byte sequences. Almost always, the
input will be legal, so help compiler branch prediction for that.
|
| |
|
|
|
|
|
|
| |
This is clearer as to its meaning than the existing 'is_ascii_string'
and 'is_invariant_string', which are retained for back compat. The
thread context variable is removed as it is not used.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The result of I8_TO_NATIVE_UTF8 has to be U8 casted for the MSVC specific
PERL_SMALL_MACRO_BUFFER option just like it is for newer CCs that dont
have a small CPP buffer. Commit 1a3756de64/#127426 did add U8 casts to
NATIVE_TO_LATIN1/LATIN1_TO_NATIVE but missed
NATIVE_UTF8_TO_I8/I8_TO_NATIVE_UTF8. This commit fixes that.
One example of the C4244 warning is VC6 thinks 0xFF & (0xFE << 6) in
UTF_START_MARK could be bigger than 0xff (a char), fixes
..\inline.h(247) : warning C4244: '=' : conversion from 'long ' to
'unsigned char ', possible loss of data
Also fixes
..\utf8.c(146) : warning C4244: '=' : conversion from 'UV' to 'U8',
possible loss of data
and alot more warnings in utf8.c
|
| |
|
|
|
|
|
|
|
|
|
|
| |
The UTF8_IS_foo() macros have an inconsistent API. In some, the
parameter is a pointer, and in others it is a byte. In the former case,
a call of the wrong type will not compile, as it will try to dereference
a non-ptr. This commit makes the other ones not compile when called
wrongly, by using the technique shown by Lukas Mai (in
9c903d5937fa3682f21b2aece7f6011b6fcb2750) of ORing the argument with a
constant 0, which should get optimized out.
|
| |
|
|
|
|
| |
This avoids an internal compiler error on VC 2003 and earlier
|
|
|
|
| |
This makes sure in DEBUGGING builds that the macro is called correctly.
|
|
|
|
| |
for future use
|
|
|
|
|
| |
This has been wrong, and won't compile, should anyone have tried, since
635e76f560b3b3ca075aa2cb5d6d661601968e04 earlier in 5.23.
|
|
|
|
|
|
| |
UTF8_QUAD_MAX is no longer used in the core, and is not in cpan, and its
name is highly misleading. It is defined to be 2**36, which has really
nothing to do with what its name indicates.
|
|
|
|
|
|
| |
This defines 2 macros that can be used individually to check for
non-characters when the input range is known to be restricted to not
include every possible one. This is for future commits.
|
| |
|
|
|
|
|
|
| |
By masking, this macro can be written so it only has one branch.
Presumably an optimizing compiler would do the same, but not necessarily
so.
|
|
|
|
|
|
| |
This also renames the macro formal parameter to uv to be clearer as to
what is expected as input, and there were cases where it was referred to
inside the macro without being parenthesized, which was dangerous.
|
|
|
|
|
|
|
|
| |
Previous commits have set things up so the macros are the same on both
platforms. By moving them to the common part of utf8.h, they can share
the same definition. The difference listing shows instead other things
being moved due to the size of this move in comparison with those things
that really stayed the same.
|
|
|
|
|
|
|
|
| |
This changes to use the underlying UTF-8 structure to compute the
numbers in this macro, instead of hand-specifying the resultant ones.
Thus, this macro, with minor tweaks, is the same text on both ASCII and
EBCDIC platforms (though the resultant numbers differ), and the next
commit will change them to use it in common.
|
|
|
|
|
|
|
|
| |
The previous commits have made these macros be the exact same text, so
can be combined, and defined just once. This requires moving them to
the portion of the file that is common with both EBCDIC and ASCII.
The commit diff shows instead other code being moved.
|
|
|
|
|
|
|
| |
This new definition expands to the same thing as before, but now the
unexpanded text is identical to the EBCDIC definition (which expands to
something else), so the next commit can combine the ASCII and EBCDIC
ones into a single definition.
|
|
|
|
|
|
|
|
| |
This refactors two macros that have mostly the same guts to have a third
macro to define the common guts.
It also changes to use UV_IS_QUAD instead of a home-grown #define that a
future commit will remove.
|
|
|
|
| |
This makes 2 related definitions adjacent.
|
|
|
|
|
|
| |
Change to use the same definition for two macros on both types of
platforms, simplifying the code, by using the underlying structure of
the encoding.
|
|
|
|
|
| |
And use its mnemonic in other #defines instead of repeating the raw
value.
|
|
|
|
|
| |
By making sure the no-op macros cast the output appropriately, we can
eliminate the casts that have been added in things that call them
|
|
|
|
|
| |
By using a more fundamental value, these two definitions of the macro
can be made the same, so only need one, common to both platforms
|
|
|
|
|
|
|
|
| |
The definition had gotten moved away from its comments in utf8.h, and
the wrong thing was being guarded by a #error, (UTF8_MAXBYTES instead).
And it is possible to generalize to get the compiler to do the
calculation, and to consolidate the definitions from the two files into
a single one.
|
|
|
|
|
|
|
|
| |
The ABOVE_31_BIT flags is a proper subset of the SUPER flags, so if the
latter is set, we don't have to bother setting the former. On the other
hand, there is no harm in doing so, these changes are all resolved at
compile time. The reason I'm changing them is that it is easier to
explain in the pod what is happening, in the next commit.
|
|
|
|
|
|
|
|
|
| |
These names have long caused me consternation, as they are named after
the internal ASCII-platform UTF-8 representation, which is not the same
for EBCDIC platforms, nor do they convey meaning to someone who isn't
currently steeped in the UTF-8 internals. I've added synonyms that are
platform-independent in meaning and make more sense to someone coming at
this cold. The old names are retained for back compat.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This uses for UTF-EBCDIC essentially the same mechanism that Perl
already uses for UTF-8 on ASCII platforms to extend it beyond what might
be its natural maximum. That is, when the UTF-8 start byte is 0xFF, it
adds a bunch more bytes to the character than it otherwise would,
bringing it to a total of 14 for UTF-EBCDIC. This is enough to handle
any code point that fits in a 64 bit word.
The downside of this is that this extension is not compatible with
previous perls for the range 2**30 up through the previous max,
2**30 - 1. A simple program could be written to convert files that were
written out using an older perl so that they can be read with newer
perls, and the perldelta says we will do this should anyone ask.
However, I strongly suspect that the number of such files in existence
is zero, as people in EBCDIC land don't seem to use Unicode much, and
these are very large code points, which are associated with a
portability warning every time they are output in some way.
This extension brings UTF-EBCDIC to parity with UTF-8, so that both can
cover a 64-bit word. It allows some removal of special cases for EBCDIC
in core code and core tests. And it is a necessary step to handle Perl
6's NFG, which I'd like eventually to bring to Perl 5.
This commit causes two implementations of a macro in utf8.h and
utfebcdic.h to become the same, and both are moved to a single one in
the portion of utf8.h common to both.
To illustrate, the I8 for U+3FFFFFFF (2**30-1) is
"\xFE\xBF\xBF\xBF\xBF\xBF\xBF" before and after this commit, but the I8
for the next code point, U+40000000 is now
"\xFF\xA0\xA0\xA0\xA0\xA0\xA0\xA1\xA0\xA0\xA0\xA0\xA0\xA0",
and before this commit it was "\xFF\xA0\xA0\xA0\xA0\xA0\xA0".
The I8 for 2**64-1 (U+FFFFFFFFFFFFFFFF) is
"\xFF\xAF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF", whereas
before this commit it was unrepresentable.
Commit 7c560c3beefbb9946463c9f7b946a13f02f319d8 said in its message that
it was moving something that hadn't been needed on EBCDIC until the
"next commit". That statement turned out to be wrong, overtaken by
events. This now is the commit it was referring to.
commit I prematurely
pushed that
|
|
|
|
|
| |
This #define has, until the next commit, not been needed on EBCDIC
platforms; move it in preparation for that commit.
|