| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
| |
An earlier commit had split some comments up. And this adds clarifying
details.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It somehow dawned on me that the code is incorrect for
warning/disallowing very high code points. What is really wanted in the
API is to catch UTF-8 that is not necessarily portable. There are
several classes of this, but I'm referring here to just the code points
that are above the Unicode-defined maximum of 0x10FFFF. These can be
considered non-portable, and there is a mechanism in the API to
warn/disallow these.
However an earlier standard defined UTF-8 to handle code points up to
2**31-1. Anything above that is using an extension to UTF-8 that has
never been officially recognized. Perl does use such an extension, and
the API is supposed to have a different mechanism to warn/disallow on
this.
Thus there are two classes of warning/disallowing for above-Unicode code
points. One for things that have some non-Unicode official recognition,
and the other for things that have never had official recognition.
UTF-EBCDIC differs somewhat in this, and since Perl 5.24, we have had a
Perl extension that allows it to handle any code point that fits in a
64-bit word. This kicks in at code points above 2**30-1, a number
different than UTF-8 extended kicks in on ASCII platforms.
Things are also complicated by the fact that the API has provisions for
accepting the overlong UTF-8 malformation. It is possible to use
extended UTF-8 to represent code points smaller than 31-bit ones.
Until this commit, the extended warning/disallowing was based on the
resultant code point, and only when that code point did not fit into 31
bits.
But what is really wanted is if extended UTF-8 was used to represent a
code point, no matter how large the resultant code point is. This
differs from the previous definition, but only for EBCDIC platforms, or
when the overlong malformation was also present. So it does not affect
very many real-world cases.
This commit fixes that. It turns out that it is easier to tell if
something is using extended-UTF8. One just looks at the first byte of a
sequence.
The trailing part of the warning message that gets raised is slightly
changed to be clearer. It's not significant enough to affect perldiag.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The next commit will fix the detection of using Perl's extended UTF-8 to
be more accurate. The current name for various flags in the API is
somewhat misleading. What is really wanted to know is if extended UTF-8
was used, not the value of the resultant code point.
This commit basically does
s/ABOVE_31_BIT/PERL_EXTENDED/g
It also similarly changes the name of a hash key in APItest/t/utf8.t.
This intermediary step makes the next commit easier to read.
|
|
|
|
|
|
| |
These are very specialized #defines to determine if UTF-8 overflows the
word size of the platform. I think its unwise to make them kinda
generally available.
|
|
|
|
|
|
|
|
|
| |
This is currently undocumented externally, so we can change the API if
needed.
This is like utf8_from_bytes(), but in the case of not being able to
convert the whole string, it converts the initial substring that is
convertible, and tells you where it had to stop.
|
|
|
|
|
|
|
|
|
|
| |
We changed to use symbols not likely to be used by non-Perl code that
could conflict, and which have trailing underbars, so they don't look
like a regular Perl #define.
See https://rt.perl.org/Ticket/Display.html?id=131110
There are many more header files which are not guarded.
|
|
|
|
|
|
|
| |
a6951642ede4abe605dcf0e94b74948e0a60a56b added an assertion to find bugs
in calling macros, and so far, instead, it found a bug in a macro. A
parameter needs to be enclosed in parens in case it is an expression, so
that precedence works.
|
|
|
|
|
|
|
| |
This is inspired by [perl #131190]. The UTF-8 macros whose parameters
are characters now have assertions that verify they are not being called
with something that won't fit in a char. These assertions should be
getting optimized out if the input type is a char or U8.
|
| |
|
|
|
|
| |
Add parens to clarify grouping, white-space for legibility
|
|
|
|
|
|
| |
use bytes;
is unlikely to be the case.
|
|
|
|
| |
Now that there are _safe versions, deprecate the unsafe ones.
|
| |
|
|
|
|
|
|
|
|
| |
When decoding a UTF-8 encoded string, we may have guessed as to how long
it is. This adds a flag so that the base level decode routine knows
that it is a guess, and it minimizes what gets printed, rather than the
normal full information, so as to minimize reading past the end of the
string
|
|
|
|
|
|
| |
These macros are being replaced by a safe version; they now generate a
deprecation message at each call site upon the first use there in each
program run.
|
|
|
|
|
|
|
|
|
| |
perl has never allowed the UTF-8 overflow malformation, for some reason.
But as long as overflows are turned into the REPLACEMENT CHARACTER,
there is no real reason not to. And making it allowable allows code
that wants to carry on in the face of malformed input to do so, without
risk of contaminating things, as the REPLACEMENT is the Unicode
prescribed way of handling malformations.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When perl decodes UTF-8 into a code point, it must decide what to do if
the input is malformed in some way. When the flags passed to the decode
function indicate that a given malformation type is not acceptable, the
function returns 0 to indicate failure; on success it returns the decoded
code point (unfortunately that may require disambiguation if the
input is validly a NUL). As perl evolved, what happened when various
allowed malformations were encountered got stricter and stricter. This
is the final malformation that was not turned into a REPLACEMENT
CHARACTER when the malformation was allowed, and this commit changes to
return that. Unlike most other malformations, the code point value of
an overlong is well-defined, and that is why it hadn't been changed
here-to-fore. But it is safer to use the Unicode prescribed behavior on
all malformations, which is to replace them with the REPLACEMENT
CHARACTER. Just in case there is code that requires the old behavior,
it is retained, but you have to search the source for the undocumented
flag that enables it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The bottom level Perl routine that decodes UTF-8 into a code point has
long accepted inputs where the length is specified to be 0, returning a
NUL. It considers this a malformation, which is accepted in some
scenarios, but not others. In consultation with Tony Cook, we decided
this really isn't a malformation, but is a bug in the calling program.
Rather than call the decode routine when it has nothing to decode, it
should just not call it.
This commit removes the acceptance of a zero length string from any of
the canned flag combinations passed to the decode function. One can
convert to specify this flag explicitly, if necessary. However the next
commit will cause this to fail under DEBUGGING builds, as a step towards
removing the capability altogether.
|
|
|
|
| |
This creates a gap that will be filled by future commits
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The original API does not check that we aren't reading beyond the end of
a buffer, apparently assuming that we could keep malformed UTF-8 out by
use of gatekeepers, but that is currently impossible. This commit adds
"safe" macros for determining if a UTF-8 sequence represents
an alphabetic, a digit, etc. Each new macro has an extra parameter
pointing to the end of the sequence, so that looking beyond the input
string can be avoided.
The macros aren't currently completely safe, as they don't test that
there is at least a single valid byte in the input, except by an
assertion in DEBUGGING builds. This is because typically they are
called in code that makes that assumption, and frequently tests the
current byte for one thing or another.
|
|
|
|
|
|
| |
The root cause of this was missing parentheses causing (s[0] + 1) to be
evaluated instead of the desired s[1]. It was causing an error in
lib/warnings.t, but only on EBCDIC platforms.
|
|
|
|
|
|
|
|
| |
The original code was generated and then hand-tunes. Therefore
I edited the code in place instead of fixing the regen/regcharclass.pl
generator.
Signed-off-by: Petr Písař <ppisar@redhat.com>
|
|
|
|
| |
This complicated macro boils down to just one bit.
|
|
|
|
|
|
|
|
|
|
| |
This new function behaves like utf8n_to_uvchr(), but takes an extra
parameter that points to a U32 which will be set to 0 if no errors are
found; otherwise each error found will set a bit in it. This can be
used by the caller to figure out precisely what the error(s) is/are.
Previously, one would have to capture and parse the warning/error
messages raised. This can be used, for example, to customize the
messages to the expected end-user's knowledge level.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Some UTF-8 sequences can have multiple malformations. For example, a
sequence can be the start of an overlong representation of a code point,
and still be incomplete. Until this commit what was generally done was
to stop looking when the first malformation was found. This was not
correct behavior, as that malformation may be allowed, while another
unallowed one went unnoticed. (But this did not actually create
security holes, as those allowed malformations replaced the input with a
REPLACEMENT CHARACTER.) This commit refactors the error handling of
this function to set a flag and keep going if a malformation is found
that doesn't preclude others. Then each is handled in a loop at the
end, warning if warranted. The result is that there is a warning for
each malformation for which warnings should be generated, and an error
return is made if any one is disallowed.
Overflow doesn't happen except for very high code points, well above the
Unicode range, and above fitting in 31 bits. Hence the latter 2
potential malformations are subsets of overflow, so only one warning is
output--the most dire.
This will speed up the normal case slightly, as the test for overflow is
pulled out of the loop, allowing the UV to overflow. Then a single test
after the loop is done to see if there was overflow or not.
|
|
|
|
|
|
| |
These #defines give flag bits in a U32. This commit opens a gap that
will be filled in a future commit. A test file has to change to
correspond, as it duplicates the defines.
|
|
|
|
|
|
|
|
|
|
| |
These functions are all extensions of the is_utf8_string_foo()
functions, that restrict the UTF-8 recognized as valid in various ways.
There are named ones for the two definitions that Unicode makes, and
foo_flags ones for more custom restrictions.
The named ones are implemented as tries, while the flags ones provide
complete generality
|
|
|
|
|
|
| |
Instead of having a comment in one header pointing to the #define in the
other, remove the indirection and just have the #define itself where it
is needed.
|
|
|
|
| |
This also does a white space change to inline.h
|
|
|
|
|
|
|
|
|
| |
This is like the previous 2 commits, but the macro takes a flags
parameter so any combination of the disallowed flags may be used. The
others, along with the original isUTF8_CHAR(), are the most commonly
desired strictures, and use an implementation of a, hopefully, inlined
trie for speed. This is for generality and the major portion of its
implementation isn't inlined.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This macro follows Unicode Corrigendum #9 to allow non-character code
points. These are still discouraged but not completely forbidden.
It's best for code that isn't intended to operate on arbitrary other
code text to use the original definition, but code that does things,
such as source code control, should change to use this definition if it
wants to be Unicode-strict.
Perl can't adopt C9 wholesale, as it might create security holes in
existing applications that rely on Perl keeping non-chars out.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
| |
This changes the name of this helper function and adds a parameter and
functionality to allow it to exclude problematic classes of code
points, the same ones excludeable by utf8n_to_uvchar(), like surrogates
or non-character code points.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This changes the macro isUTF8_CHAR to have the same number of code
points built-in for EBCDIC as ASCII. This obsoletes the
IS_UTF8_CHAR_FAST macro, which is removed.
Previously, the code generated by regen/regcharclass.pl for ASCII
platforms was hand copied into utf8.h, and LIKELY's manually added, then
the generating code was commented out. Now this has been done with
EBCDIC platforms as well. This makes regenerating regcharclass.h
faster.
The copied macro in utf8.h is moved by this commit to within the main
code section for non-EBCDIC compiles, cutting the number of #ifdef's
down, and the comments about it are changed somewhat.
|
| |
|
|
|
|
| |
These are convenience macros.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These may be useful to various module writers. They certainly are
useful for Encode. This makes public API macros to determine if the
input UTF-8 represents (one macro for each category)
a) a surrogate code point
b) a non-character code point
c) a code point that is above Unicode's legal maximum.
The macros are machine generated. In making them public, I am now using
the string end location parameter to guard against running off the end
of the input. Previously this parameter was ignored, as their use in
the core could be tightly controlled so that we already knew that the
string was long enough when calling these macros. But this can't be
guaranteed in the public API. An optimizing compiler should be able to
remove redundant length checks.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The macro isUTF8_CHAR calls a helper function for code points higher
than it can handle. That function had been an inlined wrapper around
utf8n_to_uvchr().
The function has been rewritten to not call utf8n_to_uvchr(), so it is
now too big to be effectively inlined. Instead, it implements a faster
method of checking the validity of the UTF-8 without having to decode
it. It just checks for valid syntax and now knows where the
few discontinuities are in UTF-8 where overlongs can occur, and uses a
string compare to verify that overflow won't occur.
As a result this is now a pure function.
This also causes a previously generated deprecation warning to not be,
because in printing UTF-8, no longer does it have to be converted to
internal form. I could add a check for that, but I think it's best not
to. If you manipulated what is getting printed in any way, the
deprecation message will already have been raised.
This commit also fleshes out the documentation of isUTF8_CHAR.
|
|
|
|
|
| |
This will allow the next commit to not have to actually try to decode
the UTF-8 string in order to see if it overflows the platform.
|
|
|
|
|
| |
This macro gives the legal UTF-8 byte sequences. Almost always, the
input will be legal, so help compiler branch prediction for that.
|
| |
|
|
|
|
|
|
| |
This is clearer as to its meaning than the existing 'is_ascii_string'
and 'is_invariant_string', which are retained for back compat. The
thread context variable is removed as it is not used.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The result of I8_TO_NATIVE_UTF8 has to be U8 casted for the MSVC specific
PERL_SMALL_MACRO_BUFFER option just like it is for newer CCs that dont
have a small CPP buffer. Commit 1a3756de64/#127426 did add U8 casts to
NATIVE_TO_LATIN1/LATIN1_TO_NATIVE but missed
NATIVE_UTF8_TO_I8/I8_TO_NATIVE_UTF8. This commit fixes that.
One example of the C4244 warning is VC6 thinks 0xFF & (0xFE << 6) in
UTF_START_MARK could be bigger than 0xff (a char), fixes
..\inline.h(247) : warning C4244: '=' : conversion from 'long ' to
'unsigned char ', possible loss of data
Also fixes
..\utf8.c(146) : warning C4244: '=' : conversion from 'UV' to 'U8',
possible loss of data
and alot more warnings in utf8.c
|
| |
|
|
|
|
|
|
|
|
|
|
| |
The UTF8_IS_foo() macros have an inconsistent API. In some, the
parameter is a pointer, and in others it is a byte. In the former case,
a call of the wrong type will not compile, as it will try to dereference
a non-ptr. This commit makes the other ones not compile when called
wrongly, by using the technique shown by Lukas Mai (in
9c903d5937fa3682f21b2aece7f6011b6fcb2750) of ORing the argument with a
constant 0, which should get optimized out.
|
| |
|
|
|
|
| |
This avoids an internal compiler error on VC 2003 and earlier
|
|
|
|
| |
This makes sure in DEBUGGING builds that the macro is called correctly.
|