| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
| |
These macros are supposed to accommodate larger than a byte inputs.
Therefore, under EBCDIC, we have to use a different macro which handles
the larger values. On ASCII platforms, these called macros are no-ops
so it doesn't matter there.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds to handy.h isALPHA_FOLD_EQ(c1,c2) which efficiently tests if
c1 and c2 are the same character, case-insensitively. For example
isALPHA_FOLD_EQ(c, 's') returns true if and only if <c> is 's' or 'S'.
isALPHA_FOLD_NE() is also added by this commit.
At least one of c1 and c2 must be known to be in [A-Za-z] or this macro
doesn't work properly. (There is an assert for this in the macro in
DEBUGGING builds). That is why the name includes "ALPHA", so you won't
forget when using it.
This functionality has been in regcomp.c for a while, under a different
name. I had thought that the only reason to make it more generally
available was potential speed gain, but recent gcc versions optimize to
the same code, so I thought there wasn't any point to doing so.
But I now think that using this makes things easier to read (and
certainly shorter to type in). Once you grok what this macro does, it
simplifies what you have to keep in your mind when reading logical
expressions with multiple operands. That something can be either upper
or lower case can be a distraction to understanding the larger point of
the expression.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It is not very user friendly to list functions as
"Functions found in file FOO". Better is to group them by purpose, as
many were already. I went through and placed the ones that weren't
already so grouped into groups. Patches welcome if you have a better
classification.
I changed the headings of some so that the important disctinction was
the first word so that they are placed in the file more appropriately.
And a couple of ones that I had created myself, I came up with a name
that I think is better than the original
|
|
|
|
|
|
|
| |
This commit allows one to specify to enable locale-awareness for only a
specified subset of the locale categories. Thus you could make a
section of code LC_MESSAGES aware, with no locale-awareness for the
other categories.
|
|
|
|
|
|
| |
The definition was incorrect. When going from control to printable
name, we need to go from Latin1 -> Native, so that e.g., a 65 gets
turned into the native 'A'
|
|
|
|
|
| |
It's very rare actually for code to be presented with malformed UTF-8,
so give the compiler a hint about the likely branches.
|
|
|
|
|
|
| |
This function is now more efficiently implemented as a synonym for
isUTF8_CHAR(). We retain the Perl_is_utf8_char_buf() function for code
that calls it that way.
|
|
|
|
|
|
|
|
| |
This allows for an efficient isUTF8_CHAR macro, which does its own
length checking, and uses the UTF8_INVARIANT macro for the first byte.
On EBCDIC systems this macro which does a table lookup is quite a bit
more efficient than all the branches that would normally have to be
done.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This macro will inline the code to determine if a character is
well-formed UTF-8 for code points below a certain value, falling back to
a slower function for larger ones. On ASCII platforms, it will inline
for well-beyond all legal Unicode code points. On EBCDIC, it currently
does it for code points up to 0x3FFF. This could be increased, but our
porting tests do the regen every time to make sure everything is ok, and
making it larger slows that down. This is worked around on ASCII by
normally commenting out the code that generates this info, but including
in utf8.h a version that did get generated. This is static information
and won't change. (This could be done for EBCDIC too, but I chose not
to at this time as each code page has a different macro generated, and
it gets ugly getting all of them in utf8.h)
Using this macro allowed for simplification of several functions in
utf8.c
|
|
|
|
| |
This places it in a better situated spot for later commits
|
|
|
|
|
| |
This causes the generated regcharclass.h to be valid on all
supported platforms
|
|
|
|
|
|
|
| |
This mostly indents and outdents base on blocks added or removed by the
previous commit. But there are a few comment changes and vertical
alignment of macro backslash continuation characters, and other
white-space changes
|
|
|
|
|
| |
The UTF8 in the name is kind of misleading, and would be more misleading
after future commits make UTF8 locales special.
|
|
|
|
|
|
|
|
|
|
| |
The documentation says that Perl taints certain operations when subject
to locale rules, such as lc() and ucfirst(). Prior to this commit
there were exceptions when the operand to these functions contained no
characters whose case change actually varied depending on the locale,
for example the empty string or above-Latin1 code points. Changing to
conform to the documentation simplifies the core code, and yields more
consistent results.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This bottom level function decodes the first character of a UTF-8 string
into a code point. It is discouraged from using it directly. This
commit cleans up some of the warnings it can raise. Now, tests for
malformations are done before any tests for other potential issues. One
of those issues involves code points so large that they have never
appeared in any official standard (the current standard has scaled back
the highest acceptable code point from earlier versions). It is
possible (though not done in CPAN) to warn and/or forbid these code
points, while accepting smaller code points that are still above the
legal Unicode maximum. The warning message for this now includes the
code point if representable on the machine. Previously it always
displayed raw bytes, which is what it still does for non-representable
code points.
|
|
|
|
| |
Future commits will want this available outside utf8.h
|
|
|
|
|
|
| |
This change should catch some wrong calls to these macros. The meat of
the macros is extracted out into two internal-only macros, and the other
macros are rearranged to call these.
|
| |
|
|
|
|
| |
I believe this makes the macro easier to read
|
|
|
|
| |
Previously it was based on HAS_QUAD, which is not (as) correct.
|
|
|
|
|
|
|
| |
This removes a macro not yet even in a development release, and splits
its calls into two classes: those where the input is a byte; and those
where it can be any unsigned integer. The byte implementation avoids a
function call on EBCDIC platforms.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I will not otherwise mention that stealing .c code from the core is a
dangerous practice.
This is actually a bug in the module, which had been masked until now.
The first two parameters to utf8_to_uvchr_buf() are both U8*. But both
's' and PL_bufend are char*. The 's' has a cast to U8* in the failing
line, but not PL_bufend.
Interestingly, the line in the official toke.c (introduced in 4b88fb76)
has always been right, so the stealer didn't copy it correctly.
What de69f3af3 did was turn this former function call into a macro that
manipulates the parameters and calls another function, thereby removing
a layer of function call overhead. The manipulation involves
subtracting 's' from PL_bufend, and this fails to compile due to the
missing cast on the latter parameter.
The problem goes away if the macro casts both parameters to U8*, and
that is what this commit does.
|
|
|
|
| |
Vertically align the definitions of a few #defines
|
|
|
|
| |
These will be used in a future commit
|
|
|
|
|
| |
The parentheses were misplaced, so it wasn't looking at the second byte
of the input string properly.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Now that the Unicode data is stored in native character set order, it is
rare to need to work with the Unicode order. Traditionally, the real
work was done in functions that worked with the Unicode order, and
wrapper functions (or macros) were used to translate to/from native.
There are two groups of functions: one that translates from code point
to UTF-8, and the other group goes the opposite direction.
This commit changes the base function that translates from UTF-8 to code
point to output native instead of Unicode. Those extremely rare
instances where Unicode output is needed instead will have to hand-wrap
calls to this function with a translation macro, as now described in the
API pod. Prior to this, it was the other way, the native was wrapped,
and the rare, strict Unicode wasn't. This eliminates a layer of
function call overhead for a common case.
The base function that translates from code point to UTF-8 retains its
Unicode input, as that is more natural to process. However, it is
de-emphasized in the pod, with the functionality description moved to
the pod for a native input wrapper function. And, those wrappers are
now macros in all cases; previously there was function call overhead
sometimes. (Equivalent exported functions are retained, however, for XS
code that uses the Perl_foo() form.)
I had hoped to rebase this commit, squashing it with an earlier commit
in this series, eliminating the use of a temporary function name change,
but the work involved turns out to be large, with no real payoff.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
| |
The previous definition broke good encapsulation rules. UTF_START_MARK
should return something that fits in a byte; it shouldn't be the caller
that does this. So the mask is moved into the definition. This means
it can apply only to the portion that creates something larger than a
byte. Further, the EBCDIC version can be simplified, since 7 is the
largest possible number of bytes in an EBCDIC UTF8 character.
|
|
|
|
|
| |
These two files were only being #included for non-ebcdic compiles; they
should be included always.
|
|
|
|
|
|
|
|
|
| |
These macros were previously defined in terms of UTF8_TWO_BYTE_HI and
UTF8_TWO_BYTE_LO. But the EIGHT_BIT versions can use the less general
and simpler NATIVE_TO_LATIN1 instead of NATIVE_TO_UNI because the input
domain is restricted in the EIGHT_BIT. Note that on ASCII platforms,
these both expand to the same thing, so the difference matters only on
EBCDIC.
|
|
|
|
|
|
|
|
|
| |
This means use official Unicode code point numbering, not native. Doing
this converts the existing UNISKIP calls in the code to refer to native
code points, which is what they meant anyway. The terminology is
somewhat ambiguous, but I don't think it will cause real confusion.
NATIVE_SKIP is also introduced for situations where it is important to
be precise.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is in preparation for deprecating these functions, to force any
code that has been using these functions to change.
Since the Unicode tables are now stored in native order, these
functions should only rarely be needed.
However, the functionality of these is needed, and in actuality, on
ASCII platforms, the native functions are #defined to these. So what
this commit does is rename the functions to something else, and create
wrappers with the old names, so that anyone using them will get the
deprecation when it actually goes into effect: we are waiting for CPAN
files distributed with the core to change before doing the deprecation.
According to cpan.grep.me, this should affect fewer than 10 additional
CPAN distributions.
|
|
|
|
|
|
|
| |
Code should almost never be dealing with non-native code points
This is in preparation for later deprecation when our CPAN modules have
been converted away from using it.
|
|
|
|
| |
This is in preparation for the current wrapee becoming deprecated
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These macros are no longer called in the Perl core. This commit turns
them into functions so that they can use gcc's deprecation facility.
I believe these were defective right from the beginning, and I have
struggled to understand what's going on. From the name, it appears
NATIVE_TO_NEED taks a native byte and turns it into UTF-8 if the
appropriate parameter indicates that. But that is impossible to do
correctly from that API, as for variant characters, it needs to return
two bytes. It could only work correctly if ch is an I8 byte, which
isn't native, and hence the name would be wrong.
Similar arguments for ASCII_TO_NEED.
The function S_append_utf8_from_native_byte(const U8 byte, U8** dest)
does what I think NATIVE_TO_NEED intended.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The code here was wrong in assuming that \xFF is not legal in UTF-8
encoded strings. It currently doesn't work due to a bug, but that may
eventually be fixed: [perl #116867]. The comments are also wrong that
all bytes are legal in UTF-EBCDIC.
It turns out that in well-formed UTF-8, the bytes C0 and C1 never appear
(C2, C3, and C4 as well in UTF-EBCDIC), as they would be the start byte
of an illegal overlong sequence.
This creates a #define for an illegal byte using one of the real illegal
ones, and changes the code to use that.
No test is included due to #116867.
|
|
|
|
|
|
|
|
| |
The conversion from UTF-8 to code point should generally be to the
native code point. This adds a macro to do that, and converts the
core calls to the existing macro to use the new one instead. The old
macro is retained for possible backwards compatibility, though it
probably should be deprecated.
|
|
|
|
|
| |
These macros were incorrect for EBCDIC. The 3 step process given in
utfebcdic.h wasn't being followed.
|
|
|
|
|
| |
This converts several areas of code to use the more clearly named macros
introduced in the previous commit
|
|
|
|
|
|
|
|
|
|
|
| |
This commit creates macros whose names mean something to me, and which I
don't find confusing. The older names are retained for backwards
compatibility. Future commits will fix bugs I introduced from
misunderstanding the meaning of the older names.
The older names are now #defined in terms of the newer ones, and moved
so that they are only defined once, valid for both ASCII and EBCDIC
platforms.
|
|
|
|
|
| |
The previous commit added macros to do some case changing. This
commit uses them in the core, where appropriate.
|
| |
|
| |
|
|
|
|
| |
This also reorders one #define to be closer to a related one.
|
|
|
|
|
|
|
|
|
|
|
| |
These macros should not be used as they are prone to misuse. There are
no occurrences of them in CPAN. The single use of either of them in
core has recently been removed (commit
8d40577bdbdfa85ed3293f84bf26a313b1b92f55), because it was a misuse.
Instead code should use isIDFIRST_lazy_if or isWORDCHAR_lazy_if
(isALNUM_lazy_if is also available, but can be confused with the Posix
alnum, which it doesn't mean).
|
|
|
|
|
| |
This is a synonym for the existing isALNUM_lazy_if(), which can be
confused with meaning the Posix alnum instead of the Perl \w.
|
|
|
|
|
|
| |
If a char* is passed prior to this commit, an above-ASCII char could
have been considered negative instead of positive, and thus screwed up
these tests
|