| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
| |
These will detect a array bounds error that occurs on EBCDIC machines,
and by including the assert on non-EBCDIC, we verify that the code
wouldn't fail when built on EBCDIC.
|
| |
|
|
|
|
| |
The new definition is simpler to read.
|
|
|
|
|
|
| |
This changes the definition of isUTF8_POSSIBLY_PROBLEMATIC() on EBCDIC
platforms to use PL_charclass[] instead of PL_e2a[]. The new array is
more likely to be in the memory cache.
|
|
|
|
|
|
|
|
| |
Prior to this commit UVCHR_SKIP() was defined the same in both ASCII and
EBCDIC, but they expanded to different things. Now, they are defined
separately -- to what they expand to, and the EBCDIC version is changed
when all expanded out to use PL_charclass[] instead of PL_e2a[]. The
new array is more likely to be in the memory cache.
|
|
|
|
|
|
|
|
| |
Prior to this commit UVCHR_IS_INVARIANT() was defined the same in both
ASCII and EBCDIC, but they expanded to different things. Now, they are
defined separately to what they expand to, and the EBCDIC version is
changed when all expanded out to use PL_charclass[] instead of PL_e2a[].
The new array is more likely to be in the memory cache.
|
|
|
|
|
| |
This changes it to be isASCII(), instead of repeating the "special"
number 0x80.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are three classes of problematic Unicode code points that may
require special handling. Which code points are problematic is fairly
complicated, requiring lots of branches. However, the smallest of them
is 0xD800, which means that most code points in modern use are below
them all, and a single test can be used to exclude just about everything
likely to be encountered. The problem was that the way this test was
done on EBCDIC caused way too many things to pass and have to be checked
with the more complicated branches. The digits 0-9 and some capital
letters were not filtered out, for example. This commit changes the
EBCDIC test to transform into I8 (an array lookup), and this fixes it to
exclude things that shouldn't have passed before.
|
|
|
|
|
|
|
|
|
| |
This adds a macro that converts a code point in the ASCII 128-255 range
to UTF-8, and changes existing code to use it when the range is known to
be restricted to this one, rather than the previous macro which accepted
a wider range (any code point representable by 2 bytes), but had an
extra test on EBCDIC platforms, hence was larger than necessary and
slightly slower.
|
| |
|
|
|
|
|
| |
so that these constructs appear on a single output line for reader
convenience.
|
|
|
|
|
|
|
|
|
| |
The new name is a more accurate description of what it does.
I meant to apply this commit before
d0664088be143e921b2e717524bafddf6a406029, but somehow it didn't happen.
On EBCDIC platforms, that commit will fail to compile without this one,
so could be a problem if ever bisecting on one.
|
|
|
|
|
|
|
|
| |
These macros are not in the public API, but they might be someday. We
may want to check for valide UTF-8 at some point. Add a parameter so
that is possible then without having to change the API.
This also changes to use the short name of one macro
|
|
|
|
| |
The trailing underscore was unintended.
|
| |
|
|
|
|
|
| |
Most of the other names in utf8.h have an underscore; this allows
someone to keep things consistent in their code.
|
|
|
|
| |
Align some macro definitions vertically to make it easier to read
|
|
|
|
|
|
|
| |
Actually, there are no special rules for this Unicode release. All the
4 "i" characters are considered equivalent under /i only in this
release. (Upper and lowercase dotted and dotless "i"). This
adds special cases that are only compiled in for that release.
|
|
|
|
|
|
|
|
|
|
|
|
| |
U+1E9E LATIN CAPITAL LETTER SHARP S is handled specially by Perl,
because of its relationship to the infamous LATIN SMALL LETTER SHARP S,
which folds to 'ss', being the only character whose code point is < 256
to have a multi char fold, and this creates lots of special cases.
But U+1E9E wasn't in all Unicode releases. Because Perl is supposed to
work with any release, we need to be able to compile when this character
isn't defined. In some of those cases we use U+017F (LATIN SMALL LETTER
LONG S instead, which is in all releases.
|
| |
|
|
|
|
|
| |
These are mentioned in some other pods. It's best to bring them into
perlapi, and refer to them from the other pods.
|
|
|
|
|
|
| |
The name UVCHR... parallels the usage of various functions uvchr...
It's less confusing to keep the same name form for the same type of
functionality
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
An empty cpan/.dir-locals.el stops Emacs using the core defaults for
code imported from CPAN.
Committer's work:
To keep t/porting/cmp_version.t and t/porting/utils.t happy, $VERSION needed
to be incremented in many files, including throughout dist/PathTools.
perldelta entry for module updates.
Add two Emacs control files to MANIFEST; re-sort MANIFEST.
For: RT #124119.
|
|
|
|
|
|
| |
Replace the stricter MAX_PORTABLE_UTF8_TWO_BYTE check with a looser
MAX_UTF8_TWO_BYTE check, else we can't correctly convert codepoints in
the range 0x400-0x7ff from utf16 to utf8 on non-ebcdic platforms.
|
|
|
|
| |
The comments explain their purpose
|
|
|
|
|
|
| |
This is a more accurately named synonym for is_ascii_string(), which is
retained. The old name is misleading to someone programming for
non-ASCII platforms.
|
|
|
|
|
|
|
| |
These macros are supposed to accommodate larger than a byte inputs.
Therefore, under EBCDIC, we have to use a different macro which handles
the larger values. On ASCII platforms, these called macros are no-ops
so it doesn't matter there.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds to handy.h isALPHA_FOLD_EQ(c1,c2) which efficiently tests if
c1 and c2 are the same character, case-insensitively. For example
isALPHA_FOLD_EQ(c, 's') returns true if and only if <c> is 's' or 'S'.
isALPHA_FOLD_NE() is also added by this commit.
At least one of c1 and c2 must be known to be in [A-Za-z] or this macro
doesn't work properly. (There is an assert for this in the macro in
DEBUGGING builds). That is why the name includes "ALPHA", so you won't
forget when using it.
This functionality has been in regcomp.c for a while, under a different
name. I had thought that the only reason to make it more generally
available was potential speed gain, but recent gcc versions optimize to
the same code, so I thought there wasn't any point to doing so.
But I now think that using this makes things easier to read (and
certainly shorter to type in). Once you grok what this macro does, it
simplifies what you have to keep in your mind when reading logical
expressions with multiple operands. That something can be either upper
or lower case can be a distraction to understanding the larger point of
the expression.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It is not very user friendly to list functions as
"Functions found in file FOO". Better is to group them by purpose, as
many were already. I went through and placed the ones that weren't
already so grouped into groups. Patches welcome if you have a better
classification.
I changed the headings of some so that the important disctinction was
the first word so that they are placed in the file more appropriately.
And a couple of ones that I had created myself, I came up with a name
that I think is better than the original
|
|
|
|
|
|
|
| |
This commit allows one to specify to enable locale-awareness for only a
specified subset of the locale categories. Thus you could make a
section of code LC_MESSAGES aware, with no locale-awareness for the
other categories.
|
|
|
|
|
|
| |
The definition was incorrect. When going from control to printable
name, we need to go from Latin1 -> Native, so that e.g., a 65 gets
turned into the native 'A'
|
|
|
|
|
| |
It's very rare actually for code to be presented with malformed UTF-8,
so give the compiler a hint about the likely branches.
|
|
|
|
|
|
| |
This function is now more efficiently implemented as a synonym for
isUTF8_CHAR(). We retain the Perl_is_utf8_char_buf() function for code
that calls it that way.
|
|
|
|
|
|
|
|
| |
This allows for an efficient isUTF8_CHAR macro, which does its own
length checking, and uses the UTF8_INVARIANT macro for the first byte.
On EBCDIC systems this macro which does a table lookup is quite a bit
more efficient than all the branches that would normally have to be
done.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This macro will inline the code to determine if a character is
well-formed UTF-8 for code points below a certain value, falling back to
a slower function for larger ones. On ASCII platforms, it will inline
for well-beyond all legal Unicode code points. On EBCDIC, it currently
does it for code points up to 0x3FFF. This could be increased, but our
porting tests do the regen every time to make sure everything is ok, and
making it larger slows that down. This is worked around on ASCII by
normally commenting out the code that generates this info, but including
in utf8.h a version that did get generated. This is static information
and won't change. (This could be done for EBCDIC too, but I chose not
to at this time as each code page has a different macro generated, and
it gets ugly getting all of them in utf8.h)
Using this macro allowed for simplification of several functions in
utf8.c
|
|
|
|
| |
This places it in a better situated spot for later commits
|
|
|
|
|
| |
This causes the generated regcharclass.h to be valid on all
supported platforms
|
|
|
|
|
|
|
| |
This mostly indents and outdents base on blocks added or removed by the
previous commit. But there are a few comment changes and vertical
alignment of macro backslash continuation characters, and other
white-space changes
|
|
|
|
|
| |
The UTF8 in the name is kind of misleading, and would be more misleading
after future commits make UTF8 locales special.
|
|
|
|
|
|
|
|
|
|
| |
The documentation says that Perl taints certain operations when subject
to locale rules, such as lc() and ucfirst(). Prior to this commit
there were exceptions when the operand to these functions contained no
characters whose case change actually varied depending on the locale,
for example the empty string or above-Latin1 code points. Changing to
conform to the documentation simplifies the core code, and yields more
consistent results.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This bottom level function decodes the first character of a UTF-8 string
into a code point. It is discouraged from using it directly. This
commit cleans up some of the warnings it can raise. Now, tests for
malformations are done before any tests for other potential issues. One
of those issues involves code points so large that they have never
appeared in any official standard (the current standard has scaled back
the highest acceptable code point from earlier versions). It is
possible (though not done in CPAN) to warn and/or forbid these code
points, while accepting smaller code points that are still above the
legal Unicode maximum. The warning message for this now includes the
code point if representable on the machine. Previously it always
displayed raw bytes, which is what it still does for non-representable
code points.
|
|
|
|
| |
Future commits will want this available outside utf8.h
|
|
|
|
|
|
| |
This change should catch some wrong calls to these macros. The meat of
the macros is extracted out into two internal-only macros, and the other
macros are rearranged to call these.
|
| |
|
|
|
|
| |
I believe this makes the macro easier to read
|
|
|
|
| |
Previously it was based on HAS_QUAD, which is not (as) correct.
|
|
|
|
|
|
|
| |
This removes a macro not yet even in a development release, and splits
its calls into two classes: those where the input is a byte; and those
where it can be any unsigned integer. The byte implementation avoids a
function call on EBCDIC platforms.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I will not otherwise mention that stealing .c code from the core is a
dangerous practice.
This is actually a bug in the module, which had been masked until now.
The first two parameters to utf8_to_uvchr_buf() are both U8*. But both
's' and PL_bufend are char*. The 's' has a cast to U8* in the failing
line, but not PL_bufend.
Interestingly, the line in the official toke.c (introduced in 4b88fb76)
has always been right, so the stealer didn't copy it correctly.
What de69f3af3 did was turn this former function call into a macro that
manipulates the parameters and calls another function, thereby removing
a layer of function call overhead. The manipulation involves
subtracting 's' from PL_bufend, and this fails to compile due to the
missing cast on the latter parameter.
The problem goes away if the macro casts both parameters to U8*, and
that is what this commit does.
|
|
|
|
| |
Vertically align the definitions of a few #defines
|