| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
|
|
| |
This function and is_utf8_string_loclen() are modified to check before
reading beyond the end of the string; and the pod for is_utf8_char()
is modified to warn about the buffer overflow potential.
|
| |
|
|
|
|
|
| |
The function to_uni_fold() works without requiring conversion first to
utf8.
|
|
|
|
| |
It's very rare that someone will be outputting these unusual code points
|
|
|
|
|
| |
I now understand swashes enough to document them better; nits in other
comments
|
|
|
|
|
|
| |
Perl has allowed user-defined properties to match above-Unicode code
points, while falsely warning that it doesn't. This removes that
warning.
|
|
|
|
|
|
|
| |
The code assumed that there is a code point above the highest value we
are looking at. That is true except when we are looking at the highest
representable code point on the machine. A special case is needed for
that.
|
|
|
|
|
|
| |
On a 32 bit machine with USE_MORE_BITS, a UV is 64 bits, but STRLEN is
32 bits. A cast was missing during a bit complement that led to loss of
32 bits.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a code point is to be checked if it matches a property, a swatch of
the swash is read in. Typically this is a block of 64 code points that
contain the one desired. A bit map is set for those 64 code points,
apparently under the expectation that the program will desire code
points near the original.
However, it just adds 63 to the original code point to get the ending
point of the block. When the original is so close to the maximum UV
expressible on the platform, this will overflow.
The patch is simply to check for overflow and if it happens use the max
possible. A special case is still needed to handle the very maximum
possible code point, and a future commit will deal with that.
|
|
|
|
|
|
|
| |
This adds a function similar to the ones for the other three case
changing operations that works on latin1 characters only, and avoids
having to go out to swashes. It changes to_uni_fold() and
to_utf8_fold() to call it on the appropriate input
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This creates a new function to handle upper/title casing code points in
the latin1 range, and avoids using a swash to compute the case. This is
because the correct values are compiled-in.
And it calls this function when appropriate for both title and upper
casing, in both utf8 and uni forms,
Unlike the similar function for lower casing, it may make sense for this function to be
called from outside utf8.c, but inside the core, so it is not static,
but its name begins with an underscore.
|
|
|
|
|
|
|
|
| |
The new function split out from to_uni_lower is now called when
appropriate from to_utf8_lower.
And to_uni_lower no longer calls to_utf8_lower, using the macro instead,
saving a function call and duplicate work
|
|
|
|
|
| |
The portion that deals with Latin1 range characters is refactored into a
separate (static) function, so that it can be called from more than one place.
|
|
|
|
| |
Future commits will use these in additional places, so macroize
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are five functions in utf8.c that look up Unicode maps--the case
changing functions. They look up these maps under the names ToDigit,
ToFold, ToLower, ToTitle, and ToUpper. The imminent expansion of Unicode::UCD
to return the mappings for all properties creates a naming conflict, as
three of those names are the same as other properties, Upper, Lower, and
Title.
It was an unfortunate choice of names originally. Now mktables has been
changed to create a list of mapping properties that utf8_heavy.pl reads.
It uses the official names of those properties, so change utf8.c to
correspond.
|
|
|
|
|
| |
The lowercase of latin-1 range code points is known to the perl core, so
for those we can short-ciruit converting to utf8 and reading in a swash
|
| |
|
|
|
|
|
| |
Indent newly formed blocks, and reflow comments and code to fit in
narrower space
|
|
|
|
|
| |
This adds flags so that if one of the input strings is known to already
have been folded, this routine can skip the (redundant) folding step.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The previous commit introduced some code that concatenates a pv on to
an sv and then does SvUTF8_on on the sv if the pv was utf8.
That can’t work if the sv was in Latin-1 (or single-byte) encoding
and contained extra-ASCII characters. Nor can it work if bytes are
appended to a utf8 sv. Both produce mangled utf8.
There is apparently no function apart from sv_catsv that handle
this. So I’ve modified sv_catpvn_flags to handle this if passed the
SV_CATUTF8 (concatenating a utf8 pv) or SV_CATBYTES (cancatenating a
byte pv) flag.
This avoids the overhead of creating a new sv (in fact, sv_catsv
even copies its rhs in some cases, so that would mean creating two
new svs). It might even be worthwhile to redefine sv_catsv in terms
of this....
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
The swashes already have the underscore, so this test is redundant. It
does save some time for this character to avoid having to go out and
load the swash, but why just the underscore? In fact an earlier commit
changed the macro that most people should use to access this function to
not even call it for the underscore.
|
|
|
|
|
|
|
|
|
| |
Unicode stability policy guarantees that no code points will ever be
added to the control characters beyond those already in it.
All such characters are in the Latin1 range, and so the Perl core
already knows which ones those are, and so there is no need to go out to
disk and create a swash for these.
|
|
|
|
|
| |
The XPerlSpace is less confusing than SpacePerl (at least to me). It
means take PerlSpace and extend it beyond ASCII.
|
|
|
|
|
|
| |
These three properties are restricted to being true only for ASCII
characters. That information is compiled into Perl, so no need to
create swashes for them.
|
|
|
|
|
| |
This information is trivially computed via the macro, no need to go out
to disk and store a swash for this.
|
|
|
|
|
| |
This new function is now potentially called. However, there is no data file
or other circumstances which currently cause this path to get executed.
|
|
|
|
| |
I believe that the new wording is clearer than the older, which I wrote.
|
|
|
|
| |
This indents a block of code to match being in a newly created block
|
|
|
|
|
|
|
| |
The Unicode properties are defined only on Unicode code points. In the
past, this meant all property matches would fail for non-Unicode code
points. However, starting with 5.15.1 some properties do succeed. This
restores the previous behavior.
|
|
|
|
|
|
|
|
| |
Having PL_parser->error_count set to non-zero when utf8_heavy.pl tries
to do() one of its swashes results in ‘Compilation error’ being placed
in $@ during the do, even if it was successful. This patch sets the
error_count to 0 before calling SWASHNEW, to prevent that. It uses
SAVEI8, to make sure it is restored on scope exit.
|
| |
|
|
|
|
|
|
|
| |
This allows a swash to return a list, along with an extra key in the
hash which says that the list should be inverted.
A future commit will generate such keys.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This function has not been able to handle what are called EXTRAS in
its input. These are things like:
!utf8::InHiragana
-utf8::InKatakana
+utf8::IsCn
besides the normal list of ranges.
This commit allows this function to handle all the same constructs as
the regular swash input function, from which most of the new code was
copied.
|
|
|
|
|
|
|
| |
Unicode inversion lists commonly will contain UV_MAX, which may
trigger these warnings. Add a flag to suppress them to the numeric
grok functions, which can be set by the code that is dealing with
these lists
|
|
|
|
|
|
| |
The inversion list is an opaque object, currently implemented as an SV.
Even if it ends up being an HV in the future, Nicholas is of the opinion
that it should be presented to the world as an SV*.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Consider U+FB05 and U+FB06. These both fold to 'st', and hence should
match each other under /i. However, Unicode doesn't furnish a rule for
this, and Perl hasn't been smart enought to figure it out. The bug that
shows up is in constructs like
"\x{fb06}" =~ /[^\x{fb05}]/i
succeeding. Most of these instances also have a 'S' entry in Unicode's
CaseFolding.txt, which avoids the problem (as mktables was earlier
changed to include those in the generated table). But there were
several code points that didn't.
This patch changes utf8.c to look for these when constructing it's
inverted list of case fold equivalents. An alternative would have been
to change mktables instead to look for them and create synthetic rules.
But, this is more general in case the function ends up being used for
other things.
I will change fold_grind.t to test for these in a separate commit.
|
|
|
|
|
| |
This comment will no longer apply, as the code it talked about is
moving into swash_init().
|
| |
|
|
|
|
|
|
|
| |
av_len() is misnamed, and hence led me earlier to stop the loop
one shy of what it should have been. No actual bugs were caused by
this, but it could cause a duplicate entry in an array, which is
searched linearly, hence a slight slowdown.
|
|
|
|
|
|
|
|
|
|
| |
And also to_uni_fold().
The flag allows retrieving either simple or full folds.
The interface is subject to change, so these are marked experimental
and their names begin with underscore. The old versions are turned
into macros calling the new versions with the correct extra parameter.
|
|
|
|
|
| |
prefer foo("%s", fixedstr) over foo(fixedstr).
One day someone might change fixedstr to include '%' characters.
|
|
|
|
|
|
|
| |
The code for handling locale can be moved entirely to the place where
locale handling is done for the second string, as by that time we have
processed the first string, and the second. Since we only succeed
if both are atomic, single-bytes, we don't need to do the loop below.
|
|
|
|
| |
This doesn't appear to actually break anything.
|
|
|
|
|
|
| |
A new flag is now passable to this function to indicate to use locale
rules for code points below 256. Unicode rules are still used for above
255. Folds which cross that boundary are disallowed.
|
| |
|