| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The previous commit introduced some code that concatenates a pv on to
an sv and then does SvUTF8_on on the sv if the pv was utf8.
That can’t work if the sv was in Latin-1 (or single-byte) encoding
and contained extra-ASCII characters. Nor can it work if bytes are
appended to a utf8 sv. Both produce mangled utf8.
There is apparently no function apart from sv_catsv that handle
this. So I’ve modified sv_catpvn_flags to handle this if passed the
SV_CATUTF8 (concatenating a utf8 pv) or SV_CATBYTES (cancatenating a
byte pv) flag.
This avoids the overhead of creating a new sv (in fact, sv_catsv
even copies its rhs in some cases, so that would mean creating two
new svs). It might even be worthwhile to redefine sv_catsv in terms
of this....
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
The swashes already have the underscore, so this test is redundant. It
does save some time for this character to avoid having to go out and
load the swash, but why just the underscore? In fact an earlier commit
changed the macro that most people should use to access this function to
not even call it for the underscore.
|
|
|
|
|
|
|
|
|
| |
Unicode stability policy guarantees that no code points will ever be
added to the control characters beyond those already in it.
All such characters are in the Latin1 range, and so the Perl core
already knows which ones those are, and so there is no need to go out to
disk and create a swash for these.
|
|
|
|
|
| |
The XPerlSpace is less confusing than SpacePerl (at least to me). It
means take PerlSpace and extend it beyond ASCII.
|
|
|
|
|
|
| |
These three properties are restricted to being true only for ASCII
characters. That information is compiled into Perl, so no need to
create swashes for them.
|
|
|
|
|
| |
This information is trivially computed via the macro, no need to go out
to disk and store a swash for this.
|
|
|
|
|
| |
This new function is now potentially called. However, there is no data file
or other circumstances which currently cause this path to get executed.
|
|
|
|
| |
I believe that the new wording is clearer than the older, which I wrote.
|
|
|
|
| |
This indents a block of code to match being in a newly created block
|
|
|
|
|
|
|
| |
The Unicode properties are defined only on Unicode code points. In the
past, this meant all property matches would fail for non-Unicode code
points. However, starting with 5.15.1 some properties do succeed. This
restores the previous behavior.
|
|
|
|
|
|
|
|
| |
Having PL_parser->error_count set to non-zero when utf8_heavy.pl tries
to do() one of its swashes results in ‘Compilation error’ being placed
in $@ during the do, even if it was successful. This patch sets the
error_count to 0 before calling SWASHNEW, to prevent that. It uses
SAVEI8, to make sure it is restored on scope exit.
|
| |
|
|
|
|
|
|
|
| |
This allows a swash to return a list, along with an extra key in the
hash which says that the list should be inverted.
A future commit will generate such keys.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This function has not been able to handle what are called EXTRAS in
its input. These are things like:
!utf8::InHiragana
-utf8::InKatakana
+utf8::IsCn
besides the normal list of ranges.
This commit allows this function to handle all the same constructs as
the regular swash input function, from which most of the new code was
copied.
|
|
|
|
|
|
|
| |
Unicode inversion lists commonly will contain UV_MAX, which may
trigger these warnings. Add a flag to suppress them to the numeric
grok functions, which can be set by the code that is dealing with
these lists
|
|
|
|
|
|
| |
The inversion list is an opaque object, currently implemented as an SV.
Even if it ends up being an HV in the future, Nicholas is of the opinion
that it should be presented to the world as an SV*.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Consider U+FB05 and U+FB06. These both fold to 'st', and hence should
match each other under /i. However, Unicode doesn't furnish a rule for
this, and Perl hasn't been smart enought to figure it out. The bug that
shows up is in constructs like
"\x{fb06}" =~ /[^\x{fb05}]/i
succeeding. Most of these instances also have a 'S' entry in Unicode's
CaseFolding.txt, which avoids the problem (as mktables was earlier
changed to include those in the generated table). But there were
several code points that didn't.
This patch changes utf8.c to look for these when constructing it's
inverted list of case fold equivalents. An alternative would have been
to change mktables instead to look for them and create synthetic rules.
But, this is more general in case the function ends up being used for
other things.
I will change fold_grind.t to test for these in a separate commit.
|
|
|
|
|
| |
This comment will no longer apply, as the code it talked about is
moving into swash_init().
|
| |
|
|
|
|
|
|
|
| |
av_len() is misnamed, and hence led me earlier to stop the loop
one shy of what it should have been. No actual bugs were caused by
this, but it could cause a duplicate entry in an array, which is
searched linearly, hence a slight slowdown.
|
|
|
|
|
|
|
|
|
|
| |
And also to_uni_fold().
The flag allows retrieving either simple or full folds.
The interface is subject to change, so these are marked experimental
and their names begin with underscore. The old versions are turned
into macros calling the new versions with the correct extra parameter.
|
|
|
|
|
| |
prefer foo("%s", fixedstr) over foo(fixedstr).
One day someone might change fixedstr to include '%' characters.
|
|
|
|
|
|
|
| |
The code for handling locale can be moved entirely to the place where
locale handling is done for the second string, as by that time we have
processed the first string, and the second. Since we only succeed
if both are atomic, single-bytes, we don't need to do the loop below.
|
|
|
|
| |
This doesn't appear to actually break anything.
|
|
|
|
|
|
| |
A new flag is now passable to this function to indicate to use locale
rules for code points below 256. Unicode rules are still used for above
255. Folds which cross that boundary are disallowed.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Previously this used a home-grown definition of an identifier start,
stemming from a bug in some early Unicode versions. This led to some
problems, fixed by #74022.
But the home-grown solution did not track Unicode, and allowed for
characters, like marks, to begin words when they shouldn't. This change
brings this macro into compliance with Unicode going-forward.
|
|
|
|
|
| |
If this option is set, any match that has a non-ASCII character that has
an ASCII character in its fold will not match that fold.
|
|
|
|
|
| |
The parameter doesn't do anything yet. The old version becomes a macro
calling the new version with 0 as the flags.
|
|
|
|
|
|
|
|
|
|
| |
The recent move of Unicode folding to the compilation phase caused
spurious warnings during the miniperl build phase of Perl itself before
the Unicode tables get built. Before the tables are built, Perl is
unable to know about the Unicode semantics (it has ASCII/Latin1
hard-coded in), but was still trying to access the tables. Now, it
checks and if the tables aren't present uses just the hard-coded
ASCII/Latin1 semantics.
|
|
|
|
|
|
|
|
|
|
| |
This is for security as well as performance. It allows Unicode properties to
not be matched case sensitively. As a result the swash inversion hash is
converted from having utf8 keys to numeric, code point, keys.
It also for the first time fixes the bug where /i doesn't work for a code point
not at the end of a range in a bracketed character class has a multi-character
fold
|
|
|
|
| |
This shouldn't be called from XS code.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Going forward the intent is to convert from swashes to the better-suited
inversion list data structure. This adds rudimentary inversion lists that have
only the functionality needed for 5.14. As a result, they are as much as
possible static to one file.
What's necessary for 5.14 is enough to allow folding of ANYOF nodes to be moved
from regexec to regcomp. Why they are needed for that is to generate as
compact as possible class definitions; otherwise, very long linear lists might
be generated. (They still may be, but that's inherent in the problem domain;
this generates as compact as possible, combining overlapping ranges, etc.)
The only two non-trivial methods in this object are from published algorithms.
|
|
|
|
|
|
|
|
| |
Ideally it would be available, calling Perl_newSVpvf_nocontext
directly is an alternative, but the comment in sv.c makes that
questionable.
Since the function being called from already has a context, use it.
|
|
|
|
| |
This tidies things up after several of them were removed.
|
|
|
|
| |
The routines that these call used the warn_d forms; so these should as well.
|
|
|
|
|
|
|
|
| |
The non-Unicode code points have no Unicode semantics, so applying operations
such as casing on them warns.
This patch also includes the changes to test the warnings added by recent
commits for handling the surrogates and above-Unicode code points
|
|
|
|
| |
outdent in response to the enclosing block being removed
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Surrogates, non-character code points, and code points that aren't in Unicode
are now allowed by default, instead of having to specify a flag to allow them.
(Most code did specify those flags anyway.)
This affects uvuni_to_utf8_flags(), utf8n_to_uvuni() and various routines that
are specialized interfaces to them.
Now there is a new set of flags to disallow those code points. Further, all 66
of the non-character code points are known about and handled consistently,
instead of just U+FFFF.
Code that requires these code points to be forbidden will have to change to use
the new flags. I have looked at all the (few) instances in CPAN where these
routines are used, and the only one I found that appears to have need to do
this, Encode, has already been patched to accommodate this change. Of course,
I may have overlooked some subtleties.
|
| |
|
|
|
|
|
|
|
| |
This new function looks for problematic code points on output, and warns if any
are found, returning FALSE as well.
What it warns about may change, so is marked as experimental.
|
|
|
|
|
|
| |
This gives a small space saving on this platform, likely due to code being
shared with the other call to call_sv(). (It also removes a level of function
call at runtime.)
|
|
|
|
|
|
|
|
|
| |
Historically Perl_swash_init() called Perl_gv_fetchmeth() simply to determine
if the requested package was loaded, and if not, attempt to load it. However,
Perl_gv_fetchmeth() is actually making the same lookup as Perl_call_method()
uses to get a pointer to the relevant method. Hence if we get a non-NULL
return from Perl_gv_fetchmeth() we can pass it directly to Perl_call_sv(), and
save duplicated work.
|
| |
|
| |
|
| |
|
| |
|