| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
| |
This allows a swash to return a list, along with an extra key in the
hash which says that the list should be inverted.
A future commit will generate such keys.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This function has not been able to handle what are called EXTRAS in
its input. These are things like:
!utf8::InHiragana
-utf8::InKatakana
+utf8::IsCn
besides the normal list of ranges.
This commit allows this function to handle all the same constructs as
the regular swash input function, from which most of the new code was
copied.
|
|
|
|
|
|
|
| |
Unicode inversion lists commonly will contain UV_MAX, which may
trigger these warnings. Add a flag to suppress them to the numeric
grok functions, which can be set by the code that is dealing with
these lists
|
|
|
|
|
|
| |
The inversion list is an opaque object, currently implemented as an SV.
Even if it ends up being an HV in the future, Nicholas is of the opinion
that it should be presented to the world as an SV*.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Consider U+FB05 and U+FB06. These both fold to 'st', and hence should
match each other under /i. However, Unicode doesn't furnish a rule for
this, and Perl hasn't been smart enought to figure it out. The bug that
shows up is in constructs like
"\x{fb06}" =~ /[^\x{fb05}]/i
succeeding. Most of these instances also have a 'S' entry in Unicode's
CaseFolding.txt, which avoids the problem (as mktables was earlier
changed to include those in the generated table). But there were
several code points that didn't.
This patch changes utf8.c to look for these when constructing it's
inverted list of case fold equivalents. An alternative would have been
to change mktables instead to look for them and create synthetic rules.
But, this is more general in case the function ends up being used for
other things.
I will change fold_grind.t to test for these in a separate commit.
|
|
|
|
|
| |
This comment will no longer apply, as the code it talked about is
moving into swash_init().
|
| |
|
|
|
|
|
|
|
| |
av_len() is misnamed, and hence led me earlier to stop the loop
one shy of what it should have been. No actual bugs were caused by
this, but it could cause a duplicate entry in an array, which is
searched linearly, hence a slight slowdown.
|
|
|
|
|
|
|
|
|
|
| |
And also to_uni_fold().
The flag allows retrieving either simple or full folds.
The interface is subject to change, so these are marked experimental
and their names begin with underscore. The old versions are turned
into macros calling the new versions with the correct extra parameter.
|
|
|
|
|
| |
prefer foo("%s", fixedstr) over foo(fixedstr).
One day someone might change fixedstr to include '%' characters.
|
|
|
|
|
|
|
| |
The code for handling locale can be moved entirely to the place where
locale handling is done for the second string, as by that time we have
processed the first string, and the second. Since we only succeed
if both are atomic, single-bytes, we don't need to do the loop below.
|
|
|
|
| |
This doesn't appear to actually break anything.
|
|
|
|
|
|
| |
A new flag is now passable to this function to indicate to use locale
rules for code points below 256. Unicode rules are still used for above
255. Folds which cross that boundary are disallowed.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Previously this used a home-grown definition of an identifier start,
stemming from a bug in some early Unicode versions. This led to some
problems, fixed by #74022.
But the home-grown solution did not track Unicode, and allowed for
characters, like marks, to begin words when they shouldn't. This change
brings this macro into compliance with Unicode going-forward.
|
|
|
|
|
| |
If this option is set, any match that has a non-ASCII character that has
an ASCII character in its fold will not match that fold.
|
|
|
|
|
| |
The parameter doesn't do anything yet. The old version becomes a macro
calling the new version with 0 as the flags.
|
|
|
|
|
|
|
|
|
|
| |
The recent move of Unicode folding to the compilation phase caused
spurious warnings during the miniperl build phase of Perl itself before
the Unicode tables get built. Before the tables are built, Perl is
unable to know about the Unicode semantics (it has ASCII/Latin1
hard-coded in), but was still trying to access the tables. Now, it
checks and if the tables aren't present uses just the hard-coded
ASCII/Latin1 semantics.
|
|
|
|
|
|
|
|
|
|
| |
This is for security as well as performance. It allows Unicode properties to
not be matched case sensitively. As a result the swash inversion hash is
converted from having utf8 keys to numeric, code point, keys.
It also for the first time fixes the bug where /i doesn't work for a code point
not at the end of a range in a bracketed character class has a multi-character
fold
|
|
|
|
| |
This shouldn't be called from XS code.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Going forward the intent is to convert from swashes to the better-suited
inversion list data structure. This adds rudimentary inversion lists that have
only the functionality needed for 5.14. As a result, they are as much as
possible static to one file.
What's necessary for 5.14 is enough to allow folding of ANYOF nodes to be moved
from regexec to regcomp. Why they are needed for that is to generate as
compact as possible class definitions; otherwise, very long linear lists might
be generated. (They still may be, but that's inherent in the problem domain;
this generates as compact as possible, combining overlapping ranges, etc.)
The only two non-trivial methods in this object are from published algorithms.
|
|
|
|
|
|
|
|
| |
Ideally it would be available, calling Perl_newSVpvf_nocontext
directly is an alternative, but the comment in sv.c makes that
questionable.
Since the function being called from already has a context, use it.
|
|
|
|
| |
This tidies things up after several of them were removed.
|
|
|
|
| |
The routines that these call used the warn_d forms; so these should as well.
|
|
|
|
|
|
|
|
| |
The non-Unicode code points have no Unicode semantics, so applying operations
such as casing on them warns.
This patch also includes the changes to test the warnings added by recent
commits for handling the surrogates and above-Unicode code points
|
|
|
|
| |
outdent in response to the enclosing block being removed
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Surrogates, non-character code points, and code points that aren't in Unicode
are now allowed by default, instead of having to specify a flag to allow them.
(Most code did specify those flags anyway.)
This affects uvuni_to_utf8_flags(), utf8n_to_uvuni() and various routines that
are specialized interfaces to them.
Now there is a new set of flags to disallow those code points. Further, all 66
of the non-character code points are known about and handled consistently,
instead of just U+FFFF.
Code that requires these code points to be forbidden will have to change to use
the new flags. I have looked at all the (few) instances in CPAN where these
routines are used, and the only one I found that appears to have need to do
this, Encode, has already been patched to accommodate this change. Of course,
I may have overlooked some subtleties.
|
| |
|
|
|
|
|
|
|
| |
This new function looks for problematic code points on output, and warns if any
are found, returning FALSE as well.
What it warns about may change, so is marked as experimental.
|
|
|
|
|
|
| |
This gives a small space saving on this platform, likely due to code being
shared with the other call to call_sv(). (It also removes a level of function
call at runtime.)
|
|
|
|
|
|
|
|
|
| |
Historically Perl_swash_init() called Perl_gv_fetchmeth() simply to determine
if the requested package was loaded, and if not, attempt to load it. However,
Perl_gv_fetchmeth() is actually making the same lookup as Perl_call_method()
uses to get a pointer to the relevant method. Hence if we get a non-NULL
return from Perl_gv_fetchmeth() we can pass it directly to Perl_call_sv(), and
save duplicated work.
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Convert sv_eq_flags() and sv_cmp_flags() to use it.
Previously, to compare two strings of characters, where was was in UTF-8, and
one was not, you had to either:
1: Upgrade the second to UTF-8
2: Compare the resulting octet sequence
3: Free the temporary UTF-8 string
or:
1: Attempt to downgrade the first to bytes. If it can't be, they aren't equal
2: Else compare the resulting octet sequence
3: Free the temporary byte string
Which for the general case involves a malloc()/free() and at least two O(n)
scans per comparison.
Whereas this approach has no allocation, a single O(n) scan, which terminates
as early as the best case for the second approach.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds _swash_inversion_hash() which takes a mapping swash and returns
a hash that is the inverse relation. That is, given a code point, it
allows quick lookup of all code points that map to it.
The function is not for public use, as it will likely be revised, so is
not in the public API, and it's name begins with underscore.
It does not deal with multi-char mappings at this time, nor other swash
complications.
|
|
|
|
|
|
| |
This patch moves the code that reads a single line from the main body of
an input Unicode property table into a separate subroutine. This is in
preparation for using it from another place
|
|
|
|
| |
I added comments as I was reading the code trying to understand it
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As discussed on p5p, ibcmp has different semantics from other cmp
functions in that it is a binary instead of ternary function. It is
less confusing then to have a name that implies true/false.
There are three functions affected: ibcmp, ibcmp_locale and ibcmp_utf8.
ibcmp is actually equivalent to foldNE, but for the same reason that things
like 'unless' and 'until' are cautioned against, I changed the functions
to foldEQ, so that the existing names, like ibcmp_utf8 are defined as
macros as being the complement of foldEQ.
This patch also changes the one file where turning ibcmp into a macro
causes problems. It changes it to use the new name. It also documents
for the first time ibcmp, ibcmp_locale and their new names.
|
| |
|
|
|
|
|
|
|
| |
This removes the comment about the function name, and converts tabs to
blanks throughout the function, as so much of it is changing already.
It also removes trailing whitespace in other lines of the file.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I had a hard time understanding how this routine worked; there were no
comments. In figuring it out, I discovered it could be made more
efficient. This routine is called over and over in the innermost loops
in regex matching, so efficiency is a concern.
Setup is done once before the main while loop so that it now has two
conditions instead of eight. The loop was rearranged slightly to be
smaller and a couple of unneeded assignments to temporaries were
removed, and recomputation of some values was avoided. Several other
small efficiency changes were made.
Several asserts had been commented out, saying that they make tests
fail. But they no longer do, at least on my platform. There was a
reason that they were asserts to begin with, and that is they denoted an
insane or trivial condition. Apparently there have been fixes to the
other code calling this, so I re-enabled them.
The names of several variables were changed to be less confusing; hence
f1 means the fold buffer for string 1 whereas it used to mean its goal,
which is now g1.
The leading indent was changed from 5 to 4 blanks. I made enough
other changes that I didn't submit this as a separate commit
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Users can define their own case changing mappings to replace the
standard ones. Prior to this patch, any mappings on characters whose
ordinals are 0-222, 224-255 that resulted in multiple characters were
ignored.
Note that there still is a deficiency in that the mappings will be
applied only to strings in utf8 format.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If a character folds to multiple ones in case-insensitive matching,
it should not match just one of those, or the regular expression can
loop. For example, \N{LATIN SMALL LIGATURE FF} folds to 'ff', and so
"\N{LATIN SMALL LIGATURE FF}" =~ /f+/i
should match. Prior to this patch, this function returned that there is
a match, but left the matching string pointer at the beginning of the
"\N{LATIN SMALL LIGATURE FF}" because it doesn't make sense to match
just half a character, and at this level it doesn't know about the '+'.
This leaves things in an inconsistent state, with the reporting of a
match, but the input pointer unchanged, the result of which is a loop.
I don't know how to fix this so that it correctly matches, and there are
semantic issues with doing so. For example, if
"\N{LATIN SMALL LIGATURE FF}" =~ /ff/i
matches, then one would think that so should
"\N{LATIN SMALL LIGATURE FF}" =~ /(f)(f)/i
But $1 and $2 don't really make sense here, since they both refer to the
half of the same character.
So this patch just returns failure if only a partial character is
matched. That leaves things consistent, and solves the problem of
looping, so that Perl doesn't hang on such a construct, but leaves the
ultimate solution for another day.
|
| |
|
| |
|