| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
| |
The previous commit added macros to do some case changing. This
commit uses them in the core, where appropriate.
|
| |
|
| |
|
|
|
|
| |
This also reorders one #define to be closer to a related one.
|
|
|
|
|
|
|
|
|
|
|
| |
These macros should not be used as they are prone to misuse. There are
no occurrences of them in CPAN. The single use of either of them in
core has recently been removed (commit
8d40577bdbdfa85ed3293f84bf26a313b1b92f55), because it was a misuse.
Instead code should use isIDFIRST_lazy_if or isWORDCHAR_lazy_if
(isALNUM_lazy_if is also available, but can be confused with the Posix
alnum, which it doesn't mean).
|
|
|
|
|
| |
This is a synonym for the existing isALNUM_lazy_if(), which can be
confused with meaning the Posix alnum instead of the Perl \w.
|
|
|
|
|
|
| |
If a char* is passed prior to this commit, an above-ASCII char could
have been considered negative instead of positive, and thus screwed up
these tests
|
|
|
|
|
| |
This apparently hasn't caused us problems, but all uses of a macro
paramenter should be parenthesized to prevent surprises.
|
|
|
|
|
|
| |
Some compilers can't handle unexpanded macros longer than something
like 8000 characters. So we split up long ones into sub macros to work
around the problem
|
|
|
|
|
|
|
|
|
|
| |
The macro used traditionally to see if there is a two-byte UTF-8
sequence doesn't make sure that there actually is a second byte
available; it only checks if the first byte indicates that there is.
This adds a macro that is safe in the face of malformed UTF-8.
I inspected the existing calls in the core to the unsafe macro, and I
believe that none of them need to be converted to the safe version.
|
|
|
|
| |
A future commit will #include this from another header
|
|
|
|
|
|
| |
It occurred to me that EBCDIC has different maximums for the number of
bytes a character can occupy. This moves the definition in utf8.h to
within an #ifndef EBCDIC, and adds the correct values to utfebcdic.h
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This takes the output of regen/regcharclass.pl for all the 1-4 byte
UTF8-representations of Unicode code points, and replaces the current
hand-rolled definition there. It does this only for ASCII platforms,
leaving EBCDIC to be machine generated when run on such a platform.
I would rather have both versions to be regenerated each time it is
needed to save an EBCDIC dependency, but it takes more than 10 minutes
on my computer to process the 2 billion code points that have to be
checked for on ASCII platforms, and currently t/porting/regen.t runs
this program every times; and that slow down would be unacceptable. If
this is ever run under EBCDIC, the macro should be machine computed
(very slowly). So, even though there is an EBCDIC dependency, it has
essentially been solved.
|
|
|
|
|
|
|
|
|
|
|
| |
regen/regcharclass.pl has been enhanced in previous commits so that it
generates as good code as these hand-defined macro definitions for
various UTF-8 constructs. And, it should be able to generate EBCDIC
ones as well. By using its definitions, we can remove the EBCDIC
dependencies for them. It is quite possible that the EBCDIC versions
were wrong, since they have never been tested. Even if
regcharclass.pl has bugs under EBCDIC, it is easier to find and fix
those in one place, than all the sundry definitions.
|
|
|
|
|
| |
By adding a mask, we can save a branch. The two expressions match the
exact same code points.
|
|
|
|
| |
This reflows some lines to fit into 80 columns
|
|
|
|
|
|
|
|
|
| |
These macros were incorrect for EBCDIC. The relationships are based on
I8, the intermediate-utf8 defined for UTF-EBCDIC, not the final encoding.
I was the culprit who did this orginally; I was confused by the names of
the conversion macros. I'm adding names that are clearer to me; which
have already been defined in utfebcdic.h, but weren't defined for
non-EBCDIC platforms.
|
|
|
|
|
|
| |
A new regen'd header file has been created that contains the native
values for certain characters. By using those macros, we can eliminate
EBCDIC dependencies.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A binary swash is a hash of bitmaps used to cache the results of looking
up if a code point matches a Unicode property or regex bracketed
character class. An inversion list is a data structure that also holds
information about what code points match a Unicode property or character
class. It is implemented as an SV* to a sorted C array, and hence can
be searched using a binary search.
This patch converts to using a binary search of an inversion list
instead of a hash look-up for inversion lists that are no more than 512
elements (9 iterations of the search loop). That number can be easily
adjusted, if necessary.
Theoretically, a hash is faster than a binary lookup over a very long
period. So this may negatively impact long-running servers. But in the
short run, where most programs reside, the binary search is
significantly faster.
A swash is filled as necessary as time goes on, caching each new
distinct code point it is called with. If it is called with many, many
such code points, its performance can degrade as collisions increase. A
binary search does not have that drawback. However, most real-world
scenarios do not have a program being called with huge numbers of
distinct code points. Mostly, the program will be called with code
points from just one or a few of the world's scripts, so will remain
sparse. The bitmaps in a swash are each 64 bits long (except for ASCII,
where it is 128). That means that when the swash is populated, a lookup
of a single code point that hasn't been checked before will have to
lookup the 63 adjoining code points as well, increasing its startup
overhead. Of course, if one of those 63 code points is later accessed,
no extra populate happens. This is a typical case where a languages
code points are all near each other.
The bottom line, though, is in the short term, this patch speeds up the
processing of \X regex matching about 35-40%, with modern Korean (which
has uniquely complicated \X processing) closer to 40%, and other scripts
closer to 35%.
The 512 boundary means that over 90% of the official Unicode properties
are handled using binary search. I settled on that number by
experimenting with several properties besides \X and with various
powers-of-2 limits. Until I got that high, performance kept improving
when the property went from being a swash to a binary search. \X
improved even up to 2048, which encompasses 100% of the official Unicode
properties.
The implementation changes so that an inversion list instead of a swash
is returned by swash_init() when the input flags allows it to do so, for
all inversion lists shorter than the compiled in constant of 512
(actually <= 512). The other functions that access swashes have added
intelligence to deal with an object of either type. Should someone in
CPAN be using the public swash_init() interface, they will not see any
difference, as the option to get an inversion list is not available to
them.
|
|
|
|
|
|
|
| |
Now that we have a flags parameter, we can get put this parameter as
just another flag, giving a cleaner interface to this internal-only
function. This also renames the flag parameter to <flag_p> to indicate
it needs to be dereferenced.
|
|
|
|
|
| |
This function only does something on EBCDIC platforms. On ASCII ones
make it a macro, like similar ones to avoid useless function nesting
|
|
|
|
|
|
|
|
|
|
|
| |
This revises the API for the version of swash_init() that is usable
by core Perl. The external interface is unaffected. There is now a
flags parameter to allow for future growth. And the core internal-only
function that returns if a swash has a user-defined property in it or
not has been removed. This information is now returned via the new
flags parameter upon initialization, and is unavailable afterwards.
This is to prepare for the flexibility to change the swash that is
needed in future commits.
|
|
|
|
|
|
| |
Add a mnemonic definition for these three code points. They are
currently used in only one place, but future commits will use them
elsewhere.
|
|
|
|
|
| |
This updates the editor hints in our files for Emacs and vim to request
that tabs be inserted as spaces.
|
|
|
|
|
|
|
|
|
|
| |
Under /iaa regex matching, folds that cross the ASCII/non-ASCII
boundary are prohibited. This changes _to_uni_fold_flags() and
_to_utf8_fold_flags() functions to take a new flag which, when set,
tells them to not accept such folds.
This allows us to later move the intelligence for handling this
situation to these centralized functions.
|
|
|
|
|
| |
This should speed things up slightly, as it looks directly at the UTF-8
source, instead of having to decode it first.
|
|
|
|
|
|
|
|
|
| |
These expressions, while valid, are overly complicated in order to make
it easy to separate out problematic code points in the future, such as
surrogates. But we made a decision in 5.12 to not go in that direction,
but to accept such problematic code points in general. I haven't
heard any cause to regret that decision; if we ever want to go back, the
blame log will easily allow us to.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The prior version had a number of issues, some of which have been taken
care of in previous commits.
The goal when presented with malformed input is to consume as few bytes
as possible, so as to position the input for the next try to the first
possible byte that could be the beginning of a character. We don't want
to consume too few bytes, so that the next call has us thinking that
what is the middle of a character is really the beginning; nor do we
want to consume too many, so as to skip valid input characters. (This
is forbidden by the Unicode standard because of security
considerations.) The previous code could do both of these under various
circumstances.
In some cases it took as a given that the first byte in a character is
correct, and skipped looking at the rest of the bytes in the sequence.
This is wrong when just that first byte is garbled. We have to look at
all bytes in the expected sequence to make sure it hasn't been
prematurely terminated from what we were led to expect by that first
byte.
Likewise when we get an overflow: we have to keep looking at each byte
in the sequence. It may be that the initial byte was garbled, so that
it appeared that there was going to be overflow, but in reality, the
input was supposed to be a shorter sequence that doesn't overflow. We
want to have an error on that shorter sequence, and advance the pointer
to just beyond it, which is the first position where a valid character
could start.
This fixes a long-standing TODO from an externally supplied utf8 decode
test suite.
And, the old algorithm for finding overflow failed to detect it on some
inputs. This was spotted by Hugo van der Sanden, who suggested the new
algorithm that this commit uses, and which should work in all instances.
For example, on a 32-bit machine, any string beginning with "\xFE" and
having the next byte be either "\x86" or \x87 overflows, but this was
missed by the old algorithm.
Another bug was that the code was careless about what happens when a
malformation occurs that the input flags allow. For example, a sequence
should not start with a continuation byte. If that malformation is
allowed, the code pretended it is a start byte and extracts the "length"
of the sequence from it. But pretending it is a start byte is not the
same thing as it actually being a start byte, and so there is no
extractable length in it, so the number that this code thought was
"length" was bogus.
Yet another bug fixed is that if only the warning subcategories of the
utf8 category were turned on, and not the entire utf8 category itself,
warnings were not raised that should have been.
And yet another change is that given malformed input with warnings
turned off, this function used to return whatever it had computed so
far, which is incomplete or erroneous garbage. This commit changes to
return the REPLACEMENT CHARACTER instead.
Thanks to Hugo van der Sanden for reviewing and finding problems with an
earlier version of these commits
|
|
|
|
|
|
| |
The previous definition allowed for (illegal) overlongs. The uses of
this macro in the core assume that it is accurate. The inacurracy can
cause such code to fail.
|
|
|
|
|
|
|
|
|
| |
Previously, the macro changed by this commit would accept overlong
sequences.
This patch was changed by the committer to to include EBCDIC changes;
and in the non-EBCDIC case, to save a test, by using a mask instead, in
keeping with the prior version of the code
|
|
|
|
|
|
| |
Commit 66cbab2c91fca8c9abc65a7231a053898208efe3 changed the definition
of IN_UNI_8_BIT, but in so doing, lost the 2nd line of the macro; and I
did not catch it. Tests will be added shortly.
|
|
|
|
|
| |
This adds the parameter handling, tests, and documentation for this new
feature which allows locale and Unicode to play well with each other.
|
|
|
|
|
|
|
| |
Sync copyright dates with actual changes according to git history.
[Plus run regen_perly.h to update the SHA-256 checksums, and
regen/regcharclass.pl to update regcharclass.h]
|
|
|
|
|
|
|
|
|
|
| |
This changes the 4 case changing functions to take extra parameters to
specify if the utf8 string is to be processed under locale rules when
the code points are < 256. The current functions are changed to macros
that call the new versions so that current behavior is unchanged.
An additional, static, function is created that makes sure that the
255/256 boundary is not crossed during the case change.
|
|
|
|
|
| |
These weren't caught because it only is compiled on an EBCDIC platform,
and I had to fake it to force the compilation
|
|
|
|
|
| |
This is based on my eyeballing a file I had generated of the encodings
for Unicode code points, so could be wrong. It does compile
|
|
|
|
| |
This indents for clarity with the surrounding #if, and #end.
|
| |
|
|
|
|
|
| |
This adds flags so that if one of the input strings is known to already
have been folded, this routine can skip the (redundant) folding step.
|
| |
|
|
|
|
|
|
| |
The macros that these call have been revised to do the same checks,
enhanced to not call the functions for all of Latin1, not just ASCII as
these did. So the tests here are redundant.
|
|
|
|
|
|
|
|
|
|
| |
And also to_uni_fold().
The flag allows retrieving either simple or full folds.
The interface is subject to change, so these are marked experimental
and their names begin with underscore. The old versions are turned
into macros calling the new versions with the correct extra parameter.
|
| |
|
|
|
|
| |
It can't just be large enough to hold the Unicode subset.
|
|
|
|
|
| |
These will be used in a future commit; the ordinals are different on
EBCDIC vs. ASCII
|
|
|
|
|
| |
These were defined in a .c, but now there is need for them in another .c,
so move them to a header.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is the foundation for fixing the regression RT #82610. My analysis
was wrong that two bits could be shared, at least not without further
work. This changes to use a different mechanism to pass needed
information to regexec.c so that another bit can be freed up and, in a
later commit, the two bits can become unshared again.
The bit that is freed up is ANYOF_UTF8, which basically said there is
something that is matched outside the ANYOF bitmap, and requires the
target string to be in utf8. This changes things so the existence of
something besides the bitmap indicates this, and so no flag is needed.
The flag bit ANYOF_NONBITMAP_NON_UTF8 remains to indicate that there is
something that should be matched outside the bitmap even if the target
string isn't in utf8.
|
|
|
|
|
|
| |
A new flag is now passable to this function to indicate to use locale
rules for code points below 256. Unicode rules are still used for above
255. Folds which cross that boundary are disallowed.
|