| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
| |
This changes the 4 case changing functions to take extra parameters to
specify if the utf8 string is to be processed under locale rules when
the code points are < 256. The current functions are changed to macros
that call the new versions so that current behavior is unchanged.
An additional, static, function is created that makes sure that the
255/256 boundary is not crossed during the case change.
|
|
|
|
|
| |
These weren't caught because it only is compiled on an EBCDIC platform,
and I had to fake it to force the compilation
|
|
|
|
|
| |
This is based on my eyeballing a file I had generated of the encodings
for Unicode code points, so could be wrong. It does compile
|
|
|
|
| |
This indents for clarity with the surrounding #if, and #end.
|
| |
|
|
|
|
|
| |
This adds flags so that if one of the input strings is known to already
have been folded, this routine can skip the (redundant) folding step.
|
| |
|
|
|
|
|
|
| |
The macros that these call have been revised to do the same checks,
enhanced to not call the functions for all of Latin1, not just ASCII as
these did. So the tests here are redundant.
|
|
|
|
|
|
|
|
|
|
| |
And also to_uni_fold().
The flag allows retrieving either simple or full folds.
The interface is subject to change, so these are marked experimental
and their names begin with underscore. The old versions are turned
into macros calling the new versions with the correct extra parameter.
|
| |
|
|
|
|
| |
It can't just be large enough to hold the Unicode subset.
|
|
|
|
|
| |
These will be used in a future commit; the ordinals are different on
EBCDIC vs. ASCII
|
|
|
|
|
| |
These were defined in a .c, but now there is need for them in another .c,
so move them to a header.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is the foundation for fixing the regression RT #82610. My analysis
was wrong that two bits could be shared, at least not without further
work. This changes to use a different mechanism to pass needed
information to regexec.c so that another bit can be freed up and, in a
later commit, the two bits can become unshared again.
The bit that is freed up is ANYOF_UTF8, which basically said there is
something that is matched outside the ANYOF bitmap, and requires the
target string to be in utf8. This changes things so the existence of
something besides the bitmap indicates this, and so no flag is needed.
The flag bit ANYOF_NONBITMAP_NON_UTF8 remains to indicate that there is
something that should be matched outside the bitmap even if the target
string isn't in utf8.
|
|
|
|
|
|
| |
A new flag is now passable to this function to indicate to use locale
rules for code points below 256. Unicode rules are still used for above
255. Folds which cross that boundary are disallowed.
|
|
|
|
|
| |
If this option is set, any match that has a non-ASCII character that has
an ASCII character in its fold will not match that fold.
|
|
|
|
|
| |
The parameter doesn't do anything yet. The old version becomes a macro
calling the new version with 0 as the flags.
|
|
|
|
| |
The analysis by the submitter was correct.
|
| |
|
|
|
|
|
|
| |
UNICODE_ILLEGAL only referred to one of 66 code points in the same class. And
they aren't illegal except in certain circumstances. New #defines have taken
over the use this formerly had, so it is now meaningless.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Surrogates, non-character code points, and code points that aren't in Unicode
are now allowed by default, instead of having to specify a flag to allow them.
(Most code did specify those flags anyway.)
This affects uvuni_to_utf8_flags(), utf8n_to_uvuni() and various routines that
are specialized interfaces to them.
Now there is a new set of flags to disallow those code points. Further, all 66
of the non-character code points are known about and handled consistently,
instead of just U+FFFF.
Code that requires these code points to be forbidden will have to change to use
the new flags. I have looked at all the (few) instances in CPAN where these
routines are used, and the only one I found that appears to have need to do
this, Encode, has already been patched to accommodate this change. Of course,
I may have overlooked some subtleties.
|
| |
|
|
|
|
|
|
|
| |
Surrogates, non-character code points, and non-Unicode code points are
problematic in some contexts. These macros allow easy determination if
a code point is in one of these classes. There are versions both for
UVs, and utf8-encoded.
|
|
|
|
|
|
|
| |
The refactoring of 3b0fc154d4e77cfb inadvertently introduced a bug
in Perl_is_utf8_char() and its callers, such as Perl_is_utf8_string(),
whereby the beyond-Unicode characters 0x140000 to 0x1fffff were no longer
recognised as valid.
|
|
|
|
|
|
|
|
|
|
|
| |
ANYOF_FOLD is now used only under fewer conditions. Otherwise the
bitmap of character 0-255 is fully calculated with the folds, and the
flag is not set. One condition is under locale, where the folds aren't
known at compile time; the other is for things accessible through a
swash.
By changing the name to its new meaning, certain optimizations become more
obvious.
|
| |
|
|
|
|
|
|
|
|
| |
The UTF8_TWO_BYTE_HI_nocast() macro has an error in it, in that the
START_MARK is larger than a byte, and only the last 8 bits of it are
relevant. This hasn't caused a problem because the macro hasn't been
called directly, but from other macros that make sure the result gets
cast to a U8.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
| |
The code to do this isn't obvious, as it was wrong in 5 different places
in two different files (forgetting one or both of the required
conversions to UTF (which is a no-op except on EBCDIC machines, or it
would have been detected sooner.)
Some of that code depended on left shifting being truncated in a U8.
This adds UTF_START_MASK so it can work in a larger width variable.
|
|
|
|
|
|
|
|
|
|
| |
I am about the hone the meaning of this to mean that there is something
outside the bitmap that is matchable by the node, and the new name
reflects that more accurately.
I am not retaining the old name because I'm about to remove it from the
flags field to save a bit and avoid masking operations, and any code
that would be using it would break at that point.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As discussed on p5p, ibcmp has different semantics from other cmp
functions in that it is a binary instead of ternary function. It is
less confusing then to have a name that implies true/false.
There are three functions affected: ibcmp, ibcmp_locale and ibcmp_utf8.
ibcmp is actually equivalent to foldNE, but for the same reason that things
like 'unless' and 'until' are cautioned against, I changed the functions
to foldEQ, so that the existing names, like ibcmp_utf8 are defined as
macros as being the complement of foldEQ.
This patch also changes the one file where turning ibcmp into a macro
causes problems. It changes it to use the new name. It also documents
for the first time ibcmp, ibcmp_locale and their new names.
|
|
|
|
|
| |
perl uses UTF8_IS_START() to test if a byte is a valid start byte,
this didn't take perl's extended UTF-8 range into account.
|
|
|
|
|
| |
is unused in the code, and is wrong for EBCDIC platforms, as there can
be invariants there that aren't ASCII. I simply removed it.
|
|
|
|
| |
Signed-off-by: Abigail <abigail@abigail.be>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This turns on the unicode semantics for uc/lc/ucfirst/lcfirst
operations on strings without the UTF8 bit set but with ASCII
characters higher than 127. This replaces the "legacy" pragma
experiment.
Note that currently this feature sets both a bit in $^H and
a (unused) key in %^H. The bit in $^H could be replaced by
a flag on the uc/lc/etc op. It's probably not feasible to
test a key in %^H in pp_uc in friends each time we want to
know which semantics to apply.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Attached is a patch that removes from utfebcdic.h most definitions that
are common to it and utf8.h, and moves them to the common area of
utf8.h. The duplicate ones that are retained are each an integral part
of a larger related set that do differ between the headers.
Some of the definitions had started to drift, so this brings them back
into line, with a lowered possibility of future drift. In particular
the ones for the 'lazy' macros did not do quite as intended, especially
in the EBCDIC case. The bugs were a small performance hit only, in that
the macro was not quite as lazy as expected, and so loaded utf8_heavy.pl
possibly unnecessarily. In examining these, I noted that the utf8.h
definition of the start byte of a utf8 encoded string accepts invalid
start bytes 0xC0 and 0xC1. These are invalid because they are for
overlong encodings of ASCII code points. One is not supposed to allow
these, and there have been security attacks, according to Wikipedia,
against code that does. But I don't know all the ramifications for Perl
of changing to exclude these, so I left it alone, but added a comment
(and an item on my personal todo list to check into it).
I made some comment clarifications, and removed some definitions marked
as obsolete in utf8.h that are in fact no longer used.
I added some synonyms for existing macros that more clearly reflect the
use that I intend to put them to in future patches.
From ba581aa4db767e5531ec0c0efdea5de4e9b09921 Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@khw-desktop.(none)>
Date: Mon, 9 Nov 2009 08:38:24 -0700
Subject: [PATCH] Clean up utf headers
Signed-off-by: H.Merijn Brand <h.m.brand@xs4all.nl>
|
| |
|
| |
|
|
|
|
| |
(and run "make regen")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Consider what currently happens when the tokenizer is scanning a string.
It looks through it byte-by-byte until it finds a character that forces
it to decide to go to utf8. It then calls sv_utf8_upgrade() with the
portion of the string scanned so far.
sv_utf8_upgrade() starts over from the beginning, and scans the string
byte-by-byte until it finds a character that varies between non-utf8 and
utf8. It then calls bytes_to_utf8().
bytes_to_utf8() allocates a new string that can handle the worst case
expansion, 2n+1, of the entire string, and starts over from the
beginning, and scans the input string byte-by-byte copying and
converting each character to the output string as it goes.
It doesn't return the size of the new string, so sv_utf8_upgrade()
assumes it is only as big as what actually got converted, throwing away
knowledge of any spare.
It then returns to the tokenizer, which immediately does a grow to get
space for the unparsed input. This is likely to cause a new string to
be allocated and copied from the one we had just created, even if that
string in actuality had enough space in it.
Thus, the invariant head portion of the string is scanned 3 times, and
probably 2 strings will be allocated and copied.
My solution to cutting this down is to do several things.
First, I added an extra flag for sv_utf8_upgrade that says don't bother
to check if the string needs to be converted to utf8, just assume it
does. This eliminates one of the passes.
I also added a new parameter to sv_utf8_upgrade that says when you
return, I want this much unused space in the string. That eliminates
the extra grow.
This was all done by renaming the current work-horse function from
sv_utf8_upgrade_flags to be sv_utf8_upgrade_flags_grow() and making the
current function name be a macro which calls the revised one with a 0
grow parameter.
I also improved the internal efficiency of sv_utf8_upgrade so that when
it does scan the string, it doesn't call bytes_to_utf8, but does the
conversion itself, using a fast memory copy instead of the byte-oriented
one for the invariant header, and it uses that header to get a better
estimate of the needed size of the new string, and it doesn't throw away
the knowledge of the allocated size.
And, if it is clear without scanning the whole string that the
conversion will fit in the already allocated string, it just uses that
instead of allocating and copying a new one, using the algorithm I
copied from the tokenizer. (In this case it does have to finish
scanning the whole string to get the correct size.) The comments have
details.
It still is byte-oriented. Vectorization et. al. could yield
performance improvements. One idea for that is in the comments.
The patch also includes a new synonym I created which is a more accurate
name than NATIVE_TO_ASCII.
|
|
|
| |
p4raw-id: //depot/perl@32793
|
|
|
| |
p4raw-id: //depot/perl@32237
|
|
|
|
|
|
| |
files that generate .h files, so they'll be ready
next time.
p4raw-id: //depot/perl@29695
|
|
|
|
|
| |
Message-Id: <20060402224657.B942.BQW10602@nifty.com>
p4raw-id: //depot/perl@27688
|
|
|
|
|
|
|
| |
I believe that all are now found, as redefining CopHINTS_get(c)
to (~(c)->op_private) (with corresponding changes to CopHINTS_set()
and the initialisation of PL_compiling) works.
p4raw-id: //depot/perl@27687
|
|
|
|
|
| |
tested by Rajarshi Das
p4raw-id: //depot/perl@26452
|