| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
| |
This mostly indents and outdents base on blocks added or removed by the
previous commit. But there are a few comment changes and vertical
alignment of macro backslash continuation characters, and other
white-space changes
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This large (sorry, I couldn't figure out how to meaningfully split it
up) commit causes Perl to fully support LC_CTYPE operations (case
changing, character classification) in UTF-8 locales.
As a side effect it resolves [perl #56820].
The basics are easy, but there were a lot of details, and one
troublesome edge case discussed below.
What essentially happens is that when the locale is changed to a UTF-8
one, a global variable is set TRUE (FALSE when changed to a non-UTF-8
locale). Within the scope of 'use locale', this variable is checked,
and if TRUE, the code that Perl uses for non-locale behavior is used
instead of the code for locale behavior. Since Perl's internal
representation is UTF-8, we get UTF-8 behavior for a UTF-8 locale.
More work had to be done for regular expressions. There are three
cases.
1) The character classes \w, [[:punct:]] needed no extra work, as
the changes fall out from the base work.
2) Strings that are to be matched case-insensitively. These form
EXACTFL regops (nodes). Notice that if such a string contains only
characters above-Latin1 that match only themselves, that the node can be
downgraded to an EXACT-only node, which presents better optimization
possibilities, as we now have a fixed string known at compile time to be
required to be in the target string to match. Similarly if all
characters in the string match only other above-Latin1 characters
case-insensitively, the node can be downgraded to a regular EXACTFU node
(match, folding, using Unicode, not locale, rules). The code changes
for this could be done without accepting UTF-8 locales fully, but there
were edge cases which needed to be handled differently if I stopped
there, so I continued on.
In an EXACTFL node, all such characters are now folded at compile time
(just as before this commit), while the other characters whose folds are
locale-dependent are left unfolded. This means that they have to be
folded at execution time based on the locale in effect at the moment.
Again, this isn't a change from before. The difference is that now some
of the folds that need to be done at execution time (in regexec) are
potentially multi-char. Some of the code in regexec was trivial to
extend to account for this because of existing infrastructure, but the
part dealing with regex quantifiers, had to have more work.
Also the code that joins EXACTish nodes together had to be expanded to
account for the possibility of multi-character folds within locale
handling. This was fairly easy, because it already has infrastructure
to handle these under somewhat different circumstances.
3) In bracketed character classes, represented by ANYOF nodes, a new
inversion list was created giving the characters that should be matched
by this node when the runtime locale is UTF-8. The list is ignored
except under that circumstance. To do this, I created a new ANYOF type
which has an extra SV for the inversion list.
The edge case that caused the most difficulty is folding involving the
MICRO SIGN, U+00B5. It folds to the GREEK SMALL LETTER MU, as does the
GREEK CAPITAL LETTER MU. The MICRO SIGN is the only 0-255 range
character that folds to outside that range. The issue is that it
doesn't naturally fall out that it will match the CAP MU. If we let the
CAP MU fold to the samll mu at compile time (which it can because both
are above-Latin1 and so the fold is the same no matter what locale is in
effect), it could appear that the regnode can be downgraded away from
EXACTFL to EXACTFU, but doing so would cause the MICRO SIGN to not case
insensitvely match the CAP MU. This could be special cased in regcomp
and regexec, but I wanted to avoid that. Instead the mktables tables
are set up to include the CAP MU as a character whose presence forbids
the downgrading, so the special casing is in mktables, and not in the C
code.
|
|
|
|
|
| |
The UTF8 in the name is kind of misleading, and would be more misleading
after future commits make UTF8 locales special.
|
|
|
|
|
|
|
|
|
|
| |
The documentation says that Perl taints certain operations when subject
to locale rules, such as lc() and ucfirst(). Prior to this commit
there were exceptions when the operand to these functions contained no
characters whose case change actually varied depending on the locale,
for example the empty string or above-Latin1 code points. Changing to
conform to the documentation simplifies the core code, and yields more
consistent results.
|
|
|
|
|
|
|
|
| |
This adds and modifies various comments in several files, rewrapping
some comments to occupy fewer lines but not exceed 79 columns. And
fixes some indentation and other white space issues. It includes
removing trailing white space in lines in regcomp.c. I didn't think it
was worth making a commit for each file.
|
|
|
|
|
|
| |
These lists are read-only. Turning on the flag may allow some
optimisations to be done, including some that may be added in the
future.
|
|
|
|
|
|
|
| |
These are the base names for various macros used in parsing identifiers.
Prior to this patch, parsing a code point above Latin1 caused loading
disk files. This patch causes all the information to be compiled into
the Perl binary.
|
|
|
|
|
|
|
| |
Previous commits in this series have caused all the POSIX classes to be
completely specified at C compile time. This allows us to revise the
base function used by all these macros to use these definitions,
avoiding reading them in from disk.
|
| |
|
|
|
|
|
| |
These functions will be out of the way in mathoms. There were a few
that could not be moved, as-is, so I left them.
|
|
|
|
|
|
| |
In all these cases, there is an already existing macro that does exactly
the same thing as the code that this commit replaces. No sense
duplicating logic.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This bottom level function decodes the first character of a UTF-8 string
into a code point. It is discouraged from using it directly. This
commit cleans up some of the warnings it can raise. Now, tests for
malformations are done before any tests for other potential issues. One
of those issues involves code points so large that they have never
appeared in any official standard (the current standard has scaled back
the highest acceptable code point from earlier versions). It is
possible (though not done in CPAN) to warn and/or forbid these code
points, while accepting smaller code points that are still above the
legal Unicode maximum. The warning message for this now includes the
code point if representable on the machine. Previously it always
displayed raw bytes, which is what it still does for non-representable
code points.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The warnings categories non_unicode, nonchar, and surrogate are all
subcategories of 'utf8'. One should never call a packWARN() with both a
category and a subcategory of it, as it will mean that one can't
completely make the subcategory independent. For example,
use warnings 'utf8';
no warnings 'surrogate';
surrogate warnings will be output if they are tested with a
ckWARN2(WARN_UTF8, WARN_SURROGATE);
utf8.c was guilty of this.
|
|
|
|
|
| |
The test here for WARN_UTF8 is redundant, as only if one of the other
three warning categories is enabled will anything actually be output.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
The function _invlist_invert_prop() is hereby removed. The recent
changes to allow \p{} to match above-Unicode means that no special
handling of properties need be done when inverting.
This function was accessible to XS code that cheated by using #defines
to pretend it was something it wasn't, but it also has been marked
as subject to change since its inception, and never appeared in any
documentation.
|
|
|
|
|
| |
This indents various newly-formed blocks (by the previous commit) in
these three files, and reflows lines to fit into 79 columns
|
|
|
|
|
|
|
|
|
| |
mktables now outputs the tables for binary properties as inversion
lists, with a size as the first element. This means simpler handling of
these tables in the core, including removal of an entire pass over them
(it was done just to get the size). These tables are marked as for
internal use by the Perl core only, so their format is changeable at
will.
|
|
|
|
| |
plus some typo fixes. I probably changed some things in perlintern, too.
|
|
|
|
| |
Rearrange this multi-line conditional to be easier to read.
|
|
|
|
|
| |
"The" referring to a parameter here does not look right to me, a native
English speaker.
|
|
|
|
|
|
|
| |
The names of these hashes stored in some disk files is retrievable by a
standardized lookup. There is no need to have them hard-coded in C
code. This is one less opportunity for the file and the code to get out
of sync.
|
| |
|
|
|
|
|
|
| |
These temporaries are all known to fit into 8 bits; by using a U8 it
should be more obvious to an optimizing compiler, and so the bounds
checking need not be done.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There were a few places that were doing
unsigned_var = cond ? signed_val : unsigned_val;
or similar. Fixed by suitable casts etc.
The four in utf8.c were fixed by assigning to an intermediate
unsigned var; this has the happy side-effect of collapsing
a large macro expansion, where toUPPER_LC() etc evaluate their arg
multiple times.
|
|
|
|
|
| |
This outdents code to the proper level given that the surrounding block
has been removed.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This makes all the tables in the lib/unicore/To directory that map from
code point to code point be formatted so that the mapped-to code point
is expressed as hexadecimal.
This allows for uniform treatment of these tables in utf8.c, and removes
the final use of strtol() in the (non-CPAN) core. strtol() should be
avoided because it is subject to locale rules, and some older libc
implementations have been buggy. It was used because Perl doesn't have
an efficient way of parsing a decimal number and advancing the parse
pointer to beyond it; we do have such a method for hex numbers.
The input to mktables published by Unicode is also in hex, so this now
conforms to that convention.
This also will facilitate the new work currently being done to read in
the tables that find the closing bracket given an opening one.
|
|
|
|
|
| |
The Win32 compiler doesn't realize that the values in these places can
be a max of 255. Other compilers don't warn.
|
|
|
|
|
|
| |
IS_UTF8_CHAR is defined by utf8.h, so this is always defined.
In fact, later in utf8.c we use it again, this time without the
ifdef.
|
|
|
|
| |
Previously it was based on HAS_QUAD, which is not (as) correct.
|
| |
|
|
|
|
|
|
|
| |
This removes a macro not yet even in a development release, and splits
its calls into two classes: those where the input is a byte; and those
where it can be any unsigned integer. The byte implementation avoids a
function call on EBCDIC platforms.
|
|
|
|
|
|
|
| |
These functions are still called by some CPAN-upstream modules, so can't
go into mathoms until those are fixed. There are other changes needed
in these modules, so I'm deferring sending patching to their maintainers
until I know all the necessary changes.
|
|
|
|
|
| |
Since commit 010ab96b9b802bbf77168b5af384569e053cdb63, this function is
now longer a wrapper, so shouldn't be described as such.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The LATIN SMALL LETTER SHARP S can't fold to 'ss' under /iaa because the
definition of /aa prohibits it, but it can fold to two consecutive
instances of LATIN SMALL LETTER LONG S. A capital sharp s can do the
same, and that was fixed in 1ca267a5, but this one was overlooked then.
It turns out that another possibility was also overlooked in 1ca267a5.
Both U+FB05 (LATIN SMALL LIGATURE LONG S T) and U+FB06 (LATIN SMALL
LIGATURE ST) fold to the string 'st', except under /iaa these folds are
prohibited. But U+FB05 and U+FB06 are equivalent to each other under
/iaa. This wasn't working until now. This commit changes things so
both fold to FB06.
This bug would only surface during /iaa matching, and I don't believe
there are any current code paths which lead to it, hence no tests are
added by this commit. However, a future commit will lead to this bug,
and existing tests find it then.
|
|
|
|
|
| |
This is a micro optimization. We now check for a common case and return
if found, before checking for a relatively uncommon case.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Now that the Unicode data is stored in native character set order, it is
rare to need to work with the Unicode order. Traditionally, the real
work was done in functions that worked with the Unicode order, and
wrapper functions (or macros) were used to translate to/from native.
There are two groups of functions: one that translates from code point
to UTF-8, and the other group goes the opposite direction.
This commit changes the base function that translates from UTF-8 to code
point to output native instead of Unicode. Those extremely rare
instances where Unicode output is needed instead will have to hand-wrap
calls to this function with a translation macro, as now described in the
API pod. Prior to this, it was the other way, the native was wrapped,
and the rare, strict Unicode wasn't. This eliminates a layer of
function call overhead for a common case.
The base function that translates from code point to UTF-8 retains its
Unicode input, as that is more natural to process. However, it is
de-emphasized in the pod, with the functionality description moved to
the pod for a native input wrapper function. And, those wrappers are
now macros in all cases; previously there was function call overhead
sometimes. (Equivalent exported functions are retained, however, for XS
code that uses the Perl_foo() form.)
I had hoped to rebase this commit, squashing it with an earlier commit
in this series, eliminating the use of a temporary function name change,
but the work involved turns out to be large, with no real payoff.
|
| |
|
|
|
|
|
| |
This moves these two functions to be adjacent to the function they each
call, thus keeping like things together.
|
|
|
|
|
|
| |
There is a macro that accomplishes the same task for a two byte UTF-8
encoded character, and avoids the overhead of the general purpose
function call.
|
|
|
|
|
| |
The formal parameter gets evaluated multiple times on an EBCDIC
platform, thus incrementing more than the intended once.
|
|
|
|
| |
There is a macro that accomplishes this task, and is easier to read.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This changes the code so that converting to UTF-8 is avoided unless
necessary. For such inputs, the conversion back from UTF-8 is also
avoided. The cost of doing this is that the first swatches are combined
into one that contains the values for all characters 0-255, instead of
having multiple swatches. That means when first calculating the swatch
it calculates all 256, instead of 128 (160 on EBCDIC).
This also fixes an EBCDIC bug in which characters in this range were
being translated twice.
|
|
|
|
|
|
|
|
| |
This function assumes that the input is well-formed UTF-8, even though
until this commit, the prefatory comments didn't say so. The API does
not pass the buffer length, so there is no way it could check for
reading off the end of the buffer. One code path already calls
valid_utf8_to_uvchr(); this changes the remaining code path to correspond.
|
| |
|
| |
|
| |
|
|
|
|
|
|
| |
In the case of invariants these two macros should do the same thing,
but it seems to me that the latter name more clearly indicates what is
going on.
|
|
|
|
|
|
|
|
|
| |
This means use official Unicode code point numbering, not native. Doing
this converts the existing UNISKIP calls in the code to refer to native
code points, which is what they meant anyway. The terminology is
somewhat ambiguous, but I don't think it will cause real confusion.
NATIVE_SKIP is also introduced for situations where it is important to
be precise.
|