| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
| |
This closes #19603
Unicode has various characters whose numeric value is rational
non-integer. These can be specified in \p{nv=...} constructs by either
the rational form or by an expression that it evaluates to. The number
of significant digits that must match are kept to a minimum to allow for
variances in different platforms floating point lengths and rounding
decisions. Previously that number was 2 digits; but that is no longer
always sufficient for all platforms. This commit changes it to 3.
|
|
|
|
|
| |
This macro starts from the right side and matches UTF-8 white space
characters.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
\p{Present_In} is a Perl extension of the Unicode Age property, added
because knowing the exact Unicode version in which a code point became
assigned is rarely what you want; much more frequently you want to know
if the code point exists in the version or not. (Since this extension
was added, Unicode changed their language to declare that the Age
property should be interpreted in pattern matching, not as described,
but as Perl's Present_In is. But I chose to not change Age, to avoid
backwards compatibility issues, and this way, a coder can choose which
thing s/he wanted.)
Unicode typically has synonyms (aliases) for each value a property can
tak on, so \p{Age=6.1} and \p{Age=V61_1} mean the same thing.
Prior to this commit, neither \p{Present_In=1_1} nor \p{Present_In=NA}
worked.
|
|
|
|
|
| |
Now that this function is available in miniperl, mktables can use it to
avoid a bunch of visually distracting 'no overloading' calls.
|
|
|
|
| |
These apparently were once needed, but no longer.
|
|
|
|
| |
Spotted by Dagfinn Ilmari Mannsåker
|
| |
|
|
|
|
|
| |
These mentions of the tables removed in
b852e1da77b497e086508451bebff00541073fb1 were missed in that commit.
|
| |
|
|
|
|
| |
This is used for the \b{lb}, and the rule is changing in Unicode 14.0
|
|
|
|
| |
Move comments closer to the action
|
| |
|
|
|
|
|
| |
This generated file will be changed in a future commit. This shouldn't
have been relying on its syntax anyway, but the value it returns.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These files were once apparently intended for use by modules to
supplement the core Unicode handling. They contain tables suitable for
use by Perl code of the portions of the Unicode character database
about changing the case of characters and finding the numeric value of a
given \d character, in a form suitable for use by perl code. In
particular, they were designed for fast access using the swash mechanism
that has since been removed.
Now, Unicode::UCD now contains more convenient methods of accessing
the data these contain, and the use of these files has been deprecated
since 5.16. I could not figure out a way to force a message should
someone open and read one of these files, but each of their texts say
that the file may be removed without notice at any time. I did not find
any uses on cpan of them.
Unicode is adding new properties that the format of these files will
not be able to handle. Consequently I'm coming up with a new format.
Though these files don't contain the new properties, their existence
means having the burden of having to maintain two separate mechanisms.
Better to have just one mechanism, suitable for going forward.
|
|
|
|
|
| |
All .pm files are supposed to have this line. So far this hasn't been
necessary for this file, but future commits will require it.
|
|
|
|
|
| |
Then, re-run regen/mk_invlists.pl and regen/regcharclass.pl and commit
changes in headers.
|
|
|
|
| |
This is now generated by regcharclass.pl
|
|
|
|
| |
The latter phrase makes more sense
|
| |
|
|
|
|
|
|
| |
Commit 4fe9356b250 changed the signatures on subroutines, and didn't do
these correctly. The result was that perl would croak when using the
mktables debugging facility.
|
|
|
|
|
|
| |
Commit 4fe9356b250 changed the signatures on subroutines, and didn't do
this one correctly. The result was that the comments in the generated
files had duplicate text and were slightly garbled.
|
|
|
|
|
| |
This will be used in the next commit. It requires only the first two
bytes to determine if a UTF-8 or UTF-EBCDIC sequence is for a surrogate
|
|
|
|
|
|
|
|
|
|
|
| |
A couple of commits ago improved the generated output of this script.
This builds on that. The improvements were to try a transform that
could lead to fewer conditionals, as bytes were greouped in fewer
ranges.
But that introduced a useless transformation for the single element
ranges that remain. This commit removes the transformation if not
needed.
|
|
|
|
| |
This is in preparation for a future commit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
UTF-8 has some desirable characteristics not shared by UTF-EBCDIC. One
example is all the continuation bytes are in a single range.
By transforming a UTF-EBCDIC byte into I8 (similar to UTF-8), we gain
those characteristics, and may be able to save a conditional or three.
This commit creates a 2nd pass over the bytes that are to be matched,
transforming them into I8. If that pass results in fewer conditionals
than the traditional, native, generated code, use the fewer result.
This saves quite a bit in some of the generated code, enabling the
quotemeta macro to be represented in a single part; previously it had to
be split to avoid compiler macro size limits.
|
|
|
|
| |
A future commit will put a block around this; indent now.
|
|
|
|
| |
These will be used in a future commit
|
|
|
|
|
|
|
| |
A future commit will pass this function data that shouldn't be
translated into a mnemonic, like 'f' for the letter f. The reason is
that that code will potentially be executed on a machine with a
different character set than what the mnemonic would be valid for.
|
|
|
|
| |
A future commit will use this differently than the current name implies
|
|
|
|
|
| |
This moves a loop earlier in the execution path. This will be useful in
a later commit
|
| |
|
| |
|
|
|
|
|
| |
We can short circuit some work by moving the test earlier. This does
not change the generated file.
|
| |
|
|
|
|
| |
This will make future commits read better.
|
|
|
|
|
|
|
|
| |
I realized that two base level utf8.h macros for UTF-8 could be
refactored to eliminate the conditionals in each. Those macros have
equivalents in the pure perl code changed by this commit, which I
changed before the utf8.h versions to verify that everything worked, by
verifying there was no difference in the generated tables.
|
|
|
|
|
|
|
|
|
| |
This commit makes is_HANGUL_ED_utf8_safe() return 0 unconditionally on
EBCDIC platforms. This means its callers don't have to care what
platform is running. Change the two callers to take advantage of this
The commit also changes the description of the macro to be slightly more
accurate
|
|
|
|
|
|
|
|
|
|
| |
This creates macros for the non-character code points so that, given the
length of the UTF-8 sequence, only those ones that have that length
match. This makes for more efficient processing, to be used in a future
commit.
The place where the length changes depends on the platform type, and
these macros will keep the code from having to worry about that.
|
| |
|
|
|
|
| |
redirection
|
| |
|
|
|
|
|
|
| |
This CPAN module doesn't work on recent Unicode versions
This fixes GH #18787
|
|
|
|
|
|
|
| |
Some other tweaks or modernizations are present, but I expect none of
this is controversial.
This also includes running regen/mk_invlists.pl and regen/regcharclass.pl
|
|
|
|
|
|
|
|
|
|
|
| |
They generate C files.
Bump feature.pm and warnings.pm versions to satisfy cmpVERSION.pl.
I can't get it to easily ignore whitespace, `git diff --name-only`
does not respect the -w flag.
regen_perly.pl is left alone. That would require rebuilding
perly.* which is beyond a simple indentation change.
|
|
|
|
|
|
|
|
| |
The macros generated by this script may have to be split into sub-macros
to make the overall macro fit the maximum number of characters allowed
by the compiler for a macro definition. This commit adds a trailing
underscore to the names of such intermediate macros so as to mark them
as non-API for autodoc.
|
|
|
|
|
|
|
| |
Previously regcharclass.pl could tell if an input string was a
multi-character fold of some Unicode code point. This commit adds the
ability to return what that code point is. This capability will be used
in a later commit.
|