| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
| |
Vim's filetype declarations are case sensitive. The correct types for
Perl, C, and Pod are perl, c, and pod, respectively.
|
|
|
|
| |
so that github syntax highlights them properly
|
|
|
|
| |
Helps disentangle mixed up output
|
| |
|
|
|
|
|
|
|
|
|
| |
These are newly delivered by Unicode. I haven't had time to analyze
them for use for potential new properties. They deal with security
issues of characters that look alike.
I'm not adding them to the list of files under git, but they are
explicitly mentioned in mktables to indicate their not being used.
|
|
|
|
| |
As it becomes obsolete
|
|
|
|
|
| |
Unicode 15.0 is revising the heading format for non-UCD files; Fix
mktables to be able to parse that.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Unicode 15.0 will now use this approach to deal with ranges of code
points that have a different default for unassigned code points than the
table at large. For example, a table may have one default, but all
Ideographic character ranges have something else. Prior to this new
mechanism, the files had entries for each unassigned code point that had
a different default than the global one. So this saves some lines in
the files that Unicode delivers that were otherwise useless.
Not all files in 15.0 have been converted to use the new scheme, for
whatever reason.
|
|
|
|
|
|
| |
Unicode 15.0 may have multiple @missing lines for a single property,
that should use this class. This commit converts the storage into an
array to accommodate that need..
|
|
|
|
|
|
| |
These are so that you don't have to know everything at construction
time. The constructor function changes to call these with whatever it
does get passed
|
|
|
|
|
|
| |
These lines have all had the same range (all of Unicode). But in
Unicode 15.0, there will be some with different ranges. This commit
changes to save those values (which are currently still unused)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As stated in the code comments added by this commit, Unicode has various
spellings for the same property value. For example in some places it
uses 'W', and in others 'Wide'. The legal spellings are listed in
PropValueAliases.txt, which is processed early in the construction. So
we can standardize things on input, which makes it easier later. This
commit produces minimal changes in the generated tables, so that the
algorithm can be verified by inspection of the results. And no other
code that has hard-coded in expected spellings needs to be changed.
Prior to this commit, we standardized the default value for properties
that have a default value,.
|
|
|
|
|
| |
This includes indenting a block of code in anticipation of a future
commit which will form a conditional block around it
|
|
|
|
|
| |
This changes an inside-out hash reference to have a shorthand for it,
making for better readability
|
|
|
|
|
|
|
|
| |
Prior to this commit we had a two element array, and it was known that
element 0 contained a particular thing; and element 1 contained the
other. But a future commit will add several elements, so keeping track
of which is which will become more problematic. Solve this by using a
hash instead, with the elements appropriately named.
|
|
|
|
|
| |
These functions were missed or broken by 4fe9356b250. They're used only
in debugging, so it wasn't noticed until now.
|
|
|
|
|
|
|
| |
This moves some code ahead of other code so that the end of the sub all
works on a single related issue. This is in preparation for 15.0, where
that issue becomes moot, so we can then change to return early from the
sub.
|
|
|
|
| |
This failed to exit when the file handle was exhausted
|
|
|
|
|
|
|
| |
This is a relic from long ago. mktables creates lib/unicore/Name.pm.
And in that file which is for internal core use only, it was creating
the beginnings of some pod, but quite incomplete; this was confusing
buildtoc, which perhaps could be hardened against such inputs.
|
|
|
|
|
|
|
|
| |
These use checksums to see if the generated data could be out of date.
The new NormTest.pl wasn't counted in this, and needn't be, but
excluding it and other similar ones is more trouble than it's worth, so
make a comment to that effect and update to include the NormTest.pl
digest value.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This closes #19603
Unicode has various characters whose numeric value is rational
non-integer. These can be specified in \p{nv=...} constructs by either
the rational form or by an expression that it evaluates to. The number
of significant digits that must match are kept to a minimum to allow for
variances in different platforms floating point lengths and rounding
decisions. Previously that number was 2 digits; but that is no longer
always sufficient for all platforms. This commit changes it to 3.
|
|
|
|
|
| |
This macro starts from the right side and matches UTF-8 white space
characters.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
\p{Present_In} is a Perl extension of the Unicode Age property, added
because knowing the exact Unicode version in which a code point became
assigned is rarely what you want; much more frequently you want to know
if the code point exists in the version or not. (Since this extension
was added, Unicode changed their language to declare that the Age
property should be interpreted in pattern matching, not as described,
but as Perl's Present_In is. But I chose to not change Age, to avoid
backwards compatibility issues, and this way, a coder can choose which
thing s/he wanted.)
Unicode typically has synonyms (aliases) for each value a property can
tak on, so \p{Age=6.1} and \p{Age=V61_1} mean the same thing.
Prior to this commit, neither \p{Present_In=1_1} nor \p{Present_In=NA}
worked.
|
|
|
|
|
| |
Now that this function is available in miniperl, mktables can use it to
avoid a bunch of visually distracting 'no overloading' calls.
|
|
|
|
| |
These apparently were once needed, but no longer.
|
|
|
|
| |
Spotted by Dagfinn Ilmari Mannsåker
|
| |
|
|
|
|
|
| |
These mentions of the tables removed in
b852e1da77b497e086508451bebff00541073fb1 were missed in that commit.
|
| |
|
|
|
|
| |
This is used for the \b{lb}, and the rule is changing in Unicode 14.0
|
|
|
|
| |
Move comments closer to the action
|
| |
|
|
|
|
|
| |
This generated file will be changed in a future commit. This shouldn't
have been relying on its syntax anyway, but the value it returns.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These files were once apparently intended for use by modules to
supplement the core Unicode handling. They contain tables suitable for
use by Perl code of the portions of the Unicode character database
about changing the case of characters and finding the numeric value of a
given \d character, in a form suitable for use by perl code. In
particular, they were designed for fast access using the swash mechanism
that has since been removed.
Now, Unicode::UCD now contains more convenient methods of accessing
the data these contain, and the use of these files has been deprecated
since 5.16. I could not figure out a way to force a message should
someone open and read one of these files, but each of their texts say
that the file may be removed without notice at any time. I did not find
any uses on cpan of them.
Unicode is adding new properties that the format of these files will
not be able to handle. Consequently I'm coming up with a new format.
Though these files don't contain the new properties, their existence
means having the burden of having to maintain two separate mechanisms.
Better to have just one mechanism, suitable for going forward.
|
|
|
|
|
| |
All .pm files are supposed to have this line. So far this hasn't been
necessary for this file, but future commits will require it.
|
|
|
|
|
| |
Then, re-run regen/mk_invlists.pl and regen/regcharclass.pl and commit
changes in headers.
|
|
|
|
| |
This is now generated by regcharclass.pl
|
|
|
|
| |
The latter phrase makes more sense
|
| |
|
|
|
|
|
|
| |
Commit 4fe9356b250 changed the signatures on subroutines, and didn't do
these correctly. The result was that perl would croak when using the
mktables debugging facility.
|
|
|
|
|
|
| |
Commit 4fe9356b250 changed the signatures on subroutines, and didn't do
this one correctly. The result was that the comments in the generated
files had duplicate text and were slightly garbled.
|
|
|
|
|
| |
This will be used in the next commit. It requires only the first two
bytes to determine if a UTF-8 or UTF-EBCDIC sequence is for a surrogate
|
|
|
|
|
|
|
|
|
|
|
| |
A couple of commits ago improved the generated output of this script.
This builds on that. The improvements were to try a transform that
could lead to fewer conditionals, as bytes were greouped in fewer
ranges.
But that introduced a useless transformation for the single element
ranges that remain. This commit removes the transformation if not
needed.
|
|
|
|
| |
This is in preparation for a future commit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
UTF-8 has some desirable characteristics not shared by UTF-EBCDIC. One
example is all the continuation bytes are in a single range.
By transforming a UTF-EBCDIC byte into I8 (similar to UTF-8), we gain
those characteristics, and may be able to save a conditional or three.
This commit creates a 2nd pass over the bytes that are to be matched,
transforming them into I8. If that pass results in fewer conditionals
than the traditional, native, generated code, use the fewer result.
This saves quite a bit in some of the generated code, enabling the
quotemeta macro to be represented in a single part; previously it had to
be split to avoid compiler macro size limits.
|
|
|
|
| |
A future commit will put a block around this; indent now.
|