| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
| |
Fix spelling on various files pertaining to core Perl.
|
|
|
|
|
| |
Vim's filetype declarations are case sensitive. The correct types for
Perl, C, and Pod are perl, c, and pod, respectively.
|
|
|
|
| |
so that github syntax highlights them properly
|
|
|
|
| |
Helps disentangle mixed up output
|
| |
|
|
|
|
|
|
|
|
|
| |
These are newly delivered by Unicode. I haven't had time to analyze
them for use for potential new properties. They deal with security
issues of characters that look alike.
I'm not adding them to the list of files under git, but they are
explicitly mentioned in mktables to indicate their not being used.
|
|
|
|
| |
As it becomes obsolete
|
|
|
|
|
| |
Unicode 15.0 is revising the heading format for non-UCD files; Fix
mktables to be able to parse that.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Unicode 15.0 will now use this approach to deal with ranges of code
points that have a different default for unassigned code points than the
table at large. For example, a table may have one default, but all
Ideographic character ranges have something else. Prior to this new
mechanism, the files had entries for each unassigned code point that had
a different default than the global one. So this saves some lines in
the files that Unicode delivers that were otherwise useless.
Not all files in 15.0 have been converted to use the new scheme, for
whatever reason.
|
|
|
|
|
|
| |
Unicode 15.0 may have multiple @missing lines for a single property,
that should use this class. This commit converts the storage into an
array to accommodate that need..
|
|
|
|
|
|
| |
These are so that you don't have to know everything at construction
time. The constructor function changes to call these with whatever it
does get passed
|
|
|
|
|
|
| |
These lines have all had the same range (all of Unicode). But in
Unicode 15.0, there will be some with different ranges. This commit
changes to save those values (which are currently still unused)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As stated in the code comments added by this commit, Unicode has various
spellings for the same property value. For example in some places it
uses 'W', and in others 'Wide'. The legal spellings are listed in
PropValueAliases.txt, which is processed early in the construction. So
we can standardize things on input, which makes it easier later. This
commit produces minimal changes in the generated tables, so that the
algorithm can be verified by inspection of the results. And no other
code that has hard-coded in expected spellings needs to be changed.
Prior to this commit, we standardized the default value for properties
that have a default value,.
|
|
|
|
|
| |
This includes indenting a block of code in anticipation of a future
commit which will form a conditional block around it
|
|
|
|
|
| |
This changes an inside-out hash reference to have a shorthand for it,
making for better readability
|
|
|
|
|
|
|
|
| |
Prior to this commit we had a two element array, and it was known that
element 0 contained a particular thing; and element 1 contained the
other. But a future commit will add several elements, so keeping track
of which is which will become more problematic. Solve this by using a
hash instead, with the elements appropriately named.
|
|
|
|
|
| |
These functions were missed or broken by 4fe9356b250. They're used only
in debugging, so it wasn't noticed until now.
|
|
|
|
|
|
|
| |
This moves some code ahead of other code so that the end of the sub all
works on a single related issue. This is in preparation for 15.0, where
that issue becomes moot, so we can then change to return early from the
sub.
|
|
|
|
| |
This failed to exit when the file handle was exhausted
|
| |
|
|
|
|
|
|
|
| |
This is a relic from long ago. mktables creates lib/unicore/Name.pm.
And in that file which is for internal core use only, it was creating
the beginnings of some pod, but quite incomplete; this was confusing
buildtoc, which perhaps could be hardened against such inputs.
|
|
|
|
|
|
|
| |
C reserves symbols beginning with underscores for its own use. This
commit moves the underscore so it is trailing, which is legal. The
symbols changed here are many of the ones in handy.h that have
significant uses outside it.
|
|
|
|
|
|
|
|
| |
These use checksums to see if the generated data could be out of date.
The new NormTest.pl wasn't counted in this, and needn't be, but
excluding it and other similar ones is more trouble than it's worth, so
make a comment to that effect and update to include the NormTest.pl
digest value.
|
|
|
|
|
|
|
|
| |
This commit is actually by the committer, and is intended to ensure that
someone looking for what the author wrote can find it. It took me a while
to get a email address for him or I would have done this in eda35008b17e739922
which is where his work on the _squeeze() split key algorithm was added.
Credit where credit is due and all of that. Thanks Ilya.
|
|
|
|
|
|
|
| |
Exercise an abundance of caution and validate that the buffer and split
point data returned is fit for pupose.
Includes the output of running regen/mk_invlist.pl.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
smaller blobs
The squeeze algorithm produces smaller blobs, 10-20% depending on how it
is used. With the "randomize_squeeze" option enabled it is slower but
produces 20% smaller blobs than the "_simple" strategy we used to use.
With the "randomize_squeeze" option disabled it is about as fast as
"_simple" but produces about 10% smaller blobs. Regardless "_squeeze"
uses more memory than _simple; quite a bit more currently, although that
is unforced and could be changed if required.
-blob length: 10548
+blob length: 8635
...
-data size: 69908 (%67.07)
+data size: 67995 (%65.23)
So it saves 1913 bytes running with this seed. I happened to get lucky
with the seed, depending on the seed used the blob ended up about 8650
bytes.
This algorithm is originally by Ilya Sashcheka, so I have added him to
the AUTHORS file, but unfortunately I no longer have his email address
as we lost touch. It contains many modifications by me.
|
|
|
|
|
|
|
|
|
|
|
| |
The old sub based API was passing around an awkward number of arguments
and it was becoming difficult to enhance in certain ways. This patch
changes all the "user servicable" functions into methods, and moves the
configuration defaults into the constructor.
Note, not all the functions have been converted, the core routines with
simple interfaces have not been changed. This is OO for the purpose of
encapsulation not inheritance or overloading.
|
|
|
|
|
| |
Be silent unless requested to. If DEBUG>1 produce lots of output,
if DEBUG==1 produce some basic information about what is going on.
|
|
|
|
|
|
|
|
|
|
|
| |
This adds a way to tell mk_invlists.pl to dump the keywords hash so it
can be reviewed, or used for testing or whatnot. A user can define the
env var DUMP_KEYWORDS_FILE to be a file name which will be used to save
the keywords hash to. If the env var is not set the file won't get
written to disk.
Includes regenerated output from running regen/mk_invlists.pl to keep
porting/regen.t happy.
|
|
|
|
|
|
|
|
|
| |
sub token_name() was injected into the middle of totally unrelated logic
that does not use it. token_name() is a wrapper around sanitize_name()
so move it next to that sub.
Also includes the output from running regen/mk_invlists.pl to keep
porting/regen.t happy.
|
|
|
|
|
|
|
|
|
|
|
| |
mk_invlists.pl does a lot and takes a while before it gets to the part
where it requires regen/mph.pl, which means that if there are issues in
it they arent discovered until a fair amount of time elapses, which is
frustrating when debugging. Moving the require to the top means the
script dies early and can be fixed.
Includes a regen of uni_keywords.h and friends as this changes a regen
script which causes regen.t to fail if its output is not up to date.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This closes #19603
Unicode has various characters whose numeric value is rational
non-integer. These can be specified in \p{nv=...} constructs by either
the rational form or by an expression that it evaluates to. The number
of significant digits that must match are kept to a minimum to allow for
variances in different platforms floating point lengths and rounding
decisions. Previously that number was 2 digits; but that is no longer
always sufficient for all platforms. This commit changes it to 3.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
\p{Present_In} is a Perl extension of the Unicode Age property, added
because knowing the exact Unicode version in which a code point became
assigned is rarely what you want; much more frequently you want to know
if the code point exists in the version or not. (Since this extension
was added, Unicode changed their language to declare that the Age
property should be interpreted in pattern matching, not as described,
but as Perl's Present_In is. But I chose to not change Age, to avoid
backwards compatibility issues, and this way, a coder can choose which
thing s/he wanted.)
Unicode typically has synonyms (aliases) for each value a property can
tak on, so \p{Age=6.1} and \p{Age=V61_1} mean the same thing.
Prior to this commit, neither \p{Present_In=1_1} nor \p{Present_In=NA}
worked.
|
|
|
|
|
| |
Now that this function is available in miniperl, mktables can use it to
avoid a bunch of visually distracting 'no overloading' calls.
|
|
|
|
| |
These apparently were once needed, but no longer.
|
|
|
|
| |
Spotted by Dagfinn Ilmari Mannsåker
|
| |
|
|
|
|
|
| |
These mentions of the tables removed in
b852e1da77b497e086508451bebff00541073fb1 were missed in that commit.
|
| |
|
| |
|
|
|
|
| |
This is used for the \b{lb}, and the rule is changing in Unicode 14.0
|
|
|
|
| |
Move comments closer to the action
|
| |
|
|
|
|
|
| |
This generated file will be changed in a future commit. This shouldn't
have been relying on its syntax anyway, but the value it returns.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These files were once apparently intended for use by modules to
supplement the core Unicode handling. They contain tables suitable for
use by Perl code of the portions of the Unicode character database
about changing the case of characters and finding the numeric value of a
given \d character, in a form suitable for use by perl code. In
particular, they were designed for fast access using the swash mechanism
that has since been removed.
Now, Unicode::UCD now contains more convenient methods of accessing
the data these contain, and the use of these files has been deprecated
since 5.16. I could not figure out a way to force a message should
someone open and read one of these files, but each of their texts say
that the file may be removed without notice at any time. I did not find
any uses on cpan of them.
Unicode is adding new properties that the format of these files will
not be able to handle. Consequently I'm coming up with a new format.
Though these files don't contain the new properties, their existence
means having the burden of having to maintain two separate mechanisms.
Better to have just one mechanism, suitable for going forward.
|
|
|
|
|
| |
All .pm files are supposed to have this line. So far this hasn't been
necessary for this file, but future commits will require it.
|