| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
| |
Fix spelling on various files pertaining to core Perl.
|
| |
|
|
|
|
|
|
|
|
| |
These use checksums to see if the generated data could be out of date.
The new NormTest.pl wasn't counted in this, and needn't be, but
excluding it and other similar ones is more trouble than it's worth, so
make a comment to that effect and update to include the NormTest.pl
digest value.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
smaller blobs
The squeeze algorithm produces smaller blobs, 10-20% depending on how it
is used. With the "randomize_squeeze" option enabled it is slower but
produces 20% smaller blobs than the "_simple" strategy we used to use.
With the "randomize_squeeze" option disabled it is about as fast as
"_simple" but produces about 10% smaller blobs. Regardless "_squeeze"
uses more memory than _simple; quite a bit more currently, although that
is unforced and could be changed if required.
-blob length: 10548
+blob length: 8635
...
-data size: 69908 (%67.07)
+data size: 67995 (%65.23)
So it saves 1913 bytes running with this seed. I happened to get lucky
with the seed, depending on the seed used the blob ended up about 8650
bytes.
This algorithm is originally by Ilya Sashcheka, so I have added him to
the AUTHORS file, but unfortunately I no longer have his email address
as we lost touch. It contains many modifications by me.
|
|
|
|
|
|
|
|
|
|
|
| |
The old sub based API was passing around an awkward number of arguments
and it was becoming difficult to enhance in certain ways. This patch
changes all the "user servicable" functions into methods, and moves the
configuration defaults into the constructor.
Note, not all the functions have been converted, the core routines with
simple interfaces have not been changed. This is OO for the purpose of
encapsulation not inheritance or overloading.
|
|
|
|
|
|
|
|
|
|
|
| |
This adds a way to tell mk_invlists.pl to dump the keywords hash so it
can be reviewed, or used for testing or whatnot. A user can define the
env var DUMP_KEYWORDS_FILE to be a file name which will be used to save
the keywords hash to. If the env var is not set the file won't get
written to disk.
Includes regenerated output from running regen/mk_invlists.pl to keep
porting/regen.t happy.
|
|
|
|
|
|
|
|
|
| |
sub token_name() was injected into the middle of totally unrelated logic
that does not use it. token_name() is a wrapper around sanitize_name()
so move it next to that sub.
Also includes the output from running regen/mk_invlists.pl to keep
porting/regen.t happy.
|
|
|
|
|
|
|
|
|
|
|
| |
mk_invlists.pl does a lot and takes a while before it gets to the part
where it requires regen/mph.pl, which means that if there are issues in
it they arent discovered until a fair amount of time elapses, which is
frustrating when debugging. Moving the require to the top means the
script dies early and can be fixed.
Includes a regen of uni_keywords.h and friends as this changes a regen
script which causes regen.t to fail if its output is not up to date.
|
| |
|
| |
|
|
|
|
| |
This is used for the \b{lb}, and the rule is changing in Unicode 14.0
|
|
|
|
|
|
|
|
|
|
|
| |
They generate C files.
Bump feature.pm and warnings.pm versions to satisfy cmpVERSION.pl.
I can't get it to easily ignore whitespace, `git diff --name-only`
does not respect the -w flag.
regen_perly.pl is left alone. That would require rebuilding
perly.* which is beyond a simple indentation change.
|
|
|
|
| |
All symbols in here are for core only use
|
|
|
|
|
|
|
|
|
| |
The MICRO SIGN folds to above the Latin1 range, the only character that
does so in Unicode (or ever likely to). This requires special handling.
This commit reduces some of the need for that handling by creating the
inversion map for it, which can be used in certain instances in pattern
matching, without having to have a special case. The actual use of this
will come in a future commit.
|
|
|
|
|
|
| |
The previous commit changed the code so that enums and #defines could be
requested to be in re_comp.c. This commit changes to use that new
capability.
|
|
|
|
|
|
|
| |
Tables, to save memory, that are for regcomp.c are excluded from
re_comp.c, but enums use no resources, and a later commit will want them
accessible from re_comp.c. So change the code so that they can be
requested to be in re_comp.c
|
|
|
|
|
| |
This value will be needed outside of where it currently is defined; this
commit makes it available elsewhere
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Unicode has changed its yearly release cycle so that the final version
is not available until early March of the year. This year it is March
10, 2020.
However, all changes planned were finalized in early January, and the
actual computer files have been updated to their presumably final
substantive versions. The release has been authorized without further
review needed.
The release is awaiting final documentation additions, and soak time of
the beta to verify there are no glitches. This commit causes Perl to
participate in that soak.
I don't anticipate any problems, and likely the only substantive change
upon the official release will be to update perldelta. Comments in the
files supplied by Unicode will likely also change to indicate these are
no longer beta.
There were very few changes affecting existing characters; most of the
changes involved adding new characters, including emoji. The break
characteristics of some existing characters were changed (GCB, LB, WB,
and SB properties). The only perl code I really had to change to cope
with the new release was about rules in the Line Break property, dealing
around ellipses (...) and certain East Asian characters next to opening
parentheses.
If there are problems, we can revert this at any time, and ship with
12.0.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Unicode 12.0 used a new property file that was not from the Unicode
Character Database. It only had a long property name. I incorporated
it into our data, and rather than use the very long name all the time, I
created my own short name, since there was no official one.
Now, the upcoming 13.0 has moved the file to the UCD, and come up with a
short name that differs from the one I had. This commit converts to use
Unicode's name. This property is not exposed to user or XS space, so
there is no user impact.
|
|
|
|
|
| |
This change causes certain tables to be sorted so their row and column
headings appear alphabetically, which makes them easier to read.
|
| |
|
|
|
|
| |
This makes things more like dictionary order
|
| |
|
|
|
|
|
|
|
| |
This is because these deal with only legal Unicode code points, which
are restricted to 21 bits, so 16 is too few, but 32 is sufficient to
hold them. Doing this saves some space/memory on 64 bit builds where an
int is 64 bits.
|
|
|
|
|
| |
This will lessen any paging that might occur. Further, on most builds,
it and another table are identical, so only one is actually needed.
|
|
|
|
|
| |
This makes it consistent with the other inversion lists for this sort of
thing, and finishes the fix for GH #17154
|
|
|
|
|
|
|
| |
This file was for the use of utf8_heavy.pl. But now that that is
incorporated into Unicode::UCD, move the definitions from Heavy.pl to
lib/unicore/UCD.pl which is used by Unicode::UCD. This allows removing
package names.
|
|
|
|
|
|
| |
Two variable weren't getting initialized properly in one code path, with
the result that the case folding tables were pretty much garbage, but
not on ASCII platforms.
|
|
|
|
|
| |
0 is a special marker, and shouldn't be remapped. It would be
unlikely to be so, but this makes sure.
|
|
|
|
|
|
| |
Inversion maps are supposed to have an entry for what to do above the
Unicode range. This subroutine crafts a custom map that was missing
that.
|
|
|
|
| |
This supports this new feature.
|
|
|
|
| |
These debugging lines were left in by 21c34e9717d
|
| |
|
|
|
|
|
| |
This renames a variable to more accurately reflect its content, and adds
a new one which has the old name but with an accurate content.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
I am starting to write a Unicode::Private_Use module which will allow
one to specify the Unicode properties of private use code points, thus
making them actually useful. This commit adds a hook to regcomp.c to
accommodate this module. The changes are pretty minimal. This way we
don't have to wait another release cycle to get it out there.
I don't want to document this interface, until it's proven.
|
|
|
|
|
|
|
|
|
|
|
| |
change a couple of
const char * foo[] = { ... }
to
const char * const foo[] = { ... }
Making the string ptrs const means the whole thing is RO and doesn't
appear in data section, making porting/libperl.t happier when building
under -DPERL_GLOBAL_STRUCT_PRIVATE.
|
|
|
|
| |
This will be used in a future commit.
|
| |
|
|
|
|
|
|
|
|
|
| |
This reverts commit 7e9b4fe4d85e9b669993bf96a7e33ffff3197e20, with
additional changes to get things to compile
It turns out I was wrong about the underlying cause that commit
addressed, and it is easier to just use the existing constants that get
generated.
|
|
|
|
|
|
|
|
| |
Before the GCB property handling got more complicated, it was possible
to represent its vagaries with a boolean table on early Unicode
releases. Now there are more complicated rules, and even though early
releases only use 0 or 1, the rules exist and lead to compilation
errors. Just remove the special handling, and let the table be U8.
|
|
|
|
|
|
|
|
|
| |
These 2 Unicode-like property definitions used internally by the regular
expression compiler are moved by this commit from regen/mk_invlists.pl
to lib/unicore/mktables.
By placing all these in the same place, maintainers only have to learn
one bit of code, instead of two.
|
|
|
|
|
|
|
|
| |
If two tables are identical, the code created a #define of one index of
a pointer array to be the other index. But in some cases, that's not sufficient,
and the actual pointer must be defined in terms of the other. This
showed up in compiling perl with an early Unicode version, but the
circumstances could arise again in a future version.
|
|
|
|
|
|
|
| |
This table contains all the code points that are in any multi-character
fold (not the folded-from character, but what that character folds to).
It will be used in a future commit.
|
| |
|
|
|
|
|
| |
The new value is the maximum number of code points that fold to any
single code point. It will be used in a future commit.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These are only used in compiling patterns. They previously were placed
in utf8.c because they are large, and there is a copy of regcomp.c in
ext/re, so they would have use twice the space.
This commit changes things so that they only are used and defined in
regcomp.c, (not re_comp.c) so that duplication does not occur. They are
accessed only from one function, and that is also moved from utf8.c to
regcomp.c, only compiled in regcomp.c, and referred to as an external by
re_comp.c
I had to change the names of the table. Previously they started with
'PL_' in case any got exposed, but globvar.t mindlessly assumes that any
such variables in the file regcomp.c are globals, and wrongly complains.
It was easier to just change the prefix to 'UNI_' instead.
A few tables are used in regexec.c, and are duplicated in re_exec.c.
Things could be adjusted so that only one copy is used. I tried this,
but the tables are far more intertwined in regexec.c functions than
the ones changed in this commit, as only a single function accesses
these. Thus doing this would be a lot harder, and the payback isn't all
that much. I started work to make them EXTCONSTs, and then discovered
the intertwining, but left in that work, unused.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Each Unicode property that specifies a boundary conditions, like
Word_Break, partitions all the Unicode code points into equivalence
classes. So, for example, combining marks are placed into the Extend
class, because they are usually used to extend the previous character
and don't stand on their own. mk_invlists.pl creates a boolean table of
all pairwise combinations of these classes, so that it knows by simple
lookup if the first character is class X and the next character is class
Y, if a break is permitted between these.
However, in some cases the answer isn't as simple as this, and other
means such as the characters in the vicinity of X and Y must be used to
disambiguate. In these cases the table value in the cell (X,Y) isn't a
boolean, but is some other number indicating some specially crafted code
section to execute to resolve the issue.
Over the years, Unicode has tended to subdivide partitions into smaller
ones, as they've refined their algorithms. But with Unicode 11, they
used another method and actually removed partitions. Rather, they
retain the partitions, but no code point actually takes on the value of
an obsolete partition.
In order to not have to change the algorithm unnecessarily between
Unicode releases (who knows, they might change their minds, and
unobsolete these next time), mk_invlists has just kept the tables
around, but those cells won't ever get accessed because no code point in
the current release evaluates to them.
But that makes the tables unnecessarily large. We can achieve the same
thing by mapping each unused equivalence class to the same value, which
we call 'unused'. The algorithms that refer to the obsolete partitions
go through the data assigning values to the cells, but now the cells
overlap, since all obsolete classes map to the same row or column. Thus
the data is total garbage. But that doesn't matter, since that row or
column is never read by the data in the Unicode release the table is
constructed for.
mk_invlists also can compile older Unicode releases, and this makes
those tables smaller than before, with all unused classes in a
given release collapsed into a single row and single column of (unused)
garbage.
|
| |
|
|
|
|
| |
This completes the process of upgrading to Unicode 11.0.
|