| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
| |
This makes them correspond to names for single characters, and will make
parsing easier in the next commits.
|
|
|
|
|
|
| |
Also regenerate files depending on lib/unicore/mktables
./perl -Ilib regen/mk_invlists.pl; ./perl -Ilib regen/regcharclass.pl
|
|
|
|
|
| |
This commit adds wildcard subpatterns for the Name and Name Aliases
properties.
|
|
|
|
|
|
|
|
|
|
|
| |
Many ideographic character names are of the form 'prefix-code_point'.
For these, we know that the legal names are just the ones in the prefix,
the dash, and uppercase hex digits. This commit for each series of
these types of names figures out what characters are legal in that
series, and adds that info to the hash describing the series. This will
be used in a later commit to rule out entire series when matching
under some circumstances, without having to try any individual matches
within it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This changes the format of this generated file so that it can more
easily be used with the Unicode Name property in wildcard matching.
Each line will now end with \n\n, and the \t characters are replaced by
\n. Thus an entry will look like
00001\nSTART OF HEADING\n\n
This makes matching of user-defined patterns using anchors work under
/m, which commit 4829f32decd128e6a122bd8ce35fe944bd87f104 forces. That
commit also changed some anchors' defintions to make them match \n under
/m with wildcards, so this makes it all transparent to user patterns.
The double \n\n at the end of an entry is so that the code can
distinguish between a line that contains a code point vs a name without
relying on the content; it is a disambiguator, like the \t that used to
be.
|
|
|
|
|
|
| |
The previous commit changed the code so that enums and #defines could be
requested to be in re_comp.c. This commit changes to use that new
capability.
|
|
|
|
|
|
|
| |
Tables, to save memory, that are for regcomp.c are excluded from
re_comp.c, but enums use no resources, and a later commit will want them
accessible from re_comp.c. So change the code so that they can be
requested to be in re_comp.c
|
|
|
|
|
| |
This value will be needed outside of where it currently is defined; this
commit makes it available elsewhere
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The algorithm for dealing with Unicode property wildcards is to wrap the
user-supplied pattern with /miaa. We don't want the user to be able to
override the /m and /aa parts. Modifiers that are only specifiable as a
modifier in a qr or similar op (like /gc) can't be included in things
like (?gc). These normally incur a warning that they are ignored, but
the texts of those warnings are misleading when using wildcards, so I
chose to just make them illegal. Of course that could be changed to
having custom useful warning texts, but I didn't think it was worth it.
I also chose to forbid recursion of using nested \p{}, just from fear
that it might lead to issues down the road, and it really isn't useful
for this limited universe of strings to match against. Because
wildcards currently can't handle '}' inside them, only the single letter
\p,\P are valid anyway.
Similarly, I forbid the '*' quantifier to make it harder for the
constructed subpattern to take forever to make any progress and decide
to halt. Again, using it would be overkill on the universe of possible
match strings.
|
|
|
|
|
|
| |
Unicode has made minor changes in its data files since I added the beta
versions to Perl 5.31. These are still beta; the final release date is
March 10. I thought it best to get the latest into Perl 5.31.9.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Unicode has lately been asking implementations to support non-Unicode
Character Database properties. Files for these contain a different
versioning syntax than the UCD files. Previously I was hand-editing
those files before commitiing to bring them to use a consistent style.
But that is tedious, and I decide to invest a little time to be able to
handle all the current versioning syntaxes automatically, to save having
to manually update in the future.
This was complicated by the fact that some Unicode non-UCD files have
BOM marks on many comment lines. I submitted a trouble report to them.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This accomplishes the same thing as \N{...}, but only for regex
patterns, using loose matching and only the official Unicode names.
This commit includes a comparison of the two approaches, added to
perlunicode. But the real reason to do this is as a way station to
being able to specify wild card lookup on the name property, coming in a
later commit.
I chose to not include user-defined aliases nor :short character names
at this time. I thought that there might be unforeseen consequences of
using them. It's better to later relax a requirement than to try to
restrict it.
|
|
|
|
|
|
|
|
|
|
| |
These non-UCD properties are now being asked to be supported by the
Unicode regular expression specification, UTS #18
These have a slightly different header syntax for giving the version
than UCD files. In this commit, I modify these to fit, but will
probably have to generalize at some point the parsing of versions in
mktables.
|
|
|
|
|
|
|
|
|
| |
Until now, this property was unique in that it specifies a set of
possible values for scripts that a character can be in, rather than a
single script. That multiplicity has been handled specially. But the
next couple of commits will introduce another property that has similar
characteristics. This commit makes the scx handling more general, so as
to also be usable for the new property.
|
| |
|
|
|
|
|
|
| |
This is because this is still supposed to work on DOS 8.3 filesystems,
and future commits will use non-Unicode-Character-Database tables which
don't have shorter names.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Unicode has changed its yearly release cycle so that the final version
is not available until early March of the year. This year it is March
10, 2020.
However, all changes planned were finalized in early January, and the
actual computer files have been updated to their presumably final
substantive versions. The release has been authorized without further
review needed.
The release is awaiting final documentation additions, and soak time of
the beta to verify there are no glitches. This commit causes Perl to
participate in that soak.
I don't anticipate any problems, and likely the only substantive change
upon the official release will be to update perldelta. Comments in the
files supplied by Unicode will likely also change to indicate these are
no longer beta.
There were very few changes affecting existing characters; most of the
changes involved adding new characters, including emoji. The break
characteristics of some existing characters were changed (GCB, LB, WB,
and SB properties). The only perl code I really had to change to cope
with the new release was about rules in the Line Break property, dealing
around ellipses (...) and certain East Asian characters next to opening
parentheses.
If there are problems, we can revert this at any time, and ship with
12.0.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Unicode 12.0 used a new property file that was not from the Unicode
Character Database. It only had a long property name. I incorporated
it into our data, and rather than use the very long name all the time, I
created my own short name, since there was no official one.
Now, the upcoming 13.0 has moved the file to the UCD, and come up with a
short name that differs from the one I had. This commit converts to use
Unicode's name. This property is not exposed to user or XS space, so
there is no user impact.
|
|
|
|
|
| |
This change causes certain tables to be sorted so their row and column
headings appear alphabetically, which makes them easier to read.
|
| |
|
|
|
|
| |
This makes things more like dictionary order
|
| |
|
|
|
|
|
|
| |
The motivation behind these extra bits is to allow three functions that
deal with, respectively, binary, octal, and hex data to use the same
paradigm, and hence be collapsible into a single function.
|
|
|
|
|
|
|
| |
This is because these deal with only legal Unicode code points, which
are restricted to 21 bits, so 16 is too few, but 32 is sufficient to
hold them. Doing this saves some space/memory on 64 bit builds where an
int is 64 bits.
|
|
|
|
|
| |
This will lessen any paging that might occur. Further, on most builds,
it and another table are identical, so only one is actually needed.
|
|
|
|
| |
Prior to this patch, they only sometimes overrode.
|
|
|
|
|
| |
This makes it consistent with the other inversion lists for this sort of
thing, and finishes the fix for GH #17154
|
|
|
|
|
|
| |
With the revamping done in cc288b7a2732c37504039083ebb98241954636be, the
table of Unicode case folds that are more than a single character is no
longer used, so no need to generate it, or having it available.
|
|
|
|
|
|
| |
This wasn't generating the correct values. It is no longer used, and
the next commit will remove it, but I wanted to get it right, in case it
is ever needed again.
|
|
|
|
|
|
|
| |
This file was for the use of utf8_heavy.pl. But now that that is
incorporated into Unicode::UCD, move the definitions from Heavy.pl to
lib/unicore/UCD.pl which is used by Unicode::UCD. This allows removing
package names.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The only remaining user of this is Unicode::UCD, and so most of the code
from utf8_heavy.pl is moved into that UCD.pm.
It removes a no-longer relevant test (that had been changed into a skip
anyway), and it changes or removes the no-longer relevant references in
comments to utf8_heavy.pl
Later commits will do some simplification as not all the previous
functionality is needed. This commit removed only the parts that were
preventing compilation and tests passing.
|
|
|
|
| |
Also references to the term.
|
|
|
|
| |
These are no longer needed.
|
| |
|
|
|
|
| |
This table wasn't being translated into native code points
|
|
|
|
|
|
| |
Two variable weren't getting initialized properly in one code path, with
the result that the case folding tables were pretty much garbage, but
not on ASCII platforms.
|
|
|
|
|
| |
0 is a special marker, and shouldn't be remapped. It would be
unlikely to be so, but this makes sure.
|
|
|
|
|
|
| |
Inversion maps are supposed to have an entry for what to do above the
Unicode range. This subroutine crafts a custom map that was missing
that.
|
|
|
|
|
|
|
| |
The table is collated in ASCII, so convert to that when looking things
up.
This change does not affect the binary of ASCII platform compilations
|
|
|
|
| |
And regen affected files
|
|
|
|
|
|
|
|
|
|
| |
For a property \p{Block=Foo}, we allow the synonym \p{InFoo} as
documented variously, including perluniprops, even though this usage is
discouraged, as a new Unicode release used in a new version of Perl
could cause the synonym to no longer work.
Prior to this commit, we erroneously allowed the synonym for other
properties, such as \p{InKana} or \p{InS}.
|
| |
|
|
|
|
|
|
|
| |
This takes the few latest changes in the draft Unicode 12.1, ahead of
our freeze. None are substantive. No further non-substantive changes
will be added, except in the unlikely event that a substantive change is
made, we will take it and potentially delay Perl 5.30.
|
|
|
|
| |
A variable needed to be updated for Unicode 12.1
|
|
|
|
|
|
| |
I realized that commit f9c1e7e9ed13a16099c8471c2030b93deb482571
works now, but future Unicode versions may add fractions that fool it.
This commit should handle any such event
|
| |
|
|
|
|
| |
Indent block newly formed in previous commit
|
|
|
|
|
|
|
| |
This turns out to be because Windows doesn't necessarily round to even
on floating point %e conversions. The solution is to add an extra entry
rounding up to odd when a fraction is precisely representable in binary.
So far, the only case where this occurs is 1/32.
|
|
|
|
| |
This inadvertently was left on, slowing down the process a little
|