| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
| |
These will be used in a future commit
|
|
|
|
|
|
|
| |
A future commit will pass this function data that shouldn't be
translated into a mnemonic, like 'f' for the letter f. The reason is
that that code will potentially be executed on a machine with a
different character set than what the mnemonic would be valid for.
|
|
|
|
| |
A future commit will use this differently than the current name implies
|
|
|
|
|
| |
This moves a loop earlier in the execution path. This will be useful in
a later commit
|
| |
|
| |
|
|
|
|
|
| |
We can short circuit some work by moving the test earlier. This does
not change the generated file.
|
| |
|
|
|
|
| |
This will make future commits read better.
|
|
|
|
|
|
|
|
| |
I realized that two base level utf8.h macros for UTF-8 could be
refactored to eliminate the conditionals in each. Those macros have
equivalents in the pure perl code changed by this commit, which I
changed before the utf8.h versions to verify that everything worked, by
verifying there was no difference in the generated tables.
|
|
|
|
|
|
|
|
|
| |
This commit makes is_HANGUL_ED_utf8_safe() return 0 unconditionally on
EBCDIC platforms. This means its callers don't have to care what
platform is running. Change the two callers to take advantage of this
The commit also changes the description of the macro to be slightly more
accurate
|
|
|
|
|
|
|
|
|
|
| |
This creates macros for the non-character code points so that, given the
length of the UTF-8 sequence, only those ones that have that length
match. This makes for more efficient processing, to be used in a future
commit.
The place where the length changes depends on the platform type, and
these macros will keep the code from having to worry about that.
|
| |
|
|
|
|
| |
redirection
|
| |
|
|
|
|
|
|
| |
This CPAN module doesn't work on recent Unicode versions
This fixes GH #18787
|
|
|
|
|
|
|
| |
Some other tweaks or modernizations are present, but I expect none of
this is controversial.
This also includes running regen/mk_invlists.pl and regen/regcharclass.pl
|
|
|
|
|
|
|
|
|
|
|
| |
They generate C files.
Bump feature.pm and warnings.pm versions to satisfy cmpVERSION.pl.
I can't get it to easily ignore whitespace, `git diff --name-only`
does not respect the -w flag.
regen_perly.pl is left alone. That would require rebuilding
perly.* which is beyond a simple indentation change.
|
|
|
|
|
|
|
|
| |
The macros generated by this script may have to be split into sub-macros
to make the overall macro fit the maximum number of characters allowed
by the compiler for a macro definition. This commit adds a trailing
underscore to the names of such intermediate macros so as to mark them
as non-API for autodoc.
|
|
|
|
|
|
|
| |
Previously regcharclass.pl could tell if an input string was a
multi-character fold of some Unicode code point. This commit adds the
ability to return what that code point is. This capability will be used
in a later commit.
|
|
|
|
|
| |
This avoided checking for optimizations. Whatever its original use, it
doesn't do any good, and the optimizations are actually useful.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The previous commit split inRANGE up so that code that was known to have
valid inputs to it could use a component that didn't have all the
compile-time checks (often duplicates) that otherwise are made.
This commit changes to use that component. The reason the compile-time
checks are unnecessary here, is this is machine-generated code known to
meet the inRANGE input requirements.
All those compile-time checks added up to being too large for some
compilers to handle.
|
|
|
|
|
| |
The regen script was improperyly collapsing two-element ranges into two
separate elements, which caused extraneous code to be generated.
|
|
|
|
| |
This does some line wrapping, etc
|
| |
|
|
|
|
|
|
| |
Prior to this commit, only the upper case of Latin1 characters was dealt
with. But we really want case folding, and there are a few other
characters that fold to Latin1. This commit acknowledges them.
|
|
|
|
| |
Outdent and remove lines from changes in the previous commit.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Prior to this commit, the generated macros for dealing with multi-char
folds in UTF-8 strings only recognized completely folded strings. This
commit changes that to add the uppercase for characters in the Latin1
range. Hopefully an example will clarify.
The fold for U+0130: LATIN CAPITAL LETTER I WITH DOT ABOVE is 'i'
followed by U+0307: COMBINING DOT ABOVE. But since we are doing /i
matching, an 'I' followed by U+307 should also match. This commit
changes the macros to know this. Before this, if the fold were entirely
ASCII, the macros would know all the possible combinations. This commit
extends that to all code points < 256. (Since there are no folds to the
upper latin1 range), that really means all code points below 128. But
making it general means it wouldn't have to be revised if a fold were
ever added to the upper half range.)
The reason to make this change is that it makes some future code less
complicated. And it adds very little complexity to the generated
macros; less than the code it will save. I originally thought it would
be more complext than it now turns out to be. Much of that is because
the infrastructure has advanced since that decision.
I couldn't find any current places that this change will allow to be
simplified. There could be if the macros were extended to do this on
all code points, not just the low ones. I tried that, but the generated
macros were at least 400 lines longer than before. That does add
significant complexity, so I backed that out.
|
|
|
|
|
| |
This changes the generated macros to use a printable character or
mnemonic instead of a hex value. This makes the macros easier to read.
|
|
|
|
|
|
| |
This commit changes a sub in this file to be passed a new parameter.
This is in preparation for the value to be used in the caller. No need
to derive it twice.
|
|
|
|
|
|
|
|
| |
This will allow more flexibility in future commits to instead of using a
static format, to use one based on the input value.
The only non-white space change from this commit, is the reordering of a
couple tests; I'm not sure why that happened.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This changes mktables, which generates this pod, to consider long pod
lines to be splittable before most backslashes. On os390, the lack of
this caused a line to not be split at all, creating a Porting test
failure.
There is also a current rule that you can split at a lowercase/uppercase
boundary. This works for the limited domain this code is run on. But
it shouldn't split \cK. So don't do the split if the lowercase is a
single letter preceded by a backslash.
|
|
|
|
|
| |
Mostly in comments and docs, but some in diagnostic messages and one
case of 'or die die'.
|
|
|
|
|
| |
This makes them correspond to names for single characters, and will make
parsing easier in the next commits.
|
|
|
|
|
|
| |
Also regenerate files depending on lib/unicore/mktables
./perl -Ilib regen/mk_invlists.pl; ./perl -Ilib regen/regcharclass.pl
|
|
|
|
|
| |
This commit adds wildcard subpatterns for the Name and Name Aliases
properties.
|
|
|
|
|
|
|
|
|
|
|
| |
Many ideographic character names are of the form 'prefix-code_point'.
For these, we know that the legal names are just the ones in the prefix,
the dash, and uppercase hex digits. This commit for each series of
these types of names figures out what characters are legal in that
series, and adds that info to the hash describing the series. This will
be used in a later commit to rule out entire series when matching
under some circumstances, without having to try any individual matches
within it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This changes the format of this generated file so that it can more
easily be used with the Unicode Name property in wildcard matching.
Each line will now end with \n\n, and the \t characters are replaced by
\n. Thus an entry will look like
00001\nSTART OF HEADING\n\n
This makes matching of user-defined patterns using anchors work under
/m, which commit 4829f32decd128e6a122bd8ce35fe944bd87f104 forces. That
commit also changed some anchors' defintions to make them match \n under
/m with wildcards, so this makes it all transparent to user patterns.
The double \n\n at the end of an entry is so that the code can
distinguish between a line that contains a code point vs a name without
relying on the content; it is a disambiguator, like the \t that used to
be.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The algorithm for dealing with Unicode property wildcards is to wrap the
user-supplied pattern with /miaa. We don't want the user to be able to
override the /m and /aa parts. Modifiers that are only specifiable as a
modifier in a qr or similar op (like /gc) can't be included in things
like (?gc). These normally incur a warning that they are ignored, but
the texts of those warnings are misleading when using wildcards, so I
chose to just make them illegal. Of course that could be changed to
having custom useful warning texts, but I didn't think it was worth it.
I also chose to forbid recursion of using nested \p{}, just from fear
that it might lead to issues down the road, and it really isn't useful
for this limited universe of strings to match against. Because
wildcards currently can't handle '}' inside them, only the single letter
\p,\P are valid anyway.
Similarly, I forbid the '*' quantifier to make it harder for the
constructed subpattern to take forever to make any progress and decide
to halt. Again, using it would be overkill on the universe of possible
match strings.
|
|
|
|
|
|
| |
Unicode has made minor changes in its data files since I added the beta
versions to Perl 5.31. These are still beta; the final release date is
March 10. I thought it best to get the latest into Perl 5.31.9.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Unicode has lately been asking implementations to support non-Unicode
Character Database properties. Files for these contain a different
versioning syntax than the UCD files. Previously I was hand-editing
those files before commitiing to bring them to use a consistent style.
But that is tedious, and I decide to invest a little time to be able to
handle all the current versioning syntaxes automatically, to save having
to manually update in the future.
This was complicated by the fact that some Unicode non-UCD files have
BOM marks on many comment lines. I submitted a trouble report to them.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This accomplishes the same thing as \N{...}, but only for regex
patterns, using loose matching and only the official Unicode names.
This commit includes a comparison of the two approaches, added to
perlunicode. But the real reason to do this is as a way station to
being able to specify wild card lookup on the name property, coming in a
later commit.
I chose to not include user-defined aliases nor :short character names
at this time. I thought that there might be unforeseen consequences of
using them. It's better to later relax a requirement than to try to
restrict it.
|
|
|
|
|
|
|
|
|
|
| |
These non-UCD properties are now being asked to be supported by the
Unicode regular expression specification, UTS #18
These have a slightly different header syntax for giving the version
than UCD files. In this commit, I modify these to fit, but will
probably have to generalize at some point the parsing of versions in
mktables.
|
|
|
|
|
|
|
|
|
| |
Until now, this property was unique in that it specifies a set of
possible values for scripts that a character can be in, rather than a
single script. That multiplicity has been handled specially. But the
next couple of commits will introduce another property that has similar
characteristics. This commit makes the scx handling more general, so as
to also be usable for the new property.
|
| |
|
|
|
|
|
|
| |
This is because this is still supposed to work on DOS 8.3 filesystems,
and future commits will use non-Unicode-Character-Database tables which
don't have shorter names.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Unicode has changed its yearly release cycle so that the final version
is not available until early March of the year. This year it is March
10, 2020.
However, all changes planned were finalized in early January, and the
actual computer files have been updated to their presumably final
substantive versions. The release has been authorized without further
review needed.
The release is awaiting final documentation additions, and soak time of
the beta to verify there are no glitches. This commit causes Perl to
participate in that soak.
I don't anticipate any problems, and likely the only substantive change
upon the official release will be to update perldelta. Comments in the
files supplied by Unicode will likely also change to indicate these are
no longer beta.
There were very few changes affecting existing characters; most of the
changes involved adding new characters, including emoji. The break
characteristics of some existing characters were changed (GCB, LB, WB,
and SB properties). The only perl code I really had to change to cope
with the new release was about rules in the Line Break property, dealing
around ellipses (...) and certain East Asian characters next to opening
parentheses.
If there are problems, we can revert this at any time, and ship with
12.0.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Unicode 12.0 used a new property file that was not from the Unicode
Character Database. It only had a long property name. I incorporated
it into our data, and rather than use the very long name all the time, I
created my own short name, since there was no official one.
Now, the upcoming 13.0 has moved the file to the UCD, and come up with a
short name that differs from the one I had. This commit converts to use
Unicode's name. This property is not exposed to user or XS space, so
there is no user impact.
|
|
|
|
| |
Prior to this patch, they only sometimes overrode.
|