| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
| |
|
| |
|
|
|
|
|
| |
This generated file will be changed in a future commit. This shouldn't
have been relying on its syntax anyway, but the value it returns.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These files were once apparently intended for use by modules to
supplement the core Unicode handling. They contain tables suitable for
use by Perl code of the portions of the Unicode character database
about changing the case of characters and finding the numeric value of a
given \d character, in a form suitable for use by perl code. In
particular, they were designed for fast access using the swash mechanism
that has since been removed.
Now, Unicode::UCD now contains more convenient methods of accessing
the data these contain, and the use of these files has been deprecated
since 5.16. I could not figure out a way to force a message should
someone open and read one of these files, but each of their texts say
that the file may be removed without notice at any time. I did not find
any uses on cpan of them.
Unicode is adding new properties that the format of these files will
not be able to handle. Consequently I'm coming up with a new format.
Though these files don't contain the new properties, their existence
means having the burden of having to maintain two separate mechanisms.
Better to have just one mechanism, suitable for going forward.
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This changes the format of this generated file so that it can more
easily be used with the Unicode Name property in wildcard matching.
Each line will now end with \n\n, and the \t characters are replaced by
\n. Thus an entry will look like
00001\nSTART OF HEADING\n\n
This makes matching of user-defined patterns using anchors work under
/m, which commit 4829f32decd128e6a122bd8ce35fe944bd87f104 forces. That
commit also changed some anchors' defintions to make them match \n under
/m with wildcards, so this makes it all transparent to user patterns.
The double \n\n at the end of an entry is so that the code can
distinguish between a line that contains a code point vs a name without
relying on the content; it is a disambiguator, like the \t that used to
be.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Unicode has changed its yearly release cycle so that the final version
is not available until early March of the year. This year it is March
10, 2020.
However, all changes planned were finalized in early January, and the
actual computer files have been updated to their presumably final
substantive versions. The release has been authorized without further
review needed.
The release is awaiting final documentation additions, and soak time of
the beta to verify there are no glitches. This commit causes Perl to
participate in that soak.
I don't anticipate any problems, and likely the only substantive change
upon the official release will be to update perldelta. Comments in the
files supplied by Unicode will likely also change to indicate these are
no longer beta.
There were very few changes affecting existing characters; most of the
changes involved adding new characters, including emoji. The break
characteristics of some existing characters were changed (GCB, LB, WB,
and SB properties). The only perl code I really had to change to cope
with the new release was about rules in the Line Break property, dealing
around ellipses (...) and certain East Asian characters next to opening
parentheses.
If there are problems, we can revert this at any time, and ship with
12.0.
|
|
|
|
|
|
|
| |
This file was for the use of utf8_heavy.pl. But now that that is
incorporated into Unicode::UCD, move the definitions from Heavy.pl to
lib/unicore/UCD.pl which is used by Unicode::UCD. This allows removing
package names.
|
|
|
|
|
| |
This was only used by tr///, and hence is no longer relevant. I never
really understood it.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The only remaining user of this is Unicode::UCD, and so most of the code
from utf8_heavy.pl is moved into that UCD.pm.
It removes a no-longer relevant test (that had been changed into a skip
anyway), and it changes or removes the no-longer relevant references in
comments to utf8_heavy.pl
Later commits will do some simplification as not all the previous
functionality is needed. This commit removed only the parts that were
preventing compilation and tests passing.
|
| |
|
|
|
|
|
| |
This test file invented its own environment variable, whereas everyone
else uses a different one. Make this one comply.
|
| |
|
|
|
|
| |
Unicode 12.0 is finalized. Change to use it.
|
|
|
|
|
|
|
|
| |
Committer: For porting tests: Update $VERSION in 4 files.
Run:
./perl -Ilib regen/mk_invlists.pl
./perl -Ilib regen/regcharclass.pl
|
|
|
|
| |
This completes the process of upgrading to Unicode 11.0.
|
|
|
|
|
| |
I found a case where this array can be empty, so add a test for that to
avoid trying to look at the first (non-existent) element.
|
|
|
|
|
|
|
|
|
|
| |
Get rid of the file-global filehandles and the unused filename return
value, instead return the filehandle and assign it to a lexical
variable. Also don't bother checking the return value; it croaks on
failure anyway.
In passing, eliminate erroneous assignment of {} to %CASESPEC for
Unicode < 2.1.8.
|
| |
|
|
|
|
|
|
|
| |
As discussed in http://nntp.perl.org/group/perl.perl5.porters/244444,
this sets the optional scalar ref paramater to the length of the valid
initial portion of the first parameter passed to num(). This is useful
in teasing apart why the input is invalid.
|
|
|
|
| |
This allows charprop() to be called on a Perl-internal-only property
|
|
|
|
| |
Return the correct value when asked.
|
|
|
|
|
|
| |
The new file from Unicode "extracted/DerivedName.txt" is not delivered
here, as Perl doesn't need it, as it duplicates information in other
files.
|
|
|
|
|
|
|
|
|
|
| |
Switch from two-argument form. Filehandle cloning is still done with the two
argument form for backward compatibility.
Committer: Get all porting tests to pass. Increment some $VERSIONs.
Run: ./perl -Ilib regen/mk_invlists.pl; ./perl -Ilib regen/regcharclass.pl
For: RT #130122
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
| |
when 'foo' is a script. Also update the pods correspondingly, and to
encourage scx property use.
See http://nntp.perl.org/group/perl.perl5.porters/237403
|
|
|
|
|
| |
This includes regenerating the files that depend on the Unicode 9 data
files
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The major code changes needed to support Unicode 9.0 are to changes in
the boundary (break) rules, for things like \b{lb}, \b{wb}.
regen/mk_invlists.pl creates two-dimensional arrays for all these
properties. To see if a given point in the target string is a break or
not, regexec.c looks up the entry in the property's table whose row
corresponds to the code point before the potential break, and whose
column corresponds to the one after. Mostly this is completely
determining, but for some cases, extra context is required, and the
array entry indicates this, and there has to be specially crafted code
in regexec.c to handle each such possibility. When a new release comes
along, mk_invlists.pl has to be changed to handle any new or changed
rules, and regexec.c has to be changed to handle any changes to the
custom code.
Unfortunately this is not a mature area of the Standard, and changes are
fairly common in new releases. In part, this is because new types of
code points come along, which need new rules. Sometimes it is because
they realized the previous version didn't work as well as it could. An
example of the latter is that Unicode now realizes that Regional
Indicator (RI) characters come in pairs, and that one should be able to
break between each pair, but not within a pair. Previous versions
treated any run of them as unbreakable. (Regional Indicators are a
fairly recent type that was added to the Standard in 6.0, and things are
still getting shaken out.)
The other main changes to these rules also involve a fairly new type of
character, emojis. We can expect further changes to these in the next
Unicode releases.
\b{gcb} for the first time, now depends on context (in rarely
encountered cases, like RI's), so the function had to be changed from a
simple table look-up to be more like the functions handling the other
break properties.
Some years ago I revamped mktables in part to try to make it require as
few manual interventions as possible when upgrading to a new version of
Unicode. For example, a new data file in a release requires telling
mktables about it, but as long as it follows the format of existing
recent files, nothing else need be done to get whatever properties it
describes to be included.
Some of changes to mktables involved guessing, from existing limited
data, what the underlying paradigm for that data was. The problem with
that is there may not have been a paradigm, just something they did ad
hoc, which can change at will; or I didn't understand their unstated
thinking, and guessed wrong.
Besides the boundary rule changes, the only change that the existing
mktables couldn't cope with was the addition of the Tangut script, whose
character names include the code point, like CJK UNIFIED IDEOGRAPH-3400
has always done. The paradigm for this wasn't clear, since CJK was the
only script that had this characteristic, and so I hard-coded it into
mktables. The way Tangut is structured may show that there is a
paradigm emerging (but we only have two examples, and there may not be a
paradigm at all), and so I have guessed one, and changed mktables to
assume this guessed paradigm. If other scripts like this come along,
and I have guessed correctly, mktables will cope with these
automatically without manual intervention.
|
|
|
|
|
|
|
|
| |
This now looks for the PERL_DIFF_TOOL environment variable, and if found
uses that to display some problems. If not found, it uses is(), with a
message that better output is available through setting this variable.
PERL_DIFF_TOOL is a convention I wasn't familiar with.
|
|
|
|
|
|
| |
I am not certain that I find the 'leading zero means hex' format
recommendable ('0123' meaning '0x123', the octal format has poisoned
the well); but water under the bridge.
|
|
|
|
|
| |
Prior to this commit, it would not compile because 2 properties weren't
defined in very early Unicodes.
|
|
|
|
|
|
|
|
|
|
| |
This ticket was originally because the requester did not realize the
function Unicode::UCD::charscript took a code point argument instead of
a chr one. It was rejected on that basis. But discussion here
suggested it would be better to warn on bad input instead of just
returning <undef>. It turns out that all other routines in Unicode::UCD
but charscript and charblock already do warn. This commit extends that
to the two outlier returns.
|
|
|
|
|
|
|
|
|
|
|
| |
perl needs the Name_Alias property accessible in all releases in order
for charnames to work properly. However the property was not created
until Unicode version 5.0. Previously, the property was made available
to all Unicode versions, which is contrary to the policy of exposing
properties to public use only when Unicode so exposes them. Thus the
behavior is as close as possible to Unicode-specified. This commit
creates an internal-only property for the perl core, and removes the
general access on early Unicode releases.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This property is not included in the standard Perl distribution, but it
is normative in the Unicode Unihan database and perl can be compiled to
include it. This property is currently unique in that it operates much
like how Perl defines string truthfulness: non-empty values are
considered true. That is \p{kIICore} matches all characters which have
a non-empty value for this property, plus the actual values have
meaning that may need to be examined in some circumstances. These can
be retrieved via Unicode::UCD::prop_invmap().
Unicode 7.0 changed this property without my noticing, and went a very
different direction with it than I anticipated. And the perl
interpreter would loop when trying to deal with it under some
circumstances.
This property is true for all 'core' Chinese/Japanese/Korean characters
that every implementation of CJK things should strive to handle, i.e., the
minimally acceptable set, though the values now specify a precedence as
their first letter, A, B, or C (I suppose this means one could implement
just the A level ones first). The remaining letters in each value
encode the standards which were used as the source for the character.
In previous versions of the Standard, every non-null value was the
string "2.1".
|
|
|
|
|
|
|
|
| |
No current input comes inverted, but it could some time in the future,
and we wouldn't know. In one case, it's easy to handle, so do so; in
another, die with a message so won't sneak past. At that point, if and
when it happens, time could be spent figuring out the best way to handle
the situation.
|
|
|
|
| |
Re-indent after the previous commit
|
|
|
|
|
|
|
| |
This commit causes this test to pass tests even when run on old
Unicodes, back to 3.0, where it becomes just too much. Most tests
aren't structured to pass on older Unicodes, but it's somewhat important
that this one does.
|
| |
|
|
|
|
|
| |
The formats of some older Unicode releases can be different than
previously expected.
|
| |
|
|
|
|
|
|
|
| |
Actually, there are no special rules for this Unicode release. All the
4 "i" characters are considered equivalent under /i only in this
release. (Upper and lowercase dotted and dotless "i"). This
adds special cases that are only compiled in for that release.
|
|
|
|
| |
Nothing should get executed after a croak.
|