| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
| |
Get rid of the file-global filehandles and the unused filename return
value, instead return the filehandle and assign it to a lexical
variable. Also don't bother checking the return value; it croaks on
failure anyway.
In passing, eliminate erroneous assignment of {} to %CASESPEC for
Unicode < 2.1.8.
|
|
|
|
|
|
|
| |
As discussed in http://nntp.perl.org/group/perl.perl5.porters/244444,
this sets the optional scalar ref paramater to the length of the valid
initial portion of the first parameter passed to num(). This is useful
in teasing apart why the input is invalid.
|
| |
|
|
|
|
| |
This allows charprop() to be called on a Perl-internal-only property
|
|
|
|
|
| |
Some early Unicode releases used a hyphen instead of an underscore in
script names. This changes all into underscores
|
|
|
|
|
| |
The value for this variable is already known; use that instead of
rederiving it.
|
|
|
|
| |
Return the correct value when asked.
|
|
|
|
|
|
|
|
|
| |
This commit changes the generated perluniprops to include some or all of
the code points matched by binary tables. All characters matched in the
00-FF range are listed, as well as the first few ranges beyond that.
This is to make this pod more useful for people using it as an index to
look things up.
|
|
|
|
|
|
|
|
|
|
| |
perluniprops had a few entries like
XPosixCntrl General_Category=XPosixCntrlControl
It should have read
XPosixCntrl General_Category=Control
|
|
|
|
|
|
| |
The complete set of C0 controls is listed by standard abbreviation, but
it is better to display them alphabetically, and not in ASCII-platform
code point order.
|
|
|
|
|
|
|
|
|
|
| |
This isn't currently necessary to add, but I discovered this deficiency
during debugging, and it could come up in some later change.
This code only writes one file when two tables match identically. But
it could happen that we've got the pointers to the two tables
intertwined so that they each think the other one is the one getting
written out, so neither of them do. This checks for that.
|
|
|
|
|
|
| |
The scx is an improved version of the sc(ript) property. This changes
mktables to generate perluniprops so that the entries for sc tables
refer to the equivalent scx ones.
|
|
|
|
|
|
|
|
|
|
|
| |
I spotted this entry in perluniprops recently:
\p{Nko} \p{Script_Extensions=Nko} (NOT \p{NKo})
It's saying Nko is not NKo. But case isn't supposed to matter. It
turned out that the bug was doing an eq without first canonicalizing the
names to account for case differences. I was expecting there to be more
entries that were erroneous, but it was just this one.
|
| |
|
|
|
|
| |
This is used in several places, so make its scope global to the program.
|
|
|
|
|
|
| |
Since 5.26.0, this (generated) pod has been wrong. The single-form Perl
shortcuts for script names now use the Script_Extensions property
instead of the (inferior) plain Script property.
|
|
|
|
|
|
|
| |
Unicode has some property values that should be sorted numerically, but
have prefixes that make them not currently appear to be numbers. For
example, CCC101 and V10_5. This commit changes so they are sorted by
their numeric parts.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Some tables generated by this program are completely described as the
complements of other tables. There is no need to thus generate them, as
when their value is needed, they can be generated from the other one.
However, this takes time, and so this commit caches the result the first
time it is needed, and returns that for any future needs.
This must not be done until after the controlling table is fully
populated, or else the cache would have to be invalidated. Since there
is unlikely to be the need for getting the value before the populating
is one, What is done here is to simply lock the controlling table, so
that any attempt to change it will raise an error, and the code can be
fixed at that time, if the need ever does arise.
|
|
|
|
|
|
|
|
|
|
|
| |
perluniprops was not updated to reflect the changes made to what
\p{Word} contains as of 5.18. What was added was the code points that
have the Join_Control property, which, so far, only contain U+200C and
U+200D. This commit uses Join Control instead of the hard-coded code
point numbers, so that when Unicode changes it, it automatically will
still be valid.
Thanks for spotting this.
|
|
|
|
|
|
| |
This is in preparation for later commits to restrict Unicode code points
to IV_MAX. No tables are currently output that go this high, so this
change has no current effect.
|
|
|
|
|
|
| |
The new file from Unicode "extracted/DerivedName.txt" is not delivered
here, as Perl doesn't need it, as it duplicates information in other
files.
|
|
|
|
|
| |
This informs mktables of the new files in 10.0, and updates some
comments in other files to reflect new Unicode terminology.
|
|
|
|
|
|
|
|
|
|
| |
We changed to use symbols not likely to be used by non-Perl code that
could conflict, and which have trailing underbars, so they don't look
like a regular Perl #define.
See https://rt.perl.org/Ticket/Display.html?id=131110
There are many more header files which are not guarded.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a feature that is used to compare 2 different Unicode versions
for changes to existing code points, ignoring code points that aren't in
the earlier version. It does this by removing the newly-added code
points coming from the later version. One can then diff the generated
directory structure against an existing one that was built under the old
rules to see what changed.
Prior to this commit, it assumed all version numbers were a single digit
for the major number. This will no longer work for Unicode 10, about to
be released.
As part of the process, mktables adds blocks that didn't exist in the
earlier version back to the unallocated pool. This gives better diff
results. This commit does a better job of finding such blocks.
|
|
|
|
|
| |
Note that this isn't normally executed during build, so it wasn't spotted
earlier.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 5656b1f654bb034c561558968ed3cf87a737b3e1 split the tests
generated by mktables so that 10 separate files each execute 10% of the
tests. But it turns out that some tests are much more involved than
others, so that some of those 10 files still took much longer than
average. This commit changes the split so that the amount of time each
file takes is more balanced. It uses a natural breaking spot for the
tests for the \b{} flavors, except that GCB and SB are each short (so
are combined into being tested from one file), and LB is very long, so
is split into 4 test groups.
|
|
|
|
|
|
|
|
|
| |
"use re 'strict" is supposed to warn if a range whose start and end
points are digits aren't from the same group of 10. For example, if you
mix Bengali and Thai digits. It wasn't working properly for 5 groups of
mathematical digits starting at U+1D7E. This commit fixes that, and
refactors the code to bail out as soon as it discovers that no warning
is warranted, instead of doing unnecessary work.
|
|
|
|
|
|
|
|
|
|
| |
Switch from two-argument form. Filehandle cloning is still done with the two
argument form for backward compatibility.
Committer: Get all porting tests to pass. Increment some $VERSIONs.
Run: ./perl -Ilib regen/mk_invlists.pl; ./perl -Ilib regen/regcharclass.pl
For: RT #130122
|
|
|
|
|
|
|
|
|
|
| |
This is a follow-up to commit 9f2eed981068e7abbcc52267863529bc59e9c8c0,
which manually added const qualifiers to some generated code in order to
avoid some compiler warnings. This changes the code generator to use
the same 'const' qualifier generally. The code changed by the other
commit had been hand-edited after being generated to add branch
prediction, which would be too hard to program in at this time, so the
const additions also had to be hand-edited in.
|
| |
|
|\
| |
| |
| |
| |
| | |
Build with -Ddefault_inc_excludes_dot to have exclude . from @INC.
The *current* default is set to be effectively no change. A future change
will most likely revert the default to the safer exclusion of .
|
| |
| |
| |
| |
| |
| | |
A few regen scripts aren't run by "make regen", either because they depend
on an external tool, or they must be run by the Perl just built. So they
must be run manually.
|
|/
|
|
|
| |
p5p has taken over the maintenance of this module, so it should be in
dist/
|
|
|
|
|
| |
This way we can run them at the same time under parallel test,
as there are a lot of tests (140k or so) this makes a difference.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This macro follows Unicode Corrigendum #9 to allow non-character code
points. These are still discouraged but not completely forbidden.
It's best for code that isn't intended to operate on arbitrary other
code text to use the original definition, but code that does things,
such as source code control, should change to use this definition if it
wants to be Unicode-strict.
Perl can't adopt C9 wholesale, as it might create security holes in
existing applications that rely on Perl keeping non-chars out.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This changes the macro isUTF8_CHAR to have the same number of code
points built-in for EBCDIC as ASCII. This obsoletes the
IS_UTF8_CHAR_FAST macro, which is removed.
Previously, the code generated by regen/regcharclass.pl for ASCII
platforms was hand copied into utf8.h, and LIKELY's manually added, then
the generating code was commented out. Now this has been done with
EBCDIC platforms as well. This makes regenerating regcharclass.h
faster.
The copied macro in utf8.h is moved by this commit to within the main
code section for non-EBCDIC compiles, cutting the number of #ifdef's
down, and the comments about it are changed somewhat.
|
|
|
|
| |
They are not "characters"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These may be useful to various module writers. They certainly are
useful for Encode. This makes public API macros to determine if the
input UTF-8 represents (one macro for each category)
a) a surrogate code point
b) a non-character code point
c) a code point that is above Unicode's legal maximum.
The macros are machine generated. In making them public, I am now using
the string end location parameter to guard against running off the end
of the input. Previously this parameter was ignored, as their use in
the core could be tightly controlled so that we already knew that the
string was long enough when calling these macros. But this can't be
guaranteed in the public API. An optimizing compiler should be able to
remove redundant length checks.
|
|
|
|
|
|
|
|
|
| |
All tests appear to pass. I hereby disclaim any explicit or implicit
ownership of my changes and place them under the
public-domain/CC0/X11L/same-terms-as-Perl/any-other-license-of-your-choice
multi-license.
Thanks to Vim's ":help spell" for helping a lot.
|
|
|
|
|
|
|
| |
when 'foo' is a script. Also update the pods correspondingly, and to
encourage scx property use.
See http://nntp.perl.org/group/perl.perl5.porters/237403
|
|
|
|
|
|
|
| |
Commit 3d6c5fec8cb3579a30be60177e31058bc31285d7 changed mktables to
change to slightly less nice pod in order to remove a warning that was a
bug in Pod::Checker. Pod::Checker has now been fixed, and the current
commit reinstates the old pod.
|
|
|
|
|
| |
This includes regenerating the files that depend on the Unicode 9 data
files
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The major code changes needed to support Unicode 9.0 are to changes in
the boundary (break) rules, for things like \b{lb}, \b{wb}.
regen/mk_invlists.pl creates two-dimensional arrays for all these
properties. To see if a given point in the target string is a break or
not, regexec.c looks up the entry in the property's table whose row
corresponds to the code point before the potential break, and whose
column corresponds to the one after. Mostly this is completely
determining, but for some cases, extra context is required, and the
array entry indicates this, and there has to be specially crafted code
in regexec.c to handle each such possibility. When a new release comes
along, mk_invlists.pl has to be changed to handle any new or changed
rules, and regexec.c has to be changed to handle any changes to the
custom code.
Unfortunately this is not a mature area of the Standard, and changes are
fairly common in new releases. In part, this is because new types of
code points come along, which need new rules. Sometimes it is because
they realized the previous version didn't work as well as it could. An
example of the latter is that Unicode now realizes that Regional
Indicator (RI) characters come in pairs, and that one should be able to
break between each pair, but not within a pair. Previous versions
treated any run of them as unbreakable. (Regional Indicators are a
fairly recent type that was added to the Standard in 6.0, and things are
still getting shaken out.)
The other main changes to these rules also involve a fairly new type of
character, emojis. We can expect further changes to these in the next
Unicode releases.
\b{gcb} for the first time, now depends on context (in rarely
encountered cases, like RI's), so the function had to be changed from a
simple table look-up to be more like the functions handling the other
break properties.
Some years ago I revamped mktables in part to try to make it require as
few manual interventions as possible when upgrading to a new version of
Unicode. For example, a new data file in a release requires telling
mktables about it, but as long as it follows the format of existing
recent files, nothing else need be done to get whatever properties it
describes to be included.
Some of changes to mktables involved guessing, from existing limited
data, what the underlying paradigm for that data was. The problem with
that is there may not have been a paradigm, just something they did ad
hoc, which can change at will; or I didn't understand their unstated
thinking, and guessed wrong.
Besides the boundary rule changes, the only change that the existing
mktables couldn't cope with was the addition of the Tangut script, whose
character names include the code point, like CJK UNIFIED IDEOGRAPH-3400
has always done. The paradigm for this wasn't clear, since CJK was the
only script that had this characteristic, and so I hard-coded it into
mktables. The way Tangut is structured may show that there is a
paradigm emerging (but we only have two examples, and there may not be a
paradigm at all), and so I have guessed one, and changed mktables to
assume this guessed paradigm. If other scripts like this come along,
and I have guessed correctly, mktables will cope with these
automatically without manual intervention.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A downside of supporting the Unicode break properties like \b{gcb},
\b{lb} is that these aren't very mature in the Standard, and so code
likely has to change when updating Perl to support a new version of the
Standard.
And the new rules may not be backwards compatible. This commit creates
a mechanism to tell mktables the Unicode version that the rules are
written for. If that is not the same version as being compiled, the
test file marks any failing boundary tests as TODO, and outputs a
warning if the compiled version is later than the code expects, to
alert you to the fact that the code needs to be updated.
|
|
|
|
|
|
|
|
|
| |
mktables generates a file of tests used in t/re/uniprops.t.
The tests furnished by Unicode for the boundaries like \b{gcb} have
comments that indicate the rules each test is testing. These are useful
in debugging. This commit changes things so the generated file that
includes these Unicode-supplied tests also has the corresponding
comments which are output as part of the test descriptions.
|
|
|
|
|
|
| |
I am not certain that I find the 'leading zero means hex' format
recommendable ('0123' meaning '0x123', the octal format has poisoned
the well); but water under the bridge.
|
|
|
|
| |
As scheduled for 5.26, this construct will no longer be accepted.
|
|
|
|
|
|
|
| |
It can happen that one table depends on another table for its
contents. This adds a crude mechanism to prevent the depended-upon
table from being destroyed prematurely. So far this has only shown up
during debugging, but it could have happened generally.
|
|
|
|
| |
This adds some helpful text when this option is used, which is for examining the Unicode database in great detail
|