| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|\
| |
| |
| |
| |
| | |
Build with -Ddefault_inc_excludes_dot to have exclude . from @INC.
The *current* default is set to be effectively no change. A future change
will most likely revert the default to the safer exclusion of .
|
| |
| |
| |
| |
| |
| | |
A few regen scripts aren't run by "make regen", either because they depend
on an external tool, or they must be run by the Perl just built. So they
must be run manually.
|
|/
|
|
|
| |
p5p has taken over the maintenance of this module, so it should be in
dist/
|
|
|
|
|
| |
This way we can run them at the same time under parallel test,
as there are a lot of tests (140k or so) this makes a difference.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This macro follows Unicode Corrigendum #9 to allow non-character code
points. These are still discouraged but not completely forbidden.
It's best for code that isn't intended to operate on arbitrary other
code text to use the original definition, but code that does things,
such as source code control, should change to use this definition if it
wants to be Unicode-strict.
Perl can't adopt C9 wholesale, as it might create security holes in
existing applications that rely on Perl keeping non-chars out.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This changes the macro isUTF8_CHAR to have the same number of code
points built-in for EBCDIC as ASCII. This obsoletes the
IS_UTF8_CHAR_FAST macro, which is removed.
Previously, the code generated by regen/regcharclass.pl for ASCII
platforms was hand copied into utf8.h, and LIKELY's manually added, then
the generating code was commented out. Now this has been done with
EBCDIC platforms as well. This makes regenerating regcharclass.h
faster.
The copied macro in utf8.h is moved by this commit to within the main
code section for non-EBCDIC compiles, cutting the number of #ifdef's
down, and the comments about it are changed somewhat.
|
|
|
|
| |
They are not "characters"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These may be useful to various module writers. They certainly are
useful for Encode. This makes public API macros to determine if the
input UTF-8 represents (one macro for each category)
a) a surrogate code point
b) a non-character code point
c) a code point that is above Unicode's legal maximum.
The macros are machine generated. In making them public, I am now using
the string end location parameter to guard against running off the end
of the input. Previously this parameter was ignored, as their use in
the core could be tightly controlled so that we already knew that the
string was long enough when calling these macros. But this can't be
guaranteed in the public API. An optimizing compiler should be able to
remove redundant length checks.
|
|
|
|
|
|
|
|
|
| |
All tests appear to pass. I hereby disclaim any explicit or implicit
ownership of my changes and place them under the
public-domain/CC0/X11L/same-terms-as-Perl/any-other-license-of-your-choice
multi-license.
Thanks to Vim's ":help spell" for helping a lot.
|
|
|
|
|
|
|
| |
when 'foo' is a script. Also update the pods correspondingly, and to
encourage scx property use.
See http://nntp.perl.org/group/perl.perl5.porters/237403
|
|
|
|
|
|
|
| |
Commit 3d6c5fec8cb3579a30be60177e31058bc31285d7 changed mktables to
change to slightly less nice pod in order to remove a warning that was a
bug in Pod::Checker. Pod::Checker has now been fixed, and the current
commit reinstates the old pod.
|
|
|
|
|
| |
This includes regenerating the files that depend on the Unicode 9 data
files
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The major code changes needed to support Unicode 9.0 are to changes in
the boundary (break) rules, for things like \b{lb}, \b{wb}.
regen/mk_invlists.pl creates two-dimensional arrays for all these
properties. To see if a given point in the target string is a break or
not, regexec.c looks up the entry in the property's table whose row
corresponds to the code point before the potential break, and whose
column corresponds to the one after. Mostly this is completely
determining, but for some cases, extra context is required, and the
array entry indicates this, and there has to be specially crafted code
in regexec.c to handle each such possibility. When a new release comes
along, mk_invlists.pl has to be changed to handle any new or changed
rules, and regexec.c has to be changed to handle any changes to the
custom code.
Unfortunately this is not a mature area of the Standard, and changes are
fairly common in new releases. In part, this is because new types of
code points come along, which need new rules. Sometimes it is because
they realized the previous version didn't work as well as it could. An
example of the latter is that Unicode now realizes that Regional
Indicator (RI) characters come in pairs, and that one should be able to
break between each pair, but not within a pair. Previous versions
treated any run of them as unbreakable. (Regional Indicators are a
fairly recent type that was added to the Standard in 6.0, and things are
still getting shaken out.)
The other main changes to these rules also involve a fairly new type of
character, emojis. We can expect further changes to these in the next
Unicode releases.
\b{gcb} for the first time, now depends on context (in rarely
encountered cases, like RI's), so the function had to be changed from a
simple table look-up to be more like the functions handling the other
break properties.
Some years ago I revamped mktables in part to try to make it require as
few manual interventions as possible when upgrading to a new version of
Unicode. For example, a new data file in a release requires telling
mktables about it, but as long as it follows the format of existing
recent files, nothing else need be done to get whatever properties it
describes to be included.
Some of changes to mktables involved guessing, from existing limited
data, what the underlying paradigm for that data was. The problem with
that is there may not have been a paradigm, just something they did ad
hoc, which can change at will; or I didn't understand their unstated
thinking, and guessed wrong.
Besides the boundary rule changes, the only change that the existing
mktables couldn't cope with was the addition of the Tangut script, whose
character names include the code point, like CJK UNIFIED IDEOGRAPH-3400
has always done. The paradigm for this wasn't clear, since CJK was the
only script that had this characteristic, and so I hard-coded it into
mktables. The way Tangut is structured may show that there is a
paradigm emerging (but we only have two examples, and there may not be a
paradigm at all), and so I have guessed one, and changed mktables to
assume this guessed paradigm. If other scripts like this come along,
and I have guessed correctly, mktables will cope with these
automatically without manual intervention.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A downside of supporting the Unicode break properties like \b{gcb},
\b{lb} is that these aren't very mature in the Standard, and so code
likely has to change when updating Perl to support a new version of the
Standard.
And the new rules may not be backwards compatible. This commit creates
a mechanism to tell mktables the Unicode version that the rules are
written for. If that is not the same version as being compiled, the
test file marks any failing boundary tests as TODO, and outputs a
warning if the compiled version is later than the code expects, to
alert you to the fact that the code needs to be updated.
|
|
|
|
|
|
|
|
|
| |
mktables generates a file of tests used in t/re/uniprops.t.
The tests furnished by Unicode for the boundaries like \b{gcb} have
comments that indicate the rules each test is testing. These are useful
in debugging. This commit changes things so the generated file that
includes these Unicode-supplied tests also has the corresponding
comments which are output as part of the test descriptions.
|
|
|
|
|
|
| |
I am not certain that I find the 'leading zero means hex' format
recommendable ('0123' meaning '0x123', the octal format has poisoned
the well); but water under the bridge.
|
|
|
|
| |
As scheduled for 5.26, this construct will no longer be accepted.
|
|
|
|
|
|
|
| |
It can happen that one table depends on another table for its
contents. This adds a crude mechanism to prevent the depended-upon
table from being destroyed prematurely. So far this has only shown up
during debugging, but it could have happened generally.
|
|
|
|
| |
This adds some helpful text when this option is used, which is for examining the Unicode database in great detail
|
|
|
|
| |
This can be used for debugging.
|
|
|
|
|
|
|
|
| |
The code that generates the tables for the \b{foo} handling (in
regexec.c) did not correctly work when compiled on an earlier Unicode.
This fixes things up to do that, consolidating some common code into a
common function and making the generated hdr file look nice, with the
tables taking fewer columns of screen space
|
|
|
|
|
|
|
|
|
|
| |
There are two types of tables in mktables: Map tables map code points
to the values a property have for those code points; and match tables
which are booleans, give "does a code point match a given property
value?". There are different data structures to encapsulate each. This
code was using the wrong structure to look something up. Usually this
failed, and a fall-back value was used instead. When compiling an early
Unicode release, I discovered that there could be a conflict.
|
|
|
|
|
|
|
|
|
| |
An array had 2 optional elements at the end. I got confused about
handling them. This change first deals with the final one, pops it and
saves it separately if found. Then only one optional element needs to
be dealt with in the course of the code.
This only gets executed for very early Unicode versions
|
|
|
|
| |
Therefore, we shouldn't add any above that.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
mktables generates the file used in this test. Unicode version 9
introduces a numeric value that is an order of magnitude closer to 0
than any previous version had. This demonstrated a bug in mktables,
where it didn't consider the possibility of floating point numbers being
indistinguishably close to integers. It did check for being too close
to the rational numbers used in Unicode, but omitted checking for
integers. This adds that check, which in turn causes some wrong test
cases to not be generated for this .t.
This bug has not shown up in earlier Unicode versions, but is there
nonetheless, so I'm pushing this now instead of waiting.
|
|
|
|
| |
Things like qr/\s/ are expecting native code points, not EBCDIC.
|
|
|
|
|
| |
An 'ord' was missing, so a warnings was raised. This file is generated
by mktables
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds the final Unicode boundary type previously missing from core
Perl: the LineBreak one. This feature is already available in the
Unicode::LineBreak module, but I've been told that there are portability
and some other issues with that module. What's added here is a
light-weight version that is lacking the customizable features of the
module.
This implements the default Line Breaking algorithm, but with the
customizations that Unicode is expecting everybody to add, as their
test file tests for them. In other words, this passes Unicode's fairly
extensive furnished tests, but wouldn't if it didn't include certain
customizations specified by Unicode beyond the basic algorithm.
The implementation uses a look-up table of the characters surrounding a
boundary to see if it is a suitable place to break a line. In a few
cases, context needs to be taken into account, so there is code in
addition to the lookup table to handle those.
This should meet the needs for line breaking of many applications,
without having to load the module.
The algorithm is somewhat independent of the Unicode version, just like
the other boundary types. Only if new rules are added, or existing ones
modified is there need to go in and change this code. Otherwise,
running regen/mk_invlists.pl should be sufficient when a new Unicode
release is done to keep it up-to-date, again like the other Unicode
boundary types.
|
|
|
|
|
|
|
|
|
| |
This is in preparation for adding qr/\b{lb}/. This just generates the
tables, and is a separate commit because otherwise the diff listing is
confusing, as it doesn't realize there are only additions. So, even
though the difference listing for this commit for the generated header
file is wildly crazy, the only changes in reality are the addition of
some tables for Line Break.
|
|
|
|
|
| |
This allows a default value to be specified, to prepare for a later
commit.
|
|
|
|
|
|
| |
The guts of this test are generated by mktables. Commit
f1f6961f5a6fd77a3e3c36f242f1b72ce5dfe205 broke early Unicode versions
handling.
|
|
|
|
|
| |
Prior to this commit, it would not compile because 2 properties weren't
defined in very early Unicodes.
|
|
|
|
|
|
|
|
|
|
|
| |
This commit comments out the code that generates these tables. This is
trivially reversible. We don't believe anyone is using Perl and
POSIX-BC at this time, and this saves time during development when
having to regenerate these tables, and makes the resulting tar ball
smaller.
See thread beginning at
http://nntp.perl.org/group/perl.perl5.porters/233663
|
|
|
|
|
|
|
|
|
|
|
|
| |
The Unicode \b{wb} matches the boundary between space characters in a
span of them. This is opposite of what \b does, and is counterintuitive
to Perl expectations. This commit tailors \b{wb} to not split up spans
of white space.
I have submitted a request to Unicode to re-examine their algorithm, and
this has been assigned to a subcommittee to look at, but the result
won't be available until after 5.24 is done. In any event, Unicode
encourages tailoring for local conditions.
|
|
|
|
|
|
|
| |
This new parameter will be used in the next commit, adds a special case
for handling tables that the perl interpreter relies on when compiling
a Unicode version earlier than the property is defined by Unicode. This
will allow for tailoring the property to Perl's needs in the next commit
|
|
|
|
| |
The utf8 locale testing was not getting done.
|
|
|
|
|
|
|
| |
This may be enough for some platforms that aren't able to compile the
Unicode tables to work. BUt it's quite late in the process. The
ultimate solution would be for the tables to all be compiled ahead of
time. That is under consideration for the future.
|
|
|
|
| |
So in a make, it is abundantly clear where the messages are coming from
|
|
|
|
|
|
| |
...not the hyphenated form
commit message by rjbs
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This uses for UTF-EBCDIC essentially the same mechanism that Perl
already uses for UTF-8 on ASCII platforms to extend it beyond what might
be its natural maximum. That is, when the UTF-8 start byte is 0xFF, it
adds a bunch more bytes to the character than it otherwise would,
bringing it to a total of 14 for UTF-EBCDIC. This is enough to handle
any code point that fits in a 64 bit word.
The downside of this is that this extension is not compatible with
previous perls for the range 2**30 up through the previous max,
2**30 - 1. A simple program could be written to convert files that were
written out using an older perl so that they can be read with newer
perls, and the perldelta says we will do this should anyone ask.
However, I strongly suspect that the number of such files in existence
is zero, as people in EBCDIC land don't seem to use Unicode much, and
these are very large code points, which are associated with a
portability warning every time they are output in some way.
This extension brings UTF-EBCDIC to parity with UTF-8, so that both can
cover a 64-bit word. It allows some removal of special cases for EBCDIC
in core code and core tests. And it is a necessary step to handle Perl
6's NFG, which I'd like eventually to bring to Perl 5.
This commit causes two implementations of a macro in utf8.h and
utfebcdic.h to become the same, and both are moved to a single one in
the portion of utf8.h common to both.
To illustrate, the I8 for U+3FFFFFFF (2**30-1) is
"\xFE\xBF\xBF\xBF\xBF\xBF\xBF" before and after this commit, but the I8
for the next code point, U+40000000 is now
"\xFF\xA0\xA0\xA0\xA0\xA0\xA0\xA1\xA0\xA0\xA0\xA0\xA0\xA0",
and before this commit it was "\xFF\xA0\xA0\xA0\xA0\xA0\xA0".
The I8 for 2**64-1 (U+FFFFFFFFFFFFFFFF) is
"\xFF\xAF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF", whereas
before this commit it was unrepresentable.
Commit 7c560c3beefbb9946463c9f7b946a13f02f319d8 said in its message that
it was moving something that hadn't been needed on EBCDIC until the
"next commit". That statement turned out to be wrong, overtaken by
events. This now is the commit it was referring to.
commit I prematurely
pushed that
|
|
|
|
|
|
|
| |
locale handling doesn't work without POSIX module being able to load, so
doesn't work on minitest. Prior to this patch, the code checked for
only one case of locale handling to skip when there was no POSIX, but
there was a 2nd case if failed to detect.
|
|
|
|
|
|
|
| |
These tests were using individually defined heuristics to decide whether
to do locale testing or not. However t/loc_tools.pl provides functions
that are more reliable and complete for determining this than the
hand-rolled ones in these tests.
|
|
|
|
|
| |
./perl -Ilib regen/regcharclass.pl
./perl -Ilib regen/mk_invlists.pl
|
|
|
|
|
| |
Through an oversight, the text that was supposed to be printed as the
name for a test was just getting output as a 1 or 0.
|
|
|
|
|
| |
This function's capabilities has expanded beyond its original use, but
the descriptive comments weren't until now.
|
|
|
|
|
|
|
|
|
|
| |
This ticket was originally because the requester did not realize the
function Unicode::UCD::charscript took a code point argument instead of
a chr one. It was rejected on that basis. But discussion here
suggested it would be better to warn on bad input instead of just
returning <undef>. It turns out that all other routines in Unicode::UCD
but charscript and charblock already do warn. This commit extends that
to the two outlier returns.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This changes perluniprops to not list the equivalent 'In' single form
method of specifying the Block property, and to discourage its use. The
reason is that this is a Perl extension, the use of which is unstable.
A future Unicode release could take over the 'In...' name for a new
purpose, and perl would follow along, breaking the code that assumed the
former meaning. Unicode does not know about this Perl extension, and
they wouldn't care if they did know.
The reason I'm doing this now is that the latest Unicode version
introduced some properties whose names begin with 'In', though no
conflicts arose. But it is clear that such conflicts could arise in the
future. So the documentation only is changed to warn people of this
potential.
perlunicode is update accordingly.
|
|
|
|
|
|
|
|
| |
Special code suppressed the expanded output of some ranges, where it
would be clear from the range itself what was meant. However, for many
output tables, that range output was changed, so the desired
information is missing. For these tables, don't suppress the expanded
output.
|