| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
| |
Things like qr/\s/ are expecting native code points, not EBCDIC.
|
|
|
|
|
|
|
|
| |
This follows the recent commits for lb and gcb, and generates a table at
regen time for Word Breaking. The result may run faster, depending on
the compiler optimization capabilities, than before, and is easier to
maintain, as it's easier to smack a new rule into the regen perl script
than it is to change the C code.
|
|
|
|
|
|
| |
This suppresses many clang warnings saying "suggest braces around
initialization of subobject" when the generated charclass_invlists.h
is included.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This changes the handling of Grapheme Cluster Breaks to be entirely via
a lookup table generated by regen/mk_invlists.pl.
This is easier to maintain and follow, as the generation of the table
follows the text of Unicode's UAX29 precisely, and loops can be used to
set every class up instead of having to name each explicitly, so it will
be easier to add new rules. And the runtime switch statement is
replaced by a single line.
My gcc compiler optimized the previous version to an array lookup, but
this commit does it for not so clever compilers.
|
|
|
|
|
| |
An 'ord' was missing, so a warnings was raised. This file is generated
by mktables
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds the final Unicode boundary type previously missing from core
Perl: the LineBreak one. This feature is already available in the
Unicode::LineBreak module, but I've been told that there are portability
and some other issues with that module. What's added here is a
light-weight version that is lacking the customizable features of the
module.
This implements the default Line Breaking algorithm, but with the
customizations that Unicode is expecting everybody to add, as their
test file tests for them. In other words, this passes Unicode's fairly
extensive furnished tests, but wouldn't if it didn't include certain
customizations specified by Unicode beyond the basic algorithm.
The implementation uses a look-up table of the characters surrounding a
boundary to see if it is a suitable place to break a line. In a few
cases, context needs to be taken into account, so there is code in
addition to the lookup table to handle those.
This should meet the needs for line breaking of many applications,
without having to load the module.
The algorithm is somewhat independent of the Unicode version, just like
the other boundary types. Only if new rules are added, or existing ones
modified is there need to go in and change this code. Otherwise,
running regen/mk_invlists.pl should be sufficient when a new Unicode
release is done to keep it up-to-date, again like the other Unicode
boundary types.
|
|
|
|
|
|
|
|
|
| |
This is in preparation for adding qr/\b{lb}/. This just generates the
tables, and is a separate commit because otherwise the diff listing is
confusing, as it doesn't realize there are only additions. So, even
though the difference listing for this commit for the generated header
file is wildly crazy, the only changes in reality are the addition of
some tables for Line Break.
|
|
|
|
|
|
|
| |
A future commit will tailor a property to use fewer values than Unicode
provides. Currently we look at the official property, and croak if not
all the property values are there. This commit instead looks at the
tailored property, the one that actually is being output.
|
|
|
|
|
| |
This allows a default value to be specified, to prepare for a later
commit.
|
|
|
|
|
|
|
|
|
| |
This moves the name of a synthetic enum value to a better place in the
code. The list it had been in is for a specific purpose that is not
applicable to synthetic values, though it worked.
But the new place is more logical, and can take advantage of the
previous commit which makes things in this place more predictable.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Most Unicode properties have a finite set of possible values. Most, for
example, are binary, they can be either true or false, but nothing in
between. Others have more possibilities (and still others, like Name,
are not restricted at all. The Word Break property, for example can
take on a restricted set of values, currently 19 in all, that indicate
what type, for purposes of word breaking, the character is.
In implementing things like Word Break, Perl adds some internal-only
values, like EDGE, which means matching like /^/ or /$/. By using
these synthetic values, we don't need to have extra code for edge
cases.
These properties are implemented using C enums. Prior to this commit,
the actual numeric values for each enum was mostly arbitrary, with the
synthetic ones intermixed with the offical ones. This commit changes
that so the synthetic ones are all higher numbers than any official ones,
and the order they appear in the generating code will be the numerical
order they have, so that the program has control of their order.
|
|
|
|
|
|
| |
The guts of this test are generated by mktables. Commit
f1f6961f5a6fd77a3e3c36f242f1b72ce5dfe205 broke early Unicode versions
handling.
|
|
|
|
|
| |
Prior to this commit, it would not compile because 2 properties weren't
defined in very early Unicodes.
|
|
|
|
|
|
|
|
|
|
|
| |
This commit comments out the code that generates these tables. This is
trivially reversible. We don't believe anyone is using Perl and
POSIX-BC at this time, and this saves time during development when
having to regenerate these tables, and makes the resulting tar ball
smaller.
See thread beginning at
http://nntp.perl.org/group/perl.perl5.porters/233663
|
|
|
|
|
|
|
|
|
|
|
|
| |
The Unicode \b{wb} matches the boundary between space characters in a
span of them. This is opposite of what \b does, and is counterintuitive
to Perl expectations. This commit tailors \b{wb} to not split up spans
of white space.
I have submitted a request to Unicode to re-examine their algorithm, and
this has been assigned to a subcommittee to look at, but the result
won't be available until after 5.24 is done. In any event, Unicode
encourages tailoring for local conditions.
|
|
|
|
|
|
|
| |
This new parameter will be used in the next commit, adds a special case
for handling tables that the perl interpreter relies on when compiling
a Unicode version earlier than the property is defined by Unicode. This
will allow for tailoring the property to Perl's needs in the next commit
|
|
|
|
| |
The utf8 locale testing was not getting done.
|
|
|
|
|
|
|
| |
This may be enough for some platforms that aren't able to compile the
Unicode tables to work. BUt it's quite late in the process. The
ultimate solution would be for the tables to all be compiled ahead of
time. That is under consideration for the future.
|
|
|
|
| |
So in a make, it is abundantly clear where the messages are coming from
|
|
|
|
|
| |
This is needed bcause mktables changed. A porting test did not pick
this up, and so probably should be made to.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This uses for UTF-EBCDIC essentially the same mechanism that Perl
already uses for UTF-8 on ASCII platforms to extend it beyond what might
be its natural maximum. That is, when the UTF-8 start byte is 0xFF, it
adds a bunch more bytes to the character than it otherwise would,
bringing it to a total of 14 for UTF-EBCDIC. This is enough to handle
any code point that fits in a 64 bit word.
The downside of this is that this extension is not compatible with
previous perls for the range 2**30 up through the previous max,
2**30 - 1. A simple program could be written to convert files that were
written out using an older perl so that they can be read with newer
perls, and the perldelta says we will do this should anyone ask.
However, I strongly suspect that the number of such files in existence
is zero, as people in EBCDIC land don't seem to use Unicode much, and
these are very large code points, which are associated with a
portability warning every time they are output in some way.
This extension brings UTF-EBCDIC to parity with UTF-8, so that both can
cover a 64-bit word. It allows some removal of special cases for EBCDIC
in core code and core tests. And it is a necessary step to handle Perl
6's NFG, which I'd like eventually to bring to Perl 5.
This commit causes two implementations of a macro in utf8.h and
utfebcdic.h to become the same, and both are moved to a single one in
the portion of utf8.h common to both.
To illustrate, the I8 for U+3FFFFFFF (2**30-1) is
"\xFE\xBF\xBF\xBF\xBF\xBF\xBF" before and after this commit, but the I8
for the next code point, U+40000000 is now
"\xFF\xA0\xA0\xA0\xA0\xA0\xA0\xA1\xA0\xA0\xA0\xA0\xA0\xA0",
and before this commit it was "\xFF\xA0\xA0\xA0\xA0\xA0\xA0".
The I8 for 2**64-1 (U+FFFFFFFFFFFFFFFF) is
"\xFF\xAF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF", whereas
before this commit it was unrepresentable.
Commit 7c560c3beefbb9946463c9f7b946a13f02f319d8 said in its message that
it was moving something that hadn't been needed on EBCDIC until the
"next commit". That statement turned out to be wrong, overtaken by
events. This now is the commit it was referring to.
commit I prematurely
pushed that
|
|
|
|
|
|
|
| |
locale handling doesn't work without POSIX module being able to load, so
doesn't work on minitest. Prior to this patch, the code checked for
only one case of locale handling to skip when there was no POSIX, but
there was a 2nd case if failed to detect.
|
|
|
|
|
|
|
| |
These tests were using individually defined heuristics to decide whether
to do locale testing or not. However t/loc_tools.pl provides functions
that are more reliable and complete for determining this than the
hand-rolled ones in these tests.
|
|
|
|
|
| |
./perl -Ilib regen/regcharclass.pl
./perl -Ilib regen/mk_invlists.pl
|
|
|
|
|
| |
Through an oversight, the text that was supposed to be printed as the
name for a test was just getting output as a 1 or 0.
|
|
|
|
|
| |
This function's capabilities has expanded beyond its original use, but
the descriptive comments weren't until now.
|
|
|
|
|
|
|
|
|
|
| |
This ticket was originally because the requester did not realize the
function Unicode::UCD::charscript took a code point argument instead of
a chr one. It was rejected on that basis. But discussion here
suggested it would be better to warn on bad input instead of just
returning <undef>. It turns out that all other routines in Unicode::UCD
but charscript and charblock already do warn. This commit extends that
to the two outlier returns.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This changes perluniprops to not list the equivalent 'In' single form
method of specifying the Block property, and to discourage its use. The
reason is that this is a Perl extension, the use of which is unstable.
A future Unicode release could take over the 'In...' name for a new
purpose, and perl would follow along, breaking the code that assumed the
former meaning. Unicode does not know about this Perl extension, and
they wouldn't care if they did know.
The reason I'm doing this now is that the latest Unicode version
introduced some properties whose names begin with 'In', though no
conflicts arose. But it is clear that such conflicts could arise in the
future. So the documentation only is changed to warn people of this
potential.
perlunicode is update accordingly.
|
|
|
|
|
|
|
|
| |
Special code suppressed the expanded output of some ranges, where it
would be clear from the range itself what was meant. However, for many
output tables, that range output was changed, so the desired
information is missing. For these tables, don't suppress the expanded
output.
|
| |
|
|
|
|
|
| |
The DAge.txt property until the previous commit had to be handled
out-of-the-normal order. This is no longer required.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This functionality is rarely used, but enables someone to see what
Unicode has changed between releases X and Y, without the clutter of the
things that are added after X came out. In other words it compiles
release X using Y's rules. To use it, you must go in and edit mktables
to specify to use this; so it is intended only for a developer who wants
to look at Unicode history. One use I've done is to look at the beta
version of a new release to compare with the previous official one.
This allows me to find typos, and unintentional changes and report them
back to Unicode.
This commit significantly overhauls this feature, giving better results
than before.
|
|
|
|
|
|
| |
There were several glitches when compiling very early Unicode releases.
This commit changes things so the age property reference is stored in a
global, and doesn't have to be refound multiple times.
|
|
|
|
|
|
| |
This takes two code sections and moves them to a function each. For
one, this is in preparation for being used in a 2nd place. For the
other, call the code in existing other places.
|
|
|
|
|
|
|
| |
The Default_Ignorable_Code_Point property is applicable to unassigned
code points, so shouldn't restrict our calculated value to assigned.
(We calculate what the property would be when run on Unicode releases
that haven't defined it yet.)
|
|
|
|
|
| |
These constants are used in more than one place. Use a common variable
instead of repeating the hex numbers
|
|
|
|
| |
This can be useful information.
|
|
|
|
|
|
|
| |
I found these places where file existence can be used instead of knowing
what version something happened in. Sometimes those numbers are wrong,
and one of these was. If it can be avoided, better not to use version
numbers
|
|
|
|
|
|
| |
It turns out that the generated map files look better (even if
functionally equivalent) if the default mapping is the one to the
above-Unicode code points. This was the only one that had it different.
|
|
|
|
|
| |
This isn't defined by Unicode until a later version, but Perl wants it
in all versions.
|
|
|
|
|
|
|
|
|
|
|
| |
perl needs the Name_Alias property accessible in all releases in order
for charnames to work properly. However the property was not created
until Unicode version 5.0. Previously, the property was made available
to all Unicode versions, which is contrary to the policy of exposing
properties to public use only when Unicode so exposes them. Thus the
behavior is as close as possible to Unicode-specified. This commit
creates an internal-only property for the perl core, and removes the
general access on early Unicode releases.
|
|
|
|
|
|
|
| |
This allows \b{wb} and \b{sb} to work on all Unicode releases. The huge
number of differences in charclass_invlists.h is only because the names
of the SB and WB tables change, and the code automatically
re-alphabetizes things.
|
|
|
|
|
|
|
|
| |
The GCB property was not properly being generated in early Unicode
releases. The huge commit diff is due solely to the fact that the name
changes of this property so it is sure to not be accessible outside the
perl core, and the property tables are automatically resorted
alphabetically.
|
|
|
|
| |
This allows us to remove some special handling.
|
|
|
|
|
|
| |
This file is crucial to compiling perl these days. This commit converts
to use the new infrastructure for dealing with compiling Unicode
releases prior to when this file was made available.
|
|
|
|
|
|
| |
This file is crucial to compiling perl these days. This commit converts
to use the new infrastructure for dealing with compiling Unicode
releases prior to when this file was made available.
|
|
|
|
|
|
|
|
|
|
|
| |
This adds infrastructure to the constructor of the Input_file class to
allow an alternative to be specified when compiling a Unicode release
that is earlier than the file first became available.
This is only used when the property is used by core perl and has to work
in all releases. For example the qr/\X/ construct should always work,
but relies on a property that isn't specified before Unicode 4.1. This
allows for easier specification of how to handle this type of case.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Until this commit there were two mechanisms available to specify files
in the Unicode Character Database are not used by mktables. Now there
is one. The global that contained such files is deleted, and instead
all such files are specified by an Input_file class object. This has
the advantage of just one method, and the constructor already has
parameters to specify when a file first appeared, and when it was
removed. This allows automatic generation of the pod, listing just the
appropriate files for the version being compiled. It also allows for
the automatic check of all files to see that they are DOS 8.3 filesystem
compatible. And it allows for some code simplification.
Unicode specifies some .html files in the UCD. These are always skipped
(so far, and likely forever), and were in the global. Now they are in
the constructor, which means that the code that looks for potential
files that aren't being handled has to be changed to also look for .html
files as well.
|
|
|
|
|
|
|
|
|
| |
Two files can have the same file name, but be different if they have
different suffixes. Until this commit, mktables thought they were the
same, because it ignored the suffix when calculating this. Some files
are version strings like "3.1" which look like a floating point number.
These are converted to like "3_1" first so that the .1 doesn't look like
a suffix.
|
|
|
|
|
|
|
| |
This follows up the previous commit by actually using the new
infrastructure it created. The optional Unihan files are switched to
use the new capabilities. This means that the globals they previously
used are no longer necessary, and are ripped out here.
|