| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This makes all the tables in the lib/unicore/To directory that map from
code point to code point be formatted so that the mapped-to code point
is expressed as hexadecimal.
This allows for uniform treatment of these tables in utf8.c, and removes
the final use of strtol() in the (non-CPAN) core. strtol() should be
avoided because it is subject to locale rules, and some older libc
implementations have been buggy. It was used because Perl doesn't have
an efficient way of parsing a decimal number and advancing the parse
pointer to beyond it; we do have such a method for hex numbers.
The input to mktables published by Unicode is also in hex, so this now
conforms to that convention.
This also will facilitate the new work currently being done to read in
the tables that find the closing bracket given an opening one.
|
|
|
|
|
|
|
|
|
|
| |
Since Unicode 3.2, all Unicode database source files (except
unfortunately, the most important one, UCD.txt), have their first lines
be identifying information, their name and version number. This commit
checks that the version is the expected one. This should prevent the
database from being out-of-sync. Perl changes the names of some files
so that they are distinct on DOS filesystems, so we can't easily check
that the name in the file is the same as the name of the file.
|
|
|
|
|
| |
In commit a9c9e371c40cf388593577cf577494e91793f62a, I forgot to update
the Unicode version in the file that states it.
|
| |
|
|
|
|
|
|
| |
The code was confused about what certain variables signified, and raises
erroneous warnings at other times. These bugs did not show up until
compiling Unicode 6.3.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The Unicode Character Database consists of many files in various
different formats. mktables has a single routine that processes the
most common format type. Files with different formats are run through
filters to transform them into this format, so that almost all end up
being handles by this common function.
This commit adds a way of specifying the format for one of the other
format types, and then automatically generating the code to do the
transformation. This doesn't work if the file has lines that have
special cases, such as if there is a known typo in it; the current
scheme can be used for those.
Unfortunately, all but one of the candidate files in Unicode 6.2 aren't
suitable for this table-driven approach. But a second one is coming in
6.3, and I anticipate more in the future, as Unicode has tightened their
quality control significantly in recent releases.
|
|
|
|
|
|
|
| |
Unicode furnishes various files that Perl ignores. perluniprops lists
these, with a brief reason of what they are for and why they aren't used
by Perl. Two files weren't listed, and one had a typo in the name and
an inadequate description.
|
|
|
|
| |
This changes this to conform to changes in Unicode 6.2
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
The output tables for mktables are now in the platform's native
character set. This means there is no change for ASCII platforms, but
is a change for EBCDIC ones.
Code that didn't realize there was a potential difference between EBCDIC
and non-EBCDIC platforms will now start to work; code that tried to do
the right thing under these circumstances will no longer work. Fixing
that comes in later commits.
|
|
|
|
|
|
|
|
|
|
| |
This code is moved later in the process. This is in preparation for
mktables generating tables in the native character set. By moving it to
later, the translation to native has already been done, and special
coding need not be done.
This also caught 7 code points that were omitted somehow in the previous
logic
|
|
|
|
|
| |
mktables omitted the equal sign from the generated pod for certain
properties that should match it.
|
| |
|
|
|
|
|
| |
One of these fixes is for where a real CTRL-X was specified, instead of
$^X
|
| |
|
| |
|
| |
|
|
|
|
|
| |
I was re-reading some code and got confused. This table matches just
the first character of a sequence that may or may not contain others.
|
|
|
|
|
|
|
|
|
|
|
|
| |
The recent change to random hash ordering caused some of the files
output by mktables to vary from run to run. Everything still worked.
However, one of the ways I debug mktables is to make a change, and then
compare the tables it generates with those from before the change. That
tells me the precise effect of the change. That no longer works if the
tables come out in random order from run to run.
This patch just sorts certain things so that the tables are output in
the same order each time.
|
|
|
|
|
|
| |
mktables is changed to add two new tables, one that matches the first
character in a character names, and one that matches continuation
characters.
|
|
|
|
|
| |
These internal tables were only used in regen code, and those have been
modified to not use them; so can be removed.
|
|
|
|
| |
This will be used in a later commit
|
|
|
|
|
|
|
|
|
| |
These files were included by Unicode for the first time in the final
version of its version 6.2. They document proposals for encoding
Han characters in Unicode. As far as I can tell, they have no real use
except to people working on such proposals. They are considered part of
the Unicode Character Database, however, and should be mentioned in
perluniprops as data that Perl ignores from that database.
|
| |
|
|
|
|
|
|
| |
Unicode 6.2 has been officially released, and is delivered by this
commit. There are no substantive changes from the final 6.2 beta, which
this replaces.
|
|
|
|
|
|
|
| |
Commit 27d4fc33343f0dd4287f0e7b9e6b4ff67c5d8399 neglected to include a
change required for a few Unicode releases where the \X prepend property
is not empty. This does that, and suppresses a mktables warning for
Unicode releases prior to 6.2
|
|
|
|
|
|
|
|
|
| |
Prior to this commit 98.4% of Unicode code points that went through \X
had to be looked up to see if they begin a grapheme cluster; then looked
up again to find that they didn't require special handling. This commit
refactors things so only one look-up is required for those 98.4%. It
changes the table generated by mktables to accomplish this, and hence
the name of it, and references to it are changed to correspond.
|
|
|
|
|
|
|
|
|
|
|
|
| |
These supposedly are the final data files for 6.2. Earlier changes
originally proposed for 6.2 have been deferred until a later release.
Thus there is no change in the general category of ASCII characters in
these files from what they were in 6.1 and earlier, unlike what had been
proposed.
Unlike the previous experimental beta, code is now in place in Perl to
handle the revised definition of \X in 6.2. The current working draft
of that definition is at http://unicode.org/draft/reports/tr29/tr29.html
|
|
|
|
|
|
|
|
|
|
|
| |
This changes code to be able to handle Unicode 6.2, while continuing to
handle all prevrious releases.
The major change was a new definition of \X, which adds a property to
its calculation. Unfortunately \X is hard-coded into regexec.c, and so
has to revised whenever there is a change of this magnitude in Unicode,
which fortunately isn't all that often. I refactored the code in
mktables to make it easier next time there is a change like this one.
|
|
|
|
|
| |
Unicode 6.2 is changing some of these things; this re-ordering will make
that more convenient.
|
| |
|
|
|
|
|
| |
This adds comment symbols and redirects error messages to /dev/null for
likely things that will fail
|
|
|
|
|
| |
This reverts commit 5435c3759c4567a1bb51384f6641c04822ec6391.
A new beta has been released, and so we should use that instead.
|
|
|
|
|
|
| |
When a Range_List is emptied, there is a bug which causes a runtime
error when trying to refer to a non-existent element. This avoids that.
A future commit would have run afoul of this bug.
|
|
|
|
|
|
|
| |
Normally, mktables is called from the Makefile at the base level. But
during development, it may manually be called from the directory (and
hence that directory's Makefile). This patch causes it to rebuild if
that Makefile changes.
|
| |
|
| |
|
|
|
|
|
| |
Unicode has changed their definition of what should match \w.
http://www.unicode.org/reports/tr18/. This follows that change.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This table consists of all characters that participate in any way in a
fold in the current Unicode version. regcomp.c currently uses the Cased
property as a proxy for these. This information is used to limit the
number of characters whose folds have to be dealt with in compiling
bracketed regex character classess. It turns out that Cased contains
more than 1300 more code points than actually do appear in folds, which
means potential extra work for compiling. Hence this patch allows that
work to be avoided.
There are a few characters in this new table that aren't in Cased, which
are potential bugs in the old way of doing things. In Unicode 6.1,
these are: U+02BC MODIFIER LETTER APOSTROPHE, U+0308 COMBINING
DIAERESIS, U+0313 COMBINING COMMA ABOVE, and U+0342 COMBINING GREEK
PERISPOMENI. I can't figure out how these might be currently causing a
bug, but this patch fixes any such.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Unicode 6.2 is proposing some changes that may very well break some
CPAN modules. The timing of this nicely coincides with Perl's being
early in the release cycle. This commit takes the current beta 6.2,
adds the proposed changes that aren't yet in it, and subtracts the
changes that would affect \X processing, as those turn out to have
errors, and may have to be rethought. Unicode has been notified of
these problems.
This commit is to gather data as to whether or not the proposed changes
cause us problems. These will be presented to Unicode to aid in their
final decision as to whether or not to go forward with the changes.
This commit should be reverted at some point, and the final 6.2 used
instead.
|
|
|
|
|
| |
This variable is no longer used, but the expression needs to be
evaluated anyway. The code is outdented.
|
|
|
|
|
|
| |
This is so that it can be run by a Unix shell command to rename the
files that Unicode furnishes to the ones that Perl expects (because of
DOS 8.3 filesystems).
|
|
|
|
|
| |
This operation is not commutative, so should fail if the operands are
swapped.
|
|
|
|
| |
This is useful under the -annotate option
|
|
|
|
|
| |
Now that all the control characters have names, use them instead of the
generic, "Control", but retain that as a fall back just in case.
|
|
|
|
|
| |
The code later on used to be done only sometimes; now that it is
executed always, some of it can be done at initialization time.
|
|
|
|
|
|
|
|
|
| |
As a result of the Unicode 6.0 mistake of using "BELL" to refer to
a different code point, Perl has deprecated use of this name for 2 major
release cycles, while not fully implementing Unicode in the interim, to
allow any affected code to migrate to the new name
This commit now switches to the new meaning of BELL.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Does the attached patch make sense? It lowers RAM and CPU usage by about 10%
on Linux, and 6% on FreeBSD.
Nicholas Clark
>From fe46bd796c282f6a6e4793afaf847e04d3be3524 Mon Sep 17 00:00:00 2001
From: Nicholas Clark <nick@ccl4.org>
Date: Mon, 7 May 2012 09:58:13 +0200
Subject: [PATCH] In mktables, lazily compute the 'standard_form' for Ranges.
Instead of calculating the standard form up front, calculate it only when
needed and cache the result. There are 368676 non-special objects, but
the standard form is only requested for 22047 of them. For the systems I
tested on, this reduces RAM and CPU usage by about 10% on Linux, and 6% on
FreeBSD.
This is more significant than it may first seem, because mktables is the
largest RAM user of anything run during the build process, so this reduces
the build process peak RAM requirement.
|
|
|
|
| |
I think the 'for' is easier to understand
|
|
|
|
|
| |
Sometimes in debugging, etc, it is useful to have these files; this adds
a single scalar to control if they get generated.
|