summaryrefslogtreecommitdiff
path: root/lib/unicore
Commit message (Collapse)AuthorAgeFilesLines
* Change mktables output for some tables to use hexKarl Williamson2013-10-161-5/+27
| | | | | | | | | | | | | | | | | | | This makes all the tables in the lib/unicore/To directory that map from code point to code point be formatted so that the mapped-to code point is expressed as hexadecimal. This allows for uniform treatment of these tables in utf8.c, and removes the final use of strtol() in the (non-CPAN) core. strtol() should be avoided because it is subject to locale rules, and some older libc implementations have been buggy. It was used because Perl doesn't have an efficient way of parsing a decimal number and advancing the parse pointer to beyond it; we do have such a method for hex numbers. The input to mktables published by Unicode is also in hex, so this now conforms to that convention. This also will facilitate the new work currently being done to read in the tables that find the closing bracket given an opening one.
* mktables: Verify input files' versionKarl Williamson2013-10-041-0/+9
| | | | | | | | | | Since Unicode 3.2, all Unicode database source files (except unfortunately, the most important one, UCD.txt), have their first lines be identifying information, their name and version number. This commit checks that the version is the expected one. This should prevent the database from being out-of-sync. Perl changes the names of some files so that they are distinct on DOS filesystems, so we can't easily check that the name in the file is the same as the name of the file.
* Bump Unicode versionKarl Williamson2013-10-041-1/+1
| | | | | In commit a9c9e371c40cf388593577cf577494e91793f62a, I forgot to update the Unicode version in the file that states it.
* Upgrade to Unicode 6.3Karl Williamson2013-10-0349-425/+2157
|
* mktables: Fix logic with binary vs enum propertiesKarl Williamson2013-10-031-9/+15
| | | | | | The code was confused about what certain variables signified, and raises erroneous warnings at other times. These bugs did not show up until compiling Unicode 6.3.
* mktables: Do some table-driven code generationKarl Williamson2013-10-031-30/+55
| | | | | | | | | | | | | | | | | | | The Unicode Character Database consists of many files in various different formats. mktables has a single routine that processes the most common format type. Files with different formats are run through filters to transform them into this format, so that almost all end up being handles by this common function. This commit adds a way of specifying the format for one of the other format types, and then automatically generating the code to do the transformation. This doesn't work if the file has lines that have special cases, such as if there is a known typo in it; the current scheme can be used for those. Unfortunately, all but one of the candidate files in Unicode 6.2 aren't suitable for this table-driven approach. But a second one is coming in 6.3, and I anticipate more in the future, as Unicode has tightened their quality control significantly in recent releases.
* perluniprops: Add correct ignored files docsKarl Williamson2013-10-031-1/+3
| | | | | | | Unicode furnishes various files that Perl ignores. perluniprops lists these, with a brief reason of what they are for and why they aren't used by Perl. Two files weren't listed, and one had a typo in the name and an inadequate description.
* lib/unicore/README.perl: UpdateKarl Williamson2013-10-031-3/+3
| | | | This changes this to conform to changes in Unicode 6.2
* convert mktables to parent.pm instead of base.pmRicardo Signes2013-09-121-4/+4
|
* mktables: Generate native code-point tablesKarl Williamson2013-08-291-35/+161
| | | | | | | | | | | The output tables for mktables are now in the platform's native character set. This means there is no change for ASCII platforms, but is a change for EBCDIC ones. Code that didn't realize there was a potential difference between EBCDIC and non-EBCDIC platforms will now start to work; code that tried to do the right thing under these circumstances will no longer work. Fixing that comes in later commits.
* mktables: Move table creation codeKarl Williamson2013-08-291-20/+12
| | | | | | | | | | This code is moved later in the process. This is in preparation for mktables generating tables in the native character set. By moving it to later, the translation to native has already been done, and special coding need not be done. This also caught 7 code points that were omitted somehow in the previous logic
* perluniprops: Add missing character to what's matchedKarl Williamson2013-08-141-3/+3
| | | | | mktables omitted the equal sign from the generated pod for certain properties that should match it.
* fix various podcheck nitsDavid Golden2013-05-231-7/+21
|
* mktables: Fix typos in commentsKarl Williamson2013-05-201-2/+2
| | | | | One of these fixes is for where a real CTRL-X was specified, instead of $^X
* Add, fix commentsKarl Williamson2013-02-251-2/+2
|
* Document \s change for VT, commit 075b9d7d9a6d4473b240a047655e507c8baa6db3Karl Williamson2013-02-241-2/+2
|
* mktables, regexec.c: Comments, white-space; no code changesKarl Williamson2012-12-161-2/+2
|
* Rename property involved in \X matching, for clarityKarl Williamson2012-12-161-4/+4
| | | | | I was re-reading some code and got confused. This table matches just the first character of a sequence that may or may not contain others.
* mktables: Sort some outputs for repeatabilityKarl Williamson2012-11-281-14/+21
| | | | | | | | | | | | The recent change to random hash ordering caused some of the files output by mktables to vary from run to run. Everything still worked. However, one of the ways I debug mktables is to make a change, and then compare the tables it generates with those from before the change. That tells me the precise effect of the change. That no longer works if the tables come out in random order from run to run. This patch just sorts certain things so that the tables are output in the same order each time.
* mktables: Create tables for charname beginning and follow-onKarl Williamson2012-11-111-1/+18
| | | | | | mktables is changed to add two new tables, one that matches the first character in a character names, and one that matches continuation characters.
* mktables: Don't generate no-longer needed tablesKarl Williamson2012-10-201-19/+0
| | | | | These internal tables were only used in regen code, and those have been modified to not use them; so can be removed.
* mktables: Add table for chars with multi-char foldKarl Williamson2012-10-141-1/+9
| | | | This will be used in a later commit
* mktables: Mention USourceData in generated podKarl Williamson2012-09-271-0/+2
| | | | | | | | | These files were included by Unicode for the first time in the final version of its version 6.2. They document proposals for encoding Han characters in Unicode. As far as I can tell, they have no real use except to people working on such proposals. They are considered part of the Unicode Character Database, however, and should be mentioned in perluniprops as data that Perl ignores from that database.
* mktables: Nits in comments, generated podKarl Williamson2012-09-271-3/+3
|
* Use official Unicode 6.2Karl Williamson2012-09-263-97/+79
| | | | | | Unicode 6.2 has been officially released, and is delivered by this commit. There are no substantive changes from the final 6.2 beta, which this replaces.
* Fix \X handling for Unicode 5.1 - 6.0Karl Williamson2012-09-131-1/+6
| | | | | | | Commit 27d4fc33343f0dd4287f0e7b9e6b4ff67c5d8399 neglected to include a change required for a few Unicode releases where the \X prepend property is not empty. This does that, and suppresses a mktables warning for Unicode releases prior to 6.2
* Refactor \X regex handling to avoid a typical case table lookupKarl Williamson2012-08-281-5/+8
| | | | | | | | | Prior to this commit 98.4% of Unicode code points that went through \X had to be looked up to see if they begin a grapheme cluster; then looked up again to find that they didn't require special handling. This commit refactors things so only one look-up is required for those 98.4%. It changes the table generated by mktables to accomplish this, and hence the name of it, and references to it are changed to correspond.
* Use new Unicode 6.2 betaKarl Williamson2012-08-2647-1351/+1974
| | | | | | | | | | | | These supposedly are the final data files for 6.2. Earlier changes originally proposed for 6.2 have been deferred until a later release. Thus there is no change in the general category of ASCII characters in these files from what they were in 6.1 and earlier, unlike what had been proposed. Unlike the previous experimental beta, code is now in place in Perl to handle the revised definition of \X in 6.2. The current working draft of that definition is at http://unicode.org/draft/reports/tr29/tr29.html
* Prepare for Unicode 6.2Karl Williamson2012-08-261-18/+44
| | | | | | | | | | | This changes code to be able to handle Unicode 6.2, while continuing to handle all prevrious releases. The major change was a new definition of \X, which adds a property to its calculation. Unfortunately \X is hard-coded into regexec.c, and so has to revised whenever there is a change of this magnitude in Unicode, which fortunately isn't all that often. I refactored the code in mktables to make it easier next time there is a change like this one.
* mktables: Re-order some code, change commentsKarl Williamson2012-08-261-27/+39
| | | | | Unicode 6.2 is changing some of these things; this re-ordering will make that more convenient.
* mktables: Correct generated table commentKarl Williamson2012-08-261-1/+1
|
* lib/unicore/README.perl: Make usablea s shell scriptKarl Williamson2012-08-261-19/+19
| | | | | This adds comment symbols and redirects error messages to /dev/null for likely things that will fail
* Revert "Experimentally Use Unicode 6.2 beta"Karl Williamson2012-08-2646-1833/+1350
| | | | | This reverts commit 5435c3759c4567a1bb51384f6641c04822ec6391. A new beta has been released, and so we should use that instead.
* mktables: Fix bug when deleting final rangeKarl Williamson2012-08-251-4/+11
| | | | | | When a Range_List is emptied, there is a bug which causes a runtime error when trying to refer to a non-existent element. This avoids that. A future commit would have run afoul of this bug.
* mktables: Rebuild if local Makefile has changedKarl Williamson2012-08-111-1/+5
| | | | | | | Normally, mktables is called from the Makefile at the base level. But during development, it may manually be called from the directory (and hence that directory's Makefile). This patch causes it to rebuild if that Makefile changes.
* mktables: Add comment to gen'd data fileKarl Williamson2012-08-021-0/+5
|
* mktables: grammar in commentsKarl Williamson2012-08-021-2/+2
|
* mktables: Change \w definition to match new Unicode'sKarl Williamson2012-07-261-0/+7
| | | | | Unicode has changed their definition of what should match \w. http://www.unicode.org/reports/tr18/. This follows that change.
* mktables: Generate new table for foldable charsKarl Williamson2012-07-241-1/+14
| | | | | | | | | | | | | | | | | | This table consists of all characters that participate in any way in a fold in the current Unicode version. regcomp.c currently uses the Cased property as a proxy for these. This information is used to limit the number of characters whose folds have to be dealt with in compiling bracketed regex character classess. It turns out that Cased contains more than 1300 more code points than actually do appear in folds, which means potential extra work for compiling. Hence this patch allows that work to be avoided. There are a few characters in this new table that aren't in Cased, which are potential bugs in the old way of doing things. In Unicode 6.1, these are: U+02BC MODIFIER LETTER APOSTROPHE, U+0308 COMBINING DIAERESIS, U+0313 COMBINING COMMA ABOVE, and U+0342 COMBINING GREEK PERISPOMENI. I can't figure out how these might be currently causing a bug, but this patch fixes any such.
* Experimentally Use Unicode 6.2 betaKarl Williamson2012-06-0846-1350/+1833
| | | | | | | | | | | | | | | | | Unicode 6.2 is proposing some changes that may very well break some CPAN modules. The timing of this nicely coincides with Perl's being early in the release cycle. This commit takes the current beta 6.2, adds the proposed changes that aren't yet in it, and subtracts the changes that would affect \X processing, as those turn out to have errors, and may have to be rethought. Unicode has been notified of these problems. This commit is to gather data as to whether or not the proposed changes cause us problems. These will be presented to Unicode to aid in their final decision as to whether or not to go forward with the changes. This commit should be reverted at some point, and the final 6.2 used instead.
* mktables: Remove unused $scalarKarl Williamson2012-06-081-5/+5
| | | | | This variable is no longer used, but the expression needs to be evaluated anyway. The code is outdented.
* unicore/README.perl: Make text commentsKarl Williamson2012-06-081-82/+83
| | | | | | This is so that it can be run by a Unix shell command to rename the files that Unicode furnishes to the ones that Perl expects (because of DOS 8.3 filesystems).
* mktables: Add error check for overloaded &=Karl Williamson2012-06-081-0/+10
| | | | | This operation is not commutative, so should fail if the operands are swapped.
* mktables: Add &= overload for Range_ListsKarl Williamson2012-06-081-0/+15
| | | | This is useful under the -annotate option
* mktables: Use control names under -annotateKarl Williamson2012-06-081-2/+1
| | | | | Now that all the control characters have names, use them instead of the generic, "Control", but retain that as a fall back just in case.
* mktables: Use initialize instead of a pushKarl Williamson2012-06-071-3/+1
| | | | | The code later on used to be done only sometimes; now that it is executed always, some of it can be done at initialization time.
* mktables: Convert to BELL meaning U+1F514Karl Williamson2012-06-021-3/+9
| | | | | | | | | As a result of the Unicode 6.0 mistake of using "BELL" to refer to a different code point, Perl has deprecated use of this name for 2 major release cycles, while not fully implementing Unicode in the interim, to allow any affected code to migrate to the new name This commit now switches to the new meaning of BELL.
* mktables memory reductionNicholas Clark2012-06-021-7/+9
| | | | | | | | | | | | | | | | | | | | | | Does the attached patch make sense? It lowers RAM and CPU usage by about 10% on Linux, and 6% on FreeBSD. Nicholas Clark >From fe46bd796c282f6a6e4793afaf847e04d3be3524 Mon Sep 17 00:00:00 2001 From: Nicholas Clark <nick@ccl4.org> Date: Mon, 7 May 2012 09:58:13 +0200 Subject: [PATCH] In mktables, lazily compute the 'standard_form' for Ranges. Instead of calculating the standard form up front, calculate it only when needed and cache the result. There are 368676 non-special objects, but the standard form is only requested for 22047 of them. For the systems I tested on, this reduces RAM and CPU usage by about 10% on Linux, and 6% on FreeBSD. This is more significant than it may first seem, because mktables is the largest RAM user of anything run during the build process, so this reduces the build process peak RAM requirement.
* mktables: Use for loop instead of eachKarl Williamson2012-06-021-2/+1
| | | | I think the 'for' is easier to understand
* mktables: Allow easy generation of Unicode-deprecated filesKarl Williamson2012-06-021-0/+11
| | | | | Sometimes in debugging, etc, it is useful to have these files; this adds a single scalar to control if they get generated.