summaryrefslogtreecommitdiff
path: root/charclass_invlists.h
Commit message (Collapse)AuthorAgeFilesLines
* mktables: Fix up version compareKarl Williamson2017-06-011-1/+1
| | | | | | | | | | | | | | | | | This is a feature that is used to compare 2 different Unicode versions for changes to existing code points, ignoring code points that aren't in the earlier version. It does this by removing the newly-added code points coming from the later version. One can then diff the generated directory structure against an existing one that was built under the old rules to see what changed. Prior to this commit, it assumed all version numbers were a single digit for the major number. This will no longer work for Unicode 10, about to be released. As part of the process, mktables adds blocks that didn't exist in the earlier version back to the unallocated pool. This gives better diff results. This commit does a better job of finding such blocks.
* Balance uniprops testsKarl Williamson2017-02-191-1/+1
| | | | | | | | | | | | Commit 5656b1f654bb034c561558968ed3cf87a737b3e1 split the tests generated by mktables so that 10 separate files each execute 10% of the tests. But it turns out that some tests are much more involved than others, so that some of those 10 files still took much longer than average. This commit changes the split so that the amount of time each file takes is more balanced. It uses a natural breaking spot for the tests for the \b{} flavors, except that GCB and SB are each short (so are combined into being tested from one file), and LB is very long, so is split into 4 test groups.
* Fix bug with a digit range under re 'strict'Karl Williamson2017-01-191-1/+1
| | | | | | | | | "use re 'strict" is supposed to warn if a range whose start and end points are digits aren't from the same group of 10. For example, if you mix Bengali and Thai digits. It wasn't working properly for 5 groups of mathematical digits starting at U+1D7E. This commit fixes that, and refactors the code to bail out as soon as it discovers that no warning is warranted, instead of doing unnecessary work.
* regen/mk_invlists.pl: Create list of Assigned code pointsKarl Williamson2016-12-231-1/+3850
| | | | | | | | This creates a read-only C array to be compiled into the perl source text segment of an inversion list of the characters that are assigned in the current Unicode version. This will be used in a future commit. The difference listing is large because of defects in the diff algorithm
* Switch most open() calls to three-argument form.John Lightsey2016-12-231-2/+2
| | | | | | | | | | Switch from two-argument form. Filehandle cloning is still done with the two argument form for backward compatibility. Committer: Get all porting tests to pass. Increment some $VERSIONs. Run: ./perl -Ilib regen/mk_invlists.pl; ./perl -Ilib regen/regcharclass.pl For: RT #130122
* Fixup Unicode::UCD pod/version and regen dependent filesMatthew Horsfall2016-11-141-1/+1
|
* Allow "." to be excluded from @INCH.Merijn Brand2016-11-111-1/+1
|\ | | | | | | | | | | Build with -Ddefault_inc_excludes_dot to have exclude . from @INC. The *current* default is set to be effectively no change. A future change will most likely revert the default to the safer exclusion of .
| * Regen from the "special" regen scriptsAaron Crane2016-11-111-1/+1
| | | | | | | | | | | | A few regen scripts aren't run by "make regen", either because they depend on an external tool, or they must be run by the Perl just built. So they must be run manually.
* | Move Unicode-Normalize to dist/Karl Williamson2016-11-111-1/+1
|/ | | | | p5p has taken over the maintenance of this module, so it should be in dist/
* uniprops.t: split into 10 seperate test files t/re/uniprops01.t etcYves Orton2016-10-191-1/+1
| | | | | This way we can run them at the same time under parallel test, as there are a lot of tests (140k or so) this makes a difference.
* Correct spelling errors: lib/unicore/mktablesShlomi Fish2016-08-061-1/+1
| | | | | | | | | All tests appear to pass. I hereby disclaim any explicit or implicit ownership of my changes and place them under the public-domain/CC0/X11L/same-terms-as-Perl/any-other-license-of-your-choice multi-license. Thanks to Vim's ":help spell" for helping a lot.
* Change \p{foo} to mean \p{scx: foo}Karl Williamson2016-06-301-2/+2
| | | | | | | when 'foo' is a script. Also update the pods correspondingly, and to encourage scx property use. See http://nntp.perl.org/group/perl.perl5.porters/237403
* perluniprops: Fix podKarl Williamson2016-06-251-1/+1
| | | | | | | Commit 3d6c5fec8cb3579a30be60177e31058bc31285d7 changed mktables to change to slightly less nice pod in order to remove a warning that was a bug in Pod::Checker. Pod::Checker has now been fixed, and the current commit reinstates the old pod.
* Use Unicode 9.0Unicode Consortium2016-06-211-611/+4217
| | | | | This includes regenerating the files that depend on the Unicode 9 data files
* Prepare for Unicode 9.0Karl Williamson2016-06-211-253/+318
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The major code changes needed to support Unicode 9.0 are to changes in the boundary (break) rules, for things like \b{lb}, \b{wb}. regen/mk_invlists.pl creates two-dimensional arrays for all these properties. To see if a given point in the target string is a break or not, regexec.c looks up the entry in the property's table whose row corresponds to the code point before the potential break, and whose column corresponds to the one after. Mostly this is completely determining, but for some cases, extra context is required, and the array entry indicates this, and there has to be specially crafted code in regexec.c to handle each such possibility. When a new release comes along, mk_invlists.pl has to be changed to handle any new or changed rules, and regexec.c has to be changed to handle any changes to the custom code. Unfortunately this is not a mature area of the Standard, and changes are fairly common in new releases. In part, this is because new types of code points come along, which need new rules. Sometimes it is because they realized the previous version didn't work as well as it could. An example of the latter is that Unicode now realizes that Regional Indicator (RI) characters come in pairs, and that one should be able to break between each pair, but not within a pair. Previous versions treated any run of them as unbreakable. (Regional Indicators are a fairly recent type that was added to the Standard in 6.0, and things are still getting shaken out.) The other main changes to these rules also involve a fairly new type of character, emojis. We can expect further changes to these in the next Unicode releases. \b{gcb} for the first time, now depends on context (in rarely encountered cases, like RI's), so the function had to be changed from a simple table look-up to be more like the functions handling the other break properties. Some years ago I revamped mktables in part to try to make it require as few manual interventions as possible when upgrading to a new version of Unicode. For example, a new data file in a release requires telling mktables about it, but as long as it follows the format of existing recent files, nothing else need be done to get whatever properties it describes to be included. Some of changes to mktables involved guessing, from existing limited data, what the underlying paradigm for that data was. The problem with that is there may not have been a paradigm, just something they did ad hoc, which can change at will; or I didn't understand their unstated thinking, and guessed wrong. Besides the boundary rule changes, the only change that the existing mktables couldn't cope with was the addition of the Tangut script, whose character names include the code point, like CJK UNIFIED IDEOGRAPH-3400 has always done. The paradigm for this wasn't clear, since CJK was the only script that had this characteristic, and so I hard-coded it into mktables. The way Tangut is structured may show that there is a paradigm emerging (but we only have two examples, and there may not be a paradigm at all), and so I have guessed one, and changed mktables to assume this guessed paradigm. If other scripts like this come along, and I have guessed correctly, mktables will cope with these automatically without manual intervention.
* Tell mktables what Unicode version mk_invlist.pl handlesKarl Williamson2016-06-211-1/+1
| | | | | | | | | | | | | | A downside of supporting the Unicode break properties like \b{gcb}, \b{lb} is that these aren't very mature in the Standard, and so code likely has to change when updating Perl to support a new version of the Standard. And the new rules may not be backwards compatible. This commit creates a mechanism to tell mktables the Unicode version that the rules are written for. If that is not the same version as being compiled, the test file marks any failing boundary tests as TODO, and outputs a warning if the compiled version is later than the code expects, to alert you to the fact that the code needs to be updated.
* t/re/uniprops.t: Add more description for \b{} testsKarl Williamson2016-06-211-1/+1
| | | | | | | | | mktables generates a file of tests used in t/re/uniprops.t. The tests furnished by Unicode for the boundaries like \b{gcb} have comments that indicate the rules each test is testing. These are useful in debugging. This commit changes things so the generated file that includes these Unicode-supplied tests also has the corresponding comments which are output as part of the test descriptions.
* Add an example of the '0x' string format.Jarkko Hietaniemi2016-05-301-1/+1
| | | | | | I am not certain that I find the 'leading zero means hex' format recommendable ('0123' meaning '0x123', the octal format has poisoned the well); but water under the bridge.
* Stop accepting deprecated NBSP in \N{}Karl Williamson2016-05-091-1/+1
| | | | As scheduled for 5.26, this construct will no longer be accepted.
* mktables: Don't destroy a data structure too soon.Karl Williamson2016-05-091-1/+1
| | | | | | | It can happen that one table depends on another table for its contents. This adds a crude mechanism to prevent the depended-upon table from being destroyed prematurely. So far this has only shown up during debugging, but it could have happened generally.
* mktables: Add info under -annotate optionKarl Williamson2016-05-091-1/+1
| | | | This adds some helpful text when this option is used, which is for examining the Unicode database in great detail
* mktables: Add stack trace facilityKarl Williamson2016-05-091-1/+1
| | | | This can be used for debugging.
* regen/mk_invlists.pl: Revamp so works on earlier UnicodesKarl Williamson2016-03-171-97/+97
| | | | | | | | The code that generates the tables for the \b{foo} handling (in regexec.c) did not correctly work when compiled on an earlier Unicode. This fixes things up to do that, consolidating some common code into a common function and making the generated hdr file look nice, with the tables taking fewer columns of screen space
* mktables: Use correct structure to look up dataKarl Williamson2016-03-171-1/+1
| | | | | | | | | | There are two types of tables in mktables: Map tables map code points to the values a property have for those code points; and match tables which are booleans, give "does a code point match a given property value?". There are different data structures to encapsulate each. This code was using the wrong structure to look something up. Usually this failed, and a fall-back value was used instead. When compiling an early Unicode release, I discovered that there could be a conflict.
* mktables: Fix bug with early Unicode versionsKarl Williamson2016-03-171-1/+1
| | | | | | | | | An array had 2 optional elements at the end. I got confused about handling them. This change first deals with the final one, pops it and saves it separately if found. Then only one optional element needs to be dealt with in the course of the code. This only gets executed for very early Unicode versions
* mktables: Unicode 1.5 only had 2**16 code pointsKarl Williamson2016-03-171-1/+1
| | | | Therefore, we shouldn't add any above that.
* t/re/uniprops.t: Remove wrong test casesKarl Williamson2016-03-121-1/+1
| | | | | | | | | | | | | | mktables generates the file used in this test. Unicode version 9 introduces a numeric value that is an order of magnitude closer to 0 than any previous version had. This demonstrated a bug in mktables, where it didn't consider the possibility of floating point numbers being indistinguishably close to integers. It did check for being too close to the rational numbers used in Unicode, but omitted checking for integers. This adds that check, which in turn causes some wrong test cases to not be generated for this .t. This bug has not shown up in earlier Unicode versions, but is there nonetheless, so I'm pushing this now instead of waiting.
* re/uniprops: Fix EBCDIC issueKarl Williamson2016-02-031-1/+1
| | | | Things like qr/\s/ are expecting native code points, not EBCDIC.
* Use table lookup for qr/\b{wb}/Karl Williamson2016-02-031-1/+37
| | | | | | | | This follows the recent commits for lb and gcb, and generates a table at regen time for Word Breaking. The result may run faster, depending on the compiler optimization capabilities, than before, and is easier to maintain, as it's easier to smack a new rule into the regen perl script than it is to change the C code.
* regen/mk_invlists.pl: add braces round subobject initialisersAaron Crane2016-01-211-53/+53
| | | | | | This suppresses many clang warnings saying "suggest braces around initialization of subobject" when the generated charclass_invlists.h is included.
* Use lookup table for /\b{gcb}/ instead of switch stmtKarl Williamson2016-01-191-1/+20
| | | | | | | | | | | | | | This changes the handling of Grapheme Cluster Breaks to be entirely via a lookup table generated by regen/mk_invlists.pl. This is easier to maintain and follow, as the generation of the table follows the text of Unicode's UAX29 precisely, and loops can be used to set every class up instead of having to name each explicitly, so it will be easier to add new rules. And the runtime switch statement is replaced by a single line. My gcc compiler optimized the previous version to an array lookup, but this commit does it for not so clever compilers.
* t/re/uniprops.t: Fix bug in diagnostic outputKarl Williamson2016-01-191-1/+1
| | | | | An 'ord' was missing, so a warnings was raised. This file is generated by mktables
* Add qr/\b{lb}/Karl Williamson2016-01-191-2/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds the final Unicode boundary type previously missing from core Perl: the LineBreak one. This feature is already available in the Unicode::LineBreak module, but I've been told that there are portability and some other issues with that module. What's added here is a light-weight version that is lacking the customizable features of the module. This implements the default Line Breaking algorithm, but with the customizations that Unicode is expecting everybody to add, as their test file tests for them. In other words, this passes Unicode's fairly extensive furnished tests, but wouldn't if it didn't include certain customizations specified by Unicode beyond the basic algorithm. The implementation uses a look-up table of the characters surrounding a boundary to see if it is a suitable place to break a line. In a few cases, context needs to be taken into account, so there is code in addition to the lookup table to handle those. This should meet the needs for line breaking of many applications, without having to load the module. The algorithm is somewhat independent of the Unicode version, just like the other boundary types. Only if new rules are added, or existing ones modified is there need to go in and change this code. Otherwise, running regen/mk_invlists.pl should be sufficient when a new Unicode release is done to keep it up-to-date, again like the other Unicode boundary types.
* Make tables for Perl-tailored Unicode Line_Break propertyKarl Williamson2016-01-191-2/+13163
| | | | | | | | | This is in preparation for adding qr/\b{lb}/. This just generates the tables, and is a separate commit because otherwise the diff listing is confusing, as it doesn't realize there are only additions. So, even though the difference listing for this commit for the generated header file is wildly crazy, the only changes in reality are the addition of some tables for Line Break.
* regen/mk_invlists.pl: Use property's real valuesKarl Williamson2016-01-191-1/+1
| | | | | | | A future commit will tailor a property to use fewer values than Unicode provides. Currently we look at the official property, and croak if not all the property values are there. This commit instead looks at the tailored property, the one that actually is being output.
* mktables: Add field to constructorKarl Williamson2016-01-191-1/+1
| | | | | This allows a default value to be specified, to prepare for a later commit.
* regen/mk_invlists.pl: Internal housekeepingKarl Williamson2016-01-191-7/+7
| | | | | | | | | This moves the name of a synthetic enum value to a better place in the code. The list it had been in is for a specific purpose that is not applicable to synthetic values, though it worked. But the new place is more logical, and can take advantage of the previous commit which makes things in this place more predictable.
* regen/mk_invlists.pl: Keep internal enum values lastKarl Williamson2016-01-191-118/+118
| | | | | | | | | | | | | | | | | | | | | Most Unicode properties have a finite set of possible values. Most, for example, are binary, they can be either true or false, but nothing in between. Others have more possibilities (and still others, like Name, are not restricted at all. The Word Break property, for example can take on a restricted set of values, currently 19 in all, that indicate what type, for purposes of word breaking, the character is. In implementing things like Word Break, Perl adds some internal-only values, like EDGE, which means matching like /^/ or /$/. By using these synthetic values, we don't need to have extra code for edge cases. These properties are implemented using C enums. Prior to this commit, the actual numeric values for each enum was mostly arbitrary, with the synthetic ones intermixed with the offical ones. This commit changes that so the synthetic ones are all higher numbers than any official ones, and the order they appear in the generating code will be the numerical order they have, so that the program has control of their order.
* re/uniprops: Fix to work on early UnicodesKarl Williamson2016-01-141-1/+1
| | | | | | The guts of this test are generated by mktables. Commit f1f6961f5a6fd77a3e3c36f242f1b72ce5dfe205 broke early Unicode versions handling.
* Unicode::UCD: Fix to work on very early Unicode versionsKarl Williamson2016-01-141-1/+1
| | | | | Prior to this commit, it would not compile because 2 properties weren't defined in very early Unicodes.
* Don't generate EBCDIC POSIX-BC tablesKarl Williamson2016-01-141-25003/+1
| | | | | | | | | | | This commit comments out the code that generates these tables. This is trivially reversible. We don't believe anyone is using Perl and POSIX-BC at this time, and this saves time during development when having to regenerate these tables, and makes the resulting tar ball smaller. See thread beginning at http://nntp.perl.org/group/perl.perl5.porters/233663
* Tailor \b{wb} for PerlKarl Williamson2016-01-081-26/+106
| | | | | | | | | | | | The Unicode \b{wb} matches the boundary between space characters in a span of them. This is opposite of what \b does, and is counterintuitive to Perl expectations. This commit tailors \b{wb} to not split up spans of white space. I have submitted a request to Unicode to re-examine their algorithm, and this has been assigned to a subcommittee to look at, but the result won't be available until after 5.24 is done. In any event, Unicode encourages tailoring for local conditions.
* mktables: Add constructor parameterKarl Williamson2016-01-081-1/+1
| | | | | | | This new parameter will be used in the next commit, adds a special case for handling tables that the perl interpreter relies on when compiling a Unicode version earlier than the property is defined by Unicode. This will allow for tailoring the property to Perl's needs in the next commit
* mktables: Fix /l testing in re/uniprops.tKarl Williamson2016-01-061-1/+1
| | | | The utf8 locale testing was not getting done.
* mktables: Free up some memory after final useKarl Williamson2015-12-231-1/+1
| | | | | | | This may be enough for some platforms that aren't able to compile the Unicode tables to work. BUt it's quite late in the process. The ultimate solution would be for the tables to all be compiled ahead of time. That is under consideration for the future.
* mktables: Add "$0:" to its first outputKarl Williamson2015-12-191-1/+1
| | | | So in a make, it is abundantly clear where the messages are coming from
* regen charclass_invlists.hRicardo Signes2015-12-071-1/+1
| | | | | This is needed bcause mktables changed. A porting test did not pick this up, and so probably should be made to.
* Extend UTF-EBCDIC to handle up to 2**64-1Karl Williamson2015-11-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This uses for UTF-EBCDIC essentially the same mechanism that Perl already uses for UTF-8 on ASCII platforms to extend it beyond what might be its natural maximum. That is, when the UTF-8 start byte is 0xFF, it adds a bunch more bytes to the character than it otherwise would, bringing it to a total of 14 for UTF-EBCDIC. This is enough to handle any code point that fits in a 64 bit word. The downside of this is that this extension is not compatible with previous perls for the range 2**30 up through the previous max, 2**30 - 1. A simple program could be written to convert files that were written out using an older perl so that they can be read with newer perls, and the perldelta says we will do this should anyone ask. However, I strongly suspect that the number of such files in existence is zero, as people in EBCDIC land don't seem to use Unicode much, and these are very large code points, which are associated with a portability warning every time they are output in some way. This extension brings UTF-EBCDIC to parity with UTF-8, so that both can cover a 64-bit word. It allows some removal of special cases for EBCDIC in core code and core tests. And it is a necessary step to handle Perl 6's NFG, which I'd like eventually to bring to Perl 5. This commit causes two implementations of a macro in utf8.h and utfebcdic.h to become the same, and both are moved to a single one in the portion of utf8.h common to both. To illustrate, the I8 for U+3FFFFFFF (2**30-1) is "\xFE\xBF\xBF\xBF\xBF\xBF\xBF" before and after this commit, but the I8 for the next code point, U+40000000 is now "\xFF\xA0\xA0\xA0\xA0\xA0\xA0\xA1\xA0\xA0\xA0\xA0\xA0\xA0", and before this commit it was "\xFF\xA0\xA0\xA0\xA0\xA0\xA0". The I8 for 2**64-1 (U+FFFFFFFFFFFFFFFF) is "\xFF\xAF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF", whereas before this commit it was unrepresentable. Commit 7c560c3beefbb9946463c9f7b946a13f02f319d8 said in its message that it was moving something that hadn't been needed on EBCDIC until the "next commit". That statement turned out to be wrong, overtaken by events. This now is the commit it was referring to. commit I prematurely pushed that
* Get re/uniprops.t to pass on minitestKarl Williamson2015-11-211-1/+1
| | | | | | | locale handling doesn't work without POSIX module being able to load, so doesn't work on minitest. Prior to this patch, the code checked for only one case of locale handling to skip when there was no POSIX, but there was a 2nd case if failed to detect.
* Various tests: use centralized locale detectionKarl Williamson2015-11-201-1/+1
| | | | | | | These tests were using individually defined heuristics to decide whether to do locale testing or not. However t/loc_tools.pl provides functions that are more reliable and complete for determining this than the hand-rolled ones in these tests.