summaryrefslogtreecommitdiff
path: root/charclass_invlists.h
Commit message (Collapse)AuthorAgeFilesLines
* re/uniprops: Fix EBCDIC issueKarl Williamson2016-02-031-1/+1
| | | | Things like qr/\s/ are expecting native code points, not EBCDIC.
* Use table lookup for qr/\b{wb}/Karl Williamson2016-02-031-1/+37
| | | | | | | | This follows the recent commits for lb and gcb, and generates a table at regen time for Word Breaking. The result may run faster, depending on the compiler optimization capabilities, than before, and is easier to maintain, as it's easier to smack a new rule into the regen perl script than it is to change the C code.
* regen/mk_invlists.pl: add braces round subobject initialisersAaron Crane2016-01-211-53/+53
| | | | | | This suppresses many clang warnings saying "suggest braces around initialization of subobject" when the generated charclass_invlists.h is included.
* Use lookup table for /\b{gcb}/ instead of switch stmtKarl Williamson2016-01-191-1/+20
| | | | | | | | | | | | | | This changes the handling of Grapheme Cluster Breaks to be entirely via a lookup table generated by regen/mk_invlists.pl. This is easier to maintain and follow, as the generation of the table follows the text of Unicode's UAX29 precisely, and loops can be used to set every class up instead of having to name each explicitly, so it will be easier to add new rules. And the runtime switch statement is replaced by a single line. My gcc compiler optimized the previous version to an array lookup, but this commit does it for not so clever compilers.
* t/re/uniprops.t: Fix bug in diagnostic outputKarl Williamson2016-01-191-1/+1
| | | | | An 'ord' was missing, so a warnings was raised. This file is generated by mktables
* Add qr/\b{lb}/Karl Williamson2016-01-191-2/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds the final Unicode boundary type previously missing from core Perl: the LineBreak one. This feature is already available in the Unicode::LineBreak module, but I've been told that there are portability and some other issues with that module. What's added here is a light-weight version that is lacking the customizable features of the module. This implements the default Line Breaking algorithm, but with the customizations that Unicode is expecting everybody to add, as their test file tests for them. In other words, this passes Unicode's fairly extensive furnished tests, but wouldn't if it didn't include certain customizations specified by Unicode beyond the basic algorithm. The implementation uses a look-up table of the characters surrounding a boundary to see if it is a suitable place to break a line. In a few cases, context needs to be taken into account, so there is code in addition to the lookup table to handle those. This should meet the needs for line breaking of many applications, without having to load the module. The algorithm is somewhat independent of the Unicode version, just like the other boundary types. Only if new rules are added, or existing ones modified is there need to go in and change this code. Otherwise, running regen/mk_invlists.pl should be sufficient when a new Unicode release is done to keep it up-to-date, again like the other Unicode boundary types.
* Make tables for Perl-tailored Unicode Line_Break propertyKarl Williamson2016-01-191-2/+13163
| | | | | | | | | This is in preparation for adding qr/\b{lb}/. This just generates the tables, and is a separate commit because otherwise the diff listing is confusing, as it doesn't realize there are only additions. So, even though the difference listing for this commit for the generated header file is wildly crazy, the only changes in reality are the addition of some tables for Line Break.
* regen/mk_invlists.pl: Use property's real valuesKarl Williamson2016-01-191-1/+1
| | | | | | | A future commit will tailor a property to use fewer values than Unicode provides. Currently we look at the official property, and croak if not all the property values are there. This commit instead looks at the tailored property, the one that actually is being output.
* mktables: Add field to constructorKarl Williamson2016-01-191-1/+1
| | | | | This allows a default value to be specified, to prepare for a later commit.
* regen/mk_invlists.pl: Internal housekeepingKarl Williamson2016-01-191-7/+7
| | | | | | | | | This moves the name of a synthetic enum value to a better place in the code. The list it had been in is for a specific purpose that is not applicable to synthetic values, though it worked. But the new place is more logical, and can take advantage of the previous commit which makes things in this place more predictable.
* regen/mk_invlists.pl: Keep internal enum values lastKarl Williamson2016-01-191-118/+118
| | | | | | | | | | | | | | | | | | | | | Most Unicode properties have a finite set of possible values. Most, for example, are binary, they can be either true or false, but nothing in between. Others have more possibilities (and still others, like Name, are not restricted at all. The Word Break property, for example can take on a restricted set of values, currently 19 in all, that indicate what type, for purposes of word breaking, the character is. In implementing things like Word Break, Perl adds some internal-only values, like EDGE, which means matching like /^/ or /$/. By using these synthetic values, we don't need to have extra code for edge cases. These properties are implemented using C enums. Prior to this commit, the actual numeric values for each enum was mostly arbitrary, with the synthetic ones intermixed with the offical ones. This commit changes that so the synthetic ones are all higher numbers than any official ones, and the order they appear in the generating code will be the numerical order they have, so that the program has control of their order.
* re/uniprops: Fix to work on early UnicodesKarl Williamson2016-01-141-1/+1
| | | | | | The guts of this test are generated by mktables. Commit f1f6961f5a6fd77a3e3c36f242f1b72ce5dfe205 broke early Unicode versions handling.
* Unicode::UCD: Fix to work on very early Unicode versionsKarl Williamson2016-01-141-1/+1
| | | | | Prior to this commit, it would not compile because 2 properties weren't defined in very early Unicodes.
* Don't generate EBCDIC POSIX-BC tablesKarl Williamson2016-01-141-25003/+1
| | | | | | | | | | | This commit comments out the code that generates these tables. This is trivially reversible. We don't believe anyone is using Perl and POSIX-BC at this time, and this saves time during development when having to regenerate these tables, and makes the resulting tar ball smaller. See thread beginning at http://nntp.perl.org/group/perl.perl5.porters/233663
* Tailor \b{wb} for PerlKarl Williamson2016-01-081-26/+106
| | | | | | | | | | | | The Unicode \b{wb} matches the boundary between space characters in a span of them. This is opposite of what \b does, and is counterintuitive to Perl expectations. This commit tailors \b{wb} to not split up spans of white space. I have submitted a request to Unicode to re-examine their algorithm, and this has been assigned to a subcommittee to look at, but the result won't be available until after 5.24 is done. In any event, Unicode encourages tailoring for local conditions.
* mktables: Add constructor parameterKarl Williamson2016-01-081-1/+1
| | | | | | | This new parameter will be used in the next commit, adds a special case for handling tables that the perl interpreter relies on when compiling a Unicode version earlier than the property is defined by Unicode. This will allow for tailoring the property to Perl's needs in the next commit
* mktables: Fix /l testing in re/uniprops.tKarl Williamson2016-01-061-1/+1
| | | | The utf8 locale testing was not getting done.
* mktables: Free up some memory after final useKarl Williamson2015-12-231-1/+1
| | | | | | | This may be enough for some platforms that aren't able to compile the Unicode tables to work. BUt it's quite late in the process. The ultimate solution would be for the tables to all be compiled ahead of time. That is under consideration for the future.
* mktables: Add "$0:" to its first outputKarl Williamson2015-12-191-1/+1
| | | | So in a make, it is abundantly clear where the messages are coming from
* regen charclass_invlists.hRicardo Signes2015-12-071-1/+1
| | | | | This is needed bcause mktables changed. A porting test did not pick this up, and so probably should be made to.
* Extend UTF-EBCDIC to handle up to 2**64-1Karl Williamson2015-11-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This uses for UTF-EBCDIC essentially the same mechanism that Perl already uses for UTF-8 on ASCII platforms to extend it beyond what might be its natural maximum. That is, when the UTF-8 start byte is 0xFF, it adds a bunch more bytes to the character than it otherwise would, bringing it to a total of 14 for UTF-EBCDIC. This is enough to handle any code point that fits in a 64 bit word. The downside of this is that this extension is not compatible with previous perls for the range 2**30 up through the previous max, 2**30 - 1. A simple program could be written to convert files that were written out using an older perl so that they can be read with newer perls, and the perldelta says we will do this should anyone ask. However, I strongly suspect that the number of such files in existence is zero, as people in EBCDIC land don't seem to use Unicode much, and these are very large code points, which are associated with a portability warning every time they are output in some way. This extension brings UTF-EBCDIC to parity with UTF-8, so that both can cover a 64-bit word. It allows some removal of special cases for EBCDIC in core code and core tests. And it is a necessary step to handle Perl 6's NFG, which I'd like eventually to bring to Perl 5. This commit causes two implementations of a macro in utf8.h and utfebcdic.h to become the same, and both are moved to a single one in the portion of utf8.h common to both. To illustrate, the I8 for U+3FFFFFFF (2**30-1) is "\xFE\xBF\xBF\xBF\xBF\xBF\xBF" before and after this commit, but the I8 for the next code point, U+40000000 is now "\xFF\xA0\xA0\xA0\xA0\xA0\xA0\xA1\xA0\xA0\xA0\xA0\xA0\xA0", and before this commit it was "\xFF\xA0\xA0\xA0\xA0\xA0\xA0". The I8 for 2**64-1 (U+FFFFFFFFFFFFFFFF) is "\xFF\xAF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF", whereas before this commit it was unrepresentable. Commit 7c560c3beefbb9946463c9f7b946a13f02f319d8 said in its message that it was moving something that hadn't been needed on EBCDIC until the "next commit". That statement turned out to be wrong, overtaken by events. This now is the commit it was referring to. commit I prematurely pushed that
* Get re/uniprops.t to pass on minitestKarl Williamson2015-11-211-1/+1
| | | | | | | locale handling doesn't work without POSIX module being able to load, so doesn't work on minitest. Prior to this patch, the code checked for only one case of locale handling to skip when there was no POSIX, but there was a 2nd case if failed to detect.
* Various tests: use centralized locale detectionKarl Williamson2015-11-201-1/+1
| | | | | | | These tests were using individually defined heuristics to decide whether to do locale testing or not. However t/loc_tools.pl provides functions that are more reliable and complete for determining this than the hand-rolled ones in these tests.
* Re-run two regen/ programs to clear up test failures in t/porting/regen.tJames E Keenan2015-10-201-1/+1
| | | | | ./perl -Ilib regen/regcharclass.pl ./perl -Ilib regen/mk_invlists.pl
* mktables: Improve .t diagnostic messageKarl Williamson2015-10-191-1/+1
| | | | | Through an oversight, the text that was supposed to be printed as the name for a test was just getting output as a 1 or 0.
* mktables: Update commentsKarl Williamson2015-10-191-1/+1
| | | | | This function's capabilities has expanded beyond its original use, but the descriptive comments weren't until now.
* PATCH [perl #120790] Unicode::UCD failure to warn on bad inputKarl Williamson2015-09-141-1/+1
| | | | | | | | | | This ticket was originally because the requester did not realize the function Unicode::UCD::charscript took a code point argument instead of a chr one. It was rejected on that basis. But discussion here suggested it would be better to warn on bad input instead of just returning <undef>. It turns out that all other routines in Unicode::UCD but charscript and charblock already do warn. This commit extends that to the two outlier returns.
* pods: Discourage use of 'In' prefix for Unicode Block propertyKarl Williamson2015-09-111-1/+1
| | | | | | | | | | | | | | | | | | This changes perluniprops to not list the equivalent 'In' single form method of specifying the Block property, and to discourage its use. The reason is that this is a Perl extension, the use of which is unstable. A future Unicode release could take over the 'In...' name for a new purpose, and perl would follow along, breaking the code that assumed the former meaning. Unicode does not know about this Perl extension, and they wouldn't care if they did know. The reason I'm doing this now is that the latest Unicode version introduced some properties whose names begin with 'In', though no conflicts arose. But it is clear that such conflicts could arise in the future. So the documentation only is changed to warn people of this potential. perlunicode is update accordingly.
* mktables: Fix --annotate option outputKarl Williamson2015-09-081-1/+1
| | | | | | | | Special code suppressed the expanded output of some ranges, where it would be clear from the range itself what was meant. However, for many output tables, that range output was changed, so the desired information is missing. For these tables, don't suppress the expanded output.
* mktables: Comment changes onlyKarl Williamson2015-08-201-1/+1
|
* mktables: Move file handling to non-exceptional orderKarl Williamson2015-08-201-1/+1
| | | | | The DAge.txt property until the previous commit had to be handled out-of-the-normal order. This is no longer required.
* mktables: Revamp the compare versions functionalityKarl Williamson2015-08-201-1/+1
| | | | | | | | | | | | | | | This functionality is rarely used, but enables someone to see what Unicode has changed between releases X and Y, without the clutter of the things that are added after X came out. In other words it compiles release X using Y's rules. To use it, you must go in and edit mktables to specify to use this; so it is intended only for a developer who wants to look at Unicode history. One use I've done is to look at the beta version of a new release to compare with the previous official one. This allows me to find typos, and unintentional changes and report them back to Unicode. This commit significantly overhauls this feature, giving better results than before.
* mktables: Fix so -annotate works on early UnicodesKarl Williamson2015-08-201-1/+1
| | | | | | There were several glitches when compiling very early Unicode releases. This commit changes things so the age property reference is stored in a global, and doesn't have to be refound multiple times.
* mktables: Move code to common functionsKarl Williamson2015-08-201-1/+1
| | | | | | This takes two code sections and moves them to a function each. For one, this is in preparation for being used in a 2nd place. For the other, call the code in existing other places.
* mktables: Fix up property calc for early UnicodesKarl Williamson2015-08-201-1/+1
| | | | | | | The Default_Ignorable_Code_Point property is applicable to unassigned code points, so shouldn't restrict our calculated value to assigned. (We calculate what the property would be when run on Unicode releases that haven't defined it yet.)
* mktables: Use mnemonic instead of hex constantKarl Williamson2015-08-201-1/+1
| | | | | These constants are used in more than one place. Use a common variable instead of repeating the hex numbers
* mktables: Add code point ages to --annotate optionKarl Williamson2015-08-181-1/+1
| | | | This can be useful information.
* mktables: Minimize use of version numbers for decisionsKarl Williamson2015-07-281-1/+1
| | | | | | | I found these places where file existence can be used instead of knowing what version something happened in. Sometimes those numbers are wrong, and one of these was. If it can be avoided, better not to use version numbers
* mktables: improve LineBreak table for early UnicodesKarl Williamson2015-07-281-1/+1
| | | | | | It turns out that the generated map files look better (even if functionally equivalent) if the default mapping is the one to the above-Unicode code points. This was the only one that had it different.
* mktables: Make sure \p{Space} works in all UnicodesKarl Williamson2015-07-281-1/+1
| | | | | This isn't defined by Unicode until a later version, but Perl wants it in all versions.
* mktables: Fix up Name_Alias in early UnicodesKarl Williamson2015-07-281-2/+2
| | | | | | | | | | | perl needs the Name_Alias property accessible in all releases in order for charnames to work properly. However the property was not created until Unicode version 5.0. Previously, the property was made available to all Unicode versions, which is contrary to the policy of exposing properties to public use only when Unicode so exposes them. Thus the behavior is as close as possible to Unicode-specified. This commit creates an internal-only property for the perl core, and removes the general access on early Unicode releases.
* mktables: Add handling of WB and SB for early UnicodesKarl Williamson2015-07-281-296/+296
| | | | | | | This allows \b{wb} and \b{sb} to work on all Unicode releases. The huge number of differences in charclass_invlists.h is only because the names of the SB and WB tables change, and the code automatically re-alphabetizes things.
* mktables: Fix GCB to work on early UnicodesKarl Williamson2015-07-281-1604/+1620
| | | | | | | | The GCB property was not properly being generated in early Unicode releases. The huge commit diff is due solely to the fact that the name changes of this property so it is sure to not be accessible outside the perl core, and the property tables are automatically resorted alphabetically.
* mktables: Convert Hangul_Syllable_Type to use new early infrastructureKarl Williamson2015-07-281-1/+1
| | | | This allows us to remove some special handling.
* mktables: Convert PropValueAliases.txt to use new early infrastructureKarl Williamson2015-07-281-1/+1
| | | | | | This file is crucial to compiling perl these days. This commit converts to use the new infrastructure for dealing with compiling Unicode releases prior to when this file was made available.
* mktables: Convert PropertyAliases.txt to use new early infrastructureKarl Williamson2015-07-281-1/+1
| | | | | | This file is crucial to compiling perl these days. This commit converts to use the new infrastructure for dealing with compiling Unicode releases prior to when this file was made available.
* mktables: Add code for easier handling of early Unicode versionsKarl Williamson2015-07-281-1/+1
| | | | | | | | | | | This adds infrastructure to the constructor of the Input_file class to allow an alternative to be specified when compiling a Unicode release that is earlier than the file first became available. This is only used when the property is used by core perl and has to work in all releases. For example the qr/\X/ construct should always work, but relies on a property that isn't specified before Unicode 4.1. This allows for easier specification of how to handle this type of case.
* mktables: Use Input_file class for always skipped filesKarl Williamson2015-07-281-1/+1
| | | | | | | | | | | | | | | | | | | Until this commit there were two mechanisms available to specify files in the Unicode Character Database are not used by mktables. Now there is one. The global that contained such files is deleted, and instead all such files are specified by an Input_file class object. This has the advantage of just one method, and the constructor already has parameters to specify when a file first appeared, and when it was removed. This allows automatic generation of the pod, listing just the appropriate files for the version being compiled. It also allows for the automatic check of all files to see that they are DOS 8.3 filesystem compatible. And it allows for some code simplification. Unicode specifies some .html files in the UCD. These are always skipped (so far, and likely forever), and were in the global. Now they are in the constructor, which means that the code that looks for potential files that aren't being handled has to be changed to also look for .html files as well.
* mktables: For 8.3 filesystems, the suffix mattersKarl Williamson2015-07-281-1/+1
| | | | | | | | | Two files can have the same file name, but be different if they have different suffixes. Until this commit, mktables thought they were the same, because it ignored the suffix when calculating this. Some files are version strings like "3.1" which look like a floating point number. These are converted to like "3_1" first so that the .1 doesn't look like a suffix.
* mktables: Use new infrastructure for optional filesKarl Williamson2015-07-281-1/+1
| | | | | | | This follows up the previous commit by actually using the new infrastructure it created. The optional Unihan files are switched to use the new capabilities. This means that the globals they previously used are no longer necessary, and are ripped out here.