summaryrefslogtreecommitdiff
path: root/regcharclass.h
Commit message (Collapse)AuthorAgeFilesLines
* Simplify Unicode::UCD::openunicode() and callersDagfinn Ilmari Mannsåker2017-12-281-1/+1
| | | | | | | | | | Get rid of the file-global filehandles and the unused filename return value, instead return the filehandle and assign it to a lexical variable. Also don't bother checking the return value; it croaks on failure anyway. In passing, eliminate erroneous assignment of {} to %CASESPEC for Unicode < 2.1.8.
* Unicode::UCD: Add optional paramter to num()Karl Williamson2017-12-271-1/+1
| | | | | | | As discussed in http://nntp.perl.org/group/perl.perl5.porters/244444, this sets the optional scalar ref paramater to the length of the valid initial portion of the first parameter passed to num(). This is useful in teasing apart why the input is invalid.
* mktables: Generate _Perl_SCX propertyKarl Williamson2017-12-241-1/+1
|
* Unicode::UCD.pm Add undocumented internal featureKarl Williamson2017-12-241-1/+1
| | | | This allows charprop() to be called on a Perl-internal-only property
* mktables: Canonicalize '-' into '_'Karl Williamson2017-12-241-1/+1
| | | | | Some early Unicode releases used a hyphen instead of an underscore in script names. This changes all into underscores
* mktables: Use already set variableKarl Williamson2017-12-181-1/+1
| | | | | The value for this variable is already known; use that instead of rederiving it.
* Unicode::UCD: max code point is now IV_MAXKarl Williamson2017-12-161-1/+1
| | | | Return the correct value when asked.
* perluniprops/mktables: Add code points matched annotationsKarl Williamson2017-12-031-1/+1
| | | | | | | | | This commit changes the generated perluniprops to include some or all of the code points matched by binary tables. All characters matched in the 00-FF range are listed, as well as the first few ranges beyond that. This is to make this pod more useful for people using it as an index to look things up.
* perluniprops/mktables: Fix wong output.Karl Williamson2017-12-031-1/+1
| | | | | | | | | | perluniprops had a few entries like XPosixCntrl General_Category=XPosixCntrlControl It should have read XPosixCntrl General_Category=Control
* perluniprops: Display controls sorted by alphaKarl Williamson2017-11-301-1/+1
| | | | | | The complete set of C0 controls is listed by standard abbreviation, but it is better to display them alphabetically, and not in ASCII-platform code point order.
* mktables: Add safety codeKarl Williamson2017-11-301-1/+1
| | | | | | | | | | This isn't currently necessary to add, but I discovered this deficiency during debugging, and it could come up in some later change. This code only writes one file when two tables match identically. But it could happen that we've got the pointers to the two tables intertwined so that they each think the other one is the one getting written out, so neither of them do. This checks for that.
* perluniprops: Make sc property refer to scxKarl Williamson2017-11-301-1/+1
| | | | | | The scx is an improved version of the sc(ript) property. This changes mktables to generate perluniprops so that the entries for sc tables refer to the equivalent scx ones.
* perluniprops/mktables: Fix bad entryKarl Williamson2017-11-301-1/+1
| | | | | | | | | | | I spotted this entry in perluniprops recently: \p{Nko} \p{Script_Extensions=Nko} (NOT \p{NKo}) It's saying Nko is not NKo. But case isn't supposed to matter. It turned out that the bug was doing an eq without first canonicalizing the names to account for case differences. I was expecting there to be more entries that were erroneous, but it was just this one.
* mktables: Comment fixes onlyKarl Williamson2017-11-301-1/+1
|
* mktables: Use global for Script_Extensions objectKarl Williamson2017-11-301-1/+1
| | | | This is used in several places, so make its scope global to the program.
* perluniprops: \p{Greek} is a shortcut for scx:greekKarl Williamson2017-11-301-1/+1
| | | | | | Since 5.26.0, this (generated) pod has been wrong. The single-form Perl shortcuts for script names now use the Script_Extensions property instead of the (inferior) plain Script property.
* perluniprops: Improve sortingKarl Williamson2017-11-291-1/+1
| | | | | | | Unicode has some property values that should be sorted numerically, but have prefixes that make them not currently appear to be numbers. For example, CCC101 and V10_5. This commit changes so they are sorted by their numeric parts.
* mktables: Avoid some workKarl Williamson2017-11-231-1/+1
| | | | | | | | | | | | | | | Some tables generated by this program are completely described as the complements of other tables. There is no need to thus generate them, as when their value is needed, they can be generated from the other one. However, this takes time, and so this commit caches the result the first time it is needed, and returns that for any future needs. This must not be done until after the controlling table is fully populated, or else the cache would have to be invalidated. Since there is unlikely to be the need for getting the value before the populating is one, What is done here is to simply lock the controlling table, so that any attempt to change it will raise an error, and the code can be fixed at that time, if the need ever does arise.
* PATCH: [perl #132463] perluniprops for \p{Word}Karl Williamson2017-11-181-1/+1
| | | | | | | | | | | perluniprops was not updated to reflect the changes made to what \p{Word} contains as of 5.18. What was added was the code points that have the Join_Control property, which, so far, only contain U+200C and U+200D. This commit uses Join Control instead of the hard-coded code point numbers, so that when Unicode changes it, it automatically will still be valid. Thanks for spotting this.
* mktables: Don't output anything above IV_MAXKarl Williamson2017-07-021-1/+1
| | | | | | This is in preparation for later commits to restrict Unicode code points to IV_MAX. No tables are currently output that go this high, so this change has no current effect.
* Use Unicode 10.0Karl Williamson2017-06-201-43/+44
| | | | | | The new file from Unicode "extracted/DerivedName.txt" is not delivered here, as Perl doesn't need it, as it duplicates information in other files.
* Prepare for Unicode 10.0Karl Williamson2017-06-201-1/+1
| | | | | This informs mktables of the new files in 10.0, and updates some comments in other files to reflect new Unicode terminology.
* Use new paradigm for hdr file double inclusion guardKarl Williamson2017-06-021-4/+4
| | | | | | | | | | We changed to use symbols not likely to be used by non-Perl code that could conflict, and which have trailing underbars, so they don't look like a regular Perl #define. See https://rt.perl.org/Ticket/Display.html?id=131110 There are many more header files which are not guarded.
* mktables: Fix up version compareKarl Williamson2017-06-011-1/+1
| | | | | | | | | | | | | | | | | This is a feature that is used to compare 2 different Unicode versions for changes to existing code points, ignoring code points that aren't in the earlier version. It does this by removing the newly-added code points coming from the later version. One can then diff the generated directory structure against an existing one that was built under the old rules to see what changed. Prior to this commit, it assumed all version numbers were a single digit for the major number. This will no longer work for Unicode 10, about to be released. As part of the process, mktables adds blocks that didn't exist in the earlier version back to the unallocated pool. This gives better diff results. This commit does a better job of finding such blocks.
* fixup regen/regcharclass.pl for no '.'in @INCDavid Mitchell2017-04-071-1/+1
| | | | | Note that this isn't normally executed during build, so it wasn't spotted earlier.
* Balance uniprops testsKarl Williamson2017-02-191-1/+1
| | | | | | | | | | | | Commit 5656b1f654bb034c561558968ed3cf87a737b3e1 split the tests generated by mktables so that 10 separate files each execute 10% of the tests. But it turns out that some tests are much more involved than others, so that some of those 10 files still took much longer than average. This commit changes the split so that the amount of time each file takes is more balanced. It uses a natural breaking spot for the tests for the \b{} flavors, except that GCB and SB are each short (so are combined into being tested from one file), and LB is very long, so is split into 4 test groups.
* Fix bug with a digit range under re 'strict'Karl Williamson2017-01-191-1/+1
| | | | | | | | | "use re 'strict" is supposed to warn if a range whose start and end points are digits aren't from the same group of 10. For example, if you mix Bengali and Thai digits. It wasn't working properly for 5 groups of mathematical digits starting at U+1D7E. This commit fixes that, and refactors the code to bail out as soon as it discovers that no warning is warranted, instead of doing unnecessary work.
* Switch most open() calls to three-argument form.John Lightsey2016-12-231-2/+2
| | | | | | | | | | Switch from two-argument form. Filehandle cloning is still done with the two argument form for backward compatibility. Committer: Get all porting tests to pass. Increment some $VERSIONs. Run: ./perl -Ilib regen/mk_invlists.pl; ./perl -Ilib regen/regcharclass.pl For: RT #130122
* regen/regcharclass.pl: Add const castKarl Williamson2016-12-111-1088/+1112
| | | | | | | | | | This is a follow-up to commit 9f2eed981068e7abbcc52267863529bc59e9c8c0, which manually added const qualifiers to some generated code in order to avoid some compiler warnings. This changes the code generator to use the same 'const' qualifier generally. The code changed by the other commit had been hand-edited after being generated to add branch prediction, which would be too hard to program in at this time, so the const additions also had to be hand-edited in.
* Fixup Unicode::UCD pod/version and regen dependent filesMatthew Horsfall2016-11-141-1/+1
|
* Allow "." to be excluded from @INCH.Merijn Brand2016-11-111-1/+1
|\ | | | | | | | | | | Build with -Ddefault_inc_excludes_dot to have exclude . from @INC. The *current* default is set to be effectively no change. A future change will most likely revert the default to the safer exclusion of .
| * Regen from the "special" regen scriptsAaron Crane2016-11-111-1/+1
| | | | | | | | | | | | A few regen scripts aren't run by "make regen", either because they depend on an external tool, or they must be run by the Perl just built. So they must be run manually.
* | Move Unicode-Normalize to dist/Karl Williamson2016-11-111-1/+1
|/ | | | | p5p has taken over the maintenance of this module, so it should be in dist/
* uniprops.t: split into 10 seperate test files t/re/uniprops01.t etcYves Orton2016-10-191-1/+1
| | | | | This way we can run them at the same time under parallel test, as there are a lot of tests (140k or so) this makes a difference.
* Add macro for Unicode Corregindum #9 strictKarl Williamson2016-09-171-1/+1
| | | | | | | | | | | | | This macro follows Unicode Corrigendum #9 to allow non-character code points. These are still discouraged but not completely forbidden. It's best for code that isn't intended to operate on arbitrary other code text to use the original definition, but code that does things, such as source code control, should change to use this definition if it wants to be Unicode-strict. Perl can't adopt C9 wholesale, as it might create security holes in existing applications that rely on Perl keeping non-chars out.
* Add macro for determining if UTF-8 is Unicode-strictKarl Williamson2016-09-171-1/+1
|
* isUTF8_CHAR(): Bring UTF-EBCDIC to parity with ASCIIKarl Williamson2016-09-171-23/+1
| | | | | | | | | | | | | | | | This changes the macro isUTF8_CHAR to have the same number of code points built-in for EBCDIC as ASCII. This obsoletes the IS_UTF8_CHAR_FAST macro, which is removed. Previously, the code generated by regen/regcharclass.pl for ASCII platforms was hand copied into utf8.h, and LIKELY's manually added, then the generating code was commented out. Now this has been done with EBCDIC platforms as well. This makes regenerating regcharclass.h faster. The copied macro in utf8.h is moved by this commit to within the main code section for non-EBCDIC compiles, cutting the number of #ifdef's down, and the comments about it are changed somewhat.
* regen/regcharclass.pl: surrogates are code pointsKarl Williamson2016-09-171-4/+4
| | | | They are not "characters"
* Make 3 UTF-8 macros APIKarl Williamson2016-08-311-45/+45
| | | | | | | | | | | | | | | | | These may be useful to various module writers. They certainly are useful for Encode. This makes public API macros to determine if the input UTF-8 represents (one macro for each category) a) a surrogate code point b) a non-character code point c) a code point that is above Unicode's legal maximum. The macros are machine generated. In making them public, I am now using the string end location parameter to guard against running off the end of the input. Previously this parameter was ignored, as their use in the core could be tightly controlled so that we already knew that the string was long enough when calling these macros. But this can't be guaranteed in the public API. An optimizing compiler should be able to remove redundant length checks.
* Correct spelling errors: lib/unicore/mktablesShlomi Fish2016-08-061-1/+1
| | | | | | | | | All tests appear to pass. I hereby disclaim any explicit or implicit ownership of my changes and place them under the public-domain/CC0/X11L/same-terms-as-Perl/any-other-license-of-your-choice multi-license. Thanks to Vim's ":help spell" for helping a lot.
* Change \p{foo} to mean \p{scx: foo}Karl Williamson2016-06-301-2/+2
| | | | | | | when 'foo' is a script. Also update the pods correspondingly, and to encourage scx property use. See http://nntp.perl.org/group/perl.perl5.porters/237403
* perluniprops: Fix podKarl Williamson2016-06-251-1/+1
| | | | | | | Commit 3d6c5fec8cb3579a30be60177e31058bc31285d7 changed mktables to change to slightly less nice pod in order to remove a warning that was a bug in Pod::Checker. Pod::Checker has now been fixed, and the current commit reinstates the old pod.
* Use Unicode 9.0Unicode Consortium2016-06-211-43/+43
| | | | | This includes regenerating the files that depend on the Unicode 9 data files
* Prepare for Unicode 9.0Karl Williamson2016-06-211-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The major code changes needed to support Unicode 9.0 are to changes in the boundary (break) rules, for things like \b{lb}, \b{wb}. regen/mk_invlists.pl creates two-dimensional arrays for all these properties. To see if a given point in the target string is a break or not, regexec.c looks up the entry in the property's table whose row corresponds to the code point before the potential break, and whose column corresponds to the one after. Mostly this is completely determining, but for some cases, extra context is required, and the array entry indicates this, and there has to be specially crafted code in regexec.c to handle each such possibility. When a new release comes along, mk_invlists.pl has to be changed to handle any new or changed rules, and regexec.c has to be changed to handle any changes to the custom code. Unfortunately this is not a mature area of the Standard, and changes are fairly common in new releases. In part, this is because new types of code points come along, which need new rules. Sometimes it is because they realized the previous version didn't work as well as it could. An example of the latter is that Unicode now realizes that Regional Indicator (RI) characters come in pairs, and that one should be able to break between each pair, but not within a pair. Previous versions treated any run of them as unbreakable. (Regional Indicators are a fairly recent type that was added to the Standard in 6.0, and things are still getting shaken out.) The other main changes to these rules also involve a fairly new type of character, emojis. We can expect further changes to these in the next Unicode releases. \b{gcb} for the first time, now depends on context (in rarely encountered cases, like RI's), so the function had to be changed from a simple table look-up to be more like the functions handling the other break properties. Some years ago I revamped mktables in part to try to make it require as few manual interventions as possible when upgrading to a new version of Unicode. For example, a new data file in a release requires telling mktables about it, but as long as it follows the format of existing recent files, nothing else need be done to get whatever properties it describes to be included. Some of changes to mktables involved guessing, from existing limited data, what the underlying paradigm for that data was. The problem with that is there may not have been a paradigm, just something they did ad hoc, which can change at will; or I didn't understand their unstated thinking, and guessed wrong. Besides the boundary rule changes, the only change that the existing mktables couldn't cope with was the addition of the Tangut script, whose character names include the code point, like CJK UNIFIED IDEOGRAPH-3400 has always done. The paradigm for this wasn't clear, since CJK was the only script that had this characteristic, and so I hard-coded it into mktables. The way Tangut is structured may show that there is a paradigm emerging (but we only have two examples, and there may not be a paradigm at all), and so I have guessed one, and changed mktables to assume this guessed paradigm. If other scripts like this come along, and I have guessed correctly, mktables will cope with these automatically without manual intervention.
* Tell mktables what Unicode version mk_invlist.pl handlesKarl Williamson2016-06-211-1/+1
| | | | | | | | | | | | | | A downside of supporting the Unicode break properties like \b{gcb}, \b{lb} is that these aren't very mature in the Standard, and so code likely has to change when updating Perl to support a new version of the Standard. And the new rules may not be backwards compatible. This commit creates a mechanism to tell mktables the Unicode version that the rules are written for. If that is not the same version as being compiled, the test file marks any failing boundary tests as TODO, and outputs a warning if the compiled version is later than the code expects, to alert you to the fact that the code needs to be updated.
* t/re/uniprops.t: Add more description for \b{} testsKarl Williamson2016-06-211-1/+1
| | | | | | | | | mktables generates a file of tests used in t/re/uniprops.t. The tests furnished by Unicode for the boundaries like \b{gcb} have comments that indicate the rules each test is testing. These are useful in debugging. This commit changes things so the generated file that includes these Unicode-supplied tests also has the corresponding comments which are output as part of the test descriptions.
* Add an example of the '0x' string format.Jarkko Hietaniemi2016-05-301-1/+1
| | | | | | I am not certain that I find the 'leading zero means hex' format recommendable ('0123' meaning '0x123', the octal format has poisoned the well); but water under the bridge.
* Stop accepting deprecated NBSP in \N{}Karl Williamson2016-05-091-1/+1
| | | | As scheduled for 5.26, this construct will no longer be accepted.
* mktables: Don't destroy a data structure too soon.Karl Williamson2016-05-091-1/+1
| | | | | | | It can happen that one table depends on another table for its contents. This adds a crude mechanism to prevent the depended-upon table from being destroyed prematurely. So far this has only shown up during debugging, but it could have happened generally.
* mktables: Add info under -annotate optionKarl Williamson2016-05-091-1/+1
| | | | This adds some helpful text when this option is used, which is for examining the Unicode database in great detail