summaryrefslogtreecommitdiff
path: root/charclass_invlists.h
Commit message (Collapse)AuthorAgeFilesLines
* mktables: Use builtin::refaddrKarl Williamson2021-12-131-1/+1
| | | | | Now that this function is available in miniperl, mktables can use it to avoid a bunch of visually distracting 'no overloading' calls.
* mktables: Don't calculate some unused valuesKarl Williamson2021-12-131-1/+1
| | | | These apparently were once needed, but no longer.
* mktables: Use mnemonic variable namesKarl Williamson2021-12-071-1/+1
| | | | Spotted by Dagfinn Ilmari Mannsåker
* Fix unicore/mktables to avoid any @_ accesses in signatured subsPaul "LeoNerd" Evans2021-12-071-1/+1
|
* mktables: Remove relics of removed legacy tablesKarl Williamson2021-09-151-1/+1
| | | | | These mentions of the tables removed in b852e1da77b497e086508451bebff00541073fb1 were missed in that commit.
* Support Unicode 14.0Unicode Consortium2021-09-151-5288/+15906
|
* regen/mk_invlists.pl: Add commentKarl Williamson2021-09-151-1/+1
|
* mktables: Split a Line Break equivalence classKarl Williamson2021-09-151-80/+390
| | | | This is used for the \b{lb}, and the rule is changing in Unicode 14.0
* mktables: Reorder some comments, white-spaceKarl Williamson2021-09-151-1/+1
| | | | Move comments closer to the action
* mktables: Rename variable, and hoist calc from loopKarl Williamson2021-09-151-1/+1
|
* Unicode::UCD: Don't depend on a file current syntaxKarl Williamson2021-08-311-1/+1
| | | | | This generated file will be changed in a future commit. This shouldn't have been relying on its syntax anyway, but the value it returns.
* Unicode::UCD: Fix typo in podKarl Williamson2021-08-311-1/+1
|
* Remove deprecated Unicode filesKarl Williamson2021-09-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | These files were once apparently intended for use by modules to supplement the core Unicode handling. They contain tables suitable for use by Perl code of the portions of the Unicode character database about changing the case of characters and finding the numeric value of a given \d character, in a form suitable for use by perl code. In particular, they were designed for fast access using the swash mechanism that has since been removed. Now, Unicode::UCD now contains more convenient methods of accessing the data these contain, and the use of these files has been deprecated since 5.16. I could not figure out a way to force a message should someone open and read one of these files, but each of their texts say that the file may be removed without notice at any time. I did not find any uses on cpan of them. Unicode is adding new properties that the format of these files will not be able to handle. Consequently I'm coming up with a new format. Though these files don't contain the new properties, their existence means having the burden of having to maintain two separate mechanisms. Better to have just one mechanism, suitable for going forward.
* mktables: Generate =head1 NAME line in Name.pmKarl Williamson2021-08-151-1/+1
| | | | | All .pm files are supposed to have this line. So far this hasn't been necessary for this file, but future commits will require it.
* lib/unicore/mktables: correct sub signatures in 2 locationsJames E Keenan2021-08-141-1/+1
| | | | | Then, re-run regen/mk_invlists.pl and regen/regcharclass.pl and commit changes in headers.
* mktables: Change "null string" to "empty string"Karl Williamson2021-08-111-1/+1
| | | | The latter phrase makes more sense
* mktables: Add, fix commentsKarl Williamson2021-08-111-1/+1
|
* mktables: Fix debugging issuesKarl Williamson2021-08-111-1/+1
| | | | | | Commit 4fe9356b250 changed the signatures on subroutines, and didn't do these correctly. The result was that perl would croak when using the mktables debugging facility.
* mktables: Fix table outputKarl Williamson2021-08-091-1/+1
| | | | | | Commit 4fe9356b250 changed the signatures on subroutines, and didn't do this one correctly. The result was that the comments in the generated files had duplicate text and were slightly garbled.
* Remove EBCDIC-only codeKarl Williamson2021-08-071-1/+1
| | | | The previous commit stopped using this code, so can just get rid of it.
* regen/charset xlations.pl: Use revised UTF-8 macrosKarl Williamson2021-07-311-1/+1
| | | | | | | | I realized that two base level utf8.h macros for UTF-8 could be refactored to eliminate the conditionals in each. Those macros have equivalents in the pure perl code changed by this commit, which I changed before the utf8.h versions to verify that everything worked, by verifying there was no difference in the generated tables.
* Unicode::UCD: Bump version; regenKarl Williamson2021-07-201-1/+1
|
* Put back the old url for unicode.org (in lib/unicore) since there is now a ↵Thibault DUPONCHELLE2021-07-171-4/+4
| | | | redirection
* Update UCD version. Remove changes to cpan Encode. RegenThibault DUPONCHELLE2021-06-171-5/+5
|
* perluniprops: Remove references to Unicode::UnihanKarl Williamson2021-05-311-1/+1
| | | | | | This CPAN module doesn't work on recent Unicode versions This fixes GH #18787
* mostly docs: replace "pumpking" when referring to the presentRicardo Signes2021-04-161-1/+1
| | | | | | | Some other tweaks or modernizations are present, but I expect none of this is controversial. This also includes running regen/mk_invlists.pl and regen/regcharclass.pl
* style: Detabify regen files.Michael G. Schwern2021-01-171-3/+3
| | | | | | | | | | | They generate C files. Bump feature.pm and warnings.pm versions to satisfy cmpVERSION.pl. I can't get it to easily ignore whitespace, `git diff --name-only` does not respect the -w flag. regen_perly.pl is left alone. That would require rebuilding perly.* which is beyond a simple indentation change.
* uni_keywords.h: Confine the scope to coreKarl Williamson2020-11-021-1/+1
| | | | All symbols in here are for core only use
* charclass_invlists.h: Add some inverse folds.Karl Williamson2020-10-161-155/+191
| | | | | | | | | The MICRO SIGN folds to above the Latin1 range, the only character that does so in Unicode (or ever likely to). This requires special handling. This commit reduces some of the need for that handling by creating the inversion map for it, which can be used in certain instances in pattern matching, without having to have a special case. The actual use of this will come in a future commit.
* pod/perluniprops: Split run-on lines before '\'Karl Williamson2020-05-271-1/+1
| | | | | | | | | | | | This changes mktables, which generates this pod, to consider long pod lines to be splittable before most backslashes. On os390, the lack of this caused a line to not be split at all, creating a Porting test failure. There is also a current rule that you can split at a lowercase/uppercase boundary. This works for the limited domain this code is run on. But it shouldn't split \cK. So don't do the split if the lowercase is a single letter preceded by a backslash.
* Fix a bunch of repeated-word typosDagfinn Ilmari Mannsåker2020-05-221-1/+1
| | | | | Mostly in comments and docs, but some in diagnostic messages and one case of 'or die die'.
* mktables: Change named sequences to 5 digitsKarl Williamson2020-03-201-1/+1
| | | | | This makes them correspond to names for single characters, and will make parsing easier in the next commits.
* lib/unicore/mktables: use function signaturesNicolas R2020-03-191-1/+1
| | | | | | Also regenerate files depending on lib/unicore/mktables ./perl -Ilib regen/mk_invlists.pl; ./perl -Ilib regen/regcharclass.pl
* Implement \p{Name=/.../} wildcardsKarl Williamson2020-03-111-1/+1
| | | | | This commit adds wildcard subpatterns for the Name and Name Aliases properties.
* mktables: Calculate legal chars in algorithmic namesKarl Williamson2020-03-111-1/+1
| | | | | | | | | | | Many ideographic character names are of the form 'prefix-code_point'. For these, we know that the legal names are just the ones in the prefix, the dash, and uppercase hex digits. This commit for each series of these types of names figures out what characters are legal in that series, and adds that info to the hash describing the series. This will be used in a later commit to rule out entire series when matching under some circumstances, without having to try any individual matches within it.
* Reformat lib/unicore/Name.plKarl Williamson2020-03-111-2/+2
| | | | | | | | | | | | | | | | | | | | This changes the format of this generated file so that it can more easily be used with the Unicode Name property in wildcard matching. Each line will now end with \n\n, and the \t characters are replaced by \n. Thus an entry will look like 00001\nSTART OF HEADING\n\n This makes matching of user-defined patterns using anchors work under /m, which commit 4829f32decd128e6a122bd8ce35fe944bd87f104 forces. That commit also changed some anchors' defintions to make them match \n under /m with wildcards, so this makes it all transparent to user patterns. The double \n\n at the end of an entry is so that the code can distinguish between a line that contains a code point vs a name without relying on the content; it is a disambiguator, like the \t that used to be.
* char_class_invlists.h: Give re_comp.c access to enums,#definesKarl Williamson2020-03-021-1/+17
| | | | | | The previous commit changed the code so that enums and #defines could be requested to be in re_comp.c. This commit changes to use that new capability.
* regen/mk_invlists.pl: Allow enums/defines to be in re_comp.cKarl Williamson2020-03-021-1/+1
| | | | | | | Tables, to save memory, that are for regcomp.c are excluded from re_comp.c, but enums use no resources, and a later commit will want them accessible from re_comp.c. So change the code so that they can be requested to be in re_comp.c
* regen/mk_invlists.pl: Move #define in outputKarl Williamson2020-03-021-1/+3
| | | | | This value will be needed outside of where it currently is defined; this commit makes it available elsewhere
* Restrict features in wildcardsKarl Williamson2020-02-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | The algorithm for dealing with Unicode property wildcards is to wrap the user-supplied pattern with /miaa. We don't want the user to be able to override the /m and /aa parts. Modifiers that are only specifiable as a modifier in a qr or similar op (like /gc) can't be included in things like (?gc). These normally incur a warning that they are ignored, but the texts of those warnings are misleading when using wildcards, so I chose to just make them illegal. Of course that could be changed to having custom useful warning texts, but I didn't think it was worth it. I also chose to forbid recursion of using nested \p{}, just from fear that it might lead to issues down the road, and it really isn't useful for this limited universe of strings to match against. Because wildcards currently can't handle '}' inside them, only the single letter \p,\P are valid anyway. Similarly, I forbid the '*' quantifier to make it harder for the constructed subpattern to take forever to make any progress and decide to halt. Again, using it would be overkill on the universe of possible match strings.
* Regenerate charclass_invlists.hKarl Williamson2020-02-191-182/+216
| | | | | | | | | | There is something wrong with our mechanism to show if this out-of -sync, because it didn't. And it needed regenerating. I will have to look to understand the reason why. Nor did any of the tests fail. In part, I see from looking at the diffs that there is a rule that is no longer used. But it also may be that the Unicode-supplied test are misisng things. Obviously one can't test every code point, but just a representative sample, so some things may fall through the cracks.
* Update to latest Unicode 13.0Karl Williamson2020-02-191-6/+6
| | | | | | Unicode has made minor changes in its data files since I added the beta versions to Perl 5.31. These are still beta; the final release date is March 10. I thought it best to get the latest into Perl 5.31.9.
* mktables: Handle versioning of non-UCD filesKarl Williamson2020-02-191-1/+1
| | | | | | | | | | | | | Unicode has lately been asking implementations to support non-Unicode Character Database properties. Files for these contain a different versioning syntax than the UCD files. Previously I was hand-editing those files before commitiing to bring them to use a consistent style. But that is tedious, and I decide to invest a little time to be able to handle all the current versioning syntaxes automatically, to save having to manually update in the future. This was complicated by the fact that some Unicode non-UCD files have BOM marks on many comment lines. I submitted a trouble report to them.
* perluniprops: Fix missing backslashKarl Williamson2020-02-131-1/+1
|
* Add qr/\p{Name=...}/Karl Williamson2020-02-121-1/+1
| | | | | | | | | | | | | | | This accomplishes the same thing as \N{...}, but only for regex patterns, using loose matching and only the official Unicode names. This commit includes a comparison of the two approaches, added to perlunicode. But the real reason to do this is as a way station to being able to specify wild card lookup on the name property, coming in a later commit. I chose to not include user-defined aliases nor :short character names at this time. I thought that there might be unforeseen consequences of using them. It's better to later relax a requirement than to try to restrict it.
* Support Unicode properties Identifier_(Status|Type)Karl Williamson2020-02-031-15/+16373
| | | | | | | | | | These non-UCD properties are now being asked to be supported by the Unicode regular expression specification, UTS #18 These have a slightly different header syntax for giving the version than UCD files. In this commit, I modify these to fit, but will probably have to generalize at some point the parsing of versions in mktables.
* mktables: Generalize the scx property handlingKarl Williamson2020-02-031-1/+1
| | | | | | | | | Until now, this property was unique in that it specifies a set of possible values for scripts that a character can be in, rather than a single script. That multiplicity has been handled specially. But the next couple of commits will introduce another property that has similar characteristics. This commit makes the scx handling more general, so as to also be usable for the new property.
* mktables: Improve warning msgKarl Williamson2020-02-031-1/+1
|
* mktables: Add capability to override match directoryKarl Williamson2020-02-031-1/+1
| | | | | | This is because this is still supposed to work on DOS 8.3 filesystems, and future commits will use non-Unicode-Character-Database tables which don't have shorter names.
* Use Unicode 13.0 (beta)Unicode Consortium2020-01-301-4372/+12656
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unicode has changed its yearly release cycle so that the final version is not available until early March of the year. This year it is March 10, 2020. However, all changes planned were finalized in early January, and the actual computer files have been updated to their presumably final substantive versions. The release has been authorized without further review needed. The release is awaiting final documentation additions, and soak time of the beta to verify there are no glitches. This commit causes Perl to participate in that soak. I don't anticipate any problems, and likely the only substantive change upon the official release will be to update perldelta. Comments in the files supplied by Unicode will likely also change to indicate these are no longer beta. There were very few changes affecting existing characters; most of the changes involved adding new characters, including emoji. The break characteristics of some existing characters were changed (GCB, LB, WB, and SB properties). The only perl code I really had to change to cope with the new release was about rules in the Line Break property, dealing around ellipses (...) and certain East Asian characters next to opening parentheses. If there are problems, we can revert this at any time, and ship with 12.0.