summaryrefslogtreecommitdiff
path: root/regen/mk_invlists.pl
Commit message (Collapse)AuthorAgeFilesLines
* Remove duplicate "the" in commentsElvin Aslanov2023-05-031-1/+1
| | | | Fix spelling on various files pertaining to core Perl.
* regen/mk_invlists.pl - under DEBUG=1 show some progress outputYves Orton2022-08-031-4/+26
|
* Update checksums in some generated filesKarl Williamson2022-06-061-1/+4
| | | | | | | | These use checksums to see if the generated data could be out of date. The new NormTest.pl wasn't counted in this, and needn't be, but excluding it and other similar ones is more trouble than it's worth, so make a comment to that effect and update to include the NormTest.pl digest value.
* regen/mph.pl & mk_invlists.pl - add the "_squeeze" algorithm to produce ↵Yves Orton2022-04-191-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | smaller blobs The squeeze algorithm produces smaller blobs, 10-20% depending on how it is used. With the "randomize_squeeze" option enabled it is slower but produces 20% smaller blobs than the "_simple" strategy we used to use. With the "randomize_squeeze" option disabled it is about as fast as "_simple" but produces about 10% smaller blobs. Regardless "_squeeze" uses more memory than _simple; quite a bit more currently, although that is unforced and could be changed if required. -blob length: 10548 +blob length: 8635 ... -data size: 69908 (%67.07) +data size: 67995 (%65.23) So it saves 1913 bytes running with this seed. I happened to get lucky with the seed, depending on the seed used the blob ended up about 8650 bytes. This algorithm is originally by Ilya Sashcheka, so I have added him to the AUTHORS file, but unfortunately I no longer have his email address as we lost touch. It contains many modifications by me.
* regen/mph.pl & mk_invlists.pl - convert from sub interfaces to OO interfacesYves Orton2022-04-191-6/+7
| | | | | | | | | | | The old sub based API was passing around an awkward number of arguments and it was becoming difficult to enhance in certain ways. This patch changes all the "user servicable" functions into methods, and moves the configuration defaults into the constructor. Note, not all the functions have been converted, the core routines with simple interfaces have not been changed. This is OO for the purpose of encapsulation not inheritance or overloading.
* regen/mk_invlists.pl - add a way to dump the keywords hash for reviewYves Orton2022-04-191-1/+10
| | | | | | | | | | | This adds a way to tell mk_invlists.pl to dump the keywords hash so it can be reviewed, or used for testing or whatnot. A user can define the env var DUMP_KEYWORDS_FILE to be a file name which will be used to save the keywords hash to. If the env var is not set the file won't get written to disk. Includes regenerated output from running regen/mk_invlists.pl to keep porting/regen.t happy.
* regen/mk_invlists.pl - move token_name() sub closer to where it is usedYves Orton2022-04-191-7/+8
| | | | | | | | | sub token_name() was injected into the middle of totally unrelated logic that does not use it. token_name() is a wrapper around sanitize_name() so move it next to that sub. Also includes the output from running regen/mk_invlists.pl to keep porting/regen.t happy.
* regen/mk_invlists.pl - move require to top of fileYves Orton2022-04-191-1/+1
| | | | | | | | | | | mk_invlists.pl does a lot and takes a while before it gets to the part where it requires regen/mph.pl, which means that if there are issues in it they arent discovered until a fair amount of time elapses, which is frustrating when debugging. Moving the require to the top means the script dies early and can be fixed. Includes a regen of uni_keywords.h and friends as this changes a regen script which causes regen.t to fail if its output is not up to date.
* Support Unicode 14.0Unicode Consortium2021-09-151-13/+6
|
* regen/mk_invlists.pl: Add commentKarl Williamson2021-09-151-0/+2
|
* mktables: Split a Line Break equivalence classKarl Williamson2021-09-151-1/+6
| | | | This is used for the \b{lb}, and the rule is changing in Unicode 14.0
* style: Detabify regen files.Michael G. Schwern2021-01-171-3/+3
| | | | | | | | | | | They generate C files. Bump feature.pm and warnings.pm versions to satisfy cmpVERSION.pl. I can't get it to easily ignore whitespace, `git diff --name-only` does not respect the -w flag. regen_perly.pl is left alone. That would require rebuilding perly.* which is beyond a simple indentation change.
* uni_keywords.h: Confine the scope to coreKarl Williamson2020-11-021-0/+3
| | | | All symbols in here are for core only use
* charclass_invlists.h: Add some inverse folds.Karl Williamson2020-10-161-3/+29
| | | | | | | | | The MICRO SIGN folds to above the Latin1 range, the only character that does so in Unicode (or ever likely to). This requires special handling. This commit reduces some of the need for that handling by creating the inversion map for it, which can be used in certain instances in pattern matching, without having to have a special case. The actual use of this will come in a future commit.
* char_class_invlists.h: Give re_comp.c access to enums,#definesKarl Williamson2020-03-021-2/+6
| | | | | | The previous commit changed the code so that enums and #defines could be requested to be in re_comp.c. This commit changes to use that new capability.
* regen/mk_invlists.pl: Allow enums/defines to be in re_comp.cKarl Williamson2020-03-021-2/+4
| | | | | | | Tables, to save memory, that are for regcomp.c are excluded from re_comp.c, but enums use no resources, and a later commit will want them accessible from re_comp.c. So change the code so that they can be requested to be in re_comp.c
* regen/mk_invlists.pl: Move #define in outputKarl Williamson2020-03-021-7/+6
| | | | | This value will be needed outside of where it currently is defined; this commit makes it available elsewhere
* Use Unicode 13.0 (beta)Unicode Consortium2020-01-301-31/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unicode has changed its yearly release cycle so that the final version is not available until early March of the year. This year it is March 10, 2020. However, all changes planned were finalized in early January, and the actual computer files have been updated to their presumably final substantive versions. The release has been authorized without further review needed. The release is awaiting final documentation additions, and soak time of the beta to verify there are no glitches. This commit causes Perl to participate in that soak. I don't anticipate any problems, and likely the only substantive change upon the official release will be to update perldelta. Comments in the files supplied by Unicode will likely also change to indicate these are no longer beta. There were very few changes affecting existing characters; most of the changes involved adding new characters, including emoji. The break characteristics of some existing characters were changed (GCB, LB, WB, and SB properties). The only perl code I really had to change to cope with the new release was about rules in the Line Break property, dealing around ellipses (...) and certain East Asian characters next to opening parentheses. If there are problems, we can revert this at any time, and ship with 12.0.
* Change Unicode property abbrev to upcoming officialKarl Williamson2020-01-301-19/+19
| | | | | | | | | | | | Unicode 12.0 used a new property file that was not from the Unicode Character Database. It only had a long property name. I incorporated it into our data, and rather than use the very long name all the time, I created my own short name, since there was no official one. Now, the upcoming 13.0 has moved the file to the UCD, and come up with a short name that differs from the one I had. This commit converts to use Unicode's name. This property is not exposed to user or XS space, so there is no user impact.
* regen/mk_invlists.pl: Sort generated tables alphabeticallyKarl Williamson2020-01-301-76/+114
| | | | | This change causes certain tables to be sorted so their row and column headings appear alphabetically, which makes them easier to read.
* regen/mk_invlists.pl: Clarify comment in output .hKarl Williamson2020-01-301-1/+1
|
* regen/mk_invlists.pl: Do sort caselessly in placesKarl Williamson2020-01-301-5/+7
| | | | This makes things more like dictionary order
* regen/mk_invlists.pl; White space, comments onlyKarl Williamson2020-01-301-31/+65
|
* Change some structures/fcns to use I32 and U32Karl Williamson2020-01-031-2/+9
| | | | | | | This is because these deal with only legal Unicode code points, which are restricted to 21 bits, so 16 is too few, but 32 is sufficient to hold them. Doing this saves some space/memory on 64 bit builds where an int is 64 bits.
* regen/mk_invlists.pl: Move inversion list adjacent to similarKarl Williamson2019-12-181-8/+8
| | | | | This will lessen any paging that might occur. Further, on most builds, it and another table are identical, so only one is actually needed.
* Move data for PL_InBitmap to charclass_invlists.hKarl Williamson2019-11-261-0/+31
| | | | | This makes it consistent with the other inversion lists for this sort of thing, and finishes the fix for GH #17154
* Remove lib/unicore/Heavy.plKarl Williamson2019-11-061-16/+16
| | | | | | | This file was for the use of utf8_heavy.pl. But now that that is incorporated into Unicode::UCD, move the definitions from Heavy.pl to lib/unicore/UCD.pl which is used by Unicode::UCD. This allows removing package names.
* regen/mk_invlists.pl: Fix /i rules for non-ASCII machinesKarl Williamson2019-08-261-5/+8
| | | | | | Two variable weren't getting initialized properly in one code path, with the result that the case folding tables were pretty much garbage, but not on ASCII platforms.
* regen/mk_invlists.pl: Never remap 0Karl Williamson2019-08-261-2/+2
| | | | | 0 is a special marker, and shouldn't be remapped. It would be unlikely to be so, but this makes sure.
* regen/mk_invlists.pl: inversion map requires a final entryKarl Williamson2019-08-261-0/+3
| | | | | | Inversion maps are supposed to have an entry for what to do above the Unicode range. This subroutine crafts a custom map that was missing that.
* regen/mk_invlists.pl: Add tables for Unicode wildcardsKarl Williamson2019-03-121-0/+132
| | | | This supports this new feature.
* regen/mk_invlists.pl: Remove stray debugging stmtsKarl Williamson2019-03-121-2/+0
| | | | These debugging lines were left in by 21c34e9717d
* regen/mk_invlists.pl: Comment/white-space onlyKarl Williamson2019-03-121-3/+6
|
* regen/mk_invlists.pl, lib/utf8_heavy.pl: Rename variableKarl Williamson2019-03-121-7/+14
| | | | | This renames a variable to more accurately reflect its content, and adds a new one which has the old name but with an accurate content.
* charclass_invlists.h: Add commentKarl Williamson2019-03-121-1/+1
|
* Add hook for Unicode private use overrideKarl Williamson2019-03-071-0/+36
| | | | | | | | | | I am starting to write a Unicode::Private_Use module which will allow one to specify the Unicode properties of private use code points, thus making them actually useful. This commit adds a hook to regcomp.c to accommodate this module. The changes are pretty minimal. This way we don't have to wait another release cycle to get it out there. I don't want to document this interface, until it's proven.
* PERL_GLOBAL_STRUCT_PRIVATE: fix some const stringsDavid Mitchell2019-02-191-1/+1
| | | | | | | | | | | change a couple of const char * foo[] = { ... } to const char * const foo[] = { ... } Making the string ptrs const means the whole thing is RO and doesn't appear in data section, making porting/libperl.t happier when building under -DPERL_GLOBAL_STRUCT_PRIVATE.
* regen/mk_invlists.pl: Create new inversion listKarl Williamson2019-02-051-0/+30
| | | | This will be used in a future commit.
* regen/mk_invlists.pl: Rmv extraneous tab in outputKarl Williamson2019-01-041-1/+1
|
* Revert "regen/mk_invlists.pl: Fix bug when 2 ident tables"Karl Williamson2018-12-311-6/+4
| | | | | | | | | This reverts commit 7e9b4fe4d85e9b669993bf96a7e33ffff3197e20, with additional changes to get things to compile It turns out I was wrong about the underlying cause that commit addressed, and it is easier to just use the existing constants that get generated.
* regen/mk_invlists.pl: Rmv outdated codeKarl Williamson2018-12-261-15/+2
| | | | | | | | Before the GCB property handling got more complicated, it was possible to represent its vagaries with a boolean table on early Unicode releases. Now there are more complicated rules, and even though early releases only use 0 or 1, the rules exist and lead to compilation errors. Just remove the special handling, and let the table be U8.
* Move 2 property defns to mktablesKarl Williamson2018-12-251-32/+2
| | | | | | | | | These 2 Unicode-like property definitions used internally by the regular expression compiler are moved by this commit from regen/mk_invlists.pl to lib/unicore/mktables. By placing all these in the same place, maintainers only have to learn one bit of code, instead of two.
* regen/mk_invlists.pl: Fix bug when 2 ident tablesKarl Williamson2018-12-251-4/+6
| | | | | | | | If two tables are identical, the code created a #define of one index of a pointer array to be the other index. But in some cases, that's not sufficient, and the actual pointer must be defined in terms of the other. This showed up in compiling perl with an early Unicode version, but the circumstances could arise again in a future version.
* regen/mk_invlists.pl: Add new tableKarl Williamson2018-12-071-3/+15
| | | | | | | This table contains all the code points that are in any multi-character fold (not the folded-from character, but what that character folds to). It will be used in a future commit.
* regen/mk_invlists.pl: Rmv no longer used arrayKarl Williamson2018-12-071-2/+0
|
* regen/mk_invlists.pl: Generate a new valueKarl Williamson2018-11-261-1/+24
| | | | | The new value is the maximum number of code points that fold to any single code point. It will be used in a future commit.
* Move Unicode \p{} definitions to regcomp.cKarl Williamson2018-08-021-33/+127
| | | | | | | | | | | | | | | | | | | | | | | | | These are only used in compiling patterns. They previously were placed in utf8.c because they are large, and there is a copy of regcomp.c in ext/re, so they would have use twice the space. This commit changes things so that they only are used and defined in regcomp.c, (not re_comp.c) so that duplication does not occur. They are accessed only from one function, and that is also moved from utf8.c to regcomp.c, only compiled in regcomp.c, and referred to as an external by re_comp.c I had to change the names of the table. Previously they started with 'PL_' in case any got exposed, but globvar.t mindlessly assumes that any such variables in the file regcomp.c are globals, and wrongly complains. It was easier to just change the prefix to 'UNI_' instead. A few tables are used in regexec.c, and are duplicated in re_exec.c. Things could be adjusted so that only one copy is used. I tried this, but the tables are far more intertwined in regexec.c functions than the ones changed in this commit, as only a single function accesses these. Thus doing this would be a lot harder, and the payback isn't all that much. I started work to make them EXTCONSTs, and then discovered the intertwining, but left in that work, unused.
* regen/mk_invlists.pl: Collapse unused boundary valuesKarl Williamson2018-07-211-37/+70
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Each Unicode property that specifies a boundary conditions, like Word_Break, partitions all the Unicode code points into equivalence classes. So, for example, combining marks are placed into the Extend class, because they are usually used to extend the previous character and don't stand on their own. mk_invlists.pl creates a boolean table of all pairwise combinations of these classes, so that it knows by simple lookup if the first character is class X and the next character is class Y, if a break is permitted between these. However, in some cases the answer isn't as simple as this, and other means such as the characters in the vicinity of X and Y must be used to disambiguate. In these cases the table value in the cell (X,Y) isn't a boolean, but is some other number indicating some specially crafted code section to execute to resolve the issue. Over the years, Unicode has tended to subdivide partitions into smaller ones, as they've refined their algorithms. But with Unicode 11, they used another method and actually removed partitions. Rather, they retain the partitions, but no code point actually takes on the value of an obsolete partition. In order to not have to change the algorithm unnecessarily between Unicode releases (who knows, they might change their minds, and unobsolete these next time), mk_invlists has just kept the tables around, but those cells won't ever get accessed because no code point in the current release evaluates to them. But that makes the tables unnecessarily large. We can achieve the same thing by mapping each unused equivalence class to the same value, which we call 'unused'. The algorithms that refer to the obsolete partitions go through the data assigning values to the cells, but now the cells overlap, since all obsolete classes map to the same row or column. Thus the data is total garbage. But that doesn't matter, since that row or column is never read by the data in the Unicode release the table is constructed for. mk_invlists also can compile older Unicode releases, and this makes those tables smaller than before, with all unused classes in a given release collapsed into a single row and single column of (unused) garbage.
* regen/mk_invlists.pl: Make adjacent comment and its codeKarl Williamson2018-07-211-2/+2
|
* Use Unicode 11.0Unicode Consortium2018-07-201-10/+5
| | | | This completes the process of upgrading to Unicode 11.0.