summaryrefslogtreecommitdiff
path: root/regen/regcharclass_multi_char_folds.pl
Commit message (Collapse)AuthorAgeFilesLines
* regcharclass.pl: Get code point folding to a seqKarl Williamson2020-12-191-8/+7
| | | | | | | Previously regcharclass.pl could tell if an input string was a multi-character fold of some Unicode code point. This commit adds the ability to return what that code point is. This capability will be used in a later commit.
* regen/regcharclass_multi_char_folds.pl: Use case foldKarl Williamson2020-10-161-3/+18
| | | | | | Prior to this commit, only the upper case of Latin1 characters was dealt with. But we really want case folding, and there are a few other characters that fold to Latin1. This commit acknowledges them.
* regen/regcharclass_multi_char_folds.pl: White space, comment onlyKarl Williamson2020-10-161-17/+16
| | | | Outdent and remove lines from changes in the previous commit.
* regcharclass.h: multi-folds: Add some unfoldedsKarl Williamson2020-10-161-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prior to this commit, the generated macros for dealing with multi-char folds in UTF-8 strings only recognized completely folded strings. This commit changes that to add the uppercase for characters in the Latin1 range. Hopefully an example will clarify. The fold for U+0130: LATIN CAPITAL LETTER I WITH DOT ABOVE is 'i' followed by U+0307: COMBINING DOT ABOVE. But since we are doing /i matching, an 'I' followed by U+307 should also match. This commit changes the macros to know this. Before this, if the fold were entirely ASCII, the macros would know all the possible combinations. This commit extends that to all code points < 256. (Since there are no folds to the upper latin1 range), that really means all code points below 128. But making it general means it wouldn't have to be revised if a fold were ever added to the upper half range.) The reason to make this change is that it makes some future code less complicated. And it adds very little complexity to the generated macros; less than the code it will save. I originally thought it would be more complext than it now turns out to be. Much of that is because the infrastructure has advanced since that decision. I couldn't find any current places that this change will allow to be simplified. There could be if the macros were extended to do this on all code points, not just the low ones. I tried that, but the generated macros were at least 400 lines longer than before. That does add significant complexity, so I backed that out.
* regcharclass.h: Add some macrosKarl Williamson2019-11-161-6/+17
| | | | | | These macros will be used in a future commit, and are for three-character folds. regen/regcharclass*.pl are changed for this purpose.
* regen/regcharclass_multi_char_folds.pl: SimplifyKarl Williamson2019-11-161-9/+12
| | | | | This creates a simply named array instead of a more complicated array ref, so is easier to understand
* regen/regcharclass_multi_char_folds.pl: Use printable charKarl Williamson2019-11-161-3/+9
| | | | | | | | It makes the result more legible if it uses the printable character instead of an escape sequence when appropriate. Although, currently, the value is re-escaped for output. This helped during debugging.
* regen/regcharclass_multi_char_folds.pl: Fix commentsKarl Williamson2019-11-161-10/+10
|
* fix typosAlexandr Savca2018-10-091-1/+1
| | | | | | | | Committer: For porting tests: Update $VERSION in 4 files. Run: ./perl -Ilib regen/mk_invlists.pl ./perl -Ilib regen/regcharclass.pl
* There are no folds to multiple chars in early Unicode versionsKarl Williamson2015-07-281-0/+2
| | | | | Several places require special handling because of this, notably for the lowercase Sharp S, but not in Unicodes before 3.0.1
* regen/regcharclass_multi_char_folds.pl: Don't do unnecessary workKarl Williamson2014-05-311-1/+1
| | | | | This bit code is not about just ASCII folds, so skip it when doing just those.
* regen/regcharclass_multi_char_folds.pl: Add some commentsKarl Williamson2014-05-301-6/+13
|
* Don't refer to U+XXXX when mean nativeKarl Williamson2013-08-291-1/+1
| | | | | These messages say the output number is Unicode, but it is really native, so change to saying is 0xXXXX.
* Fix multi-char fold edge caseKarl Williamson2013-05-201-0/+23
| | | | | | | | | | | | | | | | | | | | | | | | | use locale; fc("\N{LATIN CAPITAL LETTER SHARP S}") eq 2 x fc("\N{LATIN SMALL LETTER LONG S}") should return true, as the SHARP S folds to two 's's in a row, and the LONG S is an antique variant of 's', and folds to s. Until this commit, the expression was false. Similarly, the following should match, but didn't until this commit: "\N{LATIN SMALL LETTER SHARP S}" =~ /\N{LATIN SMALL LETTER LONG S}{2}/iaa The reason these didn't work properly is that in both cases the actual fold to 's' is disallowed. In the first case because of locale; and in the second because of /aa. And the code wasn't smart enough to realize that these were legal. The fix is to special case these so that the fold of sharp s (both capital and small) is two LONG S's under /aa; as is the fold of the capital sharp s under locale. The latter is user-visible, and the documentation of fc() now points that out. I believe this is such an edge case that no mention of it need be done in perldelta.
* regen/regcharclass.pl: Change name of generated macroKarl Williamson2012-10-161-9/+13
| | | | | | | | This changes the macro isMULTI_CHAR_FOLD() (non-utf8 version) from just generating ascii-range code points to generating the full Latin1 range. However there are no such non-ASCII values, so the macro expansion is unchanged. By changing the name, it becomes clearer in future commits that we aren't excluding things that we should be considering.
* Add regen/regcharclass_multi_char_folds.plKarl Williamson2012-10-091-0/+106
This takes as input the current Unicode character data base, and outputs lists of the multi-character folds in it, in a form suitable for input to regen/regcharclass.pl