summaryrefslogtreecommitdiff
path: root/regen
diff options
context:
space:
mode:
authorKarl Williamson <khw@cpan.org>2020-03-26 15:59:50 -0600
committerKarl Williamson <khw@cpan.org>2020-10-16 07:01:41 -0600
commitcf9d46fde42340bce59de919e4518881c97b3a85 (patch)
treede8071a1d8485bc2e15b82e3dc4ecb80e3dd32f4 /regen
parentfdc26d940a357441833197cb1b9b1d9a4420638e (diff)
downloadperl-cf9d46fde42340bce59de919e4518881c97b3a85.tar.gz
regcharclass.h: multi-folds: Add some unfoldeds
Prior to this commit, the generated macros for dealing with multi-char folds in UTF-8 strings only recognized completely folded strings. This commit changes that to add the uppercase for characters in the Latin1 range. Hopefully an example will clarify. The fold for U+0130: LATIN CAPITAL LETTER I WITH DOT ABOVE is 'i' followed by U+0307: COMBINING DOT ABOVE. But since we are doing /i matching, an 'I' followed by U+307 should also match. This commit changes the macros to know this. Before this, if the fold were entirely ASCII, the macros would know all the possible combinations. This commit extends that to all code points < 256. (Since there are no folds to the upper latin1 range), that really means all code points below 128. But making it general means it wouldn't have to be revised if a fold were ever added to the upper half range.) The reason to make this change is that it makes some future code less complicated. And it adds very little complexity to the generated macros; less than the code it will save. I originally thought it would be more complext than it now turns out to be. Much of that is because the infrastructure has advanced since that decision. I couldn't find any current places that this change will allow to be simplified. There could be if the macros were extended to do this on all code points, not just the low ones. I tried that, but the generated macros were at least 400 lines longer than before. That does add significant complexity, so I backed that out.
Diffstat (limited to 'regen')
-rw-r--r--regen/regcharclass_multi_char_folds.pl13
1 files changed, 6 insertions, 7 deletions
diff --git a/regen/regcharclass_multi_char_folds.pl b/regen/regcharclass_multi_char_folds.pl
index a54b05cb19..ce8e8feaab 100644
--- a/regen/regcharclass_multi_char_folds.pl
+++ b/regen/regcharclass_multi_char_folds.pl
@@ -111,27 +111,26 @@ sub multi_char_folds ($$) {
# Skip if something else already has this fold
next if grep { $_ eq $fold } @output_folds;
- if ($type eq 'u') {
- push @output_folds, $fold;
- } # Skip if wants only all-ascii folds, and there is a non-ascii
- elsif (! grep { chr($_) =~ /[^[:ascii:]]/ } @folds) {
# If the fold is to a cased letter, replace the entry with an
# array which also includes its upper case.
my $this_fold_ref = \@folds;
for my $j (0 .. @$this_fold_ref - 1) {
my $this_ord = $this_fold_ref->[$j];
- if (chr($this_ord) =~ /\p{Cased}/) {
+ undef $this_fold_ref->[$j];
+
+ if ($this_ord < 256 && chr($this_ord) =~ /\p{Cased}/) {
my $uc = ord(uc(chr($this_ord)));
- undef $this_fold_ref->[$j];
@{$this_fold_ref->[$j]} = ( $this_ord, $uc);
}
+ else {
+ @{$this_fold_ref->[$j]} = ( $this_ord );
+ }
}
# Then generate all combinations of upper/lower case of the fold.
push @output_folds, gen_combinations($this_fold_ref);
- }
}
# \x17F is the small LONG S, which folds to 's'. Both Capital and small