Fix multi-char fold edge case

use locale; fc("\N{LATIN CAPITAL LETTER SHARP S}") eq 2 x fc("\N{LATIN SMALL LETTER LONG S}") should return true, as the SHARP S folds to two 's's in a row, and the LONG S is an antique variant of 's', and folds to s. Until this commit, the expression was false. Similarly, the following should match, but didn't until this commit: "\N{LATIN SMALL LETTER SHARP S}" =~ /\N{LATIN SMALL LETTER LONG S}{2}/iaa The reason these didn't work properly is that in both cases the actual fold to 's' is disallowed. In the first case because of locale; and in the second because of /aa. And the code wasn't smart enough to realize that these were legal. The fix is to special case these so that the fold of sharp s (both capital and small) is two LONG S's under /aa; as is the fold of the capital sharp s under locale. The latter is user-visible, and the documentation of fc() now points that out. I believe this is such an edge case that no mention of it need be done in perldelta.
author: Karl Williamson <public@khwilliamson.com> 2013-05-18 08:25:16 -0600
committer: Karl Williamson <public@khwilliamson.com> 2013-05-20 11:01:52 -0600
commit: 1ca267a56acf698557ec1deec44af651acc88696 (patch)
tree: 007f9d40b63b92728a095d5defdad663e239289c /regen/regcharclass_multi_char_folds.pl
parent: 519101418837cf0edacb710b2b38b42dad6e47c1 (diff)
download: perl-1ca267a56acf698557ec1deec44af651acc88696.tar.gz
1 files changed, 23 insertions, 0 deletions
diff --git a/regen/regcharclass_multi_char_folds.pl b/regen/regcharclass_multi_char_folds.pl
index f0fd6b3a89..f04be85c58 100644
--- a/regen/regcharclass_multi_char_folds.pl
+++ b/regen/regcharclass_multi_char_folds.pl
@@ -104,6 +104,29 @@ sub multi_char_folds ($) {
         }
     }
 
+    # \x17F is the small LONG S, which folds to 's'.  Both Capital and small
+    # LATIN SHARP S fold to 'ss'.  Therefore, they should also match two 17F's
+    # in a row under regex /i matching.  But under /iaa regex matching, all
+    # three folds to 's' are prohibited, but the sharp S's should still match
+    # two 17F's.  This prohibition causes our regular regex algorithm that
+    # would ordinarily allow this match to fail.  This is the only instance in
+    # all Unicode of this kind of issue.  By adding a special case here, we
+    # can use the regular algorithm (with some other changes elsewhere as
+    # well).
+    #
+    # It would be possible to re-write the above code to automatically detect
+    # and handle this case, and any others that might eventually get added to
+    # the Unicode standard, but I (khw) don't think it's worth it.  I believe
+    # that it's extremely unlikely that more folds to ASCII characters are
+    # going to be added, and if I'm wrong, fold_grind.t has the intelligence
+    # to detect them, and test that they work, at which point another special
+    # case could be added here if necessary.
+    #
+    # No combinations of this with 's' need be added, as any of these
+    # containing 's' are prohibted under /iaa.
+    push @folds, "\"\x{17F}\x{17F}\"";
+
+
     return @folds;
 }
author	Karl Williamson <public@khwilliamson.com>	2013-05-18 08:25:16 -0600
committer	Karl Williamson <public@khwilliamson.com>	2013-05-20 11:01:52 -0600
commit	1ca267a56acf698557ec1deec44af651acc88696 (patch)
tree	007f9d40b63b92728a095d5defdad663e239289c /regen/regcharclass_multi_char_folds.pl
parent	519101418837cf0edacb710b2b38b42dad6e47c1 (diff)
download	perl-1ca267a56acf698557ec1deec44af651acc88696.tar.gz