Allow \N{named seq} in qr/[...]/

This commit changes the regex handler to properly match in many instances a \N{named sequence} in a bracketed character class. A named sequence is one which consists of a string of multiple characters but given one name. Unicode has hundreds of them, like LATIN CAPITAL LETTER A WITH MACRON AND GRAVE. These are encoded by Unicode when there is some user community that thinks of the conglomeration as a single unit, but there was no prior standard that had it so, and it is possible to encode it in Unicode using other means, typically a sequence of a base character followed by some combining marks. (If there had not been such a prior standard, 8859-1, things like LATIN CAPITAL LETTER A WITH GRAVE would have been put into Unicode this way too.) If they did not do it this way, they would run out of availble code points much sooner. Not having these as single characters adds a burden to the programmer having to deal with them. Hiding this detail as much as possible makes it easier to program. This commit hides this in one more place than previously. It takes advantage of the infrastructure added some releases ago dealing with the fact that the match of some single characters case-insensitively can be 2 or even 3 characters. "ss" =~ /[ß]/i; is the most prominent example. We earlier discovered that /[^ß]/ leads to unexpected behavior, and using one of these sequences as an endpoint in a range is also unclear as to what is meant. This commit leaves existing behavior for those cases. That behavior is to use just the first code point in the sequence for regular [...], and to generate a fatal syntax error for (?[...]).
author: Karl Williamson <khw@cpan.org> 2014-09-05 09:09:28 -0600
committer: Karl Williamson <khw@cpan.org> 2014-09-06 21:44:49 -0600
commit: 8f0cd35a38dde9ab975f5ee1a663b81939e17745 (patch)
tree: 1b79e320980b4937f349841c068458ce5d68c529 /pod/perlrecharclass.pod
parent: a5454c469023876ca9422440f302f587dba2a438 (diff)
download: perl-8f0cd35a38dde9ab975f5ee1a663b81939e17745.tar.gz
1 files changed, 47 insertions, 17 deletions
diff --git a/pod/perlrecharclass.pod b/pod/perlrecharclass.pod
index a8dda141a9..5cd0ae7aab 100644
--- a/pod/perlrecharclass.pod
+++ b/pod/perlrecharclass.pod
@@ -457,30 +457,59 @@ Examples:
 
  -------
 
-* There is an exception to a bracketed character class matching a
-single character only.  When the class is to match caselessly under C</i>
-matching rules, and a character that is explicitly mentioned inside the
-class matches a
+* There are two exceptions to a bracketed character class matching a
+single character only.  Each requires special handling by Perl to make
+things work:
+
+=over
+
+=item *
+
+When the class is to match caselessly under C</i> matching rules, and a
+character that is explicitly mentioned inside the class matches a
 multiple-character sequence caselessly under Unicode rules, the class
-(when not L<inverted|/Negation>) will also match that sequence.  For
-example, Unicode says that the letter C<LATIN SMALL LETTER SHARP S>
-should match the sequence C<ss> under C</i> rules.  Thus,
+will also match that sequence.  For example, Unicode says that the
+letter C<LATIN SMALL LETTER SHARP S> should match the sequence C<ss>
+under C</i> rules.  Thus,
 
  'ss' =~ /\A\N{LATIN SMALL LETTER SHARP S}\z/i             # Matches
  'ss' =~ /\A[aeioust\N{LATIN SMALL LETTER SHARP S}]\z/i    # Matches
 
-For this to happen, the character must be explicitly specified, and not
-be part of a multi-character range (not even as one of its endpoints).
-(L</Character Ranges> will be explained shortly.)  Therefore,
+For this to happen, the class must not be inverted (see L</Negation>)
+and the character must be explicitly specified, and not be part of a
+multi-character range (not even as one of its endpoints).  (L</Character
+Ranges> will be explained shortly.) Therefore,
 
  'ss' =~ /\A[\0-\x{ff}]\z/i        # Doesn't match
  'ss' =~ /\A[\0-\N{LATIN SMALL LETTER SHARP S}]\z/i    # No match
- 'ss' =~ /\A[\xDF-\xDF]\z/i    # Matches on ASCII platforms, since \XDF
-                               # is LATIN SMALL LETTER SHARP S, and the
-                               # range is just a single element
+ 'ss' =~ /\A[\xDF-\xDF]\z/i    # Matches on ASCII platforms, since
+                               # \XDF is LATIN SMALL LETTER SHARP S,
+                               # and the range is just a single
+                               # element
 
 Note that it isn't a good idea to specify these types of ranges anyway.
 
+=item *
+
+Some names known to C<\N{...}> refer to a sequence of multiple characters,
+instead of the usual single character.  When one of these is included in
+the class, the entire sequence is matched.  For example,
+
+  "\N{TAMIL LETTER KA}\N{TAMIL VOWEL SIGN AU}"
+                              =~ / ^ [\N{TAMIL SYLLABLE KAU}]  $ /x;
+
+matches, because C<\N{TAMIL SYLLABLE KAU}> is a named sequence
+consisting of the two characters matched against.  Like the other
+instance where a bracketed class can match multi characters, and for
+similar reasons, the class must not be inverted, and the named sequence
+may not appear in a range, even one where it is both endpoints.  If
+these happen, it is a fatal error if the character class is within an
+extended L<C<(?[...])>|/Extended Bracketed Character Classes>
+class; and only the first code point is used (with
+a C<regexp>-type warning raised) otherwise.
+
+=back
+
 =head3 Special Characters Inside a Bracketed Character Class
 
 Most characters that are meta characters in regular expressions (that
@@ -597,9 +626,10 @@ the caret as one of the characters to match, either escape the caret or
 else don't list it first.
 
 In inverted bracketed character classes, Perl ignores the Unicode rules
-that normally say that certain characters should match a sequence of
-multiple characters under caseless C</i> matching.  Following those
-rules could lead to highly confusing situations:
+that normally say that named sequence, and certain characters should
+match a sequence of multiple characters use under caseless C</i>
+matching.  Following those rules could lead to highly confusing
+situations:
 
  "ss" =~ /^[^\xDF]+$/ui;   # Matches!
 
@@ -608,7 +638,7 @@ what C<\xDF> matches under C</i>.  C<"s"> isn't C<\xDF>, but Unicode
 says that C<"ss"> is what C<\xDF> matches under C</i>.  So which one
 "wins"? Do you fail the match because the string has C<ss> or accept it
 because it has an C<s> followed by another C<s>?  Perl has chosen the
-latter.
+latter.  (See note in L</Bracketed Character Classes> above.)
 
 Examples:
author	Karl Williamson <khw@cpan.org>	2014-09-05 09:09:28 -0600
committer	Karl Williamson <khw@cpan.org>	2014-09-06 21:44:49 -0600
commit	8f0cd35a38dde9ab975f5ee1a663b81939e17745 (patch)
tree	1b79e320980b4937f349841c068458ce5d68c529 /pod/perlrecharclass.pod
parent	a5454c469023876ca9422440f302f587dba2a438 (diff)
download	perl-8f0cd35a38dde9ab975f5ee1a663b81939e17745.tar.gz