Teach Perl about Unicode named character sequences

mktables is changed to process the Unicode named sequence file. charnames.pm is changed to cache the looked-up values in utf8. A new function, string_vianame is created that can handle named sequences, as the interface for vianame cannot. The subroutine lookup_name() is slightly refactored to do almost all of the common work for \N{} and the vianame routines. It now understands named sequences as created my mktables.. tests and documentation are added. In the randomized testing section, half use vianame() and half string_vianame().
author: Karl Williamson <public@khwilliamson.com> 2010-09-12 21:33:12 -0600
committer: Father Chrysostomos <sprout@cpan.org> 2010-09-25 00:47:02 -0700
commit: fb121860c2407cd1d1566d63a95a5220fa93d8e4 (patch)
tree: cc61893dd3ffe9966e079addeaa538172e2290e9 /pod
parent: 8ebef31d4feab4b7c35ff0eb427632a67b1abdd9 (diff)
download: perl-fb121860c2407cd1d1566d63a95a5220fa93d8e4.tar.gz
6 files changed, 36 insertions, 19 deletions
diff --git a/pod/perldelta.pod b/pod/perldelta.pod
index abbca524c3..b21f2534b3 100644
--- a/pod/perldelta.pod
+++ b/pod/perldelta.pod
@@ -53,6 +53,21 @@ The C<"d"> modifier is used in the scope of C<use locale> to compile the
 regular expression as if it were not in that scope.
 See L<perlre/(?dlupimsx-imsx)>.
 
+=head2 C<\N{...}> now handles Unicode named character sequences
+
+Unicode has a number of named character sequences, in which particular sequences
+of code points are given names.  C<\N{...}> now recognizes these.
+See L<charnames>.
+
+=head2 New function C<charnames::string_vianame()>
+
+This function is a run-time version of C<\N{...}>, returning the string
+of characters whose Unicode name is its parameter.  It can handle
+Unicode named character sequences, whereas the pre-existing
+C<charnames::vianame()> cannot, as the latter returns a single code
+point.
+See L<charnames>.
+
 =head1 Security
 
 XXX Any security-related notices go here.  In particular, any security
diff --git a/pod/perlop.pod b/pod/perlop.pod
index dc5118c300..d5ca94262a 100644
--- a/pod/perlop.pod
+++ b/pod/perlop.pod
@@ -1029,7 +1029,7 @@ X<\o{}>
     \e                  escape            (ESC)
     \x{263a}     [1,8]  hex char          (example: SMILEY)
     \x1b         [2,8]  restricted range hex char (example: ESC)
-    \N{name}     [3]    named Unicode character
+    \N{name}     [3]    named Unicode character or character sequence
     \N{U+263D}   [4,8]  Unicode character (example: FIRST QUARTER MOON)
     \c[          [5]    control char      (example: chr(27))
     \o{23072}    [6,8]  octal char        (example: SMILEY)
@@ -1073,7 +1073,7 @@ For example:
 
 =item [3]
 
-The result is the Unicode character given by I<name>.
+The result is the Unicode character or character sequence given by I<name>.
 See L<charnames>.
 
 =item [4]
diff --git a/pod/perlre.pod b/pod/perlre.pod
index b9216c156c..88089ee1d7 100644
--- a/pod/perlre.pod
+++ b/pod/perlre.pod
@@ -231,7 +231,7 @@ also work:
  \e          escape (think troff)  (ESC)
  \cK         control char          (example: VT)
  \x{}, \x00  character whose ordinal is the given hexadecimal number
- \N{name}    named Unicode character
+ \N{name}    named Unicode character or character sequence
  \N{U+263D}  Unicode character     (example: FIRST QUARTER MOON)
  \o{}, \000  character whose ordinal is the given octal number
  \l          lowercase next char (think vi)
@@ -316,9 +316,9 @@ See L</Extended Patterns> below for details.
 =item [7]
 
 Note that C<\N> has two meanings.  When of the form C<\N{NAME}>, it matches the
-character whose name is C<NAME>; and similarly when of the form
-C<\N{U+I<wide hex char>}>, it matches the character whose Unicode ordinal is
-I<wide hex char>.  Otherwise it matches any character but C<\n>.
+character or character sequence whose name is C<NAME>; and similarly
+when of the form C<\N{U+I<hex>}>, it matches the character whose Unicode
+code point is I<hex>.  Otherwise it matches any character but C<\n>.
 
 =back
 
diff --git a/pod/perlrebackslash.pod b/pod/perlrebackslash.pod
index eb51d94305..a9257c7d82 100644
--- a/pod/perlrebackslash.pod
+++ b/pod/perlrebackslash.pod
@@ -85,7 +85,7 @@ as C<Not in [].>
  \L                Lowercase till \E.  Not in [].
  \n                (Logical) newline character.
  \N                Any character but newline.  Experimental.  Not in [].
- \N{}              Named or numbered (Unicode) character.
+ \N{}              Named or numbered (Unicode) character or sequence.
  \o{}              Octal escape sequence.
  \p{}, \pP         Character with the given Unicode property.
  \P{}, \PP         Character without the given Unicode property.
@@ -165,14 +165,15 @@ Mnemonic: I<c>ontrol character.
 
  $str =~ /\cK/;  # Matches if $str contains a vertical tab (control-K).
 
-=head3 Named or numbered characters
+=head3 Named or numbered characters and character sequences
 
 Unicode characters have a Unicode name and numeric ordinal value.  Use the
 C<\N{}> construct to specify a character by either of these values.
+Certain sequences of characters also have names.
 
-To specify by name, the name of the character goes between the curly braces.
-In this case, you have to C<use charnames> to load the Unicode names of the
-characters, otherwise Perl will complain.
+To specify by name, the name of the character or character sequence goes
+between the curly braces.  In this case, you have to C<use charnames> to
+load the Unicode names of the characters, otherwise Perl will complain.
 
 To specify a character by Unicode code point, use the form
 C<\N{U+I<wide hex character>}>, where I<wide hex character> is a number in
@@ -183,8 +184,8 @@ C<LATIN CAPITAL LETTER A>, and you will rarely see it written without the two
 leading zeros.  C<\N{U+0041}> means "A" even on EBCDIC machines (where the
 ordinal value of "A" is not 0x41).
 
-It is even possible to give your own names to characters, and even to short
-sequences of characters.  For details, see L<charnames>.
+It is even possible to give your own names to characters and character
+sequences.  For details, see L<charnames>.
 
 (There is an expanded internal form that you may see in debug output:
 C<\N{U+I<wide hex character>.I<wide hex character>...}>.
@@ -194,9 +195,9 @@ form only, subject to change, and you should not try to use it yourself.)
 
 Mnemonic: I<N>amed character.
 
-Note that a character that is expressed as a named or numbered character is
-considered as a character without special meaning by the regex engine, and will
-match "as is".
+Note that a character or character sequence that is expressed as a named
+or numbered character is considered as a character without special
+meaning by the regex engine, and will match "as is".
 
 =head4 Example
 
@@ -572,7 +573,8 @@ identical to the C<.> metasymbol, except under the C</s> flag, which changes
 the meaning of C<.>, but not C<\N>.
 
 Note that C<\N{...}> can mean a
-L<named or numbered character|/Named or numbered characters>.
+L<named or numbered character
+|/Named or numbered characters and character sequences>.
 
 Mnemonic: Complement of I<\n>.
 
diff --git a/pod/perlreref.pod b/pod/perlreref.pod
index 01d57cc4e6..6e028ee9c3 100644
--- a/pod/perlreref.pod
+++ b/pod/perlreref.pod
@@ -94,7 +94,7 @@ These work as in normal strings.
    \x7f     Char whose ordinal is the 2 hex digits, max \xFF
    \x{263a} Char whose ordinal is the hex number, unrestricted
    \cx      Control-x
-   \N{name} A named Unicode character
+   \N{name} A named Unicode character or character sequence
    \N{U+263D} A Unicode character by hex ordinal
 
    \l  Lowercase next character
diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod
index 54ce2f0a1c..f0b2be5a40 100644
--- a/pod/perluniintro.pod
+++ b/pod/perluniintro.pod
@@ -248,7 +248,7 @@ characters:
 
 Note that both C<\x{...}> and C<\N{...}> are compile-time string
 constants: you cannot use variables in them.  if you want similar
-run-time functionality, use C<chr()> and C<charnames::vianame()>.
+run-time functionality, use C<chr()> and C<charnames::string_vianame()>.
 
 If you want to force the result to Unicode characters, use the special
 C<"U0"> prefix.  It consumes no arguments but causes the following bytes
author	Karl Williamson <public@khwilliamson.com>	2010-09-12 21:33:12 -0600
committer	Father Chrysostomos <sprout@cpan.org>	2010-09-25 00:47:02 -0700
commit	fb121860c2407cd1d1566d63a95a5220fa93d8e4 (patch)
tree	cc61893dd3ffe9966e079addeaa538172e2290e9 /pod
parent	8ebef31d4feab4b7c35ff0eb427632a67b1abdd9 (diff)
download	perl-fb121860c2407cd1d1566d63a95a5220fa93d8e4.tar.gz