diff options
author | Jeffrey Friedl <jfriedl@regex.info> | 2001-12-16 03:36:32 -0800 |
---|---|---|
committer | Jarkko Hietaniemi <jhi@iki.fi> | 2001-12-17 16:57:57 +0000 |
commit | fc48dff7b7bc73c9dc6c44005736d101ad52bf90 (patch) | |
tree | 8223d6836144cc0590e9a6a2243fbfa8e37ea3ae /pod/perluniintro.pod | |
parent | 57af5541bbf51a20321557b8e12b6b22535e6ada (diff) | |
download | perl-fc48dff7b7bc73c9dc6c44005736d101ad52bf90.tar.gz |
Will the real Unicode encoding please stand up?
Message-Id: <200112161936.fBGJaWe41263@ventrue.corp.yahoo.com>
p4raw-id: //depot/perl@13726
Diffstat (limited to 'pod/perluniintro.pod')
-rw-r--r-- | pod/perluniintro.pod | 45 |
1 files changed, 25 insertions, 20 deletions
diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod index 0ecfba0662..67ce214568 100644 --- a/pod/perluniintro.pod +++ b/pod/perluniintro.pod @@ -358,18 +358,23 @@ its argument so that Unicode characters with code points greater than 255 are displayed as "\x{...}", control characters (like "\n") are displayed as "\x..", and the rest of the characters as themselves. -sub nice_string { - join("", - map { $_ > 255 ? # if wide character... - sprintf("\\x{%x}", $_) : # \x{...} - chr($_) =~ /[[:cntrl:]]/ ? # else if control character ... - sprintf("\\x%02x", $_) : # \x.. - chr($_) } # else as themselves - unpack("U*", $_[0])); # unpack Unicode characters -} - -For example, C<nice_string("foo\x{100}bar\n")> will return -C<"foo\x{100}bar\x0a">. + sub nice_string { + join("", + map { $_ > 255 ? # if wide character... + sprintf("\\x{%x}", $_) : # \x{...} + chr($_) =~ /[[:cntrl:]]/ ? # else if control character ... + sprintf("\\x%02x", $_) : # \x.. + chr($_) # else as themselves + } unpack("U*", $_[0])); # unpack Unicode characters + } + +For example, + + nice_string("foo\x{100}bar\n") + +will return: + + "foo\x{100}bar\x0a" =head2 Special Cases @@ -423,7 +428,7 @@ C<LATIN CAPITAL LETTER A>?) The short answer is that by default Perl compares equivalence (C<eq>, C<ne>) based only on code points of the characters. -In the above case, no (because 0x00C1 != 0x0041). But sometimes any +In the above case, the answer is no (because 0x00C1 != 0x0041). But sometimes any CAPITAL LETTER As being considered equal, or even any As of any case, would be desirable. @@ -433,7 +438,7 @@ Reports #15 and #21, I<Unicode Normalization Forms> and I<Case Mappings>, http://www.unicode.org/unicode/reports/tr15/ http://www.unicode.org/unicode/reports/tr21/ -As of Perl 5.8.0, the's regular expression case-ignoring matching +As of Perl 5.8.0, regular expression case-ignoring matching implements only 1:1 semantics: one character matches one character. In I<Case Mappings> both 1:N and N:1 matches are defined. @@ -447,9 +452,9 @@ parlance goes, collated. But again, what do you mean by collate? (Does C<LATIN CAPITAL LETTER A WITH ACUTE> come before or after C<LATIN CAPITAL LETTER A WITH GRAVE>?) -The short answer is that by default Perl compares strings (C<lt>, +The short answer is that by default, Perl compares strings (C<lt>, C<le>, C<cmp>, C<ge>, C<gt>) based only on the code points of the -characters. In the above case, after, since 0x00C1 > 0x00C0. +characters. In the above case, the answer is "after", since 0x00C1 > 0x00C0. The long answer is that "it depends", and a good answer cannot be given without knowing (at the very least) the language context. @@ -468,12 +473,12 @@ Character Ranges Character ranges in regular expression character classes (C</[a-z]/>) and in the C<tr///> (also known as C<y///>) operator are not magically -Unicode-aware. What this means that C<[a-z]> will not magically start +Unicode-aware. What this means that C<[A-Za-z]> will not magically start to mean "all alphabetic letters" (not that it does mean that even for 8-bit characters, you should be using C</[[:alpha]]/> for that). -For specifying things like that in regular expressions you can use the -various Unicode properties, C<\pL> in this particular case. You can +For specifying things like that in regular expressions, you can use the +various Unicode properties, C<\pL> or perhaps C<\p{Alphabetic}>, in this particular case. You can use Unicode code points as the end points of character ranges, but that means that particular code point range, nothing more. For further information, see L<perlunicode>. @@ -485,7 +490,7 @@ String-To-Number Conversions Unicode does define several other decimal (and numeric) characters than just the familiar 0 to 9, such as the Arabic and Indic digits. Perl does not support string-to-number conversion for digits other -than the 0 to 9 (and a to f for hexadecimal). +than ASCII 0 to 9 (and ASCII a to f for hexadecimal). =back |