diff options
Diffstat (limited to 'pod/perlretut.pod')
-rw-r--r-- | pod/perlretut.pod | 63 |
1 files changed, 20 insertions, 43 deletions
diff --git a/pod/perlretut.pod b/pod/perlretut.pod index c1f37fe4b7..da3e82c74f 100644 --- a/pod/perlretut.pod +++ b/pod/perlretut.pod @@ -1841,27 +1841,21 @@ substituted. With the advent of 5.6.0, Perl regexps can handle more than just the standard ASCII character set. Perl now supports I<Unicode>, a standard for representing the alphabets from virtually all of the world's written -languages, and a host of symbols. Perl uses the UTF-8 encoding, in which -ASCII characters are still encoded as one byte, but characters greater -than C<chr(127)> may be stored as two or more bytes. +languages, and a host of symbols. Perl's text strings are unicode strings, so +they can contain characters with a value (codepoint or character number) higher +than 255 What does this mean for regexps? Well, regexp users don't need to know much about Perl's internal representation of strings. But they do need -to know 1) how to represent Unicode characters in a regexp and 2) when -a matching operation will treat the string to be searched as a -sequence of bytes (the old way) or as a sequence of Unicode characters -(the new way). The answer to 1) is that Unicode characters greater -than C<chr(127)> may be represented using the C<\x{hex}> notation, -with C<hex> a hexadecimal integer: +to know 1) how to represent Unicode characters in a regexp and 2) that +a matching operation will treat the string to be searched as a sequence +of characters, not bytes. The answer to 1) is that Unicode characters +greater than C<chr(255)> are represented using the C<\x{hex}> notation, +because the \0 octal and \x hex (without curly braces) don't go further +than 255. /\x{263a}/; # match a Unicode smiley face :) -Unicode characters in the range of 128-255 use two hexadecimal digits -with braces: C<\x{ab}>. Note that this is in general different than -C<\xab>, which is just a hexadecimal byte with no Unicode significance, -except when your script is encoded in UTF-8 where C<\xab> has the -same byte representation as C<\x{ab}>. - B<NOTE>: In Perl 5.6.0 it used to be that one needed to say C<use utf8> to use any Unicode features. This is no more the case: for almost all Unicode processing, the explicit C<utf8> pragma is not @@ -1896,34 +1890,17 @@ A list of full names is found in the file NamesList.txt in the lib/perl5/X.X.X/unicore directory (where X.X.X is the perl version number as it is installed on your system). -The answer to requirement 2), as of 5.6.0, is that if a regexp -contains Unicode characters, the string is searched as a sequence of -Unicode characters. Otherwise, the string is searched as a sequence of -bytes. If the string is being searched as a sequence of Unicode -characters, but matching a single byte is required, we can use the C<\C> -escape sequence. C<\C> is a character class akin to C<.> except that -it matches I<any> byte 0-255. So - - use charnames ":full"; # use named chars with Unicode full names - $x = "a"; - $x =~ /\C/; # matches 'a', eats one byte - $x = ""; - $x =~ /\C/; # doesn't match, no bytes to match - $x = "\N{MERCURY}"; # two-byte Unicode character - $x =~ /\C/; # matches, but dangerous! - -The last regexp matches, but is dangerous because the string -I<character> position is no longer synchronized to the string I<byte> -position. This generates the warning 'Malformed UTF-8 -character'. The C<\C> is best used for matching the binary data in strings -with binary data intermixed with Unicode characters. - -Let us now discuss the rest of the character classes. Just as with -Unicode characters, there are named Unicode character classes -represented by the C<\p{name}> escape sequence. Closely associated is -the C<\P{name}> character class, which is the negation of the -C<\p{name}> class. For example, to match lower and uppercase -characters, +The answer to requirement 2), as of 5.6.0, is that a regexp uses unicode +characters. Internally, this is encoded to bytes using either UTF-8 or a +native 8 bit encoding, depending on the history of the string, but +conceptually it is a sequence of characters, not bytes. See +L<perlunitut> for a tutorial about that. + +Let us now discuss Unicode character classes. Just as with Unicode +characters, there are named Unicode character classes represented by the +C<\p{name}> escape sequence. Closely associated is the C<\P{name}> +character class, which is the negation of the C<\p{name}> class. For +example, to match lower and uppercase characters, use charnames ":full"; # use named chars with Unicode full names $x = "BOB"; |