summaryrefslogtreecommitdiff
path: root/pod/perlretut.pod
diff options
context:
space:
mode:
Diffstat (limited to 'pod/perlretut.pod')
-rw-r--r--pod/perlretut.pod63
1 files changed, 20 insertions, 43 deletions
diff --git a/pod/perlretut.pod b/pod/perlretut.pod
index c1f37fe4b7..da3e82c74f 100644
--- a/pod/perlretut.pod
+++ b/pod/perlretut.pod
@@ -1841,27 +1841,21 @@ substituted.
With the advent of 5.6.0, Perl regexps can handle more than just the
standard ASCII character set. Perl now supports I<Unicode>, a standard
for representing the alphabets from virtually all of the world's written
-languages, and a host of symbols. Perl uses the UTF-8 encoding, in which
-ASCII characters are still encoded as one byte, but characters greater
-than C<chr(127)> may be stored as two or more bytes.
+languages, and a host of symbols. Perl's text strings are unicode strings, so
+they can contain characters with a value (codepoint or character number) higher
+than 255
What does this mean for regexps? Well, regexp users don't need to know
much about Perl's internal representation of strings. But they do need
-to know 1) how to represent Unicode characters in a regexp and 2) when
-a matching operation will treat the string to be searched as a
-sequence of bytes (the old way) or as a sequence of Unicode characters
-(the new way). The answer to 1) is that Unicode characters greater
-than C<chr(127)> may be represented using the C<\x{hex}> notation,
-with C<hex> a hexadecimal integer:
+to know 1) how to represent Unicode characters in a regexp and 2) that
+a matching operation will treat the string to be searched as a sequence
+of characters, not bytes. The answer to 1) is that Unicode characters
+greater than C<chr(255)> are represented using the C<\x{hex}> notation,
+because the \0 octal and \x hex (without curly braces) don't go further
+than 255.
/\x{263a}/; # match a Unicode smiley face :)
-Unicode characters in the range of 128-255 use two hexadecimal digits
-with braces: C<\x{ab}>. Note that this is in general different than
-C<\xab>, which is just a hexadecimal byte with no Unicode significance,
-except when your script is encoded in UTF-8 where C<\xab> has the
-same byte representation as C<\x{ab}>.
-
B<NOTE>: In Perl 5.6.0 it used to be that one needed to say C<use
utf8> to use any Unicode features. This is no more the case: for
almost all Unicode processing, the explicit C<utf8> pragma is not
@@ -1896,34 +1890,17 @@ A list of full names is found in the file NamesList.txt in the
lib/perl5/X.X.X/unicore directory (where X.X.X is the perl
version number as it is installed on your system).
-The answer to requirement 2), as of 5.6.0, is that if a regexp
-contains Unicode characters, the string is searched as a sequence of
-Unicode characters. Otherwise, the string is searched as a sequence of
-bytes. If the string is being searched as a sequence of Unicode
-characters, but matching a single byte is required, we can use the C<\C>
-escape sequence. C<\C> is a character class akin to C<.> except that
-it matches I<any> byte 0-255. So
-
- use charnames ":full"; # use named chars with Unicode full names
- $x = "a";
- $x =~ /\C/; # matches 'a', eats one byte
- $x = "";
- $x =~ /\C/; # doesn't match, no bytes to match
- $x = "\N{MERCURY}"; # two-byte Unicode character
- $x =~ /\C/; # matches, but dangerous!
-
-The last regexp matches, but is dangerous because the string
-I<character> position is no longer synchronized to the string I<byte>
-position. This generates the warning 'Malformed UTF-8
-character'. The C<\C> is best used for matching the binary data in strings
-with binary data intermixed with Unicode characters.
-
-Let us now discuss the rest of the character classes. Just as with
-Unicode characters, there are named Unicode character classes
-represented by the C<\p{name}> escape sequence. Closely associated is
-the C<\P{name}> character class, which is the negation of the
-C<\p{name}> class. For example, to match lower and uppercase
-characters,
+The answer to requirement 2), as of 5.6.0, is that a regexp uses unicode
+characters. Internally, this is encoded to bytes using either UTF-8 or a
+native 8 bit encoding, depending on the history of the string, but
+conceptually it is a sequence of characters, not bytes. See
+L<perlunitut> for a tutorial about that.
+
+Let us now discuss Unicode character classes. Just as with Unicode
+characters, there are named Unicode character classes represented by the
+C<\p{name}> escape sequence. Closely associated is the C<\P{name}>
+character class, which is the negation of the C<\p{name}> class. For
+example, to match lower and uppercase characters,
use charnames ":full"; # use named chars with Unicode full names
$x = "BOB";