summaryrefslogtreecommitdiff
path: root/pod/perlrecharclass.pod
diff options
context:
space:
mode:
authorKarl Williamson <public@khwilliamson.com>2010-08-09 11:35:51 -0600
committerRafael Garcia-Suarez <rgs@consttype.org>2010-08-11 10:12:24 +0200
commit17657a39a2e6771c7aa399eb1696b9a06bbd58f9 (patch)
tree32c27b2e38c111eae20b5cc99ade689b4d86e241 /pod/perlrecharclass.pod
parent6671dd37b1706ba1c924653baae22e7d739fcdd4 (diff)
downloadperl-17657a39a2e6771c7aa399eb1696b9a06bbd58f9.tar.gz
perlrecharclass: Document subtlety in Unicode
The documentation had failed to mention that a regex pattern in utf8 encoding forces a Unicode interpretation on a non-utf8 string.
Diffstat (limited to 'pod/perlrecharclass.pod')
-rw-r--r--pod/perlrecharclass.pod101
1 files changed, 53 insertions, 48 deletions
diff --git a/pod/perlrecharclass.pod b/pod/perlrecharclass.pod
index aad038db72..7fcb92d421 100644
--- a/pod/perlrecharclass.pod
+++ b/pod/perlrecharclass.pod
@@ -75,13 +75,13 @@ character classes, see L<perlrebackslash>.)
=head3 Digits
C<\d> matches a single character that is considered to be a decimal I<digit>.
-What is considered a decimal digit depends on the internal encoding of the
-source string and the locale that is in effect. If the source string is in
-UTF-8 format, C<\d> not only matches the digits '0' - '9', but also Arabic,
-Devanagari and digits from other languages. Otherwise, if there is a locale in
-effect, it will match whatever characters the locale considers decimal digits.
-Without a locale, C<\d> matches just the digits '0' to '9'.
-See L</Locale, EBCDIC, Unicode and UTF-8>.
+What is considered a decimal digit depends on several factors, detailed
+below in L</Locale, EBCDIC, Unicode and UTF-8>. If those factors
+indicate a Unicode interpretation, C<\d> not only matches the digits
+'0' - '9', but also Arabic, Devanagari and digits from other languages.
+Otherwise, if there is a locale in effect, it will match whatever
+characters the locale considers decimal digits. Without a locale, C<\d>
+matches just the digits '0' to '9'.
Unicode digits may cause some confusion, and some security issues. In UTF-8
strings, C<\d> matches the same characters matched by
@@ -116,15 +116,15 @@ A C<\w> matches a single alphanumeric character (an alphabetic character, or a
decimal digit) or an underscore (C<_>), not a whole word. To match a whole
word, use C<\w+>. This isn't the same thing as matching an English word, but
is the same as a string of Perl-identifier characters. What is considered a
-word character depends on the internal
-encoding of the string and the locale or EBCDIC code page that is in effect. If
-it's in UTF-8 format, C<\w> matches those characters that are considered word
+word character depends on several factors, detailed below in L</Locale,
+EBCDIC, Unicode and UTF-8>. If those factors indicate a Unicode
+interpretation, C<\w> matches the characters that are considered word
characters in the Unicode database. That is, it not only matches ASCII letters,
-but also Thai letters, Greek letters, etc. If the source string isn't in UTF-8
-format, C<\w> matches those characters that are considered word characters by
-the current locale or EBCDIC code page. Without a locale or EBCDIC code page,
-C<\w> matches the ASCII letters, digits and the underscore.
-See L</Locale, EBCDIC, Unicode and UTF-8>.
+but also Thai letters, Greek letters, etc. If a Unicode interpretation
+is not indicated, C<\w> matches those characters that are considered
+word characters by the current locale or EBCDIC code page. Without a
+locale or EBCDIC code page, C<\w> matches the ASCII letters, digits and
+the underscore.
There are a number of security issues with the full Unicode list of word
characters. See L<http://unicode.org/reports/tr36>.
@@ -139,19 +139,19 @@ Any character that isn't matched by C<\w> will be matched by C<\W>.
=head3 Whitespace
C<\s> matches any single character that is considered whitespace. The exact
-set of characters matched by C<\s> depends on whether the source string is in
-UTF-8 format and the locale or EBCDIC code page that is in effect. If it's in
-UTF-8 format, C<\s> matches what is considered whitespace in the Unicode
-database; the complete list is in the table below. Otherwise, if there is a
-locale or EBCDIC code page in effect, C<\s> matches whatever is considered
-whitespace by the current locale or EBCDIC code page. Without a locale or
-EBCDIC code page, C<\s> matches the horizontal tab (C<\t>), the newline
-(C<\n>), the form feed (C<\f>), the carriage return (C<\r>), and the space.
-(Note that it doesn't match the vertical tab, C<\cK>.) Perhaps the most notable
-possible surprise is that C<\s> matches a non-breaking space only if the
-non-breaking space is in a UTF-8 encoded string or the locale or EBCDIC code
-page that is in effect has that character.
-See L</Locale, EBCDIC, Unicode and UTF-8>.
+set of characters matched by C<\s> depends on several factors, detailed
+below in L</Locale, EBCDIC, Unicode and UTF-8>. If those factors
+indicate a Unicode interpretation, C<\s> matches what is considered
+whitespace in the Unicode database; the complete list is in the table
+below. Otherwise, if there is a locale or EBCDIC code page in effect,
+C<\s> matches whatever is considered whitespace by the current locale or
+EBCDIC code page. Without a locale or EBCDIC code page, C<\s> matches
+the horizontal tab (C<\t>), the newline (C<\n>), the form feed (C<\f>),
+the carriage return (C<\r>), and the space. (Note that it doesn't match
+the vertical tab, C<\cK>.) Perhaps the most notable possible surprise
+is that C<\s> matches a non-breaking space only if a Unicode
+interpretation is indicated, or the locale or EBCDIC code page that is
+in effect has that character.
Any character that isn't matched by C<\s> will be matched by C<\S>.
@@ -172,9 +172,8 @@ class; use C<\v> instead (vertical whitespace).
Details are discussed in L<perlrebackslash>.
Note that unlike C<\s>, C<\d> and C<\w>, C<\h> and C<\v> always match
-the same characters, regardless whether the source string is in UTF-8
-format or not. The set of characters they match is also not influenced
-by locale nor EBCDIC code page.
+the same characters, without regard to other factors, such as if the
+source string is in UTF-8 format or not.
One might think that C<\s> is equivalent to C<[\h\v]>. This is not true. The
vertical tab (C<"\x0b">) is not matched by C<\s>, it is however considered
@@ -661,28 +660,34 @@ such a construct will lead to an error.
=head2 Locale, EBCDIC, Unicode and UTF-8
Some of the character classes have a somewhat different behaviour depending
-on the internal encoding of the source string, and the locale that is
-in effect, and if the program is running on an EBCDIC platform.
+on the internal encoding of the source string, if the regular expression
+is marked as having Unicode semantics, the locale that is in effect,
+and if the program is running on an EBCDIC platform.
C<\w>, C<\d>, C<\s> and the POSIX character classes (and their negations,
-including C<\W>, C<\D>, C<\S>) suffer from this behaviour. (Since the backslash
+including C<\W>, C<\D>, C<\S>) have this behaviour. (Since the backslash
sequences C<\b> and C<\B> are defined in terms of C<\w> and C<\W>, they also are
affected.)
-The rule is that if the source string is in UTF-8 format, the character
-classes match according to the Unicode properties. If the source string
-isn't, then the character classes match according to whatever locale or EBCDIC
-code page is in effect. If there is no locale nor EBCDIC, they match the ASCII
-defaults (0 to 9 for C<\d>; 52 letters, 10 digits and underscore for C<\w>;
-etc.).
-
-This usually means that if you are matching against characters whose C<ord()>
-values are between 128 and 255 inclusive, your character class may match
-or not depending on the current locale or EBCDIC code page, and whether the
-source string is in UTF-8 format. The string will be in UTF-8 format if it
-contains characters whose C<ord()> value exceeds 255. But a string may be in
-UTF-8 format without it having such characters. See L<perlunicode/The
-"Unicode Bug">.
+The rule is that if the source string is in UTF-8 format or the regular
+expression is marked as indicating Unicode semantics (see the next
+paragraph), the character classes match according to the Unicode
+properties. Otherwise, the character classes match according to
+whatever locale or EBCDIC code page is in effect. If there is no locale
+nor EBCDIC, they match the ASCII defaults (0 to 9 for C<\d>; 52 letters,
+10 digits and underscore for C<\w>; etc.).
+
+A regular expression is marked for Unicode semantics if it is encoded in
+utf8 (usually as a result of including a literal character whose code
+point is above 255), or if it contains a C<\N{U+...}> or C<\N{I<name>}>
+construct.
+
+The differences in behavior between locale and non-locale semantics
+can affect any character whose code point is 255 or less. The
+differences in behavior between Unicode and non-Unicode semantics
+affects only ASCII platforms, and only when matching against characters
+whose code points are between 128 and 255 inclusive. See
+L<perlunicode/The "Unicode Bug">.
For portability reasons, it may be better to not use C<\w>, C<\d>, C<\s>
or the POSIX character classes, and use the Unicode properties instead.