summaryrefslogtreecommitdiff
path: root/pod
diff options
context:
space:
mode:
authorYves Orton <demerphq@gmail.com>2009-09-02 20:29:13 +0200
committerYves Orton <demerphq@gmail.com>2009-09-02 20:30:08 +0200
commit6fa80ea2f46c3527db9bfc3ba3a52c12826c85c7 (patch)
tree0c260c769d1d4bde6dcf5465bdc7c2a924a16e80 /pod
parent42cb3b6d01c213ff7daf09d23f012bd467bea3aa (diff)
downloadperl-6fa80ea2f46c3527db9bfc3ba3a52c12826c85c7.tar.gz
update perlre and perldelta to document change in behaviour of \w and \d and POSIX charclasses
Diffstat (limited to 'pod')
-rw-r--r--pod/perl5110delta.pod58
-rw-r--r--pod/perlre.pod105
2 files changed, 89 insertions, 74 deletions
diff --git a/pod/perl5110delta.pod b/pod/perl5110delta.pod
index 7336692aee..16266b18a8 100644
--- a/pod/perl5110delta.pod
+++ b/pod/perl5110delta.pod
@@ -11,6 +11,61 @@ the 5.11.0 development release.
=head1 Incompatible Changes
+=head2 Unicode interpretation of \w, \d, \s, and the POSIX character classes redefined.
+
+Previous versions of Perl tried to map POSIX style character class definitions onto
+Unicode property names so that patterns would "dwim" when matches were made against latin-1 or
+unicode strings. This proved to be a mistake, breaking character class negation, causing
+forward compatibility problems (as Unicode keeps updating their property definitions and adding
+new characters), and other problems.
+
+Therefore we have now defined a new set of artificial "unicode" property names which will be
+used to do unicode matching of patterns using POSIX style character classes and perl short-form
+escape character classes like \w and \d.
+
+The key change here is that \d will no longer match every digit in the unicode standard
+(there are thousands) nor will \w match every word character in the standard, instead they
+will match precisely their POSIX or Perl definition.
+
+Those needing to match based on Unicode properties can continue to do so by using the \p{} syntax
+to match whichever property they like, including the new artificial definitions.
+
+B<NOTE:> This is a backwards incompatible no-warning change in behaviour. If you are upgrading
+and you process large volumes of text look for POSIX and Perl style character classes and
+change them to the relevent property name (by removing the word 'Posix' from the current name).
+
+The following table maps the POSIX character class names, the escapes and the old and new
+Unicode property mappings:
+
+ POSIX Esc Class New-Property ! Old-Property
+ ----------------------------------------------+-------------
+ alnum [0-9A-Za-z] IsPosixAlnum ! IsAlnum
+ alpha [A-Za-z] IsPosixAlpha ! IsAlpha
+ ascii [\000-\177] IsASCII = IsASCII
+ blank [\011 ] IsPosixBlank !
+ cntrl [\0-\37\177] IsPosixCntrl ! IsCntrl
+ digit \d [0-9] IsPosixDigit ! IsDigit
+ graph [!-~] IsPosixGraph ! IsGraph
+ lower [a-z] IsPosixLower ! IsLower
+ print [ -~] IsPosixPrint ! IsPrint
+ punct [!-/:-@[-`{-~] IsPosixPunct ! IsPunct
+ space [\11-\15 ] IsPosixSpace ! IsSpace
+ \s [\11\12\14\15 ] IsPerlSpace ! IsSpacePerl
+ upper [A-Z] IsPosixUpper ! IsUpper
+ word \w [0-9A-Z_a-z] IsPerlWord ! IsWord
+ xdigit [0-9A-Fa-f] IsXDigit = IsXDigit
+
+If you wish to build perl with the old mapping you may do so by setting
+
+ #define PERL_LEGACY_UNICODE_CHARCLASS_MAPPINGS 1
+
+in regcomp.h, and then setting
+
+ PERL_TEST_LEGACY_POSIX_CC
+
+to true your enviornment when testing.
+
+
=head2 In @INC, move ARCHLIB and PRIVLIB after the current version's site_perl and vendor_perl.
=head2 Switch statement changes
@@ -2294,3 +2349,6 @@ when creation of a temporary file in it fails
=head2 Add a pluggable hook in op_free()
+
+
+
diff --git a/pod/perlre.pod b/pod/perlre.pod
index ee1c2cb940..1336c5c24d 100644
--- a/pod/perlre.pod
+++ b/pod/perlre.pod
@@ -316,26 +316,34 @@ they must always be used within a character class expression.
# this is not, and will generate a warning:
$string =~ /[:alpha:]/;
-The available classes and their backslash equivalents (if available) are
-as follows:
-X<character class>
+The following table shows the mapping of POSIX character class
+names, common escapes, literal escape sequences and their equivalent
+Unicode style property names.
+X<character class> X<\p> X<\p{}>
X<alpha> X<alnum> X<ascii> X<blank> X<cntrl> X<digit> X<graph>
X<lower> X<print> X<punct> X<space> X<upper> X<word> X<xdigit>
- alpha
- alnum
- ascii
- blank [1]
- cntrl
- digit \d
- graph
- lower
- print
- punct
- space \s [2]
- upper
- word \w [3]
- xdigit
+B<Note:> up to Perl 5.10 the property names used were shared with
+standard Unicode properties, this was changed in Perl 5.11, see
+L<perl5110delta> for details.
+
+ POSIX Esc Class Property Note
+ --------------------------------------------------------
+ alnum [0-9A-Za-z] IsPosixAlnum
+ alpha [A-Za-z] IsPosixAlpha
+ ascii [\000-\177] IsASCII
+ blank [\011 ] IsPosixBlank [1]
+ cntrl [\0-\37\177] IsPosixCntrl
+ digit \d [0-9] IsPosixDigit
+ graph [!-~] IsPosixGraph
+ lower [a-z] IsPosixLower
+ print [ -~] IsPosixPrint
+ punct [!-/:-@[-`{-~] IsPosixPunct
+ space [\11-\15 ] IsPosixSpace [2]
+ \s [\11\12\14\15 ] IsPerlSpace [2]
+ upper [A-Z] IsPosixUpper
+ word \w [0-9A-Z_a-z] IsPerlWord [3]
+ xdigit [0-9A-Fa-f] IsXDigit
=over
@@ -345,8 +353,9 @@ A GNU extension equivalent to C<[ \t]>, "all horizontal whitespace".
=item [2]
-Not exactly equivalent to C<\s> since the C<[[:space:]]> includes
-also the (very rare) "vertical tabulator", "\cK" or chr(11) in ASCII.
+Note that C<\s> and C<[[:space:]]> are B<not> equivalent as C<[[:space:]]>
+includes also the (very rare) "vertical tabulator", "\cK" or chr(11) in
+ASCII.
=item [3]
@@ -362,58 +371,6 @@ whole character class. For example:
matches zero, one, any alphabetic character, and the percent sign.
-The following equivalences to Unicode \p{} constructs and equivalent
-backslash character classes (if available), will hold:
-X<character class> X<\p> X<\p{}>
-
- [[:...:]] \p{...} backslash
-
- alpha IsAlpha
- alnum IsAlnum
- ascii IsASCII
- blank
- cntrl IsCntrl
- digit IsDigit \d
- graph IsGraph
- lower IsLower
- print IsPrint (but see [2] below)
- punct IsPunct (but see [3] below)
- space IsSpace
- IsSpacePerl \s
- upper IsUpper
- word IsWord \w
- xdigit IsXDigit
-
-For example C<[[:lower:]]> and C<\p{IsLower}> are equivalent.
-
-However, the equivalence between C<[[:xxxxx:]]> and C<\p{IsXxxxx}>
-is not exact.
-
-=over 4
-
-=item [1]
-
-If the C<utf8> pragma is not used but the C<locale> pragma is, the
-classes correlate with the usual isalpha(3) interface (except for
-"word" and "blank").
-
-But if the C<locale> or C<encoding> pragmas are not used and
-the string is not C<utf8>, then C<[[:xxxxx:]]> (and C<\w>, etc.)
-will not match characters 0x80-0xff; whereas C<\p{IsXxxxx}> will
-force the string to C<utf8> and can match these characters
-(as Unicode).
-
-=item [2]
-
-C<\p{IsPrint}> matches characters 0x09-0x0d but C<[[:print:]]> does not.
-
-=item [3]
-
-C<[[:punct::]]> matches the following but C<\p{IsPunct}> does not,
-because they are classed as symbols (not punctuation) in Unicode.
-
-=over 4
-
=item C<$>
Currency symbol
@@ -473,9 +430,9 @@ X<character class, negation>
POSIX traditional Unicode
- [[:^digit:]] \D \P{IsDigit}
- [[:^space:]] \S \P{IsSpace}
- [[:^word:]] \W \P{IsWord}
+ [[:^digit:]] \D \P{IsPosixDigit}
+ [[:^space:]] \S \P{IsPosixSpace}
+ [[:^word:]] \W \P{IsPerlWord}
Perl respects the POSIX standard in that POSIX character classes are
only supported within a character class. The POSIX character classes