summaryrefslogtreecommitdiff
path: root/pod
diff options
context:
space:
mode:
authorJarkko Hietaniemi <jhi@iki.fi>2001-09-15 13:53:42 +0000
committerJarkko Hietaniemi <jhi@iki.fi>2001-09-15 13:53:42 +0000
commit776f8809bbc48a9d2c3912352f517ede1485f2f7 (patch)
tree1426850646afca5fcaeb3e356df06613b766e617 /pod
parentdb28379b6202839e1772e5a65654df24b3660070 (diff)
downloadperl-776f8809bbc48a9d2c3912352f517ede1485f2f7.tar.gz
Document what's still to be done on the regular expression
Unicode support, based on the UTR#18. p4raw-id: //depot/perl@12029
Diffstat (limited to 'pod')
-rw-r--r--pod/perltodo.pod13
-rw-r--r--pod/perlunicode.pod67
2 files changed, 76 insertions, 4 deletions
diff --git a/pod/perltodo.pod b/pod/perltodo.pod
index 0f4ea6353d..35956e1788 100644
--- a/pod/perltodo.pod
+++ b/pod/perltodo.pod
@@ -61,17 +61,26 @@ engine. The idea is that, for instance, C<\b> needs to be
algorithmically computed if you're dealing with Thai text. Hence, the
B<\b> assertion wants to be overloaded by a function.
-=head2 Unicode case mappings
+=head2 Unicode
+
+=over 4
+
+=item *
Case Mappings? http://www.unicode.org/unicode/reports/tr21/
-=head2 Unicode regular expression character classes
+=item *
They have some tricks Perl doesn't yet implement like character
class subtraction.
http://www.unicode.org/unicode/reports/tr18/
+=back
+
+See L<perlunicode/UNICODE REGULAR EXPRESSION SUPPORT LEVEL> for what's
+there and what's missing.
+
=head2 use Thread for iThreads
Artur Bergman's C<iThreads> module is a start on this, but needs to
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index e6a14a7a92..ba73eb37c1 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -6,8 +6,8 @@ perlunicode - Unicode support in Perl
=head2 Important Caveats
-WARNING: While the implementation of Unicode support in Perl is now fairly
-complete it is still evolving to some extent.
+WARNING: While the implementation of Unicode support in Perl is now
+fairly complete it is still evolving to some extent.
In particular the way Unicode is handled on EBCDIC platforms is still
rather experimental. On such a platform references to UTF-8 encoding
@@ -497,6 +497,69 @@ some attempt to apply 8-bit locale info to characters in the range
characters above that range (when mapped into Unicode). It will also
tend to run slower. Avoidance of locales is strongly encouraged.
+=head1 UNICODE REGULAR EXPRESSION SUPPORT LEVEL
+
+The following list of Unicode regular expression support describes
+feature by feature the Unicode support implemented in Perl as of Perl
+5.8.0. The "Level N" and the section numbers refer to the Unicode
+Technical Report 18, "Unicode Regular Expression Guidelines".
+
+=over 4
+
+=item *
+
+Level 1 - Basic Unicode Support
+
+ 2.1 Hex Notation - done [1]
+ Named Notation - done [2]
+ 2.2 Categories - done [3][4]
+ 2.3 Subtraction - MISSING [5][6]
+ 2.4 Simple Word Boundaries - done [7]
+ 2.5 Simple Loose Matches - MISSING [8]
+ 2.6 End of Line - MISSING [9][10]
+
+ [ 1] \x{...}
+ [ 2] \N{...}
+ [ 3] . \p{Is...} \P{Is...}
+ [ 4] now scripts (see UTR#24 Script Names) in addition to blocks
+ [ 5] have negation
+ [ 6] can use look-ahead to emulate subtracion
+ [ 7] include Letters in word characters
+ [ 8] see UTR#21 Case Mappings
+ [ 9] see UTR#13 Unicode Newline Guidelines
+ [10] should do ^ and $ also on \x{2028} and \x{2029}
+
+=item *
+
+Level 2 - Extended Unicode Support
+
+ 3.1 Surrogates - MISSING
+ 3.2 Canonical Equivalents - MISSING [11][12]
+ 3.3 Locale-Independent Graphemes - MISSING [13]
+ 3.4 Locale-Independent Words - MISSING [14]
+ 3.5 Locale-Independent Loose Matches - MISSING [15]
+
+ [11] see UTR#15 Unicode Normalization
+ [12] have Unicode::Normalize but not integrated to regexes
+ [13] have \X but at this level . should equal that
+ [14] need three classes, not just \w and \W
+ [15] see UTR#21 Case Mappings
+
+=item *
+
+Level 3 - Locale-Sensitive Support
+
+ 4.1 Locale-Dependent Categories - MISSING
+ 4.2 Locale-Dependent Graphemes - MISSING [16][17]
+ 4.3 Locale-Dependent Words - MISSING
+ 4.4 Locale-Dependent Loose Matches - MISSING
+ 4.5 Locale-Dependent Ranges - MISSING
+
+ [16] see UTR#10 Unicode Collation Algorithms
+ [17] have Unicode::Collate but not integrated to regexes
+
+=back
+
=head1 SEE ALSO
L<bytes>, L<utf8>, L<perlretut>, L<perlvar/"${^WIDE_SYSTEM_CALLS}">