diff options
author | Jarkko Hietaniemi <jhi@iki.fi> | 2001-09-15 13:53:42 +0000 |
---|---|---|
committer | Jarkko Hietaniemi <jhi@iki.fi> | 2001-09-15 13:53:42 +0000 |
commit | 776f8809bbc48a9d2c3912352f517ede1485f2f7 (patch) | |
tree | 1426850646afca5fcaeb3e356df06613b766e617 /pod | |
parent | db28379b6202839e1772e5a65654df24b3660070 (diff) | |
download | perl-776f8809bbc48a9d2c3912352f517ede1485f2f7.tar.gz |
Document what's still to be done on the regular expression
Unicode support, based on the UTR#18.
p4raw-id: //depot/perl@12029
Diffstat (limited to 'pod')
-rw-r--r-- | pod/perltodo.pod | 13 | ||||
-rw-r--r-- | pod/perlunicode.pod | 67 |
2 files changed, 76 insertions, 4 deletions
diff --git a/pod/perltodo.pod b/pod/perltodo.pod index 0f4ea6353d..35956e1788 100644 --- a/pod/perltodo.pod +++ b/pod/perltodo.pod @@ -61,17 +61,26 @@ engine. The idea is that, for instance, C<\b> needs to be algorithmically computed if you're dealing with Thai text. Hence, the B<\b> assertion wants to be overloaded by a function. -=head2 Unicode case mappings +=head2 Unicode + +=over 4 + +=item * Case Mappings? http://www.unicode.org/unicode/reports/tr21/ -=head2 Unicode regular expression character classes +=item * They have some tricks Perl doesn't yet implement like character class subtraction. http://www.unicode.org/unicode/reports/tr18/ +=back + +See L<perlunicode/UNICODE REGULAR EXPRESSION SUPPORT LEVEL> for what's +there and what's missing. + =head2 use Thread for iThreads Artur Bergman's C<iThreads> module is a start on this, but needs to diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index e6a14a7a92..ba73eb37c1 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -6,8 +6,8 @@ perlunicode - Unicode support in Perl =head2 Important Caveats -WARNING: While the implementation of Unicode support in Perl is now fairly -complete it is still evolving to some extent. +WARNING: While the implementation of Unicode support in Perl is now +fairly complete it is still evolving to some extent. In particular the way Unicode is handled on EBCDIC platforms is still rather experimental. On such a platform references to UTF-8 encoding @@ -497,6 +497,69 @@ some attempt to apply 8-bit locale info to characters in the range characters above that range (when mapped into Unicode). It will also tend to run slower. Avoidance of locales is strongly encouraged. +=head1 UNICODE REGULAR EXPRESSION SUPPORT LEVEL + +The following list of Unicode regular expression support describes +feature by feature the Unicode support implemented in Perl as of Perl +5.8.0. The "Level N" and the section numbers refer to the Unicode +Technical Report 18, "Unicode Regular Expression Guidelines". + +=over 4 + +=item * + +Level 1 - Basic Unicode Support + + 2.1 Hex Notation - done [1] + Named Notation - done [2] + 2.2 Categories - done [3][4] + 2.3 Subtraction - MISSING [5][6] + 2.4 Simple Word Boundaries - done [7] + 2.5 Simple Loose Matches - MISSING [8] + 2.6 End of Line - MISSING [9][10] + + [ 1] \x{...} + [ 2] \N{...} + [ 3] . \p{Is...} \P{Is...} + [ 4] now scripts (see UTR#24 Script Names) in addition to blocks + [ 5] have negation + [ 6] can use look-ahead to emulate subtracion + [ 7] include Letters in word characters + [ 8] see UTR#21 Case Mappings + [ 9] see UTR#13 Unicode Newline Guidelines + [10] should do ^ and $ also on \x{2028} and \x{2029} + +=item * + +Level 2 - Extended Unicode Support + + 3.1 Surrogates - MISSING + 3.2 Canonical Equivalents - MISSING [11][12] + 3.3 Locale-Independent Graphemes - MISSING [13] + 3.4 Locale-Independent Words - MISSING [14] + 3.5 Locale-Independent Loose Matches - MISSING [15] + + [11] see UTR#15 Unicode Normalization + [12] have Unicode::Normalize but not integrated to regexes + [13] have \X but at this level . should equal that + [14] need three classes, not just \w and \W + [15] see UTR#21 Case Mappings + +=item * + +Level 3 - Locale-Sensitive Support + + 4.1 Locale-Dependent Categories - MISSING + 4.2 Locale-Dependent Graphemes - MISSING [16][17] + 4.3 Locale-Dependent Words - MISSING + 4.4 Locale-Dependent Loose Matches - MISSING + 4.5 Locale-Dependent Ranges - MISSING + + [16] see UTR#10 Unicode Collation Algorithms + [17] have Unicode::Collate but not integrated to regexes + +=back + =head1 SEE ALSO L<bytes>, L<utf8>, L<perlretut>, L<perlvar/"${^WIDE_SYSTEM_CALLS}"> |