Notes on 5.12 Unicode revamping planned.

Complete the "reporting bug" section of perldelta. p4raw-id: //depot/perl@32636
author: Rafael Garcia-Suarez <rgarciasuarez@gmail.com> 2007-12-18 09:51:39 +0000
committer: Rafael Garcia-Suarez <rgarciasuarez@gmail.com> 2007-12-18 09:51:39 +0000
commit: a3d15f9a2bf22c599dfee4c8fb750856644c6d1f (patch)
tree: 618bbc3d2b62e25d629446f0575644e5a6118db4 /pod/perltodo.pod
parent: 4e4a88873a7e660d02017ef31d574516d5d3deb6 (diff)
download: perl-a3d15f9a2bf22c599dfee4c8fb750856644c6d1f.tar.gz
1 files changed, 16 insertions, 8 deletions
diff --git a/pod/perltodo.pod b/pod/perltodo.pod
index 0c85ceb1ce..d869b67c87 100644
--- a/pod/perltodo.pod
+++ b/pod/perltodo.pod
@@ -667,6 +667,22 @@ also the warning messages (see L<perllexwarn>, C<warnings.pl>).
 These tasks would need C knowledge, and knowledge of how the interpreter works,
 or a willingness to learn.
 
+=head2 UTF-8 revamp
+
+The handling of Unicode is unclean in many places. For example, the regexp
+engine matches in Unicode semantics whenever the string or the pattern is
+flagged as UTF-8, but that should not be dependent on an internal storage
+detail of the string. Likewise, case folding behaviour is dependent on the
+UTF8 internal flag being on or off.
+
+=head2 Properly Unicode safe tokeniser and pads.
+
+The tokeniser isn't actually very UTF-8 clean. C<use utf8;> is a hack -
+variable names are stored in stashes as raw bytes, without the utf-8 flag
+set. The pad API only takes a C<char *> pointer, so that's all bytes too. The
+tokeniser ignores the UTF-8-ness of C<PL_rsfp>, or any SVs returned from
+source filters.  All this could be fixed.
+
 =head2 state variable initialization in list context
 
 Currently this is illegal:
@@ -776,14 +792,6 @@ reinstated.
 
 The old perltodo notes "Look at the "reification" code in C<av.c>".
 
-=head2 Properly Unicode safe tokeniser and pads.
-
-The tokeniser isn't actually very UTF-8 clean. C<use utf8;> is a hack -
-variable names are stored in stashes as raw bytes, without the utf-8 flag
-set. The pad API only takes a C<char *> pointer, so that's all bytes too. The
-tokeniser ignores the UTF-8-ness of C<PL_rsfp>, or any SVs returned from
-source filters.  All this could be fixed.
-
 =head2 The yada yada yada operators
 
 Perl 6's Synopsis 3 says:
author	Rafael Garcia-Suarez <rgarciasuarez@gmail.com>	2007-12-18 09:51:39 +0000
committer	Rafael Garcia-Suarez <rgarciasuarez@gmail.com>	2007-12-18 09:51:39 +0000
commit	a3d15f9a2bf22c599dfee4c8fb750856644c6d1f (patch)
tree	618bbc3d2b62e25d629446f0575644e5a6118db4 /pod/perltodo.pod
parent	4e4a88873a7e660d02017ef31d574516d5d3deb6 (diff)
download	perl-a3d15f9a2bf22c599dfee4c8fb750856644c6d1f.tar.gz