diff options
author | Rafael Garcia-Suarez <rgarciasuarez@gmail.com> | 2007-12-18 09:51:39 +0000 |
---|---|---|
committer | Rafael Garcia-Suarez <rgarciasuarez@gmail.com> | 2007-12-18 09:51:39 +0000 |
commit | a3d15f9a2bf22c599dfee4c8fb750856644c6d1f (patch) | |
tree | 618bbc3d2b62e25d629446f0575644e5a6118db4 /pod/perltodo.pod | |
parent | 4e4a88873a7e660d02017ef31d574516d5d3deb6 (diff) | |
download | perl-a3d15f9a2bf22c599dfee4c8fb750856644c6d1f.tar.gz |
Notes on 5.12 Unicode revamping planned.
Complete the "reporting bug" section of perldelta.
p4raw-id: //depot/perl@32636
Diffstat (limited to 'pod/perltodo.pod')
-rw-r--r-- | pod/perltodo.pod | 24 |
1 files changed, 16 insertions, 8 deletions
diff --git a/pod/perltodo.pod b/pod/perltodo.pod index 0c85ceb1ce..d869b67c87 100644 --- a/pod/perltodo.pod +++ b/pod/perltodo.pod @@ -667,6 +667,22 @@ also the warning messages (see L<perllexwarn>, C<warnings.pl>). These tasks would need C knowledge, and knowledge of how the interpreter works, or a willingness to learn. +=head2 UTF-8 revamp + +The handling of Unicode is unclean in many places. For example, the regexp +engine matches in Unicode semantics whenever the string or the pattern is +flagged as UTF-8, but that should not be dependent on an internal storage +detail of the string. Likewise, case folding behaviour is dependent on the +UTF8 internal flag being on or off. + +=head2 Properly Unicode safe tokeniser and pads. + +The tokeniser isn't actually very UTF-8 clean. C<use utf8;> is a hack - +variable names are stored in stashes as raw bytes, without the utf-8 flag +set. The pad API only takes a C<char *> pointer, so that's all bytes too. The +tokeniser ignores the UTF-8-ness of C<PL_rsfp>, or any SVs returned from +source filters. All this could be fixed. + =head2 state variable initialization in list context Currently this is illegal: @@ -776,14 +792,6 @@ reinstated. The old perltodo notes "Look at the "reification" code in C<av.c>". -=head2 Properly Unicode safe tokeniser and pads. - -The tokeniser isn't actually very UTF-8 clean. C<use utf8;> is a hack - -variable names are stored in stashes as raw bytes, without the utf-8 flag -set. The pad API only takes a C<char *> pointer, so that's all bytes too. The -tokeniser ignores the UTF-8-ness of C<PL_rsfp>, or any SVs returned from -source filters. All this could be fixed. - =head2 The yada yada yada operators Perl 6's Synopsis 3 says: |