diff options
author | Gurusamy Sarathy <gsar@cpan.org> | 2000-02-02 03:40:49 +0000 |
---|---|---|
committer | Gurusamy Sarathy <gsar@cpan.org> | 2000-02-02 03:40:49 +0000 |
commit | 8cbd9a7a04590303d418ea175c242a7a066b65f6 (patch) | |
tree | de832f2b3bc0abbbd2ebae9969a66621bd843da8 /pod/perlunicode.pod | |
parent | 95151ede387cd538ff8505bd2a82530dd81dcfe9 (diff) | |
download | perl-8cbd9a7a04590303d418ea175c242a7a066b65f6.tar.gz |
reword some sections of perlunicode.pod
p4raw-id: //depot/perl@4943
Diffstat (limited to 'pod/perlunicode.pod')
-rw-r--r-- | pod/perlunicode.pod | 74 |
1 files changed, 53 insertions, 21 deletions
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index b0efcca8df..5a73d4e959 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -14,13 +14,48 @@ uses the UTF-8 encoding. In future, Perl-level operations will expect to work with characters rather than bytes, in general. -However, Perl v5.6 aims to provide a safe migration path from byte -semantics to character semantics for programs. To preserve compatibility -with earlier versions of Perl which allowed byte semantics in Perl -operations (owing to the fact that the internal representation for -characters was in bytes) byte semantics will continue to be in effect -until a the C<utf8> pragma is used in the C<main> package, or the C<$^U> -global flag is explicitly set. +However, as strictly an interim compatibility measure, Perl v5.6 aims to +provide a safe migration path from byte semantics to character semantics +for programs. For operations where Perl can unambiguously decide that the +input data is characters, Perl now switches to character semantics. +For operations where this determination cannot be made without additional +information from the user, Perl decides in favor of compatibility, and +chooses to use byte semantics. + +This behavior preserves compatibility with earlier versions of Perl, +which allowed byte semantics in Perl operations, but only as long as +none of the program's inputs are marked as being as source of Unicode +character data. Such data may come from filehandles, from calls to +external programs, from information provided by the system (such as %ENV), +or from literals and constants in the source text. Later, in +L</Character encodings for input and output>, we'll see how such +inputs may be marked as being Unicode character data sources. + +One particular condition will enable character semantics on the entire +program, bypassing the compatibility mode: if the C<$^U> global flag is +set to C<1>, nearly all operations will use character semantics by +default. As an added convenience, if the C<utf8> pragma is used in the +C<main> package, C<$^U> is enabled automatically. [XXX: Should there +be a -C switch to enable $^U?] + +Regardless of the above, the C<byte> pragma can always be used to force +byte semantics in a particular lexical scope. See L<byte>. + +The C<utf8> pragma is primarily a compatibility device that enables +recognition of UTF-8 in literals encountered by the parser. It is also +used for enabling some of the more experimental Unicode support features. +Note that this pragma is only required until a future version of Perl +in which character semantics will become the default. This pragma may +then become a no-op. See L<utf8>. + +Unless mentioned otherwise, Perl operators will use character semantics +when they are dealing with Unicode data, and byte semantics otherwise. +Thus, character semantics for these operations apply transparently; if +the input data came from a Unicode source (for example, by adding a +character encoding discipline to the filehandle whence it came, or a +literal UTF-8 string constant in the program), character semantics +apply; otherwise, byte semantics are in effect. To force byte semantics +on Unicode data, the C<byte> pragma should be used. Under character semantics, many operations that formerly operated on bytes change to operating on characters. For ASCII data this makes @@ -33,15 +68,7 @@ ranging from 0 to 2**32 or so. Larger characters encode to longer sequences of bytes internally, but again, this is just an internal detail which is hidden at the Perl level. -The C<byte> pragma can be used to force byte semantics in a particular -lexical scope. See L<byte>. - -The C<utf8> pragma is a compatibility device to enables recognition -of UTF-8 in literals encountered by the parser. It is also used -for enabling some experimental Unicode support features. Note that -this pragma is only required until a future version of Perl in which -character semantics will become the default. This pragma may then -become a no-op. See L<utf8>. +=head2 Effects of character semantics Character semantics have the following effects: @@ -73,9 +100,9 @@ characters, including ideographs. (You are currently on your own when it comes to using the canonical forms of characters--Perl doesn't (yet) attempt to canonicalize variable names for you.) -This also needs C<use utf8> currently. [XXX: Why? High-bit chars were +This also needs C<use utf8> currently. [XXX: Why?!? High-bit chars were syntax errors when they occurred within identifiers in previous versions, -so this should be enabled by default.] +so this should probably be enabled by default.] =item * @@ -86,7 +113,8 @@ C<\C>).) Unicode support in regular expressions needs C<use utf8> currently. [XXX: Because the SWASH routines need to be loaded. And the RE engine -appears to need an overhaul to Unicode by default anyway.] +appears to need an overhaul to dynamically match Unicode anyway--the +current RE compiler creates different nodes with and without C<use utf8>.] =item * @@ -180,14 +208,18 @@ And finally, C<scalar reverse()> reverses by character rather than by byte. =back +=head2 Character encodings for input and output + +[XXX: This feature is not yet implemented.] + =head1 CAVEATS As of yet, there is no method for automatically coercing input and output to some encoding other than UTF-8. This is planned in the near future, however. -Whether a piece of data will be treated as "characters" or "bytes" -by internal operations cannot be divined at the current time. +Whether an arbitrary piece of data will be treated as "characters" or +"bytes" by internal operations cannot be divined at the current time. Use of locales with utf8 may lead to odd results. Currently there is some attempt to apply 8-bit locale info to characters in the range |