summaryrefslogtreecommitdiff
path: root/pod/perlunicode.pod
diff options
context:
space:
mode:
authorGurusamy Sarathy <gsar@cpan.org>2000-02-02 03:40:49 +0000
committerGurusamy Sarathy <gsar@cpan.org>2000-02-02 03:40:49 +0000
commit8cbd9a7a04590303d418ea175c242a7a066b65f6 (patch)
treede832f2b3bc0abbbd2ebae9969a66621bd843da8 /pod/perlunicode.pod
parent95151ede387cd538ff8505bd2a82530dd81dcfe9 (diff)
downloadperl-8cbd9a7a04590303d418ea175c242a7a066b65f6.tar.gz
reword some sections of perlunicode.pod
p4raw-id: //depot/perl@4943
Diffstat (limited to 'pod/perlunicode.pod')
-rw-r--r--pod/perlunicode.pod74
1 files changed, 53 insertions, 21 deletions
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index b0efcca8df..5a73d4e959 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -14,13 +14,48 @@ uses the UTF-8 encoding.
In future, Perl-level operations will expect to work with characters
rather than bytes, in general.
-However, Perl v5.6 aims to provide a safe migration path from byte
-semantics to character semantics for programs. To preserve compatibility
-with earlier versions of Perl which allowed byte semantics in Perl
-operations (owing to the fact that the internal representation for
-characters was in bytes) byte semantics will continue to be in effect
-until a the C<utf8> pragma is used in the C<main> package, or the C<$^U>
-global flag is explicitly set.
+However, as strictly an interim compatibility measure, Perl v5.6 aims to
+provide a safe migration path from byte semantics to character semantics
+for programs. For operations where Perl can unambiguously decide that the
+input data is characters, Perl now switches to character semantics.
+For operations where this determination cannot be made without additional
+information from the user, Perl decides in favor of compatibility, and
+chooses to use byte semantics.
+
+This behavior preserves compatibility with earlier versions of Perl,
+which allowed byte semantics in Perl operations, but only as long as
+none of the program's inputs are marked as being as source of Unicode
+character data. Such data may come from filehandles, from calls to
+external programs, from information provided by the system (such as %ENV),
+or from literals and constants in the source text. Later, in
+L</Character encodings for input and output>, we'll see how such
+inputs may be marked as being Unicode character data sources.
+
+One particular condition will enable character semantics on the entire
+program, bypassing the compatibility mode: if the C<$^U> global flag is
+set to C<1>, nearly all operations will use character semantics by
+default. As an added convenience, if the C<utf8> pragma is used in the
+C<main> package, C<$^U> is enabled automatically. [XXX: Should there
+be a -C switch to enable $^U?]
+
+Regardless of the above, the C<byte> pragma can always be used to force
+byte semantics in a particular lexical scope. See L<byte>.
+
+The C<utf8> pragma is primarily a compatibility device that enables
+recognition of UTF-8 in literals encountered by the parser. It is also
+used for enabling some of the more experimental Unicode support features.
+Note that this pragma is only required until a future version of Perl
+in which character semantics will become the default. This pragma may
+then become a no-op. See L<utf8>.
+
+Unless mentioned otherwise, Perl operators will use character semantics
+when they are dealing with Unicode data, and byte semantics otherwise.
+Thus, character semantics for these operations apply transparently; if
+the input data came from a Unicode source (for example, by adding a
+character encoding discipline to the filehandle whence it came, or a
+literal UTF-8 string constant in the program), character semantics
+apply; otherwise, byte semantics are in effect. To force byte semantics
+on Unicode data, the C<byte> pragma should be used.
Under character semantics, many operations that formerly operated on
bytes change to operating on characters. For ASCII data this makes
@@ -33,15 +68,7 @@ ranging from 0 to 2**32 or so. Larger characters encode to longer
sequences of bytes internally, but again, this is just an internal
detail which is hidden at the Perl level.
-The C<byte> pragma can be used to force byte semantics in a particular
-lexical scope. See L<byte>.
-
-The C<utf8> pragma is a compatibility device to enables recognition
-of UTF-8 in literals encountered by the parser. It is also used
-for enabling some experimental Unicode support features. Note that
-this pragma is only required until a future version of Perl in which
-character semantics will become the default. This pragma may then
-become a no-op. See L<utf8>.
+=head2 Effects of character semantics
Character semantics have the following effects:
@@ -73,9 +100,9 @@ characters, including ideographs. (You are currently on your own when
it comes to using the canonical forms of characters--Perl doesn't (yet)
attempt to canonicalize variable names for you.)
-This also needs C<use utf8> currently. [XXX: Why? High-bit chars were
+This also needs C<use utf8> currently. [XXX: Why?!? High-bit chars were
syntax errors when they occurred within identifiers in previous versions,
-so this should be enabled by default.]
+so this should probably be enabled by default.]
=item *
@@ -86,7 +113,8 @@ C<\C>).)
Unicode support in regular expressions needs C<use utf8> currently.
[XXX: Because the SWASH routines need to be loaded. And the RE engine
-appears to need an overhaul to Unicode by default anyway.]
+appears to need an overhaul to dynamically match Unicode anyway--the
+current RE compiler creates different nodes with and without C<use utf8>.]
=item *
@@ -180,14 +208,18 @@ And finally, C<scalar reverse()> reverses by character rather than by byte.
=back
+=head2 Character encodings for input and output
+
+[XXX: This feature is not yet implemented.]
+
=head1 CAVEATS
As of yet, there is no method for automatically coercing input and
output to some encoding other than UTF-8. This is planned in the near
future, however.
-Whether a piece of data will be treated as "characters" or "bytes"
-by internal operations cannot be divined at the current time.
+Whether an arbitrary piece of data will be treated as "characters" or
+"bytes" by internal operations cannot be divined at the current time.
Use of locales with utf8 may lead to odd results. Currently there is
some attempt to apply 8-bit locale info to characters in the range