summaryrefslogtreecommitdiff
path: root/ext/Encode/Unicode/Unicode.pm
diff options
context:
space:
mode:
Diffstat (limited to 'ext/Encode/Unicode/Unicode.pm')
-rw-r--r--ext/Encode/Unicode/Unicode.pm112
1 files changed, 63 insertions, 49 deletions
diff --git a/ext/Encode/Unicode/Unicode.pm b/ext/Encode/Unicode/Unicode.pm
index 257989a636..bfe03bd646 100644
--- a/ext/Encode/Unicode/Unicode.pm
+++ b/ext/Encode/Unicode/Unicode.pm
@@ -3,7 +3,7 @@ package Encode::Unicode;
use strict;
use warnings;
-our $VERSION = do { my @r = (q$Revision: 1.32 $ =~ /\d+/g); sprintf "%d."."%02d" x $#r, @r };
+our $VERSION = do { my @r = (q$Revision: 1.35 $ =~ /\d+/g); sprintf "%d."."%02d" x $#r, @r };
use XSLoader;
XSLoader::load(__PACKAGE__,$VERSION);
@@ -47,11 +47,23 @@ sub new_sequence
return bless {%$self},ref($self);
}
+sub needs_lines { 0 };
+
+sub perlio_ok {
+ eval{ require PerlIO::encoding };
+ if ($@){
+ return 0;
+ }else{
+ return 1;
+ }
+}
+
#
-# three implementation of (en|de)code exist. XS version is the fastest.
-# *_modern use # an array and *_classic stick with substr. *_classic is
-# much slower but more memory conservative. *_xs is default.
+# three implementations of (en|de)code exist. The XS version is the
+# fastest. *_modern uses an array and *_classic sticks with substr.
+# *_classic is much slower but more memory conservative.
+# *_xs is the default.
sub set_transcoder{
no warnings qw(redefine);
@@ -273,7 +285,7 @@ __END__
=head1 NAME
-Encode::Unicode -- Various Unicode Transform Format
+Encode::Unicode -- Various Unicode Transformation Formats
=cut
@@ -308,8 +320,8 @@ UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE and UTF-32LE.
UTF-16BE 2/4 N Y S.P S.P 0xd82a,0xdfcd
UTF-16LE 2 N Y S.P S.P 0x2ad8,0xcddf
UTF-32 4 Y - is bogus As is BE/LE
- UTF-32BE 4 N - bogus As is 0x0010abcd
- UTF-32LE 4 N - bogus As is 0xcdab1000
+ UTF-32BE 4 N - bogus As is 0x0001abcd
+ UTF-32LE 4 N - bogus As is 0xcdab0100
UTF-8 1-4 - - bogus >= 4 octets \xf0\x9a\af\8d
---------------+-----------------+------------------------------
@@ -317,38 +329,41 @@ UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE and UTF-32LE.
=head1 Size, Endianness, and BOM
-You can categorize these CES by 3 criteria; Size of each character,
-Endianness, and Byte Order Mark.
+You can categorize these CES by 3 criteria: size of each character,
+endianness, and Byte Order Mark.
-=head2 by Size
+=head2 by size
UCS-2 is a fixed-length encoding with each character taking 16 bits.
-It B<does not> support I<Surrogate Pairs>. When a surrogate pair is
-encountered during decode(), its place is filled with \xFFFD without
-I<CHECK> or croaks if I<CHECK>. When a character whose ord value is
-larger than 0xFFFF is encountered, it uses 0xFFFD without I<CHECK> or
-croaks if <CHECK>.
-
-UTF-16 is almost the same as UCS-2 but it supports I<Surrogate Pairs>.
+It B<does not> support I<surrogate pairs>. When a surrogate pair
+is encountered during decode(), its place is filled with \x{FFFD}
+if I<CHECK> is 0, or the routine croaks if I<CHECK> is 1. When a
+character whose ord value is larger than 0xFFFF is encountered,
+its place is filled with \x{FFFD} if I<CHECK> is 0, or the routine
+croaks if I<CHECK> is 1.
+
+UTF-16 is almost the same as UCS-2 but it supports I<surrogate pairs>.
When it encounters a high surrogate (0xD800-0xDBFF), it fetches the
-following low surrogate (0xDC00-0xDFFF), C<desurrogate>s them to form a
-character. Bogus surrogates result in death. When \x{10000} or above
-is encountered during encode(), it C<ensurrogate>s them and pushes the
-surrogate pair to the output stream.
+following low surrogate (0xDC00-0xDFFF) and C<desurrogate>s them to
+form a character. Bogus surrogates result in death. When \x{10000}
+or above is encountered during encode(), it C<ensurrogate>s them and
+pushes the surrogate pair to the output stream.
UTF-32 is a fixed-length encoding with each character taking 32 bits.
-Since it is 32-bit there is no need for I<Surrogate Pairs>.
+Since it is 32-bit, there is no need for I<surrogate pairs>.
-=head2 by Endianness
+=head2 by endianness
-First (and now failed) goal of Unicode was to map all character
-repertories into a fixed-length integer so programmers are happy.
-Since each character is either I<short> or I<long> in C, you have to
-put endianness of each platform when you pass data to one another.
+The first (and now failed) goal of Unicode was to map all character
+repertoires into a fixed-length integer so that programmers are happy.
+Since each character is either a I<short> or I<long> in C, you have to
+pay attention to the endianness of each platform when you pass data
+to one another.
Anything marked as BE is Big Endian (or network byte order) and LE is
-Little Endian (aka VAX byte order). For anything without, a character
-called Byte Order Mark (BOM) is prepended to the head of string.
+Little Endian (aka VAX byte order). For anything not marked either
+BE or LE, a character called Byte Order Mark (BOM) indicating the
+endianness is prepended to the string.
=over 4
@@ -362,31 +377,31 @@ called Byte Order Mark (BOM) is prepended to the head of string.
=back
-This modules handles BOM as follows.
+This modules handles the BOM as follows.
=over 4
=item *
When BE or LE is explicitly stated as the name of encoding, BOM is
-simply treated as one of characters (ZERO WIDTH NO-BREAK SPACE).
+simply treated as a normal character (ZERO WIDTH NO-BREAK SPACE).
=item *
-When BE or LE is omitted during decode(), it checks if BOM is in the
-beginning of the string and if found endianness is set to what BOM
-says. If not found, dies.
+When BE or LE is omitted during decode(), it checks if BOM is at the
+beginning of the string; if one is found, the endianness is set to
+what the BOM says. If no BOM is found, the routine dies.
=item *
When BE or LE is omitted during encode(), it returns a BE-encoded
string with BOM prepended. So when you want to encode a whole text
-file, make sure you encode() by whole text, not line by line or each
-line, not file, is prepended with BOMs.
+file, make sure you encode() the whole text at once, not line by line
+or each line, not file, will have a BOM prepended.
=item *
-C<UCS-2> is an exception. Unlike others this is an alias of UCS-2BE.
+C<UCS-2> is an exception. Unlike others, this is an alias of UCS-2BE.
UCS-2 is already registered by IANA and others that way.
=back
@@ -404,18 +419,19 @@ magnitude so let's forgive them.
Vogons here ;) Or, comparing Encode to Babel Fish is completely
appropriate -- if you can only stick this into your ear :)
-Surrogate pairs were born when Unicode Consortium finally
+Surrogate pairs were born when the Unicode Consortium finally
admitted that 16 bits were not big enough to hold all the world's
-character repertoire. But they have already made UCS-2 16-bit. What
+character repertoires. But they already made UCS-2 16-bit. What
do we do?
-Back then 0xD800-0xDFFF was not allocated. Let's split them half and
-use the first half to represent C<upper half of a character> and the
-latter C<lower half of a character>. That way you can represent 1024
-* 1024 = 1048576 more characters. Now we can store character ranges
-up to \x{10ffff} even with 16-bit encodings. This pair of
-half-character is now called a I<Surrogate Pair> and UTF-16 is the
-name of the encoding that embraces them.
+Back then, the range 0xD800-0xDFFF was not allocated. Let's split
+that range in half and use the first half to represent the C<upper
+half of a character> and the second half to represent the C<lower
+half of a character>. That way, you can represent 1024 * 1024 =
+1048576 more characters. Now we can store character ranges up to
+\x{10ffff} even with 16-bit encodings. This pair of half-character is
+now called a I<surrogate pair> and UTF-16 is the name of the encoding
+that embraces them.
Here is a formula to ensurrogate a Unicode character \x{10000} and
above;
@@ -432,9 +448,7 @@ perl does not prohibit the use of characters within this range. To perl,
every one of \x{0000_0000} up to \x{ffff_ffff} (*) is I<a character>.
(*) or \x{ffff_ffff_ffff_ffff} if your perl is compiled with 64-bit
- integer support! (**)
-
- (**) Is anything beyond \x{11_0000} still Unicode :?
+ integer support!
=head1 SEE ALSO