diff options
Diffstat (limited to 'ext/Encode/Unicode/Unicode.pm')
-rw-r--r-- | ext/Encode/Unicode/Unicode.pm | 112 |
1 files changed, 63 insertions, 49 deletions
diff --git a/ext/Encode/Unicode/Unicode.pm b/ext/Encode/Unicode/Unicode.pm index 257989a636..bfe03bd646 100644 --- a/ext/Encode/Unicode/Unicode.pm +++ b/ext/Encode/Unicode/Unicode.pm @@ -3,7 +3,7 @@ package Encode::Unicode; use strict; use warnings; -our $VERSION = do { my @r = (q$Revision: 1.32 $ =~ /\d+/g); sprintf "%d."."%02d" x $#r, @r }; +our $VERSION = do { my @r = (q$Revision: 1.35 $ =~ /\d+/g); sprintf "%d."."%02d" x $#r, @r }; use XSLoader; XSLoader::load(__PACKAGE__,$VERSION); @@ -47,11 +47,23 @@ sub new_sequence return bless {%$self},ref($self); } +sub needs_lines { 0 }; + +sub perlio_ok { + eval{ require PerlIO::encoding }; + if ($@){ + return 0; + }else{ + return 1; + } +} + # -# three implementation of (en|de)code exist. XS version is the fastest. -# *_modern use # an array and *_classic stick with substr. *_classic is -# much slower but more memory conservative. *_xs is default. +# three implementations of (en|de)code exist. The XS version is the +# fastest. *_modern uses an array and *_classic sticks with substr. +# *_classic is much slower but more memory conservative. +# *_xs is the default. sub set_transcoder{ no warnings qw(redefine); @@ -273,7 +285,7 @@ __END__ =head1 NAME -Encode::Unicode -- Various Unicode Transform Format +Encode::Unicode -- Various Unicode Transformation Formats =cut @@ -308,8 +320,8 @@ UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE and UTF-32LE. UTF-16BE 2/4 N Y S.P S.P 0xd82a,0xdfcd UTF-16LE 2 N Y S.P S.P 0x2ad8,0xcddf UTF-32 4 Y - is bogus As is BE/LE - UTF-32BE 4 N - bogus As is 0x0010abcd - UTF-32LE 4 N - bogus As is 0xcdab1000 + UTF-32BE 4 N - bogus As is 0x0001abcd + UTF-32LE 4 N - bogus As is 0xcdab0100 UTF-8 1-4 - - bogus >= 4 octets \xf0\x9a\af\8d ---------------+-----------------+------------------------------ @@ -317,38 +329,41 @@ UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE and UTF-32LE. =head1 Size, Endianness, and BOM -You can categorize these CES by 3 criteria; Size of each character, -Endianness, and Byte Order Mark. +You can categorize these CES by 3 criteria: size of each character, +endianness, and Byte Order Mark. -=head2 by Size +=head2 by size UCS-2 is a fixed-length encoding with each character taking 16 bits. -It B<does not> support I<Surrogate Pairs>. When a surrogate pair is -encountered during decode(), its place is filled with \xFFFD without -I<CHECK> or croaks if I<CHECK>. When a character whose ord value is -larger than 0xFFFF is encountered, it uses 0xFFFD without I<CHECK> or -croaks if <CHECK>. - -UTF-16 is almost the same as UCS-2 but it supports I<Surrogate Pairs>. +It B<does not> support I<surrogate pairs>. When a surrogate pair +is encountered during decode(), its place is filled with \x{FFFD} +if I<CHECK> is 0, or the routine croaks if I<CHECK> is 1. When a +character whose ord value is larger than 0xFFFF is encountered, +its place is filled with \x{FFFD} if I<CHECK> is 0, or the routine +croaks if I<CHECK> is 1. + +UTF-16 is almost the same as UCS-2 but it supports I<surrogate pairs>. When it encounters a high surrogate (0xD800-0xDBFF), it fetches the -following low surrogate (0xDC00-0xDFFF), C<desurrogate>s them to form a -character. Bogus surrogates result in death. When \x{10000} or above -is encountered during encode(), it C<ensurrogate>s them and pushes the -surrogate pair to the output stream. +following low surrogate (0xDC00-0xDFFF) and C<desurrogate>s them to +form a character. Bogus surrogates result in death. When \x{10000} +or above is encountered during encode(), it C<ensurrogate>s them and +pushes the surrogate pair to the output stream. UTF-32 is a fixed-length encoding with each character taking 32 bits. -Since it is 32-bit there is no need for I<Surrogate Pairs>. +Since it is 32-bit, there is no need for I<surrogate pairs>. -=head2 by Endianness +=head2 by endianness -First (and now failed) goal of Unicode was to map all character -repertories into a fixed-length integer so programmers are happy. -Since each character is either I<short> or I<long> in C, you have to -put endianness of each platform when you pass data to one another. +The first (and now failed) goal of Unicode was to map all character +repertoires into a fixed-length integer so that programmers are happy. +Since each character is either a I<short> or I<long> in C, you have to +pay attention to the endianness of each platform when you pass data +to one another. Anything marked as BE is Big Endian (or network byte order) and LE is -Little Endian (aka VAX byte order). For anything without, a character -called Byte Order Mark (BOM) is prepended to the head of string. +Little Endian (aka VAX byte order). For anything not marked either +BE or LE, a character called Byte Order Mark (BOM) indicating the +endianness is prepended to the string. =over 4 @@ -362,31 +377,31 @@ called Byte Order Mark (BOM) is prepended to the head of string. =back -This modules handles BOM as follows. +This modules handles the BOM as follows. =over 4 =item * When BE or LE is explicitly stated as the name of encoding, BOM is -simply treated as one of characters (ZERO WIDTH NO-BREAK SPACE). +simply treated as a normal character (ZERO WIDTH NO-BREAK SPACE). =item * -When BE or LE is omitted during decode(), it checks if BOM is in the -beginning of the string and if found endianness is set to what BOM -says. If not found, dies. +When BE or LE is omitted during decode(), it checks if BOM is at the +beginning of the string; if one is found, the endianness is set to +what the BOM says. If no BOM is found, the routine dies. =item * When BE or LE is omitted during encode(), it returns a BE-encoded string with BOM prepended. So when you want to encode a whole text -file, make sure you encode() by whole text, not line by line or each -line, not file, is prepended with BOMs. +file, make sure you encode() the whole text at once, not line by line +or each line, not file, will have a BOM prepended. =item * -C<UCS-2> is an exception. Unlike others this is an alias of UCS-2BE. +C<UCS-2> is an exception. Unlike others, this is an alias of UCS-2BE. UCS-2 is already registered by IANA and others that way. =back @@ -404,18 +419,19 @@ magnitude so let's forgive them. Vogons here ;) Or, comparing Encode to Babel Fish is completely appropriate -- if you can only stick this into your ear :) -Surrogate pairs were born when Unicode Consortium finally +Surrogate pairs were born when the Unicode Consortium finally admitted that 16 bits were not big enough to hold all the world's -character repertoire. But they have already made UCS-2 16-bit. What +character repertoires. But they already made UCS-2 16-bit. What do we do? -Back then 0xD800-0xDFFF was not allocated. Let's split them half and -use the first half to represent C<upper half of a character> and the -latter C<lower half of a character>. That way you can represent 1024 -* 1024 = 1048576 more characters. Now we can store character ranges -up to \x{10ffff} even with 16-bit encodings. This pair of -half-character is now called a I<Surrogate Pair> and UTF-16 is the -name of the encoding that embraces them. +Back then, the range 0xD800-0xDFFF was not allocated. Let's split +that range in half and use the first half to represent the C<upper +half of a character> and the second half to represent the C<lower +half of a character>. That way, you can represent 1024 * 1024 = +1048576 more characters. Now we can store character ranges up to +\x{10ffff} even with 16-bit encodings. This pair of half-character is +now called a I<surrogate pair> and UTF-16 is the name of the encoding +that embraces them. Here is a formula to ensurrogate a Unicode character \x{10000} and above; @@ -432,9 +448,7 @@ perl does not prohibit the use of characters within this range. To perl, every one of \x{0000_0000} up to \x{ffff_ffff} (*) is I<a character>. (*) or \x{ffff_ffff_ffff_ffff} if your perl is compiled with 64-bit - integer support! (**) - - (**) Is anything beyond \x{11_0000} still Unicode :? + integer support! =head1 SEE ALSO |