diff options
Diffstat (limited to 'lib/utf8.pm')
-rw-r--r-- | lib/utf8.pm | 181 |
1 files changed, 181 insertions, 0 deletions
diff --git a/lib/utf8.pm b/lib/utf8.pm new file mode 100644 index 0000000000..be46d17230 --- /dev/null +++ b/lib/utf8.pm @@ -0,0 +1,181 @@ +package utf8; + +sub import { + $^H |= 0x00000008; + $enc{caller()} = $_[1] if $_[1]; +} + +sub unimport { + $^H &= ~0x00000008; +} + +sub AUTOLOAD { + require "utf8_heavy.pl"; + goto &$AUTOLOAD; +} + +1; +__END__ + +=head1 NAME + +utf8 - Perl pragma to turn on UTF-8 and Unicode support + +=head1 SYNOPSIS + + use utf8; + no utf8; + +=head1 DESCRIPTION + +The utf8 pragma tells Perl to use UTF-8 as its internal string +representation for the rest of the enclosing block. (The "no utf8" +pragma tells Perl to switch back to ordinary byte-oriented processing +for the rest of the enclosing block.) Under utf8, many operations that +formerly operated on bytes change to operating on characters. For +ASCII data this makes no difference, because UTF-8 stores ASCII in +single bytes, but for any character greater than C<chr(127)>, the +character is stored in a sequence of two or more bytes, all of which +have the high bit set. But by and large, the user need not worry about +this, because the utf8 pragma hides it from the user. A character +under utf8 is logically just a number ranging from 0 to 2**32 or so. +Larger characters encode to longer sequences of bytes, but again, this +is hidden. + +Use of the utf8 pragma has the following effects: + +=over 4 + +=item * + +Strings and patterns may contain characters that have an ordinal value +larger than 255. Presuming you use a Unicode editor to edit your +program, these will typically occur directly within the literal strings +as UTF-8 characters, but you can also specify a particular character +with an extension of the C<\x> notation. UTF-8 characters are +specified by putting the hexidecimal code within curlies after the +C<\x>. For instance, a Unicode smiley face is C<\x{263A}>. A +character in the Latin-1 range (128..255) should be written C<\x{ab}> +rather than C<\xab>, since the former will turn into a two-byte UTF-8 +code, while the latter will continue to be interpreted as generating a +8-bit byte rather than a character. In fact, if -w is turned on, it will +produce a warning that you might be generating invalid UTF-8. + +=item * + +Identifiers within the Perl script may contain Unicode alphanumeric +characters, including ideographs. (You are currently on your own when +it comes to using the canonical forms of characters--Perl doesn't (yet) +attempt to canonicalize variable names for you.) + +=item * + +Regular expressions match characters instead of bytes. For instance, +"." matches a character instead of a byte. (However, the C<\C> pattern +is provided to force a match a single byte ("C<char>" in C, hence +C<\C>).) + +=item * + +Character classes in regular expressions match characters instead of +bytes, and match against the character properties specified in the +Unicode properties database. So C<\w> can be used to match an ideograph, +for instance. + +=item * + +Named Unicode properties and block ranges make be used as character +classes via the new C<\p{}> (matches property) and C<\P{}> (doesn't +match property) constructs. For instance, C<\p{Lu}> matches any +character with the Unicode uppercase property, while C<\p{M}> matches +any mark character. Single letter properties may omit the brackets, so +that can be written C<\pM> also. Many predefined character classes are +available, such as C<\p{IsMirrored}> and C<\p{InTibetan}>. + +=item * + +The special pattern C<\X> match matches any extended Unicode sequence +(a "combining character sequence" in Standardese), where the first +character is a base character and subsequent characters are mark +characters that apply to the base character. It is equivalent to +C<(?:\pM\PM*)>. + +=item * + +The C<tr///> operator translates characters instead of bytes. It can also +be forced to translate between 8-bit codes and UTF-8 regardless of the +surrounding utf8 state. For instance, if you know your input in Latin-1, +you can say: + + use utf8; + while (<>) { + tr/\0-\xff//CU; # latin1 char to utf8 + ... + } + +Similarly you could translate your output with + + tr/\0-\x{ff}//UC; # utf8 to latin1 char + +No, C<s///> doesn't take /U or /C (yet?). + +=item * + +Case translation operators use the Unicode case translation tables. +Note that C<uc()> translates to uppercase, while C<ucfirst> translates +to titlecase (for languages that make the distinction). Naturally +the corresponding backslash sequences have the same semantics. + +=item * + +Most operators that deal with positions or lengths in the string will +automatically switch to using character positions, including C<chop()>, +C<substr()>, C<pos()>, C<index()>, C<rindex()>, C<sprintf()>, +C<write()>, and C<length()>. Operators that specifically don't switch +include C<vec()>, C<pack()>, and C<unpack()>. Operators that really +don't care include C<chomp()>, as well as any other operator that +treats a string as a bucket of bits, such as C<sort()>, and the +operators dealing with filenames. + +=item * + +The C<pack()>/C<unpack()> letters "C<c>" and "C<C>" do I<not> change, +since they're often used for byte-oriented formats. (Again, think +"C<char>" in the C language.) However, there is a new "C<U>" specifier +that will convert between UTF-8 characters and integers. (It works +outside of the utf8 pragma too.) + +=item * + +The C<chr()> and C<ord()> functions work on characters. This is like +C<pack("U")> and C<unpack("U")>, not like C<pack("C")> and +C<unpack("C")>. In fact, the latter are how you now emulate +byte-oriented C<chr()> and C<ord()> under utf8. + +=item * + +And finally, C<scalar reverse()> reverses by character rather than by byte. + +=back + +=head1 CAVEATS + +As of yet, there is no method for automatically coercing input and +output to some encoding other than UTF-8. This is planned in the near +future, however. + +In any event, you'll need to keep track of whether interfaces to other +modules expect UTF-8 data or something else. The utf8 pragma does not +magically mark strings for you in order to remember their encoding, nor +will any automatic coercion happen (other than that eventually planned +for I/O). If you want such automatic coercion, you can build yourself +a set of pretty object-oriented modules. Expect it to run considerably +slower than than this low-level support. + +Use of locales with utf8 may lead to odd results. Currently there is +some attempt to apply 8-bit locale info to characters in the range +0..255, but this is demonstrably incorrect for locales that use +characters above that range (when mapped into Unicode). It will also +tend to run slower. Avoidance of locales is strongly encouraged. + +=cut |