From 7a4efbb2994f7a46ffdcc9d9a4730a93332c577a Mon Sep 17 00:00:00 2001 From: Jarkko Hietaniemi Date: Sat, 10 Nov 2001 15:15:29 +0000 Subject: Document the UTF-16 surrogate encoding and decoding. p4raw-id: //depot/perl@12932 --- ext/Encode/Encode.pm | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) (limited to 'ext') diff --git a/ext/Encode/Encode.pm b/ext/Encode/Encode.pm index 863022b18f..f2116c54e4 100644 --- a/ext/Encode/Encode.pm +++ b/ext/Encode/Encode.pm @@ -728,9 +728,21 @@ For CHECK see L. =head2 Other Encodings of Unicode UTF-16 is similar to UCS-2, 16 bit or 2-byte chunks. UCS-2 can only -represent 0..0xFFFF, while UTF-16 has a "surrogate pair" scheme which +represent 0..0xFFFF, while UTF-16 has a I scheme which allows it to cover the whole Unicode range. +Surrogates are code points set aside to encode the 0x01000..0x10FFFF +range of Unicode code points in pairs of 16-bit units. The I are the range 0xD800..0xDBFF, and the I +are the range 0xDC00..0xDFFFF. The surrogate encoding is + + $hi = ($uni - 0x10000) / 0x400 + 0xD800; + $lo = ($uni - 0x10000) % 0x400 + 0xDC00; + +and the decoding is + + $uni = 0x10000 + ($hi - 0xD8000) * 0x400 + ($lo - 0xDC00); + Encode implements big-endian UCS-2 aliased to "iso-10646-1" as that happens to be the name used by that representation when used with X11 fonts. -- cgit v1.2.1