Document the UTF-16 surrogate encoding and decoding.

p4raw-id: //depot/perl@12932
author: Jarkko Hietaniemi <jhi@iki.fi> 2001-11-10 15:15:29 +0000
committer: Jarkko Hietaniemi <jhi@iki.fi> 2001-11-10 15:15:29 +0000
commit: 7a4efbb2994f7a46ffdcc9d9a4730a93332c577a (patch)
tree: e85eb20b2c9b8102f83cc44554948249f9911827 /ext
parent: 3e16932524becaa89351f54bcab041596f67d434 (diff)
download: perl-7a4efbb2994f7a46ffdcc9d9a4730a93332c577a.tar.gz
1 files changed, 13 insertions, 1 deletions
diff --git a/ext/Encode/Encode.pm b/ext/Encode/Encode.pm
index 863022b18f..f2116c54e4 100644
--- a/ext/Encode/Encode.pm
+++ b/ext/Encode/Encode.pm
@@ -728,9 +728,21 @@ For CHECK see L</"Handling Malformed Data">.
 =head2 Other Encodings of Unicode
 
 UTF-16 is similar to UCS-2, 16 bit or 2-byte chunks.  UCS-2 can only
-represent 0..0xFFFF, while UTF-16 has a "surrogate pair" scheme which
+represent 0..0xFFFF, while UTF-16 has a I<surrogate pair> scheme which
 allows it to cover the whole Unicode range.
 
+Surrogates are code points set aside to encode the 0x01000..0x10FFFF
+range of Unicode code points in pairs of 16-bit units.  The I<high
+surrogates> are the range 0xD800..0xDBFF, and the I<low surrogates>
+are the range 0xDC00..0xDFFFF.  The surrogate encoding is
+
+	$hi = ($uni - 0x10000) / 0x400 + 0xD800;
+	$lo = ($uni - 0x10000) % 0x400 + 0xDC00;
+
+and the decoding is
+
+	$uni = 0x10000 + ($hi - 0xD8000) * 0x400 + ($lo - 0xDC00);
+
 Encode implements big-endian UCS-2 aliased to "iso-10646-1" as that
 happens to be the name used by that representation when used with X11
 fonts.
author	Jarkko Hietaniemi <jhi@iki.fi>	2001-11-10 15:15:29 +0000
committer	Jarkko Hietaniemi <jhi@iki.fi>	2001-11-10 15:15:29 +0000
commit	7a4efbb2994f7a46ffdcc9d9a4730a93332c577a (patch)
tree	e85eb20b2c9b8102f83cc44554948249f9911827 /ext
parent	3e16932524becaa89351f54bcab041596f67d434 (diff)
download	perl-7a4efbb2994f7a46ffdcc9d9a4730a93332c577a.tar.gz