Doc describing the GNU Classpath Unicode Attribute Database format

author: Paul Fisher <rao@gnu.org> 1998-08-09 23:07:47 +0000
committer: Paul Fisher <rao@gnu.org> 1998-08-09 23:07:47 +0000
commit: c5ffc63a3c55d65882fee858e396d5ca0bfd690a (patch)
tree: 421be8e81e49865d6a2a0282a62ee3b646f5a767 /doc
parent: e640b39f8231747473b04518f10f1c53300dbbbd (diff)
download: classpath-c5ffc63a3c55d65882fee858e396d5ca0bfd690a.tar.gz
1 files changed, 105 insertions, 0 deletions
diff --git a/doc/unicode/unicode.database.format b/doc/unicode/unicode.database.format
new file mode 100644
index 000000000..6621755cd
--- /dev/null
+++ b/doc/unicode/unicode.database.format
@@ -0,0 +1,105 @@
+GNU Classpath Unicode Attribute Database
+----------------------------------------
+java.lang.Character allows one to retrieve information on all 38,887
+characters of the Unicode character set.  This is a lot of data.  The
+database specification outlined here is meant to be fast, small, and
+upgradable to new versions of the Unicode 2 specification (minus
+Character.Subset information) by running a script on the data files
+that the Unicode Consortium distributes.
+
+The database consists of three files:
+1) character.uni (main database of character attributes)
+2) block.uni (mappings from each block to offset in char file)
+3) titlecase.uni (list of characters where titlecase differs from uppercase)
+
+File sizes for Unicode 2.1.2 spec
+---------------------------------
+character.uni: 16359 bytes
+block.uni    :  4995 bytes
+titlecase.uni:    16 bytes
+
+All quantities are unsigned unless otherwise specified.
+All quantities are stored in big endian format.
+
+character.uni
+-------------
+Some characters in the Unicode specification have an entry in the
+character.uni file (characters in compressed blocks do not).  Characters
+are stored sequentially, based on the Unicode character number, and
+there are no null entries.  Each entry consists of 7 bytes.
+
+C = Category
+N = Numerical Decimal Value
+    (0xFFFF if unused, 0xFFFE if not representable as nonnegative 
+     integer value.  If category is "Nd", then this is also the
+     decimal digit value.)
+S = No-break version of a space (1/0)
+U = Uppercase mapping (0 = no mapping)
+L = Lowercase mapping (0 = no mapping)
+x = Empty
+
+ xxSCCCCC   NNNNNNNN   NNNNNNNN   UUUUUUUU   UUUUUUUU
+\________/ \________/ \________/ \________/ \________/
+  byte 6     byte 5     byte 4     byte 3     byte 2
+
+ LLLLLLLL   LLLLLLLL
+\________/ \________/
+  byte 1     byte 0
+
+ranges of values in Unicode 2.1.2 spec
+
+C = 0..28 (Sun uses 0..28, and skips 17, so that's what we do too)
+B = 1..69
+N = 0..10000
+D = 0..9
+
+block.uni
+---------
+Characters within the Unicode specification tend to come in blocks --
+sets of sequential characters.  The Classpath Unicode database takes
+advantage of this property.  Each entry in the block.uni file consists
+of 9 bytes.  Entries are stored sequentially, based on the Unicode
+character number which starts a block.  If the compressed bit is set,
+then there is only one entry for this block in the character.uni file.
+That entry in the character.uni file represents the attributes of all
+the characters of that block.
+
+Note: For Unicode 2.1.2, compressed blocks are mandatory for:
+
+U+4E00 - U+9FFF: The CJK Ideographs Area
+U+AC00 - U+D7A3: The Hangul Syllables Area
+U+D800 - U+DFFF: The Surrogates Area
+U+E000 - U+F8FF: The Private Use Area
+U+F900 - U+FAFF: CJK Compatibility Ideographs
+
+S = Unicode character which represents start of block
+E = Unicode character which represents end of block
+O = Offset of this block within the character.uni file
+C = Compressed
+x = Empty
+
+ SSSSSSSS   SSSSSSSS   EEEEEEEE   EEEEEEEE   xxxxxxxC
+\________/ \________/ \________/ \________/ \________/
+  byte 8     byte 7     byte 6     byte 5     byte 4
+
+ OOOOOOOO   OOOOOOOO   OOOOOOOO   OOOOOOOO  
+\________/ \________/ \________/ \________/ 
+  byte 3     byte 2     byte 1     byte 0  
+
+titlecase.uni
+-------------
+Characters in which the titlecase differs from the uppercase are
+stored in titlecase.uni.  There are only four characters in the
+Unicode 2.1.2 specification which fit this description, and it's
+doubtful that any others will ever be added to the specification.
+However, we should be able to support more, without changing
+java.lang.Character, and this is why we have not hardcoded these
+values.  Each entry is 4 bytes.  Entries are stored sequentially,
+based on the Unicode character number.
+
+U = Unicode character which has a titlecase
+T = Unicode mapping to titlecase
+
+ UUUUUUUU   UUUUUUUU   TTTTTTTT   TTTTTTTT
+\________/ \________/ \________/ \________/
+  byte 3     byte 2     byte 1     byte 0
author	Paul Fisher <rao@gnu.org>	1998-08-09 23:07:47 +0000
committer	Paul Fisher <rao@gnu.org>	1998-08-09 23:07:47 +0000
commit	c5ffc63a3c55d65882fee858e396d5ca0bfd690a (patch)
tree	421be8e81e49865d6a2a0282a62ee3b646f5a767 /doc
parent	e640b39f8231747473b04518f10f1c53300dbbbd (diff)
download	classpath-c5ffc63a3c55d65882fee858e396d5ca0bfd690a.tar.gz