diff options
author | Aaron M. Renn <arenn@urbanophile.com> | 1999-01-03 06:59:02 +0000 |
---|---|---|
committer | Aaron M. Renn <arenn@urbanophile.com> | 1999-01-03 06:59:02 +0000 |
commit | cf9e3fd3e35697f6f51e4d9b37d913f11c6f80e4 (patch) | |
tree | f6af6a785c3ba18291b2a760b94e980e88cdacad /doc | |
parent | 7300b43f313d6ffcba56e57acf26666de969b23c (diff) | |
download | classpath-cf9e3fd3e35697f6f51e4d9b37d913f11c6f80e4.tar.gz |
Added section on byte/char converters
Diffstat (limited to 'doc')
-rw-r--r-- | doc/hacking.texinfo | 107 |
1 files changed, 104 insertions, 3 deletions
diff --git a/doc/hacking.texinfo b/doc/hacking.texinfo index 12c7ec411..cf5919bbf 100644 --- a/doc/hacking.texinfo +++ b/doc/hacking.texinfo @@ -11,7 +11,7 @@ This file contains important information you will need to know if you are going to hack on the GNU Classpath project code. -Copyright (C) 1998 Free Software Foundation, Inc. +Copyright (C) 1998, 1999 Free Software Foundation, Inc. @end ifinfo @@ -23,7 +23,7 @@ Copyright (C) 1998 Free Software Foundation, Inc. @page @vskip 0pt plus 1filll -Copyright @copyright{} 1998 Free Software Foundation, Inc. +Copyright @copyright{} 1998, 1999 Free Software Foundation, Inc. @sp 2 Permission is granted to make and distribute verbatim copies of this document provided the copyright notice and this permission notice @@ -66,6 +66,7 @@ This document is definitely a work in progress. * Native Efficiency:: Tips for making native Java code faster * Specification Sources:: Where to find the Java class library specs * Naming Conventions:: How files and directories are named in Classpath +* Character Encodings:: How byte to char conversions work in Classpath @end menu @node Introduction, Requirements, Top, Top @@ -566,7 +567,7 @@ papers are more canonical than the JavaDoc documentation. This is true in general. -@node Naming Conventions, , Specification Sources, Top +@node Naming Conventions, Character Conversions, Specification Sources, Top @comment node-name, next, previous, up @chapter Directory and File Naming Conventions @@ -688,5 +689,105 @@ java.util.InetAddress would go in native/java.net/InetAddress.c. reside in files of any name. @end itemize +@node Character Conversions, , Naming Conventions, Top +@comment node-name, next, previous, up +@chapter Character Conversions + +Java uses the Unicode character encoding system internally. This is a +sixteen bit (two byte) collection of characters encompassing most of the +world's written languages. However, Java programs must often deal with +outside interfaces that are byte (eight bit) oriented. For example, a +Unix file, a stream of data from a network socket, etc. Beginning with +Java 1.1, the @code{Reader} and @code{Writer} classes provide functionality +for dealing with character oriented streams. The classes +@code{InputStreamReader} and @code{OutputStreamWriter} bridge the gap +between byte streams and character streams by converting bytes to +Unicode characters and vice versa. + +In Classpath, @code{InputStreamReader} and @code{OutputStreamWriter} +rely on an internal class called @code{gnu.java.io.EncodingManager} to load +translaters that perform the actual conversion. There are two types of +converters, encoders and decoders. Encoders are subclasses of +@code{gnu.java.io.encoder.Encoder}. This type of converter takes a Java +(Unicode) character stream or buffer and converts it to bytes using +a specified encoding scheme. Decoders are a subclass of +@code{gnu.java.io.decoder.Decoder}. This type of converter takes a +byte stream or buffer and converts it to Unicode characters. The +@code{Encoder} and @code{Decoder} classes are subclasses of +@code{Writer} and @code{Reader} respectively, and so can be used in +contexts that require character streams, but the Classpath implementation +currently does not make use of them in this fashion. + +The @code{EncodingManager} class searches for requested encoders and +decoders by name. Since encoders and decoders are separate in Classpath, +it is possible to have a decoder without an encoder for a particular +encoding scheme, or vice versa. @code{EncodingManager} searches the +package path specified by the @code{file.encoding.pkg} property. The +name of the encoder or decoder is appended to the search path to +produce the required class name. Note that @code{EncodingManager} knows +about the default system encoding scheme, which it retrieves from the +system property @code{file.encoding}, and it will return the proper +translator for the default encoding if no scheme is specified. Also, the +Classpath standard translator library, which is the @code{gnu.java.io} package, +is automatically appended to the end of the path. + +For efficiency, @code{EncodingManager} maintains a cache of translators +that it has loaded. This eliminates the need to search for a commonly +used translator each time it is requested. + +Finally, @code{EncodingManager} supports aliasing of encoding scheme names. +For example, the ISO Latin-1 encoding scheme can be referred to as +''8859_1'' or ''ISO-8859-1''. @code{EncodingManager} searches for +aliases by looking for the existence of a system property called +@code{gnu.java.io.encoding_scheme_alias.<encoding name>}. If such a +property exists. The value of that property is assumed to be the +canonical name of the encoding scheme, and a translator with that name is +looked up instead of one with the original name. + +Here is an example of how @code{EncodingManager} works. A class requests +a decoder for the ''UTF-8'' encoding scheme by calling +@code{EncodingManager.getDecoder("UTF-8")}. First, an alias is searched +for by looking for the system property +@code{gnu.java.io.encoding_scheme_alias.UTF-8}. In our example, this +property exists and has the value ''UTF8''. That is the actual +decoder that will be searched for. Next, @code{EncodingManager} looks +in its cache for this translator. Assuming it does not find it, it +searches the translator path, which is this example consists only of +the default @code{gnu.java.io}. The ''decoder'' package name is +appended since we are looking for a decoder. (''encoder'' would be +used if we were looking for an encoder). Then name name of the translator +is appended. So @code{EncodingManager} attempts to load a translator +class called @code{gnu.java.io.decoder.UTF8}. If that class is found, +an instance of it is returned. If it is not found, a +@code{UnsupportedEncodingException}. + +To write a new translator, it is only necessary to subclass +@code{Encoder} and/or @code{Decoder}. Only a handful of abstract +methods need to be implemented. In general, no methods need to be +overridden. The needed methods calculate the number of bytes/chars +that the translation will generate, convert buffers to/from bytes, +and read/write a requested number of characters to/from a stream. + +Many common encoding schemes use only eight bits to encode characters. +Writing a translator for these encodings is very easy. There are +abstract translator classes @code{gnu.java.io.decode.DecoderEightBitLookup} +and @code{gnu.java.io.encode.EncoderEightBitLookup}. These classes +implement all of the necessary methods. All that is necessary to +create a lookup table array that maps bytes to Unicode characters and +set the class variable @code{lookup_table} equal to it in a static +initializer. Also, a single constructor that takes an appropriate +stream as an argument must be supplied. These translators are +exceptionally easy to create and there are several of them supplied +in the Classpath distribution. + +Writing multi-byte or variable-byte encodings is more difficult, but +often not especially challenging. The Classpath distribution ships with +translators for the UTF8 encoding scheme which uses from one to three +bytes to encode Unicode characters. This can serve as an example of +how to write such a translator. + +Many more translators are needed. All major character encodings should +eventually be supported. + @bye |