summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorNick Ing-Simmons <nik@tiuk.ti.com>2001-02-25 19:36:28 +0000
committerNick Ing-Simmons <nik@tiuk.ti.com>2001-02-25 19:36:28 +0000
commit4edaa979293b3e482a715be9ab66acf7eb848c46 (patch)
tree5a39b0ff53d1fb685e54998f9a791a94b4d8e887
parent50d2698546d7dba5ed48b29bf1c13887b041a4ba (diff)
downloadperl-4edaa979293b3e482a715be9ab66acf7eb848c46.tar.gz
Encode implementations docs.
p4raw-id: //depot/perlio@8938
-rw-r--r--ext/Encode/Encode.pm150
1 files changed, 148 insertions, 2 deletions
diff --git a/ext/Encode/Encode.pm b/ext/Encode/Encode.pm
index f17cc1afeb..2d49865491 100644
--- a/ext/Encode/Encode.pm
+++ b/ext/Encode/Encode.pm
@@ -158,9 +158,12 @@ sub decode_utf8
package Encode::Encoding;
# Base class for classes which implement encodings
+
# Temporary legacy methods
-sub toUnicode { shift->decode(@_) }
-sub fromUnicode { shift->encode(@_) }
+sub toUnicode { shift->decode(@_) }
+sub fromUnicode { shift->encode(@_) }
+
+sub new_sequence { return $_[0] }
package Encode::XS;
use base 'Encode::Encoding';
@@ -809,6 +812,149 @@ not a string.
=back
+=head1 IMPLEMENTATION CLASSES
+
+As mentioned above encodings are (in the current implementation at least)
+defined by objects. The mapping of encoding name to object is via the
+C<%Encode::encodings> hash. (It is a package hash to allow XS code to get
+at it.)
+
+The values of the hash can currently be either strings or objects.
+The string form may go away in the future. The string form occurs
+when C<encodings()> has scanned C<@INC> for loadable encodings but has
+not actually loaded the encoding in question. This is because the
+current "loading" process is all perl and a bit slow.
+
+Once an encoding is loaded then value of the hash is object which implements
+the encoding. The object should provide the following interface:
+
+=over 4
+
+=item -E<gt>name
+
+Should return the string representing the canonical name of the encoding.
+
+=item -E<gt>new_sequence
+
+This is a placeholder for encodings with state. It should return an object
+which implements this interface, all current implementations return the
+original object.
+
+=item -E<gt>encode($string,$check)
+
+Should return the octet sequence representing I<$string>. If I<$check> is true
+it should modify I<$string> in place to remove the converted part (i.e.
+the whole string unless there is an error).
+If an error occurs it should return the octet sequence for the
+fragment of string that has been converted, and modify $string in-place
+to remove the converted part leaving it starting with the problem fragment.
+
+If check is is false then C<encode> should make a "best effort" to convert
+the string - for example by using a replacement character.
+
+=item -E<gt>decode($octets,$check)
+
+Should return the string that I<$octets> represents. If I<$check> is true
+it should modify I<$octets> in place to remove the converted part (i.e.
+the whole sequence unless there is an error).
+If an error occurs it should return the fragment of string
+that has been converted, and modify $octets in-place to remove the converted part
+leaving it starting with the problem fragment.
+
+If check is is false then C<decode> should make a "best effort" to convert
+the string - for example by using Unicode's "\x{FFFD}" as a replacement character.
+
+=back
+
+It should be noted that the check behaviour is different from the outer
+public API. The logic is that the "unchecked" case is useful when
+encoding is part of a stream which may be reporting errors (e.g. STDERR).
+In such cases it is desirable to get everything through somehow without
+causing additional errors which obscure the original one. Also the encoding
+is best placed to know what the correct replacement character is, so if that
+is the desired behaviour then letting low level code do it is the most efficient.
+
+In contrast if check is true, the scheme above allows the encoding to do as
+much as it can and tell layer above how much that was. What is lacking
+at present is a mechanism to report what went wrong. The most likely interface
+will be an additional method call to the object, or perhaps
+(to avoid forcing per-stream objects on otherwise stateless encodings)
+and additional parameter.
+
+It is also highly desirable that encoding classes inherit from C<Encode::Encoding>
+as a base class. This allows that class to define additional behaviour for
+all encoding objects.
+
+=head2 Compiled Encodings
+
+F<Encode.xs> provides a class C<Encode::XS> which provides the interface described
+above. It calls a generic octet-sequence to octet-sequence "engine" that is
+driven by tables (defined in F<encengine.c>). The same engine is used for both
+encode and decode. C<Encode:XS>'s C<encode> forces perl's characters to their UTF-8 form
+and then treats them as just another multibyte encoding. C<Encode:XS>'s C<decode> transforms
+the sequence and then turns the UTF-8-ness flag as that is the form that the tables
+are defined to produce. For details of the engine see the comments in F<encengine.c>.
+
+The tables are produced by the perl script F<compile> (the name needs to change so
+we can eventually install it somewhere). F<compile> can currently read two formats:
+
+=over 4
+
+=item *.enc
+
+This is a coined format used by Tcl. It is documented in Encode/EncodeFormat.pod.
+
+=item *.ucm
+
+This is the semi-standard format used by IBM's ICU package.
+
+=back
+
+F<compile> can write the following forms:
+
+=over 4
+
+=item *.ucm
+
+See above - the F<Encode/*.ucm> files provided with the distribution have
+been created from the original Tcl .enc files using this approach.
+
+=item *.c
+
+Produces tables as C data structures - this is used to build in encodings
+into F<Encode.so>/F<Encode.dll>.
+
+=item *.xs
+
+In theory this allows encodings to be stand-alone loadable perl extensions.
+The process has not yet been tested. The plan is to use this approach
+for large East Asian encodings.
+
+=back
+
+The set of encodings built-in to F<Encode.so>/F<Encode.dll> is determined by
+F<Makefile.PL>. The current set is as follows:
+
+=over 4
+
+=item ascii and iso-8859-*
+
+That is all the common 8-bit "western" encodings.
+
+=item IBM-1047 and two other variants of EBCDIC.
+
+These are the same variants that are supported by EBCDIC perl as "native" encodings.
+They are included to prove "reversibility" of some constructs in EBCDIC perl.
+
+=item symbol and dingbats as used by Tk on X11.
+
+(The reason Encode got started was to support perl/Tk.)
+
+=back
+
+That set is rather ad. hoc. and has been driven by the needs of the tests rather
+than the needs of typical applications. It is likely to be rationalized.
+
=head1 SEE ALSO
L<perlunicode>, L<perlebcdic>, L<perlfunc/open>