\input texinfo @setfilename mbapi.info @settitle Multibyte API @setchapternewpage off @c Open issues: @c What's the best way to report errors? Should functions return a @c magic value, according to C tradition, or should they signal a @c Guile exception? @c @node Working With Multibyte Strings in C @chapter Working With Multibyte Strings in C Guile allows strings to contain characters drawn from a wide variety of languages, including many Asian, Eastern European, and Middle Eastern languages, in a uniform and unrestricted way. The string representation normally used in C code --- an array of @sc{ASCII} characters --- is not sufficient for Guile strings, since they may contain characters not present in @sc{ASCII}. Instead, Guile uses a very large character set, and encodes each character as a sequence of one or more bytes. We call this variable-width encoding a @dfn{multibyte} encoding. Guile uses this single encoding internally for all strings, symbol names, error messages, etc., and performs appropriate conversions upon input and output. The use of this variable-width encoding is almost invisible to Scheme code. Strings are still indexed by character number, not by byte offset; @code{string-length} still returns the length of a string in characters, not in bytes. @code{string-ref} and @code{string-set!} are no longer guaranteed to be constant-time operations, but Guile uses various strategies to reduce the impact of this change. However, the encoding is visible via Guile's C interface, which gives the user direct access to a string's bytes. This chapter explains how to work with Guile multibyte text in C code. Since variable-width encodings are clumsier to work with than simple fixed-width encodings, Guile provides a set of standard macros and functions for manipulating multibyte text to make the job easier. Furthermore, Guile makes some promises about the encoding which you can use in writing your own text processing code. While we discuss guaranteed properties of Guile's encoding, and provide functions to operate on its character set, we do not actually specify either the character set or encoding here. This is because we expect both of them to change in the future: currently, Guile uses the same encoding as GNU Emacs 20.4, but we hope to change Guile (and GNU Emacs as well) to use Unicode and UTF-8, with some extensions. This will make it more comfortable to use Guile with other systems which use UTF-8, like the GTk user interface toolkit. @menu * Multibyte String Terminology:: * Promised Properties of the Guile Multibyte Encoding:: * Functions for Operating on Multibyte Text:: * Multibyte Text Processing Errors:: * Why Guile Does Not Use a Fixed-Width Encoding:: @end menu @node Multibyte String Terminology, Promised Properties of the Guile Multibyte Encoding, Working With Multibyte Strings in C, Working With Multibyte Strings in C @section Multibyte String Terminology In the descriptions which follow, we make the following definitions: @table @dfn @item byte A @dfn{byte} is a number between 0 and 255. It has no inherent textual interpretation. So 65 is a byte, not a character. @item character A @dfn{character} is a unit of text. It has no inherent numeric value. @samp{A} and @samp{.} are characters, not bytes. (This is different from the C language's definition of @dfn{character}; in this chapter, we will always use a phrase like ``the C language's @code{char} type'' when that's what we mean.) @item character set A @dfn{character set} is an invertible mapping between numbers and a given set of characters. @sc{ASCII} is a character set assigning characters to the numbers 0 through 127. It maps @samp{A} onto the number 65, and @samp{.} onto 46. Note that a character set maps characters onto numbers, @emph{not necessarily} onto bytes. For example, the Unicode character set maps the Greek lower-case @samp{alpha} character onto the number 945, which is not a byte. (This is what Internet standards would call a "coding character set".) @item encoding An encoding maps numbers onto sequences of bytes. For example, the UTF-8 encoding, defined in the Unicode Standard, would map the number 945 onto the sequence of bytes @samp{206 177}. When using the @sc{ASCII} character set, every number assigned also happens to be a byte, so there is an obvious trivial encoding for @sc{ASCII} in bytes. (This is what Internet standards would call a "character encoding scheme".) @end table Thus, to turn a character into a sequence of bytes, you need a character set to assign a number to that character, and then an encoding to turn that number into a sequence of bytes. Likewise, to interpret a sequence of bytes as a sequence of characters, you use an encoding to extract a sequence of numbers from the bytes, and then a character set to turn the numbers into characters. Errors can occur while carrying out either of these processes. For example, under a particular encoding, a given string of bytes might not correspond to any number. For example, the byte sequence @samp{128 128} is not a valid encoding of any number under UTF-8. Having carefully defined our terminology, we will now abuse it. We will sometimes use the word @dfn{character} to refer to the number assigned to a character by a character set, in contexts where it's obvious we mean a number. Sometimes there is a close association between a particular encoding and a particular character set. Thus, we may sometimes refer to the character set and encoding together as an @dfn{encoding}. @node Promised Properties of the Guile Multibyte Encoding, Functions for Operating on Multibyte Text, Multibyte String Terminology, Working With Multibyte Strings in C @section Promised Properties of the Guile Multibyte Encoding Internally, Guile uses a single encoding for all text --- symbols, strings, error messages, etc. Here we list a number of helpful properties of Guile's encoding. It is correct to write code which assumes these properties; code which uses these assumptions will be portable to all future versions of Guile, as far as we know. @b{Every @sc{ASCII} character is encoded as a single byte from 0 to 127, in the obvious way.} This means that a standard C string containing only @sc{ASCII} characters is a valid Guile string (except for the terminator; Guile strings store the length explicitly, so they can contain null characters). @b{The encodings of non-@sc{ASCII} characters use only bytes between 128 and 255.} That is, when we turn a non-@sc{ASCII} character into a series of bytes, none of those bytes can ever be mistaken for the encoding of an @sc{ASCII} character. This means that you can search a Guile string for an @sc{ASCII} character using the standard @code{memchr} library function. By extension, you can search for an @sc{ASCII} substring in a Guile string using a traditional substring search algorithm --- you needn't add special checks to verify encoding boundaries, etc. @b{No character encoding is a subsequence of any other character encoding.} (This is just a stronger version of the previous promise.) This means that you can search for occurrences of one Guile string within another Guile string just as if they were raw byte strings. You can use the stock @code{memmem} function (provided on GNU systems, at least) for such searches. If you don't need the ability to represent null characters in your text, you can still use null-termination for strings, and use the traditional string-handling functions like @code{strlen}, @code{strstr}, and @code{strcat}. @b{You can always determine the full length of a character's encoding from its first byte.} Guile provides the macro @code{scm_mb_len} which computes the encoding's length from its first byte. Given the first rule, you can see that @code{scm_mb_len (@var{b})}, for any @code{0 <= @var{b} <= 127}, returns 1. @b{Given an arbitrary byte position in a Guile string, you can always find the beginning and end of the character containing that byte without scanning too far in either direction.} This means that, if you are sure a byte sequence is a valid encoding of a character sequence, you can find character boundaries without keeping track of the beginning and ending of the overall string. This promise relies on the fact that, in addition to storing the string's length explicitly, Guile always either terminates the string's storage with a zero byte, or shares it with another string which is terminated this way. @node Functions for Operating on Multibyte Text, Multibyte Text Processing Errors, Promised Properties of the Guile Multibyte Encoding, Working With Multibyte Strings in C @section Functions for Operating on Multibyte Text Guile provides a variety of functions, variables, and types for working with multibyte text. @menu * Basic Multibyte Character Processing:: * Finding Character Encoding Boundaries:: * Multibyte String Functions:: * Exchanging Guile Text With the Outside World in C:: * Implementing Your Own Text Conversions:: @end menu @node Basic Multibyte Character Processing, Finding Character Encoding Boundaries, Functions for Operating on Multibyte Text, Functions for Operating on Multibyte Text @subsection Basic Multibyte Character Processing Here are the essential types and functions for working with Guile text. Guile uses the C type @code{unsigned char *} to refer to text encoded with Guile's encoding. Note that any operation marked here as a ``Libguile Macro'' might evaluate its argument multiple times. @deftp {Libguile Type} scm_char_t This is a signed integral type large enough to hold any character in Guile's character set. All character numbers are positive. @end deftp @deftypefn {Libguile Macro} scm_char_t scm_mb_get (const unsigned char *@var{p}) Return the character whose encoding starts at @var{p}. If @var{p} does not point at a valid character encoding, the behavior is undefined. @end deftypefn @deftypefn {Libguile Macro} int scm_mb_put (unsigned char *@var{p}, scm_char_t @var{c}) Place the encoded form of the Guile character @var{c} at @var{p}, and return its length in bytes. If @var{c} is not a Guile character, the behavior is undefined. @end deftypefn @deftypevr {Libguile Constant} int scm_mb_max_len The maximum length of any character's encoding, in bytes. You may assume this is relatively small --- less than a dozen or so. @end deftypevr @deftypefn {Libguile Macro} int scm_mb_len (unsigned char @var{b}) If @var{b} is the first byte of a character's encoding, return the full length of the character's encoding, in bytes. If @var{b} is not a valid leading byte, the behavior is undefined. @end deftypefn @deftypefn {Libguile Macro} int scm_mb_char_len (scm_char_t @var{c}) Return the length of the encoding of the character @var{c}, in bytes. If @var{c} is not a valid Guile character, the behavior is undefined. @end deftypefn @deftypefn {Libguile Function} scm_char_t scm_mb_get_func (const unsigned char *@var{p}) @deftypefnx {Libguile Function} int scm_mb_put_func (unsigned char *@var{p}, scm_char_t @var{c}) @deftypefnx {Libguile Function} int scm_mb_len_func (unsigned char @var{b}) @deftypefnx {Libguile Function} int scm_mb_char_len_func (scm_char_t @var{c}) These are functions identical to the corresponding macros. You can use them in situations where the overhead of a function call is acceptable, and the cleaner semantics of function application are desireable. @end deftypefn @node Finding Character Encoding Boundaries, Multibyte String Functions, Basic Multibyte Character Processing, Functions for Operating on Multibyte Text @subsection Finding Character Encoding Boundaries These are functions for finding the boundaries between characters in multibyte text. Note that any operation marked here as a ``Libguile Macro'' might evaluate its argument multiple times, unless the definition promises otherwise. @deftypefn {Libguile Macro} int scm_mb_boundary_p (const unsigned char *@var{p}) Return non-zero iff @var{p} points to the start of a character in multibyte text. This macro will evaluate its argument only once. @end deftypefn @deftypefn {Libguile Function} {const unsigned char *} scm_mb_floor (const unsigned char *@var{p}) ``Round'' @var{p} to the previous character boundary. That is, if @var{p} points to the middle of the encoding of a Guile character, return a pointer to the first byte of the encoding. If @var{p} points to the start of the encoding of a Guile character, return @var{p} unchanged. @end deftypefn @deftypefn {libguile Function} {const unsigned char *} scm_mb_ceiling (const unsigned char *@var{p}) ``Round'' @var{p} to the next character boundary. That is, if @var{p} points to the middle of the encoding of a Guile character, return a pointer to the first byte of the encoding of the next character. If @var{p} points to the start of the encoding of a Guile character, return @var{p} unchanged. @end deftypefn Note that it is usually not friendly for functions to silently correct byte offsets that point into the middle of a character's encoding. Such offsets almost always indicate a programming error, and they should be reported as early as possible. So, when you write code which operates on multibyte text, you should not use functions like these to ``clean up'' byte offsets which the originator believes to be correct; instead, your code should signal a @code{text:not-char-boundary} error as soon as it detects an invalid offset. @xref{Multibyte Text Processing Errors}. @node Multibyte String Functions, Exchanging Guile Text With the Outside World in C, Finding Character Encoding Boundaries, Functions for Operating on Multibyte Text @subsection Multibyte String Functions These functions allow you to operate on multibyte strings: sequences of character encodings. @deftypefn {Libguile Function} int scm_mb_count (const unsigned char *@var{p}, int @var{len}) Return the number of Guile characters encoded by the @var{len} bytes at @var{p}. If the sequence contains any invalid character encodings, or ends with an incomplete character encoding, signal a @code{text:bad-encoding} error. @end deftypefn @deftypefn {Libguile Macro} scm_char_t scm_mb_walk (unsigned char **@var{pp}) Return the character whose encoding starts at @code{*@var{pp}}, and advance @code{*@var{pp}} to the start of the next character. Return -1 if @code{*@var{pp}} does not point to a valid character encoding. @end deftypefn @deftypefn {Libguile Function} {const unsigned char *} scm_mb_prev (const unsigned char *@var{p}) If @var{p} points to the middle of the encoding of a Guile character, return a pointer to the first byte of the encoding. If @var{p} points to the start of the encoding of a Guile character, return the start of the previous character's encoding. This is like @code{scm_mb_floor}, but the returned pointer will always be before @var{p}. If you use this function to drive an iteration, it guarantees backward progress. @end deftypefn @deftypefn {Libguile Function} {const unsigned char *} scm_mb_next (const unsigned char *@var{p}) If @var{p} points to the encoding of a Guile character, return a pointer to the first byte of the encoding of the next character. This is like @code{scm_mb_ceiling}, but the returned pointer will always be after @var{p}. If you use this function to drive an iteration, it guarantees forward progress. @end deftypefn @deftypefn {Libguile Function} {const unsigned char *} scm_mb_index (const unsigned char *@var{p}, int @var{len}, int @var{i}) Assuming that the @var{len} bytes starting at @var{p} are a concatenation of valid character encodings, return a pointer to the start of the @var{i}'th character encoding in the sequence. This function scans the sequence from the beginning to find the @var{i}'th character, and will generally require time proportional to the distance from @var{p} to the returned address. If the sequence contains any invalid character encodings, or ends with an incomplete character encoding, signal a @code{text:bad-encoding} error. @end deftypefn It is common to process the characters in a string from left to right. However, if you fetch each character using @code{scm_mb_index}, each call will scan the text from the beginning, so your loop will require time proportional to at least the square of the length of the text. To avoid this poor performance, you can use an @code{scm_mb_cache} structure and the @code{scm_mb_index_cached} macro. @deftp {Libguile Type} {struct scm_mb_cache} This structure holds information that allows a string scanning operation to use the results from a previous scan of the string. It has the following members: @table @code @item character An index, in characters, into the string. @item byte The index, in bytes, of the start of that character. @end table In other words, @code{byte} is the byte offset of the @code{character}'th character of the string. Note that if @code{byte} and @code{character} are equal, then all characters before that point must have encodings exactly one byte long, and the string can be indexed normally. All elements of a @code{struct scm_mb_cache} structure should be initialized to zero before its first use, and whenever the string's text changes. @end deftp @deftypefn {Libguile Macro} const unsigned char *scm_mb_index_cached (const unsigned char *@var{p}, int @var{len}, int @var{i}, struct scm_mb_cache *@var{cache}) @deftypefnx {Libguile Function} const unsigned char *scm_mb_index_cached_func (const unsigned char *@var{p}, int @var{len}, int @var{i}, struct scm_mb_cache *@var{cache}) This macro and this function are identical to @code{scm_mb_index}, except that they may consult and update *@var{cache} in order to avoid scanning the string from the beginning. @code{scm_mb_index_cached} is a macro, so it may have less overhead than @code{scm_mb_index_cached_func}, but it may evaluate its arguments more than once. Using @code{scm_mb_index_cached} or @code{scm_mb_index_cached_func}, you can scan a string from left to right, or from right to left, in time proportional to the length of the string. As long as each character fetched is less than some constant distance before or after the previous character fetched with @var{cache}, each access will require constant time. @end deftypefn Guile also provides functions to convert between an encoded sequence of characters, and an array of @code{scm_char_t} objects. @deftypefn {Libguile Function} scm_char_t *scm_mb_multibyte_to_fixed (const unsigned char *@var{p}, int @var{len}, int *@var{result_len}) Convert the variable-width text in the @var{len} bytes at @var{p} to an array of @code{scm_char_t} values. Return a pointer to the array, and set @code{*@var{result_len}} to the number of elements it contains. The returned array is allocated with @code{malloc}, and it is the caller's responsibility to free it. If the text is not a sequence of valid character encodings, this function will signal a @code{text:bad-encoding} error. @end deftypefn @deftypefn {Libguile Function} unsigned char *scm_mb_fixed_to_multibyte (const scm_char_t *@var{fixed}, int @var{len}, int *@var{result_len}) Convert the array of @code{scm_char_t} values to a sequence of variable-width character encodings. Return a pointer to the array of bytes, and set @code{*@var{result_len}} to its length, in bytes. The returned byte sequence is terminated with a zero byte, which is not counted in the length returned in @code{*@var{result_len}}. The returned byte sequence is allocated with @code{malloc}; it is the caller's responsibility to free it. If the text is not a sequence of valid character encodings, this function will signal a @code{text:bad-encoding} error. @end deftypefn @node Exchanging Guile Text With the Outside World in C, Implementing Your Own Text Conversions, Multibyte String Functions, Functions for Operating on Multibyte Text @subsection Exchanging Guile Text With the Outside World in C [[This is kind of a heavy-weight model, given that one end of the conversion is always going to be the Guile encoding. Any way to shorten things a bit?]] Guile provides functions for converting between Guile's internal text representation and encodings popular in the outside world. These functions are closely modeled after the @code{iconv} functions available on some systems. To convert text between two encodings, you should first call @code{scm_mb_iconv_open} to indicate the source and destination encodings; this function returns a context object which records the conversion to perform. Then, you should call @code{scm_mb_iconv} to actually convert the text. This function expects input and output buffers, and a pointer to the context you got from @var{scm_mb_iconv_open}. You don't need to pass all your input to @code{scm_mb_iconv} at once; you can invoke it on successive blocks of input (as you read it from a file, say), and it will convert as much as it can each time, indicating when you should grow your output buffer. An encoding may be @dfn{stateless}, or @dfn{stateful}. In most encodings, a contiguous group of bytes from the sequence completely specifies a particular character; these are stateless encodings. However, some encodings require you to look back an unbounded number of bytes in the stream to assign a meaning to a particular byte sequence; such encodings are stateful. For example, in the @samp{ISO-2022-JP} encoding for Japanese text, the byte sequence @samp{27 36 66} indicates that subsequent bytes should be taken in pairs and interpreted as characters from the JIS-0208 character set. An arbitrary number of byte pairs may follow this sequence. The byte sequence @samp{27 40 66} indicates that subsequent bytes should be interpreted as @sc{ASCII}. In this encoding, you cannot tell whether a given byte is an @sc{ASCII} character without looking back an arbitrary distance for the most recent escape sequence, so it is a stateful encoding. In Guile, if a conversion involves a stateful encoding, the context object carries any necessary state. Thus, you can have many independent conversions to or from stateful encodings taking place simultaneously, as long as each data stream uses its own context object for the conversion. @deftp {Libguile Type} {struct scm_mb_iconv} This is the type for context objects, which represent the encodings and current state of an ongoing text conversion. A @code{struct scm_mb_iconv} records the source and destination encodings, and keeps track of any information needed to handle stateful encodings. @end deftp @deftypefn {Libguile Function} {struct scm_mb_iconv *} scm_mb_iconv_open (const char *@var{tocode}, const char *@var{fromcode}) Return a pointer to a new @code{struct scm_mb_iconv} context object, ready to convert from the encoding named @var{fromcode} to the encoding named @var{tocode}. For stateful encodings, the context object is in some appropriate initial state, ready for use with the @code{scm_mb_iconv} function. When you are done using a context object, you may call @code{scm_mb_iconv_close} to free it. If either @var{tocode} or @var{fromcode} is not the name of a known encoding, this function will signal the @code{text:unknown-conversion} error, described below. @c Try to use names here from the IANA list: @c see ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets Guile supports at least these encodings: @table @samp @item US-ASCII @sc{US-ASCII}, in the standard one-character-per-byte encoding. @item ISO-8859-1 The usual character set for Western European languages, in its usual one-character-per-byte encoding. @item Guile-MB Guile's current internal multibyte encoding. The actual encoding this name refers to will change from one version of Guile to the next. You should use this when converting data between external sources and the encoding used by Guile objects. You should @emph{not} use this as the encoding for data presented to the outside world, for two reasons. 1) Its meaning will change over time, so data written using the @samp{guile} encoding with one version of Guile might not be readable with the @samp{guile} encoding in another version of Guile. 2) It currently corresponds to @samp{Emacs-Mule}, which invented for Emacs's internal use, and was never intended to serve as an exchange medium. @item Guile-Wide Guile's character set, as an array of @code{scm_char_t} values. Note that this encoding is even less suitable for public use than @samp{Guile}, since the exact sequence of bytes depends heavily on the size and endianness the host system uses for @code{scm_char_t}. Using this encoding is very much like calling the @code{scm_mb_multibyte_to_fixed} or @code{scm_mb_fixed_to_multibyte} functions, except that @code{scm_mb_iconv} gives you more control over buffer allocation and management. @item Emacs-Mule This is the variable-length encoding for multi-lingual text by GNU Emacs, at least through version 20.4. You probably should not use this encoding, as it is designed only for Emacs's internal use. However, we provide it here because it's trivial to support, and some people probably do have @samp{emacs-mule}-format files lying around. @end table (At the moment, this list doesn't include any character sets suitable for external use that can actually handle multilingual data; this is unfortunate, as it encourages users to write data in Emacs-Mule format, which nobody but Emacs and Guile understands. We hope to add support for Unicode in UTF-8 soon, which should solve this problem.) Case is not significant in encoding names. You can define your own conversions; see @ref{Implementing Your Own Text Conversions}. @end deftypefn @deftypefn {Libguile Function} int scm_mb_have_encoding (const char *@var{encoding}) Return a non-zero value if Guile supports the encoding named @var{encoding}[[]] @end deftypefn @deftypefn {Libguile Function} size_t scm_mb_iconv (struct scm_mb_iconv *@var{context}, const char **@var{inbuf}, size_t *@var{inbytesleft}, char **@var{outbuf}, size_t *@var{outbytesleft}) Convert a sequence of characters from one encoding to another. The argument @var{context} specifies the encodings to use for the input and output, and carries state for stateful encodings; use @code{scm_mb_iconv_open} to create a @var{context} object for a particular conversion. Upon entry to the function, @code{*@var{inbuf}} should point to the input buffer, and @code{*@var{inbytesleft}} should hold the number of input bytes present in the buffer; @code{*@var{outbuf}} should point to the output buffer, and @code{*@var{outbytesleft}} should hold the number of bytes available to hold the conversion results in that buffer. Upon exit from the function, @code{*@var{inbuf}} points to the first unconsumed byte of input, and @code{*@var{inbytesleft}} holds the number of unconsumed input bytes; @code{*@var{outbuf}} points to the byte after the last output byte, and @code{*@var{outbyteleft}} holds the number of bytes left unused in the output buffer. For stateful encodings, @var{context} carries encoding state from one call to @code{scm_mb_iconv} to the next. Thus, successive calls to @var{scm_mb_iconv} which use the same context object can convert a stream of data one chunk at a time. If @var{inbuf} is zero or @code{*@var{inbuf}} is zero, then the call is taken as a request to reset the states of the input and the output encodings. If @var{outbuf} is non-zero and @code{*@var{outbuf}} is non-zero, then @code{scm_mb_iconv} stores a byte sequence in the output buffer to put the output encoding in its initial state. If the output buffer is not large enough to hold this byte sequence, @code{scm_mb_iconv} returns @code{scm_mb_iconv_too_big}, and leaves the shift states of @var{context}'s input and output encodings unchanged. The @code{scm_mb_iconv} function always consumes only complete characters or shift sequences from the input buffer, and the output buffer always contains a sequence of complete characters or escape sequences. If the input sequence contains characters which are not expressible in the output encoding, @code{scm_mb_iconv} converts it in an implementation-defined way. It may simply delete the character. Some encodings use byte sequences which do not correspond to any textual character. For example, the escape sequence of a stateful encoding has no textual meaning. When converting from such an encoding, a call to @code{scm_mb_iconv} might consume input but produce no output, since the input sequence might contain only escape sequences. Normally, @code{scm_mb_iconv} returns the number of input characters it could not convert perfectly to the ouput encoding. However, it may return one of the @code{scm_mb_iconv_} codes described below, to indicate an error. All of these codes are negative values. If the input sequence contains an invalid character encoding, conversion stops before the invalid input character, and @code{scm_mb_iconv} returns the constant value @code{scm_mb_iconv_bad_encoding}. If the input sequence ends with an incomplete character encoding, @code{scm_mb_iconv} will leave it in the input buffer, unconsumed, and return the constant value @code{scm_mb_iconv_incomplete_encoding}. This is not necessarily an error, if you expect to call @code{scm_mb_iconv} again with more data which might contain the rest of the encoding fragment. If the output buffer does not contain enough room to hold the converted form of the complete input text, @code{scm_mb_iconv} converts as much as it can, changes the input and output pointers to reflect the amount of text successfully converted, and then returns @code{scm_mb_iconv_too_big}. @end deftypefn Here are the status codes that might be returned by @code{scm_mb_iconv}. They are all negative integers. @table @code @item scm_mb_iconv_too_big The conversion needs more room in the output buffer. Some characters may have been consumed from the input buffer, and some characters may have been placed in the available space in the output buffer. @item scm_mb_iconv_bad_encoding @code{scm_mb_iconv} encountered an invalid character encoding in the input buffer. Conversion stopped before the invalid character, so there may be some characters consumed from the input buffer, and some converted text in the output buffer. @item scm_mb_iconv_incomplete_encoding The input buffer ends with an incomplete character encoding. The incomplete encoding is left in the input buffer, unconsumed. This is not necessarily an error, if you expect to call @code{scm_mb_iconv} again with more data which might contain the rest of the incomplete encoding. @end table Finally, Guile provides a function for destroying conversion contexts. @deftypefn {Libguile Function} void scm_mb_iconv_close (struct scm_mb_iconv *@var{context}) Deallocate the conversion context object @var{context}, and all other resources allocated by the call to @code{scm_mb_iconv_open} which returned @var{context}. @end deftypefn @node Implementing Your Own Text Conversions, , Exchanging Guile Text With the Outside World in C, Functions for Operating on Multibyte Text @subsection Implementing Your Own Text Conversions [[note that conversions to and from Guile must produce streams containing only valid character encodings, or else Guile will crash]] This section describes the interface for adding your own encoding conversions for use with @code{scm_mb_iconv}. The interface here is borrowed from the GNOME Project's @file{libunicode} library. Guile's @code{scm_mb_iconv} function works by converting the input text to a stream of @code{scm_char_t} characters, and then converting those characters to the desired output encoding. This makes it easy for Guile to choose the appropriate conversion back ends for an arbitrary pair of input and output encodings, but it also means that the accuracy and quality of the conversions depends on the fidelity of Guile's internal character set to the source and destination encodings. Since @code{scm_mb_iconv} will be used almost exclusively for converting to and from Guile's internal character set, this shouldn't be a problem. To add support for a particular encoding to Guile, you must provide one function (called the @dfn{read} function) which converts from your encoding to an array of @code{scm_char_t}'s, and another function (called the @dfn{write} function) to convert from an array of @code{scm_char_t}'s back into your encoding. To convert from some encoding @var{a} to some other encoding @var{b}, Guile pairs up @var{a}'s read function with @var{b}'s write function. Each call to @code{scm_mb_iconv} passes text in encoding @var{a} through the read function, to produce an array of @code{scm_char_t}'s, and then passes that array to the write function, to produce text in encoding @var{b}. For stateful encodings, a read or write function can hang its own data structures off the conversion object, and provide its own functions to allocate and destroy them; this allows read and write functions to maintain whatever state they like. The Guile conversion back end represents each available encoding with a @code{struct scm_mb_encoding} object. @deftp {Libguile Type} {struct scm_mb_encoding} This data structure describes an encoding. It has the following members: @table @code @item char **names An array of strings, giving the various names for this encoding. The array should be terminated by a zero pointer. Case is not significant in encoding names. The @code{scm_mb_iconv_open} function searches the list of registered encodings for an encoding whose @code{names} array matches its @var{tocode} or @var{fromcode} argument. @item int (*init) (void **@var{cookie}) An initialization function for the encoding's private data. @code{scm_mb_iconv_open} will call this function, passing it the address of the cookie for this encoding in this context. (We explain cookies below.) There is no way for the @code{init} function to tell whether the encoding will be used for reading or writing. Note that @code{init} receives a @emph{pointer} to the cookie, not the cookie itself. Because the type of @var{cookie} is @code{void **}, the C compiler will not check it as carefully as it would other types. The @code{init} member may be zero, indicating that no initialization is necessary for this encoding. @item int (*destroy) (void **@var{cookie}) A deallocation function for the encoding's private data. @code{scm_mb_iconv_close} calls this function, passing it the address of the cookie for this encoding in this context. The @code{destroy} function should free any data the @code{init} function allocated. Note that @code{destroy} receives a @emph{pointer} to the cookie, not the cookie itself. Because the type of @var{cookie} is @code{void **}, the C compiler will not check it as carefully as it would other types. The @code{destroy} member may be zero, indicating that this encoding doesn't need to perform any special action to destroy its local data. @item int (*reset) (void *@var{cookie}, char **@var{outbuf}, size_t *@var{outbytesleft}) Put the encoding into its initial shift state. Guile calls this function whether the encoding is being used for input or output, so this should take appropriate steps for both directions. If @var{outbuf} and @var{outbytesleft} are valid, the reset function should emit an escape sequence to reset the output stream to its initial state; @var{outbuf} and @var{outbytesleft} should be handled just as for @code{scm_mb_iconv}. This function can return an @code{scm_mb_iconv_} error code (@pxref{Exchanging Guile Text With the Outside World in C}). If it returns @code{scm_mb_iconv_too_big}, then the output buffer's shift state must be left unchanged. Note that @code{reset} receives the cookie's value itself, not a pointer to the cookie, as the @code{init} and @code{destroy} functions do. The @code{reset} member may be zero, indicating that this encoding doesn't use a shift state. @item enum scm_mb_read_result (*read) (void *@var{cookie}, const char **@var{inbuf}, size_t *@var{inbytesleft}, scm_char_t **@var{outbuf}, size_t *@var{outcharsleft}) Read some bytes and convert into an array of Guile characters. This is the encoding's read function. On entry, there are *@var{inbytesleft} bytes of text at *@var{inbuf} to be converted, and *@var{outcharsleft} characters available at *@var{outbuf} to hold the results. On exit, *@var{inbytesleft} and *@var{inbuf} indicate the input bytes still not consumed. *@var{outcharsleft} and *@var{outbuf} indicate the output buffer space still not filled. (By exclusion, these indicate which input bytes were consumed, and which output characters were produced.) Return one of the @code{enum scm_mb_read_result} values, described below. Note that @code{read} receives the cookie's value itself, not a pointer to the cookie, as the @code{init} and @code{destroy} functions do. @item enum scm_mb_write_result (*write) (void *@var{cookie}, scm_char_t **@var{inbuf}, size_t *@var{incharsleft}, **@var{outbuf}, size_t *@var{outbytesleft}) Convert an array of Guile characters to output bytes. This is the encoding's write function. On entry, there are *@var{incharsleft} Guile characters available at *@var{inbuf}, and *@var{outbytesleft} bytes available to store output at *@var{outbuf}. On exit, *@var{incharsleft} and *@var{inbuf} indicate the number of Guile characters left unconverted (because there was insufficient room in the output buffer to hold their converted forms), and *@var{outbytesleft} and *@var{outbuf} indicate the unused portion of the output buffer. Return one of the @code{scm_mb_write_result} values, described below. Note that @code{write} receives the cookie's value itself, not a pointer to the cookie, as the @code{init} and @code{destroy} functions do. @item struct scm_mb_encoding *next This is used by Guile to maintain a linked list of encodings. It is filled in when you call @code{scm_mb_register_encoding} to add your encoding to the list. @end table @end deftp Here is the enumerated type for the values an encoding's read function can return: @deftp {Libguile Type} {enum scm_mb_read_result} This type represents the result of a call to an encoding's read function. It has the following values: @table @code @item scm_mb_read_ok The read function consumed at least one byte of input. @item scm_mb_read_incomplete The data present in the input buffer does not contain a complete character encoding. No input was consumed, and no characters were produced as output. This is not necessarily an error status, if there is more data to pass through. @item scm_mb_read_error The input contains an invalid character encoding. @end table @end deftp Here is the enumerated type for the values an encoding's write function can return: @deftp {Libguile Type} {enum scm_mb_write_result} This type represents the result of a call to an encoding's write function. It has the following values: @table @code @item scm_mb_write_ok The write function was able to convert all the characters in @var{inbuf} successfully. @item scm_mb_write_too_big The write function filled the output buffer, but there are still characters in @var{inbuf} left unconsumed; @var{inbuf} and @var{incharsleft} indicate the unconsumed portion of the input buffer. @end table @end deftp Conversions to or from stateful encodings need to keep track of each encoding's current state. Each conversion context contains two @code{void *} variables called @dfn{cookies}, one for the input encoding, and one for the output encoding. These cookies are passed to the encodings' functions, for them to use however they please. A stateful encoding can use its cookie to hold a pointer to some object which maintains the context's current shift state. Stateless encodings will probably not use their cookies. The cookies' lifetime is the same as that of the context object. When the user calls @code{scm_mb_iconv_close} to destroy a context object, @code{scm_mb_iconv_close} calls the input and output encodings' @code{destroy} functions, passing them their respective cookies, so each encoding can free any data it allocated for that context. Note that, if a read or write function returns a successful result code like @code{scm_mb_read_ok} or @code{scm_mb_write_ok}, then the remaining input, together with the output, must together represent the complete input text; the encoding may not store any text temporarily in its cookie. This is because, if @code{scm_mb_iconv} returns a successful result to the user, it is correct for the user to assume that all the consumed input has been converted and placed in the output buffer. There is no ``flush'' operation to push any final results out of the encodings' buffers. Here is the function you call to register a new encoding with the conversion system: @deftypefn {Libguile Function} void scm_mb_register_encoding (struct scm_mb_encoding *@var{encoding}) Add the encoding described by @code{*@var{encoding}} to the set understood by @code{scm_mb_iconv_open}. Once you have registered your encoding, you can use it by calling @code{scm_mb_iconv_open} with one of the names in @code{@var{encoding}->names}. @end deftypefn @node Multibyte Text Processing Errors, Why Guile Does Not Use a Fixed-Width Encoding, Functions for Operating on Multibyte Text, Working With Multibyte Strings in C @section Multibyte Text Processing Errors This section describes error conditions which code can signal to indicate problems encountered while processing multibyte text. In each case, the arguments @var{message} and @var{args} are an error format string and arguments to be substituted into the string, as accepted by the @code{display-error} function. @deffn Condition text:not-char-boundary func message args object offset By calling @var{func}, the program attempted to access a character at byte offset @var{offset} in the Guile object @var{object}, but @var{offset} is not the start of a character's encoding in @var{object}. Typically, @var{object} is a string or symbol. If the function signalling the error cannot find the Guile object that contains the text it is inspecting, it should use @code{#f} for @var{object}. @end deffn @deffn Condition text:bad-encoding func message args object By calling @var{func}, the program attempted to interpret the text in @var{object}, but @var{object} contains a byte sequence which is not a valid encoding for any character. @end deffn @deffn Condition text:not-guile-char func message args number By calling @var{func}, the program attempted to treat @var{number} as the number of a character in the Guile character set, but @var{number} does not correspond to any character in the Guile character set. @end deffn @deffn Condition text:unknown-conversion func message args from to By calling @var{func}, the program attempted to convert from an encoding named @var{from} to an encoding named @var{to}, but Guile does not support such a conversion. @end deffn @deftypevr {Libguile Variable} SCM scm_text_not_char_boundary @deftypevrx {Libguile Variable} SCM scm_text_bad_encoding @deftypevrx {Libguile Variable} SCM scm_text_not_guile_char These variables hold the scheme symbol objects whose names are the condition symbols above. You can use these when signalling these errors, instead of looking them up yourself. @end deftypevr @node Why Guile Does Not Use a Fixed-Width Encoding, , Multibyte Text Processing Errors, Working With Multibyte Strings in C @section Why Guile Does Not Use a Fixed-Width Encoding Multibyte encodings are clumsier to work with than encodings which use a fixed number of bytes for every character. For example, using a fixed-width encoding, we can extract the @var{i}th character of a string in constant time, and we can always substitute the @var{i}th character of a string with any other character without reallocating or copying the string. However, there are no fixed-width encodings which include the characters we wish to include, and also fit in a reasonable amount of space. Despite the Unicode standard's claims to the contrary, Unicode is not really a fixed-width encoding. Unicode uses surrogate pairs to represent characters outside the 16-bit range; a surrogate pair must be treated as a single character, but occupies two 16-bit spaces. As of this writing, there are already plans to assign characters to the surrogate character codes. Three- and four-byte encodings are too wasteful for a majority of Guile's users, who only need @sc{ASCII} and a few accented characters. Another alternative would be to have several different fixed-width string representations, each with a different element size. For each string, Guile would use the smallest element size capable of accomodating the string's text. This would allow users of English and the Western European languages to use the traditional memory-efficient encodings. However, if Guile has @var{n} string representations, then users must write @var{n} versions of any code which manipulates text directly --- one for each element size. And if a user wants to operate on two strings simultaneously, and wants to avoid testing the string sizes within the loop, she must make @var{n}*@var{n} copies of the loop. Most users will simply not bother. Instead, they will write code which supports only one string size, leaving us back where we started. By using a single internal representation, Guile makes it easier for users to write multilingual code. [[What about tagging each string with its encoding? "Every extension must be written to deal with every encoding"]] [[You don't really want to index strings anyway.]] Finally, Guile's multibyte encoding is not so bad. Unlike a two- or four-byte encoding, it is efficient in space for American and European users. Furthermore, the properties described above mean that many functions can be coded just as they would for a single-byte encoding; see @ref{Promised Properties of the Guile Multibyte Encoding}. @bye