summaryrefslogtreecommitdiff
path: root/doc/mbapi.texi
blob: 3f53ccdb2799d7e9b1dbafc2bb54965a1d2f3425 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
\input texinfo
@setfilename mbapi.info
@settitle Multibyte API
@setchapternewpage off

@c Open issues:

@c What's the best way to report errors?  Should functions return a
@c magic value, according to C tradition, or should they signal a
@c Guile exception?

@c 


@node Working With Multibyte Strings in C
@chapter Working With Multibyte Strings in C

Guile allows strings to contain characters drawn from a wide variety of
languages, including many Asian, Eastern European, and Middle Eastern
languages, in a uniform and unrestricted way.  The string representation
normally used in C code --- an array of @sc{ASCII} characters --- is not
sufficient for Guile strings, since they may contain characters not
present in @sc{ASCII}.

Instead, Guile uses a very large character set, and encodes each
character as a sequence of one or more bytes.  We call this
variable-width encoding a @dfn{multibyte} encoding.  Guile uses this
single encoding internally for all strings, symbol names, error
messages, etc., and performs appropriate conversions upon input and
output.

The use of this variable-width encoding is almost invisible to Scheme
code.  Strings are still indexed by character number, not by byte
offset; @code{string-length} still returns the length of a string in
characters, not in bytes.  @code{string-ref} and @code{string-set!} are
no longer guaranteed to be constant-time operations, but Guile uses
various strategies to reduce the impact of this change.

However, the encoding is visible via Guile's C interface, which gives
the user direct access to a string's bytes.  This chapter explains how
to work with Guile multibyte text in C code.  Since variable-width
encodings are clumsier to work with than simple fixed-width encodings,
Guile provides a set of standard macros and functions for manipulating
multibyte text to make the job easier.  Furthermore, Guile makes some
promises about the encoding which you can use in writing your own text
processing code.

While we discuss guaranteed properties of Guile's encoding, and provide
functions to operate on its character set, we do not actually specify
either the character set or encoding here.  This is because we expect
both of them to change in the future: currently, Guile uses the same
encoding as GNU Emacs 20.4, but we hope to change Guile (and GNU Emacs
as well) to use Unicode and UTF-8, with some extensions.  This will make
it more comfortable to use Guile with other systems which use UTF-8,
like the GTk user interface toolkit.

@menu
* Multibyte String Terminology::  
* Promised Properties of the Guile Multibyte Encoding::  
* Functions for Operating on Multibyte Text::  
* Multibyte Text Processing Errors::  
* Why Guile Does Not Use a Fixed-Width Encoding::  
@end menu


@node Multibyte String Terminology, Promised Properties of the Guile Multibyte Encoding, Working With Multibyte Strings in C, Working With Multibyte Strings in C
@section Multibyte String Terminology 

In the descriptions which follow, we make the following definitions:
@table @dfn

@item byte
A @dfn{byte} is a number between 0 and 255.  It has no inherent textual
interpretation.  So 65 is a byte, not a character.

@item character
A @dfn{character} is a unit of text.  It has no inherent numeric value.
@samp{A} and @samp{.} are characters, not bytes.  (This is different
from the C language's definition of @dfn{character}; in this chapter, we
will always use a phrase like ``the C language's @code{char} type'' when
that's what we mean.)

@item character set
A @dfn{character set} is an invertible mapping between numbers and a
given set of characters.  @sc{ASCII} is a character set assigning
characters to the numbers 0 through 127.  It maps @samp{A} onto the
number 65, and @samp{.} onto 46.

Note that a character set maps characters onto numbers, @emph{not
necessarily} onto bytes.  For example, the Unicode character set maps
the Greek lower-case @samp{alpha} character onto the number 945, which
is not a byte.

(This is what Internet standards would call a "coding character set".)

@item encoding
An encoding maps numbers onto sequences of bytes.  For example, the
UTF-8 encoding, defined in the Unicode Standard, would map the number
945 onto the sequence of bytes @samp{206 177}.  When using the
@sc{ASCII} character set, every number assigned also happens to be a
byte, so there is an obvious trivial encoding for @sc{ASCII} in bytes.

(This is what Internet standards would call a "character encoding
scheme".)

@end table

Thus, to turn a character into a sequence of bytes, you need a character
set to assign a number to that character, and then an encoding to turn
that number into a sequence of bytes.

Likewise, to interpret a sequence of bytes as a sequence of characters,
you use an encoding to extract a sequence of numbers from the bytes, and
then a character set to turn the numbers into characters.

Errors can occur while carrying out either of these processes.  For
example, under a particular encoding, a given string of bytes might not
correspond to any number.  For example, the byte sequence @samp{128 128}
is not a valid encoding of any number under UTF-8.

Having carefully defined our terminology, we will now abuse it.

We will sometimes use the word @dfn{character} to refer to the number
assigned to a character by a character set, in contexts where it's
obvious we mean a number.

Sometimes there is a close association between a particular encoding and
a particular character set.  Thus, we may sometimes refer to the
character set and encoding together as an @dfn{encoding}.


@node Promised Properties of the Guile Multibyte Encoding, Functions for Operating on Multibyte Text, Multibyte String Terminology, Working With Multibyte Strings in C
@section Promised Properties of the Guile Multibyte Encoding

Internally, Guile uses a single encoding for all text --- symbols,
strings, error messages, etc.  Here we list a number of helpful
properties of Guile's encoding.  It is correct to write code which
assumes these properties; code which uses these assumptions will be
portable to all future versions of Guile, as far as we know.

@b{Every @sc{ASCII} character is encoded as a single byte from 0 to 127, in
the obvious way.}  This means that a standard C string containing only
@sc{ASCII} characters is a valid Guile string (except for the terminator;
Guile strings store the length explicitly, so they can contain null
characters).

@b{The encodings of non-@sc{ASCII} characters use only bytes between 128
and 255.}  That is, when we turn a non-@sc{ASCII} character into a
series of bytes, none of those bytes can ever be mistaken for the
encoding of an @sc{ASCII} character.  This means that you can search a
Guile string for an @sc{ASCII} character using the standard
@code{memchr} library function.  By extension, you can search for an
@sc{ASCII} substring in a Guile string using a traditional substring
search algorithm --- you needn't add special checks to verify encoding
boundaries, etc.

@b{No character encoding is a subsequence of any other character
encoding.}  (This is just a stronger version of the previous promise.)
This means that you can search for occurrences of one Guile string
within another Guile string just as if they were raw byte strings.  You
can use the stock @code{memmem} function (provided on GNU systems, at
least) for such searches.  If you don't need the ability to represent
null characters in your text, you can still use null-termination for
strings, and use the traditional string-handling functions like
@code{strlen}, @code{strstr}, and @code{strcat}.

@b{You can always determine the full length of a character's encoding
from its first byte.}  Guile provides the macro @code{scm_mb_len} which
computes the encoding's length from its first byte.  Given the first
rule, you can see that @code{scm_mb_len (@var{b})}, for any @code{0 <=
@var{b} <= 127}, returns 1.

@b{Given an arbitrary byte position in a Guile string, you can always
find the beginning and end of the character containing that byte without
scanning too far in either direction.}  This means that, if you are sure
a byte sequence is a valid encoding of a character sequence, you can
find character boundaries without keeping track of the beginning and
ending of the overall string.  This promise relies on the fact that, in
addition to storing the string's length explicitly, Guile always either
terminates the string's storage with a zero byte, or shares it with
another string which is terminated this way.


@node Functions for Operating on Multibyte Text, Multibyte Text Processing Errors, Promised Properties of the Guile Multibyte Encoding, Working With Multibyte Strings in C
@section Functions for Operating on Multibyte Text

Guile provides a variety of functions, variables, and types for working
with multibyte text.

@menu
* Basic Multibyte Character Processing::  
* Finding Character Encoding Boundaries::  
* Multibyte String Functions::  
* Exchanging Guile Text With the Outside World in C::  
* Implementing Your Own Text Conversions::  
@end menu


@node Basic Multibyte Character Processing, Finding Character Encoding Boundaries, Functions for Operating on Multibyte Text, Functions for Operating on Multibyte Text
@subsection Basic Multibyte Character Processing

Here are the essential types and functions for working with Guile text.
Guile uses the C type @code{unsigned char *} to refer to text encoded
with Guile's encoding.

Note that any operation marked here as a ``Libguile Macro'' might
evaluate its argument multiple times.

@deftp {Libguile Type} scm_char_t
This is a signed integral type large enough to hold any character in
Guile's character set.  All character numbers are positive.
@end deftp

@deftypefn {Libguile Macro} scm_char_t scm_mb_get (const unsigned char *@var{p})
Return the character whose encoding starts at @var{p}.  If @var{p} does
not point at a valid character encoding, the behavior is undefined.
@end deftypefn

@deftypefn {Libguile Macro} int scm_mb_put (unsigned char *@var{p}, scm_char_t @var{c})
Place the encoded form of the Guile character @var{c} at @var{p}, and
return its length in bytes.  If @var{c} is not a Guile character, the
behavior is undefined.
@end deftypefn

@deftypevr {Libguile Constant} int scm_mb_max_len
The maximum length of any character's encoding, in bytes.  You may
assume this is relatively small --- less than a dozen or so.
@end deftypevr

@deftypefn {Libguile Macro} int scm_mb_len (unsigned char @var{b})
If @var{b} is the first byte of a character's encoding, return the full
length of the character's encoding, in bytes.  If @var{b} is not a valid
leading byte, the behavior is undefined.
@end deftypefn

@deftypefn {Libguile Macro} int scm_mb_char_len (scm_char_t @var{c})
Return the length of the encoding of the character @var{c}, in bytes.
If @var{c} is not a valid Guile character, the behavior is undefined.
@end deftypefn

@deftypefn {Libguile Function} scm_char_t scm_mb_get_func (const unsigned char *@var{p})
@deftypefnx {Libguile Function} int scm_mb_put_func (unsigned char *@var{p}, scm_char_t @var{c})
@deftypefnx {Libguile Function} int scm_mb_len_func (unsigned char @var{b})
@deftypefnx {Libguile Function} int scm_mb_char_len_func (scm_char_t @var{c})
These are functions identical to the corresponding macros.  You can use
them in situations where the overhead of a function call is acceptable,
and the cleaner semantics of function application are desireable.
@end deftypefn


@node Finding Character Encoding Boundaries, Multibyte String Functions, Basic Multibyte Character Processing, Functions for Operating on Multibyte Text
@subsection Finding Character Encoding Boundaries

These are functions for finding the boundaries between characters in
multibyte text.

Note that any operation marked here as a ``Libguile Macro'' might
evaluate its argument multiple times, unless the definition promises
otherwise.

@deftypefn {Libguile Macro} int scm_mb_boundary_p (const unsigned char *@var{p})
Return non-zero iff @var{p} points to the start of a character in
multibyte text.

This macro will evaluate its argument only once.
@end deftypefn

@deftypefn {Libguile Function} {const unsigned char *} scm_mb_floor (const unsigned char *@var{p})
``Round'' @var{p} to the previous character boundary.  That is, if
@var{p} points to the middle of the encoding of a Guile character,
return a pointer to the first byte of the encoding.  If @var{p} points
to the start of the encoding of a Guile character, return @var{p}
unchanged.
@end deftypefn

@deftypefn {libguile Function} {const unsigned char *} scm_mb_ceiling (const unsigned char *@var{p})
``Round'' @var{p} to the next character boundary.  That is, if @var{p}
points to the middle of the encoding of a Guile character, return a
pointer to the first byte of the encoding of the next character.  If
@var{p} points to the start of the encoding of a Guile character, return
@var{p} unchanged.
@end deftypefn

Note that it is usually not friendly for functions to silently correct
byte offsets that point into the middle of a character's encoding.  Such
offsets almost always indicate a programming error, and they should be
reported as early as possible.  So, when you write code which operates
on multibyte text, you should not use functions like these to ``clean
up'' byte offsets which the originator believes to be correct; instead,
your code should signal a @code{text:not-char-boundary} error as soon as
it detects an invalid offset.  @xref{Multibyte Text Processing Errors}.


@node Multibyte String Functions, Exchanging Guile Text With the Outside World in C, Finding Character Encoding Boundaries, Functions for Operating on Multibyte Text
@subsection Multibyte String Functions

These functions allow you to operate on multibyte strings: sequences of
character encodings.

@deftypefn {Libguile Function} int scm_mb_count (const unsigned char *@var{p}, int @var{len})
Return the number of Guile characters encoded by the @var{len} bytes at
@var{p}.

If the sequence contains any invalid character encodings, or ends with
an incomplete character encoding, signal a @code{text:bad-encoding}
error.
@end deftypefn

@deftypefn {Libguile Macro} scm_char_t scm_mb_walk (unsigned char **@var{pp})
Return the character whose encoding starts at @code{*@var{pp}}, and
advance @code{*@var{pp}} to the start of the next character.  Return -1
if @code{*@var{pp}} does not point to a valid character encoding.
@end deftypefn

@deftypefn {Libguile Function} {const unsigned char *} scm_mb_prev (const unsigned char *@var{p})
If @var{p} points to the middle of the encoding of a Guile character,
return a pointer to the first byte of the encoding.  If @var{p} points
to the start of the encoding of a Guile character, return the start of
the previous character's encoding.

This is like @code{scm_mb_floor}, but the returned pointer will always
be before @var{p}.  If you use this function to drive an iteration, it
guarantees backward progress.
@end deftypefn

@deftypefn {Libguile Function} {const unsigned char *} scm_mb_next (const unsigned char *@var{p})
If @var{p} points to the encoding of a Guile character, return a pointer
to the first byte of the encoding of the next character.

This is like @code{scm_mb_ceiling}, but the returned pointer will always
be after @var{p}.  If you use this function to drive an iteration, it
guarantees forward progress.
@end deftypefn

@deftypefn {Libguile Function} {const unsigned char *} scm_mb_index (const unsigned char *@var{p}, int @var{len}, int @var{i})
Assuming that the @var{len} bytes starting at @var{p} are a
concatenation of valid character encodings, return a pointer to the
start of the @var{i}'th character encoding in the sequence.

This function scans the sequence from the beginning to find the
@var{i}'th character, and will generally require time proportional to
the distance from @var{p} to the returned address.

If the sequence contains any invalid character encodings, or ends with
an incomplete character encoding, signal a @code{text:bad-encoding}
error.
@end deftypefn

It is common to process the characters in a string from left to right.
However, if you fetch each character using @code{scm_mb_index}, each
call will scan the text from the beginning, so your loop will require
time proportional to at least the square of the length of the text.  To
avoid this poor performance, you can use an @code{scm_mb_cache}
structure and the @code{scm_mb_index_cached} macro.

@deftp {Libguile Type} {struct scm_mb_cache}
This structure holds information that allows a string scanning operation
to use the results from a previous scan of the string.  It has the
following members:
@table @code

@item character
An index, in characters, into the string.

@item byte
The index, in bytes, of the start of that character.

@end table

In other words, @code{byte} is the byte offset of the
@code{character}'th character of the string.  Note that if @code{byte}
and @code{character} are equal, then all characters before that point
must have encodings exactly one byte long, and the string can be indexed
normally.

All elements of a @code{struct scm_mb_cache} structure should be
initialized to zero before its first use, and whenever the string's text
changes.
@end deftp

@deftypefn {Libguile Macro} const unsigned char *scm_mb_index_cached (const unsigned char *@var{p}, int @var{len}, int @var{i}, struct scm_mb_cache *@var{cache})
@deftypefnx {Libguile Function} const unsigned char *scm_mb_index_cached_func (const unsigned char *@var{p}, int @var{len}, int @var{i}, struct scm_mb_cache *@var{cache})
This macro and this function are identical to @code{scm_mb_index},
except that they may consult and update *@var{cache} in order to avoid
scanning the string from the beginning.  @code{scm_mb_index_cached} is a
macro, so it may have less overhead than
@code{scm_mb_index_cached_func}, but it may evaluate its arguments more
than once.

Using @code{scm_mb_index_cached} or @code{scm_mb_index_cached_func}, you
can scan a string from left to right, or from right to left, in time
proportional to the length of the string.  As long as each character
fetched is less than some constant distance before or after the previous
character fetched with @var{cache}, each access will require constant
time.
@end deftypefn

Guile also provides functions to convert between an encoded sequence of
characters, and an array of @code{scm_char_t} objects.

@deftypefn {Libguile Function} scm_char_t *scm_mb_multibyte_to_fixed (const unsigned char *@var{p}, int @var{len}, int *@var{result_len})
Convert the variable-width text in the @var{len} bytes at @var{p}
to an array of @code{scm_char_t} values.  Return a pointer to the array,
and set @code{*@var{result_len}} to the number of elements it contains.
The returned array is allocated with @code{malloc}, and it is the
caller's responsibility to free it.

If the text is not a sequence of valid character encodings, this
function will signal a @code{text:bad-encoding} error.
@end deftypefn

@deftypefn {Libguile Function} unsigned char *scm_mb_fixed_to_multibyte (const scm_char_t *@var{fixed}, int @var{len}, int *@var{result_len})
Convert the array of @code{scm_char_t} values to a sequence of
variable-width character encodings.  Return a pointer to the array of
bytes, and set @code{*@var{result_len}} to its length, in bytes.

The returned byte sequence is terminated with a zero byte, which is not
counted in the length returned in @code{*@var{result_len}}.

The returned byte sequence is allocated with @code{malloc}; it is the
caller's responsibility to free it.

If the text is not a sequence of valid character encodings, this
function will signal a @code{text:bad-encoding} error.
@end deftypefn


@node Exchanging Guile Text With the Outside World in C, Implementing Your Own Text Conversions, Multibyte String Functions, Functions for Operating on Multibyte Text
@subsection Exchanging Guile Text With the Outside World in C

[[This is kind of a heavy-weight model, given that one end of the
conversion is always going to be the Guile encoding.  Any way to shorten
things a bit?]]

Guile provides functions for converting between Guile's internal text
representation and encodings popular in the outside world.  These
functions are closely modeled after the @code{iconv} functions available
on some systems.

To convert text between two encodings, you should first call
@code{scm_mb_iconv_open} to indicate the source and destination
encodings; this function returns a context object which records the
conversion to perform.

Then, you should call @code{scm_mb_iconv} to actually convert the text.
This function expects input and output buffers, and a pointer to the
context you got from @var{scm_mb_iconv_open}.  You don't need to pass
all your input to @code{scm_mb_iconv} at once; you can invoke it on
successive blocks of input (as you read it from a file, say), and it
will convert as much as it can each time, indicating when you should
grow your output buffer.

An encoding may be @dfn{stateless}, or @dfn{stateful}.  In most
encodings, a contiguous group of bytes from the sequence completely
specifies a particular character; these are stateless encodings.
However, some encodings require you to look back an unbounded number of
bytes in the stream to assign a meaning to a particular byte sequence;
such encodings are stateful.

For example, in the @samp{ISO-2022-JP} encoding for Japanese text, the
byte sequence @samp{27 36 66} indicates that subsequent bytes should be
taken in pairs and interpreted as characters from the JIS-0208 character
set.  An arbitrary number of byte pairs may follow this sequence.  The
byte sequence @samp{27 40 66} indicates that subsequent bytes should be
interpreted as @sc{ASCII}.  In this encoding, you cannot tell whether a
given byte is an @sc{ASCII} character without looking back an arbitrary
distance for the most recent escape sequence, so it is a stateful
encoding.

In Guile, if a conversion involves a stateful encoding, the context
object carries any necessary state.  Thus, you can have many independent
conversions to or from stateful encodings taking place simultaneously,
as long as each data stream uses its own context object for the
conversion.

@deftp {Libguile Type} {struct scm_mb_iconv}
This is the type for context objects, which represent the encodings and
current state of an ongoing text conversion.  A @code{struct
scm_mb_iconv} records the source and destination encodings, and keeps
track of any information needed to handle stateful encodings.
@end deftp

@deftypefn {Libguile Function} {struct scm_mb_iconv *} scm_mb_iconv_open (const char *@var{tocode}, const char *@var{fromcode})
Return a pointer to a new @code{struct scm_mb_iconv} context object,
ready to convert from the encoding named @var{fromcode} to the encoding
named @var{tocode}.  For stateful encodings, the context object is in
some appropriate initial state, ready for use with the
@code{scm_mb_iconv} function.

When you are done using a context object, you may call
@code{scm_mb_iconv_close} to free it.

If either @var{tocode} or @var{fromcode} is not the name of a known
encoding, this function will signal the @code{text:unknown-conversion}
error, described below.

@c Try to use names here from the IANA list: 
@c see ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets
Guile supports at least these encodings:
@table @samp 

@item US-ASCII
@sc{US-ASCII}, in the standard one-character-per-byte encoding.

@item ISO-8859-1
The usual character set for Western European languages, in its usual
one-character-per-byte encoding.

@item Guile-MB
Guile's current internal multibyte encoding.  The actual encoding this
name refers to will change from one version of Guile to the next.  You
should use this when converting data between external sources and the
encoding used by Guile objects.

You should @emph{not} use this as the encoding for data presented to the
outside world, for two reasons.  1) Its meaning will change over time,
so data written using the @samp{guile} encoding with one version of
Guile might not be readable with the @samp{guile} encoding in another
version of Guile.  2) It currently corresponds to @samp{Emacs-Mule},
which invented for Emacs's internal use, and was never intended to serve
as an exchange medium.

@item Guile-Wide
Guile's character set, as an array of @code{scm_char_t} values.

Note that this encoding is even less suitable for public use than
@samp{Guile}, since the exact sequence of bytes depends heavily on the
size and endianness the host system uses for @code{scm_char_t}.  Using
this encoding is very much like calling the
@code{scm_mb_multibyte_to_fixed} or @code{scm_mb_fixed_to_multibyte}
functions, except that @code{scm_mb_iconv} gives you more control over
buffer allocation and management.

@item Emacs-Mule
This is the variable-length encoding for multi-lingual text by GNU
Emacs, at least through version 20.4.  You probably should not use this
encoding, as it is designed only for Emacs's internal use.  However, we
provide it here because it's trivial to support, and some people
probably do have @samp{emacs-mule}-format files lying around.

@end table

(At the moment, this list doesn't include any character sets suitable for
external use that can actually handle multilingual data; this is
unfortunate, as it encourages users to write data in Emacs-Mule format,
which nobody but Emacs and Guile understands.  We hope to add support
for Unicode in UTF-8 soon, which should solve this problem.)

Case is not significant in encoding names.

You can define your own conversions; see @ref{Implementing Your Own Text
Conversions}.
@end deftypefn

@deftypefn {Libguile Function} int scm_mb_have_encoding (const char *@var{encoding})
Return a non-zero value if Guile supports the encoding named @var{encoding}[[]]
@end deftypefn

@deftypefn {Libguile Function} size_t scm_mb_iconv (struct scm_mb_iconv *@var{context}, const char **@var{inbuf}, size_t *@var{inbytesleft}, char **@var{outbuf}, size_t *@var{outbytesleft})
Convert a sequence of characters from one encoding to another.  The
argument @var{context} specifies the encodings to use for the input and
output, and carries state for stateful encodings; use
@code{scm_mb_iconv_open} to create a @var{context} object for a
particular conversion.

Upon entry to the function, @code{*@var{inbuf}} should point to the
input buffer, and @code{*@var{inbytesleft}} should hold the number of
input bytes present in the buffer; @code{*@var{outbuf}} should point to
the output buffer, and @code{*@var{outbytesleft}} should hold the number
of bytes available to hold the conversion results in that buffer.

Upon exit from the function, @code{*@var{inbuf}} points to the first
unconsumed byte of input, and @code{*@var{inbytesleft}} holds the number
of unconsumed input bytes; @code{*@var{outbuf}} points to the byte after
the last output byte, and @code{*@var{outbyteleft}} holds the number of
bytes left unused in the output buffer.

For stateful encodings, @var{context} carries encoding state from one
call to @code{scm_mb_iconv} to the next.  Thus, successive calls to
@var{scm_mb_iconv} which use the same context object can convert a
stream of data one chunk at a time.  

If @var{inbuf} is zero or @code{*@var{inbuf}} is zero, then the call is
taken as a request to reset the states of the input and the output
encodings.  If @var{outbuf} is non-zero and @code{*@var{outbuf}} is
non-zero, then @code{scm_mb_iconv} stores a byte sequence in the output
buffer to put the output encoding in its initial state.  If the output
buffer is not large enough to hold this byte sequence,
@code{scm_mb_iconv} returns @code{scm_mb_iconv_too_big}, and leaves
the shift states of @var{context}'s input and output encodings
unchanged.

The @code{scm_mb_iconv} function always consumes only complete
characters or shift sequences from the input buffer, and the output
buffer always contains a sequence of complete characters or escape
sequences.

If the input sequence contains characters which are not expressible in
the output encoding, @code{scm_mb_iconv} converts it in an
implementation-defined way.  It may simply delete the character.

Some encodings use byte sequences which do not correspond to any textual
character.  For example, the escape sequence of a stateful encoding has
no textual meaning.  When converting from such an encoding, a call to
@code{scm_mb_iconv} might consume input but produce no output, since the
input sequence might contain only escape sequences.

Normally, @code{scm_mb_iconv} returns the number of input characters it
could not convert perfectly to the ouput encoding.  However, it may
return one of the @code{scm_mb_iconv_} codes described below, to
indicate an error.  All of these codes are negative values.

If the input sequence contains an invalid character encoding, conversion
stops before the invalid input character, and @code{scm_mb_iconv}
returns the constant value @code{scm_mb_iconv_bad_encoding}.

If the input sequence ends with an incomplete character encoding,
@code{scm_mb_iconv} will leave it in the input buffer, unconsumed, and
return the constant value @code{scm_mb_iconv_incomplete_encoding}.  This
is not necessarily an error, if you expect to call @code{scm_mb_iconv}
again with more data which might contain the rest of the encoding
fragment.

If the output buffer does not contain enough room to hold the converted
form of the complete input text, @code{scm_mb_iconv} converts as much as
it can, changes the input and output pointers to reflect the amount of
text successfully converted, and then returns
@code{scm_mb_iconv_too_big}.
@end deftypefn

Here are the status codes that might be returned by @code{scm_mb_iconv}.
They are all negative integers.
@table @code

@item scm_mb_iconv_too_big
The conversion needs more room in the output buffer.  Some characters
may have been consumed from the input buffer, and some characters may
have been placed in the available space in the output buffer.

@item scm_mb_iconv_bad_encoding
@code{scm_mb_iconv} encountered an invalid character encoding in the
input buffer.  Conversion stopped before the invalid character, so there
may be some characters consumed from the input buffer, and some
converted text in the output buffer.

@item scm_mb_iconv_incomplete_encoding
The input buffer ends with an incomplete character encoding.  The
incomplete encoding is left in the input buffer, unconsumed.  This is
not necessarily an error, if you expect to call @code{scm_mb_iconv}
again with more data which might contain the rest of the incomplete
encoding.

@end table


Finally, Guile provides a function for destroying conversion contexts.

@deftypefn {Libguile Function} void scm_mb_iconv_close (struct scm_mb_iconv *@var{context})
Deallocate the conversion context object @var{context}, and all other
resources allocated by the call to @code{scm_mb_iconv_open} which
returned @var{context}.
@end deftypefn


@node Implementing Your Own Text Conversions,  , Exchanging Guile Text With the Outside World in C, Functions for Operating on Multibyte Text
@subsection Implementing Your Own Text Conversions

[[note that conversions to and from Guile must produce streams
containing only valid character encodings, or else Guile will crash]]

This section describes the interface for adding your own encoding
conversions for use with @code{scm_mb_iconv}.  The interface here is
borrowed from the GNOME Project's @file{libunicode} library.

Guile's @code{scm_mb_iconv} function works by converting the input text
to a stream of @code{scm_char_t} characters, and then converting
those characters to the desired output encoding.  This makes it easy
for Guile to choose the appropriate conversion back ends for an
arbitrary pair of input and output encodings, but it also means that the
accuracy and quality of the conversions depends on the fidelity of
Guile's internal character set to the source and destination encodings.
Since @code{scm_mb_iconv} will be used almost exclusively for converting
to and from Guile's internal character set, this shouldn't be a problem.

To add support for a particular encoding to Guile, you must provide one
function (called the @dfn{read} function) which converts from your
encoding to an array of @code{scm_char_t}'s, and another function
(called the @dfn{write} function) to convert from an array of
@code{scm_char_t}'s back into your encoding.  To convert from some
encoding @var{a} to some other encoding @var{b}, Guile pairs up
@var{a}'s read function with @var{b}'s write function.  Each call to
@code{scm_mb_iconv} passes text in encoding @var{a} through the read
function, to produce an array of @code{scm_char_t}'s, and then passes
that array to the write function, to produce text in encoding @var{b}.

For stateful encodings, a read or write function can hang its own data
structures off the conversion object, and provide its own functions to
allocate and destroy them; this allows read and write functions to
maintain whatever state they like.

The Guile conversion back end represents each available encoding with a
@code{struct scm_mb_encoding} object.

@deftp {Libguile Type} {struct scm_mb_encoding}
This data structure describes an encoding.  It has the following
members:

@table @code

@item char **names
An array of strings, giving the various names for this encoding.  The
array should be terminated by a zero pointer.  Case is not significant
in encoding names.

The @code{scm_mb_iconv_open} function searches the list of registered
encodings for an encoding whose @code{names} array matches its
@var{tocode} or @var{fromcode} argument.

@item int (*init) (void **@var{cookie})
An initialization function for the encoding's private data.
@code{scm_mb_iconv_open} will call this function, passing it the address
of the cookie for this encoding in this context.  (We explain cookies
below.)  There is no way for the @code{init} function to tell whether
the encoding will be used for reading or writing.

Note that @code{init} receives a @emph{pointer} to the cookie, not the
cookie itself.  Because the type of @var{cookie} is @code{void **}, the
C compiler will not check it as carefully as it would other types.

The @code{init} member may be zero, indicating that no initialization is
necessary for this encoding.

@item int (*destroy) (void **@var{cookie})
A deallocation function for the encoding's private data.
@code{scm_mb_iconv_close} calls this function, passing it the address of
the cookie for this encoding in this context.  The @code{destroy}
function should free any data the @code{init} function allocated.

Note that @code{destroy} receives a @emph{pointer} to the cookie, not the
cookie itself.  Because the type of @var{cookie} is @code{void **}, the
C compiler will not check it as carefully as it would other types.

The @code{destroy} member may be zero, indicating that this encoding
doesn't need to perform any special action to destroy its local data.

@item int (*reset) (void *@var{cookie}, char **@var{outbuf}, size_t *@var{outbytesleft})
Put the encoding into its initial shift state.  Guile calls this
function whether the encoding is being used for input or output, so this
should take appropriate steps for both directions.  If @var{outbuf} and
@var{outbytesleft} are valid, the reset function should emit an escape
sequence to reset the output stream to its initial state; @var{outbuf}
and @var{outbytesleft} should be handled just as for
@code{scm_mb_iconv}.

This function can return an @code{scm_mb_iconv_} error code
(@pxref{Exchanging Guile Text With the Outside World in C}).  If it
returns @code{scm_mb_iconv_too_big}, then the output buffer's shift
state must be left unchanged.

Note that @code{reset} receives the cookie's value itself, not a pointer
to the cookie, as the @code{init} and @code{destroy} functions do.

The @code{reset} member may be zero, indicating that this encoding
doesn't use a shift state.

@item enum scm_mb_read_result (*read) (void *@var{cookie}, const char **@var{inbuf},  size_t *@var{inbytesleft}, scm_char_t **@var{outbuf}, size_t *@var{outcharsleft})
Read some bytes and convert into an array of Guile characters.  This is
the encoding's read function.

On entry, there are *@var{inbytesleft} bytes of text at *@var{inbuf} to
be converted, and *@var{outcharsleft} characters available at
*@var{outbuf} to hold the results.

On exit, *@var{inbytesleft} and *@var{inbuf} indicate the input bytes
still not consumed.  *@var{outcharsleft} and *@var{outbuf} indicate the
output buffer space still not filled.  (By exclusion, these indicate
which input bytes were consumed, and which output characters were
produced.)

Return one of the @code{enum scm_mb_read_result} values, described below.

Note that @code{read} receives the cookie's value itself, not a pointer
to the cookie, as the @code{init} and @code{destroy} functions do.

@item enum scm_mb_write_result (*write) (void *@var{cookie}, scm_char_t **@var{inbuf}, size_t *@var{incharsleft}, **@var{outbuf}, size_t *@var{outbytesleft})
Convert an array of Guile characters to output bytes.  This is
the encoding's write function.

On entry, there are *@var{incharsleft} Guile characters available at
*@var{inbuf}, and *@var{outbytesleft} bytes available to store output at
*@var{outbuf}.

On exit, *@var{incharsleft} and *@var{inbuf} indicate the number of
Guile characters left unconverted (because there was insufficient room
in the output buffer to hold their converted forms), and
*@var{outbytesleft} and *@var{outbuf} indicate the unused portion of the
output buffer.

Return one of the @code{scm_mb_write_result} values, described below.

Note that @code{write} receives the cookie's value itself, not a pointer
to the cookie, as the @code{init} and @code{destroy} functions do.

@item struct scm_mb_encoding *next
This is used by Guile to maintain a linked list of encodings.  It is
filled in when you call @code{scm_mb_register_encoding} to add your
encoding to the list.

@end table
@end deftp

Here is the enumerated type for the values an encoding's read function
can return:

@deftp {Libguile Type} {enum scm_mb_read_result}
This type represents the result of a call to an encoding's read
function.  It has the following values:

@table @code

@item scm_mb_read_ok
The read function consumed at least one byte of input.

@item scm_mb_read_incomplete
The data present in the input buffer does not contain a complete
character encoding.  No input was consumed, and no characters were
produced as output.  This is not necessarily an error status, if there
is more data to pass through.

@item scm_mb_read_error
The input contains an invalid character encoding.

@end table
@end deftp

Here is the enumerated type for the values an encoding's write function
can return:

@deftp {Libguile Type} {enum scm_mb_write_result}
This type represents the result of a call to an encoding's write
function.  It has the following values:

@table @code

@item scm_mb_write_ok
The write function was able to convert all the characters in @var{inbuf}
successfully.

@item scm_mb_write_too_big
The write function filled the output buffer, but there are still
characters in @var{inbuf} left unconsumed; @var{inbuf} and
@var{incharsleft} indicate the unconsumed portion of the input buffer.

@end table
@end deftp


Conversions to or from stateful encodings need to keep track of each
encoding's current state.  Each conversion context contains two
@code{void *} variables called @dfn{cookies}, one for the input
encoding, and one for the output encoding.  These cookies are passed to
the encodings' functions, for them to use however they please.  A
stateful encoding can use its cookie to hold a pointer to some object
which maintains the context's current shift state.  Stateless encodings
will probably not use their cookies.

The cookies' lifetime is the same as that of the context object.  When
the user calls @code{scm_mb_iconv_close} to destroy a context object,
@code{scm_mb_iconv_close} calls the input and output encodings'
@code{destroy} functions, passing them their respective cookies, so each
encoding can free any data it allocated for that context.

Note that, if a read or write function returns a successful result code
like @code{scm_mb_read_ok} or @code{scm_mb_write_ok}, then the remaining
input, together with the output, must together represent the complete
input text; the encoding may not store any text temporarily in its
cookie.  This is because, if @code{scm_mb_iconv} returns a successful
result to the user, it is correct for the user to assume that all the
consumed input has been converted and placed in the output buffer.
There is no ``flush'' operation to push any final results out of the
encodings' buffers.

Here is the function you call to register a new encoding with the
conversion system:

@deftypefn {Libguile Function} void scm_mb_register_encoding (struct scm_mb_encoding *@var{encoding})
Add the encoding described by @code{*@var{encoding}} to the set
understood by @code{scm_mb_iconv_open}.  Once you have registered your
encoding, you can use it by calling @code{scm_mb_iconv_open} with one of
the names in @code{@var{encoding}->names}.
@end deftypefn


@node Multibyte Text Processing Errors, Why Guile Does Not Use a Fixed-Width Encoding, Functions for Operating on Multibyte Text, Working With Multibyte Strings in C
@section Multibyte Text Processing Errors

This section describes error conditions which code can signal to
indicate problems encountered while processing multibyte text.  In each
case, the arguments @var{message} and @var{args} are an error format
string and arguments to be substituted into the string, as accepted by
the @code{display-error} function.

@deffn Condition text:not-char-boundary func message args object offset
By calling @var{func}, the program attempted to access a character at
byte offset @var{offset} in the Guile object @var{object}, but
@var{offset} is not the start of a character's encoding in @var{object}.

Typically, @var{object} is a string or symbol.  If the function signalling
the error cannot find the Guile object that contains the text it is
inspecting, it should use @code{#f} for @var{object}.
@end deffn

@deffn Condition text:bad-encoding func message args object
By calling @var{func}, the program attempted to interpret the text in
@var{object}, but @var{object} contains a byte sequence which is not a
valid encoding for any character.
@end deffn

@deffn Condition text:not-guile-char func message args number
By calling @var{func}, the program attempted to treat @var{number} as the
number of a character in the Guile character set, but @var{number} does
not correspond to any character in the Guile character set.
@end deffn

@deffn Condition text:unknown-conversion func message args from to
By calling @var{func}, the program attempted to convert from an encoding
named @var{from} to an encoding named @var{to}, but Guile does not
support such a conversion.
@end deffn

@deftypevr {Libguile Variable} SCM scm_text_not_char_boundary
@deftypevrx {Libguile Variable} SCM scm_text_bad_encoding
@deftypevrx {Libguile Variable} SCM scm_text_not_guile_char
These variables hold the scheme symbol objects whose names are the
condition symbols above.  You can use these when signalling these
errors, instead of looking them up yourself.
@end deftypevr


@node Why Guile Does Not Use a Fixed-Width Encoding,  , Multibyte Text Processing Errors, Working With Multibyte Strings in C
@section Why Guile Does Not Use a Fixed-Width Encoding

Multibyte encodings are clumsier to work with than encodings which use a
fixed number of bytes for every character.  For example, using a
fixed-width encoding, we can extract the @var{i}th character of a string
in constant time, and we can always substitute the @var{i}th character
of a string with any other character without reallocating or copying the
string.

However, there are no fixed-width encodings which include the characters
we wish to include, and also fit in a reasonable amount of space.
Despite the Unicode standard's claims to the contrary, Unicode is not
really a fixed-width encoding.  Unicode uses surrogate pairs to
represent characters outside the 16-bit range; a surrogate pair must be
treated as a single character, but occupies two 16-bit spaces.  As of
this writing, there are already plans to assign characters to the
surrogate character codes.  Three- and four-byte encodings are
too wasteful for a majority of Guile's users, who only need @sc{ASCII}
and a few accented characters.

Another alternative would be to have several different fixed-width
string representations, each with a different element size.  For each
string, Guile would use the smallest element size capable of
accomodating the string's text.  This would allow users of English and
the Western European languages to use the traditional memory-efficient
encodings.  However, if Guile has @var{n} string representations, then
users must write @var{n} versions of any code which manipulates text
directly --- one for each element size.  And if a user wants to operate
on two strings simultaneously, and wants to avoid testing the string
sizes within the loop, she must make @var{n}*@var{n} copies of the loop.
Most users will simply not bother.  Instead, they will write code which
supports only one string size, leaving us back where we started.  By
using a single internal representation, Guile makes it easier for users
to write multilingual code.

[[What about tagging each string with its encoding?
"Every extension must be written to deal with every encoding"]]

[[You don't really want to index strings anyway.]]

Finally, Guile's multibyte encoding is not so bad.  Unlike a two- or
four-byte encoding, it is efficient in space for American and European
users.  Furthermore, the properties described above mean that many
functions can be coded just as they would for a single-byte encoding;
see @ref{Promised Properties of the Guile Multibyte Encoding}.

@bye