summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorAndy Wingo <wingo@pobox.com>2013-01-10 22:50:27 +0100
committerAndy Wingo <wingo@pobox.com>2013-01-11 15:15:37 +0100
commitf05bb8494c9636cd7a44aaf7d4e08f4b66004b6e (patch)
tree8eb62db9749f0f5f373cca3edef12e6f9a612e8b
parentb194b59fa10574868f7b1663a1f2d447baa18c5e (diff)
downloadguile-f05bb8494c9636cd7a44aaf7d4e08f4b66004b6e.tar.gz
add bytevector->string and string->bytevector in new (ice-9 iconv) module
* module/Makefile.am: * module/ice-9/iconv.scm: New module implementing procedures to encode and decode representations of strings as bytes. * test-suite/Makefile.am: * test-suite/tests/iconv.test: Add tests. * doc/ref/api-data.texi: Add docs.
-rw-r--r--doc/ref/api-data.texi80
-rw-r--r--module/Makefile.am3
-rw-r--r--module/ice-9/iconv.scm82
-rw-r--r--test-suite/Makefile.am3
-rw-r--r--test-suite/tests/iconv.test115
5 files changed, 277 insertions, 6 deletions
diff --git a/doc/ref/api-data.texi b/doc/ref/api-data.texi
index 6d8de2bd6..3bd38d28b 100644
--- a/doc/ref/api-data.texi
+++ b/doc/ref/api-data.texi
@@ -1,6 +1,6 @@
@c -*-texinfo-*-
@c This is part of the GNU Guile Reference Manual.
-@c Copyright (C) 1996, 1997, 2000, 2001, 2002, 2003, 2004, 2006, 2007, 2008, 2009, 2010, 2011, 2012
+@c Copyright (C) 1996, 1997, 2000, 2001, 2002, 2003, 2004, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013
@c Free Software Foundation, Inc.
@c See the file guile.texi for copying conditions.
@@ -2881,6 +2881,7 @@ Guile provides all procedures of SRFI-13 and a few more.
* Reversing and Appending Strings:: Appending strings to form a new string.
* Mapping Folding and Unfolding:: Iterating over strings.
* Miscellaneous String Operations:: Replicating, insertion, parsing, ...
+* Representing Strings as Bytes:: Encoding and decoding strings.
* Conversion to/from C::
* String Internals:: The storage strategy for strings.
@end menu
@@ -4163,6 +4164,70 @@ a predicate, if it is a character, it is tested for equality and if it
is a character set, it is tested for membership.
@end deffn
+@node Representing Strings as Bytes
+@subsubsection Representing Strings as Bytes
+
+Out in the cold world outside of Guile, not all strings are treated in
+the same way. Out there there are only bytes, and there are many ways
+of representing a strings (sequences of characters) as binary data
+(sequences of bytes).
+
+As a user, usually you don't have to think about this very much. When
+you type on your keyboard, your system encodes your keystrokes as bytes
+according to the locale that you have configured on your computer.
+Guile uses the locale to decode those bytes back into characters --
+hopefully the same characters that you typed in.
+
+All is not so clear when dealing with a system with multiple users, such
+as a web server. Your web server might get a request from one user for
+data encoded in the ISO-8859-1 character set, and then another request
+from a different user for UTF-8 data.
+
+@cindex iconv
+@cindex character encoding
+Guile provides an @dfn{iconv} module for converting between strings and
+sequences of bytes. @xref{Bytevectors}, for more on how Guile
+represents raw byte sequences. This module gets its name from the
+common @sc{unix} command of the same name.
+
+Unlike the rest of the procedures in this section, you have to load the
+@code{iconv} module before having access to these procedures:
+
+@example
+(use-modules (ice-9 iconv))
+@end example
+
+@deffn string->bytevector string encoding [#:conversion-strategy='error]
+Encode @var{string} as a sequence of bytes.
+
+The string will be encoded in the character set specified by the
+@var{encoding} string. If the string has characters that cannot be
+represented in the encoding, by default this procedure raises an
+@code{encoding-error}, though the @code{#:conversion-strategy} keyword
+can specify other behaviors.
+
+The return value is a bytevector. @xref{Bytevectors}, for more on
+bytevectors. @xref{Ports}, for more on character encodings and
+conversion strategies.
+@end deffn
+
+@deffn bytevector->string bytevector encoding
+Decode @var{bytevector} into a string.
+
+The bytes will be decoded from the character set by the @var{encoding}
+string. If the bytes do not form a valid encoding, by default this
+procedure raises an @code{decoding-error}, though that may be overridden
+with the @code{#:conversion-strategy} keyword. @xref{Ports}, for more
+on character encodings and conversion strategies.
+@end deffn
+
+@deffn call-with-output-encoded-string encoding proc [#:conversion-strategy='error]
+Like @code{call-with-output-string}, but instead of returning a string,
+returns a encoding of the string according to @var{encoding}, as a
+bytevector. This procedure can be more efficient than collecting a
+string and then converting it via @code{string->bytevector}.
+@end deffn
+
@node Conversion to/from C
@subsubsection Conversion to/from C
@@ -4172,9 +4237,9 @@ important.
In C, a string is just a sequence of bytes, and the character encoding
describes the relation between these bytes and the actual characters
-that make up the string. For Scheme strings, character encoding is
-not an issue (most of the time), since in Scheme you never get to see
-the bytes, only the characters.
+that make up the string. For Scheme strings, character encoding is not
+an issue (most of the time), since in Scheme you usually treat strings
+as character sequences, not byte sequences.
Converting to C and converting from C each have their own challenges.
@@ -4305,6 +4370,9 @@ into @var{encoding}.
If @var{lenp} is @code{NULL}, this function will return a null-terminated C
string. It will throw an error if the string contains a null
character.
+
+The Scheme interface to this function is @code{encode-string}, from the
+@code{ice-9 iconv} module. @xref{Representing Strings as Bytes}.
@end deftypefn
@deftypefn {C Function} SCM scm_from_stringn (const char *str, size_t len, const char *encoding, scm_t_string_failed_conversion_handler handler)
@@ -4313,6 +4381,9 @@ length in bytes of the C string is input as @var{len}. The encoding of the C
string is passed as the ASCII, null-terminated C string @code{encoding}.
The @var{handler} parameters suggests a strategy for dealing with
unconvertable characters.
+
+The Scheme interface to this function is @code{decode-string}.
+@xref{Representing Strings as Bytes}.
@end deftypefn
The following conversion functions are provided as a convenience for the
@@ -4810,6 +4881,7 @@ the host's native endianness.
Bytevector contents can also be interpreted as Unicode strings encoded
in one of the most commonly available encoding formats.
+@xref{Representing Strings as Bytes}, for a more generic interface.
@lisp
(utf8->string (u8-list->bytevector '(99 97 102 101)))
diff --git a/module/Makefile.am b/module/Makefile.am
index 3d3eae364..472bc4838 100644
--- a/module/Makefile.am
+++ b/module/Makefile.am
@@ -1,6 +1,6 @@
## Process this file with automake to produce Makefile.in.
##
-## Copyright (C) 2009, 2010, 2011, 2012 Free Software Foundation, Inc.
+## Copyright (C) 2009, 2010, 2011, 2012, 2013 Free Software Foundation, Inc.
##
## This file is part of GUILE.
##
@@ -210,6 +210,7 @@ ICE_9_SOURCES = \
ice-9/getopt-long.scm \
ice-9/hcons.scm \
ice-9/i18n.scm \
+ ice-9/iconv.scm \
ice-9/lineio.scm \
ice-9/ls.scm \
ice-9/mapping.scm \
diff --git a/module/ice-9/iconv.scm b/module/ice-9/iconv.scm
new file mode 100644
index 000000000..40d595473
--- /dev/null
+++ b/module/ice-9/iconv.scm
@@ -0,0 +1,82 @@
+;;; Encoding and decoding byte representations of strings
+
+;; Copyright (C) 2013 Free Software Foundation, Inc.
+
+;;;; This library is free software; you can redistribute it and/or
+;;;; modify it under the terms of the GNU Lesser General Public
+;;;; License as published by the Free Software Foundation; either
+;;;; version 3 of the License, or (at your option) any later version.
+;;;;
+;;;; This library is distributed in the hope that it will be useful,
+;;;; but WITHOUT ANY WARRANTY; without even the implied warranty of
+;;;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+;;;; Lesser General Public License for more details.
+;;;;
+;;;; You should have received a copy of the GNU Lesser General Public
+;;;; License along with this library; if not, write to the Free Software
+;;;; Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+
+;;; Code:
+
+(define-module (ice-9 iconv)
+ #:use-module (rnrs bytevectors)
+ #:use-module (ice-9 binary-ports)
+ #:use-module ((ice-9 rdelim) #:select (read-delimited))
+ #:export (string->bytevector
+ bytevector->string
+ call-with-encoded-output-string))
+
+;; Like call-with-output-string, but actually closes the port.
+(define (call-with-output-string* proc)
+ (let ((port (open-output-string)))
+ (proc port)
+ (let ((str (get-output-string port)))
+ (close-port port)
+ str)))
+
+(define (call-with-output-bytevector* proc)
+ (call-with-values (lambda () (open-bytevector-output-port))
+ (lambda (port get-bytevector)
+ (proc port)
+ (let ((bv (get-bytevector)))
+ (close-port port)
+ bv))))
+
+(define* (call-with-encoded-output-string encoding proc
+ #:key (conversion-strategy 'error))
+ (if (string-ci=? encoding "utf-8")
+ ;; I don't know why, but this appears to be faster; at least for
+ ;; serving examples/debug-sxml.scm (1464 reqs/s versus 850
+ ;; reqs/s).
+ (string->utf8 (call-with-output-string* proc))
+ (call-with-output-bytevector*
+ (lambda (port)
+ (set-port-encoding! port encoding)
+ (if conversion-strategy
+ (set-port-conversion-strategy! port conversion-strategy))
+ (proc port)))))
+
+;; TODO: Provide C implementations that call scm_from_stringn and
+;; friends?
+
+(define* (string->bytevector str encoding #:key (conversion-strategy 'error))
+ (if (string-ci=? encoding "utf-8")
+ (string->utf8 str)
+ (call-with-encoded-output-string
+ encoding
+ (lambda (port)
+ (display str port))
+ #:conversion-strategy conversion-strategy)))
+
+(define* (bytevector->string bv encoding #:key (conversion-strategy 'error))
+ (if (string-ci=? encoding "utf-8")
+ (utf8->string bv)
+ (let ((p (open-bytevector-input-port bv)))
+ (set-port-encoding! p encoding)
+ (if conversion-strategy
+ (set-port-conversion-strategy! p conversion-strategy))
+ (let ((res (read-delimited "" p)))
+ (close-port p)
+ (if (eof-object? res)
+ ""
+ res)))))
diff --git a/test-suite/Makefile.am b/test-suite/Makefile.am
index a843fcd39..880e1e2cf 100644
--- a/test-suite/Makefile.am
+++ b/test-suite/Makefile.am
@@ -1,7 +1,7 @@
## Process this file with automake to produce Makefile.in.
##
## Copyright 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009,
-## 2010, 2011, 2012 Software Foundation, Inc.
+## 2010, 2011, 2012, 2013 Software Foundation, Inc.
##
## This file is part of GUILE.
##
@@ -62,6 +62,7 @@ SCM_TESTS = tests/00-initial-env.test \
tests/hash.test \
tests/hooks.test \
tests/i18n.test \
+ tests/iconv.test \
tests/import.test \
tests/interp.test \
tests/keywords.test \
diff --git a/test-suite/tests/iconv.test b/test-suite/tests/iconv.test
new file mode 100644
index 000000000..e6ee90d1d
--- /dev/null
+++ b/test-suite/tests/iconv.test
@@ -0,0 +1,115 @@
+;;;; iconv.test --- Exercise the iconv API. -*- coding: utf-8; mode: scheme; -*-
+;;;;
+;;;; Copyright (C) 2013 Free Software Foundation, Inc.
+;;;; Andy Wingo
+;;;;
+;;;; This library is free software; you can redistribute it and/or
+;;;; modify it under the terms of the GNU Lesser General Public
+;;;; License as published by the Free Software Foundation; either
+;;;; version 3 of the License, or (at your option) any later version.
+;;;;
+;;;; This library is distributed in the hope that it will be useful,
+;;;; but WITHOUT ANY WARRANTY; without even the implied warranty of
+;;;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+;;;; Lesser General Public License for more details.
+;;;;
+;;;; You should have received a copy of the GNU Lesser General Public
+;;;; License along with this library; if not, write to the Free Software
+;;;; Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+
+(define-module (test-suite iconv)
+ #:use-module (ice-9 iconv)
+ #:use-module (rnrs bytevectors)
+ #:use-module (test-suite lib))
+
+
+(define exception:encoding-error
+ '(encoding-error . ""))
+
+(define exception:decoding-error
+ '(decoding-error . ""))
+
+
+(with-test-prefix "ascii string"
+ (let ((s "Hello, World!"))
+ ;; For ASCII, all of these encodings should be the same.
+
+ (pass-if "to ascii bytevector"
+ (equal? (string->bytevector s "ASCII")
+ #vu8(72 101 108 108 111 44 32 87 111 114 108 100 33)))
+
+ (pass-if "to ascii bytevector (length check)"
+ (equal? (string-length s)
+ (bytevector-length (string->bytevector s "ascii"))))
+
+ (pass-if "from ascii bytevector"
+ (equal? s
+ (bytevector->string (string->bytevector s "ascii") "ascii")))
+
+ (pass-if "to utf-8 bytevector"
+ (equal? (string->bytevector s "ASCII")
+ (string->bytevector s "utf-8")))
+
+ (pass-if "to UTF-8 bytevector (testing encoding case sensitivity)"
+ (equal? (string->bytevector s "ascii")
+ (string->bytevector s "UTF-8")))
+
+ (pass-if "from utf-8 bytevector"
+ (equal? s
+ (bytevector->string (string->bytevector s "utf-8") "utf-8")))
+
+ (pass-if "to latin1 bytevector"
+ (equal? (string->bytevector s "ASCII")
+ (string->bytevector s "latin1")))
+
+ (pass-if "from latin1 bytevector"
+ (equal? s
+ (bytevector->string (string->bytevector s "utf-8") "utf-8")))))
+
+(with-test-prefix "narrow non-ascii string"
+ (let ((s "été"))
+ (pass-if "to latin1 bytevector"
+ (equal? (string->bytevector s "latin1")
+ #vu8(233 116 233)))
+
+ (pass-if "to latin1 bytevector (length check)"
+ (equal? (string-length s)
+ (bytevector-length (string->bytevector s "latin1"))))
+
+ (pass-if "from latin1 bytevector"
+ (equal? s
+ (bytevector->string (string->bytevector s "latin1") "latin1")))
+
+ (pass-if "to utf-8 bytevector"
+ (equal? (string->bytevector s "utf-8")
+ #vu8(195 169 116 195 169)))
+
+ (pass-if "from utf-8 bytevector"
+ (equal? s
+ (bytevector->string (string->bytevector s "utf-8") "utf-8")))
+
+ (pass-if-exception "encode latin1 as ascii" exception:encoding-error
+ (string->bytevector s "ascii"))
+
+ (pass-if-exception "misparse latin1 as utf8" exception:decoding-error
+ (bytevector->string (string->bytevector s "latin1") "utf-8"))
+
+ (pass-if-exception "misparse latin1 as ascii" exception:decoding-error
+ (bytevector->string (string->bytevector s "latin1") "ascii"))))
+
+
+(with-test-prefix "wide non-ascii string"
+ (let ((s "ΧΑΟΣ"))
+ (pass-if "to utf-8 bytevector"
+ (equal? (string->bytevector s "utf-8")
+ #vu8(206 167 206 145 206 159 206 163) ))
+
+ (pass-if "from utf-8 bytevector"
+ (equal? s
+ (bytevector->string (string->bytevector s "utf-8") "utf-8")))
+
+ (pass-if-exception "encode as ascii" exception:encoding-error
+ (string->bytevector s "ascii"))
+
+ (pass-if-exception "encode as latin1" exception:encoding-error
+ (string->bytevector s "latin1"))))