Doc updates for character encoding of source code files

* NEWS * doc/ref/scheme-scripts.texi: doc updates for character encoding of source code * doc/ref/api-evaluation.texi: doc updates for character encoding of source code
author: Michael Gran <spk121@yahoo.com> 2009-09-05 10:42:15 -0700
committer: Michael Gran <spk121@yahoo.com> 2009-09-05 10:42:15 -0700
commit: 8748ffeaa770ed47192f970ef5302a7c7aa7a935 (patch)
tree: 99dc3f28337308232e29d9ac462ed86930e2d732
parent: 28cc8dac2f520fa9de29e93dca52e4892b945a3c (diff)
download: guile-8748ffeaa770ed47192f970ef5302a7c7aa7a935.tar.gz
3 files changed, 88 insertions, 0 deletions
diff --git a/NEWS b/NEWS
index a3c4dddc1..147d0822a 100644
--- a/NEWS
+++ b/NEWS
@@ -10,6 +10,18 @@ prerelease, and a full NEWS corresponding to 1.8 -> 2.0.)
 
 Changes in 1.9.3 (since the 1.9.2 prerelease):
 
+** Non-ASCII source code files can be read, but require coding
+   declarations
+
+The default reader now handles source code files for some of the
+non-ASCII character encodings, such as UTF-8.  A non-ASCII source file
+should have an encoding declaration near the top of the file.  Also,
+there is a new function file-encoding that scans a port for a coding
+declaration.
+
+The pre-1.9.3 reader handled 8-bit clean but otherwise unspecified source
+code.  This use is now discouraged.
+
 ** Ports do transcoding
 
 Ports now have an associated character encoding, and port read/write
diff --git a/doc/ref/api-evaluation.texi b/doc/ref/api-evaluation.texi
index d8412154c..9fc5ef5de 100644
--- a/doc/ref/api-evaluation.texi
+++ b/doc/ref/api-evaluation.texi
@@ -17,6 +17,7 @@ loading, evaluating, and compiling Scheme code at run time.
 * Fly Evaluation::              Procedures for on the fly evaluation.
 * Compilation::                 How to compile Scheme files and procedures.
 * Loading::                     Loading Scheme code from file.
+* Character Encoding of Source Files:: Loading non-ASCII Scheme code from file.
 * Delayed Evaluation::          Postponing evaluation until it is needed.
 * Local Evaluation::            Evaluation in a local environment.
 * Evaluator Behaviour::         Modifying Guile's evaluator.
@@ -229,6 +230,12 @@ Thus a Guile script often starts like this.
 More details on Guile scripting can be found in the scripting section
 (@pxref{Guile Scripting}).
 
+There is one special case where the contents of a comment can actually
+affect the interpretation of code.  When a character encoding
+declaration, such as @code{coding: utf-8} appears in one of the first
+few lines of a source file, it indicates to Guile's default reader
+that this source code file is not ASCII.  For details see @ref{Character
+Encoding of Source Files}.
 
 @node Case Sensitivity
 @subsubsection Case Sensitivity
@@ -590,6 +597,69 @@ a file to load.  By default, @code{%load-extensions} is bound to the
 list @code{("" ".scm")}.
 @end defvar
 
+@node Character Encoding of Source Files
+@subsection Character Encoding of Source Files
+
+@cindex primitive-load
+@cindex load
+Scheme source code files are usually encoded in ASCII, but, the
+built-in reader can interpret other character encodings.  The
+procedure @code{primitive-load}, and by extension the functions that
+call it, such as @code{load}, first scan the top 500 characters of the
+file for a coding declaration.
+
+A coding declaration has the form @code{coding: XXXXXX}, where
+@code{XXXXXX} is the name of a character encoding in which the source
+code file has been encoded.  The coding declaration must appear in a
+scheme comment.  It can either be a semicolon-initiated comment or a block
+@code{#!} comment.
+
+The name of the character encoding in the coding declaration is
+typically lower case and containing only letters, numbers, and
+hyphens.  The most common examples of character encodings are
+@code{utf-8} and @code{iso-8859-1}.  This allows the coding
+declaration to be compatible with EMACS.
+
+For source code, only a subset of all possible character encodings can
+be interpreted by the built-in source code reader.  Only those
+character encodings in which ASCII text appears unmodified can be
+used.  This includes @code{UTF-8} and @code{ISO-8859-1} through
+@code{ISO-8859-15}.  The multi-byte character encodings @code{UTF-16}
+and @code{UTF-32} may not be used because they are not compatible with
+ASCII.
+
+@cindex read
+@cindex set-port-encoding!
+There might be a scenario in which one would want to read non-ASCII
+code from a port, such as with the function @code{read}, instead of
+with @code{load}.  If the port's character encoding is the same as the
+encoding of the code to be read by the port, not other special
+handling is necessary.  The port will automatically do the character
+encoding conversion.  The functions @code{setlocale} or by
+@code{set-port-encoding!} are used to set port encodings.
+
+If a port is used to read code of unknown character encoding, it can
+accomplish this in three steps.  First, the character encoding of the
+port should be set to ISO-8859-1 using @code{set-port-encoding!}.
+Then, the procedure @code{file-encoding}, described below, is used to
+scan for a coding declaration when reading from the port.  As a side
+effect, it rewinds the port after its scan is complete. After that,
+the port's character encoding should be set to the encoding returned
+by @code{file-encoding}, if any, again by using
+@code{set-port-encoding!}.  Then the code can be read as normal.
+
+@deffn {Scheme Procedure} file-encoding port
+@deffnx {C Function} scm_file_encoding port
+Scans the port for an EMACS-like character coding declaration near the
+top of the contents of a port with random-acessible contents.  The
+coding declaration is of the form @code{coding: XXXXX} and must appear
+in a scheme comment.
+
+Returns a string containing the character encoding of the file
+if a declaration was found, or @code{#f} otherwise.  The port is
+rewound.
+@end deffn
+
 
 @node Delayed Evaluation
 @subsection Delayed Evaluation
diff --git a/doc/ref/scheme-scripts.texi b/doc/ref/scheme-scripts.texi
index e12eee60f..249bc3414 100644
--- a/doc/ref/scheme-scripts.texi
+++ b/doc/ref/scheme-scripts.texi
@@ -64,6 +64,12 @@ operating system never reads this far, but Guile treats this as the end
 of the comment begun on the first line by the @samp{#!} characters.
 
 @item
+If this source code file is not ASCII or ISO-8859-1 encoded, a coding
+declaration such as @code{coding: utf-8} should appear in a comment
+somewhere in the first five lines of the file: see @ref{Character
+Encoding of Source Files}.
+
+@item
 The rest of the file should be a Scheme program.
 
 @end itemize
author	Michael Gran <spk121@yahoo.com>	2009-09-05 10:42:15 -0700
committer	Michael Gran <spk121@yahoo.com>	2009-09-05 10:42:15 -0700
commit	8748ffeaa770ed47192f970ef5302a7c7aa7a935 (patch)
tree	99dc3f28337308232e29d9ac462ed86930e2d732
parent	28cc8dac2f520fa9de29e93dca52e4892b945a3c (diff)
download	guile-8748ffeaa770ed47192f970ef5302a7c7aa7a935.tar.gz