diff options
author | Gerd Moellmann <gerd@gnu.org> | 2000-05-11 15:44:54 +0000 |
---|---|---|
committer | Gerd Moellmann <gerd@gnu.org> | 2000-05-11 15:44:54 +0000 |
commit | 0ace421a2d9e1f69f139c3316df662a541acbd67 (patch) | |
tree | 37db4c604f04574142f0c4a5d9b2966ec968c202 /lispref/nonascii.texi | |
parent | 796184bc2047de12f0cfe7ae178be236f5a0256a (diff) | |
download | emacs-0ace421a2d9e1f69f139c3316df662a541acbd67.tar.gz |
*** empty log message ***
Diffstat (limited to 'lispref/nonascii.texi')
-rw-r--r-- | lispref/nonascii.texi | 138 |
1 files changed, 51 insertions, 87 deletions
diff --git a/lispref/nonascii.texi b/lispref/nonascii.texi index 149d0354c29..29d97d81acd 100644 --- a/lispref/nonascii.texi +++ b/lispref/nonascii.texi @@ -59,12 +59,13 @@ stored. The first byte of a multibyte character is always in the range character are always in the range 160 through 255 (octal 0240 through 0377); these values are @dfn{trailing codes}. - Some sequences of bytes do not form meaningful multibyte characters: -for example, a single isolated byte in the range 128 through 255 is -never meaningful. Such byte sequences are not entirely valid, and never -appear in proper multibyte text (since that consists of a sequence of -@emph{characters}); but they can appear as part of ``raw bytes'' -(@pxref{Explicit Encoding}). + Some sequences of bytes are not valid in multibyte text: for example, +a single isolated byte in the range 128 through 159 is not allowed. +But character codes 128 through 159 can appear in multibyte text, +represented as two-byte sequences. None of the character codes 128 +through 255 normally appear in ordinary multibyte text, but they do +appear in multibyte buffers and strings when you do explicit encoding +and decoding (@pxref{Explicit Encoding}). In a buffer, the buffer-local value of the variable @code{enable-multibyte-characters} specifies the representation used. @@ -237,10 +238,11 @@ If @var{string} is already a multibyte string, then the value is codes. The valid character codes for unibyte representation range from 0 to 255---the values that can fit in one byte. The valid character codes for multibyte representation range from 0 to 524287, but not all -values in that range are valid. In particular, the values 128 through -255 are not legitimate in multibyte text (though they can occur in ``raw -bytes''; @pxref{Explicit Encoding}). Only the @sc{ascii} codes 0 -through 127 are fully legitimate in both representations. +values in that range are valid. The values 128 through 255 are not +really proper in multibyte text, but they can occur if you do explicit +encoding and decoding (@pxref{Explicit Encoding}). Some other character +codes cannot occur at all in multibyte text. Only the @sc{ascii} codes +0 through 127 are truly legitimate in both representations. @defun char-valid-p charcode This returns @code{t} if @var{charcode} is valid for either one of the two @@ -410,17 +412,9 @@ is non-@code{nil}, then each character in the region is translated through this table, and the value returned describes the translated characters instead of the characters actually in the buffer. -In two peculiar cases, the value includes the symbol @code{unknown}: - -@itemize @bullet -@item -When a unibyte buffer contains non-@sc{ascii} characters. - -@item -When a multibyte buffer contains invalid byte-sequences (raw bytes). -@xref{Explicit Encoding}. -@end itemize -@end defun +When a buffer contains non-@sc{ascii} characters, codes 128 through 255, +they are assigned the character set @code{unknown}. @xref{Explicit +Encoding}. @defun find-charset-string string &optional translation This function returns a list of the character sets that appear in the @@ -690,7 +684,7 @@ encode all the character sets in the list @var{charsets}. @defun detect-coding-region start end &optional highest This function chooses a plausible coding system for decoding the text -from @var{start} to @var{end}. This text should be ``raw bytes'' +from @var{start} to @var{end}. This text should be a byte sequence (@pxref{Explicit Encoding}). Normally this function returns a list of coding systems that could @@ -923,90 +917,59 @@ ability to use a coding system to encode or decode the text. You can also explicitly encode and decode text using the functions in this section. -@cindex raw bytes The result of encoding, and the input to decoding, are not ordinary -text. They are ``raw bytes''---bytes that represent text in the same -way that an external file would. When a buffer contains raw bytes, it -is most natural to mark that buffer as using unibyte representation, -using @code{set-buffer-multibyte} (@pxref{Selecting a Representation}), -but this is not required. If the buffer's contents are only temporarily -raw, leave the buffer multibyte, which will be correct after you decode -them. - - The usual way to get raw bytes in a buffer, for explicit decoding, is -to read them from a file with @code{insert-file-contents-literally} -(@pxref{Reading from Files}) or specify a non-@code{nil} @var{rawfile} -argument when visiting a file with @code{find-file-noselect}. - - The usual way to use the raw bytes that result from explicitly -encoding text is to copy them to a file or process---for example, to -write them with @code{write-region} (@pxref{Writing to Files}), and -suppress encoding for that @code{write-region} call by binding -@code{coding-system-for-write} to @code{no-conversion}. - - Raw bytes typically contain stray individual bytes with values in the -range 128 through 255, that are legitimate only as part of multibyte -sequences. Even if the buffer is multibyte, Emacs treats each such -individual byte as a character and uses the byte value as its character -code. In this way, character codes 128 through 255 can be found in a -multibyte buffer, even though they are not legitimate multibyte -character codes. - - Raw bytes sometimes contain overlong byte-sequences that look like a -proper multibyte character plus extra superfluous trailing codes. For -most purposes, Emacs treats such a sequence in a buffer or string as a -single character, and if you look at its character code, you get the -value that corresponds to the multibyte character -sequence---disregarding the extra trailing codes. This is not quite -clean, but raw bytes are used only in limited ways, so as a practical -matter it is not worth the trouble to treat this case differently. - - When a multibyte buffer contains illegitimate byte sequences, -sometimes insertion or deletion can cause them to coalesce into a -legitimate multibyte character. For example, suppose the buffer -contains the sequence 129 68 192, 68 being the character @samp{D}. If -you delete the @samp{D}, the bytes 129 and 192 become adjacent, and thus -become one multibyte character (Latin-1 A with grave accent). Point -moves to one side or the other of the character, since it cannot be -within a character. Don't be alarmed by this. - - Some really peculiar situations prevent proper coalescence. For -example, if you narrow the buffer so that the accessible portion begins -just before the @samp{D}, then delete the @samp{D}, the two surrounding -bytes cannot coalesce because one of them is outside the accessible -portion of the buffer. In this case, the deletion cannot be done, so -@code{delete-region} signals an error. +text. They logically consist of a series of byte values; that is, a +series of characters whose codes are in the range 0 through 255. In a +multibyte buffer or string, character codes 128 through 159 are +represented by multibyte sequences, but this is invisible to Lisp +programs. + + The usual way to read a file into a buffer as a sequence of bytes, so +you can decode the contents explicitly, is with +@code{insert-file-contents-literally} (@pxref{Reading from Files}); +alternatively, specify a non-@code{nil} @var{rawfile} argument when +visiting a file with @code{find-file-noselect}. These methods result in +a unibyte buffer. + + The usual way to use the byte sequence that results from explicitly +encoding text is to copy it to a file or process---for example, to write +it with @code{write-region} (@pxref{Writing to Files}), and suppress +encoding by binding @code{coding-system-for-write} to +@code{no-conversion}. Here are the functions to perform explicit encoding or decoding. The -decoding functions produce ``raw bytes''; the encoding functions are -meant to operate on ``raw bytes''. All of these functions discard text -properties. +decoding functions produce sequences of bytes; the encoding functions +are meant to operate on sequences of bytes. All of these functions +discard text properties. @defun encode-coding-region start end coding-system This function encodes the text from @var{start} to @var{end} according to coding system @var{coding-system}. The encoded text replaces the -original text in the buffer. The result of encoding is ``raw bytes,'' -but the buffer remains multibyte if it was multibyte before. +original text in the buffer. The result of encoding is logically a +sequence of bytes, but the buffer remains multibyte if it was multibyte +before. @end defun @defun encode-coding-string string coding-system This function encodes the text in @var{string} according to coding system @var{coding-system}. It returns a new string containing the -encoded text. The result of encoding is a unibyte string of ``raw bytes.'' +encoded text. The result of encoding is a unibyte string. @end defun @defun decode-coding-region start end coding-system This function decodes the text from @var{start} to @var{end} according to coding system @var{coding-system}. The decoded text replaces the original text in the buffer. To make explicit decoding useful, the text -before decoding ought to be ``raw bytes.'' +before decoding ought to be a sequence of byte values, but both +multibyte and unibyte buffers are acceptable. @end defun @defun decode-coding-string string coding-system This function decodes the text in @var{string} according to coding system @var{coding-system}. It returns a new string containing the decoded text. To make explicit decoding useful, the contents of -@var{string} ought to be ``raw bytes.'' +@var{string} ought to be a sequence of byte values, but a multibyte +string is acceptable. @end defun @node Terminal I/O Encoding @@ -1051,7 +1014,7 @@ that means do not encode terminal output. On MS-DOS and Microsoft Windows, Emacs guesses the appropriate end-of-line conversion for a file by looking at the file's name. This -feature classifies fils as @dfn{text files} and @dfn{binary files}. By +feature classifies files as @dfn{text files} and @dfn{binary files}. By ``binary file'' we mean a file of literal byte values that are not necessarily meant to be characters; Emacs does no end-of-line conversion and no character code conversion for them. On the other hand, the bytes @@ -1157,14 +1120,14 @@ Here @var{input-method} is the input method name, a string; environment this input method is recommended for. (That serves only for documentation purposes.) -@var{title} is a string to display in the mode line while this method is -active. @var{description} is a string describing this method and what -it is good for. - @var{activate-func} is a function to call to activate this method. The @var{args}, if any, are passed as arguments to @var{activate-func}. All told, the arguments to @var{activate-func} are @var{input-method} and the @var{args}. + +@var{title} is a string to display in the mode line while this method is +active. @var{description} is a string describing this method and what +it is good for. @end defvar The fundamental interface to input methods is through the @@ -1202,3 +1165,4 @@ Changing the locale can cause messages to appear according to the conventions of a different language. If the variable is @code{nil}, the locale is specified by environment variables in the usual POSIX fashion. @end defvar + |