summaryrefslogtreecommitdiff
path: root/data/i18n_sdd.txt
diff options
context:
space:
mode:
Diffstat (limited to 'data/i18n_sdd.txt')
-rw-r--r--data/i18n_sdd.txt2337
1 files changed, 2337 insertions, 0 deletions
diff --git a/data/i18n_sdd.txt b/data/i18n_sdd.txt
new file mode 100644
index 000000000..5c6cbcedc
--- /dev/null
+++ b/data/i18n_sdd.txt
@@ -0,0 +1,2337 @@
+
+
+ WORKING DRAFT Ira McDonald
+ <i18n_sdd.txt> High North Inc
+
+ Common UNIX Printing System ("CUPS")
+ Internationalization Software Design Description v0.3
+
+ Copyright (C) Easy Software Products (2002) - All Rights Reserved
+
+
+ Status of this Document
+
+ This document is an unapproved working draft and is incomplete in some
+ sections (see 'Ed Note:' comments).
+
+
+ Abstract
+
+ This document provides general information and high-level design for the
+ Internationalization extensions for the Common UNIX Printing System
+ ("CUPS") Version 1.2. This document also provides C language header
+ files and high-level pseudo-code for all new modules and external
+ functions.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ McDonald June 20, 2002 [Page 1]
+
+ CUPS Internationalization Software Design Description v0.3
+
+ Table of Contents
+
+ 1. Scope ...................................................... 4
+ 1.1. Identification ......................................... 4
+ 1.2. System Overview ........................................ 4
+ 1.3. Document Overview ...................................... 4
+ 2. References ................................................. 5
+ 2.1. CUPS References ........................................ 5
+ 2.2. Other Documents ........................................ 5
+ 3. Design Overview ............................................ 7
+ 3.1. Transcoding - New ...................................... 7
+ 3.1.1. transcode.h - Transcoding header ................... 7
+ 3.1.1.1. cups_cmap_t - SBCS Charmap Structure ........... 10
+ 3.1.1.2. cups_dmap_t - DBCS Charmap Structure ........... 11
+ 3.1.2. transcode.c - Transcoding module ................... 11
+ 3.1.2.1. cupsUtf8ToCharset() ............................ 11
+ 3.1.2.2. cupsCharsetToUtf8() ............................ 12
+ 3.1.2.3. cupsUtf8ToUtf16() .............................. 12
+ 3.1.2.4. cupsUtf16ToUtf8() .............................. 12
+ 3.1.2.5. cupsUtf8ToUtf32() .............................. 12
+ 3.1.2.6. cupsUtf32ToUtf8() .............................. 13
+ 3.1.2.7. cupsUtf16ToUtf32() ............................. 13
+ 3.1.2.8. cupsUtf32ToUtf16() ............................. 13
+ 3.1.2.9. Transcoding Utility Functions .................. 13
+ 3.1.2.9.1. cupsCharmapGet() ........................... 14
+ 3.1.2.9.2. cupsCharmapFree() .......................... 14
+ 3.1.2.9.3. cupsCharmapFlush() ......................... 14
+ 3.2. Normalization - New .................................... 15
+ 3.2.1. normalize.h - Normalization header ................. 15
+ 3.2.1.1. cups_normmap_t - Normalize Map Structure ....... 22
+ 3.2.1.2. cups_foldmap_t - Case Fold Map Structure ....... 22
+ 3.2.1.3. cups_propmap_t - Char Property Map Structure ... 23
+ 3.2.1.4. cups_prop_t - Char Property Structure .......... 23
+ 3.2.1.5. cups_breakmap_t - Line Break Map Structure ..... 23
+ 3.2.1.6. cups_combmap_t - Combining Class Map Structure . 24
+ 3.2.1.7. cups_comb_t - Combining Class Structure ........ 24
+ 3.2.2. normalize.c - Normalization module ................. 24
+ 3.2.2.1. cupsUtf8Normalize() ............................ 24
+ 3.2.2.2. cupsUtf32Normalize() ........................... 25
+ 3.2.2.3. cupsUtf8CaseFold() ............................. 25
+ 3.2.2.4. cupsUtf32CaseFold() ............................ 26
+ 3.2.2.5. cupsUtf8CompareCaseless() ...................... 26
+ 3.2.2.6. cupsUtf32CompareCaseless() ..................... 26
+ 3.2.2.7. cupsUtf8CompareIdentifier() .................... 27
+ 3.2.2.8. cupsUtf32CompareIdentifier() ................... 27
+ 3.2.2.9. cupsUtf32CharacterProperty() ................... 27
+ 3.2.2.10. Normalization Utility Functions ............... 28
+ 3.2.2.10.1. cupsNormalizeMapsGet() .................... 28
+ 3.2.2.10.2. cupsNormalizeMapsFree() ................... 28
+ 3.2.2.10.3. cupsNormalizeMapsFlush() .................. 28
+ 3.3. Language - Existing .................................... 29
+ 3.3.1. language.h - Language header ....................... 29
+
+ McDonald June 20, 2002 [Page 2]
+
+ CUPS Internationalization Software Design Description v0.3
+
+ 3.3.2. language.c - Language module ....................... 29
+ 3.3.2.1. cupsLangEncoding() - Existing .................. 29
+ 3.3.2.2. cupsLangFlush() - Existing ..................... 29
+ 3.3.2.3. cupsLangFree() - Existing ...................... 29
+ 3.3.2.4. cupsLangGet() - Existing ....................... 30
+ 3.3.2.5. cupsLangPrintf() - New ......................... 30
+ 3.3.2.6. cupsLangPuts() - New ........................... 30
+ 3.3.2.7. cupsEncodingName() - New ....................... 31
+ 3.4. Common Text Filter - Existing .......................... 31
+ 3.4.1. textcommon.h - Common text filter header ........... 31
+ 3.4.1.1. lchar_t - Character/Attribute Structure ........ 31
+ 3.4.2. textcommon.c - Common text filter .................. 32
+ 3.4.2.1. TextMain() - Existing .......................... 32
+ 3.4.2.2. compare_keywords() - Existing .................. 33
+ 3.4.2.3. getutf8() - Existing ........................... 33
+ 3.5. Text to PostScript Filter - Existing ................... 33
+ 3.5.1. texttops.c - Text to PostScript filter ............. 33
+ 3.5.1.1. main() - Existing .............................. 33
+ 3.5.1.2. WriteEpilogue () - Existing .................... 34
+ 3.5.1.3. WritePage () - Existing ........................ 34
+ 3.5.1.4. WriteProlog () - Existing ...................... 34
+ 3.5.1.5. write_line() - Existing ........................ 34
+ 3.5.1.6. write_string() - Existing ...................... 34
+ 3.5.1.7. write_text() - Existing ........................ 35
+ A. Glossary ................................................... A-1
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ McDonald June 20, 2002 [Page 3]
+
+ CUPS Internationalization Software Design Description v0.3
+
+
+
+ 1. Scope
+
+
+
+ 1.1. Identification
+
+ This document provides general information and high-level design for the
+ Internationalization extensions for the Common UNIX Printing System
+ ("CUPS") Version 1.2. This document also provides C language header
+ files and high-level pseudo-code for all new modules and external
+ functions.
+
+
+ 1.2. System Overview
+
+ The CUPS Internationalization extensions provide multilingual support
+ via Unicode 3.2:2002 [UNICODE3.2] / ISO-10646-1:2000 [ISO10646-1] and a
+ suite of local character sets (including all adopted parts of ISO-8859
+ and many MS Windows code pages) for CUPS 1.2.
+
+ The CUPS Internationalization extensions support UTF-8 [RFC2279] as the
+ common stream-oriented representation of all character data. UTF-8 is
+ defined in [ISO10646-1] and is further constrained (for integrity and
+ security) by [UNICODE3.2].
+
+ UTF-8 is the native character set of LDAPv3 [RFC2251], SLPv2 [RFC2608],
+ IPP/1.1 [RFC2910] [RFC2911], and many other Internet protocols.
+
+
+ 1.3. Document Overview
+
+
+ This software design description document is organized into the
+ following sections:
+
+ o 1 - Scope
+ o 2 - References
+ o 3 - Design Overview
+ o A - Glossary
+
+
+
+
+
+
+
+
+
+
+
+
+ McDonald June 20, 2002 [Page 4]
+
+ CUPS Internationalization Software Design Description v0.3
+
+
+
+ 2. References
+
+
+
+ 2.1. CUPS References
+
+ See: Section 2.1 'CUPS Documentation' of CUPS Software Design
+ Description.
+
+
+ 2.2. Other Documents
+
+ The following non-CUPS documents are referenced by this document.
+
+ [ANSI-X3.4] ANSI Coded Character Set - 7-bit American National Standard
+ Code for Information Interchange, ANSI X3.4, 1986 (aka US-ASCII).
+
+ [GB2312] Code of Chinese Graphic Character Set for Information
+ Interchange, Primary Set, GB 2312, 1980.
+
+ [ISO639-1] Codes for the Representation of Names of Languages -- Part 1:
+ Alpha-2 Code, ISO/IEC 639-1, 2000.
+
+ [ISO639-2] Codes for the Representation of Names of Languages -- Part 2:
+ Alpha-3 Code, ISO/IEC 639-2, 1998.
+
+ [ISO646] Information Technology - ISO 7-bit Coded Character Set for
+ Information Interchange, ISO/IEC 646, 1991.
+
+ [ISO2022] Information Processing - ISO 7-bit and 8-bit Coded Character
+ Sets - Code Extension Techniques, ISO/IEC 2022, 1994. (Technically
+ identical to ECMA-35.)
+
+ [ISO3166-1] Codes for the Representation of Names of Countries and their
+ Subdivisions, Part 1: Country Codes, ISO/ISO 3166-1, 1997.
+
+ [ISO8859] Information Processing - 8-bit Single-Byte Code Graphic
+ Character Sets, ISO/IEC 8859-n, 1987-2001.
+
+ [ISO10646-1] Information Technology - Universal Multiple-Octet Code
+ Character Set (UCS) - Part 1: Architecture and Basic Multilingual
+ Plane, ISO/IEC 10646-1, September 2000.
+
+ [ISO10646-2] Information Technology - Universal Multiple-Octet Code
+ Character Set (UCS) - Part 2: Supplemental Planes, ISO/IEC 10646-2,
+ January 2001.
+
+ [RFC2119] Bradner. Key words for use in RFCs to Indicate Requirement
+ Levels, RFC 2119, March 1997.
+
+
+ McDonald June 20, 2002 [Page 5]
+
+ CUPS Internationalization Software Design Description v0.3
+
+
+ [RFC2251] Whal, Howes, Kille. Lightweight Directory Access Protocol
+ Version 3 (LDAPv3), RFC 2251, December 1997.
+
+ [RFC2277] Alvestrand. IETF Policy on Character Sets and Languages, RFC
+ 2277, January 1998.
+
+ [RFC2279] Yergeau. UTF-8, a Transformation Format of ISO 10646, RFC
+ 2279, January 1998.
+
+ [RFC2608] Guttman, Perkins, Veizades, Day. Service Location Protocol
+ Version 2 (SLPv2), RFC 2608, June 1999.
+
+ [RFC2910] Herriot, Butler, Moore, Turner, Wenn. Internet Printing
+ Protocol/1.1: Encoding and Transport, RFC 2910, September 2000.
+
+ [RFC2911] Hastings, Herriot, deBry, Isaacson, Powell. Internet Printing
+ Protocol/1.1: Model and Semantics, RFC 2911, September 2000.
+
+ [UNICODE3.0] Unicode Consortium, Unicode Standard Version 3.0,
+ Addison-Wesley Developers Press, ISBN 0-201-61633-5, 2000.
+
+ [UNICODE3.1] Unicode Consortium, Unicode Standard Version 3.1 (UAX-27),
+ May 2001.
+
+ [UNICODE3.2] Unicode Consortium, Unicode Standard Version 3.2 (UAX-28),
+ March 2002.
+
+ [US-ASCII] See [ANSI-X3.4] above.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ McDonald June 20, 2002 [Page 6]
+
+ CUPS Internationalization Software Design Description v0.3
+
+
+
+ 3. Design Overview
+
+ The CUPS Internationalization extensions are composed of several header
+ files and modules which extend the Language functions in the existing
+ CUPS Application Programmers Interface (API).
+
+
+ 3.1. Transcoding - New
+
+ Initially, the CUPS Internationalization extensions will only support
+ SBCS (single-byte character set) transcoding. But the design allows
+ future support for DBCS (double-byte character set) transcoding for CJK
+ (Chinese/Japanese/Korean) languages and the MBCS (multiple-byte
+ character set) compound sets that use escapes for charset switching.
+
+ In order to reduce code size and increase performance all conventional
+ 'mapping files' (tables of values in legacy characters sets with their
+ corresponding Unicode scalar values) will ALSO be sorted and stored in
+ memory as reverse maps (for efficient conversion from Unicode scalar
+ values to their corresponding legacy character set values). Transcoding
+ will be done directly by 2-level lookup (without any searching or
+ sorting).
+
+ [Ed Note: CJK languages will be fairly costly in mapping table sizes,
+ because they have thousands (or tens of thousands) of codepoints.]
+
+
+
+ 3.1.1. transcode.h - Transcoding header
+
+ /*
+ * "$Id: i18n_sdd.txt 2678 2002-08-19 01:15:26Z mike $"
+ *
+ * Transcoding support for the Common UNIX Printing System (CUPS).
+ *
+ * Copyright 1997-2002 by Easy Software Products.
+ *
+ * These coded instructions, statements, and computer programs are
+ * the property of Easy Software Products and are protected by Federal
+ * copyright law. Distribution and use rights are outlined in the
+ * file "LICENSE.txt" which should have been included with this file.
+ * If this file is missing or damaged please contact Easy Software
+ * Products at:
+ *
+ * Attn: CUPS Licensing Information
+ * Easy Software Products
+ * 44141 Airport View Drive, Suite 204
+ * Hollywood, Maryland 20636-3111 USA
+ *
+ * Voice: (301) 373-9603
+
+ McDonald June 20, 2002 [Page 7]
+
+ CUPS Internationalization Software Design Description v0.3
+
+ * EMail: cups-info@cups.org
+ * WWW: http://www.cups.org
+ */
+
+ #ifndef _CUPS_TRANSCODE_H_
+ # define _CUPS_TRANSCODE_H_
+
+ /*
+ * Include necessary headers...
+ */
+
+ # include "cups/language.h"
+
+ # ifdef __cplusplus
+ extern "C" {
+ # endif /* __cplusplus */
+
+ /*
+ * Types...
+ */
+
+ typedef unsigned char utf8_t; /* UTF-8 Unicode/ISO-10646 code unit */
+ typedef unsigned short utf16_t; /* UTF-16 Unicode/ISO-10646 code unit */
+ typedef unsigned long utf32_t; /* UTF-32 Unicode/ISO-10646 code unit */
+ typedef unsigned short ucs2_t; /* UCS-2 Unicode/ISO-10646 code unit */
+ typedef unsigned long ucs4_t; /* UCS-4 Unicode/ISO-10646 code unit */
+ typedef unsigned char sbcs_t; /* SBCS Legacy 8-bit code unit */
+ typedef unsigned short dbcs_t; /* DBCS Legacy 16-bit code unit */
+
+ /*
+ * Structures...
+ */
+
+ typedef struct cups_cmap_str /**** SBCS Charmap Cache Structure ****/
+ {
+ struct cups_cmap_str *next; /* Next charmap in cache */
+ int used; /* Number of times entry used */
+ cups_encoding_t encoding; /* Legacy charset encoding */
+ ucs2_t char2uni[256]; /* Map Legacy SBCS -> UCS-2 */
+ sbcs_t *uni2char[256]; /* Map UCS-2 -> Legacy SBCS */
+ } cups_cmap_t;
+
+ #if 0
+ typedef struct cups_dmap_str /**** DBCS Charmap Cache Structure ****/
+ {
+ struct cups_dmap_str *next; /* Next charmap in cache */
+ int used; /* Number of times entry used */
+ cups_encoding_t encoding; /* Legacy charset encoding */
+ ucs2_t *char2uni[256]; /* Map Legacy DBCS -> UCS-2 */
+ dbcs_t *uni2char[256]; /* Map UCS-2 -> Legacy DBCS */
+ } cups_dmap_t;
+ #endif
+
+ McDonald June 20, 2002 [Page 8]
+
+ CUPS Internationalization Software Design Description v0.3
+
+
+ /*
+ * Constants...
+ */
+ #define CUPS_MAX_USTRING 1024 /* Maximum size of Unicode string */
+
+ /*
+ * Globals...
+ */
+
+ extern int TcFixMapNames; /* Fix map names to Unicode names */
+ extern int TcStrictUtf8; /* Non-shortest-form is illegal */
+ extern int TcStrictUtf16; /* Invalid surrogate pair is illegal */
+ extern int TcStrictUtf32; /* Greater than 0x10FFFF is illegal */
+ extern int TcRequireBOM; /* Require BOM for little/big-endian */
+ extern int TcSupportBOM; /* Support BOM for little/big-endian */
+ extern int TcSupport8859; /* Support ISO 8859-x repertoires */
+ extern int TcSupportWin; /* Support Windows-x repertoires */
+ extern int TcSupportCJK; /* Support CJK (Asian) repertoires */
+
+ /*
+ * Prototypes...
+ */
+
+ /*
+ * Utility functions for character set maps
+ */
+ extern void *cupsCharmapGet(const cups_encoding_t encoding);
+ /* I - Encoding */
+ extern void cupsCharmapFree(const cups_encoding_t encoding);
+ /* I - Encoding */
+ extern void cupsCharmapFlush(void);
+
+ /*
+ * Convert UTF-8 to and from legacy character set
+ */
+ extern int cupsUtf8ToCharset(char *dest, /* O - Target string */
+ const utf8_t *src, /* I - Source string */
+ const int maxout, /* I - Max output */
+ cups_encoding_t encoding); /* I - Encoding */
+ extern int cupsCharsetToUtf8(utf8_t *dest, /* O - Target string */
+ const char *src, /* I - Source string */
+ const int maxout, /* I - Max output */
+ cups_encoding_t encoding); /* I - Encoding */
+
+ /*
+ * Convert UTF-8 to and from UTF-16
+ */
+ extern int cupsUtf8ToUtf16(utf16_t *dest, /* O - Target string */
+ const utf8_t *src, /* I - Source string */
+ const int maxout); /* I - Max output */
+ extern int cupsUtf16ToUtf8(utf8_t *dest, /* O - Target string */
+
+ McDonald June 20, 2002 [Page 9]
+
+ CUPS Internationalization Software Design Description v0.3
+
+ const utf16_t *src, /* I - Source string */
+ const int maxout); /* I - Max output */
+
+ /*
+ * Convert UTF-8 to and from UTF-32
+ */
+ extern int cupsUtf8ToUtf32(utf32_t *dest, /* O - Target string */
+ const utf8_t *src, /* I - Source string */
+ const int maxout); /* I - Max output */
+ extern int cupsUtf32ToUtf8(utf8_t *dest, /* O - Target string */
+ const utf32_t *src, /* I - Source string */
+ const int maxout); /* I - Max output */
+
+ /*
+ * Convert UTF-16 to and from UTF-32
+ */
+ extern int cupsUtf16ToUtf32(utf32_t *dest, /* O - Target string */
+ const utf16_t *src, /* I - Source string */
+ const int maxout); /* I - Max output */
+ extern int cupsUtf32ToUtf16(utf16_t *dest, /* O - Target string */
+ const utf32_t *src, /* I - Source string */
+ const int maxout); /* I - Max output */
+
+ # ifdef __cplusplus
+ }
+ # endif /* __cplusplus */
+
+ #endif /* !_CUPS_TRANSCODE_H_ */
+
+ /*
+ * End of "$Id: i18n_sdd.txt 2678 2002-08-19 01:15:26Z mike $"
+ */
+
+
+
+ 3.1.1.1. cups_cmap_t - SBCS Charmap Structure
+
+ typedef struct cups_cmap_str /**** SBCS Charmap Cache Structure ****/
+ {
+ struct cups_cmap_str *next; /* Next charset map in cache */
+ int used; /* Number of times entry used */
+ cups_encoding_t encoding; /* Legacy charset encoding */
+ ucs2_t char2uni[256]; /* Map Legacy SBCS -> UCS-2 */
+ sbcs_t *uni2char[256]; /* Map UCS-2 -> Legacy SBCS */
+ } cups_cmap_t;
+
+ 'char2uni[]' is a (complete) array of UCS-2 values that supports direct
+ one-level lookup from an input SBCS legacy charset code point, for use
+ by 'cupsCharsetToUtf8()'.
+
+ 'uni2char[]' is a (sparse) array of pointers to arrays of (256 each)
+ SBCS values, that supports direct two-level lookup from an input UCS-2
+
+ McDonald June 20, 2002 [Page 10]
+
+ CUPS Internationalization Software Design Description v0.3
+
+ code point, for use by 'cupsUtf8ToCharset()'.
+
+
+
+ 3.1.1.2. cups_dmap_t - DBCS Charmap Structure
+
+ typedef struct cups_dmap_str /**** DBCS Charmap Cache Structure ****/
+ {
+ struct cups_dmap_str *next; /* Next charset map in cache */
+ int used; /* Number of times entry used */
+ cups_encoding_t encoding; /* Legacy charset encoding */
+ ucs2_t *char2uni[256]; /* Map Legacy DBCS -> UCS-2 */
+ dbcs_t *uni2char[256]; /* Map UCS-2 -> Legacy DBCS */
+ } cups_dmap_t;
+
+ 'char2uni[]' is a (sparse) array of pointers to arrays of (256 each)
+ UCS-2 values that supports direct two-level lookup from an input DBCS
+ legacy charset code point, for (future) use by 'cupsCharsetToUtf8()'.
+
+ 'uni2char[]' is a (sparse) array of pointers to arrays of (256 each)
+ DBCS values, that supports direct two-level lookup from an input UCS-2
+ code point, for (future) use by 'cupsUtf8ToCharset()'.
+
+
+
+ 3.1.2. transcode.c - Transcoding module
+
+ All of the transcoding functions are modelled on the C standard library
+ function 'strncpy()', except that they return the count of output, like
+ 'strlen()', rather than the (redundant) pointer to the output.
+
+ If the transcoding functions detect invalid input parameters or they
+ detect an encoding error in their input, then they return '-1', rather
+ than the count of output.
+
+ All of the transcoding functions take an input parameter indicating the
+ maximum output units (for safe operation). The functions that return
+ 16-bit (UTF-16) or 32-bit (UTF-32/UCS-4) output always return the output
+ string count (not including the final null) and NOT the memory size in
+ bytes.
+
+
+
+ 3.1.2.1. cupsUtf8ToCharset()
+
+ extern int cupsUtf8ToCharset(char *dest, /* O - Target string */
+ const utf8_t *src, /* I - Source string */
+ const int maxout, /* I - Max output */
+ cups_encoding_t encoding); /* I - Encoding */
+
+ <Find charset map by calling 'cupsCharmapGet()'>
+ <Convert input UTF-8 to internal UCS-4 by calling 'cupsUtf8ToUtf32()'>
+
+ McDonald June 20, 2002 [Page 11]
+
+ CUPS Internationalization Software Design Description v0.3
+
+ <Convert internal UCS-4 to legacy charset via charset map>
+ <Release charset map by calling 'cupsCharmapFree()'>
+ <Return length of output legacy charset string -- size in butes>
+
+
+
+ 3.1.2.2. cupsCharsetToUtf8()
+
+ extern int cupsCharsetToUtf8(utf8_t *dest, /* O - Target string */
+ const char *src, /* I - Source string */
+ const int maxout, /* I - Max output */
+ cups_encoding_t encoding); /* I - Encoding */
+
+ <Find charset map by calling 'cupsCharmapGet()'>
+ <Convert input legacy charset to internal UCS-4 via charset map>
+ <Convert internal UCS-4 to UTF-8 by calling 'cupsUtf32ToUtf8()'>
+ <Release charset map by calling 'cupsCharmapFree()'>
+ <Return length of output UTF-8 string -- size in bytes>
+
+
+
+ 3.1.2.3. cupsUtf8ToUtf16()
+
+ extern int cupsUtf8ToUtf16(utf16_t *dest, /* O - Target string */
+ const utf8_t *src, /* I - Source string */
+ const int maxout); /* I - Max output */
+
+ <...to avoid duplicate code to handle surrogate pairs...>
+ <Convert input UTF-8 to internal UCS-4 by calling 'cupsUtf8ToUtf32()'>
+ <Convert internal UCS-4 to UTF-16 by calling 'cupsUtf32ToUtf16()'>
+ <Return count of output UTF-16 string -- NOT memory size in bytes>
+
+
+
+ 3.1.2.4. cupsUtf16ToUtf8()
+
+ extern int cupsUtf16ToUtf8(utf8_t *dest, /* O - Target string */
+ const utf16_t *src, /* I - Source string */
+ const int maxout); /* I - Max output */
+
+ <...to avoid duplicate code to handle surrogate pairs...>
+ <Convert input UTF-16 to internal UCS-4 by calling 'cupsUtf16ToUtf32()'>
+ <Convert internal UCS-4 to UTF-8 by calling 'cupsUtf32ToUtf8()'>
+ <Return length of output UTF-8 string -- size in bytes>
+
+
+
+ 3.1.2.5. cupsUtf8ToUtf32()
+
+ extern int cupsUtf8ToUtf32(utf32_t *dest, /* O - Target string */
+ const utf8_t *src, /* I - Source string */
+ const int maxout); /* I - Max output */
+
+ McDonald June 20, 2002 [Page 12]
+
+ CUPS Internationalization Software Design Description v0.3
+
+
+ <Convert input UTF-8 directly to output UCS-4...>
+ <...checking for valid range, shortest-form, etc.>
+ <Return count of output UTF-32 string -- NOT memory size in bytes>
+
+
+
+ 3.1.2.6. cupsUtf32ToUtf8()
+
+ extern int cupsUtf32ToUtf8(utf8_t *dest, /* O - Target string */
+ const utf32_t *src, /* I - Source string */
+ const int maxout); /* I - Max output */
+
+ <Convert input UCS-4 directly to output UTF-8...>
+ <...checking for valid range, etc.>
+ <Return length of output UTF-8 string -- size in bytes>
+
+
+
+ 3.1.2.7. cupsUtf16ToUtf32()
+
+ extern int cupsUtf16ToUtf32(utf32_t *dest, /* O - Target string */
+ const utf16_t *src, /* I - Source string */
+ const int maxout); /* I - Max output */
+
+ <Convert input UTF-16 directly to output UCS-4...>
+ <...handling surrogate pairs decoding from UTF-16>
+ <Return count of output UTF-32 string -- NOT memory size in bytes>
+
+
+
+ 3.1.2.8. cupsUtf32ToUtf16()
+
+ extern int cupsUtf32ToUtf16(utf16_t *dest, /* O - Target string */
+ const utf32_t *src, /* I - Source string */
+ const int maxout); /* I - Max output */
+
+ <Convert input UCS-4 directly to output UTF-16...>
+ <...handling surrogate pairs encoding to UTF-16>
+ <Return count of output UTF-16 string -- NOT memory size in bytes>
+
+
+
+ 3.1.2.9. Transcoding Utility Functions
+
+ The transcoding utility functions are used to load (from a file into
+ memory), free (logically, without freeing memory), and flush (actually
+ free memory) character maps for SBCS (single-byte character set) and
+ (future) DBCS (double-byte character set) transcoding to and from UTF-8.
+
+
+
+
+ McDonald June 20, 2002 [Page 13]
+
+ CUPS Internationalization Software Design Description v0.3
+
+
+
+ 3.1.2.9.1. cupsCharmapGet()
+
+ extern void *cupsCharmapGet(const cups_encoding_t encoding);
+ /* I - Encoding */
+
+ <Find SBSC or DBCS charset map in cache>
+ <...If found, increment 'used'>
+ <...and return pointer to SBCS or DBCS charset map>
+ <Get charset map file name by calling 'cupsEncodingName()'>
+ <Open charset map file>
+ <...If not found, return void>
+ <Allocate memory for SBCS or DBCS charset map in cache>
+ <...If no memory, return void>
+ <Add to SBCS or DBCS cache by assigning 'next' field>
+ <Assign 'encoding' field>
+ <Increment 'used' field>
+ <Read charset map file into memory in loop...>
+ <If SBCS, then 'char2uni[]' is an array of 'ucs2_t' values>
+ <...and 'uni2char[]' is an array of pointers to 'sbcs_t' arrays>
+ <If DBCS, then char2uni[]' is an array of pointers to 'ucs2_t' arrays>
+ <...and 'uni2char[]' is an array of pointers to 'dbcs_t' arrays>
+ <Close charset map file>
+ <Return pointer to SBCS or DBCS charset map>
+
+
+
+ 3.1.2.9.2. cupsCharmapFree()
+
+ extern void cupsCharmapFree(const cups_encoding_t encoding);
+ /* I - Encoding */
+
+ <Find SBSC or DBCS charset map in cache>
+ <...If found, decrement 'used'>
+ <Return void>
+
+
+
+ 3.1.2.9.3. cupsCharmapFlush()
+
+ extern void cupsCharmapFlush(void);
+
+ <Loop through SBCS charset map cache...>
+ <...Free 'uni2char[]' memory>
+ <...Free SBCS charset map memory>
+ <Loop through DBCS charset map cache...>
+ <...Free 'char2uni[]' memory>
+ <...Free 'uni2char[]' memory>
+ <...Free DBCS charset map memory>
+ <Return void>
+
+
+ McDonald June 20, 2002 [Page 14]
+
+ CUPS Internationalization Software Design Description v0.3
+
+
+
+
+ 3.2. Normalization - New
+
+
+
+ 3.2.1. normalize.h - Normalization header
+
+ /*
+ * "$Id: i18n_sdd.txt 2678 2002-08-19 01:15:26Z mike $"
+ *
+ * Unicode normalization for the Common UNIX Printing System (CUPS).
+ *
+ * Copyright 1997-2002 by Easy Software Products.
+ *
+ * These coded instructions, statements, and computer programs are
+ * the property of Easy Software Products and are protected by Federal
+ * copyright law. Distribution and use rights are outlined in the
+ * file "LICENSE.txt" which should have been included with this file.
+ * If this file is missing or damaged please contact Easy Software
+ * Products at:
+ *
+ * Attn: CUPS Licensing Information
+ * Easy Software Products
+ * 44141 Airport View Drive, Suite 204
+ * Hollywood, Maryland 20636-3111 USA
+ *
+ * Voice: (301) 373-9603
+ * EMail: cups-info@cups.org
+ * WWW: http://www.cups.org
+ */
+
+ #ifndef _CUPS_NORMALIZE_H_
+ # define _CUPS_NORMALIZE_H_
+
+ /*
+ * Include necessary headers...
+ */
+
+ # include "transcod.h"
+
+ # ifdef __cplusplus
+ extern "C" {
+ # endif /* __cplusplus */
+
+ /*
+ * Types...
+ */
+
+ typedef enum /**** Normalizataion Types ****/
+ {
+
+ McDonald June 20, 2002 [Page 15]
+
+ CUPS Internationalization Software Design Description v0.3
+
+ CUPS_NORM_NFD, /* Canonical Decomposition */
+ CUPS_NORM_NFKD, /* Compatibility Decomposition */
+ CUPS_NORM_NFC, /* NFD, them Canonical Composition */
+ CUPS_NORM_NFKC /* NFKD, them Canonical Composition */
+ } cups_normalize_t;
+
+ typedef enum /**** Case Folding Types ****/
+ {
+ CUPS_FOLD_SIMPLE, /* Simple - no expansion in size */
+ CUPS_FOLD_FULL /* Full - possible expansion in size */
+ } cups_folding_t;
+
+ typedef enum /**** Unicode Char Property Types ****/
+ {
+ CUPS_PROP_GENERAL_CATEGORY, /* See 'cups_gencat_t' enum */
+ CUPS_PROP_BIDI_CATEGORY, /* See 'cups_bidicat_t' enum */
+ CUPS_PROP_COMBINING_CLASS, /* See 'cups_combclass_t' type */
+ CUPS_PROP_BREAK_CLASS /* See 'cups_breakclass_t' enum */
+ } cups_property_t;
+
+ /*
+ * Note - parse Unicode char general category from 'UnicodeData.txt'
+ * into sparse local table in 'normalize.c'.
+ * Use major classes for logic optimizations throughout (by mask).
+ */
+
+ typedef enum /**** Unicode General Category ****/
+ {
+ CUPS_GENCAT_L = 0x10, /* Letter major class */
+ CUPS_GENCAT_LU = 0x11, /* Lu Letter, Uppercase */
+ CUPS_GENCAT_LL = 0x12, /* Ll Letter, Lowercase */
+ CUPS_GENCAT_LT = 0x13, /* Lt Letter, Titlecase */
+ CUPS_GENCAT_LM = 0x14, /* Lm Letter, Modifier */
+ CUPS_GENCAT_LO = 0x15, /* Lo Letter, Other */
+ CUPS_GENCAT_M = 0x20, /* Mark major class */
+ CUPS_GENCAT_MN = 0x21, /* Mn Mark, Non-Spacing */
+ CUPS_GENCAT_MC = 0x22, /* Mc Mark, Spacing Combining */
+ CUPS_GENCAT_ME = 0x23, /* Me Mark, Enclosing */
+ CUPS_GENCAT_N = 0x30, /* Number major class */
+ CUPS_GENCAT_ND = 0x31, /* Nd Number, Decimal Digit */
+ CUPS_GENCAT_NL = 0x32, /* Nl Number, Letter */
+ CUPS_GENCAT_NO = 0x33, /* No Number, Other */
+ CUPS_GENCAT_P = 0x40, /* Punctuation major class */
+ CUPS_GENCAT_PC = 0x41, /* Pc Punctuation, Connector */
+ CUPS_GENCAT_PD = 0x42, /* Pd Punctuation, Dash */
+ CUPS_GENCAT_PS = 0x43, /* Ps Punctuation, Open (start) */
+ CUPS_GENCAT_PE = 0x44, /* Pe Punctuation, Close (end) */
+ CUPS_GENCAT_PI = 0x45, /* Pi Punctuation, Initial Quote */
+ CUPS_GENCAT_PF = 0x46, /* Pf Punctuation, Final Quote */
+ CUPS_GENCAT_PO = 0x47, /* Po Punctuation, Other */
+ CUPS_GENCAT_S = 0x50, /* Symbol major class */
+ CUPS_GENCAT_SM = 0x51, /* Sm Symbol, Math */
+
+ McDonald June 20, 2002 [Page 16]
+
+ CUPS Internationalization Software Design Description v0.3
+
+ CUPS_GENCAT_SC = 0x52, /* Sc Symbol, Currency */
+ CUPS_GENCAT_SK = 0x53, /* Sk Symbol, Modifier */
+ CUPS_GENCAT_SO = 0x54, /* So Symbol, Other */
+ CUPS_GENCAT_Z = 0x60, /* Separator major class */
+ CUPS_GENCAT_ZS = 0x61, /* Zs Separator, Space */
+ CUPS_GENCAT_ZL = 0x62, /* Zl Separator, Line */
+ CUPS_GENCAT_ZP = 0x63, /* Zp Separator, Paragraph */
+ CUPS_GENCAT_C = 0x70, /* Other (miscellaneous) major class */
+ CUPS_GENCAT_CC = 0x71, /* Cc Other, Control */
+ CUPS_GENCAT_CF = 0x72, /* Cf Other, Format */
+ CUPS_GENCAT_CS = 0x73, /* Cs Other, Surrogate */
+ CUPS_GENCAT_CO = 0x74, /* Co Other, Private Use */
+ CUPS_GENCAT_CN = 0x75 /* Cn Other, Not Assigned */
+ } cups_gencat_t;
+
+ /*
+ * Note - parse Unicode char bidi category from 'UnicodeData.txt'
+ * into sparse local table in 'normalize.c'.
+ * Add bidirectional support to 'textcommon.c' - per Mike
+ */
+
+ typedef enum /**** Unicode Bidi Category ****/
+ {
+ CUPS_BIDI_L, /* Left-to-Right (Alpha, Syllabic, Ideographic) */
+ CUPS_BIDI_LRE, /* Left-to-Right Embedding (explicit) */
+ CUPS_BIDI_LRO, /* Left-to-Right Override (explicit) */
+ CUPS_BIDI_R, /* Right-to-Left (Hebrew alphabet and most punct) */
+ CUPS_BIDI_AL, /* Right-to-Left Arabic (Arabic, Thaana, Syriac) */
+ CUPS_BIDI_RLE, /* Right-to-Left Embedding (explicit) */
+ CUPS_BIDI_RLO, /* Right-to-Left Override (explicit) */
+ CUPS_BIDI_PDF, /* Pop Directional Format */
+ CUPS_BIDI_EN, /* Euro Number (Euro and East Arabic-Indic digits) */
+ CUPS_BIDI_ES, /* Euro Number Separator (Slash) */
+ CUPS_BIDI_ET, /* Euro Number Termintor (Plus, Minus, Degree, etc) */
+ CUPS_BIDI_AN, /* Arabic Number (Arabic-Indic digits, separators) */
+ CUPS_BIDI_CS, /* Common Number Separator (Colon, Comma, Dot, etc) */
+ CUPS_BIDI_NSM, /* Non-Spacing Mark (category Mn / Me in UCD) */
+ CUPS_BIDI_BN, /* Boundary Neutral (Formatting / Control chars) */
+ CUPS_BIDI_B, /* Paragraph Separator */
+ CUPS_BIDI_S, /* Segment Separator (Tab) */
+ CUPS_BIDI_WS, /* Whitespace Space (Space, Line Separator, etc) */
+ CUPS_BIDI_ON /* Other Neutrals */
+ } cups_bidicat_t;
+
+ /*
+ * Note - parse Unicode line break class from 'DerivedLineBreak.txt'
+ * into sparse local table (list of class ranges) in 'normalize.c'.
+ * Note - add state table from UAX-14, section 7.3 - Ira
+ * Remember to do BK and SP in outer loop (not in state table).
+ * Consider optimization for CM (combining mark).
+ * See 'LineBreak.txt' (12,875) and 'DerivedLineBreak.txt' (1,350).
+ */
+
+ McDonald June 20, 2002 [Page 17]
+
+ CUPS Internationalization Software Design Description v0.3
+
+
+ typedef enum /**** Unicode Line Break Class ****/
+ {
+ /*
+ * (A) - Allow Break AFTER
+ * (XA) - Prevent Break AFTER
+ * (B) - Allow Break BEFORE
+ * (XB) - Prevent Break BEFORE
+ * (P) - Allow Break For Pair
+ * (XP) - Prevent Break For Pair
+ */
+ CUPS_BREAK_AI, /* Ambiguous (Alphabetic or Ideograph) */
+ CUPS_BREAK_AL, /* Ordinary Alphabetic / Symbol Chars (XP) */
+ CUPS_BREAK_BA, /* Break Opportunity After Chars (A) */
+ CUPS_BREAK_BB, /* Break Opportunities Before Chars (B) */
+ CUPS_BREAK_B2, /* Break Opportunity Before / After (B/A/XP) */
+ CUPS_BREAK_BK, /* Mandatory Break (A) (normative) */
+ CUPS_BREAK_CB, /* Contingent Break (B/A) (normative) */
+ CUPS_BREAK_CL, /* Closing Punctuation (XB) */
+ CUPS_BREAK_CM, /* Attached Chars / Combining (XB) (normative) */
+ CUPS_BREAK_CR, /* Carriage Return (A) (normative) */
+ CUPS_BREAK_EX, /* Exclamation / Interrogation (XB) */
+ CUPS_BREAK_GL, /* Non-breaking ("Glue") (XB/XA) (normative) */
+ CUPS_BREAK_HY, /* Hyphen (XA) */
+ CUPS_BREAK_ID, /* Ideographic (B/A) */
+ CUPS_BREAK_IN, /* Inseparable chars (XP) */
+ CUPS_BREAK_IS, /* Numeric Separator (Infix) (XB) */
+ CUPS_BREAK_LF, /* Line Feed (A) (normative) */
+ CUPS_BREAK_NS, /* Non-starters (XB) */
+ CUPS_BREAK_NU, /* Numeric (XP) */
+ CUPS_BREAK_OP, /* Opening Punctuation (XA) */
+ CUPS_BREAK_PO, /* Postfix (Numeric) (XB) */
+ CUPS_BREAK_PR, /* Prefix (Numeric) (XA) */
+ CUPS_BREAK_QU, /* Ambiguous Quotation (XB/XA) */
+ CUPS_BREAK_SA, /* Context Dependent (South East Asian) (P) */
+ CUPS_BREAK_SG, /* Surrogates (XP) (normative) */
+ CUPS_BREAK_SP, /* Space (A) (normative) */
+ CUPS_BREAK_SY, /* Symbols Allowing Break After (A) */
+ CUPS_BREAK_XX, /* Unknown (XP) */
+ CUPS_BREAK_ZW /* Zero Width Space (A) (normative) */
+ } cups_breakclass_t;
+
+ typedef int cups_combclass_t; /**** Unicode Combining Class ****/
+ /* 0=base / 1..254=combining char */
+
+ /*
+ * Structures...
+ */
+
+ typedef struct cups_normmap_str /**** Normalize Map Cache Struct ****/
+ {
+ struct cups_normmap_str *next; /* Next normalize in cache */
+
+ McDonald June 20, 2002 [Page 18]
+
+ CUPS Internationalization Software Design Description v0.3
+
+ int used; /* Number of times entry used */
+ cups_normalize_t normalize; /* Normalization type */
+ int normcount; /* Count of Source Chars */
+ ucs2_t *uni2norm; /* Char -> Normalization */
+ /* ...only supports UCS-2 */
+ } cups_normmap_t;
+
+ typedef struct cups_foldmap_str /**** Case Fold Map Cache Struct ****/
+ {
+ struct cups_foldmap_str *next; /* Next case fold in cache */
+ int used; /* Number of times entry used */
+ cups_folding_t fold; /* Case folding type */
+ int foldcount; /* Count of Source Chars */
+ ucs2_t *uni2fold; /* Char -> Folded Char(s) */
+ /* ...only supports UCS-2 */
+ } cups_foldmap_t;
+
+ typedef struct cups_prop_str /**** Char Property Struct ****/
+ {
+ ucs2_t ch; /* Unicode Char as UCS-2 */
+ unsigned char gencat; /* General Category */
+ unsigned char bidicat; /* Bidirectional Category */
+ } cups_prop_t;
+
+ typedef struct /**** Char Property Map Struct ****/
+ {
+ int used; /* Number of times entry used */
+ int propcount; /* Count of Source Chars */
+ cups_prop_t *uni2prop; /* Char -> Properties */
+ } cups_propmap_t;
+
+ typedef struct /**** Line Break Class Map Struct ****/
+ {
+ int used; /* Number of times entry used */
+ int breakcount; /* Count of Source Chars */
+ ucs2_t *uni2break; /* Char -> Line Break Class */
+ } cups_breakmap_t;
+
+ typedef struct cups_comb_str /**** Char Combining Class Struct ****/
+ {
+ ucs2_t ch; /* Unicode Char as UCS-2 */
+ unsigned char combclass; /* Combining Class */
+ unsigned char reserved; /* Reserved for alignment */
+ } cups_comb_t;
+
+ typedef struct /**** Combining Class Map Struct ****/
+ {
+ int used; /* Number of times entry used */
+ int combcount; /* Count of Source Chars */
+ cups_comb_t *uni2comb; /* Char -> Combining Class */
+ } cups_combmap_t;
+
+
+ McDonald June 20, 2002 [Page 19]
+
+ CUPS Internationalization Software Design Description v0.3
+
+
+ /*
+ * Globals...
+ */
+
+ extern int NzSupportUcs2; /* Support UCS-2 (16-bit) mapping */
+ extern int NzSupportUcs4; /* Support UCS-4 (32-bit) mapping */
+
+ /*
+ * Prototypes...
+ */
+
+ /*
+ * Utility functions for normalization module
+ */
+ extern int cupsNormalizeMapsGet(void);
+ extern int cupsNormalizeMapsFree(void);
+ extern void cupsNormalizeMapsFlush(void);
+
+ /*
+ * Normalize UTF-8 string to Unicode UAX-15 Normalization Form
+ * Note - Compatibility Normalization Forms (NFKD/NFKC) are
+ * unsafe for subsequent transcoding to legacy charsets
+ */
+ extern int cupsUtf8Normalize(utf8_t *dest, /* O - Target string */
+ const utf8_t *src, /* I - Source string */
+ const int maxout, /* I - Max output */
+ const cups_normalize_t normalize);
+ /* I - Normalization */
+
+ /*
+ * Normalize UTF-32 string to Unicode UAX-15 Normalization Form
+ * Note - Compatibility Normalization Forms (NFKD/NFKC) are
+ * unsafe for subsequent transcoding to legacy charsets
+ */
+ extern int cupsUtf32Normalize(utf32_t *dest,
+ /* O - Target string */
+ const utf32_t *src, /* I - Source string */
+ const int maxout, /* I - Max output */
+ const cups_normalize_t normalize);
+ /* I - Normalization */
+
+ /*
+ * Case Fold UTF-8 string per Unicode UAX-21 Section 2.3
+ * Note - Case folding output is
+ * unsafe for subsequent transcoding to legacy charsets
+ */
+ extern int cupsUtf8CaseFold(utf8_t *dest, /* O - Target string */
+ const utf8_t *src, /* I - Source string */
+ const int maxout, /* I - Max output */
+ const cups_folding_t fold); /* I - Fold Mode */
+
+
+ McDonald June 20, 2002 [Page 20]
+
+ CUPS Internationalization Software Design Description v0.3
+
+
+ /*
+ * Case Fold UTF-32 string per Unicode UAX-21 Section 2.3
+ * Note - Case folding output is
+ * unsafe for subsequent transcoding to legacy charsets
+ */
+ extern int cupsUtf32CaseFold(utf32_t *dest,/* O - Target string */
+ const utf32_t *src, /* I - Source string */
+ const int maxout, /* I - Max output */
+ const cups_folding_t fold); /* I - Fold Mode */
+
+ /*
+ * Compare UTF-8 strings after case folding
+ */
+ extern int cupsUtf8CompareCaseless(const utf8_t *s1,
+ /* I - String1 */
+ const utf8_t *s2); /* I - String2 */
+
+ /*
+ * Compare UTF-32 strings after case folding
+ */
+ extern int cupsUtf32CompareCaseless(const utf32_t *s1,
+ /* I - String1 */
+ const utf32_t *s2); /* I - String2 */
+
+ /*
+ * Compare UTF-8 strings after case folding and NFKC normalization
+ */
+ extern int cupsUtf8CompareIdentifier(const utf8_t *s1,
+ /* I - String1 */
+ const utf8_t *s2); /* I - String2 */
+
+ /*
+ * Compare UTF-32 strings after case folding and NFKC normalization
+ */
+ extern int cupsUtf32CompareIdentifier(const utf32_t *s1,
+ /* I - String1 */
+ const utf32_t *s2); /* I - String2 */
+
+ /*
+ * Get UTF-32 character property
+ */
+ extern int cupsUtf32CharacterProperty(const utf32_t ch,
+ /* I - Source char */
+ const cups_property_t property);
+ /* I - Char Property */
+
+ # ifdef __cplusplus
+ }
+ # endif /* __cplusplus */
+
+ #endif /* !_CUPS_NORMALIZE_H_ */
+
+ McDonald June 20, 2002 [Page 21]
+
+ CUPS Internationalization Software Design Description v0.3
+
+
+ /*
+ * End of "$Id: i18n_sdd.txt 2678 2002-08-19 01:15:26Z mike $"
+ */
+
+
+
+ 3.2.1.1. cups_normmap_t - Normalize Map Structure
+
+ typedef struct cups_normmap_str /**** Normalize Map Cache Struct ****/
+ {
+ struct cups_normmap_str *next; /* Next normalize in cache */
+ int used; /* Number of times entry used */
+ cups_normalize_t normalize; /* Normalization type */
+ int normcount; /* Count of Source Chars */
+ ucs2_t *uni2norm; /* Char -> Normalization */
+ /* ...only supports UCS-2 */
+ } cups_normmap_t;
+
+ 'uni2norm' is a pointer to an array of _triplets_ of UCS-2 values.
+ 'normcount' is a count of _triplets_ in the 'uni2norm[]' array.
+
+ For decompositions (NFD and NFKD), the triplets are: composed base
+ character, decomposed base character, and decomposed accent character.
+ These are used by 'cupsUtf8Normalize()' and 'cupsUtf32Normalize()' in
+ performing canonical (NFD) or compatibility (NFKD) decomposition.
+
+ For compositions (NFC and NFKC), the triplets are: decomposed base
+ character, decomposed accent character, and composed base character.
+ These are used by 'cupsUtf8Normalize()' and 'cupsUtf32Normalize()' in
+ performing canonical composition (for NFC or NFKC).
+
+
+
+ 3.2.1.2. cups_foldmap_t - Case Fold Map Structure
+
+ typedef struct cups_foldmap_str /**** Case Fold Map Cache Struct ****/
+ {
+ int used; /* Number of times entry used */
+ cups_folding_t fold; /* Case folding type */
+ int foldcount; /* Count of Source Chars */
+ ucs2_t *uni2fold; /* Char -> Folded Char(s) */
+ /* ...only supports UCS-2 */
+ } cups_foldmap_t;
+
+ 'uni2fold' is a pointer to an array of _quadruplets_ of UCS-2 values.
+ 'foldcount' is a count of _quadruplets_ in the 'uni2fold[]' array.
+
+ For simple case folding (without expansion of the size of the output
+ string), the quadruplets are: input base character, output case folded
+ character, zero (unused), and zero (unused).
+
+
+ McDonald June 20, 2002 [Page 22]
+
+ CUPS Internationalization Software Design Description v0.3
+
+
+ For full case folding (with possible expansion of the size of the output
+ string), the quadruplets are: input base character, output case folded
+ character, second output character or zero, third output character or
+ zero.
+
+
+
+ 3.2.1.3. cups_propmap_t - Char Property Map Structure
+
+ typedef struct /**** Char Property Map Struct ****/
+ {
+ int used; /* Number of times entry used */
+ int propcount; /* Count of Source Chars */
+ cups_prop_t *uni2prop; /* Char -> Properties */
+ } cups_propmap_t;
+
+ 'uni2prop' is a pointer to an array of 'cups_prop_t' (see below).
+ 'propcount' is a count of elements in the 'uni2prop[]' array.
+
+
+
+ 3.2.1.4. cups_prop_t - Char Property Structure
+
+ typedef struct cups_prop_str /**** Char Property Struct ****/
+ {
+ ucs2_t ch; /* Unicode Char as UCS-2 */
+ unsigned char gencat; /* General Category */
+ unsigned char bidicat; /* Bidirectional Category */
+ } cups_prop_t;
+
+
+
+ 3.2.1.5. cups_breakmap_t - Line Break Map Structure
+
+ typedef struct /**** Line Break Class Map Struct ****/
+ {
+ int used; /* Number of times entry used */
+ int breakcount; /* Count of Source Chars */
+ ucs2_t *uni2break; /* Char -> Line Break Class */
+ } cups_breakmap_t;
+
+ 'uni2break' is a pointer to an array of _triplets_ of UCS-2 values.
+ 'breakcount' is a count of _triplets_ in the 'uni2break[]' array.
+
+ The triplets in 'uni2break' are: first UCS-2 value in a range, last
+ UCS-2 value in a range, and line break class stored as UCS-2.
+
+
+
+
+
+
+ McDonald June 20, 2002 [Page 23]
+
+ CUPS Internationalization Software Design Description v0.3
+
+
+
+ 3.2.1.6. cups_combmap_t - Combining Class Map Structure
+
+ typedef struct /**** Combining Class Map Struct ****/
+ {
+ int used; /* Number of times entry used */
+ int combcount; /* Count of Source Chars */
+ cups_comb_t *uni2comb; /* Char -> Combining Class */
+ } cups_combmap_t;
+
+ 'uni2comb' is a pointer to an array of 'cups_comb_t' (see below).
+ 'combcount' is a count of elements in the 'uni2comb[]' array.
+
+
+
+ 3.2.1.7. cups_comb_t - Combining Class Structure
+
+ typedef struct cups_comb_str /**** Char Combining Class Struct ****/
+ {
+ unsigned short ch; /* Unicode Char as UCS-2 */
+ unsigned char combclass; /* Combining Class */
+ unsigned char reserved; /* Reserved for alignment */
+ } cups_comb_t;
+
+
+
+ 3.2.2. normalize.c - Normalization module
+
+ The normalization function 'cupsUtf8Normalize()' and the case folding
+ function 'cupsUtf8CaseFold()' are modelled on the C standard library
+ function 'strncpy()', except that they return the count of the output,
+ like 'strlen()', rather than the (redundant) pointer to the output.
+
+ If the normalization or case folding functions detect invalid input
+ parameters or they detect an encoding error in their input, then they
+ return '-1', rather than the count of output.
+
+ The normalization and case folding functions take an input parameter
+ indicating the maximum output units (for safe operation).
+
+
+
+ 3.2.2.1. cupsUtf8Normalize()
+
+ /*
+ * Normalize UTF-8 string to Unicode UAX-15 Normalization Form
+ * Note - Compatibility Normalization Forms (NFKD/NFKC) are
+ * unsafe for subsequent transcoding to legacy charsets
+ */
+ extern int cupsUtf8Normalize(utf8_t *dest, /* O - Target string */
+ const utf8_t *src, /* I - Source string */
+
+ McDonald June 20, 2002 [Page 24]
+
+ CUPS Internationalization Software Design Description v0.3
+
+ const int maxout, /* I - Max output */
+ const cups_normalize_t normalize);
+ /* I - Normalization */
+
+ <Convert input UTF-8 to internal UCS-4 by calling 'cupsUtf8ToUtf32()'>
+ <Normalize by calling 'cupsUtf32Normalize()'>
+ <Convert normalized UCS-4 to UTF-8 by calling 'cupsUtf32ToUtf8()>
+ <Return length of output UTF-8 string -- size in butes>
+
+
+
+ 3.2.2.2. cupsUtf32Normalize()
+
+ extern int cupsUtf32Normalize(utf32_t *dest,
+ /* O - Target string */
+ const utf32_t *src, /* I - Source string */
+ const int maxout, /* I - Max output */
+ const cups_normalize_t normalize);
+ /* I - Normalization */
+
+ <Find normalize maps by calling 'cupsNormalizeMapsGet()'>
+ <...if not found, return '-1'>
+ <Repeatedly traverse internal UCS-4, decomposing (NFD or NFKD)...>
+ <...with 'bsearch()' of 'uni2norm[]' using local 'compare_decompose()'>
+ <...until one pass yields no further decomposition>
+ <Repeatedly traverse internal UCS-4, doing canonical reordering>
+ <...with 'bsearch()' of 'uni2comb[]' using local 'compare_combchar()'>
+ <...until one pass yields no further canonical reordering>
+ <If 'normalize' requests composition (NFC or NFKC)...>
+ <...repeatedly traverse internal UCS-4, composing (NFC or NFKC)...>
+ <...with 'bsearch()' of 'uni2norm[]' using local 'compare_compose()'>
+ <...until one pass yields no further composition>
+ <Release normalize maps by calling 'cupsNormalizeMapsFree()'>
+ <Return count of output UTF-32 string -- NOT memory size in butes>
+
+
+
+ 3.2.2.3. cupsUtf8CaseFold()
+
+ /*
+ * Case Fold UTF-8 string per Unicode UAX-21 Section 2.3
+ * Note - Case folding output is
+ * unsafe for subsequent transcoding to legacy charsets
+ */
+ extern int cupsUtf8CaseFold(utf8_t *dest, /* O - Target string */
+ const utf8_t *src, /* I - Source string */
+ const int maxout, /* I - Max output */
+ const cups_folding_t fold); /* I - Fold Mode */
+
+ <Find normalize maps by calling 'cupsNormalizeMapsGet()'>
+ <...if not found, return '-1'>
+ <Convert input UTF-8 to internal UCS-4 by calling 'cupsUtf8ToUtf32()'>
+
+ McDonald June 20, 2002 [Page 25]
+
+ CUPS Internationalization Software Design Description v0.3
+
+ <Case fold internal UCS-4 by calling 'cupsUtf32CaseFold()'>
+ <Convert internal UCS-4 to output UTF-8 by calling 'cupsUtf32ToUtf8()>
+ <Release normalize maps by calling 'cupsNormalizeMapsFree()'>
+ <Return length of output UTF-8 string -- size in butes>
+
+
+
+ 3.2.2.4. cupsUtf32CaseFold()
+
+ /*
+ * Case Fold UTF-32 string per Unicode UAX-21 Section 2.3
+ * Note - Case folding output is
+ * unsafe for subsequent transcoding to legacy charsets
+ */
+ extern int cupsUtf32CaseFold(utf32_t *dest, /* Target string */
+ const utf32_t *src, /* Source string */
+ const int maxout); /* Max output units */
+
+ <Find case fold maps by calling 'cupsNormalizeMapsGet()'>
+ <...if not found, return '-1'>
+ <Traverse internal UCS-4 once, performing case folding...>
+ <...with 'bsearch()' of 'uni2fold[]' using local 'compare_foldchar()'>
+ <Copy internal UCS-4 to output UTF-32 string>
+ <Release normalize maps by calling 'cupsNormalizeMapsFree()'>
+ <Return count of output UTF-32 string -- NOT memory size in bytes>
+
+
+
+ 3.2.2.5. cupsUtf8CompareCaseless()
+
+ /*
+ * Compare UTF-8 strings after case folding
+ */
+ extern int cupsUtf8CompareCaseless(const utf8_t *s1,
+ /* I - String1 */
+ const utf8_t *s2); /* I - String2 */
+
+ <Case fold both input UTF-8 strings by calling 'cupsUtf8CaseFold()'>
+ <Return compare of case folded first and second strings>
+
+
+
+ 3.2.2.6. cupsUtf32CompareCaseless()
+
+ /*
+ * Compare UTF-32 strings after case folding
+ */
+ extern int cupsUtf32CompareCaseless(const utf32_t *s1,
+ /* I - String1 */
+ const utf32_t *s2); /* I - String2 */
+
+ <Case fold both input UTF-32 strings by calling 'cupsUtf32CaseFold()'>
+
+ McDonald June 20, 2002 [Page 26]
+
+ CUPS Internationalization Software Design Description v0.3
+
+ <Return compare of case folded first and second strings>
+
+
+
+ 3.2.2.7. cupsUtf8CompareIdentifier()
+
+ /*
+ * Compare UTF-8 strings after case folding and NFKC normalization
+ */
+ extern int cupsUtf8CompareIdentifier(const utf8_t *s1,
+ /* I - String1 */
+ const utf8_t *s2); /* I - String2 */
+
+ <Convert input UTF-8 to internal UCS-4 by calling 'cupsUtf8ToUtf32()'>
+ <Case fold both strings by calling 'cupsUtf32CaseFold()'>
+ <Normalize both strings to NFKC by calling 'cupsUtf32Normalize()'>
+ <Return compare of case folded/normalized first and second strings>
+
+
+
+ 3.2.2.8. cupsUtf32CompareIdentifier()
+
+ /*
+ * Compare UTF-32 strings after case folding and NFKC normalization
+ */
+ extern int cupsUtf32CompareIdentifier(const utf32_t *s1,
+ /* I - String1 */
+ const utf32_t *s2); /* I - String2 */
+
+ <Case fold both strings by calling 'cupsUtf32CaseFold()'>
+ <Normalize both strings to NFKC by calling 'cupsUtf32Normalize()'>
+ <Return compare of case folded/normalized first and second strings>
+
+
+
+ 3.2.2.9. cupsUtf32CharacterProperty()
+
+ /*
+ * Get UTF-32 character property
+ */
+ extern int cupsUtf32CharacterProperty(const utf32_t ch,
+ /* I - Source char */
+ const cups_property_t property);
+ /* I - Char Property */
+
+ <Lookup UTF-32 character property in appropriate map...> <...internal
+ functions for each different map lookup>
+
+
+
+
+
+
+ McDonald June 20, 2002 [Page 27]
+
+ CUPS Internationalization Software Design Description v0.3
+
+
+
+ 3.2.2.10. Normalization Utility Functions
+
+
+
+
+ 3.2.2.10.1. cupsNormalizeMapsGet()
+
+ extern void cupsNormalizeMapsMapsGet(void);
+
+ <Find normalize maps in cache>
+ <...If found, increment 'used'>
+ <...and return void>
+ <For each map (normalization, case fold, combining class, etc.)...>
+ <Open (preprocessed form of) Unicode data file...>
+ <...If not found, return void>
+ <Count lines in preprocessed form, for mapping memory alloc>
+ <...Close (preprocessed form of) Unicode data file>
+ <Open (preprocessed form of) Unicode data file...>
+ <...If not found, return void>
+ <Allocate memory for approriate map in cache...>
+ <...If no memory, return void>
+ <Add to appropriate cache by assigning 'next' field>
+ <Assign map type field and count field>
+ <Increment 'used' field>
+ <Read normalize map into memory in loop...>
+ <...Add values to 'uni2xxx[]' array>
+ <Close (preprocessed form of) Unicode data file>
+ <Return void>
+
+
+
+ 3.2.2.10.2. cupsNormalizeMapsFree()
+
+ extern void cupsNormalizeMapsFree(void);
+
+ <Find normalize maps in cache>
+ <...If found, decrement 'used'>
+ <Return void>
+
+
+
+ 3.2.2.10.3. cupsNormalizeMapsFlush()
+
+ extern void cupsNormalizeMapsFlush(void);
+
+ <Loop through normalize maps cache...>
+ <...Free 'uni2norm[]' memory>
+ <...Free normalize map memory>
+ <Loop through case folding cache...>
+ <...Free 'uni2fold[]' memory>
+
+ McDonald June 20, 2002 [Page 28]
+
+ CUPS Internationalization Software Design Description v0.3
+
+ <...Free case folding memory>
+ <Loop through char property map cache...>
+ <...Free 'uni2prop[]' memory>
+ <...Free char property map memory>
+ <Loop through line break class map cache...>
+ <...Free 'uni2break[]' memory>
+ <...Free line break class map memory>
+ <Loop through combining class map cache...>
+ <...Free 'uni2comb[]' memory>
+ <...Free combining class map memory>
+ <Return void>
+
+
+
+ 3.3. Language - Existing
+
+
+
+ 3.3.1. language.h - Language header
+
+ Required Changes:
+
+ (1) Change definition of 'cups_lang_t' to correct length of 'language[]'
+ to 32 characters per [RFC3066] and [ISO639-2] and [ISO3166-1].
+
+
+
+ 3.3.2. language.c - Language module
+
+
+
+ 3.3.2.1. cupsLangEncoding() - Existing
+
+ [No Change]
+
+
+
+ 3.3.2.2. cupsLangFlush() - Existing
+
+ [No Change]
+
+
+
+ 3.3.2.3. cupsLangFree() - Existing
+
+ [No Change]
+
+
+
+
+
+
+
+ McDonald June 20, 2002 [Page 29]
+
+ CUPS Internationalization Software Design Description v0.3
+
+
+
+ 3.3.2.4. cupsLangGet() - Existing
+
+ Required Changes:
+
+ (1) Change length of 'langname[]' and 'real[]' to 64 characters per
+ [RFC3066] and potential length of encoding (charset) names;
+ (2) Change language string normalization to support:
+ (a) 8-character language codes per [RFC3066] and 3-character
+ language codes per [ISO639-2];
+ (b) 8-character country codes per [RFC3066] and 3-character country
+ codes per [ISO3166-1];
+ (c) Support for 'i' (IANA registered) and 'x' (private) language
+ prefixes per [RFC3066];
+ (d) Invariant use of 'utf-8' for encoding in message catalog, but
+ save actual requested encoding name for later use.
+ (3) Correct broken do/while statement for message catalog lookup (while
+ condition is _never_ satisfied).
+
+
+
+ 3.3.2.5. cupsLangPrintf() - New
+
+ extern int cupsLangPrintf(FILE *fp, /* I - File to write */
+ const cups_lang_t *lang, /* I - Language/locale*/
+ const cups_msg_t msg, /* I - Msg to format */
+ ...); /* I - Args to format */
+
+ <Set up variable args by calling 'va_start()'>
+ <Format CUPS message with variable args by calling 'vsnprintf()'>
+ <Clean up variable args by calling 'va_end()'>
+ <Transcode CUPS message by calling 'cupsUtf8ToCharset()'>
+ <Write CUPS message by calling 'fputs()'>
+ <Return transcoded output CUPS message length>
+
+
+
+ 3.3.2.6. cupsLangPuts() - New
+
+ extern int cupsLangPuts(FILE *fp, /* I - File to write */
+ const cups_lang_t *lang, /* I - Language/locale*/
+ const cups_msg_t msg); /* I - Msg to write */
+
+ <Transcode CUPS message by calling 'cupsUtf8ToCharset()'>
+ <Write CUPS message by calling 'fputs()'>
+ <Return transcoded output CUPS message length>
+
+
+
+
+
+
+ McDonald June 20, 2002 [Page 30]
+
+ CUPS Internationalization Software Design Description v0.3
+
+
+
+ 3.3.2.7. cupsEncodingName() - New
+
+ extern char *cupsEncodingName(cups_encoding_t encoding);
+
+ <Lookup encoding name in static 'lang_encodings[]' array>
+ <Return pointer to encoding name (charset map file name)>
+
+
+
+ 3.4. Common Text Filter - Existing
+
+
+
+ 3.4.1. textcommon.h - Common text filter header
+
+ Required changes:
+
+ (1) Revise 'lchar_t' as specified below, adding 'attrx' bit-mask for
+ selected Unicode character properties;
+ (2) Revise 'lchar_t' as specified below, adding 'comblen' and 'combch[]'
+ for Unicode combining/attached chars (accents);
+ (3) Add 'COMBLEN_MAX' limit as specified below;
+ (4) Add 'ATTRX_...' selected Unicode character properties as specified
+ below.
+
+
+
+ 3.4.1.1. lchar_t - Character/Attribute Structure
+
+ typedef struct lchar_str /**** Character / Attribute Structure ****/
+ {
+ unsigned short ch; /* Unicode Char as UCS-2 */
+ /* or 8/16-bit Legacy Char */
+ unsigned short attr; /* Attributes of Char */
+ unsigned short attrx; /* Extended Attributes */
+ unsigned short comblen; /* Combining Char Count */
+ unsigned short combch[8]; /* Combining Chars as UCS-2 */
+ } lchar_t;
+
+ 'ch' is a 16-bit UCS-2 character or a 8/16-bit legacy char. 'attr' is
+ the character attributes defined for the existing 'lchar_t' structure
+ (defined in 'textcommon.h'). 'attrx' is the extended character
+ attributes defined for future selected Unicode character properties (see
+ below). 'comblen' is the number of attached/combining characters.
+ 'combch' is an array of 16-bit UCS-2 attached/combining characters.
+
+ Add to 'textcommon.h' constants:
+
+ COMBLEN_MAX 8
+
+
+ McDonald June 20, 2002 [Page 31]
+
+ CUPS Internationalization Software Design Description v0.3
+
+
+ ATTRX_RIGHT2LEFT 0x0001
+
+
+
+ 3.4.2. textcommon.c - Common text filter
+
+ Required Changes:
+
+ (1) Revise 'TextMain()' function as described below.
+
+
+
+ 3.4.2.1. TextMain() - Existing
+
+ Required Changes:
+
+ [Ed Note: Pseudo code below needs more work on bidi handling.]
+
+ (1) In main loop at the _beginning_ of the 'default' clause, add the
+ following code for combining marks:
+ lchar_t *cp;
+
+ cp = Page[line];
+ cp += column;
+ /*
+ * Check for Unicode combining mark (accent)
+ */
+ if (UTF-8 && cupsUtf32CombiningClass(ch) > 0)
+ {
+
+ /*
+ * Save Unicode combining mark in SAME character
+ */
+ if (cp->comblen > COMBLEN_MAX)
+ break;
+ cp->combch[cp->comblen] = ch;
+ cp->comblen ++;
+ break;
+ }
+
+ (2) In main loop _after_ combining chars section in 'default' clause,
+ add the following code for Unicode bidi control characters
+ cups_bidicat_t bidicat;
+
+ /*
+ * Check for Unicode bidi control character
+ */
+ if (UTF-8)
+ {
+ bidicat = (cups_bidicat_t)
+ cupsUtf32CharacterProperty(ch, CUPS_PROP_BIDI_CATEGORY);
+
+ McDonald June 20, 2002 [Page 32]
+
+ CUPS Internationalization Software Design Description v0.3
+
+ if ((bidicat == CUPS_BIDI_LRE) /* Left-to-Right Embedding *
+ || (bidicat == CUPS_BIDI_LRO) /* Left-to-Right Override */
+ || (bidicat == CUPS_BIDI_RLE) /* Right-to-Left Embedding *
+ || (bidicat == CUPS_BIDI_RLO) /* Right-to-Left Override */
+ || (bidicat == CUPS_BIDI_PDF)) /* Pop Directional Format */
+ {
+ /* Do bidi stuff here with memory for NEXT char's direction
+ /* Discard bidi control character and break */
+ }
+ if ((bidicat == CUPS_BIDI_R) /* Right-to-Left Hebrew */
+ || (bidicat == CUPS_BIDI_AL)) /* Right-to-Left Arabic */
+ {
+ /* Set attrx for right-to-left */
+ cp->attrx |= ATTRX_RIGHT2LEFT
+ }
+ }
+
+
+
+ 3.4.2.2. compare_keywords() - Existing
+
+ [No Change]
+
+
+
+ 3.4.2.3. getutf8() - Existing
+
+ [No Change]
+
+ [Ed Note: Future - allow 20-bit UTF-32 code points - requires updates
+ in both 'textcommon.c' and 'texttops.c' for extended PostScript.]
+
+
+
+ 3.5. Text to PostScript Filter - Existing
+
+
+
+ 3.5.1. texttops.c - Text to PostScript filter
+
+ Required Changes:
+
+ (1) Revise local 'write_string()' function as described below.
+
+
+
+ 3.5.1.1. main() - Existing
+
+ [No Change]
+
+
+
+
+ McDonald June 20, 2002 [Page 33]
+
+ CUPS Internationalization Software Design Description v0.3
+
+
+
+ 3.5.1.2. WriteEpilogue () - Existing
+
+ [No Change]
+
+
+
+ 3.5.1.3. WritePage () - Existing
+
+ [No Change]
+
+
+
+ 3.5.1.4. WriteProlog () - Existing
+
+ [No Change]
+
+
+
+ 3.5.1.5. write_line() - Existing
+
+ [No Change]
+
+
+
+ 3.5.1.6. write_string() - Existing
+
+ Required Changes:
+
+ (1) At the _beginning_ of Multiple Fonts section, _replace_ the while()
+ loop and surrounding 'putchar()' calls with the following code:
+
+ for (; len > 0; len --, s ++)
+ {
+ utf32_t decstr[COMBLEN_MAX * 2];
+ utf32_t cmpstr[COMBLEN_MAX * 2];
+ int cmplen;
+ int i;
+
+ if (s->comblen == 0)
+ {
+ printf("<%04x>", Chars[s->ch]);
+ continue;
+ }
+
+ /*
+ * Normalize decomposed Unicode character to NFKC
+ * (compatibility decomposition, then canonical composition)
+ */
+ decstr[0] = (utf32_t) s->ch;
+ for (i = 0; i < s->comblen; i ++)
+
+ McDonald June 20, 2002 [Page 34]
+
+ CUPS Internationalization Software Design Description v0.3
+
+ decstr[i + 1] = (utf32_t) s->combch[i];
+ decstr[i] = 0;
+ cmplen = cupsUtf32Normalize (&cmpstr[0],
+ &decstr[0], COMBLEN_MAX * 2, CUPS_NORM_NFKC);
+ if (cmplen < 1)
+ continue;
+
+ /*
+ * Write combining chars, then composed base, to same location
+ */
+ for (i = 1; i < cmplen; i ++)
+ {
+ printf("<%04x>", Chars[(int) cmpstr[i]);
+ /*
+ * Superimpose glyphs by backing up one column width
+ */
+ printf (" -%.3f ", (72.0f / (float) CharsPerInch));
+ }
+ printf("<%04x>", Chars[(int) cmpstr[0]);
+ }
+
+ [Ed Note: Future - Bidi support - When writing Unicode characters
+ (checking for explicit bidi) convert input string (lchar_t) to display
+ order???]
+
+
+
+ 3.5.1.7. write_text() - Existing
+
+ [No Change]
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ McDonald June 20, 2002 [Page 35]
+
+ CUPS Internationalization Software Design Description v0.3
+ APPENDIX A
+ Glossary
+
+
+
+ A. Glossary
+
+ Abstract Character: A unit of information used for the organization,
+ control, or representation of textual data.
+
+ Accent Mark: A mark placed above, below, or to the side of a character
+ to alter its phonetic value (also 'diacritic').
+
+ Alphabet: A collection of symbols that, in the context of a particular
+ written language, represent the sounds of that language.
+
+ Base Character: A character that does not graphically combine with
+ preceding characters, and that is neither a control nor a format
+ character.
+
+ Basic Multilingual Plane: The Unicode (or UCS) code values 0x0000
+ through 0xFFFF, specified by [ISO10646] (also 'Plane 0').
+
+ BIDI: Abbreviation for Bidirectional, in reference to mixed
+ left-to-right and right-to-left text.
+
+ Bidirectional Display: The process or result of mixing left-to-right
+ oriented text and right-to-left oriented text in a single line.
+
+ Big-endian: A computer architecture that stores multiple-byte numerical
+ values with the most significant byte (MSB) values first.
+
+ BMP: Abbreviation for Basic Multilingual Plane.
+
+ BOM: Acronym for byte order mark (also 'ZWNBSP').
+
+ Byte Order Mark: The Unicode character U+FEFF Zero Width No-Break Space
+ (ZWNBSP) when used to indicate the byte order of text.
+
+ Canonical: (1) Conforming to the general rules for encoding -- that is,
+ not compressed, compacted, or in any other form specified by a higher
+ protocol. (2) Characteristic of a normative mapping and form of
+ equivalence.
+
+ Canonical Decomposition: The decomposition of a character that results
+ from recursively applying the canonical mappings defined in the Unicode
+ Character Database until no characters can be further decomposed, then
+ reordering nonspacing marks according to section 3.10 of [UNICODE3.2].
+
+ Canonical Equivalent: Two characters are canonical equivalents if their
+ full canonical decompositions are identical.
+
+ Case: (1) Feature of certain alphabets wheere the letters have two
+
+ McDonald June 20, 2002 [Page A-1]
+
+ CUPS Internationalization Software Design Description v0.3
+ APPENDIX A
+ Glossary
+
+ distinct forms. These variants are called the 'uppercase' letter (also
+ known as 'capital' or 'majuscule') and the 'lowercase' letter (also
+ known as 'small' or 'minuscule'). (2) Normative property of Unicode
+ characters, consisting of uppercase, lowercase, and titlecase.
+
+ Character: (1) The smallest component of written language that has
+ semantic value; refers to the abstract meaning and/or shape, rather than
+ a specific shape (see also 'glyph'). (2) Synonym for 'abstract
+ character'. (3) The basic unit of encoding for the Unicode character
+ encoding. (4) The English name for the ideographic written elements of
+ Chinese origin (see 'ideograph').
+
+ Character Encoding Form (CEF): Mapping from a character set definition
+ to the actual bits used to represent the data.
+
+ Character Encoding Scheme (CES): A 'character encoding form' plus byte
+ serialization. [UNICODE3.2] defines seven character encoding schemes:
+ UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF32-LE.
+
+ Character Properties: A set of property names and property values
+ associated with individual characters defined in [UNICODE3.2].
+
+ Character Repertoire: (1) The collection of characters included in a
+ character set. (2) The SUBSET of characters included in a large
+ character set, e.g., [UNICODE3.2], that are necessary to support a
+ complete mapping to another smaller character set, e.g., ISO8859-1 (also
+ called 'Latin-1').
+
+ Character Set: A collection of elements used to represent textual
+ information.
+
+ Coded Character Set: A character set in which each character is
+ assigned a numeric code value. Frequently abbreviated as 'character
+ set', 'charset', or 'code set'.
+
+ Code Point: (1) A numerical index (or position) in an encoding table
+ used for encoding characters. (2) Synonym for 'Unicode scalar value'.
+
+ Collation: The process of ordering units of textual information.
+ Collation is usually specific to a particular language. Also known as
+ 'alphabetizing' or 'alphabetic sorting'.
+
+ Combining Character: A character that graphically combines with a
+ preceding 'base character'. The combining character is said to 'apply'
+ to that base character. (See also 'nonspacing mark'.)
+
+ Compatibility: (1) Consistency with existing practice or preexisting
+ character encoding standards. (2) Characterisitic of a normative
+ mapping and form of equivalence (see 'compatibility decomposition').
+
+
+ McDonald June 20, 2002 [Page A-2]
+
+ CUPS Internationalization Software Design Description v0.3
+ APPENDIX A
+ Glossary
+
+
+ Compatibility Character: A character that has a compatibility
+ decomposition.
+
+ Compatibility Decomposition: The decomposition of a character that
+ results from recursively applying BOTH the compatibility mappings AND
+ the canonical mappings found in the Unicode Character Database until no
+ characters can be further decomposed, then reordering nonspacing marks
+ according to section 3.10 of [UNICODE3.2].
+
+ Compatibility Equivalent: Two characters are compatibility equivalents
+ if their full compatibility decompositions are identical.
+
+ Composed Character: (See 'descomposable character'.)
+
+ DBCS: Acronym for 'double-byte character set'.
+
+ Decomposable Character: A character that is equivalent to a sequence of
+ one or more other characters, according to the decomposition mappings
+ found in [UNICODE3.2]. It may also be known as a 'precomposed
+ character' or a 'composite character'.
+
+ Decomposition: (1) The process of separating or analyzing a text
+ element into component units. (2) A sequence of one or more characters
+ that is equivalent to a 'decomposable character'.
+
+ Diacritic: (See 'accent mark'.)
+
+ Double-Byte Character Set (DBCS): One of a number of character sets
+ defined for representing Chinese, Japanese, or Korean text (for example,
+ JIS X 0208-1990). These character sets are often encoded in such a way
+ as to allow double-byte character encodings to be mixed with single-byte
+ character encodings. (See also 'multiple-byte character set'.)
+
+ Font: A collection of glyphs used for visual depication of character
+ data.
+
+ FSS-UTF: Abbreviation for 'File System Safe UCS Transformation Format',
+ originally published by X/Open. Now called 'UTF-8'.
+
+ Fullwidth: Characters of East Asian character sets whose glyph image
+ extends across the entire character display cell. In legacy character
+ sets, fullwidth characters are normally encoded in two or three bytes.
+
+ Glyph: (1) An abstract form that represents one or more glyph images.
+ (2) A synonym for 'glyph image'.
+
+ Glyph Image: The actual, concrete image of a glyph representation
+ having been rasterized or otherwise images onto some display surface.
+
+
+ McDonald June 20, 2002 [Page A-3]
+
+ CUPS Internationalization Software Design Description v0.3
+ APPENDIX A
+ Glossary
+
+
+ Halfwidth: Characters of East Asian character sets whose glyph image
+ occupies half of the character display cell. In legacy character sets,
+ halfwidth characters are normally encoded in a single byte.
+
+ Han Characters: Ideographic characters of Chinese origin.
+
+ Hangul: The name of the script used to write the Korean language.
+
+ High-Surrogate: A Unicode code value in the range U+D800 to U+DBFF.
+
+ Hiragana: One of two standard syllabaries associated with the Japanese
+ writing system. Use to write particles, grammatical affixes, and words
+ that have no 'kanji' form.
+
+ IANA: Internet Assigned Numbers Authority.
+
+ Ideograph: (1) Any symbol that denotes an idea (or meaning) in contrast
+ to a sound or pronunciation (for example, a 'smiley face'). (2) A
+ common term used to refer to Han characters.
+
+ IPA: International Phonetic Alphabet.
+
+ IRG: Abbreviation for Ideographic Rapporteur Group, a subgroup of
+ ISO/IEC JTC1/SC2/WG2 (who work on Han unification and submission of new
+ Han characters for inclusion in revised versions of Unicode/ISO 10646).
+
+ Jamo: The Korean name for a single letter of the Hangul script. Jamos
+ are used to form Hangul syllables.
+
+ Joiner: An invisible character that affects the joining behavior of
+ surrounding characters.
+
+ JTC1: Abbreviation for Joint Technical Committee 1 of ISO/IEC,
+ responsible for information technology standardization.
+
+ Kana: The name of a primarily syllabic script used by the Japanese
+ writing system, composed of 'hiragana' and 'katakana'.
+
+ Kanji: The Japanese name for Han characters; derived from the Chinese
+ word 'hanzi'. Also romanized as 'kanzi'.
+
+ Katakana: One of two standard syllabaries associated with the Japanese
+ writing system, typically used in representation of borrowed vocabulary.
+
+ Ligature: A glyph representing a combination of two or more characters,
+ for example in the Latin script the ligature between 'f' and 'i' as
+ 'fi'.
+
+ Logical Order: The order in which text is typed on a keyboard. For the
+
+ McDonald June 20, 2002 [Page A-4]
+
+ CUPS Internationalization Software Design Description v0.3
+ APPENDIX A
+ Glossary
+
+ most part, logical order corresponds to phonetic order.
+
+ Lowercase: (See 'case'.)
+
+ Low-Surrogate: A Unicode code value in the range U+DC00 to U+DFFF.
+
+ MBCS: Acronym for 'multiple-byte character set'.
+
+ Multiple-Byte Character Set (MBCS): A character set encoded with a
+ variable number of bytes per character. Many large character sets have
+ been defined as MBCS so as to keep strict compatibility with the
+ US-ASCII subset and/or [ISO2022].
+
+ Normalization: Transformation of data to a normal form.
+
+ Plain Text: Computer-encoded text that consists ONLY of a sequence of
+ code values from a given standard, with no other formatting or
+ structural information.
+
+ Precomposed Character: (See 'decomposable character'.)
+
+ Rendering: (1) The process of selecting and laying out glyphs for the
+ purpose of depicting characters. (2) The process of making glyphs
+ visible on a display device.
+
+ Repertoire: (See 'character repertoire'.)
+
+ Replacement Character: A character used as a substitute for an
+ uninterpretable character from another encoding. [UNICODE3.2] defines
+ U+FFFD REPLACEMENT CHARACTER for this function.
+
+ Rich Text: The result of adding information such as font data, color,
+ formatting, phonetic annotations, etc. to 'plain text' (e.g., HTML).
+
+ SBCS: Acronym for 'single-byte character set'.
+
+ Scalar Value: (See 'Unicode scalar value'.)
+
+ Script: A collection of symbols used to represent textual information
+ in one or more writing systems.
+
+ Single-Byte Character Set (SBCS): One of a number of one-byte character
+ sets defined for representing (mostly) Western languages (for example,
+ ISO 8859-1 'Latin-1'). These character sets are often encoded in such a
+ way as to be strict supersets of 7-bit [US-ASCII].
+
+ Sorting: (See 'collation'.)
+
+ Transcoding: Conversion of character data between different character
+ sets.
+
+ McDonald June 20, 2002 [Page A-5]
+
+ CUPS Internationalization Software Design Description v0.3
+ APPENDIX A
+ Glossary
+
+
+ Transformation Format: A mapping from a coded character sequence to a
+ unique sequence of code values (typically octets).
+
+ UCS: Abbreviation for Universal Character Set, specified by [ISO10646].
+
+ UCS-2: UCS encoded in 2 octets, specified by [ISO10646].
+
+ UCS-4: UCS encoded in 4 octets, specified by [ISO10646].
+
+ Unicode Scalar Value: A number between 0 to 0x10FFFF.
+
+ Uppercase: (See 'case'.)
+
+ UTF: Abbreviation for Unicode (or UCS) Transformation Format.
+
+ UTF-8: Unicode (or UCS) Transformation Format, 8-bit encoding form.
+ Serializes a Unicode (or UCS) scalar value (code point) as a sequence of
+ one to four octets. Does NOT suffer from byte-ordering ambiguities.
+
+ UTF-16: Unicode (or UCS) Transformation Format, 16-bit encoding form.
+ Serializes a Unicode (or UCS) scalar value (code point) as a sequence of
+ two octets, in either big-endian or little-endian format. Uses an
+ (optional) prefix of BOM to disambiguate byte-ordering.
+
+ UTF-32: Unicode (or UCS) Transformation Format, 32-bit encoding form.
+ Serializes a Unicode (or UCS) scalar value (code point) as a sequence of
+ four octets, in either big-endian or little-endian format. Uses an
+ (optional) prefix of BOM to disambiguate byte-ordering.
+
+ Zero Width: Characteristic of some spaces or format control characters
+ that do not advance text along the horizontal baseline.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ McDonald June 20, 2002 [Page A-6]