diff options
Diffstat (limited to 'lib/unicode/readme.txt')
-rwxr-xr-x | lib/unicode/readme.txt | 301 |
1 files changed, 301 insertions, 0 deletions
diff --git a/lib/unicode/readme.txt b/lib/unicode/readme.txt new file mode 100755 index 0000000000..5f908d3067 --- /dev/null +++ b/lib/unicode/readme.txt @@ -0,0 +1,301 @@ + +UNICODE 2.0 CHARACTER DATABASE + +Copyright (c) 1991-1996 Unicode, Inc. +All Rights reserved. + +DISCLAIMER + +The Unicode Character Database "UNIDATA2.TXT" is provided as-is by +Unicode, Inc. (The Unicode Consortium). No claims are made as to fitness for any +particular purpose. No warranties of any kind are expressed or implied. The +recipient agrees to determine applicability of information provided. If this +file has been purchased on magnetic or optical media from Unicode, Inc., +the sole remedy for any claim will be exchange of defective media within +90 days of receipt. + +This disclaimer is applicable for all other data files accompanying the +Unicode Character Database, some of which have been compiled by the +Unicode Consortium, and some of which have been supplied by other vendors. + +LIMITATIONS ON RIGHTS TO REDISTRIBUTE THIS DATA + +Recipient is granted the right to make copies in any form for internal +distribution and to freely use the information supplied in the creation of +products supporting the Unicode (TM) Standard. This file can be redistributed +to third parties or other organizations (whether for profit or not) as long +as this notice and the disclaimer notice are retained. + +EXPLANATORY INFORMATION + +The Unicode Character Database defines the default Unicode character +properties, and internal mappings. Particular implementations may choose to +override the properties and mappings that are not normative. If that is done, +it is up to the implementer to establish a protocol to convey that +information. For more information about character properties and mappings, +see "The Unicode Standard, Worldwide Character Encoding, Version 2.0", +published by Addison-Wesley. For information about other data files +accompanying the Unicode Character Database, see the section of the +Unicode Standard they were extracted from, or the explanatory readme +files and/or header sections with those files. + +The Unicode Character Database is a plain ASCII text file consisting of lines +containing fields terminated by semicolons. Each line represents the data for +one encoded character in the Unicode Standard, Version 2.0. Every encoded +character has a data entry, with the exception of certain special ranges, as +detailed below. + +There are five special ranges of characters that are represented only by +their start and end characters, since the properties in the file are uniform, +except for code values (which are all sequential and assigned). The names of CJK +ideograph characters and Hangul syllable characters are algorithmically +derivable. (See the Unicode Standard for more information). Surrogate +characters and private use characters have no names. + +The exact ranges represented by start and end characters are: + + The CJK Ideographs Area (U+4E00 - U+9FFF) + The Hangul Syllables Area (U+AC00 - U+D7A3) + The Surrogates Area (U+D800 - U+DFFF) + The Private Use Area (U+E000 - U+F8FF) + CJK Compatibility Ideographs (U+F900 - U+FAFF) + +The following table describes the format and meaning of each field in a +data entry in the Unicode Character Database. Fields which contain +normative information are so indicated. + +Field Explanation +----- ----------- + + 0 Code value in 4-digit hexadecimal format. + This field is normative. + + 1 Unicode 2.0 Character Name. These names match exactly the + names published in Chapter 7 of the Unicode Standard. + This field is normative. + + 2 General Category. This is a useful breakdown into various "character + types" which can be used as a default categorization in implementations. + Some of the values are normative, and some are informative. + See below for a brief explanation. + + 3 Canonical Combining Classes. The classes used for the + Canonical Ordering Algorithm in the Unicode Standard. These + classes are also printed in Chapter 4 of the Unicode Standard. + This field is normative. See below for a brief explanation. + + 4 Bidirectional Category. See the list below for an explanation of the + abbreviations used in this field. These are the categories required + by the Bidirectional Behavior Algorithm in the Unicode Standard. + These categories are summarized in Chapter 4 of the Unicode Standard. + This field is normative. + + 5 Character Decomposition. In the Unicode Standard, Version 2.0, not all of + the decompositions are full decompositions. Recursive + application of look-up for decompositions will, in all cases, lead to + a maximal decomposition. The decompositions match exactly the + decompositions published with the character names in Chapter 7 + of the Unicode Standard. This field is normative. + + 6 Decimal digit value. This is a numeric field. If the character + has the decimal digit property, as specified in Chapter 4 of + the Unicode Standard, the value of that digit is represented + with an integer value in this field. This field is normative. + + 7 Digit value. This is a numeric field. If the character represents a + digit, not necessarily a decimal digit, the value is here. This + covers digits which do not form decimal radix forms, such as the + compatibility superscript digits. This field is informative. + + 8 Numeric value. This is a numeric field. If the character has the + numeric property, as specified in Chapter 4 of the Unicode + Standard, the value of that character is represented with an + integer or rational number in this field. This includes fractions as, + e.g., "1/5" for U+2155 VULGAR FRACTION ONE FIFTH. + Also included are numerical values for compatibility characters + such as circled numbers. This field is normative. + + 9 If the characters has been identified as a "mirrored" character in + bidirectional text, this field has the value "Y"; otherwise "N". + The list of mirrored characters is also printed in Chapter 4 of + the Unicode Standard. This field is normative. + + 10 Unicode 1.0 Name. This is the old name as published in Unicode 1.0. + This name is only provided when it is significantly different from + the Unicode 2.0 name for the character. This field is informative. + + 11 10646 Comment field. This field is informative. + + 12 Upper case equivalent mapping. If a character is part of an + alphabet with case distinctions, and has an upper case equivalent, + then the upper case equivalent is in this field. See the explanation + below on case distinctions. These mappings are always one-to-one, + not one-to-many or many-to-one. This field is informative. + + 13 Lower case equivalent mapping. Similar to 12. This field is informative. + + 14 Title case equivalent mapping. Similar to 12. This field is informative. + +GENERAL CATEGORY + +The values in this field are abbreviations for the following. Some of the +values are normative, and some are informative. For more information, see +the Unicode Standard. + +Normative + Mn = Mark, Non-Spacing + Mc = Mark, Combining + Nd = Number, Decimal Digit + No = Number, Other + Zs = Separator, Space + Zl = Separator, Line + Zp = Separator, Paragraph + Cc = Other, Control or Format + Co = Other, Private Use + Cn = Other, Not Assigned + +Informative + Lu = Letter, Uppercase + Ll = Letter, Lowercase + Lt = Letter, Titlecase + Lm = Letter, Modifier + Lo = Letter, Other + Pd = Punctuation, Dash + Ps = Punctuation, Open + Pe = Punctuation, Close + Po = Punctuation, Other + Sm = Symbol, Math + Sc = Symbol, Currency + So = Symbol, Other + +BIDIRECTIONAL PROPERTIES + +Please refer to the Unicode Standard for an explanation of the algorithm for +Bidirectional Behavior and an explanation of the sigificance of these categories. +These values are normative. + +Strong types: + L Left-Right; Most alphabetic, syllabic, and logographic + characters (e.g., CJK ideographs) + R Right-Left; Arabic, Hebrew, and + punctuation specific to those scripts +Weak types: + EN European Number + ES European Number Separator + ET European Number Terminator + AN Arabic Number + CS Common Number Separator + +Separators: + B Block Separator + S Segment Separator + +Neutrals: + WS Whitespace + ON Other Neutrals ; All other characters: punctuation, symbols + +CHARACTER DECOMPOSITION TAGS + +The decomposition is a normative property of a character. The tags supplied +with certain decompositions generally indicate formatting information. +Where no such tag is given, the decomposition is designated as canonical. +Conversely, the presence of a formatting tag also indicates +that the decomposition is a compatibility decomposition and not a canonical +decomposition. In the absence of other formatting information in a +compatibility decomposition, the tag <compat> is used to distinguish it from +canonical decompositions. + +In some instances a canonical decomposition or a compatibility decomposition +may consist of a single character. For a canonical decomposition, this +indicates that the character is a canonical equivalent of another single +character. For a compatibility decomposition, this indicates that the +character is a compatibility equivalent of another single character. + +The compatibility formatting tags used are: + + <font> A font variant (e.g. a blackletter form). + <noBreak> A no-break version of a space or hyphen. + <initial> An initial presentation form (Arabic). + <medial> A medial presentation form (Arabic). + <final> A final presentation form (Arabic). + <isolated> An isolated presentation form (Arabic). + <circle> An encircled form. + <super> A superscript form. + <sub> A subscript form. + <vertical> A vertical layout presentation form. + <wide> A wide (or zenkaku) compatibility character. + <narrow> A narrow (or hankaku) compatibility character. + <small> A small variant form (CNS compatibility). + <square> A CJK squared font variant. + <compat> Otherwise unspecified compatibility character. + +CANONICAL COMBINING CLASSES + + 0: Spacing, enclosing, reordrant, and surrounding + 1: Overlays and interior + 6: Tibetan subjoined Letters + 7: Nuktas + 8: Hiragana/Katakana voiced marks + 9: Viramas + 10: Start of fixed position classes +199: End of fixed position classes +200: Below left attached +202: Below attached +204: Below right attached +208: Left attached (reordrant around single base character) +210: Right attached +212: Above left attached +214: Above attached +216: Above right attached +218: Below left +220: Below +222: Below right +224: Left (reordrant around single base character) +226: Right +228: Above left +230: Above +232: Above right +234: Double above + +Note: some of the combining classes in this list do not currently have +members but are specified here for completeness. + +CASE MAPPINGS + +In addition to uppercase and lowercase, because of the inclusion of certain +composite characters for compatibility, such as "01F1;LATIN CAPITAL LETTER +DZ", there is a third case, called titlecase, which is used where the first +character of a word is to be capitalized (e.g. UPPERCASE, Titlecase, +lowercase). An example of such a character is "01F2;LATIN CAPITAL LETTER D +WITH SMALL LETTER Z". + +The uppercase, titlecase and lowercase fields are only included for characters +that have a single corresponding character of that type. Composite characters +(such as "339D;SQUARE CM") that do not have a single corresponding character +of that type can be cased by decomposition. + +The case mapping is an informative, default mapping. Certain languages, such +as Turkish, German, French, or Greek may have small deviations from the +default mappings listed in the Unicode Character Database. + +MODIFICATION HISTORY + +Some of the modifications made in updating the Unicode Character Database +for the Unicode Standard, Version 2.0 are: +* Fixed decompositions with TONOS to use correct NSM: 030D. +* Removed old Hangul Syllables; mapping to new characters are + in a separate table. +* Marked compability decompositions with additional tags. +* Changed old tag names for clarity. +* Revision of decompositions to use first-level decomposition, instead + of maximal decomposition. +* Correction of all known errors in decompositions from earlier versions. +* Added control code names (as old Unicode names). +* Added Hangul Jamo decompositions. +* Added Number category to match properties list in book. +* Fixed categories of Koranic Arabic marks. +* Fixed categories of precomposed characters to match decomposition where possible. +* Added Hebrew cantillation marks and the Tibetan script. +* Added place holders for ranges such as CJK Ideographic Area and the + Private Use Area. +* Eliminated "Nd" as a category. |