summaryrefslogtreecommitdiff
path: root/lib/unicode/readme.txt
diff options
context:
space:
mode:
Diffstat (limited to 'lib/unicode/readme.txt')
-rwxr-xr-xlib/unicode/readme.txt301
1 files changed, 301 insertions, 0 deletions
diff --git a/lib/unicode/readme.txt b/lib/unicode/readme.txt
new file mode 100755
index 0000000000..5f908d3067
--- /dev/null
+++ b/lib/unicode/readme.txt
@@ -0,0 +1,301 @@
+
+UNICODE 2.0 CHARACTER DATABASE
+
+Copyright (c) 1991-1996 Unicode, Inc.
+All Rights reserved.
+
+DISCLAIMER
+
+The Unicode Character Database "UNIDATA2.TXT" is provided as-is by
+Unicode, Inc. (The Unicode Consortium). No claims are made as to fitness for any
+particular purpose. No warranties of any kind are expressed or implied. The
+recipient agrees to determine applicability of information provided. If this
+file has been purchased on magnetic or optical media from Unicode, Inc.,
+the sole remedy for any claim will be exchange of defective media within
+90 days of receipt.
+
+This disclaimer is applicable for all other data files accompanying the
+Unicode Character Database, some of which have been compiled by the
+Unicode Consortium, and some of which have been supplied by other vendors.
+
+LIMITATIONS ON RIGHTS TO REDISTRIBUTE THIS DATA
+
+Recipient is granted the right to make copies in any form for internal
+distribution and to freely use the information supplied in the creation of
+products supporting the Unicode (TM) Standard. This file can be redistributed
+to third parties or other organizations (whether for profit or not) as long
+as this notice and the disclaimer notice are retained.
+
+EXPLANATORY INFORMATION
+
+The Unicode Character Database defines the default Unicode character
+properties, and internal mappings. Particular implementations may choose to
+override the properties and mappings that are not normative. If that is done,
+it is up to the implementer to establish a protocol to convey that
+information. For more information about character properties and mappings,
+see "The Unicode Standard, Worldwide Character Encoding, Version 2.0",
+published by Addison-Wesley. For information about other data files
+accompanying the Unicode Character Database, see the section of the
+Unicode Standard they were extracted from, or the explanatory readme
+files and/or header sections with those files.
+
+The Unicode Character Database is a plain ASCII text file consisting of lines
+containing fields terminated by semicolons. Each line represents the data for
+one encoded character in the Unicode Standard, Version 2.0. Every encoded
+character has a data entry, with the exception of certain special ranges, as
+detailed below.
+
+There are five special ranges of characters that are represented only by
+their start and end characters, since the properties in the file are uniform,
+except for code values (which are all sequential and assigned). The names of CJK
+ideograph characters and Hangul syllable characters are algorithmically
+derivable. (See the Unicode Standard for more information). Surrogate
+characters and private use characters have no names.
+
+The exact ranges represented by start and end characters are:
+
+ The CJK Ideographs Area (U+4E00 - U+9FFF)
+ The Hangul Syllables Area (U+AC00 - U+D7A3)
+ The Surrogates Area (U+D800 - U+DFFF)
+ The Private Use Area (U+E000 - U+F8FF)
+ CJK Compatibility Ideographs (U+F900 - U+FAFF)
+
+The following table describes the format and meaning of each field in a
+data entry in the Unicode Character Database. Fields which contain
+normative information are so indicated.
+
+Field Explanation
+----- -----------
+
+ 0 Code value in 4-digit hexadecimal format.
+ This field is normative.
+
+ 1 Unicode 2.0 Character Name. These names match exactly the
+ names published in Chapter 7 of the Unicode Standard.
+ This field is normative.
+
+ 2 General Category. This is a useful breakdown into various "character
+ types" which can be used as a default categorization in implementations.
+ Some of the values are normative, and some are informative.
+ See below for a brief explanation.
+
+ 3 Canonical Combining Classes. The classes used for the
+ Canonical Ordering Algorithm in the Unicode Standard. These
+ classes are also printed in Chapter 4 of the Unicode Standard.
+ This field is normative. See below for a brief explanation.
+
+ 4 Bidirectional Category. See the list below for an explanation of the
+ abbreviations used in this field. These are the categories required
+ by the Bidirectional Behavior Algorithm in the Unicode Standard.
+ These categories are summarized in Chapter 4 of the Unicode Standard.
+ This field is normative.
+
+ 5 Character Decomposition. In the Unicode Standard, Version 2.0, not all of
+ the decompositions are full decompositions. Recursive
+ application of look-up for decompositions will, in all cases, lead to
+ a maximal decomposition. The decompositions match exactly the
+ decompositions published with the character names in Chapter 7
+ of the Unicode Standard. This field is normative.
+
+ 6 Decimal digit value. This is a numeric field. If the character
+ has the decimal digit property, as specified in Chapter 4 of
+ the Unicode Standard, the value of that digit is represented
+ with an integer value in this field. This field is normative.
+
+ 7 Digit value. This is a numeric field. If the character represents a
+ digit, not necessarily a decimal digit, the value is here. This
+ covers digits which do not form decimal radix forms, such as the
+ compatibility superscript digits. This field is informative.
+
+ 8 Numeric value. This is a numeric field. If the character has the
+ numeric property, as specified in Chapter 4 of the Unicode
+ Standard, the value of that character is represented with an
+ integer or rational number in this field. This includes fractions as,
+ e.g., "1/5" for U+2155 VULGAR FRACTION ONE FIFTH.
+ Also included are numerical values for compatibility characters
+ such as circled numbers. This field is normative.
+
+ 9 If the characters has been identified as a "mirrored" character in
+ bidirectional text, this field has the value "Y"; otherwise "N".
+ The list of mirrored characters is also printed in Chapter 4 of
+ the Unicode Standard. This field is normative.
+
+ 10 Unicode 1.0 Name. This is the old name as published in Unicode 1.0.
+ This name is only provided when it is significantly different from
+ the Unicode 2.0 name for the character. This field is informative.
+
+ 11 10646 Comment field. This field is informative.
+
+ 12 Upper case equivalent mapping. If a character is part of an
+ alphabet with case distinctions, and has an upper case equivalent,
+ then the upper case equivalent is in this field. See the explanation
+ below on case distinctions. These mappings are always one-to-one,
+ not one-to-many or many-to-one. This field is informative.
+
+ 13 Lower case equivalent mapping. Similar to 12. This field is informative.
+
+ 14 Title case equivalent mapping. Similar to 12. This field is informative.
+
+GENERAL CATEGORY
+
+The values in this field are abbreviations for the following. Some of the
+values are normative, and some are informative. For more information, see
+the Unicode Standard.
+
+Normative
+ Mn = Mark, Non-Spacing
+ Mc = Mark, Combining
+ Nd = Number, Decimal Digit
+ No = Number, Other
+ Zs = Separator, Space
+ Zl = Separator, Line
+ Zp = Separator, Paragraph
+ Cc = Other, Control or Format
+ Co = Other, Private Use
+ Cn = Other, Not Assigned
+
+Informative
+ Lu = Letter, Uppercase
+ Ll = Letter, Lowercase
+ Lt = Letter, Titlecase
+ Lm = Letter, Modifier
+ Lo = Letter, Other
+ Pd = Punctuation, Dash
+ Ps = Punctuation, Open
+ Pe = Punctuation, Close
+ Po = Punctuation, Other
+ Sm = Symbol, Math
+ Sc = Symbol, Currency
+ So = Symbol, Other
+
+BIDIRECTIONAL PROPERTIES
+
+Please refer to the Unicode Standard for an explanation of the algorithm for
+Bidirectional Behavior and an explanation of the sigificance of these categories.
+These values are normative.
+
+Strong types:
+ L Left-Right; Most alphabetic, syllabic, and logographic
+ characters (e.g., CJK ideographs)
+ R Right-Left; Arabic, Hebrew, and
+ punctuation specific to those scripts
+Weak types:
+ EN European Number
+ ES European Number Separator
+ ET European Number Terminator
+ AN Arabic Number
+ CS Common Number Separator
+
+Separators:
+ B Block Separator
+ S Segment Separator
+
+Neutrals:
+ WS Whitespace
+ ON Other Neutrals ; All other characters: punctuation, symbols
+
+CHARACTER DECOMPOSITION TAGS
+
+The decomposition is a normative property of a character. The tags supplied
+with certain decompositions generally indicate formatting information.
+Where no such tag is given, the decomposition is designated as canonical.
+Conversely, the presence of a formatting tag also indicates
+that the decomposition is a compatibility decomposition and not a canonical
+decomposition. In the absence of other formatting information in a
+compatibility decomposition, the tag <compat> is used to distinguish it from
+canonical decompositions.
+
+In some instances a canonical decomposition or a compatibility decomposition
+may consist of a single character. For a canonical decomposition, this
+indicates that the character is a canonical equivalent of another single
+character. For a compatibility decomposition, this indicates that the
+character is a compatibility equivalent of another single character.
+
+The compatibility formatting tags used are:
+
+ <font> A font variant (e.g. a blackletter form).
+ <noBreak> A no-break version of a space or hyphen.
+ <initial> An initial presentation form (Arabic).
+ <medial> A medial presentation form (Arabic).
+ <final> A final presentation form (Arabic).
+ <isolated> An isolated presentation form (Arabic).
+ <circle> An encircled form.
+ <super> A superscript form.
+ <sub> A subscript form.
+ <vertical> A vertical layout presentation form.
+ <wide> A wide (or zenkaku) compatibility character.
+ <narrow> A narrow (or hankaku) compatibility character.
+ <small> A small variant form (CNS compatibility).
+ <square> A CJK squared font variant.
+ <compat> Otherwise unspecified compatibility character.
+
+CANONICAL COMBINING CLASSES
+
+ 0: Spacing, enclosing, reordrant, and surrounding
+ 1: Overlays and interior
+ 6: Tibetan subjoined Letters
+ 7: Nuktas
+ 8: Hiragana/Katakana voiced marks
+ 9: Viramas
+ 10: Start of fixed position classes
+199: End of fixed position classes
+200: Below left attached
+202: Below attached
+204: Below right attached
+208: Left attached (reordrant around single base character)
+210: Right attached
+212: Above left attached
+214: Above attached
+216: Above right attached
+218: Below left
+220: Below
+222: Below right
+224: Left (reordrant around single base character)
+226: Right
+228: Above left
+230: Above
+232: Above right
+234: Double above
+
+Note: some of the combining classes in this list do not currently have
+members but are specified here for completeness.
+
+CASE MAPPINGS
+
+In addition to uppercase and lowercase, because of the inclusion of certain
+composite characters for compatibility, such as "01F1;LATIN CAPITAL LETTER
+DZ", there is a third case, called titlecase, which is used where the first
+character of a word is to be capitalized (e.g. UPPERCASE, Titlecase,
+lowercase). An example of such a character is "01F2;LATIN CAPITAL LETTER D
+WITH SMALL LETTER Z".
+
+The uppercase, titlecase and lowercase fields are only included for characters
+that have a single corresponding character of that type. Composite characters
+(such as "339D;SQUARE CM") that do not have a single corresponding character
+of that type can be cased by decomposition.
+
+The case mapping is an informative, default mapping. Certain languages, such
+as Turkish, German, French, or Greek may have small deviations from the
+default mappings listed in the Unicode Character Database.
+
+MODIFICATION HISTORY
+
+Some of the modifications made in updating the Unicode Character Database
+for the Unicode Standard, Version 2.0 are:
+* Fixed decompositions with TONOS to use correct NSM: 030D.
+* Removed old Hangul Syllables; mapping to new characters are
+ in a separate table.
+* Marked compability decompositions with additional tags.
+* Changed old tag names for clarity.
+* Revision of decompositions to use first-level decomposition, instead
+ of maximal decomposition.
+* Correction of all known errors in decompositions from earlier versions.
+* Added control code names (as old Unicode names).
+* Added Hangul Jamo decompositions.
+* Added Number category to match properties list in book.
+* Fixed categories of Koranic Arabic marks.
+* Fixed categories of precomposed characters to match decomposition where possible.
+* Added Hebrew cantillation marks and the Tibetan script.
+* Added place holders for ranges such as CJK Ideographic Area and the
+ Private Use Area.
+* Eliminated "Nd" as a category.