summaryrefslogtreecommitdiff
path: root/pod
diff options
context:
space:
mode:
authorJarkko Hietaniemi <jhi@iki.fi>2001-10-21 13:36:40 +0000
committerJarkko Hietaniemi <jhi@iki.fi>2001-10-21 13:36:40 +0000
commite9ad1727960211698dc6e5554115c0dbf4254536 (patch)
tree4c36724f4a0ff8e809f8c8ab4643935358c75a81 /pod
parent8a93676d2b6d9cfcd46e9efcc3c94cc624b3b332 (diff)
downloadperl-e9ad1727960211698dc6e5554115c0dbf4254536.tar.gz
Prettyprinting.
p4raw-id: //depot/perl@12543
Diffstat (limited to 'pod')
-rw-r--r--pod/perlunicode.pod254
1 files changed, 129 insertions, 125 deletions
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index 6bd0423c68..4e7c936b20 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -254,7 +254,8 @@ The following reserved ranges have C<In> tests:
Plane 16 Private Use
For example C<"\x{AC00}" =~ \p{HangulSyllable}> will test true.
-(Handling of surrogates is not implemented yet.)
+(Handling of surrogates is not implemented yet, because Perl
+uses UTF-8 and not UTF-16 internally to represent Unicode.)
Additionally, because scripts differ in their directionality
(for example Hebrew is written right to left), all characters
@@ -285,66 +286,66 @@ have their directionality defined:
The scripts available for C<\p{In...}> and C<\P{In...}>, for example
\p{InCyrillic>, are as follows, for example C<\p{InLatin}> or C<\P{InHan}>:
- Latin
- Greek
- Cyrillic
- Armenian
- Hebrew
Arabic
- Syriac
- Thaana
- Devanagari
+ Armenian
Bengali
- Gurmukhi
+ Bopomofo
+ Canadian-Aboriginal
+ Cherokee
+ Cyrillic
+ Deseret
+ Devanagari
+ Ethiopic
+ Georgian
+ Gothic
+ Greek
Gujarati
- Oriya
- Tamil
- Telugu
+ Gurmukhi
+ Han
+ Hangul
+ Hebrew
+ Hiragana
+ Inherited
Kannada
- Malayalam
- Sinhala
- Thai
+ Katakana
+ Khmer
Lao
- Tibetan
+ Latin
+ Malayalam
+ Mongolian
Myanmar
- Georgian
- Hangul
- Ethiopic
- Cherokee
- Canadian Aboriginal
Ogham
+ Old-Italic
+ Oriya
Runic
- Khmer
- Mongolian
- Hiragana
- Katakana
- Bopomofo
- Han
+ Sinhala
+ Syriac
+ Tamil
+ Telugu
+ Thaana
+ Thai
+ Tibetan
Yi
- Old Italic
- Gothic
- Deseret
- Inherited
There are also extended property classes that supplement the basic
properties, defined by the F<PropList> Unicode database:
- White_space
+ ASCII_Hex_Digit
Bidi_Control
- Join_Control
Dash
- Hyphen
- Quotation_Mark
- Other_Math
- Hex_Digit
- ASCII_Hex_Digit
- Other_Alphabetic
- Ideographic
Diacritic
Extender
+ Hex_Digit
+ Hyphen
+ Ideographic
+ Join_Control
+ Noncharacter_Code_Point
+ Other_Alphabetic
Other_Lowercase
+ Other_Math
Other_Uppercase
- Noncharacter_Code_Point
+ Quotation_Mark
+ White_space
and further derived properties:
@@ -365,11 +366,14 @@ and further derived properties:
In addition to B<scripts>, Unicode also defines B<blocks> of
characters. The difference between scripts and blocks is that the
-former concept is closer to natural languages, while the latter
+scripts concept is closer to natural languages, while the blocks
concept is more an artificial grouping based on groups of 256 Unicode
characters. For example, the C<Latin> script contains letters from
-many blocks, but it does not contain all the characters from those
-blocks, it does not for example contain digits.
+many blocks. On the other hand, the C<Latin> script does not contain
+all the characters from those blocks, it does not for example contain
+digits because digits are shared across many scripts. Digits and
+other similar groups, like punctuation, are in a category called
+C<Common>.
For more about scripts see the UTR #24:
http://www.unicode.org/unicode/reports/tr24/
@@ -386,102 +390,102 @@ preferential Unicode character class definition; this meant that
the definitions of some character classes changed (the ones in the
below list that have the C<Block> appended).
+ Alphabetic Presentation Forms
+ Arabic Block
+ Arabic Presentation Forms-A
+ Arabic Presentation Forms-B
+ Armenian Block
+ Arrows
Basic Latin
- Latin 1 Supplement
- Latin Extended-A
- Latin Extended-B
- IPA Extensions
- Spacing Modifier Letters
+ Bengali Block
+ Block Elements
+ Bopomofo Block
+ Bopomofo Extended
+ Box Drawing
+ Braille Patterns
+ Byzantine Musical Symbols
+ CJK Compatibility
+ CJK Compatibility Forms
+ CJK Compatibility Ideographs
+ CJK Compatibility Ideographs Supplement
+ CJK Radicals Supplement
+ CJK Symbols and Punctuation
+ CJK Unified Ideographs
+ CJK Unified Ideographs Extension A
+ CJK Unified Ideographs Extension B
+ Cherokee Block
Combining Diacritical Marks
- Greek Block
+ Combining Half Marks
+ Combining Marks for Symbols
+ Control Pictures
+ Currency Symbols
Cyrillic Block
- Armenian Block
- Hebrew Block
- Arabic Block
- Syriac Block
- Thaana Block
+ Deseret Block
Devanagari Block
- Bengali Block
- Gurmukhi Block
- Gujarati Block
- Oriya Block
- Tamil Block
- Telugu Block
- Kannada Block
- Malayalam Block
- Sinhala Block
- Thai Block
- Lao Block
- Tibetan Block
- Myanmar Block
+ Dingbats
+ Enclosed Alphanumerics
+ Enclosed CJK Letters and Months
+ Ethiopic Block
+ General Punctuation
+ Geometric Shapes
Georgian Block
+ Gothic Block
+ Greek Block
+ Greek Extended
+ Gujarati Block
+ Gurmukhi Block
+ Halfwidth and Fullwidth Forms
+ Hangul Compatibility Jamo
Hangul Jamo
- Ethiopic Block
- Cherokee Block
- Unified Canadian Aboriginal Syllabics
- Ogham Block
- Runic Block
+ Hangul Syllables
+ Hebrew Block
+ High Private Use Surrogates
+ High Surrogates
+ Hiragana Block
+ IPA Extensions
+ Ideographic Description Characters
+ Kanbun
+ Kangxi Radicals
+ Kannada Block
+ Katakana Block
Khmer Block
- Mongolian Block
+ Lao Block
+ Latin 1 Supplement
Latin Extended Additional
- Greek Extended
- General Punctuation
- Superscripts and Subscripts
- Currency Symbols
- Combining Marks for Symbols
+ Latin Extended-A
+ Latin Extended-B
Letterlike Symbols
- Number Forms
- Arrows
+ Low Surrogates
+ Malayalam Block
+ Mathematical Alphanumeric Symbols
Mathematical Operators
+ Miscellaneous Symbols
Miscellaneous Technical
- Control Pictures
+ Mongolian Block
+ Musical Symbols
+ Myanmar Block
+ Number Forms
+ Ogham Block
+ Old Italic Block
Optical Character Recognition
- Enclosed Alphanumerics
- Box Drawing
- Block Elements
- Geometric Shapes
- Miscellaneous Symbols
- Dingbats
- Braille Patterns
- CJK Radicals Supplement
- Kangxi Radicals
- Ideographic Description Characters
- CJK Symbols and Punctuation
- Hiragana Block
- Katakana Block
- Bopomofo Block
- Hangul Compatibility Jamo
- Kanbun
- Bopomofo Extended
- Enclosed CJK Letters and Months
- CJK Compatibility
- CJK Unified Ideographs Extension A
- CJK Unified Ideographs
- Yi Syllables
- Yi Radicals
- Hangul Syllables
- High Surrogates
- High Private Use Surrogates
- Low Surrogates
+ Oriya Block
Private Use
- CJK Compatibility Ideographs
- Alphabetic Presentation Forms
- Arabic Presentation Forms-A
- Combining Half Marks
- CJK Compatibility Forms
+ Runic Block
+ Sinhala Block
Small Form Variants
- Arabic Presentation Forms-B
+ Spacing Modifier Letters
Specials
- Halfwidth and Fullwidth Forms
- Old Italic Block
- Gothic Block
- Deseret Block
- Byzantine Musical Symbols
- Musical Symbols
- Mathematical Alphanumeric Symbols
- CJK Unified Ideographs Extension B
- CJK Compatibility Ideographs Supplement
+ Superscripts and Subscripts
+ Syriac Block
Tags
+ Tamil Block
+ Telugu Block
+ Thaana Block
+ Thai Block
+ Tibetan Block
+ Unified Canadian Aboriginal Syllabics
+ Yi Radicals
+ Yi Syllables
=item *