summaryrefslogtreecommitdiff
path: root/pod/perlunicode.pod
diff options
context:
space:
mode:
Diffstat (limited to 'pod/perlunicode.pod')
-rw-r--r--pod/perlunicode.pod285
1 files changed, 178 insertions, 107 deletions
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index d629cabe9f..877b497613 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -105,13 +105,14 @@ bytes change to operating on characters. For ASCII data this makes
no difference, because UTF-8 stores ASCII in single bytes, but for
any character greater than C<chr(127)>, the character may be stored in
a sequence of two or more bytes, all of which have the high bit set.
-For C1 controls or Latin 1 characters on an EBCDIC platform the character
-may be stored in a UTF-EBCDIC multi byte sequence.
-But by and large, the user need not worry about this, because Perl
-hides it from the user. A character in Perl is logically just a number
-ranging from 0 to 2**32 or so. Larger characters encode to longer
-sequences of bytes internally, but again, this is just an internal
-detail which is hidden at the Perl level.
+
+For C1 controls or Latin 1 characters on an EBCDIC platform the
+character may be stored in a UTF-EBCDIC multi byte sequence. But by
+and large, the user need not worry about this, because Perl hides it
+from the user. A character in Perl is logically just a number ranging
+from 0 to 2**32 or so. Larger characters encode to longer sequences
+of bytes internally, but again, this is just an internal detail which
+is hidden at the Perl level.
=head2 Effects of character semantics
@@ -166,7 +167,8 @@ with all non-alphanumeric characters removed, for example the block
name C<"Latin-1 Supplement"> becomes C<\p{InLatin1Supplement}>.
Here is the list as of Unicode 3.1.0 (the two-letter classes) and
-Perl 5.8.0 (the one-letter classes):
+as defined by Perl (the one-letter classes) (in Unicode materials
+what Perl calls C<L> is often called C<L&>):
L Letter
Lu Letter, Uppercase
@@ -232,105 +234,174 @@ have their directionality defined:
BidiWS Whitespace
BidiON Other Neutrals
-The blocks available for C<\p{InBlock}> and C<\P{InBlock}>, for
-example \p{InCyrillic>, are as follows:
-
- BasicLatin
- Latin1Supplement
- LatinExtendedA
- LatinExtendedB
- IPAExtensions
- SpacingModifierLetters
- CombiningDiacriticalMarks
- Greek
- Cyrillic
- Armenian
- Hebrew
- Arabic
- Syriac
- Thaana
- Devanagari
- Bengali
- Gurmukhi
- Gujarati
- Oriya
- Tamil
- Telugu
- Kannada
- Malayalam
- Sinhala
- Thai
- Lao
- Tibetan
- Myanmar
- Georgian
- HangulJamo
- Ethiopic
- Cherokee
- UnifiedCanadianAboriginalSyllabics
- Ogham
- Runic
- Khmer
- Mongolian
- LatinExtendedAdditional
- GreekExtended
- GeneralPunctuation
- SuperscriptsandSubscripts
- CurrencySymbols
- CombiningMarksforSymbols
- LetterlikeSymbols
- NumberForms
- Arrows
- MathematicalOperators
- MiscellaneousTechnical
- ControlPictures
- OpticalCharacterRecognition
- EnclosedAlphanumerics
- BoxDrawing
- BlockElements
- GeometricShapes
- MiscellaneousSymbols
- Dingbats
- BraillePatterns
- CJKRadicalsSupplement
- KangxiRadicals
- IdeographicDescriptionCharacters
- CJKSymbolsandPunctuation
- Hiragana
- Katakana
- Bopomofo
- HangulCompatibilityJamo
- Kanbun
- BopomofoExtended
- EnclosedCJKLettersandMonths
- CJKCompatibility
- CJKUnifiedIdeographsExtensionA
- CJKUnifiedIdeographs
- YiSyllables
- YiRadicals
- HangulSyllables
- HighSurrogates
- HighPrivateUseSurrogates
- LowSurrogates
- PrivateUse
- CJKCompatibilityIdeographs
- AlphabeticPresentationForms
- ArabicPresentationFormsA
- CombiningHalfMarks
- CJKCompatibilityForms
- SmallFormVariants
- ArabicPresentationFormsB
- Specials
- HalfwidthandFullwidthForms
- OldItalic
- Gothic
- Deseret
- ByzantineMusicalSymbols
- MusicalSymbols
- MathematicalAlphanumericSymbols
- CJKUnifiedIdeographsExtensionB
- CJKCompatibilityIdeographsSupplement
- Tags
+=head2 Scripts
+
+The scripts available for C<\p{In...}> and C<\P{In...}>, for
+example \p{InCyrillic>, are as follows, for example C<\p{InLatin}>
+or C<\P{InHan}>:
+
+ Latin
+ Greek
+ Cyrillic
+ Armenian
+ Hebrew
+ Arabic
+ Syriac
+ Thaana
+ Devanagari
+ Bengali
+ Gurmukhi
+ Gujarati
+ Oriya
+ Tamil
+ Telugu
+ Kannada
+ Malayalam
+ Sinhala
+ Thai
+ Lao
+ Tibetan
+ Myanmar
+ Georgian
+ Hangul
+ Ethiopic
+ Cherokee
+ CanadianAboriginal
+ Ogham
+ Runic
+ Khmer
+ Mongolian
+ Hiragana
+ Katakana
+ Bopomofo
+ Han
+ Yi
+ OldItalic
+ Gothic
+ Deseret
+ Inherited
+
+=head2 Blocks
+
+In addition to B<scripts>, Unicode also defines B<blocks> of
+characters. The difference between scripts and blocks is that the
+former concept is closer to natural languages, while the latter
+concept is more an artificial grouping based on groups of 256 Unicode
+characters. For example, the C<Latin> script contains letters from
+many blocks, but it does not contain all the characters from those
+blocks, it does not for example contain digits.
+
+For more about scripts see the UTR #24:
+http://www.unicode.org/unicode/reports/tr24/
+For more about blocks see
+http://www.unicode.org/Public/UNIDATA/Blocks.txt
+
+Because there are overlaps in naming (there are, for example, both
+a script called C<Katakana> and a block called C<Katakana>, the block
+version has C<Block> appended to its name, C<\p{InKatakanaBlock}>.
+
+Notice that this definition was introduced in Perl 5.8.0: in Perl
+5.6.0 only the blocks were used; in Perl 5.8.0 scripts became the
+preferential character class definition; this meant that the
+definitions of some character classes changed (the ones in the
+below list that have the C<Block> appended).
+
+ BasicLatin
+ Latin1Supplement
+ LatinExtendedA
+ LatinExtendedB
+ IPAExtensions
+ SpacingModifierLetters
+ CombiningDiacriticalMarks
+ GreekBlock
+ CyrillicBlock
+ ArmenianBlock
+ HebrewBlock
+ ArabicBlock
+ SyriacBlock
+ ThaanaBlock
+ DevanagariBlock
+ BengaliBlock
+ GurmukhiBlock
+ GujaratiBlock
+ OriyaBlock
+ TamilBlock
+ TeluguBlock
+ KannadaBlock
+ MalayalamBlock
+ SinhalaBlock
+ ThaiBlock
+ LaoBlock
+ TibetanBlock
+ MyanmarBlock
+ GeorgianBlock
+ HangulJamo
+ EthiopicBlock
+ CherokeeBlock
+ UnifiedCanadianAboriginalSyllabics
+ OghamBlock
+ RunicBlock
+ KhmerBlock
+ MongolianBlock
+ LatinExtendedAdditional
+ GreekExtended
+ GeneralPunctuation
+ SuperscriptsandSubscripts
+ CurrencySymbols
+ CombiningMarksforSymbols
+ LetterlikeSymbols
+ NumberForms
+ Arrows
+ MathematicalOperators
+ MiscellaneousTechnical
+ ControlPictures
+ OpticalCharacterRecognition
+ EnclosedAlphanumerics
+ BoxDrawing
+ BlockElements
+ GeometricShapes
+ MiscellaneousSymbols
+ Dingbats
+ BraillePatterns
+ CJKRadicalsSupplement
+ KangxiRadicals
+ IdeographicDescriptionCharacters
+ CJKSymbolsandPunctuation
+ HiraganaBlock
+ KatakanaBlock
+ BopomofoBlock
+ HangulCompatibilityJamo
+ Kanbun
+ BopomofoExtended
+ EnclosedCJKLettersandMonths
+ CJKCompatibility
+ CJKUnifiedIdeographsExtensionA
+ CJKUnifiedIdeographs
+ YiSyllables
+ YiRadicals
+ HangulSyllables
+ HighSurrogates
+ HighPrivateUseSurrogates
+ LowSurrogates
+ PrivateUse
+ CJKCompatibilityIdeographs
+ AlphabeticPresentationForms
+ ArabicPresentationFormsA
+ CombiningHalfMarks
+ CJKCompatibilityForms
+ SmallFormVariants
+ ArabicPresentationFormsB
+ Specials
+ HalfwidthandFullwidthForms
+ OldItalicBlock
+ GothicBlock
+ DeseretBlock
+ ByzantineMusicalSymbols
+ MusicalSymbols
+ MathematicalAlphanumericSymbols
+ CJKUnifiedIdeographsExtensionB
+ CJKCompatibilityIdeographsSupplement
+ Tags
=item *