1 files changed, 178 insertions, 107 deletions
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index d629cabe9f..877b497613 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -105,13 +105,14 @@ bytes change to operating on characters.  For ASCII data this makes
 no difference, because UTF-8 stores ASCII in single bytes, but for
 any character greater than C<chr(127)>, the character may be stored in
 a sequence of two or more bytes, all of which have the high bit set.
-For C1 controls or Latin 1 characters on an EBCDIC platform the character
-may be stored in a UTF-EBCDIC multi byte sequence.
-But by and large, the user need not worry about this, because Perl
-hides it from the user.  A character in Perl is logically just a number
-ranging from 0 to 2**32 or so.  Larger characters encode to longer
-sequences of bytes internally, but again, this is just an internal
-detail which is hidden at the Perl level.
+
+For C1 controls or Latin 1 characters on an EBCDIC platform the
+character may be stored in a UTF-EBCDIC multi byte sequence.  But by
+and large, the user need not worry about this, because Perl hides it
+from the user.  A character in Perl is logically just a number ranging
+from 0 to 2**32 or so.  Larger characters encode to longer sequences
+of bytes internally, but again, this is just an internal detail which
+is hidden at the Perl level.
 
 =head2 Effects of character semantics
 
@@ -166,7 +167,8 @@ with all non-alphanumeric characters removed, for example the block
 name C<"Latin-1 Supplement"> becomes C<\p{InLatin1Supplement}>.
 
 Here is the list as of Unicode 3.1.0 (the two-letter classes) and
-Perl 5.8.0 (the one-letter classes):
+as defined by Perl (the one-letter classes) (in Unicode materials
+what Perl calls C<L> is often called C<L&>):
 
    L  Letter
    Lu Letter, Uppercase
@@ -232,105 +234,174 @@ have their directionality defined:
    BidiWS  Whitespace
    BidiON  Other Neutrals
 
-The blocks available for C<\p{InBlock}> and C<\P{InBlock}>, for
-example \p{InCyrillic>, are as follows:
-
-    BasicLatin
-    Latin1Supplement
-    LatinExtendedA
-    LatinExtendedB
-    IPAExtensions
-    SpacingModifierLetters
-    CombiningDiacriticalMarks
-    Greek
-    Cyrillic
-    Armenian
-    Hebrew
-    Arabic
-    Syriac
-    Thaana
-    Devanagari
-    Bengali
-    Gurmukhi
-    Gujarati
-    Oriya
-    Tamil
-    Telugu
-    Kannada
-    Malayalam
-    Sinhala
-    Thai
-    Lao
-    Tibetan
-    Myanmar
-    Georgian
-    HangulJamo
-    Ethiopic
-    Cherokee
-    UnifiedCanadianAboriginalSyllabics
-    Ogham
-    Runic
-    Khmer
-    Mongolian
-    LatinExtendedAdditional
-    GreekExtended
-    GeneralPunctuation
-    SuperscriptsandSubscripts
-    CurrencySymbols
-    CombiningMarksforSymbols
-    LetterlikeSymbols
-    NumberForms
-    Arrows
-    MathematicalOperators
-    MiscellaneousTechnical
-    ControlPictures
-    OpticalCharacterRecognition
-    EnclosedAlphanumerics
-    BoxDrawing
-    BlockElements
-    GeometricShapes
-    MiscellaneousSymbols
-    Dingbats
-    BraillePatterns
-    CJKRadicalsSupplement
-    KangxiRadicals
-    IdeographicDescriptionCharacters
-    CJKSymbolsandPunctuation
-    Hiragana
-    Katakana
-    Bopomofo
-    HangulCompatibilityJamo
-    Kanbun
-    BopomofoExtended
-    EnclosedCJKLettersandMonths
-    CJKCompatibility
-    CJKUnifiedIdeographsExtensionA
-    CJKUnifiedIdeographs
-    YiSyllables
-    YiRadicals
-    HangulSyllables
-    HighSurrogates
-    HighPrivateUseSurrogates
-    LowSurrogates
-    PrivateUse
-    CJKCompatibilityIdeographs
-    AlphabeticPresentationForms
-    ArabicPresentationFormsA
-    CombiningHalfMarks
-    CJKCompatibilityForms
-    SmallFormVariants
-    ArabicPresentationFormsB
-    Specials
-    HalfwidthandFullwidthForms
-    OldItalic
-    Gothic
-    Deseret
-    ByzantineMusicalSymbols
-    MusicalSymbols
-    MathematicalAlphanumericSymbols
-    CJKUnifiedIdeographsExtensionB
-    CJKCompatibilityIdeographsSupplement
-    Tags
+=head2 Scripts
+
+The scripts available for C<\p{In...}> and C<\P{In...}>, for
+example \p{InCyrillic>, are as follows, for example C<\p{InLatin}>
+or C<\P{InHan}>:
+
+   Latin
+   Greek
+   Cyrillic
+   Armenian
+   Hebrew
+   Arabic
+   Syriac
+   Thaana
+   Devanagari
+   Bengali
+   Gurmukhi
+   Gujarati
+   Oriya
+   Tamil
+   Telugu
+   Kannada
+   Malayalam
+   Sinhala
+   Thai
+   Lao
+   Tibetan
+   Myanmar
+   Georgian
+   Hangul
+   Ethiopic
+   Cherokee
+   CanadianAboriginal
+   Ogham
+   Runic
+   Khmer
+   Mongolian
+   Hiragana
+   Katakana
+   Bopomofo
+   Han
+   Yi
+   OldItalic
+   Gothic
+   Deseret
+   Inherited
+
+=head2 Blocks
+
+In addition to B<scripts>, Unicode also defines B<blocks> of
+characters.  The difference between scripts and blocks is that the
+former concept is closer to natural languages, while the latter
+concept is more an artificial grouping based on groups of 256 Unicode
+characters.  For example, the C<Latin> script contains letters from
+many blocks, but it does not contain all the characters from those
+blocks, it does not for example contain digits.
+
+For more about scripts see the UTR #24:
+http://www.unicode.org/unicode/reports/tr24/
+For more about blocks see
+http://www.unicode.org/Public/UNIDATA/Blocks.txt
+
+Because there are overlaps in naming (there are, for example, both
+a script called C<Katakana> and a block called C<Katakana>, the block
+version has C<Block> appended to its name, C<\p{InKatakanaBlock}>.
+
+Notice that this definition was introduced in Perl 5.8.0: in Perl
+5.6.0 only the blocks were used; in Perl 5.8.0 scripts became the
+preferential character class definition; this meant that the
+definitions of some character classes changed (the ones in the
+below list that have the C<Block> appended).
+
+   BasicLatin
+   Latin1Supplement
+   LatinExtendedA
+   LatinExtendedB
+   IPAExtensions
+   SpacingModifierLetters
+   CombiningDiacriticalMarks
+   GreekBlock
+   CyrillicBlock
+   ArmenianBlock
+   HebrewBlock
+   ArabicBlock
+   SyriacBlock
+   ThaanaBlock
+   DevanagariBlock
+   BengaliBlock
+   GurmukhiBlock
+   GujaratiBlock
+   OriyaBlock
+   TamilBlock
+   TeluguBlock
+   KannadaBlock
+   MalayalamBlock
+   SinhalaBlock
+   ThaiBlock
+   LaoBlock
+   TibetanBlock
+   MyanmarBlock
+   GeorgianBlock
+   HangulJamo
+   EthiopicBlock
+   CherokeeBlock
+   UnifiedCanadianAboriginalSyllabics
+   OghamBlock
+   RunicBlock
+   KhmerBlock
+   MongolianBlock
+   LatinExtendedAdditional
+   GreekExtended
+   GeneralPunctuation
+   SuperscriptsandSubscripts
+   CurrencySymbols
+   CombiningMarksforSymbols
+   LetterlikeSymbols
+   NumberForms
+   Arrows
+   MathematicalOperators
+   MiscellaneousTechnical
+   ControlPictures
+   OpticalCharacterRecognition
+   EnclosedAlphanumerics
+   BoxDrawing
+   BlockElements
+   GeometricShapes
+   MiscellaneousSymbols
+   Dingbats
+   BraillePatterns
+   CJKRadicalsSupplement
+   KangxiRadicals
+   IdeographicDescriptionCharacters
+   CJKSymbolsandPunctuation
+   HiraganaBlock
+   KatakanaBlock
+   BopomofoBlock
+   HangulCompatibilityJamo
+   Kanbun
+   BopomofoExtended
+   EnclosedCJKLettersandMonths
+   CJKCompatibility
+   CJKUnifiedIdeographsExtensionA
+   CJKUnifiedIdeographs
+   YiSyllables
+   YiRadicals
+   HangulSyllables
+   HighSurrogates
+   HighPrivateUseSurrogates
+   LowSurrogates
+   PrivateUse
+   CJKCompatibilityIdeographs
+   AlphabeticPresentationForms
+   ArabicPresentationFormsA
+   CombiningHalfMarks
+   CJKCompatibilityForms
+   SmallFormVariants
+   ArabicPresentationFormsB
+   Specials
+   HalfwidthandFullwidthForms
+   OldItalicBlock
+   GothicBlock
+   DeseretBlock
+   ByzantineMusicalSymbols
+   MusicalSymbols
+   MathematicalAlphanumericSymbols
+   CJKUnifiedIdeographsExtensionB
+   CJKCompatibilityIdeographsSupplement
+   Tags
 
 =item *