Support preferentially the Unicode 'scripts' definition

in the \p{In...} notation since according to Unicode the scripts concept is more natural for matching than using the somewhat artificial block names. The block names are still available, though, and if there's a name conflict, the scripts one wins and the blocks one has to do with 'Block' appended to its name. For more information see http://www.unicode.org/unicode/reports/tr24/ p4raw-id: //depot/perl@11132
author: Jarkko Hietaniemi <jhi@iki.fi> 2001-07-04 01:32:11 +0000
committer: Jarkko Hietaniemi <jhi@iki.fi> 2001-07-04 01:32:11 +0000
commit: 2796c109dc2c56e2241410992d78bd8e0cccd71f (patch)
tree: 6afcbd325dc2525c4681ef8e20e95afc8fcd49a4 /pod
parent: ad9cab3708f3a6aff28b5c1ca3a390c013235283 (diff)
download: perl-2796c109dc2c56e2241410992d78bd8e0cccd71f.tar.gz
4 files changed, 209 insertions, 120 deletions
diff --git a/pod/perl572delta.pod b/pod/perl572delta.pod
index 2800cf85dc..1ff8436508 100644
--- a/pod/perl572delta.pod
+++ b/pod/perl572delta.pod
@@ -49,6 +49,34 @@ statically built in.  This may or may not be a problem with ancient
 TCP/IP stacks of VMS: we do not know since we weren't able to test
 Perl in such configurations.
 
+=head2 Different Definition of the Unicode Character Classes \p{In...}
+
+As suggested by the Unicode consortium, the Unicode character classes
+now prefer I<scripts> as opposed to I<blocks> (as defined by Unicode);
+in Perl, when the C<\p{In....}> and the C<\p{In....}> regular expression
+constructs are used.  This has changed the definition of some of those
+character classes.
+
+The difference between scripts and blocks is that scripts are the
+glyphs used by a language or a group of languages, while the blocks
+are more artificial groupings of 256 characters based on the Unicode
+numbering.
+
+In general this change results in more inclusive Unicode character
+classes, but changes to the other direction also do take place:
+for example while the script C<Latin> includes all the Latin
+characters and their various diacritic-adorned versions, it
+does not include the various punctuation or digits (since they
+are not solely C<Latin>).
+
+Changes in the character class semantics may have happened if a script
+and a block happen to have the same name, for example C<Hebrew>.
+In such cases the script wins and C<\p{InHebrew}> now means the script
+definition of Hebrew.  The block definition in still available,
+though, by appending C<Block> to the name: C<\p{InHebrewBlock}> means
+what C<\p{InHebrew}> meant in perl 5.6.0.  For the full list
+of affected character classes, see L<perlunicode/Blocks>.
+
 =head2 Deprecations
 
 The current user-visible implementation of pseudo-hashes (the weird
diff --git a/pod/perlretut.pod b/pod/perlretut.pod
index 3e83c1305f..7f8e8f5430 100644
--- a/pod/perlretut.pod
+++ b/pod/perlretut.pod
@@ -1752,8 +1752,9 @@ character class of Unicode 'marks', for example accent marks.
 For the full list see L<perlunicode>.
 
 The Unicode has also been separated into blocks of charaters which you
-can test with C<\p{InBlock}> and C<\P{InBlock}>, for example C<\p{InGreek}>
-and C<\P{InKatakana}>.  For the full list see L<perlunicode>.
+can test with C<\p{In...}> (in) and C<\P{In...}> (not in), for example
+C<\p{InLatin}, C<\p{InGreek}>, or C<\P{InKatakana}>.  For the full list see
+L<perlunicode>.
 
 For the the full and latest information see the latest Unicode standard.
 
diff --git a/pod/perltodo.pod b/pod/perltodo.pod
index f96c7704e1..3c7243254e 100644
--- a/pod/perltodo.pod
+++ b/pod/perltodo.pod
@@ -87,17 +87,6 @@ class subtraction.
 
 	http://www.unicode.org/unicode/reports/tr18/
 
-=head2 Unicode Scripts support
-
-Currently the C<\p{In...}> supports only the Blocks database, like
-C<\p{BasicLatin}>, C<\p{InGreek}>, C<\p{InThai}>, but there's also the
-Scripts database, which has members like C<Latin>, C<Greek>,
-C<Armenian>, C<Han>.  It is desireable that also the script names
-could be used for the C<\p{In...}> construct.  Note: needs to be
-researched whether this is possible, that is, are there conflicts
-between the Blocks and the Scripts, is the Blocks Greek the same as
-the Scripts Greek?
-
 =head2 use Thread for iThreads
 
 Artur Bergman's C<iThreads> module is a start on this, but needs to
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index d629cabe9f..877b497613 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -105,13 +105,14 @@ bytes change to operating on characters.  For ASCII data this makes
 no difference, because UTF-8 stores ASCII in single bytes, but for
 any character greater than C<chr(127)>, the character may be stored in
 a sequence of two or more bytes, all of which have the high bit set.
-For C1 controls or Latin 1 characters on an EBCDIC platform the character
-may be stored in a UTF-EBCDIC multi byte sequence.
-But by and large, the user need not worry about this, because Perl
-hides it from the user.  A character in Perl is logically just a number
-ranging from 0 to 2**32 or so.  Larger characters encode to longer
-sequences of bytes internally, but again, this is just an internal
-detail which is hidden at the Perl level.
+
+For C1 controls or Latin 1 characters on an EBCDIC platform the
+character may be stored in a UTF-EBCDIC multi byte sequence.  But by
+and large, the user need not worry about this, because Perl hides it
+from the user.  A character in Perl is logically just a number ranging
+from 0 to 2**32 or so.  Larger characters encode to longer sequences
+of bytes internally, but again, this is just an internal detail which
+is hidden at the Perl level.
 
 =head2 Effects of character semantics
 
@@ -166,7 +167,8 @@ with all non-alphanumeric characters removed, for example the block
 name C<"Latin-1 Supplement"> becomes C<\p{InLatin1Supplement}>.
 
 Here is the list as of Unicode 3.1.0 (the two-letter classes) and
-Perl 5.8.0 (the one-letter classes):
+as defined by Perl (the one-letter classes) (in Unicode materials
+what Perl calls C<L> is often called C<L&>):
 
    L  Letter
    Lu Letter, Uppercase
@@ -232,105 +234,174 @@ have their directionality defined:
    BidiWS  Whitespace
    BidiON  Other Neutrals
 
-The blocks available for C<\p{InBlock}> and C<\P{InBlock}>, for
-example \p{InCyrillic>, are as follows:
-
-    BasicLatin
-    Latin1Supplement
-    LatinExtendedA
-    LatinExtendedB
-    IPAExtensions
-    SpacingModifierLetters
-    CombiningDiacriticalMarks
-    Greek
-    Cyrillic
-    Armenian
-    Hebrew
-    Arabic
-    Syriac
-    Thaana
-    Devanagari
-    Bengali
-    Gurmukhi
-    Gujarati
-    Oriya
-    Tamil
-    Telugu
-    Kannada
-    Malayalam
-    Sinhala
-    Thai
-    Lao
-    Tibetan
-    Myanmar
-    Georgian
-    HangulJamo
-    Ethiopic
-    Cherokee
-    UnifiedCanadianAboriginalSyllabics
-    Ogham
-    Runic
-    Khmer
-    Mongolian
-    LatinExtendedAdditional
-    GreekExtended
-    GeneralPunctuation
-    SuperscriptsandSubscripts
-    CurrencySymbols
-    CombiningMarksforSymbols
-    LetterlikeSymbols
-    NumberForms
-    Arrows
-    MathematicalOperators
-    MiscellaneousTechnical
-    ControlPictures
-    OpticalCharacterRecognition
-    EnclosedAlphanumerics
-    BoxDrawing
-    BlockElements
-    GeometricShapes
-    MiscellaneousSymbols
-    Dingbats
-    BraillePatterns
-    CJKRadicalsSupplement
-    KangxiRadicals
-    IdeographicDescriptionCharacters
-    CJKSymbolsandPunctuation
-    Hiragana
-    Katakana
-    Bopomofo
-    HangulCompatibilityJamo
-    Kanbun
-    BopomofoExtended
-    EnclosedCJKLettersandMonths
-    CJKCompatibility
-    CJKUnifiedIdeographsExtensionA
-    CJKUnifiedIdeographs
-    YiSyllables
-    YiRadicals
-    HangulSyllables
-    HighSurrogates
-    HighPrivateUseSurrogates
-    LowSurrogates
-    PrivateUse
-    CJKCompatibilityIdeographs
-    AlphabeticPresentationForms
-    ArabicPresentationFormsA
-    CombiningHalfMarks
-    CJKCompatibilityForms
-    SmallFormVariants
-    ArabicPresentationFormsB
-    Specials
-    HalfwidthandFullwidthForms
-    OldItalic
-    Gothic
-    Deseret
-    ByzantineMusicalSymbols
-    MusicalSymbols
-    MathematicalAlphanumericSymbols
-    CJKUnifiedIdeographsExtensionB
-    CJKCompatibilityIdeographsSupplement
-    Tags
+=head2 Scripts
+
+The scripts available for C<\p{In...}> and C<\P{In...}>, for
+example \p{InCyrillic>, are as follows, for example C<\p{InLatin}>
+or C<\P{InHan}>:
+
+   Latin
+   Greek
+   Cyrillic
+   Armenian
+   Hebrew
+   Arabic
+   Syriac
+   Thaana
+   Devanagari
+   Bengali
+   Gurmukhi
+   Gujarati
+   Oriya
+   Tamil
+   Telugu
+   Kannada
+   Malayalam
+   Sinhala
+   Thai
+   Lao
+   Tibetan
+   Myanmar
+   Georgian
+   Hangul
+   Ethiopic
+   Cherokee
+   CanadianAboriginal
+   Ogham
+   Runic
+   Khmer
+   Mongolian
+   Hiragana
+   Katakana
+   Bopomofo
+   Han
+   Yi
+   OldItalic
+   Gothic
+   Deseret
+   Inherited
+
+=head2 Blocks
+
+In addition to B<scripts>, Unicode also defines B<blocks> of
+characters.  The difference between scripts and blocks is that the
+former concept is closer to natural languages, while the latter
+concept is more an artificial grouping based on groups of 256 Unicode
+characters.  For example, the C<Latin> script contains letters from
+many blocks, but it does not contain all the characters from those
+blocks, it does not for example contain digits.
+
+For more about scripts see the UTR #24:
+http://www.unicode.org/unicode/reports/tr24/
+For more about blocks see
+http://www.unicode.org/Public/UNIDATA/Blocks.txt
+
+Because there are overlaps in naming (there are, for example, both
+a script called C<Katakana> and a block called C<Katakana>, the block
+version has C<Block> appended to its name, C<\p{InKatakanaBlock}>.
+
+Notice that this definition was introduced in Perl 5.8.0: in Perl
+5.6.0 only the blocks were used; in Perl 5.8.0 scripts became the
+preferential character class definition; this meant that the
+definitions of some character classes changed (the ones in the
+below list that have the C<Block> appended).
+
+   BasicLatin
+   Latin1Supplement
+   LatinExtendedA
+   LatinExtendedB
+   IPAExtensions
+   SpacingModifierLetters
+   CombiningDiacriticalMarks
+   GreekBlock
+   CyrillicBlock
+   ArmenianBlock
+   HebrewBlock
+   ArabicBlock
+   SyriacBlock
+   ThaanaBlock
+   DevanagariBlock
+   BengaliBlock
+   GurmukhiBlock
+   GujaratiBlock
+   OriyaBlock
+   TamilBlock
+   TeluguBlock
+   KannadaBlock
+   MalayalamBlock
+   SinhalaBlock
+   ThaiBlock
+   LaoBlock
+   TibetanBlock
+   MyanmarBlock
+   GeorgianBlock
+   HangulJamo
+   EthiopicBlock
+   CherokeeBlock
+   UnifiedCanadianAboriginalSyllabics
+   OghamBlock
+   RunicBlock
+   KhmerBlock
+   MongolianBlock
+   LatinExtendedAdditional
+   GreekExtended
+   GeneralPunctuation
+   SuperscriptsandSubscripts
+   CurrencySymbols
+   CombiningMarksforSymbols
+   LetterlikeSymbols
+   NumberForms
+   Arrows
+   MathematicalOperators
+   MiscellaneousTechnical
+   ControlPictures
+   OpticalCharacterRecognition
+   EnclosedAlphanumerics
+   BoxDrawing
+   BlockElements
+   GeometricShapes
+   MiscellaneousSymbols
+   Dingbats
+   BraillePatterns
+   CJKRadicalsSupplement
+   KangxiRadicals
+   IdeographicDescriptionCharacters
+   CJKSymbolsandPunctuation
+   HiraganaBlock
+   KatakanaBlock
+   BopomofoBlock
+   HangulCompatibilityJamo
+   Kanbun
+   BopomofoExtended
+   EnclosedCJKLettersandMonths
+   CJKCompatibility
+   CJKUnifiedIdeographsExtensionA
+   CJKUnifiedIdeographs
+   YiSyllables
+   YiRadicals
+   HangulSyllables
+   HighSurrogates
+   HighPrivateUseSurrogates
+   LowSurrogates
+   PrivateUse
+   CJKCompatibilityIdeographs
+   AlphabeticPresentationForms
+   ArabicPresentationFormsA
+   CombiningHalfMarks
+   CJKCompatibilityForms
+   SmallFormVariants
+   ArabicPresentationFormsB
+   Specials
+   HalfwidthandFullwidthForms
+   OldItalicBlock
+   GothicBlock
+   DeseretBlock
+   ByzantineMusicalSymbols
+   MusicalSymbols
+   MathematicalAlphanumericSymbols
+   CJKUnifiedIdeographsExtensionB
+   CJKCompatibilityIdeographsSupplement
+   Tags
 
 =item *
author	Jarkko Hietaniemi <jhi@iki.fi>	2001-07-04 01:32:11 +0000
committer	Jarkko Hietaniemi <jhi@iki.fi>	2001-07-04 01:32:11 +0000
commit	2796c109dc2c56e2241410992d78bd8e0cccd71f (patch)
tree	6afcbd325dc2525c4681ef8e20e95afc8fcd49a4 /pod
parent	ad9cab3708f3a6aff28b5c1ca3a390c013235283 (diff)
download	perl-2796c109dc2c56e2241410992d78bd8e0cccd71f.tar.gz