The Unicode categories doc patch to go with #14254,

from Jeffrey. p4raw-id: //depot/perl@14263
author: Jarkko Hietaniemi <jhi@iki.fi> 2002-01-15 02:14:29 +0000
committer: Jarkko Hietaniemi <jhi@iki.fi> 2002-01-15 02:14:29 +0000
commit: eb0cc9e3552a0fa3abde76a3fd73dea2d3b4e730 (patch)
tree: a785a41e214ad4900417ee21c2502360f5355c0e
parent: 9b99345a93e83058ceff44eef19901d8cd699da0 (diff)
download: perl-eb0cc9e3552a0fa3abde76a3fd73dea2d3b4e730.tar.gz
3 files changed, 247 insertions, 269 deletions
diff --git a/lib/Unicode/UCD.pm b/lib/Unicode/UCD.pm
index ff9cc8fc05..3f8b896beb 100644
--- a/lib/Unicode/UCD.pm
+++ b/lib/Unicode/UCD.pm
@@ -108,7 +108,7 @@ as defined by the Unicode standard:
     title            titlecase equivalent mapping
 
     block            block the character belongs to (used in \p{In...})
-    script           script the character belongs to 
+    script           script the character belongs to
 
 If no match is found, a reference to an empty hash is returned.
 
@@ -280,13 +280,12 @@ positions within all blocks are defined.
 
 See also L</Blocks versus Scripts>.
 
-If supplied with an argument that can't be a code point, charblock()
-tries to do the opposite and interpret the argument as a character
-block.  The return value is a I<range>: an anonymous list that
-contains anonymous lists, which in turn contain I<start-of-range>,
-I<end-of-range> code point pairs.  You can test whether a code point
-is in a range using the L</charinrange> function.  If the argument is
-not a known charater block, C<undef> is returned.
+If supplied with an argument that can't be a code point, charblock() tries
+to do the opposite and interpret the argument as a character block. The
+return value is a I<range>: an anonymous list of lists that contain
+I<start-of-range>, I<end-of-range> code point pairs. You can test whether a
+code point is in a range using the L</charinrange> function. If the
+argument is not a known charater block, C<undef> is returned.
 
 =cut
 
@@ -342,13 +341,12 @@ character belongs to, e.g.  C<Latin>, C<Greek>, C<Han>.
 
 See also L</Blocks versus Scripts>.
 
-If supplied with an argument that can't be a code point, charscript()
-tries to do the opposite and interpret the argument as a character
-script.  The return value is a I<range>: an anonymous list that
-contains anonymous lists, which in turn contain I<start-of-range>,
-I<end-of-range> code point pairs.  You can test whether a code point
-is in a range using the L</charinrange> function.  If the argument is
-not a known charater script, C<undef> is returned.
+If supplied with an argument that can't be a code point, charscript() tries
+to do the opposite and interpret the argument as a character script. The
+return value is a I<range>: an anonymous list of lists that contain
+I<start-of-range>, I<end-of-range> code point pairs. You can test whether a
+code point is in a range using the L</charinrange> function. If the
+argument is not a known charater script, C<undef> is returned.
 
 =cut
 
@@ -433,13 +431,13 @@ sub charscripts {
 The difference between a block and a script is that scripts are closer
 to the linguistic notion of a set of characters required to present
 languages, while block is more of an artifact of the Unicode character
-numbering and separation into blocks of 256 characters.
+numbering and separation into blocks of (mostly) 256 characters.
 
 For example the Latin B<script> is spread over several B<blocks>, such
 as C<Basic Latin>, C<Latin 1 Supplement>, C<Latin Extended-A>, and
 C<Latin Extended-B>.  On the other hand, the Latin script does not
 contain all the characters of the C<Basic Latin> block (also known as
-the ASCII): it includes only the letters, not for example the digits
+the ASCII): it includes only the letters, and not, for example, the digits
 or the punctuation.
 
 For blocks see http://www.unicode.org/Public/UNIDATA/Blocks.txt
@@ -448,18 +446,10 @@ For scripts see UTR #24: http://www.unicode.org/unicode/reports/tr24/
 
 =head2 Matching Scripts and Blocks
 
-Both scripts and blocks can be matched using the regular expression
-construct C<\p{In...}> and its negation C<\P{In...}>.
-
-The name of the script or the block comes after the C<In>, for example
-C<\p{InCyrillic}>, C<\P{InBasicLatin}>.  Spaces and dashes ('-') are
-removed from the names for the C<\p{In...}>, for example
-C<LatinExtendedA> instead of C<Latin Extended-A>.
-
-There are a few cases where there is both a script and a block by the
-same name, in these cases the block version has C<Block> appended to
-its name: C<\p{InKatakana}> is the script, C<\p{InKatakanaBlock}> is
-the block.
+Scripts are matched with the regular-expression construct
+C<\p{...}> (e.g. C<\p{Tibetan}> matches characters of the Tibetan script),
+while C<\p{In...}> is used for blocks (e.g. C<\p{InTibetan}> matches
+any of the 256 code points in the Tibetan block).
 
 =head2 Code Point Arguments
 
diff --git a/pod/perldelta.pod b/pod/perldelta.pod
index f3f2a19300..a2923f89db 100644
--- a/pod/perldelta.pod
+++ b/pod/perldelta.pod
@@ -88,33 +88,31 @@ point format on OpenVMS Alpha, potentially breaking binary compatibility
 with external libraries or existing data.  G_FLOAT is still available as
 a configuration option.  The default on VAX (D_FLOAT) has not changed.
 
-=head2 Different Definition of the Unicode Character Classes \p{In...}
-
-As suggested by the Unicode consortium, the Unicode character classes
-now prefer I<scripts> as opposed to I<blocks> (as defined by Unicode);
-in Perl, when the C<\p{In....}> and the C<\p{In....}> regular expression
-constructs are used.  This has changed the definition of some of those
-character classes.
-
-The difference between scripts and blocks is that scripts are the
-glyphs used by a language or a group of languages, while the blocks
-are more artificial groupings of 256 characters based on the Unicode
-numbering.
-
-In general this change results in more inclusive Unicode character
-classes, but changes to the other direction also do take place:
-for example while the script C<Latin> includes all the Latin
-characters and their various diacritic-adorned versions, it
-does not include the various punctuation or digits (since they
-are not solely C<Latin>).
-
-Changes in the character class semantics may have happened if a script
-and a block happen to have the same name, for example C<Hebrew>.
-In such cases the script wins and C<\p{InHebrew}> now means the script
-definition of Hebrew.  The block definition in still available,
-though, by appending C<Block> to the name: C<\p{InHebrewBlock}> means
-what C<\p{InHebrew}> meant in perl 5.6.0.  For the full list
-of affected character classes, see L<perlunicode/Blocks>.
+=head2 New Unicode Properties
+
+Unicode I<scripts> are now supported. Scripts are similar to (and superior
+to) Unicode I<blocks>. The difference between scripts and blocks is that
+scripts are the glyphs used by a language or a group of languages, while
+the blocks are more artificial groupings of (mostly) 256 characters based
+on the Unicode numbering.
+
+In general, scripts are more inclusive, but not universally so. For
+example, while the script C<Latin> includes all the Latin characters and
+their various diacritic-adorned versions, it does not include the various
+punctuation or digits (since they are not solely C<Latin>).
+
+A number of other properties are now supported, including C<\p{L&}>,
+C<\p{Any}> C<\p{Assigned}>, C<\p{Unassigned}>, C<\p{Blank}> and
+C<\p{SpacePerl}> (along with their C<\P{...}> versions, of course).
+See L<perlunicode> for details, and more additions.
+
+The C<In> or C<Is> prefix to names used with the C<\p{...}> and C<\P{...}>
+are now almost always optional. The only exception is that a C<In> prefix
+is required to signify a Unicode block when a block name conflicts with a
+script name. For example, C<\p{Tibetan}> refers to the script, while
+C<\p{InTibetan}> refers to the block. When there is no name conflict, you
+can omit the C<In> from the block name (e.g. C<\p{BraillePatterns}>), but
+to be safe, it's probably best to always use the C<In>).
 
 =head2 Perl Parser Stress Tested
 
@@ -351,12 +349,14 @@ considerations, is the Unihan database.
 
 =item *
 
-The Unicode character classes \p{Blank} and \p{SpacePerl} have been
-added.  "Blank" is like C isblank(), that is, it contains only
-"horizontal whitespace" (the space character is, the newline isn't),
-and the "SpacePerl" is the Unicode equivalent of C<\s> (\p{Space}
-isn't, since that includes the vertical tabulator character, whereas
-C<\s> doesn't.)
+The properties \p{Blank} and \p{SpacePerl} have been added. "Blank" is like
+C isblank(), that is, it contains only "horizontal whitespace" (the space
+character is, the newline isn't), and the "SpacePerl" is the Unicode
+equivalent of C<\s> (\p{Space} isn't, since that includes the vertical
+tabulator character, whereas C<\s> doesn't.)
+
+See "New Unicode Properties" earlier in this document for additional
+information on changes with Unicode properties.
 
 =back
 
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index beb742efb1..0264568e55 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -156,109 +156,94 @@ ideograph, for instance.
 
 =item *
 
-Named Unicode properties and block ranges may be used as character
-classes via the new C<\p{}> (matches property) and C<\P{}> (doesn't
-match property) constructs.  For instance, C<\p{Lu}> matches any
+Named Unicode properties, scripts, and block ranges may be used like
+character classes via the new C<\p{}> (matches property) and C<\P{}>
+(doesn't match property) constructs. For instance, C<\p{Lu}> matches any
 character with the Unicode "Lu" (Letter, uppercase) property, while
 C<\p{M}> matches any character with a "M" (mark -- accents and such)
-property.  Single letter properties may omit the brackets, so that can
-be written C<\pM> also.  Many predefined character classes are
-available, such as C<\p{IsMirrored}> and C<\p{InTibetan}>.
-
-The C<\p{Is...}> test for "general properties" such as "letter",
-"digit", while the C<\p{In...}> test for Unicode scripts and blocks.
+property. Single letter properties may omit the brackets, so that can be
+written C<\pM> also. Many predefined properties are available, such
+as C<\p{Mirrored}> and C<\p{Tibetan}>.
 
 The official Unicode script and block names have spaces and dashes as
-separators, but for convenience you can have dashes, spaces, and
-underbars at every word division, and you need not care about correct
-casing.  It is recommended, however, that for consistency you use the
-following naming: the official Unicode script, block, or property name
-(see below for the additional rules that apply to block names), with
-whitespace and dashes replaced with underbar, and the words
-"uppercase-first-lowercase-rest".  That is, "Latin-1 Supplement"
-becomes "Latin_1_Supplement".
+separators, but for convenience you can have dashes, spaces, and underbars
+at every word division, and you need not care about correct casing. It is
+recommended, however, that for consistency you use the following naming:
+the official Unicode script, block, or property name (see below for the
+additional rules that apply to block names), with whitespace and dashes
+removed, and the words "uppercase-first-lowercase-rest". That is, "Latin-1
+Supplement" becomes "Latin1Supplement".
 
 You can also negate both C<\p{}> and C<\P{}> by introducing a caret
-(^) between the first curly and the property name: C<\p{^In_Tamil}> is
-equal to C<\P{In_Tamil}>.
+(^) between the first curly and the property name: C<\p{^Tamil}> is
+equal to C<\P{Tamil}>.
 
-The C<In> and C<Is> can be left out: C<\p{Greek}> is equal to
-C<\p{In_Greek}>, C<\P{Pd}> is equal to C<\P{Pd}>.
+Here are the basic Unicode General Category properties, followed by their
+long form (you can use either, e.g. C<\p{Lu}> and C<\p{LowercaseLetter}>
+are identical).
 
     Short       Long
 
     L           Letter
-    Lu          Uppercase_Letter
-    Ll          Lowercase_Letter
-    Lt          Titlecase_Letter
-    Lm          Modifier_Letter
-    Lo          Other_Letter
+    Lu          UppercaseLetter
+    Ll          LowercaseLetter
+    Lt          TitlecaseLetter
+    Lm          ModifierLetter
+    Lo          OtherLetter
 
     M           Mark
-    Mn          Nonspacing_Mark
-    Mc          Spacing_Mark
-    Me          Enclosing_Mark
+    Mn          NonspacingMark
+    Mc          SpacingMark
+    Me          EnclosingMark
 
     N           Number
-    Nd          Decimal_Number
-    Nl          Letter_Number
-    No          Other_Number
+    Nd          DecimalNumber
+    Nl          LetterNumber
+    No          OtherNumber
 
     P           Punctuation
-    Pc          Connector_Punctuation
-    Pd          Dash_Punctuation
-    Ps          Open_Punctuation
-    Pe          Close_Punctuation
-    Pi          Initial_Punctuation
+    Pc          ConnectorPunctuation
+    Pd          DashPunctuation
+    Ps          OpenPunctuation
+    Pe          ClosePunctuation
+    Pi          InitialPunctuation
                 (may behave like Ps or Pe depending on usage)
-    Pf          Final_Punctuation
+    Pf          FinalPunctuation
                 (may behave like Ps or Pe depending on usage)
-    Po          Other_Punctuation
+    Po          OtherPunctuation
 
     S           Symbol
-    Sm          Math_Symbol
-    Sc          Currency_Symbol
-    Sk          Modifier_Symbol
-    So          Other_Symbol
+    Sm          MathSymbol
+    Sc          CurrencySymbol
+    Sk          ModifierSymbol
+    So          OtherSymbol
 
     Z           Separator
-    Zs          Space_Separator
-    Zl          Line_Separator
-    Zp          Paragraph_Separator
+    Zs          SpaceSeparator
+    Zl          LineSeparator
+    Zp          ParagraphSeparator
 
     C           Other
     Cc          Control
     Cf          Format
-    Cs          Surrogate
-    Co          Private_Use
+    Cs          Surrogate   (not usable)
+    Co          PrivateUse
     Cn          Unassigned
 
 The single-letter properties match all characters in any of the
 two-letter sub-properties starting with the same letter.
 There's also C<L&> which is an alias for C<Ll>, C<Lu>, and C<Lt>.
 
-The following reserved ranges have C<In> tests:
-
-    CJK_Ideograph_Extension_A
-    CJK_Ideograph
-    Hangul_Syllable
-    Non_Private_Use_High_Surrogate
-    Private_Use_High_Surrogate
-    Low_Surrogate
-    Private_Surrogate
-    CJK_Ideograph_Extension_B
-    Plane_15_Private_Use
-    Plane_16_Private_Use
-
-For example C<"\x{AC00}" =~ \p{HangulSyllable}> will test true.
-(Handling of surrogates is not implemented yet, because Perl
-uses UTF-8 and not UTF-16 internally to represent Unicode.
-So you really can't use the "Cs" category.)
+Because Perl hides the need for the user to understand the internal
+representation of Unicode characters, it has no need to support the
+somewhat messy concept of surrogates. Therefore, the C<Cs> property is not
+supported.
 
-Additionally, because scripts differ in their directionality
-(for example Hebrew is written right to left), all characters
-have their directionality defined:
+Because scripts differ in their directionality (for example Hebrew is
+written right to left), Unicode supplies these properties:
 
+    Property    Meaning
+  
     BidiL       Left-to-Right
     BidiLRE     Left-to-Right Embedding
     BidiLRO     Left-to-Right Override
@@ -279,18 +264,21 @@ have their directionality defined:
     BidiWS      Whitespace
     BidiON      Other Neutrals
 
+For example, C<\p{BidiR}> matches all characters that are normally
+written right to left.
+
 =back
 
 =head2 Scripts
 
-The scripts available for C<\p{In...}> and C<\P{In...}>, for example
-C<\p{InLatin}> or \p{InCyrillic>, are as follows:
+The scripts available via C<\p{...}> and C<\P{...}>, for example
+C<\p{Latin}> or \p{Cyrillic>, are as follows:
 
     Arabic
     Armenian
     Bengali
     Bopomofo
-    Canadian-Aboriginal
+    CanadianAboriginal
     Cherokee
     Cyrillic
     Deseret
@@ -315,7 +303,7 @@ C<\p{InLatin}> or \p{InCyrillic>, are as follows:
     Mongolian
     Myanmar
     Ogham
-    Old-Italic
+    OldItalic
     Oriya
     Runic
     Sinhala
@@ -331,49 +319,52 @@ There are also extended property classes that supplement the basic
 properties, defined by the F<PropList> Unicode database:
 
     ASCII_Hex_Digit
-    Bidi_Control
+    BidiControl
     Dash
     Diacritic
     Extender
-    Hex_Digit
+    HexDigit
     Hyphen
     Ideographic
-    Join_Control
-    Noncharacter_Code_Point
-    Other_Alphabetic
-    Other_Lowercase
-    Other_Math
-    Other_Uppercase
-    Quotation_Mark
-    White_Space
+    JoinControl
+    NoncharacterCodePoint
+    OtherAlphabetic
+    OtherLowercase
+    OtherMath
+    OtherUppercase
+    QuotationMark
+    WhiteSpace
 
 and further derived properties:
 
-    Alphabetic      Lu + Ll + Lt + Lm + Lo + Other_Alphabetic
-    Lowercase       Ll + Other_Lowercase
-    Uppercase       Lu + Other_Uppercase
-    Math            Sm + Other_Math
+    Alphabetic      Lu + Ll + Lt + Lm + Lo + OtherAlphabetic
+    Lowercase       Ll + OtherLowercase
+    Uppercase       Lu + OtherUppercase
+    Math            Sm + OtherMath
 
     ID_Start        Lu + Ll + Lt + Lm + Lo + Nl
     ID_Continue     ID_Start + Mn + Mc + Nd + Pc
 
     Any             Any character
-    Assigned        Any non-Cn character
+    Assigned        Any non-Cn character (i.e. synonym for C<\P{Cn}>)
+    Unassigned      Synonym for C<\p{Cn}>
     Common          Any character (or unassigned code point)
                     not explicitly assigned to a script
 
+For backward compatability, all properties mentioned so far may have C<Is>
+prepended to their name (e.g. C<\P{IsLu}> is equal to C<\P{Lu}>).
+
 =head2 Blocks
 
-In addition to B<scripts>, Unicode also defines B<blocks> of
-characters.  The difference between scripts and blocks is that the
-scripts concept is closer to natural languages, while the blocks
-concept is more an artificial grouping based on groups of 256 Unicode
-characters.  For example, the C<Latin> script contains letters from
-many blocks.  On the other hand, the C<Latin> script does not contain
-all the characters from those blocks. It does not, for example,
-contain digits because digits are shared across many scripts.  Digits
-and other similar groups, like punctuation, are in a category called
-C<Common>.
+In addition to B<scripts>, Unicode also defines B<blocks> of characters.
+The difference between scripts and blocks is that the scripts concept is
+closer to natural languages, while the blocks concept is more an artificial
+grouping based on groups of mostly 256 Unicode characters. For example, the
+C<Latin> script contains letters from many blocks. On the other hand, the
+C<Latin> script does not contain all the characters from those blocks. It
+does not, for example, contain digits because digits are shared across many
+scripts. Digits and other similar groups, like punctuation, are in a
+category called C<Common>.
 
 For more about scripts, see the UTR #24:
 
@@ -383,113 +374,110 @@ For more about blocks, see:
 
    http://www.unicode.org/Public/UNIDATA/Blocks.txt
 
-Because there are overlaps in naming (there are, for example, both
-a script called C<Katakana> and a block called C<Katakana>, the block
-version has C<Block> appended to its name, C<\p{InKatakanaBlock}>.
-
-Notice that this definition was introduced in Perl 5.8.0: in Perl
-5.6 only the blocks were used; in Perl 5.8.0 scripts became the
-preferential Unicode character class definition (prompted by
-recommendations from the Unicode consortium); this meant that
-the definitions of some character classes changed (the ones in
-the below list that have the C<Block> appended).
-
-   Alphabetic Presentation Forms
-   Arabic Block
-   Arabic Presentation Forms-A
-   Arabic Presentation Forms-B
-   Armenian Block
-   Arrows
-   Basic Latin
-   Bengali Block
-   Block Elements
-   Bopomofo Block
-   Bopomofo Extended
-   Box Drawing
-   Braille Patterns
-   Byzantine Musical Symbols
-   CJK Compatibility
-   CJK Compatibility Forms
-   CJK Compatibility Ideographs
-   CJK Compatibility Ideographs Supplement
-   CJK Radicals Supplement
-   CJK Symbols and Punctuation
-   CJK Unified Ideographs
-   CJK Unified Ideographs Extension A
-   CJK Unified Ideographs Extension B
-   Cherokee Block
-   Combining Diacritical Marks
-   Combining Half Marks
-   Combining Marks for Symbols
-   Control Pictures
-   Currency Symbols
-   Cyrillic Block
-   Deseret Block
-   Devanagari Block
-   Dingbats
-   Enclosed Alphanumerics
-   Enclosed CJK Letters and Months
-   Ethiopic Block
-   General Punctuation
-   Geometric Shapes
-   Georgian Block
-   Gothic Block
-   Greek Block
-   Greek Extended
-   Gujarati Block
-   Gurmukhi Block
-   Halfwidth and Fullwidth Forms
-   Hangul Compatibility Jamo
-   Hangul Jamo
-   Hangul Syllables
-   Hebrew Block
-   High Private Use Surrogates
-   High Surrogates
-   Hiragana Block
-   IPA Extensions
-   Ideographic Description Characters
-   Kanbun
-   Kangxi Radicals
-   Kannada Block
-   Katakana Block
-   Khmer Block
-   Lao Block
-   Latin 1 Supplement
-   Latin Extended Additional
-   Latin Extended-A
-   Latin Extended-B
-   Letterlike Symbols
-   Low Surrogates
-   Malayalam Block
-   Mathematical Alphanumeric Symbols
-   Mathematical Operators
-   Miscellaneous Symbols
-   Miscellaneous Technical
-   Mongolian Block
-   Musical Symbols
-   Myanmar Block
-   Number Forms
-   Ogham Block
-   Old Italic Block
-   Optical Character Recognition
-   Oriya Block
-   Private Use
-   Runic Block
-   Sinhala Block
-   Small Form Variants
-   Spacing Modifier Letters
-   Specials
-   Superscripts and Subscripts
-   Syriac Block
-   Tags
-   Tamil Block
-   Telugu Block
-   Thaana Block
-   Thai Block
-   Tibetan Block
-   Unified Canadian Aboriginal Syllabics
-   Yi Radicals
-   Yi Syllables
+Blocks names are given with the C<In> prefix. For example, the
+Katakana block is referenced via C<\p{InKatakana}. The C<In>
+prefix may be omitted if there is no nameing conflict with a script
+or any other property, but it is recommended that C<In> always be used
+to avoid confusion.
+
+These block names are supported:
+
+   InAlphabeticPresentationForms
+   InArabicBlock
+   InArabicPresentationFormsA
+   InArabicPresentationFormsB
+   InArmenianBlock
+   InArrows
+   InBasicLatin
+   InBengaliBlock
+   InBlockElements
+   InBopomofoBlock
+   InBopomofoExtended
+   InBoxDrawing
+   InBraillePatterns
+   InByzantineMusicalSymbols
+   InCJKCompatibility
+   InCJKCompatibilityForms
+   InCJKCompatibilityIdeographs
+   InCJKCompatibilityIdeographsSupplement
+   InCJKRadicalsSupplement
+   InCJKSymbolsAndPunctuation
+   InCJKUnifiedIdeographs
+   InCJKUnifiedIdeographsExtensionA
+   InCJKUnifiedIdeographsExtensionB
+   InCherokeeBlock
+   InCombiningDiacriticalMarks
+   InCombiningHalfMarks
+   InCombiningMarksForSymbols
+   InControlPictures
+   InCurrencySymbols
+   InCyrillicBlock
+   InDeseretBlock
+   InDevanagariBlock
+   InDingbats
+   InEnclosedAlphanumerics
+   InEnclosedCJKLettersAndMonths
+   InEthiopicBlock
+   InGeneralPunctuation
+   InGeometricShapes
+   InGeorgianBlock
+   InGothicBlock
+   InGreekBlock
+   InGreekExtended
+   InGujaratiBlock
+   InGurmukhiBlock
+   InHalfwidthAndFullwidthForms
+   InHangulCompatibilityJamo
+   InHangulJamo
+   InHangulSyllables
+   InHebrewBlock
+   InHighPrivateUseSurrogates
+   InHighSurrogates
+   InHiraganaBlock
+   InIPAExtensions
+   InIdeographicDescriptionCharacters
+   InKanbun
+   InKangxiRadicals
+   InKannadaBlock
+   InKatakanaBlock
+   InKhmerBlock
+   InLaoBlock
+   InLatin1Supplement
+   InLatinExtendedAdditional
+   InLatinExtended-A
+   InLatinExtended-B
+   InLetterlikeSymbols
+   InLowSurrogates
+   InMalayalamBlock
+   InMathematicalAlphanumericSymbols
+   InMathematicalOperators
+   InMiscellaneousSymbols
+   InMiscellaneousTechnical
+   InMongolianBlock
+   InMusicalSymbols
+   InMyanmarBlock
+   InNumberForms
+   InOghamBlock
+   InOldItalicBlock
+   InOpticalCharacterRecognition
+   InOriyaBlock
+   InPrivateUse
+   InRunicBlock
+   InSinhalaBlock
+   InSmallFormVariants
+   InSpacingModifierLetters
+   InSpecials
+   InSuperscriptsAndSubscripts
+   InSyriacBlock
+   InTags
+   InTamilBlock
+   InTeluguBlock
+   InThaanaBlock
+   InThaiBlock
+   InTibetanBlock
+   InUnifiedCanadianAboriginalSyllabics
+   InYiRadicals
+   InYiSyllables
 
 =over 4
 
@@ -634,7 +622,7 @@ Level 1 - Basic Unicode Support
 
         [ 1] \x{...}
         [ 2] \N{...}
-        [ 3] . \p{Is...} \P{Is...}
+        [ 3] . \p{...} \P{...}
         [ 4] now scripts (see UTR#24 Script Names) in addition to blocks
         [ 5] have negation
         [ 6] can use look-ahead to emulate subtraction (*)
@@ -657,8 +645,8 @@ For example, what TR18 might write as
 
 in Perl can be written as:
 
-    (?!\p{UNASSIGNED})\p{GreekBlock}
-    (?=\p{ASSIGNED})\p{GreekBlock}
+    (?!\p{Unassigned})\p{InGreek}
+    (?=\p{Assigned})\p{InGreek}
 
 But in this particular example, you probably really want
author	Jarkko Hietaniemi <jhi@iki.fi>	2002-01-15 02:14:29 +0000
committer	Jarkko Hietaniemi <jhi@iki.fi>	2002-01-15 02:14:29 +0000
commit	eb0cc9e3552a0fa3abde76a3fd73dea2d3b4e730 (patch)
tree	a785a41e214ad4900417ee21c2502360f5355c0e
parent	9b99345a93e83058ceff44eef19901d8cd699da0 (diff)
download	perl-eb0cc9e3552a0fa3abde76a3fd73dea2d3b4e730.tar.gz