diff options
author | Jarkko Hietaniemi <jhi@iki.fi> | 2002-01-15 02:14:29 +0000 |
---|---|---|
committer | Jarkko Hietaniemi <jhi@iki.fi> | 2002-01-15 02:14:29 +0000 |
commit | eb0cc9e3552a0fa3abde76a3fd73dea2d3b4e730 (patch) | |
tree | a785a41e214ad4900417ee21c2502360f5355c0e | |
parent | 9b99345a93e83058ceff44eef19901d8cd699da0 (diff) | |
download | perl-eb0cc9e3552a0fa3abde76a3fd73dea2d3b4e730.tar.gz |
The Unicode categories doc patch to go with #14254,
from Jeffrey.
p4raw-id: //depot/perl@14263
-rw-r--r-- | lib/Unicode/UCD.pm | 48 | ||||
-rw-r--r-- | pod/perldelta.pod | 66 | ||||
-rw-r--r-- | pod/perlunicode.pod | 402 |
3 files changed, 247 insertions, 269 deletions
diff --git a/lib/Unicode/UCD.pm b/lib/Unicode/UCD.pm index ff9cc8fc05..3f8b896beb 100644 --- a/lib/Unicode/UCD.pm +++ b/lib/Unicode/UCD.pm @@ -108,7 +108,7 @@ as defined by the Unicode standard: title titlecase equivalent mapping block block the character belongs to (used in \p{In...}) - script script the character belongs to + script script the character belongs to If no match is found, a reference to an empty hash is returned. @@ -280,13 +280,12 @@ positions within all blocks are defined. See also L</Blocks versus Scripts>. -If supplied with an argument that can't be a code point, charblock() -tries to do the opposite and interpret the argument as a character -block. The return value is a I<range>: an anonymous list that -contains anonymous lists, which in turn contain I<start-of-range>, -I<end-of-range> code point pairs. You can test whether a code point -is in a range using the L</charinrange> function. If the argument is -not a known charater block, C<undef> is returned. +If supplied with an argument that can't be a code point, charblock() tries +to do the opposite and interpret the argument as a character block. The +return value is a I<range>: an anonymous list of lists that contain +I<start-of-range>, I<end-of-range> code point pairs. You can test whether a +code point is in a range using the L</charinrange> function. If the +argument is not a known charater block, C<undef> is returned. =cut @@ -342,13 +341,12 @@ character belongs to, e.g. C<Latin>, C<Greek>, C<Han>. See also L</Blocks versus Scripts>. -If supplied with an argument that can't be a code point, charscript() -tries to do the opposite and interpret the argument as a character -script. The return value is a I<range>: an anonymous list that -contains anonymous lists, which in turn contain I<start-of-range>, -I<end-of-range> code point pairs. You can test whether a code point -is in a range using the L</charinrange> function. If the argument is -not a known charater script, C<undef> is returned. +If supplied with an argument that can't be a code point, charscript() tries +to do the opposite and interpret the argument as a character script. The +return value is a I<range>: an anonymous list of lists that contain +I<start-of-range>, I<end-of-range> code point pairs. You can test whether a +code point is in a range using the L</charinrange> function. If the +argument is not a known charater script, C<undef> is returned. =cut @@ -433,13 +431,13 @@ sub charscripts { The difference between a block and a script is that scripts are closer to the linguistic notion of a set of characters required to present languages, while block is more of an artifact of the Unicode character -numbering and separation into blocks of 256 characters. +numbering and separation into blocks of (mostly) 256 characters. For example the Latin B<script> is spread over several B<blocks>, such as C<Basic Latin>, C<Latin 1 Supplement>, C<Latin Extended-A>, and C<Latin Extended-B>. On the other hand, the Latin script does not contain all the characters of the C<Basic Latin> block (also known as -the ASCII): it includes only the letters, not for example the digits +the ASCII): it includes only the letters, and not, for example, the digits or the punctuation. For blocks see http://www.unicode.org/Public/UNIDATA/Blocks.txt @@ -448,18 +446,10 @@ For scripts see UTR #24: http://www.unicode.org/unicode/reports/tr24/ =head2 Matching Scripts and Blocks -Both scripts and blocks can be matched using the regular expression -construct C<\p{In...}> and its negation C<\P{In...}>. - -The name of the script or the block comes after the C<In>, for example -C<\p{InCyrillic}>, C<\P{InBasicLatin}>. Spaces and dashes ('-') are -removed from the names for the C<\p{In...}>, for example -C<LatinExtendedA> instead of C<Latin Extended-A>. - -There are a few cases where there is both a script and a block by the -same name, in these cases the block version has C<Block> appended to -its name: C<\p{InKatakana}> is the script, C<\p{InKatakanaBlock}> is -the block. +Scripts are matched with the regular-expression construct +C<\p{...}> (e.g. C<\p{Tibetan}> matches characters of the Tibetan script), +while C<\p{In...}> is used for blocks (e.g. C<\p{InTibetan}> matches +any of the 256 code points in the Tibetan block). =head2 Code Point Arguments diff --git a/pod/perldelta.pod b/pod/perldelta.pod index f3f2a19300..a2923f89db 100644 --- a/pod/perldelta.pod +++ b/pod/perldelta.pod @@ -88,33 +88,31 @@ point format on OpenVMS Alpha, potentially breaking binary compatibility with external libraries or existing data. G_FLOAT is still available as a configuration option. The default on VAX (D_FLOAT) has not changed. -=head2 Different Definition of the Unicode Character Classes \p{In...} - -As suggested by the Unicode consortium, the Unicode character classes -now prefer I<scripts> as opposed to I<blocks> (as defined by Unicode); -in Perl, when the C<\p{In....}> and the C<\p{In....}> regular expression -constructs are used. This has changed the definition of some of those -character classes. - -The difference between scripts and blocks is that scripts are the -glyphs used by a language or a group of languages, while the blocks -are more artificial groupings of 256 characters based on the Unicode -numbering. - -In general this change results in more inclusive Unicode character -classes, but changes to the other direction also do take place: -for example while the script C<Latin> includes all the Latin -characters and their various diacritic-adorned versions, it -does not include the various punctuation or digits (since they -are not solely C<Latin>). - -Changes in the character class semantics may have happened if a script -and a block happen to have the same name, for example C<Hebrew>. -In such cases the script wins and C<\p{InHebrew}> now means the script -definition of Hebrew. The block definition in still available, -though, by appending C<Block> to the name: C<\p{InHebrewBlock}> means -what C<\p{InHebrew}> meant in perl 5.6.0. For the full list -of affected character classes, see L<perlunicode/Blocks>. +=head2 New Unicode Properties + +Unicode I<scripts> are now supported. Scripts are similar to (and superior +to) Unicode I<blocks>. The difference between scripts and blocks is that +scripts are the glyphs used by a language or a group of languages, while +the blocks are more artificial groupings of (mostly) 256 characters based +on the Unicode numbering. + +In general, scripts are more inclusive, but not universally so. For +example, while the script C<Latin> includes all the Latin characters and +their various diacritic-adorned versions, it does not include the various +punctuation or digits (since they are not solely C<Latin>). + +A number of other properties are now supported, including C<\p{L&}>, +C<\p{Any}> C<\p{Assigned}>, C<\p{Unassigned}>, C<\p{Blank}> and +C<\p{SpacePerl}> (along with their C<\P{...}> versions, of course). +See L<perlunicode> for details, and more additions. + +The C<In> or C<Is> prefix to names used with the C<\p{...}> and C<\P{...}> +are now almost always optional. The only exception is that a C<In> prefix +is required to signify a Unicode block when a block name conflicts with a +script name. For example, C<\p{Tibetan}> refers to the script, while +C<\p{InTibetan}> refers to the block. When there is no name conflict, you +can omit the C<In> from the block name (e.g. C<\p{BraillePatterns}>), but +to be safe, it's probably best to always use the C<In>). =head2 Perl Parser Stress Tested @@ -351,12 +349,14 @@ considerations, is the Unihan database. =item * -The Unicode character classes \p{Blank} and \p{SpacePerl} have been -added. "Blank" is like C isblank(), that is, it contains only -"horizontal whitespace" (the space character is, the newline isn't), -and the "SpacePerl" is the Unicode equivalent of C<\s> (\p{Space} -isn't, since that includes the vertical tabulator character, whereas -C<\s> doesn't.) +The properties \p{Blank} and \p{SpacePerl} have been added. "Blank" is like +C isblank(), that is, it contains only "horizontal whitespace" (the space +character is, the newline isn't), and the "SpacePerl" is the Unicode +equivalent of C<\s> (\p{Space} isn't, since that includes the vertical +tabulator character, whereas C<\s> doesn't.) + +See "New Unicode Properties" earlier in this document for additional +information on changes with Unicode properties. =back diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index beb742efb1..0264568e55 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -156,109 +156,94 @@ ideograph, for instance. =item * -Named Unicode properties and block ranges may be used as character -classes via the new C<\p{}> (matches property) and C<\P{}> (doesn't -match property) constructs. For instance, C<\p{Lu}> matches any +Named Unicode properties, scripts, and block ranges may be used like +character classes via the new C<\p{}> (matches property) and C<\P{}> +(doesn't match property) constructs. For instance, C<\p{Lu}> matches any character with the Unicode "Lu" (Letter, uppercase) property, while C<\p{M}> matches any character with a "M" (mark -- accents and such) -property. Single letter properties may omit the brackets, so that can -be written C<\pM> also. Many predefined character classes are -available, such as C<\p{IsMirrored}> and C<\p{InTibetan}>. - -The C<\p{Is...}> test for "general properties" such as "letter", -"digit", while the C<\p{In...}> test for Unicode scripts and blocks. +property. Single letter properties may omit the brackets, so that can be +written C<\pM> also. Many predefined properties are available, such +as C<\p{Mirrored}> and C<\p{Tibetan}>. The official Unicode script and block names have spaces and dashes as -separators, but for convenience you can have dashes, spaces, and -underbars at every word division, and you need not care about correct -casing. It is recommended, however, that for consistency you use the -following naming: the official Unicode script, block, or property name -(see below for the additional rules that apply to block names), with -whitespace and dashes replaced with underbar, and the words -"uppercase-first-lowercase-rest". That is, "Latin-1 Supplement" -becomes "Latin_1_Supplement". +separators, but for convenience you can have dashes, spaces, and underbars +at every word division, and you need not care about correct casing. It is +recommended, however, that for consistency you use the following naming: +the official Unicode script, block, or property name (see below for the +additional rules that apply to block names), with whitespace and dashes +removed, and the words "uppercase-first-lowercase-rest". That is, "Latin-1 +Supplement" becomes "Latin1Supplement". You can also negate both C<\p{}> and C<\P{}> by introducing a caret -(^) between the first curly and the property name: C<\p{^In_Tamil}> is -equal to C<\P{In_Tamil}>. +(^) between the first curly and the property name: C<\p{^Tamil}> is +equal to C<\P{Tamil}>. -The C<In> and C<Is> can be left out: C<\p{Greek}> is equal to -C<\p{In_Greek}>, C<\P{Pd}> is equal to C<\P{Pd}>. +Here are the basic Unicode General Category properties, followed by their +long form (you can use either, e.g. C<\p{Lu}> and C<\p{LowercaseLetter}> +are identical). Short Long L Letter - Lu Uppercase_Letter - Ll Lowercase_Letter - Lt Titlecase_Letter - Lm Modifier_Letter - Lo Other_Letter + Lu UppercaseLetter + Ll LowercaseLetter + Lt TitlecaseLetter + Lm ModifierLetter + Lo OtherLetter M Mark - Mn Nonspacing_Mark - Mc Spacing_Mark - Me Enclosing_Mark + Mn NonspacingMark + Mc SpacingMark + Me EnclosingMark N Number - Nd Decimal_Number - Nl Letter_Number - No Other_Number + Nd DecimalNumber + Nl LetterNumber + No OtherNumber P Punctuation - Pc Connector_Punctuation - Pd Dash_Punctuation - Ps Open_Punctuation - Pe Close_Punctuation - Pi Initial_Punctuation + Pc ConnectorPunctuation + Pd DashPunctuation + Ps OpenPunctuation + Pe ClosePunctuation + Pi InitialPunctuation (may behave like Ps or Pe depending on usage) - Pf Final_Punctuation + Pf FinalPunctuation (may behave like Ps or Pe depending on usage) - Po Other_Punctuation + Po OtherPunctuation S Symbol - Sm Math_Symbol - Sc Currency_Symbol - Sk Modifier_Symbol - So Other_Symbol + Sm MathSymbol + Sc CurrencySymbol + Sk ModifierSymbol + So OtherSymbol Z Separator - Zs Space_Separator - Zl Line_Separator - Zp Paragraph_Separator + Zs SpaceSeparator + Zl LineSeparator + Zp ParagraphSeparator C Other Cc Control Cf Format - Cs Surrogate - Co Private_Use + Cs Surrogate (not usable) + Co PrivateUse Cn Unassigned The single-letter properties match all characters in any of the two-letter sub-properties starting with the same letter. There's also C<L&> which is an alias for C<Ll>, C<Lu>, and C<Lt>. -The following reserved ranges have C<In> tests: - - CJK_Ideograph_Extension_A - CJK_Ideograph - Hangul_Syllable - Non_Private_Use_High_Surrogate - Private_Use_High_Surrogate - Low_Surrogate - Private_Surrogate - CJK_Ideograph_Extension_B - Plane_15_Private_Use - Plane_16_Private_Use - -For example C<"\x{AC00}" =~ \p{HangulSyllable}> will test true. -(Handling of surrogates is not implemented yet, because Perl -uses UTF-8 and not UTF-16 internally to represent Unicode. -So you really can't use the "Cs" category.) +Because Perl hides the need for the user to understand the internal +representation of Unicode characters, it has no need to support the +somewhat messy concept of surrogates. Therefore, the C<Cs> property is not +supported. -Additionally, because scripts differ in their directionality -(for example Hebrew is written right to left), all characters -have their directionality defined: +Because scripts differ in their directionality (for example Hebrew is +written right to left), Unicode supplies these properties: + Property Meaning + BidiL Left-to-Right BidiLRE Left-to-Right Embedding BidiLRO Left-to-Right Override @@ -279,18 +264,21 @@ have their directionality defined: BidiWS Whitespace BidiON Other Neutrals +For example, C<\p{BidiR}> matches all characters that are normally +written right to left. + =back =head2 Scripts -The scripts available for C<\p{In...}> and C<\P{In...}>, for example -C<\p{InLatin}> or \p{InCyrillic>, are as follows: +The scripts available via C<\p{...}> and C<\P{...}>, for example +C<\p{Latin}> or \p{Cyrillic>, are as follows: Arabic Armenian Bengali Bopomofo - Canadian-Aboriginal + CanadianAboriginal Cherokee Cyrillic Deseret @@ -315,7 +303,7 @@ C<\p{InLatin}> or \p{InCyrillic>, are as follows: Mongolian Myanmar Ogham - Old-Italic + OldItalic Oriya Runic Sinhala @@ -331,49 +319,52 @@ There are also extended property classes that supplement the basic properties, defined by the F<PropList> Unicode database: ASCII_Hex_Digit - Bidi_Control + BidiControl Dash Diacritic Extender - Hex_Digit + HexDigit Hyphen Ideographic - Join_Control - Noncharacter_Code_Point - Other_Alphabetic - Other_Lowercase - Other_Math - Other_Uppercase - Quotation_Mark - White_Space + JoinControl + NoncharacterCodePoint + OtherAlphabetic + OtherLowercase + OtherMath + OtherUppercase + QuotationMark + WhiteSpace and further derived properties: - Alphabetic Lu + Ll + Lt + Lm + Lo + Other_Alphabetic - Lowercase Ll + Other_Lowercase - Uppercase Lu + Other_Uppercase - Math Sm + Other_Math + Alphabetic Lu + Ll + Lt + Lm + Lo + OtherAlphabetic + Lowercase Ll + OtherLowercase + Uppercase Lu + OtherUppercase + Math Sm + OtherMath ID_Start Lu + Ll + Lt + Lm + Lo + Nl ID_Continue ID_Start + Mn + Mc + Nd + Pc Any Any character - Assigned Any non-Cn character + Assigned Any non-Cn character (i.e. synonym for C<\P{Cn}>) + Unassigned Synonym for C<\p{Cn}> Common Any character (or unassigned code point) not explicitly assigned to a script +For backward compatability, all properties mentioned so far may have C<Is> +prepended to their name (e.g. C<\P{IsLu}> is equal to C<\P{Lu}>). + =head2 Blocks -In addition to B<scripts>, Unicode also defines B<blocks> of -characters. The difference between scripts and blocks is that the -scripts concept is closer to natural languages, while the blocks -concept is more an artificial grouping based on groups of 256 Unicode -characters. For example, the C<Latin> script contains letters from -many blocks. On the other hand, the C<Latin> script does not contain -all the characters from those blocks. It does not, for example, -contain digits because digits are shared across many scripts. Digits -and other similar groups, like punctuation, are in a category called -C<Common>. +In addition to B<scripts>, Unicode also defines B<blocks> of characters. +The difference between scripts and blocks is that the scripts concept is +closer to natural languages, while the blocks concept is more an artificial +grouping based on groups of mostly 256 Unicode characters. For example, the +C<Latin> script contains letters from many blocks. On the other hand, the +C<Latin> script does not contain all the characters from those blocks. It +does not, for example, contain digits because digits are shared across many +scripts. Digits and other similar groups, like punctuation, are in a +category called C<Common>. For more about scripts, see the UTR #24: @@ -383,113 +374,110 @@ For more about blocks, see: http://www.unicode.org/Public/UNIDATA/Blocks.txt -Because there are overlaps in naming (there are, for example, both -a script called C<Katakana> and a block called C<Katakana>, the block -version has C<Block> appended to its name, C<\p{InKatakanaBlock}>. - -Notice that this definition was introduced in Perl 5.8.0: in Perl -5.6 only the blocks were used; in Perl 5.8.0 scripts became the -preferential Unicode character class definition (prompted by -recommendations from the Unicode consortium); this meant that -the definitions of some character classes changed (the ones in -the below list that have the C<Block> appended). - - Alphabetic Presentation Forms - Arabic Block - Arabic Presentation Forms-A - Arabic Presentation Forms-B - Armenian Block - Arrows - Basic Latin - Bengali Block - Block Elements - Bopomofo Block - Bopomofo Extended - Box Drawing - Braille Patterns - Byzantine Musical Symbols - CJK Compatibility - CJK Compatibility Forms - CJK Compatibility Ideographs - CJK Compatibility Ideographs Supplement - CJK Radicals Supplement - CJK Symbols and Punctuation - CJK Unified Ideographs - CJK Unified Ideographs Extension A - CJK Unified Ideographs Extension B - Cherokee Block - Combining Diacritical Marks - Combining Half Marks - Combining Marks for Symbols - Control Pictures - Currency Symbols - Cyrillic Block - Deseret Block - Devanagari Block - Dingbats - Enclosed Alphanumerics - Enclosed CJK Letters and Months - Ethiopic Block - General Punctuation - Geometric Shapes - Georgian Block - Gothic Block - Greek Block - Greek Extended - Gujarati Block - Gurmukhi Block - Halfwidth and Fullwidth Forms - Hangul Compatibility Jamo - Hangul Jamo - Hangul Syllables - Hebrew Block - High Private Use Surrogates - High Surrogates - Hiragana Block - IPA Extensions - Ideographic Description Characters - Kanbun - Kangxi Radicals - Kannada Block - Katakana Block - Khmer Block - Lao Block - Latin 1 Supplement - Latin Extended Additional - Latin Extended-A - Latin Extended-B - Letterlike Symbols - Low Surrogates - Malayalam Block - Mathematical Alphanumeric Symbols - Mathematical Operators - Miscellaneous Symbols - Miscellaneous Technical - Mongolian Block - Musical Symbols - Myanmar Block - Number Forms - Ogham Block - Old Italic Block - Optical Character Recognition - Oriya Block - Private Use - Runic Block - Sinhala Block - Small Form Variants - Spacing Modifier Letters - Specials - Superscripts and Subscripts - Syriac Block - Tags - Tamil Block - Telugu Block - Thaana Block - Thai Block - Tibetan Block - Unified Canadian Aboriginal Syllabics - Yi Radicals - Yi Syllables +Blocks names are given with the C<In> prefix. For example, the +Katakana block is referenced via C<\p{InKatakana}. The C<In> +prefix may be omitted if there is no nameing conflict with a script +or any other property, but it is recommended that C<In> always be used +to avoid confusion. + +These block names are supported: + + InAlphabeticPresentationForms + InArabicBlock + InArabicPresentationFormsA + InArabicPresentationFormsB + InArmenianBlock + InArrows + InBasicLatin + InBengaliBlock + InBlockElements + InBopomofoBlock + InBopomofoExtended + InBoxDrawing + InBraillePatterns + InByzantineMusicalSymbols + InCJKCompatibility + InCJKCompatibilityForms + InCJKCompatibilityIdeographs + InCJKCompatibilityIdeographsSupplement + InCJKRadicalsSupplement + InCJKSymbolsAndPunctuation + InCJKUnifiedIdeographs + InCJKUnifiedIdeographsExtensionA + InCJKUnifiedIdeographsExtensionB + InCherokeeBlock + InCombiningDiacriticalMarks + InCombiningHalfMarks + InCombiningMarksForSymbols + InControlPictures + InCurrencySymbols + InCyrillicBlock + InDeseretBlock + InDevanagariBlock + InDingbats + InEnclosedAlphanumerics + InEnclosedCJKLettersAndMonths + InEthiopicBlock + InGeneralPunctuation + InGeometricShapes + InGeorgianBlock + InGothicBlock + InGreekBlock + InGreekExtended + InGujaratiBlock + InGurmukhiBlock + InHalfwidthAndFullwidthForms + InHangulCompatibilityJamo + InHangulJamo + InHangulSyllables + InHebrewBlock + InHighPrivateUseSurrogates + InHighSurrogates + InHiraganaBlock + InIPAExtensions + InIdeographicDescriptionCharacters + InKanbun + InKangxiRadicals + InKannadaBlock + InKatakanaBlock + InKhmerBlock + InLaoBlock + InLatin1Supplement + InLatinExtendedAdditional + InLatinExtended-A + InLatinExtended-B + InLetterlikeSymbols + InLowSurrogates + InMalayalamBlock + InMathematicalAlphanumericSymbols + InMathematicalOperators + InMiscellaneousSymbols + InMiscellaneousTechnical + InMongolianBlock + InMusicalSymbols + InMyanmarBlock + InNumberForms + InOghamBlock + InOldItalicBlock + InOpticalCharacterRecognition + InOriyaBlock + InPrivateUse + InRunicBlock + InSinhalaBlock + InSmallFormVariants + InSpacingModifierLetters + InSpecials + InSuperscriptsAndSubscripts + InSyriacBlock + InTags + InTamilBlock + InTeluguBlock + InThaanaBlock + InThaiBlock + InTibetanBlock + InUnifiedCanadianAboriginalSyllabics + InYiRadicals + InYiSyllables =over 4 @@ -634,7 +622,7 @@ Level 1 - Basic Unicode Support [ 1] \x{...} [ 2] \N{...} - [ 3] . \p{Is...} \P{Is...} + [ 3] . \p{...} \P{...} [ 4] now scripts (see UTR#24 Script Names) in addition to blocks [ 5] have negation [ 6] can use look-ahead to emulate subtraction (*) @@ -657,8 +645,8 @@ For example, what TR18 might write as in Perl can be written as: - (?!\p{UNASSIGNED})\p{GreekBlock} - (?=\p{ASSIGNED})\p{GreekBlock} + (?!\p{Unassigned})\p{InGreek} + (?=\p{Assigned})\p{InGreek} But in this particular example, you probably really want |