diff options
author | Jarkko Hietaniemi <jhi@iki.fi> | 2001-07-04 01:32:11 +0000 |
---|---|---|
committer | Jarkko Hietaniemi <jhi@iki.fi> | 2001-07-04 01:32:11 +0000 |
commit | 2796c109dc2c56e2241410992d78bd8e0cccd71f (patch) | |
tree | 6afcbd325dc2525c4681ef8e20e95afc8fcd49a4 /pod | |
parent | ad9cab3708f3a6aff28b5c1ca3a390c013235283 (diff) | |
download | perl-2796c109dc2c56e2241410992d78bd8e0cccd71f.tar.gz |
Support preferentially the Unicode 'scripts' definition
in the \p{In...} notation since according to Unicode the
scripts concept is more natural for matching than using
the somewhat artificial block names. The block names are
still available, though, and if there's a name conflict,
the scripts one wins and the blocks one has to do with
'Block' appended to its name. For more information see
http://www.unicode.org/unicode/reports/tr24/
p4raw-id: //depot/perl@11132
Diffstat (limited to 'pod')
-rw-r--r-- | pod/perl572delta.pod | 28 | ||||
-rw-r--r-- | pod/perlretut.pod | 5 | ||||
-rw-r--r-- | pod/perltodo.pod | 11 | ||||
-rw-r--r-- | pod/perlunicode.pod | 285 |
4 files changed, 209 insertions, 120 deletions
diff --git a/pod/perl572delta.pod b/pod/perl572delta.pod index 2800cf85dc..1ff8436508 100644 --- a/pod/perl572delta.pod +++ b/pod/perl572delta.pod @@ -49,6 +49,34 @@ statically built in. This may or may not be a problem with ancient TCP/IP stacks of VMS: we do not know since we weren't able to test Perl in such configurations. +=head2 Different Definition of the Unicode Character Classes \p{In...} + +As suggested by the Unicode consortium, the Unicode character classes +now prefer I<scripts> as opposed to I<blocks> (as defined by Unicode); +in Perl, when the C<\p{In....}> and the C<\p{In....}> regular expression +constructs are used. This has changed the definition of some of those +character classes. + +The difference between scripts and blocks is that scripts are the +glyphs used by a language or a group of languages, while the blocks +are more artificial groupings of 256 characters based on the Unicode +numbering. + +In general this change results in more inclusive Unicode character +classes, but changes to the other direction also do take place: +for example while the script C<Latin> includes all the Latin +characters and their various diacritic-adorned versions, it +does not include the various punctuation or digits (since they +are not solely C<Latin>). + +Changes in the character class semantics may have happened if a script +and a block happen to have the same name, for example C<Hebrew>. +In such cases the script wins and C<\p{InHebrew}> now means the script +definition of Hebrew. The block definition in still available, +though, by appending C<Block> to the name: C<\p{InHebrewBlock}> means +what C<\p{InHebrew}> meant in perl 5.6.0. For the full list +of affected character classes, see L<perlunicode/Blocks>. + =head2 Deprecations The current user-visible implementation of pseudo-hashes (the weird diff --git a/pod/perlretut.pod b/pod/perlretut.pod index 3e83c1305f..7f8e8f5430 100644 --- a/pod/perlretut.pod +++ b/pod/perlretut.pod @@ -1752,8 +1752,9 @@ character class of Unicode 'marks', for example accent marks. For the full list see L<perlunicode>. The Unicode has also been separated into blocks of charaters which you -can test with C<\p{InBlock}> and C<\P{InBlock}>, for example C<\p{InGreek}> -and C<\P{InKatakana}>. For the full list see L<perlunicode>. +can test with C<\p{In...}> (in) and C<\P{In...}> (not in), for example +C<\p{InLatin}, C<\p{InGreek}>, or C<\P{InKatakana}>. For the full list see +L<perlunicode>. For the the full and latest information see the latest Unicode standard. diff --git a/pod/perltodo.pod b/pod/perltodo.pod index f96c7704e1..3c7243254e 100644 --- a/pod/perltodo.pod +++ b/pod/perltodo.pod @@ -87,17 +87,6 @@ class subtraction. http://www.unicode.org/unicode/reports/tr18/ -=head2 Unicode Scripts support - -Currently the C<\p{In...}> supports only the Blocks database, like -C<\p{BasicLatin}>, C<\p{InGreek}>, C<\p{InThai}>, but there's also the -Scripts database, which has members like C<Latin>, C<Greek>, -C<Armenian>, C<Han>. It is desireable that also the script names -could be used for the C<\p{In...}> construct. Note: needs to be -researched whether this is possible, that is, are there conflicts -between the Blocks and the Scripts, is the Blocks Greek the same as -the Scripts Greek? - =head2 use Thread for iThreads Artur Bergman's C<iThreads> module is a start on this, but needs to diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index d629cabe9f..877b497613 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -105,13 +105,14 @@ bytes change to operating on characters. For ASCII data this makes no difference, because UTF-8 stores ASCII in single bytes, but for any character greater than C<chr(127)>, the character may be stored in a sequence of two or more bytes, all of which have the high bit set. -For C1 controls or Latin 1 characters on an EBCDIC platform the character -may be stored in a UTF-EBCDIC multi byte sequence. -But by and large, the user need not worry about this, because Perl -hides it from the user. A character in Perl is logically just a number -ranging from 0 to 2**32 or so. Larger characters encode to longer -sequences of bytes internally, but again, this is just an internal -detail which is hidden at the Perl level. + +For C1 controls or Latin 1 characters on an EBCDIC platform the +character may be stored in a UTF-EBCDIC multi byte sequence. But by +and large, the user need not worry about this, because Perl hides it +from the user. A character in Perl is logically just a number ranging +from 0 to 2**32 or so. Larger characters encode to longer sequences +of bytes internally, but again, this is just an internal detail which +is hidden at the Perl level. =head2 Effects of character semantics @@ -166,7 +167,8 @@ with all non-alphanumeric characters removed, for example the block name C<"Latin-1 Supplement"> becomes C<\p{InLatin1Supplement}>. Here is the list as of Unicode 3.1.0 (the two-letter classes) and -Perl 5.8.0 (the one-letter classes): +as defined by Perl (the one-letter classes) (in Unicode materials +what Perl calls C<L> is often called C<L&>): L Letter Lu Letter, Uppercase @@ -232,105 +234,174 @@ have their directionality defined: BidiWS Whitespace BidiON Other Neutrals -The blocks available for C<\p{InBlock}> and C<\P{InBlock}>, for -example \p{InCyrillic>, are as follows: - - BasicLatin - Latin1Supplement - LatinExtendedA - LatinExtendedB - IPAExtensions - SpacingModifierLetters - CombiningDiacriticalMarks - Greek - Cyrillic - Armenian - Hebrew - Arabic - Syriac - Thaana - Devanagari - Bengali - Gurmukhi - Gujarati - Oriya - Tamil - Telugu - Kannada - Malayalam - Sinhala - Thai - Lao - Tibetan - Myanmar - Georgian - HangulJamo - Ethiopic - Cherokee - UnifiedCanadianAboriginalSyllabics - Ogham - Runic - Khmer - Mongolian - LatinExtendedAdditional - GreekExtended - GeneralPunctuation - SuperscriptsandSubscripts - CurrencySymbols - CombiningMarksforSymbols - LetterlikeSymbols - NumberForms - Arrows - MathematicalOperators - MiscellaneousTechnical - ControlPictures - OpticalCharacterRecognition - EnclosedAlphanumerics - BoxDrawing - BlockElements - GeometricShapes - MiscellaneousSymbols - Dingbats - BraillePatterns - CJKRadicalsSupplement - KangxiRadicals - IdeographicDescriptionCharacters - CJKSymbolsandPunctuation - Hiragana - Katakana - Bopomofo - HangulCompatibilityJamo - Kanbun - BopomofoExtended - EnclosedCJKLettersandMonths - CJKCompatibility - CJKUnifiedIdeographsExtensionA - CJKUnifiedIdeographs - YiSyllables - YiRadicals - HangulSyllables - HighSurrogates - HighPrivateUseSurrogates - LowSurrogates - PrivateUse - CJKCompatibilityIdeographs - AlphabeticPresentationForms - ArabicPresentationFormsA - CombiningHalfMarks - CJKCompatibilityForms - SmallFormVariants - ArabicPresentationFormsB - Specials - HalfwidthandFullwidthForms - OldItalic - Gothic - Deseret - ByzantineMusicalSymbols - MusicalSymbols - MathematicalAlphanumericSymbols - CJKUnifiedIdeographsExtensionB - CJKCompatibilityIdeographsSupplement - Tags +=head2 Scripts + +The scripts available for C<\p{In...}> and C<\P{In...}>, for +example \p{InCyrillic>, are as follows, for example C<\p{InLatin}> +or C<\P{InHan}>: + + Latin + Greek + Cyrillic + Armenian + Hebrew + Arabic + Syriac + Thaana + Devanagari + Bengali + Gurmukhi + Gujarati + Oriya + Tamil + Telugu + Kannada + Malayalam + Sinhala + Thai + Lao + Tibetan + Myanmar + Georgian + Hangul + Ethiopic + Cherokee + CanadianAboriginal + Ogham + Runic + Khmer + Mongolian + Hiragana + Katakana + Bopomofo + Han + Yi + OldItalic + Gothic + Deseret + Inherited + +=head2 Blocks + +In addition to B<scripts>, Unicode also defines B<blocks> of +characters. The difference between scripts and blocks is that the +former concept is closer to natural languages, while the latter +concept is more an artificial grouping based on groups of 256 Unicode +characters. For example, the C<Latin> script contains letters from +many blocks, but it does not contain all the characters from those +blocks, it does not for example contain digits. + +For more about scripts see the UTR #24: +http://www.unicode.org/unicode/reports/tr24/ +For more about blocks see +http://www.unicode.org/Public/UNIDATA/Blocks.txt + +Because there are overlaps in naming (there are, for example, both +a script called C<Katakana> and a block called C<Katakana>, the block +version has C<Block> appended to its name, C<\p{InKatakanaBlock}>. + +Notice that this definition was introduced in Perl 5.8.0: in Perl +5.6.0 only the blocks were used; in Perl 5.8.0 scripts became the +preferential character class definition; this meant that the +definitions of some character classes changed (the ones in the +below list that have the C<Block> appended). + + BasicLatin + Latin1Supplement + LatinExtendedA + LatinExtendedB + IPAExtensions + SpacingModifierLetters + CombiningDiacriticalMarks + GreekBlock + CyrillicBlock + ArmenianBlock + HebrewBlock + ArabicBlock + SyriacBlock + ThaanaBlock + DevanagariBlock + BengaliBlock + GurmukhiBlock + GujaratiBlock + OriyaBlock + TamilBlock + TeluguBlock + KannadaBlock + MalayalamBlock + SinhalaBlock + ThaiBlock + LaoBlock + TibetanBlock + MyanmarBlock + GeorgianBlock + HangulJamo + EthiopicBlock + CherokeeBlock + UnifiedCanadianAboriginalSyllabics + OghamBlock + RunicBlock + KhmerBlock + MongolianBlock + LatinExtendedAdditional + GreekExtended + GeneralPunctuation + SuperscriptsandSubscripts + CurrencySymbols + CombiningMarksforSymbols + LetterlikeSymbols + NumberForms + Arrows + MathematicalOperators + MiscellaneousTechnical + ControlPictures + OpticalCharacterRecognition + EnclosedAlphanumerics + BoxDrawing + BlockElements + GeometricShapes + MiscellaneousSymbols + Dingbats + BraillePatterns + CJKRadicalsSupplement + KangxiRadicals + IdeographicDescriptionCharacters + CJKSymbolsandPunctuation + HiraganaBlock + KatakanaBlock + BopomofoBlock + HangulCompatibilityJamo + Kanbun + BopomofoExtended + EnclosedCJKLettersandMonths + CJKCompatibility + CJKUnifiedIdeographsExtensionA + CJKUnifiedIdeographs + YiSyllables + YiRadicals + HangulSyllables + HighSurrogates + HighPrivateUseSurrogates + LowSurrogates + PrivateUse + CJKCompatibilityIdeographs + AlphabeticPresentationForms + ArabicPresentationFormsA + CombiningHalfMarks + CJKCompatibilityForms + SmallFormVariants + ArabicPresentationFormsB + Specials + HalfwidthandFullwidthForms + OldItalicBlock + GothicBlock + DeseretBlock + ByzantineMusicalSymbols + MusicalSymbols + MathematicalAlphanumericSymbols + CJKUnifiedIdeographsExtensionB + CJKCompatibilityIdeographsSupplement + Tags =item * |