summaryrefslogtreecommitdiff
path: root/pod
diff options
context:
space:
mode:
authorJarkko Hietaniemi <jhi@iki.fi>2001-07-04 01:32:11 +0000
committerJarkko Hietaniemi <jhi@iki.fi>2001-07-04 01:32:11 +0000
commit2796c109dc2c56e2241410992d78bd8e0cccd71f (patch)
tree6afcbd325dc2525c4681ef8e20e95afc8fcd49a4 /pod
parentad9cab3708f3a6aff28b5c1ca3a390c013235283 (diff)
downloadperl-2796c109dc2c56e2241410992d78bd8e0cccd71f.tar.gz
Support preferentially the Unicode 'scripts' definition
in the \p{In...} notation since according to Unicode the scripts concept is more natural for matching than using the somewhat artificial block names. The block names are still available, though, and if there's a name conflict, the scripts one wins and the blocks one has to do with 'Block' appended to its name. For more information see http://www.unicode.org/unicode/reports/tr24/ p4raw-id: //depot/perl@11132
Diffstat (limited to 'pod')
-rw-r--r--pod/perl572delta.pod28
-rw-r--r--pod/perlretut.pod5
-rw-r--r--pod/perltodo.pod11
-rw-r--r--pod/perlunicode.pod285
4 files changed, 209 insertions, 120 deletions
diff --git a/pod/perl572delta.pod b/pod/perl572delta.pod
index 2800cf85dc..1ff8436508 100644
--- a/pod/perl572delta.pod
+++ b/pod/perl572delta.pod
@@ -49,6 +49,34 @@ statically built in. This may or may not be a problem with ancient
TCP/IP stacks of VMS: we do not know since we weren't able to test
Perl in such configurations.
+=head2 Different Definition of the Unicode Character Classes \p{In...}
+
+As suggested by the Unicode consortium, the Unicode character classes
+now prefer I<scripts> as opposed to I<blocks> (as defined by Unicode);
+in Perl, when the C<\p{In....}> and the C<\p{In....}> regular expression
+constructs are used. This has changed the definition of some of those
+character classes.
+
+The difference between scripts and blocks is that scripts are the
+glyphs used by a language or a group of languages, while the blocks
+are more artificial groupings of 256 characters based on the Unicode
+numbering.
+
+In general this change results in more inclusive Unicode character
+classes, but changes to the other direction also do take place:
+for example while the script C<Latin> includes all the Latin
+characters and their various diacritic-adorned versions, it
+does not include the various punctuation or digits (since they
+are not solely C<Latin>).
+
+Changes in the character class semantics may have happened if a script
+and a block happen to have the same name, for example C<Hebrew>.
+In such cases the script wins and C<\p{InHebrew}> now means the script
+definition of Hebrew. The block definition in still available,
+though, by appending C<Block> to the name: C<\p{InHebrewBlock}> means
+what C<\p{InHebrew}> meant in perl 5.6.0. For the full list
+of affected character classes, see L<perlunicode/Blocks>.
+
=head2 Deprecations
The current user-visible implementation of pseudo-hashes (the weird
diff --git a/pod/perlretut.pod b/pod/perlretut.pod
index 3e83c1305f..7f8e8f5430 100644
--- a/pod/perlretut.pod
+++ b/pod/perlretut.pod
@@ -1752,8 +1752,9 @@ character class of Unicode 'marks', for example accent marks.
For the full list see L<perlunicode>.
The Unicode has also been separated into blocks of charaters which you
-can test with C<\p{InBlock}> and C<\P{InBlock}>, for example C<\p{InGreek}>
-and C<\P{InKatakana}>. For the full list see L<perlunicode>.
+can test with C<\p{In...}> (in) and C<\P{In...}> (not in), for example
+C<\p{InLatin}, C<\p{InGreek}>, or C<\P{InKatakana}>. For the full list see
+L<perlunicode>.
For the the full and latest information see the latest Unicode standard.
diff --git a/pod/perltodo.pod b/pod/perltodo.pod
index f96c7704e1..3c7243254e 100644
--- a/pod/perltodo.pod
+++ b/pod/perltodo.pod
@@ -87,17 +87,6 @@ class subtraction.
http://www.unicode.org/unicode/reports/tr18/
-=head2 Unicode Scripts support
-
-Currently the C<\p{In...}> supports only the Blocks database, like
-C<\p{BasicLatin}>, C<\p{InGreek}>, C<\p{InThai}>, but there's also the
-Scripts database, which has members like C<Latin>, C<Greek>,
-C<Armenian>, C<Han>. It is desireable that also the script names
-could be used for the C<\p{In...}> construct. Note: needs to be
-researched whether this is possible, that is, are there conflicts
-between the Blocks and the Scripts, is the Blocks Greek the same as
-the Scripts Greek?
-
=head2 use Thread for iThreads
Artur Bergman's C<iThreads> module is a start on this, but needs to
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index d629cabe9f..877b497613 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -105,13 +105,14 @@ bytes change to operating on characters. For ASCII data this makes
no difference, because UTF-8 stores ASCII in single bytes, but for
any character greater than C<chr(127)>, the character may be stored in
a sequence of two or more bytes, all of which have the high bit set.
-For C1 controls or Latin 1 characters on an EBCDIC platform the character
-may be stored in a UTF-EBCDIC multi byte sequence.
-But by and large, the user need not worry about this, because Perl
-hides it from the user. A character in Perl is logically just a number
-ranging from 0 to 2**32 or so. Larger characters encode to longer
-sequences of bytes internally, but again, this is just an internal
-detail which is hidden at the Perl level.
+
+For C1 controls or Latin 1 characters on an EBCDIC platform the
+character may be stored in a UTF-EBCDIC multi byte sequence. But by
+and large, the user need not worry about this, because Perl hides it
+from the user. A character in Perl is logically just a number ranging
+from 0 to 2**32 or so. Larger characters encode to longer sequences
+of bytes internally, but again, this is just an internal detail which
+is hidden at the Perl level.
=head2 Effects of character semantics
@@ -166,7 +167,8 @@ with all non-alphanumeric characters removed, for example the block
name C<"Latin-1 Supplement"> becomes C<\p{InLatin1Supplement}>.
Here is the list as of Unicode 3.1.0 (the two-letter classes) and
-Perl 5.8.0 (the one-letter classes):
+as defined by Perl (the one-letter classes) (in Unicode materials
+what Perl calls C<L> is often called C<L&>):
L Letter
Lu Letter, Uppercase
@@ -232,105 +234,174 @@ have their directionality defined:
BidiWS Whitespace
BidiON Other Neutrals
-The blocks available for C<\p{InBlock}> and C<\P{InBlock}>, for
-example \p{InCyrillic>, are as follows:
-
- BasicLatin
- Latin1Supplement
- LatinExtendedA
- LatinExtendedB
- IPAExtensions
- SpacingModifierLetters
- CombiningDiacriticalMarks
- Greek
- Cyrillic
- Armenian
- Hebrew
- Arabic
- Syriac
- Thaana
- Devanagari
- Bengali
- Gurmukhi
- Gujarati
- Oriya
- Tamil
- Telugu
- Kannada
- Malayalam
- Sinhala
- Thai
- Lao
- Tibetan
- Myanmar
- Georgian
- HangulJamo
- Ethiopic
- Cherokee
- UnifiedCanadianAboriginalSyllabics
- Ogham
- Runic
- Khmer
- Mongolian
- LatinExtendedAdditional
- GreekExtended
- GeneralPunctuation
- SuperscriptsandSubscripts
- CurrencySymbols
- CombiningMarksforSymbols
- LetterlikeSymbols
- NumberForms
- Arrows
- MathematicalOperators
- MiscellaneousTechnical
- ControlPictures
- OpticalCharacterRecognition
- EnclosedAlphanumerics
- BoxDrawing
- BlockElements
- GeometricShapes
- MiscellaneousSymbols
- Dingbats
- BraillePatterns
- CJKRadicalsSupplement
- KangxiRadicals
- IdeographicDescriptionCharacters
- CJKSymbolsandPunctuation
- Hiragana
- Katakana
- Bopomofo
- HangulCompatibilityJamo
- Kanbun
- BopomofoExtended
- EnclosedCJKLettersandMonths
- CJKCompatibility
- CJKUnifiedIdeographsExtensionA
- CJKUnifiedIdeographs
- YiSyllables
- YiRadicals
- HangulSyllables
- HighSurrogates
- HighPrivateUseSurrogates
- LowSurrogates
- PrivateUse
- CJKCompatibilityIdeographs
- AlphabeticPresentationForms
- ArabicPresentationFormsA
- CombiningHalfMarks
- CJKCompatibilityForms
- SmallFormVariants
- ArabicPresentationFormsB
- Specials
- HalfwidthandFullwidthForms
- OldItalic
- Gothic
- Deseret
- ByzantineMusicalSymbols
- MusicalSymbols
- MathematicalAlphanumericSymbols
- CJKUnifiedIdeographsExtensionB
- CJKCompatibilityIdeographsSupplement
- Tags
+=head2 Scripts
+
+The scripts available for C<\p{In...}> and C<\P{In...}>, for
+example \p{InCyrillic>, are as follows, for example C<\p{InLatin}>
+or C<\P{InHan}>:
+
+ Latin
+ Greek
+ Cyrillic
+ Armenian
+ Hebrew
+ Arabic
+ Syriac
+ Thaana
+ Devanagari
+ Bengali
+ Gurmukhi
+ Gujarati
+ Oriya
+ Tamil
+ Telugu
+ Kannada
+ Malayalam
+ Sinhala
+ Thai
+ Lao
+ Tibetan
+ Myanmar
+ Georgian
+ Hangul
+ Ethiopic
+ Cherokee
+ CanadianAboriginal
+ Ogham
+ Runic
+ Khmer
+ Mongolian
+ Hiragana
+ Katakana
+ Bopomofo
+ Han
+ Yi
+ OldItalic
+ Gothic
+ Deseret
+ Inherited
+
+=head2 Blocks
+
+In addition to B<scripts>, Unicode also defines B<blocks> of
+characters. The difference between scripts and blocks is that the
+former concept is closer to natural languages, while the latter
+concept is more an artificial grouping based on groups of 256 Unicode
+characters. For example, the C<Latin> script contains letters from
+many blocks, but it does not contain all the characters from those
+blocks, it does not for example contain digits.
+
+For more about scripts see the UTR #24:
+http://www.unicode.org/unicode/reports/tr24/
+For more about blocks see
+http://www.unicode.org/Public/UNIDATA/Blocks.txt
+
+Because there are overlaps in naming (there are, for example, both
+a script called C<Katakana> and a block called C<Katakana>, the block
+version has C<Block> appended to its name, C<\p{InKatakanaBlock}>.
+
+Notice that this definition was introduced in Perl 5.8.0: in Perl
+5.6.0 only the blocks were used; in Perl 5.8.0 scripts became the
+preferential character class definition; this meant that the
+definitions of some character classes changed (the ones in the
+below list that have the C<Block> appended).
+
+ BasicLatin
+ Latin1Supplement
+ LatinExtendedA
+ LatinExtendedB
+ IPAExtensions
+ SpacingModifierLetters
+ CombiningDiacriticalMarks
+ GreekBlock
+ CyrillicBlock
+ ArmenianBlock
+ HebrewBlock
+ ArabicBlock
+ SyriacBlock
+ ThaanaBlock
+ DevanagariBlock
+ BengaliBlock
+ GurmukhiBlock
+ GujaratiBlock
+ OriyaBlock
+ TamilBlock
+ TeluguBlock
+ KannadaBlock
+ MalayalamBlock
+ SinhalaBlock
+ ThaiBlock
+ LaoBlock
+ TibetanBlock
+ MyanmarBlock
+ GeorgianBlock
+ HangulJamo
+ EthiopicBlock
+ CherokeeBlock
+ UnifiedCanadianAboriginalSyllabics
+ OghamBlock
+ RunicBlock
+ KhmerBlock
+ MongolianBlock
+ LatinExtendedAdditional
+ GreekExtended
+ GeneralPunctuation
+ SuperscriptsandSubscripts
+ CurrencySymbols
+ CombiningMarksforSymbols
+ LetterlikeSymbols
+ NumberForms
+ Arrows
+ MathematicalOperators
+ MiscellaneousTechnical
+ ControlPictures
+ OpticalCharacterRecognition
+ EnclosedAlphanumerics
+ BoxDrawing
+ BlockElements
+ GeometricShapes
+ MiscellaneousSymbols
+ Dingbats
+ BraillePatterns
+ CJKRadicalsSupplement
+ KangxiRadicals
+ IdeographicDescriptionCharacters
+ CJKSymbolsandPunctuation
+ HiraganaBlock
+ KatakanaBlock
+ BopomofoBlock
+ HangulCompatibilityJamo
+ Kanbun
+ BopomofoExtended
+ EnclosedCJKLettersandMonths
+ CJKCompatibility
+ CJKUnifiedIdeographsExtensionA
+ CJKUnifiedIdeographs
+ YiSyllables
+ YiRadicals
+ HangulSyllables
+ HighSurrogates
+ HighPrivateUseSurrogates
+ LowSurrogates
+ PrivateUse
+ CJKCompatibilityIdeographs
+ AlphabeticPresentationForms
+ ArabicPresentationFormsA
+ CombiningHalfMarks
+ CJKCompatibilityForms
+ SmallFormVariants
+ ArabicPresentationFormsB
+ Specials
+ HalfwidthandFullwidthForms
+ OldItalicBlock
+ GothicBlock
+ DeseretBlock
+ ByzantineMusicalSymbols
+ MusicalSymbols
+ MathematicalAlphanumericSymbols
+ CJKUnifiedIdeographsExtensionB
+ CJKCompatibilityIdeographsSupplement
+ Tags
=item *