summaryrefslogtreecommitdiff
path: root/lib
diff options
context:
space:
mode:
authorJarkko Hietaniemi <jhi@iki.fi>2001-07-03 23:02:02 +0000
committerJarkko Hietaniemi <jhi@iki.fi>2001-07-03 23:02:02 +0000
commitad9cab3708f3a6aff28b5c1ca3a390c013235283 (patch)
tree080d5152748296c3c6decee34699748f96f3b5d3 /lib
parent16703a004678038faba1eda656251a1ad71e30db (diff)
downloadperl-ad9cab3708f3a6aff28b5c1ca3a390c013235283.tar.gz
Better document the difference between a block and a script.
p4raw-id: //depot/perl@11131
Diffstat (limited to 'lib')
-rw-r--r--lib/Unicode/UCD.pm37
1 files changed, 24 insertions, 13 deletions
diff --git a/lib/Unicode/UCD.pm b/lib/Unicode/UCD.pm
index 81a9aed348..4e310e7c1c 100644
--- a/lib/Unicode/UCD.pm
+++ b/lib/Unicode/UCD.pm
@@ -194,17 +194,7 @@ sub charblock {
my $charscript = charscript(0x41);
charscript() returns the script the character belongs to, e.g.
-C<Latin>, C<Greek>, C<Han>. Note that not all the character positions
-within all scripts are defined.
-
-The difference between a character block and a script is that script
-names are closer to the linguistic notion of a set of characters,
-while block is more of an artifact of the Unicode character numbering.
-For example the Latin B<script> is spread over several B<blocks>.
-
-Note also that the script names are all in uppercase, e.g. C<HEBREW>,
-while the block names are Capitalized and with intermixed spaces,
-e.g. C<Yi Syllables>.
+C<Latin>, C<Greek>, C<Han>.
Unfortunately, currently (Perl 5.8.0) there is no regular expression
notation for matching scripts as there is for blocks (C<\p{In...}>.
@@ -231,10 +221,31 @@ sub charscript {
_search(\@SCRIPTS, 0, $#SCRIPTS, $code);
}
+=head2 charblock versus charscript
+
+The difference between a character block and a script is that scripts
+are closer to the linguistic notion of a set of characters required to
+present languages, while block is more of an artifact of the Unicode
+character numbering. For example the Latin B<script> is spread over
+several B<blocks>, such as C<Basic Latin>, C<Latin 1 Supplement>,
+C<Latin Extended-A>, and C<Latin Extended-B>. On the other hand, the
+Latin script does not contain all the characters of the C<Basic Latin>
+block (also known as the ASCII): it includes only the letters, not for
+example the digits or the punctuation.
+
+For block see http://www.unicode.org/Public/UNIDATA/Blocks.txt
+
+For scripts see UTR #24: http://www.unicode.org/unicode/reports/tr24/
+
+Note also that the script names are all in uppercase, e.g. C<HEBREW>,
+while the block names are Capitalized and with intermixed spaces,
+e.g. C<Yi Syllables>.
+
=head1 IMPLEMENTATION NOTE
-The first use of L<charinfo> opens a read-only filehandle to the Unicode
-Character Database. The filehandle is kept open for further queries.
+The first use of charinfo() opens a read-only filehandle to the Unicode
+Character Database (the database is included in the Perl distribution).
+The filehandle is then kept open for further queries.
=head1 AUTHOR