diff options
author | Karl Williamson <khw@cpan.org> | 2016-06-30 22:05:55 -0600 |
---|---|---|
committer | Karl Williamson <khw@cpan.org> | 2016-06-30 22:22:36 -0600 |
commit | 48791bf1d9612a84d71edc00af8610da1a6cf34b (patch) | |
tree | 38935030c6eeba3170b7d6396461ef46354d7d84 /pod/perlunicode.pod | |
parent | 7d7345cf4f14a683b78978462e37e75c5bccd5ed (diff) | |
download | perl-48791bf1d9612a84d71edc00af8610da1a6cf34b.tar.gz |
Change \p{foo} to mean \p{scx: foo}
when 'foo' is a script. Also update the pods correspondingly, and to
encourage scx property use.
See http://nntp.perl.org/group/perl.perl5.porters/237403
Diffstat (limited to 'pod/perlunicode.pod')
-rw-r--r-- | pod/perlunicode.pod | 35 |
1 files changed, 23 insertions, 12 deletions
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 959b8008fc..8346b2390f 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -602,16 +602,19 @@ The world's languages are written in many different scripts. This sentence written in Cyrillic, and Greek is written in, well, Greek; Japanese mainly in Hiragana or Katakana. There are many more. -The Unicode C<Script> and C<Script_Extensions> properties give what script a -given character is in. Either property can be specified with the -compound form like +The Unicode C<Script> and C<Script_Extensions> properties give what +script a given character is in. The C<Script_Extensions> property is an +improved version of C<Script>, as demonstrated below. Either property +can be specified with the compound form like C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>), or C<\p{Script_Extensions=Javanese}> (short: C<\p{scx=java}>). In addition, Perl furnishes shortcuts for all -C<Script> property names. You can omit everything up through the equals -(or colon), and simply write C<\p{Latin}> or C<\P{Cyrillic}>. -(This is not true for C<Script_Extensions>, which is required to be -written in the compound form.) +C<Script_Extensions> property names. You can omit everything up through +the equals (or colon), and simply write C<\p{Latin}> or C<\P{Cyrillic}>. +(This is not true for C<Script>, which is required to be +written in the compound form. Prior to Perl v5.26, the single form +returned the plain old C<Script> version, but was changed because +C<Script_Extensions> gives better results.) The difference between these two properties involves characters that are used in multiple scripts. For example the digits '0' through '9' are @@ -645,7 +648,11 @@ fewer characters in the C<Common> script, and correspondingly more in other scripts. It is new in Unicode version 6.0, and its data are likely to change significantly in later releases, as things get sorted out. New code should probably be using C<Script_Extensions> and not plain -C<Script>. +C<Script>. If you compile perl with a Unicode release that doesn't have +C<Script_Extensions>, the single form Perl extensions will instead refer +to the plain C<Script> property. If you compile with a version of +Unicode that doesn't have the C<Script> property, these extensions will +not be defined at all. (Actually, besides C<Common>, the C<Inherited> script, contains characters that are used in multiple scripts. These are modifier @@ -658,10 +665,13 @@ C<Script>, but not in C<Script_Extensions>.) It is worth stressing that there are several different sets of digits in Unicode that are equivalent to 0-9 and are matchable by C<\d> in a regular expression. If they are used in a single language only, they -are in that language's C<Script> and C<Script_Extension>. If they are +are in that language's C<Script> and C<Script_Extensions>. If they are used in more than one script, they will be in C<sc=Common>, but only if they are used in many scripts should they be in C<scx=Common>. +The explanation above has omitted some detail; refer to UAX#24 "Unicode +Script Property": L<http://www.unicode.org/reports/tr24>. + A complete list of scripts and their shortcuts is in L<perluniprops>. =head3 B<Use of the C<"Is"> Prefix> @@ -690,7 +700,7 @@ C<Common> script. For more about scripts versus blocks, see UAX#24 "Unicode Script Property": L<http://www.unicode.org/reports/tr24> -The C<Script> or C<Script_Extensions> properties are likely to be the +The C<Script_Extensions> or C<Script> properties are likely to be the ones you want to use when processing natural language; the C<Block> property may occasionally be useful in working with the nuts and bolts of Unicode. @@ -711,10 +721,11 @@ longer work. The extensions are mentioned here for completeness: Take the block name and prefix it with one of: C<In> (for example C<\p{Blk=Arrows}> can currently be written as C<\p{In_Arrows}>); or sometimes C<Is> (like C<\p{Is_Arrows}>); or sometimes no prefix at all -(C<\p{Arrows}>). As of this writing (Unicode 8.0) there are no +(C<\p{Arrows}>). As of this writing (Unicode 9.0) there are no conflicts with using the C<In_> prefix, but there are plenty with the other two forms. For example, C<\p{Is_Hebrew}> and C<\p{Hebrew}> mean -C<\p{Script=Hebrew}> which is NOT the same thing as C<\p{Blk=Hebrew}>. Our +C<\p{Script_Extensions=Hebrew}> which is NOT the same thing as +C<\p{Blk=Hebrew}>. Our advice used to be to use the C<In_> prefix as a single form way of specifying a block. But Unicode 8.0 added properties whose names begin with C<In>, and it's now clear that it's only luck that's so far |