diff options
author | Karl Williamson <khw@cpan.org> | 2016-06-30 22:05:55 -0600 |
---|---|---|
committer | Karl Williamson <khw@cpan.org> | 2016-06-30 22:22:36 -0600 |
commit | 48791bf1d9612a84d71edc00af8610da1a6cf34b (patch) | |
tree | 38935030c6eeba3170b7d6396461ef46354d7d84 /pod | |
parent | 7d7345cf4f14a683b78978462e37e75c5bccd5ed (diff) | |
download | perl-48791bf1d9612a84d71edc00af8610da1a6cf34b.tar.gz |
Change \p{foo} to mean \p{scx: foo}
when 'foo' is a script. Also update the pods correspondingly, and to
encourage scx property use.
See http://nntp.perl.org/group/perl.perl5.porters/237403
Diffstat (limited to 'pod')
-rw-r--r-- | pod/perldelta.pod | 12 | ||||
-rw-r--r-- | pod/perlretut.pod | 18 | ||||
-rw-r--r-- | pod/perlunicode.pod | 35 | ||||
-rw-r--r-- | pod/perlunicook.pod | 2 |
4 files changed, 47 insertions, 20 deletions
diff --git a/pod/perldelta.pod b/pod/perldelta.pod index 53d839ae6d..8bca8c7ea0 100644 --- a/pod/perldelta.pod +++ b/pod/perldelta.pod @@ -34,6 +34,18 @@ L<http://www.unicode.org/versions/Unicode9.0.0/>. Modules that are shipped with core Perl but not maintained by p5p do not necessarily support Unicode 9.0. L<Unicode::Normalize> does work on 9.0. +=head2 Use of C<\p{I<script>}> uses the improved Script_Extensions +property + +Unicode 6.0 introduced an improved form of the Script (C<sc>) property, +and called it Script_Extensions (C<scx>). As of now, Perl uses this +improved version when a property is specified as just C<\p{I<script>}>. +The meaning of compound forms, like C<\p{sc=I<script>}> are unchanged. +This should make programs be more accurate when determining if a +character is used in a given script, but there is a slight chance of +breakage for programs that very specifically needed the old behavior. +See L<perlunicode/Scripts>. + =head1 Security XXX Any security-related notices go here. In particular, any security diff --git a/pod/perlretut.pod b/pod/perlretut.pod index efebb11125..734ca5cfb3 100644 --- a/pod/perlretut.pod +++ b/pod/perlretut.pod @@ -1986,14 +1986,18 @@ also listed there. Some synonyms are a single character. For these, you can drop the braces. For instance, C<\pM> is the same thing as C<\p{Mark}>, meaning things like accent marks. -The Unicode C<\p{Script}> property is used to categorize every Unicode -character into the language script it is written in. For example, +The Unicode C<\p{Script}> and C<\p{Script_Extensions}> properties are +used to categorize every Unicode character into the language script it +is written in. (C<Script_Extensions> is an improved version of +C<Script>, which is retained for backward compatibility, and so you +should generally use C<Script_Extensions>.) +For example, English, French, and a bunch of other European languages are written in the Latin script. But there is also the Greek script, the Thai script, the Katakana script, etc. You can test whether a character is in a -particular script with, for example C<\p{Latin}>, C<\p{Greek}>, -or C<\p{Katakana}>. To test if it isn't in the Balinese script, you -would use C<\P{Balinese}>. +particular script (based on C<Script_Extensions>) with, for example +C<\p{Latin}>, C<\p{Greek}>, or C<\p{Katakana}>. To test if it isn't in +the Balinese script, you would use C<\P{Balinese}>. What we have described so far is the single form of the C<\p{...}> character classes. There is also a compound form which you may run into. These @@ -2001,8 +2005,8 @@ look like C<\p{name=value}> or C<\p{name:value}> (the equals sign and colon can be used interchangeably). These are more general than the single form, and in fact most of the single forms are just Perl-defined shortcuts for common compound forms. For example, the script examples in the previous paragraph -could be written equivalently as C<\p{Script=Latin}>, C<\p{Script:Greek}>, -C<\p{script=katakana}>, and C<\P{script=balinese}> (case is irrelevant +could be written equivalently as C<\p{Script_Extensions=Latin}>, C<\p{Script_Extensions:Greek}>, +C<\p{script_extensions=katakana}>, and C<\P{script_extensions=balinese}> (case is irrelevant between the C<{}> braces). You may never have to use the compound forms, but sometimes it is necessary, and their use can make your code easier to understand. diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 959b8008fc..8346b2390f 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -602,16 +602,19 @@ The world's languages are written in many different scripts. This sentence written in Cyrillic, and Greek is written in, well, Greek; Japanese mainly in Hiragana or Katakana. There are many more. -The Unicode C<Script> and C<Script_Extensions> properties give what script a -given character is in. Either property can be specified with the -compound form like +The Unicode C<Script> and C<Script_Extensions> properties give what +script a given character is in. The C<Script_Extensions> property is an +improved version of C<Script>, as demonstrated below. Either property +can be specified with the compound form like C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>), or C<\p{Script_Extensions=Javanese}> (short: C<\p{scx=java}>). In addition, Perl furnishes shortcuts for all -C<Script> property names. You can omit everything up through the equals -(or colon), and simply write C<\p{Latin}> or C<\P{Cyrillic}>. -(This is not true for C<Script_Extensions>, which is required to be -written in the compound form.) +C<Script_Extensions> property names. You can omit everything up through +the equals (or colon), and simply write C<\p{Latin}> or C<\P{Cyrillic}>. +(This is not true for C<Script>, which is required to be +written in the compound form. Prior to Perl v5.26, the single form +returned the plain old C<Script> version, but was changed because +C<Script_Extensions> gives better results.) The difference between these two properties involves characters that are used in multiple scripts. For example the digits '0' through '9' are @@ -645,7 +648,11 @@ fewer characters in the C<Common> script, and correspondingly more in other scripts. It is new in Unicode version 6.0, and its data are likely to change significantly in later releases, as things get sorted out. New code should probably be using C<Script_Extensions> and not plain -C<Script>. +C<Script>. If you compile perl with a Unicode release that doesn't have +C<Script_Extensions>, the single form Perl extensions will instead refer +to the plain C<Script> property. If you compile with a version of +Unicode that doesn't have the C<Script> property, these extensions will +not be defined at all. (Actually, besides C<Common>, the C<Inherited> script, contains characters that are used in multiple scripts. These are modifier @@ -658,10 +665,13 @@ C<Script>, but not in C<Script_Extensions>.) It is worth stressing that there are several different sets of digits in Unicode that are equivalent to 0-9 and are matchable by C<\d> in a regular expression. If they are used in a single language only, they -are in that language's C<Script> and C<Script_Extension>. If they are +are in that language's C<Script> and C<Script_Extensions>. If they are used in more than one script, they will be in C<sc=Common>, but only if they are used in many scripts should they be in C<scx=Common>. +The explanation above has omitted some detail; refer to UAX#24 "Unicode +Script Property": L<http://www.unicode.org/reports/tr24>. + A complete list of scripts and their shortcuts is in L<perluniprops>. =head3 B<Use of the C<"Is"> Prefix> @@ -690,7 +700,7 @@ C<Common> script. For more about scripts versus blocks, see UAX#24 "Unicode Script Property": L<http://www.unicode.org/reports/tr24> -The C<Script> or C<Script_Extensions> properties are likely to be the +The C<Script_Extensions> or C<Script> properties are likely to be the ones you want to use when processing natural language; the C<Block> property may occasionally be useful in working with the nuts and bolts of Unicode. @@ -711,10 +721,11 @@ longer work. The extensions are mentioned here for completeness: Take the block name and prefix it with one of: C<In> (for example C<\p{Blk=Arrows}> can currently be written as C<\p{In_Arrows}>); or sometimes C<Is> (like C<\p{Is_Arrows}>); or sometimes no prefix at all -(C<\p{Arrows}>). As of this writing (Unicode 8.0) there are no +(C<\p{Arrows}>). As of this writing (Unicode 9.0) there are no conflicts with using the C<In_> prefix, but there are plenty with the other two forms. For example, C<\p{Is_Hebrew}> and C<\p{Hebrew}> mean -C<\p{Script=Hebrew}> which is NOT the same thing as C<\p{Blk=Hebrew}>. Our +C<\p{Script_Extensions=Hebrew}> which is NOT the same thing as +C<\p{Blk=Hebrew}>. Our advice used to be to use the C<In_> prefix as a single form way of specifying a block. But Unicode 8.0 added properties whose names begin with C<In>, and it's now clear that it's only luck that's so far diff --git a/pod/perlunicook.pod b/pod/perlunicook.pod index e1693cd6b7..ac305098eb 100644 --- a/pod/perlunicook.pod +++ b/pod/perlunicook.pod @@ -391,7 +391,7 @@ one codepoint lacking that property. \p{Sk}, \p{Ps}, \p{Lt} \p{alpha}, \p{upper}, \p{lower} \p{Latin}, \p{Greek} - \p{script=Latin}, \p{script=Greek} + \p{script_extensions=Latin}, \p{scx=Greek} \p{East_Asian_Width=Wide}, \p{EA=W} \p{Line_Break=Hyphen}, \p{LB=HY} \p{Numeric_Value=4}, \p{NV=4} |