summaryrefslogtreecommitdiff
path: root/pod
diff options
context:
space:
mode:
authorKarl Williamson <khw@cpan.org>2016-06-30 22:05:55 -0600
committerKarl Williamson <khw@cpan.org>2016-06-30 22:22:36 -0600
commit48791bf1d9612a84d71edc00af8610da1a6cf34b (patch)
tree38935030c6eeba3170b7d6396461ef46354d7d84 /pod
parent7d7345cf4f14a683b78978462e37e75c5bccd5ed (diff)
downloadperl-48791bf1d9612a84d71edc00af8610da1a6cf34b.tar.gz
Change \p{foo} to mean \p{scx: foo}
when 'foo' is a script. Also update the pods correspondingly, and to encourage scx property use. See http://nntp.perl.org/group/perl.perl5.porters/237403
Diffstat (limited to 'pod')
-rw-r--r--pod/perldelta.pod12
-rw-r--r--pod/perlretut.pod18
-rw-r--r--pod/perlunicode.pod35
-rw-r--r--pod/perlunicook.pod2
4 files changed, 47 insertions, 20 deletions
diff --git a/pod/perldelta.pod b/pod/perldelta.pod
index 53d839ae6d..8bca8c7ea0 100644
--- a/pod/perldelta.pod
+++ b/pod/perldelta.pod
@@ -34,6 +34,18 @@ L<http://www.unicode.org/versions/Unicode9.0.0/>. Modules that are
shipped with core Perl but not maintained by p5p do not necessarily
support Unicode 9.0. L<Unicode::Normalize> does work on 9.0.
+=head2 Use of C<\p{I<script>}> uses the improved Script_Extensions
+property
+
+Unicode 6.0 introduced an improved form of the Script (C<sc>) property,
+and called it Script_Extensions (C<scx>). As of now, Perl uses this
+improved version when a property is specified as just C<\p{I<script>}>.
+The meaning of compound forms, like C<\p{sc=I<script>}> are unchanged.
+This should make programs be more accurate when determining if a
+character is used in a given script, but there is a slight chance of
+breakage for programs that very specifically needed the old behavior.
+See L<perlunicode/Scripts>.
+
=head1 Security
XXX Any security-related notices go here. In particular, any security
diff --git a/pod/perlretut.pod b/pod/perlretut.pod
index efebb11125..734ca5cfb3 100644
--- a/pod/perlretut.pod
+++ b/pod/perlretut.pod
@@ -1986,14 +1986,18 @@ also listed there. Some synonyms are a single character. For these,
you can drop the braces. For instance, C<\pM> is the same thing as
C<\p{Mark}>, meaning things like accent marks.
-The Unicode C<\p{Script}> property is used to categorize every Unicode
-character into the language script it is written in. For example,
+The Unicode C<\p{Script}> and C<\p{Script_Extensions}> properties are
+used to categorize every Unicode character into the language script it
+is written in. (C<Script_Extensions> is an improved version of
+C<Script>, which is retained for backward compatibility, and so you
+should generally use C<Script_Extensions>.)
+For example,
English, French, and a bunch of other European languages are written in
the Latin script. But there is also the Greek script, the Thai script,
the Katakana script, etc. You can test whether a character is in a
-particular script with, for example C<\p{Latin}>, C<\p{Greek}>,
-or C<\p{Katakana}>. To test if it isn't in the Balinese script, you
-would use C<\P{Balinese}>.
+particular script (based on C<Script_Extensions>) with, for example
+C<\p{Latin}>, C<\p{Greek}>, or C<\p{Katakana}>. To test if it isn't in
+the Balinese script, you would use C<\P{Balinese}>.
What we have described so far is the single form of the C<\p{...}> character
classes. There is also a compound form which you may run into. These
@@ -2001,8 +2005,8 @@ look like C<\p{name=value}> or C<\p{name:value}> (the equals sign and colon
can be used interchangeably). These are more general than the single form,
and in fact most of the single forms are just Perl-defined shortcuts for common
compound forms. For example, the script examples in the previous paragraph
-could be written equivalently as C<\p{Script=Latin}>, C<\p{Script:Greek}>,
-C<\p{script=katakana}>, and C<\P{script=balinese}> (case is irrelevant
+could be written equivalently as C<\p{Script_Extensions=Latin}>, C<\p{Script_Extensions:Greek}>,
+C<\p{script_extensions=katakana}>, and C<\P{script_extensions=balinese}> (case is irrelevant
between the C<{}> braces). You may
never have to use the compound forms, but sometimes it is necessary, and their
use can make your code easier to understand.
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index 959b8008fc..8346b2390f 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -602,16 +602,19 @@ The world's languages are written in many different scripts. This sentence
written in Cyrillic, and Greek is written in, well, Greek; Japanese mainly in
Hiragana or Katakana. There are many more.
-The Unicode C<Script> and C<Script_Extensions> properties give what script a
-given character is in. Either property can be specified with the
-compound form like
+The Unicode C<Script> and C<Script_Extensions> properties give what
+script a given character is in. The C<Script_Extensions> property is an
+improved version of C<Script>, as demonstrated below. Either property
+can be specified with the compound form like
C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>), or
C<\p{Script_Extensions=Javanese}> (short: C<\p{scx=java}>).
In addition, Perl furnishes shortcuts for all
-C<Script> property names. You can omit everything up through the equals
-(or colon), and simply write C<\p{Latin}> or C<\P{Cyrillic}>.
-(This is not true for C<Script_Extensions>, which is required to be
-written in the compound form.)
+C<Script_Extensions> property names. You can omit everything up through
+the equals (or colon), and simply write C<\p{Latin}> or C<\P{Cyrillic}>.
+(This is not true for C<Script>, which is required to be
+written in the compound form. Prior to Perl v5.26, the single form
+returned the plain old C<Script> version, but was changed because
+C<Script_Extensions> gives better results.)
The difference between these two properties involves characters that are
used in multiple scripts. For example the digits '0' through '9' are
@@ -645,7 +648,11 @@ fewer characters in the C<Common> script, and correspondingly more in
other scripts. It is new in Unicode version 6.0, and its data are likely
to change significantly in later releases, as things get sorted out.
New code should probably be using C<Script_Extensions> and not plain
-C<Script>.
+C<Script>. If you compile perl with a Unicode release that doesn't have
+C<Script_Extensions>, the single form Perl extensions will instead refer
+to the plain C<Script> property. If you compile with a version of
+Unicode that doesn't have the C<Script> property, these extensions will
+not be defined at all.
(Actually, besides C<Common>, the C<Inherited> script, contains
characters that are used in multiple scripts. These are modifier
@@ -658,10 +665,13 @@ C<Script>, but not in C<Script_Extensions>.)
It is worth stressing that there are several different sets of digits in
Unicode that are equivalent to 0-9 and are matchable by C<\d> in a
regular expression. If they are used in a single language only, they
-are in that language's C<Script> and C<Script_Extension>. If they are
+are in that language's C<Script> and C<Script_Extensions>. If they are
used in more than one script, they will be in C<sc=Common>, but only
if they are used in many scripts should they be in C<scx=Common>.
+The explanation above has omitted some detail; refer to UAX#24 "Unicode
+Script Property": L<http://www.unicode.org/reports/tr24>.
+
A complete list of scripts and their shortcuts is in L<perluniprops>.
=head3 B<Use of the C<"Is"> Prefix>
@@ -690,7 +700,7 @@ C<Common> script.
For more about scripts versus blocks, see UAX#24 "Unicode Script Property":
L<http://www.unicode.org/reports/tr24>
-The C<Script> or C<Script_Extensions> properties are likely to be the
+The C<Script_Extensions> or C<Script> properties are likely to be the
ones you want to use when processing
natural language; the C<Block> property may occasionally be useful in working
with the nuts and bolts of Unicode.
@@ -711,10 +721,11 @@ longer work. The extensions are mentioned here for completeness: Take
the block name and prefix it with one of: C<In> (for example
C<\p{Blk=Arrows}> can currently be written as C<\p{In_Arrows}>); or
sometimes C<Is> (like C<\p{Is_Arrows}>); or sometimes no prefix at all
-(C<\p{Arrows}>). As of this writing (Unicode 8.0) there are no
+(C<\p{Arrows}>). As of this writing (Unicode 9.0) there are no
conflicts with using the C<In_> prefix, but there are plenty with the
other two forms. For example, C<\p{Is_Hebrew}> and C<\p{Hebrew}> mean
-C<\p{Script=Hebrew}> which is NOT the same thing as C<\p{Blk=Hebrew}>. Our
+C<\p{Script_Extensions=Hebrew}> which is NOT the same thing as
+C<\p{Blk=Hebrew}>. Our
advice used to be to use the C<In_> prefix as a single form way of
specifying a block. But Unicode 8.0 added properties whose names begin
with C<In>, and it's now clear that it's only luck that's so far
diff --git a/pod/perlunicook.pod b/pod/perlunicook.pod
index e1693cd6b7..ac305098eb 100644
--- a/pod/perlunicook.pod
+++ b/pod/perlunicook.pod
@@ -391,7 +391,7 @@ one codepoint lacking that property.
\p{Sk}, \p{Ps}, \p{Lt}
\p{alpha}, \p{upper}, \p{lower}
\p{Latin}, \p{Greek}
- \p{script=Latin}, \p{script=Greek}
+ \p{script_extensions=Latin}, \p{scx=Greek}
\p{East_Asian_Width=Wide}, \p{EA=W}
\p{Line_Break=Hyphen}, \p{LB=HY}
\p{Numeric_Value=4}, \p{NV=4}