diff options
author | Rafael Garcia-Suarez <rgarciasuarez@gmail.com> | 2002-12-10 21:30:10 +0000 |
---|---|---|
committer | Rafael Garcia-Suarez <rgarciasuarez@gmail.com> | 2002-12-10 21:30:10 +0000 |
commit | 3a2263fe90d1c0e6c8f9368f10e6672379a975a2 (patch) | |
tree | f4ecc8075c4fe608fca0d50cea8273adb3179ea8 /pod | |
parent | 05b465836ef698192f94eef4a60cd63313013848 (diff) | |
download | perl-3a2263fe90d1c0e6c8f9368f10e6672379a975a2.tar.gz |
Integrate from the maint-5.8/ branch :
changes 18219, 18236, 18242-3, 18247-8,
18253-5, 18257, 18273-6
p4raw-id: //depot/perl@18280
p4raw-branched: from //depot/maint-5.8/perl@18279 'branch in'
t/op/lc_user.t
p4raw-integrated: from //depot/maint-5.8/perl@18279 'copy in'
lib/File/Copy.pm (@17645..) lib/utf8_heavy.pl pod/perlsec.pod
(@18080..) hints/irix_6.sh (@18173..) t/uni/tr_utf8.t
(@18197..) pod/perlunicode.pod (@18242..) t/op/pat.t (@18248..)
t/op/split.t (@18274..) 'edit in' pod/perlguts.pod (@18242..)
'merge in' pp.c (@18126..) MANIFEST (@18234..)
p4raw-integrated: from //depot/maint-5.8/perl@18254 'merge in'
pod/perldiag.pod (@18234..)
Diffstat (limited to 'pod')
-rw-r--r-- | pod/perldiag.pod | 7 | ||||
-rw-r--r-- | pod/perlguts.pod | 25 | ||||
-rw-r--r-- | pod/perlsec.pod | 2 | ||||
-rw-r--r-- | pod/perlunicode.pod | 64 |
4 files changed, 82 insertions, 16 deletions
diff --git a/pod/perldiag.pod b/pod/perldiag.pod index 6c566e5f16..6a8148ca8e 100644 --- a/pod/perldiag.pod +++ b/pod/perldiag.pod @@ -3668,6 +3668,13 @@ target of the change to (F) Your version of the C library apparently doesn't do times(). I suspect you're not running on Unix. +=item To%s: illegal mapping '%s' + +(F) You tried to define a customized To-mapping for lc(), lcfirst, +uc(), or ucfirst() (or their string-inlined versions), but you +specified an illegal mapping. +See L<perlunicode/"User-Defined Character Properties">. + =item Too few args to syscall (F) There has to be at least one argument to syscall() to specify the diff --git a/pod/perlguts.pod b/pod/perlguts.pod index 1601e3d1fc..39f23929a3 100644 --- a/pod/perlguts.pod +++ b/pod/perlguts.pod @@ -2231,13 +2231,15 @@ C<utf8_hop>, which takes a string and a number of characters to skip over. You're on your own about bounds checking, though, so don't use it lightly. -All bytes in a multi-byte UTF8 character will have the high bit set, so -you can test if you need to do something special with this character -like this: +All bytes in a multi-byte UTF8 character will have the high bit set, +so you can test if you need to do something special with this +character like this (the UTF8_IS_INVARIANT() is a macro that tests +whether the byte can be encoded as a single byte even in UTF-8): - UV uv; + U8 *utf; + UV uv; /* Note: a UV, not a U8, not a char */ - if (utf & 0x80) + if (!UTF8_IS_INVARIANT(*utf)) /* Must treat this as UTF8 */ uv = utf8_to_uv(utf); else @@ -2248,7 +2250,7 @@ You can also see in that example that we use C<utf8_to_uv> to get the value of the character; the inverse function C<uv_to_utf8> is available for putting a UV into UTF8: - if (uv > 0x80) + if (!UTF8_IS_INVARIANT(uv)) /* Must treat this as UTF8 */ utf8 = uv_to_utf8(utf8, uv); else @@ -2310,6 +2312,10 @@ In fact, your C<frobnicate> function should be made aware of whether or not it's dealing with UTF8 data, so that it can handle the string appropriately. +Since just passing an SV to an XS function and copying the data of +the SV is not enough to copy the UTF8 flags, even less right is just +passing a C<char *> to an XS function. + =head2 How do I convert a string to UTF8? If you're mixing UTF8 and non-UTF8 strings, you might find it necessary @@ -2350,12 +2356,13 @@ it's not - if you pass on the PV to somewhere, pass on the flag too. =item * If a string is UTF8, B<always> use C<utf8_to_uv> to get at the value, -unless C<!(*s & 0x80)> in which case you can use C<*s>. +unless C<UTF8_IS_INVARIANT(*s)> in which case you can use C<*s>. =item * -When writing to a UTF8 string, B<always> use C<uv_to_utf8>, unless -C<uv < 0x80> in which case you can use C<*s = uv>. +When writing a character C<uv> to a UTF8 string, B<always> use +C<uv_to_utf8>, unless C<UTF8_IS_INVARIANT(uv))> in which case +you can use C<*s = uv>. =item * diff --git a/pod/perlsec.pod b/pod/perlsec.pod index 2e1fda3704..1c2dbd266d 100644 --- a/pod/perlsec.pod +++ b/pod/perlsec.pod @@ -164,7 +164,7 @@ or a dot. if ($data =~ /^([-\@\w.]+)$/) { $data = $1; # $data now untainted } else { - die "Bad data in $data"; # log this somewhere + die "Bad data in '$data'"; # log this somewhere } This is fairly secure because C</\w+/> doesn't normally match shell diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index bf21206a94..ee8b6efe7e 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -616,10 +616,10 @@ And finally, C<scalar reverse()> reverses by character rather than by byte. =head2 User-Defined Character Properties You can define your own character properties by defining subroutines -whose names begin with "In" or "Is". The subroutines must be -visible in the package that uses the properties. The user-defined -properties can be used in the regular expression C<\p> and C<\P> -constructs. +whose names begin with "In" or "Is". The subroutines must be defined +in the C<main> package. The user-defined properties can be used in the +regular expression C<\p> and C<\P> constructs. Note that the effect +is compile-time and immutable once defined. The subroutines must return a specially-formatted string, with one or more newline-separated lines. Each line must be one of the following: @@ -698,6 +698,56 @@ The negation is useful for defining (surprise!) negated classes. END } +You can also define your own mappings to be used in the lc(), +lcfirst(), uc(), and ucfirst() (or their string-inlined versions). +The principle is the same: define subroutines in the C<main> package +with names like C<ToLower> (for lc() and lcfirst()), C<ToTitle> (for +the first character in ucfirst()), and C<ToUpper> (for uc(), and the +rest of the characters in ucfirst()). + +The string returned by the subroutines needs now to be three +hexadecimal numbers separated by tabulators: start of the source +range, end of the source range, and start of the destination range. +For example: + + sub ToUpper { + return <<END; + 0061\t0063\t0041 + END + } + +defines an uc() mapping that causes only the characters "a", "b", and +"c" to be mapped to "A", "B", "C", all other characters will remain +unchanged. + +If there is no source range to speak of, that is, the mapping is from +a single character to another single character, leave the end of the +source range empty, but the two tabulator characters are still needed. +For example: + + sub ToLower { + return <<END; + 0041\t\t0061 + END + } + +defines a lc() mapping that causes only "A" to be mapped to "a", all +other characters will remain unchanged. + +(For serious hackers only) If you want to introspect the default +mappings, you can find the data in the directory +C<$Config{privlib}>/F<unicore/To/>. The mapping data is returned as +the here-document, and the C<utf8::ToSpecFoo> are special exception +mappings derived from <$Config{privlib}>/F<unicore/SpecialCasing.txt>. +The C<Digit> and C<Fold> mappings that one can see in the directory +are not directly user-accessible, one can use either the +C<Unicode::UCD> module, or just match case-insensitively (that's when +the C<Fold> mapping is used). + +A final note on the user-defined property tests and mappings: they +will be used only if the scalar has been marked as having Unicode +characters. Old byte-style strings will not be affected. + =head2 Character Encodings for Input and Output See L<Encode>. @@ -1015,8 +1065,10 @@ straddling of the proverbial fence causes problems. =head2 Using Unicode in XS -If you want to handle Perl Unicode in XS extensions, you may find -the following C APIs useful. See L<perlapi> for details. +If you want to handle Perl Unicode in XS extensions, you may find the +following C APIs useful. See also L<perlguts/"Unicode Support"> for an +explanation about Unicode at the XS level, and L<perlapi> for the API +details. =over 4 |