Integrate from the maint-5.8/ branch :

changes 18219, 18236, 18242-3, 18247-8, 18253-5, 18257, 18273-6 p4raw-id: //depot/perl@18280 p4raw-branched: from //depot/maint-5.8/perl@18279 'branch in' t/op/lc_user.t p4raw-integrated: from //depot/maint-5.8/perl@18279 'copy in' lib/File/Copy.pm (@17645..) lib/utf8_heavy.pl pod/perlsec.pod (@18080..) hints/irix_6.sh (@18173..) t/uni/tr_utf8.t (@18197..) pod/perlunicode.pod (@18242..) t/op/pat.t (@18248..) t/op/split.t (@18274..) 'edit in' pod/perlguts.pod (@18242..) 'merge in' pp.c (@18126..) MANIFEST (@18234..) p4raw-integrated: from //depot/maint-5.8/perl@18254 'merge in' pod/perldiag.pod (@18234..)
author: Rafael Garcia-Suarez <rgarciasuarez@gmail.com> 2002-12-10 21:30:10 +0000
committer: Rafael Garcia-Suarez <rgarciasuarez@gmail.com> 2002-12-10 21:30:10 +0000
commit: 3a2263fe90d1c0e6c8f9368f10e6672379a975a2 (patch)
tree: f4ecc8075c4fe608fca0d50cea8273adb3179ea8 /pod
parent: 05b465836ef698192f94eef4a60cd63313013848 (diff)
download: perl-3a2263fe90d1c0e6c8f9368f10e6672379a975a2.tar.gz
4 files changed, 82 insertions, 16 deletions
diff --git a/pod/perldiag.pod b/pod/perldiag.pod
index 6c566e5f16..6a8148ca8e 100644
--- a/pod/perldiag.pod
+++ b/pod/perldiag.pod
@@ -3668,6 +3668,13 @@ target of the change to
 (F) Your version of the C library apparently doesn't do times().  I
 suspect you're not running on Unix.
 
+=item To%s: illegal mapping '%s'
+
+(F) You tried to define a customized To-mapping for lc(), lcfirst,
+uc(), or ucfirst() (or their string-inlined versions), but you
+specified an illegal mapping.
+See L<perlunicode/"User-Defined Character Properties">.
+
 =item Too few args to syscall
 
 (F) There has to be at least one argument to syscall() to specify the
diff --git a/pod/perlguts.pod b/pod/perlguts.pod
index 1601e3d1fc..39f23929a3 100644
--- a/pod/perlguts.pod
+++ b/pod/perlguts.pod
@@ -2231,13 +2231,15 @@ C<utf8_hop>, which takes a string and a number of characters to skip
 over. You're on your own about bounds checking, though, so don't use it
 lightly.
 
-All bytes in a multi-byte UTF8 character will have the high bit set, so
-you can test if you need to do something special with this character
-like this:
+All bytes in a multi-byte UTF8 character will have the high bit set,
+so you can test if you need to do something special with this
+character like this (the UTF8_IS_INVARIANT() is a macro that tests
+whether the byte can be encoded as a single byte even in UTF-8):
 
-    UV uv;
+    U8 *utf;
+    UV uv;	/* Note: a UV, not a U8, not a char */
 
-    if (utf & 0x80)
+    if (!UTF8_IS_INVARIANT(*utf))
         /* Must treat this as UTF8 */
         uv = utf8_to_uv(utf);
     else
@@ -2248,7 +2250,7 @@ You can also see in that example that we use C<utf8_to_uv> to get the
 value of the character; the inverse function C<uv_to_utf8> is available
 for putting a UV into UTF8:
 
-    if (uv > 0x80)
+    if (!UTF8_IS_INVARIANT(uv))
         /* Must treat this as UTF8 */
         utf8 = uv_to_utf8(utf8, uv);
     else
@@ -2310,6 +2312,10 @@ In fact, your C<frobnicate> function should be made aware of whether or
 not it's dealing with UTF8 data, so that it can handle the string
 appropriately.
 
+Since just passing an SV to an XS function and copying the data of
+the SV is not enough to copy the UTF8 flags, even less right is just
+passing a C<char *> to an XS function.
+
 =head2 How do I convert a string to UTF8?
 
 If you're mixing UTF8 and non-UTF8 strings, you might find it necessary
@@ -2350,12 +2356,13 @@ it's not - if you pass on the PV to somewhere, pass on the flag too.
 =item *
 
 If a string is UTF8, B<always> use C<utf8_to_uv> to get at the value,
-unless C<!(*s & 0x80)> in which case you can use C<*s>.
+unless C<UTF8_IS_INVARIANT(*s)> in which case you can use C<*s>.
 
 =item *
 
-When writing to a UTF8 string, B<always> use C<uv_to_utf8>, unless
-C<uv < 0x80> in which case you can use C<*s = uv>.
+When writing a character C<uv> to a UTF8 string, B<always> use
+C<uv_to_utf8>, unless C<UTF8_IS_INVARIANT(uv))> in which case
+you can use C<*s = uv>.
 
 =item *
 
diff --git a/pod/perlsec.pod b/pod/perlsec.pod
index 2e1fda3704..1c2dbd266d 100644
--- a/pod/perlsec.pod
+++ b/pod/perlsec.pod
@@ -164,7 +164,7 @@ or a dot.
     if ($data =~ /^([-\@\w.]+)$/) {
 	$data = $1; 			# $data now untainted
     } else {
-	die "Bad data in $data"; 	# log this somewhere
+	die "Bad data in '$data'"; 	# log this somewhere
     }
 
 This is fairly secure because C</\w+/> doesn't normally match shell
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index bf21206a94..ee8b6efe7e 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -616,10 +616,10 @@ And finally, C<scalar reverse()> reverses by character rather than by byte.
 =head2 User-Defined Character Properties
 
 You can define your own character properties by defining subroutines
-whose names begin with "In" or "Is".  The subroutines must be
-visible in the package that uses the properties.  The user-defined
-properties can be used in the regular expression C<\p> and C<\P>
-constructs.
+whose names begin with "In" or "Is".  The subroutines must be defined
+in the C<main> package.  The user-defined properties can be used in the
+regular expression C<\p> and C<\P> constructs.  Note that the effect
+is compile-time and immutable once defined.
 
 The subroutines must return a specially-formatted string, with one
 or more newline-separated lines.  Each line must be one of the following:
@@ -698,6 +698,56 @@ The negation is useful for defining (surprise!) negated classes.
     END
     }
 
+You can also define your own mappings to be used in the lc(),
+lcfirst(), uc(), and ucfirst() (or their string-inlined versions).
+The principle is the same: define subroutines in the C<main> package
+with names like C<ToLower> (for lc() and lcfirst()), C<ToTitle> (for
+the first character in ucfirst()), and C<ToUpper> (for uc(), and the
+rest of the characters in ucfirst()).
+
+The string returned by the subroutines needs now to be three
+hexadecimal numbers separated by tabulators: start of the source
+range, end of the source range, and start of the destination range.
+For example:
+
+    sub ToUpper {
+	return <<END;
+    0061\t0063\t0041
+    END
+    }
+
+defines an uc() mapping that causes only the characters "a", "b", and
+"c" to be mapped to "A", "B", "C", all other characters will remain
+unchanged.
+
+If there is no source range to speak of, that is, the mapping is from
+a single character to another single character, leave the end of the
+source range empty, but the two tabulator characters are still needed.
+For example:
+
+    sub ToLower {
+	return <<END;
+    0041\t\t0061
+    END
+    }
+
+defines a lc() mapping that causes only "A" to be mapped to "a", all
+other characters will remain unchanged.
+
+(For serious hackers only)  If you want to introspect the default
+mappings, you can find the data in the directory
+C<$Config{privlib}>/F<unicore/To/>.  The mapping data is returned as
+the here-document, and the C<utf8::ToSpecFoo> are special exception
+mappings derived from <$Config{privlib}>/F<unicore/SpecialCasing.txt>.
+The C<Digit> and C<Fold> mappings that one can see in the directory
+are not directly user-accessible, one can use either the
+C<Unicode::UCD> module, or just match case-insensitively (that's when
+the C<Fold> mapping is used).
+
+A final note on the user-defined property tests and mappings: they
+will be used only if the scalar has been marked as having Unicode
+characters.  Old byte-style strings will not be affected.
+
 =head2 Character Encodings for Input and Output
 
 See L<Encode>.
@@ -1015,8 +1065,10 @@ straddling of the proverbial fence causes problems.
 
 =head2 Using Unicode in XS
 
-If you want to handle Perl Unicode in XS extensions, you may find
-the following C APIs useful.  See L<perlapi> for details.
+If you want to handle Perl Unicode in XS extensions, you may find the
+following C APIs useful.  See also L<perlguts/"Unicode Support"> for an
+explanation about Unicode at the XS level, and L<perlapi> for the API
+details.
 
 =over 4
author	Rafael Garcia-Suarez <rgarciasuarez@gmail.com>	2002-12-10 21:30:10 +0000
committer	Rafael Garcia-Suarez <rgarciasuarez@gmail.com>	2002-12-10 21:30:10 +0000
commit	3a2263fe90d1c0e6c8f9368f10e6672379a975a2 (patch)
tree	f4ecc8075c4fe608fca0d50cea8273adb3179ea8 /pod
parent	05b465836ef698192f94eef4a60cd63313013848 (diff)
download	perl-3a2263fe90d1c0e6c8f9368f10e6672379a975a2.tar.gz