summaryrefslogtreecommitdiff
path: root/pod/perlguts.pod
diff options
context:
space:
mode:
authorRafael Garcia-Suarez <rgarciasuarez@gmail.com>2002-12-10 21:30:10 +0000
committerRafael Garcia-Suarez <rgarciasuarez@gmail.com>2002-12-10 21:30:10 +0000
commit3a2263fe90d1c0e6c8f9368f10e6672379a975a2 (patch)
treef4ecc8075c4fe608fca0d50cea8273adb3179ea8 /pod/perlguts.pod
parent05b465836ef698192f94eef4a60cd63313013848 (diff)
downloadperl-3a2263fe90d1c0e6c8f9368f10e6672379a975a2.tar.gz
Integrate from the maint-5.8/ branch :
changes 18219, 18236, 18242-3, 18247-8, 18253-5, 18257, 18273-6 p4raw-id: //depot/perl@18280 p4raw-branched: from //depot/maint-5.8/perl@18279 'branch in' t/op/lc_user.t p4raw-integrated: from //depot/maint-5.8/perl@18279 'copy in' lib/File/Copy.pm (@17645..) lib/utf8_heavy.pl pod/perlsec.pod (@18080..) hints/irix_6.sh (@18173..) t/uni/tr_utf8.t (@18197..) pod/perlunicode.pod (@18242..) t/op/pat.t (@18248..) t/op/split.t (@18274..) 'edit in' pod/perlguts.pod (@18242..) 'merge in' pp.c (@18126..) MANIFEST (@18234..) p4raw-integrated: from //depot/maint-5.8/perl@18254 'merge in' pod/perldiag.pod (@18234..)
Diffstat (limited to 'pod/perlguts.pod')
-rw-r--r--pod/perlguts.pod25
1 files changed, 16 insertions, 9 deletions
diff --git a/pod/perlguts.pod b/pod/perlguts.pod
index 1601e3d1fc..39f23929a3 100644
--- a/pod/perlguts.pod
+++ b/pod/perlguts.pod
@@ -2231,13 +2231,15 @@ C<utf8_hop>, which takes a string and a number of characters to skip
over. You're on your own about bounds checking, though, so don't use it
lightly.
-All bytes in a multi-byte UTF8 character will have the high bit set, so
-you can test if you need to do something special with this character
-like this:
+All bytes in a multi-byte UTF8 character will have the high bit set,
+so you can test if you need to do something special with this
+character like this (the UTF8_IS_INVARIANT() is a macro that tests
+whether the byte can be encoded as a single byte even in UTF-8):
- UV uv;
+ U8 *utf;
+ UV uv; /* Note: a UV, not a U8, not a char */
- if (utf & 0x80)
+ if (!UTF8_IS_INVARIANT(*utf))
/* Must treat this as UTF8 */
uv = utf8_to_uv(utf);
else
@@ -2248,7 +2250,7 @@ You can also see in that example that we use C<utf8_to_uv> to get the
value of the character; the inverse function C<uv_to_utf8> is available
for putting a UV into UTF8:
- if (uv > 0x80)
+ if (!UTF8_IS_INVARIANT(uv))
/* Must treat this as UTF8 */
utf8 = uv_to_utf8(utf8, uv);
else
@@ -2310,6 +2312,10 @@ In fact, your C<frobnicate> function should be made aware of whether or
not it's dealing with UTF8 data, so that it can handle the string
appropriately.
+Since just passing an SV to an XS function and copying the data of
+the SV is not enough to copy the UTF8 flags, even less right is just
+passing a C<char *> to an XS function.
+
=head2 How do I convert a string to UTF8?
If you're mixing UTF8 and non-UTF8 strings, you might find it necessary
@@ -2350,12 +2356,13 @@ it's not - if you pass on the PV to somewhere, pass on the flag too.
=item *
If a string is UTF8, B<always> use C<utf8_to_uv> to get at the value,
-unless C<!(*s & 0x80)> in which case you can use C<*s>.
+unless C<UTF8_IS_INVARIANT(*s)> in which case you can use C<*s>.
=item *
-When writing to a UTF8 string, B<always> use C<uv_to_utf8>, unless
-C<uv < 0x80> in which case you can use C<*s = uv>.
+When writing a character C<uv> to a UTF8 string, B<always> use
+C<uv_to_utf8>, unless C<UTF8_IS_INVARIANT(uv))> in which case
+you can use C<*s = uv>.
=item *