summaryrefslogtreecommitdiff
path: root/pod/perlguts.pod
diff options
context:
space:
mode:
authorKarl Williamson <khw@cpan.org>2015-05-06 21:01:32 -0600
committerKarl Williamson <khw@cpan.org>2015-05-07 17:32:47 -0600
commit61ad4b941643c11b80309e30eab01446ad239acc (patch)
tree9cc879d82a648ca6b3a4282a184f0fab28cc04aa /pod/perlguts.pod
parentbd18bd400813bdd63c4212b321b8876f2ea01818 (diff)
downloadperl-61ad4b941643c11b80309e30eab01446ad239acc.tar.gz
perlguts: Nits, corrections and clarifications
Diffstat (limited to 'pod/perlguts.pod')
-rw-r--r--pod/perlguts.pod56
1 files changed, 30 insertions, 26 deletions
diff --git a/pod/perlguts.pod b/pod/perlguts.pod
index 27f7540196..3e4c54849b 100644
--- a/pod/perlguts.pod
+++ b/pod/perlguts.pod
@@ -1365,7 +1365,7 @@ aware that the behavior may change in the future, umm, without warning.
The perl tie function associates a variable with an object that implements
the various GET, SET, etc methods. To perform the equivalent of the perl
tie function from an XSUB, you must mimic this behaviour. The code below
-carries out the necessary steps - firstly it creates a new hash, and then
+carries out the necessary steps -- firstly it creates a new hash, and then
creates a second hash which it blesses into the class which will implement
the tie methods. Lastly it ties the two hashes together, and returns a
reference to the new tied hash. Note that the code below does NOT call the
@@ -2729,7 +2729,7 @@ macros is faster than using C<call_*>.
=head2 Source Documentation
There's an effort going on to document the internal functions and
-automatically produce reference manuals from them - L<perlapi> is one
+automatically produce reference manuals from them -- L<perlapi> is one
such manual which details all the functions which are available to XS
writers. L<perlintern> is the autogenerated manual for the functions
which are not part of the API and are supposedly for internal use only.
@@ -2806,14 +2806,15 @@ You can't. This is because UTF-8 data is stored in bytes just like
non-UTF-8 data. The Unicode character 200, (C<0xC8> for you hex types)
capital E with a grave accent, is represented by the two bytes
C<v196.172>. Unfortunately, the non-Unicode string C<chr(196).chr(172)>
-has that byte sequence as well. So you can't tell just by looking - this
+has that byte sequence as well. So you can't tell just by looking -- this
is what makes Unicode input an interesting problem.
In general, you either have to know what you're dealing with, or you
have to guess. The API function C<is_utf8_string> can help; it'll tell
-you if a string contains only valid UTF-8 characters. However, it can't
-do the work for you. On a character-by-character basis,
-C<isUTF8_CHAR>
+you if a string contains only valid UTF-8 characters, and the chances
+of a non-UTF-8 string looking like valid UTF-8 become very small very
+quickly with increasing string length. On a character-by-character
+basis, C<isUTF8_CHAR>
will tell you whether the current character in a string is valid UTF-8.
=head2 How does UTF-8 represent Unicode characters?
@@ -2823,7 +2824,7 @@ character. Characters with values 0...127 are stored in one
byte, just like good ol' ASCII. Character 128 is stored as
C<v194.128>; this continues up to character 191, which is
C<v194.191>. Now we've run out of bits (191 is binary
-C<10111111>) so we move on; 192 is C<v195.128>. And
+C<10111111>) so we move on; character 192 is C<v195.128>. And
so it goes on, moving to three bytes at character 2048.
Assuming you know you're dealing with a UTF-8 string, you can find out
@@ -2843,7 +2844,7 @@ lightly.
All bytes in a multi-byte UTF-8 character will have the high bit set,
so you can test if you need to do something special with this
-character like this (the UTF8_IS_INVARIANT() is a macro that tests
+character like this (the C<UTF8_IS_INVARIANT()> is a macro that tests
whether the byte is encoded as a single byte even in UTF-8):
U8 *utf;
@@ -2862,7 +2863,7 @@ You can also see in that example that we use C<utf8_to_uvchr_buf> to get the
value of the character; the inverse function C<uvchr_to_utf8> is available
for putting a UV into UTF-8:
- if (!UTF8_IS_INVARIANT(uv))
+ if (!UVCHR_IS_INVARIANT(uv))
/* Must treat this as UTF8 */
utf8 = uvchr_to_utf8(utf8, uv);
else
@@ -2877,16 +2878,19 @@ for instance, if your UTF-8 string contains C<v196.172>, and you skip
that character, you can never match a C<chr(200)> in a non-UTF-8 string.
So don't do that!
+(Note that we don't have to test for invariant characters in the
+examples above. The functions work on any well-formed UTF-8 input.
+It's just that its faster to avoid the function overhead when it's not
+needed.)
+
=head2 How does Perl store UTF-8 strings?
-Currently, Perl deals with Unicode strings and non-Unicode strings
+Currently, Perl deals with UTF-8 strings and non-UTF-8 strings
slightly differently. A flag in the SV, C<SVf_UTF8>, indicates that the
string is internally encoded as UTF-8. Without it, the byte value is the
-codepoint number and vice versa (in other words, the string is encoded
-as iso-8859-1, but C<use feature 'unicode_strings'> is needed to get iso-8859-1
-semantics). This flag is only meaningful if the SV is C<SvPOK>
-or immediately after stringification via C<SvPV> or a similar
-macro. You can check and manipulate this flag with the
+codepoint number and vice versa. This flag is only meaningful if the SV
+is C<SvPOK> or immediately after stringification via C<SvPV> or a
+similar macro. You can check and manipulate this flag with the
following macros:
SvUTF8(sv)
@@ -2894,16 +2898,16 @@ following macros:
SvUTF8_off(sv)
This flag has an important effect on Perl's treatment of the string: if
-Unicode data is not properly distinguished, regular expressions,
+UTF-8 data is not properly distinguished, regular expressions,
C<length>, C<substr> and other string handling operations will have
-undesirable results.
+undesirable (wrong) results.
The problem comes when you have, for instance, a string that isn't
-flagged as UTF-8, and contains a byte sequence that could be UTF-8 -
+flagged as UTF-8, and contains a byte sequence that could be UTF-8 --
especially when combining non-UTF-8 and UTF-8 strings.
-Never forget that the C<SVf_UTF8> flag is separate to the PV value; you
-need be sure you don't accidentally knock it off while you're
+Never forget that the C<SVf_UTF8> flag is separate from the PV value; you
+need to be sure you don't accidentally knock it off while you're
manipulating SVs. More specifically, you cannot expect to do this:
SV *sv;
@@ -2932,12 +2936,12 @@ appropriately.
Since just passing an SV to an XS function and copying the data of
the SV is not enough to copy the UTF8 flags, even less right is just
-passing a C<char *> to an XS function.
+passing a S<C<char *>> to an XS function.
=head2 How do I convert a string to UTF-8?
If you're mixing UTF-8 and non-UTF-8 strings, it is necessary to upgrade
-one of the strings to UTF-8. If you've got an SV, the easiest way to do
+the non-UTF-8 strings to UTF-8. If you've got an SV, the easiest way to do
this is:
sv_utf8_upgrade(sv);
@@ -2979,8 +2983,8 @@ unless C<UTF8_IS_INVARIANT(*s)> in which case you can use C<*s>.
=item *
-When writing a character C<uv> to a UTF-8 string, B<always> use
-C<uvchr_to_utf8>, unless C<UTF8_IS_INVARIANT(uv))> in which case
+When writing a character UV to a UTF-8 string, B<always> use
+C<uvchr_to_utf8>, unless C<UVCHR_IS_INVARIANT(uv))> in which case
you can use C<*s = uv>.
=item *
@@ -3003,8 +3007,8 @@ C<gvsv, gvsv, add>.)
This feature is implemented as a new op type, C<OP_CUSTOM>. The Perl
core does not "know" anything special about this op type, and so it will
not be involved in any optimizations. This also means that you can
-define your custom ops to be any op structure - unary, binary, list and
-so on - you like.
+define your custom ops to be any op structure -- unary, binary, list and
+so on -- you like.
It's important to know what custom operators won't do for you. They
won't let you add new syntax to Perl, directly. They won't even let you