summaryrefslogtreecommitdiff
path: root/lib/utf8.pm
diff options
context:
space:
mode:
authorTony Cook <tony@develop-help.com>2017-07-24 11:05:40 +1000
committerTony Cook <tony@develop-help.com>2017-07-24 11:10:30 +1000
commit0397beb0d12565d70e168bfea7376e2612a6748a (patch)
treec4f63582b3e9ad74cc683a778dabd713f3ef72f3 /lib/utf8.pm
parentee329aefb9c0bfcee0e6cc41dcd6eb8b03206f30 (diff)
downloadperl-0397beb0d12565d70e168bfea7376e2612a6748a.tar.gz
(perl #131685) improve utf8::* function documentation
Splits the little cheat sheet I posted as a comment into pieces and puts them closer to where they belong - better document why you'd want to use utf8::upgrade() - similarly for utf8::downgrade() - try hard to convince people not to use utf8::is_utf8() - no, utf8::is_utf8() isn't what you want instead of utf8::valid() - change some examples to use $x instead of the sort reserved $a
Diffstat (limited to 'lib/utf8.pm')
-rw-r--r--lib/utf8.pm71
1 files changed, 57 insertions, 14 deletions
diff --git a/lib/utf8.pm b/lib/utf8.pm
index 324cb87c86..34930a0554 100644
--- a/lib/utf8.pm
+++ b/lib/utf8.pm
@@ -2,7 +2,7 @@ package utf8;
$utf8::hint_bits = 0x00800000;
-our $VERSION = '1.19';
+our $VERSION = '1.20';
sub import {
$^H |= $utf8::hint_bits;
@@ -109,11 +109,26 @@ you should not say that unless you really want to have UTF-8 source code.
Converts in-place the internal representation of the string from an octet
sequence in the native encoding (Latin-1 or EBCDIC) to UTF-8. The
logical character sequence itself is unchanged. If I<$string> is already
-stored as UTF-8, then this is a no-op. Returns the
-number of octets necessary to represent the string as UTF-8. Can be
-used to make sure that the UTF-8 flag is on, so that C<\w> or C<lc()>
-work as Unicode on strings containing non-ASCII characters whose code points
-are below 256.
+upgraded, then this is a no-op. Returns the
+number of octets necessary to represent the string as UTF-8.
+
+If your code needs to be compatible with versions of perl without
+C<use feature 'unicode_strings';>, you can force Unicode semantics on
+a given string:
+
+ # force unicode semantics for $string without the
+ # "unicode_strings" feature
+ utf8::upgrade($string);
+
+For example:
+
+ # without explicit or implicit use feature 'unicode_strings'
+ my $x = "\xDF"; # LATIN SMALL LETTER SHARP S
+ $x =~ /ss/i; # won't match
+ my $y = uc($x); # won't convert
+ utf8::upgrade($x);
+ $x =~ /ss/i; # matches
+ my $z = uc($x); # converts to "SS"
B<Note that this function does not handle arbitrary encodings>;
use L<Encode> instead.
@@ -136,6 +151,15 @@ true, returns false.
Returns true on success.
+If your code expects an octet sequence this can be used to validate
+that you've received one:
+
+ # throw an exception if not representable as octets
+ utf8::downgrade($string)
+
+ # or do your own error handling
+ utf8::downgrade($string, 1) or die "string must be octets";
+
B<Note that this function does not handle arbitrary encodings>;
use L<Encode> instead.
@@ -148,11 +172,16 @@ replaced with a sequence of one or more characters that represent the
individual UTF-8 bytes of the character. The UTF8 flag is turned off.
Returns nothing.
- my $a = "\x{100}"; # $a contains one character, with ord 0x100
- utf8::encode($a); # $a contains two characters, with ords (on
+ my $x = "\x{100}"; # $x contains one character, with ord 0x100
+ utf8::encode($x); # $x contains two characters, with ords (on
# ASCII platforms) 0xc4 and 0x80. On EBCDIC
# 1047, this would instead be 0x8C and 0x41.
+Similar to:
+
+ use Encode;
+ $x = Encode::encode("utf8", $x);
+
B<Note that this function does not handle arbitrary encodings>;
use L<Encode> instead.
@@ -167,11 +196,11 @@ turned on only if the source string contains multiple-byte UTF-8
characters. If I<$string> is invalid as UTF-8, returns false;
otherwise returns true.
- my $a = "\xc4\x80"; # $a contains two characters, with ords
+ my $x = "\xc4\x80"; # $x contains two characters, with ords
# 0xc4 and 0x80
- utf8::decode($a); # On ASCII platforms, $a contains one char,
+ utf8::decode($x); # On ASCII platforms, $x contains one char,
# with ord 0x100. Since these bytes aren't
- # legal UTF-EBCDIC, on EBCDIC platforms, $a is
+ # legal UTF-EBCDIC, on EBCDIC platforms, $x is
# unchanged and the function returns FALSE.
B<Note that this function does not handle arbitrary encodings>;
@@ -208,7 +237,22 @@ platforms, so there is no performance hit in using it there.
=item * C<$flag = utf8::is_utf8($string)>
(Since Perl 5.8.1) Test whether I<$string> is marked internally as encoded in
-UTF-8. Functionally the same as C<Encode::is_utf8()>.
+UTF-8. Functionally the same as C<Encode::is_utf8($string)>.
+
+Typically only necessary for debugging and testing, if you need to
+dump the internals of an SV, L<Devel::Peek's|Devel::Peek> Dump()
+provides more detail in a compact form.
+
+If you still think you need this outside of debugging, testing or
+dealing with filenames, you should probably read L<perlunitut> and
+L<perlunifaq/What is "the UTF8 flag"?>.
+
+Don't use this flag as a marker to distinguish character and binary
+data, that should be decided for each variable when you write your
+code.
+
+To force unicode semantics in code portable to perl 5.8 and 5.10, call
+C<utf8::upgrade($string)> unconditionally.
=item * C<$flag = utf8::valid($string)>
@@ -216,8 +260,7 @@ UTF-8. Functionally the same as C<Encode::is_utf8()>.
UTF-8. Will return true if it is well-formed UTF-8 and has the UTF-8 flag
on B<or> if I<$string> is held as bytes (both these states are 'consistent').
Main reason for this routine is to allow Perl's test suite to check
-that operations have left strings in a consistent state. You most
-probably want to use C<utf8::is_utf8()> instead.
+that operations have left strings in a consistent state.
=back