Add a note about the dangers of bad UTF-8.

p4raw-id: //depot/perl@12953
author: Jarkko Hietaniemi <jhi@iki.fi> 2001-11-12 13:11:55 +0000
committer: Jarkko Hietaniemi <jhi@iki.fi> 2001-11-12 13:11:55 +0000
commit: bf0fa0b28861f64af680a3c19765ac8a24e4f2bd (patch)
tree: a358b8d15bc72e9c377f13a10f8359a377eab6db /pod/perlunicode.pod
parent: 31f17f41bb8d60c477667b416652af44045ba3ed (diff)
download: perl-bf0fa0b28861f64af680a3c19765ac8a24e4f2bd.tar.gz
1 files changed, 12 insertions, 0 deletions
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index 2c9b078029..277238e452 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -742,6 +742,18 @@ is not extensible beyond 0xFFFF, because it does not use surrogates.
 A seven-bit safe (non-eight-bit) encoding, useful if the
 transport/storage is not eight-bit safe.  Defined by RFC 2152.
 
+=head2 Security Implications of Malformed UTF-8
+
+Unfortunately, the specification of UTF-8 leaves some room for
+interpretation of how many bytes of encoded output one should generate
+from one input Unicode character.  Strictly speaking, one is supposed
+to always generate the shortest possible sequence of UTF-8 bytes,
+because otherwise there is potential for input buffer overflow at the
+receiving end of a UTF-8 connection.  Perl always generates the shortest
+length UTF-8, and with warnings on (C<-w> or C<use warnings;>) Perl will
+warn about non-shortest length UTF-8 (and other malformations, too,
+such as the surrogates, which are not real character code points.)
+
 =head2 Unicode in Perl on EBCDIC
 
 The way Unicode is handled on EBCDIC platforms is still rather
author	Jarkko Hietaniemi <jhi@iki.fi>	2001-11-12 13:11:55 +0000
committer	Jarkko Hietaniemi <jhi@iki.fi>	2001-11-12 13:11:55 +0000
commit	bf0fa0b28861f64af680a3c19765ac8a24e4f2bd (patch)
tree	a358b8d15bc72e9c377f13a10f8359a377eab6db /pod/perlunicode.pod
parent	31f17f41bb8d60c477667b416652af44045ba3ed (diff)
download	perl-bf0fa0b28861f64af680a3c19765ac8a24e4f2bd.tar.gz