perlunicode: Update text about malformed UTF-8

author: Karl Williamson <khw@cpan.org> 2017-04-11 13:31:20 -0600
committer: Karl Williamson <khw@cpan.org> 2017-04-11 13:37:11 -0600
commit: f57d8456e7b8d6b2dad0bb49899cfdc68007b794 (patch)
tree: a7f582250bcf2ab5925548264341fcc7a364fb0d /pod/perlunicode.pod
parent: 04e8f31976b33aacdeb5a6d9c5b75dda622712b8 (diff)
download: perl-f57d8456e7b8d6b2dad0bb49899cfdc68007b794.tar.gz
1 files changed, 19 insertions, 11 deletions
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index bd70c251f6..9c13c35a9c 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -36,8 +36,8 @@ Unicode support is an extensive requirement. While Perl does not
 implement the Unicode standard or the accompanying technical reports
 from cover to cover, Perl does support many Unicode features.
 
-Also, the use of Unicode may present security issues that aren't obvious.
-Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
+Also, the use of Unicode may present security issues that aren't
+obvious, see L</Security Implications of Unicode>.
 
 =over 4
 
@@ -1633,15 +1633,23 @@ Also, note the following:
 
 Malformed UTF-8
 
-Unfortunately, the original specification of UTF-8 leaves some room for
-interpretation of how many bytes of encoded output one should generate
-from one input Unicode character.  Strictly speaking, the shortest
-possible sequence of UTF-8 bytes should be generated,
-because otherwise there is potential for an input buffer overflow at
-the receiving end of a UTF-8 connection.  Perl always generates the
-shortest length UTF-8, and with warnings on, Perl will warn about
-non-shortest length UTF-8 along with other malformations, such as the
-surrogates, which are not Unicode code points valid for interchange.
+UTF-8 is very structured, so many combinations of bytes are invalid.  In
+the past, Perl tried to soldier on and make some sense of invalid
+combinations, but this can lead to security holes, so now, if the Perl
+core needs to process an invalid combination, it will either raise a
+fatal error, or will replace those bytes by the sequence that forms the
+Unicode REPLACEMENT CHARACTER, for which purpose Unicode created it.
+
+Every code point can be represented by more than one possible
+syntactically valid UTF-8 sequence.  Early on, both Unicode and Perl
+considered any of these to be valid, but now, all sequences longer
+than the shortest possible one are considered to be malformed.
+
+Unicode considers many code points to be illegal, or to be avoided.
+Perl generally accepts them, once they have passed through any input
+filters that may try to exclude them.  These have been discussed above
+(see "Surrogates" under UTF-16 in L</Unicode Encodings>,
+L</Noncharacter code points>, and L</Beyond Unicode code points>).
 
 =item *
author	Karl Williamson <khw@cpan.org>	2017-04-11 13:31:20 -0600
committer	Karl Williamson <khw@cpan.org>	2017-04-11 13:37:11 -0600
commit	f57d8456e7b8d6b2dad0bb49899cfdc68007b794 (patch)
tree	a7f582250bcf2ab5925548264341fcc7a364fb0d /pod/perlunicode.pod
parent	04e8f31976b33aacdeb5a6d9c5b75dda622712b8 (diff)
download	perl-f57d8456e7b8d6b2dad0bb49899cfdc68007b794.tar.gz