Make the UTF-8 decoding stricter and more verbose when

malformation happens. This involved adding an argument to utf8_to_uv_chk(), which involved changing its prototype, and prefer STRLEN over I32 for the UTF-8 length, which as a domino effect necessitated changing the prototypes of scan_bin(), scan_oct(), scan_hex(), and reg_uni(). The stricter UTF-8 decoding checking uses Markus Kuhn's UTF-8 Decode Stress Tester from http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt p4raw-id: //depot/perl@7416
author: Jarkko Hietaniemi <jhi@iki.fi> 2000-10-24 02:55:33 +0000
committer: Jarkko Hietaniemi <jhi@iki.fi> 2000-10-24 02:55:33 +0000
commit: ba210ebec161cde003bc967e8e460c72f71fb70c (patch)
tree: 7eefd78e8e365cbf64ddf49314681d17b83c3025 /pod
parent: 177b92d2814bfc842f28f277e0a2f353c652a5e3 (diff)
download: perl-ba210ebec161cde003bc967e8e460c72f71fb70c.tar.gz
3 files changed, 14 insertions, 5 deletions
diff --git a/pod/perlapi.pod b/pod/perlapi.pod
index a5178e8d61..730d89f896 100644
--- a/pod/perlapi.pod
+++ b/pod/perlapi.pod
@@ -3225,7 +3225,7 @@ advanced to the end of the character.
 If C<s> does not point to a well-formed UTF8 character, an optional UTF8
 warning is produced.
 
-	U8* s	utf8_to_uv(I32 *retlen)
+	U8* s	utf8_to_uv(STRLEN *retlen)
 
 =for hackers
 Found in file utf8.c
@@ -3233,9 +3233,9 @@ Found in file utf8.c
 =item utf8_to_uv_chk
 
 Returns the character value of the first character in the string C<s>
-which is assumed to be in UTF8 encoding; C<retlen> will be set to the
-length, in bytes, of that character, and the pointer C<s> will be
-advanced to the end of the character.
+which is assumed to be in UTF8 encoding and no longer than C<curlen>;
+C<retlen> will be set to the length, in bytes, of that character,
+and the pointer C<s> will be advanced to the end of the character.
 
 If C<s> does not point to a well-formed UTF8 character, the behaviour
 is dependent on the value of C<checking>: if this is true, it is
@@ -3243,7 +3243,7 @@ assumed that the caller will raise a warning, and this function will
 set C<retlen> to C<-1> and return. If C<checking> is not true, an optional UTF8
 warning is produced.
 
-	U8* s	utf8_to_uv_chk(I32 *retlen, I32 checking)
+	U8* s	utf8_to_uv_chk(STRLEN curlen, I32 *retlen, I32 checking)
 
 =for hackers
 Found in file utf8.c
diff --git a/pod/perldiag.pod b/pod/perldiag.pod
index 480ab8492d..139bab98d5 100644
--- a/pod/perldiag.pod
+++ b/pod/perldiag.pod
@@ -1789,6 +1789,10 @@ a builtin library search path, prefix2 is substituted.  The error may
 appear if components are not found, or are too long.  See
 "PERLLIB_PREFIX" in L<perlos2>.
 
+=item Malformed UTF-8 character (%s)
+
+Perl detected something that didn't comply with UTF-8 encoding rules.
+
 =item Malformed UTF-16 surrogate
 
 Perl thought it was reading UTF-16 encoded character data but while
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index c9954d8e96..145c953099 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -71,6 +71,11 @@ on Windows.
 Regardless of the above, the C<bytes> pragma can always be used to force
 byte semantics in a particular lexical scope.  See L<bytes>.
 
+One effect of the C<utf8> pragma is that the internal UTF-8 decoding
+becomes stricter so that the character 0xFFFF (UTF-8 bytes 0xEF 0xBF
+0xBF), and the bytes 0xFE and 0xFF, start to cause warnings if they
+appear in the data.
+
 The C<utf8> pragma is primarily a compatibility device that enables
 recognition of UTF-8 in literals encountered by the parser.  It may also
 be used for enabling some of the more experimental Unicode support features.
author	Jarkko Hietaniemi <jhi@iki.fi>	2000-10-24 02:55:33 +0000
committer	Jarkko Hietaniemi <jhi@iki.fi>	2000-10-24 02:55:33 +0000
commit	ba210ebec161cde003bc967e8e460c72f71fb70c (patch)
tree	7eefd78e8e365cbf64ddf49314681d17b83c3025 /pod
parent	177b92d2814bfc842f28f277e0a2f353c652a5e3 (diff)
download	perl-ba210ebec161cde003bc967e8e460c72f71fb70c.tar.gz