From ba210ebec161cde003bc967e8e460c72f71fb70c Mon Sep 17 00:00:00 2001 From: Jarkko Hietaniemi Date: Tue, 24 Oct 2000 02:55:33 +0000 Subject: Make the UTF-8 decoding stricter and more verbose when malformation happens. This involved adding an argument to utf8_to_uv_chk(), which involved changing its prototype, and prefer STRLEN over I32 for the UTF-8 length, which as a domino effect necessitated changing the prototypes of scan_bin(), scan_oct(), scan_hex(), and reg_uni(). The stricter UTF-8 decoding checking uses Markus Kuhn's UTF-8 Decode Stress Tester from http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt p4raw-id: //depot/perl@7416 --- pod/perlunicode.pod | 5 +++++ 1 file changed, 5 insertions(+) (limited to 'pod/perlunicode.pod') diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index c9954d8e96..145c953099 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -71,6 +71,11 @@ on Windows. Regardless of the above, the C pragma can always be used to force byte semantics in a particular lexical scope. See L. +One effect of the C pragma is that the internal UTF-8 decoding +becomes stricter so that the character 0xFFFF (UTF-8 bytes 0xEF 0xBF +0xBF), and the bytes 0xFE and 0xFF, start to cause warnings if they +appear in the data. + The C pragma is primarily a compatibility device that enables recognition of UTF-8 in literals encountered by the parser. It may also be used for enabling some of the more experimental Unicode support features. -- cgit v1.2.1