Return REPLACEMENT for UTF-8 overlong malformation

When perl decodes UTF-8 into a code point, it must decide what to do if the input is malformed in some way. When the flags passed to the decode function indicate that a given malformation type is not acceptable, the function returns 0 to indicate failure; on success it returns the decoded code point (unfortunately that may require disambiguation if the input is validly a NUL). As perl evolved, what happened when various allowed malformations were encountered got stricter and stricter. This is the final malformation that was not turned into a REPLACEMENT CHARACTER when the malformation was allowed, and this commit changes to return that. Unlike most other malformations, the code point value of an overlong is well-defined, and that is why it hadn't been changed here-to-fore. But it is safer to use the Unicode prescribed behavior on all malformations, which is to replace them with the REPLACEMENT CHARACTER. Just in case there is code that requires the old behavior, it is retained, but you have to search the source for the undocumented flag that enables it.
author: Karl Williamson <khw@cpan.org> 2016-12-10 15:26:24 -0700
committer: Karl Williamson <khw@cpan.org> 2016-12-23 16:48:35 -0700
commit: 9495395586e6a655057cb766ed00213037dd06c0 (patch)
tree: dfb0df883a3dd756d58ce106bb70bd8e57a55203 /ext
parent: 5a48568dae7e81342fc2f8d0845423834f5c818f (diff)
download: perl-9495395586e6a655057cb766ed00213037dd06c0.tar.gz
1 files changed, 24 insertions, 0 deletions
diff --git a/ext/XS-APItest/t/utf8.t b/ext/XS-APItest/t/utf8.t
index e8ed76ea34..5fe56df39b 100644
--- a/ext/XS-APItest/t/utf8.t
+++ b/ext/XS-APItest/t/utf8.t
@@ -98,6 +98,7 @@ my $UTF8_GOT_NON_CONTINUATION   = $UTF8_ALLOW_NON_CONTINUATION;
 my $UTF8_ALLOW_SHORT            = 0x0008;
 my $UTF8_GOT_SHORT              = $UTF8_ALLOW_SHORT;
 my $UTF8_ALLOW_LONG             = 0x0010;
+my $UTF8_ALLOW_LONG_AND_ITS_VALUE = $UTF8_ALLOW_LONG|0x0020;
 my $UTF8_GOT_LONG               = $UTF8_ALLOW_LONG;
 my $UTF8_GOT_OVERFLOW           = 0x0080;
 my $UTF8_DISALLOW_SURROGATE     = 0x0100;
@@ -1422,6 +1423,29 @@ else { # 64-bit ASCII, or EBCDIC of any size.
     }
 }
 
+# For each overlong malformation in the list, we modify it, so that there are
+# two tests.  The first one returns the replacement character given the input
+# flags, and the second test adds a flag that causes the actual code point the
+# malformation represents to be returned.
+my @added_overlongs;
+foreach my $test (@malformations) {
+    my ($testname, $bytes, $length, $allow_flags, $expected_error_flags,
+        $allowed_uv, $expected_len, $needed_to_discern_len, $message ) = @$test;
+    next unless $testname =~ /overlong/;
+
+    $test->[0] .= "; use REPLACEMENT CHAR";
+    $test->[5] = $REPLACEMENT;
+
+    push @added_overlongs,
+        [ $testname . "; use actual value",
+          $bytes, $length,
+          $allow_flags | $UTF8_ALLOW_LONG_AND_ITS_VALUE,
+          $expected_error_flags, $allowed_uv, $expected_len,
+          $needed_to_discern_len, $message
+        ];
+}
+push @malformations, @added_overlongs;
+
 foreach my $test (@malformations) {
     my ($testname, $bytes, $length, $allow_flags, $expected_error_flags,
         $allowed_uv, $expected_len, $needed_to_discern_len, $message ) = @$test;
author	Karl Williamson <khw@cpan.org>	2016-12-10 15:26:24 -0700
committer	Karl Williamson <khw@cpan.org>	2016-12-23 16:48:35 -0700
commit	9495395586e6a655057cb766ed00213037dd06c0 (patch)
tree	dfb0df883a3dd756d58ce106bb70bd8e57a55203 /ext
parent	5a48568dae7e81342fc2f8d0845423834f5c818f (diff)
download	perl-9495395586e6a655057cb766ed00213037dd06c0.tar.gz