diff options
author | Karl Williamson <khw@cpan.org> | 2016-12-10 15:26:24 -0700 |
---|---|---|
committer | Karl Williamson <khw@cpan.org> | 2016-12-23 16:48:35 -0700 |
commit | 9495395586e6a655057cb766ed00213037dd06c0 (patch) | |
tree | dfb0df883a3dd756d58ce106bb70bd8e57a55203 /ext | |
parent | 5a48568dae7e81342fc2f8d0845423834f5c818f (diff) | |
download | perl-9495395586e6a655057cb766ed00213037dd06c0.tar.gz |
Return REPLACEMENT for UTF-8 overlong malformation
When perl decodes UTF-8 into a code point, it must decide what to do if
the input is malformed in some way. When the flags passed to the decode
function indicate that a given malformation type is not acceptable, the
function returns 0 to indicate failure; on success it returns the decoded
code point (unfortunately that may require disambiguation if the
input is validly a NUL). As perl evolved, what happened when various
allowed malformations were encountered got stricter and stricter. This
is the final malformation that was not turned into a REPLACEMENT
CHARACTER when the malformation was allowed, and this commit changes to
return that. Unlike most other malformations, the code point value of
an overlong is well-defined, and that is why it hadn't been changed
here-to-fore. But it is safer to use the Unicode prescribed behavior on
all malformations, which is to replace them with the REPLACEMENT
CHARACTER. Just in case there is code that requires the old behavior,
it is retained, but you have to search the source for the undocumented
flag that enables it.
Diffstat (limited to 'ext')
-rw-r--r-- | ext/XS-APItest/t/utf8.t | 24 |
1 files changed, 24 insertions, 0 deletions
diff --git a/ext/XS-APItest/t/utf8.t b/ext/XS-APItest/t/utf8.t index e8ed76ea34..5fe56df39b 100644 --- a/ext/XS-APItest/t/utf8.t +++ b/ext/XS-APItest/t/utf8.t @@ -98,6 +98,7 @@ my $UTF8_GOT_NON_CONTINUATION = $UTF8_ALLOW_NON_CONTINUATION; my $UTF8_ALLOW_SHORT = 0x0008; my $UTF8_GOT_SHORT = $UTF8_ALLOW_SHORT; my $UTF8_ALLOW_LONG = 0x0010; +my $UTF8_ALLOW_LONG_AND_ITS_VALUE = $UTF8_ALLOW_LONG|0x0020; my $UTF8_GOT_LONG = $UTF8_ALLOW_LONG; my $UTF8_GOT_OVERFLOW = 0x0080; my $UTF8_DISALLOW_SURROGATE = 0x0100; @@ -1422,6 +1423,29 @@ else { # 64-bit ASCII, or EBCDIC of any size. } } +# For each overlong malformation in the list, we modify it, so that there are +# two tests. The first one returns the replacement character given the input +# flags, and the second test adds a flag that causes the actual code point the +# malformation represents to be returned. +my @added_overlongs; +foreach my $test (@malformations) { + my ($testname, $bytes, $length, $allow_flags, $expected_error_flags, + $allowed_uv, $expected_len, $needed_to_discern_len, $message ) = @$test; + next unless $testname =~ /overlong/; + + $test->[0] .= "; use REPLACEMENT CHAR"; + $test->[5] = $REPLACEMENT; + + push @added_overlongs, + [ $testname . "; use actual value", + $bytes, $length, + $allow_flags | $UTF8_ALLOW_LONG_AND_ITS_VALUE, + $expected_error_flags, $allowed_uv, $expected_len, + $needed_to_discern_len, $message + ]; +} +push @malformations, @added_overlongs; + foreach my $test (@malformations) { my ($testname, $bytes, $length, $allow_flags, $expected_error_flags, $allowed_uv, $expected_len, $needed_to_discern_len, $message ) = @$test; |