Return REPLACEMENT for UTF-8 overlong malformation

When perl decodes UTF-8 into a code point, it must decide what to do if the input is malformed in some way. When the flags passed to the decode function indicate that a given malformation type is not acceptable, the function returns 0 to indicate failure; on success it returns the decoded code point (unfortunately that may require disambiguation if the input is validly a NUL). As perl evolved, what happened when various allowed malformations were encountered got stricter and stricter. This is the final malformation that was not turned into a REPLACEMENT CHARACTER when the malformation was allowed, and this commit changes to return that. Unlike most other malformations, the code point value of an overlong is well-defined, and that is why it hadn't been changed here-to-fore. But it is safer to use the Unicode prescribed behavior on all malformations, which is to replace them with the REPLACEMENT CHARACTER. Just in case there is code that requires the old behavior, it is retained, but you have to search the source for the undocumented flag that enables it.
author: Karl Williamson <khw@cpan.org> 2016-12-10 15:26:24 -0700
committer: Karl Williamson <khw@cpan.org> 2016-12-23 16:48:35 -0700
commit: 9495395586e6a655057cb766ed00213037dd06c0 (patch)
tree: dfb0df883a3dd756d58ce106bb70bd8e57a55203 /utf8.c
parent: 5a48568dae7e81342fc2f8d0845423834f5c818f (diff)
download: perl-9495395586e6a655057cb766ed00213037dd06c0.tar.gz
1 files changed, 15 insertions, 4 deletions
diff --git a/utf8.c b/utf8.c
index d34597bb56..d5e675b043 100644
--- a/utf8.c
+++ b/utf8.c
@@ -875,9 +875,10 @@ is, when there is a shorter sequence that can express the same code point;
 overlong sequences are expressly forbidden in the UTF-8 standard due to
 potential security issues).  Another malformation example is the first byte of
 a character not being a legal first byte.  See F<utf8.h> for the list of such
-flags.  For allowed overlong sequences, the computed code point is returned;
-for all other allowed malformations, the Unicode REPLACEMENT CHARACTER is
-returned.
+flags.  Even if allowed, this function generally returns the Unicode
+REPLACEMENT CHARACTER when it encounters a malformation.  There are flags in
+F<utf8.h> to override this behavior for the overlong malformations, but don't
+do that except for very specialized purposes.
 
 The C<UTF8_CHECK_ONLY> flag overrides the behavior when a non-allowed (by other
 flags) malformation is found.  If this flag is set, the routine assumes that
@@ -1465,7 +1466,17 @@ Perl_utf8n_to_uvchr_error(pTHX_ const U8 *s,
                 possible_problems &= ~UTF8_GOT_LONG;
                 *errors |= UTF8_GOT_LONG;
 
-                if (! (flags & UTF8_ALLOW_LONG)) {
+                if (flags & UTF8_ALLOW_LONG) {
+
+                    /* We don't allow the actual overlong value, unless the
+                     * special extra bit is also set */
+                    if (! (flags & (   UTF8_ALLOW_LONG_AND_ITS_VALUE
+                                    & ~UTF8_ALLOW_LONG)))
+                    {
+                        uv = UNICODE_REPLACEMENT;
+                    }
+                }
+                else {
                     disallowed = TRUE;
 
                     if (ckWARN_d(WARN_UTF8) && ! (flags & UTF8_CHECK_ONLY)) {
author	Karl Williamson <khw@cpan.org>	2016-12-10 15:26:24 -0700
committer	Karl Williamson <khw@cpan.org>	2016-12-23 16:48:35 -0700
commit	9495395586e6a655057cb766ed00213037dd06c0 (patch)
tree	dfb0df883a3dd756d58ce106bb70bd8e57a55203 /utf8.c
parent	5a48568dae7e81342fc2f8d0845423834f5c818f (diff)
download	perl-9495395586e6a655057cb766ed00213037dd06c0.tar.gz