summaryrefslogtreecommitdiff
path: root/utf8.c
diff options
context:
space:
mode:
authorKarl Williamson <khw@cpan.org>2017-08-10 15:52:35 -0600
committerKarl Williamson <khw@cpan.org>2017-11-08 20:21:44 -0700
commit624504c5a60da0880a7d1d6d3e66f65c68ba28ae (patch)
tree2e10bdcad3179576394c2f1e3fb2fd1327cac6f0 /utf8.c
parent63ab03b3966fa7dcc24a137305becdb56bbf4e5a (diff)
downloadperl-624504c5a60da0880a7d1d6d3e66f65c68ba28ae.tar.gz
Dest buffer needs to be bigger for utf16_to_utf8()
These undocumented functions require the destination buffer to have the worst case size. However that size (previously listed as 3/2 * input) is wrong for EBCDIC. Correct the comments, and the single use of these in core. These functions do not have a way to avoid overflowing, which strikes me as wrong.
Diffstat (limited to 'utf8.c')
-rw-r--r--utf8.c15
1 files changed, 12 insertions, 3 deletions
diff --git a/utf8.c b/utf8.c
index 6107348523..b731780fe4 100644
--- a/utf8.c
+++ b/utf8.c
@@ -2364,10 +2364,19 @@ Perl_bytes_to_utf8(pTHX_ const U8 *s, STRLEN *lenp)
}
/*
- * Convert native (big-endian) or reversed (little-endian) UTF-16 to UTF-8.
+ * Convert native (big-endian) UTF-16 to UTF-8. For reversed (little-endian),
+ * use utf16_to_utf8_reversed().
*
- * Destination must be pre-extended to 3/2 source. Do not use in-place.
- * We optimize for native, for obvious reasons. */
+ * UTF-16 requires 2 bytes for every code point below 0x10000; otherwise 4 bytes.
+ * UTF-8 requires 1-3 bytes for every code point below 0x1000; otherwise 4 bytes.
+ * UTF-EBCDIC requires 1-4 bytes for every code point below 0x1000; otherwise 4-5 bytes.
+ *
+ * These functions don't check for overflow. The worst case is every code
+ * point in the input is 2 bytes, and requires 4 bytes on output. (If the code
+ * is never going to run in EBCDIC, it is 2 bytes requiring 3 on output.) Therefore the
+ * destination must be pre-extended to 2 times the source length.
+ *
+ * Do not use in-place. We optimize for native, for obvious reasons. */
U8*
Perl_utf16_to_utf8(pTHX_ U8* p, U8* d, I32 bytelen, I32 *newlen)