Patch from Inaba Hiroto:

- canonical UTF-8 hash keys: if a key string for a hash is UTF8-on, try downgrade the string and use it if unicode::distinct is not in effect. For the task, I added a function bytes_from_utf8() to utf8.c. It might resemble utf8_to_bytes() but it is not convenient to the task. Made a test for it and added to t/op/each.t - Changed do_print in doio.c to apply sv_utf8_(downgrade|upgrade) to the mortal copy of the argument SV. And changed t/io/utf8.t test 18 which expects print() to upgrade its argument. - re-implement sv_eq with bytes_from_utf8() - some bug fixes - tr/// does not handle UTF8 range (\x{}-\x{}) - \ before raw UTF8 character produced "Malformed UTF-8 character" warning. - "\x{100}\N{CENT SIGN}" is Malformed. Added tests for these 3. - and one silly bug (by me) with qu operator. p4raw-id: //depot/perl@8583
author: Jarkko Hietaniemi <jhi@iki.fi> 2001-01-28 19:28:40 +0000
committer: Jarkko Hietaniemi <jhi@iki.fi> 2001-01-28 19:28:40 +0000
commit: f9a6324217cffea75ff769ddd313748c0613a128 (patch)
tree: 9fb5b4ade5877ba969d093cfe37ec605de62d8dc /utf8.c
parent: 9ee2bb1a7c54b1866ff07ab9c157254810ee5205 (diff)
download: perl-f9a6324217cffea75ff769ddd313748c0613a128.tar.gz
1 files changed, 57 insertions, 0 deletions
diff --git a/utf8.c b/utf8.c
index 156e63f717..046df74d9c 100644
--- a/utf8.c
+++ b/utf8.c
@@ -583,6 +583,63 @@ Perl_utf8_to_bytes(pTHX_ U8* s, STRLEN *len)
 }
 
 /*
+=for apidoc A|U8 *|bytes_from_utf8|U8 *s|STRLEN *len|bool *is_utf8
+
+Converts a string C<s> of length C<len> from UTF8 into byte encoding.
+Unlike <utf8_to_bytes> but like C<bytes_to_utf8>, returns a pointer to
+the newly-created string, and updates C<len> to contain the new length.
+Returns the original string if no conversion occurs, C<len> and
+C<is_utf8> are unchanged. Do nothing if C<is_utf8> points to 0. Sets
+C<is_utf8> to 0 if C<s> is converted or malformed .
+
+=cut */
+
+U8 *
+Perl_bytes_from_utf8(pTHX_ U8* s, STRLEN *len, bool *is_utf8)
+{
+    U8 *send;
+    U8 *d;
+    U8 *start = s;
+    I32 count = 0;
+
+    if (!*is_utf8)
+	return start;
+
+    /* ensure valid UTF8 and chars < 256 before updating string */
+    for (send = s + *len; s < send;) {
+	U8 c = *s++;
+        if (!UTF8_IS_ASCII(c)) {
+	    if (UTF8_IS_CONTINUATION(c) || s >= send ||
+		!UTF8_IS_CONTINUATION(*s)) {
+		*is_utf8 = 0;		
+		return start;
+	    }
+	    if ((c & 0xfc) != 0xc0)
+		return start;
+	    s++, count++;
+        }
+    }
+
+    *is_utf8 = 0;		
+
+    if (!count)
+	return start;
+
+    Newz(801, d, (*len) - count + 1, U8);
+    d = s = start;
+    while (s < send) {
+	U8 c = *s++;
+	if (UTF8_IS_ASCII(c))
+	    *d++ = c;
+	else
+	    *d++ = UTF8_ACCUMULATE(c&3, *s++);
+    }
+    *d = '\0';
+    *len = d - start;
+    return start;
+}
+
+/*
 =for apidoc A|U8 *|bytes_to_utf8|U8 *s|STRLEN *len
 
 Converts a string C<s> of length C<len> from ASCII into UTF8 encoding.
author	Jarkko Hietaniemi <jhi@iki.fi>	2001-01-28 19:28:40 +0000
committer	Jarkko Hietaniemi <jhi@iki.fi>	2001-01-28 19:28:40 +0000
commit	f9a6324217cffea75ff769ddd313748c0613a128 (patch)
tree	9fb5b4ade5877ba969d093cfe37ec605de62d8dc /utf8.c
parent	9ee2bb1a7c54b1866ff07ab9c157254810ee5205 (diff)
download	perl-f9a6324217cffea75ff769ddd313748c0613a128.tar.gz