diff options
author | Jarkko Hietaniemi <jhi@iki.fi> | 2001-01-28 19:28:40 +0000 |
---|---|---|
committer | Jarkko Hietaniemi <jhi@iki.fi> | 2001-01-28 19:28:40 +0000 |
commit | f9a6324217cffea75ff769ddd313748c0613a128 (patch) | |
tree | 9fb5b4ade5877ba969d093cfe37ec605de62d8dc /utf8.c | |
parent | 9ee2bb1a7c54b1866ff07ab9c157254810ee5205 (diff) | |
download | perl-f9a6324217cffea75ff769ddd313748c0613a128.tar.gz |
Patch from Inaba Hiroto:
- canonical UTF-8 hash keys: if a key string for a hash is
UTF8-on, try downgrade the string and use it if
unicode::distinct is not in effect.
For the task, I added a function bytes_from_utf8() to utf8.c.
It might resemble utf8_to_bytes() but it is not convenient
to the task.
Made a test for it and added to t/op/each.t
- Changed do_print in doio.c to apply sv_utf8_(downgrade|upgrade) to
the mortal copy of the argument SV.
And changed t/io/utf8.t test 18 which expects print() to
upgrade its argument.
- re-implement sv_eq with bytes_from_utf8()
- some bug fixes
- tr/// does not handle UTF8 range (\x{}-\x{})
- \ before raw UTF8 character produced
"Malformed UTF-8 character" warning.
- "\x{100}\N{CENT SIGN}" is Malformed.
Added tests for these 3.
- and one silly bug (by me) with qu operator.
p4raw-id: //depot/perl@8583
Diffstat (limited to 'utf8.c')
-rw-r--r-- | utf8.c | 57 |
1 files changed, 57 insertions, 0 deletions
@@ -583,6 +583,63 @@ Perl_utf8_to_bytes(pTHX_ U8* s, STRLEN *len) } /* +=for apidoc A|U8 *|bytes_from_utf8|U8 *s|STRLEN *len|bool *is_utf8 + +Converts a string C<s> of length C<len> from UTF8 into byte encoding. +Unlike <utf8_to_bytes> but like C<bytes_to_utf8>, returns a pointer to +the newly-created string, and updates C<len> to contain the new length. +Returns the original string if no conversion occurs, C<len> and +C<is_utf8> are unchanged. Do nothing if C<is_utf8> points to 0. Sets +C<is_utf8> to 0 if C<s> is converted or malformed . + +=cut */ + +U8 * +Perl_bytes_from_utf8(pTHX_ U8* s, STRLEN *len, bool *is_utf8) +{ + U8 *send; + U8 *d; + U8 *start = s; + I32 count = 0; + + if (!*is_utf8) + return start; + + /* ensure valid UTF8 and chars < 256 before updating string */ + for (send = s + *len; s < send;) { + U8 c = *s++; + if (!UTF8_IS_ASCII(c)) { + if (UTF8_IS_CONTINUATION(c) || s >= send || + !UTF8_IS_CONTINUATION(*s)) { + *is_utf8 = 0; + return start; + } + if ((c & 0xfc) != 0xc0) + return start; + s++, count++; + } + } + + *is_utf8 = 0; + + if (!count) + return start; + + Newz(801, d, (*len) - count + 1, U8); + d = s = start; + while (s < send) { + U8 c = *s++; + if (UTF8_IS_ASCII(c)) + *d++ = c; + else + *d++ = UTF8_ACCUMULATE(c&3, *s++); + } + *d = '\0'; + *len = d - start; + return start; +} + +/* =for apidoc A|U8 *|bytes_to_utf8|U8 *s|STRLEN *len Converts a string C<s> of length C<len> from ASCII into UTF8 encoding. |