diff options
author | karl williamson <public@khwilliamson.com> | 2009-01-02 11:22:02 +0100 |
---|---|---|
committer | Rafael Garcia-Suarez <rgarciasuarez@gmail.com> | 2009-01-02 11:22:02 +0100 |
commit | b3ab6785f6871a84567168e1bd0426ff2f66d282 (patch) | |
tree | 55bbeae5eb31ddce81ea422f6736e1a828ccc2b2 /utfebcdic.h | |
parent | 797f6e9fa603b84d2f7e8cbe978a6be740007606 (diff) | |
download | perl-b3ab6785f6871a84567168e1bd0426ff2f66d282.tar.gz |
Faster sv_utf8_upgrade()
Consider what currently happens when the tokenizer is scanning a string.
It looks through it byte-by-byte until it finds a character that forces
it to decide to go to utf8. It then calls sv_utf8_upgrade() with the
portion of the string scanned so far.
sv_utf8_upgrade() starts over from the beginning, and scans the string
byte-by-byte until it finds a character that varies between non-utf8 and
utf8. It then calls bytes_to_utf8().
bytes_to_utf8() allocates a new string that can handle the worst case
expansion, 2n+1, of the entire string, and starts over from the
beginning, and scans the input string byte-by-byte copying and
converting each character to the output string as it goes.
It doesn't return the size of the new string, so sv_utf8_upgrade()
assumes it is only as big as what actually got converted, throwing away
knowledge of any spare.
It then returns to the tokenizer, which immediately does a grow to get
space for the unparsed input. This is likely to cause a new string to
be allocated and copied from the one we had just created, even if that
string in actuality had enough space in it.
Thus, the invariant head portion of the string is scanned 3 times, and
probably 2 strings will be allocated and copied.
My solution to cutting this down is to do several things.
First, I added an extra flag for sv_utf8_upgrade that says don't bother
to check if the string needs to be converted to utf8, just assume it
does. This eliminates one of the passes.
I also added a new parameter to sv_utf8_upgrade that says when you
return, I want this much unused space in the string. That eliminates
the extra grow.
This was all done by renaming the current work-horse function from
sv_utf8_upgrade_flags to be sv_utf8_upgrade_flags_grow() and making the
current function name be a macro which calls the revised one with a 0
grow parameter.
I also improved the internal efficiency of sv_utf8_upgrade so that when
it does scan the string, it doesn't call bytes_to_utf8, but does the
conversion itself, using a fast memory copy instead of the byte-oriented
one for the invariant header, and it uses that header to get a better
estimate of the needed size of the new string, and it doesn't throw away
the knowledge of the allocated size.
And, if it is clear without scanning the whole string that the
conversion will fit in the already allocated string, it just uses that
instead of allocating and copying a new one, using the algorithm I
copied from the tokenizer. (In this case it does have to finish
scanning the whole string to get the correct size.) The comments have
details.
It still is byte-oriented. Vectorization et. al. could yield
performance improvements. One idea for that is in the comments.
The patch also includes a new synonym I created which is a more accurate
name than NATIVE_TO_ASCII.
Diffstat (limited to 'utfebcdic.h')
-rw-r--r-- | utfebcdic.h | 5 |
1 files changed, 3 insertions, 2 deletions
diff --git a/utfebcdic.h b/utfebcdic.h index bb88571212..7ed9375028 100644 --- a/utfebcdic.h +++ b/utfebcdic.h @@ -29,7 +29,7 @@ * in I8, far beyond the current Unicode standard's * max, as shown in the comment later in this file.) * 3) Use the table published in tr16 to convert each byte from step 2 into - * final UTF-EBCDIC. The table in this file is PL_utf2e, and its invverse + * final UTF-EBCDIC. The table in this file is PL_utf2e, and its inverse * is PL_e2utf. They are constructed so that all EBCDIC invariants remain * invariant, but no others do. For example, the ordinal value of 'A' is * 193 in EBCDIC, and also is 193 in UTF-EBCDIC. Step 1) converts it to @@ -406,6 +406,7 @@ END_EXTERN_C /* Native to iso-8859-1 */ #define NATIVE_TO_ASCII(ch) PL_e2a[(U8)(ch)] +#define NATIVE8_TO_UNI(ch) NATIVE_TO_ASCII(ch) /* synonym */ #define ASCII_TO_NATIVE(ch) PL_a2e[(U8)(ch)] /* Transform after encoding */ #define NATIVE_TO_UTF(ch) PL_e2utf[(U8)(ch)] @@ -461,7 +462,7 @@ END_EXTERN_C #define UNI_IS_INVARIANT(c) ((c) < 0xA0) /* UTF-EBCDIC sematic macros - transform back into UTF-8-Mod and then compare */ -#define NATIVE_IS_INVARIANT(c) UNI_IS_INVARIANT(NATIVE_TO_ASCII(c)) +#define NATIVE_IS_INVARIANT(c) UNI_IS_INVARIANT(NATIVE8_TO_UNI(c)) #define UTF8_IS_INVARIANT(c) UNI_IS_INVARIANT(NATIVE_TO_UTF(c)) #define UTF8_IS_START(c) (NATIVE_TO_UTF(c) >= 0xA0 && (NATIVE_TO_UTF(c) & 0xE0) != 0xA0) #define UTF8_IS_CONTINUATION(c) ((NATIVE_TO_UTF(c) & 0xE0) == 0xA0) |