diff options
author | Karl Williamson <khw@cpan.org> | 2018-06-27 21:52:47 -0600 |
---|---|---|
committer | Karl Williamson <khw@cpan.org> | 2018-07-05 14:47:19 -0600 |
commit | e6a4ffc3f7aa69cbf3e5e83518e40e529a34b75b (patch) | |
tree | a861c0d604b080ed7993759dbcd53719389a608c /utf8.c | |
parent | 5af9f82224b62a56edca6981b8638328d12ba98a (diff) | |
download | perl-e6a4ffc3f7aa69cbf3e5e83518e40e529a34b75b.tar.gz |
Inline dfa for translating from UTF-8
This commit inlines the simple portion of the dfa that translates from
UTF-8 to code points, used in functions like utf8_to_uvchr_buf.
This dfa has been changed in previous commits so that it is small, and
punts on any problematic input, plus 18% of the Hangul syllable code
points. (These still come out faster than blead.) The smallness allows
it to be inlined, adding <2000 total bytes to the perl text space.
The inlined part never calls anything that needs thread context, so that
parameter can be removed. I decided to remove it also from the
Perl_utf8_to_uvchr_buf() and Perl_utf8n_to_uvchr_error() functions.
There is a small risk that someone is actually using those functions
instead of the documented macros utf8_to_uvchr_buf() and
utf8n_to_uvchr_error(). If so, this can be added back in.
Perl_utf8_to_uvchr_msgs() is entirely removed, but the macro
utf8_to_uvchr_msgs() which is the normal interface to it is retained
unchanged, and it is marked as unstable anyway.
This change decreases the number of conditional branches in the Perl
statement
my $a = ord("\x{foo}")
where foo is a non-problematic code point by about 11%, except for
ASCII characters, where it is 4%, and those Hangul syllables mentioned
above, where it is 7%. Problematic code points fare much worse here
than in blead. These are the surrogates, non-characters, and
non-Unicode code points. We don't care very much about the speed of
handling these code points, which are mostly considered illegal by
Unicode anyway.
The percentage decrease is higher for the just the function itself, as
the measured Perl statement has unchanged overhead.
Here are the annotated benchmarks:
Key:
Ir Instruction read
Dr Data read
Dw Data write
COND conditional branches
IND indirect branches
_m branch predict miss
_m1 level 1 cache miss
_mm last cache (e.g. L3) miss
- indeterminate percentage (e.g. 1/0)
The numbers represent raw counts per loop iteration.
translate_utf8_to_uv_007f
my $a = ord("\x{007f}")
blead dfa Ratio %
----- ----- -------
Ir 395.0 370.0 106.8
Dr 122.0 115.0 106.1
Dw 71.0 61.0 116.4
COND 49.0 47.0 104.3
IND 5.0 5.0 100.0
In all the measurements, the indirect numbers were all zeros and
unchanged, and are omitted in this message.
translate_utf8_to_uv_07ff
my $a = ord("\x{07ff}")
blead dfa Ratio %
----- ----- -------
Ir 438.0 390.0 112.3
Dr 128.0 118.0 108.5
Dw 71.0 61.0 116.4
COND 57.0 51.0 111.8
IND 5.0 5.0 100.0
translate_utf8_to_uv_cfff
my $a = ord("\x{cfff}")
This is the highest Hangul syllable that gets the full reduction.
blead dfa Ratio %
----- ----- -------
Ir 457.0 410.0 111.5
Dr 131.0 121.0 108.3
Dw 71.0 61.0 116.4
COND 61.0 55.0 110.9
IND 5.0 5.0 100.0
translate_utf8_to_uv_d000
my $a = ord("\x{d000}")
This is the lowest affected Hangul syllable
blead dfa Ratio %
----- ----- -------
Ir 457.0 443.0 103.2
Dr 131.0 132.0 99.2
Dw 71.0 71.0 100.0
COND 61.0 57.0 107.0
IND 5.0 5.0 100.0
translate_utf8_to_uv_d7ff
my $a = ord("\x{d7ff}")
This is the highest affected Hangul syllable
blead dfa Ratio %
----- ----- -------
Ir 457.0 443.0 103.2
Dr 131.0 132.0 99.2
Dw 71.0 71.0 100.0
COND 61.0 57.0 107.0
IND 5.0 5.0 100.0
translate_utf8_to_uv_d800
my $a = ord("\x{d800}")
This is a surrogate, showing much worse performance, but we don't care
blead dfa Ratio %
----- ----- -------
Ir 457.0 515.0 88.7
Dr 131.0 134.0 97.8
Dw 71.0 73.0 97.3
COND 61.0 75.0 81.3
IND 5.0 5.0 100.0
translate_utf8_to_uv_fdd0
my $a = ord("\x{fdd0}")
This is a non-char, showing much worse performance, but we don't care
blead dfa Ratio %
----- ----- -------
Ir 457.0 548.0 83.4
Dr 131.0 139.0 94.2
Dw 71.0 73.0 97.3
COND 61.0 81.0 75.3
IND 5.0 5.0 100.0
translate_utf8_to_uv_fffd
my $a = ord("\x{fffd}")
blead dfa Ratio %
----- ----- -------
Ir 457.0 410.0 111.5
Dr 131.0 121.0 108.3
Dw 71.0 61.0 116.4
COND 61.0 55.0 110.9
IND 5.0 5.0 100.0
translate_utf8_to_uv_ffff
my $a = ord("\x{ffff}")
This is another non-char, showing much worse performance, but we don't
care
blead dfa Ratio %
----- ----- -------
Ir 457.0 548.0 83.4
Dr 131.0 139.0 94.2
Dw 71.0 73.0 97.3
COND 61.0 81.0 75.3
IND 5.0 5.0 100.0
translate_utf8_to_uv_1fffd
my $a = ord("\x{1fffd}")
blead dfa Ratio %
----- ----- -------
Ir 476.0 430.0 110.7
Dr 134.0 124.0 108.1
Dw 71.0 61.0 116.4
COND 65.0 59.0 110.2
IND 5.0 5.0 100.0
translate_utf8_to_uv_10fffd
my $a = ord("\x{10fffd}")
blead dfa Ratio %
----- ----- -------
Ir 476.0 430.0 110.7
Dr 134.0 124.0 108.1
Dw 71.0 61.0 116.4
COND 65.0 59.0 110.2
IND 5.0 5.0 100.0
translate_utf8_to_uv_110000
my $a = ord("\x{110000}")
This is a non-Unicode code point, showing much worse performance, but we
don't care
blead dfa Ratio %
----- ----- -------
Ir 476.0 544.0 87.5
Dr 134.0 137.0 97.8
Dw 71.0 73.0 97.3
COND 65.0 81.0 80.2
IND 5.0 5.0 100.0
Diffstat (limited to 'utf8.c')
-rw-r--r-- | utf8.c | 46 |
1 files changed, 8 insertions, 38 deletions
@@ -1275,10 +1275,10 @@ Also implemented as a macro in utf8.h */ UV -Perl_utf8n_to_uvchr(pTHX_ const U8 *s, - STRLEN curlen, - STRLEN *retlen, - const U32 flags) +Perl_utf8n_to_uvchr(const U8 *s, + STRLEN curlen, + STRLEN *retlen, + const U32 flags) { PERL_ARGS_ASSERT_UTF8N_TO_UVCHR; @@ -1404,7 +1404,7 @@ Also implemented as a macro in utf8.h */ UV -Perl_utf8n_to_uvchr_error(pTHX_ const U8 *s, +Perl_utf8n_to_uvchr_error(const U8 *s, STRLEN curlen, STRLEN *retlen, const U32 flags, @@ -1468,7 +1468,7 @@ The caller, of course, is responsible for freeing any returned AV. */ UV -Perl_utf8n_to_uvchr_msgs(pTHX_ const U8 *s, +Perl__utf8n_to_uvchr_msgs_helper(const U8 *s, STRLEN curlen, STRLEN *retlen, const U32 flags, @@ -1492,39 +1492,9 @@ Perl_utf8n_to_uvchr_msgs(pTHX_ const U8 *s, U8 temp_char_buf[UTF8_MAXBYTES + 1]; /* Used to avoid a Newx in this routine; see [perl #130921] */ UV uv_so_far; - UV state = 0; - - PERL_ARGS_ASSERT_UTF8N_TO_UVCHR_MSGS; - - /* Measurements show that this dfa is somewhat faster than the regular code - * below, so use it first, dropping down for the non-normal cases. */ - -#define PERL_UTF8_DECODE_REJECT 1 - - while (s < send && LIKELY(state != PERL_UTF8_DECODE_REJECT)) { - UV type = strict_utf8_dfa_tab[*s]; - - uv = (state == 0) - ? ((0xff >> type) & NATIVE_UTF8_TO_I8(*s)) - : UTF8_ACCUMULATE(uv, *s); - state = strict_utf8_dfa_tab[256 + state + type]; - - if (state == 0) { - if (retlen) { - *retlen = s - s0 + 1; - } - if (errors) { - *errors = 0; - } - if (msgs) { - *msgs = NULL; - } + dTHX; - return uv; - } - - s++; - } + PERL_ARGS_ASSERT__UTF8N_TO_UVCHR_MSGS_HELPER; /* Here, is one of: a) malformed; b) a problematic code point (surrogate, * non-unicode, or nonchar); or c) on ASCII platforms, one of the Hangul |