diff options
author | Karl Williamson <khw@cpan.org> | 2015-11-23 15:00:55 -0700 |
---|---|---|
committer | Karl Williamson <khw@cpan.org> | 2015-11-25 15:48:17 -0700 |
commit | b67fd2c557cdf9bdc899813a5e4f2dee22e4f63e (patch) | |
tree | 055007b23413908232464966d3435a702fb4424a /toke.c | |
parent | 1d1c12d9a3f5f51cb9639329ae0b854f2dab7b05 (diff) | |
download | perl-b67fd2c557cdf9bdc899813a5e4f2dee22e4f63e.tar.gz |
toke.c: Remove soon-to-be invalid t assumption
The code in toke.c assumes that the UTF8 expansion of the string
"\x{foo}" takes no more bytes than the original input text, which
includes the 4 bytes of overhead "\x{}". Similarly for "\o{}". The
functions that convert to the code point actually now assert for this.
The next commit will make this assumption definitely invalid on EBCDIC
platforms. Remove the assertions, and actually handle the case
properly. The other places that call the conversion functions do not
make this assumption, so there is no harm in removing them from there.
Since we believe that this can't happen except on EBCDIC, we
could #ifdef this code and use just an assert on non-EBCDIC. But it's
easier to maintain if #ifdef's are minimized. Parsing is not a
time-critical operation, like being in an inner loop, and the extra test
gives a branch prediction hint to the compiler.
Diffstat (limited to 'toke.c')
-rw-r--r-- | toke.c | 19 |
1 files changed, 15 insertions, 4 deletions
@@ -3357,10 +3357,7 @@ S_scan_const(pTHX_ char *start) } NUM_ESCAPE_INSERT: - /* Insert oct or hex escaped character. There will always be - * enough room in sv since such escapes will be longer than any - * UTF-8 sequence they can end up as, except if they force us - * to recode the rest of the string into utf8 */ + /* Insert oct or hex escaped character. */ /* Here uv is the ordinal of the next character being added */ if (UVCHR_IS_INVARIANT(uv)) { @@ -3388,6 +3385,20 @@ S_scan_const(pTHX_ char *start) } if (has_utf8) { + /* Usually, there will already be enough room in 'sv' + * since such escapes are likely longer than any UTF-8 + * sequence they can end up as. This isn't the case on + * EBCDIC where \x{40000000} contains 12 bytes, and the + * UTF-8 for it contains 14. And, we have to allow for + * a trailing NUL. It probably can't happen on ASCII + * platforms, but be safe */ + const STRLEN needed = d - SvPVX(sv) + UVCHR_SKIP(uv) + + 1; + if (UNLIKELY(needed > SvLEN(sv))) { + SvCUR_set(sv, d - SvPVX_const(sv)); + d = sv_grow(sv, needed) + SvCUR(sv); + } + d = (char*)uvchr_to_utf8((U8*)d, uv); if (PL_lex_inwhat == OP_TRANS && PL_sublex_info.sub_op) |