diff options
author | David Mitchell <davem@iabyn.com> | 2018-01-09 10:05:33 +0000 |
---|---|---|
committer | David Mitchell <davem@iabyn.com> | 2018-01-19 13:45:19 +0000 |
commit | ea088e559c9bca8e7337d3d6236f06deb5afda32 (patch) | |
tree | d2b45636f53f4a63d07230bb6ca0d74818c4ee6a /doop.c | |
parent | ddcffd84df3b0cf7b2325bed1db38b89bf19fa53 (diff) | |
download | perl-ea088e559c9bca8e7337d3d6236f06deb5afda32.tar.gz |
OP_TRANS: change extended table format
For non-utf8, OP_TRANS(R) ops have a translation table consisting of an
array of 256 shorts attached. For tr///c, this table is extended to hold
information about chars in the replacement list which aren't paired with
chars in the search list. For example,
tr/\x00-AE-\xff/bcdefg/c
is equivalent to
tr/BCD\x{100}-\x{7fffffff}/bcdefg/
which is equivalent to
tr/BCD\x{100}-\x{7fffffff}/bcdefggggggggg..../
Only the BCD => bcd mappings can be stored in the basic 256-slot table,
so potentially the following extra information needs recording in an
extended table to handle codepoints > 0xff in the string being modified:
1) the extra replacement chars ("efg");
2) the number of extra replacement chars (3);
3) the "repeat" char ('g').
Currently 2) and 3) are combined: the repeat char is found as the last
extra char, and if there are no extra chars, the repeat char is treated
as an extra char list of length 1.
Similarly, an 'extra chars' length value of 1 can imply either one extra
char, or no extra chars with the repeat char being faked as an extra char.
An 'extra chars' length of 0 implies an empty replacement list, i.e.
tr/....//c.
This commit changes it so that the repeat char is *always* stored (in slot
0x101), with the extra chars stored beginning at slot 0x102.
The 'extra chars' length value (located at slot 0x0100) has changed its
meaning slightly: now
-1 implies tr/....//c
0 implies no more replacement chars than search chars
1+ the number of excess replacement chars.
This (should) make no function difference, but the extra information
stored will make it easier to fix some bugs shortly.
Diffstat (limited to 'doop.c')
-rw-r--r-- | doop.c | 24 |
1 files changed, 15 insertions, 9 deletions
@@ -217,7 +217,7 @@ S_do_trans_complex(pTHX_ SV * const sv) const I32 del = PL_op->op_private & OPpTRANS_DELETE; U8 *d; U8 *dstart; - STRLEN rlen = 0; + SSize_t excess = 0; if (grows) Newx(d, len*2+1, U8); @@ -225,7 +225,9 @@ S_do_trans_complex(pTHX_ SV * const sv) d = s; dstart = d; if (complement && !del) - rlen = tbl[0x100]; + /* number of replacement chars in excess of any 0x00..0xff + * search characters */ + excess = (SSize_t)tbl[0x100]; if (PL_op->op_private & OPpTRANS_SQUASH) { UV pch = 0xfeedface; @@ -244,9 +246,10 @@ S_do_trans_complex(pTHX_ SV * const sv) /* use the implicit 0x100..0x7fffffff search range */ matches++; if (!del) { - ch = (rlen == 0) ? (I32)comp : - (comp - 0x100 < rlen) ? - tbl[comp+1] : tbl[0x100+rlen]; + ch = (excess == -1) ? (I32)comp : + ( excess == 0 + || excess < (IV)comp - 0xff) ? tbl[0x101] + : tbl[comp+2]; if ((UV)ch != pch) { d = uvchr_to_utf8(d, ch); pch = (UV)ch; @@ -290,10 +293,13 @@ S_do_trans_complex(pTHX_ SV * const sv) /* use the implicit 0x100..0x7fffffff search range */ matches++; if (!del) { - if (comp - 0x100 < rlen) - d = uvchr_to_utf8(d, tbl[comp+1]); - else - d = uvchr_to_utf8(d, tbl[0x100+rlen]); + /* tr/...//c should call S_do_trans_count + * instead */ + assert(excess != -1); + ch = ( excess == 0 + || excess < (IV)comp - 0xff) ? tbl[0x101] + : tbl[comp+2]; + d = uvchr_to_utf8(d, ch); } } } |