diff options
author | Assaf Gordon <assafgordon@gmail.com> | 2016-06-22 20:34:01 -0400 |
---|---|---|
committer | Assaf Gordon <assafgordon@gmail.com> | 2016-07-04 14:26:09 -0400 |
commit | 0052daf1e5651c4772d75b595dfc714e0e20587e (patch) | |
tree | ffd184023a2f6396e776b9376dc81bd93e091778 /m4 | |
parent | 6ad4acfbd5f1f56e6a93db476f29d299a698150e (diff) | |
download | sed-0052daf1e5651c4772d75b595dfc714e0e20587e.tar.gz |
sed: fix minor multibyte parsing bug
Previously sed would parse multibyte characters incorrectly in two scenarios:
1. Slash following an incomplete-yet-valid multibyte sequence (match_slash):
$ LC_ALL=en_US.UTF-8 sed $'s/\316/X/'
sed: -e expression #1, char 6: unterminated `s' command
2. Open/close brackets as part of a valid mutilbyte string inside a character
class (snarf_char_class). In the example below, '\203]' is a valid
multibyte character in SHIFT-JIS locale:
$ LC_ALL=ja_JP.shiftjis sed $'/[\203]/]/p'
sed: -e expression #1, char #5: Unmatched [ or [^
Both cases stem from mbcs.c:brlen() being non-intuitive:
It returned 1 for valid single-byte character, invalid multibyte-character,
and a for the last byte of a valid multibyte sequence - making it
non-trivial to use correctly.
This commit replaces brlen() with a simpler is_mb_char() function:
returns non-zero for multibyte sequences, zero for single/invalid sequences.
* sed/sed.h: (BRLEN, brlen): Remove delaration.
(IS_MB_CHAR,is_mb_char): Add macro and function declaration.
* sed/mbcs.c: (brlen): Remove function. (is_mb_char): New function.
* sed/compile.c: (snarf_char_class, match_slash): Use IS_MB_CHAR instead of
BRLEN; Adjust local variables accordingly.
* testsuite/mb-match-slash.sh: New test for scenario 1.
* testsuite/mb-charclass-non-utf8.sh: New test for scenario 2,
requires SHIFT-JIS locale.
* testsuite/Makefile.am: Add new tests
* testsuite/init.cfg: (require_ja_shiftjis_locale_): New function.
* NEWS: Mention bug fix.
Diffstat (limited to 'm4')
0 files changed, 0 insertions, 0 deletions