summaryrefslogtreecommitdiff
path: root/m4
diff options
context:
space:
mode:
authorAssaf Gordon <assafgordon@gmail.com>2016-06-22 20:34:01 -0400
committerAssaf Gordon <assafgordon@gmail.com>2016-07-04 14:26:09 -0400
commit0052daf1e5651c4772d75b595dfc714e0e20587e (patch)
treeffd184023a2f6396e776b9376dc81bd93e091778 /m4
parent6ad4acfbd5f1f56e6a93db476f29d299a698150e (diff)
downloadsed-0052daf1e5651c4772d75b595dfc714e0e20587e.tar.gz
sed: fix minor multibyte parsing bug
Previously sed would parse multibyte characters incorrectly in two scenarios: 1. Slash following an incomplete-yet-valid multibyte sequence (match_slash): $ LC_ALL=en_US.UTF-8 sed $'s/\316/X/' sed: -e expression #1, char 6: unterminated `s' command 2. Open/close brackets as part of a valid mutilbyte string inside a character class (snarf_char_class). In the example below, '\203]' is a valid multibyte character in SHIFT-JIS locale: $ LC_ALL=ja_JP.shiftjis sed $'/[\203]/]/p' sed: -e expression #1, char #5: Unmatched [ or [^ Both cases stem from mbcs.c:brlen() being non-intuitive: It returned 1 for valid single-byte character, invalid multibyte-character, and a for the last byte of a valid multibyte sequence - making it non-trivial to use correctly. This commit replaces brlen() with a simpler is_mb_char() function: returns non-zero for multibyte sequences, zero for single/invalid sequences. * sed/sed.h: (BRLEN, brlen): Remove delaration. (IS_MB_CHAR,is_mb_char): Add macro and function declaration. * sed/mbcs.c: (brlen): Remove function. (is_mb_char): New function. * sed/compile.c: (snarf_char_class, match_slash): Use IS_MB_CHAR instead of BRLEN; Adjust local variables accordingly. * testsuite/mb-match-slash.sh: New test for scenario 1. * testsuite/mb-charclass-non-utf8.sh: New test for scenario 2, requires SHIFT-JIS locale. * testsuite/Makefile.am: Add new tests * testsuite/init.cfg: (require_ja_shiftjis_locale_): New function. * NEWS: Mention bug fix.
Diffstat (limited to 'm4')
0 files changed, 0 insertions, 0 deletions