diff options
author | Norihiro Tanaka <noritnk@kcn.ne.jp> | 2014-11-18 13:36:42 +0900 |
---|---|---|
committer | Jim Meyering <meyering@fb.com> | 2014-11-20 14:52:33 -0800 |
commit | ff7e6edab11d6de3bfa33d426ab0f66eb23fa35a (patch) | |
tree | 9939b921338c0bd9e5abfcb8b5ee70b362a5b678 | |
parent | 83af04cef3139045f1a95a63b70b5935dbf857d8 (diff) | |
download | grep-ff7e6edab11d6de3bfa33d426ab0f66eb23fa35a.tar.gz |
grep -F could erroneously fail to match in non-UTF8 multibyte locales
This fixes a bug that can strike only when using a non-UTF8 multibyte
locale like ja_JP.SHIFT_JIS.
Consider this example: it would mistakenly fail to match before
this patch:
printf '\203AA\n'|LC_ALL=ja_JP.SHIFT_JIS src/grep -F A
When searching for a single byte that happens to be the latter
byte of a multibyte character, and the target byte also follows
that multibyte character, grep -F would advance an internal pointer
by one byte too many, thus missing the target byte. A test case
for this bug is already included in tests/sjis-mb.
* src/kwsearch.c (Fexecute): Skip one byte less, after matched middle of a
multi-byte character. Introduced by commit v2.18-119-gfb7d538.
-rw-r--r-- | NEWS | 7 | ||||
-rw-r--r-- | src/kwsearch.c | 17 |
2 files changed, 21 insertions, 3 deletions
@@ -48,6 +48,13 @@ GNU grep NEWS -*- outline -*- of a multibyte character when using a '^'-anchored alternate in a pattern, leading it to print non-matching lines. [bug present since "the beginning"] + grep -F Y no longer fails to match in non-UTF8 multibyte locales like + Shift-JIS, when the input contains a 2-byte character, XY, followed by + the single-byte search pattern, Y. grep would find the first, middle- + of-multibyte matching "Y", and then mistakenly advance an internal + pointer one byte too far, skipping over the target "Y" just after that. + [bug introduced in grep-2.19] + grep -E rejected unmatched ')', instead of treating it like '\)'. [bug present since "the beginning"] diff --git a/src/kwsearch.c b/src/kwsearch.c index aa965f62..1335a269 100644 --- a/src/kwsearch.c +++ b/src/kwsearch.c @@ -133,9 +133,20 @@ Fexecute (char const *buf, size_t size, size_t *match_size, if (!match_lines && MB_CUR_MAX > 1 && !using_utf8 () && mb_goback (&mb_start, beg + offset, buf + size) != 0) { - /* The match was a part of multibyte character, advance at least - one byte to ensure no infinite loop happens. */ - beg = mb_start; + /* We have matched a single byte that is not at the beginning of a + multibyte character. mb_goback has advanced MB_START past that + multibyte character. Now, we want to position BEG so that the + next kwsexec search starts there. Thus, to compensate for the + for-loop's BEG++, above, subtract one here. This code is + unusually hard to reach, and exceptionally, let's show how to + trigger it here: + + printf '\203AA\n'|LC_ALL=ja_JP.SHIFT_JIS src/grep -F A + + That assumes the named locale is installed. + Note that your system's shift-JIS locale may have a different + name, possibly including "sjis". */ + beg = mb_start - 1; continue; } beg += offset; |