summaryrefslogtreecommitdiff
path: root/diff-lib.c
diff options
context:
space:
mode:
authorJonathan Nieder <jrnieder@gmail.com>2011-01-11 15:48:50 -0600
committerJunio C Hamano <gitster@pobox.com>2011-01-18 08:59:39 -0800
commit664d44ee7fb18bdfdd66a1be760c7ee1bbe911c6 (patch)
treee7a26f9f61b8f3f05702c491a7bb28325d07d265 /diff-lib.c
parent8d96e7288f2be9de6d09352dc445f18f89564500 (diff)
downloadgit-664d44ee7fb18bdfdd66a1be760c7ee1bbe911c6.tar.gz
userdiff: simplify word-diff safeguard
git's diff-words support has a detail that can be a little dangerous: any text not matched by a given language's tokenization pattern is treated as whitespace and changes in such text would go unnoticed. Therefore each of the built-in regexes allows a special token type consisting of a single non-whitespace character [^[:space:]]. To make sure UTF-8 sequences remain human readable, the builtin regexes also have a special token type for runs of bytes with the high bit set. In English, non-ASCII characters are usually isolated so this is analogous to the [^[:space:]] pattern, except it matches a single _multibyte_ character despite use of the C locale. Unfortunately it is easy to make typos or forget entirely to include these catch-all token types when adding support for new languages (see v1.7.3.5~16, userdiff: fix typo in ruby and python word regexes, 2010-12-18). Avoid this by including them automatically within the PATTERNS and IPATTERN macros. While at it, change the UTF-8 sequence token type to match exactly one non-ASCII multi-byte character, rather than an arbitrary run of them. Suggested-by: Thomas Rast <trast@student.ethz.ch> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Diffstat (limited to 'diff-lib.c')
0 files changed, 0 insertions, 0 deletions