diff options
author | Lorry Tar Creator <lorry-tar-importer@lorry> | 2016-04-22 04:38:07 +0000 |
---|---|---|
committer | Lorry Tar Creator <lorry-tar-importer@lorry> | 2016-04-22 04:38:07 +0000 |
commit | 28ef1abc10cfbc2c3d2747c008eb2300858d0426 (patch) | |
tree | 41208fb8f393e6cb6cc8f939623ad47a0db17876 /TODO | |
download | grep-tarball-28ef1abc10cfbc2c3d2747c008eb2300858d0426.tar.gz |
Diffstat (limited to 'TODO')
-rw-r--r-- | TODO | 338 |
1 files changed, 338 insertions, 0 deletions
@@ -0,0 +1,338 @@ + Copyright (C) 1992, 1997-2002, 2004-2016 Free Software Foundation, Inc. + + Copying and distribution of this file, with or without modification, + are permitted in any medium without royalty provided the copyright + notice and this notice are preserved. + +=============== +Short term work +=============== + +See where we are with UTF-8 performance. + +Merge Debian patches 55-bigfile.patch, 69-mbtowc.patch and +70-man_apostrophe.patch. Go through patches in Savannah. + +Cleanup of the grep(), grepdir(), recursion (the "main loop") to use fts. +Fix --directories=read. + +Write better Texinfo documentation for grep. The manual page would be a +good place to start, but Info documents are also supposed to contain a +tutorial and examples. + +Some test in tests/spencer2.tests should have failed! Need to filter out +some bugs in dfa.[ch]/regex.[ch]. + +Multithreading? + +GNU grep does 32-bit arithmetic, it needs to move to 64-bit (i.e. +size_t/ptrdiff_t). + +Lazy dynamic linking of libpcre. + +Check FreeBSD's integration of zgrep (-Z) and bzgrep (-J) in one +binary. Is there a possibility of doing even better by automatically +checking the magic of binary files ourselves (0x1F 0x8B for gzip, 0x1F +0x9D for compress, and 0x42 0x5A 0x68 for bzip2)? Once what to do with +libpcre is decided, do the same for libz and libbz2. + + +================== +Matching algorithms +================== + +Check <http://tony.abou-assaleh.net/greps.html>. Take a look at these +and consider opportunities for merging or cloning: + + -- ja-grep's mlb2 patch (Japanese grep) + <ftp://ftp.freebsd.org/pub/FreeBSD/ports/distfiles/grep-2.4.2-mlb2.patch.gz> + -- lgrep (from lv, a Powerful Multilingual File Viewer / Grep) + <http://www.ff.iij4u.or.jp/~nrt/lv/>; + -- cgrep (Context grep) <http://plg.uwaterloo.ca/~ftp/mt/cgrep/> + seems like nice work; + -- sgrep (Struct grep) <http://www.cs.helsinki.fi/u/jjaakkol/sgrep.html>; + -- agrep (Approximate grep) <http://www.tgries.de/agrep/>, + from glimpse; + -- nr-grep (Nondeterministic reverse grep) + <http://www.dcc.uchile.cl/~gnavarro/software/>; + -- ggrep (Grouse grep) <http://www.grouse.com.au/ggrep/>; + -- grep.py (Python grep) <http://www.vdesmedt.com/~vds2212/grep.html>; + -- freegrep <http://www.vocito.com/downloads/software/grep/>; + +Check some new algorithms for matching; talk to Karl Berry and Nelson. +Sunday's "Quick Search" Algorithm (CACM 33, 1990-08-08 pp. 132-142) +claim that his algorithm is faster than Boyer-More. Worth checking. + +Fix the DFA matcher to never use exponential space. (Fortunately, these +cases are rare.) + + +============================ +Standards: POSIX and Unicode +============================ + +For POSIX compliance, see p10003.x. Current support for the POSIX [= =] +and [. .] constructs is limited. This is difficult because it requires +locale-dependent details of the character set and collating sequence, +but POSIX does not standardize any method for accessing this information! + +For Unicode, interesting things to check include the Unicode Standard +<http://www.unicode.org/standard/standard.html> and the Unicode Technical +Standard #18 (<http://www.unicode.org/reports/tr18/> “Unicode Regular +Expressions”). Talk to Bruno Haible who's maintaining GNU libunistring. +See also Unicode Standard Annex #15 (<http://www.unicode.org/reports/tr15/> +“Unicode Normalization Forms”), already implemented by GNU libunistring. + +In particular, --ignore-case needs to be evaluated against the standards. +We may want to deviate from POSIX if Unicode provides better or clearer +semantics. + +POSIX and --ignore-case +----------------------- + +For this issue, interesting things to check in POSIX include the +Volume “Base Definitions (XBD)”, Chapter “Regular Expressions” and in +particular Section “Regular Expression General Requirements” and its +paragraph about caseless matching (note that this may not have been +fully thought through and that this text may be self-contradicting +[specifically: “of either data or patterns” versus all the rest]). + +In particular, consider the following with POSIX's approach to case +folding in mind. Assume a non-Turkic locale with a character +repertoire reduced to the following various forms of “LATIN LETTER I”: + +0049;LATIN CAPITAL LETTER I;Lu;0;L;;;;;N;;;;0069; +0069;LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049 +0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;\ + LATIN CAPITAL LETTER I DOT;;;0069; +0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;0049 + +First note the differing UTF-8 octet lengths of U+0049 (0x49) and +U+0069 (0x69) versus U+0130 (0xC4 0xB0) and U+0131 (0xC4 0xB1). This +implies that whole UTF-8 strings cannot be case-converted in place, +using the same memory buffer, and that the needed octet-size of the +new buffer cannot merely be guessed (although there's a simple upper +bound of six times the size of the input, as the longest UTF-8 +encoding of any character is six bytes). + +We have + +lc(I) = i, uc(I) = I +lc(i) = i, uc(i) = I +lc(İ) = i, uc(İ) = İ +lc(ı) = ı, uc(ı) = I + +where lc() and uc() denote lower-case and upper-case conversions. + +There are several candidate --ignore-case logics (including the one +mandated by POSIX): + +Using the + +if (lc(input_wchar) == lc(pattern_wchar)) + +logic leads to the following matches: + + \in I i İ ı +pat\ ---------- +"I" | Y Y Y n +"i" | Y Y Y n +"İ" | Y Y Y n +"ı" | n n n Y + +There is a lack of symmetry between CAPITAL and SMALL LETTERs with +this. + +Using the + +if (uc(input_wchar) == uc(pattern_wchar)) + +logic leads to the following matches: + + \in I i İ ı +pat\ ---------- +"I" | Y Y n Y +"i" | Y Y n Y +"İ" | n n Y n +"ı" | Y Y n Y + +There is a lack of symmetry between CAPITAL and SMALL LETTERs with +this. + +Using the + +if ( lc(input_wchar) == lc(pattern_wchar) +|| uc(input_wchar) == uc(pattern_wchar)) + +logic leads to the following matches: + + \in I i İ ı +pat\ ---------- +"I" | Y Y Y Y +"i" | Y Y Y Y +"İ" | Y Y Y n +"ı" | Y Y n Y + +There is some elegance and symmetry with this. But there are +potentially two conversions to be made per input character. If the +pattern is pre-converted, two copies of it need to be kept and used in +a mutually coherent fashion. + +Using the + +if ( input_wchar == pattern_wchar +|| lc(input_wchar) == pattern_wchar +|| uc(input_wchar) == pattern_wchar) + +logic (as mandated by POSIX) leads to the following matches: + + \in I i İ ı +pat\ ---------- +"I" | Y Y n Y +"i" | Y Y Y n +"İ" | n n Y n +"ı" | n n n Y + +There is a different CAPITAL/SMALL symmetry with this. But there's +also a loss of pattern/input symmetry that's unique to it. Also there +are potentially two conversions to be made per input character. + +Using the + +if (lc(uc(input_wchar)) == lc(uc(pattern_wchar))) + + +logic leads to the following matches: + + \in I i İ ı +pat\ ---------- +"I" | Y Y Y Y +"i" | Y Y Y Y +"İ" | Y Y Y Y +"ı" | Y Y Y Y + +This shows total symmetry and transitivity +(at least in this example analysis). +There are two conversions to be made per input character, +but support could be added for having +a single straight mapping performing +a composition of the two conversions. + +Any optimization in the implementation of each logic +must not change its basic semantic. + + +Unicode and --ignore-case +------------------------- + +For this issue, interesting things to check in Unicode include: + + -- The Unicode Standard, Chapter 3 + (<http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf> + “Conformance”), Section 3.13 (“Default Case Operations”) and the + toCasefold() case conversion operation. + + -- The Unicode Standard, Chapter 4 + (<http://www.unicode.org/versions/Unicode5.2.0/ch04.pdf> + “Character Properties”), Section 4.2 (“Case—Normative”) and + the <http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt> + SpecialCasing.txt and + <http://www.unicode.org/Public/UNIDATA/CaseFolding.txt> + CaseFolding.txt files from the + <http://www.unicode.org/Public/UNIDATA/UCD.html> Unicode + Character Database . + +The <http://www.unicode.org/standard/standard.html> Unicode Standard, +Chapter 5 (“<http://www.unicode.org/versions/Unicode5.2.0/ch05.pdf> +Implementation Guidelines ”), Section 5.18 (“Case Mappings”), +Subsection “Caseless Matching”. + +The Unicode <http://www.unicode.org/charts/case/> case charts. + +Unicode uses the + +if (toCasefold(input_wchar_string) == toCasefold(pattern_wchar_string)) + +logic for caseless matching. Let's consider the “LATIN LETTER I” +example mentioned above. In a non-Turkic locale, simple case folding +yields + +toCasefold_simple(U+0049) = U+0069 +toCasefold_simple(U+0069) = U+0069 +toCasefold_simple(U+0130) = U+0130 +toCasefold_simple(U+0131) = U+0131 + +which leads to the following matches: + + \in I i İ ı +pat\ ---------- +"I" | Y Y n n +"i" | Y Y n n +"İ" | n n Y n +"ı" | n n n Y + +This is different from anything so far! + +In a non-Turkic locale, full case folding yields + +toCasefold_full(U+0049) = U+0069 +toCasefold_full(U+0069) = U+0069 +toCasefold_full(U+0130) = <U+0069, U+0307> +toCasefold_full(U+0131) = U+0131 + +with + +0307;COMBINING DOT ABOVE;Mn;230;NSM;;;;;N;NON-SPACING DOT ABOVE;;;; + +which leads to the following matches: + + \in I i İ ı +pat\ ---------- +"I" | Y Y * n +"i" | Y Y * n +"İ" | n n Y n +"ı" | n n n Y + +This is just sad! + +Note that having toCasefold(U+0131), simple or full, map to itself +instead of U+0069 is in contradiction with the rules of Section 5.18 +of the Unicode Standard since toUpperCase(U+0131) is U+0049. Same +thing for toCasefold_simple(U+0130) since toLowerCase(U+0131) is +U+0069. The justification for the weird toCasefold_full(U+0130) +mapping is unknown; it doesn't even make sense to add a dot (U+0307) +to a letter that already has one (U+0069). It would have been so +simple to put them all in the same equivalence class! + +Otherwise, also consider the following problem with Unicode's approach +on case folding in mind. Assume that we want to perform + +echo 'AßBC | grep -i 'Sb' + +which corresponds to + +input: U+0041 U+00DF U+0042 U+0043 U+000A +pattern: U+0053 U+0062 + +Following “CaseFolding-4.1.0.txt”, applying the toCasefold() +transformation to these yields + +input: U+0061 U+0073 U+0073 U+0062 U+0063 U+000A +pattern: U+0073 U+0062 + +so, according to this approach, the input should match the pattern. As +long as the original input line is to be reported to the user as a +whole, there is no problem (from the user's point-of-view; +implementation is complicated by this). + +However, consider both these GNU extensions: + +echo 'AßBC' | grep -i --only-matching 'Sb' +echo 'AßBC' | grep -i --color=always 'Sb' + +What is to be reported in these cases, since the match begins in the +*middle* of the original input character 'ß'? + +Note that Unicode's toCasefold() cannot be implemented in terms of +POSIX' towctrans() since that can only return a single wint_t value +per input wint_t value. |