From 28ef1abc10cfbc2c3d2747c008eb2300858d0426 Mon Sep 17 00:00:00 2001 From: Lorry Tar Creator Date: Fri, 22 Apr 2016 04:38:07 +0000 Subject: grep-2.25 --- TODO | 338 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 338 insertions(+) create mode 100644 TODO (limited to 'TODO') diff --git a/TODO b/TODO new file mode 100644 index 0000000..cb48cc2 --- /dev/null +++ b/TODO @@ -0,0 +1,338 @@ + Copyright (C) 1992, 1997-2002, 2004-2016 Free Software Foundation, Inc. + + Copying and distribution of this file, with or without modification, + are permitted in any medium without royalty provided the copyright + notice and this notice are preserved. + +=============== +Short term work +=============== + +See where we are with UTF-8 performance. + +Merge Debian patches 55-bigfile.patch, 69-mbtowc.patch and +70-man_apostrophe.patch. Go through patches in Savannah. + +Cleanup of the grep(), grepdir(), recursion (the "main loop") to use fts. +Fix --directories=read. + +Write better Texinfo documentation for grep. The manual page would be a +good place to start, but Info documents are also supposed to contain a +tutorial and examples. + +Some test in tests/spencer2.tests should have failed! Need to filter out +some bugs in dfa.[ch]/regex.[ch]. + +Multithreading? + +GNU grep does 32-bit arithmetic, it needs to move to 64-bit (i.e. +size_t/ptrdiff_t). + +Lazy dynamic linking of libpcre. + +Check FreeBSD's integration of zgrep (-Z) and bzgrep (-J) in one +binary. Is there a possibility of doing even better by automatically +checking the magic of binary files ourselves (0x1F 0x8B for gzip, 0x1F +0x9D for compress, and 0x42 0x5A 0x68 for bzip2)? Once what to do with +libpcre is decided, do the same for libz and libbz2. + + +================== +Matching algorithms +================== + +Check . Take a look at these +and consider opportunities for merging or cloning: + + -- ja-grep's mlb2 patch (Japanese grep) + + -- lgrep (from lv, a Powerful Multilingual File Viewer / Grep) + ; + -- cgrep (Context grep) + seems like nice work; + -- sgrep (Struct grep) ; + -- agrep (Approximate grep) , + from glimpse; + -- nr-grep (Nondeterministic reverse grep) + ; + -- ggrep (Grouse grep) ; + -- grep.py (Python grep) ; + -- freegrep ; + +Check some new algorithms for matching; talk to Karl Berry and Nelson. +Sunday's "Quick Search" Algorithm (CACM 33, 1990-08-08 pp. 132-142) +claim that his algorithm is faster than Boyer-More. Worth checking. + +Fix the DFA matcher to never use exponential space. (Fortunately, these +cases are rare.) + + +============================ +Standards: POSIX and Unicode +============================ + +For POSIX compliance, see p10003.x. Current support for the POSIX [= =] +and [. .] constructs is limited. This is difficult because it requires +locale-dependent details of the character set and collating sequence, +but POSIX does not standardize any method for accessing this information! + +For Unicode, interesting things to check include the Unicode Standard + and the Unicode Technical +Standard #18 ( “Unicode Regular +Expressions”). Talk to Bruno Haible who's maintaining GNU libunistring. +See also Unicode Standard Annex #15 ( +“Unicode Normalization Forms”), already implemented by GNU libunistring. + +In particular, --ignore-case needs to be evaluated against the standards. +We may want to deviate from POSIX if Unicode provides better or clearer +semantics. + +POSIX and --ignore-case +----------------------- + +For this issue, interesting things to check in POSIX include the +Volume “Base Definitions (XBD)”, Chapter “Regular Expressions” and in +particular Section “Regular Expression General Requirements” and its +paragraph about caseless matching (note that this may not have been +fully thought through and that this text may be self-contradicting +[specifically: “of either data or patterns” versus all the rest]). + +In particular, consider the following with POSIX's approach to case +folding in mind. Assume a non-Turkic locale with a character +repertoire reduced to the following various forms of “LATIN LETTER I”: + +0049;LATIN CAPITAL LETTER I;Lu;0;L;;;;;N;;;;0069; +0069;LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049 +0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;\ + LATIN CAPITAL LETTER I DOT;;;0069; +0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;0049 + +First note the differing UTF-8 octet lengths of U+0049 (0x49) and +U+0069 (0x69) versus U+0130 (0xC4 0xB0) and U+0131 (0xC4 0xB1). This +implies that whole UTF-8 strings cannot be case-converted in place, +using the same memory buffer, and that the needed octet-size of the +new buffer cannot merely be guessed (although there's a simple upper +bound of six times the size of the input, as the longest UTF-8 +encoding of any character is six bytes). + +We have + +lc(I) = i, uc(I) = I +lc(i) = i, uc(i) = I +lc(İ) = i, uc(İ) = İ +lc(ı) = ı, uc(ı) = I + +where lc() and uc() denote lower-case and upper-case conversions. + +There are several candidate --ignore-case logics (including the one +mandated by POSIX): + +Using the + +if (lc(input_wchar) == lc(pattern_wchar)) + +logic leads to the following matches: + + \in I i İ ı +pat\ ---------- +"I" | Y Y Y n +"i" | Y Y Y n +"İ" | Y Y Y n +"ı" | n n n Y + +There is a lack of symmetry between CAPITAL and SMALL LETTERs with +this. + +Using the + +if (uc(input_wchar) == uc(pattern_wchar)) + +logic leads to the following matches: + + \in I i İ ı +pat\ ---------- +"I" | Y Y n Y +"i" | Y Y n Y +"İ" | n n Y n +"ı" | Y Y n Y + +There is a lack of symmetry between CAPITAL and SMALL LETTERs with +this. + +Using the + +if ( lc(input_wchar) == lc(pattern_wchar) +|| uc(input_wchar) == uc(pattern_wchar)) + +logic leads to the following matches: + + \in I i İ ı +pat\ ---------- +"I" | Y Y Y Y +"i" | Y Y Y Y +"İ" | Y Y Y n +"ı" | Y Y n Y + +There is some elegance and symmetry with this. But there are +potentially two conversions to be made per input character. If the +pattern is pre-converted, two copies of it need to be kept and used in +a mutually coherent fashion. + +Using the + +if ( input_wchar == pattern_wchar +|| lc(input_wchar) == pattern_wchar +|| uc(input_wchar) == pattern_wchar) + +logic (as mandated by POSIX) leads to the following matches: + + \in I i İ ı +pat\ ---------- +"I" | Y Y n Y +"i" | Y Y Y n +"İ" | n n Y n +"ı" | n n n Y + +There is a different CAPITAL/SMALL symmetry with this. But there's +also a loss of pattern/input symmetry that's unique to it. Also there +are potentially two conversions to be made per input character. + +Using the + +if (lc(uc(input_wchar)) == lc(uc(pattern_wchar))) + + +logic leads to the following matches: + + \in I i İ ı +pat\ ---------- +"I" | Y Y Y Y +"i" | Y Y Y Y +"İ" | Y Y Y Y +"ı" | Y Y Y Y + +This shows total symmetry and transitivity +(at least in this example analysis). +There are two conversions to be made per input character, +but support could be added for having +a single straight mapping performing +a composition of the two conversions. + +Any optimization in the implementation of each logic +must not change its basic semantic. + + +Unicode and --ignore-case +------------------------- + +For this issue, interesting things to check in Unicode include: + + -- The Unicode Standard, Chapter 3 + ( + “Conformance”), Section 3.13 (“Default Case Operations”) and the + toCasefold() case conversion operation. + + -- The Unicode Standard, Chapter 4 + ( + “Character Properties”), Section 4.2 (“Case—Normative”) and + the + SpecialCasing.txt and + + CaseFolding.txt files from the + Unicode + Character Database . + +The Unicode Standard, +Chapter 5 (“ +Implementation Guidelines ”), Section 5.18 (“Case Mappings”), +Subsection “Caseless Matching”. + +The Unicode case charts. + +Unicode uses the + +if (toCasefold(input_wchar_string) == toCasefold(pattern_wchar_string)) + +logic for caseless matching. Let's consider the “LATIN LETTER I” +example mentioned above. In a non-Turkic locale, simple case folding +yields + +toCasefold_simple(U+0049) = U+0069 +toCasefold_simple(U+0069) = U+0069 +toCasefold_simple(U+0130) = U+0130 +toCasefold_simple(U+0131) = U+0131 + +which leads to the following matches: + + \in I i İ ı +pat\ ---------- +"I" | Y Y n n +"i" | Y Y n n +"İ" | n n Y n +"ı" | n n n Y + +This is different from anything so far! + +In a non-Turkic locale, full case folding yields + +toCasefold_full(U+0049) = U+0069 +toCasefold_full(U+0069) = U+0069 +toCasefold_full(U+0130) = +toCasefold_full(U+0131) = U+0131 + +with + +0307;COMBINING DOT ABOVE;Mn;230;NSM;;;;;N;NON-SPACING DOT ABOVE;;;; + +which leads to the following matches: + + \in I i İ ı +pat\ ---------- +"I" | Y Y * n +"i" | Y Y * n +"İ" | n n Y n +"ı" | n n n Y + +This is just sad! + +Note that having toCasefold(U+0131), simple or full, map to itself +instead of U+0069 is in contradiction with the rules of Section 5.18 +of the Unicode Standard since toUpperCase(U+0131) is U+0049. Same +thing for toCasefold_simple(U+0130) since toLowerCase(U+0131) is +U+0069. The justification for the weird toCasefold_full(U+0130) +mapping is unknown; it doesn't even make sense to add a dot (U+0307) +to a letter that already has one (U+0069). It would have been so +simple to put them all in the same equivalence class! + +Otherwise, also consider the following problem with Unicode's approach +on case folding in mind. Assume that we want to perform + +echo 'AßBC | grep -i 'Sb' + +which corresponds to + +input: U+0041 U+00DF U+0042 U+0043 U+000A +pattern: U+0053 U+0062 + +Following “CaseFolding-4.1.0.txt”, applying the toCasefold() +transformation to these yields + +input: U+0061 U+0073 U+0073 U+0062 U+0063 U+000A +pattern: U+0073 U+0062 + +so, according to this approach, the input should match the pattern. As +long as the original input line is to be reported to the user as a +whole, there is no problem (from the user's point-of-view; +implementation is complicated by this). + +However, consider both these GNU extensions: + +echo 'AßBC' | grep -i --only-matching 'Sb' +echo 'AßBC' | grep -i --color=always 'Sb' + +What is to be reported in these cases, since the match begins in the +*middle* of the original input character 'ß'? + +Note that Unicode's toCasefold() cannot be implemented in terms of +POSIX' towctrans() since that can only return a single wint_t value +per input wint_t value. -- cgit v1.2.1