Implement backtracking.

author: Bruno Haible <bruno@clisp.org> 2003-02-22 00:19:28 +0000
committer: Bruno Haible <bruno@clisp.org> 2003-02-22 00:19:28 +0000
commit: 76575063ead694d3680ae1b3d85cda21b8fd1e8e (patch)
tree: bcc8544c42876899ba341575dd58898b4c522df2
parent: f1da37e04b52e2479b1dcf7570fb195f3bf2f024 (diff)
download: gperf-76575063ead694d3680ae1b3d85cda21b8fd1e8e.tar.gz
6 files changed, 120 insertions, 156 deletions
diff --git a/ChangeLog b/ChangeLog
index ba7a397..fcd131d 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,25 @@
+2002-11-20  Bruno Haible  <bruno@clisp.org>
+
+	Implement backtracking.
+	* src/search.h (Search::has_collisions): Renamed from
+	Search::less_collisions. Return a boolean.
+	* src/search.cc (Search::has_collisions): Renamed from
+	Search::less_collisions. Return a boolean.
+	(StackEntry): Remove field _collisions_so_far.
+	(Search::find_asso_values): Backtrack when encountering an unresolved
+	collision. Assume collisions_so_far is always zero.
+	(Search::optimize): Exit if there are accidental duplicates at the end.
+	* src/output.cc (Output::num_hash_values): Simply return the list
+	length.
+	(Output::output_keylength_table): Remove handling of accidental
+	duplicates.
+	(Output::output_keyword_table, Output::output_lookup_array): Likewise.
+	(output_switch_case, output_switches): Likewise.
+	* doc/gperf.texi (Algorithmic Details): Adjust description of options
+	-D, -f, -o, -r.
+	(Bugs): Remove note about missing backtracking.
+	(Projects): Likewise.
+
 2002-11-19  Bruno Haible  <bruno@clisp.org>
 
 	Prepare for backtracking.
diff --git a/NEWS b/NEWS
index c5b023f..cdfe72b 100644
--- a/NEWS
+++ b/NEWS
@@ -34,6 +34,9 @@ New in 2.8:
 * Some keyword sets containing permutations, like { "xy", "yx", "xz", "zx" }
   or { "abc", "acb", "bca", "cab" }, are now handled by gperf without
   requiring the option -D.
+* When the search for a good hash function is not immediately successful,
+  backtracking is now used to continue the search. Earlier versions of gperf
+  bailed out with an "Internal error, duplicate hash code value".
 * Bug fixes.
 
 New in 2.7.2:
diff --git a/doc/gperf.texi b/doc/gperf.texi
index f47c706..f52bb04 100644
--- a/doc/gperf.texi
+++ b/doc/gperf.texi
@@ -7,7 +7,7 @@
 
 @c some day we should @include version.texi instead of defining
 @c these values at hand.
-@set UPDATED 16 November 2002
+@set UPDATED 20 November 2002
 @set EDITION 2.7.2
 @set VERSION 2.7.2
 @c ---------------------
@@ -993,27 +993,14 @@ through a search that minimizes the number of byte positions.
 @itemx --duplicates
 @cindex Duplicates
 Handle keywords whose selected byte sets hash to duplicate values.
-Duplicate hash values can occur for two reasons:
-
-@itemize @bullet
-@item
-Since @code{gperf} does not backtrack it is possible for it to process
-all your input keywords without finding a unique mapping for each word.
-However, frequently only a very small number of duplicates occur, and 
-the majority of keywords still require one probe into the table.  To
-overcome this problem, the option @samp{-m 50} should be used.
-
-@item
-Sometimes a set of keywords may have the same names, but possess different
-attributes.  With the -D option @code{gperf} treats all these keywords as
+Duplicate hash values can occur if a set of keywords has the same names, but
+possesses different attributes, or if the selected byte positions are not well
+chosen.  With the -D option @code{gperf} treats all these keywords as
 part of an equivalence class and generates a perfect hash function with
 multiple comparisons for duplicate keywords.  It is up to you to completely
 disambiguate the keywords by modifying the generated C code.  However,
 @code{gperf} helps you out by organizing the output.
-@end itemize
 
-Option @samp{-D} is extremely useful for certain large or highly
-redundant keyword sets, e.g., assembler instruction opcodes.
 Using this option usually means that the generated hash function is no
 longer perfect.  On the other hand, it permits @code{gperf} to work on
 keyword sets that it otherwise could not handle.
@@ -1025,7 +1012,7 @@ Generate the perfect hash function ``fast''.  This decreases
 table-size.  The iteration amount represents the number of times to
 iterate when resolving a collision.  `0' means iterate by the number of
 keywords.  This option is probably most useful when used in conjunction
-with options @samp{-D} and/or @samp{-S} for @emph{large} keyword sets.
+with option @samp{-o} for @emph{large} keyword sets.
 
 @item -m @var{iterations}
 @itemx --multiple-iterations=@var{iterations}
@@ -1067,7 +1054,7 @@ produce more minimal perfect hash functions.  The reason for this is
 that the reordering helps prune the search time by handling inevitable
 collisions early in the search process.  On the other hand, in practice,
 a decreased search time also means a less minimal hash function, and a
-higher probability of duplicate hash values.  Furthermore, if the
+higher frequency of backtracking.  Furthermore, if the
 number of keywords is @emph{very} large using @samp{-o} may
 @emph{increase} @code{gperf}'s execution time, since collisions will
 begin earlier and continue throughout the remainder of keyword
@@ -1080,8 +1067,7 @@ Utilizes randomness to initialize the associated values table.  This
 frequently generates solutions faster than using deterministic
 initialization (which starts all associated values at 0).  Furthermore,
 using the randomization option generally increases the size of the
-table.  If @code{gperf} has difficultly with a certain keyword set try using
-@samp{-r} or @samp{-D}.
+table.
 
 @item -s @var{size-multiple}
 @itemx --size-multiple=@var{size-multiple}
@@ -1154,16 +1140,6 @@ work efficiently on much larger keyword sets (over 15,000 keywords).
 When processing large keyword sets it helps greatly to have over 8 megs
 of RAM.
 
-However, since @code{gperf} does not backtrack no guaranteed solution
-occurs on every run.  On the other hand, it is usually easy to obtain a
-solution by varying the option parameters.  In particular, try the
-@samp{-r} option, and also try changing the default arguments to the
-@samp{-s} and @samp{-j} options.  To @emph{guarantee} a solution, use
-the @samp{-D} and @samp{-S} options, although the final results are not
-likely to be a @emph{perfect} hash function anymore!  Finally, use the
-@samp{-f} option if you want @code{gperf} to generate the perfect hash
-function @emph{fast}, with less emphasis on making it minimal.
-
 @item 
 The size of the generate static keyword array can get @emph{extremely}
 large if the input keyword file is large or if the keywords are quite
@@ -1171,7 +1147,7 @@ similar.  This tends to slow down the compilation of the generated C
 code, and @emph{greatly} inflates the object code size.  If this
 situation occurs, consider using the @samp{-S} option to reduce data
 size, potentially increasing keyword recognition time a negligible
-amount.  Since many C compilers cannot correctly generated code for
+amount.  Since many C compilers cannot correctly generate code for
 large switch statements it is important to qualify the @var{-S} option
 with an appropriate numerical argument that controls the number of
 switch statements generated.
@@ -1193,18 +1169,10 @@ worthwhile improvements include:
 
 @itemize @bullet
 @item 
-Make the algorithm more robust.  At present, the program halts with an
-error diagnostic if it can't find a direct solution and the @samp{-D}
-option is not enabled.  A more comprehensive, albeit computationally
-expensive, approach would employ backtracking or enable alternative
-options and retry.  It's not clear how helpful this would be, in
-general, since most search sets are rather small in practice.
-
-@item 
 Another useful extension involves modifying the program to generate
 ``minimal'' perfect hash functions (under certain circumstances, the
 current version can be rather extravagant in the generated table size).
-Again, this is mostly of theoretical interest, since a sparse table
+This is mostly of theoretical interest, since a sparse table
 often produces faster lookups, and use of the @samp{-S} @code{switch}
 option can minimize the data size, at the expense of slightly longer
 lookups (note that the gcc compiler generally produces good code for
diff --git a/src/output.cc b/src/output.cc
index 9060e58..03b7777 100644
--- a/src/output.cc
+++ b/src/output.cc
@@ -75,10 +75,9 @@ static const char *char_to_index;
    - Duplicates, i.e. keywords with the same _selchars set, are chained
      through the _duplicate_link pointer.  Only one representative per
      duplicate equivalence class remains on the linear keyword list.
-   - Still, accidental duplicates, i.e. keywords for which the _asso_values[]
-     search couldn't achieve different hash values, can occur on the linear
-     keyword list.  After Search::sort(), we know that they form blocks of
-     consecutive list elements.
+   - Accidental duplicates, i.e. keywords for which the _asso_values[] search
+     couldn't achieve different hash values, cannot occur on the linear
+     keyword list.  Search::optimize would catch this mistake.
  */
 Output::Output (KeywordExt_List *head, const char *struct_decl,
                 unsigned int struct_decl_lineno, const char *return_type,
@@ -134,20 +133,11 @@ Output::compute_min_max ()
 int
 Output::num_hash_values () const
 {
-  /* Since the list is already sorted by hash value we can count the
-     different hash values in a single pass through the list.  */
-  int count = 1;
-  KeywordExt_List *temp;
-  int value;
-
-  for (temp = _head, value = temp->first()->_hash_value; (temp = temp->rest()) != NULL; )
-    {
-      if (value != temp->first()->_hash_value)
-        {
-          value = temp->first()->_hash_value;
-          count++;
-        }
-    }
+  /* Since the list is already sorted by hash value and doesn't contain
+     duplicates, we can simply count the number of keywords on the list.  */
+  int count = 0;
+  for (KeywordExt_List *temp = _head; temp; temp = temp->rest())
+    count++;
   return count;
 }
 
@@ -667,9 +657,7 @@ Output::output_keylength_table () const
       /* If generating a switch statement, and there is no user defined type,
          we generate non-duplicates directly in the code.  Only duplicates go
          into the table.  */
-      if (option[SWITCH] && !option[TYPE]
-          && !(temp->first()->_duplicate_link
-               || (temp->rest() && temp->first()->_hash_value == temp->rest()->first()->_hash_value)))
+      if (option[SWITCH] && !option[TYPE] && !temp->first()->_duplicate_link)
         continue;
 
       if (index < temp->first()->_hash_value && !option[SWITCH] && !option[DUP])
@@ -789,9 +777,7 @@ Output::output_keyword_table () const
 
   for (temp = _head, index = 0; temp; temp = temp->rest())
     {
-      if (option[SWITCH] && !option[TYPE]
-          && !(temp->first()->_duplicate_link
-               || (temp->rest() && temp->first()->_hash_value == temp->rest()->first()->_hash_value)))
+      if (option[SWITCH] && !option[TYPE] && !temp->first()->_duplicate_link)
         continue;
 
       if (index > 0)
@@ -865,34 +851,20 @@ Output::output_lookup_array () const
           if (option[DEBUG])
             fprintf (stderr, "keyword = %.*s, index = %d\n",
                      temp->first()->_allchars_length, temp->first()->_allchars, temp->first()->_final_index);
-          if (temp->first()->_duplicate_link
-              || (temp->rest() && hash_value == temp->rest()->first()->_hash_value))
+          if (temp->first()->_duplicate_link)
             {
               /* Start a duplicate entry.  */
               dup_ptr->hash_value = hash_value;
               dup_ptr->index = temp->first()->_final_index;
               dup_ptr->count = 1;
 
-              for (;;)
+              for (KeywordExt *ptr = temp->first()->_duplicate_link; ptr; ptr = ptr->_duplicate_link)
                 {
-                  for (KeywordExt *ptr = temp->first()->_duplicate_link; ptr; ptr = ptr->_duplicate_link)
-                    {
-                      dup_ptr->count++;
-                      if (option[DEBUG])
-                        fprintf (stderr,
-                                 "static linked keyword = %.*s, index = %d\n",
-                                 ptr->_allchars_length, ptr->_allchars, ptr->_final_index);
-                    }
-
-                  if (!(temp->rest() && hash_value == temp->rest()->first()->_hash_value))
-                    break;
-
-                  temp = temp->rest();
-
                   dup_ptr->count++;
                   if (option[DEBUG])
-                    fprintf (stderr, "dynamic linked keyword = %.*s, index = %d\n",
-                             temp->first()->_allchars_length, temp->first()->_allchars, temp->first()->_final_index);
+                    fprintf (stderr,
+                             "static linked keyword = %.*s, index = %d\n",
+                             ptr->_allchars_length, ptr->_allchars, ptr->_final_index);
                 }
               assert (dup_ptr->count >= 2);
               dup_ptr++;
@@ -1026,9 +998,7 @@ output_switch_case (KeywordExt_List *list, int indent, int *jumps_away)
     printf ("%*s/* hash value = %4d, keyword = \"%.*s\" */\n",
             indent, "", list->first()->_hash_value, list->first()->_allchars_length, list->first()->_allchars);
 
-  if (option[DUP]
-      && (list->first()->_duplicate_link
-          || (list->rest() && list->first()->_hash_value == list->rest()->first()->_hash_value)))
+  if (option[DUP] && list->first()->_duplicate_link)
     {
       if (option[LENTABLE])
         printf ("%*slengthptr = &lengthtable[%d];\n",
@@ -1037,13 +1007,8 @@ output_switch_case (KeywordExt_List *list, int indent, int *jumps_away)
               indent, "", option.get_wordlist_name (), list->first()->_final_index);
 
       int count = 0;
-      for (KeywordExt_List *temp = list; ; temp = temp->rest())
-        {
-          for (KeywordExt *links = temp->first(); links; links = links->_duplicate_link)
-            count++;
-          if (!(temp->rest() && temp->first()->_hash_value == temp->rest()->first()->_hash_value))
-            break;
-        }
+      for (KeywordExt *links = list->first(); links; links = links->_duplicate_link)
+        count++;
 
       printf ("%*swordendptr = wordptr + %d;\n"
               "%*sgoto multicompare;\n",
@@ -1080,10 +1045,7 @@ output_switch_case (KeywordExt_List *list, int indent, int *jumps_away)
         *jumps_away = 1;
     }
 
-  while (list->rest() && list->first()->_hash_value == list->rest()->first()->_hash_value)
-    list = list->rest();
-  list = list->rest();
-  return list;
+  return list->rest();
 }
 
 /* Output a total of size cases, grouped into num_switches switch statements,
@@ -1105,11 +1067,7 @@ output_switches (KeywordExt_List *list, int num_switches, int size, int min_hash
 
       KeywordExt_List *temp = list;
       for (int count = size1; count > 0; count--)
-        {
-          while (temp->first()->_hash_value == temp->rest()->first()->_hash_value)
-            temp = temp->rest();
-          temp = temp->rest();
-        }
+        temp = temp->rest();
 
       printf ("%*sif (key < %d)\n"
               "%*s  {\n",
diff --git a/src/search.cc b/src/search.cc
index 7b6210f..9a1c9b3 100644
--- a/src/search.cc
+++ b/src/search.cc
@@ -891,16 +891,12 @@ Search::sort_by_occurrence (unsigned int *set, int len) const
     }
 }
 
-/* If the recomputed hash values for the keywords from _head->first() to
-   curr - inclusive - give fewer than collision_bound collisions, this
-   collision count is returned.  Otherwise some value >= collision_bound
-   is returned.
+/* Returns true if the recomputed hash values for the keywords from
+   _head->first() to curr - inclusive - give at least one collision.
    This is called very frequently, and needs to be fast!  */
-unsigned int
-Search::less_collisions (KeywordExt *curr, unsigned int collision_bound)
+bool
+Search::has_collisions (KeywordExt *curr)
 {
-  unsigned int collisions = 0;
-
   /* Iteration Number array is a win, O(1) initialization time!  */
   _collision_detector->clear ();
 
@@ -911,12 +907,11 @@ Search::less_collisions (KeywordExt *curr, unsigned int collision_bound)
       /* Compute new hash code for the keyword, and see whether it
          collides with another keyword's hash code.  If we have too
          many collisions, we can safely abort the fruitless loop.  */
-      if (_collision_detector->set_bit (compute_hash (keyword))
-          && ++collisions >= collision_bound)
-        return collision_bound; /* >= collision_bound */
+      if (_collision_detector->set_bit (compute_hash (keyword)))
+        return true;
 
       if (keyword == curr)
-        return collisions; /* < collision_bound */
+        return false;
     }
 }
 
@@ -945,9 +940,6 @@ Search::collision_prior_to (KeywordExt *curr)
    we perform the processing without recursion, and simulate the stack.  */
 struct StackEntry
 {
-  /* The number of collisions so far.  */
-  unsigned int          _collisions_so_far;
-
   /* The current keyword.  */
   KeywordExt *          _curr;
 
@@ -1006,8 +998,6 @@ Search::find_asso_values ()
     StackEntry *sp = &stack[0];
 
     /* Local variables corresponding to *sp.  */
-    /* The number of collisions so far.  */
-    unsigned int collisions_so_far;
     /* The current keyword.  */
     KeywordExt *curr;
     /* The prior keyword, with which curr collides.  */
@@ -1024,8 +1014,6 @@ Search::find_asso_values ()
     /* Remaining number of iterations.  */
     int iter;
 
-    collisions_so_far = 0;
-
    STARTOUTERLOOP:
 
     /* Next keyword from the list.  */
@@ -1039,8 +1027,6 @@ Search::find_asso_values ()
 
     if (prior != NULL)
       {
-        collisions_so_far++;
-
         /* Handle collision: Attempt to change an _asso_value[], in order to
            resolve a hash value collision between the two given keywords.  */
 
@@ -1075,11 +1061,10 @@ Search::find_asso_values ()
 
             /* Try various other values for _asso_values[c].  A value is
                successful if, with it, the recomputed hash values for the
-               keywords from _head->first() to curr - inclusive - give fewer
-               than collisions_so_far collisions.  Up to the given number of
-               iterations are performed.  If successful, _asso_values[c] is
-               changed, collisions_so_far is decreased, and the recursion
-               continued. If all iterations are unsuccessful, _asso_values[c]
+               keywords from _head->first() to curr - inclusive - give no
+               collisions.  Up to the given number of iterations are performed.
+               If successful, _asso_values[c] is changed, and the recursion
+               continues. If all iterations are unsuccessful, _asso_values[c]
                is restored and we backtrack, trying the next union_index.  */
 
             original_asso_value = _asso_values[c];
@@ -1092,11 +1077,8 @@ Search::find_asso_values ()
                   (_asso_values[c] + (_jump != 0 ? _jump : rand ()))
                   & (_asso_value_max - 1);
 
-                unsigned int collisions =
-                  less_collisions (curr, collisions_so_far);
-                if (collisions < collisions_so_far)
+                if (!has_collisions (curr))
                   {
-                    collisions_so_far = collisions;
                     /* Good, this _asso_values[] modification reduces the
                        number of collisions so far.
                        All keyword->_hash_value up to curr - inclusive -
@@ -1110,6 +1092,16 @@ Search::find_asso_values ()
                         fflush (stderr);
                       }
                     goto RECURSE;
+                   BACKTRACK_COLLISION: ;
+                    if (option[DEBUG])
+                      {
+                        fprintf (stderr, "back to collision on keyword #%d, prior = \"%.*s\", curr = \"%.*s\" hash = %d\n",
+                                 sp - stack + 1,
+                                 prior->_allchars_length, prior->_allchars,
+                                 curr->_allchars_length, curr->_allchars,
+                                 curr->_hash_value);
+                        fflush (stderr);
+                      }
                   }
               }
 
@@ -1119,24 +1111,52 @@ Search::find_asso_values ()
 
         /* Failed to resolve a collision.  */
 
-        /* Recompute all keyword->_hash_value up to curr - inclusive -.  */
+        /* Recompute all keyword->_hash_value up to curr - exclusive -.  */
         for (KeywordExt_List *ptr = _head; ; ptr = ptr->rest())
           {
             KeywordExt* keyword = ptr->first();
-            compute_hash (keyword);
             if (keyword == curr)
               break;
+            compute_hash (keyword);
           }
 
         if (option[DEBUG])
           {
-            fprintf (stderr, "** collision not resolved after %d iterations, %d duplicates remain, continuing...\n",
-                     iterations, collisions_so_far + _total_duplicates);
+            fprintf (stderr, "** collision not resolved after %d iterations, backtracking...\n",
+                     iterations);
             fflush (stderr);
           }
+
+       BACKTRACK_NO_COLLISION:
+        if (sp != stack)
+          {
+            sp--;
+            curr = sp->_curr;
+            prior = sp->_prior;
+            union_set = sp->_union_set;
+            union_set_length = sp->_union_set_length;
+            union_index = sp->_union_index;
+            c = sp->_c;
+            original_asso_value = sp->_original_asso_value;
+            iter = sp->_iter;
+            if (prior != NULL)
+              goto BACKTRACK_COLLISION;
+            else
+              goto BACKTRACK_NO_COLLISION;
+          }
+
+        /* No solution found after an exhaustive search!
+           We should ideally turn off option[FAST] and, if that doesn't help,
+           multiply _asso_value_max by 2.  */
+        fprintf (stderr,
+                 "\nBig failure, always got duplicate hash code values.\n");
+        if (option[POSITIONS])
+          fprintf (stderr, "try options -m or -r, or use new key positions.\n\n");
+        else
+          fprintf (stderr, "try options -m or -r.\n\n");
+        exit (1);
       }
    RECURSE:
-    sp->_collisions_so_far = collisions_so_far;
     /*sp->_curr = curr;*/ // redundant
     sp->_prior = prior;
     /*sp->_union_set = union_set;*/ // redundant
@@ -1147,10 +1167,7 @@ Search::find_asso_values ()
     sp->_iter = iter;
     sp++;
     if (sp - stack < _list_len)
-      {
-        /*collisions_so_far = sp[-1]._collisions_so_far;*/ // redundant
-        goto STARTOUTERLOOP;
-      }
+      goto STARTOUTERLOOP;
   }
 
   /* Deallocate stack.  */
@@ -1285,19 +1302,15 @@ Search::optimize ()
       unsigned int hashcode = compute_hash (curr);
       if (_collision_detector->set_bit (hashcode))
         {
-          if (option[DUP]) /* Keep track of this number... */
-            _total_duplicates++;
-          else /* Yow, big problems.  we're outta here! */
-            {
-              fprintf (stderr,
-                       "\nInternal error, duplicate hash code value %d:\n",
-                       hashcode);
-              if (option[POSITIONS])
-                fprintf (stderr, "try options -m or -D or -r, or use new key positions.\n\n");
-              else
-                fprintf (stderr, "try options -m or -D or -r.\n\n");
-              exit (1);
-            }
+          /* This shouldn't happen.  proj1, proj2, proj3 must have been
+             computed to be injective on the given keyword set.  */
+          fprintf (stderr,
+                   "\nInternal error, unexpected duplicate hash code\n");
+          if (option[POSITIONS])
+            fprintf (stderr, "try options -m or -r, or use new key positions.\n\n");
+          else
+            fprintf (stderr, "try options -m or -r.\n\n");
+          exit (1);
         }
     }
 
diff --git a/src/search.h b/src/search.h
index afd1bfa..aefcd08 100644
--- a/src/search.h
+++ b/src/search.h
@@ -93,7 +93,7 @@ private:
   /* Sorts the given set in increasing frequency of _occurrences[].  */
   void                  sort_by_occurrence (unsigned int *set, int len) const;
 
-  unsigned int          less_collisions (KeywordExt *curr, unsigned int collision_bound);
+  bool                  has_collisions (KeywordExt *curr);
 
   KeywordExt *          collision_prior_to (KeywordExt *curr);
author	Bruno Haible <bruno@clisp.org>	2003-02-22 00:19:28 +0000
committer	Bruno Haible <bruno@clisp.org>	2003-02-22 00:19:28 +0000
commit	76575063ead694d3680ae1b3d85cda21b8fd1e8e (patch)
tree	bcc8544c42876899ba341575dd58898b4c522df2
parent	f1da37e04b52e2479b1dcf7570fb195f3bf2f024 (diff)
download	gperf-76575063ead694d3680ae1b3d85cda21b8fd1e8e.tar.gz