diff options
author | Thomas G. Lockhart <lockhart@fourpalms.org> | 1998-03-01 08:16:16 +0000 |
---|---|---|
committer | Thomas G. Lockhart <lockhart@fourpalms.org> | 1998-03-01 08:16:16 +0000 |
commit | c8cfb0cea88fec22f5aa0582fe846b46baf77eb1 (patch) | |
tree | eb009c5363c2d5d8f7d99e6264c2a282d4ad7e58 /doc/src/sgml/geqo.sgml | |
parent | 878531f1ac8a288fdd04c7c41ac2f02d2506bcb7 (diff) | |
download | postgresql-c8cfb0cea88fec22f5aa0582fe846b46baf77eb1.tar.gz |
SGML source for new documentation.
Diffstat (limited to 'doc/src/sgml/geqo.sgml')
-rw-r--r-- | doc/src/sgml/geqo.sgml | 228 |
1 files changed, 228 insertions, 0 deletions
diff --git a/doc/src/sgml/geqo.sgml b/doc/src/sgml/geqo.sgml new file mode 100644 index 0000000000..725504c28f --- /dev/null +++ b/doc/src/sgml/geqo.sgml @@ -0,0 +1,228 @@ +<Chapter> +<DocInfo> +<Author> +<FirstName>Martin</FirstName> +<SurName>Utesch</SurName> +</Author> +</DocInfo> + +<Title>Genetic Query Optimization in Database Systems</Title> + +<Para> +<ProgramListing> +<ULink url="utesch@aut.tu-freiberg.de">Martin Utesch</ULink> + + Institute of Automatic Control + University of Mining and Technology + Freiberg, Germany + + 02/10/1997 + + +1.) Query Handling as a Complex Optimization Problem +==================================================== + + Among all relational operators the most difficult one to process and +optimize is the JOIN. The number of alternative plans to answer a query +grows exponentially with the number of JOINs included in it. Further +optimization effort is caused by the support of a variety of *JOIN +methods* (e.g., nested loop, index scan, merge join in Postgres) to +process individual JOINs and a diversity of *indices* (e.g., r-tree, +b-tree, hash in Postgres) as access paths for relations. + + The current Postgres optimizer implementation performs a *near- +exhaustive search* over the space of alternative strategies. This query +optimization technique is inadequate to support database application +domains that involve the need for extensive queries, such as artificial +intelligence. + + The Institute of Automatic Control at the University of Mining and +Technology, in Freiberg, Germany, encountered the described problems as its +folks wanted to take the Postgres DBMS as the backend for a decision +support knowledge based system for the maintenance of an electrical +power grid. The DBMS needed to handle large JOIN queries for the +inference machine of the knowledge based system. + + Performance difficulties within exploring the space of possible query +plans arose the demand for a new optimization technique being developed. + + In the following we propose the implementation of a *Genetic +Algorithm* as an option for the database query optimization problem. + + +2.) Genetic Algorithms (GA) +=========================== + + The GA is a heuristic optimization method which operates through +determined, randomized search. The set of possible solutions for the +optimization problem is considered as a *population* of *individuals*. +The degree of adaption of an individual to its environment is specified +by its *fitness*. + + The coordinates of an individual in the search space are represented +by *chromosomes*, in essence a set of character strings. A *gene* is a +subsection of a chromosome which encodes the value of a single parameter +being optimized. Typical encodings for a gene could be *binary* or +*integer*. + + Through simulation of the evolutionary operations *recombination*, +*mutation*, and *selection* new generations of search points are found +that show a higher average fitness than their ancestors. + + According to the "comp.ai.genetic" FAQ it cannot be stressed too +strongly that a GA is not a pure random search for a solution to a +problem. A GA uses stochastic processes, but the result is distinctly +non-random (better than random). + +Structured Diagram of a GA: +--------------------------- + +P(t) generation of ancestors at a time t +P''(t) generation of descendants at a time t + ++=========================================+ +|>>>>>>>>>>> Algorithm GA <<<<<<<<<<<<<<| ++=========================================+ +| INITIALIZE t := 0 | ++=========================================+ +| INITIALIZE P(t) | ++=========================================+ +| evalute FITNESS of P(t) | ++=========================================+ +| while not STOPPING CRITERION do | +| +-------------------------------------+ +| | P'(t) := RECOMBINATION{P(t)} | +| +-------------------------------------+ +| | P''(t) := MUTATION{P'(t)} | +| +-------------------------------------+ +| | P(t+1) := SELECTION{P''(t) + P(t)} | +| +-------------------------------------+ +| | evalute FITNESS of P''(t) | +| +-------------------------------------+ +| | t := t + 1 | ++===+=====================================+ + + +3.) Genetic Query Optimization (GEQO) in PostgreSQL +=================================================== + + The GEQO module is intended for the solution of the query +optimization problem similar to a traveling salesman problem (TSP). +Possible query plans are encoded as integer strings. Each string +represents the JOIN order from one relation of the query to the next. +E. g., the query tree /\ + /\ 2 + /\ 3 + 4 1 is encoded by the integer string '4-1-3-2', +which means, first join relation '4' and '1', then '3', and +then '2', where 1, 2, 3, 4 are relids in PostgreSQL. + + Parts of the GEQO module are adapted from D. Whitley's Genitor +algorithm. + + Specific characteristics of the GEQO implementation in PostgreSQL +are: + +o usage of a *steady state* GA (replacement of the least fit + individuals in a population, not whole-generational replacement) + allows fast convergence towards improved query plans. This is + essential for query handling with reasonable time; + +o usage of *edge recombination crossover* which is especially suited + to keep edge losses low for the solution of the TSP by means of a GA; + +o mutation as genetic operator is deprecated so that no repair + mechanisms are needed to generate legal TSP tours. + + The GEQO module gives the following benefits to the PostgreSQL DBMS +compared to the Postgres query optimizer implementation: + +o handling of large JOIN queries through non-exhaustive search; + +o improved cost size approximation of query plans since no longer + plan merging is needed (the GEQO module evaluates the cost for a + query plan as an individual). + + +References +========== + +J. Heitk"otter, D. Beasley: +--------------------------- + "The Hitch-Hicker's Guide to Evolutionary Computation", + FAQ in 'comp.ai.genetic', + 'ftp://ftp.Germany.EU.net/pub/research/softcomp/EC/Welcome.html' + +Z. Fong: +-------- + "The Design and Implementation of the Postgres Query Optimizer", + file 'planner/Report.ps' in the 'postgres-papers' distribution + +R. Elmasri, S. Navathe: +----------------------- + "Fundamentals of Database Systems", + The Benjamin/Cummings Pub., Inc. + + +=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*= +* Things left to done for the PostgreSQL * += Genetic Query Optimization (GEQO) = +* module implementation * +=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*= +* Martin Utesch * Institute of Automatic Control * += = University of Mining and Technology = +* utesch@aut.tu-freiberg.de * Freiberg, Germany * +=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*= + + +1.) Basic Improvements +=============================================================== + +a) improve freeing of memory when query is already processed: +------------------------------------------------------------- +with large JOIN queries the computing time spent for the genetic query +optimization seems to be a mere *fraction* of the time Postgres +needs for freeing memory via routine 'MemoryContextFree', +file 'backend/utils/mmgr/mcxt.c'; +debugging showed that it get stucked in a loop of routine +'OrderedElemPop', file 'backend/utils/mmgr/oset.c'; +the same problems arise with long queries when using the normal +Postgres query optimization algorithm; + +b) improve genetic algorithm parameter settings: +------------------------------------------------ +file 'backend/optimizer/geqo/geqo_params.c', routines +'gimme_pool_size' and 'gimme_number_generations'; +we have to find a compromise for the parameter settings +to satisfy two competing demands: +1. optimality of the query plan +2. computing time + +c) find better solution for integer overflow: +--------------------------------------------- +file 'backend/optimizer/geqo/geqo_eval.c', routine +'geqo_joinrel_size'; +the present hack for MAXINT overflow is to set the Postgres integer +value of 'rel->size' to its logarithm; +modifications of 'struct Rel' in 'backend/nodes/relation.h' will +surely have severe impacts on the whole PostgreSQL implementation. + +d) find solution for exhausted memory: +-------------------------------------- +that may occur with more than 10 relations involved in a query, +file 'backend/optimizer/geqo/geqo_eval.c', routine +'gimme_tree' which is recursively called; +maybe I forgot something to be freed correctly, but I dunno what; +of course the 'rel' data structure of the JOIN keeps growing and +growing the more relations are packed into it; +suggestions are welcome :-( + + +2.) Further Improvements +=============================================================== +Enable bushy query tree processing within PostgreSQL; +that may improve the quality of query plans. + +</ProgramListing> +</Para> +</Chapter> |