summaryrefslogtreecommitdiff
path: root/doc/src/sgml/geqo.sgml
diff options
context:
space:
mode:
authorThomas G. Lockhart <lockhart@fourpalms.org>1998-03-01 08:16:16 +0000
committerThomas G. Lockhart <lockhart@fourpalms.org>1998-03-01 08:16:16 +0000
commitc8cfb0cea88fec22f5aa0582fe846b46baf77eb1 (patch)
treeeb009c5363c2d5d8f7d99e6264c2a282d4ad7e58 /doc/src/sgml/geqo.sgml
parent878531f1ac8a288fdd04c7c41ac2f02d2506bcb7 (diff)
downloadpostgresql-c8cfb0cea88fec22f5aa0582fe846b46baf77eb1.tar.gz
SGML source for new documentation.
Diffstat (limited to 'doc/src/sgml/geqo.sgml')
-rw-r--r--doc/src/sgml/geqo.sgml228
1 files changed, 228 insertions, 0 deletions
diff --git a/doc/src/sgml/geqo.sgml b/doc/src/sgml/geqo.sgml
new file mode 100644
index 0000000000..725504c28f
--- /dev/null
+++ b/doc/src/sgml/geqo.sgml
@@ -0,0 +1,228 @@
+<Chapter>
+<DocInfo>
+<Author>
+<FirstName>Martin</FirstName>
+<SurName>Utesch</SurName>
+</Author>
+</DocInfo>
+
+<Title>Genetic Query Optimization in Database Systems</Title>
+
+<Para>
+<ProgramListing>
+<ULink url="utesch@aut.tu-freiberg.de">Martin Utesch</ULink>
+
+ Institute of Automatic Control
+ University of Mining and Technology
+ Freiberg, Germany
+
+ 02/10/1997
+
+
+1.) Query Handling as a Complex Optimization Problem
+====================================================
+
+ Among all relational operators the most difficult one to process and
+optimize is the JOIN. The number of alternative plans to answer a query
+grows exponentially with the number of JOINs included in it. Further
+optimization effort is caused by the support of a variety of *JOIN
+methods* (e.g., nested loop, index scan, merge join in Postgres) to
+process individual JOINs and a diversity of *indices* (e.g., r-tree,
+b-tree, hash in Postgres) as access paths for relations.
+
+ The current Postgres optimizer implementation performs a *near-
+exhaustive search* over the space of alternative strategies. This query
+optimization technique is inadequate to support database application
+domains that involve the need for extensive queries, such as artificial
+intelligence.
+
+ The Institute of Automatic Control at the University of Mining and
+Technology, in Freiberg, Germany, encountered the described problems as its
+folks wanted to take the Postgres DBMS as the backend for a decision
+support knowledge based system for the maintenance of an electrical
+power grid. The DBMS needed to handle large JOIN queries for the
+inference machine of the knowledge based system.
+
+ Performance difficulties within exploring the space of possible query
+plans arose the demand for a new optimization technique being developed.
+
+ In the following we propose the implementation of a *Genetic
+Algorithm* as an option for the database query optimization problem.
+
+
+2.) Genetic Algorithms (GA)
+===========================
+
+ The GA is a heuristic optimization method which operates through
+determined, randomized search. The set of possible solutions for the
+optimization problem is considered as a *population* of *individuals*.
+The degree of adaption of an individual to its environment is specified
+by its *fitness*.
+
+ The coordinates of an individual in the search space are represented
+by *chromosomes*, in essence a set of character strings. A *gene* is a
+subsection of a chromosome which encodes the value of a single parameter
+being optimized. Typical encodings for a gene could be *binary* or
+*integer*.
+
+ Through simulation of the evolutionary operations *recombination*,
+*mutation*, and *selection* new generations of search points are found
+that show a higher average fitness than their ancestors.
+
+ According to the "comp.ai.genetic" FAQ it cannot be stressed too
+strongly that a GA is not a pure random search for a solution to a
+problem. A GA uses stochastic processes, but the result is distinctly
+non-random (better than random).
+
+Structured Diagram of a GA:
+---------------------------
+
+P(t) generation of ancestors at a time t
+P''(t) generation of descendants at a time t
+
++=========================================+
+|>>>>>>>>>>> Algorithm GA <<<<<<<<<<<<<<|
++=========================================+
+| INITIALIZE t := 0 |
++=========================================+
+| INITIALIZE P(t) |
++=========================================+
+| evalute FITNESS of P(t) |
++=========================================+
+| while not STOPPING CRITERION do |
+| +-------------------------------------+
+| | P'(t) := RECOMBINATION{P(t)} |
+| +-------------------------------------+
+| | P''(t) := MUTATION{P'(t)} |
+| +-------------------------------------+
+| | P(t+1) := SELECTION{P''(t) + P(t)} |
+| +-------------------------------------+
+| | evalute FITNESS of P''(t) |
+| +-------------------------------------+
+| | t := t + 1 |
++===+=====================================+
+
+
+3.) Genetic Query Optimization (GEQO) in PostgreSQL
+===================================================
+
+ The GEQO module is intended for the solution of the query
+optimization problem similar to a traveling salesman problem (TSP).
+Possible query plans are encoded as integer strings. Each string
+represents the JOIN order from one relation of the query to the next.
+E. g., the query tree /\
+ /\ 2
+ /\ 3
+ 4 1 is encoded by the integer string '4-1-3-2',
+which means, first join relation '4' and '1', then '3', and
+then '2', where 1, 2, 3, 4 are relids in PostgreSQL.
+
+ Parts of the GEQO module are adapted from D. Whitley's Genitor
+algorithm.
+
+ Specific characteristics of the GEQO implementation in PostgreSQL
+are:
+
+o usage of a *steady state* GA (replacement of the least fit
+ individuals in a population, not whole-generational replacement)
+ allows fast convergence towards improved query plans. This is
+ essential for query handling with reasonable time;
+
+o usage of *edge recombination crossover* which is especially suited
+ to keep edge losses low for the solution of the TSP by means of a GA;
+
+o mutation as genetic operator is deprecated so that no repair
+ mechanisms are needed to generate legal TSP tours.
+
+ The GEQO module gives the following benefits to the PostgreSQL DBMS
+compared to the Postgres query optimizer implementation:
+
+o handling of large JOIN queries through non-exhaustive search;
+
+o improved cost size approximation of query plans since no longer
+ plan merging is needed (the GEQO module evaluates the cost for a
+ query plan as an individual).
+
+
+References
+==========
+
+J. Heitk"otter, D. Beasley:
+---------------------------
+ "The Hitch-Hicker's Guide to Evolutionary Computation",
+ FAQ in 'comp.ai.genetic',
+ 'ftp://ftp.Germany.EU.net/pub/research/softcomp/EC/Welcome.html'
+
+Z. Fong:
+--------
+ "The Design and Implementation of the Postgres Query Optimizer",
+ file 'planner/Report.ps' in the 'postgres-papers' distribution
+
+R. Elmasri, S. Navathe:
+-----------------------
+ "Fundamentals of Database Systems",
+ The Benjamin/Cummings Pub., Inc.
+
+
+=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
+* Things left to done for the PostgreSQL *
+= Genetic Query Optimization (GEQO) =
+* module implementation *
+=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
+* Martin Utesch * Institute of Automatic Control *
+= = University of Mining and Technology =
+* utesch@aut.tu-freiberg.de * Freiberg, Germany *
+=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
+
+
+1.) Basic Improvements
+===============================================================
+
+a) improve freeing of memory when query is already processed:
+-------------------------------------------------------------
+with large JOIN queries the computing time spent for the genetic query
+optimization seems to be a mere *fraction* of the time Postgres
+needs for freeing memory via routine 'MemoryContextFree',
+file 'backend/utils/mmgr/mcxt.c';
+debugging showed that it get stucked in a loop of routine
+'OrderedElemPop', file 'backend/utils/mmgr/oset.c';
+the same problems arise with long queries when using the normal
+Postgres query optimization algorithm;
+
+b) improve genetic algorithm parameter settings:
+------------------------------------------------
+file 'backend/optimizer/geqo/geqo_params.c', routines
+'gimme_pool_size' and 'gimme_number_generations';
+we have to find a compromise for the parameter settings
+to satisfy two competing demands:
+1. optimality of the query plan
+2. computing time
+
+c) find better solution for integer overflow:
+---------------------------------------------
+file 'backend/optimizer/geqo/geqo_eval.c', routine
+'geqo_joinrel_size';
+the present hack for MAXINT overflow is to set the Postgres integer
+value of 'rel->size' to its logarithm;
+modifications of 'struct Rel' in 'backend/nodes/relation.h' will
+surely have severe impacts on the whole PostgreSQL implementation.
+
+d) find solution for exhausted memory:
+--------------------------------------
+that may occur with more than 10 relations involved in a query,
+file 'backend/optimizer/geqo/geqo_eval.c', routine
+'gimme_tree' which is recursively called;
+maybe I forgot something to be freed correctly, but I dunno what;
+of course the 'rel' data structure of the JOIN keeps growing and
+growing the more relations are packed into it;
+suggestions are welcome :-(
+
+
+2.) Further Improvements
+===============================================================
+Enable bushy query tree processing within PostgreSQL;
+that may improve the quality of query plans.
+
+</ProgramListing>
+</Para>
+</Chapter>