summaryrefslogtreecommitdiff
path: root/Docs/internals.texi
diff options
context:
space:
mode:
authorunknown <serg@serg.mysql.com>2002-04-19 13:36:16 +0000
committerunknown <serg@serg.mysql.com>2002-04-19 13:36:16 +0000
commit19013b30316de332144f9b8820276262d049dc1c (patch)
treeffbb4597baf09e29d29b267b6435b1c367d98c12 /Docs/internals.texi
parent8a20bc123ba9346138f4a6c96d4f66db22e6f1d4 (diff)
downloadmariadb-git-19013b30316de332144f9b8820276262d049dc1c.tar.gz
boolean fulltext search weighting scheme changed
Docs/internals.texi: fulltext chapter added Docs/manual.texi: news updated BitKeeper/etc/ignore: Added Docs/internals.info to the ignore list myisam/ft_boolean_search.c: weighting scheme changed
Diffstat (limited to 'Docs/internals.texi')
-rw-r--r--Docs/internals.texi45
1 files changed, 44 insertions, 1 deletions
diff --git a/Docs/internals.texi b/Docs/internals.texi
index 8f358982ded..871e51c50bd 100644
--- a/Docs/internals.texi
+++ b/Docs/internals.texi
@@ -57,6 +57,7 @@ This is a manual about @strong{MySQL} internals.
* mysys functions:: Functions In The @code{mysys} Library
* DBUG:: DBUG Tags To Use
* protocol:: MySQL Client/Server Protocol
+* Fulltext Search:: Fulltext Search in MySQL
@end menu
@@ -535,7 +536,7 @@ Print query.
@end table
-@node protocol, , DBUG, Top
+@node protocol, Fulltext Search, DBUG, Top
@chapter MySQL Client/Server Protocol
@menu
@@ -785,6 +786,48 @@ Date 03 0A 00 00 |01 0A |03 00 00 00
@c @printindex fn
+@node Fulltext Search, , protocol, Top
+@chapter Fulltext Search in MySQL
+
+Hopefully, sometime there will be complete description of
+fulltext search algorithms.
+Now it's just unsorted notes.
+
+@menu
+* Weighting in boolean mode::
+@end menu
+
+@node Weighting in boolean mode, , , Fulltext Search
+@section Weighting in boolean mode
+
+The basic idea is as follows: in expression
+@code{A or B or (C and D and E)}, either @code{A} or @code{B} alone
+is enough to match the whole expression. While @code{C},
+@code{D}, and @code{E} should @strong{all} match. So it's
+reasonable to assign weight 1 to @code{A}, @code{B}, and
+@code{(C and D and E)}. And @code{C}, @code{D}, and @code{E}
+should get a weight of 1/3.
+
+Things become more complicated when considering boolean
+operators, as used in MySQL FTB. Obvioulsy, @code{+A +B}
+should be treated as @code{A and B}, and @code{A B} -
+as @code{A or B}. The problem is, that @code{+A B} can @strong{not}
+be rewritten in and/or terms (that's the reason why this - extended -
+set of operators was chosen). Still, aproximations can be used.
+@code{+A B C} can be approximated as @code{A or (A and (B or C))}
+or as @code{A or (A and B) or (A and C) or (A and B and C)}.
+Applying the above logic (and omitting mathematical
+transformations and normalization) one gets that for
+@code{+A_1 +A_2 ... +A_N B_1 B_2 ... B_M} the weights
+should be: @code{A_i = 1/N}, @code{B_j=1} if @code{N==0}, and,
+otherwise, in the first rewritting approach @code{B_j = 1/3},
+and in the second one - @code{B_j = (1+(M-1)*2^M)/(M*(2^(M+1)-1))}.
+
+The second expression gives somewhat steeper increase in total
+weight as number of matched B's increases, because it assigns
+higher weights to individual B's. Also the first expression in
+much simplier. So it is the first one, that is implemented in MySQL.
+
@summarycontents
@contents