diff options
Diffstat (limited to 'Docs')
-rw-r--r-- | Docs/internals.texi | 45 | ||||
-rw-r--r-- | Docs/manual.texi | 2 |
2 files changed, 46 insertions, 1 deletions
diff --git a/Docs/internals.texi b/Docs/internals.texi index 8f358982ded..871e51c50bd 100644 --- a/Docs/internals.texi +++ b/Docs/internals.texi @@ -57,6 +57,7 @@ This is a manual about @strong{MySQL} internals. * mysys functions:: Functions In The @code{mysys} Library * DBUG:: DBUG Tags To Use * protocol:: MySQL Client/Server Protocol +* Fulltext Search:: Fulltext Search in MySQL @end menu @@ -535,7 +536,7 @@ Print query. @end table -@node protocol, , DBUG, Top +@node protocol, Fulltext Search, DBUG, Top @chapter MySQL Client/Server Protocol @menu @@ -785,6 +786,48 @@ Date 03 0A 00 00 |01 0A |03 00 00 00 @c @printindex fn +@node Fulltext Search, , protocol, Top +@chapter Fulltext Search in MySQL + +Hopefully, sometime there will be complete description of +fulltext search algorithms. +Now it's just unsorted notes. + +@menu +* Weighting in boolean mode:: +@end menu + +@node Weighting in boolean mode, , , Fulltext Search +@section Weighting in boolean mode + +The basic idea is as follows: in expression +@code{A or B or (C and D and E)}, either @code{A} or @code{B} alone +is enough to match the whole expression. While @code{C}, +@code{D}, and @code{E} should @strong{all} match. So it's +reasonable to assign weight 1 to @code{A}, @code{B}, and +@code{(C and D and E)}. And @code{C}, @code{D}, and @code{E} +should get a weight of 1/3. + +Things become more complicated when considering boolean +operators, as used in MySQL FTB. Obvioulsy, @code{+A +B} +should be treated as @code{A and B}, and @code{A B} - +as @code{A or B}. The problem is, that @code{+A B} can @strong{not} +be rewritten in and/or terms (that's the reason why this - extended - +set of operators was chosen). Still, aproximations can be used. +@code{+A B C} can be approximated as @code{A or (A and (B or C))} +or as @code{A or (A and B) or (A and C) or (A and B and C)}. +Applying the above logic (and omitting mathematical +transformations and normalization) one gets that for +@code{+A_1 +A_2 ... +A_N B_1 B_2 ... B_M} the weights +should be: @code{A_i = 1/N}, @code{B_j=1} if @code{N==0}, and, +otherwise, in the first rewritting approach @code{B_j = 1/3}, +and in the second one - @code{B_j = (1+(M-1)*2^M)/(M*(2^(M+1)-1))}. + +The second expression gives somewhat steeper increase in total +weight as number of matched B's increases, because it assigns +higher weights to individual B's. Also the first expression in +much simplier. So it is the first one, that is implemented in MySQL. + @summarycontents @contents diff --git a/Docs/manual.texi b/Docs/manual.texi index a3cc6ffd799..47082b839ba 100644 --- a/Docs/manual.texi +++ b/Docs/manual.texi @@ -48933,6 +48933,8 @@ Our TODO section contains what we plan to have in 4.0. @xref{TODO MySQL 4.0}. @itemize @bullet @item +Boolean fulltext search weighting scheme changed to something more reasonable. +@item Fixed bug in boolean fulltext search, that caused MySQL to ignore queries of @code{ft_min_word_len} characters. @item |