diff options
Diffstat (limited to 'src/docs/huffman.dox')
-rw-r--r-- | src/docs/huffman.dox | 64 |
1 files changed, 64 insertions, 0 deletions
diff --git a/src/docs/huffman.dox b/src/docs/huffman.dox new file mode 100644 index 00000000000..e4693873d66 --- /dev/null +++ b/src/docs/huffman.dox @@ -0,0 +1,64 @@ +/*! @page huffman Huffman Encoding + +Keys in row-stores and variable-length values in either row- or +column-stores can be compressed with Huffman encoding. + +Huffman compression is maintained in memory as well as on disk, and can +increase the amount of usable data the cache can hold as well as +decrease the size of the data on disk. + +To configure Huffman encoding for the key in a row-store, specify \c +"huffman_key=english", \c "huffman_key=utf8<file>" or \c +"huffman_key=utf16<file>" in the configuration passed to \c +WT_SESSION::create. + +To configure Huffman encoding for a variable-length value in either a +row-store or a column-store, specify "huffman_value=english", \c +"huffman_value=utf8<file>" or \c "huffman_value=utf16<file>" in the +configuration passed to \c WT_SESSION::create. + +Setting Huffman encoding to \c "english" configures WiredTiger to use a +built-in English language frequency table. + +Setting Huffman encoding to \c "utf8<file>" or \c "utf16<file>" +configures WiredTiger to use a frequency table read from a file. The +frequency table file format is lines containing pairs of unsigned +integers separated by whitespace. The first integer is the symbol +value, the second integer is the frequency value. Symbol values may be +specified as hexadecimal numbers (with a leading \c "0x" prefix), or as +integers. For example, an English-language frequency table for the +characters '0' through '9' might look like this: + +@code +0x30 546233 +0x31 460946 +0x32 333499 +0x33 187606 +0x34 192528 +0x35 374413 +0x36 153865 +0x37 120094 +0x38 182627 +0x39 282364 +@endcode + +Frequency table symbol values must be unique. In the case of \c utf8 +files, symbol values must be in the range of 0 to 255. In the case of +\c utf16 files, symbol values must be in the range of 0 to 65,535. +Frequency values do not need to be unique, but must be in the range of +0 to the maximum 32-bit unsigned integer value (4,294,967,295), where +the lower a frequency value, the less likely the byte value is to occur. + +Any symbol values not listed in the frequency table are assumed to have +frequencies of 0. Input containing symbol values that did not appear +in the frequency table (or appeared in the frequency table, but with +frequency values of 0), are accepted, but will not compress as well as +if they are listed in the frequency table, with frequency values other +than 0. + +The English language frequency table is based on \c "Case-sensitive +letter and bigram frequency counts from large-scale English corpora", +by Michael N. Jones and D.J.K. Mewhort, modified to support space and +tab characters. + +*/ |