diff options
Diffstat (limited to 'src/third_party/wiredtiger/src/docs/huffman.dox')
-rw-r--r-- | src/third_party/wiredtiger/src/docs/huffman.dox | 66 |
1 files changed, 66 insertions, 0 deletions
diff --git a/src/third_party/wiredtiger/src/docs/huffman.dox b/src/third_party/wiredtiger/src/docs/huffman.dox new file mode 100644 index 00000000000..41e4be5ced3 --- /dev/null +++ b/src/third_party/wiredtiger/src/docs/huffman.dox @@ -0,0 +1,66 @@ +/*! @page huffman Huffman Encoding + +Keys in row-stores and variable-length values in either row- or +column-stores can be compressed with Huffman encoding. + +Huffman compression is maintained in memory as well as on disk, and can +increase the amount of usable data the cache can hold as well as +decrease the size of the data on disk. The additional CPU cost of +Huffman coding can be high, and should be considered. + +To configure Huffman encoding for the key in a row-store, specify \c +huffman_key=english, \c huffman_key=utf8<file> or \c +huffman_key=utf16<file> in the configuration passed to \c +WT_SESSION::create. + +To configure Huffman encoding for a variable-length value in either a +row-store or a column-store, specify \c huffman_value=english, \c +huffman_value=utf8<file> or \c huffman_value=utf16<file> in the +configuration passed to \c WT_SESSION::create. + +Setting Huffman encoding to \c english configures WiredTiger to use a +built-in English language frequency table. The English language +frequency table is based on \c "Case-sensitive letter and bigram +frequency counts from large-scale English corpora", by Michael N. Jones +and D.J.K. Mewhort, modified to support space and tab characters. + +Setting Huffman encoding to \c utf8<file> or \c utf16<file> configures +WiredTiger to use a frequency table read from a file. (Note: the \c < +and \c > characters are not literal, and should not appear in the +string.) + +The frequency table file format is lines containing pairs of unsigned +integers separated by whitespace. The first integer is the symbol +value, the second integer is the frequency value. Symbol values may be +specified as hexadecimal numbers (with a leading \c 0x prefix), or as +integers. For example, an English-language frequency table for the +characters \c 0 through \c 9 might look like this: + +@code +0x30 546233 +0x31 460946 +0x32 333499 +0x33 187606 +0x34 192528 +0x35 374413 +0x36 153865 +0x37 120094 +0x38 182627 +0x39 282364 +@endcode + +Frequency table symbol values must be unique. In the case of \c utf8 +files, symbol values must be in the range of 0 to 255. In the case of +\c utf16 files, symbol values must be in the range of 0 to 65,535. +Frequency values do not need to be unique, but must be in the range of +0 to the maximum 32-bit unsigned integer value (4,294,967,295), where +the lower a frequency value, the less likely the byte value is to occur. + +Any symbol values not listed in the frequency table are assumed to have +frequencies of 0. Input containing symbol values that did not appear +in the frequency table (or appeared in the frequency table, but with +frequency values of 0), are accepted, but will not compress as well as +if they are listed in the frequency table, with frequency values other +than 0. + +*/ |