summaryrefslogtreecommitdiff
path: root/src/third_party/wiredtiger/src/docs/huffman.dox
diff options
context:
space:
mode:
Diffstat (limited to 'src/third_party/wiredtiger/src/docs/huffman.dox')
-rw-r--r--src/third_party/wiredtiger/src/docs/huffman.dox66
1 files changed, 66 insertions, 0 deletions
diff --git a/src/third_party/wiredtiger/src/docs/huffman.dox b/src/third_party/wiredtiger/src/docs/huffman.dox
new file mode 100644
index 00000000000..41e4be5ced3
--- /dev/null
+++ b/src/third_party/wiredtiger/src/docs/huffman.dox
@@ -0,0 +1,66 @@
+/*! @page huffman Huffman Encoding
+
+Keys in row-stores and variable-length values in either row- or
+column-stores can be compressed with Huffman encoding.
+
+Huffman compression is maintained in memory as well as on disk, and can
+increase the amount of usable data the cache can hold as well as
+decrease the size of the data on disk. The additional CPU cost of
+Huffman coding can be high, and should be considered.
+
+To configure Huffman encoding for the key in a row-store, specify \c
+huffman_key=english, \c huffman_key=utf8<file> or \c
+huffman_key=utf16<file> in the configuration passed to \c
+WT_SESSION::create.
+
+To configure Huffman encoding for a variable-length value in either a
+row-store or a column-store, specify \c huffman_value=english, \c
+huffman_value=utf8<file> or \c huffman_value=utf16<file> in the
+configuration passed to \c WT_SESSION::create.
+
+Setting Huffman encoding to \c english configures WiredTiger to use a
+built-in English language frequency table. The English language
+frequency table is based on \c "Case-sensitive letter and bigram
+frequency counts from large-scale English corpora", by Michael N. Jones
+and D.J.K. Mewhort, modified to support space and tab characters.
+
+Setting Huffman encoding to \c utf8<file> or \c utf16<file> configures
+WiredTiger to use a frequency table read from a file. (Note: the \c <
+and \c > characters are not literal, and should not appear in the
+string.)
+
+The frequency table file format is lines containing pairs of unsigned
+integers separated by whitespace. The first integer is the symbol
+value, the second integer is the frequency value. Symbol values may be
+specified as hexadecimal numbers (with a leading \c 0x prefix), or as
+integers. For example, an English-language frequency table for the
+characters \c 0 through \c 9 might look like this:
+
+@code
+0x30 546233
+0x31 460946
+0x32 333499
+0x33 187606
+0x34 192528
+0x35 374413
+0x36 153865
+0x37 120094
+0x38 182627
+0x39 282364
+@endcode
+
+Frequency table symbol values must be unique. In the case of \c utf8
+files, symbol values must be in the range of 0 to 255. In the case of
+\c utf16 files, symbol values must be in the range of 0 to 65,535.
+Frequency values do not need to be unique, but must be in the range of
+0 to the maximum 32-bit unsigned integer value (4,294,967,295), where
+the lower a frequency value, the less likely the byte value is to occur.
+
+Any symbol values not listed in the frequency table are assumed to have
+frequencies of 0. Input containing symbol values that did not appear
+in the frequency table (or appeared in the frequency table, but with
+frequency values of 0), are accepted, but will not compress as well as
+if they are listed in the frequency table, with frequency values other
+than 0.
+
+*/