summaryrefslogtreecommitdiff
path: root/src/docs/huffman.dox
diff options
context:
space:
mode:
Diffstat (limited to 'src/docs/huffman.dox')
-rw-r--r--src/docs/huffman.dox64
1 files changed, 64 insertions, 0 deletions
diff --git a/src/docs/huffman.dox b/src/docs/huffman.dox
new file mode 100644
index 00000000000..e4693873d66
--- /dev/null
+++ b/src/docs/huffman.dox
@@ -0,0 +1,64 @@
+/*! @page huffman Huffman Encoding
+
+Keys in row-stores and variable-length values in either row- or
+column-stores can be compressed with Huffman encoding.
+
+Huffman compression is maintained in memory as well as on disk, and can
+increase the amount of usable data the cache can hold as well as
+decrease the size of the data on disk.
+
+To configure Huffman encoding for the key in a row-store, specify \c
+"huffman_key=english", \c "huffman_key=utf8<file>" or \c
+"huffman_key=utf16<file>" in the configuration passed to \c
+WT_SESSION::create.
+
+To configure Huffman encoding for a variable-length value in either a
+row-store or a column-store, specify "huffman_value=english", \c
+"huffman_value=utf8<file>" or \c "huffman_value=utf16<file>" in the
+configuration passed to \c WT_SESSION::create.
+
+Setting Huffman encoding to \c "english" configures WiredTiger to use a
+built-in English language frequency table.
+
+Setting Huffman encoding to \c "utf8<file>" or \c "utf16<file>"
+configures WiredTiger to use a frequency table read from a file. The
+frequency table file format is lines containing pairs of unsigned
+integers separated by whitespace. The first integer is the symbol
+value, the second integer is the frequency value. Symbol values may be
+specified as hexadecimal numbers (with a leading \c "0x" prefix), or as
+integers. For example, an English-language frequency table for the
+characters '0' through '9' might look like this:
+
+@code
+0x30 546233
+0x31 460946
+0x32 333499
+0x33 187606
+0x34 192528
+0x35 374413
+0x36 153865
+0x37 120094
+0x38 182627
+0x39 282364
+@endcode
+
+Frequency table symbol values must be unique. In the case of \c utf8
+files, symbol values must be in the range of 0 to 255. In the case of
+\c utf16 files, symbol values must be in the range of 0 to 65,535.
+Frequency values do not need to be unique, but must be in the range of
+0 to the maximum 32-bit unsigned integer value (4,294,967,295), where
+the lower a frequency value, the less likely the byte value is to occur.
+
+Any symbol values not listed in the frequency table are assumed to have
+frequencies of 0. Input containing symbol values that did not appear
+in the frequency table (or appeared in the frequency table, but with
+frequency values of 0), are accepted, but will not compress as well as
+if they are listed in the frequency table, with frequency values other
+than 0.
+
+The English language frequency table is based on \c "Case-sensitive
+letter and bigram frequency counts from large-scale English corpora",
+by Michael N. Jones and D.J.K. Mewhort, modified to support space and
+tab characters.
+
+*/