summaryrefslogtreecommitdiff
path: root/src/third_party/wiredtiger/src/docs/huffman.dox
blob: e2912394d1ff6547712be3f1e076e030d8be30f0 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
/*! @page huffman Huffman Encoding

Variable-length values in either row- or column-stores can be
compressed with Huffman encoding.

Huffman compression is maintained in memory as well as on disk, and can
increase the amount of usable data the cache can hold as well as
decrease the size of the data on disk.  The additional CPU cost of
Huffman coding can be high, and should be considered.

To configure Huffman encoding for a variable-length value in either a
row-store or a column-store, specify \c huffman_value=english, \c
huffman_value=utf8<file> or \c huffman_value=utf16<file> in the
configuration passed to \c WT_SESSION::create.

Setting Huffman encoding to \c english configures WiredTiger to use a
built-in English language frequency table.  The English language
frequency table is based on \c "Case-sensitive letter and bigram
frequency counts from large-scale English corpora", by Michael N. Jones
and D.J.K. Mewhort, modified to support space and tab characters.

Setting Huffman encoding to \c utf8<file> or \c utf16<file> configures
WiredTiger to use a frequency table read from a file.  (Note: the \c <
and \c > characters are not literal, and should not appear in the
string.)

The frequency table file format is lines containing pairs of unsigned
integers separated by whitespace.  The first integer is the symbol
value, the second integer is the frequency value.  Symbol values may be
specified as hexadecimal numbers (with a leading \c 0x prefix), or as
integers.  For example, an English-language frequency table for the
characters \c 0 through \c 9 might look like this:

@code
0x30 546233
0x31 460946
0x32 333499
0x33 187606
0x34 192528
0x35 374413
0x36 153865
0x37 120094
0x38 182627
0x39 282364
@endcode

Frequency table symbol values must be unique.  In the case of \c utf8
files, symbol values must be in the range of 0 to 255.  In the case of
\c utf16 files, symbol values must be in the range of 0 to 65,535.
Frequency values do not need to be unique, but must be in the range of
0 to the maximum 32-bit unsigned integer value (4,294,967,295), where
the lower a frequency value, the less likely the byte value is to occur.

Any symbol values not listed in the frequency table are assumed to have
frequencies of 0.  Input containing symbol values that did not appear
in the frequency table (or appeared in the frequency table, but with
frequency values of 0), are accepted, but will not compress as well as
if they are listed in the frequency table, with frequency values other
than 0.

*/