summaryrefslogtreecommitdiff
path: root/src/docs/file-formats.dox
blob: d8990aca7a6ecef83d5d639206e2a60a7b05d346 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
/*! @m_page{{c,java},file_formats,File formats and compression}

@section file_formats_formats File formats

WiredTiger supports two underlying file formats: row-store and
column-store, where both are B+tree implementations of key/value stores.
WiredTiger also supports @ref lsm, implemented as a tree of B+trees.

In a row-store, both keys and data are variable-length byte strings.  In
a column-store, keys are 64-bit record numbers (key_format type 'r'),
and values are either variable- or fixed-length byte strings.

Generally, row-stores are faster for queries where all of the columns
are required by every lookup (because there's only a single set of
meta-data pages to read into the cache and search).  Column-stores are
faster when most queries require only a subset of the columns (because
columns can be separated into multiple files and only the columns being
returned need be present in the cache).

Row-store keys and values, and variable-length column-store values, can
be up to (4GB - 512B) in length.  Keys and values too large to fit on a
normal page are stored as overflow items in the file, and are likely to
require additional file I/O to access.

Fixed-length column-store values (value_format type 't'), are limited
to 8-bits, and only values between 0 and 255 may be stored.
Additionally, there is no out-of-band fixed-length "deleted" value, and
deleting a value is the same as storing a value of 0.  For the same
reason, storing a value of 0 will cause cursor scans to skip the record.

WiredTiger does not support duplicate data items: there can be only a
single value associated with any given key, and applications are
responsible for creating unique key/value pairs.

WiredTiger allocates space from the underlying files in block units.
The minimum file allocation unit WiredTiger supports is 512B and the
maximum is 512MB.  File offsets are signed 8B values, making the maximum
file size very, very large.

@section file_formats_choice Choosing a file format

The row-store format is the default choice for most applications. When
the primary key is a record number, there are advantages to storing
columns in separate files, or the underlying data is a set of bits,
column-store format may be a better choice.

Both row- and column-store formats can maintain high volumes of writes,
but for data sets requiring sustained, extreme write throughput,
@ref lsm are usually a better choice.  For applications that do not
require extreme write throughput, row- or column-store is likely to be
a better choice because the read throughput is better than with LSM trees
(an effect that becomes more pronounced as additional read threads are
added).

Applications with complex schemas may also benefit from using multiple
storage formats, that is, using a combination of different formats in
the database, and even in individual tables (for example, a sparse, wide
table configured with a column-store primary, where indexes are stored
in an LSM tree).

Finally, as WiredTiger makes it easy to switch back-and-forth between
storage configurations, it's usually worthwhile benchmarking possible
configurations when there is any question.

@section file_formats_compression File formats and compression

Row-stores support four types of compression: key prefix compression,
dictionary compression, Huffman encoding and block compression.

- Key prefix compression reduces the size requirement of both in-memory
and on-disk objects by storing any identical key prefix only once per
page.

  The cost is additional CPU and memory when operating on the in-memory tree.
Specifically, sequential cursor movement through prefix-compressed page in
reverse (but not forward) order, or the random lookup of a key/value pair will
allocate sufficient memory to hold some number of uncompressed keys.  So, for
example, if key prefix compression only saves a small number of bytes per key,
the additional memory cost of instantiating the uncompressed key may mean
prefix compression is not worthwhile.  Further, in cases where the
on-disk cost is the primary concern, block compression may mean prefix
compression is less useful.

  Applications may limit the use of prefix compression by configuring the
minimum number of bytes that must be gained before prefix compression is
used with the WT_SESSION::create method's \c prefix_compression_min
configuration string.

  Key prefix compression is disabled by default.

- Dictionary compression reduces the size requirement of both the
in-memory and on-disk objects by storing any identical value only once
per page.  The cost is minor additional CPU and memory use when writing
pages to disk.

  Dictionary compression is disabled by default.

- Huffman encoding reduces the size requirement of both the in-memory
and on-disk objects by compressing individual key/value items, and can
be separately configured either or both keys and values.  The cost is
additional CPU and memory use when searching the in-memory tree (if keys
are encoded), and additional CPU and memory use when returning values
from the in-memory tree and when writing pages to disk.  Note the
additional CPU cost of Huffman encoding can be high, and should be
considered.  (See @subpage_single huffman for details.)

  Huffman encoding is disabled by default.

- Block compression reduces the size requirement of on-disk objects by
compressing blocks of the backing object's file.  The cost is additional
CPU and memory use when reading and writing pages to disk.  Note the
additional CPU cost of block compression can be high, and should be
considered.   (See @x_ref compression_formats for details.)

  Block compression is disabled by default.

Column-stores with variable-length byte string values support four
types of compression: run-length encoding, dictionary compression,
Huffman encoding and block compression.

- Run-length encoding reduces the size requirement of both the in-memory
and on-disk objects by storing sequential, duplicate values in the store
only a single time (with an associated count).  The cost is minor
additional CPU and memory use when returning values from the in-memory
tree and when writing pages to disk.

  Run-length encoding is always enabled and cannot be turned off.

- Dictionary compression reduces the size requirement of both the
in-memory and on-disk objects by storing any identical value only once
per page.  The cost is minor additional CPU and memory use when
returning values from the in-memory tree and when writing pages to disk.

  Dictionary compression is disabled by default.

- Huffman encoding reduces the size requirement of both the in-memory
and on-disk objects by compressing individual value items.  The cost is
additional CPU and memory use when returning values from the in-memory
tree and when writing pages to disk.  Note the additional CPU cost of
Huffman encoding can be high, and should be considered.
(See @ref_single huffman for details.)

  Huffman encoding is disabled by default.

- Block compression reduces the size requirement of on-disk objects by
compressing blocks of the backing object's file.  The cost is additional
CPU and memory use when reading and writing pages to disk.  Note the
additional CPU cost of block compression can be high, and should be
considered.   (See @x_ref compression_formats for details.)

  Block compression is disabled by default.

Column-stores with fixed-length byte values support a single type of
compression: block compression.

- Block compression reduces the size requirement of on-disk objects by
compressing blocks of the backing object's file.  The cost is additional
CPU and memory use when reading and writing pages to disk.  Note the
additional CPU cost of block compression can be high, and should be
considered.   (See @x_ref compression_formats for details.)

  Block compression is disabled by default.

*/