src/third_party/wiredtiger/src/docs/tune-page-size-and-comp.dox


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412

/*! @page tune_page_size_and_comp Tuning page size and compression

This document aims to explain the role played by different page sizes in
WiredTiger. It also details motivation behind an application wanting to modify
these page sizes from their default values and the procedure to do so.
Applications commonly configure page sizes based on their workload's typical key
and value size. Once a page size has been chosen, appropriate defaults for the
other configuration values are derived by WiredTiger from the page sizes, and
relatively few applications will need to modify the other page and key/value
size configuration options. WiredTiger also offers several compression options
that have an impact on the size of the data both in-memory and on-disk. Hence
while selecting page sizes, an application must also look at its desired
compression needs. Since the data and workload for a table differs from one
table to another in the database, an application can choose to set page sizes
and compression options on a per-table basis.

@section data_life_cycle Data life cycle
Before detailing each page size, here is a review of how data gets stored inside
WiredTiger:
 - WiredTiger uses the physical disks to store data durably, creating on-disk
files for the tables in the database directory. It also caches the portion of
the table being currently accessed by the application for reading or writing in
main memory.
 - WiredTiger maintains a table's data in memory using a data structure called a
<a href="https://en.wikipedia.org/wiki/B-tree">B-Tree</a> (
<a href="https://en.wikipedia.org/wiki/B%2B_tree">B+ Tree</a> to be specific),
referring to the nodes of a B-Tree as pages. Internal pages carry only keys. The
leaf pages store both keys and values.
 - The format of the in-memory pages is not the same as the format of the
on-disk pages.  Therefore, the in-memory pages regularly go through a process
called reconciliation to create data structures appropriate for storage on the
disk. These data structures are referred to as on-disk pages. An application can
set a maximum size separately for the internal and leaf on-disk pages otherwise
WiredTiger uses a default value. If reconciliation of an in-memory page is
leading to an on-disk page size greater than this maximum, WiredTiger creates
multiple smaller on-disk pages.
 - A component of WiredTiger called the Block Manager divides the on-disk pages
into smaller chunks called blocks, which then get written to the disk. The size
of these blocks is defined by a parameter called allocation_size, which is the
underlying unit of allocation for the file the data gets stored in.  An
application might choose to have data compressed before it gets stored to disk
by enabling block compression.
 - A database's tables are usually much larger than the main memory available.
Not all of the data can be kept in memory at any given time. A process called
eviction takes care of making space for new data by freeing the memory of data
infrequently accessed. An eviction server regularly finds in-memory pages that
have not been accessed in a while (following an LRU algorithm). Several
background eviction threads continuously process these pages, reconcile them to
disk and remove them from the main memory.
 - When an application does an insert or an update of a key/value pair, the
associated key is used to refer to an in-memory page. In the case of this page
not being in memory, appropriate on-disk page(s) are read and an in-memory page
constructed (the opposite of reconciliation). A data structure is maintained on
every in-memory page to store any insertions or modifications to the data done
on that page. As more and more data gets written to this page, the page's memory
footprint keeps growing.
 - An application can choose to set the maximum size a page is allowed to grow
in-memory. A default size is set by WiredTiger if the application doesn't
specify one. To keep page management efficient, as a page grows larger in-memory
and approaches this maximum size, if possible, it is split into smaller
in-memory pages.
 - When doing an insert or an update, if a page grows larger than the maximum,
the application thread is used to forcefully evict this page. This is done to
split the growing page into smaller in-memory pages and reconcile them into
on-disk pages. Once written to the disk they are removed from the main memory,
making space for more data to be written. When an application gets involved in
forced eviction, it might take longer than usual to do these inserts and
updates. It is not always possible to (force) evict a page from memory and this
page can temporarily grow larger in size than the configured maximum. This page
then remains marked to be evicted and reattempts are made as the application
puts more data in it.

@section configurable_page_struct Configurable page structures in WiredTiger
There are three page sizes that the user can configure:
 1. The maximum page size of any type of in-memory page in the WiredTiger cache,
memory_page_max.
 2. The maximum size of the on-disk page for an internal page, internal_page_max.
 3. The maximum size of the on-disk leaf page, leaf_page_max.

There are additional configuration settings that tune more esoteric and
specialized data. Those are included for completeness but are rarely changed.

@subsection memory_page_max memory_page_max
The maximum size a table's page is allowed to grow to in memory before being
reconciled to disk.
 - An integer, with acceptable values between 512B and 10TB
 - Default size: 5 MB
 - Additionally constrained by the condition:
   leaf_page_max <= memory_page_max <= cache_size/10
 - Motivation to tune the value:
\n memory_page_max is significant for applications wanting to tune for
consistency in write intensive workloads.
   - This is the parameter to start with for tuning and trying different values
to find the correct balance between overall throughput and individual operation
latency for each table.
   - Splitting a growing in-memory page into smaller pages and reconciliation
both require exclusive access to the page which makes an application's write
operations wait. Having a large memory_page_max means that the pages will need
to be split and reconciled less often. But when that happens, the duration that
an exclusive access to the page is required is longer, increasing the latency of
an application's insert or update operations. Conversely, having a smaller
memory_page_max reduces the time taken for splitting and reconciling the pages,
but causes it to happen more frequently, forcing more frequent but shorter
exclusive accesses to the pages.
   - Applications should choose the memory_page_max value considering the
trade-off between frequency of exclusive access to the pages (for reconciliation
or splitting pages into smaller pages) versus the duration that the exclusive
access is required.
 - Configuration:
\n Specified as memory_page_max configuration option to WT_SESSION::create. An
example of such a configuration string is as follows:

<pre>
     "key_format=S,value_format=S,memory_page_max=10MB"
</pre>

@subsection internal_page_max internal_page_max
The maximum page size for the reconciled on-disk internal pages of the B-Tree,
in bytes. When an internal page grows past this size, it splits into multiple
pages.
 - An integer, with acceptable values between 512B and 512MB
 - Default size: 4 KB (*appropriate for applications with relatively small keys)
 - Additionally constrained by the condition: the size must be a multiple of the
allocation size
 - Motivation to tune the value:
\n internal_page_max is significant for applications wanting to avoid excessive
L2 cache misses while searching the tree.
   - Recall that only keys are stored on internal pages, so the type and size of
the key values for a table help drive the setting for this parameter.
   - Should be sized to fit into on-chip caches.
   - Applications doing full-table scans with out-of-memory workloads might
increase internal_page_max to transfer more data per I/O.
   - Influences the shape of the B-Tree, i.e. depth and the number of children
each page in B-Tree has. To iterate to the desired key/value pair in the B-Tree,
WiredTiger has to binary search the key-range in a page to determine the child
page to proceed to and continue down the depth until it reaches the correct leaf
page. Having an unusually deep B-Tree, or having too many children per page can
negatively impact time taken to iterate the B-Tree, slowing down the application.
The number of children per page and, hence, the tree depth depends upon the
number of keys that can be stored in an internal page, which is
internal_page_max divided by key size. Applications should choose an appropriate
internal_page_max size that avoids the B-Tree from getting too deep.
 - Configuration:
\n Specified as internal_page_max configuration option to WT_SESSION::create.
An example of such a configuration string is as follows:

<pre>
     "key_format=S,value_format=S,internal_page_max=16KB,leaf_page_max=1MB"
</pre>

@subsection leaf_page_max leaf_page_max
The maximum page size for the reconciled on-disk leaf pages of the B-Tree, in
bytes. When a leaf page grows past this size, it splits into multiple pages.
 - An integer, with acceptable values between 512B and 512MB
(fixed-length column store pages are limited to 128KB; the configured
size does not include the additional size of timestamp information
when timestamps are in use)
 - Default size: 32 KB (*appropriate for applications with relatively small keys
and values)
 - Additionally constrained by the condition: must be a multiple of the
allocation size
 - Motivation to tune the value:
\n leaf_page_max is significant for applications wanting to maximize sequential
data transfer from a storage device.
   - Should be sized to maximize I/O performance (when reading from disk, it is
usually desirable to read a large amount of data, assuming some locality of
reference in the application's access pattern).
   - Applications doing full-table scans through out-of-cache workloads might
increase leaf_page_max to transfer more data per I/O.
   - Applications focused on read/write amplification might decrease the page
size to better match the underlying storage block size.
   - In fixed-length column stores, a relatively large number of
items fit on a single page; this can occasionally lead to contention
issues that can be mitigated by making the pages smaller.
   - For fixed-length column stores, the much-larger size of in-memory
pending changes relative to on-disk values can lead to adverse
eviction behavior if a large number of uncommitted changes are present
on the same page. Applications that expect to update large numbers of
adjacent values at once, especially in long-running transactions, may
want to decrease \c leaf_page_max or, alternatively, increase \c
memory_page_max.
 - Configuration:
\n Specified as leaf_page_max configuration option to WT_SESSION::create. An
example of such a configuration string is as follows:

<pre>
     "key_format=S,value_format=S,internal_page_max=16KB,leaf_page_max=1MB"
</pre>

The following configuration items following are rarely used.  They are described
for completeness:

@subsection allocation_size allocation_size
This is the underlying unit of allocation for the file. As the unit of file
allocation, it sets the minimum page size and how much space is wasted when
storing small amounts of data and overflow items.
 - an integer between 512B and 128 MB
 - must a power-of-two
 - default : 4 KB
 - Motivation to tune the value:
\n Most applications should not need to tune the allocation size.
   - To be compatible with virtual memory page sizes and direct I/O requirements
on the platform (4KB for most common server platforms)
   - Smaller values decrease the file space required by overflow items.
   - For example, if the allocation size is set to 4KB, an overflow item of
18,000 bytes requires 5 allocation units and wastes about 2KB of space. If the
allocation size is 16KB, the same overflow item would waste more than 10KB.
 - Configuration:
\n Specified as allocation_size configuration option to WT_SESSION::create. An
example of such a configuration string is as follows:

<pre>
     "key_format=S,value_format=S,allocation_size=4KB"
</pre>

@subsection key_val_max internal/leaf key/value max
 - Overflow items
\n Overflow items are keys and values too large to easily store on a leaf
page. Overflow items are stored separately in the file from the page where
the item logically appears, and so reading an overflow item is more expensive
than an on-page item, normally requiring additional I/O.  Additionally,
overflow values are not cached in memory. This means overflow items won't
affect the caching behavior of the application.  It also means that each
time an overflow value is read, it is re-read from disk.
 - leaf_key_max
\n The largest key stored in a leaf page, in bytes. If set, keys larger than the
specified size are stored as overflow items.
   - The default value is one-tenth the size of a newly split leaf page.
 - leaf_value_max
\n The largest value stored in a leaf page, in bytes. If set, values larger than
the specified size are stored as overflow items
   - The default is one-half the size of a newly split leaf page.
   - If the size is larger than the maximum leaf page size, the page size is
temporarily ignored when large values are written.
 - Motivation to tune the values:
\n Most applications should not need to tune the maximum key and value sizes.
Applications requiring a small page size, but also having latency concerns such
that the additional work to retrieve an overflow item may find modifying these
values useful.
\n Since overflow items are separately stored in the on-disk file, aren't cached
and require additional I/O to access (read or write), applications should avoid
creating overflow items.
   - Since page sizes also determine the default size of overflow items, i.e.,
keys and values too large to easily store on a page, they can be configured to
avoid performance penalties working with overflow items:
     - Applications with large keys and values, and concerned with latency,
might increase the page size to avoid creating overflow items, in order to avoid
the additional cost of retrieving them.
     - Applications with large keys and values, doing random searches, might
decrease the page size to avoid wasting cache space on overflow items that
aren't likely to be needed.
     - Applications with large keys and values, doing table scans, might
increase the page size to avoid creating overflow items, as the overflow items
must be read into memory in all cases, anyway.
   - leaf_key_max and leaf_value_max configuration values allow applications
to change the size at which a key or value will be treated as an overflow item.
     - Most applications should not need to tune the maximum key and value
sizes.
     - The values of leaf_key_max and leaf_value_max are not relative to the
maximum leaf page size. If either is larger than the maximum page size, the page
size will be ignored when the larger keys and values are being written, and a
larger page will be created as necessary.
 - Configuration:
\n Specified as leaf_key_max and leaf_value_max configuration options to
WT_SESSION::create. An example of configuration string for a large leaf
overflow value:

<pre>
     "key_format=S,value_format=S,leaf_page_max=16KB,leaf_value_max=256KB"
</pre>

@subsection split_pct split_pct (split percentage)
The size (specified as percentage of internal/leaf page_max) at which the
reconciled page must be split into multiple smaller pages before being sent for
compression and then be written to the disk. If the reconciled page can fit into
a single on-disk page without the page growing beyond it's set max size,
split_pct is ignored and the page isn't split.
 - an integer between 50 and 100
 - default : 90
 - Motivation to tune the value:
\n Most applications should not need to tune the split percentage size.
   - This value should be selected based on the expected workload.  Workloads
that perform many random updates benefit from values around 66%, leaving room
on the newly split pages for future updates. Append-only workloads benefit from
a value of 100%, as they will not add updates to the newly split pages.
The default value for split_pct is 90%, intended as a balance between these two
workload types.
 - Configuration:
\n Specified as split_pct configuration option to WT_SESSION::create. An
example of such a configuration string is as follows:

<pre>
     "key_format=S,value_format=S,split_pct=80"
</pre>

@section compression_considerations Compression considerations
WiredTiger compresses data at several stages to preserve memory and disk space.
Applications can configure these different compression algorithms to tailor
their requirements between memory, disk and CPU consumption.  Compression
algorithms other than block compression work by modifying how the keys and
values are represented, and will reduce data size both in-memory and on-disk.
Block compression compresses the data in its binary representation while
saving it on the disk, and so only reduces the data size on-disk.

Configuring compression may change application throughput. For example, in
applications using solid-state drives (where I/O is less expensive), turning
off compression may increase application performance by reducing CPU costs; in
applications where I/O costs are more expensive, turning on compression may
increase application performance by reducing the overall number of I/O
operations.

WiredTiger uses some internal algorithms to compress the amount of data
stored that are not configurable, but always on.  For example, run-length
encoding reduces in-memory and on-disk size requirements by storing
sequential, duplicate values in a column-store object only a single time.

Different compression options available with WiredTiger:
- Key-prefix
\n
Reduces the size requirement by storing any identical key prefix
only once per page. The cost is additional CPU and memory when operating on
the in-memory tree. Specifically, reverse sequential cursor movement (but not
forward) through a prefix-compressed page or the random lookup of a key/value
pair will allocate sufficient memory to hold some number of uncompressed keys.
So, for example, if key prefix compression only saves a small number of bytes
per key, the additional memory cost of instantiating the uncompressed key may
mean prefix compression is not worthwhile. Further, in cases where the on-disk
cost is the primary concern, block compression may mean prefix compression is
less useful.
\n\n
Key-prefix configuration:
\n
Specified using \c the prefix_compression configuration option to
WT_SESSION::create. Applications may limit the use of prefix compression
by configuring the minimum number of bytes that must be gained before
prefix compression is used with the \c prefix_compression_min configuration
option. An example of such a configuration string is as follows:
<pre>
"key_format=S,value_format=S,prefix_compression=true,prefix_compression_min=7"
</pre>

- Dictionary
\n
Reduces the size requirement by storing any identical value only once per page.
\n\n
Dictionary configuration:
\n
Specified using the \c dictionary configuration configuration option to
WT_SESSION::create, which specifies the maximum number of unique values
remembered in the B-Tree row-store leaf page value dictionary. An example of
such a configuration string is as follows:
<pre>
"key_format=S,value_format=S,dictionary=1000"
</pre>

- Huffman
\n
Reduces the size requirement by compressing individual value items, and
can be configured for values. The additional CPU cost of Huffman encoding
can be high, and should be considered. (See @ref huffman for details.)
\n\n
Huffman configuration:
\n
Specified using the \c huffman_value configuration
options to WT_SESSION::create. These options can take values of "english"
(to use a built-in English language frequency table), "utf8<file>" or
"utf16<file>" (to use a custom UTF-8 or UTF-16 symbol frequency table file).
An example of such a configuration string is as follows:
<pre>
"key_format=S,value_format=S,huffman_value=english"
</pre>

- Block Compression
\n
Reduces the size requirement of on-disk objects by compressing blocks of
the backing object's file. The additional CPU cost of block compression can be
high, and should be considered. When block compression has been configured,
configured page sizes will not match the actual size of the page on disk.
\n
When block compression is configured, chunks of data are given to the
compression system, then returned as a compressed block of data. The amount of
data chosen to compress is based on previous compression results.  When
compressed, the size written to disk will vary depending on the compressibility
of the stored data.
\n\n
Block compression configuration:
\n
Specified using the \c block_compressor configuration option to
WT_SESSION::create.  If WiredTiger was built with support for "lz4", "snappy",
"zlib" or "zstd" compression, these names are available as the value to
the configuration option. An example of such a configuration string is as
follows:
<pre>
"key_format=S,value_format=S,block_compressor=snappy"
</pre>

See @ref compression for further information on how to configure and enable
different compression options.

@subsection table_compress Table summarizing compression in WiredTiger

<table>
@hrow{Compression Type, Supported by row-store, Supported by variable col-store,
      Supported by fixed col-store, Default config, Reduces in-mem size,
      Reduces on-disk size, CPU and Memory cost}
@row{Key-prefix, yes, no, no, disabled, yes, yes, minor}
@row{Dictionary, yes, yes, no, disabled, yes, yes, minor}
@row{Huffman, yes, yes, no, disabled, yes, yes, can be high}
@row{Block, yes, yes, yes, disabled, no, yes, can be high}
</table>

*/