diff options
Diffstat (limited to 'src/third_party/wiredtiger/src/docs/lsm.dox')
-rw-r--r-- | src/third_party/wiredtiger/src/docs/lsm.dox | 133 |
1 files changed, 133 insertions, 0 deletions
diff --git a/src/third_party/wiredtiger/src/docs/lsm.dox b/src/third_party/wiredtiger/src/docs/lsm.dox new file mode 100644 index 00000000000..08512f27c92 --- /dev/null +++ b/src/third_party/wiredtiger/src/docs/lsm.dox @@ -0,0 +1,133 @@ +/*! @m_page{{c,java},lsm,Log-Structured Merge Trees} + +@section lsm_background Background + +A common requirement is sustained throughput under a workload that consists +of random inserts, where either the key range is chosen so that inserts are +very unlikely to conflict (e.g., 128-bit hashes), or where inserts are +expected to overwrite existing values. + +With traditional btree variants, inserts are very fast while the data set +remains in cache, but once the tree overflows the cache, performance drops +significantly. There are two factors involved: + + -# once the data fills the cache, new inserts have some probability of going +to a page that is not in cache, requiring a read; and + -# the cache is full of dirty pages, so pages have to be written to free +space in the cache before the read can be satisfied. + +@section lsm_description Description of LSM trees + +Log-Structured Merge Trees are described in this paper by Patrick O'Neil, +Edward Cheng, Dieter Gawlick and Elizabeth O'Neil: +http://www.cs.umb.edu/~poneil/lsmtree.pdf + +A logical tree is split into several physical pieces so that the +most-recently-updated portion of data is in a tree that fits entirely in +memory. The size of the in-memory chunk can be configured with the +\c "lsm=(chunk_size)" configuration key to WT_SESSION::create. + +Once the in-memory tree reaches a threshold size, a new in-memory tree is +created and the old tree synced to disk. Once written to disk, trees are +read-only, though they are merged in the background with other on-disk +trees to reduce the cost of reads. + +With this structure, "blind writes" can be performed entirely on the +in-memory tree. Deletes are implemented by inserting a special "tombstone" +record into the in-memory tree. + +@section lsm_api Interface to LSM trees + +An LSM tree can be created as follows, in much the same way as a +WiredTiger btree file: + +@code +@m_if{c} +session->create(session, "table:bucket", "type=lsm,key_format=S,value_format=S"); +@m_else +session.create("table:bucket", "type=lsm,key_format=S,value_format=S"); +@m_endif +@endcode + +Once created, the LSM tree can be accessed using the same cursor interface +as other data sources in WiredTiger: + +@code +@m_if{c} +WT_CURSOR *c; + +session->open_cursor(session, "table:bucket", NULL, NULL, &c); +for(;;) { + c->set_key(c, "key"); + c->set_value(c, "value"); + c->insert(c); +} +@m_else +Cursor c = session.open_cursor("table:bucket", null, null); +for(;;) { + c.putKeyString("key"); + c.putValueString("value"); + c.insert(); +} +@m_endif +@endcode + +If an LSM cursor is configured with \c "overwrite=false" passed to +WT_SESSION::open_cursor, the result will be a search through the levels of the +LSM tree before every modification. + +@section lsm_merge Merging + +A background thread is opened for each active LSM tree. This thread is +responsible for both writing old chunks to stable storage, and for merging +multiple chunks together so that reads can be satisfied from a small number +of files. There is currently no way to configure merges: they are performed +automatically by the background thread. + +@section lsm_bloom Bloom filters + +WiredTiger creates a Bloom filter when merging. This is an additional file +that contains a configurable number of bits per key (default 8). Keys are +hashed a configurable number of times (default 4), and the corresponding +bits set. The Bloom filter is used to avoid reading from a chunk if the key +cannot be present. + +With the defaults, the Bloom filter only requires one byte per key, so +it usually fits in cache. The Bloom parameters can be configured with +\c "lsm=(bloom_bit_count)" and \c "lsm=(bloom_hash_count)" configuration +keys to WT_SESSION::create. The Bloom file can be configured with the +\c "lsm=(bloom_config)" key. + +@section lsm_schema Creating tables using LSM trees + +Tables or indices can be stored using LSM trees. Schema support is provided +for LSM as an extension to the WT_SESSION::create method: + +@code +@m_if{c} +session->create(session, "table:T", "type=lsm"); +@m_else +session.create("table:T", "type=lsm"); +@m_endif +@endcode + +The default type for all schema objects will continue to be btree. + +@section lsm_caveats Caveats + +@subsection lsm_hazard Hazard configuration + +Reads from an LSM cursor may need to position a cursor in each active chunk. +The number of chunks depends on the chunk size, and how many chunks have +been merged. There must be at least as many hazard pointers available as +there are chunks in the tree for each cursor that is open on the LSM tree. +The number of hazard pointers is configured with the \c "hazard_max" +configuration key to ::wiredtiger_open. + +@subsection lsm_checkpoints Named checkpoints + +Named checkpoints are not supported on LSM trees, and cursors cannot be +opened with a non-empty \c "checkpoint" configuration (that is, only the +most recent standard checkpoint can be read). + + */ |