diff options
Diffstat (limited to 'docs/src/layout.dox')
-rw-r--r-- | docs/src/layout.dox | 194 |
1 files changed, 194 insertions, 0 deletions
diff --git a/docs/src/layout.dox b/docs/src/layout.dox new file mode 100644 index 00000000000..d5967ff3873 --- /dev/null +++ b/docs/src/layout.dox @@ -0,0 +1,194 @@ +/* vim: set filetype=c.doxygen : */ + +/*! \defgroup layout Wired Tiger File Format + +<p> +There are a number of page formats used in WiredTiger, each with a unique page +type. However, all of the pages store data in some specific order, and it's +simplest to think of all of the formats as sets of key/data pairs. + +<p> +Keys are either record numbers or variable-length byte strings. + +<p> +Data items are either fixed-length byte strings, variable-length byte strings +or off-page references (pointers to other pages in the file). + +<p> +WiredTiger page types are presented in the API as "row" or "column" stores. + +<p> +A "row" store is a Berkeley DB style key/data store, where both keys and data +are variable-length byte strings. A "column" store is one where the key is a +record number and the data item is either a variable- or fixed-length byte +string. We support both because row stores are faster for queries where all +columns are required by every lookup (because there's only a single set of +meta-data pages to go through, or read into the cache). Column stores are +faster for queries where only a few of the columns are required for any lookup +(because only the columns being returned are present in the cache). Column +stores also offer significant compression advantages when there is minimal +difference between adjacent column values. + +<p> +To restate, there are 2 different types of keys in WiredTiger: + +<pre> +Recno : 64-bit record number +Variable-Length : byte string + length +</pre> + +<p> +The record-number isn't stored explicitly in the page, it's implicitly stored +as part of the Btree. + +<p> +And there are 3 different types of data items in WiredTiger: + +<pre> +Fixed-Length : fixed length number of bytes +Variable-Length : byte string + length +Off-page Ref: : off-page reference (a ref to another page in the file) +</pre> + +<p> +A row store is one where there's a variable-length key associated with one or +more variable-length data items. Because record numbers are an implicit part +of the Btree, it's possible to retrieve records from a row store using record +numbers, where that's useful. In a row store, a key may reference multiple +data items (that is, duplicate data items). + +<p> +Here's the most complex possible row store layout: + +<pre> + IP + | +IP -> IP -> IP -> IP + | + LP -> LP -> LP + | + DIP + | + DIP -> DIP -> DIP + | + | + DLP -> DLP + +Key: IP Internal page + LP Leaf page + DIP Duplicate internal page + DLP Duplicate leaf page +</pre> + +<p> +This is a standard B+tree layout. One or more levels of internal pages, then +one level of leaf pages. + +<p> +A row store's internal page keys are variable-length byte strings. The internal +page data items are references to other pages (the off-page reference type). + +<p> +A row store's leaf page key and data items are both variable-length byte +strings. + +<p> +As mentioned above, it's possible to store multiple data items with a single +key. In this case, the key is only stored a single time. If there are enough +duplicate data items, the duplicate data items are moved into their own, +separate Btree, referenced from the leaf page. These are the "duplicate +internal" and "duplicate leaf" pages in the above diagram. + +<p> +Duplicate internal page keys are record numbers, and the duplicate internal +page data items are references to other pages (the off-page reference type). + +<p> +Duplicate leaf page keys are record numbers, and the duplicate leaf page data +items are variable-length strings. + +<p> +Row store notes: +<ul> + +<li> +WiredTiger internal pages are used only for navigation, and may not store +true database keys. In other words, a true B+tree, where keys can only be +returned from leaf pages. + +<li> +Any variable-length key or data item can be replaced with a reference to +an overflow page. This is used whenever a key or data item won't fit onto a +page because it's too large. However, references to overflow pages can be +thought of as an additional step taken to get a key or data item, and not a +fundamental layout structure. + +<li> +Any variable-length key or data item can be compressed using the Huffman +compression engine. Again, decompression is an additional step taken to get +a key or data item, and not a fundamental layout structure issue. + +</ul> + +<p> +This concludes the discussion of a row store. + +<p> +A column store is one where there's a single key, that references a single +data item, and that data item may be fixed- or variable-length. + +<p> +Here's the most complex possible row store layout: + +<pre> + IP + | +IP -> IP -> IP -> IP + | + LP -> LP -> LP + +Key: IP Internal page + LP Leaf page +</pre> + +<p> +A column store's internal page keys are record numbers. The internal page +data items are references to other pages (the off-page reference type). + +<p> +A column store's leaf page keys are record numbers. The leaf page data items +are fixed- or variable-length byte strings. + +<p> +Because column store keys do not disappear on delete, there needs to be an +on-page, permanent "deleted" data value; for variable-length byte strings, +this is a specific item type. For fixed-length byte strings, the top bit +of the field is the deleted flag (which means that the user-specified data +length is off by a single bit). + +<p> +Column store notes: +<ul> + +<li> +Like the row store, any variable-length data item can be replaced +with a reference to an overflow page, and can be compressed using the +Huffman compression engine. + +<li> +Any fixed-length data item can be compressed using a dictionary lookup +or repeat count. + +</ul> + +<p> +Q&A: +<ul> + +<li> +Why not have fixed-length data items in a row store? Because once you have +variable-length items on the page, having fixed-length items doesn't make +things any faster, you still have to maintain a page index. + +</ul> +*/ |