summaryrefslogtreecommitdiff
path: root/docs/src/layout.dox
diff options
context:
space:
mode:
Diffstat (limited to 'docs/src/layout.dox')
-rw-r--r--docs/src/layout.dox194
1 files changed, 194 insertions, 0 deletions
diff --git a/docs/src/layout.dox b/docs/src/layout.dox
new file mode 100644
index 00000000000..d5967ff3873
--- /dev/null
+++ b/docs/src/layout.dox
@@ -0,0 +1,194 @@
+/* vim: set filetype=c.doxygen : */
+
+/*! \defgroup layout Wired Tiger File Format
+
+<p>
+There are a number of page formats used in WiredTiger, each with a unique page
+type. However, all of the pages store data in some specific order, and it's
+simplest to think of all of the formats as sets of key/data pairs.
+
+<p>
+Keys are either record numbers or variable-length byte strings.
+
+<p>
+Data items are either fixed-length byte strings, variable-length byte strings
+or off-page references (pointers to other pages in the file).
+
+<p>
+WiredTiger page types are presented in the API as "row" or "column" stores.
+
+<p>
+A "row" store is a Berkeley DB style key/data store, where both keys and data
+are variable-length byte strings. A "column" store is one where the key is a
+record number and the data item is either a variable- or fixed-length byte
+string. We support both because row stores are faster for queries where all
+columns are required by every lookup (because there's only a single set of
+meta-data pages to go through, or read into the cache). Column stores are
+faster for queries where only a few of the columns are required for any lookup
+(because only the columns being returned are present in the cache). Column
+stores also offer significant compression advantages when there is minimal
+difference between adjacent column values.
+
+<p>
+To restate, there are 2 different types of keys in WiredTiger:
+
+<pre>
+Recno : 64-bit record number
+Variable-Length : byte string + length
+</pre>
+
+<p>
+The record-number isn't stored explicitly in the page, it's implicitly stored
+as part of the Btree.
+
+<p>
+And there are 3 different types of data items in WiredTiger:
+
+<pre>
+Fixed-Length : fixed length number of bytes
+Variable-Length : byte string + length
+Off-page Ref: : off-page reference (a ref to another page in the file)
+</pre>
+
+<p>
+A row store is one where there's a variable-length key associated with one or
+more variable-length data items. Because record numbers are an implicit part
+of the Btree, it's possible to retrieve records from a row store using record
+numbers, where that's useful. In a row store, a key may reference multiple
+data items (that is, duplicate data items).
+
+<p>
+Here's the most complex possible row store layout:
+
+<pre>
+ IP
+ |
+IP -> IP -> IP -> IP
+ |
+ LP -> LP -> LP
+ |
+ DIP
+ |
+ DIP -> DIP -> DIP
+ |
+ |
+ DLP -> DLP
+
+Key: IP Internal page
+ LP Leaf page
+ DIP Duplicate internal page
+ DLP Duplicate leaf page
+</pre>
+
+<p>
+This is a standard B+tree layout. One or more levels of internal pages, then
+one level of leaf pages.
+
+<p>
+A row store's internal page keys are variable-length byte strings. The internal
+page data items are references to other pages (the off-page reference type).
+
+<p>
+A row store's leaf page key and data items are both variable-length byte
+strings.
+
+<p>
+As mentioned above, it's possible to store multiple data items with a single
+key. In this case, the key is only stored a single time. If there are enough
+duplicate data items, the duplicate data items are moved into their own,
+separate Btree, referenced from the leaf page. These are the "duplicate
+internal" and "duplicate leaf" pages in the above diagram.
+
+<p>
+Duplicate internal page keys are record numbers, and the duplicate internal
+page data items are references to other pages (the off-page reference type).
+
+<p>
+Duplicate leaf page keys are record numbers, and the duplicate leaf page data
+items are variable-length strings.
+
+<p>
+Row store notes:
+<ul>
+
+<li>
+WiredTiger internal pages are used only for navigation, and may not store
+true database keys. In other words, a true B+tree, where keys can only be
+returned from leaf pages.
+
+<li>
+Any variable-length key or data item can be replaced with a reference to
+an overflow page. This is used whenever a key or data item won't fit onto a
+page because it's too large. However, references to overflow pages can be
+thought of as an additional step taken to get a key or data item, and not a
+fundamental layout structure.
+
+<li>
+Any variable-length key or data item can be compressed using the Huffman
+compression engine. Again, decompression is an additional step taken to get
+a key or data item, and not a fundamental layout structure issue.
+
+</ul>
+
+<p>
+This concludes the discussion of a row store.
+
+<p>
+A column store is one where there's a single key, that references a single
+data item, and that data item may be fixed- or variable-length.
+
+<p>
+Here's the most complex possible row store layout:
+
+<pre>
+ IP
+ |
+IP -> IP -> IP -> IP
+ |
+ LP -> LP -> LP
+
+Key: IP Internal page
+ LP Leaf page
+</pre>
+
+<p>
+A column store's internal page keys are record numbers. The internal page
+data items are references to other pages (the off-page reference type).
+
+<p>
+A column store's leaf page keys are record numbers. The leaf page data items
+are fixed- or variable-length byte strings.
+
+<p>
+Because column store keys do not disappear on delete, there needs to be an
+on-page, permanent "deleted" data value; for variable-length byte strings,
+this is a specific item type. For fixed-length byte strings, the top bit
+of the field is the deleted flag (which means that the user-specified data
+length is off by a single bit).
+
+<p>
+Column store notes:
+<ul>
+
+<li>
+Like the row store, any variable-length data item can be replaced
+with a reference to an overflow page, and can be compressed using the
+Huffman compression engine.
+
+<li>
+Any fixed-length data item can be compressed using a dictionary lookup
+or repeat count.
+
+</ul>
+
+<p>
+Q&A:
+<ul>
+
+<li>
+Why not have fixed-length data items in a row store? Because once you have
+variable-length items on the page, having fixed-length items doesn't make
+things any faster, you still have to maintain a page index.
+
+</ul>
+*/