More work on the API, plus a client/server implementation.

--HG-- branch : mjc extra : transplant_source : %3E%C8%0D%F8Z%11%B8%FC5%B4JV%C3%F75%16%98%A8%E0n
author: Michael Cahill <michael.cahill@wiredtiger.com> 2010-12-01 15:22:26 +1100
committer: Michael Cahill <michael.cahill@wiredtiger.com> 2010-12-01 15:22:26 +1100
commit: 9c248d20eb88e58769c01f38741d052dec036810 (patch)
tree: d00100e3ce2a581294d8cf8f0a9749386ab768f7 /docs
parent: 1ed1a5892e1a8a0452238f96eb00d7fb7f9ed228 (diff)
download: mongo-9c248d20eb88e58769c01f38741d052dec036810.tar.gz
3 files changed, 824 insertions, 2 deletions
diff --git a/docs/src/design.dox b/docs/src/design.dox
new file mode 100644
index 00000000000..4d6e77645d4
--- /dev/null
+++ b/docs/src/design.dox
@@ -0,0 +1,625 @@
+/* vim: set filetype=c.doxygen : */
+
+/*! \defgroup architecture Wired Tiger Architecture
+
+<h3>Major architectural differences from Berkeley DB:</h3>
+<ol>
+
+<li><b>Support for multi-core architectures</b>:
+architectures have 10's or 100's of cores, with multiple threads per
+core, and touching shared memory kills performance.  Berkeley DB's
+architecture is based on shared memory data mutex architecture (where
+mutexes serialize access to data structures), which doesn't scale well.
+
+<p>Wired Tiger will have a "work queue architecture", that is, an
+interface outside of the database engine where threads of control queue
+work.   Inside the database engine, a different thread of control will
+perform the work.  The work queue will be read and written by only two
+threads, limiting the amount of memory that's touched by more than a
+single thread of control.
+
+<p>Initial benchmarks show that schedulers/architectures are not
+uniform: for some, processes are better than threads and vice-versa, and
+polling is better than mutexing, and vice-versa.  For that reason, we
+will use a combination of polling and mutexes for the work queue, and
+configure at run-time if the database engine uses separate processes or
+threads to perform work.
+
+<p>Mutexes will behave in the usual fashion: clients will write their
+work into the shared memory queue, wake the worker if necessary, then
+sleep on a mutex.  Workers will perform the work and wake the client.
+Polling will be implemented using shared memory and a memory flush, in
+other words, the usual "lock; write the data structure; unlock" cycle
+is replaced with "write new task into a memory slot, flush memory, poll
+for new task to complete".  However, only the application knows how many
+threads it has and what they're doing, so the application will have to
+be responsible for binding threads to cores.  This means we will need
+to export controls so applications can bind threads of control to
+specific processors and do other platform-specific scheduling tasks.
+
+<p>As in Berkeley DB, this "work queue architecture" must handle two
+different programmatic architectures:
+
+<ol>
+<li>A single process, without threads, in other words, a single thread
+of control.  In this architecture, the work queue will be skipped
+entirely, and the thread of control will enter the library and do all
+work itself.
+
+<li>One or more processes, optionally threaded -- in other words,
+multiple threads of control.  Complicating this architecture is the
+possibility that Wired Tiger may be called from other library routines
+(this is the Berkeley DB Subversion problem, and the reason Berkeley DB
+implemented the DB_REGISTER flag).
+</ol>
+
+<p>
+The programmatic architecture will be selectable at run-time (unless
+that proves too costly in terms of performance, in which case we can
+add configuration-time options).
+
+<p>
+The "work queue architecture" results in one significant difference from
+the Berkeley DB architecture.   A separate process cannot configure any
+database engine callbacks or other functions for the database (such as
+providing a key sort function), those must be configured within the
+threads of control executing the actual database code.  This is
+annoying, as it requires the database configuration and callback to be
+done before any separate process accesses the database.  On the other
+hand, it is useful as only one process in the application suite has to
+be aware of the underlying configuration details of the database.
+
+<p>
+<font color="blue">Maybe </font>Automatically partition the
+database to transfer work requests to a different machine or thread of
+control.   Once the database is large enough that threads will
+reasonably not conflict, it would be best to have multiple threads of
+control to partition work on the data.  In this case, each thread which
+owns a chunk of a database should write to its own, individual, log
+file.
+
+<p><li><b>Environment creation and recovery</b>:
+Berkeley DB's environment creation and recovery model is too complex.
+Applications were often unable to determine when to run recovery,
+requiring, for example, features like the DB_REGISTER flag and the
+DbEnv.failchk method.
+
+<p>As previously described, Wired Tiger environments will have separate
+threads of control performing all database work.  The default Wired
+Tiger environment will be a combination of Berkeley DB's DbEnv.failchk
+and DB_REGISTER configurations: whenever a thread of control "joins" a
+Wired Tiger environment, or periodically as called by an application
+thread, Wired Tiger will confirm the database engine's thread of control
+is still running.  If that thread of control has failed, another thread
+of control will optionally be created, whatever recovery is necessary
+will be performed, and the system will continue.   In this sense, Wired
+Tiger will be a "crashless" system, that is, applications will never
+know that recovery was performed, it will be performed behind the scenes
+as necessary.
+
+<p><li><b>Schema layer</b>:
+Wired Tiger database environments have a configuration which describes
+the contents of the environment.   This focuses on, but is not limited
+to, the database files.  The goal is to eliminate Berkeley DB's baroque
+file handling.   File operations (for example, remove, rename, create,
+etc.) will be done by transactionally modifying the schema database.  To
+remove a file, we can remove the name in the schema database, and deal
+with the physical remove as part of the transaction commit.  Any
+transaction serialization required by operations on that file can be
+done based on handling the record in the schema database.  Recovery
+cleanup can be done by reviewing the entries in the scheme database
+after it has been recovered.  By divorcing file system names from
+logical database names, things get easier: as another example, file
+striping can be done at a logical layer, where looking up a name in the
+schema database can return whatever physical file and/or offset we want.
+
+<p><li><b><font color="blue">Maybe </font>Record-level locking</b>:
+Record-level locking is tempting in order to increase concurrency.  It
+significantly complicates recovery, unfortunately.   However, there
+isn't any other way to increase insert rates where the keys are inserted
+in sorted order, which is a significant failing of Berkeley DB.
+
+<p><li><b><font color="blue">Maybe </font>Logical logging</b>:
+Logical logging is tempting because of the possibility of supporting
+multi-master replication (by shipping logical operations to the replicas
+instead of shipping physical record changes).  However, logical logging
+significantly complicates recovery (for example, no split was required
+during the operation but a split is unexpectedly required during
+recovery), and complicating recovery is a bad idea.
+
+</ol>
+
+<h3>General API changes:</h3>
+<ol>
+
+<p><li><b>Per-thread handles</b>:
+Applications must allocate a per-thread object for any thread that will
+be calling a WiredTiger function.  When crossing the library API, the
+per-thread handle will be updated to reference any current environment
+and database handles, then the per-thread handle will be the handle
+passed as the first argument of every internal database function.
+
+<p>
+Each thread of control's information will include a lock ID which will
+be used for deadlock detection.  (This assumes we eventually implement
+some kind of deadlock detection...)
+
+<p><li><b>Object constructors/destructors</b>:
+There will be explicit object constructor/destructor functions which are
+used for configuration only (the DB_ENV vs. ENV split done right).  The
+application has to allocate the object, then configure it, then use it
+and then destroy it.   It may be repeatedly used, as it contains no
+internal information.
+
+<p><li><b>Error handling</b>:
+Wired Tiger error messages will include a URL which can be used to look
+up detailed error information.  This makes error configuration fairly
+complicated: an error message will have two parts, a per-release URL for
+detailed information and a cryptic error message.  By default, both will
+be displayed, but  there will be a compile-time options to output only
+the number (to support a small footprint).  Error messages in the code
+will need to contain the number, the short message and the detailed
+message, and the URL support will be automatically generated for each
+release.
+
+<p>
+Error handling (error message, prefix, stream, whatever) will be
+database name specific.
+
+<p><li><b>Configuration</b>:
+There will be no general configuration functions, Wired Tiger will use
+configuration methods only; as in Berkeley DB, there will be getter
+functions in all cases.  This will also apply to many of the arguments
+for methods, that is, optional arguments will be set with configuration
+methods, rather than specifying potentially ignored-values to the method
+call.
+
+<p>
+Configuration routines will only update specific shared memory region
+locations, to allow auto-generation of the configuration code.
+
+<p><li><b>Cursor errors</b>:
+By default, cursors will become unusable after any error; applications
+wanting Berkeley DB behavior, where the cursor has not moved and can be
+used after error, will have a flag which duplicates the cursor before
+each operation (Berkeley DB's DbCursor.dup method).  The point is that
+cursor duplication is painful and difficult, and most applications don't
+need it.
+
+<p><li><b>Secondary indices</b>:
+Combine the Db.get and Db.pget calls into a single method.
+
+<p><li><b>Delete of non-existent entry</b>:
+Delete of a non-existent entry should optionally not return an error.
+
+<p><li><b>File/directory permission modes</b>:
+Support explicit modes for all physical files.
+
+<p><li><b>Flash support</b>:
+Add high-level environment configuration which indicates no underlying
+disk support.
+
+</ol>
+
+<h3>Btree access method changes:</h3>
+
+<table width="100%" border="1" align=left>
+<tr valign="top">
+<th align="left" valign="top" width="15%">Feature</th>
+<th align="left" width="2%">Implemented</th>
+<th align="left">Notes</th>
+</tr>
+
+</td></tr>
+<tr valign="top"><td>
+Bulk load
+</td><td>
+Yes
+</td><td>
+Wired Tiger supports bulk load, where sorted key/data pairs are read
+from a callback function, and a B+tree is built bottom-up in a new
+database.
+
+<tr valign="top"><td>
+Page size
+</td><td>
+Yes
+</td><td>
+Berkeley DB's maximum page size is 64KB, limiting on-page key/data items
+to roughly 16KB.  Wired Tiger supports pages of any practical size.  The
+goal is two-fold, to allow databases with key or data items larger than
+16KB to be stored on-page, and to increase the data transfer of a single
+disk read.  However, simply increasing the size of the database page has
+a problem: in Berkeley DB, internal nodes are the same size as leaf
+pages, and a binary search of large pages will behave badly in the L1/L2
+caches.  To avoid this problem, Wired Tiger supports configuration of
+different sizes for internal and leaf nodes.
+
+</td></tr>
+<tr valign="top"><td>
+Overflow items
+</td><td>
+Yes
+</td><td>
+<p>
+In Berkeley DB, overflow items (key or data items too large to fit on a
+leaf page), are allocated in units of database page size.  This wastes,
+on average, half a database page for each overflow item.  Also, overflow
+items are then marshalled to/from those database pages.  WiredTiger
+allocates overflow objects directly from the underlying file, using a
+configurable value as the unit of allocation.  This eliminates the
+wasted space and the manipulation of multiple pages in order to store
+or retrieve overflow items.
+
+</td></tr>
+<tr valign="top"><td>
+Extent allocation
+</td><td>
+</td><td>
+Wired Tiger supports extent-based allocation in the B+tree.   Pages will
+be initially allocated at the page size; when splitting, pages will
+increase in size to the maximum page size.   Pages split after reaching
+the maximum page size will be aggregated into extents.  For example,
+imagine an 8KB page size with a maximum page size of 64KB and 10MB
+extents.  The page is initially 8KB; as that runs out of space, the page
+size doubles until it's 64KB.  The next split will result in 2
+contiguous 64KB pages, until there are 160 contiguous pages in memory,
+that is, a 10MB extent.  Internal pages will not be gathered into
+extents, although they will increase in size from the initial size to
+the maximum page size.
+
+<p>
+The goal is to speed sequential search as much larger chunks of disk can
+be returned in a single disk read.  As extents are made up of smaller
+pages, random accesses do not need to read the full extent.
+
+</td></tr>
+<tr valign="top"><td>
+Secondary indices
+</td><td>
+</td><td>
+Wired Tiger optionally stores  a copy (or part of a copy) of the primary
+data item in the secondary to speed the use of a secondary index,
+avoiding subsequent lookup in the primary.  In Berkeley DB, an "ordered"
+lookup when walking a secondary becomes a "random" lookup on the primary
+in order to return the primary's data item, resulting in a colder cache
+and more deadlocks.  By storing a copy of the primary's data in the
+secondary, we can avoid the second lookup.
+
+</td></tr>
+<tr valign="top"><td>
+Compression
+</td><td>
+</td><td>
+Wired Tiger will support static (probalby Huffman encoding) key/data
+item compression.  This worth doing because CPU is less-important than
+memory, and trading CPU to keep the cache hotter is the right choice to
+make.
+
+</td></tr>
+<tr valign="top"><td>
+Salvage
+</td><td>
+</td><td>
+Wired Tiger supports in-place database salvage.
+
+</td></tr>
+<tr valign="top"><td>
+Automatic transaction retry
+</td><td>
+</td><td>
+Wired Tiger supports automatic retry of an individual transactional
+operation, should it be selected to resolve a deadlock.
+
+</td></tr>
+<tr valign="top"><td>
+Copy avoidance
+</td><td>
+</td><td>
+Wired Tiger supports a call-back function which is called from inside
+the library and given a pointer to the key/data item (normally on the
+page), avoiding any data copy at all.
+
+</td></tr>
+<tr valign="top"><td>
+BLOB support
+</td><td>
+</td><td>
+Wired Tiger will support BLOBs using Berkeley DB's "returned file
+handle" design.  While this is less necessary given Wired Tiger's
+support for large page sizes, it's still useful for truly large data
+items.
+
+</td></tr>
+</table>
+
+<h3>Cache changes:</h3>
+
+<table width="100%" border="1" align=left>
+<tr valign="top">
+<th align="left" valign="top" width="15%">Feature</th>
+<th align="left" width="2%">Implemented</th>
+<th align="left">Notes</th>
+</tr>
+
+</td></tr>
+<tr valign="top"><td>
+Database page alignment
+</td><td>
+Yes
+</td><td>
+Database pages are separately allocated from their headers so direct I/O
+from the underlying file into memory is possible.
+
+</td></tr>
+<tr valign="top"><td>
+File identification
+</td><td>
+Yes
+</td><td>
+Berkeley DB encodes a file ID string in the file header; WiredTiger Use
+the file name itself, and expects it to be unique.
+
+</td></tr>
+<tr valign="top"><td>
+MVCC
+</td><td>
+</td><td>
+Remove MVCC's support for backing freeze files, applications must grow
+the cache if there's insufficient space for the transactional
+requirements.
+
+</td></tr>
+</table>
+
+
+
+<h3>Implementation changes:</h3>
+<ol>
+
+<p><li><b>Micro-logging</b>:
+To the extent possible, Wired Tiger should support TimesTen's notion of
+micro-logging (where threads of control do work in ways that if they
+fail, other threads of control can clean up without interrupting the
+overall progress of the database engine).  While desirable, this is not
+a requirement of the architecture itself, it's a performance enhancement
+in that it minimizes the events which can force more or less expensive
+recovery operations to be performed.
+
+<p>
+The biggest problem for Berkeley DB was the shared regions, as they
+could be corrupted by an application thread of control dying while
+modifying the shared region, resulting in an expensive recovery period
+as well as an expensive re-creation of the large shared region.  In
+Wired Tiger, the only shared regions will be the individual chunks used
+by threads of control to communicate with the threads running in the
+database engine, removing this problem entirely.
+
+<p><li><b>Utilities</b>:
+All of the Wired Tiger utilities (except for any code generation
+utility) will be implemented as thin wrappers.  (In Berkeley DB db_dump,
+db_load and db_hotbackup have far too much program logic.)
+
+<p><li><b>Exported subsystems</b>:
+Wired Tiger will not export the cache, locking, logging, mutex and
+transaction APIs in a general way.  Reason: the insistence on separate
+subsystems meant we didn't pass a single object handle throughout the
+entire system.  There isn't enough demand for the subsystems to make
+them available initially.  That said, we will continue to rigorously
+separate the functionality to the extent possible.
+
+<p><li><b>API layering</b>:
+Wired Tiger will not require mutexes to cross the API layer.  This will
+allow us to consistently use handle methods to access internal as well
+as external functions throughout the library.  The "global" thread
+cookie will contain the information we need to figure out if any locks
+are needed as we cross API layers.
+
+<p><li><b>Code Generation</b>:
+Wired Tiger will support a code-generation utility early on.   It will
+be a script that outputs a C-language program, supporting all features.
+
+<p><li><b>File types</b>:
+Wired Tiger will use the file name space to aggressively categorize
+storage by class: optionally taking advantage of Flash vs. RAM vs.
+durable disk writes vs. standard disk, vs. network storage.
+
+<p><li><b>Hot backup</b>:
+Wired Tiger will have a hot-backup operation as part of the library.
+Hot backups can be faster if the hot backup thread writes a log record
+("starting database copy") into the log, because recovery only has to
+go back to the checkpoint preceding that record, rather than to the
+beginning of the log, when doing "catastrophic" recovery on a hot
+backup.
+
+<p>
+Additionally, hot backup should be integrated with replication: a hot
+backup is no different from the initialization of a replica.
+
+<p><li><b>Scripting language</b>:
+Use Python or Ruby as our initial scripting language, and possibly
+Erlang to write multi-threaded tests.
+
+<p><li><b>Cursor operations</b>:
+Split up the cursor get/pget functions into a group of methods, instead
+of the current complex set of flags.
+
+<p><li><b>Record numbers</b>:
+Support 64-bit logical record numbers.
+<p>
+Give the application a flag to specify keys are numbers, it's too common
+an error.
+<p>
+Support holes in the record number space.
+
+<p><li><b>Statistics</b>:
+Implement 64-bit statistics values.
+<p>
+Output statistics in XML, and provide a script to display them in text.
+<p>
+Store all statistics in a single chunk of space so it's easy to snapshot
+the engine's performance.
+
+<p><li><b>Log files</b>:
+Dump log files in XML and provide a script which turns them into flat text.
+
+<p><li><b>Compaction</b>:
+Do compaction lazily, top-down through the tree instead of bottom-up.
+
+</ol>
+<h3>Documentation:</h3>
+<ol>
+
+<p><li>
+Support HTML only.  Use cpp instead of m4 as our flat-text to HTML
+pre-processor.
+
+<p><li>
+Write tutorials (booklets) around the features from day 1, don't depend
+on the man pages alone.  (Instead of having a Reference Guide, have
+booklets.)
+
+<p><li>
+Every man page will have a code example that can be cut-and-pasted by
+programmers.
+
+</ol>
+<h3>Do it the same (mostly):</h3>
+<ol>
+
+<ul type=disc>
+<li>Bulk get
+<li>C API (require ANSI C)
+<li>C++ API (we should support C++ hash & cursors)
+<li>Checksums
+<li>Documentation
+<li>Duplicates (both on- and off-page)
+<li>Error handling
+<li>Event notification
+<li>Group commit
+<li>Growing and shrinking the cache
+<li>Micro tests
+<li>Mutexes
+<li>Page layout
+<li>POSIX and Windows support, initially
+<li>Purify
+<li>Secondary indices, foreign key constraints
+<li>Subsystem separation (access methods, cache, lock, log, replication, txn)
+<li>Test suite (written in Python, Ruby, or C)
+</ul>
+
+<p><li><b>Replication Manager</b>:
+Wired Tiger's focus needs to be on partial and cascading replication,
+there's useful features in both.
+
+</ol>
+<h3>Don't do it at all:</h3>
+<ol>
+
+<p><li><b>Concurrent Data Store, in-memory databases</b>:
+Named, non-durable databases are the simpler solution to both problems.
+
+<p><li><b>Cryptography</b>:
+It's hard and nobody on the engineering team ever understands it.  If
+the files need to be encrypted, rely on 3rd party products that encrypt
+the entire filesystem.
+
+<p><li><b>Deadlock detection</b>:
+Use timeout-based deadlock detection, instead -- the graphs are
+complicated and expensive to build.
+
+<p><li><b>Java API</b>:
+Java: the Berkeley DB JE product is a better solution than JNI.
+
+<p><li><b>Application-specific logging</b>:
+Not useful enough to replace, although we should continue to allow
+applications to insert their own records into the log, just no longer
+document any interface to those records during recovery (or even when
+the records will be used and/or discarded).  Application-written log
+records will only be subsequently available by reading the log from
+outside of the engine.
+
+<p><li><b>Hash access method</b>:
+The Berkeley DB Hash access method offers one feature: a hotter cache
+because fewer meta-data pages are required when the data is truly large.
+This feature is less useful for Wired Tiger, given Wired Tiger's support
+for larger pages, but is still useful in some cases.   However, it isn't
+useful enough to implement a separate access method as in Berkeley DB.
+Could we do it by modifying the B+tree somehow?
+
+<p><li><b>Recno access method</b>:
+The Berkeley DB Recno access method offers two features: immutable
+record numbers and backing text files.   Immutable record numbers could
+be supported in a key-based B+tree where deleted records were replaced
+with placeholders.  Backing text files aren't useful enough to re-implement.
+
+<p><li><b>Queue access method</b>:
+The Berkeley DB Queue access method offers two features: removal of
+extent-based files as the database grows and shrinks, and record-level
+locking, at the price of fixed-sized key/data items.  These features
+could be provided by a key-based, record-level locking B+tree, that
+could be striped across multiple files (using the schema layer
+database).   The one problem is this will result in phantoms, which may
+be a problem for queue-based applications.
+
+<p><li><b>Subdatabases</b>:
+Subdatabases (multiple databases per file) are actually useful for the
+few applications wanting to use the database name space for things like
+objects (for example, a C++ compiler writing objects into databases, or
+an application storing a user mailbox per database).   However, they
+greatly complicate database creation and deletion, not to mention
+logging and recovery.  It might be possible to do something cleaner than
+Berkeley DB in the context of a schema file, but for now I don't think
+it's worth the effort.
+
+<p><li><b>Db.feedback</b>:
+Not useful enough to replace -- include any feedback as part of general
+event notification.
+
+<p><li>
+List of other features/support not useful enough to replace:
+<ul type=disc>
+<li>Apache, PHP, Perl, Tcl
+<li>BREW, QNX, S60
+<li>dbm, gdbm, ndbm, hsearch, DB 1.85 APIs
+<li>DB_REGISTER (it's now the default)
+<li>Db.key_range
+<li>Example code (replaced by documentation and code generation)
+<li>Flat-text backing files
+<li>RPC
+<li>Replication base API
+<li>SWIG
+<li>Sequences
+<li>System V shared memory regions
+<li>XA-style distributed transactions
+</ul>
+
+</ol>
+<h3>FAQ:</h3>
+<ol>
+<p><li>
+Why not use real database terms ("database" means environment, "table"
+means database)?
+<p>
+Even though that's desirable, since we're continuing to use the
+Berkeley DB API, it doesn't make sense to switch.
+
+<p><li>
+Will the engine require a 64-bit type?
+<p>
+I guess so, this shouldn't be a problem, should it?
+
+</ol>
+<h3>References:</h3>
+
+<ul type="none">
+
+<li><a href="http://www.cs.berkeley.edu/~brewer/cs262/Aries.pdf">ARIES:
+A Transaction Recovery Method Supporting Fine-Granularity Locking and
+Partial Rollbacks Using Write-Ahead Logging</a>
+
+<p>
+<li><a href="http://Berkeley DB.cs.yale.edu/hstore/oltpperf-sigmod08.pdf">OLTP Through the Looking Glass, and What We Found There</a>
+
+</ul>
+*/
diff --git a/docs/src/layout.dox b/docs/src/layout.dox
new file mode 100644
index 00000000000..d5967ff3873
--- /dev/null
+++ b/docs/src/layout.dox
@@ -0,0 +1,194 @@
+/* vim: set filetype=c.doxygen : */
+
+/*! \defgroup layout Wired Tiger File Format
+
+<p>
+There are a number of page formats used in WiredTiger, each with a unique page
+type.   However, all of the pages store data in some specific order, and it's
+simplest to think of all of the formats as sets of key/data pairs.
+
+<p>
+Keys are either record numbers or variable-length byte strings.
+
+<p>
+Data items are either fixed-length byte strings, variable-length byte strings
+or off-page references (pointers to other pages in the file).
+
+<p>
+WiredTiger page types are presented in the API as "row" or "column" stores.
+
+<p>
+A "row" store is a Berkeley DB style key/data store, where both keys and data
+are variable-length byte strings.  A "column" store is one where the key is a
+record number and the data item is either a variable- or fixed-length byte
+string.  We support both because row stores are faster for queries where all
+columns are required by every lookup (because there's only a single set of
+meta-data pages to go through, or read into the cache).  Column stores are
+faster for queries where only a few of the columns are required for any lookup
+(because only the columns being returned are present in the cache).  Column
+stores also offer significant compression advantages when there is minimal
+difference between adjacent column values.
+
+<p>
+To restate, there are 2 different types of keys in WiredTiger:
+
+<pre>
+Recno		: 64-bit record number
+Variable-Length	: byte string + length
+</pre>
+
+<p>
+The record-number isn't stored explicitly in the page, it's implicitly stored
+as part of the Btree.
+
+<p>
+And there are 3 different types of data items in WiredTiger: 
+
+<pre>
+Fixed-Length	: fixed length number of bytes
+Variable-Length	: byte string + length
+Off-page Ref:	: off-page reference (a ref to another page in the file)
+</pre>
+
+<p>
+A row store is one where there's a variable-length key associated with one or
+more variable-length data items.  Because record numbers are an implicit part
+of the Btree, it's possible to retrieve records from a row store using record
+numbers, where that's useful.  In a row store, a key may reference multiple
+data items (that is, duplicate data items).
+
+<p>
+Here's the most complex possible row store layout:
+
+<pre>
+	IP
+	|
+IP  ->  IP  ->  IP  ->  IP
+	|
+	LP  -> LP  ->  LP
+	       |
+	       DIP
+	       |
+      DIP  ->  DIP  ->  DIP
+			|
+			|
+			DLP  ->  DLP
+
+Key:	IP	Internal page
+	LP	Leaf page
+	DIP	Duplicate internal page
+	DLP	Duplicate leaf page
+</pre>
+
+<p>
+This is a standard B+tree layout.   One or more levels of internal pages, then
+one level of leaf pages.
+
+<p>
+A row store's internal page keys are variable-length byte strings.  The internal
+page data items are references to other pages (the off-page reference type).
+
+<p>
+A row store's leaf page key and data items are both variable-length byte
+strings.
+
+<p>
+As mentioned above, it's possible to store multiple data items with a single
+key.  In this case, the key is only stored a single time.  If there are enough
+duplicate data items, the duplicate data items are moved into their own,
+separate Btree, referenced from the leaf page.  These are the "duplicate
+internal" and "duplicate leaf" pages in the above diagram.
+
+<p>
+Duplicate internal page keys are record numbers, and the duplicate internal
+page data items are references to other pages (the off-page reference type).
+
+<p>
+Duplicate leaf page keys are record numbers, and the duplicate leaf page data
+items are variable-length strings.
+
+<p>
+Row store notes:
+<ul>
+
+<li>
+WiredTiger internal pages are used only for navigation, and may not store
+true database keys.  In other words, a true B+tree, where keys can only be
+returned from leaf pages.
+
+<li>
+Any variable-length key or data item can be replaced with a reference to
+an overflow page.  This is used whenever a key or data item won't fit onto a
+page because it's too large.   However, references to overflow pages can be
+thought of as an additional step taken to get a key or data item, and not a
+fundamental layout structure.
+
+<li>
+Any variable-length key or data item can be compressed using the Huffman
+compression engine.  Again, decompression is an additional step taken to get
+a key or data item, and not a fundamental layout structure issue.
+
+</ul>
+
+<p>
+This concludes the discussion of a row store.
+
+<p>
+A column store is one where there's a single key, that references a single
+data item, and that data item may be fixed- or variable-length.
+
+<p>
+Here's the most complex possible row store layout:
+
+<pre>
+	IP
+	|
+IP  ->  IP  ->  IP  ->  IP
+	|
+	LP  -> LP  ->  LP
+
+Key:	IP	Internal page
+	LP	Leaf page
+</pre>
+
+<p>
+A column store's internal page keys are record numbers.  The internal page
+data items are references to other pages (the off-page reference type).
+
+<p>
+A column store's leaf page keys are record numbers.   The leaf page data items
+are fixed- or variable-length byte strings.
+
+<p>
+Because column store keys do not disappear on delete, there needs to be an
+on-page, permanent "deleted" data value; for variable-length byte strings,
+this is a specific item type.  For fixed-length byte strings, the top bit
+of the field is the deleted flag (which means that the user-specified data
+length is off by a single bit).
+
+<p>
+Column store notes:
+<ul>
+
+<li>
+Like the row store, any variable-length data item can be replaced
+with a reference to an overflow page, and can be compressed using the
+Huffman compression engine.
+
+<li>
+Any fixed-length data item can be compressed using a dictionary lookup
+or repeat count.
+
+</ul>
+
+<p>
+Q&A:
+<ul>
+
+<li>
+Why not have fixed-length data items in a row store?   Because once you have
+variable-length items on the page, having fixed-length items doesn't make
+things any faster, you still have to maintain a page index.
+
+</ul>
+*/
diff --git a/docs/style/footer.html b/docs/style/footer.html
index 27f762d85d0..a80f673fe18 100644
--- a/docs/style/footer.html
+++ b/docs/style/footer.html
@@ -1,4 +1,7 @@
-<hr class="footer"/><address class="footer"><small>
-Generated on $datetime for $projectname by&#160;<a href="http://www.doxygen.org/index.html">doxygen</a> $doxygenversion</small></address>
+<hr class="footer"/>
+<address class="footer">
+	<p>Copyright (c) 2010 WiredTiger.  All rights reserved.  Contact <a href="mailto:info@wiredtiger.com">info@wiredtiger.com</a> for more information.</p>
+	<p><small>Generated on $datetime for $projectname by&#160;<a href="http://www.doxygen.org/index.html">doxygen</a> $doxygenversion</small></p>
+</address>
 </body>
 </html>
author	Michael Cahill <michael.cahill@wiredtiger.com>	2010-12-01 15:22:26 +1100
committer	Michael Cahill <michael.cahill@wiredtiger.com>	2010-12-01 15:22:26 +1100
commit	9c248d20eb88e58769c01f38741d052dec036810 (patch)
tree	d00100e3ce2a581294d8cf8f0a9749386ab768f7 /docs
parent	1ed1a5892e1a8a0452238f96eb00d7fb7f9ed228 (diff)
download	mongo-9c248d20eb88e58769c01f38741d052dec036810.tar.gz