Helium README. The data structures are "Helium sources" which map to one or more physical volumes; each Helium source supports any number of "WiredTiger sources", where a WiredTiger source is an object similar to a Btree "file:" object. Each WiredTiger source supports any number of WiredTiger cursors. Each Helium source is given a logical name when first referenced, and that logical name is subsequently used when a WiredTiger source is created. For example, the logical name for a Helium source might be "dev1", and it would map to the Helium volumes /dev/sd0 and /dev/sd1; subsequent WT_SESSION.create calls specify a URI like "table:dev1/my_table". For each WiredTiger source, we create two namespaces on the underlying device, a "cache" and a "primary". The cache contains key/value pairs based on updates or changes that have been made, and includes transactional information. So, for example, if transaction 3 modifies key/value pair "foo/aaa", and then transaction 4 removes key "foo", then transaction 5 inserts key/value pair "foo/bbb", the entry in the cache will look something like: Key: foo Value: [transaction ID 3] [aaa] [transaction ID 4] [remove] [transaction ID 5] [bbb] Obviously, we have to marshall/unmarshall these values to/from the cache. In contrast, the primary contains only key/value pairs known to be committed and visible to any reader. When an insert, update or remove is done: acquire a lock read any matching key from the cache check to see if the update can proceed append a new value for this transaction release the lock When a search is done: if there's a matching key/value pair in the cache { if there's an item visible to the reading transaction return it } if there's a matching key/value pair in the primary { return it } When a next/prev is done: move to the next/prev visible item in the cache move to the next/prev visible item in the primary return the one closest to the starting position Locks are not acquired for read operations, and no flushes are done for any of these operations. We also create one additional object, the transaction name space, which serves all of the WiredTiger and Helium objects in a WiredTiger connection. Whenever a transaction involving a Helium source commits, we insert a commit record into the transaction name space and flush the device. When a transaction rolls back, we insert an abort record into the txn name space, but don't flush the device. The visibility check is slightly different than the rest of WiredTiger: we do not reset anything when a transaction aborts, and so we have to check if the transaction has been aborted as well as check the transaction ID for visibility. We create a "cleanup" thread for every underlying Helium source. The job of this thread is to migrate rows from the cache object into the primary. Any committed, globally visible change in the cache can be copied into the primary and removed from the cache: set BaseTxnID to the oldest transaction ID not yet visible to a running transaction for each row in the cache: if all of the updates are greater than BaseTxnID copy the last update to the primary flush the primary to stable storage lock the cache for each row in the cache: if all of the updates are greater than BaseTxnID remove the row from the cache unlock the cache for each row in the transaction store: if the transaction ID is less than BaseTxnID remove the row We only need to lock the cache when removing rows, the initial copy to the primary does not require locks because only the cleanup thread ever writes to the primary. No lock is required when removing rows from the transaction store, once the transaction ID is less than the BaseTxnID, it will never be read. Helium recovery is almost identical to the cleanup thread, which migrates rows from the cache into the primary. For every cache/primary pair, migrate every commit to the primary (by definition, at recovery time it must be globally visible), and discard everything else (by definition, at recovery time anything not committed has been aborted. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Questions, problems, whatever: * The implementation is endian-specific, that is, the WiredTiger metadata stored on the Helium device is on not portable to a big-endian machine. Helium's metadata is portable between different endian machines, so this should probably be fixed. * There's a problem with transactions in WiredTiger that span more than a single data source. For example, consider a transaction that modifies both a Helium object and a Btree object. If we commit and push the Helium commit record to stable storage, and then crash before committing the Btree change, the enclosing WiredTiger transaction will/should end up aborting, and there's no way for us to back out the change in Helium. I'm leaving this problem alone until WiredTiger fine-grained durability is complete, we're going to need WiredTiger support for some kind of 2PC to solve this. * If a record in the cache gets too busy, we could end up unable to remove it (there would always be an active transaction), and it would grow forever. I suspect the solution is to clean it up when we realize we can't remove it, that is, we can rebuild the record, discarding the no longer needed entries, even if the record can't be entirely discarded.