diff options
author | Geert Bosch <geert@mongodb.com> | 2015-04-22 14:34:20 -0400 |
---|---|---|
committer | Geert Bosch <geert@mongodb.com> | 2015-05-08 18:08:20 -0400 |
commit | 2f453786e5ff1f268da1bf1e3d050a592ad46d13 (patch) | |
tree | a8ef48a3341c0d33a8c0894fd8b5ef5fee6380ee /src/mongo/db/storage/README.md | |
parent | 24fb02a7f57a9ab690f61fec73b6a1896079ce05 (diff) | |
download | mongo-2f453786e5ff1f268da1bf1e3d050a592ad46d13.tar.gz |
SERVER-17639: Rewrite the storage README.md
Diffstat (limited to 'src/mongo/db/storage/README.md')
-rw-r--r-- | src/mongo/db/storage/README.md | 362 |
1 files changed, 121 insertions, 241 deletions
diff --git a/src/mongo/db/storage/README.md b/src/mongo/db/storage/README.md index ceab8e8b4d8..16f48f5f0ab 100644 --- a/src/mongo/db/storage/README.md +++ b/src/mongo/db/storage/README.md @@ -1,243 +1,123 @@ -Frequently Asked Questions -========================== - -This is a list of frequently asked questions relating to storage engines, nearly -all of which come from the mongodb-dev Google group. Many of the categories -overlap, so please do not pay too much attention to the section titles. - -**Q**: If I have a question about the storage API, what's the best way to get an answer? -**A**: Your best bet is to post on mongodb-dev. Everybody involved in the storage API will read the -email. - Storage Engine API ------------------- -**Q**: On InnoDB, row locks are not released until log sync is done. Will that be a -problem in the API? -**A**: It’s not a problem with the API. A storage engine implementation could log sync at the end of -every unit of work. The exact behavior depends on the durability behavior of the storage engine -implementation. Row locks will be held by MongoDB until the end of a WriteUnitOfWork (eg an -operation on one document). - -**Q**: As far as I can tell, the storage engine API allows storage engines to keep -some context in the OperationContext's RecoveryUnit, but only for updates (there -is no beginUnitOfWork equivalent for queries). This is presumably because of the -history of locking inside MongoDB, but to prepare for future storage engines, it -would be helpful to have some context that the storage engine can track across -related read operations. -**A**: We agree that it would be helpful. We haven’t gotten to this yet because our -current storage engine and isolation level don’t require this. Feel encouraged -to suggest details. - -**Q**: Does the API expect committed changes from a transaction to be visible to -others before log sync? -**A**: The API does not require a log sync along with the commit of a unit of work. -Changes should only be visible to the operation making the changes until those -changes are committed, at which point the changes should be visible to all -clients of the storage engine. - -**Q**: What is the difference between storageSize and dataSize in RecordStore? -**A**: dataSize is the size of the data stored and storageSize is the (approximate) on-disk size of -the data and metadata. Indices are not counted in storageSize; they are counted separately. - -**Q**: Is RecordStore::numRecords called frequently? Does it need to be fast? -**A**: Yes and yes. Our Rocks implementation caches numRecords and dataSize so accessing that data -is O(1) instead of O(n) - -Operation Context ------------------ -**Q**: I didn't find documentation for the transaction API, nor source code for it. -How can I find the information for it? -**A**: You can find the documentation and source code for this in -src/mongo/db/operation_context.h and src/mongo/db/operation_context_impl.cpp, -respectively. OperationContext contains many classes, in particular the Locker -and RecoveryUnit, which have their own definitions. - -**Q**: In the RecordStore::getIterator() interface, the first parameter is a -OperationContext "txn". I didn't find any documentation about how this context -should be used by a specific storage implementation. I don't find it used in -rocks implementation, nor heap1 or mmap_v1. How can I find answer to this -question? -**A**: OperationContext is all of the state for an operation: the locks, the client -information, the recovery unit, etc. In particular, a record store -implementation would probably access the lock state and the recovery unit. - Storage engines are free to do whatever they need to with the OperationContext -in order to satisfy the storage APIs. - -**Q**: About OperationContext "txn", does it mean states in operation context are only used by the -storage engines to ensure the consistency inside the storage engines, and storage engine needs to -follow no synchronization guideline because of it? To be more precise, if the same "txn" is passed -to RecordStore::getIterator() twice, does it mean the iterator should be built from the same -snapshot? -**A**: Yes. - -**Q**: What guarantees must a storage engine provide if a document is updateRecord()'d and then -returned by a getIterator()? That is, do storage engines read their own writes? -**A**: Operations should read their own writes. - - -Reading & Writing ------------------ -**Q**: Are write cursors expected to see their changes in progress? -**A**: Yes, this is currently expected. There has been some discussion about removing this -requirement. For now we are not planning on doing so. Instead, a queryable and iterable WriteBatch -class is being created to allow access to data which has not yet been committed. - -**Q**: "Read-your-writes" consistency was mentioned in Mathias's MongoWorld -presentation, but as far as I can see, the storage engine has no way to connect -a RecordStore::insertRecord call with a subsequent RecordStore::dataFor -- the -latter doesn't take an OperationContext. Is this an oversight? -**A**: Yes, this is an oversight. We’ll be doing a sweep through and fixing this -and other problems. (FYI, as an implementation side-effect, mmapv1 -automatically provides read-your-writes, which partially explains this -oversight.) - -**Q**: Storage engines need to support a "read-your-own-update.” In the storage -engine interface, which parameter/variable passes this transaction/session -information? -**A**: The OperationContext “txn”, which is also called “opCtx” in some areas, has -the information. There are probably some places we have neglected to pass this -in. - -**Q**: When executing a query, if both of indexes ("btree") and record store are -read, how do we make sure we are reading from a consistent view of them (RocksDB -or other storage)? I didn't find any handling of that from the codes. Can you -point me in a direction to look for this? -**A**: We do not handle this now, because it is not an issue with our current -(mmapv1) storage engine. We’re discussing how to solve this. We currently store -the information necessary to obtain a consistent view of the database in the -RocksRecoveryUnit class, which is itself a member of the OperationContext class. -Therefore, one possible solution would be to pass the OperationContext class to -every method which needs a consistent view of the database. We will soon merge -code which has comments mentioning every instance in rocks where this needs to -be done. - -**Q**: The RocksDB implementation of insert, update, delete methods for RecordStore add something to -a write batch. This is fast and unlikely to fail. The real work is delayed until commit. What is the -expected behavior when writing the write batch on commit fails because of a disk write timeout, disk -full or other odd error? The likely error would occur on insert or update when unique index -maintenance is done, but that has yet to be implemented. -**A**: We’re planning on implementing operation-level retry to avoid deadlocks that might otherwise -occur while processing a document. An operation could probably be similarly retried in the case of -transient errors upon commit. - -**Q**: What kind of operations might occur between prepareToYield and recoverFromYield? DDL -including drop collection? Or just other transactions? For an index scan does prepareToYield require -me to save the last key seen so that recover knows where to restart? -**A**: In 2.6, DDL could occur, as could other transactions. These methods were introduced when -operations were scheduled through “time-slicing” with voluntary lock yielding. They’re changing -meaning now that yielding is gone. We’re still figuring out what state-saving behavior we need, in -particular for operations that mutate the results of a query. The getMore implementation currently -uses them as well and will probably continue to do so. - -**Q**: Is there global locking for SortedDataInterface::isEmpty to prevent inserts from being done -when the answer is being determined? -**A**: Yes. Any caller of isEmpty must have locks to ensure that it is still true by the time it -acts on that information. - - -RecoveryUnit ------------- -**Q**: As documented I don’t understand the point of the RecoverUnit::endUnitOfWork -nesting behavior. Can you explain where it is used or will be used? -**A**: The RecoveryUnit interface and the mmapv1 (current mongodb storage engine) -implementation are both works in progress :) We’re currently adding unit of -work declarations and two phase locking. The nesting behavior currently exists -to verify that we’re adding units of work correctly and we expect to remove it -when two phase locking is completed. - -**Q**: RecoveryUnit::{beginUnitOfWork, endUnitOfWork, commitUnitOfWork} - these -return nothing. What happens on failure? With optimistic concurrency control -commit can fail in the normal case. In theory, someone might try to use -optimistic CC for a RocksDB+Mongo engine. -**A**: We’re currently not planning these interfaces with OCC (or MVCC) in mind. - Currently, if any of these fooUnitOfWork functions fail, we expect to roll back -and probably retry the operation. The interfaces are rather fluid right now and -will probably return a Status (or throw an exception) at some point. However, -rollback should never fail. - - -RocksDB -------- -**Q**: I think RocksRecoveryUnit::awaitCommit should remain in RecoveryUnit but be -renamed to ::awaitLogSync. If force log once per second is done, then this -blocks until the next time log is forced to disk. But I think we [should] be explicit -about "commit" vs "forcing redo log to storage" given that many engines -including InnoDB, RocksDB & MongoDB let commit get done without a log force. -**A**: We agree that these should be two separately defined pieces of functionality. -We’re currently discussing whether or not to expose the “forcing redo log to -storage” in the API. We also are planning on doing a renaming pass. - -**Q**: Why doesn't RocksRecoveryUnit::endUnitOfWork respect _defaultCommit before -calling commitUnitOfWork? -**A**: _defaultCommit is a temporary boolean that should disappear once we’ve fully -implemented rollbacks. - -**Q**: Why is RocksRecoveryUnit::isCommitNeeded based on size of buffered write -batch? Isn't this supposed to return TRUE If a WAL sync is needed? -**A**: This is an mmapv1-specific method that will be going away. It’s part of the -API but will be removed soon. - -**Q**: In RocksRecordStore::updateWithDamages() there is a comment "todo: this -should use the merge functionality in rocks". Can you explain more about the -motivation? Is it for atomicity, or for reducing write amplification? -**A**: We want to do this for speed, as it will allow us to avoid reading in data -from rocks, updating it in memory, and writing it back. However, due to the way -the rest of our code works, we’re not sure if this will yield much of a -performance increase. For now, we’re focusing on getting minimal functionality -working, but may benchmark this in the future. - -**Q**: How do I install RocksDB? -**A**: https://groups.google.com/forum/#!topic/mongodb-dev/ilcHAg6JgQI - -**Q**: I think RocksRecordStore::_nextId needs to be persistent. But I also think the API needs to -support logical IDs as DiskLoc is physical today and using (int,int) for file number / offset -implies that files are at most 2G. And I suspect that the conversion of it to a byte stream means -that Rocks documents are not in PK order on little endian (see _makeKey). -**A**: We're planning on pushing DiskLoc under the storage layer API (it’s really an mmapv1 data -type) and replacing it with a 64-bit RecordID type that can be converted into whatever a storage -engine requires. - -General -======= -**Q**: Does the storage engine allow for group commit? -**A**: Yes. In fact, the mmapv1 impl does group commit. - -**Q**: cleanShutdown has no return value. What is to be done on an internal error -(assert & crash)? -**A**: Yes, on internal error, assert and crash. - -**Q**: Storage engine initialization is done in the constructor. How are errors on -init to be returned or handled? -**A**: Currently all errors on storage engine init are fatal. We assume if the -storage engine can’t work, the database probably can’t work either. - -**Q**: Is Command::runAgainstRegistered() the entry point of query and update -queries? I saw these lines: - - OperationContext* noTxn = NULL; // mongos doesn't use transactions SERVER-13931 - execCommandClientBasic(noTxn, c, *client, queryOptions, ns, jsobj, anObjBuilder, false); - -Are we always passing noTxn to the query? What does it mean? -**A**: The entry point of query and updates is assembleResponse in instance.cpp. In -the case you cite, the command is being invoked from mongos, which doesn’t need -to pay attention to durability or take locks. - -**Q**: From my reading of the codes, the Cloner component is the way for MongoDB to -build a new slave from a master (called by ReplSource::resync()). If I -understand the codes correctly, cloning always does logical copy of keys -(reading keys one by one and insert them one by one). Two comments I have: - - 1. RocksDB uses LSM, which provides a good feature that you can do physical copy - of the files, which should be faster. Is there a long term plan to make - use of it? - 2. If we stick on logical copy, the best practice is to tune the RocksDB in - the new slave side to speed up the copy process: - 1. Disable WAL tune compactions - 2. Tune compaction to never happen - 3. Use vectorrep mem table - 4. Issue a full compaction and reopen the DB after cloning finishes. -We might consider to design the storage plug-in and components to use it to be -flexible enough to make it easy to make those future improvements when needed. -**A**: Offering a physical file copy for initial sync is something we're -considering for the future, but not at this time. +================== + +The purpose of the Storage Engine API is to allow for pluggable storage engines in MongoDB, see +also the [Storage FAQ][]. This document gives a brief overview of the API, and provides pointers +to places with more detailed documentation. Where referencing code, links are to the version that +was current at the time when the reference was made. Always compare with the latest version for +changes not yet reflected here. For questions on the API that are not addressed by this material, +use the [mongodb-dev][] Google group. Everybody involved in the Storage Engine API will read your +post. + +Third-party storage engines are integrated through self-contained modules that can be dropped into +an existing MongoDB source tree, and will be automatically configured and included. A typical +module would at least have the following files: + + src/ Directory with the actual source files + README.md Information specific to the storage engine + SConscript Scons build rules + build.py Module configuration script + +See <https://github.com/mongodb-partners/mongo-rocks> for a good example of the structure. + + +Concepts +-------- + +### Record Stores +A database contains one or more collections, each with a number of indexes, and a catalog listing +them. All MongoDB collections are implemented with record stores: one for the documents themselves, +and one for each index. By using the KVEngine class, you only have to deal with the abstraction, as +the KVStorageEngine] implements the StorageEngine interface, using record stores for catalogs and +indexes. + +#### Record Identities +A RecordId is a unique identifier, assigned by the storage engine, for a specific document or entry +in a record store at a given time. For storage engines based in the KVEngine the record identity is +fixed, but other storage engines, such as MMAPv1, may change it when updating a document. Note that +this can be very expensive, as indexes map to the RecordId. A single document with a large array +may have thousands of index entries, resulting in very expensive updates. + +#### Cloning and bulk operations +Currently all cloning, [initial sync][] and other operations are done in terms of operating on +individual documents, though there is a BulkBuilder class for more efficiently building indexes. + +### Locking and Concurrency +MongoDB uses multi-granular intent locking, see the [Concurrency FAQ][]. In all cases, this will +ensure that operations to meta-data, such as creation and deletion of record stores, are serialized +with respect to other accesses. Storage engines can choose to support document-level concurrency, +in which case the storage engine is responsible for any additional synchronization necessary. For +storage engines not supporting document-level concurrency, MongoDB will use shared/exclusive locks +at the collection level, so all record store accesses will be serialized. + +MongoDB uses [two-phase locking][] (2PL) to guarantee serializability of accesses to resources it +manages. For storage engines that support document level concurrency, MongoDB will only use intent +locks for the most common operations, leaving synchronization at the record store layer up to the +storage engine. + +### Transactions +Each operation creates an OperationContext with a new RecoveryUnit, implemented by the storage +engine, that lives until the operation finishes. Currently, query operations that return a cursor +to the client live as long as that client cursor, with the operation context switching between its +own recovery unit and that of the client cursor. In a few other cases an internal command may use +an extra recovery unit as well. The recovery unit must implement transaction semantics as described +below. + +#### Atomicity +Writes must only become visible when explicitly committed, and in that case all pending writes +become visible atomically. Writes that are not committed before the unit of work ends must be +rolled back. In addition to writes done directly through the Storage API, such as document updates +and creation of record stores, other custom changes can be registered with the recovery unit. + +#### Consistency +Storage engines must ensure that atomicity and isolation guarantees span all record stores, as +otherwise the guarantee of atomic updates on a document and all its indexes would be violated. + +#### Isolation +Storage engines must provide snapshot isolation, either through locking (as is the case for the +MMAPv1 engine), through multi-version concurrency control (MVCC) or otherwise. The first read +implicitly establishes the snapshot. Operations can always see all changes they make in the context +of a recovery unit, but other operations cannot until a successfull commit. + +#### Durability +Once a transaction is committed, it is not necessarily durable: if, and only if the server fails, +as result of power loss or otherwise, the database may recover to an earlier point in time. +However, atomicity of transactions must remain preserved. Similarly, in a replica set, a primary +that becomes unavailable may need to rollback to an earlier state when rejoining the replicas et, +if its changes were not yet seen by a majority of nodes. The RecoveryUnit implements methods to +allow operations to wait for their committed transactions to become durable. + +A transaction may become visible to other transactions as soon as it commits, and a storage engine +may use a group commit bundling a number of transactions to achieve durability. Alternatively, a +storage engine may wait for durability at commit time. + +### Write Conflicts +Systems with optimistic concurrency control (OCC) or multi-version concurrency control (MVCC) may +find that a transaction conflicts with other transactions, that executing an operation would result +in deadlock or violate other resource constraints. In such cases the storage engine may throw a +WriteConflictException to signal the transient failure. MongoDB will handle the exception, abort +and restart the transaction. + + +Classes to implement +-------------------- + +A storage engine should generally implement the following classes. See their definition for more +details. + +* KVEngine +* RecordStore +* RecordIterator +* RecoveryUnit +* SortedDataInterface +* ServerStatusSection +* ServerParameter + + +[Concurrency FAQ]: http://docs.mongodb.org/manual/faq/concurrency/ +[initial sync]: http://docs.mongodb.org/manual/core/replica-set-sync/#replica-set-initial-sync +[mongodb-dev]: https://groups.google.com/forum/#!forum/mongodb-dev +[replica set]: http://docs.mongodb.org/manual/replication/ +[Storage FAQ]: http://docs.mongodb.org/manual/faq/storage +[two-phase locking]: http://en.wikipedia.org/wiki/Two-phase_locking |