summaryrefslogtreecommitdiff
path: root/bdb/db/Design.fileop
diff options
context:
space:
mode:
Diffstat (limited to 'bdb/db/Design.fileop')
-rw-r--r--bdb/db/Design.fileop452
1 files changed, 452 insertions, 0 deletions
diff --git a/bdb/db/Design.fileop b/bdb/db/Design.fileop
new file mode 100644
index 00000000000..187f1ffaf22
--- /dev/null
+++ b/bdb/db/Design.fileop
@@ -0,0 +1,452 @@
+# $Id: Design.fileop,v 11.4 2000/02/19 20:57:54 bostic Exp $
+
+The design of file operation recovery.
+
+Keith has asked me to write up notes on our current status of database
+create and delete and recovery, why it's so hard, and how we've violated
+all the cornerstone assumptions on which our recovery framework is based.
+
+I am including two documents at the end of this one. The first is the
+initial design of the recoverability of file create and delete (there is
+no talk of subdatabases there, because we didn't think we'd have to do
+anything special there). I will annotate this document on where things
+changed.
+
+The second is the design of recd007 which is supposed to test our ability
+to recover these operations regardless of where one crashes. This test
+is fundamentally different from our other recovery tests in the following
+manner. Normally, the application controls transaction boundaries.
+Therefore, we can perform an operation and then decide whether to commit
+or abort it. In the normal recovery tests, we force the database into
+each of the four possible states from a recovery perspective:
+
+ database is pre-op, undo (do nothing)
+ database is pre-op, redo
+ database is post-op, undo
+ database is post-op, redo (do nothing)
+
+By copying databases at various points and initiating txn_commit and abort
+appropriately, we can make all these things happen. Notice that the one
+case we don't handle is where page A is in one state (e.g., pre-op) and
+page B is in another state (e.g., post-op). I will argue that these don't
+matter because each page is recovered independently. If anyone can poke
+holes in this, I'm interested.
+
+The problem with create/delete recovery testing is that the transaction
+is begun and ended all inside the library. Therefore, there is never any
+point (outside the library) where we can copy files and or initiate
+abort/commit. In order to still put the recovery code through its paces,
+Sue designed an infrastructure that lets you tell the library where to
+make copies of things and where to suddenly inject errors so that the
+transaction gets aborted. This level of detail allows us to push the
+create/delete recovery code through just about every recovery path
+possible (although I'm sure Mike will tell me I'm wrong when he starts to
+run code coverage tools).
+
+OK, so that's all preamble and a brief discussion of the documents I'm
+enclosing.
+
+Why was this so hard and painful and why is the code so Q@#$!% complicated?
+The following is a discussion/explanation, but to the best of my knowledge,
+the structure we have in place now works. The key question we need to be
+asking is, "Does this need to have to be so complex or should we redesign
+portions to simplify it?" At this point, there is no obvious way to simplify
+it in my book, but I may be having difficulty seeing this because my mind is
+too polluted at this point.
+
+Our overall strategy for recovery is that we do write-ahead logging,
+that is we log an operation and make sure it is on disk before any
+data corresponding to the data that log record describes is on disk.
+Typically we use log sequence numbers (LSNs) to mark the data so that
+during recovery, we can look at the data and determine if it is in a
+state before a particular log record or after a particular log record.
+
+In the good old days, opens were not transaction protected, so we could
+do regular old opens during recovery and if the file existed, we opened
+it and if it didn't (or appeared corrupt), we didn't and treated it like
+a missing file. As will be discussed below in detail, our states are much
+more complicated and recovery can't make such simplistic assumptions.
+
+Also, since we are now dealing with file system operations, we have less
+control about when they actually happen and what the state of the system
+can be. That is, we have to write create log records synchronously, because
+the create/open system call may force a newly created (0-length) file to
+disk. This file has to now be identified as being in the "being-created"
+state.
+
+A. We used to make a number of assumptions during recovery:
+
+1. We could call db_open at any time and one of three things would happen:
+ a) the file would be opened cleanly
+ b) the file would not exist
+ c) we would encounter an error while opening the file
+
+Case a posed no difficulty.
+In Case b, we simply spit out a warning that a file was missing and then
+ ignored all subsequent operations to that file.
+In Case c, we reported a fatal error.
+
+2. We can always generate a warning if a file is missing.
+
+3. We never encounter NULL file names in the log.
+
+B. We also made some assumptions in the main-line library:
+
+1. If you try to open a file and it exists but is 0-length, then
+someone else is trying to open it.
+
+2. You can write pages anywhere in a file and any non-existent pages
+are 0-filled. [This breaks on Windows.]
+
+3. If you have proper permissions then you can always evict pages from
+the buffer pool.
+
+4. During open, we can close the master database handle as soon as
+we're done with it since all the rest of the activity will take place
+on the subdatabase handle.
+
+In our brave new world, most of these assumptions are no longer valid.
+Let's address them one at a time.
+
+A.1 We could call db_open at any time and one of three things would happen:
+ a) the file would be opened cleanly
+ b) the file would not exist
+ c) we would encounter an error while opening the file
+There are now additional states. Since we are trying to make file
+operations recoverable, you can now die in the middle of such an
+operation and we have to be able to pick up the pieces. What this
+now means is that:
+
+ * a 0-length file can be an indication of a create in-progress
+ * you can have a meta-data page but no root page (of a btree)
+ * if a file doesn't exist, it could mean that it was just about
+ to be created and needs to be rolled forward.
+ * if you encounter an error in a file (e.g., the meta-data page
+ is all 0's) you could still be in mid-open.
+
+I have now made this all work, but it required significant changes to the
+db_open code and error handling and this is the sort of change that makes
+everyone nervous.
+
+A.2. We can always generate a warning if a file is missing.
+
+Now that we have a delete file method in the API, we need to make sure
+that we do not generate warning messages for files that don't exist if
+we see that they were explicitly deleted.
+
+This means that we need to save state during recovery, determine which
+files were missing and were not being recreated and were not deleted and
+only complain about those.
+
+A.3. We never encounter NULL file names in the log.
+
+Now that we allow tranaction protection on memory-resident files, we write
+log messages for files with NULL file names. This means that our assumption
+of always being able to call "db_open" on any log_register OPEN message found
+in the log is no longer valid.
+
+B.1. If you try to open a file and it exists but is 0-length, then
+someone else is trying to open it.
+
+As discussed for A.1, this is no longer true. It may be instead that you
+are in the process of recovering a create.
+
+B.2. You can write pages anywhere in a file and any non-existent pages
+are 0-filled.
+
+It turns out that this is not true on Windows. This means that places
+we do group allocation (hash) must explicitly allocate each page, because
+we can't count on recognizing the uninitialized pages later.
+
+B.3. If you have proper permissions then you can always evict pages from
+the buffer pool.
+
+In the brave new world though, files can be deleted and they may
+have pages in the mpool. If you later try to evict these, you
+discover that the file doesn't exist. We'd get here when we had
+to dirty pages during a remove operation.
+
+B.4. You can close files any time you want.
+
+However, if the file takes part in the open/remove transaction,
+then we had better not close it until after the transaction
+commits/aborts, because we need to be able to get our hands on the
+dbp and the open happened in a different transaction.
+
+=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
+Design for recovering file create and delete in the presence of subdatabases.
+
+Assumptions:
+ Remove the O_TRUNCATE flag.
+ Single-thread all open/create/delete operations.
+ (Well, almost all; we'll optimize opens without DB_CREATE set.)
+ The reasoning for this is that with two simultaneous
+ open/creaters, during recovery, we cannot identify which
+ transaction successfully created files and therefore cannot
+ recovery correctly.
+ File system creates/deletes are synchronous
+ Once the file is open, subdatabase creates look like regular
+ get/put operations and a metadata page creation.
+
+There are 4 cases to deal with:
+ 1. Open/create file
+ 2. Open/create subdatabase
+ 3. Delete
+ 4. Recovery records
+
+ __db_fileopen_recover
+ __db_metapage_recover
+ __db_delete_recover
+ existing c_put and c_get routines for subdatabase creation
+
+ Note that the open/create of the file and the open/create of the
+ subdatabase need to be in the same transaction.
+
+1. Open/create (full file and subdb version)
+
+If create
+ LOCK_FILEOP
+ txn_begin
+ log create message (open message below)
+ do file system open/create
+ if we did not create
+ abort transaction (before going to open_only)
+ if (!subdb)
+ set dbp->open_txn = NULL
+ else
+ txn_begin a new transaction for the subdb open
+
+ construct meta-data page
+ log meta-data page (see metapage)
+ write the meta-data page
+ * It may be the case that btrees need to log both meta-data pages
+ and root pages. If that is the case, I believe that we can use
+ this same record and recovery routines for both
+
+ txn_commit
+ UNLOCK_FILEOP
+
+2. Delete
+ LOCK_FILEOP
+ txn_begin
+ log delete message (delete message below)
+ mv file __db.file.lsn
+ txn_commit
+ unlink __db.file.lsn
+ UNLOCK_FILEOP
+
+3. Recovery Routines
+
+__db_fileopen_recover
+ if (argp->name.size == 0
+ done;
+
+ if (redo) /* Commit */
+ __os_open(argp->name, DB_OSO_CREATE, argp->mode, &fh)
+ __os_closehandle(fh)
+ if (undo) /* Abort */
+ if (argp->name exists)
+ unlink(argp->name);
+
+__db_metapage_recover
+ if (redo)
+ __os_open(argp->name, 0, 0, &fh)
+ __os_lseek(meta data page)
+ __os_write(meta data page)
+ __os_closehandle(fh);
+ if (undo)
+ done = 0;
+ if (argp->name exists)
+ if (length of argp->name != 0)
+ __os_open(argp->name, 0, 0, &fh)
+ __os_lseek(meta data page)
+ __os_read(meta data page)
+ if (read succeeds && page lsn != current_lsn)
+ done = 1
+ __os_closehandle(fh);
+ if (!done)
+ unlink(argp->name)
+
+__db_delete_recover
+ if (redo)
+ Check if the backup file still exists and if so, delete it.
+
+ if (undo)
+ if (__db_appname(__db.file.lsn exists))
+ mv __db_appname(__db.file.lsn) __db_appname(file)
+
+__db_metasub_recover
+ /* This is like a normal recovery routine */
+ Get the metadata page
+ if (cmp_n && redo)
+ copy the log page onto the page
+ update the lsn
+ make sure page gets put dirty
+ else if (cmp_p && undo)
+ update the lsn to the lsn in the log record
+ make sure page gets put dirty
+
+ if the page was modified, put it back dirty
+
+In db.src
+
+# name: filename (before call to __db_appname)
+# mode: file system mode
+BEGIN open
+DBT name DBT s
+ARG mode u_int32_t o
+END
+
+# opcode: indicate if it is a create/delete and if it is a subdatabase
+# pgsize: page size on which we're going to write the meta-data page
+# pgno: page number on which to write this meta-data page
+# page: the actual meta-data page
+# lsn: LSN of the meta-data page -- 0 for new databases, may be non-0
+# for subdatabases.
+
+BEGIN metapage
+ARG opcode u_int32_t x
+DBT name DBT s
+ARG pgno db_pgno_t d
+DBT page DBT s
+POINTER lsn DB_LSN * lu
+END
+
+# We do not need a subdatabase name here because removing a subdatabase
+# name is simply a regular bt_delete operation from the master database.
+# It will get logged normally.
+# name: filename
+BEGIN delete
+DBT name DBT s
+END
+
+# We also need to reclaim pages, but we can use the existing
+# bt_pg_alloc routines.
+
+=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
+Testing recoverability of create/delete.
+
+These tests are unlike other tests in that they are going to
+require hooks in the library. The reason is that the create
+and delete calls are internally wrapped in a transaction, so
+that if the call returns, the transaction has already either
+commited or aborted. Using only that interface limits what
+kind of testing we can do. To match our other recovery testing
+efforts, we need to add hooks to trigger aborts at particular
+times in the create/delete path.
+
+The general recovery testing strategy is that we wish to
+execute every path through every recovery routine. That
+means that we try to:
+ catch each operation in its pre-operation state
+ call the recovery function with redo
+ call the recovery function with undo
+ catch each operation in its post-operation state
+ call the recovery function with redo
+ call the recovery function with undo
+
+In addition, there are a few critical points in the create and
+delete path that we want to make sure we capture.
+
+1. Test Structure
+
+The test structure should be similar to the existing recovery
+tests. We will want to have a structure in place where we
+can execute different commands:
+ create a file/database
+ create a file that will contain subdatabases.
+ create a subdatabase
+ remove a subdatabase (that contains valid data)
+ remove a subdatabase (that does not contain any data)
+ remove a file that used to contain subdatabases
+ remove a file that contains a database
+
+The tricky part is capturing the state of the world at the
+various points in the create/delete process.
+
+The critical points in the create process are:
+
+ 1. After we've logged the create, but before we've done anything.
+ in db/db.c
+ after the open_retry
+ after the __crdel_fileopen_log call (and before we've
+ called __os_open).
+
+ 2. Immediately after the __os_open
+
+ 3. Immediately after each __db_log_page call
+ in bt_open.c
+ log meta-data page
+ log root page
+ in hash.c
+ log meta-data page
+
+ 4. With respect to the log records above, shortly after each
+ log write is an memp_fput. We need to do a sync after
+ each memp_fput and trigger a point after that sync.
+
+The critical points in the remove process are:
+
+ 1. Right after the crdel_delete_log in db/db.c
+
+ 2. Right after the __os_rename call (below the crdel_delete_log)
+
+ 3. After the __db_remove_callback call.
+
+I believe that there are the places where we'll need some sort of hook.
+
+2. Adding hooks to the library.
+
+The hooks need two components. One component is to capture the state of
+the database at the hook point and the other is to trigger a txn_abort at
+the hook point. The second part is fairly trivial.
+
+The first part requires more thought. Let me explain what we do in a
+"normal" recovery test. In a normal recovery test, we save an intial
+copy of the database (this copy is called init). Then we execute one
+or more operations. Then, right before the commit/abort, we sync the
+file, and save another copy (the afterop copy). Finally, we call txn_commit
+or txn_abort, sync the file again, and save the database one last time (the
+final copy).
+
+Then we run recovery. The first time, this should be a no-op, because
+we've either committed the transaction and are checking to redo it or
+we aborted the transaction, undid it on the abort and are checking to
+undo it again.
+
+We then run recovery again on whatever database will force us through
+the path that requires work. In the commit case, this means we start
+with the init copy of the database and run recovery. This pushes us
+through all the redo paths. In the abort case, we start with the afterop
+copy which pushes us through all the undo cases.
+
+In some sense, we're asking the create/delete test to be more exhaustive
+by defining all the trigger points, but I think that's the correct thing
+to do, since the create/delete is not initiated by a user transaction.
+
+So, what do we have to do at the hook points?
+ 1. sync the file to disk.
+ 2. save the file itself
+ 3. save any files named __db_backup_name(name, &backup_name, lsn)
+ Since we may not know the right lsns, I think we should save
+ every file of the form __db.name.0xNNNNNNNN.0xNNNNNNNN into
+ some temporary files from which we can restore it to run
+ recovery.
+
+3. Putting it all together
+
+So, the three pieces are writing the test structure, putting in the hooks
+and then writing the recovery portions so that we restore the right thing
+that the hooks saved in order to initiate recovery.
+
+Some of the technical issues that need to be solved are:
+ How does the hook code become active (i.e., we don't
+ want it in there normally, but it's got to be
+ there when you configure for testing)?
+ How do you (the test) tell the library that you want a
+ particular hook to abort?
+ How do you (the test) tell the library that you want the
+ hook code doing its copies (do we really want
+ *every* test doing these copies during testing?
+ Maybe it's not a big deal, but maybe it is; we
+ should at least think about it).