diff options
Diffstat (limited to 'bdb/db/Design.fileop')
-rw-r--r-- | bdb/db/Design.fileop | 452 |
1 files changed, 0 insertions, 452 deletions
diff --git a/bdb/db/Design.fileop b/bdb/db/Design.fileop deleted file mode 100644 index 187f1ffaf22..00000000000 --- a/bdb/db/Design.fileop +++ /dev/null @@ -1,452 +0,0 @@ -# $Id: Design.fileop,v 11.4 2000/02/19 20:57:54 bostic Exp $ - -The design of file operation recovery. - -Keith has asked me to write up notes on our current status of database -create and delete and recovery, why it's so hard, and how we've violated -all the cornerstone assumptions on which our recovery framework is based. - -I am including two documents at the end of this one. The first is the -initial design of the recoverability of file create and delete (there is -no talk of subdatabases there, because we didn't think we'd have to do -anything special there). I will annotate this document on where things -changed. - -The second is the design of recd007 which is supposed to test our ability -to recover these operations regardless of where one crashes. This test -is fundamentally different from our other recovery tests in the following -manner. Normally, the application controls transaction boundaries. -Therefore, we can perform an operation and then decide whether to commit -or abort it. In the normal recovery tests, we force the database into -each of the four possible states from a recovery perspective: - - database is pre-op, undo (do nothing) - database is pre-op, redo - database is post-op, undo - database is post-op, redo (do nothing) - -By copying databases at various points and initiating txn_commit and abort -appropriately, we can make all these things happen. Notice that the one -case we don't handle is where page A is in one state (e.g., pre-op) and -page B is in another state (e.g., post-op). I will argue that these don't -matter because each page is recovered independently. If anyone can poke -holes in this, I'm interested. - -The problem with create/delete recovery testing is that the transaction -is begun and ended all inside the library. Therefore, there is never any -point (outside the library) where we can copy files and or initiate -abort/commit. In order to still put the recovery code through its paces, -Sue designed an infrastructure that lets you tell the library where to -make copies of things and where to suddenly inject errors so that the -transaction gets aborted. This level of detail allows us to push the -create/delete recovery code through just about every recovery path -possible (although I'm sure Mike will tell me I'm wrong when he starts to -run code coverage tools). - -OK, so that's all preamble and a brief discussion of the documents I'm -enclosing. - -Why was this so hard and painful and why is the code so Q@#$!% complicated? -The following is a discussion/explanation, but to the best of my knowledge, -the structure we have in place now works. The key question we need to be -asking is, "Does this need to have to be so complex or should we redesign -portions to simplify it?" At this point, there is no obvious way to simplify -it in my book, but I may be having difficulty seeing this because my mind is -too polluted at this point. - -Our overall strategy for recovery is that we do write-ahead logging, -that is we log an operation and make sure it is on disk before any -data corresponding to the data that log record describes is on disk. -Typically we use log sequence numbers (LSNs) to mark the data so that -during recovery, we can look at the data and determine if it is in a -state before a particular log record or after a particular log record. - -In the good old days, opens were not transaction protected, so we could -do regular old opens during recovery and if the file existed, we opened -it and if it didn't (or appeared corrupt), we didn't and treated it like -a missing file. As will be discussed below in detail, our states are much -more complicated and recovery can't make such simplistic assumptions. - -Also, since we are now dealing with file system operations, we have less -control about when they actually happen and what the state of the system -can be. That is, we have to write create log records synchronously, because -the create/open system call may force a newly created (0-length) file to -disk. This file has to now be identified as being in the "being-created" -state. - -A. We used to make a number of assumptions during recovery: - -1. We could call db_open at any time and one of three things would happen: - a) the file would be opened cleanly - b) the file would not exist - c) we would encounter an error while opening the file - -Case a posed no difficulty. -In Case b, we simply spit out a warning that a file was missing and then - ignored all subsequent operations to that file. -In Case c, we reported a fatal error. - -2. We can always generate a warning if a file is missing. - -3. We never encounter NULL file names in the log. - -B. We also made some assumptions in the main-line library: - -1. If you try to open a file and it exists but is 0-length, then -someone else is trying to open it. - -2. You can write pages anywhere in a file and any non-existent pages -are 0-filled. [This breaks on Windows.] - -3. If you have proper permissions then you can always evict pages from -the buffer pool. - -4. During open, we can close the master database handle as soon as -we're done with it since all the rest of the activity will take place -on the subdatabase handle. - -In our brave new world, most of these assumptions are no longer valid. -Let's address them one at a time. - -A.1 We could call db_open at any time and one of three things would happen: - a) the file would be opened cleanly - b) the file would not exist - c) we would encounter an error while opening the file -There are now additional states. Since we are trying to make file -operations recoverable, you can now die in the middle of such an -operation and we have to be able to pick up the pieces. What this -now means is that: - - * a 0-length file can be an indication of a create in-progress - * you can have a meta-data page but no root page (of a btree) - * if a file doesn't exist, it could mean that it was just about - to be created and needs to be rolled forward. - * if you encounter an error in a file (e.g., the meta-data page - is all 0's) you could still be in mid-open. - -I have now made this all work, but it required significant changes to the -db_open code and error handling and this is the sort of change that makes -everyone nervous. - -A.2. We can always generate a warning if a file is missing. - -Now that we have a delete file method in the API, we need to make sure -that we do not generate warning messages for files that don't exist if -we see that they were explicitly deleted. - -This means that we need to save state during recovery, determine which -files were missing and were not being recreated and were not deleted and -only complain about those. - -A.3. We never encounter NULL file names in the log. - -Now that we allow tranaction protection on memory-resident files, we write -log messages for files with NULL file names. This means that our assumption -of always being able to call "db_open" on any log_register OPEN message found -in the log is no longer valid. - -B.1. If you try to open a file and it exists but is 0-length, then -someone else is trying to open it. - -As discussed for A.1, this is no longer true. It may be instead that you -are in the process of recovering a create. - -B.2. You can write pages anywhere in a file and any non-existent pages -are 0-filled. - -It turns out that this is not true on Windows. This means that places -we do group allocation (hash) must explicitly allocate each page, because -we can't count on recognizing the uninitialized pages later. - -B.3. If you have proper permissions then you can always evict pages from -the buffer pool. - -In the brave new world though, files can be deleted and they may -have pages in the mpool. If you later try to evict these, you -discover that the file doesn't exist. We'd get here when we had -to dirty pages during a remove operation. - -B.4. You can close files any time you want. - -However, if the file takes part in the open/remove transaction, -then we had better not close it until after the transaction -commits/aborts, because we need to be able to get our hands on the -dbp and the open happened in a different transaction. - -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- -Design for recovering file create and delete in the presence of subdatabases. - -Assumptions: - Remove the O_TRUNCATE flag. - Single-thread all open/create/delete operations. - (Well, almost all; we'll optimize opens without DB_CREATE set.) - The reasoning for this is that with two simultaneous - open/creaters, during recovery, we cannot identify which - transaction successfully created files and therefore cannot - recovery correctly. - File system creates/deletes are synchronous - Once the file is open, subdatabase creates look like regular - get/put operations and a metadata page creation. - -There are 4 cases to deal with: - 1. Open/create file - 2. Open/create subdatabase - 3. Delete - 4. Recovery records - - __db_fileopen_recover - __db_metapage_recover - __db_delete_recover - existing c_put and c_get routines for subdatabase creation - - Note that the open/create of the file and the open/create of the - subdatabase need to be in the same transaction. - -1. Open/create (full file and subdb version) - -If create - LOCK_FILEOP - txn_begin - log create message (open message below) - do file system open/create - if we did not create - abort transaction (before going to open_only) - if (!subdb) - set dbp->open_txn = NULL - else - txn_begin a new transaction for the subdb open - - construct meta-data page - log meta-data page (see metapage) - write the meta-data page - * It may be the case that btrees need to log both meta-data pages - and root pages. If that is the case, I believe that we can use - this same record and recovery routines for both - - txn_commit - UNLOCK_FILEOP - -2. Delete - LOCK_FILEOP - txn_begin - log delete message (delete message below) - mv file __db.file.lsn - txn_commit - unlink __db.file.lsn - UNLOCK_FILEOP - -3. Recovery Routines - -__db_fileopen_recover - if (argp->name.size == 0 - done; - - if (redo) /* Commit */ - __os_open(argp->name, DB_OSO_CREATE, argp->mode, &fh) - __os_closehandle(fh) - if (undo) /* Abort */ - if (argp->name exists) - unlink(argp->name); - -__db_metapage_recover - if (redo) - __os_open(argp->name, 0, 0, &fh) - __os_lseek(meta data page) - __os_write(meta data page) - __os_closehandle(fh); - if (undo) - done = 0; - if (argp->name exists) - if (length of argp->name != 0) - __os_open(argp->name, 0, 0, &fh) - __os_lseek(meta data page) - __os_read(meta data page) - if (read succeeds && page lsn != current_lsn) - done = 1 - __os_closehandle(fh); - if (!done) - unlink(argp->name) - -__db_delete_recover - if (redo) - Check if the backup file still exists and if so, delete it. - - if (undo) - if (__db_appname(__db.file.lsn exists)) - mv __db_appname(__db.file.lsn) __db_appname(file) - -__db_metasub_recover - /* This is like a normal recovery routine */ - Get the metadata page - if (cmp_n && redo) - copy the log page onto the page - update the lsn - make sure page gets put dirty - else if (cmp_p && undo) - update the lsn to the lsn in the log record - make sure page gets put dirty - - if the page was modified, put it back dirty - -In db.src - -# name: filename (before call to __db_appname) -# mode: file system mode -BEGIN open -DBT name DBT s -ARG mode u_int32_t o -END - -# opcode: indicate if it is a create/delete and if it is a subdatabase -# pgsize: page size on which we're going to write the meta-data page -# pgno: page number on which to write this meta-data page -# page: the actual meta-data page -# lsn: LSN of the meta-data page -- 0 for new databases, may be non-0 -# for subdatabases. - -BEGIN metapage -ARG opcode u_int32_t x -DBT name DBT s -ARG pgno db_pgno_t d -DBT page DBT s -POINTER lsn DB_LSN * lu -END - -# We do not need a subdatabase name here because removing a subdatabase -# name is simply a regular bt_delete operation from the master database. -# It will get logged normally. -# name: filename -BEGIN delete -DBT name DBT s -END - -# We also need to reclaim pages, but we can use the existing -# bt_pg_alloc routines. - -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- -Testing recoverability of create/delete. - -These tests are unlike other tests in that they are going to -require hooks in the library. The reason is that the create -and delete calls are internally wrapped in a transaction, so -that if the call returns, the transaction has already either -commited or aborted. Using only that interface limits what -kind of testing we can do. To match our other recovery testing -efforts, we need to add hooks to trigger aborts at particular -times in the create/delete path. - -The general recovery testing strategy is that we wish to -execute every path through every recovery routine. That -means that we try to: - catch each operation in its pre-operation state - call the recovery function with redo - call the recovery function with undo - catch each operation in its post-operation state - call the recovery function with redo - call the recovery function with undo - -In addition, there are a few critical points in the create and -delete path that we want to make sure we capture. - -1. Test Structure - -The test structure should be similar to the existing recovery -tests. We will want to have a structure in place where we -can execute different commands: - create a file/database - create a file that will contain subdatabases. - create a subdatabase - remove a subdatabase (that contains valid data) - remove a subdatabase (that does not contain any data) - remove a file that used to contain subdatabases - remove a file that contains a database - -The tricky part is capturing the state of the world at the -various points in the create/delete process. - -The critical points in the create process are: - - 1. After we've logged the create, but before we've done anything. - in db/db.c - after the open_retry - after the __crdel_fileopen_log call (and before we've - called __os_open). - - 2. Immediately after the __os_open - - 3. Immediately after each __db_log_page call - in bt_open.c - log meta-data page - log root page - in hash.c - log meta-data page - - 4. With respect to the log records above, shortly after each - log write is an memp_fput. We need to do a sync after - each memp_fput and trigger a point after that sync. - -The critical points in the remove process are: - - 1. Right after the crdel_delete_log in db/db.c - - 2. Right after the __os_rename call (below the crdel_delete_log) - - 3. After the __db_remove_callback call. - -I believe that there are the places where we'll need some sort of hook. - -2. Adding hooks to the library. - -The hooks need two components. One component is to capture the state of -the database at the hook point and the other is to trigger a txn_abort at -the hook point. The second part is fairly trivial. - -The first part requires more thought. Let me explain what we do in a -"normal" recovery test. In a normal recovery test, we save an intial -copy of the database (this copy is called init). Then we execute one -or more operations. Then, right before the commit/abort, we sync the -file, and save another copy (the afterop copy). Finally, we call txn_commit -or txn_abort, sync the file again, and save the database one last time (the -final copy). - -Then we run recovery. The first time, this should be a no-op, because -we've either committed the transaction and are checking to redo it or -we aborted the transaction, undid it on the abort and are checking to -undo it again. - -We then run recovery again on whatever database will force us through -the path that requires work. In the commit case, this means we start -with the init copy of the database and run recovery. This pushes us -through all the redo paths. In the abort case, we start with the afterop -copy which pushes us through all the undo cases. - -In some sense, we're asking the create/delete test to be more exhaustive -by defining all the trigger points, but I think that's the correct thing -to do, since the create/delete is not initiated by a user transaction. - -So, what do we have to do at the hook points? - 1. sync the file to disk. - 2. save the file itself - 3. save any files named __db_backup_name(name, &backup_name, lsn) - Since we may not know the right lsns, I think we should save - every file of the form __db.name.0xNNNNNNNN.0xNNNNNNNN into - some temporary files from which we can restore it to run - recovery. - -3. Putting it all together - -So, the three pieces are writing the test structure, putting in the hooks -and then writing the recovery portions so that we restore the right thing -that the hooks saved in order to initiate recovery. - -Some of the technical issues that need to be solved are: - How does the hook code become active (i.e., we don't - want it in there normally, but it's got to be - there when you configure for testing)? - How do you (the test) tell the library that you want a - particular hook to abort? - How do you (the test) tell the library that you want the - hook code doing its copies (do we really want - *every* test doing these copies during testing? - Maybe it's not a big deal, but maybe it is; we - should at least think about it). |