summaryrefslogtreecommitdiff
path: root/bdb/db/Design.fileop
diff options
context:
space:
mode:
Diffstat (limited to 'bdb/db/Design.fileop')
-rw-r--r--bdb/db/Design.fileop452
1 files changed, 0 insertions, 452 deletions
diff --git a/bdb/db/Design.fileop b/bdb/db/Design.fileop
deleted file mode 100644
index 187f1ffaf22..00000000000
--- a/bdb/db/Design.fileop
+++ /dev/null
@@ -1,452 +0,0 @@
-# $Id: Design.fileop,v 11.4 2000/02/19 20:57:54 bostic Exp $
-
-The design of file operation recovery.
-
-Keith has asked me to write up notes on our current status of database
-create and delete and recovery, why it's so hard, and how we've violated
-all the cornerstone assumptions on which our recovery framework is based.
-
-I am including two documents at the end of this one. The first is the
-initial design of the recoverability of file create and delete (there is
-no talk of subdatabases there, because we didn't think we'd have to do
-anything special there). I will annotate this document on where things
-changed.
-
-The second is the design of recd007 which is supposed to test our ability
-to recover these operations regardless of where one crashes. This test
-is fundamentally different from our other recovery tests in the following
-manner. Normally, the application controls transaction boundaries.
-Therefore, we can perform an operation and then decide whether to commit
-or abort it. In the normal recovery tests, we force the database into
-each of the four possible states from a recovery perspective:
-
- database is pre-op, undo (do nothing)
- database is pre-op, redo
- database is post-op, undo
- database is post-op, redo (do nothing)
-
-By copying databases at various points and initiating txn_commit and abort
-appropriately, we can make all these things happen. Notice that the one
-case we don't handle is where page A is in one state (e.g., pre-op) and
-page B is in another state (e.g., post-op). I will argue that these don't
-matter because each page is recovered independently. If anyone can poke
-holes in this, I'm interested.
-
-The problem with create/delete recovery testing is that the transaction
-is begun and ended all inside the library. Therefore, there is never any
-point (outside the library) where we can copy files and or initiate
-abort/commit. In order to still put the recovery code through its paces,
-Sue designed an infrastructure that lets you tell the library where to
-make copies of things and where to suddenly inject errors so that the
-transaction gets aborted. This level of detail allows us to push the
-create/delete recovery code through just about every recovery path
-possible (although I'm sure Mike will tell me I'm wrong when he starts to
-run code coverage tools).
-
-OK, so that's all preamble and a brief discussion of the documents I'm
-enclosing.
-
-Why was this so hard and painful and why is the code so Q@#$!% complicated?
-The following is a discussion/explanation, but to the best of my knowledge,
-the structure we have in place now works. The key question we need to be
-asking is, "Does this need to have to be so complex or should we redesign
-portions to simplify it?" At this point, there is no obvious way to simplify
-it in my book, but I may be having difficulty seeing this because my mind is
-too polluted at this point.
-
-Our overall strategy for recovery is that we do write-ahead logging,
-that is we log an operation and make sure it is on disk before any
-data corresponding to the data that log record describes is on disk.
-Typically we use log sequence numbers (LSNs) to mark the data so that
-during recovery, we can look at the data and determine if it is in a
-state before a particular log record or after a particular log record.
-
-In the good old days, opens were not transaction protected, so we could
-do regular old opens during recovery and if the file existed, we opened
-it and if it didn't (or appeared corrupt), we didn't and treated it like
-a missing file. As will be discussed below in detail, our states are much
-more complicated and recovery can't make such simplistic assumptions.
-
-Also, since we are now dealing with file system operations, we have less
-control about when they actually happen and what the state of the system
-can be. That is, we have to write create log records synchronously, because
-the create/open system call may force a newly created (0-length) file to
-disk. This file has to now be identified as being in the "being-created"
-state.
-
-A. We used to make a number of assumptions during recovery:
-
-1. We could call db_open at any time and one of three things would happen:
- a) the file would be opened cleanly
- b) the file would not exist
- c) we would encounter an error while opening the file
-
-Case a posed no difficulty.
-In Case b, we simply spit out a warning that a file was missing and then
- ignored all subsequent operations to that file.
-In Case c, we reported a fatal error.
-
-2. We can always generate a warning if a file is missing.
-
-3. We never encounter NULL file names in the log.
-
-B. We also made some assumptions in the main-line library:
-
-1. If you try to open a file and it exists but is 0-length, then
-someone else is trying to open it.
-
-2. You can write pages anywhere in a file and any non-existent pages
-are 0-filled. [This breaks on Windows.]
-
-3. If you have proper permissions then you can always evict pages from
-the buffer pool.
-
-4. During open, we can close the master database handle as soon as
-we're done with it since all the rest of the activity will take place
-on the subdatabase handle.
-
-In our brave new world, most of these assumptions are no longer valid.
-Let's address them one at a time.
-
-A.1 We could call db_open at any time and one of three things would happen:
- a) the file would be opened cleanly
- b) the file would not exist
- c) we would encounter an error while opening the file
-There are now additional states. Since we are trying to make file
-operations recoverable, you can now die in the middle of such an
-operation and we have to be able to pick up the pieces. What this
-now means is that:
-
- * a 0-length file can be an indication of a create in-progress
- * you can have a meta-data page but no root page (of a btree)
- * if a file doesn't exist, it could mean that it was just about
- to be created and needs to be rolled forward.
- * if you encounter an error in a file (e.g., the meta-data page
- is all 0's) you could still be in mid-open.
-
-I have now made this all work, but it required significant changes to the
-db_open code and error handling and this is the sort of change that makes
-everyone nervous.
-
-A.2. We can always generate a warning if a file is missing.
-
-Now that we have a delete file method in the API, we need to make sure
-that we do not generate warning messages for files that don't exist if
-we see that they were explicitly deleted.
-
-This means that we need to save state during recovery, determine which
-files were missing and were not being recreated and were not deleted and
-only complain about those.
-
-A.3. We never encounter NULL file names in the log.
-
-Now that we allow tranaction protection on memory-resident files, we write
-log messages for files with NULL file names. This means that our assumption
-of always being able to call "db_open" on any log_register OPEN message found
-in the log is no longer valid.
-
-B.1. If you try to open a file and it exists but is 0-length, then
-someone else is trying to open it.
-
-As discussed for A.1, this is no longer true. It may be instead that you
-are in the process of recovering a create.
-
-B.2. You can write pages anywhere in a file and any non-existent pages
-are 0-filled.
-
-It turns out that this is not true on Windows. This means that places
-we do group allocation (hash) must explicitly allocate each page, because
-we can't count on recognizing the uninitialized pages later.
-
-B.3. If you have proper permissions then you can always evict pages from
-the buffer pool.
-
-In the brave new world though, files can be deleted and they may
-have pages in the mpool. If you later try to evict these, you
-discover that the file doesn't exist. We'd get here when we had
-to dirty pages during a remove operation.
-
-B.4. You can close files any time you want.
-
-However, if the file takes part in the open/remove transaction,
-then we had better not close it until after the transaction
-commits/aborts, because we need to be able to get our hands on the
-dbp and the open happened in a different transaction.
-
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
-Design for recovering file create and delete in the presence of subdatabases.
-
-Assumptions:
- Remove the O_TRUNCATE flag.
- Single-thread all open/create/delete operations.
- (Well, almost all; we'll optimize opens without DB_CREATE set.)
- The reasoning for this is that with two simultaneous
- open/creaters, during recovery, we cannot identify which
- transaction successfully created files and therefore cannot
- recovery correctly.
- File system creates/deletes are synchronous
- Once the file is open, subdatabase creates look like regular
- get/put operations and a metadata page creation.
-
-There are 4 cases to deal with:
- 1. Open/create file
- 2. Open/create subdatabase
- 3. Delete
- 4. Recovery records
-
- __db_fileopen_recover
- __db_metapage_recover
- __db_delete_recover
- existing c_put and c_get routines for subdatabase creation
-
- Note that the open/create of the file and the open/create of the
- subdatabase need to be in the same transaction.
-
-1. Open/create (full file and subdb version)
-
-If create
- LOCK_FILEOP
- txn_begin
- log create message (open message below)
- do file system open/create
- if we did not create
- abort transaction (before going to open_only)
- if (!subdb)
- set dbp->open_txn = NULL
- else
- txn_begin a new transaction for the subdb open
-
- construct meta-data page
- log meta-data page (see metapage)
- write the meta-data page
- * It may be the case that btrees need to log both meta-data pages
- and root pages. If that is the case, I believe that we can use
- this same record and recovery routines for both
-
- txn_commit
- UNLOCK_FILEOP
-
-2. Delete
- LOCK_FILEOP
- txn_begin
- log delete message (delete message below)
- mv file __db.file.lsn
- txn_commit
- unlink __db.file.lsn
- UNLOCK_FILEOP
-
-3. Recovery Routines
-
-__db_fileopen_recover
- if (argp->name.size == 0
- done;
-
- if (redo) /* Commit */
- __os_open(argp->name, DB_OSO_CREATE, argp->mode, &fh)
- __os_closehandle(fh)
- if (undo) /* Abort */
- if (argp->name exists)
- unlink(argp->name);
-
-__db_metapage_recover
- if (redo)
- __os_open(argp->name, 0, 0, &fh)
- __os_lseek(meta data page)
- __os_write(meta data page)
- __os_closehandle(fh);
- if (undo)
- done = 0;
- if (argp->name exists)
- if (length of argp->name != 0)
- __os_open(argp->name, 0, 0, &fh)
- __os_lseek(meta data page)
- __os_read(meta data page)
- if (read succeeds && page lsn != current_lsn)
- done = 1
- __os_closehandle(fh);
- if (!done)
- unlink(argp->name)
-
-__db_delete_recover
- if (redo)
- Check if the backup file still exists and if so, delete it.
-
- if (undo)
- if (__db_appname(__db.file.lsn exists))
- mv __db_appname(__db.file.lsn) __db_appname(file)
-
-__db_metasub_recover
- /* This is like a normal recovery routine */
- Get the metadata page
- if (cmp_n && redo)
- copy the log page onto the page
- update the lsn
- make sure page gets put dirty
- else if (cmp_p && undo)
- update the lsn to the lsn in the log record
- make sure page gets put dirty
-
- if the page was modified, put it back dirty
-
-In db.src
-
-# name: filename (before call to __db_appname)
-# mode: file system mode
-BEGIN open
-DBT name DBT s
-ARG mode u_int32_t o
-END
-
-# opcode: indicate if it is a create/delete and if it is a subdatabase
-# pgsize: page size on which we're going to write the meta-data page
-# pgno: page number on which to write this meta-data page
-# page: the actual meta-data page
-# lsn: LSN of the meta-data page -- 0 for new databases, may be non-0
-# for subdatabases.
-
-BEGIN metapage
-ARG opcode u_int32_t x
-DBT name DBT s
-ARG pgno db_pgno_t d
-DBT page DBT s
-POINTER lsn DB_LSN * lu
-END
-
-# We do not need a subdatabase name here because removing a subdatabase
-# name is simply a regular bt_delete operation from the master database.
-# It will get logged normally.
-# name: filename
-BEGIN delete
-DBT name DBT s
-END
-
-# We also need to reclaim pages, but we can use the existing
-# bt_pg_alloc routines.
-
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
-Testing recoverability of create/delete.
-
-These tests are unlike other tests in that they are going to
-require hooks in the library. The reason is that the create
-and delete calls are internally wrapped in a transaction, so
-that if the call returns, the transaction has already either
-commited or aborted. Using only that interface limits what
-kind of testing we can do. To match our other recovery testing
-efforts, we need to add hooks to trigger aborts at particular
-times in the create/delete path.
-
-The general recovery testing strategy is that we wish to
-execute every path through every recovery routine. That
-means that we try to:
- catch each operation in its pre-operation state
- call the recovery function with redo
- call the recovery function with undo
- catch each operation in its post-operation state
- call the recovery function with redo
- call the recovery function with undo
-
-In addition, there are a few critical points in the create and
-delete path that we want to make sure we capture.
-
-1. Test Structure
-
-The test structure should be similar to the existing recovery
-tests. We will want to have a structure in place where we
-can execute different commands:
- create a file/database
- create a file that will contain subdatabases.
- create a subdatabase
- remove a subdatabase (that contains valid data)
- remove a subdatabase (that does not contain any data)
- remove a file that used to contain subdatabases
- remove a file that contains a database
-
-The tricky part is capturing the state of the world at the
-various points in the create/delete process.
-
-The critical points in the create process are:
-
- 1. After we've logged the create, but before we've done anything.
- in db/db.c
- after the open_retry
- after the __crdel_fileopen_log call (and before we've
- called __os_open).
-
- 2. Immediately after the __os_open
-
- 3. Immediately after each __db_log_page call
- in bt_open.c
- log meta-data page
- log root page
- in hash.c
- log meta-data page
-
- 4. With respect to the log records above, shortly after each
- log write is an memp_fput. We need to do a sync after
- each memp_fput and trigger a point after that sync.
-
-The critical points in the remove process are:
-
- 1. Right after the crdel_delete_log in db/db.c
-
- 2. Right after the __os_rename call (below the crdel_delete_log)
-
- 3. After the __db_remove_callback call.
-
-I believe that there are the places where we'll need some sort of hook.
-
-2. Adding hooks to the library.
-
-The hooks need two components. One component is to capture the state of
-the database at the hook point and the other is to trigger a txn_abort at
-the hook point. The second part is fairly trivial.
-
-The first part requires more thought. Let me explain what we do in a
-"normal" recovery test. In a normal recovery test, we save an intial
-copy of the database (this copy is called init). Then we execute one
-or more operations. Then, right before the commit/abort, we sync the
-file, and save another copy (the afterop copy). Finally, we call txn_commit
-or txn_abort, sync the file again, and save the database one last time (the
-final copy).
-
-Then we run recovery. The first time, this should be a no-op, because
-we've either committed the transaction and are checking to redo it or
-we aborted the transaction, undid it on the abort and are checking to
-undo it again.
-
-We then run recovery again on whatever database will force us through
-the path that requires work. In the commit case, this means we start
-with the init copy of the database and run recovery. This pushes us
-through all the redo paths. In the abort case, we start with the afterop
-copy which pushes us through all the undo cases.
-
-In some sense, we're asking the create/delete test to be more exhaustive
-by defining all the trigger points, but I think that's the correct thing
-to do, since the create/delete is not initiated by a user transaction.
-
-So, what do we have to do at the hook points?
- 1. sync the file to disk.
- 2. save the file itself
- 3. save any files named __db_backup_name(name, &backup_name, lsn)
- Since we may not know the right lsns, I think we should save
- every file of the form __db.name.0xNNNNNNNN.0xNNNNNNNN into
- some temporary files from which we can restore it to run
- recovery.
-
-3. Putting it all together
-
-So, the three pieces are writing the test structure, putting in the hooks
-and then writing the recovery portions so that we restore the right thing
-that the hooks saved in order to initiate recovery.
-
-Some of the technical issues that need to be solved are:
- How does the hook code become active (i.e., we don't
- want it in there normally, but it's got to be
- there when you configure for testing)?
- How do you (the test) tell the library that you want a
- particular hook to abort?
- How do you (the test) tell the library that you want the
- hook code doing its copies (do we really want
- *every* test doing these copies during testing?
- Maybe it's not a big deal, but maybe it is; we
- should at least think about it).