1 files changed, 0 insertions, 452 deletions
diff --git a/bdb/db/Design.fileop b/bdb/db/Design.fileop
deleted file mode 100644
index 187f1ffaf22..00000000000
--- a/bdb/db/Design.fileop
+++ /dev/null
@@ -1,452 +0,0 @@
-# $Id: Design.fileop,v 11.4 2000/02/19 20:57:54 bostic Exp $
-
-The design of file operation recovery.
-
-Keith has asked me to write up notes on our current status of database
-create and delete and recovery, why it's so hard, and how we've violated
-all the cornerstone assumptions on which our recovery framework is based.
-
-I am including two documents at the end of this one.   The first is the
-initial design of the recoverability of file create and delete (there is
-no talk of subdatabases there, because we didn't think we'd have to do
-anything special there).  I will annotate this document on where things
-changed.
-
-The second is the design of recd007 which is supposed to test our ability
-to recover these operations regardless of where one crashes.  This test
-is fundamentally different from our other recovery tests in the following
-manner.  Normally, the application controls transaction boundaries.
-Therefore, we can perform an operation and then decide whether to commit
-or abort it.  In the normal recovery tests, we force the database into
-each of the four possible states from a recovery perspective:
-
-	database is pre-op, undo (do nothing)
-	database is pre-op, redo
-	database is post-op, undo
-	database is post-op, redo (do nothing)
-
-By copying databases at various points and initiating txn_commit and abort
-appropriately, we can make all these things happen.  Notice that the one
-case we don't handle is where page A is in one state (e.g., pre-op) and
-page B is in another state (e.g., post-op).  I will argue that these don't
-matter because each page is recovered independently.  If anyone can poke
-holes in this, I'm interested.
-
-The problem with create/delete recovery testing is that the transaction
-is begun and ended all inside the library.  Therefore, there is never any
-point (outside the library) where we can copy files and or initiate
-abort/commit.  In order to still put the recovery code through its paces,
-Sue designed an infrastructure that lets you tell the library where to
-make copies of things and where to suddenly inject errors so that the
-transaction gets aborted.  This level of detail allows us to push the
-create/delete recovery code through just about every recovery path
-possible (although I'm sure Mike will tell me I'm wrong when he starts to
-run code coverage tools).
-
-OK, so that's all preamble and a brief discussion of the documents I'm
-enclosing.
-
-Why was this so hard and painful and why is the code so Q@#$!% complicated?
-The following is a discussion/explanation, but to the best of my knowledge,
-the structure we have in place now works.  The key question we need to be
-asking is, "Does this need to have to be so complex or should we redesign
-portions to simplify it?"  At this point, there is no obvious way to simplify
-it in my book, but I may be having difficulty seeing this because my mind is
-too polluted at this point.
-
-Our overall strategy for recovery is that we do write-ahead logging,
-that is we log an operation and make sure it is on disk before any
-data corresponding to the data that log record describes is on disk.
-Typically we use log sequence numbers (LSNs) to mark the data so that
-during recovery, we can look at the data and determine if it is in a
-state before a particular log record or after a particular log record.
-
-In the good old days, opens were not transaction protected, so we could
-do regular old opens during recovery and if the file existed, we opened
-it and if it didn't (or appeared corrupt), we didn't and treated it like
-a missing file.  As will be discussed below in detail, our states are much
-more complicated and recovery can't make such simplistic assumptions.
-
-Also, since we are now dealing with file system operations, we have less
-control about when they actually happen and what the state of the system
-can be.  That is, we have to write create log records synchronously, because
-the create/open system call may force a newly created (0-length) file to
-disk.  This file has to now be identified as being in the "being-created"
-state.
-
-A. We used to make a number of assumptions during recovery:
-
-1. We could call db_open at any time and one of three things would happen:
-	a) the file would be opened cleanly
-	b) the file would not exist
-	c) we would encounter an error while opening the file
-
-Case a posed no difficulty.
-In Case b, we simply spit out a warning that a file was missing and then
-	ignored all subsequent operations to that file.
-In Case c, we reported a fatal error.
-
-2.  We can always generate a warning if a file is missing.
-
-3. We never encounter NULL file names in the log.
-
-B. We also made some assumptions in the main-line library:
-
-1. If you try to open a file and it exists but is 0-length, then
-someone else is trying to open it.
-
-2. You can write pages anywhere in a file and any non-existent pages
-are 0-filled. [This breaks on Windows.]
-
-3. If you have proper permissions then you can always evict pages from
-the buffer pool.
-
-4. During open, we can close the master database handle as soon as
-we're done with it since all the rest of the activity will take place
-on the subdatabase handle.
-
-In our brave new world, most of these assumptions are no longer valid.
-Let's address them one at a time.
-
-A.1 We could call db_open at any time and one of three things would happen:
-	a) the file would be opened cleanly
-	b) the file would not exist
-	c) we would encounter an error while opening the file
-There are now additional states.  Since we are trying to make file
-operations recoverable, you can now die in the middle of such an
-operation and we have to be able to pick up the pieces.  What this
-now means is that:
-
-	* a 0-length file can be an indication of a create in-progress
-	* you can have a meta-data page but no root page (of a btree)
-	* if a file doesn't exist, it could mean that it was just about
-		to be created and needs to be rolled forward.
-	* if you encounter an error in a file (e.g., the meta-data page
-		is all 0's) you could still be in mid-open.
-
-I have now made this all work, but it required significant changes to the
-db_open code and error handling and this is the sort of change that makes
-everyone nervous.
-
-A.2.  We can always generate a warning if a file is missing.
-
-Now that we have a delete file method in the API, we need to make sure
-that we do not generate warning messages for files that don't exist if
-we see that they were explicitly deleted.
-
-This means that we need to save state during recovery, determine which
-files were missing and were not being recreated and were not deleted and
-only complain about those.
-
-A.3. We never encounter NULL file names in the log.
-
-Now that we allow tranaction protection on memory-resident files, we write
-log messages for files with NULL file names.  This means that our assumption
-of always being able to call "db_open" on any log_register OPEN message found
-in the log is no longer valid.
-
-B.1. If you try to open a file and it exists but is 0-length, then
-someone else is trying to open it.
-
-As discussed for A.1, this is no longer true.  It may be instead that you
-are in the process of recovering a create.
-
-B.2. You can write pages anywhere in a file and any non-existent pages
-are 0-filled.
-
-It turns out that this is not true on Windows.  This means that places
-we do group allocation (hash) must explicitly allocate each page, because
-we can't count on recognizing the uninitialized pages later.
-
-B.3. If you have proper permissions then you can always evict pages from
-the buffer pool.
-
-In the brave new world though, files can be deleted and they may
-have pages in the mpool.  If you later try to evict these, you
-discover that the file doesn't exist.  We'd get here when we had
-to dirty pages during a remove operation.
-
-B.4. You can close files any time you want.
-
-However, if the file takes part in the open/remove transaction,
-then we had better not close it until after the transaction
-commits/aborts, because we need to be able to get our hands on the
-dbp and the open happened in a different transaction.
-
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
-Design for recovering file create and delete in the presence of subdatabases.
-
-Assumptions:
-	Remove the O_TRUNCATE flag.
-	Single-thread all open/create/delete operations.
-		(Well, almost all; we'll optimize opens without DB_CREATE set.)
-		The reasoning for this is that with two simultaneous
-		open/creaters, during recovery, we cannot identify which
-		transaction successfully created files and therefore cannot
-		recovery correctly.
-	File system creates/deletes are synchronous
-	Once the file is open, subdatabase creates look like regular
-		get/put operations and a metadata page creation.
-
-There are 4 cases to deal with:
-	1. Open/create file
-	2. Open/create subdatabase
-	3. Delete
-	4. Recovery records
-
-		__db_fileopen_recover
-		__db_metapage_recover
-		__db_delete_recover
-		existing c_put and c_get routines for subdatabase creation
-
-	Note that the open/create of the file and the open/create of the
-	subdatabase need to be in the same transaction.
-
-1. Open/create (full file and subdb version)
-
-If create
-	LOCK_FILEOP
-	txn_begin
-	log create message (open message below)
-	do file system open/create
-	if we did not create
-		abort transaction (before going to open_only)
-		if (!subdb)
-			set dbp->open_txn = NULL
-		else
-			txn_begin a new transaction for the subdb open
-
-	construct meta-data page
-	log meta-data page (see metapage)
-	write the meta-data page
-	* It may be the case that btrees need to log both meta-data pages
-	  and root pages. If that is the case, I believe that we can use
-	  this same record and recovery routines for both
-
-	txn_commit
-	UNLOCK_FILEOP
-
-2. Delete
-	LOCK_FILEOP
-	txn_begin
-	log delete message (delete message below)
-	mv file __db.file.lsn
-	txn_commit
-	unlink __db.file.lsn
-	UNLOCK_FILEOP
-
-3. Recovery Routines
-
-__db_fileopen_recover
-	if (argp->name.size == 0
-		done;
-
-	if (redo)	/* Commit */
-		__os_open(argp->name, DB_OSO_CREATE, argp->mode, &fh)
-		__os_closehandle(fh)
-	if (undo)	/* Abort */
-		if (argp->name exists)
-			unlink(argp->name);
-
-__db_metapage_recover
-	if (redo)
-		__os_open(argp->name, 0, 0, &fh)
-		__os_lseek(meta data page)
-		__os_write(meta data page)
-		__os_closehandle(fh);
-	if (undo)
-		done = 0;
-		if (argp->name exists)
-			if (length of argp->name != 0)
-				__os_open(argp->name, 0, 0, &fh)
-				__os_lseek(meta data page)
-				__os_read(meta data page)
-				if (read succeeds && page lsn != current_lsn)
-					done = 1
-				__os_closehandle(fh);
-			if (!done)
-				unlink(argp->name)
-
-__db_delete_recover
-	if (redo)
-		Check if the backup file still exists and if so, delete it.
-
-	if (undo)
-		if (__db_appname(__db.file.lsn exists))
-			mv __db_appname(__db.file.lsn) __db_appname(file)
-
-__db_metasub_recover
-	/* This is like a normal recovery routine */
-	Get the metadata page
-	if (cmp_n && redo)
-		copy the log page onto the page
-		update the lsn
-		make sure page gets put dirty
-	else if (cmp_p && undo)
-		update the lsn to the lsn in the log record
-		make sure page gets put dirty
-
-	if the page was modified, put it back dirty
-
-In db.src
-
-# name: filename (before call to __db_appname)
-# mode: file system mode
-BEGIN open
-DBT	name		DBT		s
-ARG	mode		u_int32_t	o
-END
-
-# opcode: indicate if it is a create/delete and if it is a subdatabase
-# pgsize: page size on which we're going to write the meta-data page
-# pgno: page number on which to write this meta-data page
-# page: the actual meta-data page
-# lsn: LSN of the meta-data page -- 0 for new databases, may be non-0
-#	for subdatabases.
-
-BEGIN metapage
-ARG	opcode		u_int32_t	x
-DBT	name		DBT		s
-ARG	pgno		db_pgno_t	d
-DBT	page		DBT		s
-POINTER	lsn		DB_LSN *	lu
-END
-
-# We do not need a subdatabase name here because removing a subdatabase
-# name is simply a regular bt_delete operation from the master database.
-# It will get logged normally.
-# name: filename
-BEGIN delete
-DBT	name		DBT		s
-END
-
-# We also need to reclaim pages, but we can use the existing
-# bt_pg_alloc routines.
-
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
-Testing recoverability of create/delete.
-
-These tests are unlike other tests in that they are going to
-require hooks in the library.  The reason is that the create
-and delete calls are internally wrapped in a transaction, so
-that if the call returns, the transaction has already either
-commited or aborted.  Using only that interface limits what
-kind of testing we can do.  To match our other recovery testing
-efforts, we need to add hooks to trigger aborts at particular
-times in the create/delete path.
-
-The general recovery testing strategy is that we wish to
-execute every path through every recovery routine.  That
-means that we try to:
-	catch each operation in its pre-operation state
-		call the recovery function with redo
-		call the recovery function with undo
-	catch each operation in its post-operation state
-		call the recovery function with redo
-		call the recovery function with undo
-
-In addition, there are a few critical points in the create and
-delete path that we want to make sure we capture.
-
-1. Test Structure
-
-The test structure should be similar to the existing recovery
-tests.  We will want to have a structure in place where we
-can execute different commands:
-	create a file/database
-	create a file that will contain subdatabases.
-	create a subdatabase
-	remove a subdatabase (that contains valid data)
-	remove a subdatabase (that does not contain any data)
-	remove a file that used to contain subdatabases
-	remove a file that contains a database
-
-The tricky part is capturing the state of the world at the
-various points in the create/delete process.
-
-The critical points in the create process are:
-
-	1. After we've logged the create, but before we've done anything.
-		in db/db.c
-		after the open_retry
-		after the __crdel_fileopen_log call (and before we've
-			called __os_open).
-
-	2. Immediately after the __os_open
-
-	3. Immediately after each __db_log_page call
-		in bt_open.c
-			log meta-data page
-			log root page
-		in hash.c
-			log meta-data page
-
-	4. With respect to the log records above, shortly after each
-		log write is an memp_fput.  We need to do a sync after
-		each memp_fput and trigger a point after that sync.
-
-The critical points in the remove process are:
-
-	1. Right after the crdel_delete_log in db/db.c
-
-	2. Right after the __os_rename call (below the crdel_delete_log)
-
-	3. After the __db_remove_callback call.
-
-I believe that there are the places where we'll need some sort of hook.
-
-2. Adding hooks to the library.
-
-The hooks need two components.  One component is to capture the state of
-the database at the hook point and the other is to trigger a txn_abort at
-the hook point.  The second part is fairly trivial.
-
-The first part requires more thought.  Let me explain what we do in a
-"normal" recovery test.  In a normal recovery test, we save an intial
-copy of the database (this copy is called init).  Then we execute one
-or more operations.  Then, right before the commit/abort, we sync the
-file, and save another copy (the afterop copy).  Finally, we call txn_commit
-or txn_abort, sync the file again, and save the database one last time (the
-final copy).
-
-Then we run recovery.  The first time, this should be a no-op, because
-we've either committed the transaction and are checking to redo it or
-we aborted the transaction, undid it on the abort and are checking to
-undo it again.
-
-We then run recovery again on whatever database will force us through
-the path that requires work.  In the commit case, this means we start
-with the init copy of the database and run recovery.  This pushes us
-through all the redo paths.  In the abort case, we start with the afterop
-copy which pushes us through all the undo cases.
-
-In some sense, we're asking the create/delete test to be more exhaustive
-by defining all the trigger points, but I think that's the correct thing
-to do, since the create/delete is not initiated by a user transaction.
-
-So, what do we have to do at the hook points?
-	1. sync the file to disk.
-	2. save the file itself
-	3. save any files named __db_backup_name(name, &backup_name, lsn)
-		Since we may not know the right lsns, I think we should save
-		every file of the form __db.name.0xNNNNNNNN.0xNNNNNNNN into
-		some temporary files from which we can restore it to run
-		recovery.
-
-3. Putting it all together
-
-So, the three pieces are writing the test structure, putting in the hooks
-and then writing the recovery portions so that we restore the right thing
-that the hooks saved in order to initiate recovery.
-
-Some of the technical issues that need to be solved are:
-	How does the hook code become active (i.e., we don't
-		want it in there normally, but it's got to be
-		there when you configure for testing)?
-	How do you (the test) tell the library that you want a
-		particular hook to abort?
-	How do you (the test) tell the library that you want the
-		hook code doing its copies (do we really want
-		*every* test doing these copies during testing?
-		Maybe it's not a big deal, but maybe it is; we
-		should at least think about it).