# $Id: Design.fileop,v 11.4 2000/02/19 20:57:54 bostic Exp $

The design of file operation recovery.

Keith has asked me to write up notes on our current status of database
create and delete and recovery, why it's so hard, and how we've violated
all the cornerstone assumptions on which our recovery framework is based.

I am including two documents at the end of this one.   The first is the
initial design of the recoverability of file create and delete (there is
no talk of subdatabases there, because we didn't think we'd have to do
anything special there).  I will annotate this document on where things
changed.

The second is the design of recd007 which is supposed to test our ability
to recover these operations regardless of where one crashes.  This test
is fundamentally different from our other recovery tests in the following
manner.  Normally, the application controls transaction boundaries.
Therefore, we can perform an operation and then decide whether to commit
or abort it.  In the normal recovery tests, we force the database into
each of the four possible states from a recovery perspective:

	database is pre-op, undo (do nothing)
	database is pre-op, redo
	database is post-op, undo
	database is post-op, redo (do nothing)

By copying databases at various points and initiating txn_commit and abort
appropriately, we can make all these things happen.  Notice that the one
case we don't handle is where page A is in one state (e.g., pre-op) and
page B is in another state (e.g., post-op).  I will argue that these don't
matter because each page is recovered independently.  If anyone can poke
holes in this, I'm interested.

The problem with create/delete recovery testing is that the transaction
is begun and ended all inside the library.  Therefore, there is never any
point (outside the library) where we can copy files and or initiate
abort/commit.  In order to still put the recovery code through its paces,
Sue designed an infrastructure that lets you tell the library where to
make copies of things and where to suddenly inject errors so that the
transaction gets aborted.  This level of detail allows us to push the
create/delete recovery code through just about every recovery path
possible (although I'm sure Mike will tell me I'm wrong when he starts to
run code coverage tools).

OK, so that's all preamble and a brief discussion of the documents I'm
enclosing.

Why was this so hard and painful and why is the code so Q@#$!% complicated?
The following is a discussion/explanation, but to the best of my knowledge,
the structure we have in place now works.  The key question we need to be
asking is, "Does this need to have to be so complex or should we redesign
portions to simplify it?"  At this point, there is no obvious way to simplify
it in my book, but I may be having difficulty seeing this because my mind is
too polluted at this point.

Our overall strategy for recovery is that we do write-ahead logging,
that is we log an operation and make sure it is on disk before any
data corresponding to the data that log record describes is on disk.
Typically we use log sequence numbers (LSNs) to mark the data so that
during recovery, we can look at the data and determine if it is in a
state before a particular log record or after a particular log record.

In the good old days, opens were not transaction protected, so we could
do regular old opens during recovery and if the file existed, we opened
it and if it didn't (or appeared corrupt), we didn't and treated it like
a missing file.  As will be discussed below in detail, our states are much
more complicated and recovery can't make such simplistic assumptions.

Also, since we are now dealing with file system operations, we have less
control about when they actually happen and what the state of the system
can be.  That is, we have to write create log records synchronously, because
the create/open system call may force a newly created (0-length) file to
disk.  This file has to now be identified as being in the "being-created"
state.

A. We used to make a number of assumptions during recovery:

1. We could call db_open at any time and one of three things would happen:
	a) the file would be opened cleanly
	b) the file would not exist
	c) we would encounter an error while opening the file

Case a posed no difficulty.
In Case b, we simply spit out a warning that a file was missing and then
	ignored all subsequent operations to that file.
In Case c, we reported a fatal error.

2.  We can always generate a warning if a file is missing.

3. We never encounter NULL file names in the log.

B. We also made some assumptions in the main-line library:

1. If you try to open a file and it exists but is 0-length, then
someone else is trying to open it.

2. You can write pages anywhere in a file and any non-existent pages
are 0-filled. [This breaks on Windows.]

3. If you have proper permissions then you can always evict pages from
the buffer pool.

4. During open, we can close the master database handle as soon as
we're done with it since all the rest of the activity will take place
on the subdatabase handle.

In our brave new world, most of these assumptions are no longer valid.
Let's address them one at a time.

A.1 We could call db_open at any time and one of three things would happen:
	a) the file would be opened cleanly
	b) the file would not exist
	c) we would encounter an error while opening the file
There are now additional states.  Since we are trying to make file
operations recoverable, you can now die in the middle of such an
operation and we have to be able to pick up the pieces.  What this
now means is that:

	* a 0-length file can be an indication of a create in-progress
	* you can have a meta-data page but no root page (of a btree)
	* if a file doesn't exist, it could mean that it was just about
		to be created and needs to be rolled forward.
	* if you encounter an error in a file (e.g., the meta-data page
		is all 0's) you could still be in mid-open.

I have now made this all work, but it required significant changes to the
db_open code and error handling and this is the sort of change that makes
everyone nervous.

A.2.  We can always generate a warning if a file is missing.

Now that we have a delete file method in the API, we need to make sure
that we do not generate warning messages for files that don't exist if
we see that they were explicitly deleted.

This means that we need to save state during recovery, determine which
files were missing and were not being recreated and were not deleted and
only complain about those.

A.3. We never encounter NULL file names in the log.

Now that we allow tranaction protection on memory-resident files, we write
log messages for files with NULL file names.  This means that our assumption
of always being able to call "db_open" on any log_register OPEN message found
in the log is no longer valid.

B.1. If you try to open a file and it exists but is 0-length, then
someone else is trying to open it.

As discussed for A.1, this is no longer true.  It may be instead that you
are in the process of recovering a create.

B.2. You can write pages anywhere in a file and any non-existent pages
are 0-filled.

It turns out that this is not true on Windows.  This means that places
we do group allocation (hash) must explicitly allocate each page, because
we can't count on recognizing the uninitialized pages later.

B.3. If you have proper permissions then you can always evict pages from
the buffer pool.

In the brave new world though, files can be deleted and they may
have pages in the mpool.  If you later try to evict these, you
discover that the file doesn't exist.  We'd get here when we had
to dirty pages during a remove operation.

B.4. You can close files any time you want.

However, if the file takes part in the open/remove transaction,
then we had better not close it until after the transaction
commits/aborts, because we need to be able to get our hands on the
dbp and the open happened in a different transaction.

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Design for recovering file create and delete in the presence of subdatabases.

Assumptions:
	Remove the O_TRUNCATE flag.
	Single-thread all open/create/delete operations.
		(Well, almost all; we'll optimize opens without DB_CREATE set.)
		The reasoning for this is that with two simultaneous
		open/creaters, during recovery, we cannot identify which
		transaction successfully created files and therefore cannot
		recovery correctly.
	File system creates/deletes are synchronous
	Once the file is open, subdatabase creates look like regular
		get/put operations and a metadata page creation.

There are 4 cases to deal with:
	1. Open/create file
	2. Open/create subdatabase
	3. Delete
	4. Recovery records

		__db_fileopen_recover
		__db_metapage_recover
		__db_delete_recover
		existing c_put and c_get routines for subdatabase creation

	Note that the open/create of the file and the open/create of the
	subdatabase need to be in the same transaction.

1. Open/create (full file and subdb version)

If create
	LOCK_FILEOP
	txn_begin
	log create message (open message below)
	do file system open/create
	if we did not create
		abort transaction (before going to open_only)
		if (!subdb)
			set dbp->open_txn = NULL
		else
			txn_begin a new transaction for the subdb open

	construct meta-data page
	log meta-data page (see metapage)
	write the meta-data page
	* It may be the case that btrees need to log both meta-data pages
	  and root pages. If that is the case, I believe that we can use
	  this same record and recovery routines for both

	txn_commit
	UNLOCK_FILEOP

2. Delete
	LOCK_FILEOP
	txn_begin
	log delete message (delete message below)
	mv file __db.file.lsn
	txn_commit
	unlink __db.file.lsn
	UNLOCK_FILEOP

3. Recovery Routines

__db_fileopen_recover
	if (argp->name.size == 0
		done;

	if (redo)	/* Commit */
		__os_open(argp->name, DB_OSO_CREATE, argp->mode, &fh)
		__os_closehandle(fh)
	if (undo)	/* Abort */
		if (argp->name exists)
			unlink(argp->name);

__db_metapage_recover
	if (redo)
		__os_open(argp->name, 0, 0, &fh)
		__os_lseek(meta data page)
		__os_write(meta data page)
		__os_closehandle(fh);
	if (undo)
		done = 0;
		if (argp->name exists)
			if (length of argp->name != 0)
				__os_open(argp->name, 0, 0, &fh)
				__os_lseek(meta data page)
				__os_read(meta data page)
				if (read succeeds && page lsn != current_lsn)
					done = 1
				__os_closehandle(fh);
			if (!done)
				unlink(argp->name)

__db_delete_recover
	if (redo)
		Check if the backup file still exists and if so, delete it.

	if (undo)
		if (__db_appname(__db.file.lsn exists))
			mv __db_appname(__db.file.lsn) __db_appname(file)

__db_metasub_recover
	/* This is like a normal recovery routine */
	Get the metadata page
	if (cmp_n && redo)
		copy the log page onto the page
		update the lsn
		make sure page gets put dirty
	else if (cmp_p && undo)
		update the lsn to the lsn in the log record
		make sure page gets put dirty

	if the page was modified, put it back dirty

In db.src

# name: filename (before call to __db_appname)
# mode: file system mode
BEGIN open
DBT	name		DBT		s
ARG	mode		u_int32_t	o
END

# opcode: indicate if it is a create/delete and if it is a subdatabase
# pgsize: page size on which we're going to write the meta-data page
# pgno: page number on which to write this meta-data page
# page: the actual meta-data page
# lsn: LSN of the meta-data page -- 0 for new databases, may be non-0
#	for subdatabases.

BEGIN metapage
ARG	opcode		u_int32_t	x
DBT	name		DBT		s
ARG	pgno		db_pgno_t	d
DBT	page		DBT		s
POINTER	lsn		DB_LSN *	lu
END

# We do not need a subdatabase name here because removing a subdatabase
# name is simply a regular bt_delete operation from the master database.
# It will get logged normally.
# name: filename
BEGIN delete
DBT	name		DBT		s
END

# We also need to reclaim pages, but we can use the existing
# bt_pg_alloc routines.

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Testing recoverability of create/delete.

These tests are unlike other tests in that they are going to
require hooks in the library.  The reason is that the create
and delete calls are internally wrapped in a transaction, so
that if the call returns, the transaction has already either
commited or aborted.  Using only that interface limits what
kind of testing we can do.  To match our other recovery testing
efforts, we need to add hooks to trigger aborts at particular
times in the create/delete path.

The general recovery testing strategy is that we wish to
execute every path through every recovery routine.  That
means that we try to:
	catch each operation in its pre-operation state
		call the recovery function with redo
		call the recovery function with undo
	catch each operation in its post-operation state
		call the recovery function with redo
		call the recovery function with undo

In addition, there are a few critical points in the create and
delete path that we want to make sure we capture.

1. Test Structure

The test structure should be similar to the existing recovery
tests.  We will want to have a structure in place where we
can execute different commands:
	create a file/database
	create a file that will contain subdatabases.
	create a subdatabase
	remove a subdatabase (that contains valid data)
	remove a subdatabase (that does not contain any data)
	remove a file that used to contain subdatabases
	remove a file that contains a database

The tricky part is capturing the state of the world at the
various points in the create/delete process.

The critical points in the create process are:

	1. After we've logged the create, but before we've done anything.
		in db/db.c
		after the open_retry
		after the __crdel_fileopen_log call (and before we've
			called __os_open).

	2. Immediately after the __os_open

	3. Immediately after each __db_log_page call
		in bt_open.c
			log meta-data page
			log root page
		in hash.c
			log meta-data page

	4. With respect to the log records above, shortly after each
		log write is an memp_fput.  We need to do a sync after
		each memp_fput and trigger a point after that sync.

The critical points in the remove process are:

	1. Right after the crdel_delete_log in db/db.c

	2. Right after the __os_rename call (below the crdel_delete_log)

	3. After the __db_remove_callback call.

I believe that there are the places where we'll need some sort of hook.

2. Adding hooks to the library.

The hooks need two components.  One component is to capture the state of
the database at the hook point and the other is to trigger a txn_abort at
the hook point.  The second part is fairly trivial.

The first part requires more thought.  Let me explain what we do in a
"normal" recovery test.  In a normal recovery test, we save an intial
copy of the database (this copy is called init).  Then we execute one
or more operations.  Then, right before the commit/abort, we sync the
file, and save another copy (the afterop copy).  Finally, we call txn_commit
or txn_abort, sync the file again, and save the database one last time (the
final copy).

Then we run recovery.  The first time, this should be a no-op, because
we've either committed the transaction and are checking to redo it or
we aborted the transaction, undid it on the abort and are checking to
undo it again.

We then run recovery again on whatever database will force us through
the path that requires work.  In the commit case, this means we start
with the init copy of the database and run recovery.  This pushes us
through all the redo paths.  In the abort case, we start with the afterop
copy which pushes us through all the undo cases.

In some sense, we're asking the create/delete test to be more exhaustive
by defining all the trigger points, but I think that's the correct thing
to do, since the create/delete is not initiated by a user transaction.

So, what do we have to do at the hook points?
	1. sync the file to disk.
	2. save the file itself
	3. save any files named __db_backup_name(name, &backup_name, lsn)
		Since we may not know the right lsns, I think we should save
		every file of the form __db.name.0xNNNNNNNN.0xNNNNNNNN into
		some temporary files from which we can restore it to run
		recovery.

3. Putting it all together

So, the three pieces are writing the test structure, putting in the hooks
and then writing the recovery portions so that we restore the right thing
that the hooks saved in order to initiate recovery.

Some of the technical issues that need to be solved are:
	How does the hook code become active (i.e., we don't
		want it in there normally, but it's got to be
		there when you configure for testing)?
	How do you (the test) tell the library that you want a
		particular hook to abort?
	How do you (the test) tell the library that you want the
		hook code doing its copies (do we really want
		*every* test doing these copies during testing?
		Maybe it's not a big deal, but maybe it is; we
		should at least think about it).