Import changeset

author: unknown <tim@threads.polyesthetic.msg> 2001-03-04 19:42:05 -0500
committer: unknown <tim@threads.polyesthetic.msg> 2001-03-04 19:42:05 -0500
commit: ec6ae091617bdfdca9e65e8d3e65b950d234f676 (patch)
tree: 9dd732e08dba156ee3d7635caedc0dc3107ecac6 /bdb/db
parent: 87d70fb598105b64b538ff6b81eef9da626255b1 (diff)
download: mariadb-git-ec6ae091617bdfdca9e65e8d3e65b950d234f676.tar.gz
25 files changed, 17969 insertions, 0 deletions
diff --git a/bdb/db/Design.fileop b/bdb/db/Design.fileop
new file mode 100644
index 00000000000..187f1ffaf22
--- /dev/null
+++ b/bdb/db/Design.fileop
@@ -0,0 +1,452 @@
+# $Id: Design.fileop,v 11.4 2000/02/19 20:57:54 bostic Exp $
+
+The design of file operation recovery.
+
+Keith has asked me to write up notes on our current status of database
+create and delete and recovery, why it's so hard, and how we've violated
+all the cornerstone assumptions on which our recovery framework is based.
+
+I am including two documents at the end of this one.   The first is the
+initial design of the recoverability of file create and delete (there is
+no talk of subdatabases there, because we didn't think we'd have to do
+anything special there).  I will annotate this document on where things
+changed.
+
+The second is the design of recd007 which is supposed to test our ability
+to recover these operations regardless of where one crashes.  This test
+is fundamentally different from our other recovery tests in the following
+manner.  Normally, the application controls transaction boundaries.
+Therefore, we can perform an operation and then decide whether to commit
+or abort it.  In the normal recovery tests, we force the database into
+each of the four possible states from a recovery perspective:
+
+	database is pre-op, undo (do nothing)
+	database is pre-op, redo
+	database is post-op, undo
+	database is post-op, redo (do nothing)
+
+By copying databases at various points and initiating txn_commit and abort
+appropriately, we can make all these things happen.  Notice that the one
+case we don't handle is where page A is in one state (e.g., pre-op) and
+page B is in another state (e.g., post-op).  I will argue that these don't
+matter because each page is recovered independently.  If anyone can poke
+holes in this, I'm interested.
+
+The problem with create/delete recovery testing is that the transaction
+is begun and ended all inside the library.  Therefore, there is never any
+point (outside the library) where we can copy files and or initiate
+abort/commit.  In order to still put the recovery code through its paces,
+Sue designed an infrastructure that lets you tell the library where to
+make copies of things and where to suddenly inject errors so that the
+transaction gets aborted.  This level of detail allows us to push the
+create/delete recovery code through just about every recovery path
+possible (although I'm sure Mike will tell me I'm wrong when he starts to
+run code coverage tools).
+
+OK, so that's all preamble and a brief discussion of the documents I'm
+enclosing.
+
+Why was this so hard and painful and why is the code so Q@#$!% complicated?
+The following is a discussion/explanation, but to the best of my knowledge,
+the structure we have in place now works.  The key question we need to be
+asking is, "Does this need to have to be so complex or should we redesign
+portions to simplify it?"  At this point, there is no obvious way to simplify
+it in my book, but I may be having difficulty seeing this because my mind is
+too polluted at this point.
+
+Our overall strategy for recovery is that we do write-ahead logging,
+that is we log an operation and make sure it is on disk before any
+data corresponding to the data that log record describes is on disk.
+Typically we use log sequence numbers (LSNs) to mark the data so that
+during recovery, we can look at the data and determine if it is in a
+state before a particular log record or after a particular log record.
+
+In the good old days, opens were not transaction protected, so we could
+do regular old opens during recovery and if the file existed, we opened
+it and if it didn't (or appeared corrupt), we didn't and treated it like
+a missing file.  As will be discussed below in detail, our states are much
+more complicated and recovery can't make such simplistic assumptions.
+
+Also, since we are now dealing with file system operations, we have less
+control about when they actually happen and what the state of the system
+can be.  That is, we have to write create log records synchronously, because
+the create/open system call may force a newly created (0-length) file to
+disk.  This file has to now be identified as being in the "being-created"
+state.
+
+A. We used to make a number of assumptions during recovery:
+
+1. We could call db_open at any time and one of three things would happen:
+	a) the file would be opened cleanly
+	b) the file would not exist
+	c) we would encounter an error while opening the file
+
+Case a posed no difficulty.
+In Case b, we simply spit out a warning that a file was missing and then
+	ignored all subsequent operations to that file.
+In Case c, we reported a fatal error.
+
+2.  We can always generate a warning if a file is missing.
+
+3. We never encounter NULL file names in the log.
+
+B. We also made some assumptions in the main-line library:
+
+1. If you try to open a file and it exists but is 0-length, then
+someone else is trying to open it.
+
+2. You can write pages anywhere in a file and any non-existent pages
+are 0-filled. [This breaks on Windows.]
+
+3. If you have proper permissions then you can always evict pages from
+the buffer pool.
+
+4. During open, we can close the master database handle as soon as
+we're done with it since all the rest of the activity will take place
+on the subdatabase handle.
+
+In our brave new world, most of these assumptions are no longer valid.
+Let's address them one at a time.
+
+A.1 We could call db_open at any time and one of three things would happen:
+	a) the file would be opened cleanly
+	b) the file would not exist
+	c) we would encounter an error while opening the file
+There are now additional states.  Since we are trying to make file
+operations recoverable, you can now die in the middle of such an
+operation and we have to be able to pick up the pieces.  What this
+now means is that:
+
+	* a 0-length file can be an indication of a create in-progress
+	* you can have a meta-data page but no root page (of a btree)
+	* if a file doesn't exist, it could mean that it was just about
+		to be created and needs to be rolled forward.
+	* if you encounter an error in a file (e.g., the meta-data page
+		is all 0's) you could still be in mid-open.
+
+I have now made this all work, but it required significant changes to the
+db_open code and error handling and this is the sort of change that makes
+everyone nervous.
+
+A.2.  We can always generate a warning if a file is missing.
+
+Now that we have a delete file method in the API, we need to make sure
+that we do not generate warning messages for files that don't exist if
+we see that they were explicitly deleted.
+
+This means that we need to save state during recovery, determine which
+files were missing and were not being recreated and were not deleted and
+only complain about those.
+
+A.3. We never encounter NULL file names in the log.
+
+Now that we allow tranaction protection on memory-resident files, we write
+log messages for files with NULL file names.  This means that our assumption
+of always being able to call "db_open" on any log_register OPEN message found
+in the log is no longer valid.
+
+B.1. If you try to open a file and it exists but is 0-length, then
+someone else is trying to open it.
+
+As discussed for A.1, this is no longer true.  It may be instead that you
+are in the process of recovering a create.
+
+B.2. You can write pages anywhere in a file and any non-existent pages
+are 0-filled.
+
+It turns out that this is not true on Windows.  This means that places
+we do group allocation (hash) must explicitly allocate each page, because
+we can't count on recognizing the uninitialized pages later.
+
+B.3. If you have proper permissions then you can always evict pages from
+the buffer pool.
+
+In the brave new world though, files can be deleted and they may
+have pages in the mpool.  If you later try to evict these, you
+discover that the file doesn't exist.  We'd get here when we had
+to dirty pages during a remove operation.
+
+B.4. You can close files any time you want.
+
+However, if the file takes part in the open/remove transaction,
+then we had better not close it until after the transaction
+commits/aborts, because we need to be able to get our hands on the
+dbp and the open happened in a different transaction.
+
+=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
+Design for recovering file create and delete in the presence of subdatabases.
+
+Assumptions:
+	Remove the O_TRUNCATE flag.
+	Single-thread all open/create/delete operations.
+		(Well, almost all; we'll optimize opens without DB_CREATE set.)
+		The reasoning for this is that with two simultaneous
+		open/creaters, during recovery, we cannot identify which
+		transaction successfully created files and therefore cannot
+		recovery correctly.
+	File system creates/deletes are synchronous
+	Once the file is open, subdatabase creates look like regular
+		get/put operations and a metadata page creation.
+
+There are 4 cases to deal with:
+	1. Open/create file
+	2. Open/create subdatabase
+	3. Delete
+	4. Recovery records
+
+		__db_fileopen_recover
+		__db_metapage_recover
+		__db_delete_recover
+		existing c_put and c_get routines for subdatabase creation
+
+	Note that the open/create of the file and the open/create of the
+	subdatabase need to be in the same transaction.
+
+1. Open/create (full file and subdb version)
+
+If create
+	LOCK_FILEOP
+	txn_begin
+	log create message (open message below)
+	do file system open/create
+	if we did not create
+		abort transaction (before going to open_only)
+		if (!subdb)
+			set dbp->open_txn = NULL
+		else
+			txn_begin a new transaction for the subdb open
+
+	construct meta-data page
+	log meta-data page (see metapage)
+	write the meta-data page
+	* It may be the case that btrees need to log both meta-data pages
+	  and root pages. If that is the case, I believe that we can use
+	  this same record and recovery routines for both
+
+	txn_commit
+	UNLOCK_FILEOP
+
+2. Delete
+	LOCK_FILEOP
+	txn_begin
+	log delete message (delete message below)
+	mv file __db.file.lsn
+	txn_commit
+	unlink __db.file.lsn
+	UNLOCK_FILEOP
+
+3. Recovery Routines
+
+__db_fileopen_recover
+	if (argp->name.size == 0
+		done;
+
+	if (redo)	/* Commit */
+		__os_open(argp->name, DB_OSO_CREATE, argp->mode, &fh)
+		__os_closehandle(fh)
+	if (undo)	/* Abort */
+		if (argp->name exists)
+			unlink(argp->name);
+
+__db_metapage_recover
+	if (redo)
+		__os_open(argp->name, 0, 0, &fh)
+		__os_lseek(meta data page)
+		__os_write(meta data page)
+		__os_closehandle(fh);
+	if (undo)
+		done = 0;
+		if (argp->name exists)
+			if (length of argp->name != 0)
+				__os_open(argp->name, 0, 0, &fh)
+				__os_lseek(meta data page)
+				__os_read(meta data page)
+				if (read succeeds && page lsn != current_lsn)
+					done = 1
+				__os_closehandle(fh);
+			if (!done)
+				unlink(argp->name)
+
+__db_delete_recover
+	if (redo)
+		Check if the backup file still exists and if so, delete it.
+
+	if (undo)
+		if (__db_appname(__db.file.lsn exists))
+			mv __db_appname(__db.file.lsn) __db_appname(file)
+
+__db_metasub_recover
+	/* This is like a normal recovery routine */
+	Get the metadata page
+	if (cmp_n && redo)
+		copy the log page onto the page
+		update the lsn
+		make sure page gets put dirty
+	else if (cmp_p && undo)
+		update the lsn to the lsn in the log record
+		make sure page gets put dirty
+
+	if the page was modified, put it back dirty
+
+In db.src
+
+# name: filename (before call to __db_appname)
+# mode: file system mode
+BEGIN open
+DBT	name		DBT		s
+ARG	mode		u_int32_t	o
+END
+
+# opcode: indicate if it is a create/delete and if it is a subdatabase
+# pgsize: page size on which we're going to write the meta-data page
+# pgno: page number on which to write this meta-data page
+# page: the actual meta-data page
+# lsn: LSN of the meta-data page -- 0 for new databases, may be non-0
+#	for subdatabases.
+
+BEGIN metapage
+ARG	opcode		u_int32_t	x
+DBT	name		DBT		s
+ARG	pgno		db_pgno_t	d
+DBT	page		DBT		s
+POINTER	lsn		DB_LSN *	lu
+END
+
+# We do not need a subdatabase name here because removing a subdatabase
+# name is simply a regular bt_delete operation from the master database.
+# It will get logged normally.
+# name: filename
+BEGIN delete
+DBT	name		DBT		s
+END
+
+# We also need to reclaim pages, but we can use the existing
+# bt_pg_alloc routines.
+
+=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
+Testing recoverability of create/delete.
+
+These tests are unlike other tests in that they are going to
+require hooks in the library.  The reason is that the create
+and delete calls are internally wrapped in a transaction, so
+that if the call returns, the transaction has already either
+commited or aborted.  Using only that interface limits what
+kind of testing we can do.  To match our other recovery testing
+efforts, we need to add hooks to trigger aborts at particular
+times in the create/delete path.
+
+The general recovery testing strategy is that we wish to
+execute every path through every recovery routine.  That
+means that we try to:
+	catch each operation in its pre-operation state
+		call the recovery function with redo
+		call the recovery function with undo
+	catch each operation in its post-operation state
+		call the recovery function with redo
+		call the recovery function with undo
+
+In addition, there are a few critical points in the create and
+delete path that we want to make sure we capture.
+
+1. Test Structure
+
+The test structure should be similar to the existing recovery
+tests.  We will want to have a structure in place where we
+can execute different commands:
+	create a file/database
+	create a file that will contain subdatabases.
+	create a subdatabase
+	remove a subdatabase (that contains valid data)
+	remove a subdatabase (that does not contain any data)
+	remove a file that used to contain subdatabases
+	remove a file that contains a database
+
+The tricky part is capturing the state of the world at the
+various points in the create/delete process.
+
+The critical points in the create process are:
+
+	1. After we've logged the create, but before we've done anything.
+		in db/db.c
+		after the open_retry
+		after the __crdel_fileopen_log call (and before we've
+			called __os_open).
+
+	2. Immediately after the __os_open
+
+	3. Immediately after each __db_log_page call
+		in bt_open.c
+			log meta-data page
+			log root page
+		in hash.c
+			log meta-data page
+
+	4. With respect to the log records above, shortly after each
+		log write is an memp_fput.  We need to do a sync after
+		each memp_fput and trigger a point after that sync.
+
+The critical points in the remove process are:
+
+	1. Right after the crdel_delete_log in db/db.c
+
+	2. Right after the __os_rename call (below the crdel_delete_log)
+
+	3. After the __db_remove_callback call.
+
+I believe that there are the places where we'll need some sort of hook.
+
+2. Adding hooks to the library.
+
+The hooks need two components.  One component is to capture the state of
+the database at the hook point and the other is to trigger a txn_abort at
+the hook point.  The second part is fairly trivial.
+
+The first part requires more thought.  Let me explain what we do in a
+"normal" recovery test.  In a normal recovery test, we save an intial
+copy of the database (this copy is called init).  Then we execute one
+or more operations.  Then, right before the commit/abort, we sync the
+file, and save another copy (the afterop copy).  Finally, we call txn_commit
+or txn_abort, sync the file again, and save the database one last time (the
+final copy).
+
+Then we run recovery.  The first time, this should be a no-op, because
+we've either committed the transaction and are checking to redo it or
+we aborted the transaction, undid it on the abort and are checking to
+undo it again.
+
+We then run recovery again on whatever database will force us through
+the path that requires work.  In the commit case, this means we start
+with the init copy of the database and run recovery.  This pushes us
+through all the redo paths.  In the abort case, we start with the afterop
+copy which pushes us through all the undo cases.
+
+In some sense, we're asking the create/delete test to be more exhaustive
+by defining all the trigger points, but I think that's the correct thing
+to do, since the create/delete is not initiated by a user transaction.
+
+So, what do we have to do at the hook points?
+	1. sync the file to disk.
+	2. save the file itself
+	3. save any files named __db_backup_name(name, &backup_name, lsn)
+		Since we may not know the right lsns, I think we should save
+		every file of the form __db.name.0xNNNNNNNN.0xNNNNNNNN into
+		some temporary files from which we can restore it to run
+		recovery.
+
+3. Putting it all together
+
+So, the three pieces are writing the test structure, putting in the hooks
+and then writing the recovery portions so that we restore the right thing
+that the hooks saved in order to initiate recovery.
+
+Some of the technical issues that need to be solved are:
+	How does the hook code become active (i.e., we don't
+		want it in there normally, but it's got to be
+		there when you configure for testing)?
+	How do you (the test) tell the library that you want a
+		particular hook to abort?
+	How do you (the test) tell the library that you want the
+		hook code doing its copies (do we really want
+		*every* test doing these copies during testing?
+		Maybe it's not a big deal, but maybe it is; we
+		should at least think about it).
diff --git a/bdb/db/crdel.src b/bdb/db/crdel.src
new file mode 100644
index 00000000000..17c061d6887
--- /dev/null
+++ b/bdb/db/crdel.src
@@ -0,0 +1,103 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1996, 1997, 1998, 1999, 2000
+ *	Sleepycat Software.  All rights reserved.
+ *
+ *	$Id: crdel.src,v 11.12 2000/12/12 17:41:48 bostic Exp $
+ */
+
+PREFIX	crdel
+
+INCLUDE	#include "db_config.h"
+INCLUDE
+INCLUDE #ifndef NO_SYSTEM_INCLUDES
+INCLUDE #include <sys/types.h>
+INCLUDE
+INCLUDE #include <ctype.h>
+INCLUDE #include <errno.h>
+INCLUDE #include <string.h>
+INCLUDE #endif
+INCLUDE
+INCLUDE #include "db_int.h"
+INCLUDE #include "db_page.h"
+INCLUDE #include "db_dispatch.h"
+INCLUDE #include "db_am.h"
+INCLUDE #include "txn.h"
+INCLUDE
+
+/*
+ * Fileopen -- log a potential file create operation
+ *
+ * name: filename
+ * subname: sub database name
+ * mode: file system mode
+ */
+BEGIN fileopen		141
+DBT	name		DBT		s
+ARG	mode		u_int32_t	o
+END
+
+/*
+ * Metasub: log the creation of a subdatabase meta data page.
+ *
+ * fileid: identifies the file being acted upon.
+ * pgno: page number on which to write this meta-data page
+ * page: the actual meta-data page
+ * lsn: lsn of the page.
+ */
+BEGIN metasub		142
+ARG	fileid		int32_t		ld
+ARG	pgno		db_pgno_t	d
+DBT	page		DBT		s
+POINTER	lsn		DB_LSN *	lu
+END
+
+/*
+ * Metapage: log the creation of a meta data page for a new file.
+ *
+ * fileid: identifies the file being acted upon.
+ * name: file containing the page.
+ * pgno: page number on which to write this meta-data page
+ * page: the actual meta-data page
+ */
+BEGIN metapage		143
+ARG	fileid		int32_t		ld
+DBT	name		DBT		s
+ARG	pgno		db_pgno_t	d
+DBT	page		DBT		s
+END
+
+/*
+ * Delete: remove a file.
+ * Note that we don't need a special log record for subdatabase
+ * removes, because we use normal btree operations to remove them.
+ *
+ * name: name of the file being removed (relative to DBHOME).
+ */
+DEPRECATED old_delete		144
+DBT	name		DBT	s
+END
+
+/*
+ * Rename: rename a file
+ *   We do not need this for subdatabases
+ *
+ * name: name of the file being removed (relative to DBHOME).
+ */
+BEGIN rename		145
+ARG	fileid		int32_t	ld
+DBT	name		DBT		s
+DBT	newname		DBT		s
+END
+/*
+ * Delete: remove a file.
+ * Note that we don't need a special log record for subdatabase
+ * removes, because we use normal btree operations to remove them.
+ *
+ * name: name of the file being removed (relative to DBHOME).
+ */
+BEGIN delete		146
+ARG	fileid		int32_t	ld
+DBT	name		DBT		s
+END
diff --git a/bdb/db/crdel_auto.c b/bdb/db/crdel_auto.c
new file mode 100644
index 00000000000..f2204410ee8
--- /dev/null
+++ b/bdb/db/crdel_auto.c
@@ -0,0 +1,900 @@
+/* Do not edit: automatically built by gen_rec.awk. */
+#include "db_config.h"
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <ctype.h>
+#include <errno.h>
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_dispatch.h"
+#include "db_am.h"
+#include "txn.h"
+
+int
+__crdel_fileopen_log(dbenv, txnid, ret_lsnp, flags,
+	name, mode)
+	DB_ENV *dbenv;
+	DB_TXN *txnid;
+	DB_LSN *ret_lsnp;
+	u_int32_t flags;
+	const DBT *name;
+	u_int32_t mode;
+{
+	DBT logrec;
+	DB_LSN *lsnp, null_lsn;
+	u_int32_t zero;
+	u_int32_t rectype, txn_num;
+	int ret;
+	u_int8_t *bp;
+
+	rectype = DB_crdel_fileopen;
+	if (txnid != NULL &&
+	    TAILQ_FIRST(&txnid->kids) != NULL &&
+	    (ret = __txn_activekids(dbenv, rectype, txnid)) != 0)
+		return (ret);
+	txn_num = txnid == NULL ? 0 : txnid->txnid;
+	if (txnid == NULL) {
+		ZERO_LSN(null_lsn);
+		lsnp = &null_lsn;
+	} else
+		lsnp = &txnid->last_lsn;
+	logrec.size = sizeof(rectype) + sizeof(txn_num) + sizeof(DB_LSN)
+	    + sizeof(u_int32_t) + (name == NULL ? 0 : name->size)
+	    + sizeof(mode);
+	if ((ret = __os_malloc(dbenv, logrec.size, NULL, &logrec.data)) != 0)
+		return (ret);
+
+	bp = logrec.data;
+	memcpy(bp, &rectype, sizeof(rectype));
+	bp += sizeof(rectype);
+	memcpy(bp, &txn_num, sizeof(txn_num));
+	bp += sizeof(txn_num);
+	memcpy(bp, lsnp, sizeof(DB_LSN));
+	bp += sizeof(DB_LSN);
+	if (name == NULL) {
+		zero = 0;
+		memcpy(bp, &zero, sizeof(u_int32_t));
+		bp += sizeof(u_int32_t);
+	} else {
+		memcpy(bp, &name->size, sizeof(name->size));
+		bp += sizeof(name->size);
+		memcpy(bp, name->data, name->size);
+		bp += name->size;
+	}
+	memcpy(bp, &mode, sizeof(mode));
+	bp += sizeof(mode);
+	DB_ASSERT((u_int32_t)(bp - (u_int8_t *)logrec.data) == logrec.size);
+	ret = log_put(dbenv, ret_lsnp, (DBT *)&logrec, flags);
+	if (txnid != NULL)
+		txnid->last_lsn = *ret_lsnp;
+	__os_free(logrec.data, logrec.size);
+	return (ret);
+}
+
+int
+__crdel_fileopen_print(dbenv, dbtp, lsnp, notused2, notused3)
+	DB_ENV *dbenv;
+	DBT *dbtp;
+	DB_LSN *lsnp;
+	db_recops notused2;
+	void *notused3;
+{
+	__crdel_fileopen_args *argp;
+	u_int32_t i;
+	u_int ch;
+	int ret;
+
+	i = 0;
+	ch = 0;
+	notused2 = DB_TXN_ABORT;
+	notused3 = NULL;
+
+	if ((ret = __crdel_fileopen_read(dbenv, dbtp->data, &argp)) != 0)
+		return (ret);
+	printf("[%lu][%lu]crdel_fileopen: rec: %lu txnid %lx prevlsn [%lu][%lu]\n",
+	    (u_long)lsnp->file,
+	    (u_long)lsnp->offset,
+	    (u_long)argp->type,
+	    (u_long)argp->txnid->txnid,
+	    (u_long)argp->prev_lsn.file,
+	    (u_long)argp->prev_lsn.offset);
+	printf("\tname: ");
+	for (i = 0; i < argp->name.size; i++) {
+		ch = ((u_int8_t *)argp->name.data)[i];
+		if (isprint(ch) || ch == 0xa)
+			putchar(ch);
+		else
+			printf("%#x ", ch);
+	}
+	printf("\n");
+	printf("\tmode: %o\n", argp->mode);
+	printf("\n");
+	__os_free(argp, 0);
+	return (0);
+}
+
+int
+__crdel_fileopen_read(dbenv, recbuf, argpp)
+	DB_ENV *dbenv;
+	void *recbuf;
+	__crdel_fileopen_args **argpp;
+{
+	__crdel_fileopen_args *argp;
+	u_int8_t *bp;
+	int ret;
+
+	ret = __os_malloc(dbenv, sizeof(__crdel_fileopen_args) +
+	    sizeof(DB_TXN), NULL, &argp);
+	if (ret != 0)
+		return (ret);
+	argp->txnid = (DB_TXN *)&argp[1];
+	bp = recbuf;
+	memcpy(&argp->type, bp, sizeof(argp->type));
+	bp += sizeof(argp->type);
+	memcpy(&argp->txnid->txnid,  bp, sizeof(argp->txnid->txnid));
+	bp += sizeof(argp->txnid->txnid);
+	memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN));
+	bp += sizeof(DB_LSN);
+	memset(&argp->name, 0, sizeof(argp->name));
+	memcpy(&argp->name.size, bp, sizeof(u_int32_t));
+	bp += sizeof(u_int32_t);
+	argp->name.data = bp;
+	bp += argp->name.size;
+	memcpy(&argp->mode, bp, sizeof(argp->mode));
+	bp += sizeof(argp->mode);
+	*argpp = argp;
+	return (0);
+}
+
+int
+__crdel_metasub_log(dbenv, txnid, ret_lsnp, flags,
+	fileid, pgno, page, lsn)
+	DB_ENV *dbenv;
+	DB_TXN *txnid;
+	DB_LSN *ret_lsnp;
+	u_int32_t flags;
+	int32_t fileid;
+	db_pgno_t pgno;
+	const DBT *page;
+	DB_LSN * lsn;
+{
+	DBT logrec;
+	DB_LSN *lsnp, null_lsn;
+	u_int32_t zero;
+	u_int32_t rectype, txn_num;
+	int ret;
+	u_int8_t *bp;
+
+	rectype = DB_crdel_metasub;
+	if (txnid != NULL &&
+	    TAILQ_FIRST(&txnid->kids) != NULL &&
+	    (ret = __txn_activekids(dbenv, rectype, txnid)) != 0)
+		return (ret);
+	txn_num = txnid == NULL ? 0 : txnid->txnid;
+	if (txnid == NULL) {
+		ZERO_LSN(null_lsn);
+		lsnp = &null_lsn;
+	} else
+		lsnp = &txnid->last_lsn;
+	logrec.size = sizeof(rectype) + sizeof(txn_num) + sizeof(DB_LSN)
+	    + sizeof(fileid)
+	    + sizeof(pgno)
+	    + sizeof(u_int32_t) + (page == NULL ? 0 : page->size)
+	    + sizeof(*lsn);
+	if ((ret = __os_malloc(dbenv, logrec.size, NULL, &logrec.data)) != 0)
+		return (ret);
+
+	bp = logrec.data;
+	memcpy(bp, &rectype, sizeof(rectype));
+	bp += sizeof(rectype);
+	memcpy(bp, &txn_num, sizeof(txn_num));
+	bp += sizeof(txn_num);
+	memcpy(bp, lsnp, sizeof(DB_LSN));
+	bp += sizeof(DB_LSN);
+	memcpy(bp, &fileid, sizeof(fileid));
+	bp += sizeof(fileid);
+	memcpy(bp, &pgno, sizeof(pgno));
+	bp += sizeof(pgno);
+	if (page == NULL) {
+		zero = 0;
+		memcpy(bp, &zero, sizeof(u_int32_t));
+		bp += sizeof(u_int32_t);
+	} else {
+		memcpy(bp, &page->size, sizeof(page->size));
+		bp += sizeof(page->size);
+		memcpy(bp, page->data, page->size);
+		bp += page->size;
+	}
+	if (lsn != NULL)
+		memcpy(bp, lsn, sizeof(*lsn));
+	else
+		memset(bp, 0, sizeof(*lsn));
+	bp += sizeof(*lsn);
+	DB_ASSERT((u_int32_t)(bp - (u_int8_t *)logrec.data) == logrec.size);
+	ret = log_put(dbenv, ret_lsnp, (DBT *)&logrec, flags);
+	if (txnid != NULL)
+		txnid->last_lsn = *ret_lsnp;
+	__os_free(logrec.data, logrec.size);
+	return (ret);
+}
+
+int
+__crdel_metasub_print(dbenv, dbtp, lsnp, notused2, notused3)
+	DB_ENV *dbenv;
+	DBT *dbtp;
+	DB_LSN *lsnp;
+	db_recops notused2;
+	void *notused3;
+{
+	__crdel_metasub_args *argp;
+	u_int32_t i;
+	u_int ch;
+	int ret;
+
+	i = 0;
+	ch = 0;
+	notused2 = DB_TXN_ABORT;
+	notused3 = NULL;
+
+	if ((ret = __crdel_metasub_read(dbenv, dbtp->data, &argp)) != 0)
+		return (ret);
+	printf("[%lu][%lu]crdel_metasub: rec: %lu txnid %lx prevlsn [%lu][%lu]\n",
+	    (u_long)lsnp->file,
+	    (u_long)lsnp->offset,
+	    (u_long)argp->type,
+	    (u_long)argp->txnid->txnid,
+	    (u_long)argp->prev_lsn.file,
+	    (u_long)argp->prev_lsn.offset);
+	printf("\tfileid: %ld\n", (long)argp->fileid);
+	printf("\tpgno: %d\n", argp->pgno);
+	printf("\tpage: ");
+	for (i = 0; i < argp->page.size; i++) {
+		ch = ((u_int8_t *)argp->page.data)[i];
+		if (isprint(ch) || ch == 0xa)
+			putchar(ch);
+		else
+			printf("%#x ", ch);
+	}
+	printf("\n");
+	printf("\tlsn: [%lu][%lu]\n",
+	    (u_long)argp->lsn.file, (u_long)argp->lsn.offset);
+	printf("\n");
+	__os_free(argp, 0);
+	return (0);
+}
+
+int
+__crdel_metasub_read(dbenv, recbuf, argpp)
+	DB_ENV *dbenv;
+	void *recbuf;
+	__crdel_metasub_args **argpp;
+{
+	__crdel_metasub_args *argp;
+	u_int8_t *bp;
+	int ret;
+
+	ret = __os_malloc(dbenv, sizeof(__crdel_metasub_args) +
+	    sizeof(DB_TXN), NULL, &argp);
+	if (ret != 0)
+		return (ret);
+	argp->txnid = (DB_TXN *)&argp[1];
+	bp = recbuf;
+	memcpy(&argp->type, bp, sizeof(argp->type));
+	bp += sizeof(argp->type);
+	memcpy(&argp->txnid->txnid,  bp, sizeof(argp->txnid->txnid));
+	bp += sizeof(argp->txnid->txnid);
+	memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN));
+	bp += sizeof(DB_LSN);
+	memcpy(&argp->fileid, bp, sizeof(argp->fileid));
+	bp += sizeof(argp->fileid);
+	memcpy(&argp->pgno, bp, sizeof(argp->pgno));
+	bp += sizeof(argp->pgno);
+	memset(&argp->page, 0, sizeof(argp->page));
+	memcpy(&argp->page.size, bp, sizeof(u_int32_t));
+	bp += sizeof(u_int32_t);
+	argp->page.data = bp;
+	bp += argp->page.size;
+	memcpy(&argp->lsn, bp,  sizeof(argp->lsn));
+	bp += sizeof(argp->lsn);
+	*argpp = argp;
+	return (0);
+}
+
+int
+__crdel_metapage_log(dbenv, txnid, ret_lsnp, flags,
+	fileid, name, pgno, page)
+	DB_ENV *dbenv;
+	DB_TXN *txnid;
+	DB_LSN *ret_lsnp;
+	u_int32_t flags;
+	int32_t fileid;
+	const DBT *name;
+	db_pgno_t pgno;
+	const DBT *page;
+{
+	DBT logrec;
+	DB_LSN *lsnp, null_lsn;
+	u_int32_t zero;
+	u_int32_t rectype, txn_num;
+	int ret;
+	u_int8_t *bp;
+
+	rectype = DB_crdel_metapage;
+	if (txnid != NULL &&
+	    TAILQ_FIRST(&txnid->kids) != NULL &&
+	    (ret = __txn_activekids(dbenv, rectype, txnid)) != 0)
+		return (ret);
+	txn_num = txnid == NULL ? 0 : txnid->txnid;
+	if (txnid == NULL) {
+		ZERO_LSN(null_lsn);
+		lsnp = &null_lsn;
+	} else
+		lsnp = &txnid->last_lsn;
+	logrec.size = sizeof(rectype) + sizeof(txn_num) + sizeof(DB_LSN)
+	    + sizeof(fileid)
+	    + sizeof(u_int32_t) + (name == NULL ? 0 : name->size)
+	    + sizeof(pgno)
+	    + sizeof(u_int32_t) + (page == NULL ? 0 : page->size);
+	if ((ret = __os_malloc(dbenv, logrec.size, NULL, &logrec.data)) != 0)
+		return (ret);
+
+	bp = logrec.data;
+	memcpy(bp, &rectype, sizeof(rectype));
+	bp += sizeof(rectype);
+	memcpy(bp, &txn_num, sizeof(txn_num));
+	bp += sizeof(txn_num);
+	memcpy(bp, lsnp, sizeof(DB_LSN));
+	bp += sizeof(DB_LSN);
+	memcpy(bp, &fileid, sizeof(fileid));
+	bp += sizeof(fileid);
+	if (name == NULL) {
+		zero = 0;
+		memcpy(bp, &zero, sizeof(u_int32_t));
+		bp += sizeof(u_int32_t);
+	} else {
+		memcpy(bp, &name->size, sizeof(name->size));
+		bp += sizeof(name->size);
+		memcpy(bp, name->data, name->size);
+		bp += name->size;
+	}
+	memcpy(bp, &pgno, sizeof(pgno));
+	bp += sizeof(pgno);
+	if (page == NULL) {
+		zero = 0;
+		memcpy(bp, &zero, sizeof(u_int32_t));
+		bp += sizeof(u_int32_t);
+	} else {
+		memcpy(bp, &page->size, sizeof(page->size));
+		bp += sizeof(page->size);
+		memcpy(bp, page->data, page->size);
+		bp += page->size;
+	}
+	DB_ASSERT((u_int32_t)(bp - (u_int8_t *)logrec.data) == logrec.size);
+	ret = log_put(dbenv, ret_lsnp, (DBT *)&logrec, flags);
+	if (txnid != NULL)
+		txnid->last_lsn = *ret_lsnp;
+	__os_free(logrec.data, logrec.size);
+	return (ret);
+}
+
+int
+__crdel_metapage_print(dbenv, dbtp, lsnp, notused2, notused3)
+	DB_ENV *dbenv;
+	DBT *dbtp;
+	DB_LSN *lsnp;
+	db_recops notused2;
+	void *notused3;
+{
+	__crdel_metapage_args *argp;
+	u_int32_t i;
+	u_int ch;
+	int ret;
+
+	i = 0;
+	ch = 0;
+	notused2 = DB_TXN_ABORT;
+	notused3 = NULL;
+
+	if ((ret = __crdel_metapage_read(dbenv, dbtp->data, &argp)) != 0)
+		return (ret);
+	printf("[%lu][%lu]crdel_metapage: rec: %lu txnid %lx prevlsn [%lu][%lu]\n",
+	    (u_long)lsnp->file,
+	    (u_long)lsnp->offset,
+	    (u_long)argp->type,
+	    (u_long)argp->txnid->txnid,
+	    (u_long)argp->prev_lsn.file,
+	    (u_long)argp->prev_lsn.offset);
+	printf("\tfileid: %ld\n", (long)argp->fileid);
+	printf("\tname: ");
+	for (i = 0; i < argp->name.size; i++) {
+		ch = ((u_int8_t *)argp->name.data)[i];
+		if (isprint(ch) || ch == 0xa)
+			putchar(ch);
+		else
+			printf("%#x ", ch);
+	}
+	printf("\n");
+	printf("\tpgno: %d\n", argp->pgno);
+	printf("\tpage: ");
+	for (i = 0; i < argp->page.size; i++) {
+		ch = ((u_int8_t *)argp->page.data)[i];
+		if (isprint(ch) || ch == 0xa)
+			putchar(ch);
+		else
+			printf("%#x ", ch);
+	}
+	printf("\n");
+	printf("\n");
+	__os_free(argp, 0);
+	return (0);
+}
+
+int
+__crdel_metapage_read(dbenv, recbuf, argpp)
+	DB_ENV *dbenv;
+	void *recbuf;
+	__crdel_metapage_args **argpp;
+{
+	__crdel_metapage_args *argp;
+	u_int8_t *bp;
+	int ret;
+
+	ret = __os_malloc(dbenv, sizeof(__crdel_metapage_args) +
+	    sizeof(DB_TXN), NULL, &argp);
+	if (ret != 0)
+		return (ret);
+	argp->txnid = (DB_TXN *)&argp[1];
+	bp = recbuf;
+	memcpy(&argp->type, bp, sizeof(argp->type));
+	bp += sizeof(argp->type);
+	memcpy(&argp->txnid->txnid,  bp, sizeof(argp->txnid->txnid));
+	bp += sizeof(argp->txnid->txnid);
+	memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN));
+	bp += sizeof(DB_LSN);
+	memcpy(&argp->fileid, bp, sizeof(argp->fileid));
+	bp += sizeof(argp->fileid);
+	memset(&argp->name, 0, sizeof(argp->name));
+	memcpy(&argp->name.size, bp, sizeof(u_int32_t));
+	bp += sizeof(u_int32_t);
+	argp->name.data = bp;
+	bp += argp->name.size;
+	memcpy(&argp->pgno, bp, sizeof(argp->pgno));
+	bp += sizeof(argp->pgno);
+	memset(&argp->page, 0, sizeof(argp->page));
+	memcpy(&argp->page.size, bp, sizeof(u_int32_t));
+	bp += sizeof(u_int32_t);
+	argp->page.data = bp;
+	bp += argp->page.size;
+	*argpp = argp;
+	return (0);
+}
+
+int
+__crdel_old_delete_print(dbenv, dbtp, lsnp, notused2, notused3)
+	DB_ENV *dbenv;
+	DBT *dbtp;
+	DB_LSN *lsnp;
+	db_recops notused2;
+	void *notused3;
+{
+	__crdel_old_delete_args *argp;
+	u_int32_t i;
+	u_int ch;
+	int ret;
+
+	i = 0;
+	ch = 0;
+	notused2 = DB_TXN_ABORT;
+	notused3 = NULL;
+
+	if ((ret = __crdel_old_delete_read(dbenv, dbtp->data, &argp)) != 0)
+		return (ret);
+	printf("[%lu][%lu]crdel_old_delete: rec: %lu txnid %lx prevlsn [%lu][%lu]\n",
+	    (u_long)lsnp->file,
+	    (u_long)lsnp->offset,
+	    (u_long)argp->type,
+	    (u_long)argp->txnid->txnid,
+	    (u_long)argp->prev_lsn.file,
+	    (u_long)argp->prev_lsn.offset);
+	printf("\tname: ");
+	for (i = 0; i < argp->name.size; i++) {
+		ch = ((u_int8_t *)argp->name.data)[i];
+		if (isprint(ch) || ch == 0xa)
+			putchar(ch);
+		else
+			printf("%#x ", ch);
+	}
+	printf("\n");
+	printf("\n");
+	__os_free(argp, 0);
+	return (0);
+}
+
+int
+__crdel_old_delete_read(dbenv, recbuf, argpp)
+	DB_ENV *dbenv;
+	void *recbuf;
+	__crdel_old_delete_args **argpp;
+{
+	__crdel_old_delete_args *argp;
+	u_int8_t *bp;
+	int ret;
+
+	ret = __os_malloc(dbenv, sizeof(__crdel_old_delete_args) +
+	    sizeof(DB_TXN), NULL, &argp);
+	if (ret != 0)
+		return (ret);
+	argp->txnid = (DB_TXN *)&argp[1];
+	bp = recbuf;
+	memcpy(&argp->type, bp, sizeof(argp->type));
+	bp += sizeof(argp->type);
+	memcpy(&argp->txnid->txnid,  bp, sizeof(argp->txnid->txnid));
+	bp += sizeof(argp->txnid->txnid);
+	memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN));
+	bp += sizeof(DB_LSN);
+	memset(&argp->name, 0, sizeof(argp->name));
+	memcpy(&argp->name.size, bp, sizeof(u_int32_t));
+	bp += sizeof(u_int32_t);
+	argp->name.data = bp;
+	bp += argp->name.size;
+	*argpp = argp;
+	return (0);
+}
+
+int
+__crdel_rename_log(dbenv, txnid, ret_lsnp, flags,
+	fileid, name, newname)
+	DB_ENV *dbenv;
+	DB_TXN *txnid;
+	DB_LSN *ret_lsnp;
+	u_int32_t flags;
+	int32_t fileid;
+	const DBT *name;
+	const DBT *newname;
+{
+	DBT logrec;
+	DB_LSN *lsnp, null_lsn;
+	u_int32_t zero;
+	u_int32_t rectype, txn_num;
+	int ret;
+	u_int8_t *bp;
+
+	rectype = DB_crdel_rename;
+	if (txnid != NULL &&
+	    TAILQ_FIRST(&txnid->kids) != NULL &&
+	    (ret = __txn_activekids(dbenv, rectype, txnid)) != 0)
+		return (ret);
+	txn_num = txnid == NULL ? 0 : txnid->txnid;
+	if (txnid == NULL) {
+		ZERO_LSN(null_lsn);
+		lsnp = &null_lsn;
+	} else
+		lsnp = &txnid->last_lsn;
+	logrec.size = sizeof(rectype) + sizeof(txn_num) + sizeof(DB_LSN)
+	    + sizeof(fileid)
+	    + sizeof(u_int32_t) + (name == NULL ? 0 : name->size)
+	    + sizeof(u_int32_t) + (newname == NULL ? 0 : newname->size);
+	if ((ret = __os_malloc(dbenv, logrec.size, NULL, &logrec.data)) != 0)
+		return (ret);
+
+	bp = logrec.data;
+	memcpy(bp, &rectype, sizeof(rectype));
+	bp += sizeof(rectype);
+	memcpy(bp, &txn_num, sizeof(txn_num));
+	bp += sizeof(txn_num);
+	memcpy(bp, lsnp, sizeof(DB_LSN));
+	bp += sizeof(DB_LSN);
+	memcpy(bp, &fileid, sizeof(fileid));
+	bp += sizeof(fileid);
+	if (name == NULL) {
+		zero = 0;
+		memcpy(bp, &zero, sizeof(u_int32_t));
+		bp += sizeof(u_int32_t);
+	} else {
+		memcpy(bp, &name->size, sizeof(name->size));
+		bp += sizeof(name->size);
+		memcpy(bp, name->data, name->size);
+		bp += name->size;
+	}
+	if (newname == NULL) {
+		zero = 0;
+		memcpy(bp, &zero, sizeof(u_int32_t));
+		bp += sizeof(u_int32_t);
+	} else {
+		memcpy(bp, &newname->size, sizeof(newname->size));
+		bp += sizeof(newname->size);
+		memcpy(bp, newname->data, newname->size);
+		bp += newname->size;
+	}
+	DB_ASSERT((u_int32_t)(bp - (u_int8_t *)logrec.data) == logrec.size);
+	ret = log_put(dbenv, ret_lsnp, (DBT *)&logrec, flags);
+	if (txnid != NULL)
+		txnid->last_lsn = *ret_lsnp;
+	__os_free(logrec.data, logrec.size);
+	return (ret);
+}
+
+int
+__crdel_rename_print(dbenv, dbtp, lsnp, notused2, notused3)
+	DB_ENV *dbenv;
+	DBT *dbtp;
+	DB_LSN *lsnp;
+	db_recops notused2;
+	void *notused3;
+{
+	__crdel_rename_args *argp;
+	u_int32_t i;
+	u_int ch;
+	int ret;
+
+	i = 0;
+	ch = 0;
+	notused2 = DB_TXN_ABORT;
+	notused3 = NULL;
+
+	if ((ret = __crdel_rename_read(dbenv, dbtp->data, &argp)) != 0)
+		return (ret);
+	printf("[%lu][%lu]crdel_rename: rec: %lu txnid %lx prevlsn [%lu][%lu]\n",
+	    (u_long)lsnp->file,
+	    (u_long)lsnp->offset,
+	    (u_long)argp->type,
+	    (u_long)argp->txnid->txnid,
+	    (u_long)argp->prev_lsn.file,
+	    (u_long)argp->prev_lsn.offset);
+	printf("\tfileid: %ld\n", (long)argp->fileid);
+	printf("\tname: ");
+	for (i = 0; i < argp->name.size; i++) {
+		ch = ((u_int8_t *)argp->name.data)[i];
+		if (isprint(ch) || ch == 0xa)
+			putchar(ch);
+		else
+			printf("%#x ", ch);
+	}
+	printf("\n");
+	printf("\tnewname: ");
+	for (i = 0; i < argp->newname.size; i++) {
+		ch = ((u_int8_t *)argp->newname.data)[i];
+		if (isprint(ch) || ch == 0xa)
+			putchar(ch);
+		else
+			printf("%#x ", ch);
+	}
+	printf("\n");
+	printf("\n");
+	__os_free(argp, 0);
+	return (0);
+}
+
+int
+__crdel_rename_read(dbenv, recbuf, argpp)
+	DB_ENV *dbenv;
+	void *recbuf;
+	__crdel_rename_args **argpp;
+{
+	__crdel_rename_args *argp;
+	u_int8_t *bp;
+	int ret;
+
+	ret = __os_malloc(dbenv, sizeof(__crdel_rename_args) +
+	    sizeof(DB_TXN), NULL, &argp);
+	if (ret != 0)
+		return (ret);
+	argp->txnid = (DB_TXN *)&argp[1];
+	bp = recbuf;
+	memcpy(&argp->type, bp, sizeof(argp->type));
+	bp += sizeof(argp->type);
+	memcpy(&argp->txnid->txnid,  bp, sizeof(argp->txnid->txnid));
+	bp += sizeof(argp->txnid->txnid);
+	memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN));
+	bp += sizeof(DB_LSN);
+	memcpy(&argp->fileid, bp, sizeof(argp->fileid));
+	bp += sizeof(argp->fileid);
+	memset(&argp->name, 0, sizeof(argp->name));
+	memcpy(&argp->name.size, bp, sizeof(u_int32_t));
+	bp += sizeof(u_int32_t);
+	argp->name.data = bp;
+	bp += argp->name.size;
+	memset(&argp->newname, 0, sizeof(argp->newname));
+	memcpy(&argp->newname.size, bp, sizeof(u_int32_t));
+	bp += sizeof(u_int32_t);
+	argp->newname.data = bp;
+	bp += argp->newname.size;
+	*argpp = argp;
+	return (0);
+}
+
+int
+__crdel_delete_log(dbenv, txnid, ret_lsnp, flags,
+	fileid, name)
+	DB_ENV *dbenv;
+	DB_TXN *txnid;
+	DB_LSN *ret_lsnp;
+	u_int32_t flags;
+	int32_t fileid;
+	const DBT *name;
+{
+	DBT logrec;
+	DB_LSN *lsnp, null_lsn;
+	u_int32_t zero;
+	u_int32_t rectype, txn_num;
+	int ret;
+	u_int8_t *bp;
+
+	rectype = DB_crdel_delete;
+	if (txnid != NULL &&
+	    TAILQ_FIRST(&txnid->kids) != NULL &&
+	    (ret = __txn_activekids(dbenv, rectype, txnid)) != 0)
+		return (ret);
+	txn_num = txnid == NULL ? 0 : txnid->txnid;
+	if (txnid == NULL) {
+		ZERO_LSN(null_lsn);
+		lsnp = &null_lsn;
+	} else
+		lsnp = &txnid->last_lsn;
+	logrec.size = sizeof(rectype) + sizeof(txn_num) + sizeof(DB_LSN)
+	    + sizeof(fileid)
+	    + sizeof(u_int32_t) + (name == NULL ? 0 : name->size);
+	if ((ret = __os_malloc(dbenv, logrec.size, NULL, &logrec.data)) != 0)
+		return (ret);
+
+	bp = logrec.data;
+	memcpy(bp, &rectype, sizeof(rectype));
+	bp += sizeof(rectype);
+	memcpy(bp, &txn_num, sizeof(txn_num));
+	bp += sizeof(txn_num);
+	memcpy(bp, lsnp, sizeof(DB_LSN));
+	bp += sizeof(DB_LSN);
+	memcpy(bp, &fileid, sizeof(fileid));
+	bp += sizeof(fileid);
+	if (name == NULL) {
+		zero = 0;
+		memcpy(bp, &zero, sizeof(u_int32_t));
+		bp += sizeof(u_int32_t);
+	} else {
+		memcpy(bp, &name->size, sizeof(name->size));
+		bp += sizeof(name->size);
+		memcpy(bp, name->data, name->size);
+		bp += name->size;
+	}
+	DB_ASSERT((u_int32_t)(bp - (u_int8_t *)logrec.data) == logrec.size);
+	ret = log_put(dbenv, ret_lsnp, (DBT *)&logrec, flags);
+	if (txnid != NULL)
+		txnid->last_lsn = *ret_lsnp;
+	__os_free(logrec.data, logrec.size);
+	return (ret);
+}
+
+int
+__crdel_delete_print(dbenv, dbtp, lsnp, notused2, notused3)
+	DB_ENV *dbenv;
+	DBT *dbtp;
+	DB_LSN *lsnp;
+	db_recops notused2;
+	void *notused3;
+{
+	__crdel_delete_args *argp;
+	u_int32_t i;
+	u_int ch;
+	int ret;
+
+	i = 0;
+	ch = 0;
+	notused2 = DB_TXN_ABORT;
+	notused3 = NULL;
+
+	if ((ret = __crdel_delete_read(dbenv, dbtp->data, &argp)) != 0)
+		return (ret);
+	printf("[%lu][%lu]crdel_delete: rec: %lu txnid %lx prevlsn [%lu][%lu]\n",
+	    (u_long)lsnp->file,
+	    (u_long)lsnp->offset,
+	    (u_long)argp->type,
+	    (u_long)argp->txnid->txnid,
+	    (u_long)argp->prev_lsn.file,
+	    (u_long)argp->prev_lsn.offset);
+	printf("\tfileid: %ld\n", (long)argp->fileid);
+	printf("\tname: ");
+	for (i = 0; i < argp->name.size; i++) {
+		ch = ((u_int8_t *)argp->name.data)[i];
+		if (isprint(ch) || ch == 0xa)
+			putchar(ch);
+		else
+			printf("%#x ", ch);
+	}
+	printf("\n");
+	printf("\n");
+	__os_free(argp, 0);
+	return (0);
+}
+
+int
+__crdel_delete_read(dbenv, recbuf, argpp)
+	DB_ENV *dbenv;
+	void *recbuf;
+	__crdel_delete_args **argpp;
+{
+	__crdel_delete_args *argp;
+	u_int8_t *bp;
+	int ret;
+
+	ret = __os_malloc(dbenv, sizeof(__crdel_delete_args) +
+	    sizeof(DB_TXN), NULL, &argp);
+	if (ret != 0)
+		return (ret);
+	argp->txnid = (DB_TXN *)&argp[1];
+	bp = recbuf;
+	memcpy(&argp->type, bp, sizeof(argp->type));
+	bp += sizeof(argp->type);
+	memcpy(&argp->txnid->txnid,  bp, sizeof(argp->txnid->txnid));
+	bp += sizeof(argp->txnid->txnid);
+	memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN));
+	bp += sizeof(DB_LSN);
+	memcpy(&argp->fileid, bp, sizeof(argp->fileid));
+	bp += sizeof(argp->fileid);
+	memset(&argp->name, 0, sizeof(argp->name));
+	memcpy(&argp->name.size, bp, sizeof(u_int32_t));
+	bp += sizeof(u_int32_t);
+	argp->name.data = bp;
+	bp += argp->name.size;
+	*argpp = argp;
+	return (0);
+}
+
+int
+__crdel_init_print(dbenv)
+	DB_ENV *dbenv;
+{
+	int ret;
+
+	if ((ret = __db_add_recovery(dbenv,
+	    __crdel_fileopen_print, DB_crdel_fileopen)) != 0)
+		return (ret);
+	if ((ret = __db_add_recovery(dbenv,
+	    __crdel_metasub_print, DB_crdel_metasub)) != 0)
+		return (ret);
+	if ((ret = __db_add_recovery(dbenv,
+	    __crdel_metapage_print, DB_crdel_metapage)) != 0)
+		return (ret);
+	if ((ret = __db_add_recovery(dbenv,
+	    __crdel_old_delete_print, DB_crdel_old_delete)) != 0)
+		return (ret);
+	if ((ret = __db_add_recovery(dbenv,
+	    __crdel_rename_print, DB_crdel_rename)) != 0)
+		return (ret);
+	if ((ret = __db_add_recovery(dbenv,
+	    __crdel_delete_print, DB_crdel_delete)) != 0)
+		return (ret);
+	return (0);
+}
+
+int
+__crdel_init_recover(dbenv)
+	DB_ENV *dbenv;
+{
+	int ret;
+
+	if ((ret = __db_add_recovery(dbenv,
+	    __crdel_fileopen_recover, DB_crdel_fileopen)) != 0)
+		return (ret);
+	if ((ret = __db_add_recovery(dbenv,
+	    __crdel_metasub_recover, DB_crdel_metasub)) != 0)
+		return (ret);
+	if ((ret = __db_add_recovery(dbenv,
+	    __crdel_metapage_recover, DB_crdel_metapage)) != 0)
+		return (ret);
+	if ((ret = __db_add_recovery(dbenv,
+	    __deprecated_recover, DB_crdel_old_delete)) != 0)
+		return (ret);
+	if ((ret = __db_add_recovery(dbenv,
+	    __crdel_rename_recover, DB_crdel_rename)) != 0)
+		return (ret);
+	if ((ret = __db_add_recovery(dbenv,
+	    __crdel_delete_recover, DB_crdel_delete)) != 0)
+		return (ret);
+	return (0);
+}
+
diff --git a/bdb/db/crdel_rec.c b/bdb/db/crdel_rec.c
new file mode 100644
index 00000000000..495b92a0ad7
--- /dev/null
+++ b/bdb/db/crdel_rec.c
@@ -0,0 +1,646 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1996, 1997, 1998, 1999, 2000
+ *	Sleepycat Software.  All rights reserved.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: crdel_rec.c,v 11.43 2000/12/13 08:06:34 krinsky Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "log.h"
+#include "hash.h"
+#include "mp.h"
+#include "db_dispatch.h"
+
+/*
+ * __crdel_fileopen_recover --
+ *	Recovery function for fileopen.
+ *
+ * PUBLIC: int __crdel_fileopen_recover
+ * PUBLIC:   __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *));
+ */
+int
+__crdel_fileopen_recover(dbenv, dbtp, lsnp, op, info)
+	DB_ENV *dbenv;
+	DBT *dbtp;
+	DB_LSN *lsnp;
+	db_recops op;
+	void *info;
+{
+	__crdel_fileopen_args *argp;
+	DBMETA ondisk;
+	DB_FH fh;
+	size_t nr;
+	int do_unlink, ret;
+	u_int32_t b, mb, io;
+	char *real_name;
+
+	COMPQUIET(info, NULL);
+
+	real_name = NULL;
+	REC_PRINT(__crdel_fileopen_print);
+
+	if ((ret = __crdel_fileopen_read(dbenv, dbtp->data, &argp)) != 0)
+		goto out;
+	/*
+	 * If this is an in-memory database, then the name is going to
+	 * be NULL, which looks like a 0-length name in recovery.
+	 */
+	if (argp->name.size == 0)
+		goto done;
+
+	if ((ret = __db_appname(dbenv, DB_APP_DATA,
+	    NULL, argp->name.data, 0, NULL, &real_name)) != 0)
+		goto out;
+	if (DB_REDO(op)) {
+		/*
+		 * The create commited, so we need to make sure that the file
+		 * exists.  A simple open should suffice.
+		 */
+		if ((ret = __os_open(dbenv, real_name,
+		    DB_OSO_CREATE, argp->mode, &fh)) != 0)
+			goto out;
+		if ((ret = __os_closehandle(&fh)) != 0)
+			goto out;
+	} else if (DB_UNDO(op)) {
+		/*
+		 * If the file is 0-length then it was in the process of being
+		 * created, so we should unlink it.  If it is non-0 length, then
+		 * either someone else created it and we need to leave it
+		 * untouched or we were in the process of creating it, allocated
+		 * the first page on a system that requires you to actually
+		 * write pages as you allocate them, but never got any data
+		 * on it.
+		 * If the file doesn't exist, we never got around to creating
+		 * it, so that's fine.
+		 */
+		if (__os_exists(real_name, NULL) != 0)
+			goto done;
+
+		if ((ret = __os_open(dbenv, real_name, 0, 0, &fh)) != 0)
+			goto out;
+		if ((ret = __os_ioinfo(dbenv,
+		    real_name, &fh, &mb, &b, &io)) != 0)
+			goto out;
+		do_unlink = 0;
+		if (mb != 0 || b != 0) {
+			/*
+			 * We need to read the first page
+			 * to see if its got valid data on it.
+			 */
+			if ((ret = __os_read(dbenv, &fh,
+			    &ondisk, sizeof(ondisk), &nr)) != 0 ||
+			    nr != sizeof(ondisk))
+				goto out;
+			if (ondisk.magic == 0)
+				do_unlink = 1;
+		}
+		if ((ret = __os_closehandle(&fh)) != 0)
+			goto out;
+		/* Check for 0-length and if it is, delete it. */
+		if (do_unlink || (mb == 0 && b == 0))
+			if ((ret = __os_unlink(dbenv, real_name)) != 0)
+				goto out;
+	}
+
+done:	*lsnp = argp->prev_lsn;
+	ret = 0;
+
+out:	if (argp != NULL)
+		__os_free(argp, 0);
+	if (real_name != NULL)
+		__os_freestr(real_name);
+	return (ret);
+}
+
+/*
+ * __crdel_metasub_recover --
+ *	Recovery function for metasub.
+ *
+ * PUBLIC: int __crdel_metasub_recover
+ * PUBLIC:   __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *));
+ */
+int
+__crdel_metasub_recover(dbenv, dbtp, lsnp, op, info)
+	DB_ENV *dbenv;
+	DBT *dbtp;
+	DB_LSN *lsnp;
+	db_recops op;
+	void *info;
+{
+	__crdel_metasub_args *argp;
+	DB *file_dbp;
+	DBC *dbc;
+	DB_MPOOLFILE *mpf;
+	PAGE *pagep;
+	u_int8_t *file_uid, ptype;
+	int cmp_p, modified, reopen, ret;
+
+	COMPQUIET(info, NULL);
+	REC_PRINT(__crdel_metasub_print);
+	REC_INTRO(__crdel_metasub_read, 0);
+
+	if ((ret = memp_fget(mpf, &argp->pgno, 0, &pagep)) != 0) {
+		if (DB_REDO(op)) {
+			if ((ret = memp_fget(mpf,
+			    &argp->pgno, DB_MPOOL_CREATE, &pagep)) != 0)
+				goto out;
+		} else {
+			*lsnp = argp->prev_lsn;
+			ret = 0;
+			goto out;
+		}
+	}
+
+	modified = 0;
+	reopen = 0;
+	cmp_p = log_compare(&LSN(pagep), &argp->lsn);
+	CHECK_LSN(op, cmp_p, &LSN(pagep), &argp->lsn);
+
+	if (cmp_p == 0 && DB_REDO(op)) {
+		memcpy(pagep, argp->page.data, argp->page.size);
+		LSN(pagep) = *lsnp;
+		modified = 1;
+		/*
+		 * If this is a meta-data page, then we must reopen;
+		 * if it was a root page, then we do not.
+		 */
+		ptype = ((DBMETA *)argp->page.data)->type;
+		if (ptype == P_HASHMETA || ptype == P_BTREEMETA ||
+		    ptype == P_QAMMETA)
+			reopen = 1;
+	} else if (DB_UNDO(op)) {
+		/*
+		 * We want to undo this page creation.  The page creation
+		 * happened in two parts.  First, we called __bam_new which
+		 * was logged separately. Then we wrote the meta-data onto
+		 * the page.  So long as we restore the LSN, then the recovery
+		 * for __bam_new will do everything else.
+		 * Don't bother checking the lsn on the page.  If we
+		 * are rolling back the next thing is that this page
+		 * will get freed.  Opening the subdb will have reinitialized
+		 * the page, but not the lsn.
+		 */
+		LSN(pagep) = argp->lsn;
+		modified = 1;
+	}
+	if ((ret = memp_fput(mpf, pagep, modified ? DB_MPOOL_DIRTY : 0)) != 0)
+		goto out;
+
+	/*
+	 * If we are redoing a subdatabase create, we must close and reopen the
+	 * file to be sure that we have the proper meta information in the
+	 * in-memory structures
+	 */
+	if (reopen) {
+		/* Close cursor if it's open. */
+		 if (dbc != NULL) {
+			dbc->c_close(dbc);
+			dbc = NULL;
+		}
+
+		if ((ret = __os_malloc(dbenv,
+		    DB_FILE_ID_LEN, NULL, &file_uid)) != 0)
+			goto out;
+		memcpy(file_uid, &file_dbp->fileid[0], DB_FILE_ID_LEN);
+		ret = __log_reopen_file(dbenv,
+		     NULL, argp->fileid, file_uid, argp->pgno);
+		(void)__os_free(file_uid, DB_FILE_ID_LEN);
+		if (ret != 0)
+			goto out;
+	}
+
+done:	*lsnp = argp->prev_lsn;
+	ret = 0;
+
+out:	REC_CLOSE;
+}
+
+/*
+ * __crdel_metapage_recover --
+ *	Recovery function for metapage.
+ *
+ * PUBLIC: int __crdel_metapage_recover
+ * PUBLIC:   __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *));
+ */
+int
+__crdel_metapage_recover(dbenv, dbtp, lsnp, op, info)
+	DB_ENV *dbenv;
+	DBT *dbtp;
+	DB_LSN *lsnp;
+	db_recops op;
+	void *info;
+{
+	__crdel_metapage_args *argp;
+	DB *dbp;
+	DBMETA *meta, ondisk;
+	DB_FH fh;
+	size_t nr;
+	u_int32_t b, io, mb, pagesize;
+	int is_done, ret;
+	char *real_name;
+
+	COMPQUIET(info, NULL);
+
+	real_name = NULL;
+	memset(&fh, 0, sizeof(fh));
+	REC_PRINT(__crdel_metapage_print);
+
+	if ((ret = __crdel_metapage_read(dbenv, dbtp->data, &argp)) != 0)
+		goto out;
+
+	/*
+	 * If this is an in-memory database, then the name is going to
+	 * be NULL, which looks like a 0-length name in recovery.
+	 */
+	if (argp->name.size == 0)
+		goto done;
+
+	meta = (DBMETA *)argp->page.data;
+	__ua_memcpy(&pagesize, &meta->pagesize, sizeof(pagesize));
+
+	if ((ret = __db_appname(dbenv, DB_APP_DATA,
+	    NULL, argp->name.data, 0, NULL, &real_name)) != 0)
+		goto out;
+	if (DB_REDO(op)) {
+		if ((ret = __db_fileid_to_db(dbenv,
+		    &dbp, argp->fileid, 0)) != 0) {
+			if (ret == DB_DELETED)
+				goto done;
+			else
+				goto out;
+		}
+
+		/*
+		 * We simply read the first page and if the LSN is 0, we
+		 * write the meta-data page.
+		 */
+		if ((ret = __os_open(dbenv, real_name, 0, 0, &fh)) != 0)
+			goto out;
+		if ((ret = __os_seek(dbenv, &fh,
+		    pagesize, argp->pgno, 0, 0, DB_OS_SEEK_SET)) != 0)
+			goto out;
+		/*
+		 * If the read succeeds then the page exists, then we need
+		 * to vrify that the page has actually been written, because
+		 * on some systems (e.g., Windows) we preallocate pages because
+		 * files aren't allowed to have holes in them.  If the page
+		 * looks good then we're done.
+		 */
+		if ((ret = __os_read(dbenv, &fh, &ondisk,
+		    sizeof(ondisk), &nr)) == 0 && nr == sizeof(ondisk)) {
+			if (ondisk.magic != 0)
+				goto done;
+			if ((ret = __os_seek(dbenv, &fh,
+			    pagesize, argp->pgno, 0, 0, DB_OS_SEEK_SET)) != 0)
+				goto out;
+		}
+
+		/*
+		 * Page didn't exist, update the LSN and write a new one.
+		 * (seek pointer shouldn't have moved)
+		 */
+		__ua_memcpy(&meta->lsn, lsnp, sizeof(DB_LSN));
+		if ((ret = __os_write(dbp->dbenv, &fh,
+		    argp->page.data, argp->page.size, &nr)) != 0)
+			goto out;
+		if (nr != (size_t)argp->page.size) {
+			__db_err(dbenv, "Write failed during recovery");
+			ret = EIO;
+			goto out;
+		}
+
+		/*
+		 * We must close and reopen the file to be sure
+		 * that we have the proper meta information
+		 * in the in memory structures
+		 */
+
+		if ((ret = __log_reopen_file(dbenv,
+		     argp->name.data, argp->fileid,
+		     meta->uid, argp->pgno)) != 0)
+			goto out;
+
+		/* Handle will be closed on exit. */
+	} else if (DB_UNDO(op)) {
+		is_done = 0;
+
+		/* If file does not exist, there is nothing to undo. */
+		if (__os_exists(real_name, NULL) != 0)
+			goto done;
+
+		/*
+		 * Before we can look at anything on disk, we have to check
+		 * if there is a valid dbp for this, and if there is, we'd
+		 * better flush it.
+		 */
+		dbp = NULL;
+		if ((ret =
+		    __db_fileid_to_db(dbenv, &dbp, argp->fileid, 0)) == 0)
+			(void)dbp->sync(dbp, 0);
+
+		/*
+		 * We need to make sure that we do not remove a file that
+		 * someone else created.   If the file is 0-length, then we
+		 * can assume that we created it and remove it.  If it is
+		 * not 0-length, then we need to check the LSN and make
+		 * sure that it's the file we created.
+		 */
+		if ((ret = __os_open(dbenv, real_name, 0, 0, &fh)) != 0)
+			goto out;
+		if ((ret = __os_ioinfo(dbenv,
+		    real_name, &fh, &mb, &b, &io)) != 0)
+			goto out;
+		if (mb != 0 || b != 0) {
+			/* The file has something in it. */
+			if ((ret = __os_seek(dbenv, &fh,
+			    pagesize, argp->pgno, 0, 0, DB_OS_SEEK_SET)) != 0)
+				goto out;
+			if ((ret = __os_read(dbenv, &fh,
+			    &ondisk, sizeof(ondisk), &nr)) != 0)
+				goto out;
+			if (log_compare(&ondisk.lsn, lsnp) != 0)
+				is_done = 1;
+		}
+
+		/*
+		 * Must close here, because unlink with the file open fails
+		 * on some systems.
+		 */
+		if ((ret = __os_closehandle(&fh)) != 0)
+			goto out;
+
+		if (!is_done) {
+			/*
+			 * On some systems, you cannot unlink an open file so
+			 * we close the fd in the dbp here and make sure we
+			 * don't try to close it again.  First, check for a
+			 * saved_open_fhp, then close down the mpool.
+			 */
+			if (dbp != NULL && dbp->saved_open_fhp != NULL &&
+			    F_ISSET(dbp->saved_open_fhp, DB_FH_VALID) &&
+			    (ret = __os_closehandle(dbp->saved_open_fhp)) != 0)
+				goto out;
+			if (dbp != NULL && dbp->mpf != NULL) {
+				(void)__memp_fremove(dbp->mpf);
+				if ((ret = memp_fclose(dbp->mpf)) != 0)
+					goto out;
+				F_SET(dbp, DB_AM_DISCARD);
+				dbp->mpf = NULL;
+			}
+			if ((ret = __os_unlink(dbenv, real_name)) != 0)
+				goto out;
+		}
+	}
+
+done:	*lsnp = argp->prev_lsn;
+	ret = 0;
+
+out:	if (argp != NULL)
+		__os_free(argp, 0);
+	if (real_name != NULL)
+		__os_freestr(real_name);
+	if (F_ISSET(&fh, DB_FH_VALID))
+		(void)__os_closehandle(&fh);
+	return (ret);
+}
+
+/*
+ * __crdel_delete_recover --
+ *	Recovery function for delete.
+ *
+ * PUBLIC: int __crdel_delete_recover
+ * PUBLIC:   __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *));
+ */
+int
+__crdel_delete_recover(dbenv, dbtp, lsnp, op, info)
+	DB_ENV *dbenv;
+	DBT *dbtp;
+	DB_LSN *lsnp;
+	db_recops op;
+	void *info;
+{
+	DB *dbp;
+	__crdel_delete_args *argp;
+	int ret;
+	char *backup, *real_back, *real_name;
+
+	REC_PRINT(__crdel_delete_print);
+
+	backup = real_back = real_name = NULL;
+	if ((ret = __crdel_delete_read(dbenv, dbtp->data, &argp)) != 0)
+		goto out;
+
+	if (DB_REDO(op)) {
+		/*
+		 * On a recovery, as we recreate what was going on, we
+		 * recreate the creation of the file.  And so, even though
+		 * it committed, we need to delete it.  Try to delete it,
+		 * but it is not an error if that delete fails.
+		 */
+		if ((ret = __db_appname(dbenv, DB_APP_DATA,
+		    NULL, argp->name.data, 0, NULL, &real_name)) != 0)
+			goto out;
+		if (__os_exists(real_name, NULL) == 0) {
+			/*
+			 * If a file is deleted and then recreated, it's
+			 * possible for the __os_exists call above to
+			 * return success and for us to get here, but for
+			 * the fileid we're looking for to be marked
+			 * deleted.  In that case, we needn't redo the
+			 * unlink even though the file exists, and it's
+			 * not an error.
+			 */
+			ret = __db_fileid_to_db(dbenv, &dbp, argp->fileid, 0);
+			if (ret == 0) {
+				/*
+				 * On Windows, the underlying file must be
+				 * closed to perform a remove.
+				 */
+				(void)__memp_fremove(dbp->mpf);
+				if ((ret = memp_fclose(dbp->mpf)) != 0)
+					goto out;
+				dbp->mpf = NULL;
+				if ((ret = __os_unlink(dbenv, real_name)) != 0)
+					goto out;
+			} else if (ret != DB_DELETED)
+				goto out;
+		}
+		/*
+		 * The transaction committed, so the only thing that might
+		 * be true is that the backup file is still around.  Try
+		 * to delete it, but it's not an error if that delete fails.
+		 */
+		if ((ret =  __db_backup_name(dbenv, argp->name.data,
+		    &backup, lsnp)) != 0)
+			goto out;
+		if ((ret = __db_appname(dbenv,
+		    DB_APP_DATA, NULL, backup, 0, NULL, &real_back)) != 0)
+			goto out;
+		if (__os_exists(real_back, NULL) == 0)
+			if ((ret = __os_unlink(dbenv, real_back)) != 0)
+				goto out;
+		if ((ret = __db_txnlist_delete(dbenv, info,
+		    argp->name.data, TXNLIST_INVALID_ID, 1)) != 0)
+			goto out;
+	} else if (DB_UNDO(op)) {
+		/*
+		 * Trying to undo.  File may or may not have been deleted.
+		 * Try to move the backup to the original.  If the backup
+		 * exists, then this is right.  If it doesn't exist, then
+		 * nothing will happen and that's OK.
+		 */
+		if ((ret =  __db_backup_name(dbenv, argp->name.data,
+		    &backup, lsnp)) != 0)
+			goto out;
+		if ((ret = __db_appname(dbenv,
+		    DB_APP_DATA, NULL, backup, 0, NULL, &real_back)) != 0)
+			goto out;
+		if ((ret = __db_appname(dbenv, DB_APP_DATA,
+		    NULL, argp->name.data, 0, NULL, &real_name)) != 0)
+			goto out;
+		if (__os_exists(real_back, NULL) == 0)
+			if ((ret =
+			     __os_rename(dbenv, real_back, real_name)) != 0)
+				goto out;
+	}
+
+	*lsnp = argp->prev_lsn;
+	ret = 0;
+
+out:	if (argp != NULL)
+		__os_free(argp, 0);
+	if (backup != NULL)
+		__os_freestr(backup);
+	if (real_back != NULL)
+		__os_freestr(real_back);
+	if (real_name != NULL)
+		__os_freestr(real_name);
+	return (ret);
+}
+/*
+ * __crdel_rename_recover --
+ *	Recovery function for rename.
+ *
+ * PUBLIC: int __crdel_rename_recover
+ * PUBLIC:   __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *));
+ */
+int
+__crdel_rename_recover(dbenv, dbtp, lsnp, op, info)
+	DB_ENV *dbenv;
+	DBT *dbtp;
+	DB_LSN *lsnp;
+	db_recops op;
+	void *info;
+{
+	DB *dbp;
+	__crdel_rename_args *argp;
+	char *new_name, *real_name;
+	int ret, set;
+
+	COMPQUIET(info, NULL);
+
+	REC_PRINT(__crdel_rename_print);
+
+	new_name = real_name = NULL;
+
+	if ((ret = __crdel_rename_read(dbenv, dbtp->data, &argp)) != 0)
+		goto out;
+
+	if ((ret = __db_fileid_to_db(dbenv, &dbp, argp->fileid, 0)) != 0)
+		goto out;
+	if (DB_REDO(op)) {
+		/*
+		 * We don't use the dbp parameter to __log_filelist_update
+		 * in the rename case, so passing NULL for it is OK.
+		 */
+		if ((ret = __log_filelist_update(dbenv, NULL,
+		    argp->fileid, argp->newname.data, &set)) != 0)
+			goto out;
+		if (set != 0) {
+			if ((ret = __db_appname(dbenv, DB_APP_DATA,
+			    NULL, argp->name.data, 0, NULL, &real_name)) != 0)
+				goto out;
+			if (__os_exists(real_name, NULL) == 0) {
+				if ((ret = __db_appname(dbenv,
+				    DB_APP_DATA, NULL, argp->newname.data,
+				    0, NULL, &new_name)) != 0)
+					goto out;
+				/*
+				 * On Windows, the underlying file
+				 * must be closed to perform a remove.
+				 * The db will be closed by a
+				 * log_register record.  Rename
+				 * has exclusive access to the db.
+				 */
+				(void)__memp_fremove(dbp->mpf);
+				if ((ret = memp_fclose(dbp->mpf)) != 0)
+					goto out;
+				dbp->mpf = NULL;
+				if ((ret = __os_rename(dbenv,
+				    real_name, new_name)) != 0)
+					goto out;
+			}
+		}
+	} else {
+		/*
+		 * We don't use the dbp parameter to __log_filelist_update
+		 * in the rename case, so passing NULL for it is OK.
+		 */
+		if ((ret = __log_filelist_update(dbenv, NULL,
+		    argp->fileid, argp->name.data, &set)) != 0)
+			goto out;
+		if (set != 0) {
+			if ((ret = __db_appname(dbenv, DB_APP_DATA,
+			    NULL, argp->newname.data, 0, NULL, &new_name)) != 0)
+				goto out;
+			if (__os_exists(new_name, NULL) == 0) {
+				if ((ret = __db_appname(dbenv,
+				    DB_APP_DATA, NULL, argp->name.data,
+				    0, NULL, &real_name)) != 0)
+					goto out;
+				/*
+				 * On Windows, the underlying file
+				 * must be closed to perform a remove.
+				 * The file may have already been closed
+				 * if we are aborting the transaction.
+				 */
+				if (dbp->mpf != NULL) {
+					(void)__memp_fremove(dbp->mpf);
+					if ((ret = memp_fclose(dbp->mpf)) != 0)
+						goto out;
+					dbp->mpf = NULL;
+				}
+				if ((ret = __os_rename(dbenv,
+				    new_name, real_name)) != 0)
+					goto out;
+			}
+		}
+	}
+
+	*lsnp = argp->prev_lsn;
+	ret = 0;
+
+out:	if (argp != NULL)
+		__os_free(argp, 0);
+
+	if (new_name != NULL)
+		__os_free(new_name, 0);
+
+	if (real_name != NULL)
+		__os_free(real_name, 0);
+
+	return (ret);
+}
diff --git a/bdb/db/db.c b/bdb/db/db.c
new file mode 100644
index 00000000000..6e74b4b21bd
--- /dev/null
+++ b/bdb/db/db.c
@@ -0,0 +1,2325 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1996, 1997, 1998, 1999, 2000
+ *	Sleepycat Software.  All rights reserved.
+ */
+/*
+ * Copyright (c) 1990, 1993, 1994, 1995, 1996
+ *	Keith Bostic.  All rights reserved.
+ */
+/*
+ * Copyright (c) 1990, 1993, 1994, 1995
+ *	The Regents of the University of California.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of the University nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db.c,v 11.117 2001/01/11 18:19:50 bostic Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <stddef.h>
+#include <stdlib.h>
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_shash.h"
+#include "db_swap.h"
+#include "btree.h"
+#include "db_am.h"
+#include "hash.h"
+#include "lock.h"
+#include "log.h"
+#include "mp.h"
+#include "qam.h"
+#include "common_ext.h"
+
+/* Actions that __db_master_update can take. */
+typedef enum { MU_REMOVE, MU_RENAME, MU_OPEN } mu_action;
+
+/* Flag values that __db_file_setup can return. */
+#define	DB_FILE_SETUP_CREATE	0x01
+#define	DB_FILE_SETUP_ZERO	0x02
+
+static int __db_file_setup __P((DB *,
+	       const char *, u_int32_t, int, db_pgno_t, int *));
+static int __db_master_update __P((DB *,
+	       const char *, u_int32_t,
+	       db_pgno_t *, mu_action, const char *, u_int32_t));
+static int __db_refresh __P((DB *));
+static int __db_remove_callback __P((DB *, void *));
+static int __db_set_pgsize __P((DB *, DB_FH *, char *));
+static int __db_subdb_remove __P((DB *, const char *, const char *));
+static int __db_subdb_rename __P(( DB *,
+		const char *, const char *, const char *));
+#if     CONFIG_TEST
+static void __db_makecopy __P((const char *, const char *));
+static int __db_testdocopy __P((DB *, const char *));
+static int __qam_testdocopy __P((DB *, const char *));
+#endif
+
+/*
+ * __db_open --
+ *	Main library interface to the DB access methods.
+ *
+ * PUBLIC: int __db_open __P((DB *,
+ * PUBLIC:     const char *, const char *, DBTYPE, u_int32_t, int));
+ */
+int
+__db_open(dbp, name, subdb, type, flags, mode)
+	DB *dbp;
+	const char *name, *subdb;
+	DBTYPE type;
+	u_int32_t flags;
+	int mode;
+{
+	DB_ENV *dbenv;
+	DB_LOCK open_lock;
+	DB *mdbp;
+	db_pgno_t meta_pgno;
+	u_int32_t ok_flags;
+	int ret, t_ret;
+
+	dbenv = dbp->dbenv;
+	mdbp = NULL;
+
+	/* Validate arguments. */
+#define	OKFLAGS								\
+    (DB_CREATE | DB_EXCL | DB_FCNTL_LOCKING |				\
+    DB_NOMMAP | DB_RDONLY | DB_RDWRMASTER | DB_THREAD | DB_TRUNCATE)
+	if ((ret = __db_fchk(dbenv, "DB->open", flags, OKFLAGS)) != 0)
+		return (ret);
+	if (LF_ISSET(DB_EXCL) && !LF_ISSET(DB_CREATE))
+		return (__db_ferr(dbenv, "DB->open", 1));
+	if (LF_ISSET(DB_RDONLY) && LF_ISSET(DB_CREATE))
+		return (__db_ferr(dbenv, "DB->open", 1));
+#ifdef	HAVE_VXWORKS
+	if (LF_ISSET(DB_TRUNCATE)) {
+		__db_err(dbenv, "DB_TRUNCATE unsupported in VxWorks");
+		return (__db_eopnotsup(dbenv));
+	}
+#endif
+	switch (type) {
+	case DB_UNKNOWN:
+		if (LF_ISSET(DB_CREATE|DB_TRUNCATE)) {
+			__db_err(dbenv,
+	    "%s: DB_UNKNOWN type specified with DB_CREATE or DB_TRUNCATE",
+			    name);
+			return (EINVAL);
+		}
+		ok_flags = 0;
+		break;
+	case DB_BTREE:
+		ok_flags = DB_OK_BTREE;
+		break;
+	case DB_HASH:
+		ok_flags = DB_OK_HASH;
+		break;
+	case DB_QUEUE:
+		ok_flags = DB_OK_QUEUE;
+		break;
+	case DB_RECNO:
+		ok_flags = DB_OK_RECNO;
+		break;
+	default:
+		__db_err(dbenv, "unknown type: %lu", (u_long)type);
+		return (EINVAL);
+	}
+	if (ok_flags)
+		DB_ILLEGAL_METHOD(dbp, ok_flags);
+
+	/* The environment may have been created, but never opened. */
+	if (!F_ISSET(dbenv, DB_ENV_DBLOCAL | DB_ENV_OPEN_CALLED)) {
+		__db_err(dbenv, "environment not yet opened");
+		return (EINVAL);
+	}
+
+	/*
+	 * Historically, you could pass in an environment that didn't have a
+	 * mpool, and DB would create a private one behind the scenes.  This
+	 * no longer works.
+	 */
+	if (!F_ISSET(dbenv, DB_ENV_DBLOCAL) && !MPOOL_ON(dbenv)) {
+		__db_err(dbenv, "environment did not include a memory pool.");
+		return (EINVAL);
+	}
+
+	/*
+	 * You can't specify threads during DB->open if subsystems in the
+	 * environment weren't configured with them.
+	 */
+	if (LF_ISSET(DB_THREAD) &&
+	    !F_ISSET(dbenv, DB_ENV_DBLOCAL | DB_ENV_THREAD)) {
+		__db_err(dbenv, "environment not created using DB_THREAD");
+		return (EINVAL);
+	}
+
+	/*
+	 * If the environment was configured with threads, the DB handle
+	 * must also be free-threaded, so we force the DB_THREAD flag on.
+	 * (See SR #2033 for why this is a requirement--recovery needs
+	 * to be able to grab a dbp using __db_fileid_to_dbp, and it has
+	 * no way of knowing which dbp goes with which thread, so whichever
+	 * one it finds has to be usable in any of them.)
+	 */
+	if (F_ISSET(dbenv, DB_ENV_THREAD))
+		LF_SET(DB_THREAD);
+
+	/* DB_TRUNCATE is not transaction recoverable. */
+	if (LF_ISSET(DB_TRUNCATE) && TXN_ON(dbenv)) {
+		__db_err(dbenv,
+	    "DB_TRUNCATE illegal in a transaction protected environment");
+		return (EINVAL);
+	}
+
+	/* Subdatabase checks. */
+	if (subdb != NULL) {
+		/* Subdatabases must be created in named files. */
+		if (name == NULL) {
+			__db_err(dbenv,
+		    "multiple databases cannot be created in temporary files");
+			return (EINVAL);
+		}
+
+		/* QAM can't be done as a subdatabase. */
+		if (type == DB_QUEUE) {
+			__db_err(dbenv, "Queue databases must be one-per-file");
+			return (EINVAL);
+		}
+	}
+
+	/* Convert any DB->open flags. */
+	if (LF_ISSET(DB_RDONLY))
+		F_SET(dbp, DB_AM_RDONLY);
+
+	/* Fill in the type. */
+	dbp->type = type;
+
+	/*
+	 * If we're potentially creating a database, wrap the open inside of
+	 * a transaction.
+	 */
+	if (TXN_ON(dbenv) && LF_ISSET(DB_CREATE))
+		if ((ret = __db_metabegin(dbp, &open_lock)) != 0)
+			return (ret);
+
+	/*
+	 * If we're opening a subdatabase, we have to open (and potentially
+	 * create) the main database, and then get (and potentially store)
+	 * our base page number in that database.  Then, we can finally open
+	 * the subdatabase.
+	 */
+	if (subdb == NULL)
+		meta_pgno = PGNO_BASE_MD;
+	else {
+		/*
+		 * Open the master database, optionally creating or updating
+		 * it, and retrieve the metadata page number.
+		 */
+		if ((ret =
+		    __db_master_open(dbp, name, flags, mode, &mdbp)) != 0)
+			goto err;
+
+		/* Copy the page size and file id from the master. */
+		dbp->pgsize = mdbp->pgsize;
+		F_SET(dbp, DB_AM_SUBDB);
+		memcpy(dbp->fileid, mdbp->fileid, DB_FILE_ID_LEN);
+
+		if ((ret = __db_master_update(mdbp,
+		    subdb, type, &meta_pgno, MU_OPEN, NULL, flags)) != 0)
+			goto err;
+
+		/*
+		 * Clear the exclusive open and truncation flags, they only
+		 * apply to the open of the master database.
+		 */
+		LF_CLR(DB_EXCL | DB_TRUNCATE);
+	}
+
+	ret = __db_dbopen(dbp, name, flags, mode, meta_pgno);
+
+	/*
+	 * You can open the database that describes the subdatabases in the
+	 * rest of the file read-only.  The content of each key's data is
+	 * unspecified and applications should never be adding new records
+	 * or updating existing records.  However, during recovery, we need
+	 * to open these databases R/W so we can redo/undo changes in them.
+	 * Likewise, we need to open master databases read/write during
+	 * rename and remove so we can be sure they're fully sync'ed, so
+	 * we provide an override flag for the purpose.
+	 */
+	if (subdb == NULL && !IS_RECOVERING(dbenv) && !LF_ISSET(DB_RDONLY) &&
+	    !LF_ISSET(DB_RDWRMASTER) && F_ISSET(dbp, DB_AM_SUBDB)) {
+		__db_err(dbenv,
+    "files containing multiple databases may only be opened read-only");
+		ret = EINVAL;
+		goto err;
+	}
+
+err:	/*
+	 * End any transaction, committing if we were successful, aborting
+	 * otherwise.
+	 */
+	if (TXN_ON(dbenv) && LF_ISSET(DB_CREATE))
+		if ((t_ret = __db_metaend(dbp,
+		    &open_lock, ret == 0, NULL, NULL)) != 0 && ret == 0)
+			ret = t_ret;
+
+	/* If we were successful, don't discard the file on close. */
+	if (ret == 0)
+		F_CLR(dbp, DB_AM_DISCARD);
+
+	/* If we were unsuccessful, destroy the DB handle. */
+	if (ret != 0) {
+		/* In recovery we set log_fileid early. */
+		if (IS_RECOVERING(dbenv))
+			dbp->log_fileid = DB_LOGFILEID_INVALID;
+		__db_refresh(dbp);
+	}
+
+	if (mdbp != NULL) {
+		/* If we were successful, don't discard the file on close. */
+		if (ret == 0)
+			F_CLR(mdbp, DB_AM_DISCARD);
+		if ((t_ret = mdbp->close(mdbp, 0)) != 0 && ret == 0)
+			ret = t_ret;
+	}
+
+	return (ret);
+}
+
+/*
+ * __db_dbopen --
+ *	Open a database.
+ * PUBLIC: int __db_dbopen __P((DB *, const char *, u_int32_t, int, db_pgno_t));
+ */
+int
+__db_dbopen(dbp, name, flags, mode, meta_pgno)
+	DB *dbp;
+	const char *name;
+	u_int32_t flags;
+	int mode;
+	db_pgno_t meta_pgno;
+{
+	DB_ENV *dbenv;
+	int ret, retinfo;
+
+	dbenv = dbp->dbenv;
+
+	/* Set up the underlying file. */
+	if ((ret = __db_file_setup(dbp,
+	    name, flags, mode, meta_pgno, &retinfo)) != 0)
+		return (ret);
+
+	/*
+	 * If we created the file, set the truncate flag for the mpool.  This
+	 * isn't for anything we've done, it's protection against stupid user
+	 * tricks: if the user deleted a file behind Berkeley DB's back, we
+	 * may still have pages in the mpool that match the file's "unique" ID.
+	 */
+	if (retinfo & DB_FILE_SETUP_CREATE)
+		flags |= DB_TRUNCATE;
+
+	/* Set up the underlying environment. */
+	if ((ret = __db_dbenv_setup(dbp, name, flags)) != 0)
+		return (ret);
+
+	/*
+	 * Do access method specific initialization.
+	 *
+	 * !!!
+	 * Set the open flag.  (The underlying access method open functions
+	 * may want to do things like acquire cursors, so the open flag has
+	 * to be set before calling them.)
+	 */
+	F_SET(dbp, DB_OPEN_CALLED);
+
+	if (retinfo & DB_FILE_SETUP_ZERO)
+		return (0);
+
+	switch (dbp->type) {
+	case DB_BTREE:
+		ret = __bam_open(dbp, name, meta_pgno, flags);
+		break;
+	case DB_HASH:
+		ret = __ham_open(dbp, name, meta_pgno, flags);
+		break;
+	case DB_RECNO:
+		ret = __ram_open(dbp, name, meta_pgno, flags);
+		break;
+	case DB_QUEUE:
+		ret = __qam_open(dbp, name, meta_pgno, mode, flags);
+		break;
+	case DB_UNKNOWN:
+		return (__db_unknown_type(dbp->dbenv,
+		     "__db_dbopen", dbp->type));
+		break;
+	}
+	return (ret);
+}
+
+/*
+ * __db_master_open --
+ *	Open up a handle on a master database.
+ *
+ * PUBLIC: int __db_master_open __P((DB *,
+ * PUBLIC:     const char *, u_int32_t, int, DB **));
+ */
+int
+__db_master_open(subdbp, name, flags, mode, dbpp)
+	DB *subdbp;
+	const char *name;
+	u_int32_t flags;
+	int mode;
+	DB **dbpp;
+{
+	DB *dbp;
+	int ret;
+
+	/* Open up a handle on the main database. */
+	if ((ret = db_create(&dbp, subdbp->dbenv, 0)) != 0)
+		return (ret);
+
+	/*
+	 * It's always a btree.
+	 * Run in the transaction we've created.
+	 * Set the pagesize in case we're creating a new database.
+	 * Flag that we're creating a database with subdatabases.
+	 */
+	dbp->type = DB_BTREE;
+	dbp->open_txn = subdbp->open_txn;
+	dbp->pgsize = subdbp->pgsize;
+	F_SET(dbp, DB_AM_SUBDB);
+
+	if ((ret = __db_dbopen(dbp, name, flags, mode, PGNO_BASE_MD)) != 0) {
+		if (!F_ISSET(dbp, DB_AM_DISCARD))
+			dbp->close(dbp, 0);
+		return (ret);
+	}
+
+	*dbpp = dbp;
+	return (0);
+}
+
+/*
+ * __db_master_update --
+ *	Add/Remove a subdatabase from a master database.
+ */
+static int
+__db_master_update(mdbp, subdb, type, meta_pgnop, action, newname, flags)
+	DB *mdbp;
+	const char *subdb;
+	u_int32_t type;
+	db_pgno_t *meta_pgnop;		/* may be NULL on MU_RENAME */
+	mu_action action;
+	const char *newname;
+	u_int32_t flags;
+{
+	DB_ENV *dbenv;
+	DBC *dbc, *ndbc;
+	DBT key, data, ndata;
+	PAGE *p;
+	db_pgno_t t_pgno;
+	int modify, ret, t_ret;
+
+	dbenv = mdbp->dbenv;
+	dbc = ndbc = NULL;
+	p = NULL;
+
+	/* Might we modify the master database?  If so, we'll need to lock. */
+	modify = (action != MU_OPEN || LF_ISSET(DB_CREATE)) ? 1 : 0;
+
+	memset(&key, 0, sizeof(key));
+	memset(&data, 0, sizeof(data));
+
+	/*
+	 * Open up a cursor.  If this is CDB and we're creating the database,
+	 * make it an update cursor.
+	 */
+	if ((ret = mdbp->cursor(mdbp, mdbp->open_txn, &dbc,
+	    (CDB_LOCKING(dbenv) && modify) ? DB_WRITECURSOR : 0)) != 0)
+		goto err;
+
+	/*
+	 * Try to point the cursor at the record.
+	 *
+	 * If we're removing or potentially creating an entry, lock the page
+	 * with DB_RMW.
+	 *
+	 * !!!
+	 * We don't include the name's nul termination in the database.
+	 */
+	key.data = (char *)subdb;
+	key.size = strlen(subdb);
+	/* In the rename case, we do multiple cursor ops, so MALLOC is safer. */
+	F_SET(&data, DB_DBT_MALLOC);
+	ret = dbc->c_get(dbc, &key, &data,
+	    DB_SET | ((STD_LOCKING(dbc) && modify) ? DB_RMW : 0));
+
+	/*
+	 * What we do next--whether or not we found a record for the
+	 * specified subdatabase--depends on what the specified action is.
+	 * Handle ret appropriately as the first statement of each case.
+	 */
+	switch (action) {
+	case MU_REMOVE:
+		/*
+		 * We should have found something if we're removing it.  Note
+		 * that in the common case where the DB we're asking to remove
+		 * doesn't exist, we won't get this far;  __db_subdb_remove
+		 * will already have returned an error from __db_open.
+		 */
+		if (ret != 0)
+			goto err;
+
+		/*
+		 * Delete the subdatabase entry first;  if this fails,
+		 * we don't want to touch the actual subdb pages.
+		 */
+		if ((ret = dbc->c_del(dbc, 0)) != 0)
+			goto err;
+
+		/*
+		 * We're handling actual data, not on-page meta-data,
+		 * so it hasn't been converted to/from opposite
+		 * endian architectures.  Do it explicitly, now.
+		 */
+		memcpy(meta_pgnop, data.data, sizeof(db_pgno_t));
+		DB_NTOHL(meta_pgnop);
+		if ((ret = memp_fget(mdbp->mpf, meta_pgnop, 0, &p)) != 0)
+			goto err;
+
+		/* Free and put the page. */
+		if ((ret = __db_free(dbc, p)) != 0) {
+			p = NULL;
+			goto err;
+		}
+		p = NULL;
+		break;
+	case MU_RENAME:
+		/* We should have found something if we're renaming it. */
+		if (ret != 0)
+			goto err;
+
+		/*
+		 * Before we rename, we need to make sure we're not
+		 * overwriting another subdatabase, or else this operation
+		 * won't be undoable.  Open a second cursor and check
+		 * for the existence of newname;  it shouldn't appear under
+		 * us since we hold the metadata lock.
+		 */
+		if ((ret = mdbp->cursor(mdbp, mdbp->open_txn, &ndbc, 0)) != 0)
+			goto err;
+		DB_ASSERT(newname != NULL);
+		key.data = (void *) newname;
+		key.size = strlen(newname);
+
+		/*
+		 * We don't actually care what the meta page of the potentially-
+		 * overwritten DB is;  we just care about existence.
+		 */
+		memset(&ndata, 0, sizeof(ndata));
+		F_SET(&ndata, DB_DBT_USERMEM | DB_DBT_PARTIAL);
+
+		if ((ret = ndbc->c_get(ndbc, &key, &ndata, DB_SET)) == 0) {
+			/* A subdb called newname exists.  Bail. */
+			ret = EEXIST;
+			__db_err(dbenv, "rename: database %s exists", newname);
+			goto err;
+		} else if (ret != DB_NOTFOUND)
+			goto err;
+
+		/*
+		 * Now do the put first;  we don't want to lose our
+		 * sole reference to the subdb.  Use the second cursor
+		 * so that the first one continues to point to the old record.
+		 */
+		if ((ret = ndbc->c_put(ndbc, &key, &data, DB_KEYFIRST)) != 0)
+			goto err;
+		if ((ret = dbc->c_del(dbc, 0)) != 0) {
+			/*
+			 * If the delete fails, try to delete the record
+			 * we just put, in case we're not txn-protected.
+			 */
+			(void)ndbc->c_del(ndbc, 0);
+			goto err;
+		}
+
+		break;
+	case MU_OPEN:
+		/*
+		 * Get the subdatabase information.  If it already exists,
+		 * copy out the page number and we're done.
+		 */
+		switch (ret) {
+		case 0:
+			memcpy(meta_pgnop, data.data, sizeof(db_pgno_t));
+			DB_NTOHL(meta_pgnop);
+			goto done;
+		case DB_NOTFOUND:
+			if (LF_ISSET(DB_CREATE))
+				break;
+			/*
+			 * No db_err, it is reasonable to remove a
+			 * nonexistent db.
+			 */
+			ret = ENOENT;
+			goto err;
+		default:
+			goto err;
+		}
+
+		if ((ret = __db_new(dbc,
+		    type == DB_HASH ? P_HASHMETA : P_BTREEMETA, &p)) != 0)
+			goto err;
+		*meta_pgnop = PGNO(p);
+
+		/*
+		 * XXX
+		 * We're handling actual data, not on-page meta-data, so it
+		 * hasn't been converted to/from opposite endian architectures.
+		 * Do it explicitly, now.
+		 */
+		t_pgno = PGNO(p);
+		DB_HTONL(&t_pgno);
+		memset(&ndata, 0, sizeof(ndata));
+		ndata.data = &t_pgno;
+		ndata.size = sizeof(db_pgno_t);
+		if ((ret = dbc->c_put(dbc, &key, &ndata, DB_KEYLAST)) != 0)
+			goto err;
+		break;
+	}
+
+err:
+done:	/*
+	 * If we allocated a page: if we're successful, mark the page dirty
+	 * and return it to the cache, otherwise, discard/free it.
+	 */
+	if (p != NULL) {
+		if (ret == 0) {
+			if ((t_ret =
+			    memp_fput(mdbp->mpf, p, DB_MPOOL_DIRTY)) != 0)
+				ret = t_ret;
+			/*
+			 * Since we cannot close this file until after
+			 * transaction commit, we need to sync the dirty
+			 * pages, because we'll read these directly from
+			 * disk to open.
+			 */
+			if ((t_ret = mdbp->sync(mdbp, 0)) != 0 && ret == 0)
+				ret = t_ret;
+		} else
+			(void)__db_free(dbc, p);
+	}
+
+	/* Discard the cursor(s) and data. */
+	if (data.data != NULL)
+		__os_free(data.data, data.size);
+	if (dbc != NULL && (t_ret = dbc->c_close(dbc)) != 0 && ret == 0)
+		ret = t_ret;
+	if (ndbc != NULL && (t_ret = ndbc->c_close(ndbc)) != 0 && ret == 0)
+		ret = t_ret;
+
+	return (ret);
+}
+
+/*
+ * __db_dbenv_setup --
+ *	Set up the underlying environment during a db_open.
+ *
+ * PUBLIC: int __db_dbenv_setup __P((DB *, const char *, u_int32_t));
+ */
+int
+__db_dbenv_setup(dbp, name, flags)
+	DB *dbp;
+	const char *name;
+	u_int32_t flags;
+{
+	DB *ldbp;
+	DB_ENV *dbenv;
+	DBT pgcookie;
+	DB_MPOOL_FINFO finfo;
+	DB_PGINFO pginfo;
+	int ret;
+	u_int32_t maxid;
+
+	dbenv = dbp->dbenv;
+
+	/* If we don't yet have an environment, it's time to create it. */
+	if (!F_ISSET(dbenv, DB_ENV_OPEN_CALLED)) {
+		/* Make sure we have at least DB_MINCACHE pages in our cache. */
+		if (dbenv->mp_gbytes == 0 &&
+		    dbenv->mp_bytes < dbp->pgsize * DB_MINPAGECACHE &&
+		    (ret = dbenv->set_cachesize(
+		    dbenv, 0, dbp->pgsize * DB_MINPAGECACHE, 0)) != 0)
+			return (ret);
+
+		if ((ret = dbenv->open(dbenv, NULL, DB_CREATE |
+		    DB_INIT_MPOOL | DB_PRIVATE | LF_ISSET(DB_THREAD), 0)) != 0)
+			return (ret);
+	}
+
+	/* Register DB's pgin/pgout functions. */
+	if ((ret =
+	    memp_register(dbenv, DB_FTYPE_SET, __db_pgin, __db_pgout)) != 0)
+		return (ret);
+
+	/*
+	 * Open a backing file in the memory pool.
+	 *
+	 * If we need to pre- or post-process a file's pages on I/O, set the
+	 * file type.  If it's a hash file, always call the pgin and pgout
+	 * routines.  This means that hash files can never be mapped into
+	 * process memory.  If it's a btree file and requires swapping, we
+	 * need to page the file in and out.  This has to be right -- we can't
+	 * mmap files that are being paged in and out.
+	 */
+	memset(&finfo, 0, sizeof(finfo));
+	switch (dbp->type) {
+	case DB_BTREE:
+	case DB_RECNO:
+		finfo.ftype =
+		    F_ISSET(dbp, DB_AM_SWAP) ? DB_FTYPE_SET : DB_FTYPE_NOTSET;
+		finfo.clear_len = DB_PAGE_DB_LEN;
+		break;
+	case DB_HASH:
+		finfo.ftype = DB_FTYPE_SET;
+		finfo.clear_len = DB_PAGE_DB_LEN;
+		break;
+	case DB_QUEUE:
+		finfo.ftype =
+		    F_ISSET(dbp, DB_AM_SWAP) ? DB_FTYPE_SET : DB_FTYPE_NOTSET;
+		finfo.clear_len = DB_PAGE_QUEUE_LEN;
+		break;
+	case DB_UNKNOWN:
+		/*
+		 * If we're running in the verifier, our database might
+		 * be corrupt and we might not know its type--but we may
+		 * still want to be able to verify and salvage.
+		 *
+		 * If we can't identify the type, it's not going to be safe
+		 * to call __db_pgin--we pretty much have to give up all
+		 * hope of salvaging cross-endianness.  Proceed anyway;
+		 * at worst, the database will just appear more corrupt
+		 * than it actually is, but at best, we may be able
+		 * to salvage some data even with no metadata page.
+		 */
+		if (F_ISSET(dbp, DB_AM_VERIFYING)) {
+			finfo.ftype = DB_FTYPE_NOTSET;
+			finfo.clear_len = DB_PAGE_DB_LEN;
+			break;
+		}
+		return (__db_unknown_type(dbp->dbenv,
+		     "__db_dbenv_setup", dbp->type));
+	}
+	finfo.pgcookie = &pgcookie;
+	finfo.fileid = dbp->fileid;
+	finfo.lsn_offset = 0;
+
+	pginfo.db_pagesize = dbp->pgsize;
+	pginfo.needswap = F_ISSET(dbp, DB_AM_SWAP);
+	pgcookie.data = &pginfo;
+	pgcookie.size = sizeof(DB_PGINFO);
+
+	if ((ret = memp_fopen(dbenv, name,
+	    LF_ISSET(DB_RDONLY | DB_NOMMAP | DB_ODDFILESIZE | DB_TRUNCATE),
+	    0, dbp->pgsize, &finfo, &dbp->mpf)) != 0)
+		return (ret);
+
+	/*
+	 * We may need a per-thread mutex.  Allocate it from the environment
+	 * region, there's supposed to be extra space there for that purpose.
+	 */
+	if (LF_ISSET(DB_THREAD)) {
+		if ((ret = __db_mutex_alloc(
+		    dbenv, dbenv->reginfo, (MUTEX **)&dbp->mutexp)) != 0)
+			return (ret);
+		if ((ret = __db_mutex_init(
+		    dbenv, dbp->mutexp, 0, MUTEX_THREAD)) != 0) {
+			__db_mutex_free(dbenv, dbenv->reginfo, dbp->mutexp);
+			return (ret);
+		}
+	}
+
+	/* Get a log file id. */
+	if (LOGGING_ON(dbenv) && !IS_RECOVERING(dbenv) &&
+#if !defined(DEBUG_ROP)
+	    !F_ISSET(dbp, DB_AM_RDONLY) &&
+#endif
+	    (ret = log_register(dbenv, dbp, name)) != 0)
+		return (ret);
+
+	/*
+	 * Insert ourselves into the DB_ENV's dblist.  We allocate a
+	 * unique ID to each {fileid, meta page number} pair, and to
+	 * each temporary file (since they all have a zero fileid).
+	 * This ID gives us something to use to tell which DB handles
+	 * go with which databases in all the cursor adjustment
+	 * routines, where we don't want to do a lot of ugly and
+	 * expensive memcmps.
+	 */
+	MUTEX_THREAD_LOCK(dbenv, dbenv->dblist_mutexp);
+	for (maxid = 0, ldbp = LIST_FIRST(&dbenv->dblist);
+	    ldbp != NULL; ldbp = LIST_NEXT(dbp, dblistlinks)) {
+		if (name != NULL &&
+		    memcmp(ldbp->fileid, dbp->fileid, DB_FILE_ID_LEN) == 0 &&
+		    ldbp->meta_pgno == dbp->meta_pgno)
+			break;
+		if (ldbp->adj_fileid > maxid)
+			maxid = ldbp->adj_fileid;
+	}
+
+	/*
+	 * If ldbp is NULL, we didn't find a match, or we weren't
+	 * really looking because name is NULL.  Assign the dbp an
+	 * adj_fileid one higher than the largest we found, and
+	 * insert it at the head of the master dbp list.
+	 *
+	 * If ldbp is not NULL, it is a match for our dbp.  Give dbp
+	 * the same ID that ldbp has, and add it after ldbp so they're
+	 * together in the list.
+	 */
+	if (ldbp == NULL) {
+		dbp->adj_fileid = maxid + 1;
+		LIST_INSERT_HEAD(&dbenv->dblist, dbp, dblistlinks);
+	} else {
+		dbp->adj_fileid = ldbp->adj_fileid;
+		LIST_INSERT_AFTER(ldbp, dbp, dblistlinks);
+	}
+	MUTEX_THREAD_UNLOCK(dbenv, dbenv->dblist_mutexp);
+
+	return (0);
+}
+
+/*
+ * __db_file_setup --
+ *	Setup the file or in-memory data.
+ *	Read the database metadata and resolve it with our arguments.
+ */
+static int
+__db_file_setup(dbp, name, flags, mode, meta_pgno, retflags)
+	DB *dbp;
+	const char *name;
+	u_int32_t flags;
+	int mode;
+	db_pgno_t meta_pgno;
+	int *retflags;
+{
+	DB *mdb;
+	DBT namedbt;
+	DB_ENV *dbenv;
+	DB_FH *fhp, fh;
+	DB_LSN lsn;
+	DB_TXN *txn;
+	size_t nr;
+	u_int32_t magic, oflags;
+	int ret, retry_cnt, t_ret;
+	char *real_name, mbuf[DBMETASIZE];
+
+#define	IS_SUBDB_SETUP	(meta_pgno != PGNO_BASE_MD)
+
+	dbenv = dbp->dbenv;
+	dbp->meta_pgno = meta_pgno;
+	txn = NULL;
+	*retflags = 0;
+
+	/*
+	 * If we open a file handle and our caller is doing fcntl(2) locking,
+	 * we can't close it because that would discard the caller's lock.
+	 * Save it until we close the DB handle.
+	 */
+	if (LF_ISSET(DB_FCNTL_LOCKING)) {
+		if ((ret = __os_malloc(dbenv, sizeof(*fhp), NULL, &fhp)) != 0)
+			return (ret);
+	} else
+		fhp = &fh;
+	memset(fhp, 0, sizeof(*fhp));
+
+	/*
+	 * If the file is in-memory, set up is simple.  Otherwise, do the
+	 * hard work of opening and reading the file.
+	 *
+	 * If we have a file name, try and read the first page, figure out
+	 * what type of file it is, and initialize everything we can based
+	 * on that file's meta-data page.
+	 *
+	 * !!!
+	 * There's a reason we don't push this code down into the buffer cache.
+	 * The problem is that there's no information external to the file that
+	 * we can use as a unique ID.  UNIX has dev/inode pairs, but they are
+	 * not necessarily unique after reboot, if the file was mounted via NFS.
+	 * Windows has similar problems, as the FAT filesystem doesn't maintain
+	 * dev/inode numbers across reboot.  So, we must get something from the
+	 * file we can use to ensure that, even after a reboot, the file we're
+	 * joining in the cache is the right file for us to join.  The solution
+	 * we use is to maintain a file ID that's stored in the database, and
+	 * that's why we have to open and read the file before calling into the
+	 * buffer cache.
+	 *
+	 * The secondary reason is that there's additional information that
+	 * we want to have before instantiating a file in the buffer cache:
+	 * the page size, file type (btree/hash), if swapping is required,
+	 * and flags (DB_RDONLY, DB_CREATE, DB_TRUNCATE).  We could handle
+	 * needing this information by allowing it to be set for a file in
+	 * the buffer cache even after the file has been opened, and, of
+	 * course, supporting the ability to flush a file from the cache as
+	 * necessary, e.g., if we guessed wrongly about the page size.  Given
+	 * that we have to read the file anyway to get the file ID, we might
+	 * as well get the rest, too.
+	 *
+	 * Get the real file name.
+	 */
+	if (name == NULL) {
+		F_SET(dbp, DB_AM_INMEM);
+
+		if (dbp->type == DB_UNKNOWN) {
+			__db_err(dbenv,
+			    "DBTYPE of unknown without existing file");
+			return (EINVAL);
+		}
+		real_name = NULL;
+
+		/* Set the page size if we don't have one yet. */
+		if (dbp->pgsize == 0)
+			dbp->pgsize = DB_DEF_IOSIZE;
+
+		/*
+		 * If the file is a temporary file and we're doing locking,
+		 * then we have to create a unique file ID.  We can't use our
+		 * normal dev/inode pair (or whatever this OS uses in place of
+		 * dev/inode pairs) because no backing file will be created
+		 * until the mpool cache is filled forcing the buffers to disk.
+		 * Grab a random locker ID to use as a file ID.  The created
+		 * ID must never match a potential real file ID -- we know it
+		 * won't because real file IDs contain a time stamp after the
+		 * dev/inode pair, and we're simply storing a 4-byte value.
+		 *
+		 * !!!
+		 * Store the locker in the file id structure -- we can get it
+		 * from there as necessary, and it saves having two copies.
+		 */
+		if (LOCKING_ON(dbenv) &&
+		    (ret = lock_id(dbenv, (u_int32_t *)dbp->fileid)) != 0)
+			return (ret);
+
+		return (0);
+	}
+
+	/* Get the real backing file name. */
+	if ((ret = __db_appname(dbenv,
+	    DB_APP_DATA, NULL, name, 0, NULL, &real_name)) != 0)
+		return (ret);
+
+	/*
+	 * Open the backing file.  We need to make sure that multiple processes
+	 * attempting to create the file at the same time are properly ordered
+	 * so that only one of them creates the "unique" file ID, so we open it
+	 * O_EXCL and O_CREAT so two simultaneous attempts to create the region
+	 * will return failure in one of the attempts.  If we're the one that
+	 * fails, simply retry without the O_CREAT flag, which will require the
+	 * meta-data page exist.
+	 */
+
+	/* Fill in the default file mode. */
+	if (mode == 0)
+		mode = __db_omode("rwrw--");
+
+	oflags = 0;
+	if (LF_ISSET(DB_RDONLY))
+		oflags |= DB_OSO_RDONLY;
+	if (LF_ISSET(DB_TRUNCATE))
+		oflags |= DB_OSO_TRUNC;
+
+	retry_cnt = 0;
+open_retry:
+	*retflags = 0;
+	ret = 0;
+	if (!IS_SUBDB_SETUP && LF_ISSET(DB_CREATE)) {
+		if (dbp->open_txn != NULL) {
+			/*
+			 * Start a child transaction to wrap this individual
+			 * create.
+			 */
+			if ((ret =
+			    txn_begin(dbenv, dbp->open_txn, &txn, 0)) != 0)
+				goto err_msg;
+
+			memset(&namedbt, 0, sizeof(namedbt));
+			namedbt.data = (char *)name;
+			namedbt.size = strlen(name) + 1;
+			if ((ret = __crdel_fileopen_log(dbenv, txn,
+			    &lsn, DB_FLUSH, &namedbt, mode)) != 0)
+				goto err_msg;
+		}
+		DB_TEST_RECOVERY(dbp, DB_TEST_PREOPEN, ret, name);
+		if ((ret = __os_open(dbenv, real_name,
+		    oflags | DB_OSO_CREATE | DB_OSO_EXCL, mode, fhp)) == 0) {
+			DB_TEST_RECOVERY(dbp, DB_TEST_POSTOPEN, ret, name);
+
+			/* Commit the file create. */
+			if (dbp->open_txn != NULL) {
+				if ((ret = txn_commit(txn, DB_TXN_SYNC)) != 0)
+					goto err_msg;
+				txn = NULL;
+			}
+
+			/*
+			 * We created the file.  This means that if we later
+			 * fail, we need to delete the file and if we're going
+			 * to do that, we need to trash any pages in the
+			 * memory pool.  Since we only know here that we
+			 * created the file, we're going to set the flag here
+			 * and clear it later if we commit successfully.
+			 */
+			F_SET(dbp, DB_AM_DISCARD);
+			*retflags |= DB_FILE_SETUP_CREATE;
+		} else {
+			/*
+			 * Abort the file create.  If the abort fails, report
+			 * the error returned by txn_abort(), rather than the
+			 * open error, for no particular reason.
+			 */
+			if (dbp->open_txn != NULL) {
+				if ((t_ret = txn_abort(txn)) != 0) {
+					ret = t_ret;
+					goto err_msg;
+				}
+				txn = NULL;
+			}
+
+			/*
+			 * If we were not doing an exclusive open, try again
+			 * without the create flag.
+			 */
+			if (ret == EEXIST && !LF_ISSET(DB_EXCL)) {
+				LF_CLR(DB_CREATE);
+				DB_TEST_RECOVERY(dbp,
+				    DB_TEST_POSTOPEN, ret, name);
+				goto open_retry;
+			}
+		}
+	} else
+		ret = __os_open(dbenv, real_name, oflags, mode, fhp);
+
+	/*
+	 * Be quiet if we couldn't open the file because it didn't exist
+	 * or we did not have permission,
+	 * the customers don't like those messages appearing in the logs.
+	 * Otherwise, complain loudly.
+	 */
+	if (ret != 0) {
+		if (ret == EACCES || ret == ENOENT)
+			goto err;
+		goto err_msg;
+	}
+
+	/* Set the page size if we don't have one yet. */
+	if (dbp->pgsize == 0) {
+		if (IS_SUBDB_SETUP) {
+			if ((ret = __db_master_open(dbp,
+			    name, flags, mode, &mdb)) != 0)
+				goto err;
+			dbp->pgsize = mdb->pgsize;
+			(void)mdb->close(mdb, 0);
+		} else if ((ret = __db_set_pgsize(dbp, fhp, real_name)) != 0)
+			goto err;
+	}
+
+	/*
+	 * Seek to the metadata offset; if it's a master database open or a
+	 * database without subdatabases, we're seeking to 0, but that's OK.
+	 */
+	if ((ret = __os_seek(dbenv, fhp,
+	    dbp->pgsize, meta_pgno, 0, 0, DB_OS_SEEK_SET)) != 0)
+		goto err_msg;
+
+	/*
+	 * Read the metadata page.  We read DBMETASIZE bytes, which is larger
+	 * than any access method's metadata page and smaller than any disk
+	 * sector.
+	 */
+	if ((ret = __os_read(dbenv, fhp, mbuf, sizeof(mbuf), &nr)) != 0)
+		goto err_msg;
+
+	if (nr == sizeof(mbuf)) {
+		/*
+		 * Figure out what access method we're dealing with, and then
+		 * call access method specific code to check error conditions
+		 * based on conflicts between the found file and application
+		 * arguments.  A found file overrides some user information --
+		 * we don't consider it an error, for example, if the user set
+		 * an expected byte order and the found file doesn't match it.
+		 */
+		F_CLR(dbp, DB_AM_SWAP);
+		magic = ((DBMETA *)mbuf)->magic;
+
+swap_retry:	switch (magic) {
+		case DB_BTREEMAGIC:
+			if ((ret =
+			    __bam_metachk(dbp, name, (BTMETA *)mbuf)) != 0)
+				goto err;
+			break;
+		case DB_HASHMAGIC:
+			if ((ret =
+			    __ham_metachk(dbp, name, (HMETA *)mbuf)) != 0)
+				goto err;
+			break;
+		case DB_QAMMAGIC:
+			if ((ret =
+			    __qam_metachk(dbp, name, (QMETA *)mbuf)) != 0)
+				goto err;
+			break;
+		case 0:
+			/*
+			 * There are two ways we can get a 0 magic number.
+			 * If we're creating a subdatabase, then the magic
+			 * number will be 0.  We allocate a page as part of
+			 * finding out what the base page number will be for
+			 * the new subdatabase, but it's not initialized in
+			 * any way.
+			 *
+			 * The second case happens if we are in recovery
+			 * and we are going to recreate a database, it's
+			 * possible that it's page was created (on systems
+			 * where pages must be created explicitly to avoid
+			 * holes in files) but is still 0.
+			 */
+			if (IS_SUBDB_SETUP) {		/* Case 1 */
+				if ((IS_RECOVERING(dbenv)
+				    && F_ISSET((DB_LOG *)
+				    dbenv->lg_handle, DBLOG_FORCE_OPEN))
+				    || ((DBMETA *)mbuf)->pgno != PGNO_INVALID)
+					goto empty;
+
+				ret = EINVAL;
+				goto err;
+			}
+							/* Case 2 */
+			if (IS_RECOVERING(dbenv)) {
+				*retflags |= DB_FILE_SETUP_ZERO;
+				goto empty;
+			}
+			goto bad_format;
+		default:
+			if (F_ISSET(dbp, DB_AM_SWAP))
+				goto bad_format;
+
+			M_32_SWAP(magic);
+			F_SET(dbp, DB_AM_SWAP);
+			goto swap_retry;
+		}
+	} else {
+		/*
+		 * Only newly created files are permitted to fail magic
+		 * number tests.
+		 */
+		if (nr != 0 || (!IS_RECOVERING(dbenv) && IS_SUBDB_SETUP))
+			goto bad_format;
+
+		/* Let the caller know that we had a 0-length file. */
+		if (!LF_ISSET(DB_CREATE | DB_TRUNCATE))
+			*retflags |= DB_FILE_SETUP_ZERO;
+
+		/*
+		 * The only way we can reach here with the DB_CREATE flag set
+		 * is if we created the file.  If that's not the case, then
+		 * either (a) someone else created the file but has not yet
+		 * written out the metadata page, or (b) we truncated the file
+		 * (DB_TRUNCATE) leaving it zero-length.  In the case of (a),
+		 * we want to sleep and give the file creator time to write
+		 * the metadata page.  In the case of (b), we want to continue.
+		 *
+		 * !!!
+		 * There's a race in the case of two processes opening the file
+		 * with the DB_TRUNCATE flag set at roughly the same time, and
+		 * they could theoretically hurt each other.  Sure hope that's
+		 * unlikely.
+		 */
+		if (!LF_ISSET(DB_CREATE | DB_TRUNCATE) &&
+		    !IS_RECOVERING(dbenv)) {
+			if (retry_cnt++ < 3) {
+				__os_sleep(dbenv, 1, 0);
+				goto open_retry;
+			}
+bad_format:		if (!IS_RECOVERING(dbenv))
+				__db_err(dbenv,
+				    "%s: unexpected file type or format", name);
+			ret = EINVAL;
+			goto err;
+		}
+
+		DB_ASSERT (dbp->type != DB_UNKNOWN);
+
+empty:		/*
+		 * The file is empty, and that's OK.  If it's not a subdatabase,
+		 * though, we do need to generate a unique file ID for it.  The
+		 * unique file ID includes a timestamp so that we can't collide
+		 * with any other files, even when the file IDs (dev/inode pair)
+		 * are reused.
+		 */
+		if (!IS_SUBDB_SETUP) {
+			if (*retflags & DB_FILE_SETUP_ZERO)
+				memset(dbp->fileid, 0, DB_FILE_ID_LEN);
+			else if ((ret = __os_fileid(dbenv,
+			    real_name, 1, dbp->fileid)) != 0)
+				goto err_msg;
+		}
+	}
+
+	if (0) {
+err_msg:	__db_err(dbenv, "%s: %s", name, db_strerror(ret));
+	}
+
+	/*
+	 * Abort any running transaction -- it can only exist if something
+	 * went wrong.
+	 */
+err:
+DB_TEST_RECOVERY_LABEL
+
+	/*
+	 * If we opened a file handle and our caller is doing fcntl(2) locking,
+	 * then we can't close it because that would discard the caller's lock.
+	 * Otherwise, close the handle.
+	 */
+	if (F_ISSET(fhp, DB_FH_VALID)) {
+		if (ret == 0 && LF_ISSET(DB_FCNTL_LOCKING))
+			dbp->saved_open_fhp = fhp;
+		else
+			if ((t_ret = __os_closehandle(fhp)) != 0 && ret == 0)
+				ret = t_ret;
+	}
+
+	/*
+	 * This must be done after the file is closed, since
+	 * txn_abort() may remove the file, and an open file
+	 * cannot be removed on a Windows platforms.
+	 */
+	if (txn != NULL)
+		(void)txn_abort(txn);
+
+	if (real_name != NULL)
+		__os_freestr(real_name);
+
+	return (ret);
+}
+
+/*
+ * __db_set_pgsize --
+ *	Set the page size based on file information.
+ */
+static int
+__db_set_pgsize(dbp, fhp, name)
+	DB *dbp;
+	DB_FH *fhp;
+	char *name;
+{
+	DB_ENV *dbenv;
+	u_int32_t iopsize;
+	int ret;
+
+	dbenv = dbp->dbenv;
+
+	/*
+	 * Use the filesystem's optimum I/O size as the pagesize if a pagesize
+	 * not specified.  Some filesystems have 64K as their optimum I/O size,
+	 * but as that results in fairly large default caches, we limit the
+	 * default pagesize to 16K.
+	 */
+	if ((ret = __os_ioinfo(dbenv, name, fhp, NULL, NULL, &iopsize)) != 0) {
+		__db_err(dbenv, "%s: %s", name, db_strerror(ret));
+		return (ret);
+	}
+	if (iopsize < 512)
+		iopsize = 512;
+	if (iopsize > 16 * 1024)
+		iopsize = 16 * 1024;
+
+	/*
+	 * Sheer paranoia, but we don't want anything that's not a power-of-2
+	 * (we rely on that for alignment of various types on the pages), and
+	 * we want a multiple of the sector size as well.
+	 */
+	OS_ROUNDOFF(iopsize, 512);
+
+	dbp->pgsize = iopsize;
+	F_SET(dbp, DB_AM_PGDEF);
+
+	return (0);
+}
+
+/*
+ * __db_close --
+ *	DB destructor.
+ *
+ * PUBLIC: int __db_close __P((DB *, u_int32_t));
+ */
+int
+__db_close(dbp, flags)
+	DB *dbp;
+	u_int32_t flags;
+{
+	DB_ENV *dbenv;
+	DBC *dbc;
+	int ret, t_ret;
+
+	ret = 0;
+
+	dbenv = dbp->dbenv;
+	PANIC_CHECK(dbenv);
+
+	/* Validate arguments. */
+	if ((ret = __db_closechk(dbp, flags)) != 0)
+		goto err;
+
+	/* If never opened, or not currently open, it's easy. */
+	if (!F_ISSET(dbp, DB_OPEN_CALLED))
+		goto never_opened;
+
+	/* Sync the underlying access method. */
+	if (!LF_ISSET(DB_NOSYNC) && !F_ISSET(dbp, DB_AM_DISCARD) &&
+	    (t_ret = dbp->sync(dbp, 0)) != 0 && ret == 0)
+		ret = t_ret;
+
+	/*
+	 * Go through the active cursors and call the cursor recycle routine,
+	 * which resolves pending operations and moves the cursors onto the
+	 * free list.  Then, walk the free list and call the cursor destroy
+	 * routine.
+	 */
+	while ((dbc = TAILQ_FIRST(&dbp->active_queue)) != NULL)
+		if ((t_ret = dbc->c_close(dbc)) != 0 && ret == 0)
+			ret = t_ret;
+	while ((dbc = TAILQ_FIRST(&dbp->free_queue)) != NULL)
+		if ((t_ret = __db_c_destroy(dbc)) != 0 && ret == 0)
+			ret = t_ret;
+
+	/*
+	 * Close any outstanding join cursors.  Join cursors destroy
+	 * themselves on close and have no separate destroy routine.
+	 */
+	while ((dbc = TAILQ_FIRST(&dbp->join_queue)) != NULL)
+		if ((t_ret = dbc->c_close(dbc)) != 0 && ret == 0)
+			ret = t_ret;
+
+	/* Remove this DB handle from the DB_ENV's dblist. */
+	MUTEX_THREAD_LOCK(dbenv, dbenv->dblist_mutexp);
+	LIST_REMOVE(dbp, dblistlinks);
+	MUTEX_THREAD_UNLOCK(dbenv, dbenv->dblist_mutexp);
+
+	/* Sync the memory pool. */
+	if (!LF_ISSET(DB_NOSYNC) && !F_ISSET(dbp, DB_AM_DISCARD) &&
+	    (t_ret = memp_fsync(dbp->mpf)) != 0 &&
+	    t_ret != DB_INCOMPLETE && ret == 0)
+		ret = t_ret;
+
+	/* Close any handle we've been holding since the open.  */
+	if (dbp->saved_open_fhp != NULL &&
+	    F_ISSET(dbp->saved_open_fhp, DB_FH_VALID) &&
+	    (t_ret = __os_closehandle(dbp->saved_open_fhp)) != 0 && ret == 0)
+		ret = t_ret;
+
+never_opened:
+	/*
+	 * Call the access specific close function.
+	 *
+	 * !!!
+	 * Because of where the function is called in the close process,
+	 * these routines can't do anything that would dirty pages or
+	 * otherwise affect closing down the database.
+	 */
+	if ((t_ret = __ham_db_close(dbp)) != 0 && ret == 0)
+		ret = t_ret;
+	if ((t_ret = __bam_db_close(dbp)) != 0 && ret == 0)
+		ret = t_ret;
+	if ((t_ret = __qam_db_close(dbp)) != 0 && ret == 0)
+		ret = t_ret;
+
+err:
+	/* Refresh the structure and close any local environment. */
+	if ((t_ret = __db_refresh(dbp)) != 0 && ret == 0)
+		ret = t_ret;
+	if (F_ISSET(dbenv, DB_ENV_DBLOCAL) &&
+	    --dbenv->dblocal_ref == 0 &&
+	    (t_ret = dbenv->close(dbenv, 0)) != 0 && ret == 0)
+		ret = t_ret;
+
+	memset(dbp, CLEAR_BYTE, sizeof(*dbp));
+	__os_free(dbp, sizeof(*dbp));
+
+	return (ret);
+}
+
+/*
+ * __db_refresh --
+ *	Refresh the DB structure, releasing any allocated resources.
+ */
+static int
+__db_refresh(dbp)
+	DB *dbp;
+{
+	DB_ENV *dbenv;
+	DBC *dbc;
+	int ret, t_ret;
+
+	ret = 0;
+
+	dbenv = dbp->dbenv;
+
+	/*
+	 * Go through the active cursors and call the cursor recycle routine,
+	 * which resolves pending operations and moves the cursors onto the
+	 * free list.  Then, walk the free list and call the cursor destroy
+	 * routine.
+	 */
+	while ((dbc = TAILQ_FIRST(&dbp->active_queue)) != NULL)
+		if ((t_ret = dbc->c_close(dbc)) != 0 && ret == 0)
+			ret = t_ret;
+	while ((dbc = TAILQ_FIRST(&dbp->free_queue)) != NULL)
+		if ((t_ret = __db_c_destroy(dbc)) != 0 && ret == 0)
+			ret = t_ret;
+
+	dbp->type = 0;
+
+	/* Close the memory pool file handle. */
+	if (dbp->mpf != NULL) {
+		if (F_ISSET(dbp, DB_AM_DISCARD))
+			(void)__memp_fremove(dbp->mpf);
+		if ((t_ret = memp_fclose(dbp->mpf)) != 0 && ret == 0)
+			ret = t_ret;
+		dbp->mpf = NULL;
+	}
+
+	/* Discard the thread mutex. */
+	if (dbp->mutexp != NULL) {
+		__db_mutex_free(dbenv, dbenv->reginfo, dbp->mutexp);
+		dbp->mutexp = NULL;
+	}
+
+	/* Discard the log file id. */
+	if (!IS_RECOVERING(dbenv)
+	    && dbp->log_fileid != DB_LOGFILEID_INVALID)
+		(void)log_unregister(dbenv, dbp);
+
+	F_CLR(dbp, DB_AM_DISCARD);
+	F_CLR(dbp, DB_AM_INMEM);
+	F_CLR(dbp, DB_AM_RDONLY);
+	F_CLR(dbp, DB_AM_SWAP);
+	F_CLR(dbp, DB_DBM_ERROR);
+	F_CLR(dbp, DB_OPEN_CALLED);
+
+	return (ret);
+}
+
+/*
+ * __db_remove
+ *	Remove method for DB.
+ *
+ * PUBLIC: int __db_remove __P((DB *, const char *, const char *, u_int32_t));
+ */
+int
+__db_remove(dbp, name, subdb, flags)
+	DB *dbp;
+	const char *name, *subdb;
+	u_int32_t flags;
+{
+	DBT namedbt;
+	DB_ENV *dbenv;
+	DB_LOCK remove_lock;
+	DB_LSN newlsn;
+	int ret, t_ret, (*callback_func) __P((DB *, void *));
+	char *backup, *real_back, *real_name;
+	void *cookie;
+
+	dbenv = dbp->dbenv;
+	ret = 0;
+	backup = real_back = real_name = NULL;
+
+	PANIC_CHECK(dbenv);
+	/*
+	 * Cannot use DB_ILLEGAL_AFTER_OPEN here because that returns
+	 * and we cannot return, but must deal with the error and destroy
+	 * the handle anyway.
+	 */
+	if (F_ISSET(dbp, DB_OPEN_CALLED)) {
+		ret = __db_mi_open(dbp->dbenv, "remove", 1);
+		goto err_close;
+	}
+
+	/* Validate arguments. */
+	if ((ret = __db_removechk(dbp, flags)) != 0)
+		goto err_close;
+
+	/*
+	 * Subdatabases.
+	 */
+	if (subdb != NULL) {
+		/* Subdatabases must be created in named files. */
+		if (name == NULL) {
+			__db_err(dbenv,
+		    "multiple databases cannot be created in temporary files");
+			goto err_close;
+		}
+		return (__db_subdb_remove(dbp, name, subdb));
+	}
+
+	if ((ret = dbp->open(dbp,
+	    name, NULL, DB_UNKNOWN, DB_RDWRMASTER, 0)) != 0)
+		goto err_close;
+
+	if (LOGGING_ON(dbenv) && (ret = __log_file_lock(dbp)) != 0)
+		goto err_close;
+
+	if ((ret = dbp->sync(dbp, 0)) != 0)
+		goto err_close;
+
+	/* Start the transaction and log the delete. */
+	if (TXN_ON(dbenv) && (ret = __db_metabegin(dbp, &remove_lock)) != 0)
+		goto err_close;
+
+	if (LOGGING_ON(dbenv)) {
+		memset(&namedbt, 0, sizeof(namedbt));
+		namedbt.data = (char *)name;
+		namedbt.size = strlen(name) + 1;
+
+		if ((ret = __crdel_delete_log(dbenv,
+		    dbp->open_txn, &newlsn, DB_FLUSH,
+		    dbp->log_fileid, &namedbt)) != 0) {
+			__db_err(dbenv,
+			    "%s: %s", name, db_strerror(ret));
+			goto err;
+		}
+	}
+
+	/* Find the real name of the file. */
+	if ((ret = __db_appname(dbenv,
+	    DB_APP_DATA, NULL, name, 0, NULL, &real_name)) != 0)
+		goto err;
+
+	/*
+	 * XXX
+	 * We don't bother to open the file and call __memp_fremove on the mpf.
+	 * There is a potential race here.  It is at least possible that, if
+	 * the unique filesystem ID (dev/inode pair on UNIX) is reallocated
+	 * within a second (the granularity of the fileID timestamp), a new
+	 * file open will get the same fileID as the file being "removed".
+	 * We may actually want to open the file and call __memp_fremove on
+	 * the mpf to get around this.
+	 */
+
+	/* Create name for backup file. */
+	if (TXN_ON(dbenv)) {
+		if ((ret =
+		    __db_backup_name(dbenv, name, &backup, &newlsn)) != 0)
+			goto err;
+		if ((ret = __db_appname(dbenv,
+		    DB_APP_DATA, NULL, backup, 0, NULL, &real_back)) != 0)
+			goto err;
+	}
+
+	callback_func = __db_remove_callback;
+	cookie = real_back;
+	DB_TEST_RECOVERY(dbp, DB_TEST_PRERENAME, ret, name);
+	if (dbp->db_am_remove != NULL &&
+	    (ret = dbp->db_am_remove(dbp,
+	    name, subdb, &newlsn, &callback_func, &cookie)) != 0)
+		goto err;
+	/*
+	 * On Windows, the underlying file must be closed to perform a remove.
+	 * Nothing later in __db_remove requires that it be open, and the
+	 * dbp->close closes it anyway, so we just close it early.
+	 */
+	(void)__memp_fremove(dbp->mpf);
+	if ((ret = memp_fclose(dbp->mpf)) != 0)
+		goto err;
+	dbp->mpf = NULL;
+
+	if (TXN_ON(dbenv))
+		ret = __os_rename(dbenv, real_name, real_back);
+	else
+		ret = __os_unlink(dbenv, real_name);
+
+	DB_TEST_RECOVERY(dbp, DB_TEST_POSTRENAME, ret, name);
+
+err:
+DB_TEST_RECOVERY_LABEL
+	/*
+	 * End the transaction, committing the transaction if we were
+	 * successful, aborting otherwise.
+	 */
+	if (dbp->open_txn != NULL && (t_ret = __db_metaend(dbp, &remove_lock,
+	   ret == 0, callback_func, cookie)) != 0 && ret == 0)
+		ret = t_ret;
+
+	/* FALLTHROUGH */
+
+err_close:
+	if (real_back != NULL)
+		__os_freestr(real_back);
+	if (real_name != NULL)
+		__os_freestr(real_name);
+	if (backup != NULL)
+		__os_freestr(backup);
+
+	/* We no longer have an mpool, so syncing would be disastrous. */
+	if ((t_ret = dbp->close(dbp, DB_NOSYNC)) != 0 && ret == 0)
+		ret = t_ret;
+
+	return (ret);
+}
+
+/*
+ * __db_subdb_remove --
+ *	Remove a subdatabase.
+ */
+static int
+__db_subdb_remove(dbp, name, subdb)
+	DB *dbp;
+	const char *name, *subdb;
+{
+	DB *mdbp;
+	DBC *dbc;
+	DB_ENV *dbenv;
+	DB_LOCK remove_lock;
+	db_pgno_t meta_pgno;
+	int ret, t_ret;
+
+	mdbp = NULL;
+	dbc = NULL;
+	dbenv = dbp->dbenv;
+
+	/* Start the transaction. */
+	if (TXN_ON(dbenv) && (ret = __db_metabegin(dbp, &remove_lock)) != 0)
+		goto err_close;
+
+	/*
+	 * Open the subdatabase.  We can use the user's DB handle for this
+	 * purpose, I think.
+	 */
+	if ((ret = __db_open(dbp, name, subdb, DB_UNKNOWN, 0, 0)) != 0)
+		goto err;
+
+	/* Free up the pages in the subdatabase. */
+	switch (dbp->type) {
+		case DB_BTREE:
+		case DB_RECNO:
+			if ((ret = __bam_reclaim(dbp, dbp->open_txn)) != 0)
+				goto err;
+			break;
+		case DB_HASH:
+			if ((ret = __ham_reclaim(dbp, dbp->open_txn)) != 0)
+				goto err;
+			break;
+		default:
+			ret = __db_unknown_type(dbp->dbenv,
+			     "__db_subdb_remove", dbp->type);
+			goto err;
+	}
+
+	/*
+	 * Remove the entry from the main database and free the subdatabase
+	 * metadata page.
+	 */
+	if ((ret = __db_master_open(dbp, name, 0, 0, &mdbp)) != 0)
+		goto err;
+
+	if ((ret = __db_master_update(mdbp,
+	     subdb, dbp->type, &meta_pgno, MU_REMOVE, NULL, 0)) != 0)
+		goto err;
+
+err:	/*
+	 * End the transaction, committing the transaction if we were
+	 * successful, aborting otherwise.
+	 */
+	if (dbp->open_txn != NULL && (t_ret = __db_metaend(dbp,
+	    &remove_lock, ret == 0, NULL, NULL)) != 0 && ret == 0)
+		ret = t_ret;
+
+err_close:
+	/*
+	 * Close the user's DB handle -- do this LAST to avoid smashing the
+	 * the transaction information.
+	 */
+	if ((t_ret = dbp->close(dbp, 0)) != 0 && ret == 0)
+		ret = t_ret;
+
+	if (mdbp != NULL && (t_ret = mdbp->close(mdbp, 0)) != 0 && ret == 0)
+		ret = t_ret;
+
+	return (ret);
+}
+
+/*
+ * __db_rename
+ *	Rename method for DB.
+ *
+ * PUBLIC: int __db_rename __P((DB *,
+ * PUBLIC:     const char *, const char *, const char *, u_int32_t));
+ */
+int
+__db_rename(dbp, filename, subdb, newname, flags)
+	DB *dbp;
+	const char *filename, *subdb, *newname;
+	u_int32_t flags;
+{
+	DBT namedbt, newnamedbt;
+	DB_ENV *dbenv;
+	DB_LOCK remove_lock;
+	DB_LSN newlsn;
+	char *real_name, *real_newname;
+	int ret, t_ret;
+
+	dbenv = dbp->dbenv;
+	ret = 0;
+	real_name = real_newname = NULL;
+
+	PANIC_CHECK(dbenv);
+	/*
+	 * Cannot use DB_ILLEGAL_AFTER_OPEN here because that returns
+	 * and we cannot return, but must deal with the error and destroy
+	 * the handle anyway.
+	 */
+	if (F_ISSET(dbp, DB_OPEN_CALLED)) {
+		ret = __db_mi_open(dbp->dbenv, "rename", 1);
+		goto err_close;
+	}
+
+	/* Validate arguments -- has same rules as remove. */
+	if ((ret = __db_removechk(dbp, flags)) != 0)
+		goto err_close;
+
+	/*
+	 * Subdatabases.
+	 */
+	if (subdb != NULL) {
+		if (filename == NULL) {
+			__db_err(dbenv,
+		    "multiple databases cannot be created in temporary files");
+			goto err_close;
+		}
+		return (__db_subdb_rename(dbp, filename, subdb, newname));
+	}
+
+	if ((ret = dbp->open(dbp,
+	    filename, NULL, DB_UNKNOWN, DB_RDWRMASTER, 0)) != 0)
+		goto err_close;
+
+	if (LOGGING_ON(dbenv) && (ret = __log_file_lock(dbp)) != 0)
+		goto err_close;
+
+	if ((ret = dbp->sync(dbp, 0)) != 0)
+		goto err_close;
+
+	/* Start the transaction and log the rename. */
+	if (TXN_ON(dbenv) && (ret = __db_metabegin(dbp, &remove_lock)) != 0)
+		goto err_close;
+
+	if (LOGGING_ON(dbenv)) {
+		memset(&namedbt, 0, sizeof(namedbt));
+		namedbt.data = (char *)filename;
+		namedbt.size = strlen(filename) + 1;
+
+		memset(&newnamedbt, 0, sizeof(namedbt));
+		newnamedbt.data = (char *)newname;
+		newnamedbt.size = strlen(newname) + 1;
+
+		if ((ret = __crdel_rename_log(dbenv, dbp->open_txn,
+		    &newlsn, 0, dbp->log_fileid, &namedbt, &newnamedbt)) != 0) {
+			__db_err(dbenv, "%s: %s", filename, db_strerror(ret));
+			goto err;
+		}
+
+		if ((ret = __log_filelist_update(dbenv, dbp,
+		    dbp->log_fileid, newname, NULL)) != 0)
+			goto err;
+	}
+
+	/* Find the real name of the file. */
+	if ((ret = __db_appname(dbenv,
+	    DB_APP_DATA, NULL, filename, 0, NULL, &real_name)) != 0)
+		goto err;
+
+	/* Find the real newname of the file. */
+	if ((ret = __db_appname(dbenv,
+	    DB_APP_DATA, NULL, newname, 0, NULL, &real_newname)) != 0)
+		goto err;
+
+	/*
+	 * It is an error to rename a file over one that already exists,
+	 * as that wouldn't be transaction-safe.
+	 */
+	if (__os_exists(real_newname, NULL) == 0) {
+		ret = EEXIST;
+		__db_err(dbenv, "rename: file %s exists", real_newname);
+		goto err;
+	}
+
+	DB_TEST_RECOVERY(dbp, DB_TEST_PRERENAME, ret, filename);
+	if (dbp->db_am_rename != NULL &&
+	    (ret = dbp->db_am_rename(dbp, filename, subdb, newname)) != 0)
+		goto err;
+	/*
+	 * We have to flush the cache for a couple of reasons.  First, the
+	 * underlying MPOOLFILE maintains a "name" that unrelated processes
+	 * can use to open the file in order to flush pages, and that name
+	 * is about to be wrong.  Second, on Windows the unique file ID is
+	 * generated from the file's name, not other file information as is
+	 * the case on UNIX, and so a subsequent open of the old file name
+	 * could conceivably result in a matching "unique" file ID.
+	 */
+	if ((ret = __memp_fremove(dbp->mpf)) != 0)
+		goto err;
+
+	/*
+	 * On Windows, the underlying file must be closed to perform a rename.
+	 * Nothing later in __db_rename requires that it be open, and the call
+	 * to dbp->close closes it anyway, so we just close it early.
+	 */
+	if ((ret = memp_fclose(dbp->mpf)) != 0)
+		goto err;
+	dbp->mpf = NULL;
+
+	ret = __os_rename(dbenv, real_name, real_newname);
+	DB_TEST_RECOVERY(dbp, DB_TEST_POSTRENAME, ret, newname);
+
+DB_TEST_RECOVERY_LABEL
+err:	if (dbp->open_txn != NULL && (t_ret = __db_metaend(dbp,
+	    &remove_lock, ret == 0, NULL, NULL)) != 0 && ret == 0)
+		ret = t_ret;
+
+err_close:
+	/* We no longer have an mpool, so syncing would be disastrous. */
+	dbp->close(dbp, DB_NOSYNC);
+	if (real_name != NULL)
+		__os_freestr(real_name);
+	if (real_newname != NULL)
+		__os_freestr(real_newname);
+
+	return (ret);
+}
+
+/*
+ * __db_subdb_rename --
+ *	Rename a subdatabase.
+ */
+static int
+__db_subdb_rename(dbp, name, subdb, newname)
+	DB *dbp;
+	const char *name, *subdb, *newname;
+{
+	DB *mdbp;
+	DBC *dbc;
+	DB_ENV *dbenv;
+	DB_LOCK remove_lock;
+	int ret, t_ret;
+
+	mdbp = NULL;
+	dbc = NULL;
+	dbenv = dbp->dbenv;
+
+	/* Start the transaction. */
+	if (TXN_ON(dbenv) && (ret = __db_metabegin(dbp, &remove_lock)) != 0)
+		goto err_close;
+
+	/*
+	 * Open the subdatabase.  We can use the user's DB handle for this
+	 * purpose, I think.
+	 */
+	if ((ret = __db_open(dbp, name, subdb, DB_UNKNOWN, 0, 0)) != 0)
+		goto err;
+
+	/*
+	 * Rename the entry in the main database.
+	 */
+	if ((ret = __db_master_open(dbp, name, 0, 0, &mdbp)) != 0)
+		goto err;
+
+	if ((ret = __db_master_update(mdbp,
+	     subdb, dbp->type, NULL, MU_RENAME, newname, 0)) != 0)
+		goto err;
+
+err:	/*
+	 * End the transaction, committing the transaction if we were
+	 * successful, aborting otherwise.
+	 */
+	if (dbp->open_txn != NULL && (t_ret = __db_metaend(dbp,
+	    &remove_lock, ret == 0, NULL, NULL)) != 0 && ret == 0)
+		ret = t_ret;
+
+err_close:
+	/*
+	 * Close the user's DB handle -- do this LAST to avoid smashing the
+	 * the transaction information.
+	 */
+	if ((t_ret = dbp->close(dbp, 0)) != 0 && ret == 0)
+		ret = t_ret;
+
+	if (mdbp != NULL && (t_ret = mdbp->close(mdbp, 0)) != 0 && ret == 0)
+		ret = t_ret;
+
+	return (ret);
+}
+
+/*
+ * __db_metabegin --
+ *
+ * Begin a meta-data operation.  This involves doing any required locking,
+ * potentially beginning a transaction and then telling the caller if you
+ * did or did not begin the transaction.
+ *
+ * The writing flag indicates if the caller is actually allowing creates
+ * or doing deletes (i.e., if the caller is opening and not creating, then
+ * we don't need to do any of this).
+ * PUBLIC: int __db_metabegin __P((DB *, DB_LOCK *));
+ */
+int
+__db_metabegin(dbp, lockp)
+	DB *dbp;
+	DB_LOCK *lockp;
+{
+	DB_ENV *dbenv;
+	DBT dbplock;
+	u_int32_t locker, lockval;
+	int ret;
+
+	dbenv = dbp->dbenv;
+
+	lockp->off = LOCK_INVALID;
+
+	/*
+	 * There is no single place where we can know that we are or are not
+	 * going to be creating any files and/or subdatabases, so we will
+	 * always begin a tranasaction when we start creating one.  If we later
+	 * discover that this was unnecessary, we will abort the transaction.
+	 * Recovery is written so that if we log a file create, but then
+	 * discover that we didn't have to do it, we recover correctly.  The
+	 * file recovery design document has details.
+	 *
+	 * We need to single thread all create and delete operations, so if we
+	 * are running with locking, we must obtain a lock. We use lock_id to
+	 * generate a unique locker id and use a handcrafted DBT as the object
+	 * on which we are locking.
+	 */
+	if (LOCKING_ON(dbenv)) {
+		if ((ret = lock_id(dbenv, &locker)) != 0)
+			return (ret);
+		lockval = 0;
+		dbplock.data = &lockval;
+		dbplock.size = sizeof(lockval);
+		if ((ret = lock_get(dbenv,
+		    locker, 0, &dbplock, DB_LOCK_WRITE, lockp)) != 0)
+			return (ret);
+	}
+
+	return (txn_begin(dbenv, NULL, &dbp->open_txn, 0));
+}
+
+/*
+ * __db_metaend --
+ *	End a meta-data operation.
+ * PUBLIC: int __db_metaend __P((DB *,
+ * PUBLIC:       DB_LOCK *, int, int (*)(DB *, void *), void *));
+ */
+int
+__db_metaend(dbp, lockp, commit, callback, cookie)
+	DB *dbp;
+	DB_LOCK *lockp;
+	int commit, (*callback) __P((DB *, void *));
+	void *cookie;
+{
+	DB_ENV *dbenv;
+	int ret, t_ret;
+
+	ret = 0;
+	dbenv = dbp->dbenv;
+
+	/* End the transaction. */
+	if (commit) {
+		if ((ret = txn_commit(dbp->open_txn, DB_TXN_SYNC)) == 0) {
+			/*
+			 * Unlink any underlying file, we've committed the
+			 * transaction.
+			 */
+			if (callback != NULL)
+				ret = callback(dbp, cookie);
+		}
+	} else if ((t_ret = txn_abort(dbp->open_txn)) && ret == 0)
+		ret = t_ret;
+
+	/* Release our lock. */
+	if (lockp->off != LOCK_INVALID &&
+	    (t_ret = lock_put(dbenv, lockp)) != 0 && ret == 0)
+		ret = t_ret;
+
+	return (ret);
+}
+
+/*
+ * __db_log_page
+ *	Log a meta-data or root page during a create operation.
+ *
+ * PUBLIC: int __db_log_page __P((DB *,
+ * PUBLIC:     const char *, DB_LSN *, db_pgno_t, PAGE *));
+ */
+int
+__db_log_page(dbp, name, lsn, pgno, page)
+	DB *dbp;
+	const char *name;
+	DB_LSN *lsn;
+	db_pgno_t pgno;
+	PAGE *page;
+{
+	DBT name_dbt, page_dbt;
+	DB_LSN new_lsn;
+	int ret;
+
+	if (dbp->open_txn == NULL)
+		return (0);
+
+	memset(&page_dbt, 0, sizeof(page_dbt));
+	page_dbt.size = dbp->pgsize;
+	page_dbt.data = page;
+	if (pgno == PGNO_BASE_MD) {
+		/*
+		 * !!!
+		 * Make sure that we properly handle a null name.  The old
+		 * Tcl sent us pathnames of the form ""; it may be the case
+		 * that the new Tcl doesn't do that, so we can get rid of
+		 * the second check here.
+		 */
+		memset(&name_dbt, 0, sizeof(name_dbt));
+		name_dbt.data = (char *)name;
+		if (name == NULL || *name == '\0')
+			name_dbt.size = 0;
+		else
+			name_dbt.size = strlen(name) + 1;
+
+		ret = __crdel_metapage_log(dbp->dbenv,
+		    dbp->open_txn, &new_lsn, DB_FLUSH,
+		    dbp->log_fileid, &name_dbt, pgno, &page_dbt);
+	} else
+		ret = __crdel_metasub_log(dbp->dbenv, dbp->open_txn,
+		    &new_lsn, 0, dbp->log_fileid, pgno, &page_dbt, lsn);
+
+	if (ret == 0)
+		page->lsn = new_lsn;
+	return (ret);
+}
+
+/*
+ * __db_backup_name
+ *	Create the backup file name for a given file.
+ *
+ * PUBLIC: int __db_backup_name __P((DB_ENV *,
+ * PUBLIC:     const char *, char **, DB_LSN *));
+ */
+#undef	BACKUP_PREFIX
+#define	BACKUP_PREFIX	"__db."
+
+#undef	MAX_LSN_TO_TEXT
+#define	MAX_LSN_TO_TEXT	21
+int
+__db_backup_name(dbenv, name, backup, lsn)
+	DB_ENV *dbenv;
+	const char *name;
+	char **backup;
+	DB_LSN *lsn;
+{
+	size_t len;
+	int plen, ret;
+	char *p, *retp;
+
+	len = strlen(name) + strlen(BACKUP_PREFIX) + MAX_LSN_TO_TEXT + 1;
+
+	if ((ret = __os_malloc(dbenv, len, NULL, &retp)) != 0)
+		return (ret);
+
+	/*
+	 * Create the name.  Backup file names are of the form:
+	 *
+	 *	__db.name.0x[lsn-file].0x[lsn-offset]
+	 *
+	 * which guarantees uniqueness.
+	 *
+	 * However, name may contain an env-relative path in it.
+	 * In that case, put the __db. after the last portion of
+	 * the pathname.
+	 */
+	if ((p = __db_rpath(name)) == NULL)
+		snprintf(retp, len,
+		    "%s%s.0x%x0x%x", BACKUP_PREFIX, name,
+		    lsn->file, lsn->offset);
+	else {
+		plen = p - name + 1;
+		p++;
+		snprintf(retp, len,
+		    "%.*s%s%s.0x%x0x%x", plen, name, BACKUP_PREFIX, p,
+		    lsn->file, lsn->offset);
+	}
+
+	*backup = retp;
+	return (0);
+}
+
+/*
+ * __db_remove_callback --
+ *	Callback function -- on file remove commit, it unlinks the backing
+ *	file.
+ */
+static int
+__db_remove_callback(dbp, cookie)
+	DB *dbp;
+	void *cookie;
+{
+	return (__os_unlink(dbp->dbenv, cookie));
+}
+
+/*
+ * __dblist_get --
+ *	Get the first element of dbenv->dblist with
+ *	dbp->adj_fileid matching adjid.
+ *
+ * PUBLIC: DB *__dblist_get __P((DB_ENV *, u_int32_t));
+ */
+DB *
+__dblist_get(dbenv, adjid)
+	DB_ENV *dbenv;
+	u_int32_t adjid;
+{
+	DB *dbp;
+
+	for (dbp = LIST_FIRST(&dbenv->dblist);
+	    dbp != NULL && dbp->adj_fileid != adjid;
+	    dbp = LIST_NEXT(dbp, dblistlinks))
+		;
+
+	return (dbp);
+}
+
+#if	CONFIG_TEST
+/*
+ * __db_testcopy
+ *	Create a copy of all backup files and our "main" DB.
+ *
+ * PUBLIC: int __db_testcopy __P((DB *, const char *));
+ */
+int
+__db_testcopy(dbp, name)
+	DB *dbp;
+	const char *name;
+{
+	if (dbp->type == DB_QUEUE)
+		return (__qam_testdocopy(dbp, name));
+	else
+		return (__db_testdocopy(dbp, name));
+}
+
+static int
+__qam_testdocopy(dbp, name)
+	DB *dbp;
+	const char *name;
+{
+	QUEUE_FILELIST *filelist, *fp;
+	char buf[256], *dir;
+	int ret;
+
+	filelist = NULL;
+	if ((ret = __db_testdocopy(dbp, name)) != 0)
+		return (ret);
+	if (dbp->mpf != NULL &&
+	    (ret = __qam_gen_filelist(dbp, &filelist)) != 0)
+		return (ret);
+
+	if (filelist == NULL)
+		return (0);
+	dir = ((QUEUE *)dbp->q_internal)->dir;
+	for (fp = filelist; fp->mpf != NULL; fp++) {
+		snprintf(buf, sizeof(buf), QUEUE_EXTENT, dir, name, fp->id);
+		if ((ret = __db_testdocopy(dbp, buf)) != 0)
+			return (ret);
+	}
+
+	__os_free(filelist, 0);
+	return (0);
+}
+
+/*
+ * __db_testdocopy
+ *	Create a copy of all backup files and our "main" DB.
+ *
+ */
+static int
+__db_testdocopy(dbp, name)
+	DB *dbp;
+	const char *name;
+{
+	size_t len;
+	int dircnt, i, ret;
+	char **namesp, *backup, *copy, *dir, *p, *real_name;
+	real_name = NULL;
+	/* Get the real backing file name. */
+	if ((ret = __db_appname(dbp->dbenv,
+	    DB_APP_DATA, NULL, name, 0, NULL, &real_name)) != 0)
+		return (ret);
+
+	copy = backup = NULL;
+	namesp = NULL;
+
+	/*
+	 * Maximum size of file, including adding a ".afterop".
+	 */
+	len = strlen(real_name) + strlen(BACKUP_PREFIX) + MAX_LSN_TO_TEXT + 9;
+
+	if ((ret = __os_malloc(dbp->dbenv, len, NULL, &copy)) != 0)
+		goto out;
+
+	if ((ret = __os_malloc(dbp->dbenv, len, NULL, &backup)) != 0)
+		goto out;
+
+	/*
+	 * First copy the file itself.
+	 */
+	snprintf(copy, len, "%s.afterop", real_name);
+	__db_makecopy(real_name, copy);
+
+	if ((ret = __os_strdup(dbp->dbenv, real_name, &dir)) != 0)
+		goto out;
+	__os_freestr(real_name);
+	real_name = NULL;
+	/*
+	 * Create the name.  Backup file names are of the form:
+	 *
+	 *	__db.name.0x[lsn-file].0x[lsn-offset]
+	 *
+	 * which guarantees uniqueness.  We want to look for the
+	 * backup name, followed by a '.0x' (so that if they have
+	 * files named, say, 'a' and 'abc' we won't match 'abc' when
+	 * looking for 'a'.
+	 */
+	snprintf(backup, len, "%s%s.0x", BACKUP_PREFIX, name);
+
+	/*
+	 * We need the directory path to do the __os_dirlist.
+	 */
+	p = __db_rpath(dir);
+	if (p != NULL)
+		*p = '\0';
+	ret = __os_dirlist(dbp->dbenv, dir, &namesp, &dircnt);
+#if DIAGNOSTIC
+	/*
+	 * XXX
+	 * To get the memory guard code to work because it uses strlen and we
+	 * just moved the end of the string somewhere sooner.  This causes the
+	 * guard code to fail because it looks at one byte past the end of the
+	 * string.
+	 */
+	*p = '/';
+#endif
+	__os_freestr(dir);
+	if (ret != 0)
+		goto out;
+	for (i = 0; i < dircnt; i++) {
+		/*
+		 * Need to check if it is a backup file for this.
+		 * No idea what namesp[i] may be or how long, so
+		 * must use strncmp and not memcmp.  We don't want
+		 * to use strcmp either because we are only matching
+		 * the first part of the real file's name.  We don't
+		 * know its LSN's.
+		 */
+		if (strncmp(namesp[i], backup, strlen(backup)) == 0) {
+			if ((ret = __db_appname(dbp->dbenv, DB_APP_DATA,
+			    NULL, namesp[i], 0, NULL, &real_name)) != 0)
+				goto out;
+
+			/*
+			 * This should not happen.  Check that old
+			 * .afterop files aren't around.
+			 * If so, just move on.
+			 */
+			if (strstr(real_name, ".afterop") != NULL) {
+				__os_freestr(real_name);
+				real_name = NULL;
+				continue;
+			}
+			snprintf(copy, len, "%s.afterop", real_name);
+			__db_makecopy(real_name, copy);
+			__os_freestr(real_name);
+			real_name = NULL;
+		}
+	}
+out:
+	if (backup != NULL)
+		__os_freestr(backup);
+	if (copy != NULL)
+		__os_freestr(copy);
+	if (namesp != NULL)
+		__os_dirfree(namesp, dircnt);
+	if (real_name != NULL)
+		__os_freestr(real_name);
+	return (ret);
+}
+
+static void
+__db_makecopy(src, dest)
+	const char *src, *dest;
+{
+	DB_FH rfh, wfh;
+	size_t rcnt, wcnt;
+	char *buf;
+
+	memset(&rfh, 0, sizeof(rfh));
+	memset(&wfh, 0, sizeof(wfh));
+
+	if (__os_malloc(NULL, 1024, NULL, &buf) != 0)
+		return;
+
+	if (__os_open(NULL,
+	    src, DB_OSO_RDONLY, __db_omode("rw----"), &rfh) != 0)
+		goto err;
+	if (__os_open(NULL, dest,
+	    DB_OSO_CREATE | DB_OSO_TRUNC, __db_omode("rw----"), &wfh) != 0)
+		goto err;
+
+	for (;;)
+		if (__os_read(NULL, &rfh, buf, 1024, &rcnt) < 0 || rcnt == 0 ||
+		    __os_write(NULL, &wfh, buf, rcnt, &wcnt) < 0 || wcnt != rcnt)
+			break;
+
+err:	__os_free(buf, 1024);
+	if (F_ISSET(&rfh, DB_FH_VALID))
+		__os_closehandle(&rfh);
+	if (F_ISSET(&wfh, DB_FH_VALID))
+		__os_closehandle(&wfh);
+}
+#endif
diff --git a/bdb/db/db.src b/bdb/db/db.src
new file mode 100644
index 00000000000..b695e1360c5
--- /dev/null
+++ b/bdb/db/db.src
@@ -0,0 +1,178 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1996, 1997, 1998, 1999, 2000
+ *	Sleepycat Software.  All rights reserved.
+ *
+ *	$Id: db.src,v 11.8 2000/02/17 20:24:07 bostic Exp $
+ */
+
+PREFIX	db
+
+INCLUDE	#include "db_config.h"
+INCLUDE
+INCLUDE #ifndef NO_SYSTEM_INCLUDES
+INCLUDE #include <sys/types.h>
+INCLUDE
+INCLUDE #include <ctype.h>
+INCLUDE #include <errno.h>
+INCLUDE #include <string.h>
+INCLUDE #endif
+INCLUDE
+INCLUDE #include "db_int.h"
+INCLUDE #include "db_page.h"
+INCLUDE #include "db_dispatch.h"
+INCLUDE #include "db_am.h"
+INCLUDE #include "txn.h"
+INCLUDE
+
+/*
+ * addrem -- Add or remove an entry from a duplicate page.
+ *
+ * opcode:	identifies if this is an add or delete.
+ * fileid:	file identifier of the file being modified.
+ * pgno:	duplicate page number.
+ * indx:	location at which to insert or delete.
+ * nbytes:	number of bytes added/removed to/from the page.
+ * hdr:		header for the data item.
+ * dbt:		data that is deleted or is to be added.
+ * pagelsn:	former lsn of the page.
+ *
+ * If the hdr was NULL then, the dbt is a regular B_KEYDATA.
+ * If the dbt was NULL then the hdr is a complete item to be
+ * pasted on the page.
+ */
+BEGIN addrem		41
+ARG	opcode		u_int32_t	lu
+ARG	fileid		int32_t		ld
+ARG	pgno		db_pgno_t	lu
+ARG	indx		u_int32_t	lu
+ARG	nbytes		size_t		lu
+DBT	hdr		DBT		s
+DBT	dbt		DBT		s
+POINTER	pagelsn		DB_LSN *	lu
+END
+
+/*
+ * split -- Handles the split of a duplicate page.
+ *
+ * opcode:	defines whether we are splitting from or splitting onto
+ * fileid:	file identifier of the file being modified.
+ * pgno:	page number being split.
+ * pageimage:	entire page contents.
+ * pagelsn:	former lsn of the page.
+ */
+DEPRECATED split	42
+ARG	opcode		u_int32_t	lu
+ARG	fileid		int32_t		ld
+ARG	pgno		db_pgno_t	lu
+DBT	pageimage	DBT		s
+POINTER	pagelsn		DB_LSN *	lu
+END
+
+/*
+ * big -- Handles addition and deletion of big key/data items.
+ *
+ * opcode:	identifies get/put.
+ * fileid:	file identifier of the file being modified.
+ * pgno:	page onto which data is being added/removed.
+ * prev_pgno:	the page before the one we are logging.
+ * next_pgno:	the page after the one we are logging.
+ * dbt:		data being written onto the page.
+ * pagelsn:	former lsn of the orig_page.
+ * prevlsn:	former lsn of the prev_pgno.
+ * nextlsn:	former lsn of the next_pgno. This is not currently used, but
+ *		may be used later if we actually do overwrites of big key/
+ *		data items in place.
+ */
+BEGIN big		43
+ARG	opcode		u_int32_t	lu
+ARG	fileid		int32_t		ld
+ARG	pgno		db_pgno_t	lu
+ARG	prev_pgno	db_pgno_t	lu
+ARG	next_pgno	db_pgno_t	lu
+DBT	dbt		DBT		s
+POINTER	pagelsn		DB_LSN *	lu
+POINTER	prevlsn		DB_LSN *	lu
+POINTER	nextlsn		DB_LSN *	lu
+END
+
+/*
+ * ovref -- Handles increment/decrement of overflow page reference count.
+ *
+ * fileid:	identifies the file being modified.
+ * pgno:	page number whose ref count is being incremented/decremented.
+ * adjust:	the adjustment being made.
+ * lsn:		the page's original lsn.
+ */
+BEGIN ovref		44
+ARG	fileid		int32_t		ld
+ARG	pgno		db_pgno_t	lu
+ARG	adjust		int32_t		ld
+POINTER	lsn		DB_LSN *	lu
+END
+
+/*
+ * relink -- Handles relinking around a page.
+ *
+ * opcode:	indicates if this is an addpage or delete page
+ * pgno:	the page being changed.
+ * lsn		the page's original lsn.
+ * prev:	the previous page.
+ * lsn_prev:	the previous page's original lsn.
+ * next:	the next page.
+ * lsn_next:	the previous page's original lsn.
+ */
+BEGIN relink		45
+ARG	opcode		u_int32_t	lu
+ARG	fileid		int32_t		ld
+ARG	pgno		db_pgno_t	lu
+POINTER	lsn		DB_LSN *	lu
+ARG	prev		db_pgno_t	lu
+POINTER	lsn_prev	DB_LSN *	lu
+ARG	next		db_pgno_t	lu
+POINTER	lsn_next	DB_LSN *	lu
+END
+
+/*
+ * Addpage -- Handles adding a new duplicate page onto the end of
+ * an existing duplicate page.
+ * fileid:	identifies the file being changed.
+ * pgno:	page number to which a new page is being added.
+ * lsn:		lsn of pgno
+ * nextpgno:	new page number being added.
+ * nextlsn:	lsn of nextpgno;
+ */
+DEPRECATED addpage	46
+ARG	fileid		int32_t		ld
+ARG	pgno		db_pgno_t	lu
+POINTER	lsn		DB_LSN *	lu
+ARG	nextpgno	db_pgno_t	lu
+POINTER	nextlsn		DB_LSN *	lu
+END
+
+/*
+ * Debug -- log an operation upon entering an access method.
+ * op:		Operation (cursor, c_close, c_get, c_put, c_del,
+ *		get, put, delete).
+ * fileid:	identifies the file being acted upon.
+ * key:		key paramater
+ * data:	data parameter
+ * flags:	flags parameter
+ */
+BEGIN debug		47
+DBT	op		DBT		s
+ARG	fileid		int32_t		ld
+DBT	key		DBT		s
+DBT	data		DBT		s
+ARG	arg_flags	u_int32_t	lu
+END
+
+/*
+ * noop -- do nothing, but get an LSN.
+ */
+BEGIN noop		48
+ARG	fileid		int32_t		ld
+ARG	pgno		db_pgno_t	lu
+POINTER	prevlsn		DB_LSN *	lu
+END
diff --git a/bdb/db/db_am.c b/bdb/db/db_am.c
new file mode 100644
index 00000000000..2d224566904
--- /dev/null
+++ b/bdb/db/db_am.c
@@ -0,0 +1,511 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1998, 1999, 2000
+ *	Sleepycat Software.  All rights reserved.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_am.c,v 11.42 2001/01/11 18:19:50 bostic Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_shash.h"
+#include "btree.h"
+#include "hash.h"
+#include "qam.h"
+#include "lock.h"
+#include "mp.h"
+#include "txn.h"
+#include "db_am.h"
+#include "db_ext.h"
+
+/*
+ * __db_cursor --
+ *	Allocate and return a cursor.
+ *
+ * PUBLIC: int __db_cursor __P((DB *, DB_TXN *, DBC **, u_int32_t));
+ */
+int
+__db_cursor(dbp, txn, dbcp, flags)
+	DB *dbp;
+	DB_TXN *txn;
+	DBC **dbcp;
+	u_int32_t flags;
+{
+	DB_ENV *dbenv;
+	DBC *dbc;
+	db_lockmode_t mode;
+	u_int32_t op;
+	int ret;
+
+	dbenv = dbp->dbenv;
+
+	PANIC_CHECK(dbenv);
+	DB_ILLEGAL_BEFORE_OPEN(dbp, "DB->cursor");
+
+	/* Check for invalid flags. */
+	if ((ret = __db_cursorchk(dbp, flags, F_ISSET(dbp, DB_AM_RDONLY))) != 0)
+		return (ret);
+
+	if ((ret =
+	    __db_icursor(dbp, txn, dbp->type, PGNO_INVALID, 0, dbcp)) != 0)
+		return (ret);
+	dbc = *dbcp;
+
+	/*
+	 * If this is CDB, do all the locking in the interface, which is
+	 * right here.
+	 */
+	if (CDB_LOCKING(dbenv)) {
+		op = LF_ISSET(DB_OPFLAGS_MASK);
+		mode = (op == DB_WRITELOCK) ? DB_LOCK_WRITE :
+		    ((op == DB_WRITECURSOR) ? DB_LOCK_IWRITE : DB_LOCK_READ);
+		if ((ret = lock_get(dbenv, dbc->locker, 0,
+		    &dbc->lock_dbt, mode, &dbc->mylock)) != 0) {
+			(void)__db_c_close(dbc);
+			return (ret);
+		}
+		if (op == DB_WRITECURSOR)
+			F_SET(dbc, DBC_WRITECURSOR);
+		if (op == DB_WRITELOCK)
+			F_SET(dbc, DBC_WRITER);
+	}
+
+	return (0);
+}
+
+/*
+ * __db_icursor --
+ *	Internal version of __db_cursor.  If dbcp is
+ *	non-NULL it is assumed to point to an area to
+ *	initialize as a cursor.
+ *
+ * PUBLIC: int __db_icursor
+ * PUBLIC:     __P((DB *, DB_TXN *, DBTYPE, db_pgno_t, int, DBC **));
+ */
+int
+__db_icursor(dbp, txn, dbtype, root, is_opd, dbcp)
+	DB *dbp;
+	DB_TXN *txn;
+	DBTYPE dbtype;
+	db_pgno_t root;
+	int is_opd;
+	DBC **dbcp;
+{
+	DBC *dbc, *adbc;
+	DBC_INTERNAL *cp;
+	DB_ENV *dbenv;
+	int allocated, ret;
+
+	dbenv = dbp->dbenv;
+	allocated = 0;
+
+	/*
+	 * Take one from the free list if it's available.  Take only the
+	 * right type.  With off page dups we may have different kinds
+	 * of cursors on the queue for a single database.
+	 */
+	MUTEX_THREAD_LOCK(dbenv, dbp->mutexp);
+	for (dbc = TAILQ_FIRST(&dbp->free_queue);
+	    dbc != NULL; dbc = TAILQ_NEXT(dbc, links))
+		if (dbtype == dbc->dbtype) {
+			TAILQ_REMOVE(&dbp->free_queue, dbc, links);
+			dbc->flags = 0;
+			break;
+		}
+	MUTEX_THREAD_UNLOCK(dbenv, dbp->mutexp);
+
+	if (dbc == NULL) {
+		if ((ret = __os_calloc(dbp->dbenv, 1, sizeof(DBC), &dbc)) != 0)
+			return (ret);
+		allocated = 1;
+		dbc->flags = 0;
+
+		dbc->dbp = dbp;
+
+		/* Set up locking information. */
+		if (LOCKING_ON(dbenv)) {
+			/*
+			 * If we are not threaded, then there is no need to
+			 * create new locker ids.  We know that no one else
+			 * is running concurrently using this DB, so we can
+			 * take a peek at any cursors on the active queue.
+			 */
+			if (!DB_IS_THREADED(dbp) &&
+			    (adbc = TAILQ_FIRST(&dbp->active_queue)) != NULL)
+				dbc->lid = adbc->lid;
+			else
+				if ((ret = lock_id(dbenv, &dbc->lid)) != 0)
+					goto err;
+
+			memcpy(dbc->lock.fileid, dbp->fileid, DB_FILE_ID_LEN);
+			if (CDB_LOCKING(dbenv)) {
+				if (F_ISSET(dbenv, DB_ENV_CDB_ALLDB)) {
+					/*
+					 * If we are doing a single lock per
+					 * environment, set up the global
+					 * lock object just like we do to
+					 * single thread creates.
+					 */
+					DB_ASSERT(sizeof(db_pgno_t) ==
+					    sizeof(u_int32_t));
+					dbc->lock_dbt.size = sizeof(u_int32_t);
+					dbc->lock_dbt.data = &dbc->lock.pgno;
+					dbc->lock.pgno = 0;
+				} else {
+					dbc->lock_dbt.size = DB_FILE_ID_LEN;
+					dbc->lock_dbt.data = dbc->lock.fileid;
+				}
+			} else {
+				dbc->lock.type = DB_PAGE_LOCK;
+				dbc->lock_dbt.size = sizeof(dbc->lock);
+				dbc->lock_dbt.data = &dbc->lock;
+			}
+		}
+		/* Init the DBC internal structure. */
+		switch (dbtype) {
+		case DB_BTREE:
+		case DB_RECNO:
+			if ((ret = __bam_c_init(dbc, dbtype)) != 0)
+				goto err;
+			break;
+		case DB_HASH:
+			if ((ret = __ham_c_init(dbc)) != 0)
+				goto err;
+			break;
+		case DB_QUEUE:
+			if ((ret = __qam_c_init(dbc)) != 0)
+				goto err;
+			break;
+		default:
+			ret = __db_unknown_type(dbp->dbenv,
+			    "__db_icursor", dbtype);
+			goto err;
+		}
+
+		cp = dbc->internal;
+	}
+
+	/* Refresh the DBC structure. */
+	dbc->dbtype = dbtype;
+
+	if ((dbc->txn = txn) == NULL)
+		dbc->locker = dbc->lid;
+	else {
+		dbc->locker = txn->txnid;
+		txn->cursors++;
+	}
+
+	if (is_opd)
+		F_SET(dbc, DBC_OPD);
+	if (F_ISSET(dbp, DB_AM_RECOVER))
+		F_SET(dbc, DBC_RECOVER);
+
+	/* Refresh the DBC internal structure. */
+	cp = dbc->internal;
+	cp->opd = NULL;
+
+	cp->indx = 0;
+	cp->page = NULL;
+	cp->pgno = PGNO_INVALID;
+	cp->root = root;
+
+	switch (dbtype) {
+	case DB_BTREE:
+	case DB_RECNO:
+		if ((ret = __bam_c_refresh(dbc)) != 0)
+			goto err;
+		break;
+	case DB_HASH:
+	case DB_QUEUE:
+		break;
+	default:
+		ret = __db_unknown_type(dbp->dbenv, "__db_icursor", dbp->type);
+		goto err;
+	}
+
+	MUTEX_THREAD_LOCK(dbenv, dbp->mutexp);
+	TAILQ_INSERT_TAIL(&dbp->active_queue, dbc, links);
+	F_SET(dbc, DBC_ACTIVE);
+	MUTEX_THREAD_UNLOCK(dbenv, dbp->mutexp);
+
+	*dbcp = dbc;
+	return (0);
+
+err:	if (allocated)
+		__os_free(dbc, sizeof(*dbc));
+	return (ret);
+}
+
+#ifdef DEBUG
+/*
+ * __db_cprint --
+ *	Display the current cursor list.
+ *
+ * PUBLIC: int __db_cprint __P((DB *));
+ */
+int
+__db_cprint(dbp)
+	DB *dbp;
+{
+	static const FN fn[] = {
+		{ DBC_ACTIVE,		"active" },
+		{ DBC_OPD,		"off-page-dup" },
+		{ DBC_RECOVER,		"recover" },
+		{ DBC_RMW,		"read-modify-write" },
+		{ DBC_WRITECURSOR,	"write cursor" },
+		{ DBC_WRITEDUP,		"internally dup'ed write cursor" },
+		{ DBC_WRITER,		"short-term write cursor" },
+		{ 0,			NULL }
+	};
+	DBC *dbc;
+	DBC_INTERNAL *cp;
+	char *s;
+
+	MUTEX_THREAD_LOCK(dbp->dbenv, dbp->mutexp);
+	for (dbc = TAILQ_FIRST(&dbp->active_queue);
+	    dbc != NULL; dbc = TAILQ_NEXT(dbc, links)) {
+		switch (dbc->dbtype) {
+		case DB_BTREE:
+			s = "btree";
+			break;
+		case DB_HASH:
+			s = "hash";
+			break;
+		case DB_RECNO:
+			s = "recno";
+			break;
+		case DB_QUEUE:
+			s = "queue";
+			break;
+		default:
+			DB_ASSERT(0);
+			return (1);
+		}
+		cp = dbc->internal;
+		fprintf(stderr, "%s/%#0lx: opd: %#0lx\n",
+		    s, P_TO_ULONG(dbc), P_TO_ULONG(cp->opd));
+		fprintf(stderr, "\ttxn: %#0lx lid: %lu locker: %lu\n",
+		    P_TO_ULONG(dbc->txn),
+		    (u_long)dbc->lid, (u_long)dbc->locker);
+		fprintf(stderr, "\troot: %lu page/index: %lu/%lu",
+		    (u_long)cp->root, (u_long)cp->pgno, (u_long)cp->indx);
+		__db_prflags(dbc->flags, fn, stderr);
+		fprintf(stderr, "\n");
+
+		if (dbp->type == DB_BTREE)
+			__bam_cprint(dbc);
+	}
+	for (dbc = TAILQ_FIRST(&dbp->free_queue);
+	    dbc != NULL; dbc = TAILQ_NEXT(dbc, links))
+		fprintf(stderr, "free: %#0lx ", P_TO_ULONG(dbc));
+	fprintf(stderr, "\n");
+	MUTEX_THREAD_UNLOCK(dbp->dbenv, dbp->mutexp);
+
+	return (0);
+}
+#endif /* DEBUG */
+
+/*
+ * db_fd --
+ *	Return a file descriptor for flock'ing.
+ *
+ * PUBLIC: int __db_fd __P((DB *, int *));
+ */
+int
+__db_fd(dbp, fdp)
+	DB *dbp;
+	int *fdp;
+{
+	DB_FH *fhp;
+	int ret;
+
+	PANIC_CHECK(dbp->dbenv);
+	DB_ILLEGAL_BEFORE_OPEN(dbp, "DB->fd");
+
+	/*
+	 * XXX
+	 * Truly spectacular layering violation.
+	 */
+	if ((ret = __mp_xxx_fh(dbp->mpf, &fhp)) != 0)
+		return (ret);
+
+	if (F_ISSET(fhp, DB_FH_VALID)) {
+		*fdp = fhp->fd;
+		return (0);
+	} else {
+		*fdp = -1;
+		__db_err(dbp->dbenv, "DB does not have a valid file handle.");
+		return (ENOENT);
+	}
+}
+
+/*
+ * __db_get --
+ *	Return a key/data pair.
+ *
+ * PUBLIC: int __db_get __P((DB *, DB_TXN *, DBT *, DBT *, u_int32_t));
+ */
+int
+__db_get(dbp, txn, key, data, flags)
+	DB *dbp;
+	DB_TXN *txn;
+	DBT *key, *data;
+	u_int32_t flags;
+{
+	DBC *dbc;
+	int mode, ret, t_ret;
+
+	PANIC_CHECK(dbp->dbenv);
+	DB_ILLEGAL_BEFORE_OPEN(dbp, "DB->get");
+
+	if ((ret = __db_getchk(dbp, key, data, flags)) != 0)
+		return (ret);
+
+	mode = 0;
+	if (flags == DB_CONSUME || flags == DB_CONSUME_WAIT)
+		mode = DB_WRITELOCK;
+	if ((ret = dbp->cursor(dbp, txn, &dbc, mode)) != 0)
+		return (ret);
+
+	DEBUG_LREAD(dbc, txn, "__db_get", key, NULL, flags);
+
+	/*
+	 * The DBC_TRANSIENT flag indicates that we're just doing a
+	 * single operation with this cursor, and that in case of
+	 * error we don't need to restore it to its old position--we're
+	 * going to close it right away.  Thus, we can perform the get
+	 * without duplicating the cursor, saving some cycles in this
+	 * common case.
+	 */
+	F_SET(dbc, DBC_TRANSIENT);
+
+	ret = dbc->c_get(dbc, key, data,
+	    flags == 0 || flags == DB_RMW ? flags | DB_SET : flags);
+
+	if ((t_ret = __db_c_close(dbc)) != 0 && ret == 0)
+		ret = t_ret;
+
+	return (ret);
+}
+
+/*
+ * __db_put --
+ *	Store a key/data pair.
+ *
+ * PUBLIC: int __db_put __P((DB *, DB_TXN *, DBT *, DBT *, u_int32_t));
+ */
+int
+__db_put(dbp, txn, key, data, flags)
+	DB *dbp;
+	DB_TXN *txn;
+	DBT *key, *data;
+	u_int32_t flags;
+{
+	DBC *dbc;
+	DBT tdata;
+	int ret, t_ret;
+
+	PANIC_CHECK(dbp->dbenv);
+	DB_ILLEGAL_BEFORE_OPEN(dbp, "DB->put");
+
+	if ((ret = __db_putchk(dbp, key, data,
+	    flags, F_ISSET(dbp, DB_AM_RDONLY),
+	    F_ISSET(dbp, DB_AM_DUP) || F_ISSET(key, DB_DBT_DUPOK))) != 0)
+		return (ret);
+
+	DB_CHECK_TXN(dbp, txn);
+
+	if ((ret = dbp->cursor(dbp, txn, &dbc, DB_WRITELOCK)) != 0)
+		return (ret);
+
+	/*
+	 * See the comment in __db_get().
+	 *
+	 * Note that the c_get in the DB_NOOVERWRITE case is safe to
+	 * do with this flag set;  if it errors in any way other than
+	 * DB_NOTFOUND, we're going to close the cursor without doing
+	 * anything else, and if it returns DB_NOTFOUND then it's safe
+	 * to do a c_put(DB_KEYLAST) even if an access method moved the
+	 * cursor, since that's not position-dependent.
+	 */
+	F_SET(dbc, DBC_TRANSIENT);
+
+	DEBUG_LWRITE(dbc, txn, "__db_put", key, data, flags);
+
+	if (flags == DB_NOOVERWRITE) {
+		flags = 0;
+		/*
+		 * Set DB_DBT_USERMEM, this might be a threaded application and
+		 * the flags checking will catch us.  We don't want the actual
+		 * data, so request a partial of length 0.
+		 */
+		memset(&tdata, 0, sizeof(tdata));
+		F_SET(&tdata, DB_DBT_USERMEM | DB_DBT_PARTIAL);
+
+		/*
+		 * If we're doing page-level locking, set the read-modify-write
+		 * flag, we're going to overwrite immediately.
+		 */
+		if ((ret = dbc->c_get(dbc, key, &tdata,
+		    DB_SET | (STD_LOCKING(dbc) ? DB_RMW : 0))) == 0)
+			ret = DB_KEYEXIST;
+		else if (ret == DB_NOTFOUND)
+			ret = 0;
+	}
+	if (ret == 0)
+		ret = dbc->c_put(dbc,
+		     key, data, flags == 0 ? DB_KEYLAST : flags);
+
+	if ((t_ret = __db_c_close(dbc)) != 0 && ret == 0)
+		ret = t_ret;
+
+	return (ret);
+}
+
+/*
+ * __db_sync --
+ *	Flush the database cache.
+ *
+ * PUBLIC: int __db_sync __P((DB *, u_int32_t));
+ */
+int
+__db_sync(dbp, flags)
+	DB *dbp;
+	u_int32_t flags;
+{
+	int ret, t_ret;
+
+	PANIC_CHECK(dbp->dbenv);
+	DB_ILLEGAL_BEFORE_OPEN(dbp, "DB->sync");
+
+	if ((ret = __db_syncchk(dbp, flags)) != 0)
+		return (ret);
+
+	/* Read-only trees never need to be sync'd. */
+	if (F_ISSET(dbp, DB_AM_RDONLY))
+		return (0);
+
+	/* If it's a Recno tree, write the backing source text file. */
+	if (dbp->type == DB_RECNO)
+		ret = __ram_writeback(dbp);
+
+	/* If the tree was never backed by a database file, we're done. */
+	if (F_ISSET(dbp, DB_AM_INMEM))
+		return (0);
+
+	/* Flush any dirty pages from the cache to the backing file. */
+	if ((t_ret = memp_fsync(dbp->mpf)) != 0 && ret == 0)
+		ret = t_ret;
+	return (ret);
+}
diff --git a/bdb/db/db_auto.c b/bdb/db/db_auto.c
new file mode 100644
index 00000000000..23540adc2e6
--- /dev/null
+++ b/bdb/db/db_auto.c
@@ -0,0 +1,1270 @@
+/* Do not edit: automatically built by gen_rec.awk. */
+#include "db_config.h"
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <ctype.h>
+#include <errno.h>
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_dispatch.h"
+#include "db_am.h"
+#include "txn.h"
+
+int
+__db_addrem_log(dbenv, txnid, ret_lsnp, flags,
+	opcode, fileid, pgno, indx, nbytes, hdr,
+	dbt, pagelsn)
+	DB_ENV *dbenv;
+	DB_TXN *txnid;
+	DB_LSN *ret_lsnp;
+	u_int32_t flags;
+	u_int32_t opcode;
+	int32_t fileid;
+	db_pgno_t pgno;
+	u_int32_t indx;
+	size_t nbytes;
+	const DBT *hdr;
+	const DBT *dbt;
+	DB_LSN * pagelsn;
+{
+	DBT logrec;
+	DB_LSN *lsnp, null_lsn;
+	u_int32_t zero;
+	u_int32_t rectype, txn_num;
+	int ret;
+	u_int8_t *bp;
+
+	rectype = DB_db_addrem;
+	if (txnid != NULL &&
+	    TAILQ_FIRST(&txnid->kids) != NULL &&
+	    (ret = __txn_activekids(dbenv, rectype, txnid)) != 0)
+		return (ret);
+	txn_num = txnid == NULL ? 0 : txnid->txnid;
+	if (txnid == NULL) {
+		ZERO_LSN(null_lsn);
+		lsnp = &null_lsn;
+	} else
+		lsnp = &txnid->last_lsn;
+	logrec.size = sizeof(rectype) + sizeof(txn_num) + sizeof(DB_LSN)
+	    + sizeof(opcode)
+	    + sizeof(fileid)
+	    + sizeof(pgno)
+	    + sizeof(indx)
+	    + sizeof(nbytes)
+	    + sizeof(u_int32_t) + (hdr == NULL ? 0 : hdr->size)
+	    + sizeof(u_int32_t) + (dbt == NULL ? 0 : dbt->size)
+	    + sizeof(*pagelsn);
+	if ((ret = __os_malloc(dbenv, logrec.size, NULL, &logrec.data)) != 0)
+		return (ret);
+
+	bp = logrec.data;
+	memcpy(bp, &rectype, sizeof(rectype));
+	bp += sizeof(rectype);
+	memcpy(bp, &txn_num, sizeof(txn_num));
+	bp += sizeof(txn_num);
+	memcpy(bp, lsnp, sizeof(DB_LSN));
+	bp += sizeof(DB_LSN);
+	memcpy(bp, &opcode, sizeof(opcode));
+	bp += sizeof(opcode);
+	memcpy(bp, &fileid, sizeof(fileid));
+	bp += sizeof(fileid);
+	memcpy(bp, &pgno, sizeof(pgno));
+	bp += sizeof(pgno);
+	memcpy(bp, &indx, sizeof(indx));
+	bp += sizeof(indx);
+	memcpy(bp, &nbytes, sizeof(nbytes));
+	bp += sizeof(nbytes);
+	if (hdr == NULL) {
+		zero = 0;
+		memcpy(bp, &zero, sizeof(u_int32_t));
+		bp += sizeof(u_int32_t);
+	} else {
+		memcpy(bp, &hdr->size, sizeof(hdr->size));
+		bp += sizeof(hdr->size);
+		memcpy(bp, hdr->data, hdr->size);
+		bp += hdr->size;
+	}
+	if (dbt == NULL) {
+		zero = 0;
+		memcpy(bp, &zero, sizeof(u_int32_t));
+		bp += sizeof(u_int32_t);
+	} else {
+		memcpy(bp, &dbt->size, sizeof(dbt->size));
+		bp += sizeof(dbt->size);
+		memcpy(bp, dbt->data, dbt->size);
+		bp += dbt->size;
+	}
+	if (pagelsn != NULL)
+		memcpy(bp, pagelsn, sizeof(*pagelsn));
+	else
+		memset(bp, 0, sizeof(*pagelsn));
+	bp += sizeof(*pagelsn);
+	DB_ASSERT((u_int32_t)(bp - (u_int8_t *)logrec.data) == logrec.size);
+	ret = log_put(dbenv, ret_lsnp, (DBT *)&logrec, flags);
+	if (txnid != NULL)
+		txnid->last_lsn = *ret_lsnp;
+	__os_free(logrec.data, logrec.size);
+	return (ret);
+}
+
+int
+__db_addrem_print(dbenv, dbtp, lsnp, notused2, notused3)
+	DB_ENV *dbenv;
+	DBT *dbtp;
+	DB_LSN *lsnp;
+	db_recops notused2;
+	void *notused3;
+{
+	__db_addrem_args *argp;
+	u_int32_t i;
+	u_int ch;
+	int ret;
+
+	i = 0;
+	ch = 0;
+	notused2 = DB_TXN_ABORT;
+	notused3 = NULL;
+
+	if ((ret = __db_addrem_read(dbenv, dbtp->data, &argp)) != 0)
+		return (ret);
+	printf("[%lu][%lu]db_addrem: rec: %lu txnid %lx prevlsn [%lu][%lu]\n",
+	    (u_long)lsnp->file,
+	    (u_long)lsnp->offset,
+	    (u_long)argp->type,
+	    (u_long)argp->txnid->txnid,
+	    (u_long)argp->prev_lsn.file,
+	    (u_long)argp->prev_lsn.offset);
+	printf("\topcode: %lu\n", (u_long)argp->opcode);
+	printf("\tfileid: %ld\n", (long)argp->fileid);
+	printf("\tpgno: %lu\n", (u_long)argp->pgno);
+	printf("\tindx: %lu\n", (u_long)argp->indx);
+	printf("\tnbytes: %lu\n", (u_long)argp->nbytes);
+	printf("\thdr: ");
+	for (i = 0; i < argp->hdr.size; i++) {
+		ch = ((u_int8_t *)argp->hdr.data)[i];
+		if (isprint(ch) || ch == 0xa)
+			putchar(ch);
+		else
+			printf("%#x ", ch);
+	}
+	printf("\n");
+	printf("\tdbt: ");
+	for (i = 0; i < argp->dbt.size; i++) {
+		ch = ((u_int8_t *)argp->dbt.data)[i];
+		if (isprint(ch) || ch == 0xa)
+			putchar(ch);
+		else
+			printf("%#x ", ch);
+	}
+	printf("\n");
+	printf("\tpagelsn: [%lu][%lu]\n",
+	    (u_long)argp->pagelsn.file, (u_long)argp->pagelsn.offset);
+	printf("\n");
+	__os_free(argp, 0);
+	return (0);
+}
+
+int
+__db_addrem_read(dbenv, recbuf, argpp)
+	DB_ENV *dbenv;
+	void *recbuf;
+	__db_addrem_args **argpp;
+{
+	__db_addrem_args *argp;
+	u_int8_t *bp;
+	int ret;
+
+	ret = __os_malloc(dbenv, sizeof(__db_addrem_args) +
+	    sizeof(DB_TXN), NULL, &argp);
+	if (ret != 0)
+		return (ret);
+	argp->txnid = (DB_TXN *)&argp[1];
+	bp = recbuf;
+	memcpy(&argp->type, bp, sizeof(argp->type));
+	bp += sizeof(argp->type);
+	memcpy(&argp->txnid->txnid,  bp, sizeof(argp->txnid->txnid));
+	bp += sizeof(argp->txnid->txnid);
+	memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN));
+	bp += sizeof(DB_LSN);
+	memcpy(&argp->opcode, bp, sizeof(argp->opcode));
+	bp += sizeof(argp->opcode);
+	memcpy(&argp->fileid, bp, sizeof(argp->fileid));
+	bp += sizeof(argp->fileid);
+	memcpy(&argp->pgno, bp, sizeof(argp->pgno));
+	bp += sizeof(argp->pgno);
+	memcpy(&argp->indx, bp, sizeof(argp->indx));
+	bp += sizeof(argp->indx);
+	memcpy(&argp->nbytes, bp, sizeof(argp->nbytes));
+	bp += sizeof(argp->nbytes);
+	memset(&argp->hdr, 0, sizeof(argp->hdr));
+	memcpy(&argp->hdr.size, bp, sizeof(u_int32_t));
+	bp += sizeof(u_int32_t);
+	argp->hdr.data = bp;
+	bp += argp->hdr.size;
+	memset(&argp->dbt, 0, sizeof(argp->dbt));
+	memcpy(&argp->dbt.size, bp, sizeof(u_int32_t));
+	bp += sizeof(u_int32_t);
+	argp->dbt.data = bp;
+	bp += argp->dbt.size;
+	memcpy(&argp->pagelsn, bp,  sizeof(argp->pagelsn));
+	bp += sizeof(argp->pagelsn);
+	*argpp = argp;
+	return (0);
+}
+
+int
+__db_split_print(dbenv, dbtp, lsnp, notused2, notused3)
+	DB_ENV *dbenv;
+	DBT *dbtp;
+	DB_LSN *lsnp;
+	db_recops notused2;
+	void *notused3;
+{
+	__db_split_args *argp;
+	u_int32_t i;
+	u_int ch;
+	int ret;
+
+	i = 0;
+	ch = 0;
+	notused2 = DB_TXN_ABORT;
+	notused3 = NULL;
+
+	if ((ret = __db_split_read(dbenv, dbtp->data, &argp)) != 0)
+		return (ret);
+	printf("[%lu][%lu]db_split: rec: %lu txnid %lx prevlsn [%lu][%lu]\n",
+	    (u_long)lsnp->file,
+	    (u_long)lsnp->offset,
+	    (u_long)argp->type,
+	    (u_long)argp->txnid->txnid,
+	    (u_long)argp->prev_lsn.file,
+	    (u_long)argp->prev_lsn.offset);
+	printf("\topcode: %lu\n", (u_long)argp->opcode);
+	printf("\tfileid: %ld\n", (long)argp->fileid);
+	printf("\tpgno: %lu\n", (u_long)argp->pgno);
+	printf("\tpageimage: ");
+	for (i = 0; i < argp->pageimage.size; i++) {
+		ch = ((u_int8_t *)argp->pageimage.data)[i];
+		if (isprint(ch) || ch == 0xa)
+			putchar(ch);
+		else
+			printf("%#x ", ch);
+	}
+	printf("\n");
+	printf("\tpagelsn: [%lu][%lu]\n",
+	    (u_long)argp->pagelsn.file, (u_long)argp->pagelsn.offset);
+	printf("\n");
+	__os_free(argp, 0);
+	return (0);
+}
+
+int
+__db_split_read(dbenv, recbuf, argpp)
+	DB_ENV *dbenv;
+	void *recbuf;
+	__db_split_args **argpp;
+{
+	__db_split_args *argp;
+	u_int8_t *bp;
+	int ret;
+
+	ret = __os_malloc(dbenv, sizeof(__db_split_args) +
+	    sizeof(DB_TXN), NULL, &argp);
+	if (ret != 0)
+		return (ret);
+	argp->txnid = (DB_TXN *)&argp[1];
+	bp = recbuf;
+	memcpy(&argp->type, bp, sizeof(argp->type));
+	bp += sizeof(argp->type);
+	memcpy(&argp->txnid->txnid,  bp, sizeof(argp->txnid->txnid));
+	bp += sizeof(argp->txnid->txnid);
+	memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN));
+	bp += sizeof(DB_LSN);
+	memcpy(&argp->opcode, bp, sizeof(argp->opcode));
+	bp += sizeof(argp->opcode);
+	memcpy(&argp->fileid, bp, sizeof(argp->fileid));
+	bp += sizeof(argp->fileid);
+	memcpy(&argp->pgno, bp, sizeof(argp->pgno));
+	bp += sizeof(argp->pgno);
+	memset(&argp->pageimage, 0, sizeof(argp->pageimage));
+	memcpy(&argp->pageimage.size, bp, sizeof(u_int32_t));
+	bp += sizeof(u_int32_t);
+	argp->pageimage.data = bp;
+	bp += argp->pageimage.size;
+	memcpy(&argp->pagelsn, bp,  sizeof(argp->pagelsn));
+	bp += sizeof(argp->pagelsn);
+	*argpp = argp;
+	return (0);
+}
+
+int
+__db_big_log(dbenv, txnid, ret_lsnp, flags,
+	opcode, fileid, pgno, prev_pgno, next_pgno, dbt,
+	pagelsn, prevlsn, nextlsn)
+	DB_ENV *dbenv;
+	DB_TXN *txnid;
+	DB_LSN *ret_lsnp;
+	u_int32_t flags;
+	u_int32_t opcode;
+	int32_t fileid;
+	db_pgno_t pgno;
+	db_pgno_t prev_pgno;
+	db_pgno_t next_pgno;
+	const DBT *dbt;
+	DB_LSN * pagelsn;
+	DB_LSN * prevlsn;
+	DB_LSN * nextlsn;
+{
+	DBT logrec;
+	DB_LSN *lsnp, null_lsn;
+	u_int32_t zero;
+	u_int32_t rectype, txn_num;
+	int ret;
+	u_int8_t *bp;
+
+	rectype = DB_db_big;
+	if (txnid != NULL &&
+	    TAILQ_FIRST(&txnid->kids) != NULL &&
+	    (ret = __txn_activekids(dbenv, rectype, txnid)) != 0)
+		return (ret);
+	txn_num = txnid == NULL ? 0 : txnid->txnid;
+	if (txnid == NULL) {
+		ZERO_LSN(null_lsn);
+		lsnp = &null_lsn;
+	} else
+		lsnp = &txnid->last_lsn;
+	logrec.size = sizeof(rectype) + sizeof(txn_num) + sizeof(DB_LSN)
+	    + sizeof(opcode)
+	    + sizeof(fileid)
+	    + sizeof(pgno)
+	    + sizeof(prev_pgno)
+	    + sizeof(next_pgno)
+	    + sizeof(u_int32_t) + (dbt == NULL ? 0 : dbt->size)
+	    + sizeof(*pagelsn)
+	    + sizeof(*prevlsn)
+	    + sizeof(*nextlsn);
+	if ((ret = __os_malloc(dbenv, logrec.size, NULL, &logrec.data)) != 0)
+		return (ret);
+
+	bp = logrec.data;
+	memcpy(bp, &rectype, sizeof(rectype));
+	bp += sizeof(rectype);
+	memcpy(bp, &txn_num, sizeof(txn_num));
+	bp += sizeof(txn_num);
+	memcpy(bp, lsnp, sizeof(DB_LSN));
+	bp += sizeof(DB_LSN);
+	memcpy(bp, &opcode, sizeof(opcode));
+	bp += sizeof(opcode);
+	memcpy(bp, &fileid, sizeof(fileid));
+	bp += sizeof(fileid);
+	memcpy(bp, &pgno, sizeof(pgno));
+	bp += sizeof(pgno);
+	memcpy(bp, &prev_pgno, sizeof(prev_pgno));
+	bp += sizeof(prev_pgno);
+	memcpy(bp, &next_pgno, sizeof(next_pgno));
+	bp += sizeof(next_pgno);
+	if (dbt == NULL) {
+		zero = 0;
+		memcpy(bp, &zero, sizeof(u_int32_t));
+		bp += sizeof(u_int32_t);
+	} else {
+		memcpy(bp, &dbt->size, sizeof(dbt->size));
+		bp += sizeof(dbt->size);
+		memcpy(bp, dbt->data, dbt->size);
+		bp += dbt->size;
+	}
+	if (pagelsn != NULL)
+		memcpy(bp, pagelsn, sizeof(*pagelsn));
+	else
+		memset(bp, 0, sizeof(*pagelsn));
+	bp += sizeof(*pagelsn);
+	if (prevlsn != NULL)
+		memcpy(bp, prevlsn, sizeof(*prevlsn));
+	else
+		memset(bp, 0, sizeof(*prevlsn));
+	bp += sizeof(*prevlsn);
+	if (nextlsn != NULL)
+		memcpy(bp, nextlsn, sizeof(*nextlsn));
+	else
+		memset(bp, 0, sizeof(*nextlsn));
+	bp += sizeof(*nextlsn);
+	DB_ASSERT((u_int32_t)(bp - (u_int8_t *)logrec.data) == logrec.size);
+	ret = log_put(dbenv, ret_lsnp, (DBT *)&logrec, flags);
+	if (txnid != NULL)
+		txnid->last_lsn = *ret_lsnp;
+	__os_free(logrec.data, logrec.size);
+	return (ret);
+}
+
+int
+__db_big_print(dbenv, dbtp, lsnp, notused2, notused3)
+	DB_ENV *dbenv;
+	DBT *dbtp;
+	DB_LSN *lsnp;
+	db_recops notused2;
+	void *notused3;
+{
+	__db_big_args *argp;
+	u_int32_t i;
+	u_int ch;
+	int ret;
+
+	i = 0;
+	ch = 0;
+	notused2 = DB_TXN_ABORT;
+	notused3 = NULL;
+
+	if ((ret = __db_big_read(dbenv, dbtp->data, &argp)) != 0)
+		return (ret);
+	printf("[%lu][%lu]db_big: rec: %lu txnid %lx prevlsn [%lu][%lu]\n",
+	    (u_long)lsnp->file,
+	    (u_long)lsnp->offset,
+	    (u_long)argp->type,
+	    (u_long)argp->txnid->txnid,
+	    (u_long)argp->prev_lsn.file,
+	    (u_long)argp->prev_lsn.offset);
+	printf("\topcode: %lu\n", (u_long)argp->opcode);
+	printf("\tfileid: %ld\n", (long)argp->fileid);
+	printf("\tpgno: %lu\n", (u_long)argp->pgno);
+	printf("\tprev_pgno: %lu\n", (u_long)argp->prev_pgno);
+	printf("\tnext_pgno: %lu\n", (u_long)argp->next_pgno);
+	printf("\tdbt: ");
+	for (i = 0; i < argp->dbt.size; i++) {
+		ch = ((u_int8_t *)argp->dbt.data)[i];
+		if (isprint(ch) || ch == 0xa)
+			putchar(ch);
+		else
+			printf("%#x ", ch);
+	}
+	printf("\n");
+	printf("\tpagelsn: [%lu][%lu]\n",
+	    (u_long)argp->pagelsn.file, (u_long)argp->pagelsn.offset);
+	printf("\tprevlsn: [%lu][%lu]\n",
+	    (u_long)argp->prevlsn.file, (u_long)argp->prevlsn.offset);
+	printf("\tnextlsn: [%lu][%lu]\n",
+	    (u_long)argp->nextlsn.file, (u_long)argp->nextlsn.offset);
+	printf("\n");
+	__os_free(argp, 0);
+	return (0);
+}
+
+int
+__db_big_read(dbenv, recbuf, argpp)
+	DB_ENV *dbenv;
+	void *recbuf;
+	__db_big_args **argpp;
+{
+	__db_big_args *argp;
+	u_int8_t *bp;
+	int ret;
+
+	ret = __os_malloc(dbenv, sizeof(__db_big_args) +
+	    sizeof(DB_TXN), NULL, &argp);
+	if (ret != 0)
+		return (ret);
+	argp->txnid = (DB_TXN *)&argp[1];
+	bp = recbuf;
+	memcpy(&argp->type, bp, sizeof(argp->type));
+	bp += sizeof(argp->type);
+	memcpy(&argp->txnid->txnid,  bp, sizeof(argp->txnid->txnid));
+	bp += sizeof(argp->txnid->txnid);
+	memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN));
+	bp += sizeof(DB_LSN);
+	memcpy(&argp->opcode, bp, sizeof(argp->opcode));
+	bp += sizeof(argp->opcode);
+	memcpy(&argp->fileid, bp, sizeof(argp->fileid));
+	bp += sizeof(argp->fileid);
+	memcpy(&argp->pgno, bp, sizeof(argp->pgno));
+	bp += sizeof(argp->pgno);
+	memcpy(&argp->prev_pgno, bp, sizeof(argp->prev_pgno));
+	bp += sizeof(argp->prev_pgno);
+	memcpy(&argp->next_pgno, bp, sizeof(argp->next_pgno));
+	bp += sizeof(argp->next_pgno);
+	memset(&argp->dbt, 0, sizeof(argp->dbt));
+	memcpy(&argp->dbt.size, bp, sizeof(u_int32_t));
+	bp += sizeof(u_int32_t);
+	argp->dbt.data = bp;
+	bp += argp->dbt.size;
+	memcpy(&argp->pagelsn, bp,  sizeof(argp->pagelsn));
+	bp += sizeof(argp->pagelsn);
+	memcpy(&argp->prevlsn, bp,  sizeof(argp->prevlsn));
+	bp += sizeof(argp->prevlsn);
+	memcpy(&argp->nextlsn, bp,  sizeof(argp->nextlsn));
+	bp += sizeof(argp->nextlsn);
+	*argpp = argp;
+	return (0);
+}
+
+int
+__db_ovref_log(dbenv, txnid, ret_lsnp, flags,
+	fileid, pgno, adjust, lsn)
+	DB_ENV *dbenv;
+	DB_TXN *txnid;
+	DB_LSN *ret_lsnp;
+	u_int32_t flags;
+	int32_t fileid;
+	db_pgno_t pgno;
+	int32_t adjust;
+	DB_LSN * lsn;
+{
+	DBT logrec;
+	DB_LSN *lsnp, null_lsn;
+	u_int32_t rectype, txn_num;
+	int ret;
+	u_int8_t *bp;
+
+	rectype = DB_db_ovref;
+	if (txnid != NULL &&
+	    TAILQ_FIRST(&txnid->kids) != NULL &&
+	    (ret = __txn_activekids(dbenv, rectype, txnid)) != 0)
+		return (ret);
+	txn_num = txnid == NULL ? 0 : txnid->txnid;
+	if (txnid == NULL) {
+		ZERO_LSN(null_lsn);
+		lsnp = &null_lsn;
+	} else
+		lsnp = &txnid->last_lsn;
+	logrec.size = sizeof(rectype) + sizeof(txn_num) + sizeof(DB_LSN)
+	    + sizeof(fileid)
+	    + sizeof(pgno)
+	    + sizeof(adjust)
+	    + sizeof(*lsn);
+	if ((ret = __os_malloc(dbenv, logrec.size, NULL, &logrec.data)) != 0)
+		return (ret);
+
+	bp = logrec.data;
+	memcpy(bp, &rectype, sizeof(rectype));
+	bp += sizeof(rectype);
+	memcpy(bp, &txn_num, sizeof(txn_num));
+	bp += sizeof(txn_num);
+	memcpy(bp, lsnp, sizeof(DB_LSN));
+	bp += sizeof(DB_LSN);
+	memcpy(bp, &fileid, sizeof(fileid));
+	bp += sizeof(fileid);
+	memcpy(bp, &pgno, sizeof(pgno));
+	bp += sizeof(pgno);
+	memcpy(bp, &adjust, sizeof(adjust));
+	bp += sizeof(adjust);
+	if (lsn != NULL)
+		memcpy(bp, lsn, sizeof(*lsn));
+	else
+		memset(bp, 0, sizeof(*lsn));
+	bp += sizeof(*lsn);
+	DB_ASSERT((u_int32_t)(bp - (u_int8_t *)logrec.data) == logrec.size);
+	ret = log_put(dbenv, ret_lsnp, (DBT *)&logrec, flags);
+	if (txnid != NULL)
+		txnid->last_lsn = *ret_lsnp;
+	__os_free(logrec.data, logrec.size);
+	return (ret);
+}
+
+int
+__db_ovref_print(dbenv, dbtp, lsnp, notused2, notused3)
+	DB_ENV *dbenv;
+	DBT *dbtp;
+	DB_LSN *lsnp;
+	db_recops notused2;
+	void *notused3;
+{
+	__db_ovref_args *argp;
+	u_int32_t i;
+	u_int ch;
+	int ret;
+
+	i = 0;
+	ch = 0;
+	notused2 = DB_TXN_ABORT;
+	notused3 = NULL;
+
+	if ((ret = __db_ovref_read(dbenv, dbtp->data, &argp)) != 0)
+		return (ret);
+	printf("[%lu][%lu]db_ovref: rec: %lu txnid %lx prevlsn [%lu][%lu]\n",
+	    (u_long)lsnp->file,
+	    (u_long)lsnp->offset,
+	    (u_long)argp->type,
+	    (u_long)argp->txnid->txnid,
+	    (u_long)argp->prev_lsn.file,
+	    (u_long)argp->prev_lsn.offset);
+	printf("\tfileid: %ld\n", (long)argp->fileid);
+	printf("\tpgno: %lu\n", (u_long)argp->pgno);
+	printf("\tadjust: %ld\n", (long)argp->adjust);
+	printf("\tlsn: [%lu][%lu]\n",
+	    (u_long)argp->lsn.file, (u_long)argp->lsn.offset);
+	printf("\n");
+	__os_free(argp, 0);
+	return (0);
+}
+
+int
+__db_ovref_read(dbenv, recbuf, argpp)
+	DB_ENV *dbenv;
+	void *recbuf;
+	__db_ovref_args **argpp;
+{
+	__db_ovref_args *argp;
+	u_int8_t *bp;
+	int ret;
+
+	ret = __os_malloc(dbenv, sizeof(__db_ovref_args) +
+	    sizeof(DB_TXN), NULL, &argp);
+	if (ret != 0)
+		return (ret);
+	argp->txnid = (DB_TXN *)&argp[1];
+	bp = recbuf;
+	memcpy(&argp->type, bp, sizeof(argp->type));
+	bp += sizeof(argp->type);
+	memcpy(&argp->txnid->txnid,  bp, sizeof(argp->txnid->txnid));
+	bp += sizeof(argp->txnid->txnid);
+	memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN));
+	bp += sizeof(DB_LSN);
+	memcpy(&argp->fileid, bp, sizeof(argp->fileid));
+	bp += sizeof(argp->fileid);
+	memcpy(&argp->pgno, bp, sizeof(argp->pgno));
+	bp += sizeof(argp->pgno);
+	memcpy(&argp->adjust, bp, sizeof(argp->adjust));
+	bp += sizeof(argp->adjust);
+	memcpy(&argp->lsn, bp,  sizeof(argp->lsn));
+	bp += sizeof(argp->lsn);
+	*argpp = argp;
+	return (0);
+}
+
+int
+__db_relink_log(dbenv, txnid, ret_lsnp, flags,
+	opcode, fileid, pgno, lsn, prev, lsn_prev,
+	next, lsn_next)
+	DB_ENV *dbenv;
+	DB_TXN *txnid;
+	DB_LSN *ret_lsnp;
+	u_int32_t flags;
+	u_int32_t opcode;
+	int32_t fileid;
+	db_pgno_t pgno;
+	DB_LSN * lsn;
+	db_pgno_t prev;
+	DB_LSN * lsn_prev;
+	db_pgno_t next;
+	DB_LSN * lsn_next;
+{
+	DBT logrec;
+	DB_LSN *lsnp, null_lsn;
+	u_int32_t rectype, txn_num;
+	int ret;
+	u_int8_t *bp;
+
+	rectype = DB_db_relink;
+	if (txnid != NULL &&
+	    TAILQ_FIRST(&txnid->kids) != NULL &&
+	    (ret = __txn_activekids(dbenv, rectype, txnid)) != 0)
+		return (ret);
+	txn_num = txnid == NULL ? 0 : txnid->txnid;
+	if (txnid == NULL) {
+		ZERO_LSN(null_lsn);
+		lsnp = &null_lsn;
+	} else
+		lsnp = &txnid->last_lsn;
+	logrec.size = sizeof(rectype) + sizeof(txn_num) + sizeof(DB_LSN)
+	    + sizeof(opcode)
+	    + sizeof(fileid)
+	    + sizeof(pgno)
+	    + sizeof(*lsn)
+	    + sizeof(prev)
+	    + sizeof(*lsn_prev)
+	    + sizeof(next)
+	    + sizeof(*lsn_next);
+	if ((ret = __os_malloc(dbenv, logrec.size, NULL, &logrec.data)) != 0)
+		return (ret);
+
+	bp = logrec.data;
+	memcpy(bp, &rectype, sizeof(rectype));
+	bp += sizeof(rectype);
+	memcpy(bp, &txn_num, sizeof(txn_num));
+	bp += sizeof(txn_num);
+	memcpy(bp, lsnp, sizeof(DB_LSN));
+	bp += sizeof(DB_LSN);
+	memcpy(bp, &opcode, sizeof(opcode));
+	bp += sizeof(opcode);
+	memcpy(bp, &fileid, sizeof(fileid));
+	bp += sizeof(fileid);
+	memcpy(bp, &pgno, sizeof(pgno));
+	bp += sizeof(pgno);
+	if (lsn != NULL)
+		memcpy(bp, lsn, sizeof(*lsn));
+	else
+		memset(bp, 0, sizeof(*lsn));
+	bp += sizeof(*lsn);
+	memcpy(bp, &prev, sizeof(prev));
+	bp += sizeof(prev);
+	if (lsn_prev != NULL)
+		memcpy(bp, lsn_prev, sizeof(*lsn_prev));
+	else
+		memset(bp, 0, sizeof(*lsn_prev));
+	bp += sizeof(*lsn_prev);
+	memcpy(bp, &next, sizeof(next));
+	bp += sizeof(next);
+	if (lsn_next != NULL)
+		memcpy(bp, lsn_next, sizeof(*lsn_next));
+	else
+		memset(bp, 0, sizeof(*lsn_next));
+	bp += sizeof(*lsn_next);
+	DB_ASSERT((u_int32_t)(bp - (u_int8_t *)logrec.data) == logrec.size);
+	ret = log_put(dbenv, ret_lsnp, (DBT *)&logrec, flags);
+	if (txnid != NULL)
+		txnid->last_lsn = *ret_lsnp;
+	__os_free(logrec.data, logrec.size);
+	return (ret);
+}
+
+int
+__db_relink_print(dbenv, dbtp, lsnp, notused2, notused3)
+	DB_ENV *dbenv;
+	DBT *dbtp;
+	DB_LSN *lsnp;
+	db_recops notused2;
+	void *notused3;
+{
+	__db_relink_args *argp;
+	u_int32_t i;
+	u_int ch;
+	int ret;
+
+	i = 0;
+	ch = 0;
+	notused2 = DB_TXN_ABORT;
+	notused3 = NULL;
+
+	if ((ret = __db_relink_read(dbenv, dbtp->data, &argp)) != 0)
+		return (ret);
+	printf("[%lu][%lu]db_relink: rec: %lu txnid %lx prevlsn [%lu][%lu]\n",
+	    (u_long)lsnp->file,
+	    (u_long)lsnp->offset,
+	    (u_long)argp->type,
+	    (u_long)argp->txnid->txnid,
+	    (u_long)argp->prev_lsn.file,
+	    (u_long)argp->prev_lsn.offset);
+	printf("\topcode: %lu\n", (u_long)argp->opcode);
+	printf("\tfileid: %ld\n", (long)argp->fileid);
+	printf("\tpgno: %lu\n", (u_long)argp->pgno);
+	printf("\tlsn: [%lu][%lu]\n",
+	    (u_long)argp->lsn.file, (u_long)argp->lsn.offset);
+	printf("\tprev: %lu\n", (u_long)argp->prev);
+	printf("\tlsn_prev: [%lu][%lu]\n",
+	    (u_long)argp->lsn_prev.file, (u_long)argp->lsn_prev.offset);
+	printf("\tnext: %lu\n", (u_long)argp->next);
+	printf("\tlsn_next: [%lu][%lu]\n",
+	    (u_long)argp->lsn_next.file, (u_long)argp->lsn_next.offset);
+	printf("\n");
+	__os_free(argp, 0);
+	return (0);
+}
+
+int
+__db_relink_read(dbenv, recbuf, argpp)
+	DB_ENV *dbenv;
+	void *recbuf;
+	__db_relink_args **argpp;
+{
+	__db_relink_args *argp;
+	u_int8_t *bp;
+	int ret;
+
+	ret = __os_malloc(dbenv, sizeof(__db_relink_args) +
+	    sizeof(DB_TXN), NULL, &argp);
+	if (ret != 0)
+		return (ret);
+	argp->txnid = (DB_TXN *)&argp[1];
+	bp = recbuf;
+	memcpy(&argp->type, bp, sizeof(argp->type));
+	bp += sizeof(argp->type);
+	memcpy(&argp->txnid->txnid,  bp, sizeof(argp->txnid->txnid));
+	bp += sizeof(argp->txnid->txnid);
+	memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN));
+	bp += sizeof(DB_LSN);
+	memcpy(&argp->opcode, bp, sizeof(argp->opcode));
+	bp += sizeof(argp->opcode);
+	memcpy(&argp->fileid, bp, sizeof(argp->fileid));
+	bp += sizeof(argp->fileid);
+	memcpy(&argp->pgno, bp, sizeof(argp->pgno));
+	bp += sizeof(argp->pgno);
+	memcpy(&argp->lsn, bp,  sizeof(argp->lsn));
+	bp += sizeof(argp->lsn);
+	memcpy(&argp->prev, bp, sizeof(argp->prev));
+	bp += sizeof(argp->prev);
+	memcpy(&argp->lsn_prev, bp,  sizeof(argp->lsn_prev));
+	bp += sizeof(argp->lsn_prev);
+	memcpy(&argp->next, bp, sizeof(argp->next));
+	bp += sizeof(argp->next);
+	memcpy(&argp->lsn_next, bp,  sizeof(argp->lsn_next));
+	bp += sizeof(argp->lsn_next);
+	*argpp = argp;
+	return (0);
+}
+
+int
+__db_addpage_print(dbenv, dbtp, lsnp, notused2, notused3)
+	DB_ENV *dbenv;
+	DBT *dbtp;
+	DB_LSN *lsnp;
+	db_recops notused2;
+	void *notused3;
+{
+	__db_addpage_args *argp;
+	u_int32_t i;
+	u_int ch;
+	int ret;
+
+	i = 0;
+	ch = 0;
+	notused2 = DB_TXN_ABORT;
+	notused3 = NULL;
+
+	if ((ret = __db_addpage_read(dbenv, dbtp->data, &argp)) != 0)
+		return (ret);
+	printf("[%lu][%lu]db_addpage: rec: %lu txnid %lx prevlsn [%lu][%lu]\n",
+	    (u_long)lsnp->file,
+	    (u_long)lsnp->offset,
+	    (u_long)argp->type,
+	    (u_long)argp->txnid->txnid,
+	    (u_long)argp->prev_lsn.file,
+	    (u_long)argp->prev_lsn.offset);
+	printf("\tfileid: %ld\n", (long)argp->fileid);
+	printf("\tpgno: %lu\n", (u_long)argp->pgno);
+	printf("\tlsn: [%lu][%lu]\n",
+	    (u_long)argp->lsn.file, (u_long)argp->lsn.offset);
+	printf("\tnextpgno: %lu\n", (u_long)argp->nextpgno);
+	printf("\tnextlsn: [%lu][%lu]\n",
+	    (u_long)argp->nextlsn.file, (u_long)argp->nextlsn.offset);
+	printf("\n");
+	__os_free(argp, 0);
+	return (0);
+}
+
+int
+__db_addpage_read(dbenv, recbuf, argpp)
+	DB_ENV *dbenv;
+	void *recbuf;
+	__db_addpage_args **argpp;
+{
+	__db_addpage_args *argp;
+	u_int8_t *bp;
+	int ret;
+
+	ret = __os_malloc(dbenv, sizeof(__db_addpage_args) +
+	    sizeof(DB_TXN), NULL, &argp);
+	if (ret != 0)
+		return (ret);
+	argp->txnid = (DB_TXN *)&argp[1];
+	bp = recbuf;
+	memcpy(&argp->type, bp, sizeof(argp->type));
+	bp += sizeof(argp->type);
+	memcpy(&argp->txnid->txnid,  bp, sizeof(argp->txnid->txnid));
+	bp += sizeof(argp->txnid->txnid);
+	memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN));
+	bp += sizeof(DB_LSN);
+	memcpy(&argp->fileid, bp, sizeof(argp->fileid));
+	bp += sizeof(argp->fileid);
+	memcpy(&argp->pgno, bp, sizeof(argp->pgno));
+	bp += sizeof(argp->pgno);
+	memcpy(&argp->lsn, bp,  sizeof(argp->lsn));
+	bp += sizeof(argp->lsn);
+	memcpy(&argp->nextpgno, bp, sizeof(argp->nextpgno));
+	bp += sizeof(argp->nextpgno);
+	memcpy(&argp->nextlsn, bp,  sizeof(argp->nextlsn));
+	bp += sizeof(argp->nextlsn);
+	*argpp = argp;
+	return (0);
+}
+
+int
+__db_debug_log(dbenv, txnid, ret_lsnp, flags,
+	op, fileid, key, data, arg_flags)
+	DB_ENV *dbenv;
+	DB_TXN *txnid;
+	DB_LSN *ret_lsnp;
+	u_int32_t flags;
+	const DBT *op;
+	int32_t fileid;
+	const DBT *key;
+	const DBT *data;
+	u_int32_t arg_flags;
+{
+	DBT logrec;
+	DB_LSN *lsnp, null_lsn;
+	u_int32_t zero;
+	u_int32_t rectype, txn_num;
+	int ret;
+	u_int8_t *bp;
+
+	rectype = DB_db_debug;
+	if (txnid != NULL &&
+	    TAILQ_FIRST(&txnid->kids) != NULL &&
+	    (ret = __txn_activekids(dbenv, rectype, txnid)) != 0)
+		return (ret);
+	txn_num = txnid == NULL ? 0 : txnid->txnid;
+	if (txnid == NULL) {
+		ZERO_LSN(null_lsn);
+		lsnp = &null_lsn;
+	} else
+		lsnp = &txnid->last_lsn;
+	logrec.size = sizeof(rectype) + sizeof(txn_num) + sizeof(DB_LSN)
+	    + sizeof(u_int32_t) + (op == NULL ? 0 : op->size)
+	    + sizeof(fileid)
+	    + sizeof(u_int32_t) + (key == NULL ? 0 : key->size)
+	    + sizeof(u_int32_t) + (data == NULL ? 0 : data->size)
+	    + sizeof(arg_flags);
+	if ((ret = __os_malloc(dbenv, logrec.size, NULL, &logrec.data)) != 0)
+		return (ret);
+
+	bp = logrec.data;
+	memcpy(bp, &rectype, sizeof(rectype));
+	bp += sizeof(rectype);
+	memcpy(bp, &txn_num, sizeof(txn_num));
+	bp += sizeof(txn_num);
+	memcpy(bp, lsnp, sizeof(DB_LSN));
+	bp += sizeof(DB_LSN);
+	if (op == NULL) {
+		zero = 0;
+		memcpy(bp, &zero, sizeof(u_int32_t));
+		bp += sizeof(u_int32_t);
+	} else {
+		memcpy(bp, &op->size, sizeof(op->size));
+		bp += sizeof(op->size);
+		memcpy(bp, op->data, op->size);
+		bp += op->size;
+	}
+	memcpy(bp, &fileid, sizeof(fileid));
+	bp += sizeof(fileid);
+	if (key == NULL) {
+		zero = 0;
+		memcpy(bp, &zero, sizeof(u_int32_t));
+		bp += sizeof(u_int32_t);
+	} else {
+		memcpy(bp, &key->size, sizeof(key->size));
+		bp += sizeof(key->size);
+		memcpy(bp, key->data, key->size);
+		bp += key->size;
+	}
+	if (data == NULL) {
+		zero = 0;
+		memcpy(bp, &zero, sizeof(u_int32_t));
+		bp += sizeof(u_int32_t);
+	} else {
+		memcpy(bp, &data->size, sizeof(data->size));
+		bp += sizeof(data->size);
+		memcpy(bp, data->data, data->size);
+		bp += data->size;
+	}
+	memcpy(bp, &arg_flags, sizeof(arg_flags));
+	bp += sizeof(arg_flags);
+	DB_ASSERT((u_int32_t)(bp - (u_int8_t *)logrec.data) == logrec.size);
+	ret = log_put(dbenv, ret_lsnp, (DBT *)&logrec, flags);
+	if (txnid != NULL)
+		txnid->last_lsn = *ret_lsnp;
+	__os_free(logrec.data, logrec.size);
+	return (ret);
+}
+
+int
+__db_debug_print(dbenv, dbtp, lsnp, notused2, notused3)
+	DB_ENV *dbenv;
+	DBT *dbtp;
+	DB_LSN *lsnp;
+	db_recops notused2;
+	void *notused3;
+{
+	__db_debug_args *argp;
+	u_int32_t i;
+	u_int ch;
+	int ret;
+
+	i = 0;
+	ch = 0;
+	notused2 = DB_TXN_ABORT;
+	notused3 = NULL;
+
+	if ((ret = __db_debug_read(dbenv, dbtp->data, &argp)) != 0)
+		return (ret);
+	printf("[%lu][%lu]db_debug: rec: %lu txnid %lx prevlsn [%lu][%lu]\n",
+	    (u_long)lsnp->file,
+	    (u_long)lsnp->offset,
+	    (u_long)argp->type,
+	    (u_long)argp->txnid->txnid,
+	    (u_long)argp->prev_lsn.file,
+	    (u_long)argp->prev_lsn.offset);
+	printf("\top: ");
+	for (i = 0; i < argp->op.size; i++) {
+		ch = ((u_int8_t *)argp->op.data)[i];
+		if (isprint(ch) || ch == 0xa)
+			putchar(ch);
+		else
+			printf("%#x ", ch);
+	}
+	printf("\n");
+	printf("\tfileid: %ld\n", (long)argp->fileid);
+	printf("\tkey: ");
+	for (i = 0; i < argp->key.size; i++) {
+		ch = ((u_int8_t *)argp->key.data)[i];
+		if (isprint(ch) || ch == 0xa)
+			putchar(ch);
+		else
+			printf("%#x ", ch);
+	}
+	printf("\n");
+	printf("\tdata: ");
+	for (i = 0; i < argp->data.size; i++) {
+		ch = ((u_int8_t *)argp->data.data)[i];
+		if (isprint(ch) || ch == 0xa)
+			putchar(ch);
+		else
+			printf("%#x ", ch);
+	}
+	printf("\n");
+	printf("\targ_flags: %lu\n", (u_long)argp->arg_flags);
+	printf("\n");
+	__os_free(argp, 0);
+	return (0);
+}
+
+int
+__db_debug_read(dbenv, recbuf, argpp)
+	DB_ENV *dbenv;
+	void *recbuf;
+	__db_debug_args **argpp;
+{
+	__db_debug_args *argp;
+	u_int8_t *bp;
+	int ret;
+
+	ret = __os_malloc(dbenv, sizeof(__db_debug_args) +
+	    sizeof(DB_TXN), NULL, &argp);
+	if (ret != 0)
+		return (ret);
+	argp->txnid = (DB_TXN *)&argp[1];
+	bp = recbuf;
+	memcpy(&argp->type, bp, sizeof(argp->type));
+	bp += sizeof(argp->type);
+	memcpy(&argp->txnid->txnid,  bp, sizeof(argp->txnid->txnid));
+	bp += sizeof(argp->txnid->txnid);
+	memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN));
+	bp += sizeof(DB_LSN);
+	memset(&argp->op, 0, sizeof(argp->op));
+	memcpy(&argp->op.size, bp, sizeof(u_int32_t));
+	bp += sizeof(u_int32_t);
+	argp->op.data = bp;
+	bp += argp->op.size;
+	memcpy(&argp->fileid, bp, sizeof(argp->fileid));
+	bp += sizeof(argp->fileid);
+	memset(&argp->key, 0, sizeof(argp->key));
+	memcpy(&argp->key.size, bp, sizeof(u_int32_t));
+	bp += sizeof(u_int32_t);
+	argp->key.data = bp;
+	bp += argp->key.size;
+	memset(&argp->data, 0, sizeof(argp->data));
+	memcpy(&argp->data.size, bp, sizeof(u_int32_t));
+	bp += sizeof(u_int32_t);
+	argp->data.data = bp;
+	bp += argp->data.size;
+	memcpy(&argp->arg_flags, bp, sizeof(argp->arg_flags));
+	bp += sizeof(argp->arg_flags);
+	*argpp = argp;
+	return (0);
+}
+
+int
+__db_noop_log(dbenv, txnid, ret_lsnp, flags,
+	fileid, pgno, prevlsn)
+	DB_ENV *dbenv;
+	DB_TXN *txnid;
+	DB_LSN *ret_lsnp;
+	u_int32_t flags;
+	int32_t fileid;
+	db_pgno_t pgno;
+	DB_LSN * prevlsn;
+{
+	DBT logrec;
+	DB_LSN *lsnp, null_lsn;
+	u_int32_t rectype, txn_num;
+	int ret;
+	u_int8_t *bp;
+
+	rectype = DB_db_noop;
+	if (txnid != NULL &&
+	    TAILQ_FIRST(&txnid->kids) != NULL &&
+	    (ret = __txn_activekids(dbenv, rectype, txnid)) != 0)
+		return (ret);
+	txn_num = txnid == NULL ? 0 : txnid->txnid;
+	if (txnid == NULL) {
+		ZERO_LSN(null_lsn);
+		lsnp = &null_lsn;
+	} else
+		lsnp = &txnid->last_lsn;
+	logrec.size = sizeof(rectype) + sizeof(txn_num) + sizeof(DB_LSN)
+	    + sizeof(fileid)
+	    + sizeof(pgno)
+	    + sizeof(*prevlsn);
+	if ((ret = __os_malloc(dbenv, logrec.size, NULL, &logrec.data)) != 0)
+		return (ret);
+
+	bp = logrec.data;
+	memcpy(bp, &rectype, sizeof(rectype));
+	bp += sizeof(rectype);
+	memcpy(bp, &txn_num, sizeof(txn_num));
+	bp += sizeof(txn_num);
+	memcpy(bp, lsnp, sizeof(DB_LSN));
+	bp += sizeof(DB_LSN);
+	memcpy(bp, &fileid, sizeof(fileid));
+	bp += sizeof(fileid);
+	memcpy(bp, &pgno, sizeof(pgno));
+	bp += sizeof(pgno);
+	if (prevlsn != NULL)
+		memcpy(bp, prevlsn, sizeof(*prevlsn));
+	else
+		memset(bp, 0, sizeof(*prevlsn));
+	bp += sizeof(*prevlsn);
+	DB_ASSERT((u_int32_t)(bp - (u_int8_t *)logrec.data) == logrec.size);
+	ret = log_put(dbenv, ret_lsnp, (DBT *)&logrec, flags);
+	if (txnid != NULL)
+		txnid->last_lsn = *ret_lsnp;
+	__os_free(logrec.data, logrec.size);
+	return (ret);
+}
+
+int
+__db_noop_print(dbenv, dbtp, lsnp, notused2, notused3)
+	DB_ENV *dbenv;
+	DBT *dbtp;
+	DB_LSN *lsnp;
+	db_recops notused2;
+	void *notused3;
+{
+	__db_noop_args *argp;
+	u_int32_t i;
+	u_int ch;
+	int ret;
+
+	i = 0;
+	ch = 0;
+	notused2 = DB_TXN_ABORT;
+	notused3 = NULL;
+
+	if ((ret = __db_noop_read(dbenv, dbtp->data, &argp)) != 0)
+		return (ret);
+	printf("[%lu][%lu]db_noop: rec: %lu txnid %lx prevlsn [%lu][%lu]\n",
+	    (u_long)lsnp->file,
+	    (u_long)lsnp->offset,
+	    (u_long)argp->type,
+	    (u_long)argp->txnid->txnid,
+	    (u_long)argp->prev_lsn.file,
+	    (u_long)argp->prev_lsn.offset);
+	printf("\tfileid: %ld\n", (long)argp->fileid);
+	printf("\tpgno: %lu\n", (u_long)argp->pgno);
+	printf("\tprevlsn: [%lu][%lu]\n",
+	    (u_long)argp->prevlsn.file, (u_long)argp->prevlsn.offset);
+	printf("\n");
+	__os_free(argp, 0);
+	return (0);
+}
+
+int
+__db_noop_read(dbenv, recbuf, argpp)
+	DB_ENV *dbenv;
+	void *recbuf;
+	__db_noop_args **argpp;
+{
+	__db_noop_args *argp;
+	u_int8_t *bp;
+	int ret;
+
+	ret = __os_malloc(dbenv, sizeof(__db_noop_args) +
+	    sizeof(DB_TXN), NULL, &argp);
+	if (ret != 0)
+		return (ret);
+	argp->txnid = (DB_TXN *)&argp[1];
+	bp = recbuf;
+	memcpy(&argp->type, bp, sizeof(argp->type));
+	bp += sizeof(argp->type);
+	memcpy(&argp->txnid->txnid,  bp, sizeof(argp->txnid->txnid));
+	bp += sizeof(argp->txnid->txnid);
+	memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN));
+	bp += sizeof(DB_LSN);
+	memcpy(&argp->fileid, bp, sizeof(argp->fileid));
+	bp += sizeof(argp->fileid);
+	memcpy(&argp->pgno, bp, sizeof(argp->pgno));
+	bp += sizeof(argp->pgno);
+	memcpy(&argp->prevlsn, bp,  sizeof(argp->prevlsn));
+	bp += sizeof(argp->prevlsn);
+	*argpp = argp;
+	return (0);
+}
+
+int
+__db_init_print(dbenv)
+	DB_ENV *dbenv;
+{
+	int ret;
+
+	if ((ret = __db_add_recovery(dbenv,
+	    __db_addrem_print, DB_db_addrem)) != 0)
+		return (ret);
+	if ((ret = __db_add_recovery(dbenv,
+	    __db_split_print, DB_db_split)) != 0)
+		return (ret);
+	if ((ret = __db_add_recovery(dbenv,
+	    __db_big_print, DB_db_big)) != 0)
+		return (ret);
+	if ((ret = __db_add_recovery(dbenv,
+	    __db_ovref_print, DB_db_ovref)) != 0)
+		return (ret);
+	if ((ret = __db_add_recovery(dbenv,
+	    __db_relink_print, DB_db_relink)) != 0)
+		return (ret);
+	if ((ret = __db_add_recovery(dbenv,
+	    __db_addpage_print, DB_db_addpage)) != 0)
+		return (ret);
+	if ((ret = __db_add_recovery(dbenv,
+	    __db_debug_print, DB_db_debug)) != 0)
+		return (ret);
+	if ((ret = __db_add_recovery(dbenv,
+	    __db_noop_print, DB_db_noop)) != 0)
+		return (ret);
+	return (0);
+}
+
+int
+__db_init_recover(dbenv)
+	DB_ENV *dbenv;
+{
+	int ret;
+
+	if ((ret = __db_add_recovery(dbenv,
+	    __db_addrem_recover, DB_db_addrem)) != 0)
+		return (ret);
+	if ((ret = __db_add_recovery(dbenv,
+	    __deprecated_recover, DB_db_split)) != 0)
+		return (ret);
+	if ((ret = __db_add_recovery(dbenv,
+	    __db_big_recover, DB_db_big)) != 0)
+		return (ret);
+	if ((ret = __db_add_recovery(dbenv,
+	    __db_ovref_recover, DB_db_ovref)) != 0)
+		return (ret);
+	if ((ret = __db_add_recovery(dbenv,
+	    __db_relink_recover, DB_db_relink)) != 0)
+		return (ret);
+	if ((ret = __db_add_recovery(dbenv,
+	    __deprecated_recover, DB_db_addpage)) != 0)
+		return (ret);
+	if ((ret = __db_add_recovery(dbenv,
+	    __db_debug_recover, DB_db_debug)) != 0)
+		return (ret);
+	if ((ret = __db_add_recovery(dbenv,
+	    __db_noop_recover, DB_db_noop)) != 0)
+		return (ret);
+	return (0);
+}
+
diff --git a/bdb/db/db_cam.c b/bdb/db/db_cam.c
new file mode 100644
index 00000000000..708d4cbda4d
--- /dev/null
+++ b/bdb/db/db_cam.c
@@ -0,0 +1,974 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 2000
+ *	Sleepycat Software.  All rights reserved.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_cam.c,v 11.52 2001/01/18 15:11:16 bostic Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_shash.h"
+#include "lock.h"
+#include "btree.h"
+#include "hash.h"
+#include "qam.h"
+#include "txn.h"
+#include "db_ext.h"
+
+static int __db_c_cleanup __P((DBC *, DBC *, int));
+static int __db_c_idup __P((DBC *, DBC **, u_int32_t));
+static int __db_wrlock_err __P((DB_ENV *));
+
+#define	CDB_LOCKING_INIT(dbp, dbc)					\
+	/*								\
+	 * If we are running CDB, this had better be either a write	\
+	 * cursor or an immediate writer.  If it's a regular writer,	\
+	 * that means we have an IWRITE lock and we need to upgrade	\
+	 * it to a write lock.						\
+	 */								\
+	if (CDB_LOCKING((dbp)->dbenv)) {				\
+		if (!F_ISSET(dbc, DBC_WRITECURSOR | DBC_WRITER))	\
+			return (__db_wrlock_err(dbp->dbenv));		\
+									\
+		if (F_ISSET(dbc, DBC_WRITECURSOR) &&			\
+		    (ret = lock_get((dbp)->dbenv, (dbc)->locker,	\
+		    DB_LOCK_UPGRADE, &(dbc)->lock_dbt, DB_LOCK_WRITE,	\
+		    &(dbc)->mylock)) != 0)				\
+			return (ret);					\
+	}
+#define	CDB_LOCKING_DONE(dbp, dbc)					\
+	/* Release the upgraded lock. */				\
+	if (F_ISSET(dbc, DBC_WRITECURSOR))				\
+		(void)__lock_downgrade(					\
+		    (dbp)->dbenv, &(dbc)->mylock, DB_LOCK_IWRITE, 0);
+/*
+ * Copy the lock info from one cursor to another, so that locking
+ * in CDB can be done in the context of an internally-duplicated
+ * or off-page-duplicate cursor.
+ */
+#define	CDB_LOCKING_COPY(dbp, dbc_o, dbc_n)				\
+	if (CDB_LOCKING((dbp)->dbenv) &&				\
+	    F_ISSET((dbc_o), DBC_WRITECURSOR | DBC_WRITEDUP)) { \
+		memcpy(&(dbc_n)->mylock, &(dbc_o)->mylock,		\
+		    sizeof((dbc_o)->mylock));				\
+		(dbc_n)->locker = (dbc_o)->locker;			\
+	    /* This lock isn't ours to put--just discard it on close. */ \
+	    F_SET((dbc_n), DBC_WRITEDUP);				\
+	}
+
+/*
+ * __db_c_close --
+ *	Close the cursor.
+ *
+ * PUBLIC: int __db_c_close __P((DBC *));
+ */
+int
+__db_c_close(dbc)
+	DBC *dbc;
+{
+	DB *dbp;
+	DBC *opd;
+	DBC_INTERNAL *cp;
+	int ret, t_ret;
+
+	dbp = dbc->dbp;
+	ret = 0;
+
+	PANIC_CHECK(dbp->dbenv);
+
+	/*
+	 * If the cursor is already closed we have a serious problem, and we
+	 * assume that the cursor isn't on the active queue.  Don't do any of
+	 * the remaining cursor close processing.
+	 */
+	if (!F_ISSET(dbc, DBC_ACTIVE)) {
+		if (dbp != NULL)
+			__db_err(dbp->dbenv, "Closing closed cursor");
+
+		DB_ASSERT(0);
+		return (EINVAL);
+	}
+
+	cp = dbc->internal;
+	opd = cp->opd;
+
+	/*
+	 * Remove the cursor(s) from the active queue.  We may be closing two
+	 * cursors at once here, a top-level one and a lower-level, off-page
+	 * duplicate one.  The acess-method specific cursor close routine must
+	 * close both of them in a single call.
+	 *
+	 * !!!
+	 * Cursors must be removed from the active queue before calling the
+	 * access specific cursor close routine, btree depends on having that
+	 * order of operations.  It must also happen before any action that
+	 * can fail and cause __db_c_close to return an error, or else calls
+	 * here from __db_close may loop indefinitely.
+	 */
+	MUTEX_THREAD_LOCK(dbp->dbenv, dbp->mutexp);
+
+	if (opd != NULL) {
+		F_CLR(opd, DBC_ACTIVE);
+		TAILQ_REMOVE(&dbp->active_queue, opd, links);
+	}
+	F_CLR(dbc, DBC_ACTIVE);
+	TAILQ_REMOVE(&dbp->active_queue, dbc, links);
+
+	MUTEX_THREAD_UNLOCK(dbp->dbenv, dbp->mutexp);
+
+	/* Call the access specific cursor close routine. */
+	if ((t_ret =
+	    dbc->c_am_close(dbc, PGNO_INVALID, NULL)) != 0 && ret == 0)
+		ret = t_ret;
+
+	/*
+	 * Release the lock after calling the access method specific close
+	 * routine, a Btree cursor may have had pending deletes.
+	 */
+	if (CDB_LOCKING(dbc->dbp->dbenv)) {
+		/*
+		 * If DBC_WRITEDUP is set, the cursor is an internally
+		 * duplicated write cursor and the lock isn't ours to put.
+		 */
+		if (!F_ISSET(dbc, DBC_WRITEDUP) &&
+		    dbc->mylock.off != LOCK_INVALID) {
+			if ((t_ret = lock_put(dbc->dbp->dbenv,
+			    &dbc->mylock)) != 0 && ret == 0)
+				ret = t_ret;
+			dbc->mylock.off = LOCK_INVALID;
+		}
+
+		/* For safety's sake, since this is going on the free queue. */
+		memset(&dbc->mylock, 0, sizeof(dbc->mylock));
+		F_CLR(dbc, DBC_WRITEDUP);
+	}
+
+	if (dbc->txn != NULL)
+		dbc->txn->cursors--;
+
+	/* Move the cursor(s) to the free queue. */
+	MUTEX_THREAD_LOCK(dbp->dbenv, dbp->mutexp);
+	if (opd != NULL) {
+		if (dbc->txn != NULL)
+			dbc->txn->cursors--;
+		TAILQ_INSERT_TAIL(&dbp->free_queue, opd, links);
+		opd = NULL;
+	}
+	TAILQ_INSERT_TAIL(&dbp->free_queue, dbc, links);
+	MUTEX_THREAD_UNLOCK(dbp->dbenv, dbp->mutexp);
+
+	return (ret);
+}
+
+/*
+ * __db_c_destroy --
+ *	Destroy the cursor, called after DBC->c_close.
+ *
+ * PUBLIC: int __db_c_destroy __P((DBC *));
+ */
+int
+__db_c_destroy(dbc)
+	DBC *dbc;
+{
+	DB *dbp;
+	DBC_INTERNAL *cp;
+	int ret;
+
+	dbp = dbc->dbp;
+	cp =  dbc->internal;
+
+	/* Remove the cursor from the free queue. */
+	MUTEX_THREAD_LOCK(dbp->dbenv, dbp->mutexp);
+	TAILQ_REMOVE(&dbp->free_queue, dbc, links);
+	MUTEX_THREAD_UNLOCK(dbp->dbenv, dbp->mutexp);
+
+	/* Free up allocated memory. */
+	if (dbc->rkey.data != NULL)
+		__os_free(dbc->rkey.data, dbc->rkey.ulen);
+	if (dbc->rdata.data != NULL)
+		__os_free(dbc->rdata.data, dbc->rdata.ulen);
+
+	/* Call the access specific cursor destroy routine. */
+	ret = dbc->c_am_destroy == NULL ? 0 : dbc->c_am_destroy(dbc);
+
+	__os_free(dbc, sizeof(*dbc));
+
+	return (ret);
+}
+
+/*
+ * __db_c_count --
+ *	Return a count of duplicate data items.
+ *
+ * PUBLIC: int __db_c_count __P((DBC *, db_recno_t *, u_int32_t));
+ */
+int
+__db_c_count(dbc, recnop, flags)
+	DBC *dbc;
+	db_recno_t *recnop;
+	u_int32_t flags;
+{
+	DB *dbp;
+	int ret;
+
+	/*
+	 * Cursor Cleanup Note:
+	 * All of the cursors passed to the underlying access methods by this
+	 * routine are not duplicated and will not be cleaned up on return.
+	 * So, pages/locks that the cursor references must be resolved by the
+	 * underlying functions.
+	 */
+	dbp = dbc->dbp;
+
+	PANIC_CHECK(dbp->dbenv);
+
+	/* Check for invalid flags. */
+	if ((ret = __db_ccountchk(dbp, flags, IS_INITIALIZED(dbc))) != 0)
+		return (ret);
+
+	switch (dbc->dbtype) {
+	case DB_QUEUE:
+	case DB_RECNO:
+		*recnop = 1;
+		break;
+	case DB_HASH:
+		if (dbc->internal->opd == NULL) {
+			if ((ret = __ham_c_count(dbc, recnop)) != 0)
+				return (ret);
+			break;
+		}
+		/* FALLTHROUGH */
+	case DB_BTREE:
+		if ((ret = __bam_c_count(dbc, recnop)) != 0)
+			return (ret);
+		break;
+	default:
+		return (__db_unknown_type(dbp->dbenv,
+		     "__db_c_count", dbp->type));
+	}
+	return (0);
+}
+
+/*
+ * __db_c_del --
+ *	Delete using a cursor.
+ *
+ * PUBLIC: int __db_c_del __P((DBC *, u_int32_t));
+ */
+int
+__db_c_del(dbc, flags)
+	DBC *dbc;
+	u_int32_t flags;
+{
+	DB *dbp;
+	DBC *opd;
+	int ret;
+
+	/*
+	 * Cursor Cleanup Note:
+	 * All of the cursors passed to the underlying access methods by this
+	 * routine are not duplicated and will not be cleaned up on return.
+	 * So, pages/locks that the cursor references must be resolved by the
+	 * underlying functions.
+	 */
+	dbp = dbc->dbp;
+
+	PANIC_CHECK(dbp->dbenv);
+	DB_CHECK_TXN(dbp, dbc->txn);
+
+	/* Check for invalid flags. */
+	if ((ret = __db_cdelchk(dbp, flags,
+	    F_ISSET(dbp, DB_AM_RDONLY), IS_INITIALIZED(dbc))) != 0)
+		return (ret);
+
+	DEBUG_LWRITE(dbc, dbc->txn, "db_c_del", NULL, NULL, flags);
+
+	CDB_LOCKING_INIT(dbp, dbc);
+
+	/*
+	 * Off-page duplicate trees are locked in the primary tree, that is,
+	 * we acquire a write lock in the primary tree and no locks in the
+	 * off-page dup tree.  If the del operation is done in an off-page
+	 * duplicate tree, call the primary cursor's upgrade routine first.
+	 */
+	opd = dbc->internal->opd;
+	if (opd == NULL)
+		ret = dbc->c_am_del(dbc);
+	else
+		if ((ret = dbc->c_am_writelock(dbc)) == 0)
+			ret = opd->c_am_del(opd);
+
+	CDB_LOCKING_DONE(dbp, dbc);
+
+	return (ret);
+}
+
+/*
+ * __db_c_dup --
+ *	Duplicate a cursor
+ *
+ * PUBLIC: int __db_c_dup __P((DBC *, DBC **, u_int32_t));
+ */
+int
+__db_c_dup(dbc_orig, dbcp, flags)
+	DBC *dbc_orig;
+	DBC **dbcp;
+	u_int32_t flags;
+{
+	DB_ENV *dbenv;
+	DB *dbp;
+	DBC *dbc_n, *dbc_nopd;
+	int ret;
+
+	dbp = dbc_orig->dbp;
+	dbenv = dbp->dbenv;
+	dbc_n = dbc_nopd = NULL;
+
+	PANIC_CHECK(dbp->dbenv);
+
+	/*
+	 * We can never have two write cursors open in CDB, so do not
+	 * allow duplication of a write cursor.
+	 */
+	if (flags != DB_POSITIONI &&
+	    F_ISSET(dbc_orig, DBC_WRITER | DBC_WRITECURSOR)) {
+		__db_err(dbenv, "Cannot duplicate writeable cursor");
+		return (EINVAL);
+	}
+
+	/* Allocate a new cursor and initialize it. */
+	if ((ret = __db_c_idup(dbc_orig, &dbc_n, flags)) != 0)
+		goto err;
+	*dbcp = dbc_n;
+
+	/*
+	 * If we're in CDB, and this isn't an internal duplication (in which
+	 * case we're explicitly overriding CDB locking), the duplicated
+	 * cursor needs its own read lock.  (We know it's not a write cursor
+	 * because we wouldn't have made it this far;  you can't dup them.)
+	 */
+	if (CDB_LOCKING(dbenv) && flags != DB_POSITIONI) {
+		DB_ASSERT(!F_ISSET(dbc_orig, DBC_WRITER | DBC_WRITECURSOR));
+
+		if ((ret = lock_get(dbenv, dbc_n->locker, 0,
+		    &dbc_n->lock_dbt, DB_LOCK_READ, &dbc_n->mylock)) != 0) {
+			(void)__db_c_close(dbc_n);
+			return (ret);
+		}
+	}
+
+	/*
+	 * If the cursor references an off-page duplicate tree, allocate a
+	 * new cursor for that tree and initialize it.
+	 */
+	if (dbc_orig->internal->opd != NULL) {
+		if ((ret =
+		   __db_c_idup(dbc_orig->internal->opd, &dbc_nopd, flags)) != 0)
+			goto err;
+		dbc_n->internal->opd = dbc_nopd;
+	}
+
+	return (0);
+
+err:	if (dbc_n != NULL)
+		(void)dbc_n->c_close(dbc_n);
+	if (dbc_nopd != NULL)
+		(void)dbc_nopd->c_close(dbc_nopd);
+
+	return (ret);
+}
+
+/*
+ * __db_c_idup --
+ *	Internal version of __db_c_dup.
+ */
+static int
+__db_c_idup(dbc_orig, dbcp, flags)
+	DBC *dbc_orig, **dbcp;
+	u_int32_t flags;
+{
+	DB *dbp;
+	DBC *dbc_n;
+	DBC_INTERNAL *int_n, *int_orig;
+	int ret;
+
+	dbp = dbc_orig->dbp;
+	dbc_n = *dbcp;
+
+	if ((ret = __db_icursor(dbp, dbc_orig->txn, dbc_orig->dbtype,
+	    dbc_orig->internal->root, F_ISSET(dbc_orig, DBC_OPD), &dbc_n)) != 0)
+		return (ret);
+
+	dbc_n->locker = dbc_orig->locker;
+
+	/* If the user wants the cursor positioned, do it here.  */
+	if (flags == DB_POSITION || flags == DB_POSITIONI) {
+		int_n = dbc_n->internal;
+		int_orig = dbc_orig->internal;
+
+		dbc_n->flags = dbc_orig->flags;
+
+		int_n->indx = int_orig->indx;
+		int_n->pgno = int_orig->pgno;
+		int_n->root = int_orig->root;
+		int_n->lock_mode = int_orig->lock_mode;
+
+		switch (dbc_orig->dbtype) {
+		case DB_QUEUE:
+			if ((ret = __qam_c_dup(dbc_orig, dbc_n)) != 0)
+				goto err;
+			break;
+		case DB_BTREE:
+		case DB_RECNO:
+			if ((ret = __bam_c_dup(dbc_orig, dbc_n)) != 0)
+				goto err;
+			break;
+		case DB_HASH:
+			if ((ret = __ham_c_dup(dbc_orig, dbc_n)) != 0)
+				goto err;
+			break;
+		default:
+			ret = __db_unknown_type(dbp->dbenv,
+			    "__db_c_idup", dbc_orig->dbtype);
+			goto err;
+		}
+	}
+
+	/* Now take care of duping the CDB information. */
+	CDB_LOCKING_COPY(dbp, dbc_orig, dbc_n);
+
+	*dbcp = dbc_n;
+	return (0);
+
+err:	(void)dbc_n->c_close(dbc_n);
+	return (ret);
+}
+
+/*
+ * __db_c_newopd --
+ *	Create a new off-page duplicate cursor.
+ *
+ * PUBLIC: int __db_c_newopd __P((DBC *, db_pgno_t, DBC **));
+ */
+int
+__db_c_newopd(dbc_parent, root, dbcp)
+	DBC *dbc_parent;
+	db_pgno_t root;
+	DBC **dbcp;
+{
+	DB *dbp;
+	DBC *opd;
+	DBTYPE dbtype;
+	int ret;
+
+	dbp = dbc_parent->dbp;
+	dbtype = (dbp->dup_compare == NULL) ? DB_RECNO : DB_BTREE;
+
+	if ((ret = __db_icursor(dbp,
+	    dbc_parent->txn, dbtype, root, 1, &opd)) != 0)
+		return (ret);
+
+	CDB_LOCKING_COPY(dbp, dbc_parent, opd);
+
+	*dbcp = opd;
+
+	return (0);
+}
+
+/*
+ * __db_c_get --
+ *	Get using a cursor.
+ *
+ * PUBLIC: int __db_c_get __P((DBC *, DBT *, DBT *, u_int32_t));
+ */
+int
+__db_c_get(dbc_arg, key, data, flags)
+	DBC *dbc_arg;
+	DBT *key, *data;
+	u_int32_t flags;
+{
+	DB *dbp;
+	DBC *dbc, *dbc_n, *opd;
+	DBC_INTERNAL *cp, *cp_n;
+	db_pgno_t pgno;
+	u_int32_t tmp_flags, tmp_rmw;
+	u_int8_t type;
+	int ret, t_ret;
+
+	/*
+	 * Cursor Cleanup Note:
+	 * All of the cursors passed to the underlying access methods by this
+	 * routine are duplicated cursors.  On return, any referenced pages
+	 * will be discarded, and, if the cursor is not intended to be used
+	 * again, the close function will be called.  So, pages/locks that
+	 * the cursor references do not need to be resolved by the underlying
+	 * functions.
+	 */
+	dbp = dbc_arg->dbp;
+	dbc_n = NULL;
+	opd = NULL;
+
+	PANIC_CHECK(dbp->dbenv);
+
+	/* Check for invalid flags. */
+	if ((ret =
+	    __db_cgetchk(dbp, key, data, flags, IS_INITIALIZED(dbc_arg))) != 0)
+		return (ret);
+
+	/* Clear OR'd in additional bits so we can check for flag equality. */
+	tmp_rmw = LF_ISSET(DB_RMW);
+	LF_CLR(DB_RMW);
+
+	DEBUG_LREAD(dbc_arg, dbc_arg->txn, "db_c_get",
+	    flags == DB_SET || flags == DB_SET_RANGE ? key : NULL, NULL, flags);
+
+	/*
+	 * Return a cursor's record number.  It has nothing to do with the
+	 * cursor get code except that it was put into the interface.
+	 */
+	if (flags == DB_GET_RECNO)
+		return (__bam_c_rget(dbc_arg, data, flags | tmp_rmw));
+
+	if (flags == DB_CONSUME || flags == DB_CONSUME_WAIT)
+		CDB_LOCKING_INIT(dbp, dbc_arg);
+
+	/*
+	 * If we have an off-page duplicates cursor, and the operation applies
+	 * to it, perform the operation.  Duplicate the cursor and call the
+	 * underlying function.
+	 *
+	 * Off-page duplicate trees are locked in the primary tree, that is,
+	 * we acquire a write lock in the primary tree and no locks in the
+	 * off-page dup tree.  If the DB_RMW flag was specified and the get
+	 * operation is done in an off-page duplicate tree, call the primary
+	 * cursor's upgrade routine first.
+	 */
+	cp = dbc_arg->internal;
+	if (cp->opd != NULL &&
+	    (flags == DB_CURRENT || flags == DB_GET_BOTHC ||
+	    flags == DB_NEXT || flags == DB_NEXT_DUP || flags == DB_PREV)) {
+		if (tmp_rmw && (ret = dbc_arg->c_am_writelock(dbc_arg)) != 0)
+			return (ret);
+		if ((ret = __db_c_idup(cp->opd, &opd, DB_POSITIONI)) != 0)
+			return (ret);
+
+		switch (ret = opd->c_am_get(
+		    opd, key, data, flags, NULL)) {
+		case 0:
+			goto done;
+		case DB_NOTFOUND:
+			/*
+			 * Translate DB_NOTFOUND failures for the DB_NEXT and
+			 * DB_PREV operations into a subsequent operation on
+			 * the parent cursor.
+			 */
+			if (flags == DB_NEXT || flags == DB_PREV) {
+				if ((ret = opd->c_close(opd)) != 0)
+					goto err;
+				opd = NULL;
+				break;
+			}
+			goto err;
+		default:
+			goto err;
+		}
+	}
+
+	/*
+	 * Perform an operation on the main cursor.  Duplicate the cursor,
+	 * upgrade the lock as required, and call the underlying function.
+	 */
+	switch (flags) {
+	case DB_CURRENT:
+	case DB_GET_BOTHC:
+	case DB_NEXT:
+	case DB_NEXT_DUP:
+	case DB_NEXT_NODUP:
+	case DB_PREV:
+	case DB_PREV_NODUP:
+		tmp_flags = DB_POSITIONI;
+		break;
+	default:
+		tmp_flags = 0;
+		break;
+	}
+
+	/*
+	 * If this cursor is going to be closed immediately, we don't
+	 * need to take precautions to clean it up on error.
+	 */
+	if (F_ISSET(dbc_arg, DBC_TRANSIENT))
+		dbc_n = dbc_arg;
+	else if ((ret = __db_c_idup(dbc_arg, &dbc_n, tmp_flags)) != 0)
+		goto err;
+
+	if (tmp_rmw)
+		F_SET(dbc_n, DBC_RMW);
+	pgno = PGNO_INVALID;
+	ret = dbc_n->c_am_get(dbc_n, key, data, flags, &pgno);
+	if (tmp_rmw)
+		F_CLR(dbc_n, DBC_RMW);
+	if (ret != 0)
+		goto err;
+
+	cp_n = dbc_n->internal;
+
+	/*
+	 * We may be referencing a new off-page duplicates tree.  Acquire
+	 * a new cursor and call the underlying function.
+	 */
+	if (pgno != PGNO_INVALID) {
+		if ((ret = __db_c_newopd(dbc_arg, pgno, &cp_n->opd)) != 0)
+			goto err;
+
+		switch (flags) {
+		case DB_FIRST:
+		case DB_NEXT:
+		case DB_NEXT_NODUP:
+		case DB_SET:
+		case DB_SET_RECNO:
+		case DB_SET_RANGE:
+			tmp_flags = DB_FIRST;
+			break;
+		case DB_LAST:
+		case DB_PREV:
+		case DB_PREV_NODUP:
+			tmp_flags = DB_LAST;
+			break;
+		case DB_GET_BOTH:
+			tmp_flags = DB_GET_BOTH;
+			break;
+		case DB_GET_BOTHC:
+			tmp_flags = DB_GET_BOTHC;
+			break;
+		default:
+			ret =
+			    __db_unknown_flag(dbp->dbenv, "__db_c_get", flags);
+			goto err;
+		}
+		if ((ret = cp_n->opd->c_am_get(
+		    cp_n->opd, key, data, tmp_flags, NULL)) != 0)
+			goto err;
+	}
+
+done:	/*
+	 * Return a key/data item.  The only exception is that we don't return
+	 * a key if the user already gave us one, that is, if the DB_SET flag
+	 * was set.  The DB_SET flag is necessary.  In a Btree, the user's key
+	 * doesn't have to be the same as the key stored the tree, depending on
+	 * the magic performed by the comparison function.  As we may not have
+	 * done any key-oriented operation here, the page reference may not be
+	 * valid.  Fill it in as necessary.  We don't have to worry about any
+	 * locks, the cursor must already be holding appropriate locks.
+	 *
+	 * XXX
+	 * If not a Btree and DB_SET_RANGE is set, we shouldn't return a key
+	 * either, should we?
+	 */
+	cp_n = dbc_n == NULL ? dbc_arg->internal : dbc_n->internal;
+	if (!F_ISSET(key, DB_DBT_ISSET)) {
+		if (cp_n->page == NULL && (ret =
+		    memp_fget(dbp->mpf, &cp_n->pgno, 0, &cp_n->page)) != 0)
+			goto err;
+
+		if ((ret = __db_ret(dbp, cp_n->page, cp_n->indx,
+		    key, &dbc_arg->rkey.data, &dbc_arg->rkey.ulen)) != 0)
+			goto err;
+	}
+	dbc = opd != NULL ? opd : cp_n->opd != NULL ? cp_n->opd : dbc_n;
+	if (!F_ISSET(data, DB_DBT_ISSET)) {
+		type = TYPE(dbc->internal->page);
+		ret = __db_ret(dbp, dbc->internal->page, dbc->internal->indx +
+		    (type == P_LBTREE || type == P_HASH ? O_INDX : 0),
+		    data, &dbc_arg->rdata.data, &dbc_arg->rdata.ulen);
+	}
+
+err:	/* Don't pass DB_DBT_ISSET back to application level, error or no. */
+	F_CLR(key, DB_DBT_ISSET);
+	F_CLR(data, DB_DBT_ISSET);
+
+	/* Cleanup and cursor resolution. */
+	if (opd != NULL) {
+		if ((t_ret =
+		     __db_c_cleanup(dbc_arg->internal->opd,
+		     opd, ret)) != 0 && ret == 0)
+			ret = t_ret;
+
+	}
+
+	if ((t_ret = __db_c_cleanup(dbc_arg, dbc_n, ret)) != 0 && ret == 0)
+		ret = t_ret;
+
+	if (flags == DB_CONSUME || flags == DB_CONSUME_WAIT)
+		CDB_LOCKING_DONE(dbp, dbc_arg);
+	return (ret);
+}
+
+/*
+ * __db_c_put --
+ *	Put using a cursor.
+ *
+ * PUBLIC: int __db_c_put __P((DBC *, DBT *, DBT *, u_int32_t));
+ */
+int
+__db_c_put(dbc_arg, key, data, flags)
+	DBC *dbc_arg;
+	DBT *key, *data;
+	u_int32_t flags;
+{
+	DB *dbp;
+	DBC *dbc_n, *opd;
+	db_pgno_t pgno;
+	u_int32_t tmp_flags;
+	int ret, t_ret;
+
+	/*
+	 * Cursor Cleanup Note:
+	 * All of the cursors passed to the underlying access methods by this
+	 * routine are duplicated cursors.  On return, any referenced pages
+	 * will be discarded, and, if the cursor is not intended to be used
+	 * again, the close function will be called.  So, pages/locks that
+	 * the cursor references do not need to be resolved by the underlying
+	 * functions.
+	 */
+	dbp = dbc_arg->dbp;
+	dbc_n = NULL;
+
+	PANIC_CHECK(dbp->dbenv);
+	DB_CHECK_TXN(dbp, dbc_arg->txn);
+
+	/* Check for invalid flags. */
+	if ((ret = __db_cputchk(dbp, key, data, flags,
+	    F_ISSET(dbp, DB_AM_RDONLY), IS_INITIALIZED(dbc_arg))) != 0)
+		return (ret);
+
+	DEBUG_LWRITE(dbc_arg, dbc_arg->txn, "db_c_put",
+	    flags == DB_KEYFIRST || flags == DB_KEYLAST ||
+	    flags == DB_NODUPDATA ? key : NULL, data, flags);
+
+	CDB_LOCKING_INIT(dbp, dbc_arg);
+
+	/*
+	 * If we have an off-page duplicates cursor, and the operation applies
+	 * to it, perform the operation.  Duplicate the cursor and call the
+	 * underlying function.
+	 *
+	 * Off-page duplicate trees are locked in the primary tree, that is,
+	 * we acquire a write lock in the primary tree and no locks in the
+	 * off-page dup tree.  If the put operation is done in an off-page
+	 * duplicate tree, call the primary cursor's upgrade routine first.
+	 */
+	if (dbc_arg->internal->opd != NULL &&
+	    (flags == DB_AFTER || flags == DB_BEFORE || flags == DB_CURRENT)) {
+		/*
+		 * A special case for hash off-page duplicates.  Hash doesn't
+		 * support (and is documented not to support) put operations
+		 * relative to a cursor which references an already deleted
+		 * item.  For consistency, apply the same criteria to off-page
+		 * duplicates as well.
+		 */
+		if (dbc_arg->dbtype == DB_HASH && F_ISSET(
+		    ((BTREE_CURSOR *)(dbc_arg->internal->opd->internal)),
+		    C_DELETED)) {
+			ret = DB_NOTFOUND;
+			goto err;
+		}
+
+		if ((ret = dbc_arg->c_am_writelock(dbc_arg)) != 0)
+			return (ret);
+		if ((ret = __db_c_dup(dbc_arg, &dbc_n, DB_POSITIONI)) != 0)
+			goto err;
+		opd = dbc_n->internal->opd;
+		if ((ret = opd->c_am_put(
+		    opd, key, data, flags, NULL)) != 0)
+			goto err;
+		goto done;
+	}
+
+	/*
+	 * Perform an operation on the main cursor.  Duplicate the cursor,
+	 * and call the underlying function.
+	 *
+	 * XXX: MARGO
+	 *
+	tmp_flags = flags == DB_AFTER ||
+	    flags == DB_BEFORE || flags == DB_CURRENT ? DB_POSITIONI : 0;
+	 */
+	tmp_flags = DB_POSITIONI;
+
+	/*
+	 * If this cursor is going to be closed immediately, we don't
+	 * need to take precautions to clean it up on error.
+	 */
+	if (F_ISSET(dbc_arg, DBC_TRANSIENT))
+		dbc_n = dbc_arg;
+	else if ((ret = __db_c_idup(dbc_arg, &dbc_n, tmp_flags)) != 0)
+		goto err;
+
+	pgno = PGNO_INVALID;
+	if ((ret = dbc_n->c_am_put(dbc_n, key, data, flags, &pgno)) != 0)
+		goto err;
+
+	/*
+	 * We may be referencing a new off-page duplicates tree.  Acquire
+	 * a new cursor and call the underlying function.
+	 */
+	if (pgno != PGNO_INVALID) {
+		if ((ret = __db_c_newopd(dbc_arg, pgno, &opd)) != 0)
+			goto err;
+		dbc_n->internal->opd = opd;
+
+		if ((ret = opd->c_am_put(
+		    opd, key, data, flags, NULL)) != 0)
+			goto err;
+	}
+
+done:
+err:	/* Cleanup and cursor resolution. */
+	if ((t_ret = __db_c_cleanup(dbc_arg, dbc_n, ret)) != 0 && ret == 0)
+		ret = t_ret;
+
+	CDB_LOCKING_DONE(dbp, dbc_arg);
+
+	return (ret);
+}
+
+/*
+ * __db_duperr()
+ *	Error message: we don't currently support sorted duplicate duplicates.
+ * PUBLIC: int __db_duperr __P((DB *, u_int32_t));
+ */
+int
+__db_duperr(dbp, flags)
+	DB *dbp;
+	u_int32_t flags;
+{
+	if (flags != DB_NODUPDATA)
+		__db_err(dbp->dbenv,
+		    "Duplicate data items are not supported with sorted data");
+	return (DB_KEYEXIST);
+}
+
+/*
+ * __db_c_cleanup --
+ *	Clean up duplicate cursors.
+ */
+static int
+__db_c_cleanup(dbc, dbc_n, failed)
+	DBC *dbc, *dbc_n;
+	int failed;
+{
+	DB *dbp;
+	DBC *opd;
+	DBC_INTERNAL *internal;
+	int ret, t_ret;
+
+	dbp = dbc->dbp;
+	internal = dbc->internal;
+	ret = 0;
+
+	/* Discard any pages we're holding. */
+	if (internal->page != NULL) {
+		if ((t_ret =
+		    memp_fput(dbp->mpf, internal->page, 0)) != 0 && ret == 0)
+			ret = t_ret;
+		internal->page = NULL;
+	}
+	opd = internal->opd;
+	if (opd != NULL && opd->internal->page != NULL) {
+		if ((t_ret = memp_fput(dbp->mpf,
+		     opd->internal->page, 0)) != 0 && ret == 0)
+			ret = t_ret;
+		 opd->internal->page = NULL;
+	}
+
+	/*
+	 * If dbc_n is NULL, there's no internal cursor swapping to be
+	 * done and no dbc_n to close--we probably did the entire
+	 * operation on an offpage duplicate cursor.  Just return.
+	 */
+	if (dbc_n == NULL)
+		return (ret);
+
+	/*
+	 * If dbc is marked DBC_TRANSIENT, we're inside a DB->{put/get}
+	 * operation, and as an optimization we performed the operation on
+	 * the main cursor rather than on a duplicated one.  Assert
+	 * that dbc_n == dbc (i.e., that we really did skip the
+	 * duplication).  Then just do nothing--even if there was
+	 * an error, we're about to close the cursor, and the fact that we
+	 * moved it isn't a user-visible violation of our "cursor
+	 * stays put on error" rule.
+	 */
+	if (F_ISSET(dbc, DBC_TRANSIENT)) {
+		DB_ASSERT(dbc == dbc_n);
+		return (ret);
+	}
+
+	if (dbc_n->internal->page != NULL) {
+		if ((t_ret = memp_fput(dbp->mpf,
+		    dbc_n->internal->page, 0)) != 0 && ret == 0)
+			ret = t_ret;
+		dbc_n->internal->page = NULL;
+	}
+	opd = dbc_n->internal->opd;
+	if (opd != NULL && opd->internal->page != NULL) {
+		if ((t_ret = memp_fput(dbp->mpf,
+		     opd->internal->page, 0)) != 0 && ret == 0)
+			ret = t_ret;
+		opd->internal->page = NULL;
+	}
+
+	/*
+	 * If we didn't fail before entering this routine or just now when
+	 * freeing pages, swap the interesting contents of the old and new
+	 * cursors.
+	 */
+	if (!failed && ret == 0) {
+		dbc->internal = dbc_n->internal;
+		dbc_n->internal = internal;
+	}
+
+	/*
+	 * Close the cursor we don't care about anymore.  The close can fail,
+	 * but we only expect DB_LOCK_DEADLOCK failures.  This violates our
+	 * "the cursor is unchanged on error" semantics, but since all you can
+	 * do with a DB_LOCK_DEADLOCK failure is close the cursor, I believe
+	 * that's OK.
+	 *
+	 * XXX
+	 * There's no way to recover from failure to close the old cursor.
+	 * All we can do is move to the new position and return an error.
+	 *
+	 * XXX
+	 * We might want to consider adding a flag to the cursor, so that any
+	 * subsequent operations other than close just return an error?
+	 */
+	if ((t_ret = dbc_n->c_close(dbc_n)) != 0 && ret == 0)
+		ret = t_ret;
+
+	return (ret);
+}
+
+/*
+ * __db_wrlock_err -- do not have a write lock.
+ */
+static int
+__db_wrlock_err(dbenv)
+	DB_ENV *dbenv;
+{
+	__db_err(dbenv, "Write attempted on read-only cursor");
+	return (EPERM);
+}
diff --git a/bdb/db/db_conv.c b/bdb/db/db_conv.c
new file mode 100644
index 00000000000..df60be06790
--- /dev/null
+++ b/bdb/db/db_conv.c
@@ -0,0 +1,348 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1996, 1997, 1998, 1999, 2000
+ *	Sleepycat Software.  All rights reserved.
+ */
+/*
+ * Copyright (c) 1990, 1993, 1994, 1995, 1996
+ *	Keith Bostic.  All rights reserved.
+ */
+/*
+ * Copyright (c) 1990, 1993, 1994, 1995
+ *	The Regents of the University of California.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of the University nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_conv.c,v 11.11 2000/11/30 00:58:31 ubell Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_swap.h"
+#include "db_am.h"
+#include "btree.h"
+#include "hash.h"
+#include "qam.h"
+
+/*
+ * __db_pgin --
+ *	Primary page-swap routine.
+ *
+ * PUBLIC: int __db_pgin __P((DB_ENV *, db_pgno_t, void *, DBT *));
+ */
+int
+__db_pgin(dbenv, pg, pp, cookie)
+	DB_ENV *dbenv;
+	db_pgno_t pg;
+	void *pp;
+	DBT *cookie;
+{
+	DB_PGINFO *pginfo;
+
+	pginfo = (DB_PGINFO *)cookie->data;
+
+	switch (((PAGE *)pp)->type) {
+	case P_HASH:
+	case P_HASHMETA:
+	case P_INVALID:
+		return (__ham_pgin(dbenv, pg, pp, cookie));
+	case P_BTREEMETA:
+	case P_IBTREE:
+	case P_IRECNO:
+	case P_LBTREE:
+	case P_LDUP:
+	case P_LRECNO:
+	case P_OVERFLOW:
+		return (__bam_pgin(dbenv, pg, pp, cookie));
+	case P_QAMMETA:
+	case P_QAMDATA:
+		return (__qam_pgin_out(dbenv, pg, pp, cookie));
+	default:
+		break;
+	}
+	return (__db_unknown_type(dbenv, "__db_pgin", ((PAGE *)pp)->type));
+}
+
+/*
+ * __db_pgout --
+ *	Primary page-swap routine.
+ *
+ * PUBLIC: int __db_pgout __P((DB_ENV *, db_pgno_t, void *, DBT *));
+ */
+int
+__db_pgout(dbenv, pg, pp, cookie)
+	DB_ENV *dbenv;
+	db_pgno_t pg;
+	void *pp;
+	DBT *cookie;
+{
+	DB_PGINFO *pginfo;
+
+	pginfo = (DB_PGINFO *)cookie->data;
+
+	switch (((PAGE *)pp)->type) {
+	case P_HASH:
+	case P_HASHMETA:
+	case P_INVALID:
+		return (__ham_pgout(dbenv, pg, pp, cookie));
+	case P_BTREEMETA:
+	case P_IBTREE:
+	case P_IRECNO:
+	case P_LBTREE:
+	case P_LDUP:
+	case P_LRECNO:
+	case P_OVERFLOW:
+		return (__bam_pgout(dbenv, pg, pp, cookie));
+	case P_QAMMETA:
+	case P_QAMDATA:
+		return (__qam_pgin_out(dbenv, pg, pp, cookie));
+	default:
+		break;
+	}
+	return (__db_unknown_type(dbenv, "__db_pgout", ((PAGE *)pp)->type));
+}
+
+/*
+ * __db_metaswap --
+ *	Byteswap the common part of the meta-data page.
+ *
+ * PUBLIC: void __db_metaswap __P((PAGE *));
+ */
+void
+__db_metaswap(pg)
+	PAGE *pg;
+{
+	u_int8_t *p;
+
+	p = (u_int8_t *)pg;
+
+	/* Swap the meta-data information. */
+	SWAP32(p);	/* lsn.file */
+	SWAP32(p);	/* lsn.offset */
+	SWAP32(p);	/* pgno */
+	SWAP32(p);	/* magic */
+	SWAP32(p);	/* version */
+	SWAP32(p);	/* pagesize */
+	p += 4;		/* unused, page type, unused, unused */
+	SWAP32(p);	/* free */
+	SWAP32(p);	/* alloc_lsn part 1 */
+	SWAP32(p);	/* alloc_lsn part 2 */
+	SWAP32(p);	/* cached key count */
+	SWAP32(p);	/* cached record count */
+	SWAP32(p);	/* flags */
+}
+
+/*
+ * __db_byteswap --
+ *	Byteswap a page.
+ *
+ * PUBLIC: int __db_byteswap __P((DB_ENV *, db_pgno_t, PAGE *, size_t, int));
+ */
+int
+__db_byteswap(dbenv, pg, h, pagesize, pgin)
+	DB_ENV *dbenv;
+	db_pgno_t pg;
+	PAGE *h;
+	size_t pagesize;
+	int pgin;
+{
+	BINTERNAL *bi;
+	BKEYDATA *bk;
+	BOVERFLOW *bo;
+	RINTERNAL *ri;
+	db_indx_t i, len, tmp;
+	u_int8_t *p, *end;
+
+	COMPQUIET(pg, 0);
+
+	if (pgin) {
+		M_32_SWAP(h->lsn.file);
+		M_32_SWAP(h->lsn.offset);
+		M_32_SWAP(h->pgno);
+		M_32_SWAP(h->prev_pgno);
+		M_32_SWAP(h->next_pgno);
+		M_16_SWAP(h->entries);
+		M_16_SWAP(h->hf_offset);
+	}
+
+	switch (h->type) {
+	case P_HASH:
+		for (i = 0; i < NUM_ENT(h); i++) {
+			if (pgin)
+				M_16_SWAP(h->inp[i]);
+
+			switch (HPAGE_TYPE(h, i)) {
+			case H_KEYDATA:
+				break;
+			case H_DUPLICATE:
+				len = LEN_HKEYDATA(h, pagesize, i);
+				p = HKEYDATA_DATA(P_ENTRY(h, i));
+				for (end = p + len; p < end;) {
+					if (pgin) {
+						P_16_SWAP(p);
+						memcpy(&tmp,
+						    p, sizeof(db_indx_t));
+						p += sizeof(db_indx_t);
+					} else {
+						memcpy(&tmp,
+						    p, sizeof(db_indx_t));
+						SWAP16(p);
+					}
+					p += tmp;
+					SWAP16(p);
+				}
+				break;
+			case H_OFFDUP:
+				p = HOFFPAGE_PGNO(P_ENTRY(h, i));
+				SWAP32(p);			/* pgno */
+				break;
+			case H_OFFPAGE:
+				p = HOFFPAGE_PGNO(P_ENTRY(h, i));
+				SWAP32(p);			/* pgno */
+				SWAP32(p);			/* tlen */
+				break;
+			}
+
+		}
+
+		/*
+		 * The offsets in the inp array are used to determine
+		 * the size of entries on a page; therefore they
+		 * cannot be converted until we've done all the
+		 * entries.
+		 */
+		if (!pgin)
+			for (i = 0; i < NUM_ENT(h); i++)
+				M_16_SWAP(h->inp[i]);
+		break;
+	case P_LBTREE:
+	case P_LDUP:
+	case P_LRECNO:
+		for (i = 0; i < NUM_ENT(h); i++) {
+			if (pgin)
+				M_16_SWAP(h->inp[i]);
+
+			/*
+			 * In the case of on-page duplicates, key information
+			 * should only be swapped once.
+			 */
+			if (h->type == P_LBTREE && i > 1) {
+				if (pgin) {
+					if (h->inp[i] == h->inp[i - 2])
+						continue;
+				} else {
+					M_16_SWAP(h->inp[i]);
+					if (h->inp[i] == h->inp[i - 2])
+						continue;
+					M_16_SWAP(h->inp[i]);
+				}
+			}
+
+			bk = GET_BKEYDATA(h, i);
+			switch (B_TYPE(bk->type)) {
+			case B_KEYDATA:
+				M_16_SWAP(bk->len);
+				break;
+			case B_DUPLICATE:
+			case B_OVERFLOW:
+				bo = (BOVERFLOW *)bk;
+				M_32_SWAP(bo->pgno);
+				M_32_SWAP(bo->tlen);
+				break;
+			}
+
+			if (!pgin)
+				M_16_SWAP(h->inp[i]);
+		}
+		break;
+	case P_IBTREE:
+		for (i = 0; i < NUM_ENT(h); i++) {
+			if (pgin)
+				M_16_SWAP(h->inp[i]);
+
+			bi = GET_BINTERNAL(h, i);
+			M_16_SWAP(bi->len);
+			M_32_SWAP(bi->pgno);
+			M_32_SWAP(bi->nrecs);
+
+			switch (B_TYPE(bi->type)) {
+			case B_KEYDATA:
+				break;
+			case B_DUPLICATE:
+			case B_OVERFLOW:
+				bo = (BOVERFLOW *)bi->data;
+				M_32_SWAP(bo->pgno);
+				M_32_SWAP(bo->tlen);
+				break;
+			}
+
+			if (!pgin)
+				M_16_SWAP(h->inp[i]);
+		}
+		break;
+	case P_IRECNO:
+		for (i = 0; i < NUM_ENT(h); i++) {
+			if (pgin)
+				M_16_SWAP(h->inp[i]);
+
+			ri = GET_RINTERNAL(h, i);
+			M_32_SWAP(ri->pgno);
+			M_32_SWAP(ri->nrecs);
+
+			if (!pgin)
+				M_16_SWAP(h->inp[i]);
+		}
+		break;
+	case P_OVERFLOW:
+	case P_INVALID:
+		/* Nothing to do. */
+		break;
+	default:
+		return (__db_unknown_type(dbenv, "__db_byteswap", h->type));
+	}
+
+	if (!pgin) {
+		/* Swap the header information. */
+		M_32_SWAP(h->lsn.file);
+		M_32_SWAP(h->lsn.offset);
+		M_32_SWAP(h->pgno);
+		M_32_SWAP(h->prev_pgno);
+		M_32_SWAP(h->next_pgno);
+		M_16_SWAP(h->entries);
+		M_16_SWAP(h->hf_offset);
+	}
+	return (0);
+}
diff --git a/bdb/db/db_dispatch.c b/bdb/db/db_dispatch.c
new file mode 100644
index 00000000000..c9beac401a7
--- /dev/null
+++ b/bdb/db/db_dispatch.c
@@ -0,0 +1,983 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1996, 1997, 1998, 1999, 2000
+ *	Sleepycat Software.  All rights reserved.
+ */
+/*
+ * Copyright (c) 1995, 1996
+ *	The President and Fellows of Harvard University.  All rights reserved.
+ *
+ * This code is derived from software contributed to Berkeley by
+ * Margo Seltzer.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of the University nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_dispatch.c,v 11.41 2001/01/11 18:19:50 bostic Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <stddef.h>
+#include <stdlib.h>
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_dispatch.h"
+#include "db_am.h"
+#include "log_auto.h"
+#include "txn.h"
+#include "txn_auto.h"
+#include "log.h"
+
+static int __db_txnlist_find_internal __P((void *, db_txnlist_type,
+	     u_int32_t, u_int8_t [DB_FILE_ID_LEN], DB_TXNLIST **, int));
+
+/*
+ * __db_dispatch --
+ *
+ * This is the transaction dispatch function used by the db access methods.
+ * It is designed to handle the record format used by all the access
+ * methods (the one automatically generated by the db_{h,log,read}.sh
+ * scripts in the tools directory).  An application using a different
+ * recovery paradigm will supply a different dispatch function to txn_open.
+ *
+ * PUBLIC: int __db_dispatch __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *));
+ */
+int
+__db_dispatch(dbenv, db, lsnp, redo, info)
+	DB_ENV *dbenv;		/* The environment. */
+	DBT *db;		/* The log record upon which to dispatch. */
+	DB_LSN *lsnp;		/* The lsn of the record being dispatched. */
+	db_recops redo;		/* Redo this op (or undo it). */
+	void *info;
+{
+	u_int32_t rectype, txnid;
+	int make_call, ret;
+
+	memcpy(&rectype, db->data, sizeof(rectype));
+	memcpy(&txnid, (u_int8_t *)db->data + sizeof(rectype), sizeof(txnid));
+	make_call = ret = 0;
+
+	/*
+	 * If we find a record that is in the user's number space and they
+	 * have specified a recovery routine, let them handle it.  If they
+	 * didn't specify a recovery routine, then we expect that they've
+	 * followed all our rules and registered new recovery functions.
+	 */
+	switch (redo) {
+	case DB_TXN_ABORT:
+		/*
+		 * XXX
+		 * db_printlog depends on DB_TXN_ABORT not examining the TXN
+		 * list.  If that ever changes, fix db_printlog too.
+		 */
+		make_call = 1;
+		break;
+	case DB_TXN_OPENFILES:
+		if (rectype == DB_log_register)
+			return (dbenv->dtab[rectype](dbenv,
+			    db, lsnp, redo, info));
+		break;
+	case DB_TXN_BACKWARD_ROLL:
+		/*
+		 * Running full recovery in the backward pass.  If we've
+		 * seen this txnid before and added to it our commit list,
+		 * then we do nothing during this pass, unless this is a child
+		 * commit record, in which case we need to process it.  If
+		 * we've never seen it, then we call the appropriate recovery
+		 * routine.
+		 *
+		 * We need to always undo DB_db_noop records, so that we
+		 * properly handle any aborts before the file was closed.
+		 */
+		if (rectype == DB_log_register ||
+		    rectype == DB_txn_ckp || rectype == DB_db_noop
+		    || rectype == DB_txn_child || (txnid != 0 &&
+		    (ret = __db_txnlist_find(info, txnid)) != 0)) {
+			make_call = 1;
+			if (ret == DB_NOTFOUND && rectype != DB_txn_regop &&
+			    rectype != DB_txn_xa_regop && (ret =
+			    __db_txnlist_add(dbenv, info, txnid, 1)) != 0)
+				return (ret);
+		}
+		break;
+	case DB_TXN_FORWARD_ROLL:
+		/*
+		 * In the forward pass, if we haven't seen the transaction,
+		 * do nothing, else recovery it.
+		 *
+		 * We need to always redo DB_db_noop records, so that we
+		 * properly handle any commits after the file was closed.
+		 */
+		if (rectype == DB_log_register ||
+		    rectype == DB_txn_ckp ||
+		    rectype == DB_db_noop ||
+		    __db_txnlist_find(info, txnid) == 0)
+			make_call = 1;
+		break;
+	default:
+		return (__db_unknown_flag(dbenv, "__db_dispatch", redo));
+	}
+
+	if (make_call) {
+		if (rectype >= DB_user_BEGIN && dbenv->tx_recover != NULL)
+			return (dbenv->tx_recover(dbenv, db, lsnp, redo));
+		else
+			return (dbenv->dtab[rectype](dbenv, db, lsnp, redo, info));
+	}
+
+	return (0);
+}
+
+/*
+ * __db_add_recovery --
+ *
+ * PUBLIC: int __db_add_recovery __P((DB_ENV *,
+ * PUBLIC:    int (*)(DB_ENV *, DBT *, DB_LSN *, db_recops, void *), u_int32_t));
+ */
+int
+__db_add_recovery(dbenv, func, ndx)
+	DB_ENV *dbenv;
+	int (*func) __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *));
+	u_int32_t ndx;
+{
+	u_int32_t i, nsize;
+	int ret;
+
+	/* Check if we have to grow the table. */
+	if (ndx >= dbenv->dtab_size) {
+		nsize = ndx + 40;
+		if ((ret = __os_realloc(dbenv,
+		    nsize * sizeof(dbenv->dtab[0]), NULL, &dbenv->dtab)) != 0)
+			return (ret);
+		for (i = dbenv->dtab_size; i < nsize; ++i)
+			dbenv->dtab[i] = NULL;
+		dbenv->dtab_size = nsize;
+	}
+
+	dbenv->dtab[ndx] = func;
+	return (0);
+}
+
+/*
+ * __deprecated_recover --
+ *	Stub routine for deprecated recovery functions.
+ *
+ * PUBLIC: int __deprecated_recover
+ * PUBLIC:     __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *));
+ */
+int
+__deprecated_recover(dbenv, dbtp, lsnp, op, info)
+	DB_ENV *dbenv;
+	DBT *dbtp;
+	DB_LSN *lsnp;
+	db_recops op;
+	void *info;
+{
+	COMPQUIET(dbenv, NULL);
+	COMPQUIET(dbtp, NULL);
+	COMPQUIET(lsnp, NULL);
+	COMPQUIET(op, 0);
+	COMPQUIET(info, NULL);
+	return (EINVAL);
+}
+
+/*
+ * __db_txnlist_init --
+ *	Initialize transaction linked list.
+ *
+ * PUBLIC: int __db_txnlist_init __P((DB_ENV *, void *));
+ */
+int
+__db_txnlist_init(dbenv, retp)
+	DB_ENV *dbenv;
+	void *retp;
+{
+	DB_TXNHEAD *headp;
+	int ret;
+
+	if ((ret = __os_malloc(dbenv, sizeof(DB_TXNHEAD), NULL, &headp)) != 0)
+		return (ret);
+
+	LIST_INIT(&headp->head);
+	headp->maxid = 0;
+	headp->generation = 1;
+
+	*(void **)retp = headp;
+	return (0);
+}
+
+/*
+ * __db_txnlist_add --
+ *	Add an element to our transaction linked list.
+ *
+ * PUBLIC: int __db_txnlist_add __P((DB_ENV *, void *, u_int32_t, int32_t));
+ */
+int
+__db_txnlist_add(dbenv, listp, txnid, aborted)
+	DB_ENV *dbenv;
+	void *listp;
+	u_int32_t txnid;
+	int32_t aborted;
+{
+	DB_TXNHEAD *hp;
+	DB_TXNLIST *elp;
+	int ret;
+
+	if ((ret = __os_malloc(dbenv, sizeof(DB_TXNLIST), NULL, &elp)) != 0)
+		return (ret);
+
+	hp = (DB_TXNHEAD *)listp;
+	LIST_INSERT_HEAD(&hp->head, elp, links);
+
+	elp->type = TXNLIST_TXNID;
+	elp->u.t.txnid = txnid;
+	elp->u.t.aborted = aborted;
+	if (txnid > hp->maxid)
+		hp->maxid = txnid;
+	elp->u.t.generation = hp->generation;
+
+	return (0);
+}
+/*
+ * __db_txnlist_remove --
+ *	Remove an element from our transaction linked list.
+ *
+ * PUBLIC: int __db_txnlist_remove __P((void *, u_int32_t));
+ */
+int
+__db_txnlist_remove(listp, txnid)
+	void *listp;
+	u_int32_t txnid;
+{
+	DB_TXNLIST *entry;
+
+	return (__db_txnlist_find_internal(listp,
+	     TXNLIST_TXNID, txnid, NULL, &entry, 1));
+}
+
+/* __db_txnlist_close --
+ *
+ *	Call this when we close a file.  It allows us to reconcile whether
+ * we have done any operations on this file with whether the file appears
+ * to have been deleted.  If you never do any operations on a file, then
+ * we assume it's OK to appear deleted.
+ *
+ * PUBLIC: int __db_txnlist_close __P((void *, int32_t, u_int32_t));
+ */
+
+int
+__db_txnlist_close(listp, lid, count)
+	void *listp;
+	int32_t lid;
+	u_int32_t count;
+{
+	DB_TXNHEAD *hp;
+	DB_TXNLIST *p;
+
+	hp = (DB_TXNHEAD *)listp;
+	for (p = LIST_FIRST(&hp->head); p != NULL; p = LIST_NEXT(p, links)) {
+		if (p->type == TXNLIST_DELETE)
+			if (lid == p->u.d.fileid &&
+			    !F_ISSET(&p->u.d, TXNLIST_FLAG_CLOSED)) {
+				p->u.d.count += count;
+				return (0);
+			}
+	}
+
+	return (0);
+}
+
+/*
+ * __db_txnlist_delete --
+ *
+ *	Record that a file was missing or deleted.  If the deleted
+ * flag is set, then we've encountered a delete of a file, else we've
+ * just encountered a file that is missing.  The lid is the log fileid
+ * and is only meaningful if deleted is not equal to 0.
+ *
+ * PUBLIC: int __db_txnlist_delete __P((DB_ENV *,
+ * PUBLIC:     void *, char *, u_int32_t, int));
+ */
+int
+__db_txnlist_delete(dbenv, listp, name, lid, deleted)
+	DB_ENV *dbenv;
+	void *listp;
+	char *name;
+	u_int32_t lid;
+	int deleted;
+{
+	DB_TXNHEAD *hp;
+	DB_TXNLIST *p;
+	int ret;
+
+	hp = (DB_TXNHEAD *)listp;
+	for (p = LIST_FIRST(&hp->head); p != NULL; p = LIST_NEXT(p, links)) {
+		if (p->type == TXNLIST_DELETE)
+			if (strcmp(name, p->u.d.fname) == 0) {
+				if (deleted)
+					F_SET(&p->u.d, TXNLIST_FLAG_DELETED);
+				else
+					F_CLR(&p->u.d, TXNLIST_FLAG_CLOSED);
+				return (0);
+			}
+	}
+
+	/* Need to add it. */
+	if ((ret = __os_malloc(dbenv, sizeof(DB_TXNLIST), NULL, &p)) != 0)
+		return (ret);
+	LIST_INSERT_HEAD(&hp->head, p, links);
+
+	p->type = TXNLIST_DELETE;
+	p->u.d.flags = 0;
+	if (deleted)
+		F_SET(&p->u.d, TXNLIST_FLAG_DELETED);
+	p->u.d.fileid = lid;
+	p->u.d.count = 0;
+	ret = __os_strdup(dbenv, name, &p->u.d.fname);
+
+	return (ret);
+}
+
+/*
+ * __db_txnlist_end --
+ *	Discard transaction linked list. Print out any error messages
+ * for deleted files.
+ *
+ * PUBLIC: void __db_txnlist_end __P((DB_ENV *, void *));
+ */
+void
+__db_txnlist_end(dbenv, listp)
+	DB_ENV *dbenv;
+	void *listp;
+{
+	DB_TXNHEAD *hp;
+	DB_TXNLIST *p;
+	DB_LOG *lp;
+
+	hp = (DB_TXNHEAD *)listp;
+	lp = (DB_LOG *)dbenv->lg_handle;
+	while (hp != NULL && (p = LIST_FIRST(&hp->head)) != NULL) {
+		LIST_REMOVE(p, links);
+		switch (p->type) {
+		case TXNLIST_DELETE:
+			/*
+			 * If we have a file that is not deleted and has
+			 * some operations, we flag the warning.  Since
+			 * the file could still be open, we need to check
+			 * the actual log table as well.
+			 */
+			if ((!F_ISSET(&p->u.d, TXNLIST_FLAG_DELETED) &&
+			    p->u.d.count != 0) ||
+			    (!F_ISSET(&p->u.d, TXNLIST_FLAG_CLOSED) &&
+			    p->u.d.fileid != (int32_t) TXNLIST_INVALID_ID &&
+			    p->u.d.fileid < lp->dbentry_cnt &&
+			    lp->dbentry[p->u.d.fileid].count != 0))
+				__db_err(dbenv, "warning: %s: %s",
+				    p->u.d.fname, db_strerror(ENOENT));
+			__os_freestr(p->u.d.fname);
+			break;
+		case TXNLIST_LSN:
+			__os_free(p->u.l.lsn_array,
+			    p->u.l.maxn * sizeof(DB_LSN));
+			break;
+		default:
+			/* Possibly an incomplete DB_TXNLIST;  just free it. */
+			break;
+		}
+		__os_free(p, sizeof(DB_TXNLIST));
+	}
+	__os_free(listp, sizeof(DB_TXNHEAD));
+}
+
+/*
+ * __db_txnlist_find --
+ *	Checks to see if a txnid with the current generation is in the
+ *	txnid list.  This returns DB_NOTFOUND if the item isn't in the
+ *	list otherwise it returns (like __db_txnlist_find_internal) a
+ *	1 or 0 indicating if the transaction is aborted or not.  A txnid
+ *	of 0 means the record was generated while not in a transaction.
+ *
+ * PUBLIC: int __db_txnlist_find __P((void *, u_int32_t));
+ */
+int
+__db_txnlist_find(listp, txnid)
+	void *listp;
+	u_int32_t txnid;
+{
+	DB_TXNLIST *entry;
+
+	if (txnid == 0)
+		return (DB_NOTFOUND);
+	return (__db_txnlist_find_internal(listp,
+	     TXNLIST_TXNID, txnid, NULL, &entry, 0));
+}
+
+/*
+ * __db_txnlist_find_internal --
+ *	Find an entry on the transaction list.
+ * If the entry is not there or the list pointeris not initialized
+ * we return DB_NOTFOUND.  If the item is found, we return the aborted
+ * status (1 for aborted, 0 for not aborted).  Currently we always call
+ * this with an initialized list pointer but checking for NULL keeps it general.
+ */
+static int
+__db_txnlist_find_internal(listp, type, txnid, uid, txnlistp, delete)
+	void *listp;
+	db_txnlist_type type;
+	u_int32_t txnid;
+	u_int8_t uid[DB_FILE_ID_LEN];
+	DB_TXNLIST **txnlistp;
+	int delete;
+{
+	DB_TXNHEAD *hp;
+	DB_TXNLIST *p;
+	int ret;
+
+	if ((hp = (DB_TXNHEAD *)listp) == NULL)
+		return (DB_NOTFOUND);
+
+	for (p = LIST_FIRST(&hp->head); p != NULL; p = LIST_NEXT(p, links)) {
+		if (p->type != type)
+			continue;
+		switch (type) {
+		case TXNLIST_TXNID:
+			if (p->u.t.txnid != txnid ||
+			    hp->generation != p->u.t.generation)
+				continue;
+			ret = p->u.t.aborted;
+			break;
+
+		case TXNLIST_PGNO:
+			if (memcmp(uid, p->u.p.uid, DB_FILE_ID_LEN) != 0)
+				continue;
+
+			ret = 0;
+			break;
+		default:
+			DB_ASSERT(0);
+			ret = EINVAL;
+		}
+		if (delete == 1) {
+			LIST_REMOVE(p, links);
+			__os_free(p, sizeof(DB_TXNLIST));
+		} else if (p != LIST_FIRST(&hp->head)) {
+			/* Move it to head of list. */
+			LIST_REMOVE(p, links);
+			LIST_INSERT_HEAD(&hp->head, p, links);
+		}
+		*txnlistp = p;
+		return (ret);
+	}
+
+	return (DB_NOTFOUND);
+}
+
+/*
+ * __db_txnlist_gen --
+ *	Change the current generation number.
+ *
+ * PUBLIC: void __db_txnlist_gen __P((void *, int));
+ */
+void
+__db_txnlist_gen(listp, incr)
+	void *listp;
+	int incr;
+{
+	DB_TXNHEAD *hp;
+
+	/*
+	 * During recovery generation numbers keep track of how many "restart"
+	 * checkpoints we've seen.  Restart checkpoints occur whenever we take
+	 * a checkpoint and there are no outstanding transactions.  When that
+	 * happens, we can reset transaction IDs back to 1.  It always happens
+	 * at recovery and it prevents us from exhausting the transaction IDs
+	 * name space.
+	 */
+	hp = (DB_TXNHEAD *)listp;
+	hp->generation += incr;
+}
+
+#define	TXN_BUBBLE(AP, MAX) {						\
+	int __j;							\
+	DB_LSN __tmp;							\
+									\
+	for (__j = 0; __j < MAX - 1; __j++)				\
+		if (log_compare(&AP[__j], &AP[__j + 1]) < 0) {		\
+			__tmp = AP[__j];				\
+			AP[__j] = AP[__j + 1];				\
+			AP[__j + 1] = __tmp;				\
+		}							\
+}
+
+/*
+ * __db_txnlist_lsnadd --
+ *	Add to or re-sort the transaction list lsn entry.
+ * Note that since this is used during an abort, the __txn_undo
+ * code calls into the "recovery" subsystem explicitly, and there
+ * is only a single TXNLIST_LSN entry on the list.
+ *
+ * PUBLIC: int __db_txnlist_lsnadd __P((DB_ENV *, void *, DB_LSN *, u_int32_t));
+ */
+int
+__db_txnlist_lsnadd(dbenv, listp, lsnp, flags)
+	DB_ENV *dbenv;
+	void *listp;
+	DB_LSN *lsnp;
+	u_int32_t flags;
+{
+	DB_TXNHEAD *hp;
+	DB_TXNLIST *elp;
+	int i, ret;
+
+	hp = (DB_TXNHEAD *)listp;
+
+	for (elp = LIST_FIRST(&hp->head);
+	    elp != NULL; elp = LIST_NEXT(elp, links))
+		if (elp->type == TXNLIST_LSN)
+			break;
+
+	if (elp == NULL)
+		return (EINVAL);
+
+	if (LF_ISSET(TXNLIST_NEW)) {
+		if (elp->u.l.ntxns >= elp->u.l.maxn) {
+			if ((ret = __os_realloc(dbenv,
+			    2 * elp->u.l.maxn * sizeof(DB_LSN),
+			    NULL, &elp->u.l.lsn_array)) != 0)
+				return (ret);
+			elp->u.l.maxn *= 2;
+		}
+		elp->u.l.lsn_array[elp->u.l.ntxns++] = *lsnp;
+	} else
+		/* Simply replace the 0th element. */
+		elp->u.l.lsn_array[0] = *lsnp;
+
+	/*
+	 * If we just added a new entry and there may be NULL
+	 * entries, so we have to do a complete bubble sort,
+	 * not just trickle a changed entry around.
+	 */
+	for (i = 0; i < (!LF_ISSET(TXNLIST_NEW) ? 1 : elp->u.l.ntxns); i++)
+		TXN_BUBBLE(elp->u.l.lsn_array, elp->u.l.ntxns);
+
+	*lsnp = elp->u.l.lsn_array[0];
+
+	return (0);
+}
+
+/*
+ * __db_txnlist_lsnhead --
+ *	Return a pointer to the beginning of the lsn_array.
+ *
+ * PUBLIC: int __db_txnlist_lsnhead __P((void *, DB_LSN **));
+ */
+int
+__db_txnlist_lsnhead(listp, lsnpp)
+	void *listp;
+	DB_LSN **lsnpp;
+{
+	DB_TXNHEAD *hp;
+	DB_TXNLIST *elp;
+
+	hp = (DB_TXNHEAD *)listp;
+
+	for (elp = LIST_FIRST(&hp->head);
+	    elp != NULL; elp = LIST_NEXT(elp, links))
+		if (elp->type == TXNLIST_LSN)
+			break;
+
+	if (elp == NULL)
+		return (EINVAL);
+
+	*lsnpp = &elp->u.l.lsn_array[0];
+
+	return (0);
+}
+
+/*
+ * __db_txnlist_lsninit --
+ *	Initialize a transaction list with an lsn array entry.
+ *
+ * PUBLIC: int __db_txnlist_lsninit __P((DB_ENV *, DB_TXNHEAD *, DB_LSN *));
+ */
+int
+__db_txnlist_lsninit(dbenv, hp, lsnp)
+	DB_ENV *dbenv;
+	DB_TXNHEAD *hp;
+	DB_LSN *lsnp;
+{
+	DB_TXNLIST *elp;
+	int ret;
+
+	elp = NULL;
+
+	if ((ret = __os_malloc(dbenv, sizeof(DB_TXNLIST), NULL, &elp)) != 0)
+		goto err;
+	LIST_INSERT_HEAD(&hp->head, elp, links);
+
+	if ((ret = __os_malloc(dbenv,
+	    12 * sizeof(DB_LSN), NULL, &elp->u.l.lsn_array)) != 0)
+		goto err;
+	elp->type = TXNLIST_LSN;
+	elp->u.l.maxn = 12;
+	elp->u.l.ntxns = 1;
+	elp->u.l.lsn_array[0] = *lsnp;
+
+	return (0);
+
+err:	__db_txnlist_end(dbenv, hp);
+	return (ret);
+}
+
+/*
+ * __db_add_limbo -- add pages to the limbo list.
+ *	Get the file information and call pgnoadd
+ * for each page.
+ *
+ * PUBLIC: int __db_add_limbo __P((DB_ENV *,
+ * PUBLIC:      void *, int32_t, db_pgno_t, int32_t));
+ */
+int
+__db_add_limbo(dbenv, info, fileid, pgno, count)
+	DB_ENV *dbenv;
+	void *info;
+	int32_t fileid;
+	db_pgno_t pgno;
+	int32_t count;
+{
+	DB_LOG *dblp;
+	FNAME *fnp;
+	int ret;
+
+	dblp = dbenv->lg_handle;
+	if ((ret = __log_lid_to_fname(dblp, fileid, &fnp)) != 0)
+		return (ret);
+
+	do {
+		if ((ret =
+		    __db_txnlist_pgnoadd(dbenv, info, fileid, fnp->ufid,
+		    R_ADDR(&dblp->reginfo, fnp->name_off), pgno)) != 0)
+			return (ret);
+		pgno++;
+	} while (--count != 0);
+
+	return (0);
+}
+
+/*
+ * __db_do_the_limbo -- move pages from limbo to free.
+ *
+ * If we are in recovery we add things to the free list without
+ * logging becasue we want to incrementaly apply logs that
+ * may be generated on another copy of this environment.
+ * Otherwise we just call __db_free to put the pages on
+ * the free list and log the activity.
+ *
+ * PUBLIC: int __db_do_the_limbo __P((DB_ENV *, DB_TXNHEAD *));
+ */
+int
+__db_do_the_limbo(dbenv, hp)
+	DB_ENV *dbenv;
+	DB_TXNHEAD *hp;
+{
+	DB *dbp;
+	DBC *dbc;
+	DBMETA *meta;
+	DB_TXN *txn;
+	DB_TXNLIST *elp;
+	PAGE *pagep;
+	db_pgno_t last_pgno, pgno;
+	int i, in_recover, put_page, ret, t_ret;
+
+	dbp = NULL;
+	dbc = NULL;
+	txn = NULL;
+	ret = 0;
+
+	/* Are we in recovery? */
+	in_recover = F_ISSET((DB_LOG *)dbenv->lg_handle, DBLOG_RECOVER);
+
+	for (elp = LIST_FIRST(&hp->head);
+	    elp != NULL; elp = LIST_NEXT(elp, links)) {
+		if (elp->type != TXNLIST_PGNO)
+			continue;
+
+		if (in_recover) {
+			if ((ret = db_create(&dbp, dbenv, 0)) != 0)
+				goto err;
+
+			/*
+			 * It is ok if the file is nolonger there.
+			 */
+			dbp->type = DB_UNKNOWN;
+			ret = __db_dbopen(dbp,
+			    elp->u.p.fname, 0, __db_omode("rw----"), 0);
+		} else {
+			/*
+			 * If we are in transaction undo, then we know
+			 * the fileid is still correct.
+			 */
+			if ((ret =
+			    __db_fileid_to_db(dbenv, &dbp,
+			    elp->u.p.fileid, 0)) != 0 && ret != DB_DELETED)
+				goto err;
+			/* File is being destroyed. */
+			if (F_ISSET(dbp, DB_AM_DISCARD))
+				ret = DB_DELETED;
+		}
+		/*
+		 * Verify that we are opening the same file that we were
+		 * referring to when we wrote this log record.
+		 */
+		if (ret == 0 &&
+		    memcmp(elp->u.p.uid, dbp->fileid, DB_FILE_ID_LEN) == 0) {
+			last_pgno = PGNO_INVALID;
+			if (in_recover) {
+				pgno = PGNO_BASE_MD;
+				if ((ret = memp_fget(dbp->mpf,
+				    &pgno, 0, (PAGE **)&meta)) != 0)
+					goto err;
+				last_pgno = meta->free;
+				/*
+				 * Check to see if the head of the free
+				 * list is any of the pages we are about
+				 * to link in.  We could have crashed
+				 * after linking them in and before writing
+				 * a checkpoint.
+				 * It may not be the last one since
+				 * any page may get reallocated before here.
+				 */
+				if (last_pgno != PGNO_INVALID)
+					for (i = 0; i < elp->u.p.nentries; i++)
+						if (last_pgno
+						     == elp->u.p.pgno_array[i])
+							goto done_it;
+			}
+
+			for (i = 0; i < elp->u.p.nentries; i++) {
+				pgno = elp->u.p.pgno_array[i];
+				if ((ret = memp_fget(dbp->mpf,
+				    &pgno, DB_MPOOL_CREATE, &pagep)) != 0)
+					goto err;
+
+				put_page = 1;
+				if (IS_ZERO_LSN(LSN(pagep))) {
+					P_INIT(pagep, dbp->pgsize,
+					    pgno, PGNO_INVALID,
+					    last_pgno, 0, P_INVALID);
+
+					if (in_recover) {
+						LSN(pagep) = LSN(meta);
+						last_pgno = pgno;
+					} else {
+						/*
+						 * Starting the transaction
+						 * is postponed until we know
+						 * we have something to do.
+						 */
+						if (txn == NULL &&
+						    (ret = txn_begin(dbenv,
+						    NULL, &txn, 0)) != 0)
+							goto err;
+
+						if (dbc == NULL &&
+						    (ret = dbp->cursor(dbp,
+						    txn, &dbc, 0)) != 0)
+							goto err;
+						/* Turn off locking. */
+						F_SET(dbc, DBC_COMPENSATE);
+
+						/* __db_free puts the page. */
+						if ((ret =
+						    __db_free(dbc, pagep)) != 0)
+							goto err;
+						put_page = 0;
+					}
+				}
+
+				if (put_page == 1 &&
+				    (ret = memp_fput(dbp->mpf,
+				    pagep, DB_MPOOL_DIRTY)) != 0)
+					goto err;
+			}
+			if (in_recover) {
+				if (last_pgno == meta->free) {
+done_it:
+					if ((ret =
+					    memp_fput(dbp->mpf, meta, 0)) != 0)
+						goto err;
+				} else {
+					/*
+					 * Flush the new free list then
+					 * update the metapage.  This is
+					 * unlogged so we cannot have the
+					 * metapage pointing at pages that
+					 * are not on disk.
+					 */
+					dbp->sync(dbp, 0);
+					meta->free = last_pgno;
+					if ((ret = memp_fput(dbp->mpf,
+					    meta, DB_MPOOL_DIRTY)) != 0)
+						goto err;
+				}
+			}
+			if (dbc != NULL && (ret = dbc->c_close(dbc)) != 0)
+				goto err;
+			dbc = NULL;
+		}
+		if (in_recover && (t_ret = dbp->close(dbp, 0)) != 0 && ret == 0)
+			ret = t_ret;
+		dbp = NULL;
+		__os_free(elp->u.p.fname, 0);
+		__os_free(elp->u.p.pgno_array, 0);
+		if (ret == ENOENT)
+			ret = 0;
+		else if (ret != 0)
+			goto err;
+	}
+
+	if (txn != NULL) {
+		ret = txn_commit(txn, 0);
+		txn = NULL;
+	}
+err:
+	if (dbc != NULL)
+		(void)dbc->c_close(dbc);
+	if (in_recover && dbp != NULL)
+		(void)dbp->close(dbp, 0);
+	if (txn != NULL)
+		(void)txn_abort(txn);
+	return (ret);
+
+}
+
+#define	DB_TXNLIST_MAX_PGNO	8 /* A nice even number. */
+
+/*
+ * __db_txnlist_pgnoadd --
+ *	Find the txnlist entry for a file and add this pgno,
+ * or add the list entry for the file and then add the pgno.
+ *
+ * PUBLIC: int __db_txnlist_pgnoadd __P((DB_ENV *, DB_TXNHEAD *,
+ * PUBLIC:      int32_t, u_int8_t [DB_FILE_ID_LEN], char *, db_pgno_t));
+ */
+int
+__db_txnlist_pgnoadd(dbenv, hp, fileid, uid, fname, pgno)
+	DB_ENV *dbenv;
+	DB_TXNHEAD *hp;
+	int32_t fileid;
+	u_int8_t uid[DB_FILE_ID_LEN];
+	char *fname;
+	db_pgno_t pgno;
+{
+	DB_TXNLIST *elp;
+	int len, ret;
+
+	elp = NULL;
+
+	if (__db_txnlist_find_internal(hp, TXNLIST_PGNO, 0, uid, &elp, 0) != 0) {
+		if ((ret =
+		    __os_malloc(dbenv, sizeof(DB_TXNLIST), NULL, &elp)) != 0)
+			goto err;
+		LIST_INSERT_HEAD(&hp->head, elp, links);
+		elp->u.p.fileid = fileid;
+		memcpy(elp->u.p.uid, uid, DB_FILE_ID_LEN);
+
+		len = strlen(fname) + 1;
+		if ((ret = __os_malloc(dbenv, len, NULL, &elp->u.p.fname)) != 0)
+			goto err;
+		memcpy(elp->u.p.fname, fname, len);
+
+		elp->u.p.maxentry = 0;
+		elp->type = TXNLIST_PGNO;
+		if ((ret = __os_malloc(dbenv,
+		    8 * sizeof(db_pgno_t), NULL, &elp->u.p.pgno_array)) != 0)
+			goto err;
+		elp->u.p.maxentry = DB_TXNLIST_MAX_PGNO;
+		elp->u.p.nentries = 0;
+	} else if (elp->u.p.nentries == elp->u.p.maxentry) {
+		elp->u.p.maxentry <<= 1;
+		if ((ret = __os_realloc(dbenv, elp->u.p.maxentry *
+		    sizeof(db_pgno_t), NULL, &elp->u.p.pgno_array)) != 0)
+			goto err;
+	}
+
+	elp->u.p.pgno_array[elp->u.p.nentries++] = pgno;
+
+	return (0);
+
+err:	__db_txnlist_end(dbenv, hp);
+	return (ret);
+}
+
+#ifdef DEBUG
+/*
+ * __db_txnlist_print --
+ *	Print out the transaction list.
+ *
+ * PUBLIC: void __db_txnlist_print __P((void *));
+ */
+void
+__db_txnlist_print(listp)
+	void *listp;
+{
+	DB_TXNHEAD *hp;
+	DB_TXNLIST *p;
+
+	hp = (DB_TXNHEAD *)listp;
+
+	printf("Maxid: %lu Generation: %lu\n",
+	    (u_long)hp->maxid, (u_long)hp->generation);
+	for (p = LIST_FIRST(&hp->head); p != NULL; p = LIST_NEXT(p, links)) {
+		switch (p->type) {
+		case TXNLIST_TXNID:
+			printf("TXNID: %lu(%lu)\n",
+			    (u_long)p->u.t.txnid, (u_long)p->u.t.generation);
+			break;
+		case TXNLIST_DELETE:
+			printf("FILE: %s id=%d ops=%d %s %s\n",
+			    p->u.d.fname, p->u.d.fileid, p->u.d.count,
+			    F_ISSET(&p->u.d, TXNLIST_FLAG_DELETED) ?
+			    "(deleted)" : "(missing)",
+			    F_ISSET(&p->u.d, TXNLIST_FLAG_CLOSED) ?
+			    "(closed)" : "(open)");
+
+			break;
+		default:
+			printf("Unrecognized type: %d\n", p->type);
+			break;
+		}
+	}
+}
+#endif
diff --git a/bdb/db/db_dup.c b/bdb/db/db_dup.c
new file mode 100644
index 00000000000..6d8b2df9518
--- /dev/null
+++ b/bdb/db/db_dup.c
@@ -0,0 +1,275 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1996, 1997, 1998, 1999, 2000
+ *	Sleepycat Software.  All rights reserved.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_dup.c,v 11.18 2000/11/30 00:58:32 ubell Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_shash.h"
+#include "btree.h"
+#include "hash.h"
+#include "lock.h"
+#include "db_am.h"
+
+/*
+ * __db_ditem --
+ *	Remove an item from a page.
+ *
+ * PUBLIC:  int __db_ditem __P((DBC *, PAGE *, u_int32_t, u_int32_t));
+ */
+int
+__db_ditem(dbc, pagep, indx, nbytes)
+	DBC *dbc;
+	PAGE *pagep;
+	u_int32_t indx, nbytes;
+{
+	DB *dbp;
+	DBT ldbt;
+	db_indx_t cnt, offset;
+	int ret;
+	u_int8_t *from;
+
+	dbp = dbc->dbp;
+	if (DB_LOGGING(dbc)) {
+		ldbt.data = P_ENTRY(pagep, indx);
+		ldbt.size = nbytes;
+		if ((ret = __db_addrem_log(dbp->dbenv, dbc->txn,
+		    &LSN(pagep), 0, DB_REM_DUP, dbp->log_fileid, PGNO(pagep),
+		    (u_int32_t)indx, nbytes, &ldbt, NULL, &LSN(pagep))) != 0)
+			return (ret);
+	}
+
+	/*
+	 * If there's only a single item on the page, we don't have to
+	 * work hard.
+	 */
+	if (NUM_ENT(pagep) == 1) {
+		NUM_ENT(pagep) = 0;
+		HOFFSET(pagep) = dbp->pgsize;
+		return (0);
+	}
+
+	/*
+	 * Pack the remaining key/data items at the end of the page.  Use
+	 * memmove(3), the regions may overlap.
+	 */
+	from = (u_int8_t *)pagep + HOFFSET(pagep);
+	memmove(from + nbytes, from, pagep->inp[indx] - HOFFSET(pagep));
+	HOFFSET(pagep) += nbytes;
+
+	/* Adjust the indices' offsets. */
+	offset = pagep->inp[indx];
+	for (cnt = 0; cnt < NUM_ENT(pagep); ++cnt)
+		if (pagep->inp[cnt] < offset)
+			pagep->inp[cnt] += nbytes;
+
+	/* Shift the indices down. */
+	--NUM_ENT(pagep);
+	if (indx != NUM_ENT(pagep))
+		memmove(&pagep->inp[indx], &pagep->inp[indx + 1],
+		    sizeof(db_indx_t) * (NUM_ENT(pagep) - indx));
+
+	return (0);
+}
+
+/*
+ * __db_pitem --
+ *	Put an item on a page.
+ *
+ * PUBLIC: int __db_pitem
+ * PUBLIC:     __P((DBC *, PAGE *, u_int32_t, u_int32_t, DBT *, DBT *));
+ */
+int
+__db_pitem(dbc, pagep, indx, nbytes, hdr, data)
+	DBC *dbc;
+	PAGE *pagep;
+	u_int32_t indx;
+	u_int32_t nbytes;
+	DBT *hdr, *data;
+{
+	DB *dbp;
+	BKEYDATA bk;
+	DBT thdr;
+	int ret;
+	u_int8_t *p;
+
+	if (nbytes > P_FREESPACE(pagep)) {
+		DB_ASSERT(nbytes <= P_FREESPACE(pagep));
+		return (EINVAL);
+	}
+	/*
+	 * Put a single item onto a page.  The logic figuring out where to
+	 * insert and whether it fits is handled in the caller.  All we do
+	 * here is manage the page shuffling.  We cheat a little bit in that
+	 * we don't want to copy the dbt on a normal put twice.  If hdr is
+	 * NULL, we create a BKEYDATA structure on the page, otherwise, just
+	 * copy the caller's information onto the page.
+	 *
+	 * This routine is also used to put entries onto the page where the
+	 * entry is pre-built, e.g., during recovery.  In this case, the hdr
+	 * will point to the entry, and the data argument will be NULL.
+	 *
+	 * !!!
+	 * There's a tremendous potential for off-by-one errors here, since
+	 * the passed in header sizes must be adjusted for the structure's
+	 * placeholder for the trailing variable-length data field.
+	 */
+	dbp = dbc->dbp;
+	if (DB_LOGGING(dbc))
+		if ((ret = __db_addrem_log(dbp->dbenv, dbc->txn,
+		    &LSN(pagep), 0, DB_ADD_DUP, dbp->log_fileid, PGNO(pagep),
+		    (u_int32_t)indx, nbytes, hdr, data, &LSN(pagep))) != 0)
+			return (ret);
+
+	if (hdr == NULL) {
+		B_TSET(bk.type, B_KEYDATA, 0);
+		bk.len = data == NULL ? 0 : data->size;
+
+		thdr.data = &bk;
+		thdr.size = SSZA(BKEYDATA, data);
+		hdr = &thdr;
+	}
+
+	/* Adjust the index table, then put the item on the page. */
+	if (indx != NUM_ENT(pagep))
+		memmove(&pagep->inp[indx + 1], &pagep->inp[indx],
+		    sizeof(db_indx_t) * (NUM_ENT(pagep) - indx));
+	HOFFSET(pagep) -= nbytes;
+	pagep->inp[indx] = HOFFSET(pagep);
+	++NUM_ENT(pagep);
+
+	p = P_ENTRY(pagep, indx);
+	memcpy(p, hdr->data, hdr->size);
+	if (data != NULL)
+		memcpy(p + hdr->size, data->data, data->size);
+
+	return (0);
+}
+
+/*
+ * __db_relink --
+ *	Relink around a deleted page.
+ *
+ * PUBLIC: int __db_relink __P((DBC *, u_int32_t, PAGE *, PAGE **, int));
+ */
+int
+__db_relink(dbc, add_rem, pagep, new_next, needlock)
+	DBC *dbc;
+	u_int32_t add_rem;
+	PAGE *pagep, **new_next;
+	int needlock;
+{
+	DB *dbp;
+	PAGE *np, *pp;
+	DB_LOCK npl, ppl;
+	DB_LSN *nlsnp, *plsnp, ret_lsn;
+	int ret;
+
+	ret = 0;
+	np = pp = NULL;
+	npl.off = ppl.off = LOCK_INVALID;
+	nlsnp = plsnp = NULL;
+	dbp = dbc->dbp;
+
+	/*
+	 * Retrieve and lock the one/two pages.  For a remove, we may need
+	 * two pages (the before and after).  For an add, we only need one
+	 * because, the split took care of the prev.
+	 */
+	if (pagep->next_pgno != PGNO_INVALID) {
+		if (needlock && (ret = __db_lget(dbc,
+		    0, pagep->next_pgno, DB_LOCK_WRITE, 0, &npl)) != 0)
+			goto err;
+		if ((ret = memp_fget(dbp->mpf,
+		    &pagep->next_pgno, 0, &np)) != 0) {
+			(void)__db_pgerr(dbp, pagep->next_pgno);
+			goto err;
+		}
+		nlsnp = &np->lsn;
+	}
+	if (add_rem == DB_REM_PAGE && pagep->prev_pgno != PGNO_INVALID) {
+		if (needlock && (ret = __db_lget(dbc,
+		    0, pagep->prev_pgno, DB_LOCK_WRITE, 0, &ppl)) != 0)
+			goto err;
+		if ((ret = memp_fget(dbp->mpf,
+		    &pagep->prev_pgno, 0, &pp)) != 0) {
+			(void)__db_pgerr(dbp, pagep->next_pgno);
+			goto err;
+		}
+		plsnp = &pp->lsn;
+	}
+
+	/* Log the change. */
+	if (DB_LOGGING(dbc)) {
+		if ((ret = __db_relink_log(dbp->dbenv, dbc->txn,
+		    &ret_lsn, 0, add_rem, dbp->log_fileid,
+		    pagep->pgno, &pagep->lsn,
+		    pagep->prev_pgno, plsnp, pagep->next_pgno, nlsnp)) != 0)
+			goto err;
+		if (np != NULL)
+			np->lsn = ret_lsn;
+		if (pp != NULL)
+			pp->lsn = ret_lsn;
+		if (add_rem == DB_REM_PAGE)
+			pagep->lsn = ret_lsn;
+	}
+
+	/*
+	 * Modify and release the two pages.
+	 *
+	 * !!!
+	 * The parameter new_next gets set to the page following the page we
+	 * are removing.  If there is no following page, then new_next gets
+	 * set to NULL.
+	 */
+	if (np != NULL) {
+		if (add_rem == DB_ADD_PAGE)
+			np->prev_pgno = pagep->pgno;
+		else
+			np->prev_pgno = pagep->prev_pgno;
+		if (new_next == NULL)
+			ret = memp_fput(dbp->mpf, np, DB_MPOOL_DIRTY);
+		else {
+			*new_next = np;
+			ret = memp_fset(dbp->mpf, np, DB_MPOOL_DIRTY);
+		}
+		if (ret != 0)
+			goto err;
+		if (needlock)
+			(void)__TLPUT(dbc, npl);
+	} else if (new_next != NULL)
+		*new_next = NULL;
+
+	if (pp != NULL) {
+		pp->next_pgno = pagep->next_pgno;
+		if ((ret = memp_fput(dbp->mpf, pp, DB_MPOOL_DIRTY)) != 0)
+			goto err;
+		if (needlock)
+			(void)__TLPUT(dbc, ppl);
+	}
+	return (0);
+
+err:	if (np != NULL)
+		(void)memp_fput(dbp->mpf, np, 0);
+	if (needlock && npl.off != LOCK_INVALID)
+		(void)__TLPUT(dbc, npl);
+	if (pp != NULL)
+		(void)memp_fput(dbp->mpf, pp, 0);
+	if (needlock && ppl.off != LOCK_INVALID)
+		(void)__TLPUT(dbc, ppl);
+	return (ret);
+}
diff --git a/bdb/db/db_iface.c b/bdb/db/db_iface.c
new file mode 100644
index 00000000000..3548a2527bb
--- /dev/null
+++ b/bdb/db/db_iface.c
@@ -0,0 +1,687 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1996, 1997, 1998, 1999, 2000
+ *	Sleepycat Software.  All rights reserved.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_iface.c,v 11.34 2001/01/11 18:19:51 bostic Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <errno.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_am.h"
+#include "btree.h"
+
+static int __db_curinval __P((const DB_ENV *));
+static int __db_rdonly __P((const DB_ENV *, const char *));
+static int __dbt_ferr __P((const DB *, const char *, const DBT *, int));
+
+/*
+ * __db_cursorchk --
+ *	Common cursor argument checking routine.
+ *
+ * PUBLIC: int __db_cursorchk __P((const DB *, u_int32_t, int));
+ */
+int
+__db_cursorchk(dbp, flags, isrdonly)
+	const DB *dbp;
+	u_int32_t flags;
+	int isrdonly;
+{
+	/* Check for invalid function flags. */
+	switch (flags) {
+	case 0:
+		break;
+	case DB_WRITECURSOR:
+		if (isrdonly)
+			return (__db_rdonly(dbp->dbenv, "DB->cursor"));
+		if (!CDB_LOCKING(dbp->dbenv))
+			return (__db_ferr(dbp->dbenv, "DB->cursor", 0));
+		break;
+	case DB_WRITELOCK:
+		if (isrdonly)
+			return (__db_rdonly(dbp->dbenv, "DB->cursor"));
+		break;
+	default:
+		return (__db_ferr(dbp->dbenv, "DB->cursor", 0));
+	}
+
+	return (0);
+}
+
+/*
+ * __db_ccountchk --
+ *	Common cursor count argument checking routine.
+ *
+ * PUBLIC: int __db_ccountchk __P((const DB *, u_int32_t, int));
+ */
+int
+__db_ccountchk(dbp, flags, isvalid)
+	const DB *dbp;
+	u_int32_t flags;
+	int isvalid;
+{
+	/* Check for invalid function flags. */
+	switch (flags) {
+	case 0:
+		break;
+	default:
+		return (__db_ferr(dbp->dbenv, "DBcursor->c_count", 0));
+	}
+
+	/*
+	 * The cursor must be initialized, return EINVAL for an invalid cursor,
+	 * otherwise 0.
+	 */
+	return (isvalid ? 0 : __db_curinval(dbp->dbenv));
+}
+
+/*
+ * __db_cdelchk --
+ *	Common cursor delete argument checking routine.
+ *
+ * PUBLIC: int __db_cdelchk __P((const DB *, u_int32_t, int, int));
+ */
+int
+__db_cdelchk(dbp, flags, isrdonly, isvalid)
+	const DB *dbp;
+	u_int32_t flags;
+	int isrdonly, isvalid;
+{
+	/* Check for changes to a read-only tree. */
+	if (isrdonly)
+		return (__db_rdonly(dbp->dbenv, "c_del"));
+
+	/* Check for invalid function flags. */
+	switch (flags) {
+	case 0:
+		break;
+	default:
+		return (__db_ferr(dbp->dbenv, "DBcursor->c_del", 0));
+	}
+
+	/*
+	 * The cursor must be initialized, return EINVAL for an invalid cursor,
+	 * otherwise 0.
+	 */
+	return (isvalid ? 0 : __db_curinval(dbp->dbenv));
+}
+
+/*
+ * __db_cgetchk --
+ *	Common cursor get argument checking routine.
+ *
+ * PUBLIC: int __db_cgetchk __P((const DB *, DBT *, DBT *, u_int32_t, int));
+ */
+int
+__db_cgetchk(dbp, key, data, flags, isvalid)
+	const DB *dbp;
+	DBT *key, *data;
+	u_int32_t flags;
+	int isvalid;
+{
+	int ret;
+
+	/*
+	 * Check for read-modify-write validity.  DB_RMW doesn't make sense
+	 * with CDB cursors since if you're going to write the cursor, you
+	 * had to create it with DB_WRITECURSOR.  Regardless, we check for
+	 * LOCKING_ON and not STD_LOCKING, as we don't want to disallow it.
+	 * If this changes, confirm that DB does not itself set the DB_RMW
+	 * flag in a path where CDB may have been configured.
+	 */
+	if (LF_ISSET(DB_RMW)) {
+		if (!LOCKING_ON(dbp->dbenv)) {
+			__db_err(dbp->dbenv,
+			    "the DB_RMW flag requires locking");
+			return (EINVAL);
+		}
+		LF_CLR(DB_RMW);
+	}
+
+	/* Check for invalid function flags. */
+	switch (flags) {
+	case DB_CONSUME:
+	case DB_CONSUME_WAIT:
+		if (dbp->type != DB_QUEUE)
+			goto err;
+		break;
+	case DB_CURRENT:
+	case DB_FIRST:
+	case DB_GET_BOTH:
+	case DB_LAST:
+	case DB_NEXT:
+	case DB_NEXT_DUP:
+	case DB_NEXT_NODUP:
+	case DB_PREV:
+	case DB_PREV_NODUP:
+	case DB_SET:
+	case DB_SET_RANGE:
+		break;
+	case DB_GET_BOTHC:
+		if (dbp->type == DB_QUEUE)
+			goto err;
+		break;
+	case DB_GET_RECNO:
+		if (!F_ISSET(dbp, DB_BT_RECNUM))
+			goto err;
+		break;
+	case DB_SET_RECNO:
+		if (!F_ISSET(dbp, DB_BT_RECNUM))
+			goto err;
+		break;
+	default:
+err:		return (__db_ferr(dbp->dbenv, "DBcursor->c_get", 0));
+	}
+
+	/* Check for invalid key/data flags. */
+	if ((ret = __dbt_ferr(dbp, "key", key, 0)) != 0)
+		return (ret);
+	if ((ret = __dbt_ferr(dbp, "data", data, 0)) != 0)
+		return (ret);
+
+	/*
+	 * The cursor must be initialized for DB_CURRENT or DB_NEXT_DUP,
+	 * return EINVAL for an invalid cursor, otherwise 0.
+	 */
+	if (isvalid || (flags != DB_CURRENT && flags != DB_NEXT_DUP))
+		return (0);
+
+	return (__db_curinval(dbp->dbenv));
+}
+
+/*
+ * __db_cputchk --
+ *	Common cursor put argument checking routine.
+ *
+ * PUBLIC: int __db_cputchk __P((const DB *,
+ * PUBLIC:    const DBT *, DBT *, u_int32_t, int, int));
+ */
+int
+__db_cputchk(dbp, key, data, flags, isrdonly, isvalid)
+	const DB *dbp;
+	const DBT *key;
+	DBT *data;
+	u_int32_t flags;
+	int isrdonly, isvalid;
+{
+	int key_flags, ret;
+
+	key_flags = 0;
+
+	/* Check for changes to a read-only tree. */
+	if (isrdonly)
+		return (__db_rdonly(dbp->dbenv, "c_put"));
+
+	/* Check for invalid function flags. */
+	switch (flags) {
+	case DB_AFTER:
+	case DB_BEFORE:
+		switch (dbp->type) {
+		case DB_BTREE:
+		case DB_HASH:		/* Only with unsorted duplicates. */
+			if (!F_ISSET(dbp, DB_AM_DUP))
+				goto err;
+			if (dbp->dup_compare != NULL)
+				goto err;
+			break;
+		case DB_QUEUE:		/* Not permitted. */
+			goto err;
+		case DB_RECNO:		/* Only with mutable record numbers. */
+			if (!F_ISSET(dbp, DB_RE_RENUMBER))
+				goto err;
+			key_flags = 1;
+			break;
+		default:
+			goto err;
+		}
+		break;
+	case DB_CURRENT:
+		/*
+		 * If there is a comparison function, doing a DB_CURRENT
+		 * must not change the part of the data item that is used
+		 * for the comparison.
+		 */
+		break;
+	case DB_NODUPDATA:
+		if (!F_ISSET(dbp, DB_AM_DUPSORT))
+			goto err;
+		/* FALLTHROUGH */
+	case DB_KEYFIRST:
+	case DB_KEYLAST:
+		if (dbp->type == DB_QUEUE || dbp->type == DB_RECNO)
+			goto err;
+		key_flags = 1;
+		break;
+	default:
+err:		return (__db_ferr(dbp->dbenv, "DBcursor->c_put", 0));
+	}
+
+	/* Check for invalid key/data flags. */
+	if (key_flags && (ret = __dbt_ferr(dbp, "key", key, 0)) != 0)
+		return (ret);
+	if ((ret = __dbt_ferr(dbp, "data", data, 0)) != 0)
+		return (ret);
+
+	/*
+	 * The cursor must be initialized for anything other than DB_KEYFIRST
+	 * and DB_KEYLAST, return EINVAL for an invalid cursor, otherwise 0.
+	 */
+	if (isvalid || flags == DB_KEYFIRST ||
+	    flags == DB_KEYLAST || flags == DB_NODUPDATA)
+		return (0);
+
+	return (__db_curinval(dbp->dbenv));
+}
+
+/*
+ * __db_closechk --
+ *	DB->close flag check.
+ *
+ * PUBLIC: int __db_closechk __P((const DB *, u_int32_t));
+ */
+int
+__db_closechk(dbp, flags)
+	const DB *dbp;
+	u_int32_t flags;
+{
+	/* Check for invalid function flags. */
+	switch (flags) {
+	case 0:
+	case DB_NOSYNC:
+		break;
+	default:
+		return (__db_ferr(dbp->dbenv, "DB->close", 0));
+	}
+
+	return (0);
+}
+
+/*
+ * __db_delchk --
+ *	Common delete argument checking routine.
+ *
+ * PUBLIC: int __db_delchk __P((const DB *, DBT *, u_int32_t, int));
+ */
+int
+__db_delchk(dbp, key, flags, isrdonly)
+	const DB *dbp;
+	DBT *key;
+	u_int32_t flags;
+	int isrdonly;
+{
+	COMPQUIET(key, NULL);
+
+	/* Check for changes to a read-only tree. */
+	if (isrdonly)
+		return (__db_rdonly(dbp->dbenv, "delete"));
+
+	/* Check for invalid function flags. */
+	switch (flags) {
+	case 0:
+		break;
+	default:
+		return (__db_ferr(dbp->dbenv, "DB->del", 0));
+	}
+
+	return (0);
+}
+
+/*
+ * __db_getchk --
+ *	Common get argument checking routine.
+ *
+ * PUBLIC: int __db_getchk __P((const DB *, const DBT *, DBT *, u_int32_t));
+ */
+int
+__db_getchk(dbp, key, data, flags)
+	const DB *dbp;
+	const DBT *key;
+	DBT *data;
+	u_int32_t flags;
+{
+	int ret;
+
+	/*
+	 * Check for read-modify-write validity.  DB_RMW doesn't make sense
+	 * with CDB cursors since if you're going to write the cursor, you
+	 * had to create it with DB_WRITECURSOR.  Regardless, we check for
+	 * LOCKING_ON and not STD_LOCKING, as we don't want to disallow it.
+	 * If this changes, confirm that DB does not itself set the DB_RMW
+	 * flag in a path where CDB may have been configured.
+	 */
+	if (LF_ISSET(DB_RMW)) {
+		if (!LOCKING_ON(dbp->dbenv)) {
+			__db_err(dbp->dbenv,
+			    "the DB_RMW flag requires locking");
+			return (EINVAL);
+		}
+		LF_CLR(DB_RMW);
+	}
+
+	/* Check for invalid function flags. */
+	switch (flags) {
+	case 0:
+	case DB_GET_BOTH:
+		break;
+	case DB_SET_RECNO:
+		if (!F_ISSET(dbp, DB_BT_RECNUM))
+			goto err;
+		break;
+	case DB_CONSUME:
+	case DB_CONSUME_WAIT:
+		if (dbp->type == DB_QUEUE)
+			break;
+		/* Fall through */
+	default:
+err:		return (__db_ferr(dbp->dbenv, "DB->get", 0));
+	}
+
+	/* Check for invalid key/data flags. */
+	if ((ret = __dbt_ferr(dbp, "key", key, flags == DB_SET_RECNO)) != 0)
+		return (ret);
+	if ((ret = __dbt_ferr(dbp, "data", data, 1)) != 0)
+		return (ret);
+
+	return (0);
+}
+
+/*
+ * __db_joinchk --
+ *	Common join argument checking routine.
+ *
+ * PUBLIC: int __db_joinchk __P((const DB *, DBC * const *, u_int32_t));
+ */
+int
+__db_joinchk(dbp, curslist, flags)
+	const DB *dbp;
+	DBC * const *curslist;
+	u_int32_t flags;
+{
+	DB_TXN *txn;
+	int i;
+
+	switch (flags) {
+	case 0:
+	case DB_JOIN_NOSORT:
+		break;
+	default:
+		return (__db_ferr(dbp->dbenv, "DB->join", 0));
+	}
+
+	if (curslist == NULL || curslist[0] == NULL) {
+		__db_err(dbp->dbenv,
+	    "At least one secondary cursor must be specified to DB->join");
+		return (EINVAL);
+	}
+
+	txn = curslist[0]->txn;
+	for (i = 1; curslist[i] != NULL; i++)
+		if (curslist[i]->txn != txn) {
+			__db_err(dbp->dbenv,
+		    "All secondary cursors must share the same transaction");
+			return (EINVAL);
+		}
+
+	return (0);
+}
+
+/*
+ * __db_joingetchk --
+ *	Common join_get argument checking routine.
+ *
+ * PUBLIC: int __db_joingetchk __P((const DB *, DBT *, u_int32_t));
+ */
+int
+__db_joingetchk(dbp, key, flags)
+	const DB *dbp;
+	DBT *key;
+	u_int32_t flags;
+{
+
+	if (LF_ISSET(DB_RMW)) {
+		if (!LOCKING_ON(dbp->dbenv)) {
+			__db_err(dbp->dbenv,
+			    "the DB_RMW flag requires locking");
+			return (EINVAL);
+		}
+		LF_CLR(DB_RMW);
+	}
+
+	switch (flags) {
+	case 0:
+	case DB_JOIN_ITEM:
+		break;
+	default:
+		return (__db_ferr(dbp->dbenv, "DBcursor->c_get", 0));
+	}
+
+	/*
+	 * A partial get of the key of a join cursor don't make much sense;
+	 * the entire key is necessary to query the primary database
+	 * and find the datum, and so regardless of the size of the key
+	 * it would not be a performance improvement.  Since it would require
+	 * special handling, we simply disallow it.
+	 *
+	 * A partial get of the data, however, potentially makes sense (if
+	 * all possible data are a predictable large structure, for instance)
+	 * and causes us no headaches, so we permit it.
+	 */
+	if (F_ISSET(key, DB_DBT_PARTIAL)) {
+		__db_err(dbp->dbenv,
+		    "DB_DBT_PARTIAL may not be set on key during join_get");
+		return (EINVAL);
+	}
+
+	return (0);
+}
+
+/*
+ * __db_putchk --
+ *	Common put argument checking routine.
+ *
+ * PUBLIC: int __db_putchk
+ * PUBLIC:    __P((const DB *, DBT *, const DBT *, u_int32_t, int, int));
+ */
+int
+__db_putchk(dbp, key, data, flags, isrdonly, isdup)
+	const DB *dbp;
+	DBT *key;
+	const DBT *data;
+	u_int32_t flags;
+	int isrdonly, isdup;
+{
+	int ret;
+
+	/* Check for changes to a read-only tree. */
+	if (isrdonly)
+		return (__db_rdonly(dbp->dbenv, "put"));
+
+	/* Check for invalid function flags. */
+	switch (flags) {
+	case 0:
+	case DB_NOOVERWRITE:
+		break;
+	case DB_APPEND:
+		if (dbp->type != DB_RECNO && dbp->type != DB_QUEUE)
+			goto err;
+		break;
+	case DB_NODUPDATA:
+		if (F_ISSET(dbp, DB_AM_DUPSORT))
+			break;
+		/* FALLTHROUGH */
+	default:
+err:		return (__db_ferr(dbp->dbenv, "DB->put", 0));
+	}
+
+	/* Check for invalid key/data flags. */
+	if ((ret = __dbt_ferr(dbp, "key", key, 0)) != 0)
+		return (ret);
+	if ((ret = __dbt_ferr(dbp, "data", data, 0)) != 0)
+		return (ret);
+
+	/* Check for partial puts in the presence of duplicates. */
+	if (isdup && F_ISSET(data, DB_DBT_PARTIAL)) {
+		__db_err(dbp->dbenv,
+"a partial put in the presence of duplicates requires a cursor operation");
+		return (EINVAL);
+	}
+
+	return (0);
+}
+
+/*
+ * __db_removechk --
+ *	DB->remove flag check.
+ *
+ * PUBLIC: int __db_removechk __P((const DB *, u_int32_t));
+ */
+int
+__db_removechk(dbp, flags)
+	const DB *dbp;
+	u_int32_t flags;
+{
+	/* Check for invalid function flags. */
+	switch (flags) {
+	case 0:
+		break;
+	default:
+		return (__db_ferr(dbp->dbenv, "DB->remove", 0));
+	}
+
+	return (0);
+}
+
+/*
+ * __db_statchk --
+ *	Common stat argument checking routine.
+ *
+ * PUBLIC: int __db_statchk __P((const DB *, u_int32_t));
+ */
+int
+__db_statchk(dbp, flags)
+	const DB *dbp;
+	u_int32_t flags;
+{
+	/* Check for invalid function flags. */
+	switch (flags) {
+	case 0:
+	case DB_CACHED_COUNTS:
+		break;
+	case DB_RECORDCOUNT:
+		if (dbp->type == DB_RECNO)
+			break;
+		if (dbp->type == DB_BTREE && F_ISSET(dbp, DB_BT_RECNUM))
+			break;
+		goto err;
+	default:
+err:		return (__db_ferr(dbp->dbenv, "DB->stat", 0));
+	}
+
+	return (0);
+}
+
+/*
+ * __db_syncchk --
+ *	Common sync argument checking routine.
+ *
+ * PUBLIC: int __db_syncchk __P((const DB *, u_int32_t));
+ */
+int
+__db_syncchk(dbp, flags)
+	const DB *dbp;
+	u_int32_t flags;
+{
+	/* Check for invalid function flags. */
+	switch (flags) {
+	case 0:
+		break;
+	default:
+		return (__db_ferr(dbp->dbenv, "DB->sync", 0));
+	}
+
+	return (0);
+}
+
+/*
+ * __dbt_ferr --
+ *	Check a DBT for flag errors.
+ */
+static int
+__dbt_ferr(dbp, name, dbt, check_thread)
+	const DB *dbp;
+	const char *name;
+	const DBT *dbt;
+	int check_thread;
+{
+	DB_ENV *dbenv;
+	int ret;
+
+	dbenv = dbp->dbenv;
+
+	/*
+	 * Check for invalid DBT flags.  We allow any of the flags to be
+	 * specified to any DB or DBcursor call so that applications can
+	 * set DB_DBT_MALLOC when retrieving a data item from a secondary
+	 * database and then specify that same DBT as a key to a primary
+	 * database, without having to clear flags.
+	 */
+	if ((ret = __db_fchk(dbenv, name, dbt->flags,
+	    DB_DBT_MALLOC | DB_DBT_DUPOK |
+	    DB_DBT_REALLOC | DB_DBT_USERMEM | DB_DBT_PARTIAL)) != 0)
+		return (ret);
+	switch (F_ISSET(dbt, DB_DBT_MALLOC | DB_DBT_REALLOC | DB_DBT_USERMEM)) {
+	case 0:
+	case DB_DBT_MALLOC:
+	case DB_DBT_REALLOC:
+	case DB_DBT_USERMEM:
+		break;
+	default:
+		return (__db_ferr(dbenv, name, 1));
+	}
+
+	if (check_thread && DB_IS_THREADED(dbp) &&
+	    !F_ISSET(dbt, DB_DBT_MALLOC | DB_DBT_REALLOC | DB_DBT_USERMEM)) {
+		__db_err(dbenv,
+		    "DB_THREAD mandates memory allocation flag on DBT %s",
+		    name);
+		return (EINVAL);
+	}
+	return (0);
+}
+
+/*
+ * __db_rdonly --
+ *	Common readonly message.
+ */
+static int
+__db_rdonly(dbenv, name)
+	const DB_ENV *dbenv;
+	const char *name;
+{
+	__db_err(dbenv, "%s: attempt to modify a read-only tree", name);
+	return (EACCES);
+}
+
+/*
+ * __db_curinval
+ *	Report that a cursor is in an invalid state.
+ */
+static int
+__db_curinval(dbenv)
+	const DB_ENV *dbenv;
+{
+	__db_err(dbenv,
+	    "Cursor position must be set before performing this operation");
+	return (EINVAL);
+}
diff --git a/bdb/db/db_join.c b/bdb/db/db_join.c
new file mode 100644
index 00000000000..881dedde0fc
--- /dev/null
+++ b/bdb/db/db_join.c
@@ -0,0 +1,730 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1998, 1999, 2000
+ *	Sleepycat Software.  All rights reserved.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_join.c,v 11.31 2000/12/20 22:41:54 krinsky Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <stdlib.h>
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_join.h"
+#include "db_am.h"
+#include "btree.h"
+
+static int __db_join_close __P((DBC *));
+static int __db_join_cmp __P((const void *, const void *));
+static int __db_join_del __P((DBC *, u_int32_t));
+static int __db_join_get __P((DBC *, DBT *, DBT *, u_int32_t));
+static int __db_join_getnext __P((DBC *, DBT *, DBT *, u_int32_t));
+static int __db_join_put __P((DBC *, DBT *, DBT *, u_int32_t));
+
+/*
+ * Check to see if the Nth secondary cursor of join cursor jc is pointing
+ * to a sorted duplicate set.
+ */
+#define	SORTED_SET(jc, n)   ((jc)->j_curslist[(n)]->dbp->dup_compare != NULL)
+
+/*
+ * This is the duplicate-assisted join functionality.  Right now we're
+ * going to write it such that we return one item at a time, although
+ * I think we may need to optimize it to return them all at once.
+ * It should be easier to get it working this way, and I believe that
+ * changing it should be fairly straightforward.
+ *
+ * We optimize the join by sorting cursors from smallest to largest
+ * cardinality.  In most cases, this is indeed optimal.  However, if
+ * a cursor with large cardinality has very few data in common with the
+ * first cursor, it is possible that the join will be made faster by
+ * putting it earlier in the cursor list.  Since we have no way to detect
+ * cases like this, we simply provide a flag, DB_JOIN_NOSORT, which retains
+ * the sort order specified by the caller, who may know more about the
+ * structure of the data.
+ *
+ * The first cursor moves sequentially through the duplicate set while
+ * the others search explicitly for the duplicate in question.
+ *
+ */
+
+/*
+ * __db_join --
+ *	This is the interface to the duplicate-assisted join functionality.
+ * In the same way that cursors mark a position in a database, a cursor
+ * can mark a position in a join.  While most cursors are created by the
+ * cursor method of a DB, join cursors are created through an explicit
+ * call to DB->join.
+ *
+ * The curslist is an array of existing, intialized cursors and primary
+ * is the DB of the primary file.  The data item that joins all the
+ * cursors in the curslist is used as the key into the primary and that
+ * key and data are returned.  When no more items are left in the join
+ * set, the  c_next operation off the join cursor will return DB_NOTFOUND.
+ *
+ * PUBLIC: int __db_join __P((DB *, DBC **, DBC **, u_int32_t));
+ */
+int
+__db_join(primary, curslist, dbcp, flags)
+	DB *primary;
+	DBC **curslist, **dbcp;
+	u_int32_t flags;
+{
+	DB_ENV *dbenv;
+	DBC *dbc;
+	JOIN_CURSOR *jc;
+	int ret;
+	u_int32_t i, ncurs, nslots;
+
+	COMPQUIET(nslots, 0);
+
+	PANIC_CHECK(primary->dbenv);
+
+	if ((ret = __db_joinchk(primary, curslist, flags)) != 0)
+		return (ret);
+
+	dbc = NULL;
+	jc = NULL;
+	dbenv = primary->dbenv;
+
+	if ((ret = __os_calloc(dbenv, 1, sizeof(DBC), &dbc)) != 0)
+		goto err;
+
+	if ((ret = __os_calloc(dbenv,
+	    1, sizeof(JOIN_CURSOR), &jc)) != 0)
+		goto err;
+
+	if ((ret = __os_malloc(dbenv, 256, NULL, &jc->j_key.data)) != 0)
+		goto err;
+	jc->j_key.ulen = 256;
+	F_SET(&jc->j_key, DB_DBT_USERMEM);
+
+	for (jc->j_curslist = curslist;
+	    *jc->j_curslist != NULL; jc->j_curslist++)
+		;
+
+	/*
+	 * The number of cursor slots we allocate is one greater than
+	 * the number of cursors involved in the join, because the
+	 * list is NULL-terminated.
+	 */
+	ncurs = jc->j_curslist - curslist;
+	nslots = ncurs + 1;
+
+	/*
+	 * !!! -- A note on the various lists hanging off jc.
+	 *
+	 * j_curslist is the initial NULL-terminated list of cursors passed
+	 * into __db_join.  The original cursors are not modified; pristine
+	 * copies are required because, in databases with unsorted dups, we
+	 * must reset all of the secondary cursors after the first each
+	 * time the first one is incremented, or else we will lose data
+	 * which happen to be sorted differently in two different cursors.
+	 *
+	 * j_workcurs is where we put those copies that we're planning to
+	 * work with.  They're lazily c_dup'ed from j_curslist as we need
+	 * them, and closed when the join cursor is closed or when we need
+	 * to reset them to their original values (in which case we just
+	 * c_dup afresh).
+	 *
+	 * j_fdupcurs is an array of cursors which point to the first
+	 * duplicate in the duplicate set that contains the data value
+	 * we're currently interested in.  We need this to make
+	 * __db_join_get correctly return duplicate duplicates;  i.e., if a
+	 * given data value occurs twice in the set belonging to cursor #2,
+	 * and thrice in the set belonging to cursor #3, and once in all
+	 * the other cursors, successive calls to __db_join_get need to
+	 * return that data item six times.  To make this happen, each time
+	 * cursor N is allowed to advance to a new datum, all cursors M
+	 * such that M > N have to be reset to the first duplicate with
+	 * that datum, so __db_join_get will return all the dup-dups again.
+	 * We could just reset them to the original cursor from j_curslist,
+	 * but that would be a bit slower in the unsorted case and a LOT
+	 * slower in the sorted one.
+	 *
+	 * j_exhausted is a list of boolean values which represent
+	 * whether or not their corresponding cursors are "exhausted",
+	 * i.e. whether the datum under the corresponding cursor has
+	 * been found not to exist in any unreturned combinations of
+	 * later secondary cursors, in which case they are ready to be
+	 * incremented.
+	 */
+
+	/* We don't want to free regions whose callocs have failed. */
+	jc->j_curslist = NULL;
+	jc->j_workcurs = NULL;
+	jc->j_fdupcurs = NULL;
+	jc->j_exhausted = NULL;
+
+	if ((ret = __os_calloc(dbenv, nslots, sizeof(DBC *),
+	    &jc->j_curslist)) != 0)
+		goto err;
+	if ((ret = __os_calloc(dbenv, nslots, sizeof(DBC *),
+	    &jc->j_workcurs)) != 0)
+		goto err;
+	if ((ret = __os_calloc(dbenv, nslots, sizeof(DBC *),
+	    &jc->j_fdupcurs)) != 0)
+		goto err;
+	if ((ret = __os_calloc(dbenv, nslots, sizeof(u_int8_t),
+	    &jc->j_exhausted)) != 0)
+		goto err;
+	for (i = 0; curslist[i] != NULL; i++) {
+		jc->j_curslist[i] = curslist[i];
+		jc->j_workcurs[i] = NULL;
+		jc->j_fdupcurs[i] = NULL;
+		jc->j_exhausted[i] = 0;
+	}
+	jc->j_ncurs = ncurs;
+
+	/*
+	 * If DB_JOIN_NOSORT is not set, optimize secondary cursors by
+	 * sorting in order of increasing cardinality.
+	 */
+	if (!LF_ISSET(DB_JOIN_NOSORT))
+		qsort(jc->j_curslist, ncurs, sizeof(DBC *), __db_join_cmp);
+
+	/*
+	 * We never need to reset the 0th cursor, so there's no
+	 * solid reason to use workcurs[0] rather than curslist[0] in
+	 * join_get.  Nonetheless, it feels cleaner to do it for symmetry,
+	 * and this is the most logical place to copy it.
+	 *
+	 * !!!
+	 * There's no need to close the new cursor if we goto err only
+	 * because this is the last thing that can fail.  Modifier of this
+	 * function beware!
+	 */
+	if ((ret = jc->j_curslist[0]->c_dup(jc->j_curslist[0], jc->j_workcurs,
+	    DB_POSITIONI)) != 0)
+		goto err;
+
+	dbc->c_close = __db_join_close;
+	dbc->c_del = __db_join_del;
+	dbc->c_get = __db_join_get;
+	dbc->c_put = __db_join_put;
+	dbc->internal = (DBC_INTERNAL *) jc;
+	dbc->dbp = primary;
+	jc->j_primary = primary;
+
+	*dbcp = dbc;
+
+	MUTEX_THREAD_LOCK(dbenv, primary->mutexp);
+	TAILQ_INSERT_TAIL(&primary->join_queue, dbc, links);
+	MUTEX_THREAD_UNLOCK(dbenv, primary->mutexp);
+
+	return (0);
+
+err:	if (jc != NULL) {
+		if (jc->j_curslist != NULL)
+			__os_free(jc->j_curslist, nslots * sizeof(DBC *));
+		if (jc->j_workcurs != NULL) {
+			if (jc->j_workcurs[0] != NULL)
+				__os_free(jc->j_workcurs[0], sizeof(DBC));
+			__os_free(jc->j_workcurs, nslots * sizeof(DBC *));
+		}
+		if (jc->j_fdupcurs != NULL)
+			__os_free(jc->j_fdupcurs, nslots * sizeof(DBC *));
+		if (jc->j_exhausted != NULL)
+			__os_free(jc->j_exhausted, nslots * sizeof(u_int8_t));
+		__os_free(jc, sizeof(JOIN_CURSOR));
+	}
+	if (dbc != NULL)
+		__os_free(dbc, sizeof(DBC));
+	return (ret);
+}
+
+static int
+__db_join_put(dbc, key, data, flags)
+	DBC *dbc;
+	DBT *key;
+	DBT *data;
+	u_int32_t flags;
+{
+	PANIC_CHECK(dbc->dbp->dbenv);
+
+	COMPQUIET(key, NULL);
+	COMPQUIET(data, NULL);
+	COMPQUIET(flags, 0);
+	return (EINVAL);
+}
+
+static int
+__db_join_del(dbc, flags)
+	DBC *dbc;
+	u_int32_t flags;
+{
+	PANIC_CHECK(dbc->dbp->dbenv);
+
+	COMPQUIET(flags, 0);
+	return (EINVAL);
+}
+
+static int
+__db_join_get(dbc, key_arg, data_arg, flags)
+	DBC *dbc;
+	DBT *key_arg, *data_arg;
+	u_int32_t flags;
+{
+	DBT *key_n, key_n_mem;
+	DB *dbp;
+	DBC *cp;
+	JOIN_CURSOR *jc;
+	int ret;
+	u_int32_t i, j, operation;
+
+	dbp = dbc->dbp;
+	jc = (JOIN_CURSOR *)dbc->internal;
+
+	PANIC_CHECK(dbp->dbenv);
+
+	operation = LF_ISSET(DB_OPFLAGS_MASK);
+
+	if ((ret = __db_joingetchk(dbp, key_arg, flags)) != 0)
+		return (ret);
+
+	/*
+	 * Since we are fetching the key as a datum in the secondary indices,
+	 * we must be careful of caller-specified DB_DBT_* memory
+	 * management flags.  If necessary, use a stack-allocated DBT;
+	 * we'll appropriately copy and/or allocate the data later.
+	 */
+	if (F_ISSET(key_arg, DB_DBT_USERMEM) ||
+	    F_ISSET(key_arg, DB_DBT_MALLOC)) {
+		/* We just use the default buffer;  no need to go malloc. */
+		key_n = &key_n_mem;
+		memset(key_n, 0, sizeof(DBT));
+	} else {
+		/*
+		 * Either DB_DBT_REALLOC or the default buffer will work
+		 * fine if we have to reuse it, as we do.
+		 */
+		key_n = key_arg;
+	}
+
+	/*
+	 * If our last attempt to do a get on the primary key failed,
+	 * short-circuit the join and try again with the same key.
+	 */
+	if (F_ISSET(jc, JOIN_RETRY))
+		goto samekey;
+	F_CLR(jc, JOIN_RETRY);
+
+retry:	ret = jc->j_workcurs[0]->c_get(jc->j_workcurs[0],
+	    &jc->j_key, key_n, jc->j_exhausted[0] ? DB_NEXT_DUP : DB_CURRENT);
+
+	if (ret == ENOMEM) {
+		jc->j_key.ulen <<= 1;
+		if ((ret = __os_realloc(dbp->dbenv,
+		    jc->j_key.ulen, NULL, &jc->j_key.data)) != 0)
+			goto mem_err;
+		goto retry;
+	}
+
+	/*
+	 * If ret == DB_NOTFOUND, we're out of elements of the first
+	 * secondary cursor.  This is how we finally finish the join
+	 * if all goes well.
+	 */
+	if (ret != 0)
+		goto err;
+
+	/*
+	 * If jc->j_exhausted[0] == 1, we've just advanced the first cursor,
+	 * and we're going to want to advance all the cursors that point to
+	 * the first member of a duplicate duplicate set (j_fdupcurs[1..N]).
+	 * Close all the cursors in j_fdupcurs;  we'll reopen them the
+	 * first time through the upcoming loop.
+	 */
+	for (i = 1; i < jc->j_ncurs; i++) {
+		if (jc->j_fdupcurs[i] != NULL &&
+		    (ret = jc->j_fdupcurs[i]->c_close(jc->j_fdupcurs[i])) != 0)
+			goto err;
+		jc->j_fdupcurs[i] = NULL;
+	}
+
+	/*
+	 * If jc->j_curslist[1] == NULL, we have only one cursor in the join.
+	 * Thus, we can safely increment that one cursor on each call
+	 * to __db_join_get, and we signal this by setting jc->j_exhausted[0]
+	 * right away.
+	 *
+	 * Otherwise, reset jc->j_exhausted[0] to 0, so that we don't
+	 * increment it until we know we're ready to.
+	 */
+	if (jc->j_curslist[1] == NULL)
+		jc->j_exhausted[0] = 1;
+	else
+		jc->j_exhausted[0] = 0;
+
+	/* We have the first element; now look for it in the other cursors. */
+	for (i = 1; i < jc->j_ncurs; i++) {
+		DB_ASSERT(jc->j_curslist[i] != NULL);
+		if (jc->j_workcurs[i] == NULL)
+			/* If this is NULL, we need to dup curslist into it. */
+			if ((ret = jc->j_curslist[i]->c_dup(
+			    jc->j_curslist[i], jc->j_workcurs + i,
+			    DB_POSITIONI)) != 0)
+				goto err;
+
+retry2:		cp = jc->j_workcurs[i];
+
+		if ((ret = __db_join_getnext(cp, &jc->j_key, key_n,
+			    jc->j_exhausted[i])) == DB_NOTFOUND) {
+			/*
+			 * jc->j_workcurs[i] has no more of the datum we're
+			 * interested in.  Go back one cursor and get
+			 * a new dup.  We can't just move to a new
+			 * element of the outer relation, because that way
+			 * we might miss duplicate duplicates in cursor i-1.
+			 *
+			 * If this takes us back to the first cursor,
+			 * -then- we can move to a new element of the outer
+			 * relation.
+			 */
+			--i;
+			jc->j_exhausted[i] = 1;
+
+			if (i == 0) {
+				for (j = 1; jc->j_workcurs[j] != NULL; j++) {
+					/*
+					 * We're moving to a new element of
+					 * the first secondary cursor.  If
+					 * that cursor is sorted, then any
+					 * other sorted cursors can be safely
+					 * reset to the first duplicate
+					 * duplicate in the current set if we
+					 * have a pointer to it (we can't just
+					 * leave them be, or we'll miss
+					 * duplicate duplicates in the outer
+					 * relation).
+					 *
+					 * If the first cursor is unsorted, or
+					 * if cursor j is unsorted, we can
+					 * make no assumptions about what
+					 * we're looking for next or where it
+					 * will be, so we reset to the very
+					 * beginning (setting workcurs NULL
+					 * will achieve this next go-round).
+					 *
+					 * XXX: This is likely to break
+					 * horribly if any two cursors are
+					 * both sorted, but have different
+					 * specified sort functions.  For,
+					 * now, we dismiss this as pathology
+					 * and let strange things happen--we
+					 * can't make rope childproof.
+					 */
+					if ((ret = jc->j_workcurs[j]->c_close(
+					    jc->j_workcurs[j])) != 0)
+						goto err;
+					if (!SORTED_SET(jc, 0) ||
+					    !SORTED_SET(jc, j) ||
+					    jc->j_fdupcurs[j] == NULL)
+						/*
+						 * Unsafe conditions;
+						 * reset fully.
+						 */
+						jc->j_workcurs[j] = NULL;
+					else
+						/* Partial reset suffices. */
+						if ((jc->j_fdupcurs[j]->c_dup(
+						    jc->j_fdupcurs[j],
+						    &jc->j_workcurs[j],
+						    DB_POSITIONI)) != 0)
+							goto err;
+					jc->j_exhausted[j] = 0;
+				}
+				goto retry;
+				/* NOTREACHED */
+			}
+
+			/*
+			 * We're about to advance the cursor and need to
+			 * reset all of the workcurs[j] where j>i, so that
+			 * we don't miss any duplicate duplicates.
+			 */
+			for (j = i + 1;
+			    jc->j_workcurs[j] != NULL;
+			    j++) {
+				if ((ret = jc->j_workcurs[j]->c_close(
+				    jc->j_workcurs[j])) != 0)
+					goto err;
+				jc->j_exhausted[j] = 0;
+				if (jc->j_fdupcurs[j] != NULL &&
+				    (ret = jc->j_fdupcurs[j]->c_dup(
+				    jc->j_fdupcurs[j], &jc->j_workcurs[j],
+				    DB_POSITIONI)) != 0)
+					goto err;
+				else
+					jc->j_workcurs[j] = NULL;
+			}
+			goto retry2;
+			/* NOTREACHED */
+		}
+
+		if (ret == ENOMEM) {
+			jc->j_key.ulen <<= 1;
+			if ((ret = __os_realloc(dbp->dbenv, jc->j_key.ulen,
+			    NULL, &jc->j_key.data)) != 0) {
+mem_err:			__db_err(dbp->dbenv,
+				    "Allocation failed for join key, len = %lu",
+				    (u_long)jc->j_key.ulen);
+				goto err;
+			}
+			goto retry2;
+		}
+
+		if (ret != 0)
+			goto err;
+
+		/*
+		 * If we made it this far, we've found a matching
+		 * datum in cursor i.  Mark the current cursor
+		 * unexhausted, so we don't miss any duplicate
+		 * duplicates the next go-round--unless this is the
+		 * very last cursor, in which case there are none to
+		 * miss, and we'll need that exhausted flag to finally
+		 * get a DB_NOTFOUND and move on to the next datum in
+		 * the outermost cursor.
+		 */
+		if (i + 1 != jc->j_ncurs)
+			jc->j_exhausted[i] = 0;
+		else
+			jc->j_exhausted[i] = 1;
+
+		/*
+		 * If jc->j_fdupcurs[i] is NULL and the ith cursor's dups are
+		 * sorted, then we're here for the first time since advancing
+		 * cursor 0, and we have a new datum of interest.
+		 * jc->j_workcurs[i] points to the beginning of a set of
+		 * duplicate duplicates;  store this into jc->j_fdupcurs[i].
+		 */
+		if (SORTED_SET(jc, i) && jc->j_fdupcurs[i] == NULL && (ret =
+		    cp->c_dup(cp, &jc->j_fdupcurs[i], DB_POSITIONI)) != 0)
+			goto err;
+
+	}
+
+err:	if (ret != 0)
+		return (ret);
+
+	if (0) {
+samekey:	/*
+		 * Get the key we tried and failed to return last time;
+		 * it should be the current datum of all the secondary cursors.
+		 */
+		if ((ret = jc->j_workcurs[0]->c_get(jc->j_workcurs[0],
+		    &jc->j_key, key_n, DB_CURRENT)) != 0)
+			return (ret);
+		F_CLR(jc, JOIN_RETRY);
+	}
+
+	/*
+	 * ret == 0;  we have a key to return.
+	 *
+	 * If DB_DBT_USERMEM or DB_DBT_MALLOC is set, we need to
+	 * copy it back into the dbt we were given for the key;
+	 * call __db_retcopy.
+	 *
+	 * Otherwise, assert that we do not in fact need to copy anything
+	 * and simply proceed.
+	 */
+	if (F_ISSET(key_arg, DB_DBT_USERMEM) ||
+	    F_ISSET(key_arg, DB_DBT_MALLOC)) {
+		/*
+		 * We need to copy the key back into our original
+		 * datum.  Do so.
+		 */
+		if ((ret = __db_retcopy(dbp,
+		    key_arg, key_n->data, key_n->size, NULL, NULL)) != 0) {
+			/*
+			 * The retcopy failed, most commonly because we
+			 * have a user buffer for the key which is too small.
+			 * Set things up to retry next time, and return.
+			 */
+			F_SET(jc, JOIN_RETRY);
+			return (ret);
+		}
+	} else
+		DB_ASSERT(key_n == key_arg);
+
+	/*
+	 * If DB_JOIN_ITEM is
+	 * set, we return it;  otherwise we do the lookup in the
+	 * primary and then return.
+	 *
+	 * Note that we use key_arg here;  it is safe (and appropriate)
+	 * to do so.
+	 */
+	if (operation == DB_JOIN_ITEM)
+		return (0);
+
+	if ((ret = jc->j_primary->get(jc->j_primary,
+	    jc->j_curslist[0]->txn, key_arg, data_arg, 0)) != 0)
+		/*
+		 * The get on the primary failed, most commonly because we're
+		 * using a user buffer that's not big enough.  Flag our
+		 * failure so we can return the same key next time.
+		 */
+		F_SET(jc, JOIN_RETRY);
+
+	return (ret);
+}
+
+static int
+__db_join_close(dbc)
+	DBC *dbc;
+{
+	DB *dbp;
+	JOIN_CURSOR *jc;
+	int ret, t_ret;
+	u_int32_t i;
+
+	jc = (JOIN_CURSOR *)dbc->internal;
+	dbp = dbc->dbp;
+	ret = t_ret = 0;
+
+	/*
+	 * Remove from active list of join cursors.  Note that this
+	 * must happen before any action that can fail and return, or else
+	 * __db_close may loop indefinitely.
+	 */
+	MUTEX_THREAD_LOCK(dbp->dbenv, dbp->mutexp);
+	TAILQ_REMOVE(&dbp->join_queue, dbc, links);
+	MUTEX_THREAD_UNLOCK(dbp->dbenv, dbp->mutexp);
+
+	PANIC_CHECK(dbc->dbp->dbenv);
+
+	/*
+	 * Close any open scratch cursors.  In each case, there may
+	 * not be as many outstanding as there are cursors in
+	 * curslist, but we want to close whatever's there.
+	 *
+	 * If any close fails, there's no reason not to close everything else;
+	 * we'll just return the error code of the last one to fail.  There's
+	 * not much the caller can do anyway, since these cursors only exist
+	 * hanging off a db-internal data structure that they shouldn't be
+	 * mucking with.
+	 */
+	for (i = 0; i < jc->j_ncurs; i++) {
+		if (jc->j_workcurs[i] != NULL && (t_ret =
+		    jc->j_workcurs[i]->c_close(jc->j_workcurs[i])) != 0)
+			ret = t_ret;
+		if (jc->j_fdupcurs[i] != NULL && (t_ret =
+		    jc->j_fdupcurs[i]->c_close(jc->j_fdupcurs[i])) != 0)
+			ret = t_ret;
+	}
+
+	__os_free(jc->j_exhausted, 0);
+	__os_free(jc->j_curslist, 0);
+	__os_free(jc->j_workcurs, 0);
+	__os_free(jc->j_fdupcurs, 0);
+	__os_free(jc->j_key.data, jc->j_key.ulen);
+	__os_free(jc, sizeof(JOIN_CURSOR));
+	__os_free(dbc, sizeof(DBC));
+
+	return (ret);
+}
+
+/*
+ * __db_join_getnext --
+ *	This function replaces the DBC_CONTINUE and DBC_KEYSET
+ *	functionality inside the various cursor get routines.
+ *
+ *	If exhausted == 0, we're not done with the current datum;
+ *	return it if it matches "matching", otherwise search
+ *	using DB_GET_BOTHC (which is faster than iteratively doing
+ *	DB_NEXT_DUP) forward until we find one that does.
+ *
+ *	If exhausted == 1, we are done with the current datum, so just
+ *	leap forward to searching NEXT_DUPs.
+ *
+ *	If no matching datum exists, returns DB_NOTFOUND, else 0.
+ */
+static int
+__db_join_getnext(dbc, key, data, exhausted)
+	DBC *dbc;
+	DBT *key, *data;
+	u_int32_t exhausted;
+{
+	int ret, cmp;
+	DB *dbp;
+	DBT ldata;
+	int (*func) __P((DB *, const DBT *, const DBT *));
+
+	dbp = dbc->dbp;
+	func = (dbp->dup_compare == NULL) ? __bam_defcmp : dbp->dup_compare;
+
+	switch (exhausted) {
+	case 0:
+		memset(&ldata, 0, sizeof(DBT));
+		/* We don't want to step on data->data;  malloc. */
+		F_SET(&ldata, DB_DBT_MALLOC);
+		if ((ret = dbc->c_get(dbc, key, &ldata, DB_CURRENT)) != 0)
+			break;
+		cmp = func(dbp, data, &ldata);
+		if (cmp == 0) {
+			/*
+			 * We have to return the real data value.  Copy
+			 * it into data, then free the buffer we malloc'ed
+			 * above.
+			 */
+			if ((ret = __db_retcopy(dbp, data, ldata.data,
+			    ldata.size, &data->data, &data->size)) != 0)
+				return (ret);
+			__os_free(ldata.data, 0);
+			return (0);
+		}
+
+		/*
+		 * Didn't match--we want to fall through and search future
+		 * dups.  We just forget about ldata and free
+		 * its buffer--data contains the value we're searching for.
+		 */
+		__os_free(ldata.data, 0);
+		/* FALLTHROUGH */
+	case 1:
+		ret = dbc->c_get(dbc, key, data, DB_GET_BOTHC);
+		break;
+	default:
+		ret = EINVAL;
+		break;
+	}
+
+	return (ret);
+}
+
+/*
+ * __db_join_cmp --
+ *	Comparison function for sorting DBCs in cardinality order.
+ */
+
+static int
+__db_join_cmp(a, b)
+	const void *a, *b;
+{
+	DBC *dbca, *dbcb;
+	db_recno_t counta, countb;
+
+	/* In case c_count fails, pretend cursors are equal. */
+	counta = countb = 0;
+
+	dbca = *((DBC * const *)a);
+	dbcb = *((DBC * const *)b);
+
+	if (dbca->c_count(dbca, &counta, 0) != 0 ||
+	    dbcb->c_count(dbcb, &countb, 0) != 0)
+		return (0);
+
+	return (counta - countb);
+}
diff --git a/bdb/db/db_meta.c b/bdb/db/db_meta.c
new file mode 100644
index 00000000000..5b57c369454
--- /dev/null
+++ b/bdb/db/db_meta.c
@@ -0,0 +1,309 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1996, 1997, 1998, 1999, 2000
+ *	Sleepycat Software.  All rights reserved.
+ */
+/*
+ * Copyright (c) 1990, 1993, 1994, 1995, 1996
+ *	Keith Bostic.  All rights reserved.
+ */
+/*
+ * Copyright (c) 1990, 1993, 1994, 1995
+ *	The Regents of the University of California.  All rights reserved.
+ *
+ * This code is derived from software contributed to Berkeley by
+ * Mike Olson.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of the University nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_meta.c,v 11.26 2001/01/16 21:57:19 ubell Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_shash.h"
+#include "lock.h"
+#include "txn.h"
+#include "db_am.h"
+#include "btree.h"
+
+/*
+ * __db_new --
+ *	Get a new page, preferably from the freelist.
+ *
+ * PUBLIC: int __db_new __P((DBC *, u_int32_t, PAGE **));
+ */
+int
+__db_new(dbc, type, pagepp)
+	DBC *dbc;
+	u_int32_t type;
+	PAGE **pagepp;
+{
+	DBMETA *meta;
+	DB *dbp;
+	DB_LOCK metalock;
+	PAGE *h;
+	db_pgno_t pgno;
+	int ret;
+
+	dbp = dbc->dbp;
+	meta = NULL;
+	h = NULL;
+
+	pgno = PGNO_BASE_MD;
+	if ((ret = __db_lget(dbc,
+	    LCK_ALWAYS, pgno, DB_LOCK_WRITE, 0, &metalock)) != 0)
+		goto err;
+	if ((ret = memp_fget(dbp->mpf, &pgno, 0, (PAGE **)&meta)) != 0)
+		goto err;
+
+	if (meta->free == PGNO_INVALID) {
+		if ((ret = memp_fget(dbp->mpf, &pgno, DB_MPOOL_NEW, &h)) != 0)
+			goto err;
+		ZERO_LSN(h->lsn);
+		h->pgno = pgno;
+	} else {
+		pgno = meta->free;
+		if ((ret = memp_fget(dbp->mpf, &pgno, 0, &h)) != 0)
+			goto err;
+		meta->free = h->next_pgno;
+		(void)memp_fset(dbp->mpf, (PAGE *)meta, DB_MPOOL_DIRTY);
+	}
+
+	DB_ASSERT(TYPE(h) == P_INVALID);
+
+	if (TYPE(h) != P_INVALID)
+		return (__db_panic(dbp->dbenv, EINVAL));
+
+	/* Log the change. */
+	if (DB_LOGGING(dbc)) {
+		if ((ret = __db_pg_alloc_log(dbp->dbenv,
+		    dbc->txn, &LSN(meta), 0, dbp->log_fileid,
+		    &LSN(meta), &h->lsn, h->pgno,
+		    (u_int32_t)type, meta->free)) != 0)
+			goto err;
+		LSN(h) = LSN(meta);
+	}
+
+	(void)memp_fput(dbp->mpf, (PAGE *)meta, DB_MPOOL_DIRTY);
+	(void)__TLPUT(dbc, metalock);
+
+	P_INIT(h, dbp->pgsize, h->pgno, PGNO_INVALID, PGNO_INVALID, 0, type);
+	*pagepp = h;
+	return (0);
+
+err:	if (h != NULL)
+		(void)memp_fput(dbp->mpf, h, 0);
+	if (meta != NULL)
+		(void)memp_fput(dbp->mpf, meta, 0);
+	(void)__TLPUT(dbc, metalock);
+	return (ret);
+}
+
+/*
+ * __db_free --
+ *	Add a page to the head of the freelist.
+ *
+ * PUBLIC: int __db_free __P((DBC *, PAGE *));
+ */
+int
+__db_free(dbc, h)
+	DBC *dbc;
+	PAGE *h;
+{
+	DBMETA *meta;
+	DB *dbp;
+	DBT ldbt;
+	DB_LOCK metalock;
+	db_pgno_t pgno;
+	u_int32_t dirty_flag;
+	int ret, t_ret;
+
+	dbp = dbc->dbp;
+
+	/*
+	 * Retrieve the metadata page and insert the page at the head of
+	 * the free list.  If either the lock get or page get routines
+	 * fail, then we need to put the page with which we were called
+	 * back because our caller assumes we take care of it.
+	 */
+	dirty_flag = 0;
+	pgno = PGNO_BASE_MD;
+	if ((ret = __db_lget(dbc,
+	     LCK_ALWAYS, pgno, DB_LOCK_WRITE, 0, &metalock)) != 0)
+		goto err;
+	if ((ret = memp_fget(dbp->mpf, &pgno, 0, (PAGE **)&meta)) != 0) {
+		(void)__TLPUT(dbc, metalock);
+		goto err;
+	}
+
+	DB_ASSERT(h->pgno != meta->free);
+	/* Log the change. */
+	if (DB_LOGGING(dbc)) {
+		memset(&ldbt, 0, sizeof(ldbt));
+		ldbt.data = h;
+		ldbt.size = P_OVERHEAD;
+		if ((ret = __db_pg_free_log(dbp->dbenv,
+		    dbc->txn, &LSN(meta), 0, dbp->log_fileid, h->pgno,
+		    &LSN(meta), &ldbt, meta->free)) != 0) {
+			(void)memp_fput(dbp->mpf, (PAGE *)meta, 0);
+			(void)__TLPUT(dbc, metalock);
+			return (ret);
+		}
+		LSN(h) = LSN(meta);
+	}
+
+	P_INIT(h, dbp->pgsize, h->pgno, PGNO_INVALID, meta->free, 0, P_INVALID);
+
+	meta->free = h->pgno;
+
+	/* Discard the metadata page. */
+	if ((t_ret = memp_fput(dbp->mpf,
+	    (PAGE *)meta, DB_MPOOL_DIRTY)) != 0 && ret == 0)
+		ret = t_ret;
+	if ((t_ret = __TLPUT(dbc, metalock)) != 0 && ret == 0)
+		ret = t_ret;
+
+	/* Discard the caller's page reference. */
+	dirty_flag = DB_MPOOL_DIRTY;
+err:	if ((t_ret = memp_fput(dbp->mpf, h, dirty_flag)) != 0 && ret == 0)
+		ret = t_ret;
+
+	/*
+	 * XXX
+	 * We have to unlock the caller's page in the caller!
+	 */
+	return (ret);
+}
+
+#ifdef DEBUG
+/*
+ * __db_lprint --
+ *	Print out the list of locks currently held by a cursor.
+ *
+ * PUBLIC: int __db_lprint __P((DBC *));
+ */
+int
+__db_lprint(dbc)
+	DBC *dbc;
+{
+	DB *dbp;
+	DB_LOCKREQ req;
+
+	dbp = dbc->dbp;
+
+	if (LOCKING_ON(dbp->dbenv)) {
+		req.op = DB_LOCK_DUMP;
+		lock_vec(dbp->dbenv, dbc->locker, 0, &req, 1, NULL);
+	}
+	return (0);
+}
+#endif
+
+/*
+ * __db_lget --
+ *	The standard lock get call.
+ *
+ * PUBLIC: int __db_lget __P((DBC *,
+ * PUBLIC:     int, db_pgno_t, db_lockmode_t, int, DB_LOCK *));
+ */
+int
+__db_lget(dbc, flags, pgno, mode, lkflags, lockp)
+	DBC *dbc;
+	int flags, lkflags;
+	db_pgno_t pgno;
+	db_lockmode_t mode;
+	DB_LOCK *lockp;
+{
+	DB *dbp;
+	DB_ENV *dbenv;
+	DB_LOCKREQ couple[2], *reqp;
+	int ret;
+
+	dbp = dbc->dbp;
+	dbenv = dbp->dbenv;
+
+	/*
+	 * We do not always check if we're configured for locking before
+	 * calling __db_lget to acquire the lock.
+	 */
+	if (CDB_LOCKING(dbenv)
+	    || !LOCKING_ON(dbenv) || F_ISSET(dbc, DBC_COMPENSATE)
+	    || (!LF_ISSET(LCK_ROLLBACK) && F_ISSET(dbc, DBC_RECOVER))
+	    || (!LF_ISSET(LCK_ALWAYS) && F_ISSET(dbc, DBC_OPD))) {
+		lockp->off = LOCK_INVALID;
+		return (0);
+	}
+
+	dbc->lock.pgno = pgno;
+	if (lkflags & DB_LOCK_RECORD)
+		dbc->lock.type = DB_RECORD_LOCK;
+	else
+		dbc->lock.type = DB_PAGE_LOCK;
+	lkflags &= ~DB_LOCK_RECORD;
+
+	/*
+	 * If the transaction enclosing this cursor has DB_LOCK_NOWAIT set,
+	 * pass that along to the lock call.
+	 */
+	if (DB_NONBLOCK(dbc))
+		lkflags |= DB_LOCK_NOWAIT;
+
+	/*
+	 * If the object not currently locked, acquire the lock and return,
+	 * otherwise, lock couple.
+	 */
+	if (LF_ISSET(LCK_COUPLE)) {
+		couple[0].op = DB_LOCK_GET;
+		couple[0].obj = &dbc->lock_dbt;
+		couple[0].mode = mode;
+		couple[1].op = DB_LOCK_PUT;
+		couple[1].lock = *lockp;
+
+		ret = lock_vec(dbenv,
+		    dbc->locker, lkflags, couple, 2, &reqp);
+		if (ret == 0 || reqp == &couple[1])
+			*lockp = couple[0].lock;
+	} else {
+		ret = lock_get(dbenv,
+		    dbc->locker, lkflags, &dbc->lock_dbt, mode, lockp);
+
+		if (ret != 0)
+			lockp->off = LOCK_INVALID;
+	}
+
+	return (ret);
+}
diff --git a/bdb/db/db_method.c b/bdb/db/db_method.c
new file mode 100644
index 00000000000..01568a6e144
--- /dev/null
+++ b/bdb/db/db_method.c
@@ -0,0 +1,629 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1999, 2000
+ *	Sleepycat Software.  All rights reserved.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_method.c,v 11.36 2000/12/21 09:17:04 krinsky Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#ifdef HAVE_RPC
+#include <rpc/rpc.h>
+#endif
+
+#include <string.h>
+#endif
+
+#ifdef HAVE_RPC
+#include "db_server.h"
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_am.h"
+#include "btree.h"
+#include "hash.h"
+#include "qam.h"
+#include "xa.h"
+#include "xa_ext.h"
+
+#ifdef HAVE_RPC
+#include "gen_client_ext.h"
+#include "rpc_client_ext.h"
+#endif
+
+static int  __db_get_byteswapped __P((DB *));
+static DBTYPE
+	    __db_get_type __P((DB *));
+static int  __db_init __P((DB *, u_int32_t));
+static int  __db_key_range
+		__P((DB *, DB_TXN *, DBT *, DB_KEY_RANGE *, u_int32_t));
+static int  __db_set_append_recno __P((DB *, int (*)(DB *, DBT *, db_recno_t)));
+static int  __db_set_cachesize __P((DB *, u_int32_t, u_int32_t, int));
+static int  __db_set_dup_compare
+		__P((DB *, int (*)(DB *, const DBT *, const DBT *)));
+static void __db_set_errcall __P((DB *, void (*)(const char *, char *)));
+static void __db_set_errfile __P((DB *, FILE *));
+static int  __db_set_feedback __P((DB *, void (*)(DB *, int, int)));
+static int  __db_set_flags __P((DB *, u_int32_t));
+static int  __db_set_lorder __P((DB *, int));
+static int  __db_set_malloc __P((DB *, void *(*)(size_t)));
+static int  __db_set_pagesize __P((DB *, u_int32_t));
+static int  __db_set_realloc __P((DB *, void *(*)(void *, size_t)));
+static void __db_set_errpfx __P((DB *, const char *));
+static int  __db_set_paniccall __P((DB *, void (*)(DB_ENV *, int)));
+static void __dbh_err __P((DB *, int, const char *, ...));
+static void __dbh_errx __P((DB *, const char *, ...));
+
+/*
+ * db_create --
+ *	DB constructor.
+ */
+int
+db_create(dbpp, dbenv, flags)
+	DB **dbpp;
+	DB_ENV *dbenv;
+	u_int32_t flags;
+{
+	DB *dbp;
+	int ret;
+
+	/* Check for invalid function flags. */
+	switch (flags) {
+	case 0:
+		break;
+	case DB_XA_CREATE:
+		if (dbenv != NULL) {
+			__db_err(dbenv,
+		"XA applications may not specify an environment to db_create");
+			return (EINVAL);
+		}
+
+		/*
+		 * If it's an XA database, open it within the XA environment,
+		 * taken from the global list of environments.  (When the XA
+		 * transaction manager called our xa_start() routine the
+		 * "current" environment was moved to the start of the list.
+		 */
+		dbenv = TAILQ_FIRST(&DB_GLOBAL(db_envq));
+		break;
+	default:
+		return (__db_ferr(dbenv, "db_create", 0));
+	}
+
+	/* Allocate the DB. */
+	if ((ret = __os_calloc(dbenv, 1, sizeof(*dbp), &dbp)) != 0)
+		return (ret);
+#ifdef HAVE_RPC
+	if (dbenv != NULL && dbenv->cl_handle != NULL)
+		ret = __dbcl_init(dbp, dbenv, flags);
+	else
+#endif
+		ret = __db_init(dbp, flags);
+	if (ret != 0) {
+		__os_free(dbp, sizeof(*dbp));
+		return (ret);
+	}
+
+	/* If we don't have an environment yet, allocate a local one. */
+	if (dbenv == NULL) {
+		if ((ret = db_env_create(&dbenv, 0)) != 0) {
+			__os_free(dbp, sizeof(*dbp));
+			return (ret);
+		}
+		dbenv->dblocal_ref = 0;
+		F_SET(dbenv, DB_ENV_DBLOCAL);
+	}
+	if (F_ISSET(dbenv, DB_ENV_DBLOCAL))
+		++dbenv->dblocal_ref;
+
+	dbp->dbenv = dbenv;
+
+	*dbpp = dbp;
+	return (0);
+}
+
+/*
+ * __db_init --
+ *	Initialize a DB structure.
+ */
+static int
+__db_init(dbp, flags)
+	DB *dbp;
+	u_int32_t flags;
+{
+	int ret;
+
+	dbp->log_fileid = DB_LOGFILEID_INVALID;
+
+	TAILQ_INIT(&dbp->free_queue);
+	TAILQ_INIT(&dbp->active_queue);
+	TAILQ_INIT(&dbp->join_queue);
+
+	FLD_SET(dbp->am_ok,
+	    DB_OK_BTREE | DB_OK_HASH | DB_OK_QUEUE | DB_OK_RECNO);
+
+	dbp->close = __db_close;
+	dbp->cursor = __db_cursor;
+	dbp->del = NULL;		/* !!! Must be set by access method. */
+	dbp->err = __dbh_err;
+	dbp->errx = __dbh_errx;
+	dbp->fd = __db_fd;
+	dbp->get = __db_get;
+	dbp->get_byteswapped = __db_get_byteswapped;
+	dbp->get_type = __db_get_type;
+	dbp->join = __db_join;
+	dbp->key_range = __db_key_range;
+	dbp->open = __db_open;
+	dbp->put = __db_put;
+	dbp->remove = __db_remove;
+	dbp->rename = __db_rename;
+	dbp->set_append_recno = __db_set_append_recno;
+	dbp->set_cachesize = __db_set_cachesize;
+	dbp->set_dup_compare = __db_set_dup_compare;
+	dbp->set_errcall = __db_set_errcall;
+	dbp->set_errfile = __db_set_errfile;
+	dbp->set_errpfx = __db_set_errpfx;
+	dbp->set_feedback = __db_set_feedback;
+	dbp->set_flags = __db_set_flags;
+	dbp->set_lorder = __db_set_lorder;
+	dbp->set_malloc = __db_set_malloc;
+	dbp->set_pagesize = __db_set_pagesize;
+	dbp->set_paniccall = __db_set_paniccall;
+	dbp->set_realloc = __db_set_realloc;
+	dbp->stat = NULL;		/* !!! Must be set by access method. */
+	dbp->sync = __db_sync;
+	dbp->upgrade = __db_upgrade;
+	dbp->verify = __db_verify;
+					/* Access method specific. */
+	if ((ret = __bam_db_create(dbp)) != 0)
+		return (ret);
+	if ((ret = __ham_db_create(dbp)) != 0)
+		return (ret);
+	if ((ret = __qam_db_create(dbp)) != 0)
+		return (ret);
+
+	/*
+	 * XA specific: must be last, as we replace methods set by the
+	 * access methods.
+	 */
+	if (LF_ISSET(DB_XA_CREATE) && (ret = __db_xa_create(dbp)) != 0)
+		return (ret);
+
+	return (0);
+}
+
+/*
+ * __dbh_am_chk --
+ *	Error if an unreasonable method is called.
+ *
+ * PUBLIC: int __dbh_am_chk __P((DB *, u_int32_t));
+ */
+int
+__dbh_am_chk(dbp, flags)
+	DB *dbp;
+	u_int32_t flags;
+{
+	/*
+	 * We start out allowing any access methods to be called, and as the
+	 * application calls the methods the options become restricted.  The
+	 * idea is to quit as soon as an illegal method combination is called.
+	 */
+	if ((LF_ISSET(DB_OK_BTREE) && FLD_ISSET(dbp->am_ok, DB_OK_BTREE)) ||
+	    (LF_ISSET(DB_OK_HASH) && FLD_ISSET(dbp->am_ok, DB_OK_HASH)) ||
+	    (LF_ISSET(DB_OK_QUEUE) && FLD_ISSET(dbp->am_ok, DB_OK_QUEUE)) ||
+	    (LF_ISSET(DB_OK_RECNO) && FLD_ISSET(dbp->am_ok, DB_OK_RECNO))) {
+		FLD_CLR(dbp->am_ok, ~flags);
+		return (0);
+	}
+
+	__db_err(dbp->dbenv,
+    "call implies an access method which is inconsistent with previous calls");
+	return (EINVAL);
+}
+
+/*
+ * __dbh_err --
+ *	Error message, including the standard error string.
+ */
+static void
+#ifdef __STDC__
+__dbh_err(DB *dbp, int error, const char *fmt, ...)
+#else
+__dbh_err(dbp, error, fmt, va_alist)
+	DB *dbp;
+	int error;
+	const char *fmt;
+	va_dcl
+#endif
+{
+	va_list ap;
+
+#ifdef __STDC__
+	va_start(ap, fmt);
+#else
+	va_start(ap);
+#endif
+	__db_real_err(dbp->dbenv, error, 1, 1, fmt, ap);
+
+	va_end(ap);
+}
+
+/*
+ * __dbh_errx --
+ *	Error message.
+ */
+static void
+#ifdef __STDC__
+__dbh_errx(DB *dbp, const char *fmt, ...)
+#else
+__dbh_errx(dbp, fmt, va_alist)
+	DB *dbp;
+	const char *fmt;
+	va_dcl
+#endif
+{
+	va_list ap;
+
+#ifdef __STDC__
+	va_start(ap, fmt);
+#else
+	va_start(ap);
+#endif
+	__db_real_err(dbp->dbenv, 0, 0, 1, fmt, ap);
+
+	va_end(ap);
+}
+
+/*
+ * __db_get_byteswapped --
+ *	Return if database requires byte swapping.
+ */
+static int
+__db_get_byteswapped(dbp)
+	DB *dbp;
+{
+	DB_ILLEGAL_BEFORE_OPEN(dbp, "get_byteswapped");
+
+	return (F_ISSET(dbp, DB_AM_SWAP) ? 1 : 0);
+}
+
+/*
+ * __db_get_type --
+ *	Return type of underlying database.
+ */
+static DBTYPE
+__db_get_type(dbp)
+	DB *dbp;
+{
+	DB_ILLEGAL_BEFORE_OPEN(dbp, "get_type");
+
+	return (dbp->type);
+}
+
+/*
+ * __db_key_range --
+ *	Return proportion of keys above and below given key.
+ */
+static int
+__db_key_range(dbp, txn, key, kr, flags)
+	DB *dbp;
+	DB_TXN *txn;
+	DBT *key;
+	DB_KEY_RANGE *kr;
+	u_int32_t flags;
+{
+	COMPQUIET(txn, NULL);
+	COMPQUIET(key, NULL);
+	COMPQUIET(kr, NULL);
+	COMPQUIET(flags, 0);
+
+	DB_ILLEGAL_BEFORE_OPEN(dbp, "key_range");
+	DB_ILLEGAL_METHOD(dbp, DB_OK_BTREE);
+
+	return (EINVAL);
+}
+
+/*
+ * __db_set_append_recno --
+ *	Set record number append routine.
+ */
+static int
+__db_set_append_recno(dbp, func)
+	DB *dbp;
+	int (*func) __P((DB *, DBT *, db_recno_t));
+{
+	DB_ILLEGAL_AFTER_OPEN(dbp, "set_append_recno");
+	DB_ILLEGAL_METHOD(dbp, DB_OK_QUEUE | DB_OK_RECNO);
+
+	dbp->db_append_recno = func;
+
+	return (0);
+}
+
+/*
+ * __db_set_cachesize --
+ *	Set underlying cache size.
+ */
+static int
+__db_set_cachesize(dbp, cache_gbytes, cache_bytes, ncache)
+	DB *dbp;
+	u_int32_t cache_gbytes, cache_bytes;
+	int ncache;
+{
+	DB_ILLEGAL_IN_ENV(dbp, "set_cachesize");
+	DB_ILLEGAL_AFTER_OPEN(dbp, "set_cachesize");
+
+	return (dbp->dbenv->set_cachesize(
+	    dbp->dbenv, cache_gbytes, cache_bytes, ncache));
+}
+
+/*
+ * __db_set_dup_compare --
+ *	Set duplicate comparison routine.
+ */
+static int
+__db_set_dup_compare(dbp, func)
+	DB *dbp;
+	int (*func) __P((DB *, const DBT *, const DBT *));
+{
+	DB_ILLEGAL_AFTER_OPEN(dbp, "dup_compare");
+	DB_ILLEGAL_METHOD(dbp, DB_OK_BTREE | DB_OK_HASH);
+
+	dbp->dup_compare = func;
+
+	return (0);
+}
+
+static void
+__db_set_errcall(dbp, errcall)
+	DB *dbp;
+	void (*errcall) __P((const char *, char *));
+{
+	dbp->dbenv->set_errcall(dbp->dbenv, errcall);
+}
+
+static void
+__db_set_errfile(dbp, errfile)
+	DB *dbp;
+	FILE *errfile;
+{
+	dbp->dbenv->set_errfile(dbp->dbenv, errfile);
+}
+
+static void
+__db_set_errpfx(dbp, errpfx)
+	DB *dbp;
+	const char *errpfx;
+{
+	dbp->dbenv->set_errpfx(dbp->dbenv, errpfx);
+}
+
+static int
+__db_set_feedback(dbp, feedback)
+	DB *dbp;
+	void (*feedback) __P((DB *, int, int));
+{
+	dbp->db_feedback = feedback;
+	return (0);
+}
+
+static int
+__db_set_flags(dbp, flags)
+	DB *dbp;
+	u_int32_t flags;
+{
+	int ret;
+
+	/*
+	 * !!!
+	 * The hash access method only takes two flags: DB_DUP and DB_DUPSORT.
+	 * The Btree access method uses them for the same purposes, and so we
+	 * resolve them there.
+	 *
+	 * The queue access method takes no flags.
+	 */
+	if ((ret = __bam_set_flags(dbp, &flags)) != 0)
+		return (ret);
+	if ((ret = __ram_set_flags(dbp, &flags)) != 0)
+		return (ret);
+
+	return (flags == 0 ? 0 : __db_ferr(dbp->dbenv, "DB->set_flags", 0));
+}
+
+static int
+__db_set_lorder(dbp, db_lorder)
+	DB *dbp;
+	int db_lorder;
+{
+	int ret;
+
+	DB_ILLEGAL_AFTER_OPEN(dbp, "set_lorder");
+
+	/* Flag if the specified byte order requires swapping. */
+	switch (ret = __db_byteorder(dbp->dbenv, db_lorder)) {
+	case 0:
+		F_CLR(dbp, DB_AM_SWAP);
+		break;
+	case DB_SWAPBYTES:
+		F_SET(dbp, DB_AM_SWAP);
+		break;
+	default:
+		return (ret);
+		/* NOTREACHED */
+	}
+	return (0);
+}
+
+static int
+__db_set_malloc(dbp, func)
+	DB *dbp;
+	void *(*func) __P((size_t));
+{
+	DB_ILLEGAL_AFTER_OPEN(dbp, "set_malloc");
+
+	dbp->db_malloc = func;
+	return (0);
+}
+
+static int
+__db_set_pagesize(dbp, db_pagesize)
+	DB *dbp;
+	u_int32_t db_pagesize;
+{
+	DB_ILLEGAL_AFTER_OPEN(dbp, "set_pagesize");
+
+	if (db_pagesize < DB_MIN_PGSIZE) {
+		__db_err(dbp->dbenv, "page sizes may not be smaller than %lu",
+		    (u_long)DB_MIN_PGSIZE);
+		return (EINVAL);
+	}
+	if (db_pagesize > DB_MAX_PGSIZE) {
+		__db_err(dbp->dbenv, "page sizes may not be larger than %lu",
+		    (u_long)DB_MAX_PGSIZE);
+		return (EINVAL);
+	}
+
+	/*
+	 * We don't want anything that's not a power-of-2, as we rely on that
+	 * for alignment of various types on the pages.
+	 */
+	if ((u_int32_t)1 << __db_log2(db_pagesize) != db_pagesize) {
+		__db_err(dbp->dbenv, "page sizes must be a power-of-2");
+		return (EINVAL);
+	}
+
+	/*
+	 * XXX
+	 * Should we be checking for a page size that's not a multiple of 512,
+	 * so that we never try and write less than a disk sector?
+	 */
+	dbp->pgsize = db_pagesize;
+
+	return (0);
+}
+
+static int
+__db_set_realloc(dbp, func)
+	DB *dbp;
+	void *(*func) __P((void *, size_t));
+{
+	DB_ILLEGAL_AFTER_OPEN(dbp, "set_realloc");
+
+	dbp->db_realloc = func;
+	return (0);
+}
+
+static int
+__db_set_paniccall(dbp, paniccall)
+	DB *dbp;
+	void (*paniccall) __P((DB_ENV *, int));
+{
+	return (dbp->dbenv->set_paniccall(dbp->dbenv, paniccall));
+}
+
+#ifdef HAVE_RPC
+/*
+ * __dbcl_init --
+ *	Initialize a DB structure on the server.
+ *
+ * PUBLIC: #ifdef HAVE_RPC
+ * PUBLIC: int __dbcl_init __P((DB *, DB_ENV *, u_int32_t));
+ * PUBLIC: #endif
+ */
+int
+__dbcl_init(dbp, dbenv, flags)
+	DB *dbp;
+	DB_ENV *dbenv;
+	u_int32_t flags;
+{
+	CLIENT *cl;
+	__db_create_reply *replyp;
+	__db_create_msg req;
+	int ret;
+
+	TAILQ_INIT(&dbp->free_queue);
+	TAILQ_INIT(&dbp->active_queue);
+	/* !!!
+	 * Note that we don't need to initialize the join_queue;  it's
+	 * not used in RPC clients.  See the comment in __dbcl_db_join_ret().
+	 */
+
+	dbp->close = __dbcl_db_close;
+	dbp->cursor = __dbcl_db_cursor;
+	dbp->del = __dbcl_db_del;
+	dbp->err = __dbh_err;
+	dbp->errx = __dbh_errx;
+	dbp->fd = __dbcl_db_fd;
+	dbp->get = __dbcl_db_get;
+	dbp->get_byteswapped = __dbcl_db_swapped;
+	dbp->get_type = __db_get_type;
+	dbp->join = __dbcl_db_join;
+	dbp->key_range = __dbcl_db_key_range;
+	dbp->open = __dbcl_db_open;
+	dbp->put = __dbcl_db_put;
+	dbp->remove = __dbcl_db_remove;
+	dbp->rename = __dbcl_db_rename;
+	dbp->set_append_recno = __dbcl_db_set_append_recno;
+	dbp->set_cachesize = __dbcl_db_cachesize;
+	dbp->set_dup_compare = NULL;
+	dbp->set_errcall = __db_set_errcall;
+	dbp->set_errfile = __db_set_errfile;
+	dbp->set_errpfx = __db_set_errpfx;
+	dbp->set_feedback = __dbcl_db_feedback;
+	dbp->set_flags = __dbcl_db_flags;
+	dbp->set_lorder = __dbcl_db_lorder;
+	dbp->set_malloc = __dbcl_db_malloc;
+	dbp->set_pagesize = __dbcl_db_pagesize;
+	dbp->set_paniccall = __dbcl_db_panic;
+	dbp->set_q_extentsize = __dbcl_db_extentsize;
+	dbp->set_realloc = __dbcl_db_realloc;
+	dbp->stat = __dbcl_db_stat;
+	dbp->sync = __dbcl_db_sync;
+	dbp->upgrade = __dbcl_db_upgrade;
+
+	/*
+	 * Set all the method specific functions to client funcs as well.
+	 */
+	dbp->set_bt_compare = __dbcl_db_bt_compare;
+	dbp->set_bt_maxkey = __dbcl_db_bt_maxkey;
+	dbp->set_bt_minkey = __dbcl_db_bt_minkey;
+	dbp->set_bt_prefix = __dbcl_db_bt_prefix;
+	dbp->set_h_ffactor = __dbcl_db_h_ffactor;
+	dbp->set_h_hash = __dbcl_db_h_hash;
+	dbp->set_h_nelem = __dbcl_db_h_nelem;
+	dbp->set_re_delim = __dbcl_db_re_delim;
+	dbp->set_re_len = __dbcl_db_re_len;
+	dbp->set_re_pad = __dbcl_db_re_pad;
+	dbp->set_re_source = __dbcl_db_re_source;
+/*
+	dbp->set_q_extentsize = __dbcl_db_q_extentsize;
+*/
+
+	cl = (CLIENT *)dbenv->cl_handle;
+	req.flags = flags;
+	req.envpcl_id = dbenv->cl_id;
+
+	/*
+	 * CALL THE SERVER
+	 */
+	replyp = __db_db_create_1(&req, cl);
+	if (replyp == NULL) {
+		__db_err(dbenv, clnt_sperror(cl, "Berkeley DB"));
+		return (DB_NOSERVER);
+	}
+
+	if ((ret = replyp->status) != 0)
+		return (ret);
+
+	dbp->cl_id = replyp->dbpcl_id;
+	return (0);
+}
+#endif
diff --git a/bdb/db/db_overflow.c b/bdb/db/db_overflow.c
new file mode 100644
index 00000000000..54f0a03aafe
--- /dev/null
+++ b/bdb/db/db_overflow.c
@@ -0,0 +1,681 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1996, 1997, 1998, 1999, 2000
+ *	Sleepycat Software.  All rights reserved.
+ */
+/*
+ * Copyright (c) 1990, 1993, 1994, 1995, 1996
+ *	Keith Bostic.  All rights reserved.
+ */
+/*
+ * Copyright (c) 1990, 1993, 1994, 1995
+ *	The Regents of the University of California.  All rights reserved.
+ *
+ * This code is derived from software contributed to Berkeley by
+ * Mike Olson.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of the University nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_overflow.c,v 11.21 2000/11/30 00:58:32 ubell Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_am.h"
+#include "db_verify.h"
+
+/*
+ * Big key/data code.
+ *
+ * Big key and data entries are stored on linked lists of pages.  The initial
+ * reference is a structure with the total length of the item and the page
+ * number where it begins.  Each entry in the linked list contains a pointer
+ * to the next page of data, and so on.
+ */
+
+/*
+ * __db_goff --
+ *	Get an offpage item.
+ *
+ * PUBLIC: int __db_goff __P((DB *, DBT *,
+ * PUBLIC:     u_int32_t, db_pgno_t, void **, u_int32_t *));
+ */
+int
+__db_goff(dbp, dbt, tlen, pgno, bpp, bpsz)
+	DB *dbp;
+	DBT *dbt;
+	u_int32_t tlen;
+	db_pgno_t pgno;
+	void **bpp;
+	u_int32_t *bpsz;
+{
+	DB_ENV *dbenv;
+	PAGE *h;
+	db_indx_t bytes;
+	u_int32_t curoff, needed, start;
+	u_int8_t *p, *src;
+	int ret;
+
+	dbenv = dbp->dbenv;
+
+	/*
+	 * Check if the buffer is big enough; if it is not and we are
+	 * allowed to malloc space, then we'll malloc it.  If we are
+	 * not (DB_DBT_USERMEM), then we'll set the dbt and return
+	 * appropriately.
+	 */
+	if (F_ISSET(dbt, DB_DBT_PARTIAL)) {
+		start = dbt->doff;
+		needed = dbt->dlen;
+	} else {
+		start = 0;
+		needed = tlen;
+	}
+
+	/* Allocate any necessary memory. */
+	if (F_ISSET(dbt, DB_DBT_USERMEM)) {
+		if (needed > dbt->ulen) {
+			dbt->size = needed;
+			return (ENOMEM);
+		}
+	} else if (F_ISSET(dbt, DB_DBT_MALLOC)) {
+		if ((ret = __os_malloc(dbenv,
+		    needed, dbp->db_malloc, &dbt->data)) != 0)
+			return (ret);
+	} else if (F_ISSET(dbt, DB_DBT_REALLOC)) {
+		if ((ret = __os_realloc(dbenv,
+		    needed, dbp->db_realloc, &dbt->data)) != 0)
+			return (ret);
+	} else if (*bpsz == 0 || *bpsz < needed) {
+		if ((ret = __os_realloc(dbenv, needed, NULL, bpp)) != 0)
+			return (ret);
+		*bpsz = needed;
+		dbt->data = *bpp;
+	} else
+		dbt->data = *bpp;
+
+	/*
+	 * Step through the linked list of pages, copying the data on each
+	 * one into the buffer.  Never copy more than the total data length.
+	 */
+	dbt->size = needed;
+	for (curoff = 0, p = dbt->data; pgno != PGNO_INVALID && needed > 0;) {
+		if ((ret = memp_fget(dbp->mpf, &pgno, 0, &h)) != 0) {
+			(void)__db_pgerr(dbp, pgno);
+			return (ret);
+		}
+		/* Check if we need any bytes from this page. */
+		if (curoff + OV_LEN(h) >= start) {
+			src = (u_int8_t *)h + P_OVERHEAD;
+			bytes = OV_LEN(h);
+			if (start > curoff) {
+				src += start - curoff;
+				bytes -= start - curoff;
+			}
+			if (bytes > needed)
+				bytes = needed;
+			memcpy(p, src, bytes);
+			p += bytes;
+			needed -= bytes;
+		}
+		curoff += OV_LEN(h);
+		pgno = h->next_pgno;
+		memp_fput(dbp->mpf, h, 0);
+	}
+	return (0);
+}
+
+/*
+ * __db_poff --
+ *	Put an offpage item.
+ *
+ * PUBLIC: int __db_poff __P((DBC *, const DBT *, db_pgno_t *));
+ */
+int
+__db_poff(dbc, dbt, pgnop)
+	DBC *dbc;
+	const DBT *dbt;
+	db_pgno_t *pgnop;
+{
+	DB *dbp;
+	PAGE *pagep, *lastp;
+	DB_LSN new_lsn, null_lsn;
+	DBT tmp_dbt;
+	db_indx_t pagespace;
+	u_int32_t sz;
+	u_int8_t *p;
+	int ret;
+
+	/*
+	 * Allocate pages and copy the key/data item into them.  Calculate the
+	 * number of bytes we get for pages we fill completely with a single
+	 * item.
+	 */
+	dbp = dbc->dbp;
+	pagespace = P_MAXSPACE(dbp->pgsize);
+
+	lastp = NULL;
+	for (p = dbt->data,
+	    sz = dbt->size; sz > 0; p += pagespace, sz -= pagespace) {
+		/*
+		 * Reduce pagespace so we terminate the loop correctly and
+		 * don't copy too much data.
+		 */
+		if (sz < pagespace)
+			pagespace = sz;
+
+		/*
+		 * Allocate and initialize a new page and copy all or part of
+		 * the item onto the page.  If sz is less than pagespace, we
+		 * have a partial record.
+		 */
+		if ((ret = __db_new(dbc, P_OVERFLOW, &pagep)) != 0)
+			return (ret);
+		if (DB_LOGGING(dbc)) {
+			tmp_dbt.data = p;
+			tmp_dbt.size = pagespace;
+			ZERO_LSN(null_lsn);
+			if ((ret = __db_big_log(dbp->dbenv, dbc->txn,
+			    &new_lsn, 0, DB_ADD_BIG, dbp->log_fileid,
+			    PGNO(pagep), lastp ? PGNO(lastp) : PGNO_INVALID,
+			    PGNO_INVALID, &tmp_dbt, &LSN(pagep),
+			    lastp == NULL ? &null_lsn : &LSN(lastp),
+			    &null_lsn)) != 0)
+				return (ret);
+
+			/* Move lsn onto page. */
+			if (lastp)
+				LSN(lastp) = new_lsn;
+			LSN(pagep) = new_lsn;
+		}
+
+		P_INIT(pagep, dbp->pgsize,
+		    PGNO(pagep), PGNO_INVALID, PGNO_INVALID, 0, P_OVERFLOW);
+		OV_LEN(pagep) = pagespace;
+		OV_REF(pagep) = 1;
+		memcpy((u_int8_t *)pagep + P_OVERHEAD, p, pagespace);
+
+		/*
+		 * If this is the first entry, update the user's info.
+		 * Otherwise, update the entry on the last page filled
+		 * in and release that page.
+		 */
+		if (lastp == NULL)
+			*pgnop = PGNO(pagep);
+		else {
+			lastp->next_pgno = PGNO(pagep);
+			pagep->prev_pgno = PGNO(lastp);
+			(void)memp_fput(dbp->mpf, lastp, DB_MPOOL_DIRTY);
+		}
+		lastp = pagep;
+	}
+	(void)memp_fput(dbp->mpf, lastp, DB_MPOOL_DIRTY);
+	return (0);
+}
+
+/*
+ * __db_ovref --
+ *	Increment/decrement the reference count on an overflow page.
+ *
+ * PUBLIC: int __db_ovref __P((DBC *, db_pgno_t, int32_t));
+ */
+int
+__db_ovref(dbc, pgno, adjust)
+	DBC *dbc;
+	db_pgno_t pgno;
+	int32_t adjust;
+{
+	DB *dbp;
+	PAGE *h;
+	int ret;
+
+	dbp = dbc->dbp;
+	if ((ret = memp_fget(dbp->mpf, &pgno, 0, &h)) != 0) {
+		(void)__db_pgerr(dbp, pgno);
+		return (ret);
+	}
+
+	if (DB_LOGGING(dbc))
+		if ((ret = __db_ovref_log(dbp->dbenv, dbc->txn,
+		    &LSN(h), 0, dbp->log_fileid, h->pgno, adjust,
+		    &LSN(h))) != 0)
+			return (ret);
+	OV_REF(h) += adjust;
+
+	(void)memp_fput(dbp->mpf, h, DB_MPOOL_DIRTY);
+	return (0);
+}
+
+/*
+ * __db_doff --
+ *	Delete an offpage chain of overflow pages.
+ *
+ * PUBLIC: int __db_doff __P((DBC *, db_pgno_t));
+ */
+int
+__db_doff(dbc, pgno)
+	DBC *dbc;
+	db_pgno_t pgno;
+{
+	DB *dbp;
+	PAGE *pagep;
+	DB_LSN null_lsn;
+	DBT tmp_dbt;
+	int ret;
+
+	dbp = dbc->dbp;
+	do {
+		if ((ret = memp_fget(dbp->mpf, &pgno, 0, &pagep)) != 0) {
+			(void)__db_pgerr(dbp, pgno);
+			return (ret);
+		}
+
+		DB_ASSERT(TYPE(pagep) == P_OVERFLOW);
+		/*
+		 * If it's referenced by more than one key/data item,
+		 * decrement the reference count and return.
+		 */
+		if (OV_REF(pagep) > 1) {
+			(void)memp_fput(dbp->mpf, pagep, 0);
+			return (__db_ovref(dbc, pgno, -1));
+		}
+
+		if (DB_LOGGING(dbc)) {
+			tmp_dbt.data = (u_int8_t *)pagep + P_OVERHEAD;
+			tmp_dbt.size = OV_LEN(pagep);
+			ZERO_LSN(null_lsn);
+			if ((ret = __db_big_log(dbp->dbenv, dbc->txn,
+			    &LSN(pagep), 0, DB_REM_BIG, dbp->log_fileid,
+			    PGNO(pagep), PREV_PGNO(pagep), NEXT_PGNO(pagep),
+			    &tmp_dbt, &LSN(pagep), &null_lsn, &null_lsn)) != 0)
+				return (ret);
+		}
+		pgno = pagep->next_pgno;
+		if ((ret = __db_free(dbc, pagep)) != 0)
+			return (ret);
+	} while (pgno != PGNO_INVALID);
+
+	return (0);
+}
+
+/*
+ * __db_moff --
+ *	Match on overflow pages.
+ *
+ * Given a starting page number and a key, return <0, 0, >0 to indicate if the
+ * key on the page is less than, equal to or greater than the key specified.
+ * We optimize this by doing chunk at a time comparison unless the user has
+ * specified a comparison function.  In this case, we need to materialize
+ * the entire object and call their comparison routine.
+ *
+ * PUBLIC: int __db_moff __P((DB *, const DBT *, db_pgno_t, u_int32_t,
+ * PUBLIC:     int (*)(DB *, const DBT *, const DBT *), int *));
+ */
+int
+__db_moff(dbp, dbt, pgno, tlen, cmpfunc, cmpp)
+	DB *dbp;
+	const DBT *dbt;
+	db_pgno_t pgno;
+	u_int32_t tlen;
+	int (*cmpfunc) __P((DB *, const DBT *, const DBT *)), *cmpp;
+{
+	PAGE *pagep;
+	DBT local_dbt;
+	void *buf;
+	u_int32_t bufsize, cmp_bytes, key_left;
+	u_int8_t *p1, *p2;
+	int ret;
+
+	/*
+	 * If there is a user-specified comparison function, build a
+	 * contiguous copy of the key, and call it.
+	 */
+	if (cmpfunc != NULL) {
+		memset(&local_dbt, 0, sizeof(local_dbt));
+		buf = NULL;
+		bufsize = 0;
+
+		if ((ret = __db_goff(dbp,
+		    &local_dbt, tlen, pgno, &buf, &bufsize)) != 0)
+			return (ret);
+		/* Pass the key as the first argument */
+		*cmpp = cmpfunc(dbp, dbt, &local_dbt);
+		__os_free(buf, bufsize);
+		return (0);
+	}
+
+	/* While there are both keys to compare. */
+	for (*cmpp = 0, p1 = dbt->data,
+	    key_left = dbt->size; key_left > 0 && pgno != PGNO_INVALID;) {
+		if ((ret = memp_fget(dbp->mpf, &pgno, 0, &pagep)) != 0)
+			return (ret);
+
+		cmp_bytes = OV_LEN(pagep) < key_left ? OV_LEN(pagep) : key_left;
+		tlen -= cmp_bytes;
+		key_left -= cmp_bytes;
+		for (p2 =
+		    (u_int8_t *)pagep + P_OVERHEAD; cmp_bytes-- > 0; ++p1, ++p2)
+			if (*p1 != *p2) {
+				*cmpp = (long)*p1 - (long)*p2;
+				break;
+			}
+		pgno = NEXT_PGNO(pagep);
+		if ((ret = memp_fput(dbp->mpf, pagep, 0)) != 0)
+			return (ret);
+		if (*cmpp != 0)
+			return (0);
+	}
+	if (key_left > 0)		/* DBT is longer than the page key. */
+		*cmpp = 1;
+	else if (tlen > 0)		/* DBT is shorter than the page key. */
+		*cmpp = -1;
+	else
+		*cmpp = 0;
+
+	return (0);
+}
+
+/*
+ * __db_vrfy_overflow --
+ *	Verify overflow page.
+ *
+ * PUBLIC: int __db_vrfy_overflow __P((DB *, VRFY_DBINFO *, PAGE *, db_pgno_t,
+ * PUBLIC:     u_int32_t));
+ */
+int
+__db_vrfy_overflow(dbp, vdp, h, pgno, flags)
+	DB *dbp;
+	VRFY_DBINFO *vdp;
+	PAGE *h;
+	db_pgno_t pgno;
+	u_int32_t flags;
+{
+	VRFY_PAGEINFO *pip;
+	int isbad, ret, t_ret;
+
+	isbad = 0;
+	if ((ret = __db_vrfy_getpageinfo(vdp, pgno, &pip)) != 0)
+		return (ret);
+
+	if ((ret = __db_vrfy_datapage(dbp, vdp, h, pgno, flags)) != 0) {
+		if (ret == DB_VERIFY_BAD)
+			isbad = 1;
+		else
+			goto err;
+	}
+
+	pip->refcount = OV_REF(h);
+	if (pip->refcount < 1) {
+		EPRINT((dbp->dbenv,
+		    "Overflow page %lu has zero reference count",
+		    (u_long)pgno));
+		isbad = 1;
+	}
+
+	/* Just store for now. */
+	pip->olen = HOFFSET(h);
+
+err:	if ((t_ret = __db_vrfy_putpageinfo(vdp, pip)) != 0)
+		ret = t_ret;
+	return ((ret == 0 && isbad == 1) ? DB_VERIFY_BAD : ret);
+}
+
+/*
+ * __db_vrfy_ovfl_structure --
+ *	Walk a list of overflow pages, avoiding cycles and marking
+ *	pages seen.
+ *
+ * PUBLIC: int __db_vrfy_ovfl_structure
+ * PUBLIC:     __P((DB *, VRFY_DBINFO *, db_pgno_t, u_int32_t, u_int32_t));
+ */
+int
+__db_vrfy_ovfl_structure(dbp, vdp, pgno, tlen, flags)
+	DB *dbp;
+	VRFY_DBINFO *vdp;
+	db_pgno_t pgno;
+	u_int32_t tlen;
+	u_int32_t flags;
+{
+	DB *pgset;
+	VRFY_PAGEINFO *pip;
+	db_pgno_t next, prev;
+	int isbad, p, ret, t_ret;
+	u_int32_t refcount;
+
+	pgset = vdp->pgset;
+	DB_ASSERT(pgset != NULL);
+	isbad = 0;
+
+	/* This shouldn't happen, but just to be sure. */
+	if (!IS_VALID_PGNO(pgno))
+		return (DB_VERIFY_BAD);
+
+	/*
+	 * Check the first prev_pgno;  it ought to be PGNO_INVALID,
+	 * since there's no prev page.
+	 */
+	if ((ret = __db_vrfy_getpageinfo(vdp, pgno, &pip)) != 0)
+		return (ret);
+
+	/* The refcount is stored on the first overflow page. */
+	refcount = pip->refcount;
+
+	if (pip->type != P_OVERFLOW) {
+		EPRINT((dbp->dbenv,
+		    "Overflow page %lu of invalid type",
+		    (u_long)pgno, (u_long)pip->type));
+		ret = DB_VERIFY_BAD;
+		goto err;		/* Unsafe to continue. */
+	}
+
+	prev = pip->prev_pgno;
+	if (prev != PGNO_INVALID) {
+		EPRINT((dbp->dbenv,
+		    "First overflow page %lu has a prev_pgno", (u_long)pgno));
+		isbad = 1;
+	}
+
+	for (;;) {
+		/*
+		 * This is slightly gross.  Btree leaf pages reference
+		 * individual overflow trees multiple times if the overflow page
+		 * is the key to a duplicate set.  The reference count does not
+		 * reflect this multiple referencing.  Thus, if this is called
+		 * during the structure verification of a btree leaf page, we
+		 * check to see whether we've seen it from a leaf page before
+		 * and, if we have, adjust our count of how often we've seen it
+		 * accordingly.
+		 *
+		 * (This will screw up if it's actually referenced--and
+		 * correctly refcounted--from two different leaf pages, but
+		 * that's a very unlikely brokenness that we're not checking for
+		 * anyway.)
+		 */
+
+		if (LF_ISSET(ST_OVFL_LEAF)) {
+			if (F_ISSET(pip, VRFY_OVFL_LEAFSEEN)) {
+				if ((ret =
+				    __db_vrfy_pgset_dec(pgset, pgno)) != 0)
+					goto err;
+			} else
+				F_SET(pip, VRFY_OVFL_LEAFSEEN);
+		}
+
+		if ((ret = __db_vrfy_pgset_get(pgset, pgno, &p)) != 0)
+			goto err;
+
+		/*
+		 * We may have seen this elsewhere, if the overflow entry
+		 * has been promoted to an internal page.
+		 */
+		if ((u_int32_t)p > refcount) {
+			EPRINT((dbp->dbenv,
+			    "Page %lu encountered twice in overflow traversal",
+			    (u_long)pgno));
+			ret = DB_VERIFY_BAD;
+			goto err;
+		}
+		if ((ret = __db_vrfy_pgset_inc(pgset, pgno)) != 0)
+			goto err;
+
+		/* Keep a running tab on how much of the item we've seen. */
+		tlen -= pip->olen;
+
+		/* Send feedback to the application about our progress. */
+		if (!LF_ISSET(DB_SALVAGE))
+			__db_vrfy_struct_feedback(dbp, vdp);
+
+		next = pip->next_pgno;
+
+		/* Are we there yet? */
+		if (next == PGNO_INVALID)
+			break;
+
+		/*
+		 * We've already checked this when we saved it, but just
+		 * to be sure...
+		 */
+		if (!IS_VALID_PGNO(next)) {
+			DB_ASSERT(0);
+			EPRINT((dbp->dbenv,
+			    "Overflow page %lu has bad next_pgno",
+			    (u_long)pgno));
+			ret = DB_VERIFY_BAD;
+			goto err;
+		}
+
+		if ((ret = __db_vrfy_putpageinfo(vdp, pip)) != 0 ||
+		    (ret = __db_vrfy_getpageinfo(vdp, next, &pip)) != 0)
+			return (ret);
+		if (pip->prev_pgno != pgno) {
+			EPRINT((dbp->dbenv,
+			    "Overflow page %lu has bogus prev_pgno value",
+			    (u_long)next));
+			isbad = 1;
+			/*
+			 * It's safe to continue because we have separate
+			 * cycle detection.
+			 */
+		}
+
+		pgno = next;
+	}
+
+	if (tlen > 0) {
+		isbad = 1;
+		EPRINT((dbp->dbenv,
+		    "Overflow item incomplete on page %lu", (u_long)pgno));
+	}
+
+err:	if ((t_ret = __db_vrfy_putpageinfo(vdp, pip)) != 0 && ret == 0)
+		ret = t_ret;
+	return ((ret == 0 && isbad == 1) ? DB_VERIFY_BAD : ret);
+}
+
+/*
+ * __db_safe_goff --
+ *	Get an overflow item, very carefully, from an untrusted database,
+ *	in the context of the salvager.
+ *
+ * PUBLIC: int __db_safe_goff __P((DB *, VRFY_DBINFO *, db_pgno_t,
+ * PUBLIC:     DBT *, void **, u_int32_t));
+ */
+int
+__db_safe_goff(dbp, vdp, pgno, dbt, buf, flags)
+	DB *dbp;
+	VRFY_DBINFO *vdp;
+	db_pgno_t pgno;
+	DBT *dbt;
+	void **buf;
+	u_int32_t flags;
+{
+	PAGE *h;
+	int ret, err_ret;
+	u_int32_t bytesgot, bytes;
+	u_int8_t *src, *dest;
+
+	ret = DB_VERIFY_BAD;
+	err_ret = 0;
+	bytesgot = bytes = 0;
+
+	while ((pgno != PGNO_INVALID) && (IS_VALID_PGNO(pgno))) {
+		/*
+		 * Mark that we're looking at this page;  if we've seen it
+		 * already, quit.
+		 */
+		if ((ret = __db_salvage_markdone(vdp, pgno)) != 0)
+			break;
+
+		if ((ret = memp_fget(dbp->mpf, &pgno, 0, &h)) != 0)
+			break;
+
+		/*
+		 * Make sure it's really an overflow page, unless we're
+		 * being aggressive, in which case we pretend it is.
+		 */
+		if (!LF_ISSET(DB_AGGRESSIVE) && TYPE(h) != P_OVERFLOW) {
+			ret = DB_VERIFY_BAD;
+			break;
+		}
+
+		src = (u_int8_t *)h + P_OVERHEAD;
+		bytes = OV_LEN(h);
+
+		if (bytes + P_OVERHEAD > dbp->pgsize)
+			bytes = dbp->pgsize - P_OVERHEAD;
+
+		if ((ret = __os_realloc(dbp->dbenv,
+		    bytesgot + bytes, 0, buf)) != 0)
+			break;
+
+		dest = (u_int8_t *)*buf + bytesgot;
+		bytesgot += bytes;
+
+		memcpy(dest, src, bytes);
+
+		pgno = NEXT_PGNO(h);
+		/* Not much we can do here--we don't want to quit. */
+		if ((ret = memp_fput(dbp->mpf, h, 0)) != 0)
+			err_ret = ret;
+	}
+
+	if (ret == 0) {
+		dbt->size = bytesgot;
+		dbt->data = *buf;
+	}
+
+	return ((err_ret != 0 && ret == 0) ? err_ret : ret);
+}
diff --git a/bdb/db/db_pr.c b/bdb/db/db_pr.c
new file mode 100644
index 00000000000..cb977cadfda
--- /dev/null
+++ b/bdb/db/db_pr.c
@@ -0,0 +1,1284 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1996, 1997, 1998, 1999, 2000
+ *	Sleepycat Software.  All rights reserved.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_pr.c,v 11.46 2001/01/22 17:25:06 krinsky Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <ctype.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "btree.h"
+#include "hash.h"
+#include "qam.h"
+#include "db_am.h"
+#include "db_verify.h"
+
+static int	 __db_bmeta __P((DB *, FILE *, BTMETA *, u_int32_t));
+static int	 __db_hmeta __P((DB *, FILE *, HMETA *, u_int32_t));
+static void	 __db_meta __P((DB *, DBMETA *, FILE *, FN const *, u_int32_t));
+static const char	*__db_dbtype_to_string __P((DB *));
+static void	 __db_prdb __P((DB *, FILE *, u_int32_t));
+static FILE	*__db_prinit __P((FILE *));
+static void	 __db_proff __P((void *));
+static int	 __db_prtree __P((DB *, u_int32_t));
+static void	 __db_psize __P((DB *));
+static int	 __db_qmeta __P((DB *, FILE *, QMETA *, u_int32_t));
+
+/*
+ * 64K is the maximum page size, so by default we check for offsets larger
+ * than that, and, where possible, we refine the test.
+ */
+#define	PSIZE_BOUNDARY	(64 * 1024 + 1)
+static size_t set_psize = PSIZE_BOUNDARY;
+
+static FILE *set_fp;				/* Output file descriptor. */
+
+/*
+ * __db_loadme --
+ *	A nice place to put a breakpoint.
+ *
+ * PUBLIC: void __db_loadme __P((void));
+ */
+void
+__db_loadme()
+{
+	getpid();
+}
+
+/*
+ * __db_dump --
+ *	Dump the tree to a file.
+ *
+ * PUBLIC: int __db_dump __P((DB *, char *, char *));
+ */
+int
+__db_dump(dbp, op, name)
+	DB *dbp;
+	char *op, *name;
+{
+	FILE *fp, *save_fp;
+	u_int32_t flags;
+
+	COMPQUIET(save_fp, NULL);
+
+	if (set_psize == PSIZE_BOUNDARY)
+		__db_psize(dbp);
+
+	if (name != NULL) {
+		if ((fp = fopen(name, "w")) == NULL)
+			return (__os_get_errno());
+		save_fp = set_fp;
+		set_fp = fp;
+	} else
+		fp = __db_prinit(NULL);
+
+	for (flags = 0; *op != '\0'; ++op)
+		switch (*op) {
+		case 'a':
+			LF_SET(DB_PR_PAGE);
+			break;
+		case 'h':
+			break;
+		case 'r':
+			LF_SET(DB_PR_RECOVERYTEST);
+			break;
+		default:
+			return (EINVAL);
+		}
+
+	__db_prdb(dbp, fp, flags);
+
+	fprintf(fp, "%s\n", DB_LINE);
+
+	(void)__db_prtree(dbp, flags);
+
+	fflush(fp);
+
+	if (name != NULL) {
+		fclose(fp);
+		set_fp = save_fp;
+	}
+	return (0);
+}
+
+/*
+ * __db_prdb --
+ *	Print out the DB structure information.
+ */
+static void
+__db_prdb(dbp, fp, flags)
+	DB *dbp;
+	FILE *fp;
+	u_int32_t flags;
+{
+	static const FN fn[] = {
+		{ DB_AM_DISCARD,	"discard cached pages" },
+		{ DB_AM_DUP,		"duplicates" },
+		{ DB_AM_INMEM,		"in-memory" },
+		{ DB_AM_PGDEF,		"default page size" },
+		{ DB_AM_RDONLY,		"read-only" },
+		{ DB_AM_SUBDB,		"multiple-databases" },
+		{ DB_AM_SWAP,		"needswap" },
+		{ DB_BT_RECNUM,		"btree:recnum" },
+		{ DB_BT_REVSPLIT,	"btree:no reverse split" },
+		{ DB_DBM_ERROR,		"dbm/ndbm error" },
+		{ DB_OPEN_CALLED,	"DB->open called" },
+		{ DB_RE_DELIMITER,	"recno:delimiter" },
+		{ DB_RE_FIXEDLEN,	"recno:fixed-length" },
+		{ DB_RE_PAD,		"recno:pad" },
+		{ DB_RE_RENUMBER,	"recno:renumber" },
+		{ DB_RE_SNAPSHOT,	"recno:snapshot" },
+		{ 0,			NULL }
+	};
+	BTREE *bt;
+	HASH *h;
+	QUEUE *q;
+
+	COMPQUIET(flags, 0);
+
+	fprintf(fp,
+	    "In-memory DB structure:\n%s: %#lx",
+	    __db_dbtype_to_string(dbp), (u_long)dbp->flags);
+	__db_prflags(dbp->flags, fn, fp);
+	fprintf(fp, "\n");
+
+	switch (dbp->type) {
+	case DB_BTREE:
+	case DB_RECNO:
+		bt = dbp->bt_internal;
+		fprintf(fp, "bt_meta: %lu bt_root: %lu\n",
+		    (u_long)bt->bt_meta, (u_long)bt->bt_root);
+		fprintf(fp, "bt_maxkey: %lu bt_minkey: %lu\n",
+		    (u_long)bt->bt_maxkey, (u_long)bt->bt_minkey);
+		fprintf(fp, "bt_compare: %#lx bt_prefix: %#lx\n",
+		    (u_long)bt->bt_compare, (u_long)bt->bt_prefix);
+		fprintf(fp, "bt_lpgno: %lu\n", (u_long)bt->bt_lpgno);
+		if (dbp->type == DB_RECNO) {
+			fprintf(fp,
+		    "re_pad: %#lx re_delim: %#lx re_len: %lu re_source: %s\n",
+			    (u_long)bt->re_pad, (u_long)bt->re_delim,
+			    (u_long)bt->re_len,
+			    bt->re_source == NULL ? "" : bt->re_source);
+			fprintf(fp, "re_modified: %d re_eof: %d re_last: %lu\n",
+			    bt->re_modified, bt->re_eof, (u_long)bt->re_last);
+		}
+		break;
+	case DB_HASH:
+		h = dbp->h_internal;
+		fprintf(fp, "meta_pgno: %lu\n", (u_long)h->meta_pgno);
+		fprintf(fp, "h_ffactor: %lu\n", (u_long)h->h_ffactor);
+		fprintf(fp, "h_nelem: %lu\n", (u_long)h->h_nelem);
+		fprintf(fp, "h_hash: %#lx\n", (u_long)h->h_hash);
+		break;
+	case DB_QUEUE:
+		q = dbp->q_internal;
+		fprintf(fp, "q_meta: %lu\n", (u_long)q->q_meta);
+		fprintf(fp, "q_root: %lu\n", (u_long)q->q_root);
+		fprintf(fp, "re_pad: %#lx re_len: %lu\n",
+		    (u_long)q->re_pad, (u_long)q->re_len);
+		fprintf(fp, "rec_page: %lu\n", (u_long)q->rec_page);
+		fprintf(fp, "page_ext: %lu\n", (u_long)q->page_ext);
+		break;
+	default:
+		break;
+	}
+}
+
+/*
+ * __db_prtree --
+ *	Print out the entire tree.
+ */
+static int
+__db_prtree(dbp, flags)
+	DB *dbp;
+	u_int32_t flags;
+{
+	PAGE *h;
+	db_pgno_t i, last;
+	int ret;
+
+	if (set_psize == PSIZE_BOUNDARY)
+		__db_psize(dbp);
+
+	if (dbp->type == DB_QUEUE) {
+		ret = __db_prqueue(dbp, flags);
+		goto done;
+	}
+
+	/* Find out the page number of the last page in the database. */
+	if ((ret = memp_fget(dbp->mpf, &last, DB_MPOOL_LAST, &h)) != 0)
+		return (ret);
+	if ((ret = memp_fput(dbp->mpf, h, 0)) != 0)
+		return (ret);
+
+	/* Dump each page. */
+	for (i = 0; i <= last; ++i) {
+		if ((ret = memp_fget(dbp->mpf, &i, 0, &h)) != 0)
+			return (ret);
+		(void)__db_prpage(dbp, h, flags);
+		if ((ret = memp_fput(dbp->mpf, h, 0)) != 0)
+			return (ret);
+	}
+
+done:
+	(void)fflush(__db_prinit(NULL));
+	return (0);
+}
+
+/*
+ * __db_meta --
+ *	Print out common metadata information.
+ */
+static void
+__db_meta(dbp, dbmeta, fp, fn, flags)
+	DB *dbp;
+	DBMETA *dbmeta;
+	FILE *fp;
+	FN const *fn;
+	u_int32_t flags;
+{
+	PAGE *h;
+	int cnt;
+	db_pgno_t pgno;
+	u_int8_t *p;
+	int ret;
+	const char *sep;
+
+	fprintf(fp, "\tmagic: %#lx\n", (u_long)dbmeta->magic);
+	fprintf(fp, "\tversion: %lu\n", (u_long)dbmeta->version);
+	fprintf(fp, "\tpagesize: %lu\n", (u_long)dbmeta->pagesize);
+	fprintf(fp, "\ttype: %lu\n", (u_long)dbmeta->type);
+	fprintf(fp, "\tkeys: %lu\trecords: %lu\n",
+	    (u_long)dbmeta->key_count, (u_long)dbmeta->record_count);
+
+	if (!LF_ISSET(DB_PR_RECOVERYTEST)) {
+		/*
+		 * If we're doing recovery testing, don't display the free
+		 * list, it may have changed and that makes the dump diff
+		 * not work.
+		 */
+		fprintf(fp, "\tfree list: %lu", (u_long)dbmeta->free);
+		for (pgno = dbmeta->free,
+		    cnt = 0, sep = ", "; pgno != PGNO_INVALID;) {
+			if ((ret = memp_fget(dbp->mpf, &pgno, 0, &h)) != 0) {
+				fprintf(fp,
+			    "Unable to retrieve free-list page: %lu: %s\n",
+				    (u_long)pgno, db_strerror(ret));
+				break;
+			}
+			pgno = h->next_pgno;
+			(void)memp_fput(dbp->mpf, h, 0);
+			fprintf(fp, "%s%lu", sep, (u_long)pgno);
+			if (++cnt % 10 == 0) {
+				fprintf(fp, "\n");
+				cnt = 0;
+				sep = "\t";
+			} else
+				sep = ", ";
+		}
+		fprintf(fp, "\n");
+	}
+
+	if (fn != NULL) {
+		fprintf(fp, "\tflags: %#lx", (u_long)dbmeta->flags);
+		__db_prflags(dbmeta->flags, fn, fp);
+		fprintf(fp, "\n");
+	}
+
+	fprintf(fp, "\tuid: ");
+	for (p = (u_int8_t *)dbmeta->uid,
+	    cnt = 0; cnt < DB_FILE_ID_LEN; ++cnt) {
+		fprintf(fp, "%x", *p++);
+		if (cnt < DB_FILE_ID_LEN - 1)
+			fprintf(fp, " ");
+	}
+	fprintf(fp, "\n");
+}
+
+/*
+ * __db_bmeta --
+ *	Print out the btree meta-data page.
+ */
+static int
+__db_bmeta(dbp, fp, h, flags)
+	DB *dbp;
+	FILE *fp;
+	BTMETA *h;
+	u_int32_t flags;
+{
+	static const FN mfn[] = {
+		{ BTM_DUP,	"duplicates" },
+		{ BTM_RECNO,	"recno" },
+		{ BTM_RECNUM,	"btree:recnum" },
+		{ BTM_FIXEDLEN,	"recno:fixed-length" },
+		{ BTM_RENUMBER,	"recno:renumber" },
+		{ BTM_SUBDB,	"multiple-databases" },
+		{ 0,		NULL }
+	};
+
+	__db_meta(dbp, (DBMETA *)h, fp, mfn, flags);
+
+	fprintf(fp, "\tmaxkey: %lu minkey: %lu\n",
+	    (u_long)h->maxkey, (u_long)h->minkey);
+	if (dbp->type == DB_RECNO)
+		fprintf(fp, "\tre_len: %#lx re_pad: %lu\n",
+		    (u_long)h->re_len, (u_long)h->re_pad);
+	fprintf(fp, "\troot: %lu\n", (u_long)h->root);
+
+	return (0);
+}
+
+/*
+ * __db_hmeta --
+ *	Print out the hash meta-data page.
+ */
+static int
+__db_hmeta(dbp, fp, h, flags)
+	DB *dbp;
+	FILE *fp;
+	HMETA *h;
+	u_int32_t flags;
+{
+	static const FN mfn[] = {
+		{ DB_HASH_DUP,	 "duplicates" },
+		{ DB_HASH_SUBDB, "multiple-databases" },
+		{ 0,		 NULL }
+	};
+	int i;
+
+	__db_meta(dbp, (DBMETA *)h, fp, mfn, flags);
+
+	fprintf(fp, "\tmax_bucket: %lu\n", (u_long)h->max_bucket);
+	fprintf(fp, "\thigh_mask: %#lx\n", (u_long)h->high_mask);
+	fprintf(fp, "\tlow_mask:  %#lx\n", (u_long)h->low_mask);
+	fprintf(fp, "\tffactor: %lu\n", (u_long)h->ffactor);
+	fprintf(fp, "\tnelem: %lu\n", (u_long)h->nelem);
+	fprintf(fp, "\th_charkey: %#lx\n", (u_long)h->h_charkey);
+	fprintf(fp, "\tspare points: ");
+	for (i = 0; i < NCACHED; i++)
+		fprintf(fp, "%lu ", (u_long)h->spares[i]);
+	fprintf(fp, "\n");
+
+	return (0);
+}
+
+/*
+ * __db_qmeta --
+ *	Print out the queue meta-data page.
+ */
+static int
+__db_qmeta(dbp, fp, h, flags)
+	DB *dbp;
+	FILE *fp;
+	QMETA *h;
+	u_int32_t flags;
+{
+	__db_meta(dbp, (DBMETA *)h, fp, NULL, flags);
+
+	fprintf(fp, "\tfirst_recno: %lu\n", (u_long)h->first_recno);
+	fprintf(fp, "\tcur_recno: %lu\n", (u_long)h->cur_recno);
+	fprintf(fp, "\tre_len: %#lx re_pad: %lu\n",
+	    (u_long)h->re_len, (u_long)h->re_pad);
+	fprintf(fp, "\trec_page: %lu\n", (u_long)h->rec_page);
+	fprintf(fp, "\tpage_ext: %lu\n", (u_long)h->page_ext);
+
+	return (0);
+}
+
+/*
+ * __db_prnpage
+ *	-- Print out a specific page.
+ *
+ * PUBLIC: int __db_prnpage __P((DB *, db_pgno_t));
+ */
+int
+__db_prnpage(dbp, pgno)
+	DB *dbp;
+	db_pgno_t pgno;
+{
+	PAGE *h;
+	int ret;
+
+	if (set_psize == PSIZE_BOUNDARY)
+		__db_psize(dbp);
+
+	if ((ret = memp_fget(dbp->mpf, &pgno, 0, &h)) != 0)
+		return (ret);
+
+	ret = __db_prpage(dbp, h, DB_PR_PAGE);
+	(void)fflush(__db_prinit(NULL));
+
+	(void)memp_fput(dbp->mpf, h, 0);
+	return (ret);
+}
+
+/*
+ * __db_prpage
+ *	-- Print out a page.
+ *
+ * PUBLIC: int __db_prpage __P((DB *, PAGE *, u_int32_t));
+ */
+int
+__db_prpage(dbp, h, flags)
+	DB *dbp;
+	PAGE *h;
+	u_int32_t flags;
+{
+	BINTERNAL *bi;
+	BKEYDATA *bk;
+	BTREE *t;
+	FILE *fp;
+	HOFFPAGE a_hkd;
+	QAMDATA *qp, *qep;
+	RINTERNAL *ri;
+	db_indx_t dlen, len, i;
+	db_pgno_t pgno;
+	db_recno_t recno;
+	int deleted, ret;
+	const char *s;
+	u_int32_t qlen;
+	u_int8_t *ep, *hk, *p;
+	void *sp;
+
+	fp = __db_prinit(NULL);
+
+	/*
+	 * If we're doing recovery testing and this page is P_INVALID,
+	 * assume it's a page that's on the free list, and don't display it.
+	 */
+	if (LF_ISSET(DB_PR_RECOVERYTEST) && TYPE(h) == P_INVALID)
+		return (0);
+
+	s = __db_pagetype_to_string(TYPE(h));
+	if (s == NULL) {
+		fprintf(fp, "ILLEGAL PAGE TYPE: page: %lu type: %lu\n",
+		    (u_long)h->pgno, (u_long)TYPE(h));
+		return (1);
+	}
+
+	/* Page number, page type. */
+	fprintf(fp, "page %lu: %s level: %lu",
+	    (u_long)h->pgno, s, (u_long)h->level);
+
+	/* Record count. */
+	if (TYPE(h) == P_IBTREE ||
+	    TYPE(h) == P_IRECNO || (TYPE(h) == P_LRECNO &&
+	    h->pgno == ((BTREE *)dbp->bt_internal)->bt_root))
+		fprintf(fp, " records: %lu", (u_long)RE_NREC(h));
+
+	/* LSN. */
+	if (!LF_ISSET(DB_PR_RECOVERYTEST))
+		fprintf(fp, " (lsn.file: %lu lsn.offset: %lu)\n",
+		    (u_long)LSN(h).file, (u_long)LSN(h).offset);
+
+	switch (TYPE(h)) {
+	case P_BTREEMETA:
+		return (__db_bmeta(dbp, fp, (BTMETA *)h, flags));
+	case P_HASHMETA:
+		return (__db_hmeta(dbp, fp, (HMETA *)h, flags));
+	case P_QAMMETA:
+		return (__db_qmeta(dbp, fp, (QMETA *)h, flags));
+	case P_QAMDATA:				/* Should be meta->start. */
+		if (!LF_ISSET(DB_PR_PAGE))
+			return (0);
+
+		qlen = ((QUEUE *)dbp->q_internal)->re_len;
+		recno = (h->pgno - 1) * QAM_RECNO_PER_PAGE(dbp) + 1;
+		i = 0;
+		qep = (QAMDATA *)((u_int8_t *)h + set_psize - qlen);
+		for (qp = QAM_GET_RECORD(dbp, h, i); qp < qep;
+		    recno++, i++, qp = QAM_GET_RECORD(dbp, h, i)) {
+			if (!F_ISSET(qp, QAM_SET))
+				continue;
+
+			fprintf(fp, "%s",
+			    F_ISSET(qp, QAM_VALID) ? "\t" : "       D");
+			fprintf(fp, "[%03lu] %4lu ",
+			    (u_long)recno, (u_long)qp - (u_long)h);
+			__db_pr(qp->data, qlen);
+		}
+		return (0);
+	}
+
+	/* LSN. */
+	if (LF_ISSET(DB_PR_RECOVERYTEST))
+		fprintf(fp, " (lsn.file: %lu lsn.offset: %lu)\n",
+		    (u_long)LSN(h).file, (u_long)LSN(h).offset);
+
+	t = dbp->bt_internal;
+
+	s = "\t";
+	if (TYPE(h) != P_IBTREE && TYPE(h) != P_IRECNO) {
+		fprintf(fp, "%sprev: %4lu next: %4lu",
+		    s, (u_long)PREV_PGNO(h), (u_long)NEXT_PGNO(h));
+		s = " ";
+	}
+	if (TYPE(h) == P_OVERFLOW) {
+		fprintf(fp, "%sref cnt: %4lu ", s, (u_long)OV_REF(h));
+		__db_pr((u_int8_t *)h + P_OVERHEAD, OV_LEN(h));
+		return (0);
+	}
+	fprintf(fp, "%sentries: %4lu", s, (u_long)NUM_ENT(h));
+	fprintf(fp, " offset: %4lu\n", (u_long)HOFFSET(h));
+
+	if (TYPE(h) == P_INVALID || !LF_ISSET(DB_PR_PAGE))
+		return (0);
+
+	ret = 0;
+	for (i = 0; i < NUM_ENT(h); i++) {
+		if (P_ENTRY(h, i) - (u_int8_t *)h < P_OVERHEAD ||
+		    (size_t)(P_ENTRY(h, i) - (u_int8_t *)h) >= set_psize) {
+			fprintf(fp,
+			    "ILLEGAL PAGE OFFSET: indx: %lu of %lu\n",
+			    (u_long)i, (u_long)h->inp[i]);
+			ret = EINVAL;
+			continue;
+		}
+		deleted = 0;
+		switch (TYPE(h)) {
+		case P_HASH:
+		case P_IBTREE:
+		case P_IRECNO:
+			sp = P_ENTRY(h, i);
+			break;
+		case P_LBTREE:
+			sp = P_ENTRY(h, i);
+			deleted = i % 2 == 0 &&
+			    B_DISSET(GET_BKEYDATA(h, i + O_INDX)->type);
+			break;
+		case P_LDUP:
+		case P_LRECNO:
+			sp = P_ENTRY(h, i);
+			deleted = B_DISSET(GET_BKEYDATA(h, i)->type);
+			break;
+		default:
+			fprintf(fp,
+			    "ILLEGAL PAGE ITEM: %lu\n", (u_long)TYPE(h));
+			ret = EINVAL;
+			continue;
+		}
+		fprintf(fp, "%s", deleted ? "       D" : "\t");
+		fprintf(fp, "[%03lu] %4lu ", (u_long)i, (u_long)h->inp[i]);
+		switch (TYPE(h)) {
+		case P_HASH:
+			hk = sp;
+			switch (HPAGE_PTYPE(hk)) {
+			case H_OFFDUP:
+				memcpy(&pgno,
+				    HOFFDUP_PGNO(hk), sizeof(db_pgno_t));
+				fprintf(fp,
+				    "%4lu [offpage dups]\n", (u_long)pgno);
+				break;
+			case H_DUPLICATE:
+				/*
+				 * If this is the first item on a page, then
+				 * we cannot figure out how long it is, so
+				 * we only print the first one in the duplicate
+				 * set.
+				 */
+				if (i != 0)
+					len = LEN_HKEYDATA(h, 0, i);
+				else
+					len = 1;
+
+				fprintf(fp, "Duplicates:\n");
+				for (p = HKEYDATA_DATA(hk),
+				    ep = p + len; p < ep;) {
+					memcpy(&dlen, p, sizeof(db_indx_t));
+					p += sizeof(db_indx_t);
+					fprintf(fp, "\t\t");
+					__db_pr(p, dlen);
+					p += sizeof(db_indx_t) + dlen;
+				}
+				break;
+			case H_KEYDATA:
+				__db_pr(HKEYDATA_DATA(hk),
+				    LEN_HKEYDATA(h, i == 0 ? set_psize : 0, i));
+				break;
+			case H_OFFPAGE:
+				memcpy(&a_hkd, hk, HOFFPAGE_SIZE);
+				fprintf(fp,
+				    "overflow: total len: %4lu page: %4lu\n",
+				    (u_long)a_hkd.tlen, (u_long)a_hkd.pgno);
+				break;
+			}
+			break;
+		case P_IBTREE:
+			bi = sp;
+			fprintf(fp, "count: %4lu pgno: %4lu type: %4lu",
+			    (u_long)bi->nrecs, (u_long)bi->pgno,
+			    (u_long)bi->type);
+			switch (B_TYPE(bi->type)) {
+			case B_KEYDATA:
+				__db_pr(bi->data, bi->len);
+				break;
+			case B_DUPLICATE:
+			case B_OVERFLOW:
+				__db_proff(bi->data);
+				break;
+			default:
+				fprintf(fp, "ILLEGAL BINTERNAL TYPE: %lu\n",
+				    (u_long)B_TYPE(bi->type));
+				ret = EINVAL;
+				break;
+			}
+			break;
+		case P_IRECNO:
+			ri = sp;
+			fprintf(fp, "entries %4lu pgno %4lu\n",
+			    (u_long)ri->nrecs, (u_long)ri->pgno);
+			break;
+		case P_LBTREE:
+		case P_LDUP:
+		case P_LRECNO:
+			bk = sp;
+			switch (B_TYPE(bk->type)) {
+			case B_KEYDATA:
+				__db_pr(bk->data, bk->len);
+				break;
+			case B_DUPLICATE:
+			case B_OVERFLOW:
+				__db_proff(bk);
+				break;
+			default:
+				fprintf(fp,
+			    "ILLEGAL DUPLICATE/LBTREE/LRECNO TYPE: %lu\n",
+				    (u_long)B_TYPE(bk->type));
+				ret = EINVAL;
+				break;
+			}
+			break;
+		}
+	}
+	(void)fflush(fp);
+	return (ret);
+}
+
+/*
+ * __db_pr --
+ *	Print out a data element.
+ *
+ * PUBLIC: void __db_pr __P((u_int8_t *, u_int32_t));
+ */
+void
+__db_pr(p, len)
+	u_int8_t *p;
+	u_int32_t len;
+{
+	FILE *fp;
+	u_int lastch;
+	int i;
+
+	fp = __db_prinit(NULL);
+
+	fprintf(fp, "len: %3lu", (u_long)len);
+	lastch = '.';
+	if (len != 0) {
+		fprintf(fp, " data: ");
+		for (i = len <= 20 ? len : 20; i > 0; --i, ++p) {
+			lastch = *p;
+			if (isprint((int)*p) || *p == '\n')
+				fprintf(fp, "%c", *p);
+			else
+				fprintf(fp, "0x%.2x", (u_int)*p);
+		}
+		if (len > 20) {
+			fprintf(fp, "...");
+			lastch = '.';
+		}
+	}
+	if (lastch != '\n')
+		fprintf(fp, "\n");
+}
+
+/*
+ * __db_prdbt --
+ *	Print out a DBT data element.
+ *
+ * PUBLIC: int __db_prdbt __P((DBT *, int, const char *, void *,
+ * PUBLIC:     int (*)(void *, const void *), int, VRFY_DBINFO *));
+ */
+int
+__db_prdbt(dbtp, checkprint, prefix, handle, callback, is_recno, vdp)
+	DBT *dbtp;
+	int checkprint;
+	const char *prefix;
+	void *handle;
+	int (*callback) __P((void *, const void *));
+	int is_recno;
+	VRFY_DBINFO *vdp;
+{
+	static const char hex[] = "0123456789abcdef";
+	db_recno_t recno;
+	u_int32_t len;
+	int ret;
+#define	DBTBUFLEN	100
+	char *p, *hp, buf[DBTBUFLEN], hbuf[DBTBUFLEN];
+
+	if (vdp != NULL) {
+		/*
+		 * If vdp is non-NULL, we might be the first key in the
+		 * "fake" subdatabase used for key/data pairs we can't
+		 * associate with a known subdb.
+		 *
+		 * Check and clear the SALVAGE_PRINTHEADER flag;  if
+		 * it was set, print a subdatabase header.
+		 */
+		if (F_ISSET(vdp, SALVAGE_PRINTHEADER))
+			(void)__db_prheader(NULL, "__OTHER__", 0, 0,
+			    handle, callback, vdp, 0);
+		F_CLR(vdp, SALVAGE_PRINTHEADER);
+		F_SET(vdp, SALVAGE_PRINTFOOTER);
+	}
+
+	/*
+	 * !!!
+	 * This routine is the routine that dumps out items in the format
+	 * used by db_dump(1) and db_load(1).  This means that the format
+	 * cannot change.
+	 */
+	if (prefix != NULL && (ret = callback(handle, prefix)) != 0)
+		return (ret);
+	if (is_recno) {
+		/*
+		 * We're printing a record number, and this has to be done
+		 * in a platform-independent way.  So we use the numeral in
+		 * straight ASCII.
+		 */
+		__ua_memcpy(&recno, dbtp->data, sizeof(recno));
+		snprintf(buf, DBTBUFLEN, "%lu", (u_long)recno);
+
+		/* If we're printing data as hex, print keys as hex too. */
+		if (!checkprint) {
+			for (len = strlen(buf), p = buf, hp = hbuf;
+			    len-- > 0; ++p) {
+				*hp++ = hex[(u_int8_t)(*p & 0xf0) >> 4];
+				*hp++ = hex[*p & 0x0f];
+			}
+			*hp = '\0';
+			ret = callback(handle, hbuf);
+		} else
+			ret = callback(handle, buf);
+
+		if (ret != 0)
+			return (ret);
+	} else if (checkprint) {
+		for (len = dbtp->size, p = dbtp->data; len--; ++p)
+			if (isprint((int)*p)) {
+				if (*p == '\\' &&
+				    (ret = callback(handle, "\\")) != 0)
+					return (ret);
+				snprintf(buf, DBTBUFLEN, "%c", *p);
+				if ((ret = callback(handle, buf)) != 0)
+					return (ret);
+			} else {
+				snprintf(buf, DBTBUFLEN, "\\%c%c",
+				    hex[(u_int8_t)(*p & 0xf0) >> 4],
+				    hex[*p & 0x0f]);
+				if ((ret = callback(handle, buf)) != 0)
+					return (ret);
+			}
+	} else
+		for (len = dbtp->size, p = dbtp->data; len--; ++p) {
+			snprintf(buf, DBTBUFLEN, "%c%c",
+			    hex[(u_int8_t)(*p & 0xf0) >> 4],
+			    hex[*p & 0x0f]);
+			if ((ret = callback(handle, buf)) != 0)
+				return (ret);
+		}
+
+	return (callback(handle, "\n"));
+}
+
+/*
+ * __db_proff --
+ *	Print out an off-page element.
+ */
+static void
+__db_proff(vp)
+	void *vp;
+{
+	FILE *fp;
+	BOVERFLOW *bo;
+
+	fp = __db_prinit(NULL);
+
+	bo = vp;
+	switch (B_TYPE(bo->type)) {
+	case B_OVERFLOW:
+		fprintf(fp, "overflow: total len: %4lu page: %4lu\n",
+		    (u_long)bo->tlen, (u_long)bo->pgno);
+		break;
+	case B_DUPLICATE:
+		fprintf(fp, "duplicate: page: %4lu\n", (u_long)bo->pgno);
+		break;
+	}
+}
+
+/*
+ * __db_prflags --
+ *	Print out flags values.
+ *
+ * PUBLIC: void __db_prflags __P((u_int32_t, const FN *, FILE *));
+ */
+void
+__db_prflags(flags, fn, fp)
+	u_int32_t flags;
+	FN const *fn;
+	FILE *fp;
+{
+	const FN *fnp;
+	int found;
+	const char *sep;
+
+	sep = " (";
+	for (found = 0, fnp = fn; fnp->mask != 0; ++fnp)
+		if (LF_ISSET(fnp->mask)) {
+			fprintf(fp, "%s%s", sep, fnp->name);
+			sep = ", ";
+			found = 1;
+		}
+	if (found)
+		fprintf(fp, ")");
+}
+
+/*
+ * __db_prinit --
+ *	Initialize tree printing routines.
+ */
+static FILE *
+__db_prinit(fp)
+	FILE *fp;
+{
+	if (set_fp == NULL)
+		set_fp = fp == NULL ? stdout : fp;
+	return (set_fp);
+}
+
+/*
+ * __db_psize --
+ *	Get the page size.
+ */
+static void
+__db_psize(dbp)
+	DB *dbp;
+{
+	DBMETA *mp;
+	db_pgno_t pgno;
+
+	set_psize = PSIZE_BOUNDARY - 1;
+
+	pgno = PGNO_BASE_MD;
+	if (memp_fget(dbp->mpf, &pgno, 0, &mp) != 0)
+		return;
+
+	switch (mp->magic) {
+	case DB_BTREEMAGIC:
+	case DB_HASHMAGIC:
+	case DB_QAMMAGIC:
+		set_psize = mp->pagesize;
+		break;
+	}
+	(void)memp_fput(dbp->mpf, mp, 0);
+}
+
+/*
+ * __db_dbtype_to_string --
+ *	Return the name of the database type.
+ */
+static const char *
+__db_dbtype_to_string(dbp)
+	DB *dbp;
+{
+	switch (dbp->type) {
+	case DB_BTREE:
+		return ("btree");
+	case DB_HASH:
+		return ("hash");
+		break;
+	case DB_RECNO:
+		return ("recno");
+		break;
+	case DB_QUEUE:
+		return ("queue");
+	default:
+		return ("UNKNOWN TYPE");
+	}
+	/* NOTREACHED */
+}
+
+/*
+ * __db_pagetype_to_string --
+ *	Return the name of the specified page type.
+ *
+ * PUBLIC: const char *__db_pagetype_to_string __P((u_int32_t));
+ */
+const char *
+__db_pagetype_to_string(type)
+	u_int32_t type;
+{
+	char *s;
+
+	s = NULL;
+	switch (type) {
+	case P_BTREEMETA:
+		s = "btree metadata";
+		break;
+	case P_LDUP:
+		s = "duplicate";
+		break;
+	case P_HASH:
+		s = "hash";
+		break;
+	case P_HASHMETA:
+		s = "hash metadata";
+		break;
+	case P_IBTREE:
+		s = "btree internal";
+		break;
+	case P_INVALID:
+		s = "invalid";
+		break;
+	case P_IRECNO:
+		s = "recno internal";
+		break;
+	case P_LBTREE:
+		s = "btree leaf";
+		break;
+	case P_LRECNO:
+		s = "recno leaf";
+		break;
+	case P_OVERFLOW:
+		s = "overflow";
+		break;
+	case P_QAMMETA:
+		s = "queue metadata";
+		break;
+	case P_QAMDATA:
+		s = "queue";
+		break;
+	default:
+		/* Just return a NULL. */
+		break;
+	}
+	return (s);
+}
+
+/*
+ * __db_prheader --
+ *	Write out header information in the format expected by db_load.
+ *
+ * PUBLIC: int	__db_prheader __P((DB *, char *, int, int, void *,
+ * PUBLIC:     int (*)(void *, const void *), VRFY_DBINFO *, db_pgno_t));
+ */
+int
+__db_prheader(dbp, subname, pflag, keyflag, handle, callback, vdp, meta_pgno)
+	DB *dbp;
+	char *subname;
+	int pflag, keyflag;
+	void *handle;
+	int (*callback) __P((void *, const void *));
+	VRFY_DBINFO *vdp;
+	db_pgno_t meta_pgno;
+{
+	DB_BTREE_STAT *btsp;
+	DB_ENV *dbenv;
+	DB_HASH_STAT *hsp;
+	DB_QUEUE_STAT *qsp;
+	VRFY_PAGEINFO *pip;
+	char *buf;
+	int buflen, ret, t_ret;
+	u_int32_t dbtype;
+
+	btsp = NULL;
+	hsp = NULL;
+	qsp = NULL;
+	ret = 0;
+	buf = NULL;
+	COMPQUIET(buflen, 0);
+
+	if (dbp == NULL)
+		dbenv = NULL;
+	else
+		dbenv = dbp->dbenv;
+
+	/*
+	 * If we've been passed a verifier statistics object, use
+	 * that;  we're being called in a context where dbp->stat
+	 * is unsafe.
+	 */
+	if (vdp != NULL) {
+		if ((ret = __db_vrfy_getpageinfo(vdp, meta_pgno, &pip)) != 0)
+			return (ret);
+	} else
+		pip = NULL;
+
+	/*
+	 * If dbp is NULL, we're being called from inside __db_prdbt,
+	 * and this is a special subdatabase for "lost" items.  Make it a btree.
+	 * Otherwise, set dbtype to the appropriate type for the specified
+	 * meta page, or the type of the dbp.
+	 */
+	if (dbp == NULL)
+		dbtype = DB_BTREE;
+	else if (pip != NULL)
+		switch (pip->type) {
+		case P_BTREEMETA:
+			if (F_ISSET(pip, VRFY_IS_RECNO))
+				dbtype = DB_RECNO;
+			else
+				dbtype = DB_BTREE;
+			break;
+		case P_HASHMETA:
+			dbtype = DB_HASH;
+			break;
+		default:
+			/*
+			 * If the meta page is of a bogus type, it's
+			 * because we have a badly corrupt database.
+			 * (We must be in the verifier for pip to be non-NULL.)
+			 * Pretend we're a Btree and salvage what we can.
+			 */
+			DB_ASSERT(F_ISSET(dbp, DB_AM_VERIFYING));
+			dbtype = DB_BTREE;
+			break;
+		}
+	else
+		dbtype = dbp->type;
+
+	if ((ret = callback(handle, "VERSION=3\n")) != 0)
+		goto err;
+	if (pflag) {
+		if ((ret = callback(handle, "format=print\n")) != 0)
+			goto err;
+	} else if ((ret = callback(handle, "format=bytevalue\n")) != 0)
+		goto err;
+
+	/*
+	 * 64 bytes is long enough, as a minimum bound, for any of the
+	 * fields besides subname.  Subname can be anything, and so
+	 * 64 + subname is big enough for all the things we need to print here.
+	 */
+	buflen = 64 + ((subname != NULL) ? strlen(subname) : 0);
+	if ((ret = __os_malloc(dbenv, buflen, NULL, &buf)) != 0)
+		goto err;
+	if (subname != NULL) {
+		snprintf(buf, buflen, "database=%s\n", subname);
+		if ((ret = callback(handle, buf)) != 0)
+			goto err;
+	}
+	switch (dbtype) {
+	case DB_BTREE:
+		if ((ret = callback(handle, "type=btree\n")) != 0)
+			goto err;
+		if (pip != NULL) {
+			if (F_ISSET(pip, VRFY_HAS_RECNUMS))
+				if ((ret =
+				    callback(handle, "recnum=1\n")) != 0)
+					goto err;
+			if (pip->bt_maxkey != 0) {
+				snprintf(buf, buflen,
+				    "bt_maxkey=%lu\n", (u_long)pip->bt_maxkey);
+				if ((ret = callback(handle, buf)) != 0)
+					goto err;
+			}
+			if (pip->bt_minkey != 0 &&
+			    pip->bt_minkey != DEFMINKEYPAGE) {
+				snprintf(buf, buflen,
+				    "bt_minkey=%lu\n", (u_long)pip->bt_minkey);
+				if ((ret = callback(handle, buf)) != 0)
+					goto err;
+			}
+			break;
+		}
+		if ((ret = dbp->stat(dbp, &btsp, NULL, 0)) != 0) {
+			dbp->err(dbp, ret, "DB->stat");
+			goto err;
+		}
+		if (F_ISSET(dbp, DB_BT_RECNUM))
+			if ((ret = callback(handle, "recnum=1\n")) != 0)
+				goto err;
+		if (btsp->bt_maxkey != 0) {
+			snprintf(buf, buflen,
+			    "bt_maxkey=%lu\n", (u_long)btsp->bt_maxkey);
+			if ((ret = callback(handle, buf)) != 0)
+				goto err;
+		}
+		if (btsp->bt_minkey != 0 && btsp->bt_minkey != DEFMINKEYPAGE) {
+			snprintf(buf, buflen,
+			    "bt_minkey=%lu\n", (u_long)btsp->bt_minkey);
+			if ((ret = callback(handle, buf)) != 0)
+				goto err;
+		}
+		break;
+	case DB_HASH:
+		if ((ret = callback(handle, "type=hash\n")) != 0)
+			goto err;
+		if (pip != NULL) {
+			if (pip->h_ffactor != 0) {
+				snprintf(buf, buflen,
+				    "h_ffactor=%lu\n", (u_long)pip->h_ffactor);
+				if ((ret = callback(handle, buf)) != 0)
+					goto err;
+			}
+			if (pip->h_nelem != 0) {
+				snprintf(buf, buflen,
+				    "h_nelem=%lu\n", (u_long)pip->h_nelem);
+				if ((ret = callback(handle, buf)) != 0)
+					goto err;
+			}
+			break;
+		}
+		if ((ret = dbp->stat(dbp, &hsp, NULL, 0)) != 0) {
+			dbp->err(dbp, ret, "DB->stat");
+			goto err;
+		}
+		if (hsp->hash_ffactor != 0) {
+			snprintf(buf, buflen,
+			    "h_ffactor=%lu\n", (u_long)hsp->hash_ffactor);
+			if ((ret = callback(handle, buf)) != 0)
+				goto err;
+		}
+		if (hsp->hash_nelem != 0 || hsp->hash_nkeys != 0) {
+			snprintf(buf, buflen, "h_nelem=%lu\n",
+			    hsp->hash_nelem > hsp->hash_nkeys ?
+			    (u_long)hsp->hash_nelem : (u_long)hsp->hash_nkeys);
+			if ((ret = callback(handle, buf)) != 0)
+				goto err;
+		}
+		break;
+	case DB_QUEUE:
+		if ((ret = callback(handle, "type=queue\n")) != 0)
+			goto err;
+		if (vdp != NULL) {
+			snprintf(buf,
+			    buflen, "re_len=%lu\n", (u_long)vdp->re_len);
+			if ((ret = callback(handle, buf)) != 0)
+				goto err;
+			break;
+		}
+		if ((ret = dbp->stat(dbp, &qsp, NULL, 0)) != 0) {
+			dbp->err(dbp, ret, "DB->stat");
+			goto err;
+		}
+		snprintf(buf, buflen, "re_len=%lu\n", (u_long)qsp->qs_re_len);
+		if (qsp->qs_re_pad != 0 && qsp->qs_re_pad != ' ')
+			snprintf(buf, buflen, "re_pad=%#x\n", qsp->qs_re_pad);
+		if ((ret = callback(handle, buf)) != 0)
+			goto err;
+		break;
+	case DB_RECNO:
+		if ((ret = callback(handle, "type=recno\n")) != 0)
+			goto err;
+		if (pip != NULL) {
+			if (F_ISSET(pip, VRFY_IS_RRECNO))
+				if ((ret =
+				    callback(handle, "renumber=1\n")) != 0)
+					goto err;
+			if (pip->re_len > 0) {
+				snprintf(buf, buflen,
+				    "re_len=%lu\n", (u_long)pip->re_len);
+				if ((ret = callback(handle, buf)) != 0)
+					goto err;
+			}
+			break;
+		}
+		if ((ret = dbp->stat(dbp, &btsp, NULL, 0)) != 0) {
+			dbp->err(dbp, ret, "DB->stat");
+			goto err;
+		}
+		if (F_ISSET(dbp, DB_RE_RENUMBER))
+			if ((ret = callback(handle, "renumber=1\n")) != 0)
+				goto err;
+		if (F_ISSET(dbp, DB_RE_FIXEDLEN)) {
+			snprintf(buf, buflen,
+			    "re_len=%lu\n", (u_long)btsp->bt_re_len);
+			if ((ret = callback(handle, buf)) != 0)
+				goto err;
+		}
+		if (btsp->bt_re_pad != 0 && btsp->bt_re_pad != ' ') {
+			snprintf(buf, buflen, "re_pad=%#x\n", btsp->bt_re_pad);
+			if ((ret = callback(handle, buf)) != 0)
+				goto err;
+		}
+		break;
+	case DB_UNKNOWN:
+		DB_ASSERT(0);			/* Impossible. */
+		__db_err(dbp->dbenv, "Impossible DB type in __db_prheader");
+		ret = EINVAL;
+		goto err;
+	}
+
+	if (pip != NULL) {
+		if (F_ISSET(pip, VRFY_HAS_DUPS))
+			if ((ret = callback(handle, "duplicates=1\n")) != 0)
+				goto err;
+		if (F_ISSET(pip, VRFY_HAS_DUPSORT))
+			if ((ret = callback(handle, "dupsort=1\n")) != 0)
+				goto err;
+		/* We should handle page size. XXX */
+	} else {
+		if (F_ISSET(dbp, DB_AM_DUP))
+			if ((ret = callback(handle, "duplicates=1\n")) != 0)
+				goto err;
+		if (F_ISSET(dbp, DB_AM_DUPSORT))
+			if ((ret = callback(handle, "dupsort=1\n")) != 0)
+				goto err;
+		if (!F_ISSET(dbp, DB_AM_PGDEF)) {
+			snprintf(buf, buflen,
+			    "db_pagesize=%lu\n", (u_long)dbp->pgsize);
+			if ((ret = callback(handle, buf)) != 0)
+				goto err;
+		}
+	}
+
+	if (keyflag && (ret = callback(handle, "keys=1\n")) != 0)
+		goto err;
+
+	ret = callback(handle, "HEADER=END\n");
+
+err:	if (pip != NULL &&
+	    (t_ret = __db_vrfy_putpageinfo(vdp, pip)) != 0 && ret == 0)
+		ret = t_ret;
+	if (btsp != NULL)
+		__os_free(btsp, 0);
+	if (hsp != NULL)
+		__os_free(hsp, 0);
+	if (qsp != NULL)
+		__os_free(qsp, 0);
+	if (buf != NULL)
+		__os_free(buf, buflen);
+
+	return (ret);
+}
+
+/*
+ * __db_prfooter --
+ *	Print the footer that marks the end of a DB dump.  This is trivial,
+ *	but for consistency's sake we don't want to put its literal contents
+ *	in multiple places.
+ *
+ * PUBLIC: int __db_prfooter __P((void *, int (*)(void *, const void *)));
+ */
+int
+__db_prfooter(handle, callback)
+	void *handle;
+	int (*callback) __P((void *, const void *));
+{
+	return (callback(handle, "DATA=END\n"));
+}
diff --git a/bdb/db/db_rec.c b/bdb/db/db_rec.c
new file mode 100644
index 00000000000..998d074290d
--- /dev/null
+++ b/bdb/db/db_rec.c
@@ -0,0 +1,529 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1996, 1997, 1998, 1999, 2000
+ *	Sleepycat Software.  All rights reserved.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_rec.c,v 11.10 2000/08/03 15:32:19 ubell Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "log.h"
+#include "hash.h"
+
+/*
+ * PUBLIC: int __db_addrem_recover
+ * PUBLIC:    __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *));
+ *
+ * This log message is generated whenever we add or remove a duplicate
+ * to/from a duplicate page.  On recover, we just do the opposite.
+ */
+int
+__db_addrem_recover(dbenv, dbtp, lsnp, op, info)
+	DB_ENV *dbenv;
+	DBT *dbtp;
+	DB_LSN *lsnp;
+	db_recops op;
+	void *info;
+{
+	__db_addrem_args *argp;
+	DB *file_dbp;
+	DBC *dbc;
+	DB_MPOOLFILE *mpf;
+	PAGE *pagep;
+	u_int32_t change;
+	int cmp_n, cmp_p, ret;
+
+	COMPQUIET(info, NULL);
+	REC_PRINT(__db_addrem_print);
+	REC_INTRO(__db_addrem_read, 1);
+
+	if ((ret = memp_fget(mpf, &argp->pgno, 0, &pagep)) != 0) {
+		if (DB_UNDO(op)) {
+			/*
+			 * We are undoing and the page doesn't exist.  That
+			 * is equivalent to having a pagelsn of 0, so we
+			 * would not have to undo anything.  In this case,
+			 * don't bother creating a page.
+			 */
+			goto done;
+		} else
+			if ((ret = memp_fget(mpf,
+			    &argp->pgno, DB_MPOOL_CREATE, &pagep)) != 0)
+				goto out;
+	}
+
+	cmp_n = log_compare(lsnp, &LSN(pagep));
+	cmp_p = log_compare(&LSN(pagep), &argp->pagelsn);
+	CHECK_LSN(op, cmp_p, &LSN(pagep), &argp->pagelsn);
+	change = 0;
+	if ((cmp_p == 0 && DB_REDO(op) && argp->opcode == DB_ADD_DUP) ||
+	    (cmp_n == 0 && DB_UNDO(op) && argp->opcode == DB_REM_DUP)) {
+
+		/* Need to redo an add, or undo a delete. */
+		if ((ret = __db_pitem(dbc, pagep, argp->indx, argp->nbytes,
+		    argp->hdr.size == 0 ? NULL : &argp->hdr,
+		    argp->dbt.size == 0 ? NULL : &argp->dbt)) != 0)
+			goto out;
+
+		change = DB_MPOOL_DIRTY;
+
+	} else if ((cmp_n == 0 && DB_UNDO(op) && argp->opcode == DB_ADD_DUP) ||
+	    (cmp_p == 0 && DB_REDO(op) && argp->opcode == DB_REM_DUP)) {
+		/* Need to undo an add, or redo a delete. */
+		if ((ret = __db_ditem(dbc,
+		    pagep, argp->indx, argp->nbytes)) != 0)
+			goto out;
+		change = DB_MPOOL_DIRTY;
+	}
+
+	if (change) {
+		if (DB_REDO(op))
+			LSN(pagep) = *lsnp;
+		else
+			LSN(pagep) = argp->pagelsn;
+	}
+
+	if ((ret = memp_fput(mpf, pagep, change)) != 0)
+		goto out;
+
+done:	*lsnp = argp->prev_lsn;
+	ret = 0;
+
+out:	REC_CLOSE;
+}
+
+/*
+ * PUBLIC: int __db_big_recover
+ * PUBLIC:     __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *));
+ */
+int
+__db_big_recover(dbenv, dbtp, lsnp, op, info)
+	DB_ENV *dbenv;
+	DBT *dbtp;
+	DB_LSN *lsnp;
+	db_recops op;
+	void *info;
+{
+	__db_big_args *argp;
+	DB *file_dbp;
+	DBC *dbc;
+	DB_MPOOLFILE *mpf;
+	PAGE *pagep;
+	u_int32_t change;
+	int cmp_n, cmp_p, ret;
+
+	COMPQUIET(info, NULL);
+	REC_PRINT(__db_big_print);
+	REC_INTRO(__db_big_read, 1);
+
+	if ((ret = memp_fget(mpf, &argp->pgno, 0, &pagep)) != 0) {
+		if (DB_UNDO(op)) {
+			/*
+			 * We are undoing and the page doesn't exist.  That
+			 * is equivalent to having a pagelsn of 0, so we
+			 * would not have to undo anything.  In this case,
+			 * don't bother creating a page.
+			 */
+			ret = 0;
+			goto ppage;
+		} else
+			if ((ret = memp_fget(mpf,
+			    &argp->pgno, DB_MPOOL_CREATE, &pagep)) != 0)
+				goto out;
+	}
+
+	/*
+	 * There are three pages we need to check.  The one on which we are
+	 * adding data, the previous one whose next_pointer may have
+	 * been updated, and the next one whose prev_pointer may have
+	 * been updated.
+	 */
+	cmp_n = log_compare(lsnp, &LSN(pagep));
+	cmp_p = log_compare(&LSN(pagep), &argp->pagelsn);
+	CHECK_LSN(op, cmp_p, &LSN(pagep), &argp->pagelsn);
+	change = 0;
+	if ((cmp_p == 0 && DB_REDO(op) && argp->opcode == DB_ADD_BIG) ||
+	    (cmp_n == 0 && DB_UNDO(op) && argp->opcode == DB_REM_BIG)) {
+		/* We are either redo-ing an add, or undoing a delete. */
+		P_INIT(pagep, file_dbp->pgsize, argp->pgno, argp->prev_pgno,
+			argp->next_pgno, 0, P_OVERFLOW);
+		OV_LEN(pagep) = argp->dbt.size;
+		OV_REF(pagep) = 1;
+		memcpy((u_int8_t *)pagep + P_OVERHEAD, argp->dbt.data,
+		    argp->dbt.size);
+		PREV_PGNO(pagep) = argp->prev_pgno;
+		change = DB_MPOOL_DIRTY;
+	} else if ((cmp_n == 0 && DB_UNDO(op) && argp->opcode == DB_ADD_BIG) ||
+	    (cmp_p == 0 && DB_REDO(op) && argp->opcode == DB_REM_BIG)) {
+		/*
+		 * We are either undo-ing an add or redo-ing a delete.
+		 * The page is about to be reclaimed in either case, so
+		 * there really isn't anything to do here.
+		 */
+		change = DB_MPOOL_DIRTY;
+	}
+	if (change)
+		LSN(pagep) = DB_REDO(op) ? *lsnp : argp->pagelsn;
+
+	if ((ret = memp_fput(mpf, pagep, change)) != 0)
+		goto out;
+
+	/* Now check the previous page. */
+ppage:	if (argp->prev_pgno != PGNO_INVALID) {
+		change = 0;
+		if ((ret = memp_fget(mpf, &argp->prev_pgno, 0, &pagep)) != 0) {
+			if (DB_UNDO(op)) {
+				/*
+				 * We are undoing and the page doesn't exist.
+				 * That is equivalent to having a pagelsn of 0,
+				 * so we would not have to undo anything.  In
+				 * this case, don't bother creating a page.
+				 */
+				*lsnp = argp->prev_lsn;
+				ret = 0;
+				goto npage;
+			} else
+				if ((ret = memp_fget(mpf, &argp->prev_pgno,
+				    DB_MPOOL_CREATE, &pagep)) != 0)
+					goto out;
+		}
+
+		cmp_n = log_compare(lsnp, &LSN(pagep));
+		cmp_p = log_compare(&LSN(pagep), &argp->prevlsn);
+		CHECK_LSN(op, cmp_p, &LSN(pagep), &argp->prevlsn);
+
+		if ((cmp_p == 0 && DB_REDO(op) && argp->opcode == DB_ADD_BIG) ||
+		    (cmp_n == 0 && DB_UNDO(op) && argp->opcode == DB_REM_BIG)) {
+			/* Redo add, undo delete. */
+			NEXT_PGNO(pagep) = argp->pgno;
+			change = DB_MPOOL_DIRTY;
+		} else if ((cmp_n == 0 &&
+		    DB_UNDO(op) && argp->opcode == DB_ADD_BIG) ||
+		    (cmp_p == 0 && DB_REDO(op) && argp->opcode == DB_REM_BIG)) {
+			/* Redo delete, undo add. */
+			NEXT_PGNO(pagep) = argp->next_pgno;
+			change = DB_MPOOL_DIRTY;
+		}
+		if (change)
+			LSN(pagep) = DB_REDO(op) ? *lsnp : argp->prevlsn;
+		if ((ret = memp_fput(mpf, pagep, change)) != 0)
+			goto out;
+	}
+
+	/* Now check the next page.  Can only be set on a delete. */
+npage:	if (argp->next_pgno != PGNO_INVALID) {
+		change = 0;
+		if ((ret = memp_fget(mpf, &argp->next_pgno, 0, &pagep)) != 0) {
+			if (DB_UNDO(op)) {
+				/*
+				 * We are undoing and the page doesn't exist.
+				 * That is equivalent to having a pagelsn of 0,
+				 * so we would not have to undo anything.  In
+				 * this case, don't bother creating a page.
+				 */
+				goto done;
+			} else
+				if ((ret = memp_fget(mpf, &argp->next_pgno,
+				    DB_MPOOL_CREATE, &pagep)) != 0)
+					goto out;
+		}
+
+		cmp_n = log_compare(lsnp, &LSN(pagep));
+		cmp_p = log_compare(&LSN(pagep), &argp->nextlsn);
+		CHECK_LSN(op, cmp_p, &LSN(pagep), &argp->nextlsn);
+		if (cmp_p == 0 && DB_REDO(op)) {
+			PREV_PGNO(pagep) = PGNO_INVALID;
+			change = DB_MPOOL_DIRTY;
+		} else if (cmp_n == 0 && DB_UNDO(op)) {
+			PREV_PGNO(pagep) = argp->pgno;
+			change = DB_MPOOL_DIRTY;
+		}
+		if (change)
+			LSN(pagep) = DB_REDO(op) ? *lsnp : argp->nextlsn;
+		if ((ret = memp_fput(mpf, pagep, change)) != 0)
+			goto out;
+	}
+
+done:	*lsnp = argp->prev_lsn;
+	ret = 0;
+
+out:	REC_CLOSE;
+}
+
+/*
+ * __db_ovref_recover --
+ *	Recovery function for __db_ovref().
+ *
+ * PUBLIC: int __db_ovref_recover __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *));
+ */
+int
+__db_ovref_recover(dbenv, dbtp, lsnp, op, info)
+	DB_ENV *dbenv;
+	DBT *dbtp;
+	DB_LSN *lsnp;
+	db_recops op;
+	void *info;
+{
+	__db_ovref_args *argp;
+	DB *file_dbp;
+	DBC *dbc;
+	DB_MPOOLFILE *mpf;
+	PAGE *pagep;
+	int cmp, modified, ret;
+
+	COMPQUIET(info, NULL);
+	REC_PRINT(__db_ovref_print);
+	REC_INTRO(__db_ovref_read, 1);
+
+	if ((ret = memp_fget(mpf, &argp->pgno, 0, &pagep)) != 0) {
+		if (DB_UNDO(op))
+			goto done;
+		(void)__db_pgerr(file_dbp, argp->pgno);
+		goto out;
+	}
+
+	modified = 0;
+	cmp = log_compare(&LSN(pagep), &argp->lsn);
+	CHECK_LSN(op, cmp, &LSN(pagep), &argp->lsn);
+	if (cmp == 0 && DB_REDO(op)) {
+		/* Need to redo update described. */
+		OV_REF(pagep) += argp->adjust;
+
+		pagep->lsn = *lsnp;
+		modified = 1;
+	} else if (log_compare(lsnp, &LSN(pagep)) == 0 && DB_UNDO(op)) {
+		/* Need to undo update described. */
+		OV_REF(pagep) -= argp->adjust;
+
+		pagep->lsn = argp->lsn;
+		modified = 1;
+	}
+	if ((ret = memp_fput(mpf, pagep, modified ? DB_MPOOL_DIRTY : 0)) != 0)
+		goto out;
+
+done:	*lsnp = argp->prev_lsn;
+	ret = 0;
+
+out:	REC_CLOSE;
+}
+
+/*
+ * __db_relink_recover --
+ *	Recovery function for relink.
+ *
+ * PUBLIC: int __db_relink_recover
+ * PUBLIC:   __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *));
+ */
+int
+__db_relink_recover(dbenv, dbtp, lsnp, op, info)
+	DB_ENV *dbenv;
+	DBT *dbtp;
+	DB_LSN *lsnp;
+	db_recops op;
+	void *info;
+{
+	__db_relink_args *argp;
+	DB *file_dbp;
+	DBC *dbc;
+	DB_MPOOLFILE *mpf;
+	PAGE *pagep;
+	int cmp_n, cmp_p, modified, ret;
+
+	COMPQUIET(info, NULL);
+	REC_PRINT(__db_relink_print);
+	REC_INTRO(__db_relink_read, 1);
+
+	/*
+	 * There are up to three pages we need to check -- the page, and the
+	 * previous and next pages, if they existed.  For a page add operation,
+	 * the current page is the result of a split and is being recovered
+	 * elsewhere, so all we need do is recover the next page.
+	 */
+	if ((ret = memp_fget(mpf, &argp->pgno, 0, &pagep)) != 0) {
+		if (DB_REDO(op)) {
+			(void)__db_pgerr(file_dbp, argp->pgno);
+			goto out;
+		}
+		goto next2;
+	}
+	modified = 0;
+	if (argp->opcode == DB_ADD_PAGE)
+		goto next1;
+
+	cmp_p = log_compare(&LSN(pagep), &argp->lsn);
+	CHECK_LSN(op, cmp_p, &LSN(pagep), &argp->lsn);
+	if (cmp_p == 0 && DB_REDO(op)) {
+		/* Redo the relink. */
+		pagep->lsn = *lsnp;
+		modified = 1;
+	} else if (log_compare(lsnp, &LSN(pagep)) == 0 && DB_UNDO(op)) {
+		/* Undo the relink. */
+		pagep->next_pgno = argp->next;
+		pagep->prev_pgno = argp->prev;
+
+		pagep->lsn = argp->lsn;
+		modified = 1;
+	}
+next1:	if ((ret = memp_fput(mpf, pagep, modified ? DB_MPOOL_DIRTY : 0)) != 0)
+		goto out;
+
+next2:	if ((ret = memp_fget(mpf, &argp->next, 0, &pagep)) != 0) {
+		if (DB_REDO(op)) {
+			(void)__db_pgerr(file_dbp, argp->next);
+			goto out;
+		}
+		goto prev;
+	}
+	modified = 0;
+	cmp_n = log_compare(lsnp, &LSN(pagep));
+	cmp_p = log_compare(&LSN(pagep), &argp->lsn_next);
+	CHECK_LSN(op, cmp_p, &LSN(pagep), &argp->lsn_next);
+	if ((argp->opcode == DB_REM_PAGE && cmp_p == 0 && DB_REDO(op)) ||
+	    (argp->opcode == DB_ADD_PAGE && cmp_n == 0 && DB_UNDO(op))) {
+		/* Redo the remove or undo the add. */
+		pagep->prev_pgno = argp->prev;
+
+		modified = 1;
+	} else if ((argp->opcode == DB_REM_PAGE && cmp_n == 0 && DB_UNDO(op)) ||
+	    (argp->opcode == DB_ADD_PAGE && cmp_p == 0 && DB_REDO(op))) {
+		/* Undo the remove or redo the add. */
+		pagep->prev_pgno = argp->pgno;
+
+		modified = 1;
+	}
+	if (modified == 1) {
+		if (DB_UNDO(op))
+			pagep->lsn = argp->lsn_next;
+		else
+			pagep->lsn = *lsnp;
+	}
+	if ((ret = memp_fput(mpf, pagep, modified ? DB_MPOOL_DIRTY : 0)) != 0)
+		goto out;
+	if (argp->opcode == DB_ADD_PAGE)
+		goto done;
+
+prev:	if ((ret = memp_fget(mpf, &argp->prev, 0, &pagep)) != 0) {
+		if (DB_REDO(op)) {
+			(void)__db_pgerr(file_dbp, argp->prev);
+			goto out;
+		}
+		goto done;
+	}
+	modified = 0;
+	cmp_p = log_compare(&LSN(pagep), &argp->lsn_prev);
+	CHECK_LSN(op, cmp_p, &LSN(pagep), &argp->lsn_prev);
+	if (cmp_p == 0 && DB_REDO(op)) {
+		/* Redo the relink. */
+		pagep->next_pgno = argp->next;
+
+		modified = 1;
+	} else if (log_compare(lsnp, &LSN(pagep)) == 0 && DB_UNDO(op)) {
+		/* Undo the relink. */
+		pagep->next_pgno = argp->pgno;
+
+		modified = 1;
+	}
+	if (modified == 1) {
+		if (DB_UNDO(op))
+			pagep->lsn = argp->lsn_prev;
+		else
+			pagep->lsn = *lsnp;
+	}
+	if ((ret = memp_fput(mpf, pagep, modified ? DB_MPOOL_DIRTY : 0)) != 0)
+		goto out;
+
+done:	*lsnp = argp->prev_lsn;
+	ret = 0;
+
+out:	REC_CLOSE;
+}
+
+/*
+ * __db_debug_recover --
+ *	Recovery function for debug.
+ *
+ * PUBLIC: int __db_debug_recover __P((DB_ENV *,
+ * PUBLIC:     DBT *, DB_LSN *, db_recops, void *));
+ */
+int
+__db_debug_recover(dbenv, dbtp, lsnp, op, info)
+	DB_ENV *dbenv;
+	DBT *dbtp;
+	DB_LSN *lsnp;
+	db_recops op;
+	void *info;
+{
+	__db_debug_args *argp;
+	int ret;
+
+	COMPQUIET(op, 0);
+	COMPQUIET(dbenv, NULL);
+	COMPQUIET(info, NULL);
+
+	REC_PRINT(__db_debug_print);
+	REC_NOOP_INTRO(__db_debug_read);
+
+	*lsnp = argp->prev_lsn;
+	ret = 0;
+
+	REC_NOOP_CLOSE;
+}
+
+/*
+ * __db_noop_recover --
+ *	Recovery function for noop.
+ *
+ * PUBLIC: int __db_noop_recover __P((DB_ENV *,
+ * PUBLIC:      DBT *, DB_LSN *, db_recops, void *));
+ */
+int
+__db_noop_recover(dbenv, dbtp, lsnp, op, info)
+	DB_ENV *dbenv;
+	DBT *dbtp;
+	DB_LSN *lsnp;
+	db_recops op;
+	void *info;
+{
+	__db_noop_args *argp;
+	DB *file_dbp;
+	DBC *dbc;
+	DB_MPOOLFILE *mpf;
+	PAGE *pagep;
+	u_int32_t change;
+	int cmp_n, cmp_p, ret;
+
+	COMPQUIET(info, NULL);
+	REC_PRINT(__db_noop_print);
+	REC_INTRO(__db_noop_read, 0);
+
+	if ((ret = memp_fget(mpf, &argp->pgno, 0, &pagep)) != 0)
+		goto out;
+
+	cmp_n = log_compare(lsnp, &LSN(pagep));
+	cmp_p = log_compare(&LSN(pagep), &argp->prevlsn);
+	CHECK_LSN(op, cmp_p, &LSN(pagep), &argp->prevlsn);
+	change = 0;
+	if (cmp_p == 0 && DB_REDO(op)) {
+		LSN(pagep) = *lsnp;
+		change = DB_MPOOL_DIRTY;
+	} else if (cmp_n == 0 && DB_UNDO(op)) {
+		LSN(pagep) = argp->prevlsn;
+		change = DB_MPOOL_DIRTY;
+	}
+	ret = memp_fput(mpf, pagep, change);
+
+done:	*lsnp = argp->prev_lsn;
+out:	REC_CLOSE;
+}
diff --git a/bdb/db/db_reclaim.c b/bdb/db/db_reclaim.c
new file mode 100644
index 00000000000..739f348407d
--- /dev/null
+++ b/bdb/db/db_reclaim.c
@@ -0,0 +1,134 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1996, 1997, 1998, 1999, 2000
+ *	Sleepycat Software.  All rights reserved.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_reclaim.c,v 11.5 2000/04/07 14:26:58 bostic Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_am.h"
+
+/*
+ * Assume that we enter with a valid pgno.  We traverse a set of
+ * duplicate pages.  The format of the callback routine is:
+ * callback(dbp, page, cookie, did_put).  did_put is an output
+ * value that will be set to 1 by the callback routine if it
+ * already put the page back.  Otherwise, this routine must
+ * put the page.
+ *
+ * PUBLIC: int __db_traverse_dup __P((DB *,
+ * PUBLIC:    db_pgno_t, int (*)(DB *, PAGE *, void *, int *), void *));
+ */
+int
+__db_traverse_dup(dbp, pgno, callback, cookie)
+	DB *dbp;
+	db_pgno_t pgno;
+	int (*callback) __P((DB *, PAGE *, void *, int *));
+	void *cookie;
+{
+	PAGE *p;
+	int did_put, i, opgno, ret;
+
+	do {
+		did_put = 0;
+		if ((ret = memp_fget(dbp->mpf, &pgno, 0, &p)) != 0)
+			return (ret);
+		pgno = NEXT_PGNO(p);
+
+		for (i = 0; i < NUM_ENT(p); i++) {
+			if (B_TYPE(GET_BKEYDATA(p, i)->type) == B_OVERFLOW) {
+				opgno = GET_BOVERFLOW(p, i)->pgno;
+				if ((ret = __db_traverse_big(dbp,
+				    opgno, callback, cookie)) != 0)
+					goto err;
+			}
+		}
+
+		if ((ret = callback(dbp, p, cookie, &did_put)) != 0)
+			goto err;
+
+		if (!did_put)
+			if ((ret = memp_fput(dbp->mpf, p, 0)) != 0)
+				return (ret);
+	} while (pgno != PGNO_INVALID);
+
+	if (0) {
+err:		if (did_put == 0)
+			(void)memp_fput(dbp->mpf, p, 0);
+	}
+	return (ret);
+}
+
+/*
+ * __db_traverse_big
+ *	Traverse a chain of overflow pages and call the callback routine
+ * on each one.  The calling convention for the callback is:
+ *	callback(dbp, page, cookie, did_put),
+ * where did_put is a return value indicating if the page in question has
+ * already been returned to the mpool.
+ *
+ * PUBLIC: int __db_traverse_big __P((DB *,
+ * PUBLIC:     db_pgno_t, int (*)(DB *, PAGE *, void *, int *), void *));
+ */
+int
+__db_traverse_big(dbp, pgno, callback, cookie)
+	DB *dbp;
+	db_pgno_t pgno;
+	int (*callback) __P((DB *, PAGE *, void *, int *));
+	void *cookie;
+{
+	PAGE *p;
+	int did_put, ret;
+
+	do {
+		did_put = 0;
+		if ((ret = memp_fget(dbp->mpf, &pgno, 0, &p)) != 0)
+			return (ret);
+		pgno = NEXT_PGNO(p);
+		if ((ret = callback(dbp, p, cookie, &did_put)) == 0 &&
+		    !did_put)
+			ret = memp_fput(dbp->mpf, p, 0);
+	} while (ret == 0 && pgno != PGNO_INVALID);
+
+	return (ret);
+}
+
+/*
+ * __db_reclaim_callback
+ * This is the callback routine used during a delete of a subdatabase.
+ * we are traversing a btree or hash table and trying to free all the
+ * pages.  Since they share common code for duplicates and overflow
+ * items, we traverse them identically and use this routine to do the
+ * actual free.  The reason that this is callback is because hash uses
+ * the same traversal code for statistics gathering.
+ *
+ * PUBLIC: int __db_reclaim_callback __P((DB *, PAGE *, void *, int *));
+ */
+int
+__db_reclaim_callback(dbp, p, cookie, putp)
+	DB *dbp;
+	PAGE *p;
+	void *cookie;
+	int *putp;
+{
+	int ret;
+
+	COMPQUIET(dbp, NULL);
+
+	if ((ret = __db_free(cookie, p)) != 0)
+		return (ret);
+	*putp = 1;
+
+	return (0);
+}
diff --git a/bdb/db/db_ret.c b/bdb/db/db_ret.c
new file mode 100644
index 00000000000..0782de3e450
--- /dev/null
+++ b/bdb/db/db_ret.c
@@ -0,0 +1,160 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1996, 1997, 1998, 1999, 2000
+ *	Sleepycat Software.  All rights reserved.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_ret.c,v 11.12 2000/11/30 00:58:33 ubell Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "btree.h"
+#include "db_am.h"
+
+/*
+ * __db_ret --
+ *	Build return DBT.
+ *
+ * PUBLIC: int __db_ret __P((DB *,
+ * PUBLIC:    PAGE *, u_int32_t, DBT *, void **, u_int32_t *));
+ */
+int
+__db_ret(dbp, h, indx, dbt, memp, memsize)
+	DB *dbp;
+	PAGE *h;
+	u_int32_t indx;
+	DBT *dbt;
+	void **memp;
+	u_int32_t *memsize;
+{
+	BKEYDATA *bk;
+	HOFFPAGE ho;
+	BOVERFLOW *bo;
+	u_int32_t len;
+	u_int8_t *hk;
+	void *data;
+
+	switch (TYPE(h)) {
+	case P_HASH:
+		hk = P_ENTRY(h, indx);
+		if (HPAGE_PTYPE(hk) == H_OFFPAGE) {
+			memcpy(&ho, hk, sizeof(HOFFPAGE));
+			return (__db_goff(dbp, dbt,
+			    ho.tlen, ho.pgno, memp, memsize));
+		}
+		len = LEN_HKEYDATA(h, dbp->pgsize, indx);
+		data = HKEYDATA_DATA(hk);
+		break;
+	case P_LBTREE:
+	case P_LDUP:
+	case P_LRECNO:
+		bk = GET_BKEYDATA(h, indx);
+		if (B_TYPE(bk->type) == B_OVERFLOW) {
+			bo = (BOVERFLOW *)bk;
+			return (__db_goff(dbp, dbt,
+			    bo->tlen, bo->pgno, memp, memsize));
+		}
+		len = bk->len;
+		data = bk->data;
+		break;
+	default:
+		return (__db_pgfmt(dbp, h->pgno));
+	}
+
+	return (__db_retcopy(dbp, dbt, data, len, memp, memsize));
+}
+
+/*
+ * __db_retcopy --
+ *	Copy the returned data into the user's DBT, handling special flags.
+ *
+ * PUBLIC: int __db_retcopy __P((DB *, DBT *,
+ * PUBLIC:    void *, u_int32_t, void **, u_int32_t *));
+ */
+int
+__db_retcopy(dbp, dbt, data, len, memp, memsize)
+	DB *dbp;
+	DBT *dbt;
+	void *data;
+	u_int32_t len;
+	void **memp;
+	u_int32_t *memsize;
+{
+	DB_ENV *dbenv;
+	int ret;
+
+	dbenv = dbp == NULL ? NULL : dbp->dbenv;
+
+	/* If returning a partial record, reset the length. */
+	if (F_ISSET(dbt, DB_DBT_PARTIAL)) {
+		data = (u_int8_t *)data + dbt->doff;
+		if (len > dbt->doff) {
+			len -= dbt->doff;
+			if (len > dbt->dlen)
+				len = dbt->dlen;
+		} else
+			len = 0;
+	}
+
+	/*
+	 * Return the length of the returned record in the DBT size field.
+	 * This satisfies the requirement that if we're using user memory
+	 * and insufficient memory was provided, return the amount necessary
+	 * in the size field.
+	 */
+	dbt->size = len;
+
+	/*
+	 * Allocate memory to be owned by the application: DB_DBT_MALLOC,
+	 * DB_DBT_REALLOC.
+	 *
+	 * !!!
+	 * We always allocate memory, even if we're copying out 0 bytes. This
+	 * guarantees consistency, i.e., the application can always free memory
+	 * without concern as to how many bytes of the record were requested.
+	 *
+	 * Use the memory specified by the application: DB_DBT_USERMEM.
+	 *
+	 * !!!
+	 * If the length we're going to copy is 0, the application-supplied
+	 * memory pointer is allowed to be NULL.
+	 */
+	if (F_ISSET(dbt, DB_DBT_MALLOC)) {
+		if ((ret = __os_malloc(dbenv, len,
+		    dbp == NULL ? NULL : dbp->db_malloc, &dbt->data)) != 0)
+			return (ret);
+	} else if (F_ISSET(dbt, DB_DBT_REALLOC)) {
+		if ((ret = __os_realloc(dbenv, len,
+		    dbp == NULL ? NULL : dbp->db_realloc, &dbt->data)) != 0)
+			return (ret);
+	} else if (F_ISSET(dbt, DB_DBT_USERMEM)) {
+		if (len != 0 && (dbt->data == NULL || dbt->ulen < len))
+			return (ENOMEM);
+	} else if (memp == NULL || memsize == NULL) {
+		return (EINVAL);
+	} else {
+		if (len != 0 && (*memsize == 0 || *memsize < len)) {
+			if ((ret = __os_realloc(dbenv, len, NULL, memp)) != 0) {
+				*memsize = 0;
+				return (ret);
+			}
+			*memsize = len;
+		}
+		dbt->data = *memp;
+	}
+
+	if (len != 0)
+		memcpy(dbt->data, data, len);
+	return (0);
+}
diff --git a/bdb/db/db_upg.c b/bdb/db/db_upg.c
new file mode 100644
index 00000000000..d8573146ad6
--- /dev/null
+++ b/bdb/db/db_upg.c
@@ -0,0 +1,338 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1996, 1997, 1998, 1999, 2000
+ *	Sleepycat Software.  All rights reserved.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_upg.c,v 11.20 2000/12/12 17:35:30 bostic Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_swap.h"
+#include "btree.h"
+#include "hash.h"
+#include "qam.h"
+
+static int (* const func_31_list[P_PAGETYPE_MAX])
+    __P((DB *, char *, u_int32_t, DB_FH *, PAGE *, int *)) = {
+	NULL,			/* P_INVALID */
+	NULL,			/* __P_DUPLICATE */
+	__ham_31_hash,		/* P_HASH */
+	NULL,			/* P_IBTREE */
+	NULL,			/* P_IRECNO */
+	__bam_31_lbtree,	/* P_LBTREE */
+	NULL,			/* P_LRECNO */
+	NULL,			/* P_OVERFLOW */
+	__ham_31_hashmeta,	/* P_HASHMETA */
+	__bam_31_btreemeta,	/* P_BTREEMETA */
+};
+
+static int __db_page_pass __P((DB *, char *, u_int32_t, int (* const [])
+	       (DB *, char *, u_int32_t, DB_FH *, PAGE *, int *), DB_FH *));
+
+/*
+ * __db_upgrade --
+ *	Upgrade an existing database.
+ *
+ * PUBLIC: int __db_upgrade __P((DB *, const char *, u_int32_t));
+ */
+int
+__db_upgrade(dbp, fname, flags)
+	DB *dbp;
+	const char *fname;
+	u_int32_t flags;
+{
+	DB_ENV *dbenv;
+	DB_FH fh;
+	size_t n;
+	int ret, t_ret;
+	u_int8_t mbuf[256];
+	char *real_name;
+
+	dbenv = dbp->dbenv;
+
+	/* Validate arguments. */
+	if ((ret = __db_fchk(dbenv, "DB->upgrade", flags, DB_DUPSORT)) != 0)
+		return (ret);
+
+	/* Get the real backing file name. */
+	if ((ret = __db_appname(dbenv,
+	    DB_APP_DATA, NULL, fname, 0, NULL, &real_name)) != 0)
+		return (ret);
+
+	/* Open the file. */
+	if ((ret = __os_open(dbenv, real_name, 0, 0, &fh)) != 0) {
+		__db_err(dbenv, "%s: %s", real_name, db_strerror(ret));
+		return (ret);
+	}
+
+	/* Initialize the feedback. */
+	if (dbp->db_feedback != NULL)
+		dbp->db_feedback(dbp, DB_UPGRADE, 0);
+
+	/*
+	 * Read the metadata page.  We read 256 bytes, which is larger than
+	 * any access method's metadata page and smaller than any disk sector.
+	 */
+	if ((ret = __os_read(dbenv, &fh, mbuf, sizeof(mbuf), &n)) != 0)
+		goto err;
+
+	switch (((DBMETA *)mbuf)->magic) {
+	case DB_BTREEMAGIC:
+		switch (((DBMETA *)mbuf)->version) {
+		case 6:
+			/*
+			 * Before V7 not all pages had page types, so we do the
+			 * single meta-data page by hand.
+			 */
+			if ((ret =
+			    __bam_30_btreemeta(dbp, real_name, mbuf)) != 0)
+				goto err;
+			if ((ret = __os_seek(dbenv,
+			    &fh, 0, 0, 0, 0, DB_OS_SEEK_SET)) != 0)
+				goto err;
+			if ((ret = __os_write(dbenv, &fh, mbuf, 256, &n)) != 0)
+				goto err;
+			/* FALLTHROUGH */
+		case 7:
+			/*
+			 * We need the page size to do more.  Rip it out of
+			 * the meta-data page.
+			 */
+			memcpy(&dbp->pgsize, mbuf + 20, sizeof(u_int32_t));
+
+			if ((ret = __db_page_pass(
+			    dbp, real_name, flags, func_31_list, &fh)) != 0)
+				goto err;
+			/* FALLTHROUGH */
+		case 8:
+			break;
+		default:
+			__db_err(dbenv, "%s: unsupported btree version: %lu",
+			    real_name, (u_long)((DBMETA *)mbuf)->version);
+			ret = DB_OLD_VERSION;
+			goto err;
+		}
+		break;
+	case DB_HASHMAGIC:
+		switch (((DBMETA *)mbuf)->version) {
+		case 4:
+		case 5:
+			/*
+			 * Before V6 not all pages had page types, so we do the
+			 * single meta-data page by hand.
+			 */
+			if ((ret =
+			    __ham_30_hashmeta(dbp, real_name, mbuf)) != 0)
+				goto err;
+			if ((ret = __os_seek(dbenv,
+			    &fh, 0, 0, 0, 0, DB_OS_SEEK_SET)) != 0)
+				goto err;
+			if ((ret = __os_write(dbenv, &fh, mbuf, 256, &n)) != 0)
+				goto err;
+
+			/*
+			 * Before V6, we created hash pages one by one as they
+			 * were needed, using hashhdr.ovfl_point to reserve
+			 * a block of page numbers for them.  A consequence
+			 * of this was that, if no overflow pages had been
+			 * created, the current doubling might extend past
+			 * the end of the database file.
+			 *
+			 * In DB 3.X, we now create all the hash pages
+			 * belonging to a doubling atomicly;  it's not
+			 * safe to just save them for later, because when
+			 * we create an overflow page we'll just create
+			 * a new last page (whatever that may be).  Grow
+			 * the database to the end of the current doubling.
+			 */
+			if ((ret =
+			    __ham_30_sizefix(dbp, &fh, real_name, mbuf)) != 0)
+				goto err;
+			/* FALLTHROUGH */
+		case 6:
+			/*
+			 * We need the page size to do more.  Rip it out of
+			 * the meta-data page.
+			 */
+			memcpy(&dbp->pgsize, mbuf + 20, sizeof(u_int32_t));
+
+			if ((ret = __db_page_pass(
+			    dbp, real_name, flags, func_31_list, &fh)) != 0)
+				goto err;
+			/* FALLTHROUGH */
+		case 7:
+			break;
+		default:
+			__db_err(dbenv, "%s: unsupported hash version: %lu",
+			    real_name, (u_long)((DBMETA *)mbuf)->version);
+			ret = DB_OLD_VERSION;
+			goto err;
+		}
+		break;
+	case DB_QAMMAGIC:
+		switch (((DBMETA *)mbuf)->version) {
+		case 1:
+			/*
+			 * If we're in a Queue database, the only page that
+			 * needs upgrading is the meta-database page, don't
+			 * bother with a full pass.
+			 */
+			if ((ret = __qam_31_qammeta(dbp, real_name, mbuf)) != 0)
+				return (ret);
+			/* FALLTHROUGH */
+		case 2:
+			if ((ret = __qam_32_qammeta(dbp, real_name, mbuf)) != 0)
+				return (ret);
+			if ((ret = __os_seek(dbenv,
+			    &fh, 0, 0, 0, 0, DB_OS_SEEK_SET)) != 0)
+				goto err;
+			if ((ret = __os_write(dbenv, &fh, mbuf, 256, &n)) != 0)
+				goto err;
+			/* FALLTHROUGH */
+		case 3:
+			break;
+		default:
+			__db_err(dbenv, "%s: unsupported queue version: %lu",
+			    real_name, (u_long)((DBMETA *)mbuf)->version);
+			ret = DB_OLD_VERSION;
+			goto err;
+		}
+		break;
+	default:
+		M_32_SWAP(((DBMETA *)mbuf)->magic);
+		switch (((DBMETA *)mbuf)->magic) {
+		case DB_BTREEMAGIC:
+		case DB_HASHMAGIC:
+		case DB_QAMMAGIC:
+			__db_err(dbenv,
+		"%s: DB->upgrade only supported on native byte-order systems",
+			    real_name);
+			break;
+		default:
+			__db_err(dbenv,
+			    "%s: unrecognized file type", real_name);
+			break;
+		}
+		ret = EINVAL;
+		goto err;
+	}
+
+	ret = __os_fsync(dbenv, &fh);
+
+err:	if ((t_ret = __os_closehandle(&fh)) != 0 && ret == 0)
+		ret = t_ret;
+	__os_freestr(real_name);
+
+	/* We're done. */
+	if (dbp->db_feedback != NULL)
+		dbp->db_feedback(dbp, DB_UPGRADE, 100);
+
+	return (ret);
+}
+
+/*
+ * __db_page_pass --
+ *	Walk the pages of the database, upgrading whatever needs it.
+ */
+static int
+__db_page_pass(dbp, real_name, flags, fl, fhp)
+	DB *dbp;
+	char *real_name;
+	u_int32_t flags;
+	int (* const fl[P_PAGETYPE_MAX])
+	    __P((DB *, char *, u_int32_t, DB_FH *, PAGE *, int *));
+	DB_FH *fhp;
+{
+	DB_ENV *dbenv;
+	PAGE *page;
+	db_pgno_t i, pgno_last;
+	size_t n;
+	int dirty, ret;
+
+	dbenv = dbp->dbenv;
+
+	/* Determine the last page of the file. */
+	if ((ret = __db_lastpgno(dbp, real_name, fhp, &pgno_last)) != 0)
+		return (ret);
+
+	/* Allocate memory for a single page. */
+	if ((ret = __os_malloc(dbenv, dbp->pgsize, NULL, &page)) != 0)
+		return (ret);
+
+	/* Walk the file, calling the underlying conversion functions. */
+	for (i = 0; i < pgno_last; ++i) {
+		if (dbp->db_feedback != NULL)
+			dbp->db_feedback(dbp, DB_UPGRADE, (i * 100)/pgno_last);
+		if ((ret = __os_seek(dbenv,
+		    fhp, dbp->pgsize, i, 0, 0, DB_OS_SEEK_SET)) != 0)
+			break;
+		if ((ret = __os_read(dbenv, fhp, page, dbp->pgsize, &n)) != 0)
+			break;
+		dirty = 0;
+		if (fl[TYPE(page)] != NULL && (ret = fl[TYPE(page)]
+		    (dbp, real_name, flags, fhp, page, &dirty)) != 0)
+			break;
+		if (dirty) {
+			if ((ret = __os_seek(dbenv,
+			    fhp, dbp->pgsize, i, 0, 0, DB_OS_SEEK_SET)) != 0)
+				break;
+			if ((ret = __os_write(dbenv,
+			    fhp, page, dbp->pgsize, &n)) != 0)
+				break;
+		}
+	}
+
+	__os_free(page, dbp->pgsize);
+	return (ret);
+}
+
+/*
+ * __db_lastpgno --
+ *	Return the current last page number of the file.
+ *
+ * PUBLIC: int __db_lastpgno __P((DB *, char *, DB_FH *, db_pgno_t *));
+ */
+int
+__db_lastpgno(dbp, real_name, fhp, pgno_lastp)
+	DB *dbp;
+	char *real_name;
+	DB_FH *fhp;
+	db_pgno_t *pgno_lastp;
+{
+	DB_ENV *dbenv;
+	db_pgno_t pgno_last;
+	u_int32_t mbytes, bytes;
+	int ret;
+
+	dbenv = dbp->dbenv;
+
+	if ((ret = __os_ioinfo(dbenv,
+	    real_name, fhp, &mbytes, &bytes, NULL)) != 0) {
+		__db_err(dbenv, "%s: %s", real_name, db_strerror(ret));
+		return (ret);
+	}
+
+	/* Page sizes have to be a power-of-two. */
+	if (bytes % dbp->pgsize != 0) {
+		__db_err(dbenv,
+		    "%s: file size not a multiple of the pagesize", real_name);
+		return (EINVAL);
+	}
+	pgno_last = mbytes * (MEGABYTE / dbp->pgsize);
+	pgno_last += bytes / dbp->pgsize;
+
+	*pgno_lastp = pgno_last;
+	return (0);
+}
diff --git a/bdb/db/db_upg_opd.c b/bdb/db/db_upg_opd.c
new file mode 100644
index 00000000000..a7be784afb8
--- /dev/null
+++ b/bdb/db/db_upg_opd.c
@@ -0,0 +1,353 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1996, 1997, 1998, 1999, 2000
+ *	Sleepycat Software.  All rights reserved.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_upg_opd.c,v 11.9 2000/11/30 00:58:33 ubell Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_swap.h"
+#include "btree.h"
+#include "hash.h"
+#include "qam.h"
+
+static int __db_build_bi __P((DB *, DB_FH *, PAGE *, PAGE *, u_int32_t, int *));
+static int __db_build_ri __P((DB *, DB_FH *, PAGE *, PAGE *, u_int32_t, int *));
+static int __db_up_ovref __P((DB *, DB_FH *, db_pgno_t));
+
+#define	GET_PAGE(dbp, fhp, pgno, page) {				\
+	if ((ret = __os_seek(dbp->dbenv,				\
+	    fhp, (dbp)->pgsize, pgno, 0, 0, DB_OS_SEEK_SET)) != 0)	\
+		goto err;						\
+	if ((ret = __os_read(dbp->dbenv,				\
+	    fhp, page, (dbp)->pgsize, &n)) != 0)			\
+		goto err;						\
+}
+#define	PUT_PAGE(dbp, fhp, pgno, page) {				\
+	if ((ret = __os_seek(dbp->dbenv,				\
+	    fhp, (dbp)->pgsize, pgno, 0, 0, DB_OS_SEEK_SET)) != 0)	\
+		goto err;						\
+	if ((ret = __os_write(dbp->dbenv,				\
+	    fhp, page, (dbp)->pgsize, &n)) != 0)			\
+		goto err;						\
+}
+
+/*
+ * __db_31_offdup --
+ *	Convert 3.0 off-page duplicates to 3.1 off-page duplicates.
+ *
+ * PUBLIC: int __db_31_offdup __P((DB *, char *, DB_FH *, int, db_pgno_t *));
+ */
+int
+__db_31_offdup(dbp, real_name, fhp, sorted, pgnop)
+	DB *dbp;
+	char *real_name;
+	DB_FH *fhp;
+	int sorted;
+	db_pgno_t *pgnop;
+{
+	PAGE *ipage, *page;
+	db_indx_t indx;
+	db_pgno_t cur_cnt, i, next_cnt, pgno, *pgno_cur, pgno_last;
+	db_pgno_t *pgno_next, pgno_max, *tmp;
+	db_recno_t nrecs;
+	size_t n;
+	int level, nomem, ret;
+
+	ipage = page = NULL;
+	pgno_cur = pgno_next = NULL;
+
+	/* Allocate room to hold a page. */
+	if ((ret = __os_malloc(dbp->dbenv, dbp->pgsize, NULL, &page)) != 0)
+		goto err;
+
+	/*
+	 * Walk the chain of 3.0 off-page duplicates.  Each one is converted
+	 * in place to a 3.1 off-page duplicate page.  If the duplicates are
+	 * sorted, they are converted to a Btree leaf page, otherwise to a
+	 * Recno leaf page.
+	 */
+	for (nrecs = 0, cur_cnt = pgno_max = 0,
+	    pgno = *pgnop; pgno != PGNO_INVALID;) {
+		if (pgno_max == cur_cnt) {
+			pgno_max += 20;
+			if ((ret = __os_realloc(dbp->dbenv, pgno_max *
+			    sizeof(db_pgno_t), NULL, &pgno_cur)) != 0)
+				goto err;
+		}
+		pgno_cur[cur_cnt++] = pgno;
+
+		GET_PAGE(dbp, fhp, pgno, page);
+		nrecs += NUM_ENT(page);
+		LEVEL(page) = LEAFLEVEL;
+		TYPE(page) = sorted ? P_LDUP : P_LRECNO;
+		/*
+		 * !!!
+		 * DB didn't zero the LSNs on off-page duplicates pages.
+		 */
+		ZERO_LSN(LSN(page));
+		PUT_PAGE(dbp, fhp, pgno, page);
+
+		pgno = NEXT_PGNO(page);
+	}
+
+	/* If we only have a single page, it's easy. */
+	if (cur_cnt > 1) {
+		/*
+		 * pgno_cur is the list of pages we just converted.  We're
+		 * going to walk that list, but we'll need to create a new
+		 * list while we do so.
+		 */
+		if ((ret = __os_malloc(dbp->dbenv,
+		    cur_cnt * sizeof(db_pgno_t), NULL, &pgno_next)) != 0)
+			goto err;
+
+		/* Figure out where we can start allocating new pages. */
+		if ((ret = __db_lastpgno(dbp, real_name, fhp, &pgno_last)) != 0)
+			goto err;
+
+		/* Allocate room for an internal page. */
+		if ((ret = __os_malloc(dbp->dbenv,
+		    dbp->pgsize, NULL, &ipage)) != 0)
+			goto err;
+		PGNO(ipage) = PGNO_INVALID;
+	}
+
+	/*
+	 * Repeatedly walk the list of pages, building internal pages, until
+	 * there's only one page at a level.
+	 */
+	for (level = LEAFLEVEL + 1; cur_cnt > 1; ++level) {
+		for (indx = 0, i = next_cnt = 0; i < cur_cnt;) {
+			if (indx == 0) {
+				P_INIT(ipage, dbp->pgsize, pgno_last,
+				    PGNO_INVALID, PGNO_INVALID,
+				    level, sorted ? P_IBTREE : P_IRECNO);
+				ZERO_LSN(LSN(ipage));
+
+				pgno_next[next_cnt++] = pgno_last++;
+			}
+
+			GET_PAGE(dbp, fhp, pgno_cur[i], page);
+
+			/*
+			 * If the duplicates are sorted, put the first item on
+			 * the lower-level page onto a Btree internal page. If
+			 * the duplicates are not sorted, create an internal
+			 * Recno structure on the page.  If either case doesn't
+			 * fit, push out the current page and start a new one.
+			 */
+			nomem = 0;
+			if (sorted) {
+				if ((ret = __db_build_bi(
+				    dbp, fhp, ipage, page, indx, &nomem)) != 0)
+					goto err;
+			} else
+				if ((ret = __db_build_ri(
+				    dbp, fhp, ipage, page, indx, &nomem)) != 0)
+					goto err;
+			if (nomem) {
+				indx = 0;
+				PUT_PAGE(dbp, fhp, PGNO(ipage), ipage);
+			} else {
+				++indx;
+				++NUM_ENT(ipage);
+				++i;
+			}
+		}
+
+		/*
+		 * Push out the last internal page.  Set the top-level record
+		 * count if we've reached the top.
+		 */
+		if (next_cnt == 1)
+			RE_NREC_SET(ipage, nrecs);
+		PUT_PAGE(dbp, fhp, PGNO(ipage), ipage);
+
+		/* Swap the current and next page number arrays. */
+		cur_cnt = next_cnt;
+		tmp = pgno_cur;
+		pgno_cur = pgno_next;
+		pgno_next = tmp;
+	}
+
+	*pgnop = pgno_cur[0];
+
+err:	if (pgno_cur != NULL)
+		__os_free(pgno_cur, 0);
+	if (pgno_next != NULL)
+		__os_free(pgno_next, 0);
+	if (ipage != NULL)
+		__os_free(ipage, dbp->pgsize);
+	if (page != NULL)
+		__os_free(page, dbp->pgsize);
+
+	return (ret);
+}
+
+/*
+ * __db_build_bi --
+ *	Build a BINTERNAL entry for a parent page.
+ */
+static int
+__db_build_bi(dbp, fhp, ipage, page, indx, nomemp)
+	DB *dbp;
+	DB_FH *fhp;
+	PAGE *ipage, *page;
+	u_int32_t indx;
+	int *nomemp;
+{
+	BINTERNAL bi, *child_bi;
+	BKEYDATA *child_bk;
+	u_int8_t *p;
+	int ret;
+
+	switch (TYPE(page)) {
+	case P_IBTREE:
+		child_bi = GET_BINTERNAL(page, 0);
+		if (P_FREESPACE(ipage) < BINTERNAL_PSIZE(child_bi->len)) {
+			*nomemp = 1;
+			return (0);
+		}
+		ipage->inp[indx] =
+		     HOFFSET(ipage) -= BINTERNAL_SIZE(child_bi->len);
+		p = P_ENTRY(ipage, indx);
+
+		bi.len = child_bi->len;
+		B_TSET(bi.type, child_bi->type, 0);
+		bi.pgno = PGNO(page);
+		bi.nrecs = __bam_total(page);
+		memcpy(p, &bi, SSZA(BINTERNAL, data));
+		p += SSZA(BINTERNAL, data);
+		memcpy(p, child_bi->data, child_bi->len);
+
+		/* Increment the overflow ref count. */
+		if (B_TYPE(child_bi->type) == B_OVERFLOW)
+			if ((ret = __db_up_ovref(dbp, fhp,
+			    ((BOVERFLOW *)(child_bi->data))->pgno)) != 0)
+				return (ret);
+		break;
+	case P_LDUP:
+		child_bk = GET_BKEYDATA(page, 0);
+		switch (B_TYPE(child_bk->type)) {
+		case B_KEYDATA:
+			if (P_FREESPACE(ipage) <
+			    BINTERNAL_PSIZE(child_bk->len)) {
+				*nomemp = 1;
+				return (0);
+			}
+			ipage->inp[indx] =
+			    HOFFSET(ipage) -= BINTERNAL_SIZE(child_bk->len);
+			p = P_ENTRY(ipage, indx);
+
+			bi.len = child_bk->len;
+			B_TSET(bi.type, child_bk->type, 0);
+			bi.pgno = PGNO(page);
+			bi.nrecs = __bam_total(page);
+			memcpy(p, &bi, SSZA(BINTERNAL, data));
+			p += SSZA(BINTERNAL, data);
+			memcpy(p, child_bk->data, child_bk->len);
+			break;
+		case B_OVERFLOW:
+			if (P_FREESPACE(ipage) <
+			    BINTERNAL_PSIZE(BOVERFLOW_SIZE)) {
+				*nomemp = 1;
+				return (0);
+			}
+			ipage->inp[indx] =
+			    HOFFSET(ipage) -= BINTERNAL_SIZE(BOVERFLOW_SIZE);
+			p = P_ENTRY(ipage, indx);
+
+			bi.len = BOVERFLOW_SIZE;
+			B_TSET(bi.type, child_bk->type, 0);
+			bi.pgno = PGNO(page);
+			bi.nrecs = __bam_total(page);
+			memcpy(p, &bi, SSZA(BINTERNAL, data));
+			p += SSZA(BINTERNAL, data);
+			memcpy(p, child_bk, BOVERFLOW_SIZE);
+
+			/* Increment the overflow ref count. */
+			if ((ret = __db_up_ovref(dbp, fhp,
+			    ((BOVERFLOW *)child_bk)->pgno)) != 0)
+				return (ret);
+			break;
+		default:
+			return (__db_pgfmt(dbp, PGNO(page)));
+		}
+		break;
+	default:
+		return (__db_pgfmt(dbp, PGNO(page)));
+	}
+
+	return (0);
+}
+
+/*
+ * __db_build_ri --
+ *	Build a RINTERNAL entry for an internal parent page.
+ */
+static int
+__db_build_ri(dbp, fhp, ipage, page, indx, nomemp)
+	DB *dbp;
+	DB_FH *fhp;
+	PAGE *ipage, *page;
+	u_int32_t indx;
+	int *nomemp;
+{
+	RINTERNAL ri;
+
+	COMPQUIET(dbp, NULL);
+	COMPQUIET(fhp, NULL);
+
+	if (P_FREESPACE(ipage) < RINTERNAL_PSIZE) {
+		*nomemp = 1;
+		return (0);
+	}
+
+	ri.pgno = PGNO(page);
+	ri.nrecs = __bam_total(page);
+	ipage->inp[indx] = HOFFSET(ipage) -= RINTERNAL_SIZE;
+	memcpy(P_ENTRY(ipage, indx), &ri, RINTERNAL_SIZE);
+
+	return (0);
+}
+
+/*
+ * __db_up_ovref --
+ *	Increment/decrement the reference count on an overflow page.
+ */
+static int
+__db_up_ovref(dbp, fhp, pgno)
+	DB *dbp;
+	DB_FH *fhp;
+	db_pgno_t pgno;
+{
+	PAGE *page;
+	size_t n;
+	int ret;
+
+	/* Allocate room to hold a page. */
+	if ((ret = __os_malloc(dbp->dbenv, dbp->pgsize, NULL, &page)) != 0)
+		return (ret);
+
+	GET_PAGE(dbp, fhp, pgno, page);
+	++OV_REF(page);
+	PUT_PAGE(dbp, fhp, pgno, page);
+
+err:	__os_free(page, dbp->pgsize);
+
+	return (ret);
+}
diff --git a/bdb/db/db_vrfy.c b/bdb/db/db_vrfy.c
new file mode 100644
index 00000000000..3509e05e91f
--- /dev/null
+++ b/bdb/db/db_vrfy.c
@@ -0,0 +1,2340 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 2000
+ *	Sleepycat Software.  All rights reserved.
+ *
+ * $Id: db_vrfy.c,v 1.53 2001/01/11 18:19:51 bostic Exp $
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_vrfy.c,v 1.53 2001/01/11 18:19:51 bostic Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_swap.h"
+#include "db_verify.h"
+#include "db_ext.h"
+#include "btree.h"
+#include "hash.h"
+#include "qam.h"
+
+static int  __db_guesspgsize __P((DB_ENV *, DB_FH *));
+static int  __db_is_valid_magicno __P((u_int32_t, DBTYPE *));
+static int  __db_is_valid_pagetype __P((u_int32_t));
+static int  __db_meta2pgset
+		__P((DB *, VRFY_DBINFO *, db_pgno_t, u_int32_t, DB *));
+static int  __db_salvage_subdbs
+		__P((DB *, VRFY_DBINFO *, void *,
+		int(*)(void *, const void *), u_int32_t, int *));
+static int  __db_salvage_unknowns
+		__P((DB *, VRFY_DBINFO *, void *,
+		int (*)(void *, const void *), u_int32_t));
+static int  __db_vrfy_common
+		__P((DB *, VRFY_DBINFO *, PAGE *, db_pgno_t, u_int32_t));
+static int  __db_vrfy_freelist __P((DB *, VRFY_DBINFO *, db_pgno_t, u_int32_t));
+static int  __db_vrfy_invalid
+		__P((DB *, VRFY_DBINFO *, PAGE *, db_pgno_t, u_int32_t));
+static int  __db_vrfy_orderchkonly __P((DB *,
+		VRFY_DBINFO *, const char *, const char *, u_int32_t));
+static int  __db_vrfy_pagezero __P((DB *, VRFY_DBINFO *, DB_FH *, u_int32_t));
+static int  __db_vrfy_subdbs
+		__P((DB *, VRFY_DBINFO *, const char *, u_int32_t));
+static int  __db_vrfy_structure
+		__P((DB *, VRFY_DBINFO *, const char *, db_pgno_t, u_int32_t));
+static int  __db_vrfy_walkpages
+		__P((DB *, VRFY_DBINFO *, void *, int (*)(void *, const void *),
+		u_int32_t));
+
+/*
+ * This is the code for DB->verify, the DB database consistency checker.
+ * For now, it checks all subdatabases in a database, and verifies
+ * everything it knows how to (i.e. it's all-or-nothing, and one can't
+ * check only for a subset of possible problems).
+ */
+
+/*
+ * __db_verify --
+ *	Walk the entire file page-by-page, either verifying with or without
+ *	dumping in db_dump -d format, or DB_SALVAGE-ing whatever key/data
+ *	pairs can be found and dumping them in standard (db_load-ready)
+ *	dump format.
+ *
+ *	(Salvaging isn't really a verification operation, but we put it
+ *	here anyway because it requires essentially identical top-level
+ *	code.)
+ *
+ *	flags may be 0, DB_NOORDERCHK, DB_ORDERCHKONLY, or DB_SALVAGE
+ *	(and optionally DB_AGGRESSIVE).
+ *
+ *	__db_verify itself is simply a wrapper to __db_verify_internal,
+ *	which lets us pass appropriate equivalents to FILE * in from the
+ *	non-C APIs.
+ *
+ * PUBLIC: int __db_verify
+ * PUBLIC:     __P((DB *, const char *, const char *, FILE *, u_int32_t));
+ */
+int
+__db_verify(dbp, file, database, outfile, flags)
+	DB *dbp;
+	const char *file, *database;
+	FILE *outfile;
+	u_int32_t flags;
+{
+
+	return (__db_verify_internal(dbp,
+	    file, database, outfile, __db_verify_callback, flags));
+}
+
+/*
+ * __db_verify_callback --
+ *	Callback function for using pr_* functions from C.
+ *
+ * PUBLIC: int  __db_verify_callback __P((void *, const void *));
+ */
+int
+__db_verify_callback(handle, str_arg)
+	void *handle;
+	const void *str_arg;
+{
+	char *str;
+	FILE *f;
+
+	str = (char *)str_arg;
+	f = (FILE *)handle;
+
+	if (fprintf(f, "%s", str) != (int)strlen(str))
+		return (EIO);
+
+	return (0);
+}
+
+/*
+ * __db_verify_internal --
+ *	Inner meat of __db_verify.
+ *
+ * PUBLIC: int __db_verify_internal __P((DB *, const char *,
+ * PUBLIC:     const char *, void *, int (*)(void *, const void *), u_int32_t));
+ */
+int
+__db_verify_internal(dbp_orig, name, subdb, handle, callback, flags)
+	DB *dbp_orig;
+	const char *name, *subdb;
+	void *handle;
+	int (*callback) __P((void *, const void *));
+	u_int32_t flags;
+{
+	DB *dbp;
+	DB_ENV *dbenv;
+	DB_FH fh, *fhp;
+	PAGE *h;
+	VRFY_DBINFO *vdp;
+	db_pgno_t last;
+	int has, ret, isbad;
+	char *real_name;
+
+	dbenv = dbp_orig->dbenv;
+	vdp = NULL;
+	real_name = NULL;
+	ret = isbad = 0;
+
+	memset(&fh, 0, sizeof(fh));
+	fhp = &fh;
+
+	PANIC_CHECK(dbenv);
+	DB_ILLEGAL_AFTER_OPEN(dbp_orig, "verify");
+
+#define	OKFLAGS (DB_AGGRESSIVE | DB_NOORDERCHK | DB_ORDERCHKONLY | DB_SALVAGE)
+	if ((ret = __db_fchk(dbenv, "DB->verify", flags, OKFLAGS)) != 0)
+		return (ret);
+
+	/*
+	 * DB_SALVAGE is mutually exclusive with the other flags except
+	 * DB_AGGRESSIVE.
+	 */
+	if (LF_ISSET(DB_SALVAGE) &&
+	    (flags & ~DB_AGGRESSIVE) != DB_SALVAGE)
+		return (__db_ferr(dbenv, "__db_verify", 1));
+
+	if (LF_ISSET(DB_ORDERCHKONLY) && flags != DB_ORDERCHKONLY)
+		return (__db_ferr(dbenv, "__db_verify", 1));
+
+	if (LF_ISSET(DB_ORDERCHKONLY) && subdb == NULL) {
+		__db_err(dbenv, "DB_ORDERCHKONLY requires a database name");
+		return (EINVAL);
+	}
+
+	/*
+	 * Forbid working in an environment that uses transactions or
+	 * locking;  we're going to be looking at the file freely,
+	 * and while we're not going to modify it, we aren't obeying
+	 * locking conventions either.
+	 */
+	if (TXN_ON(dbenv) || LOCKING_ON(dbenv) || LOGGING_ON(dbenv)) {
+		dbp_orig->errx(dbp_orig,
+	    "verify may not be used with transactions, logging, or locking");
+		return (EINVAL);
+		/* NOTREACHED */
+	}
+
+	/* Create a dbp to use internally, which we can close at our leisure. */
+	if ((ret = db_create(&dbp, dbenv, 0)) != 0)
+		goto err;
+
+	F_SET(dbp, DB_AM_VERIFYING);
+
+	/* Copy the supplied pagesize, which we use if the file one is bogus. */
+	if (dbp_orig->pgsize >= DB_MIN_PGSIZE &&
+	    dbp_orig->pgsize <= DB_MAX_PGSIZE)
+		dbp->set_pagesize(dbp, dbp_orig->pgsize);
+
+	/* Copy the feedback function, if present, and initialize it. */
+	if (!LF_ISSET(DB_SALVAGE) && dbp_orig->db_feedback != NULL) {
+		dbp->set_feedback(dbp, dbp_orig->db_feedback);
+		dbp->db_feedback(dbp, DB_VERIFY, 0);
+	}
+
+	/*
+	 * Copy the comparison and hashing functions.  Note that
+	 * even if the database is not a hash or btree, the respective
+	 * internal structures will have been initialized.
+	 */
+	if (dbp_orig->dup_compare != NULL &&
+	    (ret = dbp->set_dup_compare(dbp, dbp_orig->dup_compare)) != 0)
+		goto err;
+	if (((BTREE *)dbp_orig->bt_internal)->bt_compare != NULL &&
+	    (ret = dbp->set_bt_compare(dbp,
+	    ((BTREE *)dbp_orig->bt_internal)->bt_compare)) != 0)
+		goto err;
+	if (((HASH *)dbp_orig->h_internal)->h_hash != NULL &&
+	    (ret = dbp->set_h_hash(dbp,
+	    ((HASH *)dbp_orig->h_internal)->h_hash)) != 0)
+		goto err;
+
+	/*
+	 * We don't know how large the cache is, and if the database
+	 * in question uses a small page size--which we don't know
+	 * yet!--it may be uncomfortably small for the default page
+	 * size [#2143].  However, the things we need temporary
+	 * databases for in dbinfo are largely tiny, so using a
+	 * 1024-byte pagesize is probably not going to be a big hit,
+	 * and will make us fit better into small spaces.
+	 */
+	if ((ret = __db_vrfy_dbinfo_create(dbenv, 1024, &vdp)) != 0)
+		goto err;
+
+	/* Find the real name of the file. */
+	if ((ret = __db_appname(dbenv,
+	    DB_APP_DATA, NULL, name, 0, NULL, &real_name)) != 0)
+		goto err;
+
+	/*
+	 * Our first order of business is to verify page 0, which is
+	 * the metadata page for the master database of subdatabases
+	 * or of the only database in the file.  We want to do this by hand
+	 * rather than just calling __db_open in case it's corrupt--various
+	 * things in __db_open might act funny.
+	 *
+	 * Once we know the metadata page is healthy, I believe that it's
+	 * safe to open the database normally and then use the page swapping
+	 * code, which makes life easier.
+	 */
+	if ((ret = __os_open(dbenv, real_name, DB_OSO_RDONLY, 0444, fhp)) != 0)
+		goto err;
+
+	/* Verify the metadata page 0; set pagesize and type. */
+	if ((ret = __db_vrfy_pagezero(dbp, vdp, fhp, flags)) != 0) {
+		if (ret == DB_VERIFY_BAD)
+			isbad = 1;
+		else
+			goto err;
+	}
+
+	/*
+	 * We can assume at this point that dbp->pagesize and dbp->type are
+	 * set correctly, or at least as well as they can be, and that
+	 * locking, logging, and txns are not in use.  Thus we can trust
+	 * the memp code not to look at the page, and thus to be safe
+	 * enough to use.
+	 *
+	 * The dbp is not open, but the file is open in the fhp, and we
+	 * cannot assume that __db_open is safe.  Call __db_dbenv_setup,
+	 * the [safe] part of __db_open that initializes the environment--
+	 * and the mpool--manually.
+	 */
+	if ((ret = __db_dbenv_setup(dbp,
+	    name, DB_ODDFILESIZE | DB_RDONLY)) != 0)
+		return (ret);
+
+	/* Mark the dbp as opened, so that we correctly handle its close. */
+	F_SET(dbp, DB_OPEN_CALLED);
+
+	/*
+	 * Find out the page number of the last page in the database.
+	 *
+	 * XXX: This currently fails if the last page is of bad type,
+	 * because it calls __db_pgin and that pukes.  This is bad.
+	 */
+	if ((ret = memp_fget(dbp->mpf, &last, DB_MPOOL_LAST, &h)) != 0)
+		goto err;
+	if ((ret = memp_fput(dbp->mpf, h, 0)) != 0)
+		goto err;
+
+	vdp->last_pgno = last;
+
+	/*
+	 * DB_ORDERCHKONLY is a special case;  our file consists of
+	 * several subdatabases, which use different hash, bt_compare,
+	 * and/or dup_compare functions.  Consequently, we couldn't verify
+	 * sorting and hashing simply by calling DB->verify() on the file.
+	 * DB_ORDERCHKONLY allows us to come back and check those things;  it
+	 * requires a subdatabase, and assumes that everything but that
+	 * database's sorting/hashing is correct.
+	 */
+	if (LF_ISSET(DB_ORDERCHKONLY)) {
+		ret = __db_vrfy_orderchkonly(dbp, vdp, name, subdb, flags);
+		goto done;
+	}
+
+	/*
+	 * When salvaging, we use a db to keep track of whether we've seen a
+	 * given overflow or dup page in the course of traversing normal data.
+	 * If in the end we have not, we assume its key got lost and print it
+	 * with key "UNKNOWN".
+	 */
+	if (LF_ISSET(DB_SALVAGE)) {
+		if ((ret = __db_salvage_init(vdp)) != 0)
+			return (ret);
+
+		/*
+		 * If we're not being aggressive, attempt to crack subdbs.
+		 * "has" will indicate whether the attempt has succeeded
+		 * (even in part), meaning that we have some semblance of
+		 * subdbs;  on the walkpages pass, we print out
+		 * whichever data pages we have not seen.
+		 */
+		has = 0;
+		if (!LF_ISSET(DB_AGGRESSIVE) && (__db_salvage_subdbs(dbp,
+		    vdp, handle, callback, flags, &has)) != 0)
+			isbad = 1;
+
+		/*
+		 * If we have subdatabases, we need to signal that if
+		 * any keys are found that don't belong to a subdatabase,
+		 * they'll need to have an "__OTHER__" subdatabase header
+		 * printed first.  Flag this.  Else, print a header for
+		 * the normal, non-subdb database.
+		 */
+		if (has == 1)
+			F_SET(vdp, SALVAGE_PRINTHEADER);
+		else if ((ret = __db_prheader(dbp,
+		    NULL, 0, 0, handle, callback, vdp, PGNO_BASE_MD)) != 0)
+			goto err;
+	}
+
+	if ((ret =
+	    __db_vrfy_walkpages(dbp, vdp, handle, callback, flags)) != 0) {
+		if (ret == DB_VERIFY_BAD)
+			isbad = 1;
+		else if (ret != 0)
+			goto err;
+	}
+
+	/* If we're verifying, verify inter-page structure. */
+	if (!LF_ISSET(DB_SALVAGE) && isbad == 0)
+		if ((ret =
+		    __db_vrfy_structure(dbp, vdp, name, 0, flags)) != 0) {
+			if (ret == DB_VERIFY_BAD)
+				isbad = 1;
+			else if (ret != 0)
+				goto err;
+		}
+
+	/*
+	 * If we're salvaging, output with key UNKNOWN any overflow or dup pages
+	 * we haven't been able to put in context.  Then destroy the salvager's
+	 * state-saving database.
+	 */
+	if (LF_ISSET(DB_SALVAGE)) {
+		if ((ret = __db_salvage_unknowns(dbp,
+		    vdp, handle, callback, flags)) != 0)
+			isbad = 1;
+		/* No return value, since there's little we can do. */
+		__db_salvage_destroy(vdp);
+	}
+
+	if (0) {
+err:		(void)__db_err(dbenv, "%s: %s", name, db_strerror(ret));
+	}
+
+	if (LF_ISSET(DB_SALVAGE) &&
+	    (has == 0 || F_ISSET(vdp, SALVAGE_PRINTFOOTER)))
+		(void)__db_prfooter(handle, callback);
+
+	/* Send feedback that we're done. */
+done:	if (!LF_ISSET(DB_SALVAGE) && dbp->db_feedback != NULL)
+		dbp->db_feedback(dbp, DB_VERIFY, 100);
+
+	if (F_ISSET(fhp, DB_FH_VALID))
+		(void)__os_closehandle(fhp);
+	if (dbp)
+		(void)dbp->close(dbp, 0);
+	if (vdp)
+		(void)__db_vrfy_dbinfo_destroy(vdp);
+	if (real_name)
+		__os_freestr(real_name);
+
+	if ((ret == 0 && isbad == 1) || ret == DB_VERIFY_FATAL)
+		ret = DB_VERIFY_BAD;
+
+	return (ret);
+}
+
+/*
+ * __db_vrfy_pagezero --
+ *	Verify the master metadata page.  Use seek, read, and a local buffer
+ *	rather than the DB paging code, for safety.
+ *
+ *	Must correctly (or best-guess) set dbp->type and dbp->pagesize.
+ */
+static int
+__db_vrfy_pagezero(dbp, vdp, fhp, flags)
+	DB *dbp;
+	VRFY_DBINFO *vdp;
+	DB_FH *fhp;
+	u_int32_t flags;
+{
+	DBMETA *meta;
+	DB_ENV *dbenv;
+	VRFY_PAGEINFO *pip;
+	db_pgno_t freelist;
+	int t_ret, ret, nr, swapped;
+	u_int8_t mbuf[DBMETASIZE];
+
+	swapped = ret = t_ret = 0;
+	freelist = 0;
+	dbenv = dbp->dbenv;
+	meta = (DBMETA *)mbuf;
+	dbp->type = DB_UNKNOWN;
+
+	/*
+	 * Seek to the metadata page.
+	 * Note that if we're just starting a verification, dbp->pgsize
+	 * may be zero;  this is okay, as we want page zero anyway and
+	 * 0*0 == 0.
+	 */
+	if ((ret = __os_seek(dbenv, fhp, 0, 0, 0, 0, DB_OS_SEEK_SET)) != 0)
+		goto err;
+
+	if ((ret = __os_read(dbenv, fhp, mbuf, DBMETASIZE, (size_t *)&nr)) != 0)
+		goto err;
+
+	if (nr != DBMETASIZE) {
+		EPRINT((dbp->dbenv,
+		    "Incomplete metadata page %lu", (u_long)PGNO_BASE_MD));
+		t_ret = DB_VERIFY_FATAL;
+		goto err;
+	}
+
+	/*
+	 * Check all of the fields that we can.
+	 */
+
+	/* 08-11: Current page number.  Must == pgno. */
+	/* Note that endianness doesn't matter--it's zero. */
+	if (meta->pgno != PGNO_BASE_MD) {
+		EPRINT((dbp->dbenv, "Bad pgno: was %lu, should be %lu",
+		    (u_long)meta->pgno, (u_long)PGNO_BASE_MD));
+		ret = DB_VERIFY_BAD;
+	}
+
+	/* 12-15: Magic number.  Must be one of valid set. */
+	if (__db_is_valid_magicno(meta->magic, &dbp->type))
+		swapped = 0;
+	else {
+		M_32_SWAP(meta->magic);
+		if (__db_is_valid_magicno(meta->magic,
+		    &dbp->type))
+			swapped = 1;
+		else {
+			EPRINT((dbp->dbenv,
+			    "Bad magic number: %lu", (u_long)meta->magic));
+			ret = DB_VERIFY_BAD;
+		}
+	}
+
+	/*
+	 * 16-19: Version.  Must be current;  for now, we
+	 * don't support verification of old versions.
+	 */
+	if (swapped)
+		M_32_SWAP(meta->version);
+	if ((dbp->type == DB_BTREE && meta->version != DB_BTREEVERSION) ||
+	    (dbp->type == DB_HASH && meta->version != DB_HASHVERSION) ||
+	    (dbp->type == DB_QUEUE && meta->version != DB_QAMVERSION)) {
+		ret = DB_VERIFY_BAD;
+		EPRINT((dbp->dbenv, "%s%s", "Old or incorrect DB ",
+		    "version; extraneous errors may result"));
+	}
+
+	/*
+	 * 20-23: Pagesize.  Must be power of two,
+	 * greater than 512, and less than 64K.
+	 */
+	if (swapped)
+		M_32_SWAP(meta->pagesize);
+	if (IS_VALID_PAGESIZE(meta->pagesize))
+		dbp->pgsize = meta->pagesize;
+	else {
+		EPRINT((dbp->dbenv,
+		    "Bad page size: %lu", (u_long)meta->pagesize));
+		ret = DB_VERIFY_BAD;
+
+		/*
+		 * Now try to settle on a pagesize to use.
+		 * If the user-supplied one is reasonable,
+		 * use it;  else, guess.
+		 */
+		if (!IS_VALID_PAGESIZE(dbp->pgsize))
+			dbp->pgsize = __db_guesspgsize(dbenv, fhp);
+	}
+
+	/*
+	 * 25: Page type.  Must be correct for dbp->type,
+	 * which is by now set as well as it can be.
+	 */
+	/* Needs no swapping--only one byte! */
+	if ((dbp->type == DB_BTREE && meta->type != P_BTREEMETA) ||
+	    (dbp->type == DB_HASH && meta->type != P_HASHMETA) ||
+	    (dbp->type == DB_QUEUE && meta->type != P_QAMMETA)) {
+		ret = DB_VERIFY_BAD;
+		EPRINT((dbp->dbenv, "Bad page type: %lu", (u_long)meta->type));
+	}
+
+	/*
+	 * 28-31: Free list page number.
+	 * We'll verify its sensibility when we do inter-page
+	 * verification later;  for now, just store it.
+	 */
+	if (swapped)
+	    M_32_SWAP(meta->free);
+	freelist = meta->free;
+
+	/*
+	 * Initialize vdp->pages to fit a single pageinfo structure for
+	 * this one page.  We'll realloc later when we know how many
+	 * pages there are.
+	 */
+	if ((ret = __db_vrfy_getpageinfo(vdp, PGNO_BASE_MD, &pip)) != 0)
+		return (ret);
+	pip->pgno = PGNO_BASE_MD;
+	pip->type = meta->type;
+
+	/*
+	 * Signal that we still have to check the info specific to
+	 * a given type of meta page.
+	 */
+	F_SET(pip, VRFY_INCOMPLETE);
+
+	pip->free = freelist;
+
+	if ((ret = __db_vrfy_putpageinfo(vdp, pip)) != 0)
+		return (ret);
+
+	/* Set up the dbp's fileid.  We don't use the regular open path. */
+	memcpy(dbp->fileid, meta->uid, DB_FILE_ID_LEN);
+
+	if (0) {
+err:		__db_err(dbenv, "%s", db_strerror(ret));
+	}
+
+	if (swapped == 1)
+		F_SET(dbp, DB_AM_SWAP);
+	if (t_ret != 0)
+		ret = t_ret;
+	return (ret);
+}
+
+/*
+ * __db_vrfy_walkpages --
+ *	Main loop of the verifier/salvager.  Walks through,
+ *	page by page, and verifies all pages and/or prints all data pages.
+ */
+static int
+__db_vrfy_walkpages(dbp, vdp, handle, callback, flags)
+	DB *dbp;
+	VRFY_DBINFO *vdp;
+	void *handle;
+	int (*callback) __P((void *, const void *));
+	u_int32_t flags;
+{
+	DB_ENV *dbenv;
+	PAGE *h;
+	db_pgno_t i;
+	int ret, t_ret, isbad;
+
+	ret = isbad = t_ret = 0;
+	dbenv = dbp->dbenv;
+
+	if ((ret = __db_fchk(dbenv,
+	    "__db_vrfy_walkpages", flags, OKFLAGS)) != 0)
+		return (ret);
+
+	for (i = 0; i <= vdp->last_pgno; i++) {
+		/*
+		 * If DB_SALVAGE is set, we inspect our database of
+		 * completed pages, and skip any we've already printed in
+		 * the subdb pass.
+		 */
+		if (LF_ISSET(DB_SALVAGE) && (__db_salvage_isdone(vdp, i) != 0))
+			continue;
+
+		/* If an individual page get fails, keep going. */
+		if ((t_ret = memp_fget(dbp->mpf, &i, 0, &h)) != 0) {
+			if (ret == 0)
+				ret = t_ret;
+			continue;
+		}
+
+		if (LF_ISSET(DB_SALVAGE)) {
+			/*
+			 * We pretty much don't want to quit unless a
+			 * bomb hits.  May as well return that something
+			 * was screwy, however.
+			 */
+			if ((t_ret = __db_salvage(dbp,
+			    vdp, i, h, handle, callback, flags)) != 0) {
+				if (ret == 0)
+					ret = t_ret;
+				isbad = 1;
+			}
+		} else {
+			/*
+			 * Verify info common to all page
+			 * types.
+			 */
+			if (i != PGNO_BASE_MD)
+				if ((t_ret = __db_vrfy_common(dbp,
+				    vdp, h, i, flags)) == DB_VERIFY_BAD)
+					isbad = 1;
+
+			switch (TYPE(h)) {
+			case P_INVALID:
+				t_ret = __db_vrfy_invalid(dbp,
+				    vdp, h, i, flags);
+				break;
+			case __P_DUPLICATE:
+				isbad = 1;
+				EPRINT((dbp->dbenv,
+				    "Old-style duplicate page: %lu",
+				    (u_long)i));
+				break;
+			case P_HASH:
+				t_ret = __ham_vrfy(dbp,
+				    vdp, h, i, flags);
+				break;
+			case P_IBTREE:
+			case P_IRECNO:
+			case P_LBTREE:
+			case P_LDUP:
+				t_ret = __bam_vrfy(dbp,
+				    vdp, h, i, flags);
+				break;
+			case P_LRECNO:
+				t_ret = __ram_vrfy_leaf(dbp,
+				    vdp, h, i, flags);
+				break;
+			case P_OVERFLOW:
+				t_ret = __db_vrfy_overflow(dbp,
+				    vdp, h, i, flags);
+				break;
+			case P_HASHMETA:
+				t_ret = __ham_vrfy_meta(dbp,
+				    vdp, (HMETA *)h, i, flags);
+				break;
+			case P_BTREEMETA:
+				t_ret = __bam_vrfy_meta(dbp,
+				    vdp, (BTMETA *)h, i, flags);
+				break;
+			case P_QAMMETA:
+				t_ret = __qam_vrfy_meta(dbp,
+				    vdp, (QMETA *)h, i, flags);
+				break;
+			case P_QAMDATA:
+				t_ret = __qam_vrfy_data(dbp,
+				    vdp, (QPAGE *)h, i, flags);
+				break;
+			default:
+				EPRINT((dbp->dbenv,
+				    "Unknown page type: %lu", (u_long)TYPE(h)));
+				isbad = 1;
+				break;
+			}
+
+			/*
+			 * Set up error return.
+			 */
+			if (t_ret == DB_VERIFY_BAD)
+				isbad = 1;
+			else if (t_ret == DB_VERIFY_FATAL)
+				goto err;
+			else
+				ret = t_ret;
+
+			/*
+			 * Provide feedback to the application about our
+			 * progress.  The range 0-50% comes from the fact
+			 * that this is the first of two passes through the
+			 * database (front-to-back, then top-to-bottom).
+			 */
+			if (dbp->db_feedback != NULL)
+				dbp->db_feedback(dbp, DB_VERIFY,
+				    (i + 1) * 50 / (vdp->last_pgno + 1));
+		}
+
+		if ((t_ret = memp_fput(dbp->mpf, h, 0)) != 0 && ret == 0)
+			ret = t_ret;
+	}
+
+	if (0) {
+err:		if ((t_ret = memp_fput(dbp->mpf, h, 0)) != 0)
+			return (ret == 0 ? t_ret : ret);
+		return (DB_VERIFY_BAD);
+	}
+
+	return ((isbad == 1 && ret == 0) ? DB_VERIFY_BAD : ret);
+}
+
+/*
+ * __db_vrfy_structure--
+ *	After a beginning-to-end walk through the database has been
+ *	completed, put together the information that has been collected
+ *	to verify the overall database structure.
+ *
+ *	Should only be called if we want to do a database verification,
+ *	i.e. if DB_SALVAGE is not set.
+ */
+static int
+__db_vrfy_structure(dbp, vdp, dbname, meta_pgno, flags)
+	DB *dbp;
+	VRFY_DBINFO *vdp;
+	const char *dbname;
+	db_pgno_t meta_pgno;
+	u_int32_t flags;
+{
+	DB *pgset;
+	DB_ENV *dbenv;
+	VRFY_PAGEINFO *pip;
+	db_pgno_t i;
+	int ret, isbad, hassubs, p;
+
+	isbad = 0;
+	pip = NULL;
+	dbenv = dbp->dbenv;
+	pgset = vdp->pgset;
+
+	if ((ret = __db_fchk(dbenv, "DB->verify", flags, OKFLAGS)) != 0)
+		return (ret);
+	if (LF_ISSET(DB_SALVAGE)) {
+		__db_err(dbenv, "__db_vrfy_structure called with DB_SALVAGE");
+		return (EINVAL);
+	}
+
+	/*
+	 * Providing feedback here is tricky;  in most situations,
+	 * we fetch each page one more time, but we do so in a top-down
+	 * order that depends on the access method.  Worse, we do this
+	 * recursively in btree, such that on any call where we're traversing
+	 * a subtree we don't know where that subtree is in the whole database;
+	 * worse still, any given database may be one of several subdbs.
+	 *
+	 * The solution is to decrement a counter vdp->pgs_remaining each time
+	 * we verify (and call feedback on) a page.  We may over- or
+	 * under-count, but the structure feedback function will ensure that we
+	 * never give a percentage under 50 or over 100.  (The first pass
+	 * covered the range 0-50%.)
+	 */
+	if (dbp->db_feedback != NULL)
+		vdp->pgs_remaining = vdp->last_pgno + 1;
+
+	/*
+	 * Call the appropriate function to downwards-traverse the db type.
+	 */
+	switch(dbp->type) {
+	case DB_BTREE:
+	case DB_RECNO:
+		if ((ret = __bam_vrfy_structure(dbp, vdp, 0, flags)) != 0) {
+			if (ret == DB_VERIFY_BAD)
+				isbad = 1;
+			else
+				goto err;
+		}
+
+		/*
+		 * If we have subdatabases and we know that the database is,
+		 * thus far, sound, it's safe to walk the tree of subdatabases.
+		 * Do so, and verify the structure of the databases within.
+		 */
+		if ((ret = __db_vrfy_getpageinfo(vdp, 0, &pip)) != 0)
+			goto err;
+		hassubs = F_ISSET(pip, VRFY_HAS_SUBDBS);
+		if ((ret = __db_vrfy_putpageinfo(vdp, pip)) != 0)
+			goto err;
+
+		if (isbad == 0 && hassubs)
+			if ((ret =
+			    __db_vrfy_subdbs(dbp, vdp, dbname, flags)) != 0) {
+				if (ret == DB_VERIFY_BAD)
+					isbad = 1;
+				else
+					goto err;
+			}
+		break;
+	case DB_HASH:
+		if ((ret = __ham_vrfy_structure(dbp, vdp, 0, flags)) != 0) {
+			if (ret == DB_VERIFY_BAD)
+				isbad = 1;
+			else
+				goto err;
+		}
+		break;
+	case DB_QUEUE:
+		if ((ret = __qam_vrfy_structure(dbp, vdp, flags)) != 0) {
+			if (ret == DB_VERIFY_BAD)
+				isbad = 1;
+		}
+
+		/*
+		 * Queue pages may be unreferenced and totally zeroed, if
+		 * they're empty;  queue doesn't have much structure, so
+		 * this is unlikely to be wrong in any troublesome sense.
+		 * Skip to "err".
+		 */
+		goto err;
+		/* NOTREACHED */
+	default:
+		/* This should only happen if the verifier is somehow broken. */
+		DB_ASSERT(0);
+		ret = EINVAL;
+		goto err;
+		/* NOTREACHED */
+	}
+
+	/* Walk free list. */
+	if ((ret =
+	    __db_vrfy_freelist(dbp, vdp, meta_pgno, flags)) == DB_VERIFY_BAD)
+		isbad = 1;
+
+	/*
+	 * If structure checks up until now have failed, it's likely that
+	 * checking what pages have been missed will result in oodles of
+	 * extraneous error messages being EPRINTed.  Skip to the end
+	 * if this is the case;  we're going to be printing at least one
+	 * error anyway, and probably all the more salient ones.
+	 */
+	if (ret != 0 || isbad == 1)
+		goto err;
+
+	/*
+	 * Make sure no page has been missed and that no page is still marked
+	 * "all zeroes" (only certain hash pages can be, and they're unmarked
+	 * in __ham_vrfy_structure).
+	 */
+	for (i = 0; i < vdp->last_pgno + 1; i++) {
+		if ((ret = __db_vrfy_getpageinfo(vdp, i, &pip)) != 0)
+			goto err;
+		if ((ret = __db_vrfy_pgset_get(pgset, i, &p)) != 0)
+			goto err;
+		if (p == 0) {
+			EPRINT((dbp->dbenv,
+			    "Unreferenced page %lu", (u_long)i));
+			isbad = 1;
+		}
+
+		if (F_ISSET(pip, VRFY_IS_ALLZEROES)) {
+			EPRINT((dbp->dbenv,
+			    "Totally zeroed page %lu", (u_long)i));
+			isbad = 1;
+		}
+		if ((ret = __db_vrfy_putpageinfo(vdp, pip)) != 0)
+			goto err;
+		pip = NULL;
+	}
+
+err:	if (pip != NULL)
+		(void)__db_vrfy_putpageinfo(vdp, pip);
+
+	return ((isbad == 1 && ret == 0) ? DB_VERIFY_BAD : ret);
+}
+
+/*
+ * __db_is_valid_pagetype
+ */
+static int
+__db_is_valid_pagetype(type)
+	u_int32_t type;
+{
+	switch (type) {
+	case P_INVALID:			/* Order matches ordinal value. */
+	case P_HASH:
+	case P_IBTREE:
+	case P_IRECNO:
+	case P_LBTREE:
+	case P_LRECNO:
+	case P_OVERFLOW:
+	case P_HASHMETA:
+	case P_BTREEMETA:
+	case P_QAMMETA:
+	case P_QAMDATA:
+	case P_LDUP:
+		return (1);
+	}
+	return (0);
+}
+
+/*
+ * __db_is_valid_magicno
+ */
+static int
+__db_is_valid_magicno(magic, typep)
+	u_int32_t magic;
+	DBTYPE *typep;
+{
+	switch (magic) {
+	case DB_BTREEMAGIC:
+		*typep = DB_BTREE;
+		return (1);
+	case DB_HASHMAGIC:
+		*typep = DB_HASH;
+		return (1);
+	case DB_QAMMAGIC:
+		*typep = DB_QUEUE;
+		return (1);
+	}
+	*typep = DB_UNKNOWN;
+	return (0);
+}
+
+/*
+ * __db_vrfy_common --
+ *	Verify info common to all page types.
+ */
+static int
+__db_vrfy_common(dbp, vdp, h, pgno, flags)
+	DB *dbp;
+	VRFY_DBINFO *vdp;
+	PAGE *h;
+	db_pgno_t pgno;
+	u_int32_t flags;
+{
+	VRFY_PAGEINFO *pip;
+	int ret, t_ret;
+	u_int8_t *p;
+
+	if ((ret = __db_vrfy_getpageinfo(vdp, pgno, &pip)) != 0)
+		return (ret);
+
+	pip->pgno = pgno;
+	F_CLR(pip, VRFY_IS_ALLZEROES);
+
+	/*
+	 * Hash expands the table by leaving some pages between the
+	 * old last and the new last totally zeroed.  Its pgin function
+	 * should fix things, but we might not be using that (e.g. if
+	 * we're a subdatabase).
+	 *
+	 * Queue will create sparse files if sparse record numbers are used.
+	 */
+	if (pgno != 0 && PGNO(h) == 0) {
+		for (p = (u_int8_t *)h; p < (u_int8_t *)h + dbp->pgsize; p++)
+			if (*p != 0) {
+				EPRINT((dbp->dbenv,
+				    "Page %lu should be zeroed and is not",
+				    (u_long)pgno));
+				ret = DB_VERIFY_BAD;
+				goto err;
+			}
+		/*
+		 * It's totally zeroed;  mark it as a hash, and we'll
+		 * check that that makes sense structurally later.
+		 * (The queue verification doesn't care, since queues
+		 * don't really have much in the way of structure.)
+		 */
+		pip->type = P_HASH;
+		F_SET(pip, VRFY_IS_ALLZEROES);
+		ret = 0;
+		goto err;	/* well, not really an err. */
+	}
+
+	if (PGNO(h) != pgno) {
+		EPRINT((dbp->dbenv,
+		    "Bad page number: %lu should be %lu",
+		    (u_long)h->pgno, (u_long)pgno));
+		ret = DB_VERIFY_BAD;
+	}
+
+	if (!__db_is_valid_pagetype(h->type)) {
+		EPRINT((dbp->dbenv, "Bad page type: %lu", (u_long)h->type));
+		ret = DB_VERIFY_BAD;
+	}
+	pip->type = h->type;
+
+err:	if ((t_ret = __db_vrfy_putpageinfo(vdp, pip)) != 0 && ret == 0)
+		ret = t_ret;
+
+	return (ret);
+}
+
+/*
+ * __db_vrfy_invalid --
+ *	Verify P_INVALID page.
+ *	(Yes, there's not much to do here.)
+ */
+static int
+__db_vrfy_invalid(dbp, vdp, h, pgno, flags)
+	DB *dbp;
+	VRFY_DBINFO *vdp;
+	PAGE *h;
+	db_pgno_t pgno;
+	u_int32_t flags;
+{
+	VRFY_PAGEINFO *pip;
+	int ret, t_ret;
+
+	if ((ret = __db_vrfy_getpageinfo(vdp, pgno, &pip)) != 0)
+		return (ret);
+	pip->next_pgno = pip->prev_pgno = 0;
+
+	if (!IS_VALID_PGNO(NEXT_PGNO(h))) {
+		EPRINT((dbp->dbenv,
+		    "Invalid next_pgno %lu on page %lu",
+		    (u_long)NEXT_PGNO(h), (u_long)pgno));
+		ret = DB_VERIFY_BAD;
+	} else
+		pip->next_pgno = NEXT_PGNO(h);
+
+	if ((t_ret = __db_vrfy_putpageinfo(vdp, pip)) != 0 && ret == 0)
+		ret = t_ret;
+	return (ret);
+}
+
+/*
+ * __db_vrfy_datapage --
+ *	Verify elements common to data pages (P_HASH, P_LBTREE,
+ *	P_IBTREE, P_IRECNO, P_LRECNO, P_OVERFLOW, P_DUPLICATE)--i.e.,
+ *	those defined in the PAGE structure.
+ *
+ *	Called from each of the per-page routines, after the
+ *	all-page-type-common elements of pip have been verified and filled
+ *	in.
+ *
+ * PUBLIC: int __db_vrfy_datapage
+ * PUBLIC:     __P((DB *, VRFY_DBINFO *, PAGE *, db_pgno_t, u_int32_t));
+ */
+int
+__db_vrfy_datapage(dbp, vdp, h, pgno, flags)
+	DB *dbp;
+	VRFY_DBINFO *vdp;
+	PAGE *h;
+	db_pgno_t pgno;
+	u_int32_t flags;
+{
+	VRFY_PAGEINFO *pip;
+	int isbad, ret, t_ret;
+
+	if ((ret = __db_vrfy_getpageinfo(vdp, pgno, &pip)) != 0)
+		return (ret);
+	isbad = 0;
+
+	/*
+	 * prev_pgno and next_pgno:  store for inter-page checks,
+	 * verify that they point to actual pages and not to self.
+	 *
+	 * !!!
+	 * Internal btree pages do not maintain these fields (indeed,
+	 * they overload them).  Skip.
+	 */
+	if (TYPE(h) != P_IBTREE && TYPE(h) != P_IRECNO) {
+		if (!IS_VALID_PGNO(PREV_PGNO(h)) || PREV_PGNO(h) == pip->pgno) {
+			isbad = 1;
+			EPRINT((dbp->dbenv, "Page %lu: Invalid prev_pgno %lu",
+			    (u_long)pip->pgno, (u_long)PREV_PGNO(h)));
+		}
+		if (!IS_VALID_PGNO(NEXT_PGNO(h)) || NEXT_PGNO(h) == pip->pgno) {
+			isbad = 1;
+			EPRINT((dbp->dbenv, "Page %lu: Invalid next_pgno %lu",
+			    (u_long)pip->pgno, (u_long)NEXT_PGNO(h)));
+		}
+		pip->prev_pgno = PREV_PGNO(h);
+		pip->next_pgno = NEXT_PGNO(h);
+	}
+
+	/*
+	 * Verify the number of entries on the page.
+	 * There is no good way to determine if this is accurate;  the
+	 * best we can do is verify that it's not more than can, in theory,
+	 * fit on the page.  Then, we make sure there are at least
+	 * this many valid elements in inp[], and hope that this catches
+	 * most cases.
+	 */
+	if (TYPE(h) != P_OVERFLOW) {
+		if (BKEYDATA_PSIZE(0) * NUM_ENT(h) > dbp->pgsize) {
+			isbad = 1;
+			EPRINT((dbp->dbenv,
+			    "Page %lu: Too many entries: %lu",
+			    (u_long)pgno, (u_long)NUM_ENT(h)));
+		}
+		pip->entries = NUM_ENT(h);
+	}
+
+	/*
+	 * btree level.  Should be zero unless we're a btree;
+	 * if we are a btree, should be between LEAFLEVEL and MAXBTREELEVEL,
+	 * and we need to save it off.
+	 */
+	switch (TYPE(h)) {
+	case P_IBTREE:
+	case P_IRECNO:
+		if (LEVEL(h) < LEAFLEVEL + 1 || LEVEL(h) > MAXBTREELEVEL) {
+			isbad = 1;
+			EPRINT((dbp->dbenv, "Bad btree level %lu on page %lu",
+			    (u_long)LEVEL(h), (u_long)pgno));
+		}
+		pip->bt_level = LEVEL(h);
+		break;
+	case P_LBTREE:
+	case P_LDUP:
+	case P_LRECNO:
+		if (LEVEL(h) != LEAFLEVEL) {
+			isbad = 1;
+			EPRINT((dbp->dbenv,
+			    "Btree leaf page %lu has incorrect level %lu",
+			    (u_long)pgno, (u_long)LEVEL(h)));
+		}
+		break;
+	default:
+		if (LEVEL(h) != 0) {
+			isbad = 1;
+			EPRINT((dbp->dbenv,
+			    "Nonzero level %lu in non-btree database page %lu",
+			    (u_long)LEVEL(h), (u_long)pgno));
+		}
+		break;
+	}
+
+	/*
+	 * Even though inp[] occurs in all PAGEs, we look at it in the
+	 * access-method-specific code, since btree and hash treat
+	 * item lengths very differently, and one of the most important
+	 * things we want to verify is that the data--as specified
+	 * by offset and length--cover the right part of the page
+	 * without overlaps, gaps, or violations of the page boundary.
+	 */
+	if ((t_ret = __db_vrfy_putpageinfo(vdp, pip)) != 0 && ret == 0)
+		ret = t_ret;
+
+	return ((ret == 0 && isbad == 1) ? DB_VERIFY_BAD : ret);
+}
+
+/*
+ * __db_vrfy_meta--
+ *	Verify the access-method common parts of a meta page, using
+ *	normal mpool routines.
+ *
+ * PUBLIC: int __db_vrfy_meta
+ * PUBLIC:     __P((DB *, VRFY_DBINFO *, DBMETA *, db_pgno_t, u_int32_t));
+ */
+int
+__db_vrfy_meta(dbp, vdp, meta, pgno, flags)
+	DB *dbp;
+	VRFY_DBINFO *vdp;
+	DBMETA *meta;
+	db_pgno_t pgno;
+	u_int32_t flags;
+{
+	DBTYPE dbtype, magtype;
+	VRFY_PAGEINFO *pip;
+	int isbad, ret, t_ret;
+
+	isbad = 0;
+	if ((ret = __db_vrfy_getpageinfo(vdp, pgno, &pip)) != 0)
+		return (ret);
+
+	/* type plausible for a meta page */
+	switch (meta->type) {
+	case P_BTREEMETA:
+		dbtype = DB_BTREE;
+		break;
+	case P_HASHMETA:
+		dbtype = DB_HASH;
+		break;
+	case P_QAMMETA:
+		dbtype = DB_QUEUE;
+		break;
+	default:
+		/* The verifier should never let us get here. */
+		DB_ASSERT(0);
+		ret = EINVAL;
+		goto err;
+	}
+
+	/* magic number valid */
+	if (!__db_is_valid_magicno(meta->magic, &magtype)) {
+		isbad = 1;
+		EPRINT((dbp->dbenv,
+		    "Magic number invalid on page %lu", (u_long)pgno));
+	}
+	if (magtype != dbtype) {
+		isbad = 1;
+		EPRINT((dbp->dbenv,
+		    "Magic number does not match type of page %lu",
+		    (u_long)pgno));
+	}
+
+	/* version */
+	if ((dbtype == DB_BTREE && meta->version != DB_BTREEVERSION) ||
+	    (dbtype == DB_HASH && meta->version != DB_HASHVERSION) ||
+	    (dbtype == DB_QUEUE && meta->version != DB_QAMVERSION)) {
+		isbad = 1;
+		EPRINT((dbp->dbenv, "%s%s", "Old of incorrect DB ",
+		    "version; extraneous errors may result"));
+	}
+
+	/* pagesize */
+	if (meta->pagesize != dbp->pgsize) {
+		isbad = 1;
+		EPRINT((dbp->dbenv,
+		    "Invalid pagesize %lu on page %lu",
+		    (u_long)meta->pagesize, (u_long)pgno));
+	}
+
+	/* free list */
+	/*
+	 * If this is not the main, master-database meta page, it
+	 * should not have a free list.
+	 */
+	if (pgno != PGNO_BASE_MD && meta->free != PGNO_INVALID) {
+		isbad = 1;
+		EPRINT((dbp->dbenv,
+		    "Nonempty free list on subdatabase metadata page %lu",
+		    pgno));
+	}
+
+	/* Can correctly be PGNO_INVALID--that's just the end of the list. */
+	if (meta->free != PGNO_INVALID && IS_VALID_PGNO(meta->free))
+		pip->free = meta->free;
+	else if (!IS_VALID_PGNO(meta->free)) {
+		isbad = 1;
+		EPRINT((dbp->dbenv,
+		    "Nonsensical free list pgno %lu on page %lu",
+		    (u_long)meta->free, (u_long)pgno));
+	}
+
+	/*
+	 * We have now verified the common fields of the metadata page.
+	 * Clear the flag that told us they had been incompletely checked.
+	 */
+	F_CLR(pip, VRFY_INCOMPLETE);
+
+err:	if ((t_ret = __db_vrfy_putpageinfo(vdp, pip)) != 0 && ret == 0)
+		ret = t_ret;
+
+	return ((ret == 0 && isbad == 1) ? DB_VERIFY_BAD : ret);
+}
+
+/*
+ * __db_vrfy_freelist --
+ *	Walk free list, checking off pages and verifying absence of
+ *	loops.
+ */
+static int
+__db_vrfy_freelist(dbp, vdp, meta, flags)
+	DB *dbp;
+	VRFY_DBINFO *vdp;
+	db_pgno_t meta;
+	u_int32_t flags;
+{
+	DB *pgset;
+	VRFY_PAGEINFO *pip;
+	db_pgno_t pgno;
+	int p, ret, t_ret;
+
+	pgset = vdp->pgset;
+	DB_ASSERT(pgset != NULL);
+
+	if ((ret = __db_vrfy_getpageinfo(vdp, meta, &pip)) != 0)
+		return (ret);
+	for (pgno = pip->free; pgno != PGNO_INVALID; pgno = pip->next_pgno) {
+		if ((ret = __db_vrfy_putpageinfo(vdp, pip)) != 0)
+			return (ret);
+
+		/* This shouldn't happen, but just in case. */
+		if (!IS_VALID_PGNO(pgno)) {
+			EPRINT((dbp->dbenv,
+			    "Invalid next_pgno on free list page %lu",
+			    (u_long)pgno));
+			return (DB_VERIFY_BAD);
+		}
+
+		/* Detect cycles. */
+		if ((ret = __db_vrfy_pgset_get(pgset, pgno, &p)) != 0)
+			return (ret);
+		if (p != 0) {
+			EPRINT((dbp->dbenv,
+			    "Page %lu encountered a second time on free list",
+			    (u_long)pgno));
+			return (DB_VERIFY_BAD);
+		}
+		if ((ret = __db_vrfy_pgset_inc(pgset, pgno)) != 0)
+			return (ret);
+
+		if ((ret = __db_vrfy_getpageinfo(vdp, pgno, &pip)) != 0)
+			return (ret);
+
+		if (pip->type != P_INVALID) {
+			EPRINT((dbp->dbenv,
+			    "Non-invalid page %lu on free list", (u_long)pgno));
+			ret = DB_VERIFY_BAD;	  /* unsafe to continue */
+			break;
+		}
+	}
+
+	if ((t_ret = __db_vrfy_putpageinfo(vdp, pip)) != 0)
+		ret = t_ret;
+	return (ret);
+}
+
+/*
+ * __db_vrfy_subdbs --
+ *	Walk the known-safe master database of subdbs with a cursor,
+ *	verifying the structure of each subdatabase we encounter.
+ */
+static int
+__db_vrfy_subdbs(dbp, vdp, dbname, flags)
+	DB *dbp;
+	VRFY_DBINFO *vdp;
+	const char *dbname;
+	u_int32_t flags;
+{
+	DB *mdbp;
+	DBC *dbc;
+	DBT key, data;
+	VRFY_PAGEINFO *pip;
+	db_pgno_t meta_pgno;
+	int ret, t_ret, isbad;
+	u_int8_t type;
+
+	isbad = 0;
+	dbc = NULL;
+
+	if ((ret = __db_master_open(dbp, dbname, DB_RDONLY, 0, &mdbp)) != 0)
+		return (ret);
+
+	if ((ret =
+	    __db_icursor(mdbp, NULL, DB_BTREE, PGNO_INVALID, 0, &dbc)) != 0)
+		goto err;
+
+	memset(&key, 0, sizeof(key));
+	memset(&data, 0, sizeof(data));
+	while ((ret = dbc->c_get(dbc, &key, &data, DB_NEXT)) == 0) {
+		if (data.size != sizeof(db_pgno_t)) {
+			EPRINT((dbp->dbenv, "Database entry of invalid size"));
+			isbad = 1;
+			goto err;
+		}
+		memcpy(&meta_pgno, data.data, data.size);
+		/*
+		 * Subdatabase meta pgnos are stored in network byte
+		 * order for cross-endian compatibility.  Swap if appropriate.
+		 */
+		DB_NTOHL(&meta_pgno);
+		if (meta_pgno == PGNO_INVALID || meta_pgno > vdp->last_pgno) {
+			EPRINT((dbp->dbenv,
+			    "Database entry references invalid page %lu",
+			    (u_long)meta_pgno));
+			isbad = 1;
+			goto err;
+		}
+		if ((ret = __db_vrfy_getpageinfo(vdp, meta_pgno, &pip)) != 0)
+			goto err;
+		type = pip->type;
+		if ((ret = __db_vrfy_putpageinfo(vdp, pip)) != 0)
+			goto err;
+		switch (type) {
+		case P_BTREEMETA:
+			if ((ret = __bam_vrfy_structure(
+			    dbp, vdp, meta_pgno, flags)) != 0) {
+				if (ret == DB_VERIFY_BAD)
+					isbad = 1;
+				else
+					goto err;
+			}
+			break;
+		case P_HASHMETA:
+			if ((ret = __ham_vrfy_structure(
+			    dbp, vdp, meta_pgno, flags)) != 0) {
+				if (ret == DB_VERIFY_BAD)
+					isbad = 1;
+				else
+					goto err;
+			}
+			break;
+		case P_QAMMETA:
+		default:
+			EPRINT((dbp->dbenv,
+	    "Database entry references page %lu of invalid type %lu",
+			    (u_long)meta_pgno, (u_long)type));
+			ret = DB_VERIFY_BAD;
+			goto err;
+			/* NOTREACHED */
+		}
+	}
+
+	if (ret == DB_NOTFOUND)
+		ret = 0;
+
+err:	if (dbc != NULL && (t_ret = __db_c_close(dbc)) != 0 && ret == 0)
+		ret = t_ret;
+
+	if ((t_ret = mdbp->close(mdbp, 0)) != 0 && ret == 0)
+		ret = t_ret;
+
+	return ((ret == 0 && isbad == 1) ? DB_VERIFY_BAD : ret);
+}
+
+/*
+ * __db_vrfy_struct_feedback --
+ *	Provide feedback during top-down database structure traversal.
+ *	(See comment at the beginning of __db_vrfy_structure.)
+ *
+ * PUBLIC: int __db_vrfy_struct_feedback __P((DB *, VRFY_DBINFO *));
+ */
+int
+__db_vrfy_struct_feedback(dbp, vdp)
+	DB *dbp;
+	VRFY_DBINFO *vdp;
+{
+	int progress;
+
+	if (dbp->db_feedback == NULL)
+		return (0);
+
+	if (vdp->pgs_remaining > 0)
+		vdp->pgs_remaining--;
+
+	/* Don't allow a feedback call of 100 until we're really done. */
+	progress = 100 - (vdp->pgs_remaining * 50 / (vdp->last_pgno + 1));
+	dbp->db_feedback(dbp, DB_VERIFY, progress == 100 ? 99 : progress);
+
+	return (0);
+}
+
+/*
+ * __db_vrfy_orderchkonly --
+ *	Do an sort-order/hashing check on a known-otherwise-good subdb.
+ */
+static int
+__db_vrfy_orderchkonly(dbp, vdp, name, subdb, flags)
+	DB *dbp;
+	VRFY_DBINFO *vdp;
+	const char *name, *subdb;
+	u_int32_t flags;
+{
+	BTMETA *btmeta;
+	DB *mdbp, *pgset;
+	DBC *pgsc;
+	DBT key, data;
+	HASH *h_internal;
+	HMETA *hmeta;
+	PAGE *h, *currpg;
+	db_pgno_t meta_pgno, p, pgno;
+	u_int32_t bucket;
+	int t_ret, ret;
+
+	currpg = h = NULL;
+	pgsc = NULL;
+	pgset = NULL;
+
+	LF_CLR(DB_NOORDERCHK);
+
+	/* Open the master database and get the meta_pgno for the subdb. */
+	if ((ret = db_create(&mdbp, NULL, 0)) != 0)
+		return (ret);
+	if ((ret = __db_master_open(dbp, name, DB_RDONLY, 0, &mdbp)) != 0)
+		goto err;
+
+	memset(&key, 0, sizeof(key));
+	key.data = (void *)subdb;
+	memset(&data, 0, sizeof(data));
+	if ((ret = dbp->get(dbp, NULL, &key, &data, 0)) != 0)
+		goto err;
+
+	if (data.size != sizeof(db_pgno_t)) {
+		EPRINT((dbp->dbenv, "Database entry of invalid size"));
+		ret = DB_VERIFY_BAD;
+		goto err;
+	}
+
+	memcpy(&meta_pgno, data.data, data.size);
+
+	if ((ret = memp_fget(dbp->mpf, &meta_pgno, 0, &h)) != 0)
+		goto err;
+
+	if ((ret = __db_vrfy_pgset(dbp->dbenv, dbp->pgsize, &pgset)) != 0)
+		goto err;
+
+	switch (TYPE(h)) {
+	case P_BTREEMETA:
+		btmeta = (BTMETA *)h;
+		if (F_ISSET(&btmeta->dbmeta, BTM_RECNO)) {
+			/* Recnos have no order to check. */
+			ret = 0;
+			goto err;
+		}
+		if ((ret =
+		    __db_meta2pgset(dbp, vdp, meta_pgno, flags, pgset)) != 0)
+			goto err;
+		if ((ret = pgset->cursor(pgset, NULL, &pgsc, 0)) != 0)
+			goto err;
+		while ((ret = __db_vrfy_pgset_next(pgsc, &p)) == 0) {
+			if ((ret = memp_fget(dbp->mpf, &p, 0, &currpg)) != 0)
+				goto err;
+			if ((ret = __bam_vrfy_itemorder(dbp,
+			    NULL, currpg, p, NUM_ENT(currpg), 1,
+			    F_ISSET(&btmeta->dbmeta, BTM_DUP), flags)) != 0)
+				goto err;
+			if ((ret = memp_fput(dbp->mpf, currpg, 0)) != 0)
+				goto err;
+			currpg = NULL;
+		}
+		if ((ret = pgsc->c_close(pgsc)) != 0)
+			goto err;
+		break;
+	case P_HASHMETA:
+		hmeta = (HMETA *)h;
+		h_internal = (HASH *)dbp->h_internal;
+		/*
+		 * Make sure h_charkey is right.
+		 */
+		if (h_internal == NULL || h_internal->h_hash == NULL) {
+			EPRINT((dbp->dbenv,
+		    "DB_ORDERCHKONLY requires that a hash function be set"));
+			ret = DB_VERIFY_BAD;
+			goto err;
+		}
+		if (hmeta->h_charkey !=
+		    h_internal->h_hash(dbp, CHARKEY, sizeof(CHARKEY))) {
+			EPRINT((dbp->dbenv,
+			    "Incorrect hash function for database"));
+			ret = DB_VERIFY_BAD;
+			goto err;
+		}
+
+		/*
+		 * Foreach bucket, verify hashing on each page in the
+		 * corresponding chain of pages.
+		 */
+		for (bucket = 0; bucket <= hmeta->max_bucket; bucket++) {
+			pgno = BS_TO_PAGE(bucket, hmeta->spares);
+			while (pgno != PGNO_INVALID) {
+				if ((ret = memp_fget(dbp->mpf,
+				    &pgno, 0, &currpg)) != 0)
+					goto err;
+				if ((ret = __ham_vrfy_hashing(dbp,
+				    NUM_ENT(currpg),hmeta, bucket, pgno,
+				    flags, h_internal->h_hash)) != 0)
+					goto err;
+				pgno = NEXT_PGNO(currpg);
+				if ((ret = memp_fput(dbp->mpf, currpg, 0)) != 0)
+					goto err;
+				currpg = NULL;
+			}
+		}
+		break;
+	default:
+		EPRINT((dbp->dbenv, "Database meta page %lu of bad type %lu",
+		    (u_long)meta_pgno, (u_long)TYPE(h)));
+		ret = DB_VERIFY_BAD;
+		break;
+	}
+
+err:	if (pgsc != NULL)
+		(void)pgsc->c_close(pgsc);
+	if (pgset != NULL)
+		(void)pgset->close(pgset, 0);
+	if (h != NULL && (t_ret = memp_fput(dbp->mpf, h, 0)) != 0)
+		ret = t_ret;
+	if (currpg != NULL && (t_ret = memp_fput(dbp->mpf, currpg, 0)) != 0)
+		ret = t_ret;
+	if ((t_ret = mdbp->close(mdbp, 0)) != 0)
+		ret = t_ret;
+	return (ret);
+}
+
+/*
+ * __db_salvage --
+ *	Walk through a page, salvaging all likely or plausible (w/
+ *	DB_AGGRESSIVE) key/data pairs.
+ *
+ * PUBLIC: int __db_salvage __P((DB *, VRFY_DBINFO *, db_pgno_t, PAGE *,
+ * PUBLIC:     void *, int (*)(void *, const void *), u_int32_t));
+ */
+int
+__db_salvage(dbp, vdp, pgno, h, handle, callback, flags)
+	DB *dbp;
+	VRFY_DBINFO *vdp;
+	db_pgno_t pgno;
+	PAGE *h;
+	void *handle;
+	int (*callback) __P((void *, const void *));
+	u_int32_t flags;
+{
+	DB_ASSERT(LF_ISSET(DB_SALVAGE));
+
+	/* If we got this page in the subdb pass, we can safely skip it. */
+	if (__db_salvage_isdone(vdp, pgno))
+		return (0);
+
+	switch (TYPE(h)) {
+	case P_HASH:
+		return (__ham_salvage(dbp,
+		    vdp, pgno, h, handle, callback, flags));
+		/* NOTREACHED */
+	case P_LBTREE:
+		return (__bam_salvage(dbp,
+		    vdp, pgno, P_LBTREE, h, handle, callback, NULL, flags));
+		/* NOTREACHED */
+	case P_LDUP:
+		return (__db_salvage_markneeded(vdp, pgno, SALVAGE_LDUP));
+		/* NOTREACHED */
+	case P_OVERFLOW:
+		return (__db_salvage_markneeded(vdp, pgno, SALVAGE_OVERFLOW));
+		/* NOTREACHED */
+	case P_LRECNO:
+		/*
+		 * Recnos are tricky -- they may represent dup pages, or
+		 * they may be subdatabase/regular database pages in their
+		 * own right.  If the former, they need to be printed with a
+		 * key, preferably when we hit the corresponding datum in
+		 * a btree/hash page.  If the latter, there is no key.
+		 *
+		 * If a database is sufficiently frotzed, we're not going
+		 * to be able to get this right, so we best-guess:  just
+		 * mark it needed now, and if we're really a normal recno
+		 * database page, the "unknowns" pass will pick us up.
+		 */
+		return (__db_salvage_markneeded(vdp, pgno, SALVAGE_LRECNO));
+		/* NOTREACHED */
+	case P_IBTREE:
+	case P_INVALID:
+	case P_IRECNO:
+	case __P_DUPLICATE:
+	default:
+		/* XXX: Should we be more aggressive here? */
+		break;
+	}
+	return (0);
+}
+
+/*
+ * __db_salvage_unknowns --
+ *	Walk through the salvager database, printing with key "UNKNOWN"
+ *	any pages we haven't dealt with.
+ */
+static int
+__db_salvage_unknowns(dbp, vdp, handle, callback, flags)
+	DB *dbp;
+	VRFY_DBINFO *vdp;
+	void *handle;
+	int (*callback) __P((void *, const void *));
+	u_int32_t flags;
+{
+	DBT unkdbt, key, *dbt;
+	PAGE *h;
+	db_pgno_t pgno;
+	u_int32_t pgtype;
+	int ret, err_ret;
+	void *ovflbuf;
+
+	memset(&unkdbt, 0, sizeof(DBT));
+	unkdbt.size = strlen("UNKNOWN") + 1;
+	unkdbt.data = "UNKNOWN";
+
+	if ((ret = __os_malloc(dbp->dbenv, dbp->pgsize, 0, &ovflbuf)) != 0)
+		return (ret);
+
+	err_ret = 0;
+	while ((ret = __db_salvage_getnext(vdp, &pgno, &pgtype)) == 0) {
+		dbt = NULL;
+
+		if ((ret = memp_fget(dbp->mpf, &pgno, 0, &h)) != 0) {
+			err_ret = ret;
+			continue;
+		}
+
+		switch (pgtype) {
+		case SALVAGE_LDUP:
+		case SALVAGE_LRECNODUP:
+			dbt = &unkdbt;
+			/* FALLTHROUGH */
+		case SALVAGE_LBTREE:
+		case SALVAGE_LRECNO:
+			if ((ret = __bam_salvage(dbp, vdp, pgno, pgtype,
+			    h, handle, callback, dbt, flags)) != 0)
+				err_ret = ret;
+			break;
+		case SALVAGE_OVERFLOW:
+			/*
+			 * XXX:
+			 * This may generate multiple "UNKNOWN" keys in
+			 * a database with no dups.  What to do?
+			 */
+			if ((ret = __db_safe_goff(dbp,
+			    vdp, pgno, &key, &ovflbuf, flags)) != 0) {
+				err_ret = ret;
+				continue;
+			}
+			if ((ret = __db_prdbt(&key,
+			    0, " ", handle, callback, 0, NULL)) != 0) {
+				err_ret = ret;
+				continue;
+			}
+			if ((ret = __db_prdbt(&unkdbt,
+				0, " ", handle, callback, 0, NULL)) != 0)
+				err_ret = ret;
+			break;
+		case SALVAGE_HASH:
+			if ((ret = __ham_salvage(
+			    dbp, vdp, pgno, h, handle, callback, flags)) != 0)
+				err_ret = ret;
+			break;
+		case SALVAGE_INVALID:
+		case SALVAGE_IGNORE:
+		default:
+			/*
+			 * Shouldn't happen, but if it does, just do what the
+			 * nice man says.
+			 */
+			DB_ASSERT(0);
+			break;
+		}
+		if ((ret = memp_fput(dbp->mpf, h, 0)) != 0)
+			err_ret = ret;
+	}
+
+	__os_free(ovflbuf, 0);
+
+	if (err_ret != 0 && ret == 0)
+		ret = err_ret;
+
+	return (ret == DB_NOTFOUND ? 0 : ret);
+}
+
+/*
+ * Offset of the ith inp array entry, which we can compare to the offset
+ * the entry stores.
+ */
+#define	INP_OFFSET(h, i)	\
+    ((db_indx_t)((u_int8_t *)(h)->inp + (i) - (u_int8_t *)(h)))
+
+/*
+ * __db_vrfy_inpitem --
+ *	Verify that a single entry in the inp array is sane, and update
+ *	the high water mark and current item offset.  (The former of these is
+ *	used for state information between calls, and is required;  it must
+ *	be initialized to the pagesize before the first call.)
+ *
+ *	Returns DB_VERIFY_FATAL if inp has collided with the data,
+ *	since verification can't continue from there;  returns DB_VERIFY_BAD
+ *	if anything else is wrong.
+ *
+ * PUBLIC: int __db_vrfy_inpitem __P((DB *, PAGE *,
+ * PUBLIC:     db_pgno_t, u_int32_t, int, u_int32_t, u_int32_t *, u_int32_t *));
+ */
+int
+__db_vrfy_inpitem(dbp, h, pgno, i, is_btree, flags, himarkp, offsetp)
+	DB *dbp;
+	PAGE *h;
+	db_pgno_t pgno;
+	u_int32_t i;
+	int is_btree;
+	u_int32_t flags, *himarkp, *offsetp;
+{
+	BKEYDATA *bk;
+	db_indx_t offset, len;
+
+	DB_ASSERT(himarkp != NULL);
+
+	/*
+	 * Check that the inp array, which grows from the beginning of the
+	 * page forward, has not collided with the data, which grow from the
+	 * end of the page backward.
+	 */
+	if (h->inp + i >= (db_indx_t *)((u_int8_t *)h + *himarkp)) {
+		/* We've collided with the data.  We need to bail. */
+		EPRINT((dbp->dbenv,
+		    "Page %lu entries listing %lu overlaps data",
+		    (u_long)pgno, (u_long)i));
+		return (DB_VERIFY_FATAL);
+	}
+
+	offset = h->inp[i];
+
+	/*
+	 * Check that the item offset is reasonable:  it points somewhere
+	 * after the inp array and before the end of the page.
+	 */
+	if (offset <= INP_OFFSET(h, i) || offset > dbp->pgsize) {
+		EPRINT((dbp->dbenv,
+		    "Bad offset %lu at page %lu index %lu",
+		    (u_long)offset, (u_long)pgno, (u_long)i));
+		return (DB_VERIFY_BAD);
+	}
+
+	/* Update the high-water mark (what HOFFSET should be) */
+	if (offset < *himarkp)
+		*himarkp = offset;
+
+	if (is_btree) {
+		/*
+		 * Check that the item length remains on-page.
+		 */
+		bk = GET_BKEYDATA(h, i);
+
+		/*
+		 * We need to verify the type of the item here;
+		 * we can't simply assume that it will be one of the
+		 * expected three.  If it's not a recognizable type,
+		 * it can't be considered to have a verifiable
+		 * length, so it's not possible to certify it as safe.
+		 */
+		switch (B_TYPE(bk->type)) {
+		case B_KEYDATA:
+			len = bk->len;
+			break;
+		case B_DUPLICATE:
+		case B_OVERFLOW:
+			len = BOVERFLOW_SIZE;
+			break;
+		default:
+			EPRINT((dbp->dbenv,
+			    "Item %lu on page %lu of unrecognizable type",
+			    i, pgno));
+			return (DB_VERIFY_BAD);
+		}
+
+		if ((size_t)(offset + len) > dbp->pgsize) {
+			EPRINT((dbp->dbenv,
+			    "Item %lu on page %lu extends past page boundary",
+			    (u_long)i, (u_long)pgno));
+			return (DB_VERIFY_BAD);
+		}
+	}
+
+	if (offsetp != NULL)
+		*offsetp = offset;
+	return (0);
+}
+
+/*
+ * __db_vrfy_duptype--
+ *	Given a page number and a set of flags to __bam_vrfy_subtree,
+ *	verify that the dup tree type is correct--i.e., it's a recno
+ *	if DUPSORT is not set and a btree if it is.
+ *
+ * PUBLIC: int __db_vrfy_duptype
+ * PUBLIC:     __P((DB *, VRFY_DBINFO *, db_pgno_t, u_int32_t));
+ */
+int
+__db_vrfy_duptype(dbp, vdp, pgno, flags)
+	DB *dbp;
+	VRFY_DBINFO *vdp;
+	db_pgno_t pgno;
+	u_int32_t flags;
+{
+	VRFY_PAGEINFO *pip;
+	int ret, isbad;
+
+	isbad = 0;
+
+	if ((ret = __db_vrfy_getpageinfo(vdp, pgno, &pip)) != 0)
+		return (ret);
+
+	switch (pip->type) {
+	case P_IBTREE:
+	case P_LDUP:
+		if (!LF_ISSET(ST_DUPSORT)) {
+			EPRINT((dbp->dbenv,
+	    "Sorted duplicate set at page %lu in unsorted-dup database",
+			    (u_long)pgno));
+			isbad = 1;
+		}
+		break;
+	case P_IRECNO:
+	case P_LRECNO:
+		if (LF_ISSET(ST_DUPSORT)) {
+			EPRINT((dbp->dbenv,
+	    "Unsorted duplicate set at page %lu in sorted-dup database",
+			    (u_long)pgno));
+			isbad = 1;
+		}
+		break;
+	default:
+		EPRINT((dbp->dbenv,
+		    "Duplicate page %lu of inappropriate type %lu",
+		    (u_long)pgno, (u_long)pip->type));
+		isbad = 1;
+		break;
+	}
+
+	if ((ret = __db_vrfy_putpageinfo(vdp, pip)) != 0)
+		return (ret);
+	return (isbad == 1 ? DB_VERIFY_BAD : 0);
+}
+
+/*
+ * __db_salvage_duptree --
+ *	Attempt to salvage a given duplicate tree, given its alleged root.
+ *
+ *	The key that corresponds to this dup set has been passed to us
+ *	in DBT *key.  Because data items follow keys, though, it has been
+ *	printed once already.
+ *
+ *	The basic idea here is that pgno ought to be a P_LDUP, a P_LRECNO, a
+ *	P_IBTREE, or a P_IRECNO.  If it's an internal page, use the verifier
+ *	functions to make sure it's safe;  if it's not, we simply bail and the
+ *	data will have to be printed with no key later on.  if it is safe,
+ *	recurse on each of its children.
+ *
+ *	Whether or not it's safe, if it's a leaf page, __bam_salvage it.
+ *
+ *	At all times, use the DB hanging off vdp to mark and check what we've
+ *	done, so each page gets printed exactly once and we don't get caught
+ *	in any cycles.
+ *
+ * PUBLIC: int __db_salvage_duptree __P((DB *, VRFY_DBINFO *, db_pgno_t,
+ * PUBLIC:     DBT *, void *, int (*)(void *, const void *), u_int32_t));
+ */
+int
+__db_salvage_duptree(dbp, vdp, pgno, key, handle, callback, flags)
+	DB *dbp;
+	VRFY_DBINFO *vdp;
+	db_pgno_t pgno;
+	DBT *key;
+	void *handle;
+	int (*callback) __P((void *, const void *));
+	u_int32_t flags;
+{
+	PAGE *h;
+	int ret, t_ret;
+
+	if (pgno == PGNO_INVALID || !IS_VALID_PGNO(pgno))
+		return (DB_VERIFY_BAD);
+
+	/* We have a plausible page.  Try it. */
+	if ((ret = memp_fget(dbp->mpf, &pgno, 0, &h)) != 0)
+		return (ret);
+
+	switch (TYPE(h)) {
+	case P_IBTREE:
+	case P_IRECNO:
+		if ((ret = __db_vrfy_common(dbp, vdp, h, pgno, flags)) != 0)
+			goto err;
+		if ((ret = __bam_vrfy(dbp,
+		    vdp, h, pgno, flags | DB_NOORDERCHK)) != 0 ||
+		    (ret = __db_salvage_markdone(vdp, pgno)) != 0)
+			goto err;
+		/*
+		 * We have a known-healthy internal page.  Walk it.
+		 */
+		if ((ret = __bam_salvage_walkdupint(dbp, vdp, h, key,
+		    handle, callback, flags)) != 0)
+			goto err;
+		break;
+	case P_LRECNO:
+	case P_LDUP:
+		if ((ret = __bam_salvage(dbp,
+		    vdp, pgno, TYPE(h), h, handle, callback, key, flags)) != 0)
+			goto err;
+		break;
+	default:
+		ret = DB_VERIFY_BAD;
+		goto err;
+		/* NOTREACHED */
+	}
+
+err:	if ((t_ret = memp_fput(dbp->mpf, h, 0)) != 0 && ret == 0)
+		ret = t_ret;
+	return (ret);
+}
+
+/*
+ * __db_salvage_subdbs --
+ *	Check and see if this database has subdbs;  if so, try to salvage
+ *	them independently.
+ */
+static int
+__db_salvage_subdbs(dbp, vdp, handle, callback, flags, hassubsp)
+	DB *dbp;
+	VRFY_DBINFO *vdp;
+	void *handle;
+	int (*callback) __P((void *, const void *));
+	u_int32_t flags;
+	int *hassubsp;
+{
+	BTMETA *btmeta;
+	DB *pgset;
+	DBC *pgsc;
+	PAGE *h;
+	db_pgno_t p, meta_pgno;
+	int ret, err_ret;
+
+	err_ret = 0;
+	pgsc = NULL;
+	pgset = NULL;
+
+	meta_pgno = PGNO_BASE_MD;
+	if ((ret = memp_fget(dbp->mpf, &meta_pgno, 0, &h)) != 0)
+		return (ret);
+
+	if (TYPE(h) == P_BTREEMETA)
+		btmeta = (BTMETA *)h;
+	else {
+		/* Not a btree metadata, ergo no subdbs, so just return. */
+		ret = 0;
+		goto err;
+	}
+
+	/* If it's not a safe page, bail on the attempt. */
+	if ((ret = __db_vrfy_common(dbp, vdp, h, PGNO_BASE_MD, flags)) != 0 ||
+	   (ret = __bam_vrfy_meta(dbp, vdp, btmeta, PGNO_BASE_MD, flags)) != 0)
+		goto err;
+
+	if (!F_ISSET(&btmeta->dbmeta, BTM_SUBDB)) {
+		/* No subdbs, just return. */
+		ret = 0;
+		goto err;
+	}
+
+	/* We think we've got subdbs.  Mark it so. */
+	*hassubsp = 1;
+
+	if ((ret = memp_fput(dbp->mpf, h, 0)) != 0)
+		return (ret);
+
+	/*
+	 * We have subdbs.  Try to crack them.
+	 *
+	 * To do so, get a set of leaf pages in the master
+	 * database, and then walk each of the valid ones, salvaging
+	 * subdbs as we go.  If any prove invalid, just drop them;  we'll
+	 * pick them up on a later pass.
+	 */
+	if ((ret = __db_vrfy_pgset(dbp->dbenv, dbp->pgsize, &pgset)) != 0)
+		return (ret);
+	if ((ret =
+	    __db_meta2pgset(dbp, vdp, PGNO_BASE_MD, flags, pgset)) != 0)
+		goto err;
+
+	if ((ret = pgset->cursor(pgset, NULL, &pgsc, 0)) != 0)
+		goto err;
+	while ((ret = __db_vrfy_pgset_next(pgsc, &p)) == 0) {
+		if ((ret = memp_fget(dbp->mpf, &p, 0, &h)) != 0) {
+			err_ret = ret;
+			continue;
+		}
+		if ((ret = __db_vrfy_common(dbp, vdp, h, p, flags)) != 0 ||
+		    (ret = __bam_vrfy(dbp,
+		    vdp, h, p, flags | DB_NOORDERCHK)) != 0)
+			goto nextpg;
+		if (TYPE(h) != P_LBTREE)
+			goto nextpg;
+		else if ((ret = __db_salvage_subdbpg(
+		    dbp, vdp, h, handle, callback, flags)) != 0)
+			err_ret = ret;
+nextpg:		if ((ret = memp_fput(dbp->mpf, h, 0)) != 0)
+			err_ret = ret;
+	}
+
+	if (ret != DB_NOTFOUND)
+		goto err;
+	if ((ret = pgsc->c_close(pgsc)) != 0)
+		goto err;
+
+	ret = pgset->close(pgset, 0);
+	return ((ret == 0 && err_ret != 0) ? err_ret : ret);
+
+	/* NOTREACHED */
+
+err:	if (pgsc != NULL)
+		(void)pgsc->c_close(pgsc);
+	if (pgset != NULL)
+		(void)pgset->close(pgset, 0);
+	(void)memp_fput(dbp->mpf, h, 0);
+	return (ret);
+}
+
+/*
+ * __db_salvage_subdbpg --
+ *	Given a known-good leaf page in the master database, salvage all
+ *	leaf pages corresponding to each subdb.
+ *
+ * PUBLIC: int __db_salvage_subdbpg
+ * PUBLIC:     __P((DB *, VRFY_DBINFO *, PAGE *, void *,
+ * PUBLIC:     int (*)(void *, const void *), u_int32_t));
+ */
+int
+__db_salvage_subdbpg(dbp, vdp, master, handle, callback, flags)
+	DB *dbp;
+	VRFY_DBINFO *vdp;
+	PAGE *master;
+	void *handle;
+	int (*callback) __P((void *, const void *));
+	u_int32_t flags;
+{
+	BKEYDATA *bkkey, *bkdata;
+	BOVERFLOW *bo;
+	DB *pgset;
+	DBC *pgsc;
+	DBT key;
+	PAGE *subpg;
+	db_indx_t i;
+	db_pgno_t meta_pgno, p;
+	int ret, err_ret, t_ret;
+	char *subdbname;
+
+	ret = err_ret = 0;
+	subdbname = NULL;
+
+	if ((ret = __db_vrfy_pgset(dbp->dbenv, dbp->pgsize, &pgset)) != 0)
+		return (ret);
+
+	/*
+	 * For each entry, get and salvage the set of pages
+	 * corresponding to that entry.
+	 */
+	for (i = 0; i < NUM_ENT(master); i += P_INDX) {
+		bkkey = GET_BKEYDATA(master, i);
+		bkdata = GET_BKEYDATA(master, i + O_INDX);
+
+		/* Get the subdatabase name. */
+		if (B_TYPE(bkkey->type) == B_OVERFLOW) {
+			/*
+			 * We can, in principle anyway, have a subdb
+			 * name so long it overflows.  Ick.
+			 */
+			bo = (BOVERFLOW *)bkkey;
+			if ((ret = __db_safe_goff(dbp, vdp, bo->pgno, &key,
+			    (void **)&subdbname, flags)) != 0) {
+				err_ret = DB_VERIFY_BAD;
+				continue;
+			}
+
+			/* Nul-terminate it. */
+			if ((ret = __os_realloc(dbp->dbenv,
+			    key.size + 1, NULL, &subdbname)) != 0)
+				goto err;
+			subdbname[key.size] = '\0';
+		} else if (B_TYPE(bkkey->type == B_KEYDATA)) {
+			if ((ret = __os_realloc(dbp->dbenv,
+			    bkkey->len + 1, NULL, &subdbname)) != 0)
+				goto err;
+			memcpy(subdbname, bkkey->data, bkkey->len);
+			subdbname[bkkey->len] = '\0';
+		}
+
+		/* Get the corresponding pgno. */
+		if (bkdata->len != sizeof(db_pgno_t)) {
+			err_ret = DB_VERIFY_BAD;
+			continue;
+		}
+		memcpy(&meta_pgno, bkdata->data, sizeof(db_pgno_t));
+
+		/* If we can't get the subdb meta page, just skip the subdb. */
+		if (!IS_VALID_PGNO(meta_pgno) ||
+		    (ret = memp_fget(dbp->mpf, &meta_pgno, 0, &subpg)) != 0) {
+			err_ret = ret;
+			continue;
+		}
+
+		/*
+		 * Verify the subdatabase meta page.  This has two functions.
+		 * First, if it's bad, we have no choice but to skip the subdb
+		 * and let the pages just get printed on a later pass.  Second,
+		 * the access-method-specific meta verification routines record
+		 * the various state info (such as the presence of dups)
+		 * that we need for __db_prheader().
+		 */
+		if ((ret =
+		    __db_vrfy_common(dbp, vdp, subpg, meta_pgno, flags)) != 0) {
+			err_ret = ret;
+			(void)memp_fput(dbp->mpf, subpg, 0);
+			continue;
+		}
+		switch (TYPE(subpg)) {
+		case P_BTREEMETA:
+			if ((ret = __bam_vrfy_meta(dbp,
+			    vdp, (BTMETA *)subpg, meta_pgno, flags)) != 0) {
+				err_ret = ret;
+				(void)memp_fput(dbp->mpf, subpg, 0);
+				continue;
+			}
+			break;
+		case P_HASHMETA:
+			if ((ret = __ham_vrfy_meta(dbp,
+			    vdp, (HMETA *)subpg, meta_pgno, flags)) != 0) {
+				err_ret = ret;
+				(void)memp_fput(dbp->mpf, subpg, 0);
+				continue;
+			}
+			break;
+		default:
+			/* This isn't an appropriate page;  skip this subdb. */
+			err_ret = DB_VERIFY_BAD;
+			continue;
+			/* NOTREACHED */
+		}
+
+		if ((ret = memp_fput(dbp->mpf, subpg, 0)) != 0) {
+			err_ret = ret;
+			continue;
+		}
+
+		/* Print a subdatabase header. */
+		if ((ret = __db_prheader(dbp,
+		    subdbname, 0, 0, handle, callback, vdp, meta_pgno)) != 0)
+			goto err;
+
+		if ((ret = __db_meta2pgset(dbp, vdp, meta_pgno,
+		    flags, pgset)) != 0) {
+			err_ret = ret;
+			continue;
+		}
+
+		if ((ret = pgset->cursor(pgset, NULL, &pgsc, 0)) != 0)
+			goto err;
+		while ((ret = __db_vrfy_pgset_next(pgsc, &p)) == 0) {
+			if ((ret = memp_fget(dbp->mpf, &p, 0, &subpg)) != 0) {
+				err_ret = ret;
+				continue;
+			}
+			if ((ret = __db_salvage(dbp, vdp, p, subpg,
+			    handle, callback, flags)) != 0)
+				err_ret = ret;
+			if ((ret = memp_fput(dbp->mpf, subpg, 0)) != 0)
+				err_ret = ret;
+		}
+
+		if (ret != DB_NOTFOUND)
+			goto err;
+
+		if ((ret = pgsc->c_close(pgsc)) != 0)
+			goto err;
+		if ((ret = __db_prfooter(handle, callback)) != 0)
+			goto err;
+	}
+err:	if (subdbname)
+		__os_free(subdbname, 0);
+
+	if ((t_ret = pgset->close(pgset, 0)) != 0)
+		ret = t_ret;
+
+	if ((t_ret = __db_salvage_markdone(vdp, PGNO(master))) != 0)
+		return (t_ret);
+
+	return ((err_ret != 0) ? err_ret : ret);
+}
+
+/*
+ * __db_meta2pgset --
+ *	Given a known-safe meta page number, return the set of pages
+ *	corresponding to the database it represents.  Return DB_VERIFY_BAD if
+ *	it's not a suitable meta page or is invalid.
+ */
+static int
+__db_meta2pgset(dbp, vdp, pgno, flags, pgset)
+	DB *dbp;
+	VRFY_DBINFO *vdp;
+	db_pgno_t pgno;
+	u_int32_t flags;
+	DB *pgset;
+{
+	PAGE *h;
+	int ret, t_ret;
+
+	if ((ret = memp_fget(dbp->mpf, &pgno, 0, &h)) != 0)
+		return (ret);
+
+	switch (TYPE(h)) {
+	case P_BTREEMETA:
+		ret = __bam_meta2pgset(dbp, vdp, (BTMETA *)h, flags, pgset);
+		break;
+	case P_HASHMETA:
+		ret = __ham_meta2pgset(dbp, vdp, (HMETA *)h, flags, pgset);
+		break;
+	default:
+		ret = DB_VERIFY_BAD;
+		break;
+	}
+
+	if ((t_ret = memp_fput(dbp->mpf, h, 0)) != 0)
+		return (t_ret);
+	return (ret);
+}
+
+/*
+ * __db_guesspgsize --
+ *	Try to guess what the pagesize is if the one on the meta page
+ *	and the one in the db are invalid.
+ */
+static int
+__db_guesspgsize(dbenv, fhp)
+	DB_ENV *dbenv;
+	DB_FH *fhp;
+{
+	db_pgno_t i;
+	size_t nr;
+	u_int32_t guess;
+	u_int8_t type;
+	int ret;
+
+	for (guess = DB_MAX_PGSIZE; guess >= DB_MIN_PGSIZE; guess >>= 1) {
+		/*
+		 * We try to read three pages ahead after the first one
+		 * and make sure we have plausible types for all of them.
+		 * If the seeks fail, continue with a smaller size;
+		 * we're probably just looking past the end of the database.
+		 * If they succeed and the types are reasonable, also continue
+		 * with a size smaller;  we may be looking at pages N,
+		 * 2N, and 3N for some N > 1.
+		 *
+		 * As soon as we hit an invalid type, we stop and return
+		 * our previous guess; that last one was probably the page size.
+		 */
+		for (i = 1; i <= 3; i++) {
+			if ((ret = __os_seek(dbenv, fhp, guess,
+			    i, SSZ(DBMETA, type), 0, DB_OS_SEEK_SET)) != 0)
+				break;
+			if ((ret = __os_read(dbenv,
+			    fhp, &type, 1, &nr)) != 0 || nr == 0)
+				break;
+			if (type == P_INVALID || type >= P_PAGETYPE_MAX)
+				return (guess << 1);
+		}
+	}
+
+	/*
+	 * If we're just totally confused--the corruption takes up most of the
+	 * beginning pages of the database--go with the default size.
+	 */
+	return (DB_DEF_IOSIZE);
+}
diff --git a/bdb/db/db_vrfyutil.c b/bdb/db/db_vrfyutil.c
new file mode 100644
index 00000000000..89dccdcc760
--- /dev/null
+++ b/bdb/db/db_vrfyutil.c
@@ -0,0 +1,830 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 2000
+ *	Sleepycat Software.  All rights reserved.
+ *
+ * $Id: db_vrfyutil.c,v 11.11 2000/11/28 21:36:04 bostic Exp $
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_vrfyutil.c,v 11.11 2000/11/28 21:36:04 bostic Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_verify.h"
+#include "db_ext.h"
+
+static int __db_vrfy_pgset_iinc __P((DB *, db_pgno_t, int));
+
+/*
+ * __db_vrfy_dbinfo_create --
+ *	Allocate and initialize a VRFY_DBINFO structure.
+ *
+ * PUBLIC: int __db_vrfy_dbinfo_create
+ * PUBLIC:     __P((DB_ENV *, u_int32_t, VRFY_DBINFO **));
+ */
+int
+__db_vrfy_dbinfo_create (dbenv, pgsize, vdpp)
+	DB_ENV *dbenv;
+	u_int32_t pgsize;
+	VRFY_DBINFO **vdpp;
+{
+	DB *cdbp, *pgdbp, *pgset;
+	VRFY_DBINFO *vdp;
+	int ret;
+
+	vdp = NULL;
+	cdbp = pgdbp = pgset = NULL;
+
+	if ((ret = __os_calloc(NULL,
+	    1, sizeof(VRFY_DBINFO), (void **)&vdp)) != 0)
+		goto err;
+
+	if ((ret = db_create(&cdbp, dbenv, 0)) != 0)
+		goto err;
+
+	if ((ret = cdbp->set_flags(cdbp, DB_DUP | DB_DUPSORT)) != 0)
+		goto err;
+
+	if ((ret = cdbp->set_pagesize(cdbp, pgsize)) != 0)
+		goto err;
+
+	if ((ret =
+	    cdbp->open(cdbp, NULL, NULL, DB_BTREE, DB_CREATE, 0600)) != 0)
+		goto err;
+
+	if ((ret = db_create(&pgdbp, dbenv, 0)) != 0)
+		goto err;
+
+	if ((ret = pgdbp->set_pagesize(pgdbp, pgsize)) != 0)
+		goto err;
+
+	if ((ret =
+	    pgdbp->open(pgdbp, NULL, NULL, DB_BTREE, DB_CREATE, 0600)) != 0)
+		goto err;
+
+	if ((ret = __db_vrfy_pgset(dbenv, pgsize, &pgset)) != 0)
+		goto err;
+
+	LIST_INIT(&vdp->subdbs);
+	LIST_INIT(&vdp->activepips);
+
+	vdp->cdbp = cdbp;
+	vdp->pgdbp = pgdbp;
+	vdp->pgset = pgset;
+	*vdpp = vdp;
+	return (0);
+
+err:	if (cdbp != NULL)
+		(void)cdbp->close(cdbp, 0);
+	if (pgdbp != NULL)
+		(void)pgdbp->close(pgdbp, 0);
+	if (vdp != NULL)
+		__os_free(vdp, sizeof(VRFY_DBINFO));
+	return (ret);
+}
+
+/*
+ * __db_vrfy_dbinfo_destroy --
+ *	Destructor for VRFY_DBINFO.  Destroys VRFY_PAGEINFOs and deallocates
+ *	structure.
+ *
+ * PUBLIC: int __db_vrfy_dbinfo_destroy __P((VRFY_DBINFO *));
+ */
+int
+__db_vrfy_dbinfo_destroy(vdp)
+	VRFY_DBINFO *vdp;
+{
+	VRFY_CHILDINFO *c, *d;
+	int t_ret, ret;
+
+	ret = 0;
+
+	for (c = LIST_FIRST(&vdp->subdbs); c != NULL; c = d) {
+		d = LIST_NEXT(c, links);
+		__os_free(c, 0);
+	}
+
+	if ((t_ret = vdp->pgdbp->close(vdp->pgdbp, 0)) != 0)
+		ret = t_ret;
+
+	if ((t_ret = vdp->cdbp->close(vdp->cdbp, 0)) != 0 && ret == 0)
+		ret = t_ret;
+
+	if ((t_ret = vdp->pgset->close(vdp->pgset, 0)) != 0 && ret == 0)
+		ret = t_ret;
+
+	DB_ASSERT(LIST_FIRST(&vdp->activepips) == NULL);
+
+	__os_free(vdp, sizeof(VRFY_DBINFO));
+	return (ret);
+}
+
+/*
+ * __db_vrfy_getpageinfo --
+ *	Get a PAGEINFO structure for a given page, creating it if necessary.
+ *
+ * PUBLIC: int __db_vrfy_getpageinfo
+ * PUBLIC:     __P((VRFY_DBINFO *, db_pgno_t, VRFY_PAGEINFO **));
+ */
+int
+__db_vrfy_getpageinfo(vdp, pgno, pipp)
+	VRFY_DBINFO *vdp;
+	db_pgno_t pgno;
+	VRFY_PAGEINFO **pipp;
+{
+	DBT key, data;
+	DB *pgdbp;
+	VRFY_PAGEINFO *pip;
+	int ret;
+
+	/*
+	 * We want a page info struct.  There are three places to get it from,
+	 * in decreasing order of preference:
+	 *
+	 * 1. vdp->activepips.  If it's already "checked out", we're
+	 *	already using it, we return the same exact structure with a
+	 *	bumped refcount.  This is necessary because this code is
+	 *	replacing array accesses, and it's common for f() to make some
+	 *	changes to a pip, and then call g() and h() which each make
+	 *	changes to the same pip.  vdps are never shared between threads
+	 *	(they're never returned to the application), so this is safe.
+	 * 2. The pgdbp.  It's not in memory, but it's in the database, so
+	 *	get it, give it a refcount of 1, and stick it on activepips.
+	 * 3. malloc.  It doesn't exist yet;  create it, then stick it on
+	 *	activepips.  We'll put it in the database when we putpageinfo
+	 *	later.
+	 */
+
+	/* Case 1. */
+	for (pip = LIST_FIRST(&vdp->activepips); pip != NULL;
+	    pip = LIST_NEXT(pip, links))
+		if (pip->pgno == pgno)
+			/* Found it. */
+			goto found;
+
+	/* Case 2. */
+	pgdbp = vdp->pgdbp;
+	memset(&key, 0, sizeof(DBT));
+	memset(&data, 0, sizeof(DBT));
+	F_SET(&data, DB_DBT_MALLOC);
+	key.data = &pgno;
+	key.size = sizeof(db_pgno_t);
+
+	if ((ret = pgdbp->get(pgdbp, NULL, &key, &data, 0)) == 0) {
+		/* Found it. */
+		DB_ASSERT(data.size = sizeof(VRFY_PAGEINFO));
+		pip = data.data;
+		DB_ASSERT(pip->pi_refcount == 0);
+		LIST_INSERT_HEAD(&vdp->activepips, pip, links);
+		goto found;
+	} else if (ret != DB_NOTFOUND)	/* Something nasty happened. */
+		return (ret);
+
+	/* Case 3 */
+	if ((ret = __db_vrfy_pageinfo_create(&pip)) != 0)
+		return (ret);
+
+	LIST_INSERT_HEAD(&vdp->activepips, pip, links);
+found:	pip->pi_refcount++;
+
+	*pipp = pip;
+
+	DB_ASSERT(pip->pi_refcount > 0);
+	return (0);
+}
+
+/*
+ * __db_vrfy_putpageinfo --
+ *	Put back a VRFY_PAGEINFO that we're done with.
+ *
+ * PUBLIC: int __db_vrfy_putpageinfo __P((VRFY_DBINFO *, VRFY_PAGEINFO *));
+ */
+int
+__db_vrfy_putpageinfo(vdp, pip)
+	VRFY_DBINFO *vdp;
+	VRFY_PAGEINFO *pip;
+{
+	DBT key, data;
+	DB *pgdbp;
+	VRFY_PAGEINFO *p;
+	int ret;
+#ifdef DIAGNOSTIC
+	int found;
+
+	found = 0;
+#endif
+
+	if (--pip->pi_refcount > 0)
+		return (0);
+
+	pgdbp = vdp->pgdbp;
+	memset(&key, 0, sizeof(DBT));
+	memset(&data, 0, sizeof(DBT));
+
+	key.data = &pip->pgno;
+	key.size = sizeof(db_pgno_t);
+	data.data = pip;
+	data.size = sizeof(VRFY_PAGEINFO);
+
+	if ((ret = pgdbp->put(pgdbp, NULL, &key, &data, 0)) != 0)
+		return (ret);
+
+	for (p = LIST_FIRST(&vdp->activepips); p != NULL;
+	    p = LIST_NEXT(p, links))
+		if (p == pip) {
+#ifdef DIAGNOSTIC
+			found++;
+#endif
+			DB_ASSERT(p->pi_refcount == 0);
+			LIST_REMOVE(p, links);
+			break;
+		}
+#ifdef DIAGNOSTIC
+	DB_ASSERT(found == 1);
+#endif
+
+	DB_ASSERT(pip->pi_refcount == 0);
+	__os_free(pip, 0);
+	return (0);
+}
+
+/*
+ * __db_vrfy_pgset --
+ *	Create a temporary database for the storing of sets of page numbers.
+ *	(A mapping from page number to int, used by the *_meta2pgset functions,
+ *	as well as for keeping track of which pages the verifier has seen.)
+ *
+ * PUBLIC: int __db_vrfy_pgset __P((DB_ENV *, u_int32_t, DB **));
+ */
+int
+__db_vrfy_pgset(dbenv, pgsize, dbpp)
+	DB_ENV *dbenv;
+	u_int32_t pgsize;
+	DB **dbpp;
+{
+	DB *dbp;
+	int ret;
+
+	if ((ret = db_create(&dbp, dbenv, 0)) != 0)
+		return (ret);
+	if ((ret = dbp->set_pagesize(dbp, pgsize)) != 0)
+		goto err;
+	if ((ret = dbp->open(dbp, NULL, NULL, DB_BTREE, DB_CREATE, 0600)) == 0)
+		*dbpp = dbp;
+	else
+err:		(void)dbp->close(dbp, 0);
+
+	return (ret);
+}
+
+/*
+ * __db_vrfy_pgset_get --
+ *	Get the value associated in a page set with a given pgno.  Return
+ *	a 0 value (and succeed) if we've never heard of this page.
+ *
+ * PUBLIC: int __db_vrfy_pgset_get __P((DB *, db_pgno_t, int *));
+ */
+int
+__db_vrfy_pgset_get(dbp, pgno, valp)
+	DB *dbp;
+	db_pgno_t pgno;
+	int *valp;
+{
+	DBT key, data;
+	int ret, val;
+
+	memset(&key, 0, sizeof(DBT));
+	memset(&data, 0, sizeof(DBT));
+
+	key.data = &pgno;
+	key.size = sizeof(db_pgno_t);
+	data.data = &val;
+	data.ulen = sizeof(int);
+	F_SET(&data, DB_DBT_USERMEM);
+
+	if ((ret = dbp->get(dbp, NULL, &key, &data, 0)) == 0) {
+		DB_ASSERT(data.size = sizeof(int));
+		memcpy(&val, data.data, sizeof(int));
+	} else if (ret == DB_NOTFOUND)
+		val = 0;
+	else
+		return (ret);
+
+	*valp = val;
+	return (0);
+}
+
+/*
+ * __db_vrfy_pgset_inc --
+ *	Increment the value associated with a pgno by 1.
+ *
+ * PUBLIC: int __db_vrfy_pgset_inc __P((DB *, db_pgno_t));
+ */
+int
+__db_vrfy_pgset_inc(dbp, pgno)
+	DB *dbp;
+	db_pgno_t pgno;
+{
+
+	return (__db_vrfy_pgset_iinc(dbp, pgno, 1));
+}
+
+/*
+ * __db_vrfy_pgset_dec --
+ *	Increment the value associated with a pgno by 1.
+ *
+ * PUBLIC: int __db_vrfy_pgset_dec __P((DB *, db_pgno_t));
+ */
+int
+__db_vrfy_pgset_dec(dbp, pgno)
+	DB *dbp;
+	db_pgno_t pgno;
+{
+
+	return (__db_vrfy_pgset_iinc(dbp, pgno, -1));
+}
+
+/*
+ * __db_vrfy_pgset_iinc --
+ *	Increment the value associated with a pgno by i.
+ *
+ */
+static int
+__db_vrfy_pgset_iinc(dbp, pgno, i)
+	DB *dbp;
+	db_pgno_t pgno;
+	int i;
+{
+	DBT key, data;
+	int ret;
+	int val;
+
+	memset(&key, 0, sizeof(DBT));
+	memset(&data, 0, sizeof(DBT));
+
+	val = 0;
+
+	key.data = &pgno;
+	key.size = sizeof(db_pgno_t);
+	data.data = &val;
+	data.ulen = sizeof(int);
+	F_SET(&data, DB_DBT_USERMEM);
+
+	if ((ret = dbp->get(dbp, NULL, &key, &data, 0)) == 0) {
+		DB_ASSERT(data.size = sizeof(int));
+		memcpy(&val, data.data, sizeof(int));
+	} else if (ret != DB_NOTFOUND)
+		return (ret);
+
+	data.size = sizeof(int);
+	val += i;
+
+	return (dbp->put(dbp, NULL, &key, &data, 0));
+}
+
+/*
+ * __db_vrfy_pgset_next --
+ *	Given a cursor open in a pgset database, get the next page in the
+ *	set.
+ *
+ * PUBLIC: int __db_vrfy_pgset_next __P((DBC *, db_pgno_t *));
+ */
+int
+__db_vrfy_pgset_next(dbc, pgnop)
+	DBC *dbc;
+	db_pgno_t *pgnop;
+{
+	DBT key, data;
+	db_pgno_t pgno;
+	int ret;
+
+	memset(&key, 0, sizeof(DBT));
+	memset(&data, 0, sizeof(DBT));
+	/* We don't care about the data, just the keys. */
+	F_SET(&data, DB_DBT_USERMEM | DB_DBT_PARTIAL);
+	F_SET(&key, DB_DBT_USERMEM);
+	key.data = &pgno;
+	key.ulen = sizeof(db_pgno_t);
+
+	if ((ret = dbc->c_get(dbc, &key, &data, DB_NEXT)) != 0)
+		return (ret);
+
+	DB_ASSERT(key.size == sizeof(db_pgno_t));
+	*pgnop = pgno;
+
+	return (0);
+}
+
+/*
+ * __db_vrfy_childcursor --
+ *	Create a cursor to walk the child list with.  Returns with a nonzero
+ *	final argument if the specified page has no children.
+ *
+ * PUBLIC: int __db_vrfy_childcursor __P((VRFY_DBINFO *, DBC **));
+ */
+int
+__db_vrfy_childcursor(vdp, dbcp)
+	VRFY_DBINFO *vdp;
+	DBC **dbcp;
+{
+	DB *cdbp;
+	DBC *dbc;
+	int ret;
+
+	cdbp = vdp->cdbp;
+
+	if ((ret = cdbp->cursor(cdbp, NULL, &dbc, 0)) == 0)
+		*dbcp = dbc;
+
+	return (ret);
+}
+
+/*
+ * __db_vrfy_childput --
+ *	Add a child structure to the set for a given page.
+ *
+ * PUBLIC: int __db_vrfy_childput
+ * PUBLIC:     __P((VRFY_DBINFO *, db_pgno_t, VRFY_CHILDINFO *));
+ */
+int
+__db_vrfy_childput(vdp, pgno, cip)
+	VRFY_DBINFO *vdp;
+	db_pgno_t pgno;
+	VRFY_CHILDINFO *cip;
+{
+	DBT key, data;
+	DB *cdbp;
+	int ret;
+
+	cdbp = vdp->cdbp;
+	memset(&key, 0, sizeof(DBT));
+	memset(&data, 0, sizeof(DBT));
+
+	key.data = &pgno;
+	key.size = sizeof(db_pgno_t);
+
+	data.data = cip;
+	data.size = sizeof(VRFY_CHILDINFO);
+
+	/*
+	 * Don't add duplicate (data) entries for a given child, and accept
+	 * DB_KEYEXIST as a successful return;  we only need to verify
+	 * each child once, even if a child (such as an overflow key) is
+	 * multiply referenced.
+	 */
+	ret = cdbp->put(cdbp, NULL, &key, &data, DB_NODUPDATA);
+	return (ret == DB_KEYEXIST ? 0 : ret);
+}
+
+/*
+ * __db_vrfy_ccset --
+ *	Sets a cursor created with __db_vrfy_childcursor to the first
+ *	child of the given pgno, and returns it in the third arg.
+ *
+ * PUBLIC: int __db_vrfy_ccset __P((DBC *, db_pgno_t, VRFY_CHILDINFO **));
+ */
+int
+__db_vrfy_ccset(dbc, pgno, cipp)
+	DBC *dbc;
+	db_pgno_t pgno;
+	VRFY_CHILDINFO **cipp;
+{
+	DBT key, data;
+	int ret;
+
+	memset(&key, 0, sizeof(DBT));
+	memset(&data, 0, sizeof(DBT));
+
+	key.data = &pgno;
+	key.size = sizeof(db_pgno_t);
+
+	if ((ret = dbc->c_get(dbc, &key, &data, DB_SET)) != 0)
+		return (ret);
+
+	DB_ASSERT(data.size == sizeof(VRFY_CHILDINFO));
+	*cipp = (VRFY_CHILDINFO *)data.data;
+
+	return (0);
+}
+
+/*
+ * __db_vrfy_ccnext --
+ *	Gets the next child of the given cursor created with
+ *	__db_vrfy_childcursor, and returns it in the memory provided in the
+ *	second arg.
+ *
+ * PUBLIC: int __db_vrfy_ccnext __P((DBC *, VRFY_CHILDINFO **));
+ */
+int
+__db_vrfy_ccnext(dbc, cipp)
+	DBC *dbc;
+	VRFY_CHILDINFO **cipp;
+{
+	DBT key, data;
+	int ret;
+
+	memset(&key, 0, sizeof(DBT));
+	memset(&data, 0, sizeof(DBT));
+
+	if ((ret = dbc->c_get(dbc, &key, &data, DB_NEXT_DUP)) != 0)
+		return (ret);
+
+	DB_ASSERT(data.size == sizeof(VRFY_CHILDINFO));
+	*cipp = (VRFY_CHILDINFO *)data.data;
+
+	return (0);
+}
+
+/*
+ * __db_vrfy_ccclose --
+ *	Closes the cursor created with __db_vrfy_childcursor.
+ *
+ *	This doesn't actually do anything interesting now, but it's
+ *	not inconceivable that we might change the internal database usage
+ *	and keep the interfaces the same, and a function call here or there
+ *	seldom hurts anyone.
+ *
+ * PUBLIC: int __db_vrfy_ccclose __P((DBC *));
+ */
+int
+__db_vrfy_ccclose(dbc)
+	DBC *dbc;
+{
+
+	return (dbc->c_close(dbc));
+}
+
+/*
+ * __db_vrfy_pageinfo_create --
+ *	Constructor for VRFY_PAGEINFO;  allocates and initializes.
+ *
+ * PUBLIC: int __db_vrfy_pageinfo_create __P((VRFY_PAGEINFO **));
+ */
+int
+__db_vrfy_pageinfo_create(pgipp)
+	VRFY_PAGEINFO **pgipp;
+{
+	VRFY_PAGEINFO *pgip;
+	int ret;
+
+	if ((ret = __os_calloc(NULL,
+	    1, sizeof(VRFY_PAGEINFO), (void **)&pgip)) != 0)
+		return (ret);
+
+	DB_ASSERT(pgip->pi_refcount == 0);
+
+	*pgipp = pgip;
+	return (0);
+}
+
+/*
+ * __db_salvage_init --
+ *	Set up salvager database.
+ *
+ * PUBLIC: int  __db_salvage_init __P((VRFY_DBINFO *));
+ */
+int
+__db_salvage_init(vdp)
+	VRFY_DBINFO *vdp;
+{
+	DB *dbp;
+	int ret;
+
+	if ((ret = db_create(&dbp, NULL, 0)) != 0)
+		return (ret);
+
+	if ((ret = dbp->set_pagesize(dbp, 1024)) != 0)
+		goto err;
+
+	if ((ret = dbp->open(dbp, NULL, NULL, DB_BTREE, DB_CREATE, 0)) != 0)
+		goto err;
+
+	vdp->salvage_pages = dbp;
+	return (0);
+
+err:	(void)dbp->close(dbp, 0);
+	return (ret);
+}
+
+/*
+ * __db_salvage_destroy --
+ *	Close salvager database.
+ * PUBLIC: void  __db_salvage_destroy __P((VRFY_DBINFO *));
+ */
+void
+__db_salvage_destroy(vdp)
+	VRFY_DBINFO *vdp;
+{
+	(void)vdp->salvage_pages->close(vdp->salvage_pages, 0);
+}
+
+/*
+ * __db_salvage_getnext --
+ *	Get the next (first) unprinted page in the database of pages we need to
+ *	print still.  Delete entries for any already-printed pages we encounter
+ *	in this search, as well as the page we're returning.
+ *
+ * PUBLIC: int __db_salvage_getnext
+ * PUBLIC:     __P((VRFY_DBINFO *, db_pgno_t *, u_int32_t *));
+ */
+int
+__db_salvage_getnext(vdp, pgnop, pgtypep)
+	VRFY_DBINFO *vdp;
+	db_pgno_t *pgnop;
+	u_int32_t *pgtypep;
+{
+	DB *dbp;
+	DBC *dbc;
+	DBT key, data;
+	int ret;
+	u_int32_t pgtype;
+
+	dbp = vdp->salvage_pages;
+
+	memset(&key, 0, sizeof(DBT));
+	memset(&data, 0, sizeof(DBT));
+
+	if ((ret = dbp->cursor(dbp, NULL, &dbc, 0)) != 0)
+		return (ret);
+
+	while ((ret = dbc->c_get(dbc, &key, &data, DB_NEXT)) == 0) {
+		DB_ASSERT(data.size == sizeof(u_int32_t));
+		memcpy(&pgtype, data.data, sizeof(pgtype));
+
+		if ((ret = dbc->c_del(dbc, 0)) != 0)
+			goto err;
+		if (pgtype != SALVAGE_IGNORE)
+			goto found;
+	}
+
+	/* No more entries--ret probably equals DB_NOTFOUND. */
+
+	if (0) {
+found:		DB_ASSERT(key.size == sizeof(db_pgno_t));
+		DB_ASSERT(data.size == sizeof(u_int32_t));
+
+		*pgnop = *(db_pgno_t *)key.data;
+		*pgtypep = *(u_int32_t *)data.data;
+	}
+
+err:	(void)dbc->c_close(dbc);
+	return (ret);
+}
+
+/*
+ * __db_salvage_isdone --
+ *	Return whether or not the given pgno is already marked
+ *	SALVAGE_IGNORE (meaning that we don't need to print it again).
+ *
+ *	Returns DB_KEYEXIST if it is marked, 0 if not, or another error on
+ *	error.
+ *
+ * PUBLIC: int __db_salvage_isdone __P((VRFY_DBINFO *, db_pgno_t));
+ */
+int
+__db_salvage_isdone(vdp, pgno)
+	VRFY_DBINFO *vdp;
+	db_pgno_t pgno;
+{
+	DBT key, data;
+	DB *dbp;
+	int ret;
+	u_int32_t currtype;
+
+	dbp = vdp->salvage_pages;
+
+	memset(&key, 0, sizeof(DBT));
+	memset(&data, 0, sizeof(DBT));
+
+	currtype = SALVAGE_INVALID;
+	data.data = &currtype;
+	data.ulen = sizeof(u_int32_t);
+	data.flags = DB_DBT_USERMEM;
+
+	key.data = &pgno;
+	key.size = sizeof(db_pgno_t);
+
+	/*
+	 * Put an entry for this page, with pgno as key and type as data,
+	 * unless it's already there and is marked done.
+	 * If it's there and is marked anything else, that's fine--we
+	 * want to mark it done.
+	 */
+	ret = dbp->get(dbp, NULL, &key, &data, 0);
+	if (ret == 0) {
+		/*
+		 * The key's already here.  Check and see if it's already
+		 * marked done.  If it is, return DB_KEYEXIST.  If it's not,
+		 * return 0.
+		 */
+		if (currtype == SALVAGE_IGNORE)
+			return (DB_KEYEXIST);
+		else
+			return (0);
+	} else if (ret != DB_NOTFOUND)
+		return (ret);
+
+	/* The pgno is not yet marked anything; return 0. */
+	return (0);
+}
+
+/*
+ * __db_salvage_markdone --
+ *	Mark as done a given page.
+ *
+ * PUBLIC: int __db_salvage_markdone __P((VRFY_DBINFO *, db_pgno_t));
+ */
+int
+__db_salvage_markdone(vdp, pgno)
+	VRFY_DBINFO *vdp;
+	db_pgno_t pgno;
+{
+	DBT key, data;
+	DB *dbp;
+	int pgtype, ret;
+	u_int32_t currtype;
+
+	pgtype = SALVAGE_IGNORE;
+	dbp = vdp->salvage_pages;
+
+	memset(&key, 0, sizeof(DBT));
+	memset(&data, 0, sizeof(DBT));
+
+	currtype = SALVAGE_INVALID;
+	data.data = &currtype;
+	data.ulen = sizeof(u_int32_t);
+	data.flags = DB_DBT_USERMEM;
+
+	key.data = &pgno;
+	key.size = sizeof(db_pgno_t);
+
+	/*
+	 * Put an entry for this page, with pgno as key and type as data,
+	 * unless it's already there and is marked done.
+	 * If it's there and is marked anything else, that's fine--we
+	 * want to mark it done, but db_salvage_isdone only lets
+	 * us know if it's marked IGNORE.
+	 *
+	 * We don't want to return DB_KEYEXIST, though;  this will
+	 * likely get passed up all the way and make no sense to the
+	 * application.  Instead, use DB_VERIFY_BAD to indicate that
+	 * we've seen this page already--it probably indicates a
+	 * multiply-linked page.
+	 */
+	if ((ret = __db_salvage_isdone(vdp, pgno)) != 0)
+		return (ret == DB_KEYEXIST ? DB_VERIFY_BAD : ret);
+
+	data.size = sizeof(u_int32_t);
+	data.data = &pgtype;
+
+	return (dbp->put(dbp, NULL, &key, &data, 0));
+}
+
+/*
+ * __db_salvage_markneeded --
+ *	If it has not yet been printed, make note of the fact that a page
+ *	must be dealt with later.
+ *
+ * PUBLIC: int __db_salvage_markneeded
+ * PUBLIC:     __P((VRFY_DBINFO *, db_pgno_t, u_int32_t));
+ */
+int
+__db_salvage_markneeded(vdp, pgno, pgtype)
+	VRFY_DBINFO *vdp;
+	db_pgno_t pgno;
+	u_int32_t pgtype;
+{
+	DB *dbp;
+	DBT key, data;
+	int ret;
+
+	dbp = vdp->salvage_pages;
+
+	memset(&key, 0, sizeof(DBT));
+	memset(&data, 0, sizeof(DBT));
+
+	key.data = &pgno;
+	key.size = sizeof(db_pgno_t);
+
+	data.data = &pgtype;
+	data.size = sizeof(u_int32_t);
+
+	/*
+	 * Put an entry for this page, with pgno as key and type as data,
+	 * unless it's already there, in which case it's presumably
+	 * already been marked done.
+	 */
+	ret = dbp->put(dbp, NULL, &key, &data, DB_NOOVERWRITE);
+	return (ret == DB_KEYEXIST ? 0 : ret);
+}
author	unknown <tim@threads.polyesthetic.msg>	2001-03-04 19:42:05 -0500
committer	unknown <tim@threads.polyesthetic.msg>	2001-03-04 19:42:05 -0500
commit	ec6ae091617bdfdca9e65e8d3e65b950d234f676 (patch)
tree	9dd732e08dba156ee3d7635caedc0dc3107ecac6 /bdb/db
parent	87d70fb598105b64b538ff6b81eef9da626255b1 (diff)
download	mariadb-git-ec6ae091617bdfdca9e65e8d3e65b950d234f676.tar.gz