summaryrefslogtreecommitdiff
path: root/bdb/db
diff options
context:
space:
mode:
authorunknown <tim@threads.polyesthetic.msg>2001-03-04 19:42:05 -0500
committerunknown <tim@threads.polyesthetic.msg>2001-03-04 19:42:05 -0500
commitec6ae091617bdfdca9e65e8d3e65b950d234f676 (patch)
tree9dd732e08dba156ee3d7635caedc0dc3107ecac6 /bdb/db
parent87d70fb598105b64b538ff6b81eef9da626255b1 (diff)
downloadmariadb-git-ec6ae091617bdfdca9e65e8d3e65b950d234f676.tar.gz
Import changeset
Diffstat (limited to 'bdb/db')
-rw-r--r--bdb/db/Design.fileop452
-rw-r--r--bdb/db/crdel.src103
-rw-r--r--bdb/db/crdel_auto.c900
-rw-r--r--bdb/db/crdel_rec.c646
-rw-r--r--bdb/db/db.c2325
-rw-r--r--bdb/db/db.src178
-rw-r--r--bdb/db/db_am.c511
-rw-r--r--bdb/db/db_auto.c1270
-rw-r--r--bdb/db/db_cam.c974
-rw-r--r--bdb/db/db_conv.c348
-rw-r--r--bdb/db/db_dispatch.c983
-rw-r--r--bdb/db/db_dup.c275
-rw-r--r--bdb/db/db_iface.c687
-rw-r--r--bdb/db/db_join.c730
-rw-r--r--bdb/db/db_meta.c309
-rw-r--r--bdb/db/db_method.c629
-rw-r--r--bdb/db/db_overflow.c681
-rw-r--r--bdb/db/db_pr.c1284
-rw-r--r--bdb/db/db_rec.c529
-rw-r--r--bdb/db/db_reclaim.c134
-rw-r--r--bdb/db/db_ret.c160
-rw-r--r--bdb/db/db_upg.c338
-rw-r--r--bdb/db/db_upg_opd.c353
-rw-r--r--bdb/db/db_vrfy.c2340
-rw-r--r--bdb/db/db_vrfyutil.c830
25 files changed, 17969 insertions, 0 deletions
diff --git a/bdb/db/Design.fileop b/bdb/db/Design.fileop
new file mode 100644
index 00000000000..187f1ffaf22
--- /dev/null
+++ b/bdb/db/Design.fileop
@@ -0,0 +1,452 @@
+# $Id: Design.fileop,v 11.4 2000/02/19 20:57:54 bostic Exp $
+
+The design of file operation recovery.
+
+Keith has asked me to write up notes on our current status of database
+create and delete and recovery, why it's so hard, and how we've violated
+all the cornerstone assumptions on which our recovery framework is based.
+
+I am including two documents at the end of this one. The first is the
+initial design of the recoverability of file create and delete (there is
+no talk of subdatabases there, because we didn't think we'd have to do
+anything special there). I will annotate this document on where things
+changed.
+
+The second is the design of recd007 which is supposed to test our ability
+to recover these operations regardless of where one crashes. This test
+is fundamentally different from our other recovery tests in the following
+manner. Normally, the application controls transaction boundaries.
+Therefore, we can perform an operation and then decide whether to commit
+or abort it. In the normal recovery tests, we force the database into
+each of the four possible states from a recovery perspective:
+
+ database is pre-op, undo (do nothing)
+ database is pre-op, redo
+ database is post-op, undo
+ database is post-op, redo (do nothing)
+
+By copying databases at various points and initiating txn_commit and abort
+appropriately, we can make all these things happen. Notice that the one
+case we don't handle is where page A is in one state (e.g., pre-op) and
+page B is in another state (e.g., post-op). I will argue that these don't
+matter because each page is recovered independently. If anyone can poke
+holes in this, I'm interested.
+
+The problem with create/delete recovery testing is that the transaction
+is begun and ended all inside the library. Therefore, there is never any
+point (outside the library) where we can copy files and or initiate
+abort/commit. In order to still put the recovery code through its paces,
+Sue designed an infrastructure that lets you tell the library where to
+make copies of things and where to suddenly inject errors so that the
+transaction gets aborted. This level of detail allows us to push the
+create/delete recovery code through just about every recovery path
+possible (although I'm sure Mike will tell me I'm wrong when he starts to
+run code coverage tools).
+
+OK, so that's all preamble and a brief discussion of the documents I'm
+enclosing.
+
+Why was this so hard and painful and why is the code so Q@#$!% complicated?
+The following is a discussion/explanation, but to the best of my knowledge,
+the structure we have in place now works. The key question we need to be
+asking is, "Does this need to have to be so complex or should we redesign
+portions to simplify it?" At this point, there is no obvious way to simplify
+it in my book, but I may be having difficulty seeing this because my mind is
+too polluted at this point.
+
+Our overall strategy for recovery is that we do write-ahead logging,
+that is we log an operation and make sure it is on disk before any
+data corresponding to the data that log record describes is on disk.
+Typically we use log sequence numbers (LSNs) to mark the data so that
+during recovery, we can look at the data and determine if it is in a
+state before a particular log record or after a particular log record.
+
+In the good old days, opens were not transaction protected, so we could
+do regular old opens during recovery and if the file existed, we opened
+it and if it didn't (or appeared corrupt), we didn't and treated it like
+a missing file. As will be discussed below in detail, our states are much
+more complicated and recovery can't make such simplistic assumptions.
+
+Also, since we are now dealing with file system operations, we have less
+control about when they actually happen and what the state of the system
+can be. That is, we have to write create log records synchronously, because
+the create/open system call may force a newly created (0-length) file to
+disk. This file has to now be identified as being in the "being-created"
+state.
+
+A. We used to make a number of assumptions during recovery:
+
+1. We could call db_open at any time and one of three things would happen:
+ a) the file would be opened cleanly
+ b) the file would not exist
+ c) we would encounter an error while opening the file
+
+Case a posed no difficulty.
+In Case b, we simply spit out a warning that a file was missing and then
+ ignored all subsequent operations to that file.
+In Case c, we reported a fatal error.
+
+2. We can always generate a warning if a file is missing.
+
+3. We never encounter NULL file names in the log.
+
+B. We also made some assumptions in the main-line library:
+
+1. If you try to open a file and it exists but is 0-length, then
+someone else is trying to open it.
+
+2. You can write pages anywhere in a file and any non-existent pages
+are 0-filled. [This breaks on Windows.]
+
+3. If you have proper permissions then you can always evict pages from
+the buffer pool.
+
+4. During open, we can close the master database handle as soon as
+we're done with it since all the rest of the activity will take place
+on the subdatabase handle.
+
+In our brave new world, most of these assumptions are no longer valid.
+Let's address them one at a time.
+
+A.1 We could call db_open at any time and one of three things would happen:
+ a) the file would be opened cleanly
+ b) the file would not exist
+ c) we would encounter an error while opening the file
+There are now additional states. Since we are trying to make file
+operations recoverable, you can now die in the middle of such an
+operation and we have to be able to pick up the pieces. What this
+now means is that:
+
+ * a 0-length file can be an indication of a create in-progress
+ * you can have a meta-data page but no root page (of a btree)
+ * if a file doesn't exist, it could mean that it was just about
+ to be created and needs to be rolled forward.
+ * if you encounter an error in a file (e.g., the meta-data page
+ is all 0's) you could still be in mid-open.
+
+I have now made this all work, but it required significant changes to the
+db_open code and error handling and this is the sort of change that makes
+everyone nervous.
+
+A.2. We can always generate a warning if a file is missing.
+
+Now that we have a delete file method in the API, we need to make sure
+that we do not generate warning messages for files that don't exist if
+we see that they were explicitly deleted.
+
+This means that we need to save state during recovery, determine which
+files were missing and were not being recreated and were not deleted and
+only complain about those.
+
+A.3. We never encounter NULL file names in the log.
+
+Now that we allow tranaction protection on memory-resident files, we write
+log messages for files with NULL file names. This means that our assumption
+of always being able to call "db_open" on any log_register OPEN message found
+in the log is no longer valid.
+
+B.1. If you try to open a file and it exists but is 0-length, then
+someone else is trying to open it.
+
+As discussed for A.1, this is no longer true. It may be instead that you
+are in the process of recovering a create.
+
+B.2. You can write pages anywhere in a file and any non-existent pages
+are 0-filled.
+
+It turns out that this is not true on Windows. This means that places
+we do group allocation (hash) must explicitly allocate each page, because
+we can't count on recognizing the uninitialized pages later.
+
+B.3. If you have proper permissions then you can always evict pages from
+the buffer pool.
+
+In the brave new world though, files can be deleted and they may
+have pages in the mpool. If you later try to evict these, you
+discover that the file doesn't exist. We'd get here when we had
+to dirty pages during a remove operation.
+
+B.4. You can close files any time you want.
+
+However, if the file takes part in the open/remove transaction,
+then we had better not close it until after the transaction
+commits/aborts, because we need to be able to get our hands on the
+dbp and the open happened in a different transaction.
+
+=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
+Design for recovering file create and delete in the presence of subdatabases.
+
+Assumptions:
+ Remove the O_TRUNCATE flag.
+ Single-thread all open/create/delete operations.
+ (Well, almost all; we'll optimize opens without DB_CREATE set.)
+ The reasoning for this is that with two simultaneous
+ open/creaters, during recovery, we cannot identify which
+ transaction successfully created files and therefore cannot
+ recovery correctly.
+ File system creates/deletes are synchronous
+ Once the file is open, subdatabase creates look like regular
+ get/put operations and a metadata page creation.
+
+There are 4 cases to deal with:
+ 1. Open/create file
+ 2. Open/create subdatabase
+ 3. Delete
+ 4. Recovery records
+
+ __db_fileopen_recover
+ __db_metapage_recover
+ __db_delete_recover
+ existing c_put and c_get routines for subdatabase creation
+
+ Note that the open/create of the file and the open/create of the
+ subdatabase need to be in the same transaction.
+
+1. Open/create (full file and subdb version)
+
+If create
+ LOCK_FILEOP
+ txn_begin
+ log create message (open message below)
+ do file system open/create
+ if we did not create
+ abort transaction (before going to open_only)
+ if (!subdb)
+ set dbp->open_txn = NULL
+ else
+ txn_begin a new transaction for the subdb open
+
+ construct meta-data page
+ log meta-data page (see metapage)
+ write the meta-data page
+ * It may be the case that btrees need to log both meta-data pages
+ and root pages. If that is the case, I believe that we can use
+ this same record and recovery routines for both
+
+ txn_commit
+ UNLOCK_FILEOP
+
+2. Delete
+ LOCK_FILEOP
+ txn_begin
+ log delete message (delete message below)
+ mv file __db.file.lsn
+ txn_commit
+ unlink __db.file.lsn
+ UNLOCK_FILEOP
+
+3. Recovery Routines
+
+__db_fileopen_recover
+ if (argp->name.size == 0
+ done;
+
+ if (redo) /* Commit */
+ __os_open(argp->name, DB_OSO_CREATE, argp->mode, &fh)
+ __os_closehandle(fh)
+ if (undo) /* Abort */
+ if (argp->name exists)
+ unlink(argp->name);
+
+__db_metapage_recover
+ if (redo)
+ __os_open(argp->name, 0, 0, &fh)
+ __os_lseek(meta data page)
+ __os_write(meta data page)
+ __os_closehandle(fh);
+ if (undo)
+ done = 0;
+ if (argp->name exists)
+ if (length of argp->name != 0)
+ __os_open(argp->name, 0, 0, &fh)
+ __os_lseek(meta data page)
+ __os_read(meta data page)
+ if (read succeeds && page lsn != current_lsn)
+ done = 1
+ __os_closehandle(fh);
+ if (!done)
+ unlink(argp->name)
+
+__db_delete_recover
+ if (redo)
+ Check if the backup file still exists and if so, delete it.
+
+ if (undo)
+ if (__db_appname(__db.file.lsn exists))
+ mv __db_appname(__db.file.lsn) __db_appname(file)
+
+__db_metasub_recover
+ /* This is like a normal recovery routine */
+ Get the metadata page
+ if (cmp_n && redo)
+ copy the log page onto the page
+ update the lsn
+ make sure page gets put dirty
+ else if (cmp_p && undo)
+ update the lsn to the lsn in the log record
+ make sure page gets put dirty
+
+ if the page was modified, put it back dirty
+
+In db.src
+
+# name: filename (before call to __db_appname)
+# mode: file system mode
+BEGIN open
+DBT name DBT s
+ARG mode u_int32_t o
+END
+
+# opcode: indicate if it is a create/delete and if it is a subdatabase
+# pgsize: page size on which we're going to write the meta-data page
+# pgno: page number on which to write this meta-data page
+# page: the actual meta-data page
+# lsn: LSN of the meta-data page -- 0 for new databases, may be non-0
+# for subdatabases.
+
+BEGIN metapage
+ARG opcode u_int32_t x
+DBT name DBT s
+ARG pgno db_pgno_t d
+DBT page DBT s
+POINTER lsn DB_LSN * lu
+END
+
+# We do not need a subdatabase name here because removing a subdatabase
+# name is simply a regular bt_delete operation from the master database.
+# It will get logged normally.
+# name: filename
+BEGIN delete
+DBT name DBT s
+END
+
+# We also need to reclaim pages, but we can use the existing
+# bt_pg_alloc routines.
+
+=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
+Testing recoverability of create/delete.
+
+These tests are unlike other tests in that they are going to
+require hooks in the library. The reason is that the create
+and delete calls are internally wrapped in a transaction, so
+that if the call returns, the transaction has already either
+commited or aborted. Using only that interface limits what
+kind of testing we can do. To match our other recovery testing
+efforts, we need to add hooks to trigger aborts at particular
+times in the create/delete path.
+
+The general recovery testing strategy is that we wish to
+execute every path through every recovery routine. That
+means that we try to:
+ catch each operation in its pre-operation state
+ call the recovery function with redo
+ call the recovery function with undo
+ catch each operation in its post-operation state
+ call the recovery function with redo
+ call the recovery function with undo
+
+In addition, there are a few critical points in the create and
+delete path that we want to make sure we capture.
+
+1. Test Structure
+
+The test structure should be similar to the existing recovery
+tests. We will want to have a structure in place where we
+can execute different commands:
+ create a file/database
+ create a file that will contain subdatabases.
+ create a subdatabase
+ remove a subdatabase (that contains valid data)
+ remove a subdatabase (that does not contain any data)
+ remove a file that used to contain subdatabases
+ remove a file that contains a database
+
+The tricky part is capturing the state of the world at the
+various points in the create/delete process.
+
+The critical points in the create process are:
+
+ 1. After we've logged the create, but before we've done anything.
+ in db/db.c
+ after the open_retry
+ after the __crdel_fileopen_log call (and before we've
+ called __os_open).
+
+ 2. Immediately after the __os_open
+
+ 3. Immediately after each __db_log_page call
+ in bt_open.c
+ log meta-data page
+ log root page
+ in hash.c
+ log meta-data page
+
+ 4. With respect to the log records above, shortly after each
+ log write is an memp_fput. We need to do a sync after
+ each memp_fput and trigger a point after that sync.
+
+The critical points in the remove process are:
+
+ 1. Right after the crdel_delete_log in db/db.c
+
+ 2. Right after the __os_rename call (below the crdel_delete_log)
+
+ 3. After the __db_remove_callback call.
+
+I believe that there are the places where we'll need some sort of hook.
+
+2. Adding hooks to the library.
+
+The hooks need two components. One component is to capture the state of
+the database at the hook point and the other is to trigger a txn_abort at
+the hook point. The second part is fairly trivial.
+
+The first part requires more thought. Let me explain what we do in a
+"normal" recovery test. In a normal recovery test, we save an intial
+copy of the database (this copy is called init). Then we execute one
+or more operations. Then, right before the commit/abort, we sync the
+file, and save another copy (the afterop copy). Finally, we call txn_commit
+or txn_abort, sync the file again, and save the database one last time (the
+final copy).
+
+Then we run recovery. The first time, this should be a no-op, because
+we've either committed the transaction and are checking to redo it or
+we aborted the transaction, undid it on the abort and are checking to
+undo it again.
+
+We then run recovery again on whatever database will force us through
+the path that requires work. In the commit case, this means we start
+with the init copy of the database and run recovery. This pushes us
+through all the redo paths. In the abort case, we start with the afterop
+copy which pushes us through all the undo cases.
+
+In some sense, we're asking the create/delete test to be more exhaustive
+by defining all the trigger points, but I think that's the correct thing
+to do, since the create/delete is not initiated by a user transaction.
+
+So, what do we have to do at the hook points?
+ 1. sync the file to disk.
+ 2. save the file itself
+ 3. save any files named __db_backup_name(name, &backup_name, lsn)
+ Since we may not know the right lsns, I think we should save
+ every file of the form __db.name.0xNNNNNNNN.0xNNNNNNNN into
+ some temporary files from which we can restore it to run
+ recovery.
+
+3. Putting it all together
+
+So, the three pieces are writing the test structure, putting in the hooks
+and then writing the recovery portions so that we restore the right thing
+that the hooks saved in order to initiate recovery.
+
+Some of the technical issues that need to be solved are:
+ How does the hook code become active (i.e., we don't
+ want it in there normally, but it's got to be
+ there when you configure for testing)?
+ How do you (the test) tell the library that you want a
+ particular hook to abort?
+ How do you (the test) tell the library that you want the
+ hook code doing its copies (do we really want
+ *every* test doing these copies during testing?
+ Maybe it's not a big deal, but maybe it is; we
+ should at least think about it).
diff --git a/bdb/db/crdel.src b/bdb/db/crdel.src
new file mode 100644
index 00000000000..17c061d6887
--- /dev/null
+++ b/bdb/db/crdel.src
@@ -0,0 +1,103 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1996, 1997, 1998, 1999, 2000
+ * Sleepycat Software. All rights reserved.
+ *
+ * $Id: crdel.src,v 11.12 2000/12/12 17:41:48 bostic Exp $
+ */
+
+PREFIX crdel
+
+INCLUDE #include "db_config.h"
+INCLUDE
+INCLUDE #ifndef NO_SYSTEM_INCLUDES
+INCLUDE #include <sys/types.h>
+INCLUDE
+INCLUDE #include <ctype.h>
+INCLUDE #include <errno.h>
+INCLUDE #include <string.h>
+INCLUDE #endif
+INCLUDE
+INCLUDE #include "db_int.h"
+INCLUDE #include "db_page.h"
+INCLUDE #include "db_dispatch.h"
+INCLUDE #include "db_am.h"
+INCLUDE #include "txn.h"
+INCLUDE
+
+/*
+ * Fileopen -- log a potential file create operation
+ *
+ * name: filename
+ * subname: sub database name
+ * mode: file system mode
+ */
+BEGIN fileopen 141
+DBT name DBT s
+ARG mode u_int32_t o
+END
+
+/*
+ * Metasub: log the creation of a subdatabase meta data page.
+ *
+ * fileid: identifies the file being acted upon.
+ * pgno: page number on which to write this meta-data page
+ * page: the actual meta-data page
+ * lsn: lsn of the page.
+ */
+BEGIN metasub 142
+ARG fileid int32_t ld
+ARG pgno db_pgno_t d
+DBT page DBT s
+POINTER lsn DB_LSN * lu
+END
+
+/*
+ * Metapage: log the creation of a meta data page for a new file.
+ *
+ * fileid: identifies the file being acted upon.
+ * name: file containing the page.
+ * pgno: page number on which to write this meta-data page
+ * page: the actual meta-data page
+ */
+BEGIN metapage 143
+ARG fileid int32_t ld
+DBT name DBT s
+ARG pgno db_pgno_t d
+DBT page DBT s
+END
+
+/*
+ * Delete: remove a file.
+ * Note that we don't need a special log record for subdatabase
+ * removes, because we use normal btree operations to remove them.
+ *
+ * name: name of the file being removed (relative to DBHOME).
+ */
+DEPRECATED old_delete 144
+DBT name DBT s
+END
+
+/*
+ * Rename: rename a file
+ * We do not need this for subdatabases
+ *
+ * name: name of the file being removed (relative to DBHOME).
+ */
+BEGIN rename 145
+ARG fileid int32_t ld
+DBT name DBT s
+DBT newname DBT s
+END
+/*
+ * Delete: remove a file.
+ * Note that we don't need a special log record for subdatabase
+ * removes, because we use normal btree operations to remove them.
+ *
+ * name: name of the file being removed (relative to DBHOME).
+ */
+BEGIN delete 146
+ARG fileid int32_t ld
+DBT name DBT s
+END
diff --git a/bdb/db/crdel_auto.c b/bdb/db/crdel_auto.c
new file mode 100644
index 00000000000..f2204410ee8
--- /dev/null
+++ b/bdb/db/crdel_auto.c
@@ -0,0 +1,900 @@
+/* Do not edit: automatically built by gen_rec.awk. */
+#include "db_config.h"
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <ctype.h>
+#include <errno.h>
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_dispatch.h"
+#include "db_am.h"
+#include "txn.h"
+
+int
+__crdel_fileopen_log(dbenv, txnid, ret_lsnp, flags,
+ name, mode)
+ DB_ENV *dbenv;
+ DB_TXN *txnid;
+ DB_LSN *ret_lsnp;
+ u_int32_t flags;
+ const DBT *name;
+ u_int32_t mode;
+{
+ DBT logrec;
+ DB_LSN *lsnp, null_lsn;
+ u_int32_t zero;
+ u_int32_t rectype, txn_num;
+ int ret;
+ u_int8_t *bp;
+
+ rectype = DB_crdel_fileopen;
+ if (txnid != NULL &&
+ TAILQ_FIRST(&txnid->kids) != NULL &&
+ (ret = __txn_activekids(dbenv, rectype, txnid)) != 0)
+ return (ret);
+ txn_num = txnid == NULL ? 0 : txnid->txnid;
+ if (txnid == NULL) {
+ ZERO_LSN(null_lsn);
+ lsnp = &null_lsn;
+ } else
+ lsnp = &txnid->last_lsn;
+ logrec.size = sizeof(rectype) + sizeof(txn_num) + sizeof(DB_LSN)
+ + sizeof(u_int32_t) + (name == NULL ? 0 : name->size)
+ + sizeof(mode);
+ if ((ret = __os_malloc(dbenv, logrec.size, NULL, &logrec.data)) != 0)
+ return (ret);
+
+ bp = logrec.data;
+ memcpy(bp, &rectype, sizeof(rectype));
+ bp += sizeof(rectype);
+ memcpy(bp, &txn_num, sizeof(txn_num));
+ bp += sizeof(txn_num);
+ memcpy(bp, lsnp, sizeof(DB_LSN));
+ bp += sizeof(DB_LSN);
+ if (name == NULL) {
+ zero = 0;
+ memcpy(bp, &zero, sizeof(u_int32_t));
+ bp += sizeof(u_int32_t);
+ } else {
+ memcpy(bp, &name->size, sizeof(name->size));
+ bp += sizeof(name->size);
+ memcpy(bp, name->data, name->size);
+ bp += name->size;
+ }
+ memcpy(bp, &mode, sizeof(mode));
+ bp += sizeof(mode);
+ DB_ASSERT((u_int32_t)(bp - (u_int8_t *)logrec.data) == logrec.size);
+ ret = log_put(dbenv, ret_lsnp, (DBT *)&logrec, flags);
+ if (txnid != NULL)
+ txnid->last_lsn = *ret_lsnp;
+ __os_free(logrec.data, logrec.size);
+ return (ret);
+}
+
+int
+__crdel_fileopen_print(dbenv, dbtp, lsnp, notused2, notused3)
+ DB_ENV *dbenv;
+ DBT *dbtp;
+ DB_LSN *lsnp;
+ db_recops notused2;
+ void *notused3;
+{
+ __crdel_fileopen_args *argp;
+ u_int32_t i;
+ u_int ch;
+ int ret;
+
+ i = 0;
+ ch = 0;
+ notused2 = DB_TXN_ABORT;
+ notused3 = NULL;
+
+ if ((ret = __crdel_fileopen_read(dbenv, dbtp->data, &argp)) != 0)
+ return (ret);
+ printf("[%lu][%lu]crdel_fileopen: rec: %lu txnid %lx prevlsn [%lu][%lu]\n",
+ (u_long)lsnp->file,
+ (u_long)lsnp->offset,
+ (u_long)argp->type,
+ (u_long)argp->txnid->txnid,
+ (u_long)argp->prev_lsn.file,
+ (u_long)argp->prev_lsn.offset);
+ printf("\tname: ");
+ for (i = 0; i < argp->name.size; i++) {
+ ch = ((u_int8_t *)argp->name.data)[i];
+ if (isprint(ch) || ch == 0xa)
+ putchar(ch);
+ else
+ printf("%#x ", ch);
+ }
+ printf("\n");
+ printf("\tmode: %o\n", argp->mode);
+ printf("\n");
+ __os_free(argp, 0);
+ return (0);
+}
+
+int
+__crdel_fileopen_read(dbenv, recbuf, argpp)
+ DB_ENV *dbenv;
+ void *recbuf;
+ __crdel_fileopen_args **argpp;
+{
+ __crdel_fileopen_args *argp;
+ u_int8_t *bp;
+ int ret;
+
+ ret = __os_malloc(dbenv, sizeof(__crdel_fileopen_args) +
+ sizeof(DB_TXN), NULL, &argp);
+ if (ret != 0)
+ return (ret);
+ argp->txnid = (DB_TXN *)&argp[1];
+ bp = recbuf;
+ memcpy(&argp->type, bp, sizeof(argp->type));
+ bp += sizeof(argp->type);
+ memcpy(&argp->txnid->txnid, bp, sizeof(argp->txnid->txnid));
+ bp += sizeof(argp->txnid->txnid);
+ memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN));
+ bp += sizeof(DB_LSN);
+ memset(&argp->name, 0, sizeof(argp->name));
+ memcpy(&argp->name.size, bp, sizeof(u_int32_t));
+ bp += sizeof(u_int32_t);
+ argp->name.data = bp;
+ bp += argp->name.size;
+ memcpy(&argp->mode, bp, sizeof(argp->mode));
+ bp += sizeof(argp->mode);
+ *argpp = argp;
+ return (0);
+}
+
+int
+__crdel_metasub_log(dbenv, txnid, ret_lsnp, flags,
+ fileid, pgno, page, lsn)
+ DB_ENV *dbenv;
+ DB_TXN *txnid;
+ DB_LSN *ret_lsnp;
+ u_int32_t flags;
+ int32_t fileid;
+ db_pgno_t pgno;
+ const DBT *page;
+ DB_LSN * lsn;
+{
+ DBT logrec;
+ DB_LSN *lsnp, null_lsn;
+ u_int32_t zero;
+ u_int32_t rectype, txn_num;
+ int ret;
+ u_int8_t *bp;
+
+ rectype = DB_crdel_metasub;
+ if (txnid != NULL &&
+ TAILQ_FIRST(&txnid->kids) != NULL &&
+ (ret = __txn_activekids(dbenv, rectype, txnid)) != 0)
+ return (ret);
+ txn_num = txnid == NULL ? 0 : txnid->txnid;
+ if (txnid == NULL) {
+ ZERO_LSN(null_lsn);
+ lsnp = &null_lsn;
+ } else
+ lsnp = &txnid->last_lsn;
+ logrec.size = sizeof(rectype) + sizeof(txn_num) + sizeof(DB_LSN)
+ + sizeof(fileid)
+ + sizeof(pgno)
+ + sizeof(u_int32_t) + (page == NULL ? 0 : page->size)
+ + sizeof(*lsn);
+ if ((ret = __os_malloc(dbenv, logrec.size, NULL, &logrec.data)) != 0)
+ return (ret);
+
+ bp = logrec.data;
+ memcpy(bp, &rectype, sizeof(rectype));
+ bp += sizeof(rectype);
+ memcpy(bp, &txn_num, sizeof(txn_num));
+ bp += sizeof(txn_num);
+ memcpy(bp, lsnp, sizeof(DB_LSN));
+ bp += sizeof(DB_LSN);
+ memcpy(bp, &fileid, sizeof(fileid));
+ bp += sizeof(fileid);
+ memcpy(bp, &pgno, sizeof(pgno));
+ bp += sizeof(pgno);
+ if (page == NULL) {
+ zero = 0;
+ memcpy(bp, &zero, sizeof(u_int32_t));
+ bp += sizeof(u_int32_t);
+ } else {
+ memcpy(bp, &page->size, sizeof(page->size));
+ bp += sizeof(page->size);
+ memcpy(bp, page->data, page->size);
+ bp += page->size;
+ }
+ if (lsn != NULL)
+ memcpy(bp, lsn, sizeof(*lsn));
+ else
+ memset(bp, 0, sizeof(*lsn));
+ bp += sizeof(*lsn);
+ DB_ASSERT((u_int32_t)(bp - (u_int8_t *)logrec.data) == logrec.size);
+ ret = log_put(dbenv, ret_lsnp, (DBT *)&logrec, flags);
+ if (txnid != NULL)
+ txnid->last_lsn = *ret_lsnp;
+ __os_free(logrec.data, logrec.size);
+ return (ret);
+}
+
+int
+__crdel_metasub_print(dbenv, dbtp, lsnp, notused2, notused3)
+ DB_ENV *dbenv;
+ DBT *dbtp;
+ DB_LSN *lsnp;
+ db_recops notused2;
+ void *notused3;
+{
+ __crdel_metasub_args *argp;
+ u_int32_t i;
+ u_int ch;
+ int ret;
+
+ i = 0;
+ ch = 0;
+ notused2 = DB_TXN_ABORT;
+ notused3 = NULL;
+
+ if ((ret = __crdel_metasub_read(dbenv, dbtp->data, &argp)) != 0)
+ return (ret);
+ printf("[%lu][%lu]crdel_metasub: rec: %lu txnid %lx prevlsn [%lu][%lu]\n",
+ (u_long)lsnp->file,
+ (u_long)lsnp->offset,
+ (u_long)argp->type,
+ (u_long)argp->txnid->txnid,
+ (u_long)argp->prev_lsn.file,
+ (u_long)argp->prev_lsn.offset);
+ printf("\tfileid: %ld\n", (long)argp->fileid);
+ printf("\tpgno: %d\n", argp->pgno);
+ printf("\tpage: ");
+ for (i = 0; i < argp->page.size; i++) {
+ ch = ((u_int8_t *)argp->page.data)[i];
+ if (isprint(ch) || ch == 0xa)
+ putchar(ch);
+ else
+ printf("%#x ", ch);
+ }
+ printf("\n");
+ printf("\tlsn: [%lu][%lu]\n",
+ (u_long)argp->lsn.file, (u_long)argp->lsn.offset);
+ printf("\n");
+ __os_free(argp, 0);
+ return (0);
+}
+
+int
+__crdel_metasub_read(dbenv, recbuf, argpp)
+ DB_ENV *dbenv;
+ void *recbuf;
+ __crdel_metasub_args **argpp;
+{
+ __crdel_metasub_args *argp;
+ u_int8_t *bp;
+ int ret;
+
+ ret = __os_malloc(dbenv, sizeof(__crdel_metasub_args) +
+ sizeof(DB_TXN), NULL, &argp);
+ if (ret != 0)
+ return (ret);
+ argp->txnid = (DB_TXN *)&argp[1];
+ bp = recbuf;
+ memcpy(&argp->type, bp, sizeof(argp->type));
+ bp += sizeof(argp->type);
+ memcpy(&argp->txnid->txnid, bp, sizeof(argp->txnid->txnid));
+ bp += sizeof(argp->txnid->txnid);
+ memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN));
+ bp += sizeof(DB_LSN);
+ memcpy(&argp->fileid, bp, sizeof(argp->fileid));
+ bp += sizeof(argp->fileid);
+ memcpy(&argp->pgno, bp, sizeof(argp->pgno));
+ bp += sizeof(argp->pgno);
+ memset(&argp->page, 0, sizeof(argp->page));
+ memcpy(&argp->page.size, bp, sizeof(u_int32_t));
+ bp += sizeof(u_int32_t);
+ argp->page.data = bp;
+ bp += argp->page.size;
+ memcpy(&argp->lsn, bp, sizeof(argp->lsn));
+ bp += sizeof(argp->lsn);
+ *argpp = argp;
+ return (0);
+}
+
+int
+__crdel_metapage_log(dbenv, txnid, ret_lsnp, flags,
+ fileid, name, pgno, page)
+ DB_ENV *dbenv;
+ DB_TXN *txnid;
+ DB_LSN *ret_lsnp;
+ u_int32_t flags;
+ int32_t fileid;
+ const DBT *name;
+ db_pgno_t pgno;
+ const DBT *page;
+{
+ DBT logrec;
+ DB_LSN *lsnp, null_lsn;
+ u_int32_t zero;
+ u_int32_t rectype, txn_num;
+ int ret;
+ u_int8_t *bp;
+
+ rectype = DB_crdel_metapage;
+ if (txnid != NULL &&
+ TAILQ_FIRST(&txnid->kids) != NULL &&
+ (ret = __txn_activekids(dbenv, rectype, txnid)) != 0)
+ return (ret);
+ txn_num = txnid == NULL ? 0 : txnid->txnid;
+ if (txnid == NULL) {
+ ZERO_LSN(null_lsn);
+ lsnp = &null_lsn;
+ } else
+ lsnp = &txnid->last_lsn;
+ logrec.size = sizeof(rectype) + sizeof(txn_num) + sizeof(DB_LSN)
+ + sizeof(fileid)
+ + sizeof(u_int32_t) + (name == NULL ? 0 : name->size)
+ + sizeof(pgno)
+ + sizeof(u_int32_t) + (page == NULL ? 0 : page->size);
+ if ((ret = __os_malloc(dbenv, logrec.size, NULL, &logrec.data)) != 0)
+ return (ret);
+
+ bp = logrec.data;
+ memcpy(bp, &rectype, sizeof(rectype));
+ bp += sizeof(rectype);
+ memcpy(bp, &txn_num, sizeof(txn_num));
+ bp += sizeof(txn_num);
+ memcpy(bp, lsnp, sizeof(DB_LSN));
+ bp += sizeof(DB_LSN);
+ memcpy(bp, &fileid, sizeof(fileid));
+ bp += sizeof(fileid);
+ if (name == NULL) {
+ zero = 0;
+ memcpy(bp, &zero, sizeof(u_int32_t));
+ bp += sizeof(u_int32_t);
+ } else {
+ memcpy(bp, &name->size, sizeof(name->size));
+ bp += sizeof(name->size);
+ memcpy(bp, name->data, name->size);
+ bp += name->size;
+ }
+ memcpy(bp, &pgno, sizeof(pgno));
+ bp += sizeof(pgno);
+ if (page == NULL) {
+ zero = 0;
+ memcpy(bp, &zero, sizeof(u_int32_t));
+ bp += sizeof(u_int32_t);
+ } else {
+ memcpy(bp, &page->size, sizeof(page->size));
+ bp += sizeof(page->size);
+ memcpy(bp, page->data, page->size);
+ bp += page->size;
+ }
+ DB_ASSERT((u_int32_t)(bp - (u_int8_t *)logrec.data) == logrec.size);
+ ret = log_put(dbenv, ret_lsnp, (DBT *)&logrec, flags);
+ if (txnid != NULL)
+ txnid->last_lsn = *ret_lsnp;
+ __os_free(logrec.data, logrec.size);
+ return (ret);
+}
+
+int
+__crdel_metapage_print(dbenv, dbtp, lsnp, notused2, notused3)
+ DB_ENV *dbenv;
+ DBT *dbtp;
+ DB_LSN *lsnp;
+ db_recops notused2;
+ void *notused3;
+{
+ __crdel_metapage_args *argp;
+ u_int32_t i;
+ u_int ch;
+ int ret;
+
+ i = 0;
+ ch = 0;
+ notused2 = DB_TXN_ABORT;
+ notused3 = NULL;
+
+ if ((ret = __crdel_metapage_read(dbenv, dbtp->data, &argp)) != 0)
+ return (ret);
+ printf("[%lu][%lu]crdel_metapage: rec: %lu txnid %lx prevlsn [%lu][%lu]\n",
+ (u_long)lsnp->file,
+ (u_long)lsnp->offset,
+ (u_long)argp->type,
+ (u_long)argp->txnid->txnid,
+ (u_long)argp->prev_lsn.file,
+ (u_long)argp->prev_lsn.offset);
+ printf("\tfileid: %ld\n", (long)argp->fileid);
+ printf("\tname: ");
+ for (i = 0; i < argp->name.size; i++) {
+ ch = ((u_int8_t *)argp->name.data)[i];
+ if (isprint(ch) || ch == 0xa)
+ putchar(ch);
+ else
+ printf("%#x ", ch);
+ }
+ printf("\n");
+ printf("\tpgno: %d\n", argp->pgno);
+ printf("\tpage: ");
+ for (i = 0; i < argp->page.size; i++) {
+ ch = ((u_int8_t *)argp->page.data)[i];
+ if (isprint(ch) || ch == 0xa)
+ putchar(ch);
+ else
+ printf("%#x ", ch);
+ }
+ printf("\n");
+ printf("\n");
+ __os_free(argp, 0);
+ return (0);
+}
+
+int
+__crdel_metapage_read(dbenv, recbuf, argpp)
+ DB_ENV *dbenv;
+ void *recbuf;
+ __crdel_metapage_args **argpp;
+{
+ __crdel_metapage_args *argp;
+ u_int8_t *bp;
+ int ret;
+
+ ret = __os_malloc(dbenv, sizeof(__crdel_metapage_args) +
+ sizeof(DB_TXN), NULL, &argp);
+ if (ret != 0)
+ return (ret);
+ argp->txnid = (DB_TXN *)&argp[1];
+ bp = recbuf;
+ memcpy(&argp->type, bp, sizeof(argp->type));
+ bp += sizeof(argp->type);
+ memcpy(&argp->txnid->txnid, bp, sizeof(argp->txnid->txnid));
+ bp += sizeof(argp->txnid->txnid);
+ memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN));
+ bp += sizeof(DB_LSN);
+ memcpy(&argp->fileid, bp, sizeof(argp->fileid));
+ bp += sizeof(argp->fileid);
+ memset(&argp->name, 0, sizeof(argp->name));
+ memcpy(&argp->name.size, bp, sizeof(u_int32_t));
+ bp += sizeof(u_int32_t);
+ argp->name.data = bp;
+ bp += argp->name.size;
+ memcpy(&argp->pgno, bp, sizeof(argp->pgno));
+ bp += sizeof(argp->pgno);
+ memset(&argp->page, 0, sizeof(argp->page));
+ memcpy(&argp->page.size, bp, sizeof(u_int32_t));
+ bp += sizeof(u_int32_t);
+ argp->page.data = bp;
+ bp += argp->page.size;
+ *argpp = argp;
+ return (0);
+}
+
+int
+__crdel_old_delete_print(dbenv, dbtp, lsnp, notused2, notused3)
+ DB_ENV *dbenv;
+ DBT *dbtp;
+ DB_LSN *lsnp;
+ db_recops notused2;
+ void *notused3;
+{
+ __crdel_old_delete_args *argp;
+ u_int32_t i;
+ u_int ch;
+ int ret;
+
+ i = 0;
+ ch = 0;
+ notused2 = DB_TXN_ABORT;
+ notused3 = NULL;
+
+ if ((ret = __crdel_old_delete_read(dbenv, dbtp->data, &argp)) != 0)
+ return (ret);
+ printf("[%lu][%lu]crdel_old_delete: rec: %lu txnid %lx prevlsn [%lu][%lu]\n",
+ (u_long)lsnp->file,
+ (u_long)lsnp->offset,
+ (u_long)argp->type,
+ (u_long)argp->txnid->txnid,
+ (u_long)argp->prev_lsn.file,
+ (u_long)argp->prev_lsn.offset);
+ printf("\tname: ");
+ for (i = 0; i < argp->name.size; i++) {
+ ch = ((u_int8_t *)argp->name.data)[i];
+ if (isprint(ch) || ch == 0xa)
+ putchar(ch);
+ else
+ printf("%#x ", ch);
+ }
+ printf("\n");
+ printf("\n");
+ __os_free(argp, 0);
+ return (0);
+}
+
+int
+__crdel_old_delete_read(dbenv, recbuf, argpp)
+ DB_ENV *dbenv;
+ void *recbuf;
+ __crdel_old_delete_args **argpp;
+{
+ __crdel_old_delete_args *argp;
+ u_int8_t *bp;
+ int ret;
+
+ ret = __os_malloc(dbenv, sizeof(__crdel_old_delete_args) +
+ sizeof(DB_TXN), NULL, &argp);
+ if (ret != 0)
+ return (ret);
+ argp->txnid = (DB_TXN *)&argp[1];
+ bp = recbuf;
+ memcpy(&argp->type, bp, sizeof(argp->type));
+ bp += sizeof(argp->type);
+ memcpy(&argp->txnid->txnid, bp, sizeof(argp->txnid->txnid));
+ bp += sizeof(argp->txnid->txnid);
+ memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN));
+ bp += sizeof(DB_LSN);
+ memset(&argp->name, 0, sizeof(argp->name));
+ memcpy(&argp->name.size, bp, sizeof(u_int32_t));
+ bp += sizeof(u_int32_t);
+ argp->name.data = bp;
+ bp += argp->name.size;
+ *argpp = argp;
+ return (0);
+}
+
+int
+__crdel_rename_log(dbenv, txnid, ret_lsnp, flags,
+ fileid, name, newname)
+ DB_ENV *dbenv;
+ DB_TXN *txnid;
+ DB_LSN *ret_lsnp;
+ u_int32_t flags;
+ int32_t fileid;
+ const DBT *name;
+ const DBT *newname;
+{
+ DBT logrec;
+ DB_LSN *lsnp, null_lsn;
+ u_int32_t zero;
+ u_int32_t rectype, txn_num;
+ int ret;
+ u_int8_t *bp;
+
+ rectype = DB_crdel_rename;
+ if (txnid != NULL &&
+ TAILQ_FIRST(&txnid->kids) != NULL &&
+ (ret = __txn_activekids(dbenv, rectype, txnid)) != 0)
+ return (ret);
+ txn_num = txnid == NULL ? 0 : txnid->txnid;
+ if (txnid == NULL) {
+ ZERO_LSN(null_lsn);
+ lsnp = &null_lsn;
+ } else
+ lsnp = &txnid->last_lsn;
+ logrec.size = sizeof(rectype) + sizeof(txn_num) + sizeof(DB_LSN)
+ + sizeof(fileid)
+ + sizeof(u_int32_t) + (name == NULL ? 0 : name->size)
+ + sizeof(u_int32_t) + (newname == NULL ? 0 : newname->size);
+ if ((ret = __os_malloc(dbenv, logrec.size, NULL, &logrec.data)) != 0)
+ return (ret);
+
+ bp = logrec.data;
+ memcpy(bp, &rectype, sizeof(rectype));
+ bp += sizeof(rectype);
+ memcpy(bp, &txn_num, sizeof(txn_num));
+ bp += sizeof(txn_num);
+ memcpy(bp, lsnp, sizeof(DB_LSN));
+ bp += sizeof(DB_LSN);
+ memcpy(bp, &fileid, sizeof(fileid));
+ bp += sizeof(fileid);
+ if (name == NULL) {
+ zero = 0;
+ memcpy(bp, &zero, sizeof(u_int32_t));
+ bp += sizeof(u_int32_t);
+ } else {
+ memcpy(bp, &name->size, sizeof(name->size));
+ bp += sizeof(name->size);
+ memcpy(bp, name->data, name->size);
+ bp += name->size;
+ }
+ if (newname == NULL) {
+ zero = 0;
+ memcpy(bp, &zero, sizeof(u_int32_t));
+ bp += sizeof(u_int32_t);
+ } else {
+ memcpy(bp, &newname->size, sizeof(newname->size));
+ bp += sizeof(newname->size);
+ memcpy(bp, newname->data, newname->size);
+ bp += newname->size;
+ }
+ DB_ASSERT((u_int32_t)(bp - (u_int8_t *)logrec.data) == logrec.size);
+ ret = log_put(dbenv, ret_lsnp, (DBT *)&logrec, flags);
+ if (txnid != NULL)
+ txnid->last_lsn = *ret_lsnp;
+ __os_free(logrec.data, logrec.size);
+ return (ret);
+}
+
+int
+__crdel_rename_print(dbenv, dbtp, lsnp, notused2, notused3)
+ DB_ENV *dbenv;
+ DBT *dbtp;
+ DB_LSN *lsnp;
+ db_recops notused2;
+ void *notused3;
+{
+ __crdel_rename_args *argp;
+ u_int32_t i;
+ u_int ch;
+ int ret;
+
+ i = 0;
+ ch = 0;
+ notused2 = DB_TXN_ABORT;
+ notused3 = NULL;
+
+ if ((ret = __crdel_rename_read(dbenv, dbtp->data, &argp)) != 0)
+ return (ret);
+ printf("[%lu][%lu]crdel_rename: rec: %lu txnid %lx prevlsn [%lu][%lu]\n",
+ (u_long)lsnp->file,
+ (u_long)lsnp->offset,
+ (u_long)argp->type,
+ (u_long)argp->txnid->txnid,
+ (u_long)argp->prev_lsn.file,
+ (u_long)argp->prev_lsn.offset);
+ printf("\tfileid: %ld\n", (long)argp->fileid);
+ printf("\tname: ");
+ for (i = 0; i < argp->name.size; i++) {
+ ch = ((u_int8_t *)argp->name.data)[i];
+ if (isprint(ch) || ch == 0xa)
+ putchar(ch);
+ else
+ printf("%#x ", ch);
+ }
+ printf("\n");
+ printf("\tnewname: ");
+ for (i = 0; i < argp->newname.size; i++) {
+ ch = ((u_int8_t *)argp->newname.data)[i];
+ if (isprint(ch) || ch == 0xa)
+ putchar(ch);
+ else
+ printf("%#x ", ch);
+ }
+ printf("\n");
+ printf("\n");
+ __os_free(argp, 0);
+ return (0);
+}
+
+int
+__crdel_rename_read(dbenv, recbuf, argpp)
+ DB_ENV *dbenv;
+ void *recbuf;
+ __crdel_rename_args **argpp;
+{
+ __crdel_rename_args *argp;
+ u_int8_t *bp;
+ int ret;
+
+ ret = __os_malloc(dbenv, sizeof(__crdel_rename_args) +
+ sizeof(DB_TXN), NULL, &argp);
+ if (ret != 0)
+ return (ret);
+ argp->txnid = (DB_TXN *)&argp[1];
+ bp = recbuf;
+ memcpy(&argp->type, bp, sizeof(argp->type));
+ bp += sizeof(argp->type);
+ memcpy(&argp->txnid->txnid, bp, sizeof(argp->txnid->txnid));
+ bp += sizeof(argp->txnid->txnid);
+ memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN));
+ bp += sizeof(DB_LSN);
+ memcpy(&argp->fileid, bp, sizeof(argp->fileid));
+ bp += sizeof(argp->fileid);
+ memset(&argp->name, 0, sizeof(argp->name));
+ memcpy(&argp->name.size, bp, sizeof(u_int32_t));
+ bp += sizeof(u_int32_t);
+ argp->name.data = bp;
+ bp += argp->name.size;
+ memset(&argp->newname, 0, sizeof(argp->newname));
+ memcpy(&argp->newname.size, bp, sizeof(u_int32_t));
+ bp += sizeof(u_int32_t);
+ argp->newname.data = bp;
+ bp += argp->newname.size;
+ *argpp = argp;
+ return (0);
+}
+
+int
+__crdel_delete_log(dbenv, txnid, ret_lsnp, flags,
+ fileid, name)
+ DB_ENV *dbenv;
+ DB_TXN *txnid;
+ DB_LSN *ret_lsnp;
+ u_int32_t flags;
+ int32_t fileid;
+ const DBT *name;
+{
+ DBT logrec;
+ DB_LSN *lsnp, null_lsn;
+ u_int32_t zero;
+ u_int32_t rectype, txn_num;
+ int ret;
+ u_int8_t *bp;
+
+ rectype = DB_crdel_delete;
+ if (txnid != NULL &&
+ TAILQ_FIRST(&txnid->kids) != NULL &&
+ (ret = __txn_activekids(dbenv, rectype, txnid)) != 0)
+ return (ret);
+ txn_num = txnid == NULL ? 0 : txnid->txnid;
+ if (txnid == NULL) {
+ ZERO_LSN(null_lsn);
+ lsnp = &null_lsn;
+ } else
+ lsnp = &txnid->last_lsn;
+ logrec.size = sizeof(rectype) + sizeof(txn_num) + sizeof(DB_LSN)
+ + sizeof(fileid)
+ + sizeof(u_int32_t) + (name == NULL ? 0 : name->size);
+ if ((ret = __os_malloc(dbenv, logrec.size, NULL, &logrec.data)) != 0)
+ return (ret);
+
+ bp = logrec.data;
+ memcpy(bp, &rectype, sizeof(rectype));
+ bp += sizeof(rectype);
+ memcpy(bp, &txn_num, sizeof(txn_num));
+ bp += sizeof(txn_num);
+ memcpy(bp, lsnp, sizeof(DB_LSN));
+ bp += sizeof(DB_LSN);
+ memcpy(bp, &fileid, sizeof(fileid));
+ bp += sizeof(fileid);
+ if (name == NULL) {
+ zero = 0;
+ memcpy(bp, &zero, sizeof(u_int32_t));
+ bp += sizeof(u_int32_t);
+ } else {
+ memcpy(bp, &name->size, sizeof(name->size));
+ bp += sizeof(name->size);
+ memcpy(bp, name->data, name->size);
+ bp += name->size;
+ }
+ DB_ASSERT((u_int32_t)(bp - (u_int8_t *)logrec.data) == logrec.size);
+ ret = log_put(dbenv, ret_lsnp, (DBT *)&logrec, flags);
+ if (txnid != NULL)
+ txnid->last_lsn = *ret_lsnp;
+ __os_free(logrec.data, logrec.size);
+ return (ret);
+}
+
+int
+__crdel_delete_print(dbenv, dbtp, lsnp, notused2, notused3)
+ DB_ENV *dbenv;
+ DBT *dbtp;
+ DB_LSN *lsnp;
+ db_recops notused2;
+ void *notused3;
+{
+ __crdel_delete_args *argp;
+ u_int32_t i;
+ u_int ch;
+ int ret;
+
+ i = 0;
+ ch = 0;
+ notused2 = DB_TXN_ABORT;
+ notused3 = NULL;
+
+ if ((ret = __crdel_delete_read(dbenv, dbtp->data, &argp)) != 0)
+ return (ret);
+ printf("[%lu][%lu]crdel_delete: rec: %lu txnid %lx prevlsn [%lu][%lu]\n",
+ (u_long)lsnp->file,
+ (u_long)lsnp->offset,
+ (u_long)argp->type,
+ (u_long)argp->txnid->txnid,
+ (u_long)argp->prev_lsn.file,
+ (u_long)argp->prev_lsn.offset);
+ printf("\tfileid: %ld\n", (long)argp->fileid);
+ printf("\tname: ");
+ for (i = 0; i < argp->name.size; i++) {
+ ch = ((u_int8_t *)argp->name.data)[i];
+ if (isprint(ch) || ch == 0xa)
+ putchar(ch);
+ else
+ printf("%#x ", ch);
+ }
+ printf("\n");
+ printf("\n");
+ __os_free(argp, 0);
+ return (0);
+}
+
+int
+__crdel_delete_read(dbenv, recbuf, argpp)
+ DB_ENV *dbenv;
+ void *recbuf;
+ __crdel_delete_args **argpp;
+{
+ __crdel_delete_args *argp;
+ u_int8_t *bp;
+ int ret;
+
+ ret = __os_malloc(dbenv, sizeof(__crdel_delete_args) +
+ sizeof(DB_TXN), NULL, &argp);
+ if (ret != 0)
+ return (ret);
+ argp->txnid = (DB_TXN *)&argp[1];
+ bp = recbuf;
+ memcpy(&argp->type, bp, sizeof(argp->type));
+ bp += sizeof(argp->type);
+ memcpy(&argp->txnid->txnid, bp, sizeof(argp->txnid->txnid));
+ bp += sizeof(argp->txnid->txnid);
+ memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN));
+ bp += sizeof(DB_LSN);
+ memcpy(&argp->fileid, bp, sizeof(argp->fileid));
+ bp += sizeof(argp->fileid);
+ memset(&argp->name, 0, sizeof(argp->name));
+ memcpy(&argp->name.size, bp, sizeof(u_int32_t));
+ bp += sizeof(u_int32_t);
+ argp->name.data = bp;
+ bp += argp->name.size;
+ *argpp = argp;
+ return (0);
+}
+
+int
+__crdel_init_print(dbenv)
+ DB_ENV *dbenv;
+{
+ int ret;
+
+ if ((ret = __db_add_recovery(dbenv,
+ __crdel_fileopen_print, DB_crdel_fileopen)) != 0)
+ return (ret);
+ if ((ret = __db_add_recovery(dbenv,
+ __crdel_metasub_print, DB_crdel_metasub)) != 0)
+ return (ret);
+ if ((ret = __db_add_recovery(dbenv,
+ __crdel_metapage_print, DB_crdel_metapage)) != 0)
+ return (ret);
+ if ((ret = __db_add_recovery(dbenv,
+ __crdel_old_delete_print, DB_crdel_old_delete)) != 0)
+ return (ret);
+ if ((ret = __db_add_recovery(dbenv,
+ __crdel_rename_print, DB_crdel_rename)) != 0)
+ return (ret);
+ if ((ret = __db_add_recovery(dbenv,
+ __crdel_delete_print, DB_crdel_delete)) != 0)
+ return (ret);
+ return (0);
+}
+
+int
+__crdel_init_recover(dbenv)
+ DB_ENV *dbenv;
+{
+ int ret;
+
+ if ((ret = __db_add_recovery(dbenv,
+ __crdel_fileopen_recover, DB_crdel_fileopen)) != 0)
+ return (ret);
+ if ((ret = __db_add_recovery(dbenv,
+ __crdel_metasub_recover, DB_crdel_metasub)) != 0)
+ return (ret);
+ if ((ret = __db_add_recovery(dbenv,
+ __crdel_metapage_recover, DB_crdel_metapage)) != 0)
+ return (ret);
+ if ((ret = __db_add_recovery(dbenv,
+ __deprecated_recover, DB_crdel_old_delete)) != 0)
+ return (ret);
+ if ((ret = __db_add_recovery(dbenv,
+ __crdel_rename_recover, DB_crdel_rename)) != 0)
+ return (ret);
+ if ((ret = __db_add_recovery(dbenv,
+ __crdel_delete_recover, DB_crdel_delete)) != 0)
+ return (ret);
+ return (0);
+}
+
diff --git a/bdb/db/crdel_rec.c b/bdb/db/crdel_rec.c
new file mode 100644
index 00000000000..495b92a0ad7
--- /dev/null
+++ b/bdb/db/crdel_rec.c
@@ -0,0 +1,646 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1996, 1997, 1998, 1999, 2000
+ * Sleepycat Software. All rights reserved.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: crdel_rec.c,v 11.43 2000/12/13 08:06:34 krinsky Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "log.h"
+#include "hash.h"
+#include "mp.h"
+#include "db_dispatch.h"
+
+/*
+ * __crdel_fileopen_recover --
+ * Recovery function for fileopen.
+ *
+ * PUBLIC: int __crdel_fileopen_recover
+ * PUBLIC: __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *));
+ */
+int
+__crdel_fileopen_recover(dbenv, dbtp, lsnp, op, info)
+ DB_ENV *dbenv;
+ DBT *dbtp;
+ DB_LSN *lsnp;
+ db_recops op;
+ void *info;
+{
+ __crdel_fileopen_args *argp;
+ DBMETA ondisk;
+ DB_FH fh;
+ size_t nr;
+ int do_unlink, ret;
+ u_int32_t b, mb, io;
+ char *real_name;
+
+ COMPQUIET(info, NULL);
+
+ real_name = NULL;
+ REC_PRINT(__crdel_fileopen_print);
+
+ if ((ret = __crdel_fileopen_read(dbenv, dbtp->data, &argp)) != 0)
+ goto out;
+ /*
+ * If this is an in-memory database, then the name is going to
+ * be NULL, which looks like a 0-length name in recovery.
+ */
+ if (argp->name.size == 0)
+ goto done;
+
+ if ((ret = __db_appname(dbenv, DB_APP_DATA,
+ NULL, argp->name.data, 0, NULL, &real_name)) != 0)
+ goto out;
+ if (DB_REDO(op)) {
+ /*
+ * The create commited, so we need to make sure that the file
+ * exists. A simple open should suffice.
+ */
+ if ((ret = __os_open(dbenv, real_name,
+ DB_OSO_CREATE, argp->mode, &fh)) != 0)
+ goto out;
+ if ((ret = __os_closehandle(&fh)) != 0)
+ goto out;
+ } else if (DB_UNDO(op)) {
+ /*
+ * If the file is 0-length then it was in the process of being
+ * created, so we should unlink it. If it is non-0 length, then
+ * either someone else created it and we need to leave it
+ * untouched or we were in the process of creating it, allocated
+ * the first page on a system that requires you to actually
+ * write pages as you allocate them, but never got any data
+ * on it.
+ * If the file doesn't exist, we never got around to creating
+ * it, so that's fine.
+ */
+ if (__os_exists(real_name, NULL) != 0)
+ goto done;
+
+ if ((ret = __os_open(dbenv, real_name, 0, 0, &fh)) != 0)
+ goto out;
+ if ((ret = __os_ioinfo(dbenv,
+ real_name, &fh, &mb, &b, &io)) != 0)
+ goto out;
+ do_unlink = 0;
+ if (mb != 0 || b != 0) {
+ /*
+ * We need to read the first page
+ * to see if its got valid data on it.
+ */
+ if ((ret = __os_read(dbenv, &fh,
+ &ondisk, sizeof(ondisk), &nr)) != 0 ||
+ nr != sizeof(ondisk))
+ goto out;
+ if (ondisk.magic == 0)
+ do_unlink = 1;
+ }
+ if ((ret = __os_closehandle(&fh)) != 0)
+ goto out;
+ /* Check for 0-length and if it is, delete it. */
+ if (do_unlink || (mb == 0 && b == 0))
+ if ((ret = __os_unlink(dbenv, real_name)) != 0)
+ goto out;
+ }
+
+done: *lsnp = argp->prev_lsn;
+ ret = 0;
+
+out: if (argp != NULL)
+ __os_free(argp, 0);
+ if (real_name != NULL)
+ __os_freestr(real_name);
+ return (ret);
+}
+
+/*
+ * __crdel_metasub_recover --
+ * Recovery function for metasub.
+ *
+ * PUBLIC: int __crdel_metasub_recover
+ * PUBLIC: __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *));
+ */
+int
+__crdel_metasub_recover(dbenv, dbtp, lsnp, op, info)
+ DB_ENV *dbenv;
+ DBT *dbtp;
+ DB_LSN *lsnp;
+ db_recops op;
+ void *info;
+{
+ __crdel_metasub_args *argp;
+ DB *file_dbp;
+ DBC *dbc;
+ DB_MPOOLFILE *mpf;
+ PAGE *pagep;
+ u_int8_t *file_uid, ptype;
+ int cmp_p, modified, reopen, ret;
+
+ COMPQUIET(info, NULL);
+ REC_PRINT(__crdel_metasub_print);
+ REC_INTRO(__crdel_metasub_read, 0);
+
+ if ((ret = memp_fget(mpf, &argp->pgno, 0, &pagep)) != 0) {
+ if (DB_REDO(op)) {
+ if ((ret = memp_fget(mpf,
+ &argp->pgno, DB_MPOOL_CREATE, &pagep)) != 0)
+ goto out;
+ } else {
+ *lsnp = argp->prev_lsn;
+ ret = 0;
+ goto out;
+ }
+ }
+
+ modified = 0;
+ reopen = 0;
+ cmp_p = log_compare(&LSN(pagep), &argp->lsn);
+ CHECK_LSN(op, cmp_p, &LSN(pagep), &argp->lsn);
+
+ if (cmp_p == 0 && DB_REDO(op)) {
+ memcpy(pagep, argp->page.data, argp->page.size);
+ LSN(pagep) = *lsnp;
+ modified = 1;
+ /*
+ * If this is a meta-data page, then we must reopen;
+ * if it was a root page, then we do not.
+ */
+ ptype = ((DBMETA *)argp->page.data)->type;
+ if (ptype == P_HASHMETA || ptype == P_BTREEMETA ||
+ ptype == P_QAMMETA)
+ reopen = 1;
+ } else if (DB_UNDO(op)) {
+ /*
+ * We want to undo this page creation. The page creation
+ * happened in two parts. First, we called __bam_new which
+ * was logged separately. Then we wrote the meta-data onto
+ * the page. So long as we restore the LSN, then the recovery
+ * for __bam_new will do everything else.
+ * Don't bother checking the lsn on the page. If we
+ * are rolling back the next thing is that this page
+ * will get freed. Opening the subdb will have reinitialized
+ * the page, but not the lsn.
+ */
+ LSN(pagep) = argp->lsn;
+ modified = 1;
+ }
+ if ((ret = memp_fput(mpf, pagep, modified ? DB_MPOOL_DIRTY : 0)) != 0)
+ goto out;
+
+ /*
+ * If we are redoing a subdatabase create, we must close and reopen the
+ * file to be sure that we have the proper meta information in the
+ * in-memory structures
+ */
+ if (reopen) {
+ /* Close cursor if it's open. */
+ if (dbc != NULL) {
+ dbc->c_close(dbc);
+ dbc = NULL;
+ }
+
+ if ((ret = __os_malloc(dbenv,
+ DB_FILE_ID_LEN, NULL, &file_uid)) != 0)
+ goto out;
+ memcpy(file_uid, &file_dbp->fileid[0], DB_FILE_ID_LEN);
+ ret = __log_reopen_file(dbenv,
+ NULL, argp->fileid, file_uid, argp->pgno);
+ (void)__os_free(file_uid, DB_FILE_ID_LEN);
+ if (ret != 0)
+ goto out;
+ }
+
+done: *lsnp = argp->prev_lsn;
+ ret = 0;
+
+out: REC_CLOSE;
+}
+
+/*
+ * __crdel_metapage_recover --
+ * Recovery function for metapage.
+ *
+ * PUBLIC: int __crdel_metapage_recover
+ * PUBLIC: __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *));
+ */
+int
+__crdel_metapage_recover(dbenv, dbtp, lsnp, op, info)
+ DB_ENV *dbenv;
+ DBT *dbtp;
+ DB_LSN *lsnp;
+ db_recops op;
+ void *info;
+{
+ __crdel_metapage_args *argp;
+ DB *dbp;
+ DBMETA *meta, ondisk;
+ DB_FH fh;
+ size_t nr;
+ u_int32_t b, io, mb, pagesize;
+ int is_done, ret;
+ char *real_name;
+
+ COMPQUIET(info, NULL);
+
+ real_name = NULL;
+ memset(&fh, 0, sizeof(fh));
+ REC_PRINT(__crdel_metapage_print);
+
+ if ((ret = __crdel_metapage_read(dbenv, dbtp->data, &argp)) != 0)
+ goto out;
+
+ /*
+ * If this is an in-memory database, then the name is going to
+ * be NULL, which looks like a 0-length name in recovery.
+ */
+ if (argp->name.size == 0)
+ goto done;
+
+ meta = (DBMETA *)argp->page.data;
+ __ua_memcpy(&pagesize, &meta->pagesize, sizeof(pagesize));
+
+ if ((ret = __db_appname(dbenv, DB_APP_DATA,
+ NULL, argp->name.data, 0, NULL, &real_name)) != 0)
+ goto out;
+ if (DB_REDO(op)) {
+ if ((ret = __db_fileid_to_db(dbenv,
+ &dbp, argp->fileid, 0)) != 0) {
+ if (ret == DB_DELETED)
+ goto done;
+ else
+ goto out;
+ }
+
+ /*
+ * We simply read the first page and if the LSN is 0, we
+ * write the meta-data page.
+ */
+ if ((ret = __os_open(dbenv, real_name, 0, 0, &fh)) != 0)
+ goto out;
+ if ((ret = __os_seek(dbenv, &fh,
+ pagesize, argp->pgno, 0, 0, DB_OS_SEEK_SET)) != 0)
+ goto out;
+ /*
+ * If the read succeeds then the page exists, then we need
+ * to vrify that the page has actually been written, because
+ * on some systems (e.g., Windows) we preallocate pages because
+ * files aren't allowed to have holes in them. If the page
+ * looks good then we're done.
+ */
+ if ((ret = __os_read(dbenv, &fh, &ondisk,
+ sizeof(ondisk), &nr)) == 0 && nr == sizeof(ondisk)) {
+ if (ondisk.magic != 0)
+ goto done;
+ if ((ret = __os_seek(dbenv, &fh,
+ pagesize, argp->pgno, 0, 0, DB_OS_SEEK_SET)) != 0)
+ goto out;
+ }
+
+ /*
+ * Page didn't exist, update the LSN and write a new one.
+ * (seek pointer shouldn't have moved)
+ */
+ __ua_memcpy(&meta->lsn, lsnp, sizeof(DB_LSN));
+ if ((ret = __os_write(dbp->dbenv, &fh,
+ argp->page.data, argp->page.size, &nr)) != 0)
+ goto out;
+ if (nr != (size_t)argp->page.size) {
+ __db_err(dbenv, "Write failed during recovery");
+ ret = EIO;
+ goto out;
+ }
+
+ /*
+ * We must close and reopen the file to be sure
+ * that we have the proper meta information
+ * in the in memory structures
+ */
+
+ if ((ret = __log_reopen_file(dbenv,
+ argp->name.data, argp->fileid,
+ meta->uid, argp->pgno)) != 0)
+ goto out;
+
+ /* Handle will be closed on exit. */
+ } else if (DB_UNDO(op)) {
+ is_done = 0;
+
+ /* If file does not exist, there is nothing to undo. */
+ if (__os_exists(real_name, NULL) != 0)
+ goto done;
+
+ /*
+ * Before we can look at anything on disk, we have to check
+ * if there is a valid dbp for this, and if there is, we'd
+ * better flush it.
+ */
+ dbp = NULL;
+ if ((ret =
+ __db_fileid_to_db(dbenv, &dbp, argp->fileid, 0)) == 0)
+ (void)dbp->sync(dbp, 0);
+
+ /*
+ * We need to make sure that we do not remove a file that
+ * someone else created. If the file is 0-length, then we
+ * can assume that we created it and remove it. If it is
+ * not 0-length, then we need to check the LSN and make
+ * sure that it's the file we created.
+ */
+ if ((ret = __os_open(dbenv, real_name, 0, 0, &fh)) != 0)
+ goto out;
+ if ((ret = __os_ioinfo(dbenv,
+ real_name, &fh, &mb, &b, &io)) != 0)
+ goto out;
+ if (mb != 0 || b != 0) {
+ /* The file has something in it. */
+ if ((ret = __os_seek(dbenv, &fh,
+ pagesize, argp->pgno, 0, 0, DB_OS_SEEK_SET)) != 0)
+ goto out;
+ if ((ret = __os_read(dbenv, &fh,
+ &ondisk, sizeof(ondisk), &nr)) != 0)
+ goto out;
+ if (log_compare(&ondisk.lsn, lsnp) != 0)
+ is_done = 1;
+ }
+
+ /*
+ * Must close here, because unlink with the file open fails
+ * on some systems.
+ */
+ if ((ret = __os_closehandle(&fh)) != 0)
+ goto out;
+
+ if (!is_done) {
+ /*
+ * On some systems, you cannot unlink an open file so
+ * we close the fd in the dbp here and make sure we
+ * don't try to close it again. First, check for a
+ * saved_open_fhp, then close down the mpool.
+ */
+ if (dbp != NULL && dbp->saved_open_fhp != NULL &&
+ F_ISSET(dbp->saved_open_fhp, DB_FH_VALID) &&
+ (ret = __os_closehandle(dbp->saved_open_fhp)) != 0)
+ goto out;
+ if (dbp != NULL && dbp->mpf != NULL) {
+ (void)__memp_fremove(dbp->mpf);
+ if ((ret = memp_fclose(dbp->mpf)) != 0)
+ goto out;
+ F_SET(dbp, DB_AM_DISCARD);
+ dbp->mpf = NULL;
+ }
+ if ((ret = __os_unlink(dbenv, real_name)) != 0)
+ goto out;
+ }
+ }
+
+done: *lsnp = argp->prev_lsn;
+ ret = 0;
+
+out: if (argp != NULL)
+ __os_free(argp, 0);
+ if (real_name != NULL)
+ __os_freestr(real_name);
+ if (F_ISSET(&fh, DB_FH_VALID))
+ (void)__os_closehandle(&fh);
+ return (ret);
+}
+
+/*
+ * __crdel_delete_recover --
+ * Recovery function for delete.
+ *
+ * PUBLIC: int __crdel_delete_recover
+ * PUBLIC: __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *));
+ */
+int
+__crdel_delete_recover(dbenv, dbtp, lsnp, op, info)
+ DB_ENV *dbenv;
+ DBT *dbtp;
+ DB_LSN *lsnp;
+ db_recops op;
+ void *info;
+{
+ DB *dbp;
+ __crdel_delete_args *argp;
+ int ret;
+ char *backup, *real_back, *real_name;
+
+ REC_PRINT(__crdel_delete_print);
+
+ backup = real_back = real_name = NULL;
+ if ((ret = __crdel_delete_read(dbenv, dbtp->data, &argp)) != 0)
+ goto out;
+
+ if (DB_REDO(op)) {
+ /*
+ * On a recovery, as we recreate what was going on, we
+ * recreate the creation of the file. And so, even though
+ * it committed, we need to delete it. Try to delete it,
+ * but it is not an error if that delete fails.
+ */
+ if ((ret = __db_appname(dbenv, DB_APP_DATA,
+ NULL, argp->name.data, 0, NULL, &real_name)) != 0)
+ goto out;
+ if (__os_exists(real_name, NULL) == 0) {
+ /*
+ * If a file is deleted and then recreated, it's
+ * possible for the __os_exists call above to
+ * return success and for us to get here, but for
+ * the fileid we're looking for to be marked
+ * deleted. In that case, we needn't redo the
+ * unlink even though the file exists, and it's
+ * not an error.
+ */
+ ret = __db_fileid_to_db(dbenv, &dbp, argp->fileid, 0);
+ if (ret == 0) {
+ /*
+ * On Windows, the underlying file must be
+ * closed to perform a remove.
+ */
+ (void)__memp_fremove(dbp->mpf);
+ if ((ret = memp_fclose(dbp->mpf)) != 0)
+ goto out;
+ dbp->mpf = NULL;
+ if ((ret = __os_unlink(dbenv, real_name)) != 0)
+ goto out;
+ } else if (ret != DB_DELETED)
+ goto out;
+ }
+ /*
+ * The transaction committed, so the only thing that might
+ * be true is that the backup file is still around. Try
+ * to delete it, but it's not an error if that delete fails.
+ */
+ if ((ret = __db_backup_name(dbenv, argp->name.data,
+ &backup, lsnp)) != 0)
+ goto out;
+ if ((ret = __db_appname(dbenv,
+ DB_APP_DATA, NULL, backup, 0, NULL, &real_back)) != 0)
+ goto out;
+ if (__os_exists(real_back, NULL) == 0)
+ if ((ret = __os_unlink(dbenv, real_back)) != 0)
+ goto out;
+ if ((ret = __db_txnlist_delete(dbenv, info,
+ argp->name.data, TXNLIST_INVALID_ID, 1)) != 0)
+ goto out;
+ } else if (DB_UNDO(op)) {
+ /*
+ * Trying to undo. File may or may not have been deleted.
+ * Try to move the backup to the original. If the backup
+ * exists, then this is right. If it doesn't exist, then
+ * nothing will happen and that's OK.
+ */
+ if ((ret = __db_backup_name(dbenv, argp->name.data,
+ &backup, lsnp)) != 0)
+ goto out;
+ if ((ret = __db_appname(dbenv,
+ DB_APP_DATA, NULL, backup, 0, NULL, &real_back)) != 0)
+ goto out;
+ if ((ret = __db_appname(dbenv, DB_APP_DATA,
+ NULL, argp->name.data, 0, NULL, &real_name)) != 0)
+ goto out;
+ if (__os_exists(real_back, NULL) == 0)
+ if ((ret =
+ __os_rename(dbenv, real_back, real_name)) != 0)
+ goto out;
+ }
+
+ *lsnp = argp->prev_lsn;
+ ret = 0;
+
+out: if (argp != NULL)
+ __os_free(argp, 0);
+ if (backup != NULL)
+ __os_freestr(backup);
+ if (real_back != NULL)
+ __os_freestr(real_back);
+ if (real_name != NULL)
+ __os_freestr(real_name);
+ return (ret);
+}
+/*
+ * __crdel_rename_recover --
+ * Recovery function for rename.
+ *
+ * PUBLIC: int __crdel_rename_recover
+ * PUBLIC: __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *));
+ */
+int
+__crdel_rename_recover(dbenv, dbtp, lsnp, op, info)
+ DB_ENV *dbenv;
+ DBT *dbtp;
+ DB_LSN *lsnp;
+ db_recops op;
+ void *info;
+{
+ DB *dbp;
+ __crdel_rename_args *argp;
+ char *new_name, *real_name;
+ int ret, set;
+
+ COMPQUIET(info, NULL);
+
+ REC_PRINT(__crdel_rename_print);
+
+ new_name = real_name = NULL;
+
+ if ((ret = __crdel_rename_read(dbenv, dbtp->data, &argp)) != 0)
+ goto out;
+
+ if ((ret = __db_fileid_to_db(dbenv, &dbp, argp->fileid, 0)) != 0)
+ goto out;
+ if (DB_REDO(op)) {
+ /*
+ * We don't use the dbp parameter to __log_filelist_update
+ * in the rename case, so passing NULL for it is OK.
+ */
+ if ((ret = __log_filelist_update(dbenv, NULL,
+ argp->fileid, argp->newname.data, &set)) != 0)
+ goto out;
+ if (set != 0) {
+ if ((ret = __db_appname(dbenv, DB_APP_DATA,
+ NULL, argp->name.data, 0, NULL, &real_name)) != 0)
+ goto out;
+ if (__os_exists(real_name, NULL) == 0) {
+ if ((ret = __db_appname(dbenv,
+ DB_APP_DATA, NULL, argp->newname.data,
+ 0, NULL, &new_name)) != 0)
+ goto out;
+ /*
+ * On Windows, the underlying file
+ * must be closed to perform a remove.
+ * The db will be closed by a
+ * log_register record. Rename
+ * has exclusive access to the db.
+ */
+ (void)__memp_fremove(dbp->mpf);
+ if ((ret = memp_fclose(dbp->mpf)) != 0)
+ goto out;
+ dbp->mpf = NULL;
+ if ((ret = __os_rename(dbenv,
+ real_name, new_name)) != 0)
+ goto out;
+ }
+ }
+ } else {
+ /*
+ * We don't use the dbp parameter to __log_filelist_update
+ * in the rename case, so passing NULL for it is OK.
+ */
+ if ((ret = __log_filelist_update(dbenv, NULL,
+ argp->fileid, argp->name.data, &set)) != 0)
+ goto out;
+ if (set != 0) {
+ if ((ret = __db_appname(dbenv, DB_APP_DATA,
+ NULL, argp->newname.data, 0, NULL, &new_name)) != 0)
+ goto out;
+ if (__os_exists(new_name, NULL) == 0) {
+ if ((ret = __db_appname(dbenv,
+ DB_APP_DATA, NULL, argp->name.data,
+ 0, NULL, &real_name)) != 0)
+ goto out;
+ /*
+ * On Windows, the underlying file
+ * must be closed to perform a remove.
+ * The file may have already been closed
+ * if we are aborting the transaction.
+ */
+ if (dbp->mpf != NULL) {
+ (void)__memp_fremove(dbp->mpf);
+ if ((ret = memp_fclose(dbp->mpf)) != 0)
+ goto out;
+ dbp->mpf = NULL;
+ }
+ if ((ret = __os_rename(dbenv,
+ new_name, real_name)) != 0)
+ goto out;
+ }
+ }
+ }
+
+ *lsnp = argp->prev_lsn;
+ ret = 0;
+
+out: if (argp != NULL)
+ __os_free(argp, 0);
+
+ if (new_name != NULL)
+ __os_free(new_name, 0);
+
+ if (real_name != NULL)
+ __os_free(real_name, 0);
+
+ return (ret);
+}
diff --git a/bdb/db/db.c b/bdb/db/db.c
new file mode 100644
index 00000000000..6e74b4b21bd
--- /dev/null
+++ b/bdb/db/db.c
@@ -0,0 +1,2325 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1996, 1997, 1998, 1999, 2000
+ * Sleepycat Software. All rights reserved.
+ */
+/*
+ * Copyright (c) 1990, 1993, 1994, 1995, 1996
+ * Keith Bostic. All rights reserved.
+ */
+/*
+ * Copyright (c) 1990, 1993, 1994, 1995
+ * The Regents of the University of California. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of the University nor the names of its contributors
+ * may be used to endorse or promote products derived from this software
+ * without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db.c,v 11.117 2001/01/11 18:19:50 bostic Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <stddef.h>
+#include <stdlib.h>
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_shash.h"
+#include "db_swap.h"
+#include "btree.h"
+#include "db_am.h"
+#include "hash.h"
+#include "lock.h"
+#include "log.h"
+#include "mp.h"
+#include "qam.h"
+#include "common_ext.h"
+
+/* Actions that __db_master_update can take. */
+typedef enum { MU_REMOVE, MU_RENAME, MU_OPEN } mu_action;
+
+/* Flag values that __db_file_setup can return. */
+#define DB_FILE_SETUP_CREATE 0x01
+#define DB_FILE_SETUP_ZERO 0x02
+
+static int __db_file_setup __P((DB *,
+ const char *, u_int32_t, int, db_pgno_t, int *));
+static int __db_master_update __P((DB *,
+ const char *, u_int32_t,
+ db_pgno_t *, mu_action, const char *, u_int32_t));
+static int __db_refresh __P((DB *));
+static int __db_remove_callback __P((DB *, void *));
+static int __db_set_pgsize __P((DB *, DB_FH *, char *));
+static int __db_subdb_remove __P((DB *, const char *, const char *));
+static int __db_subdb_rename __P(( DB *,
+ const char *, const char *, const char *));
+#if CONFIG_TEST
+static void __db_makecopy __P((const char *, const char *));
+static int __db_testdocopy __P((DB *, const char *));
+static int __qam_testdocopy __P((DB *, const char *));
+#endif
+
+/*
+ * __db_open --
+ * Main library interface to the DB access methods.
+ *
+ * PUBLIC: int __db_open __P((DB *,
+ * PUBLIC: const char *, const char *, DBTYPE, u_int32_t, int));
+ */
+int
+__db_open(dbp, name, subdb, type, flags, mode)
+ DB *dbp;
+ const char *name, *subdb;
+ DBTYPE type;
+ u_int32_t flags;
+ int mode;
+{
+ DB_ENV *dbenv;
+ DB_LOCK open_lock;
+ DB *mdbp;
+ db_pgno_t meta_pgno;
+ u_int32_t ok_flags;
+ int ret, t_ret;
+
+ dbenv = dbp->dbenv;
+ mdbp = NULL;
+
+ /* Validate arguments. */
+#define OKFLAGS \
+ (DB_CREATE | DB_EXCL | DB_FCNTL_LOCKING | \
+ DB_NOMMAP | DB_RDONLY | DB_RDWRMASTER | DB_THREAD | DB_TRUNCATE)
+ if ((ret = __db_fchk(dbenv, "DB->open", flags, OKFLAGS)) != 0)
+ return (ret);
+ if (LF_ISSET(DB_EXCL) && !LF_ISSET(DB_CREATE))
+ return (__db_ferr(dbenv, "DB->open", 1));
+ if (LF_ISSET(DB_RDONLY) && LF_ISSET(DB_CREATE))
+ return (__db_ferr(dbenv, "DB->open", 1));
+#ifdef HAVE_VXWORKS
+ if (LF_ISSET(DB_TRUNCATE)) {
+ __db_err(dbenv, "DB_TRUNCATE unsupported in VxWorks");
+ return (__db_eopnotsup(dbenv));
+ }
+#endif
+ switch (type) {
+ case DB_UNKNOWN:
+ if (LF_ISSET(DB_CREATE|DB_TRUNCATE)) {
+ __db_err(dbenv,
+ "%s: DB_UNKNOWN type specified with DB_CREATE or DB_TRUNCATE",
+ name);
+ return (EINVAL);
+ }
+ ok_flags = 0;
+ break;
+ case DB_BTREE:
+ ok_flags = DB_OK_BTREE;
+ break;
+ case DB_HASH:
+ ok_flags = DB_OK_HASH;
+ break;
+ case DB_QUEUE:
+ ok_flags = DB_OK_QUEUE;
+ break;
+ case DB_RECNO:
+ ok_flags = DB_OK_RECNO;
+ break;
+ default:
+ __db_err(dbenv, "unknown type: %lu", (u_long)type);
+ return (EINVAL);
+ }
+ if (ok_flags)
+ DB_ILLEGAL_METHOD(dbp, ok_flags);
+
+ /* The environment may have been created, but never opened. */
+ if (!F_ISSET(dbenv, DB_ENV_DBLOCAL | DB_ENV_OPEN_CALLED)) {
+ __db_err(dbenv, "environment not yet opened");
+ return (EINVAL);
+ }
+
+ /*
+ * Historically, you could pass in an environment that didn't have a
+ * mpool, and DB would create a private one behind the scenes. This
+ * no longer works.
+ */
+ if (!F_ISSET(dbenv, DB_ENV_DBLOCAL) && !MPOOL_ON(dbenv)) {
+ __db_err(dbenv, "environment did not include a memory pool.");
+ return (EINVAL);
+ }
+
+ /*
+ * You can't specify threads during DB->open if subsystems in the
+ * environment weren't configured with them.
+ */
+ if (LF_ISSET(DB_THREAD) &&
+ !F_ISSET(dbenv, DB_ENV_DBLOCAL | DB_ENV_THREAD)) {
+ __db_err(dbenv, "environment not created using DB_THREAD");
+ return (EINVAL);
+ }
+
+ /*
+ * If the environment was configured with threads, the DB handle
+ * must also be free-threaded, so we force the DB_THREAD flag on.
+ * (See SR #2033 for why this is a requirement--recovery needs
+ * to be able to grab a dbp using __db_fileid_to_dbp, and it has
+ * no way of knowing which dbp goes with which thread, so whichever
+ * one it finds has to be usable in any of them.)
+ */
+ if (F_ISSET(dbenv, DB_ENV_THREAD))
+ LF_SET(DB_THREAD);
+
+ /* DB_TRUNCATE is not transaction recoverable. */
+ if (LF_ISSET(DB_TRUNCATE) && TXN_ON(dbenv)) {
+ __db_err(dbenv,
+ "DB_TRUNCATE illegal in a transaction protected environment");
+ return (EINVAL);
+ }
+
+ /* Subdatabase checks. */
+ if (subdb != NULL) {
+ /* Subdatabases must be created in named files. */
+ if (name == NULL) {
+ __db_err(dbenv,
+ "multiple databases cannot be created in temporary files");
+ return (EINVAL);
+ }
+
+ /* QAM can't be done as a subdatabase. */
+ if (type == DB_QUEUE) {
+ __db_err(dbenv, "Queue databases must be one-per-file");
+ return (EINVAL);
+ }
+ }
+
+ /* Convert any DB->open flags. */
+ if (LF_ISSET(DB_RDONLY))
+ F_SET(dbp, DB_AM_RDONLY);
+
+ /* Fill in the type. */
+ dbp->type = type;
+
+ /*
+ * If we're potentially creating a database, wrap the open inside of
+ * a transaction.
+ */
+ if (TXN_ON(dbenv) && LF_ISSET(DB_CREATE))
+ if ((ret = __db_metabegin(dbp, &open_lock)) != 0)
+ return (ret);
+
+ /*
+ * If we're opening a subdatabase, we have to open (and potentially
+ * create) the main database, and then get (and potentially store)
+ * our base page number in that database. Then, we can finally open
+ * the subdatabase.
+ */
+ if (subdb == NULL)
+ meta_pgno = PGNO_BASE_MD;
+ else {
+ /*
+ * Open the master database, optionally creating or updating
+ * it, and retrieve the metadata page number.
+ */
+ if ((ret =
+ __db_master_open(dbp, name, flags, mode, &mdbp)) != 0)
+ goto err;
+
+ /* Copy the page size and file id from the master. */
+ dbp->pgsize = mdbp->pgsize;
+ F_SET(dbp, DB_AM_SUBDB);
+ memcpy(dbp->fileid, mdbp->fileid, DB_FILE_ID_LEN);
+
+ if ((ret = __db_master_update(mdbp,
+ subdb, type, &meta_pgno, MU_OPEN, NULL, flags)) != 0)
+ goto err;
+
+ /*
+ * Clear the exclusive open and truncation flags, they only
+ * apply to the open of the master database.
+ */
+ LF_CLR(DB_EXCL | DB_TRUNCATE);
+ }
+
+ ret = __db_dbopen(dbp, name, flags, mode, meta_pgno);
+
+ /*
+ * You can open the database that describes the subdatabases in the
+ * rest of the file read-only. The content of each key's data is
+ * unspecified and applications should never be adding new records
+ * or updating existing records. However, during recovery, we need
+ * to open these databases R/W so we can redo/undo changes in them.
+ * Likewise, we need to open master databases read/write during
+ * rename and remove so we can be sure they're fully sync'ed, so
+ * we provide an override flag for the purpose.
+ */
+ if (subdb == NULL && !IS_RECOVERING(dbenv) && !LF_ISSET(DB_RDONLY) &&
+ !LF_ISSET(DB_RDWRMASTER) && F_ISSET(dbp, DB_AM_SUBDB)) {
+ __db_err(dbenv,
+ "files containing multiple databases may only be opened read-only");
+ ret = EINVAL;
+ goto err;
+ }
+
+err: /*
+ * End any transaction, committing if we were successful, aborting
+ * otherwise.
+ */
+ if (TXN_ON(dbenv) && LF_ISSET(DB_CREATE))
+ if ((t_ret = __db_metaend(dbp,
+ &open_lock, ret == 0, NULL, NULL)) != 0 && ret == 0)
+ ret = t_ret;
+
+ /* If we were successful, don't discard the file on close. */
+ if (ret == 0)
+ F_CLR(dbp, DB_AM_DISCARD);
+
+ /* If we were unsuccessful, destroy the DB handle. */
+ if (ret != 0) {
+ /* In recovery we set log_fileid early. */
+ if (IS_RECOVERING(dbenv))
+ dbp->log_fileid = DB_LOGFILEID_INVALID;
+ __db_refresh(dbp);
+ }
+
+ if (mdbp != NULL) {
+ /* If we were successful, don't discard the file on close. */
+ if (ret == 0)
+ F_CLR(mdbp, DB_AM_DISCARD);
+ if ((t_ret = mdbp->close(mdbp, 0)) != 0 && ret == 0)
+ ret = t_ret;
+ }
+
+ return (ret);
+}
+
+/*
+ * __db_dbopen --
+ * Open a database.
+ * PUBLIC: int __db_dbopen __P((DB *, const char *, u_int32_t, int, db_pgno_t));
+ */
+int
+__db_dbopen(dbp, name, flags, mode, meta_pgno)
+ DB *dbp;
+ const char *name;
+ u_int32_t flags;
+ int mode;
+ db_pgno_t meta_pgno;
+{
+ DB_ENV *dbenv;
+ int ret, retinfo;
+
+ dbenv = dbp->dbenv;
+
+ /* Set up the underlying file. */
+ if ((ret = __db_file_setup(dbp,
+ name, flags, mode, meta_pgno, &retinfo)) != 0)
+ return (ret);
+
+ /*
+ * If we created the file, set the truncate flag for the mpool. This
+ * isn't for anything we've done, it's protection against stupid user
+ * tricks: if the user deleted a file behind Berkeley DB's back, we
+ * may still have pages in the mpool that match the file's "unique" ID.
+ */
+ if (retinfo & DB_FILE_SETUP_CREATE)
+ flags |= DB_TRUNCATE;
+
+ /* Set up the underlying environment. */
+ if ((ret = __db_dbenv_setup(dbp, name, flags)) != 0)
+ return (ret);
+
+ /*
+ * Do access method specific initialization.
+ *
+ * !!!
+ * Set the open flag. (The underlying access method open functions
+ * may want to do things like acquire cursors, so the open flag has
+ * to be set before calling them.)
+ */
+ F_SET(dbp, DB_OPEN_CALLED);
+
+ if (retinfo & DB_FILE_SETUP_ZERO)
+ return (0);
+
+ switch (dbp->type) {
+ case DB_BTREE:
+ ret = __bam_open(dbp, name, meta_pgno, flags);
+ break;
+ case DB_HASH:
+ ret = __ham_open(dbp, name, meta_pgno, flags);
+ break;
+ case DB_RECNO:
+ ret = __ram_open(dbp, name, meta_pgno, flags);
+ break;
+ case DB_QUEUE:
+ ret = __qam_open(dbp, name, meta_pgno, mode, flags);
+ break;
+ case DB_UNKNOWN:
+ return (__db_unknown_type(dbp->dbenv,
+ "__db_dbopen", dbp->type));
+ break;
+ }
+ return (ret);
+}
+
+/*
+ * __db_master_open --
+ * Open up a handle on a master database.
+ *
+ * PUBLIC: int __db_master_open __P((DB *,
+ * PUBLIC: const char *, u_int32_t, int, DB **));
+ */
+int
+__db_master_open(subdbp, name, flags, mode, dbpp)
+ DB *subdbp;
+ const char *name;
+ u_int32_t flags;
+ int mode;
+ DB **dbpp;
+{
+ DB *dbp;
+ int ret;
+
+ /* Open up a handle on the main database. */
+ if ((ret = db_create(&dbp, subdbp->dbenv, 0)) != 0)
+ return (ret);
+
+ /*
+ * It's always a btree.
+ * Run in the transaction we've created.
+ * Set the pagesize in case we're creating a new database.
+ * Flag that we're creating a database with subdatabases.
+ */
+ dbp->type = DB_BTREE;
+ dbp->open_txn = subdbp->open_txn;
+ dbp->pgsize = subdbp->pgsize;
+ F_SET(dbp, DB_AM_SUBDB);
+
+ if ((ret = __db_dbopen(dbp, name, flags, mode, PGNO_BASE_MD)) != 0) {
+ if (!F_ISSET(dbp, DB_AM_DISCARD))
+ dbp->close(dbp, 0);
+ return (ret);
+ }
+
+ *dbpp = dbp;
+ return (0);
+}
+
+/*
+ * __db_master_update --
+ * Add/Remove a subdatabase from a master database.
+ */
+static int
+__db_master_update(mdbp, subdb, type, meta_pgnop, action, newname, flags)
+ DB *mdbp;
+ const char *subdb;
+ u_int32_t type;
+ db_pgno_t *meta_pgnop; /* may be NULL on MU_RENAME */
+ mu_action action;
+ const char *newname;
+ u_int32_t flags;
+{
+ DB_ENV *dbenv;
+ DBC *dbc, *ndbc;
+ DBT key, data, ndata;
+ PAGE *p;
+ db_pgno_t t_pgno;
+ int modify, ret, t_ret;
+
+ dbenv = mdbp->dbenv;
+ dbc = ndbc = NULL;
+ p = NULL;
+
+ /* Might we modify the master database? If so, we'll need to lock. */
+ modify = (action != MU_OPEN || LF_ISSET(DB_CREATE)) ? 1 : 0;
+
+ memset(&key, 0, sizeof(key));
+ memset(&data, 0, sizeof(data));
+
+ /*
+ * Open up a cursor. If this is CDB and we're creating the database,
+ * make it an update cursor.
+ */
+ if ((ret = mdbp->cursor(mdbp, mdbp->open_txn, &dbc,
+ (CDB_LOCKING(dbenv) && modify) ? DB_WRITECURSOR : 0)) != 0)
+ goto err;
+
+ /*
+ * Try to point the cursor at the record.
+ *
+ * If we're removing or potentially creating an entry, lock the page
+ * with DB_RMW.
+ *
+ * !!!
+ * We don't include the name's nul termination in the database.
+ */
+ key.data = (char *)subdb;
+ key.size = strlen(subdb);
+ /* In the rename case, we do multiple cursor ops, so MALLOC is safer. */
+ F_SET(&data, DB_DBT_MALLOC);
+ ret = dbc->c_get(dbc, &key, &data,
+ DB_SET | ((STD_LOCKING(dbc) && modify) ? DB_RMW : 0));
+
+ /*
+ * What we do next--whether or not we found a record for the
+ * specified subdatabase--depends on what the specified action is.
+ * Handle ret appropriately as the first statement of each case.
+ */
+ switch (action) {
+ case MU_REMOVE:
+ /*
+ * We should have found something if we're removing it. Note
+ * that in the common case where the DB we're asking to remove
+ * doesn't exist, we won't get this far; __db_subdb_remove
+ * will already have returned an error from __db_open.
+ */
+ if (ret != 0)
+ goto err;
+
+ /*
+ * Delete the subdatabase entry first; if this fails,
+ * we don't want to touch the actual subdb pages.
+ */
+ if ((ret = dbc->c_del(dbc, 0)) != 0)
+ goto err;
+
+ /*
+ * We're handling actual data, not on-page meta-data,
+ * so it hasn't been converted to/from opposite
+ * endian architectures. Do it explicitly, now.
+ */
+ memcpy(meta_pgnop, data.data, sizeof(db_pgno_t));
+ DB_NTOHL(meta_pgnop);
+ if ((ret = memp_fget(mdbp->mpf, meta_pgnop, 0, &p)) != 0)
+ goto err;
+
+ /* Free and put the page. */
+ if ((ret = __db_free(dbc, p)) != 0) {
+ p = NULL;
+ goto err;
+ }
+ p = NULL;
+ break;
+ case MU_RENAME:
+ /* We should have found something if we're renaming it. */
+ if (ret != 0)
+ goto err;
+
+ /*
+ * Before we rename, we need to make sure we're not
+ * overwriting another subdatabase, or else this operation
+ * won't be undoable. Open a second cursor and check
+ * for the existence of newname; it shouldn't appear under
+ * us since we hold the metadata lock.
+ */
+ if ((ret = mdbp->cursor(mdbp, mdbp->open_txn, &ndbc, 0)) != 0)
+ goto err;
+ DB_ASSERT(newname != NULL);
+ key.data = (void *) newname;
+ key.size = strlen(newname);
+
+ /*
+ * We don't actually care what the meta page of the potentially-
+ * overwritten DB is; we just care about existence.
+ */
+ memset(&ndata, 0, sizeof(ndata));
+ F_SET(&ndata, DB_DBT_USERMEM | DB_DBT_PARTIAL);
+
+ if ((ret = ndbc->c_get(ndbc, &key, &ndata, DB_SET)) == 0) {
+ /* A subdb called newname exists. Bail. */
+ ret = EEXIST;
+ __db_err(dbenv, "rename: database %s exists", newname);
+ goto err;
+ } else if (ret != DB_NOTFOUND)
+ goto err;
+
+ /*
+ * Now do the put first; we don't want to lose our
+ * sole reference to the subdb. Use the second cursor
+ * so that the first one continues to point to the old record.
+ */
+ if ((ret = ndbc->c_put(ndbc, &key, &data, DB_KEYFIRST)) != 0)
+ goto err;
+ if ((ret = dbc->c_del(dbc, 0)) != 0) {
+ /*
+ * If the delete fails, try to delete the record
+ * we just put, in case we're not txn-protected.
+ */
+ (void)ndbc->c_del(ndbc, 0);
+ goto err;
+ }
+
+ break;
+ case MU_OPEN:
+ /*
+ * Get the subdatabase information. If it already exists,
+ * copy out the page number and we're done.
+ */
+ switch (ret) {
+ case 0:
+ memcpy(meta_pgnop, data.data, sizeof(db_pgno_t));
+ DB_NTOHL(meta_pgnop);
+ goto done;
+ case DB_NOTFOUND:
+ if (LF_ISSET(DB_CREATE))
+ break;
+ /*
+ * No db_err, it is reasonable to remove a
+ * nonexistent db.
+ */
+ ret = ENOENT;
+ goto err;
+ default:
+ goto err;
+ }
+
+ if ((ret = __db_new(dbc,
+ type == DB_HASH ? P_HASHMETA : P_BTREEMETA, &p)) != 0)
+ goto err;
+ *meta_pgnop = PGNO(p);
+
+ /*
+ * XXX
+ * We're handling actual data, not on-page meta-data, so it
+ * hasn't been converted to/from opposite endian architectures.
+ * Do it explicitly, now.
+ */
+ t_pgno = PGNO(p);
+ DB_HTONL(&t_pgno);
+ memset(&ndata, 0, sizeof(ndata));
+ ndata.data = &t_pgno;
+ ndata.size = sizeof(db_pgno_t);
+ if ((ret = dbc->c_put(dbc, &key, &ndata, DB_KEYLAST)) != 0)
+ goto err;
+ break;
+ }
+
+err:
+done: /*
+ * If we allocated a page: if we're successful, mark the page dirty
+ * and return it to the cache, otherwise, discard/free it.
+ */
+ if (p != NULL) {
+ if (ret == 0) {
+ if ((t_ret =
+ memp_fput(mdbp->mpf, p, DB_MPOOL_DIRTY)) != 0)
+ ret = t_ret;
+ /*
+ * Since we cannot close this file until after
+ * transaction commit, we need to sync the dirty
+ * pages, because we'll read these directly from
+ * disk to open.
+ */
+ if ((t_ret = mdbp->sync(mdbp, 0)) != 0 && ret == 0)
+ ret = t_ret;
+ } else
+ (void)__db_free(dbc, p);
+ }
+
+ /* Discard the cursor(s) and data. */
+ if (data.data != NULL)
+ __os_free(data.data, data.size);
+ if (dbc != NULL && (t_ret = dbc->c_close(dbc)) != 0 && ret == 0)
+ ret = t_ret;
+ if (ndbc != NULL && (t_ret = ndbc->c_close(ndbc)) != 0 && ret == 0)
+ ret = t_ret;
+
+ return (ret);
+}
+
+/*
+ * __db_dbenv_setup --
+ * Set up the underlying environment during a db_open.
+ *
+ * PUBLIC: int __db_dbenv_setup __P((DB *, const char *, u_int32_t));
+ */
+int
+__db_dbenv_setup(dbp, name, flags)
+ DB *dbp;
+ const char *name;
+ u_int32_t flags;
+{
+ DB *ldbp;
+ DB_ENV *dbenv;
+ DBT pgcookie;
+ DB_MPOOL_FINFO finfo;
+ DB_PGINFO pginfo;
+ int ret;
+ u_int32_t maxid;
+
+ dbenv = dbp->dbenv;
+
+ /* If we don't yet have an environment, it's time to create it. */
+ if (!F_ISSET(dbenv, DB_ENV_OPEN_CALLED)) {
+ /* Make sure we have at least DB_MINCACHE pages in our cache. */
+ if (dbenv->mp_gbytes == 0 &&
+ dbenv->mp_bytes < dbp->pgsize * DB_MINPAGECACHE &&
+ (ret = dbenv->set_cachesize(
+ dbenv, 0, dbp->pgsize * DB_MINPAGECACHE, 0)) != 0)
+ return (ret);
+
+ if ((ret = dbenv->open(dbenv, NULL, DB_CREATE |
+ DB_INIT_MPOOL | DB_PRIVATE | LF_ISSET(DB_THREAD), 0)) != 0)
+ return (ret);
+ }
+
+ /* Register DB's pgin/pgout functions. */
+ if ((ret =
+ memp_register(dbenv, DB_FTYPE_SET, __db_pgin, __db_pgout)) != 0)
+ return (ret);
+
+ /*
+ * Open a backing file in the memory pool.
+ *
+ * If we need to pre- or post-process a file's pages on I/O, set the
+ * file type. If it's a hash file, always call the pgin and pgout
+ * routines. This means that hash files can never be mapped into
+ * process memory. If it's a btree file and requires swapping, we
+ * need to page the file in and out. This has to be right -- we can't
+ * mmap files that are being paged in and out.
+ */
+ memset(&finfo, 0, sizeof(finfo));
+ switch (dbp->type) {
+ case DB_BTREE:
+ case DB_RECNO:
+ finfo.ftype =
+ F_ISSET(dbp, DB_AM_SWAP) ? DB_FTYPE_SET : DB_FTYPE_NOTSET;
+ finfo.clear_len = DB_PAGE_DB_LEN;
+ break;
+ case DB_HASH:
+ finfo.ftype = DB_FTYPE_SET;
+ finfo.clear_len = DB_PAGE_DB_LEN;
+ break;
+ case DB_QUEUE:
+ finfo.ftype =
+ F_ISSET(dbp, DB_AM_SWAP) ? DB_FTYPE_SET : DB_FTYPE_NOTSET;
+ finfo.clear_len = DB_PAGE_QUEUE_LEN;
+ break;
+ case DB_UNKNOWN:
+ /*
+ * If we're running in the verifier, our database might
+ * be corrupt and we might not know its type--but we may
+ * still want to be able to verify and salvage.
+ *
+ * If we can't identify the type, it's not going to be safe
+ * to call __db_pgin--we pretty much have to give up all
+ * hope of salvaging cross-endianness. Proceed anyway;
+ * at worst, the database will just appear more corrupt
+ * than it actually is, but at best, we may be able
+ * to salvage some data even with no metadata page.
+ */
+ if (F_ISSET(dbp, DB_AM_VERIFYING)) {
+ finfo.ftype = DB_FTYPE_NOTSET;
+ finfo.clear_len = DB_PAGE_DB_LEN;
+ break;
+ }
+ return (__db_unknown_type(dbp->dbenv,
+ "__db_dbenv_setup", dbp->type));
+ }
+ finfo.pgcookie = &pgcookie;
+ finfo.fileid = dbp->fileid;
+ finfo.lsn_offset = 0;
+
+ pginfo.db_pagesize = dbp->pgsize;
+ pginfo.needswap = F_ISSET(dbp, DB_AM_SWAP);
+ pgcookie.data = &pginfo;
+ pgcookie.size = sizeof(DB_PGINFO);
+
+ if ((ret = memp_fopen(dbenv, name,
+ LF_ISSET(DB_RDONLY | DB_NOMMAP | DB_ODDFILESIZE | DB_TRUNCATE),
+ 0, dbp->pgsize, &finfo, &dbp->mpf)) != 0)
+ return (ret);
+
+ /*
+ * We may need a per-thread mutex. Allocate it from the environment
+ * region, there's supposed to be extra space there for that purpose.
+ */
+ if (LF_ISSET(DB_THREAD)) {
+ if ((ret = __db_mutex_alloc(
+ dbenv, dbenv->reginfo, (MUTEX **)&dbp->mutexp)) != 0)
+ return (ret);
+ if ((ret = __db_mutex_init(
+ dbenv, dbp->mutexp, 0, MUTEX_THREAD)) != 0) {
+ __db_mutex_free(dbenv, dbenv->reginfo, dbp->mutexp);
+ return (ret);
+ }
+ }
+
+ /* Get a log file id. */
+ if (LOGGING_ON(dbenv) && !IS_RECOVERING(dbenv) &&
+#if !defined(DEBUG_ROP)
+ !F_ISSET(dbp, DB_AM_RDONLY) &&
+#endif
+ (ret = log_register(dbenv, dbp, name)) != 0)
+ return (ret);
+
+ /*
+ * Insert ourselves into the DB_ENV's dblist. We allocate a
+ * unique ID to each {fileid, meta page number} pair, and to
+ * each temporary file (since they all have a zero fileid).
+ * This ID gives us something to use to tell which DB handles
+ * go with which databases in all the cursor adjustment
+ * routines, where we don't want to do a lot of ugly and
+ * expensive memcmps.
+ */
+ MUTEX_THREAD_LOCK(dbenv, dbenv->dblist_mutexp);
+ for (maxid = 0, ldbp = LIST_FIRST(&dbenv->dblist);
+ ldbp != NULL; ldbp = LIST_NEXT(dbp, dblistlinks)) {
+ if (name != NULL &&
+ memcmp(ldbp->fileid, dbp->fileid, DB_FILE_ID_LEN) == 0 &&
+ ldbp->meta_pgno == dbp->meta_pgno)
+ break;
+ if (ldbp->adj_fileid > maxid)
+ maxid = ldbp->adj_fileid;
+ }
+
+ /*
+ * If ldbp is NULL, we didn't find a match, or we weren't
+ * really looking because name is NULL. Assign the dbp an
+ * adj_fileid one higher than the largest we found, and
+ * insert it at the head of the master dbp list.
+ *
+ * If ldbp is not NULL, it is a match for our dbp. Give dbp
+ * the same ID that ldbp has, and add it after ldbp so they're
+ * together in the list.
+ */
+ if (ldbp == NULL) {
+ dbp->adj_fileid = maxid + 1;
+ LIST_INSERT_HEAD(&dbenv->dblist, dbp, dblistlinks);
+ } else {
+ dbp->adj_fileid = ldbp->adj_fileid;
+ LIST_INSERT_AFTER(ldbp, dbp, dblistlinks);
+ }
+ MUTEX_THREAD_UNLOCK(dbenv, dbenv->dblist_mutexp);
+
+ return (0);
+}
+
+/*
+ * __db_file_setup --
+ * Setup the file or in-memory data.
+ * Read the database metadata and resolve it with our arguments.
+ */
+static int
+__db_file_setup(dbp, name, flags, mode, meta_pgno, retflags)
+ DB *dbp;
+ const char *name;
+ u_int32_t flags;
+ int mode;
+ db_pgno_t meta_pgno;
+ int *retflags;
+{
+ DB *mdb;
+ DBT namedbt;
+ DB_ENV *dbenv;
+ DB_FH *fhp, fh;
+ DB_LSN lsn;
+ DB_TXN *txn;
+ size_t nr;
+ u_int32_t magic, oflags;
+ int ret, retry_cnt, t_ret;
+ char *real_name, mbuf[DBMETASIZE];
+
+#define IS_SUBDB_SETUP (meta_pgno != PGNO_BASE_MD)
+
+ dbenv = dbp->dbenv;
+ dbp->meta_pgno = meta_pgno;
+ txn = NULL;
+ *retflags = 0;
+
+ /*
+ * If we open a file handle and our caller is doing fcntl(2) locking,
+ * we can't close it because that would discard the caller's lock.
+ * Save it until we close the DB handle.
+ */
+ if (LF_ISSET(DB_FCNTL_LOCKING)) {
+ if ((ret = __os_malloc(dbenv, sizeof(*fhp), NULL, &fhp)) != 0)
+ return (ret);
+ } else
+ fhp = &fh;
+ memset(fhp, 0, sizeof(*fhp));
+
+ /*
+ * If the file is in-memory, set up is simple. Otherwise, do the
+ * hard work of opening and reading the file.
+ *
+ * If we have a file name, try and read the first page, figure out
+ * what type of file it is, and initialize everything we can based
+ * on that file's meta-data page.
+ *
+ * !!!
+ * There's a reason we don't push this code down into the buffer cache.
+ * The problem is that there's no information external to the file that
+ * we can use as a unique ID. UNIX has dev/inode pairs, but they are
+ * not necessarily unique after reboot, if the file was mounted via NFS.
+ * Windows has similar problems, as the FAT filesystem doesn't maintain
+ * dev/inode numbers across reboot. So, we must get something from the
+ * file we can use to ensure that, even after a reboot, the file we're
+ * joining in the cache is the right file for us to join. The solution
+ * we use is to maintain a file ID that's stored in the database, and
+ * that's why we have to open and read the file before calling into the
+ * buffer cache.
+ *
+ * The secondary reason is that there's additional information that
+ * we want to have before instantiating a file in the buffer cache:
+ * the page size, file type (btree/hash), if swapping is required,
+ * and flags (DB_RDONLY, DB_CREATE, DB_TRUNCATE). We could handle
+ * needing this information by allowing it to be set for a file in
+ * the buffer cache even after the file has been opened, and, of
+ * course, supporting the ability to flush a file from the cache as
+ * necessary, e.g., if we guessed wrongly about the page size. Given
+ * that we have to read the file anyway to get the file ID, we might
+ * as well get the rest, too.
+ *
+ * Get the real file name.
+ */
+ if (name == NULL) {
+ F_SET(dbp, DB_AM_INMEM);
+
+ if (dbp->type == DB_UNKNOWN) {
+ __db_err(dbenv,
+ "DBTYPE of unknown without existing file");
+ return (EINVAL);
+ }
+ real_name = NULL;
+
+ /* Set the page size if we don't have one yet. */
+ if (dbp->pgsize == 0)
+ dbp->pgsize = DB_DEF_IOSIZE;
+
+ /*
+ * If the file is a temporary file and we're doing locking,
+ * then we have to create a unique file ID. We can't use our
+ * normal dev/inode pair (or whatever this OS uses in place of
+ * dev/inode pairs) because no backing file will be created
+ * until the mpool cache is filled forcing the buffers to disk.
+ * Grab a random locker ID to use as a file ID. The created
+ * ID must never match a potential real file ID -- we know it
+ * won't because real file IDs contain a time stamp after the
+ * dev/inode pair, and we're simply storing a 4-byte value.
+ *
+ * !!!
+ * Store the locker in the file id structure -- we can get it
+ * from there as necessary, and it saves having two copies.
+ */
+ if (LOCKING_ON(dbenv) &&
+ (ret = lock_id(dbenv, (u_int32_t *)dbp->fileid)) != 0)
+ return (ret);
+
+ return (0);
+ }
+
+ /* Get the real backing file name. */
+ if ((ret = __db_appname(dbenv,
+ DB_APP_DATA, NULL, name, 0, NULL, &real_name)) != 0)
+ return (ret);
+
+ /*
+ * Open the backing file. We need to make sure that multiple processes
+ * attempting to create the file at the same time are properly ordered
+ * so that only one of them creates the "unique" file ID, so we open it
+ * O_EXCL and O_CREAT so two simultaneous attempts to create the region
+ * will return failure in one of the attempts. If we're the one that
+ * fails, simply retry without the O_CREAT flag, which will require the
+ * meta-data page exist.
+ */
+
+ /* Fill in the default file mode. */
+ if (mode == 0)
+ mode = __db_omode("rwrw--");
+
+ oflags = 0;
+ if (LF_ISSET(DB_RDONLY))
+ oflags |= DB_OSO_RDONLY;
+ if (LF_ISSET(DB_TRUNCATE))
+ oflags |= DB_OSO_TRUNC;
+
+ retry_cnt = 0;
+open_retry:
+ *retflags = 0;
+ ret = 0;
+ if (!IS_SUBDB_SETUP && LF_ISSET(DB_CREATE)) {
+ if (dbp->open_txn != NULL) {
+ /*
+ * Start a child transaction to wrap this individual
+ * create.
+ */
+ if ((ret =
+ txn_begin(dbenv, dbp->open_txn, &txn, 0)) != 0)
+ goto err_msg;
+
+ memset(&namedbt, 0, sizeof(namedbt));
+ namedbt.data = (char *)name;
+ namedbt.size = strlen(name) + 1;
+ if ((ret = __crdel_fileopen_log(dbenv, txn,
+ &lsn, DB_FLUSH, &namedbt, mode)) != 0)
+ goto err_msg;
+ }
+ DB_TEST_RECOVERY(dbp, DB_TEST_PREOPEN, ret, name);
+ if ((ret = __os_open(dbenv, real_name,
+ oflags | DB_OSO_CREATE | DB_OSO_EXCL, mode, fhp)) == 0) {
+ DB_TEST_RECOVERY(dbp, DB_TEST_POSTOPEN, ret, name);
+
+ /* Commit the file create. */
+ if (dbp->open_txn != NULL) {
+ if ((ret = txn_commit(txn, DB_TXN_SYNC)) != 0)
+ goto err_msg;
+ txn = NULL;
+ }
+
+ /*
+ * We created the file. This means that if we later
+ * fail, we need to delete the file and if we're going
+ * to do that, we need to trash any pages in the
+ * memory pool. Since we only know here that we
+ * created the file, we're going to set the flag here
+ * and clear it later if we commit successfully.
+ */
+ F_SET(dbp, DB_AM_DISCARD);
+ *retflags |= DB_FILE_SETUP_CREATE;
+ } else {
+ /*
+ * Abort the file create. If the abort fails, report
+ * the error returned by txn_abort(), rather than the
+ * open error, for no particular reason.
+ */
+ if (dbp->open_txn != NULL) {
+ if ((t_ret = txn_abort(txn)) != 0) {
+ ret = t_ret;
+ goto err_msg;
+ }
+ txn = NULL;
+ }
+
+ /*
+ * If we were not doing an exclusive open, try again
+ * without the create flag.
+ */
+ if (ret == EEXIST && !LF_ISSET(DB_EXCL)) {
+ LF_CLR(DB_CREATE);
+ DB_TEST_RECOVERY(dbp,
+ DB_TEST_POSTOPEN, ret, name);
+ goto open_retry;
+ }
+ }
+ } else
+ ret = __os_open(dbenv, real_name, oflags, mode, fhp);
+
+ /*
+ * Be quiet if we couldn't open the file because it didn't exist
+ * or we did not have permission,
+ * the customers don't like those messages appearing in the logs.
+ * Otherwise, complain loudly.
+ */
+ if (ret != 0) {
+ if (ret == EACCES || ret == ENOENT)
+ goto err;
+ goto err_msg;
+ }
+
+ /* Set the page size if we don't have one yet. */
+ if (dbp->pgsize == 0) {
+ if (IS_SUBDB_SETUP) {
+ if ((ret = __db_master_open(dbp,
+ name, flags, mode, &mdb)) != 0)
+ goto err;
+ dbp->pgsize = mdb->pgsize;
+ (void)mdb->close(mdb, 0);
+ } else if ((ret = __db_set_pgsize(dbp, fhp, real_name)) != 0)
+ goto err;
+ }
+
+ /*
+ * Seek to the metadata offset; if it's a master database open or a
+ * database without subdatabases, we're seeking to 0, but that's OK.
+ */
+ if ((ret = __os_seek(dbenv, fhp,
+ dbp->pgsize, meta_pgno, 0, 0, DB_OS_SEEK_SET)) != 0)
+ goto err_msg;
+
+ /*
+ * Read the metadata page. We read DBMETASIZE bytes, which is larger
+ * than any access method's metadata page and smaller than any disk
+ * sector.
+ */
+ if ((ret = __os_read(dbenv, fhp, mbuf, sizeof(mbuf), &nr)) != 0)
+ goto err_msg;
+
+ if (nr == sizeof(mbuf)) {
+ /*
+ * Figure out what access method we're dealing with, and then
+ * call access method specific code to check error conditions
+ * based on conflicts between the found file and application
+ * arguments. A found file overrides some user information --
+ * we don't consider it an error, for example, if the user set
+ * an expected byte order and the found file doesn't match it.
+ */
+ F_CLR(dbp, DB_AM_SWAP);
+ magic = ((DBMETA *)mbuf)->magic;
+
+swap_retry: switch (magic) {
+ case DB_BTREEMAGIC:
+ if ((ret =
+ __bam_metachk(dbp, name, (BTMETA *)mbuf)) != 0)
+ goto err;
+ break;
+ case DB_HASHMAGIC:
+ if ((ret =
+ __ham_metachk(dbp, name, (HMETA *)mbuf)) != 0)
+ goto err;
+ break;
+ case DB_QAMMAGIC:
+ if ((ret =
+ __qam_metachk(dbp, name, (QMETA *)mbuf)) != 0)
+ goto err;
+ break;
+ case 0:
+ /*
+ * There are two ways we can get a 0 magic number.
+ * If we're creating a subdatabase, then the magic
+ * number will be 0. We allocate a page as part of
+ * finding out what the base page number will be for
+ * the new subdatabase, but it's not initialized in
+ * any way.
+ *
+ * The second case happens if we are in recovery
+ * and we are going to recreate a database, it's
+ * possible that it's page was created (on systems
+ * where pages must be created explicitly to avoid
+ * holes in files) but is still 0.
+ */
+ if (IS_SUBDB_SETUP) { /* Case 1 */
+ if ((IS_RECOVERING(dbenv)
+ && F_ISSET((DB_LOG *)
+ dbenv->lg_handle, DBLOG_FORCE_OPEN))
+ || ((DBMETA *)mbuf)->pgno != PGNO_INVALID)
+ goto empty;
+
+ ret = EINVAL;
+ goto err;
+ }
+ /* Case 2 */
+ if (IS_RECOVERING(dbenv)) {
+ *retflags |= DB_FILE_SETUP_ZERO;
+ goto empty;
+ }
+ goto bad_format;
+ default:
+ if (F_ISSET(dbp, DB_AM_SWAP))
+ goto bad_format;
+
+ M_32_SWAP(magic);
+ F_SET(dbp, DB_AM_SWAP);
+ goto swap_retry;
+ }
+ } else {
+ /*
+ * Only newly created files are permitted to fail magic
+ * number tests.
+ */
+ if (nr != 0 || (!IS_RECOVERING(dbenv) && IS_SUBDB_SETUP))
+ goto bad_format;
+
+ /* Let the caller know that we had a 0-length file. */
+ if (!LF_ISSET(DB_CREATE | DB_TRUNCATE))
+ *retflags |= DB_FILE_SETUP_ZERO;
+
+ /*
+ * The only way we can reach here with the DB_CREATE flag set
+ * is if we created the file. If that's not the case, then
+ * either (a) someone else created the file but has not yet
+ * written out the metadata page, or (b) we truncated the file
+ * (DB_TRUNCATE) leaving it zero-length. In the case of (a),
+ * we want to sleep and give the file creator time to write
+ * the metadata page. In the case of (b), we want to continue.
+ *
+ * !!!
+ * There's a race in the case of two processes opening the file
+ * with the DB_TRUNCATE flag set at roughly the same time, and
+ * they could theoretically hurt each other. Sure hope that's
+ * unlikely.
+ */
+ if (!LF_ISSET(DB_CREATE | DB_TRUNCATE) &&
+ !IS_RECOVERING(dbenv)) {
+ if (retry_cnt++ < 3) {
+ __os_sleep(dbenv, 1, 0);
+ goto open_retry;
+ }
+bad_format: if (!IS_RECOVERING(dbenv))
+ __db_err(dbenv,
+ "%s: unexpected file type or format", name);
+ ret = EINVAL;
+ goto err;
+ }
+
+ DB_ASSERT (dbp->type != DB_UNKNOWN);
+
+empty: /*
+ * The file is empty, and that's OK. If it's not a subdatabase,
+ * though, we do need to generate a unique file ID for it. The
+ * unique file ID includes a timestamp so that we can't collide
+ * with any other files, even when the file IDs (dev/inode pair)
+ * are reused.
+ */
+ if (!IS_SUBDB_SETUP) {
+ if (*retflags & DB_FILE_SETUP_ZERO)
+ memset(dbp->fileid, 0, DB_FILE_ID_LEN);
+ else if ((ret = __os_fileid(dbenv,
+ real_name, 1, dbp->fileid)) != 0)
+ goto err_msg;
+ }
+ }
+
+ if (0) {
+err_msg: __db_err(dbenv, "%s: %s", name, db_strerror(ret));
+ }
+
+ /*
+ * Abort any running transaction -- it can only exist if something
+ * went wrong.
+ */
+err:
+DB_TEST_RECOVERY_LABEL
+
+ /*
+ * If we opened a file handle and our caller is doing fcntl(2) locking,
+ * then we can't close it because that would discard the caller's lock.
+ * Otherwise, close the handle.
+ */
+ if (F_ISSET(fhp, DB_FH_VALID)) {
+ if (ret == 0 && LF_ISSET(DB_FCNTL_LOCKING))
+ dbp->saved_open_fhp = fhp;
+ else
+ if ((t_ret = __os_closehandle(fhp)) != 0 && ret == 0)
+ ret = t_ret;
+ }
+
+ /*
+ * This must be done after the file is closed, since
+ * txn_abort() may remove the file, and an open file
+ * cannot be removed on a Windows platforms.
+ */
+ if (txn != NULL)
+ (void)txn_abort(txn);
+
+ if (real_name != NULL)
+ __os_freestr(real_name);
+
+ return (ret);
+}
+
+/*
+ * __db_set_pgsize --
+ * Set the page size based on file information.
+ */
+static int
+__db_set_pgsize(dbp, fhp, name)
+ DB *dbp;
+ DB_FH *fhp;
+ char *name;
+{
+ DB_ENV *dbenv;
+ u_int32_t iopsize;
+ int ret;
+
+ dbenv = dbp->dbenv;
+
+ /*
+ * Use the filesystem's optimum I/O size as the pagesize if a pagesize
+ * not specified. Some filesystems have 64K as their optimum I/O size,
+ * but as that results in fairly large default caches, we limit the
+ * default pagesize to 16K.
+ */
+ if ((ret = __os_ioinfo(dbenv, name, fhp, NULL, NULL, &iopsize)) != 0) {
+ __db_err(dbenv, "%s: %s", name, db_strerror(ret));
+ return (ret);
+ }
+ if (iopsize < 512)
+ iopsize = 512;
+ if (iopsize > 16 * 1024)
+ iopsize = 16 * 1024;
+
+ /*
+ * Sheer paranoia, but we don't want anything that's not a power-of-2
+ * (we rely on that for alignment of various types on the pages), and
+ * we want a multiple of the sector size as well.
+ */
+ OS_ROUNDOFF(iopsize, 512);
+
+ dbp->pgsize = iopsize;
+ F_SET(dbp, DB_AM_PGDEF);
+
+ return (0);
+}
+
+/*
+ * __db_close --
+ * DB destructor.
+ *
+ * PUBLIC: int __db_close __P((DB *, u_int32_t));
+ */
+int
+__db_close(dbp, flags)
+ DB *dbp;
+ u_int32_t flags;
+{
+ DB_ENV *dbenv;
+ DBC *dbc;
+ int ret, t_ret;
+
+ ret = 0;
+
+ dbenv = dbp->dbenv;
+ PANIC_CHECK(dbenv);
+
+ /* Validate arguments. */
+ if ((ret = __db_closechk(dbp, flags)) != 0)
+ goto err;
+
+ /* If never opened, or not currently open, it's easy. */
+ if (!F_ISSET(dbp, DB_OPEN_CALLED))
+ goto never_opened;
+
+ /* Sync the underlying access method. */
+ if (!LF_ISSET(DB_NOSYNC) && !F_ISSET(dbp, DB_AM_DISCARD) &&
+ (t_ret = dbp->sync(dbp, 0)) != 0 && ret == 0)
+ ret = t_ret;
+
+ /*
+ * Go through the active cursors and call the cursor recycle routine,
+ * which resolves pending operations and moves the cursors onto the
+ * free list. Then, walk the free list and call the cursor destroy
+ * routine.
+ */
+ while ((dbc = TAILQ_FIRST(&dbp->active_queue)) != NULL)
+ if ((t_ret = dbc->c_close(dbc)) != 0 && ret == 0)
+ ret = t_ret;
+ while ((dbc = TAILQ_FIRST(&dbp->free_queue)) != NULL)
+ if ((t_ret = __db_c_destroy(dbc)) != 0 && ret == 0)
+ ret = t_ret;
+
+ /*
+ * Close any outstanding join cursors. Join cursors destroy
+ * themselves on close and have no separate destroy routine.
+ */
+ while ((dbc = TAILQ_FIRST(&dbp->join_queue)) != NULL)
+ if ((t_ret = dbc->c_close(dbc)) != 0 && ret == 0)
+ ret = t_ret;
+
+ /* Remove this DB handle from the DB_ENV's dblist. */
+ MUTEX_THREAD_LOCK(dbenv, dbenv->dblist_mutexp);
+ LIST_REMOVE(dbp, dblistlinks);
+ MUTEX_THREAD_UNLOCK(dbenv, dbenv->dblist_mutexp);
+
+ /* Sync the memory pool. */
+ if (!LF_ISSET(DB_NOSYNC) && !F_ISSET(dbp, DB_AM_DISCARD) &&
+ (t_ret = memp_fsync(dbp->mpf)) != 0 &&
+ t_ret != DB_INCOMPLETE && ret == 0)
+ ret = t_ret;
+
+ /* Close any handle we've been holding since the open. */
+ if (dbp->saved_open_fhp != NULL &&
+ F_ISSET(dbp->saved_open_fhp, DB_FH_VALID) &&
+ (t_ret = __os_closehandle(dbp->saved_open_fhp)) != 0 && ret == 0)
+ ret = t_ret;
+
+never_opened:
+ /*
+ * Call the access specific close function.
+ *
+ * !!!
+ * Because of where the function is called in the close process,
+ * these routines can't do anything that would dirty pages or
+ * otherwise affect closing down the database.
+ */
+ if ((t_ret = __ham_db_close(dbp)) != 0 && ret == 0)
+ ret = t_ret;
+ if ((t_ret = __bam_db_close(dbp)) != 0 && ret == 0)
+ ret = t_ret;
+ if ((t_ret = __qam_db_close(dbp)) != 0 && ret == 0)
+ ret = t_ret;
+
+err:
+ /* Refresh the structure and close any local environment. */
+ if ((t_ret = __db_refresh(dbp)) != 0 && ret == 0)
+ ret = t_ret;
+ if (F_ISSET(dbenv, DB_ENV_DBLOCAL) &&
+ --dbenv->dblocal_ref == 0 &&
+ (t_ret = dbenv->close(dbenv, 0)) != 0 && ret == 0)
+ ret = t_ret;
+
+ memset(dbp, CLEAR_BYTE, sizeof(*dbp));
+ __os_free(dbp, sizeof(*dbp));
+
+ return (ret);
+}
+
+/*
+ * __db_refresh --
+ * Refresh the DB structure, releasing any allocated resources.
+ */
+static int
+__db_refresh(dbp)
+ DB *dbp;
+{
+ DB_ENV *dbenv;
+ DBC *dbc;
+ int ret, t_ret;
+
+ ret = 0;
+
+ dbenv = dbp->dbenv;
+
+ /*
+ * Go through the active cursors and call the cursor recycle routine,
+ * which resolves pending operations and moves the cursors onto the
+ * free list. Then, walk the free list and call the cursor destroy
+ * routine.
+ */
+ while ((dbc = TAILQ_FIRST(&dbp->active_queue)) != NULL)
+ if ((t_ret = dbc->c_close(dbc)) != 0 && ret == 0)
+ ret = t_ret;
+ while ((dbc = TAILQ_FIRST(&dbp->free_queue)) != NULL)
+ if ((t_ret = __db_c_destroy(dbc)) != 0 && ret == 0)
+ ret = t_ret;
+
+ dbp->type = 0;
+
+ /* Close the memory pool file handle. */
+ if (dbp->mpf != NULL) {
+ if (F_ISSET(dbp, DB_AM_DISCARD))
+ (void)__memp_fremove(dbp->mpf);
+ if ((t_ret = memp_fclose(dbp->mpf)) != 0 && ret == 0)
+ ret = t_ret;
+ dbp->mpf = NULL;
+ }
+
+ /* Discard the thread mutex. */
+ if (dbp->mutexp != NULL) {
+ __db_mutex_free(dbenv, dbenv->reginfo, dbp->mutexp);
+ dbp->mutexp = NULL;
+ }
+
+ /* Discard the log file id. */
+ if (!IS_RECOVERING(dbenv)
+ && dbp->log_fileid != DB_LOGFILEID_INVALID)
+ (void)log_unregister(dbenv, dbp);
+
+ F_CLR(dbp, DB_AM_DISCARD);
+ F_CLR(dbp, DB_AM_INMEM);
+ F_CLR(dbp, DB_AM_RDONLY);
+ F_CLR(dbp, DB_AM_SWAP);
+ F_CLR(dbp, DB_DBM_ERROR);
+ F_CLR(dbp, DB_OPEN_CALLED);
+
+ return (ret);
+}
+
+/*
+ * __db_remove
+ * Remove method for DB.
+ *
+ * PUBLIC: int __db_remove __P((DB *, const char *, const char *, u_int32_t));
+ */
+int
+__db_remove(dbp, name, subdb, flags)
+ DB *dbp;
+ const char *name, *subdb;
+ u_int32_t flags;
+{
+ DBT namedbt;
+ DB_ENV *dbenv;
+ DB_LOCK remove_lock;
+ DB_LSN newlsn;
+ int ret, t_ret, (*callback_func) __P((DB *, void *));
+ char *backup, *real_back, *real_name;
+ void *cookie;
+
+ dbenv = dbp->dbenv;
+ ret = 0;
+ backup = real_back = real_name = NULL;
+
+ PANIC_CHECK(dbenv);
+ /*
+ * Cannot use DB_ILLEGAL_AFTER_OPEN here because that returns
+ * and we cannot return, but must deal with the error and destroy
+ * the handle anyway.
+ */
+ if (F_ISSET(dbp, DB_OPEN_CALLED)) {
+ ret = __db_mi_open(dbp->dbenv, "remove", 1);
+ goto err_close;
+ }
+
+ /* Validate arguments. */
+ if ((ret = __db_removechk(dbp, flags)) != 0)
+ goto err_close;
+
+ /*
+ * Subdatabases.
+ */
+ if (subdb != NULL) {
+ /* Subdatabases must be created in named files. */
+ if (name == NULL) {
+ __db_err(dbenv,
+ "multiple databases cannot be created in temporary files");
+ goto err_close;
+ }
+ return (__db_subdb_remove(dbp, name, subdb));
+ }
+
+ if ((ret = dbp->open(dbp,
+ name, NULL, DB_UNKNOWN, DB_RDWRMASTER, 0)) != 0)
+ goto err_close;
+
+ if (LOGGING_ON(dbenv) && (ret = __log_file_lock(dbp)) != 0)
+ goto err_close;
+
+ if ((ret = dbp->sync(dbp, 0)) != 0)
+ goto err_close;
+
+ /* Start the transaction and log the delete. */
+ if (TXN_ON(dbenv) && (ret = __db_metabegin(dbp, &remove_lock)) != 0)
+ goto err_close;
+
+ if (LOGGING_ON(dbenv)) {
+ memset(&namedbt, 0, sizeof(namedbt));
+ namedbt.data = (char *)name;
+ namedbt.size = strlen(name) + 1;
+
+ if ((ret = __crdel_delete_log(dbenv,
+ dbp->open_txn, &newlsn, DB_FLUSH,
+ dbp->log_fileid, &namedbt)) != 0) {
+ __db_err(dbenv,
+ "%s: %s", name, db_strerror(ret));
+ goto err;
+ }
+ }
+
+ /* Find the real name of the file. */
+ if ((ret = __db_appname(dbenv,
+ DB_APP_DATA, NULL, name, 0, NULL, &real_name)) != 0)
+ goto err;
+
+ /*
+ * XXX
+ * We don't bother to open the file and call __memp_fremove on the mpf.
+ * There is a potential race here. It is at least possible that, if
+ * the unique filesystem ID (dev/inode pair on UNIX) is reallocated
+ * within a second (the granularity of the fileID timestamp), a new
+ * file open will get the same fileID as the file being "removed".
+ * We may actually want to open the file and call __memp_fremove on
+ * the mpf to get around this.
+ */
+
+ /* Create name for backup file. */
+ if (TXN_ON(dbenv)) {
+ if ((ret =
+ __db_backup_name(dbenv, name, &backup, &newlsn)) != 0)
+ goto err;
+ if ((ret = __db_appname(dbenv,
+ DB_APP_DATA, NULL, backup, 0, NULL, &real_back)) != 0)
+ goto err;
+ }
+
+ callback_func = __db_remove_callback;
+ cookie = real_back;
+ DB_TEST_RECOVERY(dbp, DB_TEST_PRERENAME, ret, name);
+ if (dbp->db_am_remove != NULL &&
+ (ret = dbp->db_am_remove(dbp,
+ name, subdb, &newlsn, &callback_func, &cookie)) != 0)
+ goto err;
+ /*
+ * On Windows, the underlying file must be closed to perform a remove.
+ * Nothing later in __db_remove requires that it be open, and the
+ * dbp->close closes it anyway, so we just close it early.
+ */
+ (void)__memp_fremove(dbp->mpf);
+ if ((ret = memp_fclose(dbp->mpf)) != 0)
+ goto err;
+ dbp->mpf = NULL;
+
+ if (TXN_ON(dbenv))
+ ret = __os_rename(dbenv, real_name, real_back);
+ else
+ ret = __os_unlink(dbenv, real_name);
+
+ DB_TEST_RECOVERY(dbp, DB_TEST_POSTRENAME, ret, name);
+
+err:
+DB_TEST_RECOVERY_LABEL
+ /*
+ * End the transaction, committing the transaction if we were
+ * successful, aborting otherwise.
+ */
+ if (dbp->open_txn != NULL && (t_ret = __db_metaend(dbp, &remove_lock,
+ ret == 0, callback_func, cookie)) != 0 && ret == 0)
+ ret = t_ret;
+
+ /* FALLTHROUGH */
+
+err_close:
+ if (real_back != NULL)
+ __os_freestr(real_back);
+ if (real_name != NULL)
+ __os_freestr(real_name);
+ if (backup != NULL)
+ __os_freestr(backup);
+
+ /* We no longer have an mpool, so syncing would be disastrous. */
+ if ((t_ret = dbp->close(dbp, DB_NOSYNC)) != 0 && ret == 0)
+ ret = t_ret;
+
+ return (ret);
+}
+
+/*
+ * __db_subdb_remove --
+ * Remove a subdatabase.
+ */
+static int
+__db_subdb_remove(dbp, name, subdb)
+ DB *dbp;
+ const char *name, *subdb;
+{
+ DB *mdbp;
+ DBC *dbc;
+ DB_ENV *dbenv;
+ DB_LOCK remove_lock;
+ db_pgno_t meta_pgno;
+ int ret, t_ret;
+
+ mdbp = NULL;
+ dbc = NULL;
+ dbenv = dbp->dbenv;
+
+ /* Start the transaction. */
+ if (TXN_ON(dbenv) && (ret = __db_metabegin(dbp, &remove_lock)) != 0)
+ goto err_close;
+
+ /*
+ * Open the subdatabase. We can use the user's DB handle for this
+ * purpose, I think.
+ */
+ if ((ret = __db_open(dbp, name, subdb, DB_UNKNOWN, 0, 0)) != 0)
+ goto err;
+
+ /* Free up the pages in the subdatabase. */
+ switch (dbp->type) {
+ case DB_BTREE:
+ case DB_RECNO:
+ if ((ret = __bam_reclaim(dbp, dbp->open_txn)) != 0)
+ goto err;
+ break;
+ case DB_HASH:
+ if ((ret = __ham_reclaim(dbp, dbp->open_txn)) != 0)
+ goto err;
+ break;
+ default:
+ ret = __db_unknown_type(dbp->dbenv,
+ "__db_subdb_remove", dbp->type);
+ goto err;
+ }
+
+ /*
+ * Remove the entry from the main database and free the subdatabase
+ * metadata page.
+ */
+ if ((ret = __db_master_open(dbp, name, 0, 0, &mdbp)) != 0)
+ goto err;
+
+ if ((ret = __db_master_update(mdbp,
+ subdb, dbp->type, &meta_pgno, MU_REMOVE, NULL, 0)) != 0)
+ goto err;
+
+err: /*
+ * End the transaction, committing the transaction if we were
+ * successful, aborting otherwise.
+ */
+ if (dbp->open_txn != NULL && (t_ret = __db_metaend(dbp,
+ &remove_lock, ret == 0, NULL, NULL)) != 0 && ret == 0)
+ ret = t_ret;
+
+err_close:
+ /*
+ * Close the user's DB handle -- do this LAST to avoid smashing the
+ * the transaction information.
+ */
+ if ((t_ret = dbp->close(dbp, 0)) != 0 && ret == 0)
+ ret = t_ret;
+
+ if (mdbp != NULL && (t_ret = mdbp->close(mdbp, 0)) != 0 && ret == 0)
+ ret = t_ret;
+
+ return (ret);
+}
+
+/*
+ * __db_rename
+ * Rename method for DB.
+ *
+ * PUBLIC: int __db_rename __P((DB *,
+ * PUBLIC: const char *, const char *, const char *, u_int32_t));
+ */
+int
+__db_rename(dbp, filename, subdb, newname, flags)
+ DB *dbp;
+ const char *filename, *subdb, *newname;
+ u_int32_t flags;
+{
+ DBT namedbt, newnamedbt;
+ DB_ENV *dbenv;
+ DB_LOCK remove_lock;
+ DB_LSN newlsn;
+ char *real_name, *real_newname;
+ int ret, t_ret;
+
+ dbenv = dbp->dbenv;
+ ret = 0;
+ real_name = real_newname = NULL;
+
+ PANIC_CHECK(dbenv);
+ /*
+ * Cannot use DB_ILLEGAL_AFTER_OPEN here because that returns
+ * and we cannot return, but must deal with the error and destroy
+ * the handle anyway.
+ */
+ if (F_ISSET(dbp, DB_OPEN_CALLED)) {
+ ret = __db_mi_open(dbp->dbenv, "rename", 1);
+ goto err_close;
+ }
+
+ /* Validate arguments -- has same rules as remove. */
+ if ((ret = __db_removechk(dbp, flags)) != 0)
+ goto err_close;
+
+ /*
+ * Subdatabases.
+ */
+ if (subdb != NULL) {
+ if (filename == NULL) {
+ __db_err(dbenv,
+ "multiple databases cannot be created in temporary files");
+ goto err_close;
+ }
+ return (__db_subdb_rename(dbp, filename, subdb, newname));
+ }
+
+ if ((ret = dbp->open(dbp,
+ filename, NULL, DB_UNKNOWN, DB_RDWRMASTER, 0)) != 0)
+ goto err_close;
+
+ if (LOGGING_ON(dbenv) && (ret = __log_file_lock(dbp)) != 0)
+ goto err_close;
+
+ if ((ret = dbp->sync(dbp, 0)) != 0)
+ goto err_close;
+
+ /* Start the transaction and log the rename. */
+ if (TXN_ON(dbenv) && (ret = __db_metabegin(dbp, &remove_lock)) != 0)
+ goto err_close;
+
+ if (LOGGING_ON(dbenv)) {
+ memset(&namedbt, 0, sizeof(namedbt));
+ namedbt.data = (char *)filename;
+ namedbt.size = strlen(filename) + 1;
+
+ memset(&newnamedbt, 0, sizeof(namedbt));
+ newnamedbt.data = (char *)newname;
+ newnamedbt.size = strlen(newname) + 1;
+
+ if ((ret = __crdel_rename_log(dbenv, dbp->open_txn,
+ &newlsn, 0, dbp->log_fileid, &namedbt, &newnamedbt)) != 0) {
+ __db_err(dbenv, "%s: %s", filename, db_strerror(ret));
+ goto err;
+ }
+
+ if ((ret = __log_filelist_update(dbenv, dbp,
+ dbp->log_fileid, newname, NULL)) != 0)
+ goto err;
+ }
+
+ /* Find the real name of the file. */
+ if ((ret = __db_appname(dbenv,
+ DB_APP_DATA, NULL, filename, 0, NULL, &real_name)) != 0)
+ goto err;
+
+ /* Find the real newname of the file. */
+ if ((ret = __db_appname(dbenv,
+ DB_APP_DATA, NULL, newname, 0, NULL, &real_newname)) != 0)
+ goto err;
+
+ /*
+ * It is an error to rename a file over one that already exists,
+ * as that wouldn't be transaction-safe.
+ */
+ if (__os_exists(real_newname, NULL) == 0) {
+ ret = EEXIST;
+ __db_err(dbenv, "rename: file %s exists", real_newname);
+ goto err;
+ }
+
+ DB_TEST_RECOVERY(dbp, DB_TEST_PRERENAME, ret, filename);
+ if (dbp->db_am_rename != NULL &&
+ (ret = dbp->db_am_rename(dbp, filename, subdb, newname)) != 0)
+ goto err;
+ /*
+ * We have to flush the cache for a couple of reasons. First, the
+ * underlying MPOOLFILE maintains a "name" that unrelated processes
+ * can use to open the file in order to flush pages, and that name
+ * is about to be wrong. Second, on Windows the unique file ID is
+ * generated from the file's name, not other file information as is
+ * the case on UNIX, and so a subsequent open of the old file name
+ * could conceivably result in a matching "unique" file ID.
+ */
+ if ((ret = __memp_fremove(dbp->mpf)) != 0)
+ goto err;
+
+ /*
+ * On Windows, the underlying file must be closed to perform a rename.
+ * Nothing later in __db_rename requires that it be open, and the call
+ * to dbp->close closes it anyway, so we just close it early.
+ */
+ if ((ret = memp_fclose(dbp->mpf)) != 0)
+ goto err;
+ dbp->mpf = NULL;
+
+ ret = __os_rename(dbenv, real_name, real_newname);
+ DB_TEST_RECOVERY(dbp, DB_TEST_POSTRENAME, ret, newname);
+
+DB_TEST_RECOVERY_LABEL
+err: if (dbp->open_txn != NULL && (t_ret = __db_metaend(dbp,
+ &remove_lock, ret == 0, NULL, NULL)) != 0 && ret == 0)
+ ret = t_ret;
+
+err_close:
+ /* We no longer have an mpool, so syncing would be disastrous. */
+ dbp->close(dbp, DB_NOSYNC);
+ if (real_name != NULL)
+ __os_freestr(real_name);
+ if (real_newname != NULL)
+ __os_freestr(real_newname);
+
+ return (ret);
+}
+
+/*
+ * __db_subdb_rename --
+ * Rename a subdatabase.
+ */
+static int
+__db_subdb_rename(dbp, name, subdb, newname)
+ DB *dbp;
+ const char *name, *subdb, *newname;
+{
+ DB *mdbp;
+ DBC *dbc;
+ DB_ENV *dbenv;
+ DB_LOCK remove_lock;
+ int ret, t_ret;
+
+ mdbp = NULL;
+ dbc = NULL;
+ dbenv = dbp->dbenv;
+
+ /* Start the transaction. */
+ if (TXN_ON(dbenv) && (ret = __db_metabegin(dbp, &remove_lock)) != 0)
+ goto err_close;
+
+ /*
+ * Open the subdatabase. We can use the user's DB handle for this
+ * purpose, I think.
+ */
+ if ((ret = __db_open(dbp, name, subdb, DB_UNKNOWN, 0, 0)) != 0)
+ goto err;
+
+ /*
+ * Rename the entry in the main database.
+ */
+ if ((ret = __db_master_open(dbp, name, 0, 0, &mdbp)) != 0)
+ goto err;
+
+ if ((ret = __db_master_update(mdbp,
+ subdb, dbp->type, NULL, MU_RENAME, newname, 0)) != 0)
+ goto err;
+
+err: /*
+ * End the transaction, committing the transaction if we were
+ * successful, aborting otherwise.
+ */
+ if (dbp->open_txn != NULL && (t_ret = __db_metaend(dbp,
+ &remove_lock, ret == 0, NULL, NULL)) != 0 && ret == 0)
+ ret = t_ret;
+
+err_close:
+ /*
+ * Close the user's DB handle -- do this LAST to avoid smashing the
+ * the transaction information.
+ */
+ if ((t_ret = dbp->close(dbp, 0)) != 0 && ret == 0)
+ ret = t_ret;
+
+ if (mdbp != NULL && (t_ret = mdbp->close(mdbp, 0)) != 0 && ret == 0)
+ ret = t_ret;
+
+ return (ret);
+}
+
+/*
+ * __db_metabegin --
+ *
+ * Begin a meta-data operation. This involves doing any required locking,
+ * potentially beginning a transaction and then telling the caller if you
+ * did or did not begin the transaction.
+ *
+ * The writing flag indicates if the caller is actually allowing creates
+ * or doing deletes (i.e., if the caller is opening and not creating, then
+ * we don't need to do any of this).
+ * PUBLIC: int __db_metabegin __P((DB *, DB_LOCK *));
+ */
+int
+__db_metabegin(dbp, lockp)
+ DB *dbp;
+ DB_LOCK *lockp;
+{
+ DB_ENV *dbenv;
+ DBT dbplock;
+ u_int32_t locker, lockval;
+ int ret;
+
+ dbenv = dbp->dbenv;
+
+ lockp->off = LOCK_INVALID;
+
+ /*
+ * There is no single place where we can know that we are or are not
+ * going to be creating any files and/or subdatabases, so we will
+ * always begin a tranasaction when we start creating one. If we later
+ * discover that this was unnecessary, we will abort the transaction.
+ * Recovery is written so that if we log a file create, but then
+ * discover that we didn't have to do it, we recover correctly. The
+ * file recovery design document has details.
+ *
+ * We need to single thread all create and delete operations, so if we
+ * are running with locking, we must obtain a lock. We use lock_id to
+ * generate a unique locker id and use a handcrafted DBT as the object
+ * on which we are locking.
+ */
+ if (LOCKING_ON(dbenv)) {
+ if ((ret = lock_id(dbenv, &locker)) != 0)
+ return (ret);
+ lockval = 0;
+ dbplock.data = &lockval;
+ dbplock.size = sizeof(lockval);
+ if ((ret = lock_get(dbenv,
+ locker, 0, &dbplock, DB_LOCK_WRITE, lockp)) != 0)
+ return (ret);
+ }
+
+ return (txn_begin(dbenv, NULL, &dbp->open_txn, 0));
+}
+
+/*
+ * __db_metaend --
+ * End a meta-data operation.
+ * PUBLIC: int __db_metaend __P((DB *,
+ * PUBLIC: DB_LOCK *, int, int (*)(DB *, void *), void *));
+ */
+int
+__db_metaend(dbp, lockp, commit, callback, cookie)
+ DB *dbp;
+ DB_LOCK *lockp;
+ int commit, (*callback) __P((DB *, void *));
+ void *cookie;
+{
+ DB_ENV *dbenv;
+ int ret, t_ret;
+
+ ret = 0;
+ dbenv = dbp->dbenv;
+
+ /* End the transaction. */
+ if (commit) {
+ if ((ret = txn_commit(dbp->open_txn, DB_TXN_SYNC)) == 0) {
+ /*
+ * Unlink any underlying file, we've committed the
+ * transaction.
+ */
+ if (callback != NULL)
+ ret = callback(dbp, cookie);
+ }
+ } else if ((t_ret = txn_abort(dbp->open_txn)) && ret == 0)
+ ret = t_ret;
+
+ /* Release our lock. */
+ if (lockp->off != LOCK_INVALID &&
+ (t_ret = lock_put(dbenv, lockp)) != 0 && ret == 0)
+ ret = t_ret;
+
+ return (ret);
+}
+
+/*
+ * __db_log_page
+ * Log a meta-data or root page during a create operation.
+ *
+ * PUBLIC: int __db_log_page __P((DB *,
+ * PUBLIC: const char *, DB_LSN *, db_pgno_t, PAGE *));
+ */
+int
+__db_log_page(dbp, name, lsn, pgno, page)
+ DB *dbp;
+ const char *name;
+ DB_LSN *lsn;
+ db_pgno_t pgno;
+ PAGE *page;
+{
+ DBT name_dbt, page_dbt;
+ DB_LSN new_lsn;
+ int ret;
+
+ if (dbp->open_txn == NULL)
+ return (0);
+
+ memset(&page_dbt, 0, sizeof(page_dbt));
+ page_dbt.size = dbp->pgsize;
+ page_dbt.data = page;
+ if (pgno == PGNO_BASE_MD) {
+ /*
+ * !!!
+ * Make sure that we properly handle a null name. The old
+ * Tcl sent us pathnames of the form ""; it may be the case
+ * that the new Tcl doesn't do that, so we can get rid of
+ * the second check here.
+ */
+ memset(&name_dbt, 0, sizeof(name_dbt));
+ name_dbt.data = (char *)name;
+ if (name == NULL || *name == '\0')
+ name_dbt.size = 0;
+ else
+ name_dbt.size = strlen(name) + 1;
+
+ ret = __crdel_metapage_log(dbp->dbenv,
+ dbp->open_txn, &new_lsn, DB_FLUSH,
+ dbp->log_fileid, &name_dbt, pgno, &page_dbt);
+ } else
+ ret = __crdel_metasub_log(dbp->dbenv, dbp->open_txn,
+ &new_lsn, 0, dbp->log_fileid, pgno, &page_dbt, lsn);
+
+ if (ret == 0)
+ page->lsn = new_lsn;
+ return (ret);
+}
+
+/*
+ * __db_backup_name
+ * Create the backup file name for a given file.
+ *
+ * PUBLIC: int __db_backup_name __P((DB_ENV *,
+ * PUBLIC: const char *, char **, DB_LSN *));
+ */
+#undef BACKUP_PREFIX
+#define BACKUP_PREFIX "__db."
+
+#undef MAX_LSN_TO_TEXT
+#define MAX_LSN_TO_TEXT 21
+int
+__db_backup_name(dbenv, name, backup, lsn)
+ DB_ENV *dbenv;
+ const char *name;
+ char **backup;
+ DB_LSN *lsn;
+{
+ size_t len;
+ int plen, ret;
+ char *p, *retp;
+
+ len = strlen(name) + strlen(BACKUP_PREFIX) + MAX_LSN_TO_TEXT + 1;
+
+ if ((ret = __os_malloc(dbenv, len, NULL, &retp)) != 0)
+ return (ret);
+
+ /*
+ * Create the name. Backup file names are of the form:
+ *
+ * __db.name.0x[lsn-file].0x[lsn-offset]
+ *
+ * which guarantees uniqueness.
+ *
+ * However, name may contain an env-relative path in it.
+ * In that case, put the __db. after the last portion of
+ * the pathname.
+ */
+ if ((p = __db_rpath(name)) == NULL)
+ snprintf(retp, len,
+ "%s%s.0x%x0x%x", BACKUP_PREFIX, name,
+ lsn->file, lsn->offset);
+ else {
+ plen = p - name + 1;
+ p++;
+ snprintf(retp, len,
+ "%.*s%s%s.0x%x0x%x", plen, name, BACKUP_PREFIX, p,
+ lsn->file, lsn->offset);
+ }
+
+ *backup = retp;
+ return (0);
+}
+
+/*
+ * __db_remove_callback --
+ * Callback function -- on file remove commit, it unlinks the backing
+ * file.
+ */
+static int
+__db_remove_callback(dbp, cookie)
+ DB *dbp;
+ void *cookie;
+{
+ return (__os_unlink(dbp->dbenv, cookie));
+}
+
+/*
+ * __dblist_get --
+ * Get the first element of dbenv->dblist with
+ * dbp->adj_fileid matching adjid.
+ *
+ * PUBLIC: DB *__dblist_get __P((DB_ENV *, u_int32_t));
+ */
+DB *
+__dblist_get(dbenv, adjid)
+ DB_ENV *dbenv;
+ u_int32_t adjid;
+{
+ DB *dbp;
+
+ for (dbp = LIST_FIRST(&dbenv->dblist);
+ dbp != NULL && dbp->adj_fileid != adjid;
+ dbp = LIST_NEXT(dbp, dblistlinks))
+ ;
+
+ return (dbp);
+}
+
+#if CONFIG_TEST
+/*
+ * __db_testcopy
+ * Create a copy of all backup files and our "main" DB.
+ *
+ * PUBLIC: int __db_testcopy __P((DB *, const char *));
+ */
+int
+__db_testcopy(dbp, name)
+ DB *dbp;
+ const char *name;
+{
+ if (dbp->type == DB_QUEUE)
+ return (__qam_testdocopy(dbp, name));
+ else
+ return (__db_testdocopy(dbp, name));
+}
+
+static int
+__qam_testdocopy(dbp, name)
+ DB *dbp;
+ const char *name;
+{
+ QUEUE_FILELIST *filelist, *fp;
+ char buf[256], *dir;
+ int ret;
+
+ filelist = NULL;
+ if ((ret = __db_testdocopy(dbp, name)) != 0)
+ return (ret);
+ if (dbp->mpf != NULL &&
+ (ret = __qam_gen_filelist(dbp, &filelist)) != 0)
+ return (ret);
+
+ if (filelist == NULL)
+ return (0);
+ dir = ((QUEUE *)dbp->q_internal)->dir;
+ for (fp = filelist; fp->mpf != NULL; fp++) {
+ snprintf(buf, sizeof(buf), QUEUE_EXTENT, dir, name, fp->id);
+ if ((ret = __db_testdocopy(dbp, buf)) != 0)
+ return (ret);
+ }
+
+ __os_free(filelist, 0);
+ return (0);
+}
+
+/*
+ * __db_testdocopy
+ * Create a copy of all backup files and our "main" DB.
+ *
+ */
+static int
+__db_testdocopy(dbp, name)
+ DB *dbp;
+ const char *name;
+{
+ size_t len;
+ int dircnt, i, ret;
+ char **namesp, *backup, *copy, *dir, *p, *real_name;
+ real_name = NULL;
+ /* Get the real backing file name. */
+ if ((ret = __db_appname(dbp->dbenv,
+ DB_APP_DATA, NULL, name, 0, NULL, &real_name)) != 0)
+ return (ret);
+
+ copy = backup = NULL;
+ namesp = NULL;
+
+ /*
+ * Maximum size of file, including adding a ".afterop".
+ */
+ len = strlen(real_name) + strlen(BACKUP_PREFIX) + MAX_LSN_TO_TEXT + 9;
+
+ if ((ret = __os_malloc(dbp->dbenv, len, NULL, &copy)) != 0)
+ goto out;
+
+ if ((ret = __os_malloc(dbp->dbenv, len, NULL, &backup)) != 0)
+ goto out;
+
+ /*
+ * First copy the file itself.
+ */
+ snprintf(copy, len, "%s.afterop", real_name);
+ __db_makecopy(real_name, copy);
+
+ if ((ret = __os_strdup(dbp->dbenv, real_name, &dir)) != 0)
+ goto out;
+ __os_freestr(real_name);
+ real_name = NULL;
+ /*
+ * Create the name. Backup file names are of the form:
+ *
+ * __db.name.0x[lsn-file].0x[lsn-offset]
+ *
+ * which guarantees uniqueness. We want to look for the
+ * backup name, followed by a '.0x' (so that if they have
+ * files named, say, 'a' and 'abc' we won't match 'abc' when
+ * looking for 'a'.
+ */
+ snprintf(backup, len, "%s%s.0x", BACKUP_PREFIX, name);
+
+ /*
+ * We need the directory path to do the __os_dirlist.
+ */
+ p = __db_rpath(dir);
+ if (p != NULL)
+ *p = '\0';
+ ret = __os_dirlist(dbp->dbenv, dir, &namesp, &dircnt);
+#if DIAGNOSTIC
+ /*
+ * XXX
+ * To get the memory guard code to work because it uses strlen and we
+ * just moved the end of the string somewhere sooner. This causes the
+ * guard code to fail because it looks at one byte past the end of the
+ * string.
+ */
+ *p = '/';
+#endif
+ __os_freestr(dir);
+ if (ret != 0)
+ goto out;
+ for (i = 0; i < dircnt; i++) {
+ /*
+ * Need to check if it is a backup file for this.
+ * No idea what namesp[i] may be or how long, so
+ * must use strncmp and not memcmp. We don't want
+ * to use strcmp either because we are only matching
+ * the first part of the real file's name. We don't
+ * know its LSN's.
+ */
+ if (strncmp(namesp[i], backup, strlen(backup)) == 0) {
+ if ((ret = __db_appname(dbp->dbenv, DB_APP_DATA,
+ NULL, namesp[i], 0, NULL, &real_name)) != 0)
+ goto out;
+
+ /*
+ * This should not happen. Check that old
+ * .afterop files aren't around.
+ * If so, just move on.
+ */
+ if (strstr(real_name, ".afterop") != NULL) {
+ __os_freestr(real_name);
+ real_name = NULL;
+ continue;
+ }
+ snprintf(copy, len, "%s.afterop", real_name);
+ __db_makecopy(real_name, copy);
+ __os_freestr(real_name);
+ real_name = NULL;
+ }
+ }
+out:
+ if (backup != NULL)
+ __os_freestr(backup);
+ if (copy != NULL)
+ __os_freestr(copy);
+ if (namesp != NULL)
+ __os_dirfree(namesp, dircnt);
+ if (real_name != NULL)
+ __os_freestr(real_name);
+ return (ret);
+}
+
+static void
+__db_makecopy(src, dest)
+ const char *src, *dest;
+{
+ DB_FH rfh, wfh;
+ size_t rcnt, wcnt;
+ char *buf;
+
+ memset(&rfh, 0, sizeof(rfh));
+ memset(&wfh, 0, sizeof(wfh));
+
+ if (__os_malloc(NULL, 1024, NULL, &buf) != 0)
+ return;
+
+ if (__os_open(NULL,
+ src, DB_OSO_RDONLY, __db_omode("rw----"), &rfh) != 0)
+ goto err;
+ if (__os_open(NULL, dest,
+ DB_OSO_CREATE | DB_OSO_TRUNC, __db_omode("rw----"), &wfh) != 0)
+ goto err;
+
+ for (;;)
+ if (__os_read(NULL, &rfh, buf, 1024, &rcnt) < 0 || rcnt == 0 ||
+ __os_write(NULL, &wfh, buf, rcnt, &wcnt) < 0 || wcnt != rcnt)
+ break;
+
+err: __os_free(buf, 1024);
+ if (F_ISSET(&rfh, DB_FH_VALID))
+ __os_closehandle(&rfh);
+ if (F_ISSET(&wfh, DB_FH_VALID))
+ __os_closehandle(&wfh);
+}
+#endif
diff --git a/bdb/db/db.src b/bdb/db/db.src
new file mode 100644
index 00000000000..b695e1360c5
--- /dev/null
+++ b/bdb/db/db.src
@@ -0,0 +1,178 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1996, 1997, 1998, 1999, 2000
+ * Sleepycat Software. All rights reserved.
+ *
+ * $Id: db.src,v 11.8 2000/02/17 20:24:07 bostic Exp $
+ */
+
+PREFIX db
+
+INCLUDE #include "db_config.h"
+INCLUDE
+INCLUDE #ifndef NO_SYSTEM_INCLUDES
+INCLUDE #include <sys/types.h>
+INCLUDE
+INCLUDE #include <ctype.h>
+INCLUDE #include <errno.h>
+INCLUDE #include <string.h>
+INCLUDE #endif
+INCLUDE
+INCLUDE #include "db_int.h"
+INCLUDE #include "db_page.h"
+INCLUDE #include "db_dispatch.h"
+INCLUDE #include "db_am.h"
+INCLUDE #include "txn.h"
+INCLUDE
+
+/*
+ * addrem -- Add or remove an entry from a duplicate page.
+ *
+ * opcode: identifies if this is an add or delete.
+ * fileid: file identifier of the file being modified.
+ * pgno: duplicate page number.
+ * indx: location at which to insert or delete.
+ * nbytes: number of bytes added/removed to/from the page.
+ * hdr: header for the data item.
+ * dbt: data that is deleted or is to be added.
+ * pagelsn: former lsn of the page.
+ *
+ * If the hdr was NULL then, the dbt is a regular B_KEYDATA.
+ * If the dbt was NULL then the hdr is a complete item to be
+ * pasted on the page.
+ */
+BEGIN addrem 41
+ARG opcode u_int32_t lu
+ARG fileid int32_t ld
+ARG pgno db_pgno_t lu
+ARG indx u_int32_t lu
+ARG nbytes size_t lu
+DBT hdr DBT s
+DBT dbt DBT s
+POINTER pagelsn DB_LSN * lu
+END
+
+/*
+ * split -- Handles the split of a duplicate page.
+ *
+ * opcode: defines whether we are splitting from or splitting onto
+ * fileid: file identifier of the file being modified.
+ * pgno: page number being split.
+ * pageimage: entire page contents.
+ * pagelsn: former lsn of the page.
+ */
+DEPRECATED split 42
+ARG opcode u_int32_t lu
+ARG fileid int32_t ld
+ARG pgno db_pgno_t lu
+DBT pageimage DBT s
+POINTER pagelsn DB_LSN * lu
+END
+
+/*
+ * big -- Handles addition and deletion of big key/data items.
+ *
+ * opcode: identifies get/put.
+ * fileid: file identifier of the file being modified.
+ * pgno: page onto which data is being added/removed.
+ * prev_pgno: the page before the one we are logging.
+ * next_pgno: the page after the one we are logging.
+ * dbt: data being written onto the page.
+ * pagelsn: former lsn of the orig_page.
+ * prevlsn: former lsn of the prev_pgno.
+ * nextlsn: former lsn of the next_pgno. This is not currently used, but
+ * may be used later if we actually do overwrites of big key/
+ * data items in place.
+ */
+BEGIN big 43
+ARG opcode u_int32_t lu
+ARG fileid int32_t ld
+ARG pgno db_pgno_t lu
+ARG prev_pgno db_pgno_t lu
+ARG next_pgno db_pgno_t lu
+DBT dbt DBT s
+POINTER pagelsn DB_LSN * lu
+POINTER prevlsn DB_LSN * lu
+POINTER nextlsn DB_LSN * lu
+END
+
+/*
+ * ovref -- Handles increment/decrement of overflow page reference count.
+ *
+ * fileid: identifies the file being modified.
+ * pgno: page number whose ref count is being incremented/decremented.
+ * adjust: the adjustment being made.
+ * lsn: the page's original lsn.
+ */
+BEGIN ovref 44
+ARG fileid int32_t ld
+ARG pgno db_pgno_t lu
+ARG adjust int32_t ld
+POINTER lsn DB_LSN * lu
+END
+
+/*
+ * relink -- Handles relinking around a page.
+ *
+ * opcode: indicates if this is an addpage or delete page
+ * pgno: the page being changed.
+ * lsn the page's original lsn.
+ * prev: the previous page.
+ * lsn_prev: the previous page's original lsn.
+ * next: the next page.
+ * lsn_next: the previous page's original lsn.
+ */
+BEGIN relink 45
+ARG opcode u_int32_t lu
+ARG fileid int32_t ld
+ARG pgno db_pgno_t lu
+POINTER lsn DB_LSN * lu
+ARG prev db_pgno_t lu
+POINTER lsn_prev DB_LSN * lu
+ARG next db_pgno_t lu
+POINTER lsn_next DB_LSN * lu
+END
+
+/*
+ * Addpage -- Handles adding a new duplicate page onto the end of
+ * an existing duplicate page.
+ * fileid: identifies the file being changed.
+ * pgno: page number to which a new page is being added.
+ * lsn: lsn of pgno
+ * nextpgno: new page number being added.
+ * nextlsn: lsn of nextpgno;
+ */
+DEPRECATED addpage 46
+ARG fileid int32_t ld
+ARG pgno db_pgno_t lu
+POINTER lsn DB_LSN * lu
+ARG nextpgno db_pgno_t lu
+POINTER nextlsn DB_LSN * lu
+END
+
+/*
+ * Debug -- log an operation upon entering an access method.
+ * op: Operation (cursor, c_close, c_get, c_put, c_del,
+ * get, put, delete).
+ * fileid: identifies the file being acted upon.
+ * key: key paramater
+ * data: data parameter
+ * flags: flags parameter
+ */
+BEGIN debug 47
+DBT op DBT s
+ARG fileid int32_t ld
+DBT key DBT s
+DBT data DBT s
+ARG arg_flags u_int32_t lu
+END
+
+/*
+ * noop -- do nothing, but get an LSN.
+ */
+BEGIN noop 48
+ARG fileid int32_t ld
+ARG pgno db_pgno_t lu
+POINTER prevlsn DB_LSN * lu
+END
diff --git a/bdb/db/db_am.c b/bdb/db/db_am.c
new file mode 100644
index 00000000000..2d224566904
--- /dev/null
+++ b/bdb/db/db_am.c
@@ -0,0 +1,511 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1998, 1999, 2000
+ * Sleepycat Software. All rights reserved.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_am.c,v 11.42 2001/01/11 18:19:50 bostic Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_shash.h"
+#include "btree.h"
+#include "hash.h"
+#include "qam.h"
+#include "lock.h"
+#include "mp.h"
+#include "txn.h"
+#include "db_am.h"
+#include "db_ext.h"
+
+/*
+ * __db_cursor --
+ * Allocate and return a cursor.
+ *
+ * PUBLIC: int __db_cursor __P((DB *, DB_TXN *, DBC **, u_int32_t));
+ */
+int
+__db_cursor(dbp, txn, dbcp, flags)
+ DB *dbp;
+ DB_TXN *txn;
+ DBC **dbcp;
+ u_int32_t flags;
+{
+ DB_ENV *dbenv;
+ DBC *dbc;
+ db_lockmode_t mode;
+ u_int32_t op;
+ int ret;
+
+ dbenv = dbp->dbenv;
+
+ PANIC_CHECK(dbenv);
+ DB_ILLEGAL_BEFORE_OPEN(dbp, "DB->cursor");
+
+ /* Check for invalid flags. */
+ if ((ret = __db_cursorchk(dbp, flags, F_ISSET(dbp, DB_AM_RDONLY))) != 0)
+ return (ret);
+
+ if ((ret =
+ __db_icursor(dbp, txn, dbp->type, PGNO_INVALID, 0, dbcp)) != 0)
+ return (ret);
+ dbc = *dbcp;
+
+ /*
+ * If this is CDB, do all the locking in the interface, which is
+ * right here.
+ */
+ if (CDB_LOCKING(dbenv)) {
+ op = LF_ISSET(DB_OPFLAGS_MASK);
+ mode = (op == DB_WRITELOCK) ? DB_LOCK_WRITE :
+ ((op == DB_WRITECURSOR) ? DB_LOCK_IWRITE : DB_LOCK_READ);
+ if ((ret = lock_get(dbenv, dbc->locker, 0,
+ &dbc->lock_dbt, mode, &dbc->mylock)) != 0) {
+ (void)__db_c_close(dbc);
+ return (ret);
+ }
+ if (op == DB_WRITECURSOR)
+ F_SET(dbc, DBC_WRITECURSOR);
+ if (op == DB_WRITELOCK)
+ F_SET(dbc, DBC_WRITER);
+ }
+
+ return (0);
+}
+
+/*
+ * __db_icursor --
+ * Internal version of __db_cursor. If dbcp is
+ * non-NULL it is assumed to point to an area to
+ * initialize as a cursor.
+ *
+ * PUBLIC: int __db_icursor
+ * PUBLIC: __P((DB *, DB_TXN *, DBTYPE, db_pgno_t, int, DBC **));
+ */
+int
+__db_icursor(dbp, txn, dbtype, root, is_opd, dbcp)
+ DB *dbp;
+ DB_TXN *txn;
+ DBTYPE dbtype;
+ db_pgno_t root;
+ int is_opd;
+ DBC **dbcp;
+{
+ DBC *dbc, *adbc;
+ DBC_INTERNAL *cp;
+ DB_ENV *dbenv;
+ int allocated, ret;
+
+ dbenv = dbp->dbenv;
+ allocated = 0;
+
+ /*
+ * Take one from the free list if it's available. Take only the
+ * right type. With off page dups we may have different kinds
+ * of cursors on the queue for a single database.
+ */
+ MUTEX_THREAD_LOCK(dbenv, dbp->mutexp);
+ for (dbc = TAILQ_FIRST(&dbp->free_queue);
+ dbc != NULL; dbc = TAILQ_NEXT(dbc, links))
+ if (dbtype == dbc->dbtype) {
+ TAILQ_REMOVE(&dbp->free_queue, dbc, links);
+ dbc->flags = 0;
+ break;
+ }
+ MUTEX_THREAD_UNLOCK(dbenv, dbp->mutexp);
+
+ if (dbc == NULL) {
+ if ((ret = __os_calloc(dbp->dbenv, 1, sizeof(DBC), &dbc)) != 0)
+ return (ret);
+ allocated = 1;
+ dbc->flags = 0;
+
+ dbc->dbp = dbp;
+
+ /* Set up locking information. */
+ if (LOCKING_ON(dbenv)) {
+ /*
+ * If we are not threaded, then there is no need to
+ * create new locker ids. We know that no one else
+ * is running concurrently using this DB, so we can
+ * take a peek at any cursors on the active queue.
+ */
+ if (!DB_IS_THREADED(dbp) &&
+ (adbc = TAILQ_FIRST(&dbp->active_queue)) != NULL)
+ dbc->lid = adbc->lid;
+ else
+ if ((ret = lock_id(dbenv, &dbc->lid)) != 0)
+ goto err;
+
+ memcpy(dbc->lock.fileid, dbp->fileid, DB_FILE_ID_LEN);
+ if (CDB_LOCKING(dbenv)) {
+ if (F_ISSET(dbenv, DB_ENV_CDB_ALLDB)) {
+ /*
+ * If we are doing a single lock per
+ * environment, set up the global
+ * lock object just like we do to
+ * single thread creates.
+ */
+ DB_ASSERT(sizeof(db_pgno_t) ==
+ sizeof(u_int32_t));
+ dbc->lock_dbt.size = sizeof(u_int32_t);
+ dbc->lock_dbt.data = &dbc->lock.pgno;
+ dbc->lock.pgno = 0;
+ } else {
+ dbc->lock_dbt.size = DB_FILE_ID_LEN;
+ dbc->lock_dbt.data = dbc->lock.fileid;
+ }
+ } else {
+ dbc->lock.type = DB_PAGE_LOCK;
+ dbc->lock_dbt.size = sizeof(dbc->lock);
+ dbc->lock_dbt.data = &dbc->lock;
+ }
+ }
+ /* Init the DBC internal structure. */
+ switch (dbtype) {
+ case DB_BTREE:
+ case DB_RECNO:
+ if ((ret = __bam_c_init(dbc, dbtype)) != 0)
+ goto err;
+ break;
+ case DB_HASH:
+ if ((ret = __ham_c_init(dbc)) != 0)
+ goto err;
+ break;
+ case DB_QUEUE:
+ if ((ret = __qam_c_init(dbc)) != 0)
+ goto err;
+ break;
+ default:
+ ret = __db_unknown_type(dbp->dbenv,
+ "__db_icursor", dbtype);
+ goto err;
+ }
+
+ cp = dbc->internal;
+ }
+
+ /* Refresh the DBC structure. */
+ dbc->dbtype = dbtype;
+
+ if ((dbc->txn = txn) == NULL)
+ dbc->locker = dbc->lid;
+ else {
+ dbc->locker = txn->txnid;
+ txn->cursors++;
+ }
+
+ if (is_opd)
+ F_SET(dbc, DBC_OPD);
+ if (F_ISSET(dbp, DB_AM_RECOVER))
+ F_SET(dbc, DBC_RECOVER);
+
+ /* Refresh the DBC internal structure. */
+ cp = dbc->internal;
+ cp->opd = NULL;
+
+ cp->indx = 0;
+ cp->page = NULL;
+ cp->pgno = PGNO_INVALID;
+ cp->root = root;
+
+ switch (dbtype) {
+ case DB_BTREE:
+ case DB_RECNO:
+ if ((ret = __bam_c_refresh(dbc)) != 0)
+ goto err;
+ break;
+ case DB_HASH:
+ case DB_QUEUE:
+ break;
+ default:
+ ret = __db_unknown_type(dbp->dbenv, "__db_icursor", dbp->type);
+ goto err;
+ }
+
+ MUTEX_THREAD_LOCK(dbenv, dbp->mutexp);
+ TAILQ_INSERT_TAIL(&dbp->active_queue, dbc, links);
+ F_SET(dbc, DBC_ACTIVE);
+ MUTEX_THREAD_UNLOCK(dbenv, dbp->mutexp);
+
+ *dbcp = dbc;
+ return (0);
+
+err: if (allocated)
+ __os_free(dbc, sizeof(*dbc));
+ return (ret);
+}
+
+#ifdef DEBUG
+/*
+ * __db_cprint --
+ * Display the current cursor list.
+ *
+ * PUBLIC: int __db_cprint __P((DB *));
+ */
+int
+__db_cprint(dbp)
+ DB *dbp;
+{
+ static const FN fn[] = {
+ { DBC_ACTIVE, "active" },
+ { DBC_OPD, "off-page-dup" },
+ { DBC_RECOVER, "recover" },
+ { DBC_RMW, "read-modify-write" },
+ { DBC_WRITECURSOR, "write cursor" },
+ { DBC_WRITEDUP, "internally dup'ed write cursor" },
+ { DBC_WRITER, "short-term write cursor" },
+ { 0, NULL }
+ };
+ DBC *dbc;
+ DBC_INTERNAL *cp;
+ char *s;
+
+ MUTEX_THREAD_LOCK(dbp->dbenv, dbp->mutexp);
+ for (dbc = TAILQ_FIRST(&dbp->active_queue);
+ dbc != NULL; dbc = TAILQ_NEXT(dbc, links)) {
+ switch (dbc->dbtype) {
+ case DB_BTREE:
+ s = "btree";
+ break;
+ case DB_HASH:
+ s = "hash";
+ break;
+ case DB_RECNO:
+ s = "recno";
+ break;
+ case DB_QUEUE:
+ s = "queue";
+ break;
+ default:
+ DB_ASSERT(0);
+ return (1);
+ }
+ cp = dbc->internal;
+ fprintf(stderr, "%s/%#0lx: opd: %#0lx\n",
+ s, P_TO_ULONG(dbc), P_TO_ULONG(cp->opd));
+ fprintf(stderr, "\ttxn: %#0lx lid: %lu locker: %lu\n",
+ P_TO_ULONG(dbc->txn),
+ (u_long)dbc->lid, (u_long)dbc->locker);
+ fprintf(stderr, "\troot: %lu page/index: %lu/%lu",
+ (u_long)cp->root, (u_long)cp->pgno, (u_long)cp->indx);
+ __db_prflags(dbc->flags, fn, stderr);
+ fprintf(stderr, "\n");
+
+ if (dbp->type == DB_BTREE)
+ __bam_cprint(dbc);
+ }
+ for (dbc = TAILQ_FIRST(&dbp->free_queue);
+ dbc != NULL; dbc = TAILQ_NEXT(dbc, links))
+ fprintf(stderr, "free: %#0lx ", P_TO_ULONG(dbc));
+ fprintf(stderr, "\n");
+ MUTEX_THREAD_UNLOCK(dbp->dbenv, dbp->mutexp);
+
+ return (0);
+}
+#endif /* DEBUG */
+
+/*
+ * db_fd --
+ * Return a file descriptor for flock'ing.
+ *
+ * PUBLIC: int __db_fd __P((DB *, int *));
+ */
+int
+__db_fd(dbp, fdp)
+ DB *dbp;
+ int *fdp;
+{
+ DB_FH *fhp;
+ int ret;
+
+ PANIC_CHECK(dbp->dbenv);
+ DB_ILLEGAL_BEFORE_OPEN(dbp, "DB->fd");
+
+ /*
+ * XXX
+ * Truly spectacular layering violation.
+ */
+ if ((ret = __mp_xxx_fh(dbp->mpf, &fhp)) != 0)
+ return (ret);
+
+ if (F_ISSET(fhp, DB_FH_VALID)) {
+ *fdp = fhp->fd;
+ return (0);
+ } else {
+ *fdp = -1;
+ __db_err(dbp->dbenv, "DB does not have a valid file handle.");
+ return (ENOENT);
+ }
+}
+
+/*
+ * __db_get --
+ * Return a key/data pair.
+ *
+ * PUBLIC: int __db_get __P((DB *, DB_TXN *, DBT *, DBT *, u_int32_t));
+ */
+int
+__db_get(dbp, txn, key, data, flags)
+ DB *dbp;
+ DB_TXN *txn;
+ DBT *key, *data;
+ u_int32_t flags;
+{
+ DBC *dbc;
+ int mode, ret, t_ret;
+
+ PANIC_CHECK(dbp->dbenv);
+ DB_ILLEGAL_BEFORE_OPEN(dbp, "DB->get");
+
+ if ((ret = __db_getchk(dbp, key, data, flags)) != 0)
+ return (ret);
+
+ mode = 0;
+ if (flags == DB_CONSUME || flags == DB_CONSUME_WAIT)
+ mode = DB_WRITELOCK;
+ if ((ret = dbp->cursor(dbp, txn, &dbc, mode)) != 0)
+ return (ret);
+
+ DEBUG_LREAD(dbc, txn, "__db_get", key, NULL, flags);
+
+ /*
+ * The DBC_TRANSIENT flag indicates that we're just doing a
+ * single operation with this cursor, and that in case of
+ * error we don't need to restore it to its old position--we're
+ * going to close it right away. Thus, we can perform the get
+ * without duplicating the cursor, saving some cycles in this
+ * common case.
+ */
+ F_SET(dbc, DBC_TRANSIENT);
+
+ ret = dbc->c_get(dbc, key, data,
+ flags == 0 || flags == DB_RMW ? flags | DB_SET : flags);
+
+ if ((t_ret = __db_c_close(dbc)) != 0 && ret == 0)
+ ret = t_ret;
+
+ return (ret);
+}
+
+/*
+ * __db_put --
+ * Store a key/data pair.
+ *
+ * PUBLIC: int __db_put __P((DB *, DB_TXN *, DBT *, DBT *, u_int32_t));
+ */
+int
+__db_put(dbp, txn, key, data, flags)
+ DB *dbp;
+ DB_TXN *txn;
+ DBT *key, *data;
+ u_int32_t flags;
+{
+ DBC *dbc;
+ DBT tdata;
+ int ret, t_ret;
+
+ PANIC_CHECK(dbp->dbenv);
+ DB_ILLEGAL_BEFORE_OPEN(dbp, "DB->put");
+
+ if ((ret = __db_putchk(dbp, key, data,
+ flags, F_ISSET(dbp, DB_AM_RDONLY),
+ F_ISSET(dbp, DB_AM_DUP) || F_ISSET(key, DB_DBT_DUPOK))) != 0)
+ return (ret);
+
+ DB_CHECK_TXN(dbp, txn);
+
+ if ((ret = dbp->cursor(dbp, txn, &dbc, DB_WRITELOCK)) != 0)
+ return (ret);
+
+ /*
+ * See the comment in __db_get().
+ *
+ * Note that the c_get in the DB_NOOVERWRITE case is safe to
+ * do with this flag set; if it errors in any way other than
+ * DB_NOTFOUND, we're going to close the cursor without doing
+ * anything else, and if it returns DB_NOTFOUND then it's safe
+ * to do a c_put(DB_KEYLAST) even if an access method moved the
+ * cursor, since that's not position-dependent.
+ */
+ F_SET(dbc, DBC_TRANSIENT);
+
+ DEBUG_LWRITE(dbc, txn, "__db_put", key, data, flags);
+
+ if (flags == DB_NOOVERWRITE) {
+ flags = 0;
+ /*
+ * Set DB_DBT_USERMEM, this might be a threaded application and
+ * the flags checking will catch us. We don't want the actual
+ * data, so request a partial of length 0.
+ */
+ memset(&tdata, 0, sizeof(tdata));
+ F_SET(&tdata, DB_DBT_USERMEM | DB_DBT_PARTIAL);
+
+ /*
+ * If we're doing page-level locking, set the read-modify-write
+ * flag, we're going to overwrite immediately.
+ */
+ if ((ret = dbc->c_get(dbc, key, &tdata,
+ DB_SET | (STD_LOCKING(dbc) ? DB_RMW : 0))) == 0)
+ ret = DB_KEYEXIST;
+ else if (ret == DB_NOTFOUND)
+ ret = 0;
+ }
+ if (ret == 0)
+ ret = dbc->c_put(dbc,
+ key, data, flags == 0 ? DB_KEYLAST : flags);
+
+ if ((t_ret = __db_c_close(dbc)) != 0 && ret == 0)
+ ret = t_ret;
+
+ return (ret);
+}
+
+/*
+ * __db_sync --
+ * Flush the database cache.
+ *
+ * PUBLIC: int __db_sync __P((DB *, u_int32_t));
+ */
+int
+__db_sync(dbp, flags)
+ DB *dbp;
+ u_int32_t flags;
+{
+ int ret, t_ret;
+
+ PANIC_CHECK(dbp->dbenv);
+ DB_ILLEGAL_BEFORE_OPEN(dbp, "DB->sync");
+
+ if ((ret = __db_syncchk(dbp, flags)) != 0)
+ return (ret);
+
+ /* Read-only trees never need to be sync'd. */
+ if (F_ISSET(dbp, DB_AM_RDONLY))
+ return (0);
+
+ /* If it's a Recno tree, write the backing source text file. */
+ if (dbp->type == DB_RECNO)
+ ret = __ram_writeback(dbp);
+
+ /* If the tree was never backed by a database file, we're done. */
+ if (F_ISSET(dbp, DB_AM_INMEM))
+ return (0);
+
+ /* Flush any dirty pages from the cache to the backing file. */
+ if ((t_ret = memp_fsync(dbp->mpf)) != 0 && ret == 0)
+ ret = t_ret;
+ return (ret);
+}
diff --git a/bdb/db/db_auto.c b/bdb/db/db_auto.c
new file mode 100644
index 00000000000..23540adc2e6
--- /dev/null
+++ b/bdb/db/db_auto.c
@@ -0,0 +1,1270 @@
+/* Do not edit: automatically built by gen_rec.awk. */
+#include "db_config.h"
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <ctype.h>
+#include <errno.h>
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_dispatch.h"
+#include "db_am.h"
+#include "txn.h"
+
+int
+__db_addrem_log(dbenv, txnid, ret_lsnp, flags,
+ opcode, fileid, pgno, indx, nbytes, hdr,
+ dbt, pagelsn)
+ DB_ENV *dbenv;
+ DB_TXN *txnid;
+ DB_LSN *ret_lsnp;
+ u_int32_t flags;
+ u_int32_t opcode;
+ int32_t fileid;
+ db_pgno_t pgno;
+ u_int32_t indx;
+ size_t nbytes;
+ const DBT *hdr;
+ const DBT *dbt;
+ DB_LSN * pagelsn;
+{
+ DBT logrec;
+ DB_LSN *lsnp, null_lsn;
+ u_int32_t zero;
+ u_int32_t rectype, txn_num;
+ int ret;
+ u_int8_t *bp;
+
+ rectype = DB_db_addrem;
+ if (txnid != NULL &&
+ TAILQ_FIRST(&txnid->kids) != NULL &&
+ (ret = __txn_activekids(dbenv, rectype, txnid)) != 0)
+ return (ret);
+ txn_num = txnid == NULL ? 0 : txnid->txnid;
+ if (txnid == NULL) {
+ ZERO_LSN(null_lsn);
+ lsnp = &null_lsn;
+ } else
+ lsnp = &txnid->last_lsn;
+ logrec.size = sizeof(rectype) + sizeof(txn_num) + sizeof(DB_LSN)
+ + sizeof(opcode)
+ + sizeof(fileid)
+ + sizeof(pgno)
+ + sizeof(indx)
+ + sizeof(nbytes)
+ + sizeof(u_int32_t) + (hdr == NULL ? 0 : hdr->size)
+ + sizeof(u_int32_t) + (dbt == NULL ? 0 : dbt->size)
+ + sizeof(*pagelsn);
+ if ((ret = __os_malloc(dbenv, logrec.size, NULL, &logrec.data)) != 0)
+ return (ret);
+
+ bp = logrec.data;
+ memcpy(bp, &rectype, sizeof(rectype));
+ bp += sizeof(rectype);
+ memcpy(bp, &txn_num, sizeof(txn_num));
+ bp += sizeof(txn_num);
+ memcpy(bp, lsnp, sizeof(DB_LSN));
+ bp += sizeof(DB_LSN);
+ memcpy(bp, &opcode, sizeof(opcode));
+ bp += sizeof(opcode);
+ memcpy(bp, &fileid, sizeof(fileid));
+ bp += sizeof(fileid);
+ memcpy(bp, &pgno, sizeof(pgno));
+ bp += sizeof(pgno);
+ memcpy(bp, &indx, sizeof(indx));
+ bp += sizeof(indx);
+ memcpy(bp, &nbytes, sizeof(nbytes));
+ bp += sizeof(nbytes);
+ if (hdr == NULL) {
+ zero = 0;
+ memcpy(bp, &zero, sizeof(u_int32_t));
+ bp += sizeof(u_int32_t);
+ } else {
+ memcpy(bp, &hdr->size, sizeof(hdr->size));
+ bp += sizeof(hdr->size);
+ memcpy(bp, hdr->data, hdr->size);
+ bp += hdr->size;
+ }
+ if (dbt == NULL) {
+ zero = 0;
+ memcpy(bp, &zero, sizeof(u_int32_t));
+ bp += sizeof(u_int32_t);
+ } else {
+ memcpy(bp, &dbt->size, sizeof(dbt->size));
+ bp += sizeof(dbt->size);
+ memcpy(bp, dbt->data, dbt->size);
+ bp += dbt->size;
+ }
+ if (pagelsn != NULL)
+ memcpy(bp, pagelsn, sizeof(*pagelsn));
+ else
+ memset(bp, 0, sizeof(*pagelsn));
+ bp += sizeof(*pagelsn);
+ DB_ASSERT((u_int32_t)(bp - (u_int8_t *)logrec.data) == logrec.size);
+ ret = log_put(dbenv, ret_lsnp, (DBT *)&logrec, flags);
+ if (txnid != NULL)
+ txnid->last_lsn = *ret_lsnp;
+ __os_free(logrec.data, logrec.size);
+ return (ret);
+}
+
+int
+__db_addrem_print(dbenv, dbtp, lsnp, notused2, notused3)
+ DB_ENV *dbenv;
+ DBT *dbtp;
+ DB_LSN *lsnp;
+ db_recops notused2;
+ void *notused3;
+{
+ __db_addrem_args *argp;
+ u_int32_t i;
+ u_int ch;
+ int ret;
+
+ i = 0;
+ ch = 0;
+ notused2 = DB_TXN_ABORT;
+ notused3 = NULL;
+
+ if ((ret = __db_addrem_read(dbenv, dbtp->data, &argp)) != 0)
+ return (ret);
+ printf("[%lu][%lu]db_addrem: rec: %lu txnid %lx prevlsn [%lu][%lu]\n",
+ (u_long)lsnp->file,
+ (u_long)lsnp->offset,
+ (u_long)argp->type,
+ (u_long)argp->txnid->txnid,
+ (u_long)argp->prev_lsn.file,
+ (u_long)argp->prev_lsn.offset);
+ printf("\topcode: %lu\n", (u_long)argp->opcode);
+ printf("\tfileid: %ld\n", (long)argp->fileid);
+ printf("\tpgno: %lu\n", (u_long)argp->pgno);
+ printf("\tindx: %lu\n", (u_long)argp->indx);
+ printf("\tnbytes: %lu\n", (u_long)argp->nbytes);
+ printf("\thdr: ");
+ for (i = 0; i < argp->hdr.size; i++) {
+ ch = ((u_int8_t *)argp->hdr.data)[i];
+ if (isprint(ch) || ch == 0xa)
+ putchar(ch);
+ else
+ printf("%#x ", ch);
+ }
+ printf("\n");
+ printf("\tdbt: ");
+ for (i = 0; i < argp->dbt.size; i++) {
+ ch = ((u_int8_t *)argp->dbt.data)[i];
+ if (isprint(ch) || ch == 0xa)
+ putchar(ch);
+ else
+ printf("%#x ", ch);
+ }
+ printf("\n");
+ printf("\tpagelsn: [%lu][%lu]\n",
+ (u_long)argp->pagelsn.file, (u_long)argp->pagelsn.offset);
+ printf("\n");
+ __os_free(argp, 0);
+ return (0);
+}
+
+int
+__db_addrem_read(dbenv, recbuf, argpp)
+ DB_ENV *dbenv;
+ void *recbuf;
+ __db_addrem_args **argpp;
+{
+ __db_addrem_args *argp;
+ u_int8_t *bp;
+ int ret;
+
+ ret = __os_malloc(dbenv, sizeof(__db_addrem_args) +
+ sizeof(DB_TXN), NULL, &argp);
+ if (ret != 0)
+ return (ret);
+ argp->txnid = (DB_TXN *)&argp[1];
+ bp = recbuf;
+ memcpy(&argp->type, bp, sizeof(argp->type));
+ bp += sizeof(argp->type);
+ memcpy(&argp->txnid->txnid, bp, sizeof(argp->txnid->txnid));
+ bp += sizeof(argp->txnid->txnid);
+ memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN));
+ bp += sizeof(DB_LSN);
+ memcpy(&argp->opcode, bp, sizeof(argp->opcode));
+ bp += sizeof(argp->opcode);
+ memcpy(&argp->fileid, bp, sizeof(argp->fileid));
+ bp += sizeof(argp->fileid);
+ memcpy(&argp->pgno, bp, sizeof(argp->pgno));
+ bp += sizeof(argp->pgno);
+ memcpy(&argp->indx, bp, sizeof(argp->indx));
+ bp += sizeof(argp->indx);
+ memcpy(&argp->nbytes, bp, sizeof(argp->nbytes));
+ bp += sizeof(argp->nbytes);
+ memset(&argp->hdr, 0, sizeof(argp->hdr));
+ memcpy(&argp->hdr.size, bp, sizeof(u_int32_t));
+ bp += sizeof(u_int32_t);
+ argp->hdr.data = bp;
+ bp += argp->hdr.size;
+ memset(&argp->dbt, 0, sizeof(argp->dbt));
+ memcpy(&argp->dbt.size, bp, sizeof(u_int32_t));
+ bp += sizeof(u_int32_t);
+ argp->dbt.data = bp;
+ bp += argp->dbt.size;
+ memcpy(&argp->pagelsn, bp, sizeof(argp->pagelsn));
+ bp += sizeof(argp->pagelsn);
+ *argpp = argp;
+ return (0);
+}
+
+int
+__db_split_print(dbenv, dbtp, lsnp, notused2, notused3)
+ DB_ENV *dbenv;
+ DBT *dbtp;
+ DB_LSN *lsnp;
+ db_recops notused2;
+ void *notused3;
+{
+ __db_split_args *argp;
+ u_int32_t i;
+ u_int ch;
+ int ret;
+
+ i = 0;
+ ch = 0;
+ notused2 = DB_TXN_ABORT;
+ notused3 = NULL;
+
+ if ((ret = __db_split_read(dbenv, dbtp->data, &argp)) != 0)
+ return (ret);
+ printf("[%lu][%lu]db_split: rec: %lu txnid %lx prevlsn [%lu][%lu]\n",
+ (u_long)lsnp->file,
+ (u_long)lsnp->offset,
+ (u_long)argp->type,
+ (u_long)argp->txnid->txnid,
+ (u_long)argp->prev_lsn.file,
+ (u_long)argp->prev_lsn.offset);
+ printf("\topcode: %lu\n", (u_long)argp->opcode);
+ printf("\tfileid: %ld\n", (long)argp->fileid);
+ printf("\tpgno: %lu\n", (u_long)argp->pgno);
+ printf("\tpageimage: ");
+ for (i = 0; i < argp->pageimage.size; i++) {
+ ch = ((u_int8_t *)argp->pageimage.data)[i];
+ if (isprint(ch) || ch == 0xa)
+ putchar(ch);
+ else
+ printf("%#x ", ch);
+ }
+ printf("\n");
+ printf("\tpagelsn: [%lu][%lu]\n",
+ (u_long)argp->pagelsn.file, (u_long)argp->pagelsn.offset);
+ printf("\n");
+ __os_free(argp, 0);
+ return (0);
+}
+
+int
+__db_split_read(dbenv, recbuf, argpp)
+ DB_ENV *dbenv;
+ void *recbuf;
+ __db_split_args **argpp;
+{
+ __db_split_args *argp;
+ u_int8_t *bp;
+ int ret;
+
+ ret = __os_malloc(dbenv, sizeof(__db_split_args) +
+ sizeof(DB_TXN), NULL, &argp);
+ if (ret != 0)
+ return (ret);
+ argp->txnid = (DB_TXN *)&argp[1];
+ bp = recbuf;
+ memcpy(&argp->type, bp, sizeof(argp->type));
+ bp += sizeof(argp->type);
+ memcpy(&argp->txnid->txnid, bp, sizeof(argp->txnid->txnid));
+ bp += sizeof(argp->txnid->txnid);
+ memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN));
+ bp += sizeof(DB_LSN);
+ memcpy(&argp->opcode, bp, sizeof(argp->opcode));
+ bp += sizeof(argp->opcode);
+ memcpy(&argp->fileid, bp, sizeof(argp->fileid));
+ bp += sizeof(argp->fileid);
+ memcpy(&argp->pgno, bp, sizeof(argp->pgno));
+ bp += sizeof(argp->pgno);
+ memset(&argp->pageimage, 0, sizeof(argp->pageimage));
+ memcpy(&argp->pageimage.size, bp, sizeof(u_int32_t));
+ bp += sizeof(u_int32_t);
+ argp->pageimage.data = bp;
+ bp += argp->pageimage.size;
+ memcpy(&argp->pagelsn, bp, sizeof(argp->pagelsn));
+ bp += sizeof(argp->pagelsn);
+ *argpp = argp;
+ return (0);
+}
+
+int
+__db_big_log(dbenv, txnid, ret_lsnp, flags,
+ opcode, fileid, pgno, prev_pgno, next_pgno, dbt,
+ pagelsn, prevlsn, nextlsn)
+ DB_ENV *dbenv;
+ DB_TXN *txnid;
+ DB_LSN *ret_lsnp;
+ u_int32_t flags;
+ u_int32_t opcode;
+ int32_t fileid;
+ db_pgno_t pgno;
+ db_pgno_t prev_pgno;
+ db_pgno_t next_pgno;
+ const DBT *dbt;
+ DB_LSN * pagelsn;
+ DB_LSN * prevlsn;
+ DB_LSN * nextlsn;
+{
+ DBT logrec;
+ DB_LSN *lsnp, null_lsn;
+ u_int32_t zero;
+ u_int32_t rectype, txn_num;
+ int ret;
+ u_int8_t *bp;
+
+ rectype = DB_db_big;
+ if (txnid != NULL &&
+ TAILQ_FIRST(&txnid->kids) != NULL &&
+ (ret = __txn_activekids(dbenv, rectype, txnid)) != 0)
+ return (ret);
+ txn_num = txnid == NULL ? 0 : txnid->txnid;
+ if (txnid == NULL) {
+ ZERO_LSN(null_lsn);
+ lsnp = &null_lsn;
+ } else
+ lsnp = &txnid->last_lsn;
+ logrec.size = sizeof(rectype) + sizeof(txn_num) + sizeof(DB_LSN)
+ + sizeof(opcode)
+ + sizeof(fileid)
+ + sizeof(pgno)
+ + sizeof(prev_pgno)
+ + sizeof(next_pgno)
+ + sizeof(u_int32_t) + (dbt == NULL ? 0 : dbt->size)
+ + sizeof(*pagelsn)
+ + sizeof(*prevlsn)
+ + sizeof(*nextlsn);
+ if ((ret = __os_malloc(dbenv, logrec.size, NULL, &logrec.data)) != 0)
+ return (ret);
+
+ bp = logrec.data;
+ memcpy(bp, &rectype, sizeof(rectype));
+ bp += sizeof(rectype);
+ memcpy(bp, &txn_num, sizeof(txn_num));
+ bp += sizeof(txn_num);
+ memcpy(bp, lsnp, sizeof(DB_LSN));
+ bp += sizeof(DB_LSN);
+ memcpy(bp, &opcode, sizeof(opcode));
+ bp += sizeof(opcode);
+ memcpy(bp, &fileid, sizeof(fileid));
+ bp += sizeof(fileid);
+ memcpy(bp, &pgno, sizeof(pgno));
+ bp += sizeof(pgno);
+ memcpy(bp, &prev_pgno, sizeof(prev_pgno));
+ bp += sizeof(prev_pgno);
+ memcpy(bp, &next_pgno, sizeof(next_pgno));
+ bp += sizeof(next_pgno);
+ if (dbt == NULL) {
+ zero = 0;
+ memcpy(bp, &zero, sizeof(u_int32_t));
+ bp += sizeof(u_int32_t);
+ } else {
+ memcpy(bp, &dbt->size, sizeof(dbt->size));
+ bp += sizeof(dbt->size);
+ memcpy(bp, dbt->data, dbt->size);
+ bp += dbt->size;
+ }
+ if (pagelsn != NULL)
+ memcpy(bp, pagelsn, sizeof(*pagelsn));
+ else
+ memset(bp, 0, sizeof(*pagelsn));
+ bp += sizeof(*pagelsn);
+ if (prevlsn != NULL)
+ memcpy(bp, prevlsn, sizeof(*prevlsn));
+ else
+ memset(bp, 0, sizeof(*prevlsn));
+ bp += sizeof(*prevlsn);
+ if (nextlsn != NULL)
+ memcpy(bp, nextlsn, sizeof(*nextlsn));
+ else
+ memset(bp, 0, sizeof(*nextlsn));
+ bp += sizeof(*nextlsn);
+ DB_ASSERT((u_int32_t)(bp - (u_int8_t *)logrec.data) == logrec.size);
+ ret = log_put(dbenv, ret_lsnp, (DBT *)&logrec, flags);
+ if (txnid != NULL)
+ txnid->last_lsn = *ret_lsnp;
+ __os_free(logrec.data, logrec.size);
+ return (ret);
+}
+
+int
+__db_big_print(dbenv, dbtp, lsnp, notused2, notused3)
+ DB_ENV *dbenv;
+ DBT *dbtp;
+ DB_LSN *lsnp;
+ db_recops notused2;
+ void *notused3;
+{
+ __db_big_args *argp;
+ u_int32_t i;
+ u_int ch;
+ int ret;
+
+ i = 0;
+ ch = 0;
+ notused2 = DB_TXN_ABORT;
+ notused3 = NULL;
+
+ if ((ret = __db_big_read(dbenv, dbtp->data, &argp)) != 0)
+ return (ret);
+ printf("[%lu][%lu]db_big: rec: %lu txnid %lx prevlsn [%lu][%lu]\n",
+ (u_long)lsnp->file,
+ (u_long)lsnp->offset,
+ (u_long)argp->type,
+ (u_long)argp->txnid->txnid,
+ (u_long)argp->prev_lsn.file,
+ (u_long)argp->prev_lsn.offset);
+ printf("\topcode: %lu\n", (u_long)argp->opcode);
+ printf("\tfileid: %ld\n", (long)argp->fileid);
+ printf("\tpgno: %lu\n", (u_long)argp->pgno);
+ printf("\tprev_pgno: %lu\n", (u_long)argp->prev_pgno);
+ printf("\tnext_pgno: %lu\n", (u_long)argp->next_pgno);
+ printf("\tdbt: ");
+ for (i = 0; i < argp->dbt.size; i++) {
+ ch = ((u_int8_t *)argp->dbt.data)[i];
+ if (isprint(ch) || ch == 0xa)
+ putchar(ch);
+ else
+ printf("%#x ", ch);
+ }
+ printf("\n");
+ printf("\tpagelsn: [%lu][%lu]\n",
+ (u_long)argp->pagelsn.file, (u_long)argp->pagelsn.offset);
+ printf("\tprevlsn: [%lu][%lu]\n",
+ (u_long)argp->prevlsn.file, (u_long)argp->prevlsn.offset);
+ printf("\tnextlsn: [%lu][%lu]\n",
+ (u_long)argp->nextlsn.file, (u_long)argp->nextlsn.offset);
+ printf("\n");
+ __os_free(argp, 0);
+ return (0);
+}
+
+int
+__db_big_read(dbenv, recbuf, argpp)
+ DB_ENV *dbenv;
+ void *recbuf;
+ __db_big_args **argpp;
+{
+ __db_big_args *argp;
+ u_int8_t *bp;
+ int ret;
+
+ ret = __os_malloc(dbenv, sizeof(__db_big_args) +
+ sizeof(DB_TXN), NULL, &argp);
+ if (ret != 0)
+ return (ret);
+ argp->txnid = (DB_TXN *)&argp[1];
+ bp = recbuf;
+ memcpy(&argp->type, bp, sizeof(argp->type));
+ bp += sizeof(argp->type);
+ memcpy(&argp->txnid->txnid, bp, sizeof(argp->txnid->txnid));
+ bp += sizeof(argp->txnid->txnid);
+ memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN));
+ bp += sizeof(DB_LSN);
+ memcpy(&argp->opcode, bp, sizeof(argp->opcode));
+ bp += sizeof(argp->opcode);
+ memcpy(&argp->fileid, bp, sizeof(argp->fileid));
+ bp += sizeof(argp->fileid);
+ memcpy(&argp->pgno, bp, sizeof(argp->pgno));
+ bp += sizeof(argp->pgno);
+ memcpy(&argp->prev_pgno, bp, sizeof(argp->prev_pgno));
+ bp += sizeof(argp->prev_pgno);
+ memcpy(&argp->next_pgno, bp, sizeof(argp->next_pgno));
+ bp += sizeof(argp->next_pgno);
+ memset(&argp->dbt, 0, sizeof(argp->dbt));
+ memcpy(&argp->dbt.size, bp, sizeof(u_int32_t));
+ bp += sizeof(u_int32_t);
+ argp->dbt.data = bp;
+ bp += argp->dbt.size;
+ memcpy(&argp->pagelsn, bp, sizeof(argp->pagelsn));
+ bp += sizeof(argp->pagelsn);
+ memcpy(&argp->prevlsn, bp, sizeof(argp->prevlsn));
+ bp += sizeof(argp->prevlsn);
+ memcpy(&argp->nextlsn, bp, sizeof(argp->nextlsn));
+ bp += sizeof(argp->nextlsn);
+ *argpp = argp;
+ return (0);
+}
+
+int
+__db_ovref_log(dbenv, txnid, ret_lsnp, flags,
+ fileid, pgno, adjust, lsn)
+ DB_ENV *dbenv;
+ DB_TXN *txnid;
+ DB_LSN *ret_lsnp;
+ u_int32_t flags;
+ int32_t fileid;
+ db_pgno_t pgno;
+ int32_t adjust;
+ DB_LSN * lsn;
+{
+ DBT logrec;
+ DB_LSN *lsnp, null_lsn;
+ u_int32_t rectype, txn_num;
+ int ret;
+ u_int8_t *bp;
+
+ rectype = DB_db_ovref;
+ if (txnid != NULL &&
+ TAILQ_FIRST(&txnid->kids) != NULL &&
+ (ret = __txn_activekids(dbenv, rectype, txnid)) != 0)
+ return (ret);
+ txn_num = txnid == NULL ? 0 : txnid->txnid;
+ if (txnid == NULL) {
+ ZERO_LSN(null_lsn);
+ lsnp = &null_lsn;
+ } else
+ lsnp = &txnid->last_lsn;
+ logrec.size = sizeof(rectype) + sizeof(txn_num) + sizeof(DB_LSN)
+ + sizeof(fileid)
+ + sizeof(pgno)
+ + sizeof(adjust)
+ + sizeof(*lsn);
+ if ((ret = __os_malloc(dbenv, logrec.size, NULL, &logrec.data)) != 0)
+ return (ret);
+
+ bp = logrec.data;
+ memcpy(bp, &rectype, sizeof(rectype));
+ bp += sizeof(rectype);
+ memcpy(bp, &txn_num, sizeof(txn_num));
+ bp += sizeof(txn_num);
+ memcpy(bp, lsnp, sizeof(DB_LSN));
+ bp += sizeof(DB_LSN);
+ memcpy(bp, &fileid, sizeof(fileid));
+ bp += sizeof(fileid);
+ memcpy(bp, &pgno, sizeof(pgno));
+ bp += sizeof(pgno);
+ memcpy(bp, &adjust, sizeof(adjust));
+ bp += sizeof(adjust);
+ if (lsn != NULL)
+ memcpy(bp, lsn, sizeof(*lsn));
+ else
+ memset(bp, 0, sizeof(*lsn));
+ bp += sizeof(*lsn);
+ DB_ASSERT((u_int32_t)(bp - (u_int8_t *)logrec.data) == logrec.size);
+ ret = log_put(dbenv, ret_lsnp, (DBT *)&logrec, flags);
+ if (txnid != NULL)
+ txnid->last_lsn = *ret_lsnp;
+ __os_free(logrec.data, logrec.size);
+ return (ret);
+}
+
+int
+__db_ovref_print(dbenv, dbtp, lsnp, notused2, notused3)
+ DB_ENV *dbenv;
+ DBT *dbtp;
+ DB_LSN *lsnp;
+ db_recops notused2;
+ void *notused3;
+{
+ __db_ovref_args *argp;
+ u_int32_t i;
+ u_int ch;
+ int ret;
+
+ i = 0;
+ ch = 0;
+ notused2 = DB_TXN_ABORT;
+ notused3 = NULL;
+
+ if ((ret = __db_ovref_read(dbenv, dbtp->data, &argp)) != 0)
+ return (ret);
+ printf("[%lu][%lu]db_ovref: rec: %lu txnid %lx prevlsn [%lu][%lu]\n",
+ (u_long)lsnp->file,
+ (u_long)lsnp->offset,
+ (u_long)argp->type,
+ (u_long)argp->txnid->txnid,
+ (u_long)argp->prev_lsn.file,
+ (u_long)argp->prev_lsn.offset);
+ printf("\tfileid: %ld\n", (long)argp->fileid);
+ printf("\tpgno: %lu\n", (u_long)argp->pgno);
+ printf("\tadjust: %ld\n", (long)argp->adjust);
+ printf("\tlsn: [%lu][%lu]\n",
+ (u_long)argp->lsn.file, (u_long)argp->lsn.offset);
+ printf("\n");
+ __os_free(argp, 0);
+ return (0);
+}
+
+int
+__db_ovref_read(dbenv, recbuf, argpp)
+ DB_ENV *dbenv;
+ void *recbuf;
+ __db_ovref_args **argpp;
+{
+ __db_ovref_args *argp;
+ u_int8_t *bp;
+ int ret;
+
+ ret = __os_malloc(dbenv, sizeof(__db_ovref_args) +
+ sizeof(DB_TXN), NULL, &argp);
+ if (ret != 0)
+ return (ret);
+ argp->txnid = (DB_TXN *)&argp[1];
+ bp = recbuf;
+ memcpy(&argp->type, bp, sizeof(argp->type));
+ bp += sizeof(argp->type);
+ memcpy(&argp->txnid->txnid, bp, sizeof(argp->txnid->txnid));
+ bp += sizeof(argp->txnid->txnid);
+ memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN));
+ bp += sizeof(DB_LSN);
+ memcpy(&argp->fileid, bp, sizeof(argp->fileid));
+ bp += sizeof(argp->fileid);
+ memcpy(&argp->pgno, bp, sizeof(argp->pgno));
+ bp += sizeof(argp->pgno);
+ memcpy(&argp->adjust, bp, sizeof(argp->adjust));
+ bp += sizeof(argp->adjust);
+ memcpy(&argp->lsn, bp, sizeof(argp->lsn));
+ bp += sizeof(argp->lsn);
+ *argpp = argp;
+ return (0);
+}
+
+int
+__db_relink_log(dbenv, txnid, ret_lsnp, flags,
+ opcode, fileid, pgno, lsn, prev, lsn_prev,
+ next, lsn_next)
+ DB_ENV *dbenv;
+ DB_TXN *txnid;
+ DB_LSN *ret_lsnp;
+ u_int32_t flags;
+ u_int32_t opcode;
+ int32_t fileid;
+ db_pgno_t pgno;
+ DB_LSN * lsn;
+ db_pgno_t prev;
+ DB_LSN * lsn_prev;
+ db_pgno_t next;
+ DB_LSN * lsn_next;
+{
+ DBT logrec;
+ DB_LSN *lsnp, null_lsn;
+ u_int32_t rectype, txn_num;
+ int ret;
+ u_int8_t *bp;
+
+ rectype = DB_db_relink;
+ if (txnid != NULL &&
+ TAILQ_FIRST(&txnid->kids) != NULL &&
+ (ret = __txn_activekids(dbenv, rectype, txnid)) != 0)
+ return (ret);
+ txn_num = txnid == NULL ? 0 : txnid->txnid;
+ if (txnid == NULL) {
+ ZERO_LSN(null_lsn);
+ lsnp = &null_lsn;
+ } else
+ lsnp = &txnid->last_lsn;
+ logrec.size = sizeof(rectype) + sizeof(txn_num) + sizeof(DB_LSN)
+ + sizeof(opcode)
+ + sizeof(fileid)
+ + sizeof(pgno)
+ + sizeof(*lsn)
+ + sizeof(prev)
+ + sizeof(*lsn_prev)
+ + sizeof(next)
+ + sizeof(*lsn_next);
+ if ((ret = __os_malloc(dbenv, logrec.size, NULL, &logrec.data)) != 0)
+ return (ret);
+
+ bp = logrec.data;
+ memcpy(bp, &rectype, sizeof(rectype));
+ bp += sizeof(rectype);
+ memcpy(bp, &txn_num, sizeof(txn_num));
+ bp += sizeof(txn_num);
+ memcpy(bp, lsnp, sizeof(DB_LSN));
+ bp += sizeof(DB_LSN);
+ memcpy(bp, &opcode, sizeof(opcode));
+ bp += sizeof(opcode);
+ memcpy(bp, &fileid, sizeof(fileid));
+ bp += sizeof(fileid);
+ memcpy(bp, &pgno, sizeof(pgno));
+ bp += sizeof(pgno);
+ if (lsn != NULL)
+ memcpy(bp, lsn, sizeof(*lsn));
+ else
+ memset(bp, 0, sizeof(*lsn));
+ bp += sizeof(*lsn);
+ memcpy(bp, &prev, sizeof(prev));
+ bp += sizeof(prev);
+ if (lsn_prev != NULL)
+ memcpy(bp, lsn_prev, sizeof(*lsn_prev));
+ else
+ memset(bp, 0, sizeof(*lsn_prev));
+ bp += sizeof(*lsn_prev);
+ memcpy(bp, &next, sizeof(next));
+ bp += sizeof(next);
+ if (lsn_next != NULL)
+ memcpy(bp, lsn_next, sizeof(*lsn_next));
+ else
+ memset(bp, 0, sizeof(*lsn_next));
+ bp += sizeof(*lsn_next);
+ DB_ASSERT((u_int32_t)(bp - (u_int8_t *)logrec.data) == logrec.size);
+ ret = log_put(dbenv, ret_lsnp, (DBT *)&logrec, flags);
+ if (txnid != NULL)
+ txnid->last_lsn = *ret_lsnp;
+ __os_free(logrec.data, logrec.size);
+ return (ret);
+}
+
+int
+__db_relink_print(dbenv, dbtp, lsnp, notused2, notused3)
+ DB_ENV *dbenv;
+ DBT *dbtp;
+ DB_LSN *lsnp;
+ db_recops notused2;
+ void *notused3;
+{
+ __db_relink_args *argp;
+ u_int32_t i;
+ u_int ch;
+ int ret;
+
+ i = 0;
+ ch = 0;
+ notused2 = DB_TXN_ABORT;
+ notused3 = NULL;
+
+ if ((ret = __db_relink_read(dbenv, dbtp->data, &argp)) != 0)
+ return (ret);
+ printf("[%lu][%lu]db_relink: rec: %lu txnid %lx prevlsn [%lu][%lu]\n",
+ (u_long)lsnp->file,
+ (u_long)lsnp->offset,
+ (u_long)argp->type,
+ (u_long)argp->txnid->txnid,
+ (u_long)argp->prev_lsn.file,
+ (u_long)argp->prev_lsn.offset);
+ printf("\topcode: %lu\n", (u_long)argp->opcode);
+ printf("\tfileid: %ld\n", (long)argp->fileid);
+ printf("\tpgno: %lu\n", (u_long)argp->pgno);
+ printf("\tlsn: [%lu][%lu]\n",
+ (u_long)argp->lsn.file, (u_long)argp->lsn.offset);
+ printf("\tprev: %lu\n", (u_long)argp->prev);
+ printf("\tlsn_prev: [%lu][%lu]\n",
+ (u_long)argp->lsn_prev.file, (u_long)argp->lsn_prev.offset);
+ printf("\tnext: %lu\n", (u_long)argp->next);
+ printf("\tlsn_next: [%lu][%lu]\n",
+ (u_long)argp->lsn_next.file, (u_long)argp->lsn_next.offset);
+ printf("\n");
+ __os_free(argp, 0);
+ return (0);
+}
+
+int
+__db_relink_read(dbenv, recbuf, argpp)
+ DB_ENV *dbenv;
+ void *recbuf;
+ __db_relink_args **argpp;
+{
+ __db_relink_args *argp;
+ u_int8_t *bp;
+ int ret;
+
+ ret = __os_malloc(dbenv, sizeof(__db_relink_args) +
+ sizeof(DB_TXN), NULL, &argp);
+ if (ret != 0)
+ return (ret);
+ argp->txnid = (DB_TXN *)&argp[1];
+ bp = recbuf;
+ memcpy(&argp->type, bp, sizeof(argp->type));
+ bp += sizeof(argp->type);
+ memcpy(&argp->txnid->txnid, bp, sizeof(argp->txnid->txnid));
+ bp += sizeof(argp->txnid->txnid);
+ memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN));
+ bp += sizeof(DB_LSN);
+ memcpy(&argp->opcode, bp, sizeof(argp->opcode));
+ bp += sizeof(argp->opcode);
+ memcpy(&argp->fileid, bp, sizeof(argp->fileid));
+ bp += sizeof(argp->fileid);
+ memcpy(&argp->pgno, bp, sizeof(argp->pgno));
+ bp += sizeof(argp->pgno);
+ memcpy(&argp->lsn, bp, sizeof(argp->lsn));
+ bp += sizeof(argp->lsn);
+ memcpy(&argp->prev, bp, sizeof(argp->prev));
+ bp += sizeof(argp->prev);
+ memcpy(&argp->lsn_prev, bp, sizeof(argp->lsn_prev));
+ bp += sizeof(argp->lsn_prev);
+ memcpy(&argp->next, bp, sizeof(argp->next));
+ bp += sizeof(argp->next);
+ memcpy(&argp->lsn_next, bp, sizeof(argp->lsn_next));
+ bp += sizeof(argp->lsn_next);
+ *argpp = argp;
+ return (0);
+}
+
+int
+__db_addpage_print(dbenv, dbtp, lsnp, notused2, notused3)
+ DB_ENV *dbenv;
+ DBT *dbtp;
+ DB_LSN *lsnp;
+ db_recops notused2;
+ void *notused3;
+{
+ __db_addpage_args *argp;
+ u_int32_t i;
+ u_int ch;
+ int ret;
+
+ i = 0;
+ ch = 0;
+ notused2 = DB_TXN_ABORT;
+ notused3 = NULL;
+
+ if ((ret = __db_addpage_read(dbenv, dbtp->data, &argp)) != 0)
+ return (ret);
+ printf("[%lu][%lu]db_addpage: rec: %lu txnid %lx prevlsn [%lu][%lu]\n",
+ (u_long)lsnp->file,
+ (u_long)lsnp->offset,
+ (u_long)argp->type,
+ (u_long)argp->txnid->txnid,
+ (u_long)argp->prev_lsn.file,
+ (u_long)argp->prev_lsn.offset);
+ printf("\tfileid: %ld\n", (long)argp->fileid);
+ printf("\tpgno: %lu\n", (u_long)argp->pgno);
+ printf("\tlsn: [%lu][%lu]\n",
+ (u_long)argp->lsn.file, (u_long)argp->lsn.offset);
+ printf("\tnextpgno: %lu\n", (u_long)argp->nextpgno);
+ printf("\tnextlsn: [%lu][%lu]\n",
+ (u_long)argp->nextlsn.file, (u_long)argp->nextlsn.offset);
+ printf("\n");
+ __os_free(argp, 0);
+ return (0);
+}
+
+int
+__db_addpage_read(dbenv, recbuf, argpp)
+ DB_ENV *dbenv;
+ void *recbuf;
+ __db_addpage_args **argpp;
+{
+ __db_addpage_args *argp;
+ u_int8_t *bp;
+ int ret;
+
+ ret = __os_malloc(dbenv, sizeof(__db_addpage_args) +
+ sizeof(DB_TXN), NULL, &argp);
+ if (ret != 0)
+ return (ret);
+ argp->txnid = (DB_TXN *)&argp[1];
+ bp = recbuf;
+ memcpy(&argp->type, bp, sizeof(argp->type));
+ bp += sizeof(argp->type);
+ memcpy(&argp->txnid->txnid, bp, sizeof(argp->txnid->txnid));
+ bp += sizeof(argp->txnid->txnid);
+ memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN));
+ bp += sizeof(DB_LSN);
+ memcpy(&argp->fileid, bp, sizeof(argp->fileid));
+ bp += sizeof(argp->fileid);
+ memcpy(&argp->pgno, bp, sizeof(argp->pgno));
+ bp += sizeof(argp->pgno);
+ memcpy(&argp->lsn, bp, sizeof(argp->lsn));
+ bp += sizeof(argp->lsn);
+ memcpy(&argp->nextpgno, bp, sizeof(argp->nextpgno));
+ bp += sizeof(argp->nextpgno);
+ memcpy(&argp->nextlsn, bp, sizeof(argp->nextlsn));
+ bp += sizeof(argp->nextlsn);
+ *argpp = argp;
+ return (0);
+}
+
+int
+__db_debug_log(dbenv, txnid, ret_lsnp, flags,
+ op, fileid, key, data, arg_flags)
+ DB_ENV *dbenv;
+ DB_TXN *txnid;
+ DB_LSN *ret_lsnp;
+ u_int32_t flags;
+ const DBT *op;
+ int32_t fileid;
+ const DBT *key;
+ const DBT *data;
+ u_int32_t arg_flags;
+{
+ DBT logrec;
+ DB_LSN *lsnp, null_lsn;
+ u_int32_t zero;
+ u_int32_t rectype, txn_num;
+ int ret;
+ u_int8_t *bp;
+
+ rectype = DB_db_debug;
+ if (txnid != NULL &&
+ TAILQ_FIRST(&txnid->kids) != NULL &&
+ (ret = __txn_activekids(dbenv, rectype, txnid)) != 0)
+ return (ret);
+ txn_num = txnid == NULL ? 0 : txnid->txnid;
+ if (txnid == NULL) {
+ ZERO_LSN(null_lsn);
+ lsnp = &null_lsn;
+ } else
+ lsnp = &txnid->last_lsn;
+ logrec.size = sizeof(rectype) + sizeof(txn_num) + sizeof(DB_LSN)
+ + sizeof(u_int32_t) + (op == NULL ? 0 : op->size)
+ + sizeof(fileid)
+ + sizeof(u_int32_t) + (key == NULL ? 0 : key->size)
+ + sizeof(u_int32_t) + (data == NULL ? 0 : data->size)
+ + sizeof(arg_flags);
+ if ((ret = __os_malloc(dbenv, logrec.size, NULL, &logrec.data)) != 0)
+ return (ret);
+
+ bp = logrec.data;
+ memcpy(bp, &rectype, sizeof(rectype));
+ bp += sizeof(rectype);
+ memcpy(bp, &txn_num, sizeof(txn_num));
+ bp += sizeof(txn_num);
+ memcpy(bp, lsnp, sizeof(DB_LSN));
+ bp += sizeof(DB_LSN);
+ if (op == NULL) {
+ zero = 0;
+ memcpy(bp, &zero, sizeof(u_int32_t));
+ bp += sizeof(u_int32_t);
+ } else {
+ memcpy(bp, &op->size, sizeof(op->size));
+ bp += sizeof(op->size);
+ memcpy(bp, op->data, op->size);
+ bp += op->size;
+ }
+ memcpy(bp, &fileid, sizeof(fileid));
+ bp += sizeof(fileid);
+ if (key == NULL) {
+ zero = 0;
+ memcpy(bp, &zero, sizeof(u_int32_t));
+ bp += sizeof(u_int32_t);
+ } else {
+ memcpy(bp, &key->size, sizeof(key->size));
+ bp += sizeof(key->size);
+ memcpy(bp, key->data, key->size);
+ bp += key->size;
+ }
+ if (data == NULL) {
+ zero = 0;
+ memcpy(bp, &zero, sizeof(u_int32_t));
+ bp += sizeof(u_int32_t);
+ } else {
+ memcpy(bp, &data->size, sizeof(data->size));
+ bp += sizeof(data->size);
+ memcpy(bp, data->data, data->size);
+ bp += data->size;
+ }
+ memcpy(bp, &arg_flags, sizeof(arg_flags));
+ bp += sizeof(arg_flags);
+ DB_ASSERT((u_int32_t)(bp - (u_int8_t *)logrec.data) == logrec.size);
+ ret = log_put(dbenv, ret_lsnp, (DBT *)&logrec, flags);
+ if (txnid != NULL)
+ txnid->last_lsn = *ret_lsnp;
+ __os_free(logrec.data, logrec.size);
+ return (ret);
+}
+
+int
+__db_debug_print(dbenv, dbtp, lsnp, notused2, notused3)
+ DB_ENV *dbenv;
+ DBT *dbtp;
+ DB_LSN *lsnp;
+ db_recops notused2;
+ void *notused3;
+{
+ __db_debug_args *argp;
+ u_int32_t i;
+ u_int ch;
+ int ret;
+
+ i = 0;
+ ch = 0;
+ notused2 = DB_TXN_ABORT;
+ notused3 = NULL;
+
+ if ((ret = __db_debug_read(dbenv, dbtp->data, &argp)) != 0)
+ return (ret);
+ printf("[%lu][%lu]db_debug: rec: %lu txnid %lx prevlsn [%lu][%lu]\n",
+ (u_long)lsnp->file,
+ (u_long)lsnp->offset,
+ (u_long)argp->type,
+ (u_long)argp->txnid->txnid,
+ (u_long)argp->prev_lsn.file,
+ (u_long)argp->prev_lsn.offset);
+ printf("\top: ");
+ for (i = 0; i < argp->op.size; i++) {
+ ch = ((u_int8_t *)argp->op.data)[i];
+ if (isprint(ch) || ch == 0xa)
+ putchar(ch);
+ else
+ printf("%#x ", ch);
+ }
+ printf("\n");
+ printf("\tfileid: %ld\n", (long)argp->fileid);
+ printf("\tkey: ");
+ for (i = 0; i < argp->key.size; i++) {
+ ch = ((u_int8_t *)argp->key.data)[i];
+ if (isprint(ch) || ch == 0xa)
+ putchar(ch);
+ else
+ printf("%#x ", ch);
+ }
+ printf("\n");
+ printf("\tdata: ");
+ for (i = 0; i < argp->data.size; i++) {
+ ch = ((u_int8_t *)argp->data.data)[i];
+ if (isprint(ch) || ch == 0xa)
+ putchar(ch);
+ else
+ printf("%#x ", ch);
+ }
+ printf("\n");
+ printf("\targ_flags: %lu\n", (u_long)argp->arg_flags);
+ printf("\n");
+ __os_free(argp, 0);
+ return (0);
+}
+
+int
+__db_debug_read(dbenv, recbuf, argpp)
+ DB_ENV *dbenv;
+ void *recbuf;
+ __db_debug_args **argpp;
+{
+ __db_debug_args *argp;
+ u_int8_t *bp;
+ int ret;
+
+ ret = __os_malloc(dbenv, sizeof(__db_debug_args) +
+ sizeof(DB_TXN), NULL, &argp);
+ if (ret != 0)
+ return (ret);
+ argp->txnid = (DB_TXN *)&argp[1];
+ bp = recbuf;
+ memcpy(&argp->type, bp, sizeof(argp->type));
+ bp += sizeof(argp->type);
+ memcpy(&argp->txnid->txnid, bp, sizeof(argp->txnid->txnid));
+ bp += sizeof(argp->txnid->txnid);
+ memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN));
+ bp += sizeof(DB_LSN);
+ memset(&argp->op, 0, sizeof(argp->op));
+ memcpy(&argp->op.size, bp, sizeof(u_int32_t));
+ bp += sizeof(u_int32_t);
+ argp->op.data = bp;
+ bp += argp->op.size;
+ memcpy(&argp->fileid, bp, sizeof(argp->fileid));
+ bp += sizeof(argp->fileid);
+ memset(&argp->key, 0, sizeof(argp->key));
+ memcpy(&argp->key.size, bp, sizeof(u_int32_t));
+ bp += sizeof(u_int32_t);
+ argp->key.data = bp;
+ bp += argp->key.size;
+ memset(&argp->data, 0, sizeof(argp->data));
+ memcpy(&argp->data.size, bp, sizeof(u_int32_t));
+ bp += sizeof(u_int32_t);
+ argp->data.data = bp;
+ bp += argp->data.size;
+ memcpy(&argp->arg_flags, bp, sizeof(argp->arg_flags));
+ bp += sizeof(argp->arg_flags);
+ *argpp = argp;
+ return (0);
+}
+
+int
+__db_noop_log(dbenv, txnid, ret_lsnp, flags,
+ fileid, pgno, prevlsn)
+ DB_ENV *dbenv;
+ DB_TXN *txnid;
+ DB_LSN *ret_lsnp;
+ u_int32_t flags;
+ int32_t fileid;
+ db_pgno_t pgno;
+ DB_LSN * prevlsn;
+{
+ DBT logrec;
+ DB_LSN *lsnp, null_lsn;
+ u_int32_t rectype, txn_num;
+ int ret;
+ u_int8_t *bp;
+
+ rectype = DB_db_noop;
+ if (txnid != NULL &&
+ TAILQ_FIRST(&txnid->kids) != NULL &&
+ (ret = __txn_activekids(dbenv, rectype, txnid)) != 0)
+ return (ret);
+ txn_num = txnid == NULL ? 0 : txnid->txnid;
+ if (txnid == NULL) {
+ ZERO_LSN(null_lsn);
+ lsnp = &null_lsn;
+ } else
+ lsnp = &txnid->last_lsn;
+ logrec.size = sizeof(rectype) + sizeof(txn_num) + sizeof(DB_LSN)
+ + sizeof(fileid)
+ + sizeof(pgno)
+ + sizeof(*prevlsn);
+ if ((ret = __os_malloc(dbenv, logrec.size, NULL, &logrec.data)) != 0)
+ return (ret);
+
+ bp = logrec.data;
+ memcpy(bp, &rectype, sizeof(rectype));
+ bp += sizeof(rectype);
+ memcpy(bp, &txn_num, sizeof(txn_num));
+ bp += sizeof(txn_num);
+ memcpy(bp, lsnp, sizeof(DB_LSN));
+ bp += sizeof(DB_LSN);
+ memcpy(bp, &fileid, sizeof(fileid));
+ bp += sizeof(fileid);
+ memcpy(bp, &pgno, sizeof(pgno));
+ bp += sizeof(pgno);
+ if (prevlsn != NULL)
+ memcpy(bp, prevlsn, sizeof(*prevlsn));
+ else
+ memset(bp, 0, sizeof(*prevlsn));
+ bp += sizeof(*prevlsn);
+ DB_ASSERT((u_int32_t)(bp - (u_int8_t *)logrec.data) == logrec.size);
+ ret = log_put(dbenv, ret_lsnp, (DBT *)&logrec, flags);
+ if (txnid != NULL)
+ txnid->last_lsn = *ret_lsnp;
+ __os_free(logrec.data, logrec.size);
+ return (ret);
+}
+
+int
+__db_noop_print(dbenv, dbtp, lsnp, notused2, notused3)
+ DB_ENV *dbenv;
+ DBT *dbtp;
+ DB_LSN *lsnp;
+ db_recops notused2;
+ void *notused3;
+{
+ __db_noop_args *argp;
+ u_int32_t i;
+ u_int ch;
+ int ret;
+
+ i = 0;
+ ch = 0;
+ notused2 = DB_TXN_ABORT;
+ notused3 = NULL;
+
+ if ((ret = __db_noop_read(dbenv, dbtp->data, &argp)) != 0)
+ return (ret);
+ printf("[%lu][%lu]db_noop: rec: %lu txnid %lx prevlsn [%lu][%lu]\n",
+ (u_long)lsnp->file,
+ (u_long)lsnp->offset,
+ (u_long)argp->type,
+ (u_long)argp->txnid->txnid,
+ (u_long)argp->prev_lsn.file,
+ (u_long)argp->prev_lsn.offset);
+ printf("\tfileid: %ld\n", (long)argp->fileid);
+ printf("\tpgno: %lu\n", (u_long)argp->pgno);
+ printf("\tprevlsn: [%lu][%lu]\n",
+ (u_long)argp->prevlsn.file, (u_long)argp->prevlsn.offset);
+ printf("\n");
+ __os_free(argp, 0);
+ return (0);
+}
+
+int
+__db_noop_read(dbenv, recbuf, argpp)
+ DB_ENV *dbenv;
+ void *recbuf;
+ __db_noop_args **argpp;
+{
+ __db_noop_args *argp;
+ u_int8_t *bp;
+ int ret;
+
+ ret = __os_malloc(dbenv, sizeof(__db_noop_args) +
+ sizeof(DB_TXN), NULL, &argp);
+ if (ret != 0)
+ return (ret);
+ argp->txnid = (DB_TXN *)&argp[1];
+ bp = recbuf;
+ memcpy(&argp->type, bp, sizeof(argp->type));
+ bp += sizeof(argp->type);
+ memcpy(&argp->txnid->txnid, bp, sizeof(argp->txnid->txnid));
+ bp += sizeof(argp->txnid->txnid);
+ memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN));
+ bp += sizeof(DB_LSN);
+ memcpy(&argp->fileid, bp, sizeof(argp->fileid));
+ bp += sizeof(argp->fileid);
+ memcpy(&argp->pgno, bp, sizeof(argp->pgno));
+ bp += sizeof(argp->pgno);
+ memcpy(&argp->prevlsn, bp, sizeof(argp->prevlsn));
+ bp += sizeof(argp->prevlsn);
+ *argpp = argp;
+ return (0);
+}
+
+int
+__db_init_print(dbenv)
+ DB_ENV *dbenv;
+{
+ int ret;
+
+ if ((ret = __db_add_recovery(dbenv,
+ __db_addrem_print, DB_db_addrem)) != 0)
+ return (ret);
+ if ((ret = __db_add_recovery(dbenv,
+ __db_split_print, DB_db_split)) != 0)
+ return (ret);
+ if ((ret = __db_add_recovery(dbenv,
+ __db_big_print, DB_db_big)) != 0)
+ return (ret);
+ if ((ret = __db_add_recovery(dbenv,
+ __db_ovref_print, DB_db_ovref)) != 0)
+ return (ret);
+ if ((ret = __db_add_recovery(dbenv,
+ __db_relink_print, DB_db_relink)) != 0)
+ return (ret);
+ if ((ret = __db_add_recovery(dbenv,
+ __db_addpage_print, DB_db_addpage)) != 0)
+ return (ret);
+ if ((ret = __db_add_recovery(dbenv,
+ __db_debug_print, DB_db_debug)) != 0)
+ return (ret);
+ if ((ret = __db_add_recovery(dbenv,
+ __db_noop_print, DB_db_noop)) != 0)
+ return (ret);
+ return (0);
+}
+
+int
+__db_init_recover(dbenv)
+ DB_ENV *dbenv;
+{
+ int ret;
+
+ if ((ret = __db_add_recovery(dbenv,
+ __db_addrem_recover, DB_db_addrem)) != 0)
+ return (ret);
+ if ((ret = __db_add_recovery(dbenv,
+ __deprecated_recover, DB_db_split)) != 0)
+ return (ret);
+ if ((ret = __db_add_recovery(dbenv,
+ __db_big_recover, DB_db_big)) != 0)
+ return (ret);
+ if ((ret = __db_add_recovery(dbenv,
+ __db_ovref_recover, DB_db_ovref)) != 0)
+ return (ret);
+ if ((ret = __db_add_recovery(dbenv,
+ __db_relink_recover, DB_db_relink)) != 0)
+ return (ret);
+ if ((ret = __db_add_recovery(dbenv,
+ __deprecated_recover, DB_db_addpage)) != 0)
+ return (ret);
+ if ((ret = __db_add_recovery(dbenv,
+ __db_debug_recover, DB_db_debug)) != 0)
+ return (ret);
+ if ((ret = __db_add_recovery(dbenv,
+ __db_noop_recover, DB_db_noop)) != 0)
+ return (ret);
+ return (0);
+}
+
diff --git a/bdb/db/db_cam.c b/bdb/db/db_cam.c
new file mode 100644
index 00000000000..708d4cbda4d
--- /dev/null
+++ b/bdb/db/db_cam.c
@@ -0,0 +1,974 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 2000
+ * Sleepycat Software. All rights reserved.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_cam.c,v 11.52 2001/01/18 15:11:16 bostic Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_shash.h"
+#include "lock.h"
+#include "btree.h"
+#include "hash.h"
+#include "qam.h"
+#include "txn.h"
+#include "db_ext.h"
+
+static int __db_c_cleanup __P((DBC *, DBC *, int));
+static int __db_c_idup __P((DBC *, DBC **, u_int32_t));
+static int __db_wrlock_err __P((DB_ENV *));
+
+#define CDB_LOCKING_INIT(dbp, dbc) \
+ /* \
+ * If we are running CDB, this had better be either a write \
+ * cursor or an immediate writer. If it's a regular writer, \
+ * that means we have an IWRITE lock and we need to upgrade \
+ * it to a write lock. \
+ */ \
+ if (CDB_LOCKING((dbp)->dbenv)) { \
+ if (!F_ISSET(dbc, DBC_WRITECURSOR | DBC_WRITER)) \
+ return (__db_wrlock_err(dbp->dbenv)); \
+ \
+ if (F_ISSET(dbc, DBC_WRITECURSOR) && \
+ (ret = lock_get((dbp)->dbenv, (dbc)->locker, \
+ DB_LOCK_UPGRADE, &(dbc)->lock_dbt, DB_LOCK_WRITE, \
+ &(dbc)->mylock)) != 0) \
+ return (ret); \
+ }
+#define CDB_LOCKING_DONE(dbp, dbc) \
+ /* Release the upgraded lock. */ \
+ if (F_ISSET(dbc, DBC_WRITECURSOR)) \
+ (void)__lock_downgrade( \
+ (dbp)->dbenv, &(dbc)->mylock, DB_LOCK_IWRITE, 0);
+/*
+ * Copy the lock info from one cursor to another, so that locking
+ * in CDB can be done in the context of an internally-duplicated
+ * or off-page-duplicate cursor.
+ */
+#define CDB_LOCKING_COPY(dbp, dbc_o, dbc_n) \
+ if (CDB_LOCKING((dbp)->dbenv) && \
+ F_ISSET((dbc_o), DBC_WRITECURSOR | DBC_WRITEDUP)) { \
+ memcpy(&(dbc_n)->mylock, &(dbc_o)->mylock, \
+ sizeof((dbc_o)->mylock)); \
+ (dbc_n)->locker = (dbc_o)->locker; \
+ /* This lock isn't ours to put--just discard it on close. */ \
+ F_SET((dbc_n), DBC_WRITEDUP); \
+ }
+
+/*
+ * __db_c_close --
+ * Close the cursor.
+ *
+ * PUBLIC: int __db_c_close __P((DBC *));
+ */
+int
+__db_c_close(dbc)
+ DBC *dbc;
+{
+ DB *dbp;
+ DBC *opd;
+ DBC_INTERNAL *cp;
+ int ret, t_ret;
+
+ dbp = dbc->dbp;
+ ret = 0;
+
+ PANIC_CHECK(dbp->dbenv);
+
+ /*
+ * If the cursor is already closed we have a serious problem, and we
+ * assume that the cursor isn't on the active queue. Don't do any of
+ * the remaining cursor close processing.
+ */
+ if (!F_ISSET(dbc, DBC_ACTIVE)) {
+ if (dbp != NULL)
+ __db_err(dbp->dbenv, "Closing closed cursor");
+
+ DB_ASSERT(0);
+ return (EINVAL);
+ }
+
+ cp = dbc->internal;
+ opd = cp->opd;
+
+ /*
+ * Remove the cursor(s) from the active queue. We may be closing two
+ * cursors at once here, a top-level one and a lower-level, off-page
+ * duplicate one. The acess-method specific cursor close routine must
+ * close both of them in a single call.
+ *
+ * !!!
+ * Cursors must be removed from the active queue before calling the
+ * access specific cursor close routine, btree depends on having that
+ * order of operations. It must also happen before any action that
+ * can fail and cause __db_c_close to return an error, or else calls
+ * here from __db_close may loop indefinitely.
+ */
+ MUTEX_THREAD_LOCK(dbp->dbenv, dbp->mutexp);
+
+ if (opd != NULL) {
+ F_CLR(opd, DBC_ACTIVE);
+ TAILQ_REMOVE(&dbp->active_queue, opd, links);
+ }
+ F_CLR(dbc, DBC_ACTIVE);
+ TAILQ_REMOVE(&dbp->active_queue, dbc, links);
+
+ MUTEX_THREAD_UNLOCK(dbp->dbenv, dbp->mutexp);
+
+ /* Call the access specific cursor close routine. */
+ if ((t_ret =
+ dbc->c_am_close(dbc, PGNO_INVALID, NULL)) != 0 && ret == 0)
+ ret = t_ret;
+
+ /*
+ * Release the lock after calling the access method specific close
+ * routine, a Btree cursor may have had pending deletes.
+ */
+ if (CDB_LOCKING(dbc->dbp->dbenv)) {
+ /*
+ * If DBC_WRITEDUP is set, the cursor is an internally
+ * duplicated write cursor and the lock isn't ours to put.
+ */
+ if (!F_ISSET(dbc, DBC_WRITEDUP) &&
+ dbc->mylock.off != LOCK_INVALID) {
+ if ((t_ret = lock_put(dbc->dbp->dbenv,
+ &dbc->mylock)) != 0 && ret == 0)
+ ret = t_ret;
+ dbc->mylock.off = LOCK_INVALID;
+ }
+
+ /* For safety's sake, since this is going on the free queue. */
+ memset(&dbc->mylock, 0, sizeof(dbc->mylock));
+ F_CLR(dbc, DBC_WRITEDUP);
+ }
+
+ if (dbc->txn != NULL)
+ dbc->txn->cursors--;
+
+ /* Move the cursor(s) to the free queue. */
+ MUTEX_THREAD_LOCK(dbp->dbenv, dbp->mutexp);
+ if (opd != NULL) {
+ if (dbc->txn != NULL)
+ dbc->txn->cursors--;
+ TAILQ_INSERT_TAIL(&dbp->free_queue, opd, links);
+ opd = NULL;
+ }
+ TAILQ_INSERT_TAIL(&dbp->free_queue, dbc, links);
+ MUTEX_THREAD_UNLOCK(dbp->dbenv, dbp->mutexp);
+
+ return (ret);
+}
+
+/*
+ * __db_c_destroy --
+ * Destroy the cursor, called after DBC->c_close.
+ *
+ * PUBLIC: int __db_c_destroy __P((DBC *));
+ */
+int
+__db_c_destroy(dbc)
+ DBC *dbc;
+{
+ DB *dbp;
+ DBC_INTERNAL *cp;
+ int ret;
+
+ dbp = dbc->dbp;
+ cp = dbc->internal;
+
+ /* Remove the cursor from the free queue. */
+ MUTEX_THREAD_LOCK(dbp->dbenv, dbp->mutexp);
+ TAILQ_REMOVE(&dbp->free_queue, dbc, links);
+ MUTEX_THREAD_UNLOCK(dbp->dbenv, dbp->mutexp);
+
+ /* Free up allocated memory. */
+ if (dbc->rkey.data != NULL)
+ __os_free(dbc->rkey.data, dbc->rkey.ulen);
+ if (dbc->rdata.data != NULL)
+ __os_free(dbc->rdata.data, dbc->rdata.ulen);
+
+ /* Call the access specific cursor destroy routine. */
+ ret = dbc->c_am_destroy == NULL ? 0 : dbc->c_am_destroy(dbc);
+
+ __os_free(dbc, sizeof(*dbc));
+
+ return (ret);
+}
+
+/*
+ * __db_c_count --
+ * Return a count of duplicate data items.
+ *
+ * PUBLIC: int __db_c_count __P((DBC *, db_recno_t *, u_int32_t));
+ */
+int
+__db_c_count(dbc, recnop, flags)
+ DBC *dbc;
+ db_recno_t *recnop;
+ u_int32_t flags;
+{
+ DB *dbp;
+ int ret;
+
+ /*
+ * Cursor Cleanup Note:
+ * All of the cursors passed to the underlying access methods by this
+ * routine are not duplicated and will not be cleaned up on return.
+ * So, pages/locks that the cursor references must be resolved by the
+ * underlying functions.
+ */
+ dbp = dbc->dbp;
+
+ PANIC_CHECK(dbp->dbenv);
+
+ /* Check for invalid flags. */
+ if ((ret = __db_ccountchk(dbp, flags, IS_INITIALIZED(dbc))) != 0)
+ return (ret);
+
+ switch (dbc->dbtype) {
+ case DB_QUEUE:
+ case DB_RECNO:
+ *recnop = 1;
+ break;
+ case DB_HASH:
+ if (dbc->internal->opd == NULL) {
+ if ((ret = __ham_c_count(dbc, recnop)) != 0)
+ return (ret);
+ break;
+ }
+ /* FALLTHROUGH */
+ case DB_BTREE:
+ if ((ret = __bam_c_count(dbc, recnop)) != 0)
+ return (ret);
+ break;
+ default:
+ return (__db_unknown_type(dbp->dbenv,
+ "__db_c_count", dbp->type));
+ }
+ return (0);
+}
+
+/*
+ * __db_c_del --
+ * Delete using a cursor.
+ *
+ * PUBLIC: int __db_c_del __P((DBC *, u_int32_t));
+ */
+int
+__db_c_del(dbc, flags)
+ DBC *dbc;
+ u_int32_t flags;
+{
+ DB *dbp;
+ DBC *opd;
+ int ret;
+
+ /*
+ * Cursor Cleanup Note:
+ * All of the cursors passed to the underlying access methods by this
+ * routine are not duplicated and will not be cleaned up on return.
+ * So, pages/locks that the cursor references must be resolved by the
+ * underlying functions.
+ */
+ dbp = dbc->dbp;
+
+ PANIC_CHECK(dbp->dbenv);
+ DB_CHECK_TXN(dbp, dbc->txn);
+
+ /* Check for invalid flags. */
+ if ((ret = __db_cdelchk(dbp, flags,
+ F_ISSET(dbp, DB_AM_RDONLY), IS_INITIALIZED(dbc))) != 0)
+ return (ret);
+
+ DEBUG_LWRITE(dbc, dbc->txn, "db_c_del", NULL, NULL, flags);
+
+ CDB_LOCKING_INIT(dbp, dbc);
+
+ /*
+ * Off-page duplicate trees are locked in the primary tree, that is,
+ * we acquire a write lock in the primary tree and no locks in the
+ * off-page dup tree. If the del operation is done in an off-page
+ * duplicate tree, call the primary cursor's upgrade routine first.
+ */
+ opd = dbc->internal->opd;
+ if (opd == NULL)
+ ret = dbc->c_am_del(dbc);
+ else
+ if ((ret = dbc->c_am_writelock(dbc)) == 0)
+ ret = opd->c_am_del(opd);
+
+ CDB_LOCKING_DONE(dbp, dbc);
+
+ return (ret);
+}
+
+/*
+ * __db_c_dup --
+ * Duplicate a cursor
+ *
+ * PUBLIC: int __db_c_dup __P((DBC *, DBC **, u_int32_t));
+ */
+int
+__db_c_dup(dbc_orig, dbcp, flags)
+ DBC *dbc_orig;
+ DBC **dbcp;
+ u_int32_t flags;
+{
+ DB_ENV *dbenv;
+ DB *dbp;
+ DBC *dbc_n, *dbc_nopd;
+ int ret;
+
+ dbp = dbc_orig->dbp;
+ dbenv = dbp->dbenv;
+ dbc_n = dbc_nopd = NULL;
+
+ PANIC_CHECK(dbp->dbenv);
+
+ /*
+ * We can never have two write cursors open in CDB, so do not
+ * allow duplication of a write cursor.
+ */
+ if (flags != DB_POSITIONI &&
+ F_ISSET(dbc_orig, DBC_WRITER | DBC_WRITECURSOR)) {
+ __db_err(dbenv, "Cannot duplicate writeable cursor");
+ return (EINVAL);
+ }
+
+ /* Allocate a new cursor and initialize it. */
+ if ((ret = __db_c_idup(dbc_orig, &dbc_n, flags)) != 0)
+ goto err;
+ *dbcp = dbc_n;
+
+ /*
+ * If we're in CDB, and this isn't an internal duplication (in which
+ * case we're explicitly overriding CDB locking), the duplicated
+ * cursor needs its own read lock. (We know it's not a write cursor
+ * because we wouldn't have made it this far; you can't dup them.)
+ */
+ if (CDB_LOCKING(dbenv) && flags != DB_POSITIONI) {
+ DB_ASSERT(!F_ISSET(dbc_orig, DBC_WRITER | DBC_WRITECURSOR));
+
+ if ((ret = lock_get(dbenv, dbc_n->locker, 0,
+ &dbc_n->lock_dbt, DB_LOCK_READ, &dbc_n->mylock)) != 0) {
+ (void)__db_c_close(dbc_n);
+ return (ret);
+ }
+ }
+
+ /*
+ * If the cursor references an off-page duplicate tree, allocate a
+ * new cursor for that tree and initialize it.
+ */
+ if (dbc_orig->internal->opd != NULL) {
+ if ((ret =
+ __db_c_idup(dbc_orig->internal->opd, &dbc_nopd, flags)) != 0)
+ goto err;
+ dbc_n->internal->opd = dbc_nopd;
+ }
+
+ return (0);
+
+err: if (dbc_n != NULL)
+ (void)dbc_n->c_close(dbc_n);
+ if (dbc_nopd != NULL)
+ (void)dbc_nopd->c_close(dbc_nopd);
+
+ return (ret);
+}
+
+/*
+ * __db_c_idup --
+ * Internal version of __db_c_dup.
+ */
+static int
+__db_c_idup(dbc_orig, dbcp, flags)
+ DBC *dbc_orig, **dbcp;
+ u_int32_t flags;
+{
+ DB *dbp;
+ DBC *dbc_n;
+ DBC_INTERNAL *int_n, *int_orig;
+ int ret;
+
+ dbp = dbc_orig->dbp;
+ dbc_n = *dbcp;
+
+ if ((ret = __db_icursor(dbp, dbc_orig->txn, dbc_orig->dbtype,
+ dbc_orig->internal->root, F_ISSET(dbc_orig, DBC_OPD), &dbc_n)) != 0)
+ return (ret);
+
+ dbc_n->locker = dbc_orig->locker;
+
+ /* If the user wants the cursor positioned, do it here. */
+ if (flags == DB_POSITION || flags == DB_POSITIONI) {
+ int_n = dbc_n->internal;
+ int_orig = dbc_orig->internal;
+
+ dbc_n->flags = dbc_orig->flags;
+
+ int_n->indx = int_orig->indx;
+ int_n->pgno = int_orig->pgno;
+ int_n->root = int_orig->root;
+ int_n->lock_mode = int_orig->lock_mode;
+
+ switch (dbc_orig->dbtype) {
+ case DB_QUEUE:
+ if ((ret = __qam_c_dup(dbc_orig, dbc_n)) != 0)
+ goto err;
+ break;
+ case DB_BTREE:
+ case DB_RECNO:
+ if ((ret = __bam_c_dup(dbc_orig, dbc_n)) != 0)
+ goto err;
+ break;
+ case DB_HASH:
+ if ((ret = __ham_c_dup(dbc_orig, dbc_n)) != 0)
+ goto err;
+ break;
+ default:
+ ret = __db_unknown_type(dbp->dbenv,
+ "__db_c_idup", dbc_orig->dbtype);
+ goto err;
+ }
+ }
+
+ /* Now take care of duping the CDB information. */
+ CDB_LOCKING_COPY(dbp, dbc_orig, dbc_n);
+
+ *dbcp = dbc_n;
+ return (0);
+
+err: (void)dbc_n->c_close(dbc_n);
+ return (ret);
+}
+
+/*
+ * __db_c_newopd --
+ * Create a new off-page duplicate cursor.
+ *
+ * PUBLIC: int __db_c_newopd __P((DBC *, db_pgno_t, DBC **));
+ */
+int
+__db_c_newopd(dbc_parent, root, dbcp)
+ DBC *dbc_parent;
+ db_pgno_t root;
+ DBC **dbcp;
+{
+ DB *dbp;
+ DBC *opd;
+ DBTYPE dbtype;
+ int ret;
+
+ dbp = dbc_parent->dbp;
+ dbtype = (dbp->dup_compare == NULL) ? DB_RECNO : DB_BTREE;
+
+ if ((ret = __db_icursor(dbp,
+ dbc_parent->txn, dbtype, root, 1, &opd)) != 0)
+ return (ret);
+
+ CDB_LOCKING_COPY(dbp, dbc_parent, opd);
+
+ *dbcp = opd;
+
+ return (0);
+}
+
+/*
+ * __db_c_get --
+ * Get using a cursor.
+ *
+ * PUBLIC: int __db_c_get __P((DBC *, DBT *, DBT *, u_int32_t));
+ */
+int
+__db_c_get(dbc_arg, key, data, flags)
+ DBC *dbc_arg;
+ DBT *key, *data;
+ u_int32_t flags;
+{
+ DB *dbp;
+ DBC *dbc, *dbc_n, *opd;
+ DBC_INTERNAL *cp, *cp_n;
+ db_pgno_t pgno;
+ u_int32_t tmp_flags, tmp_rmw;
+ u_int8_t type;
+ int ret, t_ret;
+
+ /*
+ * Cursor Cleanup Note:
+ * All of the cursors passed to the underlying access methods by this
+ * routine are duplicated cursors. On return, any referenced pages
+ * will be discarded, and, if the cursor is not intended to be used
+ * again, the close function will be called. So, pages/locks that
+ * the cursor references do not need to be resolved by the underlying
+ * functions.
+ */
+ dbp = dbc_arg->dbp;
+ dbc_n = NULL;
+ opd = NULL;
+
+ PANIC_CHECK(dbp->dbenv);
+
+ /* Check for invalid flags. */
+ if ((ret =
+ __db_cgetchk(dbp, key, data, flags, IS_INITIALIZED(dbc_arg))) != 0)
+ return (ret);
+
+ /* Clear OR'd in additional bits so we can check for flag equality. */
+ tmp_rmw = LF_ISSET(DB_RMW);
+ LF_CLR(DB_RMW);
+
+ DEBUG_LREAD(dbc_arg, dbc_arg->txn, "db_c_get",
+ flags == DB_SET || flags == DB_SET_RANGE ? key : NULL, NULL, flags);
+
+ /*
+ * Return a cursor's record number. It has nothing to do with the
+ * cursor get code except that it was put into the interface.
+ */
+ if (flags == DB_GET_RECNO)
+ return (__bam_c_rget(dbc_arg, data, flags | tmp_rmw));
+
+ if (flags == DB_CONSUME || flags == DB_CONSUME_WAIT)
+ CDB_LOCKING_INIT(dbp, dbc_arg);
+
+ /*
+ * If we have an off-page duplicates cursor, and the operation applies
+ * to it, perform the operation. Duplicate the cursor and call the
+ * underlying function.
+ *
+ * Off-page duplicate trees are locked in the primary tree, that is,
+ * we acquire a write lock in the primary tree and no locks in the
+ * off-page dup tree. If the DB_RMW flag was specified and the get
+ * operation is done in an off-page duplicate tree, call the primary
+ * cursor's upgrade routine first.
+ */
+ cp = dbc_arg->internal;
+ if (cp->opd != NULL &&
+ (flags == DB_CURRENT || flags == DB_GET_BOTHC ||
+ flags == DB_NEXT || flags == DB_NEXT_DUP || flags == DB_PREV)) {
+ if (tmp_rmw && (ret = dbc_arg->c_am_writelock(dbc_arg)) != 0)
+ return (ret);
+ if ((ret = __db_c_idup(cp->opd, &opd, DB_POSITIONI)) != 0)
+ return (ret);
+
+ switch (ret = opd->c_am_get(
+ opd, key, data, flags, NULL)) {
+ case 0:
+ goto done;
+ case DB_NOTFOUND:
+ /*
+ * Translate DB_NOTFOUND failures for the DB_NEXT and
+ * DB_PREV operations into a subsequent operation on
+ * the parent cursor.
+ */
+ if (flags == DB_NEXT || flags == DB_PREV) {
+ if ((ret = opd->c_close(opd)) != 0)
+ goto err;
+ opd = NULL;
+ break;
+ }
+ goto err;
+ default:
+ goto err;
+ }
+ }
+
+ /*
+ * Perform an operation on the main cursor. Duplicate the cursor,
+ * upgrade the lock as required, and call the underlying function.
+ */
+ switch (flags) {
+ case DB_CURRENT:
+ case DB_GET_BOTHC:
+ case DB_NEXT:
+ case DB_NEXT_DUP:
+ case DB_NEXT_NODUP:
+ case DB_PREV:
+ case DB_PREV_NODUP:
+ tmp_flags = DB_POSITIONI;
+ break;
+ default:
+ tmp_flags = 0;
+ break;
+ }
+
+ /*
+ * If this cursor is going to be closed immediately, we don't
+ * need to take precautions to clean it up on error.
+ */
+ if (F_ISSET(dbc_arg, DBC_TRANSIENT))
+ dbc_n = dbc_arg;
+ else if ((ret = __db_c_idup(dbc_arg, &dbc_n, tmp_flags)) != 0)
+ goto err;
+
+ if (tmp_rmw)
+ F_SET(dbc_n, DBC_RMW);
+ pgno = PGNO_INVALID;
+ ret = dbc_n->c_am_get(dbc_n, key, data, flags, &pgno);
+ if (tmp_rmw)
+ F_CLR(dbc_n, DBC_RMW);
+ if (ret != 0)
+ goto err;
+
+ cp_n = dbc_n->internal;
+
+ /*
+ * We may be referencing a new off-page duplicates tree. Acquire
+ * a new cursor and call the underlying function.
+ */
+ if (pgno != PGNO_INVALID) {
+ if ((ret = __db_c_newopd(dbc_arg, pgno, &cp_n->opd)) != 0)
+ goto err;
+
+ switch (flags) {
+ case DB_FIRST:
+ case DB_NEXT:
+ case DB_NEXT_NODUP:
+ case DB_SET:
+ case DB_SET_RECNO:
+ case DB_SET_RANGE:
+ tmp_flags = DB_FIRST;
+ break;
+ case DB_LAST:
+ case DB_PREV:
+ case DB_PREV_NODUP:
+ tmp_flags = DB_LAST;
+ break;
+ case DB_GET_BOTH:
+ tmp_flags = DB_GET_BOTH;
+ break;
+ case DB_GET_BOTHC:
+ tmp_flags = DB_GET_BOTHC;
+ break;
+ default:
+ ret =
+ __db_unknown_flag(dbp->dbenv, "__db_c_get", flags);
+ goto err;
+ }
+ if ((ret = cp_n->opd->c_am_get(
+ cp_n->opd, key, data, tmp_flags, NULL)) != 0)
+ goto err;
+ }
+
+done: /*
+ * Return a key/data item. The only exception is that we don't return
+ * a key if the user already gave us one, that is, if the DB_SET flag
+ * was set. The DB_SET flag is necessary. In a Btree, the user's key
+ * doesn't have to be the same as the key stored the tree, depending on
+ * the magic performed by the comparison function. As we may not have
+ * done any key-oriented operation here, the page reference may not be
+ * valid. Fill it in as necessary. We don't have to worry about any
+ * locks, the cursor must already be holding appropriate locks.
+ *
+ * XXX
+ * If not a Btree and DB_SET_RANGE is set, we shouldn't return a key
+ * either, should we?
+ */
+ cp_n = dbc_n == NULL ? dbc_arg->internal : dbc_n->internal;
+ if (!F_ISSET(key, DB_DBT_ISSET)) {
+ if (cp_n->page == NULL && (ret =
+ memp_fget(dbp->mpf, &cp_n->pgno, 0, &cp_n->page)) != 0)
+ goto err;
+
+ if ((ret = __db_ret(dbp, cp_n->page, cp_n->indx,
+ key, &dbc_arg->rkey.data, &dbc_arg->rkey.ulen)) != 0)
+ goto err;
+ }
+ dbc = opd != NULL ? opd : cp_n->opd != NULL ? cp_n->opd : dbc_n;
+ if (!F_ISSET(data, DB_DBT_ISSET)) {
+ type = TYPE(dbc->internal->page);
+ ret = __db_ret(dbp, dbc->internal->page, dbc->internal->indx +
+ (type == P_LBTREE || type == P_HASH ? O_INDX : 0),
+ data, &dbc_arg->rdata.data, &dbc_arg->rdata.ulen);
+ }
+
+err: /* Don't pass DB_DBT_ISSET back to application level, error or no. */
+ F_CLR(key, DB_DBT_ISSET);
+ F_CLR(data, DB_DBT_ISSET);
+
+ /* Cleanup and cursor resolution. */
+ if (opd != NULL) {
+ if ((t_ret =
+ __db_c_cleanup(dbc_arg->internal->opd,
+ opd, ret)) != 0 && ret == 0)
+ ret = t_ret;
+
+ }
+
+ if ((t_ret = __db_c_cleanup(dbc_arg, dbc_n, ret)) != 0 && ret == 0)
+ ret = t_ret;
+
+ if (flags == DB_CONSUME || flags == DB_CONSUME_WAIT)
+ CDB_LOCKING_DONE(dbp, dbc_arg);
+ return (ret);
+}
+
+/*
+ * __db_c_put --
+ * Put using a cursor.
+ *
+ * PUBLIC: int __db_c_put __P((DBC *, DBT *, DBT *, u_int32_t));
+ */
+int
+__db_c_put(dbc_arg, key, data, flags)
+ DBC *dbc_arg;
+ DBT *key, *data;
+ u_int32_t flags;
+{
+ DB *dbp;
+ DBC *dbc_n, *opd;
+ db_pgno_t pgno;
+ u_int32_t tmp_flags;
+ int ret, t_ret;
+
+ /*
+ * Cursor Cleanup Note:
+ * All of the cursors passed to the underlying access methods by this
+ * routine are duplicated cursors. On return, any referenced pages
+ * will be discarded, and, if the cursor is not intended to be used
+ * again, the close function will be called. So, pages/locks that
+ * the cursor references do not need to be resolved by the underlying
+ * functions.
+ */
+ dbp = dbc_arg->dbp;
+ dbc_n = NULL;
+
+ PANIC_CHECK(dbp->dbenv);
+ DB_CHECK_TXN(dbp, dbc_arg->txn);
+
+ /* Check for invalid flags. */
+ if ((ret = __db_cputchk(dbp, key, data, flags,
+ F_ISSET(dbp, DB_AM_RDONLY), IS_INITIALIZED(dbc_arg))) != 0)
+ return (ret);
+
+ DEBUG_LWRITE(dbc_arg, dbc_arg->txn, "db_c_put",
+ flags == DB_KEYFIRST || flags == DB_KEYLAST ||
+ flags == DB_NODUPDATA ? key : NULL, data, flags);
+
+ CDB_LOCKING_INIT(dbp, dbc_arg);
+
+ /*
+ * If we have an off-page duplicates cursor, and the operation applies
+ * to it, perform the operation. Duplicate the cursor and call the
+ * underlying function.
+ *
+ * Off-page duplicate trees are locked in the primary tree, that is,
+ * we acquire a write lock in the primary tree and no locks in the
+ * off-page dup tree. If the put operation is done in an off-page
+ * duplicate tree, call the primary cursor's upgrade routine first.
+ */
+ if (dbc_arg->internal->opd != NULL &&
+ (flags == DB_AFTER || flags == DB_BEFORE || flags == DB_CURRENT)) {
+ /*
+ * A special case for hash off-page duplicates. Hash doesn't
+ * support (and is documented not to support) put operations
+ * relative to a cursor which references an already deleted
+ * item. For consistency, apply the same criteria to off-page
+ * duplicates as well.
+ */
+ if (dbc_arg->dbtype == DB_HASH && F_ISSET(
+ ((BTREE_CURSOR *)(dbc_arg->internal->opd->internal)),
+ C_DELETED)) {
+ ret = DB_NOTFOUND;
+ goto err;
+ }
+
+ if ((ret = dbc_arg->c_am_writelock(dbc_arg)) != 0)
+ return (ret);
+ if ((ret = __db_c_dup(dbc_arg, &dbc_n, DB_POSITIONI)) != 0)
+ goto err;
+ opd = dbc_n->internal->opd;
+ if ((ret = opd->c_am_put(
+ opd, key, data, flags, NULL)) != 0)
+ goto err;
+ goto done;
+ }
+
+ /*
+ * Perform an operation on the main cursor. Duplicate the cursor,
+ * and call the underlying function.
+ *
+ * XXX: MARGO
+ *
+ tmp_flags = flags == DB_AFTER ||
+ flags == DB_BEFORE || flags == DB_CURRENT ? DB_POSITIONI : 0;
+ */
+ tmp_flags = DB_POSITIONI;
+
+ /*
+ * If this cursor is going to be closed immediately, we don't
+ * need to take precautions to clean it up on error.
+ */
+ if (F_ISSET(dbc_arg, DBC_TRANSIENT))
+ dbc_n = dbc_arg;
+ else if ((ret = __db_c_idup(dbc_arg, &dbc_n, tmp_flags)) != 0)
+ goto err;
+
+ pgno = PGNO_INVALID;
+ if ((ret = dbc_n->c_am_put(dbc_n, key, data, flags, &pgno)) != 0)
+ goto err;
+
+ /*
+ * We may be referencing a new off-page duplicates tree. Acquire
+ * a new cursor and call the underlying function.
+ */
+ if (pgno != PGNO_INVALID) {
+ if ((ret = __db_c_newopd(dbc_arg, pgno, &opd)) != 0)
+ goto err;
+ dbc_n->internal->opd = opd;
+
+ if ((ret = opd->c_am_put(
+ opd, key, data, flags, NULL)) != 0)
+ goto err;
+ }
+
+done:
+err: /* Cleanup and cursor resolution. */
+ if ((t_ret = __db_c_cleanup(dbc_arg, dbc_n, ret)) != 0 && ret == 0)
+ ret = t_ret;
+
+ CDB_LOCKING_DONE(dbp, dbc_arg);
+
+ return (ret);
+}
+
+/*
+ * __db_duperr()
+ * Error message: we don't currently support sorted duplicate duplicates.
+ * PUBLIC: int __db_duperr __P((DB *, u_int32_t));
+ */
+int
+__db_duperr(dbp, flags)
+ DB *dbp;
+ u_int32_t flags;
+{
+ if (flags != DB_NODUPDATA)
+ __db_err(dbp->dbenv,
+ "Duplicate data items are not supported with sorted data");
+ return (DB_KEYEXIST);
+}
+
+/*
+ * __db_c_cleanup --
+ * Clean up duplicate cursors.
+ */
+static int
+__db_c_cleanup(dbc, dbc_n, failed)
+ DBC *dbc, *dbc_n;
+ int failed;
+{
+ DB *dbp;
+ DBC *opd;
+ DBC_INTERNAL *internal;
+ int ret, t_ret;
+
+ dbp = dbc->dbp;
+ internal = dbc->internal;
+ ret = 0;
+
+ /* Discard any pages we're holding. */
+ if (internal->page != NULL) {
+ if ((t_ret =
+ memp_fput(dbp->mpf, internal->page, 0)) != 0 && ret == 0)
+ ret = t_ret;
+ internal->page = NULL;
+ }
+ opd = internal->opd;
+ if (opd != NULL && opd->internal->page != NULL) {
+ if ((t_ret = memp_fput(dbp->mpf,
+ opd->internal->page, 0)) != 0 && ret == 0)
+ ret = t_ret;
+ opd->internal->page = NULL;
+ }
+
+ /*
+ * If dbc_n is NULL, there's no internal cursor swapping to be
+ * done and no dbc_n to close--we probably did the entire
+ * operation on an offpage duplicate cursor. Just return.
+ */
+ if (dbc_n == NULL)
+ return (ret);
+
+ /*
+ * If dbc is marked DBC_TRANSIENT, we're inside a DB->{put/get}
+ * operation, and as an optimization we performed the operation on
+ * the main cursor rather than on a duplicated one. Assert
+ * that dbc_n == dbc (i.e., that we really did skip the
+ * duplication). Then just do nothing--even if there was
+ * an error, we're about to close the cursor, and the fact that we
+ * moved it isn't a user-visible violation of our "cursor
+ * stays put on error" rule.
+ */
+ if (F_ISSET(dbc, DBC_TRANSIENT)) {
+ DB_ASSERT(dbc == dbc_n);
+ return (ret);
+ }
+
+ if (dbc_n->internal->page != NULL) {
+ if ((t_ret = memp_fput(dbp->mpf,
+ dbc_n->internal->page, 0)) != 0 && ret == 0)
+ ret = t_ret;
+ dbc_n->internal->page = NULL;
+ }
+ opd = dbc_n->internal->opd;
+ if (opd != NULL && opd->internal->page != NULL) {
+ if ((t_ret = memp_fput(dbp->mpf,
+ opd->internal->page, 0)) != 0 && ret == 0)
+ ret = t_ret;
+ opd->internal->page = NULL;
+ }
+
+ /*
+ * If we didn't fail before entering this routine or just now when
+ * freeing pages, swap the interesting contents of the old and new
+ * cursors.
+ */
+ if (!failed && ret == 0) {
+ dbc->internal = dbc_n->internal;
+ dbc_n->internal = internal;
+ }
+
+ /*
+ * Close the cursor we don't care about anymore. The close can fail,
+ * but we only expect DB_LOCK_DEADLOCK failures. This violates our
+ * "the cursor is unchanged on error" semantics, but since all you can
+ * do with a DB_LOCK_DEADLOCK failure is close the cursor, I believe
+ * that's OK.
+ *
+ * XXX
+ * There's no way to recover from failure to close the old cursor.
+ * All we can do is move to the new position and return an error.
+ *
+ * XXX
+ * We might want to consider adding a flag to the cursor, so that any
+ * subsequent operations other than close just return an error?
+ */
+ if ((t_ret = dbc_n->c_close(dbc_n)) != 0 && ret == 0)
+ ret = t_ret;
+
+ return (ret);
+}
+
+/*
+ * __db_wrlock_err -- do not have a write lock.
+ */
+static int
+__db_wrlock_err(dbenv)
+ DB_ENV *dbenv;
+{
+ __db_err(dbenv, "Write attempted on read-only cursor");
+ return (EPERM);
+}
diff --git a/bdb/db/db_conv.c b/bdb/db/db_conv.c
new file mode 100644
index 00000000000..df60be06790
--- /dev/null
+++ b/bdb/db/db_conv.c
@@ -0,0 +1,348 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1996, 1997, 1998, 1999, 2000
+ * Sleepycat Software. All rights reserved.
+ */
+/*
+ * Copyright (c) 1990, 1993, 1994, 1995, 1996
+ * Keith Bostic. All rights reserved.
+ */
+/*
+ * Copyright (c) 1990, 1993, 1994, 1995
+ * The Regents of the University of California. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of the University nor the names of its contributors
+ * may be used to endorse or promote products derived from this software
+ * without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_conv.c,v 11.11 2000/11/30 00:58:31 ubell Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_swap.h"
+#include "db_am.h"
+#include "btree.h"
+#include "hash.h"
+#include "qam.h"
+
+/*
+ * __db_pgin --
+ * Primary page-swap routine.
+ *
+ * PUBLIC: int __db_pgin __P((DB_ENV *, db_pgno_t, void *, DBT *));
+ */
+int
+__db_pgin(dbenv, pg, pp, cookie)
+ DB_ENV *dbenv;
+ db_pgno_t pg;
+ void *pp;
+ DBT *cookie;
+{
+ DB_PGINFO *pginfo;
+
+ pginfo = (DB_PGINFO *)cookie->data;
+
+ switch (((PAGE *)pp)->type) {
+ case P_HASH:
+ case P_HASHMETA:
+ case P_INVALID:
+ return (__ham_pgin(dbenv, pg, pp, cookie));
+ case P_BTREEMETA:
+ case P_IBTREE:
+ case P_IRECNO:
+ case P_LBTREE:
+ case P_LDUP:
+ case P_LRECNO:
+ case P_OVERFLOW:
+ return (__bam_pgin(dbenv, pg, pp, cookie));
+ case P_QAMMETA:
+ case P_QAMDATA:
+ return (__qam_pgin_out(dbenv, pg, pp, cookie));
+ default:
+ break;
+ }
+ return (__db_unknown_type(dbenv, "__db_pgin", ((PAGE *)pp)->type));
+}
+
+/*
+ * __db_pgout --
+ * Primary page-swap routine.
+ *
+ * PUBLIC: int __db_pgout __P((DB_ENV *, db_pgno_t, void *, DBT *));
+ */
+int
+__db_pgout(dbenv, pg, pp, cookie)
+ DB_ENV *dbenv;
+ db_pgno_t pg;
+ void *pp;
+ DBT *cookie;
+{
+ DB_PGINFO *pginfo;
+
+ pginfo = (DB_PGINFO *)cookie->data;
+
+ switch (((PAGE *)pp)->type) {
+ case P_HASH:
+ case P_HASHMETA:
+ case P_INVALID:
+ return (__ham_pgout(dbenv, pg, pp, cookie));
+ case P_BTREEMETA:
+ case P_IBTREE:
+ case P_IRECNO:
+ case P_LBTREE:
+ case P_LDUP:
+ case P_LRECNO:
+ case P_OVERFLOW:
+ return (__bam_pgout(dbenv, pg, pp, cookie));
+ case P_QAMMETA:
+ case P_QAMDATA:
+ return (__qam_pgin_out(dbenv, pg, pp, cookie));
+ default:
+ break;
+ }
+ return (__db_unknown_type(dbenv, "__db_pgout", ((PAGE *)pp)->type));
+}
+
+/*
+ * __db_metaswap --
+ * Byteswap the common part of the meta-data page.
+ *
+ * PUBLIC: void __db_metaswap __P((PAGE *));
+ */
+void
+__db_metaswap(pg)
+ PAGE *pg;
+{
+ u_int8_t *p;
+
+ p = (u_int8_t *)pg;
+
+ /* Swap the meta-data information. */
+ SWAP32(p); /* lsn.file */
+ SWAP32(p); /* lsn.offset */
+ SWAP32(p); /* pgno */
+ SWAP32(p); /* magic */
+ SWAP32(p); /* version */
+ SWAP32(p); /* pagesize */
+ p += 4; /* unused, page type, unused, unused */
+ SWAP32(p); /* free */
+ SWAP32(p); /* alloc_lsn part 1 */
+ SWAP32(p); /* alloc_lsn part 2 */
+ SWAP32(p); /* cached key count */
+ SWAP32(p); /* cached record count */
+ SWAP32(p); /* flags */
+}
+
+/*
+ * __db_byteswap --
+ * Byteswap a page.
+ *
+ * PUBLIC: int __db_byteswap __P((DB_ENV *, db_pgno_t, PAGE *, size_t, int));
+ */
+int
+__db_byteswap(dbenv, pg, h, pagesize, pgin)
+ DB_ENV *dbenv;
+ db_pgno_t pg;
+ PAGE *h;
+ size_t pagesize;
+ int pgin;
+{
+ BINTERNAL *bi;
+ BKEYDATA *bk;
+ BOVERFLOW *bo;
+ RINTERNAL *ri;
+ db_indx_t i, len, tmp;
+ u_int8_t *p, *end;
+
+ COMPQUIET(pg, 0);
+
+ if (pgin) {
+ M_32_SWAP(h->lsn.file);
+ M_32_SWAP(h->lsn.offset);
+ M_32_SWAP(h->pgno);
+ M_32_SWAP(h->prev_pgno);
+ M_32_SWAP(h->next_pgno);
+ M_16_SWAP(h->entries);
+ M_16_SWAP(h->hf_offset);
+ }
+
+ switch (h->type) {
+ case P_HASH:
+ for (i = 0; i < NUM_ENT(h); i++) {
+ if (pgin)
+ M_16_SWAP(h->inp[i]);
+
+ switch (HPAGE_TYPE(h, i)) {
+ case H_KEYDATA:
+ break;
+ case H_DUPLICATE:
+ len = LEN_HKEYDATA(h, pagesize, i);
+ p = HKEYDATA_DATA(P_ENTRY(h, i));
+ for (end = p + len; p < end;) {
+ if (pgin) {
+ P_16_SWAP(p);
+ memcpy(&tmp,
+ p, sizeof(db_indx_t));
+ p += sizeof(db_indx_t);
+ } else {
+ memcpy(&tmp,
+ p, sizeof(db_indx_t));
+ SWAP16(p);
+ }
+ p += tmp;
+ SWAP16(p);
+ }
+ break;
+ case H_OFFDUP:
+ p = HOFFPAGE_PGNO(P_ENTRY(h, i));
+ SWAP32(p); /* pgno */
+ break;
+ case H_OFFPAGE:
+ p = HOFFPAGE_PGNO(P_ENTRY(h, i));
+ SWAP32(p); /* pgno */
+ SWAP32(p); /* tlen */
+ break;
+ }
+
+ }
+
+ /*
+ * The offsets in the inp array are used to determine
+ * the size of entries on a page; therefore they
+ * cannot be converted until we've done all the
+ * entries.
+ */
+ if (!pgin)
+ for (i = 0; i < NUM_ENT(h); i++)
+ M_16_SWAP(h->inp[i]);
+ break;
+ case P_LBTREE:
+ case P_LDUP:
+ case P_LRECNO:
+ for (i = 0; i < NUM_ENT(h); i++) {
+ if (pgin)
+ M_16_SWAP(h->inp[i]);
+
+ /*
+ * In the case of on-page duplicates, key information
+ * should only be swapped once.
+ */
+ if (h->type == P_LBTREE && i > 1) {
+ if (pgin) {
+ if (h->inp[i] == h->inp[i - 2])
+ continue;
+ } else {
+ M_16_SWAP(h->inp[i]);
+ if (h->inp[i] == h->inp[i - 2])
+ continue;
+ M_16_SWAP(h->inp[i]);
+ }
+ }
+
+ bk = GET_BKEYDATA(h, i);
+ switch (B_TYPE(bk->type)) {
+ case B_KEYDATA:
+ M_16_SWAP(bk->len);
+ break;
+ case B_DUPLICATE:
+ case B_OVERFLOW:
+ bo = (BOVERFLOW *)bk;
+ M_32_SWAP(bo->pgno);
+ M_32_SWAP(bo->tlen);
+ break;
+ }
+
+ if (!pgin)
+ M_16_SWAP(h->inp[i]);
+ }
+ break;
+ case P_IBTREE:
+ for (i = 0; i < NUM_ENT(h); i++) {
+ if (pgin)
+ M_16_SWAP(h->inp[i]);
+
+ bi = GET_BINTERNAL(h, i);
+ M_16_SWAP(bi->len);
+ M_32_SWAP(bi->pgno);
+ M_32_SWAP(bi->nrecs);
+
+ switch (B_TYPE(bi->type)) {
+ case B_KEYDATA:
+ break;
+ case B_DUPLICATE:
+ case B_OVERFLOW:
+ bo = (BOVERFLOW *)bi->data;
+ M_32_SWAP(bo->pgno);
+ M_32_SWAP(bo->tlen);
+ break;
+ }
+
+ if (!pgin)
+ M_16_SWAP(h->inp[i]);
+ }
+ break;
+ case P_IRECNO:
+ for (i = 0; i < NUM_ENT(h); i++) {
+ if (pgin)
+ M_16_SWAP(h->inp[i]);
+
+ ri = GET_RINTERNAL(h, i);
+ M_32_SWAP(ri->pgno);
+ M_32_SWAP(ri->nrecs);
+
+ if (!pgin)
+ M_16_SWAP(h->inp[i]);
+ }
+ break;
+ case P_OVERFLOW:
+ case P_INVALID:
+ /* Nothing to do. */
+ break;
+ default:
+ return (__db_unknown_type(dbenv, "__db_byteswap", h->type));
+ }
+
+ if (!pgin) {
+ /* Swap the header information. */
+ M_32_SWAP(h->lsn.file);
+ M_32_SWAP(h->lsn.offset);
+ M_32_SWAP(h->pgno);
+ M_32_SWAP(h->prev_pgno);
+ M_32_SWAP(h->next_pgno);
+ M_16_SWAP(h->entries);
+ M_16_SWAP(h->hf_offset);
+ }
+ return (0);
+}
diff --git a/bdb/db/db_dispatch.c b/bdb/db/db_dispatch.c
new file mode 100644
index 00000000000..c9beac401a7
--- /dev/null
+++ b/bdb/db/db_dispatch.c
@@ -0,0 +1,983 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1996, 1997, 1998, 1999, 2000
+ * Sleepycat Software. All rights reserved.
+ */
+/*
+ * Copyright (c) 1995, 1996
+ * The President and Fellows of Harvard University. All rights reserved.
+ *
+ * This code is derived from software contributed to Berkeley by
+ * Margo Seltzer.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of the University nor the names of its contributors
+ * may be used to endorse or promote products derived from this software
+ * without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_dispatch.c,v 11.41 2001/01/11 18:19:50 bostic Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <stddef.h>
+#include <stdlib.h>
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_dispatch.h"
+#include "db_am.h"
+#include "log_auto.h"
+#include "txn.h"
+#include "txn_auto.h"
+#include "log.h"
+
+static int __db_txnlist_find_internal __P((void *, db_txnlist_type,
+ u_int32_t, u_int8_t [DB_FILE_ID_LEN], DB_TXNLIST **, int));
+
+/*
+ * __db_dispatch --
+ *
+ * This is the transaction dispatch function used by the db access methods.
+ * It is designed to handle the record format used by all the access
+ * methods (the one automatically generated by the db_{h,log,read}.sh
+ * scripts in the tools directory). An application using a different
+ * recovery paradigm will supply a different dispatch function to txn_open.
+ *
+ * PUBLIC: int __db_dispatch __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *));
+ */
+int
+__db_dispatch(dbenv, db, lsnp, redo, info)
+ DB_ENV *dbenv; /* The environment. */
+ DBT *db; /* The log record upon which to dispatch. */
+ DB_LSN *lsnp; /* The lsn of the record being dispatched. */
+ db_recops redo; /* Redo this op (or undo it). */
+ void *info;
+{
+ u_int32_t rectype, txnid;
+ int make_call, ret;
+
+ memcpy(&rectype, db->data, sizeof(rectype));
+ memcpy(&txnid, (u_int8_t *)db->data + sizeof(rectype), sizeof(txnid));
+ make_call = ret = 0;
+
+ /*
+ * If we find a record that is in the user's number space and they
+ * have specified a recovery routine, let them handle it. If they
+ * didn't specify a recovery routine, then we expect that they've
+ * followed all our rules and registered new recovery functions.
+ */
+ switch (redo) {
+ case DB_TXN_ABORT:
+ /*
+ * XXX
+ * db_printlog depends on DB_TXN_ABORT not examining the TXN
+ * list. If that ever changes, fix db_printlog too.
+ */
+ make_call = 1;
+ break;
+ case DB_TXN_OPENFILES:
+ if (rectype == DB_log_register)
+ return (dbenv->dtab[rectype](dbenv,
+ db, lsnp, redo, info));
+ break;
+ case DB_TXN_BACKWARD_ROLL:
+ /*
+ * Running full recovery in the backward pass. If we've
+ * seen this txnid before and added to it our commit list,
+ * then we do nothing during this pass, unless this is a child
+ * commit record, in which case we need to process it. If
+ * we've never seen it, then we call the appropriate recovery
+ * routine.
+ *
+ * We need to always undo DB_db_noop records, so that we
+ * properly handle any aborts before the file was closed.
+ */
+ if (rectype == DB_log_register ||
+ rectype == DB_txn_ckp || rectype == DB_db_noop
+ || rectype == DB_txn_child || (txnid != 0 &&
+ (ret = __db_txnlist_find(info, txnid)) != 0)) {
+ make_call = 1;
+ if (ret == DB_NOTFOUND && rectype != DB_txn_regop &&
+ rectype != DB_txn_xa_regop && (ret =
+ __db_txnlist_add(dbenv, info, txnid, 1)) != 0)
+ return (ret);
+ }
+ break;
+ case DB_TXN_FORWARD_ROLL:
+ /*
+ * In the forward pass, if we haven't seen the transaction,
+ * do nothing, else recovery it.
+ *
+ * We need to always redo DB_db_noop records, so that we
+ * properly handle any commits after the file was closed.
+ */
+ if (rectype == DB_log_register ||
+ rectype == DB_txn_ckp ||
+ rectype == DB_db_noop ||
+ __db_txnlist_find(info, txnid) == 0)
+ make_call = 1;
+ break;
+ default:
+ return (__db_unknown_flag(dbenv, "__db_dispatch", redo));
+ }
+
+ if (make_call) {
+ if (rectype >= DB_user_BEGIN && dbenv->tx_recover != NULL)
+ return (dbenv->tx_recover(dbenv, db, lsnp, redo));
+ else
+ return (dbenv->dtab[rectype](dbenv, db, lsnp, redo, info));
+ }
+
+ return (0);
+}
+
+/*
+ * __db_add_recovery --
+ *
+ * PUBLIC: int __db_add_recovery __P((DB_ENV *,
+ * PUBLIC: int (*)(DB_ENV *, DBT *, DB_LSN *, db_recops, void *), u_int32_t));
+ */
+int
+__db_add_recovery(dbenv, func, ndx)
+ DB_ENV *dbenv;
+ int (*func) __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *));
+ u_int32_t ndx;
+{
+ u_int32_t i, nsize;
+ int ret;
+
+ /* Check if we have to grow the table. */
+ if (ndx >= dbenv->dtab_size) {
+ nsize = ndx + 40;
+ if ((ret = __os_realloc(dbenv,
+ nsize * sizeof(dbenv->dtab[0]), NULL, &dbenv->dtab)) != 0)
+ return (ret);
+ for (i = dbenv->dtab_size; i < nsize; ++i)
+ dbenv->dtab[i] = NULL;
+ dbenv->dtab_size = nsize;
+ }
+
+ dbenv->dtab[ndx] = func;
+ return (0);
+}
+
+/*
+ * __deprecated_recover --
+ * Stub routine for deprecated recovery functions.
+ *
+ * PUBLIC: int __deprecated_recover
+ * PUBLIC: __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *));
+ */
+int
+__deprecated_recover(dbenv, dbtp, lsnp, op, info)
+ DB_ENV *dbenv;
+ DBT *dbtp;
+ DB_LSN *lsnp;
+ db_recops op;
+ void *info;
+{
+ COMPQUIET(dbenv, NULL);
+ COMPQUIET(dbtp, NULL);
+ COMPQUIET(lsnp, NULL);
+ COMPQUIET(op, 0);
+ COMPQUIET(info, NULL);
+ return (EINVAL);
+}
+
+/*
+ * __db_txnlist_init --
+ * Initialize transaction linked list.
+ *
+ * PUBLIC: int __db_txnlist_init __P((DB_ENV *, void *));
+ */
+int
+__db_txnlist_init(dbenv, retp)
+ DB_ENV *dbenv;
+ void *retp;
+{
+ DB_TXNHEAD *headp;
+ int ret;
+
+ if ((ret = __os_malloc(dbenv, sizeof(DB_TXNHEAD), NULL, &headp)) != 0)
+ return (ret);
+
+ LIST_INIT(&headp->head);
+ headp->maxid = 0;
+ headp->generation = 1;
+
+ *(void **)retp = headp;
+ return (0);
+}
+
+/*
+ * __db_txnlist_add --
+ * Add an element to our transaction linked list.
+ *
+ * PUBLIC: int __db_txnlist_add __P((DB_ENV *, void *, u_int32_t, int32_t));
+ */
+int
+__db_txnlist_add(dbenv, listp, txnid, aborted)
+ DB_ENV *dbenv;
+ void *listp;
+ u_int32_t txnid;
+ int32_t aborted;
+{
+ DB_TXNHEAD *hp;
+ DB_TXNLIST *elp;
+ int ret;
+
+ if ((ret = __os_malloc(dbenv, sizeof(DB_TXNLIST), NULL, &elp)) != 0)
+ return (ret);
+
+ hp = (DB_TXNHEAD *)listp;
+ LIST_INSERT_HEAD(&hp->head, elp, links);
+
+ elp->type = TXNLIST_TXNID;
+ elp->u.t.txnid = txnid;
+ elp->u.t.aborted = aborted;
+ if (txnid > hp->maxid)
+ hp->maxid = txnid;
+ elp->u.t.generation = hp->generation;
+
+ return (0);
+}
+/*
+ * __db_txnlist_remove --
+ * Remove an element from our transaction linked list.
+ *
+ * PUBLIC: int __db_txnlist_remove __P((void *, u_int32_t));
+ */
+int
+__db_txnlist_remove(listp, txnid)
+ void *listp;
+ u_int32_t txnid;
+{
+ DB_TXNLIST *entry;
+
+ return (__db_txnlist_find_internal(listp,
+ TXNLIST_TXNID, txnid, NULL, &entry, 1));
+}
+
+/* __db_txnlist_close --
+ *
+ * Call this when we close a file. It allows us to reconcile whether
+ * we have done any operations on this file with whether the file appears
+ * to have been deleted. If you never do any operations on a file, then
+ * we assume it's OK to appear deleted.
+ *
+ * PUBLIC: int __db_txnlist_close __P((void *, int32_t, u_int32_t));
+ */
+
+int
+__db_txnlist_close(listp, lid, count)
+ void *listp;
+ int32_t lid;
+ u_int32_t count;
+{
+ DB_TXNHEAD *hp;
+ DB_TXNLIST *p;
+
+ hp = (DB_TXNHEAD *)listp;
+ for (p = LIST_FIRST(&hp->head); p != NULL; p = LIST_NEXT(p, links)) {
+ if (p->type == TXNLIST_DELETE)
+ if (lid == p->u.d.fileid &&
+ !F_ISSET(&p->u.d, TXNLIST_FLAG_CLOSED)) {
+ p->u.d.count += count;
+ return (0);
+ }
+ }
+
+ return (0);
+}
+
+/*
+ * __db_txnlist_delete --
+ *
+ * Record that a file was missing or deleted. If the deleted
+ * flag is set, then we've encountered a delete of a file, else we've
+ * just encountered a file that is missing. The lid is the log fileid
+ * and is only meaningful if deleted is not equal to 0.
+ *
+ * PUBLIC: int __db_txnlist_delete __P((DB_ENV *,
+ * PUBLIC: void *, char *, u_int32_t, int));
+ */
+int
+__db_txnlist_delete(dbenv, listp, name, lid, deleted)
+ DB_ENV *dbenv;
+ void *listp;
+ char *name;
+ u_int32_t lid;
+ int deleted;
+{
+ DB_TXNHEAD *hp;
+ DB_TXNLIST *p;
+ int ret;
+
+ hp = (DB_TXNHEAD *)listp;
+ for (p = LIST_FIRST(&hp->head); p != NULL; p = LIST_NEXT(p, links)) {
+ if (p->type == TXNLIST_DELETE)
+ if (strcmp(name, p->u.d.fname) == 0) {
+ if (deleted)
+ F_SET(&p->u.d, TXNLIST_FLAG_DELETED);
+ else
+ F_CLR(&p->u.d, TXNLIST_FLAG_CLOSED);
+ return (0);
+ }
+ }
+
+ /* Need to add it. */
+ if ((ret = __os_malloc(dbenv, sizeof(DB_TXNLIST), NULL, &p)) != 0)
+ return (ret);
+ LIST_INSERT_HEAD(&hp->head, p, links);
+
+ p->type = TXNLIST_DELETE;
+ p->u.d.flags = 0;
+ if (deleted)
+ F_SET(&p->u.d, TXNLIST_FLAG_DELETED);
+ p->u.d.fileid = lid;
+ p->u.d.count = 0;
+ ret = __os_strdup(dbenv, name, &p->u.d.fname);
+
+ return (ret);
+}
+
+/*
+ * __db_txnlist_end --
+ * Discard transaction linked list. Print out any error messages
+ * for deleted files.
+ *
+ * PUBLIC: void __db_txnlist_end __P((DB_ENV *, void *));
+ */
+void
+__db_txnlist_end(dbenv, listp)
+ DB_ENV *dbenv;
+ void *listp;
+{
+ DB_TXNHEAD *hp;
+ DB_TXNLIST *p;
+ DB_LOG *lp;
+
+ hp = (DB_TXNHEAD *)listp;
+ lp = (DB_LOG *)dbenv->lg_handle;
+ while (hp != NULL && (p = LIST_FIRST(&hp->head)) != NULL) {
+ LIST_REMOVE(p, links);
+ switch (p->type) {
+ case TXNLIST_DELETE:
+ /*
+ * If we have a file that is not deleted and has
+ * some operations, we flag the warning. Since
+ * the file could still be open, we need to check
+ * the actual log table as well.
+ */
+ if ((!F_ISSET(&p->u.d, TXNLIST_FLAG_DELETED) &&
+ p->u.d.count != 0) ||
+ (!F_ISSET(&p->u.d, TXNLIST_FLAG_CLOSED) &&
+ p->u.d.fileid != (int32_t) TXNLIST_INVALID_ID &&
+ p->u.d.fileid < lp->dbentry_cnt &&
+ lp->dbentry[p->u.d.fileid].count != 0))
+ __db_err(dbenv, "warning: %s: %s",
+ p->u.d.fname, db_strerror(ENOENT));
+ __os_freestr(p->u.d.fname);
+ break;
+ case TXNLIST_LSN:
+ __os_free(p->u.l.lsn_array,
+ p->u.l.maxn * sizeof(DB_LSN));
+ break;
+ default:
+ /* Possibly an incomplete DB_TXNLIST; just free it. */
+ break;
+ }
+ __os_free(p, sizeof(DB_TXNLIST));
+ }
+ __os_free(listp, sizeof(DB_TXNHEAD));
+}
+
+/*
+ * __db_txnlist_find --
+ * Checks to see if a txnid with the current generation is in the
+ * txnid list. This returns DB_NOTFOUND if the item isn't in the
+ * list otherwise it returns (like __db_txnlist_find_internal) a
+ * 1 or 0 indicating if the transaction is aborted or not. A txnid
+ * of 0 means the record was generated while not in a transaction.
+ *
+ * PUBLIC: int __db_txnlist_find __P((void *, u_int32_t));
+ */
+int
+__db_txnlist_find(listp, txnid)
+ void *listp;
+ u_int32_t txnid;
+{
+ DB_TXNLIST *entry;
+
+ if (txnid == 0)
+ return (DB_NOTFOUND);
+ return (__db_txnlist_find_internal(listp,
+ TXNLIST_TXNID, txnid, NULL, &entry, 0));
+}
+
+/*
+ * __db_txnlist_find_internal --
+ * Find an entry on the transaction list.
+ * If the entry is not there or the list pointeris not initialized
+ * we return DB_NOTFOUND. If the item is found, we return the aborted
+ * status (1 for aborted, 0 for not aborted). Currently we always call
+ * this with an initialized list pointer but checking for NULL keeps it general.
+ */
+static int
+__db_txnlist_find_internal(listp, type, txnid, uid, txnlistp, delete)
+ void *listp;
+ db_txnlist_type type;
+ u_int32_t txnid;
+ u_int8_t uid[DB_FILE_ID_LEN];
+ DB_TXNLIST **txnlistp;
+ int delete;
+{
+ DB_TXNHEAD *hp;
+ DB_TXNLIST *p;
+ int ret;
+
+ if ((hp = (DB_TXNHEAD *)listp) == NULL)
+ return (DB_NOTFOUND);
+
+ for (p = LIST_FIRST(&hp->head); p != NULL; p = LIST_NEXT(p, links)) {
+ if (p->type != type)
+ continue;
+ switch (type) {
+ case TXNLIST_TXNID:
+ if (p->u.t.txnid != txnid ||
+ hp->generation != p->u.t.generation)
+ continue;
+ ret = p->u.t.aborted;
+ break;
+
+ case TXNLIST_PGNO:
+ if (memcmp(uid, p->u.p.uid, DB_FILE_ID_LEN) != 0)
+ continue;
+
+ ret = 0;
+ break;
+ default:
+ DB_ASSERT(0);
+ ret = EINVAL;
+ }
+ if (delete == 1) {
+ LIST_REMOVE(p, links);
+ __os_free(p, sizeof(DB_TXNLIST));
+ } else if (p != LIST_FIRST(&hp->head)) {
+ /* Move it to head of list. */
+ LIST_REMOVE(p, links);
+ LIST_INSERT_HEAD(&hp->head, p, links);
+ }
+ *txnlistp = p;
+ return (ret);
+ }
+
+ return (DB_NOTFOUND);
+}
+
+/*
+ * __db_txnlist_gen --
+ * Change the current generation number.
+ *
+ * PUBLIC: void __db_txnlist_gen __P((void *, int));
+ */
+void
+__db_txnlist_gen(listp, incr)
+ void *listp;
+ int incr;
+{
+ DB_TXNHEAD *hp;
+
+ /*
+ * During recovery generation numbers keep track of how many "restart"
+ * checkpoints we've seen. Restart checkpoints occur whenever we take
+ * a checkpoint and there are no outstanding transactions. When that
+ * happens, we can reset transaction IDs back to 1. It always happens
+ * at recovery and it prevents us from exhausting the transaction IDs
+ * name space.
+ */
+ hp = (DB_TXNHEAD *)listp;
+ hp->generation += incr;
+}
+
+#define TXN_BUBBLE(AP, MAX) { \
+ int __j; \
+ DB_LSN __tmp; \
+ \
+ for (__j = 0; __j < MAX - 1; __j++) \
+ if (log_compare(&AP[__j], &AP[__j + 1]) < 0) { \
+ __tmp = AP[__j]; \
+ AP[__j] = AP[__j + 1]; \
+ AP[__j + 1] = __tmp; \
+ } \
+}
+
+/*
+ * __db_txnlist_lsnadd --
+ * Add to or re-sort the transaction list lsn entry.
+ * Note that since this is used during an abort, the __txn_undo
+ * code calls into the "recovery" subsystem explicitly, and there
+ * is only a single TXNLIST_LSN entry on the list.
+ *
+ * PUBLIC: int __db_txnlist_lsnadd __P((DB_ENV *, void *, DB_LSN *, u_int32_t));
+ */
+int
+__db_txnlist_lsnadd(dbenv, listp, lsnp, flags)
+ DB_ENV *dbenv;
+ void *listp;
+ DB_LSN *lsnp;
+ u_int32_t flags;
+{
+ DB_TXNHEAD *hp;
+ DB_TXNLIST *elp;
+ int i, ret;
+
+ hp = (DB_TXNHEAD *)listp;
+
+ for (elp = LIST_FIRST(&hp->head);
+ elp != NULL; elp = LIST_NEXT(elp, links))
+ if (elp->type == TXNLIST_LSN)
+ break;
+
+ if (elp == NULL)
+ return (EINVAL);
+
+ if (LF_ISSET(TXNLIST_NEW)) {
+ if (elp->u.l.ntxns >= elp->u.l.maxn) {
+ if ((ret = __os_realloc(dbenv,
+ 2 * elp->u.l.maxn * sizeof(DB_LSN),
+ NULL, &elp->u.l.lsn_array)) != 0)
+ return (ret);
+ elp->u.l.maxn *= 2;
+ }
+ elp->u.l.lsn_array[elp->u.l.ntxns++] = *lsnp;
+ } else
+ /* Simply replace the 0th element. */
+ elp->u.l.lsn_array[0] = *lsnp;
+
+ /*
+ * If we just added a new entry and there may be NULL
+ * entries, so we have to do a complete bubble sort,
+ * not just trickle a changed entry around.
+ */
+ for (i = 0; i < (!LF_ISSET(TXNLIST_NEW) ? 1 : elp->u.l.ntxns); i++)
+ TXN_BUBBLE(elp->u.l.lsn_array, elp->u.l.ntxns);
+
+ *lsnp = elp->u.l.lsn_array[0];
+
+ return (0);
+}
+
+/*
+ * __db_txnlist_lsnhead --
+ * Return a pointer to the beginning of the lsn_array.
+ *
+ * PUBLIC: int __db_txnlist_lsnhead __P((void *, DB_LSN **));
+ */
+int
+__db_txnlist_lsnhead(listp, lsnpp)
+ void *listp;
+ DB_LSN **lsnpp;
+{
+ DB_TXNHEAD *hp;
+ DB_TXNLIST *elp;
+
+ hp = (DB_TXNHEAD *)listp;
+
+ for (elp = LIST_FIRST(&hp->head);
+ elp != NULL; elp = LIST_NEXT(elp, links))
+ if (elp->type == TXNLIST_LSN)
+ break;
+
+ if (elp == NULL)
+ return (EINVAL);
+
+ *lsnpp = &elp->u.l.lsn_array[0];
+
+ return (0);
+}
+
+/*
+ * __db_txnlist_lsninit --
+ * Initialize a transaction list with an lsn array entry.
+ *
+ * PUBLIC: int __db_txnlist_lsninit __P((DB_ENV *, DB_TXNHEAD *, DB_LSN *));
+ */
+int
+__db_txnlist_lsninit(dbenv, hp, lsnp)
+ DB_ENV *dbenv;
+ DB_TXNHEAD *hp;
+ DB_LSN *lsnp;
+{
+ DB_TXNLIST *elp;
+ int ret;
+
+ elp = NULL;
+
+ if ((ret = __os_malloc(dbenv, sizeof(DB_TXNLIST), NULL, &elp)) != 0)
+ goto err;
+ LIST_INSERT_HEAD(&hp->head, elp, links);
+
+ if ((ret = __os_malloc(dbenv,
+ 12 * sizeof(DB_LSN), NULL, &elp->u.l.lsn_array)) != 0)
+ goto err;
+ elp->type = TXNLIST_LSN;
+ elp->u.l.maxn = 12;
+ elp->u.l.ntxns = 1;
+ elp->u.l.lsn_array[0] = *lsnp;
+
+ return (0);
+
+err: __db_txnlist_end(dbenv, hp);
+ return (ret);
+}
+
+/*
+ * __db_add_limbo -- add pages to the limbo list.
+ * Get the file information and call pgnoadd
+ * for each page.
+ *
+ * PUBLIC: int __db_add_limbo __P((DB_ENV *,
+ * PUBLIC: void *, int32_t, db_pgno_t, int32_t));
+ */
+int
+__db_add_limbo(dbenv, info, fileid, pgno, count)
+ DB_ENV *dbenv;
+ void *info;
+ int32_t fileid;
+ db_pgno_t pgno;
+ int32_t count;
+{
+ DB_LOG *dblp;
+ FNAME *fnp;
+ int ret;
+
+ dblp = dbenv->lg_handle;
+ if ((ret = __log_lid_to_fname(dblp, fileid, &fnp)) != 0)
+ return (ret);
+
+ do {
+ if ((ret =
+ __db_txnlist_pgnoadd(dbenv, info, fileid, fnp->ufid,
+ R_ADDR(&dblp->reginfo, fnp->name_off), pgno)) != 0)
+ return (ret);
+ pgno++;
+ } while (--count != 0);
+
+ return (0);
+}
+
+/*
+ * __db_do_the_limbo -- move pages from limbo to free.
+ *
+ * If we are in recovery we add things to the free list without
+ * logging becasue we want to incrementaly apply logs that
+ * may be generated on another copy of this environment.
+ * Otherwise we just call __db_free to put the pages on
+ * the free list and log the activity.
+ *
+ * PUBLIC: int __db_do_the_limbo __P((DB_ENV *, DB_TXNHEAD *));
+ */
+int
+__db_do_the_limbo(dbenv, hp)
+ DB_ENV *dbenv;
+ DB_TXNHEAD *hp;
+{
+ DB *dbp;
+ DBC *dbc;
+ DBMETA *meta;
+ DB_TXN *txn;
+ DB_TXNLIST *elp;
+ PAGE *pagep;
+ db_pgno_t last_pgno, pgno;
+ int i, in_recover, put_page, ret, t_ret;
+
+ dbp = NULL;
+ dbc = NULL;
+ txn = NULL;
+ ret = 0;
+
+ /* Are we in recovery? */
+ in_recover = F_ISSET((DB_LOG *)dbenv->lg_handle, DBLOG_RECOVER);
+
+ for (elp = LIST_FIRST(&hp->head);
+ elp != NULL; elp = LIST_NEXT(elp, links)) {
+ if (elp->type != TXNLIST_PGNO)
+ continue;
+
+ if (in_recover) {
+ if ((ret = db_create(&dbp, dbenv, 0)) != 0)
+ goto err;
+
+ /*
+ * It is ok if the file is nolonger there.
+ */
+ dbp->type = DB_UNKNOWN;
+ ret = __db_dbopen(dbp,
+ elp->u.p.fname, 0, __db_omode("rw----"), 0);
+ } else {
+ /*
+ * If we are in transaction undo, then we know
+ * the fileid is still correct.
+ */
+ if ((ret =
+ __db_fileid_to_db(dbenv, &dbp,
+ elp->u.p.fileid, 0)) != 0 && ret != DB_DELETED)
+ goto err;
+ /* File is being destroyed. */
+ if (F_ISSET(dbp, DB_AM_DISCARD))
+ ret = DB_DELETED;
+ }
+ /*
+ * Verify that we are opening the same file that we were
+ * referring to when we wrote this log record.
+ */
+ if (ret == 0 &&
+ memcmp(elp->u.p.uid, dbp->fileid, DB_FILE_ID_LEN) == 0) {
+ last_pgno = PGNO_INVALID;
+ if (in_recover) {
+ pgno = PGNO_BASE_MD;
+ if ((ret = memp_fget(dbp->mpf,
+ &pgno, 0, (PAGE **)&meta)) != 0)
+ goto err;
+ last_pgno = meta->free;
+ /*
+ * Check to see if the head of the free
+ * list is any of the pages we are about
+ * to link in. We could have crashed
+ * after linking them in and before writing
+ * a checkpoint.
+ * It may not be the last one since
+ * any page may get reallocated before here.
+ */
+ if (last_pgno != PGNO_INVALID)
+ for (i = 0; i < elp->u.p.nentries; i++)
+ if (last_pgno
+ == elp->u.p.pgno_array[i])
+ goto done_it;
+ }
+
+ for (i = 0; i < elp->u.p.nentries; i++) {
+ pgno = elp->u.p.pgno_array[i];
+ if ((ret = memp_fget(dbp->mpf,
+ &pgno, DB_MPOOL_CREATE, &pagep)) != 0)
+ goto err;
+
+ put_page = 1;
+ if (IS_ZERO_LSN(LSN(pagep))) {
+ P_INIT(pagep, dbp->pgsize,
+ pgno, PGNO_INVALID,
+ last_pgno, 0, P_INVALID);
+
+ if (in_recover) {
+ LSN(pagep) = LSN(meta);
+ last_pgno = pgno;
+ } else {
+ /*
+ * Starting the transaction
+ * is postponed until we know
+ * we have something to do.
+ */
+ if (txn == NULL &&
+ (ret = txn_begin(dbenv,
+ NULL, &txn, 0)) != 0)
+ goto err;
+
+ if (dbc == NULL &&
+ (ret = dbp->cursor(dbp,
+ txn, &dbc, 0)) != 0)
+ goto err;
+ /* Turn off locking. */
+ F_SET(dbc, DBC_COMPENSATE);
+
+ /* __db_free puts the page. */
+ if ((ret =
+ __db_free(dbc, pagep)) != 0)
+ goto err;
+ put_page = 0;
+ }
+ }
+
+ if (put_page == 1 &&
+ (ret = memp_fput(dbp->mpf,
+ pagep, DB_MPOOL_DIRTY)) != 0)
+ goto err;
+ }
+ if (in_recover) {
+ if (last_pgno == meta->free) {
+done_it:
+ if ((ret =
+ memp_fput(dbp->mpf, meta, 0)) != 0)
+ goto err;
+ } else {
+ /*
+ * Flush the new free list then
+ * update the metapage. This is
+ * unlogged so we cannot have the
+ * metapage pointing at pages that
+ * are not on disk.
+ */
+ dbp->sync(dbp, 0);
+ meta->free = last_pgno;
+ if ((ret = memp_fput(dbp->mpf,
+ meta, DB_MPOOL_DIRTY)) != 0)
+ goto err;
+ }
+ }
+ if (dbc != NULL && (ret = dbc->c_close(dbc)) != 0)
+ goto err;
+ dbc = NULL;
+ }
+ if (in_recover && (t_ret = dbp->close(dbp, 0)) != 0 && ret == 0)
+ ret = t_ret;
+ dbp = NULL;
+ __os_free(elp->u.p.fname, 0);
+ __os_free(elp->u.p.pgno_array, 0);
+ if (ret == ENOENT)
+ ret = 0;
+ else if (ret != 0)
+ goto err;
+ }
+
+ if (txn != NULL) {
+ ret = txn_commit(txn, 0);
+ txn = NULL;
+ }
+err:
+ if (dbc != NULL)
+ (void)dbc->c_close(dbc);
+ if (in_recover && dbp != NULL)
+ (void)dbp->close(dbp, 0);
+ if (txn != NULL)
+ (void)txn_abort(txn);
+ return (ret);
+
+}
+
+#define DB_TXNLIST_MAX_PGNO 8 /* A nice even number. */
+
+/*
+ * __db_txnlist_pgnoadd --
+ * Find the txnlist entry for a file and add this pgno,
+ * or add the list entry for the file and then add the pgno.
+ *
+ * PUBLIC: int __db_txnlist_pgnoadd __P((DB_ENV *, DB_TXNHEAD *,
+ * PUBLIC: int32_t, u_int8_t [DB_FILE_ID_LEN], char *, db_pgno_t));
+ */
+int
+__db_txnlist_pgnoadd(dbenv, hp, fileid, uid, fname, pgno)
+ DB_ENV *dbenv;
+ DB_TXNHEAD *hp;
+ int32_t fileid;
+ u_int8_t uid[DB_FILE_ID_LEN];
+ char *fname;
+ db_pgno_t pgno;
+{
+ DB_TXNLIST *elp;
+ int len, ret;
+
+ elp = NULL;
+
+ if (__db_txnlist_find_internal(hp, TXNLIST_PGNO, 0, uid, &elp, 0) != 0) {
+ if ((ret =
+ __os_malloc(dbenv, sizeof(DB_TXNLIST), NULL, &elp)) != 0)
+ goto err;
+ LIST_INSERT_HEAD(&hp->head, elp, links);
+ elp->u.p.fileid = fileid;
+ memcpy(elp->u.p.uid, uid, DB_FILE_ID_LEN);
+
+ len = strlen(fname) + 1;
+ if ((ret = __os_malloc(dbenv, len, NULL, &elp->u.p.fname)) != 0)
+ goto err;
+ memcpy(elp->u.p.fname, fname, len);
+
+ elp->u.p.maxentry = 0;
+ elp->type = TXNLIST_PGNO;
+ if ((ret = __os_malloc(dbenv,
+ 8 * sizeof(db_pgno_t), NULL, &elp->u.p.pgno_array)) != 0)
+ goto err;
+ elp->u.p.maxentry = DB_TXNLIST_MAX_PGNO;
+ elp->u.p.nentries = 0;
+ } else if (elp->u.p.nentries == elp->u.p.maxentry) {
+ elp->u.p.maxentry <<= 1;
+ if ((ret = __os_realloc(dbenv, elp->u.p.maxentry *
+ sizeof(db_pgno_t), NULL, &elp->u.p.pgno_array)) != 0)
+ goto err;
+ }
+
+ elp->u.p.pgno_array[elp->u.p.nentries++] = pgno;
+
+ return (0);
+
+err: __db_txnlist_end(dbenv, hp);
+ return (ret);
+}
+
+#ifdef DEBUG
+/*
+ * __db_txnlist_print --
+ * Print out the transaction list.
+ *
+ * PUBLIC: void __db_txnlist_print __P((void *));
+ */
+void
+__db_txnlist_print(listp)
+ void *listp;
+{
+ DB_TXNHEAD *hp;
+ DB_TXNLIST *p;
+
+ hp = (DB_TXNHEAD *)listp;
+
+ printf("Maxid: %lu Generation: %lu\n",
+ (u_long)hp->maxid, (u_long)hp->generation);
+ for (p = LIST_FIRST(&hp->head); p != NULL; p = LIST_NEXT(p, links)) {
+ switch (p->type) {
+ case TXNLIST_TXNID:
+ printf("TXNID: %lu(%lu)\n",
+ (u_long)p->u.t.txnid, (u_long)p->u.t.generation);
+ break;
+ case TXNLIST_DELETE:
+ printf("FILE: %s id=%d ops=%d %s %s\n",
+ p->u.d.fname, p->u.d.fileid, p->u.d.count,
+ F_ISSET(&p->u.d, TXNLIST_FLAG_DELETED) ?
+ "(deleted)" : "(missing)",
+ F_ISSET(&p->u.d, TXNLIST_FLAG_CLOSED) ?
+ "(closed)" : "(open)");
+
+ break;
+ default:
+ printf("Unrecognized type: %d\n", p->type);
+ break;
+ }
+ }
+}
+#endif
diff --git a/bdb/db/db_dup.c b/bdb/db/db_dup.c
new file mode 100644
index 00000000000..6d8b2df9518
--- /dev/null
+++ b/bdb/db/db_dup.c
@@ -0,0 +1,275 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1996, 1997, 1998, 1999, 2000
+ * Sleepycat Software. All rights reserved.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_dup.c,v 11.18 2000/11/30 00:58:32 ubell Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_shash.h"
+#include "btree.h"
+#include "hash.h"
+#include "lock.h"
+#include "db_am.h"
+
+/*
+ * __db_ditem --
+ * Remove an item from a page.
+ *
+ * PUBLIC: int __db_ditem __P((DBC *, PAGE *, u_int32_t, u_int32_t));
+ */
+int
+__db_ditem(dbc, pagep, indx, nbytes)
+ DBC *dbc;
+ PAGE *pagep;
+ u_int32_t indx, nbytes;
+{
+ DB *dbp;
+ DBT ldbt;
+ db_indx_t cnt, offset;
+ int ret;
+ u_int8_t *from;
+
+ dbp = dbc->dbp;
+ if (DB_LOGGING(dbc)) {
+ ldbt.data = P_ENTRY(pagep, indx);
+ ldbt.size = nbytes;
+ if ((ret = __db_addrem_log(dbp->dbenv, dbc->txn,
+ &LSN(pagep), 0, DB_REM_DUP, dbp->log_fileid, PGNO(pagep),
+ (u_int32_t)indx, nbytes, &ldbt, NULL, &LSN(pagep))) != 0)
+ return (ret);
+ }
+
+ /*
+ * If there's only a single item on the page, we don't have to
+ * work hard.
+ */
+ if (NUM_ENT(pagep) == 1) {
+ NUM_ENT(pagep) = 0;
+ HOFFSET(pagep) = dbp->pgsize;
+ return (0);
+ }
+
+ /*
+ * Pack the remaining key/data items at the end of the page. Use
+ * memmove(3), the regions may overlap.
+ */
+ from = (u_int8_t *)pagep + HOFFSET(pagep);
+ memmove(from + nbytes, from, pagep->inp[indx] - HOFFSET(pagep));
+ HOFFSET(pagep) += nbytes;
+
+ /* Adjust the indices' offsets. */
+ offset = pagep->inp[indx];
+ for (cnt = 0; cnt < NUM_ENT(pagep); ++cnt)
+ if (pagep->inp[cnt] < offset)
+ pagep->inp[cnt] += nbytes;
+
+ /* Shift the indices down. */
+ --NUM_ENT(pagep);
+ if (indx != NUM_ENT(pagep))
+ memmove(&pagep->inp[indx], &pagep->inp[indx + 1],
+ sizeof(db_indx_t) * (NUM_ENT(pagep) - indx));
+
+ return (0);
+}
+
+/*
+ * __db_pitem --
+ * Put an item on a page.
+ *
+ * PUBLIC: int __db_pitem
+ * PUBLIC: __P((DBC *, PAGE *, u_int32_t, u_int32_t, DBT *, DBT *));
+ */
+int
+__db_pitem(dbc, pagep, indx, nbytes, hdr, data)
+ DBC *dbc;
+ PAGE *pagep;
+ u_int32_t indx;
+ u_int32_t nbytes;
+ DBT *hdr, *data;
+{
+ DB *dbp;
+ BKEYDATA bk;
+ DBT thdr;
+ int ret;
+ u_int8_t *p;
+
+ if (nbytes > P_FREESPACE(pagep)) {
+ DB_ASSERT(nbytes <= P_FREESPACE(pagep));
+ return (EINVAL);
+ }
+ /*
+ * Put a single item onto a page. The logic figuring out where to
+ * insert and whether it fits is handled in the caller. All we do
+ * here is manage the page shuffling. We cheat a little bit in that
+ * we don't want to copy the dbt on a normal put twice. If hdr is
+ * NULL, we create a BKEYDATA structure on the page, otherwise, just
+ * copy the caller's information onto the page.
+ *
+ * This routine is also used to put entries onto the page where the
+ * entry is pre-built, e.g., during recovery. In this case, the hdr
+ * will point to the entry, and the data argument will be NULL.
+ *
+ * !!!
+ * There's a tremendous potential for off-by-one errors here, since
+ * the passed in header sizes must be adjusted for the structure's
+ * placeholder for the trailing variable-length data field.
+ */
+ dbp = dbc->dbp;
+ if (DB_LOGGING(dbc))
+ if ((ret = __db_addrem_log(dbp->dbenv, dbc->txn,
+ &LSN(pagep), 0, DB_ADD_DUP, dbp->log_fileid, PGNO(pagep),
+ (u_int32_t)indx, nbytes, hdr, data, &LSN(pagep))) != 0)
+ return (ret);
+
+ if (hdr == NULL) {
+ B_TSET(bk.type, B_KEYDATA, 0);
+ bk.len = data == NULL ? 0 : data->size;
+
+ thdr.data = &bk;
+ thdr.size = SSZA(BKEYDATA, data);
+ hdr = &thdr;
+ }
+
+ /* Adjust the index table, then put the item on the page. */
+ if (indx != NUM_ENT(pagep))
+ memmove(&pagep->inp[indx + 1], &pagep->inp[indx],
+ sizeof(db_indx_t) * (NUM_ENT(pagep) - indx));
+ HOFFSET(pagep) -= nbytes;
+ pagep->inp[indx] = HOFFSET(pagep);
+ ++NUM_ENT(pagep);
+
+ p = P_ENTRY(pagep, indx);
+ memcpy(p, hdr->data, hdr->size);
+ if (data != NULL)
+ memcpy(p + hdr->size, data->data, data->size);
+
+ return (0);
+}
+
+/*
+ * __db_relink --
+ * Relink around a deleted page.
+ *
+ * PUBLIC: int __db_relink __P((DBC *, u_int32_t, PAGE *, PAGE **, int));
+ */
+int
+__db_relink(dbc, add_rem, pagep, new_next, needlock)
+ DBC *dbc;
+ u_int32_t add_rem;
+ PAGE *pagep, **new_next;
+ int needlock;
+{
+ DB *dbp;
+ PAGE *np, *pp;
+ DB_LOCK npl, ppl;
+ DB_LSN *nlsnp, *plsnp, ret_lsn;
+ int ret;
+
+ ret = 0;
+ np = pp = NULL;
+ npl.off = ppl.off = LOCK_INVALID;
+ nlsnp = plsnp = NULL;
+ dbp = dbc->dbp;
+
+ /*
+ * Retrieve and lock the one/two pages. For a remove, we may need
+ * two pages (the before and after). For an add, we only need one
+ * because, the split took care of the prev.
+ */
+ if (pagep->next_pgno != PGNO_INVALID) {
+ if (needlock && (ret = __db_lget(dbc,
+ 0, pagep->next_pgno, DB_LOCK_WRITE, 0, &npl)) != 0)
+ goto err;
+ if ((ret = memp_fget(dbp->mpf,
+ &pagep->next_pgno, 0, &np)) != 0) {
+ (void)__db_pgerr(dbp, pagep->next_pgno);
+ goto err;
+ }
+ nlsnp = &np->lsn;
+ }
+ if (add_rem == DB_REM_PAGE && pagep->prev_pgno != PGNO_INVALID) {
+ if (needlock && (ret = __db_lget(dbc,
+ 0, pagep->prev_pgno, DB_LOCK_WRITE, 0, &ppl)) != 0)
+ goto err;
+ if ((ret = memp_fget(dbp->mpf,
+ &pagep->prev_pgno, 0, &pp)) != 0) {
+ (void)__db_pgerr(dbp, pagep->next_pgno);
+ goto err;
+ }
+ plsnp = &pp->lsn;
+ }
+
+ /* Log the change. */
+ if (DB_LOGGING(dbc)) {
+ if ((ret = __db_relink_log(dbp->dbenv, dbc->txn,
+ &ret_lsn, 0, add_rem, dbp->log_fileid,
+ pagep->pgno, &pagep->lsn,
+ pagep->prev_pgno, plsnp, pagep->next_pgno, nlsnp)) != 0)
+ goto err;
+ if (np != NULL)
+ np->lsn = ret_lsn;
+ if (pp != NULL)
+ pp->lsn = ret_lsn;
+ if (add_rem == DB_REM_PAGE)
+ pagep->lsn = ret_lsn;
+ }
+
+ /*
+ * Modify and release the two pages.
+ *
+ * !!!
+ * The parameter new_next gets set to the page following the page we
+ * are removing. If there is no following page, then new_next gets
+ * set to NULL.
+ */
+ if (np != NULL) {
+ if (add_rem == DB_ADD_PAGE)
+ np->prev_pgno = pagep->pgno;
+ else
+ np->prev_pgno = pagep->prev_pgno;
+ if (new_next == NULL)
+ ret = memp_fput(dbp->mpf, np, DB_MPOOL_DIRTY);
+ else {
+ *new_next = np;
+ ret = memp_fset(dbp->mpf, np, DB_MPOOL_DIRTY);
+ }
+ if (ret != 0)
+ goto err;
+ if (needlock)
+ (void)__TLPUT(dbc, npl);
+ } else if (new_next != NULL)
+ *new_next = NULL;
+
+ if (pp != NULL) {
+ pp->next_pgno = pagep->next_pgno;
+ if ((ret = memp_fput(dbp->mpf, pp, DB_MPOOL_DIRTY)) != 0)
+ goto err;
+ if (needlock)
+ (void)__TLPUT(dbc, ppl);
+ }
+ return (0);
+
+err: if (np != NULL)
+ (void)memp_fput(dbp->mpf, np, 0);
+ if (needlock && npl.off != LOCK_INVALID)
+ (void)__TLPUT(dbc, npl);
+ if (pp != NULL)
+ (void)memp_fput(dbp->mpf, pp, 0);
+ if (needlock && ppl.off != LOCK_INVALID)
+ (void)__TLPUT(dbc, ppl);
+ return (ret);
+}
diff --git a/bdb/db/db_iface.c b/bdb/db/db_iface.c
new file mode 100644
index 00000000000..3548a2527bb
--- /dev/null
+++ b/bdb/db/db_iface.c
@@ -0,0 +1,687 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1996, 1997, 1998, 1999, 2000
+ * Sleepycat Software. All rights reserved.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_iface.c,v 11.34 2001/01/11 18:19:51 bostic Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <errno.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_am.h"
+#include "btree.h"
+
+static int __db_curinval __P((const DB_ENV *));
+static int __db_rdonly __P((const DB_ENV *, const char *));
+static int __dbt_ferr __P((const DB *, const char *, const DBT *, int));
+
+/*
+ * __db_cursorchk --
+ * Common cursor argument checking routine.
+ *
+ * PUBLIC: int __db_cursorchk __P((const DB *, u_int32_t, int));
+ */
+int
+__db_cursorchk(dbp, flags, isrdonly)
+ const DB *dbp;
+ u_int32_t flags;
+ int isrdonly;
+{
+ /* Check for invalid function flags. */
+ switch (flags) {
+ case 0:
+ break;
+ case DB_WRITECURSOR:
+ if (isrdonly)
+ return (__db_rdonly(dbp->dbenv, "DB->cursor"));
+ if (!CDB_LOCKING(dbp->dbenv))
+ return (__db_ferr(dbp->dbenv, "DB->cursor", 0));
+ break;
+ case DB_WRITELOCK:
+ if (isrdonly)
+ return (__db_rdonly(dbp->dbenv, "DB->cursor"));
+ break;
+ default:
+ return (__db_ferr(dbp->dbenv, "DB->cursor", 0));
+ }
+
+ return (0);
+}
+
+/*
+ * __db_ccountchk --
+ * Common cursor count argument checking routine.
+ *
+ * PUBLIC: int __db_ccountchk __P((const DB *, u_int32_t, int));
+ */
+int
+__db_ccountchk(dbp, flags, isvalid)
+ const DB *dbp;
+ u_int32_t flags;
+ int isvalid;
+{
+ /* Check for invalid function flags. */
+ switch (flags) {
+ case 0:
+ break;
+ default:
+ return (__db_ferr(dbp->dbenv, "DBcursor->c_count", 0));
+ }
+
+ /*
+ * The cursor must be initialized, return EINVAL for an invalid cursor,
+ * otherwise 0.
+ */
+ return (isvalid ? 0 : __db_curinval(dbp->dbenv));
+}
+
+/*
+ * __db_cdelchk --
+ * Common cursor delete argument checking routine.
+ *
+ * PUBLIC: int __db_cdelchk __P((const DB *, u_int32_t, int, int));
+ */
+int
+__db_cdelchk(dbp, flags, isrdonly, isvalid)
+ const DB *dbp;
+ u_int32_t flags;
+ int isrdonly, isvalid;
+{
+ /* Check for changes to a read-only tree. */
+ if (isrdonly)
+ return (__db_rdonly(dbp->dbenv, "c_del"));
+
+ /* Check for invalid function flags. */
+ switch (flags) {
+ case 0:
+ break;
+ default:
+ return (__db_ferr(dbp->dbenv, "DBcursor->c_del", 0));
+ }
+
+ /*
+ * The cursor must be initialized, return EINVAL for an invalid cursor,
+ * otherwise 0.
+ */
+ return (isvalid ? 0 : __db_curinval(dbp->dbenv));
+}
+
+/*
+ * __db_cgetchk --
+ * Common cursor get argument checking routine.
+ *
+ * PUBLIC: int __db_cgetchk __P((const DB *, DBT *, DBT *, u_int32_t, int));
+ */
+int
+__db_cgetchk(dbp, key, data, flags, isvalid)
+ const DB *dbp;
+ DBT *key, *data;
+ u_int32_t flags;
+ int isvalid;
+{
+ int ret;
+
+ /*
+ * Check for read-modify-write validity. DB_RMW doesn't make sense
+ * with CDB cursors since if you're going to write the cursor, you
+ * had to create it with DB_WRITECURSOR. Regardless, we check for
+ * LOCKING_ON and not STD_LOCKING, as we don't want to disallow it.
+ * If this changes, confirm that DB does not itself set the DB_RMW
+ * flag in a path where CDB may have been configured.
+ */
+ if (LF_ISSET(DB_RMW)) {
+ if (!LOCKING_ON(dbp->dbenv)) {
+ __db_err(dbp->dbenv,
+ "the DB_RMW flag requires locking");
+ return (EINVAL);
+ }
+ LF_CLR(DB_RMW);
+ }
+
+ /* Check for invalid function flags. */
+ switch (flags) {
+ case DB_CONSUME:
+ case DB_CONSUME_WAIT:
+ if (dbp->type != DB_QUEUE)
+ goto err;
+ break;
+ case DB_CURRENT:
+ case DB_FIRST:
+ case DB_GET_BOTH:
+ case DB_LAST:
+ case DB_NEXT:
+ case DB_NEXT_DUP:
+ case DB_NEXT_NODUP:
+ case DB_PREV:
+ case DB_PREV_NODUP:
+ case DB_SET:
+ case DB_SET_RANGE:
+ break;
+ case DB_GET_BOTHC:
+ if (dbp->type == DB_QUEUE)
+ goto err;
+ break;
+ case DB_GET_RECNO:
+ if (!F_ISSET(dbp, DB_BT_RECNUM))
+ goto err;
+ break;
+ case DB_SET_RECNO:
+ if (!F_ISSET(dbp, DB_BT_RECNUM))
+ goto err;
+ break;
+ default:
+err: return (__db_ferr(dbp->dbenv, "DBcursor->c_get", 0));
+ }
+
+ /* Check for invalid key/data flags. */
+ if ((ret = __dbt_ferr(dbp, "key", key, 0)) != 0)
+ return (ret);
+ if ((ret = __dbt_ferr(dbp, "data", data, 0)) != 0)
+ return (ret);
+
+ /*
+ * The cursor must be initialized for DB_CURRENT or DB_NEXT_DUP,
+ * return EINVAL for an invalid cursor, otherwise 0.
+ */
+ if (isvalid || (flags != DB_CURRENT && flags != DB_NEXT_DUP))
+ return (0);
+
+ return (__db_curinval(dbp->dbenv));
+}
+
+/*
+ * __db_cputchk --
+ * Common cursor put argument checking routine.
+ *
+ * PUBLIC: int __db_cputchk __P((const DB *,
+ * PUBLIC: const DBT *, DBT *, u_int32_t, int, int));
+ */
+int
+__db_cputchk(dbp, key, data, flags, isrdonly, isvalid)
+ const DB *dbp;
+ const DBT *key;
+ DBT *data;
+ u_int32_t flags;
+ int isrdonly, isvalid;
+{
+ int key_flags, ret;
+
+ key_flags = 0;
+
+ /* Check for changes to a read-only tree. */
+ if (isrdonly)
+ return (__db_rdonly(dbp->dbenv, "c_put"));
+
+ /* Check for invalid function flags. */
+ switch (flags) {
+ case DB_AFTER:
+ case DB_BEFORE:
+ switch (dbp->type) {
+ case DB_BTREE:
+ case DB_HASH: /* Only with unsorted duplicates. */
+ if (!F_ISSET(dbp, DB_AM_DUP))
+ goto err;
+ if (dbp->dup_compare != NULL)
+ goto err;
+ break;
+ case DB_QUEUE: /* Not permitted. */
+ goto err;
+ case DB_RECNO: /* Only with mutable record numbers. */
+ if (!F_ISSET(dbp, DB_RE_RENUMBER))
+ goto err;
+ key_flags = 1;
+ break;
+ default:
+ goto err;
+ }
+ break;
+ case DB_CURRENT:
+ /*
+ * If there is a comparison function, doing a DB_CURRENT
+ * must not change the part of the data item that is used
+ * for the comparison.
+ */
+ break;
+ case DB_NODUPDATA:
+ if (!F_ISSET(dbp, DB_AM_DUPSORT))
+ goto err;
+ /* FALLTHROUGH */
+ case DB_KEYFIRST:
+ case DB_KEYLAST:
+ if (dbp->type == DB_QUEUE || dbp->type == DB_RECNO)
+ goto err;
+ key_flags = 1;
+ break;
+ default:
+err: return (__db_ferr(dbp->dbenv, "DBcursor->c_put", 0));
+ }
+
+ /* Check for invalid key/data flags. */
+ if (key_flags && (ret = __dbt_ferr(dbp, "key", key, 0)) != 0)
+ return (ret);
+ if ((ret = __dbt_ferr(dbp, "data", data, 0)) != 0)
+ return (ret);
+
+ /*
+ * The cursor must be initialized for anything other than DB_KEYFIRST
+ * and DB_KEYLAST, return EINVAL for an invalid cursor, otherwise 0.
+ */
+ if (isvalid || flags == DB_KEYFIRST ||
+ flags == DB_KEYLAST || flags == DB_NODUPDATA)
+ return (0);
+
+ return (__db_curinval(dbp->dbenv));
+}
+
+/*
+ * __db_closechk --
+ * DB->close flag check.
+ *
+ * PUBLIC: int __db_closechk __P((const DB *, u_int32_t));
+ */
+int
+__db_closechk(dbp, flags)
+ const DB *dbp;
+ u_int32_t flags;
+{
+ /* Check for invalid function flags. */
+ switch (flags) {
+ case 0:
+ case DB_NOSYNC:
+ break;
+ default:
+ return (__db_ferr(dbp->dbenv, "DB->close", 0));
+ }
+
+ return (0);
+}
+
+/*
+ * __db_delchk --
+ * Common delete argument checking routine.
+ *
+ * PUBLIC: int __db_delchk __P((const DB *, DBT *, u_int32_t, int));
+ */
+int
+__db_delchk(dbp, key, flags, isrdonly)
+ const DB *dbp;
+ DBT *key;
+ u_int32_t flags;
+ int isrdonly;
+{
+ COMPQUIET(key, NULL);
+
+ /* Check for changes to a read-only tree. */
+ if (isrdonly)
+ return (__db_rdonly(dbp->dbenv, "delete"));
+
+ /* Check for invalid function flags. */
+ switch (flags) {
+ case 0:
+ break;
+ default:
+ return (__db_ferr(dbp->dbenv, "DB->del", 0));
+ }
+
+ return (0);
+}
+
+/*
+ * __db_getchk --
+ * Common get argument checking routine.
+ *
+ * PUBLIC: int __db_getchk __P((const DB *, const DBT *, DBT *, u_int32_t));
+ */
+int
+__db_getchk(dbp, key, data, flags)
+ const DB *dbp;
+ const DBT *key;
+ DBT *data;
+ u_int32_t flags;
+{
+ int ret;
+
+ /*
+ * Check for read-modify-write validity. DB_RMW doesn't make sense
+ * with CDB cursors since if you're going to write the cursor, you
+ * had to create it with DB_WRITECURSOR. Regardless, we check for
+ * LOCKING_ON and not STD_LOCKING, as we don't want to disallow it.
+ * If this changes, confirm that DB does not itself set the DB_RMW
+ * flag in a path where CDB may have been configured.
+ */
+ if (LF_ISSET(DB_RMW)) {
+ if (!LOCKING_ON(dbp->dbenv)) {
+ __db_err(dbp->dbenv,
+ "the DB_RMW flag requires locking");
+ return (EINVAL);
+ }
+ LF_CLR(DB_RMW);
+ }
+
+ /* Check for invalid function flags. */
+ switch (flags) {
+ case 0:
+ case DB_GET_BOTH:
+ break;
+ case DB_SET_RECNO:
+ if (!F_ISSET(dbp, DB_BT_RECNUM))
+ goto err;
+ break;
+ case DB_CONSUME:
+ case DB_CONSUME_WAIT:
+ if (dbp->type == DB_QUEUE)
+ break;
+ /* Fall through */
+ default:
+err: return (__db_ferr(dbp->dbenv, "DB->get", 0));
+ }
+
+ /* Check for invalid key/data flags. */
+ if ((ret = __dbt_ferr(dbp, "key", key, flags == DB_SET_RECNO)) != 0)
+ return (ret);
+ if ((ret = __dbt_ferr(dbp, "data", data, 1)) != 0)
+ return (ret);
+
+ return (0);
+}
+
+/*
+ * __db_joinchk --
+ * Common join argument checking routine.
+ *
+ * PUBLIC: int __db_joinchk __P((const DB *, DBC * const *, u_int32_t));
+ */
+int
+__db_joinchk(dbp, curslist, flags)
+ const DB *dbp;
+ DBC * const *curslist;
+ u_int32_t flags;
+{
+ DB_TXN *txn;
+ int i;
+
+ switch (flags) {
+ case 0:
+ case DB_JOIN_NOSORT:
+ break;
+ default:
+ return (__db_ferr(dbp->dbenv, "DB->join", 0));
+ }
+
+ if (curslist == NULL || curslist[0] == NULL) {
+ __db_err(dbp->dbenv,
+ "At least one secondary cursor must be specified to DB->join");
+ return (EINVAL);
+ }
+
+ txn = curslist[0]->txn;
+ for (i = 1; curslist[i] != NULL; i++)
+ if (curslist[i]->txn != txn) {
+ __db_err(dbp->dbenv,
+ "All secondary cursors must share the same transaction");
+ return (EINVAL);
+ }
+
+ return (0);
+}
+
+/*
+ * __db_joingetchk --
+ * Common join_get argument checking routine.
+ *
+ * PUBLIC: int __db_joingetchk __P((const DB *, DBT *, u_int32_t));
+ */
+int
+__db_joingetchk(dbp, key, flags)
+ const DB *dbp;
+ DBT *key;
+ u_int32_t flags;
+{
+
+ if (LF_ISSET(DB_RMW)) {
+ if (!LOCKING_ON(dbp->dbenv)) {
+ __db_err(dbp->dbenv,
+ "the DB_RMW flag requires locking");
+ return (EINVAL);
+ }
+ LF_CLR(DB_RMW);
+ }
+
+ switch (flags) {
+ case 0:
+ case DB_JOIN_ITEM:
+ break;
+ default:
+ return (__db_ferr(dbp->dbenv, "DBcursor->c_get", 0));
+ }
+
+ /*
+ * A partial get of the key of a join cursor don't make much sense;
+ * the entire key is necessary to query the primary database
+ * and find the datum, and so regardless of the size of the key
+ * it would not be a performance improvement. Since it would require
+ * special handling, we simply disallow it.
+ *
+ * A partial get of the data, however, potentially makes sense (if
+ * all possible data are a predictable large structure, for instance)
+ * and causes us no headaches, so we permit it.
+ */
+ if (F_ISSET(key, DB_DBT_PARTIAL)) {
+ __db_err(dbp->dbenv,
+ "DB_DBT_PARTIAL may not be set on key during join_get");
+ return (EINVAL);
+ }
+
+ return (0);
+}
+
+/*
+ * __db_putchk --
+ * Common put argument checking routine.
+ *
+ * PUBLIC: int __db_putchk
+ * PUBLIC: __P((const DB *, DBT *, const DBT *, u_int32_t, int, int));
+ */
+int
+__db_putchk(dbp, key, data, flags, isrdonly, isdup)
+ const DB *dbp;
+ DBT *key;
+ const DBT *data;
+ u_int32_t flags;
+ int isrdonly, isdup;
+{
+ int ret;
+
+ /* Check for changes to a read-only tree. */
+ if (isrdonly)
+ return (__db_rdonly(dbp->dbenv, "put"));
+
+ /* Check for invalid function flags. */
+ switch (flags) {
+ case 0:
+ case DB_NOOVERWRITE:
+ break;
+ case DB_APPEND:
+ if (dbp->type != DB_RECNO && dbp->type != DB_QUEUE)
+ goto err;
+ break;
+ case DB_NODUPDATA:
+ if (F_ISSET(dbp, DB_AM_DUPSORT))
+ break;
+ /* FALLTHROUGH */
+ default:
+err: return (__db_ferr(dbp->dbenv, "DB->put", 0));
+ }
+
+ /* Check for invalid key/data flags. */
+ if ((ret = __dbt_ferr(dbp, "key", key, 0)) != 0)
+ return (ret);
+ if ((ret = __dbt_ferr(dbp, "data", data, 0)) != 0)
+ return (ret);
+
+ /* Check for partial puts in the presence of duplicates. */
+ if (isdup && F_ISSET(data, DB_DBT_PARTIAL)) {
+ __db_err(dbp->dbenv,
+"a partial put in the presence of duplicates requires a cursor operation");
+ return (EINVAL);
+ }
+
+ return (0);
+}
+
+/*
+ * __db_removechk --
+ * DB->remove flag check.
+ *
+ * PUBLIC: int __db_removechk __P((const DB *, u_int32_t));
+ */
+int
+__db_removechk(dbp, flags)
+ const DB *dbp;
+ u_int32_t flags;
+{
+ /* Check for invalid function flags. */
+ switch (flags) {
+ case 0:
+ break;
+ default:
+ return (__db_ferr(dbp->dbenv, "DB->remove", 0));
+ }
+
+ return (0);
+}
+
+/*
+ * __db_statchk --
+ * Common stat argument checking routine.
+ *
+ * PUBLIC: int __db_statchk __P((const DB *, u_int32_t));
+ */
+int
+__db_statchk(dbp, flags)
+ const DB *dbp;
+ u_int32_t flags;
+{
+ /* Check for invalid function flags. */
+ switch (flags) {
+ case 0:
+ case DB_CACHED_COUNTS:
+ break;
+ case DB_RECORDCOUNT:
+ if (dbp->type == DB_RECNO)
+ break;
+ if (dbp->type == DB_BTREE && F_ISSET(dbp, DB_BT_RECNUM))
+ break;
+ goto err;
+ default:
+err: return (__db_ferr(dbp->dbenv, "DB->stat", 0));
+ }
+
+ return (0);
+}
+
+/*
+ * __db_syncchk --
+ * Common sync argument checking routine.
+ *
+ * PUBLIC: int __db_syncchk __P((const DB *, u_int32_t));
+ */
+int
+__db_syncchk(dbp, flags)
+ const DB *dbp;
+ u_int32_t flags;
+{
+ /* Check for invalid function flags. */
+ switch (flags) {
+ case 0:
+ break;
+ default:
+ return (__db_ferr(dbp->dbenv, "DB->sync", 0));
+ }
+
+ return (0);
+}
+
+/*
+ * __dbt_ferr --
+ * Check a DBT for flag errors.
+ */
+static int
+__dbt_ferr(dbp, name, dbt, check_thread)
+ const DB *dbp;
+ const char *name;
+ const DBT *dbt;
+ int check_thread;
+{
+ DB_ENV *dbenv;
+ int ret;
+
+ dbenv = dbp->dbenv;
+
+ /*
+ * Check for invalid DBT flags. We allow any of the flags to be
+ * specified to any DB or DBcursor call so that applications can
+ * set DB_DBT_MALLOC when retrieving a data item from a secondary
+ * database and then specify that same DBT as a key to a primary
+ * database, without having to clear flags.
+ */
+ if ((ret = __db_fchk(dbenv, name, dbt->flags,
+ DB_DBT_MALLOC | DB_DBT_DUPOK |
+ DB_DBT_REALLOC | DB_DBT_USERMEM | DB_DBT_PARTIAL)) != 0)
+ return (ret);
+ switch (F_ISSET(dbt, DB_DBT_MALLOC | DB_DBT_REALLOC | DB_DBT_USERMEM)) {
+ case 0:
+ case DB_DBT_MALLOC:
+ case DB_DBT_REALLOC:
+ case DB_DBT_USERMEM:
+ break;
+ default:
+ return (__db_ferr(dbenv, name, 1));
+ }
+
+ if (check_thread && DB_IS_THREADED(dbp) &&
+ !F_ISSET(dbt, DB_DBT_MALLOC | DB_DBT_REALLOC | DB_DBT_USERMEM)) {
+ __db_err(dbenv,
+ "DB_THREAD mandates memory allocation flag on DBT %s",
+ name);
+ return (EINVAL);
+ }
+ return (0);
+}
+
+/*
+ * __db_rdonly --
+ * Common readonly message.
+ */
+static int
+__db_rdonly(dbenv, name)
+ const DB_ENV *dbenv;
+ const char *name;
+{
+ __db_err(dbenv, "%s: attempt to modify a read-only tree", name);
+ return (EACCES);
+}
+
+/*
+ * __db_curinval
+ * Report that a cursor is in an invalid state.
+ */
+static int
+__db_curinval(dbenv)
+ const DB_ENV *dbenv;
+{
+ __db_err(dbenv,
+ "Cursor position must be set before performing this operation");
+ return (EINVAL);
+}
diff --git a/bdb/db/db_join.c b/bdb/db/db_join.c
new file mode 100644
index 00000000000..881dedde0fc
--- /dev/null
+++ b/bdb/db/db_join.c
@@ -0,0 +1,730 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1998, 1999, 2000
+ * Sleepycat Software. All rights reserved.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_join.c,v 11.31 2000/12/20 22:41:54 krinsky Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <stdlib.h>
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_join.h"
+#include "db_am.h"
+#include "btree.h"
+
+static int __db_join_close __P((DBC *));
+static int __db_join_cmp __P((const void *, const void *));
+static int __db_join_del __P((DBC *, u_int32_t));
+static int __db_join_get __P((DBC *, DBT *, DBT *, u_int32_t));
+static int __db_join_getnext __P((DBC *, DBT *, DBT *, u_int32_t));
+static int __db_join_put __P((DBC *, DBT *, DBT *, u_int32_t));
+
+/*
+ * Check to see if the Nth secondary cursor of join cursor jc is pointing
+ * to a sorted duplicate set.
+ */
+#define SORTED_SET(jc, n) ((jc)->j_curslist[(n)]->dbp->dup_compare != NULL)
+
+/*
+ * This is the duplicate-assisted join functionality. Right now we're
+ * going to write it such that we return one item at a time, although
+ * I think we may need to optimize it to return them all at once.
+ * It should be easier to get it working this way, and I believe that
+ * changing it should be fairly straightforward.
+ *
+ * We optimize the join by sorting cursors from smallest to largest
+ * cardinality. In most cases, this is indeed optimal. However, if
+ * a cursor with large cardinality has very few data in common with the
+ * first cursor, it is possible that the join will be made faster by
+ * putting it earlier in the cursor list. Since we have no way to detect
+ * cases like this, we simply provide a flag, DB_JOIN_NOSORT, which retains
+ * the sort order specified by the caller, who may know more about the
+ * structure of the data.
+ *
+ * The first cursor moves sequentially through the duplicate set while
+ * the others search explicitly for the duplicate in question.
+ *
+ */
+
+/*
+ * __db_join --
+ * This is the interface to the duplicate-assisted join functionality.
+ * In the same way that cursors mark a position in a database, a cursor
+ * can mark a position in a join. While most cursors are created by the
+ * cursor method of a DB, join cursors are created through an explicit
+ * call to DB->join.
+ *
+ * The curslist is an array of existing, intialized cursors and primary
+ * is the DB of the primary file. The data item that joins all the
+ * cursors in the curslist is used as the key into the primary and that
+ * key and data are returned. When no more items are left in the join
+ * set, the c_next operation off the join cursor will return DB_NOTFOUND.
+ *
+ * PUBLIC: int __db_join __P((DB *, DBC **, DBC **, u_int32_t));
+ */
+int
+__db_join(primary, curslist, dbcp, flags)
+ DB *primary;
+ DBC **curslist, **dbcp;
+ u_int32_t flags;
+{
+ DB_ENV *dbenv;
+ DBC *dbc;
+ JOIN_CURSOR *jc;
+ int ret;
+ u_int32_t i, ncurs, nslots;
+
+ COMPQUIET(nslots, 0);
+
+ PANIC_CHECK(primary->dbenv);
+
+ if ((ret = __db_joinchk(primary, curslist, flags)) != 0)
+ return (ret);
+
+ dbc = NULL;
+ jc = NULL;
+ dbenv = primary->dbenv;
+
+ if ((ret = __os_calloc(dbenv, 1, sizeof(DBC), &dbc)) != 0)
+ goto err;
+
+ if ((ret = __os_calloc(dbenv,
+ 1, sizeof(JOIN_CURSOR), &jc)) != 0)
+ goto err;
+
+ if ((ret = __os_malloc(dbenv, 256, NULL, &jc->j_key.data)) != 0)
+ goto err;
+ jc->j_key.ulen = 256;
+ F_SET(&jc->j_key, DB_DBT_USERMEM);
+
+ for (jc->j_curslist = curslist;
+ *jc->j_curslist != NULL; jc->j_curslist++)
+ ;
+
+ /*
+ * The number of cursor slots we allocate is one greater than
+ * the number of cursors involved in the join, because the
+ * list is NULL-terminated.
+ */
+ ncurs = jc->j_curslist - curslist;
+ nslots = ncurs + 1;
+
+ /*
+ * !!! -- A note on the various lists hanging off jc.
+ *
+ * j_curslist is the initial NULL-terminated list of cursors passed
+ * into __db_join. The original cursors are not modified; pristine
+ * copies are required because, in databases with unsorted dups, we
+ * must reset all of the secondary cursors after the first each
+ * time the first one is incremented, or else we will lose data
+ * which happen to be sorted differently in two different cursors.
+ *
+ * j_workcurs is where we put those copies that we're planning to
+ * work with. They're lazily c_dup'ed from j_curslist as we need
+ * them, and closed when the join cursor is closed or when we need
+ * to reset them to their original values (in which case we just
+ * c_dup afresh).
+ *
+ * j_fdupcurs is an array of cursors which point to the first
+ * duplicate in the duplicate set that contains the data value
+ * we're currently interested in. We need this to make
+ * __db_join_get correctly return duplicate duplicates; i.e., if a
+ * given data value occurs twice in the set belonging to cursor #2,
+ * and thrice in the set belonging to cursor #3, and once in all
+ * the other cursors, successive calls to __db_join_get need to
+ * return that data item six times. To make this happen, each time
+ * cursor N is allowed to advance to a new datum, all cursors M
+ * such that M > N have to be reset to the first duplicate with
+ * that datum, so __db_join_get will return all the dup-dups again.
+ * We could just reset them to the original cursor from j_curslist,
+ * but that would be a bit slower in the unsorted case and a LOT
+ * slower in the sorted one.
+ *
+ * j_exhausted is a list of boolean values which represent
+ * whether or not their corresponding cursors are "exhausted",
+ * i.e. whether the datum under the corresponding cursor has
+ * been found not to exist in any unreturned combinations of
+ * later secondary cursors, in which case they are ready to be
+ * incremented.
+ */
+
+ /* We don't want to free regions whose callocs have failed. */
+ jc->j_curslist = NULL;
+ jc->j_workcurs = NULL;
+ jc->j_fdupcurs = NULL;
+ jc->j_exhausted = NULL;
+
+ if ((ret = __os_calloc(dbenv, nslots, sizeof(DBC *),
+ &jc->j_curslist)) != 0)
+ goto err;
+ if ((ret = __os_calloc(dbenv, nslots, sizeof(DBC *),
+ &jc->j_workcurs)) != 0)
+ goto err;
+ if ((ret = __os_calloc(dbenv, nslots, sizeof(DBC *),
+ &jc->j_fdupcurs)) != 0)
+ goto err;
+ if ((ret = __os_calloc(dbenv, nslots, sizeof(u_int8_t),
+ &jc->j_exhausted)) != 0)
+ goto err;
+ for (i = 0; curslist[i] != NULL; i++) {
+ jc->j_curslist[i] = curslist[i];
+ jc->j_workcurs[i] = NULL;
+ jc->j_fdupcurs[i] = NULL;
+ jc->j_exhausted[i] = 0;
+ }
+ jc->j_ncurs = ncurs;
+
+ /*
+ * If DB_JOIN_NOSORT is not set, optimize secondary cursors by
+ * sorting in order of increasing cardinality.
+ */
+ if (!LF_ISSET(DB_JOIN_NOSORT))
+ qsort(jc->j_curslist, ncurs, sizeof(DBC *), __db_join_cmp);
+
+ /*
+ * We never need to reset the 0th cursor, so there's no
+ * solid reason to use workcurs[0] rather than curslist[0] in
+ * join_get. Nonetheless, it feels cleaner to do it for symmetry,
+ * and this is the most logical place to copy it.
+ *
+ * !!!
+ * There's no need to close the new cursor if we goto err only
+ * because this is the last thing that can fail. Modifier of this
+ * function beware!
+ */
+ if ((ret = jc->j_curslist[0]->c_dup(jc->j_curslist[0], jc->j_workcurs,
+ DB_POSITIONI)) != 0)
+ goto err;
+
+ dbc->c_close = __db_join_close;
+ dbc->c_del = __db_join_del;
+ dbc->c_get = __db_join_get;
+ dbc->c_put = __db_join_put;
+ dbc->internal = (DBC_INTERNAL *) jc;
+ dbc->dbp = primary;
+ jc->j_primary = primary;
+
+ *dbcp = dbc;
+
+ MUTEX_THREAD_LOCK(dbenv, primary->mutexp);
+ TAILQ_INSERT_TAIL(&primary->join_queue, dbc, links);
+ MUTEX_THREAD_UNLOCK(dbenv, primary->mutexp);
+
+ return (0);
+
+err: if (jc != NULL) {
+ if (jc->j_curslist != NULL)
+ __os_free(jc->j_curslist, nslots * sizeof(DBC *));
+ if (jc->j_workcurs != NULL) {
+ if (jc->j_workcurs[0] != NULL)
+ __os_free(jc->j_workcurs[0], sizeof(DBC));
+ __os_free(jc->j_workcurs, nslots * sizeof(DBC *));
+ }
+ if (jc->j_fdupcurs != NULL)
+ __os_free(jc->j_fdupcurs, nslots * sizeof(DBC *));
+ if (jc->j_exhausted != NULL)
+ __os_free(jc->j_exhausted, nslots * sizeof(u_int8_t));
+ __os_free(jc, sizeof(JOIN_CURSOR));
+ }
+ if (dbc != NULL)
+ __os_free(dbc, sizeof(DBC));
+ return (ret);
+}
+
+static int
+__db_join_put(dbc, key, data, flags)
+ DBC *dbc;
+ DBT *key;
+ DBT *data;
+ u_int32_t flags;
+{
+ PANIC_CHECK(dbc->dbp->dbenv);
+
+ COMPQUIET(key, NULL);
+ COMPQUIET(data, NULL);
+ COMPQUIET(flags, 0);
+ return (EINVAL);
+}
+
+static int
+__db_join_del(dbc, flags)
+ DBC *dbc;
+ u_int32_t flags;
+{
+ PANIC_CHECK(dbc->dbp->dbenv);
+
+ COMPQUIET(flags, 0);
+ return (EINVAL);
+}
+
+static int
+__db_join_get(dbc, key_arg, data_arg, flags)
+ DBC *dbc;
+ DBT *key_arg, *data_arg;
+ u_int32_t flags;
+{
+ DBT *key_n, key_n_mem;
+ DB *dbp;
+ DBC *cp;
+ JOIN_CURSOR *jc;
+ int ret;
+ u_int32_t i, j, operation;
+
+ dbp = dbc->dbp;
+ jc = (JOIN_CURSOR *)dbc->internal;
+
+ PANIC_CHECK(dbp->dbenv);
+
+ operation = LF_ISSET(DB_OPFLAGS_MASK);
+
+ if ((ret = __db_joingetchk(dbp, key_arg, flags)) != 0)
+ return (ret);
+
+ /*
+ * Since we are fetching the key as a datum in the secondary indices,
+ * we must be careful of caller-specified DB_DBT_* memory
+ * management flags. If necessary, use a stack-allocated DBT;
+ * we'll appropriately copy and/or allocate the data later.
+ */
+ if (F_ISSET(key_arg, DB_DBT_USERMEM) ||
+ F_ISSET(key_arg, DB_DBT_MALLOC)) {
+ /* We just use the default buffer; no need to go malloc. */
+ key_n = &key_n_mem;
+ memset(key_n, 0, sizeof(DBT));
+ } else {
+ /*
+ * Either DB_DBT_REALLOC or the default buffer will work
+ * fine if we have to reuse it, as we do.
+ */
+ key_n = key_arg;
+ }
+
+ /*
+ * If our last attempt to do a get on the primary key failed,
+ * short-circuit the join and try again with the same key.
+ */
+ if (F_ISSET(jc, JOIN_RETRY))
+ goto samekey;
+ F_CLR(jc, JOIN_RETRY);
+
+retry: ret = jc->j_workcurs[0]->c_get(jc->j_workcurs[0],
+ &jc->j_key, key_n, jc->j_exhausted[0] ? DB_NEXT_DUP : DB_CURRENT);
+
+ if (ret == ENOMEM) {
+ jc->j_key.ulen <<= 1;
+ if ((ret = __os_realloc(dbp->dbenv,
+ jc->j_key.ulen, NULL, &jc->j_key.data)) != 0)
+ goto mem_err;
+ goto retry;
+ }
+
+ /*
+ * If ret == DB_NOTFOUND, we're out of elements of the first
+ * secondary cursor. This is how we finally finish the join
+ * if all goes well.
+ */
+ if (ret != 0)
+ goto err;
+
+ /*
+ * If jc->j_exhausted[0] == 1, we've just advanced the first cursor,
+ * and we're going to want to advance all the cursors that point to
+ * the first member of a duplicate duplicate set (j_fdupcurs[1..N]).
+ * Close all the cursors in j_fdupcurs; we'll reopen them the
+ * first time through the upcoming loop.
+ */
+ for (i = 1; i < jc->j_ncurs; i++) {
+ if (jc->j_fdupcurs[i] != NULL &&
+ (ret = jc->j_fdupcurs[i]->c_close(jc->j_fdupcurs[i])) != 0)
+ goto err;
+ jc->j_fdupcurs[i] = NULL;
+ }
+
+ /*
+ * If jc->j_curslist[1] == NULL, we have only one cursor in the join.
+ * Thus, we can safely increment that one cursor on each call
+ * to __db_join_get, and we signal this by setting jc->j_exhausted[0]
+ * right away.
+ *
+ * Otherwise, reset jc->j_exhausted[0] to 0, so that we don't
+ * increment it until we know we're ready to.
+ */
+ if (jc->j_curslist[1] == NULL)
+ jc->j_exhausted[0] = 1;
+ else
+ jc->j_exhausted[0] = 0;
+
+ /* We have the first element; now look for it in the other cursors. */
+ for (i = 1; i < jc->j_ncurs; i++) {
+ DB_ASSERT(jc->j_curslist[i] != NULL);
+ if (jc->j_workcurs[i] == NULL)
+ /* If this is NULL, we need to dup curslist into it. */
+ if ((ret = jc->j_curslist[i]->c_dup(
+ jc->j_curslist[i], jc->j_workcurs + i,
+ DB_POSITIONI)) != 0)
+ goto err;
+
+retry2: cp = jc->j_workcurs[i];
+
+ if ((ret = __db_join_getnext(cp, &jc->j_key, key_n,
+ jc->j_exhausted[i])) == DB_NOTFOUND) {
+ /*
+ * jc->j_workcurs[i] has no more of the datum we're
+ * interested in. Go back one cursor and get
+ * a new dup. We can't just move to a new
+ * element of the outer relation, because that way
+ * we might miss duplicate duplicates in cursor i-1.
+ *
+ * If this takes us back to the first cursor,
+ * -then- we can move to a new element of the outer
+ * relation.
+ */
+ --i;
+ jc->j_exhausted[i] = 1;
+
+ if (i == 0) {
+ for (j = 1; jc->j_workcurs[j] != NULL; j++) {
+ /*
+ * We're moving to a new element of
+ * the first secondary cursor. If
+ * that cursor is sorted, then any
+ * other sorted cursors can be safely
+ * reset to the first duplicate
+ * duplicate in the current set if we
+ * have a pointer to it (we can't just
+ * leave them be, or we'll miss
+ * duplicate duplicates in the outer
+ * relation).
+ *
+ * If the first cursor is unsorted, or
+ * if cursor j is unsorted, we can
+ * make no assumptions about what
+ * we're looking for next or where it
+ * will be, so we reset to the very
+ * beginning (setting workcurs NULL
+ * will achieve this next go-round).
+ *
+ * XXX: This is likely to break
+ * horribly if any two cursors are
+ * both sorted, but have different
+ * specified sort functions. For,
+ * now, we dismiss this as pathology
+ * and let strange things happen--we
+ * can't make rope childproof.
+ */
+ if ((ret = jc->j_workcurs[j]->c_close(
+ jc->j_workcurs[j])) != 0)
+ goto err;
+ if (!SORTED_SET(jc, 0) ||
+ !SORTED_SET(jc, j) ||
+ jc->j_fdupcurs[j] == NULL)
+ /*
+ * Unsafe conditions;
+ * reset fully.
+ */
+ jc->j_workcurs[j] = NULL;
+ else
+ /* Partial reset suffices. */
+ if ((jc->j_fdupcurs[j]->c_dup(
+ jc->j_fdupcurs[j],
+ &jc->j_workcurs[j],
+ DB_POSITIONI)) != 0)
+ goto err;
+ jc->j_exhausted[j] = 0;
+ }
+ goto retry;
+ /* NOTREACHED */
+ }
+
+ /*
+ * We're about to advance the cursor and need to
+ * reset all of the workcurs[j] where j>i, so that
+ * we don't miss any duplicate duplicates.
+ */
+ for (j = i + 1;
+ jc->j_workcurs[j] != NULL;
+ j++) {
+ if ((ret = jc->j_workcurs[j]->c_close(
+ jc->j_workcurs[j])) != 0)
+ goto err;
+ jc->j_exhausted[j] = 0;
+ if (jc->j_fdupcurs[j] != NULL &&
+ (ret = jc->j_fdupcurs[j]->c_dup(
+ jc->j_fdupcurs[j], &jc->j_workcurs[j],
+ DB_POSITIONI)) != 0)
+ goto err;
+ else
+ jc->j_workcurs[j] = NULL;
+ }
+ goto retry2;
+ /* NOTREACHED */
+ }
+
+ if (ret == ENOMEM) {
+ jc->j_key.ulen <<= 1;
+ if ((ret = __os_realloc(dbp->dbenv, jc->j_key.ulen,
+ NULL, &jc->j_key.data)) != 0) {
+mem_err: __db_err(dbp->dbenv,
+ "Allocation failed for join key, len = %lu",
+ (u_long)jc->j_key.ulen);
+ goto err;
+ }
+ goto retry2;
+ }
+
+ if (ret != 0)
+ goto err;
+
+ /*
+ * If we made it this far, we've found a matching
+ * datum in cursor i. Mark the current cursor
+ * unexhausted, so we don't miss any duplicate
+ * duplicates the next go-round--unless this is the
+ * very last cursor, in which case there are none to
+ * miss, and we'll need that exhausted flag to finally
+ * get a DB_NOTFOUND and move on to the next datum in
+ * the outermost cursor.
+ */
+ if (i + 1 != jc->j_ncurs)
+ jc->j_exhausted[i] = 0;
+ else
+ jc->j_exhausted[i] = 1;
+
+ /*
+ * If jc->j_fdupcurs[i] is NULL and the ith cursor's dups are
+ * sorted, then we're here for the first time since advancing
+ * cursor 0, and we have a new datum of interest.
+ * jc->j_workcurs[i] points to the beginning of a set of
+ * duplicate duplicates; store this into jc->j_fdupcurs[i].
+ */
+ if (SORTED_SET(jc, i) && jc->j_fdupcurs[i] == NULL && (ret =
+ cp->c_dup(cp, &jc->j_fdupcurs[i], DB_POSITIONI)) != 0)
+ goto err;
+
+ }
+
+err: if (ret != 0)
+ return (ret);
+
+ if (0) {
+samekey: /*
+ * Get the key we tried and failed to return last time;
+ * it should be the current datum of all the secondary cursors.
+ */
+ if ((ret = jc->j_workcurs[0]->c_get(jc->j_workcurs[0],
+ &jc->j_key, key_n, DB_CURRENT)) != 0)
+ return (ret);
+ F_CLR(jc, JOIN_RETRY);
+ }
+
+ /*
+ * ret == 0; we have a key to return.
+ *
+ * If DB_DBT_USERMEM or DB_DBT_MALLOC is set, we need to
+ * copy it back into the dbt we were given for the key;
+ * call __db_retcopy.
+ *
+ * Otherwise, assert that we do not in fact need to copy anything
+ * and simply proceed.
+ */
+ if (F_ISSET(key_arg, DB_DBT_USERMEM) ||
+ F_ISSET(key_arg, DB_DBT_MALLOC)) {
+ /*
+ * We need to copy the key back into our original
+ * datum. Do so.
+ */
+ if ((ret = __db_retcopy(dbp,
+ key_arg, key_n->data, key_n->size, NULL, NULL)) != 0) {
+ /*
+ * The retcopy failed, most commonly because we
+ * have a user buffer for the key which is too small.
+ * Set things up to retry next time, and return.
+ */
+ F_SET(jc, JOIN_RETRY);
+ return (ret);
+ }
+ } else
+ DB_ASSERT(key_n == key_arg);
+
+ /*
+ * If DB_JOIN_ITEM is
+ * set, we return it; otherwise we do the lookup in the
+ * primary and then return.
+ *
+ * Note that we use key_arg here; it is safe (and appropriate)
+ * to do so.
+ */
+ if (operation == DB_JOIN_ITEM)
+ return (0);
+
+ if ((ret = jc->j_primary->get(jc->j_primary,
+ jc->j_curslist[0]->txn, key_arg, data_arg, 0)) != 0)
+ /*
+ * The get on the primary failed, most commonly because we're
+ * using a user buffer that's not big enough. Flag our
+ * failure so we can return the same key next time.
+ */
+ F_SET(jc, JOIN_RETRY);
+
+ return (ret);
+}
+
+static int
+__db_join_close(dbc)
+ DBC *dbc;
+{
+ DB *dbp;
+ JOIN_CURSOR *jc;
+ int ret, t_ret;
+ u_int32_t i;
+
+ jc = (JOIN_CURSOR *)dbc->internal;
+ dbp = dbc->dbp;
+ ret = t_ret = 0;
+
+ /*
+ * Remove from active list of join cursors. Note that this
+ * must happen before any action that can fail and return, or else
+ * __db_close may loop indefinitely.
+ */
+ MUTEX_THREAD_LOCK(dbp->dbenv, dbp->mutexp);
+ TAILQ_REMOVE(&dbp->join_queue, dbc, links);
+ MUTEX_THREAD_UNLOCK(dbp->dbenv, dbp->mutexp);
+
+ PANIC_CHECK(dbc->dbp->dbenv);
+
+ /*
+ * Close any open scratch cursors. In each case, there may
+ * not be as many outstanding as there are cursors in
+ * curslist, but we want to close whatever's there.
+ *
+ * If any close fails, there's no reason not to close everything else;
+ * we'll just return the error code of the last one to fail. There's
+ * not much the caller can do anyway, since these cursors only exist
+ * hanging off a db-internal data structure that they shouldn't be
+ * mucking with.
+ */
+ for (i = 0; i < jc->j_ncurs; i++) {
+ if (jc->j_workcurs[i] != NULL && (t_ret =
+ jc->j_workcurs[i]->c_close(jc->j_workcurs[i])) != 0)
+ ret = t_ret;
+ if (jc->j_fdupcurs[i] != NULL && (t_ret =
+ jc->j_fdupcurs[i]->c_close(jc->j_fdupcurs[i])) != 0)
+ ret = t_ret;
+ }
+
+ __os_free(jc->j_exhausted, 0);
+ __os_free(jc->j_curslist, 0);
+ __os_free(jc->j_workcurs, 0);
+ __os_free(jc->j_fdupcurs, 0);
+ __os_free(jc->j_key.data, jc->j_key.ulen);
+ __os_free(jc, sizeof(JOIN_CURSOR));
+ __os_free(dbc, sizeof(DBC));
+
+ return (ret);
+}
+
+/*
+ * __db_join_getnext --
+ * This function replaces the DBC_CONTINUE and DBC_KEYSET
+ * functionality inside the various cursor get routines.
+ *
+ * If exhausted == 0, we're not done with the current datum;
+ * return it if it matches "matching", otherwise search
+ * using DB_GET_BOTHC (which is faster than iteratively doing
+ * DB_NEXT_DUP) forward until we find one that does.
+ *
+ * If exhausted == 1, we are done with the current datum, so just
+ * leap forward to searching NEXT_DUPs.
+ *
+ * If no matching datum exists, returns DB_NOTFOUND, else 0.
+ */
+static int
+__db_join_getnext(dbc, key, data, exhausted)
+ DBC *dbc;
+ DBT *key, *data;
+ u_int32_t exhausted;
+{
+ int ret, cmp;
+ DB *dbp;
+ DBT ldata;
+ int (*func) __P((DB *, const DBT *, const DBT *));
+
+ dbp = dbc->dbp;
+ func = (dbp->dup_compare == NULL) ? __bam_defcmp : dbp->dup_compare;
+
+ switch (exhausted) {
+ case 0:
+ memset(&ldata, 0, sizeof(DBT));
+ /* We don't want to step on data->data; malloc. */
+ F_SET(&ldata, DB_DBT_MALLOC);
+ if ((ret = dbc->c_get(dbc, key, &ldata, DB_CURRENT)) != 0)
+ break;
+ cmp = func(dbp, data, &ldata);
+ if (cmp == 0) {
+ /*
+ * We have to return the real data value. Copy
+ * it into data, then free the buffer we malloc'ed
+ * above.
+ */
+ if ((ret = __db_retcopy(dbp, data, ldata.data,
+ ldata.size, &data->data, &data->size)) != 0)
+ return (ret);
+ __os_free(ldata.data, 0);
+ return (0);
+ }
+
+ /*
+ * Didn't match--we want to fall through and search future
+ * dups. We just forget about ldata and free
+ * its buffer--data contains the value we're searching for.
+ */
+ __os_free(ldata.data, 0);
+ /* FALLTHROUGH */
+ case 1:
+ ret = dbc->c_get(dbc, key, data, DB_GET_BOTHC);
+ break;
+ default:
+ ret = EINVAL;
+ break;
+ }
+
+ return (ret);
+}
+
+/*
+ * __db_join_cmp --
+ * Comparison function for sorting DBCs in cardinality order.
+ */
+
+static int
+__db_join_cmp(a, b)
+ const void *a, *b;
+{
+ DBC *dbca, *dbcb;
+ db_recno_t counta, countb;
+
+ /* In case c_count fails, pretend cursors are equal. */
+ counta = countb = 0;
+
+ dbca = *((DBC * const *)a);
+ dbcb = *((DBC * const *)b);
+
+ if (dbca->c_count(dbca, &counta, 0) != 0 ||
+ dbcb->c_count(dbcb, &countb, 0) != 0)
+ return (0);
+
+ return (counta - countb);
+}
diff --git a/bdb/db/db_meta.c b/bdb/db/db_meta.c
new file mode 100644
index 00000000000..5b57c369454
--- /dev/null
+++ b/bdb/db/db_meta.c
@@ -0,0 +1,309 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1996, 1997, 1998, 1999, 2000
+ * Sleepycat Software. All rights reserved.
+ */
+/*
+ * Copyright (c) 1990, 1993, 1994, 1995, 1996
+ * Keith Bostic. All rights reserved.
+ */
+/*
+ * Copyright (c) 1990, 1993, 1994, 1995
+ * The Regents of the University of California. All rights reserved.
+ *
+ * This code is derived from software contributed to Berkeley by
+ * Mike Olson.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of the University nor the names of its contributors
+ * may be used to endorse or promote products derived from this software
+ * without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_meta.c,v 11.26 2001/01/16 21:57:19 ubell Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_shash.h"
+#include "lock.h"
+#include "txn.h"
+#include "db_am.h"
+#include "btree.h"
+
+/*
+ * __db_new --
+ * Get a new page, preferably from the freelist.
+ *
+ * PUBLIC: int __db_new __P((DBC *, u_int32_t, PAGE **));
+ */
+int
+__db_new(dbc, type, pagepp)
+ DBC *dbc;
+ u_int32_t type;
+ PAGE **pagepp;
+{
+ DBMETA *meta;
+ DB *dbp;
+ DB_LOCK metalock;
+ PAGE *h;
+ db_pgno_t pgno;
+ int ret;
+
+ dbp = dbc->dbp;
+ meta = NULL;
+ h = NULL;
+
+ pgno = PGNO_BASE_MD;
+ if ((ret = __db_lget(dbc,
+ LCK_ALWAYS, pgno, DB_LOCK_WRITE, 0, &metalock)) != 0)
+ goto err;
+ if ((ret = memp_fget(dbp->mpf, &pgno, 0, (PAGE **)&meta)) != 0)
+ goto err;
+
+ if (meta->free == PGNO_INVALID) {
+ if ((ret = memp_fget(dbp->mpf, &pgno, DB_MPOOL_NEW, &h)) != 0)
+ goto err;
+ ZERO_LSN(h->lsn);
+ h->pgno = pgno;
+ } else {
+ pgno = meta->free;
+ if ((ret = memp_fget(dbp->mpf, &pgno, 0, &h)) != 0)
+ goto err;
+ meta->free = h->next_pgno;
+ (void)memp_fset(dbp->mpf, (PAGE *)meta, DB_MPOOL_DIRTY);
+ }
+
+ DB_ASSERT(TYPE(h) == P_INVALID);
+
+ if (TYPE(h) != P_INVALID)
+ return (__db_panic(dbp->dbenv, EINVAL));
+
+ /* Log the change. */
+ if (DB_LOGGING(dbc)) {
+ if ((ret = __db_pg_alloc_log(dbp->dbenv,
+ dbc->txn, &LSN(meta), 0, dbp->log_fileid,
+ &LSN(meta), &h->lsn, h->pgno,
+ (u_int32_t)type, meta->free)) != 0)
+ goto err;
+ LSN(h) = LSN(meta);
+ }
+
+ (void)memp_fput(dbp->mpf, (PAGE *)meta, DB_MPOOL_DIRTY);
+ (void)__TLPUT(dbc, metalock);
+
+ P_INIT(h, dbp->pgsize, h->pgno, PGNO_INVALID, PGNO_INVALID, 0, type);
+ *pagepp = h;
+ return (0);
+
+err: if (h != NULL)
+ (void)memp_fput(dbp->mpf, h, 0);
+ if (meta != NULL)
+ (void)memp_fput(dbp->mpf, meta, 0);
+ (void)__TLPUT(dbc, metalock);
+ return (ret);
+}
+
+/*
+ * __db_free --
+ * Add a page to the head of the freelist.
+ *
+ * PUBLIC: int __db_free __P((DBC *, PAGE *));
+ */
+int
+__db_free(dbc, h)
+ DBC *dbc;
+ PAGE *h;
+{
+ DBMETA *meta;
+ DB *dbp;
+ DBT ldbt;
+ DB_LOCK metalock;
+ db_pgno_t pgno;
+ u_int32_t dirty_flag;
+ int ret, t_ret;
+
+ dbp = dbc->dbp;
+
+ /*
+ * Retrieve the metadata page and insert the page at the head of
+ * the free list. If either the lock get or page get routines
+ * fail, then we need to put the page with which we were called
+ * back because our caller assumes we take care of it.
+ */
+ dirty_flag = 0;
+ pgno = PGNO_BASE_MD;
+ if ((ret = __db_lget(dbc,
+ LCK_ALWAYS, pgno, DB_LOCK_WRITE, 0, &metalock)) != 0)
+ goto err;
+ if ((ret = memp_fget(dbp->mpf, &pgno, 0, (PAGE **)&meta)) != 0) {
+ (void)__TLPUT(dbc, metalock);
+ goto err;
+ }
+
+ DB_ASSERT(h->pgno != meta->free);
+ /* Log the change. */
+ if (DB_LOGGING(dbc)) {
+ memset(&ldbt, 0, sizeof(ldbt));
+ ldbt.data = h;
+ ldbt.size = P_OVERHEAD;
+ if ((ret = __db_pg_free_log(dbp->dbenv,
+ dbc->txn, &LSN(meta), 0, dbp->log_fileid, h->pgno,
+ &LSN(meta), &ldbt, meta->free)) != 0) {
+ (void)memp_fput(dbp->mpf, (PAGE *)meta, 0);
+ (void)__TLPUT(dbc, metalock);
+ return (ret);
+ }
+ LSN(h) = LSN(meta);
+ }
+
+ P_INIT(h, dbp->pgsize, h->pgno, PGNO_INVALID, meta->free, 0, P_INVALID);
+
+ meta->free = h->pgno;
+
+ /* Discard the metadata page. */
+ if ((t_ret = memp_fput(dbp->mpf,
+ (PAGE *)meta, DB_MPOOL_DIRTY)) != 0 && ret == 0)
+ ret = t_ret;
+ if ((t_ret = __TLPUT(dbc, metalock)) != 0 && ret == 0)
+ ret = t_ret;
+
+ /* Discard the caller's page reference. */
+ dirty_flag = DB_MPOOL_DIRTY;
+err: if ((t_ret = memp_fput(dbp->mpf, h, dirty_flag)) != 0 && ret == 0)
+ ret = t_ret;
+
+ /*
+ * XXX
+ * We have to unlock the caller's page in the caller!
+ */
+ return (ret);
+}
+
+#ifdef DEBUG
+/*
+ * __db_lprint --
+ * Print out the list of locks currently held by a cursor.
+ *
+ * PUBLIC: int __db_lprint __P((DBC *));
+ */
+int
+__db_lprint(dbc)
+ DBC *dbc;
+{
+ DB *dbp;
+ DB_LOCKREQ req;
+
+ dbp = dbc->dbp;
+
+ if (LOCKING_ON(dbp->dbenv)) {
+ req.op = DB_LOCK_DUMP;
+ lock_vec(dbp->dbenv, dbc->locker, 0, &req, 1, NULL);
+ }
+ return (0);
+}
+#endif
+
+/*
+ * __db_lget --
+ * The standard lock get call.
+ *
+ * PUBLIC: int __db_lget __P((DBC *,
+ * PUBLIC: int, db_pgno_t, db_lockmode_t, int, DB_LOCK *));
+ */
+int
+__db_lget(dbc, flags, pgno, mode, lkflags, lockp)
+ DBC *dbc;
+ int flags, lkflags;
+ db_pgno_t pgno;
+ db_lockmode_t mode;
+ DB_LOCK *lockp;
+{
+ DB *dbp;
+ DB_ENV *dbenv;
+ DB_LOCKREQ couple[2], *reqp;
+ int ret;
+
+ dbp = dbc->dbp;
+ dbenv = dbp->dbenv;
+
+ /*
+ * We do not always check if we're configured for locking before
+ * calling __db_lget to acquire the lock.
+ */
+ if (CDB_LOCKING(dbenv)
+ || !LOCKING_ON(dbenv) || F_ISSET(dbc, DBC_COMPENSATE)
+ || (!LF_ISSET(LCK_ROLLBACK) && F_ISSET(dbc, DBC_RECOVER))
+ || (!LF_ISSET(LCK_ALWAYS) && F_ISSET(dbc, DBC_OPD))) {
+ lockp->off = LOCK_INVALID;
+ return (0);
+ }
+
+ dbc->lock.pgno = pgno;
+ if (lkflags & DB_LOCK_RECORD)
+ dbc->lock.type = DB_RECORD_LOCK;
+ else
+ dbc->lock.type = DB_PAGE_LOCK;
+ lkflags &= ~DB_LOCK_RECORD;
+
+ /*
+ * If the transaction enclosing this cursor has DB_LOCK_NOWAIT set,
+ * pass that along to the lock call.
+ */
+ if (DB_NONBLOCK(dbc))
+ lkflags |= DB_LOCK_NOWAIT;
+
+ /*
+ * If the object not currently locked, acquire the lock and return,
+ * otherwise, lock couple.
+ */
+ if (LF_ISSET(LCK_COUPLE)) {
+ couple[0].op = DB_LOCK_GET;
+ couple[0].obj = &dbc->lock_dbt;
+ couple[0].mode = mode;
+ couple[1].op = DB_LOCK_PUT;
+ couple[1].lock = *lockp;
+
+ ret = lock_vec(dbenv,
+ dbc->locker, lkflags, couple, 2, &reqp);
+ if (ret == 0 || reqp == &couple[1])
+ *lockp = couple[0].lock;
+ } else {
+ ret = lock_get(dbenv,
+ dbc->locker, lkflags, &dbc->lock_dbt, mode, lockp);
+
+ if (ret != 0)
+ lockp->off = LOCK_INVALID;
+ }
+
+ return (ret);
+}
diff --git a/bdb/db/db_method.c b/bdb/db/db_method.c
new file mode 100644
index 00000000000..01568a6e144
--- /dev/null
+++ b/bdb/db/db_method.c
@@ -0,0 +1,629 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1999, 2000
+ * Sleepycat Software. All rights reserved.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_method.c,v 11.36 2000/12/21 09:17:04 krinsky Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#ifdef HAVE_RPC
+#include <rpc/rpc.h>
+#endif
+
+#include <string.h>
+#endif
+
+#ifdef HAVE_RPC
+#include "db_server.h"
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_am.h"
+#include "btree.h"
+#include "hash.h"
+#include "qam.h"
+#include "xa.h"
+#include "xa_ext.h"
+
+#ifdef HAVE_RPC
+#include "gen_client_ext.h"
+#include "rpc_client_ext.h"
+#endif
+
+static int __db_get_byteswapped __P((DB *));
+static DBTYPE
+ __db_get_type __P((DB *));
+static int __db_init __P((DB *, u_int32_t));
+static int __db_key_range
+ __P((DB *, DB_TXN *, DBT *, DB_KEY_RANGE *, u_int32_t));
+static int __db_set_append_recno __P((DB *, int (*)(DB *, DBT *, db_recno_t)));
+static int __db_set_cachesize __P((DB *, u_int32_t, u_int32_t, int));
+static int __db_set_dup_compare
+ __P((DB *, int (*)(DB *, const DBT *, const DBT *)));
+static void __db_set_errcall __P((DB *, void (*)(const char *, char *)));
+static void __db_set_errfile __P((DB *, FILE *));
+static int __db_set_feedback __P((DB *, void (*)(DB *, int, int)));
+static int __db_set_flags __P((DB *, u_int32_t));
+static int __db_set_lorder __P((DB *, int));
+static int __db_set_malloc __P((DB *, void *(*)(size_t)));
+static int __db_set_pagesize __P((DB *, u_int32_t));
+static int __db_set_realloc __P((DB *, void *(*)(void *, size_t)));
+static void __db_set_errpfx __P((DB *, const char *));
+static int __db_set_paniccall __P((DB *, void (*)(DB_ENV *, int)));
+static void __dbh_err __P((DB *, int, const char *, ...));
+static void __dbh_errx __P((DB *, const char *, ...));
+
+/*
+ * db_create --
+ * DB constructor.
+ */
+int
+db_create(dbpp, dbenv, flags)
+ DB **dbpp;
+ DB_ENV *dbenv;
+ u_int32_t flags;
+{
+ DB *dbp;
+ int ret;
+
+ /* Check for invalid function flags. */
+ switch (flags) {
+ case 0:
+ break;
+ case DB_XA_CREATE:
+ if (dbenv != NULL) {
+ __db_err(dbenv,
+ "XA applications may not specify an environment to db_create");
+ return (EINVAL);
+ }
+
+ /*
+ * If it's an XA database, open it within the XA environment,
+ * taken from the global list of environments. (When the XA
+ * transaction manager called our xa_start() routine the
+ * "current" environment was moved to the start of the list.
+ */
+ dbenv = TAILQ_FIRST(&DB_GLOBAL(db_envq));
+ break;
+ default:
+ return (__db_ferr(dbenv, "db_create", 0));
+ }
+
+ /* Allocate the DB. */
+ if ((ret = __os_calloc(dbenv, 1, sizeof(*dbp), &dbp)) != 0)
+ return (ret);
+#ifdef HAVE_RPC
+ if (dbenv != NULL && dbenv->cl_handle != NULL)
+ ret = __dbcl_init(dbp, dbenv, flags);
+ else
+#endif
+ ret = __db_init(dbp, flags);
+ if (ret != 0) {
+ __os_free(dbp, sizeof(*dbp));
+ return (ret);
+ }
+
+ /* If we don't have an environment yet, allocate a local one. */
+ if (dbenv == NULL) {
+ if ((ret = db_env_create(&dbenv, 0)) != 0) {
+ __os_free(dbp, sizeof(*dbp));
+ return (ret);
+ }
+ dbenv->dblocal_ref = 0;
+ F_SET(dbenv, DB_ENV_DBLOCAL);
+ }
+ if (F_ISSET(dbenv, DB_ENV_DBLOCAL))
+ ++dbenv->dblocal_ref;
+
+ dbp->dbenv = dbenv;
+
+ *dbpp = dbp;
+ return (0);
+}
+
+/*
+ * __db_init --
+ * Initialize a DB structure.
+ */
+static int
+__db_init(dbp, flags)
+ DB *dbp;
+ u_int32_t flags;
+{
+ int ret;
+
+ dbp->log_fileid = DB_LOGFILEID_INVALID;
+
+ TAILQ_INIT(&dbp->free_queue);
+ TAILQ_INIT(&dbp->active_queue);
+ TAILQ_INIT(&dbp->join_queue);
+
+ FLD_SET(dbp->am_ok,
+ DB_OK_BTREE | DB_OK_HASH | DB_OK_QUEUE | DB_OK_RECNO);
+
+ dbp->close = __db_close;
+ dbp->cursor = __db_cursor;
+ dbp->del = NULL; /* !!! Must be set by access method. */
+ dbp->err = __dbh_err;
+ dbp->errx = __dbh_errx;
+ dbp->fd = __db_fd;
+ dbp->get = __db_get;
+ dbp->get_byteswapped = __db_get_byteswapped;
+ dbp->get_type = __db_get_type;
+ dbp->join = __db_join;
+ dbp->key_range = __db_key_range;
+ dbp->open = __db_open;
+ dbp->put = __db_put;
+ dbp->remove = __db_remove;
+ dbp->rename = __db_rename;
+ dbp->set_append_recno = __db_set_append_recno;
+ dbp->set_cachesize = __db_set_cachesize;
+ dbp->set_dup_compare = __db_set_dup_compare;
+ dbp->set_errcall = __db_set_errcall;
+ dbp->set_errfile = __db_set_errfile;
+ dbp->set_errpfx = __db_set_errpfx;
+ dbp->set_feedback = __db_set_feedback;
+ dbp->set_flags = __db_set_flags;
+ dbp->set_lorder = __db_set_lorder;
+ dbp->set_malloc = __db_set_malloc;
+ dbp->set_pagesize = __db_set_pagesize;
+ dbp->set_paniccall = __db_set_paniccall;
+ dbp->set_realloc = __db_set_realloc;
+ dbp->stat = NULL; /* !!! Must be set by access method. */
+ dbp->sync = __db_sync;
+ dbp->upgrade = __db_upgrade;
+ dbp->verify = __db_verify;
+ /* Access method specific. */
+ if ((ret = __bam_db_create(dbp)) != 0)
+ return (ret);
+ if ((ret = __ham_db_create(dbp)) != 0)
+ return (ret);
+ if ((ret = __qam_db_create(dbp)) != 0)
+ return (ret);
+
+ /*
+ * XA specific: must be last, as we replace methods set by the
+ * access methods.
+ */
+ if (LF_ISSET(DB_XA_CREATE) && (ret = __db_xa_create(dbp)) != 0)
+ return (ret);
+
+ return (0);
+}
+
+/*
+ * __dbh_am_chk --
+ * Error if an unreasonable method is called.
+ *
+ * PUBLIC: int __dbh_am_chk __P((DB *, u_int32_t));
+ */
+int
+__dbh_am_chk(dbp, flags)
+ DB *dbp;
+ u_int32_t flags;
+{
+ /*
+ * We start out allowing any access methods to be called, and as the
+ * application calls the methods the options become restricted. The
+ * idea is to quit as soon as an illegal method combination is called.
+ */
+ if ((LF_ISSET(DB_OK_BTREE) && FLD_ISSET(dbp->am_ok, DB_OK_BTREE)) ||
+ (LF_ISSET(DB_OK_HASH) && FLD_ISSET(dbp->am_ok, DB_OK_HASH)) ||
+ (LF_ISSET(DB_OK_QUEUE) && FLD_ISSET(dbp->am_ok, DB_OK_QUEUE)) ||
+ (LF_ISSET(DB_OK_RECNO) && FLD_ISSET(dbp->am_ok, DB_OK_RECNO))) {
+ FLD_CLR(dbp->am_ok, ~flags);
+ return (0);
+ }
+
+ __db_err(dbp->dbenv,
+ "call implies an access method which is inconsistent with previous calls");
+ return (EINVAL);
+}
+
+/*
+ * __dbh_err --
+ * Error message, including the standard error string.
+ */
+static void
+#ifdef __STDC__
+__dbh_err(DB *dbp, int error, const char *fmt, ...)
+#else
+__dbh_err(dbp, error, fmt, va_alist)
+ DB *dbp;
+ int error;
+ const char *fmt;
+ va_dcl
+#endif
+{
+ va_list ap;
+
+#ifdef __STDC__
+ va_start(ap, fmt);
+#else
+ va_start(ap);
+#endif
+ __db_real_err(dbp->dbenv, error, 1, 1, fmt, ap);
+
+ va_end(ap);
+}
+
+/*
+ * __dbh_errx --
+ * Error message.
+ */
+static void
+#ifdef __STDC__
+__dbh_errx(DB *dbp, const char *fmt, ...)
+#else
+__dbh_errx(dbp, fmt, va_alist)
+ DB *dbp;
+ const char *fmt;
+ va_dcl
+#endif
+{
+ va_list ap;
+
+#ifdef __STDC__
+ va_start(ap, fmt);
+#else
+ va_start(ap);
+#endif
+ __db_real_err(dbp->dbenv, 0, 0, 1, fmt, ap);
+
+ va_end(ap);
+}
+
+/*
+ * __db_get_byteswapped --
+ * Return if database requires byte swapping.
+ */
+static int
+__db_get_byteswapped(dbp)
+ DB *dbp;
+{
+ DB_ILLEGAL_BEFORE_OPEN(dbp, "get_byteswapped");
+
+ return (F_ISSET(dbp, DB_AM_SWAP) ? 1 : 0);
+}
+
+/*
+ * __db_get_type --
+ * Return type of underlying database.
+ */
+static DBTYPE
+__db_get_type(dbp)
+ DB *dbp;
+{
+ DB_ILLEGAL_BEFORE_OPEN(dbp, "get_type");
+
+ return (dbp->type);
+}
+
+/*
+ * __db_key_range --
+ * Return proportion of keys above and below given key.
+ */
+static int
+__db_key_range(dbp, txn, key, kr, flags)
+ DB *dbp;
+ DB_TXN *txn;
+ DBT *key;
+ DB_KEY_RANGE *kr;
+ u_int32_t flags;
+{
+ COMPQUIET(txn, NULL);
+ COMPQUIET(key, NULL);
+ COMPQUIET(kr, NULL);
+ COMPQUIET(flags, 0);
+
+ DB_ILLEGAL_BEFORE_OPEN(dbp, "key_range");
+ DB_ILLEGAL_METHOD(dbp, DB_OK_BTREE);
+
+ return (EINVAL);
+}
+
+/*
+ * __db_set_append_recno --
+ * Set record number append routine.
+ */
+static int
+__db_set_append_recno(dbp, func)
+ DB *dbp;
+ int (*func) __P((DB *, DBT *, db_recno_t));
+{
+ DB_ILLEGAL_AFTER_OPEN(dbp, "set_append_recno");
+ DB_ILLEGAL_METHOD(dbp, DB_OK_QUEUE | DB_OK_RECNO);
+
+ dbp->db_append_recno = func;
+
+ return (0);
+}
+
+/*
+ * __db_set_cachesize --
+ * Set underlying cache size.
+ */
+static int
+__db_set_cachesize(dbp, cache_gbytes, cache_bytes, ncache)
+ DB *dbp;
+ u_int32_t cache_gbytes, cache_bytes;
+ int ncache;
+{
+ DB_ILLEGAL_IN_ENV(dbp, "set_cachesize");
+ DB_ILLEGAL_AFTER_OPEN(dbp, "set_cachesize");
+
+ return (dbp->dbenv->set_cachesize(
+ dbp->dbenv, cache_gbytes, cache_bytes, ncache));
+}
+
+/*
+ * __db_set_dup_compare --
+ * Set duplicate comparison routine.
+ */
+static int
+__db_set_dup_compare(dbp, func)
+ DB *dbp;
+ int (*func) __P((DB *, const DBT *, const DBT *));
+{
+ DB_ILLEGAL_AFTER_OPEN(dbp, "dup_compare");
+ DB_ILLEGAL_METHOD(dbp, DB_OK_BTREE | DB_OK_HASH);
+
+ dbp->dup_compare = func;
+
+ return (0);
+}
+
+static void
+__db_set_errcall(dbp, errcall)
+ DB *dbp;
+ void (*errcall) __P((const char *, char *));
+{
+ dbp->dbenv->set_errcall(dbp->dbenv, errcall);
+}
+
+static void
+__db_set_errfile(dbp, errfile)
+ DB *dbp;
+ FILE *errfile;
+{
+ dbp->dbenv->set_errfile(dbp->dbenv, errfile);
+}
+
+static void
+__db_set_errpfx(dbp, errpfx)
+ DB *dbp;
+ const char *errpfx;
+{
+ dbp->dbenv->set_errpfx(dbp->dbenv, errpfx);
+}
+
+static int
+__db_set_feedback(dbp, feedback)
+ DB *dbp;
+ void (*feedback) __P((DB *, int, int));
+{
+ dbp->db_feedback = feedback;
+ return (0);
+}
+
+static int
+__db_set_flags(dbp, flags)
+ DB *dbp;
+ u_int32_t flags;
+{
+ int ret;
+
+ /*
+ * !!!
+ * The hash access method only takes two flags: DB_DUP and DB_DUPSORT.
+ * The Btree access method uses them for the same purposes, and so we
+ * resolve them there.
+ *
+ * The queue access method takes no flags.
+ */
+ if ((ret = __bam_set_flags(dbp, &flags)) != 0)
+ return (ret);
+ if ((ret = __ram_set_flags(dbp, &flags)) != 0)
+ return (ret);
+
+ return (flags == 0 ? 0 : __db_ferr(dbp->dbenv, "DB->set_flags", 0));
+}
+
+static int
+__db_set_lorder(dbp, db_lorder)
+ DB *dbp;
+ int db_lorder;
+{
+ int ret;
+
+ DB_ILLEGAL_AFTER_OPEN(dbp, "set_lorder");
+
+ /* Flag if the specified byte order requires swapping. */
+ switch (ret = __db_byteorder(dbp->dbenv, db_lorder)) {
+ case 0:
+ F_CLR(dbp, DB_AM_SWAP);
+ break;
+ case DB_SWAPBYTES:
+ F_SET(dbp, DB_AM_SWAP);
+ break;
+ default:
+ return (ret);
+ /* NOTREACHED */
+ }
+ return (0);
+}
+
+static int
+__db_set_malloc(dbp, func)
+ DB *dbp;
+ void *(*func) __P((size_t));
+{
+ DB_ILLEGAL_AFTER_OPEN(dbp, "set_malloc");
+
+ dbp->db_malloc = func;
+ return (0);
+}
+
+static int
+__db_set_pagesize(dbp, db_pagesize)
+ DB *dbp;
+ u_int32_t db_pagesize;
+{
+ DB_ILLEGAL_AFTER_OPEN(dbp, "set_pagesize");
+
+ if (db_pagesize < DB_MIN_PGSIZE) {
+ __db_err(dbp->dbenv, "page sizes may not be smaller than %lu",
+ (u_long)DB_MIN_PGSIZE);
+ return (EINVAL);
+ }
+ if (db_pagesize > DB_MAX_PGSIZE) {
+ __db_err(dbp->dbenv, "page sizes may not be larger than %lu",
+ (u_long)DB_MAX_PGSIZE);
+ return (EINVAL);
+ }
+
+ /*
+ * We don't want anything that's not a power-of-2, as we rely on that
+ * for alignment of various types on the pages.
+ */
+ if ((u_int32_t)1 << __db_log2(db_pagesize) != db_pagesize) {
+ __db_err(dbp->dbenv, "page sizes must be a power-of-2");
+ return (EINVAL);
+ }
+
+ /*
+ * XXX
+ * Should we be checking for a page size that's not a multiple of 512,
+ * so that we never try and write less than a disk sector?
+ */
+ dbp->pgsize = db_pagesize;
+
+ return (0);
+}
+
+static int
+__db_set_realloc(dbp, func)
+ DB *dbp;
+ void *(*func) __P((void *, size_t));
+{
+ DB_ILLEGAL_AFTER_OPEN(dbp, "set_realloc");
+
+ dbp->db_realloc = func;
+ return (0);
+}
+
+static int
+__db_set_paniccall(dbp, paniccall)
+ DB *dbp;
+ void (*paniccall) __P((DB_ENV *, int));
+{
+ return (dbp->dbenv->set_paniccall(dbp->dbenv, paniccall));
+}
+
+#ifdef HAVE_RPC
+/*
+ * __dbcl_init --
+ * Initialize a DB structure on the server.
+ *
+ * PUBLIC: #ifdef HAVE_RPC
+ * PUBLIC: int __dbcl_init __P((DB *, DB_ENV *, u_int32_t));
+ * PUBLIC: #endif
+ */
+int
+__dbcl_init(dbp, dbenv, flags)
+ DB *dbp;
+ DB_ENV *dbenv;
+ u_int32_t flags;
+{
+ CLIENT *cl;
+ __db_create_reply *replyp;
+ __db_create_msg req;
+ int ret;
+
+ TAILQ_INIT(&dbp->free_queue);
+ TAILQ_INIT(&dbp->active_queue);
+ /* !!!
+ * Note that we don't need to initialize the join_queue; it's
+ * not used in RPC clients. See the comment in __dbcl_db_join_ret().
+ */
+
+ dbp->close = __dbcl_db_close;
+ dbp->cursor = __dbcl_db_cursor;
+ dbp->del = __dbcl_db_del;
+ dbp->err = __dbh_err;
+ dbp->errx = __dbh_errx;
+ dbp->fd = __dbcl_db_fd;
+ dbp->get = __dbcl_db_get;
+ dbp->get_byteswapped = __dbcl_db_swapped;
+ dbp->get_type = __db_get_type;
+ dbp->join = __dbcl_db_join;
+ dbp->key_range = __dbcl_db_key_range;
+ dbp->open = __dbcl_db_open;
+ dbp->put = __dbcl_db_put;
+ dbp->remove = __dbcl_db_remove;
+ dbp->rename = __dbcl_db_rename;
+ dbp->set_append_recno = __dbcl_db_set_append_recno;
+ dbp->set_cachesize = __dbcl_db_cachesize;
+ dbp->set_dup_compare = NULL;
+ dbp->set_errcall = __db_set_errcall;
+ dbp->set_errfile = __db_set_errfile;
+ dbp->set_errpfx = __db_set_errpfx;
+ dbp->set_feedback = __dbcl_db_feedback;
+ dbp->set_flags = __dbcl_db_flags;
+ dbp->set_lorder = __dbcl_db_lorder;
+ dbp->set_malloc = __dbcl_db_malloc;
+ dbp->set_pagesize = __dbcl_db_pagesize;
+ dbp->set_paniccall = __dbcl_db_panic;
+ dbp->set_q_extentsize = __dbcl_db_extentsize;
+ dbp->set_realloc = __dbcl_db_realloc;
+ dbp->stat = __dbcl_db_stat;
+ dbp->sync = __dbcl_db_sync;
+ dbp->upgrade = __dbcl_db_upgrade;
+
+ /*
+ * Set all the method specific functions to client funcs as well.
+ */
+ dbp->set_bt_compare = __dbcl_db_bt_compare;
+ dbp->set_bt_maxkey = __dbcl_db_bt_maxkey;
+ dbp->set_bt_minkey = __dbcl_db_bt_minkey;
+ dbp->set_bt_prefix = __dbcl_db_bt_prefix;
+ dbp->set_h_ffactor = __dbcl_db_h_ffactor;
+ dbp->set_h_hash = __dbcl_db_h_hash;
+ dbp->set_h_nelem = __dbcl_db_h_nelem;
+ dbp->set_re_delim = __dbcl_db_re_delim;
+ dbp->set_re_len = __dbcl_db_re_len;
+ dbp->set_re_pad = __dbcl_db_re_pad;
+ dbp->set_re_source = __dbcl_db_re_source;
+/*
+ dbp->set_q_extentsize = __dbcl_db_q_extentsize;
+*/
+
+ cl = (CLIENT *)dbenv->cl_handle;
+ req.flags = flags;
+ req.envpcl_id = dbenv->cl_id;
+
+ /*
+ * CALL THE SERVER
+ */
+ replyp = __db_db_create_1(&req, cl);
+ if (replyp == NULL) {
+ __db_err(dbenv, clnt_sperror(cl, "Berkeley DB"));
+ return (DB_NOSERVER);
+ }
+
+ if ((ret = replyp->status) != 0)
+ return (ret);
+
+ dbp->cl_id = replyp->dbpcl_id;
+ return (0);
+}
+#endif
diff --git a/bdb/db/db_overflow.c b/bdb/db/db_overflow.c
new file mode 100644
index 00000000000..54f0a03aafe
--- /dev/null
+++ b/bdb/db/db_overflow.c
@@ -0,0 +1,681 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1996, 1997, 1998, 1999, 2000
+ * Sleepycat Software. All rights reserved.
+ */
+/*
+ * Copyright (c) 1990, 1993, 1994, 1995, 1996
+ * Keith Bostic. All rights reserved.
+ */
+/*
+ * Copyright (c) 1990, 1993, 1994, 1995
+ * The Regents of the University of California. All rights reserved.
+ *
+ * This code is derived from software contributed to Berkeley by
+ * Mike Olson.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of the University nor the names of its contributors
+ * may be used to endorse or promote products derived from this software
+ * without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_overflow.c,v 11.21 2000/11/30 00:58:32 ubell Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_am.h"
+#include "db_verify.h"
+
+/*
+ * Big key/data code.
+ *
+ * Big key and data entries are stored on linked lists of pages. The initial
+ * reference is a structure with the total length of the item and the page
+ * number where it begins. Each entry in the linked list contains a pointer
+ * to the next page of data, and so on.
+ */
+
+/*
+ * __db_goff --
+ * Get an offpage item.
+ *
+ * PUBLIC: int __db_goff __P((DB *, DBT *,
+ * PUBLIC: u_int32_t, db_pgno_t, void **, u_int32_t *));
+ */
+int
+__db_goff(dbp, dbt, tlen, pgno, bpp, bpsz)
+ DB *dbp;
+ DBT *dbt;
+ u_int32_t tlen;
+ db_pgno_t pgno;
+ void **bpp;
+ u_int32_t *bpsz;
+{
+ DB_ENV *dbenv;
+ PAGE *h;
+ db_indx_t bytes;
+ u_int32_t curoff, needed, start;
+ u_int8_t *p, *src;
+ int ret;
+
+ dbenv = dbp->dbenv;
+
+ /*
+ * Check if the buffer is big enough; if it is not and we are
+ * allowed to malloc space, then we'll malloc it. If we are
+ * not (DB_DBT_USERMEM), then we'll set the dbt and return
+ * appropriately.
+ */
+ if (F_ISSET(dbt, DB_DBT_PARTIAL)) {
+ start = dbt->doff;
+ needed = dbt->dlen;
+ } else {
+ start = 0;
+ needed = tlen;
+ }
+
+ /* Allocate any necessary memory. */
+ if (F_ISSET(dbt, DB_DBT_USERMEM)) {
+ if (needed > dbt->ulen) {
+ dbt->size = needed;
+ return (ENOMEM);
+ }
+ } else if (F_ISSET(dbt, DB_DBT_MALLOC)) {
+ if ((ret = __os_malloc(dbenv,
+ needed, dbp->db_malloc, &dbt->data)) != 0)
+ return (ret);
+ } else if (F_ISSET(dbt, DB_DBT_REALLOC)) {
+ if ((ret = __os_realloc(dbenv,
+ needed, dbp->db_realloc, &dbt->data)) != 0)
+ return (ret);
+ } else if (*bpsz == 0 || *bpsz < needed) {
+ if ((ret = __os_realloc(dbenv, needed, NULL, bpp)) != 0)
+ return (ret);
+ *bpsz = needed;
+ dbt->data = *bpp;
+ } else
+ dbt->data = *bpp;
+
+ /*
+ * Step through the linked list of pages, copying the data on each
+ * one into the buffer. Never copy more than the total data length.
+ */
+ dbt->size = needed;
+ for (curoff = 0, p = dbt->data; pgno != PGNO_INVALID && needed > 0;) {
+ if ((ret = memp_fget(dbp->mpf, &pgno, 0, &h)) != 0) {
+ (void)__db_pgerr(dbp, pgno);
+ return (ret);
+ }
+ /* Check if we need any bytes from this page. */
+ if (curoff + OV_LEN(h) >= start) {
+ src = (u_int8_t *)h + P_OVERHEAD;
+ bytes = OV_LEN(h);
+ if (start > curoff) {
+ src += start - curoff;
+ bytes -= start - curoff;
+ }
+ if (bytes > needed)
+ bytes = needed;
+ memcpy(p, src, bytes);
+ p += bytes;
+ needed -= bytes;
+ }
+ curoff += OV_LEN(h);
+ pgno = h->next_pgno;
+ memp_fput(dbp->mpf, h, 0);
+ }
+ return (0);
+}
+
+/*
+ * __db_poff --
+ * Put an offpage item.
+ *
+ * PUBLIC: int __db_poff __P((DBC *, const DBT *, db_pgno_t *));
+ */
+int
+__db_poff(dbc, dbt, pgnop)
+ DBC *dbc;
+ const DBT *dbt;
+ db_pgno_t *pgnop;
+{
+ DB *dbp;
+ PAGE *pagep, *lastp;
+ DB_LSN new_lsn, null_lsn;
+ DBT tmp_dbt;
+ db_indx_t pagespace;
+ u_int32_t sz;
+ u_int8_t *p;
+ int ret;
+
+ /*
+ * Allocate pages and copy the key/data item into them. Calculate the
+ * number of bytes we get for pages we fill completely with a single
+ * item.
+ */
+ dbp = dbc->dbp;
+ pagespace = P_MAXSPACE(dbp->pgsize);
+
+ lastp = NULL;
+ for (p = dbt->data,
+ sz = dbt->size; sz > 0; p += pagespace, sz -= pagespace) {
+ /*
+ * Reduce pagespace so we terminate the loop correctly and
+ * don't copy too much data.
+ */
+ if (sz < pagespace)
+ pagespace = sz;
+
+ /*
+ * Allocate and initialize a new page and copy all or part of
+ * the item onto the page. If sz is less than pagespace, we
+ * have a partial record.
+ */
+ if ((ret = __db_new(dbc, P_OVERFLOW, &pagep)) != 0)
+ return (ret);
+ if (DB_LOGGING(dbc)) {
+ tmp_dbt.data = p;
+ tmp_dbt.size = pagespace;
+ ZERO_LSN(null_lsn);
+ if ((ret = __db_big_log(dbp->dbenv, dbc->txn,
+ &new_lsn, 0, DB_ADD_BIG, dbp->log_fileid,
+ PGNO(pagep), lastp ? PGNO(lastp) : PGNO_INVALID,
+ PGNO_INVALID, &tmp_dbt, &LSN(pagep),
+ lastp == NULL ? &null_lsn : &LSN(lastp),
+ &null_lsn)) != 0)
+ return (ret);
+
+ /* Move lsn onto page. */
+ if (lastp)
+ LSN(lastp) = new_lsn;
+ LSN(pagep) = new_lsn;
+ }
+
+ P_INIT(pagep, dbp->pgsize,
+ PGNO(pagep), PGNO_INVALID, PGNO_INVALID, 0, P_OVERFLOW);
+ OV_LEN(pagep) = pagespace;
+ OV_REF(pagep) = 1;
+ memcpy((u_int8_t *)pagep + P_OVERHEAD, p, pagespace);
+
+ /*
+ * If this is the first entry, update the user's info.
+ * Otherwise, update the entry on the last page filled
+ * in and release that page.
+ */
+ if (lastp == NULL)
+ *pgnop = PGNO(pagep);
+ else {
+ lastp->next_pgno = PGNO(pagep);
+ pagep->prev_pgno = PGNO(lastp);
+ (void)memp_fput(dbp->mpf, lastp, DB_MPOOL_DIRTY);
+ }
+ lastp = pagep;
+ }
+ (void)memp_fput(dbp->mpf, lastp, DB_MPOOL_DIRTY);
+ return (0);
+}
+
+/*
+ * __db_ovref --
+ * Increment/decrement the reference count on an overflow page.
+ *
+ * PUBLIC: int __db_ovref __P((DBC *, db_pgno_t, int32_t));
+ */
+int
+__db_ovref(dbc, pgno, adjust)
+ DBC *dbc;
+ db_pgno_t pgno;
+ int32_t adjust;
+{
+ DB *dbp;
+ PAGE *h;
+ int ret;
+
+ dbp = dbc->dbp;
+ if ((ret = memp_fget(dbp->mpf, &pgno, 0, &h)) != 0) {
+ (void)__db_pgerr(dbp, pgno);
+ return (ret);
+ }
+
+ if (DB_LOGGING(dbc))
+ if ((ret = __db_ovref_log(dbp->dbenv, dbc->txn,
+ &LSN(h), 0, dbp->log_fileid, h->pgno, adjust,
+ &LSN(h))) != 0)
+ return (ret);
+ OV_REF(h) += adjust;
+
+ (void)memp_fput(dbp->mpf, h, DB_MPOOL_DIRTY);
+ return (0);
+}
+
+/*
+ * __db_doff --
+ * Delete an offpage chain of overflow pages.
+ *
+ * PUBLIC: int __db_doff __P((DBC *, db_pgno_t));
+ */
+int
+__db_doff(dbc, pgno)
+ DBC *dbc;
+ db_pgno_t pgno;
+{
+ DB *dbp;
+ PAGE *pagep;
+ DB_LSN null_lsn;
+ DBT tmp_dbt;
+ int ret;
+
+ dbp = dbc->dbp;
+ do {
+ if ((ret = memp_fget(dbp->mpf, &pgno, 0, &pagep)) != 0) {
+ (void)__db_pgerr(dbp, pgno);
+ return (ret);
+ }
+
+ DB_ASSERT(TYPE(pagep) == P_OVERFLOW);
+ /*
+ * If it's referenced by more than one key/data item,
+ * decrement the reference count and return.
+ */
+ if (OV_REF(pagep) > 1) {
+ (void)memp_fput(dbp->mpf, pagep, 0);
+ return (__db_ovref(dbc, pgno, -1));
+ }
+
+ if (DB_LOGGING(dbc)) {
+ tmp_dbt.data = (u_int8_t *)pagep + P_OVERHEAD;
+ tmp_dbt.size = OV_LEN(pagep);
+ ZERO_LSN(null_lsn);
+ if ((ret = __db_big_log(dbp->dbenv, dbc->txn,
+ &LSN(pagep), 0, DB_REM_BIG, dbp->log_fileid,
+ PGNO(pagep), PREV_PGNO(pagep), NEXT_PGNO(pagep),
+ &tmp_dbt, &LSN(pagep), &null_lsn, &null_lsn)) != 0)
+ return (ret);
+ }
+ pgno = pagep->next_pgno;
+ if ((ret = __db_free(dbc, pagep)) != 0)
+ return (ret);
+ } while (pgno != PGNO_INVALID);
+
+ return (0);
+}
+
+/*
+ * __db_moff --
+ * Match on overflow pages.
+ *
+ * Given a starting page number and a key, return <0, 0, >0 to indicate if the
+ * key on the page is less than, equal to or greater than the key specified.
+ * We optimize this by doing chunk at a time comparison unless the user has
+ * specified a comparison function. In this case, we need to materialize
+ * the entire object and call their comparison routine.
+ *
+ * PUBLIC: int __db_moff __P((DB *, const DBT *, db_pgno_t, u_int32_t,
+ * PUBLIC: int (*)(DB *, const DBT *, const DBT *), int *));
+ */
+int
+__db_moff(dbp, dbt, pgno, tlen, cmpfunc, cmpp)
+ DB *dbp;
+ const DBT *dbt;
+ db_pgno_t pgno;
+ u_int32_t tlen;
+ int (*cmpfunc) __P((DB *, const DBT *, const DBT *)), *cmpp;
+{
+ PAGE *pagep;
+ DBT local_dbt;
+ void *buf;
+ u_int32_t bufsize, cmp_bytes, key_left;
+ u_int8_t *p1, *p2;
+ int ret;
+
+ /*
+ * If there is a user-specified comparison function, build a
+ * contiguous copy of the key, and call it.
+ */
+ if (cmpfunc != NULL) {
+ memset(&local_dbt, 0, sizeof(local_dbt));
+ buf = NULL;
+ bufsize = 0;
+
+ if ((ret = __db_goff(dbp,
+ &local_dbt, tlen, pgno, &buf, &bufsize)) != 0)
+ return (ret);
+ /* Pass the key as the first argument */
+ *cmpp = cmpfunc(dbp, dbt, &local_dbt);
+ __os_free(buf, bufsize);
+ return (0);
+ }
+
+ /* While there are both keys to compare. */
+ for (*cmpp = 0, p1 = dbt->data,
+ key_left = dbt->size; key_left > 0 && pgno != PGNO_INVALID;) {
+ if ((ret = memp_fget(dbp->mpf, &pgno, 0, &pagep)) != 0)
+ return (ret);
+
+ cmp_bytes = OV_LEN(pagep) < key_left ? OV_LEN(pagep) : key_left;
+ tlen -= cmp_bytes;
+ key_left -= cmp_bytes;
+ for (p2 =
+ (u_int8_t *)pagep + P_OVERHEAD; cmp_bytes-- > 0; ++p1, ++p2)
+ if (*p1 != *p2) {
+ *cmpp = (long)*p1 - (long)*p2;
+ break;
+ }
+ pgno = NEXT_PGNO(pagep);
+ if ((ret = memp_fput(dbp->mpf, pagep, 0)) != 0)
+ return (ret);
+ if (*cmpp != 0)
+ return (0);
+ }
+ if (key_left > 0) /* DBT is longer than the page key. */
+ *cmpp = 1;
+ else if (tlen > 0) /* DBT is shorter than the page key. */
+ *cmpp = -1;
+ else
+ *cmpp = 0;
+
+ return (0);
+}
+
+/*
+ * __db_vrfy_overflow --
+ * Verify overflow page.
+ *
+ * PUBLIC: int __db_vrfy_overflow __P((DB *, VRFY_DBINFO *, PAGE *, db_pgno_t,
+ * PUBLIC: u_int32_t));
+ */
+int
+__db_vrfy_overflow(dbp, vdp, h, pgno, flags)
+ DB *dbp;
+ VRFY_DBINFO *vdp;
+ PAGE *h;
+ db_pgno_t pgno;
+ u_int32_t flags;
+{
+ VRFY_PAGEINFO *pip;
+ int isbad, ret, t_ret;
+
+ isbad = 0;
+ if ((ret = __db_vrfy_getpageinfo(vdp, pgno, &pip)) != 0)
+ return (ret);
+
+ if ((ret = __db_vrfy_datapage(dbp, vdp, h, pgno, flags)) != 0) {
+ if (ret == DB_VERIFY_BAD)
+ isbad = 1;
+ else
+ goto err;
+ }
+
+ pip->refcount = OV_REF(h);
+ if (pip->refcount < 1) {
+ EPRINT((dbp->dbenv,
+ "Overflow page %lu has zero reference count",
+ (u_long)pgno));
+ isbad = 1;
+ }
+
+ /* Just store for now. */
+ pip->olen = HOFFSET(h);
+
+err: if ((t_ret = __db_vrfy_putpageinfo(vdp, pip)) != 0)
+ ret = t_ret;
+ return ((ret == 0 && isbad == 1) ? DB_VERIFY_BAD : ret);
+}
+
+/*
+ * __db_vrfy_ovfl_structure --
+ * Walk a list of overflow pages, avoiding cycles and marking
+ * pages seen.
+ *
+ * PUBLIC: int __db_vrfy_ovfl_structure
+ * PUBLIC: __P((DB *, VRFY_DBINFO *, db_pgno_t, u_int32_t, u_int32_t));
+ */
+int
+__db_vrfy_ovfl_structure(dbp, vdp, pgno, tlen, flags)
+ DB *dbp;
+ VRFY_DBINFO *vdp;
+ db_pgno_t pgno;
+ u_int32_t tlen;
+ u_int32_t flags;
+{
+ DB *pgset;
+ VRFY_PAGEINFO *pip;
+ db_pgno_t next, prev;
+ int isbad, p, ret, t_ret;
+ u_int32_t refcount;
+
+ pgset = vdp->pgset;
+ DB_ASSERT(pgset != NULL);
+ isbad = 0;
+
+ /* This shouldn't happen, but just to be sure. */
+ if (!IS_VALID_PGNO(pgno))
+ return (DB_VERIFY_BAD);
+
+ /*
+ * Check the first prev_pgno; it ought to be PGNO_INVALID,
+ * since there's no prev page.
+ */
+ if ((ret = __db_vrfy_getpageinfo(vdp, pgno, &pip)) != 0)
+ return (ret);
+
+ /* The refcount is stored on the first overflow page. */
+ refcount = pip->refcount;
+
+ if (pip->type != P_OVERFLOW) {
+ EPRINT((dbp->dbenv,
+ "Overflow page %lu of invalid type",
+ (u_long)pgno, (u_long)pip->type));
+ ret = DB_VERIFY_BAD;
+ goto err; /* Unsafe to continue. */
+ }
+
+ prev = pip->prev_pgno;
+ if (prev != PGNO_INVALID) {
+ EPRINT((dbp->dbenv,
+ "First overflow page %lu has a prev_pgno", (u_long)pgno));
+ isbad = 1;
+ }
+
+ for (;;) {
+ /*
+ * This is slightly gross. Btree leaf pages reference
+ * individual overflow trees multiple times if the overflow page
+ * is the key to a duplicate set. The reference count does not
+ * reflect this multiple referencing. Thus, if this is called
+ * during the structure verification of a btree leaf page, we
+ * check to see whether we've seen it from a leaf page before
+ * and, if we have, adjust our count of how often we've seen it
+ * accordingly.
+ *
+ * (This will screw up if it's actually referenced--and
+ * correctly refcounted--from two different leaf pages, but
+ * that's a very unlikely brokenness that we're not checking for
+ * anyway.)
+ */
+
+ if (LF_ISSET(ST_OVFL_LEAF)) {
+ if (F_ISSET(pip, VRFY_OVFL_LEAFSEEN)) {
+ if ((ret =
+ __db_vrfy_pgset_dec(pgset, pgno)) != 0)
+ goto err;
+ } else
+ F_SET(pip, VRFY_OVFL_LEAFSEEN);
+ }
+
+ if ((ret = __db_vrfy_pgset_get(pgset, pgno, &p)) != 0)
+ goto err;
+
+ /*
+ * We may have seen this elsewhere, if the overflow entry
+ * has been promoted to an internal page.
+ */
+ if ((u_int32_t)p > refcount) {
+ EPRINT((dbp->dbenv,
+ "Page %lu encountered twice in overflow traversal",
+ (u_long)pgno));
+ ret = DB_VERIFY_BAD;
+ goto err;
+ }
+ if ((ret = __db_vrfy_pgset_inc(pgset, pgno)) != 0)
+ goto err;
+
+ /* Keep a running tab on how much of the item we've seen. */
+ tlen -= pip->olen;
+
+ /* Send feedback to the application about our progress. */
+ if (!LF_ISSET(DB_SALVAGE))
+ __db_vrfy_struct_feedback(dbp, vdp);
+
+ next = pip->next_pgno;
+
+ /* Are we there yet? */
+ if (next == PGNO_INVALID)
+ break;
+
+ /*
+ * We've already checked this when we saved it, but just
+ * to be sure...
+ */
+ if (!IS_VALID_PGNO(next)) {
+ DB_ASSERT(0);
+ EPRINT((dbp->dbenv,
+ "Overflow page %lu has bad next_pgno",
+ (u_long)pgno));
+ ret = DB_VERIFY_BAD;
+ goto err;
+ }
+
+ if ((ret = __db_vrfy_putpageinfo(vdp, pip)) != 0 ||
+ (ret = __db_vrfy_getpageinfo(vdp, next, &pip)) != 0)
+ return (ret);
+ if (pip->prev_pgno != pgno) {
+ EPRINT((dbp->dbenv,
+ "Overflow page %lu has bogus prev_pgno value",
+ (u_long)next));
+ isbad = 1;
+ /*
+ * It's safe to continue because we have separate
+ * cycle detection.
+ */
+ }
+
+ pgno = next;
+ }
+
+ if (tlen > 0) {
+ isbad = 1;
+ EPRINT((dbp->dbenv,
+ "Overflow item incomplete on page %lu", (u_long)pgno));
+ }
+
+err: if ((t_ret = __db_vrfy_putpageinfo(vdp, pip)) != 0 && ret == 0)
+ ret = t_ret;
+ return ((ret == 0 && isbad == 1) ? DB_VERIFY_BAD : ret);
+}
+
+/*
+ * __db_safe_goff --
+ * Get an overflow item, very carefully, from an untrusted database,
+ * in the context of the salvager.
+ *
+ * PUBLIC: int __db_safe_goff __P((DB *, VRFY_DBINFO *, db_pgno_t,
+ * PUBLIC: DBT *, void **, u_int32_t));
+ */
+int
+__db_safe_goff(dbp, vdp, pgno, dbt, buf, flags)
+ DB *dbp;
+ VRFY_DBINFO *vdp;
+ db_pgno_t pgno;
+ DBT *dbt;
+ void **buf;
+ u_int32_t flags;
+{
+ PAGE *h;
+ int ret, err_ret;
+ u_int32_t bytesgot, bytes;
+ u_int8_t *src, *dest;
+
+ ret = DB_VERIFY_BAD;
+ err_ret = 0;
+ bytesgot = bytes = 0;
+
+ while ((pgno != PGNO_INVALID) && (IS_VALID_PGNO(pgno))) {
+ /*
+ * Mark that we're looking at this page; if we've seen it
+ * already, quit.
+ */
+ if ((ret = __db_salvage_markdone(vdp, pgno)) != 0)
+ break;
+
+ if ((ret = memp_fget(dbp->mpf, &pgno, 0, &h)) != 0)
+ break;
+
+ /*
+ * Make sure it's really an overflow page, unless we're
+ * being aggressive, in which case we pretend it is.
+ */
+ if (!LF_ISSET(DB_AGGRESSIVE) && TYPE(h) != P_OVERFLOW) {
+ ret = DB_VERIFY_BAD;
+ break;
+ }
+
+ src = (u_int8_t *)h + P_OVERHEAD;
+ bytes = OV_LEN(h);
+
+ if (bytes + P_OVERHEAD > dbp->pgsize)
+ bytes = dbp->pgsize - P_OVERHEAD;
+
+ if ((ret = __os_realloc(dbp->dbenv,
+ bytesgot + bytes, 0, buf)) != 0)
+ break;
+
+ dest = (u_int8_t *)*buf + bytesgot;
+ bytesgot += bytes;
+
+ memcpy(dest, src, bytes);
+
+ pgno = NEXT_PGNO(h);
+ /* Not much we can do here--we don't want to quit. */
+ if ((ret = memp_fput(dbp->mpf, h, 0)) != 0)
+ err_ret = ret;
+ }
+
+ if (ret == 0) {
+ dbt->size = bytesgot;
+ dbt->data = *buf;
+ }
+
+ return ((err_ret != 0 && ret == 0) ? err_ret : ret);
+}
diff --git a/bdb/db/db_pr.c b/bdb/db/db_pr.c
new file mode 100644
index 00000000000..cb977cadfda
--- /dev/null
+++ b/bdb/db/db_pr.c
@@ -0,0 +1,1284 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1996, 1997, 1998, 1999, 2000
+ * Sleepycat Software. All rights reserved.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_pr.c,v 11.46 2001/01/22 17:25:06 krinsky Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <ctype.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "btree.h"
+#include "hash.h"
+#include "qam.h"
+#include "db_am.h"
+#include "db_verify.h"
+
+static int __db_bmeta __P((DB *, FILE *, BTMETA *, u_int32_t));
+static int __db_hmeta __P((DB *, FILE *, HMETA *, u_int32_t));
+static void __db_meta __P((DB *, DBMETA *, FILE *, FN const *, u_int32_t));
+static const char *__db_dbtype_to_string __P((DB *));
+static void __db_prdb __P((DB *, FILE *, u_int32_t));
+static FILE *__db_prinit __P((FILE *));
+static void __db_proff __P((void *));
+static int __db_prtree __P((DB *, u_int32_t));
+static void __db_psize __P((DB *));
+static int __db_qmeta __P((DB *, FILE *, QMETA *, u_int32_t));
+
+/*
+ * 64K is the maximum page size, so by default we check for offsets larger
+ * than that, and, where possible, we refine the test.
+ */
+#define PSIZE_BOUNDARY (64 * 1024 + 1)
+static size_t set_psize = PSIZE_BOUNDARY;
+
+static FILE *set_fp; /* Output file descriptor. */
+
+/*
+ * __db_loadme --
+ * A nice place to put a breakpoint.
+ *
+ * PUBLIC: void __db_loadme __P((void));
+ */
+void
+__db_loadme()
+{
+ getpid();
+}
+
+/*
+ * __db_dump --
+ * Dump the tree to a file.
+ *
+ * PUBLIC: int __db_dump __P((DB *, char *, char *));
+ */
+int
+__db_dump(dbp, op, name)
+ DB *dbp;
+ char *op, *name;
+{
+ FILE *fp, *save_fp;
+ u_int32_t flags;
+
+ COMPQUIET(save_fp, NULL);
+
+ if (set_psize == PSIZE_BOUNDARY)
+ __db_psize(dbp);
+
+ if (name != NULL) {
+ if ((fp = fopen(name, "w")) == NULL)
+ return (__os_get_errno());
+ save_fp = set_fp;
+ set_fp = fp;
+ } else
+ fp = __db_prinit(NULL);
+
+ for (flags = 0; *op != '\0'; ++op)
+ switch (*op) {
+ case 'a':
+ LF_SET(DB_PR_PAGE);
+ break;
+ case 'h':
+ break;
+ case 'r':
+ LF_SET(DB_PR_RECOVERYTEST);
+ break;
+ default:
+ return (EINVAL);
+ }
+
+ __db_prdb(dbp, fp, flags);
+
+ fprintf(fp, "%s\n", DB_LINE);
+
+ (void)__db_prtree(dbp, flags);
+
+ fflush(fp);
+
+ if (name != NULL) {
+ fclose(fp);
+ set_fp = save_fp;
+ }
+ return (0);
+}
+
+/*
+ * __db_prdb --
+ * Print out the DB structure information.
+ */
+static void
+__db_prdb(dbp, fp, flags)
+ DB *dbp;
+ FILE *fp;
+ u_int32_t flags;
+{
+ static const FN fn[] = {
+ { DB_AM_DISCARD, "discard cached pages" },
+ { DB_AM_DUP, "duplicates" },
+ { DB_AM_INMEM, "in-memory" },
+ { DB_AM_PGDEF, "default page size" },
+ { DB_AM_RDONLY, "read-only" },
+ { DB_AM_SUBDB, "multiple-databases" },
+ { DB_AM_SWAP, "needswap" },
+ { DB_BT_RECNUM, "btree:recnum" },
+ { DB_BT_REVSPLIT, "btree:no reverse split" },
+ { DB_DBM_ERROR, "dbm/ndbm error" },
+ { DB_OPEN_CALLED, "DB->open called" },
+ { DB_RE_DELIMITER, "recno:delimiter" },
+ { DB_RE_FIXEDLEN, "recno:fixed-length" },
+ { DB_RE_PAD, "recno:pad" },
+ { DB_RE_RENUMBER, "recno:renumber" },
+ { DB_RE_SNAPSHOT, "recno:snapshot" },
+ { 0, NULL }
+ };
+ BTREE *bt;
+ HASH *h;
+ QUEUE *q;
+
+ COMPQUIET(flags, 0);
+
+ fprintf(fp,
+ "In-memory DB structure:\n%s: %#lx",
+ __db_dbtype_to_string(dbp), (u_long)dbp->flags);
+ __db_prflags(dbp->flags, fn, fp);
+ fprintf(fp, "\n");
+
+ switch (dbp->type) {
+ case DB_BTREE:
+ case DB_RECNO:
+ bt = dbp->bt_internal;
+ fprintf(fp, "bt_meta: %lu bt_root: %lu\n",
+ (u_long)bt->bt_meta, (u_long)bt->bt_root);
+ fprintf(fp, "bt_maxkey: %lu bt_minkey: %lu\n",
+ (u_long)bt->bt_maxkey, (u_long)bt->bt_minkey);
+ fprintf(fp, "bt_compare: %#lx bt_prefix: %#lx\n",
+ (u_long)bt->bt_compare, (u_long)bt->bt_prefix);
+ fprintf(fp, "bt_lpgno: %lu\n", (u_long)bt->bt_lpgno);
+ if (dbp->type == DB_RECNO) {
+ fprintf(fp,
+ "re_pad: %#lx re_delim: %#lx re_len: %lu re_source: %s\n",
+ (u_long)bt->re_pad, (u_long)bt->re_delim,
+ (u_long)bt->re_len,
+ bt->re_source == NULL ? "" : bt->re_source);
+ fprintf(fp, "re_modified: %d re_eof: %d re_last: %lu\n",
+ bt->re_modified, bt->re_eof, (u_long)bt->re_last);
+ }
+ break;
+ case DB_HASH:
+ h = dbp->h_internal;
+ fprintf(fp, "meta_pgno: %lu\n", (u_long)h->meta_pgno);
+ fprintf(fp, "h_ffactor: %lu\n", (u_long)h->h_ffactor);
+ fprintf(fp, "h_nelem: %lu\n", (u_long)h->h_nelem);
+ fprintf(fp, "h_hash: %#lx\n", (u_long)h->h_hash);
+ break;
+ case DB_QUEUE:
+ q = dbp->q_internal;
+ fprintf(fp, "q_meta: %lu\n", (u_long)q->q_meta);
+ fprintf(fp, "q_root: %lu\n", (u_long)q->q_root);
+ fprintf(fp, "re_pad: %#lx re_len: %lu\n",
+ (u_long)q->re_pad, (u_long)q->re_len);
+ fprintf(fp, "rec_page: %lu\n", (u_long)q->rec_page);
+ fprintf(fp, "page_ext: %lu\n", (u_long)q->page_ext);
+ break;
+ default:
+ break;
+ }
+}
+
+/*
+ * __db_prtree --
+ * Print out the entire tree.
+ */
+static int
+__db_prtree(dbp, flags)
+ DB *dbp;
+ u_int32_t flags;
+{
+ PAGE *h;
+ db_pgno_t i, last;
+ int ret;
+
+ if (set_psize == PSIZE_BOUNDARY)
+ __db_psize(dbp);
+
+ if (dbp->type == DB_QUEUE) {
+ ret = __db_prqueue(dbp, flags);
+ goto done;
+ }
+
+ /* Find out the page number of the last page in the database. */
+ if ((ret = memp_fget(dbp->mpf, &last, DB_MPOOL_LAST, &h)) != 0)
+ return (ret);
+ if ((ret = memp_fput(dbp->mpf, h, 0)) != 0)
+ return (ret);
+
+ /* Dump each page. */
+ for (i = 0; i <= last; ++i) {
+ if ((ret = memp_fget(dbp->mpf, &i, 0, &h)) != 0)
+ return (ret);
+ (void)__db_prpage(dbp, h, flags);
+ if ((ret = memp_fput(dbp->mpf, h, 0)) != 0)
+ return (ret);
+ }
+
+done:
+ (void)fflush(__db_prinit(NULL));
+ return (0);
+}
+
+/*
+ * __db_meta --
+ * Print out common metadata information.
+ */
+static void
+__db_meta(dbp, dbmeta, fp, fn, flags)
+ DB *dbp;
+ DBMETA *dbmeta;
+ FILE *fp;
+ FN const *fn;
+ u_int32_t flags;
+{
+ PAGE *h;
+ int cnt;
+ db_pgno_t pgno;
+ u_int8_t *p;
+ int ret;
+ const char *sep;
+
+ fprintf(fp, "\tmagic: %#lx\n", (u_long)dbmeta->magic);
+ fprintf(fp, "\tversion: %lu\n", (u_long)dbmeta->version);
+ fprintf(fp, "\tpagesize: %lu\n", (u_long)dbmeta->pagesize);
+ fprintf(fp, "\ttype: %lu\n", (u_long)dbmeta->type);
+ fprintf(fp, "\tkeys: %lu\trecords: %lu\n",
+ (u_long)dbmeta->key_count, (u_long)dbmeta->record_count);
+
+ if (!LF_ISSET(DB_PR_RECOVERYTEST)) {
+ /*
+ * If we're doing recovery testing, don't display the free
+ * list, it may have changed and that makes the dump diff
+ * not work.
+ */
+ fprintf(fp, "\tfree list: %lu", (u_long)dbmeta->free);
+ for (pgno = dbmeta->free,
+ cnt = 0, sep = ", "; pgno != PGNO_INVALID;) {
+ if ((ret = memp_fget(dbp->mpf, &pgno, 0, &h)) != 0) {
+ fprintf(fp,
+ "Unable to retrieve free-list page: %lu: %s\n",
+ (u_long)pgno, db_strerror(ret));
+ break;
+ }
+ pgno = h->next_pgno;
+ (void)memp_fput(dbp->mpf, h, 0);
+ fprintf(fp, "%s%lu", sep, (u_long)pgno);
+ if (++cnt % 10 == 0) {
+ fprintf(fp, "\n");
+ cnt = 0;
+ sep = "\t";
+ } else
+ sep = ", ";
+ }
+ fprintf(fp, "\n");
+ }
+
+ if (fn != NULL) {
+ fprintf(fp, "\tflags: %#lx", (u_long)dbmeta->flags);
+ __db_prflags(dbmeta->flags, fn, fp);
+ fprintf(fp, "\n");
+ }
+
+ fprintf(fp, "\tuid: ");
+ for (p = (u_int8_t *)dbmeta->uid,
+ cnt = 0; cnt < DB_FILE_ID_LEN; ++cnt) {
+ fprintf(fp, "%x", *p++);
+ if (cnt < DB_FILE_ID_LEN - 1)
+ fprintf(fp, " ");
+ }
+ fprintf(fp, "\n");
+}
+
+/*
+ * __db_bmeta --
+ * Print out the btree meta-data page.
+ */
+static int
+__db_bmeta(dbp, fp, h, flags)
+ DB *dbp;
+ FILE *fp;
+ BTMETA *h;
+ u_int32_t flags;
+{
+ static const FN mfn[] = {
+ { BTM_DUP, "duplicates" },
+ { BTM_RECNO, "recno" },
+ { BTM_RECNUM, "btree:recnum" },
+ { BTM_FIXEDLEN, "recno:fixed-length" },
+ { BTM_RENUMBER, "recno:renumber" },
+ { BTM_SUBDB, "multiple-databases" },
+ { 0, NULL }
+ };
+
+ __db_meta(dbp, (DBMETA *)h, fp, mfn, flags);
+
+ fprintf(fp, "\tmaxkey: %lu minkey: %lu\n",
+ (u_long)h->maxkey, (u_long)h->minkey);
+ if (dbp->type == DB_RECNO)
+ fprintf(fp, "\tre_len: %#lx re_pad: %lu\n",
+ (u_long)h->re_len, (u_long)h->re_pad);
+ fprintf(fp, "\troot: %lu\n", (u_long)h->root);
+
+ return (0);
+}
+
+/*
+ * __db_hmeta --
+ * Print out the hash meta-data page.
+ */
+static int
+__db_hmeta(dbp, fp, h, flags)
+ DB *dbp;
+ FILE *fp;
+ HMETA *h;
+ u_int32_t flags;
+{
+ static const FN mfn[] = {
+ { DB_HASH_DUP, "duplicates" },
+ { DB_HASH_SUBDB, "multiple-databases" },
+ { 0, NULL }
+ };
+ int i;
+
+ __db_meta(dbp, (DBMETA *)h, fp, mfn, flags);
+
+ fprintf(fp, "\tmax_bucket: %lu\n", (u_long)h->max_bucket);
+ fprintf(fp, "\thigh_mask: %#lx\n", (u_long)h->high_mask);
+ fprintf(fp, "\tlow_mask: %#lx\n", (u_long)h->low_mask);
+ fprintf(fp, "\tffactor: %lu\n", (u_long)h->ffactor);
+ fprintf(fp, "\tnelem: %lu\n", (u_long)h->nelem);
+ fprintf(fp, "\th_charkey: %#lx\n", (u_long)h->h_charkey);
+ fprintf(fp, "\tspare points: ");
+ for (i = 0; i < NCACHED; i++)
+ fprintf(fp, "%lu ", (u_long)h->spares[i]);
+ fprintf(fp, "\n");
+
+ return (0);
+}
+
+/*
+ * __db_qmeta --
+ * Print out the queue meta-data page.
+ */
+static int
+__db_qmeta(dbp, fp, h, flags)
+ DB *dbp;
+ FILE *fp;
+ QMETA *h;
+ u_int32_t flags;
+{
+ __db_meta(dbp, (DBMETA *)h, fp, NULL, flags);
+
+ fprintf(fp, "\tfirst_recno: %lu\n", (u_long)h->first_recno);
+ fprintf(fp, "\tcur_recno: %lu\n", (u_long)h->cur_recno);
+ fprintf(fp, "\tre_len: %#lx re_pad: %lu\n",
+ (u_long)h->re_len, (u_long)h->re_pad);
+ fprintf(fp, "\trec_page: %lu\n", (u_long)h->rec_page);
+ fprintf(fp, "\tpage_ext: %lu\n", (u_long)h->page_ext);
+
+ return (0);
+}
+
+/*
+ * __db_prnpage
+ * -- Print out a specific page.
+ *
+ * PUBLIC: int __db_prnpage __P((DB *, db_pgno_t));
+ */
+int
+__db_prnpage(dbp, pgno)
+ DB *dbp;
+ db_pgno_t pgno;
+{
+ PAGE *h;
+ int ret;
+
+ if (set_psize == PSIZE_BOUNDARY)
+ __db_psize(dbp);
+
+ if ((ret = memp_fget(dbp->mpf, &pgno, 0, &h)) != 0)
+ return (ret);
+
+ ret = __db_prpage(dbp, h, DB_PR_PAGE);
+ (void)fflush(__db_prinit(NULL));
+
+ (void)memp_fput(dbp->mpf, h, 0);
+ return (ret);
+}
+
+/*
+ * __db_prpage
+ * -- Print out a page.
+ *
+ * PUBLIC: int __db_prpage __P((DB *, PAGE *, u_int32_t));
+ */
+int
+__db_prpage(dbp, h, flags)
+ DB *dbp;
+ PAGE *h;
+ u_int32_t flags;
+{
+ BINTERNAL *bi;
+ BKEYDATA *bk;
+ BTREE *t;
+ FILE *fp;
+ HOFFPAGE a_hkd;
+ QAMDATA *qp, *qep;
+ RINTERNAL *ri;
+ db_indx_t dlen, len, i;
+ db_pgno_t pgno;
+ db_recno_t recno;
+ int deleted, ret;
+ const char *s;
+ u_int32_t qlen;
+ u_int8_t *ep, *hk, *p;
+ void *sp;
+
+ fp = __db_prinit(NULL);
+
+ /*
+ * If we're doing recovery testing and this page is P_INVALID,
+ * assume it's a page that's on the free list, and don't display it.
+ */
+ if (LF_ISSET(DB_PR_RECOVERYTEST) && TYPE(h) == P_INVALID)
+ return (0);
+
+ s = __db_pagetype_to_string(TYPE(h));
+ if (s == NULL) {
+ fprintf(fp, "ILLEGAL PAGE TYPE: page: %lu type: %lu\n",
+ (u_long)h->pgno, (u_long)TYPE(h));
+ return (1);
+ }
+
+ /* Page number, page type. */
+ fprintf(fp, "page %lu: %s level: %lu",
+ (u_long)h->pgno, s, (u_long)h->level);
+
+ /* Record count. */
+ if (TYPE(h) == P_IBTREE ||
+ TYPE(h) == P_IRECNO || (TYPE(h) == P_LRECNO &&
+ h->pgno == ((BTREE *)dbp->bt_internal)->bt_root))
+ fprintf(fp, " records: %lu", (u_long)RE_NREC(h));
+
+ /* LSN. */
+ if (!LF_ISSET(DB_PR_RECOVERYTEST))
+ fprintf(fp, " (lsn.file: %lu lsn.offset: %lu)\n",
+ (u_long)LSN(h).file, (u_long)LSN(h).offset);
+
+ switch (TYPE(h)) {
+ case P_BTREEMETA:
+ return (__db_bmeta(dbp, fp, (BTMETA *)h, flags));
+ case P_HASHMETA:
+ return (__db_hmeta(dbp, fp, (HMETA *)h, flags));
+ case P_QAMMETA:
+ return (__db_qmeta(dbp, fp, (QMETA *)h, flags));
+ case P_QAMDATA: /* Should be meta->start. */
+ if (!LF_ISSET(DB_PR_PAGE))
+ return (0);
+
+ qlen = ((QUEUE *)dbp->q_internal)->re_len;
+ recno = (h->pgno - 1) * QAM_RECNO_PER_PAGE(dbp) + 1;
+ i = 0;
+ qep = (QAMDATA *)((u_int8_t *)h + set_psize - qlen);
+ for (qp = QAM_GET_RECORD(dbp, h, i); qp < qep;
+ recno++, i++, qp = QAM_GET_RECORD(dbp, h, i)) {
+ if (!F_ISSET(qp, QAM_SET))
+ continue;
+
+ fprintf(fp, "%s",
+ F_ISSET(qp, QAM_VALID) ? "\t" : " D");
+ fprintf(fp, "[%03lu] %4lu ",
+ (u_long)recno, (u_long)qp - (u_long)h);
+ __db_pr(qp->data, qlen);
+ }
+ return (0);
+ }
+
+ /* LSN. */
+ if (LF_ISSET(DB_PR_RECOVERYTEST))
+ fprintf(fp, " (lsn.file: %lu lsn.offset: %lu)\n",
+ (u_long)LSN(h).file, (u_long)LSN(h).offset);
+
+ t = dbp->bt_internal;
+
+ s = "\t";
+ if (TYPE(h) != P_IBTREE && TYPE(h) != P_IRECNO) {
+ fprintf(fp, "%sprev: %4lu next: %4lu",
+ s, (u_long)PREV_PGNO(h), (u_long)NEXT_PGNO(h));
+ s = " ";
+ }
+ if (TYPE(h) == P_OVERFLOW) {
+ fprintf(fp, "%sref cnt: %4lu ", s, (u_long)OV_REF(h));
+ __db_pr((u_int8_t *)h + P_OVERHEAD, OV_LEN(h));
+ return (0);
+ }
+ fprintf(fp, "%sentries: %4lu", s, (u_long)NUM_ENT(h));
+ fprintf(fp, " offset: %4lu\n", (u_long)HOFFSET(h));
+
+ if (TYPE(h) == P_INVALID || !LF_ISSET(DB_PR_PAGE))
+ return (0);
+
+ ret = 0;
+ for (i = 0; i < NUM_ENT(h); i++) {
+ if (P_ENTRY(h, i) - (u_int8_t *)h < P_OVERHEAD ||
+ (size_t)(P_ENTRY(h, i) - (u_int8_t *)h) >= set_psize) {
+ fprintf(fp,
+ "ILLEGAL PAGE OFFSET: indx: %lu of %lu\n",
+ (u_long)i, (u_long)h->inp[i]);
+ ret = EINVAL;
+ continue;
+ }
+ deleted = 0;
+ switch (TYPE(h)) {
+ case P_HASH:
+ case P_IBTREE:
+ case P_IRECNO:
+ sp = P_ENTRY(h, i);
+ break;
+ case P_LBTREE:
+ sp = P_ENTRY(h, i);
+ deleted = i % 2 == 0 &&
+ B_DISSET(GET_BKEYDATA(h, i + O_INDX)->type);
+ break;
+ case P_LDUP:
+ case P_LRECNO:
+ sp = P_ENTRY(h, i);
+ deleted = B_DISSET(GET_BKEYDATA(h, i)->type);
+ break;
+ default:
+ fprintf(fp,
+ "ILLEGAL PAGE ITEM: %lu\n", (u_long)TYPE(h));
+ ret = EINVAL;
+ continue;
+ }
+ fprintf(fp, "%s", deleted ? " D" : "\t");
+ fprintf(fp, "[%03lu] %4lu ", (u_long)i, (u_long)h->inp[i]);
+ switch (TYPE(h)) {
+ case P_HASH:
+ hk = sp;
+ switch (HPAGE_PTYPE(hk)) {
+ case H_OFFDUP:
+ memcpy(&pgno,
+ HOFFDUP_PGNO(hk), sizeof(db_pgno_t));
+ fprintf(fp,
+ "%4lu [offpage dups]\n", (u_long)pgno);
+ break;
+ case H_DUPLICATE:
+ /*
+ * If this is the first item on a page, then
+ * we cannot figure out how long it is, so
+ * we only print the first one in the duplicate
+ * set.
+ */
+ if (i != 0)
+ len = LEN_HKEYDATA(h, 0, i);
+ else
+ len = 1;
+
+ fprintf(fp, "Duplicates:\n");
+ for (p = HKEYDATA_DATA(hk),
+ ep = p + len; p < ep;) {
+ memcpy(&dlen, p, sizeof(db_indx_t));
+ p += sizeof(db_indx_t);
+ fprintf(fp, "\t\t");
+ __db_pr(p, dlen);
+ p += sizeof(db_indx_t) + dlen;
+ }
+ break;
+ case H_KEYDATA:
+ __db_pr(HKEYDATA_DATA(hk),
+ LEN_HKEYDATA(h, i == 0 ? set_psize : 0, i));
+ break;
+ case H_OFFPAGE:
+ memcpy(&a_hkd, hk, HOFFPAGE_SIZE);
+ fprintf(fp,
+ "overflow: total len: %4lu page: %4lu\n",
+ (u_long)a_hkd.tlen, (u_long)a_hkd.pgno);
+ break;
+ }
+ break;
+ case P_IBTREE:
+ bi = sp;
+ fprintf(fp, "count: %4lu pgno: %4lu type: %4lu",
+ (u_long)bi->nrecs, (u_long)bi->pgno,
+ (u_long)bi->type);
+ switch (B_TYPE(bi->type)) {
+ case B_KEYDATA:
+ __db_pr(bi->data, bi->len);
+ break;
+ case B_DUPLICATE:
+ case B_OVERFLOW:
+ __db_proff(bi->data);
+ break;
+ default:
+ fprintf(fp, "ILLEGAL BINTERNAL TYPE: %lu\n",
+ (u_long)B_TYPE(bi->type));
+ ret = EINVAL;
+ break;
+ }
+ break;
+ case P_IRECNO:
+ ri = sp;
+ fprintf(fp, "entries %4lu pgno %4lu\n",
+ (u_long)ri->nrecs, (u_long)ri->pgno);
+ break;
+ case P_LBTREE:
+ case P_LDUP:
+ case P_LRECNO:
+ bk = sp;
+ switch (B_TYPE(bk->type)) {
+ case B_KEYDATA:
+ __db_pr(bk->data, bk->len);
+ break;
+ case B_DUPLICATE:
+ case B_OVERFLOW:
+ __db_proff(bk);
+ break;
+ default:
+ fprintf(fp,
+ "ILLEGAL DUPLICATE/LBTREE/LRECNO TYPE: %lu\n",
+ (u_long)B_TYPE(bk->type));
+ ret = EINVAL;
+ break;
+ }
+ break;
+ }
+ }
+ (void)fflush(fp);
+ return (ret);
+}
+
+/*
+ * __db_pr --
+ * Print out a data element.
+ *
+ * PUBLIC: void __db_pr __P((u_int8_t *, u_int32_t));
+ */
+void
+__db_pr(p, len)
+ u_int8_t *p;
+ u_int32_t len;
+{
+ FILE *fp;
+ u_int lastch;
+ int i;
+
+ fp = __db_prinit(NULL);
+
+ fprintf(fp, "len: %3lu", (u_long)len);
+ lastch = '.';
+ if (len != 0) {
+ fprintf(fp, " data: ");
+ for (i = len <= 20 ? len : 20; i > 0; --i, ++p) {
+ lastch = *p;
+ if (isprint((int)*p) || *p == '\n')
+ fprintf(fp, "%c", *p);
+ else
+ fprintf(fp, "0x%.2x", (u_int)*p);
+ }
+ if (len > 20) {
+ fprintf(fp, "...");
+ lastch = '.';
+ }
+ }
+ if (lastch != '\n')
+ fprintf(fp, "\n");
+}
+
+/*
+ * __db_prdbt --
+ * Print out a DBT data element.
+ *
+ * PUBLIC: int __db_prdbt __P((DBT *, int, const char *, void *,
+ * PUBLIC: int (*)(void *, const void *), int, VRFY_DBINFO *));
+ */
+int
+__db_prdbt(dbtp, checkprint, prefix, handle, callback, is_recno, vdp)
+ DBT *dbtp;
+ int checkprint;
+ const char *prefix;
+ void *handle;
+ int (*callback) __P((void *, const void *));
+ int is_recno;
+ VRFY_DBINFO *vdp;
+{
+ static const char hex[] = "0123456789abcdef";
+ db_recno_t recno;
+ u_int32_t len;
+ int ret;
+#define DBTBUFLEN 100
+ char *p, *hp, buf[DBTBUFLEN], hbuf[DBTBUFLEN];
+
+ if (vdp != NULL) {
+ /*
+ * If vdp is non-NULL, we might be the first key in the
+ * "fake" subdatabase used for key/data pairs we can't
+ * associate with a known subdb.
+ *
+ * Check and clear the SALVAGE_PRINTHEADER flag; if
+ * it was set, print a subdatabase header.
+ */
+ if (F_ISSET(vdp, SALVAGE_PRINTHEADER))
+ (void)__db_prheader(NULL, "__OTHER__", 0, 0,
+ handle, callback, vdp, 0);
+ F_CLR(vdp, SALVAGE_PRINTHEADER);
+ F_SET(vdp, SALVAGE_PRINTFOOTER);
+ }
+
+ /*
+ * !!!
+ * This routine is the routine that dumps out items in the format
+ * used by db_dump(1) and db_load(1). This means that the format
+ * cannot change.
+ */
+ if (prefix != NULL && (ret = callback(handle, prefix)) != 0)
+ return (ret);
+ if (is_recno) {
+ /*
+ * We're printing a record number, and this has to be done
+ * in a platform-independent way. So we use the numeral in
+ * straight ASCII.
+ */
+ __ua_memcpy(&recno, dbtp->data, sizeof(recno));
+ snprintf(buf, DBTBUFLEN, "%lu", (u_long)recno);
+
+ /* If we're printing data as hex, print keys as hex too. */
+ if (!checkprint) {
+ for (len = strlen(buf), p = buf, hp = hbuf;
+ len-- > 0; ++p) {
+ *hp++ = hex[(u_int8_t)(*p & 0xf0) >> 4];
+ *hp++ = hex[*p & 0x0f];
+ }
+ *hp = '\0';
+ ret = callback(handle, hbuf);
+ } else
+ ret = callback(handle, buf);
+
+ if (ret != 0)
+ return (ret);
+ } else if (checkprint) {
+ for (len = dbtp->size, p = dbtp->data; len--; ++p)
+ if (isprint((int)*p)) {
+ if (*p == '\\' &&
+ (ret = callback(handle, "\\")) != 0)
+ return (ret);
+ snprintf(buf, DBTBUFLEN, "%c", *p);
+ if ((ret = callback(handle, buf)) != 0)
+ return (ret);
+ } else {
+ snprintf(buf, DBTBUFLEN, "\\%c%c",
+ hex[(u_int8_t)(*p & 0xf0) >> 4],
+ hex[*p & 0x0f]);
+ if ((ret = callback(handle, buf)) != 0)
+ return (ret);
+ }
+ } else
+ for (len = dbtp->size, p = dbtp->data; len--; ++p) {
+ snprintf(buf, DBTBUFLEN, "%c%c",
+ hex[(u_int8_t)(*p & 0xf0) >> 4],
+ hex[*p & 0x0f]);
+ if ((ret = callback(handle, buf)) != 0)
+ return (ret);
+ }
+
+ return (callback(handle, "\n"));
+}
+
+/*
+ * __db_proff --
+ * Print out an off-page element.
+ */
+static void
+__db_proff(vp)
+ void *vp;
+{
+ FILE *fp;
+ BOVERFLOW *bo;
+
+ fp = __db_prinit(NULL);
+
+ bo = vp;
+ switch (B_TYPE(bo->type)) {
+ case B_OVERFLOW:
+ fprintf(fp, "overflow: total len: %4lu page: %4lu\n",
+ (u_long)bo->tlen, (u_long)bo->pgno);
+ break;
+ case B_DUPLICATE:
+ fprintf(fp, "duplicate: page: %4lu\n", (u_long)bo->pgno);
+ break;
+ }
+}
+
+/*
+ * __db_prflags --
+ * Print out flags values.
+ *
+ * PUBLIC: void __db_prflags __P((u_int32_t, const FN *, FILE *));
+ */
+void
+__db_prflags(flags, fn, fp)
+ u_int32_t flags;
+ FN const *fn;
+ FILE *fp;
+{
+ const FN *fnp;
+ int found;
+ const char *sep;
+
+ sep = " (";
+ for (found = 0, fnp = fn; fnp->mask != 0; ++fnp)
+ if (LF_ISSET(fnp->mask)) {
+ fprintf(fp, "%s%s", sep, fnp->name);
+ sep = ", ";
+ found = 1;
+ }
+ if (found)
+ fprintf(fp, ")");
+}
+
+/*
+ * __db_prinit --
+ * Initialize tree printing routines.
+ */
+static FILE *
+__db_prinit(fp)
+ FILE *fp;
+{
+ if (set_fp == NULL)
+ set_fp = fp == NULL ? stdout : fp;
+ return (set_fp);
+}
+
+/*
+ * __db_psize --
+ * Get the page size.
+ */
+static void
+__db_psize(dbp)
+ DB *dbp;
+{
+ DBMETA *mp;
+ db_pgno_t pgno;
+
+ set_psize = PSIZE_BOUNDARY - 1;
+
+ pgno = PGNO_BASE_MD;
+ if (memp_fget(dbp->mpf, &pgno, 0, &mp) != 0)
+ return;
+
+ switch (mp->magic) {
+ case DB_BTREEMAGIC:
+ case DB_HASHMAGIC:
+ case DB_QAMMAGIC:
+ set_psize = mp->pagesize;
+ break;
+ }
+ (void)memp_fput(dbp->mpf, mp, 0);
+}
+
+/*
+ * __db_dbtype_to_string --
+ * Return the name of the database type.
+ */
+static const char *
+__db_dbtype_to_string(dbp)
+ DB *dbp;
+{
+ switch (dbp->type) {
+ case DB_BTREE:
+ return ("btree");
+ case DB_HASH:
+ return ("hash");
+ break;
+ case DB_RECNO:
+ return ("recno");
+ break;
+ case DB_QUEUE:
+ return ("queue");
+ default:
+ return ("UNKNOWN TYPE");
+ }
+ /* NOTREACHED */
+}
+
+/*
+ * __db_pagetype_to_string --
+ * Return the name of the specified page type.
+ *
+ * PUBLIC: const char *__db_pagetype_to_string __P((u_int32_t));
+ */
+const char *
+__db_pagetype_to_string(type)
+ u_int32_t type;
+{
+ char *s;
+
+ s = NULL;
+ switch (type) {
+ case P_BTREEMETA:
+ s = "btree metadata";
+ break;
+ case P_LDUP:
+ s = "duplicate";
+ break;
+ case P_HASH:
+ s = "hash";
+ break;
+ case P_HASHMETA:
+ s = "hash metadata";
+ break;
+ case P_IBTREE:
+ s = "btree internal";
+ break;
+ case P_INVALID:
+ s = "invalid";
+ break;
+ case P_IRECNO:
+ s = "recno internal";
+ break;
+ case P_LBTREE:
+ s = "btree leaf";
+ break;
+ case P_LRECNO:
+ s = "recno leaf";
+ break;
+ case P_OVERFLOW:
+ s = "overflow";
+ break;
+ case P_QAMMETA:
+ s = "queue metadata";
+ break;
+ case P_QAMDATA:
+ s = "queue";
+ break;
+ default:
+ /* Just return a NULL. */
+ break;
+ }
+ return (s);
+}
+
+/*
+ * __db_prheader --
+ * Write out header information in the format expected by db_load.
+ *
+ * PUBLIC: int __db_prheader __P((DB *, char *, int, int, void *,
+ * PUBLIC: int (*)(void *, const void *), VRFY_DBINFO *, db_pgno_t));
+ */
+int
+__db_prheader(dbp, subname, pflag, keyflag, handle, callback, vdp, meta_pgno)
+ DB *dbp;
+ char *subname;
+ int pflag, keyflag;
+ void *handle;
+ int (*callback) __P((void *, const void *));
+ VRFY_DBINFO *vdp;
+ db_pgno_t meta_pgno;
+{
+ DB_BTREE_STAT *btsp;
+ DB_ENV *dbenv;
+ DB_HASH_STAT *hsp;
+ DB_QUEUE_STAT *qsp;
+ VRFY_PAGEINFO *pip;
+ char *buf;
+ int buflen, ret, t_ret;
+ u_int32_t dbtype;
+
+ btsp = NULL;
+ hsp = NULL;
+ qsp = NULL;
+ ret = 0;
+ buf = NULL;
+ COMPQUIET(buflen, 0);
+
+ if (dbp == NULL)
+ dbenv = NULL;
+ else
+ dbenv = dbp->dbenv;
+
+ /*
+ * If we've been passed a verifier statistics object, use
+ * that; we're being called in a context where dbp->stat
+ * is unsafe.
+ */
+ if (vdp != NULL) {
+ if ((ret = __db_vrfy_getpageinfo(vdp, meta_pgno, &pip)) != 0)
+ return (ret);
+ } else
+ pip = NULL;
+
+ /*
+ * If dbp is NULL, we're being called from inside __db_prdbt,
+ * and this is a special subdatabase for "lost" items. Make it a btree.
+ * Otherwise, set dbtype to the appropriate type for the specified
+ * meta page, or the type of the dbp.
+ */
+ if (dbp == NULL)
+ dbtype = DB_BTREE;
+ else if (pip != NULL)
+ switch (pip->type) {
+ case P_BTREEMETA:
+ if (F_ISSET(pip, VRFY_IS_RECNO))
+ dbtype = DB_RECNO;
+ else
+ dbtype = DB_BTREE;
+ break;
+ case P_HASHMETA:
+ dbtype = DB_HASH;
+ break;
+ default:
+ /*
+ * If the meta page is of a bogus type, it's
+ * because we have a badly corrupt database.
+ * (We must be in the verifier for pip to be non-NULL.)
+ * Pretend we're a Btree and salvage what we can.
+ */
+ DB_ASSERT(F_ISSET(dbp, DB_AM_VERIFYING));
+ dbtype = DB_BTREE;
+ break;
+ }
+ else
+ dbtype = dbp->type;
+
+ if ((ret = callback(handle, "VERSION=3\n")) != 0)
+ goto err;
+ if (pflag) {
+ if ((ret = callback(handle, "format=print\n")) != 0)
+ goto err;
+ } else if ((ret = callback(handle, "format=bytevalue\n")) != 0)
+ goto err;
+
+ /*
+ * 64 bytes is long enough, as a minimum bound, for any of the
+ * fields besides subname. Subname can be anything, and so
+ * 64 + subname is big enough for all the things we need to print here.
+ */
+ buflen = 64 + ((subname != NULL) ? strlen(subname) : 0);
+ if ((ret = __os_malloc(dbenv, buflen, NULL, &buf)) != 0)
+ goto err;
+ if (subname != NULL) {
+ snprintf(buf, buflen, "database=%s\n", subname);
+ if ((ret = callback(handle, buf)) != 0)
+ goto err;
+ }
+ switch (dbtype) {
+ case DB_BTREE:
+ if ((ret = callback(handle, "type=btree\n")) != 0)
+ goto err;
+ if (pip != NULL) {
+ if (F_ISSET(pip, VRFY_HAS_RECNUMS))
+ if ((ret =
+ callback(handle, "recnum=1\n")) != 0)
+ goto err;
+ if (pip->bt_maxkey != 0) {
+ snprintf(buf, buflen,
+ "bt_maxkey=%lu\n", (u_long)pip->bt_maxkey);
+ if ((ret = callback(handle, buf)) != 0)
+ goto err;
+ }
+ if (pip->bt_minkey != 0 &&
+ pip->bt_minkey != DEFMINKEYPAGE) {
+ snprintf(buf, buflen,
+ "bt_minkey=%lu\n", (u_long)pip->bt_minkey);
+ if ((ret = callback(handle, buf)) != 0)
+ goto err;
+ }
+ break;
+ }
+ if ((ret = dbp->stat(dbp, &btsp, NULL, 0)) != 0) {
+ dbp->err(dbp, ret, "DB->stat");
+ goto err;
+ }
+ if (F_ISSET(dbp, DB_BT_RECNUM))
+ if ((ret = callback(handle, "recnum=1\n")) != 0)
+ goto err;
+ if (btsp->bt_maxkey != 0) {
+ snprintf(buf, buflen,
+ "bt_maxkey=%lu\n", (u_long)btsp->bt_maxkey);
+ if ((ret = callback(handle, buf)) != 0)
+ goto err;
+ }
+ if (btsp->bt_minkey != 0 && btsp->bt_minkey != DEFMINKEYPAGE) {
+ snprintf(buf, buflen,
+ "bt_minkey=%lu\n", (u_long)btsp->bt_minkey);
+ if ((ret = callback(handle, buf)) != 0)
+ goto err;
+ }
+ break;
+ case DB_HASH:
+ if ((ret = callback(handle, "type=hash\n")) != 0)
+ goto err;
+ if (pip != NULL) {
+ if (pip->h_ffactor != 0) {
+ snprintf(buf, buflen,
+ "h_ffactor=%lu\n", (u_long)pip->h_ffactor);
+ if ((ret = callback(handle, buf)) != 0)
+ goto err;
+ }
+ if (pip->h_nelem != 0) {
+ snprintf(buf, buflen,
+ "h_nelem=%lu\n", (u_long)pip->h_nelem);
+ if ((ret = callback(handle, buf)) != 0)
+ goto err;
+ }
+ break;
+ }
+ if ((ret = dbp->stat(dbp, &hsp, NULL, 0)) != 0) {
+ dbp->err(dbp, ret, "DB->stat");
+ goto err;
+ }
+ if (hsp->hash_ffactor != 0) {
+ snprintf(buf, buflen,
+ "h_ffactor=%lu\n", (u_long)hsp->hash_ffactor);
+ if ((ret = callback(handle, buf)) != 0)
+ goto err;
+ }
+ if (hsp->hash_nelem != 0 || hsp->hash_nkeys != 0) {
+ snprintf(buf, buflen, "h_nelem=%lu\n",
+ hsp->hash_nelem > hsp->hash_nkeys ?
+ (u_long)hsp->hash_nelem : (u_long)hsp->hash_nkeys);
+ if ((ret = callback(handle, buf)) != 0)
+ goto err;
+ }
+ break;
+ case DB_QUEUE:
+ if ((ret = callback(handle, "type=queue\n")) != 0)
+ goto err;
+ if (vdp != NULL) {
+ snprintf(buf,
+ buflen, "re_len=%lu\n", (u_long)vdp->re_len);
+ if ((ret = callback(handle, buf)) != 0)
+ goto err;
+ break;
+ }
+ if ((ret = dbp->stat(dbp, &qsp, NULL, 0)) != 0) {
+ dbp->err(dbp, ret, "DB->stat");
+ goto err;
+ }
+ snprintf(buf, buflen, "re_len=%lu\n", (u_long)qsp->qs_re_len);
+ if (qsp->qs_re_pad != 0 && qsp->qs_re_pad != ' ')
+ snprintf(buf, buflen, "re_pad=%#x\n", qsp->qs_re_pad);
+ if ((ret = callback(handle, buf)) != 0)
+ goto err;
+ break;
+ case DB_RECNO:
+ if ((ret = callback(handle, "type=recno\n")) != 0)
+ goto err;
+ if (pip != NULL) {
+ if (F_ISSET(pip, VRFY_IS_RRECNO))
+ if ((ret =
+ callback(handle, "renumber=1\n")) != 0)
+ goto err;
+ if (pip->re_len > 0) {
+ snprintf(buf, buflen,
+ "re_len=%lu\n", (u_long)pip->re_len);
+ if ((ret = callback(handle, buf)) != 0)
+ goto err;
+ }
+ break;
+ }
+ if ((ret = dbp->stat(dbp, &btsp, NULL, 0)) != 0) {
+ dbp->err(dbp, ret, "DB->stat");
+ goto err;
+ }
+ if (F_ISSET(dbp, DB_RE_RENUMBER))
+ if ((ret = callback(handle, "renumber=1\n")) != 0)
+ goto err;
+ if (F_ISSET(dbp, DB_RE_FIXEDLEN)) {
+ snprintf(buf, buflen,
+ "re_len=%lu\n", (u_long)btsp->bt_re_len);
+ if ((ret = callback(handle, buf)) != 0)
+ goto err;
+ }
+ if (btsp->bt_re_pad != 0 && btsp->bt_re_pad != ' ') {
+ snprintf(buf, buflen, "re_pad=%#x\n", btsp->bt_re_pad);
+ if ((ret = callback(handle, buf)) != 0)
+ goto err;
+ }
+ break;
+ case DB_UNKNOWN:
+ DB_ASSERT(0); /* Impossible. */
+ __db_err(dbp->dbenv, "Impossible DB type in __db_prheader");
+ ret = EINVAL;
+ goto err;
+ }
+
+ if (pip != NULL) {
+ if (F_ISSET(pip, VRFY_HAS_DUPS))
+ if ((ret = callback(handle, "duplicates=1\n")) != 0)
+ goto err;
+ if (F_ISSET(pip, VRFY_HAS_DUPSORT))
+ if ((ret = callback(handle, "dupsort=1\n")) != 0)
+ goto err;
+ /* We should handle page size. XXX */
+ } else {
+ if (F_ISSET(dbp, DB_AM_DUP))
+ if ((ret = callback(handle, "duplicates=1\n")) != 0)
+ goto err;
+ if (F_ISSET(dbp, DB_AM_DUPSORT))
+ if ((ret = callback(handle, "dupsort=1\n")) != 0)
+ goto err;
+ if (!F_ISSET(dbp, DB_AM_PGDEF)) {
+ snprintf(buf, buflen,
+ "db_pagesize=%lu\n", (u_long)dbp->pgsize);
+ if ((ret = callback(handle, buf)) != 0)
+ goto err;
+ }
+ }
+
+ if (keyflag && (ret = callback(handle, "keys=1\n")) != 0)
+ goto err;
+
+ ret = callback(handle, "HEADER=END\n");
+
+err: if (pip != NULL &&
+ (t_ret = __db_vrfy_putpageinfo(vdp, pip)) != 0 && ret == 0)
+ ret = t_ret;
+ if (btsp != NULL)
+ __os_free(btsp, 0);
+ if (hsp != NULL)
+ __os_free(hsp, 0);
+ if (qsp != NULL)
+ __os_free(qsp, 0);
+ if (buf != NULL)
+ __os_free(buf, buflen);
+
+ return (ret);
+}
+
+/*
+ * __db_prfooter --
+ * Print the footer that marks the end of a DB dump. This is trivial,
+ * but for consistency's sake we don't want to put its literal contents
+ * in multiple places.
+ *
+ * PUBLIC: int __db_prfooter __P((void *, int (*)(void *, const void *)));
+ */
+int
+__db_prfooter(handle, callback)
+ void *handle;
+ int (*callback) __P((void *, const void *));
+{
+ return (callback(handle, "DATA=END\n"));
+}
diff --git a/bdb/db/db_rec.c b/bdb/db/db_rec.c
new file mode 100644
index 00000000000..998d074290d
--- /dev/null
+++ b/bdb/db/db_rec.c
@@ -0,0 +1,529 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1996, 1997, 1998, 1999, 2000
+ * Sleepycat Software. All rights reserved.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_rec.c,v 11.10 2000/08/03 15:32:19 ubell Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "log.h"
+#include "hash.h"
+
+/*
+ * PUBLIC: int __db_addrem_recover
+ * PUBLIC: __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *));
+ *
+ * This log message is generated whenever we add or remove a duplicate
+ * to/from a duplicate page. On recover, we just do the opposite.
+ */
+int
+__db_addrem_recover(dbenv, dbtp, lsnp, op, info)
+ DB_ENV *dbenv;
+ DBT *dbtp;
+ DB_LSN *lsnp;
+ db_recops op;
+ void *info;
+{
+ __db_addrem_args *argp;
+ DB *file_dbp;
+ DBC *dbc;
+ DB_MPOOLFILE *mpf;
+ PAGE *pagep;
+ u_int32_t change;
+ int cmp_n, cmp_p, ret;
+
+ COMPQUIET(info, NULL);
+ REC_PRINT(__db_addrem_print);
+ REC_INTRO(__db_addrem_read, 1);
+
+ if ((ret = memp_fget(mpf, &argp->pgno, 0, &pagep)) != 0) {
+ if (DB_UNDO(op)) {
+ /*
+ * We are undoing and the page doesn't exist. That
+ * is equivalent to having a pagelsn of 0, so we
+ * would not have to undo anything. In this case,
+ * don't bother creating a page.
+ */
+ goto done;
+ } else
+ if ((ret = memp_fget(mpf,
+ &argp->pgno, DB_MPOOL_CREATE, &pagep)) != 0)
+ goto out;
+ }
+
+ cmp_n = log_compare(lsnp, &LSN(pagep));
+ cmp_p = log_compare(&LSN(pagep), &argp->pagelsn);
+ CHECK_LSN(op, cmp_p, &LSN(pagep), &argp->pagelsn);
+ change = 0;
+ if ((cmp_p == 0 && DB_REDO(op) && argp->opcode == DB_ADD_DUP) ||
+ (cmp_n == 0 && DB_UNDO(op) && argp->opcode == DB_REM_DUP)) {
+
+ /* Need to redo an add, or undo a delete. */
+ if ((ret = __db_pitem(dbc, pagep, argp->indx, argp->nbytes,
+ argp->hdr.size == 0 ? NULL : &argp->hdr,
+ argp->dbt.size == 0 ? NULL : &argp->dbt)) != 0)
+ goto out;
+
+ change = DB_MPOOL_DIRTY;
+
+ } else if ((cmp_n == 0 && DB_UNDO(op) && argp->opcode == DB_ADD_DUP) ||
+ (cmp_p == 0 && DB_REDO(op) && argp->opcode == DB_REM_DUP)) {
+ /* Need to undo an add, or redo a delete. */
+ if ((ret = __db_ditem(dbc,
+ pagep, argp->indx, argp->nbytes)) != 0)
+ goto out;
+ change = DB_MPOOL_DIRTY;
+ }
+
+ if (change) {
+ if (DB_REDO(op))
+ LSN(pagep) = *lsnp;
+ else
+ LSN(pagep) = argp->pagelsn;
+ }
+
+ if ((ret = memp_fput(mpf, pagep, change)) != 0)
+ goto out;
+
+done: *lsnp = argp->prev_lsn;
+ ret = 0;
+
+out: REC_CLOSE;
+}
+
+/*
+ * PUBLIC: int __db_big_recover
+ * PUBLIC: __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *));
+ */
+int
+__db_big_recover(dbenv, dbtp, lsnp, op, info)
+ DB_ENV *dbenv;
+ DBT *dbtp;
+ DB_LSN *lsnp;
+ db_recops op;
+ void *info;
+{
+ __db_big_args *argp;
+ DB *file_dbp;
+ DBC *dbc;
+ DB_MPOOLFILE *mpf;
+ PAGE *pagep;
+ u_int32_t change;
+ int cmp_n, cmp_p, ret;
+
+ COMPQUIET(info, NULL);
+ REC_PRINT(__db_big_print);
+ REC_INTRO(__db_big_read, 1);
+
+ if ((ret = memp_fget(mpf, &argp->pgno, 0, &pagep)) != 0) {
+ if (DB_UNDO(op)) {
+ /*
+ * We are undoing and the page doesn't exist. That
+ * is equivalent to having a pagelsn of 0, so we
+ * would not have to undo anything. In this case,
+ * don't bother creating a page.
+ */
+ ret = 0;
+ goto ppage;
+ } else
+ if ((ret = memp_fget(mpf,
+ &argp->pgno, DB_MPOOL_CREATE, &pagep)) != 0)
+ goto out;
+ }
+
+ /*
+ * There are three pages we need to check. The one on which we are
+ * adding data, the previous one whose next_pointer may have
+ * been updated, and the next one whose prev_pointer may have
+ * been updated.
+ */
+ cmp_n = log_compare(lsnp, &LSN(pagep));
+ cmp_p = log_compare(&LSN(pagep), &argp->pagelsn);
+ CHECK_LSN(op, cmp_p, &LSN(pagep), &argp->pagelsn);
+ change = 0;
+ if ((cmp_p == 0 && DB_REDO(op) && argp->opcode == DB_ADD_BIG) ||
+ (cmp_n == 0 && DB_UNDO(op) && argp->opcode == DB_REM_BIG)) {
+ /* We are either redo-ing an add, or undoing a delete. */
+ P_INIT(pagep, file_dbp->pgsize, argp->pgno, argp->prev_pgno,
+ argp->next_pgno, 0, P_OVERFLOW);
+ OV_LEN(pagep) = argp->dbt.size;
+ OV_REF(pagep) = 1;
+ memcpy((u_int8_t *)pagep + P_OVERHEAD, argp->dbt.data,
+ argp->dbt.size);
+ PREV_PGNO(pagep) = argp->prev_pgno;
+ change = DB_MPOOL_DIRTY;
+ } else if ((cmp_n == 0 && DB_UNDO(op) && argp->opcode == DB_ADD_BIG) ||
+ (cmp_p == 0 && DB_REDO(op) && argp->opcode == DB_REM_BIG)) {
+ /*
+ * We are either undo-ing an add or redo-ing a delete.
+ * The page is about to be reclaimed in either case, so
+ * there really isn't anything to do here.
+ */
+ change = DB_MPOOL_DIRTY;
+ }
+ if (change)
+ LSN(pagep) = DB_REDO(op) ? *lsnp : argp->pagelsn;
+
+ if ((ret = memp_fput(mpf, pagep, change)) != 0)
+ goto out;
+
+ /* Now check the previous page. */
+ppage: if (argp->prev_pgno != PGNO_INVALID) {
+ change = 0;
+ if ((ret = memp_fget(mpf, &argp->prev_pgno, 0, &pagep)) != 0) {
+ if (DB_UNDO(op)) {
+ /*
+ * We are undoing and the page doesn't exist.
+ * That is equivalent to having a pagelsn of 0,
+ * so we would not have to undo anything. In
+ * this case, don't bother creating a page.
+ */
+ *lsnp = argp->prev_lsn;
+ ret = 0;
+ goto npage;
+ } else
+ if ((ret = memp_fget(mpf, &argp->prev_pgno,
+ DB_MPOOL_CREATE, &pagep)) != 0)
+ goto out;
+ }
+
+ cmp_n = log_compare(lsnp, &LSN(pagep));
+ cmp_p = log_compare(&LSN(pagep), &argp->prevlsn);
+ CHECK_LSN(op, cmp_p, &LSN(pagep), &argp->prevlsn);
+
+ if ((cmp_p == 0 && DB_REDO(op) && argp->opcode == DB_ADD_BIG) ||
+ (cmp_n == 0 && DB_UNDO(op) && argp->opcode == DB_REM_BIG)) {
+ /* Redo add, undo delete. */
+ NEXT_PGNO(pagep) = argp->pgno;
+ change = DB_MPOOL_DIRTY;
+ } else if ((cmp_n == 0 &&
+ DB_UNDO(op) && argp->opcode == DB_ADD_BIG) ||
+ (cmp_p == 0 && DB_REDO(op) && argp->opcode == DB_REM_BIG)) {
+ /* Redo delete, undo add. */
+ NEXT_PGNO(pagep) = argp->next_pgno;
+ change = DB_MPOOL_DIRTY;
+ }
+ if (change)
+ LSN(pagep) = DB_REDO(op) ? *lsnp : argp->prevlsn;
+ if ((ret = memp_fput(mpf, pagep, change)) != 0)
+ goto out;
+ }
+
+ /* Now check the next page. Can only be set on a delete. */
+npage: if (argp->next_pgno != PGNO_INVALID) {
+ change = 0;
+ if ((ret = memp_fget(mpf, &argp->next_pgno, 0, &pagep)) != 0) {
+ if (DB_UNDO(op)) {
+ /*
+ * We are undoing and the page doesn't exist.
+ * That is equivalent to having a pagelsn of 0,
+ * so we would not have to undo anything. In
+ * this case, don't bother creating a page.
+ */
+ goto done;
+ } else
+ if ((ret = memp_fget(mpf, &argp->next_pgno,
+ DB_MPOOL_CREATE, &pagep)) != 0)
+ goto out;
+ }
+
+ cmp_n = log_compare(lsnp, &LSN(pagep));
+ cmp_p = log_compare(&LSN(pagep), &argp->nextlsn);
+ CHECK_LSN(op, cmp_p, &LSN(pagep), &argp->nextlsn);
+ if (cmp_p == 0 && DB_REDO(op)) {
+ PREV_PGNO(pagep) = PGNO_INVALID;
+ change = DB_MPOOL_DIRTY;
+ } else if (cmp_n == 0 && DB_UNDO(op)) {
+ PREV_PGNO(pagep) = argp->pgno;
+ change = DB_MPOOL_DIRTY;
+ }
+ if (change)
+ LSN(pagep) = DB_REDO(op) ? *lsnp : argp->nextlsn;
+ if ((ret = memp_fput(mpf, pagep, change)) != 0)
+ goto out;
+ }
+
+done: *lsnp = argp->prev_lsn;
+ ret = 0;
+
+out: REC_CLOSE;
+}
+
+/*
+ * __db_ovref_recover --
+ * Recovery function for __db_ovref().
+ *
+ * PUBLIC: int __db_ovref_recover __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *));
+ */
+int
+__db_ovref_recover(dbenv, dbtp, lsnp, op, info)
+ DB_ENV *dbenv;
+ DBT *dbtp;
+ DB_LSN *lsnp;
+ db_recops op;
+ void *info;
+{
+ __db_ovref_args *argp;
+ DB *file_dbp;
+ DBC *dbc;
+ DB_MPOOLFILE *mpf;
+ PAGE *pagep;
+ int cmp, modified, ret;
+
+ COMPQUIET(info, NULL);
+ REC_PRINT(__db_ovref_print);
+ REC_INTRO(__db_ovref_read, 1);
+
+ if ((ret = memp_fget(mpf, &argp->pgno, 0, &pagep)) != 0) {
+ if (DB_UNDO(op))
+ goto done;
+ (void)__db_pgerr(file_dbp, argp->pgno);
+ goto out;
+ }
+
+ modified = 0;
+ cmp = log_compare(&LSN(pagep), &argp->lsn);
+ CHECK_LSN(op, cmp, &LSN(pagep), &argp->lsn);
+ if (cmp == 0 && DB_REDO(op)) {
+ /* Need to redo update described. */
+ OV_REF(pagep) += argp->adjust;
+
+ pagep->lsn = *lsnp;
+ modified = 1;
+ } else if (log_compare(lsnp, &LSN(pagep)) == 0 && DB_UNDO(op)) {
+ /* Need to undo update described. */
+ OV_REF(pagep) -= argp->adjust;
+
+ pagep->lsn = argp->lsn;
+ modified = 1;
+ }
+ if ((ret = memp_fput(mpf, pagep, modified ? DB_MPOOL_DIRTY : 0)) != 0)
+ goto out;
+
+done: *lsnp = argp->prev_lsn;
+ ret = 0;
+
+out: REC_CLOSE;
+}
+
+/*
+ * __db_relink_recover --
+ * Recovery function for relink.
+ *
+ * PUBLIC: int __db_relink_recover
+ * PUBLIC: __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *));
+ */
+int
+__db_relink_recover(dbenv, dbtp, lsnp, op, info)
+ DB_ENV *dbenv;
+ DBT *dbtp;
+ DB_LSN *lsnp;
+ db_recops op;
+ void *info;
+{
+ __db_relink_args *argp;
+ DB *file_dbp;
+ DBC *dbc;
+ DB_MPOOLFILE *mpf;
+ PAGE *pagep;
+ int cmp_n, cmp_p, modified, ret;
+
+ COMPQUIET(info, NULL);
+ REC_PRINT(__db_relink_print);
+ REC_INTRO(__db_relink_read, 1);
+
+ /*
+ * There are up to three pages we need to check -- the page, and the
+ * previous and next pages, if they existed. For a page add operation,
+ * the current page is the result of a split and is being recovered
+ * elsewhere, so all we need do is recover the next page.
+ */
+ if ((ret = memp_fget(mpf, &argp->pgno, 0, &pagep)) != 0) {
+ if (DB_REDO(op)) {
+ (void)__db_pgerr(file_dbp, argp->pgno);
+ goto out;
+ }
+ goto next2;
+ }
+ modified = 0;
+ if (argp->opcode == DB_ADD_PAGE)
+ goto next1;
+
+ cmp_p = log_compare(&LSN(pagep), &argp->lsn);
+ CHECK_LSN(op, cmp_p, &LSN(pagep), &argp->lsn);
+ if (cmp_p == 0 && DB_REDO(op)) {
+ /* Redo the relink. */
+ pagep->lsn = *lsnp;
+ modified = 1;
+ } else if (log_compare(lsnp, &LSN(pagep)) == 0 && DB_UNDO(op)) {
+ /* Undo the relink. */
+ pagep->next_pgno = argp->next;
+ pagep->prev_pgno = argp->prev;
+
+ pagep->lsn = argp->lsn;
+ modified = 1;
+ }
+next1: if ((ret = memp_fput(mpf, pagep, modified ? DB_MPOOL_DIRTY : 0)) != 0)
+ goto out;
+
+next2: if ((ret = memp_fget(mpf, &argp->next, 0, &pagep)) != 0) {
+ if (DB_REDO(op)) {
+ (void)__db_pgerr(file_dbp, argp->next);
+ goto out;
+ }
+ goto prev;
+ }
+ modified = 0;
+ cmp_n = log_compare(lsnp, &LSN(pagep));
+ cmp_p = log_compare(&LSN(pagep), &argp->lsn_next);
+ CHECK_LSN(op, cmp_p, &LSN(pagep), &argp->lsn_next);
+ if ((argp->opcode == DB_REM_PAGE && cmp_p == 0 && DB_REDO(op)) ||
+ (argp->opcode == DB_ADD_PAGE && cmp_n == 0 && DB_UNDO(op))) {
+ /* Redo the remove or undo the add. */
+ pagep->prev_pgno = argp->prev;
+
+ modified = 1;
+ } else if ((argp->opcode == DB_REM_PAGE && cmp_n == 0 && DB_UNDO(op)) ||
+ (argp->opcode == DB_ADD_PAGE && cmp_p == 0 && DB_REDO(op))) {
+ /* Undo the remove or redo the add. */
+ pagep->prev_pgno = argp->pgno;
+
+ modified = 1;
+ }
+ if (modified == 1) {
+ if (DB_UNDO(op))
+ pagep->lsn = argp->lsn_next;
+ else
+ pagep->lsn = *lsnp;
+ }
+ if ((ret = memp_fput(mpf, pagep, modified ? DB_MPOOL_DIRTY : 0)) != 0)
+ goto out;
+ if (argp->opcode == DB_ADD_PAGE)
+ goto done;
+
+prev: if ((ret = memp_fget(mpf, &argp->prev, 0, &pagep)) != 0) {
+ if (DB_REDO(op)) {
+ (void)__db_pgerr(file_dbp, argp->prev);
+ goto out;
+ }
+ goto done;
+ }
+ modified = 0;
+ cmp_p = log_compare(&LSN(pagep), &argp->lsn_prev);
+ CHECK_LSN(op, cmp_p, &LSN(pagep), &argp->lsn_prev);
+ if (cmp_p == 0 && DB_REDO(op)) {
+ /* Redo the relink. */
+ pagep->next_pgno = argp->next;
+
+ modified = 1;
+ } else if (log_compare(lsnp, &LSN(pagep)) == 0 && DB_UNDO(op)) {
+ /* Undo the relink. */
+ pagep->next_pgno = argp->pgno;
+
+ modified = 1;
+ }
+ if (modified == 1) {
+ if (DB_UNDO(op))
+ pagep->lsn = argp->lsn_prev;
+ else
+ pagep->lsn = *lsnp;
+ }
+ if ((ret = memp_fput(mpf, pagep, modified ? DB_MPOOL_DIRTY : 0)) != 0)
+ goto out;
+
+done: *lsnp = argp->prev_lsn;
+ ret = 0;
+
+out: REC_CLOSE;
+}
+
+/*
+ * __db_debug_recover --
+ * Recovery function for debug.
+ *
+ * PUBLIC: int __db_debug_recover __P((DB_ENV *,
+ * PUBLIC: DBT *, DB_LSN *, db_recops, void *));
+ */
+int
+__db_debug_recover(dbenv, dbtp, lsnp, op, info)
+ DB_ENV *dbenv;
+ DBT *dbtp;
+ DB_LSN *lsnp;
+ db_recops op;
+ void *info;
+{
+ __db_debug_args *argp;
+ int ret;
+
+ COMPQUIET(op, 0);
+ COMPQUIET(dbenv, NULL);
+ COMPQUIET(info, NULL);
+
+ REC_PRINT(__db_debug_print);
+ REC_NOOP_INTRO(__db_debug_read);
+
+ *lsnp = argp->prev_lsn;
+ ret = 0;
+
+ REC_NOOP_CLOSE;
+}
+
+/*
+ * __db_noop_recover --
+ * Recovery function for noop.
+ *
+ * PUBLIC: int __db_noop_recover __P((DB_ENV *,
+ * PUBLIC: DBT *, DB_LSN *, db_recops, void *));
+ */
+int
+__db_noop_recover(dbenv, dbtp, lsnp, op, info)
+ DB_ENV *dbenv;
+ DBT *dbtp;
+ DB_LSN *lsnp;
+ db_recops op;
+ void *info;
+{
+ __db_noop_args *argp;
+ DB *file_dbp;
+ DBC *dbc;
+ DB_MPOOLFILE *mpf;
+ PAGE *pagep;
+ u_int32_t change;
+ int cmp_n, cmp_p, ret;
+
+ COMPQUIET(info, NULL);
+ REC_PRINT(__db_noop_print);
+ REC_INTRO(__db_noop_read, 0);
+
+ if ((ret = memp_fget(mpf, &argp->pgno, 0, &pagep)) != 0)
+ goto out;
+
+ cmp_n = log_compare(lsnp, &LSN(pagep));
+ cmp_p = log_compare(&LSN(pagep), &argp->prevlsn);
+ CHECK_LSN(op, cmp_p, &LSN(pagep), &argp->prevlsn);
+ change = 0;
+ if (cmp_p == 0 && DB_REDO(op)) {
+ LSN(pagep) = *lsnp;
+ change = DB_MPOOL_DIRTY;
+ } else if (cmp_n == 0 && DB_UNDO(op)) {
+ LSN(pagep) = argp->prevlsn;
+ change = DB_MPOOL_DIRTY;
+ }
+ ret = memp_fput(mpf, pagep, change);
+
+done: *lsnp = argp->prev_lsn;
+out: REC_CLOSE;
+}
diff --git a/bdb/db/db_reclaim.c b/bdb/db/db_reclaim.c
new file mode 100644
index 00000000000..739f348407d
--- /dev/null
+++ b/bdb/db/db_reclaim.c
@@ -0,0 +1,134 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1996, 1997, 1998, 1999, 2000
+ * Sleepycat Software. All rights reserved.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_reclaim.c,v 11.5 2000/04/07 14:26:58 bostic Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_am.h"
+
+/*
+ * Assume that we enter with a valid pgno. We traverse a set of
+ * duplicate pages. The format of the callback routine is:
+ * callback(dbp, page, cookie, did_put). did_put is an output
+ * value that will be set to 1 by the callback routine if it
+ * already put the page back. Otherwise, this routine must
+ * put the page.
+ *
+ * PUBLIC: int __db_traverse_dup __P((DB *,
+ * PUBLIC: db_pgno_t, int (*)(DB *, PAGE *, void *, int *), void *));
+ */
+int
+__db_traverse_dup(dbp, pgno, callback, cookie)
+ DB *dbp;
+ db_pgno_t pgno;
+ int (*callback) __P((DB *, PAGE *, void *, int *));
+ void *cookie;
+{
+ PAGE *p;
+ int did_put, i, opgno, ret;
+
+ do {
+ did_put = 0;
+ if ((ret = memp_fget(dbp->mpf, &pgno, 0, &p)) != 0)
+ return (ret);
+ pgno = NEXT_PGNO(p);
+
+ for (i = 0; i < NUM_ENT(p); i++) {
+ if (B_TYPE(GET_BKEYDATA(p, i)->type) == B_OVERFLOW) {
+ opgno = GET_BOVERFLOW(p, i)->pgno;
+ if ((ret = __db_traverse_big(dbp,
+ opgno, callback, cookie)) != 0)
+ goto err;
+ }
+ }
+
+ if ((ret = callback(dbp, p, cookie, &did_put)) != 0)
+ goto err;
+
+ if (!did_put)
+ if ((ret = memp_fput(dbp->mpf, p, 0)) != 0)
+ return (ret);
+ } while (pgno != PGNO_INVALID);
+
+ if (0) {
+err: if (did_put == 0)
+ (void)memp_fput(dbp->mpf, p, 0);
+ }
+ return (ret);
+}
+
+/*
+ * __db_traverse_big
+ * Traverse a chain of overflow pages and call the callback routine
+ * on each one. The calling convention for the callback is:
+ * callback(dbp, page, cookie, did_put),
+ * where did_put is a return value indicating if the page in question has
+ * already been returned to the mpool.
+ *
+ * PUBLIC: int __db_traverse_big __P((DB *,
+ * PUBLIC: db_pgno_t, int (*)(DB *, PAGE *, void *, int *), void *));
+ */
+int
+__db_traverse_big(dbp, pgno, callback, cookie)
+ DB *dbp;
+ db_pgno_t pgno;
+ int (*callback) __P((DB *, PAGE *, void *, int *));
+ void *cookie;
+{
+ PAGE *p;
+ int did_put, ret;
+
+ do {
+ did_put = 0;
+ if ((ret = memp_fget(dbp->mpf, &pgno, 0, &p)) != 0)
+ return (ret);
+ pgno = NEXT_PGNO(p);
+ if ((ret = callback(dbp, p, cookie, &did_put)) == 0 &&
+ !did_put)
+ ret = memp_fput(dbp->mpf, p, 0);
+ } while (ret == 0 && pgno != PGNO_INVALID);
+
+ return (ret);
+}
+
+/*
+ * __db_reclaim_callback
+ * This is the callback routine used during a delete of a subdatabase.
+ * we are traversing a btree or hash table and trying to free all the
+ * pages. Since they share common code for duplicates and overflow
+ * items, we traverse them identically and use this routine to do the
+ * actual free. The reason that this is callback is because hash uses
+ * the same traversal code for statistics gathering.
+ *
+ * PUBLIC: int __db_reclaim_callback __P((DB *, PAGE *, void *, int *));
+ */
+int
+__db_reclaim_callback(dbp, p, cookie, putp)
+ DB *dbp;
+ PAGE *p;
+ void *cookie;
+ int *putp;
+{
+ int ret;
+
+ COMPQUIET(dbp, NULL);
+
+ if ((ret = __db_free(cookie, p)) != 0)
+ return (ret);
+ *putp = 1;
+
+ return (0);
+}
diff --git a/bdb/db/db_ret.c b/bdb/db/db_ret.c
new file mode 100644
index 00000000000..0782de3e450
--- /dev/null
+++ b/bdb/db/db_ret.c
@@ -0,0 +1,160 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1996, 1997, 1998, 1999, 2000
+ * Sleepycat Software. All rights reserved.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_ret.c,v 11.12 2000/11/30 00:58:33 ubell Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "btree.h"
+#include "db_am.h"
+
+/*
+ * __db_ret --
+ * Build return DBT.
+ *
+ * PUBLIC: int __db_ret __P((DB *,
+ * PUBLIC: PAGE *, u_int32_t, DBT *, void **, u_int32_t *));
+ */
+int
+__db_ret(dbp, h, indx, dbt, memp, memsize)
+ DB *dbp;
+ PAGE *h;
+ u_int32_t indx;
+ DBT *dbt;
+ void **memp;
+ u_int32_t *memsize;
+{
+ BKEYDATA *bk;
+ HOFFPAGE ho;
+ BOVERFLOW *bo;
+ u_int32_t len;
+ u_int8_t *hk;
+ void *data;
+
+ switch (TYPE(h)) {
+ case P_HASH:
+ hk = P_ENTRY(h, indx);
+ if (HPAGE_PTYPE(hk) == H_OFFPAGE) {
+ memcpy(&ho, hk, sizeof(HOFFPAGE));
+ return (__db_goff(dbp, dbt,
+ ho.tlen, ho.pgno, memp, memsize));
+ }
+ len = LEN_HKEYDATA(h, dbp->pgsize, indx);
+ data = HKEYDATA_DATA(hk);
+ break;
+ case P_LBTREE:
+ case P_LDUP:
+ case P_LRECNO:
+ bk = GET_BKEYDATA(h, indx);
+ if (B_TYPE(bk->type) == B_OVERFLOW) {
+ bo = (BOVERFLOW *)bk;
+ return (__db_goff(dbp, dbt,
+ bo->tlen, bo->pgno, memp, memsize));
+ }
+ len = bk->len;
+ data = bk->data;
+ break;
+ default:
+ return (__db_pgfmt(dbp, h->pgno));
+ }
+
+ return (__db_retcopy(dbp, dbt, data, len, memp, memsize));
+}
+
+/*
+ * __db_retcopy --
+ * Copy the returned data into the user's DBT, handling special flags.
+ *
+ * PUBLIC: int __db_retcopy __P((DB *, DBT *,
+ * PUBLIC: void *, u_int32_t, void **, u_int32_t *));
+ */
+int
+__db_retcopy(dbp, dbt, data, len, memp, memsize)
+ DB *dbp;
+ DBT *dbt;
+ void *data;
+ u_int32_t len;
+ void **memp;
+ u_int32_t *memsize;
+{
+ DB_ENV *dbenv;
+ int ret;
+
+ dbenv = dbp == NULL ? NULL : dbp->dbenv;
+
+ /* If returning a partial record, reset the length. */
+ if (F_ISSET(dbt, DB_DBT_PARTIAL)) {
+ data = (u_int8_t *)data + dbt->doff;
+ if (len > dbt->doff) {
+ len -= dbt->doff;
+ if (len > dbt->dlen)
+ len = dbt->dlen;
+ } else
+ len = 0;
+ }
+
+ /*
+ * Return the length of the returned record in the DBT size field.
+ * This satisfies the requirement that if we're using user memory
+ * and insufficient memory was provided, return the amount necessary
+ * in the size field.
+ */
+ dbt->size = len;
+
+ /*
+ * Allocate memory to be owned by the application: DB_DBT_MALLOC,
+ * DB_DBT_REALLOC.
+ *
+ * !!!
+ * We always allocate memory, even if we're copying out 0 bytes. This
+ * guarantees consistency, i.e., the application can always free memory
+ * without concern as to how many bytes of the record were requested.
+ *
+ * Use the memory specified by the application: DB_DBT_USERMEM.
+ *
+ * !!!
+ * If the length we're going to copy is 0, the application-supplied
+ * memory pointer is allowed to be NULL.
+ */
+ if (F_ISSET(dbt, DB_DBT_MALLOC)) {
+ if ((ret = __os_malloc(dbenv, len,
+ dbp == NULL ? NULL : dbp->db_malloc, &dbt->data)) != 0)
+ return (ret);
+ } else if (F_ISSET(dbt, DB_DBT_REALLOC)) {
+ if ((ret = __os_realloc(dbenv, len,
+ dbp == NULL ? NULL : dbp->db_realloc, &dbt->data)) != 0)
+ return (ret);
+ } else if (F_ISSET(dbt, DB_DBT_USERMEM)) {
+ if (len != 0 && (dbt->data == NULL || dbt->ulen < len))
+ return (ENOMEM);
+ } else if (memp == NULL || memsize == NULL) {
+ return (EINVAL);
+ } else {
+ if (len != 0 && (*memsize == 0 || *memsize < len)) {
+ if ((ret = __os_realloc(dbenv, len, NULL, memp)) != 0) {
+ *memsize = 0;
+ return (ret);
+ }
+ *memsize = len;
+ }
+ dbt->data = *memp;
+ }
+
+ if (len != 0)
+ memcpy(dbt->data, data, len);
+ return (0);
+}
diff --git a/bdb/db/db_upg.c b/bdb/db/db_upg.c
new file mode 100644
index 00000000000..d8573146ad6
--- /dev/null
+++ b/bdb/db/db_upg.c
@@ -0,0 +1,338 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1996, 1997, 1998, 1999, 2000
+ * Sleepycat Software. All rights reserved.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_upg.c,v 11.20 2000/12/12 17:35:30 bostic Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_swap.h"
+#include "btree.h"
+#include "hash.h"
+#include "qam.h"
+
+static int (* const func_31_list[P_PAGETYPE_MAX])
+ __P((DB *, char *, u_int32_t, DB_FH *, PAGE *, int *)) = {
+ NULL, /* P_INVALID */
+ NULL, /* __P_DUPLICATE */
+ __ham_31_hash, /* P_HASH */
+ NULL, /* P_IBTREE */
+ NULL, /* P_IRECNO */
+ __bam_31_lbtree, /* P_LBTREE */
+ NULL, /* P_LRECNO */
+ NULL, /* P_OVERFLOW */
+ __ham_31_hashmeta, /* P_HASHMETA */
+ __bam_31_btreemeta, /* P_BTREEMETA */
+};
+
+static int __db_page_pass __P((DB *, char *, u_int32_t, int (* const [])
+ (DB *, char *, u_int32_t, DB_FH *, PAGE *, int *), DB_FH *));
+
+/*
+ * __db_upgrade --
+ * Upgrade an existing database.
+ *
+ * PUBLIC: int __db_upgrade __P((DB *, const char *, u_int32_t));
+ */
+int
+__db_upgrade(dbp, fname, flags)
+ DB *dbp;
+ const char *fname;
+ u_int32_t flags;
+{
+ DB_ENV *dbenv;
+ DB_FH fh;
+ size_t n;
+ int ret, t_ret;
+ u_int8_t mbuf[256];
+ char *real_name;
+
+ dbenv = dbp->dbenv;
+
+ /* Validate arguments. */
+ if ((ret = __db_fchk(dbenv, "DB->upgrade", flags, DB_DUPSORT)) != 0)
+ return (ret);
+
+ /* Get the real backing file name. */
+ if ((ret = __db_appname(dbenv,
+ DB_APP_DATA, NULL, fname, 0, NULL, &real_name)) != 0)
+ return (ret);
+
+ /* Open the file. */
+ if ((ret = __os_open(dbenv, real_name, 0, 0, &fh)) != 0) {
+ __db_err(dbenv, "%s: %s", real_name, db_strerror(ret));
+ return (ret);
+ }
+
+ /* Initialize the feedback. */
+ if (dbp->db_feedback != NULL)
+ dbp->db_feedback(dbp, DB_UPGRADE, 0);
+
+ /*
+ * Read the metadata page. We read 256 bytes, which is larger than
+ * any access method's metadata page and smaller than any disk sector.
+ */
+ if ((ret = __os_read(dbenv, &fh, mbuf, sizeof(mbuf), &n)) != 0)
+ goto err;
+
+ switch (((DBMETA *)mbuf)->magic) {
+ case DB_BTREEMAGIC:
+ switch (((DBMETA *)mbuf)->version) {
+ case 6:
+ /*
+ * Before V7 not all pages had page types, so we do the
+ * single meta-data page by hand.
+ */
+ if ((ret =
+ __bam_30_btreemeta(dbp, real_name, mbuf)) != 0)
+ goto err;
+ if ((ret = __os_seek(dbenv,
+ &fh, 0, 0, 0, 0, DB_OS_SEEK_SET)) != 0)
+ goto err;
+ if ((ret = __os_write(dbenv, &fh, mbuf, 256, &n)) != 0)
+ goto err;
+ /* FALLTHROUGH */
+ case 7:
+ /*
+ * We need the page size to do more. Rip it out of
+ * the meta-data page.
+ */
+ memcpy(&dbp->pgsize, mbuf + 20, sizeof(u_int32_t));
+
+ if ((ret = __db_page_pass(
+ dbp, real_name, flags, func_31_list, &fh)) != 0)
+ goto err;
+ /* FALLTHROUGH */
+ case 8:
+ break;
+ default:
+ __db_err(dbenv, "%s: unsupported btree version: %lu",
+ real_name, (u_long)((DBMETA *)mbuf)->version);
+ ret = DB_OLD_VERSION;
+ goto err;
+ }
+ break;
+ case DB_HASHMAGIC:
+ switch (((DBMETA *)mbuf)->version) {
+ case 4:
+ case 5:
+ /*
+ * Before V6 not all pages had page types, so we do the
+ * single meta-data page by hand.
+ */
+ if ((ret =
+ __ham_30_hashmeta(dbp, real_name, mbuf)) != 0)
+ goto err;
+ if ((ret = __os_seek(dbenv,
+ &fh, 0, 0, 0, 0, DB_OS_SEEK_SET)) != 0)
+ goto err;
+ if ((ret = __os_write(dbenv, &fh, mbuf, 256, &n)) != 0)
+ goto err;
+
+ /*
+ * Before V6, we created hash pages one by one as they
+ * were needed, using hashhdr.ovfl_point to reserve
+ * a block of page numbers for them. A consequence
+ * of this was that, if no overflow pages had been
+ * created, the current doubling might extend past
+ * the end of the database file.
+ *
+ * In DB 3.X, we now create all the hash pages
+ * belonging to a doubling atomicly; it's not
+ * safe to just save them for later, because when
+ * we create an overflow page we'll just create
+ * a new last page (whatever that may be). Grow
+ * the database to the end of the current doubling.
+ */
+ if ((ret =
+ __ham_30_sizefix(dbp, &fh, real_name, mbuf)) != 0)
+ goto err;
+ /* FALLTHROUGH */
+ case 6:
+ /*
+ * We need the page size to do more. Rip it out of
+ * the meta-data page.
+ */
+ memcpy(&dbp->pgsize, mbuf + 20, sizeof(u_int32_t));
+
+ if ((ret = __db_page_pass(
+ dbp, real_name, flags, func_31_list, &fh)) != 0)
+ goto err;
+ /* FALLTHROUGH */
+ case 7:
+ break;
+ default:
+ __db_err(dbenv, "%s: unsupported hash version: %lu",
+ real_name, (u_long)((DBMETA *)mbuf)->version);
+ ret = DB_OLD_VERSION;
+ goto err;
+ }
+ break;
+ case DB_QAMMAGIC:
+ switch (((DBMETA *)mbuf)->version) {
+ case 1:
+ /*
+ * If we're in a Queue database, the only page that
+ * needs upgrading is the meta-database page, don't
+ * bother with a full pass.
+ */
+ if ((ret = __qam_31_qammeta(dbp, real_name, mbuf)) != 0)
+ return (ret);
+ /* FALLTHROUGH */
+ case 2:
+ if ((ret = __qam_32_qammeta(dbp, real_name, mbuf)) != 0)
+ return (ret);
+ if ((ret = __os_seek(dbenv,
+ &fh, 0, 0, 0, 0, DB_OS_SEEK_SET)) != 0)
+ goto err;
+ if ((ret = __os_write(dbenv, &fh, mbuf, 256, &n)) != 0)
+ goto err;
+ /* FALLTHROUGH */
+ case 3:
+ break;
+ default:
+ __db_err(dbenv, "%s: unsupported queue version: %lu",
+ real_name, (u_long)((DBMETA *)mbuf)->version);
+ ret = DB_OLD_VERSION;
+ goto err;
+ }
+ break;
+ default:
+ M_32_SWAP(((DBMETA *)mbuf)->magic);
+ switch (((DBMETA *)mbuf)->magic) {
+ case DB_BTREEMAGIC:
+ case DB_HASHMAGIC:
+ case DB_QAMMAGIC:
+ __db_err(dbenv,
+ "%s: DB->upgrade only supported on native byte-order systems",
+ real_name);
+ break;
+ default:
+ __db_err(dbenv,
+ "%s: unrecognized file type", real_name);
+ break;
+ }
+ ret = EINVAL;
+ goto err;
+ }
+
+ ret = __os_fsync(dbenv, &fh);
+
+err: if ((t_ret = __os_closehandle(&fh)) != 0 && ret == 0)
+ ret = t_ret;
+ __os_freestr(real_name);
+
+ /* We're done. */
+ if (dbp->db_feedback != NULL)
+ dbp->db_feedback(dbp, DB_UPGRADE, 100);
+
+ return (ret);
+}
+
+/*
+ * __db_page_pass --
+ * Walk the pages of the database, upgrading whatever needs it.
+ */
+static int
+__db_page_pass(dbp, real_name, flags, fl, fhp)
+ DB *dbp;
+ char *real_name;
+ u_int32_t flags;
+ int (* const fl[P_PAGETYPE_MAX])
+ __P((DB *, char *, u_int32_t, DB_FH *, PAGE *, int *));
+ DB_FH *fhp;
+{
+ DB_ENV *dbenv;
+ PAGE *page;
+ db_pgno_t i, pgno_last;
+ size_t n;
+ int dirty, ret;
+
+ dbenv = dbp->dbenv;
+
+ /* Determine the last page of the file. */
+ if ((ret = __db_lastpgno(dbp, real_name, fhp, &pgno_last)) != 0)
+ return (ret);
+
+ /* Allocate memory for a single page. */
+ if ((ret = __os_malloc(dbenv, dbp->pgsize, NULL, &page)) != 0)
+ return (ret);
+
+ /* Walk the file, calling the underlying conversion functions. */
+ for (i = 0; i < pgno_last; ++i) {
+ if (dbp->db_feedback != NULL)
+ dbp->db_feedback(dbp, DB_UPGRADE, (i * 100)/pgno_last);
+ if ((ret = __os_seek(dbenv,
+ fhp, dbp->pgsize, i, 0, 0, DB_OS_SEEK_SET)) != 0)
+ break;
+ if ((ret = __os_read(dbenv, fhp, page, dbp->pgsize, &n)) != 0)
+ break;
+ dirty = 0;
+ if (fl[TYPE(page)] != NULL && (ret = fl[TYPE(page)]
+ (dbp, real_name, flags, fhp, page, &dirty)) != 0)
+ break;
+ if (dirty) {
+ if ((ret = __os_seek(dbenv,
+ fhp, dbp->pgsize, i, 0, 0, DB_OS_SEEK_SET)) != 0)
+ break;
+ if ((ret = __os_write(dbenv,
+ fhp, page, dbp->pgsize, &n)) != 0)
+ break;
+ }
+ }
+
+ __os_free(page, dbp->pgsize);
+ return (ret);
+}
+
+/*
+ * __db_lastpgno --
+ * Return the current last page number of the file.
+ *
+ * PUBLIC: int __db_lastpgno __P((DB *, char *, DB_FH *, db_pgno_t *));
+ */
+int
+__db_lastpgno(dbp, real_name, fhp, pgno_lastp)
+ DB *dbp;
+ char *real_name;
+ DB_FH *fhp;
+ db_pgno_t *pgno_lastp;
+{
+ DB_ENV *dbenv;
+ db_pgno_t pgno_last;
+ u_int32_t mbytes, bytes;
+ int ret;
+
+ dbenv = dbp->dbenv;
+
+ if ((ret = __os_ioinfo(dbenv,
+ real_name, fhp, &mbytes, &bytes, NULL)) != 0) {
+ __db_err(dbenv, "%s: %s", real_name, db_strerror(ret));
+ return (ret);
+ }
+
+ /* Page sizes have to be a power-of-two. */
+ if (bytes % dbp->pgsize != 0) {
+ __db_err(dbenv,
+ "%s: file size not a multiple of the pagesize", real_name);
+ return (EINVAL);
+ }
+ pgno_last = mbytes * (MEGABYTE / dbp->pgsize);
+ pgno_last += bytes / dbp->pgsize;
+
+ *pgno_lastp = pgno_last;
+ return (0);
+}
diff --git a/bdb/db/db_upg_opd.c b/bdb/db/db_upg_opd.c
new file mode 100644
index 00000000000..a7be784afb8
--- /dev/null
+++ b/bdb/db/db_upg_opd.c
@@ -0,0 +1,353 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 1996, 1997, 1998, 1999, 2000
+ * Sleepycat Software. All rights reserved.
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_upg_opd.c,v 11.9 2000/11/30 00:58:33 ubell Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_swap.h"
+#include "btree.h"
+#include "hash.h"
+#include "qam.h"
+
+static int __db_build_bi __P((DB *, DB_FH *, PAGE *, PAGE *, u_int32_t, int *));
+static int __db_build_ri __P((DB *, DB_FH *, PAGE *, PAGE *, u_int32_t, int *));
+static int __db_up_ovref __P((DB *, DB_FH *, db_pgno_t));
+
+#define GET_PAGE(dbp, fhp, pgno, page) { \
+ if ((ret = __os_seek(dbp->dbenv, \
+ fhp, (dbp)->pgsize, pgno, 0, 0, DB_OS_SEEK_SET)) != 0) \
+ goto err; \
+ if ((ret = __os_read(dbp->dbenv, \
+ fhp, page, (dbp)->pgsize, &n)) != 0) \
+ goto err; \
+}
+#define PUT_PAGE(dbp, fhp, pgno, page) { \
+ if ((ret = __os_seek(dbp->dbenv, \
+ fhp, (dbp)->pgsize, pgno, 0, 0, DB_OS_SEEK_SET)) != 0) \
+ goto err; \
+ if ((ret = __os_write(dbp->dbenv, \
+ fhp, page, (dbp)->pgsize, &n)) != 0) \
+ goto err; \
+}
+
+/*
+ * __db_31_offdup --
+ * Convert 3.0 off-page duplicates to 3.1 off-page duplicates.
+ *
+ * PUBLIC: int __db_31_offdup __P((DB *, char *, DB_FH *, int, db_pgno_t *));
+ */
+int
+__db_31_offdup(dbp, real_name, fhp, sorted, pgnop)
+ DB *dbp;
+ char *real_name;
+ DB_FH *fhp;
+ int sorted;
+ db_pgno_t *pgnop;
+{
+ PAGE *ipage, *page;
+ db_indx_t indx;
+ db_pgno_t cur_cnt, i, next_cnt, pgno, *pgno_cur, pgno_last;
+ db_pgno_t *pgno_next, pgno_max, *tmp;
+ db_recno_t nrecs;
+ size_t n;
+ int level, nomem, ret;
+
+ ipage = page = NULL;
+ pgno_cur = pgno_next = NULL;
+
+ /* Allocate room to hold a page. */
+ if ((ret = __os_malloc(dbp->dbenv, dbp->pgsize, NULL, &page)) != 0)
+ goto err;
+
+ /*
+ * Walk the chain of 3.0 off-page duplicates. Each one is converted
+ * in place to a 3.1 off-page duplicate page. If the duplicates are
+ * sorted, they are converted to a Btree leaf page, otherwise to a
+ * Recno leaf page.
+ */
+ for (nrecs = 0, cur_cnt = pgno_max = 0,
+ pgno = *pgnop; pgno != PGNO_INVALID;) {
+ if (pgno_max == cur_cnt) {
+ pgno_max += 20;
+ if ((ret = __os_realloc(dbp->dbenv, pgno_max *
+ sizeof(db_pgno_t), NULL, &pgno_cur)) != 0)
+ goto err;
+ }
+ pgno_cur[cur_cnt++] = pgno;
+
+ GET_PAGE(dbp, fhp, pgno, page);
+ nrecs += NUM_ENT(page);
+ LEVEL(page) = LEAFLEVEL;
+ TYPE(page) = sorted ? P_LDUP : P_LRECNO;
+ /*
+ * !!!
+ * DB didn't zero the LSNs on off-page duplicates pages.
+ */
+ ZERO_LSN(LSN(page));
+ PUT_PAGE(dbp, fhp, pgno, page);
+
+ pgno = NEXT_PGNO(page);
+ }
+
+ /* If we only have a single page, it's easy. */
+ if (cur_cnt > 1) {
+ /*
+ * pgno_cur is the list of pages we just converted. We're
+ * going to walk that list, but we'll need to create a new
+ * list while we do so.
+ */
+ if ((ret = __os_malloc(dbp->dbenv,
+ cur_cnt * sizeof(db_pgno_t), NULL, &pgno_next)) != 0)
+ goto err;
+
+ /* Figure out where we can start allocating new pages. */
+ if ((ret = __db_lastpgno(dbp, real_name, fhp, &pgno_last)) != 0)
+ goto err;
+
+ /* Allocate room for an internal page. */
+ if ((ret = __os_malloc(dbp->dbenv,
+ dbp->pgsize, NULL, &ipage)) != 0)
+ goto err;
+ PGNO(ipage) = PGNO_INVALID;
+ }
+
+ /*
+ * Repeatedly walk the list of pages, building internal pages, until
+ * there's only one page at a level.
+ */
+ for (level = LEAFLEVEL + 1; cur_cnt > 1; ++level) {
+ for (indx = 0, i = next_cnt = 0; i < cur_cnt;) {
+ if (indx == 0) {
+ P_INIT(ipage, dbp->pgsize, pgno_last,
+ PGNO_INVALID, PGNO_INVALID,
+ level, sorted ? P_IBTREE : P_IRECNO);
+ ZERO_LSN(LSN(ipage));
+
+ pgno_next[next_cnt++] = pgno_last++;
+ }
+
+ GET_PAGE(dbp, fhp, pgno_cur[i], page);
+
+ /*
+ * If the duplicates are sorted, put the first item on
+ * the lower-level page onto a Btree internal page. If
+ * the duplicates are not sorted, create an internal
+ * Recno structure on the page. If either case doesn't
+ * fit, push out the current page and start a new one.
+ */
+ nomem = 0;
+ if (sorted) {
+ if ((ret = __db_build_bi(
+ dbp, fhp, ipage, page, indx, &nomem)) != 0)
+ goto err;
+ } else
+ if ((ret = __db_build_ri(
+ dbp, fhp, ipage, page, indx, &nomem)) != 0)
+ goto err;
+ if (nomem) {
+ indx = 0;
+ PUT_PAGE(dbp, fhp, PGNO(ipage), ipage);
+ } else {
+ ++indx;
+ ++NUM_ENT(ipage);
+ ++i;
+ }
+ }
+
+ /*
+ * Push out the last internal page. Set the top-level record
+ * count if we've reached the top.
+ */
+ if (next_cnt == 1)
+ RE_NREC_SET(ipage, nrecs);
+ PUT_PAGE(dbp, fhp, PGNO(ipage), ipage);
+
+ /* Swap the current and next page number arrays. */
+ cur_cnt = next_cnt;
+ tmp = pgno_cur;
+ pgno_cur = pgno_next;
+ pgno_next = tmp;
+ }
+
+ *pgnop = pgno_cur[0];
+
+err: if (pgno_cur != NULL)
+ __os_free(pgno_cur, 0);
+ if (pgno_next != NULL)
+ __os_free(pgno_next, 0);
+ if (ipage != NULL)
+ __os_free(ipage, dbp->pgsize);
+ if (page != NULL)
+ __os_free(page, dbp->pgsize);
+
+ return (ret);
+}
+
+/*
+ * __db_build_bi --
+ * Build a BINTERNAL entry for a parent page.
+ */
+static int
+__db_build_bi(dbp, fhp, ipage, page, indx, nomemp)
+ DB *dbp;
+ DB_FH *fhp;
+ PAGE *ipage, *page;
+ u_int32_t indx;
+ int *nomemp;
+{
+ BINTERNAL bi, *child_bi;
+ BKEYDATA *child_bk;
+ u_int8_t *p;
+ int ret;
+
+ switch (TYPE(page)) {
+ case P_IBTREE:
+ child_bi = GET_BINTERNAL(page, 0);
+ if (P_FREESPACE(ipage) < BINTERNAL_PSIZE(child_bi->len)) {
+ *nomemp = 1;
+ return (0);
+ }
+ ipage->inp[indx] =
+ HOFFSET(ipage) -= BINTERNAL_SIZE(child_bi->len);
+ p = P_ENTRY(ipage, indx);
+
+ bi.len = child_bi->len;
+ B_TSET(bi.type, child_bi->type, 0);
+ bi.pgno = PGNO(page);
+ bi.nrecs = __bam_total(page);
+ memcpy(p, &bi, SSZA(BINTERNAL, data));
+ p += SSZA(BINTERNAL, data);
+ memcpy(p, child_bi->data, child_bi->len);
+
+ /* Increment the overflow ref count. */
+ if (B_TYPE(child_bi->type) == B_OVERFLOW)
+ if ((ret = __db_up_ovref(dbp, fhp,
+ ((BOVERFLOW *)(child_bi->data))->pgno)) != 0)
+ return (ret);
+ break;
+ case P_LDUP:
+ child_bk = GET_BKEYDATA(page, 0);
+ switch (B_TYPE(child_bk->type)) {
+ case B_KEYDATA:
+ if (P_FREESPACE(ipage) <
+ BINTERNAL_PSIZE(child_bk->len)) {
+ *nomemp = 1;
+ return (0);
+ }
+ ipage->inp[indx] =
+ HOFFSET(ipage) -= BINTERNAL_SIZE(child_bk->len);
+ p = P_ENTRY(ipage, indx);
+
+ bi.len = child_bk->len;
+ B_TSET(bi.type, child_bk->type, 0);
+ bi.pgno = PGNO(page);
+ bi.nrecs = __bam_total(page);
+ memcpy(p, &bi, SSZA(BINTERNAL, data));
+ p += SSZA(BINTERNAL, data);
+ memcpy(p, child_bk->data, child_bk->len);
+ break;
+ case B_OVERFLOW:
+ if (P_FREESPACE(ipage) <
+ BINTERNAL_PSIZE(BOVERFLOW_SIZE)) {
+ *nomemp = 1;
+ return (0);
+ }
+ ipage->inp[indx] =
+ HOFFSET(ipage) -= BINTERNAL_SIZE(BOVERFLOW_SIZE);
+ p = P_ENTRY(ipage, indx);
+
+ bi.len = BOVERFLOW_SIZE;
+ B_TSET(bi.type, child_bk->type, 0);
+ bi.pgno = PGNO(page);
+ bi.nrecs = __bam_total(page);
+ memcpy(p, &bi, SSZA(BINTERNAL, data));
+ p += SSZA(BINTERNAL, data);
+ memcpy(p, child_bk, BOVERFLOW_SIZE);
+
+ /* Increment the overflow ref count. */
+ if ((ret = __db_up_ovref(dbp, fhp,
+ ((BOVERFLOW *)child_bk)->pgno)) != 0)
+ return (ret);
+ break;
+ default:
+ return (__db_pgfmt(dbp, PGNO(page)));
+ }
+ break;
+ default:
+ return (__db_pgfmt(dbp, PGNO(page)));
+ }
+
+ return (0);
+}
+
+/*
+ * __db_build_ri --
+ * Build a RINTERNAL entry for an internal parent page.
+ */
+static int
+__db_build_ri(dbp, fhp, ipage, page, indx, nomemp)
+ DB *dbp;
+ DB_FH *fhp;
+ PAGE *ipage, *page;
+ u_int32_t indx;
+ int *nomemp;
+{
+ RINTERNAL ri;
+
+ COMPQUIET(dbp, NULL);
+ COMPQUIET(fhp, NULL);
+
+ if (P_FREESPACE(ipage) < RINTERNAL_PSIZE) {
+ *nomemp = 1;
+ return (0);
+ }
+
+ ri.pgno = PGNO(page);
+ ri.nrecs = __bam_total(page);
+ ipage->inp[indx] = HOFFSET(ipage) -= RINTERNAL_SIZE;
+ memcpy(P_ENTRY(ipage, indx), &ri, RINTERNAL_SIZE);
+
+ return (0);
+}
+
+/*
+ * __db_up_ovref --
+ * Increment/decrement the reference count on an overflow page.
+ */
+static int
+__db_up_ovref(dbp, fhp, pgno)
+ DB *dbp;
+ DB_FH *fhp;
+ db_pgno_t pgno;
+{
+ PAGE *page;
+ size_t n;
+ int ret;
+
+ /* Allocate room to hold a page. */
+ if ((ret = __os_malloc(dbp->dbenv, dbp->pgsize, NULL, &page)) != 0)
+ return (ret);
+
+ GET_PAGE(dbp, fhp, pgno, page);
+ ++OV_REF(page);
+ PUT_PAGE(dbp, fhp, pgno, page);
+
+err: __os_free(page, dbp->pgsize);
+
+ return (ret);
+}
diff --git a/bdb/db/db_vrfy.c b/bdb/db/db_vrfy.c
new file mode 100644
index 00000000000..3509e05e91f
--- /dev/null
+++ b/bdb/db/db_vrfy.c
@@ -0,0 +1,2340 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 2000
+ * Sleepycat Software. All rights reserved.
+ *
+ * $Id: db_vrfy.c,v 1.53 2001/01/11 18:19:51 bostic Exp $
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_vrfy.c,v 1.53 2001/01/11 18:19:51 bostic Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_swap.h"
+#include "db_verify.h"
+#include "db_ext.h"
+#include "btree.h"
+#include "hash.h"
+#include "qam.h"
+
+static int __db_guesspgsize __P((DB_ENV *, DB_FH *));
+static int __db_is_valid_magicno __P((u_int32_t, DBTYPE *));
+static int __db_is_valid_pagetype __P((u_int32_t));
+static int __db_meta2pgset
+ __P((DB *, VRFY_DBINFO *, db_pgno_t, u_int32_t, DB *));
+static int __db_salvage_subdbs
+ __P((DB *, VRFY_DBINFO *, void *,
+ int(*)(void *, const void *), u_int32_t, int *));
+static int __db_salvage_unknowns
+ __P((DB *, VRFY_DBINFO *, void *,
+ int (*)(void *, const void *), u_int32_t));
+static int __db_vrfy_common
+ __P((DB *, VRFY_DBINFO *, PAGE *, db_pgno_t, u_int32_t));
+static int __db_vrfy_freelist __P((DB *, VRFY_DBINFO *, db_pgno_t, u_int32_t));
+static int __db_vrfy_invalid
+ __P((DB *, VRFY_DBINFO *, PAGE *, db_pgno_t, u_int32_t));
+static int __db_vrfy_orderchkonly __P((DB *,
+ VRFY_DBINFO *, const char *, const char *, u_int32_t));
+static int __db_vrfy_pagezero __P((DB *, VRFY_DBINFO *, DB_FH *, u_int32_t));
+static int __db_vrfy_subdbs
+ __P((DB *, VRFY_DBINFO *, const char *, u_int32_t));
+static int __db_vrfy_structure
+ __P((DB *, VRFY_DBINFO *, const char *, db_pgno_t, u_int32_t));
+static int __db_vrfy_walkpages
+ __P((DB *, VRFY_DBINFO *, void *, int (*)(void *, const void *),
+ u_int32_t));
+
+/*
+ * This is the code for DB->verify, the DB database consistency checker.
+ * For now, it checks all subdatabases in a database, and verifies
+ * everything it knows how to (i.e. it's all-or-nothing, and one can't
+ * check only for a subset of possible problems).
+ */
+
+/*
+ * __db_verify --
+ * Walk the entire file page-by-page, either verifying with or without
+ * dumping in db_dump -d format, or DB_SALVAGE-ing whatever key/data
+ * pairs can be found and dumping them in standard (db_load-ready)
+ * dump format.
+ *
+ * (Salvaging isn't really a verification operation, but we put it
+ * here anyway because it requires essentially identical top-level
+ * code.)
+ *
+ * flags may be 0, DB_NOORDERCHK, DB_ORDERCHKONLY, or DB_SALVAGE
+ * (and optionally DB_AGGRESSIVE).
+ *
+ * __db_verify itself is simply a wrapper to __db_verify_internal,
+ * which lets us pass appropriate equivalents to FILE * in from the
+ * non-C APIs.
+ *
+ * PUBLIC: int __db_verify
+ * PUBLIC: __P((DB *, const char *, const char *, FILE *, u_int32_t));
+ */
+int
+__db_verify(dbp, file, database, outfile, flags)
+ DB *dbp;
+ const char *file, *database;
+ FILE *outfile;
+ u_int32_t flags;
+{
+
+ return (__db_verify_internal(dbp,
+ file, database, outfile, __db_verify_callback, flags));
+}
+
+/*
+ * __db_verify_callback --
+ * Callback function for using pr_* functions from C.
+ *
+ * PUBLIC: int __db_verify_callback __P((void *, const void *));
+ */
+int
+__db_verify_callback(handle, str_arg)
+ void *handle;
+ const void *str_arg;
+{
+ char *str;
+ FILE *f;
+
+ str = (char *)str_arg;
+ f = (FILE *)handle;
+
+ if (fprintf(f, "%s", str) != (int)strlen(str))
+ return (EIO);
+
+ return (0);
+}
+
+/*
+ * __db_verify_internal --
+ * Inner meat of __db_verify.
+ *
+ * PUBLIC: int __db_verify_internal __P((DB *, const char *,
+ * PUBLIC: const char *, void *, int (*)(void *, const void *), u_int32_t));
+ */
+int
+__db_verify_internal(dbp_orig, name, subdb, handle, callback, flags)
+ DB *dbp_orig;
+ const char *name, *subdb;
+ void *handle;
+ int (*callback) __P((void *, const void *));
+ u_int32_t flags;
+{
+ DB *dbp;
+ DB_ENV *dbenv;
+ DB_FH fh, *fhp;
+ PAGE *h;
+ VRFY_DBINFO *vdp;
+ db_pgno_t last;
+ int has, ret, isbad;
+ char *real_name;
+
+ dbenv = dbp_orig->dbenv;
+ vdp = NULL;
+ real_name = NULL;
+ ret = isbad = 0;
+
+ memset(&fh, 0, sizeof(fh));
+ fhp = &fh;
+
+ PANIC_CHECK(dbenv);
+ DB_ILLEGAL_AFTER_OPEN(dbp_orig, "verify");
+
+#define OKFLAGS (DB_AGGRESSIVE | DB_NOORDERCHK | DB_ORDERCHKONLY | DB_SALVAGE)
+ if ((ret = __db_fchk(dbenv, "DB->verify", flags, OKFLAGS)) != 0)
+ return (ret);
+
+ /*
+ * DB_SALVAGE is mutually exclusive with the other flags except
+ * DB_AGGRESSIVE.
+ */
+ if (LF_ISSET(DB_SALVAGE) &&
+ (flags & ~DB_AGGRESSIVE) != DB_SALVAGE)
+ return (__db_ferr(dbenv, "__db_verify", 1));
+
+ if (LF_ISSET(DB_ORDERCHKONLY) && flags != DB_ORDERCHKONLY)
+ return (__db_ferr(dbenv, "__db_verify", 1));
+
+ if (LF_ISSET(DB_ORDERCHKONLY) && subdb == NULL) {
+ __db_err(dbenv, "DB_ORDERCHKONLY requires a database name");
+ return (EINVAL);
+ }
+
+ /*
+ * Forbid working in an environment that uses transactions or
+ * locking; we're going to be looking at the file freely,
+ * and while we're not going to modify it, we aren't obeying
+ * locking conventions either.
+ */
+ if (TXN_ON(dbenv) || LOCKING_ON(dbenv) || LOGGING_ON(dbenv)) {
+ dbp_orig->errx(dbp_orig,
+ "verify may not be used with transactions, logging, or locking");
+ return (EINVAL);
+ /* NOTREACHED */
+ }
+
+ /* Create a dbp to use internally, which we can close at our leisure. */
+ if ((ret = db_create(&dbp, dbenv, 0)) != 0)
+ goto err;
+
+ F_SET(dbp, DB_AM_VERIFYING);
+
+ /* Copy the supplied pagesize, which we use if the file one is bogus. */
+ if (dbp_orig->pgsize >= DB_MIN_PGSIZE &&
+ dbp_orig->pgsize <= DB_MAX_PGSIZE)
+ dbp->set_pagesize(dbp, dbp_orig->pgsize);
+
+ /* Copy the feedback function, if present, and initialize it. */
+ if (!LF_ISSET(DB_SALVAGE) && dbp_orig->db_feedback != NULL) {
+ dbp->set_feedback(dbp, dbp_orig->db_feedback);
+ dbp->db_feedback(dbp, DB_VERIFY, 0);
+ }
+
+ /*
+ * Copy the comparison and hashing functions. Note that
+ * even if the database is not a hash or btree, the respective
+ * internal structures will have been initialized.
+ */
+ if (dbp_orig->dup_compare != NULL &&
+ (ret = dbp->set_dup_compare(dbp, dbp_orig->dup_compare)) != 0)
+ goto err;
+ if (((BTREE *)dbp_orig->bt_internal)->bt_compare != NULL &&
+ (ret = dbp->set_bt_compare(dbp,
+ ((BTREE *)dbp_orig->bt_internal)->bt_compare)) != 0)
+ goto err;
+ if (((HASH *)dbp_orig->h_internal)->h_hash != NULL &&
+ (ret = dbp->set_h_hash(dbp,
+ ((HASH *)dbp_orig->h_internal)->h_hash)) != 0)
+ goto err;
+
+ /*
+ * We don't know how large the cache is, and if the database
+ * in question uses a small page size--which we don't know
+ * yet!--it may be uncomfortably small for the default page
+ * size [#2143]. However, the things we need temporary
+ * databases for in dbinfo are largely tiny, so using a
+ * 1024-byte pagesize is probably not going to be a big hit,
+ * and will make us fit better into small spaces.
+ */
+ if ((ret = __db_vrfy_dbinfo_create(dbenv, 1024, &vdp)) != 0)
+ goto err;
+
+ /* Find the real name of the file. */
+ if ((ret = __db_appname(dbenv,
+ DB_APP_DATA, NULL, name, 0, NULL, &real_name)) != 0)
+ goto err;
+
+ /*
+ * Our first order of business is to verify page 0, which is
+ * the metadata page for the master database of subdatabases
+ * or of the only database in the file. We want to do this by hand
+ * rather than just calling __db_open in case it's corrupt--various
+ * things in __db_open might act funny.
+ *
+ * Once we know the metadata page is healthy, I believe that it's
+ * safe to open the database normally and then use the page swapping
+ * code, which makes life easier.
+ */
+ if ((ret = __os_open(dbenv, real_name, DB_OSO_RDONLY, 0444, fhp)) != 0)
+ goto err;
+
+ /* Verify the metadata page 0; set pagesize and type. */
+ if ((ret = __db_vrfy_pagezero(dbp, vdp, fhp, flags)) != 0) {
+ if (ret == DB_VERIFY_BAD)
+ isbad = 1;
+ else
+ goto err;
+ }
+
+ /*
+ * We can assume at this point that dbp->pagesize and dbp->type are
+ * set correctly, or at least as well as they can be, and that
+ * locking, logging, and txns are not in use. Thus we can trust
+ * the memp code not to look at the page, and thus to be safe
+ * enough to use.
+ *
+ * The dbp is not open, but the file is open in the fhp, and we
+ * cannot assume that __db_open is safe. Call __db_dbenv_setup,
+ * the [safe] part of __db_open that initializes the environment--
+ * and the mpool--manually.
+ */
+ if ((ret = __db_dbenv_setup(dbp,
+ name, DB_ODDFILESIZE | DB_RDONLY)) != 0)
+ return (ret);
+
+ /* Mark the dbp as opened, so that we correctly handle its close. */
+ F_SET(dbp, DB_OPEN_CALLED);
+
+ /*
+ * Find out the page number of the last page in the database.
+ *
+ * XXX: This currently fails if the last page is of bad type,
+ * because it calls __db_pgin and that pukes. This is bad.
+ */
+ if ((ret = memp_fget(dbp->mpf, &last, DB_MPOOL_LAST, &h)) != 0)
+ goto err;
+ if ((ret = memp_fput(dbp->mpf, h, 0)) != 0)
+ goto err;
+
+ vdp->last_pgno = last;
+
+ /*
+ * DB_ORDERCHKONLY is a special case; our file consists of
+ * several subdatabases, which use different hash, bt_compare,
+ * and/or dup_compare functions. Consequently, we couldn't verify
+ * sorting and hashing simply by calling DB->verify() on the file.
+ * DB_ORDERCHKONLY allows us to come back and check those things; it
+ * requires a subdatabase, and assumes that everything but that
+ * database's sorting/hashing is correct.
+ */
+ if (LF_ISSET(DB_ORDERCHKONLY)) {
+ ret = __db_vrfy_orderchkonly(dbp, vdp, name, subdb, flags);
+ goto done;
+ }
+
+ /*
+ * When salvaging, we use a db to keep track of whether we've seen a
+ * given overflow or dup page in the course of traversing normal data.
+ * If in the end we have not, we assume its key got lost and print it
+ * with key "UNKNOWN".
+ */
+ if (LF_ISSET(DB_SALVAGE)) {
+ if ((ret = __db_salvage_init(vdp)) != 0)
+ return (ret);
+
+ /*
+ * If we're not being aggressive, attempt to crack subdbs.
+ * "has" will indicate whether the attempt has succeeded
+ * (even in part), meaning that we have some semblance of
+ * subdbs; on the walkpages pass, we print out
+ * whichever data pages we have not seen.
+ */
+ has = 0;
+ if (!LF_ISSET(DB_AGGRESSIVE) && (__db_salvage_subdbs(dbp,
+ vdp, handle, callback, flags, &has)) != 0)
+ isbad = 1;
+
+ /*
+ * If we have subdatabases, we need to signal that if
+ * any keys are found that don't belong to a subdatabase,
+ * they'll need to have an "__OTHER__" subdatabase header
+ * printed first. Flag this. Else, print a header for
+ * the normal, non-subdb database.
+ */
+ if (has == 1)
+ F_SET(vdp, SALVAGE_PRINTHEADER);
+ else if ((ret = __db_prheader(dbp,
+ NULL, 0, 0, handle, callback, vdp, PGNO_BASE_MD)) != 0)
+ goto err;
+ }
+
+ if ((ret =
+ __db_vrfy_walkpages(dbp, vdp, handle, callback, flags)) != 0) {
+ if (ret == DB_VERIFY_BAD)
+ isbad = 1;
+ else if (ret != 0)
+ goto err;
+ }
+
+ /* If we're verifying, verify inter-page structure. */
+ if (!LF_ISSET(DB_SALVAGE) && isbad == 0)
+ if ((ret =
+ __db_vrfy_structure(dbp, vdp, name, 0, flags)) != 0) {
+ if (ret == DB_VERIFY_BAD)
+ isbad = 1;
+ else if (ret != 0)
+ goto err;
+ }
+
+ /*
+ * If we're salvaging, output with key UNKNOWN any overflow or dup pages
+ * we haven't been able to put in context. Then destroy the salvager's
+ * state-saving database.
+ */
+ if (LF_ISSET(DB_SALVAGE)) {
+ if ((ret = __db_salvage_unknowns(dbp,
+ vdp, handle, callback, flags)) != 0)
+ isbad = 1;
+ /* No return value, since there's little we can do. */
+ __db_salvage_destroy(vdp);
+ }
+
+ if (0) {
+err: (void)__db_err(dbenv, "%s: %s", name, db_strerror(ret));
+ }
+
+ if (LF_ISSET(DB_SALVAGE) &&
+ (has == 0 || F_ISSET(vdp, SALVAGE_PRINTFOOTER)))
+ (void)__db_prfooter(handle, callback);
+
+ /* Send feedback that we're done. */
+done: if (!LF_ISSET(DB_SALVAGE) && dbp->db_feedback != NULL)
+ dbp->db_feedback(dbp, DB_VERIFY, 100);
+
+ if (F_ISSET(fhp, DB_FH_VALID))
+ (void)__os_closehandle(fhp);
+ if (dbp)
+ (void)dbp->close(dbp, 0);
+ if (vdp)
+ (void)__db_vrfy_dbinfo_destroy(vdp);
+ if (real_name)
+ __os_freestr(real_name);
+
+ if ((ret == 0 && isbad == 1) || ret == DB_VERIFY_FATAL)
+ ret = DB_VERIFY_BAD;
+
+ return (ret);
+}
+
+/*
+ * __db_vrfy_pagezero --
+ * Verify the master metadata page. Use seek, read, and a local buffer
+ * rather than the DB paging code, for safety.
+ *
+ * Must correctly (or best-guess) set dbp->type and dbp->pagesize.
+ */
+static int
+__db_vrfy_pagezero(dbp, vdp, fhp, flags)
+ DB *dbp;
+ VRFY_DBINFO *vdp;
+ DB_FH *fhp;
+ u_int32_t flags;
+{
+ DBMETA *meta;
+ DB_ENV *dbenv;
+ VRFY_PAGEINFO *pip;
+ db_pgno_t freelist;
+ int t_ret, ret, nr, swapped;
+ u_int8_t mbuf[DBMETASIZE];
+
+ swapped = ret = t_ret = 0;
+ freelist = 0;
+ dbenv = dbp->dbenv;
+ meta = (DBMETA *)mbuf;
+ dbp->type = DB_UNKNOWN;
+
+ /*
+ * Seek to the metadata page.
+ * Note that if we're just starting a verification, dbp->pgsize
+ * may be zero; this is okay, as we want page zero anyway and
+ * 0*0 == 0.
+ */
+ if ((ret = __os_seek(dbenv, fhp, 0, 0, 0, 0, DB_OS_SEEK_SET)) != 0)
+ goto err;
+
+ if ((ret = __os_read(dbenv, fhp, mbuf, DBMETASIZE, (size_t *)&nr)) != 0)
+ goto err;
+
+ if (nr != DBMETASIZE) {
+ EPRINT((dbp->dbenv,
+ "Incomplete metadata page %lu", (u_long)PGNO_BASE_MD));
+ t_ret = DB_VERIFY_FATAL;
+ goto err;
+ }
+
+ /*
+ * Check all of the fields that we can.
+ */
+
+ /* 08-11: Current page number. Must == pgno. */
+ /* Note that endianness doesn't matter--it's zero. */
+ if (meta->pgno != PGNO_BASE_MD) {
+ EPRINT((dbp->dbenv, "Bad pgno: was %lu, should be %lu",
+ (u_long)meta->pgno, (u_long)PGNO_BASE_MD));
+ ret = DB_VERIFY_BAD;
+ }
+
+ /* 12-15: Magic number. Must be one of valid set. */
+ if (__db_is_valid_magicno(meta->magic, &dbp->type))
+ swapped = 0;
+ else {
+ M_32_SWAP(meta->magic);
+ if (__db_is_valid_magicno(meta->magic,
+ &dbp->type))
+ swapped = 1;
+ else {
+ EPRINT((dbp->dbenv,
+ "Bad magic number: %lu", (u_long)meta->magic));
+ ret = DB_VERIFY_BAD;
+ }
+ }
+
+ /*
+ * 16-19: Version. Must be current; for now, we
+ * don't support verification of old versions.
+ */
+ if (swapped)
+ M_32_SWAP(meta->version);
+ if ((dbp->type == DB_BTREE && meta->version != DB_BTREEVERSION) ||
+ (dbp->type == DB_HASH && meta->version != DB_HASHVERSION) ||
+ (dbp->type == DB_QUEUE && meta->version != DB_QAMVERSION)) {
+ ret = DB_VERIFY_BAD;
+ EPRINT((dbp->dbenv, "%s%s", "Old or incorrect DB ",
+ "version; extraneous errors may result"));
+ }
+
+ /*
+ * 20-23: Pagesize. Must be power of two,
+ * greater than 512, and less than 64K.
+ */
+ if (swapped)
+ M_32_SWAP(meta->pagesize);
+ if (IS_VALID_PAGESIZE(meta->pagesize))
+ dbp->pgsize = meta->pagesize;
+ else {
+ EPRINT((dbp->dbenv,
+ "Bad page size: %lu", (u_long)meta->pagesize));
+ ret = DB_VERIFY_BAD;
+
+ /*
+ * Now try to settle on a pagesize to use.
+ * If the user-supplied one is reasonable,
+ * use it; else, guess.
+ */
+ if (!IS_VALID_PAGESIZE(dbp->pgsize))
+ dbp->pgsize = __db_guesspgsize(dbenv, fhp);
+ }
+
+ /*
+ * 25: Page type. Must be correct for dbp->type,
+ * which is by now set as well as it can be.
+ */
+ /* Needs no swapping--only one byte! */
+ if ((dbp->type == DB_BTREE && meta->type != P_BTREEMETA) ||
+ (dbp->type == DB_HASH && meta->type != P_HASHMETA) ||
+ (dbp->type == DB_QUEUE && meta->type != P_QAMMETA)) {
+ ret = DB_VERIFY_BAD;
+ EPRINT((dbp->dbenv, "Bad page type: %lu", (u_long)meta->type));
+ }
+
+ /*
+ * 28-31: Free list page number.
+ * We'll verify its sensibility when we do inter-page
+ * verification later; for now, just store it.
+ */
+ if (swapped)
+ M_32_SWAP(meta->free);
+ freelist = meta->free;
+
+ /*
+ * Initialize vdp->pages to fit a single pageinfo structure for
+ * this one page. We'll realloc later when we know how many
+ * pages there are.
+ */
+ if ((ret = __db_vrfy_getpageinfo(vdp, PGNO_BASE_MD, &pip)) != 0)
+ return (ret);
+ pip->pgno = PGNO_BASE_MD;
+ pip->type = meta->type;
+
+ /*
+ * Signal that we still have to check the info specific to
+ * a given type of meta page.
+ */
+ F_SET(pip, VRFY_INCOMPLETE);
+
+ pip->free = freelist;
+
+ if ((ret = __db_vrfy_putpageinfo(vdp, pip)) != 0)
+ return (ret);
+
+ /* Set up the dbp's fileid. We don't use the regular open path. */
+ memcpy(dbp->fileid, meta->uid, DB_FILE_ID_LEN);
+
+ if (0) {
+err: __db_err(dbenv, "%s", db_strerror(ret));
+ }
+
+ if (swapped == 1)
+ F_SET(dbp, DB_AM_SWAP);
+ if (t_ret != 0)
+ ret = t_ret;
+ return (ret);
+}
+
+/*
+ * __db_vrfy_walkpages --
+ * Main loop of the verifier/salvager. Walks through,
+ * page by page, and verifies all pages and/or prints all data pages.
+ */
+static int
+__db_vrfy_walkpages(dbp, vdp, handle, callback, flags)
+ DB *dbp;
+ VRFY_DBINFO *vdp;
+ void *handle;
+ int (*callback) __P((void *, const void *));
+ u_int32_t flags;
+{
+ DB_ENV *dbenv;
+ PAGE *h;
+ db_pgno_t i;
+ int ret, t_ret, isbad;
+
+ ret = isbad = t_ret = 0;
+ dbenv = dbp->dbenv;
+
+ if ((ret = __db_fchk(dbenv,
+ "__db_vrfy_walkpages", flags, OKFLAGS)) != 0)
+ return (ret);
+
+ for (i = 0; i <= vdp->last_pgno; i++) {
+ /*
+ * If DB_SALVAGE is set, we inspect our database of
+ * completed pages, and skip any we've already printed in
+ * the subdb pass.
+ */
+ if (LF_ISSET(DB_SALVAGE) && (__db_salvage_isdone(vdp, i) != 0))
+ continue;
+
+ /* If an individual page get fails, keep going. */
+ if ((t_ret = memp_fget(dbp->mpf, &i, 0, &h)) != 0) {
+ if (ret == 0)
+ ret = t_ret;
+ continue;
+ }
+
+ if (LF_ISSET(DB_SALVAGE)) {
+ /*
+ * We pretty much don't want to quit unless a
+ * bomb hits. May as well return that something
+ * was screwy, however.
+ */
+ if ((t_ret = __db_salvage(dbp,
+ vdp, i, h, handle, callback, flags)) != 0) {
+ if (ret == 0)
+ ret = t_ret;
+ isbad = 1;
+ }
+ } else {
+ /*
+ * Verify info common to all page
+ * types.
+ */
+ if (i != PGNO_BASE_MD)
+ if ((t_ret = __db_vrfy_common(dbp,
+ vdp, h, i, flags)) == DB_VERIFY_BAD)
+ isbad = 1;
+
+ switch (TYPE(h)) {
+ case P_INVALID:
+ t_ret = __db_vrfy_invalid(dbp,
+ vdp, h, i, flags);
+ break;
+ case __P_DUPLICATE:
+ isbad = 1;
+ EPRINT((dbp->dbenv,
+ "Old-style duplicate page: %lu",
+ (u_long)i));
+ break;
+ case P_HASH:
+ t_ret = __ham_vrfy(dbp,
+ vdp, h, i, flags);
+ break;
+ case P_IBTREE:
+ case P_IRECNO:
+ case P_LBTREE:
+ case P_LDUP:
+ t_ret = __bam_vrfy(dbp,
+ vdp, h, i, flags);
+ break;
+ case P_LRECNO:
+ t_ret = __ram_vrfy_leaf(dbp,
+ vdp, h, i, flags);
+ break;
+ case P_OVERFLOW:
+ t_ret = __db_vrfy_overflow(dbp,
+ vdp, h, i, flags);
+ break;
+ case P_HASHMETA:
+ t_ret = __ham_vrfy_meta(dbp,
+ vdp, (HMETA *)h, i, flags);
+ break;
+ case P_BTREEMETA:
+ t_ret = __bam_vrfy_meta(dbp,
+ vdp, (BTMETA *)h, i, flags);
+ break;
+ case P_QAMMETA:
+ t_ret = __qam_vrfy_meta(dbp,
+ vdp, (QMETA *)h, i, flags);
+ break;
+ case P_QAMDATA:
+ t_ret = __qam_vrfy_data(dbp,
+ vdp, (QPAGE *)h, i, flags);
+ break;
+ default:
+ EPRINT((dbp->dbenv,
+ "Unknown page type: %lu", (u_long)TYPE(h)));
+ isbad = 1;
+ break;
+ }
+
+ /*
+ * Set up error return.
+ */
+ if (t_ret == DB_VERIFY_BAD)
+ isbad = 1;
+ else if (t_ret == DB_VERIFY_FATAL)
+ goto err;
+ else
+ ret = t_ret;
+
+ /*
+ * Provide feedback to the application about our
+ * progress. The range 0-50% comes from the fact
+ * that this is the first of two passes through the
+ * database (front-to-back, then top-to-bottom).
+ */
+ if (dbp->db_feedback != NULL)
+ dbp->db_feedback(dbp, DB_VERIFY,
+ (i + 1) * 50 / (vdp->last_pgno + 1));
+ }
+
+ if ((t_ret = memp_fput(dbp->mpf, h, 0)) != 0 && ret == 0)
+ ret = t_ret;
+ }
+
+ if (0) {
+err: if ((t_ret = memp_fput(dbp->mpf, h, 0)) != 0)
+ return (ret == 0 ? t_ret : ret);
+ return (DB_VERIFY_BAD);
+ }
+
+ return ((isbad == 1 && ret == 0) ? DB_VERIFY_BAD : ret);
+}
+
+/*
+ * __db_vrfy_structure--
+ * After a beginning-to-end walk through the database has been
+ * completed, put together the information that has been collected
+ * to verify the overall database structure.
+ *
+ * Should only be called if we want to do a database verification,
+ * i.e. if DB_SALVAGE is not set.
+ */
+static int
+__db_vrfy_structure(dbp, vdp, dbname, meta_pgno, flags)
+ DB *dbp;
+ VRFY_DBINFO *vdp;
+ const char *dbname;
+ db_pgno_t meta_pgno;
+ u_int32_t flags;
+{
+ DB *pgset;
+ DB_ENV *dbenv;
+ VRFY_PAGEINFO *pip;
+ db_pgno_t i;
+ int ret, isbad, hassubs, p;
+
+ isbad = 0;
+ pip = NULL;
+ dbenv = dbp->dbenv;
+ pgset = vdp->pgset;
+
+ if ((ret = __db_fchk(dbenv, "DB->verify", flags, OKFLAGS)) != 0)
+ return (ret);
+ if (LF_ISSET(DB_SALVAGE)) {
+ __db_err(dbenv, "__db_vrfy_structure called with DB_SALVAGE");
+ return (EINVAL);
+ }
+
+ /*
+ * Providing feedback here is tricky; in most situations,
+ * we fetch each page one more time, but we do so in a top-down
+ * order that depends on the access method. Worse, we do this
+ * recursively in btree, such that on any call where we're traversing
+ * a subtree we don't know where that subtree is in the whole database;
+ * worse still, any given database may be one of several subdbs.
+ *
+ * The solution is to decrement a counter vdp->pgs_remaining each time
+ * we verify (and call feedback on) a page. We may over- or
+ * under-count, but the structure feedback function will ensure that we
+ * never give a percentage under 50 or over 100. (The first pass
+ * covered the range 0-50%.)
+ */
+ if (dbp->db_feedback != NULL)
+ vdp->pgs_remaining = vdp->last_pgno + 1;
+
+ /*
+ * Call the appropriate function to downwards-traverse the db type.
+ */
+ switch(dbp->type) {
+ case DB_BTREE:
+ case DB_RECNO:
+ if ((ret = __bam_vrfy_structure(dbp, vdp, 0, flags)) != 0) {
+ if (ret == DB_VERIFY_BAD)
+ isbad = 1;
+ else
+ goto err;
+ }
+
+ /*
+ * If we have subdatabases and we know that the database is,
+ * thus far, sound, it's safe to walk the tree of subdatabases.
+ * Do so, and verify the structure of the databases within.
+ */
+ if ((ret = __db_vrfy_getpageinfo(vdp, 0, &pip)) != 0)
+ goto err;
+ hassubs = F_ISSET(pip, VRFY_HAS_SUBDBS);
+ if ((ret = __db_vrfy_putpageinfo(vdp, pip)) != 0)
+ goto err;
+
+ if (isbad == 0 && hassubs)
+ if ((ret =
+ __db_vrfy_subdbs(dbp, vdp, dbname, flags)) != 0) {
+ if (ret == DB_VERIFY_BAD)
+ isbad = 1;
+ else
+ goto err;
+ }
+ break;
+ case DB_HASH:
+ if ((ret = __ham_vrfy_structure(dbp, vdp, 0, flags)) != 0) {
+ if (ret == DB_VERIFY_BAD)
+ isbad = 1;
+ else
+ goto err;
+ }
+ break;
+ case DB_QUEUE:
+ if ((ret = __qam_vrfy_structure(dbp, vdp, flags)) != 0) {
+ if (ret == DB_VERIFY_BAD)
+ isbad = 1;
+ }
+
+ /*
+ * Queue pages may be unreferenced and totally zeroed, if
+ * they're empty; queue doesn't have much structure, so
+ * this is unlikely to be wrong in any troublesome sense.
+ * Skip to "err".
+ */
+ goto err;
+ /* NOTREACHED */
+ default:
+ /* This should only happen if the verifier is somehow broken. */
+ DB_ASSERT(0);
+ ret = EINVAL;
+ goto err;
+ /* NOTREACHED */
+ }
+
+ /* Walk free list. */
+ if ((ret =
+ __db_vrfy_freelist(dbp, vdp, meta_pgno, flags)) == DB_VERIFY_BAD)
+ isbad = 1;
+
+ /*
+ * If structure checks up until now have failed, it's likely that
+ * checking what pages have been missed will result in oodles of
+ * extraneous error messages being EPRINTed. Skip to the end
+ * if this is the case; we're going to be printing at least one
+ * error anyway, and probably all the more salient ones.
+ */
+ if (ret != 0 || isbad == 1)
+ goto err;
+
+ /*
+ * Make sure no page has been missed and that no page is still marked
+ * "all zeroes" (only certain hash pages can be, and they're unmarked
+ * in __ham_vrfy_structure).
+ */
+ for (i = 0; i < vdp->last_pgno + 1; i++) {
+ if ((ret = __db_vrfy_getpageinfo(vdp, i, &pip)) != 0)
+ goto err;
+ if ((ret = __db_vrfy_pgset_get(pgset, i, &p)) != 0)
+ goto err;
+ if (p == 0) {
+ EPRINT((dbp->dbenv,
+ "Unreferenced page %lu", (u_long)i));
+ isbad = 1;
+ }
+
+ if (F_ISSET(pip, VRFY_IS_ALLZEROES)) {
+ EPRINT((dbp->dbenv,
+ "Totally zeroed page %lu", (u_long)i));
+ isbad = 1;
+ }
+ if ((ret = __db_vrfy_putpageinfo(vdp, pip)) != 0)
+ goto err;
+ pip = NULL;
+ }
+
+err: if (pip != NULL)
+ (void)__db_vrfy_putpageinfo(vdp, pip);
+
+ return ((isbad == 1 && ret == 0) ? DB_VERIFY_BAD : ret);
+}
+
+/*
+ * __db_is_valid_pagetype
+ */
+static int
+__db_is_valid_pagetype(type)
+ u_int32_t type;
+{
+ switch (type) {
+ case P_INVALID: /* Order matches ordinal value. */
+ case P_HASH:
+ case P_IBTREE:
+ case P_IRECNO:
+ case P_LBTREE:
+ case P_LRECNO:
+ case P_OVERFLOW:
+ case P_HASHMETA:
+ case P_BTREEMETA:
+ case P_QAMMETA:
+ case P_QAMDATA:
+ case P_LDUP:
+ return (1);
+ }
+ return (0);
+}
+
+/*
+ * __db_is_valid_magicno
+ */
+static int
+__db_is_valid_magicno(magic, typep)
+ u_int32_t magic;
+ DBTYPE *typep;
+{
+ switch (magic) {
+ case DB_BTREEMAGIC:
+ *typep = DB_BTREE;
+ return (1);
+ case DB_HASHMAGIC:
+ *typep = DB_HASH;
+ return (1);
+ case DB_QAMMAGIC:
+ *typep = DB_QUEUE;
+ return (1);
+ }
+ *typep = DB_UNKNOWN;
+ return (0);
+}
+
+/*
+ * __db_vrfy_common --
+ * Verify info common to all page types.
+ */
+static int
+__db_vrfy_common(dbp, vdp, h, pgno, flags)
+ DB *dbp;
+ VRFY_DBINFO *vdp;
+ PAGE *h;
+ db_pgno_t pgno;
+ u_int32_t flags;
+{
+ VRFY_PAGEINFO *pip;
+ int ret, t_ret;
+ u_int8_t *p;
+
+ if ((ret = __db_vrfy_getpageinfo(vdp, pgno, &pip)) != 0)
+ return (ret);
+
+ pip->pgno = pgno;
+ F_CLR(pip, VRFY_IS_ALLZEROES);
+
+ /*
+ * Hash expands the table by leaving some pages between the
+ * old last and the new last totally zeroed. Its pgin function
+ * should fix things, but we might not be using that (e.g. if
+ * we're a subdatabase).
+ *
+ * Queue will create sparse files if sparse record numbers are used.
+ */
+ if (pgno != 0 && PGNO(h) == 0) {
+ for (p = (u_int8_t *)h; p < (u_int8_t *)h + dbp->pgsize; p++)
+ if (*p != 0) {
+ EPRINT((dbp->dbenv,
+ "Page %lu should be zeroed and is not",
+ (u_long)pgno));
+ ret = DB_VERIFY_BAD;
+ goto err;
+ }
+ /*
+ * It's totally zeroed; mark it as a hash, and we'll
+ * check that that makes sense structurally later.
+ * (The queue verification doesn't care, since queues
+ * don't really have much in the way of structure.)
+ */
+ pip->type = P_HASH;
+ F_SET(pip, VRFY_IS_ALLZEROES);
+ ret = 0;
+ goto err; /* well, not really an err. */
+ }
+
+ if (PGNO(h) != pgno) {
+ EPRINT((dbp->dbenv,
+ "Bad page number: %lu should be %lu",
+ (u_long)h->pgno, (u_long)pgno));
+ ret = DB_VERIFY_BAD;
+ }
+
+ if (!__db_is_valid_pagetype(h->type)) {
+ EPRINT((dbp->dbenv, "Bad page type: %lu", (u_long)h->type));
+ ret = DB_VERIFY_BAD;
+ }
+ pip->type = h->type;
+
+err: if ((t_ret = __db_vrfy_putpageinfo(vdp, pip)) != 0 && ret == 0)
+ ret = t_ret;
+
+ return (ret);
+}
+
+/*
+ * __db_vrfy_invalid --
+ * Verify P_INVALID page.
+ * (Yes, there's not much to do here.)
+ */
+static int
+__db_vrfy_invalid(dbp, vdp, h, pgno, flags)
+ DB *dbp;
+ VRFY_DBINFO *vdp;
+ PAGE *h;
+ db_pgno_t pgno;
+ u_int32_t flags;
+{
+ VRFY_PAGEINFO *pip;
+ int ret, t_ret;
+
+ if ((ret = __db_vrfy_getpageinfo(vdp, pgno, &pip)) != 0)
+ return (ret);
+ pip->next_pgno = pip->prev_pgno = 0;
+
+ if (!IS_VALID_PGNO(NEXT_PGNO(h))) {
+ EPRINT((dbp->dbenv,
+ "Invalid next_pgno %lu on page %lu",
+ (u_long)NEXT_PGNO(h), (u_long)pgno));
+ ret = DB_VERIFY_BAD;
+ } else
+ pip->next_pgno = NEXT_PGNO(h);
+
+ if ((t_ret = __db_vrfy_putpageinfo(vdp, pip)) != 0 && ret == 0)
+ ret = t_ret;
+ return (ret);
+}
+
+/*
+ * __db_vrfy_datapage --
+ * Verify elements common to data pages (P_HASH, P_LBTREE,
+ * P_IBTREE, P_IRECNO, P_LRECNO, P_OVERFLOW, P_DUPLICATE)--i.e.,
+ * those defined in the PAGE structure.
+ *
+ * Called from each of the per-page routines, after the
+ * all-page-type-common elements of pip have been verified and filled
+ * in.
+ *
+ * PUBLIC: int __db_vrfy_datapage
+ * PUBLIC: __P((DB *, VRFY_DBINFO *, PAGE *, db_pgno_t, u_int32_t));
+ */
+int
+__db_vrfy_datapage(dbp, vdp, h, pgno, flags)
+ DB *dbp;
+ VRFY_DBINFO *vdp;
+ PAGE *h;
+ db_pgno_t pgno;
+ u_int32_t flags;
+{
+ VRFY_PAGEINFO *pip;
+ int isbad, ret, t_ret;
+
+ if ((ret = __db_vrfy_getpageinfo(vdp, pgno, &pip)) != 0)
+ return (ret);
+ isbad = 0;
+
+ /*
+ * prev_pgno and next_pgno: store for inter-page checks,
+ * verify that they point to actual pages and not to self.
+ *
+ * !!!
+ * Internal btree pages do not maintain these fields (indeed,
+ * they overload them). Skip.
+ */
+ if (TYPE(h) != P_IBTREE && TYPE(h) != P_IRECNO) {
+ if (!IS_VALID_PGNO(PREV_PGNO(h)) || PREV_PGNO(h) == pip->pgno) {
+ isbad = 1;
+ EPRINT((dbp->dbenv, "Page %lu: Invalid prev_pgno %lu",
+ (u_long)pip->pgno, (u_long)PREV_PGNO(h)));
+ }
+ if (!IS_VALID_PGNO(NEXT_PGNO(h)) || NEXT_PGNO(h) == pip->pgno) {
+ isbad = 1;
+ EPRINT((dbp->dbenv, "Page %lu: Invalid next_pgno %lu",
+ (u_long)pip->pgno, (u_long)NEXT_PGNO(h)));
+ }
+ pip->prev_pgno = PREV_PGNO(h);
+ pip->next_pgno = NEXT_PGNO(h);
+ }
+
+ /*
+ * Verify the number of entries on the page.
+ * There is no good way to determine if this is accurate; the
+ * best we can do is verify that it's not more than can, in theory,
+ * fit on the page. Then, we make sure there are at least
+ * this many valid elements in inp[], and hope that this catches
+ * most cases.
+ */
+ if (TYPE(h) != P_OVERFLOW) {
+ if (BKEYDATA_PSIZE(0) * NUM_ENT(h) > dbp->pgsize) {
+ isbad = 1;
+ EPRINT((dbp->dbenv,
+ "Page %lu: Too many entries: %lu",
+ (u_long)pgno, (u_long)NUM_ENT(h)));
+ }
+ pip->entries = NUM_ENT(h);
+ }
+
+ /*
+ * btree level. Should be zero unless we're a btree;
+ * if we are a btree, should be between LEAFLEVEL and MAXBTREELEVEL,
+ * and we need to save it off.
+ */
+ switch (TYPE(h)) {
+ case P_IBTREE:
+ case P_IRECNO:
+ if (LEVEL(h) < LEAFLEVEL + 1 || LEVEL(h) > MAXBTREELEVEL) {
+ isbad = 1;
+ EPRINT((dbp->dbenv, "Bad btree level %lu on page %lu",
+ (u_long)LEVEL(h), (u_long)pgno));
+ }
+ pip->bt_level = LEVEL(h);
+ break;
+ case P_LBTREE:
+ case P_LDUP:
+ case P_LRECNO:
+ if (LEVEL(h) != LEAFLEVEL) {
+ isbad = 1;
+ EPRINT((dbp->dbenv,
+ "Btree leaf page %lu has incorrect level %lu",
+ (u_long)pgno, (u_long)LEVEL(h)));
+ }
+ break;
+ default:
+ if (LEVEL(h) != 0) {
+ isbad = 1;
+ EPRINT((dbp->dbenv,
+ "Nonzero level %lu in non-btree database page %lu",
+ (u_long)LEVEL(h), (u_long)pgno));
+ }
+ break;
+ }
+
+ /*
+ * Even though inp[] occurs in all PAGEs, we look at it in the
+ * access-method-specific code, since btree and hash treat
+ * item lengths very differently, and one of the most important
+ * things we want to verify is that the data--as specified
+ * by offset and length--cover the right part of the page
+ * without overlaps, gaps, or violations of the page boundary.
+ */
+ if ((t_ret = __db_vrfy_putpageinfo(vdp, pip)) != 0 && ret == 0)
+ ret = t_ret;
+
+ return ((ret == 0 && isbad == 1) ? DB_VERIFY_BAD : ret);
+}
+
+/*
+ * __db_vrfy_meta--
+ * Verify the access-method common parts of a meta page, using
+ * normal mpool routines.
+ *
+ * PUBLIC: int __db_vrfy_meta
+ * PUBLIC: __P((DB *, VRFY_DBINFO *, DBMETA *, db_pgno_t, u_int32_t));
+ */
+int
+__db_vrfy_meta(dbp, vdp, meta, pgno, flags)
+ DB *dbp;
+ VRFY_DBINFO *vdp;
+ DBMETA *meta;
+ db_pgno_t pgno;
+ u_int32_t flags;
+{
+ DBTYPE dbtype, magtype;
+ VRFY_PAGEINFO *pip;
+ int isbad, ret, t_ret;
+
+ isbad = 0;
+ if ((ret = __db_vrfy_getpageinfo(vdp, pgno, &pip)) != 0)
+ return (ret);
+
+ /* type plausible for a meta page */
+ switch (meta->type) {
+ case P_BTREEMETA:
+ dbtype = DB_BTREE;
+ break;
+ case P_HASHMETA:
+ dbtype = DB_HASH;
+ break;
+ case P_QAMMETA:
+ dbtype = DB_QUEUE;
+ break;
+ default:
+ /* The verifier should never let us get here. */
+ DB_ASSERT(0);
+ ret = EINVAL;
+ goto err;
+ }
+
+ /* magic number valid */
+ if (!__db_is_valid_magicno(meta->magic, &magtype)) {
+ isbad = 1;
+ EPRINT((dbp->dbenv,
+ "Magic number invalid on page %lu", (u_long)pgno));
+ }
+ if (magtype != dbtype) {
+ isbad = 1;
+ EPRINT((dbp->dbenv,
+ "Magic number does not match type of page %lu",
+ (u_long)pgno));
+ }
+
+ /* version */
+ if ((dbtype == DB_BTREE && meta->version != DB_BTREEVERSION) ||
+ (dbtype == DB_HASH && meta->version != DB_HASHVERSION) ||
+ (dbtype == DB_QUEUE && meta->version != DB_QAMVERSION)) {
+ isbad = 1;
+ EPRINT((dbp->dbenv, "%s%s", "Old of incorrect DB ",
+ "version; extraneous errors may result"));
+ }
+
+ /* pagesize */
+ if (meta->pagesize != dbp->pgsize) {
+ isbad = 1;
+ EPRINT((dbp->dbenv,
+ "Invalid pagesize %lu on page %lu",
+ (u_long)meta->pagesize, (u_long)pgno));
+ }
+
+ /* free list */
+ /*
+ * If this is not the main, master-database meta page, it
+ * should not have a free list.
+ */
+ if (pgno != PGNO_BASE_MD && meta->free != PGNO_INVALID) {
+ isbad = 1;
+ EPRINT((dbp->dbenv,
+ "Nonempty free list on subdatabase metadata page %lu",
+ pgno));
+ }
+
+ /* Can correctly be PGNO_INVALID--that's just the end of the list. */
+ if (meta->free != PGNO_INVALID && IS_VALID_PGNO(meta->free))
+ pip->free = meta->free;
+ else if (!IS_VALID_PGNO(meta->free)) {
+ isbad = 1;
+ EPRINT((dbp->dbenv,
+ "Nonsensical free list pgno %lu on page %lu",
+ (u_long)meta->free, (u_long)pgno));
+ }
+
+ /*
+ * We have now verified the common fields of the metadata page.
+ * Clear the flag that told us they had been incompletely checked.
+ */
+ F_CLR(pip, VRFY_INCOMPLETE);
+
+err: if ((t_ret = __db_vrfy_putpageinfo(vdp, pip)) != 0 && ret == 0)
+ ret = t_ret;
+
+ return ((ret == 0 && isbad == 1) ? DB_VERIFY_BAD : ret);
+}
+
+/*
+ * __db_vrfy_freelist --
+ * Walk free list, checking off pages and verifying absence of
+ * loops.
+ */
+static int
+__db_vrfy_freelist(dbp, vdp, meta, flags)
+ DB *dbp;
+ VRFY_DBINFO *vdp;
+ db_pgno_t meta;
+ u_int32_t flags;
+{
+ DB *pgset;
+ VRFY_PAGEINFO *pip;
+ db_pgno_t pgno;
+ int p, ret, t_ret;
+
+ pgset = vdp->pgset;
+ DB_ASSERT(pgset != NULL);
+
+ if ((ret = __db_vrfy_getpageinfo(vdp, meta, &pip)) != 0)
+ return (ret);
+ for (pgno = pip->free; pgno != PGNO_INVALID; pgno = pip->next_pgno) {
+ if ((ret = __db_vrfy_putpageinfo(vdp, pip)) != 0)
+ return (ret);
+
+ /* This shouldn't happen, but just in case. */
+ if (!IS_VALID_PGNO(pgno)) {
+ EPRINT((dbp->dbenv,
+ "Invalid next_pgno on free list page %lu",
+ (u_long)pgno));
+ return (DB_VERIFY_BAD);
+ }
+
+ /* Detect cycles. */
+ if ((ret = __db_vrfy_pgset_get(pgset, pgno, &p)) != 0)
+ return (ret);
+ if (p != 0) {
+ EPRINT((dbp->dbenv,
+ "Page %lu encountered a second time on free list",
+ (u_long)pgno));
+ return (DB_VERIFY_BAD);
+ }
+ if ((ret = __db_vrfy_pgset_inc(pgset, pgno)) != 0)
+ return (ret);
+
+ if ((ret = __db_vrfy_getpageinfo(vdp, pgno, &pip)) != 0)
+ return (ret);
+
+ if (pip->type != P_INVALID) {
+ EPRINT((dbp->dbenv,
+ "Non-invalid page %lu on free list", (u_long)pgno));
+ ret = DB_VERIFY_BAD; /* unsafe to continue */
+ break;
+ }
+ }
+
+ if ((t_ret = __db_vrfy_putpageinfo(vdp, pip)) != 0)
+ ret = t_ret;
+ return (ret);
+}
+
+/*
+ * __db_vrfy_subdbs --
+ * Walk the known-safe master database of subdbs with a cursor,
+ * verifying the structure of each subdatabase we encounter.
+ */
+static int
+__db_vrfy_subdbs(dbp, vdp, dbname, flags)
+ DB *dbp;
+ VRFY_DBINFO *vdp;
+ const char *dbname;
+ u_int32_t flags;
+{
+ DB *mdbp;
+ DBC *dbc;
+ DBT key, data;
+ VRFY_PAGEINFO *pip;
+ db_pgno_t meta_pgno;
+ int ret, t_ret, isbad;
+ u_int8_t type;
+
+ isbad = 0;
+ dbc = NULL;
+
+ if ((ret = __db_master_open(dbp, dbname, DB_RDONLY, 0, &mdbp)) != 0)
+ return (ret);
+
+ if ((ret =
+ __db_icursor(mdbp, NULL, DB_BTREE, PGNO_INVALID, 0, &dbc)) != 0)
+ goto err;
+
+ memset(&key, 0, sizeof(key));
+ memset(&data, 0, sizeof(data));
+ while ((ret = dbc->c_get(dbc, &key, &data, DB_NEXT)) == 0) {
+ if (data.size != sizeof(db_pgno_t)) {
+ EPRINT((dbp->dbenv, "Database entry of invalid size"));
+ isbad = 1;
+ goto err;
+ }
+ memcpy(&meta_pgno, data.data, data.size);
+ /*
+ * Subdatabase meta pgnos are stored in network byte
+ * order for cross-endian compatibility. Swap if appropriate.
+ */
+ DB_NTOHL(&meta_pgno);
+ if (meta_pgno == PGNO_INVALID || meta_pgno > vdp->last_pgno) {
+ EPRINT((dbp->dbenv,
+ "Database entry references invalid page %lu",
+ (u_long)meta_pgno));
+ isbad = 1;
+ goto err;
+ }
+ if ((ret = __db_vrfy_getpageinfo(vdp, meta_pgno, &pip)) != 0)
+ goto err;
+ type = pip->type;
+ if ((ret = __db_vrfy_putpageinfo(vdp, pip)) != 0)
+ goto err;
+ switch (type) {
+ case P_BTREEMETA:
+ if ((ret = __bam_vrfy_structure(
+ dbp, vdp, meta_pgno, flags)) != 0) {
+ if (ret == DB_VERIFY_BAD)
+ isbad = 1;
+ else
+ goto err;
+ }
+ break;
+ case P_HASHMETA:
+ if ((ret = __ham_vrfy_structure(
+ dbp, vdp, meta_pgno, flags)) != 0) {
+ if (ret == DB_VERIFY_BAD)
+ isbad = 1;
+ else
+ goto err;
+ }
+ break;
+ case P_QAMMETA:
+ default:
+ EPRINT((dbp->dbenv,
+ "Database entry references page %lu of invalid type %lu",
+ (u_long)meta_pgno, (u_long)type));
+ ret = DB_VERIFY_BAD;
+ goto err;
+ /* NOTREACHED */
+ }
+ }
+
+ if (ret == DB_NOTFOUND)
+ ret = 0;
+
+err: if (dbc != NULL && (t_ret = __db_c_close(dbc)) != 0 && ret == 0)
+ ret = t_ret;
+
+ if ((t_ret = mdbp->close(mdbp, 0)) != 0 && ret == 0)
+ ret = t_ret;
+
+ return ((ret == 0 && isbad == 1) ? DB_VERIFY_BAD : ret);
+}
+
+/*
+ * __db_vrfy_struct_feedback --
+ * Provide feedback during top-down database structure traversal.
+ * (See comment at the beginning of __db_vrfy_structure.)
+ *
+ * PUBLIC: int __db_vrfy_struct_feedback __P((DB *, VRFY_DBINFO *));
+ */
+int
+__db_vrfy_struct_feedback(dbp, vdp)
+ DB *dbp;
+ VRFY_DBINFO *vdp;
+{
+ int progress;
+
+ if (dbp->db_feedback == NULL)
+ return (0);
+
+ if (vdp->pgs_remaining > 0)
+ vdp->pgs_remaining--;
+
+ /* Don't allow a feedback call of 100 until we're really done. */
+ progress = 100 - (vdp->pgs_remaining * 50 / (vdp->last_pgno + 1));
+ dbp->db_feedback(dbp, DB_VERIFY, progress == 100 ? 99 : progress);
+
+ return (0);
+}
+
+/*
+ * __db_vrfy_orderchkonly --
+ * Do an sort-order/hashing check on a known-otherwise-good subdb.
+ */
+static int
+__db_vrfy_orderchkonly(dbp, vdp, name, subdb, flags)
+ DB *dbp;
+ VRFY_DBINFO *vdp;
+ const char *name, *subdb;
+ u_int32_t flags;
+{
+ BTMETA *btmeta;
+ DB *mdbp, *pgset;
+ DBC *pgsc;
+ DBT key, data;
+ HASH *h_internal;
+ HMETA *hmeta;
+ PAGE *h, *currpg;
+ db_pgno_t meta_pgno, p, pgno;
+ u_int32_t bucket;
+ int t_ret, ret;
+
+ currpg = h = NULL;
+ pgsc = NULL;
+ pgset = NULL;
+
+ LF_CLR(DB_NOORDERCHK);
+
+ /* Open the master database and get the meta_pgno for the subdb. */
+ if ((ret = db_create(&mdbp, NULL, 0)) != 0)
+ return (ret);
+ if ((ret = __db_master_open(dbp, name, DB_RDONLY, 0, &mdbp)) != 0)
+ goto err;
+
+ memset(&key, 0, sizeof(key));
+ key.data = (void *)subdb;
+ memset(&data, 0, sizeof(data));
+ if ((ret = dbp->get(dbp, NULL, &key, &data, 0)) != 0)
+ goto err;
+
+ if (data.size != sizeof(db_pgno_t)) {
+ EPRINT((dbp->dbenv, "Database entry of invalid size"));
+ ret = DB_VERIFY_BAD;
+ goto err;
+ }
+
+ memcpy(&meta_pgno, data.data, data.size);
+
+ if ((ret = memp_fget(dbp->mpf, &meta_pgno, 0, &h)) != 0)
+ goto err;
+
+ if ((ret = __db_vrfy_pgset(dbp->dbenv, dbp->pgsize, &pgset)) != 0)
+ goto err;
+
+ switch (TYPE(h)) {
+ case P_BTREEMETA:
+ btmeta = (BTMETA *)h;
+ if (F_ISSET(&btmeta->dbmeta, BTM_RECNO)) {
+ /* Recnos have no order to check. */
+ ret = 0;
+ goto err;
+ }
+ if ((ret =
+ __db_meta2pgset(dbp, vdp, meta_pgno, flags, pgset)) != 0)
+ goto err;
+ if ((ret = pgset->cursor(pgset, NULL, &pgsc, 0)) != 0)
+ goto err;
+ while ((ret = __db_vrfy_pgset_next(pgsc, &p)) == 0) {
+ if ((ret = memp_fget(dbp->mpf, &p, 0, &currpg)) != 0)
+ goto err;
+ if ((ret = __bam_vrfy_itemorder(dbp,
+ NULL, currpg, p, NUM_ENT(currpg), 1,
+ F_ISSET(&btmeta->dbmeta, BTM_DUP), flags)) != 0)
+ goto err;
+ if ((ret = memp_fput(dbp->mpf, currpg, 0)) != 0)
+ goto err;
+ currpg = NULL;
+ }
+ if ((ret = pgsc->c_close(pgsc)) != 0)
+ goto err;
+ break;
+ case P_HASHMETA:
+ hmeta = (HMETA *)h;
+ h_internal = (HASH *)dbp->h_internal;
+ /*
+ * Make sure h_charkey is right.
+ */
+ if (h_internal == NULL || h_internal->h_hash == NULL) {
+ EPRINT((dbp->dbenv,
+ "DB_ORDERCHKONLY requires that a hash function be set"));
+ ret = DB_VERIFY_BAD;
+ goto err;
+ }
+ if (hmeta->h_charkey !=
+ h_internal->h_hash(dbp, CHARKEY, sizeof(CHARKEY))) {
+ EPRINT((dbp->dbenv,
+ "Incorrect hash function for database"));
+ ret = DB_VERIFY_BAD;
+ goto err;
+ }
+
+ /*
+ * Foreach bucket, verify hashing on each page in the
+ * corresponding chain of pages.
+ */
+ for (bucket = 0; bucket <= hmeta->max_bucket; bucket++) {
+ pgno = BS_TO_PAGE(bucket, hmeta->spares);
+ while (pgno != PGNO_INVALID) {
+ if ((ret = memp_fget(dbp->mpf,
+ &pgno, 0, &currpg)) != 0)
+ goto err;
+ if ((ret = __ham_vrfy_hashing(dbp,
+ NUM_ENT(currpg),hmeta, bucket, pgno,
+ flags, h_internal->h_hash)) != 0)
+ goto err;
+ pgno = NEXT_PGNO(currpg);
+ if ((ret = memp_fput(dbp->mpf, currpg, 0)) != 0)
+ goto err;
+ currpg = NULL;
+ }
+ }
+ break;
+ default:
+ EPRINT((dbp->dbenv, "Database meta page %lu of bad type %lu",
+ (u_long)meta_pgno, (u_long)TYPE(h)));
+ ret = DB_VERIFY_BAD;
+ break;
+ }
+
+err: if (pgsc != NULL)
+ (void)pgsc->c_close(pgsc);
+ if (pgset != NULL)
+ (void)pgset->close(pgset, 0);
+ if (h != NULL && (t_ret = memp_fput(dbp->mpf, h, 0)) != 0)
+ ret = t_ret;
+ if (currpg != NULL && (t_ret = memp_fput(dbp->mpf, currpg, 0)) != 0)
+ ret = t_ret;
+ if ((t_ret = mdbp->close(mdbp, 0)) != 0)
+ ret = t_ret;
+ return (ret);
+}
+
+/*
+ * __db_salvage --
+ * Walk through a page, salvaging all likely or plausible (w/
+ * DB_AGGRESSIVE) key/data pairs.
+ *
+ * PUBLIC: int __db_salvage __P((DB *, VRFY_DBINFO *, db_pgno_t, PAGE *,
+ * PUBLIC: void *, int (*)(void *, const void *), u_int32_t));
+ */
+int
+__db_salvage(dbp, vdp, pgno, h, handle, callback, flags)
+ DB *dbp;
+ VRFY_DBINFO *vdp;
+ db_pgno_t pgno;
+ PAGE *h;
+ void *handle;
+ int (*callback) __P((void *, const void *));
+ u_int32_t flags;
+{
+ DB_ASSERT(LF_ISSET(DB_SALVAGE));
+
+ /* If we got this page in the subdb pass, we can safely skip it. */
+ if (__db_salvage_isdone(vdp, pgno))
+ return (0);
+
+ switch (TYPE(h)) {
+ case P_HASH:
+ return (__ham_salvage(dbp,
+ vdp, pgno, h, handle, callback, flags));
+ /* NOTREACHED */
+ case P_LBTREE:
+ return (__bam_salvage(dbp,
+ vdp, pgno, P_LBTREE, h, handle, callback, NULL, flags));
+ /* NOTREACHED */
+ case P_LDUP:
+ return (__db_salvage_markneeded(vdp, pgno, SALVAGE_LDUP));
+ /* NOTREACHED */
+ case P_OVERFLOW:
+ return (__db_salvage_markneeded(vdp, pgno, SALVAGE_OVERFLOW));
+ /* NOTREACHED */
+ case P_LRECNO:
+ /*
+ * Recnos are tricky -- they may represent dup pages, or
+ * they may be subdatabase/regular database pages in their
+ * own right. If the former, they need to be printed with a
+ * key, preferably when we hit the corresponding datum in
+ * a btree/hash page. If the latter, there is no key.
+ *
+ * If a database is sufficiently frotzed, we're not going
+ * to be able to get this right, so we best-guess: just
+ * mark it needed now, and if we're really a normal recno
+ * database page, the "unknowns" pass will pick us up.
+ */
+ return (__db_salvage_markneeded(vdp, pgno, SALVAGE_LRECNO));
+ /* NOTREACHED */
+ case P_IBTREE:
+ case P_INVALID:
+ case P_IRECNO:
+ case __P_DUPLICATE:
+ default:
+ /* XXX: Should we be more aggressive here? */
+ break;
+ }
+ return (0);
+}
+
+/*
+ * __db_salvage_unknowns --
+ * Walk through the salvager database, printing with key "UNKNOWN"
+ * any pages we haven't dealt with.
+ */
+static int
+__db_salvage_unknowns(dbp, vdp, handle, callback, flags)
+ DB *dbp;
+ VRFY_DBINFO *vdp;
+ void *handle;
+ int (*callback) __P((void *, const void *));
+ u_int32_t flags;
+{
+ DBT unkdbt, key, *dbt;
+ PAGE *h;
+ db_pgno_t pgno;
+ u_int32_t pgtype;
+ int ret, err_ret;
+ void *ovflbuf;
+
+ memset(&unkdbt, 0, sizeof(DBT));
+ unkdbt.size = strlen("UNKNOWN") + 1;
+ unkdbt.data = "UNKNOWN";
+
+ if ((ret = __os_malloc(dbp->dbenv, dbp->pgsize, 0, &ovflbuf)) != 0)
+ return (ret);
+
+ err_ret = 0;
+ while ((ret = __db_salvage_getnext(vdp, &pgno, &pgtype)) == 0) {
+ dbt = NULL;
+
+ if ((ret = memp_fget(dbp->mpf, &pgno, 0, &h)) != 0) {
+ err_ret = ret;
+ continue;
+ }
+
+ switch (pgtype) {
+ case SALVAGE_LDUP:
+ case SALVAGE_LRECNODUP:
+ dbt = &unkdbt;
+ /* FALLTHROUGH */
+ case SALVAGE_LBTREE:
+ case SALVAGE_LRECNO:
+ if ((ret = __bam_salvage(dbp, vdp, pgno, pgtype,
+ h, handle, callback, dbt, flags)) != 0)
+ err_ret = ret;
+ break;
+ case SALVAGE_OVERFLOW:
+ /*
+ * XXX:
+ * This may generate multiple "UNKNOWN" keys in
+ * a database with no dups. What to do?
+ */
+ if ((ret = __db_safe_goff(dbp,
+ vdp, pgno, &key, &ovflbuf, flags)) != 0) {
+ err_ret = ret;
+ continue;
+ }
+ if ((ret = __db_prdbt(&key,
+ 0, " ", handle, callback, 0, NULL)) != 0) {
+ err_ret = ret;
+ continue;
+ }
+ if ((ret = __db_prdbt(&unkdbt,
+ 0, " ", handle, callback, 0, NULL)) != 0)
+ err_ret = ret;
+ break;
+ case SALVAGE_HASH:
+ if ((ret = __ham_salvage(
+ dbp, vdp, pgno, h, handle, callback, flags)) != 0)
+ err_ret = ret;
+ break;
+ case SALVAGE_INVALID:
+ case SALVAGE_IGNORE:
+ default:
+ /*
+ * Shouldn't happen, but if it does, just do what the
+ * nice man says.
+ */
+ DB_ASSERT(0);
+ break;
+ }
+ if ((ret = memp_fput(dbp->mpf, h, 0)) != 0)
+ err_ret = ret;
+ }
+
+ __os_free(ovflbuf, 0);
+
+ if (err_ret != 0 && ret == 0)
+ ret = err_ret;
+
+ return (ret == DB_NOTFOUND ? 0 : ret);
+}
+
+/*
+ * Offset of the ith inp array entry, which we can compare to the offset
+ * the entry stores.
+ */
+#define INP_OFFSET(h, i) \
+ ((db_indx_t)((u_int8_t *)(h)->inp + (i) - (u_int8_t *)(h)))
+
+/*
+ * __db_vrfy_inpitem --
+ * Verify that a single entry in the inp array is sane, and update
+ * the high water mark and current item offset. (The former of these is
+ * used for state information between calls, and is required; it must
+ * be initialized to the pagesize before the first call.)
+ *
+ * Returns DB_VERIFY_FATAL if inp has collided with the data,
+ * since verification can't continue from there; returns DB_VERIFY_BAD
+ * if anything else is wrong.
+ *
+ * PUBLIC: int __db_vrfy_inpitem __P((DB *, PAGE *,
+ * PUBLIC: db_pgno_t, u_int32_t, int, u_int32_t, u_int32_t *, u_int32_t *));
+ */
+int
+__db_vrfy_inpitem(dbp, h, pgno, i, is_btree, flags, himarkp, offsetp)
+ DB *dbp;
+ PAGE *h;
+ db_pgno_t pgno;
+ u_int32_t i;
+ int is_btree;
+ u_int32_t flags, *himarkp, *offsetp;
+{
+ BKEYDATA *bk;
+ db_indx_t offset, len;
+
+ DB_ASSERT(himarkp != NULL);
+
+ /*
+ * Check that the inp array, which grows from the beginning of the
+ * page forward, has not collided with the data, which grow from the
+ * end of the page backward.
+ */
+ if (h->inp + i >= (db_indx_t *)((u_int8_t *)h + *himarkp)) {
+ /* We've collided with the data. We need to bail. */
+ EPRINT((dbp->dbenv,
+ "Page %lu entries listing %lu overlaps data",
+ (u_long)pgno, (u_long)i));
+ return (DB_VERIFY_FATAL);
+ }
+
+ offset = h->inp[i];
+
+ /*
+ * Check that the item offset is reasonable: it points somewhere
+ * after the inp array and before the end of the page.
+ */
+ if (offset <= INP_OFFSET(h, i) || offset > dbp->pgsize) {
+ EPRINT((dbp->dbenv,
+ "Bad offset %lu at page %lu index %lu",
+ (u_long)offset, (u_long)pgno, (u_long)i));
+ return (DB_VERIFY_BAD);
+ }
+
+ /* Update the high-water mark (what HOFFSET should be) */
+ if (offset < *himarkp)
+ *himarkp = offset;
+
+ if (is_btree) {
+ /*
+ * Check that the item length remains on-page.
+ */
+ bk = GET_BKEYDATA(h, i);
+
+ /*
+ * We need to verify the type of the item here;
+ * we can't simply assume that it will be one of the
+ * expected three. If it's not a recognizable type,
+ * it can't be considered to have a verifiable
+ * length, so it's not possible to certify it as safe.
+ */
+ switch (B_TYPE(bk->type)) {
+ case B_KEYDATA:
+ len = bk->len;
+ break;
+ case B_DUPLICATE:
+ case B_OVERFLOW:
+ len = BOVERFLOW_SIZE;
+ break;
+ default:
+ EPRINT((dbp->dbenv,
+ "Item %lu on page %lu of unrecognizable type",
+ i, pgno));
+ return (DB_VERIFY_BAD);
+ }
+
+ if ((size_t)(offset + len) > dbp->pgsize) {
+ EPRINT((dbp->dbenv,
+ "Item %lu on page %lu extends past page boundary",
+ (u_long)i, (u_long)pgno));
+ return (DB_VERIFY_BAD);
+ }
+ }
+
+ if (offsetp != NULL)
+ *offsetp = offset;
+ return (0);
+}
+
+/*
+ * __db_vrfy_duptype--
+ * Given a page number and a set of flags to __bam_vrfy_subtree,
+ * verify that the dup tree type is correct--i.e., it's a recno
+ * if DUPSORT is not set and a btree if it is.
+ *
+ * PUBLIC: int __db_vrfy_duptype
+ * PUBLIC: __P((DB *, VRFY_DBINFO *, db_pgno_t, u_int32_t));
+ */
+int
+__db_vrfy_duptype(dbp, vdp, pgno, flags)
+ DB *dbp;
+ VRFY_DBINFO *vdp;
+ db_pgno_t pgno;
+ u_int32_t flags;
+{
+ VRFY_PAGEINFO *pip;
+ int ret, isbad;
+
+ isbad = 0;
+
+ if ((ret = __db_vrfy_getpageinfo(vdp, pgno, &pip)) != 0)
+ return (ret);
+
+ switch (pip->type) {
+ case P_IBTREE:
+ case P_LDUP:
+ if (!LF_ISSET(ST_DUPSORT)) {
+ EPRINT((dbp->dbenv,
+ "Sorted duplicate set at page %lu in unsorted-dup database",
+ (u_long)pgno));
+ isbad = 1;
+ }
+ break;
+ case P_IRECNO:
+ case P_LRECNO:
+ if (LF_ISSET(ST_DUPSORT)) {
+ EPRINT((dbp->dbenv,
+ "Unsorted duplicate set at page %lu in sorted-dup database",
+ (u_long)pgno));
+ isbad = 1;
+ }
+ break;
+ default:
+ EPRINT((dbp->dbenv,
+ "Duplicate page %lu of inappropriate type %lu",
+ (u_long)pgno, (u_long)pip->type));
+ isbad = 1;
+ break;
+ }
+
+ if ((ret = __db_vrfy_putpageinfo(vdp, pip)) != 0)
+ return (ret);
+ return (isbad == 1 ? DB_VERIFY_BAD : 0);
+}
+
+/*
+ * __db_salvage_duptree --
+ * Attempt to salvage a given duplicate tree, given its alleged root.
+ *
+ * The key that corresponds to this dup set has been passed to us
+ * in DBT *key. Because data items follow keys, though, it has been
+ * printed once already.
+ *
+ * The basic idea here is that pgno ought to be a P_LDUP, a P_LRECNO, a
+ * P_IBTREE, or a P_IRECNO. If it's an internal page, use the verifier
+ * functions to make sure it's safe; if it's not, we simply bail and the
+ * data will have to be printed with no key later on. if it is safe,
+ * recurse on each of its children.
+ *
+ * Whether or not it's safe, if it's a leaf page, __bam_salvage it.
+ *
+ * At all times, use the DB hanging off vdp to mark and check what we've
+ * done, so each page gets printed exactly once and we don't get caught
+ * in any cycles.
+ *
+ * PUBLIC: int __db_salvage_duptree __P((DB *, VRFY_DBINFO *, db_pgno_t,
+ * PUBLIC: DBT *, void *, int (*)(void *, const void *), u_int32_t));
+ */
+int
+__db_salvage_duptree(dbp, vdp, pgno, key, handle, callback, flags)
+ DB *dbp;
+ VRFY_DBINFO *vdp;
+ db_pgno_t pgno;
+ DBT *key;
+ void *handle;
+ int (*callback) __P((void *, const void *));
+ u_int32_t flags;
+{
+ PAGE *h;
+ int ret, t_ret;
+
+ if (pgno == PGNO_INVALID || !IS_VALID_PGNO(pgno))
+ return (DB_VERIFY_BAD);
+
+ /* We have a plausible page. Try it. */
+ if ((ret = memp_fget(dbp->mpf, &pgno, 0, &h)) != 0)
+ return (ret);
+
+ switch (TYPE(h)) {
+ case P_IBTREE:
+ case P_IRECNO:
+ if ((ret = __db_vrfy_common(dbp, vdp, h, pgno, flags)) != 0)
+ goto err;
+ if ((ret = __bam_vrfy(dbp,
+ vdp, h, pgno, flags | DB_NOORDERCHK)) != 0 ||
+ (ret = __db_salvage_markdone(vdp, pgno)) != 0)
+ goto err;
+ /*
+ * We have a known-healthy internal page. Walk it.
+ */
+ if ((ret = __bam_salvage_walkdupint(dbp, vdp, h, key,
+ handle, callback, flags)) != 0)
+ goto err;
+ break;
+ case P_LRECNO:
+ case P_LDUP:
+ if ((ret = __bam_salvage(dbp,
+ vdp, pgno, TYPE(h), h, handle, callback, key, flags)) != 0)
+ goto err;
+ break;
+ default:
+ ret = DB_VERIFY_BAD;
+ goto err;
+ /* NOTREACHED */
+ }
+
+err: if ((t_ret = memp_fput(dbp->mpf, h, 0)) != 0 && ret == 0)
+ ret = t_ret;
+ return (ret);
+}
+
+/*
+ * __db_salvage_subdbs --
+ * Check and see if this database has subdbs; if so, try to salvage
+ * them independently.
+ */
+static int
+__db_salvage_subdbs(dbp, vdp, handle, callback, flags, hassubsp)
+ DB *dbp;
+ VRFY_DBINFO *vdp;
+ void *handle;
+ int (*callback) __P((void *, const void *));
+ u_int32_t flags;
+ int *hassubsp;
+{
+ BTMETA *btmeta;
+ DB *pgset;
+ DBC *pgsc;
+ PAGE *h;
+ db_pgno_t p, meta_pgno;
+ int ret, err_ret;
+
+ err_ret = 0;
+ pgsc = NULL;
+ pgset = NULL;
+
+ meta_pgno = PGNO_BASE_MD;
+ if ((ret = memp_fget(dbp->mpf, &meta_pgno, 0, &h)) != 0)
+ return (ret);
+
+ if (TYPE(h) == P_BTREEMETA)
+ btmeta = (BTMETA *)h;
+ else {
+ /* Not a btree metadata, ergo no subdbs, so just return. */
+ ret = 0;
+ goto err;
+ }
+
+ /* If it's not a safe page, bail on the attempt. */
+ if ((ret = __db_vrfy_common(dbp, vdp, h, PGNO_BASE_MD, flags)) != 0 ||
+ (ret = __bam_vrfy_meta(dbp, vdp, btmeta, PGNO_BASE_MD, flags)) != 0)
+ goto err;
+
+ if (!F_ISSET(&btmeta->dbmeta, BTM_SUBDB)) {
+ /* No subdbs, just return. */
+ ret = 0;
+ goto err;
+ }
+
+ /* We think we've got subdbs. Mark it so. */
+ *hassubsp = 1;
+
+ if ((ret = memp_fput(dbp->mpf, h, 0)) != 0)
+ return (ret);
+
+ /*
+ * We have subdbs. Try to crack them.
+ *
+ * To do so, get a set of leaf pages in the master
+ * database, and then walk each of the valid ones, salvaging
+ * subdbs as we go. If any prove invalid, just drop them; we'll
+ * pick them up on a later pass.
+ */
+ if ((ret = __db_vrfy_pgset(dbp->dbenv, dbp->pgsize, &pgset)) != 0)
+ return (ret);
+ if ((ret =
+ __db_meta2pgset(dbp, vdp, PGNO_BASE_MD, flags, pgset)) != 0)
+ goto err;
+
+ if ((ret = pgset->cursor(pgset, NULL, &pgsc, 0)) != 0)
+ goto err;
+ while ((ret = __db_vrfy_pgset_next(pgsc, &p)) == 0) {
+ if ((ret = memp_fget(dbp->mpf, &p, 0, &h)) != 0) {
+ err_ret = ret;
+ continue;
+ }
+ if ((ret = __db_vrfy_common(dbp, vdp, h, p, flags)) != 0 ||
+ (ret = __bam_vrfy(dbp,
+ vdp, h, p, flags | DB_NOORDERCHK)) != 0)
+ goto nextpg;
+ if (TYPE(h) != P_LBTREE)
+ goto nextpg;
+ else if ((ret = __db_salvage_subdbpg(
+ dbp, vdp, h, handle, callback, flags)) != 0)
+ err_ret = ret;
+nextpg: if ((ret = memp_fput(dbp->mpf, h, 0)) != 0)
+ err_ret = ret;
+ }
+
+ if (ret != DB_NOTFOUND)
+ goto err;
+ if ((ret = pgsc->c_close(pgsc)) != 0)
+ goto err;
+
+ ret = pgset->close(pgset, 0);
+ return ((ret == 0 && err_ret != 0) ? err_ret : ret);
+
+ /* NOTREACHED */
+
+err: if (pgsc != NULL)
+ (void)pgsc->c_close(pgsc);
+ if (pgset != NULL)
+ (void)pgset->close(pgset, 0);
+ (void)memp_fput(dbp->mpf, h, 0);
+ return (ret);
+}
+
+/*
+ * __db_salvage_subdbpg --
+ * Given a known-good leaf page in the master database, salvage all
+ * leaf pages corresponding to each subdb.
+ *
+ * PUBLIC: int __db_salvage_subdbpg
+ * PUBLIC: __P((DB *, VRFY_DBINFO *, PAGE *, void *,
+ * PUBLIC: int (*)(void *, const void *), u_int32_t));
+ */
+int
+__db_salvage_subdbpg(dbp, vdp, master, handle, callback, flags)
+ DB *dbp;
+ VRFY_DBINFO *vdp;
+ PAGE *master;
+ void *handle;
+ int (*callback) __P((void *, const void *));
+ u_int32_t flags;
+{
+ BKEYDATA *bkkey, *bkdata;
+ BOVERFLOW *bo;
+ DB *pgset;
+ DBC *pgsc;
+ DBT key;
+ PAGE *subpg;
+ db_indx_t i;
+ db_pgno_t meta_pgno, p;
+ int ret, err_ret, t_ret;
+ char *subdbname;
+
+ ret = err_ret = 0;
+ subdbname = NULL;
+
+ if ((ret = __db_vrfy_pgset(dbp->dbenv, dbp->pgsize, &pgset)) != 0)
+ return (ret);
+
+ /*
+ * For each entry, get and salvage the set of pages
+ * corresponding to that entry.
+ */
+ for (i = 0; i < NUM_ENT(master); i += P_INDX) {
+ bkkey = GET_BKEYDATA(master, i);
+ bkdata = GET_BKEYDATA(master, i + O_INDX);
+
+ /* Get the subdatabase name. */
+ if (B_TYPE(bkkey->type) == B_OVERFLOW) {
+ /*
+ * We can, in principle anyway, have a subdb
+ * name so long it overflows. Ick.
+ */
+ bo = (BOVERFLOW *)bkkey;
+ if ((ret = __db_safe_goff(dbp, vdp, bo->pgno, &key,
+ (void **)&subdbname, flags)) != 0) {
+ err_ret = DB_VERIFY_BAD;
+ continue;
+ }
+
+ /* Nul-terminate it. */
+ if ((ret = __os_realloc(dbp->dbenv,
+ key.size + 1, NULL, &subdbname)) != 0)
+ goto err;
+ subdbname[key.size] = '\0';
+ } else if (B_TYPE(bkkey->type == B_KEYDATA)) {
+ if ((ret = __os_realloc(dbp->dbenv,
+ bkkey->len + 1, NULL, &subdbname)) != 0)
+ goto err;
+ memcpy(subdbname, bkkey->data, bkkey->len);
+ subdbname[bkkey->len] = '\0';
+ }
+
+ /* Get the corresponding pgno. */
+ if (bkdata->len != sizeof(db_pgno_t)) {
+ err_ret = DB_VERIFY_BAD;
+ continue;
+ }
+ memcpy(&meta_pgno, bkdata->data, sizeof(db_pgno_t));
+
+ /* If we can't get the subdb meta page, just skip the subdb. */
+ if (!IS_VALID_PGNO(meta_pgno) ||
+ (ret = memp_fget(dbp->mpf, &meta_pgno, 0, &subpg)) != 0) {
+ err_ret = ret;
+ continue;
+ }
+
+ /*
+ * Verify the subdatabase meta page. This has two functions.
+ * First, if it's bad, we have no choice but to skip the subdb
+ * and let the pages just get printed on a later pass. Second,
+ * the access-method-specific meta verification routines record
+ * the various state info (such as the presence of dups)
+ * that we need for __db_prheader().
+ */
+ if ((ret =
+ __db_vrfy_common(dbp, vdp, subpg, meta_pgno, flags)) != 0) {
+ err_ret = ret;
+ (void)memp_fput(dbp->mpf, subpg, 0);
+ continue;
+ }
+ switch (TYPE(subpg)) {
+ case P_BTREEMETA:
+ if ((ret = __bam_vrfy_meta(dbp,
+ vdp, (BTMETA *)subpg, meta_pgno, flags)) != 0) {
+ err_ret = ret;
+ (void)memp_fput(dbp->mpf, subpg, 0);
+ continue;
+ }
+ break;
+ case P_HASHMETA:
+ if ((ret = __ham_vrfy_meta(dbp,
+ vdp, (HMETA *)subpg, meta_pgno, flags)) != 0) {
+ err_ret = ret;
+ (void)memp_fput(dbp->mpf, subpg, 0);
+ continue;
+ }
+ break;
+ default:
+ /* This isn't an appropriate page; skip this subdb. */
+ err_ret = DB_VERIFY_BAD;
+ continue;
+ /* NOTREACHED */
+ }
+
+ if ((ret = memp_fput(dbp->mpf, subpg, 0)) != 0) {
+ err_ret = ret;
+ continue;
+ }
+
+ /* Print a subdatabase header. */
+ if ((ret = __db_prheader(dbp,
+ subdbname, 0, 0, handle, callback, vdp, meta_pgno)) != 0)
+ goto err;
+
+ if ((ret = __db_meta2pgset(dbp, vdp, meta_pgno,
+ flags, pgset)) != 0) {
+ err_ret = ret;
+ continue;
+ }
+
+ if ((ret = pgset->cursor(pgset, NULL, &pgsc, 0)) != 0)
+ goto err;
+ while ((ret = __db_vrfy_pgset_next(pgsc, &p)) == 0) {
+ if ((ret = memp_fget(dbp->mpf, &p, 0, &subpg)) != 0) {
+ err_ret = ret;
+ continue;
+ }
+ if ((ret = __db_salvage(dbp, vdp, p, subpg,
+ handle, callback, flags)) != 0)
+ err_ret = ret;
+ if ((ret = memp_fput(dbp->mpf, subpg, 0)) != 0)
+ err_ret = ret;
+ }
+
+ if (ret != DB_NOTFOUND)
+ goto err;
+
+ if ((ret = pgsc->c_close(pgsc)) != 0)
+ goto err;
+ if ((ret = __db_prfooter(handle, callback)) != 0)
+ goto err;
+ }
+err: if (subdbname)
+ __os_free(subdbname, 0);
+
+ if ((t_ret = pgset->close(pgset, 0)) != 0)
+ ret = t_ret;
+
+ if ((t_ret = __db_salvage_markdone(vdp, PGNO(master))) != 0)
+ return (t_ret);
+
+ return ((err_ret != 0) ? err_ret : ret);
+}
+
+/*
+ * __db_meta2pgset --
+ * Given a known-safe meta page number, return the set of pages
+ * corresponding to the database it represents. Return DB_VERIFY_BAD if
+ * it's not a suitable meta page or is invalid.
+ */
+static int
+__db_meta2pgset(dbp, vdp, pgno, flags, pgset)
+ DB *dbp;
+ VRFY_DBINFO *vdp;
+ db_pgno_t pgno;
+ u_int32_t flags;
+ DB *pgset;
+{
+ PAGE *h;
+ int ret, t_ret;
+
+ if ((ret = memp_fget(dbp->mpf, &pgno, 0, &h)) != 0)
+ return (ret);
+
+ switch (TYPE(h)) {
+ case P_BTREEMETA:
+ ret = __bam_meta2pgset(dbp, vdp, (BTMETA *)h, flags, pgset);
+ break;
+ case P_HASHMETA:
+ ret = __ham_meta2pgset(dbp, vdp, (HMETA *)h, flags, pgset);
+ break;
+ default:
+ ret = DB_VERIFY_BAD;
+ break;
+ }
+
+ if ((t_ret = memp_fput(dbp->mpf, h, 0)) != 0)
+ return (t_ret);
+ return (ret);
+}
+
+/*
+ * __db_guesspgsize --
+ * Try to guess what the pagesize is if the one on the meta page
+ * and the one in the db are invalid.
+ */
+static int
+__db_guesspgsize(dbenv, fhp)
+ DB_ENV *dbenv;
+ DB_FH *fhp;
+{
+ db_pgno_t i;
+ size_t nr;
+ u_int32_t guess;
+ u_int8_t type;
+ int ret;
+
+ for (guess = DB_MAX_PGSIZE; guess >= DB_MIN_PGSIZE; guess >>= 1) {
+ /*
+ * We try to read three pages ahead after the first one
+ * and make sure we have plausible types for all of them.
+ * If the seeks fail, continue with a smaller size;
+ * we're probably just looking past the end of the database.
+ * If they succeed and the types are reasonable, also continue
+ * with a size smaller; we may be looking at pages N,
+ * 2N, and 3N for some N > 1.
+ *
+ * As soon as we hit an invalid type, we stop and return
+ * our previous guess; that last one was probably the page size.
+ */
+ for (i = 1; i <= 3; i++) {
+ if ((ret = __os_seek(dbenv, fhp, guess,
+ i, SSZ(DBMETA, type), 0, DB_OS_SEEK_SET)) != 0)
+ break;
+ if ((ret = __os_read(dbenv,
+ fhp, &type, 1, &nr)) != 0 || nr == 0)
+ break;
+ if (type == P_INVALID || type >= P_PAGETYPE_MAX)
+ return (guess << 1);
+ }
+ }
+
+ /*
+ * If we're just totally confused--the corruption takes up most of the
+ * beginning pages of the database--go with the default size.
+ */
+ return (DB_DEF_IOSIZE);
+}
diff --git a/bdb/db/db_vrfyutil.c b/bdb/db/db_vrfyutil.c
new file mode 100644
index 00000000000..89dccdcc760
--- /dev/null
+++ b/bdb/db/db_vrfyutil.c
@@ -0,0 +1,830 @@
+/*-
+ * See the file LICENSE for redistribution information.
+ *
+ * Copyright (c) 2000
+ * Sleepycat Software. All rights reserved.
+ *
+ * $Id: db_vrfyutil.c,v 11.11 2000/11/28 21:36:04 bostic Exp $
+ */
+
+#include "db_config.h"
+
+#ifndef lint
+static const char revid[] = "$Id: db_vrfyutil.c,v 11.11 2000/11/28 21:36:04 bostic Exp $";
+#endif /* not lint */
+
+#ifndef NO_SYSTEM_INCLUDES
+#include <sys/types.h>
+
+#include <string.h>
+#endif
+
+#include "db_int.h"
+#include "db_page.h"
+#include "db_verify.h"
+#include "db_ext.h"
+
+static int __db_vrfy_pgset_iinc __P((DB *, db_pgno_t, int));
+
+/*
+ * __db_vrfy_dbinfo_create --
+ * Allocate and initialize a VRFY_DBINFO structure.
+ *
+ * PUBLIC: int __db_vrfy_dbinfo_create
+ * PUBLIC: __P((DB_ENV *, u_int32_t, VRFY_DBINFO **));
+ */
+int
+__db_vrfy_dbinfo_create (dbenv, pgsize, vdpp)
+ DB_ENV *dbenv;
+ u_int32_t pgsize;
+ VRFY_DBINFO **vdpp;
+{
+ DB *cdbp, *pgdbp, *pgset;
+ VRFY_DBINFO *vdp;
+ int ret;
+
+ vdp = NULL;
+ cdbp = pgdbp = pgset = NULL;
+
+ if ((ret = __os_calloc(NULL,
+ 1, sizeof(VRFY_DBINFO), (void **)&vdp)) != 0)
+ goto err;
+
+ if ((ret = db_create(&cdbp, dbenv, 0)) != 0)
+ goto err;
+
+ if ((ret = cdbp->set_flags(cdbp, DB_DUP | DB_DUPSORT)) != 0)
+ goto err;
+
+ if ((ret = cdbp->set_pagesize(cdbp, pgsize)) != 0)
+ goto err;
+
+ if ((ret =
+ cdbp->open(cdbp, NULL, NULL, DB_BTREE, DB_CREATE, 0600)) != 0)
+ goto err;
+
+ if ((ret = db_create(&pgdbp, dbenv, 0)) != 0)
+ goto err;
+
+ if ((ret = pgdbp->set_pagesize(pgdbp, pgsize)) != 0)
+ goto err;
+
+ if ((ret =
+ pgdbp->open(pgdbp, NULL, NULL, DB_BTREE, DB_CREATE, 0600)) != 0)
+ goto err;
+
+ if ((ret = __db_vrfy_pgset(dbenv, pgsize, &pgset)) != 0)
+ goto err;
+
+ LIST_INIT(&vdp->subdbs);
+ LIST_INIT(&vdp->activepips);
+
+ vdp->cdbp = cdbp;
+ vdp->pgdbp = pgdbp;
+ vdp->pgset = pgset;
+ *vdpp = vdp;
+ return (0);
+
+err: if (cdbp != NULL)
+ (void)cdbp->close(cdbp, 0);
+ if (pgdbp != NULL)
+ (void)pgdbp->close(pgdbp, 0);
+ if (vdp != NULL)
+ __os_free(vdp, sizeof(VRFY_DBINFO));
+ return (ret);
+}
+
+/*
+ * __db_vrfy_dbinfo_destroy --
+ * Destructor for VRFY_DBINFO. Destroys VRFY_PAGEINFOs and deallocates
+ * structure.
+ *
+ * PUBLIC: int __db_vrfy_dbinfo_destroy __P((VRFY_DBINFO *));
+ */
+int
+__db_vrfy_dbinfo_destroy(vdp)
+ VRFY_DBINFO *vdp;
+{
+ VRFY_CHILDINFO *c, *d;
+ int t_ret, ret;
+
+ ret = 0;
+
+ for (c = LIST_FIRST(&vdp->subdbs); c != NULL; c = d) {
+ d = LIST_NEXT(c, links);
+ __os_free(c, 0);
+ }
+
+ if ((t_ret = vdp->pgdbp->close(vdp->pgdbp, 0)) != 0)
+ ret = t_ret;
+
+ if ((t_ret = vdp->cdbp->close(vdp->cdbp, 0)) != 0 && ret == 0)
+ ret = t_ret;
+
+ if ((t_ret = vdp->pgset->close(vdp->pgset, 0)) != 0 && ret == 0)
+ ret = t_ret;
+
+ DB_ASSERT(LIST_FIRST(&vdp->activepips) == NULL);
+
+ __os_free(vdp, sizeof(VRFY_DBINFO));
+ return (ret);
+}
+
+/*
+ * __db_vrfy_getpageinfo --
+ * Get a PAGEINFO structure for a given page, creating it if necessary.
+ *
+ * PUBLIC: int __db_vrfy_getpageinfo
+ * PUBLIC: __P((VRFY_DBINFO *, db_pgno_t, VRFY_PAGEINFO **));
+ */
+int
+__db_vrfy_getpageinfo(vdp, pgno, pipp)
+ VRFY_DBINFO *vdp;
+ db_pgno_t pgno;
+ VRFY_PAGEINFO **pipp;
+{
+ DBT key, data;
+ DB *pgdbp;
+ VRFY_PAGEINFO *pip;
+ int ret;
+
+ /*
+ * We want a page info struct. There are three places to get it from,
+ * in decreasing order of preference:
+ *
+ * 1. vdp->activepips. If it's already "checked out", we're
+ * already using it, we return the same exact structure with a
+ * bumped refcount. This is necessary because this code is
+ * replacing array accesses, and it's common for f() to make some
+ * changes to a pip, and then call g() and h() which each make
+ * changes to the same pip. vdps are never shared between threads
+ * (they're never returned to the application), so this is safe.
+ * 2. The pgdbp. It's not in memory, but it's in the database, so
+ * get it, give it a refcount of 1, and stick it on activepips.
+ * 3. malloc. It doesn't exist yet; create it, then stick it on
+ * activepips. We'll put it in the database when we putpageinfo
+ * later.
+ */
+
+ /* Case 1. */
+ for (pip = LIST_FIRST(&vdp->activepips); pip != NULL;
+ pip = LIST_NEXT(pip, links))
+ if (pip->pgno == pgno)
+ /* Found it. */
+ goto found;
+
+ /* Case 2. */
+ pgdbp = vdp->pgdbp;
+ memset(&key, 0, sizeof(DBT));
+ memset(&data, 0, sizeof(DBT));
+ F_SET(&data, DB_DBT_MALLOC);
+ key.data = &pgno;
+ key.size = sizeof(db_pgno_t);
+
+ if ((ret = pgdbp->get(pgdbp, NULL, &key, &data, 0)) == 0) {
+ /* Found it. */
+ DB_ASSERT(data.size = sizeof(VRFY_PAGEINFO));
+ pip = data.data;
+ DB_ASSERT(pip->pi_refcount == 0);
+ LIST_INSERT_HEAD(&vdp->activepips, pip, links);
+ goto found;
+ } else if (ret != DB_NOTFOUND) /* Something nasty happened. */
+ return (ret);
+
+ /* Case 3 */
+ if ((ret = __db_vrfy_pageinfo_create(&pip)) != 0)
+ return (ret);
+
+ LIST_INSERT_HEAD(&vdp->activepips, pip, links);
+found: pip->pi_refcount++;
+
+ *pipp = pip;
+
+ DB_ASSERT(pip->pi_refcount > 0);
+ return (0);
+}
+
+/*
+ * __db_vrfy_putpageinfo --
+ * Put back a VRFY_PAGEINFO that we're done with.
+ *
+ * PUBLIC: int __db_vrfy_putpageinfo __P((VRFY_DBINFO *, VRFY_PAGEINFO *));
+ */
+int
+__db_vrfy_putpageinfo(vdp, pip)
+ VRFY_DBINFO *vdp;
+ VRFY_PAGEINFO *pip;
+{
+ DBT key, data;
+ DB *pgdbp;
+ VRFY_PAGEINFO *p;
+ int ret;
+#ifdef DIAGNOSTIC
+ int found;
+
+ found = 0;
+#endif
+
+ if (--pip->pi_refcount > 0)
+ return (0);
+
+ pgdbp = vdp->pgdbp;
+ memset(&key, 0, sizeof(DBT));
+ memset(&data, 0, sizeof(DBT));
+
+ key.data = &pip->pgno;
+ key.size = sizeof(db_pgno_t);
+ data.data = pip;
+ data.size = sizeof(VRFY_PAGEINFO);
+
+ if ((ret = pgdbp->put(pgdbp, NULL, &key, &data, 0)) != 0)
+ return (ret);
+
+ for (p = LIST_FIRST(&vdp->activepips); p != NULL;
+ p = LIST_NEXT(p, links))
+ if (p == pip) {
+#ifdef DIAGNOSTIC
+ found++;
+#endif
+ DB_ASSERT(p->pi_refcount == 0);
+ LIST_REMOVE(p, links);
+ break;
+ }
+#ifdef DIAGNOSTIC
+ DB_ASSERT(found == 1);
+#endif
+
+ DB_ASSERT(pip->pi_refcount == 0);
+ __os_free(pip, 0);
+ return (0);
+}
+
+/*
+ * __db_vrfy_pgset --
+ * Create a temporary database for the storing of sets of page numbers.
+ * (A mapping from page number to int, used by the *_meta2pgset functions,
+ * as well as for keeping track of which pages the verifier has seen.)
+ *
+ * PUBLIC: int __db_vrfy_pgset __P((DB_ENV *, u_int32_t, DB **));
+ */
+int
+__db_vrfy_pgset(dbenv, pgsize, dbpp)
+ DB_ENV *dbenv;
+ u_int32_t pgsize;
+ DB **dbpp;
+{
+ DB *dbp;
+ int ret;
+
+ if ((ret = db_create(&dbp, dbenv, 0)) != 0)
+ return (ret);
+ if ((ret = dbp->set_pagesize(dbp, pgsize)) != 0)
+ goto err;
+ if ((ret = dbp->open(dbp, NULL, NULL, DB_BTREE, DB_CREATE, 0600)) == 0)
+ *dbpp = dbp;
+ else
+err: (void)dbp->close(dbp, 0);
+
+ return (ret);
+}
+
+/*
+ * __db_vrfy_pgset_get --
+ * Get the value associated in a page set with a given pgno. Return
+ * a 0 value (and succeed) if we've never heard of this page.
+ *
+ * PUBLIC: int __db_vrfy_pgset_get __P((DB *, db_pgno_t, int *));
+ */
+int
+__db_vrfy_pgset_get(dbp, pgno, valp)
+ DB *dbp;
+ db_pgno_t pgno;
+ int *valp;
+{
+ DBT key, data;
+ int ret, val;
+
+ memset(&key, 0, sizeof(DBT));
+ memset(&data, 0, sizeof(DBT));
+
+ key.data = &pgno;
+ key.size = sizeof(db_pgno_t);
+ data.data = &val;
+ data.ulen = sizeof(int);
+ F_SET(&data, DB_DBT_USERMEM);
+
+ if ((ret = dbp->get(dbp, NULL, &key, &data, 0)) == 0) {
+ DB_ASSERT(data.size = sizeof(int));
+ memcpy(&val, data.data, sizeof(int));
+ } else if (ret == DB_NOTFOUND)
+ val = 0;
+ else
+ return (ret);
+
+ *valp = val;
+ return (0);
+}
+
+/*
+ * __db_vrfy_pgset_inc --
+ * Increment the value associated with a pgno by 1.
+ *
+ * PUBLIC: int __db_vrfy_pgset_inc __P((DB *, db_pgno_t));
+ */
+int
+__db_vrfy_pgset_inc(dbp, pgno)
+ DB *dbp;
+ db_pgno_t pgno;
+{
+
+ return (__db_vrfy_pgset_iinc(dbp, pgno, 1));
+}
+
+/*
+ * __db_vrfy_pgset_dec --
+ * Increment the value associated with a pgno by 1.
+ *
+ * PUBLIC: int __db_vrfy_pgset_dec __P((DB *, db_pgno_t));
+ */
+int
+__db_vrfy_pgset_dec(dbp, pgno)
+ DB *dbp;
+ db_pgno_t pgno;
+{
+
+ return (__db_vrfy_pgset_iinc(dbp, pgno, -1));
+}
+
+/*
+ * __db_vrfy_pgset_iinc --
+ * Increment the value associated with a pgno by i.
+ *
+ */
+static int
+__db_vrfy_pgset_iinc(dbp, pgno, i)
+ DB *dbp;
+ db_pgno_t pgno;
+ int i;
+{
+ DBT key, data;
+ int ret;
+ int val;
+
+ memset(&key, 0, sizeof(DBT));
+ memset(&data, 0, sizeof(DBT));
+
+ val = 0;
+
+ key.data = &pgno;
+ key.size = sizeof(db_pgno_t);
+ data.data = &val;
+ data.ulen = sizeof(int);
+ F_SET(&data, DB_DBT_USERMEM);
+
+ if ((ret = dbp->get(dbp, NULL, &key, &data, 0)) == 0) {
+ DB_ASSERT(data.size = sizeof(int));
+ memcpy(&val, data.data, sizeof(int));
+ } else if (ret != DB_NOTFOUND)
+ return (ret);
+
+ data.size = sizeof(int);
+ val += i;
+
+ return (dbp->put(dbp, NULL, &key, &data, 0));
+}
+
+/*
+ * __db_vrfy_pgset_next --
+ * Given a cursor open in a pgset database, get the next page in the
+ * set.
+ *
+ * PUBLIC: int __db_vrfy_pgset_next __P((DBC *, db_pgno_t *));
+ */
+int
+__db_vrfy_pgset_next(dbc, pgnop)
+ DBC *dbc;
+ db_pgno_t *pgnop;
+{
+ DBT key, data;
+ db_pgno_t pgno;
+ int ret;
+
+ memset(&key, 0, sizeof(DBT));
+ memset(&data, 0, sizeof(DBT));
+ /* We don't care about the data, just the keys. */
+ F_SET(&data, DB_DBT_USERMEM | DB_DBT_PARTIAL);
+ F_SET(&key, DB_DBT_USERMEM);
+ key.data = &pgno;
+ key.ulen = sizeof(db_pgno_t);
+
+ if ((ret = dbc->c_get(dbc, &key, &data, DB_NEXT)) != 0)
+ return (ret);
+
+ DB_ASSERT(key.size == sizeof(db_pgno_t));
+ *pgnop = pgno;
+
+ return (0);
+}
+
+/*
+ * __db_vrfy_childcursor --
+ * Create a cursor to walk the child list with. Returns with a nonzero
+ * final argument if the specified page has no children.
+ *
+ * PUBLIC: int __db_vrfy_childcursor __P((VRFY_DBINFO *, DBC **));
+ */
+int
+__db_vrfy_childcursor(vdp, dbcp)
+ VRFY_DBINFO *vdp;
+ DBC **dbcp;
+{
+ DB *cdbp;
+ DBC *dbc;
+ int ret;
+
+ cdbp = vdp->cdbp;
+
+ if ((ret = cdbp->cursor(cdbp, NULL, &dbc, 0)) == 0)
+ *dbcp = dbc;
+
+ return (ret);
+}
+
+/*
+ * __db_vrfy_childput --
+ * Add a child structure to the set for a given page.
+ *
+ * PUBLIC: int __db_vrfy_childput
+ * PUBLIC: __P((VRFY_DBINFO *, db_pgno_t, VRFY_CHILDINFO *));
+ */
+int
+__db_vrfy_childput(vdp, pgno, cip)
+ VRFY_DBINFO *vdp;
+ db_pgno_t pgno;
+ VRFY_CHILDINFO *cip;
+{
+ DBT key, data;
+ DB *cdbp;
+ int ret;
+
+ cdbp = vdp->cdbp;
+ memset(&key, 0, sizeof(DBT));
+ memset(&data, 0, sizeof(DBT));
+
+ key.data = &pgno;
+ key.size = sizeof(db_pgno_t);
+
+ data.data = cip;
+ data.size = sizeof(VRFY_CHILDINFO);
+
+ /*
+ * Don't add duplicate (data) entries for a given child, and accept
+ * DB_KEYEXIST as a successful return; we only need to verify
+ * each child once, even if a child (such as an overflow key) is
+ * multiply referenced.
+ */
+ ret = cdbp->put(cdbp, NULL, &key, &data, DB_NODUPDATA);
+ return (ret == DB_KEYEXIST ? 0 : ret);
+}
+
+/*
+ * __db_vrfy_ccset --
+ * Sets a cursor created with __db_vrfy_childcursor to the first
+ * child of the given pgno, and returns it in the third arg.
+ *
+ * PUBLIC: int __db_vrfy_ccset __P((DBC *, db_pgno_t, VRFY_CHILDINFO **));
+ */
+int
+__db_vrfy_ccset(dbc, pgno, cipp)
+ DBC *dbc;
+ db_pgno_t pgno;
+ VRFY_CHILDINFO **cipp;
+{
+ DBT key, data;
+ int ret;
+
+ memset(&key, 0, sizeof(DBT));
+ memset(&data, 0, sizeof(DBT));
+
+ key.data = &pgno;
+ key.size = sizeof(db_pgno_t);
+
+ if ((ret = dbc->c_get(dbc, &key, &data, DB_SET)) != 0)
+ return (ret);
+
+ DB_ASSERT(data.size == sizeof(VRFY_CHILDINFO));
+ *cipp = (VRFY_CHILDINFO *)data.data;
+
+ return (0);
+}
+
+/*
+ * __db_vrfy_ccnext --
+ * Gets the next child of the given cursor created with
+ * __db_vrfy_childcursor, and returns it in the memory provided in the
+ * second arg.
+ *
+ * PUBLIC: int __db_vrfy_ccnext __P((DBC *, VRFY_CHILDINFO **));
+ */
+int
+__db_vrfy_ccnext(dbc, cipp)
+ DBC *dbc;
+ VRFY_CHILDINFO **cipp;
+{
+ DBT key, data;
+ int ret;
+
+ memset(&key, 0, sizeof(DBT));
+ memset(&data, 0, sizeof(DBT));
+
+ if ((ret = dbc->c_get(dbc, &key, &data, DB_NEXT_DUP)) != 0)
+ return (ret);
+
+ DB_ASSERT(data.size == sizeof(VRFY_CHILDINFO));
+ *cipp = (VRFY_CHILDINFO *)data.data;
+
+ return (0);
+}
+
+/*
+ * __db_vrfy_ccclose --
+ * Closes the cursor created with __db_vrfy_childcursor.
+ *
+ * This doesn't actually do anything interesting now, but it's
+ * not inconceivable that we might change the internal database usage
+ * and keep the interfaces the same, and a function call here or there
+ * seldom hurts anyone.
+ *
+ * PUBLIC: int __db_vrfy_ccclose __P((DBC *));
+ */
+int
+__db_vrfy_ccclose(dbc)
+ DBC *dbc;
+{
+
+ return (dbc->c_close(dbc));
+}
+
+/*
+ * __db_vrfy_pageinfo_create --
+ * Constructor for VRFY_PAGEINFO; allocates and initializes.
+ *
+ * PUBLIC: int __db_vrfy_pageinfo_create __P((VRFY_PAGEINFO **));
+ */
+int
+__db_vrfy_pageinfo_create(pgipp)
+ VRFY_PAGEINFO **pgipp;
+{
+ VRFY_PAGEINFO *pgip;
+ int ret;
+
+ if ((ret = __os_calloc(NULL,
+ 1, sizeof(VRFY_PAGEINFO), (void **)&pgip)) != 0)
+ return (ret);
+
+ DB_ASSERT(pgip->pi_refcount == 0);
+
+ *pgipp = pgip;
+ return (0);
+}
+
+/*
+ * __db_salvage_init --
+ * Set up salvager database.
+ *
+ * PUBLIC: int __db_salvage_init __P((VRFY_DBINFO *));
+ */
+int
+__db_salvage_init(vdp)
+ VRFY_DBINFO *vdp;
+{
+ DB *dbp;
+ int ret;
+
+ if ((ret = db_create(&dbp, NULL, 0)) != 0)
+ return (ret);
+
+ if ((ret = dbp->set_pagesize(dbp, 1024)) != 0)
+ goto err;
+
+ if ((ret = dbp->open(dbp, NULL, NULL, DB_BTREE, DB_CREATE, 0)) != 0)
+ goto err;
+
+ vdp->salvage_pages = dbp;
+ return (0);
+
+err: (void)dbp->close(dbp, 0);
+ return (ret);
+}
+
+/*
+ * __db_salvage_destroy --
+ * Close salvager database.
+ * PUBLIC: void __db_salvage_destroy __P((VRFY_DBINFO *));
+ */
+void
+__db_salvage_destroy(vdp)
+ VRFY_DBINFO *vdp;
+{
+ (void)vdp->salvage_pages->close(vdp->salvage_pages, 0);
+}
+
+/*
+ * __db_salvage_getnext --
+ * Get the next (first) unprinted page in the database of pages we need to
+ * print still. Delete entries for any already-printed pages we encounter
+ * in this search, as well as the page we're returning.
+ *
+ * PUBLIC: int __db_salvage_getnext
+ * PUBLIC: __P((VRFY_DBINFO *, db_pgno_t *, u_int32_t *));
+ */
+int
+__db_salvage_getnext(vdp, pgnop, pgtypep)
+ VRFY_DBINFO *vdp;
+ db_pgno_t *pgnop;
+ u_int32_t *pgtypep;
+{
+ DB *dbp;
+ DBC *dbc;
+ DBT key, data;
+ int ret;
+ u_int32_t pgtype;
+
+ dbp = vdp->salvage_pages;
+
+ memset(&key, 0, sizeof(DBT));
+ memset(&data, 0, sizeof(DBT));
+
+ if ((ret = dbp->cursor(dbp, NULL, &dbc, 0)) != 0)
+ return (ret);
+
+ while ((ret = dbc->c_get(dbc, &key, &data, DB_NEXT)) == 0) {
+ DB_ASSERT(data.size == sizeof(u_int32_t));
+ memcpy(&pgtype, data.data, sizeof(pgtype));
+
+ if ((ret = dbc->c_del(dbc, 0)) != 0)
+ goto err;
+ if (pgtype != SALVAGE_IGNORE)
+ goto found;
+ }
+
+ /* No more entries--ret probably equals DB_NOTFOUND. */
+
+ if (0) {
+found: DB_ASSERT(key.size == sizeof(db_pgno_t));
+ DB_ASSERT(data.size == sizeof(u_int32_t));
+
+ *pgnop = *(db_pgno_t *)key.data;
+ *pgtypep = *(u_int32_t *)data.data;
+ }
+
+err: (void)dbc->c_close(dbc);
+ return (ret);
+}
+
+/*
+ * __db_salvage_isdone --
+ * Return whether or not the given pgno is already marked
+ * SALVAGE_IGNORE (meaning that we don't need to print it again).
+ *
+ * Returns DB_KEYEXIST if it is marked, 0 if not, or another error on
+ * error.
+ *
+ * PUBLIC: int __db_salvage_isdone __P((VRFY_DBINFO *, db_pgno_t));
+ */
+int
+__db_salvage_isdone(vdp, pgno)
+ VRFY_DBINFO *vdp;
+ db_pgno_t pgno;
+{
+ DBT key, data;
+ DB *dbp;
+ int ret;
+ u_int32_t currtype;
+
+ dbp = vdp->salvage_pages;
+
+ memset(&key, 0, sizeof(DBT));
+ memset(&data, 0, sizeof(DBT));
+
+ currtype = SALVAGE_INVALID;
+ data.data = &currtype;
+ data.ulen = sizeof(u_int32_t);
+ data.flags = DB_DBT_USERMEM;
+
+ key.data = &pgno;
+ key.size = sizeof(db_pgno_t);
+
+ /*
+ * Put an entry for this page, with pgno as key and type as data,
+ * unless it's already there and is marked done.
+ * If it's there and is marked anything else, that's fine--we
+ * want to mark it done.
+ */
+ ret = dbp->get(dbp, NULL, &key, &data, 0);
+ if (ret == 0) {
+ /*
+ * The key's already here. Check and see if it's already
+ * marked done. If it is, return DB_KEYEXIST. If it's not,
+ * return 0.
+ */
+ if (currtype == SALVAGE_IGNORE)
+ return (DB_KEYEXIST);
+ else
+ return (0);
+ } else if (ret != DB_NOTFOUND)
+ return (ret);
+
+ /* The pgno is not yet marked anything; return 0. */
+ return (0);
+}
+
+/*
+ * __db_salvage_markdone --
+ * Mark as done a given page.
+ *
+ * PUBLIC: int __db_salvage_markdone __P((VRFY_DBINFO *, db_pgno_t));
+ */
+int
+__db_salvage_markdone(vdp, pgno)
+ VRFY_DBINFO *vdp;
+ db_pgno_t pgno;
+{
+ DBT key, data;
+ DB *dbp;
+ int pgtype, ret;
+ u_int32_t currtype;
+
+ pgtype = SALVAGE_IGNORE;
+ dbp = vdp->salvage_pages;
+
+ memset(&key, 0, sizeof(DBT));
+ memset(&data, 0, sizeof(DBT));
+
+ currtype = SALVAGE_INVALID;
+ data.data = &currtype;
+ data.ulen = sizeof(u_int32_t);
+ data.flags = DB_DBT_USERMEM;
+
+ key.data = &pgno;
+ key.size = sizeof(db_pgno_t);
+
+ /*
+ * Put an entry for this page, with pgno as key and type as data,
+ * unless it's already there and is marked done.
+ * If it's there and is marked anything else, that's fine--we
+ * want to mark it done, but db_salvage_isdone only lets
+ * us know if it's marked IGNORE.
+ *
+ * We don't want to return DB_KEYEXIST, though; this will
+ * likely get passed up all the way and make no sense to the
+ * application. Instead, use DB_VERIFY_BAD to indicate that
+ * we've seen this page already--it probably indicates a
+ * multiply-linked page.
+ */
+ if ((ret = __db_salvage_isdone(vdp, pgno)) != 0)
+ return (ret == DB_KEYEXIST ? DB_VERIFY_BAD : ret);
+
+ data.size = sizeof(u_int32_t);
+ data.data = &pgtype;
+
+ return (dbp->put(dbp, NULL, &key, &data, 0));
+}
+
+/*
+ * __db_salvage_markneeded --
+ * If it has not yet been printed, make note of the fact that a page
+ * must be dealt with later.
+ *
+ * PUBLIC: int __db_salvage_markneeded
+ * PUBLIC: __P((VRFY_DBINFO *, db_pgno_t, u_int32_t));
+ */
+int
+__db_salvage_markneeded(vdp, pgno, pgtype)
+ VRFY_DBINFO *vdp;
+ db_pgno_t pgno;
+ u_int32_t pgtype;
+{
+ DB *dbp;
+ DBT key, data;
+ int ret;
+
+ dbp = vdp->salvage_pages;
+
+ memset(&key, 0, sizeof(DBT));
+ memset(&data, 0, sizeof(DBT));
+
+ key.data = &pgno;
+ key.size = sizeof(db_pgno_t);
+
+ data.data = &pgtype;
+ data.size = sizeof(u_int32_t);
+
+ /*
+ * Put an entry for this page, with pgno as key and type as data,
+ * unless it's already there, in which case it's presumably
+ * already been marked done.
+ */
+ ret = dbp->put(dbp, NULL, &key, &data, DB_NOOVERWRITE);
+ return (ret == DB_KEYEXIST ? 0 : ret);
+}