diff options
author | unknown <tim@threads.polyesthetic.msg> | 2001-03-04 19:42:05 -0500 |
---|---|---|
committer | unknown <tim@threads.polyesthetic.msg> | 2001-03-04 19:42:05 -0500 |
commit | ec6ae091617bdfdca9e65e8d3e65b950d234f676 (patch) | |
tree | 9dd732e08dba156ee3d7635caedc0dc3107ecac6 /bdb/db | |
parent | 87d70fb598105b64b538ff6b81eef9da626255b1 (diff) | |
download | mariadb-git-ec6ae091617bdfdca9e65e8d3e65b950d234f676.tar.gz |
Import changeset
Diffstat (limited to 'bdb/db')
-rw-r--r-- | bdb/db/Design.fileop | 452 | ||||
-rw-r--r-- | bdb/db/crdel.src | 103 | ||||
-rw-r--r-- | bdb/db/crdel_auto.c | 900 | ||||
-rw-r--r-- | bdb/db/crdel_rec.c | 646 | ||||
-rw-r--r-- | bdb/db/db.c | 2325 | ||||
-rw-r--r-- | bdb/db/db.src | 178 | ||||
-rw-r--r-- | bdb/db/db_am.c | 511 | ||||
-rw-r--r-- | bdb/db/db_auto.c | 1270 | ||||
-rw-r--r-- | bdb/db/db_cam.c | 974 | ||||
-rw-r--r-- | bdb/db/db_conv.c | 348 | ||||
-rw-r--r-- | bdb/db/db_dispatch.c | 983 | ||||
-rw-r--r-- | bdb/db/db_dup.c | 275 | ||||
-rw-r--r-- | bdb/db/db_iface.c | 687 | ||||
-rw-r--r-- | bdb/db/db_join.c | 730 | ||||
-rw-r--r-- | bdb/db/db_meta.c | 309 | ||||
-rw-r--r-- | bdb/db/db_method.c | 629 | ||||
-rw-r--r-- | bdb/db/db_overflow.c | 681 | ||||
-rw-r--r-- | bdb/db/db_pr.c | 1284 | ||||
-rw-r--r-- | bdb/db/db_rec.c | 529 | ||||
-rw-r--r-- | bdb/db/db_reclaim.c | 134 | ||||
-rw-r--r-- | bdb/db/db_ret.c | 160 | ||||
-rw-r--r-- | bdb/db/db_upg.c | 338 | ||||
-rw-r--r-- | bdb/db/db_upg_opd.c | 353 | ||||
-rw-r--r-- | bdb/db/db_vrfy.c | 2340 | ||||
-rw-r--r-- | bdb/db/db_vrfyutil.c | 830 |
25 files changed, 17969 insertions, 0 deletions
diff --git a/bdb/db/Design.fileop b/bdb/db/Design.fileop new file mode 100644 index 00000000000..187f1ffaf22 --- /dev/null +++ b/bdb/db/Design.fileop @@ -0,0 +1,452 @@ +# $Id: Design.fileop,v 11.4 2000/02/19 20:57:54 bostic Exp $ + +The design of file operation recovery. + +Keith has asked me to write up notes on our current status of database +create and delete and recovery, why it's so hard, and how we've violated +all the cornerstone assumptions on which our recovery framework is based. + +I am including two documents at the end of this one. The first is the +initial design of the recoverability of file create and delete (there is +no talk of subdatabases there, because we didn't think we'd have to do +anything special there). I will annotate this document on where things +changed. + +The second is the design of recd007 which is supposed to test our ability +to recover these operations regardless of where one crashes. This test +is fundamentally different from our other recovery tests in the following +manner. Normally, the application controls transaction boundaries. +Therefore, we can perform an operation and then decide whether to commit +or abort it. In the normal recovery tests, we force the database into +each of the four possible states from a recovery perspective: + + database is pre-op, undo (do nothing) + database is pre-op, redo + database is post-op, undo + database is post-op, redo (do nothing) + +By copying databases at various points and initiating txn_commit and abort +appropriately, we can make all these things happen. Notice that the one +case we don't handle is where page A is in one state (e.g., pre-op) and +page B is in another state (e.g., post-op). I will argue that these don't +matter because each page is recovered independently. If anyone can poke +holes in this, I'm interested. + +The problem with create/delete recovery testing is that the transaction +is begun and ended all inside the library. Therefore, there is never any +point (outside the library) where we can copy files and or initiate +abort/commit. In order to still put the recovery code through its paces, +Sue designed an infrastructure that lets you tell the library where to +make copies of things and where to suddenly inject errors so that the +transaction gets aborted. This level of detail allows us to push the +create/delete recovery code through just about every recovery path +possible (although I'm sure Mike will tell me I'm wrong when he starts to +run code coverage tools). + +OK, so that's all preamble and a brief discussion of the documents I'm +enclosing. + +Why was this so hard and painful and why is the code so Q@#$!% complicated? +The following is a discussion/explanation, but to the best of my knowledge, +the structure we have in place now works. The key question we need to be +asking is, "Does this need to have to be so complex or should we redesign +portions to simplify it?" At this point, there is no obvious way to simplify +it in my book, but I may be having difficulty seeing this because my mind is +too polluted at this point. + +Our overall strategy for recovery is that we do write-ahead logging, +that is we log an operation and make sure it is on disk before any +data corresponding to the data that log record describes is on disk. +Typically we use log sequence numbers (LSNs) to mark the data so that +during recovery, we can look at the data and determine if it is in a +state before a particular log record or after a particular log record. + +In the good old days, opens were not transaction protected, so we could +do regular old opens during recovery and if the file existed, we opened +it and if it didn't (or appeared corrupt), we didn't and treated it like +a missing file. As will be discussed below in detail, our states are much +more complicated and recovery can't make such simplistic assumptions. + +Also, since we are now dealing with file system operations, we have less +control about when they actually happen and what the state of the system +can be. That is, we have to write create log records synchronously, because +the create/open system call may force a newly created (0-length) file to +disk. This file has to now be identified as being in the "being-created" +state. + +A. We used to make a number of assumptions during recovery: + +1. We could call db_open at any time and one of three things would happen: + a) the file would be opened cleanly + b) the file would not exist + c) we would encounter an error while opening the file + +Case a posed no difficulty. +In Case b, we simply spit out a warning that a file was missing and then + ignored all subsequent operations to that file. +In Case c, we reported a fatal error. + +2. We can always generate a warning if a file is missing. + +3. We never encounter NULL file names in the log. + +B. We also made some assumptions in the main-line library: + +1. If you try to open a file and it exists but is 0-length, then +someone else is trying to open it. + +2. You can write pages anywhere in a file and any non-existent pages +are 0-filled. [This breaks on Windows.] + +3. If you have proper permissions then you can always evict pages from +the buffer pool. + +4. During open, we can close the master database handle as soon as +we're done with it since all the rest of the activity will take place +on the subdatabase handle. + +In our brave new world, most of these assumptions are no longer valid. +Let's address them one at a time. + +A.1 We could call db_open at any time and one of three things would happen: + a) the file would be opened cleanly + b) the file would not exist + c) we would encounter an error while opening the file +There are now additional states. Since we are trying to make file +operations recoverable, you can now die in the middle of such an +operation and we have to be able to pick up the pieces. What this +now means is that: + + * a 0-length file can be an indication of a create in-progress + * you can have a meta-data page but no root page (of a btree) + * if a file doesn't exist, it could mean that it was just about + to be created and needs to be rolled forward. + * if you encounter an error in a file (e.g., the meta-data page + is all 0's) you could still be in mid-open. + +I have now made this all work, but it required significant changes to the +db_open code and error handling and this is the sort of change that makes +everyone nervous. + +A.2. We can always generate a warning if a file is missing. + +Now that we have a delete file method in the API, we need to make sure +that we do not generate warning messages for files that don't exist if +we see that they were explicitly deleted. + +This means that we need to save state during recovery, determine which +files were missing and were not being recreated and were not deleted and +only complain about those. + +A.3. We never encounter NULL file names in the log. + +Now that we allow tranaction protection on memory-resident files, we write +log messages for files with NULL file names. This means that our assumption +of always being able to call "db_open" on any log_register OPEN message found +in the log is no longer valid. + +B.1. If you try to open a file and it exists but is 0-length, then +someone else is trying to open it. + +As discussed for A.1, this is no longer true. It may be instead that you +are in the process of recovering a create. + +B.2. You can write pages anywhere in a file and any non-existent pages +are 0-filled. + +It turns out that this is not true on Windows. This means that places +we do group allocation (hash) must explicitly allocate each page, because +we can't count on recognizing the uninitialized pages later. + +B.3. If you have proper permissions then you can always evict pages from +the buffer pool. + +In the brave new world though, files can be deleted and they may +have pages in the mpool. If you later try to evict these, you +discover that the file doesn't exist. We'd get here when we had +to dirty pages during a remove operation. + +B.4. You can close files any time you want. + +However, if the file takes part in the open/remove transaction, +then we had better not close it until after the transaction +commits/aborts, because we need to be able to get our hands on the +dbp and the open happened in a different transaction. + +=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- +Design for recovering file create and delete in the presence of subdatabases. + +Assumptions: + Remove the O_TRUNCATE flag. + Single-thread all open/create/delete operations. + (Well, almost all; we'll optimize opens without DB_CREATE set.) + The reasoning for this is that with two simultaneous + open/creaters, during recovery, we cannot identify which + transaction successfully created files and therefore cannot + recovery correctly. + File system creates/deletes are synchronous + Once the file is open, subdatabase creates look like regular + get/put operations and a metadata page creation. + +There are 4 cases to deal with: + 1. Open/create file + 2. Open/create subdatabase + 3. Delete + 4. Recovery records + + __db_fileopen_recover + __db_metapage_recover + __db_delete_recover + existing c_put and c_get routines for subdatabase creation + + Note that the open/create of the file and the open/create of the + subdatabase need to be in the same transaction. + +1. Open/create (full file and subdb version) + +If create + LOCK_FILEOP + txn_begin + log create message (open message below) + do file system open/create + if we did not create + abort transaction (before going to open_only) + if (!subdb) + set dbp->open_txn = NULL + else + txn_begin a new transaction for the subdb open + + construct meta-data page + log meta-data page (see metapage) + write the meta-data page + * It may be the case that btrees need to log both meta-data pages + and root pages. If that is the case, I believe that we can use + this same record and recovery routines for both + + txn_commit + UNLOCK_FILEOP + +2. Delete + LOCK_FILEOP + txn_begin + log delete message (delete message below) + mv file __db.file.lsn + txn_commit + unlink __db.file.lsn + UNLOCK_FILEOP + +3. Recovery Routines + +__db_fileopen_recover + if (argp->name.size == 0 + done; + + if (redo) /* Commit */ + __os_open(argp->name, DB_OSO_CREATE, argp->mode, &fh) + __os_closehandle(fh) + if (undo) /* Abort */ + if (argp->name exists) + unlink(argp->name); + +__db_metapage_recover + if (redo) + __os_open(argp->name, 0, 0, &fh) + __os_lseek(meta data page) + __os_write(meta data page) + __os_closehandle(fh); + if (undo) + done = 0; + if (argp->name exists) + if (length of argp->name != 0) + __os_open(argp->name, 0, 0, &fh) + __os_lseek(meta data page) + __os_read(meta data page) + if (read succeeds && page lsn != current_lsn) + done = 1 + __os_closehandle(fh); + if (!done) + unlink(argp->name) + +__db_delete_recover + if (redo) + Check if the backup file still exists and if so, delete it. + + if (undo) + if (__db_appname(__db.file.lsn exists)) + mv __db_appname(__db.file.lsn) __db_appname(file) + +__db_metasub_recover + /* This is like a normal recovery routine */ + Get the metadata page + if (cmp_n && redo) + copy the log page onto the page + update the lsn + make sure page gets put dirty + else if (cmp_p && undo) + update the lsn to the lsn in the log record + make sure page gets put dirty + + if the page was modified, put it back dirty + +In db.src + +# name: filename (before call to __db_appname) +# mode: file system mode +BEGIN open +DBT name DBT s +ARG mode u_int32_t o +END + +# opcode: indicate if it is a create/delete and if it is a subdatabase +# pgsize: page size on which we're going to write the meta-data page +# pgno: page number on which to write this meta-data page +# page: the actual meta-data page +# lsn: LSN of the meta-data page -- 0 for new databases, may be non-0 +# for subdatabases. + +BEGIN metapage +ARG opcode u_int32_t x +DBT name DBT s +ARG pgno db_pgno_t d +DBT page DBT s +POINTER lsn DB_LSN * lu +END + +# We do not need a subdatabase name here because removing a subdatabase +# name is simply a regular bt_delete operation from the master database. +# It will get logged normally. +# name: filename +BEGIN delete +DBT name DBT s +END + +# We also need to reclaim pages, but we can use the existing +# bt_pg_alloc routines. + +=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- +Testing recoverability of create/delete. + +These tests are unlike other tests in that they are going to +require hooks in the library. The reason is that the create +and delete calls are internally wrapped in a transaction, so +that if the call returns, the transaction has already either +commited or aborted. Using only that interface limits what +kind of testing we can do. To match our other recovery testing +efforts, we need to add hooks to trigger aborts at particular +times in the create/delete path. + +The general recovery testing strategy is that we wish to +execute every path through every recovery routine. That +means that we try to: + catch each operation in its pre-operation state + call the recovery function with redo + call the recovery function with undo + catch each operation in its post-operation state + call the recovery function with redo + call the recovery function with undo + +In addition, there are a few critical points in the create and +delete path that we want to make sure we capture. + +1. Test Structure + +The test structure should be similar to the existing recovery +tests. We will want to have a structure in place where we +can execute different commands: + create a file/database + create a file that will contain subdatabases. + create a subdatabase + remove a subdatabase (that contains valid data) + remove a subdatabase (that does not contain any data) + remove a file that used to contain subdatabases + remove a file that contains a database + +The tricky part is capturing the state of the world at the +various points in the create/delete process. + +The critical points in the create process are: + + 1. After we've logged the create, but before we've done anything. + in db/db.c + after the open_retry + after the __crdel_fileopen_log call (and before we've + called __os_open). + + 2. Immediately after the __os_open + + 3. Immediately after each __db_log_page call + in bt_open.c + log meta-data page + log root page + in hash.c + log meta-data page + + 4. With respect to the log records above, shortly after each + log write is an memp_fput. We need to do a sync after + each memp_fput and trigger a point after that sync. + +The critical points in the remove process are: + + 1. Right after the crdel_delete_log in db/db.c + + 2. Right after the __os_rename call (below the crdel_delete_log) + + 3. After the __db_remove_callback call. + +I believe that there are the places where we'll need some sort of hook. + +2. Adding hooks to the library. + +The hooks need two components. One component is to capture the state of +the database at the hook point and the other is to trigger a txn_abort at +the hook point. The second part is fairly trivial. + +The first part requires more thought. Let me explain what we do in a +"normal" recovery test. In a normal recovery test, we save an intial +copy of the database (this copy is called init). Then we execute one +or more operations. Then, right before the commit/abort, we sync the +file, and save another copy (the afterop copy). Finally, we call txn_commit +or txn_abort, sync the file again, and save the database one last time (the +final copy). + +Then we run recovery. The first time, this should be a no-op, because +we've either committed the transaction and are checking to redo it or +we aborted the transaction, undid it on the abort and are checking to +undo it again. + +We then run recovery again on whatever database will force us through +the path that requires work. In the commit case, this means we start +with the init copy of the database and run recovery. This pushes us +through all the redo paths. In the abort case, we start with the afterop +copy which pushes us through all the undo cases. + +In some sense, we're asking the create/delete test to be more exhaustive +by defining all the trigger points, but I think that's the correct thing +to do, since the create/delete is not initiated by a user transaction. + +So, what do we have to do at the hook points? + 1. sync the file to disk. + 2. save the file itself + 3. save any files named __db_backup_name(name, &backup_name, lsn) + Since we may not know the right lsns, I think we should save + every file of the form __db.name.0xNNNNNNNN.0xNNNNNNNN into + some temporary files from which we can restore it to run + recovery. + +3. Putting it all together + +So, the three pieces are writing the test structure, putting in the hooks +and then writing the recovery portions so that we restore the right thing +that the hooks saved in order to initiate recovery. + +Some of the technical issues that need to be solved are: + How does the hook code become active (i.e., we don't + want it in there normally, but it's got to be + there when you configure for testing)? + How do you (the test) tell the library that you want a + particular hook to abort? + How do you (the test) tell the library that you want the + hook code doing its copies (do we really want + *every* test doing these copies during testing? + Maybe it's not a big deal, but maybe it is; we + should at least think about it). diff --git a/bdb/db/crdel.src b/bdb/db/crdel.src new file mode 100644 index 00000000000..17c061d6887 --- /dev/null +++ b/bdb/db/crdel.src @@ -0,0 +1,103 @@ +/*- + * See the file LICENSE for redistribution information. + * + * Copyright (c) 1996, 1997, 1998, 1999, 2000 + * Sleepycat Software. All rights reserved. + * + * $Id: crdel.src,v 11.12 2000/12/12 17:41:48 bostic Exp $ + */ + +PREFIX crdel + +INCLUDE #include "db_config.h" +INCLUDE +INCLUDE #ifndef NO_SYSTEM_INCLUDES +INCLUDE #include <sys/types.h> +INCLUDE +INCLUDE #include <ctype.h> +INCLUDE #include <errno.h> +INCLUDE #include <string.h> +INCLUDE #endif +INCLUDE +INCLUDE #include "db_int.h" +INCLUDE #include "db_page.h" +INCLUDE #include "db_dispatch.h" +INCLUDE #include "db_am.h" +INCLUDE #include "txn.h" +INCLUDE + +/* + * Fileopen -- log a potential file create operation + * + * name: filename + * subname: sub database name + * mode: file system mode + */ +BEGIN fileopen 141 +DBT name DBT s +ARG mode u_int32_t o +END + +/* + * Metasub: log the creation of a subdatabase meta data page. + * + * fileid: identifies the file being acted upon. + * pgno: page number on which to write this meta-data page + * page: the actual meta-data page + * lsn: lsn of the page. + */ +BEGIN metasub 142 +ARG fileid int32_t ld +ARG pgno db_pgno_t d +DBT page DBT s +POINTER lsn DB_LSN * lu +END + +/* + * Metapage: log the creation of a meta data page for a new file. + * + * fileid: identifies the file being acted upon. + * name: file containing the page. + * pgno: page number on which to write this meta-data page + * page: the actual meta-data page + */ +BEGIN metapage 143 +ARG fileid int32_t ld +DBT name DBT s +ARG pgno db_pgno_t d +DBT page DBT s +END + +/* + * Delete: remove a file. + * Note that we don't need a special log record for subdatabase + * removes, because we use normal btree operations to remove them. + * + * name: name of the file being removed (relative to DBHOME). + */ +DEPRECATED old_delete 144 +DBT name DBT s +END + +/* + * Rename: rename a file + * We do not need this for subdatabases + * + * name: name of the file being removed (relative to DBHOME). + */ +BEGIN rename 145 +ARG fileid int32_t ld +DBT name DBT s +DBT newname DBT s +END +/* + * Delete: remove a file. + * Note that we don't need a special log record for subdatabase + * removes, because we use normal btree operations to remove them. + * + * name: name of the file being removed (relative to DBHOME). + */ +BEGIN delete 146 +ARG fileid int32_t ld +DBT name DBT s +END diff --git a/bdb/db/crdel_auto.c b/bdb/db/crdel_auto.c new file mode 100644 index 00000000000..f2204410ee8 --- /dev/null +++ b/bdb/db/crdel_auto.c @@ -0,0 +1,900 @@ +/* Do not edit: automatically built by gen_rec.awk. */ +#include "db_config.h" + +#ifndef NO_SYSTEM_INCLUDES +#include <sys/types.h> + +#include <ctype.h> +#include <errno.h> +#include <string.h> +#endif + +#include "db_int.h" +#include "db_page.h" +#include "db_dispatch.h" +#include "db_am.h" +#include "txn.h" + +int +__crdel_fileopen_log(dbenv, txnid, ret_lsnp, flags, + name, mode) + DB_ENV *dbenv; + DB_TXN *txnid; + DB_LSN *ret_lsnp; + u_int32_t flags; + const DBT *name; + u_int32_t mode; +{ + DBT logrec; + DB_LSN *lsnp, null_lsn; + u_int32_t zero; + u_int32_t rectype, txn_num; + int ret; + u_int8_t *bp; + + rectype = DB_crdel_fileopen; + if (txnid != NULL && + TAILQ_FIRST(&txnid->kids) != NULL && + (ret = __txn_activekids(dbenv, rectype, txnid)) != 0) + return (ret); + txn_num = txnid == NULL ? 0 : txnid->txnid; + if (txnid == NULL) { + ZERO_LSN(null_lsn); + lsnp = &null_lsn; + } else + lsnp = &txnid->last_lsn; + logrec.size = sizeof(rectype) + sizeof(txn_num) + sizeof(DB_LSN) + + sizeof(u_int32_t) + (name == NULL ? 0 : name->size) + + sizeof(mode); + if ((ret = __os_malloc(dbenv, logrec.size, NULL, &logrec.data)) != 0) + return (ret); + + bp = logrec.data; + memcpy(bp, &rectype, sizeof(rectype)); + bp += sizeof(rectype); + memcpy(bp, &txn_num, sizeof(txn_num)); + bp += sizeof(txn_num); + memcpy(bp, lsnp, sizeof(DB_LSN)); + bp += sizeof(DB_LSN); + if (name == NULL) { + zero = 0; + memcpy(bp, &zero, sizeof(u_int32_t)); + bp += sizeof(u_int32_t); + } else { + memcpy(bp, &name->size, sizeof(name->size)); + bp += sizeof(name->size); + memcpy(bp, name->data, name->size); + bp += name->size; + } + memcpy(bp, &mode, sizeof(mode)); + bp += sizeof(mode); + DB_ASSERT((u_int32_t)(bp - (u_int8_t *)logrec.data) == logrec.size); + ret = log_put(dbenv, ret_lsnp, (DBT *)&logrec, flags); + if (txnid != NULL) + txnid->last_lsn = *ret_lsnp; + __os_free(logrec.data, logrec.size); + return (ret); +} + +int +__crdel_fileopen_print(dbenv, dbtp, lsnp, notused2, notused3) + DB_ENV *dbenv; + DBT *dbtp; + DB_LSN *lsnp; + db_recops notused2; + void *notused3; +{ + __crdel_fileopen_args *argp; + u_int32_t i; + u_int ch; + int ret; + + i = 0; + ch = 0; + notused2 = DB_TXN_ABORT; + notused3 = NULL; + + if ((ret = __crdel_fileopen_read(dbenv, dbtp->data, &argp)) != 0) + return (ret); + printf("[%lu][%lu]crdel_fileopen: rec: %lu txnid %lx prevlsn [%lu][%lu]\n", + (u_long)lsnp->file, + (u_long)lsnp->offset, + (u_long)argp->type, + (u_long)argp->txnid->txnid, + (u_long)argp->prev_lsn.file, + (u_long)argp->prev_lsn.offset); + printf("\tname: "); + for (i = 0; i < argp->name.size; i++) { + ch = ((u_int8_t *)argp->name.data)[i]; + if (isprint(ch) || ch == 0xa) + putchar(ch); + else + printf("%#x ", ch); + } + printf("\n"); + printf("\tmode: %o\n", argp->mode); + printf("\n"); + __os_free(argp, 0); + return (0); +} + +int +__crdel_fileopen_read(dbenv, recbuf, argpp) + DB_ENV *dbenv; + void *recbuf; + __crdel_fileopen_args **argpp; +{ + __crdel_fileopen_args *argp; + u_int8_t *bp; + int ret; + + ret = __os_malloc(dbenv, sizeof(__crdel_fileopen_args) + + sizeof(DB_TXN), NULL, &argp); + if (ret != 0) + return (ret); + argp->txnid = (DB_TXN *)&argp[1]; + bp = recbuf; + memcpy(&argp->type, bp, sizeof(argp->type)); + bp += sizeof(argp->type); + memcpy(&argp->txnid->txnid, bp, sizeof(argp->txnid->txnid)); + bp += sizeof(argp->txnid->txnid); + memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN)); + bp += sizeof(DB_LSN); + memset(&argp->name, 0, sizeof(argp->name)); + memcpy(&argp->name.size, bp, sizeof(u_int32_t)); + bp += sizeof(u_int32_t); + argp->name.data = bp; + bp += argp->name.size; + memcpy(&argp->mode, bp, sizeof(argp->mode)); + bp += sizeof(argp->mode); + *argpp = argp; + return (0); +} + +int +__crdel_metasub_log(dbenv, txnid, ret_lsnp, flags, + fileid, pgno, page, lsn) + DB_ENV *dbenv; + DB_TXN *txnid; + DB_LSN *ret_lsnp; + u_int32_t flags; + int32_t fileid; + db_pgno_t pgno; + const DBT *page; + DB_LSN * lsn; +{ + DBT logrec; + DB_LSN *lsnp, null_lsn; + u_int32_t zero; + u_int32_t rectype, txn_num; + int ret; + u_int8_t *bp; + + rectype = DB_crdel_metasub; + if (txnid != NULL && + TAILQ_FIRST(&txnid->kids) != NULL && + (ret = __txn_activekids(dbenv, rectype, txnid)) != 0) + return (ret); + txn_num = txnid == NULL ? 0 : txnid->txnid; + if (txnid == NULL) { + ZERO_LSN(null_lsn); + lsnp = &null_lsn; + } else + lsnp = &txnid->last_lsn; + logrec.size = sizeof(rectype) + sizeof(txn_num) + sizeof(DB_LSN) + + sizeof(fileid) + + sizeof(pgno) + + sizeof(u_int32_t) + (page == NULL ? 0 : page->size) + + sizeof(*lsn); + if ((ret = __os_malloc(dbenv, logrec.size, NULL, &logrec.data)) != 0) + return (ret); + + bp = logrec.data; + memcpy(bp, &rectype, sizeof(rectype)); + bp += sizeof(rectype); + memcpy(bp, &txn_num, sizeof(txn_num)); + bp += sizeof(txn_num); + memcpy(bp, lsnp, sizeof(DB_LSN)); + bp += sizeof(DB_LSN); + memcpy(bp, &fileid, sizeof(fileid)); + bp += sizeof(fileid); + memcpy(bp, &pgno, sizeof(pgno)); + bp += sizeof(pgno); + if (page == NULL) { + zero = 0; + memcpy(bp, &zero, sizeof(u_int32_t)); + bp += sizeof(u_int32_t); + } else { + memcpy(bp, &page->size, sizeof(page->size)); + bp += sizeof(page->size); + memcpy(bp, page->data, page->size); + bp += page->size; + } + if (lsn != NULL) + memcpy(bp, lsn, sizeof(*lsn)); + else + memset(bp, 0, sizeof(*lsn)); + bp += sizeof(*lsn); + DB_ASSERT((u_int32_t)(bp - (u_int8_t *)logrec.data) == logrec.size); + ret = log_put(dbenv, ret_lsnp, (DBT *)&logrec, flags); + if (txnid != NULL) + txnid->last_lsn = *ret_lsnp; + __os_free(logrec.data, logrec.size); + return (ret); +} + +int +__crdel_metasub_print(dbenv, dbtp, lsnp, notused2, notused3) + DB_ENV *dbenv; + DBT *dbtp; + DB_LSN *lsnp; + db_recops notused2; + void *notused3; +{ + __crdel_metasub_args *argp; + u_int32_t i; + u_int ch; + int ret; + + i = 0; + ch = 0; + notused2 = DB_TXN_ABORT; + notused3 = NULL; + + if ((ret = __crdel_metasub_read(dbenv, dbtp->data, &argp)) != 0) + return (ret); + printf("[%lu][%lu]crdel_metasub: rec: %lu txnid %lx prevlsn [%lu][%lu]\n", + (u_long)lsnp->file, + (u_long)lsnp->offset, + (u_long)argp->type, + (u_long)argp->txnid->txnid, + (u_long)argp->prev_lsn.file, + (u_long)argp->prev_lsn.offset); + printf("\tfileid: %ld\n", (long)argp->fileid); + printf("\tpgno: %d\n", argp->pgno); + printf("\tpage: "); + for (i = 0; i < argp->page.size; i++) { + ch = ((u_int8_t *)argp->page.data)[i]; + if (isprint(ch) || ch == 0xa) + putchar(ch); + else + printf("%#x ", ch); + } + printf("\n"); + printf("\tlsn: [%lu][%lu]\n", + (u_long)argp->lsn.file, (u_long)argp->lsn.offset); + printf("\n"); + __os_free(argp, 0); + return (0); +} + +int +__crdel_metasub_read(dbenv, recbuf, argpp) + DB_ENV *dbenv; + void *recbuf; + __crdel_metasub_args **argpp; +{ + __crdel_metasub_args *argp; + u_int8_t *bp; + int ret; + + ret = __os_malloc(dbenv, sizeof(__crdel_metasub_args) + + sizeof(DB_TXN), NULL, &argp); + if (ret != 0) + return (ret); + argp->txnid = (DB_TXN *)&argp[1]; + bp = recbuf; + memcpy(&argp->type, bp, sizeof(argp->type)); + bp += sizeof(argp->type); + memcpy(&argp->txnid->txnid, bp, sizeof(argp->txnid->txnid)); + bp += sizeof(argp->txnid->txnid); + memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN)); + bp += sizeof(DB_LSN); + memcpy(&argp->fileid, bp, sizeof(argp->fileid)); + bp += sizeof(argp->fileid); + memcpy(&argp->pgno, bp, sizeof(argp->pgno)); + bp += sizeof(argp->pgno); + memset(&argp->page, 0, sizeof(argp->page)); + memcpy(&argp->page.size, bp, sizeof(u_int32_t)); + bp += sizeof(u_int32_t); + argp->page.data = bp; + bp += argp->page.size; + memcpy(&argp->lsn, bp, sizeof(argp->lsn)); + bp += sizeof(argp->lsn); + *argpp = argp; + return (0); +} + +int +__crdel_metapage_log(dbenv, txnid, ret_lsnp, flags, + fileid, name, pgno, page) + DB_ENV *dbenv; + DB_TXN *txnid; + DB_LSN *ret_lsnp; + u_int32_t flags; + int32_t fileid; + const DBT *name; + db_pgno_t pgno; + const DBT *page; +{ + DBT logrec; + DB_LSN *lsnp, null_lsn; + u_int32_t zero; + u_int32_t rectype, txn_num; + int ret; + u_int8_t *bp; + + rectype = DB_crdel_metapage; + if (txnid != NULL && + TAILQ_FIRST(&txnid->kids) != NULL && + (ret = __txn_activekids(dbenv, rectype, txnid)) != 0) + return (ret); + txn_num = txnid == NULL ? 0 : txnid->txnid; + if (txnid == NULL) { + ZERO_LSN(null_lsn); + lsnp = &null_lsn; + } else + lsnp = &txnid->last_lsn; + logrec.size = sizeof(rectype) + sizeof(txn_num) + sizeof(DB_LSN) + + sizeof(fileid) + + sizeof(u_int32_t) + (name == NULL ? 0 : name->size) + + sizeof(pgno) + + sizeof(u_int32_t) + (page == NULL ? 0 : page->size); + if ((ret = __os_malloc(dbenv, logrec.size, NULL, &logrec.data)) != 0) + return (ret); + + bp = logrec.data; + memcpy(bp, &rectype, sizeof(rectype)); + bp += sizeof(rectype); + memcpy(bp, &txn_num, sizeof(txn_num)); + bp += sizeof(txn_num); + memcpy(bp, lsnp, sizeof(DB_LSN)); + bp += sizeof(DB_LSN); + memcpy(bp, &fileid, sizeof(fileid)); + bp += sizeof(fileid); + if (name == NULL) { + zero = 0; + memcpy(bp, &zero, sizeof(u_int32_t)); + bp += sizeof(u_int32_t); + } else { + memcpy(bp, &name->size, sizeof(name->size)); + bp += sizeof(name->size); + memcpy(bp, name->data, name->size); + bp += name->size; + } + memcpy(bp, &pgno, sizeof(pgno)); + bp += sizeof(pgno); + if (page == NULL) { + zero = 0; + memcpy(bp, &zero, sizeof(u_int32_t)); + bp += sizeof(u_int32_t); + } else { + memcpy(bp, &page->size, sizeof(page->size)); + bp += sizeof(page->size); + memcpy(bp, page->data, page->size); + bp += page->size; + } + DB_ASSERT((u_int32_t)(bp - (u_int8_t *)logrec.data) == logrec.size); + ret = log_put(dbenv, ret_lsnp, (DBT *)&logrec, flags); + if (txnid != NULL) + txnid->last_lsn = *ret_lsnp; + __os_free(logrec.data, logrec.size); + return (ret); +} + +int +__crdel_metapage_print(dbenv, dbtp, lsnp, notused2, notused3) + DB_ENV *dbenv; + DBT *dbtp; + DB_LSN *lsnp; + db_recops notused2; + void *notused3; +{ + __crdel_metapage_args *argp; + u_int32_t i; + u_int ch; + int ret; + + i = 0; + ch = 0; + notused2 = DB_TXN_ABORT; + notused3 = NULL; + + if ((ret = __crdel_metapage_read(dbenv, dbtp->data, &argp)) != 0) + return (ret); + printf("[%lu][%lu]crdel_metapage: rec: %lu txnid %lx prevlsn [%lu][%lu]\n", + (u_long)lsnp->file, + (u_long)lsnp->offset, + (u_long)argp->type, + (u_long)argp->txnid->txnid, + (u_long)argp->prev_lsn.file, + (u_long)argp->prev_lsn.offset); + printf("\tfileid: %ld\n", (long)argp->fileid); + printf("\tname: "); + for (i = 0; i < argp->name.size; i++) { + ch = ((u_int8_t *)argp->name.data)[i]; + if (isprint(ch) || ch == 0xa) + putchar(ch); + else + printf("%#x ", ch); + } + printf("\n"); + printf("\tpgno: %d\n", argp->pgno); + printf("\tpage: "); + for (i = 0; i < argp->page.size; i++) { + ch = ((u_int8_t *)argp->page.data)[i]; + if (isprint(ch) || ch == 0xa) + putchar(ch); + else + printf("%#x ", ch); + } + printf("\n"); + printf("\n"); + __os_free(argp, 0); + return (0); +} + +int +__crdel_metapage_read(dbenv, recbuf, argpp) + DB_ENV *dbenv; + void *recbuf; + __crdel_metapage_args **argpp; +{ + __crdel_metapage_args *argp; + u_int8_t *bp; + int ret; + + ret = __os_malloc(dbenv, sizeof(__crdel_metapage_args) + + sizeof(DB_TXN), NULL, &argp); + if (ret != 0) + return (ret); + argp->txnid = (DB_TXN *)&argp[1]; + bp = recbuf; + memcpy(&argp->type, bp, sizeof(argp->type)); + bp += sizeof(argp->type); + memcpy(&argp->txnid->txnid, bp, sizeof(argp->txnid->txnid)); + bp += sizeof(argp->txnid->txnid); + memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN)); + bp += sizeof(DB_LSN); + memcpy(&argp->fileid, bp, sizeof(argp->fileid)); + bp += sizeof(argp->fileid); + memset(&argp->name, 0, sizeof(argp->name)); + memcpy(&argp->name.size, bp, sizeof(u_int32_t)); + bp += sizeof(u_int32_t); + argp->name.data = bp; + bp += argp->name.size; + memcpy(&argp->pgno, bp, sizeof(argp->pgno)); + bp += sizeof(argp->pgno); + memset(&argp->page, 0, sizeof(argp->page)); + memcpy(&argp->page.size, bp, sizeof(u_int32_t)); + bp += sizeof(u_int32_t); + argp->page.data = bp; + bp += argp->page.size; + *argpp = argp; + return (0); +} + +int +__crdel_old_delete_print(dbenv, dbtp, lsnp, notused2, notused3) + DB_ENV *dbenv; + DBT *dbtp; + DB_LSN *lsnp; + db_recops notused2; + void *notused3; +{ + __crdel_old_delete_args *argp; + u_int32_t i; + u_int ch; + int ret; + + i = 0; + ch = 0; + notused2 = DB_TXN_ABORT; + notused3 = NULL; + + if ((ret = __crdel_old_delete_read(dbenv, dbtp->data, &argp)) != 0) + return (ret); + printf("[%lu][%lu]crdel_old_delete: rec: %lu txnid %lx prevlsn [%lu][%lu]\n", + (u_long)lsnp->file, + (u_long)lsnp->offset, + (u_long)argp->type, + (u_long)argp->txnid->txnid, + (u_long)argp->prev_lsn.file, + (u_long)argp->prev_lsn.offset); + printf("\tname: "); + for (i = 0; i < argp->name.size; i++) { + ch = ((u_int8_t *)argp->name.data)[i]; + if (isprint(ch) || ch == 0xa) + putchar(ch); + else + printf("%#x ", ch); + } + printf("\n"); + printf("\n"); + __os_free(argp, 0); + return (0); +} + +int +__crdel_old_delete_read(dbenv, recbuf, argpp) + DB_ENV *dbenv; + void *recbuf; + __crdel_old_delete_args **argpp; +{ + __crdel_old_delete_args *argp; + u_int8_t *bp; + int ret; + + ret = __os_malloc(dbenv, sizeof(__crdel_old_delete_args) + + sizeof(DB_TXN), NULL, &argp); + if (ret != 0) + return (ret); + argp->txnid = (DB_TXN *)&argp[1]; + bp = recbuf; + memcpy(&argp->type, bp, sizeof(argp->type)); + bp += sizeof(argp->type); + memcpy(&argp->txnid->txnid, bp, sizeof(argp->txnid->txnid)); + bp += sizeof(argp->txnid->txnid); + memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN)); + bp += sizeof(DB_LSN); + memset(&argp->name, 0, sizeof(argp->name)); + memcpy(&argp->name.size, bp, sizeof(u_int32_t)); + bp += sizeof(u_int32_t); + argp->name.data = bp; + bp += argp->name.size; + *argpp = argp; + return (0); +} + +int +__crdel_rename_log(dbenv, txnid, ret_lsnp, flags, + fileid, name, newname) + DB_ENV *dbenv; + DB_TXN *txnid; + DB_LSN *ret_lsnp; + u_int32_t flags; + int32_t fileid; + const DBT *name; + const DBT *newname; +{ + DBT logrec; + DB_LSN *lsnp, null_lsn; + u_int32_t zero; + u_int32_t rectype, txn_num; + int ret; + u_int8_t *bp; + + rectype = DB_crdel_rename; + if (txnid != NULL && + TAILQ_FIRST(&txnid->kids) != NULL && + (ret = __txn_activekids(dbenv, rectype, txnid)) != 0) + return (ret); + txn_num = txnid == NULL ? 0 : txnid->txnid; + if (txnid == NULL) { + ZERO_LSN(null_lsn); + lsnp = &null_lsn; + } else + lsnp = &txnid->last_lsn; + logrec.size = sizeof(rectype) + sizeof(txn_num) + sizeof(DB_LSN) + + sizeof(fileid) + + sizeof(u_int32_t) + (name == NULL ? 0 : name->size) + + sizeof(u_int32_t) + (newname == NULL ? 0 : newname->size); + if ((ret = __os_malloc(dbenv, logrec.size, NULL, &logrec.data)) != 0) + return (ret); + + bp = logrec.data; + memcpy(bp, &rectype, sizeof(rectype)); + bp += sizeof(rectype); + memcpy(bp, &txn_num, sizeof(txn_num)); + bp += sizeof(txn_num); + memcpy(bp, lsnp, sizeof(DB_LSN)); + bp += sizeof(DB_LSN); + memcpy(bp, &fileid, sizeof(fileid)); + bp += sizeof(fileid); + if (name == NULL) { + zero = 0; + memcpy(bp, &zero, sizeof(u_int32_t)); + bp += sizeof(u_int32_t); + } else { + memcpy(bp, &name->size, sizeof(name->size)); + bp += sizeof(name->size); + memcpy(bp, name->data, name->size); + bp += name->size; + } + if (newname == NULL) { + zero = 0; + memcpy(bp, &zero, sizeof(u_int32_t)); + bp += sizeof(u_int32_t); + } else { + memcpy(bp, &newname->size, sizeof(newname->size)); + bp += sizeof(newname->size); + memcpy(bp, newname->data, newname->size); + bp += newname->size; + } + DB_ASSERT((u_int32_t)(bp - (u_int8_t *)logrec.data) == logrec.size); + ret = log_put(dbenv, ret_lsnp, (DBT *)&logrec, flags); + if (txnid != NULL) + txnid->last_lsn = *ret_lsnp; + __os_free(logrec.data, logrec.size); + return (ret); +} + +int +__crdel_rename_print(dbenv, dbtp, lsnp, notused2, notused3) + DB_ENV *dbenv; + DBT *dbtp; + DB_LSN *lsnp; + db_recops notused2; + void *notused3; +{ + __crdel_rename_args *argp; + u_int32_t i; + u_int ch; + int ret; + + i = 0; + ch = 0; + notused2 = DB_TXN_ABORT; + notused3 = NULL; + + if ((ret = __crdel_rename_read(dbenv, dbtp->data, &argp)) != 0) + return (ret); + printf("[%lu][%lu]crdel_rename: rec: %lu txnid %lx prevlsn [%lu][%lu]\n", + (u_long)lsnp->file, + (u_long)lsnp->offset, + (u_long)argp->type, + (u_long)argp->txnid->txnid, + (u_long)argp->prev_lsn.file, + (u_long)argp->prev_lsn.offset); + printf("\tfileid: %ld\n", (long)argp->fileid); + printf("\tname: "); + for (i = 0; i < argp->name.size; i++) { + ch = ((u_int8_t *)argp->name.data)[i]; + if (isprint(ch) || ch == 0xa) + putchar(ch); + else + printf("%#x ", ch); + } + printf("\n"); + printf("\tnewname: "); + for (i = 0; i < argp->newname.size; i++) { + ch = ((u_int8_t *)argp->newname.data)[i]; + if (isprint(ch) || ch == 0xa) + putchar(ch); + else + printf("%#x ", ch); + } + printf("\n"); + printf("\n"); + __os_free(argp, 0); + return (0); +} + +int +__crdel_rename_read(dbenv, recbuf, argpp) + DB_ENV *dbenv; + void *recbuf; + __crdel_rename_args **argpp; +{ + __crdel_rename_args *argp; + u_int8_t *bp; + int ret; + + ret = __os_malloc(dbenv, sizeof(__crdel_rename_args) + + sizeof(DB_TXN), NULL, &argp); + if (ret != 0) + return (ret); + argp->txnid = (DB_TXN *)&argp[1]; + bp = recbuf; + memcpy(&argp->type, bp, sizeof(argp->type)); + bp += sizeof(argp->type); + memcpy(&argp->txnid->txnid, bp, sizeof(argp->txnid->txnid)); + bp += sizeof(argp->txnid->txnid); + memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN)); + bp += sizeof(DB_LSN); + memcpy(&argp->fileid, bp, sizeof(argp->fileid)); + bp += sizeof(argp->fileid); + memset(&argp->name, 0, sizeof(argp->name)); + memcpy(&argp->name.size, bp, sizeof(u_int32_t)); + bp += sizeof(u_int32_t); + argp->name.data = bp; + bp += argp->name.size; + memset(&argp->newname, 0, sizeof(argp->newname)); + memcpy(&argp->newname.size, bp, sizeof(u_int32_t)); + bp += sizeof(u_int32_t); + argp->newname.data = bp; + bp += argp->newname.size; + *argpp = argp; + return (0); +} + +int +__crdel_delete_log(dbenv, txnid, ret_lsnp, flags, + fileid, name) + DB_ENV *dbenv; + DB_TXN *txnid; + DB_LSN *ret_lsnp; + u_int32_t flags; + int32_t fileid; + const DBT *name; +{ + DBT logrec; + DB_LSN *lsnp, null_lsn; + u_int32_t zero; + u_int32_t rectype, txn_num; + int ret; + u_int8_t *bp; + + rectype = DB_crdel_delete; + if (txnid != NULL && + TAILQ_FIRST(&txnid->kids) != NULL && + (ret = __txn_activekids(dbenv, rectype, txnid)) != 0) + return (ret); + txn_num = txnid == NULL ? 0 : txnid->txnid; + if (txnid == NULL) { + ZERO_LSN(null_lsn); + lsnp = &null_lsn; + } else + lsnp = &txnid->last_lsn; + logrec.size = sizeof(rectype) + sizeof(txn_num) + sizeof(DB_LSN) + + sizeof(fileid) + + sizeof(u_int32_t) + (name == NULL ? 0 : name->size); + if ((ret = __os_malloc(dbenv, logrec.size, NULL, &logrec.data)) != 0) + return (ret); + + bp = logrec.data; + memcpy(bp, &rectype, sizeof(rectype)); + bp += sizeof(rectype); + memcpy(bp, &txn_num, sizeof(txn_num)); + bp += sizeof(txn_num); + memcpy(bp, lsnp, sizeof(DB_LSN)); + bp += sizeof(DB_LSN); + memcpy(bp, &fileid, sizeof(fileid)); + bp += sizeof(fileid); + if (name == NULL) { + zero = 0; + memcpy(bp, &zero, sizeof(u_int32_t)); + bp += sizeof(u_int32_t); + } else { + memcpy(bp, &name->size, sizeof(name->size)); + bp += sizeof(name->size); + memcpy(bp, name->data, name->size); + bp += name->size; + } + DB_ASSERT((u_int32_t)(bp - (u_int8_t *)logrec.data) == logrec.size); + ret = log_put(dbenv, ret_lsnp, (DBT *)&logrec, flags); + if (txnid != NULL) + txnid->last_lsn = *ret_lsnp; + __os_free(logrec.data, logrec.size); + return (ret); +} + +int +__crdel_delete_print(dbenv, dbtp, lsnp, notused2, notused3) + DB_ENV *dbenv; + DBT *dbtp; + DB_LSN *lsnp; + db_recops notused2; + void *notused3; +{ + __crdel_delete_args *argp; + u_int32_t i; + u_int ch; + int ret; + + i = 0; + ch = 0; + notused2 = DB_TXN_ABORT; + notused3 = NULL; + + if ((ret = __crdel_delete_read(dbenv, dbtp->data, &argp)) != 0) + return (ret); + printf("[%lu][%lu]crdel_delete: rec: %lu txnid %lx prevlsn [%lu][%lu]\n", + (u_long)lsnp->file, + (u_long)lsnp->offset, + (u_long)argp->type, + (u_long)argp->txnid->txnid, + (u_long)argp->prev_lsn.file, + (u_long)argp->prev_lsn.offset); + printf("\tfileid: %ld\n", (long)argp->fileid); + printf("\tname: "); + for (i = 0; i < argp->name.size; i++) { + ch = ((u_int8_t *)argp->name.data)[i]; + if (isprint(ch) || ch == 0xa) + putchar(ch); + else + printf("%#x ", ch); + } + printf("\n"); + printf("\n"); + __os_free(argp, 0); + return (0); +} + +int +__crdel_delete_read(dbenv, recbuf, argpp) + DB_ENV *dbenv; + void *recbuf; + __crdel_delete_args **argpp; +{ + __crdel_delete_args *argp; + u_int8_t *bp; + int ret; + + ret = __os_malloc(dbenv, sizeof(__crdel_delete_args) + + sizeof(DB_TXN), NULL, &argp); + if (ret != 0) + return (ret); + argp->txnid = (DB_TXN *)&argp[1]; + bp = recbuf; + memcpy(&argp->type, bp, sizeof(argp->type)); + bp += sizeof(argp->type); + memcpy(&argp->txnid->txnid, bp, sizeof(argp->txnid->txnid)); + bp += sizeof(argp->txnid->txnid); + memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN)); + bp += sizeof(DB_LSN); + memcpy(&argp->fileid, bp, sizeof(argp->fileid)); + bp += sizeof(argp->fileid); + memset(&argp->name, 0, sizeof(argp->name)); + memcpy(&argp->name.size, bp, sizeof(u_int32_t)); + bp += sizeof(u_int32_t); + argp->name.data = bp; + bp += argp->name.size; + *argpp = argp; + return (0); +} + +int +__crdel_init_print(dbenv) + DB_ENV *dbenv; +{ + int ret; + + if ((ret = __db_add_recovery(dbenv, + __crdel_fileopen_print, DB_crdel_fileopen)) != 0) + return (ret); + if ((ret = __db_add_recovery(dbenv, + __crdel_metasub_print, DB_crdel_metasub)) != 0) + return (ret); + if ((ret = __db_add_recovery(dbenv, + __crdel_metapage_print, DB_crdel_metapage)) != 0) + return (ret); + if ((ret = __db_add_recovery(dbenv, + __crdel_old_delete_print, DB_crdel_old_delete)) != 0) + return (ret); + if ((ret = __db_add_recovery(dbenv, + __crdel_rename_print, DB_crdel_rename)) != 0) + return (ret); + if ((ret = __db_add_recovery(dbenv, + __crdel_delete_print, DB_crdel_delete)) != 0) + return (ret); + return (0); +} + +int +__crdel_init_recover(dbenv) + DB_ENV *dbenv; +{ + int ret; + + if ((ret = __db_add_recovery(dbenv, + __crdel_fileopen_recover, DB_crdel_fileopen)) != 0) + return (ret); + if ((ret = __db_add_recovery(dbenv, + __crdel_metasub_recover, DB_crdel_metasub)) != 0) + return (ret); + if ((ret = __db_add_recovery(dbenv, + __crdel_metapage_recover, DB_crdel_metapage)) != 0) + return (ret); + if ((ret = __db_add_recovery(dbenv, + __deprecated_recover, DB_crdel_old_delete)) != 0) + return (ret); + if ((ret = __db_add_recovery(dbenv, + __crdel_rename_recover, DB_crdel_rename)) != 0) + return (ret); + if ((ret = __db_add_recovery(dbenv, + __crdel_delete_recover, DB_crdel_delete)) != 0) + return (ret); + return (0); +} + diff --git a/bdb/db/crdel_rec.c b/bdb/db/crdel_rec.c new file mode 100644 index 00000000000..495b92a0ad7 --- /dev/null +++ b/bdb/db/crdel_rec.c @@ -0,0 +1,646 @@ +/*- + * See the file LICENSE for redistribution information. + * + * Copyright (c) 1996, 1997, 1998, 1999, 2000 + * Sleepycat Software. All rights reserved. + */ + +#include "db_config.h" + +#ifndef lint +static const char revid[] = "$Id: crdel_rec.c,v 11.43 2000/12/13 08:06:34 krinsky Exp $"; +#endif /* not lint */ + +#ifndef NO_SYSTEM_INCLUDES +#include <sys/types.h> + +#include <string.h> +#endif + +#include "db_int.h" +#include "db_page.h" +#include "log.h" +#include "hash.h" +#include "mp.h" +#include "db_dispatch.h" + +/* + * __crdel_fileopen_recover -- + * Recovery function for fileopen. + * + * PUBLIC: int __crdel_fileopen_recover + * PUBLIC: __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *)); + */ +int +__crdel_fileopen_recover(dbenv, dbtp, lsnp, op, info) + DB_ENV *dbenv; + DBT *dbtp; + DB_LSN *lsnp; + db_recops op; + void *info; +{ + __crdel_fileopen_args *argp; + DBMETA ondisk; + DB_FH fh; + size_t nr; + int do_unlink, ret; + u_int32_t b, mb, io; + char *real_name; + + COMPQUIET(info, NULL); + + real_name = NULL; + REC_PRINT(__crdel_fileopen_print); + + if ((ret = __crdel_fileopen_read(dbenv, dbtp->data, &argp)) != 0) + goto out; + /* + * If this is an in-memory database, then the name is going to + * be NULL, which looks like a 0-length name in recovery. + */ + if (argp->name.size == 0) + goto done; + + if ((ret = __db_appname(dbenv, DB_APP_DATA, + NULL, argp->name.data, 0, NULL, &real_name)) != 0) + goto out; + if (DB_REDO(op)) { + /* + * The create commited, so we need to make sure that the file + * exists. A simple open should suffice. + */ + if ((ret = __os_open(dbenv, real_name, + DB_OSO_CREATE, argp->mode, &fh)) != 0) + goto out; + if ((ret = __os_closehandle(&fh)) != 0) + goto out; + } else if (DB_UNDO(op)) { + /* + * If the file is 0-length then it was in the process of being + * created, so we should unlink it. If it is non-0 length, then + * either someone else created it and we need to leave it + * untouched or we were in the process of creating it, allocated + * the first page on a system that requires you to actually + * write pages as you allocate them, but never got any data + * on it. + * If the file doesn't exist, we never got around to creating + * it, so that's fine. + */ + if (__os_exists(real_name, NULL) != 0) + goto done; + + if ((ret = __os_open(dbenv, real_name, 0, 0, &fh)) != 0) + goto out; + if ((ret = __os_ioinfo(dbenv, + real_name, &fh, &mb, &b, &io)) != 0) + goto out; + do_unlink = 0; + if (mb != 0 || b != 0) { + /* + * We need to read the first page + * to see if its got valid data on it. + */ + if ((ret = __os_read(dbenv, &fh, + &ondisk, sizeof(ondisk), &nr)) != 0 || + nr != sizeof(ondisk)) + goto out; + if (ondisk.magic == 0) + do_unlink = 1; + } + if ((ret = __os_closehandle(&fh)) != 0) + goto out; + /* Check for 0-length and if it is, delete it. */ + if (do_unlink || (mb == 0 && b == 0)) + if ((ret = __os_unlink(dbenv, real_name)) != 0) + goto out; + } + +done: *lsnp = argp->prev_lsn; + ret = 0; + +out: if (argp != NULL) + __os_free(argp, 0); + if (real_name != NULL) + __os_freestr(real_name); + return (ret); +} + +/* + * __crdel_metasub_recover -- + * Recovery function for metasub. + * + * PUBLIC: int __crdel_metasub_recover + * PUBLIC: __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *)); + */ +int +__crdel_metasub_recover(dbenv, dbtp, lsnp, op, info) + DB_ENV *dbenv; + DBT *dbtp; + DB_LSN *lsnp; + db_recops op; + void *info; +{ + __crdel_metasub_args *argp; + DB *file_dbp; + DBC *dbc; + DB_MPOOLFILE *mpf; + PAGE *pagep; + u_int8_t *file_uid, ptype; + int cmp_p, modified, reopen, ret; + + COMPQUIET(info, NULL); + REC_PRINT(__crdel_metasub_print); + REC_INTRO(__crdel_metasub_read, 0); + + if ((ret = memp_fget(mpf, &argp->pgno, 0, &pagep)) != 0) { + if (DB_REDO(op)) { + if ((ret = memp_fget(mpf, + &argp->pgno, DB_MPOOL_CREATE, &pagep)) != 0) + goto out; + } else { + *lsnp = argp->prev_lsn; + ret = 0; + goto out; + } + } + + modified = 0; + reopen = 0; + cmp_p = log_compare(&LSN(pagep), &argp->lsn); + CHECK_LSN(op, cmp_p, &LSN(pagep), &argp->lsn); + + if (cmp_p == 0 && DB_REDO(op)) { + memcpy(pagep, argp->page.data, argp->page.size); + LSN(pagep) = *lsnp; + modified = 1; + /* + * If this is a meta-data page, then we must reopen; + * if it was a root page, then we do not. + */ + ptype = ((DBMETA *)argp->page.data)->type; + if (ptype == P_HASHMETA || ptype == P_BTREEMETA || + ptype == P_QAMMETA) + reopen = 1; + } else if (DB_UNDO(op)) { + /* + * We want to undo this page creation. The page creation + * happened in two parts. First, we called __bam_new which + * was logged separately. Then we wrote the meta-data onto + * the page. So long as we restore the LSN, then the recovery + * for __bam_new will do everything else. + * Don't bother checking the lsn on the page. If we + * are rolling back the next thing is that this page + * will get freed. Opening the subdb will have reinitialized + * the page, but not the lsn. + */ + LSN(pagep) = argp->lsn; + modified = 1; + } + if ((ret = memp_fput(mpf, pagep, modified ? DB_MPOOL_DIRTY : 0)) != 0) + goto out; + + /* + * If we are redoing a subdatabase create, we must close and reopen the + * file to be sure that we have the proper meta information in the + * in-memory structures + */ + if (reopen) { + /* Close cursor if it's open. */ + if (dbc != NULL) { + dbc->c_close(dbc); + dbc = NULL; + } + + if ((ret = __os_malloc(dbenv, + DB_FILE_ID_LEN, NULL, &file_uid)) != 0) + goto out; + memcpy(file_uid, &file_dbp->fileid[0], DB_FILE_ID_LEN); + ret = __log_reopen_file(dbenv, + NULL, argp->fileid, file_uid, argp->pgno); + (void)__os_free(file_uid, DB_FILE_ID_LEN); + if (ret != 0) + goto out; + } + +done: *lsnp = argp->prev_lsn; + ret = 0; + +out: REC_CLOSE; +} + +/* + * __crdel_metapage_recover -- + * Recovery function for metapage. + * + * PUBLIC: int __crdel_metapage_recover + * PUBLIC: __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *)); + */ +int +__crdel_metapage_recover(dbenv, dbtp, lsnp, op, info) + DB_ENV *dbenv; + DBT *dbtp; + DB_LSN *lsnp; + db_recops op; + void *info; +{ + __crdel_metapage_args *argp; + DB *dbp; + DBMETA *meta, ondisk; + DB_FH fh; + size_t nr; + u_int32_t b, io, mb, pagesize; + int is_done, ret; + char *real_name; + + COMPQUIET(info, NULL); + + real_name = NULL; + memset(&fh, 0, sizeof(fh)); + REC_PRINT(__crdel_metapage_print); + + if ((ret = __crdel_metapage_read(dbenv, dbtp->data, &argp)) != 0) + goto out; + + /* + * If this is an in-memory database, then the name is going to + * be NULL, which looks like a 0-length name in recovery. + */ + if (argp->name.size == 0) + goto done; + + meta = (DBMETA *)argp->page.data; + __ua_memcpy(&pagesize, &meta->pagesize, sizeof(pagesize)); + + if ((ret = __db_appname(dbenv, DB_APP_DATA, + NULL, argp->name.data, 0, NULL, &real_name)) != 0) + goto out; + if (DB_REDO(op)) { + if ((ret = __db_fileid_to_db(dbenv, + &dbp, argp->fileid, 0)) != 0) { + if (ret == DB_DELETED) + goto done; + else + goto out; + } + + /* + * We simply read the first page and if the LSN is 0, we + * write the meta-data page. + */ + if ((ret = __os_open(dbenv, real_name, 0, 0, &fh)) != 0) + goto out; + if ((ret = __os_seek(dbenv, &fh, + pagesize, argp->pgno, 0, 0, DB_OS_SEEK_SET)) != 0) + goto out; + /* + * If the read succeeds then the page exists, then we need + * to vrify that the page has actually been written, because + * on some systems (e.g., Windows) we preallocate pages because + * files aren't allowed to have holes in them. If the page + * looks good then we're done. + */ + if ((ret = __os_read(dbenv, &fh, &ondisk, + sizeof(ondisk), &nr)) == 0 && nr == sizeof(ondisk)) { + if (ondisk.magic != 0) + goto done; + if ((ret = __os_seek(dbenv, &fh, + pagesize, argp->pgno, 0, 0, DB_OS_SEEK_SET)) != 0) + goto out; + } + + /* + * Page didn't exist, update the LSN and write a new one. + * (seek pointer shouldn't have moved) + */ + __ua_memcpy(&meta->lsn, lsnp, sizeof(DB_LSN)); + if ((ret = __os_write(dbp->dbenv, &fh, + argp->page.data, argp->page.size, &nr)) != 0) + goto out; + if (nr != (size_t)argp->page.size) { + __db_err(dbenv, "Write failed during recovery"); + ret = EIO; + goto out; + } + + /* + * We must close and reopen the file to be sure + * that we have the proper meta information + * in the in memory structures + */ + + if ((ret = __log_reopen_file(dbenv, + argp->name.data, argp->fileid, + meta->uid, argp->pgno)) != 0) + goto out; + + /* Handle will be closed on exit. */ + } else if (DB_UNDO(op)) { + is_done = 0; + + /* If file does not exist, there is nothing to undo. */ + if (__os_exists(real_name, NULL) != 0) + goto done; + + /* + * Before we can look at anything on disk, we have to check + * if there is a valid dbp for this, and if there is, we'd + * better flush it. + */ + dbp = NULL; + if ((ret = + __db_fileid_to_db(dbenv, &dbp, argp->fileid, 0)) == 0) + (void)dbp->sync(dbp, 0); + + /* + * We need to make sure that we do not remove a file that + * someone else created. If the file is 0-length, then we + * can assume that we created it and remove it. If it is + * not 0-length, then we need to check the LSN and make + * sure that it's the file we created. + */ + if ((ret = __os_open(dbenv, real_name, 0, 0, &fh)) != 0) + goto out; + if ((ret = __os_ioinfo(dbenv, + real_name, &fh, &mb, &b, &io)) != 0) + goto out; + if (mb != 0 || b != 0) { + /* The file has something in it. */ + if ((ret = __os_seek(dbenv, &fh, + pagesize, argp->pgno, 0, 0, DB_OS_SEEK_SET)) != 0) + goto out; + if ((ret = __os_read(dbenv, &fh, + &ondisk, sizeof(ondisk), &nr)) != 0) + goto out; + if (log_compare(&ondisk.lsn, lsnp) != 0) + is_done = 1; + } + + /* + * Must close here, because unlink with the file open fails + * on some systems. + */ + if ((ret = __os_closehandle(&fh)) != 0) + goto out; + + if (!is_done) { + /* + * On some systems, you cannot unlink an open file so + * we close the fd in the dbp here and make sure we + * don't try to close it again. First, check for a + * saved_open_fhp, then close down the mpool. + */ + if (dbp != NULL && dbp->saved_open_fhp != NULL && + F_ISSET(dbp->saved_open_fhp, DB_FH_VALID) && + (ret = __os_closehandle(dbp->saved_open_fhp)) != 0) + goto out; + if (dbp != NULL && dbp->mpf != NULL) { + (void)__memp_fremove(dbp->mpf); + if ((ret = memp_fclose(dbp->mpf)) != 0) + goto out; + F_SET(dbp, DB_AM_DISCARD); + dbp->mpf = NULL; + } + if ((ret = __os_unlink(dbenv, real_name)) != 0) + goto out; + } + } + +done: *lsnp = argp->prev_lsn; + ret = 0; + +out: if (argp != NULL) + __os_free(argp, 0); + if (real_name != NULL) + __os_freestr(real_name); + if (F_ISSET(&fh, DB_FH_VALID)) + (void)__os_closehandle(&fh); + return (ret); +} + +/* + * __crdel_delete_recover -- + * Recovery function for delete. + * + * PUBLIC: int __crdel_delete_recover + * PUBLIC: __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *)); + */ +int +__crdel_delete_recover(dbenv, dbtp, lsnp, op, info) + DB_ENV *dbenv; + DBT *dbtp; + DB_LSN *lsnp; + db_recops op; + void *info; +{ + DB *dbp; + __crdel_delete_args *argp; + int ret; + char *backup, *real_back, *real_name; + + REC_PRINT(__crdel_delete_print); + + backup = real_back = real_name = NULL; + if ((ret = __crdel_delete_read(dbenv, dbtp->data, &argp)) != 0) + goto out; + + if (DB_REDO(op)) { + /* + * On a recovery, as we recreate what was going on, we + * recreate the creation of the file. And so, even though + * it committed, we need to delete it. Try to delete it, + * but it is not an error if that delete fails. + */ + if ((ret = __db_appname(dbenv, DB_APP_DATA, + NULL, argp->name.data, 0, NULL, &real_name)) != 0) + goto out; + if (__os_exists(real_name, NULL) == 0) { + /* + * If a file is deleted and then recreated, it's + * possible for the __os_exists call above to + * return success and for us to get here, but for + * the fileid we're looking for to be marked + * deleted. In that case, we needn't redo the + * unlink even though the file exists, and it's + * not an error. + */ + ret = __db_fileid_to_db(dbenv, &dbp, argp->fileid, 0); + if (ret == 0) { + /* + * On Windows, the underlying file must be + * closed to perform a remove. + */ + (void)__memp_fremove(dbp->mpf); + if ((ret = memp_fclose(dbp->mpf)) != 0) + goto out; + dbp->mpf = NULL; + if ((ret = __os_unlink(dbenv, real_name)) != 0) + goto out; + } else if (ret != DB_DELETED) + goto out; + } + /* + * The transaction committed, so the only thing that might + * be true is that the backup file is still around. Try + * to delete it, but it's not an error if that delete fails. + */ + if ((ret = __db_backup_name(dbenv, argp->name.data, + &backup, lsnp)) != 0) + goto out; + if ((ret = __db_appname(dbenv, + DB_APP_DATA, NULL, backup, 0, NULL, &real_back)) != 0) + goto out; + if (__os_exists(real_back, NULL) == 0) + if ((ret = __os_unlink(dbenv, real_back)) != 0) + goto out; + if ((ret = __db_txnlist_delete(dbenv, info, + argp->name.data, TXNLIST_INVALID_ID, 1)) != 0) + goto out; + } else if (DB_UNDO(op)) { + /* + * Trying to undo. File may or may not have been deleted. + * Try to move the backup to the original. If the backup + * exists, then this is right. If it doesn't exist, then + * nothing will happen and that's OK. + */ + if ((ret = __db_backup_name(dbenv, argp->name.data, + &backup, lsnp)) != 0) + goto out; + if ((ret = __db_appname(dbenv, + DB_APP_DATA, NULL, backup, 0, NULL, &real_back)) != 0) + goto out; + if ((ret = __db_appname(dbenv, DB_APP_DATA, + NULL, argp->name.data, 0, NULL, &real_name)) != 0) + goto out; + if (__os_exists(real_back, NULL) == 0) + if ((ret = + __os_rename(dbenv, real_back, real_name)) != 0) + goto out; + } + + *lsnp = argp->prev_lsn; + ret = 0; + +out: if (argp != NULL) + __os_free(argp, 0); + if (backup != NULL) + __os_freestr(backup); + if (real_back != NULL) + __os_freestr(real_back); + if (real_name != NULL) + __os_freestr(real_name); + return (ret); +} +/* + * __crdel_rename_recover -- + * Recovery function for rename. + * + * PUBLIC: int __crdel_rename_recover + * PUBLIC: __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *)); + */ +int +__crdel_rename_recover(dbenv, dbtp, lsnp, op, info) + DB_ENV *dbenv; + DBT *dbtp; + DB_LSN *lsnp; + db_recops op; + void *info; +{ + DB *dbp; + __crdel_rename_args *argp; + char *new_name, *real_name; + int ret, set; + + COMPQUIET(info, NULL); + + REC_PRINT(__crdel_rename_print); + + new_name = real_name = NULL; + + if ((ret = __crdel_rename_read(dbenv, dbtp->data, &argp)) != 0) + goto out; + + if ((ret = __db_fileid_to_db(dbenv, &dbp, argp->fileid, 0)) != 0) + goto out; + if (DB_REDO(op)) { + /* + * We don't use the dbp parameter to __log_filelist_update + * in the rename case, so passing NULL for it is OK. + */ + if ((ret = __log_filelist_update(dbenv, NULL, + argp->fileid, argp->newname.data, &set)) != 0) + goto out; + if (set != 0) { + if ((ret = __db_appname(dbenv, DB_APP_DATA, + NULL, argp->name.data, 0, NULL, &real_name)) != 0) + goto out; + if (__os_exists(real_name, NULL) == 0) { + if ((ret = __db_appname(dbenv, + DB_APP_DATA, NULL, argp->newname.data, + 0, NULL, &new_name)) != 0) + goto out; + /* + * On Windows, the underlying file + * must be closed to perform a remove. + * The db will be closed by a + * log_register record. Rename + * has exclusive access to the db. + */ + (void)__memp_fremove(dbp->mpf); + if ((ret = memp_fclose(dbp->mpf)) != 0) + goto out; + dbp->mpf = NULL; + if ((ret = __os_rename(dbenv, + real_name, new_name)) != 0) + goto out; + } + } + } else { + /* + * We don't use the dbp parameter to __log_filelist_update + * in the rename case, so passing NULL for it is OK. + */ + if ((ret = __log_filelist_update(dbenv, NULL, + argp->fileid, argp->name.data, &set)) != 0) + goto out; + if (set != 0) { + if ((ret = __db_appname(dbenv, DB_APP_DATA, + NULL, argp->newname.data, 0, NULL, &new_name)) != 0) + goto out; + if (__os_exists(new_name, NULL) == 0) { + if ((ret = __db_appname(dbenv, + DB_APP_DATA, NULL, argp->name.data, + 0, NULL, &real_name)) != 0) + goto out; + /* + * On Windows, the underlying file + * must be closed to perform a remove. + * The file may have already been closed + * if we are aborting the transaction. + */ + if (dbp->mpf != NULL) { + (void)__memp_fremove(dbp->mpf); + if ((ret = memp_fclose(dbp->mpf)) != 0) + goto out; + dbp->mpf = NULL; + } + if ((ret = __os_rename(dbenv, + new_name, real_name)) != 0) + goto out; + } + } + } + + *lsnp = argp->prev_lsn; + ret = 0; + +out: if (argp != NULL) + __os_free(argp, 0); + + if (new_name != NULL) + __os_free(new_name, 0); + + if (real_name != NULL) + __os_free(real_name, 0); + + return (ret); +} diff --git a/bdb/db/db.c b/bdb/db/db.c new file mode 100644 index 00000000000..6e74b4b21bd --- /dev/null +++ b/bdb/db/db.c @@ -0,0 +1,2325 @@ +/*- + * See the file LICENSE for redistribution information. + * + * Copyright (c) 1996, 1997, 1998, 1999, 2000 + * Sleepycat Software. All rights reserved. + */ +/* + * Copyright (c) 1990, 1993, 1994, 1995, 1996 + * Keith Bostic. All rights reserved. + */ +/* + * Copyright (c) 1990, 1993, 1994, 1995 + * The Regents of the University of California. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * 3. Neither the name of the University nor the names of its contributors + * may be used to endorse or promote products derived from this software + * without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +#include "db_config.h" + +#ifndef lint +static const char revid[] = "$Id: db.c,v 11.117 2001/01/11 18:19:50 bostic Exp $"; +#endif /* not lint */ + +#ifndef NO_SYSTEM_INCLUDES +#include <sys/types.h> + +#include <stddef.h> +#include <stdlib.h> +#include <string.h> +#endif + +#include "db_int.h" +#include "db_page.h" +#include "db_shash.h" +#include "db_swap.h" +#include "btree.h" +#include "db_am.h" +#include "hash.h" +#include "lock.h" +#include "log.h" +#include "mp.h" +#include "qam.h" +#include "common_ext.h" + +/* Actions that __db_master_update can take. */ +typedef enum { MU_REMOVE, MU_RENAME, MU_OPEN } mu_action; + +/* Flag values that __db_file_setup can return. */ +#define DB_FILE_SETUP_CREATE 0x01 +#define DB_FILE_SETUP_ZERO 0x02 + +static int __db_file_setup __P((DB *, + const char *, u_int32_t, int, db_pgno_t, int *)); +static int __db_master_update __P((DB *, + const char *, u_int32_t, + db_pgno_t *, mu_action, const char *, u_int32_t)); +static int __db_refresh __P((DB *)); +static int __db_remove_callback __P((DB *, void *)); +static int __db_set_pgsize __P((DB *, DB_FH *, char *)); +static int __db_subdb_remove __P((DB *, const char *, const char *)); +static int __db_subdb_rename __P(( DB *, + const char *, const char *, const char *)); +#if CONFIG_TEST +static void __db_makecopy __P((const char *, const char *)); +static int __db_testdocopy __P((DB *, const char *)); +static int __qam_testdocopy __P((DB *, const char *)); +#endif + +/* + * __db_open -- + * Main library interface to the DB access methods. + * + * PUBLIC: int __db_open __P((DB *, + * PUBLIC: const char *, const char *, DBTYPE, u_int32_t, int)); + */ +int +__db_open(dbp, name, subdb, type, flags, mode) + DB *dbp; + const char *name, *subdb; + DBTYPE type; + u_int32_t flags; + int mode; +{ + DB_ENV *dbenv; + DB_LOCK open_lock; + DB *mdbp; + db_pgno_t meta_pgno; + u_int32_t ok_flags; + int ret, t_ret; + + dbenv = dbp->dbenv; + mdbp = NULL; + + /* Validate arguments. */ +#define OKFLAGS \ + (DB_CREATE | DB_EXCL | DB_FCNTL_LOCKING | \ + DB_NOMMAP | DB_RDONLY | DB_RDWRMASTER | DB_THREAD | DB_TRUNCATE) + if ((ret = __db_fchk(dbenv, "DB->open", flags, OKFLAGS)) != 0) + return (ret); + if (LF_ISSET(DB_EXCL) && !LF_ISSET(DB_CREATE)) + return (__db_ferr(dbenv, "DB->open", 1)); + if (LF_ISSET(DB_RDONLY) && LF_ISSET(DB_CREATE)) + return (__db_ferr(dbenv, "DB->open", 1)); +#ifdef HAVE_VXWORKS + if (LF_ISSET(DB_TRUNCATE)) { + __db_err(dbenv, "DB_TRUNCATE unsupported in VxWorks"); + return (__db_eopnotsup(dbenv)); + } +#endif + switch (type) { + case DB_UNKNOWN: + if (LF_ISSET(DB_CREATE|DB_TRUNCATE)) { + __db_err(dbenv, + "%s: DB_UNKNOWN type specified with DB_CREATE or DB_TRUNCATE", + name); + return (EINVAL); + } + ok_flags = 0; + break; + case DB_BTREE: + ok_flags = DB_OK_BTREE; + break; + case DB_HASH: + ok_flags = DB_OK_HASH; + break; + case DB_QUEUE: + ok_flags = DB_OK_QUEUE; + break; + case DB_RECNO: + ok_flags = DB_OK_RECNO; + break; + default: + __db_err(dbenv, "unknown type: %lu", (u_long)type); + return (EINVAL); + } + if (ok_flags) + DB_ILLEGAL_METHOD(dbp, ok_flags); + + /* The environment may have been created, but never opened. */ + if (!F_ISSET(dbenv, DB_ENV_DBLOCAL | DB_ENV_OPEN_CALLED)) { + __db_err(dbenv, "environment not yet opened"); + return (EINVAL); + } + + /* + * Historically, you could pass in an environment that didn't have a + * mpool, and DB would create a private one behind the scenes. This + * no longer works. + */ + if (!F_ISSET(dbenv, DB_ENV_DBLOCAL) && !MPOOL_ON(dbenv)) { + __db_err(dbenv, "environment did not include a memory pool."); + return (EINVAL); + } + + /* + * You can't specify threads during DB->open if subsystems in the + * environment weren't configured with them. + */ + if (LF_ISSET(DB_THREAD) && + !F_ISSET(dbenv, DB_ENV_DBLOCAL | DB_ENV_THREAD)) { + __db_err(dbenv, "environment not created using DB_THREAD"); + return (EINVAL); + } + + /* + * If the environment was configured with threads, the DB handle + * must also be free-threaded, so we force the DB_THREAD flag on. + * (See SR #2033 for why this is a requirement--recovery needs + * to be able to grab a dbp using __db_fileid_to_dbp, and it has + * no way of knowing which dbp goes with which thread, so whichever + * one it finds has to be usable in any of them.) + */ + if (F_ISSET(dbenv, DB_ENV_THREAD)) + LF_SET(DB_THREAD); + + /* DB_TRUNCATE is not transaction recoverable. */ + if (LF_ISSET(DB_TRUNCATE) && TXN_ON(dbenv)) { + __db_err(dbenv, + "DB_TRUNCATE illegal in a transaction protected environment"); + return (EINVAL); + } + + /* Subdatabase checks. */ + if (subdb != NULL) { + /* Subdatabases must be created in named files. */ + if (name == NULL) { + __db_err(dbenv, + "multiple databases cannot be created in temporary files"); + return (EINVAL); + } + + /* QAM can't be done as a subdatabase. */ + if (type == DB_QUEUE) { + __db_err(dbenv, "Queue databases must be one-per-file"); + return (EINVAL); + } + } + + /* Convert any DB->open flags. */ + if (LF_ISSET(DB_RDONLY)) + F_SET(dbp, DB_AM_RDONLY); + + /* Fill in the type. */ + dbp->type = type; + + /* + * If we're potentially creating a database, wrap the open inside of + * a transaction. + */ + if (TXN_ON(dbenv) && LF_ISSET(DB_CREATE)) + if ((ret = __db_metabegin(dbp, &open_lock)) != 0) + return (ret); + + /* + * If we're opening a subdatabase, we have to open (and potentially + * create) the main database, and then get (and potentially store) + * our base page number in that database. Then, we can finally open + * the subdatabase. + */ + if (subdb == NULL) + meta_pgno = PGNO_BASE_MD; + else { + /* + * Open the master database, optionally creating or updating + * it, and retrieve the metadata page number. + */ + if ((ret = + __db_master_open(dbp, name, flags, mode, &mdbp)) != 0) + goto err; + + /* Copy the page size and file id from the master. */ + dbp->pgsize = mdbp->pgsize; + F_SET(dbp, DB_AM_SUBDB); + memcpy(dbp->fileid, mdbp->fileid, DB_FILE_ID_LEN); + + if ((ret = __db_master_update(mdbp, + subdb, type, &meta_pgno, MU_OPEN, NULL, flags)) != 0) + goto err; + + /* + * Clear the exclusive open and truncation flags, they only + * apply to the open of the master database. + */ + LF_CLR(DB_EXCL | DB_TRUNCATE); + } + + ret = __db_dbopen(dbp, name, flags, mode, meta_pgno); + + /* + * You can open the database that describes the subdatabases in the + * rest of the file read-only. The content of each key's data is + * unspecified and applications should never be adding new records + * or updating existing records. However, during recovery, we need + * to open these databases R/W so we can redo/undo changes in them. + * Likewise, we need to open master databases read/write during + * rename and remove so we can be sure they're fully sync'ed, so + * we provide an override flag for the purpose. + */ + if (subdb == NULL && !IS_RECOVERING(dbenv) && !LF_ISSET(DB_RDONLY) && + !LF_ISSET(DB_RDWRMASTER) && F_ISSET(dbp, DB_AM_SUBDB)) { + __db_err(dbenv, + "files containing multiple databases may only be opened read-only"); + ret = EINVAL; + goto err; + } + +err: /* + * End any transaction, committing if we were successful, aborting + * otherwise. + */ + if (TXN_ON(dbenv) && LF_ISSET(DB_CREATE)) + if ((t_ret = __db_metaend(dbp, + &open_lock, ret == 0, NULL, NULL)) != 0 && ret == 0) + ret = t_ret; + + /* If we were successful, don't discard the file on close. */ + if (ret == 0) + F_CLR(dbp, DB_AM_DISCARD); + + /* If we were unsuccessful, destroy the DB handle. */ + if (ret != 0) { + /* In recovery we set log_fileid early. */ + if (IS_RECOVERING(dbenv)) + dbp->log_fileid = DB_LOGFILEID_INVALID; + __db_refresh(dbp); + } + + if (mdbp != NULL) { + /* If we were successful, don't discard the file on close. */ + if (ret == 0) + F_CLR(mdbp, DB_AM_DISCARD); + if ((t_ret = mdbp->close(mdbp, 0)) != 0 && ret == 0) + ret = t_ret; + } + + return (ret); +} + +/* + * __db_dbopen -- + * Open a database. + * PUBLIC: int __db_dbopen __P((DB *, const char *, u_int32_t, int, db_pgno_t)); + */ +int +__db_dbopen(dbp, name, flags, mode, meta_pgno) + DB *dbp; + const char *name; + u_int32_t flags; + int mode; + db_pgno_t meta_pgno; +{ + DB_ENV *dbenv; + int ret, retinfo; + + dbenv = dbp->dbenv; + + /* Set up the underlying file. */ + if ((ret = __db_file_setup(dbp, + name, flags, mode, meta_pgno, &retinfo)) != 0) + return (ret); + + /* + * If we created the file, set the truncate flag for the mpool. This + * isn't for anything we've done, it's protection against stupid user + * tricks: if the user deleted a file behind Berkeley DB's back, we + * may still have pages in the mpool that match the file's "unique" ID. + */ + if (retinfo & DB_FILE_SETUP_CREATE) + flags |= DB_TRUNCATE; + + /* Set up the underlying environment. */ + if ((ret = __db_dbenv_setup(dbp, name, flags)) != 0) + return (ret); + + /* + * Do access method specific initialization. + * + * !!! + * Set the open flag. (The underlying access method open functions + * may want to do things like acquire cursors, so the open flag has + * to be set before calling them.) + */ + F_SET(dbp, DB_OPEN_CALLED); + + if (retinfo & DB_FILE_SETUP_ZERO) + return (0); + + switch (dbp->type) { + case DB_BTREE: + ret = __bam_open(dbp, name, meta_pgno, flags); + break; + case DB_HASH: + ret = __ham_open(dbp, name, meta_pgno, flags); + break; + case DB_RECNO: + ret = __ram_open(dbp, name, meta_pgno, flags); + break; + case DB_QUEUE: + ret = __qam_open(dbp, name, meta_pgno, mode, flags); + break; + case DB_UNKNOWN: + return (__db_unknown_type(dbp->dbenv, + "__db_dbopen", dbp->type)); + break; + } + return (ret); +} + +/* + * __db_master_open -- + * Open up a handle on a master database. + * + * PUBLIC: int __db_master_open __P((DB *, + * PUBLIC: const char *, u_int32_t, int, DB **)); + */ +int +__db_master_open(subdbp, name, flags, mode, dbpp) + DB *subdbp; + const char *name; + u_int32_t flags; + int mode; + DB **dbpp; +{ + DB *dbp; + int ret; + + /* Open up a handle on the main database. */ + if ((ret = db_create(&dbp, subdbp->dbenv, 0)) != 0) + return (ret); + + /* + * It's always a btree. + * Run in the transaction we've created. + * Set the pagesize in case we're creating a new database. + * Flag that we're creating a database with subdatabases. + */ + dbp->type = DB_BTREE; + dbp->open_txn = subdbp->open_txn; + dbp->pgsize = subdbp->pgsize; + F_SET(dbp, DB_AM_SUBDB); + + if ((ret = __db_dbopen(dbp, name, flags, mode, PGNO_BASE_MD)) != 0) { + if (!F_ISSET(dbp, DB_AM_DISCARD)) + dbp->close(dbp, 0); + return (ret); + } + + *dbpp = dbp; + return (0); +} + +/* + * __db_master_update -- + * Add/Remove a subdatabase from a master database. + */ +static int +__db_master_update(mdbp, subdb, type, meta_pgnop, action, newname, flags) + DB *mdbp; + const char *subdb; + u_int32_t type; + db_pgno_t *meta_pgnop; /* may be NULL on MU_RENAME */ + mu_action action; + const char *newname; + u_int32_t flags; +{ + DB_ENV *dbenv; + DBC *dbc, *ndbc; + DBT key, data, ndata; + PAGE *p; + db_pgno_t t_pgno; + int modify, ret, t_ret; + + dbenv = mdbp->dbenv; + dbc = ndbc = NULL; + p = NULL; + + /* Might we modify the master database? If so, we'll need to lock. */ + modify = (action != MU_OPEN || LF_ISSET(DB_CREATE)) ? 1 : 0; + + memset(&key, 0, sizeof(key)); + memset(&data, 0, sizeof(data)); + + /* + * Open up a cursor. If this is CDB and we're creating the database, + * make it an update cursor. + */ + if ((ret = mdbp->cursor(mdbp, mdbp->open_txn, &dbc, + (CDB_LOCKING(dbenv) && modify) ? DB_WRITECURSOR : 0)) != 0) + goto err; + + /* + * Try to point the cursor at the record. + * + * If we're removing or potentially creating an entry, lock the page + * with DB_RMW. + * + * !!! + * We don't include the name's nul termination in the database. + */ + key.data = (char *)subdb; + key.size = strlen(subdb); + /* In the rename case, we do multiple cursor ops, so MALLOC is safer. */ + F_SET(&data, DB_DBT_MALLOC); + ret = dbc->c_get(dbc, &key, &data, + DB_SET | ((STD_LOCKING(dbc) && modify) ? DB_RMW : 0)); + + /* + * What we do next--whether or not we found a record for the + * specified subdatabase--depends on what the specified action is. + * Handle ret appropriately as the first statement of each case. + */ + switch (action) { + case MU_REMOVE: + /* + * We should have found something if we're removing it. Note + * that in the common case where the DB we're asking to remove + * doesn't exist, we won't get this far; __db_subdb_remove + * will already have returned an error from __db_open. + */ + if (ret != 0) + goto err; + + /* + * Delete the subdatabase entry first; if this fails, + * we don't want to touch the actual subdb pages. + */ + if ((ret = dbc->c_del(dbc, 0)) != 0) + goto err; + + /* + * We're handling actual data, not on-page meta-data, + * so it hasn't been converted to/from opposite + * endian architectures. Do it explicitly, now. + */ + memcpy(meta_pgnop, data.data, sizeof(db_pgno_t)); + DB_NTOHL(meta_pgnop); + if ((ret = memp_fget(mdbp->mpf, meta_pgnop, 0, &p)) != 0) + goto err; + + /* Free and put the page. */ + if ((ret = __db_free(dbc, p)) != 0) { + p = NULL; + goto err; + } + p = NULL; + break; + case MU_RENAME: + /* We should have found something if we're renaming it. */ + if (ret != 0) + goto err; + + /* + * Before we rename, we need to make sure we're not + * overwriting another subdatabase, or else this operation + * won't be undoable. Open a second cursor and check + * for the existence of newname; it shouldn't appear under + * us since we hold the metadata lock. + */ + if ((ret = mdbp->cursor(mdbp, mdbp->open_txn, &ndbc, 0)) != 0) + goto err; + DB_ASSERT(newname != NULL); + key.data = (void *) newname; + key.size = strlen(newname); + + /* + * We don't actually care what the meta page of the potentially- + * overwritten DB is; we just care about existence. + */ + memset(&ndata, 0, sizeof(ndata)); + F_SET(&ndata, DB_DBT_USERMEM | DB_DBT_PARTIAL); + + if ((ret = ndbc->c_get(ndbc, &key, &ndata, DB_SET)) == 0) { + /* A subdb called newname exists. Bail. */ + ret = EEXIST; + __db_err(dbenv, "rename: database %s exists", newname); + goto err; + } else if (ret != DB_NOTFOUND) + goto err; + + /* + * Now do the put first; we don't want to lose our + * sole reference to the subdb. Use the second cursor + * so that the first one continues to point to the old record. + */ + if ((ret = ndbc->c_put(ndbc, &key, &data, DB_KEYFIRST)) != 0) + goto err; + if ((ret = dbc->c_del(dbc, 0)) != 0) { + /* + * If the delete fails, try to delete the record + * we just put, in case we're not txn-protected. + */ + (void)ndbc->c_del(ndbc, 0); + goto err; + } + + break; + case MU_OPEN: + /* + * Get the subdatabase information. If it already exists, + * copy out the page number and we're done. + */ + switch (ret) { + case 0: + memcpy(meta_pgnop, data.data, sizeof(db_pgno_t)); + DB_NTOHL(meta_pgnop); + goto done; + case DB_NOTFOUND: + if (LF_ISSET(DB_CREATE)) + break; + /* + * No db_err, it is reasonable to remove a + * nonexistent db. + */ + ret = ENOENT; + goto err; + default: + goto err; + } + + if ((ret = __db_new(dbc, + type == DB_HASH ? P_HASHMETA : P_BTREEMETA, &p)) != 0) + goto err; + *meta_pgnop = PGNO(p); + + /* + * XXX + * We're handling actual data, not on-page meta-data, so it + * hasn't been converted to/from opposite endian architectures. + * Do it explicitly, now. + */ + t_pgno = PGNO(p); + DB_HTONL(&t_pgno); + memset(&ndata, 0, sizeof(ndata)); + ndata.data = &t_pgno; + ndata.size = sizeof(db_pgno_t); + if ((ret = dbc->c_put(dbc, &key, &ndata, DB_KEYLAST)) != 0) + goto err; + break; + } + +err: +done: /* + * If we allocated a page: if we're successful, mark the page dirty + * and return it to the cache, otherwise, discard/free it. + */ + if (p != NULL) { + if (ret == 0) { + if ((t_ret = + memp_fput(mdbp->mpf, p, DB_MPOOL_DIRTY)) != 0) + ret = t_ret; + /* + * Since we cannot close this file until after + * transaction commit, we need to sync the dirty + * pages, because we'll read these directly from + * disk to open. + */ + if ((t_ret = mdbp->sync(mdbp, 0)) != 0 && ret == 0) + ret = t_ret; + } else + (void)__db_free(dbc, p); + } + + /* Discard the cursor(s) and data. */ + if (data.data != NULL) + __os_free(data.data, data.size); + if (dbc != NULL && (t_ret = dbc->c_close(dbc)) != 0 && ret == 0) + ret = t_ret; + if (ndbc != NULL && (t_ret = ndbc->c_close(ndbc)) != 0 && ret == 0) + ret = t_ret; + + return (ret); +} + +/* + * __db_dbenv_setup -- + * Set up the underlying environment during a db_open. + * + * PUBLIC: int __db_dbenv_setup __P((DB *, const char *, u_int32_t)); + */ +int +__db_dbenv_setup(dbp, name, flags) + DB *dbp; + const char *name; + u_int32_t flags; +{ + DB *ldbp; + DB_ENV *dbenv; + DBT pgcookie; + DB_MPOOL_FINFO finfo; + DB_PGINFO pginfo; + int ret; + u_int32_t maxid; + + dbenv = dbp->dbenv; + + /* If we don't yet have an environment, it's time to create it. */ + if (!F_ISSET(dbenv, DB_ENV_OPEN_CALLED)) { + /* Make sure we have at least DB_MINCACHE pages in our cache. */ + if (dbenv->mp_gbytes == 0 && + dbenv->mp_bytes < dbp->pgsize * DB_MINPAGECACHE && + (ret = dbenv->set_cachesize( + dbenv, 0, dbp->pgsize * DB_MINPAGECACHE, 0)) != 0) + return (ret); + + if ((ret = dbenv->open(dbenv, NULL, DB_CREATE | + DB_INIT_MPOOL | DB_PRIVATE | LF_ISSET(DB_THREAD), 0)) != 0) + return (ret); + } + + /* Register DB's pgin/pgout functions. */ + if ((ret = + memp_register(dbenv, DB_FTYPE_SET, __db_pgin, __db_pgout)) != 0) + return (ret); + + /* + * Open a backing file in the memory pool. + * + * If we need to pre- or post-process a file's pages on I/O, set the + * file type. If it's a hash file, always call the pgin and pgout + * routines. This means that hash files can never be mapped into + * process memory. If it's a btree file and requires swapping, we + * need to page the file in and out. This has to be right -- we can't + * mmap files that are being paged in and out. + */ + memset(&finfo, 0, sizeof(finfo)); + switch (dbp->type) { + case DB_BTREE: + case DB_RECNO: + finfo.ftype = + F_ISSET(dbp, DB_AM_SWAP) ? DB_FTYPE_SET : DB_FTYPE_NOTSET; + finfo.clear_len = DB_PAGE_DB_LEN; + break; + case DB_HASH: + finfo.ftype = DB_FTYPE_SET; + finfo.clear_len = DB_PAGE_DB_LEN; + break; + case DB_QUEUE: + finfo.ftype = + F_ISSET(dbp, DB_AM_SWAP) ? DB_FTYPE_SET : DB_FTYPE_NOTSET; + finfo.clear_len = DB_PAGE_QUEUE_LEN; + break; + case DB_UNKNOWN: + /* + * If we're running in the verifier, our database might + * be corrupt and we might not know its type--but we may + * still want to be able to verify and salvage. + * + * If we can't identify the type, it's not going to be safe + * to call __db_pgin--we pretty much have to give up all + * hope of salvaging cross-endianness. Proceed anyway; + * at worst, the database will just appear more corrupt + * than it actually is, but at best, we may be able + * to salvage some data even with no metadata page. + */ + if (F_ISSET(dbp, DB_AM_VERIFYING)) { + finfo.ftype = DB_FTYPE_NOTSET; + finfo.clear_len = DB_PAGE_DB_LEN; + break; + } + return (__db_unknown_type(dbp->dbenv, + "__db_dbenv_setup", dbp->type)); + } + finfo.pgcookie = &pgcookie; + finfo.fileid = dbp->fileid; + finfo.lsn_offset = 0; + + pginfo.db_pagesize = dbp->pgsize; + pginfo.needswap = F_ISSET(dbp, DB_AM_SWAP); + pgcookie.data = &pginfo; + pgcookie.size = sizeof(DB_PGINFO); + + if ((ret = memp_fopen(dbenv, name, + LF_ISSET(DB_RDONLY | DB_NOMMAP | DB_ODDFILESIZE | DB_TRUNCATE), + 0, dbp->pgsize, &finfo, &dbp->mpf)) != 0) + return (ret); + + /* + * We may need a per-thread mutex. Allocate it from the environment + * region, there's supposed to be extra space there for that purpose. + */ + if (LF_ISSET(DB_THREAD)) { + if ((ret = __db_mutex_alloc( + dbenv, dbenv->reginfo, (MUTEX **)&dbp->mutexp)) != 0) + return (ret); + if ((ret = __db_mutex_init( + dbenv, dbp->mutexp, 0, MUTEX_THREAD)) != 0) { + __db_mutex_free(dbenv, dbenv->reginfo, dbp->mutexp); + return (ret); + } + } + + /* Get a log file id. */ + if (LOGGING_ON(dbenv) && !IS_RECOVERING(dbenv) && +#if !defined(DEBUG_ROP) + !F_ISSET(dbp, DB_AM_RDONLY) && +#endif + (ret = log_register(dbenv, dbp, name)) != 0) + return (ret); + + /* + * Insert ourselves into the DB_ENV's dblist. We allocate a + * unique ID to each {fileid, meta page number} pair, and to + * each temporary file (since they all have a zero fileid). + * This ID gives us something to use to tell which DB handles + * go with which databases in all the cursor adjustment + * routines, where we don't want to do a lot of ugly and + * expensive memcmps. + */ + MUTEX_THREAD_LOCK(dbenv, dbenv->dblist_mutexp); + for (maxid = 0, ldbp = LIST_FIRST(&dbenv->dblist); + ldbp != NULL; ldbp = LIST_NEXT(dbp, dblistlinks)) { + if (name != NULL && + memcmp(ldbp->fileid, dbp->fileid, DB_FILE_ID_LEN) == 0 && + ldbp->meta_pgno == dbp->meta_pgno) + break; + if (ldbp->adj_fileid > maxid) + maxid = ldbp->adj_fileid; + } + + /* + * If ldbp is NULL, we didn't find a match, or we weren't + * really looking because name is NULL. Assign the dbp an + * adj_fileid one higher than the largest we found, and + * insert it at the head of the master dbp list. + * + * If ldbp is not NULL, it is a match for our dbp. Give dbp + * the same ID that ldbp has, and add it after ldbp so they're + * together in the list. + */ + if (ldbp == NULL) { + dbp->adj_fileid = maxid + 1; + LIST_INSERT_HEAD(&dbenv->dblist, dbp, dblistlinks); + } else { + dbp->adj_fileid = ldbp->adj_fileid; + LIST_INSERT_AFTER(ldbp, dbp, dblistlinks); + } + MUTEX_THREAD_UNLOCK(dbenv, dbenv->dblist_mutexp); + + return (0); +} + +/* + * __db_file_setup -- + * Setup the file or in-memory data. + * Read the database metadata and resolve it with our arguments. + */ +static int +__db_file_setup(dbp, name, flags, mode, meta_pgno, retflags) + DB *dbp; + const char *name; + u_int32_t flags; + int mode; + db_pgno_t meta_pgno; + int *retflags; +{ + DB *mdb; + DBT namedbt; + DB_ENV *dbenv; + DB_FH *fhp, fh; + DB_LSN lsn; + DB_TXN *txn; + size_t nr; + u_int32_t magic, oflags; + int ret, retry_cnt, t_ret; + char *real_name, mbuf[DBMETASIZE]; + +#define IS_SUBDB_SETUP (meta_pgno != PGNO_BASE_MD) + + dbenv = dbp->dbenv; + dbp->meta_pgno = meta_pgno; + txn = NULL; + *retflags = 0; + + /* + * If we open a file handle and our caller is doing fcntl(2) locking, + * we can't close it because that would discard the caller's lock. + * Save it until we close the DB handle. + */ + if (LF_ISSET(DB_FCNTL_LOCKING)) { + if ((ret = __os_malloc(dbenv, sizeof(*fhp), NULL, &fhp)) != 0) + return (ret); + } else + fhp = &fh; + memset(fhp, 0, sizeof(*fhp)); + + /* + * If the file is in-memory, set up is simple. Otherwise, do the + * hard work of opening and reading the file. + * + * If we have a file name, try and read the first page, figure out + * what type of file it is, and initialize everything we can based + * on that file's meta-data page. + * + * !!! + * There's a reason we don't push this code down into the buffer cache. + * The problem is that there's no information external to the file that + * we can use as a unique ID. UNIX has dev/inode pairs, but they are + * not necessarily unique after reboot, if the file was mounted via NFS. + * Windows has similar problems, as the FAT filesystem doesn't maintain + * dev/inode numbers across reboot. So, we must get something from the + * file we can use to ensure that, even after a reboot, the file we're + * joining in the cache is the right file for us to join. The solution + * we use is to maintain a file ID that's stored in the database, and + * that's why we have to open and read the file before calling into the + * buffer cache. + * + * The secondary reason is that there's additional information that + * we want to have before instantiating a file in the buffer cache: + * the page size, file type (btree/hash), if swapping is required, + * and flags (DB_RDONLY, DB_CREATE, DB_TRUNCATE). We could handle + * needing this information by allowing it to be set for a file in + * the buffer cache even after the file has been opened, and, of + * course, supporting the ability to flush a file from the cache as + * necessary, e.g., if we guessed wrongly about the page size. Given + * that we have to read the file anyway to get the file ID, we might + * as well get the rest, too. + * + * Get the real file name. + */ + if (name == NULL) { + F_SET(dbp, DB_AM_INMEM); + + if (dbp->type == DB_UNKNOWN) { + __db_err(dbenv, + "DBTYPE of unknown without existing file"); + return (EINVAL); + } + real_name = NULL; + + /* Set the page size if we don't have one yet. */ + if (dbp->pgsize == 0) + dbp->pgsize = DB_DEF_IOSIZE; + + /* + * If the file is a temporary file and we're doing locking, + * then we have to create a unique file ID. We can't use our + * normal dev/inode pair (or whatever this OS uses in place of + * dev/inode pairs) because no backing file will be created + * until the mpool cache is filled forcing the buffers to disk. + * Grab a random locker ID to use as a file ID. The created + * ID must never match a potential real file ID -- we know it + * won't because real file IDs contain a time stamp after the + * dev/inode pair, and we're simply storing a 4-byte value. + * + * !!! + * Store the locker in the file id structure -- we can get it + * from there as necessary, and it saves having two copies. + */ + if (LOCKING_ON(dbenv) && + (ret = lock_id(dbenv, (u_int32_t *)dbp->fileid)) != 0) + return (ret); + + return (0); + } + + /* Get the real backing file name. */ + if ((ret = __db_appname(dbenv, + DB_APP_DATA, NULL, name, 0, NULL, &real_name)) != 0) + return (ret); + + /* + * Open the backing file. We need to make sure that multiple processes + * attempting to create the file at the same time are properly ordered + * so that only one of them creates the "unique" file ID, so we open it + * O_EXCL and O_CREAT so two simultaneous attempts to create the region + * will return failure in one of the attempts. If we're the one that + * fails, simply retry without the O_CREAT flag, which will require the + * meta-data page exist. + */ + + /* Fill in the default file mode. */ + if (mode == 0) + mode = __db_omode("rwrw--"); + + oflags = 0; + if (LF_ISSET(DB_RDONLY)) + oflags |= DB_OSO_RDONLY; + if (LF_ISSET(DB_TRUNCATE)) + oflags |= DB_OSO_TRUNC; + + retry_cnt = 0; +open_retry: + *retflags = 0; + ret = 0; + if (!IS_SUBDB_SETUP && LF_ISSET(DB_CREATE)) { + if (dbp->open_txn != NULL) { + /* + * Start a child transaction to wrap this individual + * create. + */ + if ((ret = + txn_begin(dbenv, dbp->open_txn, &txn, 0)) != 0) + goto err_msg; + + memset(&namedbt, 0, sizeof(namedbt)); + namedbt.data = (char *)name; + namedbt.size = strlen(name) + 1; + if ((ret = __crdel_fileopen_log(dbenv, txn, + &lsn, DB_FLUSH, &namedbt, mode)) != 0) + goto err_msg; + } + DB_TEST_RECOVERY(dbp, DB_TEST_PREOPEN, ret, name); + if ((ret = __os_open(dbenv, real_name, + oflags | DB_OSO_CREATE | DB_OSO_EXCL, mode, fhp)) == 0) { + DB_TEST_RECOVERY(dbp, DB_TEST_POSTOPEN, ret, name); + + /* Commit the file create. */ + if (dbp->open_txn != NULL) { + if ((ret = txn_commit(txn, DB_TXN_SYNC)) != 0) + goto err_msg; + txn = NULL; + } + + /* + * We created the file. This means that if we later + * fail, we need to delete the file and if we're going + * to do that, we need to trash any pages in the + * memory pool. Since we only know here that we + * created the file, we're going to set the flag here + * and clear it later if we commit successfully. + */ + F_SET(dbp, DB_AM_DISCARD); + *retflags |= DB_FILE_SETUP_CREATE; + } else { + /* + * Abort the file create. If the abort fails, report + * the error returned by txn_abort(), rather than the + * open error, for no particular reason. + */ + if (dbp->open_txn != NULL) { + if ((t_ret = txn_abort(txn)) != 0) { + ret = t_ret; + goto err_msg; + } + txn = NULL; + } + + /* + * If we were not doing an exclusive open, try again + * without the create flag. + */ + if (ret == EEXIST && !LF_ISSET(DB_EXCL)) { + LF_CLR(DB_CREATE); + DB_TEST_RECOVERY(dbp, + DB_TEST_POSTOPEN, ret, name); + goto open_retry; + } + } + } else + ret = __os_open(dbenv, real_name, oflags, mode, fhp); + + /* + * Be quiet if we couldn't open the file because it didn't exist + * or we did not have permission, + * the customers don't like those messages appearing in the logs. + * Otherwise, complain loudly. + */ + if (ret != 0) { + if (ret == EACCES || ret == ENOENT) + goto err; + goto err_msg; + } + + /* Set the page size if we don't have one yet. */ + if (dbp->pgsize == 0) { + if (IS_SUBDB_SETUP) { + if ((ret = __db_master_open(dbp, + name, flags, mode, &mdb)) != 0) + goto err; + dbp->pgsize = mdb->pgsize; + (void)mdb->close(mdb, 0); + } else if ((ret = __db_set_pgsize(dbp, fhp, real_name)) != 0) + goto err; + } + + /* + * Seek to the metadata offset; if it's a master database open or a + * database without subdatabases, we're seeking to 0, but that's OK. + */ + if ((ret = __os_seek(dbenv, fhp, + dbp->pgsize, meta_pgno, 0, 0, DB_OS_SEEK_SET)) != 0) + goto err_msg; + + /* + * Read the metadata page. We read DBMETASIZE bytes, which is larger + * than any access method's metadata page and smaller than any disk + * sector. + */ + if ((ret = __os_read(dbenv, fhp, mbuf, sizeof(mbuf), &nr)) != 0) + goto err_msg; + + if (nr == sizeof(mbuf)) { + /* + * Figure out what access method we're dealing with, and then + * call access method specific code to check error conditions + * based on conflicts between the found file and application + * arguments. A found file overrides some user information -- + * we don't consider it an error, for example, if the user set + * an expected byte order and the found file doesn't match it. + */ + F_CLR(dbp, DB_AM_SWAP); + magic = ((DBMETA *)mbuf)->magic; + +swap_retry: switch (magic) { + case DB_BTREEMAGIC: + if ((ret = + __bam_metachk(dbp, name, (BTMETA *)mbuf)) != 0) + goto err; + break; + case DB_HASHMAGIC: + if ((ret = + __ham_metachk(dbp, name, (HMETA *)mbuf)) != 0) + goto err; + break; + case DB_QAMMAGIC: + if ((ret = + __qam_metachk(dbp, name, (QMETA *)mbuf)) != 0) + goto err; + break; + case 0: + /* + * There are two ways we can get a 0 magic number. + * If we're creating a subdatabase, then the magic + * number will be 0. We allocate a page as part of + * finding out what the base page number will be for + * the new subdatabase, but it's not initialized in + * any way. + * + * The second case happens if we are in recovery + * and we are going to recreate a database, it's + * possible that it's page was created (on systems + * where pages must be created explicitly to avoid + * holes in files) but is still 0. + */ + if (IS_SUBDB_SETUP) { /* Case 1 */ + if ((IS_RECOVERING(dbenv) + && F_ISSET((DB_LOG *) + dbenv->lg_handle, DBLOG_FORCE_OPEN)) + || ((DBMETA *)mbuf)->pgno != PGNO_INVALID) + goto empty; + + ret = EINVAL; + goto err; + } + /* Case 2 */ + if (IS_RECOVERING(dbenv)) { + *retflags |= DB_FILE_SETUP_ZERO; + goto empty; + } + goto bad_format; + default: + if (F_ISSET(dbp, DB_AM_SWAP)) + goto bad_format; + + M_32_SWAP(magic); + F_SET(dbp, DB_AM_SWAP); + goto swap_retry; + } + } else { + /* + * Only newly created files are permitted to fail magic + * number tests. + */ + if (nr != 0 || (!IS_RECOVERING(dbenv) && IS_SUBDB_SETUP)) + goto bad_format; + + /* Let the caller know that we had a 0-length file. */ + if (!LF_ISSET(DB_CREATE | DB_TRUNCATE)) + *retflags |= DB_FILE_SETUP_ZERO; + + /* + * The only way we can reach here with the DB_CREATE flag set + * is if we created the file. If that's not the case, then + * either (a) someone else created the file but has not yet + * written out the metadata page, or (b) we truncated the file + * (DB_TRUNCATE) leaving it zero-length. In the case of (a), + * we want to sleep and give the file creator time to write + * the metadata page. In the case of (b), we want to continue. + * + * !!! + * There's a race in the case of two processes opening the file + * with the DB_TRUNCATE flag set at roughly the same time, and + * they could theoretically hurt each other. Sure hope that's + * unlikely. + */ + if (!LF_ISSET(DB_CREATE | DB_TRUNCATE) && + !IS_RECOVERING(dbenv)) { + if (retry_cnt++ < 3) { + __os_sleep(dbenv, 1, 0); + goto open_retry; + } +bad_format: if (!IS_RECOVERING(dbenv)) + __db_err(dbenv, + "%s: unexpected file type or format", name); + ret = EINVAL; + goto err; + } + + DB_ASSERT (dbp->type != DB_UNKNOWN); + +empty: /* + * The file is empty, and that's OK. If it's not a subdatabase, + * though, we do need to generate a unique file ID for it. The + * unique file ID includes a timestamp so that we can't collide + * with any other files, even when the file IDs (dev/inode pair) + * are reused. + */ + if (!IS_SUBDB_SETUP) { + if (*retflags & DB_FILE_SETUP_ZERO) + memset(dbp->fileid, 0, DB_FILE_ID_LEN); + else if ((ret = __os_fileid(dbenv, + real_name, 1, dbp->fileid)) != 0) + goto err_msg; + } + } + + if (0) { +err_msg: __db_err(dbenv, "%s: %s", name, db_strerror(ret)); + } + + /* + * Abort any running transaction -- it can only exist if something + * went wrong. + */ +err: +DB_TEST_RECOVERY_LABEL + + /* + * If we opened a file handle and our caller is doing fcntl(2) locking, + * then we can't close it because that would discard the caller's lock. + * Otherwise, close the handle. + */ + if (F_ISSET(fhp, DB_FH_VALID)) { + if (ret == 0 && LF_ISSET(DB_FCNTL_LOCKING)) + dbp->saved_open_fhp = fhp; + else + if ((t_ret = __os_closehandle(fhp)) != 0 && ret == 0) + ret = t_ret; + } + + /* + * This must be done after the file is closed, since + * txn_abort() may remove the file, and an open file + * cannot be removed on a Windows platforms. + */ + if (txn != NULL) + (void)txn_abort(txn); + + if (real_name != NULL) + __os_freestr(real_name); + + return (ret); +} + +/* + * __db_set_pgsize -- + * Set the page size based on file information. + */ +static int +__db_set_pgsize(dbp, fhp, name) + DB *dbp; + DB_FH *fhp; + char *name; +{ + DB_ENV *dbenv; + u_int32_t iopsize; + int ret; + + dbenv = dbp->dbenv; + + /* + * Use the filesystem's optimum I/O size as the pagesize if a pagesize + * not specified. Some filesystems have 64K as their optimum I/O size, + * but as that results in fairly large default caches, we limit the + * default pagesize to 16K. + */ + if ((ret = __os_ioinfo(dbenv, name, fhp, NULL, NULL, &iopsize)) != 0) { + __db_err(dbenv, "%s: %s", name, db_strerror(ret)); + return (ret); + } + if (iopsize < 512) + iopsize = 512; + if (iopsize > 16 * 1024) + iopsize = 16 * 1024; + + /* + * Sheer paranoia, but we don't want anything that's not a power-of-2 + * (we rely on that for alignment of various types on the pages), and + * we want a multiple of the sector size as well. + */ + OS_ROUNDOFF(iopsize, 512); + + dbp->pgsize = iopsize; + F_SET(dbp, DB_AM_PGDEF); + + return (0); +} + +/* + * __db_close -- + * DB destructor. + * + * PUBLIC: int __db_close __P((DB *, u_int32_t)); + */ +int +__db_close(dbp, flags) + DB *dbp; + u_int32_t flags; +{ + DB_ENV *dbenv; + DBC *dbc; + int ret, t_ret; + + ret = 0; + + dbenv = dbp->dbenv; + PANIC_CHECK(dbenv); + + /* Validate arguments. */ + if ((ret = __db_closechk(dbp, flags)) != 0) + goto err; + + /* If never opened, or not currently open, it's easy. */ + if (!F_ISSET(dbp, DB_OPEN_CALLED)) + goto never_opened; + + /* Sync the underlying access method. */ + if (!LF_ISSET(DB_NOSYNC) && !F_ISSET(dbp, DB_AM_DISCARD) && + (t_ret = dbp->sync(dbp, 0)) != 0 && ret == 0) + ret = t_ret; + + /* + * Go through the active cursors and call the cursor recycle routine, + * which resolves pending operations and moves the cursors onto the + * free list. Then, walk the free list and call the cursor destroy + * routine. + */ + while ((dbc = TAILQ_FIRST(&dbp->active_queue)) != NULL) + if ((t_ret = dbc->c_close(dbc)) != 0 && ret == 0) + ret = t_ret; + while ((dbc = TAILQ_FIRST(&dbp->free_queue)) != NULL) + if ((t_ret = __db_c_destroy(dbc)) != 0 && ret == 0) + ret = t_ret; + + /* + * Close any outstanding join cursors. Join cursors destroy + * themselves on close and have no separate destroy routine. + */ + while ((dbc = TAILQ_FIRST(&dbp->join_queue)) != NULL) + if ((t_ret = dbc->c_close(dbc)) != 0 && ret == 0) + ret = t_ret; + + /* Remove this DB handle from the DB_ENV's dblist. */ + MUTEX_THREAD_LOCK(dbenv, dbenv->dblist_mutexp); + LIST_REMOVE(dbp, dblistlinks); + MUTEX_THREAD_UNLOCK(dbenv, dbenv->dblist_mutexp); + + /* Sync the memory pool. */ + if (!LF_ISSET(DB_NOSYNC) && !F_ISSET(dbp, DB_AM_DISCARD) && + (t_ret = memp_fsync(dbp->mpf)) != 0 && + t_ret != DB_INCOMPLETE && ret == 0) + ret = t_ret; + + /* Close any handle we've been holding since the open. */ + if (dbp->saved_open_fhp != NULL && + F_ISSET(dbp->saved_open_fhp, DB_FH_VALID) && + (t_ret = __os_closehandle(dbp->saved_open_fhp)) != 0 && ret == 0) + ret = t_ret; + +never_opened: + /* + * Call the access specific close function. + * + * !!! + * Because of where the function is called in the close process, + * these routines can't do anything that would dirty pages or + * otherwise affect closing down the database. + */ + if ((t_ret = __ham_db_close(dbp)) != 0 && ret == 0) + ret = t_ret; + if ((t_ret = __bam_db_close(dbp)) != 0 && ret == 0) + ret = t_ret; + if ((t_ret = __qam_db_close(dbp)) != 0 && ret == 0) + ret = t_ret; + +err: + /* Refresh the structure and close any local environment. */ + if ((t_ret = __db_refresh(dbp)) != 0 && ret == 0) + ret = t_ret; + if (F_ISSET(dbenv, DB_ENV_DBLOCAL) && + --dbenv->dblocal_ref == 0 && + (t_ret = dbenv->close(dbenv, 0)) != 0 && ret == 0) + ret = t_ret; + + memset(dbp, CLEAR_BYTE, sizeof(*dbp)); + __os_free(dbp, sizeof(*dbp)); + + return (ret); +} + +/* + * __db_refresh -- + * Refresh the DB structure, releasing any allocated resources. + */ +static int +__db_refresh(dbp) + DB *dbp; +{ + DB_ENV *dbenv; + DBC *dbc; + int ret, t_ret; + + ret = 0; + + dbenv = dbp->dbenv; + + /* + * Go through the active cursors and call the cursor recycle routine, + * which resolves pending operations and moves the cursors onto the + * free list. Then, walk the free list and call the cursor destroy + * routine. + */ + while ((dbc = TAILQ_FIRST(&dbp->active_queue)) != NULL) + if ((t_ret = dbc->c_close(dbc)) != 0 && ret == 0) + ret = t_ret; + while ((dbc = TAILQ_FIRST(&dbp->free_queue)) != NULL) + if ((t_ret = __db_c_destroy(dbc)) != 0 && ret == 0) + ret = t_ret; + + dbp->type = 0; + + /* Close the memory pool file handle. */ + if (dbp->mpf != NULL) { + if (F_ISSET(dbp, DB_AM_DISCARD)) + (void)__memp_fremove(dbp->mpf); + if ((t_ret = memp_fclose(dbp->mpf)) != 0 && ret == 0) + ret = t_ret; + dbp->mpf = NULL; + } + + /* Discard the thread mutex. */ + if (dbp->mutexp != NULL) { + __db_mutex_free(dbenv, dbenv->reginfo, dbp->mutexp); + dbp->mutexp = NULL; + } + + /* Discard the log file id. */ + if (!IS_RECOVERING(dbenv) + && dbp->log_fileid != DB_LOGFILEID_INVALID) + (void)log_unregister(dbenv, dbp); + + F_CLR(dbp, DB_AM_DISCARD); + F_CLR(dbp, DB_AM_INMEM); + F_CLR(dbp, DB_AM_RDONLY); + F_CLR(dbp, DB_AM_SWAP); + F_CLR(dbp, DB_DBM_ERROR); + F_CLR(dbp, DB_OPEN_CALLED); + + return (ret); +} + +/* + * __db_remove + * Remove method for DB. + * + * PUBLIC: int __db_remove __P((DB *, const char *, const char *, u_int32_t)); + */ +int +__db_remove(dbp, name, subdb, flags) + DB *dbp; + const char *name, *subdb; + u_int32_t flags; +{ + DBT namedbt; + DB_ENV *dbenv; + DB_LOCK remove_lock; + DB_LSN newlsn; + int ret, t_ret, (*callback_func) __P((DB *, void *)); + char *backup, *real_back, *real_name; + void *cookie; + + dbenv = dbp->dbenv; + ret = 0; + backup = real_back = real_name = NULL; + + PANIC_CHECK(dbenv); + /* + * Cannot use DB_ILLEGAL_AFTER_OPEN here because that returns + * and we cannot return, but must deal with the error and destroy + * the handle anyway. + */ + if (F_ISSET(dbp, DB_OPEN_CALLED)) { + ret = __db_mi_open(dbp->dbenv, "remove", 1); + goto err_close; + } + + /* Validate arguments. */ + if ((ret = __db_removechk(dbp, flags)) != 0) + goto err_close; + + /* + * Subdatabases. + */ + if (subdb != NULL) { + /* Subdatabases must be created in named files. */ + if (name == NULL) { + __db_err(dbenv, + "multiple databases cannot be created in temporary files"); + goto err_close; + } + return (__db_subdb_remove(dbp, name, subdb)); + } + + if ((ret = dbp->open(dbp, + name, NULL, DB_UNKNOWN, DB_RDWRMASTER, 0)) != 0) + goto err_close; + + if (LOGGING_ON(dbenv) && (ret = __log_file_lock(dbp)) != 0) + goto err_close; + + if ((ret = dbp->sync(dbp, 0)) != 0) + goto err_close; + + /* Start the transaction and log the delete. */ + if (TXN_ON(dbenv) && (ret = __db_metabegin(dbp, &remove_lock)) != 0) + goto err_close; + + if (LOGGING_ON(dbenv)) { + memset(&namedbt, 0, sizeof(namedbt)); + namedbt.data = (char *)name; + namedbt.size = strlen(name) + 1; + + if ((ret = __crdel_delete_log(dbenv, + dbp->open_txn, &newlsn, DB_FLUSH, + dbp->log_fileid, &namedbt)) != 0) { + __db_err(dbenv, + "%s: %s", name, db_strerror(ret)); + goto err; + } + } + + /* Find the real name of the file. */ + if ((ret = __db_appname(dbenv, + DB_APP_DATA, NULL, name, 0, NULL, &real_name)) != 0) + goto err; + + /* + * XXX + * We don't bother to open the file and call __memp_fremove on the mpf. + * There is a potential race here. It is at least possible that, if + * the unique filesystem ID (dev/inode pair on UNIX) is reallocated + * within a second (the granularity of the fileID timestamp), a new + * file open will get the same fileID as the file being "removed". + * We may actually want to open the file and call __memp_fremove on + * the mpf to get around this. + */ + + /* Create name for backup file. */ + if (TXN_ON(dbenv)) { + if ((ret = + __db_backup_name(dbenv, name, &backup, &newlsn)) != 0) + goto err; + if ((ret = __db_appname(dbenv, + DB_APP_DATA, NULL, backup, 0, NULL, &real_back)) != 0) + goto err; + } + + callback_func = __db_remove_callback; + cookie = real_back; + DB_TEST_RECOVERY(dbp, DB_TEST_PRERENAME, ret, name); + if (dbp->db_am_remove != NULL && + (ret = dbp->db_am_remove(dbp, + name, subdb, &newlsn, &callback_func, &cookie)) != 0) + goto err; + /* + * On Windows, the underlying file must be closed to perform a remove. + * Nothing later in __db_remove requires that it be open, and the + * dbp->close closes it anyway, so we just close it early. + */ + (void)__memp_fremove(dbp->mpf); + if ((ret = memp_fclose(dbp->mpf)) != 0) + goto err; + dbp->mpf = NULL; + + if (TXN_ON(dbenv)) + ret = __os_rename(dbenv, real_name, real_back); + else + ret = __os_unlink(dbenv, real_name); + + DB_TEST_RECOVERY(dbp, DB_TEST_POSTRENAME, ret, name); + +err: +DB_TEST_RECOVERY_LABEL + /* + * End the transaction, committing the transaction if we were + * successful, aborting otherwise. + */ + if (dbp->open_txn != NULL && (t_ret = __db_metaend(dbp, &remove_lock, + ret == 0, callback_func, cookie)) != 0 && ret == 0) + ret = t_ret; + + /* FALLTHROUGH */ + +err_close: + if (real_back != NULL) + __os_freestr(real_back); + if (real_name != NULL) + __os_freestr(real_name); + if (backup != NULL) + __os_freestr(backup); + + /* We no longer have an mpool, so syncing would be disastrous. */ + if ((t_ret = dbp->close(dbp, DB_NOSYNC)) != 0 && ret == 0) + ret = t_ret; + + return (ret); +} + +/* + * __db_subdb_remove -- + * Remove a subdatabase. + */ +static int +__db_subdb_remove(dbp, name, subdb) + DB *dbp; + const char *name, *subdb; +{ + DB *mdbp; + DBC *dbc; + DB_ENV *dbenv; + DB_LOCK remove_lock; + db_pgno_t meta_pgno; + int ret, t_ret; + + mdbp = NULL; + dbc = NULL; + dbenv = dbp->dbenv; + + /* Start the transaction. */ + if (TXN_ON(dbenv) && (ret = __db_metabegin(dbp, &remove_lock)) != 0) + goto err_close; + + /* + * Open the subdatabase. We can use the user's DB handle for this + * purpose, I think. + */ + if ((ret = __db_open(dbp, name, subdb, DB_UNKNOWN, 0, 0)) != 0) + goto err; + + /* Free up the pages in the subdatabase. */ + switch (dbp->type) { + case DB_BTREE: + case DB_RECNO: + if ((ret = __bam_reclaim(dbp, dbp->open_txn)) != 0) + goto err; + break; + case DB_HASH: + if ((ret = __ham_reclaim(dbp, dbp->open_txn)) != 0) + goto err; + break; + default: + ret = __db_unknown_type(dbp->dbenv, + "__db_subdb_remove", dbp->type); + goto err; + } + + /* + * Remove the entry from the main database and free the subdatabase + * metadata page. + */ + if ((ret = __db_master_open(dbp, name, 0, 0, &mdbp)) != 0) + goto err; + + if ((ret = __db_master_update(mdbp, + subdb, dbp->type, &meta_pgno, MU_REMOVE, NULL, 0)) != 0) + goto err; + +err: /* + * End the transaction, committing the transaction if we were + * successful, aborting otherwise. + */ + if (dbp->open_txn != NULL && (t_ret = __db_metaend(dbp, + &remove_lock, ret == 0, NULL, NULL)) != 0 && ret == 0) + ret = t_ret; + +err_close: + /* + * Close the user's DB handle -- do this LAST to avoid smashing the + * the transaction information. + */ + if ((t_ret = dbp->close(dbp, 0)) != 0 && ret == 0) + ret = t_ret; + + if (mdbp != NULL && (t_ret = mdbp->close(mdbp, 0)) != 0 && ret == 0) + ret = t_ret; + + return (ret); +} + +/* + * __db_rename + * Rename method for DB. + * + * PUBLIC: int __db_rename __P((DB *, + * PUBLIC: const char *, const char *, const char *, u_int32_t)); + */ +int +__db_rename(dbp, filename, subdb, newname, flags) + DB *dbp; + const char *filename, *subdb, *newname; + u_int32_t flags; +{ + DBT namedbt, newnamedbt; + DB_ENV *dbenv; + DB_LOCK remove_lock; + DB_LSN newlsn; + char *real_name, *real_newname; + int ret, t_ret; + + dbenv = dbp->dbenv; + ret = 0; + real_name = real_newname = NULL; + + PANIC_CHECK(dbenv); + /* + * Cannot use DB_ILLEGAL_AFTER_OPEN here because that returns + * and we cannot return, but must deal with the error and destroy + * the handle anyway. + */ + if (F_ISSET(dbp, DB_OPEN_CALLED)) { + ret = __db_mi_open(dbp->dbenv, "rename", 1); + goto err_close; + } + + /* Validate arguments -- has same rules as remove. */ + if ((ret = __db_removechk(dbp, flags)) != 0) + goto err_close; + + /* + * Subdatabases. + */ + if (subdb != NULL) { + if (filename == NULL) { + __db_err(dbenv, + "multiple databases cannot be created in temporary files"); + goto err_close; + } + return (__db_subdb_rename(dbp, filename, subdb, newname)); + } + + if ((ret = dbp->open(dbp, + filename, NULL, DB_UNKNOWN, DB_RDWRMASTER, 0)) != 0) + goto err_close; + + if (LOGGING_ON(dbenv) && (ret = __log_file_lock(dbp)) != 0) + goto err_close; + + if ((ret = dbp->sync(dbp, 0)) != 0) + goto err_close; + + /* Start the transaction and log the rename. */ + if (TXN_ON(dbenv) && (ret = __db_metabegin(dbp, &remove_lock)) != 0) + goto err_close; + + if (LOGGING_ON(dbenv)) { + memset(&namedbt, 0, sizeof(namedbt)); + namedbt.data = (char *)filename; + namedbt.size = strlen(filename) + 1; + + memset(&newnamedbt, 0, sizeof(namedbt)); + newnamedbt.data = (char *)newname; + newnamedbt.size = strlen(newname) + 1; + + if ((ret = __crdel_rename_log(dbenv, dbp->open_txn, + &newlsn, 0, dbp->log_fileid, &namedbt, &newnamedbt)) != 0) { + __db_err(dbenv, "%s: %s", filename, db_strerror(ret)); + goto err; + } + + if ((ret = __log_filelist_update(dbenv, dbp, + dbp->log_fileid, newname, NULL)) != 0) + goto err; + } + + /* Find the real name of the file. */ + if ((ret = __db_appname(dbenv, + DB_APP_DATA, NULL, filename, 0, NULL, &real_name)) != 0) + goto err; + + /* Find the real newname of the file. */ + if ((ret = __db_appname(dbenv, + DB_APP_DATA, NULL, newname, 0, NULL, &real_newname)) != 0) + goto err; + + /* + * It is an error to rename a file over one that already exists, + * as that wouldn't be transaction-safe. + */ + if (__os_exists(real_newname, NULL) == 0) { + ret = EEXIST; + __db_err(dbenv, "rename: file %s exists", real_newname); + goto err; + } + + DB_TEST_RECOVERY(dbp, DB_TEST_PRERENAME, ret, filename); + if (dbp->db_am_rename != NULL && + (ret = dbp->db_am_rename(dbp, filename, subdb, newname)) != 0) + goto err; + /* + * We have to flush the cache for a couple of reasons. First, the + * underlying MPOOLFILE maintains a "name" that unrelated processes + * can use to open the file in order to flush pages, and that name + * is about to be wrong. Second, on Windows the unique file ID is + * generated from the file's name, not other file information as is + * the case on UNIX, and so a subsequent open of the old file name + * could conceivably result in a matching "unique" file ID. + */ + if ((ret = __memp_fremove(dbp->mpf)) != 0) + goto err; + + /* + * On Windows, the underlying file must be closed to perform a rename. + * Nothing later in __db_rename requires that it be open, and the call + * to dbp->close closes it anyway, so we just close it early. + */ + if ((ret = memp_fclose(dbp->mpf)) != 0) + goto err; + dbp->mpf = NULL; + + ret = __os_rename(dbenv, real_name, real_newname); + DB_TEST_RECOVERY(dbp, DB_TEST_POSTRENAME, ret, newname); + +DB_TEST_RECOVERY_LABEL +err: if (dbp->open_txn != NULL && (t_ret = __db_metaend(dbp, + &remove_lock, ret == 0, NULL, NULL)) != 0 && ret == 0) + ret = t_ret; + +err_close: + /* We no longer have an mpool, so syncing would be disastrous. */ + dbp->close(dbp, DB_NOSYNC); + if (real_name != NULL) + __os_freestr(real_name); + if (real_newname != NULL) + __os_freestr(real_newname); + + return (ret); +} + +/* + * __db_subdb_rename -- + * Rename a subdatabase. + */ +static int +__db_subdb_rename(dbp, name, subdb, newname) + DB *dbp; + const char *name, *subdb, *newname; +{ + DB *mdbp; + DBC *dbc; + DB_ENV *dbenv; + DB_LOCK remove_lock; + int ret, t_ret; + + mdbp = NULL; + dbc = NULL; + dbenv = dbp->dbenv; + + /* Start the transaction. */ + if (TXN_ON(dbenv) && (ret = __db_metabegin(dbp, &remove_lock)) != 0) + goto err_close; + + /* + * Open the subdatabase. We can use the user's DB handle for this + * purpose, I think. + */ + if ((ret = __db_open(dbp, name, subdb, DB_UNKNOWN, 0, 0)) != 0) + goto err; + + /* + * Rename the entry in the main database. + */ + if ((ret = __db_master_open(dbp, name, 0, 0, &mdbp)) != 0) + goto err; + + if ((ret = __db_master_update(mdbp, + subdb, dbp->type, NULL, MU_RENAME, newname, 0)) != 0) + goto err; + +err: /* + * End the transaction, committing the transaction if we were + * successful, aborting otherwise. + */ + if (dbp->open_txn != NULL && (t_ret = __db_metaend(dbp, + &remove_lock, ret == 0, NULL, NULL)) != 0 && ret == 0) + ret = t_ret; + +err_close: + /* + * Close the user's DB handle -- do this LAST to avoid smashing the + * the transaction information. + */ + if ((t_ret = dbp->close(dbp, 0)) != 0 && ret == 0) + ret = t_ret; + + if (mdbp != NULL && (t_ret = mdbp->close(mdbp, 0)) != 0 && ret == 0) + ret = t_ret; + + return (ret); +} + +/* + * __db_metabegin -- + * + * Begin a meta-data operation. This involves doing any required locking, + * potentially beginning a transaction and then telling the caller if you + * did or did not begin the transaction. + * + * The writing flag indicates if the caller is actually allowing creates + * or doing deletes (i.e., if the caller is opening and not creating, then + * we don't need to do any of this). + * PUBLIC: int __db_metabegin __P((DB *, DB_LOCK *)); + */ +int +__db_metabegin(dbp, lockp) + DB *dbp; + DB_LOCK *lockp; +{ + DB_ENV *dbenv; + DBT dbplock; + u_int32_t locker, lockval; + int ret; + + dbenv = dbp->dbenv; + + lockp->off = LOCK_INVALID; + + /* + * There is no single place where we can know that we are or are not + * going to be creating any files and/or subdatabases, so we will + * always begin a tranasaction when we start creating one. If we later + * discover that this was unnecessary, we will abort the transaction. + * Recovery is written so that if we log a file create, but then + * discover that we didn't have to do it, we recover correctly. The + * file recovery design document has details. + * + * We need to single thread all create and delete operations, so if we + * are running with locking, we must obtain a lock. We use lock_id to + * generate a unique locker id and use a handcrafted DBT as the object + * on which we are locking. + */ + if (LOCKING_ON(dbenv)) { + if ((ret = lock_id(dbenv, &locker)) != 0) + return (ret); + lockval = 0; + dbplock.data = &lockval; + dbplock.size = sizeof(lockval); + if ((ret = lock_get(dbenv, + locker, 0, &dbplock, DB_LOCK_WRITE, lockp)) != 0) + return (ret); + } + + return (txn_begin(dbenv, NULL, &dbp->open_txn, 0)); +} + +/* + * __db_metaend -- + * End a meta-data operation. + * PUBLIC: int __db_metaend __P((DB *, + * PUBLIC: DB_LOCK *, int, int (*)(DB *, void *), void *)); + */ +int +__db_metaend(dbp, lockp, commit, callback, cookie) + DB *dbp; + DB_LOCK *lockp; + int commit, (*callback) __P((DB *, void *)); + void *cookie; +{ + DB_ENV *dbenv; + int ret, t_ret; + + ret = 0; + dbenv = dbp->dbenv; + + /* End the transaction. */ + if (commit) { + if ((ret = txn_commit(dbp->open_txn, DB_TXN_SYNC)) == 0) { + /* + * Unlink any underlying file, we've committed the + * transaction. + */ + if (callback != NULL) + ret = callback(dbp, cookie); + } + } else if ((t_ret = txn_abort(dbp->open_txn)) && ret == 0) + ret = t_ret; + + /* Release our lock. */ + if (lockp->off != LOCK_INVALID && + (t_ret = lock_put(dbenv, lockp)) != 0 && ret == 0) + ret = t_ret; + + return (ret); +} + +/* + * __db_log_page + * Log a meta-data or root page during a create operation. + * + * PUBLIC: int __db_log_page __P((DB *, + * PUBLIC: const char *, DB_LSN *, db_pgno_t, PAGE *)); + */ +int +__db_log_page(dbp, name, lsn, pgno, page) + DB *dbp; + const char *name; + DB_LSN *lsn; + db_pgno_t pgno; + PAGE *page; +{ + DBT name_dbt, page_dbt; + DB_LSN new_lsn; + int ret; + + if (dbp->open_txn == NULL) + return (0); + + memset(&page_dbt, 0, sizeof(page_dbt)); + page_dbt.size = dbp->pgsize; + page_dbt.data = page; + if (pgno == PGNO_BASE_MD) { + /* + * !!! + * Make sure that we properly handle a null name. The old + * Tcl sent us pathnames of the form ""; it may be the case + * that the new Tcl doesn't do that, so we can get rid of + * the second check here. + */ + memset(&name_dbt, 0, sizeof(name_dbt)); + name_dbt.data = (char *)name; + if (name == NULL || *name == '\0') + name_dbt.size = 0; + else + name_dbt.size = strlen(name) + 1; + + ret = __crdel_metapage_log(dbp->dbenv, + dbp->open_txn, &new_lsn, DB_FLUSH, + dbp->log_fileid, &name_dbt, pgno, &page_dbt); + } else + ret = __crdel_metasub_log(dbp->dbenv, dbp->open_txn, + &new_lsn, 0, dbp->log_fileid, pgno, &page_dbt, lsn); + + if (ret == 0) + page->lsn = new_lsn; + return (ret); +} + +/* + * __db_backup_name + * Create the backup file name for a given file. + * + * PUBLIC: int __db_backup_name __P((DB_ENV *, + * PUBLIC: const char *, char **, DB_LSN *)); + */ +#undef BACKUP_PREFIX +#define BACKUP_PREFIX "__db." + +#undef MAX_LSN_TO_TEXT +#define MAX_LSN_TO_TEXT 21 +int +__db_backup_name(dbenv, name, backup, lsn) + DB_ENV *dbenv; + const char *name; + char **backup; + DB_LSN *lsn; +{ + size_t len; + int plen, ret; + char *p, *retp; + + len = strlen(name) + strlen(BACKUP_PREFIX) + MAX_LSN_TO_TEXT + 1; + + if ((ret = __os_malloc(dbenv, len, NULL, &retp)) != 0) + return (ret); + + /* + * Create the name. Backup file names are of the form: + * + * __db.name.0x[lsn-file].0x[lsn-offset] + * + * which guarantees uniqueness. + * + * However, name may contain an env-relative path in it. + * In that case, put the __db. after the last portion of + * the pathname. + */ + if ((p = __db_rpath(name)) == NULL) + snprintf(retp, len, + "%s%s.0x%x0x%x", BACKUP_PREFIX, name, + lsn->file, lsn->offset); + else { + plen = p - name + 1; + p++; + snprintf(retp, len, + "%.*s%s%s.0x%x0x%x", plen, name, BACKUP_PREFIX, p, + lsn->file, lsn->offset); + } + + *backup = retp; + return (0); +} + +/* + * __db_remove_callback -- + * Callback function -- on file remove commit, it unlinks the backing + * file. + */ +static int +__db_remove_callback(dbp, cookie) + DB *dbp; + void *cookie; +{ + return (__os_unlink(dbp->dbenv, cookie)); +} + +/* + * __dblist_get -- + * Get the first element of dbenv->dblist with + * dbp->adj_fileid matching adjid. + * + * PUBLIC: DB *__dblist_get __P((DB_ENV *, u_int32_t)); + */ +DB * +__dblist_get(dbenv, adjid) + DB_ENV *dbenv; + u_int32_t adjid; +{ + DB *dbp; + + for (dbp = LIST_FIRST(&dbenv->dblist); + dbp != NULL && dbp->adj_fileid != adjid; + dbp = LIST_NEXT(dbp, dblistlinks)) + ; + + return (dbp); +} + +#if CONFIG_TEST +/* + * __db_testcopy + * Create a copy of all backup files and our "main" DB. + * + * PUBLIC: int __db_testcopy __P((DB *, const char *)); + */ +int +__db_testcopy(dbp, name) + DB *dbp; + const char *name; +{ + if (dbp->type == DB_QUEUE) + return (__qam_testdocopy(dbp, name)); + else + return (__db_testdocopy(dbp, name)); +} + +static int +__qam_testdocopy(dbp, name) + DB *dbp; + const char *name; +{ + QUEUE_FILELIST *filelist, *fp; + char buf[256], *dir; + int ret; + + filelist = NULL; + if ((ret = __db_testdocopy(dbp, name)) != 0) + return (ret); + if (dbp->mpf != NULL && + (ret = __qam_gen_filelist(dbp, &filelist)) != 0) + return (ret); + + if (filelist == NULL) + return (0); + dir = ((QUEUE *)dbp->q_internal)->dir; + for (fp = filelist; fp->mpf != NULL; fp++) { + snprintf(buf, sizeof(buf), QUEUE_EXTENT, dir, name, fp->id); + if ((ret = __db_testdocopy(dbp, buf)) != 0) + return (ret); + } + + __os_free(filelist, 0); + return (0); +} + +/* + * __db_testdocopy + * Create a copy of all backup files and our "main" DB. + * + */ +static int +__db_testdocopy(dbp, name) + DB *dbp; + const char *name; +{ + size_t len; + int dircnt, i, ret; + char **namesp, *backup, *copy, *dir, *p, *real_name; + real_name = NULL; + /* Get the real backing file name. */ + if ((ret = __db_appname(dbp->dbenv, + DB_APP_DATA, NULL, name, 0, NULL, &real_name)) != 0) + return (ret); + + copy = backup = NULL; + namesp = NULL; + + /* + * Maximum size of file, including adding a ".afterop". + */ + len = strlen(real_name) + strlen(BACKUP_PREFIX) + MAX_LSN_TO_TEXT + 9; + + if ((ret = __os_malloc(dbp->dbenv, len, NULL, ©)) != 0) + goto out; + + if ((ret = __os_malloc(dbp->dbenv, len, NULL, &backup)) != 0) + goto out; + + /* + * First copy the file itself. + */ + snprintf(copy, len, "%s.afterop", real_name); + __db_makecopy(real_name, copy); + + if ((ret = __os_strdup(dbp->dbenv, real_name, &dir)) != 0) + goto out; + __os_freestr(real_name); + real_name = NULL; + /* + * Create the name. Backup file names are of the form: + * + * __db.name.0x[lsn-file].0x[lsn-offset] + * + * which guarantees uniqueness. We want to look for the + * backup name, followed by a '.0x' (so that if they have + * files named, say, 'a' and 'abc' we won't match 'abc' when + * looking for 'a'. + */ + snprintf(backup, len, "%s%s.0x", BACKUP_PREFIX, name); + + /* + * We need the directory path to do the __os_dirlist. + */ + p = __db_rpath(dir); + if (p != NULL) + *p = '\0'; + ret = __os_dirlist(dbp->dbenv, dir, &namesp, &dircnt); +#if DIAGNOSTIC + /* + * XXX + * To get the memory guard code to work because it uses strlen and we + * just moved the end of the string somewhere sooner. This causes the + * guard code to fail because it looks at one byte past the end of the + * string. + */ + *p = '/'; +#endif + __os_freestr(dir); + if (ret != 0) + goto out; + for (i = 0; i < dircnt; i++) { + /* + * Need to check if it is a backup file for this. + * No idea what namesp[i] may be or how long, so + * must use strncmp and not memcmp. We don't want + * to use strcmp either because we are only matching + * the first part of the real file's name. We don't + * know its LSN's. + */ + if (strncmp(namesp[i], backup, strlen(backup)) == 0) { + if ((ret = __db_appname(dbp->dbenv, DB_APP_DATA, + NULL, namesp[i], 0, NULL, &real_name)) != 0) + goto out; + + /* + * This should not happen. Check that old + * .afterop files aren't around. + * If so, just move on. + */ + if (strstr(real_name, ".afterop") != NULL) { + __os_freestr(real_name); + real_name = NULL; + continue; + } + snprintf(copy, len, "%s.afterop", real_name); + __db_makecopy(real_name, copy); + __os_freestr(real_name); + real_name = NULL; + } + } +out: + if (backup != NULL) + __os_freestr(backup); + if (copy != NULL) + __os_freestr(copy); + if (namesp != NULL) + __os_dirfree(namesp, dircnt); + if (real_name != NULL) + __os_freestr(real_name); + return (ret); +} + +static void +__db_makecopy(src, dest) + const char *src, *dest; +{ + DB_FH rfh, wfh; + size_t rcnt, wcnt; + char *buf; + + memset(&rfh, 0, sizeof(rfh)); + memset(&wfh, 0, sizeof(wfh)); + + if (__os_malloc(NULL, 1024, NULL, &buf) != 0) + return; + + if (__os_open(NULL, + src, DB_OSO_RDONLY, __db_omode("rw----"), &rfh) != 0) + goto err; + if (__os_open(NULL, dest, + DB_OSO_CREATE | DB_OSO_TRUNC, __db_omode("rw----"), &wfh) != 0) + goto err; + + for (;;) + if (__os_read(NULL, &rfh, buf, 1024, &rcnt) < 0 || rcnt == 0 || + __os_write(NULL, &wfh, buf, rcnt, &wcnt) < 0 || wcnt != rcnt) + break; + +err: __os_free(buf, 1024); + if (F_ISSET(&rfh, DB_FH_VALID)) + __os_closehandle(&rfh); + if (F_ISSET(&wfh, DB_FH_VALID)) + __os_closehandle(&wfh); +} +#endif diff --git a/bdb/db/db.src b/bdb/db/db.src new file mode 100644 index 00000000000..b695e1360c5 --- /dev/null +++ b/bdb/db/db.src @@ -0,0 +1,178 @@ +/*- + * See the file LICENSE for redistribution information. + * + * Copyright (c) 1996, 1997, 1998, 1999, 2000 + * Sleepycat Software. All rights reserved. + * + * $Id: db.src,v 11.8 2000/02/17 20:24:07 bostic Exp $ + */ + +PREFIX db + +INCLUDE #include "db_config.h" +INCLUDE +INCLUDE #ifndef NO_SYSTEM_INCLUDES +INCLUDE #include <sys/types.h> +INCLUDE +INCLUDE #include <ctype.h> +INCLUDE #include <errno.h> +INCLUDE #include <string.h> +INCLUDE #endif +INCLUDE +INCLUDE #include "db_int.h" +INCLUDE #include "db_page.h" +INCLUDE #include "db_dispatch.h" +INCLUDE #include "db_am.h" +INCLUDE #include "txn.h" +INCLUDE + +/* + * addrem -- Add or remove an entry from a duplicate page. + * + * opcode: identifies if this is an add or delete. + * fileid: file identifier of the file being modified. + * pgno: duplicate page number. + * indx: location at which to insert or delete. + * nbytes: number of bytes added/removed to/from the page. + * hdr: header for the data item. + * dbt: data that is deleted or is to be added. + * pagelsn: former lsn of the page. + * + * If the hdr was NULL then, the dbt is a regular B_KEYDATA. + * If the dbt was NULL then the hdr is a complete item to be + * pasted on the page. + */ +BEGIN addrem 41 +ARG opcode u_int32_t lu +ARG fileid int32_t ld +ARG pgno db_pgno_t lu +ARG indx u_int32_t lu +ARG nbytes size_t lu +DBT hdr DBT s +DBT dbt DBT s +POINTER pagelsn DB_LSN * lu +END + +/* + * split -- Handles the split of a duplicate page. + * + * opcode: defines whether we are splitting from or splitting onto + * fileid: file identifier of the file being modified. + * pgno: page number being split. + * pageimage: entire page contents. + * pagelsn: former lsn of the page. + */ +DEPRECATED split 42 +ARG opcode u_int32_t lu +ARG fileid int32_t ld +ARG pgno db_pgno_t lu +DBT pageimage DBT s +POINTER pagelsn DB_LSN * lu +END + +/* + * big -- Handles addition and deletion of big key/data items. + * + * opcode: identifies get/put. + * fileid: file identifier of the file being modified. + * pgno: page onto which data is being added/removed. + * prev_pgno: the page before the one we are logging. + * next_pgno: the page after the one we are logging. + * dbt: data being written onto the page. + * pagelsn: former lsn of the orig_page. + * prevlsn: former lsn of the prev_pgno. + * nextlsn: former lsn of the next_pgno. This is not currently used, but + * may be used later if we actually do overwrites of big key/ + * data items in place. + */ +BEGIN big 43 +ARG opcode u_int32_t lu +ARG fileid int32_t ld +ARG pgno db_pgno_t lu +ARG prev_pgno db_pgno_t lu +ARG next_pgno db_pgno_t lu +DBT dbt DBT s +POINTER pagelsn DB_LSN * lu +POINTER prevlsn DB_LSN * lu +POINTER nextlsn DB_LSN * lu +END + +/* + * ovref -- Handles increment/decrement of overflow page reference count. + * + * fileid: identifies the file being modified. + * pgno: page number whose ref count is being incremented/decremented. + * adjust: the adjustment being made. + * lsn: the page's original lsn. + */ +BEGIN ovref 44 +ARG fileid int32_t ld +ARG pgno db_pgno_t lu +ARG adjust int32_t ld +POINTER lsn DB_LSN * lu +END + +/* + * relink -- Handles relinking around a page. + * + * opcode: indicates if this is an addpage or delete page + * pgno: the page being changed. + * lsn the page's original lsn. + * prev: the previous page. + * lsn_prev: the previous page's original lsn. + * next: the next page. + * lsn_next: the previous page's original lsn. + */ +BEGIN relink 45 +ARG opcode u_int32_t lu +ARG fileid int32_t ld +ARG pgno db_pgno_t lu +POINTER lsn DB_LSN * lu +ARG prev db_pgno_t lu +POINTER lsn_prev DB_LSN * lu +ARG next db_pgno_t lu +POINTER lsn_next DB_LSN * lu +END + +/* + * Addpage -- Handles adding a new duplicate page onto the end of + * an existing duplicate page. + * fileid: identifies the file being changed. + * pgno: page number to which a new page is being added. + * lsn: lsn of pgno + * nextpgno: new page number being added. + * nextlsn: lsn of nextpgno; + */ +DEPRECATED addpage 46 +ARG fileid int32_t ld +ARG pgno db_pgno_t lu +POINTER lsn DB_LSN * lu +ARG nextpgno db_pgno_t lu +POINTER nextlsn DB_LSN * lu +END + +/* + * Debug -- log an operation upon entering an access method. + * op: Operation (cursor, c_close, c_get, c_put, c_del, + * get, put, delete). + * fileid: identifies the file being acted upon. + * key: key paramater + * data: data parameter + * flags: flags parameter + */ +BEGIN debug 47 +DBT op DBT s +ARG fileid int32_t ld +DBT key DBT s +DBT data DBT s +ARG arg_flags u_int32_t lu +END + +/* + * noop -- do nothing, but get an LSN. + */ +BEGIN noop 48 +ARG fileid int32_t ld +ARG pgno db_pgno_t lu +POINTER prevlsn DB_LSN * lu +END diff --git a/bdb/db/db_am.c b/bdb/db/db_am.c new file mode 100644 index 00000000000..2d224566904 --- /dev/null +++ b/bdb/db/db_am.c @@ -0,0 +1,511 @@ +/*- + * See the file LICENSE for redistribution information. + * + * Copyright (c) 1998, 1999, 2000 + * Sleepycat Software. All rights reserved. + */ + +#include "db_config.h" + +#ifndef lint +static const char revid[] = "$Id: db_am.c,v 11.42 2001/01/11 18:19:50 bostic Exp $"; +#endif /* not lint */ + +#ifndef NO_SYSTEM_INCLUDES +#include <sys/types.h> + +#include <string.h> +#endif + +#include "db_int.h" +#include "db_page.h" +#include "db_shash.h" +#include "btree.h" +#include "hash.h" +#include "qam.h" +#include "lock.h" +#include "mp.h" +#include "txn.h" +#include "db_am.h" +#include "db_ext.h" + +/* + * __db_cursor -- + * Allocate and return a cursor. + * + * PUBLIC: int __db_cursor __P((DB *, DB_TXN *, DBC **, u_int32_t)); + */ +int +__db_cursor(dbp, txn, dbcp, flags) + DB *dbp; + DB_TXN *txn; + DBC **dbcp; + u_int32_t flags; +{ + DB_ENV *dbenv; + DBC *dbc; + db_lockmode_t mode; + u_int32_t op; + int ret; + + dbenv = dbp->dbenv; + + PANIC_CHECK(dbenv); + DB_ILLEGAL_BEFORE_OPEN(dbp, "DB->cursor"); + + /* Check for invalid flags. */ + if ((ret = __db_cursorchk(dbp, flags, F_ISSET(dbp, DB_AM_RDONLY))) != 0) + return (ret); + + if ((ret = + __db_icursor(dbp, txn, dbp->type, PGNO_INVALID, 0, dbcp)) != 0) + return (ret); + dbc = *dbcp; + + /* + * If this is CDB, do all the locking in the interface, which is + * right here. + */ + if (CDB_LOCKING(dbenv)) { + op = LF_ISSET(DB_OPFLAGS_MASK); + mode = (op == DB_WRITELOCK) ? DB_LOCK_WRITE : + ((op == DB_WRITECURSOR) ? DB_LOCK_IWRITE : DB_LOCK_READ); + if ((ret = lock_get(dbenv, dbc->locker, 0, + &dbc->lock_dbt, mode, &dbc->mylock)) != 0) { + (void)__db_c_close(dbc); + return (ret); + } + if (op == DB_WRITECURSOR) + F_SET(dbc, DBC_WRITECURSOR); + if (op == DB_WRITELOCK) + F_SET(dbc, DBC_WRITER); + } + + return (0); +} + +/* + * __db_icursor -- + * Internal version of __db_cursor. If dbcp is + * non-NULL it is assumed to point to an area to + * initialize as a cursor. + * + * PUBLIC: int __db_icursor + * PUBLIC: __P((DB *, DB_TXN *, DBTYPE, db_pgno_t, int, DBC **)); + */ +int +__db_icursor(dbp, txn, dbtype, root, is_opd, dbcp) + DB *dbp; + DB_TXN *txn; + DBTYPE dbtype; + db_pgno_t root; + int is_opd; + DBC **dbcp; +{ + DBC *dbc, *adbc; + DBC_INTERNAL *cp; + DB_ENV *dbenv; + int allocated, ret; + + dbenv = dbp->dbenv; + allocated = 0; + + /* + * Take one from the free list if it's available. Take only the + * right type. With off page dups we may have different kinds + * of cursors on the queue for a single database. + */ + MUTEX_THREAD_LOCK(dbenv, dbp->mutexp); + for (dbc = TAILQ_FIRST(&dbp->free_queue); + dbc != NULL; dbc = TAILQ_NEXT(dbc, links)) + if (dbtype == dbc->dbtype) { + TAILQ_REMOVE(&dbp->free_queue, dbc, links); + dbc->flags = 0; + break; + } + MUTEX_THREAD_UNLOCK(dbenv, dbp->mutexp); + + if (dbc == NULL) { + if ((ret = __os_calloc(dbp->dbenv, 1, sizeof(DBC), &dbc)) != 0) + return (ret); + allocated = 1; + dbc->flags = 0; + + dbc->dbp = dbp; + + /* Set up locking information. */ + if (LOCKING_ON(dbenv)) { + /* + * If we are not threaded, then there is no need to + * create new locker ids. We know that no one else + * is running concurrently using this DB, so we can + * take a peek at any cursors on the active queue. + */ + if (!DB_IS_THREADED(dbp) && + (adbc = TAILQ_FIRST(&dbp->active_queue)) != NULL) + dbc->lid = adbc->lid; + else + if ((ret = lock_id(dbenv, &dbc->lid)) != 0) + goto err; + + memcpy(dbc->lock.fileid, dbp->fileid, DB_FILE_ID_LEN); + if (CDB_LOCKING(dbenv)) { + if (F_ISSET(dbenv, DB_ENV_CDB_ALLDB)) { + /* + * If we are doing a single lock per + * environment, set up the global + * lock object just like we do to + * single thread creates. + */ + DB_ASSERT(sizeof(db_pgno_t) == + sizeof(u_int32_t)); + dbc->lock_dbt.size = sizeof(u_int32_t); + dbc->lock_dbt.data = &dbc->lock.pgno; + dbc->lock.pgno = 0; + } else { + dbc->lock_dbt.size = DB_FILE_ID_LEN; + dbc->lock_dbt.data = dbc->lock.fileid; + } + } else { + dbc->lock.type = DB_PAGE_LOCK; + dbc->lock_dbt.size = sizeof(dbc->lock); + dbc->lock_dbt.data = &dbc->lock; + } + } + /* Init the DBC internal structure. */ + switch (dbtype) { + case DB_BTREE: + case DB_RECNO: + if ((ret = __bam_c_init(dbc, dbtype)) != 0) + goto err; + break; + case DB_HASH: + if ((ret = __ham_c_init(dbc)) != 0) + goto err; + break; + case DB_QUEUE: + if ((ret = __qam_c_init(dbc)) != 0) + goto err; + break; + default: + ret = __db_unknown_type(dbp->dbenv, + "__db_icursor", dbtype); + goto err; + } + + cp = dbc->internal; + } + + /* Refresh the DBC structure. */ + dbc->dbtype = dbtype; + + if ((dbc->txn = txn) == NULL) + dbc->locker = dbc->lid; + else { + dbc->locker = txn->txnid; + txn->cursors++; + } + + if (is_opd) + F_SET(dbc, DBC_OPD); + if (F_ISSET(dbp, DB_AM_RECOVER)) + F_SET(dbc, DBC_RECOVER); + + /* Refresh the DBC internal structure. */ + cp = dbc->internal; + cp->opd = NULL; + + cp->indx = 0; + cp->page = NULL; + cp->pgno = PGNO_INVALID; + cp->root = root; + + switch (dbtype) { + case DB_BTREE: + case DB_RECNO: + if ((ret = __bam_c_refresh(dbc)) != 0) + goto err; + break; + case DB_HASH: + case DB_QUEUE: + break; + default: + ret = __db_unknown_type(dbp->dbenv, "__db_icursor", dbp->type); + goto err; + } + + MUTEX_THREAD_LOCK(dbenv, dbp->mutexp); + TAILQ_INSERT_TAIL(&dbp->active_queue, dbc, links); + F_SET(dbc, DBC_ACTIVE); + MUTEX_THREAD_UNLOCK(dbenv, dbp->mutexp); + + *dbcp = dbc; + return (0); + +err: if (allocated) + __os_free(dbc, sizeof(*dbc)); + return (ret); +} + +#ifdef DEBUG +/* + * __db_cprint -- + * Display the current cursor list. + * + * PUBLIC: int __db_cprint __P((DB *)); + */ +int +__db_cprint(dbp) + DB *dbp; +{ + static const FN fn[] = { + { DBC_ACTIVE, "active" }, + { DBC_OPD, "off-page-dup" }, + { DBC_RECOVER, "recover" }, + { DBC_RMW, "read-modify-write" }, + { DBC_WRITECURSOR, "write cursor" }, + { DBC_WRITEDUP, "internally dup'ed write cursor" }, + { DBC_WRITER, "short-term write cursor" }, + { 0, NULL } + }; + DBC *dbc; + DBC_INTERNAL *cp; + char *s; + + MUTEX_THREAD_LOCK(dbp->dbenv, dbp->mutexp); + for (dbc = TAILQ_FIRST(&dbp->active_queue); + dbc != NULL; dbc = TAILQ_NEXT(dbc, links)) { + switch (dbc->dbtype) { + case DB_BTREE: + s = "btree"; + break; + case DB_HASH: + s = "hash"; + break; + case DB_RECNO: + s = "recno"; + break; + case DB_QUEUE: + s = "queue"; + break; + default: + DB_ASSERT(0); + return (1); + } + cp = dbc->internal; + fprintf(stderr, "%s/%#0lx: opd: %#0lx\n", + s, P_TO_ULONG(dbc), P_TO_ULONG(cp->opd)); + fprintf(stderr, "\ttxn: %#0lx lid: %lu locker: %lu\n", + P_TO_ULONG(dbc->txn), + (u_long)dbc->lid, (u_long)dbc->locker); + fprintf(stderr, "\troot: %lu page/index: %lu/%lu", + (u_long)cp->root, (u_long)cp->pgno, (u_long)cp->indx); + __db_prflags(dbc->flags, fn, stderr); + fprintf(stderr, "\n"); + + if (dbp->type == DB_BTREE) + __bam_cprint(dbc); + } + for (dbc = TAILQ_FIRST(&dbp->free_queue); + dbc != NULL; dbc = TAILQ_NEXT(dbc, links)) + fprintf(stderr, "free: %#0lx ", P_TO_ULONG(dbc)); + fprintf(stderr, "\n"); + MUTEX_THREAD_UNLOCK(dbp->dbenv, dbp->mutexp); + + return (0); +} +#endif /* DEBUG */ + +/* + * db_fd -- + * Return a file descriptor for flock'ing. + * + * PUBLIC: int __db_fd __P((DB *, int *)); + */ +int +__db_fd(dbp, fdp) + DB *dbp; + int *fdp; +{ + DB_FH *fhp; + int ret; + + PANIC_CHECK(dbp->dbenv); + DB_ILLEGAL_BEFORE_OPEN(dbp, "DB->fd"); + + /* + * XXX + * Truly spectacular layering violation. + */ + if ((ret = __mp_xxx_fh(dbp->mpf, &fhp)) != 0) + return (ret); + + if (F_ISSET(fhp, DB_FH_VALID)) { + *fdp = fhp->fd; + return (0); + } else { + *fdp = -1; + __db_err(dbp->dbenv, "DB does not have a valid file handle."); + return (ENOENT); + } +} + +/* + * __db_get -- + * Return a key/data pair. + * + * PUBLIC: int __db_get __P((DB *, DB_TXN *, DBT *, DBT *, u_int32_t)); + */ +int +__db_get(dbp, txn, key, data, flags) + DB *dbp; + DB_TXN *txn; + DBT *key, *data; + u_int32_t flags; +{ + DBC *dbc; + int mode, ret, t_ret; + + PANIC_CHECK(dbp->dbenv); + DB_ILLEGAL_BEFORE_OPEN(dbp, "DB->get"); + + if ((ret = __db_getchk(dbp, key, data, flags)) != 0) + return (ret); + + mode = 0; + if (flags == DB_CONSUME || flags == DB_CONSUME_WAIT) + mode = DB_WRITELOCK; + if ((ret = dbp->cursor(dbp, txn, &dbc, mode)) != 0) + return (ret); + + DEBUG_LREAD(dbc, txn, "__db_get", key, NULL, flags); + + /* + * The DBC_TRANSIENT flag indicates that we're just doing a + * single operation with this cursor, and that in case of + * error we don't need to restore it to its old position--we're + * going to close it right away. Thus, we can perform the get + * without duplicating the cursor, saving some cycles in this + * common case. + */ + F_SET(dbc, DBC_TRANSIENT); + + ret = dbc->c_get(dbc, key, data, + flags == 0 || flags == DB_RMW ? flags | DB_SET : flags); + + if ((t_ret = __db_c_close(dbc)) != 0 && ret == 0) + ret = t_ret; + + return (ret); +} + +/* + * __db_put -- + * Store a key/data pair. + * + * PUBLIC: int __db_put __P((DB *, DB_TXN *, DBT *, DBT *, u_int32_t)); + */ +int +__db_put(dbp, txn, key, data, flags) + DB *dbp; + DB_TXN *txn; + DBT *key, *data; + u_int32_t flags; +{ + DBC *dbc; + DBT tdata; + int ret, t_ret; + + PANIC_CHECK(dbp->dbenv); + DB_ILLEGAL_BEFORE_OPEN(dbp, "DB->put"); + + if ((ret = __db_putchk(dbp, key, data, + flags, F_ISSET(dbp, DB_AM_RDONLY), + F_ISSET(dbp, DB_AM_DUP) || F_ISSET(key, DB_DBT_DUPOK))) != 0) + return (ret); + + DB_CHECK_TXN(dbp, txn); + + if ((ret = dbp->cursor(dbp, txn, &dbc, DB_WRITELOCK)) != 0) + return (ret); + + /* + * See the comment in __db_get(). + * + * Note that the c_get in the DB_NOOVERWRITE case is safe to + * do with this flag set; if it errors in any way other than + * DB_NOTFOUND, we're going to close the cursor without doing + * anything else, and if it returns DB_NOTFOUND then it's safe + * to do a c_put(DB_KEYLAST) even if an access method moved the + * cursor, since that's not position-dependent. + */ + F_SET(dbc, DBC_TRANSIENT); + + DEBUG_LWRITE(dbc, txn, "__db_put", key, data, flags); + + if (flags == DB_NOOVERWRITE) { + flags = 0; + /* + * Set DB_DBT_USERMEM, this might be a threaded application and + * the flags checking will catch us. We don't want the actual + * data, so request a partial of length 0. + */ + memset(&tdata, 0, sizeof(tdata)); + F_SET(&tdata, DB_DBT_USERMEM | DB_DBT_PARTIAL); + + /* + * If we're doing page-level locking, set the read-modify-write + * flag, we're going to overwrite immediately. + */ + if ((ret = dbc->c_get(dbc, key, &tdata, + DB_SET | (STD_LOCKING(dbc) ? DB_RMW : 0))) == 0) + ret = DB_KEYEXIST; + else if (ret == DB_NOTFOUND) + ret = 0; + } + if (ret == 0) + ret = dbc->c_put(dbc, + key, data, flags == 0 ? DB_KEYLAST : flags); + + if ((t_ret = __db_c_close(dbc)) != 0 && ret == 0) + ret = t_ret; + + return (ret); +} + +/* + * __db_sync -- + * Flush the database cache. + * + * PUBLIC: int __db_sync __P((DB *, u_int32_t)); + */ +int +__db_sync(dbp, flags) + DB *dbp; + u_int32_t flags; +{ + int ret, t_ret; + + PANIC_CHECK(dbp->dbenv); + DB_ILLEGAL_BEFORE_OPEN(dbp, "DB->sync"); + + if ((ret = __db_syncchk(dbp, flags)) != 0) + return (ret); + + /* Read-only trees never need to be sync'd. */ + if (F_ISSET(dbp, DB_AM_RDONLY)) + return (0); + + /* If it's a Recno tree, write the backing source text file. */ + if (dbp->type == DB_RECNO) + ret = __ram_writeback(dbp); + + /* If the tree was never backed by a database file, we're done. */ + if (F_ISSET(dbp, DB_AM_INMEM)) + return (0); + + /* Flush any dirty pages from the cache to the backing file. */ + if ((t_ret = memp_fsync(dbp->mpf)) != 0 && ret == 0) + ret = t_ret; + return (ret); +} diff --git a/bdb/db/db_auto.c b/bdb/db/db_auto.c new file mode 100644 index 00000000000..23540adc2e6 --- /dev/null +++ b/bdb/db/db_auto.c @@ -0,0 +1,1270 @@ +/* Do not edit: automatically built by gen_rec.awk. */ +#include "db_config.h" + +#ifndef NO_SYSTEM_INCLUDES +#include <sys/types.h> + +#include <ctype.h> +#include <errno.h> +#include <string.h> +#endif + +#include "db_int.h" +#include "db_page.h" +#include "db_dispatch.h" +#include "db_am.h" +#include "txn.h" + +int +__db_addrem_log(dbenv, txnid, ret_lsnp, flags, + opcode, fileid, pgno, indx, nbytes, hdr, + dbt, pagelsn) + DB_ENV *dbenv; + DB_TXN *txnid; + DB_LSN *ret_lsnp; + u_int32_t flags; + u_int32_t opcode; + int32_t fileid; + db_pgno_t pgno; + u_int32_t indx; + size_t nbytes; + const DBT *hdr; + const DBT *dbt; + DB_LSN * pagelsn; +{ + DBT logrec; + DB_LSN *lsnp, null_lsn; + u_int32_t zero; + u_int32_t rectype, txn_num; + int ret; + u_int8_t *bp; + + rectype = DB_db_addrem; + if (txnid != NULL && + TAILQ_FIRST(&txnid->kids) != NULL && + (ret = __txn_activekids(dbenv, rectype, txnid)) != 0) + return (ret); + txn_num = txnid == NULL ? 0 : txnid->txnid; + if (txnid == NULL) { + ZERO_LSN(null_lsn); + lsnp = &null_lsn; + } else + lsnp = &txnid->last_lsn; + logrec.size = sizeof(rectype) + sizeof(txn_num) + sizeof(DB_LSN) + + sizeof(opcode) + + sizeof(fileid) + + sizeof(pgno) + + sizeof(indx) + + sizeof(nbytes) + + sizeof(u_int32_t) + (hdr == NULL ? 0 : hdr->size) + + sizeof(u_int32_t) + (dbt == NULL ? 0 : dbt->size) + + sizeof(*pagelsn); + if ((ret = __os_malloc(dbenv, logrec.size, NULL, &logrec.data)) != 0) + return (ret); + + bp = logrec.data; + memcpy(bp, &rectype, sizeof(rectype)); + bp += sizeof(rectype); + memcpy(bp, &txn_num, sizeof(txn_num)); + bp += sizeof(txn_num); + memcpy(bp, lsnp, sizeof(DB_LSN)); + bp += sizeof(DB_LSN); + memcpy(bp, &opcode, sizeof(opcode)); + bp += sizeof(opcode); + memcpy(bp, &fileid, sizeof(fileid)); + bp += sizeof(fileid); + memcpy(bp, &pgno, sizeof(pgno)); + bp += sizeof(pgno); + memcpy(bp, &indx, sizeof(indx)); + bp += sizeof(indx); + memcpy(bp, &nbytes, sizeof(nbytes)); + bp += sizeof(nbytes); + if (hdr == NULL) { + zero = 0; + memcpy(bp, &zero, sizeof(u_int32_t)); + bp += sizeof(u_int32_t); + } else { + memcpy(bp, &hdr->size, sizeof(hdr->size)); + bp += sizeof(hdr->size); + memcpy(bp, hdr->data, hdr->size); + bp += hdr->size; + } + if (dbt == NULL) { + zero = 0; + memcpy(bp, &zero, sizeof(u_int32_t)); + bp += sizeof(u_int32_t); + } else { + memcpy(bp, &dbt->size, sizeof(dbt->size)); + bp += sizeof(dbt->size); + memcpy(bp, dbt->data, dbt->size); + bp += dbt->size; + } + if (pagelsn != NULL) + memcpy(bp, pagelsn, sizeof(*pagelsn)); + else + memset(bp, 0, sizeof(*pagelsn)); + bp += sizeof(*pagelsn); + DB_ASSERT((u_int32_t)(bp - (u_int8_t *)logrec.data) == logrec.size); + ret = log_put(dbenv, ret_lsnp, (DBT *)&logrec, flags); + if (txnid != NULL) + txnid->last_lsn = *ret_lsnp; + __os_free(logrec.data, logrec.size); + return (ret); +} + +int +__db_addrem_print(dbenv, dbtp, lsnp, notused2, notused3) + DB_ENV *dbenv; + DBT *dbtp; + DB_LSN *lsnp; + db_recops notused2; + void *notused3; +{ + __db_addrem_args *argp; + u_int32_t i; + u_int ch; + int ret; + + i = 0; + ch = 0; + notused2 = DB_TXN_ABORT; + notused3 = NULL; + + if ((ret = __db_addrem_read(dbenv, dbtp->data, &argp)) != 0) + return (ret); + printf("[%lu][%lu]db_addrem: rec: %lu txnid %lx prevlsn [%lu][%lu]\n", + (u_long)lsnp->file, + (u_long)lsnp->offset, + (u_long)argp->type, + (u_long)argp->txnid->txnid, + (u_long)argp->prev_lsn.file, + (u_long)argp->prev_lsn.offset); + printf("\topcode: %lu\n", (u_long)argp->opcode); + printf("\tfileid: %ld\n", (long)argp->fileid); + printf("\tpgno: %lu\n", (u_long)argp->pgno); + printf("\tindx: %lu\n", (u_long)argp->indx); + printf("\tnbytes: %lu\n", (u_long)argp->nbytes); + printf("\thdr: "); + for (i = 0; i < argp->hdr.size; i++) { + ch = ((u_int8_t *)argp->hdr.data)[i]; + if (isprint(ch) || ch == 0xa) + putchar(ch); + else + printf("%#x ", ch); + } + printf("\n"); + printf("\tdbt: "); + for (i = 0; i < argp->dbt.size; i++) { + ch = ((u_int8_t *)argp->dbt.data)[i]; + if (isprint(ch) || ch == 0xa) + putchar(ch); + else + printf("%#x ", ch); + } + printf("\n"); + printf("\tpagelsn: [%lu][%lu]\n", + (u_long)argp->pagelsn.file, (u_long)argp->pagelsn.offset); + printf("\n"); + __os_free(argp, 0); + return (0); +} + +int +__db_addrem_read(dbenv, recbuf, argpp) + DB_ENV *dbenv; + void *recbuf; + __db_addrem_args **argpp; +{ + __db_addrem_args *argp; + u_int8_t *bp; + int ret; + + ret = __os_malloc(dbenv, sizeof(__db_addrem_args) + + sizeof(DB_TXN), NULL, &argp); + if (ret != 0) + return (ret); + argp->txnid = (DB_TXN *)&argp[1]; + bp = recbuf; + memcpy(&argp->type, bp, sizeof(argp->type)); + bp += sizeof(argp->type); + memcpy(&argp->txnid->txnid, bp, sizeof(argp->txnid->txnid)); + bp += sizeof(argp->txnid->txnid); + memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN)); + bp += sizeof(DB_LSN); + memcpy(&argp->opcode, bp, sizeof(argp->opcode)); + bp += sizeof(argp->opcode); + memcpy(&argp->fileid, bp, sizeof(argp->fileid)); + bp += sizeof(argp->fileid); + memcpy(&argp->pgno, bp, sizeof(argp->pgno)); + bp += sizeof(argp->pgno); + memcpy(&argp->indx, bp, sizeof(argp->indx)); + bp += sizeof(argp->indx); + memcpy(&argp->nbytes, bp, sizeof(argp->nbytes)); + bp += sizeof(argp->nbytes); + memset(&argp->hdr, 0, sizeof(argp->hdr)); + memcpy(&argp->hdr.size, bp, sizeof(u_int32_t)); + bp += sizeof(u_int32_t); + argp->hdr.data = bp; + bp += argp->hdr.size; + memset(&argp->dbt, 0, sizeof(argp->dbt)); + memcpy(&argp->dbt.size, bp, sizeof(u_int32_t)); + bp += sizeof(u_int32_t); + argp->dbt.data = bp; + bp += argp->dbt.size; + memcpy(&argp->pagelsn, bp, sizeof(argp->pagelsn)); + bp += sizeof(argp->pagelsn); + *argpp = argp; + return (0); +} + +int +__db_split_print(dbenv, dbtp, lsnp, notused2, notused3) + DB_ENV *dbenv; + DBT *dbtp; + DB_LSN *lsnp; + db_recops notused2; + void *notused3; +{ + __db_split_args *argp; + u_int32_t i; + u_int ch; + int ret; + + i = 0; + ch = 0; + notused2 = DB_TXN_ABORT; + notused3 = NULL; + + if ((ret = __db_split_read(dbenv, dbtp->data, &argp)) != 0) + return (ret); + printf("[%lu][%lu]db_split: rec: %lu txnid %lx prevlsn [%lu][%lu]\n", + (u_long)lsnp->file, + (u_long)lsnp->offset, + (u_long)argp->type, + (u_long)argp->txnid->txnid, + (u_long)argp->prev_lsn.file, + (u_long)argp->prev_lsn.offset); + printf("\topcode: %lu\n", (u_long)argp->opcode); + printf("\tfileid: %ld\n", (long)argp->fileid); + printf("\tpgno: %lu\n", (u_long)argp->pgno); + printf("\tpageimage: "); + for (i = 0; i < argp->pageimage.size; i++) { + ch = ((u_int8_t *)argp->pageimage.data)[i]; + if (isprint(ch) || ch == 0xa) + putchar(ch); + else + printf("%#x ", ch); + } + printf("\n"); + printf("\tpagelsn: [%lu][%lu]\n", + (u_long)argp->pagelsn.file, (u_long)argp->pagelsn.offset); + printf("\n"); + __os_free(argp, 0); + return (0); +} + +int +__db_split_read(dbenv, recbuf, argpp) + DB_ENV *dbenv; + void *recbuf; + __db_split_args **argpp; +{ + __db_split_args *argp; + u_int8_t *bp; + int ret; + + ret = __os_malloc(dbenv, sizeof(__db_split_args) + + sizeof(DB_TXN), NULL, &argp); + if (ret != 0) + return (ret); + argp->txnid = (DB_TXN *)&argp[1]; + bp = recbuf; + memcpy(&argp->type, bp, sizeof(argp->type)); + bp += sizeof(argp->type); + memcpy(&argp->txnid->txnid, bp, sizeof(argp->txnid->txnid)); + bp += sizeof(argp->txnid->txnid); + memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN)); + bp += sizeof(DB_LSN); + memcpy(&argp->opcode, bp, sizeof(argp->opcode)); + bp += sizeof(argp->opcode); + memcpy(&argp->fileid, bp, sizeof(argp->fileid)); + bp += sizeof(argp->fileid); + memcpy(&argp->pgno, bp, sizeof(argp->pgno)); + bp += sizeof(argp->pgno); + memset(&argp->pageimage, 0, sizeof(argp->pageimage)); + memcpy(&argp->pageimage.size, bp, sizeof(u_int32_t)); + bp += sizeof(u_int32_t); + argp->pageimage.data = bp; + bp += argp->pageimage.size; + memcpy(&argp->pagelsn, bp, sizeof(argp->pagelsn)); + bp += sizeof(argp->pagelsn); + *argpp = argp; + return (0); +} + +int +__db_big_log(dbenv, txnid, ret_lsnp, flags, + opcode, fileid, pgno, prev_pgno, next_pgno, dbt, + pagelsn, prevlsn, nextlsn) + DB_ENV *dbenv; + DB_TXN *txnid; + DB_LSN *ret_lsnp; + u_int32_t flags; + u_int32_t opcode; + int32_t fileid; + db_pgno_t pgno; + db_pgno_t prev_pgno; + db_pgno_t next_pgno; + const DBT *dbt; + DB_LSN * pagelsn; + DB_LSN * prevlsn; + DB_LSN * nextlsn; +{ + DBT logrec; + DB_LSN *lsnp, null_lsn; + u_int32_t zero; + u_int32_t rectype, txn_num; + int ret; + u_int8_t *bp; + + rectype = DB_db_big; + if (txnid != NULL && + TAILQ_FIRST(&txnid->kids) != NULL && + (ret = __txn_activekids(dbenv, rectype, txnid)) != 0) + return (ret); + txn_num = txnid == NULL ? 0 : txnid->txnid; + if (txnid == NULL) { + ZERO_LSN(null_lsn); + lsnp = &null_lsn; + } else + lsnp = &txnid->last_lsn; + logrec.size = sizeof(rectype) + sizeof(txn_num) + sizeof(DB_LSN) + + sizeof(opcode) + + sizeof(fileid) + + sizeof(pgno) + + sizeof(prev_pgno) + + sizeof(next_pgno) + + sizeof(u_int32_t) + (dbt == NULL ? 0 : dbt->size) + + sizeof(*pagelsn) + + sizeof(*prevlsn) + + sizeof(*nextlsn); + if ((ret = __os_malloc(dbenv, logrec.size, NULL, &logrec.data)) != 0) + return (ret); + + bp = logrec.data; + memcpy(bp, &rectype, sizeof(rectype)); + bp += sizeof(rectype); + memcpy(bp, &txn_num, sizeof(txn_num)); + bp += sizeof(txn_num); + memcpy(bp, lsnp, sizeof(DB_LSN)); + bp += sizeof(DB_LSN); + memcpy(bp, &opcode, sizeof(opcode)); + bp += sizeof(opcode); + memcpy(bp, &fileid, sizeof(fileid)); + bp += sizeof(fileid); + memcpy(bp, &pgno, sizeof(pgno)); + bp += sizeof(pgno); + memcpy(bp, &prev_pgno, sizeof(prev_pgno)); + bp += sizeof(prev_pgno); + memcpy(bp, &next_pgno, sizeof(next_pgno)); + bp += sizeof(next_pgno); + if (dbt == NULL) { + zero = 0; + memcpy(bp, &zero, sizeof(u_int32_t)); + bp += sizeof(u_int32_t); + } else { + memcpy(bp, &dbt->size, sizeof(dbt->size)); + bp += sizeof(dbt->size); + memcpy(bp, dbt->data, dbt->size); + bp += dbt->size; + } + if (pagelsn != NULL) + memcpy(bp, pagelsn, sizeof(*pagelsn)); + else + memset(bp, 0, sizeof(*pagelsn)); + bp += sizeof(*pagelsn); + if (prevlsn != NULL) + memcpy(bp, prevlsn, sizeof(*prevlsn)); + else + memset(bp, 0, sizeof(*prevlsn)); + bp += sizeof(*prevlsn); + if (nextlsn != NULL) + memcpy(bp, nextlsn, sizeof(*nextlsn)); + else + memset(bp, 0, sizeof(*nextlsn)); + bp += sizeof(*nextlsn); + DB_ASSERT((u_int32_t)(bp - (u_int8_t *)logrec.data) == logrec.size); + ret = log_put(dbenv, ret_lsnp, (DBT *)&logrec, flags); + if (txnid != NULL) + txnid->last_lsn = *ret_lsnp; + __os_free(logrec.data, logrec.size); + return (ret); +} + +int +__db_big_print(dbenv, dbtp, lsnp, notused2, notused3) + DB_ENV *dbenv; + DBT *dbtp; + DB_LSN *lsnp; + db_recops notused2; + void *notused3; +{ + __db_big_args *argp; + u_int32_t i; + u_int ch; + int ret; + + i = 0; + ch = 0; + notused2 = DB_TXN_ABORT; + notused3 = NULL; + + if ((ret = __db_big_read(dbenv, dbtp->data, &argp)) != 0) + return (ret); + printf("[%lu][%lu]db_big: rec: %lu txnid %lx prevlsn [%lu][%lu]\n", + (u_long)lsnp->file, + (u_long)lsnp->offset, + (u_long)argp->type, + (u_long)argp->txnid->txnid, + (u_long)argp->prev_lsn.file, + (u_long)argp->prev_lsn.offset); + printf("\topcode: %lu\n", (u_long)argp->opcode); + printf("\tfileid: %ld\n", (long)argp->fileid); + printf("\tpgno: %lu\n", (u_long)argp->pgno); + printf("\tprev_pgno: %lu\n", (u_long)argp->prev_pgno); + printf("\tnext_pgno: %lu\n", (u_long)argp->next_pgno); + printf("\tdbt: "); + for (i = 0; i < argp->dbt.size; i++) { + ch = ((u_int8_t *)argp->dbt.data)[i]; + if (isprint(ch) || ch == 0xa) + putchar(ch); + else + printf("%#x ", ch); + } + printf("\n"); + printf("\tpagelsn: [%lu][%lu]\n", + (u_long)argp->pagelsn.file, (u_long)argp->pagelsn.offset); + printf("\tprevlsn: [%lu][%lu]\n", + (u_long)argp->prevlsn.file, (u_long)argp->prevlsn.offset); + printf("\tnextlsn: [%lu][%lu]\n", + (u_long)argp->nextlsn.file, (u_long)argp->nextlsn.offset); + printf("\n"); + __os_free(argp, 0); + return (0); +} + +int +__db_big_read(dbenv, recbuf, argpp) + DB_ENV *dbenv; + void *recbuf; + __db_big_args **argpp; +{ + __db_big_args *argp; + u_int8_t *bp; + int ret; + + ret = __os_malloc(dbenv, sizeof(__db_big_args) + + sizeof(DB_TXN), NULL, &argp); + if (ret != 0) + return (ret); + argp->txnid = (DB_TXN *)&argp[1]; + bp = recbuf; + memcpy(&argp->type, bp, sizeof(argp->type)); + bp += sizeof(argp->type); + memcpy(&argp->txnid->txnid, bp, sizeof(argp->txnid->txnid)); + bp += sizeof(argp->txnid->txnid); + memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN)); + bp += sizeof(DB_LSN); + memcpy(&argp->opcode, bp, sizeof(argp->opcode)); + bp += sizeof(argp->opcode); + memcpy(&argp->fileid, bp, sizeof(argp->fileid)); + bp += sizeof(argp->fileid); + memcpy(&argp->pgno, bp, sizeof(argp->pgno)); + bp += sizeof(argp->pgno); + memcpy(&argp->prev_pgno, bp, sizeof(argp->prev_pgno)); + bp += sizeof(argp->prev_pgno); + memcpy(&argp->next_pgno, bp, sizeof(argp->next_pgno)); + bp += sizeof(argp->next_pgno); + memset(&argp->dbt, 0, sizeof(argp->dbt)); + memcpy(&argp->dbt.size, bp, sizeof(u_int32_t)); + bp += sizeof(u_int32_t); + argp->dbt.data = bp; + bp += argp->dbt.size; + memcpy(&argp->pagelsn, bp, sizeof(argp->pagelsn)); + bp += sizeof(argp->pagelsn); + memcpy(&argp->prevlsn, bp, sizeof(argp->prevlsn)); + bp += sizeof(argp->prevlsn); + memcpy(&argp->nextlsn, bp, sizeof(argp->nextlsn)); + bp += sizeof(argp->nextlsn); + *argpp = argp; + return (0); +} + +int +__db_ovref_log(dbenv, txnid, ret_lsnp, flags, + fileid, pgno, adjust, lsn) + DB_ENV *dbenv; + DB_TXN *txnid; + DB_LSN *ret_lsnp; + u_int32_t flags; + int32_t fileid; + db_pgno_t pgno; + int32_t adjust; + DB_LSN * lsn; +{ + DBT logrec; + DB_LSN *lsnp, null_lsn; + u_int32_t rectype, txn_num; + int ret; + u_int8_t *bp; + + rectype = DB_db_ovref; + if (txnid != NULL && + TAILQ_FIRST(&txnid->kids) != NULL && + (ret = __txn_activekids(dbenv, rectype, txnid)) != 0) + return (ret); + txn_num = txnid == NULL ? 0 : txnid->txnid; + if (txnid == NULL) { + ZERO_LSN(null_lsn); + lsnp = &null_lsn; + } else + lsnp = &txnid->last_lsn; + logrec.size = sizeof(rectype) + sizeof(txn_num) + sizeof(DB_LSN) + + sizeof(fileid) + + sizeof(pgno) + + sizeof(adjust) + + sizeof(*lsn); + if ((ret = __os_malloc(dbenv, logrec.size, NULL, &logrec.data)) != 0) + return (ret); + + bp = logrec.data; + memcpy(bp, &rectype, sizeof(rectype)); + bp += sizeof(rectype); + memcpy(bp, &txn_num, sizeof(txn_num)); + bp += sizeof(txn_num); + memcpy(bp, lsnp, sizeof(DB_LSN)); + bp += sizeof(DB_LSN); + memcpy(bp, &fileid, sizeof(fileid)); + bp += sizeof(fileid); + memcpy(bp, &pgno, sizeof(pgno)); + bp += sizeof(pgno); + memcpy(bp, &adjust, sizeof(adjust)); + bp += sizeof(adjust); + if (lsn != NULL) + memcpy(bp, lsn, sizeof(*lsn)); + else + memset(bp, 0, sizeof(*lsn)); + bp += sizeof(*lsn); + DB_ASSERT((u_int32_t)(bp - (u_int8_t *)logrec.data) == logrec.size); + ret = log_put(dbenv, ret_lsnp, (DBT *)&logrec, flags); + if (txnid != NULL) + txnid->last_lsn = *ret_lsnp; + __os_free(logrec.data, logrec.size); + return (ret); +} + +int +__db_ovref_print(dbenv, dbtp, lsnp, notused2, notused3) + DB_ENV *dbenv; + DBT *dbtp; + DB_LSN *lsnp; + db_recops notused2; + void *notused3; +{ + __db_ovref_args *argp; + u_int32_t i; + u_int ch; + int ret; + + i = 0; + ch = 0; + notused2 = DB_TXN_ABORT; + notused3 = NULL; + + if ((ret = __db_ovref_read(dbenv, dbtp->data, &argp)) != 0) + return (ret); + printf("[%lu][%lu]db_ovref: rec: %lu txnid %lx prevlsn [%lu][%lu]\n", + (u_long)lsnp->file, + (u_long)lsnp->offset, + (u_long)argp->type, + (u_long)argp->txnid->txnid, + (u_long)argp->prev_lsn.file, + (u_long)argp->prev_lsn.offset); + printf("\tfileid: %ld\n", (long)argp->fileid); + printf("\tpgno: %lu\n", (u_long)argp->pgno); + printf("\tadjust: %ld\n", (long)argp->adjust); + printf("\tlsn: [%lu][%lu]\n", + (u_long)argp->lsn.file, (u_long)argp->lsn.offset); + printf("\n"); + __os_free(argp, 0); + return (0); +} + +int +__db_ovref_read(dbenv, recbuf, argpp) + DB_ENV *dbenv; + void *recbuf; + __db_ovref_args **argpp; +{ + __db_ovref_args *argp; + u_int8_t *bp; + int ret; + + ret = __os_malloc(dbenv, sizeof(__db_ovref_args) + + sizeof(DB_TXN), NULL, &argp); + if (ret != 0) + return (ret); + argp->txnid = (DB_TXN *)&argp[1]; + bp = recbuf; + memcpy(&argp->type, bp, sizeof(argp->type)); + bp += sizeof(argp->type); + memcpy(&argp->txnid->txnid, bp, sizeof(argp->txnid->txnid)); + bp += sizeof(argp->txnid->txnid); + memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN)); + bp += sizeof(DB_LSN); + memcpy(&argp->fileid, bp, sizeof(argp->fileid)); + bp += sizeof(argp->fileid); + memcpy(&argp->pgno, bp, sizeof(argp->pgno)); + bp += sizeof(argp->pgno); + memcpy(&argp->adjust, bp, sizeof(argp->adjust)); + bp += sizeof(argp->adjust); + memcpy(&argp->lsn, bp, sizeof(argp->lsn)); + bp += sizeof(argp->lsn); + *argpp = argp; + return (0); +} + +int +__db_relink_log(dbenv, txnid, ret_lsnp, flags, + opcode, fileid, pgno, lsn, prev, lsn_prev, + next, lsn_next) + DB_ENV *dbenv; + DB_TXN *txnid; + DB_LSN *ret_lsnp; + u_int32_t flags; + u_int32_t opcode; + int32_t fileid; + db_pgno_t pgno; + DB_LSN * lsn; + db_pgno_t prev; + DB_LSN * lsn_prev; + db_pgno_t next; + DB_LSN * lsn_next; +{ + DBT logrec; + DB_LSN *lsnp, null_lsn; + u_int32_t rectype, txn_num; + int ret; + u_int8_t *bp; + + rectype = DB_db_relink; + if (txnid != NULL && + TAILQ_FIRST(&txnid->kids) != NULL && + (ret = __txn_activekids(dbenv, rectype, txnid)) != 0) + return (ret); + txn_num = txnid == NULL ? 0 : txnid->txnid; + if (txnid == NULL) { + ZERO_LSN(null_lsn); + lsnp = &null_lsn; + } else + lsnp = &txnid->last_lsn; + logrec.size = sizeof(rectype) + sizeof(txn_num) + sizeof(DB_LSN) + + sizeof(opcode) + + sizeof(fileid) + + sizeof(pgno) + + sizeof(*lsn) + + sizeof(prev) + + sizeof(*lsn_prev) + + sizeof(next) + + sizeof(*lsn_next); + if ((ret = __os_malloc(dbenv, logrec.size, NULL, &logrec.data)) != 0) + return (ret); + + bp = logrec.data; + memcpy(bp, &rectype, sizeof(rectype)); + bp += sizeof(rectype); + memcpy(bp, &txn_num, sizeof(txn_num)); + bp += sizeof(txn_num); + memcpy(bp, lsnp, sizeof(DB_LSN)); + bp += sizeof(DB_LSN); + memcpy(bp, &opcode, sizeof(opcode)); + bp += sizeof(opcode); + memcpy(bp, &fileid, sizeof(fileid)); + bp += sizeof(fileid); + memcpy(bp, &pgno, sizeof(pgno)); + bp += sizeof(pgno); + if (lsn != NULL) + memcpy(bp, lsn, sizeof(*lsn)); + else + memset(bp, 0, sizeof(*lsn)); + bp += sizeof(*lsn); + memcpy(bp, &prev, sizeof(prev)); + bp += sizeof(prev); + if (lsn_prev != NULL) + memcpy(bp, lsn_prev, sizeof(*lsn_prev)); + else + memset(bp, 0, sizeof(*lsn_prev)); + bp += sizeof(*lsn_prev); + memcpy(bp, &next, sizeof(next)); + bp += sizeof(next); + if (lsn_next != NULL) + memcpy(bp, lsn_next, sizeof(*lsn_next)); + else + memset(bp, 0, sizeof(*lsn_next)); + bp += sizeof(*lsn_next); + DB_ASSERT((u_int32_t)(bp - (u_int8_t *)logrec.data) == logrec.size); + ret = log_put(dbenv, ret_lsnp, (DBT *)&logrec, flags); + if (txnid != NULL) + txnid->last_lsn = *ret_lsnp; + __os_free(logrec.data, logrec.size); + return (ret); +} + +int +__db_relink_print(dbenv, dbtp, lsnp, notused2, notused3) + DB_ENV *dbenv; + DBT *dbtp; + DB_LSN *lsnp; + db_recops notused2; + void *notused3; +{ + __db_relink_args *argp; + u_int32_t i; + u_int ch; + int ret; + + i = 0; + ch = 0; + notused2 = DB_TXN_ABORT; + notused3 = NULL; + + if ((ret = __db_relink_read(dbenv, dbtp->data, &argp)) != 0) + return (ret); + printf("[%lu][%lu]db_relink: rec: %lu txnid %lx prevlsn [%lu][%lu]\n", + (u_long)lsnp->file, + (u_long)lsnp->offset, + (u_long)argp->type, + (u_long)argp->txnid->txnid, + (u_long)argp->prev_lsn.file, + (u_long)argp->prev_lsn.offset); + printf("\topcode: %lu\n", (u_long)argp->opcode); + printf("\tfileid: %ld\n", (long)argp->fileid); + printf("\tpgno: %lu\n", (u_long)argp->pgno); + printf("\tlsn: [%lu][%lu]\n", + (u_long)argp->lsn.file, (u_long)argp->lsn.offset); + printf("\tprev: %lu\n", (u_long)argp->prev); + printf("\tlsn_prev: [%lu][%lu]\n", + (u_long)argp->lsn_prev.file, (u_long)argp->lsn_prev.offset); + printf("\tnext: %lu\n", (u_long)argp->next); + printf("\tlsn_next: [%lu][%lu]\n", + (u_long)argp->lsn_next.file, (u_long)argp->lsn_next.offset); + printf("\n"); + __os_free(argp, 0); + return (0); +} + +int +__db_relink_read(dbenv, recbuf, argpp) + DB_ENV *dbenv; + void *recbuf; + __db_relink_args **argpp; +{ + __db_relink_args *argp; + u_int8_t *bp; + int ret; + + ret = __os_malloc(dbenv, sizeof(__db_relink_args) + + sizeof(DB_TXN), NULL, &argp); + if (ret != 0) + return (ret); + argp->txnid = (DB_TXN *)&argp[1]; + bp = recbuf; + memcpy(&argp->type, bp, sizeof(argp->type)); + bp += sizeof(argp->type); + memcpy(&argp->txnid->txnid, bp, sizeof(argp->txnid->txnid)); + bp += sizeof(argp->txnid->txnid); + memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN)); + bp += sizeof(DB_LSN); + memcpy(&argp->opcode, bp, sizeof(argp->opcode)); + bp += sizeof(argp->opcode); + memcpy(&argp->fileid, bp, sizeof(argp->fileid)); + bp += sizeof(argp->fileid); + memcpy(&argp->pgno, bp, sizeof(argp->pgno)); + bp += sizeof(argp->pgno); + memcpy(&argp->lsn, bp, sizeof(argp->lsn)); + bp += sizeof(argp->lsn); + memcpy(&argp->prev, bp, sizeof(argp->prev)); + bp += sizeof(argp->prev); + memcpy(&argp->lsn_prev, bp, sizeof(argp->lsn_prev)); + bp += sizeof(argp->lsn_prev); + memcpy(&argp->next, bp, sizeof(argp->next)); + bp += sizeof(argp->next); + memcpy(&argp->lsn_next, bp, sizeof(argp->lsn_next)); + bp += sizeof(argp->lsn_next); + *argpp = argp; + return (0); +} + +int +__db_addpage_print(dbenv, dbtp, lsnp, notused2, notused3) + DB_ENV *dbenv; + DBT *dbtp; + DB_LSN *lsnp; + db_recops notused2; + void *notused3; +{ + __db_addpage_args *argp; + u_int32_t i; + u_int ch; + int ret; + + i = 0; + ch = 0; + notused2 = DB_TXN_ABORT; + notused3 = NULL; + + if ((ret = __db_addpage_read(dbenv, dbtp->data, &argp)) != 0) + return (ret); + printf("[%lu][%lu]db_addpage: rec: %lu txnid %lx prevlsn [%lu][%lu]\n", + (u_long)lsnp->file, + (u_long)lsnp->offset, + (u_long)argp->type, + (u_long)argp->txnid->txnid, + (u_long)argp->prev_lsn.file, + (u_long)argp->prev_lsn.offset); + printf("\tfileid: %ld\n", (long)argp->fileid); + printf("\tpgno: %lu\n", (u_long)argp->pgno); + printf("\tlsn: [%lu][%lu]\n", + (u_long)argp->lsn.file, (u_long)argp->lsn.offset); + printf("\tnextpgno: %lu\n", (u_long)argp->nextpgno); + printf("\tnextlsn: [%lu][%lu]\n", + (u_long)argp->nextlsn.file, (u_long)argp->nextlsn.offset); + printf("\n"); + __os_free(argp, 0); + return (0); +} + +int +__db_addpage_read(dbenv, recbuf, argpp) + DB_ENV *dbenv; + void *recbuf; + __db_addpage_args **argpp; +{ + __db_addpage_args *argp; + u_int8_t *bp; + int ret; + + ret = __os_malloc(dbenv, sizeof(__db_addpage_args) + + sizeof(DB_TXN), NULL, &argp); + if (ret != 0) + return (ret); + argp->txnid = (DB_TXN *)&argp[1]; + bp = recbuf; + memcpy(&argp->type, bp, sizeof(argp->type)); + bp += sizeof(argp->type); + memcpy(&argp->txnid->txnid, bp, sizeof(argp->txnid->txnid)); + bp += sizeof(argp->txnid->txnid); + memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN)); + bp += sizeof(DB_LSN); + memcpy(&argp->fileid, bp, sizeof(argp->fileid)); + bp += sizeof(argp->fileid); + memcpy(&argp->pgno, bp, sizeof(argp->pgno)); + bp += sizeof(argp->pgno); + memcpy(&argp->lsn, bp, sizeof(argp->lsn)); + bp += sizeof(argp->lsn); + memcpy(&argp->nextpgno, bp, sizeof(argp->nextpgno)); + bp += sizeof(argp->nextpgno); + memcpy(&argp->nextlsn, bp, sizeof(argp->nextlsn)); + bp += sizeof(argp->nextlsn); + *argpp = argp; + return (0); +} + +int +__db_debug_log(dbenv, txnid, ret_lsnp, flags, + op, fileid, key, data, arg_flags) + DB_ENV *dbenv; + DB_TXN *txnid; + DB_LSN *ret_lsnp; + u_int32_t flags; + const DBT *op; + int32_t fileid; + const DBT *key; + const DBT *data; + u_int32_t arg_flags; +{ + DBT logrec; + DB_LSN *lsnp, null_lsn; + u_int32_t zero; + u_int32_t rectype, txn_num; + int ret; + u_int8_t *bp; + + rectype = DB_db_debug; + if (txnid != NULL && + TAILQ_FIRST(&txnid->kids) != NULL && + (ret = __txn_activekids(dbenv, rectype, txnid)) != 0) + return (ret); + txn_num = txnid == NULL ? 0 : txnid->txnid; + if (txnid == NULL) { + ZERO_LSN(null_lsn); + lsnp = &null_lsn; + } else + lsnp = &txnid->last_lsn; + logrec.size = sizeof(rectype) + sizeof(txn_num) + sizeof(DB_LSN) + + sizeof(u_int32_t) + (op == NULL ? 0 : op->size) + + sizeof(fileid) + + sizeof(u_int32_t) + (key == NULL ? 0 : key->size) + + sizeof(u_int32_t) + (data == NULL ? 0 : data->size) + + sizeof(arg_flags); + if ((ret = __os_malloc(dbenv, logrec.size, NULL, &logrec.data)) != 0) + return (ret); + + bp = logrec.data; + memcpy(bp, &rectype, sizeof(rectype)); + bp += sizeof(rectype); + memcpy(bp, &txn_num, sizeof(txn_num)); + bp += sizeof(txn_num); + memcpy(bp, lsnp, sizeof(DB_LSN)); + bp += sizeof(DB_LSN); + if (op == NULL) { + zero = 0; + memcpy(bp, &zero, sizeof(u_int32_t)); + bp += sizeof(u_int32_t); + } else { + memcpy(bp, &op->size, sizeof(op->size)); + bp += sizeof(op->size); + memcpy(bp, op->data, op->size); + bp += op->size; + } + memcpy(bp, &fileid, sizeof(fileid)); + bp += sizeof(fileid); + if (key == NULL) { + zero = 0; + memcpy(bp, &zero, sizeof(u_int32_t)); + bp += sizeof(u_int32_t); + } else { + memcpy(bp, &key->size, sizeof(key->size)); + bp += sizeof(key->size); + memcpy(bp, key->data, key->size); + bp += key->size; + } + if (data == NULL) { + zero = 0; + memcpy(bp, &zero, sizeof(u_int32_t)); + bp += sizeof(u_int32_t); + } else { + memcpy(bp, &data->size, sizeof(data->size)); + bp += sizeof(data->size); + memcpy(bp, data->data, data->size); + bp += data->size; + } + memcpy(bp, &arg_flags, sizeof(arg_flags)); + bp += sizeof(arg_flags); + DB_ASSERT((u_int32_t)(bp - (u_int8_t *)logrec.data) == logrec.size); + ret = log_put(dbenv, ret_lsnp, (DBT *)&logrec, flags); + if (txnid != NULL) + txnid->last_lsn = *ret_lsnp; + __os_free(logrec.data, logrec.size); + return (ret); +} + +int +__db_debug_print(dbenv, dbtp, lsnp, notused2, notused3) + DB_ENV *dbenv; + DBT *dbtp; + DB_LSN *lsnp; + db_recops notused2; + void *notused3; +{ + __db_debug_args *argp; + u_int32_t i; + u_int ch; + int ret; + + i = 0; + ch = 0; + notused2 = DB_TXN_ABORT; + notused3 = NULL; + + if ((ret = __db_debug_read(dbenv, dbtp->data, &argp)) != 0) + return (ret); + printf("[%lu][%lu]db_debug: rec: %lu txnid %lx prevlsn [%lu][%lu]\n", + (u_long)lsnp->file, + (u_long)lsnp->offset, + (u_long)argp->type, + (u_long)argp->txnid->txnid, + (u_long)argp->prev_lsn.file, + (u_long)argp->prev_lsn.offset); + printf("\top: "); + for (i = 0; i < argp->op.size; i++) { + ch = ((u_int8_t *)argp->op.data)[i]; + if (isprint(ch) || ch == 0xa) + putchar(ch); + else + printf("%#x ", ch); + } + printf("\n"); + printf("\tfileid: %ld\n", (long)argp->fileid); + printf("\tkey: "); + for (i = 0; i < argp->key.size; i++) { + ch = ((u_int8_t *)argp->key.data)[i]; + if (isprint(ch) || ch == 0xa) + putchar(ch); + else + printf("%#x ", ch); + } + printf("\n"); + printf("\tdata: "); + for (i = 0; i < argp->data.size; i++) { + ch = ((u_int8_t *)argp->data.data)[i]; + if (isprint(ch) || ch == 0xa) + putchar(ch); + else + printf("%#x ", ch); + } + printf("\n"); + printf("\targ_flags: %lu\n", (u_long)argp->arg_flags); + printf("\n"); + __os_free(argp, 0); + return (0); +} + +int +__db_debug_read(dbenv, recbuf, argpp) + DB_ENV *dbenv; + void *recbuf; + __db_debug_args **argpp; +{ + __db_debug_args *argp; + u_int8_t *bp; + int ret; + + ret = __os_malloc(dbenv, sizeof(__db_debug_args) + + sizeof(DB_TXN), NULL, &argp); + if (ret != 0) + return (ret); + argp->txnid = (DB_TXN *)&argp[1]; + bp = recbuf; + memcpy(&argp->type, bp, sizeof(argp->type)); + bp += sizeof(argp->type); + memcpy(&argp->txnid->txnid, bp, sizeof(argp->txnid->txnid)); + bp += sizeof(argp->txnid->txnid); + memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN)); + bp += sizeof(DB_LSN); + memset(&argp->op, 0, sizeof(argp->op)); + memcpy(&argp->op.size, bp, sizeof(u_int32_t)); + bp += sizeof(u_int32_t); + argp->op.data = bp; + bp += argp->op.size; + memcpy(&argp->fileid, bp, sizeof(argp->fileid)); + bp += sizeof(argp->fileid); + memset(&argp->key, 0, sizeof(argp->key)); + memcpy(&argp->key.size, bp, sizeof(u_int32_t)); + bp += sizeof(u_int32_t); + argp->key.data = bp; + bp += argp->key.size; + memset(&argp->data, 0, sizeof(argp->data)); + memcpy(&argp->data.size, bp, sizeof(u_int32_t)); + bp += sizeof(u_int32_t); + argp->data.data = bp; + bp += argp->data.size; + memcpy(&argp->arg_flags, bp, sizeof(argp->arg_flags)); + bp += sizeof(argp->arg_flags); + *argpp = argp; + return (0); +} + +int +__db_noop_log(dbenv, txnid, ret_lsnp, flags, + fileid, pgno, prevlsn) + DB_ENV *dbenv; + DB_TXN *txnid; + DB_LSN *ret_lsnp; + u_int32_t flags; + int32_t fileid; + db_pgno_t pgno; + DB_LSN * prevlsn; +{ + DBT logrec; + DB_LSN *lsnp, null_lsn; + u_int32_t rectype, txn_num; + int ret; + u_int8_t *bp; + + rectype = DB_db_noop; + if (txnid != NULL && + TAILQ_FIRST(&txnid->kids) != NULL && + (ret = __txn_activekids(dbenv, rectype, txnid)) != 0) + return (ret); + txn_num = txnid == NULL ? 0 : txnid->txnid; + if (txnid == NULL) { + ZERO_LSN(null_lsn); + lsnp = &null_lsn; + } else + lsnp = &txnid->last_lsn; + logrec.size = sizeof(rectype) + sizeof(txn_num) + sizeof(DB_LSN) + + sizeof(fileid) + + sizeof(pgno) + + sizeof(*prevlsn); + if ((ret = __os_malloc(dbenv, logrec.size, NULL, &logrec.data)) != 0) + return (ret); + + bp = logrec.data; + memcpy(bp, &rectype, sizeof(rectype)); + bp += sizeof(rectype); + memcpy(bp, &txn_num, sizeof(txn_num)); + bp += sizeof(txn_num); + memcpy(bp, lsnp, sizeof(DB_LSN)); + bp += sizeof(DB_LSN); + memcpy(bp, &fileid, sizeof(fileid)); + bp += sizeof(fileid); + memcpy(bp, &pgno, sizeof(pgno)); + bp += sizeof(pgno); + if (prevlsn != NULL) + memcpy(bp, prevlsn, sizeof(*prevlsn)); + else + memset(bp, 0, sizeof(*prevlsn)); + bp += sizeof(*prevlsn); + DB_ASSERT((u_int32_t)(bp - (u_int8_t *)logrec.data) == logrec.size); + ret = log_put(dbenv, ret_lsnp, (DBT *)&logrec, flags); + if (txnid != NULL) + txnid->last_lsn = *ret_lsnp; + __os_free(logrec.data, logrec.size); + return (ret); +} + +int +__db_noop_print(dbenv, dbtp, lsnp, notused2, notused3) + DB_ENV *dbenv; + DBT *dbtp; + DB_LSN *lsnp; + db_recops notused2; + void *notused3; +{ + __db_noop_args *argp; + u_int32_t i; + u_int ch; + int ret; + + i = 0; + ch = 0; + notused2 = DB_TXN_ABORT; + notused3 = NULL; + + if ((ret = __db_noop_read(dbenv, dbtp->data, &argp)) != 0) + return (ret); + printf("[%lu][%lu]db_noop: rec: %lu txnid %lx prevlsn [%lu][%lu]\n", + (u_long)lsnp->file, + (u_long)lsnp->offset, + (u_long)argp->type, + (u_long)argp->txnid->txnid, + (u_long)argp->prev_lsn.file, + (u_long)argp->prev_lsn.offset); + printf("\tfileid: %ld\n", (long)argp->fileid); + printf("\tpgno: %lu\n", (u_long)argp->pgno); + printf("\tprevlsn: [%lu][%lu]\n", + (u_long)argp->prevlsn.file, (u_long)argp->prevlsn.offset); + printf("\n"); + __os_free(argp, 0); + return (0); +} + +int +__db_noop_read(dbenv, recbuf, argpp) + DB_ENV *dbenv; + void *recbuf; + __db_noop_args **argpp; +{ + __db_noop_args *argp; + u_int8_t *bp; + int ret; + + ret = __os_malloc(dbenv, sizeof(__db_noop_args) + + sizeof(DB_TXN), NULL, &argp); + if (ret != 0) + return (ret); + argp->txnid = (DB_TXN *)&argp[1]; + bp = recbuf; + memcpy(&argp->type, bp, sizeof(argp->type)); + bp += sizeof(argp->type); + memcpy(&argp->txnid->txnid, bp, sizeof(argp->txnid->txnid)); + bp += sizeof(argp->txnid->txnid); + memcpy(&argp->prev_lsn, bp, sizeof(DB_LSN)); + bp += sizeof(DB_LSN); + memcpy(&argp->fileid, bp, sizeof(argp->fileid)); + bp += sizeof(argp->fileid); + memcpy(&argp->pgno, bp, sizeof(argp->pgno)); + bp += sizeof(argp->pgno); + memcpy(&argp->prevlsn, bp, sizeof(argp->prevlsn)); + bp += sizeof(argp->prevlsn); + *argpp = argp; + return (0); +} + +int +__db_init_print(dbenv) + DB_ENV *dbenv; +{ + int ret; + + if ((ret = __db_add_recovery(dbenv, + __db_addrem_print, DB_db_addrem)) != 0) + return (ret); + if ((ret = __db_add_recovery(dbenv, + __db_split_print, DB_db_split)) != 0) + return (ret); + if ((ret = __db_add_recovery(dbenv, + __db_big_print, DB_db_big)) != 0) + return (ret); + if ((ret = __db_add_recovery(dbenv, + __db_ovref_print, DB_db_ovref)) != 0) + return (ret); + if ((ret = __db_add_recovery(dbenv, + __db_relink_print, DB_db_relink)) != 0) + return (ret); + if ((ret = __db_add_recovery(dbenv, + __db_addpage_print, DB_db_addpage)) != 0) + return (ret); + if ((ret = __db_add_recovery(dbenv, + __db_debug_print, DB_db_debug)) != 0) + return (ret); + if ((ret = __db_add_recovery(dbenv, + __db_noop_print, DB_db_noop)) != 0) + return (ret); + return (0); +} + +int +__db_init_recover(dbenv) + DB_ENV *dbenv; +{ + int ret; + + if ((ret = __db_add_recovery(dbenv, + __db_addrem_recover, DB_db_addrem)) != 0) + return (ret); + if ((ret = __db_add_recovery(dbenv, + __deprecated_recover, DB_db_split)) != 0) + return (ret); + if ((ret = __db_add_recovery(dbenv, + __db_big_recover, DB_db_big)) != 0) + return (ret); + if ((ret = __db_add_recovery(dbenv, + __db_ovref_recover, DB_db_ovref)) != 0) + return (ret); + if ((ret = __db_add_recovery(dbenv, + __db_relink_recover, DB_db_relink)) != 0) + return (ret); + if ((ret = __db_add_recovery(dbenv, + __deprecated_recover, DB_db_addpage)) != 0) + return (ret); + if ((ret = __db_add_recovery(dbenv, + __db_debug_recover, DB_db_debug)) != 0) + return (ret); + if ((ret = __db_add_recovery(dbenv, + __db_noop_recover, DB_db_noop)) != 0) + return (ret); + return (0); +} + diff --git a/bdb/db/db_cam.c b/bdb/db/db_cam.c new file mode 100644 index 00000000000..708d4cbda4d --- /dev/null +++ b/bdb/db/db_cam.c @@ -0,0 +1,974 @@ +/*- + * See the file LICENSE for redistribution information. + * + * Copyright (c) 2000 + * Sleepycat Software. All rights reserved. + */ + +#include "db_config.h" + +#ifndef lint +static const char revid[] = "$Id: db_cam.c,v 11.52 2001/01/18 15:11:16 bostic Exp $"; +#endif /* not lint */ + +#ifndef NO_SYSTEM_INCLUDES +#include <sys/types.h> + +#include <string.h> +#endif + +#include "db_int.h" +#include "db_page.h" +#include "db_shash.h" +#include "lock.h" +#include "btree.h" +#include "hash.h" +#include "qam.h" +#include "txn.h" +#include "db_ext.h" + +static int __db_c_cleanup __P((DBC *, DBC *, int)); +static int __db_c_idup __P((DBC *, DBC **, u_int32_t)); +static int __db_wrlock_err __P((DB_ENV *)); + +#define CDB_LOCKING_INIT(dbp, dbc) \ + /* \ + * If we are running CDB, this had better be either a write \ + * cursor or an immediate writer. If it's a regular writer, \ + * that means we have an IWRITE lock and we need to upgrade \ + * it to a write lock. \ + */ \ + if (CDB_LOCKING((dbp)->dbenv)) { \ + if (!F_ISSET(dbc, DBC_WRITECURSOR | DBC_WRITER)) \ + return (__db_wrlock_err(dbp->dbenv)); \ + \ + if (F_ISSET(dbc, DBC_WRITECURSOR) && \ + (ret = lock_get((dbp)->dbenv, (dbc)->locker, \ + DB_LOCK_UPGRADE, &(dbc)->lock_dbt, DB_LOCK_WRITE, \ + &(dbc)->mylock)) != 0) \ + return (ret); \ + } +#define CDB_LOCKING_DONE(dbp, dbc) \ + /* Release the upgraded lock. */ \ + if (F_ISSET(dbc, DBC_WRITECURSOR)) \ + (void)__lock_downgrade( \ + (dbp)->dbenv, &(dbc)->mylock, DB_LOCK_IWRITE, 0); +/* + * Copy the lock info from one cursor to another, so that locking + * in CDB can be done in the context of an internally-duplicated + * or off-page-duplicate cursor. + */ +#define CDB_LOCKING_COPY(dbp, dbc_o, dbc_n) \ + if (CDB_LOCKING((dbp)->dbenv) && \ + F_ISSET((dbc_o), DBC_WRITECURSOR | DBC_WRITEDUP)) { \ + memcpy(&(dbc_n)->mylock, &(dbc_o)->mylock, \ + sizeof((dbc_o)->mylock)); \ + (dbc_n)->locker = (dbc_o)->locker; \ + /* This lock isn't ours to put--just discard it on close. */ \ + F_SET((dbc_n), DBC_WRITEDUP); \ + } + +/* + * __db_c_close -- + * Close the cursor. + * + * PUBLIC: int __db_c_close __P((DBC *)); + */ +int +__db_c_close(dbc) + DBC *dbc; +{ + DB *dbp; + DBC *opd; + DBC_INTERNAL *cp; + int ret, t_ret; + + dbp = dbc->dbp; + ret = 0; + + PANIC_CHECK(dbp->dbenv); + + /* + * If the cursor is already closed we have a serious problem, and we + * assume that the cursor isn't on the active queue. Don't do any of + * the remaining cursor close processing. + */ + if (!F_ISSET(dbc, DBC_ACTIVE)) { + if (dbp != NULL) + __db_err(dbp->dbenv, "Closing closed cursor"); + + DB_ASSERT(0); + return (EINVAL); + } + + cp = dbc->internal; + opd = cp->opd; + + /* + * Remove the cursor(s) from the active queue. We may be closing two + * cursors at once here, a top-level one and a lower-level, off-page + * duplicate one. The acess-method specific cursor close routine must + * close both of them in a single call. + * + * !!! + * Cursors must be removed from the active queue before calling the + * access specific cursor close routine, btree depends on having that + * order of operations. It must also happen before any action that + * can fail and cause __db_c_close to return an error, or else calls + * here from __db_close may loop indefinitely. + */ + MUTEX_THREAD_LOCK(dbp->dbenv, dbp->mutexp); + + if (opd != NULL) { + F_CLR(opd, DBC_ACTIVE); + TAILQ_REMOVE(&dbp->active_queue, opd, links); + } + F_CLR(dbc, DBC_ACTIVE); + TAILQ_REMOVE(&dbp->active_queue, dbc, links); + + MUTEX_THREAD_UNLOCK(dbp->dbenv, dbp->mutexp); + + /* Call the access specific cursor close routine. */ + if ((t_ret = + dbc->c_am_close(dbc, PGNO_INVALID, NULL)) != 0 && ret == 0) + ret = t_ret; + + /* + * Release the lock after calling the access method specific close + * routine, a Btree cursor may have had pending deletes. + */ + if (CDB_LOCKING(dbc->dbp->dbenv)) { + /* + * If DBC_WRITEDUP is set, the cursor is an internally + * duplicated write cursor and the lock isn't ours to put. + */ + if (!F_ISSET(dbc, DBC_WRITEDUP) && + dbc->mylock.off != LOCK_INVALID) { + if ((t_ret = lock_put(dbc->dbp->dbenv, + &dbc->mylock)) != 0 && ret == 0) + ret = t_ret; + dbc->mylock.off = LOCK_INVALID; + } + + /* For safety's sake, since this is going on the free queue. */ + memset(&dbc->mylock, 0, sizeof(dbc->mylock)); + F_CLR(dbc, DBC_WRITEDUP); + } + + if (dbc->txn != NULL) + dbc->txn->cursors--; + + /* Move the cursor(s) to the free queue. */ + MUTEX_THREAD_LOCK(dbp->dbenv, dbp->mutexp); + if (opd != NULL) { + if (dbc->txn != NULL) + dbc->txn->cursors--; + TAILQ_INSERT_TAIL(&dbp->free_queue, opd, links); + opd = NULL; + } + TAILQ_INSERT_TAIL(&dbp->free_queue, dbc, links); + MUTEX_THREAD_UNLOCK(dbp->dbenv, dbp->mutexp); + + return (ret); +} + +/* + * __db_c_destroy -- + * Destroy the cursor, called after DBC->c_close. + * + * PUBLIC: int __db_c_destroy __P((DBC *)); + */ +int +__db_c_destroy(dbc) + DBC *dbc; +{ + DB *dbp; + DBC_INTERNAL *cp; + int ret; + + dbp = dbc->dbp; + cp = dbc->internal; + + /* Remove the cursor from the free queue. */ + MUTEX_THREAD_LOCK(dbp->dbenv, dbp->mutexp); + TAILQ_REMOVE(&dbp->free_queue, dbc, links); + MUTEX_THREAD_UNLOCK(dbp->dbenv, dbp->mutexp); + + /* Free up allocated memory. */ + if (dbc->rkey.data != NULL) + __os_free(dbc->rkey.data, dbc->rkey.ulen); + if (dbc->rdata.data != NULL) + __os_free(dbc->rdata.data, dbc->rdata.ulen); + + /* Call the access specific cursor destroy routine. */ + ret = dbc->c_am_destroy == NULL ? 0 : dbc->c_am_destroy(dbc); + + __os_free(dbc, sizeof(*dbc)); + + return (ret); +} + +/* + * __db_c_count -- + * Return a count of duplicate data items. + * + * PUBLIC: int __db_c_count __P((DBC *, db_recno_t *, u_int32_t)); + */ +int +__db_c_count(dbc, recnop, flags) + DBC *dbc; + db_recno_t *recnop; + u_int32_t flags; +{ + DB *dbp; + int ret; + + /* + * Cursor Cleanup Note: + * All of the cursors passed to the underlying access methods by this + * routine are not duplicated and will not be cleaned up on return. + * So, pages/locks that the cursor references must be resolved by the + * underlying functions. + */ + dbp = dbc->dbp; + + PANIC_CHECK(dbp->dbenv); + + /* Check for invalid flags. */ + if ((ret = __db_ccountchk(dbp, flags, IS_INITIALIZED(dbc))) != 0) + return (ret); + + switch (dbc->dbtype) { + case DB_QUEUE: + case DB_RECNO: + *recnop = 1; + break; + case DB_HASH: + if (dbc->internal->opd == NULL) { + if ((ret = __ham_c_count(dbc, recnop)) != 0) + return (ret); + break; + } + /* FALLTHROUGH */ + case DB_BTREE: + if ((ret = __bam_c_count(dbc, recnop)) != 0) + return (ret); + break; + default: + return (__db_unknown_type(dbp->dbenv, + "__db_c_count", dbp->type)); + } + return (0); +} + +/* + * __db_c_del -- + * Delete using a cursor. + * + * PUBLIC: int __db_c_del __P((DBC *, u_int32_t)); + */ +int +__db_c_del(dbc, flags) + DBC *dbc; + u_int32_t flags; +{ + DB *dbp; + DBC *opd; + int ret; + + /* + * Cursor Cleanup Note: + * All of the cursors passed to the underlying access methods by this + * routine are not duplicated and will not be cleaned up on return. + * So, pages/locks that the cursor references must be resolved by the + * underlying functions. + */ + dbp = dbc->dbp; + + PANIC_CHECK(dbp->dbenv); + DB_CHECK_TXN(dbp, dbc->txn); + + /* Check for invalid flags. */ + if ((ret = __db_cdelchk(dbp, flags, + F_ISSET(dbp, DB_AM_RDONLY), IS_INITIALIZED(dbc))) != 0) + return (ret); + + DEBUG_LWRITE(dbc, dbc->txn, "db_c_del", NULL, NULL, flags); + + CDB_LOCKING_INIT(dbp, dbc); + + /* + * Off-page duplicate trees are locked in the primary tree, that is, + * we acquire a write lock in the primary tree and no locks in the + * off-page dup tree. If the del operation is done in an off-page + * duplicate tree, call the primary cursor's upgrade routine first. + */ + opd = dbc->internal->opd; + if (opd == NULL) + ret = dbc->c_am_del(dbc); + else + if ((ret = dbc->c_am_writelock(dbc)) == 0) + ret = opd->c_am_del(opd); + + CDB_LOCKING_DONE(dbp, dbc); + + return (ret); +} + +/* + * __db_c_dup -- + * Duplicate a cursor + * + * PUBLIC: int __db_c_dup __P((DBC *, DBC **, u_int32_t)); + */ +int +__db_c_dup(dbc_orig, dbcp, flags) + DBC *dbc_orig; + DBC **dbcp; + u_int32_t flags; +{ + DB_ENV *dbenv; + DB *dbp; + DBC *dbc_n, *dbc_nopd; + int ret; + + dbp = dbc_orig->dbp; + dbenv = dbp->dbenv; + dbc_n = dbc_nopd = NULL; + + PANIC_CHECK(dbp->dbenv); + + /* + * We can never have two write cursors open in CDB, so do not + * allow duplication of a write cursor. + */ + if (flags != DB_POSITIONI && + F_ISSET(dbc_orig, DBC_WRITER | DBC_WRITECURSOR)) { + __db_err(dbenv, "Cannot duplicate writeable cursor"); + return (EINVAL); + } + + /* Allocate a new cursor and initialize it. */ + if ((ret = __db_c_idup(dbc_orig, &dbc_n, flags)) != 0) + goto err; + *dbcp = dbc_n; + + /* + * If we're in CDB, and this isn't an internal duplication (in which + * case we're explicitly overriding CDB locking), the duplicated + * cursor needs its own read lock. (We know it's not a write cursor + * because we wouldn't have made it this far; you can't dup them.) + */ + if (CDB_LOCKING(dbenv) && flags != DB_POSITIONI) { + DB_ASSERT(!F_ISSET(dbc_orig, DBC_WRITER | DBC_WRITECURSOR)); + + if ((ret = lock_get(dbenv, dbc_n->locker, 0, + &dbc_n->lock_dbt, DB_LOCK_READ, &dbc_n->mylock)) != 0) { + (void)__db_c_close(dbc_n); + return (ret); + } + } + + /* + * If the cursor references an off-page duplicate tree, allocate a + * new cursor for that tree and initialize it. + */ + if (dbc_orig->internal->opd != NULL) { + if ((ret = + __db_c_idup(dbc_orig->internal->opd, &dbc_nopd, flags)) != 0) + goto err; + dbc_n->internal->opd = dbc_nopd; + } + + return (0); + +err: if (dbc_n != NULL) + (void)dbc_n->c_close(dbc_n); + if (dbc_nopd != NULL) + (void)dbc_nopd->c_close(dbc_nopd); + + return (ret); +} + +/* + * __db_c_idup -- + * Internal version of __db_c_dup. + */ +static int +__db_c_idup(dbc_orig, dbcp, flags) + DBC *dbc_orig, **dbcp; + u_int32_t flags; +{ + DB *dbp; + DBC *dbc_n; + DBC_INTERNAL *int_n, *int_orig; + int ret; + + dbp = dbc_orig->dbp; + dbc_n = *dbcp; + + if ((ret = __db_icursor(dbp, dbc_orig->txn, dbc_orig->dbtype, + dbc_orig->internal->root, F_ISSET(dbc_orig, DBC_OPD), &dbc_n)) != 0) + return (ret); + + dbc_n->locker = dbc_orig->locker; + + /* If the user wants the cursor positioned, do it here. */ + if (flags == DB_POSITION || flags == DB_POSITIONI) { + int_n = dbc_n->internal; + int_orig = dbc_orig->internal; + + dbc_n->flags = dbc_orig->flags; + + int_n->indx = int_orig->indx; + int_n->pgno = int_orig->pgno; + int_n->root = int_orig->root; + int_n->lock_mode = int_orig->lock_mode; + + switch (dbc_orig->dbtype) { + case DB_QUEUE: + if ((ret = __qam_c_dup(dbc_orig, dbc_n)) != 0) + goto err; + break; + case DB_BTREE: + case DB_RECNO: + if ((ret = __bam_c_dup(dbc_orig, dbc_n)) != 0) + goto err; + break; + case DB_HASH: + if ((ret = __ham_c_dup(dbc_orig, dbc_n)) != 0) + goto err; + break; + default: + ret = __db_unknown_type(dbp->dbenv, + "__db_c_idup", dbc_orig->dbtype); + goto err; + } + } + + /* Now take care of duping the CDB information. */ + CDB_LOCKING_COPY(dbp, dbc_orig, dbc_n); + + *dbcp = dbc_n; + return (0); + +err: (void)dbc_n->c_close(dbc_n); + return (ret); +} + +/* + * __db_c_newopd -- + * Create a new off-page duplicate cursor. + * + * PUBLIC: int __db_c_newopd __P((DBC *, db_pgno_t, DBC **)); + */ +int +__db_c_newopd(dbc_parent, root, dbcp) + DBC *dbc_parent; + db_pgno_t root; + DBC **dbcp; +{ + DB *dbp; + DBC *opd; + DBTYPE dbtype; + int ret; + + dbp = dbc_parent->dbp; + dbtype = (dbp->dup_compare == NULL) ? DB_RECNO : DB_BTREE; + + if ((ret = __db_icursor(dbp, + dbc_parent->txn, dbtype, root, 1, &opd)) != 0) + return (ret); + + CDB_LOCKING_COPY(dbp, dbc_parent, opd); + + *dbcp = opd; + + return (0); +} + +/* + * __db_c_get -- + * Get using a cursor. + * + * PUBLIC: int __db_c_get __P((DBC *, DBT *, DBT *, u_int32_t)); + */ +int +__db_c_get(dbc_arg, key, data, flags) + DBC *dbc_arg; + DBT *key, *data; + u_int32_t flags; +{ + DB *dbp; + DBC *dbc, *dbc_n, *opd; + DBC_INTERNAL *cp, *cp_n; + db_pgno_t pgno; + u_int32_t tmp_flags, tmp_rmw; + u_int8_t type; + int ret, t_ret; + + /* + * Cursor Cleanup Note: + * All of the cursors passed to the underlying access methods by this + * routine are duplicated cursors. On return, any referenced pages + * will be discarded, and, if the cursor is not intended to be used + * again, the close function will be called. So, pages/locks that + * the cursor references do not need to be resolved by the underlying + * functions. + */ + dbp = dbc_arg->dbp; + dbc_n = NULL; + opd = NULL; + + PANIC_CHECK(dbp->dbenv); + + /* Check for invalid flags. */ + if ((ret = + __db_cgetchk(dbp, key, data, flags, IS_INITIALIZED(dbc_arg))) != 0) + return (ret); + + /* Clear OR'd in additional bits so we can check for flag equality. */ + tmp_rmw = LF_ISSET(DB_RMW); + LF_CLR(DB_RMW); + + DEBUG_LREAD(dbc_arg, dbc_arg->txn, "db_c_get", + flags == DB_SET || flags == DB_SET_RANGE ? key : NULL, NULL, flags); + + /* + * Return a cursor's record number. It has nothing to do with the + * cursor get code except that it was put into the interface. + */ + if (flags == DB_GET_RECNO) + return (__bam_c_rget(dbc_arg, data, flags | tmp_rmw)); + + if (flags == DB_CONSUME || flags == DB_CONSUME_WAIT) + CDB_LOCKING_INIT(dbp, dbc_arg); + + /* + * If we have an off-page duplicates cursor, and the operation applies + * to it, perform the operation. Duplicate the cursor and call the + * underlying function. + * + * Off-page duplicate trees are locked in the primary tree, that is, + * we acquire a write lock in the primary tree and no locks in the + * off-page dup tree. If the DB_RMW flag was specified and the get + * operation is done in an off-page duplicate tree, call the primary + * cursor's upgrade routine first. + */ + cp = dbc_arg->internal; + if (cp->opd != NULL && + (flags == DB_CURRENT || flags == DB_GET_BOTHC || + flags == DB_NEXT || flags == DB_NEXT_DUP || flags == DB_PREV)) { + if (tmp_rmw && (ret = dbc_arg->c_am_writelock(dbc_arg)) != 0) + return (ret); + if ((ret = __db_c_idup(cp->opd, &opd, DB_POSITIONI)) != 0) + return (ret); + + switch (ret = opd->c_am_get( + opd, key, data, flags, NULL)) { + case 0: + goto done; + case DB_NOTFOUND: + /* + * Translate DB_NOTFOUND failures for the DB_NEXT and + * DB_PREV operations into a subsequent operation on + * the parent cursor. + */ + if (flags == DB_NEXT || flags == DB_PREV) { + if ((ret = opd->c_close(opd)) != 0) + goto err; + opd = NULL; + break; + } + goto err; + default: + goto err; + } + } + + /* + * Perform an operation on the main cursor. Duplicate the cursor, + * upgrade the lock as required, and call the underlying function. + */ + switch (flags) { + case DB_CURRENT: + case DB_GET_BOTHC: + case DB_NEXT: + case DB_NEXT_DUP: + case DB_NEXT_NODUP: + case DB_PREV: + case DB_PREV_NODUP: + tmp_flags = DB_POSITIONI; + break; + default: + tmp_flags = 0; + break; + } + + /* + * If this cursor is going to be closed immediately, we don't + * need to take precautions to clean it up on error. + */ + if (F_ISSET(dbc_arg, DBC_TRANSIENT)) + dbc_n = dbc_arg; + else if ((ret = __db_c_idup(dbc_arg, &dbc_n, tmp_flags)) != 0) + goto err; + + if (tmp_rmw) + F_SET(dbc_n, DBC_RMW); + pgno = PGNO_INVALID; + ret = dbc_n->c_am_get(dbc_n, key, data, flags, &pgno); + if (tmp_rmw) + F_CLR(dbc_n, DBC_RMW); + if (ret != 0) + goto err; + + cp_n = dbc_n->internal; + + /* + * We may be referencing a new off-page duplicates tree. Acquire + * a new cursor and call the underlying function. + */ + if (pgno != PGNO_INVALID) { + if ((ret = __db_c_newopd(dbc_arg, pgno, &cp_n->opd)) != 0) + goto err; + + switch (flags) { + case DB_FIRST: + case DB_NEXT: + case DB_NEXT_NODUP: + case DB_SET: + case DB_SET_RECNO: + case DB_SET_RANGE: + tmp_flags = DB_FIRST; + break; + case DB_LAST: + case DB_PREV: + case DB_PREV_NODUP: + tmp_flags = DB_LAST; + break; + case DB_GET_BOTH: + tmp_flags = DB_GET_BOTH; + break; + case DB_GET_BOTHC: + tmp_flags = DB_GET_BOTHC; + break; + default: + ret = + __db_unknown_flag(dbp->dbenv, "__db_c_get", flags); + goto err; + } + if ((ret = cp_n->opd->c_am_get( + cp_n->opd, key, data, tmp_flags, NULL)) != 0) + goto err; + } + +done: /* + * Return a key/data item. The only exception is that we don't return + * a key if the user already gave us one, that is, if the DB_SET flag + * was set. The DB_SET flag is necessary. In a Btree, the user's key + * doesn't have to be the same as the key stored the tree, depending on + * the magic performed by the comparison function. As we may not have + * done any key-oriented operation here, the page reference may not be + * valid. Fill it in as necessary. We don't have to worry about any + * locks, the cursor must already be holding appropriate locks. + * + * XXX + * If not a Btree and DB_SET_RANGE is set, we shouldn't return a key + * either, should we? + */ + cp_n = dbc_n == NULL ? dbc_arg->internal : dbc_n->internal; + if (!F_ISSET(key, DB_DBT_ISSET)) { + if (cp_n->page == NULL && (ret = + memp_fget(dbp->mpf, &cp_n->pgno, 0, &cp_n->page)) != 0) + goto err; + + if ((ret = __db_ret(dbp, cp_n->page, cp_n->indx, + key, &dbc_arg->rkey.data, &dbc_arg->rkey.ulen)) != 0) + goto err; + } + dbc = opd != NULL ? opd : cp_n->opd != NULL ? cp_n->opd : dbc_n; + if (!F_ISSET(data, DB_DBT_ISSET)) { + type = TYPE(dbc->internal->page); + ret = __db_ret(dbp, dbc->internal->page, dbc->internal->indx + + (type == P_LBTREE || type == P_HASH ? O_INDX : 0), + data, &dbc_arg->rdata.data, &dbc_arg->rdata.ulen); + } + +err: /* Don't pass DB_DBT_ISSET back to application level, error or no. */ + F_CLR(key, DB_DBT_ISSET); + F_CLR(data, DB_DBT_ISSET); + + /* Cleanup and cursor resolution. */ + if (opd != NULL) { + if ((t_ret = + __db_c_cleanup(dbc_arg->internal->opd, + opd, ret)) != 0 && ret == 0) + ret = t_ret; + + } + + if ((t_ret = __db_c_cleanup(dbc_arg, dbc_n, ret)) != 0 && ret == 0) + ret = t_ret; + + if (flags == DB_CONSUME || flags == DB_CONSUME_WAIT) + CDB_LOCKING_DONE(dbp, dbc_arg); + return (ret); +} + +/* + * __db_c_put -- + * Put using a cursor. + * + * PUBLIC: int __db_c_put __P((DBC *, DBT *, DBT *, u_int32_t)); + */ +int +__db_c_put(dbc_arg, key, data, flags) + DBC *dbc_arg; + DBT *key, *data; + u_int32_t flags; +{ + DB *dbp; + DBC *dbc_n, *opd; + db_pgno_t pgno; + u_int32_t tmp_flags; + int ret, t_ret; + + /* + * Cursor Cleanup Note: + * All of the cursors passed to the underlying access methods by this + * routine are duplicated cursors. On return, any referenced pages + * will be discarded, and, if the cursor is not intended to be used + * again, the close function will be called. So, pages/locks that + * the cursor references do not need to be resolved by the underlying + * functions. + */ + dbp = dbc_arg->dbp; + dbc_n = NULL; + + PANIC_CHECK(dbp->dbenv); + DB_CHECK_TXN(dbp, dbc_arg->txn); + + /* Check for invalid flags. */ + if ((ret = __db_cputchk(dbp, key, data, flags, + F_ISSET(dbp, DB_AM_RDONLY), IS_INITIALIZED(dbc_arg))) != 0) + return (ret); + + DEBUG_LWRITE(dbc_arg, dbc_arg->txn, "db_c_put", + flags == DB_KEYFIRST || flags == DB_KEYLAST || + flags == DB_NODUPDATA ? key : NULL, data, flags); + + CDB_LOCKING_INIT(dbp, dbc_arg); + + /* + * If we have an off-page duplicates cursor, and the operation applies + * to it, perform the operation. Duplicate the cursor and call the + * underlying function. + * + * Off-page duplicate trees are locked in the primary tree, that is, + * we acquire a write lock in the primary tree and no locks in the + * off-page dup tree. If the put operation is done in an off-page + * duplicate tree, call the primary cursor's upgrade routine first. + */ + if (dbc_arg->internal->opd != NULL && + (flags == DB_AFTER || flags == DB_BEFORE || flags == DB_CURRENT)) { + /* + * A special case for hash off-page duplicates. Hash doesn't + * support (and is documented not to support) put operations + * relative to a cursor which references an already deleted + * item. For consistency, apply the same criteria to off-page + * duplicates as well. + */ + if (dbc_arg->dbtype == DB_HASH && F_ISSET( + ((BTREE_CURSOR *)(dbc_arg->internal->opd->internal)), + C_DELETED)) { + ret = DB_NOTFOUND; + goto err; + } + + if ((ret = dbc_arg->c_am_writelock(dbc_arg)) != 0) + return (ret); + if ((ret = __db_c_dup(dbc_arg, &dbc_n, DB_POSITIONI)) != 0) + goto err; + opd = dbc_n->internal->opd; + if ((ret = opd->c_am_put( + opd, key, data, flags, NULL)) != 0) + goto err; + goto done; + } + + /* + * Perform an operation on the main cursor. Duplicate the cursor, + * and call the underlying function. + * + * XXX: MARGO + * + tmp_flags = flags == DB_AFTER || + flags == DB_BEFORE || flags == DB_CURRENT ? DB_POSITIONI : 0; + */ + tmp_flags = DB_POSITIONI; + + /* + * If this cursor is going to be closed immediately, we don't + * need to take precautions to clean it up on error. + */ + if (F_ISSET(dbc_arg, DBC_TRANSIENT)) + dbc_n = dbc_arg; + else if ((ret = __db_c_idup(dbc_arg, &dbc_n, tmp_flags)) != 0) + goto err; + + pgno = PGNO_INVALID; + if ((ret = dbc_n->c_am_put(dbc_n, key, data, flags, &pgno)) != 0) + goto err; + + /* + * We may be referencing a new off-page duplicates tree. Acquire + * a new cursor and call the underlying function. + */ + if (pgno != PGNO_INVALID) { + if ((ret = __db_c_newopd(dbc_arg, pgno, &opd)) != 0) + goto err; + dbc_n->internal->opd = opd; + + if ((ret = opd->c_am_put( + opd, key, data, flags, NULL)) != 0) + goto err; + } + +done: +err: /* Cleanup and cursor resolution. */ + if ((t_ret = __db_c_cleanup(dbc_arg, dbc_n, ret)) != 0 && ret == 0) + ret = t_ret; + + CDB_LOCKING_DONE(dbp, dbc_arg); + + return (ret); +} + +/* + * __db_duperr() + * Error message: we don't currently support sorted duplicate duplicates. + * PUBLIC: int __db_duperr __P((DB *, u_int32_t)); + */ +int +__db_duperr(dbp, flags) + DB *dbp; + u_int32_t flags; +{ + if (flags != DB_NODUPDATA) + __db_err(dbp->dbenv, + "Duplicate data items are not supported with sorted data"); + return (DB_KEYEXIST); +} + +/* + * __db_c_cleanup -- + * Clean up duplicate cursors. + */ +static int +__db_c_cleanup(dbc, dbc_n, failed) + DBC *dbc, *dbc_n; + int failed; +{ + DB *dbp; + DBC *opd; + DBC_INTERNAL *internal; + int ret, t_ret; + + dbp = dbc->dbp; + internal = dbc->internal; + ret = 0; + + /* Discard any pages we're holding. */ + if (internal->page != NULL) { + if ((t_ret = + memp_fput(dbp->mpf, internal->page, 0)) != 0 && ret == 0) + ret = t_ret; + internal->page = NULL; + } + opd = internal->opd; + if (opd != NULL && opd->internal->page != NULL) { + if ((t_ret = memp_fput(dbp->mpf, + opd->internal->page, 0)) != 0 && ret == 0) + ret = t_ret; + opd->internal->page = NULL; + } + + /* + * If dbc_n is NULL, there's no internal cursor swapping to be + * done and no dbc_n to close--we probably did the entire + * operation on an offpage duplicate cursor. Just return. + */ + if (dbc_n == NULL) + return (ret); + + /* + * If dbc is marked DBC_TRANSIENT, we're inside a DB->{put/get} + * operation, and as an optimization we performed the operation on + * the main cursor rather than on a duplicated one. Assert + * that dbc_n == dbc (i.e., that we really did skip the + * duplication). Then just do nothing--even if there was + * an error, we're about to close the cursor, and the fact that we + * moved it isn't a user-visible violation of our "cursor + * stays put on error" rule. + */ + if (F_ISSET(dbc, DBC_TRANSIENT)) { + DB_ASSERT(dbc == dbc_n); + return (ret); + } + + if (dbc_n->internal->page != NULL) { + if ((t_ret = memp_fput(dbp->mpf, + dbc_n->internal->page, 0)) != 0 && ret == 0) + ret = t_ret; + dbc_n->internal->page = NULL; + } + opd = dbc_n->internal->opd; + if (opd != NULL && opd->internal->page != NULL) { + if ((t_ret = memp_fput(dbp->mpf, + opd->internal->page, 0)) != 0 && ret == 0) + ret = t_ret; + opd->internal->page = NULL; + } + + /* + * If we didn't fail before entering this routine or just now when + * freeing pages, swap the interesting contents of the old and new + * cursors. + */ + if (!failed && ret == 0) { + dbc->internal = dbc_n->internal; + dbc_n->internal = internal; + } + + /* + * Close the cursor we don't care about anymore. The close can fail, + * but we only expect DB_LOCK_DEADLOCK failures. This violates our + * "the cursor is unchanged on error" semantics, but since all you can + * do with a DB_LOCK_DEADLOCK failure is close the cursor, I believe + * that's OK. + * + * XXX + * There's no way to recover from failure to close the old cursor. + * All we can do is move to the new position and return an error. + * + * XXX + * We might want to consider adding a flag to the cursor, so that any + * subsequent operations other than close just return an error? + */ + if ((t_ret = dbc_n->c_close(dbc_n)) != 0 && ret == 0) + ret = t_ret; + + return (ret); +} + +/* + * __db_wrlock_err -- do not have a write lock. + */ +static int +__db_wrlock_err(dbenv) + DB_ENV *dbenv; +{ + __db_err(dbenv, "Write attempted on read-only cursor"); + return (EPERM); +} diff --git a/bdb/db/db_conv.c b/bdb/db/db_conv.c new file mode 100644 index 00000000000..df60be06790 --- /dev/null +++ b/bdb/db/db_conv.c @@ -0,0 +1,348 @@ +/*- + * See the file LICENSE for redistribution information. + * + * Copyright (c) 1996, 1997, 1998, 1999, 2000 + * Sleepycat Software. All rights reserved. + */ +/* + * Copyright (c) 1990, 1993, 1994, 1995, 1996 + * Keith Bostic. All rights reserved. + */ +/* + * Copyright (c) 1990, 1993, 1994, 1995 + * The Regents of the University of California. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * 3. Neither the name of the University nor the names of its contributors + * may be used to endorse or promote products derived from this software + * without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +#include "db_config.h" + +#ifndef lint +static const char revid[] = "$Id: db_conv.c,v 11.11 2000/11/30 00:58:31 ubell Exp $"; +#endif /* not lint */ + +#ifndef NO_SYSTEM_INCLUDES +#include <sys/types.h> + +#include <string.h> +#endif + +#include "db_int.h" +#include "db_page.h" +#include "db_swap.h" +#include "db_am.h" +#include "btree.h" +#include "hash.h" +#include "qam.h" + +/* + * __db_pgin -- + * Primary page-swap routine. + * + * PUBLIC: int __db_pgin __P((DB_ENV *, db_pgno_t, void *, DBT *)); + */ +int +__db_pgin(dbenv, pg, pp, cookie) + DB_ENV *dbenv; + db_pgno_t pg; + void *pp; + DBT *cookie; +{ + DB_PGINFO *pginfo; + + pginfo = (DB_PGINFO *)cookie->data; + + switch (((PAGE *)pp)->type) { + case P_HASH: + case P_HASHMETA: + case P_INVALID: + return (__ham_pgin(dbenv, pg, pp, cookie)); + case P_BTREEMETA: + case P_IBTREE: + case P_IRECNO: + case P_LBTREE: + case P_LDUP: + case P_LRECNO: + case P_OVERFLOW: + return (__bam_pgin(dbenv, pg, pp, cookie)); + case P_QAMMETA: + case P_QAMDATA: + return (__qam_pgin_out(dbenv, pg, pp, cookie)); + default: + break; + } + return (__db_unknown_type(dbenv, "__db_pgin", ((PAGE *)pp)->type)); +} + +/* + * __db_pgout -- + * Primary page-swap routine. + * + * PUBLIC: int __db_pgout __P((DB_ENV *, db_pgno_t, void *, DBT *)); + */ +int +__db_pgout(dbenv, pg, pp, cookie) + DB_ENV *dbenv; + db_pgno_t pg; + void *pp; + DBT *cookie; +{ + DB_PGINFO *pginfo; + + pginfo = (DB_PGINFO *)cookie->data; + + switch (((PAGE *)pp)->type) { + case P_HASH: + case P_HASHMETA: + case P_INVALID: + return (__ham_pgout(dbenv, pg, pp, cookie)); + case P_BTREEMETA: + case P_IBTREE: + case P_IRECNO: + case P_LBTREE: + case P_LDUP: + case P_LRECNO: + case P_OVERFLOW: + return (__bam_pgout(dbenv, pg, pp, cookie)); + case P_QAMMETA: + case P_QAMDATA: + return (__qam_pgin_out(dbenv, pg, pp, cookie)); + default: + break; + } + return (__db_unknown_type(dbenv, "__db_pgout", ((PAGE *)pp)->type)); +} + +/* + * __db_metaswap -- + * Byteswap the common part of the meta-data page. + * + * PUBLIC: void __db_metaswap __P((PAGE *)); + */ +void +__db_metaswap(pg) + PAGE *pg; +{ + u_int8_t *p; + + p = (u_int8_t *)pg; + + /* Swap the meta-data information. */ + SWAP32(p); /* lsn.file */ + SWAP32(p); /* lsn.offset */ + SWAP32(p); /* pgno */ + SWAP32(p); /* magic */ + SWAP32(p); /* version */ + SWAP32(p); /* pagesize */ + p += 4; /* unused, page type, unused, unused */ + SWAP32(p); /* free */ + SWAP32(p); /* alloc_lsn part 1 */ + SWAP32(p); /* alloc_lsn part 2 */ + SWAP32(p); /* cached key count */ + SWAP32(p); /* cached record count */ + SWAP32(p); /* flags */ +} + +/* + * __db_byteswap -- + * Byteswap a page. + * + * PUBLIC: int __db_byteswap __P((DB_ENV *, db_pgno_t, PAGE *, size_t, int)); + */ +int +__db_byteswap(dbenv, pg, h, pagesize, pgin) + DB_ENV *dbenv; + db_pgno_t pg; + PAGE *h; + size_t pagesize; + int pgin; +{ + BINTERNAL *bi; + BKEYDATA *bk; + BOVERFLOW *bo; + RINTERNAL *ri; + db_indx_t i, len, tmp; + u_int8_t *p, *end; + + COMPQUIET(pg, 0); + + if (pgin) { + M_32_SWAP(h->lsn.file); + M_32_SWAP(h->lsn.offset); + M_32_SWAP(h->pgno); + M_32_SWAP(h->prev_pgno); + M_32_SWAP(h->next_pgno); + M_16_SWAP(h->entries); + M_16_SWAP(h->hf_offset); + } + + switch (h->type) { + case P_HASH: + for (i = 0; i < NUM_ENT(h); i++) { + if (pgin) + M_16_SWAP(h->inp[i]); + + switch (HPAGE_TYPE(h, i)) { + case H_KEYDATA: + break; + case H_DUPLICATE: + len = LEN_HKEYDATA(h, pagesize, i); + p = HKEYDATA_DATA(P_ENTRY(h, i)); + for (end = p + len; p < end;) { + if (pgin) { + P_16_SWAP(p); + memcpy(&tmp, + p, sizeof(db_indx_t)); + p += sizeof(db_indx_t); + } else { + memcpy(&tmp, + p, sizeof(db_indx_t)); + SWAP16(p); + } + p += tmp; + SWAP16(p); + } + break; + case H_OFFDUP: + p = HOFFPAGE_PGNO(P_ENTRY(h, i)); + SWAP32(p); /* pgno */ + break; + case H_OFFPAGE: + p = HOFFPAGE_PGNO(P_ENTRY(h, i)); + SWAP32(p); /* pgno */ + SWAP32(p); /* tlen */ + break; + } + + } + + /* + * The offsets in the inp array are used to determine + * the size of entries on a page; therefore they + * cannot be converted until we've done all the + * entries. + */ + if (!pgin) + for (i = 0; i < NUM_ENT(h); i++) + M_16_SWAP(h->inp[i]); + break; + case P_LBTREE: + case P_LDUP: + case P_LRECNO: + for (i = 0; i < NUM_ENT(h); i++) { + if (pgin) + M_16_SWAP(h->inp[i]); + + /* + * In the case of on-page duplicates, key information + * should only be swapped once. + */ + if (h->type == P_LBTREE && i > 1) { + if (pgin) { + if (h->inp[i] == h->inp[i - 2]) + continue; + } else { + M_16_SWAP(h->inp[i]); + if (h->inp[i] == h->inp[i - 2]) + continue; + M_16_SWAP(h->inp[i]); + } + } + + bk = GET_BKEYDATA(h, i); + switch (B_TYPE(bk->type)) { + case B_KEYDATA: + M_16_SWAP(bk->len); + break; + case B_DUPLICATE: + case B_OVERFLOW: + bo = (BOVERFLOW *)bk; + M_32_SWAP(bo->pgno); + M_32_SWAP(bo->tlen); + break; + } + + if (!pgin) + M_16_SWAP(h->inp[i]); + } + break; + case P_IBTREE: + for (i = 0; i < NUM_ENT(h); i++) { + if (pgin) + M_16_SWAP(h->inp[i]); + + bi = GET_BINTERNAL(h, i); + M_16_SWAP(bi->len); + M_32_SWAP(bi->pgno); + M_32_SWAP(bi->nrecs); + + switch (B_TYPE(bi->type)) { + case B_KEYDATA: + break; + case B_DUPLICATE: + case B_OVERFLOW: + bo = (BOVERFLOW *)bi->data; + M_32_SWAP(bo->pgno); + M_32_SWAP(bo->tlen); + break; + } + + if (!pgin) + M_16_SWAP(h->inp[i]); + } + break; + case P_IRECNO: + for (i = 0; i < NUM_ENT(h); i++) { + if (pgin) + M_16_SWAP(h->inp[i]); + + ri = GET_RINTERNAL(h, i); + M_32_SWAP(ri->pgno); + M_32_SWAP(ri->nrecs); + + if (!pgin) + M_16_SWAP(h->inp[i]); + } + break; + case P_OVERFLOW: + case P_INVALID: + /* Nothing to do. */ + break; + default: + return (__db_unknown_type(dbenv, "__db_byteswap", h->type)); + } + + if (!pgin) { + /* Swap the header information. */ + M_32_SWAP(h->lsn.file); + M_32_SWAP(h->lsn.offset); + M_32_SWAP(h->pgno); + M_32_SWAP(h->prev_pgno); + M_32_SWAP(h->next_pgno); + M_16_SWAP(h->entries); + M_16_SWAP(h->hf_offset); + } + return (0); +} diff --git a/bdb/db/db_dispatch.c b/bdb/db/db_dispatch.c new file mode 100644 index 00000000000..c9beac401a7 --- /dev/null +++ b/bdb/db/db_dispatch.c @@ -0,0 +1,983 @@ +/*- + * See the file LICENSE for redistribution information. + * + * Copyright (c) 1996, 1997, 1998, 1999, 2000 + * Sleepycat Software. All rights reserved. + */ +/* + * Copyright (c) 1995, 1996 + * The President and Fellows of Harvard University. All rights reserved. + * + * This code is derived from software contributed to Berkeley by + * Margo Seltzer. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * 3. Neither the name of the University nor the names of its contributors + * may be used to endorse or promote products derived from this software + * without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +#include "db_config.h" + +#ifndef lint +static const char revid[] = "$Id: db_dispatch.c,v 11.41 2001/01/11 18:19:50 bostic Exp $"; +#endif /* not lint */ + +#ifndef NO_SYSTEM_INCLUDES +#include <sys/types.h> + +#include <stddef.h> +#include <stdlib.h> +#include <string.h> +#endif + +#include "db_int.h" +#include "db_page.h" +#include "db_dispatch.h" +#include "db_am.h" +#include "log_auto.h" +#include "txn.h" +#include "txn_auto.h" +#include "log.h" + +static int __db_txnlist_find_internal __P((void *, db_txnlist_type, + u_int32_t, u_int8_t [DB_FILE_ID_LEN], DB_TXNLIST **, int)); + +/* + * __db_dispatch -- + * + * This is the transaction dispatch function used by the db access methods. + * It is designed to handle the record format used by all the access + * methods (the one automatically generated by the db_{h,log,read}.sh + * scripts in the tools directory). An application using a different + * recovery paradigm will supply a different dispatch function to txn_open. + * + * PUBLIC: int __db_dispatch __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *)); + */ +int +__db_dispatch(dbenv, db, lsnp, redo, info) + DB_ENV *dbenv; /* The environment. */ + DBT *db; /* The log record upon which to dispatch. */ + DB_LSN *lsnp; /* The lsn of the record being dispatched. */ + db_recops redo; /* Redo this op (or undo it). */ + void *info; +{ + u_int32_t rectype, txnid; + int make_call, ret; + + memcpy(&rectype, db->data, sizeof(rectype)); + memcpy(&txnid, (u_int8_t *)db->data + sizeof(rectype), sizeof(txnid)); + make_call = ret = 0; + + /* + * If we find a record that is in the user's number space and they + * have specified a recovery routine, let them handle it. If they + * didn't specify a recovery routine, then we expect that they've + * followed all our rules and registered new recovery functions. + */ + switch (redo) { + case DB_TXN_ABORT: + /* + * XXX + * db_printlog depends on DB_TXN_ABORT not examining the TXN + * list. If that ever changes, fix db_printlog too. + */ + make_call = 1; + break; + case DB_TXN_OPENFILES: + if (rectype == DB_log_register) + return (dbenv->dtab[rectype](dbenv, + db, lsnp, redo, info)); + break; + case DB_TXN_BACKWARD_ROLL: + /* + * Running full recovery in the backward pass. If we've + * seen this txnid before and added to it our commit list, + * then we do nothing during this pass, unless this is a child + * commit record, in which case we need to process it. If + * we've never seen it, then we call the appropriate recovery + * routine. + * + * We need to always undo DB_db_noop records, so that we + * properly handle any aborts before the file was closed. + */ + if (rectype == DB_log_register || + rectype == DB_txn_ckp || rectype == DB_db_noop + || rectype == DB_txn_child || (txnid != 0 && + (ret = __db_txnlist_find(info, txnid)) != 0)) { + make_call = 1; + if (ret == DB_NOTFOUND && rectype != DB_txn_regop && + rectype != DB_txn_xa_regop && (ret = + __db_txnlist_add(dbenv, info, txnid, 1)) != 0) + return (ret); + } + break; + case DB_TXN_FORWARD_ROLL: + /* + * In the forward pass, if we haven't seen the transaction, + * do nothing, else recovery it. + * + * We need to always redo DB_db_noop records, so that we + * properly handle any commits after the file was closed. + */ + if (rectype == DB_log_register || + rectype == DB_txn_ckp || + rectype == DB_db_noop || + __db_txnlist_find(info, txnid) == 0) + make_call = 1; + break; + default: + return (__db_unknown_flag(dbenv, "__db_dispatch", redo)); + } + + if (make_call) { + if (rectype >= DB_user_BEGIN && dbenv->tx_recover != NULL) + return (dbenv->tx_recover(dbenv, db, lsnp, redo)); + else + return (dbenv->dtab[rectype](dbenv, db, lsnp, redo, info)); + } + + return (0); +} + +/* + * __db_add_recovery -- + * + * PUBLIC: int __db_add_recovery __P((DB_ENV *, + * PUBLIC: int (*)(DB_ENV *, DBT *, DB_LSN *, db_recops, void *), u_int32_t)); + */ +int +__db_add_recovery(dbenv, func, ndx) + DB_ENV *dbenv; + int (*func) __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *)); + u_int32_t ndx; +{ + u_int32_t i, nsize; + int ret; + + /* Check if we have to grow the table. */ + if (ndx >= dbenv->dtab_size) { + nsize = ndx + 40; + if ((ret = __os_realloc(dbenv, + nsize * sizeof(dbenv->dtab[0]), NULL, &dbenv->dtab)) != 0) + return (ret); + for (i = dbenv->dtab_size; i < nsize; ++i) + dbenv->dtab[i] = NULL; + dbenv->dtab_size = nsize; + } + + dbenv->dtab[ndx] = func; + return (0); +} + +/* + * __deprecated_recover -- + * Stub routine for deprecated recovery functions. + * + * PUBLIC: int __deprecated_recover + * PUBLIC: __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *)); + */ +int +__deprecated_recover(dbenv, dbtp, lsnp, op, info) + DB_ENV *dbenv; + DBT *dbtp; + DB_LSN *lsnp; + db_recops op; + void *info; +{ + COMPQUIET(dbenv, NULL); + COMPQUIET(dbtp, NULL); + COMPQUIET(lsnp, NULL); + COMPQUIET(op, 0); + COMPQUIET(info, NULL); + return (EINVAL); +} + +/* + * __db_txnlist_init -- + * Initialize transaction linked list. + * + * PUBLIC: int __db_txnlist_init __P((DB_ENV *, void *)); + */ +int +__db_txnlist_init(dbenv, retp) + DB_ENV *dbenv; + void *retp; +{ + DB_TXNHEAD *headp; + int ret; + + if ((ret = __os_malloc(dbenv, sizeof(DB_TXNHEAD), NULL, &headp)) != 0) + return (ret); + + LIST_INIT(&headp->head); + headp->maxid = 0; + headp->generation = 1; + + *(void **)retp = headp; + return (0); +} + +/* + * __db_txnlist_add -- + * Add an element to our transaction linked list. + * + * PUBLIC: int __db_txnlist_add __P((DB_ENV *, void *, u_int32_t, int32_t)); + */ +int +__db_txnlist_add(dbenv, listp, txnid, aborted) + DB_ENV *dbenv; + void *listp; + u_int32_t txnid; + int32_t aborted; +{ + DB_TXNHEAD *hp; + DB_TXNLIST *elp; + int ret; + + if ((ret = __os_malloc(dbenv, sizeof(DB_TXNLIST), NULL, &elp)) != 0) + return (ret); + + hp = (DB_TXNHEAD *)listp; + LIST_INSERT_HEAD(&hp->head, elp, links); + + elp->type = TXNLIST_TXNID; + elp->u.t.txnid = txnid; + elp->u.t.aborted = aborted; + if (txnid > hp->maxid) + hp->maxid = txnid; + elp->u.t.generation = hp->generation; + + return (0); +} +/* + * __db_txnlist_remove -- + * Remove an element from our transaction linked list. + * + * PUBLIC: int __db_txnlist_remove __P((void *, u_int32_t)); + */ +int +__db_txnlist_remove(listp, txnid) + void *listp; + u_int32_t txnid; +{ + DB_TXNLIST *entry; + + return (__db_txnlist_find_internal(listp, + TXNLIST_TXNID, txnid, NULL, &entry, 1)); +} + +/* __db_txnlist_close -- + * + * Call this when we close a file. It allows us to reconcile whether + * we have done any operations on this file with whether the file appears + * to have been deleted. If you never do any operations on a file, then + * we assume it's OK to appear deleted. + * + * PUBLIC: int __db_txnlist_close __P((void *, int32_t, u_int32_t)); + */ + +int +__db_txnlist_close(listp, lid, count) + void *listp; + int32_t lid; + u_int32_t count; +{ + DB_TXNHEAD *hp; + DB_TXNLIST *p; + + hp = (DB_TXNHEAD *)listp; + for (p = LIST_FIRST(&hp->head); p != NULL; p = LIST_NEXT(p, links)) { + if (p->type == TXNLIST_DELETE) + if (lid == p->u.d.fileid && + !F_ISSET(&p->u.d, TXNLIST_FLAG_CLOSED)) { + p->u.d.count += count; + return (0); + } + } + + return (0); +} + +/* + * __db_txnlist_delete -- + * + * Record that a file was missing or deleted. If the deleted + * flag is set, then we've encountered a delete of a file, else we've + * just encountered a file that is missing. The lid is the log fileid + * and is only meaningful if deleted is not equal to 0. + * + * PUBLIC: int __db_txnlist_delete __P((DB_ENV *, + * PUBLIC: void *, char *, u_int32_t, int)); + */ +int +__db_txnlist_delete(dbenv, listp, name, lid, deleted) + DB_ENV *dbenv; + void *listp; + char *name; + u_int32_t lid; + int deleted; +{ + DB_TXNHEAD *hp; + DB_TXNLIST *p; + int ret; + + hp = (DB_TXNHEAD *)listp; + for (p = LIST_FIRST(&hp->head); p != NULL; p = LIST_NEXT(p, links)) { + if (p->type == TXNLIST_DELETE) + if (strcmp(name, p->u.d.fname) == 0) { + if (deleted) + F_SET(&p->u.d, TXNLIST_FLAG_DELETED); + else + F_CLR(&p->u.d, TXNLIST_FLAG_CLOSED); + return (0); + } + } + + /* Need to add it. */ + if ((ret = __os_malloc(dbenv, sizeof(DB_TXNLIST), NULL, &p)) != 0) + return (ret); + LIST_INSERT_HEAD(&hp->head, p, links); + + p->type = TXNLIST_DELETE; + p->u.d.flags = 0; + if (deleted) + F_SET(&p->u.d, TXNLIST_FLAG_DELETED); + p->u.d.fileid = lid; + p->u.d.count = 0; + ret = __os_strdup(dbenv, name, &p->u.d.fname); + + return (ret); +} + +/* + * __db_txnlist_end -- + * Discard transaction linked list. Print out any error messages + * for deleted files. + * + * PUBLIC: void __db_txnlist_end __P((DB_ENV *, void *)); + */ +void +__db_txnlist_end(dbenv, listp) + DB_ENV *dbenv; + void *listp; +{ + DB_TXNHEAD *hp; + DB_TXNLIST *p; + DB_LOG *lp; + + hp = (DB_TXNHEAD *)listp; + lp = (DB_LOG *)dbenv->lg_handle; + while (hp != NULL && (p = LIST_FIRST(&hp->head)) != NULL) { + LIST_REMOVE(p, links); + switch (p->type) { + case TXNLIST_DELETE: + /* + * If we have a file that is not deleted and has + * some operations, we flag the warning. Since + * the file could still be open, we need to check + * the actual log table as well. + */ + if ((!F_ISSET(&p->u.d, TXNLIST_FLAG_DELETED) && + p->u.d.count != 0) || + (!F_ISSET(&p->u.d, TXNLIST_FLAG_CLOSED) && + p->u.d.fileid != (int32_t) TXNLIST_INVALID_ID && + p->u.d.fileid < lp->dbentry_cnt && + lp->dbentry[p->u.d.fileid].count != 0)) + __db_err(dbenv, "warning: %s: %s", + p->u.d.fname, db_strerror(ENOENT)); + __os_freestr(p->u.d.fname); + break; + case TXNLIST_LSN: + __os_free(p->u.l.lsn_array, + p->u.l.maxn * sizeof(DB_LSN)); + break; + default: + /* Possibly an incomplete DB_TXNLIST; just free it. */ + break; + } + __os_free(p, sizeof(DB_TXNLIST)); + } + __os_free(listp, sizeof(DB_TXNHEAD)); +} + +/* + * __db_txnlist_find -- + * Checks to see if a txnid with the current generation is in the + * txnid list. This returns DB_NOTFOUND if the item isn't in the + * list otherwise it returns (like __db_txnlist_find_internal) a + * 1 or 0 indicating if the transaction is aborted or not. A txnid + * of 0 means the record was generated while not in a transaction. + * + * PUBLIC: int __db_txnlist_find __P((void *, u_int32_t)); + */ +int +__db_txnlist_find(listp, txnid) + void *listp; + u_int32_t txnid; +{ + DB_TXNLIST *entry; + + if (txnid == 0) + return (DB_NOTFOUND); + return (__db_txnlist_find_internal(listp, + TXNLIST_TXNID, txnid, NULL, &entry, 0)); +} + +/* + * __db_txnlist_find_internal -- + * Find an entry on the transaction list. + * If the entry is not there or the list pointeris not initialized + * we return DB_NOTFOUND. If the item is found, we return the aborted + * status (1 for aborted, 0 for not aborted). Currently we always call + * this with an initialized list pointer but checking for NULL keeps it general. + */ +static int +__db_txnlist_find_internal(listp, type, txnid, uid, txnlistp, delete) + void *listp; + db_txnlist_type type; + u_int32_t txnid; + u_int8_t uid[DB_FILE_ID_LEN]; + DB_TXNLIST **txnlistp; + int delete; +{ + DB_TXNHEAD *hp; + DB_TXNLIST *p; + int ret; + + if ((hp = (DB_TXNHEAD *)listp) == NULL) + return (DB_NOTFOUND); + + for (p = LIST_FIRST(&hp->head); p != NULL; p = LIST_NEXT(p, links)) { + if (p->type != type) + continue; + switch (type) { + case TXNLIST_TXNID: + if (p->u.t.txnid != txnid || + hp->generation != p->u.t.generation) + continue; + ret = p->u.t.aborted; + break; + + case TXNLIST_PGNO: + if (memcmp(uid, p->u.p.uid, DB_FILE_ID_LEN) != 0) + continue; + + ret = 0; + break; + default: + DB_ASSERT(0); + ret = EINVAL; + } + if (delete == 1) { + LIST_REMOVE(p, links); + __os_free(p, sizeof(DB_TXNLIST)); + } else if (p != LIST_FIRST(&hp->head)) { + /* Move it to head of list. */ + LIST_REMOVE(p, links); + LIST_INSERT_HEAD(&hp->head, p, links); + } + *txnlistp = p; + return (ret); + } + + return (DB_NOTFOUND); +} + +/* + * __db_txnlist_gen -- + * Change the current generation number. + * + * PUBLIC: void __db_txnlist_gen __P((void *, int)); + */ +void +__db_txnlist_gen(listp, incr) + void *listp; + int incr; +{ + DB_TXNHEAD *hp; + + /* + * During recovery generation numbers keep track of how many "restart" + * checkpoints we've seen. Restart checkpoints occur whenever we take + * a checkpoint and there are no outstanding transactions. When that + * happens, we can reset transaction IDs back to 1. It always happens + * at recovery and it prevents us from exhausting the transaction IDs + * name space. + */ + hp = (DB_TXNHEAD *)listp; + hp->generation += incr; +} + +#define TXN_BUBBLE(AP, MAX) { \ + int __j; \ + DB_LSN __tmp; \ + \ + for (__j = 0; __j < MAX - 1; __j++) \ + if (log_compare(&AP[__j], &AP[__j + 1]) < 0) { \ + __tmp = AP[__j]; \ + AP[__j] = AP[__j + 1]; \ + AP[__j + 1] = __tmp; \ + } \ +} + +/* + * __db_txnlist_lsnadd -- + * Add to or re-sort the transaction list lsn entry. + * Note that since this is used during an abort, the __txn_undo + * code calls into the "recovery" subsystem explicitly, and there + * is only a single TXNLIST_LSN entry on the list. + * + * PUBLIC: int __db_txnlist_lsnadd __P((DB_ENV *, void *, DB_LSN *, u_int32_t)); + */ +int +__db_txnlist_lsnadd(dbenv, listp, lsnp, flags) + DB_ENV *dbenv; + void *listp; + DB_LSN *lsnp; + u_int32_t flags; +{ + DB_TXNHEAD *hp; + DB_TXNLIST *elp; + int i, ret; + + hp = (DB_TXNHEAD *)listp; + + for (elp = LIST_FIRST(&hp->head); + elp != NULL; elp = LIST_NEXT(elp, links)) + if (elp->type == TXNLIST_LSN) + break; + + if (elp == NULL) + return (EINVAL); + + if (LF_ISSET(TXNLIST_NEW)) { + if (elp->u.l.ntxns >= elp->u.l.maxn) { + if ((ret = __os_realloc(dbenv, + 2 * elp->u.l.maxn * sizeof(DB_LSN), + NULL, &elp->u.l.lsn_array)) != 0) + return (ret); + elp->u.l.maxn *= 2; + } + elp->u.l.lsn_array[elp->u.l.ntxns++] = *lsnp; + } else + /* Simply replace the 0th element. */ + elp->u.l.lsn_array[0] = *lsnp; + + /* + * If we just added a new entry and there may be NULL + * entries, so we have to do a complete bubble sort, + * not just trickle a changed entry around. + */ + for (i = 0; i < (!LF_ISSET(TXNLIST_NEW) ? 1 : elp->u.l.ntxns); i++) + TXN_BUBBLE(elp->u.l.lsn_array, elp->u.l.ntxns); + + *lsnp = elp->u.l.lsn_array[0]; + + return (0); +} + +/* + * __db_txnlist_lsnhead -- + * Return a pointer to the beginning of the lsn_array. + * + * PUBLIC: int __db_txnlist_lsnhead __P((void *, DB_LSN **)); + */ +int +__db_txnlist_lsnhead(listp, lsnpp) + void *listp; + DB_LSN **lsnpp; +{ + DB_TXNHEAD *hp; + DB_TXNLIST *elp; + + hp = (DB_TXNHEAD *)listp; + + for (elp = LIST_FIRST(&hp->head); + elp != NULL; elp = LIST_NEXT(elp, links)) + if (elp->type == TXNLIST_LSN) + break; + + if (elp == NULL) + return (EINVAL); + + *lsnpp = &elp->u.l.lsn_array[0]; + + return (0); +} + +/* + * __db_txnlist_lsninit -- + * Initialize a transaction list with an lsn array entry. + * + * PUBLIC: int __db_txnlist_lsninit __P((DB_ENV *, DB_TXNHEAD *, DB_LSN *)); + */ +int +__db_txnlist_lsninit(dbenv, hp, lsnp) + DB_ENV *dbenv; + DB_TXNHEAD *hp; + DB_LSN *lsnp; +{ + DB_TXNLIST *elp; + int ret; + + elp = NULL; + + if ((ret = __os_malloc(dbenv, sizeof(DB_TXNLIST), NULL, &elp)) != 0) + goto err; + LIST_INSERT_HEAD(&hp->head, elp, links); + + if ((ret = __os_malloc(dbenv, + 12 * sizeof(DB_LSN), NULL, &elp->u.l.lsn_array)) != 0) + goto err; + elp->type = TXNLIST_LSN; + elp->u.l.maxn = 12; + elp->u.l.ntxns = 1; + elp->u.l.lsn_array[0] = *lsnp; + + return (0); + +err: __db_txnlist_end(dbenv, hp); + return (ret); +} + +/* + * __db_add_limbo -- add pages to the limbo list. + * Get the file information and call pgnoadd + * for each page. + * + * PUBLIC: int __db_add_limbo __P((DB_ENV *, + * PUBLIC: void *, int32_t, db_pgno_t, int32_t)); + */ +int +__db_add_limbo(dbenv, info, fileid, pgno, count) + DB_ENV *dbenv; + void *info; + int32_t fileid; + db_pgno_t pgno; + int32_t count; +{ + DB_LOG *dblp; + FNAME *fnp; + int ret; + + dblp = dbenv->lg_handle; + if ((ret = __log_lid_to_fname(dblp, fileid, &fnp)) != 0) + return (ret); + + do { + if ((ret = + __db_txnlist_pgnoadd(dbenv, info, fileid, fnp->ufid, + R_ADDR(&dblp->reginfo, fnp->name_off), pgno)) != 0) + return (ret); + pgno++; + } while (--count != 0); + + return (0); +} + +/* + * __db_do_the_limbo -- move pages from limbo to free. + * + * If we are in recovery we add things to the free list without + * logging becasue we want to incrementaly apply logs that + * may be generated on another copy of this environment. + * Otherwise we just call __db_free to put the pages on + * the free list and log the activity. + * + * PUBLIC: int __db_do_the_limbo __P((DB_ENV *, DB_TXNHEAD *)); + */ +int +__db_do_the_limbo(dbenv, hp) + DB_ENV *dbenv; + DB_TXNHEAD *hp; +{ + DB *dbp; + DBC *dbc; + DBMETA *meta; + DB_TXN *txn; + DB_TXNLIST *elp; + PAGE *pagep; + db_pgno_t last_pgno, pgno; + int i, in_recover, put_page, ret, t_ret; + + dbp = NULL; + dbc = NULL; + txn = NULL; + ret = 0; + + /* Are we in recovery? */ + in_recover = F_ISSET((DB_LOG *)dbenv->lg_handle, DBLOG_RECOVER); + + for (elp = LIST_FIRST(&hp->head); + elp != NULL; elp = LIST_NEXT(elp, links)) { + if (elp->type != TXNLIST_PGNO) + continue; + + if (in_recover) { + if ((ret = db_create(&dbp, dbenv, 0)) != 0) + goto err; + + /* + * It is ok if the file is nolonger there. + */ + dbp->type = DB_UNKNOWN; + ret = __db_dbopen(dbp, + elp->u.p.fname, 0, __db_omode("rw----"), 0); + } else { + /* + * If we are in transaction undo, then we know + * the fileid is still correct. + */ + if ((ret = + __db_fileid_to_db(dbenv, &dbp, + elp->u.p.fileid, 0)) != 0 && ret != DB_DELETED) + goto err; + /* File is being destroyed. */ + if (F_ISSET(dbp, DB_AM_DISCARD)) + ret = DB_DELETED; + } + /* + * Verify that we are opening the same file that we were + * referring to when we wrote this log record. + */ + if (ret == 0 && + memcmp(elp->u.p.uid, dbp->fileid, DB_FILE_ID_LEN) == 0) { + last_pgno = PGNO_INVALID; + if (in_recover) { + pgno = PGNO_BASE_MD; + if ((ret = memp_fget(dbp->mpf, + &pgno, 0, (PAGE **)&meta)) != 0) + goto err; + last_pgno = meta->free; + /* + * Check to see if the head of the free + * list is any of the pages we are about + * to link in. We could have crashed + * after linking them in and before writing + * a checkpoint. + * It may not be the last one since + * any page may get reallocated before here. + */ + if (last_pgno != PGNO_INVALID) + for (i = 0; i < elp->u.p.nentries; i++) + if (last_pgno + == elp->u.p.pgno_array[i]) + goto done_it; + } + + for (i = 0; i < elp->u.p.nentries; i++) { + pgno = elp->u.p.pgno_array[i]; + if ((ret = memp_fget(dbp->mpf, + &pgno, DB_MPOOL_CREATE, &pagep)) != 0) + goto err; + + put_page = 1; + if (IS_ZERO_LSN(LSN(pagep))) { + P_INIT(pagep, dbp->pgsize, + pgno, PGNO_INVALID, + last_pgno, 0, P_INVALID); + + if (in_recover) { + LSN(pagep) = LSN(meta); + last_pgno = pgno; + } else { + /* + * Starting the transaction + * is postponed until we know + * we have something to do. + */ + if (txn == NULL && + (ret = txn_begin(dbenv, + NULL, &txn, 0)) != 0) + goto err; + + if (dbc == NULL && + (ret = dbp->cursor(dbp, + txn, &dbc, 0)) != 0) + goto err; + /* Turn off locking. */ + F_SET(dbc, DBC_COMPENSATE); + + /* __db_free puts the page. */ + if ((ret = + __db_free(dbc, pagep)) != 0) + goto err; + put_page = 0; + } + } + + if (put_page == 1 && + (ret = memp_fput(dbp->mpf, + pagep, DB_MPOOL_DIRTY)) != 0) + goto err; + } + if (in_recover) { + if (last_pgno == meta->free) { +done_it: + if ((ret = + memp_fput(dbp->mpf, meta, 0)) != 0) + goto err; + } else { + /* + * Flush the new free list then + * update the metapage. This is + * unlogged so we cannot have the + * metapage pointing at pages that + * are not on disk. + */ + dbp->sync(dbp, 0); + meta->free = last_pgno; + if ((ret = memp_fput(dbp->mpf, + meta, DB_MPOOL_DIRTY)) != 0) + goto err; + } + } + if (dbc != NULL && (ret = dbc->c_close(dbc)) != 0) + goto err; + dbc = NULL; + } + if (in_recover && (t_ret = dbp->close(dbp, 0)) != 0 && ret == 0) + ret = t_ret; + dbp = NULL; + __os_free(elp->u.p.fname, 0); + __os_free(elp->u.p.pgno_array, 0); + if (ret == ENOENT) + ret = 0; + else if (ret != 0) + goto err; + } + + if (txn != NULL) { + ret = txn_commit(txn, 0); + txn = NULL; + } +err: + if (dbc != NULL) + (void)dbc->c_close(dbc); + if (in_recover && dbp != NULL) + (void)dbp->close(dbp, 0); + if (txn != NULL) + (void)txn_abort(txn); + return (ret); + +} + +#define DB_TXNLIST_MAX_PGNO 8 /* A nice even number. */ + +/* + * __db_txnlist_pgnoadd -- + * Find the txnlist entry for a file and add this pgno, + * or add the list entry for the file and then add the pgno. + * + * PUBLIC: int __db_txnlist_pgnoadd __P((DB_ENV *, DB_TXNHEAD *, + * PUBLIC: int32_t, u_int8_t [DB_FILE_ID_LEN], char *, db_pgno_t)); + */ +int +__db_txnlist_pgnoadd(dbenv, hp, fileid, uid, fname, pgno) + DB_ENV *dbenv; + DB_TXNHEAD *hp; + int32_t fileid; + u_int8_t uid[DB_FILE_ID_LEN]; + char *fname; + db_pgno_t pgno; +{ + DB_TXNLIST *elp; + int len, ret; + + elp = NULL; + + if (__db_txnlist_find_internal(hp, TXNLIST_PGNO, 0, uid, &elp, 0) != 0) { + if ((ret = + __os_malloc(dbenv, sizeof(DB_TXNLIST), NULL, &elp)) != 0) + goto err; + LIST_INSERT_HEAD(&hp->head, elp, links); + elp->u.p.fileid = fileid; + memcpy(elp->u.p.uid, uid, DB_FILE_ID_LEN); + + len = strlen(fname) + 1; + if ((ret = __os_malloc(dbenv, len, NULL, &elp->u.p.fname)) != 0) + goto err; + memcpy(elp->u.p.fname, fname, len); + + elp->u.p.maxentry = 0; + elp->type = TXNLIST_PGNO; + if ((ret = __os_malloc(dbenv, + 8 * sizeof(db_pgno_t), NULL, &elp->u.p.pgno_array)) != 0) + goto err; + elp->u.p.maxentry = DB_TXNLIST_MAX_PGNO; + elp->u.p.nentries = 0; + } else if (elp->u.p.nentries == elp->u.p.maxentry) { + elp->u.p.maxentry <<= 1; + if ((ret = __os_realloc(dbenv, elp->u.p.maxentry * + sizeof(db_pgno_t), NULL, &elp->u.p.pgno_array)) != 0) + goto err; + } + + elp->u.p.pgno_array[elp->u.p.nentries++] = pgno; + + return (0); + +err: __db_txnlist_end(dbenv, hp); + return (ret); +} + +#ifdef DEBUG +/* + * __db_txnlist_print -- + * Print out the transaction list. + * + * PUBLIC: void __db_txnlist_print __P((void *)); + */ +void +__db_txnlist_print(listp) + void *listp; +{ + DB_TXNHEAD *hp; + DB_TXNLIST *p; + + hp = (DB_TXNHEAD *)listp; + + printf("Maxid: %lu Generation: %lu\n", + (u_long)hp->maxid, (u_long)hp->generation); + for (p = LIST_FIRST(&hp->head); p != NULL; p = LIST_NEXT(p, links)) { + switch (p->type) { + case TXNLIST_TXNID: + printf("TXNID: %lu(%lu)\n", + (u_long)p->u.t.txnid, (u_long)p->u.t.generation); + break; + case TXNLIST_DELETE: + printf("FILE: %s id=%d ops=%d %s %s\n", + p->u.d.fname, p->u.d.fileid, p->u.d.count, + F_ISSET(&p->u.d, TXNLIST_FLAG_DELETED) ? + "(deleted)" : "(missing)", + F_ISSET(&p->u.d, TXNLIST_FLAG_CLOSED) ? + "(closed)" : "(open)"); + + break; + default: + printf("Unrecognized type: %d\n", p->type); + break; + } + } +} +#endif diff --git a/bdb/db/db_dup.c b/bdb/db/db_dup.c new file mode 100644 index 00000000000..6d8b2df9518 --- /dev/null +++ b/bdb/db/db_dup.c @@ -0,0 +1,275 @@ +/*- + * See the file LICENSE for redistribution information. + * + * Copyright (c) 1996, 1997, 1998, 1999, 2000 + * Sleepycat Software. All rights reserved. + */ + +#include "db_config.h" + +#ifndef lint +static const char revid[] = "$Id: db_dup.c,v 11.18 2000/11/30 00:58:32 ubell Exp $"; +#endif /* not lint */ + +#ifndef NO_SYSTEM_INCLUDES +#include <sys/types.h> + +#include <string.h> +#endif + +#include "db_int.h" +#include "db_page.h" +#include "db_shash.h" +#include "btree.h" +#include "hash.h" +#include "lock.h" +#include "db_am.h" + +/* + * __db_ditem -- + * Remove an item from a page. + * + * PUBLIC: int __db_ditem __P((DBC *, PAGE *, u_int32_t, u_int32_t)); + */ +int +__db_ditem(dbc, pagep, indx, nbytes) + DBC *dbc; + PAGE *pagep; + u_int32_t indx, nbytes; +{ + DB *dbp; + DBT ldbt; + db_indx_t cnt, offset; + int ret; + u_int8_t *from; + + dbp = dbc->dbp; + if (DB_LOGGING(dbc)) { + ldbt.data = P_ENTRY(pagep, indx); + ldbt.size = nbytes; + if ((ret = __db_addrem_log(dbp->dbenv, dbc->txn, + &LSN(pagep), 0, DB_REM_DUP, dbp->log_fileid, PGNO(pagep), + (u_int32_t)indx, nbytes, &ldbt, NULL, &LSN(pagep))) != 0) + return (ret); + } + + /* + * If there's only a single item on the page, we don't have to + * work hard. + */ + if (NUM_ENT(pagep) == 1) { + NUM_ENT(pagep) = 0; + HOFFSET(pagep) = dbp->pgsize; + return (0); + } + + /* + * Pack the remaining key/data items at the end of the page. Use + * memmove(3), the regions may overlap. + */ + from = (u_int8_t *)pagep + HOFFSET(pagep); + memmove(from + nbytes, from, pagep->inp[indx] - HOFFSET(pagep)); + HOFFSET(pagep) += nbytes; + + /* Adjust the indices' offsets. */ + offset = pagep->inp[indx]; + for (cnt = 0; cnt < NUM_ENT(pagep); ++cnt) + if (pagep->inp[cnt] < offset) + pagep->inp[cnt] += nbytes; + + /* Shift the indices down. */ + --NUM_ENT(pagep); + if (indx != NUM_ENT(pagep)) + memmove(&pagep->inp[indx], &pagep->inp[indx + 1], + sizeof(db_indx_t) * (NUM_ENT(pagep) - indx)); + + return (0); +} + +/* + * __db_pitem -- + * Put an item on a page. + * + * PUBLIC: int __db_pitem + * PUBLIC: __P((DBC *, PAGE *, u_int32_t, u_int32_t, DBT *, DBT *)); + */ +int +__db_pitem(dbc, pagep, indx, nbytes, hdr, data) + DBC *dbc; + PAGE *pagep; + u_int32_t indx; + u_int32_t nbytes; + DBT *hdr, *data; +{ + DB *dbp; + BKEYDATA bk; + DBT thdr; + int ret; + u_int8_t *p; + + if (nbytes > P_FREESPACE(pagep)) { + DB_ASSERT(nbytes <= P_FREESPACE(pagep)); + return (EINVAL); + } + /* + * Put a single item onto a page. The logic figuring out where to + * insert and whether it fits is handled in the caller. All we do + * here is manage the page shuffling. We cheat a little bit in that + * we don't want to copy the dbt on a normal put twice. If hdr is + * NULL, we create a BKEYDATA structure on the page, otherwise, just + * copy the caller's information onto the page. + * + * This routine is also used to put entries onto the page where the + * entry is pre-built, e.g., during recovery. In this case, the hdr + * will point to the entry, and the data argument will be NULL. + * + * !!! + * There's a tremendous potential for off-by-one errors here, since + * the passed in header sizes must be adjusted for the structure's + * placeholder for the trailing variable-length data field. + */ + dbp = dbc->dbp; + if (DB_LOGGING(dbc)) + if ((ret = __db_addrem_log(dbp->dbenv, dbc->txn, + &LSN(pagep), 0, DB_ADD_DUP, dbp->log_fileid, PGNO(pagep), + (u_int32_t)indx, nbytes, hdr, data, &LSN(pagep))) != 0) + return (ret); + + if (hdr == NULL) { + B_TSET(bk.type, B_KEYDATA, 0); + bk.len = data == NULL ? 0 : data->size; + + thdr.data = &bk; + thdr.size = SSZA(BKEYDATA, data); + hdr = &thdr; + } + + /* Adjust the index table, then put the item on the page. */ + if (indx != NUM_ENT(pagep)) + memmove(&pagep->inp[indx + 1], &pagep->inp[indx], + sizeof(db_indx_t) * (NUM_ENT(pagep) - indx)); + HOFFSET(pagep) -= nbytes; + pagep->inp[indx] = HOFFSET(pagep); + ++NUM_ENT(pagep); + + p = P_ENTRY(pagep, indx); + memcpy(p, hdr->data, hdr->size); + if (data != NULL) + memcpy(p + hdr->size, data->data, data->size); + + return (0); +} + +/* + * __db_relink -- + * Relink around a deleted page. + * + * PUBLIC: int __db_relink __P((DBC *, u_int32_t, PAGE *, PAGE **, int)); + */ +int +__db_relink(dbc, add_rem, pagep, new_next, needlock) + DBC *dbc; + u_int32_t add_rem; + PAGE *pagep, **new_next; + int needlock; +{ + DB *dbp; + PAGE *np, *pp; + DB_LOCK npl, ppl; + DB_LSN *nlsnp, *plsnp, ret_lsn; + int ret; + + ret = 0; + np = pp = NULL; + npl.off = ppl.off = LOCK_INVALID; + nlsnp = plsnp = NULL; + dbp = dbc->dbp; + + /* + * Retrieve and lock the one/two pages. For a remove, we may need + * two pages (the before and after). For an add, we only need one + * because, the split took care of the prev. + */ + if (pagep->next_pgno != PGNO_INVALID) { + if (needlock && (ret = __db_lget(dbc, + 0, pagep->next_pgno, DB_LOCK_WRITE, 0, &npl)) != 0) + goto err; + if ((ret = memp_fget(dbp->mpf, + &pagep->next_pgno, 0, &np)) != 0) { + (void)__db_pgerr(dbp, pagep->next_pgno); + goto err; + } + nlsnp = &np->lsn; + } + if (add_rem == DB_REM_PAGE && pagep->prev_pgno != PGNO_INVALID) { + if (needlock && (ret = __db_lget(dbc, + 0, pagep->prev_pgno, DB_LOCK_WRITE, 0, &ppl)) != 0) + goto err; + if ((ret = memp_fget(dbp->mpf, + &pagep->prev_pgno, 0, &pp)) != 0) { + (void)__db_pgerr(dbp, pagep->next_pgno); + goto err; + } + plsnp = &pp->lsn; + } + + /* Log the change. */ + if (DB_LOGGING(dbc)) { + if ((ret = __db_relink_log(dbp->dbenv, dbc->txn, + &ret_lsn, 0, add_rem, dbp->log_fileid, + pagep->pgno, &pagep->lsn, + pagep->prev_pgno, plsnp, pagep->next_pgno, nlsnp)) != 0) + goto err; + if (np != NULL) + np->lsn = ret_lsn; + if (pp != NULL) + pp->lsn = ret_lsn; + if (add_rem == DB_REM_PAGE) + pagep->lsn = ret_lsn; + } + + /* + * Modify and release the two pages. + * + * !!! + * The parameter new_next gets set to the page following the page we + * are removing. If there is no following page, then new_next gets + * set to NULL. + */ + if (np != NULL) { + if (add_rem == DB_ADD_PAGE) + np->prev_pgno = pagep->pgno; + else + np->prev_pgno = pagep->prev_pgno; + if (new_next == NULL) + ret = memp_fput(dbp->mpf, np, DB_MPOOL_DIRTY); + else { + *new_next = np; + ret = memp_fset(dbp->mpf, np, DB_MPOOL_DIRTY); + } + if (ret != 0) + goto err; + if (needlock) + (void)__TLPUT(dbc, npl); + } else if (new_next != NULL) + *new_next = NULL; + + if (pp != NULL) { + pp->next_pgno = pagep->next_pgno; + if ((ret = memp_fput(dbp->mpf, pp, DB_MPOOL_DIRTY)) != 0) + goto err; + if (needlock) + (void)__TLPUT(dbc, ppl); + } + return (0); + +err: if (np != NULL) + (void)memp_fput(dbp->mpf, np, 0); + if (needlock && npl.off != LOCK_INVALID) + (void)__TLPUT(dbc, npl); + if (pp != NULL) + (void)memp_fput(dbp->mpf, pp, 0); + if (needlock && ppl.off != LOCK_INVALID) + (void)__TLPUT(dbc, ppl); + return (ret); +} diff --git a/bdb/db/db_iface.c b/bdb/db/db_iface.c new file mode 100644 index 00000000000..3548a2527bb --- /dev/null +++ b/bdb/db/db_iface.c @@ -0,0 +1,687 @@ +/*- + * See the file LICENSE for redistribution information. + * + * Copyright (c) 1996, 1997, 1998, 1999, 2000 + * Sleepycat Software. All rights reserved. + */ + +#include "db_config.h" + +#ifndef lint +static const char revid[] = "$Id: db_iface.c,v 11.34 2001/01/11 18:19:51 bostic Exp $"; +#endif /* not lint */ + +#ifndef NO_SYSTEM_INCLUDES +#include <sys/types.h> + +#include <errno.h> +#endif + +#include "db_int.h" +#include "db_page.h" +#include "db_am.h" +#include "btree.h" + +static int __db_curinval __P((const DB_ENV *)); +static int __db_rdonly __P((const DB_ENV *, const char *)); +static int __dbt_ferr __P((const DB *, const char *, const DBT *, int)); + +/* + * __db_cursorchk -- + * Common cursor argument checking routine. + * + * PUBLIC: int __db_cursorchk __P((const DB *, u_int32_t, int)); + */ +int +__db_cursorchk(dbp, flags, isrdonly) + const DB *dbp; + u_int32_t flags; + int isrdonly; +{ + /* Check for invalid function flags. */ + switch (flags) { + case 0: + break; + case DB_WRITECURSOR: + if (isrdonly) + return (__db_rdonly(dbp->dbenv, "DB->cursor")); + if (!CDB_LOCKING(dbp->dbenv)) + return (__db_ferr(dbp->dbenv, "DB->cursor", 0)); + break; + case DB_WRITELOCK: + if (isrdonly) + return (__db_rdonly(dbp->dbenv, "DB->cursor")); + break; + default: + return (__db_ferr(dbp->dbenv, "DB->cursor", 0)); + } + + return (0); +} + +/* + * __db_ccountchk -- + * Common cursor count argument checking routine. + * + * PUBLIC: int __db_ccountchk __P((const DB *, u_int32_t, int)); + */ +int +__db_ccountchk(dbp, flags, isvalid) + const DB *dbp; + u_int32_t flags; + int isvalid; +{ + /* Check for invalid function flags. */ + switch (flags) { + case 0: + break; + default: + return (__db_ferr(dbp->dbenv, "DBcursor->c_count", 0)); + } + + /* + * The cursor must be initialized, return EINVAL for an invalid cursor, + * otherwise 0. + */ + return (isvalid ? 0 : __db_curinval(dbp->dbenv)); +} + +/* + * __db_cdelchk -- + * Common cursor delete argument checking routine. + * + * PUBLIC: int __db_cdelchk __P((const DB *, u_int32_t, int, int)); + */ +int +__db_cdelchk(dbp, flags, isrdonly, isvalid) + const DB *dbp; + u_int32_t flags; + int isrdonly, isvalid; +{ + /* Check for changes to a read-only tree. */ + if (isrdonly) + return (__db_rdonly(dbp->dbenv, "c_del")); + + /* Check for invalid function flags. */ + switch (flags) { + case 0: + break; + default: + return (__db_ferr(dbp->dbenv, "DBcursor->c_del", 0)); + } + + /* + * The cursor must be initialized, return EINVAL for an invalid cursor, + * otherwise 0. + */ + return (isvalid ? 0 : __db_curinval(dbp->dbenv)); +} + +/* + * __db_cgetchk -- + * Common cursor get argument checking routine. + * + * PUBLIC: int __db_cgetchk __P((const DB *, DBT *, DBT *, u_int32_t, int)); + */ +int +__db_cgetchk(dbp, key, data, flags, isvalid) + const DB *dbp; + DBT *key, *data; + u_int32_t flags; + int isvalid; +{ + int ret; + + /* + * Check for read-modify-write validity. DB_RMW doesn't make sense + * with CDB cursors since if you're going to write the cursor, you + * had to create it with DB_WRITECURSOR. Regardless, we check for + * LOCKING_ON and not STD_LOCKING, as we don't want to disallow it. + * If this changes, confirm that DB does not itself set the DB_RMW + * flag in a path where CDB may have been configured. + */ + if (LF_ISSET(DB_RMW)) { + if (!LOCKING_ON(dbp->dbenv)) { + __db_err(dbp->dbenv, + "the DB_RMW flag requires locking"); + return (EINVAL); + } + LF_CLR(DB_RMW); + } + + /* Check for invalid function flags. */ + switch (flags) { + case DB_CONSUME: + case DB_CONSUME_WAIT: + if (dbp->type != DB_QUEUE) + goto err; + break; + case DB_CURRENT: + case DB_FIRST: + case DB_GET_BOTH: + case DB_LAST: + case DB_NEXT: + case DB_NEXT_DUP: + case DB_NEXT_NODUP: + case DB_PREV: + case DB_PREV_NODUP: + case DB_SET: + case DB_SET_RANGE: + break; + case DB_GET_BOTHC: + if (dbp->type == DB_QUEUE) + goto err; + break; + case DB_GET_RECNO: + if (!F_ISSET(dbp, DB_BT_RECNUM)) + goto err; + break; + case DB_SET_RECNO: + if (!F_ISSET(dbp, DB_BT_RECNUM)) + goto err; + break; + default: +err: return (__db_ferr(dbp->dbenv, "DBcursor->c_get", 0)); + } + + /* Check for invalid key/data flags. */ + if ((ret = __dbt_ferr(dbp, "key", key, 0)) != 0) + return (ret); + if ((ret = __dbt_ferr(dbp, "data", data, 0)) != 0) + return (ret); + + /* + * The cursor must be initialized for DB_CURRENT or DB_NEXT_DUP, + * return EINVAL for an invalid cursor, otherwise 0. + */ + if (isvalid || (flags != DB_CURRENT && flags != DB_NEXT_DUP)) + return (0); + + return (__db_curinval(dbp->dbenv)); +} + +/* + * __db_cputchk -- + * Common cursor put argument checking routine. + * + * PUBLIC: int __db_cputchk __P((const DB *, + * PUBLIC: const DBT *, DBT *, u_int32_t, int, int)); + */ +int +__db_cputchk(dbp, key, data, flags, isrdonly, isvalid) + const DB *dbp; + const DBT *key; + DBT *data; + u_int32_t flags; + int isrdonly, isvalid; +{ + int key_flags, ret; + + key_flags = 0; + + /* Check for changes to a read-only tree. */ + if (isrdonly) + return (__db_rdonly(dbp->dbenv, "c_put")); + + /* Check for invalid function flags. */ + switch (flags) { + case DB_AFTER: + case DB_BEFORE: + switch (dbp->type) { + case DB_BTREE: + case DB_HASH: /* Only with unsorted duplicates. */ + if (!F_ISSET(dbp, DB_AM_DUP)) + goto err; + if (dbp->dup_compare != NULL) + goto err; + break; + case DB_QUEUE: /* Not permitted. */ + goto err; + case DB_RECNO: /* Only with mutable record numbers. */ + if (!F_ISSET(dbp, DB_RE_RENUMBER)) + goto err; + key_flags = 1; + break; + default: + goto err; + } + break; + case DB_CURRENT: + /* + * If there is a comparison function, doing a DB_CURRENT + * must not change the part of the data item that is used + * for the comparison. + */ + break; + case DB_NODUPDATA: + if (!F_ISSET(dbp, DB_AM_DUPSORT)) + goto err; + /* FALLTHROUGH */ + case DB_KEYFIRST: + case DB_KEYLAST: + if (dbp->type == DB_QUEUE || dbp->type == DB_RECNO) + goto err; + key_flags = 1; + break; + default: +err: return (__db_ferr(dbp->dbenv, "DBcursor->c_put", 0)); + } + + /* Check for invalid key/data flags. */ + if (key_flags && (ret = __dbt_ferr(dbp, "key", key, 0)) != 0) + return (ret); + if ((ret = __dbt_ferr(dbp, "data", data, 0)) != 0) + return (ret); + + /* + * The cursor must be initialized for anything other than DB_KEYFIRST + * and DB_KEYLAST, return EINVAL for an invalid cursor, otherwise 0. + */ + if (isvalid || flags == DB_KEYFIRST || + flags == DB_KEYLAST || flags == DB_NODUPDATA) + return (0); + + return (__db_curinval(dbp->dbenv)); +} + +/* + * __db_closechk -- + * DB->close flag check. + * + * PUBLIC: int __db_closechk __P((const DB *, u_int32_t)); + */ +int +__db_closechk(dbp, flags) + const DB *dbp; + u_int32_t flags; +{ + /* Check for invalid function flags. */ + switch (flags) { + case 0: + case DB_NOSYNC: + break; + default: + return (__db_ferr(dbp->dbenv, "DB->close", 0)); + } + + return (0); +} + +/* + * __db_delchk -- + * Common delete argument checking routine. + * + * PUBLIC: int __db_delchk __P((const DB *, DBT *, u_int32_t, int)); + */ +int +__db_delchk(dbp, key, flags, isrdonly) + const DB *dbp; + DBT *key; + u_int32_t flags; + int isrdonly; +{ + COMPQUIET(key, NULL); + + /* Check for changes to a read-only tree. */ + if (isrdonly) + return (__db_rdonly(dbp->dbenv, "delete")); + + /* Check for invalid function flags. */ + switch (flags) { + case 0: + break; + default: + return (__db_ferr(dbp->dbenv, "DB->del", 0)); + } + + return (0); +} + +/* + * __db_getchk -- + * Common get argument checking routine. + * + * PUBLIC: int __db_getchk __P((const DB *, const DBT *, DBT *, u_int32_t)); + */ +int +__db_getchk(dbp, key, data, flags) + const DB *dbp; + const DBT *key; + DBT *data; + u_int32_t flags; +{ + int ret; + + /* + * Check for read-modify-write validity. DB_RMW doesn't make sense + * with CDB cursors since if you're going to write the cursor, you + * had to create it with DB_WRITECURSOR. Regardless, we check for + * LOCKING_ON and not STD_LOCKING, as we don't want to disallow it. + * If this changes, confirm that DB does not itself set the DB_RMW + * flag in a path where CDB may have been configured. + */ + if (LF_ISSET(DB_RMW)) { + if (!LOCKING_ON(dbp->dbenv)) { + __db_err(dbp->dbenv, + "the DB_RMW flag requires locking"); + return (EINVAL); + } + LF_CLR(DB_RMW); + } + + /* Check for invalid function flags. */ + switch (flags) { + case 0: + case DB_GET_BOTH: + break; + case DB_SET_RECNO: + if (!F_ISSET(dbp, DB_BT_RECNUM)) + goto err; + break; + case DB_CONSUME: + case DB_CONSUME_WAIT: + if (dbp->type == DB_QUEUE) + break; + /* Fall through */ + default: +err: return (__db_ferr(dbp->dbenv, "DB->get", 0)); + } + + /* Check for invalid key/data flags. */ + if ((ret = __dbt_ferr(dbp, "key", key, flags == DB_SET_RECNO)) != 0) + return (ret); + if ((ret = __dbt_ferr(dbp, "data", data, 1)) != 0) + return (ret); + + return (0); +} + +/* + * __db_joinchk -- + * Common join argument checking routine. + * + * PUBLIC: int __db_joinchk __P((const DB *, DBC * const *, u_int32_t)); + */ +int +__db_joinchk(dbp, curslist, flags) + const DB *dbp; + DBC * const *curslist; + u_int32_t flags; +{ + DB_TXN *txn; + int i; + + switch (flags) { + case 0: + case DB_JOIN_NOSORT: + break; + default: + return (__db_ferr(dbp->dbenv, "DB->join", 0)); + } + + if (curslist == NULL || curslist[0] == NULL) { + __db_err(dbp->dbenv, + "At least one secondary cursor must be specified to DB->join"); + return (EINVAL); + } + + txn = curslist[0]->txn; + for (i = 1; curslist[i] != NULL; i++) + if (curslist[i]->txn != txn) { + __db_err(dbp->dbenv, + "All secondary cursors must share the same transaction"); + return (EINVAL); + } + + return (0); +} + +/* + * __db_joingetchk -- + * Common join_get argument checking routine. + * + * PUBLIC: int __db_joingetchk __P((const DB *, DBT *, u_int32_t)); + */ +int +__db_joingetchk(dbp, key, flags) + const DB *dbp; + DBT *key; + u_int32_t flags; +{ + + if (LF_ISSET(DB_RMW)) { + if (!LOCKING_ON(dbp->dbenv)) { + __db_err(dbp->dbenv, + "the DB_RMW flag requires locking"); + return (EINVAL); + } + LF_CLR(DB_RMW); + } + + switch (flags) { + case 0: + case DB_JOIN_ITEM: + break; + default: + return (__db_ferr(dbp->dbenv, "DBcursor->c_get", 0)); + } + + /* + * A partial get of the key of a join cursor don't make much sense; + * the entire key is necessary to query the primary database + * and find the datum, and so regardless of the size of the key + * it would not be a performance improvement. Since it would require + * special handling, we simply disallow it. + * + * A partial get of the data, however, potentially makes sense (if + * all possible data are a predictable large structure, for instance) + * and causes us no headaches, so we permit it. + */ + if (F_ISSET(key, DB_DBT_PARTIAL)) { + __db_err(dbp->dbenv, + "DB_DBT_PARTIAL may not be set on key during join_get"); + return (EINVAL); + } + + return (0); +} + +/* + * __db_putchk -- + * Common put argument checking routine. + * + * PUBLIC: int __db_putchk + * PUBLIC: __P((const DB *, DBT *, const DBT *, u_int32_t, int, int)); + */ +int +__db_putchk(dbp, key, data, flags, isrdonly, isdup) + const DB *dbp; + DBT *key; + const DBT *data; + u_int32_t flags; + int isrdonly, isdup; +{ + int ret; + + /* Check for changes to a read-only tree. */ + if (isrdonly) + return (__db_rdonly(dbp->dbenv, "put")); + + /* Check for invalid function flags. */ + switch (flags) { + case 0: + case DB_NOOVERWRITE: + break; + case DB_APPEND: + if (dbp->type != DB_RECNO && dbp->type != DB_QUEUE) + goto err; + break; + case DB_NODUPDATA: + if (F_ISSET(dbp, DB_AM_DUPSORT)) + break; + /* FALLTHROUGH */ + default: +err: return (__db_ferr(dbp->dbenv, "DB->put", 0)); + } + + /* Check for invalid key/data flags. */ + if ((ret = __dbt_ferr(dbp, "key", key, 0)) != 0) + return (ret); + if ((ret = __dbt_ferr(dbp, "data", data, 0)) != 0) + return (ret); + + /* Check for partial puts in the presence of duplicates. */ + if (isdup && F_ISSET(data, DB_DBT_PARTIAL)) { + __db_err(dbp->dbenv, +"a partial put in the presence of duplicates requires a cursor operation"); + return (EINVAL); + } + + return (0); +} + +/* + * __db_removechk -- + * DB->remove flag check. + * + * PUBLIC: int __db_removechk __P((const DB *, u_int32_t)); + */ +int +__db_removechk(dbp, flags) + const DB *dbp; + u_int32_t flags; +{ + /* Check for invalid function flags. */ + switch (flags) { + case 0: + break; + default: + return (__db_ferr(dbp->dbenv, "DB->remove", 0)); + } + + return (0); +} + +/* + * __db_statchk -- + * Common stat argument checking routine. + * + * PUBLIC: int __db_statchk __P((const DB *, u_int32_t)); + */ +int +__db_statchk(dbp, flags) + const DB *dbp; + u_int32_t flags; +{ + /* Check for invalid function flags. */ + switch (flags) { + case 0: + case DB_CACHED_COUNTS: + break; + case DB_RECORDCOUNT: + if (dbp->type == DB_RECNO) + break; + if (dbp->type == DB_BTREE && F_ISSET(dbp, DB_BT_RECNUM)) + break; + goto err; + default: +err: return (__db_ferr(dbp->dbenv, "DB->stat", 0)); + } + + return (0); +} + +/* + * __db_syncchk -- + * Common sync argument checking routine. + * + * PUBLIC: int __db_syncchk __P((const DB *, u_int32_t)); + */ +int +__db_syncchk(dbp, flags) + const DB *dbp; + u_int32_t flags; +{ + /* Check for invalid function flags. */ + switch (flags) { + case 0: + break; + default: + return (__db_ferr(dbp->dbenv, "DB->sync", 0)); + } + + return (0); +} + +/* + * __dbt_ferr -- + * Check a DBT for flag errors. + */ +static int +__dbt_ferr(dbp, name, dbt, check_thread) + const DB *dbp; + const char *name; + const DBT *dbt; + int check_thread; +{ + DB_ENV *dbenv; + int ret; + + dbenv = dbp->dbenv; + + /* + * Check for invalid DBT flags. We allow any of the flags to be + * specified to any DB or DBcursor call so that applications can + * set DB_DBT_MALLOC when retrieving a data item from a secondary + * database and then specify that same DBT as a key to a primary + * database, without having to clear flags. + */ + if ((ret = __db_fchk(dbenv, name, dbt->flags, + DB_DBT_MALLOC | DB_DBT_DUPOK | + DB_DBT_REALLOC | DB_DBT_USERMEM | DB_DBT_PARTIAL)) != 0) + return (ret); + switch (F_ISSET(dbt, DB_DBT_MALLOC | DB_DBT_REALLOC | DB_DBT_USERMEM)) { + case 0: + case DB_DBT_MALLOC: + case DB_DBT_REALLOC: + case DB_DBT_USERMEM: + break; + default: + return (__db_ferr(dbenv, name, 1)); + } + + if (check_thread && DB_IS_THREADED(dbp) && + !F_ISSET(dbt, DB_DBT_MALLOC | DB_DBT_REALLOC | DB_DBT_USERMEM)) { + __db_err(dbenv, + "DB_THREAD mandates memory allocation flag on DBT %s", + name); + return (EINVAL); + } + return (0); +} + +/* + * __db_rdonly -- + * Common readonly message. + */ +static int +__db_rdonly(dbenv, name) + const DB_ENV *dbenv; + const char *name; +{ + __db_err(dbenv, "%s: attempt to modify a read-only tree", name); + return (EACCES); +} + +/* + * __db_curinval + * Report that a cursor is in an invalid state. + */ +static int +__db_curinval(dbenv) + const DB_ENV *dbenv; +{ + __db_err(dbenv, + "Cursor position must be set before performing this operation"); + return (EINVAL); +} diff --git a/bdb/db/db_join.c b/bdb/db/db_join.c new file mode 100644 index 00000000000..881dedde0fc --- /dev/null +++ b/bdb/db/db_join.c @@ -0,0 +1,730 @@ +/*- + * See the file LICENSE for redistribution information. + * + * Copyright (c) 1998, 1999, 2000 + * Sleepycat Software. All rights reserved. + */ + +#include "db_config.h" + +#ifndef lint +static const char revid[] = "$Id: db_join.c,v 11.31 2000/12/20 22:41:54 krinsky Exp $"; +#endif /* not lint */ + +#ifndef NO_SYSTEM_INCLUDES +#include <sys/types.h> + +#include <stdlib.h> +#include <string.h> +#endif + +#include "db_int.h" +#include "db_page.h" +#include "db_join.h" +#include "db_am.h" +#include "btree.h" + +static int __db_join_close __P((DBC *)); +static int __db_join_cmp __P((const void *, const void *)); +static int __db_join_del __P((DBC *, u_int32_t)); +static int __db_join_get __P((DBC *, DBT *, DBT *, u_int32_t)); +static int __db_join_getnext __P((DBC *, DBT *, DBT *, u_int32_t)); +static int __db_join_put __P((DBC *, DBT *, DBT *, u_int32_t)); + +/* + * Check to see if the Nth secondary cursor of join cursor jc is pointing + * to a sorted duplicate set. + */ +#define SORTED_SET(jc, n) ((jc)->j_curslist[(n)]->dbp->dup_compare != NULL) + +/* + * This is the duplicate-assisted join functionality. Right now we're + * going to write it such that we return one item at a time, although + * I think we may need to optimize it to return them all at once. + * It should be easier to get it working this way, and I believe that + * changing it should be fairly straightforward. + * + * We optimize the join by sorting cursors from smallest to largest + * cardinality. In most cases, this is indeed optimal. However, if + * a cursor with large cardinality has very few data in common with the + * first cursor, it is possible that the join will be made faster by + * putting it earlier in the cursor list. Since we have no way to detect + * cases like this, we simply provide a flag, DB_JOIN_NOSORT, which retains + * the sort order specified by the caller, who may know more about the + * structure of the data. + * + * The first cursor moves sequentially through the duplicate set while + * the others search explicitly for the duplicate in question. + * + */ + +/* + * __db_join -- + * This is the interface to the duplicate-assisted join functionality. + * In the same way that cursors mark a position in a database, a cursor + * can mark a position in a join. While most cursors are created by the + * cursor method of a DB, join cursors are created through an explicit + * call to DB->join. + * + * The curslist is an array of existing, intialized cursors and primary + * is the DB of the primary file. The data item that joins all the + * cursors in the curslist is used as the key into the primary and that + * key and data are returned. When no more items are left in the join + * set, the c_next operation off the join cursor will return DB_NOTFOUND. + * + * PUBLIC: int __db_join __P((DB *, DBC **, DBC **, u_int32_t)); + */ +int +__db_join(primary, curslist, dbcp, flags) + DB *primary; + DBC **curslist, **dbcp; + u_int32_t flags; +{ + DB_ENV *dbenv; + DBC *dbc; + JOIN_CURSOR *jc; + int ret; + u_int32_t i, ncurs, nslots; + + COMPQUIET(nslots, 0); + + PANIC_CHECK(primary->dbenv); + + if ((ret = __db_joinchk(primary, curslist, flags)) != 0) + return (ret); + + dbc = NULL; + jc = NULL; + dbenv = primary->dbenv; + + if ((ret = __os_calloc(dbenv, 1, sizeof(DBC), &dbc)) != 0) + goto err; + + if ((ret = __os_calloc(dbenv, + 1, sizeof(JOIN_CURSOR), &jc)) != 0) + goto err; + + if ((ret = __os_malloc(dbenv, 256, NULL, &jc->j_key.data)) != 0) + goto err; + jc->j_key.ulen = 256; + F_SET(&jc->j_key, DB_DBT_USERMEM); + + for (jc->j_curslist = curslist; + *jc->j_curslist != NULL; jc->j_curslist++) + ; + + /* + * The number of cursor slots we allocate is one greater than + * the number of cursors involved in the join, because the + * list is NULL-terminated. + */ + ncurs = jc->j_curslist - curslist; + nslots = ncurs + 1; + + /* + * !!! -- A note on the various lists hanging off jc. + * + * j_curslist is the initial NULL-terminated list of cursors passed + * into __db_join. The original cursors are not modified; pristine + * copies are required because, in databases with unsorted dups, we + * must reset all of the secondary cursors after the first each + * time the first one is incremented, or else we will lose data + * which happen to be sorted differently in two different cursors. + * + * j_workcurs is where we put those copies that we're planning to + * work with. They're lazily c_dup'ed from j_curslist as we need + * them, and closed when the join cursor is closed or when we need + * to reset them to their original values (in which case we just + * c_dup afresh). + * + * j_fdupcurs is an array of cursors which point to the first + * duplicate in the duplicate set that contains the data value + * we're currently interested in. We need this to make + * __db_join_get correctly return duplicate duplicates; i.e., if a + * given data value occurs twice in the set belonging to cursor #2, + * and thrice in the set belonging to cursor #3, and once in all + * the other cursors, successive calls to __db_join_get need to + * return that data item six times. To make this happen, each time + * cursor N is allowed to advance to a new datum, all cursors M + * such that M > N have to be reset to the first duplicate with + * that datum, so __db_join_get will return all the dup-dups again. + * We could just reset them to the original cursor from j_curslist, + * but that would be a bit slower in the unsorted case and a LOT + * slower in the sorted one. + * + * j_exhausted is a list of boolean values which represent + * whether or not their corresponding cursors are "exhausted", + * i.e. whether the datum under the corresponding cursor has + * been found not to exist in any unreturned combinations of + * later secondary cursors, in which case they are ready to be + * incremented. + */ + + /* We don't want to free regions whose callocs have failed. */ + jc->j_curslist = NULL; + jc->j_workcurs = NULL; + jc->j_fdupcurs = NULL; + jc->j_exhausted = NULL; + + if ((ret = __os_calloc(dbenv, nslots, sizeof(DBC *), + &jc->j_curslist)) != 0) + goto err; + if ((ret = __os_calloc(dbenv, nslots, sizeof(DBC *), + &jc->j_workcurs)) != 0) + goto err; + if ((ret = __os_calloc(dbenv, nslots, sizeof(DBC *), + &jc->j_fdupcurs)) != 0) + goto err; + if ((ret = __os_calloc(dbenv, nslots, sizeof(u_int8_t), + &jc->j_exhausted)) != 0) + goto err; + for (i = 0; curslist[i] != NULL; i++) { + jc->j_curslist[i] = curslist[i]; + jc->j_workcurs[i] = NULL; + jc->j_fdupcurs[i] = NULL; + jc->j_exhausted[i] = 0; + } + jc->j_ncurs = ncurs; + + /* + * If DB_JOIN_NOSORT is not set, optimize secondary cursors by + * sorting in order of increasing cardinality. + */ + if (!LF_ISSET(DB_JOIN_NOSORT)) + qsort(jc->j_curslist, ncurs, sizeof(DBC *), __db_join_cmp); + + /* + * We never need to reset the 0th cursor, so there's no + * solid reason to use workcurs[0] rather than curslist[0] in + * join_get. Nonetheless, it feels cleaner to do it for symmetry, + * and this is the most logical place to copy it. + * + * !!! + * There's no need to close the new cursor if we goto err only + * because this is the last thing that can fail. Modifier of this + * function beware! + */ + if ((ret = jc->j_curslist[0]->c_dup(jc->j_curslist[0], jc->j_workcurs, + DB_POSITIONI)) != 0) + goto err; + + dbc->c_close = __db_join_close; + dbc->c_del = __db_join_del; + dbc->c_get = __db_join_get; + dbc->c_put = __db_join_put; + dbc->internal = (DBC_INTERNAL *) jc; + dbc->dbp = primary; + jc->j_primary = primary; + + *dbcp = dbc; + + MUTEX_THREAD_LOCK(dbenv, primary->mutexp); + TAILQ_INSERT_TAIL(&primary->join_queue, dbc, links); + MUTEX_THREAD_UNLOCK(dbenv, primary->mutexp); + + return (0); + +err: if (jc != NULL) { + if (jc->j_curslist != NULL) + __os_free(jc->j_curslist, nslots * sizeof(DBC *)); + if (jc->j_workcurs != NULL) { + if (jc->j_workcurs[0] != NULL) + __os_free(jc->j_workcurs[0], sizeof(DBC)); + __os_free(jc->j_workcurs, nslots * sizeof(DBC *)); + } + if (jc->j_fdupcurs != NULL) + __os_free(jc->j_fdupcurs, nslots * sizeof(DBC *)); + if (jc->j_exhausted != NULL) + __os_free(jc->j_exhausted, nslots * sizeof(u_int8_t)); + __os_free(jc, sizeof(JOIN_CURSOR)); + } + if (dbc != NULL) + __os_free(dbc, sizeof(DBC)); + return (ret); +} + +static int +__db_join_put(dbc, key, data, flags) + DBC *dbc; + DBT *key; + DBT *data; + u_int32_t flags; +{ + PANIC_CHECK(dbc->dbp->dbenv); + + COMPQUIET(key, NULL); + COMPQUIET(data, NULL); + COMPQUIET(flags, 0); + return (EINVAL); +} + +static int +__db_join_del(dbc, flags) + DBC *dbc; + u_int32_t flags; +{ + PANIC_CHECK(dbc->dbp->dbenv); + + COMPQUIET(flags, 0); + return (EINVAL); +} + +static int +__db_join_get(dbc, key_arg, data_arg, flags) + DBC *dbc; + DBT *key_arg, *data_arg; + u_int32_t flags; +{ + DBT *key_n, key_n_mem; + DB *dbp; + DBC *cp; + JOIN_CURSOR *jc; + int ret; + u_int32_t i, j, operation; + + dbp = dbc->dbp; + jc = (JOIN_CURSOR *)dbc->internal; + + PANIC_CHECK(dbp->dbenv); + + operation = LF_ISSET(DB_OPFLAGS_MASK); + + if ((ret = __db_joingetchk(dbp, key_arg, flags)) != 0) + return (ret); + + /* + * Since we are fetching the key as a datum in the secondary indices, + * we must be careful of caller-specified DB_DBT_* memory + * management flags. If necessary, use a stack-allocated DBT; + * we'll appropriately copy and/or allocate the data later. + */ + if (F_ISSET(key_arg, DB_DBT_USERMEM) || + F_ISSET(key_arg, DB_DBT_MALLOC)) { + /* We just use the default buffer; no need to go malloc. */ + key_n = &key_n_mem; + memset(key_n, 0, sizeof(DBT)); + } else { + /* + * Either DB_DBT_REALLOC or the default buffer will work + * fine if we have to reuse it, as we do. + */ + key_n = key_arg; + } + + /* + * If our last attempt to do a get on the primary key failed, + * short-circuit the join and try again with the same key. + */ + if (F_ISSET(jc, JOIN_RETRY)) + goto samekey; + F_CLR(jc, JOIN_RETRY); + +retry: ret = jc->j_workcurs[0]->c_get(jc->j_workcurs[0], + &jc->j_key, key_n, jc->j_exhausted[0] ? DB_NEXT_DUP : DB_CURRENT); + + if (ret == ENOMEM) { + jc->j_key.ulen <<= 1; + if ((ret = __os_realloc(dbp->dbenv, + jc->j_key.ulen, NULL, &jc->j_key.data)) != 0) + goto mem_err; + goto retry; + } + + /* + * If ret == DB_NOTFOUND, we're out of elements of the first + * secondary cursor. This is how we finally finish the join + * if all goes well. + */ + if (ret != 0) + goto err; + + /* + * If jc->j_exhausted[0] == 1, we've just advanced the first cursor, + * and we're going to want to advance all the cursors that point to + * the first member of a duplicate duplicate set (j_fdupcurs[1..N]). + * Close all the cursors in j_fdupcurs; we'll reopen them the + * first time through the upcoming loop. + */ + for (i = 1; i < jc->j_ncurs; i++) { + if (jc->j_fdupcurs[i] != NULL && + (ret = jc->j_fdupcurs[i]->c_close(jc->j_fdupcurs[i])) != 0) + goto err; + jc->j_fdupcurs[i] = NULL; + } + + /* + * If jc->j_curslist[1] == NULL, we have only one cursor in the join. + * Thus, we can safely increment that one cursor on each call + * to __db_join_get, and we signal this by setting jc->j_exhausted[0] + * right away. + * + * Otherwise, reset jc->j_exhausted[0] to 0, so that we don't + * increment it until we know we're ready to. + */ + if (jc->j_curslist[1] == NULL) + jc->j_exhausted[0] = 1; + else + jc->j_exhausted[0] = 0; + + /* We have the first element; now look for it in the other cursors. */ + for (i = 1; i < jc->j_ncurs; i++) { + DB_ASSERT(jc->j_curslist[i] != NULL); + if (jc->j_workcurs[i] == NULL) + /* If this is NULL, we need to dup curslist into it. */ + if ((ret = jc->j_curslist[i]->c_dup( + jc->j_curslist[i], jc->j_workcurs + i, + DB_POSITIONI)) != 0) + goto err; + +retry2: cp = jc->j_workcurs[i]; + + if ((ret = __db_join_getnext(cp, &jc->j_key, key_n, + jc->j_exhausted[i])) == DB_NOTFOUND) { + /* + * jc->j_workcurs[i] has no more of the datum we're + * interested in. Go back one cursor and get + * a new dup. We can't just move to a new + * element of the outer relation, because that way + * we might miss duplicate duplicates in cursor i-1. + * + * If this takes us back to the first cursor, + * -then- we can move to a new element of the outer + * relation. + */ + --i; + jc->j_exhausted[i] = 1; + + if (i == 0) { + for (j = 1; jc->j_workcurs[j] != NULL; j++) { + /* + * We're moving to a new element of + * the first secondary cursor. If + * that cursor is sorted, then any + * other sorted cursors can be safely + * reset to the first duplicate + * duplicate in the current set if we + * have a pointer to it (we can't just + * leave them be, or we'll miss + * duplicate duplicates in the outer + * relation). + * + * If the first cursor is unsorted, or + * if cursor j is unsorted, we can + * make no assumptions about what + * we're looking for next or where it + * will be, so we reset to the very + * beginning (setting workcurs NULL + * will achieve this next go-round). + * + * XXX: This is likely to break + * horribly if any two cursors are + * both sorted, but have different + * specified sort functions. For, + * now, we dismiss this as pathology + * and let strange things happen--we + * can't make rope childproof. + */ + if ((ret = jc->j_workcurs[j]->c_close( + jc->j_workcurs[j])) != 0) + goto err; + if (!SORTED_SET(jc, 0) || + !SORTED_SET(jc, j) || + jc->j_fdupcurs[j] == NULL) + /* + * Unsafe conditions; + * reset fully. + */ + jc->j_workcurs[j] = NULL; + else + /* Partial reset suffices. */ + if ((jc->j_fdupcurs[j]->c_dup( + jc->j_fdupcurs[j], + &jc->j_workcurs[j], + DB_POSITIONI)) != 0) + goto err; + jc->j_exhausted[j] = 0; + } + goto retry; + /* NOTREACHED */ + } + + /* + * We're about to advance the cursor and need to + * reset all of the workcurs[j] where j>i, so that + * we don't miss any duplicate duplicates. + */ + for (j = i + 1; + jc->j_workcurs[j] != NULL; + j++) { + if ((ret = jc->j_workcurs[j]->c_close( + jc->j_workcurs[j])) != 0) + goto err; + jc->j_exhausted[j] = 0; + if (jc->j_fdupcurs[j] != NULL && + (ret = jc->j_fdupcurs[j]->c_dup( + jc->j_fdupcurs[j], &jc->j_workcurs[j], + DB_POSITIONI)) != 0) + goto err; + else + jc->j_workcurs[j] = NULL; + } + goto retry2; + /* NOTREACHED */ + } + + if (ret == ENOMEM) { + jc->j_key.ulen <<= 1; + if ((ret = __os_realloc(dbp->dbenv, jc->j_key.ulen, + NULL, &jc->j_key.data)) != 0) { +mem_err: __db_err(dbp->dbenv, + "Allocation failed for join key, len = %lu", + (u_long)jc->j_key.ulen); + goto err; + } + goto retry2; + } + + if (ret != 0) + goto err; + + /* + * If we made it this far, we've found a matching + * datum in cursor i. Mark the current cursor + * unexhausted, so we don't miss any duplicate + * duplicates the next go-round--unless this is the + * very last cursor, in which case there are none to + * miss, and we'll need that exhausted flag to finally + * get a DB_NOTFOUND and move on to the next datum in + * the outermost cursor. + */ + if (i + 1 != jc->j_ncurs) + jc->j_exhausted[i] = 0; + else + jc->j_exhausted[i] = 1; + + /* + * If jc->j_fdupcurs[i] is NULL and the ith cursor's dups are + * sorted, then we're here for the first time since advancing + * cursor 0, and we have a new datum of interest. + * jc->j_workcurs[i] points to the beginning of a set of + * duplicate duplicates; store this into jc->j_fdupcurs[i]. + */ + if (SORTED_SET(jc, i) && jc->j_fdupcurs[i] == NULL && (ret = + cp->c_dup(cp, &jc->j_fdupcurs[i], DB_POSITIONI)) != 0) + goto err; + + } + +err: if (ret != 0) + return (ret); + + if (0) { +samekey: /* + * Get the key we tried and failed to return last time; + * it should be the current datum of all the secondary cursors. + */ + if ((ret = jc->j_workcurs[0]->c_get(jc->j_workcurs[0], + &jc->j_key, key_n, DB_CURRENT)) != 0) + return (ret); + F_CLR(jc, JOIN_RETRY); + } + + /* + * ret == 0; we have a key to return. + * + * If DB_DBT_USERMEM or DB_DBT_MALLOC is set, we need to + * copy it back into the dbt we were given for the key; + * call __db_retcopy. + * + * Otherwise, assert that we do not in fact need to copy anything + * and simply proceed. + */ + if (F_ISSET(key_arg, DB_DBT_USERMEM) || + F_ISSET(key_arg, DB_DBT_MALLOC)) { + /* + * We need to copy the key back into our original + * datum. Do so. + */ + if ((ret = __db_retcopy(dbp, + key_arg, key_n->data, key_n->size, NULL, NULL)) != 0) { + /* + * The retcopy failed, most commonly because we + * have a user buffer for the key which is too small. + * Set things up to retry next time, and return. + */ + F_SET(jc, JOIN_RETRY); + return (ret); + } + } else + DB_ASSERT(key_n == key_arg); + + /* + * If DB_JOIN_ITEM is + * set, we return it; otherwise we do the lookup in the + * primary and then return. + * + * Note that we use key_arg here; it is safe (and appropriate) + * to do so. + */ + if (operation == DB_JOIN_ITEM) + return (0); + + if ((ret = jc->j_primary->get(jc->j_primary, + jc->j_curslist[0]->txn, key_arg, data_arg, 0)) != 0) + /* + * The get on the primary failed, most commonly because we're + * using a user buffer that's not big enough. Flag our + * failure so we can return the same key next time. + */ + F_SET(jc, JOIN_RETRY); + + return (ret); +} + +static int +__db_join_close(dbc) + DBC *dbc; +{ + DB *dbp; + JOIN_CURSOR *jc; + int ret, t_ret; + u_int32_t i; + + jc = (JOIN_CURSOR *)dbc->internal; + dbp = dbc->dbp; + ret = t_ret = 0; + + /* + * Remove from active list of join cursors. Note that this + * must happen before any action that can fail and return, or else + * __db_close may loop indefinitely. + */ + MUTEX_THREAD_LOCK(dbp->dbenv, dbp->mutexp); + TAILQ_REMOVE(&dbp->join_queue, dbc, links); + MUTEX_THREAD_UNLOCK(dbp->dbenv, dbp->mutexp); + + PANIC_CHECK(dbc->dbp->dbenv); + + /* + * Close any open scratch cursors. In each case, there may + * not be as many outstanding as there are cursors in + * curslist, but we want to close whatever's there. + * + * If any close fails, there's no reason not to close everything else; + * we'll just return the error code of the last one to fail. There's + * not much the caller can do anyway, since these cursors only exist + * hanging off a db-internal data structure that they shouldn't be + * mucking with. + */ + for (i = 0; i < jc->j_ncurs; i++) { + if (jc->j_workcurs[i] != NULL && (t_ret = + jc->j_workcurs[i]->c_close(jc->j_workcurs[i])) != 0) + ret = t_ret; + if (jc->j_fdupcurs[i] != NULL && (t_ret = + jc->j_fdupcurs[i]->c_close(jc->j_fdupcurs[i])) != 0) + ret = t_ret; + } + + __os_free(jc->j_exhausted, 0); + __os_free(jc->j_curslist, 0); + __os_free(jc->j_workcurs, 0); + __os_free(jc->j_fdupcurs, 0); + __os_free(jc->j_key.data, jc->j_key.ulen); + __os_free(jc, sizeof(JOIN_CURSOR)); + __os_free(dbc, sizeof(DBC)); + + return (ret); +} + +/* + * __db_join_getnext -- + * This function replaces the DBC_CONTINUE and DBC_KEYSET + * functionality inside the various cursor get routines. + * + * If exhausted == 0, we're not done with the current datum; + * return it if it matches "matching", otherwise search + * using DB_GET_BOTHC (which is faster than iteratively doing + * DB_NEXT_DUP) forward until we find one that does. + * + * If exhausted == 1, we are done with the current datum, so just + * leap forward to searching NEXT_DUPs. + * + * If no matching datum exists, returns DB_NOTFOUND, else 0. + */ +static int +__db_join_getnext(dbc, key, data, exhausted) + DBC *dbc; + DBT *key, *data; + u_int32_t exhausted; +{ + int ret, cmp; + DB *dbp; + DBT ldata; + int (*func) __P((DB *, const DBT *, const DBT *)); + + dbp = dbc->dbp; + func = (dbp->dup_compare == NULL) ? __bam_defcmp : dbp->dup_compare; + + switch (exhausted) { + case 0: + memset(&ldata, 0, sizeof(DBT)); + /* We don't want to step on data->data; malloc. */ + F_SET(&ldata, DB_DBT_MALLOC); + if ((ret = dbc->c_get(dbc, key, &ldata, DB_CURRENT)) != 0) + break; + cmp = func(dbp, data, &ldata); + if (cmp == 0) { + /* + * We have to return the real data value. Copy + * it into data, then free the buffer we malloc'ed + * above. + */ + if ((ret = __db_retcopy(dbp, data, ldata.data, + ldata.size, &data->data, &data->size)) != 0) + return (ret); + __os_free(ldata.data, 0); + return (0); + } + + /* + * Didn't match--we want to fall through and search future + * dups. We just forget about ldata and free + * its buffer--data contains the value we're searching for. + */ + __os_free(ldata.data, 0); + /* FALLTHROUGH */ + case 1: + ret = dbc->c_get(dbc, key, data, DB_GET_BOTHC); + break; + default: + ret = EINVAL; + break; + } + + return (ret); +} + +/* + * __db_join_cmp -- + * Comparison function for sorting DBCs in cardinality order. + */ + +static int +__db_join_cmp(a, b) + const void *a, *b; +{ + DBC *dbca, *dbcb; + db_recno_t counta, countb; + + /* In case c_count fails, pretend cursors are equal. */ + counta = countb = 0; + + dbca = *((DBC * const *)a); + dbcb = *((DBC * const *)b); + + if (dbca->c_count(dbca, &counta, 0) != 0 || + dbcb->c_count(dbcb, &countb, 0) != 0) + return (0); + + return (counta - countb); +} diff --git a/bdb/db/db_meta.c b/bdb/db/db_meta.c new file mode 100644 index 00000000000..5b57c369454 --- /dev/null +++ b/bdb/db/db_meta.c @@ -0,0 +1,309 @@ +/*- + * See the file LICENSE for redistribution information. + * + * Copyright (c) 1996, 1997, 1998, 1999, 2000 + * Sleepycat Software. All rights reserved. + */ +/* + * Copyright (c) 1990, 1993, 1994, 1995, 1996 + * Keith Bostic. All rights reserved. + */ +/* + * Copyright (c) 1990, 1993, 1994, 1995 + * The Regents of the University of California. All rights reserved. + * + * This code is derived from software contributed to Berkeley by + * Mike Olson. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * 3. Neither the name of the University nor the names of its contributors + * may be used to endorse or promote products derived from this software + * without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +#include "db_config.h" + +#ifndef lint +static const char revid[] = "$Id: db_meta.c,v 11.26 2001/01/16 21:57:19 ubell Exp $"; +#endif /* not lint */ + +#ifndef NO_SYSTEM_INCLUDES +#include <sys/types.h> + +#include <string.h> +#endif + +#include "db_int.h" +#include "db_page.h" +#include "db_shash.h" +#include "lock.h" +#include "txn.h" +#include "db_am.h" +#include "btree.h" + +/* + * __db_new -- + * Get a new page, preferably from the freelist. + * + * PUBLIC: int __db_new __P((DBC *, u_int32_t, PAGE **)); + */ +int +__db_new(dbc, type, pagepp) + DBC *dbc; + u_int32_t type; + PAGE **pagepp; +{ + DBMETA *meta; + DB *dbp; + DB_LOCK metalock; + PAGE *h; + db_pgno_t pgno; + int ret; + + dbp = dbc->dbp; + meta = NULL; + h = NULL; + + pgno = PGNO_BASE_MD; + if ((ret = __db_lget(dbc, + LCK_ALWAYS, pgno, DB_LOCK_WRITE, 0, &metalock)) != 0) + goto err; + if ((ret = memp_fget(dbp->mpf, &pgno, 0, (PAGE **)&meta)) != 0) + goto err; + + if (meta->free == PGNO_INVALID) { + if ((ret = memp_fget(dbp->mpf, &pgno, DB_MPOOL_NEW, &h)) != 0) + goto err; + ZERO_LSN(h->lsn); + h->pgno = pgno; + } else { + pgno = meta->free; + if ((ret = memp_fget(dbp->mpf, &pgno, 0, &h)) != 0) + goto err; + meta->free = h->next_pgno; + (void)memp_fset(dbp->mpf, (PAGE *)meta, DB_MPOOL_DIRTY); + } + + DB_ASSERT(TYPE(h) == P_INVALID); + + if (TYPE(h) != P_INVALID) + return (__db_panic(dbp->dbenv, EINVAL)); + + /* Log the change. */ + if (DB_LOGGING(dbc)) { + if ((ret = __db_pg_alloc_log(dbp->dbenv, + dbc->txn, &LSN(meta), 0, dbp->log_fileid, + &LSN(meta), &h->lsn, h->pgno, + (u_int32_t)type, meta->free)) != 0) + goto err; + LSN(h) = LSN(meta); + } + + (void)memp_fput(dbp->mpf, (PAGE *)meta, DB_MPOOL_DIRTY); + (void)__TLPUT(dbc, metalock); + + P_INIT(h, dbp->pgsize, h->pgno, PGNO_INVALID, PGNO_INVALID, 0, type); + *pagepp = h; + return (0); + +err: if (h != NULL) + (void)memp_fput(dbp->mpf, h, 0); + if (meta != NULL) + (void)memp_fput(dbp->mpf, meta, 0); + (void)__TLPUT(dbc, metalock); + return (ret); +} + +/* + * __db_free -- + * Add a page to the head of the freelist. + * + * PUBLIC: int __db_free __P((DBC *, PAGE *)); + */ +int +__db_free(dbc, h) + DBC *dbc; + PAGE *h; +{ + DBMETA *meta; + DB *dbp; + DBT ldbt; + DB_LOCK metalock; + db_pgno_t pgno; + u_int32_t dirty_flag; + int ret, t_ret; + + dbp = dbc->dbp; + + /* + * Retrieve the metadata page and insert the page at the head of + * the free list. If either the lock get or page get routines + * fail, then we need to put the page with which we were called + * back because our caller assumes we take care of it. + */ + dirty_flag = 0; + pgno = PGNO_BASE_MD; + if ((ret = __db_lget(dbc, + LCK_ALWAYS, pgno, DB_LOCK_WRITE, 0, &metalock)) != 0) + goto err; + if ((ret = memp_fget(dbp->mpf, &pgno, 0, (PAGE **)&meta)) != 0) { + (void)__TLPUT(dbc, metalock); + goto err; + } + + DB_ASSERT(h->pgno != meta->free); + /* Log the change. */ + if (DB_LOGGING(dbc)) { + memset(&ldbt, 0, sizeof(ldbt)); + ldbt.data = h; + ldbt.size = P_OVERHEAD; + if ((ret = __db_pg_free_log(dbp->dbenv, + dbc->txn, &LSN(meta), 0, dbp->log_fileid, h->pgno, + &LSN(meta), &ldbt, meta->free)) != 0) { + (void)memp_fput(dbp->mpf, (PAGE *)meta, 0); + (void)__TLPUT(dbc, metalock); + return (ret); + } + LSN(h) = LSN(meta); + } + + P_INIT(h, dbp->pgsize, h->pgno, PGNO_INVALID, meta->free, 0, P_INVALID); + + meta->free = h->pgno; + + /* Discard the metadata page. */ + if ((t_ret = memp_fput(dbp->mpf, + (PAGE *)meta, DB_MPOOL_DIRTY)) != 0 && ret == 0) + ret = t_ret; + if ((t_ret = __TLPUT(dbc, metalock)) != 0 && ret == 0) + ret = t_ret; + + /* Discard the caller's page reference. */ + dirty_flag = DB_MPOOL_DIRTY; +err: if ((t_ret = memp_fput(dbp->mpf, h, dirty_flag)) != 0 && ret == 0) + ret = t_ret; + + /* + * XXX + * We have to unlock the caller's page in the caller! + */ + return (ret); +} + +#ifdef DEBUG +/* + * __db_lprint -- + * Print out the list of locks currently held by a cursor. + * + * PUBLIC: int __db_lprint __P((DBC *)); + */ +int +__db_lprint(dbc) + DBC *dbc; +{ + DB *dbp; + DB_LOCKREQ req; + + dbp = dbc->dbp; + + if (LOCKING_ON(dbp->dbenv)) { + req.op = DB_LOCK_DUMP; + lock_vec(dbp->dbenv, dbc->locker, 0, &req, 1, NULL); + } + return (0); +} +#endif + +/* + * __db_lget -- + * The standard lock get call. + * + * PUBLIC: int __db_lget __P((DBC *, + * PUBLIC: int, db_pgno_t, db_lockmode_t, int, DB_LOCK *)); + */ +int +__db_lget(dbc, flags, pgno, mode, lkflags, lockp) + DBC *dbc; + int flags, lkflags; + db_pgno_t pgno; + db_lockmode_t mode; + DB_LOCK *lockp; +{ + DB *dbp; + DB_ENV *dbenv; + DB_LOCKREQ couple[2], *reqp; + int ret; + + dbp = dbc->dbp; + dbenv = dbp->dbenv; + + /* + * We do not always check if we're configured for locking before + * calling __db_lget to acquire the lock. + */ + if (CDB_LOCKING(dbenv) + || !LOCKING_ON(dbenv) || F_ISSET(dbc, DBC_COMPENSATE) + || (!LF_ISSET(LCK_ROLLBACK) && F_ISSET(dbc, DBC_RECOVER)) + || (!LF_ISSET(LCK_ALWAYS) && F_ISSET(dbc, DBC_OPD))) { + lockp->off = LOCK_INVALID; + return (0); + } + + dbc->lock.pgno = pgno; + if (lkflags & DB_LOCK_RECORD) + dbc->lock.type = DB_RECORD_LOCK; + else + dbc->lock.type = DB_PAGE_LOCK; + lkflags &= ~DB_LOCK_RECORD; + + /* + * If the transaction enclosing this cursor has DB_LOCK_NOWAIT set, + * pass that along to the lock call. + */ + if (DB_NONBLOCK(dbc)) + lkflags |= DB_LOCK_NOWAIT; + + /* + * If the object not currently locked, acquire the lock and return, + * otherwise, lock couple. + */ + if (LF_ISSET(LCK_COUPLE)) { + couple[0].op = DB_LOCK_GET; + couple[0].obj = &dbc->lock_dbt; + couple[0].mode = mode; + couple[1].op = DB_LOCK_PUT; + couple[1].lock = *lockp; + + ret = lock_vec(dbenv, + dbc->locker, lkflags, couple, 2, &reqp); + if (ret == 0 || reqp == &couple[1]) + *lockp = couple[0].lock; + } else { + ret = lock_get(dbenv, + dbc->locker, lkflags, &dbc->lock_dbt, mode, lockp); + + if (ret != 0) + lockp->off = LOCK_INVALID; + } + + return (ret); +} diff --git a/bdb/db/db_method.c b/bdb/db/db_method.c new file mode 100644 index 00000000000..01568a6e144 --- /dev/null +++ b/bdb/db/db_method.c @@ -0,0 +1,629 @@ +/*- + * See the file LICENSE for redistribution information. + * + * Copyright (c) 1999, 2000 + * Sleepycat Software. All rights reserved. + */ + +#include "db_config.h" + +#ifndef lint +static const char revid[] = "$Id: db_method.c,v 11.36 2000/12/21 09:17:04 krinsky Exp $"; +#endif /* not lint */ + +#ifndef NO_SYSTEM_INCLUDES +#include <sys/types.h> + +#ifdef HAVE_RPC +#include <rpc/rpc.h> +#endif + +#include <string.h> +#endif + +#ifdef HAVE_RPC +#include "db_server.h" +#endif + +#include "db_int.h" +#include "db_page.h" +#include "db_am.h" +#include "btree.h" +#include "hash.h" +#include "qam.h" +#include "xa.h" +#include "xa_ext.h" + +#ifdef HAVE_RPC +#include "gen_client_ext.h" +#include "rpc_client_ext.h" +#endif + +static int __db_get_byteswapped __P((DB *)); +static DBTYPE + __db_get_type __P((DB *)); +static int __db_init __P((DB *, u_int32_t)); +static int __db_key_range + __P((DB *, DB_TXN *, DBT *, DB_KEY_RANGE *, u_int32_t)); +static int __db_set_append_recno __P((DB *, int (*)(DB *, DBT *, db_recno_t))); +static int __db_set_cachesize __P((DB *, u_int32_t, u_int32_t, int)); +static int __db_set_dup_compare + __P((DB *, int (*)(DB *, const DBT *, const DBT *))); +static void __db_set_errcall __P((DB *, void (*)(const char *, char *))); +static void __db_set_errfile __P((DB *, FILE *)); +static int __db_set_feedback __P((DB *, void (*)(DB *, int, int))); +static int __db_set_flags __P((DB *, u_int32_t)); +static int __db_set_lorder __P((DB *, int)); +static int __db_set_malloc __P((DB *, void *(*)(size_t))); +static int __db_set_pagesize __P((DB *, u_int32_t)); +static int __db_set_realloc __P((DB *, void *(*)(void *, size_t))); +static void __db_set_errpfx __P((DB *, const char *)); +static int __db_set_paniccall __P((DB *, void (*)(DB_ENV *, int))); +static void __dbh_err __P((DB *, int, const char *, ...)); +static void __dbh_errx __P((DB *, const char *, ...)); + +/* + * db_create -- + * DB constructor. + */ +int +db_create(dbpp, dbenv, flags) + DB **dbpp; + DB_ENV *dbenv; + u_int32_t flags; +{ + DB *dbp; + int ret; + + /* Check for invalid function flags. */ + switch (flags) { + case 0: + break; + case DB_XA_CREATE: + if (dbenv != NULL) { + __db_err(dbenv, + "XA applications may not specify an environment to db_create"); + return (EINVAL); + } + + /* + * If it's an XA database, open it within the XA environment, + * taken from the global list of environments. (When the XA + * transaction manager called our xa_start() routine the + * "current" environment was moved to the start of the list. + */ + dbenv = TAILQ_FIRST(&DB_GLOBAL(db_envq)); + break; + default: + return (__db_ferr(dbenv, "db_create", 0)); + } + + /* Allocate the DB. */ + if ((ret = __os_calloc(dbenv, 1, sizeof(*dbp), &dbp)) != 0) + return (ret); +#ifdef HAVE_RPC + if (dbenv != NULL && dbenv->cl_handle != NULL) + ret = __dbcl_init(dbp, dbenv, flags); + else +#endif + ret = __db_init(dbp, flags); + if (ret != 0) { + __os_free(dbp, sizeof(*dbp)); + return (ret); + } + + /* If we don't have an environment yet, allocate a local one. */ + if (dbenv == NULL) { + if ((ret = db_env_create(&dbenv, 0)) != 0) { + __os_free(dbp, sizeof(*dbp)); + return (ret); + } + dbenv->dblocal_ref = 0; + F_SET(dbenv, DB_ENV_DBLOCAL); + } + if (F_ISSET(dbenv, DB_ENV_DBLOCAL)) + ++dbenv->dblocal_ref; + + dbp->dbenv = dbenv; + + *dbpp = dbp; + return (0); +} + +/* + * __db_init -- + * Initialize a DB structure. + */ +static int +__db_init(dbp, flags) + DB *dbp; + u_int32_t flags; +{ + int ret; + + dbp->log_fileid = DB_LOGFILEID_INVALID; + + TAILQ_INIT(&dbp->free_queue); + TAILQ_INIT(&dbp->active_queue); + TAILQ_INIT(&dbp->join_queue); + + FLD_SET(dbp->am_ok, + DB_OK_BTREE | DB_OK_HASH | DB_OK_QUEUE | DB_OK_RECNO); + + dbp->close = __db_close; + dbp->cursor = __db_cursor; + dbp->del = NULL; /* !!! Must be set by access method. */ + dbp->err = __dbh_err; + dbp->errx = __dbh_errx; + dbp->fd = __db_fd; + dbp->get = __db_get; + dbp->get_byteswapped = __db_get_byteswapped; + dbp->get_type = __db_get_type; + dbp->join = __db_join; + dbp->key_range = __db_key_range; + dbp->open = __db_open; + dbp->put = __db_put; + dbp->remove = __db_remove; + dbp->rename = __db_rename; + dbp->set_append_recno = __db_set_append_recno; + dbp->set_cachesize = __db_set_cachesize; + dbp->set_dup_compare = __db_set_dup_compare; + dbp->set_errcall = __db_set_errcall; + dbp->set_errfile = __db_set_errfile; + dbp->set_errpfx = __db_set_errpfx; + dbp->set_feedback = __db_set_feedback; + dbp->set_flags = __db_set_flags; + dbp->set_lorder = __db_set_lorder; + dbp->set_malloc = __db_set_malloc; + dbp->set_pagesize = __db_set_pagesize; + dbp->set_paniccall = __db_set_paniccall; + dbp->set_realloc = __db_set_realloc; + dbp->stat = NULL; /* !!! Must be set by access method. */ + dbp->sync = __db_sync; + dbp->upgrade = __db_upgrade; + dbp->verify = __db_verify; + /* Access method specific. */ + if ((ret = __bam_db_create(dbp)) != 0) + return (ret); + if ((ret = __ham_db_create(dbp)) != 0) + return (ret); + if ((ret = __qam_db_create(dbp)) != 0) + return (ret); + + /* + * XA specific: must be last, as we replace methods set by the + * access methods. + */ + if (LF_ISSET(DB_XA_CREATE) && (ret = __db_xa_create(dbp)) != 0) + return (ret); + + return (0); +} + +/* + * __dbh_am_chk -- + * Error if an unreasonable method is called. + * + * PUBLIC: int __dbh_am_chk __P((DB *, u_int32_t)); + */ +int +__dbh_am_chk(dbp, flags) + DB *dbp; + u_int32_t flags; +{ + /* + * We start out allowing any access methods to be called, and as the + * application calls the methods the options become restricted. The + * idea is to quit as soon as an illegal method combination is called. + */ + if ((LF_ISSET(DB_OK_BTREE) && FLD_ISSET(dbp->am_ok, DB_OK_BTREE)) || + (LF_ISSET(DB_OK_HASH) && FLD_ISSET(dbp->am_ok, DB_OK_HASH)) || + (LF_ISSET(DB_OK_QUEUE) && FLD_ISSET(dbp->am_ok, DB_OK_QUEUE)) || + (LF_ISSET(DB_OK_RECNO) && FLD_ISSET(dbp->am_ok, DB_OK_RECNO))) { + FLD_CLR(dbp->am_ok, ~flags); + return (0); + } + + __db_err(dbp->dbenv, + "call implies an access method which is inconsistent with previous calls"); + return (EINVAL); +} + +/* + * __dbh_err -- + * Error message, including the standard error string. + */ +static void +#ifdef __STDC__ +__dbh_err(DB *dbp, int error, const char *fmt, ...) +#else +__dbh_err(dbp, error, fmt, va_alist) + DB *dbp; + int error; + const char *fmt; + va_dcl +#endif +{ + va_list ap; + +#ifdef __STDC__ + va_start(ap, fmt); +#else + va_start(ap); +#endif + __db_real_err(dbp->dbenv, error, 1, 1, fmt, ap); + + va_end(ap); +} + +/* + * __dbh_errx -- + * Error message. + */ +static void +#ifdef __STDC__ +__dbh_errx(DB *dbp, const char *fmt, ...) +#else +__dbh_errx(dbp, fmt, va_alist) + DB *dbp; + const char *fmt; + va_dcl +#endif +{ + va_list ap; + +#ifdef __STDC__ + va_start(ap, fmt); +#else + va_start(ap); +#endif + __db_real_err(dbp->dbenv, 0, 0, 1, fmt, ap); + + va_end(ap); +} + +/* + * __db_get_byteswapped -- + * Return if database requires byte swapping. + */ +static int +__db_get_byteswapped(dbp) + DB *dbp; +{ + DB_ILLEGAL_BEFORE_OPEN(dbp, "get_byteswapped"); + + return (F_ISSET(dbp, DB_AM_SWAP) ? 1 : 0); +} + +/* + * __db_get_type -- + * Return type of underlying database. + */ +static DBTYPE +__db_get_type(dbp) + DB *dbp; +{ + DB_ILLEGAL_BEFORE_OPEN(dbp, "get_type"); + + return (dbp->type); +} + +/* + * __db_key_range -- + * Return proportion of keys above and below given key. + */ +static int +__db_key_range(dbp, txn, key, kr, flags) + DB *dbp; + DB_TXN *txn; + DBT *key; + DB_KEY_RANGE *kr; + u_int32_t flags; +{ + COMPQUIET(txn, NULL); + COMPQUIET(key, NULL); + COMPQUIET(kr, NULL); + COMPQUIET(flags, 0); + + DB_ILLEGAL_BEFORE_OPEN(dbp, "key_range"); + DB_ILLEGAL_METHOD(dbp, DB_OK_BTREE); + + return (EINVAL); +} + +/* + * __db_set_append_recno -- + * Set record number append routine. + */ +static int +__db_set_append_recno(dbp, func) + DB *dbp; + int (*func) __P((DB *, DBT *, db_recno_t)); +{ + DB_ILLEGAL_AFTER_OPEN(dbp, "set_append_recno"); + DB_ILLEGAL_METHOD(dbp, DB_OK_QUEUE | DB_OK_RECNO); + + dbp->db_append_recno = func; + + return (0); +} + +/* + * __db_set_cachesize -- + * Set underlying cache size. + */ +static int +__db_set_cachesize(dbp, cache_gbytes, cache_bytes, ncache) + DB *dbp; + u_int32_t cache_gbytes, cache_bytes; + int ncache; +{ + DB_ILLEGAL_IN_ENV(dbp, "set_cachesize"); + DB_ILLEGAL_AFTER_OPEN(dbp, "set_cachesize"); + + return (dbp->dbenv->set_cachesize( + dbp->dbenv, cache_gbytes, cache_bytes, ncache)); +} + +/* + * __db_set_dup_compare -- + * Set duplicate comparison routine. + */ +static int +__db_set_dup_compare(dbp, func) + DB *dbp; + int (*func) __P((DB *, const DBT *, const DBT *)); +{ + DB_ILLEGAL_AFTER_OPEN(dbp, "dup_compare"); + DB_ILLEGAL_METHOD(dbp, DB_OK_BTREE | DB_OK_HASH); + + dbp->dup_compare = func; + + return (0); +} + +static void +__db_set_errcall(dbp, errcall) + DB *dbp; + void (*errcall) __P((const char *, char *)); +{ + dbp->dbenv->set_errcall(dbp->dbenv, errcall); +} + +static void +__db_set_errfile(dbp, errfile) + DB *dbp; + FILE *errfile; +{ + dbp->dbenv->set_errfile(dbp->dbenv, errfile); +} + +static void +__db_set_errpfx(dbp, errpfx) + DB *dbp; + const char *errpfx; +{ + dbp->dbenv->set_errpfx(dbp->dbenv, errpfx); +} + +static int +__db_set_feedback(dbp, feedback) + DB *dbp; + void (*feedback) __P((DB *, int, int)); +{ + dbp->db_feedback = feedback; + return (0); +} + +static int +__db_set_flags(dbp, flags) + DB *dbp; + u_int32_t flags; +{ + int ret; + + /* + * !!! + * The hash access method only takes two flags: DB_DUP and DB_DUPSORT. + * The Btree access method uses them for the same purposes, and so we + * resolve them there. + * + * The queue access method takes no flags. + */ + if ((ret = __bam_set_flags(dbp, &flags)) != 0) + return (ret); + if ((ret = __ram_set_flags(dbp, &flags)) != 0) + return (ret); + + return (flags == 0 ? 0 : __db_ferr(dbp->dbenv, "DB->set_flags", 0)); +} + +static int +__db_set_lorder(dbp, db_lorder) + DB *dbp; + int db_lorder; +{ + int ret; + + DB_ILLEGAL_AFTER_OPEN(dbp, "set_lorder"); + + /* Flag if the specified byte order requires swapping. */ + switch (ret = __db_byteorder(dbp->dbenv, db_lorder)) { + case 0: + F_CLR(dbp, DB_AM_SWAP); + break; + case DB_SWAPBYTES: + F_SET(dbp, DB_AM_SWAP); + break; + default: + return (ret); + /* NOTREACHED */ + } + return (0); +} + +static int +__db_set_malloc(dbp, func) + DB *dbp; + void *(*func) __P((size_t)); +{ + DB_ILLEGAL_AFTER_OPEN(dbp, "set_malloc"); + + dbp->db_malloc = func; + return (0); +} + +static int +__db_set_pagesize(dbp, db_pagesize) + DB *dbp; + u_int32_t db_pagesize; +{ + DB_ILLEGAL_AFTER_OPEN(dbp, "set_pagesize"); + + if (db_pagesize < DB_MIN_PGSIZE) { + __db_err(dbp->dbenv, "page sizes may not be smaller than %lu", + (u_long)DB_MIN_PGSIZE); + return (EINVAL); + } + if (db_pagesize > DB_MAX_PGSIZE) { + __db_err(dbp->dbenv, "page sizes may not be larger than %lu", + (u_long)DB_MAX_PGSIZE); + return (EINVAL); + } + + /* + * We don't want anything that's not a power-of-2, as we rely on that + * for alignment of various types on the pages. + */ + if ((u_int32_t)1 << __db_log2(db_pagesize) != db_pagesize) { + __db_err(dbp->dbenv, "page sizes must be a power-of-2"); + return (EINVAL); + } + + /* + * XXX + * Should we be checking for a page size that's not a multiple of 512, + * so that we never try and write less than a disk sector? + */ + dbp->pgsize = db_pagesize; + + return (0); +} + +static int +__db_set_realloc(dbp, func) + DB *dbp; + void *(*func) __P((void *, size_t)); +{ + DB_ILLEGAL_AFTER_OPEN(dbp, "set_realloc"); + + dbp->db_realloc = func; + return (0); +} + +static int +__db_set_paniccall(dbp, paniccall) + DB *dbp; + void (*paniccall) __P((DB_ENV *, int)); +{ + return (dbp->dbenv->set_paniccall(dbp->dbenv, paniccall)); +} + +#ifdef HAVE_RPC +/* + * __dbcl_init -- + * Initialize a DB structure on the server. + * + * PUBLIC: #ifdef HAVE_RPC + * PUBLIC: int __dbcl_init __P((DB *, DB_ENV *, u_int32_t)); + * PUBLIC: #endif + */ +int +__dbcl_init(dbp, dbenv, flags) + DB *dbp; + DB_ENV *dbenv; + u_int32_t flags; +{ + CLIENT *cl; + __db_create_reply *replyp; + __db_create_msg req; + int ret; + + TAILQ_INIT(&dbp->free_queue); + TAILQ_INIT(&dbp->active_queue); + /* !!! + * Note that we don't need to initialize the join_queue; it's + * not used in RPC clients. See the comment in __dbcl_db_join_ret(). + */ + + dbp->close = __dbcl_db_close; + dbp->cursor = __dbcl_db_cursor; + dbp->del = __dbcl_db_del; + dbp->err = __dbh_err; + dbp->errx = __dbh_errx; + dbp->fd = __dbcl_db_fd; + dbp->get = __dbcl_db_get; + dbp->get_byteswapped = __dbcl_db_swapped; + dbp->get_type = __db_get_type; + dbp->join = __dbcl_db_join; + dbp->key_range = __dbcl_db_key_range; + dbp->open = __dbcl_db_open; + dbp->put = __dbcl_db_put; + dbp->remove = __dbcl_db_remove; + dbp->rename = __dbcl_db_rename; + dbp->set_append_recno = __dbcl_db_set_append_recno; + dbp->set_cachesize = __dbcl_db_cachesize; + dbp->set_dup_compare = NULL; + dbp->set_errcall = __db_set_errcall; + dbp->set_errfile = __db_set_errfile; + dbp->set_errpfx = __db_set_errpfx; + dbp->set_feedback = __dbcl_db_feedback; + dbp->set_flags = __dbcl_db_flags; + dbp->set_lorder = __dbcl_db_lorder; + dbp->set_malloc = __dbcl_db_malloc; + dbp->set_pagesize = __dbcl_db_pagesize; + dbp->set_paniccall = __dbcl_db_panic; + dbp->set_q_extentsize = __dbcl_db_extentsize; + dbp->set_realloc = __dbcl_db_realloc; + dbp->stat = __dbcl_db_stat; + dbp->sync = __dbcl_db_sync; + dbp->upgrade = __dbcl_db_upgrade; + + /* + * Set all the method specific functions to client funcs as well. + */ + dbp->set_bt_compare = __dbcl_db_bt_compare; + dbp->set_bt_maxkey = __dbcl_db_bt_maxkey; + dbp->set_bt_minkey = __dbcl_db_bt_minkey; + dbp->set_bt_prefix = __dbcl_db_bt_prefix; + dbp->set_h_ffactor = __dbcl_db_h_ffactor; + dbp->set_h_hash = __dbcl_db_h_hash; + dbp->set_h_nelem = __dbcl_db_h_nelem; + dbp->set_re_delim = __dbcl_db_re_delim; + dbp->set_re_len = __dbcl_db_re_len; + dbp->set_re_pad = __dbcl_db_re_pad; + dbp->set_re_source = __dbcl_db_re_source; +/* + dbp->set_q_extentsize = __dbcl_db_q_extentsize; +*/ + + cl = (CLIENT *)dbenv->cl_handle; + req.flags = flags; + req.envpcl_id = dbenv->cl_id; + + /* + * CALL THE SERVER + */ + replyp = __db_db_create_1(&req, cl); + if (replyp == NULL) { + __db_err(dbenv, clnt_sperror(cl, "Berkeley DB")); + return (DB_NOSERVER); + } + + if ((ret = replyp->status) != 0) + return (ret); + + dbp->cl_id = replyp->dbpcl_id; + return (0); +} +#endif diff --git a/bdb/db/db_overflow.c b/bdb/db/db_overflow.c new file mode 100644 index 00000000000..54f0a03aafe --- /dev/null +++ b/bdb/db/db_overflow.c @@ -0,0 +1,681 @@ +/*- + * See the file LICENSE for redistribution information. + * + * Copyright (c) 1996, 1997, 1998, 1999, 2000 + * Sleepycat Software. All rights reserved. + */ +/* + * Copyright (c) 1990, 1993, 1994, 1995, 1996 + * Keith Bostic. All rights reserved. + */ +/* + * Copyright (c) 1990, 1993, 1994, 1995 + * The Regents of the University of California. All rights reserved. + * + * This code is derived from software contributed to Berkeley by + * Mike Olson. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * 3. Neither the name of the University nor the names of its contributors + * may be used to endorse or promote products derived from this software + * without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +#include "db_config.h" + +#ifndef lint +static const char revid[] = "$Id: db_overflow.c,v 11.21 2000/11/30 00:58:32 ubell Exp $"; +#endif /* not lint */ + +#ifndef NO_SYSTEM_INCLUDES +#include <sys/types.h> + +#include <string.h> +#endif + +#include "db_int.h" +#include "db_page.h" +#include "db_am.h" +#include "db_verify.h" + +/* + * Big key/data code. + * + * Big key and data entries are stored on linked lists of pages. The initial + * reference is a structure with the total length of the item and the page + * number where it begins. Each entry in the linked list contains a pointer + * to the next page of data, and so on. + */ + +/* + * __db_goff -- + * Get an offpage item. + * + * PUBLIC: int __db_goff __P((DB *, DBT *, + * PUBLIC: u_int32_t, db_pgno_t, void **, u_int32_t *)); + */ +int +__db_goff(dbp, dbt, tlen, pgno, bpp, bpsz) + DB *dbp; + DBT *dbt; + u_int32_t tlen; + db_pgno_t pgno; + void **bpp; + u_int32_t *bpsz; +{ + DB_ENV *dbenv; + PAGE *h; + db_indx_t bytes; + u_int32_t curoff, needed, start; + u_int8_t *p, *src; + int ret; + + dbenv = dbp->dbenv; + + /* + * Check if the buffer is big enough; if it is not and we are + * allowed to malloc space, then we'll malloc it. If we are + * not (DB_DBT_USERMEM), then we'll set the dbt and return + * appropriately. + */ + if (F_ISSET(dbt, DB_DBT_PARTIAL)) { + start = dbt->doff; + needed = dbt->dlen; + } else { + start = 0; + needed = tlen; + } + + /* Allocate any necessary memory. */ + if (F_ISSET(dbt, DB_DBT_USERMEM)) { + if (needed > dbt->ulen) { + dbt->size = needed; + return (ENOMEM); + } + } else if (F_ISSET(dbt, DB_DBT_MALLOC)) { + if ((ret = __os_malloc(dbenv, + needed, dbp->db_malloc, &dbt->data)) != 0) + return (ret); + } else if (F_ISSET(dbt, DB_DBT_REALLOC)) { + if ((ret = __os_realloc(dbenv, + needed, dbp->db_realloc, &dbt->data)) != 0) + return (ret); + } else if (*bpsz == 0 || *bpsz < needed) { + if ((ret = __os_realloc(dbenv, needed, NULL, bpp)) != 0) + return (ret); + *bpsz = needed; + dbt->data = *bpp; + } else + dbt->data = *bpp; + + /* + * Step through the linked list of pages, copying the data on each + * one into the buffer. Never copy more than the total data length. + */ + dbt->size = needed; + for (curoff = 0, p = dbt->data; pgno != PGNO_INVALID && needed > 0;) { + if ((ret = memp_fget(dbp->mpf, &pgno, 0, &h)) != 0) { + (void)__db_pgerr(dbp, pgno); + return (ret); + } + /* Check if we need any bytes from this page. */ + if (curoff + OV_LEN(h) >= start) { + src = (u_int8_t *)h + P_OVERHEAD; + bytes = OV_LEN(h); + if (start > curoff) { + src += start - curoff; + bytes -= start - curoff; + } + if (bytes > needed) + bytes = needed; + memcpy(p, src, bytes); + p += bytes; + needed -= bytes; + } + curoff += OV_LEN(h); + pgno = h->next_pgno; + memp_fput(dbp->mpf, h, 0); + } + return (0); +} + +/* + * __db_poff -- + * Put an offpage item. + * + * PUBLIC: int __db_poff __P((DBC *, const DBT *, db_pgno_t *)); + */ +int +__db_poff(dbc, dbt, pgnop) + DBC *dbc; + const DBT *dbt; + db_pgno_t *pgnop; +{ + DB *dbp; + PAGE *pagep, *lastp; + DB_LSN new_lsn, null_lsn; + DBT tmp_dbt; + db_indx_t pagespace; + u_int32_t sz; + u_int8_t *p; + int ret; + + /* + * Allocate pages and copy the key/data item into them. Calculate the + * number of bytes we get for pages we fill completely with a single + * item. + */ + dbp = dbc->dbp; + pagespace = P_MAXSPACE(dbp->pgsize); + + lastp = NULL; + for (p = dbt->data, + sz = dbt->size; sz > 0; p += pagespace, sz -= pagespace) { + /* + * Reduce pagespace so we terminate the loop correctly and + * don't copy too much data. + */ + if (sz < pagespace) + pagespace = sz; + + /* + * Allocate and initialize a new page and copy all or part of + * the item onto the page. If sz is less than pagespace, we + * have a partial record. + */ + if ((ret = __db_new(dbc, P_OVERFLOW, &pagep)) != 0) + return (ret); + if (DB_LOGGING(dbc)) { + tmp_dbt.data = p; + tmp_dbt.size = pagespace; + ZERO_LSN(null_lsn); + if ((ret = __db_big_log(dbp->dbenv, dbc->txn, + &new_lsn, 0, DB_ADD_BIG, dbp->log_fileid, + PGNO(pagep), lastp ? PGNO(lastp) : PGNO_INVALID, + PGNO_INVALID, &tmp_dbt, &LSN(pagep), + lastp == NULL ? &null_lsn : &LSN(lastp), + &null_lsn)) != 0) + return (ret); + + /* Move lsn onto page. */ + if (lastp) + LSN(lastp) = new_lsn; + LSN(pagep) = new_lsn; + } + + P_INIT(pagep, dbp->pgsize, + PGNO(pagep), PGNO_INVALID, PGNO_INVALID, 0, P_OVERFLOW); + OV_LEN(pagep) = pagespace; + OV_REF(pagep) = 1; + memcpy((u_int8_t *)pagep + P_OVERHEAD, p, pagespace); + + /* + * If this is the first entry, update the user's info. + * Otherwise, update the entry on the last page filled + * in and release that page. + */ + if (lastp == NULL) + *pgnop = PGNO(pagep); + else { + lastp->next_pgno = PGNO(pagep); + pagep->prev_pgno = PGNO(lastp); + (void)memp_fput(dbp->mpf, lastp, DB_MPOOL_DIRTY); + } + lastp = pagep; + } + (void)memp_fput(dbp->mpf, lastp, DB_MPOOL_DIRTY); + return (0); +} + +/* + * __db_ovref -- + * Increment/decrement the reference count on an overflow page. + * + * PUBLIC: int __db_ovref __P((DBC *, db_pgno_t, int32_t)); + */ +int +__db_ovref(dbc, pgno, adjust) + DBC *dbc; + db_pgno_t pgno; + int32_t adjust; +{ + DB *dbp; + PAGE *h; + int ret; + + dbp = dbc->dbp; + if ((ret = memp_fget(dbp->mpf, &pgno, 0, &h)) != 0) { + (void)__db_pgerr(dbp, pgno); + return (ret); + } + + if (DB_LOGGING(dbc)) + if ((ret = __db_ovref_log(dbp->dbenv, dbc->txn, + &LSN(h), 0, dbp->log_fileid, h->pgno, adjust, + &LSN(h))) != 0) + return (ret); + OV_REF(h) += adjust; + + (void)memp_fput(dbp->mpf, h, DB_MPOOL_DIRTY); + return (0); +} + +/* + * __db_doff -- + * Delete an offpage chain of overflow pages. + * + * PUBLIC: int __db_doff __P((DBC *, db_pgno_t)); + */ +int +__db_doff(dbc, pgno) + DBC *dbc; + db_pgno_t pgno; +{ + DB *dbp; + PAGE *pagep; + DB_LSN null_lsn; + DBT tmp_dbt; + int ret; + + dbp = dbc->dbp; + do { + if ((ret = memp_fget(dbp->mpf, &pgno, 0, &pagep)) != 0) { + (void)__db_pgerr(dbp, pgno); + return (ret); + } + + DB_ASSERT(TYPE(pagep) == P_OVERFLOW); + /* + * If it's referenced by more than one key/data item, + * decrement the reference count and return. + */ + if (OV_REF(pagep) > 1) { + (void)memp_fput(dbp->mpf, pagep, 0); + return (__db_ovref(dbc, pgno, -1)); + } + + if (DB_LOGGING(dbc)) { + tmp_dbt.data = (u_int8_t *)pagep + P_OVERHEAD; + tmp_dbt.size = OV_LEN(pagep); + ZERO_LSN(null_lsn); + if ((ret = __db_big_log(dbp->dbenv, dbc->txn, + &LSN(pagep), 0, DB_REM_BIG, dbp->log_fileid, + PGNO(pagep), PREV_PGNO(pagep), NEXT_PGNO(pagep), + &tmp_dbt, &LSN(pagep), &null_lsn, &null_lsn)) != 0) + return (ret); + } + pgno = pagep->next_pgno; + if ((ret = __db_free(dbc, pagep)) != 0) + return (ret); + } while (pgno != PGNO_INVALID); + + return (0); +} + +/* + * __db_moff -- + * Match on overflow pages. + * + * Given a starting page number and a key, return <0, 0, >0 to indicate if the + * key on the page is less than, equal to or greater than the key specified. + * We optimize this by doing chunk at a time comparison unless the user has + * specified a comparison function. In this case, we need to materialize + * the entire object and call their comparison routine. + * + * PUBLIC: int __db_moff __P((DB *, const DBT *, db_pgno_t, u_int32_t, + * PUBLIC: int (*)(DB *, const DBT *, const DBT *), int *)); + */ +int +__db_moff(dbp, dbt, pgno, tlen, cmpfunc, cmpp) + DB *dbp; + const DBT *dbt; + db_pgno_t pgno; + u_int32_t tlen; + int (*cmpfunc) __P((DB *, const DBT *, const DBT *)), *cmpp; +{ + PAGE *pagep; + DBT local_dbt; + void *buf; + u_int32_t bufsize, cmp_bytes, key_left; + u_int8_t *p1, *p2; + int ret; + + /* + * If there is a user-specified comparison function, build a + * contiguous copy of the key, and call it. + */ + if (cmpfunc != NULL) { + memset(&local_dbt, 0, sizeof(local_dbt)); + buf = NULL; + bufsize = 0; + + if ((ret = __db_goff(dbp, + &local_dbt, tlen, pgno, &buf, &bufsize)) != 0) + return (ret); + /* Pass the key as the first argument */ + *cmpp = cmpfunc(dbp, dbt, &local_dbt); + __os_free(buf, bufsize); + return (0); + } + + /* While there are both keys to compare. */ + for (*cmpp = 0, p1 = dbt->data, + key_left = dbt->size; key_left > 0 && pgno != PGNO_INVALID;) { + if ((ret = memp_fget(dbp->mpf, &pgno, 0, &pagep)) != 0) + return (ret); + + cmp_bytes = OV_LEN(pagep) < key_left ? OV_LEN(pagep) : key_left; + tlen -= cmp_bytes; + key_left -= cmp_bytes; + for (p2 = + (u_int8_t *)pagep + P_OVERHEAD; cmp_bytes-- > 0; ++p1, ++p2) + if (*p1 != *p2) { + *cmpp = (long)*p1 - (long)*p2; + break; + } + pgno = NEXT_PGNO(pagep); + if ((ret = memp_fput(dbp->mpf, pagep, 0)) != 0) + return (ret); + if (*cmpp != 0) + return (0); + } + if (key_left > 0) /* DBT is longer than the page key. */ + *cmpp = 1; + else if (tlen > 0) /* DBT is shorter than the page key. */ + *cmpp = -1; + else + *cmpp = 0; + + return (0); +} + +/* + * __db_vrfy_overflow -- + * Verify overflow page. + * + * PUBLIC: int __db_vrfy_overflow __P((DB *, VRFY_DBINFO *, PAGE *, db_pgno_t, + * PUBLIC: u_int32_t)); + */ +int +__db_vrfy_overflow(dbp, vdp, h, pgno, flags) + DB *dbp; + VRFY_DBINFO *vdp; + PAGE *h; + db_pgno_t pgno; + u_int32_t flags; +{ + VRFY_PAGEINFO *pip; + int isbad, ret, t_ret; + + isbad = 0; + if ((ret = __db_vrfy_getpageinfo(vdp, pgno, &pip)) != 0) + return (ret); + + if ((ret = __db_vrfy_datapage(dbp, vdp, h, pgno, flags)) != 0) { + if (ret == DB_VERIFY_BAD) + isbad = 1; + else + goto err; + } + + pip->refcount = OV_REF(h); + if (pip->refcount < 1) { + EPRINT((dbp->dbenv, + "Overflow page %lu has zero reference count", + (u_long)pgno)); + isbad = 1; + } + + /* Just store for now. */ + pip->olen = HOFFSET(h); + +err: if ((t_ret = __db_vrfy_putpageinfo(vdp, pip)) != 0) + ret = t_ret; + return ((ret == 0 && isbad == 1) ? DB_VERIFY_BAD : ret); +} + +/* + * __db_vrfy_ovfl_structure -- + * Walk a list of overflow pages, avoiding cycles and marking + * pages seen. + * + * PUBLIC: int __db_vrfy_ovfl_structure + * PUBLIC: __P((DB *, VRFY_DBINFO *, db_pgno_t, u_int32_t, u_int32_t)); + */ +int +__db_vrfy_ovfl_structure(dbp, vdp, pgno, tlen, flags) + DB *dbp; + VRFY_DBINFO *vdp; + db_pgno_t pgno; + u_int32_t tlen; + u_int32_t flags; +{ + DB *pgset; + VRFY_PAGEINFO *pip; + db_pgno_t next, prev; + int isbad, p, ret, t_ret; + u_int32_t refcount; + + pgset = vdp->pgset; + DB_ASSERT(pgset != NULL); + isbad = 0; + + /* This shouldn't happen, but just to be sure. */ + if (!IS_VALID_PGNO(pgno)) + return (DB_VERIFY_BAD); + + /* + * Check the first prev_pgno; it ought to be PGNO_INVALID, + * since there's no prev page. + */ + if ((ret = __db_vrfy_getpageinfo(vdp, pgno, &pip)) != 0) + return (ret); + + /* The refcount is stored on the first overflow page. */ + refcount = pip->refcount; + + if (pip->type != P_OVERFLOW) { + EPRINT((dbp->dbenv, + "Overflow page %lu of invalid type", + (u_long)pgno, (u_long)pip->type)); + ret = DB_VERIFY_BAD; + goto err; /* Unsafe to continue. */ + } + + prev = pip->prev_pgno; + if (prev != PGNO_INVALID) { + EPRINT((dbp->dbenv, + "First overflow page %lu has a prev_pgno", (u_long)pgno)); + isbad = 1; + } + + for (;;) { + /* + * This is slightly gross. Btree leaf pages reference + * individual overflow trees multiple times if the overflow page + * is the key to a duplicate set. The reference count does not + * reflect this multiple referencing. Thus, if this is called + * during the structure verification of a btree leaf page, we + * check to see whether we've seen it from a leaf page before + * and, if we have, adjust our count of how often we've seen it + * accordingly. + * + * (This will screw up if it's actually referenced--and + * correctly refcounted--from two different leaf pages, but + * that's a very unlikely brokenness that we're not checking for + * anyway.) + */ + + if (LF_ISSET(ST_OVFL_LEAF)) { + if (F_ISSET(pip, VRFY_OVFL_LEAFSEEN)) { + if ((ret = + __db_vrfy_pgset_dec(pgset, pgno)) != 0) + goto err; + } else + F_SET(pip, VRFY_OVFL_LEAFSEEN); + } + + if ((ret = __db_vrfy_pgset_get(pgset, pgno, &p)) != 0) + goto err; + + /* + * We may have seen this elsewhere, if the overflow entry + * has been promoted to an internal page. + */ + if ((u_int32_t)p > refcount) { + EPRINT((dbp->dbenv, + "Page %lu encountered twice in overflow traversal", + (u_long)pgno)); + ret = DB_VERIFY_BAD; + goto err; + } + if ((ret = __db_vrfy_pgset_inc(pgset, pgno)) != 0) + goto err; + + /* Keep a running tab on how much of the item we've seen. */ + tlen -= pip->olen; + + /* Send feedback to the application about our progress. */ + if (!LF_ISSET(DB_SALVAGE)) + __db_vrfy_struct_feedback(dbp, vdp); + + next = pip->next_pgno; + + /* Are we there yet? */ + if (next == PGNO_INVALID) + break; + + /* + * We've already checked this when we saved it, but just + * to be sure... + */ + if (!IS_VALID_PGNO(next)) { + DB_ASSERT(0); + EPRINT((dbp->dbenv, + "Overflow page %lu has bad next_pgno", + (u_long)pgno)); + ret = DB_VERIFY_BAD; + goto err; + } + + if ((ret = __db_vrfy_putpageinfo(vdp, pip)) != 0 || + (ret = __db_vrfy_getpageinfo(vdp, next, &pip)) != 0) + return (ret); + if (pip->prev_pgno != pgno) { + EPRINT((dbp->dbenv, + "Overflow page %lu has bogus prev_pgno value", + (u_long)next)); + isbad = 1; + /* + * It's safe to continue because we have separate + * cycle detection. + */ + } + + pgno = next; + } + + if (tlen > 0) { + isbad = 1; + EPRINT((dbp->dbenv, + "Overflow item incomplete on page %lu", (u_long)pgno)); + } + +err: if ((t_ret = __db_vrfy_putpageinfo(vdp, pip)) != 0 && ret == 0) + ret = t_ret; + return ((ret == 0 && isbad == 1) ? DB_VERIFY_BAD : ret); +} + +/* + * __db_safe_goff -- + * Get an overflow item, very carefully, from an untrusted database, + * in the context of the salvager. + * + * PUBLIC: int __db_safe_goff __P((DB *, VRFY_DBINFO *, db_pgno_t, + * PUBLIC: DBT *, void **, u_int32_t)); + */ +int +__db_safe_goff(dbp, vdp, pgno, dbt, buf, flags) + DB *dbp; + VRFY_DBINFO *vdp; + db_pgno_t pgno; + DBT *dbt; + void **buf; + u_int32_t flags; +{ + PAGE *h; + int ret, err_ret; + u_int32_t bytesgot, bytes; + u_int8_t *src, *dest; + + ret = DB_VERIFY_BAD; + err_ret = 0; + bytesgot = bytes = 0; + + while ((pgno != PGNO_INVALID) && (IS_VALID_PGNO(pgno))) { + /* + * Mark that we're looking at this page; if we've seen it + * already, quit. + */ + if ((ret = __db_salvage_markdone(vdp, pgno)) != 0) + break; + + if ((ret = memp_fget(dbp->mpf, &pgno, 0, &h)) != 0) + break; + + /* + * Make sure it's really an overflow page, unless we're + * being aggressive, in which case we pretend it is. + */ + if (!LF_ISSET(DB_AGGRESSIVE) && TYPE(h) != P_OVERFLOW) { + ret = DB_VERIFY_BAD; + break; + } + + src = (u_int8_t *)h + P_OVERHEAD; + bytes = OV_LEN(h); + + if (bytes + P_OVERHEAD > dbp->pgsize) + bytes = dbp->pgsize - P_OVERHEAD; + + if ((ret = __os_realloc(dbp->dbenv, + bytesgot + bytes, 0, buf)) != 0) + break; + + dest = (u_int8_t *)*buf + bytesgot; + bytesgot += bytes; + + memcpy(dest, src, bytes); + + pgno = NEXT_PGNO(h); + /* Not much we can do here--we don't want to quit. */ + if ((ret = memp_fput(dbp->mpf, h, 0)) != 0) + err_ret = ret; + } + + if (ret == 0) { + dbt->size = bytesgot; + dbt->data = *buf; + } + + return ((err_ret != 0 && ret == 0) ? err_ret : ret); +} diff --git a/bdb/db/db_pr.c b/bdb/db/db_pr.c new file mode 100644 index 00000000000..cb977cadfda --- /dev/null +++ b/bdb/db/db_pr.c @@ -0,0 +1,1284 @@ +/*- + * See the file LICENSE for redistribution information. + * + * Copyright (c) 1996, 1997, 1998, 1999, 2000 + * Sleepycat Software. All rights reserved. + */ + +#include "db_config.h" + +#ifndef lint +static const char revid[] = "$Id: db_pr.c,v 11.46 2001/01/22 17:25:06 krinsky Exp $"; +#endif /* not lint */ + +#ifndef NO_SYSTEM_INCLUDES +#include <sys/types.h> + +#include <ctype.h> +#include <stdlib.h> +#include <string.h> +#include <unistd.h> +#endif + +#include "db_int.h" +#include "db_page.h" +#include "btree.h" +#include "hash.h" +#include "qam.h" +#include "db_am.h" +#include "db_verify.h" + +static int __db_bmeta __P((DB *, FILE *, BTMETA *, u_int32_t)); +static int __db_hmeta __P((DB *, FILE *, HMETA *, u_int32_t)); +static void __db_meta __P((DB *, DBMETA *, FILE *, FN const *, u_int32_t)); +static const char *__db_dbtype_to_string __P((DB *)); +static void __db_prdb __P((DB *, FILE *, u_int32_t)); +static FILE *__db_prinit __P((FILE *)); +static void __db_proff __P((void *)); +static int __db_prtree __P((DB *, u_int32_t)); +static void __db_psize __P((DB *)); +static int __db_qmeta __P((DB *, FILE *, QMETA *, u_int32_t)); + +/* + * 64K is the maximum page size, so by default we check for offsets larger + * than that, and, where possible, we refine the test. + */ +#define PSIZE_BOUNDARY (64 * 1024 + 1) +static size_t set_psize = PSIZE_BOUNDARY; + +static FILE *set_fp; /* Output file descriptor. */ + +/* + * __db_loadme -- + * A nice place to put a breakpoint. + * + * PUBLIC: void __db_loadme __P((void)); + */ +void +__db_loadme() +{ + getpid(); +} + +/* + * __db_dump -- + * Dump the tree to a file. + * + * PUBLIC: int __db_dump __P((DB *, char *, char *)); + */ +int +__db_dump(dbp, op, name) + DB *dbp; + char *op, *name; +{ + FILE *fp, *save_fp; + u_int32_t flags; + + COMPQUIET(save_fp, NULL); + + if (set_psize == PSIZE_BOUNDARY) + __db_psize(dbp); + + if (name != NULL) { + if ((fp = fopen(name, "w")) == NULL) + return (__os_get_errno()); + save_fp = set_fp; + set_fp = fp; + } else + fp = __db_prinit(NULL); + + for (flags = 0; *op != '\0'; ++op) + switch (*op) { + case 'a': + LF_SET(DB_PR_PAGE); + break; + case 'h': + break; + case 'r': + LF_SET(DB_PR_RECOVERYTEST); + break; + default: + return (EINVAL); + } + + __db_prdb(dbp, fp, flags); + + fprintf(fp, "%s\n", DB_LINE); + + (void)__db_prtree(dbp, flags); + + fflush(fp); + + if (name != NULL) { + fclose(fp); + set_fp = save_fp; + } + return (0); +} + +/* + * __db_prdb -- + * Print out the DB structure information. + */ +static void +__db_prdb(dbp, fp, flags) + DB *dbp; + FILE *fp; + u_int32_t flags; +{ + static const FN fn[] = { + { DB_AM_DISCARD, "discard cached pages" }, + { DB_AM_DUP, "duplicates" }, + { DB_AM_INMEM, "in-memory" }, + { DB_AM_PGDEF, "default page size" }, + { DB_AM_RDONLY, "read-only" }, + { DB_AM_SUBDB, "multiple-databases" }, + { DB_AM_SWAP, "needswap" }, + { DB_BT_RECNUM, "btree:recnum" }, + { DB_BT_REVSPLIT, "btree:no reverse split" }, + { DB_DBM_ERROR, "dbm/ndbm error" }, + { DB_OPEN_CALLED, "DB->open called" }, + { DB_RE_DELIMITER, "recno:delimiter" }, + { DB_RE_FIXEDLEN, "recno:fixed-length" }, + { DB_RE_PAD, "recno:pad" }, + { DB_RE_RENUMBER, "recno:renumber" }, + { DB_RE_SNAPSHOT, "recno:snapshot" }, + { 0, NULL } + }; + BTREE *bt; + HASH *h; + QUEUE *q; + + COMPQUIET(flags, 0); + + fprintf(fp, + "In-memory DB structure:\n%s: %#lx", + __db_dbtype_to_string(dbp), (u_long)dbp->flags); + __db_prflags(dbp->flags, fn, fp); + fprintf(fp, "\n"); + + switch (dbp->type) { + case DB_BTREE: + case DB_RECNO: + bt = dbp->bt_internal; + fprintf(fp, "bt_meta: %lu bt_root: %lu\n", + (u_long)bt->bt_meta, (u_long)bt->bt_root); + fprintf(fp, "bt_maxkey: %lu bt_minkey: %lu\n", + (u_long)bt->bt_maxkey, (u_long)bt->bt_minkey); + fprintf(fp, "bt_compare: %#lx bt_prefix: %#lx\n", + (u_long)bt->bt_compare, (u_long)bt->bt_prefix); + fprintf(fp, "bt_lpgno: %lu\n", (u_long)bt->bt_lpgno); + if (dbp->type == DB_RECNO) { + fprintf(fp, + "re_pad: %#lx re_delim: %#lx re_len: %lu re_source: %s\n", + (u_long)bt->re_pad, (u_long)bt->re_delim, + (u_long)bt->re_len, + bt->re_source == NULL ? "" : bt->re_source); + fprintf(fp, "re_modified: %d re_eof: %d re_last: %lu\n", + bt->re_modified, bt->re_eof, (u_long)bt->re_last); + } + break; + case DB_HASH: + h = dbp->h_internal; + fprintf(fp, "meta_pgno: %lu\n", (u_long)h->meta_pgno); + fprintf(fp, "h_ffactor: %lu\n", (u_long)h->h_ffactor); + fprintf(fp, "h_nelem: %lu\n", (u_long)h->h_nelem); + fprintf(fp, "h_hash: %#lx\n", (u_long)h->h_hash); + break; + case DB_QUEUE: + q = dbp->q_internal; + fprintf(fp, "q_meta: %lu\n", (u_long)q->q_meta); + fprintf(fp, "q_root: %lu\n", (u_long)q->q_root); + fprintf(fp, "re_pad: %#lx re_len: %lu\n", + (u_long)q->re_pad, (u_long)q->re_len); + fprintf(fp, "rec_page: %lu\n", (u_long)q->rec_page); + fprintf(fp, "page_ext: %lu\n", (u_long)q->page_ext); + break; + default: + break; + } +} + +/* + * __db_prtree -- + * Print out the entire tree. + */ +static int +__db_prtree(dbp, flags) + DB *dbp; + u_int32_t flags; +{ + PAGE *h; + db_pgno_t i, last; + int ret; + + if (set_psize == PSIZE_BOUNDARY) + __db_psize(dbp); + + if (dbp->type == DB_QUEUE) { + ret = __db_prqueue(dbp, flags); + goto done; + } + + /* Find out the page number of the last page in the database. */ + if ((ret = memp_fget(dbp->mpf, &last, DB_MPOOL_LAST, &h)) != 0) + return (ret); + if ((ret = memp_fput(dbp->mpf, h, 0)) != 0) + return (ret); + + /* Dump each page. */ + for (i = 0; i <= last; ++i) { + if ((ret = memp_fget(dbp->mpf, &i, 0, &h)) != 0) + return (ret); + (void)__db_prpage(dbp, h, flags); + if ((ret = memp_fput(dbp->mpf, h, 0)) != 0) + return (ret); + } + +done: + (void)fflush(__db_prinit(NULL)); + return (0); +} + +/* + * __db_meta -- + * Print out common metadata information. + */ +static void +__db_meta(dbp, dbmeta, fp, fn, flags) + DB *dbp; + DBMETA *dbmeta; + FILE *fp; + FN const *fn; + u_int32_t flags; +{ + PAGE *h; + int cnt; + db_pgno_t pgno; + u_int8_t *p; + int ret; + const char *sep; + + fprintf(fp, "\tmagic: %#lx\n", (u_long)dbmeta->magic); + fprintf(fp, "\tversion: %lu\n", (u_long)dbmeta->version); + fprintf(fp, "\tpagesize: %lu\n", (u_long)dbmeta->pagesize); + fprintf(fp, "\ttype: %lu\n", (u_long)dbmeta->type); + fprintf(fp, "\tkeys: %lu\trecords: %lu\n", + (u_long)dbmeta->key_count, (u_long)dbmeta->record_count); + + if (!LF_ISSET(DB_PR_RECOVERYTEST)) { + /* + * If we're doing recovery testing, don't display the free + * list, it may have changed and that makes the dump diff + * not work. + */ + fprintf(fp, "\tfree list: %lu", (u_long)dbmeta->free); + for (pgno = dbmeta->free, + cnt = 0, sep = ", "; pgno != PGNO_INVALID;) { + if ((ret = memp_fget(dbp->mpf, &pgno, 0, &h)) != 0) { + fprintf(fp, + "Unable to retrieve free-list page: %lu: %s\n", + (u_long)pgno, db_strerror(ret)); + break; + } + pgno = h->next_pgno; + (void)memp_fput(dbp->mpf, h, 0); + fprintf(fp, "%s%lu", sep, (u_long)pgno); + if (++cnt % 10 == 0) { + fprintf(fp, "\n"); + cnt = 0; + sep = "\t"; + } else + sep = ", "; + } + fprintf(fp, "\n"); + } + + if (fn != NULL) { + fprintf(fp, "\tflags: %#lx", (u_long)dbmeta->flags); + __db_prflags(dbmeta->flags, fn, fp); + fprintf(fp, "\n"); + } + + fprintf(fp, "\tuid: "); + for (p = (u_int8_t *)dbmeta->uid, + cnt = 0; cnt < DB_FILE_ID_LEN; ++cnt) { + fprintf(fp, "%x", *p++); + if (cnt < DB_FILE_ID_LEN - 1) + fprintf(fp, " "); + } + fprintf(fp, "\n"); +} + +/* + * __db_bmeta -- + * Print out the btree meta-data page. + */ +static int +__db_bmeta(dbp, fp, h, flags) + DB *dbp; + FILE *fp; + BTMETA *h; + u_int32_t flags; +{ + static const FN mfn[] = { + { BTM_DUP, "duplicates" }, + { BTM_RECNO, "recno" }, + { BTM_RECNUM, "btree:recnum" }, + { BTM_FIXEDLEN, "recno:fixed-length" }, + { BTM_RENUMBER, "recno:renumber" }, + { BTM_SUBDB, "multiple-databases" }, + { 0, NULL } + }; + + __db_meta(dbp, (DBMETA *)h, fp, mfn, flags); + + fprintf(fp, "\tmaxkey: %lu minkey: %lu\n", + (u_long)h->maxkey, (u_long)h->minkey); + if (dbp->type == DB_RECNO) + fprintf(fp, "\tre_len: %#lx re_pad: %lu\n", + (u_long)h->re_len, (u_long)h->re_pad); + fprintf(fp, "\troot: %lu\n", (u_long)h->root); + + return (0); +} + +/* + * __db_hmeta -- + * Print out the hash meta-data page. + */ +static int +__db_hmeta(dbp, fp, h, flags) + DB *dbp; + FILE *fp; + HMETA *h; + u_int32_t flags; +{ + static const FN mfn[] = { + { DB_HASH_DUP, "duplicates" }, + { DB_HASH_SUBDB, "multiple-databases" }, + { 0, NULL } + }; + int i; + + __db_meta(dbp, (DBMETA *)h, fp, mfn, flags); + + fprintf(fp, "\tmax_bucket: %lu\n", (u_long)h->max_bucket); + fprintf(fp, "\thigh_mask: %#lx\n", (u_long)h->high_mask); + fprintf(fp, "\tlow_mask: %#lx\n", (u_long)h->low_mask); + fprintf(fp, "\tffactor: %lu\n", (u_long)h->ffactor); + fprintf(fp, "\tnelem: %lu\n", (u_long)h->nelem); + fprintf(fp, "\th_charkey: %#lx\n", (u_long)h->h_charkey); + fprintf(fp, "\tspare points: "); + for (i = 0; i < NCACHED; i++) + fprintf(fp, "%lu ", (u_long)h->spares[i]); + fprintf(fp, "\n"); + + return (0); +} + +/* + * __db_qmeta -- + * Print out the queue meta-data page. + */ +static int +__db_qmeta(dbp, fp, h, flags) + DB *dbp; + FILE *fp; + QMETA *h; + u_int32_t flags; +{ + __db_meta(dbp, (DBMETA *)h, fp, NULL, flags); + + fprintf(fp, "\tfirst_recno: %lu\n", (u_long)h->first_recno); + fprintf(fp, "\tcur_recno: %lu\n", (u_long)h->cur_recno); + fprintf(fp, "\tre_len: %#lx re_pad: %lu\n", + (u_long)h->re_len, (u_long)h->re_pad); + fprintf(fp, "\trec_page: %lu\n", (u_long)h->rec_page); + fprintf(fp, "\tpage_ext: %lu\n", (u_long)h->page_ext); + + return (0); +} + +/* + * __db_prnpage + * -- Print out a specific page. + * + * PUBLIC: int __db_prnpage __P((DB *, db_pgno_t)); + */ +int +__db_prnpage(dbp, pgno) + DB *dbp; + db_pgno_t pgno; +{ + PAGE *h; + int ret; + + if (set_psize == PSIZE_BOUNDARY) + __db_psize(dbp); + + if ((ret = memp_fget(dbp->mpf, &pgno, 0, &h)) != 0) + return (ret); + + ret = __db_prpage(dbp, h, DB_PR_PAGE); + (void)fflush(__db_prinit(NULL)); + + (void)memp_fput(dbp->mpf, h, 0); + return (ret); +} + +/* + * __db_prpage + * -- Print out a page. + * + * PUBLIC: int __db_prpage __P((DB *, PAGE *, u_int32_t)); + */ +int +__db_prpage(dbp, h, flags) + DB *dbp; + PAGE *h; + u_int32_t flags; +{ + BINTERNAL *bi; + BKEYDATA *bk; + BTREE *t; + FILE *fp; + HOFFPAGE a_hkd; + QAMDATA *qp, *qep; + RINTERNAL *ri; + db_indx_t dlen, len, i; + db_pgno_t pgno; + db_recno_t recno; + int deleted, ret; + const char *s; + u_int32_t qlen; + u_int8_t *ep, *hk, *p; + void *sp; + + fp = __db_prinit(NULL); + + /* + * If we're doing recovery testing and this page is P_INVALID, + * assume it's a page that's on the free list, and don't display it. + */ + if (LF_ISSET(DB_PR_RECOVERYTEST) && TYPE(h) == P_INVALID) + return (0); + + s = __db_pagetype_to_string(TYPE(h)); + if (s == NULL) { + fprintf(fp, "ILLEGAL PAGE TYPE: page: %lu type: %lu\n", + (u_long)h->pgno, (u_long)TYPE(h)); + return (1); + } + + /* Page number, page type. */ + fprintf(fp, "page %lu: %s level: %lu", + (u_long)h->pgno, s, (u_long)h->level); + + /* Record count. */ + if (TYPE(h) == P_IBTREE || + TYPE(h) == P_IRECNO || (TYPE(h) == P_LRECNO && + h->pgno == ((BTREE *)dbp->bt_internal)->bt_root)) + fprintf(fp, " records: %lu", (u_long)RE_NREC(h)); + + /* LSN. */ + if (!LF_ISSET(DB_PR_RECOVERYTEST)) + fprintf(fp, " (lsn.file: %lu lsn.offset: %lu)\n", + (u_long)LSN(h).file, (u_long)LSN(h).offset); + + switch (TYPE(h)) { + case P_BTREEMETA: + return (__db_bmeta(dbp, fp, (BTMETA *)h, flags)); + case P_HASHMETA: + return (__db_hmeta(dbp, fp, (HMETA *)h, flags)); + case P_QAMMETA: + return (__db_qmeta(dbp, fp, (QMETA *)h, flags)); + case P_QAMDATA: /* Should be meta->start. */ + if (!LF_ISSET(DB_PR_PAGE)) + return (0); + + qlen = ((QUEUE *)dbp->q_internal)->re_len; + recno = (h->pgno - 1) * QAM_RECNO_PER_PAGE(dbp) + 1; + i = 0; + qep = (QAMDATA *)((u_int8_t *)h + set_psize - qlen); + for (qp = QAM_GET_RECORD(dbp, h, i); qp < qep; + recno++, i++, qp = QAM_GET_RECORD(dbp, h, i)) { + if (!F_ISSET(qp, QAM_SET)) + continue; + + fprintf(fp, "%s", + F_ISSET(qp, QAM_VALID) ? "\t" : " D"); + fprintf(fp, "[%03lu] %4lu ", + (u_long)recno, (u_long)qp - (u_long)h); + __db_pr(qp->data, qlen); + } + return (0); + } + + /* LSN. */ + if (LF_ISSET(DB_PR_RECOVERYTEST)) + fprintf(fp, " (lsn.file: %lu lsn.offset: %lu)\n", + (u_long)LSN(h).file, (u_long)LSN(h).offset); + + t = dbp->bt_internal; + + s = "\t"; + if (TYPE(h) != P_IBTREE && TYPE(h) != P_IRECNO) { + fprintf(fp, "%sprev: %4lu next: %4lu", + s, (u_long)PREV_PGNO(h), (u_long)NEXT_PGNO(h)); + s = " "; + } + if (TYPE(h) == P_OVERFLOW) { + fprintf(fp, "%sref cnt: %4lu ", s, (u_long)OV_REF(h)); + __db_pr((u_int8_t *)h + P_OVERHEAD, OV_LEN(h)); + return (0); + } + fprintf(fp, "%sentries: %4lu", s, (u_long)NUM_ENT(h)); + fprintf(fp, " offset: %4lu\n", (u_long)HOFFSET(h)); + + if (TYPE(h) == P_INVALID || !LF_ISSET(DB_PR_PAGE)) + return (0); + + ret = 0; + for (i = 0; i < NUM_ENT(h); i++) { + if (P_ENTRY(h, i) - (u_int8_t *)h < P_OVERHEAD || + (size_t)(P_ENTRY(h, i) - (u_int8_t *)h) >= set_psize) { + fprintf(fp, + "ILLEGAL PAGE OFFSET: indx: %lu of %lu\n", + (u_long)i, (u_long)h->inp[i]); + ret = EINVAL; + continue; + } + deleted = 0; + switch (TYPE(h)) { + case P_HASH: + case P_IBTREE: + case P_IRECNO: + sp = P_ENTRY(h, i); + break; + case P_LBTREE: + sp = P_ENTRY(h, i); + deleted = i % 2 == 0 && + B_DISSET(GET_BKEYDATA(h, i + O_INDX)->type); + break; + case P_LDUP: + case P_LRECNO: + sp = P_ENTRY(h, i); + deleted = B_DISSET(GET_BKEYDATA(h, i)->type); + break; + default: + fprintf(fp, + "ILLEGAL PAGE ITEM: %lu\n", (u_long)TYPE(h)); + ret = EINVAL; + continue; + } + fprintf(fp, "%s", deleted ? " D" : "\t"); + fprintf(fp, "[%03lu] %4lu ", (u_long)i, (u_long)h->inp[i]); + switch (TYPE(h)) { + case P_HASH: + hk = sp; + switch (HPAGE_PTYPE(hk)) { + case H_OFFDUP: + memcpy(&pgno, + HOFFDUP_PGNO(hk), sizeof(db_pgno_t)); + fprintf(fp, + "%4lu [offpage dups]\n", (u_long)pgno); + break; + case H_DUPLICATE: + /* + * If this is the first item on a page, then + * we cannot figure out how long it is, so + * we only print the first one in the duplicate + * set. + */ + if (i != 0) + len = LEN_HKEYDATA(h, 0, i); + else + len = 1; + + fprintf(fp, "Duplicates:\n"); + for (p = HKEYDATA_DATA(hk), + ep = p + len; p < ep;) { + memcpy(&dlen, p, sizeof(db_indx_t)); + p += sizeof(db_indx_t); + fprintf(fp, "\t\t"); + __db_pr(p, dlen); + p += sizeof(db_indx_t) + dlen; + } + break; + case H_KEYDATA: + __db_pr(HKEYDATA_DATA(hk), + LEN_HKEYDATA(h, i == 0 ? set_psize : 0, i)); + break; + case H_OFFPAGE: + memcpy(&a_hkd, hk, HOFFPAGE_SIZE); + fprintf(fp, + "overflow: total len: %4lu page: %4lu\n", + (u_long)a_hkd.tlen, (u_long)a_hkd.pgno); + break; + } + break; + case P_IBTREE: + bi = sp; + fprintf(fp, "count: %4lu pgno: %4lu type: %4lu", + (u_long)bi->nrecs, (u_long)bi->pgno, + (u_long)bi->type); + switch (B_TYPE(bi->type)) { + case B_KEYDATA: + __db_pr(bi->data, bi->len); + break; + case B_DUPLICATE: + case B_OVERFLOW: + __db_proff(bi->data); + break; + default: + fprintf(fp, "ILLEGAL BINTERNAL TYPE: %lu\n", + (u_long)B_TYPE(bi->type)); + ret = EINVAL; + break; + } + break; + case P_IRECNO: + ri = sp; + fprintf(fp, "entries %4lu pgno %4lu\n", + (u_long)ri->nrecs, (u_long)ri->pgno); + break; + case P_LBTREE: + case P_LDUP: + case P_LRECNO: + bk = sp; + switch (B_TYPE(bk->type)) { + case B_KEYDATA: + __db_pr(bk->data, bk->len); + break; + case B_DUPLICATE: + case B_OVERFLOW: + __db_proff(bk); + break; + default: + fprintf(fp, + "ILLEGAL DUPLICATE/LBTREE/LRECNO TYPE: %lu\n", + (u_long)B_TYPE(bk->type)); + ret = EINVAL; + break; + } + break; + } + } + (void)fflush(fp); + return (ret); +} + +/* + * __db_pr -- + * Print out a data element. + * + * PUBLIC: void __db_pr __P((u_int8_t *, u_int32_t)); + */ +void +__db_pr(p, len) + u_int8_t *p; + u_int32_t len; +{ + FILE *fp; + u_int lastch; + int i; + + fp = __db_prinit(NULL); + + fprintf(fp, "len: %3lu", (u_long)len); + lastch = '.'; + if (len != 0) { + fprintf(fp, " data: "); + for (i = len <= 20 ? len : 20; i > 0; --i, ++p) { + lastch = *p; + if (isprint((int)*p) || *p == '\n') + fprintf(fp, "%c", *p); + else + fprintf(fp, "0x%.2x", (u_int)*p); + } + if (len > 20) { + fprintf(fp, "..."); + lastch = '.'; + } + } + if (lastch != '\n') + fprintf(fp, "\n"); +} + +/* + * __db_prdbt -- + * Print out a DBT data element. + * + * PUBLIC: int __db_prdbt __P((DBT *, int, const char *, void *, + * PUBLIC: int (*)(void *, const void *), int, VRFY_DBINFO *)); + */ +int +__db_prdbt(dbtp, checkprint, prefix, handle, callback, is_recno, vdp) + DBT *dbtp; + int checkprint; + const char *prefix; + void *handle; + int (*callback) __P((void *, const void *)); + int is_recno; + VRFY_DBINFO *vdp; +{ + static const char hex[] = "0123456789abcdef"; + db_recno_t recno; + u_int32_t len; + int ret; +#define DBTBUFLEN 100 + char *p, *hp, buf[DBTBUFLEN], hbuf[DBTBUFLEN]; + + if (vdp != NULL) { + /* + * If vdp is non-NULL, we might be the first key in the + * "fake" subdatabase used for key/data pairs we can't + * associate with a known subdb. + * + * Check and clear the SALVAGE_PRINTHEADER flag; if + * it was set, print a subdatabase header. + */ + if (F_ISSET(vdp, SALVAGE_PRINTHEADER)) + (void)__db_prheader(NULL, "__OTHER__", 0, 0, + handle, callback, vdp, 0); + F_CLR(vdp, SALVAGE_PRINTHEADER); + F_SET(vdp, SALVAGE_PRINTFOOTER); + } + + /* + * !!! + * This routine is the routine that dumps out items in the format + * used by db_dump(1) and db_load(1). This means that the format + * cannot change. + */ + if (prefix != NULL && (ret = callback(handle, prefix)) != 0) + return (ret); + if (is_recno) { + /* + * We're printing a record number, and this has to be done + * in a platform-independent way. So we use the numeral in + * straight ASCII. + */ + __ua_memcpy(&recno, dbtp->data, sizeof(recno)); + snprintf(buf, DBTBUFLEN, "%lu", (u_long)recno); + + /* If we're printing data as hex, print keys as hex too. */ + if (!checkprint) { + for (len = strlen(buf), p = buf, hp = hbuf; + len-- > 0; ++p) { + *hp++ = hex[(u_int8_t)(*p & 0xf0) >> 4]; + *hp++ = hex[*p & 0x0f]; + } + *hp = '\0'; + ret = callback(handle, hbuf); + } else + ret = callback(handle, buf); + + if (ret != 0) + return (ret); + } else if (checkprint) { + for (len = dbtp->size, p = dbtp->data; len--; ++p) + if (isprint((int)*p)) { + if (*p == '\\' && + (ret = callback(handle, "\\")) != 0) + return (ret); + snprintf(buf, DBTBUFLEN, "%c", *p); + if ((ret = callback(handle, buf)) != 0) + return (ret); + } else { + snprintf(buf, DBTBUFLEN, "\\%c%c", + hex[(u_int8_t)(*p & 0xf0) >> 4], + hex[*p & 0x0f]); + if ((ret = callback(handle, buf)) != 0) + return (ret); + } + } else + for (len = dbtp->size, p = dbtp->data; len--; ++p) { + snprintf(buf, DBTBUFLEN, "%c%c", + hex[(u_int8_t)(*p & 0xf0) >> 4], + hex[*p & 0x0f]); + if ((ret = callback(handle, buf)) != 0) + return (ret); + } + + return (callback(handle, "\n")); +} + +/* + * __db_proff -- + * Print out an off-page element. + */ +static void +__db_proff(vp) + void *vp; +{ + FILE *fp; + BOVERFLOW *bo; + + fp = __db_prinit(NULL); + + bo = vp; + switch (B_TYPE(bo->type)) { + case B_OVERFLOW: + fprintf(fp, "overflow: total len: %4lu page: %4lu\n", + (u_long)bo->tlen, (u_long)bo->pgno); + break; + case B_DUPLICATE: + fprintf(fp, "duplicate: page: %4lu\n", (u_long)bo->pgno); + break; + } +} + +/* + * __db_prflags -- + * Print out flags values. + * + * PUBLIC: void __db_prflags __P((u_int32_t, const FN *, FILE *)); + */ +void +__db_prflags(flags, fn, fp) + u_int32_t flags; + FN const *fn; + FILE *fp; +{ + const FN *fnp; + int found; + const char *sep; + + sep = " ("; + for (found = 0, fnp = fn; fnp->mask != 0; ++fnp) + if (LF_ISSET(fnp->mask)) { + fprintf(fp, "%s%s", sep, fnp->name); + sep = ", "; + found = 1; + } + if (found) + fprintf(fp, ")"); +} + +/* + * __db_prinit -- + * Initialize tree printing routines. + */ +static FILE * +__db_prinit(fp) + FILE *fp; +{ + if (set_fp == NULL) + set_fp = fp == NULL ? stdout : fp; + return (set_fp); +} + +/* + * __db_psize -- + * Get the page size. + */ +static void +__db_psize(dbp) + DB *dbp; +{ + DBMETA *mp; + db_pgno_t pgno; + + set_psize = PSIZE_BOUNDARY - 1; + + pgno = PGNO_BASE_MD; + if (memp_fget(dbp->mpf, &pgno, 0, &mp) != 0) + return; + + switch (mp->magic) { + case DB_BTREEMAGIC: + case DB_HASHMAGIC: + case DB_QAMMAGIC: + set_psize = mp->pagesize; + break; + } + (void)memp_fput(dbp->mpf, mp, 0); +} + +/* + * __db_dbtype_to_string -- + * Return the name of the database type. + */ +static const char * +__db_dbtype_to_string(dbp) + DB *dbp; +{ + switch (dbp->type) { + case DB_BTREE: + return ("btree"); + case DB_HASH: + return ("hash"); + break; + case DB_RECNO: + return ("recno"); + break; + case DB_QUEUE: + return ("queue"); + default: + return ("UNKNOWN TYPE"); + } + /* NOTREACHED */ +} + +/* + * __db_pagetype_to_string -- + * Return the name of the specified page type. + * + * PUBLIC: const char *__db_pagetype_to_string __P((u_int32_t)); + */ +const char * +__db_pagetype_to_string(type) + u_int32_t type; +{ + char *s; + + s = NULL; + switch (type) { + case P_BTREEMETA: + s = "btree metadata"; + break; + case P_LDUP: + s = "duplicate"; + break; + case P_HASH: + s = "hash"; + break; + case P_HASHMETA: + s = "hash metadata"; + break; + case P_IBTREE: + s = "btree internal"; + break; + case P_INVALID: + s = "invalid"; + break; + case P_IRECNO: + s = "recno internal"; + break; + case P_LBTREE: + s = "btree leaf"; + break; + case P_LRECNO: + s = "recno leaf"; + break; + case P_OVERFLOW: + s = "overflow"; + break; + case P_QAMMETA: + s = "queue metadata"; + break; + case P_QAMDATA: + s = "queue"; + break; + default: + /* Just return a NULL. */ + break; + } + return (s); +} + +/* + * __db_prheader -- + * Write out header information in the format expected by db_load. + * + * PUBLIC: int __db_prheader __P((DB *, char *, int, int, void *, + * PUBLIC: int (*)(void *, const void *), VRFY_DBINFO *, db_pgno_t)); + */ +int +__db_prheader(dbp, subname, pflag, keyflag, handle, callback, vdp, meta_pgno) + DB *dbp; + char *subname; + int pflag, keyflag; + void *handle; + int (*callback) __P((void *, const void *)); + VRFY_DBINFO *vdp; + db_pgno_t meta_pgno; +{ + DB_BTREE_STAT *btsp; + DB_ENV *dbenv; + DB_HASH_STAT *hsp; + DB_QUEUE_STAT *qsp; + VRFY_PAGEINFO *pip; + char *buf; + int buflen, ret, t_ret; + u_int32_t dbtype; + + btsp = NULL; + hsp = NULL; + qsp = NULL; + ret = 0; + buf = NULL; + COMPQUIET(buflen, 0); + + if (dbp == NULL) + dbenv = NULL; + else + dbenv = dbp->dbenv; + + /* + * If we've been passed a verifier statistics object, use + * that; we're being called in a context where dbp->stat + * is unsafe. + */ + if (vdp != NULL) { + if ((ret = __db_vrfy_getpageinfo(vdp, meta_pgno, &pip)) != 0) + return (ret); + } else + pip = NULL; + + /* + * If dbp is NULL, we're being called from inside __db_prdbt, + * and this is a special subdatabase for "lost" items. Make it a btree. + * Otherwise, set dbtype to the appropriate type for the specified + * meta page, or the type of the dbp. + */ + if (dbp == NULL) + dbtype = DB_BTREE; + else if (pip != NULL) + switch (pip->type) { + case P_BTREEMETA: + if (F_ISSET(pip, VRFY_IS_RECNO)) + dbtype = DB_RECNO; + else + dbtype = DB_BTREE; + break; + case P_HASHMETA: + dbtype = DB_HASH; + break; + default: + /* + * If the meta page is of a bogus type, it's + * because we have a badly corrupt database. + * (We must be in the verifier for pip to be non-NULL.) + * Pretend we're a Btree and salvage what we can. + */ + DB_ASSERT(F_ISSET(dbp, DB_AM_VERIFYING)); + dbtype = DB_BTREE; + break; + } + else + dbtype = dbp->type; + + if ((ret = callback(handle, "VERSION=3\n")) != 0) + goto err; + if (pflag) { + if ((ret = callback(handle, "format=print\n")) != 0) + goto err; + } else if ((ret = callback(handle, "format=bytevalue\n")) != 0) + goto err; + + /* + * 64 bytes is long enough, as a minimum bound, for any of the + * fields besides subname. Subname can be anything, and so + * 64 + subname is big enough for all the things we need to print here. + */ + buflen = 64 + ((subname != NULL) ? strlen(subname) : 0); + if ((ret = __os_malloc(dbenv, buflen, NULL, &buf)) != 0) + goto err; + if (subname != NULL) { + snprintf(buf, buflen, "database=%s\n", subname); + if ((ret = callback(handle, buf)) != 0) + goto err; + } + switch (dbtype) { + case DB_BTREE: + if ((ret = callback(handle, "type=btree\n")) != 0) + goto err; + if (pip != NULL) { + if (F_ISSET(pip, VRFY_HAS_RECNUMS)) + if ((ret = + callback(handle, "recnum=1\n")) != 0) + goto err; + if (pip->bt_maxkey != 0) { + snprintf(buf, buflen, + "bt_maxkey=%lu\n", (u_long)pip->bt_maxkey); + if ((ret = callback(handle, buf)) != 0) + goto err; + } + if (pip->bt_minkey != 0 && + pip->bt_minkey != DEFMINKEYPAGE) { + snprintf(buf, buflen, + "bt_minkey=%lu\n", (u_long)pip->bt_minkey); + if ((ret = callback(handle, buf)) != 0) + goto err; + } + break; + } + if ((ret = dbp->stat(dbp, &btsp, NULL, 0)) != 0) { + dbp->err(dbp, ret, "DB->stat"); + goto err; + } + if (F_ISSET(dbp, DB_BT_RECNUM)) + if ((ret = callback(handle, "recnum=1\n")) != 0) + goto err; + if (btsp->bt_maxkey != 0) { + snprintf(buf, buflen, + "bt_maxkey=%lu\n", (u_long)btsp->bt_maxkey); + if ((ret = callback(handle, buf)) != 0) + goto err; + } + if (btsp->bt_minkey != 0 && btsp->bt_minkey != DEFMINKEYPAGE) { + snprintf(buf, buflen, + "bt_minkey=%lu\n", (u_long)btsp->bt_minkey); + if ((ret = callback(handle, buf)) != 0) + goto err; + } + break; + case DB_HASH: + if ((ret = callback(handle, "type=hash\n")) != 0) + goto err; + if (pip != NULL) { + if (pip->h_ffactor != 0) { + snprintf(buf, buflen, + "h_ffactor=%lu\n", (u_long)pip->h_ffactor); + if ((ret = callback(handle, buf)) != 0) + goto err; + } + if (pip->h_nelem != 0) { + snprintf(buf, buflen, + "h_nelem=%lu\n", (u_long)pip->h_nelem); + if ((ret = callback(handle, buf)) != 0) + goto err; + } + break; + } + if ((ret = dbp->stat(dbp, &hsp, NULL, 0)) != 0) { + dbp->err(dbp, ret, "DB->stat"); + goto err; + } + if (hsp->hash_ffactor != 0) { + snprintf(buf, buflen, + "h_ffactor=%lu\n", (u_long)hsp->hash_ffactor); + if ((ret = callback(handle, buf)) != 0) + goto err; + } + if (hsp->hash_nelem != 0 || hsp->hash_nkeys != 0) { + snprintf(buf, buflen, "h_nelem=%lu\n", + hsp->hash_nelem > hsp->hash_nkeys ? + (u_long)hsp->hash_nelem : (u_long)hsp->hash_nkeys); + if ((ret = callback(handle, buf)) != 0) + goto err; + } + break; + case DB_QUEUE: + if ((ret = callback(handle, "type=queue\n")) != 0) + goto err; + if (vdp != NULL) { + snprintf(buf, + buflen, "re_len=%lu\n", (u_long)vdp->re_len); + if ((ret = callback(handle, buf)) != 0) + goto err; + break; + } + if ((ret = dbp->stat(dbp, &qsp, NULL, 0)) != 0) { + dbp->err(dbp, ret, "DB->stat"); + goto err; + } + snprintf(buf, buflen, "re_len=%lu\n", (u_long)qsp->qs_re_len); + if (qsp->qs_re_pad != 0 && qsp->qs_re_pad != ' ') + snprintf(buf, buflen, "re_pad=%#x\n", qsp->qs_re_pad); + if ((ret = callback(handle, buf)) != 0) + goto err; + break; + case DB_RECNO: + if ((ret = callback(handle, "type=recno\n")) != 0) + goto err; + if (pip != NULL) { + if (F_ISSET(pip, VRFY_IS_RRECNO)) + if ((ret = + callback(handle, "renumber=1\n")) != 0) + goto err; + if (pip->re_len > 0) { + snprintf(buf, buflen, + "re_len=%lu\n", (u_long)pip->re_len); + if ((ret = callback(handle, buf)) != 0) + goto err; + } + break; + } + if ((ret = dbp->stat(dbp, &btsp, NULL, 0)) != 0) { + dbp->err(dbp, ret, "DB->stat"); + goto err; + } + if (F_ISSET(dbp, DB_RE_RENUMBER)) + if ((ret = callback(handle, "renumber=1\n")) != 0) + goto err; + if (F_ISSET(dbp, DB_RE_FIXEDLEN)) { + snprintf(buf, buflen, + "re_len=%lu\n", (u_long)btsp->bt_re_len); + if ((ret = callback(handle, buf)) != 0) + goto err; + } + if (btsp->bt_re_pad != 0 && btsp->bt_re_pad != ' ') { + snprintf(buf, buflen, "re_pad=%#x\n", btsp->bt_re_pad); + if ((ret = callback(handle, buf)) != 0) + goto err; + } + break; + case DB_UNKNOWN: + DB_ASSERT(0); /* Impossible. */ + __db_err(dbp->dbenv, "Impossible DB type in __db_prheader"); + ret = EINVAL; + goto err; + } + + if (pip != NULL) { + if (F_ISSET(pip, VRFY_HAS_DUPS)) + if ((ret = callback(handle, "duplicates=1\n")) != 0) + goto err; + if (F_ISSET(pip, VRFY_HAS_DUPSORT)) + if ((ret = callback(handle, "dupsort=1\n")) != 0) + goto err; + /* We should handle page size. XXX */ + } else { + if (F_ISSET(dbp, DB_AM_DUP)) + if ((ret = callback(handle, "duplicates=1\n")) != 0) + goto err; + if (F_ISSET(dbp, DB_AM_DUPSORT)) + if ((ret = callback(handle, "dupsort=1\n")) != 0) + goto err; + if (!F_ISSET(dbp, DB_AM_PGDEF)) { + snprintf(buf, buflen, + "db_pagesize=%lu\n", (u_long)dbp->pgsize); + if ((ret = callback(handle, buf)) != 0) + goto err; + } + } + + if (keyflag && (ret = callback(handle, "keys=1\n")) != 0) + goto err; + + ret = callback(handle, "HEADER=END\n"); + +err: if (pip != NULL && + (t_ret = __db_vrfy_putpageinfo(vdp, pip)) != 0 && ret == 0) + ret = t_ret; + if (btsp != NULL) + __os_free(btsp, 0); + if (hsp != NULL) + __os_free(hsp, 0); + if (qsp != NULL) + __os_free(qsp, 0); + if (buf != NULL) + __os_free(buf, buflen); + + return (ret); +} + +/* + * __db_prfooter -- + * Print the footer that marks the end of a DB dump. This is trivial, + * but for consistency's sake we don't want to put its literal contents + * in multiple places. + * + * PUBLIC: int __db_prfooter __P((void *, int (*)(void *, const void *))); + */ +int +__db_prfooter(handle, callback) + void *handle; + int (*callback) __P((void *, const void *)); +{ + return (callback(handle, "DATA=END\n")); +} diff --git a/bdb/db/db_rec.c b/bdb/db/db_rec.c new file mode 100644 index 00000000000..998d074290d --- /dev/null +++ b/bdb/db/db_rec.c @@ -0,0 +1,529 @@ +/*- + * See the file LICENSE for redistribution information. + * + * Copyright (c) 1996, 1997, 1998, 1999, 2000 + * Sleepycat Software. All rights reserved. + */ + +#include "db_config.h" + +#ifndef lint +static const char revid[] = "$Id: db_rec.c,v 11.10 2000/08/03 15:32:19 ubell Exp $"; +#endif /* not lint */ + +#ifndef NO_SYSTEM_INCLUDES +#include <sys/types.h> + +#include <string.h> +#endif + +#include "db_int.h" +#include "db_page.h" +#include "log.h" +#include "hash.h" + +/* + * PUBLIC: int __db_addrem_recover + * PUBLIC: __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *)); + * + * This log message is generated whenever we add or remove a duplicate + * to/from a duplicate page. On recover, we just do the opposite. + */ +int +__db_addrem_recover(dbenv, dbtp, lsnp, op, info) + DB_ENV *dbenv; + DBT *dbtp; + DB_LSN *lsnp; + db_recops op; + void *info; +{ + __db_addrem_args *argp; + DB *file_dbp; + DBC *dbc; + DB_MPOOLFILE *mpf; + PAGE *pagep; + u_int32_t change; + int cmp_n, cmp_p, ret; + + COMPQUIET(info, NULL); + REC_PRINT(__db_addrem_print); + REC_INTRO(__db_addrem_read, 1); + + if ((ret = memp_fget(mpf, &argp->pgno, 0, &pagep)) != 0) { + if (DB_UNDO(op)) { + /* + * We are undoing and the page doesn't exist. That + * is equivalent to having a pagelsn of 0, so we + * would not have to undo anything. In this case, + * don't bother creating a page. + */ + goto done; + } else + if ((ret = memp_fget(mpf, + &argp->pgno, DB_MPOOL_CREATE, &pagep)) != 0) + goto out; + } + + cmp_n = log_compare(lsnp, &LSN(pagep)); + cmp_p = log_compare(&LSN(pagep), &argp->pagelsn); + CHECK_LSN(op, cmp_p, &LSN(pagep), &argp->pagelsn); + change = 0; + if ((cmp_p == 0 && DB_REDO(op) && argp->opcode == DB_ADD_DUP) || + (cmp_n == 0 && DB_UNDO(op) && argp->opcode == DB_REM_DUP)) { + + /* Need to redo an add, or undo a delete. */ + if ((ret = __db_pitem(dbc, pagep, argp->indx, argp->nbytes, + argp->hdr.size == 0 ? NULL : &argp->hdr, + argp->dbt.size == 0 ? NULL : &argp->dbt)) != 0) + goto out; + + change = DB_MPOOL_DIRTY; + + } else if ((cmp_n == 0 && DB_UNDO(op) && argp->opcode == DB_ADD_DUP) || + (cmp_p == 0 && DB_REDO(op) && argp->opcode == DB_REM_DUP)) { + /* Need to undo an add, or redo a delete. */ + if ((ret = __db_ditem(dbc, + pagep, argp->indx, argp->nbytes)) != 0) + goto out; + change = DB_MPOOL_DIRTY; + } + + if (change) { + if (DB_REDO(op)) + LSN(pagep) = *lsnp; + else + LSN(pagep) = argp->pagelsn; + } + + if ((ret = memp_fput(mpf, pagep, change)) != 0) + goto out; + +done: *lsnp = argp->prev_lsn; + ret = 0; + +out: REC_CLOSE; +} + +/* + * PUBLIC: int __db_big_recover + * PUBLIC: __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *)); + */ +int +__db_big_recover(dbenv, dbtp, lsnp, op, info) + DB_ENV *dbenv; + DBT *dbtp; + DB_LSN *lsnp; + db_recops op; + void *info; +{ + __db_big_args *argp; + DB *file_dbp; + DBC *dbc; + DB_MPOOLFILE *mpf; + PAGE *pagep; + u_int32_t change; + int cmp_n, cmp_p, ret; + + COMPQUIET(info, NULL); + REC_PRINT(__db_big_print); + REC_INTRO(__db_big_read, 1); + + if ((ret = memp_fget(mpf, &argp->pgno, 0, &pagep)) != 0) { + if (DB_UNDO(op)) { + /* + * We are undoing and the page doesn't exist. That + * is equivalent to having a pagelsn of 0, so we + * would not have to undo anything. In this case, + * don't bother creating a page. + */ + ret = 0; + goto ppage; + } else + if ((ret = memp_fget(mpf, + &argp->pgno, DB_MPOOL_CREATE, &pagep)) != 0) + goto out; + } + + /* + * There are three pages we need to check. The one on which we are + * adding data, the previous one whose next_pointer may have + * been updated, and the next one whose prev_pointer may have + * been updated. + */ + cmp_n = log_compare(lsnp, &LSN(pagep)); + cmp_p = log_compare(&LSN(pagep), &argp->pagelsn); + CHECK_LSN(op, cmp_p, &LSN(pagep), &argp->pagelsn); + change = 0; + if ((cmp_p == 0 && DB_REDO(op) && argp->opcode == DB_ADD_BIG) || + (cmp_n == 0 && DB_UNDO(op) && argp->opcode == DB_REM_BIG)) { + /* We are either redo-ing an add, or undoing a delete. */ + P_INIT(pagep, file_dbp->pgsize, argp->pgno, argp->prev_pgno, + argp->next_pgno, 0, P_OVERFLOW); + OV_LEN(pagep) = argp->dbt.size; + OV_REF(pagep) = 1; + memcpy((u_int8_t *)pagep + P_OVERHEAD, argp->dbt.data, + argp->dbt.size); + PREV_PGNO(pagep) = argp->prev_pgno; + change = DB_MPOOL_DIRTY; + } else if ((cmp_n == 0 && DB_UNDO(op) && argp->opcode == DB_ADD_BIG) || + (cmp_p == 0 && DB_REDO(op) && argp->opcode == DB_REM_BIG)) { + /* + * We are either undo-ing an add or redo-ing a delete. + * The page is about to be reclaimed in either case, so + * there really isn't anything to do here. + */ + change = DB_MPOOL_DIRTY; + } + if (change) + LSN(pagep) = DB_REDO(op) ? *lsnp : argp->pagelsn; + + if ((ret = memp_fput(mpf, pagep, change)) != 0) + goto out; + + /* Now check the previous page. */ +ppage: if (argp->prev_pgno != PGNO_INVALID) { + change = 0; + if ((ret = memp_fget(mpf, &argp->prev_pgno, 0, &pagep)) != 0) { + if (DB_UNDO(op)) { + /* + * We are undoing and the page doesn't exist. + * That is equivalent to having a pagelsn of 0, + * so we would not have to undo anything. In + * this case, don't bother creating a page. + */ + *lsnp = argp->prev_lsn; + ret = 0; + goto npage; + } else + if ((ret = memp_fget(mpf, &argp->prev_pgno, + DB_MPOOL_CREATE, &pagep)) != 0) + goto out; + } + + cmp_n = log_compare(lsnp, &LSN(pagep)); + cmp_p = log_compare(&LSN(pagep), &argp->prevlsn); + CHECK_LSN(op, cmp_p, &LSN(pagep), &argp->prevlsn); + + if ((cmp_p == 0 && DB_REDO(op) && argp->opcode == DB_ADD_BIG) || + (cmp_n == 0 && DB_UNDO(op) && argp->opcode == DB_REM_BIG)) { + /* Redo add, undo delete. */ + NEXT_PGNO(pagep) = argp->pgno; + change = DB_MPOOL_DIRTY; + } else if ((cmp_n == 0 && + DB_UNDO(op) && argp->opcode == DB_ADD_BIG) || + (cmp_p == 0 && DB_REDO(op) && argp->opcode == DB_REM_BIG)) { + /* Redo delete, undo add. */ + NEXT_PGNO(pagep) = argp->next_pgno; + change = DB_MPOOL_DIRTY; + } + if (change) + LSN(pagep) = DB_REDO(op) ? *lsnp : argp->prevlsn; + if ((ret = memp_fput(mpf, pagep, change)) != 0) + goto out; + } + + /* Now check the next page. Can only be set on a delete. */ +npage: if (argp->next_pgno != PGNO_INVALID) { + change = 0; + if ((ret = memp_fget(mpf, &argp->next_pgno, 0, &pagep)) != 0) { + if (DB_UNDO(op)) { + /* + * We are undoing and the page doesn't exist. + * That is equivalent to having a pagelsn of 0, + * so we would not have to undo anything. In + * this case, don't bother creating a page. + */ + goto done; + } else + if ((ret = memp_fget(mpf, &argp->next_pgno, + DB_MPOOL_CREATE, &pagep)) != 0) + goto out; + } + + cmp_n = log_compare(lsnp, &LSN(pagep)); + cmp_p = log_compare(&LSN(pagep), &argp->nextlsn); + CHECK_LSN(op, cmp_p, &LSN(pagep), &argp->nextlsn); + if (cmp_p == 0 && DB_REDO(op)) { + PREV_PGNO(pagep) = PGNO_INVALID; + change = DB_MPOOL_DIRTY; + } else if (cmp_n == 0 && DB_UNDO(op)) { + PREV_PGNO(pagep) = argp->pgno; + change = DB_MPOOL_DIRTY; + } + if (change) + LSN(pagep) = DB_REDO(op) ? *lsnp : argp->nextlsn; + if ((ret = memp_fput(mpf, pagep, change)) != 0) + goto out; + } + +done: *lsnp = argp->prev_lsn; + ret = 0; + +out: REC_CLOSE; +} + +/* + * __db_ovref_recover -- + * Recovery function for __db_ovref(). + * + * PUBLIC: int __db_ovref_recover __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *)); + */ +int +__db_ovref_recover(dbenv, dbtp, lsnp, op, info) + DB_ENV *dbenv; + DBT *dbtp; + DB_LSN *lsnp; + db_recops op; + void *info; +{ + __db_ovref_args *argp; + DB *file_dbp; + DBC *dbc; + DB_MPOOLFILE *mpf; + PAGE *pagep; + int cmp, modified, ret; + + COMPQUIET(info, NULL); + REC_PRINT(__db_ovref_print); + REC_INTRO(__db_ovref_read, 1); + + if ((ret = memp_fget(mpf, &argp->pgno, 0, &pagep)) != 0) { + if (DB_UNDO(op)) + goto done; + (void)__db_pgerr(file_dbp, argp->pgno); + goto out; + } + + modified = 0; + cmp = log_compare(&LSN(pagep), &argp->lsn); + CHECK_LSN(op, cmp, &LSN(pagep), &argp->lsn); + if (cmp == 0 && DB_REDO(op)) { + /* Need to redo update described. */ + OV_REF(pagep) += argp->adjust; + + pagep->lsn = *lsnp; + modified = 1; + } else if (log_compare(lsnp, &LSN(pagep)) == 0 && DB_UNDO(op)) { + /* Need to undo update described. */ + OV_REF(pagep) -= argp->adjust; + + pagep->lsn = argp->lsn; + modified = 1; + } + if ((ret = memp_fput(mpf, pagep, modified ? DB_MPOOL_DIRTY : 0)) != 0) + goto out; + +done: *lsnp = argp->prev_lsn; + ret = 0; + +out: REC_CLOSE; +} + +/* + * __db_relink_recover -- + * Recovery function for relink. + * + * PUBLIC: int __db_relink_recover + * PUBLIC: __P((DB_ENV *, DBT *, DB_LSN *, db_recops, void *)); + */ +int +__db_relink_recover(dbenv, dbtp, lsnp, op, info) + DB_ENV *dbenv; + DBT *dbtp; + DB_LSN *lsnp; + db_recops op; + void *info; +{ + __db_relink_args *argp; + DB *file_dbp; + DBC *dbc; + DB_MPOOLFILE *mpf; + PAGE *pagep; + int cmp_n, cmp_p, modified, ret; + + COMPQUIET(info, NULL); + REC_PRINT(__db_relink_print); + REC_INTRO(__db_relink_read, 1); + + /* + * There are up to three pages we need to check -- the page, and the + * previous and next pages, if they existed. For a page add operation, + * the current page is the result of a split and is being recovered + * elsewhere, so all we need do is recover the next page. + */ + if ((ret = memp_fget(mpf, &argp->pgno, 0, &pagep)) != 0) { + if (DB_REDO(op)) { + (void)__db_pgerr(file_dbp, argp->pgno); + goto out; + } + goto next2; + } + modified = 0; + if (argp->opcode == DB_ADD_PAGE) + goto next1; + + cmp_p = log_compare(&LSN(pagep), &argp->lsn); + CHECK_LSN(op, cmp_p, &LSN(pagep), &argp->lsn); + if (cmp_p == 0 && DB_REDO(op)) { + /* Redo the relink. */ + pagep->lsn = *lsnp; + modified = 1; + } else if (log_compare(lsnp, &LSN(pagep)) == 0 && DB_UNDO(op)) { + /* Undo the relink. */ + pagep->next_pgno = argp->next; + pagep->prev_pgno = argp->prev; + + pagep->lsn = argp->lsn; + modified = 1; + } +next1: if ((ret = memp_fput(mpf, pagep, modified ? DB_MPOOL_DIRTY : 0)) != 0) + goto out; + +next2: if ((ret = memp_fget(mpf, &argp->next, 0, &pagep)) != 0) { + if (DB_REDO(op)) { + (void)__db_pgerr(file_dbp, argp->next); + goto out; + } + goto prev; + } + modified = 0; + cmp_n = log_compare(lsnp, &LSN(pagep)); + cmp_p = log_compare(&LSN(pagep), &argp->lsn_next); + CHECK_LSN(op, cmp_p, &LSN(pagep), &argp->lsn_next); + if ((argp->opcode == DB_REM_PAGE && cmp_p == 0 && DB_REDO(op)) || + (argp->opcode == DB_ADD_PAGE && cmp_n == 0 && DB_UNDO(op))) { + /* Redo the remove or undo the add. */ + pagep->prev_pgno = argp->prev; + + modified = 1; + } else if ((argp->opcode == DB_REM_PAGE && cmp_n == 0 && DB_UNDO(op)) || + (argp->opcode == DB_ADD_PAGE && cmp_p == 0 && DB_REDO(op))) { + /* Undo the remove or redo the add. */ + pagep->prev_pgno = argp->pgno; + + modified = 1; + } + if (modified == 1) { + if (DB_UNDO(op)) + pagep->lsn = argp->lsn_next; + else + pagep->lsn = *lsnp; + } + if ((ret = memp_fput(mpf, pagep, modified ? DB_MPOOL_DIRTY : 0)) != 0) + goto out; + if (argp->opcode == DB_ADD_PAGE) + goto done; + +prev: if ((ret = memp_fget(mpf, &argp->prev, 0, &pagep)) != 0) { + if (DB_REDO(op)) { + (void)__db_pgerr(file_dbp, argp->prev); + goto out; + } + goto done; + } + modified = 0; + cmp_p = log_compare(&LSN(pagep), &argp->lsn_prev); + CHECK_LSN(op, cmp_p, &LSN(pagep), &argp->lsn_prev); + if (cmp_p == 0 && DB_REDO(op)) { + /* Redo the relink. */ + pagep->next_pgno = argp->next; + + modified = 1; + } else if (log_compare(lsnp, &LSN(pagep)) == 0 && DB_UNDO(op)) { + /* Undo the relink. */ + pagep->next_pgno = argp->pgno; + + modified = 1; + } + if (modified == 1) { + if (DB_UNDO(op)) + pagep->lsn = argp->lsn_prev; + else + pagep->lsn = *lsnp; + } + if ((ret = memp_fput(mpf, pagep, modified ? DB_MPOOL_DIRTY : 0)) != 0) + goto out; + +done: *lsnp = argp->prev_lsn; + ret = 0; + +out: REC_CLOSE; +} + +/* + * __db_debug_recover -- + * Recovery function for debug. + * + * PUBLIC: int __db_debug_recover __P((DB_ENV *, + * PUBLIC: DBT *, DB_LSN *, db_recops, void *)); + */ +int +__db_debug_recover(dbenv, dbtp, lsnp, op, info) + DB_ENV *dbenv; + DBT *dbtp; + DB_LSN *lsnp; + db_recops op; + void *info; +{ + __db_debug_args *argp; + int ret; + + COMPQUIET(op, 0); + COMPQUIET(dbenv, NULL); + COMPQUIET(info, NULL); + + REC_PRINT(__db_debug_print); + REC_NOOP_INTRO(__db_debug_read); + + *lsnp = argp->prev_lsn; + ret = 0; + + REC_NOOP_CLOSE; +} + +/* + * __db_noop_recover -- + * Recovery function for noop. + * + * PUBLIC: int __db_noop_recover __P((DB_ENV *, + * PUBLIC: DBT *, DB_LSN *, db_recops, void *)); + */ +int +__db_noop_recover(dbenv, dbtp, lsnp, op, info) + DB_ENV *dbenv; + DBT *dbtp; + DB_LSN *lsnp; + db_recops op; + void *info; +{ + __db_noop_args *argp; + DB *file_dbp; + DBC *dbc; + DB_MPOOLFILE *mpf; + PAGE *pagep; + u_int32_t change; + int cmp_n, cmp_p, ret; + + COMPQUIET(info, NULL); + REC_PRINT(__db_noop_print); + REC_INTRO(__db_noop_read, 0); + + if ((ret = memp_fget(mpf, &argp->pgno, 0, &pagep)) != 0) + goto out; + + cmp_n = log_compare(lsnp, &LSN(pagep)); + cmp_p = log_compare(&LSN(pagep), &argp->prevlsn); + CHECK_LSN(op, cmp_p, &LSN(pagep), &argp->prevlsn); + change = 0; + if (cmp_p == 0 && DB_REDO(op)) { + LSN(pagep) = *lsnp; + change = DB_MPOOL_DIRTY; + } else if (cmp_n == 0 && DB_UNDO(op)) { + LSN(pagep) = argp->prevlsn; + change = DB_MPOOL_DIRTY; + } + ret = memp_fput(mpf, pagep, change); + +done: *lsnp = argp->prev_lsn; +out: REC_CLOSE; +} diff --git a/bdb/db/db_reclaim.c b/bdb/db/db_reclaim.c new file mode 100644 index 00000000000..739f348407d --- /dev/null +++ b/bdb/db/db_reclaim.c @@ -0,0 +1,134 @@ +/*- + * See the file LICENSE for redistribution information. + * + * Copyright (c) 1996, 1997, 1998, 1999, 2000 + * Sleepycat Software. All rights reserved. + */ + +#include "db_config.h" + +#ifndef lint +static const char revid[] = "$Id: db_reclaim.c,v 11.5 2000/04/07 14:26:58 bostic Exp $"; +#endif /* not lint */ + +#ifndef NO_SYSTEM_INCLUDES +#include <sys/types.h> +#endif + +#include "db_int.h" +#include "db_page.h" +#include "db_am.h" + +/* + * Assume that we enter with a valid pgno. We traverse a set of + * duplicate pages. The format of the callback routine is: + * callback(dbp, page, cookie, did_put). did_put is an output + * value that will be set to 1 by the callback routine if it + * already put the page back. Otherwise, this routine must + * put the page. + * + * PUBLIC: int __db_traverse_dup __P((DB *, + * PUBLIC: db_pgno_t, int (*)(DB *, PAGE *, void *, int *), void *)); + */ +int +__db_traverse_dup(dbp, pgno, callback, cookie) + DB *dbp; + db_pgno_t pgno; + int (*callback) __P((DB *, PAGE *, void *, int *)); + void *cookie; +{ + PAGE *p; + int did_put, i, opgno, ret; + + do { + did_put = 0; + if ((ret = memp_fget(dbp->mpf, &pgno, 0, &p)) != 0) + return (ret); + pgno = NEXT_PGNO(p); + + for (i = 0; i < NUM_ENT(p); i++) { + if (B_TYPE(GET_BKEYDATA(p, i)->type) == B_OVERFLOW) { + opgno = GET_BOVERFLOW(p, i)->pgno; + if ((ret = __db_traverse_big(dbp, + opgno, callback, cookie)) != 0) + goto err; + } + } + + if ((ret = callback(dbp, p, cookie, &did_put)) != 0) + goto err; + + if (!did_put) + if ((ret = memp_fput(dbp->mpf, p, 0)) != 0) + return (ret); + } while (pgno != PGNO_INVALID); + + if (0) { +err: if (did_put == 0) + (void)memp_fput(dbp->mpf, p, 0); + } + return (ret); +} + +/* + * __db_traverse_big + * Traverse a chain of overflow pages and call the callback routine + * on each one. The calling convention for the callback is: + * callback(dbp, page, cookie, did_put), + * where did_put is a return value indicating if the page in question has + * already been returned to the mpool. + * + * PUBLIC: int __db_traverse_big __P((DB *, + * PUBLIC: db_pgno_t, int (*)(DB *, PAGE *, void *, int *), void *)); + */ +int +__db_traverse_big(dbp, pgno, callback, cookie) + DB *dbp; + db_pgno_t pgno; + int (*callback) __P((DB *, PAGE *, void *, int *)); + void *cookie; +{ + PAGE *p; + int did_put, ret; + + do { + did_put = 0; + if ((ret = memp_fget(dbp->mpf, &pgno, 0, &p)) != 0) + return (ret); + pgno = NEXT_PGNO(p); + if ((ret = callback(dbp, p, cookie, &did_put)) == 0 && + !did_put) + ret = memp_fput(dbp->mpf, p, 0); + } while (ret == 0 && pgno != PGNO_INVALID); + + return (ret); +} + +/* + * __db_reclaim_callback + * This is the callback routine used during a delete of a subdatabase. + * we are traversing a btree or hash table and trying to free all the + * pages. Since they share common code for duplicates and overflow + * items, we traverse them identically and use this routine to do the + * actual free. The reason that this is callback is because hash uses + * the same traversal code for statistics gathering. + * + * PUBLIC: int __db_reclaim_callback __P((DB *, PAGE *, void *, int *)); + */ +int +__db_reclaim_callback(dbp, p, cookie, putp) + DB *dbp; + PAGE *p; + void *cookie; + int *putp; +{ + int ret; + + COMPQUIET(dbp, NULL); + + if ((ret = __db_free(cookie, p)) != 0) + return (ret); + *putp = 1; + + return (0); +} diff --git a/bdb/db/db_ret.c b/bdb/db/db_ret.c new file mode 100644 index 00000000000..0782de3e450 --- /dev/null +++ b/bdb/db/db_ret.c @@ -0,0 +1,160 @@ +/*- + * See the file LICENSE for redistribution information. + * + * Copyright (c) 1996, 1997, 1998, 1999, 2000 + * Sleepycat Software. All rights reserved. + */ + +#include "db_config.h" + +#ifndef lint +static const char revid[] = "$Id: db_ret.c,v 11.12 2000/11/30 00:58:33 ubell Exp $"; +#endif /* not lint */ + +#ifndef NO_SYSTEM_INCLUDES +#include <sys/types.h> + +#include <string.h> +#endif + +#include "db_int.h" +#include "db_page.h" +#include "btree.h" +#include "db_am.h" + +/* + * __db_ret -- + * Build return DBT. + * + * PUBLIC: int __db_ret __P((DB *, + * PUBLIC: PAGE *, u_int32_t, DBT *, void **, u_int32_t *)); + */ +int +__db_ret(dbp, h, indx, dbt, memp, memsize) + DB *dbp; + PAGE *h; + u_int32_t indx; + DBT *dbt; + void **memp; + u_int32_t *memsize; +{ + BKEYDATA *bk; + HOFFPAGE ho; + BOVERFLOW *bo; + u_int32_t len; + u_int8_t *hk; + void *data; + + switch (TYPE(h)) { + case P_HASH: + hk = P_ENTRY(h, indx); + if (HPAGE_PTYPE(hk) == H_OFFPAGE) { + memcpy(&ho, hk, sizeof(HOFFPAGE)); + return (__db_goff(dbp, dbt, + ho.tlen, ho.pgno, memp, memsize)); + } + len = LEN_HKEYDATA(h, dbp->pgsize, indx); + data = HKEYDATA_DATA(hk); + break; + case P_LBTREE: + case P_LDUP: + case P_LRECNO: + bk = GET_BKEYDATA(h, indx); + if (B_TYPE(bk->type) == B_OVERFLOW) { + bo = (BOVERFLOW *)bk; + return (__db_goff(dbp, dbt, + bo->tlen, bo->pgno, memp, memsize)); + } + len = bk->len; + data = bk->data; + break; + default: + return (__db_pgfmt(dbp, h->pgno)); + } + + return (__db_retcopy(dbp, dbt, data, len, memp, memsize)); +} + +/* + * __db_retcopy -- + * Copy the returned data into the user's DBT, handling special flags. + * + * PUBLIC: int __db_retcopy __P((DB *, DBT *, + * PUBLIC: void *, u_int32_t, void **, u_int32_t *)); + */ +int +__db_retcopy(dbp, dbt, data, len, memp, memsize) + DB *dbp; + DBT *dbt; + void *data; + u_int32_t len; + void **memp; + u_int32_t *memsize; +{ + DB_ENV *dbenv; + int ret; + + dbenv = dbp == NULL ? NULL : dbp->dbenv; + + /* If returning a partial record, reset the length. */ + if (F_ISSET(dbt, DB_DBT_PARTIAL)) { + data = (u_int8_t *)data + dbt->doff; + if (len > dbt->doff) { + len -= dbt->doff; + if (len > dbt->dlen) + len = dbt->dlen; + } else + len = 0; + } + + /* + * Return the length of the returned record in the DBT size field. + * This satisfies the requirement that if we're using user memory + * and insufficient memory was provided, return the amount necessary + * in the size field. + */ + dbt->size = len; + + /* + * Allocate memory to be owned by the application: DB_DBT_MALLOC, + * DB_DBT_REALLOC. + * + * !!! + * We always allocate memory, even if we're copying out 0 bytes. This + * guarantees consistency, i.e., the application can always free memory + * without concern as to how many bytes of the record were requested. + * + * Use the memory specified by the application: DB_DBT_USERMEM. + * + * !!! + * If the length we're going to copy is 0, the application-supplied + * memory pointer is allowed to be NULL. + */ + if (F_ISSET(dbt, DB_DBT_MALLOC)) { + if ((ret = __os_malloc(dbenv, len, + dbp == NULL ? NULL : dbp->db_malloc, &dbt->data)) != 0) + return (ret); + } else if (F_ISSET(dbt, DB_DBT_REALLOC)) { + if ((ret = __os_realloc(dbenv, len, + dbp == NULL ? NULL : dbp->db_realloc, &dbt->data)) != 0) + return (ret); + } else if (F_ISSET(dbt, DB_DBT_USERMEM)) { + if (len != 0 && (dbt->data == NULL || dbt->ulen < len)) + return (ENOMEM); + } else if (memp == NULL || memsize == NULL) { + return (EINVAL); + } else { + if (len != 0 && (*memsize == 0 || *memsize < len)) { + if ((ret = __os_realloc(dbenv, len, NULL, memp)) != 0) { + *memsize = 0; + return (ret); + } + *memsize = len; + } + dbt->data = *memp; + } + + if (len != 0) + memcpy(dbt->data, data, len); + return (0); +} diff --git a/bdb/db/db_upg.c b/bdb/db/db_upg.c new file mode 100644 index 00000000000..d8573146ad6 --- /dev/null +++ b/bdb/db/db_upg.c @@ -0,0 +1,338 @@ +/*- + * See the file LICENSE for redistribution information. + * + * Copyright (c) 1996, 1997, 1998, 1999, 2000 + * Sleepycat Software. All rights reserved. + */ + +#include "db_config.h" + +#ifndef lint +static const char revid[] = "$Id: db_upg.c,v 11.20 2000/12/12 17:35:30 bostic Exp $"; +#endif /* not lint */ + +#ifndef NO_SYSTEM_INCLUDES +#include <sys/types.h> + +#include <string.h> +#endif + +#include "db_int.h" +#include "db_page.h" +#include "db_swap.h" +#include "btree.h" +#include "hash.h" +#include "qam.h" + +static int (* const func_31_list[P_PAGETYPE_MAX]) + __P((DB *, char *, u_int32_t, DB_FH *, PAGE *, int *)) = { + NULL, /* P_INVALID */ + NULL, /* __P_DUPLICATE */ + __ham_31_hash, /* P_HASH */ + NULL, /* P_IBTREE */ + NULL, /* P_IRECNO */ + __bam_31_lbtree, /* P_LBTREE */ + NULL, /* P_LRECNO */ + NULL, /* P_OVERFLOW */ + __ham_31_hashmeta, /* P_HASHMETA */ + __bam_31_btreemeta, /* P_BTREEMETA */ +}; + +static int __db_page_pass __P((DB *, char *, u_int32_t, int (* const []) + (DB *, char *, u_int32_t, DB_FH *, PAGE *, int *), DB_FH *)); + +/* + * __db_upgrade -- + * Upgrade an existing database. + * + * PUBLIC: int __db_upgrade __P((DB *, const char *, u_int32_t)); + */ +int +__db_upgrade(dbp, fname, flags) + DB *dbp; + const char *fname; + u_int32_t flags; +{ + DB_ENV *dbenv; + DB_FH fh; + size_t n; + int ret, t_ret; + u_int8_t mbuf[256]; + char *real_name; + + dbenv = dbp->dbenv; + + /* Validate arguments. */ + if ((ret = __db_fchk(dbenv, "DB->upgrade", flags, DB_DUPSORT)) != 0) + return (ret); + + /* Get the real backing file name. */ + if ((ret = __db_appname(dbenv, + DB_APP_DATA, NULL, fname, 0, NULL, &real_name)) != 0) + return (ret); + + /* Open the file. */ + if ((ret = __os_open(dbenv, real_name, 0, 0, &fh)) != 0) { + __db_err(dbenv, "%s: %s", real_name, db_strerror(ret)); + return (ret); + } + + /* Initialize the feedback. */ + if (dbp->db_feedback != NULL) + dbp->db_feedback(dbp, DB_UPGRADE, 0); + + /* + * Read the metadata page. We read 256 bytes, which is larger than + * any access method's metadata page and smaller than any disk sector. + */ + if ((ret = __os_read(dbenv, &fh, mbuf, sizeof(mbuf), &n)) != 0) + goto err; + + switch (((DBMETA *)mbuf)->magic) { + case DB_BTREEMAGIC: + switch (((DBMETA *)mbuf)->version) { + case 6: + /* + * Before V7 not all pages had page types, so we do the + * single meta-data page by hand. + */ + if ((ret = + __bam_30_btreemeta(dbp, real_name, mbuf)) != 0) + goto err; + if ((ret = __os_seek(dbenv, + &fh, 0, 0, 0, 0, DB_OS_SEEK_SET)) != 0) + goto err; + if ((ret = __os_write(dbenv, &fh, mbuf, 256, &n)) != 0) + goto err; + /* FALLTHROUGH */ + case 7: + /* + * We need the page size to do more. Rip it out of + * the meta-data page. + */ + memcpy(&dbp->pgsize, mbuf + 20, sizeof(u_int32_t)); + + if ((ret = __db_page_pass( + dbp, real_name, flags, func_31_list, &fh)) != 0) + goto err; + /* FALLTHROUGH */ + case 8: + break; + default: + __db_err(dbenv, "%s: unsupported btree version: %lu", + real_name, (u_long)((DBMETA *)mbuf)->version); + ret = DB_OLD_VERSION; + goto err; + } + break; + case DB_HASHMAGIC: + switch (((DBMETA *)mbuf)->version) { + case 4: + case 5: + /* + * Before V6 not all pages had page types, so we do the + * single meta-data page by hand. + */ + if ((ret = + __ham_30_hashmeta(dbp, real_name, mbuf)) != 0) + goto err; + if ((ret = __os_seek(dbenv, + &fh, 0, 0, 0, 0, DB_OS_SEEK_SET)) != 0) + goto err; + if ((ret = __os_write(dbenv, &fh, mbuf, 256, &n)) != 0) + goto err; + + /* + * Before V6, we created hash pages one by one as they + * were needed, using hashhdr.ovfl_point to reserve + * a block of page numbers for them. A consequence + * of this was that, if no overflow pages had been + * created, the current doubling might extend past + * the end of the database file. + * + * In DB 3.X, we now create all the hash pages + * belonging to a doubling atomicly; it's not + * safe to just save them for later, because when + * we create an overflow page we'll just create + * a new last page (whatever that may be). Grow + * the database to the end of the current doubling. + */ + if ((ret = + __ham_30_sizefix(dbp, &fh, real_name, mbuf)) != 0) + goto err; + /* FALLTHROUGH */ + case 6: + /* + * We need the page size to do more. Rip it out of + * the meta-data page. + */ + memcpy(&dbp->pgsize, mbuf + 20, sizeof(u_int32_t)); + + if ((ret = __db_page_pass( + dbp, real_name, flags, func_31_list, &fh)) != 0) + goto err; + /* FALLTHROUGH */ + case 7: + break; + default: + __db_err(dbenv, "%s: unsupported hash version: %lu", + real_name, (u_long)((DBMETA *)mbuf)->version); + ret = DB_OLD_VERSION; + goto err; + } + break; + case DB_QAMMAGIC: + switch (((DBMETA *)mbuf)->version) { + case 1: + /* + * If we're in a Queue database, the only page that + * needs upgrading is the meta-database page, don't + * bother with a full pass. + */ + if ((ret = __qam_31_qammeta(dbp, real_name, mbuf)) != 0) + return (ret); + /* FALLTHROUGH */ + case 2: + if ((ret = __qam_32_qammeta(dbp, real_name, mbuf)) != 0) + return (ret); + if ((ret = __os_seek(dbenv, + &fh, 0, 0, 0, 0, DB_OS_SEEK_SET)) != 0) + goto err; + if ((ret = __os_write(dbenv, &fh, mbuf, 256, &n)) != 0) + goto err; + /* FALLTHROUGH */ + case 3: + break; + default: + __db_err(dbenv, "%s: unsupported queue version: %lu", + real_name, (u_long)((DBMETA *)mbuf)->version); + ret = DB_OLD_VERSION; + goto err; + } + break; + default: + M_32_SWAP(((DBMETA *)mbuf)->magic); + switch (((DBMETA *)mbuf)->magic) { + case DB_BTREEMAGIC: + case DB_HASHMAGIC: + case DB_QAMMAGIC: + __db_err(dbenv, + "%s: DB->upgrade only supported on native byte-order systems", + real_name); + break; + default: + __db_err(dbenv, + "%s: unrecognized file type", real_name); + break; + } + ret = EINVAL; + goto err; + } + + ret = __os_fsync(dbenv, &fh); + +err: if ((t_ret = __os_closehandle(&fh)) != 0 && ret == 0) + ret = t_ret; + __os_freestr(real_name); + + /* We're done. */ + if (dbp->db_feedback != NULL) + dbp->db_feedback(dbp, DB_UPGRADE, 100); + + return (ret); +} + +/* + * __db_page_pass -- + * Walk the pages of the database, upgrading whatever needs it. + */ +static int +__db_page_pass(dbp, real_name, flags, fl, fhp) + DB *dbp; + char *real_name; + u_int32_t flags; + int (* const fl[P_PAGETYPE_MAX]) + __P((DB *, char *, u_int32_t, DB_FH *, PAGE *, int *)); + DB_FH *fhp; +{ + DB_ENV *dbenv; + PAGE *page; + db_pgno_t i, pgno_last; + size_t n; + int dirty, ret; + + dbenv = dbp->dbenv; + + /* Determine the last page of the file. */ + if ((ret = __db_lastpgno(dbp, real_name, fhp, &pgno_last)) != 0) + return (ret); + + /* Allocate memory for a single page. */ + if ((ret = __os_malloc(dbenv, dbp->pgsize, NULL, &page)) != 0) + return (ret); + + /* Walk the file, calling the underlying conversion functions. */ + for (i = 0; i < pgno_last; ++i) { + if (dbp->db_feedback != NULL) + dbp->db_feedback(dbp, DB_UPGRADE, (i * 100)/pgno_last); + if ((ret = __os_seek(dbenv, + fhp, dbp->pgsize, i, 0, 0, DB_OS_SEEK_SET)) != 0) + break; + if ((ret = __os_read(dbenv, fhp, page, dbp->pgsize, &n)) != 0) + break; + dirty = 0; + if (fl[TYPE(page)] != NULL && (ret = fl[TYPE(page)] + (dbp, real_name, flags, fhp, page, &dirty)) != 0) + break; + if (dirty) { + if ((ret = __os_seek(dbenv, + fhp, dbp->pgsize, i, 0, 0, DB_OS_SEEK_SET)) != 0) + break; + if ((ret = __os_write(dbenv, + fhp, page, dbp->pgsize, &n)) != 0) + break; + } + } + + __os_free(page, dbp->pgsize); + return (ret); +} + +/* + * __db_lastpgno -- + * Return the current last page number of the file. + * + * PUBLIC: int __db_lastpgno __P((DB *, char *, DB_FH *, db_pgno_t *)); + */ +int +__db_lastpgno(dbp, real_name, fhp, pgno_lastp) + DB *dbp; + char *real_name; + DB_FH *fhp; + db_pgno_t *pgno_lastp; +{ + DB_ENV *dbenv; + db_pgno_t pgno_last; + u_int32_t mbytes, bytes; + int ret; + + dbenv = dbp->dbenv; + + if ((ret = __os_ioinfo(dbenv, + real_name, fhp, &mbytes, &bytes, NULL)) != 0) { + __db_err(dbenv, "%s: %s", real_name, db_strerror(ret)); + return (ret); + } + + /* Page sizes have to be a power-of-two. */ + if (bytes % dbp->pgsize != 0) { + __db_err(dbenv, + "%s: file size not a multiple of the pagesize", real_name); + return (EINVAL); + } + pgno_last = mbytes * (MEGABYTE / dbp->pgsize); + pgno_last += bytes / dbp->pgsize; + + *pgno_lastp = pgno_last; + return (0); +} diff --git a/bdb/db/db_upg_opd.c b/bdb/db/db_upg_opd.c new file mode 100644 index 00000000000..a7be784afb8 --- /dev/null +++ b/bdb/db/db_upg_opd.c @@ -0,0 +1,353 @@ +/*- + * See the file LICENSE for redistribution information. + * + * Copyright (c) 1996, 1997, 1998, 1999, 2000 + * Sleepycat Software. All rights reserved. + */ + +#include "db_config.h" + +#ifndef lint +static const char revid[] = "$Id: db_upg_opd.c,v 11.9 2000/11/30 00:58:33 ubell Exp $"; +#endif /* not lint */ + +#ifndef NO_SYSTEM_INCLUDES +#include <sys/types.h> + +#include <string.h> +#endif + +#include "db_int.h" +#include "db_page.h" +#include "db_swap.h" +#include "btree.h" +#include "hash.h" +#include "qam.h" + +static int __db_build_bi __P((DB *, DB_FH *, PAGE *, PAGE *, u_int32_t, int *)); +static int __db_build_ri __P((DB *, DB_FH *, PAGE *, PAGE *, u_int32_t, int *)); +static int __db_up_ovref __P((DB *, DB_FH *, db_pgno_t)); + +#define GET_PAGE(dbp, fhp, pgno, page) { \ + if ((ret = __os_seek(dbp->dbenv, \ + fhp, (dbp)->pgsize, pgno, 0, 0, DB_OS_SEEK_SET)) != 0) \ + goto err; \ + if ((ret = __os_read(dbp->dbenv, \ + fhp, page, (dbp)->pgsize, &n)) != 0) \ + goto err; \ +} +#define PUT_PAGE(dbp, fhp, pgno, page) { \ + if ((ret = __os_seek(dbp->dbenv, \ + fhp, (dbp)->pgsize, pgno, 0, 0, DB_OS_SEEK_SET)) != 0) \ + goto err; \ + if ((ret = __os_write(dbp->dbenv, \ + fhp, page, (dbp)->pgsize, &n)) != 0) \ + goto err; \ +} + +/* + * __db_31_offdup -- + * Convert 3.0 off-page duplicates to 3.1 off-page duplicates. + * + * PUBLIC: int __db_31_offdup __P((DB *, char *, DB_FH *, int, db_pgno_t *)); + */ +int +__db_31_offdup(dbp, real_name, fhp, sorted, pgnop) + DB *dbp; + char *real_name; + DB_FH *fhp; + int sorted; + db_pgno_t *pgnop; +{ + PAGE *ipage, *page; + db_indx_t indx; + db_pgno_t cur_cnt, i, next_cnt, pgno, *pgno_cur, pgno_last; + db_pgno_t *pgno_next, pgno_max, *tmp; + db_recno_t nrecs; + size_t n; + int level, nomem, ret; + + ipage = page = NULL; + pgno_cur = pgno_next = NULL; + + /* Allocate room to hold a page. */ + if ((ret = __os_malloc(dbp->dbenv, dbp->pgsize, NULL, &page)) != 0) + goto err; + + /* + * Walk the chain of 3.0 off-page duplicates. Each one is converted + * in place to a 3.1 off-page duplicate page. If the duplicates are + * sorted, they are converted to a Btree leaf page, otherwise to a + * Recno leaf page. + */ + for (nrecs = 0, cur_cnt = pgno_max = 0, + pgno = *pgnop; pgno != PGNO_INVALID;) { + if (pgno_max == cur_cnt) { + pgno_max += 20; + if ((ret = __os_realloc(dbp->dbenv, pgno_max * + sizeof(db_pgno_t), NULL, &pgno_cur)) != 0) + goto err; + } + pgno_cur[cur_cnt++] = pgno; + + GET_PAGE(dbp, fhp, pgno, page); + nrecs += NUM_ENT(page); + LEVEL(page) = LEAFLEVEL; + TYPE(page) = sorted ? P_LDUP : P_LRECNO; + /* + * !!! + * DB didn't zero the LSNs on off-page duplicates pages. + */ + ZERO_LSN(LSN(page)); + PUT_PAGE(dbp, fhp, pgno, page); + + pgno = NEXT_PGNO(page); + } + + /* If we only have a single page, it's easy. */ + if (cur_cnt > 1) { + /* + * pgno_cur is the list of pages we just converted. We're + * going to walk that list, but we'll need to create a new + * list while we do so. + */ + if ((ret = __os_malloc(dbp->dbenv, + cur_cnt * sizeof(db_pgno_t), NULL, &pgno_next)) != 0) + goto err; + + /* Figure out where we can start allocating new pages. */ + if ((ret = __db_lastpgno(dbp, real_name, fhp, &pgno_last)) != 0) + goto err; + + /* Allocate room for an internal page. */ + if ((ret = __os_malloc(dbp->dbenv, + dbp->pgsize, NULL, &ipage)) != 0) + goto err; + PGNO(ipage) = PGNO_INVALID; + } + + /* + * Repeatedly walk the list of pages, building internal pages, until + * there's only one page at a level. + */ + for (level = LEAFLEVEL + 1; cur_cnt > 1; ++level) { + for (indx = 0, i = next_cnt = 0; i < cur_cnt;) { + if (indx == 0) { + P_INIT(ipage, dbp->pgsize, pgno_last, + PGNO_INVALID, PGNO_INVALID, + level, sorted ? P_IBTREE : P_IRECNO); + ZERO_LSN(LSN(ipage)); + + pgno_next[next_cnt++] = pgno_last++; + } + + GET_PAGE(dbp, fhp, pgno_cur[i], page); + + /* + * If the duplicates are sorted, put the first item on + * the lower-level page onto a Btree internal page. If + * the duplicates are not sorted, create an internal + * Recno structure on the page. If either case doesn't + * fit, push out the current page and start a new one. + */ + nomem = 0; + if (sorted) { + if ((ret = __db_build_bi( + dbp, fhp, ipage, page, indx, &nomem)) != 0) + goto err; + } else + if ((ret = __db_build_ri( + dbp, fhp, ipage, page, indx, &nomem)) != 0) + goto err; + if (nomem) { + indx = 0; + PUT_PAGE(dbp, fhp, PGNO(ipage), ipage); + } else { + ++indx; + ++NUM_ENT(ipage); + ++i; + } + } + + /* + * Push out the last internal page. Set the top-level record + * count if we've reached the top. + */ + if (next_cnt == 1) + RE_NREC_SET(ipage, nrecs); + PUT_PAGE(dbp, fhp, PGNO(ipage), ipage); + + /* Swap the current and next page number arrays. */ + cur_cnt = next_cnt; + tmp = pgno_cur; + pgno_cur = pgno_next; + pgno_next = tmp; + } + + *pgnop = pgno_cur[0]; + +err: if (pgno_cur != NULL) + __os_free(pgno_cur, 0); + if (pgno_next != NULL) + __os_free(pgno_next, 0); + if (ipage != NULL) + __os_free(ipage, dbp->pgsize); + if (page != NULL) + __os_free(page, dbp->pgsize); + + return (ret); +} + +/* + * __db_build_bi -- + * Build a BINTERNAL entry for a parent page. + */ +static int +__db_build_bi(dbp, fhp, ipage, page, indx, nomemp) + DB *dbp; + DB_FH *fhp; + PAGE *ipage, *page; + u_int32_t indx; + int *nomemp; +{ + BINTERNAL bi, *child_bi; + BKEYDATA *child_bk; + u_int8_t *p; + int ret; + + switch (TYPE(page)) { + case P_IBTREE: + child_bi = GET_BINTERNAL(page, 0); + if (P_FREESPACE(ipage) < BINTERNAL_PSIZE(child_bi->len)) { + *nomemp = 1; + return (0); + } + ipage->inp[indx] = + HOFFSET(ipage) -= BINTERNAL_SIZE(child_bi->len); + p = P_ENTRY(ipage, indx); + + bi.len = child_bi->len; + B_TSET(bi.type, child_bi->type, 0); + bi.pgno = PGNO(page); + bi.nrecs = __bam_total(page); + memcpy(p, &bi, SSZA(BINTERNAL, data)); + p += SSZA(BINTERNAL, data); + memcpy(p, child_bi->data, child_bi->len); + + /* Increment the overflow ref count. */ + if (B_TYPE(child_bi->type) == B_OVERFLOW) + if ((ret = __db_up_ovref(dbp, fhp, + ((BOVERFLOW *)(child_bi->data))->pgno)) != 0) + return (ret); + break; + case P_LDUP: + child_bk = GET_BKEYDATA(page, 0); + switch (B_TYPE(child_bk->type)) { + case B_KEYDATA: + if (P_FREESPACE(ipage) < + BINTERNAL_PSIZE(child_bk->len)) { + *nomemp = 1; + return (0); + } + ipage->inp[indx] = + HOFFSET(ipage) -= BINTERNAL_SIZE(child_bk->len); + p = P_ENTRY(ipage, indx); + + bi.len = child_bk->len; + B_TSET(bi.type, child_bk->type, 0); + bi.pgno = PGNO(page); + bi.nrecs = __bam_total(page); + memcpy(p, &bi, SSZA(BINTERNAL, data)); + p += SSZA(BINTERNAL, data); + memcpy(p, child_bk->data, child_bk->len); + break; + case B_OVERFLOW: + if (P_FREESPACE(ipage) < + BINTERNAL_PSIZE(BOVERFLOW_SIZE)) { + *nomemp = 1; + return (0); + } + ipage->inp[indx] = + HOFFSET(ipage) -= BINTERNAL_SIZE(BOVERFLOW_SIZE); + p = P_ENTRY(ipage, indx); + + bi.len = BOVERFLOW_SIZE; + B_TSET(bi.type, child_bk->type, 0); + bi.pgno = PGNO(page); + bi.nrecs = __bam_total(page); + memcpy(p, &bi, SSZA(BINTERNAL, data)); + p += SSZA(BINTERNAL, data); + memcpy(p, child_bk, BOVERFLOW_SIZE); + + /* Increment the overflow ref count. */ + if ((ret = __db_up_ovref(dbp, fhp, + ((BOVERFLOW *)child_bk)->pgno)) != 0) + return (ret); + break; + default: + return (__db_pgfmt(dbp, PGNO(page))); + } + break; + default: + return (__db_pgfmt(dbp, PGNO(page))); + } + + return (0); +} + +/* + * __db_build_ri -- + * Build a RINTERNAL entry for an internal parent page. + */ +static int +__db_build_ri(dbp, fhp, ipage, page, indx, nomemp) + DB *dbp; + DB_FH *fhp; + PAGE *ipage, *page; + u_int32_t indx; + int *nomemp; +{ + RINTERNAL ri; + + COMPQUIET(dbp, NULL); + COMPQUIET(fhp, NULL); + + if (P_FREESPACE(ipage) < RINTERNAL_PSIZE) { + *nomemp = 1; + return (0); + } + + ri.pgno = PGNO(page); + ri.nrecs = __bam_total(page); + ipage->inp[indx] = HOFFSET(ipage) -= RINTERNAL_SIZE; + memcpy(P_ENTRY(ipage, indx), &ri, RINTERNAL_SIZE); + + return (0); +} + +/* + * __db_up_ovref -- + * Increment/decrement the reference count on an overflow page. + */ +static int +__db_up_ovref(dbp, fhp, pgno) + DB *dbp; + DB_FH *fhp; + db_pgno_t pgno; +{ + PAGE *page; + size_t n; + int ret; + + /* Allocate room to hold a page. */ + if ((ret = __os_malloc(dbp->dbenv, dbp->pgsize, NULL, &page)) != 0) + return (ret); + + GET_PAGE(dbp, fhp, pgno, page); + ++OV_REF(page); + PUT_PAGE(dbp, fhp, pgno, page); + +err: __os_free(page, dbp->pgsize); + + return (ret); +} diff --git a/bdb/db/db_vrfy.c b/bdb/db/db_vrfy.c new file mode 100644 index 00000000000..3509e05e91f --- /dev/null +++ b/bdb/db/db_vrfy.c @@ -0,0 +1,2340 @@ +/*- + * See the file LICENSE for redistribution information. + * + * Copyright (c) 2000 + * Sleepycat Software. All rights reserved. + * + * $Id: db_vrfy.c,v 1.53 2001/01/11 18:19:51 bostic Exp $ + */ + +#include "db_config.h" + +#ifndef lint +static const char revid[] = "$Id: db_vrfy.c,v 1.53 2001/01/11 18:19:51 bostic Exp $"; +#endif /* not lint */ + +#ifndef NO_SYSTEM_INCLUDES +#include <sys/types.h> + +#include <string.h> +#endif + +#include "db_int.h" +#include "db_page.h" +#include "db_swap.h" +#include "db_verify.h" +#include "db_ext.h" +#include "btree.h" +#include "hash.h" +#include "qam.h" + +static int __db_guesspgsize __P((DB_ENV *, DB_FH *)); +static int __db_is_valid_magicno __P((u_int32_t, DBTYPE *)); +static int __db_is_valid_pagetype __P((u_int32_t)); +static int __db_meta2pgset + __P((DB *, VRFY_DBINFO *, db_pgno_t, u_int32_t, DB *)); +static int __db_salvage_subdbs + __P((DB *, VRFY_DBINFO *, void *, + int(*)(void *, const void *), u_int32_t, int *)); +static int __db_salvage_unknowns + __P((DB *, VRFY_DBINFO *, void *, + int (*)(void *, const void *), u_int32_t)); +static int __db_vrfy_common + __P((DB *, VRFY_DBINFO *, PAGE *, db_pgno_t, u_int32_t)); +static int __db_vrfy_freelist __P((DB *, VRFY_DBINFO *, db_pgno_t, u_int32_t)); +static int __db_vrfy_invalid + __P((DB *, VRFY_DBINFO *, PAGE *, db_pgno_t, u_int32_t)); +static int __db_vrfy_orderchkonly __P((DB *, + VRFY_DBINFO *, const char *, const char *, u_int32_t)); +static int __db_vrfy_pagezero __P((DB *, VRFY_DBINFO *, DB_FH *, u_int32_t)); +static int __db_vrfy_subdbs + __P((DB *, VRFY_DBINFO *, const char *, u_int32_t)); +static int __db_vrfy_structure + __P((DB *, VRFY_DBINFO *, const char *, db_pgno_t, u_int32_t)); +static int __db_vrfy_walkpages + __P((DB *, VRFY_DBINFO *, void *, int (*)(void *, const void *), + u_int32_t)); + +/* + * This is the code for DB->verify, the DB database consistency checker. + * For now, it checks all subdatabases in a database, and verifies + * everything it knows how to (i.e. it's all-or-nothing, and one can't + * check only for a subset of possible problems). + */ + +/* + * __db_verify -- + * Walk the entire file page-by-page, either verifying with or without + * dumping in db_dump -d format, or DB_SALVAGE-ing whatever key/data + * pairs can be found and dumping them in standard (db_load-ready) + * dump format. + * + * (Salvaging isn't really a verification operation, but we put it + * here anyway because it requires essentially identical top-level + * code.) + * + * flags may be 0, DB_NOORDERCHK, DB_ORDERCHKONLY, or DB_SALVAGE + * (and optionally DB_AGGRESSIVE). + * + * __db_verify itself is simply a wrapper to __db_verify_internal, + * which lets us pass appropriate equivalents to FILE * in from the + * non-C APIs. + * + * PUBLIC: int __db_verify + * PUBLIC: __P((DB *, const char *, const char *, FILE *, u_int32_t)); + */ +int +__db_verify(dbp, file, database, outfile, flags) + DB *dbp; + const char *file, *database; + FILE *outfile; + u_int32_t flags; +{ + + return (__db_verify_internal(dbp, + file, database, outfile, __db_verify_callback, flags)); +} + +/* + * __db_verify_callback -- + * Callback function for using pr_* functions from C. + * + * PUBLIC: int __db_verify_callback __P((void *, const void *)); + */ +int +__db_verify_callback(handle, str_arg) + void *handle; + const void *str_arg; +{ + char *str; + FILE *f; + + str = (char *)str_arg; + f = (FILE *)handle; + + if (fprintf(f, "%s", str) != (int)strlen(str)) + return (EIO); + + return (0); +} + +/* + * __db_verify_internal -- + * Inner meat of __db_verify. + * + * PUBLIC: int __db_verify_internal __P((DB *, const char *, + * PUBLIC: const char *, void *, int (*)(void *, const void *), u_int32_t)); + */ +int +__db_verify_internal(dbp_orig, name, subdb, handle, callback, flags) + DB *dbp_orig; + const char *name, *subdb; + void *handle; + int (*callback) __P((void *, const void *)); + u_int32_t flags; +{ + DB *dbp; + DB_ENV *dbenv; + DB_FH fh, *fhp; + PAGE *h; + VRFY_DBINFO *vdp; + db_pgno_t last; + int has, ret, isbad; + char *real_name; + + dbenv = dbp_orig->dbenv; + vdp = NULL; + real_name = NULL; + ret = isbad = 0; + + memset(&fh, 0, sizeof(fh)); + fhp = &fh; + + PANIC_CHECK(dbenv); + DB_ILLEGAL_AFTER_OPEN(dbp_orig, "verify"); + +#define OKFLAGS (DB_AGGRESSIVE | DB_NOORDERCHK | DB_ORDERCHKONLY | DB_SALVAGE) + if ((ret = __db_fchk(dbenv, "DB->verify", flags, OKFLAGS)) != 0) + return (ret); + + /* + * DB_SALVAGE is mutually exclusive with the other flags except + * DB_AGGRESSIVE. + */ + if (LF_ISSET(DB_SALVAGE) && + (flags & ~DB_AGGRESSIVE) != DB_SALVAGE) + return (__db_ferr(dbenv, "__db_verify", 1)); + + if (LF_ISSET(DB_ORDERCHKONLY) && flags != DB_ORDERCHKONLY) + return (__db_ferr(dbenv, "__db_verify", 1)); + + if (LF_ISSET(DB_ORDERCHKONLY) && subdb == NULL) { + __db_err(dbenv, "DB_ORDERCHKONLY requires a database name"); + return (EINVAL); + } + + /* + * Forbid working in an environment that uses transactions or + * locking; we're going to be looking at the file freely, + * and while we're not going to modify it, we aren't obeying + * locking conventions either. + */ + if (TXN_ON(dbenv) || LOCKING_ON(dbenv) || LOGGING_ON(dbenv)) { + dbp_orig->errx(dbp_orig, + "verify may not be used with transactions, logging, or locking"); + return (EINVAL); + /* NOTREACHED */ + } + + /* Create a dbp to use internally, which we can close at our leisure. */ + if ((ret = db_create(&dbp, dbenv, 0)) != 0) + goto err; + + F_SET(dbp, DB_AM_VERIFYING); + + /* Copy the supplied pagesize, which we use if the file one is bogus. */ + if (dbp_orig->pgsize >= DB_MIN_PGSIZE && + dbp_orig->pgsize <= DB_MAX_PGSIZE) + dbp->set_pagesize(dbp, dbp_orig->pgsize); + + /* Copy the feedback function, if present, and initialize it. */ + if (!LF_ISSET(DB_SALVAGE) && dbp_orig->db_feedback != NULL) { + dbp->set_feedback(dbp, dbp_orig->db_feedback); + dbp->db_feedback(dbp, DB_VERIFY, 0); + } + + /* + * Copy the comparison and hashing functions. Note that + * even if the database is not a hash or btree, the respective + * internal structures will have been initialized. + */ + if (dbp_orig->dup_compare != NULL && + (ret = dbp->set_dup_compare(dbp, dbp_orig->dup_compare)) != 0) + goto err; + if (((BTREE *)dbp_orig->bt_internal)->bt_compare != NULL && + (ret = dbp->set_bt_compare(dbp, + ((BTREE *)dbp_orig->bt_internal)->bt_compare)) != 0) + goto err; + if (((HASH *)dbp_orig->h_internal)->h_hash != NULL && + (ret = dbp->set_h_hash(dbp, + ((HASH *)dbp_orig->h_internal)->h_hash)) != 0) + goto err; + + /* + * We don't know how large the cache is, and if the database + * in question uses a small page size--which we don't know + * yet!--it may be uncomfortably small for the default page + * size [#2143]. However, the things we need temporary + * databases for in dbinfo are largely tiny, so using a + * 1024-byte pagesize is probably not going to be a big hit, + * and will make us fit better into small spaces. + */ + if ((ret = __db_vrfy_dbinfo_create(dbenv, 1024, &vdp)) != 0) + goto err; + + /* Find the real name of the file. */ + if ((ret = __db_appname(dbenv, + DB_APP_DATA, NULL, name, 0, NULL, &real_name)) != 0) + goto err; + + /* + * Our first order of business is to verify page 0, which is + * the metadata page for the master database of subdatabases + * or of the only database in the file. We want to do this by hand + * rather than just calling __db_open in case it's corrupt--various + * things in __db_open might act funny. + * + * Once we know the metadata page is healthy, I believe that it's + * safe to open the database normally and then use the page swapping + * code, which makes life easier. + */ + if ((ret = __os_open(dbenv, real_name, DB_OSO_RDONLY, 0444, fhp)) != 0) + goto err; + + /* Verify the metadata page 0; set pagesize and type. */ + if ((ret = __db_vrfy_pagezero(dbp, vdp, fhp, flags)) != 0) { + if (ret == DB_VERIFY_BAD) + isbad = 1; + else + goto err; + } + + /* + * We can assume at this point that dbp->pagesize and dbp->type are + * set correctly, or at least as well as they can be, and that + * locking, logging, and txns are not in use. Thus we can trust + * the memp code not to look at the page, and thus to be safe + * enough to use. + * + * The dbp is not open, but the file is open in the fhp, and we + * cannot assume that __db_open is safe. Call __db_dbenv_setup, + * the [safe] part of __db_open that initializes the environment-- + * and the mpool--manually. + */ + if ((ret = __db_dbenv_setup(dbp, + name, DB_ODDFILESIZE | DB_RDONLY)) != 0) + return (ret); + + /* Mark the dbp as opened, so that we correctly handle its close. */ + F_SET(dbp, DB_OPEN_CALLED); + + /* + * Find out the page number of the last page in the database. + * + * XXX: This currently fails if the last page is of bad type, + * because it calls __db_pgin and that pukes. This is bad. + */ + if ((ret = memp_fget(dbp->mpf, &last, DB_MPOOL_LAST, &h)) != 0) + goto err; + if ((ret = memp_fput(dbp->mpf, h, 0)) != 0) + goto err; + + vdp->last_pgno = last; + + /* + * DB_ORDERCHKONLY is a special case; our file consists of + * several subdatabases, which use different hash, bt_compare, + * and/or dup_compare functions. Consequently, we couldn't verify + * sorting and hashing simply by calling DB->verify() on the file. + * DB_ORDERCHKONLY allows us to come back and check those things; it + * requires a subdatabase, and assumes that everything but that + * database's sorting/hashing is correct. + */ + if (LF_ISSET(DB_ORDERCHKONLY)) { + ret = __db_vrfy_orderchkonly(dbp, vdp, name, subdb, flags); + goto done; + } + + /* + * When salvaging, we use a db to keep track of whether we've seen a + * given overflow or dup page in the course of traversing normal data. + * If in the end we have not, we assume its key got lost and print it + * with key "UNKNOWN". + */ + if (LF_ISSET(DB_SALVAGE)) { + if ((ret = __db_salvage_init(vdp)) != 0) + return (ret); + + /* + * If we're not being aggressive, attempt to crack subdbs. + * "has" will indicate whether the attempt has succeeded + * (even in part), meaning that we have some semblance of + * subdbs; on the walkpages pass, we print out + * whichever data pages we have not seen. + */ + has = 0; + if (!LF_ISSET(DB_AGGRESSIVE) && (__db_salvage_subdbs(dbp, + vdp, handle, callback, flags, &has)) != 0) + isbad = 1; + + /* + * If we have subdatabases, we need to signal that if + * any keys are found that don't belong to a subdatabase, + * they'll need to have an "__OTHER__" subdatabase header + * printed first. Flag this. Else, print a header for + * the normal, non-subdb database. + */ + if (has == 1) + F_SET(vdp, SALVAGE_PRINTHEADER); + else if ((ret = __db_prheader(dbp, + NULL, 0, 0, handle, callback, vdp, PGNO_BASE_MD)) != 0) + goto err; + } + + if ((ret = + __db_vrfy_walkpages(dbp, vdp, handle, callback, flags)) != 0) { + if (ret == DB_VERIFY_BAD) + isbad = 1; + else if (ret != 0) + goto err; + } + + /* If we're verifying, verify inter-page structure. */ + if (!LF_ISSET(DB_SALVAGE) && isbad == 0) + if ((ret = + __db_vrfy_structure(dbp, vdp, name, 0, flags)) != 0) { + if (ret == DB_VERIFY_BAD) + isbad = 1; + else if (ret != 0) + goto err; + } + + /* + * If we're salvaging, output with key UNKNOWN any overflow or dup pages + * we haven't been able to put in context. Then destroy the salvager's + * state-saving database. + */ + if (LF_ISSET(DB_SALVAGE)) { + if ((ret = __db_salvage_unknowns(dbp, + vdp, handle, callback, flags)) != 0) + isbad = 1; + /* No return value, since there's little we can do. */ + __db_salvage_destroy(vdp); + } + + if (0) { +err: (void)__db_err(dbenv, "%s: %s", name, db_strerror(ret)); + } + + if (LF_ISSET(DB_SALVAGE) && + (has == 0 || F_ISSET(vdp, SALVAGE_PRINTFOOTER))) + (void)__db_prfooter(handle, callback); + + /* Send feedback that we're done. */ +done: if (!LF_ISSET(DB_SALVAGE) && dbp->db_feedback != NULL) + dbp->db_feedback(dbp, DB_VERIFY, 100); + + if (F_ISSET(fhp, DB_FH_VALID)) + (void)__os_closehandle(fhp); + if (dbp) + (void)dbp->close(dbp, 0); + if (vdp) + (void)__db_vrfy_dbinfo_destroy(vdp); + if (real_name) + __os_freestr(real_name); + + if ((ret == 0 && isbad == 1) || ret == DB_VERIFY_FATAL) + ret = DB_VERIFY_BAD; + + return (ret); +} + +/* + * __db_vrfy_pagezero -- + * Verify the master metadata page. Use seek, read, and a local buffer + * rather than the DB paging code, for safety. + * + * Must correctly (or best-guess) set dbp->type and dbp->pagesize. + */ +static int +__db_vrfy_pagezero(dbp, vdp, fhp, flags) + DB *dbp; + VRFY_DBINFO *vdp; + DB_FH *fhp; + u_int32_t flags; +{ + DBMETA *meta; + DB_ENV *dbenv; + VRFY_PAGEINFO *pip; + db_pgno_t freelist; + int t_ret, ret, nr, swapped; + u_int8_t mbuf[DBMETASIZE]; + + swapped = ret = t_ret = 0; + freelist = 0; + dbenv = dbp->dbenv; + meta = (DBMETA *)mbuf; + dbp->type = DB_UNKNOWN; + + /* + * Seek to the metadata page. + * Note that if we're just starting a verification, dbp->pgsize + * may be zero; this is okay, as we want page zero anyway and + * 0*0 == 0. + */ + if ((ret = __os_seek(dbenv, fhp, 0, 0, 0, 0, DB_OS_SEEK_SET)) != 0) + goto err; + + if ((ret = __os_read(dbenv, fhp, mbuf, DBMETASIZE, (size_t *)&nr)) != 0) + goto err; + + if (nr != DBMETASIZE) { + EPRINT((dbp->dbenv, + "Incomplete metadata page %lu", (u_long)PGNO_BASE_MD)); + t_ret = DB_VERIFY_FATAL; + goto err; + } + + /* + * Check all of the fields that we can. + */ + + /* 08-11: Current page number. Must == pgno. */ + /* Note that endianness doesn't matter--it's zero. */ + if (meta->pgno != PGNO_BASE_MD) { + EPRINT((dbp->dbenv, "Bad pgno: was %lu, should be %lu", + (u_long)meta->pgno, (u_long)PGNO_BASE_MD)); + ret = DB_VERIFY_BAD; + } + + /* 12-15: Magic number. Must be one of valid set. */ + if (__db_is_valid_magicno(meta->magic, &dbp->type)) + swapped = 0; + else { + M_32_SWAP(meta->magic); + if (__db_is_valid_magicno(meta->magic, + &dbp->type)) + swapped = 1; + else { + EPRINT((dbp->dbenv, + "Bad magic number: %lu", (u_long)meta->magic)); + ret = DB_VERIFY_BAD; + } + } + + /* + * 16-19: Version. Must be current; for now, we + * don't support verification of old versions. + */ + if (swapped) + M_32_SWAP(meta->version); + if ((dbp->type == DB_BTREE && meta->version != DB_BTREEVERSION) || + (dbp->type == DB_HASH && meta->version != DB_HASHVERSION) || + (dbp->type == DB_QUEUE && meta->version != DB_QAMVERSION)) { + ret = DB_VERIFY_BAD; + EPRINT((dbp->dbenv, "%s%s", "Old or incorrect DB ", + "version; extraneous errors may result")); + } + + /* + * 20-23: Pagesize. Must be power of two, + * greater than 512, and less than 64K. + */ + if (swapped) + M_32_SWAP(meta->pagesize); + if (IS_VALID_PAGESIZE(meta->pagesize)) + dbp->pgsize = meta->pagesize; + else { + EPRINT((dbp->dbenv, + "Bad page size: %lu", (u_long)meta->pagesize)); + ret = DB_VERIFY_BAD; + + /* + * Now try to settle on a pagesize to use. + * If the user-supplied one is reasonable, + * use it; else, guess. + */ + if (!IS_VALID_PAGESIZE(dbp->pgsize)) + dbp->pgsize = __db_guesspgsize(dbenv, fhp); + } + + /* + * 25: Page type. Must be correct for dbp->type, + * which is by now set as well as it can be. + */ + /* Needs no swapping--only one byte! */ + if ((dbp->type == DB_BTREE && meta->type != P_BTREEMETA) || + (dbp->type == DB_HASH && meta->type != P_HASHMETA) || + (dbp->type == DB_QUEUE && meta->type != P_QAMMETA)) { + ret = DB_VERIFY_BAD; + EPRINT((dbp->dbenv, "Bad page type: %lu", (u_long)meta->type)); + } + + /* + * 28-31: Free list page number. + * We'll verify its sensibility when we do inter-page + * verification later; for now, just store it. + */ + if (swapped) + M_32_SWAP(meta->free); + freelist = meta->free; + + /* + * Initialize vdp->pages to fit a single pageinfo structure for + * this one page. We'll realloc later when we know how many + * pages there are. + */ + if ((ret = __db_vrfy_getpageinfo(vdp, PGNO_BASE_MD, &pip)) != 0) + return (ret); + pip->pgno = PGNO_BASE_MD; + pip->type = meta->type; + + /* + * Signal that we still have to check the info specific to + * a given type of meta page. + */ + F_SET(pip, VRFY_INCOMPLETE); + + pip->free = freelist; + + if ((ret = __db_vrfy_putpageinfo(vdp, pip)) != 0) + return (ret); + + /* Set up the dbp's fileid. We don't use the regular open path. */ + memcpy(dbp->fileid, meta->uid, DB_FILE_ID_LEN); + + if (0) { +err: __db_err(dbenv, "%s", db_strerror(ret)); + } + + if (swapped == 1) + F_SET(dbp, DB_AM_SWAP); + if (t_ret != 0) + ret = t_ret; + return (ret); +} + +/* + * __db_vrfy_walkpages -- + * Main loop of the verifier/salvager. Walks through, + * page by page, and verifies all pages and/or prints all data pages. + */ +static int +__db_vrfy_walkpages(dbp, vdp, handle, callback, flags) + DB *dbp; + VRFY_DBINFO *vdp; + void *handle; + int (*callback) __P((void *, const void *)); + u_int32_t flags; +{ + DB_ENV *dbenv; + PAGE *h; + db_pgno_t i; + int ret, t_ret, isbad; + + ret = isbad = t_ret = 0; + dbenv = dbp->dbenv; + + if ((ret = __db_fchk(dbenv, + "__db_vrfy_walkpages", flags, OKFLAGS)) != 0) + return (ret); + + for (i = 0; i <= vdp->last_pgno; i++) { + /* + * If DB_SALVAGE is set, we inspect our database of + * completed pages, and skip any we've already printed in + * the subdb pass. + */ + if (LF_ISSET(DB_SALVAGE) && (__db_salvage_isdone(vdp, i) != 0)) + continue; + + /* If an individual page get fails, keep going. */ + if ((t_ret = memp_fget(dbp->mpf, &i, 0, &h)) != 0) { + if (ret == 0) + ret = t_ret; + continue; + } + + if (LF_ISSET(DB_SALVAGE)) { + /* + * We pretty much don't want to quit unless a + * bomb hits. May as well return that something + * was screwy, however. + */ + if ((t_ret = __db_salvage(dbp, + vdp, i, h, handle, callback, flags)) != 0) { + if (ret == 0) + ret = t_ret; + isbad = 1; + } + } else { + /* + * Verify info common to all page + * types. + */ + if (i != PGNO_BASE_MD) + if ((t_ret = __db_vrfy_common(dbp, + vdp, h, i, flags)) == DB_VERIFY_BAD) + isbad = 1; + + switch (TYPE(h)) { + case P_INVALID: + t_ret = __db_vrfy_invalid(dbp, + vdp, h, i, flags); + break; + case __P_DUPLICATE: + isbad = 1; + EPRINT((dbp->dbenv, + "Old-style duplicate page: %lu", + (u_long)i)); + break; + case P_HASH: + t_ret = __ham_vrfy(dbp, + vdp, h, i, flags); + break; + case P_IBTREE: + case P_IRECNO: + case P_LBTREE: + case P_LDUP: + t_ret = __bam_vrfy(dbp, + vdp, h, i, flags); + break; + case P_LRECNO: + t_ret = __ram_vrfy_leaf(dbp, + vdp, h, i, flags); + break; + case P_OVERFLOW: + t_ret = __db_vrfy_overflow(dbp, + vdp, h, i, flags); + break; + case P_HASHMETA: + t_ret = __ham_vrfy_meta(dbp, + vdp, (HMETA *)h, i, flags); + break; + case P_BTREEMETA: + t_ret = __bam_vrfy_meta(dbp, + vdp, (BTMETA *)h, i, flags); + break; + case P_QAMMETA: + t_ret = __qam_vrfy_meta(dbp, + vdp, (QMETA *)h, i, flags); + break; + case P_QAMDATA: + t_ret = __qam_vrfy_data(dbp, + vdp, (QPAGE *)h, i, flags); + break; + default: + EPRINT((dbp->dbenv, + "Unknown page type: %lu", (u_long)TYPE(h))); + isbad = 1; + break; + } + + /* + * Set up error return. + */ + if (t_ret == DB_VERIFY_BAD) + isbad = 1; + else if (t_ret == DB_VERIFY_FATAL) + goto err; + else + ret = t_ret; + + /* + * Provide feedback to the application about our + * progress. The range 0-50% comes from the fact + * that this is the first of two passes through the + * database (front-to-back, then top-to-bottom). + */ + if (dbp->db_feedback != NULL) + dbp->db_feedback(dbp, DB_VERIFY, + (i + 1) * 50 / (vdp->last_pgno + 1)); + } + + if ((t_ret = memp_fput(dbp->mpf, h, 0)) != 0 && ret == 0) + ret = t_ret; + } + + if (0) { +err: if ((t_ret = memp_fput(dbp->mpf, h, 0)) != 0) + return (ret == 0 ? t_ret : ret); + return (DB_VERIFY_BAD); + } + + return ((isbad == 1 && ret == 0) ? DB_VERIFY_BAD : ret); +} + +/* + * __db_vrfy_structure-- + * After a beginning-to-end walk through the database has been + * completed, put together the information that has been collected + * to verify the overall database structure. + * + * Should only be called if we want to do a database verification, + * i.e. if DB_SALVAGE is not set. + */ +static int +__db_vrfy_structure(dbp, vdp, dbname, meta_pgno, flags) + DB *dbp; + VRFY_DBINFO *vdp; + const char *dbname; + db_pgno_t meta_pgno; + u_int32_t flags; +{ + DB *pgset; + DB_ENV *dbenv; + VRFY_PAGEINFO *pip; + db_pgno_t i; + int ret, isbad, hassubs, p; + + isbad = 0; + pip = NULL; + dbenv = dbp->dbenv; + pgset = vdp->pgset; + + if ((ret = __db_fchk(dbenv, "DB->verify", flags, OKFLAGS)) != 0) + return (ret); + if (LF_ISSET(DB_SALVAGE)) { + __db_err(dbenv, "__db_vrfy_structure called with DB_SALVAGE"); + return (EINVAL); + } + + /* + * Providing feedback here is tricky; in most situations, + * we fetch each page one more time, but we do so in a top-down + * order that depends on the access method. Worse, we do this + * recursively in btree, such that on any call where we're traversing + * a subtree we don't know where that subtree is in the whole database; + * worse still, any given database may be one of several subdbs. + * + * The solution is to decrement a counter vdp->pgs_remaining each time + * we verify (and call feedback on) a page. We may over- or + * under-count, but the structure feedback function will ensure that we + * never give a percentage under 50 or over 100. (The first pass + * covered the range 0-50%.) + */ + if (dbp->db_feedback != NULL) + vdp->pgs_remaining = vdp->last_pgno + 1; + + /* + * Call the appropriate function to downwards-traverse the db type. + */ + switch(dbp->type) { + case DB_BTREE: + case DB_RECNO: + if ((ret = __bam_vrfy_structure(dbp, vdp, 0, flags)) != 0) { + if (ret == DB_VERIFY_BAD) + isbad = 1; + else + goto err; + } + + /* + * If we have subdatabases and we know that the database is, + * thus far, sound, it's safe to walk the tree of subdatabases. + * Do so, and verify the structure of the databases within. + */ + if ((ret = __db_vrfy_getpageinfo(vdp, 0, &pip)) != 0) + goto err; + hassubs = F_ISSET(pip, VRFY_HAS_SUBDBS); + if ((ret = __db_vrfy_putpageinfo(vdp, pip)) != 0) + goto err; + + if (isbad == 0 && hassubs) + if ((ret = + __db_vrfy_subdbs(dbp, vdp, dbname, flags)) != 0) { + if (ret == DB_VERIFY_BAD) + isbad = 1; + else + goto err; + } + break; + case DB_HASH: + if ((ret = __ham_vrfy_structure(dbp, vdp, 0, flags)) != 0) { + if (ret == DB_VERIFY_BAD) + isbad = 1; + else + goto err; + } + break; + case DB_QUEUE: + if ((ret = __qam_vrfy_structure(dbp, vdp, flags)) != 0) { + if (ret == DB_VERIFY_BAD) + isbad = 1; + } + + /* + * Queue pages may be unreferenced and totally zeroed, if + * they're empty; queue doesn't have much structure, so + * this is unlikely to be wrong in any troublesome sense. + * Skip to "err". + */ + goto err; + /* NOTREACHED */ + default: + /* This should only happen if the verifier is somehow broken. */ + DB_ASSERT(0); + ret = EINVAL; + goto err; + /* NOTREACHED */ + } + + /* Walk free list. */ + if ((ret = + __db_vrfy_freelist(dbp, vdp, meta_pgno, flags)) == DB_VERIFY_BAD) + isbad = 1; + + /* + * If structure checks up until now have failed, it's likely that + * checking what pages have been missed will result in oodles of + * extraneous error messages being EPRINTed. Skip to the end + * if this is the case; we're going to be printing at least one + * error anyway, and probably all the more salient ones. + */ + if (ret != 0 || isbad == 1) + goto err; + + /* + * Make sure no page has been missed and that no page is still marked + * "all zeroes" (only certain hash pages can be, and they're unmarked + * in __ham_vrfy_structure). + */ + for (i = 0; i < vdp->last_pgno + 1; i++) { + if ((ret = __db_vrfy_getpageinfo(vdp, i, &pip)) != 0) + goto err; + if ((ret = __db_vrfy_pgset_get(pgset, i, &p)) != 0) + goto err; + if (p == 0) { + EPRINT((dbp->dbenv, + "Unreferenced page %lu", (u_long)i)); + isbad = 1; + } + + if (F_ISSET(pip, VRFY_IS_ALLZEROES)) { + EPRINT((dbp->dbenv, + "Totally zeroed page %lu", (u_long)i)); + isbad = 1; + } + if ((ret = __db_vrfy_putpageinfo(vdp, pip)) != 0) + goto err; + pip = NULL; + } + +err: if (pip != NULL) + (void)__db_vrfy_putpageinfo(vdp, pip); + + return ((isbad == 1 && ret == 0) ? DB_VERIFY_BAD : ret); +} + +/* + * __db_is_valid_pagetype + */ +static int +__db_is_valid_pagetype(type) + u_int32_t type; +{ + switch (type) { + case P_INVALID: /* Order matches ordinal value. */ + case P_HASH: + case P_IBTREE: + case P_IRECNO: + case P_LBTREE: + case P_LRECNO: + case P_OVERFLOW: + case P_HASHMETA: + case P_BTREEMETA: + case P_QAMMETA: + case P_QAMDATA: + case P_LDUP: + return (1); + } + return (0); +} + +/* + * __db_is_valid_magicno + */ +static int +__db_is_valid_magicno(magic, typep) + u_int32_t magic; + DBTYPE *typep; +{ + switch (magic) { + case DB_BTREEMAGIC: + *typep = DB_BTREE; + return (1); + case DB_HASHMAGIC: + *typep = DB_HASH; + return (1); + case DB_QAMMAGIC: + *typep = DB_QUEUE; + return (1); + } + *typep = DB_UNKNOWN; + return (0); +} + +/* + * __db_vrfy_common -- + * Verify info common to all page types. + */ +static int +__db_vrfy_common(dbp, vdp, h, pgno, flags) + DB *dbp; + VRFY_DBINFO *vdp; + PAGE *h; + db_pgno_t pgno; + u_int32_t flags; +{ + VRFY_PAGEINFO *pip; + int ret, t_ret; + u_int8_t *p; + + if ((ret = __db_vrfy_getpageinfo(vdp, pgno, &pip)) != 0) + return (ret); + + pip->pgno = pgno; + F_CLR(pip, VRFY_IS_ALLZEROES); + + /* + * Hash expands the table by leaving some pages between the + * old last and the new last totally zeroed. Its pgin function + * should fix things, but we might not be using that (e.g. if + * we're a subdatabase). + * + * Queue will create sparse files if sparse record numbers are used. + */ + if (pgno != 0 && PGNO(h) == 0) { + for (p = (u_int8_t *)h; p < (u_int8_t *)h + dbp->pgsize; p++) + if (*p != 0) { + EPRINT((dbp->dbenv, + "Page %lu should be zeroed and is not", + (u_long)pgno)); + ret = DB_VERIFY_BAD; + goto err; + } + /* + * It's totally zeroed; mark it as a hash, and we'll + * check that that makes sense structurally later. + * (The queue verification doesn't care, since queues + * don't really have much in the way of structure.) + */ + pip->type = P_HASH; + F_SET(pip, VRFY_IS_ALLZEROES); + ret = 0; + goto err; /* well, not really an err. */ + } + + if (PGNO(h) != pgno) { + EPRINT((dbp->dbenv, + "Bad page number: %lu should be %lu", + (u_long)h->pgno, (u_long)pgno)); + ret = DB_VERIFY_BAD; + } + + if (!__db_is_valid_pagetype(h->type)) { + EPRINT((dbp->dbenv, "Bad page type: %lu", (u_long)h->type)); + ret = DB_VERIFY_BAD; + } + pip->type = h->type; + +err: if ((t_ret = __db_vrfy_putpageinfo(vdp, pip)) != 0 && ret == 0) + ret = t_ret; + + return (ret); +} + +/* + * __db_vrfy_invalid -- + * Verify P_INVALID page. + * (Yes, there's not much to do here.) + */ +static int +__db_vrfy_invalid(dbp, vdp, h, pgno, flags) + DB *dbp; + VRFY_DBINFO *vdp; + PAGE *h; + db_pgno_t pgno; + u_int32_t flags; +{ + VRFY_PAGEINFO *pip; + int ret, t_ret; + + if ((ret = __db_vrfy_getpageinfo(vdp, pgno, &pip)) != 0) + return (ret); + pip->next_pgno = pip->prev_pgno = 0; + + if (!IS_VALID_PGNO(NEXT_PGNO(h))) { + EPRINT((dbp->dbenv, + "Invalid next_pgno %lu on page %lu", + (u_long)NEXT_PGNO(h), (u_long)pgno)); + ret = DB_VERIFY_BAD; + } else + pip->next_pgno = NEXT_PGNO(h); + + if ((t_ret = __db_vrfy_putpageinfo(vdp, pip)) != 0 && ret == 0) + ret = t_ret; + return (ret); +} + +/* + * __db_vrfy_datapage -- + * Verify elements common to data pages (P_HASH, P_LBTREE, + * P_IBTREE, P_IRECNO, P_LRECNO, P_OVERFLOW, P_DUPLICATE)--i.e., + * those defined in the PAGE structure. + * + * Called from each of the per-page routines, after the + * all-page-type-common elements of pip have been verified and filled + * in. + * + * PUBLIC: int __db_vrfy_datapage + * PUBLIC: __P((DB *, VRFY_DBINFO *, PAGE *, db_pgno_t, u_int32_t)); + */ +int +__db_vrfy_datapage(dbp, vdp, h, pgno, flags) + DB *dbp; + VRFY_DBINFO *vdp; + PAGE *h; + db_pgno_t pgno; + u_int32_t flags; +{ + VRFY_PAGEINFO *pip; + int isbad, ret, t_ret; + + if ((ret = __db_vrfy_getpageinfo(vdp, pgno, &pip)) != 0) + return (ret); + isbad = 0; + + /* + * prev_pgno and next_pgno: store for inter-page checks, + * verify that they point to actual pages and not to self. + * + * !!! + * Internal btree pages do not maintain these fields (indeed, + * they overload them). Skip. + */ + if (TYPE(h) != P_IBTREE && TYPE(h) != P_IRECNO) { + if (!IS_VALID_PGNO(PREV_PGNO(h)) || PREV_PGNO(h) == pip->pgno) { + isbad = 1; + EPRINT((dbp->dbenv, "Page %lu: Invalid prev_pgno %lu", + (u_long)pip->pgno, (u_long)PREV_PGNO(h))); + } + if (!IS_VALID_PGNO(NEXT_PGNO(h)) || NEXT_PGNO(h) == pip->pgno) { + isbad = 1; + EPRINT((dbp->dbenv, "Page %lu: Invalid next_pgno %lu", + (u_long)pip->pgno, (u_long)NEXT_PGNO(h))); + } + pip->prev_pgno = PREV_PGNO(h); + pip->next_pgno = NEXT_PGNO(h); + } + + /* + * Verify the number of entries on the page. + * There is no good way to determine if this is accurate; the + * best we can do is verify that it's not more than can, in theory, + * fit on the page. Then, we make sure there are at least + * this many valid elements in inp[], and hope that this catches + * most cases. + */ + if (TYPE(h) != P_OVERFLOW) { + if (BKEYDATA_PSIZE(0) * NUM_ENT(h) > dbp->pgsize) { + isbad = 1; + EPRINT((dbp->dbenv, + "Page %lu: Too many entries: %lu", + (u_long)pgno, (u_long)NUM_ENT(h))); + } + pip->entries = NUM_ENT(h); + } + + /* + * btree level. Should be zero unless we're a btree; + * if we are a btree, should be between LEAFLEVEL and MAXBTREELEVEL, + * and we need to save it off. + */ + switch (TYPE(h)) { + case P_IBTREE: + case P_IRECNO: + if (LEVEL(h) < LEAFLEVEL + 1 || LEVEL(h) > MAXBTREELEVEL) { + isbad = 1; + EPRINT((dbp->dbenv, "Bad btree level %lu on page %lu", + (u_long)LEVEL(h), (u_long)pgno)); + } + pip->bt_level = LEVEL(h); + break; + case P_LBTREE: + case P_LDUP: + case P_LRECNO: + if (LEVEL(h) != LEAFLEVEL) { + isbad = 1; + EPRINT((dbp->dbenv, + "Btree leaf page %lu has incorrect level %lu", + (u_long)pgno, (u_long)LEVEL(h))); + } + break; + default: + if (LEVEL(h) != 0) { + isbad = 1; + EPRINT((dbp->dbenv, + "Nonzero level %lu in non-btree database page %lu", + (u_long)LEVEL(h), (u_long)pgno)); + } + break; + } + + /* + * Even though inp[] occurs in all PAGEs, we look at it in the + * access-method-specific code, since btree and hash treat + * item lengths very differently, and one of the most important + * things we want to verify is that the data--as specified + * by offset and length--cover the right part of the page + * without overlaps, gaps, or violations of the page boundary. + */ + if ((t_ret = __db_vrfy_putpageinfo(vdp, pip)) != 0 && ret == 0) + ret = t_ret; + + return ((ret == 0 && isbad == 1) ? DB_VERIFY_BAD : ret); +} + +/* + * __db_vrfy_meta-- + * Verify the access-method common parts of a meta page, using + * normal mpool routines. + * + * PUBLIC: int __db_vrfy_meta + * PUBLIC: __P((DB *, VRFY_DBINFO *, DBMETA *, db_pgno_t, u_int32_t)); + */ +int +__db_vrfy_meta(dbp, vdp, meta, pgno, flags) + DB *dbp; + VRFY_DBINFO *vdp; + DBMETA *meta; + db_pgno_t pgno; + u_int32_t flags; +{ + DBTYPE dbtype, magtype; + VRFY_PAGEINFO *pip; + int isbad, ret, t_ret; + + isbad = 0; + if ((ret = __db_vrfy_getpageinfo(vdp, pgno, &pip)) != 0) + return (ret); + + /* type plausible for a meta page */ + switch (meta->type) { + case P_BTREEMETA: + dbtype = DB_BTREE; + break; + case P_HASHMETA: + dbtype = DB_HASH; + break; + case P_QAMMETA: + dbtype = DB_QUEUE; + break; + default: + /* The verifier should never let us get here. */ + DB_ASSERT(0); + ret = EINVAL; + goto err; + } + + /* magic number valid */ + if (!__db_is_valid_magicno(meta->magic, &magtype)) { + isbad = 1; + EPRINT((dbp->dbenv, + "Magic number invalid on page %lu", (u_long)pgno)); + } + if (magtype != dbtype) { + isbad = 1; + EPRINT((dbp->dbenv, + "Magic number does not match type of page %lu", + (u_long)pgno)); + } + + /* version */ + if ((dbtype == DB_BTREE && meta->version != DB_BTREEVERSION) || + (dbtype == DB_HASH && meta->version != DB_HASHVERSION) || + (dbtype == DB_QUEUE && meta->version != DB_QAMVERSION)) { + isbad = 1; + EPRINT((dbp->dbenv, "%s%s", "Old of incorrect DB ", + "version; extraneous errors may result")); + } + + /* pagesize */ + if (meta->pagesize != dbp->pgsize) { + isbad = 1; + EPRINT((dbp->dbenv, + "Invalid pagesize %lu on page %lu", + (u_long)meta->pagesize, (u_long)pgno)); + } + + /* free list */ + /* + * If this is not the main, master-database meta page, it + * should not have a free list. + */ + if (pgno != PGNO_BASE_MD && meta->free != PGNO_INVALID) { + isbad = 1; + EPRINT((dbp->dbenv, + "Nonempty free list on subdatabase metadata page %lu", + pgno)); + } + + /* Can correctly be PGNO_INVALID--that's just the end of the list. */ + if (meta->free != PGNO_INVALID && IS_VALID_PGNO(meta->free)) + pip->free = meta->free; + else if (!IS_VALID_PGNO(meta->free)) { + isbad = 1; + EPRINT((dbp->dbenv, + "Nonsensical free list pgno %lu on page %lu", + (u_long)meta->free, (u_long)pgno)); + } + + /* + * We have now verified the common fields of the metadata page. + * Clear the flag that told us they had been incompletely checked. + */ + F_CLR(pip, VRFY_INCOMPLETE); + +err: if ((t_ret = __db_vrfy_putpageinfo(vdp, pip)) != 0 && ret == 0) + ret = t_ret; + + return ((ret == 0 && isbad == 1) ? DB_VERIFY_BAD : ret); +} + +/* + * __db_vrfy_freelist -- + * Walk free list, checking off pages and verifying absence of + * loops. + */ +static int +__db_vrfy_freelist(dbp, vdp, meta, flags) + DB *dbp; + VRFY_DBINFO *vdp; + db_pgno_t meta; + u_int32_t flags; +{ + DB *pgset; + VRFY_PAGEINFO *pip; + db_pgno_t pgno; + int p, ret, t_ret; + + pgset = vdp->pgset; + DB_ASSERT(pgset != NULL); + + if ((ret = __db_vrfy_getpageinfo(vdp, meta, &pip)) != 0) + return (ret); + for (pgno = pip->free; pgno != PGNO_INVALID; pgno = pip->next_pgno) { + if ((ret = __db_vrfy_putpageinfo(vdp, pip)) != 0) + return (ret); + + /* This shouldn't happen, but just in case. */ + if (!IS_VALID_PGNO(pgno)) { + EPRINT((dbp->dbenv, + "Invalid next_pgno on free list page %lu", + (u_long)pgno)); + return (DB_VERIFY_BAD); + } + + /* Detect cycles. */ + if ((ret = __db_vrfy_pgset_get(pgset, pgno, &p)) != 0) + return (ret); + if (p != 0) { + EPRINT((dbp->dbenv, + "Page %lu encountered a second time on free list", + (u_long)pgno)); + return (DB_VERIFY_BAD); + } + if ((ret = __db_vrfy_pgset_inc(pgset, pgno)) != 0) + return (ret); + + if ((ret = __db_vrfy_getpageinfo(vdp, pgno, &pip)) != 0) + return (ret); + + if (pip->type != P_INVALID) { + EPRINT((dbp->dbenv, + "Non-invalid page %lu on free list", (u_long)pgno)); + ret = DB_VERIFY_BAD; /* unsafe to continue */ + break; + } + } + + if ((t_ret = __db_vrfy_putpageinfo(vdp, pip)) != 0) + ret = t_ret; + return (ret); +} + +/* + * __db_vrfy_subdbs -- + * Walk the known-safe master database of subdbs with a cursor, + * verifying the structure of each subdatabase we encounter. + */ +static int +__db_vrfy_subdbs(dbp, vdp, dbname, flags) + DB *dbp; + VRFY_DBINFO *vdp; + const char *dbname; + u_int32_t flags; +{ + DB *mdbp; + DBC *dbc; + DBT key, data; + VRFY_PAGEINFO *pip; + db_pgno_t meta_pgno; + int ret, t_ret, isbad; + u_int8_t type; + + isbad = 0; + dbc = NULL; + + if ((ret = __db_master_open(dbp, dbname, DB_RDONLY, 0, &mdbp)) != 0) + return (ret); + + if ((ret = + __db_icursor(mdbp, NULL, DB_BTREE, PGNO_INVALID, 0, &dbc)) != 0) + goto err; + + memset(&key, 0, sizeof(key)); + memset(&data, 0, sizeof(data)); + while ((ret = dbc->c_get(dbc, &key, &data, DB_NEXT)) == 0) { + if (data.size != sizeof(db_pgno_t)) { + EPRINT((dbp->dbenv, "Database entry of invalid size")); + isbad = 1; + goto err; + } + memcpy(&meta_pgno, data.data, data.size); + /* + * Subdatabase meta pgnos are stored in network byte + * order for cross-endian compatibility. Swap if appropriate. + */ + DB_NTOHL(&meta_pgno); + if (meta_pgno == PGNO_INVALID || meta_pgno > vdp->last_pgno) { + EPRINT((dbp->dbenv, + "Database entry references invalid page %lu", + (u_long)meta_pgno)); + isbad = 1; + goto err; + } + if ((ret = __db_vrfy_getpageinfo(vdp, meta_pgno, &pip)) != 0) + goto err; + type = pip->type; + if ((ret = __db_vrfy_putpageinfo(vdp, pip)) != 0) + goto err; + switch (type) { + case P_BTREEMETA: + if ((ret = __bam_vrfy_structure( + dbp, vdp, meta_pgno, flags)) != 0) { + if (ret == DB_VERIFY_BAD) + isbad = 1; + else + goto err; + } + break; + case P_HASHMETA: + if ((ret = __ham_vrfy_structure( + dbp, vdp, meta_pgno, flags)) != 0) { + if (ret == DB_VERIFY_BAD) + isbad = 1; + else + goto err; + } + break; + case P_QAMMETA: + default: + EPRINT((dbp->dbenv, + "Database entry references page %lu of invalid type %lu", + (u_long)meta_pgno, (u_long)type)); + ret = DB_VERIFY_BAD; + goto err; + /* NOTREACHED */ + } + } + + if (ret == DB_NOTFOUND) + ret = 0; + +err: if (dbc != NULL && (t_ret = __db_c_close(dbc)) != 0 && ret == 0) + ret = t_ret; + + if ((t_ret = mdbp->close(mdbp, 0)) != 0 && ret == 0) + ret = t_ret; + + return ((ret == 0 && isbad == 1) ? DB_VERIFY_BAD : ret); +} + +/* + * __db_vrfy_struct_feedback -- + * Provide feedback during top-down database structure traversal. + * (See comment at the beginning of __db_vrfy_structure.) + * + * PUBLIC: int __db_vrfy_struct_feedback __P((DB *, VRFY_DBINFO *)); + */ +int +__db_vrfy_struct_feedback(dbp, vdp) + DB *dbp; + VRFY_DBINFO *vdp; +{ + int progress; + + if (dbp->db_feedback == NULL) + return (0); + + if (vdp->pgs_remaining > 0) + vdp->pgs_remaining--; + + /* Don't allow a feedback call of 100 until we're really done. */ + progress = 100 - (vdp->pgs_remaining * 50 / (vdp->last_pgno + 1)); + dbp->db_feedback(dbp, DB_VERIFY, progress == 100 ? 99 : progress); + + return (0); +} + +/* + * __db_vrfy_orderchkonly -- + * Do an sort-order/hashing check on a known-otherwise-good subdb. + */ +static int +__db_vrfy_orderchkonly(dbp, vdp, name, subdb, flags) + DB *dbp; + VRFY_DBINFO *vdp; + const char *name, *subdb; + u_int32_t flags; +{ + BTMETA *btmeta; + DB *mdbp, *pgset; + DBC *pgsc; + DBT key, data; + HASH *h_internal; + HMETA *hmeta; + PAGE *h, *currpg; + db_pgno_t meta_pgno, p, pgno; + u_int32_t bucket; + int t_ret, ret; + + currpg = h = NULL; + pgsc = NULL; + pgset = NULL; + + LF_CLR(DB_NOORDERCHK); + + /* Open the master database and get the meta_pgno for the subdb. */ + if ((ret = db_create(&mdbp, NULL, 0)) != 0) + return (ret); + if ((ret = __db_master_open(dbp, name, DB_RDONLY, 0, &mdbp)) != 0) + goto err; + + memset(&key, 0, sizeof(key)); + key.data = (void *)subdb; + memset(&data, 0, sizeof(data)); + if ((ret = dbp->get(dbp, NULL, &key, &data, 0)) != 0) + goto err; + + if (data.size != sizeof(db_pgno_t)) { + EPRINT((dbp->dbenv, "Database entry of invalid size")); + ret = DB_VERIFY_BAD; + goto err; + } + + memcpy(&meta_pgno, data.data, data.size); + + if ((ret = memp_fget(dbp->mpf, &meta_pgno, 0, &h)) != 0) + goto err; + + if ((ret = __db_vrfy_pgset(dbp->dbenv, dbp->pgsize, &pgset)) != 0) + goto err; + + switch (TYPE(h)) { + case P_BTREEMETA: + btmeta = (BTMETA *)h; + if (F_ISSET(&btmeta->dbmeta, BTM_RECNO)) { + /* Recnos have no order to check. */ + ret = 0; + goto err; + } + if ((ret = + __db_meta2pgset(dbp, vdp, meta_pgno, flags, pgset)) != 0) + goto err; + if ((ret = pgset->cursor(pgset, NULL, &pgsc, 0)) != 0) + goto err; + while ((ret = __db_vrfy_pgset_next(pgsc, &p)) == 0) { + if ((ret = memp_fget(dbp->mpf, &p, 0, &currpg)) != 0) + goto err; + if ((ret = __bam_vrfy_itemorder(dbp, + NULL, currpg, p, NUM_ENT(currpg), 1, + F_ISSET(&btmeta->dbmeta, BTM_DUP), flags)) != 0) + goto err; + if ((ret = memp_fput(dbp->mpf, currpg, 0)) != 0) + goto err; + currpg = NULL; + } + if ((ret = pgsc->c_close(pgsc)) != 0) + goto err; + break; + case P_HASHMETA: + hmeta = (HMETA *)h; + h_internal = (HASH *)dbp->h_internal; + /* + * Make sure h_charkey is right. + */ + if (h_internal == NULL || h_internal->h_hash == NULL) { + EPRINT((dbp->dbenv, + "DB_ORDERCHKONLY requires that a hash function be set")); + ret = DB_VERIFY_BAD; + goto err; + } + if (hmeta->h_charkey != + h_internal->h_hash(dbp, CHARKEY, sizeof(CHARKEY))) { + EPRINT((dbp->dbenv, + "Incorrect hash function for database")); + ret = DB_VERIFY_BAD; + goto err; + } + + /* + * Foreach bucket, verify hashing on each page in the + * corresponding chain of pages. + */ + for (bucket = 0; bucket <= hmeta->max_bucket; bucket++) { + pgno = BS_TO_PAGE(bucket, hmeta->spares); + while (pgno != PGNO_INVALID) { + if ((ret = memp_fget(dbp->mpf, + &pgno, 0, &currpg)) != 0) + goto err; + if ((ret = __ham_vrfy_hashing(dbp, + NUM_ENT(currpg),hmeta, bucket, pgno, + flags, h_internal->h_hash)) != 0) + goto err; + pgno = NEXT_PGNO(currpg); + if ((ret = memp_fput(dbp->mpf, currpg, 0)) != 0) + goto err; + currpg = NULL; + } + } + break; + default: + EPRINT((dbp->dbenv, "Database meta page %lu of bad type %lu", + (u_long)meta_pgno, (u_long)TYPE(h))); + ret = DB_VERIFY_BAD; + break; + } + +err: if (pgsc != NULL) + (void)pgsc->c_close(pgsc); + if (pgset != NULL) + (void)pgset->close(pgset, 0); + if (h != NULL && (t_ret = memp_fput(dbp->mpf, h, 0)) != 0) + ret = t_ret; + if (currpg != NULL && (t_ret = memp_fput(dbp->mpf, currpg, 0)) != 0) + ret = t_ret; + if ((t_ret = mdbp->close(mdbp, 0)) != 0) + ret = t_ret; + return (ret); +} + +/* + * __db_salvage -- + * Walk through a page, salvaging all likely or plausible (w/ + * DB_AGGRESSIVE) key/data pairs. + * + * PUBLIC: int __db_salvage __P((DB *, VRFY_DBINFO *, db_pgno_t, PAGE *, + * PUBLIC: void *, int (*)(void *, const void *), u_int32_t)); + */ +int +__db_salvage(dbp, vdp, pgno, h, handle, callback, flags) + DB *dbp; + VRFY_DBINFO *vdp; + db_pgno_t pgno; + PAGE *h; + void *handle; + int (*callback) __P((void *, const void *)); + u_int32_t flags; +{ + DB_ASSERT(LF_ISSET(DB_SALVAGE)); + + /* If we got this page in the subdb pass, we can safely skip it. */ + if (__db_salvage_isdone(vdp, pgno)) + return (0); + + switch (TYPE(h)) { + case P_HASH: + return (__ham_salvage(dbp, + vdp, pgno, h, handle, callback, flags)); + /* NOTREACHED */ + case P_LBTREE: + return (__bam_salvage(dbp, + vdp, pgno, P_LBTREE, h, handle, callback, NULL, flags)); + /* NOTREACHED */ + case P_LDUP: + return (__db_salvage_markneeded(vdp, pgno, SALVAGE_LDUP)); + /* NOTREACHED */ + case P_OVERFLOW: + return (__db_salvage_markneeded(vdp, pgno, SALVAGE_OVERFLOW)); + /* NOTREACHED */ + case P_LRECNO: + /* + * Recnos are tricky -- they may represent dup pages, or + * they may be subdatabase/regular database pages in their + * own right. If the former, they need to be printed with a + * key, preferably when we hit the corresponding datum in + * a btree/hash page. If the latter, there is no key. + * + * If a database is sufficiently frotzed, we're not going + * to be able to get this right, so we best-guess: just + * mark it needed now, and if we're really a normal recno + * database page, the "unknowns" pass will pick us up. + */ + return (__db_salvage_markneeded(vdp, pgno, SALVAGE_LRECNO)); + /* NOTREACHED */ + case P_IBTREE: + case P_INVALID: + case P_IRECNO: + case __P_DUPLICATE: + default: + /* XXX: Should we be more aggressive here? */ + break; + } + return (0); +} + +/* + * __db_salvage_unknowns -- + * Walk through the salvager database, printing with key "UNKNOWN" + * any pages we haven't dealt with. + */ +static int +__db_salvage_unknowns(dbp, vdp, handle, callback, flags) + DB *dbp; + VRFY_DBINFO *vdp; + void *handle; + int (*callback) __P((void *, const void *)); + u_int32_t flags; +{ + DBT unkdbt, key, *dbt; + PAGE *h; + db_pgno_t pgno; + u_int32_t pgtype; + int ret, err_ret; + void *ovflbuf; + + memset(&unkdbt, 0, sizeof(DBT)); + unkdbt.size = strlen("UNKNOWN") + 1; + unkdbt.data = "UNKNOWN"; + + if ((ret = __os_malloc(dbp->dbenv, dbp->pgsize, 0, &ovflbuf)) != 0) + return (ret); + + err_ret = 0; + while ((ret = __db_salvage_getnext(vdp, &pgno, &pgtype)) == 0) { + dbt = NULL; + + if ((ret = memp_fget(dbp->mpf, &pgno, 0, &h)) != 0) { + err_ret = ret; + continue; + } + + switch (pgtype) { + case SALVAGE_LDUP: + case SALVAGE_LRECNODUP: + dbt = &unkdbt; + /* FALLTHROUGH */ + case SALVAGE_LBTREE: + case SALVAGE_LRECNO: + if ((ret = __bam_salvage(dbp, vdp, pgno, pgtype, + h, handle, callback, dbt, flags)) != 0) + err_ret = ret; + break; + case SALVAGE_OVERFLOW: + /* + * XXX: + * This may generate multiple "UNKNOWN" keys in + * a database with no dups. What to do? + */ + if ((ret = __db_safe_goff(dbp, + vdp, pgno, &key, &ovflbuf, flags)) != 0) { + err_ret = ret; + continue; + } + if ((ret = __db_prdbt(&key, + 0, " ", handle, callback, 0, NULL)) != 0) { + err_ret = ret; + continue; + } + if ((ret = __db_prdbt(&unkdbt, + 0, " ", handle, callback, 0, NULL)) != 0) + err_ret = ret; + break; + case SALVAGE_HASH: + if ((ret = __ham_salvage( + dbp, vdp, pgno, h, handle, callback, flags)) != 0) + err_ret = ret; + break; + case SALVAGE_INVALID: + case SALVAGE_IGNORE: + default: + /* + * Shouldn't happen, but if it does, just do what the + * nice man says. + */ + DB_ASSERT(0); + break; + } + if ((ret = memp_fput(dbp->mpf, h, 0)) != 0) + err_ret = ret; + } + + __os_free(ovflbuf, 0); + + if (err_ret != 0 && ret == 0) + ret = err_ret; + + return (ret == DB_NOTFOUND ? 0 : ret); +} + +/* + * Offset of the ith inp array entry, which we can compare to the offset + * the entry stores. + */ +#define INP_OFFSET(h, i) \ + ((db_indx_t)((u_int8_t *)(h)->inp + (i) - (u_int8_t *)(h))) + +/* + * __db_vrfy_inpitem -- + * Verify that a single entry in the inp array is sane, and update + * the high water mark and current item offset. (The former of these is + * used for state information between calls, and is required; it must + * be initialized to the pagesize before the first call.) + * + * Returns DB_VERIFY_FATAL if inp has collided with the data, + * since verification can't continue from there; returns DB_VERIFY_BAD + * if anything else is wrong. + * + * PUBLIC: int __db_vrfy_inpitem __P((DB *, PAGE *, + * PUBLIC: db_pgno_t, u_int32_t, int, u_int32_t, u_int32_t *, u_int32_t *)); + */ +int +__db_vrfy_inpitem(dbp, h, pgno, i, is_btree, flags, himarkp, offsetp) + DB *dbp; + PAGE *h; + db_pgno_t pgno; + u_int32_t i; + int is_btree; + u_int32_t flags, *himarkp, *offsetp; +{ + BKEYDATA *bk; + db_indx_t offset, len; + + DB_ASSERT(himarkp != NULL); + + /* + * Check that the inp array, which grows from the beginning of the + * page forward, has not collided with the data, which grow from the + * end of the page backward. + */ + if (h->inp + i >= (db_indx_t *)((u_int8_t *)h + *himarkp)) { + /* We've collided with the data. We need to bail. */ + EPRINT((dbp->dbenv, + "Page %lu entries listing %lu overlaps data", + (u_long)pgno, (u_long)i)); + return (DB_VERIFY_FATAL); + } + + offset = h->inp[i]; + + /* + * Check that the item offset is reasonable: it points somewhere + * after the inp array and before the end of the page. + */ + if (offset <= INP_OFFSET(h, i) || offset > dbp->pgsize) { + EPRINT((dbp->dbenv, + "Bad offset %lu at page %lu index %lu", + (u_long)offset, (u_long)pgno, (u_long)i)); + return (DB_VERIFY_BAD); + } + + /* Update the high-water mark (what HOFFSET should be) */ + if (offset < *himarkp) + *himarkp = offset; + + if (is_btree) { + /* + * Check that the item length remains on-page. + */ + bk = GET_BKEYDATA(h, i); + + /* + * We need to verify the type of the item here; + * we can't simply assume that it will be one of the + * expected three. If it's not a recognizable type, + * it can't be considered to have a verifiable + * length, so it's not possible to certify it as safe. + */ + switch (B_TYPE(bk->type)) { + case B_KEYDATA: + len = bk->len; + break; + case B_DUPLICATE: + case B_OVERFLOW: + len = BOVERFLOW_SIZE; + break; + default: + EPRINT((dbp->dbenv, + "Item %lu on page %lu of unrecognizable type", + i, pgno)); + return (DB_VERIFY_BAD); + } + + if ((size_t)(offset + len) > dbp->pgsize) { + EPRINT((dbp->dbenv, + "Item %lu on page %lu extends past page boundary", + (u_long)i, (u_long)pgno)); + return (DB_VERIFY_BAD); + } + } + + if (offsetp != NULL) + *offsetp = offset; + return (0); +} + +/* + * __db_vrfy_duptype-- + * Given a page number and a set of flags to __bam_vrfy_subtree, + * verify that the dup tree type is correct--i.e., it's a recno + * if DUPSORT is not set and a btree if it is. + * + * PUBLIC: int __db_vrfy_duptype + * PUBLIC: __P((DB *, VRFY_DBINFO *, db_pgno_t, u_int32_t)); + */ +int +__db_vrfy_duptype(dbp, vdp, pgno, flags) + DB *dbp; + VRFY_DBINFO *vdp; + db_pgno_t pgno; + u_int32_t flags; +{ + VRFY_PAGEINFO *pip; + int ret, isbad; + + isbad = 0; + + if ((ret = __db_vrfy_getpageinfo(vdp, pgno, &pip)) != 0) + return (ret); + + switch (pip->type) { + case P_IBTREE: + case P_LDUP: + if (!LF_ISSET(ST_DUPSORT)) { + EPRINT((dbp->dbenv, + "Sorted duplicate set at page %lu in unsorted-dup database", + (u_long)pgno)); + isbad = 1; + } + break; + case P_IRECNO: + case P_LRECNO: + if (LF_ISSET(ST_DUPSORT)) { + EPRINT((dbp->dbenv, + "Unsorted duplicate set at page %lu in sorted-dup database", + (u_long)pgno)); + isbad = 1; + } + break; + default: + EPRINT((dbp->dbenv, + "Duplicate page %lu of inappropriate type %lu", + (u_long)pgno, (u_long)pip->type)); + isbad = 1; + break; + } + + if ((ret = __db_vrfy_putpageinfo(vdp, pip)) != 0) + return (ret); + return (isbad == 1 ? DB_VERIFY_BAD : 0); +} + +/* + * __db_salvage_duptree -- + * Attempt to salvage a given duplicate tree, given its alleged root. + * + * The key that corresponds to this dup set has been passed to us + * in DBT *key. Because data items follow keys, though, it has been + * printed once already. + * + * The basic idea here is that pgno ought to be a P_LDUP, a P_LRECNO, a + * P_IBTREE, or a P_IRECNO. If it's an internal page, use the verifier + * functions to make sure it's safe; if it's not, we simply bail and the + * data will have to be printed with no key later on. if it is safe, + * recurse on each of its children. + * + * Whether or not it's safe, if it's a leaf page, __bam_salvage it. + * + * At all times, use the DB hanging off vdp to mark and check what we've + * done, so each page gets printed exactly once and we don't get caught + * in any cycles. + * + * PUBLIC: int __db_salvage_duptree __P((DB *, VRFY_DBINFO *, db_pgno_t, + * PUBLIC: DBT *, void *, int (*)(void *, const void *), u_int32_t)); + */ +int +__db_salvage_duptree(dbp, vdp, pgno, key, handle, callback, flags) + DB *dbp; + VRFY_DBINFO *vdp; + db_pgno_t pgno; + DBT *key; + void *handle; + int (*callback) __P((void *, const void *)); + u_int32_t flags; +{ + PAGE *h; + int ret, t_ret; + + if (pgno == PGNO_INVALID || !IS_VALID_PGNO(pgno)) + return (DB_VERIFY_BAD); + + /* We have a plausible page. Try it. */ + if ((ret = memp_fget(dbp->mpf, &pgno, 0, &h)) != 0) + return (ret); + + switch (TYPE(h)) { + case P_IBTREE: + case P_IRECNO: + if ((ret = __db_vrfy_common(dbp, vdp, h, pgno, flags)) != 0) + goto err; + if ((ret = __bam_vrfy(dbp, + vdp, h, pgno, flags | DB_NOORDERCHK)) != 0 || + (ret = __db_salvage_markdone(vdp, pgno)) != 0) + goto err; + /* + * We have a known-healthy internal page. Walk it. + */ + if ((ret = __bam_salvage_walkdupint(dbp, vdp, h, key, + handle, callback, flags)) != 0) + goto err; + break; + case P_LRECNO: + case P_LDUP: + if ((ret = __bam_salvage(dbp, + vdp, pgno, TYPE(h), h, handle, callback, key, flags)) != 0) + goto err; + break; + default: + ret = DB_VERIFY_BAD; + goto err; + /* NOTREACHED */ + } + +err: if ((t_ret = memp_fput(dbp->mpf, h, 0)) != 0 && ret == 0) + ret = t_ret; + return (ret); +} + +/* + * __db_salvage_subdbs -- + * Check and see if this database has subdbs; if so, try to salvage + * them independently. + */ +static int +__db_salvage_subdbs(dbp, vdp, handle, callback, flags, hassubsp) + DB *dbp; + VRFY_DBINFO *vdp; + void *handle; + int (*callback) __P((void *, const void *)); + u_int32_t flags; + int *hassubsp; +{ + BTMETA *btmeta; + DB *pgset; + DBC *pgsc; + PAGE *h; + db_pgno_t p, meta_pgno; + int ret, err_ret; + + err_ret = 0; + pgsc = NULL; + pgset = NULL; + + meta_pgno = PGNO_BASE_MD; + if ((ret = memp_fget(dbp->mpf, &meta_pgno, 0, &h)) != 0) + return (ret); + + if (TYPE(h) == P_BTREEMETA) + btmeta = (BTMETA *)h; + else { + /* Not a btree metadata, ergo no subdbs, so just return. */ + ret = 0; + goto err; + } + + /* If it's not a safe page, bail on the attempt. */ + if ((ret = __db_vrfy_common(dbp, vdp, h, PGNO_BASE_MD, flags)) != 0 || + (ret = __bam_vrfy_meta(dbp, vdp, btmeta, PGNO_BASE_MD, flags)) != 0) + goto err; + + if (!F_ISSET(&btmeta->dbmeta, BTM_SUBDB)) { + /* No subdbs, just return. */ + ret = 0; + goto err; + } + + /* We think we've got subdbs. Mark it so. */ + *hassubsp = 1; + + if ((ret = memp_fput(dbp->mpf, h, 0)) != 0) + return (ret); + + /* + * We have subdbs. Try to crack them. + * + * To do so, get a set of leaf pages in the master + * database, and then walk each of the valid ones, salvaging + * subdbs as we go. If any prove invalid, just drop them; we'll + * pick them up on a later pass. + */ + if ((ret = __db_vrfy_pgset(dbp->dbenv, dbp->pgsize, &pgset)) != 0) + return (ret); + if ((ret = + __db_meta2pgset(dbp, vdp, PGNO_BASE_MD, flags, pgset)) != 0) + goto err; + + if ((ret = pgset->cursor(pgset, NULL, &pgsc, 0)) != 0) + goto err; + while ((ret = __db_vrfy_pgset_next(pgsc, &p)) == 0) { + if ((ret = memp_fget(dbp->mpf, &p, 0, &h)) != 0) { + err_ret = ret; + continue; + } + if ((ret = __db_vrfy_common(dbp, vdp, h, p, flags)) != 0 || + (ret = __bam_vrfy(dbp, + vdp, h, p, flags | DB_NOORDERCHK)) != 0) + goto nextpg; + if (TYPE(h) != P_LBTREE) + goto nextpg; + else if ((ret = __db_salvage_subdbpg( + dbp, vdp, h, handle, callback, flags)) != 0) + err_ret = ret; +nextpg: if ((ret = memp_fput(dbp->mpf, h, 0)) != 0) + err_ret = ret; + } + + if (ret != DB_NOTFOUND) + goto err; + if ((ret = pgsc->c_close(pgsc)) != 0) + goto err; + + ret = pgset->close(pgset, 0); + return ((ret == 0 && err_ret != 0) ? err_ret : ret); + + /* NOTREACHED */ + +err: if (pgsc != NULL) + (void)pgsc->c_close(pgsc); + if (pgset != NULL) + (void)pgset->close(pgset, 0); + (void)memp_fput(dbp->mpf, h, 0); + return (ret); +} + +/* + * __db_salvage_subdbpg -- + * Given a known-good leaf page in the master database, salvage all + * leaf pages corresponding to each subdb. + * + * PUBLIC: int __db_salvage_subdbpg + * PUBLIC: __P((DB *, VRFY_DBINFO *, PAGE *, void *, + * PUBLIC: int (*)(void *, const void *), u_int32_t)); + */ +int +__db_salvage_subdbpg(dbp, vdp, master, handle, callback, flags) + DB *dbp; + VRFY_DBINFO *vdp; + PAGE *master; + void *handle; + int (*callback) __P((void *, const void *)); + u_int32_t flags; +{ + BKEYDATA *bkkey, *bkdata; + BOVERFLOW *bo; + DB *pgset; + DBC *pgsc; + DBT key; + PAGE *subpg; + db_indx_t i; + db_pgno_t meta_pgno, p; + int ret, err_ret, t_ret; + char *subdbname; + + ret = err_ret = 0; + subdbname = NULL; + + if ((ret = __db_vrfy_pgset(dbp->dbenv, dbp->pgsize, &pgset)) != 0) + return (ret); + + /* + * For each entry, get and salvage the set of pages + * corresponding to that entry. + */ + for (i = 0; i < NUM_ENT(master); i += P_INDX) { + bkkey = GET_BKEYDATA(master, i); + bkdata = GET_BKEYDATA(master, i + O_INDX); + + /* Get the subdatabase name. */ + if (B_TYPE(bkkey->type) == B_OVERFLOW) { + /* + * We can, in principle anyway, have a subdb + * name so long it overflows. Ick. + */ + bo = (BOVERFLOW *)bkkey; + if ((ret = __db_safe_goff(dbp, vdp, bo->pgno, &key, + (void **)&subdbname, flags)) != 0) { + err_ret = DB_VERIFY_BAD; + continue; + } + + /* Nul-terminate it. */ + if ((ret = __os_realloc(dbp->dbenv, + key.size + 1, NULL, &subdbname)) != 0) + goto err; + subdbname[key.size] = '\0'; + } else if (B_TYPE(bkkey->type == B_KEYDATA)) { + if ((ret = __os_realloc(dbp->dbenv, + bkkey->len + 1, NULL, &subdbname)) != 0) + goto err; + memcpy(subdbname, bkkey->data, bkkey->len); + subdbname[bkkey->len] = '\0'; + } + + /* Get the corresponding pgno. */ + if (bkdata->len != sizeof(db_pgno_t)) { + err_ret = DB_VERIFY_BAD; + continue; + } + memcpy(&meta_pgno, bkdata->data, sizeof(db_pgno_t)); + + /* If we can't get the subdb meta page, just skip the subdb. */ + if (!IS_VALID_PGNO(meta_pgno) || + (ret = memp_fget(dbp->mpf, &meta_pgno, 0, &subpg)) != 0) { + err_ret = ret; + continue; + } + + /* + * Verify the subdatabase meta page. This has two functions. + * First, if it's bad, we have no choice but to skip the subdb + * and let the pages just get printed on a later pass. Second, + * the access-method-specific meta verification routines record + * the various state info (such as the presence of dups) + * that we need for __db_prheader(). + */ + if ((ret = + __db_vrfy_common(dbp, vdp, subpg, meta_pgno, flags)) != 0) { + err_ret = ret; + (void)memp_fput(dbp->mpf, subpg, 0); + continue; + } + switch (TYPE(subpg)) { + case P_BTREEMETA: + if ((ret = __bam_vrfy_meta(dbp, + vdp, (BTMETA *)subpg, meta_pgno, flags)) != 0) { + err_ret = ret; + (void)memp_fput(dbp->mpf, subpg, 0); + continue; + } + break; + case P_HASHMETA: + if ((ret = __ham_vrfy_meta(dbp, + vdp, (HMETA *)subpg, meta_pgno, flags)) != 0) { + err_ret = ret; + (void)memp_fput(dbp->mpf, subpg, 0); + continue; + } + break; + default: + /* This isn't an appropriate page; skip this subdb. */ + err_ret = DB_VERIFY_BAD; + continue; + /* NOTREACHED */ + } + + if ((ret = memp_fput(dbp->mpf, subpg, 0)) != 0) { + err_ret = ret; + continue; + } + + /* Print a subdatabase header. */ + if ((ret = __db_prheader(dbp, + subdbname, 0, 0, handle, callback, vdp, meta_pgno)) != 0) + goto err; + + if ((ret = __db_meta2pgset(dbp, vdp, meta_pgno, + flags, pgset)) != 0) { + err_ret = ret; + continue; + } + + if ((ret = pgset->cursor(pgset, NULL, &pgsc, 0)) != 0) + goto err; + while ((ret = __db_vrfy_pgset_next(pgsc, &p)) == 0) { + if ((ret = memp_fget(dbp->mpf, &p, 0, &subpg)) != 0) { + err_ret = ret; + continue; + } + if ((ret = __db_salvage(dbp, vdp, p, subpg, + handle, callback, flags)) != 0) + err_ret = ret; + if ((ret = memp_fput(dbp->mpf, subpg, 0)) != 0) + err_ret = ret; + } + + if (ret != DB_NOTFOUND) + goto err; + + if ((ret = pgsc->c_close(pgsc)) != 0) + goto err; + if ((ret = __db_prfooter(handle, callback)) != 0) + goto err; + } +err: if (subdbname) + __os_free(subdbname, 0); + + if ((t_ret = pgset->close(pgset, 0)) != 0) + ret = t_ret; + + if ((t_ret = __db_salvage_markdone(vdp, PGNO(master))) != 0) + return (t_ret); + + return ((err_ret != 0) ? err_ret : ret); +} + +/* + * __db_meta2pgset -- + * Given a known-safe meta page number, return the set of pages + * corresponding to the database it represents. Return DB_VERIFY_BAD if + * it's not a suitable meta page or is invalid. + */ +static int +__db_meta2pgset(dbp, vdp, pgno, flags, pgset) + DB *dbp; + VRFY_DBINFO *vdp; + db_pgno_t pgno; + u_int32_t flags; + DB *pgset; +{ + PAGE *h; + int ret, t_ret; + + if ((ret = memp_fget(dbp->mpf, &pgno, 0, &h)) != 0) + return (ret); + + switch (TYPE(h)) { + case P_BTREEMETA: + ret = __bam_meta2pgset(dbp, vdp, (BTMETA *)h, flags, pgset); + break; + case P_HASHMETA: + ret = __ham_meta2pgset(dbp, vdp, (HMETA *)h, flags, pgset); + break; + default: + ret = DB_VERIFY_BAD; + break; + } + + if ((t_ret = memp_fput(dbp->mpf, h, 0)) != 0) + return (t_ret); + return (ret); +} + +/* + * __db_guesspgsize -- + * Try to guess what the pagesize is if the one on the meta page + * and the one in the db are invalid. + */ +static int +__db_guesspgsize(dbenv, fhp) + DB_ENV *dbenv; + DB_FH *fhp; +{ + db_pgno_t i; + size_t nr; + u_int32_t guess; + u_int8_t type; + int ret; + + for (guess = DB_MAX_PGSIZE; guess >= DB_MIN_PGSIZE; guess >>= 1) { + /* + * We try to read three pages ahead after the first one + * and make sure we have plausible types for all of them. + * If the seeks fail, continue with a smaller size; + * we're probably just looking past the end of the database. + * If they succeed and the types are reasonable, also continue + * with a size smaller; we may be looking at pages N, + * 2N, and 3N for some N > 1. + * + * As soon as we hit an invalid type, we stop and return + * our previous guess; that last one was probably the page size. + */ + for (i = 1; i <= 3; i++) { + if ((ret = __os_seek(dbenv, fhp, guess, + i, SSZ(DBMETA, type), 0, DB_OS_SEEK_SET)) != 0) + break; + if ((ret = __os_read(dbenv, + fhp, &type, 1, &nr)) != 0 || nr == 0) + break; + if (type == P_INVALID || type >= P_PAGETYPE_MAX) + return (guess << 1); + } + } + + /* + * If we're just totally confused--the corruption takes up most of the + * beginning pages of the database--go with the default size. + */ + return (DB_DEF_IOSIZE); +} diff --git a/bdb/db/db_vrfyutil.c b/bdb/db/db_vrfyutil.c new file mode 100644 index 00000000000..89dccdcc760 --- /dev/null +++ b/bdb/db/db_vrfyutil.c @@ -0,0 +1,830 @@ +/*- + * See the file LICENSE for redistribution information. + * + * Copyright (c) 2000 + * Sleepycat Software. All rights reserved. + * + * $Id: db_vrfyutil.c,v 11.11 2000/11/28 21:36:04 bostic Exp $ + */ + +#include "db_config.h" + +#ifndef lint +static const char revid[] = "$Id: db_vrfyutil.c,v 11.11 2000/11/28 21:36:04 bostic Exp $"; +#endif /* not lint */ + +#ifndef NO_SYSTEM_INCLUDES +#include <sys/types.h> + +#include <string.h> +#endif + +#include "db_int.h" +#include "db_page.h" +#include "db_verify.h" +#include "db_ext.h" + +static int __db_vrfy_pgset_iinc __P((DB *, db_pgno_t, int)); + +/* + * __db_vrfy_dbinfo_create -- + * Allocate and initialize a VRFY_DBINFO structure. + * + * PUBLIC: int __db_vrfy_dbinfo_create + * PUBLIC: __P((DB_ENV *, u_int32_t, VRFY_DBINFO **)); + */ +int +__db_vrfy_dbinfo_create (dbenv, pgsize, vdpp) + DB_ENV *dbenv; + u_int32_t pgsize; + VRFY_DBINFO **vdpp; +{ + DB *cdbp, *pgdbp, *pgset; + VRFY_DBINFO *vdp; + int ret; + + vdp = NULL; + cdbp = pgdbp = pgset = NULL; + + if ((ret = __os_calloc(NULL, + 1, sizeof(VRFY_DBINFO), (void **)&vdp)) != 0) + goto err; + + if ((ret = db_create(&cdbp, dbenv, 0)) != 0) + goto err; + + if ((ret = cdbp->set_flags(cdbp, DB_DUP | DB_DUPSORT)) != 0) + goto err; + + if ((ret = cdbp->set_pagesize(cdbp, pgsize)) != 0) + goto err; + + if ((ret = + cdbp->open(cdbp, NULL, NULL, DB_BTREE, DB_CREATE, 0600)) != 0) + goto err; + + if ((ret = db_create(&pgdbp, dbenv, 0)) != 0) + goto err; + + if ((ret = pgdbp->set_pagesize(pgdbp, pgsize)) != 0) + goto err; + + if ((ret = + pgdbp->open(pgdbp, NULL, NULL, DB_BTREE, DB_CREATE, 0600)) != 0) + goto err; + + if ((ret = __db_vrfy_pgset(dbenv, pgsize, &pgset)) != 0) + goto err; + + LIST_INIT(&vdp->subdbs); + LIST_INIT(&vdp->activepips); + + vdp->cdbp = cdbp; + vdp->pgdbp = pgdbp; + vdp->pgset = pgset; + *vdpp = vdp; + return (0); + +err: if (cdbp != NULL) + (void)cdbp->close(cdbp, 0); + if (pgdbp != NULL) + (void)pgdbp->close(pgdbp, 0); + if (vdp != NULL) + __os_free(vdp, sizeof(VRFY_DBINFO)); + return (ret); +} + +/* + * __db_vrfy_dbinfo_destroy -- + * Destructor for VRFY_DBINFO. Destroys VRFY_PAGEINFOs and deallocates + * structure. + * + * PUBLIC: int __db_vrfy_dbinfo_destroy __P((VRFY_DBINFO *)); + */ +int +__db_vrfy_dbinfo_destroy(vdp) + VRFY_DBINFO *vdp; +{ + VRFY_CHILDINFO *c, *d; + int t_ret, ret; + + ret = 0; + + for (c = LIST_FIRST(&vdp->subdbs); c != NULL; c = d) { + d = LIST_NEXT(c, links); + __os_free(c, 0); + } + + if ((t_ret = vdp->pgdbp->close(vdp->pgdbp, 0)) != 0) + ret = t_ret; + + if ((t_ret = vdp->cdbp->close(vdp->cdbp, 0)) != 0 && ret == 0) + ret = t_ret; + + if ((t_ret = vdp->pgset->close(vdp->pgset, 0)) != 0 && ret == 0) + ret = t_ret; + + DB_ASSERT(LIST_FIRST(&vdp->activepips) == NULL); + + __os_free(vdp, sizeof(VRFY_DBINFO)); + return (ret); +} + +/* + * __db_vrfy_getpageinfo -- + * Get a PAGEINFO structure for a given page, creating it if necessary. + * + * PUBLIC: int __db_vrfy_getpageinfo + * PUBLIC: __P((VRFY_DBINFO *, db_pgno_t, VRFY_PAGEINFO **)); + */ +int +__db_vrfy_getpageinfo(vdp, pgno, pipp) + VRFY_DBINFO *vdp; + db_pgno_t pgno; + VRFY_PAGEINFO **pipp; +{ + DBT key, data; + DB *pgdbp; + VRFY_PAGEINFO *pip; + int ret; + + /* + * We want a page info struct. There are three places to get it from, + * in decreasing order of preference: + * + * 1. vdp->activepips. If it's already "checked out", we're + * already using it, we return the same exact structure with a + * bumped refcount. This is necessary because this code is + * replacing array accesses, and it's common for f() to make some + * changes to a pip, and then call g() and h() which each make + * changes to the same pip. vdps are never shared between threads + * (they're never returned to the application), so this is safe. + * 2. The pgdbp. It's not in memory, but it's in the database, so + * get it, give it a refcount of 1, and stick it on activepips. + * 3. malloc. It doesn't exist yet; create it, then stick it on + * activepips. We'll put it in the database when we putpageinfo + * later. + */ + + /* Case 1. */ + for (pip = LIST_FIRST(&vdp->activepips); pip != NULL; + pip = LIST_NEXT(pip, links)) + if (pip->pgno == pgno) + /* Found it. */ + goto found; + + /* Case 2. */ + pgdbp = vdp->pgdbp; + memset(&key, 0, sizeof(DBT)); + memset(&data, 0, sizeof(DBT)); + F_SET(&data, DB_DBT_MALLOC); + key.data = &pgno; + key.size = sizeof(db_pgno_t); + + if ((ret = pgdbp->get(pgdbp, NULL, &key, &data, 0)) == 0) { + /* Found it. */ + DB_ASSERT(data.size = sizeof(VRFY_PAGEINFO)); + pip = data.data; + DB_ASSERT(pip->pi_refcount == 0); + LIST_INSERT_HEAD(&vdp->activepips, pip, links); + goto found; + } else if (ret != DB_NOTFOUND) /* Something nasty happened. */ + return (ret); + + /* Case 3 */ + if ((ret = __db_vrfy_pageinfo_create(&pip)) != 0) + return (ret); + + LIST_INSERT_HEAD(&vdp->activepips, pip, links); +found: pip->pi_refcount++; + + *pipp = pip; + + DB_ASSERT(pip->pi_refcount > 0); + return (0); +} + +/* + * __db_vrfy_putpageinfo -- + * Put back a VRFY_PAGEINFO that we're done with. + * + * PUBLIC: int __db_vrfy_putpageinfo __P((VRFY_DBINFO *, VRFY_PAGEINFO *)); + */ +int +__db_vrfy_putpageinfo(vdp, pip) + VRFY_DBINFO *vdp; + VRFY_PAGEINFO *pip; +{ + DBT key, data; + DB *pgdbp; + VRFY_PAGEINFO *p; + int ret; +#ifdef DIAGNOSTIC + int found; + + found = 0; +#endif + + if (--pip->pi_refcount > 0) + return (0); + + pgdbp = vdp->pgdbp; + memset(&key, 0, sizeof(DBT)); + memset(&data, 0, sizeof(DBT)); + + key.data = &pip->pgno; + key.size = sizeof(db_pgno_t); + data.data = pip; + data.size = sizeof(VRFY_PAGEINFO); + + if ((ret = pgdbp->put(pgdbp, NULL, &key, &data, 0)) != 0) + return (ret); + + for (p = LIST_FIRST(&vdp->activepips); p != NULL; + p = LIST_NEXT(p, links)) + if (p == pip) { +#ifdef DIAGNOSTIC + found++; +#endif + DB_ASSERT(p->pi_refcount == 0); + LIST_REMOVE(p, links); + break; + } +#ifdef DIAGNOSTIC + DB_ASSERT(found == 1); +#endif + + DB_ASSERT(pip->pi_refcount == 0); + __os_free(pip, 0); + return (0); +} + +/* + * __db_vrfy_pgset -- + * Create a temporary database for the storing of sets of page numbers. + * (A mapping from page number to int, used by the *_meta2pgset functions, + * as well as for keeping track of which pages the verifier has seen.) + * + * PUBLIC: int __db_vrfy_pgset __P((DB_ENV *, u_int32_t, DB **)); + */ +int +__db_vrfy_pgset(dbenv, pgsize, dbpp) + DB_ENV *dbenv; + u_int32_t pgsize; + DB **dbpp; +{ + DB *dbp; + int ret; + + if ((ret = db_create(&dbp, dbenv, 0)) != 0) + return (ret); + if ((ret = dbp->set_pagesize(dbp, pgsize)) != 0) + goto err; + if ((ret = dbp->open(dbp, NULL, NULL, DB_BTREE, DB_CREATE, 0600)) == 0) + *dbpp = dbp; + else +err: (void)dbp->close(dbp, 0); + + return (ret); +} + +/* + * __db_vrfy_pgset_get -- + * Get the value associated in a page set with a given pgno. Return + * a 0 value (and succeed) if we've never heard of this page. + * + * PUBLIC: int __db_vrfy_pgset_get __P((DB *, db_pgno_t, int *)); + */ +int +__db_vrfy_pgset_get(dbp, pgno, valp) + DB *dbp; + db_pgno_t pgno; + int *valp; +{ + DBT key, data; + int ret, val; + + memset(&key, 0, sizeof(DBT)); + memset(&data, 0, sizeof(DBT)); + + key.data = &pgno; + key.size = sizeof(db_pgno_t); + data.data = &val; + data.ulen = sizeof(int); + F_SET(&data, DB_DBT_USERMEM); + + if ((ret = dbp->get(dbp, NULL, &key, &data, 0)) == 0) { + DB_ASSERT(data.size = sizeof(int)); + memcpy(&val, data.data, sizeof(int)); + } else if (ret == DB_NOTFOUND) + val = 0; + else + return (ret); + + *valp = val; + return (0); +} + +/* + * __db_vrfy_pgset_inc -- + * Increment the value associated with a pgno by 1. + * + * PUBLIC: int __db_vrfy_pgset_inc __P((DB *, db_pgno_t)); + */ +int +__db_vrfy_pgset_inc(dbp, pgno) + DB *dbp; + db_pgno_t pgno; +{ + + return (__db_vrfy_pgset_iinc(dbp, pgno, 1)); +} + +/* + * __db_vrfy_pgset_dec -- + * Increment the value associated with a pgno by 1. + * + * PUBLIC: int __db_vrfy_pgset_dec __P((DB *, db_pgno_t)); + */ +int +__db_vrfy_pgset_dec(dbp, pgno) + DB *dbp; + db_pgno_t pgno; +{ + + return (__db_vrfy_pgset_iinc(dbp, pgno, -1)); +} + +/* + * __db_vrfy_pgset_iinc -- + * Increment the value associated with a pgno by i. + * + */ +static int +__db_vrfy_pgset_iinc(dbp, pgno, i) + DB *dbp; + db_pgno_t pgno; + int i; +{ + DBT key, data; + int ret; + int val; + + memset(&key, 0, sizeof(DBT)); + memset(&data, 0, sizeof(DBT)); + + val = 0; + + key.data = &pgno; + key.size = sizeof(db_pgno_t); + data.data = &val; + data.ulen = sizeof(int); + F_SET(&data, DB_DBT_USERMEM); + + if ((ret = dbp->get(dbp, NULL, &key, &data, 0)) == 0) { + DB_ASSERT(data.size = sizeof(int)); + memcpy(&val, data.data, sizeof(int)); + } else if (ret != DB_NOTFOUND) + return (ret); + + data.size = sizeof(int); + val += i; + + return (dbp->put(dbp, NULL, &key, &data, 0)); +} + +/* + * __db_vrfy_pgset_next -- + * Given a cursor open in a pgset database, get the next page in the + * set. + * + * PUBLIC: int __db_vrfy_pgset_next __P((DBC *, db_pgno_t *)); + */ +int +__db_vrfy_pgset_next(dbc, pgnop) + DBC *dbc; + db_pgno_t *pgnop; +{ + DBT key, data; + db_pgno_t pgno; + int ret; + + memset(&key, 0, sizeof(DBT)); + memset(&data, 0, sizeof(DBT)); + /* We don't care about the data, just the keys. */ + F_SET(&data, DB_DBT_USERMEM | DB_DBT_PARTIAL); + F_SET(&key, DB_DBT_USERMEM); + key.data = &pgno; + key.ulen = sizeof(db_pgno_t); + + if ((ret = dbc->c_get(dbc, &key, &data, DB_NEXT)) != 0) + return (ret); + + DB_ASSERT(key.size == sizeof(db_pgno_t)); + *pgnop = pgno; + + return (0); +} + +/* + * __db_vrfy_childcursor -- + * Create a cursor to walk the child list with. Returns with a nonzero + * final argument if the specified page has no children. + * + * PUBLIC: int __db_vrfy_childcursor __P((VRFY_DBINFO *, DBC **)); + */ +int +__db_vrfy_childcursor(vdp, dbcp) + VRFY_DBINFO *vdp; + DBC **dbcp; +{ + DB *cdbp; + DBC *dbc; + int ret; + + cdbp = vdp->cdbp; + + if ((ret = cdbp->cursor(cdbp, NULL, &dbc, 0)) == 0) + *dbcp = dbc; + + return (ret); +} + +/* + * __db_vrfy_childput -- + * Add a child structure to the set for a given page. + * + * PUBLIC: int __db_vrfy_childput + * PUBLIC: __P((VRFY_DBINFO *, db_pgno_t, VRFY_CHILDINFO *)); + */ +int +__db_vrfy_childput(vdp, pgno, cip) + VRFY_DBINFO *vdp; + db_pgno_t pgno; + VRFY_CHILDINFO *cip; +{ + DBT key, data; + DB *cdbp; + int ret; + + cdbp = vdp->cdbp; + memset(&key, 0, sizeof(DBT)); + memset(&data, 0, sizeof(DBT)); + + key.data = &pgno; + key.size = sizeof(db_pgno_t); + + data.data = cip; + data.size = sizeof(VRFY_CHILDINFO); + + /* + * Don't add duplicate (data) entries for a given child, and accept + * DB_KEYEXIST as a successful return; we only need to verify + * each child once, even if a child (such as an overflow key) is + * multiply referenced. + */ + ret = cdbp->put(cdbp, NULL, &key, &data, DB_NODUPDATA); + return (ret == DB_KEYEXIST ? 0 : ret); +} + +/* + * __db_vrfy_ccset -- + * Sets a cursor created with __db_vrfy_childcursor to the first + * child of the given pgno, and returns it in the third arg. + * + * PUBLIC: int __db_vrfy_ccset __P((DBC *, db_pgno_t, VRFY_CHILDINFO **)); + */ +int +__db_vrfy_ccset(dbc, pgno, cipp) + DBC *dbc; + db_pgno_t pgno; + VRFY_CHILDINFO **cipp; +{ + DBT key, data; + int ret; + + memset(&key, 0, sizeof(DBT)); + memset(&data, 0, sizeof(DBT)); + + key.data = &pgno; + key.size = sizeof(db_pgno_t); + + if ((ret = dbc->c_get(dbc, &key, &data, DB_SET)) != 0) + return (ret); + + DB_ASSERT(data.size == sizeof(VRFY_CHILDINFO)); + *cipp = (VRFY_CHILDINFO *)data.data; + + return (0); +} + +/* + * __db_vrfy_ccnext -- + * Gets the next child of the given cursor created with + * __db_vrfy_childcursor, and returns it in the memory provided in the + * second arg. + * + * PUBLIC: int __db_vrfy_ccnext __P((DBC *, VRFY_CHILDINFO **)); + */ +int +__db_vrfy_ccnext(dbc, cipp) + DBC *dbc; + VRFY_CHILDINFO **cipp; +{ + DBT key, data; + int ret; + + memset(&key, 0, sizeof(DBT)); + memset(&data, 0, sizeof(DBT)); + + if ((ret = dbc->c_get(dbc, &key, &data, DB_NEXT_DUP)) != 0) + return (ret); + + DB_ASSERT(data.size == sizeof(VRFY_CHILDINFO)); + *cipp = (VRFY_CHILDINFO *)data.data; + + return (0); +} + +/* + * __db_vrfy_ccclose -- + * Closes the cursor created with __db_vrfy_childcursor. + * + * This doesn't actually do anything interesting now, but it's + * not inconceivable that we might change the internal database usage + * and keep the interfaces the same, and a function call here or there + * seldom hurts anyone. + * + * PUBLIC: int __db_vrfy_ccclose __P((DBC *)); + */ +int +__db_vrfy_ccclose(dbc) + DBC *dbc; +{ + + return (dbc->c_close(dbc)); +} + +/* + * __db_vrfy_pageinfo_create -- + * Constructor for VRFY_PAGEINFO; allocates and initializes. + * + * PUBLIC: int __db_vrfy_pageinfo_create __P((VRFY_PAGEINFO **)); + */ +int +__db_vrfy_pageinfo_create(pgipp) + VRFY_PAGEINFO **pgipp; +{ + VRFY_PAGEINFO *pgip; + int ret; + + if ((ret = __os_calloc(NULL, + 1, sizeof(VRFY_PAGEINFO), (void **)&pgip)) != 0) + return (ret); + + DB_ASSERT(pgip->pi_refcount == 0); + + *pgipp = pgip; + return (0); +} + +/* + * __db_salvage_init -- + * Set up salvager database. + * + * PUBLIC: int __db_salvage_init __P((VRFY_DBINFO *)); + */ +int +__db_salvage_init(vdp) + VRFY_DBINFO *vdp; +{ + DB *dbp; + int ret; + + if ((ret = db_create(&dbp, NULL, 0)) != 0) + return (ret); + + if ((ret = dbp->set_pagesize(dbp, 1024)) != 0) + goto err; + + if ((ret = dbp->open(dbp, NULL, NULL, DB_BTREE, DB_CREATE, 0)) != 0) + goto err; + + vdp->salvage_pages = dbp; + return (0); + +err: (void)dbp->close(dbp, 0); + return (ret); +} + +/* + * __db_salvage_destroy -- + * Close salvager database. + * PUBLIC: void __db_salvage_destroy __P((VRFY_DBINFO *)); + */ +void +__db_salvage_destroy(vdp) + VRFY_DBINFO *vdp; +{ + (void)vdp->salvage_pages->close(vdp->salvage_pages, 0); +} + +/* + * __db_salvage_getnext -- + * Get the next (first) unprinted page in the database of pages we need to + * print still. Delete entries for any already-printed pages we encounter + * in this search, as well as the page we're returning. + * + * PUBLIC: int __db_salvage_getnext + * PUBLIC: __P((VRFY_DBINFO *, db_pgno_t *, u_int32_t *)); + */ +int +__db_salvage_getnext(vdp, pgnop, pgtypep) + VRFY_DBINFO *vdp; + db_pgno_t *pgnop; + u_int32_t *pgtypep; +{ + DB *dbp; + DBC *dbc; + DBT key, data; + int ret; + u_int32_t pgtype; + + dbp = vdp->salvage_pages; + + memset(&key, 0, sizeof(DBT)); + memset(&data, 0, sizeof(DBT)); + + if ((ret = dbp->cursor(dbp, NULL, &dbc, 0)) != 0) + return (ret); + + while ((ret = dbc->c_get(dbc, &key, &data, DB_NEXT)) == 0) { + DB_ASSERT(data.size == sizeof(u_int32_t)); + memcpy(&pgtype, data.data, sizeof(pgtype)); + + if ((ret = dbc->c_del(dbc, 0)) != 0) + goto err; + if (pgtype != SALVAGE_IGNORE) + goto found; + } + + /* No more entries--ret probably equals DB_NOTFOUND. */ + + if (0) { +found: DB_ASSERT(key.size == sizeof(db_pgno_t)); + DB_ASSERT(data.size == sizeof(u_int32_t)); + + *pgnop = *(db_pgno_t *)key.data; + *pgtypep = *(u_int32_t *)data.data; + } + +err: (void)dbc->c_close(dbc); + return (ret); +} + +/* + * __db_salvage_isdone -- + * Return whether or not the given pgno is already marked + * SALVAGE_IGNORE (meaning that we don't need to print it again). + * + * Returns DB_KEYEXIST if it is marked, 0 if not, or another error on + * error. + * + * PUBLIC: int __db_salvage_isdone __P((VRFY_DBINFO *, db_pgno_t)); + */ +int +__db_salvage_isdone(vdp, pgno) + VRFY_DBINFO *vdp; + db_pgno_t pgno; +{ + DBT key, data; + DB *dbp; + int ret; + u_int32_t currtype; + + dbp = vdp->salvage_pages; + + memset(&key, 0, sizeof(DBT)); + memset(&data, 0, sizeof(DBT)); + + currtype = SALVAGE_INVALID; + data.data = &currtype; + data.ulen = sizeof(u_int32_t); + data.flags = DB_DBT_USERMEM; + + key.data = &pgno; + key.size = sizeof(db_pgno_t); + + /* + * Put an entry for this page, with pgno as key and type as data, + * unless it's already there and is marked done. + * If it's there and is marked anything else, that's fine--we + * want to mark it done. + */ + ret = dbp->get(dbp, NULL, &key, &data, 0); + if (ret == 0) { + /* + * The key's already here. Check and see if it's already + * marked done. If it is, return DB_KEYEXIST. If it's not, + * return 0. + */ + if (currtype == SALVAGE_IGNORE) + return (DB_KEYEXIST); + else + return (0); + } else if (ret != DB_NOTFOUND) + return (ret); + + /* The pgno is not yet marked anything; return 0. */ + return (0); +} + +/* + * __db_salvage_markdone -- + * Mark as done a given page. + * + * PUBLIC: int __db_salvage_markdone __P((VRFY_DBINFO *, db_pgno_t)); + */ +int +__db_salvage_markdone(vdp, pgno) + VRFY_DBINFO *vdp; + db_pgno_t pgno; +{ + DBT key, data; + DB *dbp; + int pgtype, ret; + u_int32_t currtype; + + pgtype = SALVAGE_IGNORE; + dbp = vdp->salvage_pages; + + memset(&key, 0, sizeof(DBT)); + memset(&data, 0, sizeof(DBT)); + + currtype = SALVAGE_INVALID; + data.data = &currtype; + data.ulen = sizeof(u_int32_t); + data.flags = DB_DBT_USERMEM; + + key.data = &pgno; + key.size = sizeof(db_pgno_t); + + /* + * Put an entry for this page, with pgno as key and type as data, + * unless it's already there and is marked done. + * If it's there and is marked anything else, that's fine--we + * want to mark it done, but db_salvage_isdone only lets + * us know if it's marked IGNORE. + * + * We don't want to return DB_KEYEXIST, though; this will + * likely get passed up all the way and make no sense to the + * application. Instead, use DB_VERIFY_BAD to indicate that + * we've seen this page already--it probably indicates a + * multiply-linked page. + */ + if ((ret = __db_salvage_isdone(vdp, pgno)) != 0) + return (ret == DB_KEYEXIST ? DB_VERIFY_BAD : ret); + + data.size = sizeof(u_int32_t); + data.data = &pgtype; + + return (dbp->put(dbp, NULL, &key, &data, 0)); +} + +/* + * __db_salvage_markneeded -- + * If it has not yet been printed, make note of the fact that a page + * must be dealt with later. + * + * PUBLIC: int __db_salvage_markneeded + * PUBLIC: __P((VRFY_DBINFO *, db_pgno_t, u_int32_t)); + */ +int +__db_salvage_markneeded(vdp, pgno, pgtype) + VRFY_DBINFO *vdp; + db_pgno_t pgno; + u_int32_t pgtype; +{ + DB *dbp; + DBT key, data; + int ret; + + dbp = vdp->salvage_pages; + + memset(&key, 0, sizeof(DBT)); + memset(&data, 0, sizeof(DBT)); + + key.data = &pgno; + key.size = sizeof(db_pgno_t); + + data.data = &pgtype; + data.size = sizeof(u_int32_t); + + /* + * Put an entry for this page, with pgno as key and type as data, + * unless it's already there, in which case it's presumably + * already been marked done. + */ + ret = dbp->put(dbp, NULL, &key, &data, DB_NOOVERWRITE); + return (ret == DB_KEYEXIST ? 0 : ret); +} |