- There are a series of configuration tasks which are common to all - access methods. They are described in the following sections. -

+ There are a series of configuration tasks which are common + to all access methods. They are described in the following + sections. +

Prev	Chapter 2. - Access Method Configuration -	Chapter 2. Access Method Configuration	Next

@@ -80,83 +80,95 @@

- The size of the pages used in the underlying database can be specified - by calling the DB->set_pagesize() method. The minimum page size is 512 - bytes and the maximum page size is 64K bytes, and must be a power of - two. If no page size is specified by the application, a page size is - selected based on the underlying filesystem I/O block size. (A page - size selected in this way has a lower limit of 512 bytes and an upper - limit of 16K bytes.) -

+ The size of the pages used in the underlying database can + be specified by calling the DB->set_pagesize() method. The + minimum page size is 512 bytes and the maximum page size is + 64K bytes, and must be a power of two. If no page size is + specified by the application, a page size is selected based on + the underlying filesystem I/O block size. (A page size + selected in this way has a lower limit of 512 bytes and an + upper limit of 16K bytes.) +

+ There are several issues to consider when selecting a + pagesize: overflow record sizes, locking, I/O efficiency, and + recoverability. +

- There are several issues to consider when selecting a pagesize: overflow - record sizes, locking, I/O efficiency, and recoverability. -

+ First, the page size implicitly sets the size of an + overflow record. Overflow records are key or data items that + are too large to fit on a normal database page because of + their size, and are therefore stored in overflow pages. + Overflow pages are pages that exist outside of the normal + database structure. For this reason, there is often a + significant performance penalty associated with retrieving or + modifying overflow records. Selecting a page size that is too + small, and which forces the creation of large numbers of + overflow pages, can seriously impact the performance of an + application. +

+ Second, in the Btree, Hash and Recno access methods, the + finest-grained lock that Berkeley DB acquires is for a page. + (The Queue access method generally acquires record-level locks + rather than page-level locks.) Selecting a page size that is + too large, and which causes threads or processes to wait + because other threads of control are accessing or modifying + records on the same page, can impact the performance of your + application. +

- First, the page size implicitly sets the size of an overflow record. - Overflow records are key or data items that are too large to fit on a - normal database page because of their size, and are therefore stored in - overflow pages. Overflow pages are pages that exist outside of the - normal database structure. For this reason, there is often a - significant performance penalty associated with retrieving or modifying - overflow records. Selecting a page size that is too small, and which - forces the creation of large numbers of overflow pages, can seriously - impact the performance of an application. -

+ Third, the page size specifies the granularity of I/O from + the database to the operating system. Berkeley DB will give a + page-sized unit of bytes to the operating system to be + scheduled for reading/writing from/to the disk. For many + operating systems, there is an internal + block size which is used as the granularity of + I/O from the operating system to the disk. Generally, it will + be more efficient for Berkeley DB to write filesystem-sized + blocks to the operating system and for the operating system to + write those same blocks to the disk. +

+ Selecting a database page size smaller than the filesystem + block size may cause the operating system to coalesce or + otherwise manipulate Berkeley DB pages and can impact the + performance of your application. When the page size is smaller + than the filesystem block size and a page written by Berkeley + DB is not found in the operating system's cache, the operating + system may be forced to read a block from the disk, copy the + page into the block it read, and then write out the block to + disk, rather than simply writing the page to disk. + Additionally, as the operating system is reading more data + into its buffer cache than is strictly necessary to satisfy + each Berkeley DB request for a page, the operating system + buffer cache may be wasting memory. +

- Second, in the Btree, Hash and Recno access methods, the finest-grained - lock that Berkeley DB acquires is for a page. (The Queue access method - generally acquires record-level locks rather than page-level locks.) - Selecting a page size that is too large, and which causes threads or - processes to wait because other threads of control are accessing or - modifying records on the same page, can impact the performance of your - application. -

- Third, the page size specifies the granularity of I/O from the database - to the operating system. Berkeley DB will give a page-sized unit of - bytes to the operating system to be scheduled for reading/writing - from/to the disk. For many operating systems, there is an internal - block size which is used as the - granularity of I/O from the operating system to the disk. Generally, - it will be more efficient for Berkeley DB to write filesystem-sized - blocks to the operating system and for the operating system to write - those same blocks to the disk. -

- Selecting a database page size smaller than the filesystem block size - may cause the operating system to coalesce or otherwise manipulate - Berkeley DB pages and can impact the performance of your application. - When the page size is smaller than the filesystem block size and a page - written by Berkeley DB is not found in the operating system's cache, - the operating system may be forced to read a block from the disk, copy - the page into the block it read, and then write out the block to disk, - rather than simply writing the page to disk. Additionally, as the - operating system is reading more data into its buffer cache than is - strictly necessary to satisfy each Berkeley DB request for a page, the - operating system buffer cache may be wasting memory. -

- Alternatively, selecting a page size larger than the filesystem block - size may cause the operating system to read more data than necessary. - On some systems, reading filesystem blocks sequentially may cause the - operating system to begin performing read-ahead. If requesting a - single database page implies reading enough filesystem blocks to - satisfy the operating system's criteria for read-ahead, the operating - system may do more I/O than is required. -

- Fourth, when using the Berkeley DB Transactional Data Store product, - the page size may affect the errors from which your database can - recover See Berkeley DB recoverability for more information. -

+ Alternatively, selecting a page size larger than the + filesystem block size may cause the operating system to read + more data than necessary. On some systems, reading filesystem + blocks sequentially may cause the operating system to begin + performing read-ahead. If requesting a single database page + implies reading enough filesystem blocks to satisfy the + operating system's criteria for read-ahead, the operating + system may do more I/O than is required. +

+ Fourth, when using the Berkeley DB Transactional Data Store + product, the page size may affect the errors from which your + database can recover See Berkeley DB recoverability for more information. +

Note

-The db_tuner utility suggests a page size for btree databases that optimizes cache -efficiency and storage space requirements. This utility works only when given a pre-populated database. -So, it is useful when tuning an existing application and not when first implementing an application. -

+ The db_tuner utility suggests a page size for btree databases + that optimizes cache efficiency and storage space + requirements. This utility works only when given a + pre-populated database. So, it is useful when tuning an + existing application and not when first implementing an + application. +

@@ -167,51 +179,74 @@ So, it is useful when tuning an existing application and not when first implemen

The size of the cache used for the underlying database can be specified -by calling the DB->set_cachesize() method. -Choosing a cache size is, unfortunately, an art. Your cache must be at -least large enough for your working set plus some overlap for unexpected -situations.

When using the Btree access method, you must have a cache big enough for -the minimum working set for a single access. This will include a root -page, one or more internal pages (depending on the depth of your tree), -and a leaf page. If your cache is any smaller than that, each new page -will force out the least-recently-used page, and Berkeley DB will re-read the -root page of the tree anew on each database request.

If your keys are of moderate size (a few tens of bytes) and your pages -are on the order of 4KB to 8KB, most Btree applications will be only -three levels. For example, using 20 byte keys with 20 bytes of data -associated with each key, a 8KB page can hold roughly 400 keys (or 200 -key/data pairs), so a fully populated three-level Btree will hold 32 -million key/data pairs, and a tree with only a 50% page-fill factor will -still hold 16 million key/data pairs. We rarely expect trees to exceed -five levels, although Berkeley DB will support trees up to 255 levels.

The rule-of-thumb is that cache is good, and more cache is better. -Generally, applications benefit from increasing the cache size up to a -point, at which the performance will stop improving as the cache size -increases. When this point is reached, one of two things have happened: -either the cache is large enough that the application is almost never -having to retrieve information from disk, or, your application is doing -truly random accesses, and therefore increasing size of the cache doesn't -significantly increase the odds of finding the next requested information -in the cache. The latter is fairly rare -- almost all applications show -some form of locality of reference.

That said, it is important not to increase your cache size beyond the -capabilities of your system, as that will result in reduced performance. -Under many operating systems, tying down enough virtual memory will cause -your memory and potentially your program to be swapped. This is -especially likely on systems without unified OS buffer caches and virtual -memory spaces, as the buffer cache was allocated at boot time and so -cannot be adjusted based on application requests for large amounts of -virtual memory.

For example, even if accesses are truly random within a Btree, your -access pattern will favor internal pages to leaf pages, so your cache -should be large enough to hold all internal pages. In the steady state, -this requires at most one I/O per operation to retrieve the appropriate -leaf page.

You can use the db_stat utility to monitor the effectiveness of -your cache. The following output is excerpted from the output of that -utility's -m option:

+ The size of the cache used for the underlying database can + be specified by calling the DB->set_cachesize() method. Choosing + a cache size is, unfortunately, an art. Your cache must be at + least large enough for your working set plus some overlap for + unexpected situations. +

+ When using the Btree access method, you must have a cache + big enough for the minimum working set for a single access. + This will include a root page, one or more internal pages + (depending on the depth of your tree), and a leaf page. If + your cache is any smaller than that, each new page will force + out the least-recently-used page, and Berkeley DB will re-read + the root page of the tree anew on each database + request. +

+ If your keys are of moderate size (a few tens of bytes) and + your pages are on the order of 4KB to 8KB, most Btree + applications will be only three levels. For example, using 20 + byte keys with 20 bytes of data associated with each key, a + 8KB page can hold roughly 400 keys (or 200 key/data pairs), so + a fully populated three-level Btree will hold 32 million + key/data pairs, and a tree with only a 50% page-fill factor + will still hold 16 million key/data pairs. We rarely expect + trees to exceed five levels, although Berkeley DB will support + trees up to 255 levels. +

+ The rule-of-thumb is that cache is good, and more cache is + better. Generally, applications benefit from increasing the + cache size up to a point, at which the performance will stop + improving as the cache size increases. When this point is + reached, one of two things have happened: either the cache is + large enough that the application is almost never having to + retrieve information from disk, or, your application is doing + truly random accesses, and therefore increasing size of the + cache doesn't significantly increase the odds of finding the + next requested information in the cache. The latter is fairly + rare -- almost all applications show some form of locality of + reference. +

+ That said, it is important not to increase your cache size + beyond the capabilities of your system, as that will result in + reduced performance. Under many operating systems, tying down + enough virtual memory will cause your memory and potentially + your program to be swapped. This is especially likely on + systems without unified OS buffer caches and virtual memory + spaces, as the buffer cache was allocated at boot time and so + cannot be adjusted based on application requests for large + amounts of virtual memory. +

+ For example, even if accesses are truly random within a + Btree, your access pattern will favor internal pages to leaf + pages, so your cache should be large enough to hold all + internal pages. In the steady state, this requires at most one + I/O per operation to retrieve the appropriate leaf + page. +

+ You can use the db_stat utility to monitor the effectiveness of + your cache. The following output is excerpted from the output + of that utility's -m + option: +

The statistics for this cache say that there have been 4,273 requests of -the cache, and only 116 of those requests required an I/O from disk. This -means that the cache is working well, yielding a 97% cache hit rate. The -db_stat utility will present these statistics both for the cache -as a whole and for each file within the cache separately.

+ The statistics for this cache say that there have been 4,273 + requests of the cache, and only 116 of those requests required + an I/O from disk. This means that the cache is working well, + yielding a 97% cache hit rate. The db_stat utility will present + these statistics both for the cache as a whole and for each + file within the cache separately. +

@@ -239,25 +276,30 @@ as a whole and for each file within the cache separately.

Database files created by Berkeley DB can be created in either little- or -big-endian formats. The byte order used for the underlying database -is specified by calling the DB->set_lorder() method. If no order -is selected, the native format of the machine on which the database is -created will be used.

Berkeley DB databases are architecture independent, and any format database can -be used on a machine with a different native format. In this case, as -each page that is read into or written from the cache must be converted -to or from the host format, and databases with non-native formats will -incur a performance penalty for the run-time conversion.

- - It is important to note that the Berkeley DB access methods do no data -conversion for application specified data. Key/data pairs written on a -little-endian format architecture will be returned to the application -exactly as they were written when retrieved on a big-endian format -architecture. - -

+ Berkeley DB databases are architecture independent, and any + format database can be used on a machine with a different + native format. In this case, as each page that is read into or + written from the cache must be converted to or from the host + format, and databases with non-native formats will incur a + performance penalty for the run-time conversion. +

+ It is important to note that the + Berkeley DB access methods do no data conversion for + application specified data. Key/data pairs written on a + little-endian format architecture will be returned to the + application exactly as they were written when retrieved on + a big-endian format architecture. +

@@ -268,105 +310,123 @@ architecture.

- The Btree and Hash access methods support the creation of multiple data - items for a single key item. By default, multiple data items are not - permitted, and each database store operation will overwrite any - previous data item for that key. To configure Berkeley DB for - duplicate data items, call the DB->set_flags() method with the DB_DUP - flag. Only one copy of the key will be stored for each set of - duplicate data items. If the Btree access method comparison routine - returns that two keys compare equally, it is undefined which of the two - keys will be stored and returned from future database operations. -

- By default, Berkeley DB stores duplicates in the order in which they - were added, that is, each new duplicate data item will be stored after - any already existing data items. This default behavior can be - overridden by using the DBC->put() method and one of the DB_AFTER, - DB_BEFORE, DB_KEYFIRST or DB_KEYLAST flags. Alternatively, - Berkeley DB may be configured to sort duplicate data items. -

- When stepping through the database sequentially, duplicate data items - will be returned individually, as a key/data pair, where the key item - only changes after the last duplicate data item has been returned. For - this reason, duplicate data items cannot be accessed using the DB->get() - method, as it always returns the first of the duplicate data items. - Duplicate data items should be retrieved using a Berkeley DB cursor - interface such as the DBC->get() method. -

+ There is a flag that permits applications to request the + following data item only if it is + a duplicate data item of the current entry, + see DB_NEXT_DUP for more information. There is a flag that + permits applications to request the following data item only + if it is not a duplicate data + item of the current entry, see DB_NEXT_NODUP and + DB_PREV_NODUP for more information. +

+ It is also possible to maintain duplicate records in sorted + order. Sorting duplicates will significantly increase + performance when searching them and performing equality joins + — both of which are common operations when using + secondary indices. To configure Berkeley DB to sort duplicate + data items, the application must call the DB->set_flags() method + with the DB_DUPSORT flag. Note that DB_DUPSORT + automatically turns on the DB_DUP flag for you, so you do + not have to also set that flag; however, it is not an error to + also set DB_DUP when configuring for sorted duplicate + records. +

- There is a flag that permits applications to request the following data - item only if it is a duplicate data - item of the current entry, see DB_NEXT_DUP for more information. - There is a flag that permits applications to request the following data - item only if it is not a duplicate - data item of the current entry, see DB_NEXT_NODUP and DB_PREV_NODUP - for more information. -

- It is also possible to maintain duplicate records in sorted order. - Sorting duplicates will significantly increase performance when - searching them and performing equality joins — both of which are - common operations when using secondary indices. To configure Berkeley - DB to sort duplicate data items, the application must call the - DB->set_flags() method with the DB_DUPSORT flag. Note that DB_DUPSORT - automatically turns on the DB_DUP flag for you, so you do not - have to also set that flag; however, it is not an error to also set DB_DUP - when configuring for sorted duplicate records. -

- When configuring sorted duplicate records, you can also specify a - custom comparison function using the DB->set_dup_compare() method. If - the DB_DUPSORT flag is given, but no comparison routine is specified, - then Berkeley DB defaults to the same lexicographical sorting used for - Btree keys, with shorter items collating before longer items. -

- If the duplicate data items are unsorted, applications may store - identical duplicate data items, or, for those that just like the way it - sounds, duplicate duplicates. -

- It is an error to attempt to store identical - duplicate data items when duplicates are being stored in a sorted - order. Any such attempt results in the - error message "Duplicate data items are not supported with sorted - data" with a DB_KEYEXIST return code. -

- Note that you can suppress the error message "Duplicate data items are - not supported with sorted data" by using the DB_NODUPDATA flag. Use - of this flag does not change the database's basic behavior; storing - duplicate data items in a database configured for sorted duplicates is - still an error and so you will continue to receive the - DB_KEYEXIST return code if you try to do that. -

- For further information on how searching and insertion behaves in the - presence of duplicates (sorted or not), see the DB->get() DB->put(), - DBC->get() and DBC->put() documentation. -

+ For further information on how searching and insertion + behaves in the presence of duplicates (sorted or not), see the + DB->get() DB->put(), DBC->get() and DBC->put() documentation. +

Note

Non-local memory allocation

Non-local memory + allocation