diff options
author | Marko Mäkelä <marko.makela@oracle.com> | 2011-08-29 11:16:42 +0300 |
---|---|---|
committer | Marko Mäkelä <marko.makela@oracle.com> | 2011-08-29 11:16:42 +0300 |
commit | 41f229cd9ed5a1549fb6438bdc1614e48d003625 (patch) | |
tree | efdc44422d9f3294b99449b048f528ea6de30c09 /storage/innobase | |
parent | e46b3453bfd379f7eef4fcc01853f085007c012b (diff) | |
download | mariadb-git-41f229cd9ed5a1549fb6438bdc1614e48d003625.tar.gz |
Bug#12704861 Corruption after a crash during BLOB update
The fix of Bug#12612184 broke crash recovery. When a record that
contains off-page columns (BLOBs) is updated, we must first write redo
log about the BLOB page writes, and only after that write the redo log
about the B-tree changes. The buggy fix would log the B-tree changes
first, meaning that after recovery, we could end up having a record
that contains a null BLOB pointer.
Because we will be redo logging the writes off the off-page columns
before the B-tree changes, we must make sure that the pages chosen for
the off-page columns are free both before and after the B-tree
changes. In this way, the worst thing that can happen in crash
recovery is that the BLOBs are written to free pages, but the B-tree
changes are not applied. The BLOB pages would correctly remain free in
this case. To achieve this, we must allocate the BLOB pages in the
mini-transaction of the B-tree operation. A further quirk is that BLOB
pages are allocated from the same file segment as leaf pages. Because
of this, we must temporarily "hide" any leaf pages that were freed
during the B-tree operation by "fake allocating" them prior to writing
the BLOBs, and freeing them again before the mtr_commit() of the
B-tree operation, in btr_mark_freed_leaves().
btr_cur_mtr_commit_and_start(): Remove this faulty function that was
introduced in the Bug#12612184 fix. The problem that this function was
trying to address was that when we did mtr_commit() the BLOB writes
before the mtr_commit() of the update, the new BLOB pages could have
overwritten clustered index B-tree leaf pages that were freed during
the update. If recovery applied the redo log of the BLOB writes but
did not see the log of the record update, the index tree would be
corrupted. The correct solution is to make the freed clustered index
pages unavailable to the BLOB allocation. This function is also a
likely culprit of InnoDB hangs that were observed when testing the
Bug#12612184 fix.
btr_mark_freed_leaves(): Mark all freed clustered index leaf pages of
a mini-transaction allocated (nonfree=TRUE) before storing the BLOBs,
or freed (nonfree=FALSE) before committing the mini-transaction.
btr_freed_leaves_validate(): A debug function for checking that all
clustered index leaf pages that have been marked free in the
mini-transaction are consistent (have not been zeroed out).
btr_page_alloc_low(): Refactored from btr_page_alloc(). Return the
number of the allocated page, or FIL_NULL if out of space. Add the
parameter "mtr_t* init_mtr" for specifying the mini-transaction where
the page should be initialized, or if this is a "fake allocation"
(init_mtr=NULL) by btr_mark_freed_leaves(nonfree=TRUE).
btr_page_alloc(): Add the parameter init_mtr, allowing the page to be
initialized and X-latched in a different mini-transaction than the one
that is used for the allocation. Invoke btr_page_alloc_low(). If a
clustered index leaf page was previously freed in mtr, remove it from
the memo of previously freed pages.
btr_page_free(): Assert that the page is a B-tree page and it has been
X-latched by the mini-transaction. If the freed page was a leaf page
of a clustered index, link it by a MTR_MEMO_FREE_CLUST_LEAF marker to
the mini-transaction.
btr_store_big_rec_extern_fields_func(): Add the parameter alloc_mtr,
which is NULL (old behaviour in inserts) and the same as local_mtr in
updates. If alloc_mtr!=NULL, the BLOB pages will be allocated from it
instead of the mini-transaction that is used for writing the BLOBs.
fsp_alloc_from_free_frag(): Refactored from
fsp_alloc_free_page(). Allocate the specified page from a partially
free extent.
fseg_alloc_free_page_low(), fseg_alloc_free_page_general(): Add the
parameter "mtr_t* init_mtr" for specifying the mini-transaction where
the page should be initialized, or NULL if this is a "fake allocation"
that prevents the reuse of a previously freed B-tree page for BLOB
storage. If init_mtr==NULL, try harder to reallocate the specified page
and assert that it succeeded.
fsp_alloc_free_page(): Add the parameter "mtr_t* init_mtr" for
specifying the mini-transaction where the page should be initialized.
Do not allow init_mtr == NULL, because this function is never to be
used for "fake allocations".
mtr_t: Add the operation MTR_MEMO_FREE_CLUST_LEAF and the flag
mtr->freed_clust_leaf for quickly determining if any
MTR_MEMO_FREE_CLUST_LEAF operations have been posted.
row_ins_index_entry_low(): When columns are being made off-page in
insert-by-update, invoke btr_mark_freed_leaves(nonfree=TRUE) and pass
the mini-transaction as the alloc_mtr to
btr_store_big_rec_extern_fields(). Finally, invoke
btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages.
row_build(): Correct a comment, and add a debug assertion that a
record that contains NULL BLOB pointers must be a fresh insert.
row_upd_clust_rec(): When columns are being moved off-page, invoke
btr_mark_freed_leaves(nonfree=TRUE) and pass the mini-transaction as
the alloc_mtr to btr_store_big_rec_extern_fields(). Finally, invoke
btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages.
buf_reset_check_index_page_at_flush(): Remove. The function
fsp_init_file_page_low() already sets
bpage->check_index_page_at_flush=FALSE.
There is a known issue in tablespace extension. If the request to
allocate a BLOB page leads to the tablespace being extended, crash
recovery could see BLOB writes to pages that are off the tablespace
file bounds. This should trigger an assertion failure in fil_io() at
crash recovery. The safe thing would be to write redo log about the
tablespace extension to the mini-transaction of the BLOB write, not to
the mini-transaction of the record update. However, there is no redo
log record for file extension in the current redo log format.
rb:693 approved by Sunny Bains
Diffstat (limited to 'storage/innobase')
-rw-r--r-- | storage/innobase/btr/btr0btr.c | 222 | ||||
-rw-r--r-- | storage/innobase/btr/btr0cur.c | 92 | ||||
-rw-r--r-- | storage/innobase/buf/buf0buf.c | 23 | ||||
-rw-r--r-- | storage/innobase/fsp/fsp0fsp.c | 168 | ||||
-rw-r--r-- | storage/innobase/include/btr0btr.h | 31 | ||||
-rw-r--r-- | storage/innobase/include/btr0cur.h | 14 | ||||
-rw-r--r-- | storage/innobase/include/buf0buf.h | 9 | ||||
-rw-r--r-- | storage/innobase/include/fsp0fsp.h | 8 | ||||
-rw-r--r-- | storage/innobase/include/mtr0mtr.h | 7 | ||||
-rw-r--r-- | storage/innobase/include/mtr0mtr.ic | 4 | ||||
-rw-r--r-- | storage/innobase/mtr/mtr0mtr.c | 10 | ||||
-rw-r--r-- | storage/innobase/row/row0ins.c | 32 | ||||
-rw-r--r-- | storage/innobase/row/row0row.c | 38 | ||||
-rw-r--r-- | storage/innobase/row/row0upd.c | 23 | ||||
-rw-r--r-- | storage/innobase/trx/trx0undo.c | 2 |
15 files changed, 471 insertions, 212 deletions
diff --git a/storage/innobase/btr/btr0btr.c b/storage/innobase/btr/btr0btr.c index 790582815a3..ad99913cf3b 100644 --- a/storage/innobase/btr/btr0btr.c +++ b/storage/innobase/btr/btr0btr.c @@ -300,29 +300,30 @@ btr_page_alloc_for_ibuf( /****************************************************************** Allocates a new file page to be used in an index tree. NOTE: we assume that the caller has made the reservation for free extents! */ - -page_t* -btr_page_alloc( -/*===========*/ - /* out: new allocated page, x-latched; - NULL if out of space */ +static +ulint +btr_page_alloc_low( +/*===============*/ + /* out: allocated page number, + FIL_NULL if out of space */ dict_index_t* index, /* in: index */ ulint hint_page_no, /* in: hint of a good page */ byte file_direction, /* in: direction where a possible page split is made */ ulint level, /* in: level where the page is placed in the tree */ - mtr_t* mtr) /* in: mtr */ + mtr_t* mtr, /* in/out: mini-transaction + for the allocation */ + mtr_t* init_mtr) /* in/out: mini-transaction + in which the page should be + initialized (may be the same + as mtr), or NULL if it should + not be initialized (the page + at hint was previously freed + in mtr) */ { fseg_header_t* seg_header; page_t* root; - page_t* new_page; - ulint new_page_no; - - if (index->type & DICT_IBUF) { - - return(btr_page_alloc_for_ibuf(index, mtr)); - } root = btr_root_get(index, mtr); @@ -336,19 +337,61 @@ btr_page_alloc( reservation for free extents, and thus we know that a page can be allocated: */ - new_page_no = fseg_alloc_free_page_general(seg_header, hint_page_no, - file_direction, TRUE, mtr); + return(fseg_alloc_free_page_general(seg_header, hint_page_no, + file_direction, TRUE, + mtr, init_mtr)); +} + +/**************************************************************//** +Allocates a new file page to be used in an index tree. NOTE: we assume +that the caller has made the reservation for free extents! */ + +page_t* +btr_page_alloc( +/*===========*/ + /* out: new allocated block, x-latched; + NULL if out of space */ + dict_index_t* index, /* in: index */ + ulint hint_page_no, /* in: hint of a good page */ + byte file_direction, /* in: direction where a possible + page split is made */ + ulint level, /* in: level where the page is placed + in the tree */ + mtr_t* mtr, /* in/out: mini-transaction + for the allocation */ + mtr_t* init_mtr) /* in/out: mini-transaction + for x-latching and initializing + the page */ +{ + page_t* new_page; + ulint new_page_no; + + if (index->type & DICT_IBUF) { + + return(btr_page_alloc_for_ibuf(index, mtr)); + } + + new_page_no = btr_page_alloc_low( + index, hint_page_no, file_direction, level, mtr, init_mtr); + if (new_page_no == FIL_NULL) { return(NULL); } new_page = buf_page_get(dict_index_get_space(index), new_page_no, - RW_X_LATCH, mtr); + RW_X_LATCH, init_mtr); #ifdef UNIV_SYNC_DEBUG buf_page_dbg_add_level(new_page, SYNC_TREE_NODE_NEW); #endif /* UNIV_SYNC_DEBUG */ + if (mtr->freed_clust_leaf) { + mtr_memo_release(mtr, new_page, MTR_MEMO_FREE_CLUST_LEAF); + ut_ad(!mtr_memo_contains(mtr, buf_block_align(new_page), + MTR_MEMO_FREE_CLUST_LEAF)); + } + + ut_ad(btr_freed_leaves_validate(mtr)); return(new_page); } @@ -464,6 +507,16 @@ btr_page_free_low( page_no = buf_frame_get_page_no(page); fseg_free_page(seg_header, space, page_no, mtr); + + /* The page was marked free in the allocation bitmap, but it + should remain buffer-fixed until mtr_commit(mtr) or until it + is explicitly freed from the mini-transaction. */ + ut_ad(mtr_memo_contains(mtr, buf_block_align(page), + MTR_MEMO_PAGE_X_FIX)); + /* TODO: Discard any operations on the page from the redo log + and remove the block from the flush list and the buffer pool. + This would free up buffer pool earlier and reduce writes to + both the tablespace and the redo log. */ } /****************************************************************** @@ -479,13 +532,144 @@ btr_page_free( { ulint level; + ut_ad(fil_page_get_type(page) == FIL_PAGE_INDEX); ut_ad(mtr_memo_contains(mtr, buf_block_align(page), MTR_MEMO_PAGE_X_FIX)); level = btr_page_get_level(page, mtr); btr_page_free_low(index, page, level, mtr); + + /* The handling of MTR_MEMO_FREE_CLUST_LEAF assumes this. */ + ut_ad(mtr_memo_contains(mtr, buf_block_align(page), + MTR_MEMO_PAGE_X_FIX)); + + if (level == 0 && (index->type & DICT_CLUSTERED)) { + /* We may have to call btr_mark_freed_leaves() to + temporarily mark the block nonfree for invoking + btr_store_big_rec_extern_fields() after an + update. Remember that the block was freed. */ + mtr->freed_clust_leaf = TRUE; + mtr_memo_push(mtr, buf_block_align(page), + MTR_MEMO_FREE_CLUST_LEAF); + } + + ut_ad(btr_freed_leaves_validate(mtr)); } +/**************************************************************//** +Marks all MTR_MEMO_FREE_CLUST_LEAF pages nonfree or free. +For invoking btr_store_big_rec_extern_fields() after an update, +we must temporarily mark freed clustered index pages allocated, so +that off-page columns will not be allocated from them. Between the +btr_store_big_rec_extern_fields() and mtr_commit() we have to +mark the pages free again, so that no pages will be leaked. */ + +void +btr_mark_freed_leaves( +/*==================*/ + dict_index_t* index, /* in/out: clustered index */ + mtr_t* mtr, /* in/out: mini-transaction */ + ibool nonfree)/* in: TRUE=mark nonfree, FALSE=mark freed */ +{ + /* This is loosely based on mtr_memo_release(). */ + + ulint offset; + + ut_ad(index->type & DICT_CLUSTERED); + ut_ad(mtr->magic_n == MTR_MAGIC_N); + ut_ad(mtr->state == MTR_ACTIVE); + + if (!mtr->freed_clust_leaf) { + return; + } + + offset = dyn_array_get_data_size(&mtr->memo); + + while (offset > 0) { + mtr_memo_slot_t* slot; + buf_block_t* block; + + offset -= sizeof *slot; + + slot = dyn_array_get_element(&mtr->memo, offset); + + if (slot->type != MTR_MEMO_FREE_CLUST_LEAF) { + continue; + } + + /* Because btr_page_alloc() does invoke + mtr_memo_release on MTR_MEMO_FREE_CLUST_LEAF, all + blocks tagged with MTR_MEMO_FREE_CLUST_LEAF in the + memo must still be clustered index leaf tree pages. */ + block = slot->object; + ut_a(buf_block_get_space(block) + == dict_index_get_space(index)); + ut_a(fil_page_get_type(buf_block_get_frame(block)) + == FIL_PAGE_INDEX); + ut_a(btr_page_get_level(buf_block_get_frame(block), mtr) == 0); + + if (nonfree) { + /* Allocate the same page again. */ + ulint page_no; + page_no = btr_page_alloc_low( + index, buf_block_get_page_no(block), + FSP_NO_DIR, 0, mtr, NULL); + ut_a(page_no == buf_block_get_page_no(block)); + } else { + /* Assert that the page is allocated and free it. */ + btr_page_free_low(index, buf_block_get_frame(block), + 0, mtr); + } + } + + ut_ad(btr_freed_leaves_validate(mtr)); +} + +#ifdef UNIV_DEBUG +/**************************************************************//** +Validates all pages marked MTR_MEMO_FREE_CLUST_LEAF. +See btr_mark_freed_leaves(). */ + +ibool +btr_freed_leaves_validate( +/*======================*/ + /* out: TRUE if valid */ + mtr_t* mtr) /* in: mini-transaction */ +{ + ulint offset; + + ut_ad(mtr->magic_n == MTR_MAGIC_N); + ut_ad(mtr->state == MTR_ACTIVE); + + offset = dyn_array_get_data_size(&mtr->memo); + + while (offset > 0) { + mtr_memo_slot_t* slot; + buf_block_t* block; + + offset -= sizeof *slot; + + slot = dyn_array_get_element(&mtr->memo, offset); + + if (slot->type != MTR_MEMO_FREE_CLUST_LEAF) { + continue; + } + + ut_a(mtr->freed_clust_leaf); + /* Because btr_page_alloc() does invoke + mtr_memo_release on MTR_MEMO_FREE_CLUST_LEAF, all + blocks tagged with MTR_MEMO_FREE_CLUST_LEAF in the + memo must still be clustered index leaf tree pages. */ + block = slot->object; + ut_a(fil_page_get_type(buf_block_get_frame(block)) + == FIL_PAGE_INDEX); + ut_a(btr_page_get_level(buf_block_get_frame(block), mtr) == 0); + } + + return(TRUE); +} +#endif /* UNIV_DEBUG */ + /****************************************************************** Sets the child node file address in a node pointer. */ UNIV_INLINE @@ -1015,7 +1199,7 @@ btr_root_raise_and_insert( a node pointer to the new page, and then splitting the new page. */ new_page = btr_page_alloc(index, 0, FSP_NO_DIR, - btr_page_get_level(root, mtr), mtr); + btr_page_get_level(root, mtr), mtr, mtr); btr_page_create(new_page, index, mtr); @@ -1636,7 +1820,7 @@ func_start: /* 2. Allocate a new page to the index */ new_page = btr_page_alloc(cursor->index, hint_page_no, direction, - btr_page_get_level(page, mtr), mtr); + btr_page_get_level(page, mtr), mtr, mtr); btr_page_create(new_page, cursor->index, mtr); /* 3. Calculate the first record on the upper half-page, and the diff --git a/storage/innobase/btr/btr0cur.c b/storage/innobase/btr/btr0cur.c index 9ce09929f9a..a1dda8edf69 100644 --- a/storage/innobase/btr/btr0cur.c +++ b/storage/innobase/btr/btr0cur.c @@ -2051,43 +2051,6 @@ return_after_reservations: return(err); } -/***************************************************************** -Commits and restarts a mini-transaction so that it will retain an -x-lock on index->lock and the cursor page. */ - -void -btr_cur_mtr_commit_and_start( -/*=========================*/ - btr_cur_t* cursor, /* in: cursor */ - mtr_t* mtr) /* in/out: mini-transaction */ -{ - buf_block_t* block; - - block = buf_block_align(btr_cur_get_rec(cursor)); - - ut_ad(mtr_memo_contains(mtr, dict_index_get_lock(cursor->index), - MTR_MEMO_X_LOCK)); - ut_ad(mtr_memo_contains(mtr, block, MTR_MEMO_PAGE_X_FIX)); - /* Keep the locks across the mtr_commit(mtr). */ - rw_lock_x_lock(dict_index_get_lock(cursor->index)); - rw_lock_x_lock(&block->lock); - mutex_enter(&block->mutex); -#ifdef UNIV_SYNC_DEBUG - buf_block_buf_fix_inc_debug(block, __FILE__, __LINE__); -#else - buf_block_buf_fix_inc(block); -#endif - mutex_exit(&block->mutex); - /* Write out the redo log. */ - mtr_commit(mtr); - mtr_start(mtr); - /* Reassociate the locks with the mini-transaction. - They will be released on mtr_commit(mtr). */ - mtr_memo_push(mtr, dict_index_get_lock(cursor->index), - MTR_MEMO_X_LOCK); - mtr_memo_push(mtr, block, MTR_MEMO_PAGE_X_FIX); -} - /*==================== B-TREE DELETE MARK AND UNMARK ===============*/ /******************************************************************** @@ -3494,6 +3457,11 @@ btr_store_big_rec_extern_fields( this function returns */ big_rec_t* big_rec_vec, /* in: vector containing fields to be stored externally */ + mtr_t* alloc_mtr, /* in/out: in an insert, NULL; + in an update, local_mtr for + allocating BLOB pages and + updating BLOB pointers; alloc_mtr + must not have freed any leaf pages */ mtr_t* local_mtr __attribute__((unused))) /* in: mtr containing the latch to rec and to the tree */ @@ -3514,6 +3482,8 @@ btr_store_big_rec_extern_fields( ulint i; mtr_t mtr; + ut_ad(local_mtr); + ut_ad(!alloc_mtr || alloc_mtr == local_mtr); ut_ad(rec_offs_validate(rec, index, offsets)); ut_ad(mtr_memo_contains(local_mtr, dict_index_get_lock(index), MTR_MEMO_X_LOCK)); @@ -3523,6 +3493,25 @@ btr_store_big_rec_extern_fields( space_id = buf_frame_get_space_id(rec); + if (alloc_mtr) { + /* Because alloc_mtr will be committed after + mtr, it is possible that the tablespace has been + extended when the B-tree record was updated or + inserted, or it will be extended while allocating + pages for big_rec. + + TODO: In mtr (not alloc_mtr), write a redo log record + about extending the tablespace to its current size, + and remember the current size. Whenever the tablespace + grows as pages are allocated, write further redo log + records to mtr. (Currently tablespace extension is not + covered by the redo log. If it were, the record would + only be written to alloc_mtr, which is committed after + mtr.) */ + } else { + alloc_mtr = &mtr; + } + /* We have to create a file segment to the tablespace for each field and put the pointer to the field in rec */ @@ -3549,7 +3538,7 @@ btr_store_big_rec_extern_fields( } page = btr_page_alloc(index, hint_page_no, - FSP_NO_DIR, 0, &mtr); + FSP_NO_DIR, 0, alloc_mtr, &mtr); if (page == NULL) { mtr_commit(&mtr); @@ -3603,37 +3592,42 @@ btr_store_big_rec_extern_fields( extern_len -= store_len; + if (alloc_mtr == &mtr) { #ifdef UNIV_SYNC_DEBUG - rec_page = + rec_page = #endif /* UNIV_SYNC_DEBUG */ - buf_page_get(space_id, - buf_frame_get_page_no(data), - RW_X_LATCH, &mtr); + buf_page_get( + space_id, + buf_frame_get_page_no(data), + RW_X_LATCH, &mtr); #ifdef UNIV_SYNC_DEBUG - buf_page_dbg_add_level(rec_page, SYNC_NO_ORDER_CHECK); + buf_page_dbg_add_level( + rec_page, SYNC_NO_ORDER_CHECK); #endif /* UNIV_SYNC_DEBUG */ + } + mlog_write_ulint(data + local_len + BTR_EXTERN_LEN, 0, - MLOG_4BYTES, &mtr); + MLOG_4BYTES, alloc_mtr); mlog_write_ulint(data + local_len + BTR_EXTERN_LEN + 4, big_rec_vec->fields[i].len - extern_len, - MLOG_4BYTES, &mtr); + MLOG_4BYTES, alloc_mtr); if (prev_page_no == FIL_NULL) { mlog_write_ulint(data + local_len + BTR_EXTERN_SPACE_ID, space_id, - MLOG_4BYTES, &mtr); + MLOG_4BYTES, alloc_mtr); mlog_write_ulint(data + local_len + BTR_EXTERN_PAGE_NO, page_no, - MLOG_4BYTES, &mtr); + MLOG_4BYTES, alloc_mtr); mlog_write_ulint(data + local_len + BTR_EXTERN_OFFSET, FIL_PAGE_DATA, - MLOG_4BYTES, &mtr); + MLOG_4BYTES, alloc_mtr); /* Set the bit denoting that this field in rec is stored externally */ @@ -3641,7 +3635,7 @@ btr_store_big_rec_extern_fields( rec_set_nth_field_extern_bit( rec, index, big_rec_vec->fields[i].field_no, - TRUE, &mtr); + TRUE, alloc_mtr); } prev_page_no = page_no; diff --git a/storage/innobase/buf/buf0buf.c b/storage/innobase/buf/buf0buf.c index 08e033e7a63..78b39812cff 100644 --- a/storage/innobase/buf/buf0buf.c +++ b/storage/innobase/buf/buf0buf.c @@ -1009,29 +1009,6 @@ buf_page_peek_block( } /************************************************************************ -Resets the check_index_page_at_flush field of a page if found in the buffer -pool. */ - -void -buf_reset_check_index_page_at_flush( -/*================================*/ - ulint space, /* in: space id */ - ulint offset) /* in: page number */ -{ - buf_block_t* block; - - mutex_enter_fast(&(buf_pool->mutex)); - - block = buf_page_hash_get(space, offset); - - if (block) { - block->check_index_page_at_flush = FALSE; - } - - mutex_exit(&(buf_pool->mutex)); -} - -/************************************************************************ Returns the current state of is_hashed of a page. FALSE if the page is not in the pool. NOTE that this operation does not fix the page in the pool if it is found there. */ diff --git a/storage/innobase/fsp/fsp0fsp.c b/storage/innobase/fsp/fsp0fsp.c index d228e683957..d5be8fca38f 100644 --- a/storage/innobase/fsp/fsp0fsp.c +++ b/storage/innobase/fsp/fsp0fsp.c @@ -293,15 +293,19 @@ fseg_alloc_free_page_low( /* out: the allocated page number, FIL_NULL if no page could be allocated */ ulint space, /* in: space */ - fseg_inode_t* seg_inode, /* in: segment inode */ + fseg_inode_t* seg_inode, /* in/out: segment inode */ ulint hint, /* in: hint of which page would be desirable */ byte direction, /* in: if the new page is needed because of an index page split, and records are inserted there in order, into which direction they go alphabetically: FSP_DOWN, FSP_UP, FSP_NO_DIR */ - mtr_t* mtr); /* in: mtr handle */ - + mtr_t* mtr, /* in/out: mini-transaction */ + mtr_t* init_mtr);/* in/out: mini-transaction in which the + page should be initialized + (may be the same as mtr), or NULL if it + should not be initialized (the page at hint + was previously freed in mtr) */ /************************************************************************** Reads the file space size stored in the header page. */ @@ -1371,6 +1375,43 @@ fsp_alloc_free_extent( return(descr); } +/**********************************************************************//** +Allocates a single free page from a space. */ +static __attribute__((nonnull)) +void +fsp_alloc_from_free_frag( +/*=====================*/ + fsp_header_t* header, /* in/out: tablespace header */ + xdes_t* descr, /* in/out: extent descriptor */ + ulint bit, /* in: slot to allocate in the extent */ + mtr_t* mtr) /* in/out: mini-transaction */ +{ + ulint frag_n_used; + + ut_ad(xdes_get_state(descr, mtr) == XDES_FREE_FRAG); + ut_a(xdes_get_bit(descr, XDES_FREE_BIT, bit, mtr)); + xdes_set_bit(descr, XDES_FREE_BIT, bit, FALSE, mtr); + + /* Update the FRAG_N_USED field */ + frag_n_used = mtr_read_ulint(header + FSP_FRAG_N_USED, MLOG_4BYTES, + mtr); + frag_n_used++; + mlog_write_ulint(header + FSP_FRAG_N_USED, frag_n_used, MLOG_4BYTES, + mtr); + if (xdes_is_full(descr, mtr)) { + /* The fragment is full: move it to another list */ + flst_remove(header + FSP_FREE_FRAG, descr + XDES_FLST_NODE, + mtr); + xdes_set_state(descr, XDES_FULL_FRAG, mtr); + + flst_add_last(header + FSP_FULL_FRAG, descr + XDES_FLST_NODE, + mtr); + mlog_write_ulint(header + FSP_FRAG_N_USED, + frag_n_used - FSP_EXTENT_SIZE, MLOG_4BYTES, + mtr); + } +} + /************************************************************************** Allocates a single free page from a space. The page is marked as used. */ static @@ -1381,19 +1422,22 @@ fsp_alloc_free_page( be allocated */ ulint space, /* in: space id */ ulint hint, /* in: hint of which page would be desirable */ - mtr_t* mtr) /* in: mtr handle */ + mtr_t* mtr, /* in/out: mini-transaction */ + mtr_t* init_mtr)/* in/out: mini-transaction in which the + page should be initialized + (may be the same as mtr) */ { fsp_header_t* header; fil_addr_t first; xdes_t* descr; page_t* page; ulint free; - ulint frag_n_used; ulint page_no; ulint space_size; ibool success; ut_ad(mtr); + ut_ad(init_mtr); header = fsp_get_space_header(space, mtr); @@ -1441,6 +1485,7 @@ fsp_alloc_free_page( if (free == ULINT_UNDEFINED) { ut_print_buf(stderr, ((byte*)descr) - 500, 1000); + putc('\n', stderr); ut_error; } @@ -1472,40 +1517,21 @@ fsp_alloc_free_page( } } - xdes_set_bit(descr, XDES_FREE_BIT, free, FALSE, mtr); - - /* Update the FRAG_N_USED field */ - frag_n_used = mtr_read_ulint(header + FSP_FRAG_N_USED, MLOG_4BYTES, - mtr); - frag_n_used++; - mlog_write_ulint(header + FSP_FRAG_N_USED, frag_n_used, MLOG_4BYTES, - mtr); - if (xdes_is_full(descr, mtr)) { - /* The fragment is full: move it to another list */ - flst_remove(header + FSP_FREE_FRAG, descr + XDES_FLST_NODE, - mtr); - xdes_set_state(descr, XDES_FULL_FRAG, mtr); - - flst_add_last(header + FSP_FULL_FRAG, descr + XDES_FLST_NODE, - mtr); - mlog_write_ulint(header + FSP_FRAG_N_USED, - frag_n_used - FSP_EXTENT_SIZE, MLOG_4BYTES, - mtr); - } + fsp_alloc_from_free_frag(header, descr, free, mtr); /* Initialize the allocated page to the buffer pool, so that it can be obtained immediately with buf_page_get without need for a disk read. */ - buf_page_create(space, page_no, mtr); + buf_page_create(space, page_no, init_mtr); - page = buf_page_get(space, page_no, RW_X_LATCH, mtr); + page = buf_page_get(space, page_no, RW_X_LATCH, init_mtr); #ifdef UNIV_SYNC_DEBUG buf_page_dbg_add_level(page, SYNC_FSP_PAGE); #endif /* UNIV_SYNC_DEBUG */ /* Prior contents of the page should be ignored */ - fsp_init_file_page(page, mtr); + fsp_init_file_page(page, init_mtr); return(page_no); } @@ -1724,7 +1750,7 @@ fsp_alloc_seg_inode_page( space = buf_frame_get_space_id(space_header); - page_no = fsp_alloc_free_page(space, 0, mtr); + page_no = fsp_alloc_free_page(space, 0, mtr, mtr); if (page_no == FIL_NULL) { @@ -2094,7 +2120,8 @@ fseg_create_general( } if (page == 0) { - page = fseg_alloc_free_page_low(space, inode, 0, FSP_UP, mtr); + page = fseg_alloc_free_page_low(space, + inode, 0, FSP_UP, mtr, mtr); if (page == FIL_NULL) { @@ -2331,14 +2358,19 @@ fseg_alloc_free_page_low( /* out: the allocated page number, FIL_NULL if no page could be allocated */ ulint space, /* in: space */ - fseg_inode_t* seg_inode, /* in: segment inode */ + fseg_inode_t* seg_inode, /* in/out: segment inode */ ulint hint, /* in: hint of which page would be desirable */ byte direction, /* in: if the new page is needed because of an index page split, and records are inserted there in order, into which direction they go alphabetically: FSP_DOWN, FSP_UP, FSP_NO_DIR */ - mtr_t* mtr) /* in: mtr handle */ + mtr_t* mtr, /* in/out: mini-transaction */ + mtr_t* init_mtr)/* in/out: mini-transaction in which the + page should be initialized + (may be the same as mtr), or NULL if it + should not be initialized (the page at hint + was previously freed in mtr) */ { fsp_header_t* space_header; ulint space_size; @@ -2350,7 +2382,6 @@ fseg_alloc_free_page_low( if could not be allocated */ xdes_t* ret_descr; /* the extent of the allocated page */ page_t* page; - ibool frag_page_allocated = FALSE; ibool success; ulint n; @@ -2371,6 +2402,8 @@ fseg_alloc_free_page_low( if (descr == NULL) { /* Hint outside space or too high above free limit: reset hint */ + ut_a(init_mtr); + /* The file space header page is always allocated. */ hint = 0; descr = xdes_get_descriptor(space, hint, mtr); } @@ -2382,15 +2415,20 @@ fseg_alloc_free_page_low( mtr), seg_id)) && (xdes_get_bit(descr, XDES_FREE_BIT, hint % FSP_EXTENT_SIZE, mtr) == TRUE)) { - +take_hinted_page: /* 1. We can take the hinted page =================================*/ ret_descr = descr; ret_page = hint; + /* Skip the check for extending the tablespace. If the + page hint were not within the size of the tablespace, + we would have got (descr == NULL) above and reset the hint. */ + goto got_hinted_page; /*-----------------------------------------------------------*/ - } else if ((xdes_get_state(descr, mtr) == XDES_FREE) - && ((reserved - used) < reserved / FSEG_FILLFACTOR) - && (used >= FSEG_FRAG_LIMIT)) { + } else if (xdes_get_state(descr, mtr) == XDES_FREE + && (!init_mtr + || ((reserved - used < reserved / FSEG_FILLFACTOR) + && used >= FSEG_FRAG_LIMIT))) { /* 2. We allocate the free extent from space and can take ========================================================= @@ -2408,8 +2446,20 @@ fseg_alloc_free_page_low( /* Try to fill the segment free list */ fseg_fill_free_list(seg_inode, space, hint + FSP_EXTENT_SIZE, mtr); - ret_page = hint; + goto take_hinted_page; /*-----------------------------------------------------------*/ + } else if (!init_mtr) { + ut_a(xdes_get_state(descr, mtr) == XDES_FREE_FRAG); + fsp_alloc_from_free_frag(space_header, descr, + hint % FSP_EXTENT_SIZE, mtr); + ret_page = hint; + ret_descr = NULL; + + /* Put the page in the fragment page array of the segment */ + n = fseg_find_free_frag_page_slot(seg_inode, mtr); + ut_a(n != FIL_NULL); + fseg_set_nth_frag_page_no(seg_inode, n, ret_page, mtr); + goto got_hinted_page; } else if ((direction != FSP_NO_DIR) && ((reserved - used) < reserved / FSEG_FILLFACTOR) && (used >= FSEG_FRAG_LIMIT) @@ -2467,11 +2517,9 @@ fseg_alloc_free_page_low( } else if (used < FSEG_FRAG_LIMIT) { /* 6. We allocate an individual page from the space ===================================================*/ - ret_page = fsp_alloc_free_page(space, hint, mtr); + ret_page = fsp_alloc_free_page(space, hint, mtr, init_mtr); ret_descr = NULL; - frag_page_allocated = TRUE; - if (ret_page != FIL_NULL) { /* Put the page in the fragment page array of the segment */ @@ -2481,6 +2529,10 @@ fseg_alloc_free_page_low( fseg_set_nth_frag_page_no(seg_inode, n, ret_page, mtr); } + + /* fsp_alloc_free_page() invoked fsp_init_file_page() + already. */ + return(ret_page); /*-----------------------------------------------------------*/ } else { /* 7. We allocate a new extent and take its first page @@ -2527,22 +2579,31 @@ fseg_alloc_free_page_low( } } - if (!frag_page_allocated) { +got_hinted_page: + { /* Initialize the allocated page to buffer pool, so that it can be obtained immediately with buf_page_get without need for a disk read */ + mtr_t* block_mtr = init_mtr ? init_mtr : mtr; - page = buf_page_create(space, ret_page, mtr); + page = buf_page_create(space, ret_page, block_mtr); - ut_a(page == buf_page_get(space, ret_page, RW_X_LATCH, mtr)); + ut_a(page == buf_page_get(space, ret_page, RW_X_LATCH, + block_mtr)); #ifdef UNIV_SYNC_DEBUG buf_page_dbg_add_level(page, SYNC_FSP_PAGE); #endif /* UNIV_SYNC_DEBUG */ - /* The prior contents of the page should be ignored */ - fsp_init_file_page(page, mtr); + if (init_mtr) { + /* The prior contents of the page should be ignored */ + fsp_init_file_page(page, init_mtr); + } + } + /* ret_descr == NULL if the block was allocated from free_frag + (XDES_FREE_FRAG) */ + if (ret_descr != NULL) { /* At this point we know the extent and the page offset. The extent is still in the appropriate list (FSEG_NOT_FULL or FSEG_FREE), and the page is not yet marked as used. */ @@ -2554,8 +2615,6 @@ fseg_alloc_free_page_low( fseg_mark_page_used(seg_inode, space, ret_page, mtr); } - buf_reset_check_index_page_at_flush(space, ret_page); - return(ret_page); } @@ -2569,7 +2628,7 @@ fseg_alloc_free_page_general( /*=========================*/ /* out: allocated page offset, FIL_NULL if no page could be allocated */ - fseg_header_t* seg_header,/* in: segment header */ + fseg_header_t* seg_header,/* in/out: segment header */ ulint hint, /* in: hint of which page would be desirable */ byte direction,/* in: if the new page is needed because of an index page split, and records are @@ -2581,7 +2640,11 @@ fseg_alloc_free_page_general( with fsp_reserve_free_extents, then there is no need to do the check for this individual page */ - mtr_t* mtr) /* in: mtr handle */ + mtr_t* mtr, /* in/out: mini-transaction handle */ + mtr_t* init_mtr)/* in/out: mtr or another mini-transaction + in which the page should be initialized, + or NULL if this is a "fake allocation" of + a page that was previously freed in mtr */ { fseg_inode_t* inode; ulint space; @@ -2619,7 +2682,8 @@ fseg_alloc_free_page_general( } page_no = fseg_alloc_free_page_low(buf_frame_get_space_id(inode), - inode, hint, direction, mtr); + inode, hint, direction, + mtr, init_mtr); if (!has_done_reservation) { fil_space_release_free_extents(space, n_reserved); } @@ -2647,7 +2711,7 @@ fseg_alloc_free_page( mtr_t* mtr) /* in: mtr handle */ { return(fseg_alloc_free_page_general(seg_header, hint, direction, - FALSE, mtr)); + FALSE, mtr, mtr)); } /************************************************************************** diff --git a/storage/innobase/include/btr0btr.h b/storage/innobase/include/btr0btr.h index 269fa355558..3988019589d 100644 --- a/storage/innobase/include/btr0btr.h +++ b/storage/innobase/include/btr0btr.h @@ -379,7 +379,11 @@ btr_page_alloc( page split is made */ ulint level, /* in: level where the page is placed in the tree */ - mtr_t* mtr); /* in: mtr */ + mtr_t* mtr, /* in/out: mini-transaction + for the allocation */ + mtr_t* init_mtr); /* in/out: mini-transaction + for x-latching and initializing + the page */ /****************************************************************** Frees a file page used in an index tree. NOTE: cannot free field external storage pages because the page must contain info on its level. */ @@ -402,6 +406,31 @@ btr_page_free_low( page_t* page, /* in: page to be freed, x-latched */ ulint level, /* in: page level */ mtr_t* mtr); /* in: mtr */ +/**************************************************************//** +Marks all MTR_MEMO_FREE_CLUST_LEAF pages nonfree or free. +For invoking btr_store_big_rec_extern_fields() after an update, +we must temporarily mark freed clustered index pages allocated, so +that off-page columns will not be allocated from them. Between the +btr_store_big_rec_extern_fields() and mtr_commit() we have to +mark the pages free again, so that no pages will be leaked. */ + +void +btr_mark_freed_leaves( +/*==================*/ + dict_index_t* index, /* in/out: clustered index */ + mtr_t* mtr, /* in/out: mini-transaction */ + ibool nonfree);/* in: TRUE=mark nonfree, FALSE=mark freed */ +#ifdef UNIV_DEBUG +/**************************************************************//** +Validates all pages marked MTR_MEMO_FREE_CLUST_LEAF. +See btr_mark_freed_leaves(). */ + +ibool +btr_freed_leaves_validate( +/*======================*/ + /* out: TRUE if valid */ + mtr_t* mtr); /* in: mini-transaction */ +#endif /* UNIV_DEBUG */ #ifdef UNIV_BTR_PRINT /***************************************************************** Prints size info of a B-tree. */ diff --git a/storage/innobase/include/btr0cur.h b/storage/innobase/include/btr0cur.h index c068d8d3318..c2bf84ef9cb 100644 --- a/storage/innobase/include/btr0cur.h +++ b/storage/innobase/include/btr0cur.h @@ -252,15 +252,6 @@ btr_cur_pessimistic_update( updates */ que_thr_t* thr, /* in: query thread */ mtr_t* mtr); /* in: mtr */ -/***************************************************************** -Commits and restarts a mini-transaction so that it will retain an -x-lock on index->lock and the cursor page. */ - -void -btr_cur_mtr_commit_and_start( -/*=========================*/ - btr_cur_t* cursor, /* in: cursor */ - mtr_t* mtr); /* in/out: mini-transaction */ /*************************************************************** Marks a clustered index record deleted. Writes an undo log record to undo log on this delete marking. Writes in the trx id field the id @@ -471,6 +462,11 @@ btr_store_big_rec_extern_fields( this function returns */ big_rec_t* big_rec_vec, /* in: vector containing fields to be stored externally */ + mtr_t* alloc_mtr, /* in/out: in an insert, NULL; + in an update, local_mtr for + allocating BLOB pages and + updating BLOB pointers; alloc_mtr + must not have freed any leaf pages */ mtr_t* local_mtr); /* in: mtr containing the latch to rec and to the tree */ /*********************************************************************** diff --git a/storage/innobase/include/buf0buf.h b/storage/innobase/include/buf0buf.h index 7479ce9cbf0..87b2f6172de 100644 --- a/storage/innobase/include/buf0buf.h +++ b/storage/innobase/include/buf0buf.h @@ -294,15 +294,6 @@ buf_page_peek_block( ulint space, /* in: space id */ ulint offset);/* in: page number */ /************************************************************************ -Resets the check_index_page_at_flush field of a page if found in the buffer -pool. */ - -void -buf_reset_check_index_page_at_flush( -/*================================*/ - ulint space, /* in: space id */ - ulint offset);/* in: page number */ -/************************************************************************ Sets file_page_was_freed TRUE if the page is found in the buffer pool. This function should be called when we free a file page and want the debug version to check that it is not accessed any more unless diff --git a/storage/innobase/include/fsp0fsp.h b/storage/innobase/include/fsp0fsp.h index 17bfbeec2c1..4c58d6075e6 100644 --- a/storage/innobase/include/fsp0fsp.h +++ b/storage/innobase/include/fsp0fsp.h @@ -167,7 +167,7 @@ fseg_alloc_free_page_general( /*=========================*/ /* out: allocated page offset, FIL_NULL if no page could be allocated */ - fseg_header_t* seg_header,/* in: segment header */ + fseg_header_t* seg_header,/* in/out: segment header */ ulint hint, /* in: hint of which page would be desirable */ byte direction,/* in: if the new page is needed because of an index page split, and records are @@ -179,7 +179,11 @@ fseg_alloc_free_page_general( with fsp_reserve_free_extents, then there is no need to do the check for this individual page */ - mtr_t* mtr); /* in: mtr handle */ + mtr_t* mtr, /* in/out: mini-transaction */ + mtr_t* init_mtr);/* in/out: mtr or another mini-transaction + in which the page should be initialized, + or NULL if this is a "fake allocation" of + a page that was previously freed in mtr */ /************************************************************************** Reserves free pages from a tablespace. All mini-transactions which may use several pages from the tablespace should call this function beforehand diff --git a/storage/innobase/include/mtr0mtr.h b/storage/innobase/include/mtr0mtr.h index a6e2976830b..a0a51dbbd17 100644 --- a/storage/innobase/include/mtr0mtr.h +++ b/storage/innobase/include/mtr0mtr.h @@ -36,6 +36,8 @@ first 3 values must be RW_S_LATCH, RW_X_LATCH, RW_NO_LATCH */ #define MTR_MEMO_MODIFY 54 #define MTR_MEMO_S_LOCK 55 #define MTR_MEMO_X_LOCK 56 +/* The mini-transaction freed a clustered index leaf page. */ +#define MTR_MEMO_FREE_CLUST_LEAF 57 /* Log item types: we have made them to be of the type 'byte' for the compiler to warn if val and type parameters are switched @@ -325,9 +327,12 @@ struct mtr_struct{ ulint state; /* MTR_ACTIVE, MTR_COMMITTING, MTR_COMMITTED */ dyn_array_t memo; /* memo stack for locks etc. */ dyn_array_t log; /* mini-transaction log */ - ibool modifications; + unsigned modifications:1; /* TRUE if the mtr made modifications to buffer pool pages */ + unsigned freed_clust_leaf:1; + /* TRUE if MTR_MEMO_FREE_CLUST_LEAF + was logged in the mini-transaction */ ulint n_log_recs; /* count of how many page initial log records have been written to the mtr log */ diff --git a/storage/innobase/include/mtr0mtr.ic b/storage/innobase/include/mtr0mtr.ic index 81eec3bfc92..6b4cacf0766 100644 --- a/storage/innobase/include/mtr0mtr.ic +++ b/storage/innobase/include/mtr0mtr.ic @@ -26,6 +26,7 @@ mtr_start( mtr->log_mode = MTR_LOG_ALL; mtr->modifications = FALSE; + mtr->freed_clust_leaf = FALSE; mtr->n_log_recs = 0; #ifdef UNIV_DEBUG @@ -50,7 +51,8 @@ mtr_memo_push( ut_ad(object); ut_ad(type >= MTR_MEMO_PAGE_S_FIX); - ut_ad(type <= MTR_MEMO_X_LOCK); + ut_ad(type <= MTR_MEMO_FREE_CLUST_LEAF); + ut_ad(type != MTR_MEMO_FREE_CLUST_LEAF || mtr->freed_clust_leaf); ut_ad(mtr); ut_ad(mtr->magic_n == MTR_MAGIC_N); diff --git a/storage/innobase/mtr/mtr0mtr.c b/storage/innobase/mtr/mtr0mtr.c index 365fa15878a..a11e20ca661 100644 --- a/storage/innobase/mtr/mtr0mtr.c +++ b/storage/innobase/mtr/mtr0mtr.c @@ -53,17 +53,13 @@ mtr_memo_slot_release( buf_page_release((buf_block_t*)object, type, mtr); } else if (type == MTR_MEMO_S_LOCK) { rw_lock_s_unlock((rw_lock_t*)object); -#ifdef UNIV_DEBUG - } else if (type == MTR_MEMO_X_LOCK) { - rw_lock_x_unlock((rw_lock_t*)object); - } else { - ut_ad(type == MTR_MEMO_MODIFY); + } else if (type != MTR_MEMO_X_LOCK) { + ut_ad(type == MTR_MEMO_MODIFY + || type == MTR_MEMO_FREE_CLUST_LEAF); ut_ad(mtr_memo_contains(mtr, object, MTR_MEMO_PAGE_X_FIX)); -#else } else { rw_lock_x_unlock((rw_lock_t*)object); -#endif } } diff --git a/storage/innobase/row/row0ins.c b/storage/innobase/row/row0ins.c index 7ff443a11ad..6366beb6b47 100644 --- a/storage/innobase/row/row0ins.c +++ b/storage/innobase/row/row0ins.c @@ -2089,15 +2089,20 @@ row_ins_index_entry_low( if (big_rec) { ut_a(err == DB_SUCCESS); /* Write out the externally stored - columns while still x-latching - index->lock and block->lock. We have - to mtr_commit(mtr) first, so that the - redo log will be written in the - correct order. Otherwise, we would run - into trouble on crash recovery if mtr - freed B-tree pages on which some of - the big_rec fields will be written. */ - btr_cur_mtr_commit_and_start(&cursor, &mtr); + columns, but allocate the pages and + write the pointers using the + mini-transaction of the record update. + If any pages were freed in the update, + temporarily mark them allocated so + that off-page columns will not + overwrite them. We must do this, + because we will write the redo log for + the BLOB writes before writing the + redo log for the record update. Thus, + redo log application at crash recovery + will see BLOBs being written to free pages. */ + + btr_mark_freed_leaves(index, &mtr, TRUE); rec = btr_cur_get_rec(&cursor); offsets = rec_get_offsets(rec, index, offsets, @@ -2105,7 +2110,8 @@ row_ins_index_entry_low( &heap); err = btr_store_big_rec_extern_fields( - index, rec, offsets, big_rec, &mtr); + index, rec, offsets, big_rec, + &mtr, &mtr); /* If writing big_rec fails (for example, because of DB_OUT_OF_FILE_SPACE), the record will be corrupted. Even if @@ -2118,6 +2124,9 @@ row_ins_index_entry_low( undo log, and thus the record cannot be rolled back. */ ut_a(err == DB_SUCCESS); + /* Free the pages again + in order to avoid a leak. */ + btr_mark_freed_leaves(index, &mtr, FALSE); goto stored_big_rec; } } else { @@ -2165,7 +2174,8 @@ function_exit: ULINT_UNDEFINED, &heap); err = btr_store_big_rec_extern_fields(index, rec, - offsets, big_rec, &mtr); + offsets, big_rec, + NULL, &mtr); stored_big_rec: if (modify) { dtuple_big_rec_free(big_rec); diff --git a/storage/innobase/row/row0row.c b/storage/innobase/row/row0row.c index 171039e34ac..ccb3c1f7781 100644 --- a/storage/innobase/row/row0row.c +++ b/storage/innobase/row/row0row.c @@ -212,23 +212,27 @@ row_build( } #if defined UNIV_DEBUG || defined UNIV_BLOB_LIGHT_DEBUG - /* This condition can occur during crash recovery before - trx_rollback_or_clean_all_without_sess() has completed - execution. - - This condition is possible if the server crashed - during an insert or update before - btr_store_big_rec_extern_fields() did mtr_commit() all - BLOB pointers to the clustered index record. - - If the record contains a null BLOB pointer, look up the - transaction that holds the implicit lock on this record, and - assert that it is active. (In this version of InnoDB, we - cannot assert that it was recovered, because there is no - trx->is_recovered field.) */ - - ut_a(!rec_offs_any_null_extern(rec, offsets) - || trx_assert_active(row_get_rec_trx_id(rec, index, offsets))); + if (rec_offs_any_null_extern(rec, offsets)) { + /* This condition can occur during crash recovery + before trx_rollback_or_clean_all_without_sess() has + completed execution. + + This condition is possible if the server crashed + during an insert or update before + btr_store_big_rec_extern_fields() did mtr_commit() all + BLOB pointers to the clustered index record. + + If the record contains a null BLOB pointer, look up the + transaction that holds the implicit lock on this record, and + assert that it is active. (In this version of InnoDB, we + cannot assert that it was recovered, because there is no + trx->is_recovered field.) */ + + ut_a(trx_assert_active( + row_get_rec_trx_id(rec, index, offsets))); + ut_a(trx_undo_roll_ptr_is_insert( + row_get_rec_roll_ptr(rec, index, offsets))); + } #endif /* UNIV_DEBUG || UNIV_BLOB_LIGHT_DEBUG */ if (type != ROW_COPY_POINTERS) { diff --git a/storage/innobase/row/row0upd.c b/storage/innobase/row/row0upd.c index 694b00ea265..58739edfd98 100644 --- a/storage/innobase/row/row0upd.c +++ b/storage/innobase/row/row0upd.c @@ -1591,21 +1591,22 @@ row_upd_clust_rec( *offsets_ = (sizeof offsets_) / sizeof *offsets_; ut_a(err == DB_SUCCESS); - /* Write out the externally stored columns while still - x-latching index->lock and block->lock. We have to - mtr_commit(mtr) first, so that the redo log will be - written in the correct order. Otherwise, we would run - into trouble on crash recovery if mtr freed B-tree - pages on which some of the big_rec fields will be - written. */ - btr_cur_mtr_commit_and_start(btr_cur, mtr); - + /* Write out the externally stored columns, but + allocate the pages and write the pointers using the + mini-transaction of the record update. If any pages + were freed in the update, temporarily mark them + allocated so that off-page columns will not overwrite + them. We must do this, because we write the redo log + for the BLOB writes before writing the redo log for + the record update. */ + + btr_mark_freed_leaves(index, mtr, TRUE); rec = btr_cur_get_rec(btr_cur); err = btr_store_big_rec_extern_fields( index, rec, rec_get_offsets(rec, index, offsets_, ULINT_UNDEFINED, &heap), - big_rec, mtr); + big_rec, mtr, mtr); if (UNIV_LIKELY_NULL(heap)) { mem_heap_free(heap); } @@ -1618,6 +1619,8 @@ row_upd_clust_rec( to the undo log, and thus the record cannot be rolled back. */ ut_a(err == DB_SUCCESS); + /* Free the pages again in order to avoid a leak. */ + btr_mark_freed_leaves(index, mtr, FALSE); } mtr_commit(mtr); diff --git a/storage/innobase/trx/trx0undo.c b/storage/innobase/trx/trx0undo.c index 329565943c8..ce09862f317 100644 --- a/storage/innobase/trx/trx0undo.c +++ b/storage/innobase/trx/trx0undo.c @@ -864,7 +864,7 @@ trx_undo_add_page( page_no = fseg_alloc_free_page_general(header_page + TRX_UNDO_SEG_HDR + TRX_UNDO_FSEG_HEADER, undo->top_page_no + 1, FSP_UP, - TRUE, mtr); + TRUE, mtr, mtr); fil_space_release_free_extents(undo->space, n_reserved); |