From f77329ace9a8a415b05ad473970de6dc187327e7 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Marko=20M=C3=A4kel=C3=A4?= Date: Fri, 17 Feb 2012 11:42:04 +0200 Subject: Bug#13721257 RACE CONDITION IN UPDATES OR INSERTS OF WIDE RECORDS This bug was originally filed and fixed as Bug#12612184. The original fix was buggy, and it was patched by Bug#12704861. Also that patch was buggy (potentially breaking crash recovery), and both fixes were reverted. This fix was not ported to the built-in InnoDB of MySQL 5.1, because the function signatures of many core functions are different from InnoDB Plugin and later versions. The block allocation routines and their callers would have to changed so that they handle block descriptors instead of page frames. When a record is updated so that its size grows, non-updated columns can be selected for external (off-page) storage. The bug is that the initially inserted updated record contains an all-zero BLOB pointer to the field that was not updated. Only after the BLOB pages have been allocated and written, the valid pointer can be written to the record. Between the release of the page latch in mtr_commit(mtr) after btr_cur_pessimistic_update() and the re-latching of the page in btr_pcur_restore_position(), other threads can see the invalid BLOB pointer consisting of 20 zero bytes. Moreover, if the system crashes at this point, the situation could persist after crash recovery, and the contents of the non-updated column would be permanently lost. The problem is amplified by the ROW_FORMAT=DYNAMIC and ROW_FORMAT=COMPRESSED that were introduced in innodb_file_format=barracuda in InnoDB Plugin, but the bug does exist in all InnoDB versions. The fix is as follows. After a pessimistic B-tree operation that needs to write out off-page columns, allocate the pages for these columns in the mini-transaction that performed the B-tree operation (btr_mtr), but write the pages in a separate mini-transaction (blob_mtr). Do mtr_commit(blob_mtr) before mtr_commit(btr_mtr). A quirk: Do not reuse pages that were previously freed in btr_mtr. Only write the off-page columns to 'fresh' pages. In this way, crash recovery will see redo log entries for blob_mtr before any redo log entry for btr_mtr. It will apply the BLOB page writes to pages that were marked free at that point. If crash recovery fails to see all of the btr_mtr redo log, there will be some unreachable BLOB data in free pages, but the B-tree will be in a consistent state. btr_page_alloc_low(): Renamed from btr_page_alloc(). Add the parameter init_mtr. Return an allocated block, or NULL. If init_mtr!=mtr but the page was already X-latched in mtr, do not initialize the page. btr_page_alloc(): Wrapper for btr_page_alloc_for_ibuf() and btr_page_alloc_low(). btr_page_free(): Add a debug assertion that the page was a B-tree page. btr_lift_page_up(): Return the father block. btr_compress(), btr_cur_compress_if_useful(): Add the parameter ibool adjust, for adjusting the cursor position. btr_cur_pessimistic_update(): Preserve the cursor position when big_rec will be written and the new flag BTR_KEEP_POS_FLAG is defined. Remove a duplicate rec_get_offsets() call. Keep the X-latch on index->lock when big_rec is needed. btr_store_big_rec_extern_fields(): Replace update_inplace with an operation code, and local_mtr with btr_mtr. When not doing a fresh insert and btr_mtr has freed pages, put aside any pages that were previously X-latched in btr_mtr, and free the pages after writing out all data. The data must be written to 'fresh' pages, because btr_mtr will be committed and written to the redo log after the BLOB writes have been written to the redo log. btr_blob_op_is_update(): Check if an operation passed to btr_store_big_rec_extern_fields() is an update or insert-by-update. fseg_alloc_free_page_low(), fsp_alloc_free_page(), fseg_alloc_free_extent(), fseg_alloc_free_page_general(): Add the parameter init_mtr. Return an allocated block, or NULL. If init_mtr!=mtr but the page was already X-latched in mtr, do not initialize the page. xdes_get_descriptor_with_space_hdr(): Assert that the file space header is being X-latched. fsp_alloc_from_free_frag(): Refactored from fsp_alloc_free_page(). fsp_page_create(): New function, for allocating, X-latching and potentially initializing a page. If init_mtr!=mtr but the page was already X-latched in mtr, do not initialize the page. fsp_free_page(): Add ut_ad(0) to the error outcomes. fsp_free_page(), fseg_free_page_low(): Increment mtr->n_freed_pages. fsp_alloc_seg_inode_page(), fseg_create_general(): Assert that the page was not previously X-latched in the mini-transaction. A file segment or inode page should never be allocated in the middle of an mini-transaction that frees pages, such as btr_cur_pessimistic_delete(). fseg_alloc_free_page_low(): If the hinted page was allocated, skip the check if the tablespace should be extended. Return NULL instead of FIL_NULL on failure. Remove the flag frag_page_allocated. Instead, return directly, because the page would already have been initialized. fseg_find_free_frag_page_slot() would return ULINT_UNDEFINED on error, not FIL_NULL. Correct a bogus assertion. fseg_alloc_free_page(): Redefine as a wrapper macro around fseg_alloc_free_page_general(). buf_block_buf_fix_inc(): Move the definition from the buf0buf.ic to buf0buf.h, so that it can be called from other modules. mtr_t: Add n_freed_pages (number of pages that have been freed). page_rec_get_nth_const(), page_rec_get_nth(): The inverse function of page_rec_get_n_recs_before(), get the nth record of the record list. This is faster than iterating the linked list. Refactored from page_get_middle_rec(). trx_undo_rec_copy(): Add a debug assertion for the length. trx_undo_add_page(): Return a block descriptor or NULL instead of a page number or FIL_NULL. trx_undo_report_row_operation(): Add debug assertions. trx_sys_create_doublewrite_buf(): Assert that each page was not previously X-latched. page_cur_insert_rec_zip_reorg(): Make use of page_rec_get_nth(). row_ins_clust_index_entry_by_modify(): Pass BTR_KEEP_POS_FLAG, so that the repositioning of the cursor can be avoided. row_ins_index_entry_low(): Add DEBUG_SYNC points before and after writing off-page columns. If inserting by updating a delete-marked record, do not reposition the cursor or commit the mini-transaction before writing the off-page columns. row_build(): Tighten a debug assertion about null BLOB pointers. row_upd_clust_rec(): Add DEBUG_SYNC points before and after writing off-page columns. Do not reposition the cursor or commit the mini-transaction before writing the off-page columns. rb:939 approved by Jimmy Yang --- .../suite/innodb_plugin/r/innodb-blob.result | 119 +++++++++++ mysql-test/suite/innodb_plugin/t/innodb-blob.test | 218 +++++++++++++++++++++ 2 files changed, 337 insertions(+) create mode 100644 mysql-test/suite/innodb_plugin/r/innodb-blob.result create mode 100644 mysql-test/suite/innodb_plugin/t/innodb-blob.test (limited to 'mysql-test') diff --git a/mysql-test/suite/innodb_plugin/r/innodb-blob.result b/mysql-test/suite/innodb_plugin/r/innodb-blob.result new file mode 100644 index 00000000000..b0b6bb9e5e2 --- /dev/null +++ b/mysql-test/suite/innodb_plugin/r/innodb-blob.result @@ -0,0 +1,119 @@ +CREATE TABLE t1 (a INT PRIMARY KEY, b TEXT) ENGINE=InnoDB; +CREATE TABLE t2 (a INT PRIMARY KEY) ENGINE=InnoDB; +CREATE TABLE t3 (a INT PRIMARY KEY, b TEXT, c TEXT) ENGINE=InnoDB; +INSERT INTO t1 VALUES (1,REPEAT('a',30000)),(2,REPEAT('b',40000)); +SET DEBUG_SYNC='before_row_upd_extern SIGNAL have_latch WAIT_FOR go1'; +BEGIN; +UPDATE t1 SET a=a+2; +ROLLBACK; +BEGIN; +UPDATE t1 SET b=CONCAT(b,'foo'); +SET DEBUG_SYNC='now WAIT_FOR have_latch'; +SELECT a, RIGHT(b,20) FROM t1; +SET DEBUG_SYNC='now SIGNAL go1'; +a RIGHT(b,20) +1 aaaaaaaaaaaaaaaaaaaa +2 bbbbbbbbbbbbbbbbbbbb +SET DEBUG='+d,row_ins_extern_checkpoint'; +SET DEBUG_SYNC='before_row_ins_extern_latch SIGNAL rec_not_blob WAIT_FOR crash'; +ROLLBACK; +BEGIN; +INSERT INTO t1 VALUES (3,REPEAT('c',50000)); +SET DEBUG_SYNC='now WAIT_FOR rec_not_blob'; +SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED; +SELECT @@tx_isolation; +@@tx_isolation +READ-UNCOMMITTED +SELECT a, RIGHT(b,20) FROM t1; +a RIGHT(b,20) +1 aaaaaaaaaaaaaaaaaaaa +2 bbbbbbbbbbbbbbbbbbbb +SELECT a FROM t1; +a +1 +2 +3 +SET DEBUG='+d,crash_commit_before'; +INSERT INTO t2 VALUES (42); +ERROR HY000: Lost connection to MySQL server during query +ERROR HY000: Lost connection to MySQL server during query +CHECK TABLE t1; +Table Op Msg_type Msg_text +test.t1 check status OK +INSERT INTO t3 VALUES +(1,REPEAT('d',7000),REPEAT('e',100)), +(2,REPEAT('g',7000),REPEAT('h',100)); +SET DEBUG_SYNC='before_row_upd_extern SIGNAL have_latch WAIT_FOR go'; +UPDATE t3 SET c=REPEAT('f',3000) WHERE a=1; +SET DEBUG_SYNC='now WAIT_FOR have_latch'; +SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED; +SELECT @@tx_isolation; +@@tx_isolation +READ-UNCOMMITTED +SELECT a, RIGHT(b,20), RIGHT(c,20) FROM t3; +SET DEBUG_SYNC='now SIGNAL go'; +a RIGHT(b,20) RIGHT(c,20) +1 dddddddddddddddddddd ffffffffffffffffffff +2 gggggggggggggggggggg hhhhhhhhhhhhhhhhhhhh +CHECK TABLE t1,t2,t3; +Table Op Msg_type Msg_text +test.t1 check status OK +test.t2 check status OK +test.t3 check status OK +BEGIN; +INSERT INTO t2 VALUES (347); +SET DEBUG='+d,row_upd_extern_checkpoint'; +SET DEBUG_SYNC='before_row_upd_extern SIGNAL have_latch WAIT_FOR crash'; +UPDATE t3 SET c=REPEAT('i',3000) WHERE a=2; +SET DEBUG_SYNC='now WAIT_FOR have_latch'; +SELECT info FROM information_schema.processlist +WHERE state = 'debug sync point: before_row_upd_extern'; +info +UPDATE t3 SET c=REPEAT('i',3000) WHERE a=2 +SET DEBUG='+d,crash_commit_before'; +COMMIT; +ERROR HY000: Lost connection to MySQL server during query +ERROR HY000: Lost connection to MySQL server during query +CHECK TABLE t1,t2,t3; +Table Op Msg_type Msg_text +test.t1 check status OK +test.t2 check status OK +test.t3 check status OK +SELECT a, RIGHT(b,20), RIGHT(c,20) FROM t3; +a RIGHT(b,20) RIGHT(c,20) +1 dddddddddddddddddddd ffffffffffffffffffff +2 gggggggggggggggggggg hhhhhhhhhhhhhhhhhhhh +SELECT a FROM t3; +a +1 +2 +BEGIN; +INSERT INTO t2 VALUES (33101); +SET DEBUG='+d,row_upd_extern_checkpoint'; +SET DEBUG_SYNC='after_row_upd_extern SIGNAL have_latch WAIT_FOR crash'; +UPDATE t3 SET c=REPEAT('j',3000) WHERE a=2; +SET DEBUG_SYNC='now WAIT_FOR have_latch'; +SELECT info FROM information_schema.processlist +WHERE state = 'debug sync point: after_row_upd_extern'; +info +UPDATE t3 SET c=REPEAT('j',3000) WHERE a=2 +SET DEBUG='+d,crash_commit_before'; +COMMIT; +ERROR HY000: Lost connection to MySQL server during query +ERROR HY000: Lost connection to MySQL server during query +CHECK TABLE t1,t2,t3; +Table Op Msg_type Msg_text +test.t1 check status OK +test.t2 check status OK +test.t3 check status OK +SELECT a, RIGHT(b,20), RIGHT(c,20) FROM t3; +a RIGHT(b,20) RIGHT(c,20) +1 dddddddddddddddddddd ffffffffffffffffffff +2 gggggggggggggggggggg hhhhhhhhhhhhhhhhhhhh +SELECT a FROM t3; +a +1 +2 +SELECT * FROM t2; +a +DROP TABLE t1,t2,t3; diff --git a/mysql-test/suite/innodb_plugin/t/innodb-blob.test b/mysql-test/suite/innodb_plugin/t/innodb-blob.test new file mode 100644 index 00000000000..7d2968c720d --- /dev/null +++ b/mysql-test/suite/innodb_plugin/t/innodb-blob.test @@ -0,0 +1,218 @@ +# Bug#13721257 RACE CONDITION IN UPDATES OR INSERTS OF WIDE RECORDS +# Test what happens when a record is inserted or updated so that some +# columns are stored off-page. + +--source include/have_innodb_plugin.inc + +# DEBUG_SYNC must be compiled in. +--source include/have_debug_sync.inc + +# Valgrind would complain about memory leaks when we crash on purpose. +--source include/not_valgrind.inc +# Embedded server does not support crashing +--source include/not_embedded.inc +# Avoid CrashReporter popup on Mac +--source include/not_crashrep.inc +# InnoDB Plugin cannot use DEBUG_SYNC on Windows +--source include/not_windows.inc + +CREATE TABLE t1 (a INT PRIMARY KEY, b TEXT) ENGINE=InnoDB; +CREATE TABLE t2 (a INT PRIMARY KEY) ENGINE=InnoDB; +CREATE TABLE t3 (a INT PRIMARY KEY, b TEXT, c TEXT) ENGINE=InnoDB; + +INSERT INTO t1 VALUES (1,REPEAT('a',30000)),(2,REPEAT('b',40000)); +SET DEBUG_SYNC='before_row_upd_extern SIGNAL have_latch WAIT_FOR go1'; +BEGIN; +# This will not block, because it will not store new BLOBs. +UPDATE t1 SET a=a+2; +ROLLBACK; +BEGIN; +--send +UPDATE t1 SET b=CONCAT(b,'foo'); + +connect (con1,localhost,root,,); +SET DEBUG_SYNC='now WAIT_FOR have_latch'; + +# this one should block due to the clustered index tree and leaf page latches +--send +SELECT a, RIGHT(b,20) FROM t1; + +connect (con2,localhost,root,,); + +# Check that the above SELECT is blocked +let $wait_condition= + select count(*) = 1 from information_schema.processlist + where state = 'Sending data' and + info = 'SELECT a, RIGHT(b,20) FROM t1'; +--source include/wait_condition.inc + +SET DEBUG_SYNC='now SIGNAL go1'; + +connection con1; +reap; +connection default; +reap; +SET DEBUG='+d,row_ins_extern_checkpoint'; +SET DEBUG_SYNC='before_row_ins_extern_latch SIGNAL rec_not_blob WAIT_FOR crash'; +ROLLBACK; +BEGIN; +--send +INSERT INTO t1 VALUES (3,REPEAT('c',50000)); + +connection con1; +SET DEBUG_SYNC='now WAIT_FOR rec_not_blob'; +SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED; +SELECT @@tx_isolation; + +# this one should see (3,NULL_BLOB) +SELECT a, RIGHT(b,20) FROM t1; +SELECT a FROM t1; + +# Request a crash, and restart the server. +SET DEBUG='+d,crash_commit_before'; +--exec echo "restart" > $MYSQLTEST_VARDIR/tmp/mysqld.1.expect +--error 2013 +INSERT INTO t2 VALUES (42); + +disconnect con1; +disconnect con2; +connection default; +# This connection should notice the crash as well. +--error 2013 +reap; + +# Write file to make mysql-test-run.pl restart the server +--enable_reconnect +--source include/wait_until_connected_again.inc +--disable_reconnect + +CHECK TABLE t1; + +INSERT INTO t3 VALUES + (1,REPEAT('d',7000),REPEAT('e',100)), + (2,REPEAT('g',7000),REPEAT('h',100)); +SET DEBUG_SYNC='before_row_upd_extern SIGNAL have_latch WAIT_FOR go'; +# This should move column b off-page. +--send +UPDATE t3 SET c=REPEAT('f',3000) WHERE a=1; + +connect (con1,localhost,root,,); +SET DEBUG_SYNC='now WAIT_FOR have_latch'; +SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED; +SELECT @@tx_isolation; + +# this one should block +-- send +SELECT a, RIGHT(b,20), RIGHT(c,20) FROM t3; + +connect (con2,localhost,root,,); + +# Check that the above SELECT is blocked +let $wait_condition= + select count(*) = 1 from information_schema.processlist + where state = 'Sending data' and + info = 'SELECT a, RIGHT(b,20), RIGHT(c,20) FROM t3'; +--source include/wait_condition.inc + +SET DEBUG_SYNC='now SIGNAL go'; + +connection con1; +reap; +disconnect con1; + +connection default; +reap; + +CHECK TABLE t1,t2,t3; + +connection con2; +BEGIN; +INSERT INTO t2 VALUES (347); +connection default; + +# The row_upd_extern_checkpoint was removed in Bug#13721257, +# because the mini-transaction of the B-tree modification would +# remain open while we are writing the off-page columns and are +# stuck in the DEBUG_SYNC. A checkpoint involves a flush, which +# would wait for the buffer-fix to cease. +SET DEBUG='+d,row_upd_extern_checkpoint'; +SET DEBUG_SYNC='before_row_upd_extern SIGNAL have_latch WAIT_FOR crash'; +# This should move column b off-page. +--send +UPDATE t3 SET c=REPEAT('i',3000) WHERE a=2; + +connection con2; +SET DEBUG_SYNC='now WAIT_FOR have_latch'; + +# Check that the above UPDATE is blocked +SELECT info FROM information_schema.processlist +WHERE state = 'debug sync point: before_row_upd_extern'; + +# Request a crash, and restart the server. +SET DEBUG='+d,crash_commit_before'; +--exec echo "restart" > $MYSQLTEST_VARDIR/tmp/mysqld.1.expect +--error 2013 +COMMIT; + +disconnect con2; +connection default; +# This connection should notice the crash as well. +--error 2013 +reap; + +# Write file to make mysql-test-run.pl restart the server +--enable_reconnect +--source include/wait_until_connected_again.inc +--disable_reconnect + +CHECK TABLE t1,t2,t3; +SELECT a, RIGHT(b,20), RIGHT(c,20) FROM t3; +SELECT a FROM t3; + +connect (con2,localhost,root,,); +BEGIN; +INSERT INTO t2 VALUES (33101); +connection default; + +# The row_upd_extern_checkpoint was removed in Bug#13721257, +# because the mini-transaction of the B-tree modification would +# remain open while we are writing the off-page columns and are +# stuck in the DEBUG_SYNC. A checkpoint involves a flush, which +# would wait for the buffer-fix to cease. +SET DEBUG='+d,row_upd_extern_checkpoint'; +SET DEBUG_SYNC='after_row_upd_extern SIGNAL have_latch WAIT_FOR crash'; +# This should move column b off-page. +--send +UPDATE t3 SET c=REPEAT('j',3000) WHERE a=2; + +connection con2; +SET DEBUG_SYNC='now WAIT_FOR have_latch'; + +# Check that the above UPDATE is blocked +SELECT info FROM information_schema.processlist +WHERE state = 'debug sync point: after_row_upd_extern'; + +# Request a crash, and restart the server. +SET DEBUG='+d,crash_commit_before'; +--exec echo "restart" > $MYSQLTEST_VARDIR/tmp/mysqld.1.expect +--error 2013 +COMMIT; + +disconnect con2; +connection default; +# This connection should notice the crash as well. +--error 2013 +reap; + +# Write file to make mysql-test-run.pl restart the server +--enable_reconnect +--source include/wait_until_connected_again.inc +--disable_reconnect + +CHECK TABLE t1,t2,t3; +SELECT a, RIGHT(b,20), RIGHT(c,20) FROM t3; +SELECT a FROM t3; + +SELECT * FROM t2; + +DROP TABLE t1,t2,t3; -- cgit v1.2.1