diff options
author | Marko Mäkelä <marko.makela@mariadb.com> | 2023-03-16 17:19:58 +0200 |
---|---|---|
committer | Marko Mäkelä <marko.makela@mariadb.com> | 2023-03-16 17:19:58 +0200 |
commit | a55b951e6082a4ce9a1f2ed5ee176ea7dbbaf1f2 (patch) | |
tree | bbc01052f654499f11d4ee04bb17cf7480ae6e96 | |
parent | 9593cccf285ee348fc9a2743c1ed7d24c768439b (diff) | |
download | mariadb-git-a55b951e6082a4ce9a1f2ed5ee176ea7dbbaf1f2.tar.gz |
MDEV-26827 Make page flushing even faster
For more convenient monitoring of something that could greatly affect
the volume of page writes, we add the status variable
Innodb_buffer_pool_pages_split that was previously only available
via information_schema.innodb_metrics as "innodb_page_splits".
This was suggested by Axel Schwenke.
buf_flush_page_count: Replaced with buf_pool.stat.n_pages_written.
We protect buf_pool.stat (except n_page_gets) with buf_pool.mutex
and remove unnecessary export_vars indirection.
buf_pool.flush_list_bytes: Moved from buf_pool.stat.flush_list_bytes.
Protected by buf_pool.flush_list_mutex.
buf_pool_t::page_cleaner_status: Replaces buf_pool_t::n_flush_LRU_,
buf_pool_t::n_flush_list_, and buf_pool_t::page_cleaner_is_idle.
Protected by buf_pool.flush_list_mutex. We will exclusively broadcast
buf_pool.done_flush_list by the buf_flush_page_cleaner thread,
and only wait for it when communicating with buf_flush_page_cleaner.
There is no need to keep a count of pending writes by the
buf_pool.flush_list processing. A single flag suffices for that.
Waits for page write completion can be performed by
simply waiting on block->page.lock, or by invoking
buf_dblwr.wait_for_page_writes().
buf_LRU_block_free_non_file_page(): Broadcast buf_pool.done_free and
set buf_pool.try_LRU_scan when freeing a page. This would be
executed also as part of buf_page_write_complete().
buf_page_write_complete(): Do not broadcast buf_pool.done_flush_list,
and do not acquire buf_pool.mutex unless buf_pool.LRU eviction is needed.
Let buf_dblwr count all writes to persistent pages and broadcast a
condition variable when no outstanding writes remain.
buf_flush_page_cleaner(): Prioritize LRU flushing and eviction right after
"furious flushing" (lsn_limit). Simplify the conditions and reduce the
hold time of buf_pool.flush_list_mutex. Refuse to shut down
or sleep if buf_pool.ran_out(), that is, LRU eviction is needed.
buf_pool_t::page_cleaner_wakeup(): Add the optional parameter for_LRU.
buf_LRU_get_free_block(): Protect buf_lru_free_blocks_error_printed
with buf_pool.mutex. Invoke buf_pool.page_cleaner_wakeup(true) to
to ensure that buf_flush_page_cleaner() will process the LRU flush
request.
buf_do_LRU_batch(), buf_flush_list(), buf_flush_list_space():
Update buf_pool.stat.n_pages_written when submitting writes
(while holding buf_pool.mutex), not when completing them.
buf_page_t::flush(), buf_flush_discard_page(): Require that
the page U-latch be acquired upfront, and remove
buf_page_t::ready_for_flush().
buf_pool_t::delete_from_flush_list(): Remove the parameter "bool clear".
buf_flush_page(): Count pending page writes via buf_dblwr.
buf_flush_try_neighbors(): Take the block of page_id as a parameter.
If the tablespace is dropped before our page has been written out,
release the page U-latch.
buf_pool_invalidate(): Let the caller ensure that there are no
outstanding writes.
buf_flush_wait_batch_end(false),
buf_flush_wait_batch_end_acquiring_mutex(false):
Replaced with buf_dblwr.wait_for_page_writes().
buf_flush_wait_LRU_batch_end(): Replaces buf_flush_wait_batch_end(true).
buf_flush_list(): Remove some broadcast of buf_pool.done_flush_list.
buf_flush_buffer_pool(): Invoke also buf_dblwr.wait_for_page_writes().
buf_pool_t::io_pending(), buf_pool_t::n_flush_list(): Remove.
Outstanding writes are reflected by buf_dblwr.pending_writes().
buf_dblwr_t::init(): New function, to initialize the mutex and
the condition variables, but not the backing store.
buf_dblwr_t::is_created(): Replaces buf_dblwr_t::is_initialised().
buf_dblwr_t::pending_writes(), buf_dblwr_t::writes_pending:
Keeps track of writes of persistent data pages.
buf_flush_LRU(): Allow calls while LRU flushing may be in progress
in another thread.
Tested by Matthias Leich (correctness) and Axel Schwenke (performance)
21 files changed, 705 insertions, 669 deletions
diff --git a/mysql-test/suite/innodb/r/innodb_skip_innodb_is_tables.result b/mysql-test/suite/innodb/r/innodb_skip_innodb_is_tables.result index 19b426009f2..9bdb546482e 100644 --- a/mysql-test/suite/innodb/r/innodb_skip_innodb_is_tables.result +++ b/mysql-test/suite/innodb/r/innodb_skip_innodb_is_tables.result @@ -199,7 +199,7 @@ compress_pages_page_decompressed compression 0 NULL NULL NULL 0 NULL NULL NULL N compress_pages_page_compression_error compression 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Number of page compression errors compress_pages_encrypted compression 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Number of pages encrypted compress_pages_decrypted compression 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Number of pages decrypted -index_page_splits index 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Number of index page splits +index_page_splits index 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 status_counter Number of index page splits index_page_merge_attempts index 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Number of index page merge attempts index_page_merge_successful index 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Number of successful index page merges index_page_reorg_attempts index 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Number of index page reorganization attempts diff --git a/mysql-test/suite/innodb/r/innodb_status_variables.result b/mysql-test/suite/innodb/r/innodb_status_variables.result index a729dd0a8d4..5b8ca678795 100644 --- a/mysql-test/suite/innodb/r/innodb_status_variables.result +++ b/mysql-test/suite/innodb/r/innodb_status_variables.result @@ -23,6 +23,7 @@ INNODB_BUFFER_POOL_PAGES_OLD INNODB_BUFFER_POOL_PAGES_TOTAL INNODB_BUFFER_POOL_PAGES_LRU_FLUSHED INNODB_BUFFER_POOL_PAGES_LRU_FREED +INNODB_BUFFER_POOL_PAGES_SPLIT INNODB_BUFFER_POOL_READ_AHEAD_RND INNODB_BUFFER_POOL_READ_AHEAD INNODB_BUFFER_POOL_READ_AHEAD_EVICTED diff --git a/storage/innobase/btr/btr0btr.cc b/storage/innobase/btr/btr0btr.cc index 1b69f4c7170..e54c2a101b8 100644 --- a/storage/innobase/btr/btr0btr.cc +++ b/storage/innobase/btr/btr0btr.cc @@ -2975,6 +2975,8 @@ btr_page_split_and_insert( ut_ad(*err == DB_SUCCESS); ut_ad(dtuple_check_typed(tuple)); + buf_pool.pages_split++; + if (cursor->index()->is_spatial()) { /* Split rtree page and update parent */ return rtr_page_split_and_insert(flags, cursor, offsets, heap, @@ -3371,8 +3373,6 @@ func_exit: left_block, right_block, mtr); } - MONITOR_INC(MONITOR_INDEX_SPLIT); - ut_ad(page_validate(buf_block_get_frame(left_block), page_cursor->index)); ut_ad(page_validate(buf_block_get_frame(right_block), diff --git a/storage/innobase/buf/buf0buf.cc b/storage/innobase/buf/buf0buf.cc index 510872c142e..106569f74b2 100644 --- a/storage/innobase/buf/buf0buf.cc +++ b/storage/innobase/buf/buf0buf.cc @@ -1401,8 +1401,10 @@ inline bool buf_pool_t::withdraw_blocks() true); mysql_mutex_unlock(&buf_pool.mutex); buf_dblwr.flush_buffered_writes(); + mysql_mutex_lock(&buf_pool.flush_list_mutex); + buf_flush_wait_LRU_batch_end(); + mysql_mutex_unlock(&buf_pool.flush_list_mutex); mysql_mutex_lock(&buf_pool.mutex); - buf_flush_wait_batch_end(true); } /* relocate blocks/buddies in withdrawn area */ @@ -2265,13 +2267,15 @@ lookup: return bpage; must_read_page: - if (dberr_t err= buf_read_page(page_id, zip_size)) - { + switch (dberr_t err= buf_read_page(page_id, zip_size)) { + case DB_SUCCESS: + case DB_SUCCESS_LOCKED_REC: + goto lookup; + default: ib::error() << "Reading compressed page " << page_id << " failed with error: " << err; return nullptr; } - goto lookup; } /********************************************************************//** @@ -2511,20 +2515,23 @@ loop: corrupted, or if an encrypted page with a valid checksum cannot be decypted. */ - if (dberr_t local_err = buf_read_page(page_id, zip_size)) { - if (local_err != DB_CORRUPTION - && mode != BUF_GET_POSSIBLY_FREED + switch (dberr_t local_err = buf_read_page(page_id, zip_size)) { + case DB_SUCCESS: + case DB_SUCCESS_LOCKED_REC: + buf_read_ahead_random(page_id, zip_size, ibuf_inside(mtr)); + break; + default: + if (mode != BUF_GET_POSSIBLY_FREED && retries++ < BUF_PAGE_READ_MAX_RETRIES) { DBUG_EXECUTE_IF("intermittent_read_failure", retries = BUF_PAGE_READ_MAX_RETRIES;); - } else { - if (err) { - *err = local_err; - } - return nullptr; } - } else { - buf_read_ahead_random(page_id, zip_size, ibuf_inside(mtr)); + /* fall through */ + case DB_PAGE_CORRUPTED: + if (err) { + *err = local_err; + } + return nullptr; } ut_d(if (!(++buf_dbg_counter % 5771)) buf_pool.validate()); @@ -3279,12 +3286,12 @@ retry: buf_unzip_LRU_add_block(reinterpret_cast<buf_block_t*>(bpage), FALSE); } + buf_pool.stat.n_pages_created++; mysql_mutex_unlock(&buf_pool.mutex); mtr->memo_push(reinterpret_cast<buf_block_t*>(bpage), MTR_MEMO_PAGE_X_FIX); bpage->set_accessed(); - buf_pool.stat.n_pages_created++; /* Delete possible entries for the page from the insert buffer: such can exist if the page belonged to an index which was dropped */ @@ -3534,7 +3541,6 @@ dberr_t buf_page_t::read_complete(const fil_node_t &node) ut_d(auto n=) buf_pool.n_pend_reads--; ut_ad(n > 0); - buf_pool.stat.n_pages_read++; const byte *read_frame= zip.data ? zip.data : frame; ut_ad(read_frame); @@ -3686,9 +3692,6 @@ void buf_pool_invalidate() { mysql_mutex_lock(&buf_pool.mutex); - buf_flush_wait_batch_end(true); - buf_flush_wait_batch_end(false); - /* It is possible that a write batch that has been posted earlier is still not complete. For buffer pool invalidation to proceed we must ensure there is NO write activity happening. */ @@ -3839,8 +3842,8 @@ void buf_pool_t::print() << UT_LIST_GET_LEN(flush_list) << ", n pending decompressions=" << n_pend_unzip << ", n pending reads=" << n_pend_reads - << ", n pending flush LRU=" << n_flush_LRU_ - << " list=" << n_flush_list_ + << ", n pending flush LRU=" << n_flush() + << " list=" << buf_dblwr.pending_writes() << ", pages made young=" << stat.n_pages_made_young << ", not young=" << stat.n_pages_not_made_young << ", pages read=" << stat.n_pages_read @@ -3952,13 +3955,13 @@ void buf_stats_get_pool_info(buf_pool_info_t *pool_info) pool_info->flush_list_len = UT_LIST_GET_LEN(buf_pool.flush_list); pool_info->n_pend_unzip = UT_LIST_GET_LEN(buf_pool.unzip_LRU); - mysql_mutex_unlock(&buf_pool.flush_list_mutex); pool_info->n_pend_reads = buf_pool.n_pend_reads; - pool_info->n_pending_flush_lru = buf_pool.n_flush_LRU_; + pool_info->n_pending_flush_lru = buf_pool.n_flush(); - pool_info->n_pending_flush_list = buf_pool.n_flush_list_; + pool_info->n_pending_flush_list = buf_dblwr.pending_writes(); + mysql_mutex_unlock(&buf_pool.flush_list_mutex); current_time = time(NULL); time_elapsed = 0.001 + difftime(current_time, diff --git a/storage/innobase/buf/buf0dblwr.cc b/storage/innobase/buf/buf0dblwr.cc index c71fd8df068..72b1ba5ca2b 100644 --- a/storage/innobase/buf/buf0dblwr.cc +++ b/storage/innobase/buf/buf0dblwr.cc @@ -46,7 +46,17 @@ inline buf_block_t *buf_dblwr_trx_sys_get(mtr_t *mtr) 0, RW_X_LATCH, mtr); } -/** Initialize the doublewrite buffer data structure. +void buf_dblwr_t::init() +{ + if (!active_slot) + { + active_slot= &slots[0]; + mysql_mutex_init(buf_dblwr_mutex_key, &mutex, nullptr); + pthread_cond_init(&cond, nullptr); + } +} + +/** Initialise the persistent storage of the doublewrite buffer. @param header doublewrite page header in the TRX_SYS page */ inline void buf_dblwr_t::init(const byte *header) { @@ -54,8 +64,6 @@ inline void buf_dblwr_t::init(const byte *header) ut_ad(!active_slot->reserved); ut_ad(!batch_running); - mysql_mutex_init(buf_dblwr_mutex_key, &mutex, nullptr); - pthread_cond_init(&cond, nullptr); block1= page_id_t(0, mach_read_from_4(header + TRX_SYS_DOUBLEWRITE_BLOCK1)); block2= page_id_t(0, mach_read_from_4(header + TRX_SYS_DOUBLEWRITE_BLOCK2)); @@ -74,7 +82,7 @@ inline void buf_dblwr_t::init(const byte *header) @return whether the operation succeeded */ bool buf_dblwr_t::create() { - if (is_initialised()) + if (is_created()) return true; mtr_t mtr; @@ -343,7 +351,7 @@ func_exit: void buf_dblwr_t::recover() { ut_ad(recv_sys.parse_start_lsn); - if (!is_initialised()) + if (!is_created()) return; uint32_t page_no_dblwr= 0; @@ -452,10 +460,9 @@ next_page: /** Free the doublewrite buffer. */ void buf_dblwr_t::close() { - if (!is_initialised()) + if (!active_slot) return; - /* Free the double write data structures. */ ut_ad(!active_slot->reserved); ut_ad(!active_slot->first_free); ut_ad(!batch_running); @@ -469,35 +476,41 @@ void buf_dblwr_t::close() mysql_mutex_destroy(&mutex); memset((void*) this, 0, sizeof *this); - active_slot= &slots[0]; } /** Update the doublewrite buffer on write completion. */ -void buf_dblwr_t::write_completed() +void buf_dblwr_t::write_completed(bool with_doublewrite) { ut_ad(this == &buf_dblwr); - ut_ad(srv_use_doublewrite_buf); - ut_ad(is_initialised()); ut_ad(!srv_read_only_mode); mysql_mutex_lock(&mutex); - ut_ad(batch_running); - slot *flush_slot= active_slot == &slots[0] ? &slots[1] : &slots[0]; - ut_ad(flush_slot->reserved); - ut_ad(flush_slot->reserved <= flush_slot->first_free); + ut_ad(writes_pending); + if (!--writes_pending) + pthread_cond_broadcast(&write_cond); - if (!--flush_slot->reserved) + if (with_doublewrite) { - mysql_mutex_unlock(&mutex); - /* This will finish the batch. Sync data files to the disk. */ - fil_flush_file_spaces(); - mysql_mutex_lock(&mutex); + ut_ad(is_created()); + ut_ad(srv_use_doublewrite_buf); + ut_ad(batch_running); + slot *flush_slot= active_slot == &slots[0] ? &slots[1] : &slots[0]; + ut_ad(flush_slot->reserved); + ut_ad(flush_slot->reserved <= flush_slot->first_free); + + if (!--flush_slot->reserved) + { + mysql_mutex_unlock(&mutex); + /* This will finish the batch. Sync data files to the disk. */ + fil_flush_file_spaces(); + mysql_mutex_lock(&mutex); - /* We can now reuse the doublewrite memory buffer: */ - flush_slot->first_free= 0; - batch_running= false; - pthread_cond_broadcast(&cond); + /* We can now reuse the doublewrite memory buffer: */ + flush_slot->first_free= 0; + batch_running= false; + pthread_cond_broadcast(&cond); + } } mysql_mutex_unlock(&mutex); @@ -642,7 +655,7 @@ void buf_dblwr_t::flush_buffered_writes_completed(const IORequest &request) { ut_ad(this == &buf_dblwr); ut_ad(srv_use_doublewrite_buf); - ut_ad(is_initialised()); + ut_ad(is_created()); ut_ad(!srv_read_only_mode); ut_ad(!request.bpage); ut_ad(request.node == fil_system.sys_space->chain.start); @@ -708,7 +721,7 @@ posted, and also when we may have to wait for a page latch! Otherwise a deadlock of threads can occur. */ void buf_dblwr_t::flush_buffered_writes() { - if (!is_initialised() || !srv_use_doublewrite_buf) + if (!is_created() || !srv_use_doublewrite_buf) { fil_flush_file_spaces(); return; @@ -741,6 +754,7 @@ void buf_dblwr_t::add_to_batch(const IORequest &request, size_t size) const ulint buf_size= 2 * block_size(); mysql_mutex_lock(&mutex); + writes_pending++; for (;;) { diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc index 70e1595e00e..326636e0c4d 100644 --- a/storage/innobase/buf/buf0flu.cc +++ b/storage/innobase/buf/buf0flu.cc @@ -47,15 +47,12 @@ Created 11/11/1995 Heikki Tuuri #endif /** Number of pages flushed via LRU. Protected by buf_pool.mutex. -Also included in buf_flush_page_count. */ +Also included in buf_pool.stat.n_pages_written. */ ulint buf_lru_flush_page_count; /** Number of pages freed without flushing. Protected by buf_pool.mutex. */ ulint buf_lru_freed_page_count; -/** Number of pages flushed. Protected by buf_pool.mutex. */ -ulint buf_flush_page_count; - /** Flag indicating if the page_cleaner is in active state. */ Atomic_relaxed<bool> buf_page_cleaner_is_active; @@ -115,8 +112,7 @@ static void buf_flush_validate_skip() } #endif /* UNIV_DEBUG */ -/** Wake up the page cleaner if needed */ -void buf_pool_t::page_cleaner_wakeup() +void buf_pool_t::page_cleaner_wakeup(bool for_LRU) { if (!page_cleaner_idle()) return; @@ -149,11 +145,12 @@ void buf_pool_t::page_cleaner_wakeup() - by allowing last_activity_count to updated when page-cleaner is made active and has work to do. This ensures that the last_activity signal is consumed by the page-cleaner before the next one is generated. */ - if ((pct_lwm != 0.0 && pct_lwm <= dirty_pct) || - (pct_lwm != 0.0 && last_activity_count == srv_get_activity_count()) || + if (for_LRU || + (pct_lwm != 0.0 && (pct_lwm <= dirty_pct || + last_activity_count == srv_get_activity_count())) || srv_max_buf_pool_modified_pct <= dirty_pct) { - page_cleaner_is_idle= false; + page_cleaner_status-= PAGE_CLEANER_IDLE; pthread_cond_signal(&do_flush_list); } } @@ -183,8 +180,8 @@ void buf_pool_t::insert_into_flush_list(buf_block_t *block, lsn_t lsn) delete_from_flush_list_low(&block->page); } else - stat.flush_list_bytes+= block->physical_size(); - ut_ad(stat.flush_list_bytes <= curr_pool_size); + flush_list_bytes+= block->physical_size(); + ut_ad(flush_list_bytes <= curr_pool_size); block->page.set_oldest_modification(lsn); MEM_CHECK_DEFINED(block->page.zip.data @@ -197,14 +194,12 @@ void buf_pool_t::insert_into_flush_list(buf_block_t *block, lsn_t lsn) } /** Remove a block from flush_list. -@param bpage buffer pool page -@param clear whether to invoke buf_page_t::clear_oldest_modification() */ -void buf_pool_t::delete_from_flush_list(buf_page_t *bpage, bool clear) +@param bpage buffer pool page */ +void buf_pool_t::delete_from_flush_list(buf_page_t *bpage) { delete_from_flush_list_low(bpage); - stat.flush_list_bytes-= bpage->physical_size(); - if (clear) - bpage->clear_oldest_modification(); + flush_list_bytes-= bpage->physical_size(); + bpage->clear_oldest_modification(); #ifdef UNIV_DEBUG buf_flush_validate_skip(); #endif /* UNIV_DEBUG */ @@ -219,10 +214,10 @@ void buf_flush_remove_pages(ulint id) { const page_id_t first(id, 0), end(id + 1, 0); ut_ad(id); - mysql_mutex_lock(&buf_pool.mutex); for (;;) { + mysql_mutex_lock(&buf_pool.mutex); bool deferred= false; mysql_mutex_lock(&buf_pool.flush_list_mutex); @@ -245,18 +240,14 @@ void buf_flush_remove_pages(ulint id) bpage= prev; } + mysql_mutex_unlock(&buf_pool.mutex); mysql_mutex_unlock(&buf_pool.flush_list_mutex); if (!deferred) break; - mysql_mutex_unlock(&buf_pool.mutex); - std::this_thread::yield(); - mysql_mutex_lock(&buf_pool.mutex); - buf_flush_wait_batch_end(false); + buf_dblwr.wait_for_page_writes(); } - - mysql_mutex_unlock(&buf_pool.mutex); } /*******************************************************************//** @@ -301,7 +292,7 @@ buf_flush_relocate_on_flush_list( bpage->clear_oldest_modification(); if (lsn == 1) { - buf_pool.stat.flush_list_bytes -= dpage->physical_size(); + buf_pool.flush_list_bytes -= dpage->physical_size(); dpage->list.prev = nullptr; dpage->list.next = nullptr; dpage->clear_oldest_modification(); @@ -341,6 +332,21 @@ inline void buf_page_t::write_complete(bool temporary) lock.u_unlock(true); } +inline void buf_pool_t::n_flush_inc() +{ + mysql_mutex_assert_owner(&flush_list_mutex); + page_cleaner_status+= LRU_FLUSH; +} + +inline void buf_pool_t::n_flush_dec() +{ + mysql_mutex_lock(&flush_list_mutex); + ut_ad(page_cleaner_status >= LRU_FLUSH); + if ((page_cleaner_status-= LRU_FLUSH) < LRU_FLUSH) + pthread_cond_broadcast(&done_flush_LRU); + mysql_mutex_unlock(&flush_list_mutex); +} + /** Complete write of a file page from buf_pool. @param request write request */ void buf_page_write_complete(const IORequest &request) @@ -356,13 +362,6 @@ void buf_page_write_complete(const IORequest &request) ut_ad(!buf_dblwr.is_inside(bpage->id())); ut_ad(request.node->space->id == bpage->id().space()); - if (state < buf_page_t::WRITE_FIX_REINIT && - request.node->space->use_doublewrite()) - { - ut_ad(request.node->space != fil_system.temp_space); - buf_dblwr.write_completed(); - } - if (request.slot) request.slot->release(); @@ -370,32 +369,31 @@ void buf_page_write_complete(const IORequest &request) buf_page_monitor(*bpage, false); DBUG_PRINT("ib_buf", ("write page %u:%u", bpage->id().space(), bpage->id().page_no())); - const bool temp= fsp_is_system_temporary(bpage->id().space()); - mysql_mutex_lock(&buf_pool.mutex); + mysql_mutex_assert_not_owner(&buf_pool.mutex); mysql_mutex_assert_not_owner(&buf_pool.flush_list_mutex); - buf_pool.stat.n_pages_written++; - bpage->write_complete(temp); if (request.is_LRU()) { + const bool temp= bpage->oldest_modification() == 2; + if (!temp) + buf_dblwr.write_completed(state < buf_page_t::WRITE_FIX_REINIT && + request.node->space->use_doublewrite()); + /* We must hold buf_pool.mutex while releasing the block, so that + no other thread can access it before we have freed it. */ + mysql_mutex_lock(&buf_pool.mutex); + bpage->write_complete(temp); buf_LRU_free_page(bpage, true); - buf_pool.try_LRU_scan= true; - pthread_cond_signal(&buf_pool.done_free); + mysql_mutex_unlock(&buf_pool.mutex); - ut_ad(buf_pool.n_flush_LRU_); - if (!--buf_pool.n_flush_LRU_) - pthread_cond_broadcast(&buf_pool.done_flush_LRU); + buf_pool.n_flush_dec(); } else { - ut_ad(!temp); - ut_ad(buf_pool.n_flush_list_); - if (!--buf_pool.n_flush_list_) - pthread_cond_broadcast(&buf_pool.done_flush_list); + buf_dblwr.write_completed(state < buf_page_t::WRITE_FIX_REINIT && + request.node->space->use_doublewrite()); + bpage->write_complete(false); } - - mysql_mutex_unlock(&buf_pool.mutex); } /** Calculate a ROW_FORMAT=COMPRESSED page checksum and update the page. @@ -739,43 +737,41 @@ not_compressed: } /** Free a page whose underlying file page has been freed. */ -inline void buf_pool_t::release_freed_page(buf_page_t *bpage) +ATTRIBUTE_COLD void buf_pool_t::release_freed_page(buf_page_t *bpage) { mysql_mutex_assert_owner(&mutex); - mysql_mutex_lock(&flush_list_mutex); ut_d(const lsn_t oldest_modification= bpage->oldest_modification();) if (fsp_is_system_temporary(bpage->id().space())) { ut_ad(bpage->frame); ut_ad(oldest_modification == 2); + bpage->clear_oldest_modification(); } else { + mysql_mutex_lock(&flush_list_mutex); ut_ad(oldest_modification > 2); - delete_from_flush_list(bpage, false); + delete_from_flush_list(bpage); + mysql_mutex_unlock(&flush_list_mutex); } - bpage->clear_oldest_modification(); - mysql_mutex_unlock(&flush_list_mutex); - bpage->lock.u_unlock(true); + bpage->lock.u_unlock(true); buf_LRU_free_page(bpage, true); } -/** Write a flushable page to a file. buf_pool.mutex must be held. +/** Write a flushable page to a file or free a freeable block. @param evict whether to evict the page on write completion @param space tablespace -@return whether the page was flushed and buf_pool.mutex was released */ -inline bool buf_page_t::flush(bool evict, fil_space_t *space) +@return whether a page write was initiated and buf_pool.mutex released */ +bool buf_page_t::flush(bool evict, fil_space_t *space) { + mysql_mutex_assert_not_owner(&buf_pool.flush_list_mutex); ut_ad(in_file()); ut_ad(in_LRU_list); ut_ad((space->purpose == FIL_TYPE_TEMPORARY) == (space == fil_system.temp_space)); - ut_ad(space->referenced()); ut_ad(evict || space != fil_system.temp_space); - - if (!lock.u_lock_try(true)) - return false; + ut_ad(space->referenced()); const auto s= state(); ut_a(s >= FREED); @@ -783,18 +779,29 @@ inline bool buf_page_t::flush(bool evict, fil_space_t *space) if (s < UNFIXED) { buf_pool.release_freed_page(this); - mysql_mutex_unlock(&buf_pool.mutex); - return true; + return false; } - if (s >= READ_FIX || oldest_modification() < 2) + ut_d(const auto f=) zip.fix.fetch_add(WRITE_FIX - UNFIXED); + ut_ad(f >= UNFIXED); + ut_ad(f < READ_FIX); + ut_ad((space == fil_system.temp_space) + ? oldest_modification() == 2 + : oldest_modification() > 2); + + /* Increment the I/O operation count used for selecting LRU policy. */ + buf_LRU_stat_inc_io(); + mysql_mutex_unlock(&buf_pool.mutex); + + IORequest::Type type= IORequest::WRITE_ASYNC; + if (UNIV_UNLIKELY(evict)) { - lock.u_unlock(true); - return false; + type= IORequest::WRITE_LRU; + mysql_mutex_lock(&buf_pool.flush_list_mutex); + buf_pool.n_flush_inc(); + mysql_mutex_unlock(&buf_pool.flush_list_mutex); } - mysql_mutex_assert_not_owner(&buf_pool.flush_list_mutex); - /* Apart from the U-lock, this block will also be protected by is_write_fixed() and oldest_modification()>1. Thus, it cannot be relocated or removed. */ @@ -802,25 +809,6 @@ inline bool buf_page_t::flush(bool evict, fil_space_t *space) DBUG_PRINT("ib_buf", ("%s %u page %u:%u", evict ? "LRU" : "flush_list", id().space(), id().page_no())); - ut_d(const auto f=) zip.fix.fetch_add(WRITE_FIX - UNFIXED); - ut_ad(f >= UNFIXED); - ut_ad(f < READ_FIX); - ut_ad(space == fil_system.temp_space - ? oldest_modification() == 2 - : oldest_modification() > 2); - if (evict) - { - ut_ad(buf_pool.n_flush_LRU_ < ULINT_UNDEFINED); - buf_pool.n_flush_LRU_++; - } - else - { - ut_ad(buf_pool.n_flush_list_ < ULINT_UNDEFINED); - buf_pool.n_flush_list_++; - } - buf_flush_page_count++; - - mysql_mutex_unlock(&buf_pool.mutex); buf_block_t *block= reinterpret_cast<buf_block_t*>(this); page_t *write_frame= zip.data; @@ -830,7 +818,6 @@ inline bool buf_page_t::flush(bool evict, fil_space_t *space) #if defined HAVE_FALLOC_PUNCH_HOLE_AND_KEEP_SIZE || defined _WIN32 size_t orig_size; #endif - IORequest::Type type= evict ? IORequest::WRITE_LRU : IORequest::WRITE_ASYNC; buf_tmp_buffer_t *slot= nullptr; if (UNIV_UNLIKELY(!frame)) /* ROW_FORMAT=COMPRESSED */ @@ -874,7 +861,10 @@ inline bool buf_page_t::flush(bool evict, fil_space_t *space) { switch (space->chain.start->punch_hole) { case 1: - type= evict ? IORequest::PUNCH_LRU : IORequest::PUNCH; + static_assert(IORequest::PUNCH_LRU - IORequest::PUNCH == + IORequest::WRITE_LRU - IORequest::WRITE_ASYNC, ""); + type= + IORequest::Type(type + (IORequest::PUNCH - IORequest::WRITE_ASYNC)); break; case 2: size= orig_size; @@ -896,15 +886,14 @@ inline bool buf_page_t::flush(bool evict, fil_space_t *space) if (lsn > log_sys.get_flushed_lsn()) log_write_up_to(lsn, true); } + if (UNIV_LIKELY(space->purpose != FIL_TYPE_TEMPORARY)) + buf_dblwr.add_unbuffered(); space->io(IORequest{type, this, slot}, physical_offset(), size, write_frame, this); } else buf_dblwr.add_to_batch(IORequest{this, slot, space->chain.start, type}, size); - - /* Increment the I/O operation count used for selecting LRU policy. */ - buf_LRU_stat_inc_io(); return true; } @@ -931,7 +920,7 @@ static bool buf_flush_check_neighbor(const page_id_t id, ulint fold, if (evict && !bpage->is_old()) return false; - return bpage->oldest_modification() > 1 && bpage->ready_for_flush(); + return bpage->oldest_modification() > 1 && !bpage->is_io_fixed(); } /** Check which neighbors of a page can be flushed from the buf_pool. @@ -1058,6 +1047,7 @@ uint32_t fil_space_t::flush_freed(bool writable) and also write zeroes or punch the hole for the freed ranges of pages. @param space tablespace @param page_id page identifier +@param bpage buffer page @param contiguous whether to consider contiguous areas of pages @param evict true=buf_pool.LRU; false=buf_pool.flush_list @param n_flushed number of pages flushed so far in this batch @@ -1065,10 +1055,12 @@ and also write zeroes or punch the hole for the freed ranges of pages. @return number of pages flushed */ static ulint buf_flush_try_neighbors(fil_space_t *space, const page_id_t page_id, + buf_page_t *bpage, bool contiguous, bool evict, ulint n_flushed, ulint n_to_flush) { ut_ad(space->id == page_id.space()); + ut_ad(bpage->id() == page_id); ulint count= 0; page_id_t id= page_id; @@ -1077,9 +1069,15 @@ static ulint buf_flush_try_neighbors(fil_space_t *space, ut_ad(page_id >= id); ut_ad(page_id < high); - for (ulint id_fold= id.fold(); id < high && !space->is_stopping(); - ++id, ++id_fold) + for (ulint id_fold= id.fold(); id < high; ++id, ++id_fold) { + if (UNIV_UNLIKELY(space->is_stopping())) + { + if (bpage) + bpage->lock.u_unlock(true); + break; + } + if (count + n_flushed >= n_to_flush) { if (id > page_id) @@ -1093,26 +1091,39 @@ static ulint buf_flush_try_neighbors(fil_space_t *space, const buf_pool_t::hash_chain &chain= buf_pool.page_hash.cell_get(id_fold); mysql_mutex_lock(&buf_pool.mutex); - if (buf_page_t *bpage= buf_pool.page_hash.get(id, chain)) + if (buf_page_t *b= buf_pool.page_hash.get(id, chain)) { - ut_ad(bpage->in_file()); - /* We avoid flushing 'non-old' blocks in an eviction flush, - because the flushed blocks are soon freed */ - if (!evict || id == page_id || bpage->is_old()) + ut_ad(b->in_file()); + if (id == page_id) { - if (!buf_pool.watch_is_sentinel(*bpage) && - bpage->oldest_modification() > 1 && bpage->ready_for_flush() && - bpage->flush(evict, space)) + ut_ad(bpage == b); + bpage= nullptr; + ut_ad(!buf_pool.watch_is_sentinel(*b)); + ut_ad(b->oldest_modification() > 1); + flush: + if (b->flush(evict, space)) { ++count; continue; } } + /* We avoid flushing 'non-old' blocks in an eviction flush, + because the flushed blocks are soon freed */ + else if ((!evict || b->is_old()) && !buf_pool.watch_is_sentinel(*b) && + b->oldest_modification() > 1 && b->lock.u_lock_try(true)) + { + if (b->oldest_modification() < 2) + b->lock.u_unlock(true); + else + goto flush; + } } mysql_mutex_unlock(&buf_pool.mutex); } + ut_ad(!bpage); + if (auto n= count - 1) { MONITOR_INC_VALUE_CUMULATIVE(MONITOR_FLUSH_NEIGHBOR_TOTAL_PAGE, @@ -1185,27 +1196,20 @@ struct flush_counters_t ulint evicted; }; -/** Try to discard a dirty page. +/** Discard a dirty page, and release buf_pool.flush_list_mutex. @param bpage dirty page whose tablespace is not accessible */ static void buf_flush_discard_page(buf_page_t *bpage) { - mysql_mutex_assert_owner(&buf_pool.mutex); - mysql_mutex_assert_not_owner(&buf_pool.flush_list_mutex); ut_ad(bpage->in_file()); ut_ad(bpage->oldest_modification()); - if (!bpage->lock.u_lock_try(false)) - return; - - mysql_mutex_lock(&buf_pool.flush_list_mutex); buf_pool.delete_from_flush_list(bpage); mysql_mutex_unlock(&buf_pool.flush_list_mutex); ut_d(const auto state= bpage->state()); ut_ad(state == buf_page_t::FREED || state == buf_page_t::UNFIXED || state == buf_page_t::IBUF_EXIST || state == buf_page_t::REINIT); - bpage->lock.u_unlock(); - + bpage->lock.u_unlock(true); buf_LRU_free_page(bpage, true); } @@ -1227,7 +1231,6 @@ static void buf_flush_LRU_list_batch(ulint max, bool evict, const auto neighbors= UT_LIST_GET_LEN(buf_pool.LRU) < BUF_LRU_OLD_MIN_LEN ? 0 : srv_flush_neighbors; fil_space_t *space= nullptr; - bool do_evict= evict; uint32_t last_space_id= FIL_NULL; static_assert(FIL_NULL > SRV_TMP_SPACE_ID, "consistency"); static_assert(FIL_NULL > SRV_SPACE_ID_UPPER_BOUND, "consistency"); @@ -1236,27 +1239,47 @@ static void buf_flush_LRU_list_batch(ulint max, bool evict, bpage && ((UT_LIST_GET_LEN(buf_pool.LRU) > BUF_LRU_MIN_LEN && UT_LIST_GET_LEN(buf_pool.free) < free_limit) || - recv_recovery_is_on()); ++scanned) + recv_recovery_is_on()); + ++scanned, bpage= buf_pool.lru_hp.get()) { buf_page_t *prev= UT_LIST_GET_PREV(LRU, bpage); - const lsn_t oldest_modification= bpage->oldest_modification(); buf_pool.lru_hp.set(prev); - const auto state= bpage->state(); + auto state= bpage->state(); ut_ad(state >= buf_page_t::FREED); ut_ad(bpage->in_LRU_list); - if (oldest_modification <= 1) - { + switch (bpage->oldest_modification()) { + case 0: + evict: if (state != buf_page_t::FREED && (state >= buf_page_t::READ_FIX || (~buf_page_t::LRU_MASK & state))) - goto must_skip; - if (buf_LRU_free_page(bpage, true)) - ++n->evicted; + continue; + buf_LRU_free_page(bpage, true); + ++n->evicted; + /* fall through */ + case 1: + continue; } - else if (state < buf_page_t::READ_FIX) + + if (state < buf_page_t::READ_FIX && bpage->lock.u_lock_try(true)) { + ut_ad(!bpage->is_io_fixed()); + bool do_evict= evict; + switch (bpage->oldest_modification()) { + case 1: + mysql_mutex_lock(&buf_pool.flush_list_mutex); + buf_pool.delete_from_flush_list(bpage); + mysql_mutex_unlock(&buf_pool.flush_list_mutex); + /* fall through */ + case 0: + bpage->lock.u_unlock(true); + goto evict; + case 2: + /* LRU flushing will always evict pages of the temporary tablespace. */ + do_evict= true; + } /* Block is ready for flush. Dispatch an IO request. - If evict=true, the page will be evicted by buf_page_write_complete(). */ + If do_evict, the page may be evicted by buf_page_write_complete(). */ const page_id_t page_id(bpage->id()); const uint32_t space_id= page_id.space(); if (!space || space->id != space_id) @@ -1269,14 +1292,10 @@ static void buf_flush_LRU_list_batch(ulint max, bool evict, space->release(); auto p= buf_flush_space(space_id); space= p.first; - /* For the temporary tablespace, LRU flushing will always - evict pages upon completing the write. */ - do_evict= evict || space == fil_system.temp_space; last_space_id= space_id; mysql_mutex_lock(&buf_pool.mutex); if (p.second) buf_pool.stat.n_pages_written+= p.second; - goto retry; } else ut_ad(!space); @@ -1288,17 +1307,24 @@ static void buf_flush_LRU_list_batch(ulint max, bool evict, } if (!space) + { + mysql_mutex_lock(&buf_pool.flush_list_mutex); buf_flush_discard_page(bpage); + } else if (neighbors && space->is_rotational()) { mysql_mutex_unlock(&buf_pool.mutex); - n->flushed+= buf_flush_try_neighbors(space, page_id, neighbors == 1, + n->flushed+= buf_flush_try_neighbors(space, page_id, bpage, + neighbors == 1, do_evict, n->flushed, max); reacquire_mutex: mysql_mutex_lock(&buf_pool.mutex); } else if (n->flushed >= max && !recv_recovery_is_on()) + { + bpage->lock.u_unlock(true); break; + } else if (bpage->flush(do_evict, space)) { ++n->flushed; @@ -1306,11 +1332,8 @@ reacquire_mutex: } } else - must_skip: /* Can't evict or dispatch this block. Go to previous. */ ut_ad(buf_pool.lru_hp.is_hp(prev)); - retry: - bpage= buf_pool.lru_hp.get(); } buf_pool.lru_hp.set(nullptr); @@ -1341,6 +1364,7 @@ static void buf_do_LRU_batch(ulint max, bool evict, flush_counters_t *n) mysql_mutex_assert_owner(&buf_pool.mutex); buf_lru_freed_page_count+= n->evicted; buf_lru_flush_page_count+= n->flushed; + buf_pool.stat.n_pages_written+= n->flushed; } /** This utility flushes dirty blocks from the end of the flush_list. @@ -1354,6 +1378,7 @@ static ulint buf_do_flush_list_batch(ulint max_n, lsn_t lsn) ulint scanned= 0; mysql_mutex_assert_owner(&buf_pool.mutex); + mysql_mutex_assert_owner(&buf_pool.flush_list_mutex); const auto neighbors= UT_LIST_GET_LEN(buf_pool.LRU) < BUF_LRU_OLD_MIN_LEN ? 0 : srv_flush_neighbors; @@ -1364,7 +1389,6 @@ static ulint buf_do_flush_list_batch(ulint max_n, lsn_t lsn) /* Start from the end of the list looking for a suitable block to be flushed. */ - mysql_mutex_lock(&buf_pool.flush_list_mutex); ulint len= UT_LIST_GET_LEN(buf_pool.flush_list); for (buf_page_t *bpage= UT_LIST_GET_LAST(buf_pool.flush_list); @@ -1375,32 +1399,42 @@ static ulint buf_do_flush_list_batch(ulint max_n, lsn_t lsn) break; ut_ad(bpage->in_file()); - buf_page_t *prev= UT_LIST_GET_PREV(list, bpage); - - if (oldest_modification == 1) { - buf_pool.delete_from_flush_list(bpage); - skip: - bpage= prev; - continue; - } + buf_page_t *prev= UT_LIST_GET_PREV(list, bpage); - ut_ad(oldest_modification > 2); + if (oldest_modification == 1) + { + clear: + buf_pool.delete_from_flush_list(bpage); + skip: + bpage= prev; + continue; + } - if (!bpage->ready_for_flush()) - goto skip; + ut_ad(oldest_modification > 2); - /* In order not to degenerate this scan to O(n*n) we attempt to - preserve the pointer position. Any thread that would remove 'prev' - from buf_pool.flush_list must adjust the hazard pointer. + if (!bpage->lock.u_lock_try(true)) + goto skip; - Note: A concurrent execution of buf_flush_list_space() may - terminate this scan prematurely. The buf_pool.n_flush_list_ - should prevent multiple threads from executing - buf_do_flush_list_batch() concurrently, - but buf_flush_list_space() is ignoring that. */ - buf_pool.flush_hp.set(prev); - mysql_mutex_unlock(&buf_pool.flush_list_mutex); + ut_ad(!bpage->is_io_fixed()); + + if (bpage->oldest_modification() == 1) + { + bpage->lock.u_unlock(true); + goto clear; + } + + /* In order not to degenerate this scan to O(n*n) we attempt to + preserve the pointer position. Any thread that would remove 'prev' + from buf_pool.flush_list must adjust the hazard pointer. + + Note: A concurrent execution of buf_flush_list_space() may + terminate this scan prematurely. The buf_pool.flush_list_active + should prevent multiple threads from executing + buf_do_flush_list_batch() concurrently, + but buf_flush_list_space() is ignoring that. */ + buf_pool.flush_hp.set(prev); + } const page_id_t page_id(bpage->id()); const uint32_t space_id= page_id.space(); @@ -1408,8 +1442,6 @@ static ulint buf_do_flush_list_batch(ulint max_n, lsn_t lsn) { if (last_space_id != space_id) { - mysql_mutex_lock(&buf_pool.flush_list_mutex); - buf_pool.flush_hp.set(bpage); mysql_mutex_unlock(&buf_pool.flush_list_mutex); mysql_mutex_unlock(&buf_pool.mutex); if (space) @@ -1418,18 +1450,8 @@ static ulint buf_do_flush_list_batch(ulint max_n, lsn_t lsn) space= p.first; last_space_id= space_id; mysql_mutex_lock(&buf_pool.mutex); - if (p.second) - buf_pool.stat.n_pages_written+= p.second; + buf_pool.stat.n_pages_written+= p.second; mysql_mutex_lock(&buf_pool.flush_list_mutex); - bpage= buf_pool.flush_hp.get(); - if (!bpage) - break; - if (bpage->id() != page_id) - continue; - buf_pool.flush_hp.set(UT_LIST_GET_PREV(list, bpage)); - if (bpage->oldest_modification() <= 1 || !bpage->ready_for_flush()) - goto next; - mysql_mutex_unlock(&buf_pool.flush_list_mutex); } else ut_ad(!space); @@ -1442,27 +1464,29 @@ static ulint buf_do_flush_list_batch(ulint max_n, lsn_t lsn) if (!space) buf_flush_discard_page(bpage); - else if (neighbors && space->is_rotational()) - { - mysql_mutex_unlock(&buf_pool.mutex); - count+= buf_flush_try_neighbors(space, page_id, neighbors == 1, - false, count, max_n); - reacquire_mutex: - mysql_mutex_lock(&buf_pool.mutex); - } - else if (bpage->flush(false, space)) + else { - ++count; - goto reacquire_mutex; + mysql_mutex_unlock(&buf_pool.flush_list_mutex); + if (neighbors && space->is_rotational()) + { + mysql_mutex_unlock(&buf_pool.mutex); + count+= buf_flush_try_neighbors(space, page_id, bpage, neighbors == 1, + false, count, max_n); + reacquire_mutex: + mysql_mutex_lock(&buf_pool.mutex); + } + else if (bpage->flush(false, space)) + { + ++count; + goto reacquire_mutex; + } } mysql_mutex_lock(&buf_pool.flush_list_mutex); - next: bpage= buf_pool.flush_hp.get(); } buf_pool.flush_hp.set(nullptr); - mysql_mutex_unlock(&buf_pool.flush_list_mutex); if (space) space->release(); @@ -1472,32 +1496,25 @@ static ulint buf_do_flush_list_batch(ulint max_n, lsn_t lsn) MONITOR_FLUSH_BATCH_SCANNED_NUM_CALL, MONITOR_FLUSH_BATCH_SCANNED_PER_CALL, scanned); - if (count) - MONITOR_INC_VALUE_CUMULATIVE(MONITOR_FLUSH_BATCH_TOTAL_PAGE, - MONITOR_FLUSH_BATCH_COUNT, - MONITOR_FLUSH_BATCH_PAGES, - count); - mysql_mutex_assert_owner(&buf_pool.mutex); return count; } -/** Wait until a flush batch ends. -@param lru true=buf_pool.LRU; false=buf_pool.flush_list */ -void buf_flush_wait_batch_end(bool lru) +/** Wait until a LRU flush batch ends. */ +void buf_flush_wait_LRU_batch_end() { - const auto &n_flush= lru ? buf_pool.n_flush_LRU_ : buf_pool.n_flush_list_; + mysql_mutex_assert_owner(&buf_pool.flush_list_mutex); + mysql_mutex_assert_not_owner(&buf_pool.mutex); - if (n_flush) + if (buf_pool.n_flush()) { - auto cond= lru ? &buf_pool.done_flush_LRU : &buf_pool.done_flush_list; tpool::tpool_wait_begin(); thd_wait_begin(nullptr, THD_WAIT_DISKIO); do - my_cond_wait(cond, &buf_pool.mutex.m_mutex); - while (n_flush); + my_cond_wait(&buf_pool.done_flush_LRU, + &buf_pool.flush_list_mutex.m_mutex); + while (buf_pool.n_flush()); tpool::tpool_wait_end(); thd_wait_end(nullptr); - pthread_cond_broadcast(cond); } } @@ -1514,21 +1531,31 @@ static ulint buf_flush_list_holding_mutex(ulint max_n= ULINT_UNDEFINED, ut_ad(lsn); mysql_mutex_assert_owner(&buf_pool.mutex); - if (buf_pool.n_flush_list_) + mysql_mutex_lock(&buf_pool.flush_list_mutex); + if (buf_pool.flush_list_active()) + { +nothing_to_do: + mysql_mutex_unlock(&buf_pool.flush_list_mutex); return 0; - - /* FIXME: we are performing a dirty read of buf_pool.flush_list.count - while not holding buf_pool.flush_list_mutex */ - if (!UT_LIST_GET_LEN(buf_pool.flush_list)) + } + if (!buf_pool.get_oldest_modification(0)) { pthread_cond_broadcast(&buf_pool.done_flush_list); - return 0; + goto nothing_to_do; } - - buf_pool.n_flush_list_++; + buf_pool.flush_list_set_active(); const ulint n_flushed= buf_do_flush_list_batch(max_n, lsn); - if (!--buf_pool.n_flush_list_) - pthread_cond_broadcast(&buf_pool.done_flush_list); + if (n_flushed) + buf_pool.stat.n_pages_written+= n_flushed; + buf_pool.flush_list_set_inactive(); + pthread_cond_broadcast(&buf_pool.done_flush_list); + mysql_mutex_unlock(&buf_pool.flush_list_mutex); + + if (n_flushed) + MONITOR_INC_VALUE_CUMULATIVE(MONITOR_FLUSH_BATCH_TOTAL_PAGE, + MONITOR_FLUSH_BATCH_COUNT, + MONITOR_FLUSH_BATCH_PAGES, + n_flushed); DBUG_PRINT("ib_buf", ("flush_list completed, " ULINTPF " pages", n_flushed)); return n_flushed; @@ -1560,6 +1587,7 @@ bool buf_flush_list_space(fil_space_t *space, ulint *n_flushed) bool may_have_skipped= false; ulint max_n_flush= srv_io_capacity; + ulint n_flush= 0; bool acquired= space->acquire(); { @@ -1576,11 +1604,17 @@ bool buf_flush_list_space(fil_space_t *space, ulint *n_flushed) ut_ad(bpage->in_file()); buf_page_t *prev= UT_LIST_GET_PREV(list, bpage); - if (bpage->id().space() != space_id); - else if (bpage->oldest_modification() == 1) + if (bpage->oldest_modification() == 1) + clear: buf_pool.delete_from_flush_list(bpage); - else if (!bpage->ready_for_flush()) + else if (bpage->id().space() != space_id); + else if (!bpage->lock.u_lock_try(true)) may_have_skipped= true; + else if (bpage->oldest_modification() == 1) + { + bpage->lock.u_unlock(true); + goto clear; + } else { /* In order not to degenerate this scan to O(n*n) we attempt to @@ -1592,13 +1626,10 @@ bool buf_flush_list_space(fil_space_t *space, ulint *n_flushed) concurrently. This may terminate our iteration prematurely, leading us to return may_have_skipped=true. */ buf_pool.flush_hp.set(prev); - mysql_mutex_unlock(&buf_pool.flush_list_mutex); if (!acquired) - { was_freed: buf_flush_discard_page(bpage); - } else { if (space->is_stopping()) @@ -1607,28 +1638,24 @@ bool buf_flush_list_space(fil_space_t *space, ulint *n_flushed) acquired= false; goto was_freed; } - if (!bpage->flush(false, space)) - { - may_have_skipped= true; - mysql_mutex_lock(&buf_pool.flush_list_mutex); - goto next_after_skip; - } - if (n_flushed) - ++*n_flushed; - if (!--max_n_flush) + mysql_mutex_unlock(&buf_pool.flush_list_mutex); + if (bpage->flush(false, space)) { + ++n_flush; + if (!--max_n_flush) + { + mysql_mutex_lock(&buf_pool.mutex); + mysql_mutex_lock(&buf_pool.flush_list_mutex); + may_have_skipped= true; + goto done; + } mysql_mutex_lock(&buf_pool.mutex); - mysql_mutex_lock(&buf_pool.flush_list_mutex); - may_have_skipped= true; - break; } - mysql_mutex_lock(&buf_pool.mutex); } mysql_mutex_lock(&buf_pool.flush_list_mutex); if (!buf_pool.flush_hp.is_hp(prev)) may_have_skipped= true; - next_after_skip: bpage= buf_pool.flush_hp.get(); continue; } @@ -1641,14 +1668,19 @@ bool buf_flush_list_space(fil_space_t *space, ulint *n_flushed) buf_flush_list_space(). We should always return true from buf_flush_list_space() if that should be the case; in buf_do_flush_list_batch() we will simply perform less work. */ - +done: buf_pool.flush_hp.set(nullptr); mysql_mutex_unlock(&buf_pool.flush_list_mutex); buf_pool.try_LRU_scan= true; pthread_cond_broadcast(&buf_pool.done_free); + + buf_pool.stat.n_pages_written+= n_flush; mysql_mutex_unlock(&buf_pool.mutex); + if (n_flushed) + *n_flushed= n_flush; + if (acquired) space->release(); @@ -1672,29 +1704,20 @@ ulint buf_flush_LRU(ulint max_n, bool evict) { mysql_mutex_assert_owner(&buf_pool.mutex); - if (evict) - { - if (buf_pool.n_flush_LRU_) - return 0; - buf_pool.n_flush_LRU_= 1; - } - flush_counters_t n; buf_do_LRU_batch(max_n, evict, &n); + ulint pages= n.flushed; + if (n.evicted) { + if (evict) + pages+= n.evicted; buf_pool.try_LRU_scan= true; - pthread_cond_signal(&buf_pool.done_free); + pthread_cond_broadcast(&buf_pool.done_free); } - if (!evict) - return n.flushed; - - if (!--buf_pool.n_flush_LRU_) - pthread_cond_broadcast(&buf_pool.done_flush_LRU); - - return n.evicted + n.flushed; + return pages; } /** Initiate a log checkpoint, discarding the start of the log. @@ -1826,9 +1849,14 @@ static void buf_flush_wait(lsn_t lsn) buf_flush_sync_lsn= lsn; buf_pool.page_cleaner_set_idle(false); pthread_cond_signal(&buf_pool.do_flush_list); + my_cond_wait(&buf_pool.done_flush_list, + &buf_pool.flush_list_mutex.m_mutex); + if (buf_pool.get_oldest_modification(lsn) >= lsn) + break; } - my_cond_wait(&buf_pool.done_flush_list, - &buf_pool.flush_list_mutex.m_mutex); + mysql_mutex_unlock(&buf_pool.flush_list_mutex); + buf_dblwr.wait_for_page_writes(); + mysql_mutex_lock(&buf_pool.flush_list_mutex); } } @@ -1849,6 +1877,9 @@ ATTRIBUTE_COLD void buf_flush_wait_flushed(lsn_t sync_lsn) if (buf_pool.get_oldest_modification(sync_lsn) < sync_lsn) { MONITOR_INC(MONITOR_FLUSH_SYNC_WAITS); + thd_wait_begin(nullptr, THD_WAIT_DISKIO); + tpool::tpool_wait_begin(); + #if 1 /* FIXME: remove this, and guarantee that the page cleaner serves us */ if (UNIV_UNLIKELY(!buf_page_cleaner_is_active)) { @@ -1856,28 +1887,23 @@ ATTRIBUTE_COLD void buf_flush_wait_flushed(lsn_t sync_lsn) { mysql_mutex_unlock(&buf_pool.flush_list_mutex); ulint n_pages= buf_flush_list(srv_max_io_capacity, sync_lsn); - mysql_mutex_lock(&buf_pool.mutex); - buf_flush_wait_batch_end(false); - mysql_mutex_unlock(&buf_pool.mutex); if (n_pages) { MONITOR_INC_VALUE_CUMULATIVE(MONITOR_FLUSH_SYNC_TOTAL_PAGE, MONITOR_FLUSH_SYNC_COUNT, MONITOR_FLUSH_SYNC_PAGES, n_pages); } + buf_dblwr.wait_for_page_writes(); mysql_mutex_lock(&buf_pool.flush_list_mutex); } while (buf_pool.get_oldest_modification(sync_lsn) < sync_lsn); } else #endif - { - thd_wait_begin(nullptr, THD_WAIT_DISKIO); - tpool::tpool_wait_begin(); buf_flush_wait(sync_lsn); - tpool::tpool_wait_end(); - thd_wait_end(nullptr); - } + + tpool::tpool_wait_end(); + thd_wait_end(nullptr); } mysql_mutex_unlock(&buf_pool.flush_list_mutex); @@ -1930,11 +1956,10 @@ and try to initiate checkpoints until the target is met. ATTRIBUTE_COLD static void buf_flush_sync_for_checkpoint(lsn_t lsn) { ut_ad(!srv_read_only_mode); + mysql_mutex_assert_not_owner(&buf_pool.flush_list_mutex); for (;;) { - mysql_mutex_unlock(&buf_pool.flush_list_mutex); - if (ulint n_flushed= buf_flush_list(srv_max_io_capacity, lsn)) { MONITOR_INC_VALUE_CUMULATIVE(MONITOR_FLUSH_SYNC_TOTAL_PAGE, @@ -1985,6 +2010,7 @@ ATTRIBUTE_COLD static void buf_flush_sync_for_checkpoint(lsn_t lsn) /* wake up buf_flush_wait() */ pthread_cond_broadcast(&buf_pool.done_flush_list); + mysql_mutex_unlock(&buf_pool.flush_list_mutex); lsn= std::max(lsn, target); @@ -2179,8 +2205,6 @@ static void buf_flush_page_cleaner() timespec abstime; set_timespec(abstime, 1); - mysql_mutex_lock(&buf_pool.flush_list_mutex); - lsn_t lsn_limit; ulint last_activity_count= srv_get_activity_count(); @@ -2188,45 +2212,34 @@ static void buf_flush_page_cleaner() { lsn_limit= buf_flush_sync_lsn; - if (UNIV_UNLIKELY(lsn_limit != 0)) + if (UNIV_UNLIKELY(lsn_limit != 0) && UNIV_LIKELY(srv_flush_sync)) { furious_flush: - if (UNIV_LIKELY(srv_flush_sync)) - { - buf_flush_sync_for_checkpoint(lsn_limit); - last_pages= 0; - set_timespec(abstime, 1); - continue; - } + buf_flush_sync_for_checkpoint(lsn_limit); + last_pages= 0; + set_timespec(abstime, 1); + continue; } + + mysql_mutex_lock(&buf_pool.flush_list_mutex); + if (buf_pool.ran_out()) + goto no_wait; else if (srv_shutdown_state > SRV_SHUTDOWN_INITIATED) break; - /* If buf pager cleaner is idle and there is no work - (either dirty pages are all flushed or adaptive flushing - is not enabled) then opt for non-timed wait */ if (buf_pool.page_cleaner_idle() && (!UT_LIST_GET_LEN(buf_pool.flush_list) || srv_max_dirty_pages_pct_lwm == 0.0)) + /* We are idle; wait for buf_pool.page_cleaner_wakeup() */ my_cond_wait(&buf_pool.do_flush_list, &buf_pool.flush_list_mutex.m_mutex); else my_cond_timedwait(&buf_pool.do_flush_list, &buf_pool.flush_list_mutex.m_mutex, &abstime); - + no_wait: set_timespec(abstime, 1); - lsn_t soft_lsn_limit= buf_flush_async_lsn; lsn_limit= buf_flush_sync_lsn; - - if (UNIV_UNLIKELY(lsn_limit != 0)) - { - if (UNIV_LIKELY(srv_flush_sync)) - goto furious_flush; - } - else if (srv_shutdown_state > SRV_SHUTDOWN_INITIATED) - break; - const lsn_t oldest_lsn= buf_pool.get_oldest_modification(0); if (!oldest_lsn) @@ -2241,6 +2254,8 @@ static void buf_flush_page_cleaner() buf_flush_async_lsn= 0; set_idle: buf_pool.page_cleaner_set_idle(true); + if (UNIV_UNLIKELY(srv_shutdown_state > SRV_SHUTDOWN_INITIATED)) + break; mysql_mutex_unlock(&buf_pool.flush_list_mutex); end_of_batch: buf_dblwr.flush_buffered_writes(); @@ -2257,10 +2272,57 @@ static void buf_flush_page_cleaner() } while (false); + if (!buf_pool.ran_out()) + continue; mysql_mutex_lock(&buf_pool.flush_list_mutex); - continue; } + lsn_t soft_lsn_limit= buf_flush_async_lsn; + + if (UNIV_UNLIKELY(lsn_limit != 0)) + { + if (srv_flush_sync) + goto do_furious_flush; + if (oldest_lsn >= lsn_limit) + { + buf_flush_sync_lsn= 0; + pthread_cond_broadcast(&buf_pool.done_flush_list); + } + else if (lsn_limit > soft_lsn_limit) + soft_lsn_limit= lsn_limit; + } + + bool idle_flush= false; + ulint n_flushed= 0, n; + + if (UNIV_UNLIKELY(soft_lsn_limit != 0)) + { + if (oldest_lsn >= soft_lsn_limit) + buf_flush_async_lsn= soft_lsn_limit= 0; + } + else if (buf_pool.ran_out()) + { + buf_pool.page_cleaner_set_idle(false); + mysql_mutex_unlock(&buf_pool.flush_list_mutex); + n= srv_max_io_capacity; + mysql_mutex_lock(&buf_pool.mutex); + LRU_flush: + n= buf_flush_LRU(n, false); + mysql_mutex_unlock(&buf_pool.mutex); + last_pages+= n; + + if (!idle_flush) + goto end_of_batch; + + /* when idle flushing kicks in page_cleaner is marked active. + reset it back to idle since the it was made active as part of + idle flushing stage. */ + mysql_mutex_lock(&buf_pool.flush_list_mutex); + goto set_idle; + } + else if (UNIV_UNLIKELY(srv_shutdown_state > SRV_SHUTDOWN_INITIATED)) + break; + const ulint dirty_blocks= UT_LIST_GET_LEN(buf_pool.flush_list); ut_ad(dirty_blocks); /* We perform dirty reads of the LRU+free list lengths here. @@ -2268,60 +2330,53 @@ static void buf_flush_page_cleaner() guaranteed to be nonempty, and it is a subset of buf_pool.LRU. */ const double dirty_pct= double(dirty_blocks) * 100.0 / double(UT_LIST_GET_LEN(buf_pool.LRU) + UT_LIST_GET_LEN(buf_pool.free)); - - bool idle_flush= false; - - if (lsn_limit || soft_lsn_limit); - else if (af_needed_for_redo(oldest_lsn)); - else if (srv_max_dirty_pages_pct_lwm != 0.0) + if (srv_max_dirty_pages_pct_lwm != 0.0) { const ulint activity_count= srv_get_activity_count(); if (activity_count != last_activity_count) + { last_activity_count= activity_count; + goto maybe_unemployed; + } else if (buf_pool.page_cleaner_idle() && buf_pool.n_pend_reads == 0) { - /* reaching here means 3 things: - - last_activity_count == activity_count: suggesting server is idle - (no trx_t::commit activity) - - page cleaner is idle (dirty_pct < srv_max_dirty_pages_pct_lwm) - - there are no pending reads but there are dirty pages to flush */ - idle_flush= true; + /* reaching here means 3 things: + - last_activity_count == activity_count: suggesting server is idle + (no trx_t::commit() activity) + - page cleaner is idle (dirty_pct < srv_max_dirty_pages_pct_lwm) + - there are no pending reads but there are dirty pages to flush */ buf_pool.update_last_activity_count(activity_count); + mysql_mutex_unlock(&buf_pool.flush_list_mutex); + idle_flush= true; + goto idle_flush; } - - if (!idle_flush && dirty_pct < srv_max_dirty_pages_pct_lwm) - goto unemployed; + else + maybe_unemployed: + if (dirty_pct < srv_max_dirty_pages_pct_lwm) + goto possibly_unemployed; } else if (dirty_pct < srv_max_buf_pool_modified_pct) - goto unemployed; - - if (UNIV_UNLIKELY(lsn_limit != 0) && oldest_lsn >= lsn_limit) - lsn_limit= buf_flush_sync_lsn= 0; - if (UNIV_UNLIKELY(soft_lsn_limit != 0) && oldest_lsn >= soft_lsn_limit) - soft_lsn_limit= buf_flush_async_lsn= 0; + possibly_unemployed: + if (!soft_lsn_limit && !af_needed_for_redo(oldest_lsn)) + goto unemployed; buf_pool.page_cleaner_set_idle(false); mysql_mutex_unlock(&buf_pool.flush_list_mutex); - if (!lsn_limit) - lsn_limit= soft_lsn_limit; - - ulint n_flushed= 0, n; - - if (UNIV_UNLIKELY(lsn_limit != 0)) + if (UNIV_UNLIKELY(soft_lsn_limit != 0)) { n= srv_max_io_capacity; goto background_flush; } - else if (idle_flush || !srv_adaptive_flushing) + + if (!srv_adaptive_flushing) { + idle_flush: n= srv_io_capacity; - lsn_limit= LSN_MAX; + soft_lsn_limit= LSN_MAX; background_flush: mysql_mutex_lock(&buf_pool.mutex); - n_flushed= buf_flush_list_holding_mutex(n, lsn_limit); - /* wake up buf_flush_wait() */ - pthread_cond_broadcast(&buf_pool.done_flush_list); + n_flushed= buf_flush_list_holding_mutex(n, soft_lsn_limit); MONITOR_INC_VALUE_CUMULATIVE(MONITOR_FLUSH_BACKGROUND_TOTAL_PAGE, MONITOR_FLUSH_BACKGROUND_COUNT, MONITOR_FLUSH_BACKGROUND_PAGES, @@ -2347,18 +2402,8 @@ static void buf_flush_page_cleaner() goto unemployed; } - n= buf_flush_LRU(n >= n_flushed ? n - n_flushed : 0, false); - mysql_mutex_unlock(&buf_pool.mutex); - last_pages+= n; - - if (!idle_flush) - goto end_of_batch; - - /* when idle flushing kicks in page_cleaner is marked active. - reset it back to idle since the it was made active as part of - idle flushing stage. */ - mysql_mutex_lock(&buf_pool.flush_list_mutex); - goto set_idle; + n= n >= n_flushed ? n - n_flushed : 0; + goto LRU_flush; } mysql_mutex_unlock(&buf_pool.flush_list_mutex); @@ -2366,16 +2411,20 @@ static void buf_flush_page_cleaner() if (srv_fast_shutdown != 2) { buf_dblwr.flush_buffered_writes(); - mysql_mutex_lock(&buf_pool.mutex); - buf_flush_wait_batch_end(true); - buf_flush_wait_batch_end(false); - mysql_mutex_unlock(&buf_pool.mutex); + mysql_mutex_lock(&buf_pool.flush_list_mutex); + buf_flush_wait_LRU_batch_end(); + mysql_mutex_unlock(&buf_pool.flush_list_mutex); + buf_dblwr.wait_for_page_writes(); } mysql_mutex_lock(&buf_pool.flush_list_mutex); lsn_limit= buf_flush_sync_lsn; if (UNIV_UNLIKELY(lsn_limit != 0)) + { + do_furious_flush: + mysql_mutex_unlock(&buf_pool.flush_list_mutex); goto furious_flush; + } buf_page_cleaner_is_active= false; pthread_cond_broadcast(&buf_pool.done_flush_list); mysql_mutex_unlock(&buf_pool.flush_list_mutex); @@ -2400,17 +2449,6 @@ ATTRIBUTE_COLD void buf_flush_page_cleaner_init() std::thread(buf_flush_page_cleaner).detach(); } -#if defined(HAVE_SYSTEMD) && !defined(EMBEDDED_LIBRARY) -/** @return the number of dirty pages in the buffer pool */ -static ulint buf_flush_list_length() -{ - mysql_mutex_lock(&buf_pool.flush_list_mutex); - const ulint len= UT_LIST_GET_LEN(buf_pool.flush_list); - mysql_mutex_unlock(&buf_pool.flush_list_mutex); - return len; -} -#endif - /** Flush the buffer pool on shutdown. */ ATTRIBUTE_COLD void buf_flush_buffer_pool() { @@ -2425,24 +2463,20 @@ ATTRIBUTE_COLD void buf_flush_buffer_pool() while (buf_pool.get_oldest_modification(0)) { mysql_mutex_unlock(&buf_pool.flush_list_mutex); - mysql_mutex_lock(&buf_pool.mutex); - buf_flush_list_holding_mutex(srv_max_io_capacity); - if (buf_pool.n_flush_list_) + buf_flush_list(srv_max_io_capacity); + if (const size_t pending= buf_dblwr.pending_writes()) { - mysql_mutex_unlock(&buf_pool.mutex); timespec abstime; service_manager_extend_timeout(INNODB_EXTEND_TIMEOUT_INTERVAL, - "Waiting to flush " ULINTPF " pages", - buf_flush_list_length()); + "Waiting to write %zu pages", pending); set_timespec(abstime, INNODB_EXTEND_TIMEOUT_INTERVAL / 2); - buf_dblwr.flush_buffered_writes(); - mysql_mutex_lock(&buf_pool.mutex); - while (buf_pool.n_flush_list_) - my_cond_timedwait(&buf_pool.done_flush_list, &buf_pool.mutex.m_mutex, - &abstime); + buf_dblwr.wait_for_page_writes(abstime); } - mysql_mutex_unlock(&buf_pool.mutex); + mysql_mutex_lock(&buf_pool.flush_list_mutex); + service_manager_extend_timeout(INNODB_EXTEND_TIMEOUT_INTERVAL, + "Waiting to flush " ULINTPF " pages", + UT_LIST_GET_LEN(buf_pool.flush_list)); } mysql_mutex_unlock(&buf_pool.flush_list_mutex); @@ -2483,6 +2517,7 @@ void buf_flush_sync() if (lsn == log_sys.get_lsn()) break; } + mysql_mutex_unlock(&buf_pool.flush_list_mutex); tpool::tpool_wait_end(); thd_wait_end(nullptr); diff --git a/storage/innobase/buf/buf0lru.cc b/storage/innobase/buf/buf0lru.cc index 9fa6492d525..1947dfaeeb4 100644 --- a/storage/innobase/buf/buf0lru.cc +++ b/storage/innobase/buf/buf0lru.cc @@ -136,7 +136,6 @@ static void buf_LRU_block_free_hashed_page(buf_block_t *block) @param[in] bpage control block */ static inline void incr_LRU_size_in_bytes(const buf_page_t* bpage) { - /* FIXME: use atomics, not mutex */ mysql_mutex_assert_owner(&buf_pool.mutex); buf_pool.stat.LRU_bytes += bpage->physical_size(); @@ -400,6 +399,7 @@ buf_block_t *buf_LRU_get_free_block(bool have_mutex) DBUG_EXECUTE_IF("recv_ran_out_of_buffer", if (recv_recovery_is_on() && recv_sys.apply_log_recs) { + mysql_mutex_lock(&buf_pool.mutex); goto flush_lru; }); get_mutex: @@ -445,20 +445,32 @@ got_block: if ((block = buf_LRU_get_free_only()) != nullptr) { goto got_block; } - if (!buf_pool.n_flush_LRU_) { - break; + mysql_mutex_unlock(&buf_pool.mutex); + mysql_mutex_lock(&buf_pool.flush_list_mutex); + const auto n_flush = buf_pool.n_flush(); + mysql_mutex_unlock(&buf_pool.flush_list_mutex); + mysql_mutex_lock(&buf_pool.mutex); + if (!n_flush) { + goto not_found; + } + if (!buf_pool.try_LRU_scan) { + mysql_mutex_lock(&buf_pool.flush_list_mutex); + buf_pool.page_cleaner_wakeup(true); + mysql_mutex_unlock(&buf_pool.flush_list_mutex); + my_cond_wait(&buf_pool.done_free, + &buf_pool.mutex.m_mutex); } - my_cond_wait(&buf_pool.done_free, &buf_pool.mutex.m_mutex); } -#ifndef DBUG_OFF not_found: -#endif - mysql_mutex_unlock(&buf_pool.mutex); + if (n_iterations > 1) { + MONITOR_INC( MONITOR_LRU_GET_FREE_WAITS ); + } - if (n_iterations > 20 && !buf_lru_free_blocks_error_printed + if (n_iterations == 21 && !buf_lru_free_blocks_error_printed && srv_buf_pool_old_size == srv_buf_pool_size) { - + buf_lru_free_blocks_error_printed = true; + mysql_mutex_unlock(&buf_pool.mutex); ib::warn() << "Difficult to find free blocks in the buffer pool" " (" << n_iterations << " search iterations)! " << flush_failures << " failed attempts to" @@ -472,12 +484,7 @@ not_found: << os_n_file_writes << " OS file writes, " << os_n_fsyncs << " OS fsyncs."; - - buf_lru_free_blocks_error_printed = true; - } - - if (n_iterations > 1) { - MONITOR_INC( MONITOR_LRU_GET_FREE_WAITS ); + mysql_mutex_lock(&buf_pool.mutex); } /* No free block was found: try to flush the LRU list. @@ -491,8 +498,6 @@ not_found: #ifndef DBUG_OFF flush_lru: #endif - mysql_mutex_lock(&buf_pool.mutex); - if (!buf_flush_LRU(innodb_lru_flush_size, true)) { MONITOR_INC(MONITOR_LRU_SINGLE_FLUSH_FAILURE_COUNT); ++flush_failures; @@ -1039,7 +1044,8 @@ buf_LRU_block_free_non_file_page( } else { UT_LIST_ADD_FIRST(buf_pool.free, &block->page); ut_d(block->page.in_free_list = true); - pthread_cond_signal(&buf_pool.done_free); + buf_pool.try_LRU_scan= true; + pthread_cond_broadcast(&buf_pool.done_free); } MEM_NOACCESS(block->page.frame, srv_page_size); diff --git a/storage/innobase/buf/buf0rea.cc b/storage/innobase/buf/buf0rea.cc index b20b105a4c4..b39a8f49133 100644 --- a/storage/innobase/buf/buf0rea.cc +++ b/storage/innobase/buf/buf0rea.cc @@ -226,6 +226,7 @@ static buf_page_t* buf_page_init_for_read(ulint mode, const page_id_t page_id, buf_LRU_add_block(bpage, true/* to old blocks */); } + buf_pool.stat.n_pages_read++; mysql_mutex_unlock(&buf_pool.mutex); buf_pool.n_pend_reads++; goto func_exit_no_mutex; @@ -245,20 +246,18 @@ buffer buf_pool if it is not already there, in which case does nothing. Sets the io_fix flag and sets an exclusive lock on the buffer frame. The flag is cleared and the x-lock released by an i/o-handler thread. -@param[out] err DB_SUCCESS or DB_TABLESPACE_DELETED - if we are trying - to read from a non-existent tablespace @param[in,out] space tablespace @param[in] sync true if synchronous aio is desired @param[in] mode BUF_READ_IBUF_PAGES_ONLY, ..., @param[in] page_id page id @param[in] zip_size ROW_FORMAT=COMPRESSED page size, or 0 @param[in] unzip true=request uncompressed page -@return whether a read request was queued */ +@return error code +@retval DB_SUCCESS if the page was read +@retval DB_SUCCESS_LOCKED_REC if the page exists in the buffer pool already */ static -bool +dberr_t buf_read_page_low( - dberr_t* err, fil_space_t* space, bool sync, ulint mode, @@ -268,15 +267,12 @@ buf_read_page_low( { buf_page_t* bpage; - *err = DB_SUCCESS; - if (buf_dblwr.is_inside(page_id)) { ib::error() << "Trying to read doublewrite buffer page " << page_id; ut_ad(0); -nothing_read: space->release(); - return false; + return DB_PAGE_CORRUPTED; } if (sync) { @@ -299,8 +295,9 @@ nothing_read: completed */ bpage = buf_page_init_for_read(mode, page_id, zip_size, unzip); - if (bpage == NULL) { - goto nothing_read; + if (!bpage) { + space->release(); + return DB_SUCCESS_LOCKED_REC; } ut_ad(bpage->in_file()); @@ -320,7 +317,6 @@ nothing_read: ? IORequest::READ_SYNC : IORequest::READ_ASYNC), page_id.page_no() * len, len, dst, bpage); - *err = fio.err; if (UNIV_UNLIKELY(fio.err != DB_SUCCESS)) { ut_d(auto n=) buf_pool.n_pend_reads--; @@ -329,14 +325,14 @@ nothing_read: } else if (sync) { thd_wait_end(NULL); /* The i/o was already completed in space->io() */ - *err = bpage->read_complete(*fio.node); + fio.err = bpage->read_complete(*fio.node); space->release(); - if (*err == DB_FAIL) { - *err = DB_PAGE_CORRUPTED; + if (fio.err == DB_FAIL) { + fio.err = DB_PAGE_CORRUPTED; } } - return true; + return fio.err; } /** Applies a random read-ahead in buf_pool if there are at least a threshold @@ -414,24 +410,26 @@ read_ahead: continue; if (space->is_stopping()) break; - dberr_t err; space->reacquire(); - if (buf_read_page_low(&err, space, false, ibuf_mode, i, zip_size, false)) + if (buf_read_page_low(space, false, ibuf_mode, i, zip_size, false) == + DB_SUCCESS) count++; } if (count) + { DBUG_PRINT("ib_buf", ("random read-ahead %zu pages from %s: %u", count, space->chain.start->name, low.page_no())); - space->release(); - - /* Read ahead is considered one I/O operation for the purpose of - LRU policy decision. */ - buf_LRU_stat_inc_io(); + mysql_mutex_lock(&buf_pool.mutex); + /* Read ahead is considered one I/O operation for the purpose of + LRU policy decision. */ + buf_LRU_stat_inc_io(); + buf_pool.stat.n_ra_pages_read_rnd+= count; + mysql_mutex_unlock(&buf_pool.mutex); + } - buf_pool.stat.n_ra_pages_read_rnd+= count; - srv_stats.buf_pool_reads.add(count); + space->release(); return count; } @@ -441,8 +439,9 @@ on the buffer frame. The flag is cleared and the x-lock released by the i/o-handler thread. @param[in] page_id page id @param[in] zip_size ROW_FORMAT=COMPRESSED page size, or 0 -@retval DB_SUCCESS if the page was read and is not corrupted, -@retval DB_PAGE_CORRUPTED if page based on checksum check is corrupted, +@retval DB_SUCCESS if the page was read and is not corrupted +@retval DB_SUCCESS_LOCKED_REC if the page was not read +@retval DB_PAGE_CORRUPTED if page based on checksum check is corrupted @retval DB_DECRYPTION_FAILED if page post encryption checksum matches but after decryption normal page checksum does not match. @retval DB_TABLESPACE_DELETED if tablespace .ibd file is missing */ @@ -456,13 +455,9 @@ dberr_t buf_read_page(const page_id_t page_id, ulint zip_size) return DB_TABLESPACE_DELETED; } - dberr_t err; - if (buf_read_page_low(&err, space, true, BUF_READ_ANY_PAGE, - page_id, zip_size, false)) - srv_stats.buf_pool_reads.add(1); - - buf_LRU_stat_inc_io(); - return err; + buf_LRU_stat_inc_io(); /* NOT protected by buf_pool.mutex */ + return buf_read_page_low(space, true, BUF_READ_ANY_PAGE, + page_id, zip_size, false); } /** High-level function which reads a page asynchronously from a file to the @@ -475,12 +470,8 @@ released by the i/o-handler thread. void buf_read_page_background(fil_space_t *space, const page_id_t page_id, ulint zip_size) { - dberr_t err; - - if (buf_read_page_low(&err, space, false, BUF_READ_ANY_PAGE, - page_id, zip_size, false)) { - srv_stats.buf_pool_reads.add(1); - } + buf_read_page_low(space, false, BUF_READ_ANY_PAGE, + page_id, zip_size, false); /* We do not increment number of I/O operations used for LRU policy here (buf_LRU_stat_inc_io()). We use this in heuristics to decide @@ -638,23 +629,26 @@ failed: continue; if (space->is_stopping()) break; - dberr_t err; space->reacquire(); - count+= buf_read_page_low(&err, space, false, ibuf_mode, new_low, zip_size, - false); + if (buf_read_page_low(space, false, ibuf_mode, new_low, zip_size, false) == + DB_SUCCESS) + count++; } if (count) + { DBUG_PRINT("ib_buf", ("random read-ahead %zu pages from %s: %u", count, space->chain.start->name, new_low.page_no())); - space->release(); - - /* Read ahead is considered one I/O operation for the purpose of - LRU policy decision. */ - buf_LRU_stat_inc_io(); + mysql_mutex_lock(&buf_pool.mutex); + /* Read ahead is considered one I/O operation for the purpose of + LRU policy decision. */ + buf_LRU_stat_inc_io(); + buf_pool.stat.n_ra_pages_read+= count; + mysql_mutex_unlock(&buf_pool.mutex); + } - buf_pool.stat.n_ra_pages_read+= count; + space->release(); return count; } @@ -709,13 +703,12 @@ void buf_read_recv_pages(ulint space_id, const uint32_t* page_nos, ulint n) } } - dberr_t err; space->reacquire(); - buf_read_page_low(&err, space, false, - BUF_READ_ANY_PAGE, cur_page_id, zip_size, - true); - - if (err != DB_SUCCESS) { + switch (buf_read_page_low(space, false, BUF_READ_ANY_PAGE, + cur_page_id, zip_size, true)) { + case DB_SUCCESS: case DB_SUCCESS_LOCKED_REC: + break; + default: sql_print_error("InnoDB: Recovery failed to read page " UINT32PF " from %s", cur_page_id.page_no(), diff --git a/storage/innobase/gis/gis0rtree.cc b/storage/innobase/gis/gis0rtree.cc index 59d77c9c5fc..83afd732b21 100644 --- a/storage/innobase/gis/gis0rtree.cc +++ b/storage/innobase/gis/gis0rtree.cc @@ -1209,8 +1209,6 @@ after_insert: ut_ad(!rec || rec_offs_validate(rec, cursor->index(), *offsets)); #endif - MONITOR_INC(MONITOR_INDEX_SPLIT); - return(rec); } diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc index aa2fb7c38eb..cac20c70e02 100644 --- a/storage/innobase/handler/ha_innodb.cc +++ b/storage/innobase/handler/ha_innodb.cc @@ -915,43 +915,37 @@ static SHOW_VAR innodb_status_variables[]= { (char*) &export_vars.innodb_buffer_pool_resize_status, SHOW_CHAR}, {"buffer_pool_load_incomplete", &export_vars.innodb_buffer_pool_load_incomplete, SHOW_BOOL}, - {"buffer_pool_pages_data", - &export_vars.innodb_buffer_pool_pages_data, SHOW_SIZE_T}, + {"buffer_pool_pages_data", &UT_LIST_GET_LEN(buf_pool.LRU), SHOW_SIZE_T}, {"buffer_pool_bytes_data", &export_vars.innodb_buffer_pool_bytes_data, SHOW_SIZE_T}, {"buffer_pool_pages_dirty", - &export_vars.innodb_buffer_pool_pages_dirty, SHOW_SIZE_T}, - {"buffer_pool_bytes_dirty", - &export_vars.innodb_buffer_pool_bytes_dirty, SHOW_SIZE_T}, - {"buffer_pool_pages_flushed", &buf_flush_page_count, SHOW_SIZE_T}, - {"buffer_pool_pages_free", - &export_vars.innodb_buffer_pool_pages_free, SHOW_SIZE_T}, + &UT_LIST_GET_LEN(buf_pool.flush_list), SHOW_SIZE_T}, + {"buffer_pool_bytes_dirty", &buf_pool.flush_list_bytes, SHOW_SIZE_T}, + {"buffer_pool_pages_flushed", &buf_pool.stat.n_pages_written, SHOW_SIZE_T}, + {"buffer_pool_pages_free", &UT_LIST_GET_LEN(buf_pool.free), SHOW_SIZE_T}, #ifdef UNIV_DEBUG {"buffer_pool_pages_latched", &export_vars.innodb_buffer_pool_pages_latched, SHOW_SIZE_T}, #endif /* UNIV_DEBUG */ {"buffer_pool_pages_made_not_young", - &export_vars.innodb_buffer_pool_pages_made_not_young, SHOW_SIZE_T}, + &buf_pool.stat.n_pages_not_made_young, SHOW_SIZE_T}, {"buffer_pool_pages_made_young", - &export_vars.innodb_buffer_pool_pages_made_young, SHOW_SIZE_T}, + &buf_pool.stat.n_pages_made_young, SHOW_SIZE_T}, {"buffer_pool_pages_misc", &export_vars.innodb_buffer_pool_pages_misc, SHOW_SIZE_T}, - {"buffer_pool_pages_old", - &export_vars.innodb_buffer_pool_pages_old, SHOW_SIZE_T}, + {"buffer_pool_pages_old", &buf_pool.LRU_old_len, SHOW_SIZE_T}, {"buffer_pool_pages_total", &export_vars.innodb_buffer_pool_pages_total, SHOW_SIZE_T}, {"buffer_pool_pages_LRU_flushed", &buf_lru_flush_page_count, SHOW_SIZE_T}, {"buffer_pool_pages_LRU_freed", &buf_lru_freed_page_count, SHOW_SIZE_T}, + {"buffer_pool_pages_split", &buf_pool.pages_split, SHOW_SIZE_T}, {"buffer_pool_read_ahead_rnd", - &export_vars.innodb_buffer_pool_read_ahead_rnd, SHOW_SIZE_T}, - {"buffer_pool_read_ahead", - &export_vars.innodb_buffer_pool_read_ahead, SHOW_SIZE_T}, + &buf_pool.stat.n_ra_pages_read_rnd, SHOW_SIZE_T}, + {"buffer_pool_read_ahead", &buf_pool.stat.n_ra_pages_read, SHOW_SIZE_T}, {"buffer_pool_read_ahead_evicted", - &export_vars.innodb_buffer_pool_read_ahead_evicted, SHOW_SIZE_T}, - {"buffer_pool_read_requests", - &export_vars.innodb_buffer_pool_read_requests, SHOW_SIZE_T}, - {"buffer_pool_reads", - &export_vars.innodb_buffer_pool_reads, SHOW_SIZE_T}, + &buf_pool.stat.n_ra_pages_evicted, SHOW_SIZE_T}, + {"buffer_pool_read_requests", &buf_pool.stat.n_page_gets, SHOW_SIZE_T}, + {"buffer_pool_reads", &buf_pool.stat.n_pages_read, SHOW_SIZE_T}, {"buffer_pool_wait_free", &buf_pool.stat.LRU_waits, SHOW_SIZE_T}, {"buffer_pool_write_requests", &export_vars.innodb_buffer_pool_write_requests, SHOW_SIZE_T}, diff --git a/storage/innobase/include/buf0buf.h b/storage/innobase/include/buf0buf.h index e79cbdadcd6..94f8dc2badb 100644 --- a/storage/innobase/include/buf0buf.h +++ b/storage/innobase/include/buf0buf.h @@ -782,11 +782,11 @@ public: it from buf_pool.flush_list */ inline void write_complete(bool temporary); - /** Write a flushable page to a file. buf_pool.mutex must be held. + /** Write a flushable page to a file or free a freeable block. @param evict whether to evict the page on write completion @param space tablespace - @return whether the page was flushed and buf_pool.mutex was released */ - inline bool flush(bool evict, fil_space_t *space); + @return whether a page write was initiated and buf_pool.mutex released */ + bool flush(bool evict, fil_space_t *space); /** Notify that a page in a temporary tablespace has been modified. */ void set_temp_modified() @@ -856,8 +856,6 @@ public: /** @return whether the block is mapped to a data file */ bool in_file() const { return state() >= FREED; } - /** @return whether the block is modified and ready for flushing */ - inline bool ready_for_flush() const; /** @return whether the block can be relocated in memory. The block can be dirty, but it must not be I/O-fixed or bufferfixed. */ inline bool can_relocate() const; @@ -1030,10 +1028,10 @@ Compute the hash fold value for blocks in buf_pool.zip_hash. */ #define BUF_POOL_ZIP_FOLD_BPAGE(b) BUF_POOL_ZIP_FOLD((buf_block_t*) (b)) /* @} */ -/** A "Hazard Pointer" class used to iterate over page lists -inside the buffer pool. A hazard pointer is a buf_page_t pointer +/** A "Hazard Pointer" class used to iterate over buf_pool.LRU or +buf_pool.flush_list. A hazard pointer is a buf_page_t pointer which we intend to iterate over next and we want it remain valid -even after we release the buffer pool mutex. */ +even after we release the mutex that protects the list. */ class HazardPointer { public: @@ -1148,7 +1146,8 @@ struct buf_buddy_free_t { /*!< Node of zip_free list */ }; -/** @brief The buffer pool statistics structure. */ +/** @brief The buffer pool statistics structure; +protected by buf_pool.mutex unless otherwise noted. */ struct buf_pool_stat_t{ /** Initialize the counters */ void init() { memset((void*) this, 0, sizeof *this); } @@ -1157,9 +1156,8 @@ struct buf_pool_stat_t{ /*!< number of page gets performed; also successful searches through the adaptive hash index are - counted as page gets; this field - is NOT protected by the buffer - pool mutex */ + counted as page gets; + NOT protected by buf_pool.mutex */ ulint n_pages_read; /*!< number read operations */ ulint n_pages_written;/*!< number write operations */ ulint n_pages_created;/*!< number of pages created @@ -1177,10 +1175,9 @@ struct buf_pool_stat_t{ young because the first access was not long enough ago, in buf_page_peek_if_too_old() */ - /** number of waits for eviction; writes protected by buf_pool.mutex */ + /** number of waits for eviction */ ulint LRU_waits; ulint LRU_bytes; /*!< LRU size in bytes */ - ulint flush_list_bytes;/*!< flush_list size in bytes */ }; /** Statistics of buddy blocks of a given size. */ @@ -1501,6 +1498,11 @@ public: n_chunks_new / 4 * chunks->size; } + /** @return whether the buffer pool has run out */ + TPOOL_SUPPRESS_TSAN + bool ran_out() const + { return UNIV_UNLIKELY(!try_LRU_scan || !UT_LIST_GET_LEN(free)); } + /** @return whether the buffer pool is shrinking */ inline bool is_shrinking() const { @@ -1538,14 +1540,10 @@ public: /** Buffer pool mutex */ alignas(CPU_LEVEL1_DCACHE_LINESIZE) mysql_mutex_t mutex; - /** Number of pending LRU flush; protected by mutex. */ - ulint n_flush_LRU_; - /** broadcast when n_flush_LRU reaches 0; protected by mutex */ - pthread_cond_t done_flush_LRU; - /** Number of pending flush_list flush; protected by mutex */ - ulint n_flush_list_; - /** broadcast when n_flush_list reaches 0; protected by mutex */ - pthread_cond_t done_flush_list; + /** current statistics; protected by mutex */ + buf_pool_stat_t stat; + /** old statistics; protected by mutex */ + buf_pool_stat_t old_stat; /** @name General fields */ /* @{ */ @@ -1706,11 +1704,12 @@ public: buf_buddy_stat_t buddy_stat[BUF_BUDDY_SIZES_MAX + 1]; /*!< Statistics of buddy system, indexed by block size */ - buf_pool_stat_t stat; /*!< current statistics */ - buf_pool_stat_t old_stat; /*!< old statistics */ /* @} */ + /** number of index page splits */ + Atomic_counter<ulint> pages_split; + /** @name Page flushing algorithm fields */ /* @{ */ @@ -1719,31 +1718,76 @@ public: alignas(CPU_LEVEL1_DCACHE_LINESIZE) mysql_mutex_t flush_list_mutex; /** "hazard pointer" for flush_list scans; protected by flush_list_mutex */ FlushHp flush_hp; - /** modified blocks (a subset of LRU) */ + /** flush_list size in bytes; protected by flush_list_mutex */ + ulint flush_list_bytes; + /** possibly modified persistent pages (a subset of LRU); + buf_dblwr.pending_writes() is approximately COUNT(is_write_fixed()) */ UT_LIST_BASE_NODE_T(buf_page_t) flush_list; private: - /** whether the page cleaner needs wakeup from indefinite sleep */ - bool page_cleaner_is_idle; + static constexpr unsigned PAGE_CLEANER_IDLE= 1; + static constexpr unsigned FLUSH_LIST_ACTIVE= 2; + static constexpr unsigned LRU_FLUSH= 4; + + /** Number of pending LRU flush * LRU_FLUSH + + PAGE_CLEANER_IDLE + FLUSH_LIST_ACTIVE flags */ + unsigned page_cleaner_status; /** track server activity count for signaling idle flushing */ ulint last_activity_count; public: /** signalled to wake up the page_cleaner; protected by flush_list_mutex */ pthread_cond_t do_flush_list; + /** broadcast when !n_flush(); protected by flush_list_mutex */ + pthread_cond_t done_flush_LRU; + /** broadcast when a batch completes; protected by flush_list_mutex */ + pthread_cond_t done_flush_list; + + /** @return number of pending LRU flush */ + unsigned n_flush() const + { + mysql_mutex_assert_owner(&flush_list_mutex); + return page_cleaner_status / LRU_FLUSH; + } + + /** Increment the number of pending LRU flush */ + inline void n_flush_inc(); + + /** Decrement the number of pending LRU flush */ + inline void n_flush_dec(); + + /** @return whether flush_list flushing is active */ + bool flush_list_active() const + { + mysql_mutex_assert_owner(&flush_list_mutex); + return page_cleaner_status & FLUSH_LIST_ACTIVE; + } + + void flush_list_set_active() + { + ut_ad(!flush_list_active()); + page_cleaner_status+= FLUSH_LIST_ACTIVE; + } + void flush_list_set_inactive() + { + ut_ad(flush_list_active()); + page_cleaner_status-= FLUSH_LIST_ACTIVE; + } /** @return whether the page cleaner must sleep due to being idle */ bool page_cleaner_idle() const { mysql_mutex_assert_owner(&flush_list_mutex); - return page_cleaner_is_idle; + return page_cleaner_status & PAGE_CLEANER_IDLE; } - /** Wake up the page cleaner if needed */ - void page_cleaner_wakeup(); + /** Wake up the page cleaner if needed. + @param for_LRU whether to wake up for LRU eviction */ + void page_cleaner_wakeup(bool for_LRU= false); /** Register whether an explicit wakeup of the page cleaner is needed */ void page_cleaner_set_idle(bool deep_sleep) { mysql_mutex_assert_owner(&flush_list_mutex); - page_cleaner_is_idle= deep_sleep; + page_cleaner_status= (page_cleaner_status & ~PAGE_CLEANER_IDLE) | + (PAGE_CLEANER_IDLE * deep_sleep); } /** Update server last activity count */ @@ -1753,9 +1797,6 @@ public: last_activity_count= activity_count; } - // n_flush_LRU_ + n_flush_list_ - // is approximately COUNT(is_write_fixed()) in flush_list - unsigned freed_page_clock;/*!< a sequence number used to count the number of buffer blocks removed from the end of @@ -1765,16 +1806,10 @@ public: to read this for heuristic purposes without holding any mutex or latch */ - bool try_LRU_scan; /*!< Cleared when an LRU - scan for free block fails. This - flag is used to avoid repeated - scans of LRU list when we know - that there is no free block - available in the scan depth for - eviction. Set whenever - we flush a batch from the - buffer pool. Protected by the - buf_pool.mutex */ + /** Cleared when buf_LRU_get_free_block() fails. + Set whenever the free list grows, along with a broadcast of done_free. + Protected by buf_pool.mutex. */ + Atomic_relaxed<bool> try_LRU_scan; /* @} */ /** @name LRU replacement algorithm fields */ @@ -1783,8 +1818,8 @@ public: UT_LIST_BASE_NODE_T(buf_page_t) free; /*!< base node of the free block list */ - /** signaled each time when the free list grows and - broadcast each time try_LRU_scan is set; protected by mutex */ + /** broadcast each time when the free list grows or try_LRU_scan is set; + protected by mutex */ pthread_cond_t done_free; UT_LIST_BASE_NODE_T(buf_page_t) withdraw; @@ -1844,29 +1879,20 @@ public: { if (n_pend_reads) return true; - mysql_mutex_lock(&mutex); - const bool any_pending{n_flush_LRU_ || n_flush_list_}; - mysql_mutex_unlock(&mutex); + mysql_mutex_lock(&flush_list_mutex); + const bool any_pending= page_cleaner_status > PAGE_CLEANER_IDLE || + buf_dblwr.pending_writes(); + mysql_mutex_unlock(&flush_list_mutex); return any_pending; } - /** @return total amount of pending I/O */ - TPOOL_SUPPRESS_TSAN ulint io_pending() const - { - return n_pend_reads + n_flush_LRU_ + n_flush_list_; - } private: /** Remove a block from the flush list. */ inline void delete_from_flush_list_low(buf_page_t *bpage); - /** Remove a block from flush_list. - @param bpage buffer pool page - @param clear whether to invoke buf_page_t::clear_oldest_modification() */ - void delete_from_flush_list(buf_page_t *bpage, bool clear); public: /** Remove a block from flush_list. @param bpage buffer pool page */ - void delete_from_flush_list(buf_page_t *bpage) - { delete_from_flush_list(bpage, true); } + void delete_from_flush_list(buf_page_t *bpage); /** Insert a modified block into the flush list. @param block modified block @@ -1874,7 +1900,7 @@ public: void insert_into_flush_list(buf_block_t *block, lsn_t lsn); /** Free a page whose underlying file page has been freed. */ - inline void release_freed_page(buf_page_t *bpage); + ATTRIBUTE_COLD void release_freed_page(buf_page_t *bpage); private: /** Temporary memory for page_compressed and encrypted I/O */ @@ -1994,17 +2020,6 @@ inline void buf_page_t::clear_oldest_modification() oldest_modification_.store(0, std::memory_order_release); } -/** @return whether the block is modified and ready for flushing */ -inline bool buf_page_t::ready_for_flush() const -{ - mysql_mutex_assert_owner(&buf_pool.mutex); - ut_ad(in_LRU_list); - const auto s= state(); - ut_a(s >= FREED); - ut_ad(!fsp_is_system_temporary(id().space()) || oldest_modification() == 2); - return s < READ_FIX; -} - /** @return whether the block can be relocated in memory. The block can be dirty, but it must not be I/O-fixed or bufferfixed. */ inline bool buf_page_t::can_relocate() const diff --git a/storage/innobase/include/buf0dblwr.h b/storage/innobase/include/buf0dblwr.h index fb9df55504c..d9c9239c0b4 100644 --- a/storage/innobase/include/buf0dblwr.h +++ b/storage/innobase/include/buf0dblwr.h @@ -1,7 +1,7 @@ /***************************************************************************** Copyright (c) 1995, 2017, Oracle and/or its affiliates. All Rights Reserved. -Copyright (c) 2017, 2020, MariaDB Corporation. +Copyright (c) 2017, 2022, MariaDB Corporation. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software @@ -54,9 +54,9 @@ class buf_dblwr_t }; /** the page number of the first doublewrite block (block_size() pages) */ - page_id_t block1= page_id_t(0, 0); + page_id_t block1{0, 0}; /** the page number of the second doublewrite block (block_size() pages) */ - page_id_t block2= page_id_t(0, 0); + page_id_t block2{0, 0}; /** mutex protecting the data members below */ mysql_mutex_t mutex; @@ -72,11 +72,15 @@ class buf_dblwr_t ulint writes_completed; /** number of pages written by flush_buffered_writes_completed() */ ulint pages_written; + /** condition variable for !writes_pending */ + pthread_cond_t write_cond; + /** number of pending page writes */ + size_t writes_pending; slot slots[2]; - slot *active_slot= &slots[0]; + slot *active_slot; - /** Initialize the doublewrite buffer data structure. + /** Initialise the persistent storage of the doublewrite buffer. @param header doublewrite page header in the TRX_SYS page */ inline void init(const byte *header); @@ -84,6 +88,8 @@ class buf_dblwr_t bool flush_buffered_writes(const ulint size); public: + /** Initialise the doublewrite buffer data structures. */ + void init(); /** Create or restore the doublewrite buffer in the TRX_SYS page. @return whether the operation succeeded */ bool create(); @@ -118,7 +124,7 @@ public: void recover(); /** Update the doublewrite buffer on data page write completion. */ - void write_completed(); + void write_completed(bool with_doublewrite); /** Flush possible buffered writes to persistent storage. It is very important to call this function after a batch of writes has been posted, and also when we may have to wait for a page latch! @@ -137,14 +143,14 @@ public: @param size payload size in bytes */ void add_to_batch(const IORequest &request, size_t size); - /** Determine whether the doublewrite buffer is initialized */ - bool is_initialised() const + /** Determine whether the doublewrite buffer has been created */ + bool is_created() const { return UNIV_LIKELY(block1 != page_id_t(0, 0)); } /** @return whether a page identifier is part of the doublewrite buffer */ bool is_inside(const page_id_t id) const { - if (!is_initialised()) + if (!is_created()) return false; ut_ad(block1 < block2); if (id < block1) @@ -156,13 +162,44 @@ public: /** Wait for flush_buffered_writes() to be fully completed */ void wait_flush_buffered_writes() { - if (is_initialised()) - { - mysql_mutex_lock(&mutex); - while (batch_running) - my_cond_wait(&cond, &mutex.m_mutex); - mysql_mutex_unlock(&mutex); - } + mysql_mutex_lock(&mutex); + while (batch_running) + my_cond_wait(&cond, &mutex.m_mutex); + mysql_mutex_unlock(&mutex); + } + + /** Register an unbuffered page write */ + void add_unbuffered() + { + mysql_mutex_lock(&mutex); + writes_pending++; + mysql_mutex_unlock(&mutex); + } + + size_t pending_writes() + { + mysql_mutex_lock(&mutex); + const size_t pending{writes_pending}; + mysql_mutex_unlock(&mutex); + return pending; + } + + /** Wait for writes_pending to reach 0 */ + void wait_for_page_writes() + { + mysql_mutex_lock(&mutex); + while (writes_pending) + my_cond_wait(&write_cond, &mutex.m_mutex); + mysql_mutex_unlock(&mutex); + } + + /** Wait for writes_pending to reach 0 */ + void wait_for_page_writes(const timespec &abstime) + { + mysql_mutex_lock(&mutex); + while (writes_pending) + my_cond_timedwait(&write_cond, &mutex.m_mutex, &abstime); + mysql_mutex_unlock(&mutex); } }; diff --git a/storage/innobase/include/buf0flu.h b/storage/innobase/include/buf0flu.h index d71a05c0ec9..13a9363922b 100644 --- a/storage/innobase/include/buf0flu.h +++ b/storage/innobase/include/buf0flu.h @@ -30,10 +30,8 @@ Created 11/5/1995 Heikki Tuuri #include "log0log.h" #include "buf0buf.h" -/** Number of pages flushed. Protected by buf_pool.mutex. */ -extern ulint buf_flush_page_count; /** Number of pages flushed via LRU. Protected by buf_pool.mutex. -Also included in buf_flush_page_count. */ +Also included in buf_pool.stat.n_pages_written. */ extern ulint buf_lru_flush_page_count; /** Number of pages freed without flushing. Protected by buf_pool.mutex. */ extern ulint buf_lru_freed_page_count; @@ -96,9 +94,8 @@ after releasing buf_pool.mutex. @retval 0 if a buf_pool.LRU batch is already running */ ulint buf_flush_LRU(ulint max_n, bool evict); -/** Wait until a flush batch ends. -@param lru true=buf_pool.LRU; false=buf_pool.flush_list */ -void buf_flush_wait_batch_end(bool lru); +/** Wait until a LRU flush batch ends. */ +void buf_flush_wait_LRU_batch_end(); /** Wait until all persistent pages are flushed up to a limit. @param sync_lsn buf_pool.get_oldest_modification(LSN_MAX) to wait for */ ATTRIBUTE_COLD void buf_flush_wait_flushed(lsn_t sync_lsn); diff --git a/storage/innobase/include/buf0rea.h b/storage/innobase/include/buf0rea.h index 8d6b28194dc..d898c5efc63 100644 --- a/storage/innobase/include/buf0rea.h +++ b/storage/innobase/include/buf0rea.h @@ -33,10 +33,11 @@ Created 11/5/1995 Heikki Tuuri buffer buf_pool if it is not already there. Sets the io_fix flag and sets an exclusive lock on the buffer frame. The flag is cleared and the x-lock released by the i/o-handler thread. -@param[in] page_id page id -@param[in] zip_size ROW_FORMAT=COMPRESSED page size, or 0 -@retval DB_SUCCESS if the page was read and is not corrupted, -@retval DB_PAGE_CORRUPTED if page based on checksum check is corrupted, +@param page_id page id +@param zip_size ROW_FORMAT=COMPRESSED page size, or 0 +@retval DB_SUCCESS if the page was read and is not corrupted +@retval DB_SUCCESS_LOCKED_REC if the page was not read +@retval DB_PAGE_CORRUPTED if page based on checksum check is corrupted @retval DB_DECRYPTION_FAILED if page post encryption checksum matches but after decryption normal page checksum does not match. @retval DB_TABLESPACE_DELETED if tablespace .ibd file is missing */ diff --git a/storage/innobase/include/fil0fil.h b/storage/innobase/include/fil0fil.h index 533f595c852..ff6ece8a360 100644 --- a/storage/innobase/include/fil0fil.h +++ b/storage/innobase/include/fil0fil.h @@ -1170,7 +1170,7 @@ private: inline bool fil_space_t::use_doublewrite() const { return !UT_LIST_GET_FIRST(chain)->atomic_write && srv_use_doublewrite_buf && - buf_dblwr.is_initialised(); + buf_dblwr.is_created(); } inline void fil_space_t::set_imported() diff --git a/storage/innobase/include/srv0srv.h b/storage/innobase/include/srv0srv.h index 9807d9cd9a4..90d3a21f761 100644 --- a/storage/innobase/include/srv0srv.h +++ b/storage/innobase/include/srv0srv.h @@ -108,10 +108,6 @@ struct srv_stats_t /** Store the number of write requests issued */ ulint_ctr_1_t buf_pool_write_requests; - /** Number of buffer pool reads that led to the reading of - a disk page */ - ulint_ctr_1_t buf_pool_reads; - /** Number of bytes saved by page compression */ ulint_ctr_n_t page_compression_saved; /* Number of pages compressed with page compression */ @@ -670,24 +666,12 @@ struct export_var_t{ char innodb_buffer_pool_resize_status[512];/*!< Buf pool resize status */ my_bool innodb_buffer_pool_load_incomplete;/*!< Buf pool load incomplete */ ulint innodb_buffer_pool_pages_total; /*!< Buffer pool size */ - ulint innodb_buffer_pool_pages_data; /*!< Data pages */ ulint innodb_buffer_pool_bytes_data; /*!< File bytes used */ - ulint innodb_buffer_pool_pages_dirty; /*!< Dirty data pages */ - ulint innodb_buffer_pool_bytes_dirty; /*!< File bytes modified */ ulint innodb_buffer_pool_pages_misc; /*!< Miscellanous pages */ - ulint innodb_buffer_pool_pages_free; /*!< Free pages */ #ifdef UNIV_DEBUG ulint innodb_buffer_pool_pages_latched; /*!< Latched pages */ #endif /* UNIV_DEBUG */ - ulint innodb_buffer_pool_pages_made_not_young; - ulint innodb_buffer_pool_pages_made_young; - ulint innodb_buffer_pool_pages_old; - ulint innodb_buffer_pool_read_requests; /*!< buf_pool.stat.n_page_gets */ - ulint innodb_buffer_pool_reads; /*!< srv_buf_pool_reads */ ulint innodb_buffer_pool_write_requests;/*!< srv_stats.buf_pool_write_requests */ - ulint innodb_buffer_pool_read_ahead_rnd;/*!< srv_read_ahead_rnd */ - ulint innodb_buffer_pool_read_ahead; /*!< srv_read_ahead */ - ulint innodb_buffer_pool_read_ahead_evicted;/*!< srv_read_ahead evicted*/ ulint innodb_checkpoint_age; ulint innodb_checkpoint_max_age; ulint innodb_data_pending_reads; /*!< Pending reads */ diff --git a/storage/innobase/log/log0log.cc b/storage/innobase/log/log0log.cc index 70f561280d9..c53e2fd5074 100644 --- a/storage/innobase/log/log0log.cc +++ b/storage/innobase/log/log0log.cc @@ -1173,14 +1173,6 @@ wait_suspend_loop: if (!buf_pool.is_initialised()) { ut_ad(!srv_was_started); - } else if (ulint pending_io = buf_pool.io_pending()) { - if (srv_print_verbose_log && count > 600) { - ib::info() << "Waiting for " << pending_io << " buffer" - " page I/Os to complete"; - count = 0; - } - - goto loop; } else { buf_flush_buffer_pool(); } diff --git a/storage/innobase/srv/srv0mon.cc b/storage/innobase/srv/srv0mon.cc index 60fef24d183..b6496d03908 100644 --- a/storage/innobase/srv/srv0mon.cc +++ b/storage/innobase/srv/srv0mon.cc @@ -909,7 +909,7 @@ static monitor_info_t innodb_counter_info[] = MONITOR_DEFAULT_START, MONITOR_MODULE_INDEX}, {"index_page_splits", "index", "Number of index page splits", - MONITOR_NONE, + MONITOR_EXISTING, MONITOR_DEFAULT_START, MONITOR_INDEX_SPLIT}, {"index_page_merge_attempts", "index", @@ -1411,10 +1411,12 @@ srv_mon_process_existing_counter( /* Get the value from corresponding global variable */ switch (monitor_id) { - /* export_vars.innodb_buffer_pool_reads. Num Reads from - disk (page not in buffer) */ + case MONITOR_INDEX_SPLIT: + value = buf_pool.pages_split; + break; + case MONITOR_OVLD_BUF_POOL_READS: - value = srv_stats.buf_pool_reads; + value = buf_pool.stat.n_pages_read; break; /* innodb_buffer_pool_read_requests, the number of logical @@ -1475,7 +1477,7 @@ srv_mon_process_existing_counter( /* innodb_buffer_pool_bytes_dirty */ case MONITOR_OVLD_BUF_POOL_BYTES_DIRTY: - value = buf_pool.stat.flush_list_bytes; + value = buf_pool.flush_list_bytes; break; /* innodb_buffer_pool_pages_free */ diff --git a/storage/innobase/srv/srv0srv.cc b/storage/innobase/srv/srv0srv.cc index c16868b5cf5..2e9f5a0eff8 100644 --- a/storage/innobase/srv/srv0srv.cc +++ b/storage/innobase/srv/srv0srv.cc @@ -675,6 +675,7 @@ void srv_boot() if (transactional_lock_enabled()) sql_print_information("InnoDB: Using transactional memory"); #endif + buf_dblwr.init(); srv_thread_pool_init(); trx_pool_init(); srv_init(); @@ -1001,59 +1002,22 @@ srv_export_innodb_status(void) export_vars.innodb_data_writes = os_n_file_writes; - ulint dblwr = 0; - - if (buf_dblwr.is_initialised()) { - buf_dblwr.lock(); - dblwr = buf_dblwr.submitted(); - export_vars.innodb_dblwr_pages_written = buf_dblwr.written(); - export_vars.innodb_dblwr_writes = buf_dblwr.batches(); - buf_dblwr.unlock(); - } + buf_dblwr.lock(); + ulint dblwr = buf_dblwr.submitted(); + export_vars.innodb_dblwr_pages_written = buf_dblwr.written(); + export_vars.innodb_dblwr_writes = buf_dblwr.batches(); + buf_dblwr.unlock(); export_vars.innodb_data_written = srv_stats.data_written + dblwr; - export_vars.innodb_buffer_pool_read_requests - = buf_pool.stat.n_page_gets; - export_vars.innodb_buffer_pool_write_requests = srv_stats.buf_pool_write_requests; - export_vars.innodb_buffer_pool_reads = srv_stats.buf_pool_reads; - - export_vars.innodb_buffer_pool_read_ahead_rnd = - buf_pool.stat.n_ra_pages_read_rnd; - - export_vars.innodb_buffer_pool_read_ahead = - buf_pool.stat.n_ra_pages_read; - - export_vars.innodb_buffer_pool_read_ahead_evicted = - buf_pool.stat.n_ra_pages_evicted; - - export_vars.innodb_buffer_pool_pages_data = - UT_LIST_GET_LEN(buf_pool.LRU); - export_vars.innodb_buffer_pool_bytes_data = buf_pool.stat.LRU_bytes + (UT_LIST_GET_LEN(buf_pool.unzip_LRU) << srv_page_size_shift); - export_vars.innodb_buffer_pool_pages_dirty = - UT_LIST_GET_LEN(buf_pool.flush_list); - - export_vars.innodb_buffer_pool_pages_made_young - = buf_pool.stat.n_pages_made_young; - export_vars.innodb_buffer_pool_pages_made_not_young - = buf_pool.stat.n_pages_not_made_young; - - export_vars.innodb_buffer_pool_pages_old = buf_pool.LRU_old_len; - - export_vars.innodb_buffer_pool_bytes_dirty = - buf_pool.stat.flush_list_bytes; - - export_vars.innodb_buffer_pool_pages_free = - UT_LIST_GET_LEN(buf_pool.free); - #ifdef UNIV_DEBUG export_vars.innodb_buffer_pool_pages_latched = buf_get_latched_pages_number(); diff --git a/storage/innobase/srv/srv0start.cc b/storage/innobase/srv/srv0start.cc index b0adc15300c..a881ae0ad6a 100644 --- a/storage/innobase/srv/srv0start.cc +++ b/storage/innobase/srv/srv0start.cc @@ -1997,7 +1997,7 @@ void innodb_shutdown() ut_ad(dict_sys.is_initialised() || !srv_was_started); ut_ad(trx_sys.is_initialised() || !srv_was_started); - ut_ad(buf_dblwr.is_initialised() || !srv_was_started + ut_ad(buf_dblwr.is_created() || !srv_was_started || srv_read_only_mode || srv_force_recovery >= SRV_FORCE_NO_TRX_UNDO); ut_ad(lock_sys.is_initialised() || !srv_was_started); diff --git a/storage/rocksdb/mysql-test/rocksdb/r/innodb_i_s_tables_disabled.result b/storage/rocksdb/mysql-test/rocksdb/r/innodb_i_s_tables_disabled.result index 1b3b43c0304..d3f0ee3bcd9 100644 --- a/storage/rocksdb/mysql-test/rocksdb/r/innodb_i_s_tables_disabled.result +++ b/storage/rocksdb/mysql-test/rocksdb/r/innodb_i_s_tables_disabled.result @@ -181,7 +181,7 @@ compress_pages_page_decompressed compression 0 NULL NULL NULL 0 NULL NULL NULL N compress_pages_page_compression_error compression 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Number of page compression errors compress_pages_encrypted compression 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Number of pages encrypted compress_pages_decrypted compression 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Number of pages decrypted -index_page_splits index 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Number of index page splits +index_page_splits index 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 status_counter Number of index page splits index_page_merge_attempts index 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Number of index page merge attempts index_page_merge_successful index 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Number of successful index page merges index_page_reorg_attempts index 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Number of index page reorganization attempts |