diff options
author | Marko Mäkelä <marko.makela@mariadb.com> | 2020-10-15 12:10:42 +0300 |
---|---|---|
committer | Marko Mäkelä <marko.makela@mariadb.com> | 2020-10-15 17:04:56 +0300 |
commit | 7cffb5f6e8a231a041152447be8980ce35d2c9b8 (patch) | |
tree | 2f6c6bc2a71e4d5235f6fe06b6cb1c34507f48ef | |
parent | 46b1f500983d45e89dc84bb9820023bd51a4cda8 (diff) | |
download | mariadb-git-7cffb5f6e8a231a041152447be8980ce35d2c9b8.tar.gz |
MDEV-23399: Performance regression with write workloads
The buffer pool refactoring in MDEV-15053 and MDEV-22871 shifted
the performance bottleneck to the page flushing.
The configuration parameters will be changed as follows:
innodb_lru_flush_size=32 (new: how many pages to flush on LRU eviction)
innodb_lru_scan_depth=1536 (old: 1024)
innodb_max_dirty_pages_pct=90 (old: 75)
innodb_max_dirty_pages_pct_lwm=75 (old: 0)
Note: The parameter innodb_lru_scan_depth will only affect LRU
eviction of buffer pool pages when a new page is being allocated. The
page cleaner thread will no longer evict any pages. It used to
guarantee that some pages will remain free in the buffer pool. Now, we
perform that eviction 'on demand' in buf_LRU_get_free_block().
The parameter innodb_lru_scan_depth(srv_LRU_scan_depth) is used as follows:
* When the buffer pool is being shrunk in buf_pool_t::withdraw_blocks()
* As a buf_pool.free limit in buf_LRU_list_batch() for terminating
the flushing that is initiated e.g., by buf_LRU_get_free_block()
The parameter also used to serve as an initial limit for unzip_LRU
eviction (evicting uncompressed page frames while retaining
ROW_FORMAT=COMPRESSED pages), but now we will use a hard-coded limit
of 100 or unlimited for invoking buf_LRU_scan_and_free_block().
The status variables will be changed as follows:
innodb_buffer_pool_pages_flushed: This includes also the count of
innodb_buffer_pool_pages_LRU_flushed and should work reliably,
updated one by one in buf_flush_page() to give more real-time
statistics. The function buf_flush_stats(), which we are removing,
was not called in every code path. For both counters, we will use
regular variables that are incremented in a critical section of
buf_pool.mutex. Note that show_innodb_vars() directly links to the
variables, and reads of the counters will *not* be protected by
buf_pool.mutex, so you cannot get a consistent snapshot of both variables.
The following INFORMATION_SCHEMA.INNODB_METRICS counters will be
removed, because the page cleaner no longer deals with writing or
evicting least recently used pages, and because the single-page writes
have been removed:
* buffer_LRU_batch_flush_avg_time_slot
* buffer_LRU_batch_flush_avg_time_thread
* buffer_LRU_batch_flush_avg_time_est
* buffer_LRU_batch_flush_avg_pass
* buffer_LRU_single_flush_scanned
* buffer_LRU_single_flush_num_scan
* buffer_LRU_single_flush_scanned_per_call
When moving to a single buffer pool instance in MDEV-15058, we missed
some opportunity to simplify the buf_flush_page_cleaner thread. It was
unnecessarily using a mutex and some complex data structures, even
though we always have a single page cleaner thread.
Furthermore, the buf_flush_page_cleaner thread had separate 'recovery'
and 'shutdown' modes where it was waiting to be triggered by some
other thread, adding unnecessary latency and potential for hangs in
relatively rarely executed startup or shutdown code.
The page cleaner was also running two kinds of batches in an
interleaved fashion: "LRU flush" (writing out some least recently used
pages and evicting them on write completion) and the normal batches
that aim to increase the MIN(oldest_modification) in the buffer pool,
to help the log checkpoint advance.
The buf_pool.flush_list flushing was being blocked by
buf_block_t::lock for no good reason. Furthermore, if the FIL_PAGE_LSN
of a page is ahead of log_sys.get_flushed_lsn(), that is, what has
been persistently written to the redo log, we would trigger a log
flush and then resume the page flushing. This would unnecessarily
limit the performance of the page cleaner thread and trigger the
infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms.
The settings might not be optimal" that were suppressed in
commit d1ab89037a518fcffbc50c24e4bd94e4ec33aed0 unless log_warnings>2.
Our revised algorithm will make log_sys.get_flushed_lsn() advance at
the start of buf_flush_lists(), and then execute a 'best effort' to
write out all pages. The flush batches will skip pages that were modified
since the log was written, or are are currently exclusively locked.
The MDEV-13670 message "page_cleaner: 1000ms intended loop took" message
will be removed, because by design, the buf_flush_page_cleaner() should
not be blocked during a batch for extended periods of time.
We will remove the single-page flushing altogether. Related to this,
the debug parameter innodb_doublewrite_batch_size will be removed,
because all of the doublewrite buffer will be used for flushing
batches. If a page needs to be evicted from the buffer pool and all
100 least recently used pages in the buffer pool have unflushed
changes, buf_LRU_get_free_block() will execute buf_flush_lists() to
write out and evict innodb_lru_flush_size pages. At most one thread
will execute buf_flush_lists() in buf_LRU_get_free_block(); other
threads will wait for that LRU flushing batch to finish.
To improve concurrency, we will replace the InnoDB ib_mutex_t and
os_event_t native mutexes and condition variables in this area of code.
Most notably, this means that the buffer pool mutex (buf_pool.mutex)
is no longer instrumented via any InnoDB interfaces. It will continue
to be instrumented via PERFORMANCE_SCHEMA.
For now, both buf_pool.flush_list_mutex and buf_pool.mutex will be
declared with MY_MUTEX_INIT_FAST (PTHREAD_MUTEX_ADAPTIVE_NP). The critical
sections of buf_pool.flush_list_mutex should be shorter than those for
buf_pool.mutex, because in the worst case, they cover a linear scan of
buf_pool.flush_list, while the worst case of a critical section of
buf_pool.mutex covers a linear scan of the potentially much longer
buf_pool.LRU list.
mysql_mutex_is_owner(), safe_mutex_is_owner(): New predicate, usable
with SAFE_MUTEX. Some InnoDB debug assertions need this predicate
instead of mysql_mutex_assert_owner() or mysql_mutex_assert_not_owner().
buf_pool_t::n_flush_LRU, buf_pool_t::n_flush_list:
Replaces buf_pool_t::init_flush[] and buf_pool_t::n_flush[].
The number of active flush operations.
buf_pool_t::mutex, buf_pool_t::flush_list_mutex: Use mysql_mutex_t
instead of ib_mutex_t, to have native mutexes with PERFORMANCE_SCHEMA
and SAFE_MUTEX instrumentation.
buf_pool_t::done_flush_LRU: Condition variable for !n_flush_LRU.
buf_pool_t::done_flush_list: Condition variable for !n_flush_list.
buf_pool_t::do_flush_list: Condition variable to wake up the
buf_flush_page_cleaner when a log checkpoint needs to be written
or the server is being shut down. Replaces buf_flush_event.
We will keep using timed waits (the page cleaner thread will wake
_at least_ once per second), because the calculations for
innodb_adaptive_flushing depend on fixed time intervals.
buf_dblwr: Allocate statically, and move all code to member functions.
Use a native mutex and condition variable. Remove code to deal with
single-page flushing.
buf_dblwr_check_block(): Make the check debug-only. We were spending
a significant amount of execution time in page_simple_validate_new().
flush_counters_t::unzip_LRU_evicted: Remove.
IORequest: Make more members const. FIXME: m_fil_node should be removed.
buf_flush_sync_lsn: Protect by std::atomic, not page_cleaner.mutex
(which we are removing).
page_cleaner_slot_t, page_cleaner_t: Remove many redundant members.
pc_request_flush_slot(): Replaces pc_request() and pc_flush_slot().
recv_writer_thread: Remove. Recovery works just fine without it, if we
simply invoke buf_flush_sync() at the end of each batch in
recv_sys_t::apply().
recv_recovery_from_checkpoint_finish(): Remove. We can simply call
recv_sys.debug_free() directly.
srv_started_redo: Replaces srv_start_state.
SRV_SHUTDOWN_FLUSH_PHASE: Remove. logs_empty_and_mark_files_at_shutdown()
can communicate with the normal page cleaner loop via the new function
flush_buffer_pool().
buf_flush_remove(): Assert that the calling thread is holding
buf_pool.flush_list_mutex. This removes unnecessary mutex operations
from buf_flush_remove_pages() and buf_flush_dirty_pages(),
which replace buf_LRU_flush_or_remove_pages().
buf_flush_lists(): Renamed from buf_flush_batch(), with simplified
interface. Return the number of flushed pages. Clarified comments and
renamed min_n to max_n. Identify LRU batch by lsn=0. Merge all the functions
buf_flush_start(), buf_flush_batch(), buf_flush_end() directly to this
function, which was their only caller, and remove 2 unnecessary
buf_pool.mutex release/re-acquisition that we used to perform around
the buf_flush_batch() call. At the start, if not all log has been
durably written, wait for a background task to do it, or start a new
task to do it. This allows the log write to run concurrently with our
page flushing batch. Any pages that were skipped due to too recent
FIL_PAGE_LSN or due to them being latched by a writer should be flushed
during the next batch, unless there are further modifications to those
pages. It is possible that a page that we must flush due to small
oldest_modification also carries a recent FIL_PAGE_LSN or is being
constantly modified. In the worst case, all writers would then end up
waiting in log_free_check() to allow the flushing and the checkpoint
to complete.
buf_do_flush_list_batch(): Clarify comments, and rename min_n to max_n.
Cache the last looked up tablespace. If neighbor flushing is not applicable,
invoke buf_flush_page() directly, avoiding a page lookup in between.
buf_flush_space(): Auxiliary function to look up a tablespace for
page flushing.
buf_flush_page(): Defer the computation of space->full_crc32(). Never
call log_write_up_to(), but instead skip persistent pages whose latest
modification (FIL_PAGE_LSN) is newer than the redo log. Also skip
pages on which we cannot acquire a shared latch without waiting.
buf_flush_try_neighbors(): Do not bother checking buf_fix_count
because buf_flush_page() will no longer wait for the page latch.
Take the tablespace as a parameter, and only execute this function
when innodb_flush_neighbors>0. Avoid repeated calls of page_id_t::fold().
buf_flush_relocate_on_flush_list(): Declare as cold, and push down
a condition from the callers.
buf_flush_check_neighbor(): Take id.fold() as a parameter.
buf_flush_sync(): Ensure that the buf_pool.flush_list is empty,
because the flushing batch will skip pages whose modifications have
not yet been written to the log or were latched for modification.
buf_free_from_unzip_LRU_list_batch(): Remove redundant local variables.
buf_flush_LRU_list_batch(): Let the caller buf_do_LRU_batch() initialize
the counters, and report n->evicted.
Cache the last looked up tablespace. If neighbor flushing is not applicable,
invoke buf_flush_page() directly, avoiding a page lookup in between.
buf_do_LRU_batch(): Return the number of pages flushed.
buf_LRU_free_page(): Only release and re-acquire buf_pool.mutex if
adaptive hash index entries are pointing to the block.
buf_LRU_get_free_block(): Do not wake up the page cleaner, because it
will no longer perform any useful work for us, and we do not want it
to compete for I/O while buf_flush_lists(innodb_lru_flush_size, 0)
writes out and evicts at most innodb_lru_flush_size pages. (The
function buf_do_LRU_batch() may complete after writing fewer pages if
more than innodb_lru_scan_depth pages end up in buf_pool.free list.)
Eliminate some mutex release-acquire cycles, and wait for the LRU
flush batch to complete before rescanning.
buf_LRU_check_size_of_non_data_objects(): Simplify the code.
buf_page_write_complete(): Remove the parameter evict, and always
evict pages that were part of an LRU flush.
buf_page_create(): Take a pre-allocated page as a parameter.
buf_pool_t::free_block(): Free a pre-allocated block.
recv_sys_t::recover_low(), recv_sys_t::apply(): Preallocate the block
while not holding recv_sys.mutex. During page allocation, we may
initiate a page flush, which in turn may initiate a log flush, which
would require acquiring log_sys.mutex, which should always be acquired
before recv_sys.mutex in order to avoid deadlocks. Therefore, we must
not be holding recv_sys.mutex while allocating a buffer pool block.
BtrBulk::logFreeCheck(): Skip a redundant condition.
row_undo_step(): Do not invoke srv_inc_activity_count() for every row
that is being rolled back. It should suffice to invoke the function in
trx_flush_log_if_needed() during trx_t::commit_in_memory() when the
rollback completes.
sync_check_enable(): Remove. We will enable innodb_sync_debug from the
very beginning.
Reviewed by: Vladislav Vaintroub
73 files changed, 2023 insertions, 3835 deletions
diff --git a/extra/mariabackup/xtrabackup.cc b/extra/mariabackup/xtrabackup.cc index 7fd5efe2cd5..3da6239b6f8 100644 --- a/extra/mariabackup/xtrabackup.cc +++ b/extra/mariabackup/xtrabackup.cc @@ -3426,9 +3426,7 @@ xb_data_files_close() { ut_ad(!os_thread_count); fil_close_all_files(); - if (buf_dblwr) { - buf_dblwr_free(); - } + buf_dblwr.close(); } /*********************************************************************** @@ -4017,7 +4015,6 @@ fail: } srv_thread_pool_init(); sync_check_init(); - ut_d(sync_check_enable()); /* Reset the system variables in the recovery module. */ trx_pool_init(); recv_sys.create(); @@ -5385,7 +5382,6 @@ static bool xtrabackup_prepare_func(char** argv) } sync_check_init(); - ut_d(sync_check_enable()); recv_sys.create(); log_sys.create(); recv_sys.recovery_on = true; diff --git a/include/my_pthread.h b/include/my_pthread.h index 4888bfcc2c8..66876032178 100644 --- a/include/my_pthread.h +++ b/include/my_pthread.h @@ -427,12 +427,10 @@ void safe_mutex_free_deadlock_data(safe_mutex_t *mp); #define MYF_NO_DEADLOCK_DETECTION 2 #ifdef SAFE_MUTEX -#define safe_mutex_assert_owner(mp) \ - DBUG_ASSERT((mp)->count > 0 && \ - pthread_equal(pthread_self(), (mp)->thread)) -#define safe_mutex_assert_not_owner(mp) \ - DBUG_ASSERT(! (mp)->count || \ - ! pthread_equal(pthread_self(), (mp)->thread)) +#define safe_mutex_is_owner(mp) ((mp)->count > 0 && \ + pthread_equal(pthread_self(), (mp)->thread)) +#define safe_mutex_assert_owner(mp) DBUG_ASSERT(safe_mutex_is_owner(mp)) +#define safe_mutex_assert_not_owner(mp) DBUG_ASSERT(!safe_mutex_is_owner(mp)) #define safe_mutex_setflags(mp, F) do { (mp)->create_flags|= (F); } while (0) #define my_cond_timedwait(A,B,C) safe_cond_timedwait((A),(B),(C),__FILE__,__LINE__) #define my_cond_wait(A,B) safe_cond_wait((A), (B), __FILE__, __LINE__) diff --git a/include/mysql/psi/mysql_thread.h b/include/mysql/psi/mysql_thread.h index 711520dba78..47f89f76685 100644 --- a/include/mysql/psi/mysql_thread.h +++ b/include/mysql/psi/mysql_thread.h @@ -1,4 +1,5 @@ /* Copyright (c) 2008, 2013, Oracle and/or its affiliates. + Copyright (c) 2020, MariaDB Corporation. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License, version 2.0, @@ -262,6 +263,7 @@ typedef struct st_mysql_cond mysql_cond_t; */ #ifndef DISABLE_MYSQL_THREAD_H +#define mysql_mutex_is_owner(M) safe_mutex_is_owner(&(M)->m_mutex) /** @def mysql_mutex_assert_owner(M) Wrapper, to use safe_mutex_assert_owner with instrumented mutexes. diff --git a/mysql-test/suite/encryption/t/innodb_encryption_discard_import.opt b/mysql-test/suite/encryption/t/innodb_encryption_discard_import.opt index bcff011eb82..9fe990f7260 100644 --- a/mysql-test/suite/encryption/t/innodb_encryption_discard_import.opt +++ b/mysql-test/suite/encryption/t/innodb_encryption_discard_import.opt @@ -3,6 +3,5 @@ --innodb-encryption-rotate-key-age=15 --innodb-encryption-threads=4 --innodb-tablespaces-encryption +--innodb-max-dirty-pages-pct_lwm=0 --innodb-max-dirty-pages-pct=0.001 - - diff --git a/mysql-test/suite/innodb/r/ibuf_not_empty.result b/mysql-test/suite/innodb/r/ibuf_not_empty.result index 3382c74174e..d1b8203b063 100644 --- a/mysql-test/suite/innodb/r/ibuf_not_empty.result +++ b/mysql-test/suite/innodb/r/ibuf_not_empty.result @@ -1,4 +1,3 @@ -SET GLOBAL innodb_purge_rseg_truncate_frequency=1; CREATE TABLE t1( a INT AUTO_INCREMENT PRIMARY KEY, b CHAR(1), @@ -6,26 +5,11 @@ c INT, INDEX(b)) ENGINE=InnoDB STATS_PERSISTENT=0; SET GLOBAL innodb_change_buffering_debug = 1; -BEGIN; -INSERT INTO t1 VALUES(0,'x',1); -INSERT INTO t1 SELECT 0,b,c FROM t1; -INSERT INTO t1 SELECT 0,b,c FROM t1; -INSERT INTO t1 SELECT 0,b,c FROM t1; -INSERT INTO t1 SELECT 0,b,c FROM t1; -INSERT INTO t1 SELECT 0,b,c FROM t1; -INSERT INTO t1 SELECT 0,b,c FROM t1; -INSERT INTO t1 SELECT 0,b,c FROM t1; -INSERT INTO t1 SELECT 0,b,c FROM t1; -INSERT INTO t1 SELECT 0,b,c FROM t1; -INSERT INTO t1 SELECT 0,b,c FROM t1; -INSERT INTO t1 SELECT 0,b,c FROM t1; -INSERT INTO t1 SELECT 0,b,c FROM t1; -COMMIT; -InnoDB 0 transactions not purged +INSERT INTO t1 SELECT 0,'x',1 FROM seq_1_to_1024; # restart: --innodb-force-recovery=6 --innodb-change-buffer-dump check table t1; Table Op Msg_type Msg_text -test.t1 check Warning InnoDB: Index 'b' contains #### entries, should be 4096. +test.t1 check Warning InnoDB: Index 'b' contains 990 entries, should be 1024. test.t1 check error Corrupt # restart SET GLOBAL innodb_fast_shutdown=0; diff --git a/mysql-test/suite/innodb/r/innodb-change-buffer-recovery.result b/mysql-test/suite/innodb/r/innodb-change-buffer-recovery.result index d795b516d5e..678c8c67be5 100644 --- a/mysql-test/suite/innodb/r/innodb-change-buffer-recovery.result +++ b/mysql-test/suite/innodb/r/innodb-change-buffer-recovery.result @@ -50,5 +50,5 @@ Table Op Msg_type Msg_text test.t1 check status OK SHOW ENGINE INNODB STATUS; Type Name Status -InnoDB insert 79, delete mark 1 +InnoDB DROP TABLE t1; diff --git a/mysql-test/suite/innodb/r/innodb_skip_innodb_is_tables.result b/mysql-test/suite/innodb/r/innodb_skip_innodb_is_tables.result index 9ab72408274..6a597a919e1 100644 --- a/mysql-test/suite/innodb/r/innodb_skip_innodb_is_tables.result +++ b/mysql-test/suite/innodb/r/innodb_skip_innodb_is_tables.result @@ -90,14 +90,10 @@ buffer_flush_neighbor_pages buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL N buffer_flush_n_to_flush_requested buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Number of pages requested for flushing. buffer_flush_n_to_flush_by_age buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Number of pages target by LSN Age for flushing. buffer_flush_adaptive_avg_time_slot buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Avg time (ms) spent for adaptive flushing recently per slot. -buffer_LRU_batch_flush_avg_time_slot buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Avg time (ms) spent for LRU batch flushing recently per slot. buffer_flush_adaptive_avg_time_thread buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Avg time (ms) spent for adaptive flushing recently per thread. -buffer_LRU_batch_flush_avg_time_thread buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Avg time (ms) spent for LRU batch flushing recently per thread. buffer_flush_adaptive_avg_time_est buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Estimated time (ms) spent for adaptive flushing recently. -buffer_LRU_batch_flush_avg_time_est buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Estimated time (ms) spent for LRU batch flushing recently. buffer_flush_avg_time buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Avg time (ms) spent for flushing recently. buffer_flush_adaptive_avg_pass buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Number of adaptive flushes passed during the recent Avg period. -buffer_LRU_batch_flush_avg_pass buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Number of LRU batch flushes passed during the recent Avg period. buffer_flush_avg_pass buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Number of flushes passed during the recent Avg period. buffer_LRU_get_free_loops buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Total loops in LRU get free. buffer_LRU_get_free_waits buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Total sleep waits in LRU get free. @@ -124,9 +120,6 @@ buffer_LRU_batch_flush_pages buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL buffer_LRU_batch_evict_total_pages buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 set_owner Total pages evicted as part of LRU batches buffer_LRU_batches_evict buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 set_member Number of LRU batches buffer_LRU_batch_evict_pages buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 set_member Pages queued as an LRU batch -buffer_LRU_single_flush_scanned buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 set_owner Total pages scanned as part of single page LRU flush -buffer_LRU_single_flush_num_scan buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 set_member Number of times single page LRU flush is called -buffer_LRU_single_flush_scanned_per_call buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 set_member Page scanned per single LRU flush buffer_LRU_single_flush_failure_count Buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Number of times attempt to flush a single page from LRU failed buffer_LRU_get_free_search Buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Number of searches performed for a clean page buffer_LRU_search_scanned buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 set_owner Total pages scanned as part of LRU search diff --git a/mysql-test/suite/innodb/r/monitor.result b/mysql-test/suite/innodb/r/monitor.result index 0c0168fb266..4aeab1a8402 100644 --- a/mysql-test/suite/innodb/r/monitor.result +++ b/mysql-test/suite/innodb/r/monitor.result @@ -56,14 +56,10 @@ buffer_flush_neighbor_pages disabled buffer_flush_n_to_flush_requested disabled buffer_flush_n_to_flush_by_age disabled buffer_flush_adaptive_avg_time_slot disabled -buffer_LRU_batch_flush_avg_time_slot disabled buffer_flush_adaptive_avg_time_thread disabled -buffer_LRU_batch_flush_avg_time_thread disabled buffer_flush_adaptive_avg_time_est disabled -buffer_LRU_batch_flush_avg_time_est disabled buffer_flush_avg_time disabled buffer_flush_adaptive_avg_pass disabled -buffer_LRU_batch_flush_avg_pass disabled buffer_flush_avg_pass disabled buffer_LRU_get_free_loops disabled buffer_LRU_get_free_waits disabled @@ -90,9 +86,6 @@ buffer_LRU_batch_flush_pages disabled buffer_LRU_batch_evict_total_pages disabled buffer_LRU_batches_evict disabled buffer_LRU_batch_evict_pages disabled -buffer_LRU_single_flush_scanned disabled -buffer_LRU_single_flush_num_scan disabled -buffer_LRU_single_flush_scanned_per_call disabled buffer_LRU_single_flush_failure_count disabled buffer_LRU_get_free_search disabled buffer_LRU_search_scanned disabled diff --git a/mysql-test/suite/innodb/r/purge_secondary.result b/mysql-test/suite/innodb/r/purge_secondary.result index 7c2b4151e76..a583e46418d 100644 --- a/mysql-test/suite/innodb/r/purge_secondary.result +++ b/mysql-test/suite/innodb/r/purge_secondary.result @@ -148,10 +148,6 @@ SELECT (variable_value > 0) FROM information_schema.global_status WHERE LOWER(variable_name) LIKE 'INNODB_BUFFER_POOL_PAGES_FLUSHED'; (variable_value > 0) 1 -SELECT NAME, SUBSYSTEM FROM INFORMATION_SCHEMA.INNODB_METRICS -WHERE NAME="buffer_LRU_batch_evict_total_pages" AND COUNT > 0; -NAME SUBSYSTEM -buffer_LRU_batch_evict_total_pages buffer # Note: The OTHER_INDEX_SIZE does not cover any SPATIAL INDEX. # To test that all indexes were emptied, replace DROP TABLE # with the following, and examine the root pages in t1.ibd: diff --git a/mysql-test/suite/innodb/t/ibuf_not_empty.combinations b/mysql-test/suite/innodb/t/ibuf_not_empty.combinations index 729380593f3..c4b45dcca32 100644 --- a/mysql-test/suite/innodb/t/ibuf_not_empty.combinations +++ b/mysql-test/suite/innodb/t/ibuf_not_empty.combinations @@ -1,5 +1,9 @@ [strict_crc32] --innodb-checksum-algorithm=strict_crc32 +--innodb-page-size=4k +--innodb-force-recovery=2 [strict_full_crc32] --innodb-checksum-algorithm=strict_full_crc32 +--innodb-page-size=4k +--innodb-force-recovery=2 diff --git a/mysql-test/suite/innodb/t/ibuf_not_empty.test b/mysql-test/suite/innodb/t/ibuf_not_empty.test index a3f4ad9ac5c..3b254177497 100644 --- a/mysql-test/suite/innodb/t/ibuf_not_empty.test +++ b/mysql-test/suite/innodb/t/ibuf_not_empty.test @@ -3,10 +3,8 @@ --source include/have_debug.inc # Embedded server tests do not support restarting --source include/not_embedded.inc -# The test is not big enough to use change buffering with larger page size. ---source include/have_innodb_max_16k.inc +--source include/have_sequence.inc -SET GLOBAL innodb_purge_rseg_truncate_frequency=1; --disable_query_log call mtr.add_suppression("InnoDB: Failed to find tablespace for table `test`\\.`t1` in the cache\\. Attempting to load the tablespace with space id"); call mtr.add_suppression("InnoDB: Allocated tablespace ID \\d+ for test.t1, old maximum was"); @@ -30,27 +28,10 @@ SET GLOBAL innodb_change_buffering_debug = 1; # Create enough rows for the table, so that the change buffer will be # used for modifying the secondary index page. There must be multiple # index pages, because changes to the root page are never buffered. -BEGIN; -INSERT INTO t1 VALUES(0,'x',1); -INSERT INTO t1 SELECT 0,b,c FROM t1; -INSERT INTO t1 SELECT 0,b,c FROM t1; -INSERT INTO t1 SELECT 0,b,c FROM t1; -INSERT INTO t1 SELECT 0,b,c FROM t1; -INSERT INTO t1 SELECT 0,b,c FROM t1; -INSERT INTO t1 SELECT 0,b,c FROM t1; -INSERT INTO t1 SELECT 0,b,c FROM t1; -INSERT INTO t1 SELECT 0,b,c FROM t1; -INSERT INTO t1 SELECT 0,b,c FROM t1; -INSERT INTO t1 SELECT 0,b,c FROM t1; -INSERT INTO t1 SELECT 0,b,c FROM t1; -INSERT INTO t1 SELECT 0,b,c FROM t1; -COMMIT; +INSERT INTO t1 SELECT 0,'x',1 FROM seq_1_to_1024; let MYSQLD_DATADIR=`select @@datadir`; let PAGE_SIZE=`select @@innodb_page_size`; -# Ensure that purge will not access the truncated .ibd file ---source include/wait_all_purged.inc - --source include/shutdown_mysqld.inc # Corrupt the change buffer bitmap, to claim that pages are clean @@ -87,7 +68,6 @@ EOF --let $restart_parameters= --innodb-force-recovery=6 --innodb-change-buffer-dump --source include/start_mysqld.inc ---replace_regex /contains \d+ entries/contains #### entries/ check table t1; --source include/shutdown_mysqld.inc diff --git a/mysql-test/suite/innodb/t/innodb-change-buffer-recovery.test b/mysql-test/suite/innodb/t/innodb-change-buffer-recovery.test index 79d9cc814a0..a12ca43cec1 100644 --- a/mysql-test/suite/innodb/t/innodb-change-buffer-recovery.test +++ b/mysql-test/suite/innodb/t/innodb-change-buffer-recovery.test @@ -76,6 +76,6 @@ SET GLOBAL innodb_fast_shutdown=0; --let $restart_parameters= --source include/restart_mysqld.inc CHECK TABLE t1; -replace_regex /.*operations:.* (insert.*), delete \d.*discarded .*/\1/; +replace_regex /.*operations:.* insert [1-9][0-9]*, delete mark [1-9][0-9]*, delete \d.*discarded .*//; SHOW ENGINE INNODB STATUS; DROP TABLE t1; diff --git a/mysql-test/suite/innodb/t/purge_secondary.test b/mysql-test/suite/innodb/t/purge_secondary.test index 34b4ce06f5f..f2c85ce10e7 100644 --- a/mysql-test/suite/innodb/t/purge_secondary.test +++ b/mysql-test/suite/innodb/t/purge_secondary.test @@ -131,9 +131,6 @@ ALTER TABLE t1 FORCE, ALGORITHM=INPLACE; SELECT (variable_value > 0) FROM information_schema.global_status WHERE LOWER(variable_name) LIKE 'INNODB_BUFFER_POOL_PAGES_FLUSHED'; -SELECT NAME, SUBSYSTEM FROM INFORMATION_SCHEMA.INNODB_METRICS -WHERE NAME="buffer_LRU_batch_evict_total_pages" AND COUNT > 0; - --echo # Note: The OTHER_INDEX_SIZE does not cover any SPATIAL INDEX. --echo # To test that all indexes were emptied, replace DROP TABLE --echo # with the following, and examine the root pages in t1.ibd: diff --git a/mysql-test/suite/perfschema/t/show_sanity.test b/mysql-test/suite/perfschema/t/show_sanity.test index 8f73c38eab8..61161ea5df6 100644 --- a/mysql-test/suite/perfschema/t/show_sanity.test +++ b/mysql-test/suite/perfschema/t/show_sanity.test @@ -422,7 +422,6 @@ insert into test.sanity values ("JUNK: GLOBAL-ONLY", "I_S.SESSION_VARIABLES", "INNODB_DISABLE_RESIZE_BUFFER_POOL_DEBUG"), ("JUNK: GLOBAL-ONLY", "I_S.SESSION_VARIABLES", "INNODB_DISABLE_SORT_FILE_CACHE"), ("JUNK: GLOBAL-ONLY", "I_S.SESSION_VARIABLES", "INNODB_DOUBLEWRITE"), - ("JUNK: GLOBAL-ONLY", "I_S.SESSION_VARIABLES", "INNODB_DOUBLEWRITE_BATCH_SIZE"), ("JUNK: GLOBAL-ONLY", "I_S.SESSION_VARIABLES", "INNODB_FAST_SHUTDOWN"), ("JUNK: GLOBAL-ONLY", "I_S.SESSION_VARIABLES", "INNODB_FILE_PER_TABLE"), ("JUNK: GLOBAL-ONLY", "I_S.SESSION_VARIABLES", "INNODB_FILL_FACTOR"), diff --git a/mysql-test/suite/sys_vars/r/innodb_doublewrite_batch_size_basic.result b/mysql-test/suite/sys_vars/r/innodb_doublewrite_batch_size_basic.result deleted file mode 100644 index cec90ea8950..00000000000 --- a/mysql-test/suite/sys_vars/r/innodb_doublewrite_batch_size_basic.result +++ /dev/null @@ -1,24 +0,0 @@ -select @@global.innodb_doublewrite_batch_size between 1 and 127; -@@global.innodb_doublewrite_batch_size between 1 and 127 -1 -select @@global.innodb_doublewrite_batch_size; -@@global.innodb_doublewrite_batch_size -120 -select @@session.innodb_doublewrite_batch_size; -ERROR HY000: Variable 'innodb_doublewrite_batch_size' is a GLOBAL variable -show global variables like 'innodb_doublewrite_batch_size'; -Variable_name Value -innodb_doublewrite_batch_size 120 -show session variables like 'innodb_doublewrite_batch_size'; -Variable_name Value -innodb_doublewrite_batch_size 120 -select * from information_schema.global_variables where variable_name='innodb_doublewrite_batch_size'; -VARIABLE_NAME VARIABLE_VALUE -INNODB_DOUBLEWRITE_BATCH_SIZE 120 -select * from information_schema.session_variables where variable_name='innodb_doublewrite_batch_size'; -VARIABLE_NAME VARIABLE_VALUE -INNODB_DOUBLEWRITE_BATCH_SIZE 120 -set global innodb_doublewrite_batch_size=1; -ERROR HY000: Variable 'innodb_doublewrite_batch_size' is a read only variable -set @@session.innodb_doublewrite_batch_size='some'; -ERROR HY000: Variable 'innodb_doublewrite_batch_size' is a read only variable diff --git a/mysql-test/suite/sys_vars/r/innodb_max_dirty_pages_pct_basic.result b/mysql-test/suite/sys_vars/r/innodb_max_dirty_pages_pct_basic.result index 20b619972dd..ad0ffe9855a 100644 --- a/mysql-test/suite/sys_vars/r/innodb_max_dirty_pages_pct_basic.result +++ b/mysql-test/suite/sys_vars/r/innodb_max_dirty_pages_pct_basic.result @@ -7,7 +7,7 @@ SELECT @global_start_value; SET @global_start_max_dirty_lwm_value = @@global.innodb_max_dirty_pages_pct_lwm; SELECT @global_start_max_dirty_lwm_value; @global_start_max_dirty_lwm_value -0 +75 SET @@global.innodb_max_dirty_pages_pct_lwm = 0; SELECT @@global.innodb_max_dirty_pages_pct_lwm; @@global.innodb_max_dirty_pages_pct_lwm @@ -17,13 +17,13 @@ SET @@global.innodb_max_dirty_pages_pct = 0; SET @@global.innodb_max_dirty_pages_pct = DEFAULT; SELECT @@global.innodb_max_dirty_pages_pct; @@global.innodb_max_dirty_pages_pct -75.000000 +90.000000 '#---------------------FN_DYNVARS_046_02-------------------------#' SET innodb_max_dirty_pages_pct = 1; ERROR HY000: Variable 'innodb_max_dirty_pages_pct' is a GLOBAL variable and should be set with SET GLOBAL SELECT @@innodb_max_dirty_pages_pct; @@innodb_max_dirty_pages_pct -75.000000 +90.000000 SELECT local.innodb_max_dirty_pages_pct; ERROR 42S02: Unknown table 'local' in field list SET global innodb_max_dirty_pages_pct = 0; @@ -171,5 +171,5 @@ SELECT @@global.innodb_max_dirty_pages_pct; SET @@global.innodb_max_dirty_pages_pct_lwm = @global_start_max_dirty_lwm_value; SELECT @@global.innodb_max_dirty_pages_pct_lwm; @@global.innodb_max_dirty_pages_pct_lwm -0.000000 +75.000000 SET @@global.innodb_max_dirty_pages_pct=@save_innodb_max_dirty_pages_pct; diff --git a/mysql-test/suite/sys_vars/r/innodb_max_dirty_pages_pct_func.result b/mysql-test/suite/sys_vars/r/innodb_max_dirty_pages_pct_func.result index 8b68f182789..43cdf17ee27 100644 --- a/mysql-test/suite/sys_vars/r/innodb_max_dirty_pages_pct_func.result +++ b/mysql-test/suite/sys_vars/r/innodb_max_dirty_pages_pct_func.result @@ -1,26 +1,28 @@ +SET @innodb_max_dirty_pages_pct_lwm = @@global.innodb_max_dirty_pages_pct_lwm; SET @innodb_max_dirty_pages_pct = @@global.innodb_max_dirty_pages_pct; '#--------------------FN_DYNVARS_044_02-------------------------#' +SET @@global.innodb_max_dirty_pages_pct_lwm = 0; SET @@global.innodb_max_dirty_pages_pct = 80; -'connect (con1,localhost,root,,,,)' +SET @@global.innodb_max_dirty_pages_pct_lwm = 80; connect con1,localhost,root,,,,; -'connection con1' connection con1; SELECT @@global.innodb_max_dirty_pages_pct; @@global.innodb_max_dirty_pages_pct 80.000000 SET @@global.innodb_max_dirty_pages_pct = 70; -'connect (con2,localhost,root,,,,)' +Warnings: +Warning 1210 innodb_max_dirty_pages_pct cannot be set lower than innodb_max_dirty_pages_pct_lwm. +Warning 1210 Lowering innodb_max_dirty_page_pct_lwm to 70.000000 +SELECT @@global.innodb_max_dirty_pages_pct_lwm; +@@global.innodb_max_dirty_pages_pct_lwm +70.000000 connect con2,localhost,root,,,,; -'connection con2' connection con2; SELECT @@global.innodb_max_dirty_pages_pct; @@global.innodb_max_dirty_pages_pct 70.000000 -'connection default' connection default; -'disconnect con2' disconnect con2; -'disconnect con1' disconnect con1; SET @@global.innodb_max_dirty_pages_pct = @innodb_max_dirty_pages_pct; '#--------------------FN_DYNVARS_044_02-------------------------#' @@ -85,6 +87,22 @@ b CHAR(200) ) ENGINE = INNODB; '---Check when innodb_max_dirty_pages_pct is 10---' SET @@global.innodb_max_dirty_pages_pct = 10; +Warnings: +Warning 1210 innodb_max_dirty_pages_pct cannot be set lower than innodb_max_dirty_pages_pct_lwm. +Warning 1210 Lowering innodb_max_dirty_page_pct_lwm to 10.000000 +SELECT @@global.innodb_max_dirty_pages_pct_lwm; +@@global.innodb_max_dirty_pages_pct_lwm +10.000000 +SET GLOBAL innodb_max_dirty_pages_pct_lwm = 15; +Warnings: +Warning 1210 innodb_max_dirty_pages_pct_lwm cannot be set higher than innodb_max_dirty_pages_pct. +Warning 1210 Setting innodb_max_dirty_page_pct_lwm to 10.000000 +SELECT @@global.innodb_max_dirty_pages_pct_lwm; +@@global.innodb_max_dirty_pages_pct_lwm +10.000000 +SELECT @@global.innodb_max_dirty_pages_pct; +@@global.innodb_max_dirty_pages_pct +10.000000 FLUSH STATUS; CALL add_until(10); FLUSH TABLES; @@ -98,4 +116,6 @@ DROP PROCEDURE add_until; DROP PROCEDURE check_pct; DROP FUNCTION dirty_pct; DROP TABLE t1; +SET GLOBAL innodb_max_dirty_pages_pct_lwm = 0; SET @@global.innodb_max_dirty_pages_pct = @innodb_max_dirty_pages_pct; +SET @@global.innodb_max_dirty_pages_pct_lwm = @innodb_max_dirty_pages_pct_lwm; diff --git a/mysql-test/suite/sys_vars/r/innodb_max_dirty_pages_pct_lwm_basic.result b/mysql-test/suite/sys_vars/r/innodb_max_dirty_pages_pct_lwm_basic.result index 641386d5f23..313bdf28e82 100644 --- a/mysql-test/suite/sys_vars/r/innodb_max_dirty_pages_pct_lwm_basic.result +++ b/mysql-test/suite/sys_vars/r/innodb_max_dirty_pages_pct_lwm_basic.result @@ -3,7 +3,7 @@ set @@global.innodb_max_dirty_pages_pct=75; SET @pct_lwm_start_value = @@global.innodb_max_dirty_pages_pct_lwm; SELECT @pct_lwm_start_value; @pct_lwm_start_value -0 +75 SET @pct_start_value = @@global.innodb_max_dirty_pages_pct; SELECT @pct_start_value; @pct_start_value @@ -13,13 +13,13 @@ SET @@global.innodb_max_dirty_pages_pct_lwm = 0; SET @@global.innodb_max_dirty_pages_pct_lwm = DEFAULT; SELECT @@global.innodb_max_dirty_pages_pct_lwm; @@global.innodb_max_dirty_pages_pct_lwm -0.000000 +75.000000 '#---------------------FN_DYNVARS_046_02-------------------------#' SET innodb_max_dirty_pages_pct_lwm = 1; ERROR HY000: Variable 'innodb_max_dirty_pages_pct_lwm' is a GLOBAL variable and should be set with SET GLOBAL SELECT @@innodb_max_dirty_pages_pct_lwm; @@innodb_max_dirty_pages_pct_lwm -0.000000 +75.000000 SELECT local.innodb_max_dirty_pages_pct_lwm; ERROR 42S02: Unknown table 'local' in field list SET global innodb_max_dirty_pages_pct_lwm = 0; @@ -130,5 +130,5 @@ SELECT @@global.innodb_max_dirty_pages_pct; SET @@global.innodb_max_dirty_pages_pct_lwm = @pct_lwm_start_value; SELECT @@global.innodb_max_dirty_pages_pct_lwm; @@global.innodb_max_dirty_pages_pct_lwm -0.000000 +75.000000 SET @@global.innodb_max_dirty_pages_pct=@save_innodb_max_dirty_pages_pct; diff --git a/mysql-test/suite/sys_vars/r/sysvars_innodb,32bit.rdiff b/mysql-test/suite/sys_vars/r/sysvars_innodb,32bit.rdiff index 2f39a472b99..0a558e77923 100644 --- a/mysql-test/suite/sys_vars/r/sysvars_innodb,32bit.rdiff +++ b/mysql-test/suite/sys_vars/r/sysvars_innodb,32bit.rdiff @@ -85,16 +85,7 @@ VARIABLE_COMMENT Percentage of empty space on a data page that can be reserved to make the page compressible. NUMERIC_MIN_VALUE 0 NUMERIC_MAX_VALUE 75 -@@ -661,7 +661,7 @@ - SESSION_VALUE NULL - DEFAULT_VALUE 120 - VARIABLE_SCOPE GLOBAL --VARIABLE_TYPE BIGINT UNSIGNED -+VARIABLE_TYPE INT UNSIGNED - VARIABLE_COMMENT Number of pages reserved in doublewrite buffer for batch flushing - NUMERIC_MIN_VALUE 1 - NUMERIC_MAX_VALUE 127 -@@ -757,7 +757,7 @@ +@@ -745,7 +745,7 @@ SESSION_VALUE NULL DEFAULT_VALUE 600 VARIABLE_SCOPE GLOBAL @@ -103,7 +94,7 @@ VARIABLE_COMMENT Maximum number of seconds that semaphore times out in InnoDB. NUMERIC_MIN_VALUE 1 NUMERIC_MAX_VALUE 4294967295 -@@ -805,7 +805,7 @@ +@@ -793,7 +793,7 @@ SESSION_VALUE NULL DEFAULT_VALUE 0 VARIABLE_SCOPE GLOBAL @@ -112,7 +103,7 @@ VARIABLE_COMMENT Make the first page of the given tablespace dirty. NUMERIC_MIN_VALUE 0 NUMERIC_MAX_VALUE 4294967295 -@@ -817,7 +817,7 @@ +@@ -805,7 +805,7 @@ SESSION_VALUE NULL DEFAULT_VALUE 30 VARIABLE_SCOPE GLOBAL @@ -121,7 +112,7 @@ VARIABLE_COMMENT Number of iterations over which the background flushing is averaged. NUMERIC_MIN_VALUE 1 NUMERIC_MAX_VALUE 1000 -@@ -841,7 +841,7 @@ +@@ -829,7 +829,7 @@ SESSION_VALUE NULL DEFAULT_VALUE 1 VARIABLE_SCOPE GLOBAL @@ -130,7 +121,7 @@ VARIABLE_COMMENT Controls the durability/speed trade-off for commits. Set to 0 (write and flush redo log to disk only once per second), 1 (flush to disk at each commit), 2 (write to log at commit but flush to disk only once per second) or 3 (flush to disk at prepare and at commit, slower and usually redundant). 1 and 3 guarantees that after a crash, committed transactions will not be lost and will be consistent with the binlog and other transactional engines. 2 can get inconsistent and lose transactions if there is a power failure or kernel crash but not if mysqld crashes. 0 has no guarantees in case of crash. 0 and 2 can be faster than 1 or 3. NUMERIC_MIN_VALUE 0 NUMERIC_MAX_VALUE 3 -@@ -865,7 +865,7 @@ +@@ -853,7 +853,7 @@ SESSION_VALUE NULL DEFAULT_VALUE 1 VARIABLE_SCOPE GLOBAL @@ -139,7 +130,7 @@ VARIABLE_COMMENT Set to 0 (don't flush neighbors from buffer pool), 1 (flush contiguous neighbors from buffer pool) or 2 (flush neighbors from buffer pool), when flushing a block NUMERIC_MIN_VALUE 0 NUMERIC_MAX_VALUE 2 -@@ -913,7 +913,7 @@ +@@ -901,7 +901,7 @@ SESSION_VALUE NULL DEFAULT_VALUE 0 VARIABLE_SCOPE GLOBAL @@ -148,7 +139,7 @@ VARIABLE_COMMENT Helps to save your data in case the disk image of the database becomes corrupt. Value 5 can return bogus data, and 6 can permanently corrupt data. NUMERIC_MIN_VALUE 0 NUMERIC_MAX_VALUE 6 -@@ -937,7 +937,7 @@ +@@ -925,7 +925,7 @@ SESSION_VALUE NULL DEFAULT_VALUE 8000000 VARIABLE_SCOPE GLOBAL @@ -157,7 +148,7 @@ VARIABLE_COMMENT InnoDB Fulltext search cache size in bytes NUMERIC_MIN_VALUE 1600000 NUMERIC_MAX_VALUE 80000000 -@@ -973,7 +973,7 @@ +@@ -961,7 +961,7 @@ SESSION_VALUE NULL DEFAULT_VALUE 84 VARIABLE_SCOPE GLOBAL @@ -166,7 +157,7 @@ VARIABLE_COMMENT InnoDB Fulltext search maximum token size in characters NUMERIC_MIN_VALUE 10 NUMERIC_MAX_VALUE 84 -@@ -985,7 +985,7 @@ +@@ -973,7 +973,7 @@ SESSION_VALUE NULL DEFAULT_VALUE 3 VARIABLE_SCOPE GLOBAL @@ -175,7 +166,7 @@ VARIABLE_COMMENT InnoDB Fulltext search minimum token size in characters NUMERIC_MIN_VALUE 0 NUMERIC_MAX_VALUE 16 -@@ -997,7 +997,7 @@ +@@ -985,7 +985,7 @@ SESSION_VALUE NULL DEFAULT_VALUE 2000 VARIABLE_SCOPE GLOBAL @@ -184,7 +175,7 @@ VARIABLE_COMMENT InnoDB Fulltext search number of words to optimize for each optimize table call NUMERIC_MIN_VALUE 1000 NUMERIC_MAX_VALUE 10000 -@@ -1009,10 +1009,10 @@ +@@ -997,10 +997,10 @@ SESSION_VALUE NULL DEFAULT_VALUE 2000000000 VARIABLE_SCOPE GLOBAL @@ -197,7 +188,7 @@ NUMERIC_BLOCK_SIZE 0 ENUM_VALUE_LIST NULL READ_ONLY NO -@@ -1033,7 +1033,7 @@ +@@ -1021,7 +1021,7 @@ SESSION_VALUE NULL DEFAULT_VALUE 2 VARIABLE_SCOPE GLOBAL @@ -206,7 +197,7 @@ VARIABLE_COMMENT InnoDB Fulltext search parallel sort degree, will round up to nearest power of 2 number NUMERIC_MIN_VALUE 1 NUMERIC_MAX_VALUE 16 -@@ -1045,7 +1045,7 @@ +@@ -1033,7 +1033,7 @@ SESSION_VALUE NULL DEFAULT_VALUE 640000000 VARIABLE_SCOPE GLOBAL @@ -215,7 +206,7 @@ VARIABLE_COMMENT Total memory allocated for InnoDB Fulltext Search cache NUMERIC_MIN_VALUE 32000000 NUMERIC_MAX_VALUE 1600000000 -@@ -1069,7 +1069,7 @@ +@@ -1057,7 +1057,7 @@ SESSION_VALUE NULL DEFAULT_VALUE 100 VARIABLE_SCOPE GLOBAL @@ -224,7 +215,7 @@ VARIABLE_COMMENT Up to what percentage of dirty pages should be flushed when innodb finds it has spare resources to do so. NUMERIC_MIN_VALUE 0 NUMERIC_MAX_VALUE 100 -@@ -1105,22 +1105,22 @@ +@@ -1093,22 +1093,22 @@ SESSION_VALUE NULL DEFAULT_VALUE 200 VARIABLE_SCOPE GLOBAL @@ -252,7 +243,7 @@ NUMERIC_BLOCK_SIZE 0 ENUM_VALUE_LIST NULL READ_ONLY NO -@@ -1165,7 +1165,7 @@ +@@ -1153,7 +1153,7 @@ SESSION_VALUE 50 DEFAULT_VALUE 50 VARIABLE_SCOPE SESSION @@ -261,7 +252,7 @@ VARIABLE_COMMENT Timeout in seconds an InnoDB transaction may wait for a lock before being rolled back. Values above 100000000 disable the timeout. NUMERIC_MIN_VALUE 0 NUMERIC_MAX_VALUE 1073741824 -@@ -1177,10 +1177,10 @@ +@@ -1165,10 +1165,10 @@ SESSION_VALUE NULL DEFAULT_VALUE 16777216 VARIABLE_SCOPE GLOBAL @@ -274,7 +265,7 @@ NUMERIC_BLOCK_SIZE 1024 ENUM_VALUE_LIST NULL READ_ONLY YES -@@ -1225,7 +1225,7 @@ +@@ -1213,7 +1213,7 @@ SESSION_VALUE NULL DEFAULT_VALUE 1 VARIABLE_SCOPE GLOBAL @@ -283,7 +274,7 @@ VARIABLE_COMMENT Deprecated parameter with no effect. NUMERIC_MIN_VALUE 1 NUMERIC_MAX_VALUE 100 -@@ -1273,7 +1273,7 @@ +@@ -1261,7 +1261,7 @@ SESSION_VALUE NULL DEFAULT_VALUE 8192 VARIABLE_SCOPE GLOBAL @@ -292,6 +283,19 @@ VARIABLE_COMMENT Redo log write ahead unit size to avoid read-on-write, it should match the OS cache block IO size NUMERIC_MIN_VALUE 512 NUMERIC_MAX_VALUE 16384 +@@ -1273,10 +1273,10 @@ + SESSION_VALUE NULL + DEFAULT_VALUE 100 + VARIABLE_SCOPE GLOBAL +-VARIABLE_TYPE BIGINT UNSIGNED ++VARIABLE_TYPE INT UNSIGNED + VARIABLE_COMMENT How many pages to flush on LRU eviction + NUMERIC_MIN_VALUE 1 +-NUMERIC_MAX_VALUE 18446744073709551615 ++NUMERIC_MAX_VALUE 4294967295 + NUMERIC_BLOCK_SIZE 0 + ENUM_VALUE_LIST NULL + READ_ONLY NO @@ -1285,10 +1285,10 @@ SESSION_VALUE NULL DEFAULT_VALUE 1024 diff --git a/mysql-test/suite/sys_vars/r/sysvars_innodb.result b/mysql-test/suite/sys_vars/r/sysvars_innodb.result index 5b532addaa8..767d31c033e 100644 --- a/mysql-test/suite/sys_vars/r/sysvars_innodb.result +++ b/mysql-test/suite/sys_vars/r/sysvars_innodb.result @@ -657,18 +657,6 @@ NUMERIC_BLOCK_SIZE NULL ENUM_VALUE_LIST OFF,ON READ_ONLY YES COMMAND_LINE_ARGUMENT NONE -VARIABLE_NAME INNODB_DOUBLEWRITE_BATCH_SIZE -SESSION_VALUE NULL -DEFAULT_VALUE 120 -VARIABLE_SCOPE GLOBAL -VARIABLE_TYPE BIGINT UNSIGNED -VARIABLE_COMMENT Number of pages reserved in doublewrite buffer for batch flushing -NUMERIC_MIN_VALUE 1 -NUMERIC_MAX_VALUE 127 -NUMERIC_BLOCK_SIZE 0 -ENUM_VALUE_LIST NULL -READ_ONLY YES -COMMAND_LINE_ARGUMENT OPTIONAL VARIABLE_NAME INNODB_ENCRYPTION_ROTATE_KEY_AGE SESSION_VALUE NULL DEFAULT_VALUE 1 @@ -1281,9 +1269,21 @@ NUMERIC_BLOCK_SIZE 512 ENUM_VALUE_LIST NULL READ_ONLY NO COMMAND_LINE_ARGUMENT REQUIRED +VARIABLE_NAME INNODB_LRU_FLUSH_SIZE +SESSION_VALUE NULL +DEFAULT_VALUE 32 +VARIABLE_SCOPE GLOBAL +VARIABLE_TYPE BIGINT UNSIGNED +VARIABLE_COMMENT How many pages to flush on LRU eviction +NUMERIC_MIN_VALUE 1 +NUMERIC_MAX_VALUE 18446744073709551615 +NUMERIC_BLOCK_SIZE 0 +ENUM_VALUE_LIST NULL +READ_ONLY NO +COMMAND_LINE_ARGUMENT REQUIRED VARIABLE_NAME INNODB_LRU_SCAN_DEPTH SESSION_VALUE NULL -DEFAULT_VALUE 1024 +DEFAULT_VALUE 1536 VARIABLE_SCOPE GLOBAL VARIABLE_TYPE BIGINT UNSIGNED VARIABLE_COMMENT How deep to scan LRU to keep it clean @@ -1307,7 +1307,7 @@ READ_ONLY NO COMMAND_LINE_ARGUMENT OPTIONAL VARIABLE_NAME INNODB_MAX_DIRTY_PAGES_PCT SESSION_VALUE NULL -DEFAULT_VALUE 75.000000 +DEFAULT_VALUE 90.000000 VARIABLE_SCOPE GLOBAL VARIABLE_TYPE DOUBLE VARIABLE_COMMENT Percentage of dirty pages allowed in bufferpool. @@ -1319,7 +1319,7 @@ READ_ONLY NO COMMAND_LINE_ARGUMENT REQUIRED VARIABLE_NAME INNODB_MAX_DIRTY_PAGES_PCT_LWM SESSION_VALUE NULL -DEFAULT_VALUE 0.000000 +DEFAULT_VALUE 75.000000 VARIABLE_SCOPE GLOBAL VARIABLE_TYPE DOUBLE VARIABLE_COMMENT Percentage of dirty pages at which flushing kicks in. diff --git a/mysql-test/suite/sys_vars/t/innodb_doublewrite_batch_size_basic.test b/mysql-test/suite/sys_vars/t/innodb_doublewrite_batch_size_basic.test deleted file mode 100644 index 5e9104b5335..00000000000 --- a/mysql-test/suite/sys_vars/t/innodb_doublewrite_batch_size_basic.test +++ /dev/null @@ -1,24 +0,0 @@ ---source include/have_innodb.inc ---source include/have_debug.inc - -# -# exists as global only -# -select @@global.innodb_doublewrite_batch_size between 1 and 127; -select @@global.innodb_doublewrite_batch_size; ---error ER_INCORRECT_GLOBAL_LOCAL_VAR -select @@session.innodb_doublewrite_batch_size; -show global variables like 'innodb_doublewrite_batch_size'; -show session variables like 'innodb_doublewrite_batch_size'; ---disable_warnings -select * from information_schema.global_variables where variable_name='innodb_doublewrite_batch_size'; -select * from information_schema.session_variables where variable_name='innodb_doublewrite_batch_size'; ---enable_warnings - -# -# show that it's read-only -# ---error ER_INCORRECT_GLOBAL_LOCAL_VAR -set global innodb_doublewrite_batch_size=1; ---error ER_INCORRECT_GLOBAL_LOCAL_VAR -set @@session.innodb_doublewrite_batch_size='some'; diff --git a/mysql-test/suite/sys_vars/t/innodb_max_dirty_pages_pct_func.test b/mysql-test/suite/sys_vars/t/innodb_max_dirty_pages_pct_func.test index c7a9e567e69..0720aca65b9 100644 --- a/mysql-test/suite/sys_vars/t/innodb_max_dirty_pages_pct_func.test +++ b/mysql-test/suite/sys_vars/t/innodb_max_dirty_pages_pct_func.test @@ -25,6 +25,7 @@ --source include/have_innodb.inc # safe initial value +SET @innodb_max_dirty_pages_pct_lwm = @@global.innodb_max_dirty_pages_pct_lwm; SET @innodb_max_dirty_pages_pct = @@global.innodb_max_dirty_pages_pct; --echo '#--------------------FN_DYNVARS_044_02-------------------------#' @@ -32,23 +33,19 @@ SET @innodb_max_dirty_pages_pct = @@global.innodb_max_dirty_pages_pct; # Check if setting innodb_max_dirty_pages_pct is changed in new connection # ############################################################################ +SET @@global.innodb_max_dirty_pages_pct_lwm = 0; SET @@global.innodb_max_dirty_pages_pct = 80; ---echo 'connect (con1,localhost,root,,,,)' +SET @@global.innodb_max_dirty_pages_pct_lwm = 80; connect (con1,localhost,root,,,,); ---echo 'connection con1' connection con1; SELECT @@global.innodb_max_dirty_pages_pct; SET @@global.innodb_max_dirty_pages_pct = 70; ---echo 'connect (con2,localhost,root,,,,)' +SELECT @@global.innodb_max_dirty_pages_pct_lwm; connect (con2,localhost,root,,,,); ---echo 'connection con2' connection con2; SELECT @@global.innodb_max_dirty_pages_pct; ---echo 'connection default' connection default; ---echo 'disconnect con2' disconnect con2; ---echo 'disconnect con1' disconnect con1; # restore initial value SET @@global.innodb_max_dirty_pages_pct = @innodb_max_dirty_pages_pct; @@ -138,6 +135,10 @@ b CHAR(200) #========================================================== SET @@global.innodb_max_dirty_pages_pct = 10; +SELECT @@global.innodb_max_dirty_pages_pct_lwm; +SET GLOBAL innodb_max_dirty_pages_pct_lwm = 15; +SELECT @@global.innodb_max_dirty_pages_pct_lwm; +SELECT @@global.innodb_max_dirty_pages_pct; FLUSH STATUS; @@ -164,7 +165,9 @@ DROP FUNCTION dirty_pct; DROP TABLE t1; # restore initial value +SET GLOBAL innodb_max_dirty_pages_pct_lwm = 0; SET @@global.innodb_max_dirty_pages_pct = @innodb_max_dirty_pages_pct; +SET @@global.innodb_max_dirty_pages_pct_lwm = @innodb_max_dirty_pages_pct_lwm; ################################################################## # End of functionality Testing for innodb_max_dirty_pages_pct # diff --git a/storage/innobase/btr/btr0bulk.cc b/storage/innobase/btr/btr0bulk.cc index a44b96add2c..e64f3c8c803 100644 --- a/storage/innobase/btr/btr0bulk.cc +++ b/storage/innobase/btr/btr0bulk.cc @@ -981,7 +981,7 @@ inline void BtrBulk::logFreeCheck() if (log_sys.check_flush_or_checkpoint()) { release(); - log_free_check(); + log_check_margins(); latch(); } @@ -1113,7 +1113,7 @@ BtrBulk::insert( /* Wake up page cleaner to flush dirty pages. */ srv_inc_activity_count(); - os_event_set(buf_flush_event); + mysql_cond_signal(&buf_pool.do_flush_list); logFreeCheck(); } diff --git a/storage/innobase/btr/btr0cur.cc b/storage/innobase/btr/btr0cur.cc index a190c11eb0a..f7c0e080cdd 100644 --- a/storage/innobase/btr/btr0cur.cc +++ b/storage/innobase/btr/btr0cur.cc @@ -7058,7 +7058,7 @@ static void btr_blob_free(buf_block_t *block, bool all, mtr_t *mtr) const ulint fold= page_id.fold(); - mutex_enter(&buf_pool.mutex); + mysql_mutex_lock(&buf_pool.mutex); if (buf_page_t *bpage= buf_pool.page_hash_get_low(page_id, fold)) if(!buf_LRU_free_page(bpage, all) && all && bpage->zip.data) @@ -7066,7 +7066,7 @@ static void btr_blob_free(buf_block_t *block, bool all, mtr_t *mtr) if the whole ROW_FORMAT=COMPRESSED block cannot be deallocted. */ buf_LRU_free_page(bpage, false); - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); } /** Helper class used while writing blob pages, during insert or update. */ @@ -8253,7 +8253,7 @@ btr_rec_copy_externally_stored_field( field_ref_zero, BTR_EXTERN_FIELD_REF_SIZE))) { /* The externally stored field was not written yet. This record should only be seen by - recv_recovery_rollback_active() or any + trx_rollback_recovered() or any TRX_ISO_READ_UNCOMMITTED transactions. */ return(NULL); } diff --git a/storage/innobase/btr/btr0sea.cc b/storage/innobase/btr/btr0sea.cc index 9ee2342420b..238551783db 100644 --- a/storage/innobase/btr/btr0sea.cc +++ b/storage/innobase/btr/btr0sea.cc @@ -261,9 +261,9 @@ void btr_search_disable() void btr_search_enable(bool resize) { if (!resize) { - mutex_enter(&buf_pool.mutex); + mysql_mutex_lock(&buf_pool.mutex); bool changed = srv_buf_pool_old_size != srv_buf_pool_size; - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); if (changed) { return; } @@ -2133,7 +2133,7 @@ btr_search_hash_table_validate(ulint hash_table_id) rec_offs_init(offsets_); - mutex_enter(&buf_pool.mutex); + mysql_mutex_lock(&buf_pool.mutex); auto &part = btr_search_sys.parts[hash_table_id]; @@ -2144,7 +2144,7 @@ btr_search_hash_table_validate(ulint hash_table_id) give other queries a chance to run. */ if ((i != 0) && ((i % chunk_size) == 0)) { - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); btr_search_x_unlock_all(); os_thread_yield(); @@ -2156,7 +2156,7 @@ btr_search_hash_table_validate(ulint hash_table_id) goto func_exit; } - mutex_enter(&buf_pool.mutex); + mysql_mutex_lock(&buf_pool.mutex); ulint curr_cell_count = part.table.n_cells; @@ -2252,7 +2252,7 @@ state_ok: /* We release search latches every once in a while to give other queries a chance to run. */ if (i != 0) { - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); btr_search_x_unlock_all(); os_thread_yield(); @@ -2264,7 +2264,7 @@ state_ok: goto func_exit; } - mutex_enter(&buf_pool.mutex); + mysql_mutex_lock(&buf_pool.mutex); ulint curr_cell_count = part.table.n_cells; @@ -2285,7 +2285,7 @@ state_ok: } } - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); func_exit: btr_search_x_unlock_all(); diff --git a/storage/innobase/buf/buf0buddy.cc b/storage/innobase/buf/buf0buddy.cc index 8280377b42a..a83d0840cb5 100644 --- a/storage/innobase/buf/buf0buddy.cc +++ b/storage/innobase/buf/buf0buddy.cc @@ -192,7 +192,7 @@ static bool buf_buddy_check_free(const buf_buddy_free_t* buf, ulint i) { const ulint size = BUF_BUDDY_LOW << i; - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); ut_ad(!ut_align_offset(buf, size)); ut_ad(i >= buf_buddy_get_slot(UNIV_ZIP_SIZE_MIN)); @@ -261,7 +261,7 @@ UNIV_INLINE void buf_buddy_add_to_free(buf_buddy_free_t* buf, ulint i) { - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); ut_ad(buf_pool.zip_free[i].start != buf); buf_buddy_stamp_free(buf, i); @@ -276,7 +276,7 @@ UNIV_INLINE void buf_buddy_remove_from_free(buf_buddy_free_t* buf, ulint i) { - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); ut_ad(buf_buddy_check_free(buf, i)); UT_LIST_REMOVE(buf_pool.zip_free[i], buf); @@ -290,7 +290,7 @@ static buf_buddy_free_t* buf_buddy_alloc_zip(ulint i) { buf_buddy_free_t* buf; - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); ut_a(i < BUF_BUDDY_SIZES); ut_a(i >= buf_buddy_get_slot(UNIV_ZIP_SIZE_MIN)); @@ -350,7 +350,7 @@ buf_buddy_block_free(void* buf) buf_page_t* bpage; buf_block_t* block; - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); ut_a(!ut_align_offset(buf, srv_page_size)); HASH_SEARCH(hash, &buf_pool.zip_hash, fold, buf_page_t*, bpage, @@ -433,7 +433,7 @@ byte *buf_buddy_alloc_low(ulint i, bool *lru) { buf_block_t* block; - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); ut_ad(i >= buf_buddy_get_slot(UNIV_ZIP_SIZE_MIN)); if (i < BUF_BUDDY_SIZES) { @@ -483,7 +483,7 @@ static bool buf_buddy_relocate(void* src, void* dst, ulint i, bool force) ulint space; ulint offset; - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); ut_ad(!ut_align_offset(src, size)); ut_ad(!ut_align_offset(dst, size)); ut_ad(i >= buf_buddy_get_slot(UNIV_ZIP_SIZE_MIN)); @@ -584,7 +584,7 @@ void buf_buddy_free_low(void* buf, ulint i) { buf_buddy_free_t* buddy; - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); ut_ad(i <= BUF_BUDDY_SIZES); ut_ad(i >= buf_buddy_get_slot(UNIV_ZIP_SIZE_MIN)); ut_ad(buf_pool.buddy_stat[i].used > 0); @@ -670,7 +670,7 @@ buf_buddy_realloc(void* buf, ulint size) buf_block_t* block = NULL; ulint i = buf_buddy_get_slot(size); - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); ut_ad(i <= BUF_BUDDY_SIZES); ut_ad(i >= buf_buddy_get_slot(UNIV_ZIP_SIZE_MIN)); @@ -711,7 +711,7 @@ buf_buddy_realloc(void* buf, ulint size) /** Combine all pairs of free buddies. */ void buf_buddy_condense_free() { - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); ut_ad(buf_pool.curr_size < buf_pool.old_size); for (ulint i = 0; i < UT_ARR_SIZE(buf_pool.zip_free); ++i) { diff --git a/storage/innobase/buf/buf0buf.cc b/storage/innobase/buf/buf0buf.cc index 078361092fc..6bde6939fd6 100644 --- a/storage/innobase/buf/buf0buf.cc +++ b/storage/innobase/buf/buf0buf.cc @@ -525,7 +525,7 @@ decrypt_failed: @retval 0 if all modified persistent pages have been flushed */ lsn_t buf_pool_t::get_oldest_modification() { - mutex_enter(&flush_list_mutex); + mysql_mutex_lock(&flush_list_mutex); /* FIXME: Keep temporary tablespace pages in a separate flush list. We would only need to write out temporary pages if the @@ -538,7 +538,7 @@ lsn_t buf_pool_t::get_oldest_modification() ut_ad(bpage->oldest_modification()); lsn_t oldest_lsn= bpage ? bpage->oldest_modification() : 0; - mutex_exit(&flush_list_mutex); + mysql_mutex_unlock(&flush_list_mutex); /* The result may become stale as soon as we released the mutex. On log checkpoint, also log_sys.flush_order_mutex will be needed. */ @@ -1063,14 +1063,14 @@ buf_madvise_do_dump() ret+= madvise(recv_sys.buf, recv_sys.len, MADV_DODUMP); } - mutex_enter(&buf_pool.mutex); + mysql_mutex_lock(&buf_pool.mutex); auto chunk = buf_pool.chunks; for (ulint n = buf_pool.n_chunks; n--; chunk++) { ret+= madvise(chunk->mem, chunk->mem_size(), MADV_DODUMP); } - mutex_exit(&buf_pool.mutex); + mysql_mutex_lock(&buf_pool.mutex); return ret; } #endif @@ -1546,7 +1546,7 @@ bool buf_pool_t::create() while (++chunk < chunks + n_chunks); ut_ad(is_initialised()); - mutex_create(LATCH_ID_BUF_POOL, &mutex); + mysql_mutex_init(buf_pool_mutex_key, &mutex, MY_MUTEX_INIT_FAST); UT_LIST_INIT(LRU, &buf_page_t::LRU); UT_LIST_INIT(withdraw, &buf_page_t::list); @@ -1570,17 +1570,18 @@ bool buf_pool_t::create() zip_hash.create(2 * curr_size); last_printout_time= time(NULL); - mutex_create(LATCH_ID_FLUSH_LIST, &flush_list_mutex); + mysql_mutex_init(flush_list_mutex_key, &flush_list_mutex, + MY_MUTEX_INIT_FAST); - for (int i= 0; i < 3; i++) - no_flush[i]= os_event_create(0); + mysql_cond_init(0, &done_flush_LRU, nullptr); + mysql_cond_init(0, &done_flush_list, nullptr); + mysql_cond_init(0, &do_flush_list, nullptr); try_LRU_scan= true; ut_d(flush_hp.m_mutex= &flush_list_mutex;); ut_d(lru_hp.m_mutex= &mutex); ut_d(lru_scan_itr.m_mutex= &mutex); - ut_d(single_scan_itr.m_mutex= &mutex); io_buf.create((srv_n_read_io_threads + srv_n_write_io_threads) * OS_AIO_N_PENDING_IOS_PER_THREAD); @@ -1604,8 +1605,8 @@ void buf_pool_t::close() if (!is_initialised()) return; - mutex_free(&mutex); - mutex_free(&flush_list_mutex); + mysql_mutex_destroy(&mutex); + mysql_mutex_destroy(&flush_list_mutex); for (buf_page_t *bpage= UT_LIST_GET_LAST(LRU), *prev_bpage= nullptr; bpage; bpage= prev_bpage) @@ -1633,8 +1634,9 @@ void buf_pool_t::close() allocator.deallocate_large_dodump(chunk->mem, &chunk->mem_pfx); } - for (int i= 0; i < 3; ++i) - os_event_destroy(no_flush[i]); + mysql_cond_destroy(&done_flush_LRU); + mysql_cond_destroy(&done_flush_list); + mysql_cond_destroy(&do_flush_list); ut_free(chunks); chunks= nullptr; @@ -1661,7 +1663,7 @@ inline bool buf_pool_t::realloc(buf_block_t *block) buf_block_t* new_block; ut_ad(withdrawing); - ut_ad(mutex_own(&mutex)); + mysql_mutex_assert_owner(&mutex); ut_ad(block->page.state() == BUF_BLOCK_FILE_PAGE); new_block = buf_LRU_get_free_only(); @@ -1732,13 +1734,8 @@ inline bool buf_pool_t::realloc(buf_block_t *block) + FIL_PAGE_ARCH_LOG_NO_OR_SPACE_ID, 0xff, 4); MEM_UNDEFINED(block->frame, srv_page_size); block->page.set_state(BUF_BLOCK_REMOVE_HASH); - - /* Relocate flush_list. */ - if (block->page.oldest_modification()) { - buf_flush_relocate_on_flush_list( - &block->page, &new_block->page); - } - + buf_flush_relocate_on_flush_list(&block->page, + &new_block->page); block->page.set_corrupt_id(); /* set other flags of buf_block_t */ @@ -1804,16 +1801,16 @@ inline bool buf_pool_t::withdraw_blocks() << withdraw_target << " blocks"; /* Minimize zip_free[i] lists */ - mutex_enter(&mutex); + mysql_mutex_lock(&mutex); buf_buddy_condense_free(); - mutex_exit(&mutex); + mysql_mutex_unlock(&mutex); while (UT_LIST_GET_LEN(withdraw) < withdraw_target) { /* try to withdraw from free_list */ ulint count1 = 0; - mutex_enter(&mutex); + mysql_mutex_lock(&mutex); block = reinterpret_cast<buf_block_t*>( UT_LIST_GET_FIRST(free)); while (block != NULL @@ -1828,7 +1825,7 @@ inline bool buf_pool_t::withdraw_blocks() UT_LIST_GET_NEXT( list, &block->page)); - if (buf_pool.will_be_withdrawn(block->page)) { + if (will_be_withdrawn(block->page)) { /* This should be withdrawn */ UT_LIST_REMOVE(free, &block->page); UT_LIST_ADD_LAST(withdraw, &block->page); @@ -1838,40 +1835,29 @@ inline bool buf_pool_t::withdraw_blocks() block = next_block; } - mutex_exit(&mutex); + mysql_mutex_unlock(&mutex); /* reserve free_list length */ if (UT_LIST_GET_LEN(withdraw) < withdraw_target) { - ulint scan_depth; - flush_counters_t n; - - /* cap scan_depth with current LRU size. */ - mutex_enter(&mutex); - scan_depth = UT_LIST_GET_LEN(LRU); - mutex_exit(&mutex); + ulint n_flushed = buf_flush_lists( + std::max<ulint>(withdraw_target + - UT_LIST_GET_LEN(withdraw), + srv_LRU_scan_depth), 0); + buf_flush_wait_batch_end_acquiring_mutex(true); - scan_depth = ut_min( - ut_max(withdraw_target - - UT_LIST_GET_LEN(withdraw), - static_cast<ulint>(srv_LRU_scan_depth)), - scan_depth); - - buf_flush_do_batch(true, scan_depth, 0, &n); - buf_flush_wait_batch_end(true); - - if (n.flushed) { + if (n_flushed) { MONITOR_INC_VALUE_CUMULATIVE( MONITOR_LRU_BATCH_FLUSH_TOTAL_PAGE, MONITOR_LRU_BATCH_FLUSH_COUNT, MONITOR_LRU_BATCH_FLUSH_PAGES, - n.flushed); + n_flushed); } } /* relocate blocks/buddies in withdrawn area */ ulint count2 = 0; - mutex_enter(&mutex); + mysql_mutex_lock(&mutex); buf_page_t* bpage; bpage = UT_LIST_GET_FIRST(LRU); while (bpage != NULL) { @@ -1892,7 +1878,7 @@ inline bool buf_pool_t::withdraw_blocks() } if (bpage->state() == BUF_BLOCK_FILE_PAGE - && buf_pool.will_be_withdrawn(*bpage)) { + && will_be_withdrawn(*bpage)) { if (bpage->can_relocate()) { buf_pool_mutex_exit_forbid(); if (!realloc( @@ -1911,7 +1897,7 @@ inline bool buf_pool_t::withdraw_blocks() bpage = next_bpage; } - mutex_exit(&mutex); + mysql_mutex_unlock(&mutex); buf_resize_status( "withdrawing blocks. (" ULINTPF "/" ULINTPF ")", @@ -2031,7 +2017,7 @@ inline void buf_pool_t::page_hash_table::write_unlock_all() inline void buf_pool_t::write_lock_all_page_hash() { - ut_ad(mutex_own(&mutex)); + mysql_mutex_assert_owner(&mutex); page_hash.write_lock_all(); for (page_hash_table *old_page_hash= freed_page_hash; old_page_hash; old_page_hash= static_cast<page_hash_table*> @@ -2103,7 +2089,7 @@ inline void buf_pool_t::resize() srv_buf_pool_old_size, srv_buf_pool_size, srv_buf_pool_chunk_unit); - mutex_enter(&mutex); + mysql_mutex_lock(&mutex); ut_ad(curr_size == old_size); ut_ad(n_chunks_new == n_chunks); ut_ad(UT_LIST_GET_LEN(withdraw) == 0); @@ -2111,7 +2097,7 @@ inline void buf_pool_t::resize() n_chunks_new = (new_instance_size << srv_page_size_shift) / srv_buf_pool_chunk_unit; curr_size = n_chunks_new * chunks->size; - mutex_exit(&mutex); + mysql_mutex_unlock(&mutex); #ifdef BTR_CUR_HASH_ADAPT /* disable AHI if needed */ @@ -2225,7 +2211,7 @@ withdraw_retry: /* Indicate critical path */ resizing.store(true, std::memory_order_relaxed); - mutex_enter(&mutex); + mysql_mutex_lock(&mutex); write_lock_all_page_hash(); chunk_t::map_reg = UT_NEW_NOKEY(chunk_t::map()); @@ -2389,7 +2375,7 @@ calc_buf_pool_size: ib::info() << "hash tables were resized"; } - mutex_exit(&mutex); + mysql_mutex_unlock(&mutex); write_unlock_all_page_hash(); UT_DELETE(chunk_map_old); @@ -2454,10 +2440,10 @@ static void buf_resize_callback(void *) { DBUG_ENTER("buf_resize_callback"); ut_a(srv_shutdown_state == SRV_SHUTDOWN_NONE); - mutex_enter(&buf_pool.mutex); + mysql_mutex_lock(&buf_pool.mutex); const auto size= srv_buf_pool_size; const bool work= srv_buf_pool_old_size != size; - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); if (work) buf_pool.resize(); @@ -2495,7 +2481,7 @@ static void buf_relocate(buf_page_t *bpage, buf_page_t *dpage) { const ulint fold= bpage->id().fold(); ut_ad(bpage->state() == BUF_BLOCK_ZIP_PAGE); - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); ut_ad(buf_pool.hash_lock_get(bpage->id())->is_write_locked()); ut_a(bpage->io_fix() == BUF_IO_NONE); ut_a(!bpage->buf_fix_count()); @@ -2568,7 +2554,7 @@ retry: (*hash_lock)->write_unlock(); /* Allocate a watch[] and then try to insert it into the page_hash. */ - mutex_enter(&mutex); + mysql_mutex_lock(&mutex); /* The maximum number of purge tasks should never exceed the UT_ARR_SIZE(watch) - 1, and there is no way for a purge task to hold a @@ -2592,17 +2578,17 @@ retry: *hash_lock= page_hash.lock_get(fold); (*hash_lock)->write_lock(); - mutex_exit(&mutex); + mysql_mutex_unlock(&mutex); buf_page_t *bpage= page_hash_get_low(id, fold); if (UNIV_LIKELY_NULL(bpage)) { (*hash_lock)->write_unlock(); - mutex_enter(&mutex); + mysql_mutex_lock(&mutex); w->set_state(BUF_BLOCK_NOT_USED); *hash_lock= page_hash.lock_get(fold); (*hash_lock)->write_lock(); - mutex_exit(&mutex); + mysql_mutex_unlock(&mutex); goto retry; } @@ -2615,7 +2601,7 @@ retry: } ut_error; - mutex_exit(&mutex); + mysql_mutex_unlock(&mutex); return nullptr; } @@ -2733,10 +2719,10 @@ err_exit: { discard_attempted= true; hash_lock->read_unlock(); - mutex_enter(&buf_pool.mutex); + mysql_mutex_lock(&buf_pool.mutex); if (buf_page_t *bpage= buf_pool.page_hash_get_low(page_id, fold)) buf_LRU_free_page(bpage, false); - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); goto lookup; } @@ -3261,14 +3247,14 @@ got_block: if (UNIV_UNLIKELY(mode == BUF_EVICT_IF_IN_POOL)) { evict_from_pool: ut_ad(!fix_block->page.oldest_modification()); - mutex_enter(&buf_pool.mutex); + mysql_mutex_lock(&buf_pool.mutex); fix_block->unfix(); if (!buf_LRU_free_page(&fix_block->page, true)) { ut_ad(0); } - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); return(NULL); } @@ -3317,7 +3303,7 @@ evict_from_pool: block = buf_LRU_get_free_block(false); buf_block_init_low(block); - mutex_enter(&buf_pool.mutex); + mysql_mutex_lock(&buf_pool.mutex); hash_lock = buf_pool.page_hash.lock_get(fold); hash_lock->write_lock(); @@ -3336,7 +3322,7 @@ evict_from_pool: hash_lock->write_unlock(); buf_LRU_block_free_non_file_page(block); - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); /* Try again */ goto loop; @@ -3355,9 +3341,7 @@ evict_from_pool: /* Set after buf_relocate(). */ block->page.set_buf_fix_count(1); - if (block->page.oldest_modification()) { - buf_flush_relocate_on_flush_list(bpage, &block->page); - } + buf_flush_relocate_on_flush_list(bpage, &block->page); /* Buffer-fix, I/O-fix, and X-latch the block for the duration of the decompression. @@ -3372,7 +3356,7 @@ evict_from_pool: MEM_UNDEFINED(bpage, sizeof *bpage); - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); hash_lock->write_unlock(); buf_pool.n_pend_unzip++; @@ -3412,7 +3396,7 @@ evict_from_pool: ut_ad(fix_block->page.state() == BUF_BLOCK_FILE_PAGE); #if defined UNIV_DEBUG || defined UNIV_IBUF_DEBUG - +re_evict: if (mode != BUF_GET_IF_IN_POOL && mode != BUF_GET_IF_IN_POOL_OR_WATCH) { } else if (!ibuf_debug) { @@ -3421,18 +3405,19 @@ evict_from_pool: /* Try to evict the block from the buffer pool, to use the insert buffer (change buffer) as much as possible. */ - mutex_enter(&buf_pool.mutex); + mysql_mutex_lock(&buf_pool.mutex); fix_block->unfix(); /* Blocks cannot be relocated or enter or exit the buf_pool while we are holding the buf_pool.mutex. */ + const bool evicted = buf_LRU_free_page(&fix_block->page, true); + space->release_for_io(); - if (buf_LRU_free_page(&fix_block->page, true)) { - space->release_for_io(); + if (evicted) { hash_lock = buf_pool.page_hash.lock_get(fold); hash_lock->write_lock(); - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); /* We may set the watch, as it would have been set if the page were not in the buffer pool in the first place. */ @@ -3456,20 +3441,16 @@ evict_from_pool: return(NULL); } - bool flushed = fix_block->page.ready_for_flush() - && buf_flush_page(&fix_block->page, - IORequest::SINGLE_PAGE, space, true); - space->release_for_io(); - if (flushed) { - guess = fix_block; - goto loop; - } - fix_block->fix(); + mysql_mutex_unlock(&buf_pool.mutex); + buf_flush_lists(ULINT_UNDEFINED, LSN_MAX); + buf_flush_wait_batch_end_acquiring_mutex(false); - /* Failed to evict the page; change it directly */ + if (!fix_block->page.oldest_modification()) { + goto re_evict; + } - mutex_exit(&buf_pool.mutex); + /* Failed to evict the page; change it directly */ } #endif /* UNIV_DEBUG || UNIV_IBUF_DEBUG */ @@ -3793,23 +3774,22 @@ FILE_PAGE (the other is buf_page_get_gen). @param[in] offset offset of the tablespace @param[in] zip_size ROW_FORMAT=COMPRESSED page size, or 0 @param[in,out] mtr mini-transaction +@param[in,out] free_block pre-allocated buffer block @return pointer to the block, page bufferfixed */ buf_block_t* buf_page_create(fil_space_t *space, uint32_t offset, - ulint zip_size, mtr_t *mtr) + ulint zip_size, mtr_t *mtr, buf_block_t *free_block) { page_id_t page_id(space->id, offset); ut_ad(mtr->is_active()); ut_ad(page_id.space() != 0 || !zip_size); space->free_page(offset, false); -loop: - buf_block_t *free_block= buf_LRU_get_free_block(false); free_block->initialise(page_id, zip_size, 1); const ulint fold= page_id.fold(); - - mutex_enter(&buf_pool.mutex); +loop: + mysql_mutex_lock(&buf_pool.mutex); buf_block_t *block= reinterpret_cast<buf_block_t*> (buf_pool.page_hash_get_low(page_id, fold)); @@ -3820,7 +3800,7 @@ loop: #ifdef BTR_CUR_HASH_ADAPT const dict_index_t *drop_hash_entry= nullptr; #endif - switch (block->page.state()) { + switch (UNIV_EXPECT(block->page.state(), BUF_BLOCK_FILE_PAGE)) { default: ut_ad(0); break; @@ -3831,16 +3811,15 @@ loop: while (block->page.io_fix() != BUF_IO_NONE || num_fix_count != block->page.buf_fix_count()) { - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); os_thread_yield(); - mutex_enter(&buf_pool.mutex); + mysql_mutex_lock(&buf_pool.mutex); } } rw_lock_x_lock(&block->lock); #ifdef BTR_CUR_HASH_ADAPT drop_hash_entry= block->index; #endif - buf_LRU_block_free_non_file_page(free_block); break; case BUF_BLOCK_ZIP_PAGE: page_hash_latch *hash_lock= buf_pool.page_hash.lock_get(fold); @@ -3849,15 +3828,13 @@ loop: { hash_lock->write_unlock(); buf_LRU_block_free_non_file_page(free_block); - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); goto loop; } rw_lock_x_lock(&free_block->lock); buf_relocate(&block->page, &free_block->page); - - if (block->page.oldest_modification() > 0) - buf_flush_relocate_on_flush_list(&block->page, &free_block->page); + buf_flush_relocate_on_flush_list(&block->page, &free_block->page); free_block->page.set_state(BUF_BLOCK_FILE_PAGE); buf_unzip_LRU_add_block(free_block, FALSE); @@ -3868,7 +3845,7 @@ loop: break; } - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); #ifdef BTR_CUR_HASH_ADAPT if (drop_hash_entry) @@ -3926,7 +3903,7 @@ loop: else hash_lock->write_unlock(); - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); mtr->memo_push(block, MTR_MEMO_PAGE_X_FIX); block->page.set_accessed(); @@ -4074,12 +4051,12 @@ static void buf_mark_space_corrupt(buf_page_t* bpage, const fil_space_t& space) /** Release and evict a corrupted page. @param bpage page that was being read */ -void buf_pool_t::corrupted_evict(buf_page_t *bpage) +ATTRIBUTE_COLD void buf_pool_t::corrupted_evict(buf_page_t *bpage) { const page_id_t id(bpage->id()); page_hash_latch *hash_lock= hash_lock_get(id); - mutex_enter(&mutex); + mysql_mutex_lock(&mutex); hash_lock->write_lock(); ut_ad(bpage->io_fix() == BUF_IO_READ); @@ -4094,7 +4071,7 @@ void buf_pool_t::corrupted_evict(buf_page_t *bpage) /* remove from LRU and page_hash */ buf_LRU_free_one_page(bpage, id, hash_lock); - mutex_exit(&mutex); + mysql_mutex_unlock(&mutex); ut_d(auto n=) n_pend_reads--; ut_ad(n > 0); @@ -4104,6 +4081,7 @@ void buf_pool_t::corrupted_evict(buf_page_t *bpage) @param[in] bpage Corrupted page @param[in] node data file Also remove the bpage from LRU list. */ +ATTRIBUTE_COLD static void buf_corrupt_page_release(buf_page_t *bpage, const fil_node_t &node) { ut_ad(bpage->id().space() == node.space->id); @@ -4220,7 +4198,7 @@ dberr_t buf_page_read_complete(buf_page_t *bpage, const fil_node_t &node) { const page_id_t id(bpage->id()); ut_ad(bpage->in_file()); - ut_ad(id.space() || !buf_dblwr_page_inside(id.page_no())); + ut_ad(!buf_dblwr.is_inside(id)); ut_ad(id.space() == node.space->id); ut_ad(bpage->zip_size() == node.space->zip_size()); @@ -4287,7 +4265,7 @@ dberr_t buf_page_read_complete(buf_page_t *bpage, const fil_node_t &node) } err= buf_page_check_corrupt(bpage, node); - if (err != DB_SUCCESS) + if (UNIV_UNLIKELY(err != DB_SUCCESS)) { database_corrupted: /* Not a real corruption if it was triggered by error injection */ @@ -4375,12 +4353,12 @@ release_page: @retval nullptr if all freed */ void buf_pool_t::assert_all_freed() { - mutex_enter(&mutex); + mysql_mutex_lock(&mutex); const chunk_t *chunk= chunks; for (auto i= n_chunks; i--; chunk++) if (const buf_block_t* block= chunk->not_freed()) ib::fatal() << "Page " << block->page.id() << " still fixed or dirty"; - mutex_exit(&mutex); + mysql_mutex_unlock(&mutex); } #endif /* UNIV_DEBUG */ @@ -4395,33 +4373,20 @@ void buf_refresh_io_stats() All pages must be in a replaceable state (not modified or latched). */ void buf_pool_invalidate() { - mutex_enter(&buf_pool.mutex); - ut_ad(!buf_pool.init_flush[IORequest::LRU]); - ut_ad(!buf_pool.init_flush[IORequest::FLUSH_LIST]); - ut_ad(!buf_pool.init_flush[IORequest::SINGLE_PAGE]); - ut_ad(!buf_pool.n_flush[IORequest::SINGLE_PAGE]); - - if (buf_pool.n_flush[IORequest::LRU]) { - mutex_exit(&buf_pool.mutex); - buf_flush_wait_batch_end(true); - mutex_enter(&buf_pool.mutex); - } + mysql_mutex_lock(&buf_pool.mutex); - if (buf_pool.n_flush[IORequest::FLUSH_LIST]) { - mutex_exit(&buf_pool.mutex); - buf_flush_wait_batch_end(false); - mutex_enter(&buf_pool.mutex); - } + buf_flush_wait_batch_end(true); + buf_flush_wait_batch_end(false); /* It is possible that a write batch that has been posted earlier is still not complete. For buffer pool invalidation to proceed we must ensure there is NO write activity happening. */ - ut_d(mutex_exit(&buf_pool.mutex)); + ut_d(mysql_mutex_unlock(&buf_pool.mutex)); ut_d(buf_pool.assert_all_freed()); - ut_d(mutex_enter(&buf_pool.mutex)); + ut_d(mysql_mutex_lock(&buf_pool.mutex)); - while (buf_LRU_scan_and_free_block(true)); + while (buf_LRU_scan_and_free_block()); ut_ad(UT_LIST_GET_LEN(buf_pool.LRU) == 0); ut_ad(UT_LIST_GET_LEN(buf_pool.unzip_LRU) == 0); @@ -4432,7 +4397,7 @@ void buf_pool_invalidate() memset(&buf_pool.stat, 0x00, sizeof(buf_pool.stat)); buf_refresh_io_stats(); - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); } #ifdef UNIV_DEBUG @@ -4444,7 +4409,7 @@ void buf_pool_t::validate() ulint n_free = 0; ulint n_zip = 0; - mutex_enter(&mutex); + mysql_mutex_lock(&mutex); chunk_t* chunk = chunks; @@ -4485,7 +4450,7 @@ void buf_pool_t::validate() /* Check dirty blocks. */ - mutex_enter(&flush_list_mutex); + mysql_mutex_lock(&flush_list_mutex); for (buf_page_t* b = UT_LIST_GET_FIRST(flush_list); b; b = UT_LIST_GET_NEXT(list, b)) { ut_ad(b->oldest_modification()); @@ -4511,7 +4476,7 @@ void buf_pool_t::validate() ut_ad(UT_LIST_GET_LEN(flush_list) == n_flushing); - mutex_exit(&flush_list_mutex); + mysql_mutex_unlock(&flush_list_mutex); if (curr_size == old_size && n_lru + n_free > curr_size + n_zip) { @@ -4531,7 +4496,7 @@ void buf_pool_t::validate() << ", free blocks " << n_free << ". Aborting..."; } - mutex_exit(&mutex); + mysql_mutex_unlock(&mutex); ut_d(buf_LRU_validate()); ut_d(buf_flush_validate()); @@ -4559,8 +4524,8 @@ void buf_pool_t::print() counts = static_cast<ulint*>(ut_malloc_nokey(sizeof(ulint) * size)); - mutex_enter(&mutex); - mutex_enter(&flush_list_mutex); + mysql_mutex_lock(&mutex); + mysql_mutex_lock(&flush_list_mutex); ib::info() << "[buffer pool: size=" << curr_size @@ -4570,16 +4535,15 @@ void buf_pool_t::print() << UT_LIST_GET_LEN(flush_list) << ", n pending decompressions=" << n_pend_unzip << ", n pending reads=" << n_pend_reads - << ", n pending flush LRU=" << n_flush[IORequest::LRU] - << " list=" << n_flush[IORequest::FLUSH_LIST] - << " single page=" << n_flush[IORequest::SINGLE_PAGE] + << ", n pending flush LRU=" << n_flush_LRU + << " list=" << n_flush_list << ", pages made young=" << stat.n_pages_made_young << ", not young=" << stat.n_pages_not_made_young << ", pages read=" << stat.n_pages_read << ", created=" << stat.n_pages_created << ", written=" << stat.n_pages_written << "]"; - mutex_exit(&flush_list_mutex); + mysql_mutex_unlock(&flush_list_mutex); /* Count the number of blocks belonging to each index in the buffer */ @@ -4620,7 +4584,7 @@ void buf_pool_t::print() } } - mutex_exit(&mutex); + mysql_mutex_unlock(&mutex); for (i = 0; i < n_found; i++) { index = dict_index_get_if_in_cache(index_ids[i]); @@ -4650,14 +4614,14 @@ ulint buf_get_latched_pages_number() { ulint fixed_pages_number= 0; - mutex_enter(&buf_pool.mutex); + mysql_mutex_lock(&buf_pool.mutex); for (buf_page_t *b= UT_LIST_GET_FIRST(buf_pool.LRU); b; b= UT_LIST_GET_NEXT(LRU, b)) if (b->in_file() && (b->buf_fix_count() || b->io_fix() != BUF_IO_NONE)) fixed_pages_number++; - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); return fixed_pages_number; } @@ -4670,8 +4634,8 @@ void buf_stats_get_pool_info(buf_pool_info_t *pool_info) time_t current_time; double time_elapsed; - mutex_enter(&buf_pool.mutex); - mutex_enter(&buf_pool.flush_list_mutex); + mysql_mutex_lock(&buf_pool.mutex); + mysql_mutex_lock(&buf_pool.flush_list_mutex); pool_info->pool_size = buf_pool.curr_size; @@ -4687,19 +4651,11 @@ void buf_stats_get_pool_info(buf_pool_info_t *pool_info) pool_info->n_pend_reads = buf_pool.n_pend_reads; - pool_info->n_pending_flush_lru = - (buf_pool.n_flush[IORequest::LRU] - + buf_pool.init_flush[IORequest::LRU]); - - pool_info->n_pending_flush_list = - (buf_pool.n_flush[IORequest::FLUSH_LIST] - + buf_pool.init_flush[IORequest::FLUSH_LIST]); + pool_info->n_pending_flush_lru = buf_pool.n_flush_LRU; - pool_info->n_pending_flush_single_page = - (buf_pool.n_flush[IORequest::SINGLE_PAGE] - + buf_pool.init_flush[IORequest::SINGLE_PAGE]); + pool_info->n_pending_flush_list = buf_pool.n_flush_list; - mutex_exit(&buf_pool.flush_list_mutex); + mysql_mutex_unlock(&buf_pool.flush_list_mutex); current_time = time(NULL); time_elapsed = 0.001 + difftime(current_time, @@ -4790,7 +4746,7 @@ void buf_stats_get_pool_info(buf_pool_info_t *pool_info) pool_info->unzip_cur = buf_LRU_stat_cur.unzip; buf_refresh_io_stats(); - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); } /*********************************************************************//** @@ -4813,8 +4769,7 @@ buf_print_io_instance( "Percent of dirty pages(LRU & free pages): %.3f\n" "Max dirty pages percent: %.3f\n" "Pending reads " ULINTPF "\n" - "Pending writes: LRU " ULINTPF ", flush list " ULINTPF - ", single page " ULINTPF "\n", + "Pending writes: LRU " ULINTPF ", flush list " ULINTPF "\n", pool_info->pool_size, pool_info->free_list_len, pool_info->lru_len, @@ -4827,8 +4782,7 @@ buf_print_io_instance( srv_max_buf_pool_modified_pct, pool_info->n_pend_reads, pool_info->n_pending_flush_lru, - pool_info->n_pending_flush_list, - pool_info->n_pending_flush_single_page); + pool_info->n_pending_flush_list); fprintf(file, "Pages made young " ULINTPF ", not young " ULINTPF "\n" diff --git a/storage/innobase/buf/buf0dblwr.cc b/storage/innobase/buf/buf0dblwr.cc index 4a583bf7a9a..d9faf2ffe06 100644 --- a/storage/innobase/buf/buf0dblwr.cc +++ b/storage/innobase/buf/buf0dblwr.cc @@ -29,6 +29,7 @@ Created 2011/12/19 #include "buf0checksum.h" #include "srv0start.h" #include "srv0srv.h" +#include "sync0sync.h" #include "page0zip.h" #include "trx0sys.h" #include "fil0crypt.h" @@ -37,38 +38,7 @@ Created 2011/12/19 using st_::span; /** The doublewrite buffer */ -buf_dblwr_t* buf_dblwr = NULL; - -#define TRX_SYS_DOUBLEWRITE_BLOCKS 2 - -/****************************************************************//** -Determines if a page number is located inside the doublewrite buffer. -@return TRUE if the location is inside the two blocks of the -doublewrite buffer */ -ibool -buf_dblwr_page_inside( -/*==================*/ - ulint page_no) /*!< in: page number */ -{ - if (buf_dblwr == NULL) { - - return(FALSE); - } - - if (page_no >= buf_dblwr->block1 - && page_no < buf_dblwr->block1 - + TRX_SYS_DOUBLEWRITE_BLOCK_SIZE) { - return(TRUE); - } - - if (page_no >= buf_dblwr->block2 - && page_no < buf_dblwr->block2 - + TRX_SYS_DOUBLEWRITE_BLOCK_SIZE) { - return(TRUE); - } - - return(FALSE); -} +buf_dblwr_t buf_dblwr; /** @return the TRX_SYS page */ inline buf_block_t *buf_dblwr_trx_sys_get(mtr_t *mtr) @@ -79,616 +49,447 @@ inline buf_block_t *buf_dblwr_trx_sys_get(mtr_t *mtr) return block; } -/****************************************************************//** -Creates or initialializes the doublewrite buffer at a database start. */ -static void buf_dblwr_init(const byte *doublewrite) +/** Initialize the doublewrite buffer data structure. +@param header doublewrite page header in the TRX_SYS page */ +inline void buf_dblwr_t::init(const byte *header) { - ulint buf_size; + ut_ad(!first_free); + ut_ad(!reserved); + ut_ad(!batch_running); - buf_dblwr = static_cast<buf_dblwr_t*>( - ut_zalloc_nokey(sizeof(buf_dblwr_t))); + mysql_mutex_init(buf_dblwr_mutex_key, &mutex, nullptr); + mysql_cond_init(0, &cond, nullptr); + block1= page_id_t(0, mach_read_from_4(header + TRX_SYS_DOUBLEWRITE_BLOCK1)); + block2= page_id_t(0, mach_read_from_4(header + TRX_SYS_DOUBLEWRITE_BLOCK2)); - /* There are two blocks of same size in the doublewrite - buffer. */ - buf_size = TRX_SYS_DOUBLEWRITE_BLOCKS * TRX_SYS_DOUBLEWRITE_BLOCK_SIZE; + const uint32_t buf_size= 2 * block_size(); + write_buf= static_cast<byte*>(aligned_malloc(buf_size << srv_page_size_shift, + srv_page_size)); + buf_block_arr= static_cast<element*> + (ut_zalloc_nokey(buf_size * sizeof(element))); +} - /* There must be atleast one buffer for single page writes - and one buffer for batch writes. */ - ut_a(srv_doublewrite_batch_size > 0 - && srv_doublewrite_batch_size < buf_size); +/** Create or restore the doublewrite buffer in the TRX_SYS page. +@return whether the operation succeeded */ +bool buf_dblwr_t::create() +{ + if (is_initialised()) + return true; - mutex_create(LATCH_ID_BUF_DBLWR, &buf_dblwr->mutex); + mtr_t mtr; + const ulint size= block_size(); - buf_dblwr->b_event = os_event_create("dblwr_batch_event"); - buf_dblwr->s_event = os_event_create("dblwr_single_event"); - buf_dblwr->first_free = 0; - buf_dblwr->s_reserved = 0; - buf_dblwr->b_reserved = 0; +start_again: + mtr.start(); - buf_dblwr->block1 = mach_read_from_4( - doublewrite + TRX_SYS_DOUBLEWRITE_BLOCK1); - buf_dblwr->block2 = mach_read_from_4( - doublewrite + TRX_SYS_DOUBLEWRITE_BLOCK2); + buf_block_t *trx_sys_block= buf_dblwr_trx_sys_get(&mtr); - buf_dblwr->write_buf = static_cast<byte*>( - aligned_malloc(buf_size << srv_page_size_shift, - srv_page_size)); + if (mach_read_from_4(TRX_SYS_DOUBLEWRITE + TRX_SYS_DOUBLEWRITE_MAGIC + + trx_sys_block->frame) == TRX_SYS_DOUBLEWRITE_MAGIC_N) + { + /* The doublewrite buffer has already been created: just read in + some numbers */ + init(TRX_SYS_DOUBLEWRITE + trx_sys_block->frame); + mtr.commit(); + return true; + } - buf_dblwr->buf_block_arr = static_cast<buf_dblwr_t::element*>( - ut_zalloc_nokey(buf_size * sizeof(buf_dblwr_t::element))); -} + if (UT_LIST_GET_FIRST(fil_system.sys_space->chain)->size < 3 * size) + { +too_small: + ib::error() << "Cannot create doublewrite buffer: " + "the first file in innodb_data_file_path must be at least " + << (3 * (size >> (20U - srv_page_size_shift))) << "M."; + mtr.commit(); + return false; + } + else + { + buf_block_t *b= fseg_create(fil_system.sys_space, + TRX_SYS_DOUBLEWRITE + TRX_SYS_DOUBLEWRITE_FSEG, + &mtr, false, trx_sys_block); + if (!b) + goto too_small; + ib::info() << "Doublewrite buffer not found: creating new"; + + /* FIXME: After this point, the doublewrite buffer creation + is not atomic. The doublewrite buffer should not exist in + the InnoDB system tablespace file in the first place. + It could be located in separate optional file(s) in a + user-specified location. */ + + /* fseg_create acquires a second latch on the page, + therefore we must declare it: */ + buf_block_dbg_add_level(b, SYNC_NO_ORDER_CHECK); + } -/** Create the doublewrite buffer if the doublewrite buffer header -is not present in the TRX_SYS page. -@return whether the operation succeeded -@retval true if the doublewrite buffer exists or was created -@retval false if the creation failed (too small first data file) */ -bool -buf_dblwr_create() -{ - buf_block_t* block2; - buf_block_t* new_block; - byte* fseg_header; - ulint page_no; - ulint prev_page_no; - ulint i; - mtr_t mtr; - - if (buf_dblwr) { - /* Already inited */ - return(true); - } + byte *fseg_header= TRX_SYS_DOUBLEWRITE + TRX_SYS_DOUBLEWRITE_FSEG + + trx_sys_block->frame; + for (ulint prev_page_no= 0, i= 0; i < 2 * size + FSP_EXTENT_SIZE / 2; i++) + { + buf_block_t *new_block= fseg_alloc_free_page(fseg_header, prev_page_no + 1, + FSP_UP, &mtr); + if (!new_block) + { + ib::error() << "Cannot create doublewrite buffer: " + " you must increase your tablespace size." + " Cannot continue operation."; + /* This may essentially corrupt the doublewrite + buffer. However, usually the doublewrite buffer + is created at database initialization, and it + should not matter (just remove all newly created + InnoDB files and restart). */ + mtr.commit(); + return false; + } -start_again: - mtr.start(); + /* We read the allocated pages to the buffer pool; when they are + written to disk in a flush, the space id and page number fields + are also written to the pages. When we at database startup read + pages from the doublewrite buffer, we know that if the space id + and page number in them are the same as the page position in the + tablespace, then the page has not been written to in + doublewrite. */ + + ut_ad(rw_lock_get_x_lock_count(&new_block->lock) == 1); + const page_id_t id= new_block->page.id(); + /* We only do this in the debug build, to ensure that the check in + buf_flush_init_for_writing() will see a valid page type. The + flushes of new_block are actually unnecessary here. */ + ut_d(mtr.write<2>(*new_block, FIL_PAGE_TYPE + new_block->frame, + FIL_PAGE_TYPE_SYS)); + + if (i == size / 2) + { + ut_a(id.page_no() == size); + mtr.write<4>(*trx_sys_block, + TRX_SYS_DOUBLEWRITE + TRX_SYS_DOUBLEWRITE_BLOCK1 + + trx_sys_block->frame, id.page_no()); + mtr.write<4>(*trx_sys_block, + TRX_SYS_DOUBLEWRITE + TRX_SYS_DOUBLEWRITE_REPEAT + + TRX_SYS_DOUBLEWRITE_BLOCK1 + trx_sys_block->frame, + id.page_no()); + } + else if (i == size / 2 + size) + { + ut_a(id.page_no() == 2 * size); + mtr.write<4>(*trx_sys_block, + TRX_SYS_DOUBLEWRITE + TRX_SYS_DOUBLEWRITE_BLOCK2 + + trx_sys_block->frame, id.page_no()); + mtr.write<4>(*trx_sys_block, + TRX_SYS_DOUBLEWRITE + TRX_SYS_DOUBLEWRITE_REPEAT + + TRX_SYS_DOUBLEWRITE_BLOCK2 + trx_sys_block->frame, + id.page_no()); + } + else if (i > size / 2) + ut_a(id.page_no() == prev_page_no + 1); + + if (((i + 1) & 15) == 0) { + /* rw_locks can only be recursively x-locked 2048 times. (on 32 + bit platforms, (lint) 0 - (X_LOCK_DECR * 2049) is no longer a + negative number, and thus lock_word becomes like a shared lock). + For 4k page size this loop will lock the fseg header too many + times. Since this code is not done while any other threads are + active, restart the MTR occasionally. */ + mtr.commit(); + mtr.start(); + trx_sys_block= buf_dblwr_trx_sys_get(&mtr); + fseg_header= TRX_SYS_DOUBLEWRITE + TRX_SYS_DOUBLEWRITE_FSEG + + trx_sys_block->frame; + } - buf_block_t *trx_sys_block = buf_dblwr_trx_sys_get(&mtr); + prev_page_no= id.page_no(); + } - if (mach_read_from_4(TRX_SYS_DOUBLEWRITE + TRX_SYS_DOUBLEWRITE_MAGIC - + trx_sys_block->frame) - == TRX_SYS_DOUBLEWRITE_MAGIC_N) { - /* The doublewrite buffer has already been created: - just read in some numbers */ + mtr.write<4>(*trx_sys_block, + TRX_SYS_DOUBLEWRITE + TRX_SYS_DOUBLEWRITE_MAGIC + + trx_sys_block->frame, TRX_SYS_DOUBLEWRITE_MAGIC_N); + mtr.write<4>(*trx_sys_block, + TRX_SYS_DOUBLEWRITE + TRX_SYS_DOUBLEWRITE_MAGIC + + TRX_SYS_DOUBLEWRITE_REPEAT + trx_sys_block->frame, + TRX_SYS_DOUBLEWRITE_MAGIC_N); - buf_dblwr_init(TRX_SYS_DOUBLEWRITE + trx_sys_block->frame); + mtr.write<4>(*trx_sys_block, + TRX_SYS_DOUBLEWRITE + TRX_SYS_DOUBLEWRITE_SPACE_ID_STORED + + trx_sys_block->frame, TRX_SYS_DOUBLEWRITE_SPACE_ID_STORED_N); + mtr.commit(); - mtr.commit(); - return(true); - } else { - if (UT_LIST_GET_FIRST(fil_system.sys_space->chain)->size - < 3 * FSP_EXTENT_SIZE) { - goto too_small; - } - } + /* Flush the modified pages to disk and make a checkpoint */ + log_make_checkpoint(); - block2 = fseg_create(fil_system.sys_space, - TRX_SYS_DOUBLEWRITE + TRX_SYS_DOUBLEWRITE_FSEG, - &mtr, false, trx_sys_block); + /* Remove doublewrite pages from LRU */ + buf_pool_invalidate(); - if (block2 == NULL) { -too_small: - ib::error() - << "Cannot create doublewrite buffer: " - "the first file in innodb_data_file_path" - " must be at least " - << (3 * (FSP_EXTENT_SIZE - >> (20U - srv_page_size_shift))) - << "M."; - mtr.commit(); - return(false); - } - - ib::info() << "Doublewrite buffer not found: creating new"; - - /* FIXME: After this point, the doublewrite buffer creation - is not atomic. The doublewrite buffer should not exist in - the InnoDB system tablespace file in the first place. - It could be located in separate optional file(s) in a - user-specified location. */ - - /* fseg_create acquires a second latch on the page, - therefore we must declare it: */ - - buf_block_dbg_add_level(block2, SYNC_NO_ORDER_CHECK); - - fseg_header = TRX_SYS_DOUBLEWRITE + TRX_SYS_DOUBLEWRITE_FSEG - + trx_sys_block->frame; - prev_page_no = 0; - - for (i = 0; i < TRX_SYS_DOUBLEWRITE_BLOCKS * TRX_SYS_DOUBLEWRITE_BLOCK_SIZE - + FSP_EXTENT_SIZE / 2; i++) { - new_block = fseg_alloc_free_page( - fseg_header, prev_page_no + 1, FSP_UP, &mtr); - if (new_block == NULL) { - ib::error() << "Cannot create doublewrite buffer: " - " you must increase your tablespace size." - " Cannot continue operation."; - /* This may essentially corrupt the doublewrite - buffer. However, usually the doublewrite buffer - is created at database initialization, and it - should not matter (just remove all newly created - InnoDB files and restart). */ - mtr.commit(); - return(false); - } - - /* We read the allocated pages to the buffer pool; - when they are written to disk in a flush, the space - id and page number fields are also written to the - pages. When we at database startup read pages - from the doublewrite buffer, we know that if the - space id and page number in them are the same as - the page position in the tablespace, then the page - has not been written to in doublewrite. */ - - ut_ad(rw_lock_get_x_lock_count(&new_block->lock) == 1); - page_no = new_block->page.id().page_no(); - /* We only do this in the debug build, to ensure that - the check in buf_flush_init_for_writing() will see a valid - page type. The flushes of new_block are actually - unnecessary here. */ - ut_d(mtr.write<2>(*new_block, - FIL_PAGE_TYPE + new_block->frame, - FIL_PAGE_TYPE_SYS)); - - if (i == FSP_EXTENT_SIZE / 2) { - ut_a(page_no == FSP_EXTENT_SIZE); - mtr.write<4>(*trx_sys_block, - TRX_SYS_DOUBLEWRITE - + TRX_SYS_DOUBLEWRITE_BLOCK1 - + trx_sys_block->frame, - page_no); - mtr.write<4>(*trx_sys_block, - TRX_SYS_DOUBLEWRITE - + TRX_SYS_DOUBLEWRITE_REPEAT - + TRX_SYS_DOUBLEWRITE_BLOCK1 - + trx_sys_block->frame, - page_no); - - } else if (i == FSP_EXTENT_SIZE / 2 - + TRX_SYS_DOUBLEWRITE_BLOCK_SIZE) { - ut_a(page_no == 2 * FSP_EXTENT_SIZE); - mtr.write<4>(*trx_sys_block, - TRX_SYS_DOUBLEWRITE - + TRX_SYS_DOUBLEWRITE_BLOCK2 - + trx_sys_block->frame, - page_no); - mtr.write<4>(*trx_sys_block, - TRX_SYS_DOUBLEWRITE - + TRX_SYS_DOUBLEWRITE_REPEAT - + TRX_SYS_DOUBLEWRITE_BLOCK2 - + trx_sys_block->frame, - page_no); - } else if (i > FSP_EXTENT_SIZE / 2) { - ut_a(page_no == prev_page_no + 1); - } - - if (((i + 1) & 15) == 0) { - /* rw_locks can only be recursively x-locked - 2048 times. (on 32 bit platforms, - (lint) 0 - (X_LOCK_DECR * 2049) - is no longer a negative number, and thus - lock_word becomes like a shared lock). - For 4k page size this loop will - lock the fseg header too many times. Since - this code is not done while any other threads - are active, restart the MTR occasionally. */ - mtr.commit(); - mtr.start(); - trx_sys_block = buf_dblwr_trx_sys_get(&mtr); - fseg_header = TRX_SYS_DOUBLEWRITE - + TRX_SYS_DOUBLEWRITE_FSEG - + trx_sys_block->frame; - } - - prev_page_no = page_no; - } - - mtr.write<4>(*trx_sys_block, - TRX_SYS_DOUBLEWRITE + TRX_SYS_DOUBLEWRITE_MAGIC - + trx_sys_block->frame, - TRX_SYS_DOUBLEWRITE_MAGIC_N); - mtr.write<4>(*trx_sys_block, - TRX_SYS_DOUBLEWRITE + TRX_SYS_DOUBLEWRITE_MAGIC - + TRX_SYS_DOUBLEWRITE_REPEAT - + trx_sys_block->frame, - TRX_SYS_DOUBLEWRITE_MAGIC_N); - - mtr.write<4>(*trx_sys_block, - TRX_SYS_DOUBLEWRITE + TRX_SYS_DOUBLEWRITE_SPACE_ID_STORED - + trx_sys_block->frame, - TRX_SYS_DOUBLEWRITE_SPACE_ID_STORED_N); - mtr.commit(); - - /* Flush the modified pages to disk and make a checkpoint */ - log_make_checkpoint(); - - /* Remove doublewrite pages from LRU */ - buf_pool_invalidate(); - - ib::info() << "Doublewrite buffer created"; - - goto start_again; + ib::info() << "Doublewrite buffer created"; + goto start_again; } -/** -At database startup initializes the doublewrite buffer memory structure if -we already have a doublewrite buffer created in the data files. If we are -upgrading to an InnoDB version which supports multiple tablespaces, then this -function performs the necessary update operations. If we are in a crash -recovery, this function loads the pages from double write buffer into memory. -@param[in] file File handle -@param[in] path Path name of file +/** Initialize the doublewrite buffer memory structure on recovery. +If we are upgrading from a version before MySQL 4.1, then this +function performs the necessary update operations to support +innodb_file_per_table. If we are in a crash recovery, this function +loads the pages from double write buffer into memory. +@param file File handle +@param path Path name of file @return DB_SUCCESS or error code */ -dberr_t -buf_dblwr_init_or_load_pages( - pfs_os_file_t file, - const char* path) +dberr_t buf_dblwr_t::init_or_load_pages(pfs_os_file_t file, const char *path) { - byte* buf; - byte* page; - ulint block1; - ulint block2; - ulint space_id; - byte* read_buf; - byte* doublewrite; - ibool reset_space_ids = FALSE; - recv_dblwr_t& recv_dblwr = recv_sys.dblwr; - - /* We do the file i/o past the buffer pool */ - read_buf = static_cast<byte*>( - aligned_malloc(2 * srv_page_size, srv_page_size)); - - /* Read the trx sys header to check if we are using the doublewrite - buffer */ - dberr_t err; - - IORequest read_request(IORequest::READ); - - err = os_file_read( - read_request, - file, read_buf, TRX_SYS_PAGE_NO << srv_page_size_shift, - srv_page_size); - - if (err != DB_SUCCESS) { - - ib::error() - << "Failed to read the system tablespace header page"; + ut_ad(this == &buf_dblwr); + const uint32_t size= block_size(); + + /* We do the file i/o past the buffer pool */ + byte *read_buf= static_cast<byte*>(aligned_malloc(srv_page_size, + srv_page_size)); + /* Read the TRX_SYS header to check if we are using the doublewrite buffer */ + dberr_t err= os_file_read(IORequestRead, file, read_buf, + TRX_SYS_PAGE_NO << srv_page_size_shift, + srv_page_size); + + if (err != DB_SUCCESS) + { + ib::error() << "Failed to read the system tablespace header page"; func_exit: - aligned_free(read_buf); - return(err); - } - - doublewrite = read_buf + TRX_SYS_DOUBLEWRITE; - - /* TRX_SYS_PAGE_NO is not encrypted see fil_crypt_rotate_page() */ - - if (mach_read_from_4(doublewrite + TRX_SYS_DOUBLEWRITE_MAGIC) - == TRX_SYS_DOUBLEWRITE_MAGIC_N) { - /* The doublewrite buffer has been created */ - - buf_dblwr_init(doublewrite); - - block1 = buf_dblwr->block1; - block2 = buf_dblwr->block2; - - buf = buf_dblwr->write_buf; - } else { - err = DB_SUCCESS; - goto func_exit; - } - - if (mach_read_from_4(doublewrite + TRX_SYS_DOUBLEWRITE_SPACE_ID_STORED) - != TRX_SYS_DOUBLEWRITE_SPACE_ID_STORED_N) { - - /* We are upgrading from a version < 4.1.x to a version where - multiple tablespaces are supported. We must reset the space id - field in the pages in the doublewrite buffer because starting - from this version the space id is stored to - FIL_PAGE_ARCH_LOG_NO_OR_SPACE_ID. */ - - reset_space_ids = TRUE; - - ib::info() << "Resetting space id's in the doublewrite buffer"; - } - - /* Read the pages from the doublewrite buffer to memory */ - err = os_file_read( - read_request, - file, buf, block1 << srv_page_size_shift, - TRX_SYS_DOUBLEWRITE_BLOCK_SIZE << srv_page_size_shift); - - if (err != DB_SUCCESS) { - - ib::error() - << "Failed to read the first double write buffer " - "extent"; - goto func_exit; - } - - err = os_file_read( - read_request, - file, - buf + (TRX_SYS_DOUBLEWRITE_BLOCK_SIZE << srv_page_size_shift), - block2 << srv_page_size_shift, - TRX_SYS_DOUBLEWRITE_BLOCK_SIZE << srv_page_size_shift); - - if (err != DB_SUCCESS) { - - ib::error() - << "Failed to read the second double write buffer " - "extent"; - goto func_exit; - } - - /* Check if any of these pages is half-written in data files, in the - intended position */ - - page = buf; - - for (ulint i = 0; i < TRX_SYS_DOUBLEWRITE_BLOCK_SIZE * 2; i++) { - - if (reset_space_ids) { - ulint source_page_no; - - space_id = 0; - mach_write_to_4(page + FIL_PAGE_ARCH_LOG_NO_OR_SPACE_ID, - space_id); - /* We do not need to calculate new checksums for the - pages because the field .._SPACE_ID does not affect - them. Write the page back to where we read it from. */ - - if (i < TRX_SYS_DOUBLEWRITE_BLOCK_SIZE) { - source_page_no = block1 + i; - } else { - source_page_no = block2 - + i - TRX_SYS_DOUBLEWRITE_BLOCK_SIZE; - } - - err = os_file_write( - IORequestWrite, path, file, page, - source_page_no << srv_page_size_shift, - srv_page_size); - if (err != DB_SUCCESS) { - - ib::error() - << "Failed to write to the double write" - " buffer"; - goto func_exit; - } - } else if (mach_read_from_8(page + FIL_PAGE_LSN)) { - /* Each valid page header must contain - a nonzero FIL_PAGE_LSN field. */ - recv_dblwr.add(page); - } - - page += srv_page_size; - } - - if (reset_space_ids) { - os_file_flush(file); - } - - err = DB_SUCCESS; - goto func_exit; + aligned_free(read_buf); + return err; + } + + /* TRX_SYS_PAGE_NO is not encrypted see fil_crypt_rotate_page() */ + if (mach_read_from_4(TRX_SYS_DOUBLEWRITE_MAGIC + TRX_SYS_DOUBLEWRITE + + read_buf) != TRX_SYS_DOUBLEWRITE_MAGIC_N) + { + /* There is no doublewrite buffer initialized in the TRX_SYS page. + This should normally not be possible; the doublewrite buffer should + be initialized when creating the database. */ + err= DB_SUCCESS; + goto func_exit; + } + + init(TRX_SYS_DOUBLEWRITE + read_buf); + + const bool upgrade_to_innodb_file_per_table= + mach_read_from_4(TRX_SYS_DOUBLEWRITE_SPACE_ID_STORED + + TRX_SYS_DOUBLEWRITE + read_buf) != + TRX_SYS_DOUBLEWRITE_SPACE_ID_STORED_N; + + /* Read the pages from the doublewrite buffer to memory */ + err= os_file_read(IORequestRead, file, write_buf, + block1.page_no() << srv_page_size_shift, + size << srv_page_size_shift); + + if (err != DB_SUCCESS) + { + ib::error() << "Failed to read the first double write buffer extent"; + goto func_exit; + } + + err= os_file_read(IORequestRead, file, + write_buf + (size << srv_page_size_shift), + block2.page_no() << srv_page_size_shift, + size << srv_page_size_shift); + if (err != DB_SUCCESS) + { + ib::error() << "Failed to read the second double write buffer extent"; + goto func_exit; + } + + byte *page= write_buf; + + if (UNIV_UNLIKELY(upgrade_to_innodb_file_per_table)) + { + ib::info() << "Resetting space id's in the doublewrite buffer"; + + for (ulint i= 0; i < size * 2; i++, page += srv_page_size) + { + memset(page + FIL_PAGE_SPACE_ID, 0, 4); + /* For innodb_checksum_algorithm=innodb, we do not need to + calculate new checksums for the pages because the field + .._SPACE_ID does not affect them. Write the page back to where + we read it from. */ + const ulint source_page_no= i < size + ? block1.page_no() + i + : block2.page_no() + i - size; + err= os_file_write(IORequestWrite, path, file, page, + source_page_no << srv_page_size_shift, srv_page_size); + if (err != DB_SUCCESS) + { + ib::error() << "Failed to upgrade the double write buffer"; + goto func_exit; + } + } + os_file_flush(file); + } + else + for (ulint i= 0; i < size * 2; i++, page += srv_page_size) + if (mach_read_from_8(my_assume_aligned<8>(page + FIL_PAGE_LSN))) + /* Each valid page header must contain a nonzero FIL_PAGE_LSN field. */ + recv_sys.dblwr.add(page); + + err= DB_SUCCESS; + goto func_exit; } /** Process and remove the double write buffer pages for all tablespaces. */ -void -buf_dblwr_process() +void buf_dblwr_t::recover() { - ut_ad(recv_sys.parse_start_lsn); - - ulint page_no_dblwr = 0; - byte* read_buf; - recv_dblwr_t& recv_dblwr = recv_sys.dblwr; - - if (!buf_dblwr) { - return; - } - - read_buf = static_cast<byte*>( - aligned_malloc(3 * srv_page_size, srv_page_size)); - byte* const buf = read_buf + srv_page_size; - - for (recv_dblwr_t::list::iterator i = recv_dblwr.pages.begin(); - i != recv_dblwr.pages.end(); - ++i, ++page_no_dblwr) { - byte* page = *i; - const ulint page_no = page_get_page_no(page); - - if (!page_no) { - /* page 0 should have been recovered - already via Datafile::restore_from_doublewrite() */ - continue; - } - - const ulint space_id = page_get_space_id(page); - const lsn_t lsn = mach_read_from_8(page + FIL_PAGE_LSN); - - if (recv_sys.parse_start_lsn > lsn) { - /* Pages written before the checkpoint are - not useful for recovery. */ - continue; - } - - const page_id_t page_id(space_id, page_no); - - if (recv_sys.scanned_lsn < lsn) { - ib::warn() << "Ignoring a doublewrite copy of page " - << page_id - << " with future log sequence number " - << lsn; - continue; - } - - fil_space_t* space = fil_space_acquire_for_io(space_id); - - if (!space) { - /* Maybe we have dropped the tablespace - and this page once belonged to it: do nothing */ - continue; - } - - fil_space_open_if_needed(space); - - if (UNIV_UNLIKELY(page_no >= space->size)) { - - /* Do not report the warning for undo - tablespaces, because they can be truncated in place. */ - if (!srv_is_undo_tablespace(space_id)) { - ib::warn() << "A copy of page " << page_no - << " in the doublewrite buffer slot " - << page_no_dblwr - << " is beyond the end of tablespace " - << space->name - << " (" << space->size << " pages)"; - } + ut_ad(recv_sys.parse_start_lsn); + if (!is_initialised()) + return; + + ulint page_no_dblwr= 0; + byte *read_buf= static_cast<byte*>(aligned_malloc(3 * srv_page_size, + srv_page_size)); + byte *const buf= read_buf + srv_page_size; + + for (recv_dblwr_t::list::iterator i= recv_sys.dblwr.pages.begin(); + i != recv_sys.dblwr.pages.end(); ++i, ++page_no_dblwr) + { + byte *page= *i; + const ulint page_no= page_get_page_no(page); + if (!page_no) /* recovered via Datafile::restore_from_doublewrite() */ + continue; + + const lsn_t lsn= mach_read_from_8(page + FIL_PAGE_LSN); + if (recv_sys.parse_start_lsn > lsn) + /* Pages written before the checkpoint are not useful for recovery. */ + continue; + const ulint space_id= page_get_space_id(page); + const page_id_t page_id(space_id, page_no); + + if (recv_sys.scanned_lsn < lsn) + { + ib::warn() << "Ignoring a doublewrite copy of page " << page_id + << " with future log sequence number " << lsn; + continue; + } + + fil_space_t* space= fil_space_acquire_for_io(space_id); + + if (!space) + /* The tablespace that this page once belonged to does not exist */ + continue; + + fil_space_open_if_needed(space); + + if (UNIV_UNLIKELY(page_no >= space->size)) + { + /* Do not report the warning for undo tablespaces, because they + can be truncated in place. */ + if (!srv_is_undo_tablespace(space_id)) + ib::warn() << "A copy of page " << page_no + << " in the doublewrite buffer slot " << page_no_dblwr + << " is beyond the end of tablespace " << space->name + << " (" << space->size << " pages)"; next_page: - space->release_for_io(); - continue; - } - - const ulint physical_size = space->physical_size(); - const ulint zip_size = space->zip_size(); - ut_ad(!buf_is_zeroes(span<const byte>(page, physical_size))); - - /* We want to ensure that for partial reads the - unread portion of the page is NUL. */ - memset(read_buf, 0x0, physical_size); - - IORequest request; - - request.dblwr_recover(); - - /* Read in the actual page from the file */ - fil_io_t fio = fil_io( - request, true, - page_id, zip_size, - 0, physical_size, read_buf, NULL); - - if (UNIV_UNLIKELY(fio.err != DB_SUCCESS)) { - ib::warn() - << "Double write buffer recovery: " - << page_id << " read failed with " - << "error: " << fio.err; - } - - if (fio.node) { - fio.node->space->release_for_io(); - } - - if (buf_is_zeroes(span<const byte>(read_buf, physical_size))) { - /* We will check if the copy in the - doublewrite buffer is valid. If not, we will - ignore this page (there should be redo log - records to initialize it). */ - } else if (recv_dblwr.validate_page( - page_id, read_buf, space, buf)) { - goto next_page; - } else { - /* We intentionally skip this message for - all-zero pages. */ - ib::info() - << "Trying to recover page " << page_id - << " from the doublewrite buffer."; - } - - page = recv_dblwr.find_page(page_id, space, buf); - - if (!page) { - goto next_page; - } - - /* Write the good page from the doublewrite buffer to - the intended position. */ - fio = fil_io(IORequestWrite, true, page_id, zip_size, - 0, physical_size, page, nullptr); - - if (fio.node) { - ut_ad(fio.err == DB_SUCCESS); - ib::info() << "Recovered page " << page_id - << " to '" << fio.node->name - << "' from the doublewrite buffer."; - fio.node->space->release_for_io(); - } - - goto next_page; - } - - recv_dblwr.pages.clear(); - - fil_flush_file_spaces(); - aligned_free(read_buf); + space->release_for_io(); + continue; + } + + const ulint physical_size= space->physical_size(); + const ulint zip_size= space->zip_size(); + ut_ad(!buf_is_zeroes(span<const byte>(page, physical_size))); + + /* We want to ensure that for partial reads the unread portion of + the page is NUL. */ + memset(read_buf, 0x0, physical_size); + + /* Read in the actual page from the file */ + fil_io_t fio= fil_io(IORequest(IORequest::READ | IORequest::DBLWR_RECOVER), + true, page_id, zip_size, + 0, physical_size, read_buf, nullptr); + + if (UNIV_UNLIKELY(fio.err != DB_SUCCESS)) + ib::warn() << "Double write buffer recovery: " << page_id + << " (tablespace '" << space->name + << "') read failed with error: " << fio.err; + + if (fio.node) + fio.node->space->release_for_io(); + + if (buf_is_zeroes(span<const byte>(read_buf, physical_size))) + { + /* We will check if the copy in the doublewrite buffer is + valid. If not, we will ignore this page (there should be redo + log records to initialize it). */ + } + else if (recv_sys.dblwr.validate_page(page_id, read_buf, space, buf)) + goto next_page; + else + /* We intentionally skip this message for all-zero pages. */ + ib::info() << "Trying to recover page " << page_id + << " from the doublewrite buffer."; + + page= recv_sys.dblwr.find_page(page_id, space, buf); + + if (!page) + goto next_page; + + /* Write the good page from the doublewrite buffer to the intended + position. */ + fio= fil_io(IORequestWrite, true, page_id, zip_size, 0, physical_size, + page, nullptr); + + if (fio.node) + { + ut_ad(fio.err == DB_SUCCESS); + ib::info() << "Recovered page " << page_id << " to '" << fio.node->name + << "' from the doublewrite buffer."; + fio.node->space->release_for_io(); + goto next_page; + } + } + + recv_sys.dblwr.pages.clear(); + fil_flush_file_spaces(); + aligned_free(read_buf); } -/****************************************************************//** -Frees doublewrite buffer. */ -void -buf_dblwr_free() +/** Free the doublewrite buffer. */ +void buf_dblwr_t::close() { - /* Free the double write data structures. */ - ut_a(buf_dblwr != NULL); - ut_ad(buf_dblwr->s_reserved == 0); - ut_ad(buf_dblwr->b_reserved == 0); - - os_event_destroy(buf_dblwr->b_event); - os_event_destroy(buf_dblwr->s_event); - aligned_free(buf_dblwr->write_buf); - ut_free(buf_dblwr->buf_block_arr); - mutex_free(&buf_dblwr->mutex); - ut_free(buf_dblwr); - buf_dblwr = NULL; + if (!is_initialised()) + return; + + /* Free the double write data structures. */ + ut_ad(!reserved); + ut_ad(!first_free); + ut_ad(!batch_running); + + mysql_cond_destroy(&cond); + aligned_free(write_buf); + ut_free(buf_block_arr); + mysql_mutex_destroy(&mutex); + + memset((void*) this, 0, sizeof *this); } /** Update the doublewrite buffer on write completion. */ -void buf_dblwr_update(const buf_page_t &bpage, bool single_page) +void buf_dblwr_t::write_completed() { + ut_ad(this == &buf_dblwr); ut_ad(srv_use_doublewrite_buf); - ut_ad(buf_dblwr); - ut_ad(!fsp_is_system_temporary(bpage.id().space())); + ut_ad(is_initialised()); ut_ad(!srv_read_only_mode); - if (!single_page) - { - mutex_enter(&buf_dblwr->mutex); - - ut_ad(buf_dblwr->batch_running); - ut_ad(buf_dblwr->b_reserved > 0); - ut_ad(buf_dblwr->b_reserved <= buf_dblwr->first_free); - - if (!--buf_dblwr->b_reserved) - { - mutex_exit(&buf_dblwr->mutex); - /* This will finish the batch. Sync data files to the disk. */ - fil_flush_file_spaces(); - mutex_enter(&buf_dblwr->mutex); - - /* We can now reuse the doublewrite memory buffer: */ - buf_dblwr->first_free= 0; - buf_dblwr->batch_running= false; - os_event_set(buf_dblwr->b_event); - } + mysql_mutex_lock(&mutex); - mutex_exit(&buf_dblwr->mutex); - return; - } + ut_ad(batch_running); + ut_ad(reserved); + ut_ad(reserved <= first_free); - ulint size= TRX_SYS_DOUBLEWRITE_BLOCKS * TRX_SYS_DOUBLEWRITE_BLOCK_SIZE; - mutex_enter(&buf_dblwr->mutex); - for (ulint i= srv_doublewrite_batch_size; i < size; ++i) + if (!--reserved) { - if (buf_dblwr->buf_block_arr[i].bpage != &bpage) - continue; - buf_dblwr->s_reserved--; - buf_dblwr->buf_block_arr[i].bpage= nullptr; - os_event_set(buf_dblwr->s_event); - mutex_exit(&buf_dblwr->mutex); - return; + mysql_mutex_unlock(&mutex); + /* This will finish the batch. Sync data files to the disk. */ + fil_flush_file_spaces(); + mysql_mutex_lock(&mutex); + + /* We can now reuse the doublewrite memory buffer: */ + first_free= 0; + batch_running= false; + mysql_cond_broadcast(&cond); } - /* The block must exist as a reserved block. */ - ut_error; + mysql_mutex_unlock(&mutex); } #ifdef UNIV_DEBUG @@ -718,390 +519,203 @@ static void buf_dblwr_check_page_lsn(const buf_page_t &b, const byte *page) space->release_for_io(); } } -#endif /* UNIV_DEBUG */ -/********************************************************************//** -Asserts when a corrupt block is find during writing out data to the -disk. */ -static -void -buf_dblwr_assert_on_corrupt_block( -/*==============================*/ - const buf_block_t* block) /*!< in: block to check */ +/** Check the LSN values on the page with which this block is associated. */ +static void buf_dblwr_check_block(const buf_page_t *bpage) { - buf_page_print(block->frame); - - ib::fatal() << "Apparent corruption of an index page " - << block->page.id() - << " to be written to data file. We intentionally crash" - " the server to prevent corrupt data from ending up in" - " data files."; + ut_ad(bpage->state() == BUF_BLOCK_FILE_PAGE); + const page_t *page= reinterpret_cast<const buf_block_t*>(bpage)->frame; + + switch (fil_page_get_type(page)) { + case FIL_PAGE_INDEX: + case FIL_PAGE_TYPE_INSTANT: + case FIL_PAGE_RTREE: + if (page_is_comp(page)) + { + if (page_simple_validate_new(page)) + return; + } + else if (page_simple_validate_old(page)) + return; + /* While it is possible that this is not an index page but just + happens to have wrongly set FIL_PAGE_TYPE, such pages should never + be modified to without also adjusting the page type during page + allocation or buf_flush_init_for_writing() or + fil_block_reset_type(). */ + buf_page_print(page); + + ib::fatal() << "Apparent corruption of an index page " << bpage->id() + << " to be written to data file. We intentionally crash" + " the server to prevent corrupt data from ending up in" + " data files."; + } } +#endif /* UNIV_DEBUG */ -/********************************************************************//** -Check the LSN values on the page with which this block is associated. -Also validate the page if the option is set. */ -static -void -buf_dblwr_check_block( -/*==================*/ - const buf_block_t* block) /*!< in: block to check */ +bool buf_dblwr_t::flush_buffered_writes(const ulint size) { - ut_ad(block->page.state() == BUF_BLOCK_FILE_PAGE); - - switch (fil_page_get_type(block->frame)) { - case FIL_PAGE_INDEX: - case FIL_PAGE_TYPE_INSTANT: - case FIL_PAGE_RTREE: - if (page_is_comp(block->frame)) { - if (page_simple_validate_new(block->frame)) { - return; - } - } else if (page_simple_validate_old(block->frame)) { - return; - } - /* While it is possible that this is not an index page - but just happens to have wrongly set FIL_PAGE_TYPE, - such pages should never be modified to without also - adjusting the page type during page allocation or - buf_flush_init_for_writing() or fil_block_reset_type(). */ - break; - case FIL_PAGE_TYPE_FSP_HDR: - case FIL_PAGE_IBUF_BITMAP: - case FIL_PAGE_TYPE_UNKNOWN: - /* Do not complain again, we already reset this field. */ - case FIL_PAGE_UNDO_LOG: - case FIL_PAGE_INODE: - case FIL_PAGE_IBUF_FREE_LIST: - case FIL_PAGE_TYPE_SYS: - case FIL_PAGE_TYPE_TRX_SYS: - case FIL_PAGE_TYPE_XDES: - case FIL_PAGE_TYPE_BLOB: - case FIL_PAGE_TYPE_ZBLOB: - case FIL_PAGE_TYPE_ZBLOB2: - /* TODO: validate also non-index pages */ - return; - case FIL_PAGE_TYPE_ALLOCATED: - /* empty pages should never be flushed */ - return; - } - - buf_dblwr_assert_on_corrupt_block(block); -} + mysql_mutex_assert_owner(&mutex); + ut_ad(size == block_size()); -/********************************************************************//** -Writes a page that has already been written to the doublewrite buffer -to the datafile. It is the job of the caller to sync the datafile. */ -static void -buf_dblwr_write_block_to_datafile(const buf_dblwr_t::element &e, bool sync) -{ - ut_ad(!sync || e.flush == IORequest::SINGLE_PAGE); - buf_page_t* bpage = e.bpage; - ut_a(bpage->in_file()); - IORequest request(IORequest::WRITE, bpage, e.flush); - - /* We request frame here to get correct buffer in case of - encryption and/or page compression */ - void * frame = buf_page_get_frame(bpage); - - fil_io_t fio; - - if (bpage->zip.data) { - ut_ad(bpage->zip_size()); - - fio = fil_io(request, sync, bpage->id(), bpage->zip_size(), 0, - bpage->zip_size(), frame, bpage); - } else { - ut_ad(bpage->state() == BUF_BLOCK_FILE_PAGE); - ut_ad(!bpage->zip_size()); - - ut_d(buf_dblwr_check_page_lsn(*bpage, static_cast<const byte*> - (frame))); - fio = fil_io(request, - sync, bpage->id(), bpage->zip_size(), 0, - e.size, frame, bpage); - } - - if (sync && fio.node) { - ut_ad(fio.err == DB_SUCCESS); - fio.node->space->release_for_io(); - } -} + for (;;) + { + if (!first_free) + return false; + if (!batch_running) + break; + mysql_cond_wait(&cond, &mutex); + } -/********************************************************************//** -Flushes possible buffered writes from the doublewrite memory buffer to disk. -It is very important to call this function after a batch of writes has been posted, -and also when we may have to wait for a page latch! Otherwise a deadlock -of threads can occur. */ -void -buf_dblwr_flush_buffered_writes() -{ - byte* write_buf; - ulint first_free; - ulint len; - - if (!srv_use_doublewrite_buf || buf_dblwr == NULL) { - /* Sync the writes to the disk. */ - os_aio_wait_until_no_pending_writes(); - /* Now we flush the data to disk (for example, with fsync) */ - fil_flush_file_spaces(); - return; - } - - ut_ad(!srv_read_only_mode); - -try_again: - mutex_enter(&buf_dblwr->mutex); - - /* Write first to doublewrite buffer blocks. We use synchronous - aio and thus know that file write has been completed when the - control returns. */ - - if (buf_dblwr->first_free == 0) { - - mutex_exit(&buf_dblwr->mutex); - return; - } - - if (buf_dblwr->batch_running) { - /* Another thread is running the batch right now. Wait - for it to finish. */ - int64_t sig_count = os_event_reset(buf_dblwr->b_event); - mutex_exit(&buf_dblwr->mutex); - - os_event_wait_low(buf_dblwr->b_event, sig_count); - goto try_again; - } - - ut_ad(buf_dblwr->first_free == buf_dblwr->b_reserved); - - /* Disallow anyone else to post to doublewrite buffer or to - start another batch of flushing. */ - buf_dblwr->batch_running = true; - first_free = buf_dblwr->first_free; - - /* Now safe to release the mutex. Note that though no other - thread is allowed to post to the doublewrite batch flushing - but any threads working on single page flushes are allowed - to proceed. */ - mutex_exit(&buf_dblwr->mutex); - - write_buf = buf_dblwr->write_buf; - - for (ulint len2 = 0, i = 0; - i < buf_dblwr->first_free; - len2 += srv_page_size, i++) { - - buf_page_t* bpage= buf_dblwr->buf_block_arr[i].bpage; - - if (bpage->state() != BUF_BLOCK_FILE_PAGE || bpage->zip.data) { - /* No simple validate for compressed - pages exists. */ - continue; - } - - /* Check that the actual page in the buffer pool is - not corrupt and the LSN values are sane. */ - buf_dblwr_check_block(reinterpret_cast<buf_block_t*>(bpage)); - ut_d(buf_dblwr_check_page_lsn(*bpage, write_buf + len2)); - } - - /* Write out the first block of the doublewrite buffer */ - len = std::min<ulint>(TRX_SYS_DOUBLEWRITE_BLOCK_SIZE, - buf_dblwr->first_free) << srv_page_size_shift; - - fil_io_t fio = fil_io(IORequestWrite, true, - page_id_t(TRX_SYS_SPACE, buf_dblwr->block1), 0, - 0, len, write_buf, nullptr); - fio.node->space->release_for_io(); - - if (buf_dblwr->first_free <= TRX_SYS_DOUBLEWRITE_BLOCK_SIZE) { - /* No unwritten pages in the second block. */ - goto flush; - } - - /* Write out the second block of the doublewrite buffer. */ - len = (buf_dblwr->first_free - TRX_SYS_DOUBLEWRITE_BLOCK_SIZE) - << srv_page_size_shift; - - write_buf = buf_dblwr->write_buf - + (TRX_SYS_DOUBLEWRITE_BLOCK_SIZE << srv_page_size_shift); - - fio = fil_io(IORequestWrite, true, - page_id_t(TRX_SYS_SPACE, buf_dblwr->block2), 0, - 0, len, write_buf, nullptr); - fio.node->space->release_for_io(); - -flush: - /* increment the doublewrite flushed pages counter */ - srv_stats.dblwr_pages_written.add(buf_dblwr->first_free); - srv_stats.dblwr_writes.inc(); - - /* Now flush the doublewrite buffer data to disk */ - fil_flush(TRX_SYS_SPACE); - - /* We know that the writes have been flushed to disk now - and in recovery we will find them in the doublewrite buffer - blocks. Next do the writes to the intended positions. */ - - /* Up to this point first_free and buf_dblwr->first_free are - same because we have set the buf_dblwr->batch_running flag - disallowing any other thread to post any request but we - can't safely access buf_dblwr->first_free in the loop below. - This is so because it is possible that after we are done with - the last iteration and before we terminate the loop, the batch - gets finished in the IO helper thread and another thread posts - a new batch setting buf_dblwr->first_free to a higher value. - If this happens and we are using buf_dblwr->first_free in the - loop termination condition then we'll end up dispatching - the same block twice from two different threads. */ - ut_ad(first_free == buf_dblwr->first_free); - for (ulint i = 0; i < first_free; i++) { - buf_dblwr_write_block_to_datafile( - buf_dblwr->buf_block_arr[i], false); - } -} + ut_ad(reserved == first_free); + /* Disallow anyone else to post to doublewrite buffer or to + start another batch of flushing. */ + batch_running= true; + const ulint old_first_free= first_free; -/** Schedule a page write. If the doublewrite memory buffer is full, -buf_dblwr_flush_buffered_writes() will be invoked to make space. -@param bpage buffer pool page to be written -@param flush type of flush -@param size payload size in bytes */ -void buf_dblwr_t::add_to_batch(buf_page_t *bpage, IORequest::flush_t flush, - size_t size) -{ - ut_ad(bpage->in_file()); - ut_ad(flush == IORequest::LRU || flush == IORequest::FLUSH_LIST); + /* Now safe to release the mutex. */ + mysql_mutex_unlock(&mutex); +#ifdef UNIV_DEBUG + for (ulint len2= 0, i= 0; i < old_first_free; len2 += srv_page_size, i++) + { + buf_page_t *bpage= buf_block_arr[i].bpage; -try_again: - mutex_enter(&mutex); + if (bpage->zip.data) + /* No simple validate for ROW_FORMAT=COMPRESSED pages exists. */ + continue; - ut_a(first_free <= srv_doublewrite_batch_size); + /* Check that the actual page in the buffer pool is not corrupt + and the LSN values are sane. */ + buf_dblwr_check_block(bpage); + ut_d(buf_dblwr_check_page_lsn(*bpage, write_buf + len2)); + } +#endif /* UNIV_DEBUG */ + /* Write out the first block of the doublewrite buffer */ + fil_io_t fio= fil_io(IORequestWrite, true, block1, 0, 0, + std::min(size, old_first_free) << srv_page_size_shift, + write_buf, nullptr); + fio.node->space->release_for_io(); - if (batch_running) + if (old_first_free > size) { - /* This not nearly as bad as it looks. There is only page_cleaner - thread which does background flushing in batches therefore it is - unlikely to be a contention point. The only exception is when a - user thread is forced to do a flush batch because of a sync - checkpoint. */ - int64_t sig_count= os_event_reset(b_event); - mutex_exit(&mutex); - - os_event_wait_low(b_event, sig_count); - goto try_again; + /* Write out the second block of the doublewrite buffer. */ + fio= fil_io(IORequestWrite, true, block2, 0, 0, + (old_first_free - size) << srv_page_size_shift, + write_buf + (size << srv_page_size_shift), nullptr); + fio.node->space->release_for_io(); } - if (first_free == srv_doublewrite_batch_size) + /* increment the doublewrite flushed pages counter */ + srv_stats.dblwr_pages_written.add(first_free); + srv_stats.dblwr_writes.inc(); + + /* Now flush the doublewrite buffer data to disk */ + fil_flush(TRX_SYS_SPACE); + + /* We know that the writes have been flushed to disk now + and in recovery we will find them in the doublewrite buffer + blocks. Next do the writes to the intended positions. */ + + /* Up to this point old_first_free == first_free because we have set + the batch_running flag disallowing any other thread to post any + request but we can't safely access first_free in the loop below. + This is so because it is possible that after we are done with the + last iteration and before we terminate the loop, the batch gets + finished in the IO helper thread and another thread posts a new + batch setting first_free to a higher value. If this happens and we + are using first_free in the loop termination condition then we'll + end up dispatching the same block twice from two different + threads. */ + ut_ad(old_first_free == first_free); + for (ulint i= 0; i < old_first_free; i++) { - mutex_exit(&mutex); - buf_dblwr_flush_buffered_writes(); - goto try_again; - } + auto e= buf_block_arr[i]; + buf_page_t* bpage= e.bpage; + ut_a(bpage->in_file()); - byte *p= write_buf + srv_page_size * first_free; + /* We request frame here to get correct buffer in case of + encryption and/or page compression */ + void *frame= buf_page_get_frame(bpage); - /* We request frame here to get correct buffer in case of - encryption and/or page compression */ - void * frame = buf_page_get_frame(bpage); + auto e_size= e.size; - memcpy_aligned<OS_FILE_LOG_BLOCK_SIZE>(p, frame, size); - ut_ad(!bpage->zip_size() || bpage->zip_size() == size); - buf_block_arr[first_free++] = { bpage, flush, size }; - b_reserved++; + if (UNIV_LIKELY_NULL(bpage->zip.data)) + { + e_size= bpage->zip_size(); + ut_ad(e_size); + } + else + { + ut_ad(bpage->state() == BUF_BLOCK_FILE_PAGE); + ut_ad(!bpage->zip_size()); + ut_d(buf_dblwr_check_page_lsn(*bpage, static_cast<const byte*>(frame))); + } - ut_ad(!batch_running); - ut_ad(first_free == b_reserved); - ut_ad(b_reserved <= srv_doublewrite_batch_size); + fil_io(IORequest(IORequest::WRITE, bpage, e.lru), false, + bpage->id(), bpage->zip_size(), 0, e_size, frame, bpage); + } + + return true; +} + +/** Flush possible buffered writes to persistent storage. +It is very important to call this function after a batch of writes has been +posted, and also when we may have to wait for a page latch! +Otherwise a deadlock of threads can occur. */ +void buf_dblwr_t::flush_buffered_writes() +{ + if (!is_initialised() || !srv_use_doublewrite_buf) + { + os_aio_wait_until_no_pending_writes(); + fil_flush_file_spaces(); + return; + } - const bool need_flush= first_free == srv_doublewrite_batch_size; - mutex_exit(&mutex); + ut_ad(!srv_read_only_mode); + const ulint size= block_size(); - if (need_flush) - buf_dblwr_flush_buffered_writes(); + mysql_mutex_lock(&mutex); + if (!flush_buffered_writes(size)) + mysql_mutex_unlock(&mutex); } -/** Write a page to the doublewrite buffer on disk, sync it, then write -the page to the datafile and sync the datafile. This function is used -for single page flushes. If all the buffers allocated for single page -flushes in the doublewrite buffer are in use we wait here for one to -become free. We are guaranteed that a slot will become free because any -thread that is using a slot must also release the slot before leaving -this function. -@param bpage buffer pool page to be written -@param sync whether synchronous operation is requested -@param size payload size in bytes */ -void buf_dblwr_t::write_single_page(buf_page_t *bpage, bool sync, size_t size) +/** Schedule a page write. If the doublewrite memory buffer is full, +flush_buffered_writes() will be invoked to make space. +@param bpage buffer pool page to be written +@param lru true=buf_pool.LRU; false=buf_pool.flush_list +@param size payload size in bytes */ +void buf_dblwr_t::add_to_batch(buf_page_t *bpage, bool lru, size_t size) { ut_ad(bpage->in_file()); - ut_ad(srv_use_doublewrite_buf); - ut_ad(this == buf_dblwr); + const ulint buf_size= 2 * block_size(); - /* total number of slots available for single page flushes - starts from srv_doublewrite_batch_size to the end of the buffer. */ - ulint slots = TRX_SYS_DOUBLEWRITE_BLOCKS * TRX_SYS_DOUBLEWRITE_BLOCK_SIZE; - ut_a(slots > srv_doublewrite_batch_size); - ulint n_slots= slots - srv_doublewrite_batch_size; + mysql_mutex_lock(&mutex); - if (bpage->state() == BUF_BLOCK_FILE_PAGE) + for (;;) { - /* Check that the actual page in the buffer pool is not corrupt - and the LSN values are sane. */ - buf_dblwr_check_block(reinterpret_cast<buf_block_t*>(bpage)); -#ifdef UNIV_DEBUG - /* Check that the page as written to the doublewrite buffer has - sane LSN values. */ - if (!bpage->zip.data) - buf_dblwr_check_page_lsn(*bpage, reinterpret_cast<buf_block_t*> - (bpage)->frame); -#endif - } + while (batch_running) + mysql_cond_wait(&cond, &mutex); -retry: - mutex_enter(&mutex); - if (s_reserved == n_slots) - { - /* All slots are reserved. */ - int64_t sig_count = os_event_reset(s_event); - mutex_exit(&mutex); - os_event_wait_low(s_event, sig_count); - goto retry; - } + ut_ad(first_free <= buf_size); + if (first_free != buf_size) + break; - ulint i; - for (i = srv_doublewrite_batch_size; i < slots; ++i) - if (!buf_block_arr[i].bpage) - goto found; - /* We are guaranteed to find a slot. */ - ut_error; -found: - s_reserved++; - buf_block_arr[i]= { bpage, IORequest::SINGLE_PAGE, size }; - - /* increment the doublewrite flushed pages counter */ - srv_stats.dblwr_pages_written.inc(); - srv_stats.dblwr_writes.inc(); - - mutex_exit(&mutex); + if (flush_buffered_writes(buf_size / 2)) + mysql_mutex_lock(&mutex); + } - const ulint offset= i < TRX_SYS_DOUBLEWRITE_BLOCK_SIZE - ? block1 + i - : block2 + i - TRX_SYS_DOUBLEWRITE_BLOCK_SIZE; + byte *p= write_buf + srv_page_size * first_free; /* We request frame here to get correct buffer in case of encryption and/or page compression */ - void * frame = buf_page_get_frame(bpage); - ut_ad(!bpage->zip_size() || bpage->zip_size() == size); - fil_io_t fio= fil_io(IORequestWrite, true, page_id_t(TRX_SYS_SPACE, offset), - 0, 0, size, frame, nullptr); - fio.node->space->release_for_io(); + void *frame= buf_page_get_frame(bpage); - /* Now flush the doublewrite buffer data to disk */ - fil_flush(TRX_SYS_SPACE); + memcpy_aligned<OS_FILE_LOG_BLOCK_SIZE>(p, frame, size); + ut_ad(!bpage->zip_size() || bpage->zip_size() == size); + ut_ad(reserved == first_free); + ut_ad(reserved < buf_size); + buf_block_arr[first_free++]= { bpage, lru, size }; + reserved= first_free; - /* We know that the write has been flushed to disk now - and during recovery we will find it in the doublewrite buffer - blocks. Next do the write to the intended position. */ - buf_dblwr_write_block_to_datafile({bpage, IORequest::SINGLE_PAGE, size}, - sync); + if (first_free != buf_size || !flush_buffered_writes(buf_size / 2)) + mysql_mutex_unlock(&mutex); } diff --git a/storage/innobase/buf/buf0dump.cc b/storage/innobase/buf/buf0dump.cc index d468000f894..01b523d6e94 100644 --- a/storage/innobase/buf/buf0dump.cc +++ b/storage/innobase/buf/buf0dump.cc @@ -276,13 +276,13 @@ buf_dump( ulint n_pages; ulint j; - mutex_enter(&buf_pool.mutex); + mysql_mutex_lock(&buf_pool.mutex); n_pages = UT_LIST_GET_LEN(buf_pool.LRU); /* skip empty buffer pools */ if (n_pages == 0) { - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); goto done; } @@ -310,7 +310,7 @@ buf_dump( n_pages * sizeof(*dump))); if (dump == NULL) { - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); fclose(f); buf_dump_status(STATUS_ERR, "Cannot allocate " ULINTPF " bytes: %s", @@ -335,7 +335,7 @@ buf_dump( dump[j++] = id; } - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); ut_a(j <= n_pages); n_pages = j; diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc index 10ed54be452..0f8084886c1 100644 --- a/storage/innobase/buf/buf0flu.cc +++ b/storage/innobase/buf/buf0flu.cc @@ -34,20 +34,10 @@ Created 11/11/1995 Heikki Tuuri #include "buf0checksum.h" #include "buf0dblwr.h" #include "srv0start.h" -#include "srv0srv.h" #include "page0zip.h" -#include "ut0byte.h" -#include "page0page.h" #include "fil0fil.h" -#include "buf0lru.h" -#include "buf0rea.h" -#include "ibuf0ibuf.h" -#include "log0log.h" #include "log0crypt.h" -#include "os0file.h" -#include "trx0sys.h" #include "srv0mon.h" -#include "ut0stage.h" #include "fil0pagecompress.h" #ifdef UNIV_LINUX /* include defs for CPU time priority settings */ @@ -55,30 +45,25 @@ Created 11/11/1995 Heikki Tuuri #include <sys/syscall.h> #include <sys/time.h> #include <sys/resource.h> -static const int buf_flush_page_cleaner_priority = -20; #endif /* UNIV_LINUX */ #ifdef HAVE_LZO -#include "lzo/lzo1x.h" -#endif - -#ifdef HAVE_SNAPPY -#include "snappy-c.h" +# include "lzo/lzo1x.h" +#elif defined HAVE_SNAPPY +# include "snappy-c.h" #endif /** Sleep time in microseconds for loop waiting for the oldest modification lsn */ -static const ulint buf_flush_wait_flushed_sleep_time = 10000; - -#include <my_service_manager.h> +static constexpr ulint buf_flush_wait_flushed_sleep_time = 10000; -/** Number of pages flushed through non flush_list flushes. */ +/** Number of pages flushed via LRU. Protected by buf_pool.mutex. +Also included in buf_flush_page_count. */ ulint buf_lru_flush_page_count; -/** Flag indicating if the page_cleaner is in active state. This flag -is set to TRUE by the page_cleaner thread when it is spawned and is set -back to FALSE at shutdown by the page_cleaner as well. Therefore no -need to protect it by a mutex. It is only ever read by the thread -doing the shutdown */ +/** Number of pages flushed. Protected by buf_pool.mutex. */ +ulint buf_flush_page_count; + +/** Flag indicating if the page_cleaner is in active state. */ bool buf_page_cleaner_is_active; /** Factor for scan length to determine n_pages for intended oldest LSN @@ -89,63 +74,20 @@ static ulint buf_flush_lsn_scan_factor = 3; static lsn_t lsn_avg_rate = 0; /** Target oldest LSN for the requested flush_sync */ -static lsn_t buf_flush_sync_lsn = 0; +static std::atomic<lsn_t> buf_flush_sync_lsn; #ifdef UNIV_PFS_THREAD mysql_pfs_key_t page_cleaner_thread_key; #endif /* UNIV_PFS_THREAD */ -/** Event to synchronise with the flushing. */ -os_event_t buf_flush_event; - -static void pc_flush_slot_func(void *); -static tpool::task_group page_cleaner_task_group(1); -static tpool::waitable_task pc_flush_slot_task( - pc_flush_slot_func, 0, &page_cleaner_task_group); - -/** State for page cleaner array slot */ -enum page_cleaner_state_t { - /** Not requested any yet. Moved from FINISHED. */ - PAGE_CLEANER_STATE_NONE = 0, - /** Requested but not started flushing. Moved from NONE. */ - PAGE_CLEANER_STATE_REQUESTED, - /** Flushing is on going. Moved from REQUESTED. */ - PAGE_CLEANER_STATE_FLUSHING, - /** Flushing was finished. Moved from FLUSHING. */ - PAGE_CLEANER_STATE_FINISHED -}; - /** Page cleaner request state for buf_pool */ struct page_cleaner_slot_t { - page_cleaner_state_t state; /*!< state of the request. - protected by page_cleaner_t::mutex - if the worker thread got the slot and - set to PAGE_CLEANER_STATE_FLUSHING, - n_flushed_lru and n_flushed_list can be - updated only by the worker thread */ - /* This value is set during state==PAGE_CLEANER_STATE_NONE */ - ulint n_pages_requested; - /*!< number of requested pages - for the slot */ - /* These values are updated during state==PAGE_CLEANER_STATE_FLUSHING, - and commited with state==PAGE_CLEANER_STATE_FINISHED. - The consistency is protected by the 'state' */ - ulint n_flushed_lru; - /*!< number of flushed pages - by LRU scan flushing */ ulint n_flushed_list; /*!< number of flushed pages by flush_list flushing */ - bool succeeded_list; - /*!< true if flush_list flushing - succeeded. */ - ulint flush_lru_time; - /*!< elapsed time for LRU flushing */ ulint flush_list_time; /*!< elapsed time for flush_list flushing */ - ulint flush_lru_pass; - /*!< count to attempt LRU flushing */ ulint flush_list_pass; /*!< count to attempt flush_list flushing */ @@ -153,37 +95,11 @@ struct page_cleaner_slot_t { /** Page cleaner structure */ struct page_cleaner_t { - /* FIXME: do we need mutex? use atomics? */ - ib_mutex_t mutex; /*!< mutex to protect whole of - page_cleaner_t struct and - page_cleaner_slot_t slots. */ - os_event_t is_finished; /*!< event to signal that all - slots were finished. */ - bool requested; /*!< true if requested pages - to flush */ - lsn_t lsn_limit; /*!< upper limit of LSN to be - flushed */ -#if 1 /* FIXME: use bool for these, or remove some of these */ - ulint n_slots_requested; - /*!< number of slots - in the state - PAGE_CLEANER_STATE_REQUESTED */ - ulint n_slots_flushing; - /*!< number of slots - in the state - PAGE_CLEANER_STATE_FLUSHING */ - ulint n_slots_finished; - /*!< number of slots - in the state - PAGE_CLEANER_STATE_FINISHED */ -#endif ulint flush_time; /*!< elapsed time to flush requests for all slots */ ulint flush_pass; /*!< count to finish to flush requests for all slots */ page_cleaner_slot_t slot; - bool is_running; /*!< false if attempt - to shutdown */ }; static page_cleaner_t page_cleaner; @@ -200,15 +116,6 @@ in thrashing. */ /* @} */ -/** Increases flush_list size in bytes with the page size */ -static inline void incr_flush_list_size_in_bytes(const buf_block_t* block) -{ - /* FIXME: use std::atomic! */ - ut_ad(mutex_own(&buf_pool.flush_list_mutex)); - buf_pool.stat.flush_list_bytes += block->physical_size(); - ut_ad(buf_pool.stat.flush_list_bytes <= buf_pool.curr_pool_size); -} - #ifdef UNIV_DEBUG /** Validate the flush list. */ static void buf_flush_validate_low(); @@ -241,37 +148,29 @@ static void buf_flush_validate_skip() @param[in] lsn oldest modification */ void buf_flush_insert_into_flush_list(buf_block_t* block, lsn_t lsn) { - ut_ad(!mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_not_owner(&buf_pool.mutex); ut_ad(log_flush_order_mutex_own()); ut_ad(lsn); - mutex_enter(&buf_pool.flush_list_mutex); + mysql_mutex_lock(&buf_pool.flush_list_mutex); block->page.set_oldest_modification(lsn); MEM_CHECK_DEFINED(block->page.zip.data - ? block->page.zip.data : block->frame, - block->physical_size()); - incr_flush_list_size_in_bytes(block); + ? block->page.zip.data : block->frame, + block->physical_size()); + buf_pool.stat.flush_list_bytes += block->physical_size(); + ut_ad(buf_pool.stat.flush_list_bytes <= buf_pool.curr_pool_size); UT_LIST_ADD_FIRST(buf_pool.flush_list, &block->page); ut_d(buf_flush_validate_skip()); - mutex_exit(&buf_pool.flush_list_mutex); + mysql_mutex_unlock(&buf_pool.flush_list_mutex); } /** Remove a block from the flush list of modified blocks. -@param[in] bpage block to be removed from the flush list */ -void buf_flush_remove(buf_page_t* bpage) +@param[in,out] bpage block to be removed from the flush list */ +static void buf_flush_remove(buf_page_t *bpage) { -#if 0 // FIXME: Rate-limit the output. Move this to the page cleaner? - if (UNIV_UNLIKELY(srv_shutdown_state == SRV_SHUTDOWN_FLUSH_PHASE)) { - service_manager_extend_timeout( - INNODB_EXTEND_TIMEOUT_INTERVAL, - "Flush and remove page with tablespace id %u" - ", flush list length " ULINTPF, - bpage->space, UT_LIST_GET_LEN(buf_pool.flush_list)); - } -#endif - ut_ad(mutex_own(&buf_pool.mutex)); - mutex_enter(&buf_pool.flush_list_mutex); + mysql_mutex_assert_owner(&buf_pool.mutex); + mysql_mutex_assert_owner(&buf_pool.flush_list_mutex); /* Important that we adjust the hazard pointer before removing the bpage from flush list. */ @@ -284,8 +183,78 @@ void buf_flush_remove(buf_page_t* bpage) #ifdef UNIV_DEBUG buf_flush_validate_skip(); #endif /* UNIV_DEBUG */ +} + +/** Remove all dirty pages belonging to a given tablespace when we are +deleting the data file of that tablespace. +The pages still remain a part of LRU and are evicted from +the list as they age towards the tail of the LRU. +@param id tablespace identifier */ +void buf_flush_remove_pages(ulint id) +{ + const page_id_t first(id, 0), end(id + 1, 0); + ut_ad(id); + mysql_mutex_lock(&buf_pool.mutex); + + for (;;) + { + bool deferred= false; - mutex_exit(&buf_pool.flush_list_mutex); + mysql_mutex_lock(&buf_pool.flush_list_mutex); + + for (buf_page_t *bpage= UT_LIST_GET_LAST(buf_pool.flush_list); bpage; ) + { + ut_ad(bpage->in_file()); + buf_page_t *prev= UT_LIST_GET_PREV(list, bpage); + + const page_id_t bpage_id(bpage->id()); + + if (bpage_id < first || bpage_id >= end); + else if (bpage->io_fix() != BUF_IO_NONE) + deferred= true; + else + buf_flush_remove(bpage); + + bpage= prev; + } + + mysql_mutex_unlock(&buf_pool.flush_list_mutex); + + if (!deferred) + break; + + mysql_mutex_unlock(&buf_pool.mutex); + os_thread_yield(); + mysql_mutex_lock(&buf_pool.mutex); + buf_flush_wait_batch_end(false); + } + + mysql_mutex_unlock(&buf_pool.mutex); +} + +/** Try to flush all the dirty pages that belong to a given tablespace. +@param id tablespace identifier +@return number dirty pages that there were for this tablespace */ +ulint buf_flush_dirty_pages(ulint id) +{ + ut_ad(!sync_check_iterate(dict_sync_check())); + + ulint n= 0; + + mysql_mutex_lock(&buf_pool.flush_list_mutex); + + for (buf_page_t *bpage= UT_LIST_GET_FIRST(buf_pool.flush_list); bpage; + bpage= UT_LIST_GET_NEXT(list, bpage)) + { + ut_ad(bpage->in_file()); + ut_ad(bpage->oldest_modification()); + if (id == bpage->id().space()) + n++; + } + mysql_mutex_unlock(&buf_pool.flush_list_mutex); + if (n) + buf_flush_lists(ULINT_UNDEFINED, LSN_MAX); + return n; } /*******************************************************************//** @@ -299,6 +268,7 @@ use the current list node (bpage) to do the list manipulation because the list pointers could have changed between the time that we copied the contents of bpage to the dpage and the flush list manipulation below. */ +ATTRIBUTE_COLD void buf_flush_relocate_on_flush_list( /*=============================*/ @@ -307,8 +277,13 @@ buf_flush_relocate_on_flush_list( { buf_page_t* prev; - ut_ad(mutex_own(&buf_pool.mutex)); - mutex_enter(&buf_pool.flush_list_mutex); + mysql_mutex_assert_owner(&buf_pool.mutex); + + if (!bpage->oldest_modification()) { + return; + } + + mysql_mutex_lock(&buf_pool.flush_list_mutex); /* FIXME: At this point we have both buf_pool and flush_list mutexes. Theoretically removal of a block from flush list is @@ -330,56 +305,29 @@ buf_flush_relocate_on_flush_list( if (prev) { ut_ad(prev->oldest_modification()); - UT_LIST_INSERT_AFTER( buf_pool.flush_list, prev, dpage); + UT_LIST_INSERT_AFTER(buf_pool.flush_list, prev, dpage); } else { UT_LIST_ADD_FIRST(buf_pool.flush_list, dpage); } ut_d(buf_flush_validate_low()); - mutex_exit(&buf_pool.flush_list_mutex); -} - -/** Update the buf_pool data structures on write completion. -@param[in,out] bpage written page -@param[in] flush_type write request type -@param[in] dblwr whether the doublewrite buffer was used */ -static void buf_flush_write_complete(buf_page_t *bpage, - IORequest::flush_t flush_type, bool dblwr) -{ - ut_ad(mutex_own(&buf_pool.mutex)); - buf_flush_remove(bpage); - - switch (--buf_pool.n_flush[flush_type]) { -#ifdef UNIV_DEBUG - case ULINT_UNDEFINED: - ut_error; - break; -#endif - case 0: - if (!buf_pool.init_flush[flush_type]) - os_event_set(buf_pool.no_flush[flush_type]); - } - - if (dblwr) - buf_dblwr_update(*bpage, flush_type == IORequest::SINGLE_PAGE); + mysql_mutex_unlock(&buf_pool.flush_list_mutex); } /** Complete write of a file page from buf_pool. @param bpage written page @param request write request -@param dblwr whether the doublewrite buffer was used -@param evict whether or not to evict the page from LRU list */ +@param dblwr whether the doublewrite buffer was used */ void buf_page_write_complete(buf_page_t *bpage, const IORequest &request, - bool dblwr, bool evict) + bool dblwr) { ut_ad(request.is_write()); ut_ad(bpage->in_file()); ut_ad(bpage->io_fix() == BUF_IO_WRITE); - ut_ad(bpage->id().space() != TRX_SYS_SPACE || - !buf_dblwr_page_inside(bpage->id().page_no())); + ut_ad(!buf_dblwr.is_inside(bpage->id())); /* We do not need protect io_fix here by mutex to read it because - this and buf_page_write_complete() are the only functions where we can + this and buf_page_read_complete() are the only functions where we can change the value from BUF_IO_READ or BUF_IO_WRITE to some other value, and our code ensures that this is the only thread that handles the i/o for this block. */ @@ -393,9 +341,19 @@ void buf_page_write_complete(buf_page_t *bpage, const IORequest &request, buf_page_monitor(bpage, BUF_IO_WRITE); DBUG_PRINT("ib_buf", ("write page %u:%u", bpage->id().space(), bpage->id().page_no())); - mutex_enter(&buf_pool.mutex); + ut_ad(request.is_LRU() ? buf_pool.n_flush_LRU : buf_pool.n_flush_list); + + mysql_mutex_lock(&buf_pool.mutex); bpage->set_io_fix(BUF_IO_NONE); - buf_flush_write_complete(bpage, request.flush_type(), dblwr); + mysql_mutex_lock(&buf_pool.flush_list_mutex); + buf_flush_remove(bpage); + mysql_mutex_unlock(&buf_pool.flush_list_mutex); + + if (dblwr) + { + ut_ad(!fsp_is_system_temporary(bpage->id().space())); + buf_dblwr.write_completed(); + } /* Because this thread which does the unlocking might not be the same that did the locking, we use a pass value != 0 in unlock, which simply @@ -405,10 +363,19 @@ void buf_page_write_complete(buf_page_t *bpage, const IORequest &request, buf_pool.stat.n_pages_written++; - if (evict) + if (request.is_LRU()) + { buf_LRU_free_page(bpage, true); + if (!--buf_pool.n_flush_LRU) + mysql_cond_broadcast(&buf_pool.done_flush_LRU); + } + else + { + if (!--buf_pool.n_flush_list) + mysql_cond_broadcast(&buf_pool.done_flush_list); + } - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); } /** Calculate a ROW_FORMAT=COMPRESSED page checksum and update the page. @@ -803,72 +770,47 @@ static void buf_release_freed_page(buf_page_t *bpage) { ut_ad(bpage->in_file()); const bool uncompressed= bpage->state() == BUF_BLOCK_FILE_PAGE; - mutex_enter(&buf_pool.mutex); + mysql_mutex_lock(&buf_pool.mutex); bpage->set_io_fix(BUF_IO_NONE); bpage->status= buf_page_t::NORMAL; + mysql_mutex_lock(&buf_pool.flush_list_mutex); buf_flush_remove(bpage); + mysql_mutex_unlock(&buf_pool.flush_list_mutex); if (uncompressed) rw_lock_sx_unlock_gen(&reinterpret_cast<buf_block_t*>(bpage)->lock, BUF_IO_WRITE); buf_LRU_free_page(bpage, true); - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); } /** Write a flushable page from buf_pool to a file. buf_pool.mutex must be held. @param bpage buffer control block -@param flush_type type of flush -@param space tablespace (or nullptr if not known) -@param sync whether this is a synchronous request - (only for flush_type=SINGLE_PAGE) +@param lru true=buf_pool.LRU; false=buf_pool.flush_list +@param space tablespace @return whether the page was flushed and buf_pool.mutex was released */ -bool buf_flush_page(buf_page_t *bpage, IORequest::flush_t flush_type, - fil_space_t *space, bool sync) +static bool buf_flush_page(buf_page_t *bpage, bool lru, fil_space_t *space) { ut_ad(bpage->in_file()); ut_ad(bpage->ready_for_flush()); - ut_ad(!sync || flush_type == IORequest::SINGLE_PAGE); - ut_ad(mutex_own(&buf_pool.mutex)); rw_lock_t *rw_lock; - bool no_fix_count= bpage->buf_fix_count() == 0; if (bpage->state() != BUF_BLOCK_FILE_PAGE) rw_lock= nullptr; - else if (!(no_fix_count || flush_type == IORequest::FLUSH_LIST) || - (!no_fix_count && srv_shutdown_state <= SRV_SHUTDOWN_CLEANUP && - fsp_is_system_temporary(bpage->id().space()))) - /* This is a heuristic, to avoid expensive SX attempts. */ - /* For table residing in temporary tablespace sync is done - using IO_FIX and so before scheduling for flush ensure that - page is not fixed. */ - return false; else { rw_lock= &reinterpret_cast<buf_block_t*>(bpage)->lock; - if (flush_type != IORequest::FLUSH_LIST && - !rw_lock_sx_lock_nowait(rw_lock, BUF_IO_WRITE)) + if (!rw_lock_sx_lock_nowait(rw_lock, BUF_IO_WRITE)) return false; } - /* We are committed to flushing by the time we get here */ bpage->set_io_fix(BUF_IO_WRITE); - mutex_exit(&buf_pool.mutex); - - if (flush_type == IORequest::FLUSH_LIST && rw_lock && - !rw_lock_sx_lock_nowait(rw_lock, BUF_IO_WRITE)) - { - if (!fsp_is_system_temporary(bpage->id().space())) - /* Avoid a potential deadlock with the doublewrite buffer, - which might be holding another buf_block_t::lock. */ - buf_dblwr_flush_buffered_writes(); - else - os_aio_wait_until_no_pending_writes(); - - rw_lock_sx_lock_gen(rw_lock, BUF_IO_WRITE); - } + buf_flush_page_count++; + mysql_mutex_unlock(&buf_pool.mutex); + mysql_mutex_assert_not_owner(&buf_pool.flush_list_mutex); /* We are holding rw_lock = buf_block_t::lock in SX mode except if this is a ROW_FORMAT=COMPRESSED page whose uncompressed page frame @@ -877,38 +819,21 @@ bool buf_flush_page(buf_page_t *bpage, IORequest::flush_t flush_type, Apart from possible rw_lock protection, bpage is also protected by io_fix and oldest_modification()!=0. Thus, it cannot be relocated in the buffer pool or removed from flush_list or LRU_list. */ -#if 0 /* rw_lock_own() does not hold because we passed BUF_IO_WRITE above. */ - ut_ad(!rw_lock || rw_lock_own(rw_lock, RW_LOCK_SX)); -#endif - const fil_space_t * const provided_space= space; - if (!space) - { - space= fil_space_acquire_for_io(bpage->id().space()); - if (UNIV_UNLIKELY(!space)) - { - mutex_enter(&buf_pool.mutex); - bpage->status= buf_page_t::NORMAL; - bpage->set_io_fix(BUF_IO_NONE); - if (rw_lock) - rw_lock_sx_unlock_gen(rw_lock, BUF_IO_WRITE); - return false; - } - } ut_ad((space->purpose == FIL_TYPE_TEMPORARY) == (space == fil_system.temp_space)); + ut_ad(space->purpose == FIL_TYPE_TABLESPACE || + space->atomic_write_supported); - const bool full_crc32= space->full_crc32(); - - DBUG_PRINT("ib_buf", ("flush %s %u page %u:%u", - sync ? "sync" : "async", (unsigned) flush_type, + DBUG_PRINT("ib_buf", ("%s %u page %u:%u", + lru ? "LRU" : "flush_list", bpage->id().space(), bpage->id().page_no())); - ut_ad(!mutex_own(&buf_pool.mutex)); - ut_ad(!mutex_own(&buf_pool.flush_list_mutex)); ut_ad(bpage->io_fix() == BUF_IO_WRITE); ut_ad(bpage->oldest_modification()); ut_ad(bpage->state() == (rw_lock ? BUF_BLOCK_FILE_PAGE : BUF_BLOCK_ZIP_PAGE)); + ut_ad(ULINT_UNDEFINED > + (lru ? buf_pool.n_flush_LRU : buf_pool.n_flush_list)); /* Because bpage->status can only be changed while buf_block_t exists, it cannot be modified for ROW_FORMAT=COMPRESSED pages @@ -917,22 +842,29 @@ bool buf_flush_page(buf_page_t *bpage, IORequest::flush_t flush_type, is protected even if !rw_lock. */ const auto status= bpage->status; - if (status != buf_page_t::FREED) + buf_block_t *block= reinterpret_cast<buf_block_t*>(bpage); + page_t *frame= bpage->zip.data; + + if (UNIV_LIKELY(space->purpose == FIL_TYPE_TABLESPACE)) { - switch (buf_pool.n_flush[flush_type]++) { - case 0: - os_event_reset(buf_pool.no_flush[flush_type]); - break; -#ifdef UNIV_DEBUG - case ULINT_UNDEFINED: - ut_error; - break; -#endif + const lsn_t lsn= mach_read_from_8(my_assume_aligned<8> + (FIL_PAGE_LSN + + (frame ? frame : block->frame))); + ut_ad(lsn); + ut_ad(lsn >= bpage->oldest_modification()); + ut_ad(!srv_read_only_mode); + if (UNIV_UNLIKELY(lsn > log_sys.get_flushed_lsn())) + { + if (rw_lock) + rw_lock_sx_unlock_gen(rw_lock, BUF_IO_WRITE); + mysql_mutex_lock(&buf_pool.mutex); + bpage->set_io_fix(BUF_IO_NONE); + return false; } } - page_t *frame= bpage->zip.data; size_t size, orig_size; + ulint type= IORequest::WRITE; if (UNIV_UNLIKELY(!rw_lock)) /* ROW_FORMAT=COMPRESSED */ { @@ -948,43 +880,37 @@ bool buf_flush_page(buf_page_t *bpage, IORequest::flush_t flush_type, } else { - buf_block_t *block= reinterpret_cast<buf_block_t*>(bpage); byte *page= block->frame; orig_size= size= block->physical_size(); - if (status != buf_page_t::FREED) + if (status == buf_page_t::FREED); + else if (space->full_crc32()) + { + /* innodb_checksum_algorithm=full_crc32 is not implemented for + ROW_FORMAT=COMPRESSED pages. */ + ut_ad(!frame); + page= buf_page_encrypt(space, bpage, page, &size); + buf_flush_init_for_writing(block, page, nullptr, true); + } + else { - if (full_crc32) - { - /* innodb_checksum_algorithm=full_crc32 is not implemented for - ROW_FORMAT=COMPRESSED pages. */ - ut_ad(!frame); - page= buf_page_encrypt(space, bpage, page, &size); - } - buf_flush_init_for_writing(block, page, frame ? &bpage->zip : nullptr, - full_crc32); - - if (!full_crc32) - page= buf_page_encrypt(space, bpage, frame ? frame : page, &size); + false); + page= buf_page_encrypt(space, bpage, frame ? frame : page, &size); } +#if defined HAVE_FALLOC_PUNCH_HOLE_AND_KEEP_SIZE || defined _WIN32 + if (size != orig_size && space->punch_hole) + type|= IORequest::PUNCH_HOLE; +#else + DBUG_EXECUTE_IF("ignore_punch_hole", + if (size != orig_size && space->punch_hole) + type|= IORequest::PUNCH_HOLE;); +#endif frame= page; } - if (UNIV_LIKELY(space->purpose == FIL_TYPE_TABLESPACE)) - { - const lsn_t lsn= mach_read_from_8(frame + FIL_PAGE_LSN); - ut_ad(lsn); - ut_ad(lsn >= bpage->oldest_modification()); - ut_ad(!srv_read_only_mode); - log_write_up_to(lsn, true); - } - else - ut_ad(space->atomic_write_supported); - - bool use_doublewrite; - IORequest request(IORequest::WRITE, bpage, flush_type); + IORequest request(type, bpage, lru); ut_ad(status == bpage->status); @@ -992,49 +918,29 @@ bool buf_flush_page(buf_page_t *bpage, IORequest::flush_t flush_type, default: ut_ad(status == buf_page_t::FREED); buf_release_freed_page(bpage); - goto done; + break; case buf_page_t::NORMAL: - use_doublewrite= space->use_doublewrite(); - - if (use_doublewrite) + if (space->use_doublewrite()) { ut_ad(!srv_read_only_mode); - if (flush_type == IORequest::SINGLE_PAGE) - buf_dblwr->write_single_page(bpage, sync, size); + if (lru) + buf_pool.n_flush_LRU++; else - buf_dblwr->add_to_batch(bpage, flush_type, size); + buf_pool.n_flush_list++; + buf_dblwr.add_to_batch(bpage, lru, size); break; } /* fall through */ case buf_page_t::INIT_ON_FLUSH: - use_doublewrite= false; - if (size != orig_size) - request.set_punch_hole(); + if (lru) + buf_pool.n_flush_LRU++; + else + buf_pool.n_flush_list++; /* FIXME: pass space to fil_io() */ - fil_io_t fio= fil_io(request, sync, bpage->id(), bpage->zip_size(), 0, - bpage->physical_size(), frame, bpage); - ut_ad(!fio.node || fio.node->space == space); - if (fio.node && sync) - fio.node->space->release_for_io(); + fil_io(request, false, bpage->id(), bpage->zip_size(), 0, + bpage->physical_size(), frame, bpage); } - if (sync) - { - ut_ad(bpage->io_fix() == BUF_IO_WRITE); - - /* When flushing single page synchronously, we flush the changes - only for the tablespace we are working on. */ - if (space->purpose != FIL_TYPE_TEMPORARY) - fil_flush(space); - - if (size != orig_size && space->punch_hole) - request.set_punch_hole(); - buf_page_write_complete(bpage, request, use_doublewrite, true/*evict*/); - } - -done: - if (!provided_space) - space->release_for_io(); /* Increment the I/O operation count used for selecting LRU policy. */ buf_LRU_stat_inc_io(); return true; @@ -1042,15 +948,15 @@ done: /** Check whether a page can be flushed from the buf_pool. @param id page identifier -@param flush LRU or FLUSH_LIST +@param fold id.fold() +@param lru true=buf_pool.LRU; false=buf_pool.flush_list @return whether the page can be flushed */ -static bool buf_flush_check_neighbor(const page_id_t id, - IORequest::flush_t flush) +static bool buf_flush_check_neighbor(const page_id_t id, ulint fold, bool lru) { - ut_ad(flush == IORequest::LRU || flush == IORequest::FLUSH_LIST); - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); + ut_ad(fold == id.fold()); - buf_page_t *bpage= buf_pool.page_hash_get_low(id, id.fold()); + buf_page_t *bpage= buf_pool.page_hash_get_low(id, fold); if (!bpage || buf_pool.watch_is_sentinel(*bpage)) return false; @@ -1058,22 +964,20 @@ static bool buf_flush_check_neighbor(const page_id_t id, /* We avoid flushing 'non-old' blocks in an LRU flush, because the flushed blocks are soon freed */ - return (flush != IORequest::LRU || bpage->is_old()) && - bpage->ready_for_flush(); + return (!lru || bpage->is_old()) && bpage->ready_for_flush(); } /** Check which neighbors of a page can be flushed from the buf_pool. @param space tablespace @param id page identifier of a dirty page @param contiguous whether to consider contiguous areas of pages -@param flush LRU or FLUSH_LIST +@param lru true=buf_pool.LRU; false=buf_pool.flush_list @return last page number that can be flushed */ static page_id_t buf_flush_check_neighbors(const fil_space_t &space, page_id_t &id, bool contiguous, - IORequest::flush_t flush) + bool lru) { ut_ad(id.page_no() < space.size); - ut_ad(flush == IORequest::LRU || flush == IORequest::FLUSH_LIST); /* When flushed, dirty blocks are searched in neighborhoods of this size, and flushed along with the original page. */ const ulint s= buf_pool.curr_size / 16; @@ -1095,7 +999,7 @@ static page_id_t buf_flush_check_neighbors(const fil_space_t &space, /* Determine the contiguous dirty area around id. */ const ulint id_fold= id.fold(); - mutex_enter(&buf_pool.mutex); + mysql_mutex_lock(&buf_pool.mutex); if (id > low) { @@ -1103,8 +1007,7 @@ static page_id_t buf_flush_check_neighbors(const fil_space_t &space, for (page_id_t i= id - 1;; --i) { fold--; - ut_ad(i.fold() == fold); - if (!buf_flush_check_neighbor(i, flush)) + if (!buf_flush_check_neighbor(i, fold, lru)) { low= i + 1; break; @@ -1120,12 +1023,11 @@ static page_id_t buf_flush_check_neighbors(const fil_space_t &space, while (++i < high) { ++fold; - ut_ad(i.fold() == fold); - if (!buf_flush_check_neighbor(i, flush)) + if (!buf_flush_check_neighbor(i, fold, lru)) break; } - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); return i; } @@ -1183,100 +1085,67 @@ static void buf_flush_freed_pages(fil_space_t *space) /** Flushes to disk all flushable pages within the flush area and also write zeroes or punch the hole for the freed ranges of pages. -@param[in] page_id page id -@param[in] flush LRU or FLUSH_LIST -@param[in] n_flushed number of pages flushed so far in this batch -@param[in] n_to_flush maximum number of pages we are allowed to flush +@param space tablespace +@param page_id page identifier +@param contiguous whether to consider contiguous areas of pages +@param lru true=buf_pool.LRU; false=buf_pool.flush_list +@param n_flushed number of pages flushed so far in this batch +@param n_to_flush maximum number of pages we are allowed to flush @return number of pages flushed */ -static -ulint -buf_flush_try_neighbors( - const page_id_t page_id, - IORequest::flush_t flush, - ulint n_flushed, - ulint n_to_flush) +static ulint buf_flush_try_neighbors(fil_space_t *space, + const page_id_t page_id, + bool contiguous, bool lru, + ulint n_flushed, ulint n_to_flush) { - ulint count = 0; - - ut_ad(flush == IORequest::LRU || flush == IORequest::FLUSH_LIST); - fil_space_t* space = fil_space_acquire_for_io(page_id.space()); - if (!space) { - return 0; - } - - /* Flush the freed ranges while flushing the neighbors */ - buf_flush_freed_pages(space); - - const auto neighbors= srv_flush_neighbors; + ut_ad(space->id == page_id.space()); - page_id_t id = page_id; - page_id_t high = (!neighbors - || UT_LIST_GET_LEN(buf_pool.LRU) - < BUF_LRU_OLD_MIN_LEN - || !space->is_rotational()) - ? id + 1 /* Flush the minimum. */ - : buf_flush_check_neighbors(*space, id, neighbors == 1, flush); - - ut_ad(page_id >= id); - ut_ad(page_id < high); - - for (; id < high; ++id) { - buf_page_t* bpage; - - if ((count + n_flushed) >= n_to_flush) { - - /* We have already flushed enough pages and - should call it a day. There is, however, one - exception. If the page whose neighbors we - are flushing has not been flushed yet then - we'll try to flush the victim that we - selected originally. */ - if (id <= page_id) { - id = page_id; - } else { - break; - } - } - - const ulint fold = id.fold(); - - mutex_enter(&buf_pool.mutex); - - bpage = buf_pool.page_hash_get_low(id, fold); + ulint count= 0; + page_id_t id= page_id; + page_id_t high= buf_flush_check_neighbors(*space, id, contiguous, lru); - if (bpage == NULL) { - mutex_exit(&buf_pool.mutex); - continue; - } + ut_ad(page_id >= id); + ut_ad(page_id < high); - ut_a(bpage->in_file()); + for (ulint id_fold= id.fold(); id < high; ++id, ++id_fold) + { + if (count + n_flushed >= n_to_flush) + { + if (id > page_id) + break; + /* If the page whose neighbors we are flushing has not been + flushed yet, we must flush the page that we selected originally. */ + id= page_id; + id_fold= id.fold(); + } - /* We avoid flushing 'non-old' blocks in an LRU flush, - because the flushed blocks are soon freed */ + mysql_mutex_lock(&buf_pool.mutex); - if (flush != IORequest::LRU - || id == page_id || bpage->is_old()) { - if (bpage->ready_for_flush() - && (id == page_id || bpage->buf_fix_count() == 0) - && buf_flush_page(bpage, flush, space, false)) { - ++count; - continue; - } - } - mutex_exit(&buf_pool.mutex); - } + if (buf_page_t *bpage= buf_pool.page_hash_get_low(id, id_fold)) + { + ut_ad(bpage->in_file()); + /* We avoid flushing 'non-old' blocks in an LRU flush, + because the flushed blocks are soon freed */ + if (!lru || id == page_id || bpage->is_old()) + { + if (bpage->ready_for_flush() && buf_flush_page(bpage, lru, space)) + { + ++count; + continue; + } + } + } - space->release_for_io(); + mysql_mutex_unlock(&buf_pool.mutex); + } - if (count > 1) { - MONITOR_INC_VALUE_CUMULATIVE( - MONITOR_FLUSH_NEIGHBOR_TOTAL_PAGE, - MONITOR_FLUSH_NEIGHBOR_COUNT, - MONITOR_FLUSH_NEIGHBOR_PAGES, - (count - 1)); - } + if (auto n= count - 1) + { + MONITOR_INC_VALUE_CUMULATIVE(MONITOR_FLUSH_NEIGHBOR_TOTAL_PAGE, + MONITOR_FLUSH_NEIGHBOR_COUNT, + MONITOR_FLUSH_NEIGHBOR_PAGES, n); + } - return(count); + return count; } /*******************************************************************//** @@ -1293,17 +1162,16 @@ static ulint buf_free_from_unzip_LRU_list_batch(ulint max) { ulint scanned = 0; ulint count = 0; - ulint free_len = UT_LIST_GET_LEN(buf_pool.free); - ulint lru_len = UT_LIST_GET_LEN(buf_pool.unzip_LRU); - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); buf_block_t* block = UT_LIST_GET_LAST(buf_pool.unzip_LRU); - while (block != NULL + while (block && count < max - && free_len < srv_LRU_scan_depth - && lru_len > UT_LIST_GET_LEN(buf_pool.LRU) / 10) { + && UT_LIST_GET_LEN(buf_pool.free) < srv_LRU_scan_depth + && UT_LIST_GET_LEN(buf_pool.unzip_LRU) + > UT_LIST_GET_LEN(buf_pool.LRU) / 10) { ++scanned; if (buf_LRU_free_page(&block->page, false)) { @@ -1311,14 +1179,12 @@ static ulint buf_free_from_unzip_LRU_list_batch(ulint max) released and reacquired */ ++count; block = UT_LIST_GET_LAST(buf_pool.unzip_LRU); - free_len = UT_LIST_GET_LEN(buf_pool.free); - lru_len = UT_LIST_GET_LEN(buf_pool.unzip_LRU); } else { block = UT_LIST_GET_PREV(unzip_LRU, block); } } - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); if (scanned) { MONITOR_INC_VALUE_CUMULATIVE( @@ -1331,23 +1197,42 @@ static ulint buf_free_from_unzip_LRU_list_batch(ulint max) return(count); } -/** Flush dirty blocks from the end of the LRU list. -The calling thread is not allowed to own any latches on pages! +/** Start writing out pages for a tablespace. +@param id tablespace identifier +@return tablespace +@retval nullptr if the pages for this tablespace should be discarded */ +static fil_space_t *buf_flush_space(const uint32_t id) +{ + fil_space_t *space= fil_space_acquire_for_io(id); + if (space) + buf_flush_freed_pages(space); + return space; +} + +struct flush_counters_t +{ + /** number of dirty pages flushed */ + ulint flushed; + /** number of clean pages evicted */ + ulint evicted; +}; -@param[in] max desired number of blocks to make available - in the free list (best effort; not guaranteed) -@param[out] n counts of flushed and evicted pages */ +/** Flush dirty blocks from the end of the LRU list. +@param max maximum number of blocks to make available in buf_pool.free +@param n counts of flushed and evicted pages */ static void buf_flush_LRU_list_batch(ulint max, flush_counters_t *n) { ulint scanned= 0; ulint free_limit= srv_LRU_scan_depth; - n->flushed = 0; - n->evicted = 0; - n->unzip_LRU_evicted = 0; - ut_ad(mutex_own(&buf_pool.mutex)); + + mysql_mutex_assert_owner(&buf_pool.mutex); if (buf_pool.withdraw_target && buf_pool.curr_size < buf_pool.old_size) free_limit+= buf_pool.withdraw_target - UT_LIST_GET_LEN(buf_pool.withdraw); + const auto neighbors= UT_LIST_GET_LEN(buf_pool.LRU) < BUF_LRU_OLD_MIN_LEN + ? 0 : srv_flush_neighbors; + fil_space_t *space= nullptr; + for (buf_page_t *bpage= UT_LIST_GET_LAST(buf_pool.LRU); bpage && n->flushed + n->evicted < max && UT_LIST_GET_LEN(buf_pool.LRU) > BUF_LRU_MIN_LEN && @@ -1369,10 +1254,28 @@ static void buf_flush_LRU_list_batch(ulint max, flush_counters_t *n) /* Block is ready for flush. Dispatch an IO request. The IO helper thread will put it on free list in IO completion routine. */ const page_id_t page_id(bpage->id()); - mutex_exit(&buf_pool.mutex); - n->flushed+= buf_flush_try_neighbors(page_id, IORequest::LRU, n->flushed, - max); - mutex_enter(&buf_pool.mutex); + const uint32_t space_id= page_id.space(); + if (!space || space->id != space_id) + { + if (space) + space->release_for_io(); + space= buf_flush_space(space_id); + if (!space) + continue; + } + if (neighbors && space->is_rotational()) + { + mysql_mutex_unlock(&buf_pool.mutex); + n->flushed+= buf_flush_try_neighbors(space, page_id, neighbors == 1, + true, n->flushed, max); +reacquire_mutex: + mysql_mutex_lock(&buf_pool.mutex); + } + else if (buf_flush_page(bpage, true, space)) + { + ++n->flushed; + goto reacquire_mutex; + } } else /* Can't evict or dispatch this block. Go to previous. */ @@ -1381,18 +1284,16 @@ static void buf_flush_LRU_list_batch(ulint max, flush_counters_t *n) buf_pool.lru_hp.set(nullptr); + if (space) + space->release_for_io(); + /* We keep track of all flushes happening as part of LRU flush. When estimating the desired rate at which flush_list should be flushed, we factor in this value. */ buf_lru_flush_page_count+= n->flushed; - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); - if (n->evicted) - MONITOR_INC_VALUE_CUMULATIVE(MONITOR_LRU_BATCH_EVICT_TOTAL_PAGE, - MONITOR_LRU_BATCH_EVICT_COUNT, - MONITOR_LRU_BATCH_EVICT_PAGES, - n->evicted); if (scanned) MONITOR_INC_VALUE_CUMULATIVE(MONITOR_LRU_BATCH_SCANNED, MONITOR_LRU_BATCH_SCANNED_NUM_CALL, @@ -1402,44 +1303,49 @@ static void buf_flush_LRU_list_batch(ulint max, flush_counters_t *n) /** Flush and move pages from LRU or unzip_LRU list to the free list. Whether LRU or unzip_LRU is used depends on the state of the system. -@param[in] max desired number of blocks to make available - in the free list (best effort; not guaranteed) -@param[out] n counts of flushed and evicted pages */ -static void buf_do_LRU_batch(ulint max, flush_counters_t* n) +@param max maximum number of blocks to make available in buf_pool.free +@return number of flushed pages */ +static ulint buf_do_LRU_batch(ulint max) { - n->unzip_LRU_evicted = buf_LRU_evict_from_unzip_LRU() - ? buf_free_from_unzip_LRU_list_batch(max) : 0; - - if (max > n->unzip_LRU_evicted) { - buf_flush_LRU_list_batch(max - n->unzip_LRU_evicted, n); - } else { - n->evicted = 0; - n->flushed = 0; - } + const ulint n_unzip_LRU_evicted= buf_LRU_evict_from_unzip_LRU() + ? buf_free_from_unzip_LRU_list_batch(max) + : 0; + flush_counters_t n; + n.flushed= 0; + n.evicted= n_unzip_LRU_evicted; + buf_flush_LRU_list_batch(max, &n); + + if (const ulint evicted= n.evicted - n_unzip_LRU_evicted) + { + MONITOR_INC_VALUE_CUMULATIVE(MONITOR_LRU_BATCH_EVICT_TOTAL_PAGE, + MONITOR_LRU_BATCH_EVICT_COUNT, + MONITOR_LRU_BATCH_EVICT_PAGES, + evicted); + } - /* Add evicted pages from unzip_LRU to the evicted pages from - the simple LRU. */ - n->evicted += n->unzip_LRU_evicted; + return n.flushed; } /** This utility flushes dirty blocks from the end of the flush_list. The calling thread is not allowed to own any latches on pages! -@param[in] min_n wished minimum mumber of blocks flushed (it is -not guaranteed that the actual number is that big, though) -@param[in] lsn_limit all blocks whose oldest_modification is smaller -than this should be flushed (if their number does not exceed min_n) +@param max_n maximum mumber of blocks to flush +@param lsn once an oldest_modification>=lsn is found, terminate the batch @return number of blocks for which the write request was queued */ -static ulint buf_do_flush_list_batch(ulint min_n, lsn_t lsn_limit) +static ulint buf_do_flush_list_batch(ulint max_n, lsn_t lsn) { ulint count= 0; ulint scanned= 0; - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); + + const auto neighbors= UT_LIST_GET_LEN(buf_pool.LRU) < BUF_LRU_OLD_MIN_LEN + ? 0 : srv_flush_neighbors; + fil_space_t *space= nullptr; /* Start from the end of the list looking for a suitable block to be flushed. */ - mutex_enter(&buf_pool.flush_list_mutex); - ulint len = UT_LIST_GET_LEN(buf_pool.flush_list); + mysql_mutex_lock(&buf_pool.flush_list_mutex); + ulint len= UT_LIST_GET_LEN(buf_pool.flush_list); /* In order not to degenerate this scan to O(n*n) we attempt to preserve pointer of previous block in the flush list. To do so we @@ -1447,17 +1353,17 @@ static ulint buf_do_flush_list_batch(ulint min_n, lsn_t lsn_limit) must check the hazard pointer and if it is removing the same block then it must reset it. */ for (buf_page_t *bpage= UT_LIST_GET_LAST(buf_pool.flush_list); - bpage && len && count < min_n; + bpage && len && count < max_n; bpage= buf_pool.flush_hp.get(), ++scanned, len--) { const lsn_t oldest_modification= bpage->oldest_modification(); - if (oldest_modification >= lsn_limit) + if (oldest_modification >= lsn) break; - ut_a(oldest_modification); + ut_ad(oldest_modification); buf_page_t *prev= UT_LIST_GET_PREV(list, bpage); buf_pool.flush_hp.set(prev); - mutex_exit(&buf_pool.flush_list_mutex); + mysql_mutex_unlock(&buf_pool.flush_list_mutex); ut_ad(bpage->in_file()); const bool flushed= bpage->ready_for_flush(); @@ -1465,18 +1371,39 @@ static ulint buf_do_flush_list_batch(ulint min_n, lsn_t lsn_limit) if (flushed) { const page_id_t page_id(bpage->id()); - mutex_exit(&buf_pool.mutex); - count+= buf_flush_try_neighbors(page_id, IORequest::FLUSH_LIST, - count, min_n); - mutex_enter(&buf_pool.mutex); + const uint32_t space_id= page_id.space(); + if (!space || space->id != space_id) + { + if (space) + space->release_for_io(); + space= buf_flush_space(space_id); + if (!space) + continue; + } + if (neighbors && space->is_rotational()) + { + mysql_mutex_unlock(&buf_pool.mutex); + count+= buf_flush_try_neighbors(space, page_id, neighbors == 1, + false, count, max_n); +reacquire_mutex: + mysql_mutex_lock(&buf_pool.mutex); + } + else if (buf_flush_page(bpage, false, space)) + { + ++count; + goto reacquire_mutex; + } } - mutex_enter(&buf_pool.flush_list_mutex); + mysql_mutex_lock(&buf_pool.flush_list_mutex); ut_ad(flushed || buf_pool.flush_hp.is_hp(prev)); } buf_pool.flush_hp.set(nullptr); - mutex_exit(&buf_pool.flush_list_mutex); + mysql_mutex_unlock(&buf_pool.flush_list_mutex); + + if (space) + space->release_for_io(); if (scanned) MONITOR_INC_VALUE_CUMULATIVE(MONITOR_FLUSH_BATCH_SCANNED, @@ -1488,160 +1415,136 @@ static ulint buf_do_flush_list_batch(ulint min_n, lsn_t lsn_limit) MONITOR_FLUSH_BATCH_COUNT, MONITOR_FLUSH_BATCH_PAGES, count); - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); return count; } -/** This utility flushes dirty blocks from the end of the LRU list or -flush_list. -NOTE 1: in the case of an LRU flush the calling thread may own latches to -pages: to avoid deadlocks, this function must be written so that it cannot -end up waiting for these latches! NOTE 2: in the case of a flush list flush, -the calling thread is not allowed to own any latches on pages! -@param[in] lru true=LRU; false=FLUSH_LIST; -if !lru, then the caller must not own any latches on pages -@param[in] min_n wished minimum mumber of blocks flushed (it is -not guaranteed that the actual number is that big, though) -@param[in] lsn_limit in the case of !lru all blocks whose -@param[out] n counts of flushed and evicted pages -oldest_modification is smaller than this should be flushed (if their number -does not exceed min_n), otherwise ignored */ -static -void -buf_flush_batch( - bool lru, - ulint min_n, - lsn_t lsn_limit, - flush_counters_t* n) +/** Wait until a flush batch ends. +@param lru true=buf_pool.LRU; false=buf_pool.flush_list */ +void buf_flush_wait_batch_end(bool lru) { - ut_ad(lru || !sync_check_iterate(dict_sync_check())); - - mutex_enter(&buf_pool.mutex); + const auto &n_flush= lru ? buf_pool.n_flush_LRU : buf_pool.n_flush_list; - /* Note: The buffer pool mutex is released and reacquired within - the flush functions. */ - if (lru) { - buf_do_LRU_batch(min_n, n); - } else { - n->flushed = buf_do_flush_list_batch(min_n, lsn_limit); - n->evicted = 0; - } - - mutex_exit(&buf_pool.mutex); - - DBUG_PRINT("ib_buf", - (lru ? "LRU flush completed" : "flush_list completed")); + if (n_flush) + { + auto cond= lru ? &buf_pool.done_flush_LRU : &buf_pool.done_flush_list; + tpool::tpool_wait_begin(); + thd_wait_begin(nullptr, THD_WAIT_DISKIO); + do + mysql_cond_wait(cond, &buf_pool.mutex); + while (n_flush); + tpool::tpool_wait_end(); + thd_wait_end(nullptr); + mysql_cond_broadcast(cond); + } } -/******************************************************************//** -Gather the aggregated stats for both flush list and LRU list flushing. -@param page_count_flush number of pages flushed from the end of the flush_list -@param page_count_LRU number of pages flushed from the end of the LRU list -*/ -static -void -buf_flush_stats( -/*============*/ - ulint page_count_flush, - ulint page_count_LRU) -{ - DBUG_PRINT("ib_buf", ("flush completed, from flush_list %u pages, " - "from LRU_list %u pages", - unsigned(page_count_flush), - unsigned(page_count_LRU))); +/** Whether a background log flush is pending */ +static std::atomic_flag log_flush_pending; - srv_stats.buf_pool_flushed.add(page_count_flush + page_count_LRU); +/** Advance log_sys.get_flushed_lsn() */ +static void log_flush(void *) +{ + log_write_up_to(log_sys.get_lsn(), true); + log_flush_pending.clear(); } -/** Start a buffer flush batch for LRU or flush list -@param[in] lru true=buf_pool.LRU; false=buf_pool.flush_list -@return whether the flush batch was started (was not already running) */ -static bool buf_flush_start(bool lru) +static tpool::waitable_task log_flush_task(log_flush, nullptr, nullptr); + +/** Write out dirty blocks from buf_pool.flush_list. +@param max_n wished maximum mumber of blocks flushed +@param lsn buf_pool.get_oldest_modification(LSN_MAX) target (0=LRU flush) +@return the number of processed pages +@retval 0 if a batch of the same type (lsn==0 or lsn!=0) is already running */ +ulint buf_flush_lists(ulint max_n, lsn_t lsn) { - IORequest::flush_t flush_type= lru ? IORequest::LRU : IORequest::FLUSH_LIST; - mutex_enter(&buf_pool.mutex); + auto &n_flush= lsn ? buf_pool.n_flush_list : buf_pool.n_flush_LRU; + + if (n_flush) + return 0; - if (buf_pool.n_flush[flush_type] > 0 || buf_pool.init_flush[flush_type]) + if (log_sys.get_lsn() > log_sys.get_flushed_lsn()) { - /* There is already a flush batch of the same type running */ - mutex_exit(&buf_pool.mutex); - return false; + log_flush_task.wait(); + if (log_sys.get_lsn() > log_sys.get_flushed_lsn() && + !log_flush_pending.test_and_set()) + srv_thread_pool->submit_task(&log_flush_task); +#if defined UNIV_DEBUG || defined UNIV_IBUF_DEBUG + if (UNIV_UNLIKELY(ibuf_debug)) + log_write_up_to(log_sys.get_lsn(), true); +#endif } - buf_pool.init_flush[flush_type]= true; - os_event_reset(buf_pool.no_flush[flush_type]); - mutex_exit(&buf_pool.mutex); - return true; -} + auto cond= lsn ? &buf_pool.done_flush_list : &buf_pool.done_flush_LRU; -/** End a buffer flush batch. -@param[in] lru true=buf_pool.LRU; false=buf_pool.flush_list */ -static void buf_flush_end(bool lru) -{ - IORequest::flush_t flush_type= lru ? IORequest::LRU : IORequest::FLUSH_LIST; + mysql_mutex_lock(&buf_pool.mutex); + const bool running= n_flush != 0; + /* FIXME: we are performing a dirty read of buf_pool.flush_list.count + while not holding buf_pool.flush_list_mutex */ + if (running || !UT_LIST_GET_LEN(buf_pool.flush_list)) + { + mysql_mutex_unlock(&buf_pool.mutex); + if (running) + return 0; + mysql_cond_broadcast(cond); + return 0; + } + n_flush++; - mutex_enter(&buf_pool.mutex); + ulint n_flushed= lsn + ? buf_do_flush_list_batch(max_n, lsn) + : buf_do_LRU_batch(max_n); + + const auto n_flushing= --n_flush; - buf_pool.init_flush[flush_type]= false; buf_pool.try_LRU_scan= true; - if (!buf_pool.n_flush[flush_type]) - /* The running flush batch has ended */ - os_event_set(buf_pool.no_flush[flush_type]); + mysql_mutex_unlock(&buf_pool.mutex); - mutex_exit(&buf_pool.mutex); + if (!n_flushing) + mysql_cond_broadcast(cond); - if (!srv_read_only_mode) - buf_dblwr_flush_buffered_writes(); -} + buf_dblwr.flush_buffered_writes(); -/** Wait until a flush batch ends. -@param[in] lru true=buf_pool.LRU; false=buf_pool.flush_list */ -void buf_flush_wait_batch_end(bool lru) -{ - thd_wait_begin(nullptr, THD_WAIT_DISKIO); - os_event_wait(buf_pool.no_flush[lru - ? IORequest::LRU : IORequest::FLUSH_LIST]); - thd_wait_end(nullptr); + DBUG_PRINT("ib_buf", ("%s completed, " ULINTPF " pages", + lsn ? "flush_list" : "LRU flush", n_flushed)); + return n_flushed; } -/** Do flushing batch of a given type. -NOTE: The calling thread is not allowed to own any latches on pages! -@param[in] lru true=buf_pool.LRU; false=buf_pool.flush_list -@param[in] min_n wished minimum mumber of blocks flushed -(it is not guaranteed that the actual number is that big, though) -@param[in] lsn_limit if !lru, all blocks whose -oldest_modification is smaller than this should be flushed (if their number -does not exceed min_n), otherwise ignored -@param[out] n_processed the number of pages which were processed is -passed back to caller. Ignored if NULL -@retval true if a batch was queued successfully. -@retval false if another batch of same type was already running. */ -bool buf_flush_do_batch(bool lru, ulint min_n, lsn_t lsn_limit, - flush_counters_t *n) +/** Request IO burst and wake up the page_cleaner. +@param lsn desired lower bound of oldest_modification */ +static void buf_flush_request_force(lsn_t lsn) { - if (n) - n->flushed= 0; + lsn+= lsn_avg_rate * 3; - if (!buf_flush_start(lru)) - return false; + lsn_t o= 0; - buf_flush_batch(lru, min_n, lsn_limit, n); - buf_flush_end(lru); + while (!buf_flush_sync_lsn.compare_exchange_weak(o, lsn, + std::memory_order_acquire, + std::memory_order_relaxed)) + if (lsn > o) + break; - return true; + mysql_cond_signal(&buf_pool.do_flush_list); } /** Wait until a flush batch of the given lsn ends @param[in] new_oldest target oldest_modified_lsn to wait for */ void buf_flush_wait_flushed(lsn_t new_oldest) { + ut_ad(new_oldest); + + if (srv_flush_sync) { + /* wake page cleaner for IO burst */ + buf_flush_request_force(new_oldest); + } + for (;;) { /* We don't need to wait for fsync of the flushed blocks, because anyway we need fsync to make chekpoint. So, we don't need to wait for the batch end here. */ - mutex_enter(&buf_pool.flush_list_mutex); + mysql_mutex_lock(&buf_pool.flush_list_mutex); buf_page_t* bpage; /* FIXME: Keep temporary tablespace pages in a separate flush @@ -1656,7 +1559,7 @@ void buf_flush_wait_flushed(lsn_t new_oldest) lsn_t oldest = bpage ? bpage->oldest_modification() : 0; - mutex_exit(&buf_pool.flush_list_mutex); + mysql_mutex_unlock(&buf_pool.flush_list_mutex); if (oldest == 0 || oldest >= new_oldest) { break; @@ -1669,145 +1572,15 @@ void buf_flush_wait_flushed(lsn_t new_oldest) } } -/** This utility flushes dirty blocks from the end of the flush list. -NOTE: The calling thread is not allowed to own any latches on pages! -@param[in] min_n wished minimum mumber of blocks flushed (it is -not guaranteed that the actual number is that big, though) -@param[in] lsn_limit all blocks whose -oldest_modification is smaller than this should be flushed (if their number -does not exceed min_n), otherwise ignored -@param[out] n_processed the number of pages which were processed is -passed back to caller. Ignored if NULL. -@retval true if a batch was queued successfully -@retval false if another batch of same type was already running */ -bool buf_flush_lists(ulint min_n, lsn_t lsn_limit, ulint *n_processed) +/** Wait for pending flushes to complete. */ +void buf_flush_wait_batch_end_acquiring_mutex(bool lru) { - flush_counters_t n; - - bool success = buf_flush_do_batch(false, min_n, lsn_limit, &n); - - if (n.flushed) { - buf_flush_stats(n.flushed, 0); - } - - if (n_processed) { - *n_processed = n.flushed; - } - - return success; -} - -/******************************************************************//** -This function picks up a single page from the tail of the LRU -list, flushes it (if it is dirty), removes it from page_hash and LRU -list and puts it on the free list. It is called from user threads when -they are unable to find a replaceable page at the tail of the LRU -list i.e.: when the background LRU flushing in the page_cleaner thread -is not fast enough to keep pace with the workload. -@return true if success. */ -bool buf_flush_single_page_from_LRU() -{ - ulint scanned = 0; - bool freed = false; - - mutex_enter(&buf_pool.mutex); - - for (buf_page_t* bpage = buf_pool.single_scan_itr.start(); bpage; - ++scanned, bpage = buf_pool.single_scan_itr.get()) { - - ut_ad(mutex_own(&buf_pool.mutex)); - - buf_page_t* prev = UT_LIST_GET_PREV(LRU, bpage); - buf_pool.single_scan_itr.set(prev); - - if (!bpage->ready_for_flush()) { // FIXME: ready_for_replace() - continue; - } - - if (!bpage->buf_fix_count() - && buf_LRU_free_page(bpage, true)) { - /* block is ready for eviction i.e., it is - clean and is not IO-fixed or buffer fixed. */ - freed = true; - break; - } else { - /* Block is ready for flush. Try and dispatch an IO - request. We'll put it on free list in IO completion - routine if it is not buffer fixed. The following call - will release the buf_pool.mutex. - - Note: There is no guarantee that this page has actually - been freed, only that it has been flushed to disk */ - - freed = buf_flush_page(bpage, IORequest::SINGLE_PAGE, - nullptr, true); - - if (freed) { - goto found; - } - } - } - - mutex_exit(&buf_pool.mutex); -found: - if (scanned) { - MONITOR_INC_VALUE_CUMULATIVE( - MONITOR_LRU_SINGLE_FLUSH_SCANNED, - MONITOR_LRU_SINGLE_FLUSH_SCANNED_NUM_CALL, - MONITOR_LRU_SINGLE_FLUSH_SCANNED_PER_CALL, - scanned); - } - - ut_ad(!mutex_own(&buf_pool.mutex)); - return(freed); -} - -/** -Clear up the tail of the LRU list. -Put replaceable pages at the tail of LRU to the free list. -Flush dirty pages at the tail of LRU to the disk. -The depth to which we scan each buffer pool is controlled by dynamic -config parameter innodb_LRU_scan_depth. -@return total pages flushed */ -static ulint buf_flush_LRU_list() -{ - ulint scan_depth, withdraw_depth; - flush_counters_t n; - - memset(&n, 0, sizeof(flush_counters_t)); - - /* srv_LRU_scan_depth can be arbitrarily large value. - We cap it with current LRU size. */ - mutex_enter(&buf_pool.mutex); - scan_depth = UT_LIST_GET_LEN(buf_pool.LRU); - if (buf_pool.curr_size < buf_pool.old_size - && buf_pool.withdraw_target > 0) { - withdraw_depth = buf_pool.withdraw_target - - UT_LIST_GET_LEN(buf_pool.withdraw); - } else { - withdraw_depth = 0; - } - mutex_exit(&buf_pool.mutex); - if (withdraw_depth > srv_LRU_scan_depth) { - scan_depth = ut_min(withdraw_depth, scan_depth); - } else { - scan_depth = ut_min(static_cast<ulint>(srv_LRU_scan_depth), - scan_depth); - } - /* Currently one of page_cleaners is the only thread - that can trigger an LRU flush at the same time. - So, it is not possible that a batch triggered during - last iteration is still running, */ - buf_flush_do_batch(true, scan_depth, 0, &n); - - return(n.flushed); -} - -/** Wait for any possible LRU flushes to complete. */ -void buf_flush_wait_LRU_batch_end() -{ - if (buf_pool.n_flush[IORequest::LRU] || buf_pool.init_flush[IORequest::LRU]) - buf_flush_wait_batch_end(true); + if (lru ? buf_pool.n_flush_LRU : buf_pool.n_flush_list) + { + mysql_mutex_lock(&buf_pool.mutex); + buf_flush_wait_batch_end(lru); + mysql_mutex_unlock(&buf_pool.mutex); + } } /*********************************************************************//** @@ -1911,7 +1684,6 @@ page_cleaner_flush_pages_recommendation(ulint last_pages_in) static ulint n_iterations = 0; static time_t prev_time; lsn_t oldest_lsn; - lsn_t cur_lsn; lsn_t age; lsn_t lsn_rate; ulint n_pages = 0; @@ -1919,7 +1691,7 @@ page_cleaner_flush_pages_recommendation(ulint last_pages_in) ulint pct_for_lsn = 0; ulint pct_total = 0; - cur_lsn = log_sys.get_lsn(); + const lsn_t cur_lsn = log_sys.get_lsn(); if (prev_lsn == 0) { /* First time around. */ @@ -1958,29 +1730,18 @@ page_cleaner_flush_pages_recommendation(ulint last_pages_in) lsn_avg_rate = (lsn_avg_rate + lsn_rate) / 2; - /* aggregate stats of all slots */ - mutex_enter(&page_cleaner.mutex); - ulint flush_tm = page_cleaner.flush_time; ulint flush_pass = page_cleaner.flush_pass; page_cleaner.flush_time = 0; page_cleaner.flush_pass = 0; - ulint lru_tm = page_cleaner.slot.flush_lru_time; ulint list_tm = page_cleaner.slot.flush_list_time; - ulint lru_pass = page_cleaner.slot.flush_lru_pass; ulint list_pass = page_cleaner.slot.flush_list_pass; - page_cleaner.slot.flush_lru_time = 0; - page_cleaner.slot.flush_lru_pass = 0; page_cleaner.slot.flush_list_time = 0; page_cleaner.slot.flush_list_pass = 0; - mutex_exit(&page_cleaner.mutex); /* minimum values are 1, to avoid dividing by zero. */ - if (lru_tm < 1) { - lru_tm = 1; - } if (list_tm < 1) { list_tm = 1; } @@ -1988,9 +1749,6 @@ page_cleaner_flush_pages_recommendation(ulint last_pages_in) flush_tm = 1; } - if (lru_pass < 1) { - lru_pass = 1; - } if (list_pass < 1) { list_pass = 1; } @@ -2000,23 +1758,14 @@ page_cleaner_flush_pages_recommendation(ulint last_pages_in) MONITOR_SET(MONITOR_FLUSH_ADAPTIVE_AVG_TIME_SLOT, list_tm / list_pass); - MONITOR_SET(MONITOR_LRU_BATCH_FLUSH_AVG_TIME_SLOT, - lru_tm / lru_pass); MONITOR_SET(MONITOR_FLUSH_ADAPTIVE_AVG_TIME_THREAD, list_tm / flush_pass); - MONITOR_SET(MONITOR_LRU_BATCH_FLUSH_AVG_TIME_THREAD, - lru_tm / flush_pass); MONITOR_SET(MONITOR_FLUSH_ADAPTIVE_AVG_TIME_EST, - flush_tm * list_tm / flush_pass - / (list_tm + lru_tm)); - MONITOR_SET(MONITOR_LRU_BATCH_FLUSH_AVG_TIME_EST, - flush_tm * lru_tm / flush_pass - / (list_tm + lru_tm)); + flush_tm / flush_pass); MONITOR_SET(MONITOR_FLUSH_AVG_TIME, flush_tm / flush_pass); MONITOR_SET(MONITOR_FLUSH_ADAPTIVE_AVG_PASS, list_pass); - MONITOR_SET(MONITOR_LRU_BATCH_FLUSH_AVG_PASS, lru_pass); MONITOR_SET(MONITOR_FLUSH_AVG_PASS, flush_pass); prev_lsn = cur_lsn; @@ -2043,7 +1792,7 @@ page_cleaner_flush_pages_recommendation(ulint last_pages_in) + lsn_avg_rate * buf_flush_lsn_scan_factor; ulint pages_for_lsn = 0; - mutex_enter(&buf_pool.flush_list_mutex); + mysql_mutex_lock(&buf_pool.flush_list_mutex); for (buf_page_t* b = UT_LIST_GET_LAST(buf_pool.flush_list); b != NULL; b = UT_LIST_GET_PREV(list, b)) { @@ -2052,13 +1801,7 @@ page_cleaner_flush_pages_recommendation(ulint last_pages_in) } ++pages_for_lsn; } - mutex_exit(&buf_pool.flush_list_mutex); - - mutex_enter(&page_cleaner.mutex); - ut_ad(page_cleaner.slot.state == PAGE_CLEANER_STATE_NONE); - page_cleaner.slot.n_pages_requested - = pages_for_lsn / buf_flush_lsn_scan_factor + 1; - mutex_exit(&page_cleaner.mutex); + mysql_mutex_unlock(&buf_pool.flush_list_mutex); pages_for_lsn /= buf_flush_lsn_scan_factor; if (pages_for_lsn < 1) { @@ -2077,21 +1820,6 @@ page_cleaner_flush_pages_recommendation(ulint last_pages_in) n_pages = srv_max_io_capacity; } - mutex_enter(&page_cleaner.mutex); - ut_ad(page_cleaner.n_slots_requested == 0); - ut_ad(page_cleaner.n_slots_flushing == 0); - ut_ad(page_cleaner.n_slots_finished == 0); - - /* if REDO has enough of free space, - don't care about age distribution of pages */ - if (pct_for_lsn > 30) { - page_cleaner.slot.n_pages_requested *= n_pages - / pages_for_lsn + 1; - } else { - page_cleaner.slot.n_pages_requested = n_pages; - } - mutex_exit(&page_cleaner.mutex); - MONITOR_SET(MONITOR_FLUSH_N_TO_FLUSH_REQUESTED, n_pages); MONITOR_SET(MONITOR_FLUSH_N_TO_FLUSH_BY_AGE, pages_for_lsn); @@ -2104,229 +1832,28 @@ page_cleaner_flush_pages_recommendation(ulint last_pages_in) return(n_pages); } -/*********************************************************************//** -Puts the page_cleaner thread to sleep if it has finished work in less -than a second -@retval 0 wake up by event set, -@retval OS_SYNC_TIME_EXCEEDED if timeout was exceeded -@param next_loop_time time when next loop iteration should start -@param sig_count zero or the value returned by previous call of - os_event_reset() -@param cur_time current time as in ut_time_ms() */ -static -ulint -pc_sleep_if_needed( -/*===============*/ - ulint next_loop_time, - int64_t sig_count, - ulint cur_time) -{ - /* No sleep if we are cleaning the buffer pool during the shutdown - with everything else finished */ - if (srv_shutdown_state == SRV_SHUTDOWN_FLUSH_PHASE) - return OS_SYNC_TIME_EXCEEDED; - - if (next_loop_time > cur_time) { - /* Get sleep interval in micro seconds. We use - ut_min() to avoid long sleep in case of wrap around. */ - ulint sleep_us; - - sleep_us = ut_min(static_cast<ulint>(1000000), - (next_loop_time - cur_time) * 1000); - - return(os_event_wait_time_low(buf_flush_event, - sleep_us, sig_count)); - } - - return(OS_SYNC_TIME_EXCEEDED); -} - -/** -Requests for all slots to flush. -@param min_n wished minimum mumber of blocks flushed - (it is not guaranteed that the actual number is that big) -@param lsn_limit in the case of buf_pool.flush_list all blocks whose - oldest_modification is smaller than this should be flushed - (if their number does not exceed min_n), otherwise ignored -*/ -static void pc_request(ulint min_n, lsn_t lsn_limit) -{ - mutex_enter(&page_cleaner.mutex); - - ut_ad(page_cleaner.n_slots_requested == 0); - ut_ad(page_cleaner.n_slots_flushing == 0); - ut_ad(page_cleaner.n_slots_finished == 0); - - page_cleaner.requested = (min_n > 0); - page_cleaner.lsn_limit = lsn_limit; - - ut_ad(page_cleaner.slot.state == PAGE_CLEANER_STATE_NONE); - - if (min_n == 0 || min_n == ULINT_MAX) { - page_cleaner.slot.n_pages_requested = min_n; - } - - /* page_cleaner.slot.n_pages_requested was already set by - page_cleaner_flush_pages_recommendation() */ - - page_cleaner.slot.state = PAGE_CLEANER_STATE_REQUESTED; - - page_cleaner.n_slots_requested = 1; - page_cleaner.n_slots_flushing = 0; - page_cleaner.n_slots_finished = 0; - - mutex_exit(&page_cleaner.mutex); -} - -/** -Do flush for one slot. -@return the number of the slots which has not been treated yet. */ -static ulint pc_flush_slot() +/** Initiate a flushing batch. +@param max_n maximum mumber of blocks flushed +@param lsn oldest_modification limit +@return ut_time_ms() at the start of the wait */ +static ulint pc_request_flush_slot(ulint max_n, lsn_t lsn) { - ulint lru_tm = 0; - ulint list_tm = 0; - ulint lru_pass = 0; - ulint list_pass = 0; - - mutex_enter(&page_cleaner.mutex); - - if (page_cleaner.n_slots_requested) { - ut_ad(page_cleaner.slot.state == PAGE_CLEANER_STATE_REQUESTED); - page_cleaner.n_slots_requested--; - page_cleaner.n_slots_flushing++; - page_cleaner.slot.state = PAGE_CLEANER_STATE_FLUSHING; - - if (UNIV_UNLIKELY(!page_cleaner.is_running)) { - page_cleaner.slot.n_flushed_lru = 0; - page_cleaner.slot.n_flushed_list = 0; - goto finish_mutex; - } - - mutex_exit(&page_cleaner.mutex); - - lru_tm = ut_time_ms(); - - /* Flush pages from end of LRU if required */ - page_cleaner.slot.n_flushed_lru = buf_flush_LRU_list(); - - lru_tm = ut_time_ms() - lru_tm; - lru_pass++; - - if (UNIV_UNLIKELY(!page_cleaner.is_running)) { - page_cleaner.slot.n_flushed_list = 0; - goto finish; - } - - /* Flush pages from flush_list if required */ - if (page_cleaner.requested) { - flush_counters_t n; - memset(&n, 0, sizeof(flush_counters_t)); - list_tm = ut_time_ms(); - - page_cleaner.slot.succeeded_list = buf_flush_do_batch( - false, - page_cleaner.slot.n_pages_requested, - page_cleaner.lsn_limit, - &n); - - page_cleaner.slot.n_flushed_list = n.flushed; - - list_tm = ut_time_ms() - list_tm; - list_pass++; - } else { - page_cleaner.slot.n_flushed_list = 0; - page_cleaner.slot.succeeded_list = true; - } -finish: - mutex_enter(&page_cleaner.mutex); -finish_mutex: - page_cleaner.n_slots_flushing--; - page_cleaner.n_slots_finished++; - page_cleaner.slot.state = PAGE_CLEANER_STATE_FINISHED; - - page_cleaner.slot.flush_lru_time += lru_tm; - page_cleaner.slot.flush_list_time += list_tm; - page_cleaner.slot.flush_lru_pass += lru_pass; - page_cleaner.slot.flush_list_pass += list_pass; - - if (page_cleaner.n_slots_requested == 0 - && page_cleaner.n_slots_flushing == 0) { - os_event_set(page_cleaner.is_finished); - } - } - - ulint ret = page_cleaner.n_slots_requested; - - mutex_exit(&page_cleaner.mutex); - - return(ret); + ut_ad(max_n); + ut_ad(lsn); + + const ulint flush_start_tm= ut_time_ms(); + page_cleaner.slot.n_flushed_list= buf_flush_lists(max_n, lsn); + page_cleaner.slot.flush_list_time+= ut_time_ms() - flush_start_tm; + page_cleaner.slot.flush_list_pass++; + return flush_start_tm; } -/** -Wait until all flush requests are finished. -@param n_flushed_lru number of pages flushed from the end of the LRU list. -@param n_flushed_list number of pages flushed from the end of the - flush_list. -@return true if all flush_list flushing batch were success. */ -static -bool -pc_wait_finished( - ulint* n_flushed_lru, - ulint* n_flushed_list) -{ - bool all_succeeded = true; - - *n_flushed_lru = 0; - *n_flushed_list = 0; - - os_event_wait(page_cleaner.is_finished); - - mutex_enter(&page_cleaner.mutex); - - ut_ad(page_cleaner.n_slots_requested == 0); - ut_ad(page_cleaner.n_slots_flushing == 0); - ut_ad(page_cleaner.n_slots_finished == 1); - - ut_ad(page_cleaner.slot.state == PAGE_CLEANER_STATE_FINISHED); - page_cleaner.slot.state = PAGE_CLEANER_STATE_NONE; - *n_flushed_lru = page_cleaner.slot.n_flushed_lru; - *n_flushed_list = page_cleaner.slot.n_flushed_list; - all_succeeded = page_cleaner.slot.succeeded_list; - page_cleaner.slot.n_pages_requested = 0; - - page_cleaner.n_slots_finished = 0; - - os_event_reset(page_cleaner.is_finished); - - mutex_exit(&page_cleaner.mutex); - - return(all_succeeded); -} - -#ifdef UNIV_LINUX -/** -Set priority for page_cleaner threads. -@param[in] priority priority intended to set -@return true if set as intended */ -static -bool -buf_flush_page_cleaner_set_priority( - int priority) -{ - setpriority(PRIO_PROCESS, (pid_t)syscall(SYS_gettid), - priority); - return(getpriority(PRIO_PROCESS, (pid_t)syscall(SYS_gettid)) - == priority); -} -#endif /* UNIV_LINUX */ - #ifdef UNIV_DEBUG /** Loop used to disable the page cleaner thread. */ static void buf_flush_page_cleaner_disabled_loop() { while (innodb_page_cleaner_disabled_debug - && srv_shutdown_state == SRV_SHUTDOWN_NONE - && page_cleaner.is_running) { + && srv_shutdown_state == SRV_SHUTDOWN_NONE) { os_thread_sleep(100000); } } @@ -2343,6 +1870,7 @@ static os_thread_ret_t DECLARE_THREAD(buf_flush_page_cleaner)(void*) pfs_register_thread(page_cleaner_thread_key); #endif /* UNIV_PFS_THREAD */ ut_ad(!srv_read_only_mode); + ut_ad(buf_page_cleaner_is_active); #ifdef UNIV_DEBUG_THREAD_CREATION ib::info() << "page_cleaner thread running, id " @@ -2350,300 +1878,115 @@ static os_thread_ret_t DECLARE_THREAD(buf_flush_page_cleaner)(void*) #endif /* UNIV_DEBUG_THREAD_CREATION */ #ifdef UNIV_LINUX /* linux might be able to set different setting for each thread. - worth to try to set high priority for page cleaner threads */ - if (buf_flush_page_cleaner_set_priority( - buf_flush_page_cleaner_priority)) { - - ib::info() << "page_cleaner coordinator priority: " - << buf_flush_page_cleaner_priority; - } else { + worth to try to set high priority for the page cleaner thread */ + const pid_t tid= static_cast<pid_t>(syscall(SYS_gettid)); + setpriority(PRIO_PROCESS, tid, -20); + if (getpriority(PRIO_PROCESS, tid) != -20) { ib::info() << "If the mysqld execution user is authorized," " page cleaner thread priority can be changed." " See the man page of setpriority()."; } #endif /* UNIV_LINUX */ - ulint ret_sleep = 0; - ulint n_evicted = 0; - ulint n_flushed_last = 0; - ulint warn_interval = 1; - ulint warn_count = 0; - int64_t sig_count = os_event_reset(buf_flush_event); - ulint next_loop_time = ut_time_ms() + 1000; + ulint curr_time = ut_time_ms(); ulint n_flushed = 0; ulint last_activity = srv_get_activity_count(); ulint last_pages = 0; - while (srv_shutdown_state <= SRV_SHUTDOWN_INITIATED) { - ulint curr_time = ut_time_ms(); + for (ulint next_loop_time = curr_time + 1000; + srv_shutdown_state <= SRV_SHUTDOWN_INITIATED; + curr_time = ut_time_ms()) { + bool sleep_timeout; /* The page_cleaner skips sleep if the server is idle and there are no pending IOs in the buffer pool and there is work to do. */ - if (!n_flushed || !buf_pool.n_pend_reads - || srv_check_activity(&last_activity)) { - - ret_sleep = pc_sleep_if_needed( - next_loop_time, sig_count, curr_time); - } else if (curr_time > next_loop_time) { - ret_sleep = OS_SYNC_TIME_EXCEEDED; - } else { - ret_sleep = 0; - } - - if (srv_shutdown_state > SRV_SHUTDOWN_INITIATED) { - break; - } - - sig_count = os_event_reset(buf_flush_event); - - if (ret_sleep == OS_SYNC_TIME_EXCEEDED) { - if (global_system_variables.log_warnings > 2 - && curr_time > next_loop_time + 3000 - && !(test_flags & TEST_SIGINT)) { - if (warn_count == 0) { - ib::info() << "page_cleaner: 1000ms" - " intended loop took " - << 1000 + curr_time - - next_loop_time - << "ms. The settings might not" - " be optimal. (flushed=" - << n_flushed_last - << " and evicted=" - << n_evicted - << ", during the time.)"; - if (warn_interval > 300) { - warn_interval = 600; - } else { - warn_interval *= 2; - } - - warn_count = warn_interval; - } else { - --warn_count; - } - } else { - /* reset counter */ - warn_interval = 1; - warn_count = 0; + if (next_loop_time <= curr_time) { + sleep_timeout = true; + } else if (!n_flushed || !buf_pool.n_pend_reads + || srv_check_activity(&last_activity)) { + const ulint sleep_ms = std::min<ulint>(next_loop_time + - curr_time, + 1000); + timespec abstime; + set_timespec_nsec(abstime, 1000000ULL * sleep_ms); + mysql_mutex_lock(&buf_pool.flush_list_mutex); + const auto error = mysql_cond_timedwait( + &buf_pool.do_flush_list, + &buf_pool.flush_list_mutex, + &abstime); + mysql_mutex_unlock(&buf_pool.flush_list_mutex); + sleep_timeout = error == ETIMEDOUT || error == ETIME; + if (srv_shutdown_state > SRV_SHUTDOWN_INITIATED) { + break; } - - next_loop_time = curr_time + 1000; - n_flushed_last = n_evicted = 0; + } else { + sleep_timeout = false; } - if (ret_sleep != OS_SYNC_TIME_EXCEEDED - && srv_flush_sync - && buf_flush_sync_lsn > 0) { - /* woke up for flush_sync */ - mutex_enter(&page_cleaner.mutex); - lsn_t lsn_limit = buf_flush_sync_lsn; - buf_flush_sync_lsn = 0; - mutex_exit(&page_cleaner.mutex); - - /* Request flushing for threads */ - pc_request(ULINT_MAX, lsn_limit); - - ulint tm = ut_time_ms(); + if (sleep_timeout) { + /* no activity, slept enough */ + n_flushed = buf_flush_lists(srv_io_capacity, LSN_MAX); + last_pages = n_flushed; - /* Coordinator also treats requests */ - while (pc_flush_slot() > 0) {} + if (n_flushed) { + MONITOR_INC_VALUE_CUMULATIVE( + MONITOR_FLUSH_BACKGROUND_TOTAL_PAGE, + MONITOR_FLUSH_BACKGROUND_COUNT, + MONITOR_FLUSH_BACKGROUND_PAGES, + n_flushed); - /* only coordinator is using these counters, - so no need to protect by lock. */ - page_cleaner.flush_time += ut_time_ms() - tm; + } + } else if (lsn_t lsn_limit = buf_flush_sync_lsn.exchange( + 0, std::memory_order_release)) { + page_cleaner.flush_time += ut_time_ms() + - pc_request_flush_slot(ULINT_MAX, lsn_limit); page_cleaner.flush_pass++; + n_flushed = page_cleaner.slot.n_flushed_list; - /* Wait for all slots to be finished */ - ulint n_flushed_lru = 0; - ulint n_flushed_list = 0; - pc_wait_finished(&n_flushed_lru, &n_flushed_list); - - if (n_flushed_list > 0 || n_flushed_lru > 0) { - buf_flush_stats(n_flushed_list, n_flushed_lru); - + if (n_flushed) { MONITOR_INC_VALUE_CUMULATIVE( MONITOR_FLUSH_SYNC_TOTAL_PAGE, MONITOR_FLUSH_SYNC_COUNT, MONITOR_FLUSH_SYNC_PAGES, - n_flushed_lru + n_flushed_list); + n_flushed); } - - n_flushed = n_flushed_lru + n_flushed_list; - - } else if (srv_check_activity(&last_activity)) { - ulint n_to_flush; - lsn_t lsn_limit; - + } else if (!srv_check_activity(&last_activity)) { + /* no activity, but woken up by event */ + n_flushed = 0; + } else if (ulint n= page_cleaner_flush_pages_recommendation( + last_pages)) { /* Estimate pages from flush_list to be flushed */ - if (ret_sleep == OS_SYNC_TIME_EXCEEDED) { - last_activity = srv_get_activity_count(); - n_to_flush = - page_cleaner_flush_pages_recommendation( - last_pages); - lsn_limit = LSN_MAX; - } else { - n_to_flush = 0; - lsn_limit = 0; - } - - /* Request flushing for threads */ - pc_request(n_to_flush, lsn_limit); - - ulint tm = ut_time_ms(); - - /* Coordinator also treats requests */ - while (pc_flush_slot() > 0) { - /* No op */ - } + ulint tm= pc_request_flush_slot(n, LSN_MAX); - /* only coordinator is using these counters, - so no need to protect by lock. */ page_cleaner.flush_time += ut_time_ms() - tm; page_cleaner.flush_pass++ ; - /* Wait for all slots to be finished */ - ulint n_flushed_lru = 0; - ulint n_flushed_list = 0; - - pc_wait_finished(&n_flushed_lru, &n_flushed_list); - - if (n_flushed_list > 0 || n_flushed_lru > 0) { - buf_flush_stats(n_flushed_list, n_flushed_lru); - } - - if (ret_sleep == OS_SYNC_TIME_EXCEEDED) { - last_pages = n_flushed_list; - } - - n_evicted += n_flushed_lru; - n_flushed_last += n_flushed_list; + n_flushed = page_cleaner.slot.n_flushed_list; - n_flushed = n_flushed_lru + n_flushed_list; - - if (n_flushed_lru) { - MONITOR_INC_VALUE_CUMULATIVE( - MONITOR_LRU_BATCH_FLUSH_TOTAL_PAGE, - MONITOR_LRU_BATCH_FLUSH_COUNT, - MONITOR_LRU_BATCH_FLUSH_PAGES, - n_flushed_lru); - } - - if (n_flushed_list) { + if (n_flushed) { MONITOR_INC_VALUE_CUMULATIVE( MONITOR_FLUSH_ADAPTIVE_TOTAL_PAGE, MONITOR_FLUSH_ADAPTIVE_COUNT, MONITOR_FLUSH_ADAPTIVE_PAGES, - n_flushed_list); - } - - } else if (ret_sleep == OS_SYNC_TIME_EXCEEDED) { - /* no activity, slept enough */ - buf_flush_lists(srv_io_capacity, LSN_MAX, &n_flushed); - - n_flushed_last += n_flushed; - - if (n_flushed) { - MONITOR_INC_VALUE_CUMULATIVE( - MONITOR_FLUSH_BACKGROUND_TOTAL_PAGE, - MONITOR_FLUSH_BACKGROUND_COUNT, - MONITOR_FLUSH_BACKGROUND_PAGES, n_flushed); - } - } else { - /* no activity, but woken up by event */ n_flushed = 0; } + if (!n_flushed) { + next_loop_time = curr_time + 1000; + } + ut_d(buf_flush_page_cleaner_disabled_loop()); } - ut_ad(srv_shutdown_state > SRV_SHUTDOWN_INITIATED); - if (srv_fast_shutdown == 2 - || srv_shutdown_state == SRV_SHUTDOWN_EXIT_THREADS) { - /* In very fast shutdown or when innodb failed to start, we - simulate a crash of the buffer pool. We are not required to do - any flushing. */ - goto thread_exit; + if (srv_fast_shutdown != 2) { + buf_flush_wait_batch_end_acquiring_mutex(true); + buf_flush_wait_batch_end_acquiring_mutex(false); } - /* In case of normal and slow shutdown the page_cleaner thread - must wait for all other activity in the server to die down. - Note that we can start flushing the buffer pool as soon as the - server enters shutdown phase but we must stay alive long enough - to ensure that any work done by the master or purge threads is - also flushed. - During shutdown we pass through two stages. In the first stage, - when SRV_SHUTDOWN_CLEANUP is set other threads like the master - and the purge threads may be working as well. We start flushing - the buffer pool but can't be sure that no new pages are being - dirtied until we enter SRV_SHUTDOWN_FLUSH_PHASE phase. */ - - do { - pc_request(ULINT_MAX, LSN_MAX); - - while (pc_flush_slot() > 0) {} - - ulint n_flushed_lru = 0; - ulint n_flushed_list = 0; - pc_wait_finished(&n_flushed_lru, &n_flushed_list); - - n_flushed = n_flushed_lru + n_flushed_list; - - /* We sleep only if there are no pages to flush */ - if (n_flushed == 0) { - os_thread_sleep(100000); - } - } while (srv_shutdown_state == SRV_SHUTDOWN_CLEANUP); - - /* At this point all threads including the master and the purge - thread must have been suspended. */ - ut_ad(!srv_any_background_activity()); - ut_ad(srv_shutdown_state == SRV_SHUTDOWN_FLUSH_PHASE); - - /* We can now make a final sweep on flushing the buffer pool - and exit after we have cleaned the whole buffer pool. - It is important that we wait for any running batch that has - been triggered by us to finish. Otherwise we can end up - considering end of that batch as a finish of our final - sweep and we'll come out of the loop leaving behind dirty pages - in the flush_list */ - buf_flush_wait_batch_end(false); - buf_flush_wait_LRU_batch_end(); - - bool success; - - do { - pc_request(ULINT_MAX, LSN_MAX); - - while (pc_flush_slot() > 0) {} - - ulint n_flushed_lru = 0; - ulint n_flushed_list = 0; - success = pc_wait_finished(&n_flushed_lru, &n_flushed_list); - - n_flushed = n_flushed_lru + n_flushed_list; - - buf_flush_wait_batch_end(false); - buf_flush_wait_LRU_batch_end(); - - } while (!success || n_flushed > 0); - - /* Some sanity checks */ - ut_ad(!srv_any_background_activity()); - ut_ad(srv_shutdown_state == SRV_SHUTDOWN_FLUSH_PHASE); - ut_a(UT_LIST_GET_LEN(buf_pool.flush_list) == 0); - - /* We have lived our life. Time to die. */ - -thread_exit: - page_cleaner.is_running = false; - mutex_destroy(&page_cleaner.mutex); - - os_event_destroy(page_cleaner.is_finished); - buf_page_cleaner_is_active = false; my_thread_end(); @@ -2654,52 +1997,34 @@ thread_exit: OS_THREAD_DUMMY_RETURN; } -static void pc_flush_slot_func(void*) -{ - while (pc_flush_slot() > 0) {}; -} - /** Initialize page_cleaner. */ void buf_flush_page_cleaner_init() { - ut_ad(!page_cleaner.is_running); - - mutex_create(LATCH_ID_PAGE_CLEANER, &page_cleaner.mutex); - - page_cleaner.is_finished = os_event_create("pc_is_finished"); - - page_cleaner.is_running = true; - - buf_page_cleaner_is_active = true; - os_thread_create(buf_flush_page_cleaner); + ut_ad(!buf_page_cleaner_is_active); + buf_page_cleaner_is_active= true; + os_thread_create(buf_flush_page_cleaner); } /** Synchronously flush dirty blocks. NOTE: The calling thread is not allowed to hold any buffer page latches! */ void buf_flush_sync() { - bool success; - do { - success = buf_flush_lists(ULINT_MAX, LSN_MAX, NULL); - buf_flush_wait_batch_end(false); - } while (!success); -} - -/** Request IO burst and wake page_cleaner up. -@param[in] lsn_limit upper limit of LSN to be flushed */ -void buf_flush_request_force(lsn_t lsn_limit) -{ - /* adjust based on lsn_avg_rate not to get old */ - lsn_t lsn_target = lsn_limit + lsn_avg_rate * 3; + ut_ad(!sync_check_iterate(dict_sync_check())); - mutex_enter(&page_cleaner.mutex); - if (lsn_target > buf_flush_sync_lsn) { - buf_flush_sync_lsn = lsn_target; - } - mutex_exit(&page_cleaner.mutex); - - os_event_set(buf_flush_event); + for (;;) + { + const ulint n_flushed= buf_flush_lists(ULINT_UNDEFINED, LSN_MAX); + buf_flush_wait_batch_end_acquiring_mutex(false); + if (!n_flushed) + { + mysql_mutex_lock(&buf_pool.flush_list_mutex); + const auto len= UT_LIST_GET_LEN(buf_pool.flush_list); + mysql_mutex_unlock(&buf_pool.flush_list_mutex); + if (!len) + return; + } + } } #ifdef UNIV_DEBUG @@ -2716,7 +2041,7 @@ static void buf_flush_validate_low() { buf_page_t* bpage; - ut_ad(mutex_own(&buf_pool.flush_list_mutex)); + mysql_mutex_assert_owner(&buf_pool.flush_list_mutex); ut_list_validate(buf_pool.flush_list, Check()); @@ -2743,8 +2068,8 @@ static void buf_flush_validate_low() /** Validate the flush list. */ void buf_flush_validate() { - mutex_enter(&buf_pool.flush_list_mutex); - buf_flush_validate_low(); - mutex_exit(&buf_pool.flush_list_mutex); + mysql_mutex_lock(&buf_pool.flush_list_mutex); + buf_flush_validate_low(); + mysql_mutex_unlock(&buf_pool.flush_list_mutex); } #endif /* UNIV_DEBUG */ diff --git a/storage/innobase/buf/buf0lru.cc b/storage/innobase/buf/buf0lru.cc index 521be10ba21..f9ed938b20c 100644 --- a/storage/innobase/buf/buf0lru.cc +++ b/storage/innobase/buf/buf0lru.cc @@ -39,6 +39,9 @@ Created 11/5/1995 Heikki Tuuri #include "srv0srv.h" #include "srv0mon.h" +/** Flush this many pages in buf_LRU_get_free_block() */ +size_t innodb_lru_flush_size; + /** The number of blocks from the LRU_old pointer onward, including the block pointed to, must be buf_pool.LRU_old_ratio/BUF_LRU_OLD_RATIO_DIV of the whole LRU list length, except that the tolerance defined below @@ -46,28 +49,13 @@ is allowed. Note that the tolerance must be small enough such that for even the BUF_LRU_OLD_MIN_LEN long LRU list, the LRU_old pointer is not allowed to point to either end of the LRU list. */ -static const ulint BUF_LRU_OLD_TOLERANCE = 20; +static constexpr ulint BUF_LRU_OLD_TOLERANCE = 20; /** The minimum amount of non-old blocks when the LRU_old list exists (that is, when there are more than BUF_LRU_OLD_MIN_LEN blocks). @see buf_LRU_old_adjust_len */ #define BUF_LRU_NON_OLD_MIN_LEN 5 -#ifdef BTR_CUR_HASH_ADAPT -/** When dropping the search hash index entries before deleting an ibd -file, we build a local array of pages belonging to that tablespace -in the buffer pool. Following is the size of that array. -We also release buf_pool.mutex after scanning this many pages of the -flush_list when dropping a table. This is to ensure that other threads -are not blocked for extended period of time when using very large -buffer pools. */ -static const ulint BUF_LRU_DROP_SEARCH_SIZE = 1024; -#endif /* BTR_CUR_HASH_ADAPT */ - -/** We scan these many blocks when looking for a clean page to evict -during LRU eviction. */ -static const ulint BUF_LRU_SEARCH_SCAN_THRESHOLD = 100; - /** If we switch on the InnoDB monitor because there are too few available frames in the buffer pool, we set this to TRUE */ static bool buf_lru_switched_on_innodb_mon = false; @@ -149,7 +137,7 @@ static void buf_LRU_block_free_hashed_page(buf_block_t *block) static inline void incr_LRU_size_in_bytes(const buf_page_t* bpage) { /* FIXME: use atomics, not mutex */ - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); buf_pool.stat.LRU_bytes += bpage->physical_size(); @@ -160,7 +148,7 @@ static inline void incr_LRU_size_in_bytes(const buf_page_t* bpage) instead of the general LRU list */ bool buf_LRU_evict_from_unzip_LRU() { - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); /* If the unzip_LRU list is empty, we can only use the LRU. */ if (UT_LIST_GET_LEN(buf_pool.unzip_LRU) == 0) { @@ -196,280 +184,19 @@ bool buf_LRU_evict_from_unzip_LRU() return(unzip_avg <= io_avg * BUF_LRU_IO_TO_UNZIP_FACTOR); } -#ifdef BTR_CUR_HASH_ADAPT -/** -While flushing (or removing dirty) pages from a tablespace we don't -want to hog the CPU and resources. Release the buffer pool and block -mutex and try to force a context switch. Then reacquire the same mutexes. -The current page is "fixed" before the release of the mutexes and then -"unfixed" again once we have reacquired the mutexes. -@param[in,out] bpage current page */ -static void buf_flush_yield(buf_page_t *bpage) -{ - mutex_exit(&buf_pool.flush_list_mutex); - ut_ad(bpage->oldest_modification()); - ut_ad(bpage->in_file()); - ut_ad(bpage->io_fix() == BUF_IO_NONE); - /** Make the block sticky, so that even after we release buf_pool.mutex: - (1) it cannot be removed from the buf_pool.flush_list - (2) bpage cannot be relocated in buf_pool - (3) bpage->in_LRU_list cannot change - However, bpage->LRU can change. */ - bpage->set_io_fix(BUF_IO_PIN); - mutex_exit(&buf_pool.mutex); - - /* Try and force a context switch. */ - os_thread_yield(); - - mutex_enter(&buf_pool.mutex); - bpage->io_unfix(); - mutex_enter(&buf_pool.flush_list_mutex); - /* Should not have been removed from the flush - list during the yield. However, this check is - not sufficient to catch a remove -> add. */ - ut_ad(bpage->oldest_modification()); -} - -/******************************************************************//** -If we have hogged the resources for too long then release the buffer -pool and flush list mutex and do a thread yield. Set the current page -to "sticky" so that it is not relocated during the yield. -@return true if yielded */ -static MY_ATTRIBUTE((warn_unused_result)) -bool -buf_flush_try_yield( -/*================*/ - buf_page_t* bpage, /*!< in/out: bpage to remove */ - ulint processed) /*!< in: number of pages processed */ -{ - /* Every BUF_LRU_DROP_SEARCH_SIZE iterations in the - loop we release buf_pool.mutex to let other threads - do their job but only if the block is not IO fixed. This - ensures that the block stays in its position in the - flush_list. */ - - if (bpage != NULL - && processed >= BUF_LRU_DROP_SEARCH_SIZE - && bpage->io_fix() == BUF_IO_NONE) { - - /* Release the buf_pool.mutex - to give the other threads a go. */ - - buf_flush_yield(bpage); - return(true); - } - - return(false); -} -#endif /* BTR_CUR_HASH_ADAPT */ - -/** Remove a single page from flush_list. -@param[in,out] bpage buffer page to remove -@param[in] flush whether to flush the page before removing -@return true if page was removed. */ -static bool buf_flush_or_remove_page(buf_page_t *bpage, bool flush) -{ - ut_ad(mutex_own(&buf_pool.mutex)); - ut_ad(mutex_own(&buf_pool.flush_list_mutex)); - - /* bpage->id and bpage->io_fix are protected by - buf_pool.mutex (and bpage->id additionally by hash_lock). - It is safe to check them while holding buf_pool.mutex only. */ - - if (bpage->io_fix() != BUF_IO_NONE) { - - /* We cannot remove this page during this scan - yet; maybe the system is currently reading it - in, or flushing the modifications to the file */ - return(false); - - } - - bool processed = false; - - /* We have to release the flush_list_mutex to obey the - latching order. We are however guaranteed that the page - will stay in the flush_list and won't be relocated because - buf_flush_remove() and buf_flush_relocate_on_flush_list() - need buf_pool.mutex as well. */ - - mutex_exit(&buf_pool.flush_list_mutex); - - ut_ad(bpage->oldest_modification()); - - if (!flush) { - buf_flush_remove(bpage); - processed = true; - } else if (bpage->ready_for_flush()) { - processed = buf_flush_page(bpage, IORequest::SINGLE_PAGE, - nullptr, false); - - if (processed) { - mutex_enter(&buf_pool.mutex); - } - } - - mutex_enter(&buf_pool.flush_list_mutex); - - ut_ad(mutex_own(&buf_pool.mutex)); - - return(processed); -} - -/** Remove all dirty pages belonging to a given tablespace when we are -deleting the data file of that tablespace. -The pages still remain a part of LRU and are evicted from -the list as they age towards the tail of the LRU. -@param[in] id tablespace identifier -@param[in] flush whether to flush the pages before removing -@param[in] first first page to be flushed or evicted -@return whether all matching dirty pages were removed */ -static bool buf_flush_or_remove_pages(ulint id, bool flush, ulint first) -{ - buf_page_t* prev; - buf_page_t* bpage; - ulint processed = 0; - - mutex_enter(&buf_pool.flush_list_mutex); -rescan: - bool all_freed = true; - - for (bpage = UT_LIST_GET_LAST(buf_pool.flush_list); - bpage != NULL; - bpage = prev) { - - ut_a(bpage->in_file()); - - /* Save the previous link because once we free the - page we can't rely on the links. */ - - prev = UT_LIST_GET_PREV(list, bpage); - - const page_id_t bpage_id(bpage->id()); - - if (id != bpage_id.space()) { - /* Skip this block, because it is for a - different tablespace. */ - } else if (bpage_id.page_no() < first) { - /* Skip this block, because it is below the limit. */ - } else if (!buf_flush_or_remove_page(bpage, flush)) { - - /* Remove was unsuccessful, we have to try again - by scanning the entire list from the end. - This also means that we never released the - buf_pool mutex. Therefore we can trust the prev - pointer. - buf_flush_or_remove_page() released the - flush list mutex but not the buf_pool mutex. - Therefore it is possible that a new page was - added to the flush list. For example, in case - where we are at the head of the flush list and - prev == NULL. That is OK because we have the - tablespace quiesced and no new pages for this - space-id should enter flush_list. This is - because the only callers of this function are - DROP TABLE and FLUSH TABLE FOR EXPORT. - We know that we'll have to do at least one more - scan but we don't break out of loop here and - try to do as much work as we can in this - iteration. */ - - all_freed = false; - } else if (flush) { - - /* The processing was successful. And during the - processing we have released the buf_pool mutex - when calling buf_page_flush(). We cannot trust - prev pointer. */ - goto rescan; - } - -#ifdef BTR_CUR_HASH_ADAPT - ++processed; - - /* Yield if we have hogged the CPU and mutexes for too long. */ - if (buf_flush_try_yield(prev, processed)) { - /* Reset the batch size counter if we had to yield. */ - processed = 0; - } -#endif /* BTR_CUR_HASH_ADAPT */ - } - - mutex_exit(&buf_pool.flush_list_mutex); - - return(all_freed); -} - -/** Remove or flush all the dirty pages that belong to a given tablespace. -The pages will remain in the LRU list and will be evicted from the LRU list -as they age and move towards the tail of the LRU list. -@param[in] id tablespace identifier -@param[in] flush whether to flush the pages before removing -@param[in] first first page to be flushed or evicted */ -static void buf_flush_dirty_pages(ulint id, bool flush, ulint first) -{ - mutex_enter(&buf_pool.mutex); - while (!buf_flush_or_remove_pages(id, flush, first)) - { - mutex_exit(&buf_pool.mutex); - ut_d(buf_flush_validate()); - os_thread_sleep(2000); - mutex_enter(&buf_pool.mutex); - } - -#ifdef UNIV_DEBUG - if (!first) - { - mutex_enter(&buf_pool.flush_list_mutex); - - for (buf_page_t *bpage= UT_LIST_GET_FIRST(buf_pool.flush_list); bpage; - bpage= UT_LIST_GET_NEXT(list, bpage)) - { - ut_ad(bpage->in_file()); - ut_ad(bpage->oldest_modification()); - ut_ad(id != bpage->id().space()); - } - - mutex_exit(&buf_pool.flush_list_mutex); - } -#endif - - mutex_exit(&buf_pool.mutex); -} - -/** Empty the flush list for all pages belonging to a tablespace. -@param[in] id tablespace identifier -@param[in] flush whether to write the pages to files -@param[in] first first page to be flushed or evicted */ -void buf_LRU_flush_or_remove_pages(ulint id, bool flush, ulint first) -{ - /* Pages in the system tablespace must never be discarded. */ - ut_ad(id || flush); - - buf_flush_dirty_pages(id, flush, first); - - if (flush) { - /* Ensure that all asynchronous IO is completed. */ - os_aio_wait_until_no_pending_writes(); - fil_flush(id); - } -} - /** Try to free an uncompressed page of a compressed block from the unzip LRU list. The compressed page is preserved, and it need not be clean. -@param[in] scan_all true=scan the whole list; - false=scan srv_LRU_scan_depth / 2 blocks +@param limit maximum number of blocks to scan @return true if freed */ -static bool buf_LRU_free_from_unzip_LRU_list(bool scan_all) +static bool buf_LRU_free_from_unzip_LRU_list(ulint limit) { - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); if (!buf_LRU_evict_from_unzip_LRU()) { return(false); } ulint scanned = 0; - const ulint limit = scan_all ? ULINT_UNDEFINED : srv_LRU_scan_depth; bool freed = false; for (buf_block_t* block = UT_LIST_GET_LAST(buf_pool.unzip_LRU); @@ -500,31 +227,24 @@ static bool buf_LRU_free_from_unzip_LRU_list(bool scan_all) } /** Try to free a clean page from the common LRU list. -@param[in] scan_all true=scan the whole LRU list - false=use BUF_LRU_SEARCH_SCAN_THRESHOLD +@param limit maximum number of blocks to scan @return whether a page was freed */ -static bool buf_LRU_free_from_common_LRU_list(bool scan_all) +static bool buf_LRU_free_from_common_LRU_list(ulint limit) { - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); ulint scanned = 0; bool freed = false; for (buf_page_t* bpage = buf_pool.lru_scan_itr.start(); - bpage && (scan_all || scanned < BUF_LRU_SEARCH_SCAN_THRESHOLD); + bpage && scanned < limit; ++scanned, bpage = buf_pool.lru_scan_itr.get()) { buf_page_t* prev = UT_LIST_GET_PREV(LRU, bpage); buf_pool.lru_scan_itr.set(prev); const auto accessed = bpage->is_accessed(); - freed = bpage->ready_for_replace(); - - if (freed) { - freed = buf_LRU_free_page(bpage, true); - if (!freed) { - continue; - } - + if (!bpage->oldest_modification() + && buf_LRU_free_page(bpage, true)) { if (!accessed) { /* Keep track of pages that are evicted without ever being accessed. This gives us a measure of @@ -532,6 +252,7 @@ static bool buf_LRU_free_from_common_LRU_list(bool scan_all) ++buf_pool.stat.n_ra_pages_evicted; } + freed = true; break; } } @@ -548,15 +269,14 @@ static bool buf_LRU_free_from_common_LRU_list(bool scan_all) } /** Try to free a replaceable block. -@param[in] scan_all true=scan the whole LRU list, - false=use BUF_LRU_SEARCH_SCAN_THRESHOLD +@param limit maximum number of blocks to scan @return true if found and freed */ -bool buf_LRU_scan_and_free_block(bool scan_all) +bool buf_LRU_scan_and_free_block(ulint limit) { - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); - return(buf_LRU_free_from_unzip_LRU_list(scan_all) - || buf_LRU_free_from_common_LRU_list(scan_all)); + return buf_LRU_free_from_unzip_LRU_list(limit) || + buf_LRU_free_from_common_LRU_list(limit); } /** @return a buffer block from the buf_pool.free list @@ -565,7 +285,7 @@ buf_block_t* buf_LRU_get_free_only() { buf_block_t* block; - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); block = reinterpret_cast<buf_block_t*>( UT_LIST_GET_FIRST(buf_pool.free)); @@ -611,106 +331,89 @@ function will either assert or issue a warning and switch on the status monitor. */ static void buf_LRU_check_size_of_non_data_objects() { - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); - if (!recv_recovery_is_on() - && buf_pool.curr_size == buf_pool.old_size - && UT_LIST_GET_LEN(buf_pool.free) - + UT_LIST_GET_LEN(buf_pool.LRU) < buf_pool.curr_size / 20) { + if (recv_recovery_is_on() || buf_pool.curr_size != buf_pool.old_size) + return; - ib::fatal() << "Over 95 percent of the buffer pool is" - " occupied by lock heaps" + const auto s= UT_LIST_GET_LEN(buf_pool.free) + UT_LIST_GET_LEN(buf_pool.LRU); + + if (s < buf_pool.curr_size / 20) + ib::fatal() << "Over 95 percent of the buffer pool is" + " occupied by lock heaps" #ifdef BTR_CUR_HASH_ADAPT - " or the adaptive hash index!" + " or the adaptive hash index" #endif /* BTR_CUR_HASH_ADAPT */ - " Check that your transactions do not set too many" - " row locks, or review if" - " innodb_buffer_pool_size=" - << (buf_pool.curr_size >> (20U - srv_page_size_shift)) - << "M could be bigger."; - } else if (!recv_recovery_is_on() - && buf_pool.curr_size == buf_pool.old_size - && (UT_LIST_GET_LEN(buf_pool.free) - + UT_LIST_GET_LEN(buf_pool.LRU)) - < buf_pool.curr_size / 3) { - - if (!buf_lru_switched_on_innodb_mon && srv_monitor_timer) { - - /* Over 67 % of the buffer pool is occupied by lock - heaps or the adaptive hash index. This may be a memory - leak! */ - - ib::warn() << "Over 67 percent of the buffer pool is" - " occupied by lock heaps" + "! Check that your transactions do not set too many" + " row locks, or review if innodb_buffer_pool_size=" + << (buf_pool.curr_size >> (20U - srv_page_size_shift)) + << "M could be bigger."; + + if (s < buf_pool.curr_size / 3) + { + if (!buf_lru_switched_on_innodb_mon && srv_monitor_timer) + { + /* Over 67 % of the buffer pool is occupied by lock heaps or + the adaptive hash index. This may be a memory leak! */ + ib::warn() << "Over 67 percent of the buffer pool is" + " occupied by lock heaps" #ifdef BTR_CUR_HASH_ADAPT - " or the adaptive hash index!" + " or the adaptive hash index" #endif /* BTR_CUR_HASH_ADAPT */ - " Check that your transactions do not" - " set too many row locks." - " innodb_buffer_pool_size=" - << (buf_pool.curr_size >> - (20U - srv_page_size_shift)) << "M." - " Starting the InnoDB Monitor to print" - " diagnostics."; - - buf_lru_switched_on_innodb_mon = true; - srv_print_innodb_monitor = TRUE; - srv_monitor_timer_schedule_now(); - } - - } else if (buf_lru_switched_on_innodb_mon) { - - /* Switch off the InnoDB Monitor; this is a simple way - to stop the monitor if the situation becomes less urgent, - but may also surprise users if the user also switched on the - monitor! */ - - buf_lru_switched_on_innodb_mon = false; - srv_print_innodb_monitor = FALSE; - } + "! Check that your transactions do not set too many row locks." + " innodb_buffer_pool_size=" + << (buf_pool.curr_size >> (20U - srv_page_size_shift)) + << "M. Starting the InnoDB Monitor to print diagnostics."; + buf_lru_switched_on_innodb_mon= true; + srv_print_innodb_monitor= TRUE; + srv_monitor_timer_schedule_now(); + } + } + else if (buf_lru_switched_on_innodb_mon) + { + /* Switch off the InnoDB Monitor; this is a simple way to stop the + monitor if the situation becomes less urgent, but may also + surprise users who did SET GLOBAL innodb_status_output=ON earlier! */ + buf_lru_switched_on_innodb_mon= false; + srv_print_innodb_monitor= FALSE; + } } -/** Get a free block from the buf_pool. The block is taken off the -free list. If free list is empty, blocks are moved from the end of the -LRU list to the free list. +/** Get a block from the buf_pool.free list. +If the list is empty, blocks will be moved from the end of buf_pool.LRU +to buf_pool.free. This function is called from a user thread when it needs a clean block to read in a page. Note that we only ever get a block from the free list. Even when we flush a page or find a page in LRU scan we put it to free list to be used. * iteration 0: - * get a block from free list, success:done + * get a block from the buf_pool.free list, success:done * if buf_pool.try_LRU_scan is set - * scan LRU up to srv_LRU_scan_depth to find a clean block - * the above will put the block on free list + * scan LRU up to 100 pages to free a clean block * success:retry the free list - * flush one dirty page from tail of LRU to disk - * the above will put the block on free list + * flush up to innodb_lru_flush_size LRU blocks to data files + (until UT_LIST_GET_GEN(buf_pool.free) < innodb_lru_scan_depth) + * on buf_page_write_complete() the blocks will put on buf_pool.free list * success: retry the free list -* iteration 1: - * same as iteration 0 except: - * scan whole LRU list - * scan LRU list even if buf_pool.try_LRU_scan is not set -* iteration > 1: - * same as iteration 1 but sleep 10ms +* subsequent iterations: same as iteration 0 except: + * scan whole LRU list + * scan LRU list even if buf_pool.try_LRU_scan is not set @param have_mutex whether buf_pool.mutex is already being held @return the free control block, in state BUF_BLOCK_MEMORY */ buf_block_t* buf_LRU_get_free_block(bool have_mutex) { - buf_block_t* block = NULL; - bool freed = false; ulint n_iterations = 0; ulint flush_failures = 0; MONITOR_INC(MONITOR_LRU_GET_FREE_SEARCH); if (have_mutex) { - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); goto got_mutex; } loop: - mutex_enter(&buf_pool.mutex); + mysql_mutex_lock(&buf_pool.mutex); got_mutex: - buf_LRU_check_size_of_non_data_objects(); DBUG_EXECUTE_IF("ib_lru_force_no_free_page", @@ -718,49 +421,38 @@ got_mutex: n_iterations = 21; goto not_found;}); +retry: /* If there is a block in the free list, take it */ - block = buf_LRU_get_free_only(); - - if (block) { + if (buf_block_t* block = buf_LRU_get_free_only()) { if (!have_mutex) { - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); } memset(&block->page.zip, 0, sizeof block->page.zip); - return(block); + return block; } MONITOR_INC( MONITOR_LRU_GET_FREE_LOOPS ); - freed = false; if (n_iterations || buf_pool.try_LRU_scan) { /* If no block was in the free list, search from the end of the LRU list and try to free a block there. If we are doing for the first time we'll scan only tail of the LRU list otherwise we scan the whole LRU list. */ - freed = buf_LRU_scan_and_free_block(n_iterations > 0); - - if (!freed && n_iterations == 0) { - /* Tell other threads that there is no point - in scanning the LRU list. This flag is set to - TRUE again when we flush a batch from this - buffer pool. */ - buf_pool.try_LRU_scan = false; - - /* Also tell the page_cleaner thread that - there is work for it to do. */ - os_event_set(buf_flush_event); + if (buf_LRU_scan_and_free_block(n_iterations + ? ULINT_UNDEFINED : 100)) { + goto retry; } + + /* Tell other threads that there is no point + in scanning the LRU list. */ + buf_pool.try_LRU_scan = false; } #ifndef DBUG_OFF not_found: #endif - - mutex_exit(&buf_pool.mutex); - - if (freed) { - goto loop; - } + mysql_mutex_unlock(&buf_pool.mutex); + buf_flush_wait_batch_end_acquiring_mutex(true); if (n_iterations > 20 && !buf_lru_free_blocks_error_printed && srv_buf_pool_old_size == srv_buf_pool_size) { @@ -782,18 +474,8 @@ not_found: buf_lru_free_blocks_error_printed = true; } - /* If we have scanned the whole LRU and still are unable to - find a free block then we should sleep here to let the - page_cleaner do an LRU batch for us. */ - - if (!srv_read_only_mode) { - os_event_set(buf_flush_event); - } - if (n_iterations > 1) { - MONITOR_INC( MONITOR_LRU_GET_FREE_WAITS ); - os_thread_sleep(10000); } /* No free block was found: try to flush the LRU list. @@ -804,10 +486,10 @@ not_found: TODO: A more elegant way would have been to return the freed up block to the caller here but the code that deals with removing the block from page_hash and LRU_list is fairly - involved (particularly in case of compressed pages). We + involved (particularly in case of ROW_FORMAT=COMPRESSED pages). We can do that in a separate patch sometime in future. */ - if (!buf_flush_single_page_from_LRU()) { + if (!buf_flush_lists(innodb_lru_flush_size, 0)) { MONITOR_INC(MONITOR_LRU_SINGLE_FLUSH_FAILURE_COUNT); ++flush_failures; } @@ -827,7 +509,7 @@ static void buf_LRU_old_adjust_len() ulint new_len; ut_a(buf_pool.LRU_old); - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); ut_ad(buf_pool.LRU_old_ratio >= BUF_LRU_OLD_RATIO_MIN); ut_ad(buf_pool.LRU_old_ratio <= BUF_LRU_OLD_RATIO_MAX); compile_time_assert(BUF_LRU_OLD_RATIO_MIN * BUF_LRU_OLD_MIN_LEN @@ -888,7 +570,7 @@ static void buf_LRU_old_adjust_len() called when the LRU list grows to BUF_LRU_OLD_MIN_LEN length. */ static void buf_LRU_old_init() { - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); ut_a(UT_LIST_GET_LEN(buf_pool.LRU) == BUF_LRU_OLD_MIN_LEN); /* We first initialize all blocks in the LRU list as old and then use @@ -917,7 +599,7 @@ static void buf_LRU_old_init() static void buf_unzip_LRU_remove_block_if_needed(buf_page_t* bpage) { ut_ad(bpage->in_file()); - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); if (bpage->belongs_to_unzip_LRU()) { buf_block_t* block = reinterpret_cast<buf_block_t*>(bpage); @@ -1000,7 +682,7 @@ buf_unzip_LRU_add_block( ibool old) /*!< in: TRUE if should be put to the end of the list, else put to the start */ { - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); ut_a(block->page.belongs_to_unzip_LRU()); ut_ad(!block->in_unzip_LRU_list); ut_d(block->in_unzip_LRU_list = true); @@ -1024,7 +706,7 @@ buf_LRU_add_block( LRU list is very short, the block is added to the start, regardless of this parameter */ { - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); ut_ad(!bpage->in_LRU_list); if (!old || (UT_LIST_GET_LEN(buf_pool.LRU) < BUF_LRU_OLD_MIN_LEN)) { @@ -1084,7 +766,7 @@ void buf_page_make_young(buf_page_t *bpage) { ut_ad(bpage->in_file()); - mutex_enter(&buf_pool.mutex); + mysql_mutex_lock(&buf_pool.mutex); if (UNIV_UNLIKELY(bpage->old)) buf_pool.stat.n_pages_made_young++; @@ -1092,7 +774,7 @@ void buf_page_make_young(buf_page_t *bpage) buf_LRU_remove_block(bpage); buf_LRU_add_block(bpage, false); - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); } /** Try to free a block. If bpage is a descriptor of a compressed-only @@ -1107,7 +789,7 @@ bool buf_LRU_free_page(buf_page_t *bpage, bool zip) const page_id_t id(bpage->id()); buf_page_t* b = nullptr; - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); ut_ad(bpage->in_file()); ut_ad(bpage->in_LRU_list); @@ -1148,7 +830,7 @@ func_exit: b->set_state(BUF_BLOCK_ZIP_PAGE); } - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); ut_ad(bpage->in_file()); ut_ad(bpage->in_LRU_list); @@ -1238,9 +920,7 @@ func_exit: buf_LRU_add_block(b, b->old); } - if (b->oldest_modification()) { - buf_flush_relocate_on_flush_list(bpage, b); - } + buf_flush_relocate_on_flush_list(bpage, b); bpage->zip.data = nullptr; @@ -1253,25 +933,35 @@ func_exit: hash_lock->write_unlock(); } - mutex_exit(&buf_pool.mutex); - - /* Remove possible adaptive hash index on the page. - The page was declared uninitialized by - buf_LRU_block_remove_hashed(). We need to flag - the contents of the page valid (which it still is) in - order to avoid bogus Valgrind or MSAN warnings.*/ buf_block_t* block = reinterpret_cast<buf_block_t*>(bpage); - MEM_MAKE_DEFINED(block->frame, srv_page_size); - btr_search_drop_page_hash_index(block); - MEM_UNDEFINED(block->frame, srv_page_size); +#ifdef BTR_CUR_HASH_ADAPT + if (block->index) { + mysql_mutex_unlock(&buf_pool.mutex); + + /* Remove the adaptive hash index on the page. + The page was declared uninitialized by + buf_LRU_block_remove_hashed(). We need to flag + the contents of the page valid (which it still is) in + order to avoid bogus Valgrind or MSAN warnings.*/ + + MEM_MAKE_DEFINED(block->frame, srv_page_size); + btr_search_drop_page_hash_index(block); + MEM_UNDEFINED(block->frame, srv_page_size); + + if (UNIV_LIKELY_NULL(b)) { + ut_ad(b->zip_size()); + b->io_unfix(); + } + mysql_mutex_lock(&buf_pool.mutex); + } else +#endif if (UNIV_LIKELY_NULL(b)) { ut_ad(b->zip_size()); b->io_unfix(); } - mutex_enter(&buf_pool.mutex); buf_LRU_block_free_hashed_page(block); return(true); @@ -1332,6 +1022,16 @@ buf_LRU_block_free_non_file_page( MEM_NOACCESS(block->frame, srv_page_size); } +/** Release a memory block to the buffer pool. */ +ATTRIBUTE_COLD void buf_pool_t::free_block(buf_block_t *block) +{ + ut_ad(this == &buf_pool); + mysql_mutex_lock(&mutex); + buf_LRU_block_free_non_file_page(block); + mysql_mutex_unlock(&mutex); +} + + /** Remove bpage from buf_pool.LRU and buf_pool.page_hash. If bpage->state() == BUF_BLOCK_ZIP_PAGE && !bpage->oldest_modification(), @@ -1350,7 +1050,7 @@ this case the block is already returned to the buddy allocator. */ static bool buf_LRU_block_remove_hashed(buf_page_t *bpage, const page_id_t id, page_hash_latch *hash_lock, bool zip) { - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); ut_ad(hash_lock->is_write_locked()); ut_a(bpage->io_fix() == BUF_IO_NONE); @@ -1545,7 +1245,7 @@ uint buf_LRU_old_ratio_update(uint old_pct, bool adjust) } if (adjust) { - mutex_enter(&buf_pool.mutex); + mysql_mutex_lock(&buf_pool.mutex); if (ratio != buf_pool.LRU_old_ratio) { buf_pool.LRU_old_ratio = ratio; @@ -1556,7 +1256,7 @@ uint buf_LRU_old_ratio_update(uint old_pct, bool adjust) } } - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); } else { buf_pool.LRU_old_ratio = ratio; } @@ -1609,7 +1309,7 @@ void buf_LRU_validate() ulint old_len; ulint new_len; - mutex_enter(&buf_pool.mutex); + mysql_mutex_lock(&buf_pool.mutex); if (UT_LIST_GET_LEN(buf_pool.LRU) >= BUF_LRU_OLD_MIN_LEN) { @@ -1687,7 +1387,7 @@ void buf_LRU_validate() ut_a(block->page.belongs_to_unzip_LRU()); } - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); } #endif /* UNIV_DEBUG */ @@ -1695,7 +1395,7 @@ void buf_LRU_validate() /** Dump the LRU list to stderr. */ void buf_LRU_print() { - mutex_enter(&buf_pool.mutex); + mysql_mutex_lock(&buf_pool.mutex); for (buf_page_t* bpage = UT_LIST_GET_FIRST(buf_pool.LRU); bpage != NULL; @@ -1744,6 +1444,6 @@ void buf_LRU_print() } } - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); } #endif /* UNIV_DEBUG_PRINT || UNIV_DEBUG */ diff --git a/storage/innobase/buf/buf0rea.cc b/storage/innobase/buf/buf0rea.cc index 911fce4cc43..d843865d0a9 100644 --- a/storage/innobase/buf/buf0rea.cc +++ b/storage/innobase/buf/buf0rea.cc @@ -121,7 +121,7 @@ static buf_page_t* buf_page_init_for_read(ulint mode, const page_id_t page_id, const ulint fold= page_id.fold(); - mutex_enter(&buf_pool.mutex); + mysql_mutex_lock(&buf_pool.mutex); /* We must acquire hash_lock this early to prevent a race condition with buf_pool_t::watch_remove() */ @@ -239,11 +239,11 @@ static buf_page_t* buf_page_init_for_read(ulint mode, const page_id_t page_id, buf_LRU_add_block(bpage, true/* to old blocks */); } - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); buf_pool.n_pend_reads++; goto func_exit_no_mutex; func_exit: - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); func_exit_no_mutex: if (mode == BUF_READ_IBUF_PAGES_ONLY) ibuf_mtr_commit(&mtr); @@ -286,10 +286,10 @@ buf_read_page_low( *err = DB_SUCCESS; - if (!page_id.space() && buf_dblwr_page_inside(page_id.page_no())) { - + if (buf_dblwr.is_inside(page_id)) { ib::error() << "Trying to read doublewrite buffer page " << page_id; + ut_ad(0); return(0); } diff --git a/storage/innobase/fil/fil0crypt.cc b/storage/innobase/fil/fil0crypt.cc index 5d5bf9967d2..131df87b383 100644 --- a/storage/innobase/fil/fil0crypt.cc +++ b/storage/innobase/fil/fil0crypt.cc @@ -1080,20 +1080,8 @@ static bool fil_crypt_start_encrypting_space(fil_space_t* space) mtr.commit(); - /* record lsn of update */ - lsn_t end_lsn = mtr.commit_lsn(); - /* 4 - sync tablespace before publishing crypt data */ - - bool success = false; - ulint sum_pages = 0; - - do { - ulint n_pages = 0; - success = buf_flush_lists(ULINT_MAX, end_lsn, &n_pages); - buf_flush_wait_batch_end(false); - sum_pages += n_pages; - } while (!success); + while (buf_flush_dirty_pages(space->id)); /* 5 - publish crypt data */ mutex_enter(&fil_crypt_threads_mutex); @@ -1156,7 +1144,6 @@ struct rotate_thread_t { case SRV_SHUTDOWN_CLEANUP: case SRV_SHUTDOWN_INITIATED: return true; - case SRV_SHUTDOWN_FLUSH_PHASE: case SRV_SHUTDOWN_LAST_PHASE: break; } @@ -1945,8 +1932,7 @@ fil_crypt_rotate_pages( * real pages, they will be updated anyway when the * real page is updated */ - if (space == TRX_SYS_SPACE && - buf_dblwr_page_inside(state->offset)) { + if (buf_dblwr.is_inside(page_id_t(space, state->offset))) { continue; } @@ -1978,20 +1964,19 @@ fil_crypt_flush_space( lsn_t end_lsn = crypt_data->rotate_state.end_lsn; if (end_lsn > 0 && !space->is_stopping()) { - bool success = false; - ulint n_pages = 0; ulint sum_pages = 0; const ulonglong start = my_interval_timer(); - do { - success = buf_flush_lists(ULINT_MAX, end_lsn, &n_pages); - buf_flush_wait_batch_end(false); - sum_pages += n_pages; - } while (!success && !space->is_stopping()); + ulint n_dirty= buf_flush_dirty_pages(state->space->id); + if (!n_dirty) { + break; + } + sum_pages += n_dirty; + } while (!space->is_stopping()); - const ulonglong end = my_interval_timer(); + if (sum_pages) { + const ulonglong end = my_interval_timer(); - if (sum_pages && end > start) { state->cnt_waited += sum_pages; state->sum_waited_us += (end - start) / 1000; diff --git a/storage/innobase/fil/fil0fil.cc b/storage/innobase/fil/fil0fil.cc index 0325dce8422..93047391aa2 100644 --- a/storage/innobase/fil/fil0fil.cc +++ b/storage/innobase/fil/fil0fil.cc @@ -171,7 +171,6 @@ fil_system_t fil_system; /** At this age or older a space/page will be rotated */ UNIV_INTERN extern uint srv_fil_crypt_rotate_key_age; -UNIV_INTERN extern ib_mutex_t fil_crypt_threads_mutex; /** Determine if the space id is a user tablespace id or not. @param[in] space_id Space ID to check @@ -713,8 +712,8 @@ fil_space_extend_must_retry( const ulint page_size = space->physical_size(); - /* fil_read_first_page() expects srv_page_size bytes. - fil_node_open_file() expects at least 4 * srv_page_size bytes.*/ + /* Datafile::read_first_page() expects srv_page_size bytes. + fil_node_t::read_page0() expects at least 4 * srv_page_size bytes.*/ os_offset_t new_size = std::max( os_offset_t(size - file_start_page_no) * page_size, os_offset_t(FIL_IBD_FILE_INITIAL_SIZE << srv_page_size_shift)); @@ -2097,7 +2096,10 @@ void fil_close_tablespace(ulint id) Thus we can clean the tablespace out of buf_pool completely and permanently. The flag stop_new_ops also prevents fil_flush() from being applied to this tablespace. */ - buf_LRU_flush_or_remove_pages(id, true); + while (buf_flush_dirty_pages(id)); + /* Ensure that all asynchronous IO is completed. */ + os_aio_wait_until_no_pending_writes(); + fil_flush(id); /* If the free is successful, the X lock will be released before the space memory data structure is freed. */ @@ -2186,7 +2188,7 @@ dberr_t fil_delete_tablespace(ulint id, bool if_exists, ::stop_new_ops flag in fil_io(). */ err = DB_SUCCESS; - buf_LRU_flush_or_remove_pages(id, false); + buf_flush_remove_pages(id); /* If it is a delete then also delete any generated files, otherwise when we drop the database the remove directory will fail. */ @@ -3683,15 +3685,6 @@ fil_report_invalid_page_access(const page_id_t id, const char *name, << ". Byte offset " << byte_offset << ", len " << len; } -inline void IORequest::set_fil_node(fil_node_t* node) -{ - if (!node->space->punch_hole) { - clear_punch_hole(); - } - - m_fil_node = node; -} - /** Reads or writes data. This operation could be asynchronous (aio). @param[in,out] type IO context @@ -3726,9 +3719,8 @@ fil_io( bool punch_hole) { os_offset_t offset; - IORequest req_type(type); - ut_ad(req_type.validate()); + ut_ad(type.validate()); ut_ad(len > 0); ut_ad(byte_offset < srv_page_size); @@ -3742,7 +3734,7 @@ fil_io( /* ibuf bitmap pages must be read in the sync AIO mode: */ ut_ad(recv_no_ibuf_operations - || req_type.is_write() + || type.is_write() || !ibuf_bitmap_page(page_id, zip_size) || sync); @@ -3750,7 +3742,7 @@ fil_io( if (sync) { mode = OS_AIO_SYNC; - } else if (req_type.is_read() + } else if (type.is_read() && !recv_no_ibuf_operations && ibuf_page(page_id, zip_size, NULL)) { mode = OS_AIO_IBUF; @@ -3758,11 +3750,11 @@ fil_io( mode = OS_AIO_NORMAL; } - if (req_type.is_read()) { + if (type.is_read()) { srv_stats.data_read.add(len); - } else if (req_type.is_write()) { + } else if (type.is_write()) { ut_ad(!srv_read_only_mode || fsp_is_system_temporary(page_id.space())); @@ -3776,7 +3768,7 @@ fil_io( page_id.space()); if (!space - || (req_type.is_read() + || (type.is_read() && !sync && space->is_stopping() && !space->is_being_truncated)) { @@ -3786,7 +3778,7 @@ fil_io( ib::error() << "Trying to do I/O to a tablespace which" " does not exist. I/O type: " - << (req_type.is_read() ? "read" : "write") + << (type.is_read() ? "read" : "write") << ", page: " << page_id << ", I/O length: " << len << " bytes"; } @@ -3807,7 +3799,7 @@ fil_io( fil_report_invalid_page_access( page_id, space->name, byte_offset, len, - req_type.is_read()); + type.is_read()); } else if (fil_is_user_tablespace_id(space->id) && node->size == 0) { @@ -3838,7 +3830,7 @@ fil_io( << space->name << "' which exists without .ibd data file." " I/O type: " - << (req_type.is_read() + << (type.is_read() ? "read" : "write") << ", page: " << page_id @@ -3853,14 +3845,14 @@ fil_io( /* If we can tolerate the non-existent pages, we should return with DB_ERROR and let caller decide what to do. */ - node->complete_io(req_type.is_write()); + node->complete_io(type.is_write()); mutex_exit(&fil_system.mutex); return {DB_ERROR, nullptr}; } fil_report_invalid_page_access( page_id, space->name, byte_offset, len, - req_type.is_read()); + type.is_read()); } space->acquire_for_io(); @@ -3879,9 +3871,7 @@ fil_io( const char* name = node->name == NULL ? space->name : node->name; - req_type.set_fil_node(node); - - ut_ad(!req_type.is_write() + ut_ad(!type.is_write() || !fil_is_user_tablespace_id(page_id.space()) || offset == page_id.page_no() * zip_size); @@ -3897,6 +3887,8 @@ fil_io( err = DB_SUCCESS; } } else { + IORequest req_type(type); + req_type.set_fil_node(node); /* Queue the aio request */ err = os_aio( req_type, @@ -3909,10 +3901,10 @@ fil_io( /* We an try to recover the page from the double write buffer if the decompression fails or the page is corrupt. */ - ut_a(req_type.is_dblwr_recover() || err == DB_SUCCESS); + ut_a(type.is_dblwr_recover() || err == DB_SUCCESS); if (sync) { mutex_enter(&fil_system.mutex); - node->complete_io(req_type.is_write()); + node->complete_io(type.is_write()); mutex_exit(&fil_system.mutex); ut_ad(fil_validate_skip()); } @@ -3960,7 +3952,7 @@ write_completed: bpage->status= buf_page_t::NORMAL; dblwr= false; } - buf_page_write_complete(bpage, data->type, dblwr, false); + buf_page_write_complete(bpage, data->type, dblwr); goto write_completed; } diff --git a/storage/innobase/fsp/fsp0file.cc b/storage/innobase/fsp/fsp0file.cc index a8f04a754b4..e8fc47f3e41 100644 --- a/storage/innobase/fsp/fsp0file.cc +++ b/storage/innobase/fsp/fsp0file.cc @@ -296,14 +296,13 @@ Datafile::read_first_page(bool read_only_mode) m_first_page = static_cast<byte*>( aligned_malloc(UNIV_PAGE_SIZE_MAX, srv_page_size)); - IORequest request; + constexpr IORequest request(IORequest::READ | + IORequest::DISABLE_PARTIAL_IO_WARNINGS); dberr_t err = DB_ERROR; size_t page_size = UNIV_PAGE_SIZE_MAX; /* Don't want unnecessary complaints about partial reads. */ - request.disable_partial_io_warnings(); - while (page_size >= UNIV_PAGE_SIZE_MIN) { ulint n_read = 0; @@ -805,10 +804,8 @@ Datafile::restore_from_doublewrite() << physical_size << " bytes into file '" << m_filepath << "'"; - IORequest request(IORequest::WRITE); - return(os_file_write( - request, + IORequestWrite, m_filepath, m_handle, page, 0, physical_size) != DB_SUCCESS); } diff --git a/storage/innobase/fsp/fsp0fsp.cc b/storage/innobase/fsp/fsp0fsp.cc index 6ac57ca395c..fedade4366f 100644 --- a/storage/innobase/fsp/fsp0fsp.cc +++ b/storage/innobase/fsp/fsp0fsp.cc @@ -553,11 +553,18 @@ void fsp_header_init(fil_space_t* space, ulint size, mtr_t* mtr) const page_id_t page_id(space->id, 0); const ulint zip_size = space->zip_size(); + buf_block_t *free_block = buf_LRU_get_free_block(false); + mtr_x_lock_space(space, mtr); - buf_block_t* block = buf_page_create(space, 0, zip_size, mtr); + buf_block_t* block = buf_page_create(space, 0, zip_size, mtr, + free_block); buf_block_dbg_add_level(block, SYNC_FSP_PAGE); + if (UNIV_UNLIKELY(block != free_block)) { + buf_pool.free_block(free_block); + } + space->size_in_header = size; space->free_len = 0; space->free_limit = 0; @@ -874,11 +881,14 @@ fsp_fill_free_list( pages should be ignored. */ if (i > 0) { + buf_block_t *f= buf_LRU_get_free_block(false); block= buf_page_create( space, static_cast<uint32_t>(i), - zip_size, mtr); - + zip_size, mtr, f); buf_block_dbg_add_level(block, SYNC_FSP_PAGE); + if (UNIV_UNLIKELY(block != f)) { + buf_pool.free_block(f); + } fsp_init_file_page(space, block, mtr); mtr->write<2>(*block, FIL_PAGE_TYPE + block->frame, @@ -886,13 +896,16 @@ fsp_fill_free_list( } if (space->purpose != FIL_TYPE_TEMPORARY) { + buf_block_t *f= buf_LRU_get_free_block(false); block = buf_page_create( space, static_cast<uint32_t>( i + FSP_IBUF_BITMAP_OFFSET), - zip_size, mtr); + zip_size, mtr, f); buf_block_dbg_add_level(block, SYNC_FSP_PAGE); - + if (UNIV_UNLIKELY(block != f)) { + buf_pool.free_block(f); + } fsp_init_file_page(space, block, mtr); mtr->write<2>(*block, block->frame + FIL_PAGE_TYPE, @@ -1042,8 +1055,11 @@ static buf_block_t* fsp_page_create(fil_space_t *space, page_no_t offset, mtr_t *mtr) { + buf_block_t *free_block= buf_LRU_get_free_block(false); buf_block_t *block= buf_page_create(space, static_cast<uint32_t>(offset), - space->zip_size(), mtr); + space->zip_size(), mtr, free_block); + if (UNIV_UNLIKELY(block != free_block)) + buf_pool.free_block(free_block); fsp_init_file_page(space, block, mtr); return block; } diff --git a/storage/innobase/fsp/fsp0sysspace.cc b/storage/innobase/fsp/fsp0sysspace.cc index 5d381fca033..2e4b3678760 100644 --- a/storage/innobase/fsp/fsp0sysspace.cc +++ b/storage/innobase/fsp/fsp0sysspace.cc @@ -559,7 +559,7 @@ SysTablespace::read_lsn_and_check_flags(lsn_t* flushed_lsn) ut_a(it->order() == 0); if (srv_operation == SRV_OPERATION_NORMAL) { - buf_dblwr_init_or_load_pages(it->handle(), it->filepath()); + buf_dblwr.init_or_load_pages(it->handle(), it->filepath()); } /* Check the contents of the first page of the diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc index ec2764f44a9..f9cc8629a73 100644 --- a/storage/innobase/handler/ha_innodb.cc +++ b/storage/innobase/handler/ha_innodb.cc @@ -878,8 +878,7 @@ static SHOW_VAR innodb_status_variables[]= { &export_vars.innodb_buffer_pool_pages_dirty, SHOW_SIZE_T}, {"buffer_pool_bytes_dirty", &export_vars.innodb_buffer_pool_bytes_dirty, SHOW_SIZE_T}, - {"buffer_pool_pages_flushed", - &export_vars.innodb_buffer_pool_pages_flushed, SHOW_SIZE_T}, + {"buffer_pool_pages_flushed", &buf_flush_page_count, SHOW_SIZE_T}, {"buffer_pool_pages_free", &export_vars.innodb_buffer_pool_pages_free, SHOW_SIZE_T}, #ifdef UNIV_DEBUG @@ -17505,6 +17504,7 @@ func_exit: block->frame[FIL_PAGE_SPACE_ID]); } mtr.commit(); + log_write_up_to(mtr.commit_lsn(), true); goto func_exit; } #endif // UNIV_DEBUG @@ -17966,7 +17966,7 @@ static bool innodb_buffer_pool_evict_uncompressed() { bool all_evicted = true; - mutex_enter(&buf_pool.mutex); + mysql_mutex_lock(&buf_pool.mutex); for (buf_block_t* block = UT_LIST_GET_LAST(buf_pool.unzip_LRU); block != NULL; ) { @@ -17986,7 +17986,7 @@ static bool innodb_buffer_pool_evict_uncompressed() } } - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); return(all_evicted); } @@ -19063,13 +19063,13 @@ static MYSQL_SYSVAR_ULONG(page_cleaners, deprecated::innodb_page_cleaners, static MYSQL_SYSVAR_DOUBLE(max_dirty_pages_pct, srv_max_buf_pool_modified_pct, PLUGIN_VAR_RQCMDARG, "Percentage of dirty pages allowed in bufferpool.", - NULL, innodb_max_dirty_pages_pct_update, 75.0, 0, 99.999, 0); + NULL, innodb_max_dirty_pages_pct_update, 90.0, 0, 99.999, 0); static MYSQL_SYSVAR_DOUBLE(max_dirty_pages_pct_lwm, srv_max_dirty_pages_pct_lwm, PLUGIN_VAR_RQCMDARG, "Percentage of dirty pages at which flushing kicks in.", - NULL, innodb_max_dirty_pages_pct_lwm_update, 0, 0, 99.999, 0); + NULL, innodb_max_dirty_pages_pct_lwm_update, 75.0, 0, 99.999, 0); static MYSQL_SYSVAR_DOUBLE(adaptive_flushing_lwm, srv_adaptive_flushing_lwm, @@ -19234,13 +19234,6 @@ static MYSQL_SYSVAR_ULONG(buffer_pool_chunk_size, srv_buf_pool_chunk_unit, NULL, NULL, 128 * 1024 * 1024, 1024 * 1024, LONG_MAX, 1024 * 1024); -#if defined UNIV_DEBUG || defined UNIV_PERF_DEBUG -static MYSQL_SYSVAR_ULONG(doublewrite_batch_size, srv_doublewrite_batch_size, - PLUGIN_VAR_OPCMDARG | PLUGIN_VAR_READONLY, - "Number of pages reserved in doublewrite buffer for batch flushing", - NULL, NULL, 120, 1, 127, 0); -#endif /* defined UNIV_DEBUG || defined UNIV_PERF_DEBUG */ - static MYSQL_SYSVAR_ENUM(lock_schedule_algorithm, innodb_lock_schedule_algorithm, PLUGIN_VAR_RQCMDARG | PLUGIN_VAR_READONLY, "The algorithm Innodb uses for deciding which locks to grant next when" @@ -19364,7 +19357,12 @@ static MYSQL_SYSVAR_UINT(defragment_frequency, srv_defragment_frequency, static MYSQL_SYSVAR_ULONG(lru_scan_depth, srv_LRU_scan_depth, PLUGIN_VAR_RQCMDARG, "How deep to scan LRU to keep it clean", - NULL, NULL, 1024, 100, ~0UL, 0); + NULL, NULL, 1536, 100, ~0UL, 0); + +static MYSQL_SYSVAR_SIZE_T(lru_flush_size, innodb_lru_flush_size, + PLUGIN_VAR_RQCMDARG, + "How many pages to flush on LRU eviction", + NULL, NULL, 32, 1, SIZE_T_MAX, 0); static MYSQL_SYSVAR_ULONG(flush_neighbors, srv_flush_neighbors, PLUGIN_VAR_OPCMDARG, @@ -19994,6 +19992,7 @@ static struct st_mysql_sys_var* innobase_system_variables[]= { MYSQL_SYSVAR(defragment_fill_factor_n_recs), MYSQL_SYSVAR(defragment_frequency), MYSQL_SYSVAR(lru_scan_depth), + MYSQL_SYSVAR(lru_flush_size), MYSQL_SYSVAR(flush_neighbors), MYSQL_SYSVAR(checksum_algorithm), MYSQL_SYSVAR(log_checksums), @@ -20115,9 +20114,6 @@ static struct st_mysql_sys_var* innobase_system_variables[]= { MYSQL_SYSVAR(buf_flush_list_now), MYSQL_SYSVAR(merge_threshold_set_all_debug), #endif /* UNIV_DEBUG */ -#if defined UNIV_DEBUG || defined UNIV_PERF_DEBUG - MYSQL_SYSVAR(doublewrite_batch_size), -#endif /* defined UNIV_DEBUG || defined UNIV_PERF_DEBUG */ MYSQL_SYSVAR(status_output), MYSQL_SYSVAR(status_output_locks), MYSQL_SYSVAR(print_all_deadlocks), @@ -21191,10 +21187,10 @@ innodb_buffer_pool_size_validate( #endif /* UNIV_DEBUG */ - mutex_enter(&buf_pool.mutex); + mysql_mutex_lock(&buf_pool.mutex); if (srv_buf_pool_old_size != srv_buf_pool_size) { - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); my_printf_error(ER_WRONG_ARGUMENTS, "Another buffer pool resize is already in progress.", MYF(0)); return(1); @@ -21205,13 +21201,13 @@ innodb_buffer_pool_size_validate( *static_cast<ulonglong*>(save) = requested_buf_pool_size; if (srv_buf_pool_size == ulint(intbuf)) { - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); /* nothing to do */ return(0); } if (srv_buf_pool_size == requested_buf_pool_size) { - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); push_warning_printf(thd, Sql_condition::WARN_LEVEL_WARN, ER_WRONG_ARGUMENTS, "innodb_buffer_pool_size must be at least" @@ -21222,7 +21218,7 @@ innodb_buffer_pool_size_validate( } srv_buf_pool_size = requested_buf_pool_size; - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); if (intbuf != static_cast<longlong>(requested_buf_pool_size)) { char buf[64]; diff --git a/storage/innobase/handler/i_s.cc b/storage/innobase/handler/i_s.cc index 9d77b6625b8..de4195b5727 100644 --- a/storage/innobase/handler/i_s.cc +++ b/storage/innobase/handler/i_s.cc @@ -1635,7 +1635,7 @@ i_s_cmpmem_fill_low( buf_buddy_stat_t buddy_stat_local[BUF_BUDDY_SIZES_MAX + 1]; /* Save buddy stats for buffer pool in local variables. */ - mutex_enter(&buf_pool.mutex); + mysql_mutex_lock(&buf_pool.mutex); for (uint x = 0; x <= BUF_BUDDY_SIZES; x++) { zip_free_len_local[x] = (x < BUF_BUDDY_SIZES) ? @@ -1650,7 +1650,7 @@ i_s_cmpmem_fill_low( } } - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); for (uint x = 0; x <= BUF_BUDDY_SIZES; x++) { buf_buddy_stat_t* buddy_stat = &buddy_stat_local[x]; @@ -4270,7 +4270,7 @@ static int i_s_innodb_buffer_page_fill(THD *thd, TABLE_LIST *tables, Item *) buffer pool info printout, we are not required to preserve the overall consistency, so we can release mutex periodically */ - mutex_enter(&buf_pool.mutex); + mysql_mutex_lock(&buf_pool.mutex); /* GO through each block in the chunk */ for (n_blocks = num_to_process; n_blocks--; block++) { @@ -4281,7 +4281,7 @@ static int i_s_innodb_buffer_page_fill(THD *thd, TABLE_LIST *tables, Item *) num_page++; } - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); /* Fill in information schema table with information just collected from the buffer chunk scan */ @@ -4604,7 +4604,7 @@ static int i_s_innodb_fill_buffer_lru(THD *thd, TABLE_LIST *tables, Item *) /* Aquire the mutex before allocating info_buffer, since UT_LIST_GET_LEN(buf_pool.LRU) could change */ - mutex_enter(&buf_pool.mutex); + mysql_mutex_lock(&buf_pool.mutex); lru_len = UT_LIST_GET_LEN(buf_pool.LRU); @@ -4636,7 +4636,7 @@ static int i_s_innodb_fill_buffer_lru(THD *thd, TABLE_LIST *tables, Item *) ut_ad(lru_pos == UT_LIST_GET_LEN(buf_pool.LRU)); exit: - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); if (info_buffer) { status = i_s_innodb_buf_page_lru_fill( diff --git a/storage/innobase/include/buf0buf.h b/storage/innobase/include/buf0buf.h index 9b58fa76c01..d5b65bb7ed8 100644 --- a/storage/innobase/include/buf0buf.h +++ b/storage/innobase/include/buf0buf.h @@ -104,10 +104,6 @@ struct buf_pool_info_t ulint n_pend_reads; /*!< buf_pool.n_pend_reads, pages pending read */ ulint n_pending_flush_lru; /*!< Pages pending flush in LRU */ - ulint n_pending_flush_single_page;/*!< Pages pending to be - flushed as part of single page - flushes issued by various user - threads */ ulint n_pending_flush_list; /*!< Pages pending flush in FLUSH LIST */ ulint n_pages_made_young; /*!< number of pages made young */ @@ -339,10 +335,11 @@ FILE_PAGE (the other is buf_page_get_gen). @param[in] offset offset of the tablespace @param[in] zip_size ROW_FORMAT=COMPRESSED page size, or 0 @param[in,out] mtr mini-transaction +@param[in,out] free_block pre-allocated buffer block @return pointer to the block, page bufferfixed */ buf_block_t* buf_page_create(fil_space_t *space, uint32_t offset, - ulint zip_size, mtr_t *mtr); + ulint zip_size, mtr_t *mtr, buf_block_t *free_block); /********************************************************************//** Releases a compressed-only page acquired with buf_page_get_zip(). */ @@ -1038,7 +1035,7 @@ struct buf_block_t{ is of size srv_page_size, and aligned to an address divisible by srv_page_size */ - BPageLock lock; /*!< read-write lock of the buffer + rw_lock_t lock; /*!< read-write lock of the buffer frame */ #ifdef UNIV_DEBUG /** whether page.list is in buf_pool.withdraw @@ -1214,13 +1211,13 @@ public: virtual ~HazardPointer() {} /** @return current value */ - buf_page_t *get() const { ut_ad(mutex_own(m_mutex)); return m_hp; } + buf_page_t *get() const { mysql_mutex_assert_owner(m_mutex); return m_hp; } /** Set current value @param bpage buffer block to be set as hp */ void set(buf_page_t *bpage) { - ut_ad(mutex_own(m_mutex)); + mysql_mutex_assert_owner(m_mutex); ut_ad(!bpage || bpage->in_file()); m_hp= bpage; } @@ -1229,7 +1226,7 @@ public: @param bpage buffer block to be compared @return true if it is hp */ bool is_hp(const buf_page_t *bpage) const - { ut_ad(mutex_own(m_mutex)); return bpage == m_hp; } + { mysql_mutex_assert_owner(m_mutex); return bpage == m_hp; } /** Adjust the value of hp. This happens when some other thread working on the same list attempts to @@ -1238,7 +1235,7 @@ public: #ifdef UNIV_DEBUG /** mutex that protects access to the m_hp. */ - const ib_mutex_t *m_mutex= nullptr; + const mysql_mutex_t *m_mutex= nullptr; #endif /* UNIV_DEBUG */ protected: @@ -1494,7 +1491,10 @@ public: bool will_be_withdrawn(const byte *ptr) const { ut_ad(curr_size < old_size); - ut_ad(!resizing.load(std::memory_order_relaxed) || mutex_own(&mutex)); +#ifdef SAFE_MUTEX + if (resizing.load(std::memory_order_relaxed)) + mysql_mutex_assert_owner(&mutex); +#endif /* SAFE_MUTEX */ for (const chunk_t *chunk= chunks + n_chunks_new, * const echunk= chunks + n_chunks; @@ -1511,7 +1511,10 @@ public: bool will_be_withdrawn(const buf_page_t &bpage) const { ut_ad(curr_size < old_size); - ut_ad(!resizing.load(std::memory_order_relaxed) || mutex_own(&mutex)); +#ifdef SAFE_MUTEX + if (resizing.load(std::memory_order_relaxed)) + mysql_mutex_assert_owner(&mutex); +#endif /* SAFE_MUTEX */ for (const chunk_t *chunk= chunks + n_chunks_new, * const echunk= chunks + n_chunks; @@ -1524,7 +1527,10 @@ public: /** Release and evict a corrupted page. @param bpage page that was being read */ - void corrupted_evict(buf_page_t *bpage); + ATTRIBUTE_COLD void corrupted_evict(buf_page_t *bpage); + + /** Release a memory block to the buffer pool. */ + ATTRIBUTE_COLD void free_block(buf_block_t *block); #ifdef UNIV_DEBUG /** Find a block that points to a ROW_FORMAT=COMPRESSED page @@ -1533,7 +1539,7 @@ public: @retval nullptr if not found */ const buf_block_t *contains_zip(const void *data) const { - ut_ad(mutex_own(&mutex)); + mysql_mutex_assert_owner(&mutex); for (const chunk_t *chunk= chunks, * const end= chunks + n_chunks; chunk != end; chunk++) if (const buf_block_t *block= chunk->contains_zip(data)) @@ -1556,8 +1562,8 @@ public: inline buf_block_t *block_from_ahi(const byte *ptr) const; #endif /* BTR_CUR_HASH_ADAPT */ - bool is_block_lock(const BPageLock *l) const - { return is_block_field(reinterpret_cast<const void*>(l)); } + bool is_block_lock(const rw_lock_t *l) const + { return is_block_field(static_cast<const void*>(l)); } /** @return the smallest oldest_modification lsn for any page @@ -1588,7 +1594,10 @@ public: buf_page_t *page_hash_get_low(const page_id_t id, const ulint fold) { ut_ad(id.fold() == fold); - ut_ad(mutex_own(&mutex) || page_hash.lock_get(fold)->is_locked()); +#ifdef SAFE_MUTEX + DBUG_ASSERT(mysql_mutex_is_owner(&mutex) || + page_hash.lock_get(fold)->is_locked()); +#endif /* SAFE_MUTEX */ buf_page_t *bpage; /* Look for the page in the hash table */ HASH_SEARCH(hash, &page_hash, fold, buf_page_t*, bpage, @@ -1655,7 +1664,10 @@ public: @return whether bpage a sentinel for a buffer pool watch */ bool watch_is_sentinel(const buf_page_t &bpage) { - ut_ad(mutex_own(&mutex) || hash_lock_get(bpage.id())->is_locked()); +#ifdef SAFE_MUTEX + DBUG_ASSERT(mysql_mutex_is_owner(&mutex) || + hash_lock_get(bpage.id())->is_locked()); +#endif /* SAFE_MUTEX */ ut_ad(bpage.in_file()); if (&bpage < &watch[0] || &bpage >= &watch[UT_ARR_SIZE(watch)]) @@ -1712,7 +1724,7 @@ public: HASH_DELETE(buf_page_t, hash, &page_hash, fold, watch); hash_lock->write_unlock(); // Now that the watch is detached from page_hash, release it to watch[]. - mutex_enter(&mutex); + mysql_mutex_lock(&mutex); /* It is possible that watch_remove() already removed the watch. */ if (watch->id_ == id) { @@ -1720,7 +1732,7 @@ public: ut_ad(watch->state() == BUF_BLOCK_ZIP_PAGE); watch->set_state(BUF_BLOCK_NOT_USED); } - mutex_exit(&mutex); + mysql_mutex_unlock(&mutex); } else hash_lock->write_unlock(); @@ -1753,14 +1765,13 @@ public: @return the predecessor in the LRU list */ buf_page_t *LRU_remove(buf_page_t *bpage) { - ut_ad(mutex_own(&mutex)); + mysql_mutex_assert_owner(&mutex); ut_ad(bpage->in_LRU_list); ut_ad(bpage->in_page_hash); ut_ad(!bpage->in_zip_hash); ut_ad(bpage->in_file()); lru_hp.adjust(bpage); lru_scan_itr.adjust(bpage); - single_scan_itr.adjust(bpage); ut_d(bpage->in_LRU_list= false); buf_page_t *prev= UT_LIST_GET_PREV(LRU, bpage); UT_LIST_REMOVE(LRU, bpage); @@ -1770,9 +1781,19 @@ public: /** Number of pages to read ahead */ static constexpr uint32_t READ_AHEAD_PAGES= 64; + /** Buffer pool mutex */ + mysql_mutex_t mutex; + /** Number of pending LRU flush. */ + Atomic_counter<ulint> n_flush_LRU; + /** broadcast when n_flush_LRU reaches 0; protected by mutex */ + mysql_cond_t done_flush_LRU; + /** Number of pending flush_list flush. */ + Atomic_counter<ulint> n_flush_list; + /** broadcast when n_flush_list reaches 0; protected by mutex */ + mysql_cond_t done_flush_list; + /** @name General fields */ /* @{ */ - BufPoolMutex mutex; /*!< Buffer pool mutex */ ulint curr_pool_size; /*!< Current pool size in bytes */ ulint LRU_old_ratio; /*!< Reserve this much of the buffer pool for "old" blocks */ @@ -1903,36 +1924,23 @@ public: /* @} */ - /** @name Page flushing algorithm fields */ + /** @name Page flushing algorithm fields */ + /* @{ */ - /* @{ */ + /** mutex protecting flush_list, buf_page_t::set_oldest_modification() + and buf_page_t::list pointers when !oldest_modification() */ + mysql_mutex_t flush_list_mutex; + /** "hazard pointer" for flush_list scans; protected by flush_list_mutex */ + FlushHp flush_hp; + /** modified blocks (a subset of LRU) */ + UT_LIST_BASE_NODE_T(buf_page_t) flush_list; + + /** signalled to wake up the page_cleaner; protected by flush_list_mutex */ + mysql_cond_t do_flush_list; + + // n_flush_LRU + n_flush_list is approximately COUNT(io_fix()==BUF_IO_WRITE) + // in flush_list - FlushListMutex flush_list_mutex;/*!< mutex protecting the - flush list access. This mutex - protects flush_list - and bpage::list pointers when - the bpage is on flush_list. It - also protects writes to - bpage::oldest_modification and - flush_list_hp */ - FlushHp flush_hp;/*!< "hazard pointer" - used during scan of flush_list - while doing flush list batch. - Protected by flush_list_mutex */ - UT_LIST_BASE_NODE_T(buf_page_t) flush_list; - /*!< base node of the modified block - list */ - /** set if a flush of the type is being initialized */ - Atomic_relaxed<bool> init_flush[3]; - /** Number of pending writes of a flush type. - The sum of these is approximately the sum of BUF_IO_WRITE blocks. */ - Atomic_counter<ulint> n_flush[3]; - os_event_t no_flush[3]; - /*!< this is in the set state - when there is no flush batch - of the given type running; - os_event_set() and os_event_reset() - are protected by buf_pool_t::mutex */ unsigned freed_page_clock;/*!< a sequence number used to count the number of buffer blocks removed from the end of @@ -1978,10 +1986,6 @@ public: replacable victim. Protected by buf_pool_t::mutex. */ LRUItr lru_scan_itr; - /** Iterator used to scan the LRU list when searching for - single page flushing victim. Protected by buf_pool_t::mutex. */ - LRUItr single_scan_itr; - UT_LIST_BASE_NODE_T(buf_page_t) LRU; /*!< base node of the LRU list */ @@ -2020,16 +2024,12 @@ public: /** @return whether any I/O is pending */ bool any_io_pending() const { - return n_pend_reads || - n_flush[IORequest::LRU] || n_flush[IORequest::FLUSH_LIST] || - n_flush[IORequest::SINGLE_PAGE]; + return n_pend_reads || n_flush_LRU || n_flush_list; } /** @return total amount of pending I/O */ ulint io_pending() const { - return n_pend_reads + - n_flush[IORequest::LRU] + n_flush[IORequest::FLUSH_LIST] + - n_flush[IORequest::SINGLE_PAGE]; + return n_pend_reads + n_flush_LRU + n_flush_list; } private: /** Temporary memory for page_compressed and encrypted I/O */ @@ -2086,7 +2086,7 @@ extern buf_pool_t buf_pool; inline void page_hash_latch::read_lock() { - ut_ad(!mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_not_owner(&buf_pool.mutex); if (!read_trylock()) read_lock_wait(); } @@ -2099,19 +2099,19 @@ inline void page_hash_latch::write_lock() inline void buf_page_t::add_buf_fix_count(uint32_t count) { - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); buf_fix_count_+= count; } inline void buf_page_t::set_buf_fix_count(uint32_t count) { - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); buf_fix_count_= count; } inline void buf_page_t::set_state(buf_page_state state) { - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); #ifdef UNIV_DEBUG switch (state) { case BUF_BLOCK_REMOVE_HASH: @@ -2140,7 +2140,7 @@ inline void buf_page_t::set_state(buf_page_state state) inline void buf_page_t::set_io_fix(buf_io_fix io_fix) { - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); io_fix_= io_fix; } @@ -2166,7 +2166,7 @@ inline void buf_page_t::set_corrupt_id() /** Set oldest_modification when adding to buf_pool.flush_list */ inline void buf_page_t::set_oldest_modification(lsn_t lsn) { - ut_ad(mutex_own(&buf_pool.flush_list_mutex)); + mysql_mutex_assert_owner(&buf_pool.flush_list_mutex); ut_ad(!oldest_modification()); oldest_modification_= lsn; } @@ -2174,7 +2174,7 @@ inline void buf_page_t::set_oldest_modification(lsn_t lsn) /** Clear oldest_modification when removing from buf_pool.flush_list */ inline void buf_page_t::clear_oldest_modification() { - ut_ad(mutex_own(&buf_pool.flush_list_mutex)); + mysql_mutex_assert_owner(&buf_pool.flush_list_mutex); ut_d(const auto state= state_); ut_ad(state == BUF_BLOCK_FILE_PAGE || state == BUF_BLOCK_ZIP_PAGE || state == BUF_BLOCK_REMOVE_HASH); @@ -2185,7 +2185,7 @@ inline void buf_page_t::clear_oldest_modification() /** @return whether the block is modified and ready for flushing */ inline bool buf_page_t::ready_for_flush() const { - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); ut_ad(in_LRU_list); ut_a(in_file()); return oldest_modification() && io_fix_ == BUF_IO_NONE; @@ -2195,7 +2195,7 @@ inline bool buf_page_t::ready_for_flush() const The block can be dirty, but it must not be I/O-fixed or bufferfixed. */ inline bool buf_page_t::can_relocate() const { - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); ut_ad(in_file()); ut_ad(in_LRU_list); return io_fix_ == BUF_IO_NONE && !buf_fix_count_; @@ -2204,7 +2204,7 @@ inline bool buf_page_t::can_relocate() const /** @return whether the block has been flagged old in buf_pool.LRU */ inline bool buf_page_t::is_old() const { - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); ut_ad(in_file()); ut_ad(in_LRU_list); return old; @@ -2213,7 +2213,7 @@ inline bool buf_page_t::is_old() const /** Set whether a block is old in buf_pool.LRU */ inline void buf_page_t::set_old(bool old) { - ut_ad(mutex_own(&buf_pool.mutex)); + mysql_mutex_assert_owner(&buf_pool.mutex); ut_ad(in_LRU_list); #ifdef UNIV_LRU_DEBUG @@ -2240,14 +2240,14 @@ inline void buf_page_t::set_old(bool old) #ifdef UNIV_DEBUG /** Forbid the release of the buffer pool mutex. */ -# define buf_pool_mutex_exit_forbid() do { \ - ut_ad(mutex_own(&buf_pool.mutex)); \ - buf_pool.mutex_exit_forbidden++; \ +# define buf_pool_mutex_exit_forbid() do { \ + mysql_mutex_assert_owner(&buf_pool.mutex); \ + buf_pool.mutex_exit_forbidden++; \ } while (0) /** Allow the release of the buffer pool mutex. */ # define buf_pool_mutex_exit_allow() do { \ - ut_ad(mutex_own(&buf_pool.mutex)); \ - ut_ad(buf_pool.mutex_exit_forbidden--); \ + mysql_mutex_assert_owner(&buf_pool.mutex); \ + ut_ad(buf_pool.mutex_exit_forbidden--); \ } while (0) #else /** Forbid the release of the buffer pool mutex. */ @@ -2265,8 +2265,7 @@ MEMORY: is not in free list, LRU list, or flush list, nor page hash table FILE_PAGE: space and offset are defined, is in page hash table if io_fix == BUF_IO_WRITE, - pool: no_flush[flush_type] is in reset state, - pool: n_flush[flush_type] > 0 + buf_pool.n_flush_LRU > 0 || buf_pool.n_flush_list > 0 (1) if buf_fix_count == 0, then is in LRU list, not in free list @@ -2303,7 +2302,7 @@ of the LRU list. @return buf_page_t from where to start scan. */ inline buf_page_t *LRUItr::start() { - ut_ad(mutex_own(m_mutex)); + mysql_mutex_assert_owner(m_mutex); if (!m_hp || m_hp->old) m_hp= UT_LIST_GET_LAST(buf_pool.LRU); diff --git a/storage/innobase/include/buf0buf.ic b/storage/innobase/include/buf0buf.ic index 2384c46af02..3074489527f 100644 --- a/storage/innobase/include/buf0buf.ic +++ b/storage/innobase/include/buf0buf.ic @@ -198,9 +198,9 @@ buf_block_free( /*===========*/ buf_block_t* block) /*!< in, own: block to be freed */ { - mutex_enter(&buf_pool.mutex); + mysql_mutex_lock(&buf_pool.mutex); buf_LRU_block_free_non_file_page(block); - mutex_exit(&buf_pool.mutex); + mysql_mutex_unlock(&buf_pool.mutex); } /********************************************************************//** @@ -213,12 +213,20 @@ buf_block_modify_clock_inc( /*=======================*/ buf_block_t* block) /*!< in: block */ { +#ifdef SAFE_MUTEX /* No latch is acquired for the shared temporary tablespace. */ ut_ad(fsp_is_system_temporary(block->page.id().space()) - || (mutex_own(&buf_pool.mutex) + || (mysql_mutex_is_owner(&buf_pool.mutex) && !block->page.buf_fix_count()) || rw_lock_own_flagged(&block->lock, RW_LOCK_FLAG_X | RW_LOCK_FLAG_SX)); +#else /* SAFE_MUTEX */ + /* No latch is acquired for the shared temporary tablespace. */ + ut_ad(fsp_is_system_temporary(block->page.id().space()) + || !block->page.buf_fix_count() + || rw_lock_own_flagged(&block->lock, + RW_LOCK_FLAG_X | RW_LOCK_FLAG_SX)); +#endif /* SAFE_MUTEX */ assert_block_ahi_valid(block); block->modify_clock++; diff --git a/storage/innobase/include/buf0dblwr.h b/storage/innobase/include/buf0dblwr.h index fed64e0ee72..1b9415d38be 100644 --- a/storage/innobase/include/buf0dblwr.h +++ b/storage/innobase/include/buf0dblwr.h @@ -24,101 +24,38 @@ Doublewrite buffer module Created 2011/12/19 Inaam Rana *******************************************************/ -#ifndef buf0dblwr_h -#define buf0dblwr_h +#pragma once -#include "ut0byte.h" -#include "log0log.h" +#include "os0file.h" #include "buf0types.h" -#include "log0recv.h" - -/** Doublewrite system */ -extern buf_dblwr_t* buf_dblwr; - -/** Create the doublewrite buffer if the doublewrite buffer header -is not present in the TRX_SYS page. -@return whether the operation succeeded -@retval true if the doublewrite buffer exists or was created -@retval false if the creation failed (too small first data file) */ -MY_ATTRIBUTE((warn_unused_result)) -bool -buf_dblwr_create(); - -/** -At database startup initializes the doublewrite buffer memory structure if -we already have a doublewrite buffer created in the data files. If we are -upgrading to an InnoDB version which supports multiple tablespaces, then this -function performs the necessary update operations. If we are in a crash -recovery, this function loads the pages from double write buffer into memory. -@param[in] file File handle -@param[in] path Path name of file -@return DB_SUCCESS or error code */ -dberr_t -buf_dblwr_init_or_load_pages( - pfs_os_file_t file, - const char* path); - -/** Process and remove the double write buffer pages for all tablespaces. */ -void -buf_dblwr_process(); - -/****************************************************************//** -frees doublewrite buffer. */ -void -buf_dblwr_free(); - -/** Update the doublewrite buffer on write completion. */ -void buf_dblwr_update(const buf_page_t &bpage, bool single_page); -/****************************************************************//** -Determines if a page number is located inside the doublewrite buffer. -@return TRUE if the location is inside the two blocks of the -doublewrite buffer */ -ibool -buf_dblwr_page_inside( -/*==================*/ - ulint page_no); /*!< in: page number */ - -/********************************************************************//** -Flushes possible buffered writes from the doublewrite memory buffer to disk. -It is very important to call this function after a batch of writes -has been posted, and also when we may have to wait for a page latch! -Otherwise a deadlock of threads can occur. */ -void -buf_dblwr_flush_buffered_writes(); /** Doublewrite control struct */ -struct buf_dblwr_t{ - ib_mutex_t mutex; /*!< mutex protecting the first_free - field and write_buf */ - ulint block1; /*!< the page number of the first - doublewrite block (64 pages) */ - ulint block2; /*!< page number of the second block */ - ulint first_free;/*!< first free position in write_buf - measured in units of srv_page_size */ - ulint b_reserved;/*!< number of slots currently reserved - for batch flush. */ - os_event_t b_event;/*!< event where threads wait for a - batch flush to end; - os_event_set() and os_event_reset() - are protected by buf_dblwr_t::mutex */ - ulint s_reserved;/*!< number of slots currently - reserved for single page flushes. */ - os_event_t s_event;/*!< event where threads wait for a - single page flush slot. Protected by mutex. */ - bool batch_running;/*!< set to TRUE if currently a batch - is being written from the doublewrite - buffer. */ - byte* write_buf;/*!< write buffer used in writing to the - doublewrite buffer, aligned to an - address divisible by srv_page_size - (which is required by Windows aio) */ +class buf_dblwr_t +{ + /** the page number of the first doublewrite block (block_size() pages) */ + page_id_t block1= page_id_t(0, 0); + /** the page number of the second doublewrite block (block_size() pages) */ + page_id_t block2= page_id_t(0, 0); + + /** mutex protecting the data members below */ + mysql_mutex_t mutex; + /** condition variable for !batch_running */ + mysql_cond_t cond; + /** whether a batch is being written from the doublewrite buffer */ + bool batch_running; + /** first free position in write_buf measured in units of srv_page_size */ + ulint first_free; + /** number of slots reserved for the current write batch */ + ulint reserved; + /** the doublewrite buffer, aligned to srv_page_size */ + byte *write_buf; struct element { /** block descriptor */ buf_page_t *bpage; - /** flush type */ - IORequest::flush_t flush; + /** true=buf_pool.flush_list, false=buf_pool.LRU */ + bool lru; /** payload size in bytes */ size_t size; }; @@ -126,23 +63,67 @@ struct buf_dblwr_t{ /** buffer blocks to be written via write_buf */ element *buf_block_arr; + /** Initialize the doublewrite buffer data structure. + @param header doublewrite page header in the TRX_SYS page */ + inline void init(const byte *header); + + /** Flush possible buffered writes to persistent storage. */ + bool flush_buffered_writes(const ulint size); + +public: + /** Create or restore the doublewrite buffer in the TRX_SYS page. + @return whether the operation succeeded */ + bool create(); + /** Free the doublewrite buffer. */ + void close(); + + /** Initialize the doublewrite buffer memory structure on recovery. + If we are upgrading from a version before MySQL 4.1, then this + function performs the necessary update operations to support + innodb_file_per_table. If we are in a crash recovery, this function + loads the pages from double write buffer into memory. + @param file File handle + @param path Path name of file + @return DB_SUCCESS or error code */ + dberr_t init_or_load_pages(pfs_os_file_t file, const char *path); + + /** Process and remove the double write buffer pages for all tablespaces. */ + void recover(); + + /** Update the doublewrite buffer on write completion. */ + void write_completed(); + /** Flush possible buffered writes to persistent storage. + It is very important to call this function after a batch of writes has been + posted, and also when we may have to wait for a page latch! + Otherwise a deadlock of threads can occur. */ + void flush_buffered_writes(); + + /** Size of the doublewrite block in pages */ + uint32_t block_size() const { return FSP_EXTENT_SIZE; } + /** Schedule a page write. If the doublewrite memory buffer is full, - buf_dblwr_flush_buffered_writes() will be invoked to make space. - @param bpage buffer pool page to be written - @param flush type of flush - @param size payload size in bytes */ - void add_to_batch(buf_page_t *bpage, IORequest::flush_t flush, size_t size); - /** Write a page to the doublewrite buffer on disk, sync it, then write - the page to the datafile and sync the datafile. This function is used - for single page flushes. If all the buffers allocated for single page - flushes in the doublewrite buffer are in use we wait here for one to - become free. We are guaranteed that a slot will become free because any - thread that is using a slot must also release the slot before leaving - this function. - @param bpage buffer pool page to be written - @param sync whether synchronous operation is requested - @param size payload size in bytes */ - void write_single_page(buf_page_t *bpage, bool sync, size_t size); + flush_buffered_writes() will be invoked to make space. + @param bpage buffer pool page to be written + @param lru true=buf_pool.LRU; false=buf_pool.flush_list + @param size payload size in bytes */ + void add_to_batch(buf_page_t *bpage, bool lru, size_t size); + + /** Determine whether the doublewrite buffer is initialized */ + bool is_initialised() const + { return UNIV_LIKELY(block1 != page_id_t(0, 0)); } + + /** @return whether a page identifier is part of the doublewrite buffer */ + bool is_inside(const page_id_t id) const + { + if (!is_initialised()) + return false; + ut_ad(block1 < block2); + if (id < block1) + return false; + const uint32_t size= block_size(); + return id < block1 + size || (id >= block2 && id < block2 + size); + } }; -#endif +/** The doublewrite buffer */ +extern buf_dblwr_t buf_dblwr; diff --git a/storage/innobase/include/buf0flu.h b/storage/innobase/include/buf0flu.h index 17568d0e2b1..12ebf6f01e9 100644 --- a/storage/innobase/include/buf0flu.h +++ b/storage/innobase/include/buf0flu.h @@ -31,7 +31,10 @@ Created 11/5/1995 Heikki Tuuri #include "log0log.h" #include "buf0types.h" -/** Number of pages flushed through non flush_list flushes. */ +/** Number of pages flushed. Protected by buf_pool.mutex. */ +extern ulint buf_flush_page_count; +/** Number of pages flushed via LRU. Protected by buf_pool.mutex. +Also included in buf_flush_page_count. */ extern ulint buf_lru_flush_page_count; /** Flag indicating if the page_cleaner is in active state. */ @@ -44,27 +47,24 @@ extern my_bool innodb_page_cleaner_disabled_debug; #endif /* UNIV_DEBUG */ -/** Event to synchronise with the flushing. */ -extern os_event_t buf_flush_event; +/** Remove all dirty pages belonging to a given tablespace when we are +deleting the data file of that tablespace. +The pages still remain a part of LRU and are evicted from +the list as they age towards the tail of the LRU. +@param id tablespace identifier */ +void buf_flush_remove_pages(ulint id); -class ut_stage_alter_t; - -/** Handled page counters for a single flush */ -struct flush_counters_t { - ulint flushed; /*!< number of dirty pages flushed */ - ulint evicted; /*!< number of clean pages evicted */ - ulint unzip_LRU_evicted;/*!< number of uncompressed page images - evicted */ -}; - -/** Remove a block from the flush list of modified blocks. -@param[in] bpage block to be removed from the flush list */ -void buf_flush_remove(buf_page_t* bpage); +/** Try to flush all the dirty pages that belong to a given tablespace. +@param id tablespace identifier +@return number dirty pages that there were for this tablespace */ +ulint buf_flush_dirty_pages(ulint id) + MY_ATTRIBUTE((warn_unused_result)); /*******************************************************************//** Relocates a buffer control block on the flush_list. Note that it is assumed that the contents of bpage has already been copied to dpage. */ +ATTRIBUTE_COLD void buf_flush_relocate_on_flush_list( /*=============================*/ @@ -74,10 +74,9 @@ buf_flush_relocate_on_flush_list( /** Complete write of a file page from buf_pool. @param bpage written page @param request write request -@param dblwr whether the doublewrite buffer was used -@param evict whether or not to evict the page from LRU list */ +@param dblwr whether the doublewrite buffer was used */ void buf_page_write_complete(buf_page_t *bpage, const IORequest &request, - bool dblwr, bool evict); + bool dblwr); /** Assign the full crc32 checksum for non-compressed page. @param[in,out] page page to be updated */ @@ -95,46 +94,15 @@ buf_flush_init_for_writing( void* page_zip_, bool use_full_checksum); -/** Do flushing batch of a given type. -NOTE: The calling thread is not allowed to own any latches on pages! -@param[in] lru true=buf_pool.LRU; false=buf_pool.flush_list -@param[in] min_n wished minimum mumber of blocks flushed -(it is not guaranteed that the actual number is that big, though) -@param[in] lsn_limit in the case BUF_FLUSH_LIST all blocks whose -oldest_modification is smaller than this should be flushed (if their number -does not exceed min_n), otherwise ignored -@param[out] n the number of pages which were processed is -passed back to caller. Ignored if NULL -@retval true if a batch was queued successfully. -@retval false if another batch of same type was already running. */ -bool buf_flush_do_batch(bool lru, ulint min_n, lsn_t lsn_limit, - flush_counters_t *n); - -/** This utility flushes dirty blocks from the end of the flush list. -NOTE: The calling thread is not allowed to own any latches on pages! -@param[in] min_n wished minimum mumber of blocks flushed (it is -not guaranteed that the actual number is that big, though) -@param[in] lsn_limit in the case BUF_FLUSH_LIST all blocks whose -oldest_modification is smaller than this should be flushed (if their number -does not exceed min_n), otherwise ignored -@param[out] n_processed the number of pages which were processed is -passed back to caller. Ignored if NULL. -@retval true if a batch was queued successfully -@retval false if another batch of same type was already running */ -bool buf_flush_lists(ulint min_n, lsn_t lsn_limit, ulint *n_processed); - -/******************************************************************//** -This function picks up a single page from the tail of the LRU -list, flushes it (if it is dirty), removes it from page_hash and LRU -list and puts it on the free list. It is called from user threads when -they are unable to find a replaceable page at the tail of the LRU -list i.e.: when the background LRU flushing in the page_cleaner thread -is not fast enough to keep pace with the workload. -@return true if success. */ -bool buf_flush_single_page_from_LRU(); +/** Write out dirty blocks from buf_pool.flush_list. +@param max_n wished maximum mumber of blocks flushed +@param lsn buf_pool.get_oldest_modification(LSN_MAX) target (0=LRU flush) +@return the number of processed pages +@retval 0 if a batch of the same type (lsn==0 or lsn!=0) is already running */ +ulint buf_flush_lists(ulint max_n, lsn_t lsn); /** Wait until a flush batch ends. -@param[in] lru true=buf_pool.LRU; false=buf_pool.flush_list */ +@param lru true=buf_pool.LRU; false=buf_pool.flush_list */ void buf_flush_wait_batch_end(bool lru); /** Wait until a flush batch of the given lsn ends @param[in] new_oldest target oldest_modified_lsn to wait for */ @@ -156,33 +124,18 @@ buf_flush_note_modification( /** Initialize page_cleaner. */ void buf_flush_page_cleaner_init(); -/** Wait for any possible LRU flushes to complete. */ -void buf_flush_wait_LRU_batch_end(); +/** Wait for pending flushes to complete. */ +void buf_flush_wait_batch_end_acquiring_mutex(bool lru); #ifdef UNIV_DEBUG /** Validate the flush list. */ void buf_flush_validate(); #endif /* UNIV_DEBUG */ -/** Write a flushable page from buf_pool to a file. -buf_pool.mutex must be held. -@param bpage buffer control block -@param flush_type type of flush -@param space tablespace (or nullptr if not known) -@param sync whether this is a synchronous request - (only for flush_type=SINGLE_PAGE) -@return whether the page was flushed and buf_pool.mutex was released */ -bool buf_flush_page(buf_page_t *bpage, IORequest::flush_t flush_type, - fil_space_t *space, bool sync); - /** Synchronously flush dirty blocks. NOTE: The calling thread is not allowed to hold any buffer page latches! */ void buf_flush_sync(); -/** Request IO burst and wake page_cleaner up. -@param[in] lsn_limit upper limit of LSN to be flushed */ -void buf_flush_request_force(lsn_t lsn_limit); - #include "buf0flu.ic" #endif diff --git a/storage/innobase/include/buf0lru.h b/storage/innobase/include/buf0lru.h index bdf56656692..540c14a49c9 100644 --- a/storage/innobase/include/buf0lru.h +++ b/storage/innobase/include/buf0lru.h @@ -34,6 +34,9 @@ Created 11/5/1995 Heikki Tuuri struct trx_t; struct fil_space_t; +/** Flush this many pages in buf_LRU_get_free_block() */ +extern size_t innodb_lru_flush_size; + /*####################################################################### These are low-level functions #########################################################################*/ @@ -41,12 +44,6 @@ These are low-level functions /** Minimum LRU list length for which the LRU_old pointer is defined */ #define BUF_LRU_OLD_MIN_LEN 512 /* 8 megabytes of 16k pages */ -/** Empty the flush list for all pages belonging to a tablespace. -@param[in] id tablespace identifier -@param[in] flush whether to write the pages to files -@param[in] first first page to be flushed or evicted */ -void buf_LRU_flush_or_remove_pages(ulint id, bool flush, ulint first = 0); - /** Try to free a block. If bpage is a descriptor of a compressed-only ROW_FORMAT=COMPRESSED page, the buf_page_t object will be freed as well. The caller must hold buf_pool.mutex. @@ -58,38 +55,34 @@ bool buf_LRU_free_page(buf_page_t *bpage, bool zip) MY_ATTRIBUTE((nonnull)); /** Try to free a replaceable block. -@param[in] scan_all true=scan the whole LRU list, - false=use BUF_LRU_SEARCH_SCAN_THRESHOLD +@param limit maximum number of blocks to scan @return true if found and freed */ -bool buf_LRU_scan_and_free_block(bool scan_all); +bool buf_LRU_scan_and_free_block(ulint limit= ULINT_UNDEFINED); /** @return a buffer block from the buf_pool.free list @retval NULL if the free list is empty */ buf_block_t* buf_LRU_get_free_only(); -/** Get a free block from the buf_pool. The block is taken off the -free list. If free list is empty, blocks are moved from the end of the -LRU list to the free list. +/** Get a block from the buf_pool.free list. +If the list is empty, blocks will be moved from the end of buf_pool.LRU +to buf_pool.free. This function is called from a user thread when it needs a clean block to read in a page. Note that we only ever get a block from the free list. Even when we flush a page or find a page in LRU scan we put it to free list to be used. * iteration 0: - * get a block from free list, success:done + * get a block from the buf_pool.free list, success:done * if buf_pool.try_LRU_scan is set - * scan LRU up to srv_LRU_scan_depth to find a clean block - * the above will put the block on free list + * scan LRU up to 100 pages to free a clean block * success:retry the free list - * flush one dirty page from tail of LRU to disk - * the above will put the block on free list + * flush up to innodb_lru_flush_size LRU blocks to data files + (until UT_LIST_GET_GEN(buf_pool.free) < innodb_lru_scan_depth) + * on buf_page_write_complete() the blocks will put on buf_pool.free list * success: retry the free list -* iteration 1: - * same as iteration 0 except: - * scan whole LRU list - * scan LRU list even if buf_pool.try_LRU_scan is not set -* iteration > 1: - * same as iteration 1 but sleep 10ms +* subsequent iterations: same as iteration 0 except: + * scan whole LRU list + * scan LRU list even if buf_pool.try_LRU_scan is not set @param have_mutex whether buf_pool.mutex is already being held @return the free control block, in state BUF_BLOCK_MEMORY */ diff --git a/storage/innobase/include/buf0types.h b/storage/innobase/include/buf0types.h index 55bd2ac3a5a..b50352a1c0b 100644 --- a/storage/innobase/include/buf0types.h +++ b/storage/innobase/include/buf0types.h @@ -27,8 +27,7 @@ Created 11/17/1995 Heikki Tuuri #ifndef buf0types_h #define buf0types_h -#include "os0event.h" -#include "ut0ut.h" +#include "univ.i" /** Buffer page (uncompressed or compressed) */ class buf_page_t; @@ -38,8 +37,6 @@ struct buf_block_t; struct buf_pool_stat_t; /** Buffer pool buddy statistics struct */ struct buf_buddy_stat_t; -/** Doublewrite memory struct */ -struct buf_dblwr_t; /** A buffer frame. @see page_t */ typedef byte buf_frame_t; @@ -194,10 +191,6 @@ extern const byte field_ref_zero[UNIV_PAGE_SIZE_MAX]; #include "sync0rw.h" #include "rw_lock.h" -typedef ib_mutex_t BufPoolMutex; -typedef ib_mutex_t FlushListMutex; -typedef rw_lock_t BPageLock; - class page_hash_latch : public rw_lock { public: diff --git a/storage/innobase/include/fil0fil.h b/storage/innobase/include/fil0fil.h index 1b89d38bef7..b6ff8b6b6bb 100644 --- a/storage/innobase/include/fil0fil.h +++ b/storage/innobase/include/fil0fil.h @@ -33,6 +33,7 @@ Created 10/25/1995 Heikki Tuuri #ifndef UNIV_INNOCHECKSUM +#include "buf0dblwr.h" #include "hash0hash.h" #include "log0recv.h" #include "dict0types.h" @@ -92,7 +93,6 @@ inline bool srv_is_undo_tablespace(ulint space_id) space_id < srv_undo_space_id_start + srv_undo_tablespaces_open; } -extern struct buf_dblwr_t* buf_dblwr; class page_id_t; /** Structure containing encryption specification */ @@ -415,12 +415,12 @@ public: ulint magic_n;/*!< FIL_SPACE_MAGIC_N */ - /** @return whether doublewrite buffering is needed */ - bool use_doublewrite() const - { - return !atomic_write_supported - && srv_use_doublewrite_buf && buf_dblwr; - } + /** @return whether doublewrite buffering is needed */ + bool use_doublewrite() const + { + return !atomic_write_supported && srv_use_doublewrite_buf && + buf_dblwr.is_initialised(); + } /** Append a file to the chain of files of a space. @param[in] name file name of a file that is not open diff --git a/storage/innobase/include/log0recv.h b/storage/innobase/include/log0recv.h index e7cf100cbde..f822a874565 100644 --- a/storage/innobase/include/log0recv.h +++ b/storage/innobase/include/log0recv.h @@ -50,21 +50,12 @@ ATTRIBUTE_COLD void recv_recover_page(fil_space_t* space, buf_page_t* bpage) MY_ATTRIBUTE((nonnull)); /** Start recovering from a redo log checkpoint. -@see recv_recovery_from_checkpoint_finish @param[in] flush_lsn FIL_PAGE_FILE_FLUSH_LSN of first system tablespace page @return error code or DB_SUCCESS */ dberr_t recv_recovery_from_checkpoint_start( lsn_t flush_lsn); -/** Complete recovery from a checkpoint. */ -void -recv_recovery_from_checkpoint_finish(void); -/********************************************************//** -Initiates the rollback of active transactions. */ -void -recv_recovery_rollback_active(void); -/*===============================*/ /** Whether to store redo log records in recv_sys.pages */ enum store_t { @@ -296,9 +287,10 @@ private: @param page_id page identifier @param p iterator pointing to page_id @param mtr mini-transaction + @param b pre-allocated buffer pool block @return whether the page was successfully initialized */ inline buf_block_t *recover_low(const page_id_t page_id, map::iterator &p, - mtr_t &mtr); + mtr_t &mtr, buf_block_t *b); /** Attempt to initialize a page based on redo log records. @param page_id page identifier @return the recovered block diff --git a/storage/innobase/include/os0file.h b/storage/innobase/include/os0file.h index 08ea482333b..4be5e5341ba 100644 --- a/storage/innobase/include/os0file.h +++ b/storage/innobase/include/os0file.h @@ -182,32 +182,14 @@ static const ulint OS_FILE_OPERATION_NOT_SUPPORTED = 125; static const ulint OS_FILE_ERROR_MAX = 200; /* @} */ -/** Types for AIO operations @{ */ - -/** No transformations during read/write, write as is. */ -#define IORequestRead IORequest(IORequest::READ) -#define IORequestWrite IORequest(IORequest::WRITE) - /** The I/O context that is passed down to the low level IO code */ class IORequest { public: - /** Buffer pool flush types */ - enum flush_t - { - /** via buf_pool.LRU */ - LRU= 0, - /** via buf_pool.flush_list */ - FLUSH_LIST, - /** single page of buf_poof.LRU */ - SINGLE_PAGE - }; - - IORequest(ulint type= READ, buf_page_t *bpage= nullptr, - flush_t flush_type= LRU) : - m_bpage(bpage), m_type(static_cast<uint16_t>(type)), - m_flush_type(flush_type) {} + constexpr IORequest(ulint type= READ, buf_page_t *bpage= nullptr, + bool lru= false) : + m_bpage(bpage), m_type(static_cast<uint16_t>(type)), m_LRU(lru) {} /** Flags passed in the request, they can be ORred together. */ enum { @@ -243,12 +225,6 @@ public: return((m_type & WRITE) == WRITE); } - /** Clear the punch hole flag */ - void clear_punch_hole() - { - m_type &= uint16_t(~PUNCH_HOLE); - } - /** @return true if partial read warning disabled */ bool is_partial_io_warning_disabled() const MY_ATTRIBUTE((warn_unused_result)) @@ -256,12 +232,6 @@ public: return !!(m_type & DISABLE_PARTIAL_IO_WARNINGS); } - /** Disable partial read warnings */ - void disable_partial_io_warnings() - { - m_type |= DISABLE_PARTIAL_IO_WARNINGS; - } - /** @return true if punch hole should be used */ bool punch_hole() const MY_ATTRIBUTE((warn_unused_result)) @@ -276,29 +246,15 @@ public: return(is_read() ^ is_write()); } - /** Set the punch hole flag */ - void set_punch_hole() - { - if (is_punch_hole_supported()) { - m_type |= PUNCH_HOLE; - } - } - /** Set the pointer to file node for IO @param[in] node File node */ - inline void set_fil_node(fil_node_t* node); + void set_fil_node(fil_node_t *node) { m_fil_node= node; } bool operator==(const IORequest& rhs) const { return(m_type == rhs.m_type); } - /** Note that the IO is for double write recovery. */ - void dblwr_recover() - { - m_type |= DBLWR_RECOVER; - } - /** @return true if the request is from the dblwr recovery */ bool is_dblwr_recover() const MY_ATTRIBUTE((warn_unused_result)) @@ -306,24 +262,6 @@ public: return((m_type & DBLWR_RECOVER) == DBLWR_RECOVER); } - /** @return true if punch hole is supported */ - static bool is_punch_hole_supported() - { - - /* In this debugging mode, we act as if punch hole is supported, - and then skip any calls to actually punch a hole here. - In this way, Transparent Page Compression is still being tested. */ - DBUG_EXECUTE_IF("ignore_punch_hole", - return(true); - ); - -#if defined(HAVE_FALLOC_PUNCH_HOLE_AND_KEEP_SIZE) || defined(_WIN32) - return(true); -#else - return(false); -#endif /* HAVE_FALLOC_PUNCH_HOLE_AND_KEEP_SIZE || _WIN32 */ - } - ulint get_trim_length(ulint write_length) const { return (m_bpage ? @@ -340,8 +278,8 @@ public: @return DB_SUCCESS or error code */ dberr_t punch_hole(os_file_t fh, os_offset_t off, ulint len); - /** @return the flush type */ - flush_t flush_type() const { return m_flush_type; } + /** @return type of page flush (for writes) */ + bool is_LRU() const { return m_LRU; } private: /** Page to be written on write operation. */ @@ -350,14 +288,16 @@ private: /** File node */ fil_node_t* m_fil_node= nullptr; - /** Request type bit flags */ - uint16_t m_type= READ; + /** Request type bit flags */ + const uint16_t m_type; /** for writes, type of page flush */ - flush_t m_flush_type= LRU; + const bool m_LRU= false; }; -/* @} */ +constexpr IORequest IORequestRead(IORequest::READ); +constexpr IORequest IORequestWrite(IORequest::WRITE); + /** Sparse file size information. */ struct os_file_size_t { diff --git a/storage/innobase/include/srv0mon.h b/storage/innobase/include/srv0mon.h index 325bb3a2cee..a18ff5d49ad 100644 --- a/storage/innobase/include/srv0mon.h +++ b/storage/innobase/include/srv0mon.h @@ -196,16 +196,11 @@ enum monitor_id_t { MONITOR_FLUSH_N_TO_FLUSH_BY_AGE, MONITOR_FLUSH_ADAPTIVE_AVG_TIME_SLOT, - MONITOR_LRU_BATCH_FLUSH_AVG_TIME_SLOT, - MONITOR_FLUSH_ADAPTIVE_AVG_TIME_THREAD, - MONITOR_LRU_BATCH_FLUSH_AVG_TIME_THREAD, MONITOR_FLUSH_ADAPTIVE_AVG_TIME_EST, - MONITOR_LRU_BATCH_FLUSH_AVG_TIME_EST, MONITOR_FLUSH_AVG_TIME, MONITOR_FLUSH_ADAPTIVE_AVG_PASS, - MONITOR_LRU_BATCH_FLUSH_AVG_PASS, MONITOR_FLUSH_AVG_PASS, MONITOR_LRU_GET_FREE_LOOPS, @@ -234,9 +229,6 @@ enum monitor_id_t { MONITOR_LRU_BATCH_EVICT_TOTAL_PAGE, MONITOR_LRU_BATCH_EVICT_COUNT, MONITOR_LRU_BATCH_EVICT_PAGES, - MONITOR_LRU_SINGLE_FLUSH_SCANNED, - MONITOR_LRU_SINGLE_FLUSH_SCANNED_NUM_CALL, - MONITOR_LRU_SINGLE_FLUSH_SCANNED_PER_CALL, MONITOR_LRU_SINGLE_FLUSH_FAILURE_COUNT, MONITOR_LRU_GET_FREE_SEARCH, MONITOR_LRU_SEARCH_SCANNED, diff --git a/storage/innobase/include/srv0srv.h b/storage/innobase/include/srv0srv.h index ade0f1c198e..e956e421a6a 100644 --- a/storage/innobase/include/srv0srv.h +++ b/storage/innobase/include/srv0srv.h @@ -100,10 +100,6 @@ struct srv_stats_t need to make a flush, in order to be able to read or create a page. */ ulint_ctr_1_t buf_pool_wait_free; - /** Count the number of pages that were written from buffer - pool to the disk */ - ulint_ctr_1_t buf_pool_flushed; - /** Number of buffer pool reads that led to the reading of a disk page */ ulint_ctr_1_t buf_pool_reads; @@ -409,7 +405,6 @@ extern unsigned long long srv_stats_modified_counter; extern my_bool srv_stats_sample_traditional; extern my_bool srv_use_doublewrite_buf; -extern ulong srv_doublewrite_batch_size; extern ulong srv_checksum_algorithm; extern double srv_max_buf_pool_modified_pct; @@ -764,7 +759,6 @@ struct export_var_t{ ulint innodb_buffer_pool_read_requests; /*!< buf_pool.stat.n_page_gets */ ulint innodb_buffer_pool_reads; /*!< srv_buf_pool_reads */ ulint innodb_buffer_pool_wait_free; /*!< srv_buf_pool_wait_free */ - ulint innodb_buffer_pool_pages_flushed; /*!< srv_buf_pool_flushed */ ulint innodb_buffer_pool_write_requests;/*!< srv_buf_pool_write_requests */ ulint innodb_buffer_pool_read_ahead_rnd;/*!< srv_read_ahead_rnd */ ulint innodb_buffer_pool_read_ahead; /*!< srv_read_ahead */ diff --git a/storage/innobase/include/srv0start.h b/storage/innobase/include/srv0start.h index 23dc8347129..324e3f0478d 100644 --- a/storage/innobase/include/srv0start.h +++ b/storage/innobase/include/srv0start.h @@ -112,11 +112,6 @@ enum srv_shutdown_t { SRV_SHUTDOWN_INITIATED, SRV_SHUTDOWN_CLEANUP, /*!< Cleaning up in logs_empty_and_mark_files_at_shutdown() */ - SRV_SHUTDOWN_FLUSH_PHASE,/*!< At this phase the master and the - purge threads must have completed their - work. Once we enter this phase the - page_cleaner can clean up the buffer - pool and exit */ SRV_SHUTDOWN_LAST_PHASE,/*!< Last phase after ensuring that the buffer pool can be freed: flush all file spaces and close all files */ diff --git a/storage/innobase/include/sync0debug.h b/storage/innobase/include/sync0debug.h index 55ea99cd47b..07e985465e0 100644 --- a/storage/innobase/include/sync0debug.h +++ b/storage/innobase/include/sync0debug.h @@ -1,7 +1,7 @@ /***************************************************************************** Copyright (c) 2013, 2015, Oracle and/or its affiliates. All Rights Reserved. -Copyright (c) 2017, MariaDB Corporation. +Copyright (c) 2017, 2020, MariaDB Corporation. Portions of this file contain modifications contributed and copyrighted by Google, Inc. Those modifications are gratefully acknowledged and are described @@ -44,10 +44,6 @@ void sync_check_close(); #ifdef UNIV_DEBUG -/** Enable sync order checking. */ -void -sync_check_enable(); - /** Check if it is OK to acquire the latch. @param[in] latch latch type */ void diff --git a/storage/innobase/include/sync0sync.h b/storage/innobase/include/sync0sync.h index 1b8b60e9f81..72f2d8ffb74 100644 --- a/storage/innobase/include/sync0sync.h +++ b/storage/innobase/include/sync0sync.h @@ -63,7 +63,6 @@ extern mysql_pfs_key_t log_sys_mutex_key; extern mysql_pfs_key_t log_cmdq_mutex_key; extern mysql_pfs_key_t log_flush_order_mutex_key; extern mysql_pfs_key_t recalc_pool_mutex_key; -extern mysql_pfs_key_t page_cleaner_mutex_key; extern mysql_pfs_key_t purge_sys_pq_mutex_key; extern mysql_pfs_key_t recv_sys_mutex_key; extern mysql_pfs_key_t rtr_active_mutex_key; diff --git a/storage/innobase/include/sync0types.h b/storage/innobase/include/sync0types.h index 36c1ad4495a..a6d0bd8a86c 100644 --- a/storage/innobase/include/sync0types.h +++ b/storage/innobase/include/sync0types.h @@ -188,12 +188,6 @@ enum latch_level_t { SYNC_ANY_LATCH, - SYNC_DOUBLEWRITE, - - SYNC_BUF_FLUSH_LIST, - - SYNC_BUF_POOL, - SYNC_POOL, SYNC_POOL_MANAGER, @@ -208,7 +202,6 @@ enum latch_level_t { SYNC_RECV, SYNC_LOG_FLUSH_ORDER, SYNC_LOG, - SYNC_PAGE_CLEANER, SYNC_PURGE_QUEUE, SYNC_TRX_SYS_HEADER, SYNC_REC_LOCK, @@ -271,11 +264,9 @@ enum latch_level_t { up its meta-data. See sync0debug.cc. */ enum latch_id_t { LATCH_ID_NONE = 0, - LATCH_ID_BUF_POOL, LATCH_ID_DICT_FOREIGN_ERR, LATCH_ID_DICT_SYS, LATCH_ID_FIL_SYSTEM, - LATCH_ID_FLUSH_LIST, LATCH_ID_FTS_BG_THREADS, LATCH_ID_FTS_DELETE, LATCH_ID_FTS_DOC_ID, @@ -285,7 +276,6 @@ enum latch_id_t { LATCH_ID_IBUF_PESSIMISTIC_INSERT, LATCH_ID_LOG_SYS, LATCH_ID_LOG_FLUSH_ORDER, - LATCH_ID_PAGE_CLEANER, LATCH_ID_PURGE_SYS_PQ, LATCH_ID_RECALC_POOL, LATCH_ID_RECV_SYS, @@ -299,7 +289,6 @@ enum latch_id_t { LATCH_ID_SRV_INNODB_MONITOR, LATCH_ID_SRV_MISC_TMPFILE, LATCH_ID_SRV_MONITOR_FILE, - LATCH_ID_BUF_DBLWR, LATCH_ID_TRX_POOL, LATCH_ID_TRX_POOL_MANAGER, LATCH_ID_TRX, @@ -982,6 +971,7 @@ struct sync_checker : public sync_check_functor_t { if (some_allowed) { switch (level) { + case SYNC_FSP: case SYNC_DICT: case SYNC_DICT_OPERATION: case SYNC_FTS_CACHE: diff --git a/storage/innobase/include/trx0sys.h b/storage/innobase/include/trx0sys.h index acb10428108..0bc8b95dd77 100644 --- a/storage/innobase/include/trx0sys.h +++ b/storage/innobase/include/trx0sys.h @@ -1,7 +1,7 @@ /***************************************************************************** Copyright (c) 1996, 2016, Oracle and/or its affiliates. All Rights Reserved. -Copyright (c) 2017, 2019, MariaDB Corporation. +Copyright (c) 2017, 2020, MariaDB Corporation. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software @@ -340,9 +340,6 @@ FIL_PAGE_ARCH_LOG_NO_OR_SPACE_ID. */ constexpr uint32_t TRX_SYS_DOUBLEWRITE_MAGIC_N= 536853855; /** Contents of TRX_SYS_DOUBLEWRITE_SPACE_ID_STORED */ constexpr uint32_t TRX_SYS_DOUBLEWRITE_SPACE_ID_STORED_N= 1783657386; - -/** Size of the doublewrite block in pages */ -#define TRX_SYS_DOUBLEWRITE_BLOCK_SIZE FSP_EXTENT_SIZE /* @} */ trx_t* current_trx(); diff --git a/storage/innobase/log/log0log.cc b/storage/innobase/log/log0log.cc index 090acc86843..7563f30e8fb 100644 --- a/storage/innobase/log/log0log.cc +++ b/storage/innobase/log/log0log.cc @@ -493,7 +493,7 @@ void log_t::create() log record has a non-zero start lsn, a fact which we will use */ set_lsn(LOG_START_LSN + LOG_BLOCK_HDR_SIZE); - set_flushed_lsn(0); + set_flushed_lsn(LOG_START_LSN + LOG_BLOCK_HDR_SIZE); ut_ad(srv_log_buffer_size >= 16 * OS_FILE_LOG_BLOCK_SIZE); ut_ad(srv_log_buffer_size >= 4U << srv_page_size_shift); @@ -936,7 +936,8 @@ loop: and invoke log_mutex_enter(). */ static void log_write_flush_to_disk_low(lsn_t lsn) { - log_sys.log.flush(); + if (!log_sys.log.writes_are_durable()) + log_sys.log.flush(); ut_a(lsn >= log_sys.get_flushed_lsn()); log_sys.set_flushed_lsn(lsn); } @@ -1129,12 +1130,7 @@ void log_write_up_to(lsn_t lsn, bool flush_to_disk, bool rotate_key) /* Flush the highest written lsn.*/ auto flush_lsn = write_lock.value(); flush_lock.set_pending(flush_lsn); - - if (!log_sys.log.writes_are_durable()) - { - log_write_flush_to_disk_low(flush_lsn); - } - + log_write_flush_to_disk_low(flush_lsn); flush_lock.release(flush_lsn); innobase_mysql_log_notify(flush_lsn); @@ -1186,7 +1182,7 @@ this lsn which means that we could not start this flush batch */ static bool log_preflush_pool_modified_pages(lsn_t new_oldest) { - bool success; + bool success; if (recv_recovery_is_on()) { /* If the recovery is running, we must first apply all @@ -1204,29 +1200,22 @@ static bool log_preflush_pool_modified_pages(lsn_t new_oldest) || !buf_page_cleaner_is_active || srv_is_being_started) { - ulint n_pages; + ulint n_pages = buf_flush_lists(ULINT_UNDEFINED, new_oldest); - success = buf_flush_lists(ULINT_MAX, new_oldest, &n_pages); + buf_flush_wait_batch_end_acquiring_mutex(false); - buf_flush_wait_batch_end(false); - - if (!success) { - MONITOR_INC(MONITOR_FLUSH_SYNC_WAITS); - } + MONITOR_INC(MONITOR_FLUSH_SYNC_WAITS); MONITOR_INC_VALUE_CUMULATIVE( MONITOR_FLUSH_SYNC_TOTAL_PAGE, MONITOR_FLUSH_SYNC_COUNT, MONITOR_FLUSH_SYNC_PAGES, n_pages); + + const lsn_t oldest = buf_pool.get_oldest_modification(); + success = !oldest || oldest >= new_oldest; } else { /* better to wait for flushed by page cleaner */ - - if (srv_flush_sync) { - /* wake page cleaner for IO burst */ - buf_flush_request_force(new_oldest); - } - buf_flush_wait_flushed(new_oldest); success = true; @@ -1515,6 +1504,41 @@ void log_check_margins() extern void buf_resize_shutdown(); +/** @return the number of dirty pages in the buffer pool */ +static ulint flush_list_length() +{ + mysql_mutex_lock(&buf_pool.flush_list_mutex); + const ulint len= UT_LIST_GET_LEN(buf_pool.flush_list); + mysql_mutex_unlock(&buf_pool.flush_list_mutex); + return len; +} + +static void flush_buffer_pool() +{ + service_manager_extend_timeout(INNODB_EXTEND_TIMEOUT_INTERVAL, + "Waiting to flush the buffer pool"); + while (buf_pool.n_flush_list || flush_list_length()) + { + buf_flush_lists(ULINT_UNDEFINED, LSN_MAX); + timespec abstime; + + if (buf_pool.n_flush_list) + { + service_manager_extend_timeout(INNODB_EXTEND_TIMEOUT_INTERVAL, + "Waiting to flush " ULINTPF " pages", + flush_list_length()); + set_timespec(abstime, INNODB_EXTEND_TIMEOUT_INTERVAL / 2); + mysql_mutex_lock(&buf_pool.mutex); + while (buf_pool.n_flush_list) + mysql_cond_timedwait(&buf_pool.done_flush_list, &buf_pool.mutex, + &abstime); + mysql_mutex_unlock(&buf_pool.mutex); + } + } + + ut_ad(!buf_pool.any_io_pending()); +} + /** Make a checkpoint at the latest lsn on shutdown. */ void logs_empty_and_mark_files_at_shutdown() { @@ -1616,30 +1640,26 @@ wait_suspend_loop: goto wait_suspend_loop; } - buf_load_dump_end(); - - srv_shutdown_state = SRV_SHUTDOWN_FLUSH_PHASE; + if (buf_page_cleaner_is_active) { + thread_name = "page cleaner thread"; + mysql_cond_signal(&buf_pool.do_flush_list); + goto wait_suspend_loop; + } - /* At this point only page_cleaner should be active. We wait - here to let it complete the flushing of the buffer pools - before proceeding further. */ + buf_load_dump_end(); - count = 0; - service_manager_extend_timeout(COUNT_INTERVAL * CHECK_INTERVAL/1000000 * 2, - "Waiting for page cleaner"); - while (buf_page_cleaner_is_active) { - ++count; - os_thread_sleep(CHECK_INTERVAL); - if (srv_print_verbose_log && count > COUNT_INTERVAL) { - service_manager_extend_timeout(COUNT_INTERVAL * CHECK_INTERVAL/1000000 * 2, - "Waiting for page cleaner"); - ib::info() << "Waiting for page_cleaner to " - "finish flushing of buffer pool"; - /* This is a workaround to avoid the InnoDB hang - when OS datetime changed backwards */ - os_event_set(buf_flush_event); + if (!buf_pool.is_initialised()) { + ut_ad(!srv_was_started); + } else if (ulint pending_io = buf_pool.io_pending()) { + if (srv_print_verbose_log && count > 600) { + ib::info() << "Waiting for " << pending_io << " buffer" + " page I/Os to complete"; count = 0; } + + goto loop; + } else { + flush_buffer_pool(); } if (log_sys.is_initialised()) { @@ -1660,18 +1680,6 @@ wait_suspend_loop: } } - if (!buf_pool.is_initialised()) { - ut_ad(!srv_was_started); - } else if (ulint pending_io = buf_pool.io_pending()) { - if (srv_print_verbose_log && count > 600) { - ib::info() << "Waiting for " << pending_io << " buffer" - " page I/Os to complete"; - count = 0; - } - - goto loop; - } - if (srv_fast_shutdown == 2 || !srv_was_started) { if (!srv_read_only_mode && srv_was_started) { ib::info() << "MySQL has requested a very fast" diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc index aff1d011a8d..11c57618e53 100644 --- a/storage/innobase/log/log0recv.cc +++ b/storage/innobase/log/log0recv.cc @@ -987,9 +987,9 @@ void recv_sys_t::debug_free() { ut_ad(this == &recv_sys); ut_ad(is_initialised()); - ut_ad(!recv_recovery_is_on()); mutex_enter(&mutex); + recovery_on= false; pages.clear(); ut_free_dodump(buf, RECV_PARSING_BUF_SIZE); @@ -2497,9 +2497,11 @@ static void recv_read_in_area(page_id_t page_id) @param page_id page identifier @param p iterator pointing to page_id @param mtr mini-transaction +@param b pre-allocated buffer pool block @return whether the page was successfully initialized */ inline buf_block_t *recv_sys_t::recover_low(const page_id_t page_id, - map::iterator &p, mtr_t &mtr) + map::iterator &p, mtr_t &mtr, + buf_block_t *b) { ut_ad(mutex_own(&mutex)); ut_ad(p->first == page_id); @@ -2515,20 +2517,21 @@ inline buf_block_t *recv_sys_t::recover_low(const page_id_t page_id, { mtr.start(); mtr.set_log_mode(MTR_LOG_NO_REDO); - block= buf_page_create(space, page_id.page_no(), space->zip_size(), &mtr); - p= recv_sys.pages.find(page_id); - if (p == recv_sys.pages.end()) + block= buf_page_create(space, page_id.page_no(), space->zip_size(), &mtr, + b); + if (UNIV_UNLIKELY(block != b)) { /* The page happened to exist in the buffer pool, or it was just being read in. Before buf_page_get_with_no_latch() returned to buf_page_create(), all changes must have been applied to the page already. */ + ut_ad(recv_sys.pages.find(page_id) == recv_sys.pages.end()); mtr.commit(); block= nullptr; } else { - ut_ad(&recs == &p->second); + ut_ad(&recs == &recv_sys.pages.find(page_id)->second); i.created= true; buf_block_dbg_add_level(block, SYNC_NO_ORDER_CHECK); recv_recover_page(block, mtr, p, space, &i); @@ -2548,6 +2551,7 @@ inline buf_block_t *recv_sys_t::recover_low(const page_id_t page_id, @return whether the page was successfully initialized */ buf_block_t *recv_sys_t::recover_low(const page_id_t page_id) { + buf_block_t *free_block= buf_LRU_get_free_block(false); buf_block_t *block= nullptr; mutex_enter(&mutex); @@ -2556,10 +2560,13 @@ buf_block_t *recv_sys_t::recover_low(const page_id_t page_id) if (p != pages.end() && p->second.state == page_recv_t::RECV_WILL_NOT_READ) { mtr_t mtr; - block= recover_low(page_id, p, mtr); + block= recover_low(page_id, p, mtr, free_block); + ut_ad(!block || block == free_block); } mutex_exit(&mutex); + if (UNIV_UNLIKELY(!block)) + buf_pool.free_block(free_block); return block; } @@ -2614,6 +2621,8 @@ void recv_sys_t::apply(bool last_batch) trim(page_id_t(id + srv_undo_space_id_start, t.pages), t.lsn); } + buf_block_t *free_block= buf_LRU_get_free_block(false); + for (map::iterator p= pages.begin(); p != pages.end(); ) { const page_id_t page_id= p->first; @@ -2626,7 +2635,14 @@ void recv_sys_t::apply(bool last_batch) p++; continue; case page_recv_t::RECV_WILL_NOT_READ: - recover_low(page_id, p, mtr); + if (UNIV_LIKELY(!!recover_low(page_id, p, mtr, free_block))) + { + mutex_exit(&mutex); + free_block= buf_LRU_get_free_block(false); + mutex_enter(&mutex); +next_page: + p= pages.lower_bound(page_id); + } continue; case page_recv_t::RECV_NOT_PROCESSED: mtr.start(); @@ -2652,9 +2668,11 @@ void recv_sys_t::apply(bool last_batch) continue; } - p= pages.lower_bound(page_id); + goto next_page; } + buf_pool.free_block(free_block); + /* Wait until all the pages have been processed */ while (!pages.empty()) { @@ -2686,7 +2704,6 @@ void recv_sys_t::apply(bool last_batch) /* Instead of flushing, last_batch could sort the buf_pool.flush_list in ascending order of buf_page_t::oldest_modification. */ - buf_flush_wait_LRU_batch_end(); buf_flush_sync(); if (!last_batch) @@ -3250,7 +3267,6 @@ recv_init_crash_recovery_spaces(bool rescan, bool& missing_tablespace) } /** Start recovering from a redo log checkpoint. -@see recv_recovery_from_checkpoint_finish @param[in] flush_lsn FIL_PAGE_FILE_FLUSH_LSN of first system tablespace page @return error code or DB_SUCCESS */ @@ -3268,10 +3284,10 @@ recv_recovery_from_checkpoint_start(lsn_t flush_lsn) ut_ad(srv_operation == SRV_OPERATION_NORMAL || srv_operation == SRV_OPERATION_RESTORE || srv_operation == SRV_OPERATION_RESTORE_EXPORT); - ut_d(mutex_enter(&buf_pool.flush_list_mutex)); + ut_d(mysql_mutex_lock(&buf_pool.flush_list_mutex)); ut_ad(UT_LIST_GET_LEN(buf_pool.LRU) == 0); ut_ad(UT_LIST_GET_LEN(buf_pool.unzip_LRU) == 0); - ut_d(mutex_exit(&buf_pool.flush_list_mutex)); + ut_d(mysql_mutex_unlock(&buf_pool.flush_list_mutex)); if (srv_force_recovery >= SRV_FORCE_NO_LOG_REDO) { @@ -3424,6 +3440,11 @@ completed: } log_sys.set_lsn(recv_sys.recovered_lsn); + if (UNIV_LIKELY(log_sys.get_flushed_lsn() < recv_sys.recovered_lsn)) { + /* This may already have been set by create_log_file() + if no logs existed when the server started up. */ + log_sys.set_flushed_lsn(recv_sys.recovered_lsn); + } if (recv_needed_recovery) { bool missing_tablespace = false; @@ -3472,7 +3493,7 @@ completed: recv_sys.parse_start_lsn = checkpoint_lsn; if (srv_operation == SRV_OPERATION_NORMAL) { - buf_dblwr_process(); + buf_dblwr.recover(); } ut_ad(srv_force_recovery <= SRV_FORCE_NO_UNDO_LOG_SCAN); @@ -3556,19 +3577,6 @@ completed: return(DB_SUCCESS); } -/** Complete recovery from a checkpoint. */ -void recv_recovery_from_checkpoint_finish() -{ - /* Free the resources of the recovery system */ - - recv_sys.recovery_on= false; - - recv_sys.debug_free(); - - /* Enable innodb_sync_debug checks */ - ut_d(sync_check_enable()); -} - bool recv_dblwr_t::validate_page(const page_id_t page_id, const byte *page, const fil_space_t *space, @@ -3651,7 +3659,7 @@ byte *recv_dblwr_t::find_page(const page_id_t page_id, if (lsn <= max_lsn || !validate_page(page_id, page, space, tmp_buf)) { - /* Mark processed for subsequent iterations in buf_dblwr_process() */ + /* Mark processed for subsequent iterations in buf_dblwr_t::recover() */ memset(page + FIL_PAGE_LSN, 0, 8); continue; } diff --git a/storage/innobase/os/os0file.cc b/storage/innobase/os/os0file.cc index 0a7e0ed5dc8..e0817934e67 100644 --- a/storage/innobase/os/os0file.cc +++ b/storage/innobase/os/os0file.cc @@ -153,9 +153,6 @@ static ulint os_innodb_umask = 0; #endif /* _WIN32 */ -/** Flag indicating if the page_cleaner is in active state. */ -extern bool buf_page_cleaner_is_active; - #ifdef WITH_INNODB_DISALLOW_WRITES #define WAIT_ALLOW_WRITES() os_event_wait(srv_allow_writes_event) #else @@ -3829,9 +3826,7 @@ IORequest::punch_hole(os_file_t fh, os_offset_t off, ulint len) /* If punch hole is not supported, set space so that it is not used. */ if (err == DB_IO_NO_PUNCH_HOLE) { - if (m_fil_node) { - m_fil_node->space->punch_hole = false; - } + m_fil_node->space->punch_hole = false; err = DB_SUCCESS; } } diff --git a/storage/innobase/row/row0import.cc b/storage/innobase/row/row0import.cc index 2ec0a06709e..539c2e83f04 100644 --- a/storage/innobase/row/row0import.cc +++ b/storage/innobase/row/row0import.cc @@ -3423,11 +3423,10 @@ fil_iterate( ? iter.crypt_io_buffer : io_buffer; byte* const writeptr = readptr; - IORequest read_request(IORequest::READ); - read_request.disable_partial_io_warnings(); - err = os_file_read_no_error_handling( - read_request, iter.file, readptr, offset, n_bytes, 0); + IORequest(IORequest::READ + | IORequest::DISABLE_PARTIAL_IO_WARNINGS), + iter.file, readptr, offset, n_bytes, 0); if (err != DB_SUCCESS) { ib::error() << iter.filepath << ": os_file_read() failed"; @@ -3760,11 +3759,10 @@ fil_tablespace_iterate( /* Read the first page and determine the page and zip size. */ - IORequest request(IORequest::READ); - request.disable_partial_io_warnings(); - - err = os_file_read_no_error_handling(request, file, page, 0, - srv_page_size, 0); + err = os_file_read_no_error_handling( + IORequest(IORequest::READ + | IORequest::DISABLE_PARTIAL_IO_WARNINGS), + file, page, 0, srv_page_size, 0); if (err == DB_SUCCESS) { err = callback.init(file_size, block); @@ -4175,7 +4173,7 @@ row_import_for_mysql( /* Ensure that all pages dirtied during the IMPORT make it to disk. The only dirty pages generated should be from the pessimistic purge of delete marked records that couldn't be purged in Phase I. */ - buf_LRU_flush_or_remove_pages(prebuilt->table->space_id, true); + while (buf_flush_dirty_pages(prebuilt->table->space_id)); ib::info() << "Phase IV - Flush complete"; prebuilt->table->space->set_imported(); diff --git a/storage/innobase/row/row0log.cc b/storage/innobase/row/row0log.cc index 4db1b7cfa7f..02429aac23b 100644 --- a/storage/innobase/row/row0log.cc +++ b/storage/innobase/row/row0log.cc @@ -410,7 +410,6 @@ row_log_online_op( const os_offset_t byte_offset = (os_offset_t) log->tail.blocks * srv_sort_buf_size; - IORequest request(IORequest::WRITE); byte* buf = log->tail.block; if (byte_offset + srv_sort_buf_size >= srv_online_max_size) { @@ -448,7 +447,7 @@ row_log_online_op( log->tail.blocks++; if (os_file_write( - request, + IORequestWrite, "(modification log)", log->fd, buf, byte_offset, srv_sort_buf_size) @@ -549,7 +548,6 @@ row_log_table_close_func( const os_offset_t byte_offset = (os_offset_t) log->tail.blocks * srv_sort_buf_size; - IORequest request(IORequest::WRITE); byte* buf = log->tail.block; if (byte_offset + srv_sort_buf_size >= srv_online_max_size) { @@ -587,7 +585,7 @@ row_log_table_close_func( log->tail.blocks++; if (os_file_write( - request, + IORequestWrite, "(modification log)", log->fd, buf, byte_offset, srv_sort_buf_size) @@ -2874,11 +2872,10 @@ all_done: goto func_exit; } - IORequest request(IORequest::READ); byte* buf = index->online_log->head.block; if (os_file_read_no_error_handling( - request, index->online_log->fd, + IORequestRead, index->online_log->fd, buf, ofs, srv_sort_buf_size, 0) != DB_SUCCESS) { ib::error() << "Unable to read temporary file" @@ -3767,8 +3764,6 @@ all_done: os_offset_t ofs = static_cast<os_offset_t>( index->online_log->head.blocks) * srv_sort_buf_size; - IORequest request(IORequest::READ); - ut_ad(has_index_lock); has_index_lock = false; rw_lock_x_unlock(dict_index_get_lock(index)); @@ -3783,7 +3778,7 @@ all_done: byte* buf = index->online_log->head.block; if (os_file_read_no_error_handling( - request, index->online_log->fd, + IORequestRead, index->online_log->fd, buf, ofs, srv_sort_buf_size, 0) != DB_SUCCESS) { ib::error() << "Unable to read temporary file" diff --git a/storage/innobase/row/row0merge.cc b/storage/innobase/row/row0merge.cc index 37814b70188..a57a53aaaea 100644 --- a/storage/innobase/row/row0merge.cc +++ b/storage/innobase/row/row0merge.cc @@ -1080,9 +1080,8 @@ row_merge_read( DBUG_LOG("ib_merge_sort", "fd=" << fd << " ofs=" << ofs); DBUG_EXECUTE_IF("row_merge_read_failure", DBUG_RETURN(FALSE);); - IORequest request(IORequest::READ); const bool success = DB_SUCCESS == os_file_read_no_error_handling( - request, fd, buf, ofs, srv_sort_buf_size, 0); + IORequestRead, fd, buf, ofs, srv_sort_buf_size, 0); /* If encryption is enabled decrypt buffer */ if (success && log_tmp_is_encrypted()) { @@ -1144,9 +1143,8 @@ row_merge_write( out_buf = crypt_buf; } - IORequest request(IORequest::WRITE); const bool success = DB_SUCCESS == os_file_write( - request, "(merge)", fd, out_buf, ofs, buf_len); + IORequestWrite, "(merge)", fd, out_buf, ofs, buf_len); #ifdef POSIX_FADV_DONTNEED /* The block will be needed on the next merge pass, diff --git a/storage/innobase/row/row0mysql.cc b/storage/innobase/row/row0mysql.cc index 00c2c41c1d0..c9ccb35ea05 100644 --- a/storage/innobase/row/row0mysql.cc +++ b/storage/innobase/row/row0mysql.cc @@ -3391,8 +3391,7 @@ row_drop_table_for_mysql( dict_stats_recalc_pool_del(table); dict_stats_defrag_pool_del(table, NULL); if (btr_defragment_active) { - /* During fts_drop_orphaned_tables() in - recv_recovery_rollback_active() the + /* During fts_drop_orphaned_tables() the btr_defragment_mutex has not yet been initialized by btr_defragment_init(). */ btr_defragment_remove_table(table); diff --git a/storage/innobase/row/row0quiesce.cc b/storage/innobase/row/row0quiesce.cc index ff50e4f1510..0cddde4b3ca 100644 --- a/storage/innobase/row/row0quiesce.cc +++ b/storage/innobase/row/row0quiesce.cc @@ -525,17 +525,27 @@ row_quiesce_table_start( } for (ulint count = 0; - ibuf_merge_space(table->space_id) != 0 - && !trx_is_interrupted(trx); + ibuf_merge_space(table->space_id); ++count) { + if (trx_is_interrupted(trx)) { + goto aborted; + } if (!(count % 20)) { ib::info() << "Merging change buffer entries for " << table->name; } } + while (buf_flush_dirty_pages(table->space_id)) { + if (trx_is_interrupted(trx)) { + goto aborted; + } + } + if (!trx_is_interrupted(trx)) { - buf_LRU_flush_or_remove_pages(table->space_id, true); + /* Ensure that all asynchronous IO is completed. */ + os_aio_wait_until_no_pending_writes(); + fil_flush(table->space_id); if (row_quiesce_write_cfg(table, trx->mysql_thd) != DB_SUCCESS) { @@ -546,6 +556,7 @@ row_quiesce_table_start( << " flushed to disk"; } } else { +aborted: ib::warn() << "Quiesce aborted!"; } diff --git a/storage/innobase/row/row0sel.cc b/storage/innobase/row/row0sel.cc index 239544c453e..9af0738faad 100644 --- a/storage/innobase/row/row0sel.cc +++ b/storage/innobase/row/row0sel.cc @@ -121,7 +121,7 @@ row_sel_sec_rec_is_for_blob( field_ref_zero, BTR_EXTERN_FIELD_REF_SIZE)) { /* The externally stored field was not written yet. This record should only be seen by - recv_recovery_rollback_active() or any + trx_rollback_recovered() or any TRX_ISO_READ_UNCOMMITTED transactions. */ return(FALSE); } @@ -528,7 +528,7 @@ row_sel_fetch_columns( externally stored field was not written yet. This record should only be seen by - recv_recovery_rollback_active() or any + trx_rollback_recovered() or any TRX_ISO_READ_UNCOMMITTED transactions. The InnoDB SQL parser (the sole caller of this function) @@ -2891,7 +2891,7 @@ row_sel_store_mysql_field( if (UNIV_UNLIKELY(!data)) { /* The externally stored field was not written yet. This record should only be seen by - recv_recovery_rollback_active() or any + trx_rollback_recovered() or any TRX_ISO_READ_UNCOMMITTED transactions. */ if (heap != prebuilt->blob_heap) { diff --git a/storage/innobase/row/row0undo.cc b/storage/innobase/row/row0undo.cc index 2fe1135b894..82c999d8b53 100644 --- a/storage/innobase/row/row0undo.cc +++ b/storage/innobase/row/row0undo.cc @@ -466,13 +466,7 @@ row_undo_step( { dberr_t err; undo_node_t* node; - trx_t* trx; - - ut_ad(thr); - - srv_inc_activity_count(); - - trx = thr_get_trx(thr); + trx_t* trx = thr_get_trx(thr); node = static_cast<undo_node_t*>(thr->run_node); diff --git a/storage/innobase/row/row0upd.cc b/storage/innobase/row/row0upd.cc index fa3fad5fe26..aee20477ac0 100644 --- a/storage/innobase/row/row0upd.cc +++ b/storage/innobase/row/row0upd.cc @@ -1539,7 +1539,7 @@ row_upd_changes_ord_field_binary_func( /* The externally stored field was not written yet. This record should only be seen by - recv_recovery_rollback_active(), + trx_rollback_recovered() when the server had crashed before storing the field. */ ut_ad(thr->graph->trx->is_recovered); diff --git a/storage/innobase/srv/srv0mon.cc b/storage/innobase/srv/srv0mon.cc index ef16c453657..81ab97daac9 100644 --- a/storage/innobase/srv/srv0mon.cc +++ b/storage/innobase/srv/srv0mon.cc @@ -386,31 +386,16 @@ static monitor_info_t innodb_counter_info[] = MONITOR_NONE, MONITOR_DEFAULT_START, MONITOR_FLUSH_ADAPTIVE_AVG_TIME_SLOT}, - {"buffer_LRU_batch_flush_avg_time_slot", "buffer", - "Avg time (ms) spent for LRU batch flushing recently per slot.", - MONITOR_NONE, - MONITOR_DEFAULT_START, MONITOR_LRU_BATCH_FLUSH_AVG_TIME_SLOT}, - {"buffer_flush_adaptive_avg_time_thread", "buffer", "Avg time (ms) spent for adaptive flushing recently per thread.", MONITOR_NONE, MONITOR_DEFAULT_START, MONITOR_FLUSH_ADAPTIVE_AVG_TIME_THREAD}, - {"buffer_LRU_batch_flush_avg_time_thread", "buffer", - "Avg time (ms) spent for LRU batch flushing recently per thread.", - MONITOR_NONE, - MONITOR_DEFAULT_START, MONITOR_LRU_BATCH_FLUSH_AVG_TIME_THREAD}, - {"buffer_flush_adaptive_avg_time_est", "buffer", "Estimated time (ms) spent for adaptive flushing recently.", MONITOR_NONE, MONITOR_DEFAULT_START, MONITOR_FLUSH_ADAPTIVE_AVG_TIME_EST}, - {"buffer_LRU_batch_flush_avg_time_est", "buffer", - "Estimated time (ms) spent for LRU batch flushing recently.", - MONITOR_NONE, - MONITOR_DEFAULT_START, MONITOR_LRU_BATCH_FLUSH_AVG_TIME_EST}, - {"buffer_flush_avg_time", "buffer", "Avg time (ms) spent for flushing recently.", MONITOR_NONE, @@ -421,11 +406,6 @@ static monitor_info_t innodb_counter_info[] = MONITOR_NONE, MONITOR_DEFAULT_START, MONITOR_FLUSH_ADAPTIVE_AVG_PASS}, - {"buffer_LRU_batch_flush_avg_pass", "buffer", - "Number of LRU batch flushes passed during the recent Avg period.", - MONITOR_NONE, - MONITOR_DEFAULT_START, MONITOR_LRU_BATCH_FLUSH_AVG_PASS}, - {"buffer_flush_avg_pass", "buffer", "Number of flushes passed during the recent Avg period.", MONITOR_NONE, @@ -562,23 +542,6 @@ static monitor_info_t innodb_counter_info[] = MONITOR_SET_MEMBER, MONITOR_LRU_BATCH_EVICT_TOTAL_PAGE, MONITOR_LRU_BATCH_EVICT_PAGES}, - /* Cumulative counter for single page LRU scans */ - {"buffer_LRU_single_flush_scanned", "buffer", - "Total pages scanned as part of single page LRU flush", - MONITOR_SET_OWNER, - MONITOR_LRU_SINGLE_FLUSH_SCANNED_NUM_CALL, - MONITOR_LRU_SINGLE_FLUSH_SCANNED}, - - {"buffer_LRU_single_flush_num_scan", "buffer", - "Number of times single page LRU flush is called", - MONITOR_SET_MEMBER, MONITOR_LRU_SINGLE_FLUSH_SCANNED, - MONITOR_LRU_SINGLE_FLUSH_SCANNED_NUM_CALL}, - - {"buffer_LRU_single_flush_scanned_per_call", "buffer", - "Page scanned per single LRU flush", - MONITOR_SET_MEMBER, MONITOR_LRU_SINGLE_FLUSH_SCANNED, - MONITOR_LRU_SINGLE_FLUSH_SCANNED_PER_CALL}, - {"buffer_LRU_single_flush_failure_count", "Buffer", "Number of times attempt to flush a single page from LRU failed", MONITOR_NONE, @@ -1468,7 +1431,8 @@ srv_mon_set_module_control( ibool set_current_module = FALSE; ut_a(module_id <= NUM_MONITOR); - ut_a(UT_ARR_SIZE(innodb_counter_info) == NUM_MONITOR); + compile_time_assert(array_elements(innodb_counter_info) + == NUM_MONITOR); /* The module_id must be an ID of MONITOR_MODULE type */ ut_a(innodb_counter_info[module_id].monitor_type & MONITOR_MODULE); diff --git a/storage/innobase/srv/srv0srv.cc b/storage/innobase/srv/srv0srv.cc index 144ea17ec0c..3303edf9272 100644 --- a/storage/innobase/srv/srv0srv.cc +++ b/storage/innobase/srv/srv0srv.cc @@ -342,11 +342,6 @@ my_bool srv_stats_sample_traditional; my_bool srv_use_doublewrite_buf; -/** innodb_doublewrite_batch_size (a debug parameter) specifies the -number of pages to use in LRU and flush_list batch flushing. -The rest of the doublewrite buffer is used for single-page flushing. */ -ulong srv_doublewrite_batch_size = 120; - /** innodb_sync_spin_loops */ ulong srv_n_spin_wait_rounds; /** innodb_spin_wait_delay */ @@ -707,9 +702,6 @@ static void srv_init() if (!srv_read_only_mode) { mutex_create(LATCH_ID_SRV_SYS_TASKS, &srv_sys.tasks_mutex); - - buf_flush_event = os_event_create("buf_flush_event"); - UT_LIST_INIT(srv_sys.tasks, &que_thr_t::queue); } @@ -755,7 +747,6 @@ srv_free(void) if (!srv_read_only_mode) { mutex_free(&srv_sys.tasks_mutex); - os_event_destroy(buf_flush_event); } ut_d(os_event_destroy(srv_master_thread_disabled_event)); @@ -1119,9 +1110,6 @@ srv_export_innodb_status(void) export_vars.innodb_buffer_pool_wait_free = srv_stats.buf_pool_wait_free; - export_vars.innodb_buffer_pool_pages_flushed = - srv_stats.buf_pool_flushed; - export_vars.innodb_buffer_pool_reads = srv_stats.buf_pool_reads; export_vars.innodb_buffer_pool_read_ahead_rnd = @@ -1871,10 +1859,6 @@ srv_master_do_idle_tasks(void) log_checkpoint(); MONITOR_INC_TIME_IN_MICRO_SECS(MONITOR_SRV_CHECKPOINT_MICROSECOND, counter_time); - - /* This is a workaround to avoid the InnoDB hang when OS datetime - changed backwards.*/ - os_event_set(buf_flush_event); } /** diff --git a/storage/innobase/srv/srv0start.cc b/storage/innobase/srv/srv0start.cc index 9a0f9d04149..dba660ee13f 100644 --- a/storage/innobase/srv/srv0start.cc +++ b/storage/innobase/srv/srv0start.cc @@ -140,30 +140,8 @@ UNIV_INTERN uint srv_sys_space_size_debug; UNIV_INTERN bool srv_log_file_created; #endif /* UNIV_DEBUG */ -/** Bit flags for tracking background thread creation. They are used to -determine which threads need to be stopped if we need to abort during -the initialisation step. */ -enum srv_start_state_t { - /** No thread started */ - SRV_START_STATE_NONE = 0, /*!< No thread started */ - /** lock_wait_timeout timer task started */ - SRV_START_STATE_LOCK_SYS = 1, - /** buf_flush_page_cleaner_coordinator, - buf_flush_page_cleaner_worker started */ - SRV_START_STATE_IO = 2, - /** srv_error_monitor_thread, srv_print_monitor_task started */ - SRV_START_STATE_MONITOR = 4, - /** srv_master_thread started */ - SRV_START_STATE_MASTER = 8, - /** srv_purge_coordinator_thread, srv_worker_thread started */ - SRV_START_STATE_PURGE = 16, - /** fil_crypt_thread, - (all background threads that can generate redo log but not undo log */ - SRV_START_STATE_REDO = 32 -}; - -/** Track server thrd starting phases */ -static ulint srv_start_state; +/** whether some background threads that create redo log have been started */ +static bool srv_started_redo; /** At a shutdown this value climbs from SRV_SHUTDOWN_NONE to SRV_SHUTDOWN_CLEANUP and then to SRV_SHUTDOWN_LAST_PHASE, and so on */ @@ -844,30 +822,6 @@ srv_open_tmp_tablespace(bool create_new_db) return(err); } -/****************************************************************//** -Set state to indicate start of particular group of threads in InnoDB. */ -UNIV_INLINE -void -srv_start_state_set( -/*================*/ - srv_start_state_t state) /*!< in: indicate current state of - thread startup */ -{ - srv_start_state |= ulint(state); -} - -/****************************************************************//** -Check if following group of threads is started. -@return true if started */ -UNIV_INLINE -bool -srv_start_state_is_set( -/*===================*/ - srv_start_state_t state) /*!< in: state to check for */ -{ - return(srv_start_state & ulint(state)); -} - /** Shutdown all background threads created by InnoDB. */ static @@ -899,11 +853,11 @@ srv_shutdown_all_bg_threads() } } - if (srv_start_state_is_set(SRV_START_STATE_IO)) { + if (buf_page_cleaner_is_active) { ut_ad(!srv_read_only_mode); - /* e. Exit the i/o threads */ - os_event_set(buf_flush_event); + /* e. Exit the buf_flush_page_cleaner */ + mysql_cond_signal(&buf_pool.do_flush_list); } if (!os_thread_count) { @@ -1145,8 +1099,7 @@ dberr_t srv_start(bool create_new_db) || srv_force_recovery > SRV_FORCE_NO_IBUF_MERGE || srv_sys_space.created_new_raw(); - /* Reset the start state. */ - srv_start_state = SRV_START_STATE_NONE; + srv_started_redo = false; compile_time_assert(sizeof(ulint) == sizeof(void*)); @@ -1328,7 +1281,7 @@ dberr_t srv_start(bool create_new_db) if (!srv_read_only_mode) { buf_flush_page_cleaner_init(); - srv_start_state_set(SRV_START_STATE_IO); + ut_ad(buf_page_cleaner_is_active); } srv_startup_is_before_trx_rollback_phase = !create_new_db; @@ -1377,11 +1330,10 @@ dberr_t srv_start(bool create_new_db) std::string logfile0; if (create_new_db) { - + flushed_lsn = log_sys.get_lsn(); + log_sys.set_flushed_lsn(flushed_lsn); buf_flush_sync(); - flushed_lsn = log_get_lsn(); - err = create_log_file(flushed_lsn, logfile0); if (err != DB_SUCCESS) { @@ -1652,10 +1604,7 @@ file_checked: } } - /* recv_recovery_from_checkpoint_finish needs trx lists which - are initialized in trx_lists_init_at_db_start(). */ - - recv_recovery_from_checkpoint_finish(); + recv_sys.debug_free(); if (srv_operation == SRV_OPERATION_RESTORE || srv_operation == SRV_OPERATION_RESTORE_EXPORT) { @@ -1757,7 +1706,7 @@ file_checked: /* Create the doublewrite buffer to a new tablespace */ if (!srv_read_only_mode && srv_force_recovery < SRV_FORCE_NO_TRX_UNDO - && !buf_dblwr_create()) { + && !buf_dblwr.create()) { return(srv_init_abort(DB_ERROR)); } @@ -1896,9 +1845,6 @@ file_checked: srv_start_periodic_timer(srv_error_monitor_timer, srv_error_monitor_task, 1000); srv_start_periodic_timer(srv_monitor_timer, srv_monitor_task, 5000); - srv_start_state |= SRV_START_STATE_LOCK_SYS - | SRV_START_STATE_MONITOR; - #ifndef DBUG_OFF skip_monitors: #endif @@ -1957,7 +1903,6 @@ skip_monitors: srv_init_purge_tasks(); purge_sys.coordinator_startup(); srv_wake_purge_thread_if_not_active(); - srv_start_state_set(SRV_START_STATE_PURGE); } srv_is_being_started = false; @@ -2016,7 +1961,7 @@ skip_monitors: /* Initialize online defragmentation. */ btr_defragment_init(); - srv_start_state |= SRV_START_STATE_REDO; + srv_started_redo = true; } return(DB_SUCCESS); @@ -2103,7 +2048,8 @@ void innodb_shutdown() ut_ad(dict_sys.is_initialised() || !srv_was_started); ut_ad(trx_sys.is_initialised() || !srv_was_started); - ut_ad(buf_dblwr || !srv_was_started || srv_read_only_mode + ut_ad(buf_dblwr.is_initialised() || !srv_was_started + || srv_read_only_mode || srv_force_recovery >= SRV_FORCE_NO_TRX_UNDO); ut_ad(lock_sys.is_initialised() || !srv_was_started); ut_ad(log_sys.is_initialised() || !srv_was_started); @@ -2111,7 +2057,7 @@ void innodb_shutdown() dict_stats_deinit(); - if (srv_start_state_is_set(SRV_START_STATE_REDO)) { + if (srv_started_redo) { ut_ad(!srv_read_only_mode); /* srv_shutdown_bg_undo_sources() already invoked fts_optimize_shutdown(); dict_stats_shutdown(); */ @@ -2132,9 +2078,7 @@ void innodb_shutdown() log_sys.close(); purge_sys.close(); trx_sys.close(); - if (buf_dblwr) { - buf_dblwr_free(); - } + buf_dblwr.close(); lock_sys.close(); trx_pool_close(); @@ -2161,7 +2105,7 @@ void innodb_shutdown() << "; transaction id " << trx_sys.get_max_trx_id(); } srv_thread_pool_end(); - srv_start_state = SRV_START_STATE_NONE; + srv_started_redo = false; srv_was_started = false; srv_start_has_been_called = false; } diff --git a/storage/innobase/sync/sync0debug.cc b/storage/innobase/sync/sync0debug.cc index 78e613b52f0..11038c6020d 100644 --- a/storage/innobase/sync/sync0debug.cc +++ b/storage/innobase/sync/sync0debug.cc @@ -453,9 +453,6 @@ LatchDebug::LatchDebug() LEVEL_MAP_INSERT(RW_LOCK_X); LEVEL_MAP_INSERT(RW_LOCK_NOT_LOCKED); LEVEL_MAP_INSERT(SYNC_ANY_LATCH); - LEVEL_MAP_INSERT(SYNC_DOUBLEWRITE); - LEVEL_MAP_INSERT(SYNC_BUF_FLUSH_LIST); - LEVEL_MAP_INSERT(SYNC_BUF_POOL); LEVEL_MAP_INSERT(SYNC_POOL); LEVEL_MAP_INSERT(SYNC_POOL_MANAGER); LEVEL_MAP_INSERT(SYNC_SEARCH_SYS); @@ -467,7 +464,6 @@ LatchDebug::LatchDebug() LEVEL_MAP_INSERT(SYNC_RECV); LEVEL_MAP_INSERT(SYNC_LOG_FLUSH_ORDER); LEVEL_MAP_INSERT(SYNC_LOG); - LEVEL_MAP_INSERT(SYNC_PAGE_CLEANER); LEVEL_MAP_INSERT(SYNC_PURGE_QUEUE); LEVEL_MAP_INSERT(SYNC_TRX_SYS_HEADER); LEVEL_MAP_INSERT(SYNC_REC_LOCK); @@ -741,10 +737,8 @@ LatchDebug::check_order( case SYNC_FTS_OPTIMIZE: case SYNC_FTS_CACHE: case SYNC_FTS_CACHE_INIT: - case SYNC_PAGE_CLEANER: case SYNC_LOG: case SYNC_LOG_FLUSH_ORDER: - case SYNC_DOUBLEWRITE: case SYNC_SEARCH_SYS: case SYNC_LOCK_SYS: case SYNC_LOCK_WAIT_SYS: @@ -802,15 +796,6 @@ LatchDebug::check_order( } break; - case SYNC_BUF_FLUSH_LIST: - case SYNC_BUF_POOL: - - /* We can have multiple mutexes of this type therefore we - can only check whether the greater than condition holds. */ - - basic_check(latches, level, level - 1); - break; - case SYNC_REC_LOCK: if (find(latches, SYNC_LOCK_SYS) != 0) { @@ -1169,8 +1154,7 @@ sync_check_iterate(const sync_check_functor_t& functor) Note: We don't enforce any synchronisation checks. The caller must ensure that no races can occur */ -void -sync_check_enable() +static void sync_check_enable() { if (!srv_sync_debug) { @@ -1243,8 +1227,6 @@ sync_latch_meta_init() /* The latches should be ordered on latch_id_t. So that we can index directly into the vector to update and fetch meta-data. */ - LATCH_ADD_MUTEX(BUF_POOL, SYNC_BUF_POOL, buf_pool_mutex_key); - LATCH_ADD_MUTEX(DICT_FOREIGN_ERR, SYNC_NO_ORDER_CHECK, dict_foreign_err_mutex_key); @@ -1252,8 +1234,6 @@ sync_latch_meta_init() LATCH_ADD_MUTEX(FIL_SYSTEM, SYNC_ANY_LATCH, fil_system_mutex_key); - LATCH_ADD_MUTEX(FLUSH_LIST, SYNC_BUF_FLUSH_LIST, flush_list_mutex_key); - LATCH_ADD_MUTEX(FTS_BG_THREADS, SYNC_FTS_BG_THREADS, fts_bg_threads_mutex_key); @@ -1277,9 +1257,6 @@ sync_latch_meta_init() LATCH_ADD_MUTEX(LOG_FLUSH_ORDER, SYNC_LOG_FLUSH_ORDER, log_flush_order_mutex_key); - LATCH_ADD_MUTEX(PAGE_CLEANER, SYNC_PAGE_CLEANER, - page_cleaner_mutex_key); - LATCH_ADD_MUTEX(PURGE_SYS_PQ, SYNC_PURGE_QUEUE, purge_sys_pq_mutex_key); @@ -1320,8 +1297,6 @@ sync_latch_meta_init() LATCH_ADD_MUTEX(SRV_MONITOR_FILE, SYNC_NO_ORDER_CHECK, srv_monitor_file_mutex_key); - LATCH_ADD_MUTEX(BUF_DBLWR, SYNC_DOUBLEWRITE, buf_dblwr_mutex_key); - LATCH_ADD_MUTEX(TRX_POOL, SYNC_POOL, trx_pool_mutex_key); LATCH_ADD_MUTEX(TRX_POOL_MANAGER, SYNC_POOL_MANAGER, @@ -1456,6 +1431,8 @@ sync_check_init() ut_d(LatchDebug::init()); sync_array_init(); + + ut_d(sync_check_enable()); } /** Free the InnoDB synchronization data structures. */ diff --git a/storage/innobase/sync/sync0sync.cc b/storage/innobase/sync/sync0sync.cc index 6a40c90579d..f88a3945773 100644 --- a/storage/innobase/sync/sync0sync.cc +++ b/storage/innobase/sync/sync0sync.cc @@ -52,7 +52,6 @@ mysql_pfs_key_t log_sys_mutex_key; mysql_pfs_key_t log_cmdq_mutex_key; mysql_pfs_key_t log_flush_order_mutex_key; mysql_pfs_key_t recalc_pool_mutex_key; -mysql_pfs_key_t page_cleaner_mutex_key; mysql_pfs_key_t purge_sys_pq_mutex_key; mysql_pfs_key_t recv_sys_mutex_key; mysql_pfs_key_t redo_rseg_mutex_key; diff --git a/storage/innobase/trx/trx0purge.cc b/storage/innobase/trx/trx0purge.cc index 393f044d23a..f9f564e1841 100644 --- a/storage/innobase/trx/trx0purge.cc +++ b/storage/innobase/trx/trx0purge.cc @@ -262,7 +262,7 @@ trx_purge_add_undo_to_history(const trx_t* trx, trx_undo_t*& undo, mtr_t* mtr) or in trx_rollback_recovered() in slow shutdown. Before any transaction-generating background threads or the - purge have been started, recv_recovery_rollback_active() can + purge have been started, we can start transactions in row_merge_drop_temp_indexes() and fts_drop_orphaned_tables(), and roll back recovered transactions. @@ -680,7 +680,7 @@ not_free: mini-transaction commit and the server was killed, then discarding the to-be-trimmed pages without flushing would break crash recovery. So, we cannot avoid the write. */ - buf_LRU_flush_or_remove_pages(space.id, true); + while (buf_flush_dirty_pages(space.id)); log_free_check(); diff --git a/storage/rocksdb/mysql-test/rocksdb/r/innodb_i_s_tables_disabled.result b/storage/rocksdb/mysql-test/rocksdb/r/innodb_i_s_tables_disabled.result index 2139b1bfe05..d4623c2f054 100644 --- a/storage/rocksdb/mysql-test/rocksdb/r/innodb_i_s_tables_disabled.result +++ b/storage/rocksdb/mysql-test/rocksdb/r/innodb_i_s_tables_disabled.result @@ -72,14 +72,10 @@ buffer_flush_neighbor_pages buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL N buffer_flush_n_to_flush_requested buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Number of pages requested for flushing. buffer_flush_n_to_flush_by_age buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Number of pages target by LSN Age for flushing. buffer_flush_adaptive_avg_time_slot buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Avg time (ms) spent for adaptive flushing recently per slot. -buffer_LRU_batch_flush_avg_time_slot buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Avg time (ms) spent for LRU batch flushing recently per slot. buffer_flush_adaptive_avg_time_thread buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Avg time (ms) spent for adaptive flushing recently per thread. -buffer_LRU_batch_flush_avg_time_thread buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Avg time (ms) spent for LRU batch flushing recently per thread. buffer_flush_adaptive_avg_time_est buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Estimated time (ms) spent for adaptive flushing recently. -buffer_LRU_batch_flush_avg_time_est buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Estimated time (ms) spent for LRU batch flushing recently. buffer_flush_avg_time buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Avg time (ms) spent for flushing recently. buffer_flush_adaptive_avg_pass buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Number of adaptive flushes passed during the recent Avg period. -buffer_LRU_batch_flush_avg_pass buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Number of LRU batch flushes passed during the recent Avg period. buffer_flush_avg_pass buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Number of flushes passed during the recent Avg period. buffer_LRU_get_free_loops buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Total loops in LRU get free. buffer_LRU_get_free_waits buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Total sleep waits in LRU get free. @@ -106,9 +102,6 @@ buffer_LRU_batch_flush_pages buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL buffer_LRU_batch_evict_total_pages buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 set_owner Total pages evicted as part of LRU batches buffer_LRU_batches_evict buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 set_member Number of LRU batches buffer_LRU_batch_evict_pages buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 set_member Pages queued as an LRU batch -buffer_LRU_single_flush_scanned buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 set_owner Total pages scanned as part of single page LRU flush -buffer_LRU_single_flush_num_scan buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 set_member Number of times single page LRU flush is called -buffer_LRU_single_flush_scanned_per_call buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 set_member Page scanned per single LRU flush buffer_LRU_single_flush_failure_count Buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Number of times attempt to flush a single page from LRU failed buffer_LRU_get_free_search Buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 counter Number of searches performed for a clean page buffer_LRU_search_scanned buffer 0 NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL 0 set_owner Total pages scanned as part of LRU search |