delta/mariadb-git.git - github.com: MariaDB/server.git

	Commit message (Collapse)	Author	Age	Files	Lines
*	MDEV-24350 buf_dblwr unnecessarily uses memory-intensive srv_stats countersbb-10.5-MDEV-24350	Marko Mäkelä	2020-12-04	1	-7/+8
\| \| \| \| \| \| \|	The counters in srv_stats use std::atomic and multiple cache lines per counter. This is an overkill in a case where a critical section already exists in the code. A regular variable will work just fine, with much smaller memory bus impact.
*	MDEV-24348 InnoDB shutdown hang with innodb_flush_sync=0	Marko Mäkelä	2020-12-04	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This hang was caused by MDEV-23855, and we failed to fix it in MDEV-24109 (commit 4cbfdeca840098b9ed0d8147d43288c36743a328). When buf_flush_ahead() is invoked soon before server shutdown and the non-default setting innodb_flush_sync=OFF is in effect and the buffer pool contains dirty pages of temporary tables, the page cleaner thread may remain in an infinite loop without completing its work, thus causing the shutdown to hang. buf_flush_page_cleaner(): If the buffer pool contains no unmodified persistent pages, ensure that buf_flush_sync_lsn= 0 will be assigned, so that shutdown will proceed. The test case is not deterministic. On my system, it reproduced the hang with 95% probability when running multiple instances of the test in parallel, and 4% when running single-threaded. Thanks to Eugene Kosov for debugging and testing this.
*	MDEV-24308: Remove some os_thread_ functions	Marko Mäkelä	2020-11-30	1	-1/+1
\| \| \| \| \| \| \| \| \|	os_thread_pf(): Remove. os_thread_eq(), os_thread_yield(), os_thread_get_curr_id(): Define as macros. ut_print_timestamp(), ut_sprintf_timestamp(): Simplify.
*	MDEV-24280 InnoDB triggers too many independent periodic tasksbb-10.5-MDEV-24280	Marko Mäkelä	2020-11-25	1	-4/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A side effect of MDEV-16264 is that a large number of threads will be created at server startup, to be destroyed after a minute or two. One source of such thread creation is srv_start_periodic_timer(). InnoDB is creating 3 periodic tasks: srv_master_callback (1Hz) srv_error_monitor_task (1Hz), and srv_monitor_task (0.2Hz). It appears that we can merge srv_error_monitor_task and srv_monitor_task and have them invoked 4 times per minute (every 15 seconds). This will affect our ability to enforce innodb_fatal_semaphore_wait_threshold and some computations around BUF_LRU_STAT_N_INTERVAL. We could remove srv_master_callback along with the DROP TABLE queue at some point of time in the future. We must keep it independent of the innodb_fatal_semaphore_wait_threshold detection, because the background DROP TABLE queue could get stuck due to dict_sys being locked by another thread. For now, srv_master_callback must be invoked once per second, so that innodb_flush_log_at_timeout=1 can work. BUF_LRU_STAT_N_INTERVAL: Reduce the precision and extend the time from 501 second to 415 seconds. srv_error_monitor_timer: Remove. MAX_MUTEX_NOWAIT: Increase from 201 second to 215 seconds. srv_refresh_innodb_monitor_stats(): Avoid a repeated call to time(NULL). Change the interval to less than 60 seconds. srv_monitor(): Renamed from srv_monitor_task. srv_monitor_task(): Renamed from srv_error_monitor_task(). Invoked only once in 15 seconds. Invoke also srv_monitor(). Increase the fatal_cnt threshold from 101 second to 115 seconds. sync_array_print_long_waits_low(): Invoke time(NULL) only once. Remove a bogus message about printouts for 30 seconds. Those printouts were effectively already disabled in MDEV-16264 (commit 5e62b6a5e06eb02cbde1e34e95e26f42d87fce02).
*	MDEV-24278 InnoDB page cleaner keeps waking up on idle server	Marko Mäkelä	2020-11-25	1	-3/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The purpose of the InnoDB page cleaner subsystem is to write out modified pages from the buffer pool to data files. When the innodb_max_dirty_pages_pct_lwm is not exceeded or innodb_adaptive_flushing=ON decides not to write out anything, the page cleaner should keep sleeping indefinitely until the state of the system changes: a dirty page is added to the buffer pool such that the page cleaner would no longer be idle. buf_flush_page_cleaner(): Explicitly note when the page cleaner is idle. When that happens, use mysql_cond_wait() instead of mysql_cond_timedwait(). buf_flush_insert_into_flush_list(): Wake up the page cleaner if needed. innodb_max_dirty_pages_pct_update(), innodb_max_dirty_pages_pct_lwm_update(): Wake up the page cleaner just in case. Note: buf_flush_ahead(), buf_flush_wait_flushed() and shutdown are already waking up the page cleaner thread.
*	Partially Revert "MDEV-24270: Collect multiple completed events at a time"	Vladislav Vaintroub	2020-11-25	1	-1/+1
\| \| \| \| \| \| \|	This partially reverts commit 6479006e14691ff85072d06682f81b90875e9cb0. Remove the constant tpool::aio::N_PENDING, which has no intrinsic meaning for the tpool.
*	MDEV-24270: Collect multiple completed events at a time	Marko Mäkelä	2020-11-25	1	-1/+1
\| \| \| \| \| \| \|	tpool::aio::N_PENDING: Replaces OS_AIO_N_PENDING_IOS_PER_THREAD. This limits two similar things: the number of outstanding requests that a thread may io_submit(), and the number of completed requests collected at a time by io_getevents().
*	MDEV-24271 rw_lock::read_lock_yield() may cause writer starvation	Marko Mäkelä	2020-11-24	1	-12/+4
\| \| \| \| \| \| \|	The greedy fetch_add(1) approach of read_trylock() may cause starvation of a waiting write lock request. Let us use a compare-and-swap for the read lock acquisition in order to guarantee the progress of writers.
*	MDEV-24167: Remove PFS instrumentation of buf_block_t	Marko Mäkelä	2020-11-20	1	-82/+0
\| \| \| \| \| \| \| \| \|	We always defined PFS_SKIP_BUFFER_MUTEX_RWLOCK, that is, the latches of the buffer pool blocks were never instrumented in PERFORMANCE_SCHEMA. For some reason, the debug_latch (which enforce proper usage of buffer-fixing in debug builds) was instrumented.
*	MDEV-24188 fixup: Simplify the wait loop	Marko Mäkelä	2020-11-17	1	-11/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Starting with commit 7cffb5f6e8a231a041152447be8980ce35d2c9b8 (MDEV-23399) the function buf_flush_page() will first acquire block->lock and only after that invoke set_io_fix(). Before that, it was possible to reach a livelock between buf_page_create() and buf_flush_page(). buf_page_create(): Directly try acquiring the exclusive page latch without checking whether the page is io-fixed or buffer-fixed. (As a matter of fact, the have_x_latch() check is not strictly necessary, because we still support recursive X-latches.) In case of a latch conflict, wait while allowing buf_page_write_complete() to acquire buf_pool.mutex and release the block->lock. An attempt to wait for exclusive block->lock while holding buf_pool.mutex would lead to a hang in the tests parts.part_supported_sql_func_innodb and stress.ddl_innodb, due to a deadlock between buf_page_write_complete() and buf_page_create(). Similarly, in case of an I/O fixed compressed-only ROW_FORMAT=COMPRESSED page, we will sleep before retrying. In both cases, we will sleep for 1ms or until a flush batch is completed.
*	MDEV-24188: Merge 10.4 into 10.5	Marko Mäkelä	2020-11-13	1	-10/+20
\|\
\| *	MDEV-24188: Merge 10.3 into 10.4	Marko Mäkelä	2020-11-13	1	-15/+20
\| \|\
\| \| *	MDEV-24188: Merge 10.2 into 10.3	Marko Mäkelä	2020-11-13	1	-25/+27
\| \| \|\
\| \| \| *	MDEV-24188 Hang in buf_page_create() after reusing a previously freed page	Marko Mäkelä	2020-11-13	1	-25/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The fix of MDEV-23456 (commit b1009ae5c16697d5eef443cc6a60a74301148c73) introduced a livelock between page flushing and a thread that is executing buf_page_create(). buf_page_create(): If the current mini-transaction is holding an exclusive latch on the page, do not attempt to acquire another one, and do not care about any I/O fix. mtr_t::have_x_latch(): Replaces mtr_t::get_fix_count(). dyn_buf_t::for_each_block(const Functor&) const: A new variant. rw_lock_own(): Add a const qualifier. Reviewed by: Thirunarayanan Balathandayuthapani
* \| \| \|	Merge 10.4 into 10.5	Marko Mäkelä	2020-11-13	1	-4/+4
\|\ \ \ \ \| \|/ / /
\| * \| \|	Merge 10.3 into 10.4	Marko Mäkelä	2020-11-12	1	-3/+3
\| \|\ \ \ \| \| \|/ /
\| \| * \|	Merge 10.2 into 10.3	Marko Mäkelä	2020-11-12	1	-3/+3
\| \| \|\ \ \| \| \| \|/
\| \| \| *	MDEV-24182 ibuf_merge_or_delete_for_page() contains dead code	Marko Mäkelä	2020-11-11	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The function ibuf_merge_or_delete_for_page() was always being invoked with update_ibuf_bitmap=true ever since commit cd623508dff53c210154392da6c0f65b7b6bcf4c fixed up something after MDEV-9566. Furthermore, the parameter page_size is never being passed as a null pointer, and therefore it should better be a reference to a constant object.
* \| \| \|	MDEV-24109 InnoDB hangs with innodb_flush_sync=OFF	Marko Mäkelä	2020-11-04	1	-44/+68
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	MDEV-23855 broke the handling of innodb_flush_sync=OFF. That parameter is supposed to limit the page write rate in case the log capacity is being exceeded and log checkpoints are needed. With this fix, the following should pass: ./mtr --mysqld=--loose-innodb-flush-sync=0 One of our best regression tests for page flushing is encryption.innochecksum. With innodb_page_size=16k and innodb_flush_sync=OFF it would likely hang without this fix. log_sys.last_checkpoint_lsn: Declare as Atomic_relaxed<lsn_t> so that we are allowed to read the value while not holding log_sys.mutex. buf_flush_wait_flushed(): Let the page cleaner perform the flushing also if innodb_flush_sync=OFF. After the page cleaner has completed, perform a checkpoint if it is needed, because buf_flush_sync_for_checkpoint() will not be run if innodb_flush_sync=OFF. buf_flush_ahead(): Simplify the condition. We do not really care whether buf_flush_page_cleaner() is running. buf_flush_page_cleaner(): Evaluate innodb_flush_sync at the low level. If innodb_flush_sync=OFF, rate-limit the batches to innodb_io_capacity_max pages per second. Reviewed by: Vladislav Vaintroub
* \| \| \|	MDEV-24101 innodb_random_read_ahead=ON causes hang on DDL or shutdownmariadb-10.5.7	Marko Mäkelä	2020-11-03	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	buf_read_ahead_random(): Do not leak a tablespace reference. The reference was already acquired in fil_space_t::get(), and we must only check that operations were not stopped. This error was introduced when commit 118e258aaac5da75a2ac4556201aaea3688fac67 merged n_pending_ios, n_pending_ops into a single n_pending. This was not noticed earlier, because innodb_random_read_ahead is OFF by default and our regression tests did not vary that parameter at all.
* \| \| \|	MDEV-24054 Assertion in_LRU_list failed in buf_flush_try_neighbors()	Marko Mäkelä	2020-10-30	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	buf_flush_try_neighbors(): Before invoking buf_page_t::ready_for_flush(), check that the freshly looked up buf_pool.page_hash entry actually is a buffer page and not a buf_pool.watch[] sentinel for purge buffering. This race condition was introduced in MDEV-15053 (commit b1ab211dee599eabd9a5b886fafa3adea29ae041). It is rather hard to hit this bug, because buf_flush_check_neighbors() already checked the condition. The problem exists if buf_pool.watch_set() was invoked for a page in the range after the check in buf_flush_check_neighbor() had been finished.
* \| \| \|	Merge 10.4 into 10.5	Marko Mäkelä	2020-10-30	2	-10/+59
\|\ \ \ \ \| \|/ / /
\| * \| \|	Merge 10.3 into 10.4	Marko Mäkelä	2020-10-29	2	-74/+80
\| \|\ \ \ \| \| \|/ /
\| \| * \|	Merge 10.2 into 10.3	Marko Mäkelä	2020-10-28	2	-74/+80
\| \| \|\ \ \| \| \| \|/
\| \| \| *	MDEV-23693 Failing assertion: my_atomic_load32_explicit(&lock->lock_word, ↵	Thirunarayanan Balathandayuthapani	2020-10-27	2	-74/+80
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	MY_MEMORY_ORDER_RELAXED) == X_LOCK_DECR InnoDB frees the block lock during buffer pool shrinking when other thread is yet to release the block lock. While shrinking the buffer pool, InnoDB allows the page to be freed unless it is buffer fixed. In some cases, InnoDB releases the latch after unfixing the block. Fix: ==== - InnoDB should unfix the block after releases the latch. - Add more assertion to check buffer fix while accessing the page. - Introduced block_hint structure to store buf_block_t pointer and allow accessing the buf_block_t pointer only by passing a functor. It returns original buf_block_t* pointer if it is valid or nullptr if the pointer become stale. - Replace buf_block_is_uncompressed() with buf_pool_t::is_block_pointer() This change is motivated by a change in mysql-5.7.32: mysql/mysql-server@46e60de444a8fbd876cc6778a7e64a1d3426a48d Bug #31036301 ASSERTION FAILURE: SYNC0RW.IC:429:LOCK->LOCK_WORD
* \| \| \|	MDEV-24053 MSAN use-of-uninitialized-value in ↵	Marko Mäkelä	2020-10-29	1	-1/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tpool::simulated_aio::simulated_aio_callback() Starting with commit ef3f71fa7435f092dfce36d606cf22332218dd8b MemorySanitizer would complain that we are writing uninitialized data via the doublewrite buffer. buf_dblwr_t::add_to_batch(): Zero out any unused part of the doublewrite buffer, for PAGE_COMPRESSED and ROW_FORMAT=COMPRESSED tables. Reviewed by: Eugene Kosov
* \| \| \|	MDEV-23855: Use normal mutex for log_sys.mutex, log_sys.flush_order_mutex	Marko Mäkelä	2020-10-26	1	-18/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With an unreasonably small innodb_log_file_size, the page cleaner thread would frequently acquire log_sys.flush_order_mutex and spend a significant portion of CPU time spinning on that mutex when determining the checkpoint LSN.
* \| \| \|	MDEV-23855: Implement asynchronous doublewrite	Marko Mäkelä	2020-10-26	2	-35/+53
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Synchronous writes and calls to fdatasync(), fsync() or FlushFileBuffers() would ruin performance. So, let us submit asynchronous writes for the doublewrite buffer. We submit a single request for the likely case that the two doublewrite buffers are contiquous in the system tablespace. buf_dblwr_t::flush_buffered_writes_completed(): The completion callback of buf_dblwr_t::flush_buffered_writes(). os_aio_wait_until_no_pending_writes(): Also wait for doublewrite batches. buf_dblwr_t::element::space: Remove. We can simply use element::request.node->space instead. Reviewed by: Vladislav Vaintroub
* \| \| \|	MDEV-23399 fixup: Interleaved doublewrite batches	Marko Mäkelä	2020-10-26	1	-45/+52
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Author: Vladislav Vaintroub
* \| \| \|	MDEV-16264 fixup: Clean up asynchronous I/O	Marko Mäkelä	2020-10-26	1	-5/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	os_aio_userdata_t: Remove. It was basically duplicating IORequest. buf_page_write_complete(): Take only IORequest as a parameter. os_aio_func(), pfs_os_aio_func(): Replaced with os_aio() that has no redundant parameters. There is only one caller, so there is no point to pass __FILE__, __LINE__ as a parameter.
* \| \| \|	MDEV-23855: Shrink fil_space_t	Marko Mäkelä	2020-10-26	5	-86/+52
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Merge n_pending_ios, n_pending_ops to std::atomic<uint32_t> n_pending. Change some more fil_space_t members to uint32_t to reduce the memory footprint. fil_space_t::add(), fil_ibd_create(): Attach the already opened handle to the tablespace, and enforce the fil_system.n_open limit. dict_boot(): Initialize fil_system.max_assigned_id. srv_boot(): Call srv_thread_pool_init() before anything else, so that files should be opened in the correct mode on Windows. fil_ibd_create(): Create the file in OS_FILE_AIO mode, just like fil_node_open_file_low() does it. dict_table_t::is_accessible(): Replaces fil_table_accessible(). Reviewed by: Vladislav Vaintroub
* \| \| \|	MDEV-23855: Remove fil_system.LRU and reduce fil_system.mutex contention	Marko Mäkelä	2020-10-26	5	-253/+320
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Also fixes MDEV-23929: innodb_flush_neighbors is not being ignored for system tablespace on SSD When the maximum configured number of file is exceeded, InnoDB will close data files. We used to maintain a fil_system.LRU list and a counter fil_node_t::n_pending to achieve this, at the huge cost of multiple fil_system.mutex operations per I/O operation. fil_node_open_file_low(): Implement a FIFO replacement policy: The last opened file will be moved to the end of fil_system.space_list, and files will be closed from the start of the list. However, we will not move tablespaces in fil_system.space_list while i_s_tablespaces_encryption_fill_table() is executing (producing output for INFORMATION_SCHEMA.INNODB_TABLESPACES_ENCRYPTION) because it may cause information of some tablespaces to go missing. We also avoid this in mariabackup --backup because datafiles_iter_next() assumes that the ordering is not changed. IORequest: Fold more parameters to IORequest::type. fil_space_t::io(): Replaces fil_io(). fil_space_t::flush(): Replaces fil_flush(). OS_AIO_IBUF: Remove. We will always issue synchronous reads of the change buffer pages in buf_read_page_low(). We will always ignore some errors for background reads. This should reduce fil_system.mutex contention a little. fil_node_t::complete_write(): Replaces fil_node_t::complete_io(). On both read and write completion, fil_space_t::release_for_io() will have to be called. fil_space_t::io(): Do not acquire fil_system.mutex in the normal code path. xb_delta_open_matching_space(): Do not try to open the system tablespace which was already opened. This fixes a file sharing violation in mariabackup --prepare --incremental. Reviewed by: Vladislav Vaintroub
* \| \| \|	MDEV-23855: Improve InnoDB log checkpoint performance	Marko Mäkelä	2020-10-26	4	-350/+497
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	After MDEV-15053, MDEV-22871, MDEV-23399 shifted the scalability bottleneck, log checkpoints became a new bottleneck. If innodb_io_capacity is set low or innodb_max_dirty_pct_lwm is set high and the workload fits in the buffer pool, the page cleaner thread will perform very little flushing. When we reach the capacity of the circular redo log file ib_logfile0 and must initiate a checkpoint, some 'furious flushing' will be necessary. (If innodb_flush_sync=OFF, then flushing would continue at the innodb_io_capacity rate, and writers would be throttled.) We have the best chance of advancing the checkpoint LSN immediately after a page flush batch has been completed. Hence, it is best to perform checkpoints after every batch in the page cleaner thread, attempting to run once per second. By initiating high-priority flushing in the page cleaner as early as possible, we aim to make the throughput more stable. The function buf_flush_wait_flushed() used to sleep for 10ms, hoping that the page cleaner thread would do something during that time. The observed end result was that a large number of threads that call log_free_check() would end up sleeping while nothing useful is happening. We will revise the design so that in the default innodb_flush_sync=ON mode, buf_flush_wait_flushed() will wake up the page cleaner thread to perform the necessary flushing, and it will wait for a signal from the page cleaner thread. If innodb_io_capacity is set to a low value (causing the page cleaner to throttle its work), a write workload would initially perform well, until the capacity of the circular ib_logfile0 is reached and log_free_check() will trigger checkpoints. At that point, the extra waiting in buf_flush_wait_flushed() will start reducing throughput. The page cleaner thread will also initiate log checkpoints after each buf_flush_lists() call, because that is the best point of time for the checkpoint LSN to advance by the maximum amount. Even in 'furious flushing' mode we invoke buf_flush_lists() with innodb_io_capacity_max pages at a time, and at the start of each batch (in the log_flush() callback function that runs in a separate task) we will invoke os_aio_wait_until_no_pending_writes(). This tweak allows the checkpoint to advance in smaller steps and significantly reduces the maximum latency. On an Intel Optane 960 NVMe SSD on Linux, it reduced from 4.6 seconds to 74 milliseconds. On Microsoft Windows with a slower SSD, it reduced from more than 180 seconds to 0.6 seconds. We will make innodb_adaptive_flushing=OFF simply flush innodb_io_capacity per second whenever the dirty proportion of buffer pool pages exceeds innodb_max_dirty_pages_pct_lwm. For innodb_adaptive_flushing=ON we try to make page_cleaner_flush_pages_recommendation() more consistent and predictable: if we are below innodb_adaptive_flushing_lwm, let us flush pages according to the return value of af_get_pct_for_dirty(). innodb_max_dirty_pages_pct_lwm: Revert the change of the default value that was made in MDEV-23399. The value innodb_max_dirty_pages_pct_lwm=0 guarantees that a shutdown of an idle server will be fast. Users might be surprised if normal shutdown suddenly became slower when upgrading within a GA release series. innodb_checkpoint_usec: Remove. The master task will no longer perform periodic log checkpoints. It is the duty of the page cleaner thread. log_sys.max_modified_age: Remove. The current span of the buf_pool.flush_list expressed in LSN only matters for adaptive flushing (outside the 'furious flushing' condition). For the correctness of checkpoints, the only thing that matters is the checkpoint age (log_sys.lsn - log_sys.last_checkpoint_lsn). This run-time constant was also reported as log_max_modified_age_sync. log_sys.max_checkpoint_age_async: Remove. This does not serve any purpose, because the checkpoints will now be triggered by the page cleaner thread. We will retain the log_sys.max_checkpoint_age limit for engaging 'furious flushing'. page_cleaner.slot: Remove. It turns out that page_cleaner_slot.flush_list_time was duplicating page_cleaner.slot.flush_time and page_cleaner.slot.flush_list_pass was duplicating page_cleaner.flush_pass. Likewise, there were some redundant monitor counters, because the page cleaner thread no longer performs any buf_pool.LRU flushing, and because there only is one buf_flush_page_cleaner thread. buf_flush_sync_lsn: Protect writes by buf_pool.flush_list_mutex. buf_pool_t::get_oldest_modification(): Add a parameter to specify the return value when no persistent data pages are dirty. Require the caller to hold buf_pool.flush_list_mutex. log_buf_pool_get_oldest_modification(): Take the fall-back LSN as a parameter. All callers will also invoke log_sys.get_lsn(). log_preflush_pool_modified_pages(): Replaced with buf_flush_wait_flushed(). buf_flush_wait_flushed(): Implement two limits. If not enough buffer pool has been flushed, signal the page cleaner (unless innodb_flush_sync=OFF) and wait for the page cleaner to complete. If the page cleaner thread is not running (which can be the case durign shutdown), initiate the flush and wait for it directly. buf_flush_ahead(): If innodb_flush_sync=ON (the default), submit a new buf_flush_sync_lsn target for the page cleaner but do not wait for the flushing to finish. log_get_capacity(), log_get_max_modified_age_async(): Remove, to make it easier to see that af_get_pct_for_lsn() is not acquiring any mutexes. page_cleaner_flush_pages_recommendation(): Protect all access to buf_pool.flush_list with buf_pool.flush_list_mutex. Previously there were some race conditions in the calculation. buf_flush_sync_for_checkpoint(): New function to process buf_flush_sync_lsn in the page cleaner thread. At the end of each batch, we try to wake up any blocked buf_flush_wait_flushed(). If everything up to buf_flush_sync_lsn has been flushed, we will reset buf_flush_sync_lsn=0. The page cleaner thread will keep 'furious flushing' until the limit is reached. Any threads that are waiting in buf_flush_wait_flushed() will be able to resume as soon as their own limit has been satisfied. buf_flush_page_cleaner: Prioritize buf_flush_sync_lsn and do not sleep as long as it is set. Do not update any page_cleaner statistics for this special mode of operation. In the normal mode (buf_flush_sync_lsn is not set for innodb_flush_sync=ON), try to wake up once per second. No longer check whether srv_inc_activity_count() has been called. After each batch, try to perform a log checkpoint, because the best chances for the checkpoint LSN to advance by the maximum amount are upon completing a flushing batch. log_t: Move buf_free, max_buf_free possibly to the same cache line with log_sys.mutex. log_margin_checkpoint_age(): Simplify the logic, and replace a 0.1-second sleep with a call to buf_flush_wait_flushed() to initiate flushing. Moved to the same compilation unit with the only caller. log_close(): Clean up the calculations. (Should be no functional change.) Return whether flush-ahead is needed. Moved to the same compilation unit with the only caller. mtr_t::finish_write(): Return whether flush-ahead is needed. mtr_t::commit(): Invoke buf_flush_ahead() when needed. Let us avoid external calls in mtr_t::commit() and make the logic easier to follow by having related code in a single compilation unit. Also, we will invoke srv_stats.log_write_requests.inc() only once per mini-transaction commit, while not holding mutexes. log_checkpoint_margin(): Only care about log_sys.max_checkpoint_age. Upon reaching log_sys.max_checkpoint_age where we must wait to prevent the log from getting corrupted, let us wait for at most 1MiB of LSN at a time, before rechecking the condition. This should allow writers to proceed even if the redo log capacity has been reached and 'furious flushing' is in progress. We no longer care about log_sys.max_modified_age_sync or log_sys.max_modified_age_async. The log_sys.max_modified_age_sync could be a relic from the time when there was a srv_master_thread that wrote dirty pages to data files. Also, we no longer have any log_sys.max_checkpoint_age_async limit, because log checkpoints will now be triggered by the page cleaner thread upon completing buf_flush_lists(). log_set_capacity(): Simplify the calculations of the limit (no functional change). log_checkpoint_low(): Split from log_checkpoint(). Moved to the same compilation unit with the caller. log_make_checkpoint(): Only wait for everything to be flushed until the current LSN. create_log_file(): After checkpoint, invoke log_write_up_to() to ensure that the FILE_CHECKPOINT record has been written. This avoids ut_ad(!srv_log_file_created) in create_log_file_rename(). srv_start(): Do not call recv_recovery_from_checkpoint_start() if the log has just been created. Set fil_system.space_id_reuse_warned before dict_boot() has been executed, and clear it after recovery has finished. dict_boot(): Initialize fil_system.max_assigned_id. srv_check_activity(): Remove. The activity count is counting transaction commits and therefore mostly interesting for the purge of history. BtrBulk::insert(): Do not explicitly wake up the page cleaner, but do invoke srv_inc_activity_count(), because that counter is still being used in buf_load_throttle_if_needed() for some heuristics. (It might be cleaner to execute buf_load() in the page cleaner thread!) Reviewed by: Vladislav Vaintroub
* \| \| \|	MDEV-23399 fixup: Assertion bpage->in_file() failed	Marko Mäkelä	2020-10-26	1	-4/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	buf_flush_remove_pages(), buf_flush_dirty_pages(): Because buf_page_t::state() is protected by buf_pool.mutex, which we are not holding, the state may be BUF_BLOCK_REMOVE_HASH when the page is being relocated. Let us relax these assertions similar to buf_flush_validate_low(). The other in_file() assertions in buf0flu.cc look valid.
* \| \| \|	MDEV-23399 fixup: Avoid crash on Mariabackup shutdown	Marko Mäkelä	2020-10-26	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	innodb_preshutdown(): Terminate the encryption threads before the page cleaner thread can be shut down. innodb_shutdown(): Always wait for the encryption threads and page cleaner to shut down. srv_shutdown_all_bg_threads(): Wait for the encryption threads and the page cleaner to shut down. (After an aborted startup, innodb_shutdown() would not be called.) row_get_background_drop_list_len_low(): Remove. os_thread_count: Remove. Alternatively, at the end of srv_shutdown_all_bg_threads() we could try to wait longer for the count to reach 0. On some platforms, an assertion os_thread_count==0 could fail even after a small delay, even though in the core dump all threads would have exited. srv_shutdown_threads(): Renamed from srv_shutdown_all_bg_threads(). Do not wait for the page cleaner to shut down, because the later innodb_shutdown(), which may invoke logs_empty_and_mark_files_at_shutdown(), assumes that it exists.
* \| \| \|	Merge 10.4 to 10.5	Marko Mäkelä	2020-10-22	1	-1/+1
\|\ \ \ \ \| \|/ / /
\| * \| \|	Merge 10.3 into 10.4	Marko Mäkelä	2020-10-22	1	-1/+1
\| \|\ \ \ \| \| \|/ /
\| \| * \|	Merge 10.2 into 10.3	Marko Mäkelä	2020-10-22	1	-1/+1
\| \| \|\ \ \| \| \| \|/
\| \| \| *	MDEV-23960 UBSAN ../storage/innobase/buf/buf0buddy.cc:350:6: runtime error: ↵	Eugene Kosov	2020-10-14	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	index 4096 out of bounds for type 'byte [38]' Reviewed by: Marko Mäkelä
* \| \| \|	MDEV-23998 Race between buf_page_optimistic_get() and buf_page_t::init()	Marko Mäkelä	2020-10-21	1	-4/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	MDEV-22871 tried to optimize the buf_page_t initialization in buf_page_init_for_read() by initializing everything while the block is in freed state, and only afterwards attaching the block to buf_pool.page_hash. In an rr replay trace, we have multiple threads executing in buf_page_optimistic_get() on the same buf_block_t while the block is being freed and reallocated several times in buf_page_init_for_read(). Because also the buf_page_t::id() is changing, the buf_pool.page_hash is being protected by a different rw-lock than the one that buf_page_optimistic_get() are successfully read-locking. buf_page_optimistic_get(): Validate also buf_page_t::id() after acquiring the buf_pool.page_hash latch. Reviewed by: Thirunarayanan Balathandayuthapani
* \| \| \|	MDEV-23399 fixup: Remove double-free of a buffer page	Marko Mäkelä	2020-10-16	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In commit 7cffb5f6e8a231a041152447be8980ce35d2c9b8 we changed the interface of buf_page_create() so that the free_block is allocated by the caller. Both calls to buf_LRU_block_free_non_file_page() should have been removed. This caused an assertion failure 'block->page.state() == BUF_BLOCK_MEMORY' in buf_LRU_block_free_non_file_page(). The bug only affected ROW_FORMAT=COMPRESSED pages.
* \| \| \|	MDEV-23973 Change buffer corruption when reallocating an recently freed page	Marko Mäkelä	2020-10-16	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	After commit abb678b61894146fcb88eed7f4a5facf434aea7c (a follow-up fix to MDEV-19514 to prevent potential hangs) and MDEV-23399, the probability for hitting a dormant bug that is related to MDEV-19514 was increased. buf_page_create(): Call ibuf_merge_or_delete_for_page() also when reusing a previously freed page. Reviewed by: Thirunarayanan Balathandayuthapani
* \| \| \|	Cleanup: Make InnoDB page numbers uint32_t	Marko Mäkelä	2020-10-15	5	-34/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	InnoDB stores a 32-bit page number in page headers and in some data structures, such as FIL_ADDR (consisting of a 32-bit page number and a 16-bit byte offset within a page). For better compile-time error detection and to reduce the memory footprint in some data structures, let us use a uint32_t for the page number, instead of ulint (size_t) which can be 64 bits.
* \| \| \|	MDEV-19514 fixup: Simplify buf_page_read_complete()	Marko Mäkelä	2020-10-15	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	False positives for buf_page_t::ibuf_exist are acceptable, because it does not hurt to unnecessarily invoke ibuf_merge_or_delete_for_page(). Invoking buf_page_get_gen() in a read completion function is a definite no-no, because it could trigger a page flush or cause the server to run out of buffer pool. With some MDEV-23855 changes present, the test innodb.purge_secondary occasionally failed due to the table having been dropped while ibuf_page_exists() invoked buf_page_get_gen(). Reviewed by: Thirunarayanan Balathandayuthapani
* \| \| \|	MDEV-23399: Performance regression with write workloads	Marko Mäkelä	2020-10-15	7	-2824/+1417
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The buffer pool refactoring in MDEV-15053 and MDEV-22871 shifted the performance bottleneck to the page flushing. The configuration parameters will be changed as follows: innodb_lru_flush_size=32 (new: how many pages to flush on LRU eviction) innodb_lru_scan_depth=1536 (old: 1024) innodb_max_dirty_pages_pct=90 (old: 75) innodb_max_dirty_pages_pct_lwm=75 (old: 0) Note: The parameter innodb_lru_scan_depth will only affect LRU eviction of buffer pool pages when a new page is being allocated. The page cleaner thread will no longer evict any pages. It used to guarantee that some pages will remain free in the buffer pool. Now, we perform that eviction 'on demand' in buf_LRU_get_free_block(). The parameter innodb_lru_scan_depth(srv_LRU_scan_depth) is used as follows: * When the buffer pool is being shrunk in buf_pool_t::withdraw_blocks() * As a buf_pool.free limit in buf_LRU_list_batch() for terminating the flushing that is initiated e.g., by buf_LRU_get_free_block() The parameter also used to serve as an initial limit for unzip_LRU eviction (evicting uncompressed page frames while retaining ROW_FORMAT=COMPRESSED pages), but now we will use a hard-coded limit of 100 or unlimited for invoking buf_LRU_scan_and_free_block(). The status variables will be changed as follows: innodb_buffer_pool_pages_flushed: This includes also the count of innodb_buffer_pool_pages_LRU_flushed and should work reliably, updated one by one in buf_flush_page() to give more real-time statistics. The function buf_flush_stats(), which we are removing, was not called in every code path. For both counters, we will use regular variables that are incremented in a critical section of buf_pool.mutex. Note that show_innodb_vars() directly links to the variables, and reads of the counters will not be protected by buf_pool.mutex, so you cannot get a consistent snapshot of both variables. The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed, because the page cleaner no longer deals with writing or evicting least recently used pages, and because the single-page writes have been removed: * buffer_LRU_batch_flush_avg_time_slot * buffer_LRU_batch_flush_avg_time_thread * buffer_LRU_batch_flush_avg_time_est * buffer_LRU_batch_flush_avg_pass * buffer_LRU_single_flush_scanned * buffer_LRU_single_flush_num_scan * buffer_LRU_single_flush_scanned_per_call When moving to a single buffer pool instance in MDEV-15058, we missed some opportunity to simplify the buf_flush_page_cleaner thread. It was unnecessarily using a mutex and some complex data structures, even though we always have a single page cleaner thread. Furthermore, the buf_flush_page_cleaner thread had separate 'recovery' and 'shutdown' modes where it was waiting to be triggered by some other thread, adding unnecessary latency and potential for hangs in relatively rarely executed startup or shutdown code. The page cleaner was also running two kinds of batches in an interleaved fashion: "LRU flush" (writing out some least recently used pages and evicting them on write completion) and the normal batches that aim to increase the MIN(oldest_modification) in the buffer pool, to help the log checkpoint advance. The buf_pool.flush_list flushing was being blocked by buf_block_t::lock for no good reason. Furthermore, if the FIL_PAGE_LSN of a page is ahead of log_sys.get_flushed_lsn(), that is, what has been persistently written to the redo log, we would trigger a log flush and then resume the page flushing. This would unnecessarily limit the performance of the page cleaner thread and trigger the infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms. The settings might not be optimal" that were suppressed in commit d1ab89037a518fcffbc50c24e4bd94e4ec33aed0 unless log_warnings>2. Our revised algorithm will make log_sys.get_flushed_lsn() advance at the start of buf_flush_lists(), and then execute a 'best effort' to write out all pages. The flush batches will skip pages that were modified since the log was written, or are are currently exclusively locked. The MDEV-13670 message "page_cleaner: 1000ms intended loop took" message will be removed, because by design, the buf_flush_page_cleaner() should not be blocked during a batch for extended periods of time. We will remove the single-page flushing altogether. Related to this, the debug parameter innodb_doublewrite_batch_size will be removed, because all of the doublewrite buffer will be used for flushing batches. If a page needs to be evicted from the buffer pool and all 100 least recently used pages in the buffer pool have unflushed changes, buf_LRU_get_free_block() will execute buf_flush_lists() to write out and evict innodb_lru_flush_size pages. At most one thread will execute buf_flush_lists() in buf_LRU_get_free_block(); other threads will wait for that LRU flushing batch to finish. To improve concurrency, we will replace the InnoDB ib_mutex_t and os_event_t native mutexes and condition variables in this area of code. Most notably, this means that the buffer pool mutex (buf_pool.mutex) is no longer instrumented via any InnoDB interfaces. It will continue to be instrumented via PERFORMANCE_SCHEMA. For now, both buf_pool.flush_list_mutex and buf_pool.mutex will be declared with MY_MUTEX_INIT_FAST (PTHREAD_MUTEX_ADAPTIVE_NP). The critical sections of buf_pool.flush_list_mutex should be shorter than those for buf_pool.mutex, because in the worst case, they cover a linear scan of buf_pool.flush_list, while the worst case of a critical section of buf_pool.mutex covers a linear scan of the potentially much longer buf_pool.LRU list. mysql_mutex_is_owner(), safe_mutex_is_owner(): New predicate, usable with SAFE_MUTEX. Some InnoDB debug assertions need this predicate instead of mysql_mutex_assert_owner() or mysql_mutex_assert_not_owner(). buf_pool_t::n_flush_LRU, buf_pool_t::n_flush_list: Replaces buf_pool_t::init_flush[] and buf_pool_t::n_flush[]. The number of active flush operations. buf_pool_t::mutex, buf_pool_t::flush_list_mutex: Use mysql_mutex_t instead of ib_mutex_t, to have native mutexes with PERFORMANCE_SCHEMA and SAFE_MUTEX instrumentation. buf_pool_t::done_flush_LRU: Condition variable for !n_flush_LRU. buf_pool_t::done_flush_list: Condition variable for !n_flush_list. buf_pool_t::do_flush_list: Condition variable to wake up the buf_flush_page_cleaner when a log checkpoint needs to be written or the server is being shut down. Replaces buf_flush_event. We will keep using timed waits (the page cleaner thread will wake _at least_ once per second), because the calculations for innodb_adaptive_flushing depend on fixed time intervals. buf_dblwr: Allocate statically, and move all code to member functions. Use a native mutex and condition variable. Remove code to deal with single-page flushing. buf_dblwr_check_block(): Make the check debug-only. We were spending a significant amount of execution time in page_simple_validate_new(). flush_counters_t::unzip_LRU_evicted: Remove. IORequest: Make more members const. FIXME: m_fil_node should be removed. buf_flush_sync_lsn: Protect by std::atomic, not page_cleaner.mutex (which we are removing). page_cleaner_slot_t, page_cleaner_t: Remove many redundant members. pc_request_flush_slot(): Replaces pc_request() and pc_flush_slot(). recv_writer_thread: Remove. Recovery works just fine without it, if we simply invoke buf_flush_sync() at the end of each batch in recv_sys_t::apply(). recv_recovery_from_checkpoint_finish(): Remove. We can simply call recv_sys.debug_free() directly. srv_started_redo: Replaces srv_start_state. SRV_SHUTDOWN_FLUSH_PHASE: Remove. logs_empty_and_mark_files_at_shutdown() can communicate with the normal page cleaner loop via the new function flush_buffer_pool(). buf_flush_remove(): Assert that the calling thread is holding buf_pool.flush_list_mutex. This removes unnecessary mutex operations from buf_flush_remove_pages() and buf_flush_dirty_pages(), which replace buf_LRU_flush_or_remove_pages(). buf_flush_lists(): Renamed from buf_flush_batch(), with simplified interface. Return the number of flushed pages. Clarified comments and renamed min_n to max_n. Identify LRU batch by lsn=0. Merge all the functions buf_flush_start(), buf_flush_batch(), buf_flush_end() directly to this function, which was their only caller, and remove 2 unnecessary buf_pool.mutex release/re-acquisition that we used to perform around the buf_flush_batch() call. At the start, if not all log has been durably written, wait for a background task to do it, or start a new task to do it. This allows the log write to run concurrently with our page flushing batch. Any pages that were skipped due to too recent FIL_PAGE_LSN or due to them being latched by a writer should be flushed during the next batch, unless there are further modifications to those pages. It is possible that a page that we must flush due to small oldest_modification also carries a recent FIL_PAGE_LSN or is being constantly modified. In the worst case, all writers would then end up waiting in log_free_check() to allow the flushing and the checkpoint to complete. buf_do_flush_list_batch(): Clarify comments, and rename min_n to max_n. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_flush_space(): Auxiliary function to look up a tablespace for page flushing. buf_flush_page(): Defer the computation of space->full_crc32(). Never call log_write_up_to(), but instead skip persistent pages whose latest modification (FIL_PAGE_LSN) is newer than the redo log. Also skip pages on which we cannot acquire a shared latch without waiting. buf_flush_try_neighbors(): Do not bother checking buf_fix_count because buf_flush_page() will no longer wait for the page latch. Take the tablespace as a parameter, and only execute this function when innodb_flush_neighbors>0. Avoid repeated calls of page_id_t::fold(). buf_flush_relocate_on_flush_list(): Declare as cold, and push down a condition from the callers. buf_flush_check_neighbor(): Take id.fold() as a parameter. buf_flush_sync(): Ensure that the buf_pool.flush_list is empty, because the flushing batch will skip pages whose modifications have not yet been written to the log or were latched for modification. buf_free_from_unzip_LRU_list_batch(): Remove redundant local variables. buf_flush_LRU_list_batch(): Let the caller buf_do_LRU_batch() initialize the counters, and report n->evicted. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_do_LRU_batch(): Return the number of pages flushed. buf_LRU_free_page(): Only release and re-acquire buf_pool.mutex if adaptive hash index entries are pointing to the block. buf_LRU_get_free_block(): Do not wake up the page cleaner, because it will no longer perform any useful work for us, and we do not want it to compete for I/O while buf_flush_lists(innodb_lru_flush_size, 0) writes out and evicts at most innodb_lru_flush_size pages. (The function buf_do_LRU_batch() may complete after writing fewer pages if more than innodb_lru_scan_depth pages end up in buf_pool.free list.) Eliminate some mutex release-acquire cycles, and wait for the LRU flush batch to complete before rescanning. buf_LRU_check_size_of_non_data_objects(): Simplify the code. buf_page_write_complete(): Remove the parameter evict, and always evict pages that were part of an LRU flush. buf_page_create(): Take a pre-allocated page as a parameter. buf_pool_t::free_block(): Free a pre-allocated block. recv_sys_t::recover_low(), recv_sys_t::apply(): Preallocate the block while not holding recv_sys.mutex. During page allocation, we may initiate a page flush, which in turn may initiate a log flush, which would require acquiring log_sys.mutex, which should always be acquired before recv_sys.mutex in order to avoid deadlocks. Therefore, we must not be holding recv_sys.mutex while allocating a buffer pool block. BtrBulk::logFreeCheck(): Skip a redundant condition. row_undo_step(): Do not invoke srv_inc_activity_count() for every row that is being rolled back. It should suffice to invoke the function in trx_flush_log_if_needed() during trx_t::commit_in_memory() when the rollback completes. sync_check_enable(): Remove. We will enable innodb_sync_debug from the very beginning. Reviewed by: Vladislav Vaintroub
* \| \| \|	MDEV-23399: Remove buf_pool.flush_rbt	Marko Mäkelä	2020-10-15	2	-209/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Normally, buf_pool.flush_list must be sorted by buf_page_t::oldest_modification, so that log_checkpoint() can choose MIN(oldest_modification) as the checkpoint LSN. During recovery, buf_pool.flush_rbt used to guarantee the ordering. However, we can allow the buf_pool.flush_list to be in an arbitrary order during recovery, and simply ensure that it is in the correct order by the time a log checkpoint needs to be executed. recv_sys_t::apply(): To keep it simple, we will always flush the buffer pool at the end of each batch. Note that log_checkpoint() will invoke recv_sys_t::apply() in case a checkpoint is initiated during the last batch of recovery, when we already allow writes to data pages and the redo log. Reviewed by: Vladislav Vaintroub
* \| \| \|	MDEV-23399: Remove recv_writer_thread	Marko Mäkelä	2020-10-15	1	-33/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Recovery works just fine without a separate thread whose only task is to tell the page cleaner thread to do its job. recv_sys_t::apply(): Flush the buffer pool at the end of each batch. Reviewed by: Vladislav Vaintroub
* \| \| \|	MDEV-23399 preparation: Remove buf_pool.zip_clean	Marko Mäkelä	2020-10-15	4	-154/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The debug data structure may have been useful during the development of ROW_FORMAT=COMPRESSED page frames. Let us simplify code by removing it.
* \| \| \|	MDEV-23909 innodb_flush_neighbors=2 is treated like innodb_flush_neighbors=0	Marko Mäkelä	2020-10-08	1	-3/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In MDEV-15053 (commit b1ab211dee599eabd9a5b886fafa3adea29ae041) we inadvertently removed a check whether innodb_flush_neighbors is 0, and thus started treating only the value 1 in a special way. buf_flush_check_neighbors(): Add the parameter contiguous, which can be set to skip the check for non-contiguous page number ranges. Reviewed by: Thirunarayanan Balathandayuthapani
* \| \| \|	MDEV-16264 fixup: Remove unused code and data	Marko Mäkelä	2020-09-30	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	LATCH_ID_OS_AIO_READ_MUTEX, LATCH_ID_OS_AIO_WRITE_MUTEX, LATCH_ID_OS_AIO_LOG_MUTEX, LATCH_ID_OS_AIO_IBUF_MUTEX, LATCH_ID_OS_AIO_SYNC_MUTEX: Remove. The tpool is not instrumented. lock_set_timeout_event(): Remove. srv_sys_mutex_key, srv_sys_t::mutex, SYNC_THREADS: Remove. srv_slot_t::suspended: Remove. We only ever assigned this data member true, so it is redundant. ib_wqueue_wait(), ib_wqueue_timedwait(): Remove. os_thread_join(): Remove. os_thread_create(), os_thread_exit(): Remove redundant parameters. These were missed in commit 5e62b6a5e06eb02cbde1e34e95e26f42d87fce02.