summaryrefslogtreecommitdiff
path: root/storage/rocksdb
Commit message (Collapse)AuthorAgeFilesLines
* MDEV-9077 Use sys schema in bootstrapping, incl. mtrVladislav Vaintroub2021-03-181-0/+1
|
* MDEV-25085: Simplify instrumentation for LRU evictionMarko Mäkelä2021-03-091-6/+2
| | | | | | | | | | | | | | | | | | | Let us add the status variable innodb_buffer_pool_pages_LRU_freed to monitor the number of pages that were freed by a buffer pool LRU eviction scan, without flushing. Also, let us simplify the monitor interface: MONITOR_LRU_BATCH_FLUSH_COUNT, MONITOR_LRU_BATCH_FLUSH_PAGES, MONITOR_LRU_BATCH_EVICT_COUNT, MONITOR_LRU_BATCH_EVICT_PAGES: Remove. MONITOR_LRU_BATCH_FLUSH_TOTAL_PAGE: Track buf_lru_flush_page_count (innodb_buffer_pool_pages_LRU_flushed). MONITOR_LRU_BATCH_EVICT_TOTAL_PAGE: Track buf_lru_freed_page_count (buffer_pool_pages_LRU_freed). Reviewed by: Vladislav Vaintroub
* MDEV-7317: Make an index ignorable to the optimizerVarun Gupta2021-03-0415-179/+179
| | | | | | | | | | | | | | This feature adds the functionality of ignorability for indexes. Indexes are not ignored be default. To control index ignorability explicitly for a new index, use IGNORE or NOT IGNORE as part of the index definition for CREATE TABLE, CREATE INDEX, or ALTER TABLE. Primary keys (explicit or implicit) cannot be made ignorable. The table INFORMATION_SCHEMA.STATISTICS get a new column named IGNORED that would store whether an index needs to be ignored or not.
* MDEV-24789: Reduce lock_sys.wait_mutex contentionMarko Mäkelä2021-02-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | A performance regression was introduced by commit e71e6133535da8d5eab86e504f0b116a03680780 (MDEV-24671) and mostly addressed by commit 455514c8006e2d1a147a7845c5d408b0dcb7cdd4. The regression is likely caused by increased contention lock_sys.latch (former lock_sys.mutex), possibly indirectly caused by contention on lock_sys.wait_mutex. This change aims to reduce both, but further improvements will be needed. lock_wait(): Minimize the lock_sys.wait_mutex hold time. lock_sys_t::deadlock_check(): Add a parameter for indicating whether lock_sys.latch is exclusively locked. trx_t::was_chosen_as_deadlock_victim: Always use atomics. lock_wait_wsrep(): Assume that no mutex is being held. Deadlock::report(): Always kill the victim transaction. lock_sys_t::timeout: New counter to back MONITOR_TIMEOUT.
* Merge 10.5 into 10.6Marko Mäkelä2021-02-242-4/+4
|\
| * Merge branch '10.4' into 10.5Sergei Golubchik2021-02-232-4/+4
| |\
| | * Merge branch '10.3' into 10.4Sergei Golubchik2021-02-232-4/+4
| | |\
| | | * Merge branch '10.2' into 10.3Sergei Golubchik2021-02-223-5/+5
| | | |\
| | | | * Updating test results in rocksdb test suite after MDEV-11172 is fixedVarun Gupta2021-02-013-5/+5
| | | | |
| | | | * MDEV-17556 Assertion `bitmap_is_set_all(&table->s->all_set)' failedNikita Malyavin2021-01-082-9/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The assertion failed in handler::ha_reset upon SELECT under READ UNCOMMITTED from table with index on virtual column. This was the debug-only failure, though the problem is mush wider: * MY_BITMAP is a structure containing my_bitmap_map, the latter is a raw bitmap. * read_set, write_set and vcol_set of TABLE are the pointers to MY_BITMAP * The rest of MY_BITMAPs are stored in TABLE and TABLE_SHARE * The pointers to the stored MY_BITMAPs, like orig_read_set etc, and sometimes all_set and tmp_set, are assigned to the pointers. * Sometimes tmp_use_all_columns is used to substitute the raw bitmap directly with all_set.bitmap * Sometimes even bitmaps are directly modified, like in TABLE::update_virtual_field(): bitmap_clear_all(&tmp_set) is called. The last three bullets in the list, when used together (which is mostly always) make the program flow cumbersome and impossible to follow, notwithstanding the errors they cause, like this MDEV-17556, where tmp_set pointer was assigned to read_set, write_set and vcol_set, then its bitmap was substituted with all_set.bitmap by dbug_tmp_use_all_columns() call, and then bitmap_clear_all(&tmp_set) was applied to all this. To untangle this knot, the rule should be applied: * Never substitute bitmaps! This patch is about this. orig_*, all_set bitmaps are never substituted already. This patch changes the following function prototypes: * tmp_use_all_columns, dbug_tmp_use_all_columns to accept MY_BITMAP** and to return MY_BITMAP * instead of my_bitmap_map* * tmp_restore_column_map, dbug_tmp_restore_column_maps to accept MY_BITMAP* instead of my_bitmap_map* These functions now will substitute read_set/write_set/vcol_set directly, and won't touch underlying bitmaps.
* | | | | Merge 10.5 into 10.6Marko Mäkelä2021-02-172-9/+8
|\ \ \ \ \ | |/ / / /
| * | | | Merge branch 'bb-10.4-release' into bb-10.5-releaseSergei Golubchik2021-02-152-9/+8
| |\ \ \ \ | | |/ / /
| | * | | Merge branch 'bb-10.3-release' into bb-10.4-releaseSergei Golubchik2021-02-122-9/+8
| | |\ \ \ | | | |/ / | | | | | | | | | | | | | | | Note, the fix for "MDEV-23328 Server hang due to Galera lock conflict resolution" was null-merged. 10.4 version of the fix is coming up separately
| | | * | MDEV-17556 Assertion `bitmap_is_set_all(&table->s->all_set)' failedNikita Malyavin2021-01-272-9/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The assertion failed in handler::ha_reset upon SELECT under READ UNCOMMITTED from table with index on virtual column. This was the debug-only failure, though the problem is mush wider: * MY_BITMAP is a structure containing my_bitmap_map, the latter is a raw bitmap. * read_set, write_set and vcol_set of TABLE are the pointers to MY_BITMAP * The rest of MY_BITMAPs are stored in TABLE and TABLE_SHARE * The pointers to the stored MY_BITMAPs, like orig_read_set etc, and sometimes all_set and tmp_set, are assigned to the pointers. * Sometimes tmp_use_all_columns is used to substitute the raw bitmap directly with all_set.bitmap * Sometimes even bitmaps are directly modified, like in TABLE::update_virtual_field(): bitmap_clear_all(&tmp_set) is called. The last three bullets in the list, when used together (which is mostly always) make the program flow cumbersome and impossible to follow, notwithstanding the errors they cause, like this MDEV-17556, where tmp_set pointer was assigned to read_set, write_set and vcol_set, then its bitmap was substituted with all_set.bitmap by dbug_tmp_use_all_columns() call, and then bitmap_clear_all(&tmp_set) was applied to all this. To untangle this knot, the rule should be applied: * Never substitute bitmaps! This patch is about this. orig_*, all_set bitmaps are never substituted already. This patch changes the following function prototypes: * tmp_use_all_columns, dbug_tmp_use_all_columns to accept MY_BITMAP** and to return MY_BITMAP * instead of my_bitmap_map* * tmp_restore_column_map, dbug_tmp_restore_column_maps to accept MY_BITMAP* instead of my_bitmap_map* These functions now will substitute read_set/write_set/vcol_set directly, and won't touch underlying bitmaps.
* | | | | MDEV-24731 Excessive mutex contention in DeadlockChecker::check_and_resolve()Marko Mäkelä2021-02-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The DeadlockChecker expects to be able to freeze the waits-for graph. Hence, it is best executed somewhere where we are not holding any additional mutexes. lock_wait(): Defer the deadlock check to this function, instead of executing it in lock_rec_enqueue_waiting(), lock_table_enqueue_waiting(). DeadlockChecker::trx_rollback(): Merge with the only caller, check_and_resolve(). LockMutexGuard: RAII accessor for lock_sys.mutex. lock_sys.deadlocks: Replaces lock_deadlock_found. trx_t: Clean up some comments.
* | | | | Merge 10.5 into 10.6Marko Mäkelä2021-01-251-1/+1
|\ \ \ \ \ | |/ / / /
| * | | | cmake warning, if/endif mismatchSergei Golubchik2021-01-171-1/+1
| | | | |
* | | | | Merge 10.5 into 10.6Marko Mäkelä2021-01-111-2/+2
|\ \ \ \ \ | |/ / / /
| * | | | MDEV-24556: Build does not recognize powerpc64 (OpenBSD)Brad Smith2021-01-111-2/+2
| | | | | | | | | | | | | | | | | | | | Reviewer: Daniel Black
* | | | | MDEV-24142: Replace InnoDB rw_lock_t with sux_lockMarko Mäkelä2020-12-032-10/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | InnoDB buffer pool block and index tree latches depend on a special kind of read-update-write lock that allows reentrant (recursive) acquisition of the 'update' and 'write' locks as well as an upgrade from 'update' lock to 'write' lock. The 'update' lock allows any number of reader locks from other threads, but no concurrent 'update' or 'write' lock. If there were no requirement to support an upgrade from 'update' to 'write', we could compose the lock out of two srw_lock (implemented as any type of native rw-lock, such as SRWLOCK on Microsoft Windows). Removing this requirement is very difficult, so in commit f7e7f487d4b06695f91f6fbeb0396b9d87fc7bbf we implemented an 'update' mode to our srw_lock. Re-entrant or recursive locking is mostly needed when writing or freeing BLOB pages, but also in crash recovery or when merging buffered changes to an index page. The re-entrancy allows us to attach a previously acquired page to a sub-mini-transaction that will be committed before whatever else is holding the page latch. The SUX lock supports Shared ('read'), Update, and eXclusive ('write') locking modes. The S latches are not re-entrant, but a single S latch may be acquired even if the thread already holds an U latch. The idea of the U latch is to allow a write of something that concurrent readers do not care about (such as the contents of BTR_SEG_LEAF, BTR_SEG_TOP and other page allocation metadata structures, or the MDEV-6076 PAGE_ROOT_AUTO_INC). (The PAGE_ROOT_AUTO_INC field is only updated when a dict_table_t for the table exists, and only read when a dict_table_t for the table is being added to dict_sys.) block_lock::u_lock_try(bool for_io=true) is used in buf_flush_page() to allow concurrent readers but no concurrent modifications while the page is being written to the data file. That latch will be released by buf_page_write_complete() in a different thread. Hence, we use the special lock owner value FOR_IO. The index_lock::u_lock() improves concurrency on operations that involve non-leaf index pages. The interface has been cleaned up a little. We will use x_lock_recursive() instead of x_lock() when we know that a lock is already held by the current thread. Similarly, a lock upgrade from U to X is only allowed via u_x_upgrade() or x_lock_upgraded() but not via x_lock(). We will disable the LatchDebug and sync_array interfaces to InnoDB rw-locks. The SEMAPHORES section of SHOW ENGINE INNODB STATUS output will no longer include any information about InnoDB rw-locks, only TTASEventMutex (cmake -DMUTEXTYPE=event) waits. This will make a part of the 'innotop' script dead code. The block_lock buf_block_t::lock will not be covered by any PERFORMANCE_SCHEMA instrumentation. SHOW ENGINE INNODB MUTEX and INFORMATION_SCHEMA.INNODB_MUTEXES will no longer output source code file names or line numbers. The dict_index_t::lock will be identified by index and table names, which should be much more useful. PERFORMANCE_SCHEMA is lumping information about all dict_index_t::lock together as event_name='wait/synch/sxlock/innodb/index_tree_rw_lock'. buf_page_free(): Remove the file,line parameters. The sux_lock will not store such diagnostic information. buf_block_dbg_add_level(): Define as empty macro, to be removed in a subsequent commit. Unless the build was configured with cmake -DPLUGIN_PERFSCHEMA=NO the index_lock dict_index_t::lock will be instrumented via PERFORMANCE_SCHEMA. Similar to commit 1669c8890ca2e9092213626e5b047e58ca8b1e77 we will distinguish lock waits by registering shared_lock,exclusive_lock events instead of try_shared_lock,try_exclusive_lock. Actual 'try' operations will not be instrumented at all. rw_lock_list: Remove. After MDEV-24167, this only covered buf_block_t::lock and dict_index_t::lock. We will output their information by traversing buf_pool or dict_sys.
* | | | | Merge 10.5 into 10.6Marko Mäkelä2020-12-031-1/+1
|\ \ \ \ \ | |/ / / /
| * | | | Merge 10.4 into 10.5Marko Mäkelä2020-12-021-1/+1
| |\ \ \ \ | | |/ / /
| | * | | Merge 10.3 into 10.4Marko Mäkelä2020-12-011-1/+1
| | |\ \ \ | | | |/ /
| | | * | Merge 10.2 into 10.3Marko Mäkelä2020-12-011-1/+1
| | | |\ \ | | | | |/
| | | | * MyRocks: Bare Windows compatibility: use rmdir built-in, not "rm -rf"Sergei Petrunia2020-11-171-1/+1
| | | | |
* | | | | MDEV-22343 Remove SYS_TABLESPACES and SYS_DATAFILESbb-10.6-MDEV-23497Marko Mäkelä2020-11-113-10/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The InnoDB internal tables SYS_TABLESPACES and SYS_DATAFILES as well as the INFORMATION_SCHEMA views INNODB_SYS_TABLESPACES and INNODB_SYS_DATAFILES were introduced in MySQL 5.6 for no good reason in mysql/mysql-server/commit/e9255a22ef16d612a8076bc0b34002bc5a784627 when the InnoDB support for the DATA DIRECTORY attribute was introduced. The file system should be the authoritative source of information on files. Storing information about file system paths in the file system (symlinks, or even the .isl files that were unfortunately chosen as the solution) is sufficient. If information is additionally stored in some hidden tables inside the InnoDB system tablespace, everything unnecessarily becomes more complicated, because more copies of data mean more opportunity for the copies to be out of sync, and because modifying the data in the system tablespace in the desired way might not be possible at all without modifying the InnoDB source code. So, the copy in the system tablespace basically is a redundant, non-authoritative source of information. We will stop creating or accessing the system tables SYS_TABLESPACES and SYS_DATAFILES. We will also remove the view INFORMATION_SCHEMA.INNODB_SYS_DATAFILES along with SYS_DATAFILES. The view INFORMATION_SCHEMA.INNODB_SYS_TABLESPACES will be repurposed to directly reflect fil_system.space_list. The column PAGE_SIZE, which would always contain the value of the GLOBAL read-only variable innodb_page_size, is removed. The column ZIP_PAGE_SIZE, which would actually contain the physical page size of a page, is renamed to PAGE_SIZE. Finally, a new column FILENAME is added, as a replacement of SYS_DATAFILES.PATH. This will also address MDEV-21801 (files that were created before upgrading to MySQL 5.6 or MariaDB 10.0 or later were never registered in SYS_TABLESPACES or SYS_DATAFILES) and MDEV-21801 (information about the system tablespace is not stored in SYS_TABLESPACES or SYS_DATAFILES).
* | | | | Merge 10.5 into 10.6Marko Mäkelä2020-11-022-15/+2
|\ \ \ \ \ | |/ / / /
| * | | | Merge 10.4 into 10.5Marko Mäkelä2020-10-301-1/+1
| |\ \ \ \ | | |/ / /
| | * | | Merge 10.3 into 10.4Marko Mäkelä2020-10-291-1/+1
| | |\ \ \ | | | |/ /
| | | * | cleanup: use predefined CMAKE_DL_LIBSSergei Golubchik2020-10-231-1/+1
| | | | | | | | | | | | | | | | | | | | instead of, say, MY_SEARCH_LIBS(dlopen dl LIBDL)
| * | | | MDEV-23855: Improve InnoDB log checkpoint performanceMarko Mäkelä2020-10-261-7/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After MDEV-15053, MDEV-22871, MDEV-23399 shifted the scalability bottleneck, log checkpoints became a new bottleneck. If innodb_io_capacity is set low or innodb_max_dirty_pct_lwm is set high and the workload fits in the buffer pool, the page cleaner thread will perform very little flushing. When we reach the capacity of the circular redo log file ib_logfile0 and must initiate a checkpoint, some 'furious flushing' will be necessary. (If innodb_flush_sync=OFF, then flushing would continue at the innodb_io_capacity rate, and writers would be throttled.) We have the best chance of advancing the checkpoint LSN immediately after a page flush batch has been completed. Hence, it is best to perform checkpoints after every batch in the page cleaner thread, attempting to run once per second. By initiating high-priority flushing in the page cleaner as early as possible, we aim to make the throughput more stable. The function buf_flush_wait_flushed() used to sleep for 10ms, hoping that the page cleaner thread would do something during that time. The observed end result was that a large number of threads that call log_free_check() would end up sleeping while nothing useful is happening. We will revise the design so that in the default innodb_flush_sync=ON mode, buf_flush_wait_flushed() will wake up the page cleaner thread to perform the necessary flushing, and it will wait for a signal from the page cleaner thread. If innodb_io_capacity is set to a low value (causing the page cleaner to throttle its work), a write workload would initially perform well, until the capacity of the circular ib_logfile0 is reached and log_free_check() will trigger checkpoints. At that point, the extra waiting in buf_flush_wait_flushed() will start reducing throughput. The page cleaner thread will also initiate log checkpoints after each buf_flush_lists() call, because that is the best point of time for the checkpoint LSN to advance by the maximum amount. Even in 'furious flushing' mode we invoke buf_flush_lists() with innodb_io_capacity_max pages at a time, and at the start of each batch (in the log_flush() callback function that runs in a separate task) we will invoke os_aio_wait_until_no_pending_writes(). This tweak allows the checkpoint to advance in smaller steps and significantly reduces the maximum latency. On an Intel Optane 960 NVMe SSD on Linux, it reduced from 4.6 seconds to 74 milliseconds. On Microsoft Windows with a slower SSD, it reduced from more than 180 seconds to 0.6 seconds. We will make innodb_adaptive_flushing=OFF simply flush innodb_io_capacity per second whenever the dirty proportion of buffer pool pages exceeds innodb_max_dirty_pages_pct_lwm. For innodb_adaptive_flushing=ON we try to make page_cleaner_flush_pages_recommendation() more consistent and predictable: if we are below innodb_adaptive_flushing_lwm, let us flush pages according to the return value of af_get_pct_for_dirty(). innodb_max_dirty_pages_pct_lwm: Revert the change of the default value that was made in MDEV-23399. The value innodb_max_dirty_pages_pct_lwm=0 guarantees that a shutdown of an idle server will be fast. Users might be surprised if normal shutdown suddenly became slower when upgrading within a GA release series. innodb_checkpoint_usec: Remove. The master task will no longer perform periodic log checkpoints. It is the duty of the page cleaner thread. log_sys.max_modified_age: Remove. The current span of the buf_pool.flush_list expressed in LSN only matters for adaptive flushing (outside the 'furious flushing' condition). For the correctness of checkpoints, the only thing that matters is the checkpoint age (log_sys.lsn - log_sys.last_checkpoint_lsn). This run-time constant was also reported as log_max_modified_age_sync. log_sys.max_checkpoint_age_async: Remove. This does not serve any purpose, because the checkpoints will now be triggered by the page cleaner thread. We will retain the log_sys.max_checkpoint_age limit for engaging 'furious flushing'. page_cleaner.slot: Remove. It turns out that page_cleaner_slot.flush_list_time was duplicating page_cleaner.slot.flush_time and page_cleaner.slot.flush_list_pass was duplicating page_cleaner.flush_pass. Likewise, there were some redundant monitor counters, because the page cleaner thread no longer performs any buf_pool.LRU flushing, and because there only is one buf_flush_page_cleaner thread. buf_flush_sync_lsn: Protect writes by buf_pool.flush_list_mutex. buf_pool_t::get_oldest_modification(): Add a parameter to specify the return value when no persistent data pages are dirty. Require the caller to hold buf_pool.flush_list_mutex. log_buf_pool_get_oldest_modification(): Take the fall-back LSN as a parameter. All callers will also invoke log_sys.get_lsn(). log_preflush_pool_modified_pages(): Replaced with buf_flush_wait_flushed(). buf_flush_wait_flushed(): Implement two limits. If not enough buffer pool has been flushed, signal the page cleaner (unless innodb_flush_sync=OFF) and wait for the page cleaner to complete. If the page cleaner thread is not running (which can be the case durign shutdown), initiate the flush and wait for it directly. buf_flush_ahead(): If innodb_flush_sync=ON (the default), submit a new buf_flush_sync_lsn target for the page cleaner but do not wait for the flushing to finish. log_get_capacity(), log_get_max_modified_age_async(): Remove, to make it easier to see that af_get_pct_for_lsn() is not acquiring any mutexes. page_cleaner_flush_pages_recommendation(): Protect all access to buf_pool.flush_list with buf_pool.flush_list_mutex. Previously there were some race conditions in the calculation. buf_flush_sync_for_checkpoint(): New function to process buf_flush_sync_lsn in the page cleaner thread. At the end of each batch, we try to wake up any blocked buf_flush_wait_flushed(). If everything up to buf_flush_sync_lsn has been flushed, we will reset buf_flush_sync_lsn=0. The page cleaner thread will keep 'furious flushing' until the limit is reached. Any threads that are waiting in buf_flush_wait_flushed() will be able to resume as soon as their own limit has been satisfied. buf_flush_page_cleaner: Prioritize buf_flush_sync_lsn and do not sleep as long as it is set. Do not update any page_cleaner statistics for this special mode of operation. In the normal mode (buf_flush_sync_lsn is not set for innodb_flush_sync=ON), try to wake up once per second. No longer check whether srv_inc_activity_count() has been called. After each batch, try to perform a log checkpoint, because the best chances for the checkpoint LSN to advance by the maximum amount are upon completing a flushing batch. log_t: Move buf_free, max_buf_free possibly to the same cache line with log_sys.mutex. log_margin_checkpoint_age(): Simplify the logic, and replace a 0.1-second sleep with a call to buf_flush_wait_flushed() to initiate flushing. Moved to the same compilation unit with the only caller. log_close(): Clean up the calculations. (Should be no functional change.) Return whether flush-ahead is needed. Moved to the same compilation unit with the only caller. mtr_t::finish_write(): Return whether flush-ahead is needed. mtr_t::commit(): Invoke buf_flush_ahead() when needed. Let us avoid external calls in mtr_t::commit() and make the logic easier to follow by having related code in a single compilation unit. Also, we will invoke srv_stats.log_write_requests.inc() only once per mini-transaction commit, while not holding mutexes. log_checkpoint_margin(): Only care about log_sys.max_checkpoint_age. Upon reaching log_sys.max_checkpoint_age where we must wait to prevent the log from getting corrupted, let us wait for at most 1MiB of LSN at a time, before rechecking the condition. This should allow writers to proceed even if the redo log capacity has been reached and 'furious flushing' is in progress. We no longer care about log_sys.max_modified_age_sync or log_sys.max_modified_age_async. The log_sys.max_modified_age_sync could be a relic from the time when there was a srv_master_thread that wrote dirty pages to data files. Also, we no longer have any log_sys.max_checkpoint_age_async limit, because log checkpoints will now be triggered by the page cleaner thread upon completing buf_flush_lists(). log_set_capacity(): Simplify the calculations of the limit (no functional change). log_checkpoint_low(): Split from log_checkpoint(). Moved to the same compilation unit with the caller. log_make_checkpoint(): Only wait for everything to be flushed until the current LSN. create_log_file(): After checkpoint, invoke log_write_up_to() to ensure that the FILE_CHECKPOINT record has been written. This avoids ut_ad(!srv_log_file_created) in create_log_file_rename(). srv_start(): Do not call recv_recovery_from_checkpoint_start() if the log has just been created. Set fil_system.space_id_reuse_warned before dict_boot() has been executed, and clear it after recovery has finished. dict_boot(): Initialize fil_system.max_assigned_id. srv_check_activity(): Remove. The activity count is counting transaction commits and therefore mostly interesting for the purge of history. BtrBulk::insert(): Do not explicitly wake up the page cleaner, but do invoke srv_inc_activity_count(), because that counter is still being used in buf_load_throttle_if_needed() for some heuristics. (It might be cleaner to execute buf_load() in the page cleaner thread!) Reviewed by: Vladislav Vaintroub
| * | | | MDEV-23399: Performance regression with write workloadsMarko Mäkelä2020-10-151-7/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The buffer pool refactoring in MDEV-15053 and MDEV-22871 shifted the performance bottleneck to the page flushing. The configuration parameters will be changed as follows: innodb_lru_flush_size=32 (new: how many pages to flush on LRU eviction) innodb_lru_scan_depth=1536 (old: 1024) innodb_max_dirty_pages_pct=90 (old: 75) innodb_max_dirty_pages_pct_lwm=75 (old: 0) Note: The parameter innodb_lru_scan_depth will only affect LRU eviction of buffer pool pages when a new page is being allocated. The page cleaner thread will no longer evict any pages. It used to guarantee that some pages will remain free in the buffer pool. Now, we perform that eviction 'on demand' in buf_LRU_get_free_block(). The parameter innodb_lru_scan_depth(srv_LRU_scan_depth) is used as follows: * When the buffer pool is being shrunk in buf_pool_t::withdraw_blocks() * As a buf_pool.free limit in buf_LRU_list_batch() for terminating the flushing that is initiated e.g., by buf_LRU_get_free_block() The parameter also used to serve as an initial limit for unzip_LRU eviction (evicting uncompressed page frames while retaining ROW_FORMAT=COMPRESSED pages), but now we will use a hard-coded limit of 100 or unlimited for invoking buf_LRU_scan_and_free_block(). The status variables will be changed as follows: innodb_buffer_pool_pages_flushed: This includes also the count of innodb_buffer_pool_pages_LRU_flushed and should work reliably, updated one by one in buf_flush_page() to give more real-time statistics. The function buf_flush_stats(), which we are removing, was not called in every code path. For both counters, we will use regular variables that are incremented in a critical section of buf_pool.mutex. Note that show_innodb_vars() directly links to the variables, and reads of the counters will *not* be protected by buf_pool.mutex, so you cannot get a consistent snapshot of both variables. The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed, because the page cleaner no longer deals with writing or evicting least recently used pages, and because the single-page writes have been removed: * buffer_LRU_batch_flush_avg_time_slot * buffer_LRU_batch_flush_avg_time_thread * buffer_LRU_batch_flush_avg_time_est * buffer_LRU_batch_flush_avg_pass * buffer_LRU_single_flush_scanned * buffer_LRU_single_flush_num_scan * buffer_LRU_single_flush_scanned_per_call When moving to a single buffer pool instance in MDEV-15058, we missed some opportunity to simplify the buf_flush_page_cleaner thread. It was unnecessarily using a mutex and some complex data structures, even though we always have a single page cleaner thread. Furthermore, the buf_flush_page_cleaner thread had separate 'recovery' and 'shutdown' modes where it was waiting to be triggered by some other thread, adding unnecessary latency and potential for hangs in relatively rarely executed startup or shutdown code. The page cleaner was also running two kinds of batches in an interleaved fashion: "LRU flush" (writing out some least recently used pages and evicting them on write completion) and the normal batches that aim to increase the MIN(oldest_modification) in the buffer pool, to help the log checkpoint advance. The buf_pool.flush_list flushing was being blocked by buf_block_t::lock for no good reason. Furthermore, if the FIL_PAGE_LSN of a page is ahead of log_sys.get_flushed_lsn(), that is, what has been persistently written to the redo log, we would trigger a log flush and then resume the page flushing. This would unnecessarily limit the performance of the page cleaner thread and trigger the infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms. The settings might not be optimal" that were suppressed in commit d1ab89037a518fcffbc50c24e4bd94e4ec33aed0 unless log_warnings>2. Our revised algorithm will make log_sys.get_flushed_lsn() advance at the start of buf_flush_lists(), and then execute a 'best effort' to write out all pages. The flush batches will skip pages that were modified since the log was written, or are are currently exclusively locked. The MDEV-13670 message "page_cleaner: 1000ms intended loop took" message will be removed, because by design, the buf_flush_page_cleaner() should not be blocked during a batch for extended periods of time. We will remove the single-page flushing altogether. Related to this, the debug parameter innodb_doublewrite_batch_size will be removed, because all of the doublewrite buffer will be used for flushing batches. If a page needs to be evicted from the buffer pool and all 100 least recently used pages in the buffer pool have unflushed changes, buf_LRU_get_free_block() will execute buf_flush_lists() to write out and evict innodb_lru_flush_size pages. At most one thread will execute buf_flush_lists() in buf_LRU_get_free_block(); other threads will wait for that LRU flushing batch to finish. To improve concurrency, we will replace the InnoDB ib_mutex_t and os_event_t native mutexes and condition variables in this area of code. Most notably, this means that the buffer pool mutex (buf_pool.mutex) is no longer instrumented via any InnoDB interfaces. It will continue to be instrumented via PERFORMANCE_SCHEMA. For now, both buf_pool.flush_list_mutex and buf_pool.mutex will be declared with MY_MUTEX_INIT_FAST (PTHREAD_MUTEX_ADAPTIVE_NP). The critical sections of buf_pool.flush_list_mutex should be shorter than those for buf_pool.mutex, because in the worst case, they cover a linear scan of buf_pool.flush_list, while the worst case of a critical section of buf_pool.mutex covers a linear scan of the potentially much longer buf_pool.LRU list. mysql_mutex_is_owner(), safe_mutex_is_owner(): New predicate, usable with SAFE_MUTEX. Some InnoDB debug assertions need this predicate instead of mysql_mutex_assert_owner() or mysql_mutex_assert_not_owner(). buf_pool_t::n_flush_LRU, buf_pool_t::n_flush_list: Replaces buf_pool_t::init_flush[] and buf_pool_t::n_flush[]. The number of active flush operations. buf_pool_t::mutex, buf_pool_t::flush_list_mutex: Use mysql_mutex_t instead of ib_mutex_t, to have native mutexes with PERFORMANCE_SCHEMA and SAFE_MUTEX instrumentation. buf_pool_t::done_flush_LRU: Condition variable for !n_flush_LRU. buf_pool_t::done_flush_list: Condition variable for !n_flush_list. buf_pool_t::do_flush_list: Condition variable to wake up the buf_flush_page_cleaner when a log checkpoint needs to be written or the server is being shut down. Replaces buf_flush_event. We will keep using timed waits (the page cleaner thread will wake _at least_ once per second), because the calculations for innodb_adaptive_flushing depend on fixed time intervals. buf_dblwr: Allocate statically, and move all code to member functions. Use a native mutex and condition variable. Remove code to deal with single-page flushing. buf_dblwr_check_block(): Make the check debug-only. We were spending a significant amount of execution time in page_simple_validate_new(). flush_counters_t::unzip_LRU_evicted: Remove. IORequest: Make more members const. FIXME: m_fil_node should be removed. buf_flush_sync_lsn: Protect by std::atomic, not page_cleaner.mutex (which we are removing). page_cleaner_slot_t, page_cleaner_t: Remove many redundant members. pc_request_flush_slot(): Replaces pc_request() and pc_flush_slot(). recv_writer_thread: Remove. Recovery works just fine without it, if we simply invoke buf_flush_sync() at the end of each batch in recv_sys_t::apply(). recv_recovery_from_checkpoint_finish(): Remove. We can simply call recv_sys.debug_free() directly. srv_started_redo: Replaces srv_start_state. SRV_SHUTDOWN_FLUSH_PHASE: Remove. logs_empty_and_mark_files_at_shutdown() can communicate with the normal page cleaner loop via the new function flush_buffer_pool(). buf_flush_remove(): Assert that the calling thread is holding buf_pool.flush_list_mutex. This removes unnecessary mutex operations from buf_flush_remove_pages() and buf_flush_dirty_pages(), which replace buf_LRU_flush_or_remove_pages(). buf_flush_lists(): Renamed from buf_flush_batch(), with simplified interface. Return the number of flushed pages. Clarified comments and renamed min_n to max_n. Identify LRU batch by lsn=0. Merge all the functions buf_flush_start(), buf_flush_batch(), buf_flush_end() directly to this function, which was their only caller, and remove 2 unnecessary buf_pool.mutex release/re-acquisition that we used to perform around the buf_flush_batch() call. At the start, if not all log has been durably written, wait for a background task to do it, or start a new task to do it. This allows the log write to run concurrently with our page flushing batch. Any pages that were skipped due to too recent FIL_PAGE_LSN or due to them being latched by a writer should be flushed during the next batch, unless there are further modifications to those pages. It is possible that a page that we must flush due to small oldest_modification also carries a recent FIL_PAGE_LSN or is being constantly modified. In the worst case, all writers would then end up waiting in log_free_check() to allow the flushing and the checkpoint to complete. buf_do_flush_list_batch(): Clarify comments, and rename min_n to max_n. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_flush_space(): Auxiliary function to look up a tablespace for page flushing. buf_flush_page(): Defer the computation of space->full_crc32(). Never call log_write_up_to(), but instead skip persistent pages whose latest modification (FIL_PAGE_LSN) is newer than the redo log. Also skip pages on which we cannot acquire a shared latch without waiting. buf_flush_try_neighbors(): Do not bother checking buf_fix_count because buf_flush_page() will no longer wait for the page latch. Take the tablespace as a parameter, and only execute this function when innodb_flush_neighbors>0. Avoid repeated calls of page_id_t::fold(). buf_flush_relocate_on_flush_list(): Declare as cold, and push down a condition from the callers. buf_flush_check_neighbor(): Take id.fold() as a parameter. buf_flush_sync(): Ensure that the buf_pool.flush_list is empty, because the flushing batch will skip pages whose modifications have not yet been written to the log or were latched for modification. buf_free_from_unzip_LRU_list_batch(): Remove redundant local variables. buf_flush_LRU_list_batch(): Let the caller buf_do_LRU_batch() initialize the counters, and report n->evicted. Cache the last looked up tablespace. If neighbor flushing is not applicable, invoke buf_flush_page() directly, avoiding a page lookup in between. buf_do_LRU_batch(): Return the number of pages flushed. buf_LRU_free_page(): Only release and re-acquire buf_pool.mutex if adaptive hash index entries are pointing to the block. buf_LRU_get_free_block(): Do not wake up the page cleaner, because it will no longer perform any useful work for us, and we do not want it to compete for I/O while buf_flush_lists(innodb_lru_flush_size, 0) writes out and evicts at most innodb_lru_flush_size pages. (The function buf_do_LRU_batch() may complete after writing fewer pages if more than innodb_lru_scan_depth pages end up in buf_pool.free list.) Eliminate some mutex release-acquire cycles, and wait for the LRU flush batch to complete before rescanning. buf_LRU_check_size_of_non_data_objects(): Simplify the code. buf_page_write_complete(): Remove the parameter evict, and always evict pages that were part of an LRU flush. buf_page_create(): Take a pre-allocated page as a parameter. buf_pool_t::free_block(): Free a pre-allocated block. recv_sys_t::recover_low(), recv_sys_t::apply(): Preallocate the block while not holding recv_sys.mutex. During page allocation, we may initiate a page flush, which in turn may initiate a log flush, which would require acquiring log_sys.mutex, which should always be acquired before recv_sys.mutex in order to avoid deadlocks. Therefore, we must not be holding recv_sys.mutex while allocating a buffer pool block. BtrBulk::logFreeCheck(): Skip a redundant condition. row_undo_step(): Do not invoke srv_inc_activity_count() for every row that is being rolled back. It should suffice to invoke the function in trx_flush_log_if_needed() during trx_t::commit_in_memory() when the rollback completes. sync_check_enable(): Remove. We will enable innodb_sync_debug from the very beginning. Reviewed by: Vladislav Vaintroub
* | | | | Merge 10.5 into 10.6Marko Mäkelä2020-09-241-0/+3
|\ \ \ \ \ | |/ / / /
| * | | | Merge 10.4 into 10.5Marko Mäkelä2020-09-041-0/+3
| |\ \ \ \ | | |/ / /
| | * | | Merge 10.3 into 10.4Marko Mäkelä2020-09-041-0/+3
| | |\ \ \ | | | |/ /
| | | * | Merge 10.2 into 10.3Marko Mäkelä2020-09-041-0/+3
| | | |\ \ | | | | |/
| | | | * Fix a typo in the previous csetSergei Petrunia2020-09-041-1/+1
| | | | |
| | | | * MDEV-23661: RocksDB produces "missing initializer for member" warningsSergei Petrunia2020-09-031-0/+3
| | | | | | | | | | | | | | | | | | | | Add -Wno-missing-field-initializers for MyRocks and gcc version below 5.0
* | | | | Merge branch '10.5' into 10.6Oleksandr Byelkin2020-09-0210-16/+16
|\ \ \ \ \ | |/ / / /
| * | | | Merge 10.4 into 10.5Marko Mäkelä2020-08-2110-16/+16
| |\ \ \ \ | | |/ / /
| | * | | Merge 10.3 into 10.4Marko Mäkelä2020-08-2110-16/+16
| | |\ \ \ | | | |/ /
| | | * | Merge 10.2 into 10.3Marko Mäkelä2020-08-2110-16/+16
| | | |\ \ | | | | |/
| | | | * MDEV-23511 shutdown_server 10 times out, causing server kill at shutdownAndrei Elkin2020-08-219-15/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Shutdown of mtr tests may be too impatient, esp on CI environment where 10 seconds of `arg` of `shutdown_server arg` may not be enough for the clean shutdown to complete. This is fixed to remove explicit non-zero timeout argument to `shutdown_server` from all mtr tests. mysqltest computes 60 seconds default value for the timeout for the argless `shutdown_server` command. This policy is additionally ensured with a compile time assert.
* | | | | Merge 10.5 into 10.6Marko Mäkelä2020-08-043-7/+28
|\ \ \ \ \ | |/ / / /
| * | | | Merge 10.4 into 10.5Marko Mäkelä2020-08-013-7/+28
| |\ \ \ \ | | |/ / /
| | * | | Merge 10.3 into 10.4Marko Mäkelä2020-07-313-7/+28
| | |\ \ \ | | | |/ /
| | | * | Merge 10.2 into 10.3Marko Mäkelä2020-07-313-7/+28
| | | |\ \ | | | | |/
| | | | * MDEV-23137: aarch64, postfix - cmake includeDaniel Black2020-07-281-0/+1
| | | | |
| | | | * rocksdb: FreeBSD disable jemalloc searchDaniel Black2020-07-281-6/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | FreeBSD's inbuilt default jemalloc means its pointless to do a package search on it. The paths are already set by the system defaults.
| | | | * MDEV-23051: riscv64 fails build (atomics)Daniel Black2020-07-282-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | riscv64 fails to build because the use of #include <atomic> needs to link with -latomic. per https://github.com/riscv/riscv-gnu-toolchain/issues/183#issuecomment-253721765