summaryrefslogtreecommitdiff
path: root/mysql-test/suite/innodb_zip
Commit message (Collapse)AuthorAgeFilesLines
* MDEV-21921 Make transaction_isolation and transaction_read_only into system ↵Junqi Xie2023-04-1210-121/+121
| | | | | | | | | | | | | | variables In MariaDB, we have a confusing problem where: * The transaction_isolation option can be set in a configuration file, but it cannot be set dynamically. * The tx_isolation system variable can be set dynamically, but it cannot be set in a configuration file. Therefore, we have two different names for the same thing in different contexts. This is needlessly confusing, and it complicates the documentation. The same thing applys for transaction_read_only. MySQL 5.7 solved this problem by making them into system variables. https://dev.mysql.com/doc/relnotes/mysql/5.7/en/news-5-7-20.html This commit takes a similar approach by adding new system variables and marking the original ones as deprecated. This commit also resolves some legacy problems related to SET STATEMENT and transaction_isolation.
* Merge 10.11 into 11.0Marko Mäkelä2023-03-174-18/+49
|\
* | MDEV-29694 fixup: Remove srv_change_buffer_max_size, adjust commentsMarko Mäkelä2023-02-211-28/+0
| |
* | MDEV-30153 ad hoc client versions are confusingSergei Golubchik2023-01-192-3/+4
| | | | | | | | | | | | | | | | | | | | | | try to make them less confusing for users. Hopefully, if the version string will be changed like - mariadb Ver 15.1 Distrib 10.11.2-MariaDB for Linux (x86_64) + mariadb from 10.11.2-MariaDB, client 15.1 for Linux (x86_64) users will be less inclined to reply "15.1" to the question "what mariadb version are you using?"
* | MDEV-29694 Remove the InnoDB change bufferMarko Mäkelä2023-01-112-20/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The purpose of the change buffer was to reduce random disk access, which could be useful on rotational storage, but maybe less so on solid-state storage. When we wished to (1) insert a record into a non-unique secondary index, (2) delete-mark a secondary index record, (3) delete a secondary index record as part of purge (but not ROLLBACK), and the B-tree leaf page where the record belongs to is not in the buffer pool, we inserted a record into the change buffer B-tree, indexed by the page identifier. When the page was eventually read into the buffer pool, we looked up the change buffer B-tree for any modifications to the page, applied these upon the completion of the read operation. This was called the insert buffer merge. We remove the change buffer, because it has been the source of various hard-to-reproduce corruption bugs, including those fixed in commit 5b9ee8d8193a8c7a8ebdd35eedcadc3ae78e7fc1 and commit 165564d3c33ae3d677d70644a83afcb744bdbf65 but not limited to them. A downgrade will fail with a clear message starting with commit db14eb16f9977453467ec4765f481bb2f71814ba (MDEV-30106). buf_page_t::state: Merge IBUF_EXIST to UNFIXED and WRITE_FIX_IBUF to WRITE_FIX. buf_pool_t::watch[]: Remove. trx_t: Move isolation_level, check_foreigns, check_unique_secondary, bulk_insert into the same bit-field. The only purpose of trx_t::check_unique_secondary is to enable bulk insert into an empty table. It no longer enables insert buffering for UNIQUE INDEX. btr_cur_t::thr: Remove. This field was originally needed for change buffering. Later, its use was extended to cover SPATIAL INDEX. Much of the time, rtr_info::thr holds this field. When it does not, we will add parameters to SPATIAL INDEX specific functions. ibuf_upgrade_needed(): Check if the change buffer needs to be updated. ibuf_upgrade(): Merge and upgrade the change buffer after all redo log has been applied. Free any pages consumed by the change buffer, and zero out the change buffer root page to mark the upgrade completed, and to prevent a downgrade to an earlier version. dict_load_tablespaces(): Renamed from dict_check_tablespaces_and_store_max_id(). This needs to be invoked before ibuf_upgrade(). btr_cur_open_at_rnd_pos(): Specialize for use in persistent statistics. The change buffer merge does not need this function anymore. btr_page_alloc(): Renamed from btr_page_alloc_low(). We no longer allocate any change buffer pages. btr_cur_open_at_rnd_pos(): Specialize for use in persistent statistics. The change buffer merge does not need this function anymore. row_search_index_entry(), btr_lift_page_up(): Add a parameter thr for the SPATIAL INDEX case. rtr_page_split_and_insert(): Specialized from btr_page_split_and_insert(). rtr_root_raise_and_insert(): Specialized from btr_root_raise_and_insert(). Note: The support for upgrading from the MySQL 3.23 or MySQL 4.0 change buffer format that predates the MySQL 4.1 introduction of the option innodb_file_per_table was removed in MySQL 5.6.5 as part of mysql/mysql-server@69b6241a79876ae98bb0c9dce7c8d8799d6ad273 and MariaDB 10.0.11 as part of 1d0f70c2f894b27e98773a282871d32802f67964. In the tests innodb.log_upgrade and innodb.log_corruption, we create valid (upgraded) change buffer pages. Tested by: Matthias Leich
* | MDEV-29983 Deprecate innodb_file_per_tableMarko Mäkelä2023-01-1127-162/+69
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before commit 6112853cdab2770e92f9cfefdfef9c0a14b71cb7 in MySQL 4.1.1 introduced the parameter innodb_file_per_table, all InnoDB data was written to the InnoDB system tablespace (often named ibdata1). A serious design problem is that once the system tablespace has grown to some size, it cannot shrink even if the data inside it has been deleted. There are also other design problems, such as the server hang MDEV-29930 that should only be possible when using innodb_file_per_table=0 and innodb_undo_tablespaces=0 (storing both tables and undo logs in the InnoDB system tablespace). The parameter innodb_change_buffering was deprecated in commit b5852ffbeebc3000982988383daeefb0549e058a. Starting with commit baf276e6d4a44fe7cdf3b435c0153da0a42af2b6 (MDEV-19229) the number of innodb_undo_tablespaces can be increased, so that the undo logs can be moved out of the system tablespace of an existing installation. If all these things (tables, undo logs, and the change buffer) are removed from the InnoDB system tablespace, the only variable-size data structure inside it is the InnoDB data dictionary. DDL operations on .ibd files was optimized in commit 86dc7b4d4cfe15a2d37f8b5f60c4fce5dba9491d (MDEV-24626). That should have removed any thinkable performance advantage of using innodb_file_per_table=0. Since there should be no benefit of setting innodb_file_per_table=0, the parameter should be deprecated. Starting with MySQL 5.6 and MariaDB Server 10.0, the default value is innodb_file_per_table=1.
* Merge 10.7 into 10.8Marko Mäkelä2022-12-132-4/+6
|\
| * Merge 10.5 into 10.6Marko Mäkelä2022-12-132-4/+6
| |\
| | * Merge 10.4 into 10.5Marko Mäkelä2022-12-132-5/+7
| | |\
| | | * Merge 10.3 into 10.4Marko Mäkelä2022-12-131-3/+3
| | | |\
| | | | * MDEV-27882 Innodb - recognise MySQL-8.0 innodb flags and give a specific ↵Daniel Black2022-11-111-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | error message Per fsp0types.h, SDI is on tablespace flags position 14 where MariaDB stores its pagesize. Flag at position 13, also in MariaDB pagesize flags, is a MySQL encryption flag. These are checked only if fsp_flags_is_valid fails, so valid MariaDB pages sizes don't become errors. The error message "Cannot reset LSNs in table" was rather specific and not always true to replaced with more generic error. ALTER TABLE tbl IMPORT TABLESPACE now reports Unsupported on MySQL tablespace (rather than index corrupted) along with a server error message. MySQL innodb Errors are with with UNSUPPORTED rather than CORRUPTED to avoid user anxiety. Reviewer: Marko Mäkelä
* | | | | Merge 10.7 into 10.8Marko Mäkelä2022-11-091-1/+0
|\ \ \ \ \ | |/ / / /
| * | | | Merge 10.5 into 10.6Marko Mäkelä2022-11-091-1/+0
| |\ \ \ \ | | |/ / /
| | * | | MDEV-22512: Re-enable the tests on 10.5Marko Mäkelä2022-11-081-1/+0
| | | | |
| * | | | Merge 10.5 into 10.6Marko Mäkelä2022-11-081-1/+1
| |\ \ \ \ | | |/ / /
| | * | | Merge 10.4 into 10.5Marko Mäkelä2022-11-081-1/+1
| | |\ \ \ | | | |/ /
| | | * | Merge 10.3 into 10.4Marko Mäkelä2022-11-081-1/+1
| | | |\ \ | | | | |/
| | | | * MDEV-22512: Disable frequently failing testsMarko Mäkelä2022-11-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | InnoDB crash recovery can run out of memory before commit 50324ce62448f284522ee1e3be24d8f5ba3bf904 in MariaDB Server 10.5. Let us disable some frequently failing recovery tests in earlier versions.
* | | | | Merge 10.7 into 10.8Marko Mäkelä2022-09-219-53/+62
|\ \ \ \ \ | |/ / / /
| * | | | Merge 10.5 into 10.6Marko Mäkelä2022-09-209-53/+62
| |\ \ \ \ | | |/ / /
| | * | | Merge remote-tracking branch 'origin/10.4' into 10.5Alexander Barkov2022-09-149-53/+53
| | |\ \ \ | | | |/ /
| | | * | Merge 10.3 into 10.4Marko Mäkelä2022-09-139-53/+53
| | | |\ \ | | | | |/
| | | | * MDEV-29446 Change SHOW CREATE TABLE to display default collationAlexander Barkov2022-09-129-53/+53
| | | | |
* | | | | Merge 10.7 into 10.8Marko Mäkelä2022-08-241-10/+10
|\ \ \ \ \ | |/ / / /
| * | | | Merge 10.5 into 10.6Marko Mäkelä2022-08-221-10/+10
| |\ \ \ \ | | |/ / /
| | * | | Merge 10.4 into 10.5Marko Mäkelä2022-08-221-10/+10
| | |\ \ \ | | | |/ /
| | | * | Merge 10.3 into 10.4Marko Mäkelä2022-08-221-10/+10
| | | |\ \ | | | | |/
| | | | * MDEV-13013 fixup: Adjust a testMarko Mäkelä2022-08-221-10/+10
| | | | |
* | | | | Merge 10.7 into 10.8Marko Mäkelä2022-07-282-0/+28
|\ \ \ \ \ | |/ / / /
| * | | | MDEV-28950: Add a test caseMarko Mäkelä2022-07-272-0/+28
| | | | | | | | | | | | | | | | | | | | | | | | | After reverting commit commit 39f45f6f89ce2fc2db54bb8ab0f6076f923beeec all combinations of this test would crash the server.
* | | | | Merge 10.7 into 10.8Marko Mäkelä2022-06-062-8/+20
|\ \ \ \ \ | |/ / / /
| * | | | MDEV-13542: Crashing on corrupted page is unhelpfulMarko Mäkelä2022-06-062-8/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The approach to handling corruption that was chosen by Oracle in commit 177d8b0c125b841c0650d27d735e3b87509dc286 is not really useful. Not only did it actually fail to prevent InnoDB from crashing, but it is making things worse by blocking attempts to rescue data from or rebuild a partially readable table. We will try to prevent crashes in a different way: by propagating errors up the call stack. We will never mark the clustered index persistently corrupted, so that data recovery may be attempted by reading from the table, or by rebuilding the table. This should also fix MDEV-13680 (crash on btr_page_alloc() failure); it was extensively tested with innodb_file_per_table=0 and a non-autoextend system tablespace. We should now avoid crashes in many cases, such as when a page cannot be read or allocated, or an inconsistency is detected when attempting to update multiple pages. We will not crash on double-free, such as on the recovery of DDL in system tablespace in case something was corrupted. Crashes on corrupted data are still possible. The fault injection mechanism that is introduced in the subsequent commit may help catch more of them. buf_page_import_corrupt_failure: Remove the fault injection, and instead corrupt some pages using Perl code in the tests. btr_cur_pessimistic_insert(): Always reserve extents (except for the change buffer), in order to prevent a subsequent allocation failure. btr_pcur_open_at_rnd_pos(): Merged to the only caller ibuf_merge_pages(). btr_assert_not_corrupted(), btr_corruption_report(): Remove. Similar checks are already part of btr_block_get(). FSEG_MAGIC_N_BYTES: Replaces FSEG_MAGIC_N_VALUE. dict_hdr_get(), trx_rsegf_get_new(), trx_undo_page_get(), trx_undo_page_get_s_latched(): Replaced with error-checking calls. trx_rseg_t::get(mtr_t*): Replaces trx_rsegf_get(). trx_rseg_header_create(): Let the caller update the TRX_SYS page if needed. trx_sys_create_sys_pages(): Merged with trx_sysf_create(). dict_check_tablespaces_and_store_max_id(): Do not access DICT_HDR_MAX_SPACE_ID, because it was already recovered in dict_boot(). Merge dict_check_sys_tables() with this function. dir_pathname(): Replaces os_file_make_new_pathname(). row_undo_ins_remove_sec(): Do not modify the undo page by adding a terminating NUL byte to the record. btr_decryption_failed(): Report decryption failures dict_set_corrupted_by_space(), dict_set_encrypted_by_space(), dict_set_corrupted_index_cache_only(): Remove. dict_set_corrupted(): Remove the constant parameter dict_locked=false. Never flag the clustered index corrupted in SYS_INDEXES, because that would deny further access to the table. It might be possible to repair the table by executing ALTER TABLE or OPTIMIZE TABLE, in case no B-tree leaf page is corrupted. dict_table_skip_corrupt_index(), dict_table_next_uncorrupted_index(), row_purge_skip_uncommitted_virtual_index(): Remove, and refactor the callers to read dict_index_t::type only once. dict_table_is_corrupted(): Remove. dict_index_t::is_btree(): Determine if the index is a valid B-tree. BUF_GET_NO_LATCH, BUF_EVICT_IF_IN_POOL: Remove. UNIV_BTR_DEBUG: Remove. Any inconsistency will no longer trigger assertion failures, but error codes being returned. buf_corrupt_page_release(): Replaced with a direct call to buf_pool.corrupted_evict(). fil_invalid_page_access_msg(): Never crash on an invalid read; let the caller of buf_page_get_gen() decide. btr_pcur_t::restore_position(): Propagate failure status to the caller by returning CORRUPTED. opt_search_plan_for_table(): Simplify the code. row_purge_del_mark(), row_purge_upd_exist_or_extern_func(), row_undo_ins_remove_sec_rec(), row_undo_mod_upd_del_sec(), row_undo_mod_del_mark_sec(): Avoid mem_heap_create()/mem_heap_free() when no secondary indexes exist. row_undo_mod_upd_exist_sec(): Simplify the code. row_upd_clust_step(), dict_load_table_one(): Return DB_TABLE_CORRUPT if the clustered index (and therefore the table) is corrupted, similar to what we do in row_insert_for_mysql(). fut_get_ptr(): Replace with buf_page_get_gen() calls. buf_page_get_gen(): Return nullptr and *err=DB_CORRUPTION if the page is marked as freed. For other modes than BUF_GET_POSSIBLY_FREED or BUF_PEEK_IF_IN_POOL this will trigger a debug assertion failure. For BUF_GET_POSSIBLY_FREED, we will return nullptr for freed pages, so that the callers can be simplified. The purge of transaction history will be a new user of BUF_GET_POSSIBLY_FREED, to avoid crashes on corrupted data. buf_page_get_low(): Never crash on a corrupted page, but simply return nullptr. fseg_page_is_allocated(): Replaces fseg_page_is_free(). fts_drop_common_tables(): Return an error if the transaction was rolled back. fil_space_t::set_corrupted(): Report a tablespace as corrupted if it was not reported already. fil_space_t::io(): Invoke fil_space_t::set_corrupted() to report out-of-bounds page access or other errors. Clean up mtr_t::page_lock() buf_page_get_low(): Validate the page identifier (to check for recently read corrupted pages) after acquiring the page latch. buf_page_t::read_complete(): Flag uninitialized (all-zero) pages with DB_FAIL. Return DB_PAGE_CORRUPTED on page number mismatch. mtr_t::defer_drop_ahi(): Renamed from mtr_defer_drop_ahi(). recv_sys_t::free_corrupted_page(): Only set_corrupt_fs() if any log records exist for the page. We do not mind if read-ahead produces corrupted (or all-zero) pages that were not actually needed during recovery. recv_recover_page(): Return whether the operation succeeded. recv_sys_t::recover_low(): Simplify the logic. Check for recovery error. Thanks to Matthias Leich for testing this extensively and to the authors of https://rr-project.org for making it easy to diagnose and fix any failures that were found during the testing.
* | | | | Merge 10.7 into 10.8Marko Mäkelä2022-03-304-2/+13
|\ \ \ \ \ | |/ / / /
| * | | | MDEV-28181 The innochecksum -w option was inadvertently removedMarko Mäkelä2022-03-284-2/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit 7a4fbb55b02b449a135fe935f624422eaacfdd7c (MDEV-25105) the innochecksum option --write (-w) was removed altogether. It should have been made a Boolean option, so that old data files may be converted to a format that is compatible with innodb_checksum_algorithm=strict_crc32 by executing the following: innochecksum -n -w ibdata* */*.ibd It would be better to use an older-version innochecksum for such a conversion, so that page checksums will be validated before updating the checksum. It never was possible for innochecksum to convert files to the innodb_checksum_algorithm=full_crc32 format that is the default for new InnoDB data files.
* | | | | Merge 10.7 into 10.8Marko Mäkelä2022-02-1717-638/+362
|\ \ \ \ \ | |/ / / /
| * | | | Merge 10.5 into 10.6Marko Mäkelä2022-02-1717-638/+362
| |\ \ \ \ | | |/ / /
| | * | | Merge 10.4 into 10.5Marko Mäkelä2022-02-1717-638/+362
| | |\ \ \ | | | |/ /
| | | * | Merge 10.3 into 10.4Marko Mäkelä2022-02-1717-638/+362
| | | |\ \ | | | | |/
| | | | * Merge 10.2 into 10.3Marko Mäkelä2022-02-1717-638/+362
| | | | |\
| | | | | * MDEV-27634 innodb_zip tests failing on s390xMarko Mäkelä2022-02-1617-638/+362
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some GNU/Linux distributions ship a zlib that is modified to use the s390x DFLTCC instruction. That modification would essentially redefine compressBound(sourceLen) as (sourceLen * 16 + 2308) / 8 + 6. Let us relax the tests for InnoDB ROW_FORMAT=COMPRESSED to cope with such a weaker compression guarantee. create_table_info_t::row_size_is_acceptable(): Remove a bogus debug-only assertion that would fail to hold for the test innodb_zip.bug36169. The function page_zip_empty_size() may indeed return 0.
* | | | | | Merge branch '10.7' into 10.8Oleksandr Byelkin2022-02-042-1/+2
|\ \ \ \ \ \ | |/ / / / /
| * | | | | Merge branch '10.5' into 10.6Oleksandr Byelkin2022-02-032-1/+2
| |\ \ \ \ \ | | |/ / / /
| | * | | | Merge branch '10.4' into 10.5Oleksandr Byelkin2022-02-012-1/+2
| | |\ \ \ \ | | | |/ / /
| | | * | | Merge branch '10.3' into 10.4Oleksandr Byelkin2022-01-302-1/+2
| | | |\ \ \ | | | | |/ /
| | | | * | Merge branch '10.2' into 10.3mariadb-10.3.33Oleksandr Byelkin2022-01-292-1/+2
| | | | |\ \ | | | | | |/
| | | | | * MDEV-27273 Confusing column count in IMPORT TABLESPACE error messagebb-10.2-MDEV-27273-import-column-checkEugene Kosov2022-01-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's misleading to compare and write to user number of columns and fields. Thus, it would be better to remove that check and let use see a subsequent error message about missing or mispaced column. row_import::match_schema(): remove misleading check
| | | | | * MDEV-26870 --skip-symbolic-links does not disallow .isl file creationMarko Mäkelä2022-01-211-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The InnoDB DATA DIRECTORY attribute is not implemented via symbolic links but something similar, *.isl files that contain the names of data files. InnoDB failed to ignore the DATA DIRECTORY attribute even though the server was started with --skip-symbolic-links. Native ALTER TABLE in InnoDB will retain the DATA DIRECTORY attribute of the table, no matter if the table will be rebuilt or not. Generic ALTER TABLE (with ALGORITHM=COPY) as well as TRUNCATE TABLE will discard the DATA DIRECTORY attribute. All tests have been run with and without the ./mtr option --mysqld=--skip-symbolic-links and some tests that use the InnoDB DATA DIRECTORY attribute have been adjusted for this.
* | | | | | MDEV-14425 Improve the redo log for concurrencyMarko Mäkelä2022-01-212-2/+2
|/ / / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The InnoDB redo log used to be formatted in blocks of 512 bytes. The log blocks were encrypted and the checksum was calculated while holding log_sys.mutex, creating a serious scalability bottleneck. We remove the fixed-size redo log block structure altogether and essentially turn every mini-transaction into a log block of its own. This allows encryption and checksum calculations to be performed on local mtr_t::m_log buffers, before acquiring log_sys.mutex. The mutex only protects a memcpy() of the data to the shared log_sys.buf, as well as the padding of the log, in case the to-be-written part of the log would not end in a block boundary of the underlying storage. For now, the "padding" consists of writing a single NUL byte, to allow recovery and mariadb-backup to detect the end of the circular log faster. Like the previous implementation, we will overwrite the last log block over and over again, until it has been completely filled. It would be possible to write only up to the last completed block (if no more recent write was requested), or to write dummy FILE_CHECKPOINT records to fill the incomplete block, by invoking the currently disabled function log_pad(). This would require adjustments to some logic around log checkpoints, page flushing, and shutdown. An upgrade after a crash of any previous version is not supported. Logically empty log files from a previous version will be upgraded. An attempt to start up InnoDB without a valid ib_logfile0 will be refused. Previously, the redo log used to be created automatically if it was missing. Only with with innodb_force_recovery=6, it is possible to start InnoDB in read-only mode even if the log file does not exist. This allows the contents of a possibly corrupted database to be dumped. Because a prepared backup from an earlier version of mariadb-backup will create a 0-sized log file, we will allow an upgrade from such log files, provided that the FIL_PAGE_FILE_FLUSH_LSN in the system tablespace looks valid. The 512-byte log checkpoint blocks at 0x200 and 0x600 will be replaced with 64-byte log checkpoint blocks at 0x1000 and 0x2000. The start of log records will move from 0x800 to 0x3000. This allows us to use 4096-byte aligned blocks for all I/O in a future revision. We extend the MDEV-12353 redo log record format as follows. (1) Empty mini-transactions or extra NUL bytes will not be allowed. (2) The end-of-minitransaction marker (a NUL byte) will be replaced with a 1-bit sequence number, which will be toggled each time when the circular log file wraps back to the beginning. (3) After the sequence bit, a CRC-32C checksum of all data (excluding the sequence bit) will written. (4) If the log is encrypted, 8 bytes will be written before the checksum and included in it. This is part of the initialization vector (IV) of encrypted log data. (5) File names, page numbers, and checkpoint information will not be encrypted. Only the payload bytes of page-level log will be encrypted. The tablespace ID and page number will form part of the IV. (6) For padding, arbitrary-length FILE_CHECKPOINT records may be written, with all-zero payload, and with the normal end marker and checksum. The minimum size is 7 bytes, or 7+8 with innodb_encrypt_log=ON. In mariadb-backup and in Galera snapshot transfer (SST) scripts, we will no longer remove ib_logfile0 or create an empty ib_logfile0. Server startup will require a valid log file. When resizing the log, we will create a logically empty ib_logfile101 at the current LSN and use an atomic rename to replace ib_logfile0 with it. See the test innodb.log_file_size. Because there is no mandatory padding in the log file, we are able to create a dummy log file as of an arbitrary log sequence number. See the test mariabackup.huge_lsn. The parameter innodb_log_write_ahead_size and the INFORMATION_SCHEMA.INNODB_METRICS counter log_padded will be removed. The minimum value of innodb_log_buffer_size will be increased to 2MiB (because log_sys.buf will replace recv_sys.buf) and the increment adjusted to 4096 bytes (the maximum log block size). The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed: os_log_fsyncs os_log_pending_fsyncs log_pending_log_flushes log_pending_checkpoint_writes The following status variables will be removed: Innodb_os_log_fsyncs (this is included in Innodb_data_fsyncs) Innodb_os_log_pending_fsyncs (this was limited to at most 1 by design) log_sys.get_block_size(): Return the physical block size of the log file. This is only implemented on Linux and Microsoft Windows for now, and for the power-of-2 block sizes between 64 and 4096 bytes (the minimum and maximum size of a checkpoint block). If the block size is anything else, the traditional 512-byte size will be used via normal file system buffering. If the file system buffers can be bypassed, a message like the following will be issued: InnoDB: File system buffers for log disabled (block size=512 bytes) InnoDB: File system buffers for log disabled (block size=4096 bytes) This has been tested on Linux and Microsoft Windows with both sizes. On Linux, only enable O_DIRECT on the log for innodb_flush_method=O_DSYNC. Tests in 3 different environments where the log is stored in a device with a physical block size of 512 bytes are yielding better throughput without O_DIRECT. This could be due to the fact that in the event the last log block is being overwritten (if multiple transactions would become durable at the same time, and each of will write a small number of bytes to the last log block), it should be faster to re-copy data from log_sys.buf or log_sys.flush_buf to the kernel buffer, to be finally written at fdatasync() time. The parameter innodb_flush_method=O_DSYNC will imply O_DIRECT for data files. This option will enable O_DIRECT on the log file on Linux. It may be unsafe to use when the storage device does not support FUA (Force Unit Access) mode. When the server is compiled WITH_PMEM=ON, we will use memory-mapped I/O for the log file if the log resides on a "mount -o dax" device. We will identify PMEM in a start-up message: InnoDB: log sequence number 0 (memory-mapped); transaction id 3 On Linux, we will also invoke mmap() on any ib_logfile0 that resides in /dev/shm, effectively treating the log file as persistent memory. This should speed up "./mtr --mem" and increase the test coverage of PMEM on non-PMEM hardware. It also allows users to estimate how much the performance would be improved by installing persistent memory. On other tmpfs file systems such as /run, we will not use mmap(). mariadb-backup: Eliminated several variables. We will refer directly to recv_sys and log_sys. backup_wait_for_lsn(): Detect non-progress of xtrabackup_copy_logfile(). In this new log format with arbitrary-sized blocks, we can only detect log file overrun indirectly, by observing that the scanned log sequence number is not advancing. xtrabackup_copy_logfile(): On PMEM, do not modify the sequence bit, because we are not allowed to modify the server's log file, and our memory mapping is read-only. trx_flush_log_if_needed_low(): Do not use the callback on pmem. Using neither flush_lock nor write_lock around PMEM writes seems to yield the best performance. The pmem_persist() calls may still be somewhat slower than the pwrite() and fdatasync() based interface (PMEM mounted without -o dax). recv_sys_t::buf: Remove. We will use log_sys.buf for parsing. recv_sys_t::MTR_SIZE_MAX: Replaces RECV_SCAN_SIZE. recv_sys_t::file_checkpoint: Renamed from mlog_checkpoint_lsn. recv_sys_t, log_sys_t: Removed many data members. recv_sys.lsn: Renamed from recv_sys.recovered_lsn. recv_sys.offset: Renamed from recv_sys.recovered_offset. log_sys.buf_size: Replaces srv_log_buffer_size. recv_buf: A smart pointer that wraps log_sys.buf[recv_sys.offset] when the buffer is being allocated from the memory heap. recv_ring: A smart pointer that wraps a circular log_sys.buf[] that is backed by ib_logfile0. The pointer will wrap from recv_sys.len (log_sys.file_size) to log_sys.START_OFFSET. For the record that wraps around, we may copy file name or record payload data to the auxiliary buffer decrypt_buf in order to have a contiguous block of memory. The maximum size of a record is less than innodb_page_size bytes. recv_sys_t::parse(): Take the smart pointer as a template parameter. Do not temporarily add a trailing NUL byte to FILE_ records, because we are not supposed to modify the memory-mapped log file. (It is attached in read-write mode already during recovery.) recv_sys_t::parse_mtr(): Wrapper for recv_sys_t::parse(). recv_sys_t::parse_pmem(): Like parse_mtr(), but if PREMATURE_EOF would be returned on PMEM, use recv_ring to wrap around the buffer to the start. mtr_t::finish_write(), log_close(): Do not enforce log_sys.max_buf_free on PMEM, because it has no meaning on the mmap-based log. log_sys.write_to_buf: Count writes to log_sys.buf. Replaces srv_stats.log_write_requests and export_vars.innodb_log_write_requests. Protected by log_sys.mutex. Updated consistently in log_close(). Previously, mtr_t::commit() conditionally updated the count, which was inconsistent. log_sys.write_to_log: Count swaps of log_sys.buf and log_sys.flush_buf, for writing to log_sys.log (the ib_logfile0). Replaces srv_stats.log_writes and export_vars.innodb_log_writes. Protected by log_sys.mutex. log_sys.waits: Count waits in append_prepare(). Replaces srv_stats.log_waits and export_vars.innodb_log_waits. recv_recover_page(): Do not unnecessarily acquire log_sys.flush_order_mutex. We are inserting the blocks in arbitary order anyway, to be adjusted in recv_sys.apply(true). We will change the definition of flush_lock and write_lock to avoid potential false sharing. Depending on sizeof(log_sys) and CPU_LEVEL1_DCACHE_LINESIZE, the flush_lock and write_lock could share a cache line with each other or with the last data members of log_sys. Thanks to Matthias Leich for providing https://rr-project.org traces for various failures during the development, and to Thirunarayanan Balathandayuthapani for his help in debugging some of the recovery code. And thanks to the developers of the rr debugger for a tool without which extensive changes to InnoDB would be very challenging to get right. Thanks to Vladislav Vaintroub for useful feedback and to him, Axel Schwenke and Krunal Bauskar for testing the performance.
* | | | | Merge 10.5 into 10.6Marko Mäkelä2021-10-272-2/+2
|\ \ \ \ \ | |/ / / /
| * | | | Merge 10.4 into 10.5st-10.5-mergeMarko Mäkelä2021-10-272-2/+2
| |\ \ \ \ | | |/ / /