summaryrefslogtreecommitdiff
path: root/storage/innobase/include/page0page.h
Commit message (Collapse)AuthorAgeFilesLines
* MDEV-29694 Remove the InnoDB change bufferMarko Mäkelä2023-01-111-17/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The purpose of the change buffer was to reduce random disk access, which could be useful on rotational storage, but maybe less so on solid-state storage. When we wished to (1) insert a record into a non-unique secondary index, (2) delete-mark a secondary index record, (3) delete a secondary index record as part of purge (but not ROLLBACK), and the B-tree leaf page where the record belongs to is not in the buffer pool, we inserted a record into the change buffer B-tree, indexed by the page identifier. When the page was eventually read into the buffer pool, we looked up the change buffer B-tree for any modifications to the page, applied these upon the completion of the read operation. This was called the insert buffer merge. We remove the change buffer, because it has been the source of various hard-to-reproduce corruption bugs, including those fixed in commit 5b9ee8d8193a8c7a8ebdd35eedcadc3ae78e7fc1 and commit 165564d3c33ae3d677d70644a83afcb744bdbf65 but not limited to them. A downgrade will fail with a clear message starting with commit db14eb16f9977453467ec4765f481bb2f71814ba (MDEV-30106). buf_page_t::state: Merge IBUF_EXIST to UNFIXED and WRITE_FIX_IBUF to WRITE_FIX. buf_pool_t::watch[]: Remove. trx_t: Move isolation_level, check_foreigns, check_unique_secondary, bulk_insert into the same bit-field. The only purpose of trx_t::check_unique_secondary is to enable bulk insert into an empty table. It no longer enables insert buffering for UNIQUE INDEX. btr_cur_t::thr: Remove. This field was originally needed for change buffering. Later, its use was extended to cover SPATIAL INDEX. Much of the time, rtr_info::thr holds this field. When it does not, we will add parameters to SPATIAL INDEX specific functions. ibuf_upgrade_needed(): Check if the change buffer needs to be updated. ibuf_upgrade(): Merge and upgrade the change buffer after all redo log has been applied. Free any pages consumed by the change buffer, and zero out the change buffer root page to mark the upgrade completed, and to prevent a downgrade to an earlier version. dict_load_tablespaces(): Renamed from dict_check_tablespaces_and_store_max_id(). This needs to be invoked before ibuf_upgrade(). btr_cur_open_at_rnd_pos(): Specialize for use in persistent statistics. The change buffer merge does not need this function anymore. btr_page_alloc(): Renamed from btr_page_alloc_low(). We no longer allocate any change buffer pages. btr_cur_open_at_rnd_pos(): Specialize for use in persistent statistics. The change buffer merge does not need this function anymore. row_search_index_entry(), btr_lift_page_up(): Add a parameter thr for the SPATIAL INDEX case. rtr_page_split_and_insert(): Specialized from btr_page_split_and_insert(). rtr_root_raise_and_insert(): Specialized from btr_root_raise_and_insert(). Note: The support for upgrading from the MySQL 3.23 or MySQL 4.0 change buffer format that predates the MySQL 4.1 introduction of the option innodb_file_per_table was removed in MySQL 5.6.5 as part of mysql/mysql-server@69b6241a79876ae98bb0c9dce7c8d8799d6ad273 and MariaDB 10.0.11 as part of 1d0f70c2f894b27e98773a282871d32802f67964. In the tests innodb.log_upgrade and innodb.log_corruption, we create valid (upgraded) change buffer pages. Tested by: Matthias Leich
* Merge 10.7 into 10.8Marko Mäkelä2022-11-091-22/+0
|\
| * MDEV-28797 Assertion `page_rec_is_user_rec(rec)' failed in ↵Thirunarayanan Balathandayuthapani2022-11-081-22/+0
| | | | | | | | | | | | | | | | | | | | | | PageBulk::getSplitRec - During alter operation of compressed table, page split operation chooses the first record of the page as split record and it leads to empty left page. This issue caused by the commit 77b3959b5c1528f33ada7aa4445cccf5b5e197b0 (MDEV-28457). page_rec_is_second(), page_rec_is_second_last(): Removed the functions since it is a deadcode.
* | Merge 10.7 into 10.8Marko Mäkelä2022-08-021-48/+31
|\ \ | |/
| * MDEV-21098: Assertion failure in rec_get_offsets_func()Marko Mäkelä2022-08-011-48/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function rec_get_offsets_func() used to hit ut_error due to an invalid rec_get_status() value of a ROW_FORMAT!=REDUNDANT record. This fix is twofold: We will not only avoid a crash on corruption in this case, but we will also make more effort to validate each record every time we are iterating over index page records. rec_get_offsets_func(): Do not crash on a corrupted record. page_rec_get_nth(): Return nullptr on error. page_dir_slot_get_rec_validate(): Like page_dir_slot_get_rec(), but validate the pointer and return nullptr on error. page_cur_search_with_match(), page_cur_search_with_match_bytes(), page_dir_split_slot(), page_cur_move_to_next(): Indicate failure in a return value. page_cur_search(): Replaced with page_cur_search_with_match(). rec_get_next_ptr_const(), rec_get_next_ptr(): Replaced with page_rec_get_next_low(). TODO: rtr_page_split_initialize_nodes(), rtr_update_mbr_field(), and possibly other SPATIAL INDEX functions fail to properly handle errors. Reviewed by: Thirunarayanan Balathandayuthapani Tested by: Matthias Leich Performance tested by: Axel Schwenke
* | Merge 10.7 into 10.8Marko Mäkelä2022-06-091-11/+13
|\ \ | |/
| * MDEV-28457 Crash in page_dir_find_owner_slot()Marko Mäkelä2022-06-081-11/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A prominent remaining source of crashes on corrupted index pages is page directory corruption. A frequent caller of page_dir_find_owner_slot() is page_rec_get_prev(). Some of those calls can be replaced with simpler logic that is less prone to fail. page_dir_find_owner_slot(), page_rec_get_prev(), page_rec_get_prev_const(), btr_pcur_move_to_prev(), btr_pcur_move_to_prev_on_page(), btr_cur_upd_rec_sys(), page_delete_rec_list_end(), rtr_page_copy_rec_list_end_no_locks(), rtr_page_copy_rec_list_start_no_locks(): Return an error code on failure. fil_space_t::io(), buf_page_get_low(): Use DB_CORRUPTION for out-of-bounds page reads. PageBulk::getSplitRec(), PageBulk::copyOut(): Simplify the code. btr_validate_level(): Prevent some more CHECK TABLE crashes on corrupted pages. btr_block_get(), btr_pcur_move_to_next_page(): Implement some checks that were previously only part of IndexPurge::next(). IndexPurge::next(): Use btr_pcur_move_to_next_page().
* | Merge 10.7 into 10.8Marko Mäkelä2022-06-061-51/+18
|\ \ | |/
| * MDEV-13542: Crashing on corrupted page is unhelpfulMarko Mäkelä2022-06-061-51/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The approach to handling corruption that was chosen by Oracle in commit 177d8b0c125b841c0650d27d735e3b87509dc286 is not really useful. Not only did it actually fail to prevent InnoDB from crashing, but it is making things worse by blocking attempts to rescue data from or rebuild a partially readable table. We will try to prevent crashes in a different way: by propagating errors up the call stack. We will never mark the clustered index persistently corrupted, so that data recovery may be attempted by reading from the table, or by rebuilding the table. This should also fix MDEV-13680 (crash on btr_page_alloc() failure); it was extensively tested with innodb_file_per_table=0 and a non-autoextend system tablespace. We should now avoid crashes in many cases, such as when a page cannot be read or allocated, or an inconsistency is detected when attempting to update multiple pages. We will not crash on double-free, such as on the recovery of DDL in system tablespace in case something was corrupted. Crashes on corrupted data are still possible. The fault injection mechanism that is introduced in the subsequent commit may help catch more of them. buf_page_import_corrupt_failure: Remove the fault injection, and instead corrupt some pages using Perl code in the tests. btr_cur_pessimistic_insert(): Always reserve extents (except for the change buffer), in order to prevent a subsequent allocation failure. btr_pcur_open_at_rnd_pos(): Merged to the only caller ibuf_merge_pages(). btr_assert_not_corrupted(), btr_corruption_report(): Remove. Similar checks are already part of btr_block_get(). FSEG_MAGIC_N_BYTES: Replaces FSEG_MAGIC_N_VALUE. dict_hdr_get(), trx_rsegf_get_new(), trx_undo_page_get(), trx_undo_page_get_s_latched(): Replaced with error-checking calls. trx_rseg_t::get(mtr_t*): Replaces trx_rsegf_get(). trx_rseg_header_create(): Let the caller update the TRX_SYS page if needed. trx_sys_create_sys_pages(): Merged with trx_sysf_create(). dict_check_tablespaces_and_store_max_id(): Do not access DICT_HDR_MAX_SPACE_ID, because it was already recovered in dict_boot(). Merge dict_check_sys_tables() with this function. dir_pathname(): Replaces os_file_make_new_pathname(). row_undo_ins_remove_sec(): Do not modify the undo page by adding a terminating NUL byte to the record. btr_decryption_failed(): Report decryption failures dict_set_corrupted_by_space(), dict_set_encrypted_by_space(), dict_set_corrupted_index_cache_only(): Remove. dict_set_corrupted(): Remove the constant parameter dict_locked=false. Never flag the clustered index corrupted in SYS_INDEXES, because that would deny further access to the table. It might be possible to repair the table by executing ALTER TABLE or OPTIMIZE TABLE, in case no B-tree leaf page is corrupted. dict_table_skip_corrupt_index(), dict_table_next_uncorrupted_index(), row_purge_skip_uncommitted_virtual_index(): Remove, and refactor the callers to read dict_index_t::type only once. dict_table_is_corrupted(): Remove. dict_index_t::is_btree(): Determine if the index is a valid B-tree. BUF_GET_NO_LATCH, BUF_EVICT_IF_IN_POOL: Remove. UNIV_BTR_DEBUG: Remove. Any inconsistency will no longer trigger assertion failures, but error codes being returned. buf_corrupt_page_release(): Replaced with a direct call to buf_pool.corrupted_evict(). fil_invalid_page_access_msg(): Never crash on an invalid read; let the caller of buf_page_get_gen() decide. btr_pcur_t::restore_position(): Propagate failure status to the caller by returning CORRUPTED. opt_search_plan_for_table(): Simplify the code. row_purge_del_mark(), row_purge_upd_exist_or_extern_func(), row_undo_ins_remove_sec_rec(), row_undo_mod_upd_del_sec(), row_undo_mod_del_mark_sec(): Avoid mem_heap_create()/mem_heap_free() when no secondary indexes exist. row_undo_mod_upd_exist_sec(): Simplify the code. row_upd_clust_step(), dict_load_table_one(): Return DB_TABLE_CORRUPT if the clustered index (and therefore the table) is corrupted, similar to what we do in row_insert_for_mysql(). fut_get_ptr(): Replace with buf_page_get_gen() calls. buf_page_get_gen(): Return nullptr and *err=DB_CORRUPTION if the page is marked as freed. For other modes than BUF_GET_POSSIBLY_FREED or BUF_PEEK_IF_IN_POOL this will trigger a debug assertion failure. For BUF_GET_POSSIBLY_FREED, we will return nullptr for freed pages, so that the callers can be simplified. The purge of transaction history will be a new user of BUF_GET_POSSIBLY_FREED, to avoid crashes on corrupted data. buf_page_get_low(): Never crash on a corrupted page, but simply return nullptr. fseg_page_is_allocated(): Replaces fseg_page_is_free(). fts_drop_common_tables(): Return an error if the transaction was rolled back. fil_space_t::set_corrupted(): Report a tablespace as corrupted if it was not reported already. fil_space_t::io(): Invoke fil_space_t::set_corrupted() to report out-of-bounds page access or other errors. Clean up mtr_t::page_lock() buf_page_get_low(): Validate the page identifier (to check for recently read corrupted pages) after acquiring the page latch. buf_page_t::read_complete(): Flag uninitialized (all-zero) pages with DB_FAIL. Return DB_PAGE_CORRUPTED on page number mismatch. mtr_t::defer_drop_ahi(): Renamed from mtr_defer_drop_ahi(). recv_sys_t::free_corrupted_page(): Only set_corrupt_fs() if any log records exist for the page. We do not mind if read-ahead produces corrupted (or all-zero) pages that were not actually needed during recovery. recv_recover_page(): Return whether the operation succeeded. recv_sys_t::recover_low(): Simplify the logic. Check for recovery error. Thanks to Matthias Leich for testing this extensively and to the authors of https://rr-project.org for making it easy to diagnose and fix any failures that were found during the testing.
* | Merge branch '10.7' into 10.8Oleksandr Byelkin2022-02-041-1/+1
|\ \ | |/
| * Merge branch '10.5' into 10.6Oleksandr Byelkin2022-02-031-1/+1
| |\
| | * Merge branch '10.4' into 10.5Oleksandr Byelkin2022-02-011-1/+1
| | |\
| | | * Merge branch '10.3' into 10.4Oleksandr Byelkin2022-01-301-1/+1
| | | |\
| | | | * Merge branch '10.2' into 10.3mariadb-10.3.33Oleksandr Byelkin2022-01-291-1/+1
| | | | |\
| | | | | * MDEV-27494 Rename .ic files to .inlVladislav Vaintroub2022-01-171-1/+1
| | | | | |
* | | | | | MDEV-26938 Support descending indexes internally in InnoDBMarko Mäkelä2022-01-261-1/+1
|/ / / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is loosely based on the InnoDB changes in mysql/mysql-server@97fd8b1b6993340b361fa7f85da86a308f0b5e0c that I had developed in 2015 or 2016. For each B-tree key field, we will allow a flag ASC/DESC to be associated. When PRIMARY KEY fields are internally appended to secondary indexes, the ASC/DESC attribute will be inherited, so that covering index scans will work as expected. Note: Until the subsequent commit, the DESC attribute will be ignored (no HA_REVERSE_SORT flag will be written to .frm files). dict_field_t::descending: A new flag to denote descending order. cmp_data(), cmp_dfield_dfield(): Add a new parameter descending. cmp_dtuple_rec(), cmp_dtuple_rec_with_match(): Add a parameter "index". dtuple_coll_eq(): Replaces dtuple_coll_cmp(). cmp_dfield_dfield_eq_prefix(): Replaces cmp_dfield_dfield_like_prefix(). dict_index_t::is_btree(): Check whether the index is a regular B-tree index (not SPATIAL, FULLTEXT, or the ibuf.index, or a corrupted index. btr_cur_search_to_nth_level_func(): Only attempt to use the adaptive hash index if index->is_btree(). This function may also be invoked on ibuf.index, and cmp_dtuple_rec_with_match_bytes() will no longer work on ibuf.index because it assumes that the index and record fields exactly match. The ibuf.index is a special variadic index tree. Thanks to Thirunarayanan Balathandayuthapani for fixing some bugs: MDEV-27439, MDEV-27374/MDEV-27445.
* | | | | MDEV-27058: Reduce the size of buf_block_t and buf_page_tMarko Mäkelä2021-11-181-2/+2
|/ / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | buf_page_t::frame: Moved from buf_block_t::frame. All 'thin' buf_page_t describing compressed-only ROW_FORMAT=COMPRESSED pages will have frame=nullptr, while all 'fat' buf_block_t will have a non-null frame pointing to aligned innodb_page_size bytes. This eliminates the need for separate states for BUF_BLOCK_FILE_PAGE and BUF_BLOCK_ZIP_PAGE. buf_page_t::lock: Moved from buf_block_t::lock. That is, all block descriptors will have a page latch. The IO_PIN state that was used for discarding or creating the uncompressed page frame of a ROW_FORMAT=COMPRESSED block is replaced by a combination of read-fix and page X-latch. page_zip_des_t::fix: Replaces state_, buf_fix_count_, io_fix_, status of buf_page_t with a single std::atomic<uint32_t>. All modifications will use store(), fetch_add(), fetch_sub(). This space was previously wasted to alignment on 64-bit systems. We will use the following encoding that combines a state (partly read-fix or write-fix) and a buffer-fix count: buf_page_t::NOT_USED=0 (previously BUF_BLOCK_NOT_USED) buf_page_t::MEMORY=1 (previously BUF_BLOCK_MEMORY) buf_page_t::REMOVE_HASH=2 (previously BUF_BLOCK_REMOVE_HASH) buf_page_t::FREED=3 + fix: pages marked as freed in the file buf_page_t::UNFIXED=1U<<29 + fix: normal pages buf_page_t::IBUF_EXIST=2U<<29 + fix: normal pages; may need ibuf merge buf_page_t::REINIT=3U<<29 + fix: reinitialized pages (skip doublewrite) buf_page_t::READ_FIX=4U<<29 + fix: read-fixed pages (also X-latched) buf_page_t::WRITE_FIX=5U<<29 + fix: write-fixed pages (also U-latched) buf_page_t::WRITE_FIX_IBUF=6U<<29 + fix: write-fixed; may have ibuf buf_page_t::WRITE_FIX_REINIT=7U<<29 + fix: write-fixed (no doublewrite) buf_page_t::write_complete(): Change WRITE_FIX or WRITE_FIX_REINIT to UNFIXED, and WRITE_FIX_IBUF to IBUF_EXIST, before releasing the U-latch. buf_page_t::read_complete(): Renamed from buf_page_read_complete(). Change READ_FIX to UNFIXED or IBUF_EXIST, before releasing the X-latch. buf_page_t::can_relocate(): If the page latch is being held or waited for, or the block is buffer-fixed or io-fixed, return false. (The condition on the page latch is new.) Outside buf_page_get_gen(), buf_page_get_low() and buf_page_free(), we will acquire the page latch before fix(), and unfix() before unlocking. buf_page_t::flush(): Replaces buf_flush_page(). Optimize the handling of FREED pages. buf_pool_t::release_freed_page(): Assume that buf_pool.mutex is held by the caller. buf_page_t::is_read_fixed(), buf_page_t::is_write_fixed(): New predicates. buf_page_get_low(): Ignore guesses that are read-fixed because they may not yet be registered in buf_pool.page_hash and buf_pool.LRU. buf_page_optimistic_get(): Acquire latch before buffer-fixing. buf_page_make_young(): Leave read-fixed blocks alone, because they might not be registered in buf_pool.LRU yet. recv_sys_t::recover_deferred(), recv_sys_t::recover_low(): Possibly fix MDEV-26326, by holding a page X-latch instead of only buffer-fixing the page.
* | | | Merge 10.4 into 10.5Marko Mäkelä2021-10-271-1/+1
|\ \ \ \ | |/ / /
| * | | MDEV-18543 fixup: Fix 32-bit buildsMarko Mäkelä2021-10-271-3/+3
| | | |
| * | | Merge 10.3 into 10.4, except MDEV-22543Marko Mäkelä2020-08-131-14/+0
| |\ \ \ | | |/ / | | | | | | | | Also, fix GCC -Og -Wmaybe-uninitialized in run_backup_stage()
| | * | Merge 10.2 into 10.3Marko Mäkelä2020-08-131-14/+0
| | |\ \ | | | |/
| | | * Merge 10.1 into 10.2Marko Mäkelä2020-08-131-14/+0
| | | |\
| | | | * MDEV-19526 heap number overflow on innodb_page_size=64kMarko Mäkelä2020-08-121-16/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | InnoDB only reserves 13 bits for the heap number in the record header, limiting the heap number to be at most 8191. But, when using innodb_page_size=64k and secondary index records of 7 bytes each, it is possible to exceed the maximum heap number. btr_cur_optimistic_insert(): Let the operation fail if the maximum number of records would be exceeded. page_mem_alloc_heap(): Move to the same compilation unit with the only caller, and let the operation fail if the maximum heap number has been allocated already.
* | | | | Cleanup: Make InnoDB page numbers uint32_tMarko Mäkelä2020-10-151-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | InnoDB stores a 32-bit page number in page headers and in some data structures, such as FIL_ADDR (consisting of a 32-bit page number and a 16-bit byte offset within a page). For better compile-time error detection and to reduce the memory footprint in some data structures, let us use a uint32_t for the page number, instead of ulint (size_t) which can be 64 bits.
* | | | | Merge 10.4 into 10.5Marko Mäkelä2020-05-051-2/+2
|\ \ \ \ \ | |/ / / /
| * | | | Merge 10.3 into 10.4Marko Mäkelä2020-05-051-3/+3
| |\ \ \ \ | | |/ / /
| | * | | Merge branch '10.2' into 10.3Oleksandr Byelkin2020-05-041-3/+3
| | |\ \ \ | | | |/ /
| | | * | MDEV-21595: innodb offset_t rename to rec_offsDaniel Black2020-04-291-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | thanks to: perl -i -pe 's/\boffset_t\b/rec_offs/g' $(git grep -lw offset_t storage/innobase)
* | | | | MDEV-22126 Rename confusing constant mtr_t::OPTMarko Mäkelä2020-04-031-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The template parameter mtr_t::OPT refers to optional, not optimized. Also the default parameter mtr_t::NORMAL refers to optimized writes. The name MAYBE_NOP would be more descriptive, conveying the idea that a write to a durable page might not actually have any effect.
* | | | | MDEV-12353 Cleanup: Remove page_rec_get_base_extra_size()Marko Mäkelä2020-02-271-9/+0
| | | | | | | | | | | | | | | | | | | | | | | | | The function page_rec_get_base_extra_size() became dead code in commit 08ba388713946c03aa591899cd3a446a6202f882.
* | | | | MDEV-12353: Reduce log volume of page_cur_delete_rec()Marko Mäkelä2020-02-221-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mrec_ext_t: Introduce DELETE_ROW_FORMAT_REDUNDANT, DELETE_ROW_FORMAT_DYNAMIC. mtr_t::page_delete(): Write DELETE_ROW_FORMAT_REDUNDANT or DELETE_ROW_FORMAT_DYNAMIC log records. We log the byte offset of the preceding record, so that on recovery we can easily find everything to update. For DELETE_ROW_FORMAT_DYNAMIC, we must also write the header and data size of the record. We will retain the physical logging for ROW_FORMAT=COMPRESSED pages. page_zip_dir_balance_slot(): Renamed from page_dir_balance_slot(), and specialized for ROW_FORMAT=COMPRESSED only. page_rec_set_n_owned(), page_dir_slot_set_n_owned(), page_dir_balance_slot(): New variants that do not write any log. page_mem_free(): Take data_size, extra_size as parameters. Always zerofill the record payload. page_cur_delete_rec(): For other than ROW_FORMAT=COMPRESSED, only write log by mtr_t::page_delete().
* | | | | MDEV-12353: Remove support for crash-upgradebb-10.5-MDEV-12353Marko Mäkelä2020-02-131-16/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We tighten some assertions regarding dict_index_t::is_dummy and crash recovery, now that redo log processing will no longer create dummy objects.
* | | | | MDEV-12353: Change the redo log encodingMarko Mäkelä2020-02-131-11/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | log_t::FORMAT_10_5: physical redo log format tag log_phys_t: Buffered records in the physical format. The log record bytes will follow the last data field, making use of alignment padding that would otherwise be wasted. If there are multiple records for the same page, also those may be appended to an existing log_phys_t object if the memory is available. In the physical format, the first byte of a record identifies the record and its length (up to 15 bytes). For longer records, the immediately following bytes will encode the remaining length in a variable-length encoding. Usually, a variable-length-encoded page identifier will follow, followed by optional payload, whose length is included in the initially encoded total record length. When a mini-transaction is updating multiple fields in a page, it can avoid repeating the tablespace identifier and page number by setting the same_page flag (most significant bit) in the first byte of the log record. The byte offset of the record will be relative to where the previous record for that page ended. Until MDEV-14425 introduces a separate file-level log for redo log checkpoints and file operations, we will write the file-level records in the page-level redo log file. The record FILE_CHECKPOINT (which replaces MLOG_CHECKPOINT) will be removed in MDEV-14425, and one sequential scan of the page recovery log will suffice. Compared to MLOG_FILE_CREATE2, FILE_CREATE will not include any flags. If the information is needed, it can be parsed from WRITE records that modify FSP_SPACE_FLAGS. MLOG_ZIP_WRITE_STRING: Remove. The record was only introduced temporarily as part of this work, before being replaced with WRITE (along with MLOG_WRITE_STRING, MLOG_1BYTE, MLOG_nBYTES). mtr_buf_t::empty(): Check if the buffer is empty. mtr_t::m_n_log_recs: Remove. It suffices to check if m_log is empty. mtr_t::m_last, mtr_t::m_last_offset: End of the latest m_log record, for the same_page encoding. page_recv_t::last_offset: Reflects mtr_t::m_last_offset. Valid values for last_offset during recovery should be 0 or above 8. (The first 8 bytes of a page are the checksum and the page number, and neither are ever updated directly by log records.) Internally, the special value 1 indicates that the same_page form will not be allowed for the subsequent record. mtr_t::page_create(): Take the block descriptor as parameter, so that it can be compared to mtr_t::m_last. The INIT_INDEX_PAGE record will always followed by a subtype byte, because same_page records must be longer than 1 byte. trx_undo_page_init(): Combine the writes in WRITE record. trx_undo_header_create(): Write 4 bytes using a special MEMSET record that includes 1 bytes of length and 2 bytes of payload. flst_write_addr(): Define as a static function. Combine the writes. flst_zero_both(): Replaces two flst_zero_addr() calls. flst_init(): Do not inline the function. fsp_free_seg_inode(): Zerofill the whole inode. fsp_apply_init_file_page(): Initialize FIL_PAGE_PREV,FIL_PAGE_NEXT to FIL_NULL when using the physical format. btr_create(): Assert !page_has_siblings() because fsp_apply_init_file_page() must have been invoked. fil_ibd_create(): Do not write FILE_MODIFY after FILE_CREATE. fil_names_dirty_and_write(): Remove the parameter mtr. Write the records using a separate mini-transaction object, because any FILE_ records must be at the start of a mini-transaction log. recv_recover_page(): Add a fil_space_t* parameter. After applying log to the a ROW_FORMAT=COMPRESSED page, invoke buf_zip_decompress() to restore the uncompressed page. buf_page_io_complete(): Remove the temporary hack to discard the uncompressed page of a ROW_FORMAT=COMPRESSED page. page_zip_write_header(): Remove. Use mtr_t::write() or mtr_t::memset() instead, and update the compressed page frame separately. trx_undo_header_add_space_for_xid(): Remove. trx_undo_seg_create(): Perform the changes that were previously made by trx_undo_header_add_space_for_xid(). btr_reset_instant(): New function: Reset the table to MariaDB 10.2 or 10.3 format when rolling back an instant ALTER TABLE operation. page_rec_find_owner_rec(): Merge with the only callers. page_cur_insert_rec_low(): Combine writes by using a local buffer. MEMMOVE data from the preceding record whenever feasible (copying at least 3 bytes). page_cur_insert_rec_zip(): Combine writes to page header fields. PageBulk::insertPage(): Issue MEMMOVE records to copy a matching part from the preceding record. PageBulk::finishPage(): Combine the writes to the page header and to the sparse page directory slots. mtr_t::write(): Only log the least significant (last) bytes of multi-byte fields that actually differ. For updating FSP_SIZE, we must always write all 4 bytes to the redo log, so that the fil_space_set_recv_size() logic in recv_sys_t::parse() will work. mtr_t::memcpy(), mtr_t::zmemcpy(): Take a pointer argument instead of a numeric offset to the page frame. Only log the last bytes of multi-byte fields that actually differ. In fil_space_crypt_t::write_page0(), we must log also any unchanged bytes, so that recovery will recognize the record and invoke fil_crypt_parse(). Future work: MDEV-21724 Optimize page_cur_insert_rec_low() redo logging MDEV-21725 Optimize btr_page_reorganize_low() redo logging MDEV-21727 Optimize redo logging for ROW_FORMAT=COMPRESSED
* | | | | MDEV-12353: Replace MLOG_REC_INSERT,MLOG_COMP_REC_INSERTMarko Mäkelä2020-02-131-67/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | page_mem_alloc_free(), page_dir_set_n_heap(), page_ptr_set_direction(): Merge with the callers. page_direction_reset(), page_direction_increment(), page_zip_dir_insert(), page_zip_write_rec_ext(), page_zip_write_rec(): Add the parameter mtr, and write log. PageBulk::insert(), PageBulk::finish(): Write log for all changes. page_cur_rec_insert(), page_cur_insert_rec_write_log(), page_cur_insert_rec_write_log(): Remove. page_rec_set_next(), page_header_set_field(), page_header_set_ptr(): Remove. Use lower-level operations with or without logging. page_zip_dir_add_slot(): Move to the same compilation unit with its only caller, page_cur_insert_rec_zip(). page_cur_insert_rec_zip(): Mark pieces of code that must be skipped once this task is completed. btr_defragment_chunk(): Before starting a mini-transaction that is writing (a lot), invoke log_free_check(). This should allow the test innodb.innodb_defrag_concurrent to pass with the mtr default_mysqld.cnf setting of innodb_log_file_size=10M. MLOG_BUF_MARGIN: Remove.
* | | | | MDEV-12353: Replace MLOG_*LIST_*_DELETE and MLOG_*REC_DELETEMarko Mäkelä2020-02-131-28/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No longer write the following redo log records: MLOG_COMP_LIST_END_DELETE, MLOG_LIST_END_DELETE, MLOG_COMP_LIST_START_DELETE, MLOG_LIST_START_DELETE, MLOG_REC_DELETE,MLOG_COMP_REC_DELETE. Each individual deleted record will be logged separately using physical log records. page_dir_slot_set_n_owned(), page_zip_rec_set_owned(), page_zip_dir_delete(), page_zip_clear_rec(): Add the parameter mtr, and write redo log. page_dir_slot_set_rec(): Remove. Replaced with lower-level operations that write redo log when necessary. page_rec_set_n_owned(): Replaces rec_set_n_owned_old(), rec_set_n_owned_new(). rec_set_heap_no(): Replaces rec_set_heap_no_old(), rec_set_heap_no_new(). page_mem_free(), page_dir_split_slot(), page_dir_balance_slot(): Add the parameter mtr. page_dir_set_n_slots(): Merge with the caller page_dir_split_slot(). page_dir_slot_set_rec(): Merge with the callers page_dir_split_slot() and page_dir_balance_slot(). page_cur_insert_rec_low(), page_cur_insert_rec_zip(): Suppress the logging of lower-level operations. page_cur_delete_rec_write_log(): Remove. page_cur_delete_rec(): Do not tolerate mtr=NULL. rec_convert_dtuple_to_rec_old(), rec_convert_dtuple_to_rec_comp(): Replace rec_set_heap_no_old() and rec_set_heap_no_new() with direct access that does not involve redo logging. mtr_t::memcpy(): Do allow non-redo-logged writes to uncompressed pages of ROW_FORMAT=COMPRESSED pages. buf_page_io_complete(): Evict the uncompressed page of a ROW_FORMAT=COMPRESSED page after recovery. Because we no longer write logical log records for deleting index records, but instead write physical records that may refer directly to the compressed page frame of a ROW_FORMAT=COMPRESSED page, and because on recovery we will only apply the changes to the ROW_FORMAT=COMPRESSED page, the uncompressed page frame can be stale until page_zip_decompress() is executed. recv_parse_or_apply_log_rec_body(): After applying MLOG_ZIP_WRITE_STRING, ensure that the FIL_PAGE_TYPE of the uncompressed page matches the compressed page, because buf_flush_init_for_writing() assumes that field to be valid. mlog_init_t::mark_ibuf_exist(): Invoke page_zip_decompress(), because the uncompressed page after buf_page_create() is not necessarily up to date. buf_LRU_block_remove_hashed(): Bypass a page_zip_validate() check during redo log apply. recv_apply_hashed_log_recs(): Invoke mlog_init.mark_ibuf_exist() also for the last batch, to ensure that page_zip_decompress() will be called for freshly initialized pages.
* | | | | MDEV-12353: Replace MLOG_PAGE_CREATE_RTREE, MLOG_PAGE_COMP_CREATE_RTREEMarko Mäkelä2020-02-131-23/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | page_create(): Create normal B-tree pages. Callers that create R-tree pages will set FIL_PAGE_TYPE and reset the split sequence number afterwards. The creation of ROW_FORMAT=COMPRESSED pages is unaffected; they will be logged as compressed page images. page_create_low(): Take const buf_block_t* as a parameter. Let the callers invoke buf_block_modify_clock_inc().
* | | | | Cleanup: Aligned InnoDB index page header accessMarko Mäkelä2020-02-081-36/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ut_align_down(): Preserve the const qualifier. Use C++ casts. ha_delete_hash_node(): Correct an assertion expression. fil_page_get_type(): Perform an assumed-aligned read. page_align(): Preserve the const qualifier. Assume (some) alignment. page_get_max_trx_id(): Check the index page type. page_header_get_field(): Perform an assumed-aligned read. page_get_autoinc(): Perform an assumed-aligned read. page_dir_get_nth_slot(): Perform an assumed-aligned read. Preserve the const qualifier.
* | | | | fix aligned memcpy()-like functions usageEugene Kosov2020-01-231-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I found that memcpy_aligned was used incorrectly at redo log and decided to put assertions in aligned functions. And found even more incorrect cases. Given the amount discovered of bugs, I left assertions to prevent future bugs. my_assume_aligned(): instead of MY_ASSUME_ALIGNED macro
* | | | | Merge 10.4 into 10.5Marko Mäkelä2020-01-201-2/+18
|\ \ \ \ \ | |/ / / /
| * | | | Merge 10.3 into 10.4Marko Mäkelä2020-01-201-2/+18
| |\ \ \ \ | | |/ / / | | | | | | | | | | | | | | | The MDEV-17062 fix in commit c4195305b2a8431f39a4c75cc1c66ba43685f7a0 was omitted.
| | * | | Merge 10.2 into 10.3Marko Mäkelä2020-01-181-2/+18
| | |\ \ \ | | | |/ /
| | | * | MDEV-21509 Possible hang during purge of history, or rollbackMarko Mäkelä2020-01-171-2/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | WL#6326 in MariaDB 10.2.2 introduced a potential hang on purge or rollback when an index tree is being shrunk by multiple levels. This fix is based on mysql/mysql-server@f2c58526300c0d84837effa26d37cbd5d2694967 with the main difference that our version of the test case uses DEBUG_SYNC instrumentation on ROLLBACK, not on purge. btr_cur_will_modify_tree(): Simplify the check further. This is the actual bug fix. row_undo_mod_remove_clust_low(), row_undo_mod_clust(): Add DEBUG_SYNC instrumentation for the test case.
* | | | | Merge 10.4 into 10.5Marko Mäkelä2019-12-161-2/+2
|\ \ \ \ \ | |/ / / /
| * | | | Merge 10.3 into 10.4Marko Mäkelä2019-12-131-3/+3
| |\ \ \ \ | | |/ / / | | | | | | | | | | | | | | | We disable the MDEV-21189 test galera.galera_partition because it times out.
| | * | | Merge 10.2 into 10.3Marko Mäkelä2019-12-131-3/+4
| | |\ \ \ | | | |/ /
| | | * | MDEV-20950 Reduce size of record offsetsbb-10.2-MDEV-20950-stack-offsetsEugene Kosov2019-12-131-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | offset_t: this is a type which represents one record offset. It's unsigned short int. a lot of functions: replace ulint with offset_t btr_pcur_restore_position_func(), page_validate(), row_ins_scan_sec_index_for_duplicate(), row_upd_clust_rec_by_insert_inherit_func(), row_vers_impl_x_locked_low(), trx_undo_prev_version_build(): allocate record offsets on the stack instead of waiting for rec_get_offsets() to allocate it from mem_heap_t. So, reducing memory allocations. RECORD_OFFSET, INDEX_OFFSET: now it's less convenient to store pointers in offset_t* array. One pointer occupies now several offset_t. And those constant are start indexes into array to places where to store pointer values REC_OFFS_HEADER_SIZE: adjusted for the new reality REC_OFFS_NORMAL_SIZE: increase size from 100 to 300 which means less heap allocations. And sizeof(offset_t[REC_OFFS_NORMAL_SIZE]) now is 600 bytes which is smaller than previous 800 bytes. REC_OFFS_SEC_INDEX_SIZE: adjusted for the new reality rem0rec.h, rem0rec.ic, rem0rec.cc: various arguments, return values and local variables types were changed to fix numerous integer conversions issues. enum field_type_t: offset types concept was introduces which replaces old offset flags stuff. Like in earlier version, 2 upper bits are used to store offset type. And this enum represents those types. REC_OFFS_SQL_NULL, REC_OFFS_MASK: removed get_type(), set_type(), get_value(), combine(): these are convenience functions to work with offsets and it's types rec_offs_base()[0]: still uses an old scheme with flags REC_OFFS_COMPACT and REC_OFFS_EXTERNAL rec_offs_base()[i]: these have type offset_t now. Two upper bits contains type.
* | | | | MDEV-21174: Replace mlog_write_ulint() with mtr_t::write()Marko Mäkelä2019-12-031-29/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mtr_t::write(): Replaces mlog_write_ulint(), mlog_write_ull(). Optimize away writes if the page contents does not change, except when a dummy write has been explicitly requested. Because the member function template takes a block descriptor as a parameter, it is possible to introduce better consistency checks. Due to this, the code for handling file-based lists, undo logs and user transactions was refactored to pass around buf_block_t.
* | | | | Cleanup: flst_read_addr(), fil_addr_tMarko Mäkelä2019-11-281-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fil_addr_t: Use exactly sized data types. flst_read_addr(): Remove the unused parameter mtr. page_offset(): Return uint16_t.
* | | | | Merge 10.4 into 10.5Marko Mäkelä2019-11-131-6/+6
|\ \ \ \ \ | |/ / / /
| * | | | Use constexpr for constants on data pagesMarko Mäkelä2019-11-131-6/+6
| | | | |