summaryrefslogtreecommitdiff
path: root/mysql-test
Commit message (Collapse)AuthorAgeFilesLines
* MDEV-27683 EXCHANGE PARTITION allows different index direction, but causes ↵bb-10.2-MDEV-277400Sergei Golubchik2022-02-032-0/+24
| | | | further errors
* fix a copy-paste errorSergei Golubchik2022-02-032-117/+46
| | | | | | LEX_CSTRING table_name= { table->s->db.str, table->s->table_name.length }; and misc cleanups
* MDEV-11675. rpl_start_alter_ftwrl.test is refinedbb-10.8-andreiAndrei2022-02-022-0/+10
| | | | | | | The test could fail sporadically because of not anticipated race on slave between CREATE and ALTER queries. Fixed to synchronize slave and master wrt CREATE.
* MDEV-11675. Convert the new session var to bool type and test changesAndrei2022-01-3138-109/+108
| | | | The new @@binlog_alter_two_phase is converted to `my_bool` type.
* MDEV-4989: Support for GTID in mysqlbinlogBrandon Nesterenko2022-01-314-5/+67
| | | | | | | | | | | | | | | | | This patch fixes two issues: First, it fixes test failure due to GTID List events having inconsistent ordering of domain ids. In particular, this patch ensures that a GTID list log event will have its GTIDs ordered by domain id (ascending) followed by sequence number (ascending). Second, it fixes an assert which could use an unintialized variable. Reviewed By: ============ Andrei Elkin <andrei.elkin@mariadb.com>
* MDEV-11675 Lag Free Alter On SlaveSachin2022-01-2760-1/+6735
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit implements two phase binloggable ALTER. When a new @@session.binlog_alter_two_phase = YES ALTER query gets logged in two parts, the START ALTER and the COMMIT or ROLLBACK ALTER. START Alter is written in binlog as soon as necessary locks have been acquired for the table. The timing is such that any concurrent DML:s that update the same table are either committed, thus logged into binary log having done work on the old version of the table, or will be queued for execution on its new version. The "COMPLETE" COMMIT or ROLLBACK ALTER are written at the very point of a normal "single-piece" ALTER that is after the most of the query work is done. When its result is positive COMMIT ALTER is written, otherwise ROLLBACK ALTER is written with specific error happened after START ALTER phase. Replication of two-phase binloggable ALTER is cross-version safe. Specifically the OLD slave merely does not recognized the start alter part, still being able to process and memorize its gtid. Two phase logged ALTER is read from binlog by mysqlbinlog to produce BINLOG 'string', where 'string' contains base64 encoded Query_log_event containing either the start part of ALTER, or a completion part. The Query details can be displayed with `-v` flag, similarly to ROW format events. Notice, mysqlbinlog output containing parts of two-phase binloggable ALTER is processable correctly only by binlog_alter_two_phase server. @@log_warnings > 2 can reveal details of binlogging and slave side processing of the ALTER parts. The current commit also carries fixes to the following list of reported bugs: MDEV-27511, MDEV-27471, MDEV-27349, MDEV-27628, MDEV-27528. Thanks to all people involved into early discussion of the feature including Kristian Nielsen, those who helped to design, implement and test: Sergei Golubchik, Andrei Elkin who took the burden of the implemenation completion, Sujatha Sivakumar, Brandon Nesterenko, Alice Sherepa, Ramesh Sivaraman, Jan Lindstrom.
* MDEV-4989: Support for GTID in mysqlbinlogBrandon Nesterenko2022-01-267-1/+2677
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | New Feature: =========== This commit extends the mariadb-binlog capabilities to allow events to be filtered by GTID ranges. More specifically, the --start-position and --stop-position arguments have been extended to accept values formatted as a list of GTID positions, e.g. --start-position=0-1-0,1-2-55. The following specific capabilities are addressed: 1) GTIDs can be used to filter results on local binlog files 2) GTIDs can be used to filter results from remote servers 3) Implemented --gtid-strict-mode that ensures the GTID event stream in each domain is monotonically increasing 4) Added new level of verbosity in mysqlbinlog -vvv to print additional diagnostic information/warnings about invalid GTID states 5) For a given GTID range, its start and stop position parameters aim to mimic the behaviors of CHANGE MASTER TO MASTER_USE_GTID=slave_pos and START SLAVE UNTIL master_gtid_pos=<GTID>, respectively. In particular, the start-position list expresses a gtid state of the server, similarly to how @@global.gtid_slave_pos expresses the gtid state of a slave server when connecting to a master with MASTER_USE_GTID=slave_pos. The GTID start-position list is exclusive and the stop-position list is inclusive. This allows users to receive events strictly after those that they already have, and is useful in cases of point in (logical) time recovery including 1) events were received out of order and should be re-sent, or 2) specifying the gtid state of a slave to get events newer than their current state. If a seq_no is 0 for start-position, it means to include the entirety of the domain. If a seq_no is 0 for stop-position, it means to exclude all events from that domain. The GTIDs provided in a start position argument must match with the GTID state of the first processed log (i.e. those listed in the Gtid_list event). If a stop position is provided, the events that are output are limited to only those with domain ids listed in the argument. When specifying combinations of start and stop positions, the following behaviors are expected: [--start-position without --stop-position]: Events that have domain ids in the start position are output if their seq_no occurs after the respective start position. Events with domain ids that are unspecified in the start position list are also output. Note that if the Gtid_list event of the first binary log is populated (i.e. non-empty), each domain in the Gtid_list must be present in the start-position list with a seq_no at or after the listed value. This behavior mimics how a slave only processes events after the state provided by @@global.gtid_slave_pos when connecting to a master with CHANGE MASTER TO MASTER_USE_GTID=slave_pos. [--stop-position without --start-position]: Output is limited to only events with both 1) domain ids that are present in the given stop position list and 2) seq_nos that are less than or equal to their respective stop GTID. Once all GTIDs in the stop position list have been processed, the program will stop processing log files. This behavior mimics how START SLAVE UNTIL master_gtid_pos=<G> has a slave only process events with domain ids present in G with their seq_nos at or before the respective gtid. [--start-position and --stop-position]: Output consists of the intersection between the events permitted by both the start and stop position rules. More concretely, the output can be defined by a union of the following rules: 1. For domains which exist in both the start and stop position lists, the events which exist in-between these positions (exclusive start, inclusive stop) are output 2. For all other events, the rules of [--stop-position without --start-position] are followed This is due to the implicit filtering within each individual rule. Even though the start position rule always includes events from unspecified domains, the stop position rule takes precedence because it always excludes events from unspecified domains. In other words, events which the start position rule would have included would then always be excluded by the stop position rule. [neither --start-position nor --stop-position]: Events are not omitted based on GTID positioning; however, --gtid-strict-mode and -vvv can still analyze gtid correctness for warning and error reporting. [repeated specification of --start-position or --stop-position]: Subsequent specifications of start and stop positions completely override previous ones. E.g., if invoked as mysqlbinlog --start-position=<G1> --start-position=<G2> ... All GTIDs specified in G1 are ignored and only those specified in G2 are used for the start position. A few additional notes: 1) this commit squashes together the commits: f4319661120e-78a9d49907ba 2) Changed rpl.rpl_blackhole_row_annotate test because it has out of order GTIDs in its binlog, so I added --skip-gtid-strict-mode 3) After all binlog events have been written, the session server id and domain id are reset to their values in the global state Reviewed By: =========== Andrei Elkin: <andrei.elkin@mariadb.com>
* bump the version and maturitySergei Golubchik2022-01-261-1/+1
|
* MDEV-26238: Remove inconsistent behaviour of --default-* optionsRucha Deodhar2022-01-263-24/+89
| | | | | | | | | | | | | in my_print_defaults Analysis: --defaults* option is recognized anywhere in the commandline instead of only at the beginning because handle_options() recognizes options in any order. Fix: use get_defaults_options() which recognizes --defaults* options only at the beginning. After this is done, we only want to recognize other options given in any order which can be done using handle_options(). So only skip --defaults* options and pass rest of them to handle_options(). Also, removed -e, -g and -c because only my_print_defaults supports them.
* MDEV-27398 DESC index causes wrong (empty) result on Federated tablesSergei Golubchik2022-01-262-0/+55
| | | | take descending indexes into account when generating a query
* MDEV-27581 Wrong result with DESC key on partitioned Spider tableSergei Golubchik2022-01-262-0/+23
| | | | | | take descending indexes into account when generating a query also fixes MDEV-27617
* MDEV-27586 Auto-increment does not work with DESC on MERGE tableSergei Golubchik2022-01-262-0/+25
| | | | also fix handler::get_auto_increment()
* MDEV-27434 DESC attribute does not work with auto-increment on secondary ↵Sergei Golubchik2022-01-264-10/+74
| | | | | | | | | column of multi-part index when searching for the last auto-inc value, it's HA_READ_PREFIX_LAST for the ASC keypart, but HA_READ_PREFIX for the DESC one also fixes MDEV-27585
* ORDER BY index traversal direction in the optimizer traceSergei Golubchik2022-01-261-0/+2
|
* MDEV-27407 Different ASC/DESC index attributes on MERGE and underlying table ↵Sergei Golubchik2022-01-262-0/+24
| | | | | | can cause wrong results detect if merge children are "differently defined" regarding ASC/DESC
* MDEV-27406 Index on a HEAP table retains DESC attribute despite being hashSergei Golubchik2022-01-262-7/+35
| | | | just like in SHOW KEYS, suppress DESC in SHOW CREATE if not HA_READ_ORDER
* MDEV-27393 Timezone tables cannot have descending indexesSergei Golubchik2022-01-262-0/+28
| | | | | | | replace the assert with an if(). asserts should not be used on the input (even without DESC indexes the table could've been corrupted)
* MDEV-27309 Server crash or ASAN memcpy-param-overlap upon INSERT into ↵Sergei Golubchik2022-01-264-0/+102
| | | | | | | | | | | | | | Aria/MyISAM table with DESC key MyiSAM and Aria, indexes with prefix compression, where the first keypart could be NULL - in this case they didn't expect the next key after the not NULL key to be NULL. Expect the first keypart of the next key to have zero length even if store_not_null==1, this combination means keypart is NULL, don't pack it. also fixes MDEV-27340
* MDEV-27303 Table corruption after insert into a non-InnoDB table with DESC indexSergei Golubchik2022-01-264-0/+159
| | | | | | optimized prefix search didn't take into account descending indexes also fixes MDEV-27330
* MDEV-27396 DESC index attribute remains in Archive table definition, despite ↵Sergei Golubchik2022-01-262-8/+30
| | | | | | being apparently ignored disallow descending indexes in archive
* MDEV-27529 Wrong result upon query using index_merge with DESC key (#2)Sergei Petrunia2022-01-262-1/+51
| | | | | | | | | | | | | | | | | | | ROR-index_merge relies on Rowid-ordered-retrieval property: a ROR scan, e.g. a scan on equality range tbl.key=const should return rows ordered by their Rowid. Also, handler->cmp_ref() should compare rowids according to the Rowid ordering. When the table's primary key uses DESC keyparts, ROR scans return rows according to the PK's ordering. But ha_innobase::cmp_ref() compared rowids as if PK used ASC keyparts. This caused wrong query results with index_merge. Fixed this by making ha_innobase::cmp_ref() compare according to the PK defintion, including keypart's DESC property.
* MDEV-27426 Wrong result upon query using index_merge with DESC keySergei Petrunia2022-01-262-0/+20
| | | | | | | Make QUICK_RANGE_SELECT::cmp_next() aware of reverse-ordered key parts. (QUICK_RANGE_SELECT::cmp_prev() uses key_cmp() and so it already works correctly)
* MDEV-26996 Reverse-ordered indexes: improve print-outSergei Petrunia2022-01-261-7/+7
| | | | | | | When printing a range into optimizer trace, print DESC for columns that are reverse-ordered, for example: "(4) <= (key1 DESC) <= (2)"
* MDEV-26996 Support descending indexes in the range optimizerSergei Petrunia2022-01-265-36/+70
| | | | | - Code cleanup - Disable "Using index for GROUP BY" over indexes with DESC keyparts
* Descending indexes code exposed a gap in fix for MDEV-25858.Sergei Petrunia2022-01-263-7/+70
| | | | | | | | | Extend the fix for MDEV-25858 to handle non-reverse-ordered ORDER BY: If test_if_skip_sort_order() decides to use an index to produce rows in the required ordering, it should disable "Range Checked for Each Record". The fix needs to be backported to earlier versions.
* MDEV-26996 Support descending indexes in the range optimizerSergei Petrunia2022-01-262-0/+235
| | | | | | | | Make the Range Optimizer support descending index key parts. We follow the approach taken in MySQL-8. See HowRangeOptimizerHandlesDescKeyparts for the description.
* MDEV-26938 Support descending indexes internally in InnoDB (server part)Sergei Golubchik2022-01-2620-104/+830
| | | | | | | | | | | * preserve DESC index property in the parser * store it in the frm (only for HA_KEY_ALG_BTREE) * read it from the frm * show it in SHOW CREATE * skip DESC indexes in opt_range.cc and opt_sum.cc * ORDER BY test This includes a fix of MDEV-27432.
* MDEV-26938 Support descending indexes internally in InnoDBMarko Mäkelä2022-01-268-32/+324
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is loosely based on the InnoDB changes in mysql/mysql-server@97fd8b1b6993340b361fa7f85da86a308f0b5e0c that I had developed in 2015 or 2016. For each B-tree key field, we will allow a flag ASC/DESC to be associated. When PRIMARY KEY fields are internally appended to secondary indexes, the ASC/DESC attribute will be inherited, so that covering index scans will work as expected. Note: Until the subsequent commit, the DESC attribute will be ignored (no HA_REVERSE_SORT flag will be written to .frm files). dict_field_t::descending: A new flag to denote descending order. cmp_data(), cmp_dfield_dfield(): Add a new parameter descending. cmp_dtuple_rec(), cmp_dtuple_rec_with_match(): Add a parameter "index". dtuple_coll_eq(): Replaces dtuple_coll_cmp(). cmp_dfield_dfield_eq_prefix(): Replaces cmp_dfield_dfield_like_prefix(). dict_index_t::is_btree(): Check whether the index is a regular B-tree index (not SPATIAL, FULLTEXT, or the ibuf.index, or a corrupted index. btr_cur_search_to_nth_level_func(): Only attempt to use the adaptive hash index if index->is_btree(). This function may also be invoked on ibuf.index, and cmp_dtuple_rec_with_match_bytes() will no longer work on ibuf.index because it assumes that the index and record fields exactly match. The ibuf.index is a special variadic index tree. Thanks to Thirunarayanan Balathandayuthapani for fixing some bugs: MDEV-27439, MDEV-27374/MDEV-27445.
* cleanup: testsSergei Golubchik2022-01-268-255/+189
| | | | | * combine two test files with seemingly the same name * comments
* MDEV-23570 deprecate keep_files_on_createAlexander Barkov2022-01-263-0/+126
|
* A clean-up for MDEV-10654 add support IN, OUT, INOUT parameter qualifiers ↵Alexander Barkov2022-01-246-0/+4186
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | for stored functions Changes: 1. Enabling IN/OUT/INOUT mode for sql_mode=DEFAULT, adding tests for sql_mode=DEFAULT based by mostly translating compat/oracle.sp-inout.test to SQL/PSM with minor changes (e.g. testing trigger OLD.column and NEW.column as IN/OUT parameters). 2. Removing duplicate grammar: sp_pdparam and sp_fdparam implemented exactly the same syntax after - the first patch for MDEV-10654 (for sql_mode=ORACLE) - the change #1 from this patch (for sql_mode=DEFAULT) Removing separate rules and adding a single "sp_param" rule instead, which now covers both PRDEDURE and FUNCTION parameters (and CURSOR parameters as well!). 3. Adding a helper rule sp_param_name_and_mode, which is a combination of the parameter name and the IN/OUT/INOUT mode. It allows to simplify the grammer a bit. 4. The first patch unintentionally allowed IN/OUT/INOUT mode to be specified in CURSOR parameters. This is good for the IN keyword - it is allowed in PL/SQL CURSORs. This is not good the the OUT/INOUT keywords - they should not be allowed. Adding a additional symantic post-check.
* MDEV-10654 add support IN, OUT, INOUT parameter qualifiers for stored functionsManoharKB2022-01-242-0/+5068
| | | | | | | | | | | | | | | | | Problem: Currently stored function does not support IN/OUT/INOUT parameter qualifiers. This is needed for Oracle compatibility (sql_mode = ORACLE). Solution: Implemented parameter qualifier support to CREATE FUNCTION (reference: CREATE PROCEDURE) Implemented return by reference for OUT/INOUT parameters in execute_function() (reference: execute_procedure()) Files changed: sql/sql_yacc.yy: Added IN, OUT, INOUT parameter qualifiers for CREATE FUNCTION. sql/sp_head.cc: Added input and output parameter binding for IN/OUT/INOUT parameters in execute_function() so that OUT/INOUT can return by reference. sql/share/errmsg-utf8.txt: Added error message to restrict OUT/INOUT parameters while function being called from SQL query. mysql-test/suite/compat/oracle/t/sp-inout.test: Added test cases mysql-test/suite/compat/oracle/r/sp-inout.result: Added test results Reviewed-by: iqbal@hasprime.com
* MDEV-5271 Support engine-defined attributes per partitionNayuta Yanagisawa2022-01-243-1/+388
| | | | | | | | | | | | | | | | | | | | | | | | | Make it possible to specify engine-defined attributes on partitions as well as tables. If an engine-defined attribute is only specified at the table level, it applies to all the partitions in the table. This is a backward-compatible behavior. If the same attribute is specified both at the table level and the partition level, the per-partition one takes precedence. So, we can consider per-table attributes as default values. One cannot specify engine-defined attributes on subpartitions. Implementation details: * We store per-partition attributes in the partition_element class because we already have the part_comment field, which is for per-partition comments. * In the case of ALTER TABLE statements, the partition_elements in table->part_info is set up by mysql_unpack_partition(). So, we parse per-partition attributes after the call of the function.
* MDEV-27314 InnoDB Buffer Pool Resize output cleanup (mtr postfix)Daniel Black2022-01-243-3/+3
| | | | More tests depending on 'Completed resizing buffer pool.' output
* MDEV-27314 InnoDB Buffer Pool Resize output cleanupHaidong Ji2022-01-241-1/+1
| | | | | | | | | Cleaned up the log messages as suggested, with a minor code formatting change. On bullet point 13, I decided to not include timestamp in output message. In most (all?) cases, the output goes to the log file, which has timestamp already.
* MDEV-27208: mtr --ps-protocol test fixupMarko Mäkelä2022-01-222-0/+8
| | | | | | | | The test ./mtr --ps-protocol main.func_math was broken in commit 5b3ad94c7b070be1b1e5ab186c5d4d017e1fe8cf because in that mode, one of several truncation warnings for a single integer literal would be omitted. Those warnings are issued by the parser somewhere outside CRC32() or CRC32C().
* MDEV-27208: Extend CRC32() and implement CRC32C()Marko Mäkelä2022-01-212-16/+150
| | | | | | | | | | | | | | | | | | | | | We used to define a native unary function CRC32() that computes the CRC-32 of a string using the ISO 3309 polynomial that is being used by zlib and many others. Often, a CRC is computed in pieces. To faciliate this, we introduce a 2-ary variant of the function that inputs a previous CRC as the first argument: CRC32('MariaDB')=CRC32(CRC32('Maria'),'DB'). InnoDB and MyRocks use a different polynomial, which was implemented in SSE4.2 instructions that were introduced in the Intel Nehalem microarchitecture. This is commonly called CRC-32C (Castagnoli). We introduce a native function that uses the Castagnoli polynomial: CRC32C('MariaDB')=CRC32C(CRC32C('Maria'),'DB'). This allows SELECT...INTO DUMPFILE to be used for the creation of files with valid checksums, such as a logically empty InnoDB redo log file ib_logfile0 corresponding to a particular log sequence number.
* MDEV-14425 Improve the redo log for concurrencyMarko Mäkelä2022-01-2145-528/+270
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The InnoDB redo log used to be formatted in blocks of 512 bytes. The log blocks were encrypted and the checksum was calculated while holding log_sys.mutex, creating a serious scalability bottleneck. We remove the fixed-size redo log block structure altogether and essentially turn every mini-transaction into a log block of its own. This allows encryption and checksum calculations to be performed on local mtr_t::m_log buffers, before acquiring log_sys.mutex. The mutex only protects a memcpy() of the data to the shared log_sys.buf, as well as the padding of the log, in case the to-be-written part of the log would not end in a block boundary of the underlying storage. For now, the "padding" consists of writing a single NUL byte, to allow recovery and mariadb-backup to detect the end of the circular log faster. Like the previous implementation, we will overwrite the last log block over and over again, until it has been completely filled. It would be possible to write only up to the last completed block (if no more recent write was requested), or to write dummy FILE_CHECKPOINT records to fill the incomplete block, by invoking the currently disabled function log_pad(). This would require adjustments to some logic around log checkpoints, page flushing, and shutdown. An upgrade after a crash of any previous version is not supported. Logically empty log files from a previous version will be upgraded. An attempt to start up InnoDB without a valid ib_logfile0 will be refused. Previously, the redo log used to be created automatically if it was missing. Only with with innodb_force_recovery=6, it is possible to start InnoDB in read-only mode even if the log file does not exist. This allows the contents of a possibly corrupted database to be dumped. Because a prepared backup from an earlier version of mariadb-backup will create a 0-sized log file, we will allow an upgrade from such log files, provided that the FIL_PAGE_FILE_FLUSH_LSN in the system tablespace looks valid. The 512-byte log checkpoint blocks at 0x200 and 0x600 will be replaced with 64-byte log checkpoint blocks at 0x1000 and 0x2000. The start of log records will move from 0x800 to 0x3000. This allows us to use 4096-byte aligned blocks for all I/O in a future revision. We extend the MDEV-12353 redo log record format as follows. (1) Empty mini-transactions or extra NUL bytes will not be allowed. (2) The end-of-minitransaction marker (a NUL byte) will be replaced with a 1-bit sequence number, which will be toggled each time when the circular log file wraps back to the beginning. (3) After the sequence bit, a CRC-32C checksum of all data (excluding the sequence bit) will written. (4) If the log is encrypted, 8 bytes will be written before the checksum and included in it. This is part of the initialization vector (IV) of encrypted log data. (5) File names, page numbers, and checkpoint information will not be encrypted. Only the payload bytes of page-level log will be encrypted. The tablespace ID and page number will form part of the IV. (6) For padding, arbitrary-length FILE_CHECKPOINT records may be written, with all-zero payload, and with the normal end marker and checksum. The minimum size is 7 bytes, or 7+8 with innodb_encrypt_log=ON. In mariadb-backup and in Galera snapshot transfer (SST) scripts, we will no longer remove ib_logfile0 or create an empty ib_logfile0. Server startup will require a valid log file. When resizing the log, we will create a logically empty ib_logfile101 at the current LSN and use an atomic rename to replace ib_logfile0 with it. See the test innodb.log_file_size. Because there is no mandatory padding in the log file, we are able to create a dummy log file as of an arbitrary log sequence number. See the test mariabackup.huge_lsn. The parameter innodb_log_write_ahead_size and the INFORMATION_SCHEMA.INNODB_METRICS counter log_padded will be removed. The minimum value of innodb_log_buffer_size will be increased to 2MiB (because log_sys.buf will replace recv_sys.buf) and the increment adjusted to 4096 bytes (the maximum log block size). The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed: os_log_fsyncs os_log_pending_fsyncs log_pending_log_flushes log_pending_checkpoint_writes The following status variables will be removed: Innodb_os_log_fsyncs (this is included in Innodb_data_fsyncs) Innodb_os_log_pending_fsyncs (this was limited to at most 1 by design) log_sys.get_block_size(): Return the physical block size of the log file. This is only implemented on Linux and Microsoft Windows for now, and for the power-of-2 block sizes between 64 and 4096 bytes (the minimum and maximum size of a checkpoint block). If the block size is anything else, the traditional 512-byte size will be used via normal file system buffering. If the file system buffers can be bypassed, a message like the following will be issued: InnoDB: File system buffers for log disabled (block size=512 bytes) InnoDB: File system buffers for log disabled (block size=4096 bytes) This has been tested on Linux and Microsoft Windows with both sizes. On Linux, only enable O_DIRECT on the log for innodb_flush_method=O_DSYNC. Tests in 3 different environments where the log is stored in a device with a physical block size of 512 bytes are yielding better throughput without O_DIRECT. This could be due to the fact that in the event the last log block is being overwritten (if multiple transactions would become durable at the same time, and each of will write a small number of bytes to the last log block), it should be faster to re-copy data from log_sys.buf or log_sys.flush_buf to the kernel buffer, to be finally written at fdatasync() time. The parameter innodb_flush_method=O_DSYNC will imply O_DIRECT for data files. This option will enable O_DIRECT on the log file on Linux. It may be unsafe to use when the storage device does not support FUA (Force Unit Access) mode. When the server is compiled WITH_PMEM=ON, we will use memory-mapped I/O for the log file if the log resides on a "mount -o dax" device. We will identify PMEM in a start-up message: InnoDB: log sequence number 0 (memory-mapped); transaction id 3 On Linux, we will also invoke mmap() on any ib_logfile0 that resides in /dev/shm, effectively treating the log file as persistent memory. This should speed up "./mtr --mem" and increase the test coverage of PMEM on non-PMEM hardware. It also allows users to estimate how much the performance would be improved by installing persistent memory. On other tmpfs file systems such as /run, we will not use mmap(). mariadb-backup: Eliminated several variables. We will refer directly to recv_sys and log_sys. backup_wait_for_lsn(): Detect non-progress of xtrabackup_copy_logfile(). In this new log format with arbitrary-sized blocks, we can only detect log file overrun indirectly, by observing that the scanned log sequence number is not advancing. xtrabackup_copy_logfile(): On PMEM, do not modify the sequence bit, because we are not allowed to modify the server's log file, and our memory mapping is read-only. trx_flush_log_if_needed_low(): Do not use the callback on pmem. Using neither flush_lock nor write_lock around PMEM writes seems to yield the best performance. The pmem_persist() calls may still be somewhat slower than the pwrite() and fdatasync() based interface (PMEM mounted without -o dax). recv_sys_t::buf: Remove. We will use log_sys.buf for parsing. recv_sys_t::MTR_SIZE_MAX: Replaces RECV_SCAN_SIZE. recv_sys_t::file_checkpoint: Renamed from mlog_checkpoint_lsn. recv_sys_t, log_sys_t: Removed many data members. recv_sys.lsn: Renamed from recv_sys.recovered_lsn. recv_sys.offset: Renamed from recv_sys.recovered_offset. log_sys.buf_size: Replaces srv_log_buffer_size. recv_buf: A smart pointer that wraps log_sys.buf[recv_sys.offset] when the buffer is being allocated from the memory heap. recv_ring: A smart pointer that wraps a circular log_sys.buf[] that is backed by ib_logfile0. The pointer will wrap from recv_sys.len (log_sys.file_size) to log_sys.START_OFFSET. For the record that wraps around, we may copy file name or record payload data to the auxiliary buffer decrypt_buf in order to have a contiguous block of memory. The maximum size of a record is less than innodb_page_size bytes. recv_sys_t::parse(): Take the smart pointer as a template parameter. Do not temporarily add a trailing NUL byte to FILE_ records, because we are not supposed to modify the memory-mapped log file. (It is attached in read-write mode already during recovery.) recv_sys_t::parse_mtr(): Wrapper for recv_sys_t::parse(). recv_sys_t::parse_pmem(): Like parse_mtr(), but if PREMATURE_EOF would be returned on PMEM, use recv_ring to wrap around the buffer to the start. mtr_t::finish_write(), log_close(): Do not enforce log_sys.max_buf_free on PMEM, because it has no meaning on the mmap-based log. log_sys.write_to_buf: Count writes to log_sys.buf. Replaces srv_stats.log_write_requests and export_vars.innodb_log_write_requests. Protected by log_sys.mutex. Updated consistently in log_close(). Previously, mtr_t::commit() conditionally updated the count, which was inconsistent. log_sys.write_to_log: Count swaps of log_sys.buf and log_sys.flush_buf, for writing to log_sys.log (the ib_logfile0). Replaces srv_stats.log_writes and export_vars.innodb_log_writes. Protected by log_sys.mutex. log_sys.waits: Count waits in append_prepare(). Replaces srv_stats.log_waits and export_vars.innodb_log_waits. recv_recover_page(): Do not unnecessarily acquire log_sys.flush_order_mutex. We are inserting the blocks in arbitary order anyway, to be adjusted in recv_sys.apply(true). We will change the definition of flush_lock and write_lock to avoid potential false sharing. Depending on sizeof(log_sys) and CPU_LEVEL1_DCACHE_LINESIZE, the flush_lock and write_lock could share a cache line with each other or with the last data members of log_sys. Thanks to Matthias Leich for providing https://rr-project.org traces for various failures during the development, and to Thirunarayanan Balathandayuthapani for his help in debugging some of the recovery code. And thanks to the developers of the rr debugger for a tool without which extensive changes to InnoDB would be very challenging to get right. Thanks to Vladislav Vaintroub for useful feedback and to him, Axel Schwenke and Krunal Bauskar for testing the performance.
* MDEV-25785 Add support for OpenSSL 3.0Vladislav Vaintroub2022-01-203-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Summary of changes - MD_CTX_SIZE is increased - EVP_CIPHER_CTX_buf_noconst(ctx) does not work anymore, points to nobody knows where. The assumption made previously was that (since the function does not seem to be documented) was that it points to the last partial source block. Add own partial block buffer for NOPAD encryption instead - SECLEVEL in CipherString in openssl.cnf had been downgraded to 0, from 1, to make TLSv1.0 and TLSv1.1 possible (according to https://github.com/openssl/openssl/blob/openssl-3.0.0/NEWS.md even though the manual for SSL_CTX_get_security_level claims that it should not be necessary) - Workaround Ssl_cipher_list issue, it now returns TLSv1.3 ciphers, in addition to what was set in --ssl-cipher - ctx_buf buffer now must be aligned to 16 bytes with openssl( previously with WolfSSL only), ot crashes will happen - updated aes-t , to be better debuggable using function, rather than a huge multiline macro added test that does "nopad" encryption piece-wise, to test replacement of EVP_CIPHER_CTX_buf_noconst
* Merge 10.7 into 10.8Marko Mäkelä2022-01-207-2/+165
|\
| * Merge 10.6 into 10.7Marko Mäkelä2022-01-207-2/+165
| |\
| | * Merge 10.5 into 10.6Marko Mäkelä2022-01-203-0/+75
| | |\
| | | * MDEV-27550: Disable galera.MW-328DMarko Mäkelä2022-01-201-0/+1
| | | |
| | | * MDEV-27382: OFFSET is ignored when combined with DISTINCTbb-10.5-mdev27382-v2Sergei Petrunia2022-01-192-0/+74
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A query in form SELECT DISTINCT expr_that_is_inferred_to_be_const LIMIT 0 OFFSET n produces one row when it should produce none. The issue was in JOIN_TAB::remove_duplicates() in the piece of logic that tried to avoid duplicate removal for such cases but didn't account for possible "LIMIT 0". Fixed by making Select_limit_counters::set_limit() change OFFSET to 0 when LIMIT is 0.
| | | * MDEV-27025 insert-intention lock conflicts with waiting ORDINARY lockbb-10.5-MDEV-27025-deadlockVlad Lesin2022-01-184-2/+90
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The code was backported from 10.6 bd03c0e51629e1c3969a171137712a6bb854c232 commit. See that commit message for details. Apart from the above commit trx_lock_t::wait_trx was also backported from MDEV-24738. trx_lock_t::wait_trx is protected with lock_sys.wait_mutex in 10.6, but that mutex was implemented only in MDEV-24789. As there is no need to backport MDEV-24789 for MDEV-27025, trx_lock_t::wait_trx is protected with the same mutexes as trx_lock_t::wait_lock. This fix should not break innodb-lock-schedule-algorithm=VATS. This algorithm uses an Eldest-Transaction-First (ETF) heuristic, which prefers older transactions over new ones. In this fix we just insert granted lock just before the last granted lock of the same transaction, what does not change transactions execution order. The changes in lock_rec_create_low() should not break Galera Cluster, there is a big "if" branch for WSREP. This branch is necessary to provide the correct transactions execution order, and should not be changed for the current bug fix.
| | * | MDEV-27025 insert-intention lock conflicts with waiting ORDINARY lockbb-10.6-MDEV-27025-deadlockVlad Lesin2022-01-184-2/+90
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When lock is checked for conflict, ignore other locks on the record if they wait for the requesting transaction. lock_rec_has_to_wait_in_queue() iterates not all locks for the page, but only the locks located before the waiting lock in the queue. So there is some invariant - any lock in the queue can wait only lock which is located before the waiting lock in the queue. In the case when conflicting lock waits for the transaction of requesting lock, we need to place the requesting lock before the waiting lock in the queue to preserve the invariant. That is why we are looking for the first waiting for requesting transation lock and place the new lock just after the last granted requesting transaction lock before the first waiting for requesting transaction lock. Example: trx1 waiting lock, trx1 granted lock, ..., trx2 lock - waiting for trx1 place new lock here -----------------^ There are also implicit locks which are lazily converted to explicit ones, and we need to place the newly created explicit lock to the correct place in a queue. All explicit locks converted from implicit ones are placed just after the last non-waiting lock of the same transaction before the first waiting for the transaction lock. Code review and cleanup was made by Marko Mäkelä.
* | | | Code cleanupSergei Petrunia2022-01-191-1/+1
| | | |
* | | | Switch the default histogram_type to still be DOUBLE_PREC_HBSergei Petrunia2022-01-193-2/+3
| | | | | | | | | | | | | | | | MTR still uses JSON_HB as the default.
* | | | JSON_HB histogram: represent values of BIT() columns in hex alwaysSergei Petrunia2022-01-192-11/+64
| | | |
* | | | MDEV-26901: Estimation for filtered rows less precise ... #4Sergei Petrunia2022-01-194-4/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In Histogram_json_hb::point_selectivity(), do return selectivity of 0.0 when the histogram says so. The logic of "Do not return 0.0 estimate as it causes a multiply-by-zero meltdown in cost and cardinality calculations" is moved into records_in_column_ranges() where it is one *once* per column pair (as opposed to doing once per range, which can cause the error to add-up to large number when there are many ranges)