| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Following issues here:
Whenever Galera BF (brute force) transaction decides to abort conflicting
transaction it will kill that thread using thd::awake()
User KILL [QUERY|CONNECTION] ... for a thread it will also call thd::awake()
Whenever one of these actions is executed we will hold number of InnoDB
internal mutexes and thd mutexes. Sometimes these mutexes are taken in
different order causing mutex deadlock.
Lets consider a example starting from Galera BF transaction deciding to abort
conflicting transaction (let's call this thread1):
thread1:
lock_rec_other_has_conflicting
here we hold both lock_sys mutex and trx mutex for conflicting lock transaction.
wsrep_innobase_kill_one_trx
For victim we call wsrep_thd_awake
Next thread2 we assume user to execute KILL QUERY to some other executing
query (note it can't be BF query).
thread2:
sql_kill()
kill_one_thread
find_thread_by_id
takes LOCK_thd_kill
thd->awake_no_mutex()
thread1:
thd->awake(KILL_QUERY)
Tries to have LOCK_thd_kill but it is hold by thread2 so we wait (note that
we still hold lock_sys mutex and trx mutex).
thread2:
ha_kill_query()
kill_handlerton
innobase_kill_query
lock_trx_handle_wait
lock_mutex_enter() must wait as thread1 is holding it
Thus thread1 lock_sys, trx_mutex waits -> thread2 LOCK_thd_kill waits lock_sys -> thread1
==> thread1 waits -> thread2 waits -> thread1 ==> mutex deadlock.
In this patch we will fix Galera BF and user kill cases so that we enqueue
victim thread to a list while we hold InnoDB mutexes and we then release them.
A new background thread will pick victim thread from this new list and uses
thd::awake() with no InnoDB mutexes. Idea is similar to replication background
kill. This fix enforces that we take LOCK_thd_data -> lock sys mutex -> trx mutex
always in this order.
wsrep_mysqld.cc
Here we introduce a list where victim threads are stored,
condition variable to be used to wake up background thread
and mutex to protect list.
wsrep_thd.cc
Create a new background thread to handle victim thread
abort. We may take wsrep_thd_LOCK mutex here but not any
InnoDB mutexes.
wsrep_innobase_kill_one_trx
Remove all the wsrep code that was moved to wsrep_thd.cc
We just enqueue required information to background kill
list and cancel victim trx lock wait if there is such.
Here we have InnoDB lock sys mutex and trx mutex so here
we can't take wsrep_thd_LOCK mutex.
wsrep_abort_transaction
Cleanup only.
|
|
|
|
|
|
| |
Lock wait can happen on secondary index when doing FK checks for wsrep.
We should just return error to upper layer and applier will retry
operation when needed.
|
|\ |
|
| | |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This reverts commit 21b2fada7ab7f35c898c02d2f918461409cc9c8e
and commit 81d71ee6b21870772c336bff15b71904914f146a.
The MDEV-18464 change introduces a few data race issues. Contrary to
the documentation, the field trx_t::victim is not always being protected
by lock_sys_t::mutex and trx_t::mutex. Most importantly, it seems
that KILL QUERY could wrongly avoid acquiring both mutexes when
invoking lock_trx_handle_wait_low(), in case another thread had
already set trx->victim=true.
We also revert MDEV-12009, because it should depend on the MDEV-18464
fix being present.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
priority by Galera
As noted on kill_one_thread SUPER should be able to kill even
system threads i.e. threads/query flagged as high priority or
wsrep applier thread. Normal user, should not able to kill
threads/query flagged as high priority (BF) or wsrep applier
thread.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
If we have a 2+ node cluster which is replicating from an async master
and the binlog_format is set to STATEMENT and multi-row inserts are executed
on a table with an auto_increment column such that values are automatically
generated by MySQL, then the server node generates wrong auto_increment
values, which are different from what was generated on the async master.
In the title of the MDEV-9519 it was proposed to ban start slave on a Galera
if master binlog_format = statement and wsrep_auto_increment_control = 1,
but the problem can be solved without such a restriction.
The causes and fixes:
1. We need to improve processing of changing the auto-increment values
after changing the cluster size.
2. If wsrep auto_increment_control switched on during operation of
the node, then we should immediately update the auto_increment_increment
and auto_increment_offset global variables, without waiting of the next
invocation of the wsrep_view_handler_cb() callback. In the current version
these variables retain its initial values if wsrep_auto_increment_control
is switched on during operation of the node, which leads to inconsistent
results on the different nodes in some scenarios.
3. If wsrep auto_increment_control switched off during operation of the node,
then we must return the original values of the auto_increment_increment and
auto_increment_offset global variables, as the user has set. To make this
possible, we need to add a "shadow copies" of these variables (which stores
the latest values set by the user).
https://jira.mariadb.org/browse/MDEV-9519
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This patch contains a fix for the MDEV-17262/17243 issues and
new mtr test.
These issues (MDEV-17262/17243) have two reasons:
1) After an intermediate commit, a transaction loses its status
of "transaction that registered in the MySQL for 2pc coordinator"
(in the InnoDB) due to the fact that since version 10.2 the
write_row() function (which located in the ha_innodb.cc) does
not call trx_register_for_2pc(m_prebuilt->trx) during the processing
of split transactions. It is necessary to restore this call inside
the write_row() when an intermediate commit was made (for a split
transaction).
Similarly, we need to set the flag of the started transaction
(m_prebuilt->sql_stat_start) after intermediate commit.
The table->file->extra(HA_EXTRA_FAKE_START_STMT) called from the
wsrep_load_data_split() function (which located in sql_load.cc)
will also do this, but it will be too late. As a result, the call
to the wsrep_append_keys() function from the InnoDB engine may be
lost or function may be called with invalid transaction identifier.
2) If a transaction with the LOAD DATA statement is divided into
logical mini-transactions (of the 10K rows) and binlog is rotated,
then in rare cases due to the wsrep handler re-registration at the
boundary of the split, the last portion of data may be lost. Since
splitting of the LOAD DATA into mini-transactions is technical,
I believe that we should not allow these mini-transactions to fall
into separate binlogs. Therefore, it is necessary to prohibit the
rotation of binlog in the middle of processing LOAD DATA statement.
https://jira.mariadb.org/browse/MDEV-17262 and
https://jira.mariadb.org/browse/MDEV-17243
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Problem was that we skipped background persistent statistics calculation
on applier nodes if thread is marked as high priority (a.k.a BF).
However, on applier nodes all DDL which is replicate will be executed
as high priority i.e BF.
Fixed by allowing background persistent statistics calculation on
applier nodes even when thread is marked as BF. This could lead
BF lock waits but for queries on that node needs that statistics.
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If we have a 2+ node cluster which is replicating from an async master
and the binlog_format is set to STATEMENT and multi-row inserts are executed
on a table with an auto_increment column such that values are automatically
generated by MySQL, then the server node generates wrong auto_increment
values, which are different from what was generated on the async master.
In the title of the MDEV-9519 it was proposed to ban start slave on a Galera
if master binlog_format = statement and wsrep_auto_increment_control = 1,
but the problem can be solved without such a restriction.
The causes and fixes:
1. We need to improve processing of changing the auto-increment values
after changing the cluster size.
2. If wsrep auto_increment_control switched on during operation of
the node, then we should immediately update the auto_increment_increment
and auto_increment_offset global variables, without waiting of the next
invocation of the wsrep_view_handler_cb() callback. In the current version
these variables retain its initial values if wsrep_auto_increment_control
is switched on during operation of the node, which leads to inconsistent
results on the different nodes in some scenarios.
3. If wsrep auto_increment_control switched off during operation of the node,
then we must return the original values of the auto_increment_increment and
auto_increment_offset global variables, as the user has set. To make this
possible, we need to add a "shadow copies" of these variables (which stores
the latest values set by the user).
https://jira.mariadb.org/browse/MDEV-9519
|
|
|
|
|
|
| |
Post-fix for 7e8ed15b95b
Also, apply the same innodb fix to xtradb.
|
|
|
|
|
| |
Also, renamed wsrep_skip_append_keys to wsrep_ignore_table.
Test case : galera.galera_as_slave_gtid.test
|
|
|
|
| |
Note: some tests fail, just as they failed before the merge!
|
| |
|
| |
|
|
|