diff options
author | Zhao Chao <zhaochao1984@gmail.com> | 2018-03-06 18:08:59 +0800 |
---|---|---|
committer | Elod Illes <elod.illes@ericsson.com> | 2018-07-10 10:56:06 +0200 |
commit | 2133963b8975d4d5968b6559c9be36d0f992d8ab (patch) | |
tree | 920f0c07faac60ed74384728f0950c4ccbcebb77 | |
parent | 43d2b96f86a5365d69c885738ea1c3642f4e5aa1 (diff) | |
download | trove-2133963b8975d4d5968b6559c9be36d0f992d8ab.tar.gz |
Fix zuul check jobs
This patch is a squash of three commits that is needed to have
a successful zuul run:
1. Use neutronclient for floatingip operations
As add_floating_ip and remove_floating_ip were removed from novaclient,
migrating to neutronclient for floatingip operations.
This is the first patch for fixing unittests. As integration tests
didn't fail, floatingip operations are not covered there, so it may be
necessary to add new scenario tests for this in following patches.
NOTE(elod.illes): this backport is necessary due to nova client
constraint update [1] otherwise (periodic stable) jobs in queens fails
in stable/queens
[1] https://review.openstack.org/#/c/571540/
Signed-off-by: Zhao Chao <zhaochao1984@gmail.com>
2. Avoid diverged slave when migrating MariaDB master
When promoting one slave to the new master in a replication group,
previously the old master will be attached to the new one right after
the new master is on. For MariaDB, attaching the old master to the new
one, new GTID may be created on the old master and also may be synced
to some of the other replicas, as they're still connecting to the old
master. The new GTID does not exists in the new master, making these
slaves diverged from the master. After that, when the diverged slave
connects to the new master, 'START SLAVE' will fail with logs like:
[ERROR] Error reading packet from server: Error: connecting slave
requested to start from GTID X-XXXXXXXXXX-XX, which is not in the
master's binlog. Since the master's binlog contains GTIDs with
higher sequence numbers, it probably means that the slave has
diverged due to executing extra erroneous transactions
(server_errno=1236)
And these slaves will be left orphan and errored after
promote_to_replica_source finishs.
Attaching the other replicas to the new master before dealing with the
old master will fix this problem and the failure of the
trove-scenario-mariadb-multi Zuul job as well.
Closes-Bug: #1754539
Signed-off-by: Zhao Chao <zhaochao1984@gmail.com>
3. Add a hook for restore process to check if successful
1.This is the same solution with the backup process. Previously we only
check the stderr of the restore command, but this method is not
reliable, and the changes during percona-xtrabackup 2.4.11:
https://jira.percona.com/browse/PXB-1542
causes the gate jobs failed in creating a slave MySQL instance.
The new hook currently is only used by InnoBackupEx restore runner with
checking the exit status and stderr output of xbstream.
NOTE(elod.illes): The following part is not backported because [1] is
not backported either:
2.with[1] merged,this makes DIB_CLOUD_IMAGES more flexible, but it
break trovestack build image, now we need to specify a more
detailed path to DIB_CLOUD_IMAGES to get the appropriate content.
[1]:https://review.openstack.org/#/c/568697/
Co-Authored-By: Zhao Chao <zhaochao1984@gmail.com>
Co-Authored-By: zhanggang <zhanggang@cmss.chinamobile.com>
Co-Authored-By: jiansong <jian.song@easystack.cn>
Closes-Bug: #1771990
Change-Id: Ic41f5898db77a09158e0f8bfa4196bdb5e40b49a
(cherry picked from commit 9209d0bf896c8b7c717e072317794f0b141306fc)
(cherry picked from commit 5895cf0ee99022e2910ec8c72393fe998f5860f8)
(cherry picked from commit 43b5807f2d7fe417712226062df8e2f366fff801)
-rw-r--r-- | releasenotes/notes/avoid-diverged-slave-when-migrating-mariadb-master-37e2429a1ea75913.yaml | 12 | ||||
-rw-r--r-- | trove/guestagent/strategies/restore/base.py | 19 | ||||
-rw-r--r-- | trove/guestagent/strategies/restore/mysql_impl.py | 40 | ||||
-rw-r--r-- | trove/instance/models.py | 9 | ||||
-rw-r--r-- | trove/taskmanager/manager.py | 26 | ||||
-rwxr-xr-x | trove/taskmanager/models.py | 36 | ||||
-rw-r--r-- | trove/tests/unittests/guestagent/test_backups.py | 36 | ||||
-rw-r--r-- | trove/tests/unittests/taskmanager/test_models.py | 56 |
8 files changed, 196 insertions, 38 deletions
diff --git a/releasenotes/notes/avoid-diverged-slave-when-migrating-mariadb-master-37e2429a1ea75913.yaml b/releasenotes/notes/avoid-diverged-slave-when-migrating-mariadb-master-37e2429a1ea75913.yaml new file mode 100644 index 00000000..f2c3f04c --- /dev/null +++ b/releasenotes/notes/avoid-diverged-slave-when-migrating-mariadb-master-37e2429a1ea75913.yaml @@ -0,0 +1,12 @@ +--- +fixes: + - | + MariaDB allows an server to be a master and a slave simutaneously, so when + migrating masters, if the old master is reactivated before attaching the + other replicas to the new master, new unexpected GTIDs may be created on + the old master and synced to some of the other replicas by chance, as the + other replicas are still connecting to the old one by the time. After that + these diverged slave will fail changing to the new master. This will be + fixed by first attaching the other replicas to the new master, and then + dealing with old master. + Fixes #1754539 diff --git a/trove/guestagent/strategies/restore/base.py b/trove/guestagent/strategies/restore/base.py index 9620e988..e77c2a41 100644 --- a/trove/guestagent/strategies/restore/base.py +++ b/trove/guestagent/strategies/restore/base.py @@ -53,6 +53,7 @@ class RestoreRunner(Strategy): def __init__(self, storage, **kwargs): self.storage = storage + self.process = None self.location = kwargs.pop('location') self.checksum = kwargs.pop('checksum') self.restore_location = kwargs.get('restore_location') @@ -80,15 +81,17 @@ class RestoreRunner(Strategy): def _unpack(self, location, checksum, command): stream = self.storage.load(location, checksum) - process = subprocess.Popen(command, shell=True, - stdin=subprocess.PIPE, - stderr=subprocess.PIPE) + self.process = subprocess.Popen(command, shell=True, + stdin=subprocess.PIPE, + stderr=subprocess.PIPE) content_length = 0 for chunk in stream: - process.stdin.write(chunk) + self.process.stdin.write(chunk) content_length += len(chunk) - process.stdin.close() - utils.raise_if_process_errored(process, RestoreError) + self.process.stdin.close() + utils.raise_if_process_errored(self.process, RestoreError) + if not self.check_process(): + raise RestoreError LOG.debug("Restored %s bytes from stream.", content_length) return content_length @@ -104,3 +107,7 @@ class RestoreRunner(Strategy): @property def unzip_cmd(self): return 'gzip -d -c | ' if self.is_zipped else '' + + def check_process(self): + """Hook for subclasses to check the restore process for errors.""" + return True diff --git a/trove/guestagent/strategies/restore/mysql_impl.py b/trove/guestagent/strategies/restore/mysql_impl.py index 27faef4e..6e2a8956 100644 --- a/trove/guestagent/strategies/restore/mysql_impl.py +++ b/trove/guestagent/strategies/restore/mysql_impl.py @@ -180,7 +180,8 @@ class MySQLDump(base.RestoreRunner, MySQLRestoreMixin): class InnoBackupEx(base.RestoreRunner, MySQLRestoreMixin): """Implementation of Restore Strategy for InnoBackupEx.""" __strategy_name__ = 'innobackupex' - base_restore_cmd = 'sudo xbstream -x -C %(restore_location)s' + base_restore_cmd = ('sudo xbstream -x -C %(restore_location)s' + ' 2>/tmp/xbstream_extract.log') base_prepare_cmd = ('sudo innobackupex' ' --defaults-file=%(restore_location)s/backup-my.cnf' ' --ibbackup=xtrabackup' @@ -229,6 +230,43 @@ class InnoBackupEx(base.RestoreRunner, MySQLRestoreMixin): for f in files: os.unlink(f) + def check_process(self): + """Check whether xbstream restore is successful.""" + # We first check the restore process exits with 0, however + # xbstream has a bug for creating new files: + # https://jira.percona.com/browse/PXB-1542 + # So we also check the stderr with ignorance of some known + # non-error log lines. Currently we only need to ignore: + # "encryption: using gcrypt x.x.x" + # After PXB-1542 is fixed, we could just check the exit status. + LOG.debug('Checking return code of xbstream restore process.') + return_code = self.process.wait() + if return_code != 0: + LOG.erro('xbstream exited with %s', return_code) + return False + + LOG.debug('Checking xbstream restore process stderr output.') + IGNORE_LINES = [ + 'encryption: using gcrypt ', + ] + with open('/tmp/xbstream_extract.log', 'r') as xbstream_log: + for line in xbstream_log: + # Ignore empty lines + if not line.strip(): + continue + + # Ignore known non-error log lines + check_ignorance = [line.startswith(non_err) + for non_err in IGNORE_LINES] + if any(check_ignorance): + continue + else: + LOG.error('xbstream restore failed with: %s', + line.rstrip('\n')) + return False + + return True + class InnoBackupExIncremental(InnoBackupEx): __strategy_name__ = 'innobackupexincremental' diff --git a/trove/instance/models.py b/trove/instance/models.py index dbc9487d..ebe8a411 100644 --- a/trove/instance/models.py +++ b/trove/instance/models.py @@ -36,6 +36,7 @@ from trove.common.notification import StartNotification from trove.common.remote import create_cinder_client from trove.common.remote import create_dns_client from trove.common.remote import create_guest_client +from trove.common.remote import create_neutron_client from trove.common.remote import create_nova_client from trove.common import server_group as srv_grp from trove.common import template @@ -616,6 +617,7 @@ class BaseInstance(SimpleInstance): self._guest = None self._nova_client = None self._volume_client = None + self._neutron_client = None self._server_group = None self._server_group_loaded = False @@ -709,6 +711,13 @@ class BaseInstance(SimpleInstance): self.context, region_name=self.db_info.region_id) return self._volume_client + @property + def neutron_client(self): + if not self._neutron_client: + self._neutron_client = create_neutron_client( + self.context, region_name=self.db_info.region_id) + return self._neutron_client + def reset_task_status(self): LOG.info("Resetting task status to NONE on instance %s.", self.id) diff --git a/trove/taskmanager/manager.py b/trove/taskmanager/manager.py index e600e7ad..7559c1ec 100644 --- a/trove/taskmanager/manager.py +++ b/trove/taskmanager/manager.py @@ -99,6 +99,26 @@ class Manager(periodic_task.PeriodicTasks): replica_models): # First, we transition from the old master to new as quickly as # possible to minimize the scope of unrecoverable error + + # NOTE(zhaochao): we cannot reattach the old master to the new + # one immediately after the new master is up, because for MariaDB + # the other replicas are still connecting to the old master, and + # during reattaching the old master as a slave, new GTID may be + # created and synced to the replicas. After that, when attaching + # the replicas to the new master, 'START SLAVE' will fail by + # 'fatal error 1236' if the binlog of the replica diverged from + # the new master. So the proper order should be: + # -1. make the old master read only (and detach floating ips) + # -2. make sure the new master is up-to-date + # -3. detach the new master from the old one + # -4. enable the new master (and attach floating ips) + # -5. attach the other replicas to the new master + # -6. attach the old master to the new one + # (and attach floating ips) + # -7. demote the old master + # What we changed here is the order of the 6th step, previously + # this step took place right after step 4, which causes failures + # with MariaDB replications. old_master.make_read_only(True) master_ips = old_master.detach_public_ips() slave_ips = master_candidate.detach_public_ips() @@ -106,10 +126,8 @@ class Manager(periodic_task.PeriodicTasks): master_candidate.wait_for_txn(latest_txn_id) master_candidate.detach_replica(old_master, for_failover=True) master_candidate.enable_as_master() - old_master.attach_replica(master_candidate) master_candidate.attach_public_ips(master_ips) master_candidate.make_read_only(False) - old_master.attach_public_ips(slave_ips) # At this point, should something go wrong, there # should be a working master with some number of working slaves, @@ -138,6 +156,10 @@ class Manager(periodic_task.PeriodicTasks): error_messages += "%s (%s)\n" % ( exc_fmt % msg_content, ex) + # dealing with the old master after all the other replicas + # has been migrated. + old_master.attach_replica(master_candidate) + old_master.attach_public_ips(slave_ips) try: old_master.demote_replication_master() except Exception as ex: diff --git a/trove/taskmanager/models.py b/trove/taskmanager/models.py index 7ce8c424..fdf1a728 100755 --- a/trove/taskmanager/models.py +++ b/trove/taskmanager/models.py @@ -21,6 +21,7 @@ from eventlet import greenthread from eventlet.timeout import Timeout from novaclient import exceptions as nova_exceptions from oslo_log import log as logging +from oslo_utils import netutils from swiftclient.client import ClientException from trove.backup import models as bkup_models @@ -1218,31 +1219,46 @@ class BuiltInstanceTasks(BuiltInstance, NotifyMixin, ConfigurationMixin): def _get_floating_ips(self): """Returns floating ips as a dict indexed by the ip.""" floating_ips = {} - neutron_client = remote.create_neutron_client(self.context) - network_floating_ips = neutron_client.list_floatingips() + network_floating_ips = self.neutron_client.list_floatingips() for ip in network_floating_ips.get('floatingips'): - floating_ips.update({ip.get('floating_ip_address'): ip}) + floating_ips.update( + {ip.get('floating_ip_address'): ip.get('id')}) LOG.debug("In _get_floating_ips(), returning %s", floating_ips) return floating_ips def detach_public_ips(self): LOG.debug("Begin detach_public_ips for instance %s", self.id) removed_ips = [] - server_id = self.db_info.compute_instance_id - nova_instance = self.nova_client.servers.get(server_id) floating_ips = self._get_floating_ips() for ip in self.get_visible_ip_addresses(): if ip in floating_ips: - nova_instance.remove_floating_ip(ip) - removed_ips.append(ip) + fip_id = floating_ips[ip] + self.neutron_client.update_floatingip( + fip_id, {'floatingip': {'port_id': None}}) + removed_ips.append(fip_id) return removed_ips def attach_public_ips(self, ips): LOG.debug("Begin attach_public_ips for instance %s", self.id) server_id = self.db_info.compute_instance_id - nova_instance = self.nova_client.servers.get(server_id) - for ip in ips: - nova_instance.add_floating_ip(ip) + + # NOTE(zhaochao): in Nova's addFloatingIp, the new floating ip will + # always be associated with the first IPv4 fixed address of the Nova + # instance, we're doing the same thing here, after add_floating_ip is + # removed from novaclient. + server_ports = (self.neutron_client.list_ports(device_id=server_id) + .get('ports')) + fixed_address, port_id = next( + (fixed_ip['ip_address'], port['id']) + for port in server_ports + for fixed_ip in port.get('fixed_ips') + if netutils.is_valid_ipv4(fixed_ip['ip_address'])) + + for fip_id in ips: + self.neutron_client.update_floatingip( + fip_id, {'floatingip': { + 'port_id': port_id, + 'fixed_ip_address': fixed_address}}) def enable_as_master(self): LOG.debug("Calling enable_as_master on %s", self.id) diff --git a/trove/tests/unittests/guestagent/test_backups.py b/trove/tests/unittests/guestagent/test_backups.py index aae721f9..a22b9d52 100644 --- a/trove/tests/unittests/guestagent/test_backups.py +++ b/trove/tests/unittests/guestagent/test_backups.py @@ -95,7 +95,8 @@ SQLDUMP_BACKUP_RAW = ("mysqldump --all-databases %(extra_opts)s " SQLDUMP_BACKUP = SQLDUMP_BACKUP_RAW % {'extra_opts': ''} SQLDUMP_BACKUP_EXTRA_OPTS = (SQLDUMP_BACKUP_RAW % {'extra_opts': '--events --routines --triggers'}) -XTRA_RESTORE_RAW = "sudo xbstream -x -C %(restore_location)s" +XTRA_RESTORE_RAW = ("sudo xbstream -x -C %(restore_location)s" + " 2>/tmp/xbstream_extract.log") XTRA_RESTORE = XTRA_RESTORE_RAW % {'restore_location': '/var/lib/mysql/data'} XTRA_INCR_PREPARE = ("sudo innobackupex" " --defaults-file=/var/lib/mysql/data/backup-my.cnf" @@ -1134,3 +1135,36 @@ class CouchDBRestoreTests(trove_testtools.TestCase): self.restore_runner.post_restore = mock.Mock() self.assertRaises(exception.ProcessExecutionError, self.restore_runner.restore) + + +class MySQLRestoreTests(trove_testtools.TestCase): + + def setUp(self): + super(MySQLRestoreTests, self).setUp() + + self.restore_runner = utils.import_class( + RESTORE_XTRA_CLS)( + 'swift', location='http://some.where', + checksum='True_checksum', + restore_location='/tmp/somewhere') + + def tearDown(self): + super(MySQLRestoreTests, self).tearDown() + + def test_restore_success(self): + expected_content_length = 123 + self.restore_runner._run_restore = mock.Mock( + return_value=expected_content_length) + self.restore_runner.pre_restore = mock.Mock() + self.restore_runner.post_restore = mock.Mock() + actual_content_length = self.restore_runner.restore() + self.assertEqual( + expected_content_length, actual_content_length) + + def test_restore_failed_due_to_run_restore(self): + self.restore_runner.pre_restore = mock.Mock() + self.restore_runner._run_restore = mock.Mock( + side_effect=restoreBase.RestoreError('Error')) + self.restore_runner.post_restore = mock.Mock() + self.assertRaises(restoreBase.RestoreError, + self.restore_runner.restore) diff --git a/trove/tests/unittests/taskmanager/test_models.py b/trove/tests/unittests/taskmanager/test_models.py index 3c620de2..6ab49e58 100644 --- a/trove/tests/unittests/taskmanager/test_models.py +++ b/trove/tests/unittests/taskmanager/test_models.py @@ -19,6 +19,7 @@ from cinderclient import exceptions as cinder_exceptions import cinderclient.v2.client as cinderclient from cinderclient.v2 import volumes as cinderclient_volumes from mock import Mock, MagicMock, patch, PropertyMock, call +import neutronclient.v2_0.client as neutronclient from novaclient import exceptions as nova_exceptions import novaclient.v2.flavors import novaclient.v2.servers @@ -59,11 +60,6 @@ INST_ID = 'dbinst-id-1' VOLUME_ID = 'volume-id-1' -class _fake_neutron_client(object): - def list_floatingips(self): - return {'floatingips': [{'floating_ip_address': '192.168.10.1'}]} - - class FakeOptGroup(object): def __init__(self, tcp_ports=['3306', '3301-3307'], udp_ports=[], icmp=False): @@ -617,6 +613,19 @@ class BuiltInstanceTasksTest(trove_testtools.TestCase): stub_volume_mgr.get = MagicMock(return_value=stub_new_volume) stub_volume_mgr.attach = MagicMock(return_value=None) + def _stub_neutron_client(self): + stub_neutron_client = self.instance_task._neutron_client = MagicMock( + spec=neutronclient.Client) + stub_neutron_client.list_floatingips = MagicMock( + return_value={'floatingips': [{ + 'floating_ip_address': '192.168.10.1', + 'id': 'fake-floatingip-id'}]}) + stub_neutron_client.list_ports = MagicMock( + return_value={'ports': [{ + 'fixed_ips': [{'ip_address': '10.0.0.1'}, + {'ip_address': 'fd4a:7bef:d1ed:1::1'}], + 'id': 'fake-port-id'}]}) + def setUp(self): super(BuiltInstanceTasksTest, self).setUp() self.new_flavor = {'id': 8, 'ram': 768, 'name': 'bigger_flavor'} @@ -726,6 +735,10 @@ class BuiltInstanceTasksTest(trove_testtools.TestCase): if 'volume' in self._testMethodName: self._stub_volume_client() + if ('floating_ips' in self._testMethodName or + 'public_ips' in self._testMethodName): + self._stub_neutron_client() + def tearDown(self): super(BuiltInstanceTasksTest, self).tearDown() @@ -856,25 +869,32 @@ class BuiltInstanceTasksTest(trove_testtools.TestCase): Mock()) def test_get_floating_ips(self): - with patch.object(remote, 'create_neutron_client', - return_value=_fake_neutron_client()): - floating_ips = self.instance_task._get_floating_ips() - self.assertEqual('192.168.10.1', - floating_ips['192.168.10.1'].get( - 'floating_ip_address')) + floating_ips = self.instance_task._get_floating_ips() + self.assertEqual({'192.168.10.1': 'fake-floatingip-id'}, + floating_ips) @patch.object(BaseInstance, 'get_visible_ip_addresses', return_value=['192.168.10.1']) def test_detach_public_ips(self, mock_address): - with patch.object(remote, 'create_neutron_client', - return_value=_fake_neutron_client()): - removed_ips = self.instance_task.detach_public_ips() - self.assertEqual(['192.168.10.1'], removed_ips) + removed_ips = self.instance_task.detach_public_ips() + self.assertEqual(['fake-floatingip-id'], removed_ips) + mock_update_floatingip = (self.instance_task.neutron_client + .update_floatingip) + mock_update_floatingip.assert_called_once_with( + removed_ips[0], {'floatingip': {'port_id': None}}) def test_attach_public_ips(self): - self.instance_task.attach_public_ips(['192.168.10.1']) - self.stub_verifying_server.add_floating_ip.assert_called_with( - '192.168.10.1') + self.instance_task.attach_public_ips(['fake-floatingip-id']) + mock_list_ports = (self.instance_task.neutron_client + .list_ports) + mock_list_ports.assert_called_once_with(device_id='computeinst-id-1') + + mock_update_floatingip = (self.instance_task.neutron_client + .update_floatingip) + mock_update_floatingip.assert_called_once_with( + 'fake-floatingip-id', + {'floatingip': {'port_id': 'fake-port-id', + 'fixed_ip_address': '10.0.0.1'}}) @patch.object(BaseInstance, 'update_db') def test_enable_as_master(self, mock_update_db): |