Fix zuul check jobs

This patch is a squash of three commits that is needed to have a successful zuul run: 1. Use neutronclient for floatingip operations As add_floating_ip and remove_floating_ip were removed from novaclient, migrating to neutronclient for floatingip operations. This is the first patch for fixing unittests. As integration tests didn't fail, floatingip operations are not covered there, so it may be necessary to add new scenario tests for this in following patches. NOTE(elod.illes): this backport is necessary due to nova client constraint update [1] otherwise (periodic stable) jobs in queens fails in stable/queens [1] https://review.openstack.org/#/c/571540/ Signed-off-by: Zhao Chao <zhaochao1984@gmail.com> 2. Avoid diverged slave when migrating MariaDB master When promoting one slave to the new master in a replication group, previously the old master will be attached to the new one right after the new master is on. For MariaDB, attaching the old master to the new one, new GTID may be created on the old master and also may be synced to some of the other replicas, as they're still connecting to the old master. The new GTID does not exists in the new master, making these slaves diverged from the master. After that, when the diverged slave connects to the new master, 'START SLAVE' will fail with logs like: [ERROR] Error reading packet from server: Error: connecting slave requested to start from GTID X-XXXXXXXXXX-XX, which is not in the master's binlog. Since the master's binlog contains GTIDs with higher sequence numbers, it probably means that the slave has diverged due to executing extra erroneous transactions (server_errno=1236) And these slaves will be left orphan and errored after promote_to_replica_source finishs. Attaching the other replicas to the new master before dealing with the old master will fix this problem and the failure of the trove-scenario-mariadb-multi Zuul job as well. Closes-Bug: #1754539 Signed-off-by: Zhao Chao <zhaochao1984@gmail.com> 3. Add a hook for restore process to check if successful 1.This is the same solution with the backup process. Previously we only check the stderr of the restore command, but this method is not reliable, and the changes during percona-xtrabackup 2.4.11: https://jira.percona.com/browse/PXB-1542 causes the gate jobs failed in creating a slave MySQL instance. The new hook currently is only used by InnoBackupEx restore runner with checking the exit status and stderr output of xbstream. NOTE(elod.illes): The following part is not backported because [1] is not backported either: 2.with[1] merged,this makes DIB_CLOUD_IMAGES more flexible, but it break trovestack build image, now we need to specify a more detailed path to DIB_CLOUD_IMAGES to get the appropriate content. [1]:https://review.openstack.org/#/c/568697/ Co-Authored-By: Zhao Chao <zhaochao1984@gmail.com> Co-Authored-By: zhanggang <zhanggang@cmss.chinamobile.com> Co-Authored-By: jiansong <jian.song@easystack.cn> Closes-Bug: #1771990 Change-Id: Ic41f5898db77a09158e0f8bfa4196bdb5e40b49a (cherry picked from commit 9209d0bf896c8b7c717e072317794f0b141306fc) (cherry picked from commit 5895cf0ee99022e2910ec8c72393fe998f5860f8) (cherry picked from commit 43b5807f2d7fe417712226062df8e2f366fff801)
author: Zhao Chao <zhaochao1984@gmail.com> 2018-03-06 18:08:59 +0800
committer: Elod Illes <elod.illes@ericsson.com> 2018-07-10 10:56:06 +0200
commit: 2133963b8975d4d5968b6559c9be36d0f992d8ab (patch)
tree: 920f0c07faac60ed74384728f0950c4ccbcebb77
parent: 43d2b96f86a5365d69c885738ea1c3642f4e5aa1 (diff)
download: trove-2133963b8975d4d5968b6559c9be36d0f992d8ab.tar.gz
8 files changed, 196 insertions, 38 deletions
diff --git a/releasenotes/notes/avoid-diverged-slave-when-migrating-mariadb-master-37e2429a1ea75913.yaml b/releasenotes/notes/avoid-diverged-slave-when-migrating-mariadb-master-37e2429a1ea75913.yaml
new file mode 100644
index 00000000..f2c3f04c
--- /dev/null
+++ b/releasenotes/notes/avoid-diverged-slave-when-migrating-mariadb-master-37e2429a1ea75913.yaml
@@ -0,0 +1,12 @@
+---
+fixes:
+  - |
+    MariaDB allows an server to be a master and a slave simutaneously, so when
+    migrating masters, if the old master is reactivated before attaching the
+    other replicas to the new master, new unexpected GTIDs may be created on
+    the old master and synced to some of the other replicas by chance, as the
+    other replicas are still connecting to the old one by the time. After that
+    these diverged slave will fail changing to the new master. This will be
+    fixed by first attaching the other replicas to the new master, and then
+    dealing with old master.
+    Fixes #1754539
diff --git a/trove/guestagent/strategies/restore/base.py b/trove/guestagent/strategies/restore/base.py
index 9620e988..e77c2a41 100644
--- a/trove/guestagent/strategies/restore/base.py
+++ b/trove/guestagent/strategies/restore/base.py
@@ -53,6 +53,7 @@ class RestoreRunner(Strategy):
 
     def __init__(self, storage, **kwargs):
         self.storage = storage
+        self.process = None
         self.location = kwargs.pop('location')
         self.checksum = kwargs.pop('checksum')
         self.restore_location = kwargs.get('restore_location')
@@ -80,15 +81,17 @@ class RestoreRunner(Strategy):
 
     def _unpack(self, location, checksum, command):
         stream = self.storage.load(location, checksum)
-        process = subprocess.Popen(command, shell=True,
-                                   stdin=subprocess.PIPE,
-                                   stderr=subprocess.PIPE)
+        self.process = subprocess.Popen(command, shell=True,
+                                        stdin=subprocess.PIPE,
+                                        stderr=subprocess.PIPE)
         content_length = 0
         for chunk in stream:
-            process.stdin.write(chunk)
+            self.process.stdin.write(chunk)
             content_length += len(chunk)
-        process.stdin.close()
-        utils.raise_if_process_errored(process, RestoreError)
+        self.process.stdin.close()
+        utils.raise_if_process_errored(self.process, RestoreError)
+        if not self.check_process():
+            raise RestoreError
         LOG.debug("Restored %s bytes from stream.", content_length)
 
         return content_length
@@ -104,3 +107,7 @@ class RestoreRunner(Strategy):
     @property
     def unzip_cmd(self):
         return 'gzip -d -c | ' if self.is_zipped else ''
+
+    def check_process(self):
+        """Hook for subclasses to check the restore process for errors."""
+        return True
diff --git a/trove/guestagent/strategies/restore/mysql_impl.py b/trove/guestagent/strategies/restore/mysql_impl.py
index 27faef4e..6e2a8956 100644
--- a/trove/guestagent/strategies/restore/mysql_impl.py
+++ b/trove/guestagent/strategies/restore/mysql_impl.py
@@ -180,7 +180,8 @@ class MySQLDump(base.RestoreRunner, MySQLRestoreMixin):
 class InnoBackupEx(base.RestoreRunner, MySQLRestoreMixin):
     """Implementation of Restore Strategy for InnoBackupEx."""
     __strategy_name__ = 'innobackupex'
-    base_restore_cmd = 'sudo xbstream -x -C %(restore_location)s'
+    base_restore_cmd = ('sudo xbstream -x -C %(restore_location)s'
+                        ' 2>/tmp/xbstream_extract.log')
     base_prepare_cmd = ('sudo innobackupex'
                         ' --defaults-file=%(restore_location)s/backup-my.cnf'
                         ' --ibbackup=xtrabackup'
@@ -229,6 +230,43 @@ class InnoBackupEx(base.RestoreRunner, MySQLRestoreMixin):
         for f in files:
             os.unlink(f)
 
+    def check_process(self):
+        """Check whether xbstream restore is successful."""
+        # We first check the restore process exits with 0, however
+        # xbstream has a bug for creating new files:
+        # https://jira.percona.com/browse/PXB-1542
+        # So we also check the stderr with ignorance of some known
+        # non-error log lines. Currently we only need to ignore:
+        # "encryption: using gcrypt x.x.x"
+        # After PXB-1542 is fixed, we could just check the exit status.
+        LOG.debug('Checking return code of xbstream restore process.')
+        return_code = self.process.wait()
+        if return_code != 0:
+            LOG.erro('xbstream exited with %s', return_code)
+            return False
+
+        LOG.debug('Checking xbstream restore process stderr output.')
+        IGNORE_LINES = [
+            'encryption: using gcrypt ',
+        ]
+        with open('/tmp/xbstream_extract.log', 'r') as xbstream_log:
+            for line in xbstream_log:
+                # Ignore empty lines
+                if not line.strip():
+                    continue
+
+                # Ignore known non-error log lines
+                check_ignorance = [line.startswith(non_err)
+                                   for non_err in IGNORE_LINES]
+                if any(check_ignorance):
+                    continue
+                else:
+                    LOG.error('xbstream restore failed with: %s',
+                              line.rstrip('\n'))
+                    return False
+
+        return True
+
 
 class InnoBackupExIncremental(InnoBackupEx):
     __strategy_name__ = 'innobackupexincremental'
diff --git a/trove/instance/models.py b/trove/instance/models.py
index dbc9487d..ebe8a411 100644
--- a/trove/instance/models.py
+++ b/trove/instance/models.py
@@ -36,6 +36,7 @@ from trove.common.notification import StartNotification
 from trove.common.remote import create_cinder_client
 from trove.common.remote import create_dns_client
 from trove.common.remote import create_guest_client
+from trove.common.remote import create_neutron_client
 from trove.common.remote import create_nova_client
 from trove.common import server_group as srv_grp
 from trove.common import template
@@ -616,6 +617,7 @@ class BaseInstance(SimpleInstance):
         self._guest = None
         self._nova_client = None
         self._volume_client = None
+        self._neutron_client = None
         self._server_group = None
         self._server_group_loaded = False
 
@@ -709,6 +711,13 @@ class BaseInstance(SimpleInstance):
                 self.context, region_name=self.db_info.region_id)
         return self._volume_client
 
+    @property
+    def neutron_client(self):
+        if not self._neutron_client:
+            self._neutron_client = create_neutron_client(
+                self.context, region_name=self.db_info.region_id)
+        return self._neutron_client
+
     def reset_task_status(self):
         LOG.info("Resetting task status to NONE on instance %s.",
                  self.id)
diff --git a/trove/taskmanager/manager.py b/trove/taskmanager/manager.py
index e600e7ad..7559c1ec 100644
--- a/trove/taskmanager/manager.py
+++ b/trove/taskmanager/manager.py
@@ -99,6 +99,26 @@ class Manager(periodic_task.PeriodicTasks):
                                        replica_models):
             # First, we transition from the old master to new as quickly as
             # possible to minimize the scope of unrecoverable error
+
+            # NOTE(zhaochao): we cannot reattach the old master to the new
+            # one immediately after the new master is up, because for MariaDB
+            # the other replicas are still connecting to the old master, and
+            # during reattaching the old master as a slave, new GTID may be
+            # created and synced to the replicas. After that, when attaching
+            # the replicas to the new master, 'START SLAVE' will fail by
+            # 'fatal error 1236' if the binlog of the replica diverged from
+            # the new master. So the proper order should be:
+            # -1. make the old master read only (and detach floating ips)
+            # -2. make sure the new master is up-to-date
+            # -3. detach the new master from the old one
+            # -4. enable the new master (and attach floating ips)
+            # -5. attach the other replicas to the new master
+            # -6. attach the old master to the new one
+            #     (and attach floating ips)
+            # -7. demote the old master
+            # What we changed here is the order of the 6th step, previously
+            # this step took place right after step 4, which causes failures
+            # with MariaDB replications.
             old_master.make_read_only(True)
             master_ips = old_master.detach_public_ips()
             slave_ips = master_candidate.detach_public_ips()
@@ -106,10 +126,8 @@ class Manager(periodic_task.PeriodicTasks):
             master_candidate.wait_for_txn(latest_txn_id)
             master_candidate.detach_replica(old_master, for_failover=True)
             master_candidate.enable_as_master()
-            old_master.attach_replica(master_candidate)
             master_candidate.attach_public_ips(master_ips)
             master_candidate.make_read_only(False)
-            old_master.attach_public_ips(slave_ips)
 
             # At this point, should something go wrong, there
             # should be a working master with some number of working slaves,
@@ -138,6 +156,10 @@ class Manager(periodic_task.PeriodicTasks):
                     error_messages += "%s (%s)\n" % (
                         exc_fmt % msg_content, ex)
 
+            # dealing with the old master after all the other replicas
+            # has been migrated.
+            old_master.attach_replica(master_candidate)
+            old_master.attach_public_ips(slave_ips)
             try:
                 old_master.demote_replication_master()
             except Exception as ex:
diff --git a/trove/taskmanager/models.py b/trove/taskmanager/models.py
index 7ce8c424..fdf1a728 100755
--- a/trove/taskmanager/models.py
+++ b/trove/taskmanager/models.py
@@ -21,6 +21,7 @@ from eventlet import greenthread
 from eventlet.timeout import Timeout
 from novaclient import exceptions as nova_exceptions
 from oslo_log import log as logging
+from oslo_utils import netutils
 from swiftclient.client import ClientException
 
 from trove.backup import models as bkup_models
@@ -1218,31 +1219,46 @@ class BuiltInstanceTasks(BuiltInstance, NotifyMixin, ConfigurationMixin):
     def _get_floating_ips(self):
         """Returns floating ips as a dict indexed by the ip."""
         floating_ips = {}
-        neutron_client = remote.create_neutron_client(self.context)
-        network_floating_ips = neutron_client.list_floatingips()
+        network_floating_ips = self.neutron_client.list_floatingips()
         for ip in network_floating_ips.get('floatingips'):
-            floating_ips.update({ip.get('floating_ip_address'): ip})
+            floating_ips.update(
+                {ip.get('floating_ip_address'): ip.get('id')})
         LOG.debug("In _get_floating_ips(), returning %s", floating_ips)
         return floating_ips
 
     def detach_public_ips(self):
         LOG.debug("Begin detach_public_ips for instance %s", self.id)
         removed_ips = []
-        server_id = self.db_info.compute_instance_id
-        nova_instance = self.nova_client.servers.get(server_id)
         floating_ips = self._get_floating_ips()
         for ip in self.get_visible_ip_addresses():
             if ip in floating_ips:
-                nova_instance.remove_floating_ip(ip)
-                removed_ips.append(ip)
+                fip_id = floating_ips[ip]
+                self.neutron_client.update_floatingip(
+                    fip_id, {'floatingip': {'port_id': None}})
+                removed_ips.append(fip_id)
         return removed_ips
 
     def attach_public_ips(self, ips):
         LOG.debug("Begin attach_public_ips for instance %s", self.id)
         server_id = self.db_info.compute_instance_id
-        nova_instance = self.nova_client.servers.get(server_id)
-        for ip in ips:
-            nova_instance.add_floating_ip(ip)
+
+        # NOTE(zhaochao): in Nova's addFloatingIp, the new floating ip will
+        # always be associated with the first IPv4 fixed address of the Nova
+        # instance, we're doing the same thing here, after add_floating_ip is
+        # removed from novaclient.
+        server_ports = (self.neutron_client.list_ports(device_id=server_id)
+                        .get('ports'))
+        fixed_address, port_id = next(
+            (fixed_ip['ip_address'], port['id'])
+            for port in server_ports
+            for fixed_ip in port.get('fixed_ips')
+            if netutils.is_valid_ipv4(fixed_ip['ip_address']))
+
+        for fip_id in ips:
+            self.neutron_client.update_floatingip(
+                fip_id, {'floatingip': {
+                    'port_id': port_id,
+                    'fixed_ip_address': fixed_address}})
 
     def enable_as_master(self):
         LOG.debug("Calling enable_as_master on %s", self.id)
diff --git a/trove/tests/unittests/guestagent/test_backups.py b/trove/tests/unittests/guestagent/test_backups.py
index aae721f9..a22b9d52 100644
--- a/trove/tests/unittests/guestagent/test_backups.py
+++ b/trove/tests/unittests/guestagent/test_backups.py
@@ -95,7 +95,8 @@ SQLDUMP_BACKUP_RAW = ("mysqldump --all-databases %(extra_opts)s "
 SQLDUMP_BACKUP = SQLDUMP_BACKUP_RAW % {'extra_opts': ''}
 SQLDUMP_BACKUP_EXTRA_OPTS = (SQLDUMP_BACKUP_RAW %
                              {'extra_opts': '--events --routines --triggers'})
-XTRA_RESTORE_RAW = "sudo xbstream -x -C %(restore_location)s"
+XTRA_RESTORE_RAW = ("sudo xbstream -x -C %(restore_location)s"
+                    " 2>/tmp/xbstream_extract.log")
 XTRA_RESTORE = XTRA_RESTORE_RAW % {'restore_location': '/var/lib/mysql/data'}
 XTRA_INCR_PREPARE = ("sudo innobackupex"
                      " --defaults-file=/var/lib/mysql/data/backup-my.cnf"
@@ -1134,3 +1135,36 @@ class CouchDBRestoreTests(trove_testtools.TestCase):
         self.restore_runner.post_restore = mock.Mock()
         self.assertRaises(exception.ProcessExecutionError,
                           self.restore_runner.restore)
+
+
+class MySQLRestoreTests(trove_testtools.TestCase):
+
+    def setUp(self):
+        super(MySQLRestoreTests, self).setUp()
+
+        self.restore_runner = utils.import_class(
+            RESTORE_XTRA_CLS)(
+                'swift', location='http://some.where',
+                checksum='True_checksum',
+                restore_location='/tmp/somewhere')
+
+    def tearDown(self):
+        super(MySQLRestoreTests, self).tearDown()
+
+    def test_restore_success(self):
+        expected_content_length = 123
+        self.restore_runner._run_restore = mock.Mock(
+            return_value=expected_content_length)
+        self.restore_runner.pre_restore = mock.Mock()
+        self.restore_runner.post_restore = mock.Mock()
+        actual_content_length = self.restore_runner.restore()
+        self.assertEqual(
+            expected_content_length, actual_content_length)
+
+    def test_restore_failed_due_to_run_restore(self):
+        self.restore_runner.pre_restore = mock.Mock()
+        self.restore_runner._run_restore = mock.Mock(
+            side_effect=restoreBase.RestoreError('Error'))
+        self.restore_runner.post_restore = mock.Mock()
+        self.assertRaises(restoreBase.RestoreError,
+                          self.restore_runner.restore)
diff --git a/trove/tests/unittests/taskmanager/test_models.py b/trove/tests/unittests/taskmanager/test_models.py
index 3c620de2..6ab49e58 100644
--- a/trove/tests/unittests/taskmanager/test_models.py
+++ b/trove/tests/unittests/taskmanager/test_models.py
@@ -19,6 +19,7 @@ from cinderclient import exceptions as cinder_exceptions
 import cinderclient.v2.client as cinderclient
 from cinderclient.v2 import volumes as cinderclient_volumes
 from mock import Mock, MagicMock, patch, PropertyMock, call
+import neutronclient.v2_0.client as neutronclient
 from novaclient import exceptions as nova_exceptions
 import novaclient.v2.flavors
 import novaclient.v2.servers
@@ -59,11 +60,6 @@ INST_ID = 'dbinst-id-1'
 VOLUME_ID = 'volume-id-1'
 
 
-class _fake_neutron_client(object):
-    def list_floatingips(self):
-        return {'floatingips': [{'floating_ip_address': '192.168.10.1'}]}
-
-
 class FakeOptGroup(object):
     def __init__(self, tcp_ports=['3306', '3301-3307'],
                  udp_ports=[], icmp=False):
@@ -617,6 +613,19 @@ class BuiltInstanceTasksTest(trove_testtools.TestCase):
         stub_volume_mgr.get = MagicMock(return_value=stub_new_volume)
         stub_volume_mgr.attach = MagicMock(return_value=None)
 
+    def _stub_neutron_client(self):
+        stub_neutron_client = self.instance_task._neutron_client = MagicMock(
+            spec=neutronclient.Client)
+        stub_neutron_client.list_floatingips = MagicMock(
+            return_value={'floatingips': [{
+                'floating_ip_address': '192.168.10.1',
+                'id': 'fake-floatingip-id'}]})
+        stub_neutron_client.list_ports = MagicMock(
+            return_value={'ports': [{
+                'fixed_ips': [{'ip_address': '10.0.0.1'},
+                              {'ip_address': 'fd4a:7bef:d1ed:1::1'}],
+                'id': 'fake-port-id'}]})
+
     def setUp(self):
         super(BuiltInstanceTasksTest, self).setUp()
         self.new_flavor = {'id': 8, 'ram': 768, 'name': 'bigger_flavor'}
@@ -726,6 +735,10 @@ class BuiltInstanceTasksTest(trove_testtools.TestCase):
         if 'volume' in self._testMethodName:
             self._stub_volume_client()
 
+        if ('floating_ips' in self._testMethodName or
+                'public_ips' in self._testMethodName):
+            self._stub_neutron_client()
+
     def tearDown(self):
         super(BuiltInstanceTasksTest, self).tearDown()
 
@@ -856,25 +869,32 @@ class BuiltInstanceTasksTest(trove_testtools.TestCase):
                               Mock())
 
     def test_get_floating_ips(self):
-        with patch.object(remote, 'create_neutron_client',
-                          return_value=_fake_neutron_client()):
-            floating_ips = self.instance_task._get_floating_ips()
-            self.assertEqual('192.168.10.1',
-                             floating_ips['192.168.10.1'].get(
-                                 'floating_ip_address'))
+        floating_ips = self.instance_task._get_floating_ips()
+        self.assertEqual({'192.168.10.1': 'fake-floatingip-id'},
+                         floating_ips)
 
     @patch.object(BaseInstance, 'get_visible_ip_addresses',
                   return_value=['192.168.10.1'])
     def test_detach_public_ips(self, mock_address):
-        with patch.object(remote, 'create_neutron_client',
-                          return_value=_fake_neutron_client()):
-            removed_ips = self.instance_task.detach_public_ips()
-            self.assertEqual(['192.168.10.1'], removed_ips)
+        removed_ips = self.instance_task.detach_public_ips()
+        self.assertEqual(['fake-floatingip-id'], removed_ips)
+        mock_update_floatingip = (self.instance_task.neutron_client
+                                  .update_floatingip)
+        mock_update_floatingip.assert_called_once_with(
+            removed_ips[0], {'floatingip': {'port_id': None}})
 
     def test_attach_public_ips(self):
-        self.instance_task.attach_public_ips(['192.168.10.1'])
-        self.stub_verifying_server.add_floating_ip.assert_called_with(
-            '192.168.10.1')
+        self.instance_task.attach_public_ips(['fake-floatingip-id'])
+        mock_list_ports = (self.instance_task.neutron_client
+                           .list_ports)
+        mock_list_ports.assert_called_once_with(device_id='computeinst-id-1')
+
+        mock_update_floatingip = (self.instance_task.neutron_client
+                                  .update_floatingip)
+        mock_update_floatingip.assert_called_once_with(
+            'fake-floatingip-id',
+            {'floatingip': {'port_id': 'fake-port-id',
+                            'fixed_ip_address': '10.0.0.1'}})
 
     @patch.object(BaseInstance, 'update_db')
     def test_enable_as_master(self, mock_update_db):
author	Zhao Chao <zhaochao1984@gmail.com>	2018-03-06 18:08:59 +0800
committer	Elod Illes <elod.illes@ericsson.com>	2018-07-10 10:56:06 +0200
commit	2133963b8975d4d5968b6559c9be36d0f992d8ab (patch)
tree	920f0c07faac60ed74384728f0950c4ccbcebb77
parent	43d2b96f86a5365d69c885738ea1c3642f4e5aa1 (diff)
download	trove-2133963b8975d4d5968b6559c9be36d0f992d8ab.tar.gz