diff options
Diffstat (limited to 'doc/administration/geo/disaster_recovery/promotion_runbook.md')
-rw-r--r-- | doc/administration/geo/disaster_recovery/promotion_runbook.md | 268 |
1 files changed, 2 insertions, 266 deletions
diff --git a/doc/administration/geo/disaster_recovery/promotion_runbook.md b/doc/administration/geo/disaster_recovery/promotion_runbook.md index fb2353513df..7eb6ef01aee 100644 --- a/doc/administration/geo/disaster_recovery/promotion_runbook.md +++ b/doc/administration/geo/disaster_recovery/promotion_runbook.md @@ -1,269 +1,5 @@ --- -stage: Enablement -group: Geo -info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#designated-technical-writers -type: howto +redirect_to: runbooks/planned_failover_single_node.md --- -CAUTION: **Caution:** -This runbook is in **alpha**. For complete, production-ready documentation, see the -[disaster recovery documentation](index.md). - -# Disaster Recovery (Geo) promotion runbooks **(PREMIUM ONLY)** - -## Geo planned failover runbook 1 - -| Component | Configuration | -| ----------- | --------------- | -| PostgreSQL | Omnibus-managed | -| Geo site | Single-node | -| Secondaries | One | - -This runbook will guide you through a planned failover of a single-node Geo site -with one secondary. The following general architecture is assumed: - -```mermaid -graph TD - subgraph main[Geo deployment] - subgraph Primary[Primary site] - Node_1[(GitLab node)] - end - subgraph Secondary1[Secondary site] - Node_2[(GitLab node)] - end - end -``` - -This guide will result in the following: - -1. An offline primary. -1. A promoted secondary that is now the new primary. - -What is not covered: - -1. Re-adding the old **primary** as a secondary. -1. Adding a new secondary. - -### Preparation - -NOTE: **Note:** -Before following any of those steps, make sure you have `root` access to the -**secondary** to promote it, since there isn't provided an automated way to -promote a Geo replica and perform a failover. - -On the **secondary** node, navigate to the **Admin Area > Geo** dashboard to -review its status. Replicated objects (shown in green) should be close to 100%, -and there should be no failures (shown in red). If a large proportion of -objects aren't yet replicated (shown in gray), consider giving the node more -time to complete. - -![Replication status](img/replication-status.png) - -If any objects are failing to replicate, this should be investigated before -scheduling the maintenance window. After a planned failover, anything that -failed to replicate will be **lost**. - -You can use the -[Geo status API](../../../api/geo_nodes.md#retrieve-project-sync-or-verification-failures-that-occurred-on-the-current-node) -to review failed objects and the reasons for failure. -A common cause of replication failures is the data being missing on the -**primary** node - you can resolve these failures by restoring the data from backup, -or removing references to the missing data. - -The maintenance window won't end until Geo replication and verification is -completely finished. To keep the window as short as possible, you should -ensure these processes are close to 100% as possible during active use. - -If the **secondary** node is still replicating data from the **primary** node, -follow these steps to avoid unnecessary data loss: - -1. Until a [read-only mode](https://gitlab.com/gitlab-org/gitlab/-/issues/14609) - is implemented, updates must be prevented from happening manually to the - **primary**. Note that your **secondary** node still needs read-only - access to the **primary** node during the maintenance window: - - 1. At the scheduled time, using your cloud provider or your node's firewall, block - all HTTP, HTTPS and SSH traffic to/from the **primary** node, **except** for your IP and - the **secondary** node's IP. - - For instance, you can run the following commands on the **primary** node: - - ```shell - sudo iptables -A INPUT -p tcp -s <secondary_node_ip> --destination-port 22 -j ACCEPT - sudo iptables -A INPUT -p tcp -s <your_ip> --destination-port 22 -j ACCEPT - sudo iptables -A INPUT --destination-port 22 -j REJECT - - sudo iptables -A INPUT -p tcp -s <secondary_node_ip> --destination-port 80 -j ACCEPT - sudo iptables -A INPUT -p tcp -s <your_ip> --destination-port 80 -j ACCEPT - sudo iptables -A INPUT --tcp-dport 80 -j REJECT - - sudo iptables -A INPUT -p tcp -s <secondary_node_ip> --destination-port 443 -j ACCEPT - sudo iptables -A INPUT -p tcp -s <your_ip> --destination-port 443 -j ACCEPT - sudo iptables -A INPUT --tcp-dport 443 -j REJECT - ``` - - From this point, users will be unable to view their data or make changes on the - **primary** node. They will also be unable to log in to the **secondary** node. - However, existing sessions will work for the remainder of the maintenance period, and - public data will be accessible throughout. - - 1. Verify the **primary** node is blocked to HTTP traffic by visiting it in browser via - another IP. The server should refuse connection. - - 1. Verify the **primary** node is blocked to Git over SSH traffic by attempting to pull an - existing Git repository with an SSH remote URL. The server should refuse - connection. - - 1. On the **primary** node, disable non-Geo periodic background jobs by navigating - to **Admin Area > Monitoring > Background Jobs > Cron**, clicking `Disable All`, - and then clicking `Enable` for the `geo_sidekiq_cron_config_worker` cron job. - This job will re-enable several other cron jobs that are essential for planned - failover to complete successfully. - -1. Finish replicating and verifying all data: - - CAUTION: **Caution:** - Not all data is automatically replicated. Read more about - [what is excluded](planned_failover.md#not-all-data-is-automatically-replicated). - - 1. If you are manually replicating any - [data not managed by Geo](../replication/datatypes.md#limitations-on-replicationverification), - trigger the final replication process now. - 1. On the **primary** node, navigate to **Admin Area > Monitoring > Background Jobs > Queues** - and wait for all queues except those with `geo` in the name to drop to 0. - These queues contain work that has been submitted by your users; failing over - before it is completed will cause the work to be lost. - 1. On the **primary** node, navigate to **Admin Area > Geo** and wait for the - following conditions to be true of the **secondary** node you are failing over to: - - All replication meters to each 100% replicated, 0% failures. - - All verification meters reach 100% verified, 0% failures. - - Database replication lag is 0ms. - - The Geo log cursor is up to date (0 events behind). - - 1. On the **secondary** node, navigate to **Admin Area > Monitoring > Background Jobs > Queues** - and wait for all the `geo` queues to drop to 0 queued and 0 running jobs. - 1. On the **secondary** node, use [these instructions](../../raketasks/check.md) - to verify the integrity of CI artifacts, LFS objects, and uploads in file - storage. - - At this point, your **secondary** node will contain an up-to-date copy of everything the - **primary** node has, meaning nothing will be lost when you fail over. - -1. In this final step, you need to permanently disable the **primary** node. - - CAUTION: **Caution:** - When the **primary** node goes offline, there may be data saved on the **primary** node - that has not been replicated to the **secondary** node. This data should be treated - as lost if you proceed. - - TIP: **Tip:** - If you plan to [update the **primary** domain DNS record](index.md#step-4-optional-updating-the-primary-domain-dns-record), - you may wish to lower the TTL now to speed up propagation. - - When performing a failover, we want to avoid a split-brain situation where - writes can occur in two different GitLab instances. So to prepare for the - failover, you must disable the **primary** node: - - - If you have SSH access to the **primary** node, stop and disable GitLab: - - ```shell - sudo gitlab-ctl stop - ``` - - Prevent GitLab from starting up again if the server unexpectedly reboots: - - ```shell - sudo systemctl disable gitlab-runsvdir - ``` - - NOTE: **Note:** - (**CentOS only**) In CentOS 6 or older, there is no easy way to prevent GitLab from being - started if the machine reboots isn't available (see [Omnibus GitLab issue #3058](https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/3058)). - It may be safest to uninstall the GitLab package completely with `sudo yum remove gitlab-ee`. - - NOTE: **Note:** - (**Ubuntu 14.04 LTS**) If you are using an older version of Ubuntu - or any other distribution based on the Upstart init system, you can prevent GitLab - from starting if the machine reboots as `root` with - `initctl stop gitlab-runsvvdir && echo 'manual' > /etc/init/gitlab-runsvdir.override && initctl reload-configuration`. - - - If you do not have SSH access to the **primary** node, take the machine offline and - prevent it from rebooting. Since there are many ways you may prefer to accomplish - this, we will avoid a single recommendation. You may need to: - - - Reconfigure the load balancers. - - Change DNS records (for example, point the **primary** DNS record to the **secondary** - node in order to stop usage of the **primary** node). - - Stop the virtual servers. - - Block traffic through a firewall. - - Revoke object storage permissions from the **primary** node. - - Physically disconnect a machine. - -### Promoting the **secondary** node - -Note the following when promoting a secondary: - -- A new **secondary** should not be added at this time. If you want to add a new - **secondary**, do this after you have completed the entire process of promoting - the **secondary** to the **primary**. -- If you encounter an `ActiveRecord::RecordInvalid: Validation failed: Name has already been taken` - error during this process, read - [the troubleshooting advice](../replication/troubleshooting.md#fixing-errors-during-a-failover-or-when-promoting-a-secondary-to-a-primary-node). - -To promote the secondary node: - -1. SSH in to your **secondary** node and login as root: - - ```shell - sudo -i - ``` - -1. Edit `/etc/gitlab/gitlab.rb` to reflect its new status as **primary** by - removing any lines that enabled the `geo_secondary_role`: - - ```ruby - ## In pre-11.5 documentation, the role was enabled as follows. Remove this line. - geo_secondary_role['enable'] = true - - ## In 11.5+ documentation, the role was enabled as follows. Remove this line. - roles ['geo_secondary_role'] - ``` - -1. Run the following command to list out all preflight checks and automatically - check if replication and verification are complete before scheduling a planned - failover to ensure the process will go smoothly: - - ```shell - gitlab-ctl promotion-preflight-checks - ``` - -1. Promote the **secondary**: - - ```shell - gitlab-ctl promote-to-primary-node - ``` - - If you have already run the [preflight checks](planned_failover.md#preflight-checks) - or don't want to run them, you can skip them: - - ```shell - gitlab-ctl promote-to-primary-node --skip-preflight-check - ``` - - You can also promote the secondary node to primary **without any further confirmation**, even when preflight checks fail: - - ```shell - sudo gitlab-ctl promote-to-primary-node --force - ``` - -1. Verify you can connect to the newly promoted **primary** node using the URL used - previously for the **secondary** node. - - If successful, the **secondary** node has now been promoted to the **primary** node. - -### Next steps - -To regain geographic redundancy as quickly as possible, you should -[add a new **secondary** node](../setup/index.md). To -do that, you can re-add the old **primary** as a new secondary and bring it back -online. +This document was moved to [another location](runbooks/planned_failover_single_node.md). |