diff options
author | Marcel Amirault <ravlen@gmail.com> | 2019-05-05 16:08:21 +0000 |
---|---|---|
committer | Achilleas Pipinellis <axil@gitlab.com> | 2019-05-05 16:08:21 +0000 |
commit | e2fe9c8ee59e703ca354a5cdd77847727129fd5a (patch) | |
tree | d34a9da4dc67f93bce1ffadb2494b62762775cf8 /doc/administration/geo/disaster_recovery/planned_failover.md | |
parent | 2c1a9367bc0cd5fdc1736517943fb0c67dc1081f (diff) | |
download | gitlab-ce-e2fe9c8ee59e703ca354a5cdd77847727129fd5a.tar.gz |
Docs: Merge EE doc/administration/geo to CE
Diffstat (limited to 'doc/administration/geo/disaster_recovery/planned_failover.md')
-rw-r--r-- | doc/administration/geo/disaster_recovery/planned_failover.md | 227 |
1 files changed, 227 insertions, 0 deletions
diff --git a/doc/administration/geo/disaster_recovery/planned_failover.md b/doc/administration/geo/disaster_recovery/planned_failover.md new file mode 100644 index 00000000000..88ab12d910a --- /dev/null +++ b/doc/administration/geo/disaster_recovery/planned_failover.md @@ -0,0 +1,227 @@ +# Disaster recovery for planned failover **[PREMIUM ONLY]** + +The primary use-case of Disaster Recovery is to ensure business continuity in +the event of unplanned outage, but it can be used in conjunction with a planned +failover to migrate your GitLab instance between regions without extended +downtime. + +As replication between Geo nodes is asynchronous, a planned failover requires +a maintenance window in which updates to the **primary** node are blocked. The +length of this window is determined by your replication capacity - once the +**secondary** node is completely synchronized with the **primary** node, the failover can occur without +data loss. + +This document assumes you already have a fully configured, working Geo setup. +Please read it and the [Disaster Recovery][disaster-recovery] failover +documentation in full before proceeding. Planned failover is a major operation, +and if performed incorrectly, there is a high risk of data loss. Consider +rehearsing the procedure until you are comfortable with the necessary steps and +have a high degree of confidence in being able to perform them accurately. + +## Not all data is automatically replicated + +If you are using any GitLab features that Geo [doesn't support][limitations], +you must make separate provisions to ensure that the **secondary** node has an +up-to-date copy of any data associated with that feature. This may extend the +required scheduled maintenance period significantly. + +A common strategy for keeping this period as short as possible for data stored +in files is to use `rsync` to transfer the data. An initial `rsync` can be +performed ahead of the maintenance window; subsequent `rsync`s (including a +final transfer inside the maintenance window) will then transfer only the +*changes* between the **primary** node and the **secondary** nodes. + +Repository-centric strategies for using `rsync` effectively can be found in the +[moving repositories][moving-repositories] documentation; these strategies can +be adapted for use with any other file-based data, such as GitLab Pages (to +be found in `/var/opt/gitlab/gitlab-rails/shared/pages` if using Omnibus). + +## Pre-flight checks + +Follow these steps before scheduling a planned failover to ensure the process +will go smoothly. + +### Object storage + +Some classes of non-repository data can use object storage in preference to +file storage. Geo [does not replicate data in object storage](../replication/object_storage.md), +leaving that task up to the object store itself. For a planned failover, this +means you can decouple the replication of this data from the failover of the +GitLab service. + +If you're already using object storage, simply verify that your **secondary** +node has access to the same data as the **primary** node - they must either they share the +same object storage configuration, or the **secondary** node should be configured to +access a [geographically-replicated][os-repl] copy provided by the object store +itself. + +If you have a large GitLab installation or cannot tolerate downtime, consider +[migrating to Object Storage][os-conf] **before** scheduling a planned failover. +Doing so reduces both the length of the maintenance window, and the risk of data +loss as a result of a poorly executed planned failover. + +### Review the configuration of each **secondary** node + +Database settings are automatically replicated to the **secondary** node, but the +`/etc/gitlab/gitlab.rb` file must be set up manually, and differs between +nodes. If features such as Mattermost, OAuth or LDAP integration are enabled +on the **primary** node but not the **secondary** node, they will be lost during failover. + +Review the `/etc/gitlab/gitlab.rb` file for both nodes and ensure the **secondary** node +supports everything the **primary** node does **before** scheduling a planned failover. + +### Run system checks + +Run the following on both **primary** and **secondary** nodes: + +```sh +gitlab-rake gitlab:check +gitlab-rake gitlab:geo:check +``` + +If any failures are reported on either node, they should be resolved **before** +scheduling a planned failover. + +### Check that secrets match between nodes + +The SSH host keys and `/etc/gitlab/gitlab-secrets.json` files should be +identical on all nodes. Check this by running the following on all nodes and +comparing the output: + +```sh +sudo sha256sum /etc/ssh/ssh_host* /etc/gitlab/gitlab-secrets.json +``` + +If any files differ, replace the content on the **secondary** node with the +content from the **primary** node. + +### Ensure Geo replication is up-to-date + +The maintenance window won't end until Geo replication and verification is +completely finished. To keep the window as short as possible, you should +ensure these processes are close to 100% as possible during active use. + +Navigate to the **Admin Area > Geo** dashboard on the **secondary** node to +review status. Replicated objects (shown in green) should be close to 100%, +and there should be no failures (shown in red). If a large proportion of +objects aren't yet replicated (shown in grey), consider giving the node more +time to complete + +![Replication status](img/replication-status.png) + +If any objects are failing to replicate, this should be investigated before +scheduling the maintenance window. Following a planned failover, anything that +failed to replicate will be **lost**. + +You can use the [Geo status API](https://docs.gitlab.com/ee/api/geo_nodes.html#retrieve-project-sync-or-verification-failures-that-occurred-on-the-current-node) to review failed objects and +the reasons for failure. + +A common cause of replication failures is the data being missing on the +**primary** node - you can resolve these failures by restoring the data from backup, +or removing references to the missing data. + +### Verify the integrity of replicated data + +This [content was moved to another location][background-verification]. + +### Notify users of scheduled maintenance + +On the **primary** node, navigate to **Admin Area > Messages**, add a broadcast +message. You can check under **Admin Area > Geo** to estimate how long it +will take to finish syncing. An example message would be: + +> A scheduled maintenance will take place at XX:XX UTC. We expect it to take +> less than 1 hour. + +## Prevent updates to the **primary** node + +Until a [read-only mode][ce-19739] is implemented, updates must be prevented +from happening manually. Note that your **secondary** node still needs read-only +access to the **primary** node during the maintenance window. + +1. At the scheduled time, using your cloud provider or your node's firewall, block + all HTTP, HTTPS and SSH traffic to/from the **primary** node, **except** for your IP and + the **secondary** node's IP. + + For instance, you might run the following commands on the server(s) making up your **primary** node: + + ```sh + sudo iptables -A INPUT -p tcp -s <secondary_node_ip> --destination-port 22 -j ACCEPT + sudo iptables -A INPUT -p tcp -s <your_ip> --destination-port 22 -j ACCEPT + sudo iptables -A INPUT --destination-port 22 -j REJECT + + sudo iptables -A INPUT -p tcp -s <secondary_node_ip> --destination-port 80 -j ACCEPT + sudo iptables -A INPUT -p tcp -s <your_ip> --destination-port 80 -j ACCEPT + sudo iptables -A INPUT --tcp-dport 80 -j REJECT + + sudo iptables -A INPUT -p tcp -s <secondary_node_ip> --destination-port 443 -j ACCEPT + sudo iptables -A INPUT -p tcp -s <your_ip> --destination-port 443 -j ACCEPT + sudo iptables -A INPUT --tcp-dport 443 -j REJECT + ``` + + From this point, users will be unable to view their data or make changes on the + **primary** node. They will also be unable to log in to the **secondary** node. + However, existing sessions will work for the remainder of the maintenance period, and + public data will be accessible throughout. + +1. Verify the **primary** node is blocked to HTTP traffic by visiting it in browser via + another IP. The server should refuse connection. + +1. Verify the **primary** node is blocked to Git over SSH traffic by attempting to pull an + existing Git repository with an SSH remote URL. The server should refuse + connection. + +1. Disable non-Geo periodic background jobs on the primary node by navigating + to **Admin Area > Monitoring > Background Jobs > Cron** , pressing `Disable All`, + and then pressing `Enable` for the `geo_sidekiq_cron_config_worker` cron job. + This job will re-enable several other cron jobs that are essential for planned + failover to complete successfully. + +## Finish replicating and verifying all data + +1. If you are manually replicating any data not managed by Geo, trigger the + final replication process now. +1. On the **primary** node, navigate to **Admin Area > Monitoring > Background Jobs > Queues** + and wait for all queues except those with `geo` in the name to drop to 0. + These queues contain work that has been submitted by your users; failing over + before it is completed will cause the work to be lost. +1. On the **primary** node, navigate to **Admin Area > Geo** and wait for the + following conditions to be true of the **secondary** node you are failing over to: + - All replication meters to each 100% replicated, 0% failures. + - All verification meters reach 100% verified, 0% failures. + - Database replication lag is 0ms. + - The Geo log cursor is up to date (0 events behind). + +1. On the **secondary** node, navigate to **Admin Area > Monitoring > Background Jobs > Queues** + and wait for all the `geo` queues to drop to 0 queued and 0 running jobs. +1. On the **secondary** node, use [these instructions][foreground-verification] + to verify the integrity of CI artifacts, LFS objects and uploads in file + storage. + +At this point, your **secondary** node will contain an up-to-date copy of everything the +**primary** node has, meaning nothing will be lost when you fail over. + +## Promote the **secondary** node + +Finally, follow the [Disaster Recovery docs][disaster-recovery] to promote the +**secondary** node to a **primary** node. This process will cause a brief outage on the **secondary** node, and users may need to log in again. + +Once it is completed, the maintenance window is over! Your new **primary** node will now +begin to diverge from the old one. If problems do arise at this point, failing +back to the old **primary** node [is possible][bring-primary-back], but likely to result +in the loss of any data uploaded to the new primary in the meantime. + +Don't forget to remove the broadcast message after failover is complete. + +[bring-primary-back]: bring_primary_back.md +[ce-19739]: https://gitlab.com/gitlab-org/gitlab-ce/issues/19739 +[container-registry]: ../replication/container_registry.md +[disaster-recovery]: index.md +[ee-4930]: https://gitlab.com/gitlab-org/gitlab-ee/issues/4930 +[ee-5064]: https://gitlab.com/gitlab-org/gitlab-ee/issues/5064 +[foreground-verification]: ../../raketasks/check.md +[background-verification]: background_verification.md +[limitations]: ../replication/index.md#current-limitations +[moving-repositories]: ../../operations/moving_repositories.md +[os-conf]: ../replication/object_storage.md#configuration +[os-repl]: ../replication/object_storage.md#replication |