summaryrefslogtreecommitdiff
path: root/doc/administration/geo/disaster_recovery
diff options
context:
space:
mode:
Diffstat (limited to 'doc/administration/geo/disaster_recovery')
-rw-r--r--doc/administration/geo/disaster_recovery/background_verification.md172
-rw-r--r--doc/administration/geo/disaster_recovery/bring_primary_back.md61
-rw-r--r--doc/administration/geo/disaster_recovery/img/replication-status.pngbin0 -> 7716 bytes
-rw-r--r--doc/administration/geo/disaster_recovery/img/reverification-interval.pngbin0 -> 33620 bytes
-rw-r--r--doc/administration/geo/disaster_recovery/img/verification-status-primary.pngbin0 -> 13329 bytes
-rw-r--r--doc/administration/geo/disaster_recovery/img/verification-status-secondary.pngbin0 -> 12186 bytes
-rw-r--r--doc/administration/geo/disaster_recovery/index.md322
-rw-r--r--doc/administration/geo/disaster_recovery/planned_failover.md227
8 files changed, 782 insertions, 0 deletions
diff --git a/doc/administration/geo/disaster_recovery/background_verification.md b/doc/administration/geo/disaster_recovery/background_verification.md
new file mode 100644
index 00000000000..7d2fd51f834
--- /dev/null
+++ b/doc/administration/geo/disaster_recovery/background_verification.md
@@ -0,0 +1,172 @@
+# Automatic background verification **[PREMIUM ONLY]**
+
+NOTE: **Note:**
+Automatic background verification of repositories and wikis was added in
+GitLab EE 10.6 but is enabled by default only on GitLab EE 11.1. You can
+disable or enable this feature manually by following
+[these instructions](#disabling-or-enabling-the-automatic-background-verification).
+
+Automatic background verification ensures that the transferred data matches a
+calculated checksum. If the checksum of the data on the **primary** node matches checksum of the
+data on the **secondary** node, the data transferred successfully. Following a planned failover,
+any corrupted data may be **lost**, depending on the extent of the corruption.
+
+If verification fails on the **primary** node, this indicates that Geo is
+successfully replicating a corrupted object; restore it from backup or remove it
+it from the **primary** node to resolve the issue.
+
+If verification succeeds on the **primary** node but fails on the **secondary** node,
+this indicates that the object was corrupted during the replication process.
+Geo actively try to correct verification failures marking the repository to
+be resynced with a backoff period. If you want to reset the verification for
+these failures, so you should follow [these instructions][reset-verification].
+
+If verification is lagging significantly behind replication, consider giving
+the node more time before scheduling a planned failover.
+
+## Disabling or enabling the automatic background verification
+
+Run the following commands in a Rails console on the **primary** node:
+
+```sh
+# Omnibus GitLab
+gitlab-rails console
+
+# Installation from source
+cd /home/git/gitlab
+sudo -u git -H bin/rails console RAILS_ENV=production
+```
+
+To check if automatic background verification is enabled:
+
+```ruby
+Gitlab::Geo.repository_verification_enabled?
+```
+
+To disable automatic background verification:
+
+```ruby
+Feature.disable('geo_repository_verification')
+```
+
+To enable automatic background verification:
+
+```ruby
+Feature.enable('geo_repository_verification')
+```
+
+## Repository verification
+
+Navigate to the **Admin Area > Geo** dashboard on the **primary** node and expand
+the **Verification information** tab for that node to view automatic checksumming
+status for repositories and wikis. Successes are shown in green, pending work
+in grey, and failures in red.
+
+![Verification status](img/verification-status-primary.png)
+
+Navigate to the **Admin Area > Geo** dashboard on the **secondary** node and expand
+the **Verification information** tab for that node to view automatic verification
+status for repositories and wikis. As with checksumming, successes are shown in
+green, pending work in grey, and failures in red.
+
+![Verification status](img/verification-status-secondary.png)
+
+## Using checksums to compare Geo nodes
+
+To check the health of Geo **secondary** nodes, we use a checksum over the list of
+Git references and their values. The checksum includes `HEAD`, `heads`, `tags`,
+`notes`, and GitLab-specific references to ensure true consistency. If two nodes
+have the same checksum, then they definitely hold the same references. We compute
+the checksum for every node after every update to make sure that they are all
+in sync.
+
+## Repository re-verification
+
+> [Introduced](https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/8550) in GitLab Enterprise Edition 11.6. Available in [GitLab Premium](https://about.gitlab.com/pricing/).
+
+Due to bugs or transient infrastructure failures, it is possible for Git
+repositories to change unexpectedly without being marked for verification.
+Geo constantly reverifies the repositories to ensure the integrity of the
+data. The default and recommended re-verification interval is 7 days, though
+an interval as short as 1 day can be set. Shorter intervals reduce risk but
+increase load and vice versa.
+
+Navigate to the **Admin Area > Geo** dashboard on the **primary** node, and
+click the **Edit** button for the **primary** node to customize the minimum
+re-verification interval:
+
+![Re-verification interval](img/reverification-interval.png)
+
+The automatic background re-verification is enabled by default, but you can
+disable if you need. Run the following commands in a Rails console on the
+**primary** node:
+
+```sh
+# Omnibus GitLab
+gitlab-rails console
+
+# Installation from source
+cd /home/git/gitlab
+sudo -u git -H bin/rails console RAILS_ENV=production
+```
+
+To disable automatic background re-verification:
+
+```ruby
+Feature.disable('geo_repository_reverification')
+```
+
+To enable automatic background re-verification:
+
+```ruby
+Feature.enable('geo_repository_reverification')
+```
+
+## Reset verification for projects where verification has failed
+
+Geo actively try to correct verification failures marking the repository to
+be resynced with a backoff period. If you want to reset them manually, this
+rake task marks projects where verification has failed or the checksum mismatch
+to be resynced without the backoff period:
+
+For repositories:
+
+- Omnibus Installation
+
+ ```sh
+ sudo gitlab-rake geo:verification:repository:reset
+ ```
+
+- Source Installation
+
+ ```sh
+ sudo -u git -H bundle exec rake geo:verification:repository:reset RAILS_ENV=production
+ ```
+
+For wikis:
+
+- Omnibus Installation
+
+ ```sh
+ sudo gitlab-rake geo:verification:wiki:reset
+ ```
+
+- Source Installation
+
+ ```sh
+ sudo -u git -H bundle exec rake geo:verification:wiki:reset RAILS_ENV=production
+ ```
+
+## Current limitations
+
+Until [issue #5064][ee-5064] is completed, background verification doesn't cover
+CI job artifacts and traces, LFS objects, or user uploads in file storage.
+Verify their integrity manually by following [these instructions][foreground-verification]
+on both nodes, and comparing the output between them.
+
+Data in object storage is **not verified**, as the object store is responsible
+for ensuring the integrity of the data.
+
+[reset-verification]: background_verification.md#reset-verification-for-projects-where-verification-has-failed
+[foreground-verification]: ../../raketasks/check.md
+[ee-5064]: https://gitlab.com/gitlab-org/gitlab-ee/issues/5064
diff --git a/doc/administration/geo/disaster_recovery/bring_primary_back.md b/doc/administration/geo/disaster_recovery/bring_primary_back.md
new file mode 100644
index 00000000000..f4d31a98080
--- /dev/null
+++ b/doc/administration/geo/disaster_recovery/bring_primary_back.md
@@ -0,0 +1,61 @@
+# Bring a demoted primary node back online **[PREMIUM ONLY]**
+
+After a failover, it is possible to fail back to the demoted **primary** node to
+restore your original configuration. This process consists of two steps:
+
+1. Making the old **primary** node a **secondary** node.
+1. Promoting a **secondary** node to a **primary** node.
+
+CAUTION: **Caution:**
+If you have any doubts about the consistency of the data on this node, we recommend setting it up from scratch.
+
+## Configure the former **primary** node to be a **secondary** node
+
+Since the former **primary** node will be out of sync with the current **primary** node, the first step is to bring the former **primary** node up to date. Note, deletion of data stored on disk like
+repositories and uploads will not be replayed when bringing the former **primary** node back
+into sync, which may result in increased disk usage.
+Alternatively, you can [set up a new **secondary** GitLab instance][setup-geo] to avoid this.
+
+To bring the former **primary** node up to date:
+
+1. SSH into the former **primary** node that has fallen behind.
+1. Make sure all the services are up:
+
+ ```sh
+ sudo gitlab-ctl start
+ ```
+
+ > **Note 1:** If you [disabled the **primary** node permanently][disaster-recovery-disable-primary],
+ > you need to undo those steps now. For Debian/Ubuntu you just need to run
+ > `sudo systemctl enable gitlab-runsvdir`. For CentOS 6, you need to install
+ > the GitLab instance from scratch and set it up as a **secondary** node by
+ > following [Setup instructions][setup-geo]. In this case, you don't need to follow the next step.
+ >
+ > **Note 2:** If you [changed the DNS records](index.md#step-4-optional-updating-the-primary-domain-dns-record)
+ > for this node during disaster recovery procedure you may need to [block
+ > all the writes to this node](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/doc/gitlab-geo/planned-failover.md#block-primary-traffic)
+ > during this procedure.
+
+1. [Setup database replication][database-replication]. Note that in this
+ case, **primary** node refers to the current **primary** node, and **secondary** node refers to the
+ former **primary** node.
+
+If you have lost your original **primary** node, follow the
+[setup instructions][setup-geo] to set up a new **secondary** node.
+
+## Promote the **secondary** node to **primary** node
+
+When the initial replication is complete and the **primary** node and **secondary** node are
+closely in sync, you can do a [planned failover].
+
+## Restore the **secondary** node
+
+If your objective is to have two nodes again, you need to bring your **secondary**
+node back online as well by repeating the first step
+([configure the former **primary** node to be a **secondary** node](#configure-the-former-primary-node-to-be-a-secondary-node))
+for the **secondary** node.
+
+[setup-geo]: ../replication/index.md#setup-instructions
+[database-replication]: ../replication/database.md
+[disaster-recovery-disable-primary]: index.md#step-2-permanently-disable-the-primary-node
+[planned failover]: planned_failover.md
diff --git a/doc/administration/geo/disaster_recovery/img/replication-status.png b/doc/administration/geo/disaster_recovery/img/replication-status.png
new file mode 100644
index 00000000000..d7085927c75
--- /dev/null
+++ b/doc/administration/geo/disaster_recovery/img/replication-status.png
Binary files differ
diff --git a/doc/administration/geo/disaster_recovery/img/reverification-interval.png b/doc/administration/geo/disaster_recovery/img/reverification-interval.png
new file mode 100644
index 00000000000..ad4597a4f49
--- /dev/null
+++ b/doc/administration/geo/disaster_recovery/img/reverification-interval.png
Binary files differ
diff --git a/doc/administration/geo/disaster_recovery/img/verification-status-primary.png b/doc/administration/geo/disaster_recovery/img/verification-status-primary.png
new file mode 100644
index 00000000000..2503408ec5d
--- /dev/null
+++ b/doc/administration/geo/disaster_recovery/img/verification-status-primary.png
Binary files differ
diff --git a/doc/administration/geo/disaster_recovery/img/verification-status-secondary.png b/doc/administration/geo/disaster_recovery/img/verification-status-secondary.png
new file mode 100644
index 00000000000..462274d8b14
--- /dev/null
+++ b/doc/administration/geo/disaster_recovery/img/verification-status-secondary.png
Binary files differ
diff --git a/doc/administration/geo/disaster_recovery/index.md b/doc/administration/geo/disaster_recovery/index.md
new file mode 100644
index 00000000000..71dc797f281
--- /dev/null
+++ b/doc/administration/geo/disaster_recovery/index.md
@@ -0,0 +1,322 @@
+# Disaster Recovery **[PREMIUM ONLY]**
+
+Geo replicates your database, your Git repositories, and few other assets.
+We will support and replicate more data in the future, that will enable you to
+failover with minimal effort, in a disaster situation.
+
+See [Geo current limitations][geo-limitations] for more information.
+
+CAUTION: **Warning:**
+Disaster recovery for multi-secondary configurations is in **Alpha**.
+For the latest updates, check the multi-secondary [Disaster Recovery epic][gitlab-org&65].
+
+## Promoting a **secondary** Geo node in single-secondary configurations
+
+We don't currently provide an automated way to promote a Geo replica and do a
+failover, but you can do it manually if you have `root` access to the machine.
+
+This process promotes a **secondary** Geo node to a **primary** node. To regain
+geographic redundancy as quickly as possible, you should add a new **secondary** node
+immediately after following these instructions.
+
+### Step 1. Allow replication to finish if possible
+
+If the **secondary** node is still replicating data from the **primary** node, follow
+[the planned failover docs][planned-failover] as closely as possible in
+order to avoid unnecessary data loss.
+
+### Step 2. Permanently disable the **primary** node
+
+CAUTION: **Warning:**
+If the **primary** node goes offline, there may be data saved on the **primary** node
+that has not been replicated to the **secondary** node. This data should be treated
+as lost if you proceed.
+
+If an outage on the **primary** node happens, you should do everything possible to
+avoid a split-brain situation where writes can occur in two different GitLab
+instances, complicating recovery efforts. So to prepare for the failover, we
+must disable the **primary** node.
+
+1. SSH into the **primary** node to stop and disable GitLab, if possible:
+
+ ```sh
+ sudo gitlab-ctl stop
+ ```
+
+ Prevent GitLab from starting up again if the server unexpectedly reboots:
+
+ ```sh
+ sudo systemctl disable gitlab-runsvdir
+ ```
+
+ > **CentOS only**: In CentOS 6 or older, there is no easy way to prevent GitLab from being
+ > started if the machine reboots isn't available (see [gitlab-org/omnibus-gitlab#3058]).
+ > It may be safest to uninstall the GitLab package completely:
+
+ ```sh
+ yum remove gitlab-ee
+ ```
+
+ > **Ubuntu 14.04 LTS**: If you are using an older version of Ubuntu
+ > or any other distro based on the Upstart init system, you can prevent GitLab
+ > from starting if the machine reboots by doing the following:
+
+ ```sh
+ initctl stop gitlab-runsvvdir
+ echo 'manual' > /etc/init/gitlab-runsvdir.override
+ initctl reload-configuration
+ ```
+
+1. If you do not have SSH access to the **primary** node, take the machine offline and
+ prevent it from rebooting by any means at your disposal.
+ Since there are many ways you may prefer to accomplish this, we will avoid a
+ single recommendation. You may need to:
+ - Reconfigure the load balancers.
+ - Change DNS records (e.g., point the primary DNS record to the **secondary**
+ node in order to stop usage of the **primary** node).
+ - Stop the virtual servers.
+ - Block traffic through a firewall.
+ - Revoke object storage permissions from the **primary** node.
+ - Physically disconnect a machine.
+
+1. If you plan to
+ [update the primary domain DNS record](#step-4-optional-updating-the-primary-domain-dns-record),
+ you may wish to lower the TTL now to speed up propagation.
+
+### Step 3. Promoting a **secondary** node
+
+NOTE: **Note:**
+A new **secondary** should not be added at this time. If you want to add a new
+**secondary**, do this after you have completed the entire process of promoting
+the **secondary** to the **primary**.
+
+#### Promoting a **secondary** node running on a single machine
+
+1. SSH in to your **secondary** node and login as root:
+
+ ```sh
+ sudo -i
+ ```
+
+1. Edit `/etc/gitlab/gitlab.rb` to reflect its new status as **primary** by
+ removing any lines that enabled the `geo_secondary_role`:
+
+ ```ruby
+ ## In pre-11.5 documentation, the role was enabled as follows. Remove this line.
+ geo_secondary_role['enable'] = true
+
+ ## In 11.5+ documentation, the role was enabled as follows. Remove this line.
+ roles ['geo_secondary_role']
+ ```
+
+1. Promote the **secondary** node to the **primary** node. Execute:
+
+ ```sh
+ gitlab-ctl promote-to-primary-node
+ ```
+
+1. Verify you can connect to the newly promoted **primary** node using the URL used
+ previously for the **secondary** node.
+1. If successful, the **secondary** node has now been promoted to the **primary** node.
+
+#### Promoting a **secondary** node with HA
+
+The `gitlab-ctl promote-to-primary-node` command cannot be used yet in
+conjunction with High Availability or with multiple machines, as it can only
+perform changes on a **secondary** with only a single machine. Instead, you must
+do this manually.
+
+1. SSH in to the database node in the **secondary** and trigger PostgreSQL to
+ promote to read-write:
+
+ ```bash
+ sudo gitlab-pg-ctl promote
+ ```
+
+1. Edit `/etc/gitlab/gitlab.rb` on every machine in the **secondary** to
+ reflect its new status as **primary** by removing any lines that enabled the
+ `geo_secondary_role`:
+
+ ```ruby
+ ## In pre-11.5 documentation, the role was enabled as follows. Remove this line.
+ geo_secondary_role['enable'] = true
+
+ ## In 11.5+ documentation, the role was enabled as follows. Remove this line.
+ roles ['geo_secondary_role']
+ ```
+
+ After making these changes [Reconfigure GitLab](../../restart_gitlab.md#omnibus-gitlab-reconfigure) each
+ machine so the changes take effect.
+
+1. Promote the **secondary** to **primary**. SSH into a single application
+ server and execute:
+
+ ```bash
+ sudo gitlab-rake geo:set_secondary_as_primary
+ ```
+
+1. Verify you can connect to the newly promoted **primary** using the URL used
+ previously for the **secondary**.
+1. Success! The **secondary** has now been promoted to **primary**.
+
+### Step 4. (Optional) Updating the primary domain DNS record
+
+Updating the DNS records for the primary domain to point to the **secondary** node
+will prevent the need to update all references to the primary domain to the
+secondary domain, like changing Git remotes and API URLs.
+
+1. SSH into the **secondary** node and login as root:
+
+ ```sh
+ sudo -i
+ ```
+
+1. Update the primary domain's DNS record. After updating the primary domain's
+ DNS records to point to the **secondary** node, edit `/etc/gitlab/gitlab.rb` on the
+ **secondary** node to reflect the new URL:
+
+ ```ruby
+ # Change the existing external_url configuration
+ external_url 'https://<new_external_url>'
+ ```
+
+ NOTE: **Note**
+ Changing `external_url` won't prevent access via the old secondary URL, as
+ long as the secondary DNS records are still intact.
+
+1. Reconfigure the **secondary** node for the change to take effect:
+
+ ```sh
+ gitlab-ctl reconfigure
+ ```
+
+1. Execute the command below to update the newly promoted **primary** node URL:
+
+ ```sh
+ gitlab-rake geo:update_primary_node_url
+ ```
+
+ This command will use the changed `external_url` configuration defined
+ in `/etc/gitlab/gitlab.rb`.
+
+1. Verify you can connect to the newly promoted **primary** using its URL.
+ If you updated the DNS records for the primary domain, these changes may
+ not have yet propagated depending on the previous DNS records TTL.
+
+### Step 5. (Optional) Add **secondary** Geo node to a promoted **primary** node
+
+Promoting a **secondary** node to **primary** node using the process above does not enable
+Geo on the new **primary** node.
+
+To bring a new **secondary** node online, follow the [Geo setup instructions][setup-geo].
+
+### Step 6. (Optional) Removing the secondary's tracking database
+
+Every **secondary** has a special tracking database that is used to save the status of the synchronization of all the items from the **primary**.
+Because the **secondary** is already promoted, that data in the tracking database is no longer required.
+
+The data can be removed with the following command:
+
+```sh
+sudo rm -rf /var/opt/gitlab/geo-postgresql
+```
+
+## Promoting secondary Geo replica in multi-secondary configurations
+
+If you have more than one **secondary** node and you need to promote one of them, we suggest you follow
+[Promoting a **secondary** Geo node in single-secondary configurations](#promoting-a-secondary-geo-node-in-single-secondary-configurations)
+and after that you also need two extra steps.
+
+### Step 1. Prepare the new **primary** node to serve one or more **secondary** nodes
+
+1. SSH into the new **primary** node and login as root:
+
+ ```sh
+ sudo -i
+ ```
+
+1. Edit `/etc/gitlab/gitlab.rb`
+
+ ```ruby
+ ## Enable a Geo Primary role (if you haven't yet)
+ roles ['geo_primary_role']
+
+ ##
+ # Allow PostgreSQL client authentication from the primary and secondary IPs. These IPs may be
+ # public or VPC addresses in CIDR format, for example ['198.51.100.1/32', '198.51.100.2/32']
+ ##
+ postgresql['md5_auth_cidr_addresses'] = ['<primary_node_ip>/32', '<secondary_node_ip>/32']
+
+ # Every secondary server needs to have its own slot so specify the number of secondary nodes you're going to have
+ postgresql['max_replication_slots'] = 1
+
+ ##
+ ## Disable automatic database migrations temporarily
+ ## (until PostgreSQL is restarted and listening on the private address).
+ ##
+ gitlab_rails['auto_migrate'] = false
+
+ ```
+
+ (For more details about these settings you can read [Configure the primary server][configure-the-primary-server])
+
+1. Save the file and reconfigure GitLab for the database listen changes and
+ the replication slot changes to be applied.
+
+ ```sh
+ gitlab-ctl reconfigure
+ ```
+
+ Restart PostgreSQL for its changes to take effect:
+
+ ```sh
+ gitlab-ctl restart postgresql
+ ```
+
+1. Re-enable migrations now that PostgreSQL is restarted and listening on the
+ private address.
+
+ Edit `/etc/gitlab/gitlab.rb` and **change** the configuration to `true`:
+
+ ```ruby
+ gitlab_rails['auto_migrate'] = true
+ ```
+
+ Save the file and reconfigure GitLab:
+
+ ```sh
+ gitlab-ctl reconfigure
+ ```
+
+### Step 2. Initiate the replication process
+
+Now we need to make each **secondary** node listen to changes on the new **primary** node. To do that you need
+to [initiate the replication process][initiate-the-replication-process] again but this time
+for another **primary** node. All the old replication settings will be overwritten.
+
+## Troubleshooting
+
+### I followed the disaster recovery instructions and now two-factor auth is broken!
+
+The setup instructions for Geo prior to 10.5 failed to replicate the
+`otp_key_base` secret, which is used to encrypt the two-factor authentication
+secrets stored in the database. If it differs between **primary** and **secondary**
+nodes, users with two-factor authentication enabled won't be able to log in
+after a failover.
+
+If you still have access to the old **primary** node, you can follow the
+instructions in the
+[Upgrading to GitLab 10.5][updating-geo]
+section to resolve the error. Otherwise, the secret is lost and you'll need to
+[reset two-factor authentication for all users][sec-tfa].
+
+[gitlab-org&65]: https://gitlab.com/groups/gitlab-org/-/epics/65
+[geo-limitations]: ../replication/index.md#current-limitations
+[planned-failover]: planned_failover.md
+[setup-geo]: ../replication/index.md#setup-instructions
+[updating-geo]: ../replication/updating_the_geo_nodes.md#upgrading-to-gitlab-105
+[sec-tfa]: ../../../security/two_factor_authentication.md#disabling-2fa-for-everyone
+[gitlab-org/omnibus-gitlab#3058]: https://gitlab.com/gitlab-org/omnibus-gitlab/issues/3058
+[gitlab-org/gitlab-ee#4284]: https://gitlab.com/gitlab-org/gitlab-ee/issues/4284
+[initiate-the-replication-process]: ../replication/database.html#step-3-initiate-the-replication-process
+[configure-the-primary-server]: ../replication/database.html#step-1-configure-the-primary-server
diff --git a/doc/administration/geo/disaster_recovery/planned_failover.md b/doc/administration/geo/disaster_recovery/planned_failover.md
new file mode 100644
index 00000000000..88ab12d910a
--- /dev/null
+++ b/doc/administration/geo/disaster_recovery/planned_failover.md
@@ -0,0 +1,227 @@
+# Disaster recovery for planned failover **[PREMIUM ONLY]**
+
+The primary use-case of Disaster Recovery is to ensure business continuity in
+the event of unplanned outage, but it can be used in conjunction with a planned
+failover to migrate your GitLab instance between regions without extended
+downtime.
+
+As replication between Geo nodes is asynchronous, a planned failover requires
+a maintenance window in which updates to the **primary** node are blocked. The
+length of this window is determined by your replication capacity - once the
+**secondary** node is completely synchronized with the **primary** node, the failover can occur without
+data loss.
+
+This document assumes you already have a fully configured, working Geo setup.
+Please read it and the [Disaster Recovery][disaster-recovery] failover
+documentation in full before proceeding. Planned failover is a major operation,
+and if performed incorrectly, there is a high risk of data loss. Consider
+rehearsing the procedure until you are comfortable with the necessary steps and
+have a high degree of confidence in being able to perform them accurately.
+
+## Not all data is automatically replicated
+
+If you are using any GitLab features that Geo [doesn't support][limitations],
+you must make separate provisions to ensure that the **secondary** node has an
+up-to-date copy of any data associated with that feature. This may extend the
+required scheduled maintenance period significantly.
+
+A common strategy for keeping this period as short as possible for data stored
+in files is to use `rsync` to transfer the data. An initial `rsync` can be
+performed ahead of the maintenance window; subsequent `rsync`s (including a
+final transfer inside the maintenance window) will then transfer only the
+*changes* between the **primary** node and the **secondary** nodes.
+
+Repository-centric strategies for using `rsync` effectively can be found in the
+[moving repositories][moving-repositories] documentation; these strategies can
+be adapted for use with any other file-based data, such as GitLab Pages (to
+be found in `/var/opt/gitlab/gitlab-rails/shared/pages` if using Omnibus).
+
+## Pre-flight checks
+
+Follow these steps before scheduling a planned failover to ensure the process
+will go smoothly.
+
+### Object storage
+
+Some classes of non-repository data can use object storage in preference to
+file storage. Geo [does not replicate data in object storage](../replication/object_storage.md),
+leaving that task up to the object store itself. For a planned failover, this
+means you can decouple the replication of this data from the failover of the
+GitLab service.
+
+If you're already using object storage, simply verify that your **secondary**
+node has access to the same data as the **primary** node - they must either they share the
+same object storage configuration, or the **secondary** node should be configured to
+access a [geographically-replicated][os-repl] copy provided by the object store
+itself.
+
+If you have a large GitLab installation or cannot tolerate downtime, consider
+[migrating to Object Storage][os-conf] **before** scheduling a planned failover.
+Doing so reduces both the length of the maintenance window, and the risk of data
+loss as a result of a poorly executed planned failover.
+
+### Review the configuration of each **secondary** node
+
+Database settings are automatically replicated to the **secondary** node, but the
+`/etc/gitlab/gitlab.rb` file must be set up manually, and differs between
+nodes. If features such as Mattermost, OAuth or LDAP integration are enabled
+on the **primary** node but not the **secondary** node, they will be lost during failover.
+
+Review the `/etc/gitlab/gitlab.rb` file for both nodes and ensure the **secondary** node
+supports everything the **primary** node does **before** scheduling a planned failover.
+
+### Run system checks
+
+Run the following on both **primary** and **secondary** nodes:
+
+```sh
+gitlab-rake gitlab:check
+gitlab-rake gitlab:geo:check
+```
+
+If any failures are reported on either node, they should be resolved **before**
+scheduling a planned failover.
+
+### Check that secrets match between nodes
+
+The SSH host keys and `/etc/gitlab/gitlab-secrets.json` files should be
+identical on all nodes. Check this by running the following on all nodes and
+comparing the output:
+
+```sh
+sudo sha256sum /etc/ssh/ssh_host* /etc/gitlab/gitlab-secrets.json
+```
+
+If any files differ, replace the content on the **secondary** node with the
+content from the **primary** node.
+
+### Ensure Geo replication is up-to-date
+
+The maintenance window won't end until Geo replication and verification is
+completely finished. To keep the window as short as possible, you should
+ensure these processes are close to 100% as possible during active use.
+
+Navigate to the **Admin Area > Geo** dashboard on the **secondary** node to
+review status. Replicated objects (shown in green) should be close to 100%,
+and there should be no failures (shown in red). If a large proportion of
+objects aren't yet replicated (shown in grey), consider giving the node more
+time to complete
+
+![Replication status](img/replication-status.png)
+
+If any objects are failing to replicate, this should be investigated before
+scheduling the maintenance window. Following a planned failover, anything that
+failed to replicate will be **lost**.
+
+You can use the [Geo status API](https://docs.gitlab.com/ee/api/geo_nodes.html#retrieve-project-sync-or-verification-failures-that-occurred-on-the-current-node) to review failed objects and
+the reasons for failure.
+
+A common cause of replication failures is the data being missing on the
+**primary** node - you can resolve these failures by restoring the data from backup,
+or removing references to the missing data.
+
+### Verify the integrity of replicated data
+
+This [content was moved to another location][background-verification].
+
+### Notify users of scheduled maintenance
+
+On the **primary** node, navigate to **Admin Area > Messages**, add a broadcast
+message. You can check under **Admin Area > Geo** to estimate how long it
+will take to finish syncing. An example message would be:
+
+> A scheduled maintenance will take place at XX:XX UTC. We expect it to take
+> less than 1 hour.
+
+## Prevent updates to the **primary** node
+
+Until a [read-only mode][ce-19739] is implemented, updates must be prevented
+from happening manually. Note that your **secondary** node still needs read-only
+access to the **primary** node during the maintenance window.
+
+1. At the scheduled time, using your cloud provider or your node's firewall, block
+ all HTTP, HTTPS and SSH traffic to/from the **primary** node, **except** for your IP and
+ the **secondary** node's IP.
+
+ For instance, you might run the following commands on the server(s) making up your **primary** node:
+
+ ```sh
+ sudo iptables -A INPUT -p tcp -s <secondary_node_ip> --destination-port 22 -j ACCEPT
+ sudo iptables -A INPUT -p tcp -s <your_ip> --destination-port 22 -j ACCEPT
+ sudo iptables -A INPUT --destination-port 22 -j REJECT
+
+ sudo iptables -A INPUT -p tcp -s <secondary_node_ip> --destination-port 80 -j ACCEPT
+ sudo iptables -A INPUT -p tcp -s <your_ip> --destination-port 80 -j ACCEPT
+ sudo iptables -A INPUT --tcp-dport 80 -j REJECT
+
+ sudo iptables -A INPUT -p tcp -s <secondary_node_ip> --destination-port 443 -j ACCEPT
+ sudo iptables -A INPUT -p tcp -s <your_ip> --destination-port 443 -j ACCEPT
+ sudo iptables -A INPUT --tcp-dport 443 -j REJECT
+ ```
+
+ From this point, users will be unable to view their data or make changes on the
+ **primary** node. They will also be unable to log in to the **secondary** node.
+ However, existing sessions will work for the remainder of the maintenance period, and
+ public data will be accessible throughout.
+
+1. Verify the **primary** node is blocked to HTTP traffic by visiting it in browser via
+ another IP. The server should refuse connection.
+
+1. Verify the **primary** node is blocked to Git over SSH traffic by attempting to pull an
+ existing Git repository with an SSH remote URL. The server should refuse
+ connection.
+
+1. Disable non-Geo periodic background jobs on the primary node by navigating
+ to **Admin Area > Monitoring > Background Jobs > Cron** , pressing `Disable All`,
+ and then pressing `Enable` for the `geo_sidekiq_cron_config_worker` cron job.
+ This job will re-enable several other cron jobs that are essential for planned
+ failover to complete successfully.
+
+## Finish replicating and verifying all data
+
+1. If you are manually replicating any data not managed by Geo, trigger the
+ final replication process now.
+1. On the **primary** node, navigate to **Admin Area > Monitoring > Background Jobs > Queues**
+ and wait for all queues except those with `geo` in the name to drop to 0.
+ These queues contain work that has been submitted by your users; failing over
+ before it is completed will cause the work to be lost.
+1. On the **primary** node, navigate to **Admin Area > Geo** and wait for the
+ following conditions to be true of the **secondary** node you are failing over to:
+ - All replication meters to each 100% replicated, 0% failures.
+ - All verification meters reach 100% verified, 0% failures.
+ - Database replication lag is 0ms.
+ - The Geo log cursor is up to date (0 events behind).
+
+1. On the **secondary** node, navigate to **Admin Area > Monitoring > Background Jobs > Queues**
+ and wait for all the `geo` queues to drop to 0 queued and 0 running jobs.
+1. On the **secondary** node, use [these instructions][foreground-verification]
+ to verify the integrity of CI artifacts, LFS objects and uploads in file
+ storage.
+
+At this point, your **secondary** node will contain an up-to-date copy of everything the
+**primary** node has, meaning nothing will be lost when you fail over.
+
+## Promote the **secondary** node
+
+Finally, follow the [Disaster Recovery docs][disaster-recovery] to promote the
+**secondary** node to a **primary** node. This process will cause a brief outage on the **secondary** node, and users may need to log in again.
+
+Once it is completed, the maintenance window is over! Your new **primary** node will now
+begin to diverge from the old one. If problems do arise at this point, failing
+back to the old **primary** node [is possible][bring-primary-back], but likely to result
+in the loss of any data uploaded to the new primary in the meantime.
+
+Don't forget to remove the broadcast message after failover is complete.
+
+[bring-primary-back]: bring_primary_back.md
+[ce-19739]: https://gitlab.com/gitlab-org/gitlab-ce/issues/19739
+[container-registry]: ../replication/container_registry.md
+[disaster-recovery]: index.md
+[ee-4930]: https://gitlab.com/gitlab-org/gitlab-ee/issues/4930
+[ee-5064]: https://gitlab.com/gitlab-org/gitlab-ee/issues/5064
+[foreground-verification]: ../../raketasks/check.md
+[background-verification]: background_verification.md
+[limitations]: ../replication/index.md#current-limitations
+[moving-repositories]: ../../operations/moving_repositories.md
+[os-conf]: ../replication/object_storage.md#configuration
+[os-repl]: ../replication/object_storage.md#replication