diff options
Diffstat (limited to 'doc/administration/geo/disaster_recovery')
-rw-r--r-- | doc/administration/geo/disaster_recovery/background_verification.md | 172 | ||||
-rw-r--r-- | doc/administration/geo/disaster_recovery/bring_primary_back.md | 61 | ||||
-rw-r--r-- | doc/administration/geo/disaster_recovery/img/replication-status.png | bin | 0 -> 7716 bytes | |||
-rw-r--r-- | doc/administration/geo/disaster_recovery/img/reverification-interval.png | bin | 0 -> 33620 bytes | |||
-rw-r--r-- | doc/administration/geo/disaster_recovery/img/verification-status-primary.png | bin | 0 -> 13329 bytes | |||
-rw-r--r-- | doc/administration/geo/disaster_recovery/img/verification-status-secondary.png | bin | 0 -> 12186 bytes | |||
-rw-r--r-- | doc/administration/geo/disaster_recovery/index.md | 322 | ||||
-rw-r--r-- | doc/administration/geo/disaster_recovery/planned_failover.md | 227 |
8 files changed, 782 insertions, 0 deletions
diff --git a/doc/administration/geo/disaster_recovery/background_verification.md b/doc/administration/geo/disaster_recovery/background_verification.md new file mode 100644 index 00000000000..7d2fd51f834 --- /dev/null +++ b/doc/administration/geo/disaster_recovery/background_verification.md @@ -0,0 +1,172 @@ +# Automatic background verification **[PREMIUM ONLY]** + +NOTE: **Note:** +Automatic background verification of repositories and wikis was added in +GitLab EE 10.6 but is enabled by default only on GitLab EE 11.1. You can +disable or enable this feature manually by following +[these instructions](#disabling-or-enabling-the-automatic-background-verification). + +Automatic background verification ensures that the transferred data matches a +calculated checksum. If the checksum of the data on the **primary** node matches checksum of the +data on the **secondary** node, the data transferred successfully. Following a planned failover, +any corrupted data may be **lost**, depending on the extent of the corruption. + +If verification fails on the **primary** node, this indicates that Geo is +successfully replicating a corrupted object; restore it from backup or remove it +it from the **primary** node to resolve the issue. + +If verification succeeds on the **primary** node but fails on the **secondary** node, +this indicates that the object was corrupted during the replication process. +Geo actively try to correct verification failures marking the repository to +be resynced with a backoff period. If you want to reset the verification for +these failures, so you should follow [these instructions][reset-verification]. + +If verification is lagging significantly behind replication, consider giving +the node more time before scheduling a planned failover. + +## Disabling or enabling the automatic background verification + +Run the following commands in a Rails console on the **primary** node: + +```sh +# Omnibus GitLab +gitlab-rails console + +# Installation from source +cd /home/git/gitlab +sudo -u git -H bin/rails console RAILS_ENV=production +``` + +To check if automatic background verification is enabled: + +```ruby +Gitlab::Geo.repository_verification_enabled? +``` + +To disable automatic background verification: + +```ruby +Feature.disable('geo_repository_verification') +``` + +To enable automatic background verification: + +```ruby +Feature.enable('geo_repository_verification') +``` + +## Repository verification + +Navigate to the **Admin Area > Geo** dashboard on the **primary** node and expand +the **Verification information** tab for that node to view automatic checksumming +status for repositories and wikis. Successes are shown in green, pending work +in grey, and failures in red. + +![Verification status](img/verification-status-primary.png) + +Navigate to the **Admin Area > Geo** dashboard on the **secondary** node and expand +the **Verification information** tab for that node to view automatic verification +status for repositories and wikis. As with checksumming, successes are shown in +green, pending work in grey, and failures in red. + +![Verification status](img/verification-status-secondary.png) + +## Using checksums to compare Geo nodes + +To check the health of Geo **secondary** nodes, we use a checksum over the list of +Git references and their values. The checksum includes `HEAD`, `heads`, `tags`, +`notes`, and GitLab-specific references to ensure true consistency. If two nodes +have the same checksum, then they definitely hold the same references. We compute +the checksum for every node after every update to make sure that they are all +in sync. + +## Repository re-verification + +> [Introduced](https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/8550) in GitLab Enterprise Edition 11.6. Available in [GitLab Premium](https://about.gitlab.com/pricing/). + +Due to bugs or transient infrastructure failures, it is possible for Git +repositories to change unexpectedly without being marked for verification. +Geo constantly reverifies the repositories to ensure the integrity of the +data. The default and recommended re-verification interval is 7 days, though +an interval as short as 1 day can be set. Shorter intervals reduce risk but +increase load and vice versa. + +Navigate to the **Admin Area > Geo** dashboard on the **primary** node, and +click the **Edit** button for the **primary** node to customize the minimum +re-verification interval: + +![Re-verification interval](img/reverification-interval.png) + +The automatic background re-verification is enabled by default, but you can +disable if you need. Run the following commands in a Rails console on the +**primary** node: + +```sh +# Omnibus GitLab +gitlab-rails console + +# Installation from source +cd /home/git/gitlab +sudo -u git -H bin/rails console RAILS_ENV=production +``` + +To disable automatic background re-verification: + +```ruby +Feature.disable('geo_repository_reverification') +``` + +To enable automatic background re-verification: + +```ruby +Feature.enable('geo_repository_reverification') +``` + +## Reset verification for projects where verification has failed + +Geo actively try to correct verification failures marking the repository to +be resynced with a backoff period. If you want to reset them manually, this +rake task marks projects where verification has failed or the checksum mismatch +to be resynced without the backoff period: + +For repositories: + +- Omnibus Installation + + ```sh + sudo gitlab-rake geo:verification:repository:reset + ``` + +- Source Installation + + ```sh + sudo -u git -H bundle exec rake geo:verification:repository:reset RAILS_ENV=production + ``` + +For wikis: + +- Omnibus Installation + + ```sh + sudo gitlab-rake geo:verification:wiki:reset + ``` + +- Source Installation + + ```sh + sudo -u git -H bundle exec rake geo:verification:wiki:reset RAILS_ENV=production + ``` + +## Current limitations + +Until [issue #5064][ee-5064] is completed, background verification doesn't cover +CI job artifacts and traces, LFS objects, or user uploads in file storage. +Verify their integrity manually by following [these instructions][foreground-verification] +on both nodes, and comparing the output between them. + +Data in object storage is **not verified**, as the object store is responsible +for ensuring the integrity of the data. + +[reset-verification]: background_verification.md#reset-verification-for-projects-where-verification-has-failed +[foreground-verification]: ../../raketasks/check.md +[ee-5064]: https://gitlab.com/gitlab-org/gitlab-ee/issues/5064 diff --git a/doc/administration/geo/disaster_recovery/bring_primary_back.md b/doc/administration/geo/disaster_recovery/bring_primary_back.md new file mode 100644 index 00000000000..f4d31a98080 --- /dev/null +++ b/doc/administration/geo/disaster_recovery/bring_primary_back.md @@ -0,0 +1,61 @@ +# Bring a demoted primary node back online **[PREMIUM ONLY]** + +After a failover, it is possible to fail back to the demoted **primary** node to +restore your original configuration. This process consists of two steps: + +1. Making the old **primary** node a **secondary** node. +1. Promoting a **secondary** node to a **primary** node. + +CAUTION: **Caution:** +If you have any doubts about the consistency of the data on this node, we recommend setting it up from scratch. + +## Configure the former **primary** node to be a **secondary** node + +Since the former **primary** node will be out of sync with the current **primary** node, the first step is to bring the former **primary** node up to date. Note, deletion of data stored on disk like +repositories and uploads will not be replayed when bringing the former **primary** node back +into sync, which may result in increased disk usage. +Alternatively, you can [set up a new **secondary** GitLab instance][setup-geo] to avoid this. + +To bring the former **primary** node up to date: + +1. SSH into the former **primary** node that has fallen behind. +1. Make sure all the services are up: + + ```sh + sudo gitlab-ctl start + ``` + + > **Note 1:** If you [disabled the **primary** node permanently][disaster-recovery-disable-primary], + > you need to undo those steps now. For Debian/Ubuntu you just need to run + > `sudo systemctl enable gitlab-runsvdir`. For CentOS 6, you need to install + > the GitLab instance from scratch and set it up as a **secondary** node by + > following [Setup instructions][setup-geo]. In this case, you don't need to follow the next step. + > + > **Note 2:** If you [changed the DNS records](index.md#step-4-optional-updating-the-primary-domain-dns-record) + > for this node during disaster recovery procedure you may need to [block + > all the writes to this node](https://gitlab.com/gitlab-org/gitlab-ee/blob/master/doc/gitlab-geo/planned-failover.md#block-primary-traffic) + > during this procedure. + +1. [Setup database replication][database-replication]. Note that in this + case, **primary** node refers to the current **primary** node, and **secondary** node refers to the + former **primary** node. + +If you have lost your original **primary** node, follow the +[setup instructions][setup-geo] to set up a new **secondary** node. + +## Promote the **secondary** node to **primary** node + +When the initial replication is complete and the **primary** node and **secondary** node are +closely in sync, you can do a [planned failover]. + +## Restore the **secondary** node + +If your objective is to have two nodes again, you need to bring your **secondary** +node back online as well by repeating the first step +([configure the former **primary** node to be a **secondary** node](#configure-the-former-primary-node-to-be-a-secondary-node)) +for the **secondary** node. + +[setup-geo]: ../replication/index.md#setup-instructions +[database-replication]: ../replication/database.md +[disaster-recovery-disable-primary]: index.md#step-2-permanently-disable-the-primary-node +[planned failover]: planned_failover.md diff --git a/doc/administration/geo/disaster_recovery/img/replication-status.png b/doc/administration/geo/disaster_recovery/img/replication-status.png Binary files differnew file mode 100644 index 00000000000..d7085927c75 --- /dev/null +++ b/doc/administration/geo/disaster_recovery/img/replication-status.png diff --git a/doc/administration/geo/disaster_recovery/img/reverification-interval.png b/doc/administration/geo/disaster_recovery/img/reverification-interval.png Binary files differnew file mode 100644 index 00000000000..ad4597a4f49 --- /dev/null +++ b/doc/administration/geo/disaster_recovery/img/reverification-interval.png diff --git a/doc/administration/geo/disaster_recovery/img/verification-status-primary.png b/doc/administration/geo/disaster_recovery/img/verification-status-primary.png Binary files differnew file mode 100644 index 00000000000..2503408ec5d --- /dev/null +++ b/doc/administration/geo/disaster_recovery/img/verification-status-primary.png diff --git a/doc/administration/geo/disaster_recovery/img/verification-status-secondary.png b/doc/administration/geo/disaster_recovery/img/verification-status-secondary.png Binary files differnew file mode 100644 index 00000000000..462274d8b14 --- /dev/null +++ b/doc/administration/geo/disaster_recovery/img/verification-status-secondary.png diff --git a/doc/administration/geo/disaster_recovery/index.md b/doc/administration/geo/disaster_recovery/index.md new file mode 100644 index 00000000000..71dc797f281 --- /dev/null +++ b/doc/administration/geo/disaster_recovery/index.md @@ -0,0 +1,322 @@ +# Disaster Recovery **[PREMIUM ONLY]** + +Geo replicates your database, your Git repositories, and few other assets. +We will support and replicate more data in the future, that will enable you to +failover with minimal effort, in a disaster situation. + +See [Geo current limitations][geo-limitations] for more information. + +CAUTION: **Warning:** +Disaster recovery for multi-secondary configurations is in **Alpha**. +For the latest updates, check the multi-secondary [Disaster Recovery epic][gitlab-org&65]. + +## Promoting a **secondary** Geo node in single-secondary configurations + +We don't currently provide an automated way to promote a Geo replica and do a +failover, but you can do it manually if you have `root` access to the machine. + +This process promotes a **secondary** Geo node to a **primary** node. To regain +geographic redundancy as quickly as possible, you should add a new **secondary** node +immediately after following these instructions. + +### Step 1. Allow replication to finish if possible + +If the **secondary** node is still replicating data from the **primary** node, follow +[the planned failover docs][planned-failover] as closely as possible in +order to avoid unnecessary data loss. + +### Step 2. Permanently disable the **primary** node + +CAUTION: **Warning:** +If the **primary** node goes offline, there may be data saved on the **primary** node +that has not been replicated to the **secondary** node. This data should be treated +as lost if you proceed. + +If an outage on the **primary** node happens, you should do everything possible to +avoid a split-brain situation where writes can occur in two different GitLab +instances, complicating recovery efforts. So to prepare for the failover, we +must disable the **primary** node. + +1. SSH into the **primary** node to stop and disable GitLab, if possible: + + ```sh + sudo gitlab-ctl stop + ``` + + Prevent GitLab from starting up again if the server unexpectedly reboots: + + ```sh + sudo systemctl disable gitlab-runsvdir + ``` + + > **CentOS only**: In CentOS 6 or older, there is no easy way to prevent GitLab from being + > started if the machine reboots isn't available (see [gitlab-org/omnibus-gitlab#3058]). + > It may be safest to uninstall the GitLab package completely: + + ```sh + yum remove gitlab-ee + ``` + + > **Ubuntu 14.04 LTS**: If you are using an older version of Ubuntu + > or any other distro based on the Upstart init system, you can prevent GitLab + > from starting if the machine reboots by doing the following: + + ```sh + initctl stop gitlab-runsvvdir + echo 'manual' > /etc/init/gitlab-runsvdir.override + initctl reload-configuration + ``` + +1. If you do not have SSH access to the **primary** node, take the machine offline and + prevent it from rebooting by any means at your disposal. + Since there are many ways you may prefer to accomplish this, we will avoid a + single recommendation. You may need to: + - Reconfigure the load balancers. + - Change DNS records (e.g., point the primary DNS record to the **secondary** + node in order to stop usage of the **primary** node). + - Stop the virtual servers. + - Block traffic through a firewall. + - Revoke object storage permissions from the **primary** node. + - Physically disconnect a machine. + +1. If you plan to + [update the primary domain DNS record](#step-4-optional-updating-the-primary-domain-dns-record), + you may wish to lower the TTL now to speed up propagation. + +### Step 3. Promoting a **secondary** node + +NOTE: **Note:** +A new **secondary** should not be added at this time. If you want to add a new +**secondary**, do this after you have completed the entire process of promoting +the **secondary** to the **primary**. + +#### Promoting a **secondary** node running on a single machine + +1. SSH in to your **secondary** node and login as root: + + ```sh + sudo -i + ``` + +1. Edit `/etc/gitlab/gitlab.rb` to reflect its new status as **primary** by + removing any lines that enabled the `geo_secondary_role`: + + ```ruby + ## In pre-11.5 documentation, the role was enabled as follows. Remove this line. + geo_secondary_role['enable'] = true + + ## In 11.5+ documentation, the role was enabled as follows. Remove this line. + roles ['geo_secondary_role'] + ``` + +1. Promote the **secondary** node to the **primary** node. Execute: + + ```sh + gitlab-ctl promote-to-primary-node + ``` + +1. Verify you can connect to the newly promoted **primary** node using the URL used + previously for the **secondary** node. +1. If successful, the **secondary** node has now been promoted to the **primary** node. + +#### Promoting a **secondary** node with HA + +The `gitlab-ctl promote-to-primary-node` command cannot be used yet in +conjunction with High Availability or with multiple machines, as it can only +perform changes on a **secondary** with only a single machine. Instead, you must +do this manually. + +1. SSH in to the database node in the **secondary** and trigger PostgreSQL to + promote to read-write: + + ```bash + sudo gitlab-pg-ctl promote + ``` + +1. Edit `/etc/gitlab/gitlab.rb` on every machine in the **secondary** to + reflect its new status as **primary** by removing any lines that enabled the + `geo_secondary_role`: + + ```ruby + ## In pre-11.5 documentation, the role was enabled as follows. Remove this line. + geo_secondary_role['enable'] = true + + ## In 11.5+ documentation, the role was enabled as follows. Remove this line. + roles ['geo_secondary_role'] + ``` + + After making these changes [Reconfigure GitLab](../../restart_gitlab.md#omnibus-gitlab-reconfigure) each + machine so the changes take effect. + +1. Promote the **secondary** to **primary**. SSH into a single application + server and execute: + + ```bash + sudo gitlab-rake geo:set_secondary_as_primary + ``` + +1. Verify you can connect to the newly promoted **primary** using the URL used + previously for the **secondary**. +1. Success! The **secondary** has now been promoted to **primary**. + +### Step 4. (Optional) Updating the primary domain DNS record + +Updating the DNS records for the primary domain to point to the **secondary** node +will prevent the need to update all references to the primary domain to the +secondary domain, like changing Git remotes and API URLs. + +1. SSH into the **secondary** node and login as root: + + ```sh + sudo -i + ``` + +1. Update the primary domain's DNS record. After updating the primary domain's + DNS records to point to the **secondary** node, edit `/etc/gitlab/gitlab.rb` on the + **secondary** node to reflect the new URL: + + ```ruby + # Change the existing external_url configuration + external_url 'https://<new_external_url>' + ``` + + NOTE: **Note** + Changing `external_url` won't prevent access via the old secondary URL, as + long as the secondary DNS records are still intact. + +1. Reconfigure the **secondary** node for the change to take effect: + + ```sh + gitlab-ctl reconfigure + ``` + +1. Execute the command below to update the newly promoted **primary** node URL: + + ```sh + gitlab-rake geo:update_primary_node_url + ``` + + This command will use the changed `external_url` configuration defined + in `/etc/gitlab/gitlab.rb`. + +1. Verify you can connect to the newly promoted **primary** using its URL. + If you updated the DNS records for the primary domain, these changes may + not have yet propagated depending on the previous DNS records TTL. + +### Step 5. (Optional) Add **secondary** Geo node to a promoted **primary** node + +Promoting a **secondary** node to **primary** node using the process above does not enable +Geo on the new **primary** node. + +To bring a new **secondary** node online, follow the [Geo setup instructions][setup-geo]. + +### Step 6. (Optional) Removing the secondary's tracking database + +Every **secondary** has a special tracking database that is used to save the status of the synchronization of all the items from the **primary**. +Because the **secondary** is already promoted, that data in the tracking database is no longer required. + +The data can be removed with the following command: + +```sh +sudo rm -rf /var/opt/gitlab/geo-postgresql +``` + +## Promoting secondary Geo replica in multi-secondary configurations + +If you have more than one **secondary** node and you need to promote one of them, we suggest you follow +[Promoting a **secondary** Geo node in single-secondary configurations](#promoting-a-secondary-geo-node-in-single-secondary-configurations) +and after that you also need two extra steps. + +### Step 1. Prepare the new **primary** node to serve one or more **secondary** nodes + +1. SSH into the new **primary** node and login as root: + + ```sh + sudo -i + ``` + +1. Edit `/etc/gitlab/gitlab.rb` + + ```ruby + ## Enable a Geo Primary role (if you haven't yet) + roles ['geo_primary_role'] + + ## + # Allow PostgreSQL client authentication from the primary and secondary IPs. These IPs may be + # public or VPC addresses in CIDR format, for example ['198.51.100.1/32', '198.51.100.2/32'] + ## + postgresql['md5_auth_cidr_addresses'] = ['<primary_node_ip>/32', '<secondary_node_ip>/32'] + + # Every secondary server needs to have its own slot so specify the number of secondary nodes you're going to have + postgresql['max_replication_slots'] = 1 + + ## + ## Disable automatic database migrations temporarily + ## (until PostgreSQL is restarted and listening on the private address). + ## + gitlab_rails['auto_migrate'] = false + + ``` + + (For more details about these settings you can read [Configure the primary server][configure-the-primary-server]) + +1. Save the file and reconfigure GitLab for the database listen changes and + the replication slot changes to be applied. + + ```sh + gitlab-ctl reconfigure + ``` + + Restart PostgreSQL for its changes to take effect: + + ```sh + gitlab-ctl restart postgresql + ``` + +1. Re-enable migrations now that PostgreSQL is restarted and listening on the + private address. + + Edit `/etc/gitlab/gitlab.rb` and **change** the configuration to `true`: + + ```ruby + gitlab_rails['auto_migrate'] = true + ``` + + Save the file and reconfigure GitLab: + + ```sh + gitlab-ctl reconfigure + ``` + +### Step 2. Initiate the replication process + +Now we need to make each **secondary** node listen to changes on the new **primary** node. To do that you need +to [initiate the replication process][initiate-the-replication-process] again but this time +for another **primary** node. All the old replication settings will be overwritten. + +## Troubleshooting + +### I followed the disaster recovery instructions and now two-factor auth is broken! + +The setup instructions for Geo prior to 10.5 failed to replicate the +`otp_key_base` secret, which is used to encrypt the two-factor authentication +secrets stored in the database. If it differs between **primary** and **secondary** +nodes, users with two-factor authentication enabled won't be able to log in +after a failover. + +If you still have access to the old **primary** node, you can follow the +instructions in the +[Upgrading to GitLab 10.5][updating-geo] +section to resolve the error. Otherwise, the secret is lost and you'll need to +[reset two-factor authentication for all users][sec-tfa]. + +[gitlab-org&65]: https://gitlab.com/groups/gitlab-org/-/epics/65 +[geo-limitations]: ../replication/index.md#current-limitations +[planned-failover]: planned_failover.md +[setup-geo]: ../replication/index.md#setup-instructions +[updating-geo]: ../replication/updating_the_geo_nodes.md#upgrading-to-gitlab-105 +[sec-tfa]: ../../../security/two_factor_authentication.md#disabling-2fa-for-everyone +[gitlab-org/omnibus-gitlab#3058]: https://gitlab.com/gitlab-org/omnibus-gitlab/issues/3058 +[gitlab-org/gitlab-ee#4284]: https://gitlab.com/gitlab-org/gitlab-ee/issues/4284 +[initiate-the-replication-process]: ../replication/database.html#step-3-initiate-the-replication-process +[configure-the-primary-server]: ../replication/database.html#step-1-configure-the-primary-server diff --git a/doc/administration/geo/disaster_recovery/planned_failover.md b/doc/administration/geo/disaster_recovery/planned_failover.md new file mode 100644 index 00000000000..88ab12d910a --- /dev/null +++ b/doc/administration/geo/disaster_recovery/planned_failover.md @@ -0,0 +1,227 @@ +# Disaster recovery for planned failover **[PREMIUM ONLY]** + +The primary use-case of Disaster Recovery is to ensure business continuity in +the event of unplanned outage, but it can be used in conjunction with a planned +failover to migrate your GitLab instance between regions without extended +downtime. + +As replication between Geo nodes is asynchronous, a planned failover requires +a maintenance window in which updates to the **primary** node are blocked. The +length of this window is determined by your replication capacity - once the +**secondary** node is completely synchronized with the **primary** node, the failover can occur without +data loss. + +This document assumes you already have a fully configured, working Geo setup. +Please read it and the [Disaster Recovery][disaster-recovery] failover +documentation in full before proceeding. Planned failover is a major operation, +and if performed incorrectly, there is a high risk of data loss. Consider +rehearsing the procedure until you are comfortable with the necessary steps and +have a high degree of confidence in being able to perform them accurately. + +## Not all data is automatically replicated + +If you are using any GitLab features that Geo [doesn't support][limitations], +you must make separate provisions to ensure that the **secondary** node has an +up-to-date copy of any data associated with that feature. This may extend the +required scheduled maintenance period significantly. + +A common strategy for keeping this period as short as possible for data stored +in files is to use `rsync` to transfer the data. An initial `rsync` can be +performed ahead of the maintenance window; subsequent `rsync`s (including a +final transfer inside the maintenance window) will then transfer only the +*changes* between the **primary** node and the **secondary** nodes. + +Repository-centric strategies for using `rsync` effectively can be found in the +[moving repositories][moving-repositories] documentation; these strategies can +be adapted for use with any other file-based data, such as GitLab Pages (to +be found in `/var/opt/gitlab/gitlab-rails/shared/pages` if using Omnibus). + +## Pre-flight checks + +Follow these steps before scheduling a planned failover to ensure the process +will go smoothly. + +### Object storage + +Some classes of non-repository data can use object storage in preference to +file storage. Geo [does not replicate data in object storage](../replication/object_storage.md), +leaving that task up to the object store itself. For a planned failover, this +means you can decouple the replication of this data from the failover of the +GitLab service. + +If you're already using object storage, simply verify that your **secondary** +node has access to the same data as the **primary** node - they must either they share the +same object storage configuration, or the **secondary** node should be configured to +access a [geographically-replicated][os-repl] copy provided by the object store +itself. + +If you have a large GitLab installation or cannot tolerate downtime, consider +[migrating to Object Storage][os-conf] **before** scheduling a planned failover. +Doing so reduces both the length of the maintenance window, and the risk of data +loss as a result of a poorly executed planned failover. + +### Review the configuration of each **secondary** node + +Database settings are automatically replicated to the **secondary** node, but the +`/etc/gitlab/gitlab.rb` file must be set up manually, and differs between +nodes. If features such as Mattermost, OAuth or LDAP integration are enabled +on the **primary** node but not the **secondary** node, they will be lost during failover. + +Review the `/etc/gitlab/gitlab.rb` file for both nodes and ensure the **secondary** node +supports everything the **primary** node does **before** scheduling a planned failover. + +### Run system checks + +Run the following on both **primary** and **secondary** nodes: + +```sh +gitlab-rake gitlab:check +gitlab-rake gitlab:geo:check +``` + +If any failures are reported on either node, they should be resolved **before** +scheduling a planned failover. + +### Check that secrets match between nodes + +The SSH host keys and `/etc/gitlab/gitlab-secrets.json` files should be +identical on all nodes. Check this by running the following on all nodes and +comparing the output: + +```sh +sudo sha256sum /etc/ssh/ssh_host* /etc/gitlab/gitlab-secrets.json +``` + +If any files differ, replace the content on the **secondary** node with the +content from the **primary** node. + +### Ensure Geo replication is up-to-date + +The maintenance window won't end until Geo replication and verification is +completely finished. To keep the window as short as possible, you should +ensure these processes are close to 100% as possible during active use. + +Navigate to the **Admin Area > Geo** dashboard on the **secondary** node to +review status. Replicated objects (shown in green) should be close to 100%, +and there should be no failures (shown in red). If a large proportion of +objects aren't yet replicated (shown in grey), consider giving the node more +time to complete + +![Replication status](img/replication-status.png) + +If any objects are failing to replicate, this should be investigated before +scheduling the maintenance window. Following a planned failover, anything that +failed to replicate will be **lost**. + +You can use the [Geo status API](https://docs.gitlab.com/ee/api/geo_nodes.html#retrieve-project-sync-or-verification-failures-that-occurred-on-the-current-node) to review failed objects and +the reasons for failure. + +A common cause of replication failures is the data being missing on the +**primary** node - you can resolve these failures by restoring the data from backup, +or removing references to the missing data. + +### Verify the integrity of replicated data + +This [content was moved to another location][background-verification]. + +### Notify users of scheduled maintenance + +On the **primary** node, navigate to **Admin Area > Messages**, add a broadcast +message. You can check under **Admin Area > Geo** to estimate how long it +will take to finish syncing. An example message would be: + +> A scheduled maintenance will take place at XX:XX UTC. We expect it to take +> less than 1 hour. + +## Prevent updates to the **primary** node + +Until a [read-only mode][ce-19739] is implemented, updates must be prevented +from happening manually. Note that your **secondary** node still needs read-only +access to the **primary** node during the maintenance window. + +1. At the scheduled time, using your cloud provider or your node's firewall, block + all HTTP, HTTPS and SSH traffic to/from the **primary** node, **except** for your IP and + the **secondary** node's IP. + + For instance, you might run the following commands on the server(s) making up your **primary** node: + + ```sh + sudo iptables -A INPUT -p tcp -s <secondary_node_ip> --destination-port 22 -j ACCEPT + sudo iptables -A INPUT -p tcp -s <your_ip> --destination-port 22 -j ACCEPT + sudo iptables -A INPUT --destination-port 22 -j REJECT + + sudo iptables -A INPUT -p tcp -s <secondary_node_ip> --destination-port 80 -j ACCEPT + sudo iptables -A INPUT -p tcp -s <your_ip> --destination-port 80 -j ACCEPT + sudo iptables -A INPUT --tcp-dport 80 -j REJECT + + sudo iptables -A INPUT -p tcp -s <secondary_node_ip> --destination-port 443 -j ACCEPT + sudo iptables -A INPUT -p tcp -s <your_ip> --destination-port 443 -j ACCEPT + sudo iptables -A INPUT --tcp-dport 443 -j REJECT + ``` + + From this point, users will be unable to view their data or make changes on the + **primary** node. They will also be unable to log in to the **secondary** node. + However, existing sessions will work for the remainder of the maintenance period, and + public data will be accessible throughout. + +1. Verify the **primary** node is blocked to HTTP traffic by visiting it in browser via + another IP. The server should refuse connection. + +1. Verify the **primary** node is blocked to Git over SSH traffic by attempting to pull an + existing Git repository with an SSH remote URL. The server should refuse + connection. + +1. Disable non-Geo periodic background jobs on the primary node by navigating + to **Admin Area > Monitoring > Background Jobs > Cron** , pressing `Disable All`, + and then pressing `Enable` for the `geo_sidekiq_cron_config_worker` cron job. + This job will re-enable several other cron jobs that are essential for planned + failover to complete successfully. + +## Finish replicating and verifying all data + +1. If you are manually replicating any data not managed by Geo, trigger the + final replication process now. +1. On the **primary** node, navigate to **Admin Area > Monitoring > Background Jobs > Queues** + and wait for all queues except those with `geo` in the name to drop to 0. + These queues contain work that has been submitted by your users; failing over + before it is completed will cause the work to be lost. +1. On the **primary** node, navigate to **Admin Area > Geo** and wait for the + following conditions to be true of the **secondary** node you are failing over to: + - All replication meters to each 100% replicated, 0% failures. + - All verification meters reach 100% verified, 0% failures. + - Database replication lag is 0ms. + - The Geo log cursor is up to date (0 events behind). + +1. On the **secondary** node, navigate to **Admin Area > Monitoring > Background Jobs > Queues** + and wait for all the `geo` queues to drop to 0 queued and 0 running jobs. +1. On the **secondary** node, use [these instructions][foreground-verification] + to verify the integrity of CI artifacts, LFS objects and uploads in file + storage. + +At this point, your **secondary** node will contain an up-to-date copy of everything the +**primary** node has, meaning nothing will be lost when you fail over. + +## Promote the **secondary** node + +Finally, follow the [Disaster Recovery docs][disaster-recovery] to promote the +**secondary** node to a **primary** node. This process will cause a brief outage on the **secondary** node, and users may need to log in again. + +Once it is completed, the maintenance window is over! Your new **primary** node will now +begin to diverge from the old one. If problems do arise at this point, failing +back to the old **primary** node [is possible][bring-primary-back], but likely to result +in the loss of any data uploaded to the new primary in the meantime. + +Don't forget to remove the broadcast message after failover is complete. + +[bring-primary-back]: bring_primary_back.md +[ce-19739]: https://gitlab.com/gitlab-org/gitlab-ce/issues/19739 +[container-registry]: ../replication/container_registry.md +[disaster-recovery]: index.md +[ee-4930]: https://gitlab.com/gitlab-org/gitlab-ee/issues/4930 +[ee-5064]: https://gitlab.com/gitlab-org/gitlab-ee/issues/5064 +[foreground-verification]: ../../raketasks/check.md +[background-verification]: background_verification.md +[limitations]: ../replication/index.md#current-limitations +[moving-repositories]: ../../operations/moving_repositories.md +[os-conf]: ../replication/object_storage.md#configuration +[os-repl]: ../replication/object_storage.md#replication |