summaryrefslogtreecommitdiff
path: root/doc/administration/gitaly/recovery.md
diff options
context:
space:
mode:
Diffstat (limited to 'doc/administration/gitaly/recovery.md')
-rw-r--r--doc/administration/gitaly/recovery.md418
1 files changed, 418 insertions, 0 deletions
diff --git a/doc/administration/gitaly/recovery.md b/doc/administration/gitaly/recovery.md
new file mode 100644
index 00000000000..e1b9a73908d
--- /dev/null
+++ b/doc/administration/gitaly/recovery.md
@@ -0,0 +1,418 @@
+---
+stage: Create
+group: Gitaly
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+type: reference
+---
+
+# Recovery options
+
+Gitaly Cluster can [recover from certain types of failure](recovery.md).
+
+## Primary Node Failure
+
+Gitaly Cluster recovers from a failing primary Gitaly node by promoting a healthy secondary as the
+new primary.
+
+In GitLab 14.1 and later, Gitaly Cluster:
+
+- Elects a healthy secondary with a fully up to date copy of the repository as the new primary.
+- Repository becomes unavailable if there are no fully up to date copies of it on healthy secondaries.
+
+To minimize data loss in GitLab 13.0 to 14.0, Gitaly Cluster:
+
+- Switches repositories that are outdated on the new primary to [read-only mode](#read-only-mode).
+- Elects the secondary with the least unreplicated writes from the primary to be the new
+ primary. Because there can still be some unreplicated writes,
+ [data loss can occur](#check-for-data-loss).
+
+### Read-only mode
+
+> - Introduced in GitLab 13.0 as [generally available](https://about.gitlab.com/handbook/product/gitlab-the-product/#generally-available-ga).
+> - Between GitLab 13.0 and GitLab 13.2, read-only mode applied to the whole virtual storage and occurred whenever failover occurred.
+> - [In GitLab 13.3 and later](https://gitlab.com/gitlab-org/gitaly/-/issues/2862), read-only mode applies on a per-repository basis and only occurs if a new primary is out of date.
+new primary. If the failed primary contained unreplicated writes, [data loss can occur](#check-for-data-loss).
+> - Removed in GitLab 14.1. Instead, repositories [become unavailable](#unavailable-repositories).
+
+When Gitaly Cluster switches to a new primary in GitLab 13.0 to 14.0, repositories enter
+read-only mode if they are out of date. This can happen after failing over to an outdated
+secondary. Read-only mode eases data recovery efforts by preventing writes that may conflict
+with the unreplicated writes on other nodes.
+
+To enable writes again in GitLab 13.0 to 14.0, an administrator can:
+
+1. [Check](#check-for-data-loss) for data loss.
+1. Attempt to [recover](#data-recovery) missing data.
+1. Either [enable writes](#enable-writes-or-accept-data-loss) in the virtual storage or
+ [accept data loss](#enable-writes-or-accept-data-loss) if necessary, depending on the version of
+ GitLab.
+
+## Unavailable repositories
+
+> - From GitLab 13.0 through 14.0, repositories became read-only if they were outdated on the primary but fully up to date on a healthy secondary. `dataloss` sub-command displays read-only repositories by default through these versions.
+> - Since GitLab 14.1, Praefect contains more responsive failover logic which immediately fails over to one of the fully up to date secondaries rather than placing the repository in read-only mode. Since GitLab 14.1, the `dataloss` sub-command displays repositories which are unavailable due to having no fully up to date copies on healthy Gitaly nodes.
+
+A repository is unavailable if all of its up to date replicas are unavailable. Unavailable repositories are
+not accessible through Praefect to prevent serving stale data that may break automated tooling.
+
+### Check for data loss
+
+The Praefect `dataloss` subcommand identifies:
+
+- Copies of repositories in GitLab 13.0 to GitLab 14.0 that at are likely to be outdated.
+ This can help identify potential data loss after a failover.
+- Repositories in GitLab 14.1 and later that are unavailable. This helps identify potential
+ data loss and repositories which are no longer accessible because all of their up-to-date
+ replicas copies are unavailable.
+
+The following parameters are available:
+
+- `-virtual-storage` that specifies which virtual storage to check. Because they might require
+ an administrator to intervene, the default behavior is to display:
+ - In GitLab 13.0 to 14.0, copies of read-only repositories.
+ - In GitLab 14.1 and later, unavailable repositories.
+- In GitLab 14.1 and later, [`-partially-unavailable`](#unavailable-replicas-of-available-repositories)
+ that specifies whether to include in the output repositories that are available but have
+ some assigned copies that are not available.
+
+NOTE:
+`dataloss` is still in beta and the output format is subject to change.
+
+To check for repositories with outdated primaries or for unavailable repositories, run:
+
+```shell
+sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml dataloss [-virtual-storage <virtual-storage>]
+```
+
+Every configured virtual storage is checked if none is specified:
+
+```shell
+sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml dataloss
+```
+
+Repositories are listed in the output that have either:
+
+- An outdated copy of the repository on the primary, in GitLab 13.0 to GitLab 14.0.
+- No healthy and fully up-to-date copies available, in GitLab 14.1 and later.
+
+The following information is printed for each repository:
+
+- A repository's relative path to the storage directory identifies each repository and groups the related
+ information.
+- The repository's current status is printed in parentheses next to the disk path:
+ - In GitLab 13.0 to 14.0, either `(read-only)` if the repository's primary node is outdated
+ and can't accept writes. Otherwise, `(writable)`.
+ - In GitLab 14.1 and later, `(unavailable)` is printed next to the disk path if the
+ repository is unavailable.
+- The primary field lists the repository's current primary. If the repository has no primary, the field shows
+ `No Primary`.
+- The In-Sync Storages lists replicas which have replicated the latest successful write and all writes
+ preceding it.
+- The Outdated Storages lists replicas which contain an outdated copy of the repository. Replicas which have no copy
+ of the repository but should contain it are also listed here. The maximum number of changes the replica is missing
+ is listed next to replica. It's important to notice that the outdated replicas may be fully up to date or contain
+ later changes but Praefect can't guarantee it.
+
+Additional information includes:
+
+- Whether a node is assigned to host the repository is listed with each node's status.
+ `assigned host` is printed next to nodes that are assigned to store the repository. The
+ text is omitted if the node contains a copy of the repository but is not assigned to store
+ the repository. Such copies aren't kept in sync by Praefect, but may act as replication
+ sources to bring assigned copies up to date.
+- In GitLab 14.1 and later, `unhealthy` is printed next to the copies that are located
+ on unhealthy Gitaly nodes.
+
+Example output:
+
+```shell
+Virtual storage: default
+ Outdated repositories:
+ @hashed/3f/db/3fdba35f04dc8c462986c992bcf875546257113072a909c162f7e470e581e278.git (unavailable):
+ Primary: gitaly-1
+ In-Sync Storages:
+ gitaly-2, assigned host, unhealthy
+ Outdated Storages:
+ gitaly-1 is behind by 3 changes or less, assigned host
+ gitaly-3 is behind by 3 changes or less
+```
+
+A confirmation is printed out when every repository is available. For example:
+
+```shell
+Virtual storage: default
+ All repositories are available!
+```
+
+#### Unavailable replicas of available repositories
+
+NOTE:
+In GitLab 14.0 and earlier, the flag is `-partially-replicated` and the output shows any repositories with assigned nodes with outdated
+copies.
+
+To also list information of repositories which are available but are unavailable from some of the assigned nodes,
+use the `-partially-unavailable` flag.
+
+A repository is available if there is a healthy, up to date replica available. Some of the assigned secondary
+replicas may be temporarily unavailable for access while they are waiting to replicate the latest changes.
+
+```shell
+sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml dataloss [-virtual-storage <virtual-storage>] [-partially-unavailable]
+```
+
+Example output:
+
+```shell
+Virtual storage: default
+ Outdated repositories:
+ @hashed/3f/db/3fdba35f04dc8c462986c992bcf875546257113072a909c162f7e470e581e278.git:
+ Primary: gitaly-1
+ In-Sync Storages:
+ gitaly-1, assigned host
+ Outdated Storages:
+ gitaly-2 is behind by 3 changes or less, assigned host
+ gitaly-3 is behind by 3 changes or less
+```
+
+With the `-partially-unavailable` flag set, a confirmation is printed out if every assigned replica is fully up to
+date and healthy.
+
+For example:
+
+```shell
+Virtual storage: default
+ All repositories are fully available on all assigned storages!
+```
+
+### Check repository checksums
+
+To check a project's repository checksums across on all Gitaly nodes, run the
+[replicas Rake task](../raketasks/praefect.md#replica-checksums) on the main GitLab node.
+
+### Accept data loss
+
+WARNING:
+`accept-dataloss` causes permanent data loss by overwriting other versions of the repository. Data
+[recovery efforts](#data-recovery) must be performed before using it.
+
+If it is not possible to bring one of the up to date replicas back online, you may have to accept data
+loss. When accepting data loss, Praefect marks the chosen replica of the repository as the latest version
+and replicates it to the other assigned Gitaly nodes. This process overwrites any other version of the
+repository so care must be taken.
+
+```shell
+sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml accept-dataloss
+-virtual-storage <virtual-storage> -repository <relative-path> -authoritative-storage <storage-name>
+```
+
+### Enable writes or accept data loss
+
+WARNING:
+`accept-dataloss` causes permanent data loss by overwriting other versions of the repository.
+Data [recovery efforts](#data-recovery) must be performed before using it.
+
+Praefect provides the following subcommands to re-enable writes or accept data loss:
+
+- In GitLab 13.2 and earlier, `enable-writes` to re-enable virtual storage for writes after
+ data recovery attempts:
+
+ ```shell
+ sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml enable-writes -virtual-storage <virtual-storage>
+ ```
+
+- In GitLab 13.3 and later, if it is not possible to bring one of the up to date nodes back
+ online, you may have to accept data loss:
+
+ ```shell
+ sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml accept-dataloss -virtual-storage <virtual-storage> -repository <relative-path> -authoritative-storage <storage-name>
+ ```
+
+ When accepting data loss, Praefect:
+
+ 1. Marks the chosen copy of the repository as the latest version.
+ 1. Replicates the copy to the other assigned Gitaly nodes.
+
+ This process overwrites any other copy of the repository so care must be taken.
+
+## Data recovery
+
+If a Gitaly node fails replication jobs for any reason, it ends up hosting outdated versions of the
+affected repositories. Praefect provides tools for:
+
+- [Automatic](#automatic-reconciliation) reconciliation, for GitLab 13.4 and later.
+- [Manual](#manual-reconciliation) reconciliation, for:
+ - GitLab 13.3 and earlier.
+ - Repositories upgraded to GitLab 13.4 and later without entries in the `repositories` table. In
+ GitLab 13.6 and later, [a migration is run](https://gitlab.com/gitlab-org/gitaly/-/issues/3033)
+ when Praefect starts for these repositories.
+
+These tools reconcile the outdated repositories to bring them fully up to date again.
+
+### Automatic reconciliation
+
+> [Introduced](https://gitlab.com/gitlab-org/gitaly/-/issues/2717) in GitLab 13.4.
+
+Praefect automatically reconciles repositories that are not up to date. By default, this is done every
+five minutes. For each outdated repository on a healthy Gitaly node, the Praefect picks a
+random, fully up-to-date replica of the repository on another healthy Gitaly node to replicate from. A
+replication job is scheduled only if there are no other replication jobs pending for the target
+repository.
+
+The reconciliation frequency can be changed via the configuration. The value can be any valid
+[Go duration value](https://pkg.go.dev/time#ParseDuration). Values below 0 disable the feature.
+
+Examples:
+
+```ruby
+praefect['reconciliation_scheduling_interval'] = '5m' # the default value
+```
+
+```ruby
+praefect['reconciliation_scheduling_interval'] = '30s' # reconcile every 30 seconds
+```
+
+```ruby
+praefect['reconciliation_scheduling_interval'] = '0' # disable the feature
+```
+
+### Manual reconciliation
+
+WARNING:
+The `reconcile` sub-command was removed in GitLab 14.1. Use [automatic reconciliation](#automatic-reconciliation) instead.
+Manual reconciliation may produce excess replication jobs and is limited in functionality. Manual reconciliation does not
+work when [repository-specific primary nodes](praefect.md#repository-specific-primary-nodes) are enabled.
+
+The Praefect `reconcile` sub-command allows for the manual reconciliation between two Gitaly nodes. The
+command replicates every repository on a later version on the reference storage to the target storage.
+
+```shell
+sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml reconcile -virtual <virtual-storage> -reference <up-to-date-storage> -target <outdated-storage> -f
+```
+
+- Replace the placeholder `<virtual-storage>` with the virtual storage containing the Gitaly node storage to be checked.
+- Replace the placeholder `<up-to-date-storage>` with the Gitaly storage name containing up to date repositories.
+- Replace the placeholder `<outdated-storage>` with the Gitaly storage name containing outdated repositories.
+
+### Manually remove repositories
+
+> - [Introduced](https://gitlab.com/gitlab-org/gitaly/-/merge_requests/3767) in GitLab 14.3.
+> - [Introduced](https://gitlab.com/gitlab-org/gitaly/-/merge_requests/4054) in GitLab 14.6, support for dry-run mode.
+
+The `remove-repository` Praefect sub-command removes a repository from a Gitaly Cluster, and all state associated with a given repository including:
+
+- On-disk repositories on all relevant Gitaly nodes.
+- Any database state tracked by Praefect.
+
+In GitLab 14.6 and later, by default, the command operates in dry-run mode. In earlier versions, the command didn't support dry-run mode. For example:
+
+```shell
+sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml remove-repository -virtual-storage <virtual-storage> -repository <repository>
+```
+
+- Replace `<virtual-storage>` with the name of the virtual storage containing the repository.
+- Replace `<repository>` with the relative path of the repository to remove.
+- In GitLab 14.6 and later, add `-apply` to run the command outside of dry-run mode and remove the repository. For example:
+
+ ```shell
+ sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml remove-repository -virtual-storage <virtual-storage> -repository <repository> -apply
+ ```
+
+- `-virtual-storage` is the virtual storage the repository is located in. Virtual storages are configured in `/etc/gitlab/gitlab.rb` under `praefect['virtual_storages]` and looks like the following:
+
+ ```ruby
+ praefect['virtual_storages'] = {
+ 'default' => {
+ ...
+ },
+ 'storage-1' => {
+ ...
+ }
+ }
+ ```
+
+ In this example, the virtual storage to specify is `default` or `storage-1`.
+
+- `-repository` is the repository's relative path in the storage [beginning with `@hashed`](../repository_storage_types.md#hashed-storage).
+ For example:
+
+ ```plaintext
+ @hashed/f5/ca/f5ca38f748a1d6eaf726b8a42fb575c3c71f1864a8143301782de13da2d9202b.git
+ ```
+
+Parts of the repository can continue to exist after running `remove-repository`. This can be because of:
+
+- A deletion error.
+- An in-flight RPC call targeting the repository.
+
+If this occurs, run `remove-repository` again.
+
+### Manually list untracked repositories
+
+> [Introduced](https://gitlab.com/gitlab-org/gitaly/-/merge_requests/3926) in GitLab 14.4.
+
+The `list-untracked-repositories` Praefect sub-command lists repositories of the Gitaly Cluster that both:
+
+- Exist for at least one Gitaly storage.
+- Aren't tracked in the Praefect database.
+
+The command outputs:
+
+- Result to `STDOUT` and the command's logs.
+- Errors to `STDERR`.
+
+Each entry is a complete JSON string with a newline at the end (configurable using the
+`-delimiter` flag). For example:
+
+```plaintext
+sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml list-untracked-repositories
+{"virtual_storage":"default","storage":"gitaly-1","relative_path":"@hashed/ab/cd/abcd123456789012345678901234567890123456789012345678901234567890.git"}
+{"virtual_storage":"default","storage":"gitaly-1","relative_path":"@hashed/ab/cd/abcd123456789012345678901234567890123456789012345678901234567891.git"}
+```
+
+### Manually track repositories
+
+> - [Introduced](https://gitlab.com/gitlab-org/omnibus-gitlab/-/merge_requests/5658) in GitLab 14.4.
+> - [Introduced](https://gitlab.com/gitlab-org/omnibus-gitlab/-/merge_requests/5789) in GitLab 14.6, support for immediate replication.
+
+The `track-repository` Praefect sub-command adds repositories on disk to the Praefect database to be tracked.
+
+```shell
+sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml track-repository -virtual-storage <virtual-storage> -repository <repository> -replicate-immediately
+```
+
+- `-virtual-storage` is the virtual storage the repository is located in. Virtual storages are configured in `/etc/gitlab/gitlab.rb` under `praefect['virtual_storages]` and looks like the following:
+
+ ```ruby
+ praefect['virtual_storages'] = {
+ 'default' => {
+ ...
+ },
+ 'storage-1' => {
+ ...
+ }
+ }
+ ```
+
+ In this example, the virtual storage to specify is `default` or `storage-1`.
+
+- `-repository` is the repository's relative path in the storage [beginning with `@hashed`](../repository_storage_types.md#hashed-storage).
+ For example:
+
+ ```plaintext
+ @hashed/f5/ca/f5ca38f748a1d6eaf726b8a42fb575c3c71f1864a8143301782de13da2d9202b.git
+ ```
+
+- `-authoritative-storage` is the storage we want Praefect to treat as the primary. Required if
+ [per-repository replication](praefect.md#configure-replication-factor) is set as the replication strategy.
+- `-replicate-immediately`, available in GitLab 14.6 and later, causes the command to replicate the repository to its secondaries immediately.
+ Otherwise, replication jobs are scheduled for execution in the database and are picked up by a Praefect background process.
+
+The command outputs:
+
+- Results to `STDOUT` and the command's logs.
+- Errors to `STDERR`.
+
+This command fails if:
+
+- The repository is already being tracked by the Praefect database.
+- The repository does not exist on disk.