diff options
Diffstat (limited to 'doc/administration/gitaly/recovery.md')
-rw-r--r-- | doc/administration/gitaly/recovery.md | 418 |
1 files changed, 418 insertions, 0 deletions
diff --git a/doc/administration/gitaly/recovery.md b/doc/administration/gitaly/recovery.md new file mode 100644 index 00000000000..e1b9a73908d --- /dev/null +++ b/doc/administration/gitaly/recovery.md @@ -0,0 +1,418 @@ +--- +stage: Create +group: Gitaly +info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments +type: reference +--- + +# Recovery options + +Gitaly Cluster can [recover from certain types of failure](recovery.md). + +## Primary Node Failure + +Gitaly Cluster recovers from a failing primary Gitaly node by promoting a healthy secondary as the +new primary. + +In GitLab 14.1 and later, Gitaly Cluster: + +- Elects a healthy secondary with a fully up to date copy of the repository as the new primary. +- Repository becomes unavailable if there are no fully up to date copies of it on healthy secondaries. + +To minimize data loss in GitLab 13.0 to 14.0, Gitaly Cluster: + +- Switches repositories that are outdated on the new primary to [read-only mode](#read-only-mode). +- Elects the secondary with the least unreplicated writes from the primary to be the new + primary. Because there can still be some unreplicated writes, + [data loss can occur](#check-for-data-loss). + +### Read-only mode + +> - Introduced in GitLab 13.0 as [generally available](https://about.gitlab.com/handbook/product/gitlab-the-product/#generally-available-ga). +> - Between GitLab 13.0 and GitLab 13.2, read-only mode applied to the whole virtual storage and occurred whenever failover occurred. +> - [In GitLab 13.3 and later](https://gitlab.com/gitlab-org/gitaly/-/issues/2862), read-only mode applies on a per-repository basis and only occurs if a new primary is out of date. +new primary. If the failed primary contained unreplicated writes, [data loss can occur](#check-for-data-loss). +> - Removed in GitLab 14.1. Instead, repositories [become unavailable](#unavailable-repositories). + +When Gitaly Cluster switches to a new primary in GitLab 13.0 to 14.0, repositories enter +read-only mode if they are out of date. This can happen after failing over to an outdated +secondary. Read-only mode eases data recovery efforts by preventing writes that may conflict +with the unreplicated writes on other nodes. + +To enable writes again in GitLab 13.0 to 14.0, an administrator can: + +1. [Check](#check-for-data-loss) for data loss. +1. Attempt to [recover](#data-recovery) missing data. +1. Either [enable writes](#enable-writes-or-accept-data-loss) in the virtual storage or + [accept data loss](#enable-writes-or-accept-data-loss) if necessary, depending on the version of + GitLab. + +## Unavailable repositories + +> - From GitLab 13.0 through 14.0, repositories became read-only if they were outdated on the primary but fully up to date on a healthy secondary. `dataloss` sub-command displays read-only repositories by default through these versions. +> - Since GitLab 14.1, Praefect contains more responsive failover logic which immediately fails over to one of the fully up to date secondaries rather than placing the repository in read-only mode. Since GitLab 14.1, the `dataloss` sub-command displays repositories which are unavailable due to having no fully up to date copies on healthy Gitaly nodes. + +A repository is unavailable if all of its up to date replicas are unavailable. Unavailable repositories are +not accessible through Praefect to prevent serving stale data that may break automated tooling. + +### Check for data loss + +The Praefect `dataloss` subcommand identifies: + +- Copies of repositories in GitLab 13.0 to GitLab 14.0 that at are likely to be outdated. + This can help identify potential data loss after a failover. +- Repositories in GitLab 14.1 and later that are unavailable. This helps identify potential + data loss and repositories which are no longer accessible because all of their up-to-date + replicas copies are unavailable. + +The following parameters are available: + +- `-virtual-storage` that specifies which virtual storage to check. Because they might require + an administrator to intervene, the default behavior is to display: + - In GitLab 13.0 to 14.0, copies of read-only repositories. + - In GitLab 14.1 and later, unavailable repositories. +- In GitLab 14.1 and later, [`-partially-unavailable`](#unavailable-replicas-of-available-repositories) + that specifies whether to include in the output repositories that are available but have + some assigned copies that are not available. + +NOTE: +`dataloss` is still in beta and the output format is subject to change. + +To check for repositories with outdated primaries or for unavailable repositories, run: + +```shell +sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml dataloss [-virtual-storage <virtual-storage>] +``` + +Every configured virtual storage is checked if none is specified: + +```shell +sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml dataloss +``` + +Repositories are listed in the output that have either: + +- An outdated copy of the repository on the primary, in GitLab 13.0 to GitLab 14.0. +- No healthy and fully up-to-date copies available, in GitLab 14.1 and later. + +The following information is printed for each repository: + +- A repository's relative path to the storage directory identifies each repository and groups the related + information. +- The repository's current status is printed in parentheses next to the disk path: + - In GitLab 13.0 to 14.0, either `(read-only)` if the repository's primary node is outdated + and can't accept writes. Otherwise, `(writable)`. + - In GitLab 14.1 and later, `(unavailable)` is printed next to the disk path if the + repository is unavailable. +- The primary field lists the repository's current primary. If the repository has no primary, the field shows + `No Primary`. +- The In-Sync Storages lists replicas which have replicated the latest successful write and all writes + preceding it. +- The Outdated Storages lists replicas which contain an outdated copy of the repository. Replicas which have no copy + of the repository but should contain it are also listed here. The maximum number of changes the replica is missing + is listed next to replica. It's important to notice that the outdated replicas may be fully up to date or contain + later changes but Praefect can't guarantee it. + +Additional information includes: + +- Whether a node is assigned to host the repository is listed with each node's status. + `assigned host` is printed next to nodes that are assigned to store the repository. The + text is omitted if the node contains a copy of the repository but is not assigned to store + the repository. Such copies aren't kept in sync by Praefect, but may act as replication + sources to bring assigned copies up to date. +- In GitLab 14.1 and later, `unhealthy` is printed next to the copies that are located + on unhealthy Gitaly nodes. + +Example output: + +```shell +Virtual storage: default + Outdated repositories: + @hashed/3f/db/3fdba35f04dc8c462986c992bcf875546257113072a909c162f7e470e581e278.git (unavailable): + Primary: gitaly-1 + In-Sync Storages: + gitaly-2, assigned host, unhealthy + Outdated Storages: + gitaly-1 is behind by 3 changes or less, assigned host + gitaly-3 is behind by 3 changes or less +``` + +A confirmation is printed out when every repository is available. For example: + +```shell +Virtual storage: default + All repositories are available! +``` + +#### Unavailable replicas of available repositories + +NOTE: +In GitLab 14.0 and earlier, the flag is `-partially-replicated` and the output shows any repositories with assigned nodes with outdated +copies. + +To also list information of repositories which are available but are unavailable from some of the assigned nodes, +use the `-partially-unavailable` flag. + +A repository is available if there is a healthy, up to date replica available. Some of the assigned secondary +replicas may be temporarily unavailable for access while they are waiting to replicate the latest changes. + +```shell +sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml dataloss [-virtual-storage <virtual-storage>] [-partially-unavailable] +``` + +Example output: + +```shell +Virtual storage: default + Outdated repositories: + @hashed/3f/db/3fdba35f04dc8c462986c992bcf875546257113072a909c162f7e470e581e278.git: + Primary: gitaly-1 + In-Sync Storages: + gitaly-1, assigned host + Outdated Storages: + gitaly-2 is behind by 3 changes or less, assigned host + gitaly-3 is behind by 3 changes or less +``` + +With the `-partially-unavailable` flag set, a confirmation is printed out if every assigned replica is fully up to +date and healthy. + +For example: + +```shell +Virtual storage: default + All repositories are fully available on all assigned storages! +``` + +### Check repository checksums + +To check a project's repository checksums across on all Gitaly nodes, run the +[replicas Rake task](../raketasks/praefect.md#replica-checksums) on the main GitLab node. + +### Accept data loss + +WARNING: +`accept-dataloss` causes permanent data loss by overwriting other versions of the repository. Data +[recovery efforts](#data-recovery) must be performed before using it. + +If it is not possible to bring one of the up to date replicas back online, you may have to accept data +loss. When accepting data loss, Praefect marks the chosen replica of the repository as the latest version +and replicates it to the other assigned Gitaly nodes. This process overwrites any other version of the +repository so care must be taken. + +```shell +sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml accept-dataloss +-virtual-storage <virtual-storage> -repository <relative-path> -authoritative-storage <storage-name> +``` + +### Enable writes or accept data loss + +WARNING: +`accept-dataloss` causes permanent data loss by overwriting other versions of the repository. +Data [recovery efforts](#data-recovery) must be performed before using it. + +Praefect provides the following subcommands to re-enable writes or accept data loss: + +- In GitLab 13.2 and earlier, `enable-writes` to re-enable virtual storage for writes after + data recovery attempts: + + ```shell + sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml enable-writes -virtual-storage <virtual-storage> + ``` + +- In GitLab 13.3 and later, if it is not possible to bring one of the up to date nodes back + online, you may have to accept data loss: + + ```shell + sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml accept-dataloss -virtual-storage <virtual-storage> -repository <relative-path> -authoritative-storage <storage-name> + ``` + + When accepting data loss, Praefect: + + 1. Marks the chosen copy of the repository as the latest version. + 1. Replicates the copy to the other assigned Gitaly nodes. + + This process overwrites any other copy of the repository so care must be taken. + +## Data recovery + +If a Gitaly node fails replication jobs for any reason, it ends up hosting outdated versions of the +affected repositories. Praefect provides tools for: + +- [Automatic](#automatic-reconciliation) reconciliation, for GitLab 13.4 and later. +- [Manual](#manual-reconciliation) reconciliation, for: + - GitLab 13.3 and earlier. + - Repositories upgraded to GitLab 13.4 and later without entries in the `repositories` table. In + GitLab 13.6 and later, [a migration is run](https://gitlab.com/gitlab-org/gitaly/-/issues/3033) + when Praefect starts for these repositories. + +These tools reconcile the outdated repositories to bring them fully up to date again. + +### Automatic reconciliation + +> [Introduced](https://gitlab.com/gitlab-org/gitaly/-/issues/2717) in GitLab 13.4. + +Praefect automatically reconciles repositories that are not up to date. By default, this is done every +five minutes. For each outdated repository on a healthy Gitaly node, the Praefect picks a +random, fully up-to-date replica of the repository on another healthy Gitaly node to replicate from. A +replication job is scheduled only if there are no other replication jobs pending for the target +repository. + +The reconciliation frequency can be changed via the configuration. The value can be any valid +[Go duration value](https://pkg.go.dev/time#ParseDuration). Values below 0 disable the feature. + +Examples: + +```ruby +praefect['reconciliation_scheduling_interval'] = '5m' # the default value +``` + +```ruby +praefect['reconciliation_scheduling_interval'] = '30s' # reconcile every 30 seconds +``` + +```ruby +praefect['reconciliation_scheduling_interval'] = '0' # disable the feature +``` + +### Manual reconciliation + +WARNING: +The `reconcile` sub-command was removed in GitLab 14.1. Use [automatic reconciliation](#automatic-reconciliation) instead. +Manual reconciliation may produce excess replication jobs and is limited in functionality. Manual reconciliation does not +work when [repository-specific primary nodes](praefect.md#repository-specific-primary-nodes) are enabled. + +The Praefect `reconcile` sub-command allows for the manual reconciliation between two Gitaly nodes. The +command replicates every repository on a later version on the reference storage to the target storage. + +```shell +sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml reconcile -virtual <virtual-storage> -reference <up-to-date-storage> -target <outdated-storage> -f +``` + +- Replace the placeholder `<virtual-storage>` with the virtual storage containing the Gitaly node storage to be checked. +- Replace the placeholder `<up-to-date-storage>` with the Gitaly storage name containing up to date repositories. +- Replace the placeholder `<outdated-storage>` with the Gitaly storage name containing outdated repositories. + +### Manually remove repositories + +> - [Introduced](https://gitlab.com/gitlab-org/gitaly/-/merge_requests/3767) in GitLab 14.3. +> - [Introduced](https://gitlab.com/gitlab-org/gitaly/-/merge_requests/4054) in GitLab 14.6, support for dry-run mode. + +The `remove-repository` Praefect sub-command removes a repository from a Gitaly Cluster, and all state associated with a given repository including: + +- On-disk repositories on all relevant Gitaly nodes. +- Any database state tracked by Praefect. + +In GitLab 14.6 and later, by default, the command operates in dry-run mode. In earlier versions, the command didn't support dry-run mode. For example: + +```shell +sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml remove-repository -virtual-storage <virtual-storage> -repository <repository> +``` + +- Replace `<virtual-storage>` with the name of the virtual storage containing the repository. +- Replace `<repository>` with the relative path of the repository to remove. +- In GitLab 14.6 and later, add `-apply` to run the command outside of dry-run mode and remove the repository. For example: + + ```shell + sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml remove-repository -virtual-storage <virtual-storage> -repository <repository> -apply + ``` + +- `-virtual-storage` is the virtual storage the repository is located in. Virtual storages are configured in `/etc/gitlab/gitlab.rb` under `praefect['virtual_storages]` and looks like the following: + + ```ruby + praefect['virtual_storages'] = { + 'default' => { + ... + }, + 'storage-1' => { + ... + } + } + ``` + + In this example, the virtual storage to specify is `default` or `storage-1`. + +- `-repository` is the repository's relative path in the storage [beginning with `@hashed`](../repository_storage_types.md#hashed-storage). + For example: + + ```plaintext + @hashed/f5/ca/f5ca38f748a1d6eaf726b8a42fb575c3c71f1864a8143301782de13da2d9202b.git + ``` + +Parts of the repository can continue to exist after running `remove-repository`. This can be because of: + +- A deletion error. +- An in-flight RPC call targeting the repository. + +If this occurs, run `remove-repository` again. + +### Manually list untracked repositories + +> [Introduced](https://gitlab.com/gitlab-org/gitaly/-/merge_requests/3926) in GitLab 14.4. + +The `list-untracked-repositories` Praefect sub-command lists repositories of the Gitaly Cluster that both: + +- Exist for at least one Gitaly storage. +- Aren't tracked in the Praefect database. + +The command outputs: + +- Result to `STDOUT` and the command's logs. +- Errors to `STDERR`. + +Each entry is a complete JSON string with a newline at the end (configurable using the +`-delimiter` flag). For example: + +```plaintext +sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml list-untracked-repositories +{"virtual_storage":"default","storage":"gitaly-1","relative_path":"@hashed/ab/cd/abcd123456789012345678901234567890123456789012345678901234567890.git"} +{"virtual_storage":"default","storage":"gitaly-1","relative_path":"@hashed/ab/cd/abcd123456789012345678901234567890123456789012345678901234567891.git"} +``` + +### Manually track repositories + +> - [Introduced](https://gitlab.com/gitlab-org/omnibus-gitlab/-/merge_requests/5658) in GitLab 14.4. +> - [Introduced](https://gitlab.com/gitlab-org/omnibus-gitlab/-/merge_requests/5789) in GitLab 14.6, support for immediate replication. + +The `track-repository` Praefect sub-command adds repositories on disk to the Praefect database to be tracked. + +```shell +sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml track-repository -virtual-storage <virtual-storage> -repository <repository> -replicate-immediately +``` + +- `-virtual-storage` is the virtual storage the repository is located in. Virtual storages are configured in `/etc/gitlab/gitlab.rb` under `praefect['virtual_storages]` and looks like the following: + + ```ruby + praefect['virtual_storages'] = { + 'default' => { + ... + }, + 'storage-1' => { + ... + } + } + ``` + + In this example, the virtual storage to specify is `default` or `storage-1`. + +- `-repository` is the repository's relative path in the storage [beginning with `@hashed`](../repository_storage_types.md#hashed-storage). + For example: + + ```plaintext + @hashed/f5/ca/f5ca38f748a1d6eaf726b8a42fb575c3c71f1864a8143301782de13da2d9202b.git + ``` + +- `-authoritative-storage` is the storage we want Praefect to treat as the primary. Required if + [per-repository replication](praefect.md#configure-replication-factor) is set as the replication strategy. +- `-replicate-immediately`, available in GitLab 14.6 and later, causes the command to replicate the repository to its secondaries immediately. + Otherwise, replication jobs are scheduled for execution in the database and are picked up by a Praefect background process. + +The command outputs: + +- Results to `STDOUT` and the command's logs. +- Errors to `STDERR`. + +This command fails if: + +- The repository is already being tracked by the Praefect database. +- The repository does not exist on disk. |