diff options
Diffstat (limited to 'doc/administration/gitaly')
-rw-r--r-- | doc/administration/gitaly/faq.md | 5 | ||||
-rw-r--r-- | doc/administration/gitaly/index.md | 574 | ||||
-rw-r--r-- | doc/administration/gitaly/praefect.md | 522 | ||||
-rw-r--r-- | doc/administration/gitaly/troubleshooting.md | 372 |
4 files changed, 819 insertions, 654 deletions
diff --git a/doc/administration/gitaly/faq.md b/doc/administration/gitaly/faq.md index 98a90925d32..a5964b7a2eb 100644 --- a/doc/administration/gitaly/faq.md +++ b/doc/administration/gitaly/faq.md @@ -7,7 +7,8 @@ type: reference # Frequently asked questions **(FREE SELF)** -The following are answers to frequently asked questions about Gitaly and Gitaly Cluster. +The following are answers to frequently asked questions about Gitaly and Gitaly Cluster. For +troubleshooting information, see [Troubleshooting Gitaly and Gitaly Cluster](troubleshooting.md). ## How does Gitaly Cluster compare to Geo? @@ -87,4 +88,4 @@ There are no special requirements. Gitaly Cluster requires PostgreSQL version 11 These tables are created per the [specific configuration section](praefect.md#postgresql). If you find you have an empty Praefect database table, see the -[relevant troubleshooting section](index.md#relation-does-not-exist-errors). +[relevant troubleshooting section](troubleshooting.md#relation-does-not-exist-errors). diff --git a/doc/administration/gitaly/index.md b/doc/administration/gitaly/index.md index eaf9e21780d..0af248e0573 100644 --- a/doc/administration/gitaly/index.md +++ b/doc/administration/gitaly/index.md @@ -19,6 +19,67 @@ Gitaly implements a client-server architecture: - [GitLab Shell](https://gitlab.com/gitlab-org/gitlab-shell). - [GitLab Workhorse](https://gitlab.com/gitlab-org/gitlab-workhorse). +Gitaly manages only Git repository access for GitLab. Other types of GitLab data aren't accessed +using Gitaly. + +GitLab accesses [repositories](../../user/project/repository/index.md) through the configured +[repository storages](../repository_storage_paths.md). Each new repository is stored on one of the +repository storages based on their +[configured weights](../repository_storage_paths.md#configure-where-new-repositories-are-stored). Each +repository storage is either: + +- A Gitaly storage with direct access to repositories using [storage paths](../repository_storage_paths.md), + where each repository is stored on a single Gitaly node. All requests are routed to this node. +- A virtual storage provided by [Gitaly Cluster](#gitaly-cluster), where each repository can be + stored on multiple Gitaly nodes for fault tolerance. In a Gitaly Cluster: + - Read requests are distributed between multiple Gitaly nodes, which can improve performance. + - Write requests are broadcast to repository replicas. + +WARNING: +Engineering support for NFS for Git repositories is deprecated. Read the +[deprecation notice](#nfs-deprecation-notice). + +## Virtual storage + +Virtual storage makes it viable to have a single repository storage in GitLab to simplify repository +management. + +Virtual storage with Gitaly Cluster can usually replace direct Gitaly storage configurations. +However, this is at the expense of additional storage space needed to store each repository on multiple +Gitaly nodes. The benefit of using Gitaly Cluster virtual storage over direct Gitaly storage is: + +- Improved fault tolerance, because each Gitaly node has a copy of every repository. +- Improved resource utilization, reducing the need for over-provisioning for shard-specific peak + loads, because read loads are distributed across Gitaly nodes. +- Manual rebalancing for performance is not required, because read loads are distributed across + Gitaly nodes. +- Simpler management, because all Gitaly nodes are identical. + +The number of repository replicas can be configured using a +[replication factor](praefect.md#replication-factor). + +It can +be uneconomical to have the same replication factor for all repositories. +[Variable replication factor](https://gitlab.com/groups/gitlab-org/-/epics/3372) is planned to +provide greater flexibility for extremely large GitLab instances. + +As with normal Gitaly storages, virtual storages can be sharded. + +## Gitaly + +The following shows GitLab set up to use direct access to Gitaly: + +![Shard example](img/shard_example_v13_3.png) + +In this example: + +- Each repository is stored on one of three Gitaly storages: `storage-1`, `storage-2`, or + `storage-3`. +- Each storage is serviced by a Gitaly node. +- The three Gitaly nodes store data on their file systems. + +### Gitaly architecture + The following illustrates the Gitaly client-server architecture: ```mermaid @@ -44,19 +105,7 @@ D -- gRPC --> Gitaly E --> F ``` -End users do not have direct access to Gitaly. Gitaly manages only Git repository access for GitLab. -Other types of GitLab data aren't accessed using Gitaly. - -<!-- vale gitlab.FutureTense = NO --> - -WARNING: -From GitLab 14.0, enhancements and bug fixes for NFS for Git repositories will no longer be -considered and customer technical support will be considered out of scope. -[Read more about Gitaly and NFS](#nfs-deprecation-notice). - -<!-- vale gitlab.FutureTense = YES --> - -## Configure Gitaly +### Configure Gitaly Gitaly comes pre-configured with Omnibus GitLab, which is a configuration [suitable for up to 1000 users](../reference_architectures/1k_users.md). For: @@ -72,10 +121,24 @@ default value. The default value depends on the GitLab version. ## Gitaly Cluster -Gitaly, the service that provides storage for Git repositories, can -be run in a clustered configuration to scale the Gitaly service and increase -fault tolerance. In this configuration, every Git repository is stored on every -Gitaly node in the cluster. +Git storage is provided through the Gitaly service in GitLab, and is essential to the operation of +GitLab. When the number of users, repositories, and activity grows, it is important to scale Gitaly +appropriately by: + +- Increasing the available CPU and memory resources available to Git before + resource exhaustion degrades Git, Gitaly, and GitLab application performance. +- Increasing available storage before storage limits are reached causing write + operations to fail. +- Removing single points of failure to improve fault tolerance. Git should be + considered mission critical if a service degradation would prevent you from + deploying changes to production. + +Gitaly can be run in a clustered configuration to: + +- Scale the Gitaly service. +- Increase fault tolerance. + +In this configuration, every Git repository can be stored on multiple Gitaly nodes in the cluster. Using a Gitaly Cluster increases fault tolerance by: @@ -87,6 +150,19 @@ NOTE: Technical support for Gitaly clusters is limited to GitLab Premium and Ultimate customers. +The following shows GitLab set up to access `storage-1`, a virtual storage provided by Gitaly +Cluster: + +![Cluster example](img/cluster_example_v13_3.png) + +In this example: + +- Repositories are stored on a virtual storage called `storage-1`. +- Three Gitaly nodes provide `storage-1` access: `gitaly-1`, `gitaly-2`, and `gitaly-3`. +- The three Gitaly nodes share data in three separate hashed storage locations. +- The [replication factor](praefect.md#replication-factor) is `3`. There are three copies maintained + of each repository. + The availability objectives for Gitaly clusters are: - **Recovery Point Objective (RPO):** Less than 1 minute. @@ -110,33 +186,18 @@ Gitaly Cluster supports: - [Strong consistency](praefect.md#strong-consistency) of the secondary replicas. - [Automatic failover](praefect.md#automatic-failover-and-primary-election-strategies) from the primary to the secondary. - Reporting of possible data loss if replication queue is non-empty. -- Marking repositories as [read-only](praefect.md#read-only-mode) if data loss is detected to prevent data inconsistencies. +- From GitLab 13.0 to GitLab 14.0, marking repositories as [read-only](praefect.md#read-only-mode) + if data loss is detected to prevent data inconsistencies. Follow the [Gitaly Cluster epic](https://gitlab.com/groups/gitlab-org/-/epics/1489) for improvements including [horizontally distributing reads](https://gitlab.com/groups/gitlab-org/-/epics/2013). -### Overview - -Git storage is provided through the Gitaly service in GitLab, and is essential -to the operation of the GitLab application. When the number of -users, repositories, and activity grows, it is important to scale Gitaly -appropriately by: - -- Increasing the available CPU and memory resources available to Git before - resource exhaustion degrades Git, Gitaly, and GitLab application performance. -- Increase available storage before storage limits are reached causing write - operations to fail. -- Improve fault tolerance by removing single points of failure. Git should be - considered mission critical if a service degradation would prevent you from - deploying changes to production. - ### Moving beyond NFS WARNING: -From GitLab 13.0, using NFS for Git repositories is deprecated. In GitLab 14.0, -support for NFS for Git repositories is scheduled to be removed. Upgrade to -Gitaly Cluster as soon as possible. +Engineering support for NFS for Git repositories is deprecated. Technical support is planned to be +unavailable from GitLab 15.0. No further enhancements are planned for this feature. [Network File System (NFS)](https://en.wikipedia.org/wiki/Network_File_System) is not well suited to Git workloads which are CPU and IOPS sensitive. @@ -159,22 +220,6 @@ Further reading: - Blog post: [The road to Gitaly v1.0 (aka, why GitLab doesn't require NFS for storing Git data anymore)](https://about.gitlab.com/blog/2018/09/12/the-road-to-gitaly-1-0/) - Blog post: [How we spent two weeks hunting an NFS bug in the Linux kernel](https://about.gitlab.com/blog/2018/11/14/how-we-spent-two-weeks-hunting-an-nfs-bug/) -### Where Gitaly Cluster fits - -GitLab accesses [repositories](../../user/project/repository/index.md) through the configured -[repository storages](../repository_storage_paths.md). Each new repository is stored on one of the -repository storages based on their configured weights. Each repository storage is either: - -- A Gitaly storage served directly by Gitaly. These map to a directory on the file system of a - Gitaly node. -- A [virtual storage](#virtual-storage-or-direct-gitaly-storage) served by Praefect. A virtual - storage is a cluster of Gitaly storages that appear as a single repository storage. - -Virtual storages are a feature of Gitaly Cluster. They support replicating the repositories to -multiple storages for fault tolerance. Virtual storages can improve performance by distributing -requests across Gitaly nodes. Their distributed nature makes it viable to have a single repository -storage in GitLab to simplify repository management. - ### Components of Gitaly Cluster Gitaly Cluster consists of multiple components: @@ -182,59 +227,10 @@ Gitaly Cluster consists of multiple components: - [Load balancer](praefect.md#load-balancer) for distributing requests and providing fault-tolerant access to Praefect nodes. - [Praefect](praefect.md#praefect) nodes for managing the cluster and routing requests to Gitaly nodes. -- [PostgreSQL database](praefect.md#postgresql) for persisting cluster metadata and [PgBouncer](praefect.md#pgbouncer), +- [PostgreSQL database](praefect.md#postgresql) for persisting cluster metadata and [PgBouncer](praefect.md#use-pgbouncer), recommended for pooling Praefect's database connections. - Gitaly nodes to provide repository storage and Git access. -![Cluster example](img/cluster_example_v13_3.png) - -In this example: - -- Repositories are stored on a virtual storage called `storage-1`. -- Three Gitaly nodes provide `storage-1` access: `gitaly-1`, `gitaly-2`, and `gitaly-3`. -- The three Gitaly nodes store data on their file systems. - -### Virtual storage or direct Gitaly storage - -Gitaly supports multiple models of scaling: - -- Clustering using Gitaly Cluster, where each repository is stored on multiple Gitaly nodes in the - cluster. Read requests are distributed between repository replicas and write requests are - broadcast to repository replicas. GitLab accesses virtual storage. -- Direct access to Gitaly storage using [repository storage paths](../repository_storage_paths.md), - where each repository is stored on the assigned Gitaly node. All requests are routed to this node. - -The following is Gitaly set up to use direct access to Gitaly instead of Gitaly Cluster: - -![Shard example](img/shard_example_v13_3.png) - -In this example: - -- Each repository is stored on one of three Gitaly storages: `storage-1`, `storage-2`, - or `storage-3`. -- Each storage is serviced by a Gitaly node. -- The three Gitaly nodes share data in three separate hashed storage locations. -- The [replication factor](praefect.md#replication-factor) is `3`. There are three copies maintained - of each repository. - -Generally, virtual storage with Gitaly Cluster can replace direct Gitaly storage configurations, at -the expense of additional storage needed to store each repository on multiple Gitaly nodes. The -benefit of using Gitaly Cluster over direct Gitaly storage is: - -- Improved fault tolerance, because each Gitaly node has a copy of every repository. -- Improved resource utilization, reducing the need for over-provisioning for shard-specific peak - loads, because read loads are distributed across replicas. -- Manual rebalancing for performance is not required, because read loads are distributed across - replicas. -- Simpler management, because all Gitaly nodes are identical. - -Under some workloads, CPU and memory requirements may require a large fleet of Gitaly nodes. It -can be uneconomical to have one to one replication factor. - -A hybrid approach can be used in these instances, where each shard is configured as a smaller -cluster. [Variable replication factor](https://gitlab.com/groups/gitlab-org/-/epics/3372) is planned -to provide greater flexibility for extremely large GitLab instances. - ### Architecture Praefect is a router and transaction manager for Gitaly, and a required @@ -360,385 +356,21 @@ The second facet presents the only real solution. For this, we developed ## NFS deprecation notice -<!-- vale gitlab.FutureTense = NO --> - -From GitLab 14.0, enhancements and bug fixes for NFS for Git repositories will no longer be -considered and customer technical support will be considered out of scope. +Engineering support for NFS for Git repositories is deprecated. Technical support is planned to be +unavailable from GitLab 15.0. No further enhancements are planned for this feature. Additional information: - [Recommended NFS mount options and known issues with Gitaly and NFS](../nfs.md#upgrade-to-gitaly-cluster-or-disable-caching-if-experiencing-data-loss). - [GitLab statement of support](https://about.gitlab.com/support/statement-of-support.html#gitaly-and-nfs). -<!-- vale gitlab.FutureTense = YES --> - GitLab recommends: - Creating a [Gitaly Cluster](#gitaly-cluster) as soon as possible. - [Moving your repositories](praefect.md#migrate-to-gitaly-cluster) from NFS-based storage to Gitaly Cluster. -We welcome your feedback on this process: raise a support ticket, or [comment on the epic](https://gitlab.com/groups/gitlab-org/-/epics/4916). - -## Troubleshooting - -Refer to the information below when troubleshooting Gitaly and Gitaly Cluster. - -Before troubleshooting, see the Gitaly and Gitaly Cluster -[frequently asked questions](faq.md). - -### Troubleshoot Gitaly - -The following sections provide possible solutions to Gitaly errors. - -See also [Gitaly timeout](../../user/admin_area/settings/gitaly_timeouts.md) settings. - -#### Check versions when using standalone Gitaly servers - -When using standalone Gitaly servers, you must make sure they are the same version -as GitLab to ensure full compatibility: - -1. On the top bar, select **Menu >** **{admin}** **Admin** on your GitLab instance. -1. On the left sidebar, select **Overview > Gitaly Servers**. -1. Confirm all Gitaly servers indicate that they are up to date. - -#### Use `gitaly-debug` - -The `gitaly-debug` command provides "production debugging" tools for Gitaly and Git -performance. It is intended to help production engineers and support -engineers investigate Gitaly performance problems. - -If you're using GitLab 11.6 or newer, this tool should be installed on -your GitLab or Gitaly server already at `/opt/gitlab/embedded/bin/gitaly-debug`. -If you're investigating an older GitLab version you can compile this -tool offline and copy the executable to your server: - -```shell -git clone https://gitlab.com/gitlab-org/gitaly.git -cd cmd/gitaly-debug -GOOS=linux GOARCH=amd64 go build -o gitaly-debug -``` - -To see the help page of `gitaly-debug` for a list of supported sub-commands, run: - -```shell -gitaly-debug -h -``` - -#### Commits, pushes, and clones return a 401 - -```plaintext -remote: GitLab: 401 Unauthorized -``` - -You need to sync your `gitlab-secrets.json` file with your GitLab -application nodes. - -#### Client side gRPC logs - -Gitaly uses the [gRPC](https://grpc.io/) RPC framework. The Ruby gRPC -client has its own log file which may contain useful information when -you are seeing Gitaly errors. You can control the log level of the -gRPC client with the `GRPC_LOG_LEVEL` environment variable. The -default level is `WARN`. - -You can run a gRPC trace with: - -```shell -sudo GRPC_TRACE=all GRPC_VERBOSITY=DEBUG gitlab-rake gitlab:gitaly:check -``` - -#### Server side gRPC logs - -gRPC tracing can also be enabled in Gitaly itself with the `GODEBUG=http2debug` -environment variable. To set this in an Omnibus GitLab install: - -1. Add the following to your `gitlab.rb` file: - - ```ruby - gitaly['env'] = { - "GODEBUG=http2debug" => "2" - } - ``` - -1. [Reconfigure](../restart_gitlab.md#omnibus-gitlab-reconfigure) GitLab. - -#### Correlating Git processes with RPCs - -Sometimes you need to find out which Gitaly RPC created a particular Git process. - -One method for doing this is by using `DEBUG` logging. However, this needs to be enabled -ahead of time and the logs produced are quite verbose. - -A lightweight method for doing this correlation is by inspecting the environment -of the Git process (using its `PID`) and looking at the `CORRELATION_ID` variable: - -```shell -PID=<Git process ID> -sudo cat /proc/$PID/environ | tr '\0' '\n' | grep ^CORRELATION_ID= -``` - -This method isn't reliable for `git cat-file` processes, because Gitaly -internally pools and re-uses those across RPCs. - -#### Observing `gitaly-ruby` traffic +We welcome your feedback on this process. You can: -[`gitaly-ruby`](configure_gitaly.md#gitaly-ruby) is an internal implementation detail of Gitaly, -so, there's not that much visibility into what goes on inside -`gitaly-ruby` processes. - -If you have Prometheus set up to scrape your Gitaly process, you can see -request rates and error codes for individual RPCs in `gitaly-ruby` by -querying `grpc_client_handled_total`. - -- In theory, this metric does not differentiate between `gitaly-ruby` and other RPCs. -- In practice from GitLab 11.9, all gRPC calls made by Gitaly itself are internal calls from the - main Gitaly process to one of its `gitaly-ruby` sidecars. - -Assuming your `grpc_client_handled_total` counter only observes Gitaly, -the following query shows you RPCs are (most likely) internally -implemented as calls to `gitaly-ruby`: - -```prometheus -sum(rate(grpc_client_handled_total[5m])) by (grpc_method) > 0 -``` - -#### Repository changes fail with a `401 Unauthorized` error - -If you run Gitaly on its own server and notice these conditions: - -- Users can successfully clone and fetch repositories by using both SSH and HTTPS. -- Users can't push to repositories, or receive a `401 Unauthorized` message when attempting to - make changes to them in the web UI. - -Gitaly may be failing to authenticate with the Gitaly client because it has the -[wrong secrets file](configure_gitaly.md#configure-gitaly-servers). - -Confirm the following are all true: - -- When any user performs a `git push` to any repository on this Gitaly server, it - fails with a `401 Unauthorized` error: - - ```shell - remote: GitLab: 401 Unauthorized - To <REMOTE_URL> - ! [remote rejected] branch-name -> branch-name (pre-receive hook declined) - error: failed to push some refs to '<REMOTE_URL>' - ``` - -- When any user adds or modifies a file from the repository using the GitLab - UI, it immediately fails with a red `401 Unauthorized` banner. -- Creating a new project and [initializing it with a README](../../user/project/working_with_projects.md#blank-projects) - successfully creates the project but doesn't create the README. -- When [tailing the logs](https://docs.gitlab.com/omnibus/settings/logs.html#tail-logs-in-a-console-on-the-server) - on a Gitaly client and reproducing the error, you get `401` errors - when reaching the [`/api/v4/internal/allowed`](../../development/internal_api.md) endpoint: - - ```shell - # api_json.log - { - "time": "2019-07-18T00:30:14.967Z", - "severity": "INFO", - "duration": 0.57, - "db": 0, - "view": 0.57, - "status": 401, - "method": "POST", - "path": "\/api\/v4\/internal\/allowed", - "params": [ - { - "key": "action", - "value": "git-receive-pack" - }, - { - "key": "changes", - "value": "REDACTED" - }, - { - "key": "gl_repository", - "value": "REDACTED" - }, - { - "key": "project", - "value": "\/path\/to\/project.git" - }, - { - "key": "protocol", - "value": "web" - }, - { - "key": "env", - "value": "{\"GIT_ALTERNATE_OBJECT_DIRECTORIES\":[],\"GIT_ALTERNATE_OBJECT_DIRECTORIES_RELATIVE\":[],\"GIT_OBJECT_DIRECTORY\":null,\"GIT_OBJECT_DIRECTORY_RELATIVE\":null}" - }, - { - "key": "user_id", - "value": "2" - }, - { - "key": "secret_token", - "value": "[FILTERED]" - } - ], - "host": "gitlab.example.com", - "ip": "REDACTED", - "ua": "Ruby", - "route": "\/api\/:version\/internal\/allowed", - "queue_duration": 4.24, - "gitaly_calls": 0, - "gitaly_duration": 0, - "correlation_id": "XPUZqTukaP3" - } - - # nginx_access.log - [IP] - - [18/Jul/2019:00:30:14 +0000] "POST /api/v4/internal/allowed HTTP/1.1" 401 30 "" "Ruby" - ``` - -To fix this problem, confirm that your [`gitlab-secrets.json` file](configure_gitaly.md#configure-gitaly-servers) -on the Gitaly server matches the one on Gitaly client. If it doesn't match, -update the secrets file on the Gitaly server to match the Gitaly client, then -[reconfigure](../restart_gitlab.md#omnibus-gitlab-reconfigure). - -#### Command line tools cannot connect to Gitaly - -gRPC cannot reach your Gitaly server if: - -- You can't connect to a Gitaly server with command-line tools. -- Certain actions result in a `14: Connect Failed` error message. - -Verify you can reach Gitaly by using TCP: - -```shell -sudo gitlab-rake gitlab:tcp_check[GITALY_SERVER_IP,GITALY_LISTEN_PORT] -``` - -If the TCP connection: - -- Fails, check your network settings and your firewall rules. -- Succeeds, your networking and firewall rules are correct. - -If you use proxy servers in your command line environment such as Bash, these can interfere with -your gRPC traffic. - -If you use Bash or a compatible command line environment, run the following commands to determine -whether you have proxy servers configured: - -```shell -echo $http_proxy -echo $https_proxy -``` - -If either of these variables have a value, your Gitaly CLI connections may be getting routed through -a proxy which cannot connect to Gitaly. - -To remove the proxy setting, run the following commands (depending on which variables had values): - -```shell -unset http_proxy -unset https_proxy -``` - -#### Permission denied errors appearing in Gitaly or Praefect logs when accessing repositories - -You might see the following in Gitaly and Praefect logs: - -```shell -{ - ... - "error":"rpc error: code = PermissionDenied desc = permission denied", - "grpc.code":"PermissionDenied", - "grpc.meta.client_name":"gitlab-web", - "grpc.request.fullMethod":"/gitaly.ServerService/ServerInfo", - "level":"warning", - "msg":"finished unary call with code PermissionDenied", - ... -} -``` - -This is a GRPC call -[error response code](https://grpc.github.io/grpc/core/md_doc_statuscodes.html). - -If this error occurs, even though -[the Gitaly auth tokens are set up correctly](#praefect-errors-in-logs), -it's likely that the Gitaly servers are experiencing -[clock drift](https://en.wikipedia.org/wiki/Clock_drift). - -Ensure the Gitaly clients and servers are synchronized, and use an NTP time -server to keep them synchronized. - -#### Gitaly not listening on new address after reconfiguring - -When updating the `gitaly['listen_addr']` or `gitaly['prometheus_listen_addr']` values, Gitaly may -continue to listen on the old address after a `sudo gitlab-ctl reconfigure`. - -When this occurs, run `sudo gitlab-ctl restart` to resolve the issue. This should no longer be -necessary because [this issue](https://gitlab.com/gitlab-org/gitaly/-/issues/2521) is resolved. - -#### Permission denied errors appearing in Gitaly logs when accessing repositories from a standalone Gitaly node - -If this error occurs even though file permissions are correct, it's likely that the Gitaly node is -experiencing [clock drift](https://en.wikipedia.org/wiki/Clock_drift). - -Please ensure that the GitLab and Gitaly nodes are synchronized and use an NTP time -server to keep them synchronized if possible. - -### Troubleshoot Praefect (Gitaly Cluster) - -The following sections provide possible solutions to Gitaly Cluster errors. - -#### Praefect errors in logs - -If you receive an error, check `/var/log/gitlab/gitlab-rails/production.log`. - -Here are common errors and potential causes: - -- 500 response code - - **ActionView::Template::Error (7:permission denied)** - - `praefect['auth_token']` and `gitlab_rails['gitaly_token']` do not match on the GitLab server. - - **Unable to save project. Error: 7:permission denied** - - Secret token in `praefect['storage_nodes']` on GitLab server does not match the - value in `gitaly['auth_token']` on one or more Gitaly servers. -- 503 response code - - **GRPC::Unavailable (14:failed to connect to all addresses)** - - GitLab was unable to reach Praefect. - - **GRPC::Unavailable (14:all SubCons are in TransientFailure...)** - - Praefect cannot reach one or more of its child Gitaly nodes. Try running - the Praefect connection checker to diagnose. - -#### Determine primary Gitaly node - -To determine the current primary Gitaly node for a specific Praefect node: - -- Use the `Shard Primary Election` [Grafana chart](praefect.md#grafana) on the - [`Gitlab Omnibus - Praefect` dashboard](https://gitlab.com/gitlab-org/grafana-dashboards/-/blob/master/omnibus/praefect.json). - This is recommended. -- If you do not have Grafana set up, use the following command on each host of each - Praefect node: - - ```shell - curl localhost:9652/metrics | grep gitaly_praefect_primaries` - ``` - -#### Relation does not exist errors - -By default Praefect database tables are created automatically by `gitlab-ctl reconfigure` task. -However, if the `gitlab-ctl reconfigure` command isn't executed or there are errors during the -execution, the Praefect database tables are not created on initial reconfigure and can throw -errors that relations do not exist. - -For example: - -- `ERROR: relation "node_status" does not exist at character 13` -- `ERROR: relation "replication_queue_lock" does not exist at character 40` -- This error: - - ```json - {"level":"error","msg":"Error updating node: pq: relation \"node_status\" does not exist","pid":210882,"praefectName":"gitlab1x4m:0.0.0.0:2305","time":"2021-04-01T19:26:19.473Z","virtual_storage":"praefect-cluster-1"} - ``` - -To solve this, the database schema migration can be done using `sql-migrate` sub-command of -the `praefect` command: - -```shell -$ sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml sql-migrate -praefect sql-migrate: OK (applied 21 migrations) -``` +- Raise a support ticket. +- [Comment on the epic](https://gitlab.com/groups/gitlab-org/-/epics/4916). diff --git a/doc/administration/gitaly/praefect.md b/doc/administration/gitaly/praefect.md index 21e5360e27b..e483bcc944a 100644 --- a/doc/administration/gitaly/praefect.md +++ b/doc/administration/gitaly/praefect.md @@ -43,8 +43,8 @@ default value. The default value depends on the GitLab version. ## Setup Instructions -If you [installed](https://about.gitlab.com/install/) GitLab using the Omnibus -package (highly recommended), follow the steps below: +If you [installed](https://about.gitlab.com/install/) GitLab using the Omnibus GitLab package +(highly recommended), follow the steps below: 1. [Preparation](#preparation) 1. [Configuring the Praefect database](#postgresql) @@ -59,25 +59,27 @@ package (highly recommended), follow the steps below: Before beginning, you should already have a working GitLab instance. [Learn how to install GitLab](https://about.gitlab.com/install/). -Provision a PostgreSQL server (PostgreSQL 11 or newer). +Provision a PostgreSQL server. We recommend using the PostgreSQL that is shipped +with Omnibus GitLab and use it to configure the PostgreSQL database. You can use an +external PostgreSQL server (version 11 or newer) but you must set it up [manually](#manual-database-setup). -Prepare all your new nodes by [installing -GitLab](https://about.gitlab.com/install/). +Prepare all your new nodes by [installing GitLab](https://about.gitlab.com/install/). You need: +- 1 PostgreSQL node +- 1 PgBouncer node (optional) - At least 1 Praefect node (minimal storage required) - 3 Gitaly nodes (high CPU, high memory, fast storage) - 1 GitLab server -You need the IP/host address for each node. +You also need the IP/host address for each node: -1. `LOAD_BALANCER_SERVER_ADDRESS`: the IP/host address of the load balancer -1. `POSTGRESQL_SERVER_ADDRESS`: the IP/host address of the PostgreSQL server +1. `PRAEFECT_LOADBALANCER_HOST`: the IP/host address of Praefect load balancer +1. `POSTGRESQL_HOST`: the IP/host address of the PostgreSQL server +1. `PGBOUNCER_HOST`: the IP/host address of the PostgreSQL server 1. `PRAEFECT_HOST`: the IP/host address of the Praefect server 1. `GITALY_HOST_*`: the IP or host address of each Gitaly server 1. `GITLAB_HOST`: the IP/host address of the GitLab server -If you are using a cloud provider, you can look up the addresses for each server through your cloud provider's management console. - If you are using Google Cloud Platform, SoftLayer, or any other vendor that provides a virtual private cloud (VPC) you can use the private addresses for each cloud instance (corresponds to "internal address" for Google Cloud Platform) for `PRAEFECT_HOST`, `GITALY_HOST_*`, and `GITLAB_HOST`. #### Secrets @@ -98,6 +100,14 @@ with secure tokens as you complete the setup process. Praefect cluster directly; that could lead to data loss. 1. `PRAEFECT_SQL_PASSWORD`: this password is used by Praefect to connect to PostgreSQL. +1. `PRAEFECT_SQL_PASSWORD_HASH`: the hash of password of the Praefect user. + Use `gitlab-ctl pg-password-md5 praefect` to generate the hash. The command + asks for the password for `praefect` user. Enter `PRAEFECT_SQL_PASSWORD` + plaintext password. By default, Praefect uses `praefect` user, but you can + change it. +1. `PGBOUNCER_SQL_PASSWORD_HASH`: the hash of password of the PgBouncer user. + PgBouncer uses this password to connect to PostgreSQL. For more details + see [bundled PgBouncer](../postgresql/pgbouncer.md) documentation. We note in the instructions below where these secrets are required. @@ -108,127 +118,210 @@ Omnibus GitLab installations can use `gitlab-secrets.json` for `GITLAB_SHELL_SEC NOTE: Do not store the GitLab application database and the Praefect -database on the same PostgreSQL server if using -[Geo](../geo/index.md). The replication state is internal to each instance -of GitLab and should not be replicated. +database on the same PostgreSQL server if using [Geo](../geo/index.md). +The replication state is internal to each instance of GitLab and should +not be replicated. These instructions help set up a single PostgreSQL database, which creates a single point of -failure. The following options are available: +failure. Alternatively, [you can use PostgreSQL replication and failover](../postgresql/replication_and_failover.md). + +The following options are available: - For non-Geo installations, either: - Use one of the documented [PostgreSQL setups](../postgresql/index.md). - - Use your own third-party database setup, if fault tolerance is required. + - Use your own third-party database setup. This will require [manual setup](#manual-database-setup). - For Geo instances, either: - Set up a separate [PostgreSQL instance](https://www.postgresql.org/docs/11/high-availability.html). - Use a cloud-managed PostgreSQL service. AWS [Relational Database Service](https://aws.amazon.com/rds/) is recommended. -To complete this section you need: +#### Manual database setup -- 1 Praefect node -- 1 PostgreSQL server (PostgreSQL 11 or newer) - - An SQL user with permissions to create databases +To complete this section you need: -During this section, we configure the PostgreSQL server, from the Praefect -node, using `psql` which is installed by Omnibus GitLab. +- One Praefect node +- One PostgreSQL node (version 11 or newer) + - A PostgreSQL user with permissions to manage the database server -1. SSH into the **Praefect** node and login as root: +In this section, we configure the PostgreSQL database. This can be used for both external +and Omnibus-provided PostgreSQL server. - ```shell - sudo -i - ``` +To run the following instructions, you can use the Praefect node, where `psql` is installed +by Omnibus GitLab (`/opt/gitlab/embedded/bin/psql`). If you are using the Omnibus-provided +PostgreSQL you can use `gitlab-psql` on the PostgreSQL node instead: -1. Connect to the PostgreSQL server with administrative access. This is likely - the `postgres` user. The database `template1` is used because it is created - by default on all PostgreSQL servers. +1. Create a new user `praefect` to be used by Praefect: - ```shell - /opt/gitlab/embedded/bin/psql -U postgres -d template1 -h POSTGRESQL_SERVER_ADDRESS + ```sql + CREATE ROLE praefect WITH LOGIN PASSWORD 'PRAEFECT_SQL_PASSWORD'; ``` - Create a new user `praefect` to be used by Praefect. Replace - `PRAEFECT_SQL_PASSWORD` with the strong password you generated in the - preparation step. + Replace `PRAEFECT_SQL_PASSWORD` with the strong password you generated in the preparation step. + +1. Create a new database `praefect_production` that is owned by `praefect` user. ```sql - CREATE ROLE praefect WITH LOGIN CREATEDB PASSWORD 'PRAEFECT_SQL_PASSWORD'; + CREATE DATABASE praefect_production WITH OWNER praefect ENCODING UTF8; ``` -1. Reconnect to the PostgreSQL server, this time as the `praefect` user: +For using Omnibus-provided PgBouncer you need to take the following additional steps. We strongly +recommend using the PostgreSQL that is shipped with Omnibus as the backend. The following +instructions only work on Omnibus-provided PostgreSQL: - ```shell - /opt/gitlab/embedded/bin/psql -U praefect -d template1 -h POSTGRESQL_SERVER_ADDRESS +1. For Omnibus-provided PgBouncer, you need to use the hash of `praefect` user instead the of the + actual password: + + ```sql + ALTER ROLE praefect WITH PASSWORD 'md5<PRAEFECT_SQL_PASSWORD_HASH>'; ``` - Create a new database `praefect_production`. By creating the database while - connected as the `praefect` user, we are confident they have access. + Replace `<PRAEFECT_SQL_PASSWORD_HASH>` with the hash of the password you generated in the + preparation step. Note that it is prefixed with `md5` literal. + +1. The PgBouncer that is shipped with Omnibus is configured to use [`auth_query`](https://www.pgbouncer.org/config.html#generic-settings) + and uses `pg_shadow_lookup` function. You need to create this function in `praefect_production` + database: ```sql - CREATE DATABASE praefect_production WITH ENCODING=UTF8; + CREATE OR REPLACE FUNCTION public.pg_shadow_lookup(in i_username text, out username text, out password text) RETURNS record AS $$ + BEGIN + SELECT usename, passwd FROM pg_catalog.pg_shadow + WHERE usename = i_username INTO username, password; + RETURN; + END; + $$ LANGUAGE plpgsql SECURITY DEFINER; + + REVOKE ALL ON FUNCTION public.pg_shadow_lookup(text) FROM public, pgbouncer; + GRANT EXECUTE ON FUNCTION public.pg_shadow_lookup(text) TO pgbouncer; ``` The database used by Praefect is now configured. If you see Praefect database errors after configuring PostgreSQL, see -[troubleshooting steps](index.md#relation-does-not-exist-errors). +[troubleshooting steps](troubleshooting.md#relation-does-not-exist-errors). -#### PgBouncer +#### Use PgBouncer To reduce PostgreSQL resource consumption, we recommend setting up and configuring [PgBouncer](https://www.pgbouncer.org/) in front of the PostgreSQL instance. To do -this, set the corresponding IP or host address of the PgBouncer instance in -`/etc/gitlab/gitlab.rb` by changing the following settings: +this, you must point Praefect to PgBouncer by setting Praefect database parameters: -- `praefect['database_host']`, for the address. -- `praefect['database_port']`, for the port. +```ruby +praefect['database_host'] = PGBOUNCER_HOST +praefect['database_port'] = 6432 +praefect['database_user'] = 'praefect' +praefect['database_password'] = PRAEFECT_SQL_PASSWORD +praefect['database_dbname'] = 'praefect_production' +#praefect['database_sslmode'] = '...' +#praefect['database_sslcert'] = '...' +#praefect['database_sslkey'] = '...' +#praefect['database_sslrootcert'] = '...' +``` -Because PgBouncer manages resources more efficiently, Praefect still requires a -direct connection to the PostgreSQL database. It uses the -[LISTEN](https://www.postgresql.org/docs/11/sql-listen.html) -feature that is [not supported](https://www.pgbouncer.org/features.html) by -PgBouncer with `pool_mode = transaction`. -Set `praefect['database_host_no_proxy']` and `praefect['database_port_no_proxy']` -to a direct connection, and not a PgBouncer connection. +Praefect requires an additional connection to the PostgreSQL that supports the +[LISTEN](https://www.postgresql.org/docs/11/sql-listen.html) feature. With PgBouncer +this feature is only available with `session` pool mode (`pool_mode = session`). +It is not supported in `transaction` pool mode (`pool_mode = transaction`). -Save the changes to `/etc/gitlab/gitlab.rb` and -[reconfigure Praefect](../restart_gitlab.md#omnibus-gitlab-reconfigure). +For the additional connection, you must either: -This documentation doesn't provide PgBouncer installation instructions, -but you can: +- Connect Praefect directly to PostgreSQL and bypass PgBouncer. +- Configure a new PgBouncer database that uses to the same PostgreSQL database endpoint, + but with different pool mode. That is, `pool_mode = session`. -- Find instructions on the [official website](https://www.pgbouncer.org/install.html). -- Use a [Docker image](https://hub.docker.com/r/edoburu/pgbouncer/). +Praefect can be configured to use different connection parameters for direct access +to PostgreSQL. This is the connection that supports the `LISTEN` feature. -In addition to the base PgBouncer configuration options, set the following values in -your `pgbouncer.ini` file: +Here is an example of Praefect that bypasses PgBouncer and directly connects to PostgreSQL: -- The [Praefect PostgreSQL database](#postgresql) in the `[databases]` section: +```ruby +praefect['database_direct_host'] = POSTGRESQL_HOST +praefect['database_direct_port'] = 5432 + +# Use the following to override parameters of direct database connection. +# Comment out where the parameters are the same for both connections. + +praefect['database_direct_user'] = 'praefect' +praefect['database_direct_password'] = PRAEFECT_SQL_PASSWORD +praefect['database_direct_dbname'] = 'praefect_production' +#praefect['database_direct_sslmode'] = '...' +#praefect['database_direct_sslcert'] = '...' +#praefect['database_direct_sslkey'] = '...' +#praefect['database_direct_sslrootcert'] = '...' +``` - ```ini - [databases] - * = host=POSTGRESQL_SERVER_ADDRESS port=5432 auth_user=praefect - ``` +We recommend using PgBouncer with `session` pool mode instead. You can use the [bundled +PgBouncer](../postgresql/pgbouncer.md) or use an external PgBouncer and [configure it +manually](https://www.pgbouncer.org/config.html). -- [`pool_mode`](https://www.pgbouncer.org/config.html#pool_mode) - and [`ignore_startup_parameters`](https://www.pgbouncer.org/config.html#ignore_startup_parameters) - in the `[pgbouncer]` section: +The following example uses the bundled PgBouncer and sets up two separate connection pools, +one in `session` pool mode and the other in `transaction` pool mode. For this example to work, +you need to prepare PostgreSQL server with [setup instruction](#manual-database-setup): - ```ini - [pgbouncer] - pool_mode = transaction - ignore_startup_parameters = extra_float_digits - ``` +```ruby +pgbouncer['databases'] = { + # Other database configuation including gitlabhq_production + ... + + praefect_production: { + host: POSTGRESQL_HOST, + # Use `pgbouncer` user to connect to database backend. + user: 'pgbouncer', + password: PGBOUNCER_SQL_PASSWORD_HASH, + pool_mode: 'transaction' + } + praefect_production_direct: { + host: POSTGRESQL_HOST, + # Use `pgbouncer` user to connect to database backend. + user: 'pgbouncer', + password: PGBOUNCER_SQL_PASSWORD_HASH, + dbname: 'praefect_production', + pool_mode: 'session' + }, + + ... +} +``` + +Both `praefect_production` and `praefect_production_direct` use the same database endpoint +(`praefect_production`), but with different pool modes. This translates to the following +`databases` section of PgBouncer: -The `praefect` user and its password should be included in the file (default is -`userlist.txt`) used by PgBouncer if the [`auth_file`](https://www.pgbouncer.org/config.html#auth_file) -configuration option is set. +```ini +[databases] +praefect_production = host=POSTGRESQL_HOST auth_user=pgbouncer pool_mode=transaction +praefect_production_direct = host=POSTGRESQL_HOST auth_user=pgbouncer dbname=praefect_production pool_mode=session +``` + +Now you can configure Praefect to use PgBouncer for both connections: + +```ruby +praefect['database_host'] = PGBOUNCER_HOST +praefect['database_port'] = 6432 +praefect['database_user'] = 'praefect' +# `PRAEFECT_SQL_PASSWORD` is the plain-text password of +# Praefect user. Not to be confused with `PRAEFECT_SQL_PASSWORD_HASH`. +praefect['database_password'] = PRAEFECT_SQL_PASSWORD + +praefect['database_dbname'] = 'praefect_production' +praefect['database_direct_dbname'] = 'praefect_production_direct' + +# There is no need to repeat the following. Parameters of direct +# database connection will fall back to the values above. + +#praefect['database_direct_host'] = PGBOUNCER_HOST +#praefect['database_direct_port'] = 6432 +#praefect['database_direct_user'] = 'praefect' +#praefect['database_direct_password'] = PRAEFECT_SQL_PASSWORD +``` + +With this configuration, Praefect uses PgBouncer for both connection types. NOTE: -By default PgBouncer uses port `6432` to accept incoming -connections. You can change it by setting the [`listen_port`](https://www.pgbouncer.org/config.html#listen_port) -configuration option. We recommend setting it to the default port value (`5432`) used by -PostgreSQL instances. Otherwise you should change the configuration parameter -`praefect['database_port']` for each Praefect instance to the correct value. +Omnibus GitLab handles the authentication requirements (using `auth_query`), but if you are preparing +your databases manually and configuring an external PgBouncer, you must include `praefect` user and +its password in the file used by PgBouncer. For example, `userlist.txt` if the [`auth_file`](https://www.pgbouncer.org/config.html#auth_file) +configuration option is set. For more details, consult the PgBouncer documentation. ### Praefect @@ -241,17 +334,10 @@ If there are multiple Praefect nodes: To complete this section you need a [configured PostgreSQL server](#postgresql), including: -- IP/host address (`POSTGRESQL_SERVER_ADDRESS`) -- Password (`PRAEFECT_SQL_PASSWORD`) - Praefect should be run on a dedicated node. Do not run Praefect on the application server, or a Gitaly node. -1. SSH into the **Praefect** node and login as root: - - ```shell - sudo -i - ``` +On the **Praefect** node: 1. Disable all other services by editing `/etc/gitlab/gitlab.rb`: @@ -295,22 +381,8 @@ application server, or a Gitaly node. praefect['auth_token'] = 'PRAEFECT_EXTERNAL_TOKEN' ``` -1. Configure **Praefect** to connect to the PostgreSQL database by editing - `/etc/gitlab/gitlab.rb`. - - You need to replace `POSTGRESQL_SERVER_ADDRESS` with the IP/host address - of the database, and `PRAEFECT_SQL_PASSWORD` with the strong password set - above. - - ```ruby - praefect['database_host'] = 'POSTGRESQL_SERVER_ADDRESS' - praefect['database_port'] = 5432 - praefect['database_user'] = 'praefect' - praefect['database_password'] = 'PRAEFECT_SQL_PASSWORD' - praefect['database_dbname'] = 'praefect_production' - praefect['database_host_no_proxy'] = 'POSTGRESQL_SERVER_ADDRESS' - praefect['database_port_no_proxy'] = 5432 - ``` +1. Configure **Praefect** to [connect to the PostgreSQL database](#postgresql). We + highly recommend using [PgBouncer](#use-pgbouncer) as well. If you want to use a TLS client certificate, the options below can be used: @@ -507,7 +579,7 @@ To configure Praefect with TLS: ```ruby git_data_dirs({ "default" => { - "gitaly_address" => 'tls://LOAD_BALANCER_SERVER_ADDRESS:2305', + "gitaly_address" => 'tls://PRAEFECT_LOADBALANCER_HOST:2305', "gitaly_token" => 'PRAEFECT_EXTERNAL_TOKEN' } }) @@ -544,7 +616,7 @@ To configure Praefect with TLS: repositories: storages: default: - gitaly_address: tls://LOAD_BALANCER_SERVER_ADDRESS:3305 + gitaly_address: tls://PRAEFECT_LOADBALANCER_HOST:3305 path: /some/local/path ``` @@ -817,7 +889,7 @@ Particular attention should be shown to: You need to replace: - - `LOAD_BALANCER_SERVER_ADDRESS` with the IP address or hostname of the load + - `PRAEFECT_LOADBALANCER_HOST` with the IP address or hostname of the load balancer. - `PRAEFECT_EXTERNAL_TOKEN` with the real secret @@ -826,7 +898,7 @@ Particular attention should be shown to: ```ruby git_data_dirs({ "default" => { - "gitaly_address" => "tcp://LOAD_BALANCER_SERVER_ADDRESS:2305", + "gitaly_address" => "tcp://PRAEFECT_LOADBALANCER_HOST:2305", "gitaly_token" => 'PRAEFECT_EXTERNAL_TOKEN' } }) @@ -926,7 +998,7 @@ For example: git_data_dirs({ 'default' => { 'gitaly_address' => 'tcp://old-gitaly.internal:8075' }, 'cluster' => { - 'gitaly_address' => 'tcp://<load_balancer_server_address>:2305', + 'gitaly_address' => 'tcp://<PRAEFECT_LOADBALANCER_HOST>:2305', 'gitaly_token' => '<praefect_external_token>' } }) @@ -981,6 +1053,26 @@ To get started quickly: Congratulations! You've configured an observable fault-tolerant Praefect cluster. +## Network connectivity requirements + +Gitaly Cluster components need to communicate with each other over many routes. +Your firewall rules must allow the following for Gitaly Cluster to function properly: + +| From | To | Default port / TLS port | +|:-----------------------|:------------------------|:------------------------| +| GitLab | Praefect load balancer | `2305` / `3305` | +| Praefect load balancer | Praefect | `2305` / `3305` | +| Praefect | Gitaly | `8075` / `9999` | +| Gitaly | GitLab (internal API) | `80` / `443` | +| Gitaly | Praefect load balancer | `2305` / `3305` | +| Gitaly | Praefect | `2305` / `3305` | +| Gitaly | Gitaly | `8075` / `9999` | + +NOTE: +Gitaly does not directly connect to Praefect. However, requests from Gitaly to the Praefect +load balancer may still be blocked unless firewalls on the Praefect nodes allow traffic from +the Gitaly nodes. + ## Distributed reads > - Introduced in GitLab 13.1 in [beta](https://about.gitlab.com/handbook/product/gitlab-the-product/#alpha-beta-ga) with feature flag `gitaly_distributed_reads` set to disabled. @@ -1147,24 +1239,30 @@ The `per_repository` election strategy solves this problem by electing a primary repository. Combined with [configurable replication factors](#configure-replication-factor), you can horizontally scale storage capacity and distribute write load across Gitaly nodes. -Primary elections are run when: +Primary elections are run: -- Praefect starts up. -- The cluster's consensus of a Gitaly node's health changes. +- In GitLab 14.1 and later, lazily. This means that Praefect doesn't immediately elect + a new primary node if the current one is unhealthy. A new primary is elected if it is + necessary to serve a request while the current primary is unavailable. +- In GitLab 13.12 to GitLab 14.0 when: + - Praefect starts up. + - The cluster's consensus of a Gitaly node's health changes. -A Gitaly node is considered: +A valid primary node candidate is a Gitaly node that: -- Healthy if `>=50%` Praefect nodes have successfully health checked the Gitaly node in the - previous ten seconds. -- Unhealthy otherwise. +- Is healthy. A Gitaly node is considered healthy if `>=50%` Praefect nodes have + successfully health checked the Gitaly node in the previous ten seconds. +- Has a fully up to date copy of the repository. -During an election run, Praefect elects a new primary Gitaly node for each repository that has -an unhealthy primary Gitaly node. The election is made: +If there are multiple primary node candidates, Praefect: -- Randomly from healthy secondary Gitaly nodes that are the most up to date. -- Only from Gitaly nodes assigned to the host repository. +- Picks one of them randomly. +- Prioritizes promoting a Gitaly node that is assigned to host the repository. If + there are no assigned Gitaly nodes to elect as the primary, Praefect may temporarily + elect an unassigned one. The unassigned primary is demoted in favor of an assigned + one when one becomes available. -If there are no healthy secondary nodes for a repository: +If there are no valid primary candidates for a repository: - The unhealthy primary node is demoted and the repository is left without a primary node. - Operations that require a primary node fail until a primary is successfully elected. @@ -1212,7 +1310,7 @@ To migrate existing clusters: - If downtime is unacceptable: - 1. Determine which Gitaly node is [the current primary](index.md#determine-primary-gitaly-node). + 1. Determine which Gitaly node is [the current primary](troubleshooting.md#determine-primary-gitaly-node). 1. Comment out the secondary Gitaly nodes from the virtual storage's configuration in `/etc/gitlab/gitlab.rb` on all Praefect nodes. This ensures there's only one Gitaly node configured, causing both of the election @@ -1259,23 +1357,37 @@ Migrate to [repository-specific primary nodes](#repository-specific-primary-node Gitaly Cluster recovers from a failing primary Gitaly node by promoting a healthy secondary as the new primary. -To minimize data loss, Gitaly Cluster: +In GitLab 14.1 and later, Gitaly Cluster: + +- Elects a healthy secondary with a fully up to date copy of the repository as the new primary. +- Repository becomes unavailable if there are no fully up to date copies of it on healthy secondaries. + +To minimize data loss in GitLab 13.0 to 14.0, Gitaly Cluster: - Switches repositories that are outdated on the new primary to [read-only mode](#read-only-mode). -- Elects the secondary with the least unreplicated writes from the primary to be the new primary. - Because there can still be some unreplicated writes, [data loss can occur](#check-for-data-loss). +- Elects the secondary with the least unreplicated writes from the primary to be the new + primary. Because there can still be some unreplicated writes, + [data loss can occur](#check-for-data-loss). ### Read-only mode > - Introduced in GitLab 13.0 as [generally available](https://about.gitlab.com/handbook/product/gitlab-the-product/#generally-available-ga). > - Between GitLab 13.0 and GitLab 13.2, read-only mode applied to the whole virtual storage and occurred whenever failover occurred. > - [In GitLab 13.3 and later](https://gitlab.com/gitlab-org/gitaly/-/issues/2862), read-only mode applies on a per-repository basis and only occurs if a new primary is out of date. +new primary. If the failed primary contained unreplicated writes, [data loss can occur](#check-for-data-loss). +> - Removed in GitLab 14.1. Instead, repositories [become unavailable](#unavailable-repositories). + +In GitLab 13.0 to 14.0, when Gitaly Cluster switches to a new primary, repositories enter +read-only mode if they are out of date. This can happen after failing over to an outdated +secondary. Read-only mode eases data recovery efforts by preventing writes that may conflict +with the unreplicated writes on other nodes. -When Gitaly Cluster switches to a new primary, repositories enter read-only mode if they are out of -date. This can happen after failing over to an outdated secondary. Read-only mode eases data -recovery efforts by preventing writes that may conflict with the unreplicated writes on other nodes. +When Gitaly Cluster switches to a new primary In GitLab 13.0 to 14.0, repositories enter +read-only mode if they are out of date. This can happen after failing over to an outdated +secondary. Read-only mode eases data recovery efforts by preventing writes that may conflict +with the unreplicated writes on other nodes. -To enable writes again, an administrator can: +To enable writes again in GitLab 13.0 to 14.0, an administrator can: 1. [Check](#check-for-data-loss) for data loss. 1. Attempt to [recover](#data-recovery) missing data. @@ -1283,21 +1395,38 @@ To enable writes again, an administrator can: [accept data loss](#enable-writes-or-accept-data-loss) if necessary, depending on the version of GitLab. +## Unavailable repositories + +> - From GitLab 13.0 through 14.0, repositories became read-only if they were outdated on the primary but fully up to date on a healthy secondary. `dataloss` sub-command displays read-only repositories by default through these versions. +> - Since GitLab 14.1, Praefect contains more responsive failover logic which immediately fails over to one of the fully up to date secondaries rather than placing the repository in read-only mode. Since GitLab 14.1, the `dataloss` sub-command displays repositories which are unavailable due to having no fully up to date copies on healthy Gitaly nodes. + +A repository is unavailable if all of its up to date replicas are unavailable. Unavailable repositories are +not accessible through Praefect to prevent serving stale data that may break automated tooling. + ### Check for data loss -The Praefect `dataloss` sub-command identifies replicas that are likely to be outdated. This can help -identify potential data loss after a failover. The following parameters are -available: +The Praefect `dataloss` subcommand identifies: + +- Copies of repositories in GitLab 13.0 to GitLab 14.0 that at are likely to be outdated. + This can help identify potential data loss after a failover. +- Repositories in GitLab 14.1 and later that are unavailable. This helps identify potential + data loss and repositories which are no longer accessible because all of their up-to-date + replicas copies are unavailable. + +The following parameters are available: -- `-virtual-storage` that specifies which virtual storage to check. The default behavior is to - display outdated replicas of read-only repositories as they might require administrator action. -- In GitLab 13.3 and later, `-partially-replicated` that specifies whether to display a list of - [outdated replicas of writable repositories](#outdated-replicas-of-writable-repositories). +- `-virtual-storage` that specifies which virtual storage to check. Because they might require + an administrator to intervene, the default behavior is to display: + - In GitLab 13.0 to 14.0, copies of read-only repositories. + - In GitLab 14.1 and later, unavailable repositories. +- In GitLab 14.1 and later, [`-partially-unavailable`](#unavailable-replicas-of-available-repositories) + that specifies whether to include in the output repositories that are available but have + some assigned copies that are not available. NOTE: `dataloss` is still in beta and the output format is subject to change. -To check for repositories with outdated primaries, run: +To check for repositories with outdated primaries or for unavailable repositories, run: ```shell sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml dataloss [-virtual-storage <virtual-storage>] @@ -1309,13 +1438,20 @@ Every configured virtual storage is checked if none is specified: sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml dataloss ``` -Repositories which have assigned storage nodes that contain an outdated copy of the repository are listed -in the output. This information is printed for each repository: +Repositories are listed in the output that have either: + +- An outdated copy of the repository on the primary, in GitLab 13.0 to GitLab 14.0. +- No healthy and fully up-to-date copies available, in GitLab 14.1 and later. + +The following information is printed for each repository: - A repository's relative path to the storage directory identifies each repository and groups the related information. -- The repository's current status is printed in parentheses next to the disk path. If the repository's primary - is outdated, the repository is in `read-only` mode and can't accept writes. Otherwise, the mode is `writable`. +- The repository's current status is printed in parentheses next to the disk path: + - In GitLab 13.0 to 14.0, either `(read-only)` if the repository's primary node is outdated + and can't accept writes. Otherwise, `(writable)`. + - In GitLab 14.1 and later, `(unavailable)` is printed next to the disk path if the + repository is unavailable. - The primary field lists the repository's current primary. If the repository has no primary, the field shows `No Primary`. - The In-Sync Storages lists replicas which have replicated the latest successful write and all writes @@ -1325,44 +1461,51 @@ in the output. This information is printed for each repository: is listed next to replica. It's important to notice that the outdated replicas may be fully up to date or contain later changes but Praefect can't guarantee it. -Whether a replica is assigned to host the repository is listed with each replica's status. `assigned host` is printed -next to replicas which are assigned to store the repository. The text is omitted if the replica contains a copy of -the repository but is not assigned to store the repository. Such replicas aren't kept in-sync by Praefect, but may -act as replication sources to bring assigned replicas up to date. +Additional information includes: + +- Whether a node is assigned to host the repository is listed with each node's status. + `assigned host` is printed next to nodes that are assigned to store the repository. The + text is omitted if the node contains a copy of the repository but is not assigned to store + the repository. Such copies aren't kept in sync by Praefect, but may act as replication + sources to bring assigned copies up to date. +- In GitLab 14.1 and later, `unhealthy` is printed next to the copies that are located + on unhealthy Gitaly nodes. Example output: ```shell Virtual storage: default Outdated repositories: - @hashed/3f/db/3fdba35f04dc8c462986c992bcf875546257113072a909c162f7e470e581e278.git (read-only): + @hashed/3f/db/3fdba35f04dc8c462986c992bcf875546257113072a909c162f7e470e581e278.git (unavailable): Primary: gitaly-1 In-Sync Storages: - gitaly-2, assigned host + gitaly-2, assigned host, unhealthy Outdated Storages: gitaly-1 is behind by 3 changes or less, assigned host gitaly-3 is behind by 3 changes or less ``` -A confirmation is printed out when every repository is writable. For example: +A confirmation is printed out when every repository is available. For example: ```shell Virtual storage: default - All repositories are writable! + All repositories are available! ``` -#### Outdated replicas of writable repositories +#### Unavailable replicas of available repositories -> [Introduced](https://gitlab.com/gitlab-org/gitaly/-/issues/3019) in GitLab 13.3. +NOTE: +In GitLab 14.0 and earlier, the flag is `-partially-replicated` and the output shows any repositories with assigned nodes with outdated +copies. -To also list information of repositories whose primary is up to date but one or more assigned -replicas are outdated, use the `-partially-replicated` flag. +To also list information of repositories which are available but are unavailable from some of the assigned nodes, +use the `-partially-unavailable` flag. -A repository is writable if the primary has the latest changes. Secondaries might be temporarily -outdated while they are waiting to replicate the latest changes. +A repository is available if there is a healthy, up to date replica available. Some of the assigned secondary +replicas may be temporarily unavailable for access while they are waiting to replicate the latest changes. ```shell -sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml dataloss [-virtual-storage <virtual-storage>] [-partially-replicated] +sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml dataloss [-virtual-storage <virtual-storage>] [-partially-unavailable] ``` Example output: @@ -1370,7 +1513,7 @@ Example output: ```shell Virtual storage: default Outdated repositories: - @hashed/3f/db/3fdba35f04dc8c462986c992bcf875546257113072a909c162f7e470e581e278.git (writable): + @hashed/3f/db/3fdba35f04dc8c462986c992bcf875546257113072a909c162f7e470e581e278.git: Primary: gitaly-1 In-Sync Storages: gitaly-1, assigned host @@ -1379,14 +1522,14 @@ Virtual storage: default gitaly-3 is behind by 3 changes or less ``` -With the `-partially-replicated` flag set, a confirmation is printed out if every assigned replica is fully up to -date. +With the `-partially-unavailable` flag set, a confirmation is printed out if every assigned replica is fully up to +date and healthy. For example: ```shell Virtual storage: default - All repositories are up to date! + All repositories are fully available on all assigned storages! ``` ### Check repository checksums @@ -1394,30 +1537,50 @@ Virtual storage: default To check a project's repository checksums across on all Gitaly nodes, run the [replicas Rake task](../raketasks/praefect.md#replica-checksums) on the main GitLab node. +### Accept data loss + +WARNING: +`accept-dataloss` causes permanent data loss by overwriting other versions of the repository. Data +[recovery efforts](#data-recovery) must be performed before using it. + +If it is not possible to bring one of the up to date replicas back online, you may have to accept data +loss. When accepting data loss, Praefect marks the chosen replica of the repository as the latest version +and replicates it to the other assigned Gitaly nodes. This process overwrites any other version of the +repository so care must be taken. + +```shell +sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml accept-dataloss +-virtual-storage <virtual-storage> -repository <relative-path> -authoritative-storage <storage-name> +``` + ### Enable writes or accept data loss -Praefect provides the following sub-commands to re-enable writes: +WARNING: +`accept-dataloss` causes permanent data loss by overwriting other versions of the repository. +Data [recovery efforts](#data-recovery) must be performed before using it. -- In GitLab 13.2 and earlier, `enable-writes` to re-enable virtual storage for writes after data - recovery attempts. +Praefect provides the following subcommands to re-enable writes or accept data loss: - ```shell - sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml enable-writes -virtual-storage <virtual-storage> - ``` +- In GitLab 13.2 and earlier, `enable-writes` to re-enable virtual storage for writes after + data recovery attempts: -- [In GitLab 13.3](https://gitlab.com/gitlab-org/gitaly/-/merge_requests/2415) and later, - `accept-dataloss` to accept data loss and re-enable writes for repositories after data recovery - attempts have failed. Accepting data loss causes current version of the repository on the - authoritative storage to be considered latest. Other storages are brought up to date with the - authoritative storage by scheduling replication jobs. + ```shell + sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml enable-writes -virtual-storage <virtual-storage> + ``` + +- In GitLab 13.3 and later, if it is not possible to bring one of the up to date nodes back + online, you may have to accept data loss: ```shell sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml accept-dataloss -virtual-storage <virtual-storage> -repository <relative-path> -authoritative-storage <storage-name> ``` -WARNING: -`accept-dataloss` causes permanent data loss by overwriting other versions of the repository. Data -[recovery efforts](#data-recovery) must be performed before using it. + When accepting data loss, Praefect: + + 1. Marks the chosen copy of the repository as the latest version. + 1. Replicates the copy to the other assigned Gitaly nodes. + + This process overwrites any other copy of the repository so care must be taken. ## Data recovery @@ -1463,10 +1626,7 @@ praefect['reconciliation_scheduling_interval'] = '0' # disable the feature ### Manual reconciliation WARNING: -The `reconcile` sub-command is deprecated and scheduled for removal in GitLab 14.0. Use -[automatic reconciliation](#automatic-reconciliation) instead. Manual reconciliation may -produce excess replication jobs and is limited in functionality. Manual reconciliation does -not work when [repository-specific primary nodes](#repository-specific-primary-nodes) are +The `reconcile` sub-command was removed in GitLab 14.1. Use [automatic reconciliation](#automatic-reconciliation) instead. Manual reconciliation may produce excess replication jobs and is limited in functionality. Manual reconciliation does not work when [repository-specific primary nodes](#repository-specific-primary-nodes) are enabled. The Praefect `reconcile` sub-command allows for the manual reconciliation between two Gitaly nodes. The @@ -1509,7 +1669,7 @@ After creating and configuring Gitaly Cluster: 1. Ensure all storages are accessible to the GitLab instance. In this example, these are `<original_storage_name>` and `<cluster_storage_name>`. 1. [Configure repository storage weights](../repository_storage_paths.md#configure-where-new-repositories-are-stored) - so that the Gitaly Cluster receives all new projects. This stops new projects being created + so that the Gitaly Cluster receives all new projects. This stops new projects from being created on existing Gitaly nodes while the migration is in progress. 1. Schedule repository moves for: - [Projects](#bulk-schedule-project-moves). diff --git a/doc/administration/gitaly/troubleshooting.md b/doc/administration/gitaly/troubleshooting.md new file mode 100644 index 00000000000..ab6f493cf0f --- /dev/null +++ b/doc/administration/gitaly/troubleshooting.md @@ -0,0 +1,372 @@ +--- +stage: Create +group: Gitaly +info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments +type: reference +--- + +# Troubleshooting Gitaly and Gitaly Cluster **(FREE SELF)** + +Refer to the information below when troubleshooting Gitaly and Gitaly Cluster. + +Before troubleshooting, see the Gitaly and Gitaly Cluster +[frequently asked questions](faq.md). + +## Troubleshoot Gitaly + +The following sections provide possible solutions to Gitaly errors. + +See also [Gitaly timeout](../../user/admin_area/settings/gitaly_timeouts.md) settings. + +### Check versions when using standalone Gitaly servers + +When using standalone Gitaly servers, you must make sure they are the same version +as GitLab to ensure full compatibility: + +1. On the top bar, select **Menu >** **{admin}** **Admin** on your GitLab instance. +1. On the left sidebar, select **Overview > Gitaly Servers**. +1. Confirm all Gitaly servers indicate that they are up to date. + +### Use `gitaly-debug` + +The `gitaly-debug` command provides "production debugging" tools for Gitaly and Git +performance. It is intended to help production engineers and support +engineers investigate Gitaly performance problems. + +If you're using GitLab 11.6 or newer, this tool should be installed on +your GitLab or Gitaly server already at `/opt/gitlab/embedded/bin/gitaly-debug`. +If you're investigating an older GitLab version you can compile this +tool offline and copy the executable to your server: + +```shell +git clone https://gitlab.com/gitlab-org/gitaly.git +cd cmd/gitaly-debug +GOOS=linux GOARCH=amd64 go build -o gitaly-debug +``` + +To see the help page of `gitaly-debug` for a list of supported sub-commands, run: + +```shell +gitaly-debug -h +``` + +### Commits, pushes, and clones return a 401 + +```plaintext +remote: GitLab: 401 Unauthorized +``` + +You need to sync your `gitlab-secrets.json` file with your GitLab +application nodes. + +### Client side gRPC logs + +Gitaly uses the [gRPC](https://grpc.io/) RPC framework. The Ruby gRPC +client has its own log file which may contain useful information when +you are seeing Gitaly errors. You can control the log level of the +gRPC client with the `GRPC_LOG_LEVEL` environment variable. The +default level is `WARN`. + +You can run a gRPC trace with: + +```shell +sudo GRPC_TRACE=all GRPC_VERBOSITY=DEBUG gitlab-rake gitlab:gitaly:check +``` + +### Server side gRPC logs + +gRPC tracing can also be enabled in Gitaly itself with the `GODEBUG=http2debug` +environment variable. To set this in an Omnibus GitLab install: + +1. Add the following to your `gitlab.rb` file: + + ```ruby + gitaly['env'] = { + "GODEBUG=http2debug" => "2" + } + ``` + +1. [Reconfigure](../restart_gitlab.md#omnibus-gitlab-reconfigure) GitLab. + +### Correlating Git processes with RPCs + +Sometimes you need to find out which Gitaly RPC created a particular Git process. + +One method for doing this is by using `DEBUG` logging. However, this needs to be enabled +ahead of time and the logs produced are quite verbose. + +A lightweight method for doing this correlation is by inspecting the environment +of the Git process (using its `PID`) and looking at the `CORRELATION_ID` variable: + +```shell +PID=<Git process ID> +sudo cat /proc/$PID/environ | tr '\0' '\n' | grep ^CORRELATION_ID= +``` + +This method isn't reliable for `git cat-file` processes, because Gitaly +internally pools and re-uses those across RPCs. + +### Observing `gitaly-ruby` traffic + +[`gitaly-ruby`](configure_gitaly.md#gitaly-ruby) is an internal implementation detail of Gitaly, +so, there's not that much visibility into what goes on inside +`gitaly-ruby` processes. + +If you have Prometheus set up to scrape your Gitaly process, you can see +request rates and error codes for individual RPCs in `gitaly-ruby` by +querying `grpc_client_handled_total`. + +- In theory, this metric does not differentiate between `gitaly-ruby` and other RPCs. +- In practice from GitLab 11.9, all gRPC calls made by Gitaly itself are internal calls from the + main Gitaly process to one of its `gitaly-ruby` sidecars. + +Assuming your `grpc_client_handled_total` counter only observes Gitaly, +the following query shows you RPCs are (most likely) internally +implemented as calls to `gitaly-ruby`: + +```prometheus +sum(rate(grpc_client_handled_total[5m])) by (grpc_method) > 0 +``` + +### Repository changes fail with a `401 Unauthorized` error + +If you run Gitaly on its own server and notice these conditions: + +- Users can successfully clone and fetch repositories by using both SSH and HTTPS. +- Users can't push to repositories, or receive a `401 Unauthorized` message when attempting to + make changes to them in the web UI. + +Gitaly may be failing to authenticate with the Gitaly client because it has the +[wrong secrets file](configure_gitaly.md#configure-gitaly-servers). + +Confirm the following are all true: + +- When any user performs a `git push` to any repository on this Gitaly server, it + fails with a `401 Unauthorized` error: + + ```shell + remote: GitLab: 401 Unauthorized + To <REMOTE_URL> + ! [remote rejected] branch-name -> branch-name (pre-receive hook declined) + error: failed to push some refs to '<REMOTE_URL>' + ``` + +- When any user adds or modifies a file from the repository using the GitLab + UI, it immediately fails with a red `401 Unauthorized` banner. +- Creating a new project and [initializing it with a README](../../user/project/working_with_projects.md#blank-projects) + successfully creates the project but doesn't create the README. +- When [tailing the logs](https://docs.gitlab.com/omnibus/settings/logs.html#tail-logs-in-a-console-on-the-server) + on a Gitaly client and reproducing the error, you get `401` errors + when reaching the [`/api/v4/internal/allowed`](../../development/internal_api.md) endpoint: + + ```shell + # api_json.log + { + "time": "2019-07-18T00:30:14.967Z", + "severity": "INFO", + "duration": 0.57, + "db": 0, + "view": 0.57, + "status": 401, + "method": "POST", + "path": "\/api\/v4\/internal\/allowed", + "params": [ + { + "key": "action", + "value": "git-receive-pack" + }, + { + "key": "changes", + "value": "REDACTED" + }, + { + "key": "gl_repository", + "value": "REDACTED" + }, + { + "key": "project", + "value": "\/path\/to\/project.git" + }, + { + "key": "protocol", + "value": "web" + }, + { + "key": "env", + "value": "{\"GIT_ALTERNATE_OBJECT_DIRECTORIES\":[],\"GIT_ALTERNATE_OBJECT_DIRECTORIES_RELATIVE\":[],\"GIT_OBJECT_DIRECTORY\":null,\"GIT_OBJECT_DIRECTORY_RELATIVE\":null}" + }, + { + "key": "user_id", + "value": "2" + }, + { + "key": "secret_token", + "value": "[FILTERED]" + } + ], + "host": "gitlab.example.com", + "ip": "REDACTED", + "ua": "Ruby", + "route": "\/api\/:version\/internal\/allowed", + "queue_duration": 4.24, + "gitaly_calls": 0, + "gitaly_duration": 0, + "correlation_id": "XPUZqTukaP3" + } + + # nginx_access.log + [IP] - - [18/Jul/2019:00:30:14 +0000] "POST /api/v4/internal/allowed HTTP/1.1" 401 30 "" "Ruby" + ``` + +To fix this problem, confirm that your [`gitlab-secrets.json` file](configure_gitaly.md#configure-gitaly-servers) +on the Gitaly server matches the one on Gitaly client. If it doesn't match, +update the secrets file on the Gitaly server to match the Gitaly client, then +[reconfigure](../restart_gitlab.md#omnibus-gitlab-reconfigure). + +### Command line tools cannot connect to Gitaly + +gRPC cannot reach your Gitaly server if: + +- You can't connect to a Gitaly server with command-line tools. +- Certain actions result in a `14: Connect Failed` error message. + +Verify you can reach Gitaly by using TCP: + +```shell +sudo gitlab-rake gitlab:tcp_check[GITALY_SERVER_IP,GITALY_LISTEN_PORT] +``` + +If the TCP connection: + +- Fails, check your network settings and your firewall rules. +- Succeeds, your networking and firewall rules are correct. + +If you use proxy servers in your command line environment such as Bash, these can interfere with +your gRPC traffic. + +If you use Bash or a compatible command line environment, run the following commands to determine +whether you have proxy servers configured: + +```shell +echo $http_proxy +echo $https_proxy +``` + +If either of these variables have a value, your Gitaly CLI connections may be getting routed through +a proxy which cannot connect to Gitaly. + +To remove the proxy setting, run the following commands (depending on which variables had values): + +```shell +unset http_proxy +unset https_proxy +``` + +### Permission denied errors appearing in Gitaly or Praefect logs when accessing repositories + +You might see the following in Gitaly and Praefect logs: + +```shell +{ + ... + "error":"rpc error: code = PermissionDenied desc = permission denied", + "grpc.code":"PermissionDenied", + "grpc.meta.client_name":"gitlab-web", + "grpc.request.fullMethod":"/gitaly.ServerService/ServerInfo", + "level":"warning", + "msg":"finished unary call with code PermissionDenied", + ... +} +``` + +This is a GRPC call +[error response code](https://grpc.github.io/grpc/core/md_doc_statuscodes.html). + +If this error occurs, even though +[the Gitaly auth tokens are set up correctly](#praefect-errors-in-logs), +it's likely that the Gitaly servers are experiencing +[clock drift](https://en.wikipedia.org/wiki/Clock_drift). + +Ensure the Gitaly clients and servers are synchronized, and use an NTP time +server to keep them synchronized. + +### Gitaly not listening on new address after reconfiguring + +When updating the `gitaly['listen_addr']` or `gitaly['prometheus_listen_addr']` values, Gitaly may +continue to listen on the old address after a `sudo gitlab-ctl reconfigure`. + +When this occurs, run `sudo gitlab-ctl restart` to resolve the issue. This should no longer be +necessary because [this issue](https://gitlab.com/gitlab-org/gitaly/-/issues/2521) is resolved. + +### Permission denied errors appearing in Gitaly logs when accessing repositories from a standalone Gitaly node + +If this error occurs even though file permissions are correct, it's likely that the Gitaly node is +experiencing [clock drift](https://en.wikipedia.org/wiki/Clock_drift). + +Please ensure that the GitLab and Gitaly nodes are synchronized and use an NTP time +server to keep them synchronized if possible. + +## Troubleshoot Praefect (Gitaly Cluster) + +The following sections provide possible solutions to Gitaly Cluster errors. + +### Praefect errors in logs + +If you receive an error, check `/var/log/gitlab/gitlab-rails/production.log`. + +Here are common errors and potential causes: + +- 500 response code + - **ActionView::Template::Error (7:permission denied)** + - `praefect['auth_token']` and `gitlab_rails['gitaly_token']` do not match on the GitLab server. + - **Unable to save project. Error: 7:permission denied** + - Secret token in `praefect['storage_nodes']` on GitLab server does not match the + value in `gitaly['auth_token']` on one or more Gitaly servers. +- 503 response code + - **GRPC::Unavailable (14:failed to connect to all addresses)** + - GitLab was unable to reach Praefect. + - **GRPC::Unavailable (14:all SubCons are in TransientFailure...)** + - Praefect cannot reach one or more of its child Gitaly nodes. Try running + the Praefect connection checker to diagnose. + +### Determine primary Gitaly node + +To determine the current primary Gitaly node for a specific Praefect node: + +- Use the `Shard Primary Election` [Grafana chart](praefect.md#grafana) on the + [`Gitlab Omnibus - Praefect` dashboard](https://gitlab.com/gitlab-org/grafana-dashboards/-/blob/master/omnibus/praefect.json). + This is recommended. +- If you do not have Grafana set up, use the following command on each host of each + Praefect node: + + ```shell + curl localhost:9652/metrics | grep gitaly_praefect_primaries` + ``` + +### Relation does not exist errors + +By default Praefect database tables are created automatically by `gitlab-ctl reconfigure` task. + +However, the Praefect database tables are not created on initial reconfigure and can throw +errors that relations do not exist if either: + +- The `gitlab-ctl reconfigure` command isn't executed. +- There are errors during the execution. + +For example: + +- `ERROR: relation "node_status" does not exist at character 13` +- `ERROR: relation "replication_queue_lock" does not exist at character 40` +- This error: + + ```json + {"level":"error","msg":"Error updating node: pq: relation \"node_status\" does not exist","pid":210882,"praefectName":"gitlab1x4m:0.0.0.0:2305","time":"2021-04-01T19:26:19.473Z","virtual_storage":"praefect-cluster-1"} + ``` + +To solve this, the database schema migration can be done using `sql-migrate` sub-command of +the `praefect` command: + +```shell +$ sudo /opt/gitlab/embedded/bin/praefect -config /var/opt/gitlab/praefect/config.toml sql-migrate +praefect sql-migrate: OK (applied 21 migrations) +``` |