diff options
Diffstat (limited to 'doc/administration/gitaly/monitoring.md')
-rw-r--r-- | doc/administration/gitaly/monitoring.md | 202 |
1 files changed, 202 insertions, 0 deletions
diff --git a/doc/administration/gitaly/monitoring.md b/doc/administration/gitaly/monitoring.md new file mode 100644 index 00000000000..17f94f912ee --- /dev/null +++ b/doc/administration/gitaly/monitoring.md @@ -0,0 +1,202 @@ +--- +stage: Create +group: Gitaly +info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments +--- + +# Monitoring Gitaly and Gitaly Cluster + +You can use the available logs and [Prometheus metrics](../monitoring/prometheus/index.md) to +monitor Gitaly and Gitaly Cluster (Praefect). + +Metric definitions are available: + +- Directly from Prometheus `/metrics` endpoint configured for Gitaly. +- Using [Grafana Explore](https://grafana.com/docs/grafana/latest/explore/) on a + Grafana instance configured against Prometheus. + +## Monitor Gitaly rate limiting + +Gitaly can be configured to limit requests based on: + +- Concurrency of requests. +- A rate limit. + +Monitor Gitaly request limiting with the `gitaly_requests_dropped_total` Prometheus metric. This metric provides a total count +of requests dropped due to request limiting. The `reason` label indicates why a request was dropped: + +- `rate`, due to rate limiting. +- `max_size`, because the concurrency queue size was reached. +- `max_time`, because the request exceeded the maximum queue wait time as configured in Gitaly. + +## Monitor Gitaly concurrency limiting + +You can observe specific behavior of [concurrency-queued requests](configure_gitaly.md#limit-rpc-concurrency) using +the Gitaly logs and Prometheus: + +- In the [Gitaly logs](../logs.md#gitaly-logs), look for the string (or structured log field) + `acquire_ms`. Messages that have this field are reporting about the concurrency limiter. +- In Prometheus, look for the following metrics: + - `gitaly_concurrency_limiting_in_progress` indicates how many concurrent requests are + being processed. + - `gitaly_concurrency_limiting_queued` indicates how many requests for an RPC for a given + repository are waiting due to the concurrency limit being reached. + - `gitaly_concurrency_limiting_acquiring_seconds` indicates how long a request has to + wait due to concurrency limits before being processed. + +## Monitor Gitaly cgroups + +You can observe the status of [control groups (cgroups)](configure_gitaly.md#control-groups) using Prometheus: + +- `gitaly_cgroups_memory_failed_total`, a gauge for the total number of times + the memory limit has been hit. This number resets each time a server is + restarted. +- `gitaly_cgroups_cpu_usage`, a gauge that measures CPU usage per cgroup. +- `gitaly_cgroup_procs_total`, a gauge that measures the total number of + processes Gitaly has spawned under the control of cgroups. + +## `pack-objects` cache + +The following [`pack-objects` cache](configure_gitaly.md#pack-objects-cache) metrics are available: + +- `gitaly_pack_objects_cache_enabled`, a gauge set to `1` when the cache is enabled. Available + labels: `dir` and `max_age`. +- `gitaly_pack_objects_cache_lookups_total`, a counter for cache lookups. Available label: `result`. +- `gitaly_pack_objects_generated_bytes_total`, a counter for the number of bytes written into the + cache. +- `gitaly_pack_objects_served_bytes_total`, a counter for the number of bytes read from the cache. +- `gitaly_streamcache_filestore_disk_usage_bytes`, a gauge for the total size of cache files. + Available label: `dir`. +- `gitaly_streamcache_index_entries`, a gauge for the number of entries in the cache. Available + label: `dir`. + +Some of these metrics start with `gitaly_streamcache` because they are generated by the +`streamcache` internal library package in Gitaly. + +Example: + +```plaintext +gitaly_pack_objects_cache_enabled{dir="/var/opt/gitlab/git-data/repositories/+gitaly/PackObjectsCache",max_age="300"} 1 +gitaly_pack_objects_cache_lookups_total{result="hit"} 2 +gitaly_pack_objects_cache_lookups_total{result="miss"} 1 +gitaly_pack_objects_generated_bytes_total 2.618649e+07 +gitaly_pack_objects_served_bytes_total 7.855947e+07 +gitaly_streamcache_filestore_disk_usage_bytes{dir="/var/opt/gitlab/git-data/repositories/+gitaly/PackObjectsCache"} 2.6200152e+07 +gitaly_streamcache_filestore_removed_total{dir="/var/opt/gitlab/git-data/repositories/+gitaly/PackObjectsCache"} 1 +gitaly_streamcache_index_entries{dir="/var/opt/gitlab/git-data/repositories/+gitaly/PackObjectsCache"} 1 +``` + +## Useful queries + +The following are useful queries for monitoring Gitaly: + +- Use the following Prometheus query to observe the + [type of connections](configure_gitaly.md#enable-tls-support) Gitaly is serving a production + environment: + + ```prometheus + sum(rate(gitaly_connections_total[5m])) by (type) + ``` + +- Use the following Prometheus query to monitor the + [authentication behavior](configure_gitaly.md#observe-type-of-gitaly-connections) of your GitLab + installation: + + ```prometheus + sum(rate(gitaly_authentications_total[5m])) by (enforced, status) + ``` + + In a system where authentication is configured correctly and where you have live traffic, you + see something like this: + + ```prometheus + {enforced="true",status="ok"} 4424.985419441742 + ``` + + There may also be other numbers with rate 0, but you only have to take note of the non-zero numbers. + + The only non-zero number should have `enforced="true",status="ok"`. If you have other non-zero + numbers, something is wrong in your configuration. + + The `status="ok"` number reflects your current request rate. In the example above, Gitaly is + handling about 4000 requests per second. + +- Use the following Prometheus query to observe the [Git protocol versions](../git_protocol.md) + being used in a production environment: + + ```prometheus + sum(rate(gitaly_git_protocol_requests_total[1m])) by (grpc_method,git_protocol,grpc_service) + ``` + +## Monitor Gitaly Cluster + +To monitor Gitaly Cluster (Praefect), you can use these Prometheus metrics. There are two separate metrics +endpoints from which metrics can be scraped: + +- The default `/metrics` endpoint. +- `/db_metrics`, which contains metrics that require database queries. + +### Default Prometheus `/metrics` endpoint + +The following metrics are available from the `/metrics` endpoint: + +- `gitaly_praefect_read_distribution`, a counter to track [distribution of reads](index.md#distributed-reads). + It has two labels: + + - `virtual_storage`. + - `storage`. + + They reflect configuration defined for this instance of Praefect. + +- `gitaly_praefect_replication_latency_bucket`, a histogram measuring the amount of time it takes + for replication to complete after the replication job starts. Available in GitLab 12.10 and later. +- `gitaly_praefect_replication_delay_bucket`, a histogram measuring how much time passes between + when the replication job is created and when it starts. Available in GitLab 12.10 and later. +- `gitaly_praefect_node_latency_bucket`, a histogram measuring the latency in Gitaly returning + health check information to Praefect. This indicates Praefect connection saturation. Available in + GitLab 12.10 and later. + +To monitor [strong consistency](index.md#strong-consistency), you can use the following Prometheus metrics: + +- `gitaly_praefect_transactions_total`, the number of transactions created and voted on. +- `gitaly_praefect_subtransactions_per_transaction_total`, the number of times nodes cast a vote for + a single transaction. This can happen multiple times if multiple references are getting updated in + a single transaction. +- `gitaly_praefect_voters_per_transaction_total`: the number of Gitaly nodes taking part in a + transaction. +- `gitaly_praefect_transactions_delay_seconds`, the server-side delay introduced by waiting for the + transaction to be committed. +- `gitaly_hook_transaction_voting_delay_seconds`, the client-side delay introduced by waiting for + the transaction to be committed. + +To monitor the number of repositories that have no healthy, up-to-date replicas: + +- `gitaly_praefect_unavailable_repositories` + +To monitor [repository verification](praefect.md#repository-verification), use the following Prometheus metrics: + +- `gitaly_praefect_verification_queue_depth`, the total number of replicas pending verification. This + metric is scraped from the database and is only available when Prometheus is scraping the database metrics. +- `gitaly_praefect_verification_jobs_dequeued_total`, the number of verification jobs picked up by the + worker. +- `gitaly_praefect_verification_jobs_completed_total`, the number of verification jobs completed by the + worker. The `result` label indicates the end result of the jobs: + - `valid` indicates the expected replica existed on the storage. + - `invalid` indicates the replica expected to exist did not exist on the storage. + - `error` indicates the job failed and has to be retried. +- `gitaly_praefect_stale_verification_leases_released_total`, the number of stale verification leases + released. + +You can also monitor the [Praefect logs](../logs.md#praefect-logs). + +### Database metrics `/db_metrics` endpoint + +> [Introduced](https://gitlab.com/gitlab-org/gitaly/-/issues/3286) in GitLab 14.5. + +The following metrics are available from the `/db_metrics` endpoint: + +- `gitaly_praefect_unavailable_repositories`, the number of repositories that have no healthy, up to date replicas. +- `gitaly_praefect_read_only_repositories`, the number of repositories in read-only mode in a virtual storage. + This metric is available for backwards compatibility reasons. `gitaly_praefect_unavailable_repositories` is more + accurate. +- `gitaly_praefect_replication_queue_depth`, the number of jobs in the replication queue. |