doc/administration/gitaly/monitoring.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203

---
stage: Systems
group: Gitaly
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
---

# Monitoring Gitaly and Gitaly Cluster

You can use the available logs and [Prometheus metrics](../monitoring/prometheus/index.md) to
monitor Gitaly and Gitaly Cluster (Praefect).

Metric definitions are available:

- Directly from Prometheus `/metrics` endpoint configured for Gitaly.
- Using [Grafana Explore](https://grafana.com/docs/grafana/latest/explore/) on a
  Grafana instance configured against Prometheus.

## Monitor Gitaly rate limiting

Gitaly can be configured to limit requests based on:

- Concurrency of requests.
- A rate limit.

Monitor Gitaly request limiting with the `gitaly_requests_dropped_total` Prometheus metric. This metric provides a total count
of requests dropped due to request limiting. The `reason` label indicates why a request was dropped:

- `rate`, due to rate limiting.
- `max_size`, because the concurrency queue size was reached.
- `max_time`, because the request exceeded the maximum queue wait time as configured in Gitaly.

## Monitor Gitaly concurrency limiting

You can observe specific behavior of [concurrency-queued requests](configure_gitaly.md#limit-rpc-concurrency) using
the Gitaly logs and Prometheus:

- In the [Gitaly logs](../logs/index.md#gitaly-logs), look for the string (or structured log field)
  `acquire_ms`. Messages that have this field are reporting about the concurrency limiter.
- In Prometheus, look for the following metrics:
  - `gitaly_concurrency_limiting_in_progress` indicates how many concurrent requests are
    being processed.
  - `gitaly_concurrency_limiting_queued` indicates how many requests for an RPC for a given
    repository are waiting due to the concurrency limit being reached.
  - `gitaly_concurrency_limiting_acquiring_seconds` indicates how long a request has to
    wait due to concurrency limits before being processed.

## Monitor Gitaly cgroups

You can observe the status of [control groups (cgroups)](configure_gitaly.md#control-groups) using Prometheus:

- `gitaly_cgroups_reclaim_attempts_total`, a gauge for the total number of times
   there has been a memory relcaim attempt. This number resets each time a server is
   restarted.
- `gitaly_cgroups_cpu_usage`, a gauge that measures CPU usage per cgroup.
- `gitaly_cgroup_procs_total`, a gauge that measures the total number of
   processes Gitaly has spawned under the control of cgroups.

## `pack-objects` cache

The following [`pack-objects` cache](configure_gitaly.md#pack-objects-cache) metrics are available:

- `gitaly_pack_objects_cache_enabled`, a gauge set to `1` when the cache is enabled. Available
  labels: `dir` and `max_age`.
- `gitaly_pack_objects_cache_lookups_total`, a counter for cache lookups. Available label: `result`.
- `gitaly_pack_objects_generated_bytes_total`, a counter for the number of bytes written into the
  cache.
- `gitaly_pack_objects_served_bytes_total`, a counter for the number of bytes read from the cache.
- `gitaly_streamcache_filestore_disk_usage_bytes`, a gauge for the total size of cache files.
  Available label: `dir`.
- `gitaly_streamcache_index_entries`, a gauge for the number of entries in the cache. Available
  label: `dir`.

Some of these metrics start with `gitaly_streamcache` because they are generated by the
`streamcache` internal library package in Gitaly.

Example:

```plaintext
gitaly_pack_objects_cache_enabled{dir="/var/opt/gitlab/git-data/repositories/+gitaly/PackObjectsCache",max_age="300"} 1
gitaly_pack_objects_cache_lookups_total{result="hit"} 2
gitaly_pack_objects_cache_lookups_total{result="miss"} 1
gitaly_pack_objects_generated_bytes_total 2.618649e+07
gitaly_pack_objects_served_bytes_total 7.855947e+07
gitaly_streamcache_filestore_disk_usage_bytes{dir="/var/opt/gitlab/git-data/repositories/+gitaly/PackObjectsCache"} 2.6200152e+07
gitaly_streamcache_filestore_removed_total{dir="/var/opt/gitlab/git-data/repositories/+gitaly/PackObjectsCache"} 1
gitaly_streamcache_index_entries{dir="/var/opt/gitlab/git-data/repositories/+gitaly/PackObjectsCache"} 1
```

## Useful queries

The following are useful queries for monitoring Gitaly:

- Use the following Prometheus query to observe the
  [type of connections](configure_gitaly.md#enable-tls-support) Gitaly is serving a production
  environment:

  ```prometheus
  sum(rate(gitaly_connections_total[5m])) by (type)
  ```

- Use the following Prometheus query to monitor the
  [authentication behavior](configure_gitaly.md#observe-type-of-gitaly-connections) of your GitLab
  installation:

  ```prometheus
  sum(rate(gitaly_authentications_total[5m])) by (enforced, status)
  ```

  In a system where authentication is configured correctly and where you have live traffic, you
  see something like this:

  ```prometheus
  {enforced="true",status="ok"}  4424.985419441742
  ```

  There may also be other numbers with rate 0, but you only have to take note of the non-zero numbers.

  The only non-zero number should have `enforced="true",status="ok"`. If you have other non-zero
  numbers, something is wrong in your configuration.

  The `status="ok"` number reflects your current request rate. In the example above, Gitaly is
  handling about 4000 requests per second.

- Use the following Prometheus query to observe the [Git protocol versions](../git_protocol.md)
  being used in a production environment:

  ```prometheus
  sum(rate(gitaly_git_protocol_requests_total[1m])) by (grpc_method,git_protocol,grpc_service)
  ```

## Monitor Gitaly Cluster

To monitor Gitaly Cluster (Praefect), you can use these Prometheus metrics. There are two separate metrics
endpoints from which metrics can be scraped:

- The default `/metrics` endpoint.
- `/db_metrics`, which contains metrics that require database queries.

### Default Prometheus `/metrics` endpoint

The following metrics are available from the `/metrics` endpoint:

- `gitaly_praefect_read_distribution`, a counter to track [distribution of reads](index.md#distributed-reads).
  It has two labels:

  - `virtual_storage`.
  - `storage`.

  They reflect configuration defined for this instance of Praefect.

- `gitaly_praefect_replication_latency_bucket`, a histogram measuring the amount of time it takes
  for replication to complete after the replication job starts. Available in GitLab 12.10 and later.
- `gitaly_praefect_replication_delay_bucket`, a histogram measuring how much time passes between
  when the replication job is created and when it starts. Available in GitLab 12.10 and later.
- `gitaly_praefect_node_latency_bucket`, a histogram measuring the latency in Gitaly returning
  health check information to Praefect. This indicates Praefect connection saturation. Available in
  GitLab 12.10 and later.
- `gitaly_praefect_connections_total`, the total number of connections to Praefect. [Introduced](https://gitlab.com/gitlab-org/gitaly/-/merge_requests/4220) in GitLab 14.7.

To monitor [strong consistency](index.md#strong-consistency), you can use the following Prometheus metrics:

- `gitaly_praefect_transactions_total`, the number of transactions created and voted on.
- `gitaly_praefect_subtransactions_per_transaction_total`, the number of times nodes cast a vote for
  a single transaction. This can happen multiple times if multiple references are getting updated in
  a single transaction.
- `gitaly_praefect_voters_per_transaction_total`: the number of Gitaly nodes taking part in a
  transaction.
- `gitaly_praefect_transactions_delay_seconds`, the server-side delay introduced by waiting for the
  transaction to be committed.
- `gitaly_hook_transaction_voting_delay_seconds`, the client-side delay introduced by waiting for
  the transaction to be committed.

To monitor the number of repositories that have no healthy, up-to-date replicas:

- `gitaly_praefect_unavailable_repositories`

To monitor [repository verification](praefect.md#repository-verification), use the following Prometheus metrics:

- `gitaly_praefect_verification_queue_depth`, the total number of replicas pending verification. This
  metric is scraped from the database and is only available when Prometheus is scraping the database metrics.
- `gitaly_praefect_verification_jobs_dequeued_total`, the number of verification jobs picked up by the
  worker.
- `gitaly_praefect_verification_jobs_completed_total`, the number of verification jobs completed by the
  worker. The `result` label indicates the end result of the jobs:
  - `valid` indicates the expected replica existed on the storage.
  - `invalid` indicates the replica expected to exist did not exist on the storage.
  - `error` indicates the job failed and has to be retried.
- `gitaly_praefect_stale_verification_leases_released_total`, the number of stale verification leases
  released.

You can also monitor the [Praefect logs](../logs/index.md#praefect-logs).

### Database metrics `/db_metrics` endpoint

> [Introduced](https://gitlab.com/gitlab-org/gitaly/-/issues/3286) in GitLab 14.5.

The following metrics are available from the `/db_metrics` endpoint:

- `gitaly_praefect_unavailable_repositories`, the number of repositories that have no healthy, up to date replicas.
- `gitaly_praefect_read_only_repositories`, the number of repositories in read-only mode in a virtual storage.
  This metric is available for backwards compatibility reasons. `gitaly_praefect_unavailable_repositories` is more
  accurate.
- `gitaly_praefect_replication_queue_depth`, the number of jobs in the replication queue.