summaryrefslogtreecommitdiff
path: root/doc/development/database/clickhouse/merge_request_analytics.md
blob: 34da71d6c4cb87dab8e08d744c16c95b9ce0422f (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
---
stage: Data Stores
group: Database
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
---

# Merge request analytics with ClickHouse

The [merge request analytics feature](../../../user/analytics/merge_request_analytics.md)
shows statistics about the merged merge requests in the project and also exposes record-level metadata.
Aggregations include:

- **Average time to merge**: The duration between the creation time and the merge time.
- **Monthly aggregations**: A chart of 12 months of the merged merge requests.

Under the chart, the user can see the paginated list of merge requests, 12 months per page.

You can filter by:

- Author
- Assignee
- Labels
- Milestone
- Source branch
- Target branch

## Current performance problems

- The aggregation queries require specialized indexes, which cost additional
  disk space (index-only scans).
- Querying the whole 12 months is slow (statement timeout). Instead, the frontend
  requests data per month (12 database queries).
- Even with specialized indexes, making the feature available on the group level
  would not be feasible due to the large volume of merge requests.

## Example queries

Get the number of merge requests merged in a given month:

```sql
SELECT COUNT(*)
FROM "merge_requests"
INNER JOIN "merge_request_metrics" ON "merge_request_metrics"."merge_request_id" = "merge_requests"."id"
WHERE (NOT EXISTS
         (SELECT 1
          FROM "banned_users"
          WHERE (merge_requests.author_id = banned_users.user_id)))
  AND "merge_request_metrics"."target_project_id" = 278964
  AND "merge_request_metrics"."merged_at" >= '2022-12-01 00:00:00'
  AND "merge_request_metrics"."merged_at" <= '2023-01-01 00:00:00'
```

The `merge_request_metrics` table was de-normalized (by adding `target_project_id`)
to improve the first-page load time. The query itself works well for smaller date ranges,
however, it can time out as the date range increases.

After an extra filter is added, the query becomes more complex because it must also
filter the `merge_requests` table:

```sql
SELECT COUNT(*)
FROM "merge_requests"
INNER JOIN "merge_request_metrics" ON "merge_request_metrics"."merge_request_id" = "merge_requests"."id"
WHERE (NOT EXISTS
         (SELECT 1
          FROM "banned_users"
          WHERE (merge_requests.author_id = banned_users.user_id)))
  AND "merge_requests"."author_id" IN
    (SELECT "users"."id"
     FROM "users"
     WHERE (LOWER("users"."username") IN (LOWER('ahegyi'))))
  AND "merge_request_metrics"."target_project_id" = 278964
  AND "merge_request_metrics"."merged_at" >= '2022-12-01 00:00:00'
  AND "merge_request_metrics"."merged_at" <= '2023-01-01 00:00:00'
```

To calculate mean time to merge, we also query the total time between the
merge request creation time and merge time.

```sql
SELECT EXTRACT(epoch
               FROM SUM(AGE(merge_request_metrics.merged_at, merge_request_metrics.created_at)))
FROM "merge_requests"
INNER JOIN "merge_request_metrics" ON "merge_request_metrics"."merge_request_id" = "merge_requests"."id"
WHERE (NOT EXISTS
         (SELECT 1
          FROM "banned_users"
          WHERE (merge_requests.author_id = banned_users.user_id)))
  AND "merge_requests"."author_id" IN
    (SELECT "users"."id"
     FROM "users"
     WHERE (LOWER("users"."username") IN (LOWER('ahegyi'))))
  AND "merge_request_metrics"."target_project_id" = 278964
  AND "merge_request_metrics"."merged_at" >= '2022-08-01 00:00:00'
  AND "merge_request_metrics"."merged_at" <= '2022-09-01 00:00:00'
  AND "merge_request_metrics"."merged_at" > "merge_request_metrics"."created_at"
LIMIT 1
```

## Store merge request data in ClickHouse

Several other use cases exist for storing and querying merge request data in
ClickHouse. In this document, we focus on this particular feature.

The core data exists in the `merge_request_metrics` and in the `merge_requests`
database tables. Some filters require extra tables to be joined:

- `banned_users`: Filter out merge requests created by banned users.
- `labels`: A merge request can have one or more assigned labels.
- `assignees`: A merge request can have one or more assignees.
- `merged_at`: The `merged_at` column is located in the `merge_request_metrics` table.

The `merge_requests` table contains data that can be filtered directly:

- **Author**: via the `author_id` column.
- **Milestone**: via the `milestone_id` column.
- **Source branch**.
- **Target branch**.
- **Project**: via the `project_id` column.

### Keep ClickHouse data up to date

Replicating or syncing the `merge_requests` table is unfortunately not enough.
Separate queries to associated tables are required to insert one de-normalized
`merge_requests` row into the ClickHouse database.

Change detection is non-trivial to implement. A few corners we could cut:

- The feature is available for GitLab Premium and GitLab Ultimate customers.
  We don't have to sync all the data, but instead sync only the `merge_requests` records
  which are part of licensed groups.
- Data changes (often) happen via the `MergeRequest` services, where bumping the
  `updated_at` timestamp column is mostly consistent. Some sort of incremental
  synchronization process could be implemented.
- We only need to query the merged merge requests. After the merge, the record rarely changes.

### Database table structure

The database table structure uses de-normalization to make all required columns
available in one database table. This eliminates the need for `JOINs`.

```sql
CREATE TABLE merge_requests
(
    `id` UInt64,
    `project_id` UInt64 DEFAULT 0 NOT NULL,
    `author_id` UInt64 DEFAULT 0 NOT NULL,
    `milestone_id` UInt64 DEFAULT 0 NOT NULL,
    `label_ids` Array(UInt64) DEFAULT [] NOT NULL,
    `assignee_ids` Array(UInt64) DEFAULT [] NOT NULL,
    `source_branch` String DEFAULT '' NOT NULL,
    `target_branch` String DEFAULT '' NOT NULL,
    `merged_at` DateTime64(6, 'UTC') NOT NULL,
    `created_at` DateTime64(6, 'UTC') DEFAULT now() NOT NULL,
    `updated_at` DateTime64(6, 'UTC') DEFAULT now() NOT NULL
)
ENGINE = ReplacingMergeTree(updated_at)
ORDER BY (project_id, merged_at, id);
```

Similarly to the [activity data example](gitlab_activity_data.md), we use the
`ReplacingMergeTree` engine. Several columns of the merge request record may change,
so keeping the table up-to-date is important.

The database table is ordered by the `project_id, merged_at, id` columns. This ordering
optimizes the table data for our use case: querying the `merged_at` column in a project.

## Rewrite the count query

First, let's generate some data for the table.

```sql
INSERT INTO merge_requests (id, project_id, author_id, milestone_id, label_ids, merged_at, created_at)
SELECT id, project_id, author_id, milestone_id, label_ids, merged_at, created_at
FROM generateRandom('id UInt64, project_id UInt8, author_id UInt8, milestone_id UInt8, label_ids Array(UInt8), merged_at DateTime64(6, \'UTC\'), created_at DateTime64(6, \'UTC\')')
LIMIT 1000000;
```

NOTE:
Some integer data types were cast as `UInt8` so it is highly probable that they
have same values across different rows.

The original count query only aggregated data for one month. With ClickHouse, we can
attempt aggregating the data for the whole year.

PostgreSQL-based count query:

```sql
SELECT COUNT(*)
FROM "merge_requests"
INNER JOIN "merge_request_metrics" ON "merge_request_metrics"."merge_request_id" = "merge_requests"."id"
WHERE (NOT EXISTS
         (SELECT 1
          FROM "banned_users"
          WHERE (merge_requests.author_id = banned_users.user_id)))
  AND "merge_request_metrics"."target_project_id" = 278964
  AND "merge_request_metrics"."merged_at" >= '2022-12-01 00:00:00'
  AND "merge_request_metrics"."merged_at" <= '2023-01-01 00:00:00'
```

ClickHouse query:

```sql
SELECT
  toYear(merged_at) AS year,
  toMonth(merged_at) AS month,
  COUNT(*)
FROM merge_requests
WHERE
  project_id = 200
  AND merged_at BETWEEN '2022-01-01 00:00:00'
  AND '2023-01-01 00:00:00'
GROUP BY year, month
```

The query processed a significantly lower number of rows compared to the generated data.
The `ORDER BY` clause (primary key) is helping the query execution:

```plaintext
11 rows in set. Elapsed: 0.010 sec.
Processed 8.19 thousand rows, 131.07 KB (783.45 thousand rows/s., 12.54 MB/s.)
```

## Rewrite the Mean time to merge query

The query calculates the mean time to merge as:
`duration(created_at, merged_at) / merge_request_count`. The calculation is done in
two separate steps:

1. Request the monthly counts and the monthly duration values.
1. Sum the counts to get the yearly count.
1. Sum the durations to get the yearly duration.
1. Divide the durations by the count.

In ClickHouse, we can calculate the mean time to merge with one query:

```sql
SELECT
  SUM(
    dateDiff('second', merged_at, created_at) / 3600 / 24
  ) / COUNT(*) AS mean_time_to_merge -- mean_time_to_merge is in days
FROM merge_requests
WHERE
  project_id = 200
  AND merged_at BETWEEN '2022-01-01 00:00:00'
  AND '2023-01-01 00:00:00'
```

## Filtering

The database queries above can be used as base queries. You can add more filters.
For example, filtering for a label and a milestone:

```sql
SELECT
  toYear(merged_at) AS year,
  toMonth(merged_at) AS month,
  COUNT(*)
FROM merge_requests
WHERE
  project_id = 200
  AND milestone_id = 15
  AND has(label_ids, 118)
  AND -- array includes 118
  merged_at BETWEEN '2022-01-01 00:00:00'
  AND '2023-01-01 00:00:00'
GROUP BY year, month
```

Optimizing a particular filter is usually done with a database index. This particular
query reads 8000 rows:

```plaintext
1 row in set. Elapsed: 0.016 sec.
Processed 8.19 thousand rows, 589.99 KB (505.38 thousand rows/s., 36.40 MB/s.)
```

Adding an index on `milestone_id`:

```sql
ALTER TABLE merge_requests
ADD
  INDEX milestone_id_index milestone_id TYPE minmax GRANULARITY 10;
ALTER TABLE
  merge_requests MATERIALIZE INDEX milestone_id_index;
```

On the generated data, adding the index didn't improve the performance.

### Banned users filter

A recently added feature in GitLab filters out merge requests where the author is
banned by the admins. The banned users are tracked on the instance level in the
`banned_users` database table.

#### Idea 1: Enumerate the banned user IDs

This would require no structural changes to the ClickHouse database schema.
We could query the banned users in the project and filter the values out in query time.

Get the banned users (in PostgreSQL):

```sql
SELECT user_id FROM banned_users
```

In ClickHouse

```sql
SELECT
  toYear(merged_at) AS year,
  toMonth(merged_at) AS month,
  COUNT(*)
FROM merge_requests
WHERE
  author_id NOT IN (1, 2, 3, 4) AND -- banned users
  project_id = 200
  AND milestone_id = 15
  AND has(label_ids, 118) AND -- array includes 118
  merged_at BETWEEN '2022-01-01 00:00:00'
  AND '2023-01-01 00:00:00'
GROUP BY year, month
```

The problem with this approach is that the number of banned users could increase significantly which would make the query bigger and slower.

#### Idea 2: replicate the `banned_users` table

Assuming that the `banned_users table` doesn't grow to millions of rows, we could
attempt to periodically sync the whole table to ClickHouse. With this approach,
a mostly consistent `banned_users` table could be used in the ClickHouse database query:

```sql
SELECT
  toYear(merged_at) AS year,
  toMonth(merged_at) AS month,
  COUNT(*)
FROM merge_requests
WHERE
  author_id NOT IN (SELECT user_id FROM banned_users) AND
  project_id = 200 AND
  milestone_id = 15 AND
  has(label_ids, 118) AND -- array includes 118
  merged_at BETWEEN '2022-01-01 00:00:00' AND '2023-01-01 00:00:00'
GROUP BY year, month
```

Alternatively, the `banned_users` table could be stored as a
[dictionary](https://clickhouse.com/docs/en/sql-reference/dictionaries/external-dictionaries/external-dicts)
to further improve the query performance.

#### Idea 3: Alter the feature

For analytical calculations, it might be acceptable to drop this particular filter.
This approach assumes that including the merge requests of banned users doesn't skew the statistics significantly.