doc/architecture/blueprints/cells/cells-feature-ci-runners.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170

---
stage: enablement
group: Tenant Scale
description: 'Cells: CI Runners'
---

<!-- vale gitlab.FutureTense = NO -->

This document is a work-in-progress and represents a very early state of the
Cells design. Significant aspects are not documented, though we expect to add
them in the future. This is one possible architecture for Cells, and we intend to
contrast this with alternatives before deciding which approach to implement.
This documentation will be kept even if we decide not to implement this so that
we can document the reasons for not choosing this approach.

# Cells: CI Runners

GitLab in order to execute CI jobs [GitLab Runner](https://gitlab.com/gitlab-org/gitlab-runner/),
very often managed by customer in their infrastructure.

All CI jobs created as part of CI pipeline are run in a context of project
it poses a challenge how to manage GitLab Runners.

## 1. Definition

There are 3 different types of runners:

- instance-wide: runners that are registered globally with specific tags (selection criteria)
- group runners: runners that execute jobs from a given top-level group or subprojects of that group
- project runners: runners that execute jobs from projects or many projects: some runners might
  have projects assigned from projects in different top-level groups.

This alongside with existing data structure where `ci_runners` is a table describing
all types of runners poses a challenge how the `ci_runners` should be managed in a Cells environment.

## 2. Data flow

GitLab Runners use a set of globally scoped endpoints to:

- registration of a new runner via registration token `https://gitlab.com/api/v4/runners`
  ([subject for removal](../runner_tokens/index.md)) (`registration token`)
- requests jobs via an authenticated `https://gitlab.com/api/v4/jobs/request` endpoint (`runner token`)
- upload job status via `https://gitlab.com/api/v4/jobs/:job_id` (`build token`)
- upload trace via `https://gitlab.com/api/v4/jobs/:job_id/trace` (`build token`)
- download and upload artifacts via `https://gitlab.com/api/v4/jobs/:job_id/artifacts` (`build token`)

Currently three types of authentication tokens are used:

- runner registration token ([subject for removal](../runner_tokens/index.md))
- runner token representing an registered runner in a system with specific configuration (`tags`, `locked`, etc.)
- build token representing an ephemeral token giving a limited access to updating a specific
  job, uploading artifacts, downloading dependent artifacts, downloading and uploading
  container registry images

Each of those endpoints do receive an authentication token via header (`JOB-TOKEN` for `/trace`)
or body parameter (`token` all other endpoints).

Since the CI pipeline would be created in a context of a specific Cell it would be required
that pick of a build would have to be processed by that particular Cell. This requires
that build picking depending on a solution would have to be either:

- routed to correct Cell for a first time
- be made to be two phase: request build from global pool, claim build on a specific Cell using a Cell specific URL

## 3. Proposal

This section describes various proposals. Reader should consider that those
proposals do describe solutions for different problems. Many or some aspects
of those proposals might be the solution to the stated problem.

### 3.1. Authentication tokens

Even though the paths for CI Runners are not routable they can be made routable with
those two possible solutions:

- The `https://gitlab.com/api/v4/jobs/request` uses a long polling mechanism with
  a ticketing mechanism (based on `X-GitLab-Last-Update` header). Runner when first
  starts sends a request to GitLab to which GitLab responds with either a build to pick
  by runner. This value is completely controlled by GitLab. This allows GitLab
  to use JWT or any other means to encode `cell` identifier that could be easily
  decodable by Router.
- The majority of communication (in terms of volume) is using `build token` making it
  the easiest target to change since GitLab is sole owner of the token that Runner later
  uses for specific job. There were prior discussions about not storing `build token`
  but rather using `JWT` token with defined scopes. Such token could encode the `cell`
  to which router could easily route all requests.

### 3.2. Request body

- The most of used endpoints pass authentication token in request body. It might be desired
  to use HTTP Headers as an easier way to access this information by Router without
  a need to proxy requests.

### 3.3. Instance-wide are Cell local

We can pick a design where all runners are always registered and local to a given Cell:

- Each Cell has it's own set of instance-wide runners that are updated at it's own pace
- The project runners can only be linked to projects from the same organization
  creating strong isolation.
- In this model the `ci_runners` table is local to the Cell.
- In this model we would require the above endpoints to be scoped to a Cell in some way
  or made routable. It might be via prefixing them, adding additional Cell parameter,
  or providing much more robust way to decode runner token and match it to Cell.
- If routable token is used, we could move away from cryptographic random stored in
  database to rather prefer to use JWT tokens that would encode
- The Admin Area showing registered Runners would have to be scoped to a Cell

This model might be desired since it provides strong isolation guarantees.
This model does significantly increase maintenance overhead since each Cell is managed
separately.

This model may require adjustments to runner tags feature so that projects have consistent runner experience across cells.

### 3.4. Instance-wide are cluster-wide

Contrary to proposal where all runners are Cell local, we can consider that runners
are global, or just instance-wide runners are global.

However, this requires significant overhaul of system and to change the following aspects:

- `ci_runners` table would likely have to be split decomposed into `ci_instance_runners`, ...
- all interfaces would have to be adopted to use correct table
- build queuing would have to be reworked to be two phase where each Cell would know of all pending
  and running builds, but the actual claim of a build would happen against a Cell containing data
- likely `ci_pending_builds` and `ci_running_builds` would have to be made `cluster-wide` tables
  increasing likelihood of creating hotspots in a system related to CI queueing

This model makes it complex to implement from engineering side. Does make some data being shared
between Cells. Creates hotspots / scalability issues in a system (ex. during abuse) that
might impact experience of organizations on other Cells.

### 3.5. GitLab CI Daemon

Another potential solution to explore is to have a dedicated service responsible for builds queueing
owning it's database and working in a model of either sharded or celled service. There were prior
discussions about [CI/CD Daemon](https://gitlab.com/gitlab-org/gitlab/-/issues/19435).

If the service would be sharded:

- depending on a model if runners are cluster-wide or cell-local this service would have to fetch
  data from all Cells
- if the sharded service would be used we could adapt a model of either sharing database containing
  `ci_pending_builds/ci_running_builds` with the service
- if the sharded service would be used we could consider a push model where each Cell pushes to CI/CD Daemon
  builds that should be picked by Runner
- the sharded service would be aware which Cell is responsible for processing the given build and could
  route processing requests to designated Cell

If the service would be celled:

- all expectations of routable endpoints are still valid

In general usage of CI Daemon does not help significantly with the stated problem. However, this offers
a few upsides related to more efficient processing and decoupling model: push model and it opens a way
to offer stateful communication with GitLab Runners (ex. gRPC or Websockets).

## 4. Evaluation

Considering all solutions it appears that solution giving the most promise is:

- use "instance-wide are Cell local"
- refine endpoints to have routable identities (either via specific paths, or better tokens)

Other potential upsides is to get rid of `ci_builds.token` and rather use a `JWT token`
that can much better and easier encode wider set of scopes allowed by CI runner.

## 4.1. Pros

## 4.2. Cons