diff options
Diffstat (limited to 'doc/architecture/blueprints/pods/pods-feature-ci-runners.md')
-rw-r--r-- | doc/architecture/blueprints/pods/pods-feature-ci-runners.md | 169 |
1 files changed, 169 insertions, 0 deletions
diff --git a/doc/architecture/blueprints/pods/pods-feature-ci-runners.md b/doc/architecture/blueprints/pods/pods-feature-ci-runners.md new file mode 100644 index 00000000000..b75515a916f --- /dev/null +++ b/doc/architecture/blueprints/pods/pods-feature-ci-runners.md @@ -0,0 +1,169 @@ +--- +stage: enablement +group: pods +comments: false +description: 'Pods: CI Runners' +--- + +This document is a work-in-progress and represents a very early state of the +Pods design. Significant aspects are not documented, though we expect to add +them in the future. This is one possible architecture for Pods, and we intend to +contrast this with alternatives before deciding which approach to implement. +This documentation will be kept even if we decide not to implement this so that +we can document the reasons for not choosing this approach. + +# Pods: CI Runners + +GitLab in order to execute CI jobs [GitLab Runner](https://gitlab.com/gitlab-org/gitlab-runner/), +very often managed by customer in their infrastructure. + +All CI jobs created as part of CI pipeline are run in a context of project +it poses a challenge how to manage GitLab Runners. + +## 1. Definition + +There are 3 different types of runners: + +- instance-wide: runners that are registered globally with specific tags (selection criteria) +- group runners: runners that execute jobs from a given top-level group or subprojects of that group +- project runners: runners that execute jobs from projects or many projects: some runners might + have projects assigned from projects in different top-level groups. + +This alongside with existing data structure where `ci_runners` is a table describing +all types of runners poses a challenge how the `ci_runners` should be managed in a Pods environment. + +## 2. Data flow + +GitLab Runners use a set of globally scoped endpoints to: + +- registration of a new runner via registration token `https://gitlab.com/api/v4/runners` + ([subject for removal](../runner_tokens/index.md)) (`registration token`) +- requests jobs via an authenticated `https://gitlab.com/api/v4/jobs/request` endpoint (`runner token`) +- upload job status via `https://gitlab.com/api/v4/jobs/:job_id` (`build token`) +- upload trace via `https://gitlab.com/api/v4/jobs/:job_id/trace` (`build token`) +- download and upload artifacts via `https://gitlab.com/api/v4/jobs/:job_id/artifacts` (`build token`) + +Currently three types of authentication tokens are used: + +- runner registration token ([subject for removal](../runner_tokens/index.md)) +- runner token representing an registered runner in a system with specific configuration (`tags`, `locked`, etc.) +- build token representing an ephemeral token giving a limited access to updating a specific + job, uploading artifacts, downloading dependent artifacts, downloading and uploading + container registry images + +Each of those endpoints do receive an authentication token via header (`JOB-TOKEN` for `/trace`) +or body parameter (`token` all other endpoints). + +Since the CI pipeline would be created in a context of a specific Pod it would be required +that pick of a build would have to be processed by that particular Pod. This requires +that build picking depending on a solution would have to be either: + +- routed to correct Pod for a first time +- be made to be two phase: request build from global pool, claim build on a specific Pod using a Pod specific URL + +## 3. Proposal + +This section describes various proposals. Reader should consider that those +proposals do describe solutions for different problems. Many or some aspects +of those proposals might be the solution to the stated problem. + +### 3.1. Authentication tokens + +Even though the paths for CI Runners are not routable they can be made routable with +those two possible solutions: + +- The `https://gitlab.com/api/v4/jobs/request` uses a long polling mechanism with + a ticketing mechanism (based on `X-GitLab-Last-Update` header). Runner when first + starts sends a request to GitLab to which GitLab responds with either a build to pick + by runner. This value is completely controlled by GitLab. This allows GitLab + to use JWT or any other means to encode `pod` identifier that could be easily + decodable by Router. +- The majority of communication (in terms of volume) is using `build token` making it + the easiest target to change since GitLab is sole owner of the token that Runner later + uses for specific job. There were prior discussions about not storing `build token` + but rather using `JWT` token with defined scopes. Such token could encode the `pod` + to which router could easily route all requests. + +### 3.2. Request body + +- The most of used endpoints pass authentication token in request body. It might be desired + to use HTTP Headers as an easier way to access this information by Router without + a need to proxy requests. + +### 3.3. Instance-wide are Pod local + +We can pick a design where all runners are always registered and local to a given Pod: + +- Each Pod has it's own set of instance-wide runners that are updated at it's own pace +- The project runners can only be linked to projects from the same organization + creating strong isolation. +- In this model the `ci_runners` table is local to the Pod. +- In this model we would require the above endpoints to be scoped to a Pod in some way + or made routable. It might be via prefixing them, adding additional Pod parameter, + or providing much more robust way to decode runner token and match it to Pod. +- If routable token is used, we could move away from cryptographic random stored in + database to rather prefer to use JWT tokens that would encode +- The Admin Area showing registered Runners would have to be scoped to a Pod + +This model might be desired since it provides strong isolation guarantees. +This model does significantly increase maintenance overhead since each Pod is managed +separately. + +This model may require adjustments to runner tags feature so that projects have consistent runner experience across pods. + +### 3.4. Instance-wide are cluster-wide + +Contrary to proposal where all runners are Pod local, we can consider that runners +are global, or just instance-wide runners are global. + +However, this requires significant overhaul of system and to change the following aspects: + +- `ci_runners` table would likely have to be split decomposed into `ci_instance_runners`, ... +- all interfaces would have to be adopted to use correct table +- build queuing would have to be reworked to be two phase where each Pod would know of all pending + and running builds, but the actual claim of a build would happen against a Pod containing data +- likely `ci_pending_builds` and `ci_running_builds` would have to be made `cluster-wide` tables + increasing likelihood of creating hotspots in a system related to CI queueing + +This model makes it complex to implement from engineering side. Does make some data being shared +between Pods. Creates hotspots / scalability issues in a system (ex. during abuse) that +might impact experience of organizations on other Pods. + +### 3.5. GitLab CI Daemon + +Another potential solution to explore is to have a dedicated service responsible for builds queueing +owning it's database and working in a model of either sharded or podded service. There were prior +discussions about [CI/CD Daemon](https://gitlab.com/gitlab-org/gitlab/-/issues/19435). + +If the service would be sharded: + +- depending on a model if runners are cluster-wide or pod-local this service would have to fetch + data from all Pods +- if the sharded service would be used we could adapt a model of either sharing database containing + `ci_pending_builds/ci_running_builds` with the service +- if the sharded service would be used we could consider a push model where each Pod pushes to CI/CD Daemon + builds that should be picked by Runner +- the sharded service would be aware which Pod is responsible for processing the given build and could + route processing requests to designated Pod + +If the service would be podded: + +- all expectations of routable endpoints are still valid + +In general usage of CI Daemon does not help significantly with the stated problem. However, this offers +a few upsides related to more efficient processing and decoupling model: push model and it opens a way +to offer stateful communication with GitLab Runners (ex. gRPC or Websockets). + +## 4. Evaluation + +Considering all solutions it appears that solution giving the most promise is: + +- use "instance-wide are Pod local" +- refine endpoints to have routable identities (either via specific paths, or better tokens) + +Other potential upsides is to get rid of `ci_builds.token` and rather use a `JWT token` +that can much better and easier encode wider set of scopes allowed by CI runner. + +## 4.1. Pros + +## 4.2. Cons |