1 files changed, 227 insertions, 0 deletions
diff --git a/doc/architecture/blueprints/runner_tokens/index.md b/doc/architecture/blueprints/runner_tokens/index.md
new file mode 100644
index 00000000000..3f8a27e503d
--- /dev/null
+++ b/doc/architecture/blueprints/runner_tokens/index.md
@@ -0,0 +1,227 @@
+---
+stage: Verify
+group: Runner
+comments: false
+description: 'Next Runner Token Architecture'
+---
+
+# Next GitLab Runner Token Architecture
+
+## Summary
+
+GitLab Runner is a core component of GitLab CI/CD that runs
+CI/CD jobs in a reliable and concurrent environment. Ever since the beginnings
+of the service as a Ruby program, runners are registered in a GitLab instance with
+a registration token - a randomly generated string of text. The registration token is unique for its given scope
+(instance, group, or project). The registration token proves that the party that registers the runner has
+administrator access to the instance, group, or project to which the runner is registered.
+
+This approach has worked well in the initial years, but some major known issues started to
+become apparent as the target audience grew:
+
+| Problem                                     | Symptoms                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
+|---------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Single token per scope                      | - The registration token is shared by multiple runners: <br/>- Single tokens lower the value of auditing and make traceability almost impossible; <br/>- Copied in many places for [self-registration of runners](https://docs.gitlab.com/runner/install/kubernetes.html#required-configuration); <br/>- Reports of users storing tokens in unsecured locations; <br/>- Makes rotation of tokens costly. <br/>- In the case of a security event affecting the whole instance, rotating tokens requires users to update a table of projects/namespaces, which takes a significant amount of time. |
+| No provision for automatic expiration       | Requires manual intervention to change token. Addressed in [#30942](https://gitlab.com/gitlab-org/gitlab/-/issues/30942).                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+| No permissions model                        | Used to register a runner for protected branches, and for any tags. In this case, the registration token has permission to do everything. Effectively, someone taking a possession of registration token could steal secrets or source code.                                                                                                                                                                                                                                                                                                                                                       |
+| No traceability                             | Given that the token is not created by a user, and is accessible to all administrators, there is no possibility to know the source of a leaked token.                                                                                                                                                                                                                                                                                                                                                                                                                  |
+| No historical records                       | When reset, the previous value of the registration token is not stored so there is no historical data to enable deeper auditing and inspection.                                                                                                                                                                                                                                                                                                                                                                                                                        |
+| Token stored in project/namespace model     | Inadvertent disclosure of token is possible.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
+| Too many registered runners                 | It is too straightforward to register a new runner using a well-known registration token.                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
+
+In light of these issues, it is important that we redesign the way in which we connect runners to the GitLab instance so that we can guarantee traceability, security, and performance.
+
+We call this new mechanism the "next GitLab Runner Token architecture".
+
+## Proposal
+
+The proposal addresses the issues of a _single token per scope_ and _token storage_
+by eliminating the need for a registration token. Runner creation happens
+in the GitLab Runners settings page for the given scope, in the context of the logged-in user
+, which provides traceability. The page provides instructions to configure the newly-created
+runner in supported environments.
+
+The runner configuration will be generated through a new `deploy` command, which will leverage
+the `/runners/verify` REST endpoint to ensure the validity of the authentication token.
+The remaining concerns become non-issues due to the elimination of the registration token.
+
+The configuration can be applied across many machines by reusing the same instructions.
+A unique system identifier will be generated automatically if a value is missing from
+the runner entry in the `config.toml` file. This allows differentiating systems sharing the same
+runner token (for example, in auto-scaling scenarios), and is crucial for the proper functioning of our
+long-polling mechanism when the same authentication token is shared across two or more runner managers.
+
+Given that the creation of runners involves user interaction, it should be possible
+to eventually lower the per-plan limit of CI runners that can be registered per scope.
+
+### Auto-scaling scenarios (for example Helm chart)
+
+In the existing model, a new runner is created whenever a new worker is required. This
+has led to many situations where runners are left behind and become stale.
+
+In the proposed model, a `ci_runners` table entry describes a configuration,
+which the runner could reuse across multiple machines. This allows differentiating the context in
+which the runner is being used. In situations where we must differentiate between runners
+that reuse the same configuration, we can use the unique system identifier to track all
+unique "runners" that are executed in context of a single `ci_runners` model. This unique
+system identifier would be present in the Runner's `config.toml` configuration file and
+initially set when generating the new `[[runners]]` configuration by means of the `deploy` command.
+Legacy files that miss values for unique system identifiers will get rewritten automatically with new values.
+
+### Runner identification in CI jobs
+
+For users to identify the machine where the job was executed, the unique identifier will need to be visible in CI job contexts.
+As a first iteration, GitLab Runner will include the unique system identifier in the build logs,
+wherever it publishes the short token SHA.
+
+Given that the runner will potentially be reused with different unique system identifiers,
+we can store the unique system ID. This ensures the unique system ID maps to a GitLab Runner's `config.toml` entry with
+the runner token. The `ci_runner_machines` would hold information about each unique runner machine,
+with information when runner last connected, and what type of runner it was. The relevant fields
+will be moved from the `ci_runners`.
+The `ci_builds_runner_session` (or `ci_builds` or `ci_builds_metadata`) will reference
+`ci_runner_machines`.
+We might consider a more efficient way to store `contacted_at` than updating the existing record.
+
+```sql
+CREATE TABLE ci_builds_runner_session (
+    ...
+    runner_machine_id bigint NOT NULL
+);
+
+CREATE TABLE ci_runner_machines (
+    id integer NOT NULL,
+    machine_id character varying UNIQUE NOT NULL,
+    contacted_at timestamp without time zone,
+    version character varying,
+    revision character varying,
+    platform character varying,
+    architecture character varying,
+    ip_address character varying,
+    executor_type smallint,
+);
+```
+
+## Advantages
+
+- Easier for users to wrap their minds around the concept: instead of two types of tokens,
+  there is a single type of token - the per-runner authentication token. Having two types of tokens
+  frequently results in misunderstandings when discussing issues;
+- Runners can always be traced back to the user who created it, using the audit log;
+- The claims of a CI runner are known at creation time, and cannot be changed from the runner
+  (for example, changing the `access_level`/`protected` flag). Authenticated users
+  may however still edit these settings through the GitLab UI.
+
+## Details
+
+In the proposed approach, we create a distinct way to configure runners that is usable
+alongside the current registration token method during a transition period. The idea is
+to avoid having the Runner make API calls that allow it to leverage a single "god-like"
+token to register new runners.
+
+The new workflow looks as follows:
+
+  1. The user opens the Runners settings page;
+  1. The user fills in the details regarding the new desired runner, namely description,
+  tags, protected, locked, etc.;
+  1. The user clicks `Create`. That results in the following:
+
+      1. Creates a new runner in the `ci_runners` table (and corresponding authentication token);
+      1. Presents the user with instructions on how to configure this new runner on a machine,
+         with possibilities for different supported deployment scenarios (e.g. shell, `docker-compose`, Helm chart, etc.)
+         This information contains a token which will only be available to the user once, and the UI
+         will make it clear to the user that the value will not be shown again, as registering the same runner multiple times
+         is discouraged (though not impossible).
+
+  1. The user copies and pastes the instructions for the intended deployment scenario (a `deploy` command), leading to the following actions:
+
+      1. Upon executing the new `gitlab-runner deploy` command in the instructions, `gitlab-runner` will perform
+      a call to the `POST /runners/verify` with the given runner token;
+      1. If the `POST /runners/verify` GitLab endpoint validates the token, the `config.toml` file will be populated with the configuration.
+
+     The `gitlab-runner deploy` will also accept executor-specific arguments
+     currently present in the `register` command.
+
+As part of the transition period, we will provide admins and top-level group owners with a instance/group-level setting to disable
+the legacy registration token functionality and enforce using only the new workflow.
+Any attempt by a `gitlab-runner register` command to hit the `POST /runners` endpoint to register a new runner
+will result in a `HTTP 410 - Gone` status code. The instance setting is inherited by the groups
+, which means that if the legacy registration method is disabled at the instance method, the descendant groups/projects will also mandatorily
+prevent the legacy registration method.
+
+The registration token workflow is to be deprecated (with a deprecation notice printed by the `gitlab-runner register` command)
+and removed at a future major release after the concept is proven stable and customers have migrated to the new workflow.
+
+### Handling of legacy runners
+
+Legacy versions of GitLab Runner will not send the unique system identifier in its requests, and we
+will not change logic in Workhorse to handle unique system IDs. This can be improved upon in the
+future once the legacy registration system is removed, and runners have been upgraded to newer
+versions.
+
+Not using the unique system ID means that all connected runners with the same token will be
+notified, instead of just the runner matching the exact system identifier. While not ideal, this is
+not an issue per-se.
+
+### Helm chart
+
+The `runnerRegistrationToken` entry in the [`values.yaml` file](https://gitlab.com/gitlab-org/charts/gitlab-runner/-/blob/a70bc29a903b79d5675bb0c45d981adf8b7a8659/values.yaml#L52)
+will be retired. The `runnerRegistrationToken` entry will be replaced by the existing `runnerToken` value, which will be passed
+to the new `gitlab-runner deploy` command in [`configmap.yaml`](https://gitlab.com/gitlab-org/charts/gitlab-runner/-/blob/a70bc29a903b79d5675bb0c45d981adf8b7a8659/templates/configmap.yaml#L116).
+
+### Runner creation through API
+
+Automated runner creation may be allowed, although always through authenticated API calls -
+using PAT tokens for example - such that every runner is associated with an owner.
+
+## Implementation plan
+
+| Component        | Milestone | Changes |
+|------------------|-----------|---------|
+| GitLab Rails app | `15.x` (latest at `15.6`) | Deprecate `POST /api/v4/runners` endpoint for `16.0`. This hinges on a [proposal](https://gitlab.com/gitlab-org/gitlab/-/issues/373774) to allow deprecating REST API endpoints for security reasons. |
+| GitLab Runner    | `15.x` (latest at `15.8`) | Add deprecation notice for `register` command for `16.0`. |
+| GitLab Runner    | `15.x` | Ensure all runner entries in `config.toml` have unique system identifier values assigned. Log new system ID values with `INFO` level as they get created. |
+| GitLab Runner    | `15.x` | Start additionally logging unique system ID anywhere we log the runner short SHA. |
+| GitLab Rails app | `15.x` | Create database migrations to add settings from `application_settings` and `namaspace_settings` tables. |
+| GitLab Runner    | `15.x` | Start sending `unique_id` value in `POST /jobs/request` request and other follow-up requests that require identifying the unique system. |
+| GitLab Runner    | `15.x` | Implement new user-authenticated API (REST and GraphQL) to create a new runner. |
+| GitLab Rails app | `15.x` | Implement UI to create new runner. |
+| GitLab Runner    | `16.0` | Remove `register` command and support for `POST /runners` endpoint. |
+| GitLab Rails app | `16.0` | Remove legacy UI showing registration with a registration token. |
+| GitLab Rails app | `16.0` | Create database migrations to remove settings from `application_settings` and `namaspace_settings` tables. |
+| GitLab Rails app | `16.0` | Make [`POST /api/v4/runners` endpoint](../../../api/runners.md#register-a-new-runner-deprecated) permanently return `410 Gone`. A future v5 version of the API would return `404 Not Found`. |
+| GitLab Rails app | `16.0` | Start refusing job requests that don't include a unique ID. |
+
+## Status
+
+Status: RFC.
+
+## Who
+
+Proposal:
+
+<!-- vale gitlab.Spelling = NO -->
+
+| Role                         | Who
+|------------------------------|--------------------------------------------------|
+| Authors                      | Kamil Trzciński, Tomasz Maczukin, Pedro Pombeiro |
+| Architecture Evolution Coach | Kamil Trzciński                                  |
+| Engineering Leader           | Elliot Rushton, Cheryl Li                        |
+| Product Manager              | Darren Eastman, Jackie Porter                    |
+| Domain Expert / Runner       | Tomasz Maczukin                                  |
+
+DRIs:
+
+| Role                         | Who                             |
+|------------------------------|---------------------------------|
+| Leadership                   | Elliot Rushton                  |
+| Product                      | Darren Eastman                  |
+| Engineering                  | Tomasz Maczukin, Pedro Pombeiro |
+
+Domain experts:
+
+| Area                         | Who             |
+|------------------------------|-----------------|
+| Domain Expert / Runner       | Tomasz Maczukin |
+
+<!-- vale gitlab.Spelling = YES -->