summaryrefslogtreecommitdiff
path: root/doc/architecture/blueprints
diff options
context:
space:
mode:
Diffstat (limited to 'doc/architecture/blueprints')
-rw-r--r--doc/architecture/blueprints/_template.md7
-rw-r--r--doc/architecture/blueprints/ci_data_decay/index.md60
-rw-r--r--doc/architecture/blueprints/ci_data_decay/pipeline_partitioning.md48
-rw-r--r--doc/architecture/blueprints/ci_pipeline_components/index.md58
-rw-r--r--doc/architecture/blueprints/cloud_native_build_logs/index.md38
-rw-r--r--doc/architecture/blueprints/cloud_native_gitlab_pages/index.md47
-rw-r--r--doc/architecture/blueprints/composable_codebase_using_rails_engines/index.md46
-rw-r--r--doc/architecture/blueprints/consolidating_groups_and_projects/index.md40
-rw-r--r--doc/architecture/blueprints/container_registry_metadata_database/index.md33
-rw-r--r--doc/architecture/blueprints/database/scalability/patterns/read_mostly.md2
-rw-r--r--doc/architecture/blueprints/database/scalability/patterns/time_decay.md2
-rw-r--r--doc/architecture/blueprints/database_scaling/size-limits.md2
-rw-r--r--doc/architecture/blueprints/database_testing/index.md35
-rw-r--r--doc/architecture/blueprints/feature_flags_development/index.md35
-rw-r--r--doc/architecture/blueprints/gitlab_to_kubernetes_communication/index.md37
-rw-r--r--doc/architecture/blueprints/graphql_api/index.md51
-rw-r--r--doc/architecture/blueprints/image_resizing/index.md12
-rw-r--r--doc/architecture/blueprints/object_storage/index.md32
-rw-r--r--doc/architecture/blueprints/pods/images/iteration0-organizations-introduction.pngbin0 -> 67160 bytes
-rw-r--r--doc/architecture/blueprints/pods/images/pods-and-fulfillment.pngbin0 -> 75803 bytes
-rw-r--r--doc/architecture/blueprints/pods/images/term-cluster.pngbin0 -> 63268 bytes
-rw-r--r--doc/architecture/blueprints/pods/images/term-organization.pngbin0 -> 7150 bytes
-rw-r--r--doc/architecture/blueprints/pods/images/term-pod.png (renamed from doc/architecture/blueprints/pods/term-pod.png)bin16104 -> 16104 bytes
-rw-r--r--doc/architecture/blueprints/pods/images/term-top-level-namespace.png (renamed from doc/architecture/blueprints/pods/term-top-level-namespace.png)bin11451 -> 11451 bytes
-rw-r--r--doc/architecture/blueprints/pods/index.md122
-rw-r--r--doc/architecture/blueprints/pods/iteration0-organizations-introduction.pngbin326285 -> 0 bytes
-rw-r--r--doc/architecture/blueprints/pods/pods-feature-data-migration.md82
-rw-r--r--doc/architecture/blueprints/pods/pods-feature-database-sequences.md94
-rw-r--r--doc/architecture/blueprints/pods/pods-feature-git-access.md163
-rw-r--r--doc/architecture/blueprints/pods/pods-feature-graphql.md94
-rw-r--r--doc/architecture/blueprints/pods/pods-feature-organizations.md58
-rw-r--r--doc/architecture/blueprints/pods/pods-feature-router-endpoints-classification.md46
-rw-r--r--doc/architecture/blueprints/pods/pods-feature-template.md29
-rw-r--r--doc/architecture/blueprints/pods/proposal-stateless-router-with-buffering-requests.md648
-rw-r--r--doc/architecture/blueprints/pods/proposal-stateless-router-with-routes-learning.md672
-rw-r--r--doc/architecture/blueprints/pods/term-cluster.pngbin271291 -> 0 bytes
-rw-r--r--doc/architecture/blueprints/pods/term-organization.pngbin22575 -> 0 bytes
-rw-r--r--doc/architecture/blueprints/rate_limiting/index.md84
-rw-r--r--doc/architecture/blueprints/runner_scaling/index.md47
-rw-r--r--doc/architecture/blueprints/runner_tokens/index.md227
-rw-r--r--doc/architecture/blueprints/work_items/index.md66
41 files changed, 2436 insertions, 581 deletions
diff --git a/doc/architecture/blueprints/_template.md b/doc/architecture/blueprints/_template.md
index 7637c3bf5fa..798d51a97ad 100644
--- a/doc/architecture/blueprints/_template.md
+++ b/doc/architecture/blueprints/_template.md
@@ -1,14 +1,11 @@
---
-stage: none
-group: unassigned
-info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
status: proposed
creation-date: yyyy-mm-dd
authors: [ "@username" ]
coach: "@username"
-owning-section: "~section::<section>"
-participating-sections: []
approvers: [ "@product-manager", "@engineering-manager" ]
+owning-stage: "~devops::<stage>"
+participating-stages: []
---
<!--
diff --git a/doc/architecture/blueprints/ci_data_decay/index.md b/doc/architecture/blueprints/ci_data_decay/index.md
index 221c2364f79..b7c3bdde2f8 100644
--- a/doc/architecture/blueprints/ci_data_decay/index.md
+++ b/doc/architecture/blueprints/ci_data_decay/index.md
@@ -1,8 +1,11 @@
---
-stage: none
-group: unassigned
-comments: false
-description: 'CI/CD data time decay'
+status: ready
+creation-date: "2021-09-10"
+authors: [ "@grzesiek" ]
+coach: "@kamil"
+approvers: [ "@jporter", "@cheryl.li" ]
+owning-stage: "~devops::verify"
+participating-stages: []
---
# CI/CD data time decay
@@ -23,18 +26,6 @@ the data storage for pipeline builds remains almost the same since 2012. In
ia separate database. Now we want to improve the architecture of GitLab CI/CD
product to enable further scaling.
-_Disclaimer: The following contains information related to upcoming products,
-features, and functionality._
-
-_It is important to note that the information presented is for informational
-purposes only. Please do not rely on this information for purchasing or
-planning purposes._
-
-_As with all projects, the items mentioned in this document and linked pages are
-subject to change or delay. The development, release and timing of any
-products, features, or functionality remain at the sole discretion of GitLab
-Inc._
-
## Goals
**Implement a new architecture of CI/CD data storage to enable scaling.**
@@ -67,7 +58,7 @@ When a build gets archived it will not be possible to retry it, but we still do
keep all the processing metadata in the database, and it consumes resources
that are scarce in the primary database.
-In order to improve performance and make it easier to scale CI/CD data storage
+To improve performance and make it easier to scale CI/CD data storage
we might want to follow these three tracks described below.
![pipeline data time decay](pipeline_data_time_decay.png)
@@ -210,7 +201,7 @@ We accept the possible necessity of building a separate API endpoint /
endpoints needed to access pipeline data through the API.
In the new API users might need to provide a time range in which the data has
-been created to search through their pipelines / builds. In order to make it
+been created to search through their pipelines / builds. To make it
efficient it might be necessary to restrict access to querying data residing in
more than two partitions at once. We can do that by supporting time ranges
spanning the duration that equals to the builds archival policy.
@@ -246,35 +237,4 @@ In progress.
- 2022-04-30: Additional [benchmarking started](https://gitlab.com/gitlab-org/gitlab/-/issues/361019) to evaluate impact.
- 2022-06-31: [Pipeline partitioning design](pipeline_partitioning.md) document [merge request](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/87683) merged.
- 2022-09-01: Engineering effort started to implement partitioning.
-
-## Who
-
-Proposal:
-
-<!-- vale gitlab.Spelling = NO -->
-
-| Role | Who
-|------------------------------|-------------------------|
-| Author | Grzegorz Bizon |
-| Engineering Leader | Cheryl Li |
-| Product Manager | Jackie Porter |
-| Architecture Evolution Coach | Kamil Trzciński |
-
-DRIs:
-
-| Role | Who
-|------------------------------|------------------------|
-| Leadership | Cheryl Li |
-| Product | Jackie Porter |
-| Engineering | Grzegorz Bizon |
-
-Domain experts:
-
-| Area | Who
-|------------------------------|------------------------|
-| Verify / Pipeline execution | Fabio Pitino |
-| Verify / Pipeline execution | Marius Bobin |
-| Verify / Pipeline insights | Maxime Orefice |
-| PostgreSQL Database | Andreas Brandl |
-
-<!-- vale gitlab.Spelling = YES -->
+- 2022-11-01: The fastest growing CI table partitioned: `ci_builds_metadata`.
diff --git a/doc/architecture/blueprints/ci_data_decay/pipeline_partitioning.md b/doc/architecture/blueprints/ci_data_decay/pipeline_partitioning.md
index 5f907ecdaa4..d61412ae1ed 100644
--- a/doc/architecture/blueprints/ci_data_decay/pipeline_partitioning.md
+++ b/doc/architecture/blueprints/ci_data_decay/pipeline_partitioning.md
@@ -7,18 +7,6 @@ description: 'Pipeline data partitioning design'
# Pipeline data partitioning design
-_Disclaimer: The following contains information related to upcoming products,
-features, and functionality._
-
-_It is important to note that the information presented is for informational
-purposes only. Please do not rely on this information for purchasing or
-planning purposes._
-
-_As with all projects, the items mentioned in this document and linked pages
-are subject to change or delay. The development, release and timing of any
-products, features, or functionality remain at the sole discretion of GitLab
-Inc._
-
## What problem are we trying to solve?
We want to partition the CI/CD dataset, because some of the database tables are
@@ -267,9 +255,27 @@ new routing tables. Depending on the chosen
[partitioning strategy](#how-do-we-want-to-partition-cicd-data) for a given
table, it is possible to have many logical partitions per one physical partition.
+### Attaching first partition and acquiring locks
+
+We learned when [partitioning](https://gitlab.com/gitlab-org/gitlab/-/issues/378644)
+the first table that `PostgreSQL` requires an `AccessExclusiveLock` on the table and
+all of the other tables that it references through foreign keys. This can cause a deadlock
+if the migration tries to acquire the locks in a different order from the application
+business logic.
+
+To solve this problem, we introduced a **priority locking strategy** to avoid
+further deadlock errors. This allows us to define the locking order and
+then try keep retrying aggressively until we acquire the locks or run out of retries.
+This process can take up to 40 minutes.
+
+With this strategy, we successfully acquired a lock on `ci_builds` table after 15 retries
+during a low traffic period([after `00:00 UTC`](https://dashboards.gitlab.net/d/web-main/web-overview?orgId=1&viewPanel=537181794&from=now-2d&to=now)).
+
+See an example of this strategy in our [partition tooling](../../../development/database/table_partitioning.md#step-6---create-parent-table-and-attach-existing-table-as-the-initial-partition)).
+
## Storing partitions metadata in the database
-In order to build an efficient mechanism that will be responsible for creating
+To build an efficient mechanism that will be responsible for creating
new partitions, and to implement time decay we want to introduce a partitioning
metadata table, called `ci_partitions`. In that table we would store metadata
about all the logical partitions, with many pipelines per partition. We may
@@ -366,6 +372,20 @@ scope block takes an argument). Preloading instance dependent scopes is not
supported.
```
+### Query analyzers
+
+We implemented 2 query analyzers to detect queries that need to be fixed so that everything
+keeps working with partitioned tables:
+
+- One analyzer to detect queries not going through a routing table.
+- One analyzer to detect queries that use routing tables without specifying the `partition_id` in the `WHERE` clauses.
+
+We started by enabling our first analyzer in `test` environment to detect existing broken
+queries. It is also enabled on `production` environment, but for a small subset of the traffic (`0.1%`)
+because of scalability concerns.
+
+The second analyzer will be enabled in a future iteration.
+
### Primary key
Primary key must include the partitioning key column to partition the table.
@@ -652,7 +672,7 @@ application-wide outage.
1. Make it possible to create partitions in an automatic way.
1. Deliver the new architecture to self-managed instances.
-The diagram below visualizes this plan on Gantt chart. Please note that dates
+The diagram below visualizes this plan on Gantt chart. The dates
on the chart below are just estimates to visualize the plan better, these are
not deadlines and can change at any time.
diff --git a/doc/architecture/blueprints/ci_pipeline_components/index.md b/doc/architecture/blueprints/ci_pipeline_components/index.md
index 115f6909d2d..a3c72227f3e 100644
--- a/doc/architecture/blueprints/ci_pipeline_components/index.md
+++ b/doc/architecture/blueprints/ci_pipeline_components/index.md
@@ -1,12 +1,13 @@
---
-stage: Stage
-group: Pipeline Authoring
-info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
-comments: false
-description: 'Create a catalog of shareable pipeline constructs'
+status: proposed
+creation-date: "2022-09-14"
+authors: [ "@fabio", "@grzesiek" ]
+coach: "@kamil"
+approvers: [ "@dov" ]
+owning-stage: "~devops::verify"
+participating-stages: []
---
-
# CI/CD pipeline components catalog
## Summary
@@ -115,21 +116,22 @@ while encapsulating and isolating implementation details.
Components allow a pipeline to be assembled by using abstractions instead of having all the details defined in one place.
When using a component in a pipeline, a user shouldn't need to know the implementation details of the component and should
-only rely on the provided interface. The interface will have a version / revision, so that users understand which revision they are interfacing with.
+only rely on the provided interface.
A pipeline component defines its type which indicates in which context of the pipeline configuration the component can be used.
For example, a component of type X can only be used according to the type X use-case.
-For best experience with any systems made of components it's fundamental that components are single purpose,
-isolated, reusable and resolvable.
+For best experience with any systems made of components it's fundamental that components:
- **Single purpose**: a component must focus on a single goal and the scope be as small as possible.
-- **Isolation**: when a component is used in a pipeline, its implementation details should not leak outside the
+- **Isolated**: when a component is used in a pipeline, its implementation details should not leak outside the
component itself and into the main pipeline.
-- **Reusability:** a component is designed to be used in different pipelines.
+- **Reusable**: a component is designed to be used in different pipelines.
Depending on the assumptions it's built on a component can be more or less generic.
Generic components are more reusable but may require more customization.
-- **Resolvable:** When a component depends on another component, this dependency must be explicit and trackable.
+- **Versioned**: when using a component we must specify the version we are interested in.
+ The version identifies the exact interface and behavior of the component.
+- **Resolvable**: when a component depends on another component, this dependency must be explicit and trackable.
## Proposal
@@ -186,35 +188,3 @@ Some limits we could consider adding:
- Allow self-managed administrators to populate their self-managed catalog by importing/updating
components from GitLab.com or from repository exports.
- Iterate on feedback.
-
-## Who
-
-Proposal:
-
-<!-- vale gitlab.Spelling = NO -->
-
-| Role | Who
-|------------------------------|-------------------------|
-| Author | Fabio Pitino |
-| Engineering Leader | ? |
-| Product Manager | Dov Hershkovitch |
-| Architecture Evolution Coach | Kamil Trzciński |
-
-DRIs:
-
-| Role | Who
-|------------------------------|------------------------|
-| Leadership | ? |
-| Product | Dov Hershkovitch |
-| Engineering | ? |
-| UX | Nadia Sotnikova |
-
-Domain experts:
-
-| Area | Who
-|------------------------------|------------------------|
-| Verify / Pipeline authoring | Avielle Wolfe |
-| Verify / Pipeline authoring | Furkan Ayhan |
-| Verify / Pipeline execution | Fabio Pitino |
-
-<!-- vale gitlab.Spelling = YES -->
diff --git a/doc/architecture/blueprints/cloud_native_build_logs/index.md b/doc/architecture/blueprints/cloud_native_build_logs/index.md
index b77d7998fc8..20cfb46abc4 100644
--- a/doc/architecture/blueprints/cloud_native_build_logs/index.md
+++ b/doc/architecture/blueprints/cloud_native_build_logs/index.md
@@ -1,9 +1,11 @@
---
-stage: none
-group: unassigned
-info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
-comments: false
-description: 'Next iteration of build logs architecture at GitLab'
+status: implemented
+creation-date: "2020-08-26"
+authors: [ "@grzesiek" ]
+coach: "@kamil"
+approvers: [ "@thaoyeager", "@darbyfrey" ]
+owning-stage: "~devops::release"
+participating-stages: []
---
# Cloud Native Build Logs
@@ -31,7 +33,7 @@ a job is complete, the trace file contents are sent to the object store.
New architecture writes data to Redis instead of writing build logs into a
file.
-In order to make this performant and resilient enough, we implemented a chunked
+To make this performant and resilient enough, we implemented a chunked
I/O mechanism - we store data in Redis in chunks, and migrate them to an object
store once we reach a desired chunk size.
@@ -121,27 +123,3 @@ Enabling this feature on GitLab.com is a subtask of
This change has been implemented and enabled on GitLab.com.
We are working on [an epic to make this feature more resilient and observable](https://gitlab.com/groups/gitlab-org/-/epics/4860).
-
-## Who
-
-Proposal:
-
-<!-- vale gitlab.Spelling = NO -->
-
-| Role | Who
-|------------------------------|-------------------------|
-| Author | Grzegorz Bizon |
-| Architecture Evolution Coach | Gerardo Lopez-Fernandez |
-| Engineering Leader | Darby Frey |
-| Domain Expert | Kamil Trzciński |
-| Domain Expert | Sean McGivern |
-
-DRIs:
-
-| Role | Who
-|------------------------------|------------------------|
-| Product | Thao Yeager |
-| Leadership | Darby Frey |
-| Engineering | Grzegorz Bizon |
-
-<!-- vale gitlab.Spelling = YES -->
diff --git a/doc/architecture/blueprints/cloud_native_gitlab_pages/index.md b/doc/architecture/blueprints/cloud_native_gitlab_pages/index.md
index 127badabb71..b6f3a59dc0b 100644
--- a/doc/architecture/blueprints/cloud_native_gitlab_pages/index.md
+++ b/doc/architecture/blueprints/cloud_native_gitlab_pages/index.md
@@ -1,9 +1,11 @@
---
-stage: none
-group: unassigned
-info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
-comments: false
-description: 'Making GitLab Pages a Cloud Native application - architecture blueprint.'
+status: implemented
+creation-date: "2019-05-16"
+authors: [ "@grzesiek" ]
+coach: "@kamil"
+approvers: [ "@ogolowinski", "@dcroft", "@vshushlin" ]
+owning-stage: "~devops::release"
+participating-stages: []
---
# GitLab Pages New Architecture
@@ -100,38 +102,3 @@ too.
[GitLab Pages Architecture](https://gitlab.com/groups/gitlab-org/-/epics/1316)
epic with detailed roadmap is also available.
-
-## Who
-
-Proposal:
-
-<!-- vale gitlab.Spelling = NO -->
-
-| Role | Who
-|------------------------------|-------------------------|
-| Author | Grzegorz Bizon |
-| Architecture Evolution Coach | Kamil Trzciński |
-| Engineering Leader | Daniel Croft |
-| Domain Expert | Grzegorz Bizon |
-| Domain Expert | Vladimir Shushlin |
-| Domain Expert | Jaime Martinez |
-
-DRIs:
-
-| Role | Who
-|------------------------------|------------------------|
-| Product | Orit Golowinski |
-| Leadership | Daniel Croft |
-| Engineering | Vladimir Shushlin |
-
-Domain Experts:
-
-| Role | Who
-|------------------------------|------------------------|
-| Domain Expert | Kamil Trzciński |
-| Domain Expert | Grzegorz Bizon |
-| Domain Expert | Vladimir Shushlin |
-| Domain Expert | Jaime Martinez |
-| Domain Expert | Krasimir Angelov |
-
-<!-- vale gitlab.Spelling = YES -->
diff --git a/doc/architecture/blueprints/composable_codebase_using_rails_engines/index.md b/doc/architecture/blueprints/composable_codebase_using_rails_engines/index.md
index 4111e2ef056..53f38fa85fd 100644
--- a/doc/architecture/blueprints/composable_codebase_using_rails_engines/index.md
+++ b/doc/architecture/blueprints/composable_codebase_using_rails_engines/index.md
@@ -1,16 +1,18 @@
---
-stage: none
-group: unassigned
-info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
-comments: false
-description: 'Making a GitLab codebase composable - allowing to run parts of the application'
+status: proposed
+creation-date: "2021-05-19"
+authors: [ "@kamil", "@mkaeppler" ]
+coach: "@glopezfernandez"
+approvers: []
+owning-stage: "~devops::non_devops"
+participating-stages: []
---
+# Composable GitLab codebase - using Rails Engines
+
NOTE:
Due to our focus on improving the overall availability of GitLab.com and reducing tech debt, we do not have capacity to act on this blueprint. We will re-evaluate in Q1-FY23.
-# Composable GitLab codebase - using Rails Engines
-
The one of the major risks of a single codebase is an infinite growth of the whole
application. The more code being added results in not only ever increasing resource requirements
for running the application, but increased application coupling and explosion of the complexity.
@@ -585,33 +587,3 @@ to be created to ensure that we do not have explosion of engines.
- [Use nested structure to organize CI classes](https://gitlab.com/gitlab-org/gitlab/-/issues/209745)
- [WIP: Make it simple to build and use "Decoupled Services"](https://gitlab.com/gitlab-org/gitlab/-/issues/31121)
- [Rails takes awhile to boot, let's see if we can improve this](https://gitlab.com/gitlab-org/gitlab/-/issues/213992)
-
-## Who
-
-Proposal:
-
-<!-- vale gitlab.Spelling = NO -->
-
-| Role | Who
-|------------------------------|-------------------------|
-| Author | Kamil Trzciński |
-| Architecture Evolution Coach | ? |
-| Engineering Leader | ? |
-
-DRIs:
-
-| Role | Who
-|------------------------------|------------------------|
-| Product | ? |
-| Leadership | Craig Gomes |
-| Engineering | ? |
-
-Domain Experts:
-
-| Role | Who
-|------------------------------|------------------------|
-| Domain Expert | Nikola Milojevic |
-| Domain Expert | ? |
-| Domain Expert | ? |
-
-<!-- vale gitlab.Spelling = YES -->
diff --git a/doc/architecture/blueprints/consolidating_groups_and_projects/index.md b/doc/architecture/blueprints/consolidating_groups_and_projects/index.md
index 433c23bf188..0818d9b973d 100644
--- a/doc/architecture/blueprints/consolidating_groups_and_projects/index.md
+++ b/doc/architecture/blueprints/consolidating_groups_and_projects/index.md
@@ -1,9 +1,11 @@
---
-stage: none
-group: unassigned
-info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
-comments: false
-description: Consolidating groups and projects
+status: proposed
+creation-date: "2021-02-07"
+authors: [ "@alexpooley", "@ifarkas" ]
+coach: "@grzesiek"
+approvers: [ "@m_gill", "@mushakov" ]
+owning-stage: "~devops::plan"
+participating-stages: []
---
# Consolidating Groups and Projects
@@ -143,34 +145,6 @@ The initial iteration will provide a framework to house features under `Namespac
- Start small: What are the product changes that need to be made to assist with the migration?
- Move fast: Prioritise these solution ideas, document in issues, and create a roadmap for implementation.
-## Who
-
-Proposal:
-
-<!-- vale gitlab.Spelling = NO -->
-
-| Role | Who
-|------------------------------|-------------------------------------|
-| Author | Alex Pooley, Imre Farkas |
-| Architecture Evolution Coach | Dmitriy Zaporozhets, Grzegorz Bizon |
-| Engineering Leader | Michelle Gill |
-| Domain Expert | Jan Provaznik |
-
-<!-- vale gitlab.Spelling = YES -->
-
-DRIs:
-
-<!-- vale gitlab.Spelling = NO -->
-
-| Role | Who
-|------------------------------|------------------------|
-| Product | Melissa Ushakov |
-| Leadership | Michelle Gill |
-| Engineering | Imre Farkas |
-| Design | Nick Post |
-
-<!-- vale gitlab.Spelling = YES -->
-
## Related topics
- [Workspace developer documentation](../../../development/workspace/index.md)
diff --git a/doc/architecture/blueprints/container_registry_metadata_database/index.md b/doc/architecture/blueprints/container_registry_metadata_database/index.md
index 58d59fe5737..63e27286756 100644
--- a/doc/architecture/blueprints/container_registry_metadata_database/index.md
+++ b/doc/architecture/blueprints/container_registry_metadata_database/index.md
@@ -1,9 +1,11 @@
---
-stage: Package
-group: Package
-info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
-comments: false
-description: 'Container Registry metadata database'
+status: implemented
+creation-date: "2020-09-29"
+authors: [ "@jdrpereira" ]
+coach: "@glopezfernandez"
+approvers: [ "@trizzi", "@hswimelar" ]
+owning-stage: "~devops::package"
+participating-stages: []
---
# Container Registry Metadata Database
@@ -344,24 +346,3 @@ A more detailed list of all tasks, as well as periodic progress updates can be f
- [Gradual migration proposal for the GitLab.com container registry](https://gitlab.com/gitlab-org/container-registry/-/issues/191)
- [Create a self-serve registry deployment](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/316)
- [Database cluster for container registry](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/11154)
-
-## Who
-
-Proposal:
-
-<!-- vale gitlab.Spelling = NO -->
-
-| Role | Who
-|------------------------------|-------------------------|
-| Author | João Pereira |
-| Architecture Evolution Coach | Gerardo Lopez-Fernandez |
-| Engineering Leader | |
-| Domain Expert | Hayley Swimelar |
-
-DRIs:
-
-| Role | Who
-|------------------------------|------------------------|
-| Product | Tim Rizzi |
-| Leadership | |
-| Engineering | João Pereira |
diff --git a/doc/architecture/blueprints/database/scalability/patterns/read_mostly.md b/doc/architecture/blueprints/database/scalability/patterns/read_mostly.md
index 6cf8e17edeb..ec236c9bfe3 100644
--- a/doc/architecture/blueprints/database/scalability/patterns/read_mostly.md
+++ b/doc/architecture/blueprints/database/scalability/patterns/read_mostly.md
@@ -8,7 +8,7 @@ description: 'Learn how to scale operating on read-mostly data at scale'
# Read-mostly data
-[Introduced](https://gitlab.com/gitlab-org/gitlab/-/issues/326037) in GitLab 14.0.
+> [Introduced](https://gitlab.com/gitlab-org/gitlab/-/issues/326037) in GitLab 14.0.
This document describes the *read-mostly* pattern introduced in the
[Database Scalability Working Group](https://about.gitlab.com/company/team/structure/working-groups/database-scalability/#read-mostly-data).
diff --git a/doc/architecture/blueprints/database/scalability/patterns/time_decay.md b/doc/architecture/blueprints/database/scalability/patterns/time_decay.md
index ff5f7c25ea1..93f5dffd3f5 100644
--- a/doc/architecture/blueprints/database/scalability/patterns/time_decay.md
+++ b/doc/architecture/blueprints/database/scalability/patterns/time_decay.md
@@ -8,7 +8,7 @@ description: 'Learn how to operate on large time-decay data'
# Time-decay data
-[Introduced](https://gitlab.com/gitlab-org/gitlab/-/issues/326035) in GitLab 14.0.
+> [Introduced](https://gitlab.com/gitlab-org/gitlab/-/issues/326035) in GitLab 14.0.
This document describes the *time-decay pattern* introduced in the
[Database Scalability Working Group](https://about.gitlab.com/company/team/structure/working-groups/database-scalability/#time-decay-data).
diff --git a/doc/architecture/blueprints/database_scaling/size-limits.md b/doc/architecture/blueprints/database_scaling/size-limits.md
index 0bb1ae9efb4..e530bd6eff0 100644
--- a/doc/architecture/blueprints/database_scaling/size-limits.md
+++ b/doc/architecture/blueprints/database_scaling/size-limits.md
@@ -117,7 +117,7 @@ limit 30;
NOTE:
In PostgreSQL context, a **physical table** is either a regular table or a partition of a partitioned table.
-In order to maintain and improve operational stability and lessen development burden, we target a **table size less than 100 GB for a physical table on GitLab.com** (including its indexes). This has numerous benefits:
+To maintain and improve operational stability and lessen development burden, we target a **table size less than 100 GB for a physical table on GitLab.com** (including its indexes). This has numerous benefits:
1. Improved query performance and more stable query plans
1. Significantly reduce vacuum run times and increase frequency of vacuum runs to maintain a healthy state - reducing overhead on the database primary
diff --git a/doc/architecture/blueprints/database_testing/index.md b/doc/architecture/blueprints/database_testing/index.md
index 3f8041ea416..fe6dcf1723d 100644
--- a/doc/architecture/blueprints/database_testing/index.md
+++ b/doc/architecture/blueprints/database_testing/index.md
@@ -1,9 +1,11 @@
---
-stage: none
-group: unassigned
-info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
-comments: false
-description: 'Database Testing'
+status: accepted
+creation-date: "2021-02-08"
+authors: [ "@abrandl" ]
+coach: "@glopezfernandez"
+approvers: [ "@fabian", "@craig-gomes" ]
+owning-stage: "~devops::data_stores"
+participating-stages: []
---
# Database Testing
@@ -122,26 +124,3 @@ An alternative approach we have discussed and abandoned is to "scrub" and anonym
- Annotating data as "sensitive" is error prone, with the wrong anonymization approach used for a data type or one sensitive attribute accidentally not marked as such possibly leading to a data breach.
- Scrubbing not only removes sensitive data, but it also changes data distribution, which greatly affects performance of migrations and queries.
- Scrubbing heavily changes the database contents, potentially updating a lot of data, which leads to different data storage details (think MVC bloat), affecting performance of migrations and queries.
-
-## Who
-
-<!-- vale gitlab.Spelling = NO -->
-
-This effort is owned and driven by the [GitLab Database Team](https://about.gitlab.com/handbook/engineering/development/enablement/data_stores/database/) with support from the [GitLab.com Reliability Datastores](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/) team.
-
-| Role | Who
-|------------------------------|-------------------------|
-| Author | Andreas Brandl |
-| Architecture Evolution Coach | Gerardo Lopez-Fernandez |
-| Engineering Leader | Craig Gomes |
-| Domain Expert | Yannis Roussos |
-| Domain Expert | Pat Bair |
-
-DRIs:
-
-| Role | Who
-|------------------------------|------------------------|
-| Product | Fabian Zimmer |
-| Engineering | Andreas Brandl |
-
-<!-- vale gitlab.Spelling = YES -->
diff --git a/doc/architecture/blueprints/feature_flags_development/index.md b/doc/architecture/blueprints/feature_flags_development/index.md
index 866be9d8a70..730daf56f0d 100644
--- a/doc/architecture/blueprints/feature_flags_development/index.md
+++ b/doc/architecture/blueprints/feature_flags_development/index.md
@@ -1,9 +1,11 @@
---
-stage: none
-group: unassigned
-info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
-comments: false
-description: 'Internal usage of Feature Flags for GitLab development'
+status: accepted
+creation-date: "2020-06-10"
+authors: [ "@kamil" ]
+coach: "@glopezfernandez"
+approvers: [ "@kencjohnston", "@craig-gomes" ]
+owning-stage: "~devops::non_devops"
+participating-stages: []
---
# Architectural discussion of feature flags
@@ -118,26 +120,3 @@ These are reason why these changes are needed:
This work is being done as part of dedicated epic:
[Improve internal usage of Feature Flags](https://gitlab.com/groups/gitlab-org/-/epics/3551).
This epic describes a meta reasons for making these changes.
-
-## Who
-
-Proposal:
-
-<!-- vale gitlab.Spelling = NO -->
-
-| Role | Who
-|------------------------------|-------------------------|
-| Author | Kamil Trzciński |
-| Architecture Evolution Coach | Gerardo Lopez-Fernandez |
-| Engineering Leader | Kamil Trzciński |
-| Domain Expert | Shinya Maeda |
-
-DRIs:
-
-| Role | Who
-|------------------------------|------------------------|
-| Product | Kenny Johnston |
-| Leadership | Craig Gomes |
-| Engineering | Kamil Trzciński |
-
-<!-- vale gitlab.Spelling = YES -->
diff --git a/doc/architecture/blueprints/gitlab_to_kubernetes_communication/index.md b/doc/architecture/blueprints/gitlab_to_kubernetes_communication/index.md
index 19fd995bead..6ac67dd0f18 100644
--- a/doc/architecture/blueprints/gitlab_to_kubernetes_communication/index.md
+++ b/doc/architecture/blueprints/gitlab_to_kubernetes_communication/index.md
@@ -1,9 +1,11 @@
---
-stage: Configure
-group: Configure
-info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
-comments: false
-description: 'GitLab to Kubernetes communication'
+status: implemented
+creation-date: "2020-12-03"
+authors: [ "@ash2k" ]
+coach: "@andrewn"
+approvers: [ "@nicholasklick", "@nagyv-gitlab" ]
+owning-stage: "~devops::configure"
+participating-stages: []
---
# GitLab to Kubernetes communication **(FREE)**
@@ -137,28 +139,3 @@ flowchart LR
### Iterations
Iterations are tracked in [the dedicated epic](https://gitlab.com/groups/gitlab-org/-/epics/4591).
-
-## Who
-
-Proposal:
-
-<!-- vale gitlab.Spelling = NO -->
-
-| Role | Who
-|------------------------------|-------------------------|
-| Author | Mikhail Mazurskiy |
-| Architecture Evolution Coach | Andrew Newdigate |
-| Engineering Leader | Nicholas Klick |
-| Domain Expert | Thong Kuah |
-| Domain Expert | Graeme Gillies |
-| Security Expert | Vitor Meireles De Sousa |
-
-DRIs:
-
-| Role | Who
-|------------------------------|------------------------|
-| Product Lead | Viktor Nagy |
-| Engineering Leader | Nicholas Klick |
-| Domain Expert | Mikhail Mazurskiy |
-
-<!-- vale gitlab.Spelling = YES -->
diff --git a/doc/architecture/blueprints/graphql_api/index.md b/doc/architecture/blueprints/graphql_api/index.md
index 1ee322c412b..4b446a78541 100644
--- a/doc/architecture/blueprints/graphql_api/index.md
+++ b/doc/architecture/blueprints/graphql_api/index.md
@@ -1,8 +1,11 @@
---
-stage: none
-group: unassigned
-comments: false
-description: 'GraphQL API architecture foundation'
+status: accepted
+creation-date: "2021-01-07"
+authors: [ "@grzesiek" ]
+coach: "@kamil"
+approvers: [ "@dsatcher", "@deuley" ]
+owning-stage: "~devops::manage"
+participating-stages: []
---
# GraphQL API
@@ -155,43 +158,3 @@ state synchronization mechanisms and hooking into existing ones.
1. [Build a scalable state synchronization for GraphQL](https://gitlab.com/groups/gitlab-org/-/epics/5319)
1. [Add support for direct uploads for GraphQL](https://gitlab.com/gitlab-org/gitlab/-/issues/280819)
1. [Review GraphQL design choices related to security](https://gitlab.com/gitlab-org/security/gitlab/-/issues/339)
-
-## Status
-
-Current status: in progress.
-
-## Who
-
-Proposal:
-
-<!-- vale gitlab.Spelling = NO -->
-
-| Role | Who
-|------------------------------|-------------------------|
-| Author | Grzegorz Bizon |
-| Architecture Evolution Coach | Kamil Trzciński |
-| Engineering Leader | Darva Satcher |
-| Product Manager | Patrick Deuley |
-| Domain Expert / GraphQL | Charlie Ablett |
-| Domain Expert / GraphQL | Alex Kalderimis |
-| Domain Expert / GraphQL | Natalia Tepluhina |
-| Domain Expert / Scalability | Bob Van Landuyt |
-
-DRIs:
-
-| Role | Who
-|------------------------------|------------------------|
-| Leadership | Darva Satcher |
-| Product | Patrick Deuley |
-| Engineering | Paul Slaughter |
-
-Domain Experts:
-
-| Area | Who
-|------------------------------|------------------------|
-| Domain Expert / GraphQL | Charlie Ablett |
-| Domain Expert / GraphQL | Alex Kalderimis |
-| Domain Expert / GraphQL | Natalia Tepluhina |
-| Domain Expert / Scalability | Bob Van Landuyt |
-
-<!-- vale gitlab.Spelling = YES -->
diff --git a/doc/architecture/blueprints/image_resizing/index.md b/doc/architecture/blueprints/image_resizing/index.md
index dd7ce27f459..948378d8834 100644
--- a/doc/architecture/blueprints/image_resizing/index.md
+++ b/doc/architecture/blueprints/image_resizing/index.md
@@ -1,9 +1,11 @@
---
-stage: none
-group: unassigned
-info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
-comments: false
-description: 'Image Resizing'
+status: implemented
+creation-date: "2020-10-21"
+authors: [ "@craig-gomes" ]
+coach: "@kamil"
+approvers: [ "@timzallmann", "@joshlambert" ]
+owning-stage: "~devops::non_devops"
+participating-stages: []
---
# Image resizing for avatars and content images
diff --git a/doc/architecture/blueprints/object_storage/index.md b/doc/architecture/blueprints/object_storage/index.md
index 7a4ecd0e5a8..61dc37d7706 100644
--- a/doc/architecture/blueprints/object_storage/index.md
+++ b/doc/architecture/blueprints/object_storage/index.md
@@ -1,8 +1,11 @@
---
-stage: none
-group: unassigned
-comments: false
-description: 'Object storage: direct_upload consolidation - architecture blueprint.'
+status: ready
+creation-date: "2021-11-18"
+authors: [ "@nolith" ]
+coach: "@glopezfernandez"
+approvers: [ "@marin" ]
+owning-stage: "~devops::data_stores"
+participating-stages: []
---
# Object storage: `direct_upload` consolidation
@@ -197,24 +200,3 @@ require one bucket.
- [Speed up the monolith, building a smart reverse proxy in Go](https://archive.fosdem.org/2020/schedule/event/speedupmonolith/): a presentation explaining a bit of workhorse history and the challenge we faced in releasing the first cloud-native installation.
- [Object Storage improvements epic](https://gitlab.com/groups/gitlab-org/-/epics/483).
- We are moving to GraphQL API, but [we do not support direct upload](https://gitlab.com/gitlab-org/gitlab/-/issues/280819).
-
-## Who
-
-Proposal:
-
-<!-- vale gitlab.Spelling = NO -->
-
-| Role | Who |
-|--------------------------------|-------------------------|
-| Author | Alessio Caiazza |
-| Architecture Evolution Coach | Gerardo Lopez-Fernandez |
-| Engineering Leader | Marin Jankovski |
-| Domain Expert / Object storage | Stan Hu |
-| Domain Expert / Security | Joern Schneeweisz |
-
-DRIs:
-
-The DRI for this blueprint is the
-[Object Storage Working Group](https://about.gitlab.com/company/team/structure/working-groups/object-storage/).
-
-<!-- vale gitlab.Spelling = YES -->
diff --git a/doc/architecture/blueprints/pods/images/iteration0-organizations-introduction.png b/doc/architecture/blueprints/pods/images/iteration0-organizations-introduction.png
new file mode 100644
index 00000000000..5725b0fa71f
--- /dev/null
+++ b/doc/architecture/blueprints/pods/images/iteration0-organizations-introduction.png
Binary files differ
diff --git a/doc/architecture/blueprints/pods/images/pods-and-fulfillment.png b/doc/architecture/blueprints/pods/images/pods-and-fulfillment.png
new file mode 100644
index 00000000000..aab8556a5d3
--- /dev/null
+++ b/doc/architecture/blueprints/pods/images/pods-and-fulfillment.png
Binary files differ
diff --git a/doc/architecture/blueprints/pods/images/term-cluster.png b/doc/architecture/blueprints/pods/images/term-cluster.png
new file mode 100644
index 00000000000..87e4d631551
--- /dev/null
+++ b/doc/architecture/blueprints/pods/images/term-cluster.png
Binary files differ
diff --git a/doc/architecture/blueprints/pods/images/term-organization.png b/doc/architecture/blueprints/pods/images/term-organization.png
new file mode 100644
index 00000000000..4c82c62b8f4
--- /dev/null
+++ b/doc/architecture/blueprints/pods/images/term-organization.png
Binary files differ
diff --git a/doc/architecture/blueprints/pods/term-pod.png b/doc/architecture/blueprints/pods/images/term-pod.png
index d8f79df2f29..d8f79df2f29 100644
--- a/doc/architecture/blueprints/pods/term-pod.png
+++ b/doc/architecture/blueprints/pods/images/term-pod.png
Binary files differ
diff --git a/doc/architecture/blueprints/pods/term-top-level-namespace.png b/doc/architecture/blueprints/pods/images/term-top-level-namespace.png
index c1cd317d878..c1cd317d878 100644
--- a/doc/architecture/blueprints/pods/term-top-level-namespace.png
+++ b/doc/architecture/blueprints/pods/images/term-top-level-namespace.png
Binary files differ
diff --git a/doc/architecture/blueprints/pods/index.md b/doc/architecture/blueprints/pods/index.md
index 01d56c483ea..3ba319d169b 100644
--- a/doc/architecture/blueprints/pods/index.md
+++ b/doc/architecture/blueprints/pods/index.md
@@ -1,15 +1,15 @@
---
-stage: enablement
-group: pods
-comments: false
-description: 'Pods'
+status: accepted
+creation-date: "2022-09-07"
+authors: [ "@fzimmer", "@DylanGriffith" ]
+coach: "@kamil"
+approvers: [ "@fzimmer" ]
+owning-stage: "~devops::enablement"
+participating-stages: []
---
# Pods
-DISCLAIMER:
-This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.
-
This document is a work-in-progress and represents a very early state of the Pods design. Significant aspects are not documented, though we expect to add them in the future.
## Summary
@@ -24,7 +24,7 @@ We use the following terms to describe components and properties of the Pods arc
A Pod is a set of infrastructure components that contains multiple top-level namespaces that belong to different organizations. The components include both datastores (PostgreSQL, Redis etc.) and stateless services (web etc.). The infrastructure components provided within a Pod are shared among organizations and their top-level namespaces but not shared with other Pods. This isolation of infrastructure components means that Pods are independent from each other.
-![Term Pod](term-pod.png)
+![Term Pod](images/term-pod.png)
#### Pod properties
@@ -42,7 +42,7 @@ Discouraged synonyms: GitLab instance, cluster, shard
A cluster is a collection of Pods.
-![Term Cluster](term-cluster.png)
+![Term Cluster](images/term-cluster.png)
#### Cluster properties
@@ -66,7 +66,7 @@ Organizations work under the following assumptions:
1. Users understand that the majority of pages they view are only scoped to a single organization at a time.
1. Organizations are located on a single pod.
-![Term Organization](term-organization.png)
+![Term Organization](images/term-organization.png)
#### Organization properties
@@ -94,7 +94,7 @@ Top-level namespaces may [be replaced by workspaces](https://gitlab.com/gitlab-o
Discouraged synonyms: Root-level namespace
-![Term Top-level Namespace](term-top-level-namespace.png)
+![Term Top-level Namespace](images/term-top-level-namespace.png)
#### Top-level namespace properties
@@ -111,8 +111,8 @@ Users are available globally and not restricted to a single Pod. Users can be me
- Users can create multiple top-level namespaces
- Users can be a member of multiple top-level namespaces
- Users can be a member of multiple organizations
-- Users can administrate organizations
-- User activity is aggregated within an organization
+- Users can administer organizations
+- User activity is aggregated in an organization
- Every user has one personal namespace
## Goals
@@ -160,6 +160,59 @@ A number of technical issues need to be resolved to implement Pods (in no partic
1. How are Pods provisioned?
1. How can Pods implement disaster recovery capabilities?
+## Cross-section impact
+
+Pods is a fundamental architecture change that impacts other sections and stages. This section summarizes and links to other groups that may be impacted and highlights potential conflicts that need to be resolved. The Pods group is not responsible for achieving the goals of other groups but we want to ensure that dependencies are resolved.
+
+### Summary
+
+Based on discussions with other groups the net impact of introducing Pods and a new entity called organizations is mostly neutral. It may slow down development in some areas. We did not discover major blockers for other teams.
+
+1. We need to resolve naming conflicts (proposal is TBD)
+1. Pods requires introducing Organizations. Organizations are a new entity **above** top-level groups. Because this is a new entity, it may impact the ability to consolidate settings for Group Workspace and influence their decision on [how to approach introducing a workspace](https://gitlab.com/gitlab-org/gitlab/-/issues/376285#approach-2-workspace-is-built-on-top-of-top-level-groups)
+1. Organizations may make it slightly easier for Fulfillment to realize their billing plans.
+
+### Impact on Group Manage Workspace
+
+We synced with the Workspace PM and Designer ([recording](https://youtu.be/b5Opn9cFWFk)) and discussed the similarities and differences between the Pods and Workspace proposal ([presentation](https://docs.google.com/presentation/d/1FsUi22Up15b_tu6p2m-yLML3hCZ3rgrZrmzJAxUsNmU/edit?usp=sharing)).
+
+#### Goals of Group Manage Workspace
+
+As defined in the [workspace documentation](../../../user/workspace/index.md):
+
+1. Create an entity to manage everything you do as a GitLab administrator, including:
+ 1. Defining and applying settings to all of your groups, subgroups, and projects.
+ 1. Aggregating data from all your groups, subgroups, and projects.
+1. Reach feature parity between SaaS and self-managed installations, with all Admin Area settings moving to groups (?). Hardware controls remain on the instance level.
+
+The [workspace roadmap outlines](https://gitlab.com/gitlab-org/gitlab/-/issues/368237#high-level-goals) the current goals in detail.
+
+#### Potential conflicts with Pods
+
+- Workspace and Organization are different terms for the same entity. Both define a new entity as the primary organizational object for groups and projects. This is mainly a semantic difference and **we need to decide on a name** following [user research to decide if workspace](https://gitlab.com/gitlab-org/ux-research/-/issues/2147). This is also driven by the fact that the Remote Development team is looking at better names and [are considering the term Workspace as well](https://gitlab.com/gitlab-com/Product/-/issues/4812).
+- We will only introduce one entity
+- Group workspace highlighted the need to further validate the key assumption that users only care about what happens within their organization.
+
+### Impact on Fulfillment
+
+We synced with Fulfillment ([recording](https://youtu.be/FkQF3uF7vTY)) to discuss how Pods would impact them. Fulfillment is supportive of an entity above top-level namespaces. Their perspective is outline in [!5639](https://gitlab.com/gitlab-org/customers-gitlab-com/-/merge_requests/5639/diffs).
+
+#### Goals of Fulfillment
+
+- Fulfillment has a longstanding plan to move billing from the top-level namespace to a level above. This would mean that a license applies for an organization and all its top-level namespaces.
+- Fulfillment uses Zuora for billing and would like to have a 1-to-1 relationship between an organization and their Zuora entity called BillingAccount. They want to move away from tying a license to a single user.
+- If a customer needs multiple organizations, the corresponding BillingAccounts can be rolled up into a consolidated billing account (similar to [AWS consolidated billing](https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/consolidated-billing.html))
+- Ideally, a self-managed instance has a single Organization by default, which should be enough for most customers.
+- Fulfillment prefers only one additional entity.
+
+A rough representation of this is:
+
+![Pods and Fulfillment](images/pods-and-fulfillment.png)
+
+#### Potential conflicts with Pods
+
+- There are no known conflicts between Fulfillment's plans and Pods
+
## Iteration plan
We can't ship the entire Pods architecture in one go - it is too large. Instead, we are adopting an iteration plan that provides value along the way.
@@ -189,7 +242,7 @@ Organizations solve the following problems:
1. Self-managed instances would set a default organization.
1. Organizations can control user-profiles in a central way. This could be achieved by having an organization specific user-profile. Such a profile makes it possible for the organization administrators to control the user role in a company, enforce user emails, or show a graphical indicator of a user being part of the organization. An example would be a "GitLab Employee stamp" on comments.
-![Move to Organizations](iteration0-organizations-introduction.png)
+![Move to Organizations](images/iteration0-organizations-introduction.png)
#### Why would customers opt-in to Organizations?
@@ -251,28 +304,31 @@ Based on user research, we may want to change certain features to work across or
- Specific features allow for cross-organization interactions, for example forking, search.
-### Links
+## Technical Proposals
+
+The Pods architecture do have long lasting implications to data processing, location, scalability and the GitLab architecture.
+This section links all different technical proposals that are being evaluated.
+
+- [Stateless Router That Uses a Cache to Pick Pod and Is Redirected When Wrong Pod Is Reached](proposal-stateless-router-with-buffering-requests.md)
+
+- [Stateless Router That Uses a Cache to Pick Pod and pre-flight `/api/v4/pods/learn`](proposal-stateless-router-with-routes-learning.md)
+
+## Impacted features
+
+The Pods architecture will impact many features requiring some of them to be rewritten, or changed significantly.
+This is the list of known affected features with the proposed solutions.
+
+- [Pods: Git Access](pods-feature-git-access.md)
+- [Pods: Data Migration](pods-feature-data-migration.md)
+- [Pods: Database Sequences](pods-feature-database-sequences.md)
+- [Pods: GraphQL](pods-feature-graphql.md)
+- [Pods: Organizations](pods-feature-organizations.md)
+- [Pods: Router Endpoints Classification](pods-feature-router-endpoints-classification.md)
+
+## Links
- [Internal Pods presentation](https://docs.google.com/presentation/d/1x1uIiN8FR9fhL7pzFh9juHOVcSxEY7d2_q4uiKKGD44/edit#slide=id.ge7acbdc97a_0_155)
- [Pods Epic](https://gitlab.com/groups/gitlab-org/-/epics/7582)
- [Database Group investigation](https://about.gitlab.com/handbook/engineering/development/enablement/data_stores/database/doc/root-namespace-sharding.html)
- [Shopify Pods architecture](https://shopify.engineering/a-pods-architecture-to-allow-shopify-to-scale)
- [Opstrace architecture](https://gitlab.com/gitlab-org/opstrace/opstrace/-/blob/main/docs/architecture/overview.md)
-
-### Who
-
-| Role | Who
-|------------------------------|-------------------------|
-| Author | Fabian Zimmer |
-| Architecture Evolution Coach | Kamil Trzciński |
-| Engineering Leader | TBD |
-| Product Manager | Fabian Zimmer |
-| Domain Expert / Database | TBD |
-
-DRIs:
-
-| Role | Who
-|------------------------------|------------------------|
-| Leadership | TBD |
-| Product | Fabian Zimmer |
-| Engineering | Thong Kuah |
diff --git a/doc/architecture/blueprints/pods/iteration0-organizations-introduction.png b/doc/architecture/blueprints/pods/iteration0-organizations-introduction.png
deleted file mode 100644
index 5f5cad7b169..00000000000
--- a/doc/architecture/blueprints/pods/iteration0-organizations-introduction.png
+++ /dev/null
Binary files differ
diff --git a/doc/architecture/blueprints/pods/pods-feature-data-migration.md b/doc/architecture/blueprints/pods/pods-feature-data-migration.md
new file mode 100644
index 00000000000..fad6bca45fa
--- /dev/null
+++ b/doc/architecture/blueprints/pods/pods-feature-data-migration.md
@@ -0,0 +1,82 @@
+---
+stage: enablement
+group: pods
+comments: false
+description: 'Pods: Data migration'
+---
+
+DISCLAIMER:
+This page may contain information related to upcoming products, features and
+functionality. It is important to note that the information presented is for
+informational purposes only, so please do not rely on the information for
+purchasing or planning purposes. Just like with all projects, the items
+mentioned on the page are subject to change or delay, and the development,
+release, and timing of any products, features, or functionality remain at the
+sole discretion of GitLab Inc.
+
+This document is a work-in-progress and represents a very early state of the
+Pods design. Significant aspects are not documented, though we expect to add
+them in the future. This is one possible architecture for Pods, and we intend to
+contrast this with alternatives before deciding which approach to implement.
+This documentation will be kept even if we decide not to implement this so that
+we can document the reasons for not choosing this approach.
+
+# Pods: Data migration
+
+It is essential for Pods architecture to provide a way to migrate data out of big Pods
+into smaller ones. This describes various approaches to provide this type of split.
+
+## 1. Definition
+
+## 2. Data flow
+
+## 3. Proposal
+
+### 3.1. Split large Pods
+
+A single Pod can only be divided into many Pods. This is based on principle
+that it is easier to create exact clone of an existing Pod in many replicas
+out of which some will be made authoritative once migrated. Keeping those
+replicas up-to date with Pod 0 is also much easier due to pre-existing
+replication solutions that can replicate the whole systems: Geo, PostgreSQL
+physical replication, etc.
+
+1. All data of an organization needs to not be divided across many Pods.
+1. Split should be doable online.
+1. New Pods cannot contain pre-existing data.
+1. N Pods contain exact replica of Pod 0.
+1. The data of Pod 0 is live replicated to as many Pods it needs to be split.
+1. Once consensus is achieved between Pod 0 and N-Pods the organizations to be migrated away
+ are marked as read-only cluster-wide.
+1. The `routes` is updated on for all organizations to be split to indicate an authorative
+ Pod holding the most recent data, like `gitlab-org` on `pod-100`.
+1. The data for `gitlab-org` on Pod 0, and on other non-authoritative N-Pods are dormant
+ and will be removed in the future.
+1. All accesses to `gitlab-org` on a given Pod are validated about `pod_id` of `routes`
+ to ensure that given Pod is authoritative to handle the data.
+
+### 3.2. Migrate organization from an existing Pod
+
+This is different to split, as we intend to perform logical and selective replication
+of data belonging to a single organization.
+
+Today this type of selective replication is only implemented by Gitaly where we can migrate
+Git repository from a single Gitaly node to another with minimal downtime.
+
+In this model we would require identifying all resources belonging to a given organization:
+database rows, object storage files, Git repositories, etc. and selectively copy them over
+to another (likely) existing Pod importing data into it. Ideally ensuring that we can
+perform logical replication live of all changed data, but change similarly to split
+which Pod is authoritative for this organization.
+
+1. It is hard to identify all resources belonging to organization.
+1. It requires either downtime for organization or a robust system to identify
+ live changes made.
+1. It likely will require a full database structure analysis (more robust than project import/export)
+ to perform selective PostgreSQL logical replication.
+
+## 4. Evaluation
+
+## 4.1. Pros
+
+## 4.2. Cons
diff --git a/doc/architecture/blueprints/pods/pods-feature-database-sequences.md b/doc/architecture/blueprints/pods/pods-feature-database-sequences.md
new file mode 100644
index 00000000000..0a8bb4d250e
--- /dev/null
+++ b/doc/architecture/blueprints/pods/pods-feature-database-sequences.md
@@ -0,0 +1,94 @@
+---
+stage: enablement
+group: pods
+comments: false
+description: 'Pods: Database Sequences'
+---
+
+DISCLAIMER:
+This page may contain information related to upcoming products, features and
+functionality. It is important to note that the information presented is for
+informational purposes only, so please do not rely on the information for
+purchasing or planning purposes. Just like with all projects, the items
+mentioned on the page are subject to change or delay, and the development,
+release, and timing of any products, features, or functionality remain at the
+sole discretion of GitLab Inc.
+
+This document is a work-in-progress and represents a very early state of the
+Pods design. Significant aspects are not documented, though we expect to add
+them in the future. This is one possible architecture for Pods, and we intend to
+contrast this with alternatives before deciding which approach to implement.
+This documentation will be kept even if we decide not to implement this so that
+we can document the reasons for not choosing this approach.
+
+# Pods: Database Sequences
+
+GitLab today ensures that every database row create has unique ID, allowing
+to access Merge Request, CI Job or Project by a known global ID.
+
+Pods will use many distinct and not connected databases, each of them having
+a separate IDs for most of entities.
+
+It might be desirable to retain globally unique IDs for all database rows
+to allow migrating resources between Pods in the future.
+
+## 1. Definition
+
+## 2. Data flow
+
+## 3. Proposal
+
+This are some preliminary ideas how we can retain unique IDs across the system.
+
+### 3.1. UUID
+
+Instead of using incremental sequences use UUID (128 bit) that is stored in database.
+
+- This might break existing IDs and requires adding UUID column for all existing tables.
+- This makes all indexes larger as it requires storing 128 bit instead of 32/64 bit in index.
+
+### 3.2. Use Pod index encoded in ID
+
+Since significant number of tables already use 64 bit ID numbers we could use MSB to encode
+Pod ID effectively enabling
+
+- This might limit amount of Pods that can be enabled in system, as we might decide to only
+ allocate 1024 possible Pod numbers.
+- This might make IDs to be migratable between Pods, since even if entity from Pod 1 is migrated to Pod 100
+ this ID would still be unique.
+- If resources are migrated the ID itself will not be enough to decode Pod number and we would need
+ lookup table.
+- This requires updating all IDs to 32 bits.
+
+### 3.3. Allocate sequence ranges from central place
+
+Each Pod might receive its own range of the sequences as they are consumed from a centrally managed place.
+Once Pod consumes all IDs assigned for a given table it would be replenished and a next range would be allocated.
+Ranges would be tracked to provide a faster lookup table if a random access pattern is required.
+
+- This might make IDs to be migratable between Pods, since even if entity from Pod 1 is migrated to Pod 100
+ this ID would still be unique.
+- If resources are migrated the ID itself will not be enough to decode Pod number and we would need
+ much more robust lookup table as we could be breaking previously assigned sequence ranges.
+- This does not require updating all IDs to 64 bits.
+- This adds some performance penalty to all `INSERT` statements in Postgres or at least from Rails as we need to check for the sequence number and potentially wait for our range to be refreshed from the ID server
+- The available range will need to be stored and incremented in a centralized place so that concurrent transactions cannot possibly get the same value.
+
+### 3.4. Define only some tables to require unique IDs
+
+Maybe this is acceptable only for some tables to have a globally unique IDs. It could be projects, groups
+and other top-level entities. All other tables like `merge_requests` would only offer Pod-local ID,
+but when referenced outside it would rather use IID (an ID that is monotonic in context of a given resource, like project).
+
+- This makes the ID 10000 for `merge_requests` be present on all Pods, which might be sometimes confusing
+ as for uniqueness of the resource.
+- This might make random access by ID (if ever needed) be impossible without using composite key, like: `project_id+merge_request_id`.
+- This would require us to implement a transformation/generation of new ID if we need to migrate records to another pod. This can lead to very difficult migration processes when these IDs are also used as foreign keys for other records being migrated.
+- If IDs need to change when moving between pods this means that any links to records by ID would no longer work even if those links included the `project_id`.
+- If we plan to allow these ids to not be unique and change the unique constraint to be based on a composite key then we'd need to update all foreign key references to be based on the composite key
+
+## 4. Evaluation
+
+## 4.1. Pros
+
+## 4.2. Cons
diff --git a/doc/architecture/blueprints/pods/pods-feature-git-access.md b/doc/architecture/blueprints/pods/pods-feature-git-access.md
new file mode 100644
index 00000000000..ae996281d46
--- /dev/null
+++ b/doc/architecture/blueprints/pods/pods-feature-git-access.md
@@ -0,0 +1,163 @@
+---
+stage: enablement
+group: pods
+comments: false
+description: 'Pods: Git Access'
+---
+
+This document is a work-in-progress and represents a very early state of the
+Pods design. Significant aspects are not documented, though we expect to add
+them in the future. This is one possible architecture for Pods, and we intend to
+contrast this with alternatives before deciding which approach to implement.
+This documentation will be kept even if we decide not to implement this so that
+we can document the reasons for not choosing this approach.
+
+# Pods: Git Access
+
+This document describes impact of Pods architecture on all Git access (over HTTPS and SSH)
+patterns providing explanantion of how potentially those features should be changed
+to work well with Pods.
+
+## 1. Definition
+
+Git access is done through out the application. It can be an operation performed by the system
+(read Git repository) or by user (create a new file via Web IDE, `git clone` or `git push` via command line).
+
+The Pods architecture defines that all Git repositories will be local to the Pod,
+so no repository could be shared with another Pod.
+
+The Pods architecture will require that any Git operation done can only be handled by a Pod holding
+the data. It means that any operation either via Web interface, API, or GraphQL needs to be routed
+to the correct Pod. It means that any `git clone` or `git push` operation can only be performed
+in a context of a Pod.
+
+## 2. Data flow
+
+The are various operations performed today by the GitLab on a Git repository. This describes
+the data flow how they behave today to better represent the impact.
+
+It appears that Git access does require changes only to a few endpoints that are scoped to project.
+There appear to be different types of repositories:
+
+- Project: assigned to Group
+- Wiki: additional repository assigned to Project
+- Design: similar to Wiki, additional repository assigned to Project
+- Snippet: creates a virtual project to hold repository, likely tied to the User
+
+### 2.1. Git clone over HTTPS
+
+Execution of: `git clone` over HTTPS
+
+```mermaid
+sequenceDiagram
+ User ->> Workhorse: GET /gitlab-org/gitlab.git/info/refs?service=git-upload-pack
+ Workhorse ->> Rails: GET /gitlab-org/gitlab.git/info/refs?service=git-upload-pack
+ Rails ->> Workhorse: 200 OK
+ Workhorse ->> Gitaly: RPC InfoRefsUploadPack
+ Gitaly ->> User: Response
+ User ->> Workhorse: POST /gitlab-org/gitlab.git/git-upload-pack
+ Workhorse ->> Gitaly: RPC PostUploadPackWithSidechannel
+ Gitaly ->> User: Response
+```
+
+### 2.2. Git clone over SSH
+
+Execution of: `git clone` over SSH
+
+```mermaid
+sequenceDiagram
+ User ->> Git SSHD: ssh git@gitlab.com
+ Git SSHD ->> Rails: GET /api/v4/internal/authorized_keys
+ Rails ->> Git SSHD: 200 OK (list of accepted SSH keys)
+ Git SSHD ->> User: Accept SSH
+ User ->> Git SSHD: git clone over SSH
+ Git SSHD ->> Rails: POST /api/v4/internal/allowed?project=/gitlab-org/gitlab.git&service=git-upload-pack
+ Rails ->> Git SSHD: 200 OK
+ Git SSHD ->> Gitaly: RPC SSHUploadPackWithSidechannel
+ Gitaly ->> User: Response
+```
+
+### 2.3. Git push over HTTPS
+
+Execution of: `git push` over HTTPS
+
+```mermaid
+sequenceDiagram
+ User ->> Workhorse: GET /gitlab-org/gitlab.git/info/refs?service=git-receive-pack
+ Workhorse ->> Rails: GET /gitlab-org/gitlab.git/info/refs?service=git-receive-pack
+ Rails ->> Workhorse: 200 OK
+ Workhorse ->> Gitaly: RPC PostReceivePack
+ Gitaly ->> Rails: POST /api/v4/internal/allowed?gl_repository=project-111&service=git-receive-pack
+ Gitaly ->> Rails: POST /api/v4/internal/pre_receive?gl_repository=project-111
+ Gitaly ->> Rails: POST /api/v4/internal/post_receive?gl_repository=project-111
+ Gitaly ->> User: Response
+```
+
+### 2.4. Git push over SSHD
+
+Execution of: `git clone` over SSH
+
+```mermaid
+sequenceDiagram
+ User ->> Git SSHD: ssh git@gitlab.com
+ Git SSHD ->> Rails: GET /api/v4/internal/authorized_keys
+ Rails ->> Git SSHD: 200 OK (list of accepted SSH keys)
+ Git SSHD ->> User: Accept SSH
+ User ->> Git SSHD: git clone over SSH
+ Git SSHD ->> Rails: POST /api/v4/internal/allowed?project=/gitlab-org/gitlab.git&service=git-receive-pack
+ Rails ->> Git SSHD: 200 OK
+ Git SSHD ->> Gitaly: RPC ReceivePack
+ Gitaly ->> Rails: POST /api/v4/internal/allowed?gl_repository=project-111
+ Gitaly ->> Rails: POST /api/v4/internal/pre_receive?gl_repository=project-111
+ Gitaly ->> Rails: POST /api/v4/internal/post_receive?gl_repository=project-111
+ Gitaly ->> User: Response
+```
+
+### 2.5. Create commit via Web
+
+Execution of `Add CHANGELOG` to repository:
+
+```mermaid
+sequenceDiagram
+ Web ->> Puma: POST /gitlab-org/gitlab/-/create/main
+ Puma ->> Gitaly: RPC TreeEntry
+ Gitaly ->> Rails: POST /api/v4/internal/allowed?gl_repository=project-111
+ Gitaly ->> Rails: POST /api/v4/internal/pre_receive?gl_repository=project-111
+ Gitaly ->> Rails: POST /api/v4/internal/post_receive?gl_repository=project-111
+ Gitaly ->> Puma: Response
+ Puma ->> Web: See CHANGELOG
+```
+
+## 3. Proposal
+
+The Pods stateless router proposal requires that any ambigious path (that is not routable)
+will be made to be routable. It means that at least the following paths will have to be updated
+do introduce a routable entity (project, group, or organization).
+
+Change:
+
+- `/api/v4/internal/allowed` => `/api/v4/internal/projects/<gl_repository>/allowed`
+- `/api/v4/internal/pre_receive` => `/api/v4/internal/projects/<gl_repository>/pre_receive`
+- `/api/v4/internal/post_receive` => `/api/v4/internal/projects/<gl_repository>/post_receive`
+- `/api/v4/internal/lfs_authenticate` => `/api/v4/internal/projects/<gl_repository>/lfs_authenticate`
+
+Where:
+
+- `gl_repository` can be `project-1111` (`Gitlab::GlRepository`)
+- `gl_repository` in some cases might be a full path to repository as executed by GitLab Shell (`/gitlab-org/gitlab.git`)
+
+## 4. Evaluation
+
+Supporting Git repositories if a Pod can access only its own repositories does not appear to be complex.
+
+The one major complication is supporting snippets, but this likely falls in the same category as for the approach
+to support user's personal namespaces.
+
+## 4.1. Pros
+
+1. The API used for supporting HTTPS/SSH and Hooks are well defined and can easily be made routable.
+
+## 4.2. Cons
+
+1. The sharing of repositories objects is limited to the given Pod and Gitaly node.
+1. The across-Pods forks are likely impossible to be supported (discover: how this work today across different Gitaly node).
diff --git a/doc/architecture/blueprints/pods/pods-feature-graphql.md b/doc/architecture/blueprints/pods/pods-feature-graphql.md
new file mode 100644
index 00000000000..5f8a39c0b3f
--- /dev/null
+++ b/doc/architecture/blueprints/pods/pods-feature-graphql.md
@@ -0,0 +1,94 @@
+---
+stage: enablement
+group: pods
+comments: false
+description: 'Pods: GraphQL'
+---
+
+DISCLAIMER:
+This page may contain information related to upcoming products, features and
+functionality. It is important to note that the information presented is for
+informational purposes only, so please do not rely on the information for
+purchasing or planning purposes. Just like with all projects, the items
+mentioned on the page are subject to change or delay, and the development,
+release, and timing of any products, features, or functionality remain at the
+sole discretion of GitLab Inc.
+
+This document is a work-in-progress and represents a very early state of the
+Pods design. Significant aspects are not documented, though we expect to add
+them in the future. This is one possible architecture for Pods, and we intend to
+contrast this with alternatives before deciding which approach to implement.
+This documentation will be kept even if we decide not to implement this so that
+we can document the reasons for not choosing this approach.
+
+# Pods: GraphQL
+
+GitLab exensively uses GraphQL to perform efficient data query operations.
+GraphQL due to it's nature is not directly routable. The way how GitLab uses
+it calls the `/api/graphql` endpoint, and only query or mutation of body request
+might define where the data can be accessed.
+
+## 1. Definition
+
+## 2. Data flow
+
+## 3. Proposal
+
+There are at least two main ways to implement GraphQL in Pods architecture.
+
+### 3.1. GraphQL routable by endpoint
+
+Change `/api/graphql` to `/api/organization/<organization>/graphql`.
+
+- This breaks all existing usages of `/api/graphql` endpoint
+ since the API URI is changed.
+
+### 3.2. GraphQL routable by body
+
+As part of router parse GraphQL body to find a routable entity, like `project`.
+
+- This still makes the GraphQL query be executed only in context of a given Pod
+ and not allowing the data to be merged.
+
+```json
+# Good example
+{
+ project(fullPath:"gitlab-org/gitlab") {
+ id
+ description
+ }
+}
+
+# Bad example, since Merge Request is not routable
+{
+ mergeRequest(id: 1111) {
+ iid
+ description
+ }
+}
+```
+
+### 3.3. Merging GraphQL Proxy
+
+Implement as part of router GraphQL Proxy which can parse body
+and merge results from many Pods.
+
+- This might make pagination hard to achieve, or we might assume that
+ we execute many queries of which results are merged across all Pods.
+
+```json
+{
+ project(fullPath:"gitlab-org/gitlab"){
+ id, description
+ }
+ group(fullPath:"gitlab-com") {
+ id, description
+ }
+}
+```
+
+## 4. Evaluation
+
+## 4.1. Pros
+
+## 4.2. Cons
diff --git a/doc/architecture/blueprints/pods/pods-feature-organizations.md b/doc/architecture/blueprints/pods/pods-feature-organizations.md
new file mode 100644
index 00000000000..a0a87458767
--- /dev/null
+++ b/doc/architecture/blueprints/pods/pods-feature-organizations.md
@@ -0,0 +1,58 @@
+---
+stage: enablement
+group: pods
+comments: false
+description: 'Pods: Organizations'
+---
+
+DISCLAIMER:
+This page may contain information related to upcoming products, features and
+functionality. It is important to note that the information presented is for
+informational purposes only, so please do not rely on the information for
+purchasing or planning purposes. Just like with all projects, the items
+mentioned on the page are subject to change or delay, and the development,
+release, and timing of any products, features, or functionality remain at the
+sole discretion of GitLab Inc.
+
+This document is a work-in-progress and represents a very early state of the
+Pods design. Significant aspects are not documented, though we expect to add
+them in the future. This is one possible architecture for Pods, and we intend to
+contrast this with alternatives before deciding which approach to implement.
+This documentation will be kept even if we decide not to implement this so that
+we can document the reasons for not choosing this approach.
+
+# Pods: Organizations
+
+One of the major designs of Pods architecture is strong isolation between Groups.
+Organizations as described by this blueprint provides a way to have plausible UX
+for joining together many Groups that are isolated from the rest of systems.
+
+## 1. Definition
+
+Pods do require that all groups and projects of a single organization can
+only be stored on a single Pod since a Pod can only access data that it holds locally
+and has very limited capabilities to read information from other Pods.
+
+Pods with Organizations do require strong isolation between organizations.
+
+It will have significant implications on various user-facing features,
+like Todos, dropdowns allowing to select projects, references to other issues
+or projects, or any other social functions present at GitLab. Today those functions
+were able to reference anything in the whole system. With the introduction of
+organizations such will be forbidden.
+
+This problem definition aims to answer effort and implications required to add
+strong isolation between organizations to the system. Including features affected
+and their data processing flow. The purpose is to ensure that our solution when
+implemented consistently avoids data leakage between organizations residing on
+a single Pod.
+
+## 2. Data flow
+
+## 3. Proposal
+
+## 4. Evaluation
+
+## 4.1. Pros
+
+## 4.2. Cons
diff --git a/doc/architecture/blueprints/pods/pods-feature-router-endpoints-classification.md b/doc/architecture/blueprints/pods/pods-feature-router-endpoints-classification.md
new file mode 100644
index 00000000000..c672342fff9
--- /dev/null
+++ b/doc/architecture/blueprints/pods/pods-feature-router-endpoints-classification.md
@@ -0,0 +1,46 @@
+---
+stage: enablement
+group: pods
+comments: false
+description: 'Pods: Router Endpoints Classification'
+---
+
+DISCLAIMER:
+This page may contain information related to upcoming products, features and
+functionality. It is important to note that the information presented is for
+informational purposes only, so please do not rely on the information for
+purchasing or planning purposes. Just like with all projects, the items
+mentioned on the page are subject to change or delay, and the development,
+release, and timing of any products, features, or functionality remain at the
+sole discretion of GitLab Inc.
+
+This document is a work-in-progress and represents a very early state of the
+Pods design. Significant aspects are not documented, though we expect to add
+them in the future. This is one possible architecture for Pods, and we intend to
+contrast this with alternatives before deciding which approach to implement.
+This documentation will be kept even if we decide not to implement this so that
+we can document the reasons for not choosing this approach.
+
+# Pods: Router Endpoints Classification
+
+Classification of all endpoints is essential to properly route request
+hitting load balancer of a GitLab installation to a Pod that can serve it.
+
+Each Pod should be able to decode each request and classify for which Pod
+it belongs to.
+
+GitLab currently implements houndreds of endpoints. This document tries
+to describe various techniques that can be implemented to allow the Rails
+to provide this information efficiently.
+
+## 1. Definition
+
+## 2. Data flow
+
+## 3. Proposal
+
+## 4. Evaluation
+
+## 4.1. Pros
+
+## 4.2. Cons
diff --git a/doc/architecture/blueprints/pods/pods-feature-template.md b/doc/architecture/blueprints/pods/pods-feature-template.md
new file mode 100644
index 00000000000..dfae21b5406
--- /dev/null
+++ b/doc/architecture/blueprints/pods/pods-feature-template.md
@@ -0,0 +1,29 @@
+---
+stage: enablement
+group: pods
+comments: false
+description: 'Pods: Problem A'
+---
+
+This document is a work-in-progress and represents a very early state of the
+Pods design. Significant aspects are not documented, though we expect to add
+them in the future. This is one possible architecture for Pods, and we intend to
+contrast this with alternatives before deciding which approach to implement.
+This documentation will be kept even if we decide not to implement this so that
+we can document the reasons for not choosing this approach.
+
+# Pods: A
+
+> TL;DR
+
+## 1. Definition
+
+## 2. Data flow
+
+## 3. Proposal
+
+## 4. Evaluation
+
+## 4.1. Pros
+
+## 4.2. Cons
diff --git a/doc/architecture/blueprints/pods/proposal-stateless-router-with-buffering-requests.md b/doc/architecture/blueprints/pods/proposal-stateless-router-with-buffering-requests.md
new file mode 100644
index 00000000000..21aa72273fe
--- /dev/null
+++ b/doc/architecture/blueprints/pods/proposal-stateless-router-with-buffering-requests.md
@@ -0,0 +1,648 @@
+---
+stage: enablement
+group: pods
+comments: false
+description: 'Pods Stateless Router Proposal'
+---
+
+This document is a work-in-progress and represents a very early state of the
+Pods design. Significant aspects are not documented, though we expect to add
+them in the future. This is one possible architecture for Pods, and we intend to
+contrast this with alternatives before deciding which approach to implement.
+This documentation will be kept even if we decide not to implement this so that
+we can document the reasons for not choosing this approach.
+
+# Proposal: Stateless Router
+
+We will decompose `gitlab_users`, `gitlab_routes` and `gitlab_admin` related
+tables so that they can be shared between all pods and allow any pod to
+authenticate a user and route requests to the correct pod. Pods may receive
+requests for the resources they don't own, but they know how to redirect back
+to the correct pod.
+
+The router is stateless and does not read from the `routes` database which
+means that all interactions with the database still happen from the Rails
+monolith. This architecture also supports regions by allowing for low traffic
+databases to be replicated across regions.
+
+Users are not directly exposed to the concept of Pods but instead they see
+different data dependent on their currently chosen "organization".
+[Organizations](index.md#organizations) will be a new model introduced to enforce isolation in the
+application and allow us to decide which request route to which pod, since an
+organization can only be on a single pod.
+
+## Differences
+
+The main difference between this proposal and the one [with learning routes](proposal-stateless-router-with-routes-learning.md)
+is that this proposal always sends requests to any of the Pods. If the requests cannot be processed,
+the requests will be bounced back with relevant headers. This requires that request to be buffered.
+It allows that request decoding can be either via URI or Body of request by Rails.
+This means that each request might be sent more than once and be processed more than once as result.
+
+The [with learning routes proposal](proposal-stateless-router-with-routes-learning.md) requires that
+routable information is always encoded in URI, and the router sends a pre-flight request.
+
+## Summary in diagrams
+
+This shows how a user request routes via DNS to the nearest router and the router chooses a pod to send the request to.
+
+```mermaid
+graph TD;
+ user((User));
+ dns[DNS];
+ router_us(Router);
+ router_eu(Router);
+ pod_us0{Pod US0};
+ pod_us1{Pod US1};
+ pod_eu0{Pod EU0};
+ pod_eu1{Pod EU1};
+ user-->dns;
+ dns-->router_us;
+ dns-->router_eu;
+ subgraph Europe
+ router_eu-->pod_eu0;
+ router_eu-->pod_eu1;
+ end
+ subgraph United States
+ router_us-->pod_us0;
+ router_us-->pod_us1;
+ end
+```
+
+<details><summary>More detail</summary>
+
+This shows that the router can actually send requests to any pod. The user will
+get the closest router to them geographically.
+
+```mermaid
+graph TD;
+ user((User));
+ dns[DNS];
+ router_us(Router);
+ router_eu(Router);
+ pod_us0{Pod US0};
+ pod_us1{Pod US1};
+ pod_eu0{Pod EU0};
+ pod_eu1{Pod EU1};
+ user-->dns;
+ dns-->router_us;
+ dns-->router_eu;
+ subgraph Europe
+ router_eu-->pod_eu0;
+ router_eu-->pod_eu1;
+ end
+ subgraph United States
+ router_us-->pod_us0;
+ router_us-->pod_us1;
+ end
+ router_eu-.->pod_us0;
+ router_eu-.->pod_us1;
+ router_us-.->pod_eu0;
+ router_us-.->pod_eu1;
+```
+
+</details>
+
+<details><summary>Even more detail</summary>
+
+This shows the databases. `gitlab_users` and `gitlab_routes` exist only in the
+US region but are replicated to other regions. Replication does not have an
+arrow because it's too hard to read the diagram.
+
+```mermaid
+graph TD;
+ user((User));
+ dns[DNS];
+ router_us(Router);
+ router_eu(Router);
+ pod_us0{Pod US0};
+ pod_us1{Pod US1};
+ pod_eu0{Pod EU0};
+ pod_eu1{Pod EU1};
+ db_gitlab_users[(gitlab_users Primary)];
+ db_gitlab_routes[(gitlab_routes Primary)];
+ db_gitlab_users_replica[(gitlab_users Replica)];
+ db_gitlab_routes_replica[(gitlab_routes Replica)];
+ db_pod_us0[(gitlab_main/gitlab_ci Pod US0)];
+ db_pod_us1[(gitlab_main/gitlab_ci Pod US1)];
+ db_pod_eu0[(gitlab_main/gitlab_ci Pod EU0)];
+ db_pod_eu1[(gitlab_main/gitlab_ci Pod EU1)];
+ user-->dns;
+ dns-->router_us;
+ dns-->router_eu;
+ subgraph Europe
+ router_eu-->pod_eu0;
+ router_eu-->pod_eu1;
+ pod_eu0-->db_pod_eu0;
+ pod_eu0-->db_gitlab_users_replica;
+ pod_eu0-->db_gitlab_routes_replica;
+ pod_eu1-->db_gitlab_users_replica;
+ pod_eu1-->db_gitlab_routes_replica;
+ pod_eu1-->db_pod_eu1;
+ end
+ subgraph United States
+ router_us-->pod_us0;
+ router_us-->pod_us1;
+ pod_us0-->db_pod_us0;
+ pod_us0-->db_gitlab_users;
+ pod_us0-->db_gitlab_routes;
+ pod_us1-->db_gitlab_users;
+ pod_us1-->db_gitlab_routes;
+ pod_us1-->db_pod_us1;
+ end
+ router_eu-.->pod_us0;
+ router_eu-.->pod_us1;
+ router_us-.->pod_eu0;
+ router_us-.->pod_eu1;
+```
+
+</details>
+
+## Summary of changes
+
+1. Tables related to User data (including profile settings, authentication credentials, personal access tokens) are decomposed into a `gitlab_users` schema
+1. The `routes` table is decomposed into `gitlab_routes` schema
+1. The `application_settings` (and probably a few other instance level tables) are decomposed into `gitlab_admin` schema
+1. A new column `routes.pod_id` is added to `routes` table
+1. A new Router service exists to choose which pod to route a request to.
+1. A new concept will be introduced in GitLab called an organization and a user can select a "default organization" and this will be a user level setting. The default organization is used to redirect users away from ambiguous routes like `/dashboard` to organization scoped routes like `/organizations/my-organization/-/dashboard`. Legacy users will have a special default organization that allows them to keep using global resources on `Pod US0`. All existing namespaces will initially move to this public organization.
+1. If a pod receives a request for a `routes.pod_id` that it does not own it returns a `302` with `X-Gitlab-Pod-Redirect` header so that the router can send the request to the correct pod. The correct pod can also set a header `X-Gitlab-Pod-Cache` which contains information about how this request should be cached to remember the pod. For example if the request was `/gitlab-org/gitlab` then the header would encode `/gitlab-org/* => Pod US0` (ie. any requests starting with `/gitlab-org/` can always be routed to `Pod US0`
+1. When the pod does not know (from the cache) which pod to send a request to it just picks a random pod within it's region
+1. Writes to `gitlab_users` and `gitlab_routes` are sent to a primary PostgreSQL server in our `US` region but reads can come from replicas in the same region. This will add latency for these writes but we expect they are infrequent relative to the rest of GitLab.
+
+## Detailed explanation of default organization in the first iteration
+
+All users will get a new column `users.default_organization` which they can
+control in user settings. We will introduce a concept of the
+`GitLab.com Public` organization. This will be set as the default organization for all existing
+users. This organization will allow the user to see data from all namespaces in
+`Pod US0` (ie. our original GitLab.com instance). This behavior can be invisible to
+existing users such that they don't even get told when they are viewing a
+global page like `/dashboard` that it's even scoped to an organization.
+
+Any new users with a default organization other than `GitLab.com Public` will have
+a distinct user experience and will be fully aware that every page they load is
+only ever scoped to a single organization. These users can never
+load any global pages like `/dashboard` and will end up being redirected to
+`/organizations/<DEFAULT_ORGANIZATION>/-/dashboard`. This may also be the case
+for legacy APIs and such users may only ever be able to use APIs scoped to a
+organization.
+
+## Detailed explanation of Admin Area settings
+
+We believe that maintaining and synchronizing Admin Area settings will be
+frustrating and painful so to avoid this we will decompose and share all Admin Area
+settings in the `gitlab_admin` schema. This should be safe (similar to other
+shared schemas) because these receive very little write traffic.
+
+In cases where different pods need different settings (eg. the
+Elasticsearch URL), we will either decide to use a templated
+format in the relevant `application_settings` row which allows it to be dynamic
+per pod. Alternatively if that proves difficult we'll introduce a new table
+called `per_pod_application_settings` and this will have 1 row per pod to allow
+setting different settings per pod. It will still be part of the `gitlab_admin`
+schema and shared which will allow us to centrally manage it and simplify
+keeping settings in sync for all pods.
+
+## Pros
+
+1. Router is stateless and can live in many regions. We use Anycast DNS to resolve to nearest region for the user.
+1. Pods can receive requests for namespaces in the wrong pod and the user
+ still gets the right response as well as caching at the router that
+ ensures the next request is sent to the correct pod so the next request
+ will go to the correct pod
+1. The majority of the code still lives in `gitlab` rails codebase. The Router doesn't actually need to understand how GitLab URLs are composed.
+1. Since the responsibility to read and write `gitlab_users`,
+ `gitlab_routes` and `gitlab_admin` still lives in Rails it means minimal
+ changes will be needed to the Rails application compared to extracting
+ services that need to isolate the domain models and build new interfaces.
+1. Compared to a separate routing service this allows the Rails application
+ to encode more complex rules around how to map URLs to the correct pod
+ and may work for some existing API endpoints.
+1. All the new infrastructure (just a router) is optional and a single-pod
+ self-managed installation does not even need to run the Router and there are
+ no other new services.
+
+## Cons
+
+1. `gitlab_users`, `gitlab_routes` and `gitlab_admin` databases may need to be
+ replicated across regions and writes need to go across regions. We need to
+ do an analysis on write TPS for the relevant tables to determine if this is
+ feasible.
+1. Sharing access to the database from many different Pods means that they are
+ all coupled at the Postgres schema level and this means changes to the
+ database schema need to be done carefully in sync with the deployment of all
+ Pods. This limits us to ensure that Pods are kept in closely similar
+ versions compared to an architecture with shared services that have an API
+ we control.
+1. Although most data is stored in the right region there can be requests
+ proxied from another region which may be an issue for certain types
+ of compliance.
+1. Data in `gitlab_users` and `gitlab_routes` databases must be replicated in
+ all regions which may be an issue for certain types of compliance.
+1. The router cache may need to be very large if we get a wide variety of URLs
+ (ie. long tail). In such a case we may need to implement a 2nd level of
+ caching in user cookies so their frequently accessed pages always go to the
+ right pod the first time.
+1. Having shared database access for `gitlab_users` and `gitlab_routes`
+ from multiple pods is an unusual architecture decision compared to
+ extracting services that are called from multiple pods.
+1. It is very likely we won't be able to find cacheable elements of a
+ GraphQL URL and often existing GraphQL endpoints are heavily dependent on
+ ids that won't be in the `routes` table so pods won't necessarily know
+ what pod has the data. As such we'll probably have to update our GraphQL
+ calls to include an organization context in the path like
+ `/api/organizations/<organization>/graphql`.
+1. This architecture implies that implemented endpoints can only access data
+ that are readily accessible on a given Pod, but are unlikely
+ to aggregate information from many Pods.
+1. All unknown routes are sent to the latest deployment which we assume to be `Pod US0`.
+ This is required as newly added endpoints will be only decodable by latest pod.
+ This Pod could later redirect to correct one that can serve the given request.
+ Since request processing might be heavy some Pods might receive significant amount
+ of traffic due to that.
+
+## Example database configuration
+
+Handling shared `gitlab_users`, `gitlab_routes` and `gitlab_admin` databases, while having dedicated `gitlab_main` and `gitlab_ci` databases should already be handled by the way we use `config/database.yml`. We should also, already be able to handle the dedicated EU replicas while having a single US primary for `gitlab_users` and `gitlab_routes`. Below is a snippet of part of the database configuration for the Pod architecture described above.
+
+<details><summary>Pod US0</summary>
+
+```yaml
+# config/database.yml
+production:
+ main:
+ host: postgres-main.pod-us0.primary.consul
+ load_balancing:
+ discovery: postgres-main.pod-us0.replicas.consul
+ ci:
+ host: postgres-ci.pod-us0.primary.consul
+ load_balancing:
+ discovery: postgres-ci.pod-us0.replicas.consul
+ users:
+ host: postgres-users-primary.consul
+ load_balancing:
+ discovery: postgres-users-replicas.us.consul
+ routes:
+ host: postgres-routes-primary.consul
+ load_balancing:
+ discovery: postgres-routes-replicas.us.consul
+ admin:
+ host: postgres-admin-primary.consul
+ load_balancing:
+ discovery: postgres-admin-replicas.us.consul
+```
+
+</details>
+
+<details><summary>Pod EU0</summary>
+
+```yaml
+# config/database.yml
+production:
+ main:
+ host: postgres-main.pod-eu0.primary.consul
+ load_balancing:
+ discovery: postgres-main.pod-eu0.replicas.consul
+ ci:
+ host: postgres-ci.pod-eu0.primary.consul
+ load_balancing:
+ discovery: postgres-ci.pod-eu0.replicas.consul
+ users:
+ host: postgres-users-primary.consul
+ load_balancing:
+ discovery: postgres-users-replicas.eu.consul
+ routes:
+ host: postgres-routes-primary.consul
+ load_balancing:
+ discovery: postgres-routes-replicas.eu.consul
+ admin:
+ host: postgres-admin-primary.consul
+ load_balancing:
+ discovery: postgres-admin-replicas.eu.consul
+```
+
+</details>
+
+## Request flows
+
+1. `gitlab-org` is a top level namespace and lives in `Pod US0` in the `GitLab.com Public` organization
+1. `my-company` is a top level namespace and lives in `Pod EU0` in the `my-organization` organization
+
+### Experience for paying user that is part of `my-organization`
+
+Such a user will have a default organization set to `/my-organization` and will be
+unable to load any global routes outside of this organization. They may load other
+projects/namespaces but their MR/Todo/Issue counts at the top of the page will
+not be correctly populated in the first iteration. The user will be aware of
+this limitation.
+
+#### Navigates to `/my-company/my-project` while logged in
+
+1. User is in Europe so DNS resolves to the router in Europe
+1. They request `/my-company/my-project` without the router cache, so the router chooses randomly `Pod EU1`
+1. `Pod EU1` does not have `/my-company`, but it knows that it lives in `Pod EU0` so it redirects the router to `Pod EU0`
+1. `Pod EU0` returns the correct response as well as setting the cache headers for the router `/my-company/* => Pod EU0`
+1. The router now caches and remembers any request paths matching `/my-company/*` should go to `Pod EU0`
+
+```mermaid
+sequenceDiagram
+ participant user as User
+ participant router_eu as Router EU
+ participant pod_eu0 as Pod EU0
+ participant pod_eu1 as Pod EU1
+ user->>router_eu: GET /my-company/my-project
+ router_eu->>pod_eu1: GET /my-company/my-project
+ pod_eu1->>router_eu: 302 /my-company/my-project X-Gitlab-Pod-Redirect={pod:Pod EU0}
+ router_eu->>pod_eu0: GET /my-company/my-project
+ pod_eu0->>user: <h1>My Project... X-Gitlab-Pod-Cache={path_prefix:/my-company/}
+```
+
+#### Navigates to `/my-company/my-project` while not logged in
+
+1. User is in Europe so DNS resolves to the router in Europe
+1. The router does not have `/my-company/*` cached yet so it chooses randomly `Pod EU1`
+1. `Pod EU1` redirects them through a login flow
+1. Stil they request `/my-company/my-project` without the router cache, so the router chooses a random pod `Pod EU1`
+1. `Pod EU1` does not have `/my-company`, but it knows that it lives in `Pod EU0` so it redirects the router to `Pod EU0`
+1. `Pod EU0` returns the correct response as well as setting the cache headers for the router `/my-company/* => Pod EU0`
+1. The router now caches and remembers any request paths matching `/my-company/*` should go to `Pod EU0`
+
+```mermaid
+sequenceDiagram
+ participant user as User
+ participant router_eu as Router EU
+ participant pod_eu0 as Pod EU0
+ participant pod_eu1 as Pod EU1
+ user->>router_eu: GET /my-company/my-project
+ router_eu->>pod_eu1: GET /my-company/my-project
+ pod_eu1->>user: 302 /users/sign_in?redirect=/my-company/my-project
+ user->>router_eu: GET /users/sign_in?redirect=/my-company/my-project
+ router_eu->>pod_eu1: GET /users/sign_in?redirect=/my-company/my-project
+ pod_eu1->>user: <h1>Sign in...
+ user->>router_eu: POST /users/sign_in?redirect=/my-company/my-project
+ router_eu->>pod_eu1: POST /users/sign_in?redirect=/my-company/my-project
+ pod_eu1->>user: 302 /my-company/my-project
+ user->>router_eu: GET /my-company/my-project
+ router_eu->>pod_eu1: GET /my-company/my-project
+ pod_eu1->>router_eu: 302 /my-company/my-project X-Gitlab-Pod-Redirect={pod:Pod EU0}
+ router_eu->>pod_eu0: GET /my-company/my-project
+ pod_eu0->>user: <h1>My Project... X-Gitlab-Pod-Cache={path_prefix:/my-company/}
+```
+
+#### Navigates to `/my-company/my-other-project` after last step
+
+1. User is in Europe so DNS resolves to the router in Europe
+1. The router cache now has `/my-company/* => Pod EU0`, so the router chooses `Pod EU0`
+1. `Pod EU0` returns the correct response as well as the cache header again
+
+```mermaid
+sequenceDiagram
+ participant user as User
+ participant router_eu as Router EU
+ participant pod_eu0 as Pod EU0
+ participant pod_eu1 as Pod EU1
+ user->>router_eu: GET /my-company/my-project
+ router_eu->>pod_eu0: GET /my-company/my-project
+ pod_eu0->>user: <h1>My Project... X-Gitlab-Pod-Cache={path_prefix:/my-company/}
+```
+
+#### Navigates to `/gitlab-org/gitlab` after last step
+
+1. User is in Europe so DNS resolves to the router in Europe
+1. The router has no cached value for this URL so randomly chooses `Pod EU0`
+1. `Pod EU0` redirects the router to `Pod US0`
+1. `Pod US0` returns the correct response as well as the cache header again
+
+```mermaid
+sequenceDiagram
+ participant user as User
+ participant router_eu as Router EU
+ participant pod_eu0 as Pod EU0
+ participant pod_us0 as Pod US0
+ user->>router_eu: GET /gitlab-org/gitlab
+ router_eu->>pod_eu0: GET /gitlab-org/gitlab
+ pod_eu0->>router_eu: 302 /gitlab-org/gitlab X-Gitlab-Pod-Redirect={pod:Pod US0}
+ router_eu->>pod_us0: GET /gitlab-org/gitlab
+ pod_us0->>user: <h1>GitLab.org... X-Gitlab-Pod-Cache={path_prefix:/gitlab-org/}
+```
+
+In this case the user is not on their "default organization" so their TODO
+counter will not include their normal todos. We may choose to highlight this in
+the UI somewhere. A future iteration may be able to fetch that for them from
+their default organization.
+
+#### Navigates to `/`
+
+1. User is in Europe so DNS resolves to the router in Europe
+1. Router does not have a cache for `/` route (specifically rails never tells it to cache this route)
+1. The Router choose `Pod EU0` randomly
+1. The Rails application knows the users default organization is `/my-organization`, so
+ it redirects the user to `/organizations/my-organization/-/dashboard`
+1. The Router has a cached value for `/organizations/my-organization/*` so it then sends the
+ request to `POD EU0`
+1. `Pod EU0` serves up a new page `/organizations/my-organization/-/dashboard` which is the same
+ dashboard view we have today but scoped to an organization clearly in the UI
+1. The user is (optionally) presented with a message saying that data on this page is only
+ from their default organization and that they can change their default
+ organization if it's not right.
+
+```mermaid
+sequenceDiagram
+ participant user as User
+ participant router_eu as Router EU
+ participant pod_eu0 as Pod EU0
+ user->>router_eu: GET /
+ router_eu->>pod_eu0: GET /
+ pod_eu0->>user: 302 /organizations/my-organization/-/dashboard
+ user->>router: GET /organizations/my-organization/-/dashboard
+ router->>pod_eu0: GET /organizations/my-organization/-/dashboard
+ pod_eu0->>user: <h1>My Company Dashboard... X-Gitlab-Pod-Cache={path_prefix:/organizations/my-organization/}
+```
+
+#### Navigates to `/dashboard`
+
+As above, they will end up on `/organizations/my-organization/-/dashboard` as
+the rails application will already redirect `/` to the dashboard page.
+
+### Navigates to `/not-my-company/not-my-project` while logged in (but they don't have access since this project/group is private)
+
+1. User is in Europe so DNS resolves to the router in Europe
+1. The router knows that `/not-my-company` lives in `Pod US1` so sends the request to this
+1. The user does not have access so `Pod US1` returns 404
+
+```mermaid
+sequenceDiagram
+ participant user as User
+ participant router_eu as Router EU
+ participant pod_us1 as Pod US1
+ user->>router_eu: GET /not-my-company/not-my-project
+ router_eu->>pod_us1: GET /not-my-company/not-my-project
+ pod_us1->>user: 404
+```
+
+#### Creates a new top level namespace
+
+The user will be asked which organization they want the namespace to belong to.
+If they select `my-organization` then it will end up on the same pod as all
+other namespaces in `my-organization`. If they select nothing we default to
+`GitLab.com Public` and it is clear to the user that this is isolated from
+their existing organization such that they won't be able to see data from both
+on a single page.
+
+### Experience for GitLab team member that is part of `/gitlab-org`
+
+Such a user is considered a legacy user and has their default organization set to
+`GitLab.com Public`. This is a "meta" organization that does not really exist but
+the Rails application knows to interpret this organization to mean that they are
+allowed to use legacy global functionality like `/dashboard` to see data across
+namespaces located on `Pod US0`. The rails backend also knows that the default pod to render any ambiguous
+routes like `/dashboard` is `Pod US0`. Lastly the user will be allowed to
+navigate to organizations on another pod like `/my-organization` but when they do the
+user will see a message indicating that some data may be missing (eg. the
+MRs/Issues/Todos) counts.
+
+#### Navigates to `/gitlab-org/gitlab` while not logged in
+
+1. User is in the US so DNS resolves to the US router
+1. The router knows that `/gitlab-org` lives in `Pod US0` so sends the request
+ to this pod
+1. `Pod US0` serves up the response
+
+```mermaid
+sequenceDiagram
+ participant user as User
+ participant router_us as Router US
+ participant pod_us0 as Pod US0
+ user->>router_us: GET /gitlab-org/gitlab
+ router_us->>pod_us0: GET /gitlab-org/gitlab
+ pod_us0->>user: <h1>GitLab.org... X-Gitlab-Pod-Cache={path_prefix:/gitlab-org/}
+```
+
+#### Navigates to `/`
+
+1. User is in US so DNS resolves to the router in US
+1. Router does not have a cache for `/` route (specifically rails never tells it to cache this route)
+1. The Router chooses `Pod US1` randomly
+1. The Rails application knows the users default organization is `GitLab.com Public`, so
+ it redirects the user to `/dashboards` (only legacy users can see
+ `/dashboard` global view)
+1. Router does not have a cache for `/dashboard` route (specifically rails never tells it to cache this route)
+1. The Router chooses `Pod US1` randomly
+1. The Rails application knows the users default organization is `GitLab.com Public`, so
+ it allows the user to load `/dashboards` (only legacy users can see
+ `/dashboard` global view) and redirects to router the legacy pod which is `Pod US0`
+1. `Pod US0` serves up the global view dashboard page `/dashboard` which is the same
+ dashboard view we have today
+
+```mermaid
+sequenceDiagram
+ participant user as User
+ participant router_us as Router US
+ participant pod_us0 as Pod US0
+ participant pod_us1 as Pod US1
+ user->>router_us: GET /
+ router_us->>pod_us1: GET /
+ pod_us1->>user: 302 /dashboard
+ user->>router_us: GET /dashboard
+ router_us->>pod_us1: GET /dashboard
+ pod_us1->>router_us: 302 /dashboard X-Gitlab-Pod-Redirect={pod:Pod US0}
+ router_us->>pod_us0: GET /dashboard
+ pod_us0->>user: <h1>Dashboard...
+```
+
+#### Navigates to `/my-company/my-other-project` while logged in (but they don't have access since this project is private)
+
+They get a 404.
+
+### Experience for non-logged in users
+
+Flow is similar to logged in users except global routes like `/dashboard` will
+redirect to the login page as there is no default organization to choose from.
+
+### A new customers signs up
+
+They will be asked if they are already part of an organization or if they'd
+like to create one. If they choose neither they end up no the default
+`GitLab.com Public` organization.
+
+### An organization is moved from 1 pod to another
+
+TODO
+
+### GraphQL/API requests which don't include the namespace in the URL
+
+TODO
+
+### The autocomplete suggestion functionality in the search bar which remembers recent issues/MRs
+
+TODO
+
+### Global search
+
+TODO
+
+## Administrator
+
+### Loads `/admin` page
+
+1. Router picks a random pod `Pod US0`
+1. Pod US0 redirects user to `/admin/pods/podus0`
+1. Pod US0 renders an Admin Area page and also returns a cache header to cache `/admin/podss/podus0/* => Pod US0`. The Admin Area page contains a dropdown list showing other pods they could select and it changes the query parameter.
+
+Admin Area settings in Postgres are all shared across all pods to avoid
+divergence but we still make it clear in the URL and UI which pod is serving
+the Admin Area page as there is dynamic data being generated from these pages and
+the operator may want to view a specific pod.
+
+## More Technical Problems To Solve
+
+### Replicating User Sessions Between All Pods
+
+Today user sessions live in Redis but each pod will have their own Redis instance. We already use a dedicated Redis instance for sessions so we could consider sharing this with all pods like we do with `gitlab_users` PostgreSQL database. But an important consideration will be latency as we would still want to mostly fetch sessions from the same region.
+
+An alternative might be that user sessions get moved to a JWT payload that encodes all the session data but this has downsides. For example, it is difficult to expire a user session, when their password changes or for other reasons, if the session lives in a JWT controlled by the user.
+
+### How do we migrate between Pods
+
+Migrating data between pods will need to factor all data stores:
+
+1. PostgreSQL
+1. Redis Shared State
+1. Gitaly
+1. Elasticsearch
+
+### Is it still possible to leak the existence of private groups via a timing attack?
+
+If you have router in EU, and you know that EU router by default redirects
+to EU located Pods, you know their latency (lets assume 10ms). Now, if your
+request is bounced back and redirected to US which has different latency
+(lets assume that roundtrip will be around 60ms) you can deduce that 404 was
+returned by US Pod and know that your 404 is in fact 403.
+
+We may defer this until we actually implement a pod in a different region. Such timing attacks are already theoretically possible with the way we do permission checks today but the timing difference is probably too small to be able to detect.
+
+One technique to mitigate this risk might be to have the router add a random
+delay to any request that returns 404 from a pod.
+
+## Should runners be shared across all pods?
+
+We have 2 options and we should decide which is easier:
+
+1. Decompose runner registration and queuing tables and share them across all
+ pods. This may have implications for scalability, and we'd need to consider
+ if this would include group/project runners as this may have scalability
+ concerns as these are high traffic tables that would need to be shared.
+1. Runners are registered per-pod and, we probably have a separate fleet of
+ runners for every pod or just register the same runners to many pods which
+ may have implications for queueing
+
+## How do we guarantee unique ids across all pods for things that cannot conflict?
+
+This project assumes at least namespaces and projects have unique ids across
+all pods as many requests need to be routed based on their ID. Since those
+tables are across different databases then guaranteeing a unique ID will
+require a new solution. There are likely other tables where unique IDs are
+necessary and depending on how we resolve routing for GraphQL and other APIs
+and other design goals it may be determined that we want the primary key to be
+unique for all tables.
diff --git a/doc/architecture/blueprints/pods/proposal-stateless-router-with-routes-learning.md b/doc/architecture/blueprints/pods/proposal-stateless-router-with-routes-learning.md
new file mode 100644
index 00000000000..e7520f3d6a8
--- /dev/null
+++ b/doc/architecture/blueprints/pods/proposal-stateless-router-with-routes-learning.md
@@ -0,0 +1,672 @@
+---
+stage: enablement
+group: pods
+comments: false
+description: 'Pods Stateless Router Proposal'
+---
+
+This document is a work-in-progress and represents a very early state of the
+Pods design. Significant aspects are not documented, though we expect to add
+them in the future. This is one possible architecture for Pods, and we intend to
+contrast this with alternatives before deciding which approach to implement.
+This documentation will be kept even if we decide not to implement this so that
+we can document the reasons for not choosing this approach.
+
+# Proposal: Stateless Router
+
+We will decompose `gitlab_users`, `gitlab_routes` and `gitlab_admin` related
+tables so that they can be shared between all pods and allow any pod to
+authenticate a user and route requests to the correct pod. Pods may receive
+requests for the resources they don't own, but they know how to redirect back
+to the correct pod.
+
+The router is stateless and does not read from the `routes` database which
+means that all interactions with the database still happen from the Rails
+monolith. This architecture also supports regions by allowing for low traffic
+databases to be replicated across regions.
+
+Users are not directly exposed to the concept of Pods but instead they see
+different data dependent on their currently chosen "organization".
+[Organizations](index.md#organizations) will be a new model introduced to enforce isolation in the
+application and allow us to decide which request route to which pod, since an
+organization can only be on a single pod.
+
+## Differences
+
+The main difference between this proposal and one [with buffering requests](proposal-stateless-router-with-buffering-requests.md)
+is that this proposal uses a pre-flight API request (`/api/v4/pods/learn`) to redirect the request body to the correct Pod.
+This means that each request is sent exactly once to be processed, but the URI is used to decode which Pod it should be directed.
+
+## Summary in diagrams
+
+This shows how a user request routes via DNS to the nearest router and the router chooses a pod to send the request to.
+
+```mermaid
+graph TD;
+ user((User));
+ dns[DNS];
+ router_us(Router);
+ router_eu(Router);
+ pod_us0{Pod US0};
+ pod_us1{Pod US1};
+ pod_eu0{Pod EU0};
+ pod_eu1{Pod EU1};
+ user-->dns;
+ dns-->router_us;
+ dns-->router_eu;
+ subgraph Europe
+ router_eu-->pod_eu0;
+ router_eu-->pod_eu1;
+ end
+ subgraph United States
+ router_us-->pod_us0;
+ router_us-->pod_us1;
+ end
+```
+
+### More detail
+
+This shows that the router can actually send requests to any pod. The user will
+get the closest router to them geographically.
+
+```mermaid
+graph TD;
+ user((User));
+ dns[DNS];
+ router_us(Router);
+ router_eu(Router);
+ pod_us0{Pod US0};
+ pod_us1{Pod US1};
+ pod_eu0{Pod EU0};
+ pod_eu1{Pod EU1};
+ user-->dns;
+ dns-->router_us;
+ dns-->router_eu;
+ subgraph Europe
+ router_eu-->pod_eu0;
+ router_eu-->pod_eu1;
+ end
+ subgraph United States
+ router_us-->pod_us0;
+ router_us-->pod_us1;
+ end
+ router_eu-.->pod_us0;
+ router_eu-.->pod_us1;
+ router_us-.->pod_eu0;
+ router_us-.->pod_eu1;
+```
+
+### Even more detail
+
+This shows the databases. `gitlab_users` and `gitlab_routes` exist only in the
+US region but are replicated to other regions. Replication does not have an
+arrow because it's too hard to read the diagram.
+
+```mermaid
+graph TD;
+ user((User));
+ dns[DNS];
+ router_us(Router);
+ router_eu(Router);
+ pod_us0{Pod US0};
+ pod_us1{Pod US1};
+ pod_eu0{Pod EU0};
+ pod_eu1{Pod EU1};
+ db_gitlab_users[(gitlab_users Primary)];
+ db_gitlab_routes[(gitlab_routes Primary)];
+ db_gitlab_users_replica[(gitlab_users Replica)];
+ db_gitlab_routes_replica[(gitlab_routes Replica)];
+ db_pod_us0[(gitlab_main/gitlab_ci Pod US0)];
+ db_pod_us1[(gitlab_main/gitlab_ci Pod US1)];
+ db_pod_eu0[(gitlab_main/gitlab_ci Pod EU0)];
+ db_pod_eu1[(gitlab_main/gitlab_ci Pod EU1)];
+ user-->dns;
+ dns-->router_us;
+ dns-->router_eu;
+ subgraph Europe
+ router_eu-->pod_eu0;
+ router_eu-->pod_eu1;
+ pod_eu0-->db_pod_eu0;
+ pod_eu0-->db_gitlab_users_replica;
+ pod_eu0-->db_gitlab_routes_replica;
+ pod_eu1-->db_gitlab_users_replica;
+ pod_eu1-->db_gitlab_routes_replica;
+ pod_eu1-->db_pod_eu1;
+ end
+ subgraph United States
+ router_us-->pod_us0;
+ router_us-->pod_us1;
+ pod_us0-->db_pod_us0;
+ pod_us0-->db_gitlab_users;
+ pod_us0-->db_gitlab_routes;
+ pod_us1-->db_gitlab_users;
+ pod_us1-->db_gitlab_routes;
+ pod_us1-->db_pod_us1;
+ end
+ router_eu-.->pod_us0;
+ router_eu-.->pod_us1;
+ router_us-.->pod_eu0;
+ router_us-.->pod_eu1;
+```
+
+## Summary of changes
+
+1. Tables related to User data (including profile settings, authentication credentials, personal access tokens) are decomposed into a `gitlab_users` schema
+1. The `routes` table is decomposed into `gitlab_routes` schema
+1. The `application_settings` (and probably a few other instance level tables) are decomposed into `gitlab_admin` schema
+1. A new column `routes.pod_id` is added to `routes` table
+1. A new Router service exists to choose which pod to route a request to.
+1. If a router receives a new request it will send `/api/v4/pods/learn?method=GET&path_info=/group-org/project` to learn which Pod can process it
+1. A new concept will be introduced in GitLab called an organization
+1. We require all existing endpoints to be routable by URI, or be fixed to a specific Pod for processing. This requires changing ambiguous endpoints like `/dashboard` to be scoped like `/organizations/my-organization/-/dashboard`
+1. Endpoints like `/admin` would be routed always to the specific Pod, like `pod_0`
+1. Each Pod can respond to `/api/v4/pods/learn` and classify each endpoint
+1. Writes to `gitlab_users` and `gitlab_routes` are sent to a primary PostgreSQL server in our `US` region but reads can come from replicas in the same region. This will add latency for these writes but we expect they are infrequent relative to the rest of GitLab.
+
+## Pre-flight request learning
+
+While processing a request the URI will be decoded and a pre-flight request
+will be sent for each non-cached endpoint.
+
+When asking for the endpoint GitLab Rails will return information about
+the routable path. GitLab Rails will decode `path_info` and match it to
+an existing endpoint and find a routable entity (like project). The router will
+treat this as short-lived cache information.
+
+1. Prefix match: `/api/v4/pods/learn?method=GET&path_info=/gitlab-org/gitlab-test/-/issues`
+
+ ```json
+ {
+ "path": "/gitlab-org/gitlab-test",
+ "pod": "pod_0",
+ "source": "routable"
+ }
+ ```
+
+1. Some endpoints might require an exact match: `/api/v4/pods/learn?method=GET&path_info=/-/profile`
+
+ ```json
+ {
+ "path": "/-/profile",
+ "pod": "pod_0",
+ "source": "fixed",
+ "exact": true
+ }
+ ```
+
+## Detailed explanation of default organization in the first iteration
+
+All users will get a new column `users.default_organization` which they can
+control in user settings. We will introduce a concept of the
+`GitLab.com Public` organization. This will be set as the default organization for all existing
+users. This organization will allow the user to see data from all namespaces in
+`Pod US0` (ie. our original GitLab.com instance). This behavior can be invisible to
+existing users such that they don't even get told when they are viewing a
+global page like `/dashboard` that it's even scoped to an organization.
+
+Any new users with a default organization other than `GitLab.com Public` will have
+a distinct user experience and will be fully aware that every page they load is
+only ever scoped to a single organization. These users can never
+load any global pages like `/dashboard` and will end up being redirected to
+`/organizations/<DEFAULT_ORGANIZATION>/-/dashboard`. This may also be the case
+for legacy APIs and such users may only ever be able to use APIs scoped to a
+organization.
+
+## Detailed explanation of Admin Area settings
+
+We believe that maintaining and synchronizing Admin Area settings will be
+frustrating and painful so to avoid this we will decompose and share all Admin Area
+settings in the `gitlab_admin` schema. This should be safe (similar to other
+shared schemas) because these receive very little write traffic.
+
+In cases where different pods need different settings (eg. the
+Elasticsearch URL), we will either decide to use a templated
+format in the relevant `application_settings` row which allows it to be dynamic
+per pod. Alternatively if that proves difficult we'll introduce a new table
+called `per_pod_application_settings` and this will have 1 row per pod to allow
+setting different settings per pod. It will still be part of the `gitlab_admin`
+schema and shared which will allow us to centrally manage it and simplify
+keeping settings in sync for all pods.
+
+## Pros
+
+1. Router is stateless and can live in many regions. We use Anycast DNS to resolve to nearest region for the user.
+1. Pods can receive requests for namespaces in the wrong pod and the user
+ still gets the right response as well as caching at the router that
+ ensures the next request is sent to the correct pod so the next request
+ will go to the correct pod
+1. The majority of the code still lives in `gitlab` rails codebase. The Router doesn't actually need to understand how GitLab URLs are composed.
+1. Since the responsibility to read and write `gitlab_users`,
+ `gitlab_routes` and `gitlab_admin` still lives in Rails it means minimal
+ changes will be needed to the Rails application compared to extracting
+ services that need to isolate the domain models and build new interfaces.
+1. Compared to a separate routing service this allows the Rails application
+ to encode more complex rules around how to map URLs to the correct pod
+ and may work for some existing API endpoints.
+1. All the new infrastructure (just a router) is optional and a single-pod
+ self-managed installation does not even need to run the Router and there are
+ no other new services.
+
+## Cons
+
+1. `gitlab_users`, `gitlab_routes` and `gitlab_admin` databases may need to be
+ replicated across regions and writes need to go across regions. We need to
+ do an analysis on write TPS for the relevant tables to determine if this is
+ feasible.
+1. Sharing access to the database from many different Pods means that they are
+ all coupled at the Postgres schema level and this means changes to the
+ database schema need to be done carefully in sync with the deployment of all
+ Pods. This limits us to ensure that Pods are kept in closely similar
+ versions compared to an architecture with shared services that have an API
+ we control.
+1. Although most data is stored in the right region there can be requests
+ proxied from another region which may be an issue for certain types
+ of compliance.
+1. Data in `gitlab_users` and `gitlab_routes` databases must be replicated in
+ all regions which may be an issue for certain types of compliance.
+1. The router cache may need to be very large if we get a wide variety of URLs
+ (ie. long tail). In such a case we may need to implement a 2nd level of
+ caching in user cookies so their frequently accessed pages always go to the
+ right pod the first time.
+1. Having shared database access for `gitlab_users` and `gitlab_routes`
+ from multiple pods is an unusual architecture decision compared to
+ extracting services that are called from multiple pods.
+1. It is very likely we won't be able to find cacheable elements of a
+ GraphQL URL and often existing GraphQL endpoints are heavily dependent on
+ ids that won't be in the `routes` table so pods won't necessarily know
+ what pod has the data. As such we'll probably have to update our GraphQL
+ calls to include an organization context in the path like
+ `/api/organizations/<organization>/graphql`.
+1. This architecture implies that implemented endpoints can only access data
+ that are readily accessible on a given Pod, but are unlikely
+ to aggregate information from many Pods.
+1. All unknown routes are sent to the latest deployment which we assume to be `Pod US0`.
+ This is required as newly added endpoints will be only decodable by latest pod.
+ Likely this is not a problem for the `/pods/learn` is it is lightweight
+ to process and this should not cause a performance impact.
+
+## Example database configuration
+
+Handling shared `gitlab_users`, `gitlab_routes` and `gitlab_admin` databases, while having dedicated `gitlab_main` and `gitlab_ci` databases should already be handled by the way we use `config/database.yml`. We should also, already be able to handle the dedicated EU replicas while having a single US primary for `gitlab_users` and `gitlab_routes`. Below is a snippet of part of the database configuration for the Pod architecture described above.
+
+**Pod US0**:
+
+```yaml
+# config/database.yml
+production:
+ main:
+ host: postgres-main.pod-us0.primary.consul
+ load_balancing:
+ discovery: postgres-main.pod-us0.replicas.consul
+ ci:
+ host: postgres-ci.pod-us0.primary.consul
+ load_balancing:
+ discovery: postgres-ci.pod-us0.replicas.consul
+ users:
+ host: postgres-users-primary.consul
+ load_balancing:
+ discovery: postgres-users-replicas.us.consul
+ routes:
+ host: postgres-routes-primary.consul
+ load_balancing:
+ discovery: postgres-routes-replicas.us.consul
+ admin:
+ host: postgres-admin-primary.consul
+ load_balancing:
+ discovery: postgres-admin-replicas.us.consul
+```
+
+**Pod EU0**:
+
+```yaml
+# config/database.yml
+production:
+ main:
+ host: postgres-main.pod-eu0.primary.consul
+ load_balancing:
+ discovery: postgres-main.pod-eu0.replicas.consul
+ ci:
+ host: postgres-ci.pod-eu0.primary.consul
+ load_balancing:
+ discovery: postgres-ci.pod-eu0.replicas.consul
+ users:
+ host: postgres-users-primary.consul
+ load_balancing:
+ discovery: postgres-users-replicas.eu.consul
+ routes:
+ host: postgres-routes-primary.consul
+ load_balancing:
+ discovery: postgres-routes-replicas.eu.consul
+ admin:
+ host: postgres-admin-primary.consul
+ load_balancing:
+ discovery: postgres-admin-replicas.eu.consul
+```
+
+## Request flows
+
+1. `gitlab-org` is a top level namespace and lives in `Pod US0` in the `GitLab.com Public` organization
+1. `my-company` is a top level namespace and lives in `Pod EU0` in the `my-organization` organization
+
+### Experience for paying user that is part of `my-organization`
+
+Such a user will have a default organization set to `/my-organization` and will be
+unable to load any global routes outside of this organization. They may load other
+projects/namespaces but their MR/Todo/Issue counts at the top of the page will
+not be correctly populated in the first iteration. The user will be aware of
+this limitation.
+
+#### Navigates to `/my-company/my-project` while logged in
+
+1. User is in Europe so DNS resolves to the router in Europe
+1. They request `/my-company/my-project` without the router cache, so the router chooses randomly `Pod EU1`
+1. The `/pods/learn` is sent to `Pod EU1`, which responds that resource lives on `Pod EU0`
+1. `Pod EU0` returns the correct response
+1. The router now caches and remembers any request paths matching `/my-company/*` should go to `Pod EU0`
+
+```mermaid
+sequenceDiagram
+ participant user as User
+ participant router_eu as Router EU
+ participant pod_eu0 as Pod EU0
+ participant pod_eu1 as Pod EU1
+ user->>router_eu: GET /my-company/my-project
+ router_eu->>pod_eu1: /api/v4/pods/learn?method=GET&path_info=/my-company/my-project
+ pod_eu1->>router_eu: {path: "/my-company", pod: "pod_eu0", source: "routable"}
+ router_eu->>pod_eu0: GET /my-company/my-project
+ pod_eu0->>user: <h1>My Project...
+```
+
+#### Navigates to `/my-company/my-project` while not logged in
+
+1. User is in Europe so DNS resolves to the router in Europe
+1. The router does not have `/my-company/*` cached yet so it chooses randomly `Pod EU1`
+1. The `/pods/learn` is sent to `Pod EU1`, which responds that resource lives on `Pod EU0`
+1. `Pod EU0` redirects them through a login flow
+1. User requests `/users/sign_in`, uses random Pod to run `/pods/learn`
+1. The `Pod EU1` responds with `pod_0` as a fixed route
+1. User after login requests `/my-company/my-project` which is cached and stored in `Pod EU0`
+1. `Pod EU0` returns the correct response
+
+```mermaid
+sequenceDiagram
+ participant user as User
+ participant router_eu as Router EU
+ participant pod_eu0 as Pod EU0
+ participant pod_eu1 as Pod EU1
+ user->>router_eu: GET /my-company/my-project
+ router_eu->>pod_eu1: /api/v4/pods/learn?method=GET&path_info=/my-company/my-project
+ pod_eu1->>router_eu: {path: "/my-company", pod: "pod_eu0", source: "routable"}
+ router_eu->>pod_eu0: GET /my-company/my-project
+ pod_eu0->>user: 302 /users/sign_in?redirect=/my-company/my-project
+ user->>router_eu: GET /users/sign_in?redirect=/my-company/my-project
+ router_eu->>pod_eu1: /api/v4/pods/learn?method=GET&path_info=/users/sign_in
+ pod_eu1->>router_eu: {path: "/users", pod: "pod_eu0", source: "fixed"}
+ router_eu->>pod_eu0: GET /users/sign_in?redirect=/my-company/my-project
+ pod_eu0-->>user: <h1>Sign in...
+ user->>router_eu: POST /users/sign_in?redirect=/my-company/my-project
+ router_eu->>pod_eu0: POST /users/sign_in?redirect=/my-company/my-project
+ pod_eu0->>user: 302 /my-company/my-project
+ user->>router_eu: GET /my-company/my-project
+ router_eu->>pod_eu0: GET /my-company/my-project
+ router_eu->>pod_eu0: GET /my-company/my-project
+ pod_eu0->>user: <h1>My Project...
+```
+
+#### Navigates to `/my-company/my-other-project` after last step
+
+1. User is in Europe so DNS resolves to the router in Europe
+1. The router cache now has `/my-company/* => Pod EU0`, so the router chooses `Pod EU0`
+1. `Pod EU0` returns the correct response as well as the cache header again
+
+```mermaid
+sequenceDiagram
+ participant user as User
+ participant router_eu as Router EU
+ participant pod_eu0 as Pod EU0
+ participant pod_eu1 as Pod EU1
+ user->>router_eu: GET /my-company/my-project
+ router_eu->>pod_eu0: GET /my-company/my-project
+ pod_eu0->>user: <h1>My Project...
+```
+
+#### Navigates to `/gitlab-org/gitlab` after last step
+
+1. User is in Europe so DNS resolves to the router in Europe
+1. The router has no cached value for this URL so randomly chooses `Pod EU0`
+1. `Pod EU0` redirects the router to `Pod US0`
+1. `Pod US0` returns the correct response as well as the cache header again
+
+```mermaid
+sequenceDiagram
+ participant user as User
+ participant router_eu as Router EU
+ participant pod_eu0 as Pod EU0
+ participant pod_us0 as Pod US0
+ user->>router_eu: GET /gitlab-org/gitlab
+ router_eu->>pod_eu0: /api/v4/pods/learn?method=GET&path_info=/gitlab-org/gitlab
+ pod_eu0->>router_eu: {path: "/gitlab-org", pod: "pod_us0", source: "routable"}
+ router_eu->>pod_us0: GET /gitlab-org/gitlab
+ pod_us0->>user: <h1>GitLab.org...
+```
+
+In this case the user is not on their "default organization" so their TODO
+counter will not include their normal todos. We may choose to highlight this in
+the UI somewhere. A future iteration may be able to fetch that for them from
+their default organization.
+
+#### Navigates to `/`
+
+1. User is in Europe so DNS resolves to the router in Europe
+1. Router does not have a cache for `/` route (specifically rails never tells it to cache this route)
+1. The Router choose `Pod EU0` randomly
+1. The Rails application knows the users default organization is `/my-organization`, so
+ it redirects the user to `/organizations/my-organization/-/dashboard`
+1. The Router has a cached value for `/organizations/my-organization/*` so it then sends the
+ request to `POD EU0`
+1. `Pod EU0` serves up a new page `/organizations/my-organization/-/dashboard` which is the same
+ dashboard view we have today but scoped to an organization clearly in the UI
+1. The user is (optionally) presented with a message saying that data on this page is only
+ from their default organization and that they can change their default
+ organization if it's not right.
+
+```mermaid
+sequenceDiagram
+ participant user as User
+ participant router_eu as Router EU
+ participant pod_eu0 as Pod EU0
+ user->>router_eu: GET /
+ router_eu->>pod_eu0: GET /
+ pod_eu0->>user: 302 /organizations/my-organization/-/dashboard
+ user->>router: GET /organizations/my-organization/-/dashboard
+ router->>pod_eu0: GET /organizations/my-organization/-/dashboard
+ pod_eu0->>user: <h1>My Company Dashboard... X-Gitlab-Pod-Cache={path_prefix:/organizations/my-organization/}
+```
+
+#### Navigates to `/dashboard`
+
+As above, they will end up on `/organizations/my-organization/-/dashboard` as
+the rails application will already redirect `/` to the dashboard page.
+
+### Navigates to `/not-my-company/not-my-project` while logged in (but they don't have access since this project/group is private)
+
+1. User is in Europe so DNS resolves to the router in Europe
+1. The router knows that `/not-my-company` lives in `Pod US1` so sends the request to this
+1. The user does not have access so `Pod US1` returns 404
+
+```mermaid
+sequenceDiagram
+ participant user as User
+ participant router_eu as Router EU
+ participant pod_us1 as Pod US1
+ user->>router_eu: GET /not-my-company/not-my-project
+ router_eu->>pod_us1: GET /not-my-company/not-my-project
+ pod_us1->>user: 404
+```
+
+#### Creates a new top level namespace
+
+The user will be asked which organization they want the namespace to belong to.
+If they select `my-organization` then it will end up on the same pod as all
+other namespaces in `my-organization`. If they select nothing we default to
+`GitLab.com Public` and it is clear to the user that this is isolated from
+their existing organization such that they won't be able to see data from both
+on a single page.
+
+### Experience for GitLab team member that is part of `/gitlab-org`
+
+Such a user is considered a legacy user and has their default organization set to
+`GitLab.com Public`. This is a "meta" organization that does not really exist but
+the Rails application knows to interpret this organization to mean that they are
+allowed to use legacy global functionality like `/dashboard` to see data across
+namespaces located on `Pod US0`. The rails backend also knows that the default pod to render any ambiguous
+routes like `/dashboard` is `Pod US0`. Lastly the user will be allowed to
+navigate to organizations on another pod like `/my-organization` but when they do the
+user will see a message indicating that some data may be missing (eg. the
+MRs/Issues/Todos) counts.
+
+#### Navigates to `/gitlab-org/gitlab` while not logged in
+
+1. User is in the US so DNS resolves to the US router
+1. The router knows that `/gitlab-org` lives in `Pod US0` so sends the request
+ to this pod
+1. `Pod US0` serves up the response
+
+```mermaid
+sequenceDiagram
+ participant user as User
+ participant router_us as Router US
+ participant pod_us0 as Pod US0
+ user->>router_us: GET /gitlab-org/gitlab
+ router_us->>pod_us0: GET /gitlab-org/gitlab
+ pod_us0->>user: <h1>GitLab.org...
+```
+
+#### Navigates to `/`
+
+1. User is in US so DNS resolves to the router in US
+1. Router does not have a cache for `/` route (specifically rails never tells it to cache this route)
+1. The Router chooses `Pod US1` randomly
+1. The Rails application knows the users default organization is `GitLab.com Public`, so
+ it redirects the user to `/dashboards` (only legacy users can see
+ `/dashboard` global view)
+1. Router does not have a cache for `/dashboard` route (specifically rails never tells it to cache this route)
+1. The Router chooses `Pod US1` randomly
+1. The Rails application knows the users default organization is `GitLab.com Public`, so
+ it allows the user to load `/dashboards` (only legacy users can see
+ `/dashboard` global view) and redirects to router the legacy pod which is `Pod US0`
+1. `Pod US0` serves up the global view dashboard page `/dashboard` which is the same
+ dashboard view we have today
+
+```mermaid
+sequenceDiagram
+ participant user as User
+ participant router_us as Router US
+ participant pod_us0 as Pod US0
+ participant pod_us1 as Pod US1
+ user->>router_us: GET /
+ router_us->>pod_us1: GET /
+ pod_us1->>user: 302 /dashboard
+ user->>router_us: GET /dashboard
+ router_us->>pod_us1: /api/v4/pods/learn?method=GET&path_info=/dashboard
+ pod_us1->>router_us: {path: "/dashboard", pod: "pod_us0", source: "routable"}
+ router_us->>pod_us0: GET /dashboard
+ pod_us0->>user: <h1>Dashboard...
+```
+
+#### Navigates to `/my-company/my-other-project` while logged in (but they don't have access since this project is private)
+
+They get a 404.
+
+### Experience for non-logged in users
+
+Flow is similar to logged in users except global routes like `/dashboard` will
+redirect to the login page as there is no default organization to choose from.
+
+### A new customers signs up
+
+They will be asked if they are already part of an organization or if they'd
+like to create one. If they choose neither they end up no the default
+`GitLab.com Public` organization.
+
+### An organization is moved from 1 pod to another
+
+TODO
+
+### GraphQL/API requests which don't include the namespace in the URL
+
+TODO
+
+### The autocomplete suggestion functionality in the search bar which remembers recent issues/MRs
+
+TODO
+
+### Global search
+
+TODO
+
+## Administrator
+
+### Loads `/admin` page
+
+1. The `/admin` is locked to `Pod US0`
+1. Some endpoints of `/admin`, like Projects in Admin are scoped to a Pod
+ and users needs to choose the correct one in a dropdown, which results in endpoint
+ like `/admin/pods/pod_0/projects`.
+
+Admin Area settings in Postgres are all shared across all pods to avoid
+divergence but we still make it clear in the URL and UI which pod is serving
+the Admin Area page as there is dynamic data being generated from these pages and
+the operator may want to view a specific pod.
+
+## More Technical Problems To Solve
+
+### Replicating User Sessions Between All Pods
+
+Today user sessions live in Redis but each pod will have their own Redis instance. We already use a dedicated Redis instance for sessions so we could consider sharing this with all pods like we do with `gitlab_users` PostgreSQL database. But an important consideration will be latency as we would still want to mostly fetch sessions from the same region.
+
+An alternative might be that user sessions get moved to a JWT payload that encodes all the session data but this has downsides. For example, it is difficult to expire a user session, when their password changes or for other reasons, if the session lives in a JWT controlled by the user.
+
+### How do we migrate between Pods
+
+Migrating data between pods will need to factor all data stores:
+
+1. PostgreSQL
+1. Redis Shared State
+1. Gitaly
+1. Elasticsearch
+
+### Is it still possible to leak the existence of private groups via a timing attack?
+
+If you have router in EU, and you know that EU router by default redirects
+to EU located Pods, you know their latency (lets assume 10ms). Now, if your
+request is bounced back and redirected to US which has different latency
+(lets assume that roundtrip will be around 60ms) you can deduce that 404 was
+returned by US Pod and know that your 404 is in fact 403.
+
+We may defer this until we actually implement a pod in a different region. Such timing attacks are already theoretically possible with the way we do permission checks today but the timing difference is probably too small to be able to detect.
+
+One technique to mitigate this risk might be to have the router add a random
+delay to any request that returns 404 from a pod.
+
+## Should runners be shared across all pods?
+
+We have 2 options and we should decide which is easier:
+
+1. Decompose runner registration and queuing tables and share them across all
+ pods. This may have implications for scalability, and we'd need to consider
+ if this would include group/project runners as this may have scalability
+ concerns as these are high traffic tables that would need to be shared.
+1. Runners are registered per-pod and, we probably have a separate fleet of
+ runners for every pod or just register the same runners to many pods which
+ may have implications for queueing
+
+## How do we guarantee unique ids across all pods for things that cannot conflict?
+
+This project assumes at least namespaces and projects have unique ids across
+all pods as many requests need to be routed based on their ID. Since those
+tables are across different databases then guaranteeing a unique ID will
+require a new solution. There are likely other tables where unique IDs are
+necessary and depending on how we resolve routing for GraphQL and other APIs
+and other design goals it may be determined that we want the primary key to be
+unique for all tables.
diff --git a/doc/architecture/blueprints/pods/term-cluster.png b/doc/architecture/blueprints/pods/term-cluster.png
deleted file mode 100644
index f52e31b52ad..00000000000
--- a/doc/architecture/blueprints/pods/term-cluster.png
+++ /dev/null
Binary files differ
diff --git a/doc/architecture/blueprints/pods/term-organization.png b/doc/architecture/blueprints/pods/term-organization.png
deleted file mode 100644
index f605adb124d..00000000000
--- a/doc/architecture/blueprints/pods/term-organization.png
+++ /dev/null
Binary files differ
diff --git a/doc/architecture/blueprints/rate_limiting/index.md b/doc/architecture/blueprints/rate_limiting/index.md
index 2ed66f22b53..ffe0712d69b 100644
--- a/doc/architecture/blueprints/rate_limiting/index.md
+++ b/doc/architecture/blueprints/rate_limiting/index.md
@@ -1,8 +1,11 @@
---
-stage: none
-group: unassigned
-comments: false
-description: 'Next Rate Limiting Architecture'
+status: ready
+creation-date: "2022-09-08"
+authors: [ "@grzesiek", "@marshall007", "@fabiopitino", "@hswimelar" ]
+coach: "@andrewn"
+approvers: [ "@sgoldstein" ]
+owning-stage:
+participating-stages: []
---
# Next Rate Limiting Architecture
@@ -35,18 +38,6 @@ stack.
This blueprint has been written to consolidate our limits and to describe the
vision of our next rate limiting and policies enforcement architecture.
-_Disclaimer: The following contains information related to upcoming products,
-features, and functionality._
-
-_It is important to note that the information presented is for informational
-purposes only. Please do not rely on this information for purchasing or
-planning purposes._
-
-_As with all projects, the items mentioned in this document and linked pages are
-subject to change or delay. The development, release and timing of any
-products, features, or functionality remain at the sole discretion of GitLab
-Inc._
-
## Goals
**Implement a next architecture for rate limiting and policies definition.**
@@ -361,6 +352,31 @@ hierarchy. Choosing a proper solution will require a thoughtful research.
1. Maintain consistent features and behavior across SaaS and self-managed codebase.
1. Be mindful about a cognitive load added by the hierarchical limits, aim to reduce it.
+## Phases and iterations
+
+**Phase 1**: Compile examples of current most important application limits — Owning Team
+ a. Owning Team (in collaboration with Stage Groups) compiles a list of the
+ most important application limits used in Rails today.
+**Phase 2**: Implement Rate Limiting Framework in Rails - Owning Team
+ a. Triangulate rate limiting abstractions based on the data gathered in Phase 1
+ b. Develop YAML model for limits.
+ c. Build Rails SDK.
+ d. Create examples showcasing usage of the new rate limits SDK.
+**Phase 3**: Team Fanout of Rails SDK - Stage Groups
+ a. Individual stage groups begin using the SDK built in Phase 2 for new limit and policies.
+ b. Stage groups begin replacing historical adhoc limit implementations with the SDK.
+ c. Provides means to monitor and observe the progress of the replacement effort. Ideally this is broken down to the `feature_category` level to drive group-level buy-in -- Owning Team.
+**Phase 4**: Enable Satellite Services to Use the Rate Limiting Framework - Owning Team
+ a. Determine if the goals of Phase 4 are best met by either
+ 1. Extracting the Rails rate limiting service into a decoupled service OR
+ 2. Implementing a separate Go library which uses the same backend (eg, Redis) for rate limiting.
+**Phase 5**: SDK for Satellite Services - Owning Team
+ a. Build Golang SDK.
+ c. Create examples showcasing usage of the new rate limits SDK.
+**Phase 6**: Team Fanout for Satellite Services - Stage Groups
+ a. Individual stage groups being using the SDK built in Phase 5 for new limit and policies.
+ b. Stage groups begin replacing historical adhoc limit implementations with the SDK.
+
## Status
Request For Comments.
@@ -373,39 +389,3 @@ Request For Comments.
- 2022-07-06: A fourth, [consolidated proposal](https://gitlab.com/gitlab-org/gitlab/-/issues/364524#note_1017640650), has been submitted.
- 2022-07-12: Started working on the design document following [Architecture Evolution Workflow](https://about.gitlab.com/handbook/engineering/architecture/workflow/).
- 2022-09-08: The initial version of the blueprint has been merged.
-
-## Who
-
-Proposal:
-
-<!-- vale gitlab.Spelling = NO -->
-
-| Role | Who
-|------------------------------|-------------------------|
-| Author | Grzegorz Bizon |
-| Author | Fabio Pitino |
-| Author | Marshall Cottrell |
-| Author | Hayley Swimelar |
-| Engineering Leader | Sam Goldstein |
-| Product Manager | |
-| Architecture Evolution Coach | Andrew Newdigate |
-| Recommender | |
-| Recommender | |
-| Recommender | |
-| Recommender | |
-
-DRIs:
-
-| Role | Who
-|------------------------------|------------------------|
-| Leadership | |
-| Product | |
-| Engineering | |
-
-Domain experts:
-
-| Area | Who
-|------------------------------|------------------------|
-| | |
-
-<!-- vale gitlab.Spelling = YES -->
diff --git a/doc/architecture/blueprints/runner_scaling/index.md b/doc/architecture/blueprints/runner_scaling/index.md
index 415884449ed..24c6820f94a 100644
--- a/doc/architecture/blueprints/runner_scaling/index.md
+++ b/doc/architecture/blueprints/runner_scaling/index.md
@@ -1,8 +1,11 @@
---
-stage: none
-group: unassigned
-comments: false
-description: 'Next Runner Auto-scaling Architecture'
+status: accepted
+creation-date: "2022-01-19"
+authors: [ "@grzesiek", "@tmaczukin", "@josephburnett" ]
+coach: "@kamil"
+approvers: [ "@DarrenEastman" ]
+owning-stage: "~devops::verify"
+participating-stages: []
---
# Next Runner Auto-scaling Architecture
@@ -50,18 +53,6 @@ build on top of it to improve efficiency, reliability and availability.
We call this new mechanism the "next GitLab Runner Scaling architecture".
-_Disclaimer The following contain information related to upcoming products,
-features, and functionality._
-
-_It is important to note that the information presented is for informational
-purposes only. Please do not rely on this information for purchasing or
-planning purposes._
-
-_As with all projects, the items mentioned in this document and linked pages are
-subject to change or delay. The development, release and timing of any
-products, features, or functionality remain at the sole discretion of GitLab
-Inc._
-
## Continuing building on Docker Machine
At this moment one of our core products - GitLab Runner - and one of its most
@@ -210,7 +201,7 @@ easier to understand how it performs.
## Details
-How the abstraction for the custom provider will look exactly is something that
+How the abstraction will look exactly is something that
we will need to prototype, PoC and decide in a data-informed way. There are a
few proposals that we should describe in detail, develop requirements for, PoC
and score. We will choose the solution that seems to support our goals the
@@ -257,6 +248,10 @@ them each separately.
to the Runner system. These details are highly dependent on the VM
architecture and operating system as well as Executor type.
+See also Glossary below.
+
+#### Current state
+
The current architecture has several points of coupling between concerns.
Coupling reduces opportunities for abstraction (e.g. community supported
plugins) and increases complexity, making the code harder to understand,
@@ -391,7 +386,7 @@ for by the plugin.
Rationale: [Description of the Custom Executor Provider proposal](https://gitlab.com/gitlab-org/gitlab-runner/-/issues/28848#note_823321515)
-### Fleeting VM provider
+### Taskscaler provider
We can introduce a more simple version of the `Machine` abstraction in the
form of a "Fleeting" interface. Fleeting provides a low-level interface to
@@ -412,6 +407,22 @@ component so it can be used by multiple Runner Executors (not just `docker+autos
Rationale: [Description of the InstanceGroup / Fleeting proposal](https://gitlab.com/gitlab-org/gitlab-runner/-/issues/28848#note_823430883)
POC: [Merge request](https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/3315)
+## Glossary
+
+- **[GitLab Runner](../../../development/documentation/styleguide/word_list.md#gitlab-runner)** - the software application that you can choose to install and manage, whose source code is hosted at `gitlab.com/gitlab-org/gitlab-runner`.
+- **[runners](../../../development/documentation/styleguide/word_list.md#runner-runners)** - the runner is the agent that's responsible for running GitLab CI/CD jobs in an environment and reporting the results to a GitLab instance. It /1/ retrieves jobs from GitLab, /2/ configures a local or remote build environment, and /3/ executes jobs within the provisioned environment, passing along log data and status updates to GitLab.
+- **runner manager** - the runner process is often referred to as the `Runner Manager` as it manages multiple runners, which are the `[[runners]]` workers defined in the runners `config.toml` file.
+- **executor** - a concrete environment which can be prepared and used to run a job. A new executor is created for each job.
+- **executor provider** - an implementation capable of providing executors on demand. Executor providers are registered on import and initialized once when a runner starts up.
+- **custom executor** - works as an interface between GitLab Runner and a set of binaries or shell scripts with environment variable inputs that enable executing CI jobs in any host computing environment. New custom executors can be added to the system without making any changes to the GitLab Runner codebase.
+- **custom executor provider** - a new abstraction, proposed under the custom provider heading in the plugin boundary proposal section above, which allows new executor providers to be created without modifying the GitLab Runner codebase. The protocol could be similar to custom executors or done over gRPC. This abstraction places all the mechanics of producing executors within the plugin, delegating autoscaling and lifecycle management concerns to each implementation.
+- **taskscaler** - a new library, proposed under the taskscaler provider heading in the plugin boundary proposal section above, which is parameterized with a concrete executor provider and a fleeting provider. Taskscaler is responsible for the autoscaling concern and can be used to autoscale any executor provider using any VM shape. Taskscaler is also responsible for the runner-specific aspect of VM lifecycle and keeps track of how many jobs are using a give VM and how many times a VM has been used.
+- **fleeting** - a new library proposed along with taskscaler which provides abstractions for cloud provider VMs.
+- **fleeting instance group** - the abstraction that fleeting uses to represent a pool of like VMs. This would represent a GCP IGM or an AWS ASG (without the autoscaling). Instance groups can be increased, decreased or can provide connection details for a specific VM.
+- **fleeting plugin** - a concrete implementation of a fleeting instance group representing a specific IGM or ASG (when initialized). There will be N of these, one for each provider, each in its own project. We will own and maintain the core ones but some will be community supported. A new fleeting plugin can be created without making any changes to the runner, taskscaler or fleeting code bases. This makes it analogous to the custom executor provider in terms of self-service and decoupling, but along a different line of concerns.
+- **fleeting plugin Google Compute** - the fleeting plugin which creates GCP instances. This lives in a separate project from the fleeting and taskscaler.
+- **fleeting plugin AWS** - the fleeting plugin which creates AWS instances. This lives in a separate project from the fleeting and taskscaler.
+
## Status
Status: RFC.
diff --git a/doc/architecture/blueprints/runner_tokens/index.md b/doc/architecture/blueprints/runner_tokens/index.md
new file mode 100644
index 00000000000..3f8a27e503d
--- /dev/null
+++ b/doc/architecture/blueprints/runner_tokens/index.md
@@ -0,0 +1,227 @@
+---
+stage: Verify
+group: Runner
+comments: false
+description: 'Next Runner Token Architecture'
+---
+
+# Next GitLab Runner Token Architecture
+
+## Summary
+
+GitLab Runner is a core component of GitLab CI/CD that runs
+CI/CD jobs in a reliable and concurrent environment. Ever since the beginnings
+of the service as a Ruby program, runners are registered in a GitLab instance with
+a registration token - a randomly generated string of text. The registration token is unique for its given scope
+(instance, group, or project). The registration token proves that the party that registers the runner has
+administrator access to the instance, group, or project to which the runner is registered.
+
+This approach has worked well in the initial years, but some major known issues started to
+become apparent as the target audience grew:
+
+| Problem | Symptoms |
+|---------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Single token per scope | - The registration token is shared by multiple runners: <br/>- Single tokens lower the value of auditing and make traceability almost impossible; <br/>- Copied in many places for [self-registration of runners](https://docs.gitlab.com/runner/install/kubernetes.html#required-configuration); <br/>- Reports of users storing tokens in unsecured locations; <br/>- Makes rotation of tokens costly. <br/>- In the case of a security event affecting the whole instance, rotating tokens requires users to update a table of projects/namespaces, which takes a significant amount of time. |
+| No provision for automatic expiration | Requires manual intervention to change token. Addressed in [#30942](https://gitlab.com/gitlab-org/gitlab/-/issues/30942). |
+| No permissions model | Used to register a runner for protected branches, and for any tags. In this case, the registration token has permission to do everything. Effectively, someone taking a possession of registration token could steal secrets or source code. |
+| No traceability | Given that the token is not created by a user, and is accessible to all administrators, there is no possibility to know the source of a leaked token. |
+| No historical records | When reset, the previous value of the registration token is not stored so there is no historical data to enable deeper auditing and inspection. |
+| Token stored in project/namespace model | Inadvertent disclosure of token is possible. |
+| Too many registered runners | It is too straightforward to register a new runner using a well-known registration token. |
+
+In light of these issues, it is important that we redesign the way in which we connect runners to the GitLab instance so that we can guarantee traceability, security, and performance.
+
+We call this new mechanism the "next GitLab Runner Token architecture".
+
+## Proposal
+
+The proposal addresses the issues of a _single token per scope_ and _token storage_
+by eliminating the need for a registration token. Runner creation happens
+in the GitLab Runners settings page for the given scope, in the context of the logged-in user
+, which provides traceability. The page provides instructions to configure the newly-created
+runner in supported environments.
+
+The runner configuration will be generated through a new `deploy` command, which will leverage
+the `/runners/verify` REST endpoint to ensure the validity of the authentication token.
+The remaining concerns become non-issues due to the elimination of the registration token.
+
+The configuration can be applied across many machines by reusing the same instructions.
+A unique system identifier will be generated automatically if a value is missing from
+the runner entry in the `config.toml` file. This allows differentiating systems sharing the same
+runner token (for example, in auto-scaling scenarios), and is crucial for the proper functioning of our
+long-polling mechanism when the same authentication token is shared across two or more runner managers.
+
+Given that the creation of runners involves user interaction, it should be possible
+to eventually lower the per-plan limit of CI runners that can be registered per scope.
+
+### Auto-scaling scenarios (for example Helm chart)
+
+In the existing model, a new runner is created whenever a new worker is required. This
+has led to many situations where runners are left behind and become stale.
+
+In the proposed model, a `ci_runners` table entry describes a configuration,
+which the runner could reuse across multiple machines. This allows differentiating the context in
+which the runner is being used. In situations where we must differentiate between runners
+that reuse the same configuration, we can use the unique system identifier to track all
+unique "runners" that are executed in context of a single `ci_runners` model. This unique
+system identifier would be present in the Runner's `config.toml` configuration file and
+initially set when generating the new `[[runners]]` configuration by means of the `deploy` command.
+Legacy files that miss values for unique system identifiers will get rewritten automatically with new values.
+
+### Runner identification in CI jobs
+
+For users to identify the machine where the job was executed, the unique identifier will need to be visible in CI job contexts.
+As a first iteration, GitLab Runner will include the unique system identifier in the build logs,
+wherever it publishes the short token SHA.
+
+Given that the runner will potentially be reused with different unique system identifiers,
+we can store the unique system ID. This ensures the unique system ID maps to a GitLab Runner's `config.toml` entry with
+the runner token. The `ci_runner_machines` would hold information about each unique runner machine,
+with information when runner last connected, and what type of runner it was. The relevant fields
+will be moved from the `ci_runners`.
+The `ci_builds_runner_session` (or `ci_builds` or `ci_builds_metadata`) will reference
+`ci_runner_machines`.
+We might consider a more efficient way to store `contacted_at` than updating the existing record.
+
+```sql
+CREATE TABLE ci_builds_runner_session (
+ ...
+ runner_machine_id bigint NOT NULL
+);
+
+CREATE TABLE ci_runner_machines (
+ id integer NOT NULL,
+ machine_id character varying UNIQUE NOT NULL,
+ contacted_at timestamp without time zone,
+ version character varying,
+ revision character varying,
+ platform character varying,
+ architecture character varying,
+ ip_address character varying,
+ executor_type smallint,
+);
+```
+
+## Advantages
+
+- Easier for users to wrap their minds around the concept: instead of two types of tokens,
+ there is a single type of token - the per-runner authentication token. Having two types of tokens
+ frequently results in misunderstandings when discussing issues;
+- Runners can always be traced back to the user who created it, using the audit log;
+- The claims of a CI runner are known at creation time, and cannot be changed from the runner
+ (for example, changing the `access_level`/`protected` flag). Authenticated users
+ may however still edit these settings through the GitLab UI.
+
+## Details
+
+In the proposed approach, we create a distinct way to configure runners that is usable
+alongside the current registration token method during a transition period. The idea is
+to avoid having the Runner make API calls that allow it to leverage a single "god-like"
+token to register new runners.
+
+The new workflow looks as follows:
+
+ 1. The user opens the Runners settings page;
+ 1. The user fills in the details regarding the new desired runner, namely description,
+ tags, protected, locked, etc.;
+ 1. The user clicks `Create`. That results in the following:
+
+ 1. Creates a new runner in the `ci_runners` table (and corresponding authentication token);
+ 1. Presents the user with instructions on how to configure this new runner on a machine,
+ with possibilities for different supported deployment scenarios (e.g. shell, `docker-compose`, Helm chart, etc.)
+ This information contains a token which will only be available to the user once, and the UI
+ will make it clear to the user that the value will not be shown again, as registering the same runner multiple times
+ is discouraged (though not impossible).
+
+ 1. The user copies and pastes the instructions for the intended deployment scenario (a `deploy` command), leading to the following actions:
+
+ 1. Upon executing the new `gitlab-runner deploy` command in the instructions, `gitlab-runner` will perform
+ a call to the `POST /runners/verify` with the given runner token;
+ 1. If the `POST /runners/verify` GitLab endpoint validates the token, the `config.toml` file will be populated with the configuration.
+
+ The `gitlab-runner deploy` will also accept executor-specific arguments
+ currently present in the `register` command.
+
+As part of the transition period, we will provide admins and top-level group owners with a instance/group-level setting to disable
+the legacy registration token functionality and enforce using only the new workflow.
+Any attempt by a `gitlab-runner register` command to hit the `POST /runners` endpoint to register a new runner
+will result in a `HTTP 410 - Gone` status code. The instance setting is inherited by the groups
+, which means that if the legacy registration method is disabled at the instance method, the descendant groups/projects will also mandatorily
+prevent the legacy registration method.
+
+The registration token workflow is to be deprecated (with a deprecation notice printed by the `gitlab-runner register` command)
+and removed at a future major release after the concept is proven stable and customers have migrated to the new workflow.
+
+### Handling of legacy runners
+
+Legacy versions of GitLab Runner will not send the unique system identifier in its requests, and we
+will not change logic in Workhorse to handle unique system IDs. This can be improved upon in the
+future once the legacy registration system is removed, and runners have been upgraded to newer
+versions.
+
+Not using the unique system ID means that all connected runners with the same token will be
+notified, instead of just the runner matching the exact system identifier. While not ideal, this is
+not an issue per-se.
+
+### Helm chart
+
+The `runnerRegistrationToken` entry in the [`values.yaml` file](https://gitlab.com/gitlab-org/charts/gitlab-runner/-/blob/a70bc29a903b79d5675bb0c45d981adf8b7a8659/values.yaml#L52)
+will be retired. The `runnerRegistrationToken` entry will be replaced by the existing `runnerToken` value, which will be passed
+to the new `gitlab-runner deploy` command in [`configmap.yaml`](https://gitlab.com/gitlab-org/charts/gitlab-runner/-/blob/a70bc29a903b79d5675bb0c45d981adf8b7a8659/templates/configmap.yaml#L116).
+
+### Runner creation through API
+
+Automated runner creation may be allowed, although always through authenticated API calls -
+using PAT tokens for example - such that every runner is associated with an owner.
+
+## Implementation plan
+
+| Component | Milestone | Changes |
+|------------------|-----------|---------|
+| GitLab Rails app | `15.x` (latest at `15.6`) | Deprecate `POST /api/v4/runners` endpoint for `16.0`. This hinges on a [proposal](https://gitlab.com/gitlab-org/gitlab/-/issues/373774) to allow deprecating REST API endpoints for security reasons. |
+| GitLab Runner | `15.x` (latest at `15.8`) | Add deprecation notice for `register` command for `16.0`. |
+| GitLab Runner | `15.x` | Ensure all runner entries in `config.toml` have unique system identifier values assigned. Log new system ID values with `INFO` level as they get created. |
+| GitLab Runner | `15.x` | Start additionally logging unique system ID anywhere we log the runner short SHA. |
+| GitLab Rails app | `15.x` | Create database migrations to add settings from `application_settings` and `namaspace_settings` tables. |
+| GitLab Runner | `15.x` | Start sending `unique_id` value in `POST /jobs/request` request and other follow-up requests that require identifying the unique system. |
+| GitLab Runner | `15.x` | Implement new user-authenticated API (REST and GraphQL) to create a new runner. |
+| GitLab Rails app | `15.x` | Implement UI to create new runner. |
+| GitLab Runner | `16.0` | Remove `register` command and support for `POST /runners` endpoint. |
+| GitLab Rails app | `16.0` | Remove legacy UI showing registration with a registration token. |
+| GitLab Rails app | `16.0` | Create database migrations to remove settings from `application_settings` and `namaspace_settings` tables. |
+| GitLab Rails app | `16.0` | Make [`POST /api/v4/runners` endpoint](../../../api/runners.md#register-a-new-runner-deprecated) permanently return `410 Gone`. A future v5 version of the API would return `404 Not Found`. |
+| GitLab Rails app | `16.0` | Start refusing job requests that don't include a unique ID. |
+
+## Status
+
+Status: RFC.
+
+## Who
+
+Proposal:
+
+<!-- vale gitlab.Spelling = NO -->
+
+| Role | Who
+|------------------------------|--------------------------------------------------|
+| Authors | Kamil Trzciński, Tomasz Maczukin, Pedro Pombeiro |
+| Architecture Evolution Coach | Kamil Trzciński |
+| Engineering Leader | Elliot Rushton, Cheryl Li |
+| Product Manager | Darren Eastman, Jackie Porter |
+| Domain Expert / Runner | Tomasz Maczukin |
+
+DRIs:
+
+| Role | Who |
+|------------------------------|---------------------------------|
+| Leadership | Elliot Rushton |
+| Product | Darren Eastman |
+| Engineering | Tomasz Maczukin, Pedro Pombeiro |
+
+Domain experts:
+
+| Area | Who |
+|------------------------------|-----------------|
+| Domain Expert / Runner | Tomasz Maczukin |
+
+<!-- vale gitlab.Spelling = YES -->
diff --git a/doc/architecture/blueprints/work_items/index.md b/doc/architecture/blueprints/work_items/index.md
index 42864e7112e..75a9d8d76ad 100644
--- a/doc/architecture/blueprints/work_items/index.md
+++ b/doc/architecture/blueprints/work_items/index.md
@@ -1,15 +1,15 @@
---
-stage: Plan
-group: Project Management
-comments: false
-description: 'Work Items'
+status: accepted
+creation-date: "2022-09-28"
+authors: [ "@ntepluhina" ]
+coach: "@kamil"
+approvers: [ "@gweaver" ]
+owning-stage: "~devops::plan"
+participating-stages: []
---
# Work Items
-DISCLAIMER:
-This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.
-
This document is a work-in-progress. Some aspects are not documented, though we expect to add them in the future.
## Summary
@@ -55,14 +55,19 @@ All Work Item types share the same pool of predefined widgets and are customized
### Work Item widget types (updating)
-- assignees
-- description
-- hierarchy
-- iteration
-- labels
-- start and due date
-- verification status
-- weight
+| widget type | feature flag |
+|---|---|---|
+| assignees | |
+| description | |
+| hierarchy | |
+| [iteration](https://gitlab.com/gitlab-org/gitlab/-/issues/367456) | work_items_mvc_2 |
+| [milestone](https://gitlab.com/gitlab-org/gitlab/-/issues/367463) | work_items_mvc_2 |
+| labels | |
+| start and due date | |
+| status\* | |
+| weight | |
+
+\* status is not currently a widget, but a part of the root work item, similar to title
### Work Item view
@@ -72,6 +77,16 @@ The new frontend view that renders Work Items of any type using global Work Item
Task is a special Work Item type. Tasks can be added to issues as child items and can be displayed in the modal on the issue view.
+### Feature flags
+
+Since this is a large project with numerous moving parts, feature flags are being used to track promotions of available widgets. The table below shows the different feature flags that are being used, and the audience that they are available to.
+
+| feature flag name | audience |
+|---|---|
+| `work_items` | defaulted to on |
+| `work_items_mvc` | `gitlab-org`, `gitlab-com` |
+| `work_items_mvc_2` | `gitlab-org/plan-stage` |
+
## Motivation
Work Items main goal is to enhance the planning toolset to become the most popular collaboration tool for knowledge workers in any industry.
@@ -107,24 +122,3 @@ Work Item architecture is designed with making all the features for all the type
- [Tasks roadmap](https://gitlab.com/groups/gitlab-org/-/epics/7103?_gl=1*zqatx*_ga*NzUyOTc3NTc1LjE2NjEzNDcwMDQ.*_ga_ENFH3X7M5Y*MTY2MjU0MDQ0MC43LjEuMTY2MjU0MDc2MC4wLjAuMA..)
- [Work Item "Vision" Prototype](https://gitlab.com/gitlab-org/gitlab/-/issues/368607)
- [Work Item Discussions](https://gitlab.com/groups/gitlab-org/-/epics/7060)
-
-### Who
-
-| Role | Who
-|------------------------------|-----------------------------|
-| Author | Natalia Tepluhina |
-| Architecture Evolution Coach | Kamil Trzciński |
-| Engineering Leader | TBD |
-| Product Manager | Gabe Weaver |
-| Domain Expert / Frontend | Natalia Tepluhina |
-| Domain Expert / Backend | Heinrich Lee Yu |
-| Domain Expert / Backend | Jan Provaznik |
-| Domain Expert / Backend | Mario Celi |
-
-DRIs:
-
-| Role | Who
-|------------------------------|------------------------|
-| Leadership | TBD |
-| Product | Gabe Weaver |
-| Engineering | TBD |