7 files changed, 226 insertions, 10 deletions
diff --git a/doc/architecture/blueprints/ci_scale/ci_builds_cumulative_forecast.png b/doc/architecture/blueprints/ci_scale/ci_builds_cumulative_forecast.png
new file mode 100644
index 00000000000..fa34c7d1c36
--- /dev/null
+++ b/doc/architecture/blueprints/ci_scale/ci_builds_cumulative_forecast.png
diff --git a/doc/architecture/blueprints/ci_scale/ci_builds_daily_forecast.png b/doc/architecture/blueprints/ci_scale/ci_builds_daily_forecast.png
new file mode 100644
index 00000000000..b73a592fa6b
--- /dev/null
+++ b/doc/architecture/blueprints/ci_scale/ci_builds_daily_forecast.png
diff --git a/doc/architecture/blueprints/ci_scale/index.md b/doc/architecture/blueprints/ci_scale/index.md
new file mode 100644
index 00000000000..99997e7b19b
--- /dev/null
+++ b/doc/architecture/blueprints/ci_scale/index.md
@@ -0,0 +1,205 @@
+---
+stage: none
+group: unassigned
+comments: false
+description: 'Improve scalability of GitLab CI/CD'
+---
+
+# Next CI/CD scale target: 20M builds per day by 2024
+
+## Summary
+
+GitLab CI/CD is one of the most data and compute intensive components of GitLab.
+Since its [initial release in November 2012](https://about.gitlab.com/blog/2012/11/13/continuous-integration-server-from-gitlab/),
+the CI/CD subsystem has evolved significantly. It was [integrated into GitLab in September 2015](https://about.gitlab.com/releases/2015/09/22/gitlab-8-0-released/)
+and has become [one of the most beloved CI/CD solutions](https://about.gitlab.com/blog/2017/09/27/gitlab-leader-continuous-integration-forrester-wave/).
+
+GitLab CI/CD has come a long way since the initial release, but the design of
+the data storage for pipeline builds remains almost the same since 2012. We
+store all the builds in PostgreSQL in `ci_builds` table, and because we are
+creating more than [2 million builds each day on GitLab.com](https://docs.google.com/spreadsheets/d/17ZdTWQMnTHWbyERlvj1GA7qhw_uIfCoI5Zfrrsh95zU),
+we are reaching database limits that are slowing our development velocity down.
+
+On February 1st, 2021, a billionth CI/CD job was created and the number of
+builds is growing exponentially. We will run out of the available primary keys
+for builds before December 2021 unless we improve the database model used to
+store CI/CD data.
+
+We expect to see 20M builds created daily on GitLab.com in the first half of
+2024.
+
+![ci_builds cumulative with forecast](ci_builds_cumulative_forecast.png)
+
+## Goals
+
+**Enable future growth by making processing 20M builds in a day possible.**
+
+## Challenges
+
+The current state of CI/CD product architecture needs to be updated if we want
+to sustain future growth.
+
+### We are running out of the capacity to store primary keys
+
+The primary key in `ci_builds` table is an integer generated in a sequence.
+Historically, Rails used to use [integer](https://www.postgresql.org/docs/9.1/datatype-numeric.html)
+type when creating primary keys for a table. We did use the default when we
+[created the `ci_builds` table in 2012](https://gitlab.com/gitlab-org/gitlab/-/blob/046b28312704f3131e72dcd2dbdacc5264d4aa62/db/ci/migrate/20121004165038_create_builds.rb).
+[The behavior of Rails has changed](https://github.com/rails/rails/pull/26266)
+since the release of Rails 5. The framework is now using bigint type that is 8
+bytes long, however we have not migrated primary keys for `ci_builds` table to
+bigint yet.
+
+We will run out of the capacity of the integer type to store primary keys in
+`ci_builds` table before December 2021. When it happens without a viable
+workaround or an emergency plan, GitLab.com will go down.
+
+`ci_builds` is just one of the tables that are running out of the primary keys
+available in Int4 sequence. There are multiple other tables storing CI/CD data
+that have the same problem.
+
+Primary keys problem will be tackled by our Database Team.
+
+### The table is too large
+
+There is more than a billion rows in `ci_builds` table. We store more than 2
+terabytes of data in that table, and the total size of indexes is more than 1
+terabyte (as of February 2021).
+
+This amount of data contributes to a significant performance problems we
+experience on our primary PostgreSQL database.
+
+Most of the problem are related to how PostgreSQL database works internally,
+and how it is making use of resources on a node the database runs on. We are at
+the limits of vertical scaling of the primary database nodes and we frequently
+see a negative impact of the `ci_builds` table on the overall performance,
+stability, scalability and predictability of the database GitLab.com depends
+on.
+
+The size of the table also hinders development velocity because queries that
+seem fine in the development environment may not work on GitLab.com. The
+difference in the dataset size between the environments makes it difficult to
+predict the performance of even the most simple queries.
+
+We also expect a significant, exponential growth in the upcoming years.
+
+One of the forecasts done using [Facebook's
+Prophet](https://facebook.github.io/prophet/) shows that in the first half of
+2024 we expect seeing 20M builds created on GitLab.com each day. In comparison
+to around 2M we see created today, this is 10x growth our product might need to
+sustain in upcoming years.
+
+![ci_builds daily forecast](ci_builds_daily_forecast.png)
+
+### Queuing mechanisms are using the large table
+
+Because of how large the table is, mechanisms that we use to build queues of
+pending builds (there is more than one queue), are not very efficient. Pending
+builds represent a small fraction of what we store in the `ci_builds` table,
+yet we need to find them in this big dataset to determine an order in which we
+want to process them.
+
+This mechanism is very inefficient, and it has been causing problems on the
+production environment frequently. This usually results in a significant drop
+of the CI/CD apdex score, and sometimes even causes a significant performance
+degradation in the production environment.
+
+There are multiple other strategies that can improve performance and
+reliability. We can use [Redis
+queuing](https://gitlab.com/gitlab-org/gitlab/-/issues/322972), or [a separate
+table that will accelerate SQL queries used to build
+queues](https://gitlab.com/gitlab-org/gitlab/-/issues/322766) and we want to
+explore them.
+
+### Moving big amounts of data is challenging
+
+We store a significant amount of data in `ci_builds` table. Some of the columns
+in that table store a serialized user-provided data. Column `ci_builds.options`
+stores more than 600 gigabytes of data, and `ci_builds.yaml_variables` more
+than 300 gigabytes (as of February 2021).
+
+It is a lot of data that needs to be reliably moved to a different place.
+Unfortunately, right now, our [background
+migrations](https://docs.gitlab.com/ee/development/background_migrations.html)
+are not reliable enough to migrate this amount of data at scale. We need to
+build mechanisms that will give us confidence in moving this data between
+columns, tables, partitions or database shards.
+
+Effort to improve background migrations will be owned by our Database Team.
+
+### Development velocity is negatively affected
+
+Team members and the wider community members are struggling to contribute the
+Verify area, because we restricted the possibility of extending `ci_builds`
+even further. Our static analysis tools prevent adding more columns to this
+table. Adding new queries is unpredictable because of the size of the dataset
+and the amount of queries executed using the table. This significantly hinders
+the development velocity and contributes to incidents on the production
+environment.
+
+## Proposal
+
+Making GitLab CI/CD product ready for the scale we expect to see in the
+upcoming years is a multi-phase effort.
+
+First, we want to focus on things that are urgently needed right now. We need
+to fix primary keys overflow risk and unblock other teams that are working on
+database partitioning and sharding.
+
+We want to improve situation around bottlenecks that are known already, like
+queuing mechanisms using the large table and things that are holding other
+teams back.
+
+Extending CI/CD metrics is important to get a better sense of how the system
+performs and to what growth should we expect. This will make it easier for us
+to identify bottlenecks and perform more advanced capacity planning.
+
+As we work on first iterations we expect our Database Sharding team and
+Database Scalability Working Group to make progress on patterns we will be able
+to use to partition the large CI/CD dataset. We consider the strong time-decay
+effect, related to the diminishing importance of pipelines with time, as an
+opportunity we might want to seize.
+
+## Iterations
+
+Work required to achieve our next CI/CD scaling target is tracked in the
+[GitLab CI/CD 20M builds per day scaling
+target](https://gitlab.com/groups/gitlab-org/-/epics/5745) epic.
+
+## Status
+
+In progress.
+
+## Who
+
+Proposal:
+
+<!-- vale gitlab.Spelling = NO -->
+
+| Role                         | Who
+|------------------------------|-------------------------|
+| Author                       | Grzegorz Bizon          |
+| Architecture Evolution Coach | Kamil Trzciński         |
+| Engineering Leader           | Darby Frey              |
+| Product Manager              | Jackie Porter           |
+| Domain Expert / Verify       | Fabio Pitino            |
+| Domain Expert / Database     | Jose Finotto            |
+| Domain Expert / PostgreSQL   | Nikolay Samokhvalov     |
+
+DRIs:
+
+| Role                         | Who
+|------------------------------|------------------------|
+| Leadership                   | Darby Frey             |
+| Product                      | Jackie Porter          |
+| Engineering                  | Grzegorz Bizon         |
+
+Domain experts:
+
+| Area                         | Who
+|------------------------------|------------------------|
+| Domain Expert / Verify       | Fabio Pitino           |
+| Domain Expert / Database     | Jose Finotto           |
+| Domain Expert / PostgreSQL   | Nikolay Samokhvalov    |
+
+<!-- vale gitlab.Spelling = YES -->
diff --git a/doc/architecture/blueprints/container_registry_metadata_database/index.md b/doc/architecture/blueprints/container_registry_metadata_database/index.md
index 4e40f249e56..86628b31536 100644
--- a/doc/architecture/blueprints/container_registry_metadata_database/index.md
+++ b/doc/architecture/blueprints/container_registry_metadata_database/index.md
@@ -26,7 +26,7 @@ graph LR
    R -- Write/read metadata --> B
 ```
 
-Client applications (e.g. GitLab Rails and Docker CLI) interact with the Container Registry through its [HTTP API](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs/spec/api.md). The most common operations are pushing and pulling images to/from the registry, which require a series of HTTP requests in a specific order. The request flow for these operations is detailed [here](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs-gitlab/push-pull-request-flow.md).
+Client applications (e.g. GitLab Rails and Docker CLI) interact with the Container Registry through its [HTTP API](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs/spec/api.md). The most common operations are pushing and pulling images to/from the registry, which require a series of HTTP requests in a specific order. The request flow for these operations is detailed in the [Request flow](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs-gitlab/push-pull-request-flow.md).
 
 The registry supports multiple [storage backends](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs/configuration.md#storage), including Google Cloud Storage (GCS) which is used for the GitLab.com registry. In the storage backend, images are stored as blobs, deduplicated, and shared across repositories. These are then linked (like a symlink) to each repository that relies on them, giving them access to the central storage location.
 
@@ -156,7 +156,7 @@ Running *online* and [*post deployment*](../../../development/post_deployment_mi
 
 The registry database will be partitioned from start to achieve greater performance (by limiting the amount of data to act upon and enable parallel execution), easier maintenance (by splitting tables and indexes into smaller units), and high availability (with partition independence). By partitioning the database from start we can also facilitate a sharding implementation later on if necessary.
 
-Although blobs are shared across repositories, manifest and tag metadata are scoped by repository. This is also visible at the API level, where all write and read requests (except [listing repositories](https://gitlab.com/gitlab-org/container-registry/-/blob/a113d0f0ab29b49cf88e173ee871893a9fc56a90/docs/spec/api.md#listing-repositories)) are scoped by repository, with its namespace being part of the request URI. For this reason, after [identifying access patterns](https://gitlab.com/gitlab-org/gitlab/-/issues/234255), we decided to partition manifests and tags by repository and blobs by digest, ensuring that lookups are always performed by partition key for optimal performance. The initial version of the partitioned schema was documented [here](https://gitlab.com/gitlab-com/www-gitlab-com/-/merge_requests/60918).
+Although blobs are shared across repositories, manifest and tag metadata are scoped by repository. This is also visible at the API level, where all write and read requests (except [listing repositories](https://gitlab.com/gitlab-org/container-registry/-/blob/a113d0f0ab29b49cf88e173ee871893a9fc56a90/docs/spec/api.md#listing-repositories)) are scoped by repository, with its namespace being part of the request URI. For this reason, after [identifying access patterns](https://gitlab.com/gitlab-org/gitlab/-/issues/234255), we decided to partition manifests and tags by repository and blobs by digest, ensuring that lookups are always performed by partition key for optimal performance. The initial version of the partitioned schema was documented [in a merge request](https://gitlab.com/gitlab-com/www-gitlab-com/-/merge_requests/60918).
 
 #### GitLab.com
 
diff --git a/doc/architecture/blueprints/database_testing/index.md b/doc/architecture/blueprints/database_testing/index.md
index a333ac12ef3..162b112732c 100644
--- a/doc/architecture/blueprints/database_testing/index.md
+++ b/doc/architecture/blueprints/database_testing/index.md
@@ -79,7 +79,7 @@ Database Lab provides an API we can interact with to manage thin clones. In orde
 
 The short-term focus is on testing regular migrations (typically schema changes) and using the existing Database Lab instance from postgres.ai for it.
 
-In order to secure this process and meet compliance goals, the runner environment will be treated as a *production* environment and similarly locked down, monitored and audited. Only Database Maintainers will have access to the CI pipeline and its job output. Everyone else will only be able to see the results and statistics posted back on the merge request.
+In order to secure this process and meet compliance goals, the runner environment is treated as a *production* environment and similarly locked down, monitored and audited. Only Database Maintainers have access to the CI pipeline and its job output. Everyone else can only see the results and statistics posted back on the merge request.
 
 We implement a secured CI pipeline on <https://ops.gitlab.net> that adds the execution steps outlined above. The goal is to secure this pipeline in order to solve the following problem:
 
@@ -117,7 +117,7 @@ An alternative approach we have discussed and abandoned is to "scrub" and anonym
 - Anonymization is complex by nature - it is a hard problem to call a "scrubbed clone" actually safe to work with in public. Different data types may require different anonymization techniques (e.g. anonymizing sensitive information inside a JSON field) and only focusing on one attribute at a time does not guarantee that a dataset is fully anonymized (for example join attacks or using timestamps in conjunction to public profiles/projects to de-anonymize users by there activity).
 - Anonymization requires an additional process to keep track and update the set of attributes considered as sensitive, ongoing maintenance and security reviews every time the database schema changes.
 - Annotating data as "sensitive" is error prone, with the wrong anonymization approach used for a data type or one sensitive attribute accidentally not marked as such possibly leading to a data breach.
-- Scrubbing not only removes sensitive data, but also changes data distribution, which greatly affects performance of migrations and queries.
+- Scrubbing not only removes sensitive data, but it also changes data distribution, which greatly affects performance of migrations and queries.
 - Scrubbing heavily changes the database contents, potentially updating a lot of data, which leads to different data storage details (think MVC bloat), affecting performance of migrations and queries.
 
 ## Who
diff --git a/doc/architecture/blueprints/graphql_api/index.md b/doc/architecture/blueprints/graphql_api/index.md
index 99047eb5964..b856f7d96ad 100644
--- a/doc/architecture/blueprints/graphql_api/index.md
+++ b/doc/architecture/blueprints/graphql_api/index.md
@@ -143,11 +143,17 @@ state synchronization mechanisms and hooking into existing ones.
 
 ## Iterations
 
-1. [Build comprehensive Grafana dashboard for GraphQL](https://gitlab.com/groups/gitlab-com/-/epics/1343)
-1. [Improve logging of GraphQL requests in Elastic](https://gitlab.com/groups/gitlab-org/-/epics/4646)
+### In the scope of the blueprint
+
+1. [GraphQL API architecture](https://gitlab.com/groups/gitlab-org/-/epics/5842)
+    1. [Build comprehensive Grafana dashboard for GraphQL](https://gitlab.com/groups/gitlab-org/-/epics/5841)
+    1. [Improve logging of GraphQL requests in Elastic](https://gitlab.com/groups/gitlab-org/-/epics/4646)
+    1. [Build GraphQL query correlation mechanisms](https://gitlab.com/groups/gitlab-org/-/epics/5320)
+    1. [Design a better data-informed deprecation policy](https://gitlab.com/groups/gitlab-org/-/epics/5321)
+
+### Future iterations
+
 1. [Build a scalable state synchronization for GraphQL](https://gitlab.com/groups/gitlab-org/-/epics/5319)
-1. [Build GraphQL feature-to-query correlation mechanisms](https://gitlab.com/groups/gitlab-org/-/epics/5320)
-1. [Design a better data-informed deprecation policy](https://gitlab.com/groups/gitlab-org/-/epics/5321)
 1. [Add support for direct uploads for GraphQL](https://gitlab.com/gitlab-org/gitlab/-/issues/280819)
 1. [Review GraphQL design choices related to security](https://gitlab.com/gitlab-org/security/gitlab/-/issues/339)
 
@@ -179,6 +185,11 @@ DRIs:
 | Leadership                   | Darva Satcher          |
 | Product                      | Patrick Deuley         |
 | Engineering                  | Paul Slaughter         |
+
+Domain Experts:
+
+| Area                         | Who
+|------------------------------|------------------------|
 | Domain Expert / GraphQL      | Charlie Ablett         |
 | Domain Expert / GraphQL      | Alex Kalderimis        |
 | Domain Expert / GraphQL      | Natalia Tepluhina      |
diff --git a/doc/architecture/blueprints/image_resizing/index.md b/doc/architecture/blueprints/image_resizing/index.md
index 686a2f9c8f5..26c15d7a035 100644
--- a/doc/architecture/blueprints/image_resizing/index.md
+++ b/doc/architecture/blueprints/image_resizing/index.md
@@ -35,10 +35,10 @@ sequenceDiagram
 
 Content image resizing is a more complex problem to tackle. There are no set size restrictions and there are additional features or requirements to consider.
 
-- Dynamic WebP support - the WebP format typically achieves an average of 30% more compression than JPEG without the loss of image quality. More details [here](https://developers.google.com/speed/webp/docs/c_study)
+- Dynamic WebP support - the WebP format typically achieves an average of 30% more compression than JPEG without the loss of image quality. More details are in [this Google Comparative Study](https://developers.google.com/speed/webp/docs/c_study)
 - Extract first image of GIF's so we can prevent from loading 10MB pixels
 - Check Device Pixel Ratio to deliver nice images on High DPI screens
-- Progressive image loading, similar to what is described [here](https://www.sitepoint.com/how-to-build-your-own-progressive-image-loader/)
+- Progressive image loading, similar to what is described in [this article about how to build a progressive image loader](https://www.sitepoint.com/how-to-build-your-own-progressive-image-loader/)
 - Resizing recommendations (size, clarity, etc.)
 - Storage