diff options
Diffstat (limited to 'doc/architecture')
-rw-r--r-- | doc/architecture/blueprints/ci_data_decay/index.md | 255 | ||||
-rw-r--r-- | doc/architecture/blueprints/ci_data_decay/pipeline_data_time_decay.png | bin | 0 -> 13687 bytes | |||
-rw-r--r-- | doc/architecture/blueprints/container_registry_metadata_database/index.md | 2 | ||||
-rw-r--r-- | doc/architecture/blueprints/database_testing/index.md | 3 | ||||
-rw-r--r-- | doc/architecture/blueprints/runner_scaling/gitlab-autoscaling-overview.png | bin | 0 -> 94088 bytes | |||
-rw-r--r-- | doc/architecture/blueprints/runner_scaling/index.md | 239 |
6 files changed, 498 insertions, 1 deletions
diff --git a/doc/architecture/blueprints/ci_data_decay/index.md b/doc/architecture/blueprints/ci_data_decay/index.md new file mode 100644 index 00000000000..155c781b04a --- /dev/null +++ b/doc/architecture/blueprints/ci_data_decay/index.md @@ -0,0 +1,255 @@ +--- +stage: none +group: unassigned +comments: false +description: 'CI/CD data time decay' +--- + +# CI/CD data time decay + +## Summary + +GitLab CI/CD is one of the most data and compute intensive components of GitLab. +Since its [initial release in November 2012](https://about.gitlab.com/blog/2012/11/13/continuous-integration-server-from-gitlab/), +the CI/CD subsystem has evolved significantly. It was [integrated into GitLab in September 2015](https://about.gitlab.com/releases/2015/09/22/gitlab-8-0-released/) +and has become [one of the most beloved CI/CD solutions](https://about.gitlab.com/blog/2017/09/27/gitlab-leader-continuous-integration-forrester-wave/). + +On February 1st, 2021, GitLab.com surpassed 1 billion CI/CD builds, and the number of +builds [continues to grow exponentially](../ci_scale/index.md). + +GitLab CI/CD has come a long way since the initial release, but the design of +the data storage for pipeline builds remains almost the same since 2012. In +2021 we started working on database decomposition and extracting CI/CD data to +ia separate database. Now we want to improve the architecture of GitLab CI/CD +product to enable further scaling. + +_Disclaimer: The following contains information related to upcoming products, +features, and functionality._ + +_It is important to note that the information presented is for informational +purposes only. Please do not rely on this information for purchasing or +planning purposes._ + +_As with all projects, the items mentioned in this document and linked pages are +subject to change or delay. The development, release and timing of any +products, features, or functionality remain at the sole discretion of GitLab +Inc._ + +## Goals + +**Implement a new architecture of CI/CD data storage to enable scaling.** + +## Challenges + +There are more than two billion rows describing CI/CD builds in GitLab.com's +database. This data represents a sizable portion of the whole data stored in +PostgreSQL database running on GitLab.com. + +This volume contributes to significant performance problems, development +challenges and is often related to production incidents. + +We also expect a [significant growth in the number of builds executed on +GitLab.com](../ci_scale/index.md) in the upcoming years. + +## Opportunity + +CI/CD data is subject to +[time-decay](https://about.gitlab.com/company/team/structure/working-groups/database-scalability/time-decay.html) +because, usually, pipelines that are a few months old are not frequently +accessed or are even not relevant anymore. Restricting access to processing +pipelines that are older than a few months might help us to move this data out +of the primary database, to a different storage, that is more performant and +cost effective. + +It is already possible to prevent processing builds [that have been +archived](../../../user/admin_area/settings/continuous_integration.md#archive-jobs). +When a build gets archived it will not be possible to retry it, but we still do +keep all the processing metadata in the database, and it consumes resources +that are scarce in the primary database. + +In order to improve performance and make it easier to scale CI/CD data storage +we might want to follow these three tracks described below. + +![pipeline data time decay](pipeline_data_time_decay.png) + +<!-- markdownlint-disable MD029 --> + +1. Partition builds queuing tables +2. Archive CI/CD data into partitioned database schema +3. Migrate archived builds metadata out of primary database + +<!-- markdownlint-enable MD029 --> + +### Migrate archived builds metadata out of primary database + +Once a build (or a pipeline) gets archived, it is no longer possible to resume +pipeline processing in such pipeline. It means that all the metadata, we store +in PostgreSQL, that is needed to efficiently and reliably process builds can be +safely moved to a different data store. + +Currently, storing pipeline processing data is expensive as this kind of CI/CD +data represents a significant portion of data stored in CI/CD tables. Once we +restrict access to processing archived pipelines, we can move this metadata to +a different place - preferably object storage - and make it accessible on +demand, when it is really needed again (for example for compliance or auditing purposes). + +We need to evaluate whether moving data is the most optimal solution. We might +be able to use de-duplication of metadata entries and other normalization +strategies to consume less storage while retaining ability to query this +dataset. Technical evaluation will be required to find the best solution here. + +Epic: [Migrate archived builds metadata out of primary database](https://gitlab.com/groups/gitlab-org/-/epics/7216). + +### Archive CI/CD data into partitioned database schema + +After we move CI/CD metadata to a different store, the problem of having +billions of rows describing pipelines, builds and artifacts, remains. We still +need to keep reference to the metadata we store in object storage and we still +do need to be able to retrieve this information reliably in bulk (or search +through it). + +It means that by moving data to object storage we might not be able to reduce +the number of rows in CI/CD tables. Moving data to object storage should help +with reducing the data size, but not the quantity of entries describing this +data. Because of this limitation, we still want to partition CI/CD data to +reduce the impact on the database (indices size, auto-vacuum time and +frequency). + +Our intent here is not to move this data out of our primary database elsewhere. +What want to divide very large database tables, that store CI/CD data, into +multiple smaller ones, using PostgreSQL partitioning features. + +There are a few approaches we can take to partition CI/CD data. A promising one +is using list-based partitioning where a partition number is assigned a +pipeline, and gets propagated to all resources that are related to this +pipeline. We assign the partition number based on when the pipeline was created +or when we observed the last processing activity in it. This is very flexible +because we can extend this partitioning strategy at will; for example with this +strategy we can assign an arbitrary partition number based on multiple +partitioning keys, combining time-decay-based partitioning with tenant-based +partitioning on the application level. + +Partitioning rarely accessed data should also follow the policy defined for +builds archival, to make it consistent and reliable. + +Epic: [Archive CI/CD data into partitioned database schema](https://gitlab.com/groups/gitlab-org/-/epics/5417). + +### Partition builds queuing tables + +While working on the [CI/CD Scale](../ci_scale/index.md) blueprint, we have +introduced a [new architecture for queuing CI/CD builds](https://gitlab.com/groups/gitlab-org/-/epics/5909#note_680407908) +for execution. + +This allowed us to significantly improve performance. We still consider the new +solution as an intermediate mechanism, needed before we start working on the +next iteration. The following iteration that should improve the architecture of +builds queuing even more (it might require moving off the PostgreSQL fully or +partially). + +In the meantime we want to ship another iteration, an intermediate step towards +more flexible and reliable solution. We want to partition the new queuing +tables, to reduce the impact on the database, to improve reliability and +database health. + +Partitioning of CI/CD queuing tables does not need to follow the policy defined +for builds archival. Instead we should leverage a long-standing policy saying +that builds created more 24 hours ago need to be removed from the queue. This +business rule is present in the product since the inception of GitLab CI. + +Epic: [Partition builds queuing tables](https://gitlab.com/gitlab-org/gitlab/-/issues/347027). + +## Principles + +All the three tracks we will use to implement CI/CD time decay pattern are +associated with some challenges. As we progress with the implementation we will +need to solve many problems and devise many implementation details to make this +successful. + +Below, we documented a few foundational principles to make it easier for +everyone to understand the vision described in this architectural blueprint. + +### Removing pipeline data + +While it might be tempting to simply remove old or archived data from our +databases this should be avoided. It is usually not desired to permanently +remove user data unless consent is given to do so. We can, however, move data +to a different data store, like object storage. + +Archived data can still be needed sometimes (for example for compliance or +auditing reasons). We want to be able to retrieve this data if needed, as long +as permanent removal has not been requested or approved by a user. + +### Accessing pipeline data in the UI + +Implementing CI/CD data time-decay through partitioning might be challenging +when we still want to make it possible for users to access data stored in many +partitions. + +We want to retain simplicity of accessing pipeline data in the UI. It will +require some backstage changes in how we reference pipeline data from other +resources, but we don't want to make it more difficult for users to find their +pipelines in the UI. + +We may need to add "Archived" tab on the pipelines / builds list pages, but we +should be able to avoid additional steps / clicks when someone wants to view +pipeline status or builds associated with a merge request or a deployment. + +We also may need to disable search in the "Archived" tab on pipelines / builds +list pages. + +### Accessing pipeline data through the API + +We accept the possible necessity of building a separate API endpoint / +endpoints needed to access pipeline data through the API. + +In the new API users might need to provide a time range in which the data has +been created to search through their pipelines / builds. In order to make it +efficient it might be necessary to restrict access to querying data residing in +more than two partitions at once. We can do that by supporting time ranges +spanning the duration that equals to the builds archival policy. + +It is possible to still allow users to use the old API to access archived +pipelines data, although a user provided partition identifier may be required. + +## Iterations + +All three tracks can be worked on in parallel: + +1. [Migrate archived build metadata to object storage](https://gitlab.com/groups/gitlab-org/-/epics/7216). +1. [Partition CI/CD data that have been archived](https://gitlab.com/groups/gitlab-org/-/epics/5417). +1. [Partition CI/CD queuing tables using list partitioning](https://gitlab.com/gitlab-org/gitlab/-/issues/347027) + +## Status + +In progress. + +## Who + +Proposal: + +<!-- vale gitlab.Spelling = NO --> + +| Role | Who +|------------------------------|-------------------------| +| Author | Grzegorz Bizon | +| Engineering Leader | Cheryl Li | +| Product Manager | Jackie Porter | +| Architecture Evolution Coach | Kamil Trzciński | + +DRIs: + +| Role | Who +|------------------------------|------------------------| +| Leadership | Cheryl Li | +| Product | Jackie Porter | +| Engineering | Grzegorz Bizon | + +Domain experts: + +| Area | Who +|------------------------------|------------------------| +| Verify / Pipeline execution | Fabio Pitino | +| Verify / Pipeline execution | Marius Bobin | +| PostgreSQL Database | Andreas Brandl | + +<!-- vale gitlab.Spelling = YES --> diff --git a/doc/architecture/blueprints/ci_data_decay/pipeline_data_time_decay.png b/doc/architecture/blueprints/ci_data_decay/pipeline_data_time_decay.png Binary files differnew file mode 100644 index 00000000000..e094b87933a --- /dev/null +++ b/doc/architecture/blueprints/ci_data_decay/pipeline_data_time_decay.png diff --git a/doc/architecture/blueprints/container_registry_metadata_database/index.md b/doc/architecture/blueprints/container_registry_metadata_database/index.md index a38a8727dc4..c1aac235085 100644 --- a/doc/architecture/blueprints/container_registry_metadata_database/index.md +++ b/doc/architecture/blueprints/container_registry_metadata_database/index.md @@ -78,7 +78,7 @@ The single entrypoint for the registry is the [HTTP API](https://gitlab.com/gitl | Operation | UI | Background | Observations | | ------------------------------------------------------------ | ------------------ | ------------------------ | ------------------------------------------------------------ | | [Check API version](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs/spec/api.md#api-version-check) | **{check-circle}** Yes | **{check-circle}** Yes | Used globally to ensure that the registry supports the Docker Distribution V2 API, as well as for identifying whether GitLab Rails is talking to the GitLab Container Registry or a third-party one (used to toggle features only available in the former). | -| [List repository tags](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs/spec/api.md#listing-image-tags) | **{check-circle}** Yes | **{check-circle}** Yes | Used to list and show tags in the UI. Used to list tags in the background for [cleanup policies](../../../user/packages/container_registry/#cleanup-policy) and [Geo replication](../../../administration/geo/replication/docker_registry.md). | +| [List repository tags](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs/spec/api.md#listing-image-tags) | **{check-circle}** Yes | **{check-circle}** Yes | Used to list and show tags in the UI. Used to list tags in the background for [cleanup policies](../../../user/packages/container_registry/reduce_container_registry_storage.md#cleanup-policy) and [Geo replication](../../../administration/geo/replication/docker_registry.md). | | [Check if manifest exists](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs/spec/api.md#existing-manifests) | **{check-circle}** Yes | **{dotted-circle}** No | Used to get the digest of a manifest by tag. This is then used to pull the manifest and show the tag details in the UI. | | [Pull manifest](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs/spec/api.md#pulling-an-image-manifest) | **{check-circle}** Yes | **{dotted-circle}** No | Used to show the image size and the manifest digest in the tag details UI. | | [Pull blob](https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs/spec/api.md#pulling-a-layer) | **{check-circle}** Yes | **{dotted-circle}** No | Used to show the configuration digest and the creation date in the tag details UI. | diff --git a/doc/architecture/blueprints/database_testing/index.md b/doc/architecture/blueprints/database_testing/index.md index 38629e7348d..4676caab85d 100644 --- a/doc/architecture/blueprints/database_testing/index.md +++ b/doc/architecture/blueprints/database_testing/index.md @@ -1,4 +1,7 @@ --- +stage: none +group: unassigned +info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments comments: false description: 'Database Testing' --- diff --git a/doc/architecture/blueprints/runner_scaling/gitlab-autoscaling-overview.png b/doc/architecture/blueprints/runner_scaling/gitlab-autoscaling-overview.png Binary files differnew file mode 100644 index 00000000000..c3ba615784f --- /dev/null +++ b/doc/architecture/blueprints/runner_scaling/gitlab-autoscaling-overview.png diff --git a/doc/architecture/blueprints/runner_scaling/index.md b/doc/architecture/blueprints/runner_scaling/index.md new file mode 100644 index 00000000000..8e47b5fda8c --- /dev/null +++ b/doc/architecture/blueprints/runner_scaling/index.md @@ -0,0 +1,239 @@ +--- +stage: none +group: unassigned +comments: false +description: 'Next Runner Auto-scaling Architecture' +--- + +# Next Runner Auto-scaling Architecture + +## Summary + +GitLab Runner is a core component of GitLab CI/CD. It makes it possible to run +CI/CD jobs in a reliable and concurrent environment. It has been initially +introduced by Kamil Trzciński in early 2015 to replace a Ruby version of the +same service. GitLab Runner written in Go turned out to be easier to use by the +wider community, it was more efficient and reliable than the previous, +Ruby-based, version. + +In February 2016 Kamil Trzciński [implemented an auto-scaling feature](https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/53) +to leverage cloud infrastructure to run many CI/CD jobs in parallel. This +feature has become a foundation supporting CI/CD adoption on GitLab.com over +the years, where we now run around 4 million builds per day at peak. + +During the initial implementation a decision was made to use Docker Machine: + +> Is easy to use. Is well documented. Is well supported and constantly +> extended. It supports almost any cloud provider or virtualization +> infrastructure. We need minimal amount of changes to support Docker Machine: +> machine enumeration and inspection. We don't need to implement any "cloud +> specific" features. + +This design choice was crucial for the GitLab Runner success. Since that time +the auto-scaling feature has been used by many users and customers and enabled +rapid growth of CI/CD adoption on GitLab.com. + +We can not, however, continue using Docker Machine. Work on that project [was +paused in July 2018](https://github.com/docker/machine/issues/4537) and there +was no development made since that time (except for some highly important +security fixes). In 2018, after Docker Machine entered the “maintenance mode”, +we decided to create [our own fork](https://gitlab.com/gitlab-org/ci-cd/docker-machine) +to be able to keep using this and ship fixes and updates needed for our use case. +[On September 26th, 2021 the project got archived](https://github.com/docker/docker.github.io/commit/2dc8b49dcbe85686cc7230e17aff8e9944cb47a5) +and the documentation for it has been removed from the official page. This +means that the original reason to use Docker Machine is no longer valid too. + +To keep supporting our customers and the wider community we need to design a +new mechanism for GitLab Runner autoscaling. It not only needs to support +auto-scaling, but it also needs to do that in the way to enable us to build on +top of it to improve efficiency, reliability and availability. + +We call this new mechanism the “next GitLab Runner Scaling architecture”. + +_Disclaimer The following contain information related to upcoming products, +features, and functionality._ + +_It is important to note that the information presented is for informational +purposes only. Please do not rely on this information for purchasing or +planning purposes._ + +_As with all projects, the items mentioned in this document and linked pages are +subject to change or delay. The development, release and timing of any +products, features, or functionality remain at the sole discretion of GitLab +Inc._ + +## Proposal + +Currently, GitLab Runner auto-scaling can be configured in a few ways. Some +customers are successfully using an auto-scaled environment in Kubernetes. We +know that a custom and unofficial GitLab Runner version has been built to make +auto-scaling on Kubernetes more reliable. We recognize the importance of having +a really good Kubernetes solution for running multiple jobs in parallel, but +refinements in this area are out of scope for this architectural initiative. + +We want to focus on resolving problems with Docker Machine and replacing this +mechanism with a reliable and flexible mechanism. We might be unable to build a +drop-in replacement for Docker Machine, as there are presumably many reasons +why it has been deprecated. It is very difficult to maintain compatibility with +so many cloud providers, and it seems that Docker Machine has been deprecated +in favor of Docker Desktop, which is not a viable replacement for us. [This +issue](https://github.com/docker/roadmap/issues/245) contains a discussion +about how people are using Docker Machine right now, and it seems that GitLab +CI is one of the most frequent reasons for people to keep using Docker Machine. + +There is also an opportunity in being able to optionally run multiple jobs in a +single, larger virtual machine. We can’t do that today, but we know that this +can potentially significantly improve efficiency. We might want to build a new +architecture that makes it easier and allows us to test how efficient it is +with PoCs. Running multiple jobs on a single machine can also make it possible +to reuse what we call a “sticky context” - a space for build artifacts / user +data that can be shared between job runs. + +### 💡 Design a simple abstraction that users will be able to build on top of + +Because there is no viable replacement and we might be unable to support all +cloud providers that Docker Machine used to support, the key design requirement +is to make it really simple and easy for the wider community to write a custom +GitLab auto-scaling plugin, whatever cloud provider they might be using. We +want to design a simple abstraction that users will be able to build on top, as +will we to support existing workflows on GitLab.com. + +The designed mechanism should abstract what Docker Machine executor has been doing: +providing a way to create an external Docker environment, waiting to execute +jobs by provisioning this environment and returning credentials required to +perform these operations. + +The new plugin system should be available for all major platforms: Linux, +Windows, MacOS. + +### 💡 Migrate existing Docker Machine solution to a plugin + +Once we design and implement the new abstraction, we should be able to migrate +existing Docker Machine mechanisms to a plugin. This will make it possible for +users and customers to immediately start using the new architecture, but still +keep their existing workflows and configuration for Docker Machine. This will +give everyone time to migrate to the new architecture before we drop support +for the legacy auto-scaling entirely. + +### 💡 Build plugins for AWS, Google Cloud Platform and Azure + +Although we might be unable to add support for all the cloud providers that +Docker Machine used to support, it seems to be important to provide +GitLab-maintained plugins for the major cloud providers like AWS, Google Cloud +Platform and Azure. + +We should build them, presumably in separate repositories, in a way that they +are easy to contribute to, fork, modify for certain needs the wider community +team members might have. It should be also easy to install a new plugin without +the need of rebuilding GitLab Runner whenever it happens. + +### 💡 Write a solid documentation about how to build your own plugin + +It is important to show users how to build an auto-scaling plugin, so that they +can implement support for their own cloud infrastructure. + +Building new plugins should be simple, and with the support of great +documentation it should not require advanced skills, like understanding how +gRPC works. We want to design the plugin system in a way that the entry barrier +for contributing new plugins is very low. + +### 💡 Build a PoC to run multiple builds on a single machine + +We want to better understand what kind of efficiency can running multiple jobs +on a single machine bring. It is difficult to predict that, so ideally we +should build a PoC that will help us to better understand what we can expect +from this. + +To run this experiement we most likely we will need to build an experimental +plugin, that not only allows us to schedule running multiple builds on a single +machine, but also has a set of comprehensive metrics built into it, to make it +easier to understand how it performs. + +## Details + +How the abstraction for the custom provider will look exactly is something that +we will need to prototype, PoC and decide in a data-informed way. There are a +few proposals that we should describe in detail, develop requirements for, PoC +and score. We will choose the solution that seems to support our goals the +most. + +In order to describe the proposals we first need to better explain what part of +the GitLab Runner needs to be abstracted away. To make this easier to grasp +these concepts, let's take a look at the current auto-scaling architecture and +sequence diagram. + +![GitLab Runner Autoscaling Overview](gitlab-autoscaling-overview.png) + +On the diagrams above we see that currently a GitLab Runner Manager runs on a +machine that has access to a cloud provider’s API. It is using Docker Machine +to provision new Virtual Machines with Docker Engine installed and it +configures the Docker daemon there to allow external authenticated requests. It +stores credentials to such ephemeral Docker environments on disk. Once a +machine has been provisioned and made available for GitLab Runner Manager to +run builds, it is using one of the existing executors to run a user-provided +script. In auto-scaling, this is typically done using Docker executor. + +### Custom provider + +In order to reduce the scope of work, we only want to introduce the new +abstraction layer in one place. + +A few years ago we introduced the [Custom Executor](https://docs.gitlab.com/runner/executors/custom.html) +feature in GitLab Runner. It allows users to design custom build execution +methods. The custom executor driver can be implemented in any way - from a +simple shell script to a dedicated binary - that is then used by a Runner +through os/exec system calls. + +Thanks to the custom executor abstraction there is no more need to implement +new executors internally in Runner. Users who have specific needs can implement +their own drivers and don’t need to wait for us to make their work part of the +“official” GitLab Runner. As each driver is a separate project, it also makes +it easier to create communities around them, where interested people can +collaborate together on improvements and bug fixes. + +We want to design the new Custom Provider to replicate the success of the +Custom Executor. It will make it easier for users to build their own ways to +provide a context and an environment in which a build will be executed by one +of the Custom Executors. + +There are multiple solutions to implementing a custom provider abstraction. We +can use raw Go plugins, Hashcorp’s Go Plugin, HTTP interface or gRPC based +facade service. There are many solutions, and we want to choose the most +optimal one. In order to do that, we will describe the solutions in a separate +document, define requirements and score the solution accordingly. This will +allow us to choose a solution that will work best for us and the wider +community. + +## Status + +Status: RFC. + +## Who + +Proposal: + +<!-- vale gitlab.Spelling = NO --> + +| Role | Who +|------------------------------|------------------------------------------| +| Authors | Grzegorz Bizon, Tomasz Maczukin | +| Architecture Evolution Coach | Kamil Trzciński | +| Engineering Leader | Elliot Rushton, Cheryl Li | +| Product Manager | Darren Eastman, Jackie Porter | +| Domain Expert / Runner | Arran Walker | + +DRIs: + +| Role | Who +|------------------------------|------------------------| +| Leadership | Elliot Rushton | +| Product | Darren Eastman | +| Engineering | Tomasz Maczukin | + +Domain experts: + +| Area | Who +|------------------------------|------------------------| +| Domain Expert / Runner | Arran Walker | + +<!-- vale gitlab.Spelling = YES --> |