diff options
author | GitLab Bot <gitlab-bot@gitlab.com> | 2022-09-19 23:18:09 +0000 |
---|---|---|
committer | GitLab Bot <gitlab-bot@gitlab.com> | 2022-09-19 23:18:09 +0000 |
commit | 6ed4ec3e0b1340f96b7c043ef51d1b33bbe85fde (patch) | |
tree | dc4d20fe6064752c0bd323187252c77e0a89144b /doc/architecture/blueprints | |
parent | 9868dae7fc0655bd7ce4a6887d4e6d487690eeed (diff) | |
download | gitlab-ce-6ed4ec3e0b1340f96b7c043ef51d1b33bbe85fde.tar.gz |
Add latest changes from gitlab-org/gitlab@15-4-stable-eev15.4.0-rc42
Diffstat (limited to 'doc/architecture/blueprints')
14 files changed, 1130 insertions, 50 deletions
diff --git a/doc/architecture/blueprints/_template.md b/doc/architecture/blueprints/_template.md new file mode 100644 index 00000000000..e99ce61970a --- /dev/null +++ b/doc/architecture/blueprints/_template.md @@ -0,0 +1,142 @@ +--- +status: proposed +creation-date: yyyy-mm-dd +authors: [ "@username" ] +coach: "@username" +owning-section: "~section::<section>" +participating-sections: [] +approvers: [ "@product-manager", "@engineering-manager" ] +--- + +<!-- +**Note:** Please remove comment blocks for sections you've filled in. +When your blueprint is complete, all of these comment blocks should be removed. + +To get started with a blueprint you can use this template to inform you about +what you may want to document in it at the beginning. This content will change +/ evolve as you move forward with the proposal. You are not constrained by the +content in this template. If you have a good idea about what should be in your +blueprint, you can ignore the template, but if you don't know yet what should +be in it, this template might be handy. + +- **Fill out this file as best you can.** At minimum, you should fill in the + "Summary", and "Motivation" sections. These can be brief and may be a copy + of issue or epic descriptions if the initiative is already on Product's + roadmap. +- **Create a MR for this blueprint.** Assign it to an Architecture Evolution + Coach (i.e. a Principal+ engineer). +- **Merge early and iterate.** Avoid getting hung up on specific details and + instead aim to get the goals of the blueprint clarified and merged quickly. + The best way to do this is to just start with the high-level sections and fill + out details incrementally in subsequent MRs. + +Just because a blueprint is merged does not mean it is complete or approved. +Any blueprint is a working document and subject to change at any time. + +When editing blueprints, aim for tightly-scoped, single-topic MRs to keep +discussions focused. If you disagree with what is already in a document, open a +new MR with suggested changes. + +If there are new details that belong in the blueprint, edit the blueprint. Once +a feature has become "implemented", major changes should get new blueprints. + +The canonical place for the latest set of instructions (and the likely source +of this file) is [here](/doc/architecture/blueprints/_template.md). +--> + +# {+ Title of Blueprint +} + +<!-- +This is the title of your blueprint. Keep it short, simple, and descriptive. A +good title can help communicate what the blueprint is and should be considered +as part of any review. +--> + +[[_TOC_]] + +## Summary + +<!-- +This section is very important, because very often it is the only section that +will be read by team members. We sometimes call it an "Executive summary", +because executives usually don't have time to read entire document like this. +Focus on writing this section in a way that anyone can understand what is says, +the audience here is everyone: executives, product managers, engineers, wider +community members. + +A good summary is probably at least a paragraph in length. +--> + +## Motivation + +<!-- +This section is for explicitly listing the motivation, goals and non-goals of +this blueprint. Describe why the change is important, all the opportunities, +and the benefits to users. + +The motivation section can optionally provide links to issues that demonstrate +interest in a blueprint within the wider GitLab community. Links to +documentation for competing products and services is also encouraged in cases +where they demonstrate clear gaps in the functionality GitLab provides. + +For concrete proposals we recommend laying out goals and non-goals explicitly, +but this section may be framed in terms of problem statements, challenges, or +opportunities. The latter may be a more suitable framework in cases where the +problem is not well-defined or design details not yet established. +--> + +### Goals + +<!-- +List the specific goals / opportunities of the blueprint. + +- What is it trying to achieve? +- How will we know that this has succeeded? +- What are other less tangible opportunities here? +--> + +### Non-Goals + +<!-- +Listing non-goals helps to focus discussion and make progress. This section is +optional. + +- What is out of scope for this blueprint? +--> + +## Proposal + +<!-- +This is where we get down to the specifics of what the proposal actually is, +but keep it simple! This should have enough detail that reviewers can +understand exactly what you're proposing, but should not include things like +API designs or implementation. The "Design Details" section below is for the +real nitty-gritty. +--> + +## Design and implementation details + +<!-- +This section should contain enough information that the specifics of your +change are understandable. This may include API specs (though not always +required) or even code snippets. If there's any ambiguity about HOW your +proposal will be implemented, this is the place to discuss them. + +If you are not sure how many implementation details you should include in the +blueprint, the rule of thumb here is to provide enough context for people to +understand the proposal. As you move forward with the implementation, you may +need to add more implementation details to the blueprint, as those may become +an important context for important technical decisions made along the way. A +blueprint is also a register of such technical decisions. If a technical +decision requires additional context before it can be made, you probably should +document this context in a blueprint. If it is a small technical decision that +can be made in a merge request by an author and a maintainer, you probably do +not need to document it here. The impact a technical decision will have is +another helpful information - if a technical decision is very impactful, +documenting it, along with associated implementation details, is advisable. + +If it's helpful to include workflow diagrams or any other related images. +Diagrams authored in GitLab flavored markdown are preferred. In cases where +that is not feasible, images should be placed under `images/` in the same +directory as the `index.md` for the proposal. +--> diff --git a/doc/architecture/blueprints/ci_data_decay/index.md b/doc/architecture/blueprints/ci_data_decay/index.md index 7c0bdf299db..23c8e9df1bb 100644 --- a/doc/architecture/blueprints/ci_data_decay/index.md +++ b/doc/architecture/blueprints/ci_data_decay/index.md @@ -48,7 +48,7 @@ PostgreSQL database running on GitLab.com. This volume contributes to significant performance problems, development challenges and is often related to production incidents. -We also expect a [significant growth in the number of builds executed on GitLab.com](../ci_scale/index.md) +We also expect a [significant growth in the number of builds executed on GitLab.com](../ci_scale/index.md) in the upcoming years. ## Opportunity @@ -61,7 +61,7 @@ pipelines that are older than a few months might help us to move this data out of the primary database, to a different storage, that is more performant and cost effective. -It is already possible to prevent processing builds +It is already possible to prevent processing builds [that have been archived](../../../user/admin_area/settings/continuous_integration.md#archive-jobs). When a build gets archived it will not be possible to retry it, but we still do keep all the processing metadata in the database, and it consumes resources @@ -232,7 +232,7 @@ In progress. ## Timeline -- 2021-01-21: Parent [CI Scaling](../ci_scale/) blueprint [merge request](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/52203) created. +- 2021-01-21: Parent [CI Scaling](../ci_scale/index.md) blueprint [merge request](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/52203) created. - 2021-04-26: CI Scaling blueprint approved and merged. - 2021-09-10: CI/CD data time decay blueprint discussions started. - 2022-01-07: CI/CD data time decay blueprint [merged](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/70052). diff --git a/doc/architecture/blueprints/ci_data_decay/pipeline_partitioning.md b/doc/architecture/blueprints/ci_data_decay/pipeline_partitioning.md index 868dae4fc6c..baec14e3f0f 100644 --- a/doc/architecture/blueprints/ci_data_decay/pipeline_partitioning.md +++ b/doc/architecture/blueprints/ci_data_decay/pipeline_partitioning.md @@ -74,7 +74,12 @@ violates our [principle of 100 GB max size](../database_scaling/size-limits.md). We also want to [build alerting](https://gitlab.com/gitlab-com/gl-infra/tamland/-/issues/5) to notify us when this number is exceeded. -We’ve seen numerous S1 and S2 database-related production environment +Large SQL tables increase index maintenance time, during which freshly deleted tuples +cannot be cleaned by `autovacuum`. This highlight the need for small tables. +We will measure how much bloat we accumulate when (re)indexing huge tables. Base on this analysis, +we will be able to set up SLO (dead tuples / bloat), associated with (re)indexing. + +We've seen numerous S1 and S2 database-related production environment incidents, over the last couple of months, for example: - S1: 2022-03-17 [Increase in writes in `ci_builds` table](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6625) @@ -130,7 +135,7 @@ remaining database tables when it becomes necessary. It is also important to avoid large data migrations. We store almost 6 terabytes of data in the biggest CI/CD tables, in many different columns and indexes. Migrating this amount of data might be challenging and could cause -instability in the production environment. Due to this concern, we’ve developed +instability in the production environment. Due to this concern, we've developed a way to attach an existing database table as a partition zero without downtime and excessive database locking, what has been demonstrated in one of the [first proofs of concept](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/80186). @@ -145,7 +150,7 @@ Our plan is to use logical partition IDs. We want to start with the `ci_pipelines` table and create a `partition_id` column with a `DEFAULT` value of `100` or `1000`. Using a `DEFAULT` value avoids the challenge of backfilling this value for every row. Adding a `CHECK` constraint prior to attaching the -first partition tells PostgreSQL that we’ve already ensured consistency and +first partition tells PostgreSQL that we've already ensured consistency and there is no need to check it while holding an exclusive table lock when attaching this table as a partition to the routing table (partitioned schema definition). We will increment this value every time we create a new partition @@ -159,21 +164,32 @@ and artifacts, will share the same value. We want to add the `partition_id` column into all 6 problematic tables because we can avoid backfilling this data when we decide it is time to start partitioning them. -We want to partition CI/CD data iteratively, so we will start with the -pipelines table, and create at least one, but likely two, partitions. The -pipelines table will be partitioned using the `LIST` partitioning strategy. It -is possible that, after some time, `p_ci_pipelines` will store data in two -partitions with IDs of `100` and `101`. Then we will try partitioning -`ci_builds`. Therefore we might want to use `RANGE` partitioning in -`p_ci_builds` with IDs `100` and `101`, because builds for the two logical -partitions used will still be stored in a single table. +We want to partition CI/CD data iteratively. We plan to start with the +`ci_builds_metadata` table, because this is the fastest growing table in the CI +database and want to contain this rapid growth. This table has also the most +simple access patterns - a row from it is being read when a build is exposed to +a runner, and other access patterns are relatively simple too. Starting with +`p_ci_builds_metadata` will allow us to achieve tangible and quantifiable +results earlier, and will become a new pattern that makes partitioning the +largest table possible. We will partition builds metadata using the `LIST` +partitioning strategy. + +Once we have many partitions attached to `p_ci_builds_metadata`, with many +`partition_ids` we will choose another CI table to partition next. In that case +we might want to use `RANGE` partitioning in for that next table because +`p_ci_builds_metadata` will already have many physical partitions, and +therefore many logical `partition_ids` will be used at that time. For example, +if we choose `ci_builds` as the next partitioning candidate, after having +partitioned `p_ci_builds_metadata`, it will have many different values stored +in `ci_builds.partition_id`. Using `RANGE` partitioning in that case might be +easier. Physical partitioning and logical partitioning will be separated, and a -strategy will be determined when we implement partitioning for the respective -database tables. Using `RANGE` partitioning works similarly to using `LIST` -partitioning in database tables other than `ci_pipelines`, but because we can -guarantee continuity of `partition_id` values, using `RANGE` partitioning might -be a better strategy. +strategy will be determined when we implement physical partitioning for the +respective database tables. Using `RANGE` partitioning works similarly to using +`LIST` partitioning in database tables, but because we can guarantee continuity +of `partition_id` values, using `RANGE` partitioning might be a better +strategy. ## Why do we want to use explicit logical partition ids? @@ -201,9 +217,30 @@ find this number, though we might not need to do this. The single and uniform `partition_id` value for pipeline data gives us more choices later on than primary-keys-based partitioning. +## Altering partitioned tables + +It will still be possible to run `ALTER TABLE` statements against partitioned tables, +similarly to how the tables behaved before partitioning. When PostgreSQL runs +an `ALTER TABLE` statement against a parent partitioned table, it acquires the same +lock on all child partitions and updates each to keep them in sync. This differs from +running `ALTER TABLE` on a non-partitioned table in a few key ways: + +- PostgreSQL acquires `ACCESS EXCLUSIVE` locks against a larger number of tables, but + not a larger amount of data, than it would were the table not partitioned. + Each partition will be locked similarly to the parent table, and all will be updated + in a single transaction. +- Lock duration will be increased based on the number of partitions involved. + All `ALTER TABLE` statements executed on the GitLab database (other than `VALIDATE CONSTRAINT`) + take small constant amounts of time per table modified. PostgreSQL will need + to modify each partition in sequence, increasing the runtime of the lock. This + time will still remain very small until there are many partitions involved. +- If thousands of partitions are involved in an `ALTER TABLE`, we will need to verify that + the value of `max_locks_per_transaction` is high enough to support all of the locks that + need to be taken during the operation. + ## Splitting large partitions into smaller ones -We want to start with the initial `pipeline_id` number `100` (or higher, like +We want to start with the initial `partition_id` number `100` (or higher, like `1000`, depending on our calculations and estimations). We do not want to start from 1, because existing tables are also large already, and we might want to split them into smaller partitions. If we start with `100`, we will be able to @@ -217,6 +254,18 @@ smaller ones (it's not yet clear if we will need to do this), we might be able to just use background migrations to update partition IDs, and PostgreSQL is smart enough to move rows between partitions on its own. +### Naming conventions + +A partitioned table is called a **routing** table and it will use the `p_` +prefix which should help us with building automated tooling for query analysis. + +A table partition will be simply called **partition** and it can use the a +physical partition ID as suffix, leaded by a `p` letter, for example +`ci_builds_p101`. Existing CI tables will become **zero partitions** of the +new routing tables. Depending on the chosen +[partitioning strategy](#how-do-we-want-to-partition-cicd-data) for a given +table, it is possible to have many logical partitions per one physical partition. + ## Storing partitions metadata in the database In order to build an efficient mechanism that will be responsible for creating @@ -225,8 +274,8 @@ metadata table, called `ci_partitions`. In that table we would store metadata about all the logical partitions, with many pipelines per partition. We may need to store a range of pipeline ids per logical partition. Using it we will be able to find the `partition_id` number for a given pipeline ID and we will -also find information about which logical partitions are “active” or -“archived”, which will help us to implement a time-decay pattern using database +also find information about which logical partitions are "active" or +"archived", which will help us to implement a time-decay pattern using database declarative partitioning. `ci_partitions` table will store information about a partition identifier, @@ -302,9 +351,116 @@ scope block takes an argument). Preloading instance dependent scopes is not supported. ``` -We also need to build a proof of concept for removing data on the PostgreSQL -side (using foreign keys with `ON DELETE CASCADE`) and removing data through -Rails associations, as this might be an important area of uncertainty. +### Foreign keys + +Foreign keys must reference columns that either are a primary key or form a +unique constraint. We can define them using these strategies: + +#### Between routing tables sharing partition ID + +For relations that are part of the same pipeline hierarchy it is possible to +share the `partition_id` column to define the foreign key constraint: + +```plaintext +p_ci_pipelines: + - id + - partition_id + +p_ci_builds: + - id + - partition_id + - pipeline_id +``` + +In this case, `p_ci_builds.partition_id` indicates the partition for the build +and also for the pipeline. We can add a FK on the routing table using: + +```sql +ALTER TABLE ONLY p_ci_builds + ADD CONSTRAINT fk_on_pipeline_and_partition + FOREIGN KEY (pipeline_id, partition_id) + REFERENCES p_ci_pipelines(id, partition_id) ON DELETE CASCADE; +``` + +#### Between routing tables with different partition IDs + +It's not possible to reuse the `partition_id` for all relations in the CI domain, +so in this case we'll need to store the value as a different attribute. For +example, when canceling redundant pipelines we store on the old pipeline row +the ID of the new pipeline that cancelled it as `auto_canceled_by_id`: + +```plaintext +p_ci_pipelines: + - id + - partition_id + - auto_canceled_by_id + - auto_canceled_by_partition_id +``` + +In this case we can't ensure that the canceling pipeline is part of the same +hierarchy as the canceled pipelines, so we need an extra attribute to store its +partition, `auto_canceled_by_partition_id`, and the FK becomes: + +```sql +ALTER TABLE ONLY p_ci_pipelines + ADD CONSTRAINT fk_cancel_redundant_pieplines + FOREIGN KEY (auto_canceled_by_id, auto_canceled_by_partition_id) + REFERENCES p_ci_pipelines(id, partition_id) ON DELETE SET NULL; +``` + +#### Between routing tables and regular tables + +Not all of the tables in the CI domain will be partitioned, so we'll have routing +tables that will reference non-partitioned tables, for example we reference +`external_pull_requests` from `ci_pipelines`: + +```sql +FOREIGN KEY (external_pull_request_id) +REFERENCES external_pull_requests(id) +ON DELETE SET NULL +``` + +In this case we only need to move the FK definition from the partition level +to the routing table so that new pipeline partitions may use it: + +```sql +ALTER TABLE p_ci_pipelines + ADD CONSTRAINT fk_external_request + FOREIGN KEY (external_pull_request_id) + REFERENCES external_pull_requests(id) ON DELETE SET NULL; +``` + +#### Between regular tables and routing tables + +Most of the tables from the CI domain reference at least one table that will be +turned into a routing tables, for example `ci_pipeline_messages` references +`ci_pipelines`. These definitions will need to be updated to use the routing +tables and for this they will need a `partition_id` column: + +```plaintext +p_ci_pipelines: + - id + - partition_id + +ci_pipeline_messages: + - id + - pipeline_id + - pipeline_partition_id +``` + +The foreign key can be defined by using: + +```sql +ALTER TABLE ci_pipeline_messages ADD CONSTRAINT fk_pipeline_partitioned + FOREIGN KEY (pipeline_id, pipeline_partition_id) + REFERENCES p_ci_pipelines(id, partition_id) ON DELETE CASCADE; +``` + +The old FK definition will need to be removed, otherwise new inserts in the +`ci_pipeline_messages` with pipeline IDs from non-zero partition will fail with +reference errors. + +### Indexes We [learned](https://gitlab.com/gitlab-org/gitlab/-/issues/360148) that `PostgreSQL` does not allow to create a single index (unique or otherwise) across all partitions of a table. @@ -465,7 +621,7 @@ strategy. The strategy, described in this document, is subject to iteration as well. Whenever we find a better way to reduce the risk and improve our plan, we should update this document as well. -We’ve managed to find a way to avoid large-scale data migrations, and we are +We've managed to find a way to avoid large-scale data migrations, and we are building an iterative strategy for partitioning CI/CD data. We documented our strategy here to share knowledge and solicit feedback from other team members. diff --git a/doc/architecture/blueprints/ci_pipeline_components/index.md b/doc/architecture/blueprints/ci_pipeline_components/index.md new file mode 100644 index 00000000000..94ec3e2f894 --- /dev/null +++ b/doc/architecture/blueprints/ci_pipeline_components/index.md @@ -0,0 +1,209 @@ +--- +stage: Stage +group: Pipeline Authoring +info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments +comments: false +description: 'Create a catalog of shareable pipeline constructs' +--- + + +# CI/CD pipeline components catalog + +## Summary + +## Goals + +The goal of the CI/CD pipeline components catalog is to make the reusing pipeline configurations +easier and more efficient. +Providing a way to discover, understand and learn how to reuse pipeline constructs allows for a more streamlined experience. +Having a CI/CD pipeline components catalog also sets a framework for users to collaborate on pipeline constructs so that they can be evolved +and improved over time. + +This blueprint defines the architectural guidelines on how to build a CI/CD catalog of pipeline components. +This blueprint also defines the long-term direction for iterations and improvements to the solution. + +## Challenges + +- GitLab CI/CD can have a steep learning curve for new users. Users must read the documentation and + [YAML reference](../../../ci/yaml/index.md) to understand how to configure their pipelines. +- Developers are struggling to reuse existing CI/CD templates with the result of having to reinvent the wheel and write + YAML configurations repeatedly. +- GitLab [CI templates](../../../development/cicd/templates.md#template-directories) provide users with + scaffolding pipeline or jobs for specific purposes. + However versioning them is challenging today due to being shipped with the GitLab instance. + See [this issue](https://gitlab.com/gitlab-org/gitlab/-/issues/17716) for more information. +- Users of GitLab CI/CD (pipeline authors) today have their own ad-hoc way to organize shared pipeline + configurations inside their organization. Those configurations tend to be mostly undocumented. +- The only discoverable configurations are GitLab CI templates. However they don't have any inline documentation + so it becomes harder to know what they do and how to use them without copy-pasting the content in the + editor and read the actual YAML. +- It's harder to adopt additional GitLab features (CD, security, test, etc.). +- There is no framework for testing reusable CI configurations. + Many configurations are not unit tested against single changes. +- Communities, partners, 3rd parties, individual contributors, must go through the + [GitLab Contribution process](https://about.gitlab.com/community/contribute/) to contribute to GitLab managed + templates. See [this issue](https://gitlab.com/gitlab-org/gitlab/-/issues/323727) for more information. +- GitLab has more than 100 of templates with some of them barely maintained after their addition. + +### Problems with GitLab CI templates + +- GitLab CI Templates have not been designed with deterministic behavior in mind. +- GitLab CI Templates have not been design with reusability in mind. +- `Jobs/` templates hard-code the `stage:` attribute but the user of the template must somehow override + or know in advance what stage is needed. + - The user should be able to import the job inside a given stage or pass the stage names as input parameter + when using the component. + - Failures in mapping the correct stage can result in confusing errors. +- Some templates are designed to work with AutoDevops but are not generic enough + ([example](https://gitlab.com/gitlab-org/gitlab/-/blob/2c0e8e4470001442e999391df81e19732b3439e6/lib/gitlab/ci/templates/AWS/Deploy-ECS.gitlab-ci.yml)). +- Many CI templates, especially those [language specific](https://gitlab.com/gitlab-org/gitlab/-/tree/2c0e8e4470001442e999391df81e19732b3439e6/lib/gitlab/ci/templates) + are tutorial/scaffolding-style templates. + - They are meant to show the user how a typical pipeline would look like but it requires high customization from the user perspective. + - They require a different UX: copy-paste in the position of the Pipeline Editor cursor. +- Some templates like `SAST.latest.gitlab-ci.yml` add multiple jobs conditionally to the same pipeline. + - Ideally these jobs could run as a child pipeline and make the reports available to the parent pipeline. + - [This epic](https://gitlab.com/groups/gitlab-org/-/epics/8205) is necessary for Parent-child pipelines to be used. +- Some templates incorrectly use `variables`, `image` and other top-level keywords but that defines them in all pipeline jobs, + not just those defined in the template. + - This technique introduces inheritance issues when a template modifies jobs unnecessarily. + +## Opportunities + +- Having a catalog of pipeline constructs where users can search and find what they need can greatly lower + the bar for new users. +- Customers are already trying to rollout their ad-hoc catalog of shared configurations. We could provide a + standardized way to write, package and share pipeline constructs directly in the product. +- As we implement new pipeline constructs (for example, reusable job steps) they could be items of the + catalog. The catalog can boost the adoption of new constructs. +- The catalog can be a place where we strengthen our relationship with partners, having components offered + and maintained by our partners. +- With discoverability and better versioning mechanism we can have more improvements and better collaboration. +- Competitive landscape is showing the need for such feature + - [R2DevOps](https://r2devops.io) implements a catalog of CI templates for GitLab pipelines. + - [GitHub Actions](https://github.com/features/actions) provides an extensive catalog of reusable job steps. + +## Implementation guidelines + +- Start with the smallest user base. Dogfood the feature for `gitlab-org` and `gitlab-com` groups. + Involve the Engineering Productivity and other groups authoring pipeline configurations to test + and validate our solutions. +- Ensure we can integrate all the feedback gathered, even if that means changing the technical design or + UX. Until we make the feature GA we should have clear expectations with early adopters. +- Reuse existing functionality as much as possible. Don't reinvent the wheel on the initial iterations. + For example: reuse project features like title, description, avatar to build a catalog. +- Leverage GitLab features for the development lifecycle of the components (testing via `.gitlab-ci.yml`, + release management, Pipeline Editor, etc.). +- Design the catalog with self-managed support in mind. +- Allow the catalog an the workflow to support future types of pipeline constructs and new ways of using them. +- Design components and catalog following industry best practice related to building deterministic package managers. + +## Glossary + +This section defines some terms that are used throughout this document. With these terms we are only +identifying abstract concepts and are subject to changes as we refine the design by discovering new insights. + +- **Component** Is the reusable unit of pipeline configuration. +- **Project** Is the GitLab project attached to a repository. A project can contain multiple components. +- **Catalog** is the collection of projects that are set to contain components. +- **Version** is the release name of a tag in the project, which allows components to be pinned to a specific revision. + +## Characteristics of a component + +For best experience with any systems made of components it's fundamental that components are single purpose, +isolated, reusable and resolvable. + +- **Single purpose**: a component must focus on a single goal and the scope be as small as possible. +- **Isolation**: when a component is used in a pipeline, its implementation details should not leak outside the + component itself and into the main pipeline. +- **Reusability:** a component is designed to be used in different pipelines. + Depending on the assumptions it's built on a component can be more or less generic. + Generic components are more reusable but may require more customization. +- **Resolvable:** When a component depends on another component, this dependency needs to be explicit and trackable. Hidden dependencies can lead to myriads of problems. + +## Proposal + +Prerequisites to create a component: + +- Create a project. Description and avatar are highly recommended to improve discoverability. +- Add a `README.md` in the top level directory that documents the component. + What it does, how to use it, how to contribute, etc. + This file is mandatory. +- Add a `.gitlab-ci.yml` in the top level directory to test that the components works as expected. + This file is highly recommended. + +Characteristics of a component: + +- It must have a **name** to be referenced to and **description** for extra details. +- It must specify its **type** which defines how it can be used (raw configuration to be `include`d, child pipeline workflow, job step). +- It must define its **content** based on the type. +- It must specify **input parameters** that it accepts. Components should depend on input parameters for dynamic values and not environment variables. +- It can optionally define **output data** that it returns. +- Its YAML specification should be **validated statically** (for example: using JSON schema validators). +- It should be possible to use specific **versions** of a component by referencing official releases and SHA. +- It should be possible to use components defined locally in the same repository. + +## Limits + +Any MVC that exposes a feature should be added with limitations from the beginning. +It's safer to add new features with restrictions than trying to limit a feature after it's being used. +We can always soften the restrictions later depending on user demand. + +Some limits we could consider adding: + +- number of components that a single project can contain/export +- number of imports that a `.gitlab-ci.yml` file can use +- number of imports that a component can declare/use +- max level of nested imports +- max length of the exported component name + +## Iterations + +1. Experimentation phase + - Build an MVC behind a feature flag with `namespace` actor. + - Enable the feature flag only for `gitlab-com` and `gitlab-org` namespaces to initiate the dogfooding. + - Refine the solution and UX based on feedback. + - Find customers to be early adopters of this feature and iterate on their feedback. +1. Design new pipeline constructs (in parallel with other phases) + - Start the technical and design process to work on proposals for new pipeline constructs (steps, workflows, templates). + - Implement new constructs. The catalog must be compatible with them. + - Dogfood new constructs and iterate on feedback. + - Release new constructs on private catalogs. +1. Release the private catalog for groups on Ultimate plan. + - Iterate on feedback. +1. Release the public catalog for all GitLab users (prospect feature) + - Publish new versions of GitLab CI templates as components using the new constructs whenever possible. + - Allow self-managed administrators to populate their self-managed catalog by importing/updating + components from GitLab.com or from repository exports. + - Iterate on feedback. + +## Who + +Proposal: + +<!-- vale gitlab.Spelling = NO --> + +| Role | Who +|------------------------------|-------------------------| +| Author | Fabio Pitino | +| Engineering Leader | ? | +| Product Manager | Dov Hershkovitch | +| Architecture Evolution Coach | Kamil Trzciński | + +DRIs: + +| Role | Who +|------------------------------|------------------------| +| Leadership | ? | +| Product | Dov Hershkovitch | +| Engineering | ? | +| UX | Nadia Sotnikova | + +Domain experts: + +| Area | Who +|------------------------------|------------------------| +| Verify / Pipeline authoring | Avielle Wolfe | +| Verify / Pipeline authoring | Furkan Ayhan | +| Verify / Pipeline execution | Fabio Pitino | + +<!-- vale gitlab.Spelling = YES --> diff --git a/doc/architecture/blueprints/ci_scale/index.md b/doc/architecture/blueprints/ci_scale/index.md index 5822ae2b5ed..75c4d05c334 100644 --- a/doc/architecture/blueprints/ci_scale/index.md +++ b/doc/architecture/blueprints/ci_scale/index.md @@ -115,13 +115,13 @@ of the CI/CD Apdex score, and sometimes even causes a significant performance degradation in the production environment. There are multiple other strategies that can improve performance and -reliability. We can use [Redis queuing](https://gitlab.com/gitlab-org/gitlab/-/issues/322972), or +reliability. We can use [Redis queuing](https://gitlab.com/gitlab-org/gitlab/-/issues/322972), or [a separate table that will accelerate SQL queries used to build queues](https://gitlab.com/gitlab-org/gitlab/-/issues/322766) and we want to explore them. -**Status**: As of October 2021 the new architecture +**Status**: As of October 2021 the new architecture [has been implemented on GitLab.com](https://gitlab.com/groups/gitlab-org/-/epics/5909#note_680407908). -The following epic tracks making it generally available: +The following epic tracks making it generally available: [Make the new pending builds architecture generally available](https://gitlab.com/groups/gitlab-org/-/epics/6954). ### Moving big amounts of data is challenging @@ -171,7 +171,7 @@ Work required to achieve our next CI/CD scaling target is tracked in the 1. ✓ Migrate primary keys to big integers on GitLab.com. 1. ✓ Implement the new architecture of builds queuing on GitLab.com. 1. [Make the new builds queuing architecture generally available](https://gitlab.com/groups/gitlab-org/-/epics/6954). -1. [Partition CI/CD data using time-decay pattern](../ci_data_decay/). +1. [Partition CI/CD data using time-decay pattern](../ci_data_decay/index.md). ## Status diff --git a/doc/architecture/blueprints/cloud_native_build_logs/index.md b/doc/architecture/blueprints/cloud_native_build_logs/index.md index 0c941e332cb..3a06d73141b 100644 --- a/doc/architecture/blueprints/cloud_native_build_logs/index.md +++ b/doc/architecture/blueprints/cloud_native_build_logs/index.md @@ -12,7 +12,7 @@ Cloud native and the adoption of Kubernetes has been recognised by GitLab to be one of the top two biggest tailwinds that are helping us grow faster as a company behind the project. -This effort is described in a more details +This effort is described in a more details [in the infrastructure team handbook](https://about.gitlab.com/handbook/engineering/infrastructure/production/kubernetes/gitlab-com/). ## Traditional build logs @@ -88,7 +88,7 @@ even tried to replace NFS with Since that time it has become apparent that the cost of operations and maintenance of a NFS cluster is significant and that if we ever decide to -migrate to Kubernetes +migrate to Kubernetes [we need to decouple GitLab from a shared local storage and NFS](https://gitlab.com/gitlab-org/gitlab-pages/-/issues/426#note_375646396). 1. NFS might be a single point of failure @@ -113,7 +113,7 @@ of complexity, maintenance cost and enormous, negative impact on availability. The work needed to make the new architecture production ready and enabled on GitLab.com had been tracked in [Cloud Native Build Logs on GitLab.com](https://gitlab.com/groups/gitlab-org/-/epics/4275) epic. -Enabling this feature on GitLab.com is a subtask of +Enabling this feature on GitLab.com is a subtask of [making the new architecture generally available](https://gitlab.com/groups/gitlab-org/-/epics/3791) for everyone. ## Status diff --git a/doc/architecture/blueprints/cloud_native_gitlab_pages/index.md b/doc/architecture/blueprints/cloud_native_gitlab_pages/index.md index 89c3a4cd6b4..431bc19ad84 100644 --- a/doc/architecture/blueprints/cloud_native_gitlab_pages/index.md +++ b/doc/architecture/blueprints/cloud_native_gitlab_pages/index.md @@ -17,7 +17,7 @@ Cloud Native and the adoption of Kubernetes has been recognised by GitLab to be one of the top two biggest tailwinds that are helping us grow faster as a company behind the project. -This effort is described in more detail +This effort is described in more detail [in the infrastructure team handbook page](https://about.gitlab.com/handbook/engineering/infrastructure/production/kubernetes/gitlab-com/). GitLab Pages is tightly coupled with NFS and in order to unblock Kubernetes @@ -55,7 +55,7 @@ even tried to replace NFS with Since that time it has become apparent that the cost of operations and maintenance of a NFS cluster is significant and that if we ever decide to -migrate to Kubernetes +migrate to Kubernetes [we need to decouple GitLab from a shared local storage and NFS](https://gitlab.com/gitlab-org/gitlab-pages/-/issues/426#note_375646396). 1. NFS might be a single point of failure @@ -83,7 +83,7 @@ graph TD C -- Serves static content --> E(Visitors) ``` -This new architecture has been briefly described in +This new architecture has been briefly described in [the blog post](https://about.gitlab.com/blog/2020/08/03/how-gitlab-pages-uses-the-gitlab-api-to-serve-content/) too. diff --git a/doc/architecture/blueprints/database_scaling/size-limits.md b/doc/architecture/blueprints/database_scaling/size-limits.md index 284f6402d3c..0bb1ae9efb4 100644 --- a/doc/architecture/blueprints/database_scaling/size-limits.md +++ b/doc/architecture/blueprints/database_scaling/size-limits.md @@ -138,7 +138,7 @@ There is no standard solution to reduce table sizes - there are many! 1. **Retention**: Delete unnecessary data, for example expire old and unneeded records. 1. **Remove STI**: We still use [single-table inheritance](../../../development/database/single_table_inheritance.md) in a few places, which is considered an anti-pattern. Redesigning this, we can split data into multiple tables. 1. **Index optimization**: Drop unnecessary indexes and consolidate overlapping indexes if possible. -1. **Optimise data types**: Review data type decisions and optimise data types where possible (example: use integer instead of text for an enum column) +1. **Optimize data types**: Review data type decisions and optimize data types where possible (example: use integer instead of text for an enum column) 1. **Partitioning**: Apply a partitioning scheme if there is a common access dimension. 1. **Normalization**: Review relational modeling and apply normalization techniques to remove duplicate data 1. **Vertical table splits**: Review column usage and split table vertically. diff --git a/doc/architecture/blueprints/feature_flags_development/index.md b/doc/architecture/blueprints/feature_flags_development/index.md index eaca7da6bd7..08253ac883c 100644 --- a/doc/architecture/blueprints/feature_flags_development/index.md +++ b/doc/architecture/blueprints/feature_flags_development/index.md @@ -23,7 +23,7 @@ The extensive usage of feature flags poses a few challenges - Each feature flag that we add to codebase is a ~"technical debt" as it adds a matrix of configurations. - Testing each combination of feature flags is close to impossible, so we - instead try to optimise our testing of feature flags to the most common + instead try to optimize our testing of feature flags to the most common scenarios. - There's a growing challenge of maintaining a growing number of feature flags. We sometimes forget how our feature flags are configured or why we haven't @@ -115,8 +115,8 @@ These are reason why these changes are needed: ## Iterations -This work is being done as part of dedicated epic: -[Improve internal usage of Feature Flags](https://gitlab.com/groups/gitlab-org/-/epics/3551). +This work is being done as part of dedicated epic: +[Improve internal usage of Feature Flags](https://gitlab.com/groups/gitlab-org/-/epics/3551). This epic describes a meta reasons for making these changes. ## Who diff --git a/doc/architecture/blueprints/graphql_api/index.md b/doc/architecture/blueprints/graphql_api/index.md index eb045de491e..1ee322c412b 100644 --- a/doc/architecture/blueprints/graphql_api/index.md +++ b/doc/architecture/blueprints/graphql_api/index.md @@ -44,11 +44,11 @@ It is an opportunity to learn from our experience in evolving the REST API, for the scale, and to apply this knowledge onto the GraphQL development efforts. We can do that by building query-to-feature correlation mechanisms, adding scalable state synchronization support and aligning GraphQL with other -architectural initiatives being executed in parallel, like +architectural initiatives being executed in parallel, like [the support for direct uploads](https://gitlab.com/gitlab-org/gitlab/-/issues/280819). GraphQL should be secure by default. We can avoid common security mistakes by -building mechanisms that will help us to enforce +building mechanisms that will help us to enforce [OWASP GraphQL recommendations](https://cheatsheetseries.owasp.org/cheatsheets/GraphQL_Cheat_Sheet.html) that are relevant to us. diff --git a/doc/architecture/blueprints/object_storage/index.md b/doc/architecture/blueprints/object_storage/index.md index b70339c8b8d..7a4ecd0e5a8 100644 --- a/doc/architecture/blueprints/object_storage/index.md +++ b/doc/architecture/blueprints/object_storage/index.md @@ -31,8 +31,8 @@ underlying implementation for shared, distributed, highly-available (HA) file storage. Over time, we have built support for object storage across the -application, solving specific problems in a -[multitude of iterations](https://about.gitlab.com/company/team/structure/working-groups/object-storage/#company-efforts-on-uploads). +application, solving specific problems in a +[multitude of iterations](https://about.gitlab.com/company/team/structure/working-groups/object-storage/#company-efforts-on-uploads). This has led to increased complexity across the board, from development (new features and bug fixes) to installation: @@ -67,7 +67,7 @@ This has led to increased complexity across the board, from development The following is a brief description of the main directions we can take to remove the pain points affecting our object storage implementation. -This is also available as [a YouTube video](https://youtu.be/X9V_w8hsM8E) recorded for the +This is also available as [a YouTube video](https://youtu.be/X9V_w8hsM8E) recorded for the [Object Storage Working Group](https://about.gitlab.com/company/team/structure/working-groups/object-storage/). ### Simplify GitLab architecture by shipping MinIO @@ -78,7 +78,7 @@ local storage and object storage. With local storage, there is the assumption of a shared storage between components. This can be achieved by having a single box -installation, without HA, or with a NFS, which +installation, without HA, or with a NFS, which [we no longer recommend](../../../administration/nfs.md). We have a testing gap on object storage. It also requires Workhorse @@ -134,7 +134,7 @@ access to new features without infrastructure chores. Our implementation is built on top of a 3rd-party framework where every object storage client is a 3rd-party library. Unfortunately some -of them are unmaintained. +of them are unmaintained. [We have customers who cannot push 5GB Git LFS objects](https://gitlab.com/gitlab-org/gitlab/-/issues/216442), but with such a vital feature implemented in 3rd-party libraries we are slowed down in fixing it, and we also rely on external maintainers @@ -214,7 +214,7 @@ Proposal: DRIs: -The DRI for this blueprint is the +The DRI for this blueprint is the [Object Storage Working Group](https://about.gitlab.com/company/team/structure/working-groups/object-storage/). <!-- vale gitlab.Spelling = YES --> diff --git a/doc/architecture/blueprints/pods/index.md b/doc/architecture/blueprints/pods/index.md new file mode 100644 index 00000000000..fc33a4f441b --- /dev/null +++ b/doc/architecture/blueprints/pods/index.md @@ -0,0 +1,162 @@ +--- +stage: enablement +group: pods +comments: false +description: 'Pods' +--- + +# Pods + +DISCLAIMER: +This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc. + +This document is a work-in-progress and represents a very early state of the Pods design. Significant aspects are not documented, though we expect to add them in the future. + +## Summary + +Pods is a new architecture for our Software as a Service platform that is horizontally-scalable, resilient, and provides a more consistent user experience. It may also provide additional features in the future, such as data residency control (regions) and federated features. + +## Terminology + +We use the following terms to describe components and properties of the Pods architecture. + +### Pod + +A Pod is a set of infrastructure components that contains multiple workspaces that belong to different organizations. The components include both datastores (PostgreSQL, Redis etc.) and stateless services (web etc.). The infrastructure components provided within a Pod are shared among workspaces but not shared with other Pods. This isolation of infrastructure components means that Pods are independent from each other. + +#### Pod properties + +- Each pod is independent from the others +- Infrastructure components are shared by workspaces within a Pod +- More Pods can be provisioned to provide horizontal scalability +- A failing Pod does not lead to failure of other Pods +- Noisy neighbor effects are limited to within a Pod +- Pods are not visible to organizations; it is an implementation detail +- Pods may be located in different geographical regions (for example, EU, US, JP, UK) + +Discouraged synonyms: GitLab instance, cluster, shard + +### Workspace + +A [workspace](../../../user/workspace/index.md) is the name for the top-level namespace that is used by organizations to manage everything GitLab. It will provide similar administrative capabilities to a self-managed instance. + +See more in the [workspace group overview](https://about.gitlab.com/direction/manage/workspace/#overview). + +#### Workspace properties + +- Workspaces are isolated from each other by default +- A workspace is located on a single Pod +- Workspaces share the resources provided by a Pod + +### Top-Level namespace + +A top-level namespace is the logical object container in the code that represents all groups, subgroups and projects that belong to an organization. + +A top-level namespace is the root of nested collection namespaces and projects. The namespace and its related entities form a tree-like hierarchy: Namespaces are the nodes of the tree, projects are the leaves. An organization usually contains a single top-level namespace, called a workspace. + +Example: + +`https://gitlab.com/gitlab-org/gitlab/`: + +- `gitlab-org` is a `top-level namespace`; the root for all groups and projects of an organization +- `gitlab` is a `project`; a project of the organization. + +Discouraged synonyms: Root-level namespace + +#### Top-level namespace properties + +Same as workspaces. + +### Users + +Users are available globally and not restricted to a single Pod. Users can create multiple workspaces and they may be members of several workspaces and contribute to them. Because users' activity is not limited to an individual Pod, their activity needs to be aggregated across Pods to reflect all their contributions (for example TODOs). This means, the Pods architecture may need to provide a central dashboard. + +#### User properties + +- Users are shared globally across all Pods +- Users can create multiple workspaces +- Users can be a member of multiple workspaces + +## Goals + +### Scalability + +The main goal of this new shared-infrastructure architecture is to provide additional scalability for our SaaS Platform. GitLab.com is largely monolithic and we have estimated (internal) that the current architecture has scalability limitations, even when database partitioning and decomposition are taken into account. + +Pods provide a horizontally scalable solution because additional Pods can be created based on demand. Pods can be provisioned and tuned as needed for optimal scalability. + +### Increased availability + +A major challenge for shared-infrastructure architectures is a lack of isolation between workspaces. This can lead to noisy neighbor effects. A organization's behavior inside a workspace can impact all other workspaces. This is highly undesirable. Pods provide isolation at the pod level. A group of organizations is fully isolated from other organizations located on a different Pod. This minimizes noisy neighbor effects while still benefiting from the cost-efficiency of shared infrastructure. + +Additionally, Pods provide a way to implement disaster recovery capabilities. Entire Pods may be replicated to read-only standbys with automatic failover capabilities. + +### A consistent experience + +Organizations should have the same user experience on our SaaS platform as they do on a self-managed GitLab instance. + +### Regions + +GitLab.com is only hosted within the United States of America. Organizations located in other regions have voiced demand for local SaaS offerings. Pods provide a path towards [GitLab Regions](https://gitlab.com/groups/gitlab-org/-/epics/6037) because Pods may be deployed within different geographies. Depending on which of the organization's data is located outside a Pod, this may solve data residency and compliance problems. + +## Market segment + +Pods would provide a solution for organizations in the small to medium business (up to 100 users) and the mid-market segment (up to 2000 users). +(See [segmentation definitions](https://about.gitlab.com/handbook/sales/field-operations/gtm-resources/#segmentation).) +Larger organizations may benefit substantially from [GitLab Dedicated](../../../subscriptions/gitlab_dedicated/index.md). + +## High-level architecture problems to solve + +A number of technical issues need to be resolved to implement Pods (in no particular order). This section will be expanded. + +1. How are users of an organization routed to the correct Pod containing their workspace? +1. How do users authenticate? +1. How are Pods rebalanced? +1. How are Pods provisioned? +1. How can Pods implement disaster recovery capabilities? + +## Iteration 1 + +Ultimately, a Pods architecture should offer the same user experience as self-managed and GitLab dedicated. However, at this moment GitLab.com has many more "social-network"-like capabilities that will be difficult to implement with a Pods architecture. We should evaluate if the SMB and mid market segment is interested in these features, or if not having them is acceptable in most cases. + +The first iteration of Pods will still contain some limitations that would break cross-workspace workflows. This means it may only be acceptable for new customers, or for existing customers that are briefed. + +Limitations are: + +- An organization can create only a single workspace. +- Workspaces are isolated from each other. This means cross-workspace workflows are broken. + +## Iteration 2 + +Based on user research, we may want to change certain features to work across namespaces to allow organizations to interact with each other in specific circumstances. We may also allow organizations to have more than one workspace. This is particularly relevant for organizations with sub-divisions, or multi-national organizations that want to have workspaces in different regions. + +Additional features: + +- Specific features allow for cross-workspace interactions, for example forking, search. +- An organization can own multiple workspaces on different Pods. + +### Links + +- [Internal Pods presentation](https://docs.google.com/presentation/d/1x1uIiN8FR9fhL7pzFh9juHOVcSxEY7d2_q4uiKKGD44/edit#slide=id.ge7acbdc97a_0_155) +- [Pods Epic](https://gitlab.com/groups/gitlab-org/-/epics/7582) +- [Database Group investigation](https://about.gitlab.com/handbook/engineering/development/enablement/data_stores/database/doc/root-namespace-sharding.html) +- [Shopify Pods architecture](https://shopify.engineering/a-pods-architecture-to-allow-shopify-to-scale) +- [Opstrace architecture](https://gitlab.com/gitlab-org/opstrace/opstrace/-/blob/main/docs/architecture/overview.md) + +### Who + +| Role | Who +|------------------------------|-------------------------| +| Author | Fabian Zimmer | +| Architecture Evolution Coach | Kamil Trzciński | +| Engineering Leader | TBD | +| Product Manager | Fabian Zimmer | +| Domain Expert / Database | TBD | + +DRIs: + +| Role | Who +|------------------------------|------------------------| +| Leadership | TBD | +| Product | Fabian Zimmer | +| Engineering | Thong Kuah | diff --git a/doc/architecture/blueprints/rate_limiting/index.md b/doc/architecture/blueprints/rate_limiting/index.md new file mode 100644 index 00000000000..692cef4b11d --- /dev/null +++ b/doc/architecture/blueprints/rate_limiting/index.md @@ -0,0 +1,411 @@ +--- +stage: none +group: unassigned +comments: false +description: 'Next Rate Limiting Architecture' +--- + +# Next Rate Limiting Architecture + +## Summary + +Introducing reasonable application limits is a very important step in any SaaS +platform scaling strategy. The more users a SaaS platform has, the more +important it is to introduce sensible rate limiting and policies enforcement +that will help to achieve availability goals, reduce the problem of noisy +neighbours for users and ensure that they can keep using a platform +successfully. + +This is especially true for GitLab.com. Our goal is to have a reasonable and +transparent strategy for enforcing application limits, which will become a +definition of a responsible usage, to help us with keeping our availability and +user satisfaction at a desired level. + +We've been introducing various application limits for many years already, but +we've never had a consistent strategy for doing it. What we want to build now is +a consistent framework used by engineers and product managers, across entire +application stack, to define, expose and enforce limits and policies. + +Lack of consistency in defining limits, not being able to expose them to our +users, support engineers and satellite services, has negative impact on our +productivity, makes it difficult to introduce new limits and eventually +prevents us from enforcing responsible usage on all layers of our application +stack. + +This blueprint has been written to consolidate our limits and to describe the +vision of our next rate limiting and policies enforcement architecture. + +_Disclaimer: The following contains information related to upcoming products, +features, and functionality._ + +_It is important to note that the information presented is for informational +purposes only. Please do not rely on this information for purchasing or +planning purposes._ + +_As with all projects, the items mentioned in this document and linked pages are +subject to change or delay. The development, release and timing of any +products, features, or functionality remain at the sole discretion of GitLab +Inc._ + +## Goals + +**Implement a next architecture for rate limiting and policies definition.** + +## Challenges + +- We have many ways to define application limits, in many different places. +- It is difficult to understand what limits have been applied to a request. +- It is difficult to introduce new limits, even more to define policies. +- Finding what limits are defined requires performing a codebase audit. +- We don't have a good way to expose limits to satellite services like Registry. +- We enforce a number of different policies via opaque external systems + (Pipeline Validation Service, Bouncer, Watchtower, Cloudflare, Haproxy). +- There is not standardized way to define policies in a way consistent with defining limits. +- It is difficult to understand when a user is approaching a limit threshold. +- There is no way to automatically notify a user when they are approaching thresholds. +- There is no single way to change limits for a namespace / project / user / customer. +- There is no single way to monitor limits through real-time metrics. +- There is no framework for hierarchical limit configuration (instance / namespace / sub-group / project). +- We allow disabling rate-limiting for some marquee SaaS customers, but this + increases a risk for those same customers. We should instead be able to set + higher limits. + +## Opportunity + +We want to build a new framework, making it easier to define limits, quotas and +policies, and to enforce / adjust them in a controlled way, through robust +monitoring capabilities. + +<!-- markdownlint-disable MD029 --> + +1. Build a framework to define and enforce limits in GitLab Rails. +2. Build an API to consume limits in satellite service and expose them to users. +3. Extract parts of this framework into a dedicated GitLab Limits Service. + +<!-- markdownlint-enable MD029 --> + +The most important opportunity here is consolidation happening on multiple +levels: + +1. Consolidate on the application limits tooling used in GitLab Rails. +1. Consolidate on the process of adding and managing application limits. +1. Consolidate on the behavior of hierarchical cascade of limits and overrides. +1. Consolidate on the application limits tooling used across entire application stack. +1. Consolidate on the policies enforcement tooling used across entire company. + +Once we do that we will unlock another opportunity: to ship the new framework / +tooling as a GitLab feature to unlock these consolidation benefits for our +users, customers and entire wider community audience. + +### Limits, quotas and policies + +This document aims to describe our technical vision for building the next rate +limiting architecture for GitLab.com. We refer to this architectural evolution +as "the next rate limiting architecture", but this is a mental shortcut, +because we actually want to build a better framework that will make it easier +for us to manage not only rate limits, but also quotas and policies. + +Below you can find a short definition of what we understand by a limit, by a +quota and by a policy. + +- **Limit:** A constraint on application usage, typically used to mitigate + risks to performance, stability, and security. + - _Example:_ API calls per second for a given IP address + - _Example:_ `git clone` events per minute for a given user + - _Example:_ maximum artifact upload size of 1GB +- **Quota:** A global constraint in application usage that is aggregated across an + entire namespace over the duration of their billing cycle. + - _Example:_ 400 CI/CD minutes per namespace per month + - _Example:_ 10GB transfer per namespace per month +- **Policy:** A representation of business logic that is decoupled from application + code. Decoupled policy definitions allow logic to be shared across multiple services + and/or "hot-loaded" at runtime without releasing a new version of the application. + - _Example:_ decode and verify a JWT, determine whether the user has access to the + given resource based on the JWT's scopes and claims + - _Example:_ deny access based on group-level constraints + (such as IP allowlist, SSO, and 2FA) across all services + +Technically, all of these are limits, because rate limiting is still +"limiting", quota is usually a business limit, and policy limits what you can +do with the application to enforce specific rules. By referring to a "limit" in +this document we mean a limit that is defined to protect business, availability +and security. + +### Framework to define and enforce limits + +First we want to build a new framework that will allow us to define and enforce +application limits, in the GitLab Rails project context, in a more consistent +and established way. In order to do that, we will need to build a new +abstraction that will tell engineers how to define a limit in a structured way +(presumably using YAML or Cue format) and then how to consume the limit in the +application itself. + +We already do have many limits defined in the application, we can use them to +triangulate to find a reasonable abstraction that will consolidate how we +define, use and enforce limits. + +We envision building a simple Ruby library here (we can add it to LabKit) that +will make it trivial for engineers to check if a certain limit has been +exceeded or not. + +```yaml +name: my_limit_name +actors: user +context: project, group, pipeline +type: rate / second +group: pipeline::execution +limits: + warn: 2B / day + soft: 100k / s + hard: 500k / s +``` + +```ruby +Gitlab::Limits::RateThreshold.enforce(:my_limit_name) do |threshold| + actor = current_user + context = current_project + + threshold.available do |limit| + # ... + end + + threshold.approaching do |limit| + # ... + end + + threshold.exceeded do |limit| + # ... + end +end +``` + +In the example above, when `my_limit_name` is defined in YAML, engineers will +be check the current state and execute appropriate code block depending on the +past usage / resource consumption. + +Things we want to build and support by default: + +1. Comprehensive dashboards showing how often limits are being hit. +1. Notifications about the risk of hitting limits. +1. Automation checking if limits definitions are being enforced properly. +1. Different types of limits - time bound / number per resource etc. +1. A panel that makes it easy to override limits per plan / namespace. +1. Logging that will expose limits applied in Kibana. +1. An automatically generated documentation page describing all the limits. + +### API to expose limits and policies + +Once we have an established a consistent way to define application limits we +can build a few API endpoints that will allow us to expose them to our users, +customers and other satellite services that may want to consume them. + +Users will be able to ask the API about the limits / thresholds that have been +set for them, how often they are hitting them, and what impact those might have +on their business. This kind of transparency can help them with communicating +their needs to customer success team at GitLab, and we will be able to +communicate how the responsible usage is defined at a given moment. + +Because of how GitLab architecture has been built, GitLab Rails application, in +most cases, behaves as a central enterprise service bus (ESB) and there are a +few satellite services communicating with it. Services like Container Registry, +GitLab Runners, Gitaly, Workhorse, KAS could use the API to receive a set of +application limits those are supposed to enforce. This will still allow us to +define all of them in a single place. + +We should, however, avoid the possible negative-feedback-loop, that will put +additional strain on the Rails application when there is a sudden increase in +usage happening. This might be a big customer starting a new automation that +traverses our API or a Denial of Service attack. In such cases, the additional +traffic will reach GitLab Rails and subsequently also other satellite services. +Then the satellite services may need to consult Rails again to obtain new +instructions / policies around rate limiting the increased traffic. This can +put additional strain on Rails application and eventually degrade performance +even more. In order to avoid this problem, we should extract the API endpoints +to separate service (see the section below) if the request rate to those +endpoints depends on the volume of incoming traffic. Alternatively we can keep +those endpoints in Rails if the increased traffic will not translate into +increase of requests rate or increase in resources consumption on these API +endpoints on the Rails side. + +#### Decoupled Limits Service + +At some point we may decide that it is time to extract a stateful backend +responsible for storing metadata around limits, all the counters and state +required, and exposing API, out of Rails. + +It is impossible to make a decision about extracting such a decoupled limits +service yet, because we will need to ship more proof-of-concept work, and +concrete iterations to inform us better about when and how we should do that. We +will depend on the Evolution Architecture practice to guide us towards either +extracting Decoupled Limits Service or not doing that at all. + +As we evolve this blueprint, we will document our findings and insights about +how this service should look like, in this section of the document. + +### GitLab Policy Service + +_Disclaimer_: Extracting a GitLab Policy Service might be out of scope +of the current workstream organized around implementing this blueprint. + +Not all limits can be easily described in YAML. There are some more complex +policies that require a bit more sophisticated approach and a declarative +programming language used to enforce them. One example of such a language might be +[Rego](https://www.openpolicyagent.org/docs/latest/policy-language/) language. +It is a standardized way to define policies in +[OPA - Open Policy Agent](https://www.openpolicyagent.org/). At GitLab we are +already using OPA in some departments. We envision the need to additional +consolidation to not only consolidate on the tooling we are using internally at +GitLab, but to also transform the Next Rate Limiting Architecture into +something we can make a part of the product itself. + +Today, we already do have a policy service we are using to decide whether a +pipeline can be created or not. There are many policies defined in +[Pipeline Validation Service](https://gitlab.com/gitlab-org/modelops/anti-abuse/pipeline-validation-service). +There is a significant opportunity here in transforming Pipeline Validation +Service into a general purpose GitLab Policy Service / GitLab Policy Agent that +will be well integrated into the GitLab product itself. + +Generalizing Pipeline Validation Service into GitLab Policy Service can bring a +few interesting benefits: + +1. Consolidate on our tooling across the company to improve efficiency. +1. Integrate our GitLab Rails limits framework to resolve policies using the policy service. +1. Do not struggle to define complex policies in YAML and hack evaluating them in Ruby. +1. Build a policy for GraphQL queries limiting using query execution cost estimation. +1. Make it easier to resolve policies that do not need "hierarchical limits" structure. +1. Make GitLab Policy Service part of the product and integrate it into the single application. + +We envision using GitLab Policy Service to be place to define policies that do +not require knowing anything about the hierarchical structure of the limits. +There are limits that do not need this, like IP addresses allow-list, spam +checks, configuration validation etc. + +We defined "Policy" as a stateless, functional-style, limit. It takes input +arguments and evaluates to either true or false. It should not require a global +counter or any other volatile global state to get evaluated. It may still +require to have a globally defined rules / configuration, but this state is not +volatile in a same way a rate limiting counter may be, or a megabytes consumed +to evaluate quota limit. + +#### Policies used internally and externally + +The GitLab Policy Service might be used in two different ways: + +1. Rails limits framework will use it as a source of policies enforced internally. +1. The policy service feature will be used as a backend to store policies defined by users. + +These are two slightly different use-cases: first one is about using +internally-defined policies to ensure the stability / availably of a GitLab +instance (GitLab.com or self-managed instance). The second use-case is about +making GitLab Policy Service a feature that users will be able to build on top +of. + +Both use-cases are valid but we will need to make technical decision about how +to separate them. Even if we decide to implement them both in a single service, +we will need to draw a strong boundary between the two. + +The same principle might apply to Decouple Limits Service described in one of +the sections of this document above. + +#### The two limits / policy services + +It is possible that GitLab Policy Service and Decoupled Limits Service can +actually be the same thing. It, however, depends on the implementation details +that we can't predict yet, and the decision about merging these services +together will need to be informed by subsequent interations' feedback. + +## Hierarchical limits + +GitLab application aggregates users, projects, groups and namespaces in a +hierarchical way. This hierarchical structure has been designed to make it +easier to manage permissions, streamline workflows, and allow users and +customers to store related projects, repositories, and other artifacts, +together. + +It is important to design the new rate limiting framework in a way that it +built on top of this hierarchical structure and engineers, customers, SREs and +other stakeholders can understand how limits are being applied, enforced and +overridden within the hierarchy of namespaces, groups and projects. + +We want to reduce the cognitive load required to understand how limits are +being managed within the existing permissions structure. We might need to build +a simple and easy-to-understand formula for how our application decides which +limits and thresholds to apply for a given request and a given actor: + +> GitLab will read default limits for every operation, all overrides configured +> and will choose a limit with the highest precedence configured. A limit +> precedence needs to be explicitly configured for every override, a default +> limit has precedence 100. + +One way in which we can simplify limits management in general is to: + +1. Have default limits / thresholds defined in YAML files with a default precedence 100. +1. Allow limits to be overridden through the API, store overrides in the database. +1. Every limit / threshold override needs to have an integer precedence value provided. +1. Build an API that will take an actor and expose limits applicable for it. +1. Build a dashboard showing actors with non-standard limits / overrides. +1. Build a observability around this showing in Kibana when non-standard limits are being used. + +The points above represent an idea to use precedence score (or Z-Index for +limits), but there may be better solutions, like just defining a direction of +overrides - a lower limit might always override a limit defined higher in the +hierarchy. Choosing a proper solution will require a thoughtful research. + +## Principles + +1. Try to avoid building rate limiting framework in a tightly coupled way. +1. Build application limits API in a way that it can be easily extracted to a separate service. +1. Build application limits definition in a way that is independent from the Rails application. +1. Build tooling that produce consistent behavior and results across programming languages. +1. Build the new framework in a way that we can extend to allow self-managed admins to customize limits. +1. Maintain consistent features and behavior across SaaS and self-managed codebase. +1. Be mindful about a cognitive load added by the hierarchical limits, aim to reduce it. + +## Status + +Request For Comments. + +## Timeline + +- 2022-04-27: [Rate Limit Architecture Working Group](https://about.gitlab.com/company/team/structure/working-groups/rate-limit-architecture/) started. +- 2022-06-07: Working Group members [started submitting technical proposals](https://gitlab.com/gitlab-org/gitlab/-/issues/364524) for the next rate limiting architecture. +- 2022-06-15: We started [scoring proposals](https://docs.google.com/spreadsheets/d/1DFHU1kSdTnpydwM5P2RK8NhVBNWgEHvzT72eOhB8F9E) submitted by Working Group members. +- 2022-07-06: A fourth, [consolidated proposal](https://gitlab.com/gitlab-org/gitlab/-/issues/364524#note_1017640650), has been submitted. +- 2022-07-12: Started working on the design document following [Architecture Evolution Workflow](https://about.gitlab.com/handbook/engineering/architecture/workflow/). +- 2022-09-08: The initial version of the blueprint has been merged. + +## Who + +Proposal: + +<!-- vale gitlab.Spelling = NO --> + +| Role | Who +|------------------------------|-------------------------| +| Author | Grzegorz Bizon | +| Author | Fabio Pitino | +| Author | Marshall Cottrell | +| Author | Hayley Swimelar | +| Engineering Leader | Sam Goldstein | +| Product Manager | | +| Architecture Evolution Coach | | +| Recommender | | +| Recommender | | +| Recommender | | +| Recommender | | + +DRIs: + +| Role | Who +|------------------------------|------------------------| +| Leadership | | +| Product | | +| Engineering | | + +Domain experts: + +| Area | Who +|------------------------------|------------------------| +| | | + +<!-- vale gitlab.Spelling = YES --> diff --git a/doc/architecture/blueprints/runner_scaling/index.md b/doc/architecture/blueprints/runner_scaling/index.md index 494aaa6a641..8f7062a1148 100644 --- a/doc/architecture/blueprints/runner_scaling/index.md +++ b/doc/architecture/blueprints/runner_scaling/index.md @@ -33,7 +33,7 @@ This design choice was crucial for the GitLab Runner success. Since that time the auto-scaling feature has been used by many users and customers and enabled rapid growth of CI/CD adoption on GitLab.com. -We can not, however, continue using Docker Machine. Work on that project +We can not, however, continue using Docker Machine. Work on that project [was paused in July 2018](https://github.com/docker/machine/issues/4537) and there was no development made since that time (except for some highly important security fixes). In 2018, after Docker Machine entered the "maintenance mode", @@ -76,7 +76,7 @@ mechanism with a reliable and flexible mechanism. We might be unable to build a drop-in replacement for Docker Machine, as there are presumably many reasons why it has been deprecated. It is very difficult to maintain compatibility with so many cloud providers, and it seems that Docker Machine has been deprecated -in favor of Docker Desktop, which is not a viable replacement for us. +in favor of Docker Desktop, which is not a viable replacement for us. [This issue](https://github.com/docker/roadmap/issues/245) contains a discussion about how people are using Docker Machine right now, and it seems that GitLab CI is one of the most frequent reasons for people to keep using Docker Machine. |