summaryrefslogtreecommitdiff
path: root/doc/architecture
diff options
context:
space:
mode:
authorGitLab Bot <gitlab-bot@gitlab.com>2021-08-19 09:08:42 +0000
committerGitLab Bot <gitlab-bot@gitlab.com>2021-08-19 09:08:42 +0000
commitb76ae638462ab0f673e5915986070518dd3f9ad3 (patch)
treebdab0533383b52873be0ec0eb4d3c66598ff8b91 /doc/architecture
parent434373eabe7b4be9593d18a585fb763f1e5f1a6f (diff)
downloadgitlab-ce-b76ae638462ab0f673e5915986070518dd3f9ad3.tar.gz
Add latest changes from gitlab-org/gitlab@14-2-stable-eev14.2.0-rc42
Diffstat (limited to 'doc/architecture')
-rw-r--r--doc/architecture/blueprints/consolidating_groups_and_projects/index.md155
-rw-r--r--doc/architecture/blueprints/database/scalability/patterns/img/db_terminology_v14_2.pngbin0 -> 51264 bytes
-rw-r--r--doc/architecture/blueprints/database/scalability/patterns/img/read_mostly_licenses_calls_v14_2.pngbin0 -> 157824 bytes
-rw-r--r--doc/architecture/blueprints/database/scalability/patterns/img/read_mostly_licenses_fixed_v14_2.pngbin0 -> 85790 bytes
-rw-r--r--doc/architecture/blueprints/database/scalability/patterns/img/read_mostly_readwriteratio_v14_2.pngbin0 -> 93291 bytes
-rw-r--r--doc/architecture/blueprints/database/scalability/patterns/img/read_mostly_subscriptions_reads_v14_2.pngbin0 -> 60703 bytes
-rw-r--r--doc/architecture/blueprints/database/scalability/patterns/img/read_mostly_subscriptions_writes_v14_2.pngbin0 -> 52727 bytes
-rw-r--r--doc/architecture/blueprints/database/scalability/patterns/index.md12
-rw-r--r--doc/architecture/blueprints/database/scalability/patterns/read_mostly.md152
-rw-r--r--doc/architecture/blueprints/database/scalability/patterns/time_decay.md361
-rw-r--r--doc/architecture/blueprints/database_scaling/size-limits.md176
-rw-r--r--doc/architecture/blueprints/database_testing/index.md2
12 files changed, 857 insertions, 1 deletions
diff --git a/doc/architecture/blueprints/consolidating_groups_and_projects/index.md b/doc/architecture/blueprints/consolidating_groups_and_projects/index.md
new file mode 100644
index 00000000000..fab886808e2
--- /dev/null
+++ b/doc/architecture/blueprints/consolidating_groups_and_projects/index.md
@@ -0,0 +1,155 @@
+---
+stage: none
+group: unassigned
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+comments: false
+description: Consolidating groups and projects
+---
+
+# Consolidating Group and Project
+
+There are numerous features that exist exclusively within groups or
+projects. The boundary between group and project features used to be clear.
+However, there is growing demand to have group features within projects, and
+project features within groups. For example, having issues in groups, and epics
+in projects.
+
+The [Simplify Groups & Projects Working Group](https://about.gitlab.com/company/team/structure/working-groups/simplify-groups-and-projects/)
+determined that our architecture is a significant hurdle in sharing features
+across groups and projects.
+
+Architecture issue: <https://gitlab.com/gitlab-org/architecture/tasks/-/issues/7>
+
+## Challenges
+
+### Feature duplication
+
+When a feature needs to be made available on a different level, we have
+no established process in place. This results in the reimplementation of
+the same feature. Those implementations diverge from each other over time as
+they all live on their own. A few more problems with this approach:
+
+- Features are coupled to their container. In practice it is not straight
+ forward to decouple a feature from its container. The degree of coupling
+ varies across features.
+- Naive duplication of features will result in a more complex and fragile code base.
+- Generalizing solutions across groups and projects may degrade system performance.
+- The range of features span across many teams, and these changes will need to
+ manage development interference.
+- The group/project hierarchy creates a natural feature hierarchy. When features
+ exist across containers the feature hierarchy becomes ambiguous.
+- Duplication of features slows down development velocity.
+
+There is potential for significant architectural changes. These changes will
+have to be independent of the product design, so that customer experience
+remains consistent.
+
+### Performance
+
+Resources can only be queried in elaborate / complicated ways. This caused
+performance issues with authorization, epics, and many other places. As an
+example, to query the projects a user has access to, the following sources need
+to be considered:
+
+- personal projects
+- direct group membership
+- direct project membership
+- inherited group membership
+- inherited project membership
+- group sharing
+- inherited membership via group sharing
+- project sharing
+
+Group / project membership, group / project sharing are also examples of
+duplicated features.
+
+## Goals
+
+For now this blueprint strictly relates to the engineering challenges.
+
+- Consolidate the group and project container architecture.
+- Develop a set of solutions to decouple features from their container.
+- Decouple engineering changes from product changes.
+- Develop a strategy to make architectural changes without adversely affecting
+ other teams.
+- Provide a solution for requests asking for features availability of other levels.
+
+## Proposal
+
+Use our existing `Namespace` model as a container for features. We already have
+a `Namespace` associated with `User` (personal namespace), and with `Group`
+(which is a subclass of `Namespace`). We can extend this further, by associating
+`Namespace` with `Projects` by introducing `ProjectNamespaces`. Each `Project`
+should be owned by its `ProjectNamespace`, and this relation should replace the
+existing `Project` <-> `Group` / personal namespace relation.
+
+We also lack a model specific for personal namespaces, and we use the generic
+`Namespace` model instead. This is confusing, but can be fixed by creating a
+dedicated subclass: `UserNamespace`.
+
+As a result, the `Namespace` hierarchy will transition to:
+
+```mermaid
+classDiagram
+ Namespace <|-- UserNamespace
+ Namespace <|-- Group
+ Namespace <|-- ProjectNamespace
+```
+
+New features should be implemented on `Namespace`. Similarly, when a feature
+need to be reimplemented on a different level, moving it to `Namespace`
+essentially makes it available on all levels:
+
+- personal namespaces
+- groups
+- projects
+
+Various traversal queries are already available on `Namespaces` to query the
+group hierarchy. `Projects` represents the leaf nodes in the hierarchy, but with
+the introduction of `ProjectNamespace`, these traversal queries can be used to
+retrieve projects as well.
+
+This also enables further simplification of some of our core features:
+
+- routes should be generated based on the `Namespace` hierarchy, instead of
+ mixing project with the group hierarchy.
+- there is no need to differentiate between `GroupMembers` and `ProjectMembers`.
+ All `Members` should be related to a `Namespace`. This can lead to simplified
+ querying, and potentially deduplicating policies.
+
+As more and more features will be migrated to `Namespace`, the role of `Project`
+model will diminish over time to essentially a container around repository
+related functionality.
+
+## Iterations
+
+The work required to establish `Namespace` as a container for our features is
+tracked under [Consolidate Groups and Projects](https://gitlab.com/groups/gitlab-org/-/epics/6473)
+epic.
+
+## Who
+
+Proposal:
+
+<!-- vale gitlab.Spelling = NO -->
+
+| Role | Who
+|------------------------------|-------------------------------------|
+| Author | Alex Pooley, Imre Farkas |
+| Architecture Evolution Coach | Dmitriy Zaporozhets, Grzegorz Bizon |
+| Engineering Leader | Michelle Gill |
+| Domain Expert | Jan Provaznik |
+
+<!-- vale gitlab.Spelling = YES -->
+
+DRIs:
+
+<!-- vale gitlab.Spelling = NO -->
+
+| Role | Who
+|------------------------------|------------------------|
+| Product | Melissa Ushakov |
+| Leadership | Michelle Gill |
+| Engineering | Imre Farkas |
+
+<!-- vale gitlab.Spelling = YES -->
diff --git a/doc/architecture/blueprints/database/scalability/patterns/img/db_terminology_v14_2.png b/doc/architecture/blueprints/database/scalability/patterns/img/db_terminology_v14_2.png
new file mode 100644
index 00000000000..85ba1360f06
--- /dev/null
+++ b/doc/architecture/blueprints/database/scalability/patterns/img/db_terminology_v14_2.png
Binary files differ
diff --git a/doc/architecture/blueprints/database/scalability/patterns/img/read_mostly_licenses_calls_v14_2.png b/doc/architecture/blueprints/database/scalability/patterns/img/read_mostly_licenses_calls_v14_2.png
new file mode 100644
index 00000000000..f6ae995391c
--- /dev/null
+++ b/doc/architecture/blueprints/database/scalability/patterns/img/read_mostly_licenses_calls_v14_2.png
Binary files differ
diff --git a/doc/architecture/blueprints/database/scalability/patterns/img/read_mostly_licenses_fixed_v14_2.png b/doc/architecture/blueprints/database/scalability/patterns/img/read_mostly_licenses_fixed_v14_2.png
new file mode 100644
index 00000000000..dcfaae230d3
--- /dev/null
+++ b/doc/architecture/blueprints/database/scalability/patterns/img/read_mostly_licenses_fixed_v14_2.png
Binary files differ
diff --git a/doc/architecture/blueprints/database/scalability/patterns/img/read_mostly_readwriteratio_v14_2.png b/doc/architecture/blueprints/database/scalability/patterns/img/read_mostly_readwriteratio_v14_2.png
new file mode 100644
index 00000000000..9b85f814bc1
--- /dev/null
+++ b/doc/architecture/blueprints/database/scalability/patterns/img/read_mostly_readwriteratio_v14_2.png
Binary files differ
diff --git a/doc/architecture/blueprints/database/scalability/patterns/img/read_mostly_subscriptions_reads_v14_2.png b/doc/architecture/blueprints/database/scalability/patterns/img/read_mostly_subscriptions_reads_v14_2.png
new file mode 100644
index 00000000000..4f448841e48
--- /dev/null
+++ b/doc/architecture/blueprints/database/scalability/patterns/img/read_mostly_subscriptions_reads_v14_2.png
Binary files differ
diff --git a/doc/architecture/blueprints/database/scalability/patterns/img/read_mostly_subscriptions_writes_v14_2.png b/doc/architecture/blueprints/database/scalability/patterns/img/read_mostly_subscriptions_writes_v14_2.png
new file mode 100644
index 00000000000..e4f5756051f
--- /dev/null
+++ b/doc/architecture/blueprints/database/scalability/patterns/img/read_mostly_subscriptions_writes_v14_2.png
Binary files differ
diff --git a/doc/architecture/blueprints/database/scalability/patterns/index.md b/doc/architecture/blueprints/database/scalability/patterns/index.md
new file mode 100644
index 00000000000..dadf3634407
--- /dev/null
+++ b/doc/architecture/blueprints/database/scalability/patterns/index.md
@@ -0,0 +1,12 @@
+---
+stage: Enablement
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+comments: false
+description: 'Learn how to scale the database through the use of best-of-class database scalability patterns'
+---
+
+# Database Scalability Patterns
+
+- [Read-mostly](read_mostly.md)
+- [Time-decay](time_decay.md)
diff --git a/doc/architecture/blueprints/database/scalability/patterns/read_mostly.md b/doc/architecture/blueprints/database/scalability/patterns/read_mostly.md
new file mode 100644
index 00000000000..02b56841507
--- /dev/null
+++ b/doc/architecture/blueprints/database/scalability/patterns/read_mostly.md
@@ -0,0 +1,152 @@
+---
+stage: Enablement
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+comments: false
+description: 'Learn how to scale operating on read-mostly data at scale'
+---
+
+# Read-mostly data
+
+[Introduced](https://gitlab.com/gitlab-org/gitlab/-/issues/326037) in GitLab 14.0.
+
+This document describes the *read-mostly* pattern introduced in the
+[Database Scalability Working Group](https://about.gitlab.com/company/team/structure/working-groups/database-scalability/#read-mostly-data).
+We discuss the characteristics of *read-mostly* data and propose best practices for GitLab development
+to consider in this context.
+
+## Characteristics of read-mostly data
+
+As the name already suggests, *read-mostly* data is about data that is much more often read than
+updated. Writing this data through updates, inserts, or deletes is a very rare event compared to
+reading this data.
+
+In addition, *read-mostly* data in this context is typically a small dataset. We explicitly don't deal
+with large datasets here, even though they often have a "write once, read often" characteristic, too.
+
+### Example: license data
+
+Let's introduce a canonical example: license data in GitLab. A GitLab instance may have a license
+attached to use GitLab enterprise features. This license data is held instance-wide, that
+is, there typically only exist a few relevant records. This information is kept in a table
+`licenses` which is very small.
+
+We consider this *read-mostly* data, because it follows above outlined characteristics:
+
+- **Rare writes:** license data very rarely sees any writes after having inserted the license.
+- **Frequent reads:** license data is read extremely often to check if enterprise features can be used.
+- **Small size:** this dataset is very small. On GitLab.com we have 5 records at < 50 kB total relation size.
+
+### Effects of *read-mostly* data at scale
+
+Given this dataset is small and read very often, we can expect data to nearly always reside in
+database caches and/or database disk caches. Thus, the concern with *read-mostly* data is typically
+not around database I/O overhead, because we typically don't read data from disk anyway.
+
+However, considering the high frequency reads, this has potential to incur overhead in terms of
+database CPU load and database context switches. Additionally, those high frequency queries go
+through the whole database stack. They also cause overhead on the database connection
+multiplexing components and load balancers. Also, the application spends cycles in preparing and
+sending queries to retrieve the data, deserialize the results and allocate new objects to represent
+the information gathered - all in a high frequency fashion.
+
+In the example of license data above, the query to read license data was
+[identified](https://gitlab.com/gitlab-org/gitlab/-/issues/292900) to stand out in terms of query
+frequency. In fact, we were seeing around 6,000 queries per second (QPS) on the cluster during peak
+times. With the cluster size at that time, we were seeing about 1,000 QPS on each replica, and fewer
+than 400 QPS on the primary at peak times. The difference is explained by our
+[database load balancing for scaling reads](https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/lib/gitlab/database/load_balancing.rb),
+which favors replicas for pure read-only transactions.
+
+![Licenses Calls](img/read_mostly_licenses_calls_v14_2.png)
+
+The overall transaction throughput on the database primary at the time varied between 50,000 and
+70,000 transactions per second (TPS). In comparison, this query frequency only takes a small
+portion of the overall query frequency. However, we do expect this to still have considerable
+overhead in terms of context switches. It is worth removing this overhead, if we can.
+
+## How to recognize read-mostly data
+
+It can be difficult to recognize *read-mostly* data, even though there are clear cases like in our
+example.
+
+One approach is to look at the [read/write ratio and statistics from, for example, the primary](https://bit.ly/3frdtyz). Here, we look at the TOP20 tables by their read/write ratio over 60 minutes (taken in a peak traffic time):
+
+```plaintext
+bottomk(20,
+avg by (relname, fqdn) (
+ (
+ rate(pg_stat_user_tables_seq_tup_read{env="gprd"}[1h])
+ +
+ rate(pg_stat_user_tables_idx_tup_fetch{env="gprd"}[1h])
+ ) /
+ (
+ rate(pg_stat_user_tables_seq_tup_read{env="gprd"}[1h])
+ + rate(pg_stat_user_tables_idx_tup_fetch{env="gprd"}[1h])
+ + rate(pg_stat_user_tables_n_tup_ins{env="gprd"}[1h])
+ + rate(pg_stat_user_tables_n_tup_upd{env="gprd"}[1h])
+ + rate(pg_stat_user_tables_n_tup_del{env="gprd"}[1h])
+ )
+) and on (fqdn) (pg_replication_is_replica == 0)
+)
+```
+
+This yields a good impression of which tables are much more often read than written (on the database
+primary):
+
+![Read Write Ratio TOP20](img/read_mostly_readwriteratio_v14_2.png)
+
+From here, we can [zoom](https://bit.ly/2VmloX1) into for example `gitlab_subscriptions` and realize that index reads peak at above 10k tuples per second overall (there are no seq scans):
+
+![Subscriptions: reads](img/read_mostly_subscriptions_reads_v14_2.png)
+
+We very rarely write to the table (there are no seq scans):
+
+![Subscriptions: writes](img/read_mostly_subscriptions_writes_v14_2.png)
+
+Additionally, the table is only 400 MB in size - so this may be another candidate we may want to
+consider in this pattern (see [#327483](https://gitlab.com/gitlab-org/gitlab/-/issues/327483)).
+
+## Best practices for handling read-mostly data at scale
+
+### Cache read-mostly data
+
+To reduce the database overhead, we implement a cache for the data and thus significantly
+reduce the query frequency on the database side. There are different scopes for caching available:
+
+- `RequestStore`: per-request in-memory cache (based on [request_store gem](https://github.com/steveklabnik/request_store))
+- [`ProcessMemoryCache`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/gitlab/process_memory_cache.rb#L4): per-process in-memory cache (a `ActiveSupport::Cache::MemoryStore`)
+- [`Gitlab::Redis::Cache`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/gitlab/redis/cache.rb) and `Rails.cache`: full-blown cache in Redis
+
+Continuing the above example, we had a `RequestStore` in place to cache license information on a
+per-request basis. However, that still leads to one query per request. When we started to cache license information
+[using a process-wide in-memory cache](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/50318)
+for 1 second, query frequency dramatically dropped:
+
+![Licenses Calls - Fixed](img/read_mostly_licenses_fixed_v14_2.png)
+
+The choice of caching here highly depends on the characteristics of data in question. A very small
+dataset like license data that is nearly never updated is a good candidate for in-memory caching.
+A per-process cache is favorable here, because this unties the cache refresh rate from the incoming
+request rate.
+
+A caveat here is that our Redis setup is currently not using Redis secondaries and we rely on a
+single node for caching. That is, we need to strike a balance to avoid Redis falling over due to
+increased pressure. In comparison, reading data from PostgreSQL replicas can be distributed across
+several read-only replicas. Even though a query to the database might be more expensive, the
+load is balanced across more nodes.
+
+### Read read-mostly data from replica
+
+With or without caching implemented, we also must make sure to read data from database replicas if
+we can. This supports our efforts to scale reads across many database replicas and removes
+unnecessary workload from the database primary.
+
+GitLab [database load balancing for reads](https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/lib/gitlab/database/load_balancing.rb)
+sticks to the primary after a first write or when opening an
+explicit transaction. In the context of *read-mostly* data, we strive to read this data outside of a
+transaction scope and before doing any writes. This is often possible given that this data is only
+seldom updated (and thus we're often not concerned with reading slightly stale data, for example).
+However, it can be non-obvious that this query cannot be sent to a replica because of a previous
+write or transaction. Hence, when we encounter *read-mostly* data, it is a good practice to check the
+wider context and make sure this data can be read from a replica.
diff --git a/doc/architecture/blueprints/database/scalability/patterns/time_decay.md b/doc/architecture/blueprints/database/scalability/patterns/time_decay.md
new file mode 100644
index 00000000000..9309c581d54
--- /dev/null
+++ b/doc/architecture/blueprints/database/scalability/patterns/time_decay.md
@@ -0,0 +1,361 @@
+---
+stage: Enablement
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+comments: false
+description: 'Learn how to operate on large time-decay data'
+---
+
+# Time-decay data
+
+[Introduced](https://gitlab.com/gitlab-org/gitlab/-/issues/326035) in GitLab 14.0.
+
+This document describes the *time-decay pattern* introduced in the
+[Database Scalability Working Group](https://about.gitlab.com/company/team/structure/working-groups/database-scalability/#time-decay-data).
+We discuss the characteristics of time-decay data, and propose best practices for GitLab development
+to consider in this context.
+
+Some datasets are subject to strong time-decay effects, in which recent data is accessed far more
+frequently than older data. Another aspect of time-decay: with time, some types of data become
+less important. This means we can also move old data to a bit less durable (less available) storage,
+or even delete the data, in extreme cases.
+
+Those effects are usually tied to product or application semantics. They can vary in the degree
+that older data are accessed, and how useful or required older data are to the users or the
+application.
+
+Let's first consider entities with no inherent time-related bias for their data.
+
+A record for a user or a project may be equally important and frequently accessed, irrelevant to when
+it was created. We can not predict by using a user's `id` or `created_at` how often the related
+record is accessed or updated.
+
+On the other hand, a good example for datasets with extreme time-decay effects are logs and time
+series data, such as events recording user actions.
+
+Most of the time, that type of data may have no business use after a couple of days or weeks, and
+quickly become less important even from a data analysis perspective. They represent a snapshot that
+quickly becomes less and less relevant to the current state of the application, until at
+some point it has no real value.
+
+In the middle of the two extremes, we can find datasets that have useful information that we want to
+keep around, but with old records seldom being accessed after an initial (small) time period after
+creation.
+
+## Characteristics of time-decay data
+
+We are interested in datasets that show the following characteristics:
+
+- **Size of the dataset:** they are considerably large.
+- **Access methods:** we can filter the vast majority of queries accessing the dataset
+ by a time related dimension or a categorical dimension with time decay effects.
+- **Immutability:** the time-decay status does not change.
+- **Retention:** whether we want to keep the old data or not, or whether old
+ data should remain accessible by users through the application.
+
+### Size of the dataset
+
+There can be datasets of variable sizes that show strong time-decay effects, but in the context of
+this blueprint, we intend to focus on entities with a **considerably large dataset**.
+
+Smaller datasets do not contribute significantly to the database related resource usage, nor do they
+inflict a considerable performance penalty to queries.
+
+In contrast, large datasets over about 50 million records, or 100 GB in size, add a significant
+overhead to constantly accessing a really small subset of the data. In those cases, we would want to
+use the time-decay effect in our advantage and reduce the actively accessed dataset.
+
+### Data access methods
+
+The second and most important characteristic of time-decay data is that most of the time, we are
+able to implicitly or explicitly access the data using a date filter,
+**restricting our results based on a time-related dimension**.
+
+There can be many such dimensions, but we are only going to focus on the creation date as it is both
+the most commonly used, and the one that we can control and optimize against. It:
+
+- Is immutable.
+- Is set when the record is created
+- Can be tied to physically clustering the records, without having to move them around.
+
+It's important to add that even if time-decay data are not accessed that way by the application by
+default, you can make the vast majority of the queries explicitly filter the data in such
+a way. **Time decay data without such a time-decay related access method are of no use from an optimization perspective, as there is no way to set and follow a scaling pattern.**
+
+We are not restricting the definition to data that are always accessed using a time-decay related
+access method, as there may be some outlier operations. These may be necessary and we can accept
+them not scaling, if the rest of the access methods can scale. An example:
+an administrator accessing all past events of a specific type, while all other operations only access
+a maximum of a month of events, restricted to 6 months in the past.
+
+### Immutability
+
+The third characteristic of time-decay data is that their **time-decay status does not change**.
+Once they are considered "old", they can not switch back to "new" or relevant again.
+
+This definition may sound trivial, but we have to be able to make operations over "old" data **more**
+expensive (for example, by archiving or moving them to less expensive storage) without having to worry about
+the repercussions of switching back to being relevant and having important application operations
+underperforming.
+
+Consider as a counter example to a time-decay data access pattern an application view that presents
+issues by when they were updated. We are also interested in the most recent data from an "update"
+perspective, but that definition is volatile and not actionable.
+
+### Retention
+
+Finally, a characteristic that further differentiates time-decay data in sub-categories with
+slightly different approaches available is **whether we want to keep the old data or not**
+(for example, retention policy) and/or
+**whether old data will be accessible by users through the application**.
+
+#### (optional) Extended definition of time-decay data
+
+As a side note, if we extend the aforementioned definitions to access patterns that restrict access
+to a well defined subset of the data based on a clustering attribute, we could use the time-decay
+scaling patterns for many other types of data.
+
+As an example, consider data that are only accessed while they are labeled as active, like To-Dos
+not marked as done, pipelines for unmerged merge requests (or a similar not time based constraint), etc.
+In this case, instead of using a time dimension to define the decay, we use a categorical dimension
+(that is, one that uses a finite set of values) to define the subset of interest. As long as that
+subset is small compared to the overall size of the dataset, we could use the same approach.
+
+Similarly, we may define data as old based both on a time dimension and additional status attributes,
+such as CI pipelines that failed more than 6 months ago.
+
+## Time-decay data strategies
+
+### Partition tables
+
+This is the acceptable best practice for addressing time-decay data from a pure database perspective.
+You can find more information on table partitioning for PostgreSQL in the
+[documentation page for table partitioning](https://www.postgresql.org/docs/12/ddl-partitioning.html).
+
+Partitioning by date intervals (for example, month, year) allows us to create much smaller tables
+(partitions) for each date interval and only access the most recent partition(s) for any
+application related operation.
+
+We have to set the partitioning key based on the date interval of interest, which may depend on two
+factors:
+
+1. **How far back in time do we need to access data for?**
+ Partitioning by week is of no use if we always access data for a year back, as we would have to
+ execute queries over 52 different partitions (tables) each time. As an example for that consider the
+ activity feed on the profile of any GitLab user.
+
+ In contrast, if we want to just access the last 7 days of created records, partitioning by year
+ would include too many unnecessary records in each partition, as is the case for `web_hook_logs`.
+1. **How large are the partitions created?**
+ The major purpose of partitioning is accessing tables that are as small as possible. If they get too
+ large by themselves, queries will start underperforming. We may have to re-partition (split) them
+ in even smaller partitions.
+
+The perfect partitioning scheme keeps **all queries over a dataset almost always over a single partition**,
+with some cases going over two partitions and seldom over multiple partitions being
+an acceptable balance. We should also target for **partitions that are as small as possible**, below
+5-10M records and/or 10GB each maximum.
+
+Partitioning can be combined with other strategies to either prune (drop) old partitions, move them
+to cheaper storage inside the database or move them outside of the database (archive or use of other
+types of storage engines).
+
+As long as we do not want to keep old records and partitioning is used, pruning old data has a
+constant, for all intents and purposes zero, cost compared to deleting the data from a huge table
+(as described in the following sub-section). We just need a background worker to drop old partitions
+whenever all the data inside that partition get out of the retention policy's period.
+
+As an example, if we only want to keep records no more than 6 months old and we partition by month,
+we can safely keep the 7 latest partitions at all times (current month and 6 months in the past).
+That means that we can have a worker dropping the 8th oldest partition at the start of each month.
+
+Moving partitions to cheaper storage inside the same database is relatively simple in PostgreSQL
+through the use of [tablespaces](https://www.postgresql.org/docs/12/manage-ag-tablespaces.html).
+It is possible to specify a tablespace and storage parameters for each partition separately, so the
+approach in this case would be to:
+
+1. Create a new tablespace on a cheaper, slow disk.
+1. Set the storage parameters higher on that new tablespace so that the PostgreSQL optimizer knows that the disks are slower.
+1. Move the old partitions to the slow tablespace automatically by using background workers.
+
+Finally, moving partitions outside of the database can be achieved through database archiving or
+manually exporting the partitions to a different storage engine (more details in the dedicated
+sub-section).
+
+### Prune old data
+
+If we don't want to keep old data around in any form, we can implement a pruning strategy and
+delete old data.
+
+It's a simple-to-implement strategy that uses a pruning worker to delete past data. As an example
+that we further analyze below, we are pruning old `web_hook_logs` older than 90 days.
+
+The disadvantage of such a solution over large, non-partitioned tables is that we have to manually
+access and delete all the records that are considered as not relevant any more. That is a very
+expensive operation, due to multi-version concurrency control in PostgreSQL. It also leads to the
+pruning worker not being able to catch up with new records being created, if that rate exceeds a
+threshold, as is the case of [web_hook_logs](https://gitlab.com/gitlab-org/gitlab/-/issues/256088)
+at the time of writing this document.
+
+For the aforementioned reasons, our proposal is that
+**we should base any implementation of a data retention strategy on partitioning**,
+unless there are strong reasons not to.
+
+### Move old data outside of the database
+
+In most cases, we consider old data as valuable, so we do not want to prune them. If at the same
+time, they are not required for any database related operations (for example, directly accessed or used in
+joins and other types of queries), we can move them outside of the database.
+
+That does not mean that they are not directly accessible by users through the application; we could
+move data outside the database and use other storage engines or access types for them, similarly to
+offloading metadata but only for the case of old data.
+
+In the simplest use case we can provide fast and direct access to recent data, while allowing users
+to download an archive with older data. This is an option evaluated in the `audit_events` use case.
+Depending on the country and industry, audit events may have a very long retention period, while
+only the past month(s) of data are actively accessed through GitLab interface.
+
+Additional use cases may include exporting data to a data warehouse or other types of data stores as
+they may be better suited for processing that type of data. An example can be JSON logs that we
+sometimes store in tables: loading such data into a BigQuery or a columnar store like Redshift may
+be better for analyzing/querying the data.
+
+We might consider a number of strategies for moving data outside of the database:
+
+1. Streaming this type of data into logs and then move them to secondary storage options
+ or load them to other types of data stores directly (as CSV/JSON data).
+1. Creating an ETL process that exports the data to CSV, uploads them to object storage,
+ drops this data from the database, and then loads the CSV into a different data store.
+1. Loading the data in the background by using the API provided by the data store.
+
+This may be a not viable solution for large datasets; as long as bulk uploading using files is an
+option, it should outperform API calls.
+
+## Use cases
+
+### Web hook logs
+
+Related epic: [Partitioning: `web_hook_logs` table](https://gitlab.com/groups/gitlab-org/-/epics/5558)
+
+The important characteristics of `web_hook_logs` are the following:
+
+1. Size of the dataset: it is a really large table. At the moment we decided to
+ partition it (`2021-03-01`), it had roughly 527M records and a total size of roughly 1TB
+
+ - Table: `web_hook_logs`
+ - Rows: approximately 527M
+ - Total size: 1.02 TiB (10.46%)
+ - Table size: 713.02 GiB (13.37%)
+ - Index(es) size: 42.26 GiB (1.10%)
+ - TOAST size: 279.01 GiB (38.56%)
+
+1. Access methods: we always request for the past 7 days of logs at max.
+1. Immutability: it can be partitioned by `created_at`, an attribute that does not change.
+1. Retention: there is a 90 days retention policy set for it.
+
+Additionally, we were at the time trying to prune the data by using a background worker
+(`PruneWebHookLogsWorker`), which could not [keep up with the rate of inserts](https://gitlab.com/gitlab-org/gitlab/-/issues/256088).
+
+As a result, on March 2021 there were still not deleted records since July 2020 and the table was
+increasing in size by more than 2 million records per day instead of staying at a more or less
+stable size.
+
+Finally, the rate of inserts has grown to more than 170GB of data per month by March 2021 and keeps
+on growing, so the only viable solution to pruning old data was through partitioning.
+
+Our approach was to partition the table per month as it aligned with the 90 days retention policy.
+
+The process required follows:
+
+1. Decide on a partitioning key
+
+ Using the `created_at` column is straightforward in this case: it is a natural
+ partitioning key when a retention policy exists and there were no conflicting access patterns.
+
+1. After we decide on the partitioning key, we can create the partitions and backfill
+ them (copy data from the existing table). We can't just partition an existing table;
+ we have to create a new partitioned table.
+
+ So, we have to create the partitioned table and all the related partitions, start copying everything
+ over, and also add sync triggers so that any new data or updates/deletes to existing data can be
+ mirrored to the new partitioned table.
+
+ [MR with all the necessary details on how to start partitioning a table](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/55938)
+
+ It required 15 days and 7.6 hours to complete that process.
+
+1. One milestone after the initial partitioning starts, clean up after the background migration
+ used to backfill and finish executing any remaining jobs, retry failed jobs, etc.
+
+ [MR with all the necessary details](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/57580)
+
+1. Add any remaining foreign keys and secondary indexes to the partitioned table. This brings
+ its schema on par with the original non partitioned table before we can swap them in the next milestone.
+
+ We are not adding them at the beginning as they are adding overhead to each insert and they
+ would slow down the initial backfilling of the table (in this case for more than half a billion
+ records, which can add up significantly). So we create a lightweight, *vanilla* version of the
+ table, copy all the data and then add any remaining indexes and foreign keys.
+
+1. Swap the base table with partitioned copy: this is when the partitioned table
+ starts actively being used by the application.
+
+ Dropping the original table is a destructive operation, and we want to make sure that we had no
+ issues during the process, so we keep the old non-partitioned table. We also switch the sync trigger
+ the other way around so that the non-partitioned table is still up to date with any operations
+ happening on the partitioned table. That allows us to swap back the tables if it is necessary.
+
+ [MR with all the necessary details](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/60184)
+
+1. Last step, one milestone after the swap: drop the non-partitioned table
+
+ [Issue with all the necessary details](https://gitlab.com/gitlab-org/gitlab/-/issues/323678)
+
+1. After the non-partitioned table is dropped, we can add a worker to implement the
+ pruning strategy by dropping past partitions.
+
+ In this case, the worker will make sure that only 4 partitions are always active (as the
+ retention policy is 90 days) and drop any partitions older than four months. We have to keep 4
+ months of partitions while the current month is still active, as going 90 days back takes you to
+ the fourth oldest partition.
+
+### Audit Events
+
+Related epic: [Partitioning: Design and implement partitioning strategy for Audit Events](https://gitlab.com/groups/gitlab-org/-/epics/3206)
+
+The `audit_events` table shares a lot of characteristics with the `web_hook_logs` table discussed
+in the previous sub-section, so we are going to focus on the points they differ.
+
+The consensus was that
+[partitioning could solve most of the performance issues](https://gitlab.com/groups/gitlab-org/-/epics/3206#note_338157248).
+
+In contrast to most other large tables, it has no major conflicting access patterns: we could switch
+the access patterns to align with partitioning by month. This is not the case for example for other
+tables, which even though could justify a partitioning approach (for example, by namespace), they have many
+conflicting access patterns.
+
+In addition, `audit_events` is a write-heavy table with very few reads (queries) over it and has a
+very simple schema, not connected with the rest of the database (no incoming or outgoing FK
+constraints) and with only two indexes defined over it.
+
+The later was important at the time as not having Foreign Key constraints meant that we could
+partition it while we were still in PostgreSQL 11. *This is not a concern any more now that we have
+moved to PostgreSQL 12 as a required default, as can be seen for the `web_hook_logs` use case above.*
+
+The migrations and steps required for partitioning the `audit_events` are similar to
+the ones described in the previous sub-section for `web_hook_logs`. There is no retention
+strategy defined for `audit_events` at the moment, so there is no pruning strategy
+implemented over it, but we may implement an archiving solution in the future.
+
+What's interesting on the case of `audit_events` is the discussion on the necessary steps that we
+had to follow to implement the UI/UX Changes needed to
+[encourage optimal querying of the partitioned](https://gitlab.com/gitlab-org/gitlab/-/issues/223260).
+It can be used as a starting point on the changes required on the application level
+to align all access patterns with a specific time-decay related access method.
+
+### CI tables
+
+NOTE:
+Requirements and analysis of the CI tables use case: still a work in progress. We intend
+to add more details after the analysis moves forward.
diff --git a/doc/architecture/blueprints/database_scaling/size-limits.md b/doc/architecture/blueprints/database_scaling/size-limits.md
new file mode 100644
index 00000000000..d63aa3bd4e8
--- /dev/null
+++ b/doc/architecture/blueprints/database_scaling/size-limits.md
@@ -0,0 +1,176 @@
+---
+comments: false
+description: 'Database Scalability / Limit table sizes'
+group: database
+---
+
+# Database Scalability: Limit on-disk table size to < 100 GB for GitLab.com
+
+This document is a proposal to work towards reducing and limiting table sizes on GitLab.com. We establish a **measurable target** by limiting table size to a certain threshold. This will be used as an indicator to drive database focus and decision making. With GitLab.com growing, we continuously re-evaluate which tables need to be worked on to prevent or otherwise fix violations.
+
+This is not meant to be a hard rule but rather a strong indication that work needs to be done to break a table apart or otherwise reduce its size.
+
+This is meant to be read in context with the [Database Sharding blueprint](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/64115),
+which paints the bigger picture. This proposal here is thought to be part of the "debloating step" below, as we aim to reduce storage requirements and improve data modeling. Partitioning is part of the standard tool-belt: where possible, we can already use partitioning as a solution to cut physical table sizes significantly. Both will help to prepare efforts like decomposition (database usage is already optimized) and sharding (database is already partitioned along an identified data access dimension).
+
+```mermaid
+graph LR
+ Fe(Pick feature) --> D
+ D[(Database)] --> De
+ De[Debloating] --> Dc
+ Dc[Decomposition] --> P
+ P[Partitioning] --> S
+ S[Sharding] --> R
+ P --> M
+ M[Microservices] --> R
+ R{Repeat?} --> Fe
+ style De fill:#fca326
+ style P fill:#fc6d26
+```
+
+## Motivation: GitLab.com stability and performance
+
+Large tables on GitLab.com are a major problem - for both operations and development. They cause a variety of problems:
+
+1. **Query timings** and hence overall application performance suffers
+1. **Table maintenance** becomes much more costly. Vacuum activity has become a significant concern on GitLab.com - with large tables only seeing infrequent (once per day) processing and vacuum runs taking many hours to complete. This has various negative consequences and a very large table has potential to impact seemingly unrelated parts of the database and hence overall application performance suffers.
+1. **Data migrations** on large tables are significantly more complex to implement and incur development overhead. They have potential to cause stability problems on GitLab.com and take a long time to execute on large datasets.
+1. **Indexes size** is significant. This directly impacts performance as smaller parts of the index are kept in memory and also makes the indexes harder to maintain (think repacking).
+1. **Index creation times** go up significantly - in 2021, we see btree creation take up to 6 hours for a single btree index. This impacts our ability to deploy frequently and leads to vacuum-related problems (delayed cleanup).
+1. We tend to add **many indexes** to mitigate, but this eventually causes significant overhead, can confuse the query planner and a large number of indexes is a smell of a design problem.
+
+## Examples
+
+Most prominently, the `ci_builds` table is 1.5 TB in size as of June 2021 and has 31 indexes associated with it which sum up to 1 TB in size. The overall on-disk size for this table is 2.5 TB. Currently, this grows at 300 GB per month. By the end of the year, this is thought to be close to 5 TB if we don't take measures against.
+
+The following examples show that very large tables often constitute the root cause of incidents on GitLab.com.
+
+1. Infrequent and long running vacuum activity has led to [repeated degradation of query performance for CI queuing](https://gitlab.com/gitlab-com/gl-infra/production/-/issues?label_name%5B%5D=Service%3A%3ACI+Runners&label_name%5B%5D=incident&scope=all&search=shared_runner_queues&state=all)
+1. On large tables like `ci_builds`, index creation time varies between 1.5 to 6 hours during busy times. This process blocks deployments as migrations are being run synchronously - reducing our ability to deploy frequently.
+1. Creating a large index can lead to a burst of activity on the database primary:
+ 1. on `merge_request_diff_commits` table: caused [high network saturation](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4823),
+ 1. regular reindexing activity on the weekend: causes [growing WAL queue](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4767) (impacts recovery objectives),
+ 1. `notes` table: Re-creating a GIN trigram index for maintenance reasons has become nearly unfeasible and had to be [aborted after 12 hours upon first try](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4633) as it was blocking other vacuum operation.
+
+## Problematic tables on GitLab.com
+
+This shows the TOP30 tables by their total size (includes index sizes) as of mid June 2021 on GitLab.com. `table_size, index_size` is the on-disk size of the actual data and associated indexes, respectively. `percentage_of_total_database_size` displays the ratio of total table size to database size.
+
+As we can see, there are currently very large tables greater than 1 TB in size, which also tend to have very large indexes.
+
+The other observation here is that there are also tables with a large number of indexes and total index size can be significantly larger than the data stored. For example, `deployments` is 30 GB in size plus additional 123 GB of index data spread across 24 indexes.
+
+<!--
+select tablename,
+ pg_size_pretty(pg_total_relation_size(t.schemaname || '.' || t.tablename)) as total_size,
+ pg_size_pretty(pg_relation_size(t.schemaname || '.' || t.tablename)) as table_size,
+ pg_size_pretty(pg_indexes_size(t.schemaname || '.' || t.tablename)) as index_size,
+ count(*) as index_count,
+ round(pg_total_relation_size(t.schemaname || '.' || t.tablename) / pg_database_size('gitlabhq_production')::numeric * 100, 1) as percentage_of_total_database_size
+from pg_indexes i
+join pg_tables t USING (tablename)
+group by 1,
+ 2,
+ 3,
+ t.schemaname,
+ t.tablename
+order by pg_total_relation_size(t.schemaname || '.' || t.tablename) desc
+limit 30;
+-->
+
+| Table | Total size | Table size | Index size | Index count | Percentage of total database size |
+|------------------------------|------------|------------|------------|-------------|-----------------------------------|
+| `ci_builds` | 2975 GB | 1551 GB | 941 GB | 30 | 22.7 |
+| `merge_request_diff_commits` | 1890 GB | 1454 GB | 414 GB | 2 | 14.4 |
+| `ci_build_trace_sections` | 1123 GB | 542 GB | 581 GB | 3 | 8.6 |
+| `notes` | 748 GB | 390 GB | 332 GB | 13 | 5.7 |
+| `merge_request_diff_files` | 575 GB | 481 GB | 88 GB | 1 | 4.4 |
+| `events` | 441 GB | 95 GB | 346 GB | 12 | 3.4 |
+| `ci_job_artifacts` | 397 GB | 187 GB | 210 GB | 10 | 3.0 |
+| `ci_pipelines` | 266 GB | 66 GB | 200 GB | 23 | 2.0 |
+| `taggings` | 238 GB | 60 GB | 179 GB | 5 | 1.8 |
+| `ci_builds_metadata` | 237 GB | 88 GB | 149 GB | 5 | 1.8 |
+| `issues` | 219 GB | 47 GB | 150 GB | 28 | 1.7 |
+| `web_hook_logs_202103` | 186 GB | 122 GB | 8416 MB | 3 | 1.4 |
+| `ci_stages` | 182 GB | 58 GB | 124 GB | 6 | 1.4 |
+| `web_hook_logs_202105` | 180 GB | 115 GB | 7868 MB | 3 | 1.4 |
+| `merge_requests` | 176 GB | 44 GB | 125 GB | 36 | 1.3 |
+| `web_hook_logs_202104` | 176 GB | 115 GB | 7472 MB | 3 | 1.3 |
+| `web_hook_logs_202101` | 169 GB | 112 GB | 7231 MB | 3 | 1.3 |
+| `web_hook_logs_202102` | 167 GB | 111 GB | 7106 MB | 3 | 1.3 |
+| `sent_notifications` | 166 GB | 88 GB | 79 GB | 3 | 1.3 |
+| `web_hook_logs_202011` | 163 GB | 113 GB | 7125 MB | 3 | 1.2 |
+| `push_event_payloads` | 162 GB | 114 GB | 48 GB | 1 | 1.2 |
+| `web_hook_logs_202012` | 159 GB | 106 GB | 6771 MB | 3 | 1.2 |
+| `web_hook_logs_202106` | 156 GB | 101 GB | 6752 MB | 3 | 1.2 |
+| `deployments` | 155 GB | 30 GB | 125 GB | 24 | 1.2 |
+| `web_hook_logs_202010` | 136 GB | 98 GB | 6116 MB | 3 | 1.0 |
+| `web_hook_logs_202009` | 114 GB | 82 GB | 5168 MB | 3 | 0.9 |
+| `security_findings` | 109 GB | 21 GB | 88 GB | 8 | 0.8 |
+| `web_hook_logs_202008` | 92 GB | 66 GB | 3983 MB | 3 | 0.7 |
+| `resource_label_events` | 66 GB | 47 GB | 19 GB | 6 | 0.5 |
+| `merge_request_diffs` | 63 GB | 39 GB | 22 GB | 5 | 0.5 |
+
+## Target: All physical tables on GitLab.com are < 100 GB including indexes
+
+NOTE:
+In PostgreSQL context, a **physical table** is either a regular table or a partition of a partitioned table.
+
+In order to maintain and improve operational stability and lessen development burden, we target a **table size less than 100 GB for a physical table on GitLab.com** (including its indexes). This has numerous benefits:
+
+1. Improved query performance and more stable query plans
+1. Significantly reduce vacuum run times and increase frequency of vacuum runs to maintain a healthy state - reducing overhead on the database primary
+1. Index creation times are significantly faster (significantly less data to read per index)
+1. Indexes are smaller, can be maintained more efficiently and fit better into memory
+1. Data migrations are easier to reason about, take less time to implement and execute
+
+This target is *pragmatic*: We understand table sizes depend on feature usage, code changes and other factors - which all change over time. We may not always find solutions where we can tightly limit the size of physical tables once and for all. That is acceptable though and we primarily aim to keep the situation on GitLab.com under control. We adapt our efforts to the situation present on GitLab.com and will re-evaluate frequently.
+
+While there are changes we can make that lead to a constant maximum physical table size over time, this doesn't need to be the case necessarily. Consider for example hash partitioniong, which breaks a table down into a static number of partitions. With data growth over time, individual partitions will also grow in size and may eventually reach the threshold size again. We strive to get constant table sizes, but it is acceptable to ship easier solutions that don't have this characteristic but improve the situation for a considerable amount of time.
+
+As such, the target size of a physical table after refactoring depends on the situation and there is no hard rule for it. We suggest to consider historic data growth and forecast when physical tables will reach the threshold of 100 GB again. This allows us to understand how long a particular solution is expected to last until the model has to be revisited.
+
+## Solutions
+
+There is no standard solution to reduce table sizes - there are many!
+
+1. **Retention**: Delete unnecessary data, for example expire old and unneeded records.
+1. **Remove STI**: We still use [single-table inheritance](../../../development/single_table_inheritance.md) in a few places, which is considered an anti-pattern. Redesigning this, we can split data into multiple tables.
+1. **Index optimization**: Drop unnecessary indexes and consolidate overlapping indexes if possible.
+1. **Optimise data types**: Review data type decisions and optimise data types where possible (example: use integer instead of text for an enum column)
+1. **Partitioning**: Apply a partitioning scheme if there is a common access dimension.
+1. **Normalization**: Review relational modeling and apply normalization techniques to remove duplicate data
+1. **Vertical table splits**: Review column usage and split table vertically.
+1. **Externalize**: Move large data types out of the database entirely. For example, JSON documents, especially when not used for filtering, may be better stored outside the database, for example, in object storage.
+
+NOTE:
+While we're targeting to limit physical table sizes, we consider retaining or improving performance a goal, too.
+
+For solutions like normalization, this is a trade-off: Denormalized models can speed up queries when used appropriately, at the expense of table size. When normalizing models, splitting tables or externalizing data, we aim to understand the impact on performance and strive to find a solution to reduce table sizes that doesn't impact performance significantly.
+
+### Example efforts
+
+A few examples can be found below, many more are organized under the epic [Database efficiency](https://gitlab.com/groups/gitlab-org/-/epics/5585).
+
+1. [Reduce number of indexes on `ci_builds`](https://gitlab.com/groups/gitlab-org/-/epics/6203)
+1. [Normalize and de-duplicate committer and author details in merge_request_diff_commits](https://gitlab.com/gitlab-org/gitlab/-/issues/331823)
+1. [Retention strategy for `ci_build_trace_sections`](https://gitlab.com/gitlab-org/gitlab/-/issues/32565#note_603138100)
+1. [Implement worker that hard-deletes old CI jobs metadata](https://gitlab.com/gitlab-org/gitlab/-/issues/215646)
+1. [merge_request_diff_files violates < 100 GB target](https://gitlab.com/groups/gitlab-org/-/epics/6215) (epic)
+
+## Goal
+
+The [epic for `~group::database`](https://gitlab.com/groups/gitlab-org/-/epics/6211) drives decision making to establish and communicate the target and to identify and propose necessary changes to reach it. Those changes should primarily be driven by the respective stage group owning the data (and the feature using it), with `~group::database` to support.
+
+## Who
+
+<!-- vale gitlab.Spelling = NO -->
+
+Identifying solutions for offending tables is driven by the [GitLab Database Team](https://about.gitlab.com/handbook/engineering/development/enablement/database/) and respective stage groups.
+
+| Role | Who
+|------------------------------|-------------------------|
+| Author | Andreas Brandl |
+| Engineering Leader | Craig Gomes |
+
+<!-- vale gitlab.Spelling = YES -->
diff --git a/doc/architecture/blueprints/database_testing/index.md b/doc/architecture/blueprints/database_testing/index.md
index fb52f6cc7d3..38629e7348d 100644
--- a/doc/architecture/blueprints/database_testing/index.md
+++ b/doc/architecture/blueprints/database_testing/index.md
@@ -64,7 +64,7 @@ We already use Database Lab from [postgres.ai](https://postgres.ai/), which is a
Internally, this is based on ZFS and implements a "thin-cloning technology". That is, ZFS snapshots are being used to clone the data and it exposes a full read/write PostgreSQL cluster based on the cloned data. This is called a *thin clone*. It is rather short lived and is going to be destroyed again shortly after we are finished using it.
-It is important to note that a thin clone is fully read/write. This allows us to execute migrations on top of it.
+A thin clone is fully read/write. This allows us to execute migrations on top of it.
Database Lab provides an API we can interact with to manage thin clones. In order to automate the migration and query testing, we add steps to the `gitlab/gitlab-org` CI pipeline. This triggers automation that performs the following steps for a given merge request: