summaryrefslogtreecommitdiff
path: root/doc/development/database
diff options
context:
space:
mode:
Diffstat (limited to 'doc/development/database')
-rw-r--r--doc/development/database/add_foreign_key_to_existing_column.md8
-rw-r--r--doc/development/database/client_side_connection_pool.md63
-rw-r--r--doc/development/database/database_reviewer_guidelines.md6
-rw-r--r--doc/development/database/index.md9
-rw-r--r--doc/development/database/not_null_constraints.md8
-rw-r--r--doc/development/database/setting_multiple_values.md103
-rw-r--r--doc/development/database/strings_and_the_text_data_type.md6
-rw-r--r--doc/development/database/table_partitioning.md259
8 files changed, 460 insertions, 2 deletions
diff --git a/doc/development/database/add_foreign_key_to_existing_column.md b/doc/development/database/add_foreign_key_to_existing_column.md
index 1b41a52c95e..85411ff9aa7 100644
--- a/doc/development/database/add_foreign_key_to_existing_column.md
+++ b/doc/development/database/add_foreign_key_to_existing_column.md
@@ -1,3 +1,9 @@
+---
+stage: Enablement
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#designated-technical-writers
+---
+
# Adding foreign key constraint to an existing column
Foreign keys help ensure consistency between related database tables. The current database review process **always** encourages you to add [foreign keys](../foreign_keys.md) when creating tables that reference records from other tables.
@@ -103,7 +109,7 @@ class RemoveRecordsWithoutUserFromEmailsTable < ActiveRecord::Migration[5.2]
end
def down
- # Can be a no-op when data inconsistency is not affecting the pre and post deploymnet version of the application.
+ # Can be a no-op when data inconsistency is not affecting the pre and post deployment version of the application.
# In this case we might have records in the `emails` table where the associated record in the `users` table is not there anymore.
end
end
diff --git a/doc/development/database/client_side_connection_pool.md b/doc/development/database/client_side_connection_pool.md
new file mode 100644
index 00000000000..1a30d2d73a3
--- /dev/null
+++ b/doc/development/database/client_side_connection_pool.md
@@ -0,0 +1,63 @@
+---
+type: dev, reference
+stage: none
+group: Development
+info: "See the Technical Writers assigned to Development Guidelines: https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments-to-development-guidelines"
+---
+
+# Client-side connection-pool
+
+Ruby processes accessing the database through
+ActiveRecord, automatically calculate the connection-pool size for the
+process based on the concurrency.
+
+Because of the way [Ruby on Rails manages database
+connections](#connection-lifecycle), it is important that we have at
+least as many connections as we have threads. While there is a 'pool'
+setting in [`database.yml`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/config/database.yml.postgresql), it is not very practical because you need to
+maintain it in tandem with the number of application threads. For this
+reason, we override the number of allowed connections in the database
+connection-pool based on the configured number of application threads.
+
+`Gitlab::Runtime.max_threads` is the number of user-facing
+application threads the process has been configured with. We also have
+auxiliary threads that use database connections. As it isn't
+straightforward to keep an accurate count of the number of auxiliary threads as
+the application evolves over time, we just add a fixed headroom to the
+number of user-facing threads. It is OK if this number is too large
+because connections are instantiated lazily.
+
+## Troubleshooting connection-pool issues
+
+The connection-pool usage can be seen per environment in the [connection-pool
+saturation
+dashboard](https://dashboards.gitlab.net/d/alerts-sat_rails_db_connection_pool/alerts-rails_db_connection_pool-saturation-detail?orgId=1).
+
+If the connection-pool is too small, this would manifest in
+`ActiveRecord::ConnectionTimeoutError`s from the application. Because we alert
+when almost all connections are used, we should know this before
+timeouts occur. If this happens we can remediate by setting the
+`DB_POOL_HEADROOM` environment variable to something bigger than the
+hardcoded value (10).
+
+At this point, we need to investigate what is using more connections
+than we anticipated. To do that, we can use the
+`gitlab_ruby_threads_running_threads` metric. For example, [this
+graph](https://thanos-query.ops.gitlab.net/graph?g0.range_input=1h&g0.max_source_resolution=0s&g0.expr=sum%20by%20(thread_name)%20(%20gitlab_ruby_threads_running_threads%7Buses_db_connection%3D%22yes%22%7D%20)&g0.tab=0)
+shows all running threads that connect to the database by their
+name. Threads labeled `puma worker` or `sidekiq_worker_thread` are
+the threads that define `Gitlab::Runtime.max_threads` so those are
+accounted for. If there's more than 10 other threads running, we could
+consider raising the default headroom.
+
+## Connection lifecycle
+
+For web requests, a connection is obtained from the pool at the first
+time a database query is made. The connection is returned to the pool
+after the request completes.
+
+For background jobs, the behavior is very similar. The thread obtains
+a connection for the first query, and returns it after the job is
+finished.
+
+This is managed by Rails internally.
diff --git a/doc/development/database/database_reviewer_guidelines.md b/doc/development/database/database_reviewer_guidelines.md
index 6cb061f9959..3345df8b46b 100644
--- a/doc/development/database/database_reviewer_guidelines.md
+++ b/doc/development/database/database_reviewer_guidelines.md
@@ -1,3 +1,9 @@
+---
+stage: Enablement
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#designated-technical-writers
+---
+
# Database Reviewer Guidelines
This page includes introductory material for new database reviewers.
diff --git a/doc/development/database/index.md b/doc/development/database/index.md
index 9ea5b6fcaac..4bcefefe7a7 100644
--- a/doc/development/database/index.md
+++ b/doc/development/database/index.md
@@ -1,3 +1,9 @@
+---
+stage: Enablement
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#designated-technical-writers
+---
+
# Database guides
## Database Reviews
@@ -24,6 +30,7 @@
- [Background migrations](../background_migrations.md)
- [Swapping tables](../swapping_tables.md)
- [Deleting migrations](../deleting_migrations.md)
+- [Partitioning tables](table_partitioning.md)
## Debugging
@@ -49,6 +56,8 @@
- [Database Debugging and Troubleshooting](../database_debugging.md)
- [Query Count Limits](../query_count_limits.md)
- [Creating enums](../creating_enums.md)
+- [Client-side connection-pool](client_side_connection_pool.md)
+- [Updating multiple values](./setting_multiple_values.md)
## Case studies
diff --git a/doc/development/database/not_null_constraints.md b/doc/development/database/not_null_constraints.md
index e4dec2afa10..96271863d94 100644
--- a/doc/development/database/not_null_constraints.md
+++ b/doc/development/database/not_null_constraints.md
@@ -1,3 +1,9 @@
+---
+stage: Enablement
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#designated-technical-writers
+---
+
# `NOT NULL` constraints
> [Introduced](https://gitlab.com/gitlab-org/gitlab/-/issues/38358) in GitLab 13.0.
@@ -33,7 +39,7 @@ end
## Add a `NOT NULL` column to an existing table
-With PostgreSQL 11 being the minimum version since GitLab 13.0, adding columns with `NULL` and/or
+With PostgreSQL 11 being the minimum version in GitLab 13.0 and later, adding columns with `NULL` and/or
default values has become much easier and the standard `add_column` helper should be used in all cases.
For example, consider a migration that adds a new `NOT NULL` column `active` to table `db_guides`,
diff --git a/doc/development/database/setting_multiple_values.md b/doc/development/database/setting_multiple_values.md
new file mode 100644
index 00000000000..5569a0e10b7
--- /dev/null
+++ b/doc/development/database/setting_multiple_values.md
@@ -0,0 +1,103 @@
+# Setting Multiple Values
+
+> [Introduced](https://gitlab.com/gitlab-org/gitlab/-/issues/32921) in GitLab 13.5.
+
+Frequently, we will want to update multiple objects with new values for one
+or more columns. The obvious way to do this is using `Relation#update_all`:
+
+```ruby
+user.issues.open.update_all(due_date: 7.days.from_now) # (1)
+user.issues.update_all('relative_position = relative_position + 1') # (2)
+```
+
+But what do you do if you cannot express the update as either a static value (1)
+or as a calculation (2)?
+
+Thankfully we can use `UPDATE FROM` to express the need to update multiple rows
+with distinct values in a single query. One can either use a temporary table, or
+a Common Table Expression (CTE), and then use that as the source of the updates:
+
+```sql
+with updates(obj_id, new_title, new_weight) as (
+ values (1 :: integer, 'Very difficult issue' :: text, 8 :: integer),
+ (2, 'Very easy issue', 1)
+)
+update issues
+ set title = new_title, weight = new_weight
+ from updates
+ where id = obj_id
+```
+
+The bad news: There is no way to express this in ActiveRecord or even dropping
+down to ARel - the `UpdateManager` just does not support `update from`, so this
+is not expressible.
+
+The good news: We supply an abstraction to help you generate these kinds of
+updates, called `Gitlab::Database::BulkUpdate`. This constructs queries such as the
+above, and uses binding parameters to avoid SQL injection.
+
+## Usage
+
+To use this, we need:
+
+- the list of columns to update
+- a mapping from object/ID to the new values to set for that object
+- a way to determine the table for each object
+
+So for example, we can express the query above as:
+
+```ruby
+issue_a = Issue.find(..)
+issue_b = Issue.find(..)
+
+# Issues a single query:
+::Gitlab::Database::BulkUpdate.execute(%i[title weight], {
+ issue_a => { title: 'Very difficult issue', weight: 8 },
+ issue_b => { title: 'Very easy issue', weight: 1 }
+})
+```
+
+Here the table can be determined automatically, from calling
+`object.class.table_name`, so we don't need to provide anything.
+
+We can even pass heterogeneous sets of objects, if the updates all make sense
+for them:
+
+```ruby
+issue_a = Issue.find(..)
+issue_b = Issue.find(..)
+merge_request = MergeRequest.find(..)
+
+# Issues two queries
+::Gitlab::Database::BulkUpdate.execute(%i[title], {
+ issue_a => { title: 'A' },
+ issue_b => { title: 'B' },
+ merge_request => { title: 'B' }
+})
+```
+
+If your objects do not return the correct model class (perhaps because they are
+part of a union), then we need to specify this explicitly in a block:
+
+```ruby
+bazzes = params
+objects = Foo.from_union([
+ Foo.select("id, 'foo' as object_type").where(quux: true),
+ Bar.select("id, 'bar' as object_type").where(wibble: true)
+ ])
+# At this point, all the objects are instances of Foo, even the ones from the
+# Bar table
+mapping = objects.to_h { |obj| [obj, bazzes[obj.id] }
+
+# Issues at most 2 queries
+::Gitlab::Database::BulkUpdate.execute(%i[baz], mapping) do |obj|
+ obj.object_type.constantize
+end
+```
+
+## Caveats
+
+Note that this is a **very low level** tool, and operates on the raw column
+values. Enumerations and state fields must be translated into their underlying
+representations, for example, and nested associations are not supported. No
+validations or hooks will be called.
diff --git a/doc/development/database/strings_and_the_text_data_type.md b/doc/development/database/strings_and_the_text_data_type.md
index b73dfa859fb..fe8cfa5cd22 100644
--- a/doc/development/database/strings_and_the_text_data_type.md
+++ b/doc/development/database/strings_and_the_text_data_type.md
@@ -1,3 +1,9 @@
+---
+stage: Enablement
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#designated-technical-writers
+---
+
# Strings and the Text data type
> [Introduced](https://gitlab.com/gitlab-org/gitlab/-/issues/30453) in GitLab 13.0.
diff --git a/doc/development/database/table_partitioning.md b/doc/development/database/table_partitioning.md
new file mode 100644
index 00000000000..30d0b0a2f5b
--- /dev/null
+++ b/doc/development/database/table_partitioning.md
@@ -0,0 +1,259 @@
+---
+stage: Enablement
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#designated-technical-writers
+---
+
+# Database table partitioning
+
+Table partitioning is a powerful database feature that allows a table's
+data to be split into smaller physical tables that act as a single large
+table. If the application is designed to work with partitioning in mind,
+there can be multiple benefits, such as:
+
+- Query performance can be improved greatly, because the database can
+cheaply eliminate much of the data from the search space, while still
+providing full SQL capabilities.
+
+- Bulk deletes can be achieved with minimal impact on the database by
+dropping entire partitions. This is a natural fit for features that need
+to periodically delete data that falls outside the retention window.
+
+- Administrative tasks like `VACUUM` and index rebuilds can operate on
+individual partitions, rather than across a single massive table.
+
+Unfortunately, not all models fit a partitioning scheme, and there are
+significant drawbacks if implemented incorrectly. Additionally, tables
+can only be partitioned at their creation, making it nontrivial to apply
+partitioning to a busy database. A suite of migration tools are available
+to enable backend developers to partition existing tables, but the
+migration process is rather heavy, taking multiple steps split across
+several releases. Due to the limitations of partitioning and the related
+migrations, you should understand how partitioning fits your use case
+before attempting to leverage this feature.
+
+## Determining when to use partitioning
+
+While partitioning can be very useful when properly applied, it's
+imperative to identify if the data and workload of a table naturally fit a
+partitioning scheme. There are a few details you'll have to understand
+in order to decide if partitioning is a good fit for your particular
+problem.
+
+First, a table is partitioned on a partition key, which is a column or
+set of columns which determine how the data will be split across the
+partitions. The partition key is used by the database when reading or
+writing data, to decide which partition(s) need to be accessed. The
+partition key should be a column that would be included in a `WHERE`
+clause on almost all queries accessing that table.
+
+Second, it's necessary to understand the strategy the database will
+use to split the data across the partitions. The scheme supported by the
+GitLab migration helpers is date-range partitioning, where each partition
+in the table contains data for a single month. In this case, the partitioning
+key would need to be a timestamp or date column. In order for this type of
+partitioning to work well, most queries would need to access data within a
+certain date range.
+
+For a more concrete example, the `audit_events` table can be used, which
+was the first table to be partitioned in the application database
+(scheduled for deployment with the GitLab 13.5 release). This
+table tracks audit entries of security events that happen in the
+application. In almost all cases, users want to see audit activity that
+occurs in a certain timeframe. As a result, date-range partitioning
+was a natural fit for how the data would be accessed.
+
+To look at this in more detail, imagine a simplified `audit_events` schema:
+
+```sql
+CREATE TABLE audit_events (
+ id SERIAL NOT NULL PRIMARY KEY,
+ author_id INT NOT NULL,
+ details jsonb NOT NULL,
+ created_at timestamptz NOT NULL);
+```
+
+Now imagine typical queries in the UI would display the data within a
+certain date range, like a single week:
+
+```sql
+SELECT *
+FROM audit_events
+WHERE created_at >= '2020-01-01 00:00:00'
+ AND created_at < '2020-01-08 00:00:00'
+ORDER BY created_at DESC
+LIMIT 100
+```
+
+If the table is partitioned on the `created_at` column the base table would
+look like:
+
+```sql
+CREATE TABLE audit_events (
+ id SERIAL NOT NULL,
+ author_id INT NOT NULL,
+ details jsonb NOT NULL,
+ created_at timestamptz NOT NULL,
+ PRIMARY KEY (id, created_at))
+PARTITION BY RANGE(created_at);
+```
+
+NOTE: **Note:**
+The primary key of a partitioned table must include the partition key as
+part of the primary key definition.
+
+And we might have a list of partitions for the table, such as:
+
+```sql
+audit_events_202001 FOR VALUES FROM ('2020-01-01') TO ('2020-02-01')
+audit_events_202002 FOR VALUES FROM ('2020-02-01') TO ('2020-03-01')
+audit_events_202003 FOR VALUES FROM ('2020-03-01') TO ('2020-04-01')
+```
+
+Each partition is a separate physical table, with the same structure as
+the base `audit_events` table, but contains only data for rows where the
+partition key falls in the specified range. For example, the partition
+`audit_events_202001` contains rows where the `created_at` column is
+greater than or equal to `2020-01-01` and less than `2020-02-01`.
+
+Now, if we look at the previous example query again, the database can
+use the `WHERE` to recognize that all matching rows will be in the
+`audit_events_202001` partition. Rather than searching all of the data
+in all of the partitions, it can search only the single month's worth
+of data in the appropriate partition. In a large table, this can
+dramatically reduce the amount of data the database needs to access.
+However, imagine a query that does not filter based on the partitioning
+key, such as:
+
+```sql
+SELECT *
+FROM audit_events
+WHERE author_id = 123
+ORDER BY created_at DESC
+LIMIT 100
+```
+
+In this example, the database can't prune any partitions from the search,
+because matching data could exist in any of them. As a result, it has to
+query each partition individually, and aggregate the rows into a single result
+set. Since `author_id` would be indexed, the performance impact could
+likely be acceptable, but on more complex queries the overhead can be
+substantial. Partitioning should only be leveraged if the access patterns
+of the data support the partitioning strategy, otherwise performance will
+suffer.
+
+## Partitioning a table
+
+Unfortunately, tables can only be partitioned at their creation, making
+it nontrivial to apply to a busy database. A suite of migration
+tools have been developed to enable backend developers to partition
+existing tables. This migration process takes multiple steps which must
+be split across several releases.
+
+### Caveats
+
+The partitioning migration helpers work by creating a partitioned duplicate
+of the original table and using a combination of a trigger and a background
+migration to copy data into the new table. Changes to the original table
+schema can be made in parallel with the partitioning migration, but they
+must take care to not break the underlying mechanism that makes the migration
+work. For example, if a column is added to the table that is being
+partitioned, both the partitioned table and the trigger definition need to
+be updated to match.
+
+### Step 1: Creating the partitioned copy (Release N)
+
+The first step is to add a migration to create the partitioned copy of
+the original table. This migration will also create the appropriate
+partitions based on the data in the original table, and install a
+trigger that will sync writes from the original table into the
+partitioned copy.
+
+An example migration of partitioning the `audit_events` table by its
+`created_at` column would look like:
+
+```ruby
+class PartitionAuditEvents < ActiveRecord::Migration[6.0]
+ include Gitlab::Database::PartitioningMigrationHelpers
+
+ def up
+ partition_table_by_date :audit_events, :created_at
+ end
+
+ def down
+ drop_partitioned_table_for :audit_events
+ end
+end
+```
+
+Once this has executed, any inserts, updates or deletes in the
+original table will also be duplicated in the new table. For updates and
+deletes, the operation will only have an effect if the corresponding row
+exists in the partitioned table.
+
+### Step 2: Backfill the partitioned copy (Release N)
+
+The second step is to add a post-deployment migration that will schedule
+the background jobs that will backfill existing data from the original table
+into the partitioned copy.
+
+Continuing the above example, the migration would look like:
+
+```ruby
+class BackfillPartitionAuditEvents < ActiveRecord::Migration[6.0]
+ include Gitlab::Database::PartitioningMigrationHelpers
+
+ def up
+ enqueue_partitioning_data_migration :audit_events
+ end
+
+ def down
+ cleanup_partitioning_data_migration :audit_events
+ end
+end
+```
+
+This step uses the same mechanism as any background migration, so you
+may want to read the [Background Migration](../background_migrations.md)
+guide for details on that process. Background jobs are scheduled every
+2 minutes and copy `50_000` records at a time, which can be used to
+estimate the timing of the background migration portion of the
+partitioning migration.
+
+### Step 3: Post-backfill cleanup (Release N+1)
+
+The third step must occur at least one release after the release that
+includes the background migration. This gives time for the background
+migration to execute properly in self-managed installations. In this step,
+add another post-deployment migration that will cleanup after the
+background migration. This includes forcing any remaining jobs to
+execute, and copying data that may have been missed, due to dropped or
+failed jobs.
+
+Once again, continuing the example, this migration would look like:
+
+```ruby
+class CleanupPartitionedAuditEventsBackfill < ActiveRecord::Migration[6.0]
+ include Gitlab::Database::PartitioningMigrationHelpers
+
+ def up
+ finalize_backfilling_partitioned_table :audit_events
+ end
+
+ def down
+ # no op
+ end
+end
+```
+
+After this migration has completed, the original table and partitioned
+table should contain identical data. The trigger installed on the
+original table guarantees that the data will remain in sync going
+forward.
+
+### Step 4: Swap the partitioned and non-partitioned tables (Release N+1)
+
+The final step of the migration will make the partitioned table ready
+for use by the application. This section will be updated when the
+migration helper is ready, for now development can be followed in the
+[Tracking Issue](https://gitlab.com/gitlab-org/gitlab/-/issues/241267).