From 48aff82709769b098321c738f3444b9bdaa694c6 Mon Sep 17 00:00:00 2001 From: GitLab Bot Date: Wed, 21 Oct 2020 07:08:36 +0000 Subject: Add latest changes from gitlab-org/gitlab@13-5-stable-ee --- .../database/add_foreign_key_to_existing_column.md | 8 +- .../database/client_side_connection_pool.md | 63 +++++ .../database/database_reviewer_guidelines.md | 6 + doc/development/database/index.md | 9 + doc/development/database/not_null_constraints.md | 8 +- .../database/setting_multiple_values.md | 103 ++++++++ .../database/strings_and_the_text_data_type.md | 6 + doc/development/database/table_partitioning.md | 259 +++++++++++++++++++++ 8 files changed, 460 insertions(+), 2 deletions(-) create mode 100644 doc/development/database/client_side_connection_pool.md create mode 100644 doc/development/database/setting_multiple_values.md create mode 100644 doc/development/database/table_partitioning.md (limited to 'doc/development/database') diff --git a/doc/development/database/add_foreign_key_to_existing_column.md b/doc/development/database/add_foreign_key_to_existing_column.md index 1b41a52c95e..85411ff9aa7 100644 --- a/doc/development/database/add_foreign_key_to_existing_column.md +++ b/doc/development/database/add_foreign_key_to_existing_column.md @@ -1,3 +1,9 @@ +--- +stage: Enablement +group: Database +info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#designated-technical-writers +--- + # Adding foreign key constraint to an existing column Foreign keys help ensure consistency between related database tables. The current database review process **always** encourages you to add [foreign keys](../foreign_keys.md) when creating tables that reference records from other tables. @@ -103,7 +109,7 @@ class RemoveRecordsWithoutUserFromEmailsTable < ActiveRecord::Migration[5.2] end def down - # Can be a no-op when data inconsistency is not affecting the pre and post deploymnet version of the application. + # Can be a no-op when data inconsistency is not affecting the pre and post deployment version of the application. # In this case we might have records in the `emails` table where the associated record in the `users` table is not there anymore. end end diff --git a/doc/development/database/client_side_connection_pool.md b/doc/development/database/client_side_connection_pool.md new file mode 100644 index 00000000000..1a30d2d73a3 --- /dev/null +++ b/doc/development/database/client_side_connection_pool.md @@ -0,0 +1,63 @@ +--- +type: dev, reference +stage: none +group: Development +info: "See the Technical Writers assigned to Development Guidelines: https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments-to-development-guidelines" +--- + +# Client-side connection-pool + +Ruby processes accessing the database through +ActiveRecord, automatically calculate the connection-pool size for the +process based on the concurrency. + +Because of the way [Ruby on Rails manages database +connections](#connection-lifecycle), it is important that we have at +least as many connections as we have threads. While there is a 'pool' +setting in [`database.yml`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/config/database.yml.postgresql), it is not very practical because you need to +maintain it in tandem with the number of application threads. For this +reason, we override the number of allowed connections in the database +connection-pool based on the configured number of application threads. + +`Gitlab::Runtime.max_threads` is the number of user-facing +application threads the process has been configured with. We also have +auxiliary threads that use database connections. As it isn't +straightforward to keep an accurate count of the number of auxiliary threads as +the application evolves over time, we just add a fixed headroom to the +number of user-facing threads. It is OK if this number is too large +because connections are instantiated lazily. + +## Troubleshooting connection-pool issues + +The connection-pool usage can be seen per environment in the [connection-pool +saturation +dashboard](https://dashboards.gitlab.net/d/alerts-sat_rails_db_connection_pool/alerts-rails_db_connection_pool-saturation-detail?orgId=1). + +If the connection-pool is too small, this would manifest in +`ActiveRecord::ConnectionTimeoutError`s from the application. Because we alert +when almost all connections are used, we should know this before +timeouts occur. If this happens we can remediate by setting the +`DB_POOL_HEADROOM` environment variable to something bigger than the +hardcoded value (10). + +At this point, we need to investigate what is using more connections +than we anticipated. To do that, we can use the +`gitlab_ruby_threads_running_threads` metric. For example, [this +graph](https://thanos-query.ops.gitlab.net/graph?g0.range_input=1h&g0.max_source_resolution=0s&g0.expr=sum%20by%20(thread_name)%20(%20gitlab_ruby_threads_running_threads%7Buses_db_connection%3D%22yes%22%7D%20)&g0.tab=0) +shows all running threads that connect to the database by their +name. Threads labeled `puma worker` or `sidekiq_worker_thread` are +the threads that define `Gitlab::Runtime.max_threads` so those are +accounted for. If there's more than 10 other threads running, we could +consider raising the default headroom. + +## Connection lifecycle + +For web requests, a connection is obtained from the pool at the first +time a database query is made. The connection is returned to the pool +after the request completes. + +For background jobs, the behavior is very similar. The thread obtains +a connection for the first query, and returns it after the job is +finished. + +This is managed by Rails internally. diff --git a/doc/development/database/database_reviewer_guidelines.md b/doc/development/database/database_reviewer_guidelines.md index 6cb061f9959..3345df8b46b 100644 --- a/doc/development/database/database_reviewer_guidelines.md +++ b/doc/development/database/database_reviewer_guidelines.md @@ -1,3 +1,9 @@ +--- +stage: Enablement +group: Database +info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#designated-technical-writers +--- + # Database Reviewer Guidelines This page includes introductory material for new database reviewers. diff --git a/doc/development/database/index.md b/doc/development/database/index.md index 9ea5b6fcaac..4bcefefe7a7 100644 --- a/doc/development/database/index.md +++ b/doc/development/database/index.md @@ -1,3 +1,9 @@ +--- +stage: Enablement +group: Database +info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#designated-technical-writers +--- + # Database guides ## Database Reviews @@ -24,6 +30,7 @@ - [Background migrations](../background_migrations.md) - [Swapping tables](../swapping_tables.md) - [Deleting migrations](../deleting_migrations.md) +- [Partitioning tables](table_partitioning.md) ## Debugging @@ -49,6 +56,8 @@ - [Database Debugging and Troubleshooting](../database_debugging.md) - [Query Count Limits](../query_count_limits.md) - [Creating enums](../creating_enums.md) +- [Client-side connection-pool](client_side_connection_pool.md) +- [Updating multiple values](./setting_multiple_values.md) ## Case studies diff --git a/doc/development/database/not_null_constraints.md b/doc/development/database/not_null_constraints.md index e4dec2afa10..96271863d94 100644 --- a/doc/development/database/not_null_constraints.md +++ b/doc/development/database/not_null_constraints.md @@ -1,3 +1,9 @@ +--- +stage: Enablement +group: Database +info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#designated-technical-writers +--- + # `NOT NULL` constraints > [Introduced](https://gitlab.com/gitlab-org/gitlab/-/issues/38358) in GitLab 13.0. @@ -33,7 +39,7 @@ end ## Add a `NOT NULL` column to an existing table -With PostgreSQL 11 being the minimum version since GitLab 13.0, adding columns with `NULL` and/or +With PostgreSQL 11 being the minimum version in GitLab 13.0 and later, adding columns with `NULL` and/or default values has become much easier and the standard `add_column` helper should be used in all cases. For example, consider a migration that adds a new `NOT NULL` column `active` to table `db_guides`, diff --git a/doc/development/database/setting_multiple_values.md b/doc/development/database/setting_multiple_values.md new file mode 100644 index 00000000000..5569a0e10b7 --- /dev/null +++ b/doc/development/database/setting_multiple_values.md @@ -0,0 +1,103 @@ +# Setting Multiple Values + +> [Introduced](https://gitlab.com/gitlab-org/gitlab/-/issues/32921) in GitLab 13.5. + +Frequently, we will want to update multiple objects with new values for one +or more columns. The obvious way to do this is using `Relation#update_all`: + +```ruby +user.issues.open.update_all(due_date: 7.days.from_now) # (1) +user.issues.update_all('relative_position = relative_position + 1') # (2) +``` + +But what do you do if you cannot express the update as either a static value (1) +or as a calculation (2)? + +Thankfully we can use `UPDATE FROM` to express the need to update multiple rows +with distinct values in a single query. One can either use a temporary table, or +a Common Table Expression (CTE), and then use that as the source of the updates: + +```sql +with updates(obj_id, new_title, new_weight) as ( + values (1 :: integer, 'Very difficult issue' :: text, 8 :: integer), + (2, 'Very easy issue', 1) +) +update issues + set title = new_title, weight = new_weight + from updates + where id = obj_id +``` + +The bad news: There is no way to express this in ActiveRecord or even dropping +down to ARel - the `UpdateManager` just does not support `update from`, so this +is not expressible. + +The good news: We supply an abstraction to help you generate these kinds of +updates, called `Gitlab::Database::BulkUpdate`. This constructs queries such as the +above, and uses binding parameters to avoid SQL injection. + +## Usage + +To use this, we need: + +- the list of columns to update +- a mapping from object/ID to the new values to set for that object +- a way to determine the table for each object + +So for example, we can express the query above as: + +```ruby +issue_a = Issue.find(..) +issue_b = Issue.find(..) + +# Issues a single query: +::Gitlab::Database::BulkUpdate.execute(%i[title weight], { + issue_a => { title: 'Very difficult issue', weight: 8 }, + issue_b => { title: 'Very easy issue', weight: 1 } +}) +``` + +Here the table can be determined automatically, from calling +`object.class.table_name`, so we don't need to provide anything. + +We can even pass heterogeneous sets of objects, if the updates all make sense +for them: + +```ruby +issue_a = Issue.find(..) +issue_b = Issue.find(..) +merge_request = MergeRequest.find(..) + +# Issues two queries +::Gitlab::Database::BulkUpdate.execute(%i[title], { + issue_a => { title: 'A' }, + issue_b => { title: 'B' }, + merge_request => { title: 'B' } +}) +``` + +If your objects do not return the correct model class (perhaps because they are +part of a union), then we need to specify this explicitly in a block: + +```ruby +bazzes = params +objects = Foo.from_union([ + Foo.select("id, 'foo' as object_type").where(quux: true), + Bar.select("id, 'bar' as object_type").where(wibble: true) + ]) +# At this point, all the objects are instances of Foo, even the ones from the +# Bar table +mapping = objects.to_h { |obj| [obj, bazzes[obj.id] } + +# Issues at most 2 queries +::Gitlab::Database::BulkUpdate.execute(%i[baz], mapping) do |obj| + obj.object_type.constantize +end +``` + +## Caveats + +Note that this is a **very low level** tool, and operates on the raw column +values. Enumerations and state fields must be translated into their underlying +representations, for example, and nested associations are not supported. No +validations or hooks will be called. diff --git a/doc/development/database/strings_and_the_text_data_type.md b/doc/development/database/strings_and_the_text_data_type.md index b73dfa859fb..fe8cfa5cd22 100644 --- a/doc/development/database/strings_and_the_text_data_type.md +++ b/doc/development/database/strings_and_the_text_data_type.md @@ -1,3 +1,9 @@ +--- +stage: Enablement +group: Database +info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#designated-technical-writers +--- + # Strings and the Text data type > [Introduced](https://gitlab.com/gitlab-org/gitlab/-/issues/30453) in GitLab 13.0. diff --git a/doc/development/database/table_partitioning.md b/doc/development/database/table_partitioning.md new file mode 100644 index 00000000000..30d0b0a2f5b --- /dev/null +++ b/doc/development/database/table_partitioning.md @@ -0,0 +1,259 @@ +--- +stage: Enablement +group: Database +info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#designated-technical-writers +--- + +# Database table partitioning + +Table partitioning is a powerful database feature that allows a table's +data to be split into smaller physical tables that act as a single large +table. If the application is designed to work with partitioning in mind, +there can be multiple benefits, such as: + +- Query performance can be improved greatly, because the database can +cheaply eliminate much of the data from the search space, while still +providing full SQL capabilities. + +- Bulk deletes can be achieved with minimal impact on the database by +dropping entire partitions. This is a natural fit for features that need +to periodically delete data that falls outside the retention window. + +- Administrative tasks like `VACUUM` and index rebuilds can operate on +individual partitions, rather than across a single massive table. + +Unfortunately, not all models fit a partitioning scheme, and there are +significant drawbacks if implemented incorrectly. Additionally, tables +can only be partitioned at their creation, making it nontrivial to apply +partitioning to a busy database. A suite of migration tools are available +to enable backend developers to partition existing tables, but the +migration process is rather heavy, taking multiple steps split across +several releases. Due to the limitations of partitioning and the related +migrations, you should understand how partitioning fits your use case +before attempting to leverage this feature. + +## Determining when to use partitioning + +While partitioning can be very useful when properly applied, it's +imperative to identify if the data and workload of a table naturally fit a +partitioning scheme. There are a few details you'll have to understand +in order to decide if partitioning is a good fit for your particular +problem. + +First, a table is partitioned on a partition key, which is a column or +set of columns which determine how the data will be split across the +partitions. The partition key is used by the database when reading or +writing data, to decide which partition(s) need to be accessed. The +partition key should be a column that would be included in a `WHERE` +clause on almost all queries accessing that table. + +Second, it's necessary to understand the strategy the database will +use to split the data across the partitions. The scheme supported by the +GitLab migration helpers is date-range partitioning, where each partition +in the table contains data for a single month. In this case, the partitioning +key would need to be a timestamp or date column. In order for this type of +partitioning to work well, most queries would need to access data within a +certain date range. + +For a more concrete example, the `audit_events` table can be used, which +was the first table to be partitioned in the application database +(scheduled for deployment with the GitLab 13.5 release). This +table tracks audit entries of security events that happen in the +application. In almost all cases, users want to see audit activity that +occurs in a certain timeframe. As a result, date-range partitioning +was a natural fit for how the data would be accessed. + +To look at this in more detail, imagine a simplified `audit_events` schema: + +```sql +CREATE TABLE audit_events ( + id SERIAL NOT NULL PRIMARY KEY, + author_id INT NOT NULL, + details jsonb NOT NULL, + created_at timestamptz NOT NULL); +``` + +Now imagine typical queries in the UI would display the data within a +certain date range, like a single week: + +```sql +SELECT * +FROM audit_events +WHERE created_at >= '2020-01-01 00:00:00' + AND created_at < '2020-01-08 00:00:00' +ORDER BY created_at DESC +LIMIT 100 +``` + +If the table is partitioned on the `created_at` column the base table would +look like: + +```sql +CREATE TABLE audit_events ( + id SERIAL NOT NULL, + author_id INT NOT NULL, + details jsonb NOT NULL, + created_at timestamptz NOT NULL, + PRIMARY KEY (id, created_at)) +PARTITION BY RANGE(created_at); +``` + +NOTE: **Note:** +The primary key of a partitioned table must include the partition key as +part of the primary key definition. + +And we might have a list of partitions for the table, such as: + +```sql +audit_events_202001 FOR VALUES FROM ('2020-01-01') TO ('2020-02-01') +audit_events_202002 FOR VALUES FROM ('2020-02-01') TO ('2020-03-01') +audit_events_202003 FOR VALUES FROM ('2020-03-01') TO ('2020-04-01') +``` + +Each partition is a separate physical table, with the same structure as +the base `audit_events` table, but contains only data for rows where the +partition key falls in the specified range. For example, the partition +`audit_events_202001` contains rows where the `created_at` column is +greater than or equal to `2020-01-01` and less than `2020-02-01`. + +Now, if we look at the previous example query again, the database can +use the `WHERE` to recognize that all matching rows will be in the +`audit_events_202001` partition. Rather than searching all of the data +in all of the partitions, it can search only the single month's worth +of data in the appropriate partition. In a large table, this can +dramatically reduce the amount of data the database needs to access. +However, imagine a query that does not filter based on the partitioning +key, such as: + +```sql +SELECT * +FROM audit_events +WHERE author_id = 123 +ORDER BY created_at DESC +LIMIT 100 +``` + +In this example, the database can't prune any partitions from the search, +because matching data could exist in any of them. As a result, it has to +query each partition individually, and aggregate the rows into a single result +set. Since `author_id` would be indexed, the performance impact could +likely be acceptable, but on more complex queries the overhead can be +substantial. Partitioning should only be leveraged if the access patterns +of the data support the partitioning strategy, otherwise performance will +suffer. + +## Partitioning a table + +Unfortunately, tables can only be partitioned at their creation, making +it nontrivial to apply to a busy database. A suite of migration +tools have been developed to enable backend developers to partition +existing tables. This migration process takes multiple steps which must +be split across several releases. + +### Caveats + +The partitioning migration helpers work by creating a partitioned duplicate +of the original table and using a combination of a trigger and a background +migration to copy data into the new table. Changes to the original table +schema can be made in parallel with the partitioning migration, but they +must take care to not break the underlying mechanism that makes the migration +work. For example, if a column is added to the table that is being +partitioned, both the partitioned table and the trigger definition need to +be updated to match. + +### Step 1: Creating the partitioned copy (Release N) + +The first step is to add a migration to create the partitioned copy of +the original table. This migration will also create the appropriate +partitions based on the data in the original table, and install a +trigger that will sync writes from the original table into the +partitioned copy. + +An example migration of partitioning the `audit_events` table by its +`created_at` column would look like: + +```ruby +class PartitionAuditEvents < ActiveRecord::Migration[6.0] + include Gitlab::Database::PartitioningMigrationHelpers + + def up + partition_table_by_date :audit_events, :created_at + end + + def down + drop_partitioned_table_for :audit_events + end +end +``` + +Once this has executed, any inserts, updates or deletes in the +original table will also be duplicated in the new table. For updates and +deletes, the operation will only have an effect if the corresponding row +exists in the partitioned table. + +### Step 2: Backfill the partitioned copy (Release N) + +The second step is to add a post-deployment migration that will schedule +the background jobs that will backfill existing data from the original table +into the partitioned copy. + +Continuing the above example, the migration would look like: + +```ruby +class BackfillPartitionAuditEvents < ActiveRecord::Migration[6.0] + include Gitlab::Database::PartitioningMigrationHelpers + + def up + enqueue_partitioning_data_migration :audit_events + end + + def down + cleanup_partitioning_data_migration :audit_events + end +end +``` + +This step uses the same mechanism as any background migration, so you +may want to read the [Background Migration](../background_migrations.md) +guide for details on that process. Background jobs are scheduled every +2 minutes and copy `50_000` records at a time, which can be used to +estimate the timing of the background migration portion of the +partitioning migration. + +### Step 3: Post-backfill cleanup (Release N+1) + +The third step must occur at least one release after the release that +includes the background migration. This gives time for the background +migration to execute properly in self-managed installations. In this step, +add another post-deployment migration that will cleanup after the +background migration. This includes forcing any remaining jobs to +execute, and copying data that may have been missed, due to dropped or +failed jobs. + +Once again, continuing the example, this migration would look like: + +```ruby +class CleanupPartitionedAuditEventsBackfill < ActiveRecord::Migration[6.0] + include Gitlab::Database::PartitioningMigrationHelpers + + def up + finalize_backfilling_partitioned_table :audit_events + end + + def down + # no op + end +end +``` + +After this migration has completed, the original table and partitioned +table should contain identical data. The trigger installed on the +original table guarantees that the data will remain in sync going +forward. + +### Step 4: Swap the partitioned and non-partitioned tables (Release N+1) + +The final step of the migration will make the partitioned table ready +for use by the application. This section will be updated when the +migration helper is ready, for now development can be followed in the +[Tracking Issue](https://gitlab.com/gitlab-org/gitlab/-/issues/241267). -- cgit v1.2.1