diff options
Diffstat (limited to 'doc/development/database')
12 files changed, 1123 insertions, 13 deletions
diff --git a/doc/development/database/add_foreign_key_to_existing_column.md b/doc/development/database/add_foreign_key_to_existing_column.md index d74f826cc14..bfd455ef9da 100644 --- a/doc/development/database/add_foreign_key_to_existing_column.md +++ b/doc/development/database/add_foreign_key_to_existing_column.md @@ -123,7 +123,7 @@ end Validating the foreign key scans the whole table and makes sure that each relation is correct. NOTE: -When using [background migrations](../background_migrations.md), foreign key validation should happen in the next GitLab release. +When using [background migrations](background_migrations.md), foreign key validation should happen in the next GitLab release. Migration file for validating the foreign key: diff --git a/doc/development/database/avoiding_downtime_in_migrations.md b/doc/development/database/avoiding_downtime_in_migrations.md new file mode 100644 index 00000000000..ad2768397e6 --- /dev/null +++ b/doc/development/database/avoiding_downtime_in_migrations.md @@ -0,0 +1,491 @@ +--- +stage: Enablement +group: Database +info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments +--- + +# Avoiding downtime in migrations + +When working with a database certain operations may require downtime. Since we +cannot have downtime in migrations we need to use a set of steps to get the +same end result without downtime. This guide describes various operations that +may appear to need downtime, their impact, and how to perform them without +requiring downtime. + +## Dropping Columns + +Removing columns is tricky because running GitLab processes may still be using +the columns. To work around this safely, you will need three steps in three releases: + +1. Ignoring the column (release M) +1. Dropping the column (release M+1) +1. Removing the ignore rule (release M+2) + +The reason we spread this out across three releases is that dropping a column is +a destructive operation that can't be rolled back easily. + +Following this procedure helps us to make sure there are no deployments to GitLab.com +and upgrade processes for self-managed installations that lump together any of these steps. + +### Step 1: Ignoring the column (release M) + +The first step is to ignore the column in the application code. This is +necessary because Rails caches the columns and re-uses this cache in various +places. This can be done by defining the columns to ignore. For example, to ignore +`updated_at` in the User model you'd use the following: + +```ruby +class User < ApplicationRecord + include IgnorableColumns + ignore_column :updated_at, remove_with: '12.7', remove_after: '2020-01-22' +end +``` + +Multiple columns can be ignored, too: + +```ruby +ignore_columns %i[updated_at created_at], remove_with: '12.7', remove_after: '2020-01-22' +``` + +If the model exists in CE and EE, the column has to be ignored in the CE model. If the +model only exists in EE, then it has to be added there. + +We require indication of when it is safe to remove the column ignore with: + +- `remove_with`: set to a GitLab release typically two releases (M+2) after adding the + column ignore. +- `remove_after`: set to a date after which we consider it safe to remove the column + ignore, typically after the M+1 release date, during the M+2 development cycle. + +This information allows us to reason better about column ignores and makes sure we +don't remove column ignores too early for both regular releases and deployments to GitLab.com. For +example, this avoids a situation where we deploy a bulk of changes that include both changes +to ignore the column and subsequently remove the column ignore (which would result in a downtime). + +In this example, the change to ignore the column went into release 12.5. + +### Step 2: Dropping the column (release M+1) + +Continuing our example, dropping the column goes into a _post-deployment_ migration in release 12.6: + +```ruby + remove_column :user, :updated_at +``` + +### Step 3: Removing the ignore rule (release M+2) + +With the next release, in this example 12.7, we set up another merge request to remove the ignore rule. +This removes the `ignore_column` line and - if not needed anymore - also the inclusion of `IgnoreableColumns`. + +This should only get merged with the release indicated with `remove_with` and once +the `remove_after` date has passed. + +## Renaming Columns + +Renaming columns the normal way requires downtime as an application may continue +using the old column name during/after a database migration. To rename a column +without requiring downtime we need two migrations: a regular migration, and a +post-deployment migration. Both these migration can go in the same release. + +### Step 1: Add The Regular Migration + +First we need to create the regular migration. This migration should use +`Gitlab::Database::MigrationHelpers#rename_column_concurrently` to perform the +renaming. For example + +```ruby +# A regular migration in db/migrate +class RenameUsersUpdatedAtToUpdatedAtTimestamp < Gitlab::Database::Migration[1.0] + disable_ddl_transaction! + + def up + rename_column_concurrently :users, :updated_at, :updated_at_timestamp + end + + def down + undo_rename_column_concurrently :users, :updated_at, :updated_at_timestamp + end +end +``` + +This will take care of renaming the column, ensuring data stays in sync, and +copying over indexes and foreign keys. + +If a column contains one or more indexes that don't contain the name of the +original column, the previously described procedure will fail. In that case, +you'll first need to rename these indexes. + +### Step 2: Add A Post-Deployment Migration + +The renaming procedure requires some cleaning up in a post-deployment migration. +We can perform this cleanup using +`Gitlab::Database::MigrationHelpers#cleanup_concurrent_column_rename`: + +```ruby +# A post-deployment migration in db/post_migrate +class CleanupUsersUpdatedAtRename < Gitlab::Database::Migration[1.0] + disable_ddl_transaction! + + def up + cleanup_concurrent_column_rename :users, :updated_at, :updated_at_timestamp + end + + def down + undo_cleanup_concurrent_column_rename :users, :updated_at, :updated_at_timestamp + end +end +``` + +If you're renaming a [large table](https://gitlab.com/gitlab-org/gitlab/-/blob/master/rubocop/rubocop-migrations.yml#L3), please carefully consider the state when the first migration has run but the second cleanup migration hasn't been run yet. +With [Canary](https://gitlab.com/gitlab-com/gl-infra/readiness/-/tree/master/library/canary/) it is possible that the system runs in this state for a significant amount of time. + +## Changing Column Constraints + +Adding or removing a `NOT NULL` clause (or another constraint) can typically be +done without requiring downtime. However, this does require that any application +changes are deployed _first_. Thus, changing the constraints of a column should +happen in a post-deployment migration. + +Avoid using `change_column` as it produces an inefficient query because it re-defines +the whole column type. + +You can check the following guides for each specific use case: + +- [Adding foreign-key constraints](../migration_style_guide.md#adding-foreign-key-constraints) +- [Adding `NOT NULL` constraints](not_null_constraints.md) +- [Adding limits to text columns](strings_and_the_text_data_type.md) + +## Changing Column Types + +Changing the type of a column can be done using +`Gitlab::Database::MigrationHelpers#change_column_type_concurrently`. This +method works similarly to `rename_column_concurrently`. For example, let's say +we want to change the type of `users.username` from `string` to `text`. + +### Step 1: Create A Regular Migration + +A regular migration is used to create a new column with a temporary name along +with setting up some triggers to keep data in sync. Such a migration would look +as follows: + +```ruby +# A regular migration in db/migrate +class ChangeUsersUsernameStringToText < Gitlab::Database::Migration[1.0] + disable_ddl_transaction! + + def up + change_column_type_concurrently :users, :username, :text + end + + def down + undo_change_column_type_concurrently :users, :username + end +end +``` + +### Step 2: Create A Post Deployment Migration + +Next we need to clean up our changes using a post-deployment migration: + +```ruby +# A post-deployment migration in db/post_migrate +class ChangeUsersUsernameStringToTextCleanup < Gitlab::Database::Migration[1.0] + disable_ddl_transaction! + + def up + cleanup_concurrent_column_type_change :users, :username + end + + def down + undo_cleanup_concurrent_column_type_change :users, :username, :string + end +end +``` + +And that's it, we're done! + +### Casting data to a new type + +Some type changes require casting data to a new type. For example when changing from `text` to `jsonb`. +In this case, use the `type_cast_function` option. +Make sure there is no bad data and the cast will always succeed. You can also provide a custom function that handles +casting errors. + +Example migration: + +```ruby + def up + change_column_type_concurrently :users, :settings, :jsonb, type_cast_function: 'jsonb' + end +``` + +## Changing The Schema For Large Tables + +While `change_column_type_concurrently` and `rename_column_concurrently` can be +used for changing the schema of a table without downtime, it doesn't work very +well for large tables. Because all of the work happens in sequence the migration +can take a very long time to complete, preventing a deployment from proceeding. +They can also produce a lot of pressure on the database due to it rapidly +updating many rows in sequence. + +To reduce database pressure you should instead use a background migration +when migrating a column in a large table (for example, `issues`). This will +spread the work / load over a longer time period, without slowing down deployments. + +For more information, see [the documentation on cleaning up background +migrations](background_migrations.md#cleaning-up). + +## Adding Indexes + +Adding indexes does not require downtime when `add_concurrent_index` +is used. + +See also [Migration Style Guide](../migration_style_guide.md#adding-indexes) +for more information. + +## Dropping Indexes + +Dropping an index does not require downtime. + +## Adding Tables + +This operation is safe as there's no code using the table just yet. + +## Dropping Tables + +Dropping tables can be done safely using a post-deployment migration, but only +if the application no longer uses the table. + +## Renaming Tables + +Renaming tables requires downtime as an application may continue +using the old table name during/after a database migration. + +If the table and the ActiveRecord model is not in use yet, removing the old +table and creating a new one is the preferred way to "rename" the table. + +Renaming a table is possible without downtime by following our multi-release +[rename table process](rename_database_tables.md#rename-table-without-downtime). + +## Adding Foreign Keys + +Adding foreign keys usually works in 3 steps: + +1. Start a transaction +1. Run `ALTER TABLE` to add the constraint(s) +1. Check all existing data + +Because `ALTER TABLE` typically acquires an exclusive lock until the end of a +transaction this means this approach would require downtime. + +GitLab allows you to work around this by using +`Gitlab::Database::MigrationHelpers#add_concurrent_foreign_key`. This method +ensures that no downtime is needed. + +## Removing Foreign Keys + +This operation does not require downtime. + +## Migrating `integer` primary keys to `bigint` + +To [prevent the overflow risk](https://gitlab.com/groups/gitlab-org/-/epics/4785) for some tables +with `integer` primary key (PK), we have to migrate their PK to `bigint`. The process to do this +without downtime and causing too much load on the database is described below. + +### Initialize the conversion and start migrating existing data (release N) + +To start the process, add a regular migration to create the new `bigint` columns. Use the provided +`initialize_conversion_of_integer_to_bigint` helper. The helper also creates a database trigger +to keep in sync both columns for any new records ([see an example](https://gitlab.com/gitlab-org/gitlab/-/blob/41fbe34a4725a4e357a83fda66afb382828767b2/db/migrate/20210608072312_initialize_conversion_of_ci_stages_to_bigint.rb)): + +```ruby +class InitializeConversionOfCiStagesToBigint < ActiveRecord::Migration[6.1] + include Gitlab::Database::MigrationHelpers + + TABLE = :ci_stages + COLUMNS = %i(id) + + def up + initialize_conversion_of_integer_to_bigint(TABLE, COLUMNS) + end + + def down + revert_initialize_conversion_of_integer_to_bigint(TABLE, COLUMNS) + end +end +``` + +Ignore the new `bigint` columns: + +```ruby +module Ci + class Stage < Ci::ApplicationRecord + include IgnorableColumns + ignore_column :id_convert_to_bigint, remove_with: '14.2', remove_after: '2021-08-22' + end +``` + +To migrate existing data, we introduced new type of _batched background migrations_. +Unlike the classic background migrations, built on top of Sidekiq, batched background migrations +don't have to enqueue and schedule all the background jobs at the beginning. +They also have other advantages, like automatic tuning of the batch size, better progress visibility, +and collecting metrics. To start the process, use the provided `backfill_conversion_of_integer_to_bigint` +helper ([example](https://gitlab.com/gitlab-org/gitlab/-/blob/41fbe34a4725a4e357a83fda66afb382828767b2/db/migrate/20210608072346_backfill_ci_stages_for_bigint_conversion.rb)): + +```ruby +class BackfillCiStagesForBigintConversion < ActiveRecord::Migration[6.1] + include Gitlab::Database::MigrationHelpers + + TABLE = :ci_stages + COLUMNS = %i(id) + + def up + backfill_conversion_of_integer_to_bigint(TABLE, COLUMNS) + end + + def down + revert_backfill_conversion_of_integer_to_bigint(TABLE, COLUMNS) + end +end +``` + +### Monitor the background migration + +Check how the migration is performing while it's running. Multiple ways to do this are described below. + +#### High-level status of batched background migrations + +See how to [check the status of batched background migrations](../../update/index.md#checking-for-background-migrations-before-upgrading). + +#### Query the database + +We can query the related database tables directly. Requires access to read-only replica. +Example queries: + +```sql +-- Get details for batched background migration for given table +SELECT * FROM batched_background_migrations WHERE table_name = 'namespaces'\gx + +-- Get count of batched background migration jobs by status for given table +SELECT + batched_background_migrations.id, batched_background_migration_jobs.status, COUNT(*) +FROM + batched_background_migrations + JOIN batched_background_migration_jobs ON batched_background_migrations.id = batched_background_migration_jobs.batched_background_migration_id +WHERE + table_name = 'namespaces' +GROUP BY + batched_background_migrations.id, batched_background_migration_jobs.status; + +-- Batched background migration progress for given table (based on estimated total number of tuples) +SELECT + m.table_name, + LEAST(100 * sum(j.batch_size) / pg_class.reltuples, 100) AS percentage_complete +FROM + batched_background_migrations m + JOIN batched_background_migration_jobs j ON j.batched_background_migration_id = m.id + JOIN pg_class ON pg_class.relname = m.table_name +WHERE + j.status = 3 AND m.table_name = 'namespaces' +GROUP BY m.id, pg_class.reltuples; +``` + +#### Sidekiq logs + +We can also use the Sidekiq logs to monitor the worker that executes the batched background +migrations: + +1. Sign in to [Kibana](https://log.gprd.gitlab.net) with a `@gitlab.com` email address. +1. Change the index pattern to `pubsub-sidekiq-inf-gprd*`. +1. Add filter for `json.queue: cronjob:database_batched_background_migration`. + +#### PostgreSQL slow queries log + +Slow queries log keeps track of low queries that took above 1 second to execute. To see them +for batched background migration: + +1. Sign in to [Kibana](https://log.gprd.gitlab.net) with a `@gitlab.com` email address. +1. Change the index pattern to `pubsub-postgres-inf-gprd*`. +1. Add filter for `json.endpoint_id.keyword: Database::BatchedBackgroundMigrationWorker`. +1. Optional. To see only updates, add a filter for `json.command_tag.keyword: UPDATE`. +1. Optional. To see only failed statements, add a filter for `json.error_severity.keyword: ERROR`. +1. Optional. Add a filter by table name. + +#### Grafana dashboards + +To monitor the health of the database, use these additional metrics: + +- [PostgreSQL Tuple Statistics](https://dashboards.gitlab.net/d/000000167/postgresql-tuple-statistics?orgId=1&refresh=1m): if you see high rate of updates for the tables being actively converted, or increasing percentage of dead tuples for this table, it might mean that autovacuum cannot keep up. +- [PostgreSQL Overview](https://dashboards.gitlab.net/d/000000144/postgresql-overview?orgId=1): if you see high system usage or transactions per second (TPS) on the primary database server, it might mean that the migration is causing problems. + +### Prometheus metrics + +Number of [metrics](https://gitlab.com/gitlab-org/gitlab/-/blob/294a92484ce4611f660439aa48eee4dfec2230b5/lib/gitlab/database/background_migration/batched_migration_wrapper.rb#L90-128) +for each batched background migration are published to Prometheus. These metrics can be searched for and +visualized in Thanos ([see an example](https://thanos-query.ops.gitlab.net/graph?g0.expr=sum%20(rate(batched_migration_job_updated_tuples_total%7Benv%3D%22gprd%22%7D%5B5m%5D))%20by%20(migration_id)%20&g0.tab=0&g0.stacked=0&g0.range_input=3d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D&g0.end_input=2021-06-13%2012%3A18%3A24&g0.moment_input=2021-06-13%2012%3A18%3A24)). + +### Swap the columns (release N + 1) + +After the background is completed and the new `bigint` columns are populated for all records, we can +swap the columns. Swapping is done with post-deployment migration. The exact process depends on the +table being converted, but in general it's done in the following steps: + +1. Using the provided `ensure_batched_background_migration_is_finished` helper, make sure the batched +migration has finished ([see an example](https://gitlab.com/gitlab-org/gitlab/-/blob/41fbe34a4725a4e357a83fda66afb382828767b2/db/post_migrate/20210707210916_finalize_ci_stages_bigint_conversion.rb#L13-18)). +If the migration has not completed, the subsequent steps fail anyway. By checking in advance we +aim to have more helpful error message. +1. Create indexes using the `bigint` columns that match the existing indexes using the `integer` +column ([see an example](https://gitlab.com/gitlab-org/gitlab/-/blob/41fbe34a4725a4e357a83fda66afb382828767b2/db/post_migrate/20210707210916_finalize_ci_stages_bigint_conversion.rb#L28-34)). +1. Create foreign keys (FK) using the `bigint` columns that match the existing FKs using the +`integer` column. Do this both for FK referencing other tables, and FKs that reference the table +that is being migrated ([see an example](https://gitlab.com/gitlab-org/gitlab/-/blob/41fbe34a4725a4e357a83fda66afb382828767b2/db/post_migrate/20210707210916_finalize_ci_stages_bigint_conversion.rb#L36-43)). +1. Inside a transaction, swap the columns: + 1. Lock the tables involved. To reduce the chance of hitting a deadlock, we recommended to do this in parent to child order ([see an example](https://gitlab.com/gitlab-org/gitlab/-/blob/41fbe34a4725a4e357a83fda66afb382828767b2/db/post_migrate/20210707210916_finalize_ci_stages_bigint_conversion.rb#L47)). + 1. Rename the columns to swap names ([see an example](https://gitlab.com/gitlab-org/gitlab/-/blob/41fbe34a4725a4e357a83fda66afb382828767b2/db/post_migrate/20210707210916_finalize_ci_stages_bigint_conversion.rb#L49-54)) + 1. Reset the trigger function ([see an example](https://gitlab.com/gitlab-org/gitlab/-/blob/41fbe34a4725a4e357a83fda66afb382828767b2/db/post_migrate/20210707210916_finalize_ci_stages_bigint_conversion.rb#L56-57)). + 1. Swap the defaults ([see an example](https://gitlab.com/gitlab-org/gitlab/-/blob/41fbe34a4725a4e357a83fda66afb382828767b2/db/post_migrate/20210707210916_finalize_ci_stages_bigint_conversion.rb#L59-62)). + 1. Swap the PK constraint (if any) ([see an example](https://gitlab.com/gitlab-org/gitlab/-/blob/41fbe34a4725a4e357a83fda66afb382828767b2/db/post_migrate/20210707210916_finalize_ci_stages_bigint_conversion.rb#L64-68)). + 1. Remove old indexes and rename new ones ([see an example](https://gitlab.com/gitlab-org/gitlab/-/blob/41fbe34a4725a4e357a83fda66afb382828767b2/db/post_migrate/20210707210916_finalize_ci_stages_bigint_conversion.rb#L70-72)). + 1. Remove old FKs (if still present) and rename new ones ([see an example](https://gitlab.com/gitlab-org/gitlab/-/blob/41fbe34a4725a4e357a83fda66afb382828767b2/db/post_migrate/20210707210916_finalize_ci_stages_bigint_conversion.rb#L74)). + +See example [merge request](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/66088), and [migration](https://gitlab.com/gitlab-org/gitlab/-/blob/41fbe34a4725a4e357a83fda66afb382828767b2/db/post_migrate/20210707210916_finalize_ci_stages_bigint_conversion.rb). + +### Remove the trigger and old `integer` columns (release N + 2) + +Using post-deployment migration and the provided `cleanup_conversion_of_integer_to_bigint` helper, +drop the database trigger and the old `integer` columns ([see an example](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/69714)). + +### Remove ignore rules (release N + 3) + +In the next release after the columns were dropped, remove the ignore rules as we do not need them +anymore ([see an example](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/71161)). + +## Data migrations + +Data migrations can be tricky. The usual approach to migrate data is to take a 3 +step approach: + +1. Migrate the initial batch of data +1. Deploy the application code +1. Migrate any remaining data + +Usually this works, but not always. For example, if a field's format is to be +changed from JSON to something else we have a bit of a problem. If we were to +change existing data before deploying application code we'll most likely run +into errors. On the other hand, if we were to migrate after deploying the +application code we could run into the same problems. + +If you merely need to correct some invalid data, then a post-deployment +migration is usually enough. If you need to change the format of data (for example, from +JSON to something else) it's typically best to add a new column for the new data +format, and have the application use that. In such a case the procedure would +be: + +1. Add a new column in the new format +1. Copy over existing data to this new column +1. Deploy the application code +1. In a post-deployment migration, copy over any remaining data + +In general there is no one-size-fits-all solution, therefore it's best to +discuss these kind of migrations in a merge request to make sure they are +implemented in the best way possible. diff --git a/doc/development/database/background_migrations.md b/doc/development/database/background_migrations.md new file mode 100644 index 00000000000..a982b76ccf7 --- /dev/null +++ b/doc/development/database/background_migrations.md @@ -0,0 +1,499 @@ +--- +stage: Enablement +group: Database +info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments +--- + +# Background migrations + +Background migrations should be used to perform data migrations whenever a +migration exceeds [the time limits in our guidelines](../migration_style_guide.md#how-long-a-migration-should-take). For example, you can use background +migrations to migrate data that's stored in a single JSON column +to a separate table instead. + +If the database cluster is considered to be in an unhealthy state, background +migrations automatically reschedule themselves for a later point in time. + +## When To Use Background Migrations + +You should use a background migration when you migrate _data_ in tables that have +so many rows that the process would exceed [the time limits in our guidelines](../migration_style_guide.md#how-long-a-migration-should-take) if performed using a regular Rails migration. + +- Background migrations should be used when migrating data in [high-traffic tables](../migration_style_guide.md#high-traffic-tables). +- Background migrations may also be used when executing numerous single-row queries +for every item on a large dataset. Typically, for single-record patterns, runtime is +largely dependent on the size of the dataset, hence it should be split accordingly +and put into background migrations. +- Background migrations should not be used to perform schema migrations. + +Some examples where background migrations can be useful: + +- Migrating events from one table to multiple separate tables. +- Populating one column based on JSON stored in another column. +- Migrating data that depends on the output of external services (for example, an API). + +NOTE: +If the background migration is part of an important upgrade, make sure it's announced +in the release post. Discuss with your Project Manager if you're not sure the migration falls +into this category. + +## Isolation + +Background migrations must be isolated and can not use application code (for example, +models defined in `app/models`). Since these migrations can take a long time to +run it's possible for new versions to be deployed while they are still running. + +It's also possible for different migrations to be executed at the same time. +This means that different background migrations should not migrate data in a +way that would cause conflicts. + +## Idempotence + +Background migrations are executed in a context of a Sidekiq process. +Usual Sidekiq rules apply, especially the rule that jobs should be small +and idempotent. + +See [Sidekiq best practices guidelines](https://github.com/mperham/sidekiq/wiki/Best-Practices) +for more details. + +Make sure that in case that your migration job is going to be retried data +integrity is guaranteed. + +## Background migrations for EE-only features + +All the background migration classes for EE-only features should be present in GitLab CE. +For this purpose, an empty class can be created for GitLab CE, and it can be extended for GitLab EE +as explained in the [guidelines for implementing Enterprise Edition features](../ee_features.md#code-in-libgitlabbackground_migration). + +## How It Works + +Background migrations are simple classes that define a `perform` method. A +Sidekiq worker will then execute such a class, passing any arguments to it. All +migration classes must be defined in the namespace +`Gitlab::BackgroundMigration`, the files should be placed in the directory +`lib/gitlab/background_migration/`. + +## Scheduling + +Scheduling a background migration should be done in a post-deployment +migration that includes `Gitlab::Database::MigrationHelpers` +To do so, simply use the following code while +replacing the class name and arguments with whatever values are necessary for +your migration: + +```ruby +migrate_in('BackgroundMigrationClassName', [arg1, arg2, ...]) +``` + +You can use the function `queue_background_migration_jobs_by_range_at_intervals` +to automatically split the job into batches: + +```ruby +queue_background_migration_jobs_by_range_at_intervals( + ClassName, + BackgroundMigrationClassName, + 2.minutes, + batch_size: 10_000 + ) +``` + +You'll also need to make sure that newly created data is either migrated, or +saved in both the old and new version upon creation. For complex and time +consuming migrations it's best to schedule a background job using an +`after_create` hook so this doesn't affect response timings. The same applies to +updates. Removals in turn can be handled by simply defining foreign keys with +cascading deletes. + +### Rescheduling background migrations + +If one of the background migrations contains a bug that is fixed in a patch +release, the background migration needs to be rescheduled so the migration would +be repeated on systems that already performed the initial migration. + +When you reschedule the background migration, make sure to turn the original +scheduling into a no-op by clearing up the `#up` and `#down` methods of the +migration performing the scheduling. Otherwise the background migration would be +scheduled multiple times on systems that are upgrading multiple patch releases at +once. + +When you start the second post-deployment migration, you should delete any +previously queued jobs from the initial migration with the provided +helper: + +```ruby +delete_queued_jobs('BackgroundMigrationClassName') +``` + +## Cleaning Up + +NOTE: +Cleaning up any remaining background migrations _must_ be done in either a major +or minor release, you _must not_ do this in a patch release. + +Because background migrations can take a long time you can't immediately clean +things up after scheduling them. For example, you can't drop a column that's +used in the migration process as this would cause jobs to fail. This means that +you'll need to add a separate _post deployment_ migration in a future release +that finishes any remaining jobs before cleaning things up (for example, removing a +column). + +As an example, say you want to migrate the data from column `foo` (containing a +big JSON blob) to column `bar` (containing a string). The process for this would +roughly be as follows: + +1. Release A: + 1. Create a migration class that performs the migration for a row with a given ID. + You can use [background jobs tracking](#background-jobs-tracking) to simplify cleaning up. + 1. Deploy the code for this release, this should include some code that will + schedule jobs for newly created data (for example, using an `after_create` hook). + 1. Schedule jobs for all existing rows in a post-deployment migration. It's + possible some newly created rows may be scheduled twice so your migration + should take care of this. +1. Release B: + 1. Deploy code so that the application starts using the new column and stops + scheduling jobs for newly created data. + 1. In a post-deployment migration, finalize all jobs that have not succeeded by now. + If you used [background jobs tracking](#background-jobs-tracking) in release A, + you can use `finalize_background_migration` from `BackgroundMigrationHelpers` to ensure no jobs remain. + This helper will: + 1. Use `Gitlab::BackgroundMigration.steal` to process any remaining + jobs in Sidekiq. + 1. Reschedule the migration to be run directly (that is, not through Sidekiq) + on any rows that weren't migrated by Sidekiq. This can happen if, for + instance, Sidekiq received a SIGKILL, or if a particular batch failed + enough times to be marked as dead. + 1. Remove `Gitlab::Database::BackgroundMigrationJob` rows where + `status = succeeded`. To retain diagnostic information that may + help with future bug tracking you can skip this step by specifying + the `delete_tracking_jobs: false` parameter. + 1. Remove the old column. + +This may also require a bump to the [import/export version](../../user/project/settings/import_export.md), if +importing a project from a prior version of GitLab requires the data to be in +the new format. + +## Example + +To explain all this, let's use the following example: the table `integrations` has a +field called `properties` which is stored in JSON. For all rows you want to +extract the `url` key from this JSON object and store it in the `integrations.url` +column. There are millions of integrations and parsing JSON is slow, thus you can't +do this in a regular migration. + +To do this using a background migration we'll start with defining our migration +class: + +```ruby +class Gitlab::BackgroundMigration::ExtractIntegrationsUrl + class Integration < ActiveRecord::Base + self.table_name = 'integrations' + end + + def perform(start_id, end_id) + Integration.where(id: start_id..end_id).each do |integration| + json = JSON.load(integration.properties) + + integration.update(url: json['url']) if json['url'] + rescue JSON::ParserError + # If the JSON is invalid we don't want to keep the job around forever, + # instead we'll just leave the "url" field to whatever the default value + # is. + next + end + end +end +``` + +Next we'll need to adjust our code so we schedule the above migration for newly +created and updated integrations. We can do this using something along the lines of +the following: + +```ruby +class Integration < ActiveRecord::Base + after_commit :schedule_integration_migration, on: :update + after_commit :schedule_integration_migration, on: :create + + def schedule_integration_migration + BackgroundMigrationWorker.perform_async('ExtractIntegrationsUrl', [id, id]) + end +end +``` + +We're using `after_commit` here to ensure the Sidekiq job is not scheduled +before the transaction completes as doing so can lead to race conditions where +the changes are not yet visible to the worker. + +Next we'll need a post-deployment migration that schedules the migration for +existing data. + +```ruby +class ScheduleExtractIntegrationsUrl < Gitlab::Database::Migration[1.0] + disable_ddl_transaction! + + MIGRATION = 'ExtractIntegrationsUrl' + DELAY_INTERVAL = 2.minutes + + def up + queue_background_migration_jobs_by_range_at_intervals( + define_batchable_model('integrations'), + MIGRATION, + DELAY_INTERVAL) + end + + def down + end +end +``` + +Once deployed our application will continue using the data as before but at the +same time will ensure that both existing and new data is migrated. + +In the next release we can remove the `after_commit` hooks and related code. We +will also need to add a post-deployment migration that consumes any remaining +jobs and manually run on any un-migrated rows. Such a migration would look like +this: + +```ruby +class ConsumeRemainingExtractIntegrationsUrlJobs < Gitlab::Database::Migration[1.0] + disable_ddl_transaction! + + def up + # This must be included + Gitlab::BackgroundMigration.steal('ExtractIntegrationsUrl') + + # This should be included, but can be skipped - see below + define_batchable_model('integrations').where(url: nil).each_batch(of: 50) do |batch| + range = batch.pluck('MIN(id)', 'MAX(id)').first + + Gitlab::BackgroundMigration::ExtractIntegrationsUrl.new.perform(*range) + end + end + + def down + end +end +``` + +The final step runs for any un-migrated rows after all of the jobs have been +processed. This is in case a Sidekiq process running the background migrations +received SIGKILL, leading to the jobs being lost. (See +[more reliable Sidekiq queue](https://gitlab.com/gitlab-org/gitlab-foss/-/issues/36791) for more information.) + +If the application does not depend on the data being 100% migrated (for +instance, the data is advisory, and not mission-critical), then this final step +can be skipped. + +This migration will then process any jobs for the ExtractIntegrationsUrl migration +and continue once all jobs have been processed. Once done you can safely remove +the `integrations.properties` column. + +## Testing + +It is required to write tests for: + +- The background migrations' scheduling migration. +- The background migration itself. +- A cleanup migration. + +The `:migration` and `schema: :latest` RSpec tags are automatically set for +background migration specs. +See the +[Testing Rails migrations](../testing_guide/testing_migrations_guide.md#testing-a-non-activerecordmigration-class) +style guide. + +Keep in mind that `before` and `after` RSpec hooks are going +to migrate you database down and up, which can result in other background +migrations being called. That means that using `spy` test doubles with +`have_received` is encouraged, instead of using regular test doubles, because +your expectations defined in a `it` block can conflict with what is being +called in RSpec hooks. See [issue #35351](https://gitlab.com/gitlab-org/gitlab/-/issues/18839) +for more details. + +## Best practices + +1. Make sure to know how much data you're dealing with. +1. Make sure that background migration jobs are idempotent. +1. Make sure that tests you write are not false positives. +1. Make sure that if the data being migrated is critical and cannot be lost, the + clean-up migration also checks the final state of the data before completing. +1. When migrating many columns, make sure it won't generate too many + dead tuples in the process (you may need to directly query the number of dead tuples + and adjust the scheduling according to this piece of data). +1. Make sure to discuss the numbers with a database specialist, the migration may add + more pressure on DB than you expect (measure on staging, + or ask someone to measure on production). +1. Make sure to know how much time it'll take to run all scheduled migrations. +1. Provide an estimation section in the description, estimating both the total migration + run time and the query times for each background migration job. Explain plans for each query + should also be provided. + + For example, assuming a migration that deletes data, include information similar to + the following section: + + ```plaintext + Background Migration Details: + + 47600 items to delete + batch size = 1000 + 47600 / 1000 = 48 batches + + Estimated times per batch: + - 820ms for select statement with 1000 items (see linked explain plan) + - 900ms for delete statement with 1000 items (see linked explain plan) + Total: ~2 sec per batch + + 2 mins delay per batch (safe for the given total time per batch) + + 48 batches * 2 min per batch = 96 mins to run all the scheduled jobs + ``` + + The execution time per batch (2 sec in this example) is not included in the calculation + for total migration time. The jobs are scheduled 2 minutes apart without knowledge of + the execution time. + +## Additional tips and strategies + +### Nested batching + +A strategy to make the migration run faster is to schedule larger batches, and then use `EachBatch` +within the background migration to perform multiple statements. + +The background migration helpers that queue multiple jobs such as +`queue_background_migration_jobs_by_range_at_intervals` use [`EachBatch`](../iterating_tables_in_batches.md). +The example above has batches of 1000, where each queued job takes two seconds. If the query has been optimized +to make the time for the delete statement within the [query performance guidelines](../query_performance.md), +1000 may be the largest number of records that can be deleted in a reasonable amount of time. + +The minimum and most common interval for delaying jobs is two minutes. This results in two seconds +of work for each two minute job. There's nothing that prevents you from executing multiple delete +statements in each background migration job. + +Looking at the example above, you could alternatively do: + +```plaintext +Background Migration Details: + +47600 items to delete +batch size = 10_000 +47600 / 10_000 = 5 batches + +Estimated times per batch: +- Records are updated in sub-batches of 1000 => 10_000 / 1000 = 10 total updates +- 820ms for select statement with 1000 items (see linked explain plan) +- 900ms for delete statement with 1000 items (see linked explain plan) +Sub-batch total: ~2 sec per sub-batch, +Total batch time: 2 * 10 = 20 sec per batch + +2 mins delay per batch + +5 batches * 2 min per batch = 10 mins to run all the scheduled jobs +``` + +The batch time of 20 seconds still fits comfortably within the two minute delay, yet the total run +time is cut by a tenth from around 100 minutes to 10 minutes! When dealing with large background +migrations, this can cut the total migration time by days. + +When batching in this way, it is important to look at query times on the higher end +of the table or relation being updated. `EachBatch` may generate some queries that become much +slower when dealing with higher ID ranges. + +### Delay time + +When looking at the batch execution time versus the delay time, the execution time +should fit comfortably within the delay time for a few reasons: + +- To allow for a variance in query times. +- To allow autovacuum to catch up after periods of high churn. + +Never try to optimize by fully filling the delay window even if you are confident +the queries themselves have no timing variance. + +### Background jobs tracking + +NOTE: +Background migrations with job tracking enabled must call `mark_all_as_succeeded` for its batch, even if no work is needed to be done. + +`queue_background_migration_jobs_by_range_at_intervals` can create records for each job that is scheduled to run. +You can enable this behavior by passing `track_jobs: true`. Each record starts with a `pending` status. Make sure that your worker updates the job status to `succeeded` by calling `Gitlab::Database::BackgroundMigrationJob.mark_all_as_succeeded` in the `perform` method of your background migration. + +```ruby +# Background migration code + +def perform(start_id, end_id) + # do work here + + mark_job_as_succeeded(start_id, end_id) +end + +private + +# Make sure that the arguments passed here match those passed to the background +# migration +def mark_job_as_succeeded(*arguments) + Gitlab::Database::BackgroundMigrationJob.mark_all_as_succeeded( + self.class.name.demodulize, + arguments + ) +end +``` + +```ruby +# Post deployment migration +MIGRATION = 'YourBackgroundMigrationName' +DELAY_INTERVAL = 2.minutes.to_i # can be different +BATCH_SIZE = 10_000 # can be different + +disable_ddl_transaction! + +def up + queue_background_migration_jobs_by_range_at_intervals( + define_batchable_model('name_of_the_table_backing_the_model'), + MIGRATION, + DELAY_INTERVAL, + batch_size: BATCH_SIZE, + track_jobs: true + ) +end + +def down + # no-op +end +``` + +See [`lib/gitlab/background_migration/drop_invalid_vulnerabilities.rb`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/gitlab/background_migration/drop_invalid_vulnerabilities.rb) for a full example. + +#### Rescheduling pending jobs + +You can reschedule pending migrations from the `background_migration_jobs` table by creating a post-deployment migration and calling `requeue_background_migration_jobs_by_range_at_intervals` with the migration name and delay interval. + +```ruby +# Post deployment migration +MIGRATION = 'YourBackgroundMigrationName' +DELAY_INTERVAL = 2.minutes + +disable_ddl_transaction! + +def up + requeue_background_migration_jobs_by_range_at_intervals(MIGRATION, DELAY_INTERVAL) +end + +def down + # no-op +end +``` + +See [`db/post_migrate/20210604070207_retry_backfill_traversal_ids.rb`](https://gitlab.com/gitlab-org/gitlab/blob/master/db/post_migrate/20210604070207_retry_backfill_traversal_ids.rb) for a full example. + +### Viewing failure error logs + +After running a background migration, if any jobs have failed, you can view the logs in [Kibana](https://log.gprd.gitlab.net/goto/5f06a57f768c6025e1c65aefb4075694). +View the production Sidekiq log and filter for: + +- `json.class: BackgroundMigrationWorker` +- `json.job_status: fail` +- `json.meta.caller_id: <MyBackgroundMigrationSchedulingMigrationClassName>` +- `json.args: <MyBackgroundMigrationClassName>` + +Looking at the `json.error_class`, `json.error_message` and `json.error_backtrace` values may be helpful in understanding why the jobs failed. + +Depending on when and how the failure occurred, you may find other helpful information by filtering with `json.class: <MyBackgroundMigrationClassName>`. diff --git a/doc/development/database/database_reviewer_guidelines.md b/doc/development/database/database_reviewer_guidelines.md index 9d5e4821c9f..ca9ca36b156 100644 --- a/doc/development/database/database_reviewer_guidelines.md +++ b/doc/development/database/database_reviewer_guidelines.md @@ -70,7 +70,7 @@ Finally, you can find various guides in the [Database guides](index.md) page tha topics and use cases. The most frequently required during database reviewing are the following: - [Migrations style guide](../migration_style_guide.md) for creating safe SQL migrations. -- [Avoiding downtime in migrations](../avoiding_downtime_in_migrations.md). +- [Avoiding downtime in migrations](avoiding_downtime_in_migrations.md). - [SQL guidelines](../sql.md) for working with SQL queries. - [Guidelines for JiHu contributions with database migrations](https://about.gitlab.com/handbook/ceo/chief-of-staff-team/jihu-support/jihu-database-change-process.html) diff --git a/doc/development/database/deleting_migrations.md b/doc/development/database/deleting_migrations.md new file mode 100644 index 00000000000..be9009f365d --- /dev/null +++ b/doc/development/database/deleting_migrations.md @@ -0,0 +1,39 @@ +--- +stage: Enablement +group: Database +info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments +--- + +# Delete existing migrations + +When removing existing migrations from the GitLab project, you have to take into account +the possibility of the migration already been included in past releases or in the current release, and thus already executed on GitLab.com and/or in self-managed instances. + +Because of it, it's not possible to delete existing migrations, as that could lead to: + +- Schema inconsistency, as changes introduced into the database were not rolled back properly. +- Leaving a record on the `schema_versions` table, that points out to migration that no longer exists on the codebase. + +Instead of deleting we can opt for disabling the migration. + +## Pre-requisites to disable a migration + +Migrations can be disabled if: + +- They caused a timeout or general issue on GitLab.com. +- They are obsoleted, for example, changes are not necessary due to a feature change. +- Migration is a data migration only, that is, the migration does not change the database schema. + +## How to disable a data migration? + +In order to disable a migration, the following steps apply to all types of migrations: + +1. Turn the migration into a no-op by removing the code inside `#up`, `#down` + or `#perform` methods, and adding `# no-op` comment instead. +1. Add a comment explaining why the code is gone. + +Disabling migrations requires explicit approval of Database Maintainer. + +## Examples + +- [Disable scheduling of productivity analytics](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/17253) diff --git a/doc/development/database/index.md b/doc/development/database/index.md index 830e97ec533..1dc40e62840 100644 --- a/doc/development/database/index.md +++ b/doc/development/database/index.md @@ -23,14 +23,14 @@ info: To determine the technical writer assigned to the Stage/Group associated w ## Migrations -- [Avoiding downtime in migrations](../avoiding_downtime_in_migrations.md) +- [Avoiding downtime in migrations](avoiding_downtime_in_migrations.md) - [SQL guidelines](../sql.md) for working with SQL queries - [Migrations style guide](../migration_style_guide.md) for creating safe SQL migrations - [Testing Rails migrations](../testing_guide/testing_migrations_guide.md) guide -- [Post deployment migrations](../post_deployment_migrations.md) -- [Background migrations](../background_migrations.md) +- [Post deployment migrations](post_deployment_migrations.md) +- [Background migrations](background_migrations.md) - [Swapping tables](../swapping_tables.md) -- [Deleting migrations](../deleting_migrations.md) +- [Deleting migrations](deleting_migrations.md) - [Partitioning tables](table_partitioning.md) ## Debugging diff --git a/doc/development/database/loose_foreign_keys.md b/doc/development/database/loose_foreign_keys.md index 2d3812ec584..2bcdc91202a 100644 --- a/doc/development/database/loose_foreign_keys.md +++ b/doc/development/database/loose_foreign_keys.md @@ -191,7 +191,7 @@ ci_pipelines: ### Track record changes To know about deletions in the `projects` table, configure a `DELETE` trigger -using a [post-deployment migration](../post_deployment_migrations.md). The +using a [post-deployment migration](post_deployment_migrations.md). The trigger needs to be configured only once. If the model already has at least one `loose_foreign_key` definition, then this step can be skipped: @@ -226,7 +226,7 @@ ON DELETE CASCADE; The migration must run after the `DELETE` trigger is installed and the loose foreign key definition is deployed. As such, it must be a [post-deployment -migration](../post_deployment_migrations.md) dated after the migration for the +migration](post_deployment_migrations.md) dated after the migration for the trigger. If the foreign key is deleted earlier, there is a good chance of introducing data inconsistency which needs manual cleanup: diff --git a/doc/development/database/not_null_constraints.md b/doc/development/database/not_null_constraints.md index de070f7e434..af7d569e282 100644 --- a/doc/development/database/not_null_constraints.md +++ b/doc/development/database/not_null_constraints.md @@ -197,7 +197,7 @@ end If you have to clean up a nullable column for a [high-traffic table](../migration_style_guide.md#high-traffic-tables) (for example, the `artifacts` in `ci_builds`), your background migration will go on for a while and -it will need an additional [background migration cleaning up](../background_migrations.md#cleaning-up) +it will need an additional [background migration cleaning up](background_migrations.md#cleaning-up) in the release after adding the data migration. In that rare case you will need 3 releases end-to-end: diff --git a/doc/development/database/post_deployment_migrations.md b/doc/development/database/post_deployment_migrations.md new file mode 100644 index 00000000000..799eefdb875 --- /dev/null +++ b/doc/development/database/post_deployment_migrations.md @@ -0,0 +1,81 @@ +--- +stage: Enablement +group: Database +info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments +--- + +# Post Deployment Migrations + +Post deployment migrations are regular Rails migrations that can optionally be +executed after a deployment. By default these migrations are executed alongside +the other migrations. To skip these migrations you must set the +environment variable `SKIP_POST_DEPLOYMENT_MIGRATIONS` to a non-empty value +when running `rake db:migrate`. + +For example, this would run all migrations including any post deployment +migrations: + +```shell +bundle exec rake db:migrate +``` + +This however skips post deployment migrations: + +```shell +SKIP_POST_DEPLOYMENT_MIGRATIONS=true bundle exec rake db:migrate +``` + +## Deployment Integration + +Say you're using Chef for deploying new versions of GitLab and you'd like to run +post deployment migrations after deploying a new version. Let's assume you +normally use the command `chef-client` to do so. To make use of this feature +you'd have to run this command as follows: + +```shell +SKIP_POST_DEPLOYMENT_MIGRATIONS=true sudo chef-client +``` + +Once all servers have been updated you can run `chef-client` again on a single +server _without_ the environment variable. + +The process is similar for other deployment techniques: first you would deploy +with the environment variable set, then you re-deploy a single +server but with the variable _unset_. + +## Creating Migrations + +To create a post deployment migration you can use the following Rails generator: + +```shell +bundle exec rails g post_deployment_migration migration_name_here +``` + +This generates the migration file in `db/post_migrate`. These migrations +behave exactly like regular Rails migrations. + +## Use Cases + +Post deployment migrations can be used to perform migrations that mutate state +that an existing version of GitLab depends on. For example, say you want to +remove a column from a table. This requires downtime as a GitLab instance +depends on this column being present while it's running. Normally you'd follow +these steps in such a case: + +1. Stop the GitLab instance +1. Run the migration removing the column +1. Start the GitLab instance again + +Using post deployment migrations we can instead follow these steps: + +1. Deploy a new version of GitLab while ignoring post deployment migrations +1. Re-run `rake db:migrate` but without the environment variable set + +Here we don't need any downtime as the migration takes place _after_ a new +version (which doesn't depend on the column anymore) has been deployed. + +Some other examples where these migrations are useful: + +- Cleaning up data generated due to a bug in GitLab +- Removing tables +- Migrating jobs from one Sidekiq queue to another diff --git a/doc/development/database/rename_database_tables.md b/doc/development/database/rename_database_tables.md index 881adf00ad0..7a76c028042 100644 --- a/doc/development/database/rename_database_tables.md +++ b/doc/development/database/rename_database_tables.md @@ -82,7 +82,7 @@ when naming indexes, so there is a possibility that not all indexes are properly the migration locally, check if there are inconsistently named indexes (`db/structure.sql`). Those can be renamed manually in a separate migration, which can be also part of the release M.N+1. - Foreign key columns might still contain the old table name. For smaller tables, follow our [standard column -rename process](../avoiding_downtime_in_migrations.md#renaming-columns) +rename process](avoiding_downtime_in_migrations.md#renaming-columns) - Avoid renaming database tables which are using with triggers. - Table modifications (add or remove columns) are not allowed during the rename process, please make sure that all changes to the table happen before the rename migration is started (or in the next release). - As the index names might change, verify that the model does not use bulk insert diff --git a/doc/development/database/strings_and_the_text_data_type.md b/doc/development/database/strings_and_the_text_data_type.md index 9674deb4603..4ed7cf1b4de 100644 --- a/doc/development/database/strings_and_the_text_data_type.md +++ b/doc/development/database/strings_and_the_text_data_type.md @@ -229,7 +229,7 @@ end To keep this guide short, we skipped the definition of the background migration and only provided a high level example of the post-deployment migration that is used to schedule the batches. -You can find more information on the guide about [background migrations](../background_migrations.md) +You can find more information on the guide about [background migrations](background_migrations.md) #### Validate the text limit (next release) @@ -277,7 +277,7 @@ end If you have to clean up a text column for a really [large table](https://gitlab.com/gitlab-org/gitlab/-/blob/master/rubocop/rubocop-migrations.yml#L3) (for example, the `artifacts` in `ci_builds`), your background migration will go on for a while and -it will need an additional [background migration cleaning up](../background_migrations.md#cleaning-up) +it will need an additional [background migration cleaning up](background_migrations.md#cleaning-up) in the release after adding the data migration. In that rare case you will need 3 releases end-to-end: diff --git a/doc/development/database/table_partitioning.md b/doc/development/database/table_partitioning.md index 5319c73aad0..ec768136404 100644 --- a/doc/development/database/table_partitioning.md +++ b/doc/development/database/table_partitioning.md @@ -214,7 +214,7 @@ end ``` This step uses the same mechanism as any background migration, so you -may want to read the [Background Migration](../background_migrations.md) +may want to read the [Background Migration](background_migrations.md) guide for details on that process. Background jobs are scheduled every 2 minutes and copy `50_000` records at a time, which can be used to estimate the timing of the background migration portion of the |