1 files changed, 259 insertions, 0 deletions
diff --git a/doc/development/background_migrations.md b/doc/development/background_migrations.md
new file mode 100644
index 00000000000..f83a60e49e8
--- /dev/null
+++ b/doc/development/background_migrations.md
@@ -0,0 +1,259 @@
+# Background Migrations
+
+Background migrations can be used to perform data migrations that would
+otherwise take a very long time (hours, days, years, etc) to complete. For
+example, you can use background migrations to migrate data so that instead of
+storing data in a single JSON column the data is stored in a separate table.
+
+## When To Use Background Migrations
+
+>**Note:**
+When adding background migrations _you must_ make sure they are announced in the
+monthly release post along with an estimate of how long it will take to complete
+the migrations.
+
+In the vast majority of cases you will want to use a regular Rails migration
+instead. Background migrations should _only_ be used when migrating _data_ in
+tables that have so many rows this process would take hours when performed in a
+regular Rails migration.
+
+Background migrations _may not_ be used to perform schema migrations, they
+should only be used for data migrations.
+
+Some examples where background migrations can be useful:
+
+* Migrating events from one table to multiple separate tables.
+* Populating one column based on JSON stored in another column.
+* Migrating data that depends on the output of exernal services (e.g. an API).
+
+## Isolation
+
+Background migrations must be isolated and can not use application code (e.g.
+models defined in `app/models`). Since these migrations can take a long time to
+run it's possible for new versions to be deployed while they are still running.
+
+It's also possible for different migrations to be executed at the same time.
+This means that different background migrations should not migrate data in a
+way that would cause conflicts.
+
+## Idempotence
+
+Background migrations are executed in a context of a Sidekiq process.
+Usual Sidekiq rules apply, especially the rule that jobs should be small
+and idempotent.
+
+See [Sidekiq best practices guidelines](https://github.com/mperham/sidekiq/wiki/Best-Practices)
+for more details.
+
+Make sure that in case that your migration job is going to be retried data
+integrity is guarateed.
+
+## How It Works
+
+Background migrations are simple classes that define a `perform` method. A
+Sidekiq worker will then execute such a class, passing any arguments to it. All
+migration classes must be defined in the namespace
+`Gitlab::BackgroundMigration`, the files should be placed in the directory
+`lib/gitlab/background_migration/`.
+
+## Scheduling
+
+Scheduling a migration can be done in either a regular migration or a
+post-deployment migration. To do so, simply use the following code while
+replacing the class name and arguments with whatever values are necessary for
+your migration:
+
+```ruby
+BackgroundMigrationWorker.perform_async('BackgroundMigrationClassName', [arg1, arg2, ...])
+```
+
+Usually it's better to enqueue jobs in bulk, for this you can use
+`BackgroundMigrationWorker.perform_bulk`:
+
+```ruby
+BackgroundMigrationWorker.perform_bulk(
+  [['BackgroundMigrationClassName', [1]],
+   ['BackgroundMigrationClassName', [2]]]
+)
+```
+
+You'll also need to make sure that newly created data is either migrated, or
+saved in both the old and new version upon creation. For complex and time
+consuming migrations it's best to schedule a background job using an
+`after_create` hook so this doesn't affect response timings. The same applies to
+updates. Removals in turn can be handled by simply defining foreign keys with
+cascading deletes.
+
+If you would like to schedule jobs in bulk with a delay, you can use
+`BackgroundMigrationWorker.perform_bulk_in`:
+
+```ruby
+jobs = [['BackgroundMigrationClassName', [1]],
+        ['BackgroundMigrationClassName', [2]]]
+
+BackgroundMigrationWorker.perform_bulk_in(5.minutes, jobs)
+```
+
+## Cleaning Up
+
+>**Note:**
+Cleaning up any remaining background migrations _must_ be done in either a major
+or minor release, you _must not_ do this in a patch release.
+
+Because background migrations can take a long time you can't immediately clean
+things up after scheduling them. For example, you can't drop a column that's
+used in the migration process as this would cause jobs to fail. This means that
+you'll need to add a separate _post deployment_ migration in a future release
+that finishes any remaining jobs before cleaning things up (e.g. removing a
+column).
+
+As an example, say you want to migrate the data from column `foo` (containing a
+big JSON blob) to column `bar` (containing a string). The process for this would
+roughly be as follows:
+
+1. Release A:
+  1. Create a migration class that perform the migration for a row with a given ID.
+  1. Deploy the code for this release, this should include some code that will
+     schedule jobs for newly created data (e.g. using an `after_create` hook).
+  1. Schedule jobs for all existing rows in a post-deployment migration. It's
+     possible some newly created rows may be scheduled twice so your migration
+     should take care of this.
+1. Release B:
+  1. Deploy code so that the application starts using the new column and stops
+     scheduling jobs for newly created data.
+  1. In a post-deployment migration you'll need to ensure no jobs remain. To do
+     so you can use `Gitlab::BackgroundMigration.steal` to process any remaining
+     jobs before continueing.
+  1. Remove the old column.
+
+## Example
+
+To explain all this, let's use the following example: the table `services` has a
+field called `properties` which is stored in JSON. For all rows you want to
+extract the `url` key from this JSON object and store it in the `services.url`
+column. There are millions of services and parsing JSON is slow, thus you can't
+do this in a regular migration.
+
+To do this using a background migration we'll start with defining our migration
+class:
+
+```ruby
+class Gitlab::BackgroundMigration::ExtractServicesUrl
+  class Service < ActiveRecord::Base
+    self.table_name = 'services'
+  end
+
+  def perform(service_id)
+    # A row may be removed between scheduling and starting of a job, thus we
+    # need to make sure the data is still present before doing any work.
+    service = Service.select(:properties).find_by(id: service_id)
+
+    return unless service
+
+    begin
+      json = JSON.load(service.properties)
+    rescue JSON::ParserError
+      # If the JSON is invalid we don't want to keep the job around forever,
+      # instead we'll just leave the "url" field to whatever the default value
+      # is.
+      return
+    end
+
+    service.update(url: json['url']) if json['url']
+  end
+end
+```
+
+Next we'll need to adjust our code so we schedule the above migration for newly
+created and updated services. We can do this using something along the lines of
+the following:
+
+```ruby
+class Service < ActiveRecord::Base
+  after_commit :schedule_service_migration, on: :update
+  after_commit :schedule_service_migration, on: :create
+
+  def schedule_service_migration
+    BackgroundMigrationWorker.perform_async('ExtractServicesUrl', [id])
+  end
+end
+```
+
+We're using `after_commit` here to ensure the Sidekiq job is not scheduled
+before the transaction completes as doing so can lead to race conditions where
+the changes are not yet visible to the worker.
+
+Next we'll need a post-deployment migration that schedules the migration for
+existing data. Since we're dealing with a lot of rows we'll schedule jobs in
+batches instead of doing this one by one:
+
+```ruby
+class ScheduleExtractServicesUrl < ActiveRecord::Migration
+  disable_ddl_transaction!
+
+  class Service < ActiveRecord::Base
+    self.table_name = 'services'
+  end
+
+  def up
+    Service.select(:id).in_batches do |relation|
+      jobs = relation.pluck(:id).map do |id|
+        ['ExtractServicesUrl', [id]]
+      end
+
+      BackgroundMigrationWorker.perform_bulk(jobs)
+    end
+  end
+
+  def down
+  end
+end
+```
+
+Once deployed our application will continue using the data as before but at the
+same time will ensure that both existing and new data is migrated.
+
+In the next release we can remove the `after_commit` hooks and related code. We
+will also need to add a post-deployment migration that consumes any remaining
+jobs. Such a migration would look like this:
+
+```ruby
+class ConsumeRemainingExtractServicesUrlJobs < ActiveRecord::Migration
+  disable_ddl_transaction!
+
+  def up
+    Gitlab::BackgroundMigration.steal('ExtractServicesUrl')
+  end
+
+  def down
+  end
+end
+```
+
+This migration will then process any jobs for the ExtractServicesUrl migration
+and continue once all jobs have been processed. Once done you can safely remove
+the `services.properties` column.
+
+## Testing
+
+It is required to write tests for background migrations' scheduling migration
+(either a regular migration or a post deployment migration), background
+migration itself and a cleanup migration. You can use the `:migration` RSpec
+tag when testing a regular / post deployment migration.
+See [README][migrations-readme].
+
+When you do that, keep in mind that `before` and `after` RSpec hooks are going
+to migrate you database down and up, which can result in other background
+migrations being called. That means that using `spy` test doubles with
+`have_received` is encouraged, instead of using regular test doubles, because
+your expectations defined in a `it` block can conflict with what is being
+called in RSpec hooks. See [gitlab-org/gitlab-ce#35351][issue-rspec-hooks]
+for more details.
+
+## Best practices
+
+1. Make sure that background migration jobs are idempotent.
+1. Make sure that tests you write are not false positives.
+
+[migrations-readme]: https://gitlab.com/gitlab-org/gitlab-ce/blob/master/spec/migrations/README.md
+[issue-rspec-hooks]: https://gitlab.com/gitlab-org/gitlab-ce/issues/35351