summaryrefslogtreecommitdiff
path: root/doc/development/database/migrations_for_multiple_databases.md
blob: 0ec4612e9857ddb1b70e0184e4d261612d0ca4e0 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
---
stage: Enablement
group: Database
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
---

# Migrations for Multiple databases

> Support for describing migration purposes was [introduced](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/73756) in GitLab 14.8.

This document describes how to properly write database migrations
for [the decomposed GitLab application using multiple databases](https://gitlab.com/groups/gitlab-org/-/epics/6168).

Learn more about general multiple databases support in a [separate document](multiple_databases.md).

WARNING:
If you experience any issues using `Gitlab::Database::Migration[2.0]`,
you can temporarily revert back to the previous behavior by changing the version to `Gitlab::Database::Migration[1.0]`.
Please report any issues with `Gitlab::Database::Migration[2.0]` in [this issue](https://gitlab.com/gitlab-org/gitlab/-/issues/358430).

The design for multiple databases (except for the Geo database) assumes
that all decomposed databases have **the same structure** (for example, schema), but **the data is different** in each database. This means that some tables do not contain data on each database.

## Operations

Depending on the used constructs, we can classify migrations to be either:

1. Modifying structure ([DDL - Data Definition Language](https://www.postgresql.org/docs/current/ddl.html)) (for example, `ALTER TABLE`).
1. Modifying data ([DML - Data Manipulation Language](https://www.postgresql.org/docs/current/dml.html)) (for example, `UPDATE`).
1. Performing [other queries](https://www.postgresql.org/docs/current/queries.html) (for example, `SELECT`) that are treated as **DML** for the purposes of our migrations.

**The usage of `Gitlab::Database::Migration[2.0]` requires migrations to always be of a single purpose**.
Migrations cannot mix **DDL** and **DML** changes as the application requires the structure
(as described by `db/structure.sql`) to be exactly the same across all decomposed databases.

### Data Definition Language (DDL) 

The DDL migrations are all migrations that:

1. Create or drop a table (for example, `create_table`).
1. Add or remove an index (for example, `add_index`, `add_index_concurrently`).
1. Add or remove a foreign key (for example `add_foreign_key`, `add_foreign_key_concurrently`).
1. Add or remove a column with or without a default value (for example, `add_column`).
1. Create or drop trigger functions (for example, `create_trigger_function`).
1. Attach or detach triggers from tables (for example, `track_record_deletions`, `untrack_record_deletions`).
1. Prepare or not async indexes (for example, `prepare_async_index`, `unprepare_async_index_by_name`).

As such DDL migrations **CANNOT**:

1. Read or modify data in any form, via SQL statements or ActiveRecord models.
1. Update column values (for example, `update_column_in_batches`).
1. Schedule background migrations (for example, `queue_background_migration_jobs_by_range_at_intervals`).
1. Read the state of feature flags since they are stored in `main:` (a `features` and `feature_gates`).
1. Read application settings (as settings are stored in `main:`).

As the majority of migrations in the GitLab codebase are of the DDL-type,
this is also the default mode of operation and requires no further changes
to the migrations files.

#### Example: perform DDL on all databases

Example migration adding a concurrent index that is treated as change of the structure (DDL)
that is executed on all configured databases.

```ruby
class AddUserIdAndStateIndexToMergeRequestReviewers < Gitlab::Database::Migration[2.0]
  disable_ddl_transaction!

  INDEX_NAME = 'index_on_merge_request_reviewers_user_id_and_state'

  def up
    add_concurrent_index :merge_request_reviewers, [:user_id, :state], where: 'state = 2', name: INDEX_NAME
  end

  def down
    remove_concurrent_index_by_name :merge_request_reviewers, INDEX_NAME
  end
end
```

### Data Manipulation Language (DML)

The DML migrations are all migrations that:

1. Read data via SQL statements (for example, `SELECT * FROM projects WHERE id=1`).
1. Read data via ActiveRecord models (for example, `User < MigrationRecord`).
1. Create, update or delete data via ActiveRecord models (for example, `User.create!(...)`).
1. Create, update or delete data via SQL statements (for example, `DELETE FROM projects WHERE id=1`).
1. Update columns in batches (for example, `update_column_in_batches(:projects, :archived, true)`).
1. Schedule background migrations (for example, `queue_background_migration_jobs_by_range_at_intervals`).
1. Access application settings (for example, `ApplicationSetting.last` if run for `main:` database).
1. Read and modify feature flags if run for the `main:` database.

The DML migrations **CANNOT**:

1. Make any changes to DDL since this breaks the rule of keeping `structure.sql` coherent across
   all decomposed databases.
1. **Read data from another database**.

To indicate the `DML` migration type, a migration must use the `restrict_gitlab_migration gitlab_schema:`
syntax in a migration class. This marks the given migration as DML and restricts access to it.

#### Example: perform DML only in context of the database containing the given `gitlab_schema`

Example migration updating `archived` column of `projects` that is executed
only for the database containing `gitlab_main` schema.

```ruby
class UpdateProjectsArchivedState < Gitlab::Database::Migration[2.0]
  disable_ddl_transaction!

  restrict_gitlab_migration gitlab_schema: :gitlab_main

  def up
    update_column_in_batches(:projects, :archived, true) do |table, query|
      query.where(table[:archived].eq(false)) # rubocop:disable CodeReuse/ActiveRecord
    end
  end

  def down
    # no-op
  end
end
```

#### Example: usage of `ActiveRecord` classes

A migration using `ActiveRecord` class to perform data manipulation
must use the `MigrationRecord` class. This class is guaranteed to provide
a correct connection in a context of a given migration.

Underneath the `MigrationRecord == ActiveRecord::Base`, as once the `db:migrate`
runs, it switches the active connection of `ActiveRecord::Base.establish_connection :ci`.
To avoid confusion to using the `ActiveRecord::Base`, `MigrationRecord` is required.

This implies that DML migrations are forbidden to read data from other
databases. For example, running migration in context of `ci:` and reading feature flags
from `main:`, as no established connection to another database is present.

```ruby
class UpdateProjectsArchivedState < Gitlab::Database::Migration[2.0]
  disable_ddl_transaction!

  restrict_gitlab_migration gitlab_schema: :gitlab_main

  class Project < MigrationRecord
  end

  def up
    Project.where(archived: false).each_batch of |batch|
      batch.update_all(archived: true)
    end
  end

  def down
  end
end
```

### The special purpose of `gitlab_shared`

As described in [gitlab_schema](multiple_databases.md#the-special-purpose-of-gitlab_shared),
the `gitlab_shared` tables are allowed to contain data across all databases. This implies
that such migrations should run across all databases to modify structure (DDL) or modify data (DML).

As such migrations accessing `gitlab_shared` do not need to use `restrict_gitlab_migration gitlab_schema:`,
migrations without restriction run across all databases and are allowed to modify data on each of them.
If the `restrict_gitlab_migration gitlab_schema:` is specified, the `DML` migration
runs only in a context of a database containing the given `gitlab_schema`.

#### Example: run DML `gitlab_shared` migration on all databases

Example migration updating `loose_foreign_keys_deleted_records` table
that is marked in `lib/gitlab/database/gitlab_schemas.yml` as `gitlab_shared`.

This migration is executed across all configured databases.

```ruby
class DeleteAllLooseForeignKeyRecords < Gitlab::Database::Migration[2.0]
  disable_ddl_transaction!

  def up
    execute("DELETE FROM loose_foreign_keys_deleted_records")
  end

  def down
    # no-op
  end
end
```

#### Example: run DML `gitlab_shared` only on the database containing the given `gitlab_schema`

Example migration updating `loose_foreign_keys_deleted_records` table
that is marked in `lib/gitlab/database/gitlab_schemas.yml` as `gitlab_shared`.

This migration since it configures restriction on `gitlab_ci` is executed only
in context of database containing `gitlab_ci` schema.

```ruby
class DeleteCiBuildsLooseForeignKeyRecords < Gitlab::Database::Migration[2.0]
  disable_ddl_transaction!

  restrict_gitlab_migration gitlab_schema: :gitlab_ci

  def up
    execute("DELETE FROM loose_foreign_keys_deleted_records WHERE fully_qualified_table_name='ci_builds'")
  end

  def down
    # no-op
  end
end
```

### The behavior of skipping migrations

The only migrations that are skipped are the ones performing **DML** changes.
The **DDL** migrations are **always and unconditionally** executed.

The implemented [solution](https://gitlab.com/gitlab-org/gitlab/-/issues/355014#solution-2-use-database_tasks)
uses the `database_tasks:` as a way to indicate which additional database configurations
(in `config/database.yml`) share the same primary database. The database configurations
marked with `database_tasks: false` are exempt from executing `db:migrate` for those
database configurations.

If database configurations do not share databases (all do have `database_tasks: true`),
each migration runs for every database configuration:

1. The DDL migration applies all structure changes on all databases.
1. The DML migration runs only in the context of a database containing the given `gitlab_schema:`.
1. If the DML migration is not eligible to run, it is skipped. It's still
   marked as executed in `schema_migrations`. While running `db:migrate`, the skipped
   migration outputs `Current migration is skipped since it modifies 'gitlab_ci' which is outside of 'gitlab_main, gitlab_shared`.

To prevent loss of migrations if the `database_tasks: false` is configured, a dedicated
Rake task is used [`gitlab:db:validate_config`](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/83118).
The `gitlab:db:validate_config` validates the correctness of `database_tasks:` by checking database identifiers
of each underlying database configuration. The ones that share the database are required to have
the `database_tasks: false` set. `gitlab:db:validate_config` always runs before `db:migrate`.

## Validation

Validation in a nutshell uses [pg_query](https://github.com/pganalyze/pg_query) to analyze
each query and classify tables with information from [`gitlab_schema.yml`](multiple_databases.md#gitlab-schema).
The migration is skipped if the specified `gitlab_schema` is outside of a list of schemas
managed by a given database connection (`Gitlab::Database::gitlab_schemas_for_connection`).

The `Gitlab::Database::Migration[2.0]` includes `Gitlab::Database::MigrationHelpers::RestrictGitlabSchema`
which extends the `#migrate` method. For the duration of a migration a dedicated query analyzer
is installed `Gitlab::Database::QueryAnalyzers::RestrictAllowedSchemas` that accepts
a list of allowed schemas as defined by `restrict_gitlab_migration:`. If the executed query
is outside of allowed schemas, it raises an exception.

## Exceptions

Depending on misuse or lack of `restrict_gitlab_migration` various exceptions can be raised
as part of the migration run and prevent the migration from being completed.

### Exception 1: migration running in DDL mode does DML select

```ruby
class UpdateProjectsArchivedState < Gitlab::Database::Migration[2.0]
  disable_ddl_transaction!

  # Missing:
  # restrict_gitlab_migration gitlab_schema: :gitlab_main

  def up
    update_column_in_batches(:projects, :archived, true) do |table, query|
      query.where(table[:archived].eq(false)) # rubocop:disable CodeReuse/ActiveRecord
    end
  end

  def down
    # no-op
  end
end
```

```plaintext
Select/DML queries (SELECT/UPDATE/DELETE) are disallowed in the DDL (structure) mode
Modifying of 'projects' (gitlab_main) with 'SELECT * FROM projects...
```

The current migration do not use `restrict_gitlab_migration`. The lack indicates a migration
running in **DDL** mode, but the executed payload appears to be reading data from `projects`.

**The solution** is to add `restrict_gitlab_migration gitlab_schema: :gitlab_main`.

### Exception 2: migration running in DML mode changes the structure

```ruby
class AddUserIdAndStateIndexToMergeRequestReviewers < Gitlab::Database::Migration[2.0]
  disable_ddl_transaction!

  # restrict_gitlab_migration if defined indicates DML, it should be removed
  restrict_gitlab_migration gitlab_schema: :gitlab_main

  INDEX_NAME = 'index_on_merge_request_reviewers_user_id_and_state'

  def up
    add_concurrent_index :merge_request_reviewers, [:user_id, :state], where: 'state = 2', name: INDEX_NAME
  end

  def down
    remove_concurrent_index_by_name :merge_request_reviewers, INDEX_NAME
  end
end
```

```plaintext
DDL queries (structure) are disallowed in the Select/DML (SELECT/UPDATE/DELETE) mode.
Modifying of 'merge_request_reviewers' with 'CREATE INDEX...
```

The current migration do use `restrict_gitlab_migration`. The presence indicates **DML** mode,
but the executed payload appears to be doing structure changes (DDL).

**The solution** is to remove `restrict_gitlab_migration gitlab_schema: :gitlab_main`.

### Exception 3: migration running in DML mode accesses data from a table in another schema

```ruby
class UpdateProjectsArchivedState < Gitlab::Database::Migration[2.0]
  disable_ddl_transaction!

  # Since it modifies `projects` it should use `gitlab_main`
  restrict_gitlab_migration gitlab_schema: :gitlab_ci

  def up
    update_column_in_batches(:projects, :archived, true) do |table, query|
      query.where(table[:archived].eq(false)) # rubocop:disable CodeReuse/ActiveRecord
    end
  end

  def down
    # no-op
  end
end
```

```plaintext
Select/DML queries (SELECT/UPDATE/DELETE) do access 'projects' (gitlab_main) " \
which is outside of list of allowed schemas: 'gitlab_ci'
```

The current migration do restrict the migration to `gitlab_ci`, but appears to modify
data in `gitlab_main`.

**The solution** is to change `restrict_gitlab_migration gitlab_schema: :gitlab_ci`.

### Exception 4: mixing DDL and DML mode

```ruby
class UpdateProjectsArchivedState < Gitlab::Database::Migration[2.0]
  disable_ddl_transaction!

  # This migration is invalid regardless of specification
  # as it cannot modify structure and data at the same time
  restrict_gitlab_migration gitlab_schema: :gitlab_ci

  def up
    add_concurrent_index :merge_request_reviewers, [:user_id, :state], where: 'state = 2', name: 'index_on_merge_request_reviewers'
    update_column_in_batches(:projects, :archived, true) do |table, query|
      query.where(table[:archived].eq(false)) # rubocop:disable CodeReuse/ActiveRecord
    end
  end

  def down
    # no-op
  end
end
```

The migrations mixing **DDL** and **DML** depending on ordering of operations raises
one of the prior exceptions.

## Upcoming changes on multiple database migrations

The `restrict_gitlab_migration` using `gitlab_schema:` is considered as a first iteration
of this feature for running migrations selectively depending on a context. It is possible
to add additional restrictions to DML-only migrations (as the structure coherency is likely
to stay as-is until further notice) to restrict when they run.

A Potential extension is to limit running DML migration only to specific environments:

```ruby
restrict_gitlab_migration gitlab_schema: :gitlab_main, gitlab_env: :gitlab_com
```