summaryrefslogtreecommitdiff
path: root/doc/development/database/adding_database_indexes.md
blob: e1d5a7af6d96768c0a0ec76480c6895f77e7e4d9 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
---
stage: Data Stores
group: Database
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
---

# Adding Database Indexes

Indexes can be used to speed up database queries, but when should you add a new
index? Traditionally the answer to this question has been to add an index for
every column used for filtering or joining data. For example, consider the
following query:

```sql
SELECT *
FROM projects
WHERE user_id = 2;
```

Here we are filtering by the `user_id` column and as such a developer may decide
to index this column.

While in certain cases indexing columns using the above approach may make sense,
it can actually have a negative impact. Whenever you write data to a table, any
existing indexes must also be updated. The more indexes there are, the slower this
can potentially become. Indexes can also take up significant disk space, depending
on the amount of data indexed and the index type. For example, PostgreSQL offers
`GIN` indexes which can be used to index certain data types that cannot be
indexed by regular B-tree indexes. These indexes, however, generally take up more
data and are slower to update compared to B-tree indexes.

Because of all this, it's important make the following considerations
when adding a new index:

1. Do the new queries re-use as many existing indexes as possible?
1. Is there enough data that using an index is faster than iterating over
   rows in the table?
1. Is the overhead of maintaining the index worth the reduction in query
   timings?

## Re-using Queries

The first step is to make sure your query re-uses as many existing indexes as
possible. For example, consider the following query:

```sql
SELECT *
FROM todos
WHERE user_id = 123
AND state = 'open';
```

Now imagine we already have an index on the `user_id` column but not on the
`state` column. One may think this query performs badly due to `state` being
unindexed. In reality the query may perform just fine given the index on
`user_id` can filter out enough rows.

The best way to determine if indexes are re-used is to run your query using
`EXPLAIN ANALYZE`. Depending on the joined tables and the columns being used for filtering,
you may find an extra index doesn't make much, if any, difference.

In short:

1. Try to write your query in such a way that it re-uses as many existing
   indexes as possible.
1. Run the query using `EXPLAIN ANALYZE` and study the output to find the most
   ideal query.

## Data Size

A database may not use an index even when a regular sequence scan
(iterating over all rows) is faster, especially for small tables.

Consider adding an index if a table is expected to grow, and your query has to filter a lot of rows.
You may _not_ want to add an index if the table size is small (<`1,000` records),
or if existing indexes already filter out enough rows.

## Maintenance Overhead

Indexes have to be updated on every table write. In the case of PostgreSQL, _all_
existing indexes are updated whenever data is written to a table. As a
result, having many indexes on the same table slows down writes. It's therefore important
to balance query performance with the overhead of maintaining an extra index.

Let's say that adding an index reduces SELECT timings by 5 milliseconds but increases
INSERT/UPDATE/DELETE timings by 10 milliseconds. In this case, the new index may not be worth
it. A new index is more valuable when SELECT timings are reduced and INSERT/UPDATE/DELETE
timings are unaffected.

## Finding Unused Indexes

To see which indexes are unused you can run the following query:

```sql
SELECT relname as table_name, indexrelname as index_name, idx_scan, idx_tup_read, idx_tup_fetch, pg_size_pretty(pg_relation_size(indexrelname::regclass))
FROM pg_stat_all_indexes
WHERE schemaname = 'public'
AND idx_scan = 0
AND idx_tup_read = 0
AND idx_tup_fetch = 0
ORDER BY pg_relation_size(indexrelname::regclass) desc;
```

This query outputs a list containing all indexes that are never used and sorts
them by indexes sizes in descending order. This query helps in
determining whether existing indexes are still required. More information on
the meaning of the various columns can be found at
<https://www.postgresql.org/docs/current/monitoring-stats.html>.

To determine if an index is still being used on production, use [Thanos](https://thanos-query.ops.gitlab.net/graph?g0.expr=sum%20by%20(type)(rate(pg_stat_user_indexes_idx_scan%7Benv%3D%22gprd%22%2C%20indexrelname%3D%22INSERT%20INDEX%20NAME%20HERE%22%7D%5B30d%5D))&g0.tab=1&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D):

```sql
sum by (type)(rate(pg_stat_user_indexes_idx_scan{env="gprd", indexrelname="INSERT INDEX NAME HERE"}[30d]))
```

Because the query output relies on the actual usage of your database, it
may be affected by factors such as:

- Certain queries never being executed, thus not being able to use certain
  indexes.
- Certain tables having little data, resulting in PostgreSQL using sequence
  scans instead of index scans.

This data is only reliable for a frequently used database with
plenty of data, and using as many GitLab features as possible.

## Requirements for naming indexes

Indexes with complex definitions must be explicitly named rather than
relying on the implicit naming behavior of migration methods. In short,
that means you **must** provide an explicit name argument for an index
created with one or more of the following options:

- `where`
- `using`
- `order`
- `length`
- `type`
- `opclass`

### Considerations for index names

Check our [Constraints naming conventions](constraint_naming_convention.md) page.

### Why explicit names are required

As Rails is database agnostic, it generates an index name only
from the required options of all indexes: table name and column names.
For example, imagine the following two indexes are created in a migration:

```ruby
def up
  add_index :my_table, :my_column

  add_index :my_table, :my_column, where: 'my_column IS NOT NULL'
end
```

Creation of the second index would fail, because Rails would generate
the same name for both indexes.

This naming issue is further complicated by the behavior of the `index_exists?` method.
It considers only the table name, column names, and uniqueness specification
of the index when making a comparison. Consider:

```ruby
def up
  unless index_exists?(:my_table, :my_column, where: 'my_column IS NOT NULL')
    add_index :my_table, :my_column, where: 'my_column IS NOT NULL'
  end
end
```

The call to `index_exists?` returns true if **any** index exists on
`:my_table` and `:my_column`, and index creation is bypassed.

The `add_concurrent_index` helper is a requirement for creating indexes
on populated tables. Because it cannot be used inside a transactional
migration, it has a built-in check that detects if the index already
exists. In the event a match is found, index creation is skipped.
Without an explicit name argument, Rails can return a false positive
for `index_exists?`, causing a required index to not be created
properly. By always requiring a name for certain types of indexes, the
chance of error is greatly reduced.

## Temporary indexes

There may be times when an index is only needed temporarily.

For example, in a migration, a column of a table might be conditionally
updated. To query which columns must be updated in the
[query performance guidelines](query_performance.md), an index is needed
that would otherwise not be used.

In these cases, consider a temporary index. To specify a
temporary index:

1. Prefix the index name with `tmp_` and follow the [naming conventions](constraint_naming_convention.md).
1. Create a follow-up issue to remove the index in the next (or future) milestone.
1. Add a comment in the migration mentioning the removal issue.

A temporary migration would look like:

```ruby
INDEX_NAME = 'tmp_index_projects_on_owner_where_emails_disabled'

def up
  # Temporary index to be removed in 13.9 https://gitlab.com/gitlab-org/gitlab/-/issues/1234
  add_concurrent_index :projects, :creator_id, where: 'emails_disabled = false', name: INDEX_NAME
end

def down
  remove_concurrent_index_by_name :projects, INDEX_NAME
end
```

## Indexes for partitioned tables

Indexes [cannot be created](https://www.postgresql.org/docs/15/ddl-partitioning.html#DDL-PARTITIONING-DECLARATIVE-MAINTENANCE)
**concurrently** on a partitioned table. You must use `CONCURRENTLY` to avoid service disruption in a hot system.

To create an index on a partitioned table, use `add_concurrent_partitioned_index`, provided by the database team.

Under the hood, `add_concurrent_partitioned_index`:

1. Creates indexes on each partition using `CONCURRENTLY`.
1. Creates an index on the parent table.

A Rails migration example:

```ruby
# in db/post_migrate/

class AddIndexToPartitionedTable < Gitlab::Database::Migration[2.1]
  include Gitlab::Database::PartitioningMigrationHelpers

  disable_ddl_transaction!

  TABLE_NAME = :table_name
  COLUMN_NAMES = [:partition_id, :id]
  INDEX_NAME = :index_name

  def up
    add_concurrent_partitioned_index(TABLE_NAME, COLUMN_NAMES, name: INDEX_NAME)
  end

  def down
    remove_concurrent_partitioned_index_by_name(TABLE_NAME, INDEX_NAME)
  end
end
```

## Create indexes asynchronously

For very large tables, index creation can be a challenge to manage.
While `add_concurrent_index` creates indexes in a way that does not block
normal traffic, it can still be problematic when index creation runs for
many hours. Necessary database operations like `autovacuum` cannot run, and
on GitLab.com, the deployment process is blocked waiting for index
creation to finish.

To limit impact on GitLab.com, a process exists to create indexes
asynchronously during weekend hours. Due to generally lower traffic and fewer deployments,
index creation can proceed at a lower level of risk.

### Schedule index creation for a low-impact time

1. [Schedule the index to be created](#schedule-the-index-to-be-created).
1. [Verify the MR was deployed and the index exists in production](#verify-the-mr-was-deployed-and-the-index-exists-in-production).
1. [Add a migration to create the index synchronously](#add-a-migration-to-create-the-index-synchronously).

### Schedule the index to be created

Create an MR with a post-deployment migration which prepares the index
for asynchronous creation. An example of creating an index using
the asynchronous index helpers can be seen in the block below. This migration
enters the index name and definition into the `postgres_async_indexes`
table. The process that runs on weekends pulls indexes from this
table and attempt to create them.

```ruby
# in db/post_migrate/

INDEX_NAME = 'index_ci_builds_on_some_column'

def up
  prepare_async_index :ci_builds, :some_column, name: INDEX_NAME
end

def down
  unprepare_async_index :ci_builds, :some_column, name: INDEX_NAME
end
```

### Verify the MR was deployed and the index exists in production

1. Verify that the post-deploy migration was executed on GitLab.com using ChatOps with
   `/chatops run auto_deploy status <merge_sha>`. If the output returns `db/gprd`,
   the post-deploy migration has been executed in the production database. For more information, see
   [How to determine if a post-deploy migration has been executed on GitLab.com](https://gitlab.com/gitlab-org/release/docs/-/blob/master/general/post_deploy_migration/readme.md#how-to-determine-if-a-post-deploy-migration-has-been-executed-on-gitlabcom).
1. In the case of an [index created asynchronously](#schedule-the-index-to-be-created), wait
   until the next week so that the index can be created over a weekend.
1. Use [Database Lab](database_lab.md) to check [if creation was successful](database_lab.md#checking-indexes).
   Ensure the output does not indicate the index is `invalid`.

### Add a migration to create the index synchronously

After the index is verified to exist on the production database, create a second
merge request that adds the index synchronously. The schema changes must be
updated and committed to `structure.sql` in this second merge request.
The synchronous migration results in a no-op on GitLab.com, but you should still add the
migration as expected for other installations. The below block
demonstrates how to create the second migration for the previous
asynchronous example.

**WARNING:**
Verify that the index exists in production before merging a second migration with `add_concurrent_index`.
If the second migration is deployed before the index has been created,
the index is created synchronously when the second migration executes.

```ruby
# in db/post_migrate/

INDEX_NAME = 'index_ci_builds_on_some_column'

disable_ddl_transaction!

def up
  add_concurrent_index :ci_builds, :some_column, name: INDEX_NAME
end

def down
  remove_concurrent_index_by_name :ci_builds, INDEX_NAME
end
```

## Test database index changes locally

You must test the database index changes locally before creating a merge request.

### Verify indexes created asynchronously

Use the asynchronous index helpers on your local environment to test changes for creating an index:

1. Enable the feature flags by running `Feature.enable(:database_async_index_creation)` and `Feature.enable(:database_reindexing)` in the Rails console.
1. Run `bundle exec rails db:migrate` so that it creates an entry in the `postgres_async_indexes` table.
1. Run `bundle exec rails gitlab:db:reindex` so that the index is created asynchronously.
1. To verify the index, open the PostgreSQL console using the [GDK](https://gitlab.com/gitlab-org/gitlab-development-kit/-/blob/main/doc/howto/postgresql.md) command `gdk psql` and run the command `\d <index_name>` to check that your newly created index exists.

## Drop indexes asynchronously

For very large tables, index destruction can be a challenge to manage.
While `remove_concurrent_index` removes indexes in a way that does not block
normal traffic, it can still be problematic if index destruction runs for
during `autovacuum`. Necessary database operations like `autovacuum` cannot run, and
the deployment process on GitLab.com is blocked while waiting for index
destruction to finish.

To limit the impact on GitLab.com, use the following process to remove indexes
asynchronously during weekend hours. Due to generally lower traffic and fewer deployments,
index destruction can proceed at a lower level of risk.

1. [Schedule the index to be removed](#schedule-the-index-to-be-removed).
1. [Verify the MR was deployed and the index exists in production](#verify-the-mr-was-deployed-and-the-index-no-longer-exists-in-production).
1. [Add a migration to destroy the index synchronously](#add-a-migration-to-destroy-the-index-synchronously).

### Schedule the index to be removed

Create an MR with a post-deployment migration which prepares the index
for asynchronous destruction. For example. to destroy an index using
the asynchronous index helpers:

```ruby
# in db/post_migrate/

INDEX_NAME = 'index_ci_builds_on_some_column'

def up
  prepare_async_index_removal :ci_builds, :some_column, name: INDEX_NAME
end

def down
  unprepare_async_index :ci_builds, :some_column, name: INDEX_NAME
end
```

This migration enters the index name and definition into the `postgres_async_indexes`
table. The process that runs on weekends pulls indexes from this table and attempt
to remove them.

You must test the database index changes locally before creating a merge request.

### Verify the MR was deployed and the index no longer exists in production

1. Verify that the post-deploy migration was executed on GitLab.com using ChatOps with
   `/chatops run auto_deploy status <merge_sha>`. If the output returns `db/gprd`,
   the post-deploy migration has been executed in the production database. For more information, see
   [How to determine if a post-deploy migration has been executed on GitLab.com](https://gitlab.com/gitlab-org/release/docs/-/blob/master/general/post_deploy_migration/readme.md#how-to-determine-if-a-post-deploy-migration-has-been-executed-on-gitlabcom).
1. In the case of an [index removed asynchronously](#schedule-the-index-to-be-removed), wait
   until the next week so that the index can be created over a weekend.
1. Use Database Lab [to check if removal was successful](database_lab.md#checking-indexes).
   [Database Lab](database_lab.md)
   should report an error when trying to find the removed index. If not, the index may still exist.

### Add a migration to destroy the index synchronously

After you verify the index no longer exists in the production database, create a second
merge request that removes the index synchronously. The schema changes must be
updated and committed to `structure.sql` in this second merge request.
The synchronous migration results in a no-op on GitLab.com, but you should still add the
migration as expected for other installations. For example, to
create the second migration for the previous asynchronous example:

**WARNING:**
Verify that the index no longer exists in production before merging a second migration with `remove_concurrent_index_by_name`.
If the second migration is deployed before the index has been destroyed,
the index is destroyed synchronously when the second migration executes.

```ruby
# in db/post_migrate/

INDEX_NAME = 'index_ci_builds_on_some_column'

disable_ddl_transaction!

def up
  remove_concurrent_index_by_name :ci_builds, name: INDEX_NAME
end

def down
  add_concurrent_index :ci_builds, :some_column, name: INDEX_NAME
end
```

### Verify indexes removed asynchronously

To test changes for removing an index, use the asynchronous index helpers on your local environment:

1. Enable the feature flags by running `Feature.enable(:database_reindexing)` in the Rails console.
1. Run `bundle exec rails db:migrate` which should create an entry in the `postgres_async_indexes` table.
1. Run `bundle exec rails gitlab:db:reindex` destroy the index asynchronously.
1. To verify the index, open the PostgreSQL console by using the [GDK](https://gitlab.com/gitlab-org/gitlab-development-kit/-/blob/main/doc/howto/postgresql.md)
   command `gdk psql` and run `\d <index_name>` to check that the destroyed index no longer exists.