summaryrefslogtreecommitdiff
path: root/doc/development/iterating_tables_in_batches.md
blob: 8813fe560dbbeec5b8ccc049c8b2330aa2196fc2 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
---
stage: Enablement
group: Database
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
---

# Iterating tables in batches

Rails provides a method called `in_batches` that can be used to iterate over
rows in batches. For example:

```ruby
User.in_batches(of: 10) do |relation|
  relation.update_all(updated_at: Time.now)
end
```

Unfortunately, this method is implemented in a way that is not very efficient,
both query and memory usage wise.

To work around this you can include the `EachBatch` module into your models,
then use the `each_batch` class method. For example:

```ruby
class User < ActiveRecord::Base
  include EachBatch
end

User.each_batch(of: 10) do |relation|
  relation.update_all(updated_at: Time.now)
end
```

This will end up producing queries such as:

```plaintext
User Load (0.7ms)  SELECT  "users"."id" FROM "users" WHERE ("users"."id" >= 41654)  ORDER BY "users"."id" ASC LIMIT 1 OFFSET 1000
  (0.7ms)  SELECT COUNT(*) FROM "users" WHERE ("users"."id" >= 41654) AND ("users"."id" < 42687)
```

The API of this method is similar to `in_batches`, though it doesn't support
all of the arguments that `in_batches` supports. You should always use
`each_batch` _unless_ you have a specific need for `in_batches`.

## Avoid iterating over non-unique columns

One should proceed with extra caution, and possibly avoid iterating over a column that can contain
duplicate values. When you iterate over an attribute that is not unique, even with the applied max
batch size, there is no guarantee that the resulting batches will not surpass it. The following
snippet demonstrates this situation when one attempt to select `Ci::Build` entries for users with
`id` between `1` and `10,000`, the database returns `1 215 178` matching rows.

```ruby
[ gstg ] production> Ci::Build.where(user_id: (1..10_000)).size
=> 1215178
```

This happens because built relation is translated into the following query

```ruby
[ gstg ] production> puts Ci::Build.where(user_id: (1..10_000)).to_sql
SELECT "ci_builds".* FROM "ci_builds" WHERE "ci_builds"."type" = 'Ci::Build' AND "ci_builds"."user_id" BETWEEN 1 AND 10000
=> nil
```

`And` queries which filter non-unique column by range `WHERE "ci_builds"."user_id" BETWEEN ? AND ?`,
even though the range size is limited to a certain threshold (`10,000` in the previous example) this
threshold does not translate to the size of the returned dataset. That happens because when taking
`n` possible values of attributes, one can't tell for sure that the number of records that contains
them will be less than `n`.

## Column definition

`EachBatch` uses the primary key of the model by default for the iteration. This works most of the
cases, however in some cases, you might want to use a different column for the iteration.

```ruby
Project.distinct.each_batch(column: :creator_id, of: 10) do |relation|
  puts User.where(id: relation.select(:creator_id)).map(&:id)
end
```

The query above iterates over the project creators and prints them out without duplications.

NOTE:
In case the column is not unique (no unique index definition), calling the `distinct` method on
the relation is necessary. Using not unique column without `distinct` may result in `each_batch`
falling into an endless loop as described in following
[issue](https://gitlab.com/gitlab-org/gitlab/-/issues/285097).

## `EachBatch` in data migrations

When dealing with data migrations the preferred way to iterate over a large volume of data is using
`EachBatch`.

A special case of data migration is a [background migration](database/background_migrations.md#scheduling)
where the actual data modification is executed in a background job. The migration code that
determines the data ranges (slices) and schedules the background jobs uses `each_batch`.

## Efficient usage of `each_batch`

`EachBatch` helps to iterate over large tables. It's important to highlight that `EachBatch` is
not going to magically solve all iteration related performance problems and it might not help at
all in some scenarios. From the database point of view, correctly configured database indexes are
also necessary to make `EachBatch` perform well.

### Example 1: Simple iteration

Let's consider that we want to iterate over the `users` table and print the `User` records to the
standard output. The `users` table contains millions of records, thus running one query to fetch
the users will likely time out.

![Users table overview](img/each_batch_users_table_v13_7.png)

This is a simplified version of the `users` table which contains several rows. We have a few
smaller gaps in the `id` column to make the example a bit more realistic (a few records were
already deleted). Currently, we have one index on the `id` field.

Loading all users into memory (avoid):

```ruby
users = User.all

users.each { |user| puts user.inspect }
```

Use `each_batch`:

```ruby
# Note: for this example I picked 5 as the batch size, the default is 1_000
User.each_batch(of: 5) do |relation|
  relation.each { |user| puts user.inspect }
end
```

#### How `each_batch` works

As the first step, it finds the lowest `id` (start `id`) in the table by executing the following
database query:

```sql
SELECT "users"."id" FROM "users" ORDER BY "users"."id" ASC LIMIT 1
```

![Reading the start ID value](img/each_batch_users_table_iteration_1_v13_7.png)

Notice that the query only reads data from the index (`INDEX ONLY SCAN`), the table is not
accessed. Database indexes are sorted so taking out the first item is a very cheap operation.

The next step is to find the next `id` (end `id`) which should respect the batch size
configuration. In this example we used a batch size of 5. `EachBatch` uses the `OFFSET` clause
to get a "shifted" `id` value.

```sql
SELECT "users"."id" FROM "users" WHERE "users"."id" >= 1 ORDER BY "users"."id" ASC LIMIT 1 OFFSET 5
```

![Reading the end ID value](img/each_batch_users_table_iteration_2_v13_7.png)

Again, the query only looks into the index. The `OFFSET 5` takes out the sixth `id` value: this
query reads a maximum of six items from the index regardless of the table size or the iteration
count.

At this point, we know the `id` range for the first batch. Now it's time to construct the query
for the `relation` block.

```sql
SELECT "users".* FROM "users" WHERE "users"."id" >= 1 AND "users"."id" < 302
```

![Reading the rows from the `users` table](img/each_batch_users_table_iteration_3_v13_7.png)

Notice the `<` sign. Previously six items were read from the index and in this query, the last
value is "excluded". The query will look at the index to get the location of the five `user`
rows on the disk and read the rows from the table. The returned array is processed in Ruby.

The first iteration is done. For the next iteration, the last `id` value is reused from the
previous iteration in order to find out the next end `id` value.

```sql
SELECT "users"."id" FROM "users" WHERE "users"."id" >= 302 ORDER BY "users"."id" ASC LIMIT 1 OFFSET 5
```

![Reading the second end ID value](img/each_batch_users_table_iteration_4_v13_7.png)

Now we can easily construct the `users` query for the second iteration.

```sql
SELECT "users".* FROM "users" WHERE "users"."id" >= 302 AND "users"."id" < 353
```

![Reading the rows for the second iteration from the users table](img/each_batch_users_table_iteration_5_v13_7.png)

### Example 2: Iteration with filters

Building on top of the previous example, we want to print users with zero sign-in count. We keep
track of the number of sign-ins in the `sign_in_count` column so we write the following code:

```ruby
users = User.where(sign_in_count: 0)

users.each_batch(of: 5) do |relation|
  relation.each { |user| puts user.inspect }
end
```

`each_batch` will produce the following SQL query for the start `id` value:

```sql
SELECT "users"."id" FROM "users" WHERE "users"."sign_in_count" = 0 ORDER BY "users"."id" ASC LIMIT 1
```

Selecting only the `id` column and ordering by `id` is going to "force" the database to use the
index on the `id` (primary key index) column however, we also have an extra condition on the
`sign_in_count` column. The column is not part of the index, so the database needs to look into
the actual table to find the first matching row.

![Reading the index with extra filter](img/each_batch_users_table_filter_v13_7.png)

NOTE:
The number of scanned rows depends on the data distribution in the table.

- Best case scenario: the first user was never logged in. The database reads only one row.
- Worst case scenario: all users were logged in at least once. The database reads all rows.

In this particular example, the database had to read 10 rows (regardless of our batch size setting)
to determine the first `id` value. In a "real-world" application it's hard to predict whether the
filtering is going to cause problems or not. In the case of GitLab, verifying the data on a
production replica is a good start, but keep in mind that data distribution on GitLab.com can be
different from self-managed instances.

#### Improve filtering with `each_batch`

##### Specialized conditional index

```sql
CREATE INDEX index_on_users_never_logged_in ON users (id) WHERE sign_in_count = 0
```

This is how our table and the newly created index looks like:

![Reading the specialized index](img/each_batch_users_table_filtered_index_v13_7.png)

This index definition covers the conditions on the `id` and `sign_in_count` columns thus makes the
`each_batch` queries very effective (similar to the simple iteration example).

It's rare when a user was never signed in so we a anticipate small index size. Including only the
`id` in the index definition also helps to keep the index size small.

##### Index on columns

Later on, we might want to iterate over the table filtering for different `sign_in_count` values, in
those cases we cannot use the previously suggested conditional index because the `WHERE` condition
does not match with our new filter (`sign_in_count > 10`).

To address this problem, we have two options:

- Create another, conditional index to cover the new query.
- Replace the index with a more generalized configuration.

NOTE:
Having multiple indexes on the same table and on the same columns could be a performance bottleneck
when writing data.

Let's consider the following index (avoid):

```sql
CREATE INDEX index_on_users_never_logged_in ON users (id, sign_in_count)
```

The index definition starts with the `id` column which makes the index very inefficient from data
selectivity point of view.

```sql
SELECT "users"."id" FROM "users" WHERE "users"."sign_in_count" = 0 ORDER BY "users"."id" ASC LIMIT 1
```

Executing the query above results in an `INDEX ONLY SCAN`. However, the query still needs to
iterate over an unknown number of entries in the index, and then find the first item where the
`sign_in_count` is `0`.

![Reading an ineffective index](img/each_batch_users_table_bad_index_v13_7.png)

We can improve the query significantly by swapping the columns in the index definition (prefer).

```sql
CREATE INDEX index_on_users_never_logged_in ON users (sign_in_count, id)
```

![Reading a good index](img/each_batch_users_table_good_index_v13_7.png)

The following index definition is not going to work well with `each_batch` (avoid).

```sql
CREATE INDEX index_on_users_never_logged_in ON users (sign_in_count)
```

Since `each_batch` builds range queries based on the `id` column, this index cannot be used
efficiently. The DB reads the rows from the table or uses a bitmap search where the primary
key index is also read.

##### "Slow" iteration

Slow iteration means that we use a good index configuration to iterate over the table and
apply filtering on the yielded relation.

```ruby
User.each_batch(of: 5) do |relation|
  relation.where(sign_in_count: 0).each { |user| puts user inspect }
end
```

The iteration uses the primary key index (on the `id` column) which makes it safe from statement
timeouts. The filter (`sign_in_count: 0`) is applied on the `relation` where the `id` is already
constrained (range). The number of rows is limited.

Slow iteration generally takes more time to finish. The iteration count is higher and
one iteration could yield fewer records than the batch size. Iterations may even yield
0 records. This is not an optimal solution; however, in some cases (especially when
dealing with large tables) this is the only viable option.

### Using Subqueries

Using subqueries in your `each_batch` query does not work well in most cases. Consider the following example:

```ruby
projects = Project.where(creator_id: Issue.where(confidential: true).select(:author_id))

projects.each_batch do |relation|
  # do something
end
```

The iteration uses the `id` column of the `projects` table. The batching does not affect the
subquery. This means for each iteration, the subquery is executed by the database. This adds a
constant "load" on the query which often ends up in statement timeouts. We have an unknown number
of [confidential issues](../user/project/issues/confidential_issues.md), the execution time
and the accessed database rows depend on the data distribution in the `issues` table.

NOTE:
Using subqueries works only when the subquery returns a small number of rows.

#### Improving Subqueries

When dealing with subqueries, a slow iteration approach could work: the filter on `creator_id`
can be part of the generated `relation` object.

```ruby
projects = Project.all

projects.each_batch do |relation|
  relation.where(creator_id: Issue.where(confidential: true).select(:author_id))
end
```

If the query on the `issues` table itself is not performant enough, a nested loop could be
constructed. Try to avoid it when possible.

```ruby
projects = Project.all

projects.each_batch do |relation|
  issues = Issue.where(confidential: true)

  issues.each_batch do |issues_relation|
    relation.where(creator_id: issues_relation.select(:author_id))
  end
end
```

If we know that the `issues` table has many more rows than `projects`, it would make sense to flip
the queries, where the `issues` table is batched first.

### Using `JOIN` and `EXISTS`

When to use `JOINS`:

- When there's a 1:1 or 1:N relationship between the tables where we know that the joined record
(almost) always exists. This works well for "extension-like" tables:
  - `projects` - `project_settings`
  - `users` - `user_details`
  - `users` - `user_statuses`
- `LEFT JOIN` works well in this case. Conditions on the joined table need to go to the yielded
relation so the iteration is not affected by the data distribution in the joined table.

Example:

```ruby
users = User.joins("LEFT JOIN personal_access_tokens on personal_access_tokens.user_id = users.id")

users.each_batch do |relation|
  relation.where("personal_access_tokens.name = 'name'")
end
```

`EXISTS` queries should be added only to the inner `relation` of the `each_batch` query:

```ruby
User.each_batch do |relation|
  relation.where("EXISTS (SELECT 1 FROM ...")
end
```

### Complex queries on the relation object

When the `relation` object has several extra conditions, the execution plans might become
"unstable".

Example:

```ruby
Issue.each_batch do |relation|
  relation
    .joins(:metrics)
    .joins(:merge_requests_closing_issues)
    .where("id IN (SELECT ...)")
    .where(confidential: true)
end
```

Here, we expect that the `relation` query reads the `BATCH_SIZE` of user records and then
filters down the results according to the provided queries. The planner might decide that
using a bitmap index lookup with the index on the `confidential` column is a better way to
execute the query. This can cause an unexpectedly high amount of rows to be read and the
query could time out.

Problem: we know for sure that the relation is returning maximum `BATCH_SIZE` of records
however, the planner does not know this.

Common table expression (CTE) trick to force the range query to execute first:

```ruby
Issue.each_batch(of: 1000) do |relation|
  cte = Gitlab::SQL::CTE.new(:batched_relation, relation.limit(1000))

  scope = cte
    .apply_to(Issue.all)
    .joins(:metrics)
    .joins(:merge_requests_closing_issues)
    .where("id IN (SELECT ...)")
    .where(confidential: true)

  puts scope.to_a
end
```

### `EachBatch` vs `BatchCount`

When adding new counters for Service Ping, the preferred way to count records is using the
`Gitlab::Database::BatchCount` class. The iteration logic implemented in `BatchCount`
has similar performance characteristics like `EachBatch`. Most of the tips and suggestions
for improving `BatchCount` mentioned above applies to `BatchCount` as well.

## Iterate with keyset pagination

There are a few special cases where iterating with `EachBatch` does not work. `EachBatch`
requires one distinct column (usually the primary key), which makes the iteration impossible
for timestamp columns and tables with composite primary keys.

Where `EachBatch` does not work, you can use
[keyset pagination](database/pagination_guidelines.md#keyset-pagination) to iterate over the
table or a range of rows. The scaling and performance characteristics are very similar to
`EachBatch`.

Examples:

- Iterate over the table in a specific order (timestamp columns) in combination with a tie-breaker
if column user to sort by does not contain unique values.
- Iterate over the table with composite primary keys.

### Iterate over the issues in a project by creation date

You can use keyset pagination to iterate over any database column in a specific order (for example,
`created_at DESC`). To ensure consistent order of the returned records with the same values for
`created_at`, use a tie-breaker column with unique values (for example, `id`).

Assume you have the following index in the `issues` table:

```sql
idx_issues_on_project_id_and_created_at_and_id" btree (project_id, created_at, id)
```

### Fetching records for further processing

The following snippet iterates over issue records within the project using the specified order
(`created_at, id`).

```ruby
scope = Issue.where(project_id: 278964).order(:created_at, :id) # id is the tie-breaker

iterator = Gitlab::Pagination::Keyset::Iterator.new(scope: scope)

iterator.each_batch(of: 100) do |records|
  puts records.map(&:id)
end
```

You can add extra filters to the query. This example only lists the issue IDs created in the last
30 days:

```ruby
scope = Issue.where(project_id: 278964).where('created_at > ?', 30.days.ago).order(:created_at, :id) # id is the tie-breaker

iterator = Gitlab::Pagination::Keyset::Iterator.new(scope: scope)

iterator.each_batch(of: 100) do |records|
  puts records.map(&:id)
end
```

### Updating records in the batch

For complex `ActiveRecord` queries, the `.update_all` method does not work well, because it
generates an incorrect `UPDATE` statement.
You can use raw SQL for updating records in batches:

```ruby
scope = Issue.where(project_id: 278964).order(:created_at, :id) # id is the tie-breaker

iterator = Gitlab::Pagination::Keyset::Iterator.new(scope: scope)

iterator.each_batch(of: 100) do |records|
  ApplicationRecord.connection.execute("UPDATE issues SET updated_at=NOW() WHERE issues.id in (#{records.dup.reselect(:id).to_sql})")
end
```

NOTE:
To keep the iteration stable and predictable, avoid updating the columns in the `ORDER BY` clause.

### Iterate over the `merge_request_diff_commits` table

The `merge_request_diff_commits` table uses a composite primary key (`merge_request_diff_id,
relative_order`), which makes `EachBatch` impossible to use efficiently.

To paginate over the `merge_request_diff_commits` table, you can use the following snippet:

```ruby
# Custom order object configuration:
order = Gitlab::Pagination::Keyset::Order.build([
  Gitlab::Pagination::Keyset::ColumnOrderDefinition.new(
    attribute_name: 'merge_request_diff_id',
    order_expression: MergeRequestDiffCommit.arel_table[:merge_request_diff_id].asc,
    nullable: :not_nullable,
    distinct: false,
  ),
  Gitlab::Pagination::Keyset::ColumnOrderDefinition.new(
    attribute_name: 'relative_order',
    order_expression: MergeRequestDiffCommit.arel_table[:relative_order].asc,
    nullable: :not_nullable,
    distinct: false,
  )
])
MergeRequestDiffCommit.include(FromUnion) # keyset pagination generates UNION queries

scope = MergeRequestDiffCommit.order(order)

iterator = Gitlab::Pagination::Keyset::Iterator.new(scope: scope)

iterator.each_batch(of: 100) do |records|
  puts records.map { |record| [record.merge_request_diff_id, record.relative_order] }.inspect
end
```

### Order object configuration

Keyset pagination works well with simple `ActiveRecord` `order` scopes
([first example](iterating_tables_in_batches.md#iterate-over-the-issues-in-a-project-by-creation-date).
However, in special cases, you need to describe the columns in the `ORDER BY` clause (second example)
for the underlying keyset pagination library. When the `ORDER BY` configuration cannot be
automatically determined by the keyset pagination library, an error is raised.

The code comments of the
[`Gitlab::Pagination::Keyset::Order`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/pagination/keyset/order.rb)
and [`Gitlab::Pagination::Keyset::ColumnOrderDefinition`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/pagination/keyset/column_order_definition.rb)
classes give an overview of the possible options for configuring the `ORDER BY` clause. You can
also find a few code examples in the
[keyset pagination](database/keyset_pagination.md#complex-order-configuration) documentation.