1 files changed, 598 insertions, 0 deletions
diff --git a/doc/development/database/iterating_tables_in_batches.md b/doc/development/database/iterating_tables_in_batches.md
new file mode 100644
index 00000000000..6d7a57ecacb
--- /dev/null
+++ b/doc/development/database/iterating_tables_in_batches.md
@@ -0,0 +1,598 @@
+---
+stage: Data Stores
+group: Database
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Iterating tables in batches
+
+Rails provides a method called `in_batches` that can be used to iterate over
+rows in batches. For example:
+
+```ruby
+User.in_batches(of: 10) do |relation|
+  relation.update_all(updated_at: Time.now)
+end
+```
+
+Unfortunately, this method is implemented in a way that is not very efficient,
+both query and memory usage wise.
+
+To work around this you can include the `EachBatch` module into your models,
+then use the `each_batch` class method. For example:
+
+```ruby
+class User < ActiveRecord::Base
+  include EachBatch
+end
+
+User.each_batch(of: 10) do |relation|
+  relation.update_all(updated_at: Time.now)
+end
+```
+
+This produces queries such as:
+
+```plaintext
+User Load (0.7ms)  SELECT  "users"."id" FROM "users" WHERE ("users"."id" >= 41654)  ORDER BY "users"."id" ASC LIMIT 1 OFFSET 1000
+  (0.7ms)  SELECT COUNT(*) FROM "users" WHERE ("users"."id" >= 41654) AND ("users"."id" < 42687)
+```
+
+The API of this method is similar to `in_batches`, though it doesn't support
+all of the arguments that `in_batches` supports. You should always use
+`each_batch` _unless_ you have a specific need for `in_batches`.
+
+## Iterating over non-unique columns
+
+One should proceed with extra caution. When you iterate over an attribute that is not unique,
+even with the applied max batch size, there is no guarantee that the resulting batches do not
+surpass it. The following snippet demonstrates this situation when one attempt to select
+`Ci::Build` entries for users with `id` between `1` and `10,000`, the database returns
+`1 215 178` matching rows.
+
+```ruby
+[ gstg ] production> Ci::Build.where(user_id: (1..10_000)).size
+=> 1215178
+```
+
+This happens because the built relation is translated into the following query:
+
+```ruby
+[ gstg ] production> puts Ci::Build.where(user_id: (1..10_000)).to_sql
+SELECT "ci_builds".* FROM "ci_builds" WHERE "ci_builds"."type" = 'Ci::Build' AND "ci_builds"."user_id" BETWEEN 1 AND 10000
+=> nil
+```
+
+`And` queries which filter non-unique column by range `WHERE "ci_builds"."user_id" BETWEEN ? AND ?`,
+even though the range size is limited to a certain threshold (`10,000` in the previous example) this
+threshold does not translate to the size of the returned dataset. That happens because when taking
+`n` possible values of attributes, one can't tell for sure that the number of records that contains
+them is less than `n`.
+
+### Loose-index scan with `distinct_each_batch`
+
+When iterating over a non-unique column is necessary, use the `distinct_each_batch` helper
+method. The helper uses the [loose-index scan technique](https://wiki.postgresql.org/wiki/Loose_indexscan)
+(skip-index scan) to skip duplicated values within a database index.
+
+Example: iterating over distinct `author_id` in the Issue model
+
+```ruby
+Issue.distinct_each_batch(column: :author_id, of: 1000) do |relation|
+  users = User.where(id: relation.select(:author_id)).to_a
+end
+```
+
+The technique provides stable performance between the batches regardless of the data distribution.
+The `relation` object returns an ActiveRecord scope where only the given `column` is available.
+Other columns are not loaded.
+
+The underlying database queries use recursive CTEs, which adds extra overhead. We therefore advise to use
+smaller batch sizes than those used for a standard `each_batch` iteration.
+
+## Column definition
+
+`EachBatch` uses the primary key of the model by default for the iteration. This works most of the
+cases, however in some cases, you might want to use a different column for the iteration.
+
+```ruby
+Project.distinct.each_batch(column: :creator_id, of: 10) do |relation|
+  puts User.where(id: relation.select(:creator_id)).map(&:id)
+end
+```
+
+The query above iterates over the project creators and prints them out without duplications.
+
+NOTE:
+In case the column is not unique (no unique index definition), calling the `distinct` method on
+the relation is necessary. Using not unique column without `distinct` may result in `each_batch`
+falling into an endless loop as described in following
+[issue](https://gitlab.com/gitlab-org/gitlab/-/issues/285097).
+
+## `EachBatch` in data migrations
+
+When dealing with data migrations the preferred way to iterate over a large volume of data is using
+`EachBatch`.
+
+A special case of data migration is a [background migration](background_migrations.md#scheduling)
+where the actual data modification is executed in a background job. The migration code that
+determines the data ranges (slices) and schedules the background jobs uses `each_batch`.
+
+## Efficient usage of `each_batch`
+
+`EachBatch` helps to iterate over large tables. It's important to highlight that `EachBatch`
+does not magically solve all iteration-related performance problems, and it might not help at
+all in some scenarios. From the database point of view, correctly configured database indexes are
+also necessary to make `EachBatch` perform well.
+
+### Example 1: Simple iteration
+
+Let's consider that we want to iterate over the `users` table and print the `User` records to the
+standard output. The `users` table contains millions of records, thus running one query to fetch
+the users likely times out.
+
+![Users table overview](../img/each_batch_users_table_v13_7.png)
+
+This is a simplified version of the `users` table which contains several rows. We have a few
+smaller gaps in the `id` column to make the example a bit more realistic (a few records were
+already deleted). Currently, we have one index on the `id` field.
+
+Loading all users into memory (avoid):
+
+```ruby
+users = User.all
+
+users.each { |user| puts user.inspect }
+```
+
+Use `each_batch`:
+
+```ruby
+# Note: for this example I picked 5 as the batch size, the default is 1_000
+User.each_batch(of: 5) do |relation|
+  relation.each { |user| puts user.inspect }
+end
+```
+
+#### How `each_batch` works
+
+As the first step, it finds the lowest `id` (start `id`) in the table by executing the following
+database query:
+
+```sql
+SELECT "users"."id" FROM "users" ORDER BY "users"."id" ASC LIMIT 1
+```
+
+![Reading the start ID value](../img/each_batch_users_table_iteration_1_v13_7.png)
+
+Notice that the query only reads data from the index (`INDEX ONLY SCAN`), the table is not
+accessed. Database indexes are sorted so taking out the first item is a very cheap operation.
+
+The next step is to find the next `id` (end `id`) which should respect the batch size
+configuration. In this example we used a batch size of 5. `EachBatch` uses the `OFFSET` clause
+to get a "shifted" `id` value.
+
+```sql
+SELECT "users"."id" FROM "users" WHERE "users"."id" >= 1 ORDER BY "users"."id" ASC LIMIT 1 OFFSET 5
+```
+
+![Reading the end ID value](../img/each_batch_users_table_iteration_2_v13_7.png)
+
+Again, the query only looks into the index. The `OFFSET 5` takes out the sixth `id` value: this
+query reads a maximum of six items from the index regardless of the table size or the iteration
+count.
+
+At this point, we know the `id` range for the first batch. Now it's time to construct the query
+for the `relation` block.
+
+```sql
+SELECT "users".* FROM "users" WHERE "users"."id" >= 1 AND "users"."id" < 302
+```
+
+![Reading the rows from the `users` table](../img/each_batch_users_table_iteration_3_v13_7.png)
+
+Notice the `<` sign. Previously six items were read from the index and in this query, the last
+value is "excluded". The query looks at the index to get the location of the five `user`
+rows on the disk and read the rows from the table. The returned array is processed in Ruby.
+
+The first iteration is done. For the next iteration, the last `id` value is reused from the
+previous iteration in order to find out the next end `id` value.
+
+```sql
+SELECT "users"."id" FROM "users" WHERE "users"."id" >= 302 ORDER BY "users"."id" ASC LIMIT 1 OFFSET 5
+```
+
+![Reading the second end ID value](../img/each_batch_users_table_iteration_4_v13_7.png)
+
+Now we can easily construct the `users` query for the second iteration.
+
+```sql
+SELECT "users".* FROM "users" WHERE "users"."id" >= 302 AND "users"."id" < 353
+```
+
+![Reading the rows for the second iteration from the users table](../img/each_batch_users_table_iteration_5_v13_7.png)
+
+### Example 2: Iteration with filters
+
+Building on top of the previous example, we want to print users with zero sign-in count. We keep
+track of the number of sign-ins in the `sign_in_count` column so we write the following code:
+
+```ruby
+users = User.where(sign_in_count: 0)
+
+users.each_batch(of: 5) do |relation|
+  relation.each { |user| puts user.inspect }
+end
+```
+
+`each_batch` produces the following SQL query for the start `id` value:
+
+```sql
+SELECT "users"."id" FROM "users" WHERE "users"."sign_in_count" = 0 ORDER BY "users"."id" ASC LIMIT 1
+```
+
+Selecting only the `id` column and ordering by `id` forces the database to use the
+index on the `id` (primary key index) column however, we also have an extra condition on the
+`sign_in_count` column. The column is not part of the index, so the database needs to look into
+the actual table to find the first matching row.
+
+![Reading the index with extra filter](../img/each_batch_users_table_filter_v13_7.png)
+
+NOTE:
+The number of scanned rows depends on the data distribution in the table.
+
+- Best case scenario: the first user was never logged in. The database reads only one row.
+- Worst case scenario: all users were logged in at least once. The database reads all rows.
+
+In this particular example, the database had to read 10 rows (regardless of our batch size setting)
+to determine the first `id` value. In a "real-world" application it's hard to predict whether the
+filtering causes problems or not. In the case of GitLab, verifying the data on a
+production replica is a good start, but keep in mind that data distribution on GitLab.com can be
+different from self-managed instances.
+
+#### Improve filtering with `each_batch`
+
+##### Specialized conditional index
+
+```sql
+CREATE INDEX index_on_users_never_logged_in ON users (id) WHERE sign_in_count = 0
+```
+
+This is how our table and the newly created index looks like:
+
+![Reading the specialized index](../img/each_batch_users_table_filtered_index_v13_7.png)
+
+This index definition covers the conditions on the `id` and `sign_in_count` columns thus makes the
+`each_batch` queries very effective (similar to the simple iteration example).
+
+It's rare when a user was never signed in so we a anticipate small index size. Including only the
+`id` in the index definition also helps to keep the index size small.
+
+##### Index on columns
+
+Later on, we might want to iterate over the table filtering for different `sign_in_count` values, in
+those cases we cannot use the previously suggested conditional index because the `WHERE` condition
+does not match with our new filter (`sign_in_count > 10`).
+
+To address this problem, we have two options:
+
+- Create another, conditional index to cover the new query.
+- Replace the index with a more generalized configuration.
+
+NOTE:
+Having multiple indexes on the same table and on the same columns could be a performance bottleneck
+when writing data.
+
+Let's consider the following index (avoid):
+
+```sql
+CREATE INDEX index_on_users_never_logged_in ON users (id, sign_in_count)
+```
+
+The index definition starts with the `id` column which makes the index very inefficient from data
+selectivity point of view.
+
+```sql
+SELECT "users"."id" FROM "users" WHERE "users"."sign_in_count" = 0 ORDER BY "users"."id" ASC LIMIT 1
+```
+
+Executing the query above results in an `INDEX ONLY SCAN`. However, the query still needs to
+iterate over an unknown number of entries in the index, and then find the first item where the
+`sign_in_count` is `0`.
+
+![Reading an ineffective index](../img/each_batch_users_table_bad_index_v13_7.png)
+
+We can improve the query significantly by swapping the columns in the index definition (prefer).
+
+```sql
+CREATE INDEX index_on_users_never_logged_in ON users (sign_in_count, id)
+```
+
+![Reading a good index](../img/each_batch_users_table_good_index_v13_7.png)
+
+The following index definition does not work well with `each_batch` (avoid).
+
+```sql
+CREATE INDEX index_on_users_never_logged_in ON users (sign_in_count)
+```
+
+Since `each_batch` builds range queries based on the `id` column, this index cannot be used
+efficiently. The DB reads the rows from the table or uses a bitmap search where the primary
+key index is also read.
+
+##### "Slow" iteration
+
+Slow iteration means that we use a good index configuration to iterate over the table and
+apply filtering on the yielded relation.
+
+```ruby
+User.each_batch(of: 5) do |relation|
+  relation.where(sign_in_count: 0).each { |user| puts user inspect }
+end
+```
+
+The iteration uses the primary key index (on the `id` column) which makes it safe from statement
+timeouts. The filter (`sign_in_count: 0`) is applied on the `relation` where the `id` is already
+constrained (range). The number of rows is limited.
+
+Slow iteration generally takes more time to finish. The iteration count is higher and
+one iteration could yield fewer records than the batch size. Iterations may even yield
+0 records. This is not an optimal solution; however, in some cases (especially when
+dealing with large tables) this is the only viable option.
+
+### Using Subqueries
+
+Using subqueries in your `each_batch` query does not work well in most cases. Consider the following example:
+
+```ruby
+projects = Project.where(creator_id: Issue.where(confidential: true).select(:author_id))
+
+projects.each_batch do |relation|
+  # do something
+end
+```
+
+The iteration uses the `id` column of the `projects` table. The batching does not affect the
+subquery. This means for each iteration, the subquery is executed by the database. This adds a
+constant "load" on the query which often ends up in statement timeouts. We have an unknown number
+of [confidential issues](../../user/project/issues/confidential_issues.md), the execution time
+and the accessed database rows depend on the data distribution in the `issues` table.
+
+NOTE:
+Using subqueries works only when the subquery returns a small number of rows.
+
+#### Improving Subqueries
+
+When dealing with subqueries, a slow iteration approach could work: the filter on `creator_id`
+can be part of the generated `relation` object.
+
+```ruby
+projects = Project.all
+
+projects.each_batch do |relation|
+  relation.where(creator_id: Issue.where(confidential: true).select(:author_id))
+end
+```
+
+If the query on the `issues` table itself is not performant enough, a nested loop could be
+constructed. Try to avoid it when possible.
+
+```ruby
+projects = Project.all
+
+projects.each_batch do |relation|
+  issues = Issue.where(confidential: true)
+
+  issues.each_batch do |issues_relation|
+    relation.where(creator_id: issues_relation.select(:author_id))
+  end
+end
+```
+
+If we know that the `issues` table has many more rows than `projects`, it would make sense to flip
+the queries, where the `issues` table is batched first.
+
+### Using `JOIN` and `EXISTS`
+
+When to use `JOINS`:
+
+- When there's a 1:1 or 1:N relationship between the tables where we know that the joined record
+(almost) always exists. This works well for "extension-like" tables:
+  - `projects` - `project_settings`
+  - `users` - `user_details`
+  - `users` - `user_statuses`
+- `LEFT JOIN` works well in this case. Conditions on the joined table need to go to the yielded
+relation so the iteration is not affected by the data distribution in the joined table.
+
+Example:
+
+```ruby
+users = User.joins("LEFT JOIN personal_access_tokens on personal_access_tokens.user_id = users.id")
+
+users.each_batch do |relation|
+  relation.where("personal_access_tokens.name = 'name'")
+end
+```
+
+`EXISTS` queries should be added only to the inner `relation` of the `each_batch` query:
+
+```ruby
+User.each_batch do |relation|
+  relation.where("EXISTS (SELECT 1 FROM ...")
+end
+```
+
+### Complex queries on the relation object
+
+When the `relation` object has several extra conditions, the execution plans might become
+"unstable".
+
+Example:
+
+```ruby
+Issue.each_batch do |relation|
+  relation
+    .joins(:metrics)
+    .joins(:merge_requests_closing_issues)
+    .where("id IN (SELECT ...)")
+    .where(confidential: true)
+end
+```
+
+Here, we expect that the `relation` query reads the `BATCH_SIZE` of user records and then
+filters down the results according to the provided queries. The planner might decide that
+using a bitmap index lookup with the index on the `confidential` column is a better way to
+execute the query. This can cause an unexpectedly high amount of rows to be read and the
+query could time out.
+
+Problem: we know for sure that the relation is returning maximum `BATCH_SIZE` of records
+however, the planner does not know this.
+
+Common table expression (CTE) trick to force the range query to execute first:
+
+```ruby
+Issue.each_batch(of: 1000) do |relation|
+  cte = Gitlab::SQL::CTE.new(:batched_relation, relation.limit(1000))
+
+  scope = cte
+    .apply_to(Issue.all)
+    .joins(:metrics)
+    .joins(:merge_requests_closing_issues)
+    .where("id IN (SELECT ...)")
+    .where(confidential: true)
+
+  puts scope.to_a
+end
+```
+
+### `EachBatch` vs `BatchCount`
+
+When adding new counters for Service Ping, the preferred way to count records is using the
+`Gitlab::Database::BatchCount` class. The iteration logic implemented in `BatchCount`
+has similar performance characteristics like `EachBatch`. Most of the tips and suggestions
+for improving `BatchCount` mentioned above applies to `BatchCount` as well.
+
+## Iterate with keyset pagination
+
+There are a few special cases where iterating with `EachBatch` does not work. `EachBatch`
+requires one distinct column (usually the primary key), which makes the iteration impossible
+for timestamp columns and tables with composite primary keys.
+
+Where `EachBatch` does not work, you can use
+[keyset pagination](pagination_guidelines.md#keyset-pagination) to iterate over the
+table or a range of rows. The scaling and performance characteristics are very similar to
+`EachBatch`.
+
+Examples:
+
+- Iterate over the table in a specific order (timestamp columns) in combination with a tie-breaker
+if column user to sort by does not contain unique values.
+- Iterate over the table with composite primary keys.
+
+### Iterate over the issues in a project by creation date
+
+You can use keyset pagination to iterate over any database column in a specific order (for example,
+`created_at DESC`). To ensure consistent order of the returned records with the same values for
+`created_at`, use a tie-breaker column with unique values (for example, `id`).
+
+Assume you have the following index in the `issues` table:
+
+```sql
+idx_issues_on_project_id_and_created_at_and_id" btree (project_id, created_at, id)
+```
+
+### Fetching records for further processing
+
+The following snippet iterates over issue records within the project using the specified order
+(`created_at, id`).
+
+```ruby
+scope = Issue.where(project_id: 278964).order(:created_at, :id) # id is the tie-breaker
+
+iterator = Gitlab::Pagination::Keyset::Iterator.new(scope: scope)
+
+iterator.each_batch(of: 100) do |records|
+  puts records.map(&:id)
+end
+```
+
+You can add extra filters to the query. This example only lists the issue IDs created in the last
+30 days:
+
+```ruby
+scope = Issue.where(project_id: 278964).where('created_at > ?', 30.days.ago).order(:created_at, :id) # id is the tie-breaker
+
+iterator = Gitlab::Pagination::Keyset::Iterator.new(scope: scope)
+
+iterator.each_batch(of: 100) do |records|
+  puts records.map(&:id)
+end
+```
+
+### Updating records in the batch
+
+For complex `ActiveRecord` queries, the `.update_all` method does not work well, because it
+generates an incorrect `UPDATE` statement.
+You can use raw SQL for updating records in batches:
+
+```ruby
+scope = Issue.where(project_id: 278964).order(:created_at, :id) # id is the tie-breaker
+
+iterator = Gitlab::Pagination::Keyset::Iterator.new(scope: scope)
+
+iterator.each_batch(of: 100) do |records|
+  ApplicationRecord.connection.execute("UPDATE issues SET updated_at=NOW() WHERE issues.id in (#{records.dup.reselect(:id).to_sql})")
+end
+```
+
+NOTE:
+To keep the iteration stable and predictable, avoid updating the columns in the `ORDER BY` clause.
+
+### Iterate over the `merge_request_diff_commits` table
+
+The `merge_request_diff_commits` table uses a composite primary key (`merge_request_diff_id,
+relative_order`), which makes `EachBatch` impossible to use efficiently.
+
+To paginate over the `merge_request_diff_commits` table, you can use the following snippet:
+
+```ruby
+# Custom order object configuration:
+order = Gitlab::Pagination::Keyset::Order.build([
+  Gitlab::Pagination::Keyset::ColumnOrderDefinition.new(
+    attribute_name: 'merge_request_diff_id',
+    order_expression: MergeRequestDiffCommit.arel_table[:merge_request_diff_id].asc,
+    nullable: :not_nullable,
+    distinct: false,
+  ),
+  Gitlab::Pagination::Keyset::ColumnOrderDefinition.new(
+    attribute_name: 'relative_order',
+    order_expression: MergeRequestDiffCommit.arel_table[:relative_order].asc,
+    nullable: :not_nullable,
+    distinct: false,
+  )
+])
+MergeRequestDiffCommit.include(FromUnion) # keyset pagination generates UNION queries
+
+scope = MergeRequestDiffCommit.order(order)
+
+iterator = Gitlab::Pagination::Keyset::Iterator.new(scope: scope)
+
+iterator.each_batch(of: 100) do |records|
+  puts records.map { |record| [record.merge_request_diff_id, record.relative_order] }.inspect
+end
+```
+
+### Order object configuration
+
+Keyset pagination works well with simple `ActiveRecord` `order` scopes
+([first example](#iterate-over-the-issues-in-a-project-by-creation-date).
+However, in special cases, you need to describe the columns in the `ORDER BY` clause (second example)
+for the underlying keyset pagination library. When the `ORDER BY` configuration cannot be
+automatically determined by the keyset pagination library, an error is raised.
+
+The code comments of the
+[`Gitlab::Pagination::Keyset::Order`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/pagination/keyset/order.rb)
+and [`Gitlab::Pagination::Keyset::ColumnOrderDefinition`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/pagination/keyset/column_order_definition.rb)
+classes give an overview of the possible options for configuring the `ORDER BY` clause. You can
+also find a few code examples in the
+[keyset pagination](keyset_pagination.md#complex-order-configuration) documentation.