summaryrefslogtreecommitdiff
path: root/doc/development/iterating_tables_in_batches.md
diff options
context:
space:
mode:
Diffstat (limited to 'doc/development/iterating_tables_in_batches.md')
-rw-r--r--doc/development/iterating_tables_in_batches.md25
1 files changed, 24 insertions, 1 deletions
diff --git a/doc/development/iterating_tables_in_batches.md b/doc/development/iterating_tables_in_batches.md
index 3953e7097dd..43d7f32ad7f 100644
--- a/doc/development/iterating_tables_in_batches.md
+++ b/doc/development/iterating_tables_in_batches.md
@@ -42,6 +42,29 @@ The API of this method is similar to `in_batches`, though it doesn't support
all of the arguments that `in_batches` supports. You should always use
`each_batch` _unless_ you have a specific need for `in_batches`.
+## Avoid iterating over non-unique columns
+
+One should proceed with extra caution, and possibly avoid iterating over a column that can contain duplicate values.
+When you iterate over an attribute that is not unique, even with the applied max batch size, there is no guarantee that the resulting batches will not surpass it.
+The following snippet demonstrates this situation, whe one attempt to select `Ci::Build` entries for users with `id` between `1` and `10,s000`, database returns `1 215 178`
+matching rows
+
+```ruby
+[ gstg ] production> Ci::Build.where(user_id: (1..10_000)).size
+=> 1215178
+```
+
+This happens because built relation is translated into following query
+
+```ruby
+[ gstg ] production> puts Ci::Build.where(user_id: (1..10_000)).to_sql
+SELECT "ci_builds".* FROM "ci_builds" WHERE "ci_builds"."type" = 'Ci::Build' AND "ci_builds"."user_id" BETWEEN 1 AND 10000
+=> nil
+```
+
+And queries which filters non-unique column by range `WHERE "ci_builds"."user_id" BETWEEN ? AND ?`, even though the range size is limited to certain threshold (`10,000` in previous example) this threshold does not translates to the size of returned dataset. That happens because when taking `n` possible values of attributes,
+one can't tell for sure that the number of records that contains them will be less than `n`.
+
## Column definition
`EachBatch` uses the primary key of the model by default for the iteration. This works most of the cases, however in some cases, you might want to use a different column for the iteration.
@@ -55,7 +78,7 @@ end
The query above iterates over the project creators and prints them out without duplications.
NOTE:
-In case the column is not unique (no unique index definition), calling the `distinct` method on the relation is necessary.
+In case the column is not unique (no unique index definition), calling the `distinct` method on the relation is necessary. Using not unique column without `distinct` may result in `each_batch` falling into endless loop as described at following [issue](https://gitlab.com/gitlab-org/gitlab/-/issues/285097)
## `EachBatch` in data migrations