summaryrefslogtreecommitdiff
path: root/doc/development/insert_into_tables_in_batches.md
diff options
context:
space:
mode:
Diffstat (limited to 'doc/development/insert_into_tables_in_batches.md')
-rw-r--r--doc/development/insert_into_tables_in_batches.md156
1 files changed, 156 insertions, 0 deletions
diff --git a/doc/development/insert_into_tables_in_batches.md b/doc/development/insert_into_tables_in_batches.md
new file mode 100644
index 00000000000..763185013c9
--- /dev/null
+++ b/doc/development/insert_into_tables_in_batches.md
@@ -0,0 +1,156 @@
+---
+description: "Sometimes it is necessary to store large amounts of records at once, which can be inefficient
+when iterating collections and performing individual `save`s. With the arrival of `insert_all`
+in Rails 6, which operates at the row level (that is, using `Hash`es), GitLab has added a set
+of APIs that make it safe and simple to insert ActiveRecord objects in bulk."
+---
+
+# Insert into tables in batches
+
+Sometimes it is necessary to store large amounts of records at once, which can be inefficient
+when iterating collections and saving each record individually. With the arrival of
+[`insert_all`](https://apidock.com/rails/ActiveRecord/Persistence/ClassMethods/insert_all)
+in Rails 6, which operates at the row level (that is, using `Hash` objects), GitLab has added a set
+of APIs that make it safe and simple to insert `ActiveRecord` objects in bulk.
+
+## Prepare `ApplicationRecord`s for bulk insertion
+
+In order for a model class to take advantage of the bulk insertion API, it has to include the
+`BulkInsertSafe` concern first:
+
+```ruby
+class MyModel < ApplicationRecord
+ # other includes here
+ # ...
+ include BulkInsertSafe # include this last
+
+ # ...
+end
+```
+
+The `BulkInsertSafe` concern has two functions:
+
+- It performs checks against your model class to ensure that it does not use ActiveRecord
+ APIs that are not safe to use with respect to bulk insertions (more on that below).
+- It adds a new class method `bulk_insert!`, which you can use to insert many records at once.
+
+## Insert records via `bulk_insert!`
+
+If the target class passes the checks performed by `BulkInsertSafe`, you can proceed to use
+the `bulk_insert!` class method as follows:
+
+```ruby
+records = [MyModel.new, ...]
+
+MyModel.bulk_insert!(records)
+```
+
+### Record validation
+
+The `bulk_insert!` method guarantees that `records` will be inserted transactionally, and
+will run validations on each record prior to insertion. If any record fails to validate,
+an error is raised and the transaction is rolled back. You can turn off validations via
+the `:validate` option:
+
+```ruby
+MyModel.bulk_insert!(records, validate: false)
+```
+
+### Batch size configuration
+
+In those cases where the number of `records` is above a given threshold, insertions will
+occur in multiple batches. The default batch size is defined in
+[`BulkInsertSafe::DEFAULT_BATCH_SIZE`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/models/concerns/bulk_insert_safe.rb).
+Assuming a default threshold of 500, inserting 950 records
+would result in two batches being written sequentially (of size 500 and 450 respectively.)
+You can override the default batch size via the `:batch_size` option:
+
+```ruby
+MyModel.bulk_insert!(records, batch_size: 100)
+```
+
+Assuming the same number of 950 records, this would result in 10 batches being written instead.
+Since this will also affect the number of `INSERT`s that occur, make sure you measure the
+performance impact this might have on your code. There is a trade-off between the number of
+`INSERT` statements the database has to process and the size and cost of each `INSERT`.
+
+### Requirements for safe bulk insertions
+
+Large parts of ActiveRecord's persistence API are built around the notion of callbacks. Many
+of these callbacks fire in response to model life cycle events such as `save` or `create`.
+These callbacks cannot be used with bulk insertions, since they are meant to be called for
+every instance that is saved or created. Since these events do not fire when
+records are inserted in bulk, we currently disallow their use.
+
+The specifics around which callbacks are disallowed are defined in
+[`BulkInsertSafe`](https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/models/concerns/bulk_insert_safe.rb).
+Consult the module source code for details. If your class uses any of the blacklisted
+functionality, and you `include BulkInsertSafe`, the application will fail with an error.
+
+### `BulkInsertSafe` versus `InsertAll`
+
+Internally, `BulkInsertSafe` is based on `InsertAll`, and you may wonder when to choose
+the former over the latter. To help you make the decision,
+the key differences between these classes are listed in the table below.
+
+| | Input type | Validates input | Specify batch size | Can bypass callbacks | Transactional |
+|--------------- | -------------------- | --------------- | ------------------ | --------------------------------- | ------------- |
+| `bulk_insert!` | ActiveRecord objects | Yes (optional) | Yes (optional) | No (prevents unsafe callback use) | Yes |
+| `insert_all!` | Attribute hashes | No | No | Yes | Yes |
+
+To summarize, `BulkInsertSafe` moves bulk inserts closer to how ActiveRecord objects
+and inserts would normally behave. However, if all you need is to insert raw data in bulk, then
+`insert_all` is more efficient.
+
+## Insert `has_many` associations in bulk
+
+A common use case is to save collections of associated relations through the owner side of the relation,
+where the owned relation is associated to the owner through the `has_many` class method:
+
+```ruby
+owner = OwnerModel.new(owned_relations: array_of_owned_relations)
+# saves all `owned_relations` one-by-one
+owner.save!
+```
+
+This will issue a single `INSERT`, and transaction, for every record in `owned_relations`, which is inefficient if
+`array_of_owned_relations` is large. To remedy this, the `BulkInsertableAssociations` concern can be
+used to declare that the owner defines associations that are safe for bulk insertion:
+
+```ruby
+class OwnerModel < ApplicationRecord
+ # other includes here
+ # ...
+ include BulkInsertableAssociations # include this last
+
+ has_many :my_models
+end
+```
+
+Here `my_models` must be declared `BulkInsertSafe` (as described previously) for bulk insertions
+to happen. You can now insert any yet unsaved records as follows:
+
+```ruby
+BulkInsertableAssociations.with_bulk_insert do
+ owner = OwnerModel.new(my_models: array_of_my_model_instances)
+ # saves `my_models` using a single bulk insert (possibly via multiple batches)
+ owner.save!
+end
+```
+
+Note that you can still save relations that are not `BulkInsertSafe` in this block; they will
+simply be treated as if you had invoked `save` from outside the block.
+
+## Known limitations
+
+There are a few restrictions to how these APIs can be used:
+
+- Bulk inserts only work for new records; `UPDATE`s or "upserts" are not supported yet.
+- `ON CONFLICT` behavior cannot currently be configured; an error will be raised on primary key conflicts.
+- `BulkInsertableAssociations` furthermore has the following restrictions:
+ - only compatible with `has_many` relations.
+ - does not support `has_many through: ...` relations.
+
+Moreover, input data should either be limited to around 1000 records at most,
+or already batched prior to calling bulk insert. The `INSERT` statement will run in a single
+transaction, so for large amounts of records it may negatively affect database stability.