summaryrefslogtreecommitdiff
path: root/doc/development/database/table_partitioning.md
diff options
context:
space:
mode:
Diffstat (limited to 'doc/development/database/table_partitioning.md')
-rw-r--r--doc/development/database/table_partitioning.md192
1 files changed, 160 insertions, 32 deletions
diff --git a/doc/development/database/table_partitioning.md b/doc/development/database/table_partitioning.md
index 0d5e3c233f6..88b2ccbc6a2 100644
--- a/doc/development/database/table_partitioning.md
+++ b/doc/development/database/table_partitioning.md
@@ -6,6 +6,13 @@ info: To determine the technical writer assigned to the Stage/Group associated w
# Database table partitioning
+WARNING:
+If you have questions not answered below, check for and add them
+to [this issue](https://gitlab.com/gitlab-org/gitlab/-/issues/398650).
+Tag `@gitlab-org/database-team/triage` and we'll get back to you with an
+answer as soon as possible. If you get an answer in Slack, document
+it on the issue as well so we can update this document in the future.
+
Table partitioning is a powerful database feature that allows a table's
data to be split into smaller physical tables that act as a single large
table. If the application is designed to work with partitioning in mind,
@@ -32,31 +39,38 @@ several releases. Due to the limitations of partitioning and the related
migrations, you should understand how partitioning fits your use case
before attempting to leverage this feature.
-## Determining when to use partitioning
+## Determine when to use partitioning
While partitioning can be very useful when properly applied, it's
imperative to identify if the data and workload of a table naturally fit a
-partitioning scheme. There are a few details you have to understand
-to decide if partitioning is a good fit for your particular
-problem.
-
-First, a table is partitioned on a partition key, which is a column or
-set of columns which determine how the data is split across the
-partitions. The partition key is used by the database when reading or
-writing data, to decide which partitions must be accessed. The
-partition key should be a column that would be included in a `WHERE`
-clause on almost all queries accessing that table.
-
-Second, it's necessary to understand the strategy the database uses
-to split the data across the partitions. The scheme supported by the
-GitLab migration helpers is date-range partitioning, where each partition
-in the table contains data for a single month. In this case, the partitioning
-key must be a timestamp or date column. In order for this type of
+partitioning scheme. Understand a few details to decide if partitioning
+is a good fit for your particular problem:
+
+- **Table partitioning**. A table is partitioned on a partition key, which is a
+ column or set of columns which determine how the data is split across the
+ partitions. The partition key is used by the database when reading or
+ writing data, to decide which partitions must be accessed. The
+ partition key should be a column that would be included in a `WHERE`
+ clause on almost all queries accessing that table.
+
+- **How the data is split**. What strategy does the database use
+ to split the data across the partitions? The available choices are `range`,
+ `hash`, and `list`.
+
+## Determine the appropriate partitioning strategy
+
+The available partitioning strategy choices are `range`, `hash`, and `list`.
+
+### Range partitioning
+
+The scheme best supported by the GitLab migration helpers is date-range partitioning,
+where each partition in the table contains data for a single month. In this case,
+the partitioning key must be a timestamp or date column. For this type of
partitioning to work well, most queries must access data in a
certain date range.
-For a more concrete example, the `audit_events` table can be used, which
-was the first table to be partitioned in the application database
+For a more concrete example, consider using the `audit_events` table.
+It was the first table to be partitioned in the application database
(scheduled for deployment with the GitLab 13.5 release). This
table tracks audit entries of security events that happen in the
application. In almost all cases, users want to see audit activity that
@@ -142,6 +156,31 @@ substantial. Partitioning should only be leveraged if the access patterns
of the data support the partitioning strategy, otherwise performance
suffers.
+### Hash Partitioning
+
+Hash partitioning splits a logical table into a series of partitioned
+tables. Each partition corresponds to the ID range that matches
+a hash and remainder. For example, if partitioning `BY HASH(id)`, rows
+with `hash(id) % 64 == 1` would end up in the partition
+`WITH (MODULUS 64, REMAINDER 1)`.
+
+When hash partitioning, you must include a `WHERE hashed_column = ?` condition in
+every performance-sensitive query issued by the application. If this is not possible,
+hash partitioning may not be the correct fit for your use case.
+
+Hash partitioning has one main advantage: it is the only type of partitioning that
+can enforce uniqueness on a single numeric `id` column. (While also possible with
+range partitioning, it's rarely the correct choice).
+
+Hash partitioning has downsides:
+
+- The number of partitions must be known up-front.
+- It's difficult to move new data to an extra partition if current partitions become too large.
+- Range queries, such as `WHERE id BETWEEN ? and ?`, are unsupported.
+- Lookups by other keys, such as `WHERE other_id = ?`, are unsupported.
+
+For this reason, it's often best to choose a large number of hash partitions to accommodate future table growth.
+
## Partitioning a table (Range)
Unfortunately, tables can only be partitioned at their creation, making
@@ -203,6 +242,10 @@ Continuing the above example, the migration would look like:
class BackfillPartitionAuditEvents < Gitlab::Database::Migration[2.1]
include Gitlab::Database::PartitioningMigrationHelpers
+ disable_ddl_transaction!
+
+ restrict_gitlab_migration gitlab_schema: :gitlab_main
+
def up
enqueue_partitioning_data_migration :audit_events
end
@@ -213,17 +256,12 @@ class BackfillPartitionAuditEvents < Gitlab::Database::Migration[2.1]
end
```
-This step uses the same mechanism as any background migration, so you
-may want to read the [Background Migration](background_migrations.md)
-guide for details on that process. Background jobs are scheduled every
-2 minutes and copy `50_000` records at a time, which can be used to
-estimate the timing of the background migration portion of the
-partitioning migration.
+This step [queues a batched background migration](batched_background_migrations.md#queueing) internally with BATCH_SIZE and SUB_BATCH_SIZE as `50,000` and `2,500`. Refer [Batched Background migrations guide](batched_background_migrations.md) for more details.
### Step 3: Post-backfill cleanup (Release N+1)
-The third step must occur at least one release after the release that
-includes the background migration. This gives time for the background
+This step must occur at least one release after the release that
+includes step (2). This gives time for the background
migration to execute properly in self-managed installations. In this step,
add another post-deployment migration that cleans up after the
background migration. This includes forcing any remaining jobs to
@@ -236,6 +274,10 @@ Once again, continuing the example, this migration would look like:
class CleanupPartitionedAuditEventsBackfill < Gitlab::Database::Migration[2.1]
include Gitlab::Database::PartitioningMigrationHelpers
+ disable_ddl_transaction!
+
+ restrict_gitlab_migration gitlab_schema: :gitlab_main
+
def up
finalize_backfilling_partitioned_table :audit_events
end
@@ -246,16 +288,57 @@ class CleanupPartitionedAuditEventsBackfill < Gitlab::Database::Migration[2.1]
end
```
-After this migration has completed, the original table and partitioned
+After this migration completes, the original table and partitioned
table should contain identical data. The trigger installed on the
original table guarantees that the data remains in sync going forward.
### Step 4: Swap the partitioned and non-partitioned tables (Release N+1)
-The final step of the migration makes the partitioned table ready
-for use by the application. This section will be updated when the
-migration helper is ready, for now development can be followed in the
-[Tracking Issue](https://gitlab.com/gitlab-org/gitlab/-/issues/241267).
+This step replaces the non-partitioned table with its partitioned copy, this should be used only after all other migration steps have completed successfully.
+
+Some limitations to this method MUST be handled before, or during, the swap migration:
+
+- Secondary indexes and foreign keys are not automatically recreated on the partitioned table.
+- Some types of constraints (UNIQUE and EXCLUDE) which rely on indexes, are not automatically recreated
+ on the partitioned table, since the underlying index will not be present.
+- Foreign keys referencing the original non-partitioned table should be updated to reference the
+ partitioned table. This is not supported in PostgreSQL 11.
+- Views referencing the original table are not automatically updated to reference the partitioned table.
+
+```ruby
+# frozen_string_literal: true
+
+class SwapPartitionedAuditEvents < ActiveRecord::Migration[6.0]
+ include Gitlab::Database::PartitioningMigrationHelpers
+
+ def up
+ replace_with_partitioned_table :audit_events
+ end
+
+ def down
+ rollback_replace_with_partitioned_table :audit_events
+ end
+end
+```
+
+After this migration completes:
+
+- The partitioned table replaces the non-partitioned (original) table.
+- The sync trigger created earlier is dropped.
+
+The partitioned table is now ready for use by the application.
+
+## Partitioning a table (Hash)
+
+Hash partitioning divides data into partitions based on a hash of their ID.
+It works well only if most queries against the table include a clause like `WHERE id = ?`,
+so that PostgreSQL can decide which partition to look in based on the ID or ids being requested.
+
+Another key downside is that hash partitioning does not allow adding additional partitions after table creation.
+The correct number of partitions must be chosen up-front.
+
+Hash partitioning is the only type of partitioning (aside from some complex uses of list partitioning) that can guarantee
+uniqueness of an ID across multiple partitions at the database level.
## Partitioning a table (List)
@@ -504,3 +587,48 @@ class Model < ApplicationRecord
self.sequence_name = 'model_id_seq'
end
```
+
+If the partitioning constraint migration takes [more than 10 minutes](../migration_style_guide.md#how-long-a-migration-should-take) to finish,
+it can be made to run asynchronously to avoid running the post-migration during busy hours.
+
+Prepend the following migration `AsyncPrepareTableConstraintsForListPartitioning`
+and use `async: true` option. This change marks the partitioning constraint as `NOT VALID`
+and enqueues a scheduled job to validate the existing data in the table during the weekend.
+
+Then the second post-migration `PrepareTableConstraintsForListPartitioning` only
+marks the partitioning constraint as validated, because the existing data is already
+tested during the previous weekend.
+
+For example:
+
+```ruby
+class AsyncPrepareTableConstraintsForListPartitioning < Gitlab::Database::Migration[2.1]
+ include Gitlab::Database::PartitioningMigrationHelpers::TableManagementHelpers
+
+ disable_ddl_transaction!
+
+ TABLE_NAME = :table_name
+ PARENT_TABLE_NAME = :p_table_name
+ FIRST_PARTITION = 100
+ PARTITION_COLUMN = :partition_id
+
+ def up
+ prepare_constraint_for_list_partitioning(
+ table_name: TABLE_NAME,
+ partitioning_column: PARTITION_COLUMN,
+ parent_table_name: PARENT_TABLE_NAME,
+ initial_partitioning_value: FIRST_PARTITION,
+ async: true
+ )
+ end
+
+ def down
+ revert_preparing_constraint_for_list_partitioning(
+ table_name: TABLE_NAME,
+ partitioning_column: PARTITION_COLUMN,
+ parent_table_name: PARENT_TABLE_NAME,
+ initial_partitioning_value: FIRST_PARTITION
+ )
+ end
+end
+```