diff options
author | GitLab Bot <gitlab-bot@gitlab.com> | 2021-11-18 13:16:36 +0000 |
---|---|---|
committer | GitLab Bot <gitlab-bot@gitlab.com> | 2021-11-18 13:16:36 +0000 |
commit | 311b0269b4eb9839fa63f80c8d7a58f32b8138a0 (patch) | |
tree | 07e7870bca8aed6d61fdcc810731c50d2c40af47 /doc/development/database | |
parent | 27909cef6c4170ed9205afa7426b8d3de47cbb0c (diff) | |
download | gitlab-ce-311b0269b4eb9839fa63f80c8d7a58f32b8138a0.tar.gz |
Add latest changes from gitlab-org/gitlab@14-5-stable-eev14.5.0-rc42
Diffstat (limited to 'doc/development/database')
-rw-r--r-- | doc/development/database/loose_foreign_keys.md | 182 | ||||
-rw-r--r-- | doc/development/database/multiple_databases.md | 66 |
2 files changed, 238 insertions, 10 deletions
diff --git a/doc/development/database/loose_foreign_keys.md b/doc/development/database/loose_foreign_keys.md new file mode 100644 index 00000000000..157c1284512 --- /dev/null +++ b/doc/development/database/loose_foreign_keys.md @@ -0,0 +1,182 @@ +--- +stage: Enablement +group: Database +info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments +--- + +# Loose foreign keys + +## Problem statement + +In relational databases (including PostgreSQL), foreign keys provide a way to link +two database tables together, and ensure data-consistency between them. In GitLab, +[foreign keys](../foreign_keys.md) are vital part of the database design process. +Most of our database tables have foreign keys. + +With the ongoing database [decomposition work](https://gitlab.com/groups/gitlab-org/-/epics/6168), +linked records might be present on two different database servers. Ensuring data consistency +between two databases is not possible with standard PostgreSQL foreign keys. PostgreSQL +does not support foreign keys operating within a single database server, defining +a link between two database tables in two different database servers over the network. + +Example: + +- Database "Main": `projects` table +- Database "CI": `ci_pipelines` table + +A project can have many pipelines. When a project is deleted, the associated `ci_pipeline` (via the +`project_id` column) records must be also deleted. + +With a multi-database setup, this cannot be achieved with foreign keys. + +## Asynchronous approach + +Our preferred approach to this problem is eventual consistency. With the loose foreign keys +feature, we can configure delayed association cleanup without negatively affecting the +application performance. + +### How it works + +In the previous example, a record in the `projects` table can have multiple `ci_pipeline` +records. To keep the cleanup process separate from the actual parent record deletion, +we can: + +1. Create a `DELETE` trigger on the `projects` table. + Record the deletions in a separate table (`deleted_records`). +1. A job checks the `deleted_records` table every 5 minutes. +1. For each record in the table, delete the associated `ci_pipelines` records + using the `project_id` column. + +NOTE: +For this procedure to work, we must register which tables to clean up asynchronously. + +## Example migration and configuration + +### Configure the model + +First, tell the application that the `projects` table has a new loose foreign key. +You can do this in the `Project` model: + +```ruby +class Project < ApplicationRecord + # ... + + include LooseForeignKey + + loose_foreign_key :ci_pipelines, :project_id, on_delete: :async_delete # or async_nullify + + # ... +end +``` + +This instruction ensures the asynchronous cleanup process knows about the association, and the +how to do the cleanup. In this case, the associated `ci_pipelines` records are deleted. + +### Track record changes + +To know about deletions in the `projects` table, configure a `DELETE` trigger using a database +migration (post-migration). The trigger needs to be configured only once. If the model already has +at least one `loose_foreign_key` definition, then this step can be skipped: + +```ruby +class TrackProjectRecordChanges < Gitlab::Database::Migration[1.0] + include Gitlab::Database::MigrationHelpers::LooseForeignKeyHelpers + + enable_lock_retries! + + def up + track_record_deletions(:projects) + end + + def down + untrack_record_deletions(:projects) + end +end +``` + +### Remove the foreign key + +If there is an existing foreign key, then it can be removed from the database. As of GitLab 14.5, +the following foreign key describes the link between the `projects` and `ci_pipelines` tables: + +```sql +ALTER TABLE ONLY ci_pipelines +ADD CONSTRAINT fk_86635dbd80 +FOREIGN KEY (project_id) +REFERENCES projects(id) +ON DELETE CASCADE; +``` + +The migration should run after the `DELETE` trigger is installed. If the foreign key is deleted +earlier, there is a good chance of introducing data inconsistency which needs manual cleanup: + +```ruby +class RemoveProjectsCiPipelineFk < Gitlab::Database::Migration[1.0] + enable_lock_retries! + + def up + remove_foreign_key_if_exists(:ci_pipelines, :projects, name: "fk_86635dbd80") + end + + def down + add_concurrent_foreign_key(:ci_pipelines, :projects, name: "fk_86635dbd80", column: :project_id, target_column: :id, on_delete: "cascade") + end +end +``` + +At this point, the setup phase is concluded. The deleted `projects` records should be automatically +picked up by the scheduled cleanup worker job. + +## Caveats of loose foreign keys + +### Record creation + +The feature provides an efficient way of cleaning up associated records after the parent record is +deleted. Without foreign keys, it's the application's responsibility to validate if the parent record +exists when a new associated record is created. + +A bad example: record creation with the given ID (`project_id` comes from user input). +In this example, nothing prevents us from passing a random project ID: + +```ruby +Ci::Pipeline.create!(project_id: params[:project_id]) +``` + +A good example: record creation with extra check: + +```ruby +project = Project.find(params[:project_id]) +Ci::Pipeline.create!(project_id: project.id) +``` + +### Association lookup + +Consider the following HTTP request: + +```plaintext +GET /projects/5/pipelines/100 +``` + +The controller action ignores the `project_id` parameter and finds the pipeline using the ID: + +```ruby + def show + # bad, avoid it + pipeline = Ci::Pipeline.find(params[:id]) # 100 +end +``` + +This endpoint still works when the parent `Project` model is deleted. This can be considered a +a data leak which should not happen under normal circumstances: + +```ruby +def show + # good + project = Project.find(params[:project_id]) + pipeline = project.pipelines.find(params[:pipeline_id]) # 100 +end +``` + +NOTE: +This example is unlikely in GitLab, because we usually look up the parent models to perform +permission checks. diff --git a/doc/development/database/multiple_databases.md b/doc/development/database/multiple_databases.md index 0ba752ba3a6..a17ad798305 100644 --- a/doc/development/database/multiple_databases.md +++ b/doc/development/database/multiple_databases.md @@ -88,16 +88,6 @@ test: &test statement_timeout: 120s ``` -### Migrations - -Place any migrations that affect `Ci::CiDatabaseRecord` models -and their tables in two directories: - -- `db/migrate` -- `db/ci_migrate` - -We aim to keep the schema for both tables the same across both databases. - <!-- NOTE: The `validate_cross_joins!` method in `spec/support/database/prevent_cross_joins.rb` references the following heading in the code, so if you make a change to this heading, make sure to update @@ -272,6 +262,62 @@ logic to delete these rows if or whenever necessary in your domain. Finally, this de-normalization and new query also improves performance because it does less joins and needs less filtering. +##### Remove a redundant join + +Sometimes there are cases where a query is doing excess (or redundant) joins. + +A common example occurs where a query is joining from `A` to `C`, via some +table with both foreign keys, `B`. +When you only care about counting how +many rows there are in `C` and if there are foreign keys and `NOT NULL` constraints +on the foreign keys in `B`, then it might be enough to count those rows. +For example, in +[MR 71811](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/71811), it was +previously doing `project.runners.count`, which would produce a query like: + +```sql +select count(*) from projects +inner join ci_runner_projects on ci_runner_projects.project_id = projects.id +where ci_runner_projects.runner_id IN (1, 2, 3) +``` + +This was changed to avoid the cross-join by changing the code to +`project.runner_projects.count`. It produces the same response with the +following query: + +```sql +select count(*) from ci_runner_projects +where ci_runner_projects.runner_id IN (1, 2, 3) +``` + +Another common redundant join is joining all the way to another table, +then filtering by primary key when you could have instead filtered on a foreign +key. See an example in +[MR 71614](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/71614). The previous +code was `joins(scan: :build).where(ci_builds: { id: build_ids })`, which +generated a query like: + +```sql +select ... +inner join security_scans +inner join ci_builds on security_scans.build_id = ci_builds.id +where ci_builds.id IN (1, 2, 3) +``` + +However, as `security_scans` already has a foreign key `build_id`, the code +can be changed to `joins(:scan).where(security_scans: { build_id: build_ids })`, +which produces the same response with the following query: + +```sql +select ... +inner join security_scans +where security_scans.build_id IN (1, 2, 3) +``` + +Both of these examples of removing redundant joins remove the cross-joins, +but they have the added benefit of producing simpler and faster +queries. + ##### Use `disable_joins` for `has_one` or `has_many` `through:` relations Sometimes a join query is caused by using `has_one ... through:` or `has_many |