diff options
Diffstat (limited to 'doc/development/database/multiple_databases.md')
-rw-r--r-- | doc/development/database/multiple_databases.md | 236 |
1 files changed, 203 insertions, 33 deletions
diff --git a/doc/development/database/multiple_databases.md b/doc/development/database/multiple_databases.md index 71dcc5bb866..0fd9f821fab 100644 --- a/doc/development/database/multiple_databases.md +++ b/doc/development/database/multiple_databases.md @@ -24,24 +24,26 @@ configurations. For example, given a `config/database.yml` like below: ```yaml development: - adapter: postgresql - encoding: unicode - database: gitlabhq_development - host: /path/to/gdk/postgresql - pool: 10 - prepared_statements: false - variables: - statement_timeout: 120s + main: + adapter: postgresql + encoding: unicode + database: gitlabhq_development + host: /path/to/gdk/postgresql + pool: 10 + prepared_statements: false + variables: + statement_timeout: 120s test: &test - adapter: postgresql - encoding: unicode - database: gitlabhq_test - host: /path/to/gdk/postgresql - pool: 10 - prepared_statements: false - variables: - statement_timeout: 120s + main: + adapter: postgresql + encoding: unicode + database: gitlabhq_test + host: /path/to/gdk/postgresql + pool: 10 + prepared_statements: false + variables: + statement_timeout: 120s ``` Edit the `config/database.yml` to look like this: @@ -98,18 +100,45 @@ and their tables must be placed in two directories for now: We aim to keep the schema for both tables the same across both databases. +<!-- +NOTE: The `validate_cross_joins!` method in `spec/support/database/prevent_cross_joins.rb` references + the following heading in the code, so if you make a change to this heading, make sure to update + the corresponding documentation URL used in `spec/support/database/prevent_cross_joins.rb`. +--> + ### Removing joins between `ci_*` and non `ci_*` tables -We are planning on moving all the `ci_*` tables to a separate database so +Queries that join across databases raise an error. [Introduced](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/68620) +in GitLab 14.3, for new queries only. Pre-existing queries do not raise an error. + +We are planning on moving all the `ci_*` tables to a separate database, so referencing `ci_*` tables with other tables will not be possible. This means, that using any kind of `JOIN` in SQL queries will not work. We have identified already many such examples that need to be fixed in <https://gitlab.com/groups/gitlab-org/-/epics/6289> . -The following are some real examples that have resulted from this and these -patterns may apply to future cases. +#### Path to removing cross-database joins + +The following steps are the process to remove cross-database joins between +`ci_*` and non `ci_*` tables: + +1. **{check-circle}** Add all failing specs to the [`cross-join-allowlist.yml`](https://gitlab.com/gitlab-org/gitlab/-/blob/f5de89daeb468fc45e1e95a76d1b5297aa53da11/spec/support/database/cross-join-allowlist.yml) + file. +1. **{dotted-circle}** Find the code that caused the spec failure and wrap the isolated code + in [`allow_cross_joins_across_databases`](#allowlist-for-existing-cross-joins). + Link to a new issue assigned to the correct team to remove the specs from the + `cross-join-allowlist.yml` file. +1. **{dotted-circle}** Remove the `cross-join-allowlist.yml` file and stop allowing + whole test files. +1. **{dotted-circle}** Fix the problem and remove the `allow_cross_joins_across_databases` call. +1. **{dotted-circle}** Fix all the cross-joins and remove the `allow_cross_joins_across_databases` method. + +#### Suggestions for removing cross-database joins -#### Remove the code +The following sections are some real examples that were identified as joining across databases, +along with possible suggestions on how to fix them. + +##### Remove the code The simplest solution we've seen several times now has been an existing scope that is unused. This is the easiest example to fix. So the first step is to @@ -131,7 +160,7 @@ to evaluate, because `UsageData` is not critical to users and it may be possible to get a similarly useful metric with a simpler approach. Alternatively we may find that nobody is using these metrics, so we can remove them. -#### Use `preload` instead of `includes` +##### Use `preload` instead of `includes` The `includes` and `preload` methods in Rails are both ways to avoid an N+1 query. The `includes` method in Rails uses a heuristic approach to determine @@ -145,7 +174,7 @@ allows you to avoid the join, while still avoiding the N+1 query. You can see a real example of this solution being used in <https://gitlab.com/gitlab-org/gitlab/-/merge_requests/67655>. -#### De-normalize some foreign key to the table +##### De-normalize some foreign key to the table De-normalization refers to adding redundant precomputed (duplicated) data to a table to simplify certain queries or to improve performance. In this @@ -198,7 +227,7 @@ You can see this approach implemented in <https://gitlab.com/gitlab-org/gitlab/-/merge_requests/66963> . This MR also de-normalizes `pipeline_id` to fix a similar query. -#### De-normalize into an extra table +##### De-normalize into an extra table Sometimes the previous de-normalization (adding an extra column) doesn't work for your specific case. This may be due to the fact that your data is not 1:1, or @@ -245,18 +274,88 @@ logic to delete these rows if or whenever necessary in your domain. Finally, this de-normalization and new query also improves performance because it does less joins and needs less filtering. -#### Summary of cross-join removal patterns +##### Use `disable_joins` for `has_one` or `has_many` `through:` relations + +Sometimes a join query is caused by using `has_one ... through:` or `has_many +... through:` across tables that span the different databases. These joins +sometimes can be solved by adding +[`disable_joins:true`](https://edgeguides.rubyonrails.org/active_record_multiple_databases.html#handling-associations-with-joins-across-databases). +This is a Rails feature which we +[backported](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/66400). We +also extended the feature to allow a lambda syntax for enabling `disable_joins` +with a feature flag. If you use this feature we encourage using a feature flag +as it mitigates risk if there is some serious performance regression. + +You can see an example where this was used in +<https://gitlab.com/gitlab-org/gitlab/-/merge_requests/66709/diffs>. + +With any change to DB queries it is important to analyze and compare the SQL +before and after the change. `disable_joins` can introduce very poorly performing +code depending on the actual logic of the `has_many` or `has_one` relationship. +The key thing to look for is whether any of the intermediate result sets +used to construct the final result set have an unbounded amount of data loaded. +The best way to tell is by looking at the SQL generated and confirming that +each one is limited in some way. You can tell by either a `LIMIT 1` clause or +by `WHERE` clause that is limiting based on a unique column. Any unbounded +intermediate dataset could lead to loading too many IDs into memory. + +An example where you may see very poor performance is the following +hypothetical code: + +```ruby +class Project + has_many :pipelines + has_many :builds, through: :pipelines +end + +class Pipeline + has_many :builds +end + +class Build + belongs_to :pipeline +end + +def some_action + @builds = Project.find(5).builds.order(created_at: :desc).limit(10) +end +``` + +In the above case `some_action` will generate a query like: + +```sql +select * from builds +inner join pipelines on builds.pipeline_id = pipelines.id +where pipelines.project_id = 5 +order by builds.created_at desc +limit 10 +``` + +However, if you changed the relation to be: + +```ruby +class Project + has_many :pipelines + has_many :builds, through: :pipelines, disable_joins: true +end +``` -A quick checklist for fixing a specific join query would be: +Then you would get the following 2 queries: -1. Is the code even used? If not just remove it -1. If the code is used, then is this feature even used or can we implement the - feature in a simpler way and still meet the requirements. Always prefer the - simplest option. -1. Can we remove the join if we de-normalize the data you are joining to by - adding a new column -1. Can we remove the join by adding a new table in the correct database that - replicates the minimum data needed to do the join +```sql +select id from pipelines where project_id = 5; + +select * from builds where pipeline_id in (...) +order by created_at desc +limit 10; +``` + +Because the first query does not limit by any unique column or +have a `LIMIT` clause, it can load an unlimited number of +pipeline IDs into memory, which are then sent in the following query. +This can lead to very poor performance in the Rails application and the +database. In cases like this, you might need to re-write the +query or look at other patterns described above for removing cross-joins. #### How to validate you have correctly removed a cross-join @@ -291,3 +390,74 @@ end You can see a real example of using this method for fixing a cross-join in <https://gitlab.com/gitlab-org/gitlab/-/merge_requests/67655>. + +#### Allowlist for existing cross-joins + +A cross-join across databases can be explicitly allowed by wrapping the code in the +`::Gitlab::Database.allow_cross_joins_across_databases` helper method. + +This method should only be used: + +- For existing code. +- If the code is required to help migrate away from a cross-join. For example, + in a migration that backfills data for future use to remove a cross-join. + +The `allow_cross_joins_across_databases` helper method can be used as follows: + +```ruby +::Gitlab::Database.allow_cross_joins_across_databases(url: 'https://gitlab.com/gitlab-org/gitlab/-/issues/336590') do + subject.perform(1, 4) +end +``` + +The `url` parameter should point to an issue with a milestone for when we intend +to fix the cross-join. If the cross-join is being used in a migration, we do not +need to fix the code. See <https://gitlab.com/gitlab-org/gitlab/-/issues/340017> +for more details. + +## `config/database.yml` + +GitLab will support running multiple databases in the future, for example to [separate tables for the continuous integration features](https://gitlab.com/groups/gitlab-org/-/epics/6167) from the main database. In order to prepare for this change, we [validate the structure of the configuration](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/67877) in `database.yml` to ensure that only known databases are used. + +Previously, the `config/database.yml` would look like this: + +```yaml +production: + adapter: postgresql + encoding: unicode + database: gitlabhq_production + ... +``` + +With the support for many databases the support for this +syntax is deprecated and will be removed in [15.0](https://gitlab.com/gitlab-org/gitlab/-/issues/338182). + +The new `config/database.yml` needs to include a database name +to define a database configuration. Only `main:` and `ci:` database +names are supported today. The `main:` needs to always be a first +entry in a hash. This change applies to decomposed and non-decomposed +change. If an invalidate or deprecated syntax is used the error +or warning will be printed during application start. + +```yaml +# Non-decomposed database +production: + main: + adapter: postgresql + encoding: unicode + database: gitlabhq_production + ... + +# Decomposed database +production: + main: + adapter: postgresql + encoding: unicode + database: gitlabhq_production + ... + ci: + adapter: postgresql + encoding: unicode + database: gitlabhq_production_ci + ... +``` |