diff options
author | Achilleas Pipinellis <axilleas@axilleas.me> | 2016-08-16 17:03:22 +0000 |
---|---|---|
committer | Achilleas Pipinellis <axilleas@axilleas.me> | 2016-08-16 17:03:22 +0000 |
commit | 07d5759408000ff46088ad886df149e21faf676f (patch) | |
tree | b62b612847a7b070fbe8488df9db95fbcbfa7656 /doc | |
parent | b44b09b302e6f4d19c7e6b2dac3be62fe83b31b2 (diff) | |
parent | 5f7394070f92acb2b7858b792034b30102fe1d9b (diff) | |
download | gitlab-ce-07d5759408000ff46088ad886df149e21faf676f.tar.gz |
Merge branch 'document-when-to-add-indexes' into 'master'
Added documentation on adding database indexes
## What does this MR do?
This MR adds a guide on when to add database indexes, how to approach this problem, etc
## Are there points in the code the reviewer needs to double check?
Spelling, grammar, etc
## Why was this MR needed?
Blindly adding indexes can be harmful to the database, and I recently had to remove 48-something unused indexes.
## What are the relevant issue numbers?
#20767
## Does this MR meet the acceptance criteria?
- [x] [Documentation created/updated](https://gitlab.com/gitlab-org/gitlab-ce/blob/master/doc/development/doc_styleguide.md)
- Tests
- [x] All builds are passing
- [ ] Conform by the [style guides](https://gitlab.com/gitlab-org/gitlab-ce/blob/master/CONTRIBUTING.md#style-guides)
- [x] Branch has no merge conflicts with `master` (if you do - rebase it please)
- [x] [Squashed related commits together](https://git-scm.com/book/en/Git-Tools-Rewriting-History#Squashing-Commits)
See merge request !5771
Diffstat (limited to 'doc')
-rw-r--r-- | doc/development/README.md | 4 | ||||
-rw-r--r-- | doc/development/adding_database_indexes.md | 123 |
2 files changed, 127 insertions, 0 deletions
diff --git a/doc/development/README.md b/doc/development/README.md index bf67b5d8dff..57f37da6f80 100644 --- a/doc/development/README.md +++ b/doc/development/README.md @@ -30,7 +30,11 @@ - [Rake tasks](rake_tasks.md) for development - [Shell commands](shell_commands.md) in the GitLab codebase - [Sidekiq debugging](sidekiq_debugging.md) + +## Databases + - [What requires downtime?](what_requires_downtime.md) +- [Adding database indexes](adding_database_indexes.md) ## Compliance diff --git a/doc/development/adding_database_indexes.md b/doc/development/adding_database_indexes.md new file mode 100644 index 00000000000..ea6f14da3b9 --- /dev/null +++ b/doc/development/adding_database_indexes.md @@ -0,0 +1,123 @@ +# Adding Database Indexes + +Indexes can be used to speed up database queries, but when should you add a new +index? Traditionally the answer to this question has been to add an index for +every column used for filtering or joining data. For example, consider the +following query: + +```sql +SELECT * +FROM projects +WHERE user_id = 2; +``` + +Here we are filtering by the `user_id` column and as such a developer may decide +to index this column. + +While in certain cases indexing columns using the above approach may make sense +it can actually have a negative impact. Whenever you write data to a table any +existing indexes need to be updated. The more indexes there are the slower this +can potentially become. Indexes can also take up quite some disk space depending +on the amount of data indexed and the index type. For example, PostgreSQL offers +"GIN" indexes which can be used to index certain data types that can not be +indexed by regular btree indexes. These indexes however generally take up more +data and are slower to update compared to btree indexes. + +Because of all this one should not blindly add a new index for every column used +to filter data by. Instead one should ask themselves the following questions: + +1. Can I write my query in such a way that it re-uses as many existing indexes + as possible? +2. Is the data going to be large enough that using an index will actually be + faster than just iterating over the rows in the table? +3. Is the overhead of maintaining the index worth the reduction in query + timings? + +We'll explore every question in detail below. + +## Re-using Queries + +The first step is to make sure your query re-uses as many existing indexes as +possible. For example, consider the following query: + +```sql +SELECT * +FROM todos +WHERE user_id = 123 +AND state = 'open'; +``` + +Now imagine we already have an index on the `user_id` column but not on the +`state` column. One may think this query will perform badly due to `state` being +unindexed. In reality the query may perform just fine given the index on +`user_id` can filter out enough rows. + +The best way to determine if indexes are re-used is to run your query using +`EXPLAIN ANALYZE`. Depending on any extra tables that may be joined and +other columns being used for filtering you may find an extra index is not going +to make much (if any) difference. On the other hand you may determine that the +index _may_ make a difference. + +In short: + +1. Try to write your query in such a way that it re-uses as many existing + indexes as possible. +2. Run the query using `EXPLAIN ANALYZE` and study the output to find the most + ideal query. + +## Data Size + +A database may decide not to use an index despite it existing in case a regular +sequence scan (= simply iterating over all existing rows) is faster. This is +especially the case for small tables. + +If a table is expected to grow in size and you expect your query has to filter +out a lot of rows you may want to consider adding an index. If the table size is +very small (e.g. only a handful of rows) or any existing indexes filter out +enough rows you may _not_ want to add a new index. + +## Maintenance Overhead + +Indexes have to be updated on every table write. In case of PostgreSQL _all_ +existing indexes will be updated whenever data is written to a table. As a +result of this having many indexes on the same table will slow down writes. + +Because of this one should ask themselves: is the reduction in query performance +worth the overhead of maintaining an extra index? + +If adding an index reduces SELECT timings by 5 milliseconds but increases +INSERT/UPDATE/DELETE timings by 10 milliseconds then the index may not be worth +it. On the other hand, if SELECT timings are reduced but INSERT/UPDATE/DELETE +timings are not affected you may want to add the index after all. + +## Finding Unused Indexes + +To see which indexes are unused you can run the following query: + +```sql +SELECT relname as table_name, indexrelname as index_name, idx_scan, idx_tup_read, idx_tup_fetch, pg_size_pretty(pg_relation_size(indexrelname::regclass)) +FROM pg_stat_all_indexes +WHERE schemaname = 'public' +AND idx_scan = 0 +AND idx_tup_read = 0 +AND idx_tup_fetch = 0 +ORDER BY pg_relation_size(indexrelname::regclass) desc; +``` + +This query outputs a list containing all indexes that are never used and sorts +them by indexes sizes in descending order. This query can be useful to +determine if any previously indexes are useful after all. More information on +the meaning of the various columns can be found at +<https://www.postgresql.org/docs/current/static/monitoring-stats.html>. + +Because the output of this query relies on the actual usage of your database it +may be affected by factors such as (but not limited to): + +* Certain queries never being executed, thus not being able to use certain + indexes. +* Certain tables having little data, resulting in PostgreSQL using sequence + scans instead of index scans. + +In other words, this data is only reliable for a frequently used database with +plenty of data and with as many GitLab features enabled (and being used) as +possible. |