From 862da3cfed8db462589d0271e144f21f408cf73b Mon Sep 17 00:00:00 2001 From: Yorick Peterse Date: Thu, 10 Aug 2017 17:53:20 +0200 Subject: Add more database development related docs --- doc/development/README.md | 2 + .../database_merge_request_checklist.md | 15 +++ doc/development/ordering_table_columns.md | 127 +++++++++++++++++++++ doc/development/sql.md | 26 +++++ 4 files changed, 170 insertions(+) create mode 100644 doc/development/database_merge_request_checklist.md create mode 100644 doc/development/ordering_table_columns.md diff --git a/doc/development/README.md b/doc/development/README.md index 58993c52dcd..36367f0a60d 100644 --- a/doc/development/README.md +++ b/doc/development/README.md @@ -46,6 +46,7 @@ ## Databases +- [Merge Request Checklist](database_merge_request_checklist.md) - [What requires downtime?](what_requires_downtime.md) - [Adding database indexes](adding_database_indexes.md) - [Post Deployment Migrations](post_deployment_migrations.md) @@ -56,6 +57,7 @@ - [Background Migrations](background_migrations.md) - [Storing SHA1 Hashes As Binary](sha1_as_binary.md) - [Iterating Tables In Batches](iterating_tables_in_batches.md) +- [Ordering Table Columns](ordering_table_columns.md) ## i18n diff --git a/doc/development/database_merge_request_checklist.md b/doc/development/database_merge_request_checklist.md new file mode 100644 index 00000000000..75c395b61ef --- /dev/null +++ b/doc/development/database_merge_request_checklist.md @@ -0,0 +1,15 @@ +# Merge Request Checklist + +When creating a merge request that performs database related changes (schema +changes, adjusting queries to optimise performance, etc) you should use the +merge request template called "Database Changes". This template contains a +checklist of steps to follow to make sure the changes are up to snuff. + +To use the checklist, create a new merge request and click on the "Choose a +template" dropdown, then click "Database Changes". + +An example of this checklist can be found at +https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/12463. + +The source code of the checklist can be found in at +https://gitlab.com/gitlab-org/gitlab-ce/blob/master/.gitlab/merge_request_templates/Database%20Changes.md diff --git a/doc/development/ordering_table_columns.md b/doc/development/ordering_table_columns.md new file mode 100644 index 00000000000..249e70c7b0e --- /dev/null +++ b/doc/development/ordering_table_columns.md @@ -0,0 +1,127 @@ +# Ordering Table Columns + +Similar to C structures the space of a table is influenced by the order of +columns. This is because the size of columns is aligned depending on the type of +the column. Take the following column order for example: + +* id (integer, 4 bytes) +* name (text, variable) +* user_id (integer, 4 bytes) + +Integers are aligned to the word size. This means that on a 64 bit platform the +actual size of each column would be: 8 bytes, variable, 8 bytes. This means that +each row will require at least 16 bytes for the two integers, and a variable +amount for the text field. If a table has a few rows this is not an issue, but +once you start storing millions of rows you can save space by using a different +order. For the above example a more ideal column order would be the following: + +* id (integer, 4 bytes) +* user_id (integer, 4 bytes) +* name (text, variable) + +In this setup the `id` and `user_id` columns can be packed together, which means +we only need 8 bytes to store _both_ of them. This in turn each row will require +8 bytes less of space. + +For GitLab we require that columns of new tables are ordered based to use the +least amount of space. An easy way of doing this is to order them based on the +type size in descending order with variable sizes (string and text columns for +example) at the end. + +## Type Sizes + +While the PostgreSQL docuemntation +(https://www.postgresql.org/docs/current/static/datatype.html) contains plenty +of information we will list the sizes of common types here so it's easier to +look them up. Here "word" refers to the word size, which is 4 bytes for a 32 +bits platform and 8 bytes for a 64 bits platform. + +| Type | Size | Aligned To | +|:-----------------|:-------------------------------------|:-----------| +| smallint | 2 bytes | 1 word | +| integer | 4 bytes | 1 word | +| bigint | 8 bytes | 8 bytes | +| real | 4 bytes | 1 word | +| double precision | 8 bytes | 8 bytes | +| boolean | 1 byte | not needed | +| text / string | variable, 1 byte plus the data | 1 word | +| bytea | variable, 1 or 4 bytes plus the data | 1 word | +| timestamp | 8 bytes | 8 bytes | +| timestamptz | 8 bytes | 8 bytes | +| date | 4 bytes | 1 word | + +A "variable" size means the actual size depends on the value being stored. If +PostgreSQL determines this can be embedded directly into a row it may do so, but +for very large values it will store the data externally and store a pointer (of +1 word in size) in the column. Because of this variable sized columns should +always be at the end of a table. + +## Real Example + +Let's use the "events" table as an example, which currently has the following +layout: + +| Column | Type | Size | +|:------------|:----------------------------|:---------| +| id | integer | 4 bytes | +| target_type | character varying | variable | +| target_id | integer | 4 bytes | +| title | character varying | variable | +| data | text | variable | +| project_id | integer | 4 bytes | +| created_at | timestamp without time zone | 8 bytes | +| updated_at | timestamp without time zone | 8 bytes | +| action | integer | 4 bytes | +| author_id | integer | 4 bytes | + +After adding padding to align the columns this would translate to columns being +divided into fixed size chunks as follows: + +| Chunk Size | Columns | +|:-----------|:------------------| +| 8 bytes | id | +| variable | target_type | +| 8 bytes | target_id | +| variable | title | +| variable | data | +| 8 bytes | project_id | +| 8 bytes | created_at | +| 8 bytes | updated_at | +| 8 bytes | action, author_id | + +This means that excluding the variable sized data we need at least 48 bytes per +row. + +We can optimise this by using the following column order instead: + +| Column | Type | Size | +|:------------|:----------------------------|:---------| +| created_at | timestamp without time zone | 8 bytes | +| updated_at | timestamp without time zone | 8 bytes | +| id | integer | 4 bytes | +| target_id | integer | 4 bytes | +| project_id | integer | 4 bytes | +| action | integer | 4 bytes | +| author_id | integer | 4 bytes | +| target_type | character varying | variable | +| title | character varying | variable | +| data | text | variable | + +This would produce the following chunks: + +| Chunk Size | Columns | +|:-----------|:-------------------| +| 8 bytes | created_at | +| 8 bytes | updated_at | +| 8 bytes | id, target_id | +| 8 bytes | project_id, action | +| 8 bytes | author_id | +| variable | target_type | +| variable | title | +| variable | data | + +Here we only need 40 bytes per row excluding the variable sized data. 8 bytes +being saved may not sound like much, but for tables as large as the "events" +table it does begin to matter. For example, when storing 80 000 000 rows this +translates to a space saving of at least 610 MB: all by just changing the order +of a few columns. diff --git a/doc/development/sql.md b/doc/development/sql.md index 23fd7604957..974b1d99dff 100644 --- a/doc/development/sql.md +++ b/doc/development/sql.md @@ -216,4 +216,30 @@ exact same results. This also means there's no need to add an index on `created_at` to ensure consistent performance as `id` is already indexed by default. +## Use WHERE EXISTS instead of WHERE IN + +While `WHERE IN` and `WHERE EXISTS` can be used to produce the same data it is +recommended to use `WHERE EXISTS` whenever possible. While in many cases +PostgreSQL can optimise `WHERE IN` quite well there are also many cases where +`WHERE EXISTS` will perform (much) better. + +In Rails you have to use this by creating SQL fragments: + +```ruby +Project.where('EXISTS (?)', User.select(1).where('projects.creator_id = users.id AND users.foo = X')) +``` + +This would then produce a query along the lines of the following: + +```sql +SELECT * +FROM projects +WHERE EXISTS ( + SELECT 1 + FROM users + WHERE projects.creator_id = users.id + AND users.foo = X +) +``` + [gin-index]: http://www.postgresql.org/docs/current/static/gin.html -- cgit v1.2.1