summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorRémy Coutable <remy@rymai.me>2017-08-16 15:03:41 +0000
committerRémy Coutable <remy@rymai.me>2017-08-16 15:03:41 +0000
commitba89ee1f7d9e126dc6306a857da5abe816a18047 (patch)
tree43a0050d82b39a8d47526e6e7c296df5c0fc47cd
parent53edaa3111ff752ffc000fcca0f64595948cbeb3 (diff)
parenta4a8cae7e1d4f5c72ddc0fce18d8530bf0e6c911 (diff)
downloadgitlab-ce-ba89ee1f7d9e126dc6306a857da5abe816a18047.tar.gz
Merge branch 'docs/database-development' into 'master'
Add more database development related docs See merge request !13466
-rw-r--r--doc/development/README.md4
-rw-r--r--doc/development/database_merge_request_checklist.md15
-rw-r--r--doc/development/hash_indexes.md20
-rw-r--r--doc/development/ordering_table_columns.md127
-rw-r--r--doc/development/serializing_data.md3
-rw-r--r--doc/development/sql.md26
-rw-r--r--doc/development/verifying_database_capabilities.md26
7 files changed, 220 insertions, 1 deletions
diff --git a/doc/development/README.md b/doc/development/README.md
index 58993c52dcd..dd150421b65 100644
--- a/doc/development/README.md
+++ b/doc/development/README.md
@@ -46,6 +46,7 @@
## Databases
+- [Merge Request Checklist](database_merge_request_checklist.md)
- [What requires downtime?](what_requires_downtime.md)
- [Adding database indexes](adding_database_indexes.md)
- [Post Deployment Migrations](post_deployment_migrations.md)
@@ -56,6 +57,9 @@
- [Background Migrations](background_migrations.md)
- [Storing SHA1 Hashes As Binary](sha1_as_binary.md)
- [Iterating Tables In Batches](iterating_tables_in_batches.md)
+- [Ordering Table Columns](ordering_table_columns.md)
+- [Verifying Database Capabilities](verifying_database_capabilities.md)
+- [Hash Indexes](hash_indexes.md)
## i18n
diff --git a/doc/development/database_merge_request_checklist.md b/doc/development/database_merge_request_checklist.md
new file mode 100644
index 00000000000..75c395b61ef
--- /dev/null
+++ b/doc/development/database_merge_request_checklist.md
@@ -0,0 +1,15 @@
+# Merge Request Checklist
+
+When creating a merge request that performs database related changes (schema
+changes, adjusting queries to optimise performance, etc) you should use the
+merge request template called "Database Changes". This template contains a
+checklist of steps to follow to make sure the changes are up to snuff.
+
+To use the checklist, create a new merge request and click on the "Choose a
+template" dropdown, then click "Database Changes".
+
+An example of this checklist can be found at
+https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/12463.
+
+The source code of the checklist can be found in at
+https://gitlab.com/gitlab-org/gitlab-ce/blob/master/.gitlab/merge_request_templates/Database%20Changes.md
diff --git a/doc/development/hash_indexes.md b/doc/development/hash_indexes.md
new file mode 100644
index 00000000000..e6c1b3590b1
--- /dev/null
+++ b/doc/development/hash_indexes.md
@@ -0,0 +1,20 @@
+# Hash Indexes
+
+Both PostgreSQL and MySQL support hash indexes besides the regular btree
+indexes. Hash indexes however are to be avoided at all costs. While they may
+_sometimes_ provide better performance the cost of rehashing can be very high.
+More importantly: at least until PostgreSQL 10.0 hash indexes are not
+WAL-logged, meaning they are not replicated to any replicas. From the PostgreSQL
+documentation:
+
+> Hash index operations are not presently WAL-logged, so hash indexes might need
+> to be rebuilt with REINDEX after a database crash if there were unwritten
+> changes. Also, changes to hash indexes are not replicated over streaming or
+> file-based replication after the initial base backup, so they give wrong
+> answers to queries that subsequently use them. For these reasons, hash index
+> use is presently discouraged.
+
+RuboCop is configured to register an offence when it detects the use of a hash
+index.
+
+Instead of using hash indexes you should use regular btree indexes.
diff --git a/doc/development/ordering_table_columns.md b/doc/development/ordering_table_columns.md
new file mode 100644
index 00000000000..249e70c7b0e
--- /dev/null
+++ b/doc/development/ordering_table_columns.md
@@ -0,0 +1,127 @@
+# Ordering Table Columns
+
+Similar to C structures the space of a table is influenced by the order of
+columns. This is because the size of columns is aligned depending on the type of
+the column. Take the following column order for example:
+
+* id (integer, 4 bytes)
+* name (text, variable)
+* user_id (integer, 4 bytes)
+
+Integers are aligned to the word size. This means that on a 64 bit platform the
+actual size of each column would be: 8 bytes, variable, 8 bytes. This means that
+each row will require at least 16 bytes for the two integers, and a variable
+amount for the text field. If a table has a few rows this is not an issue, but
+once you start storing millions of rows you can save space by using a different
+order. For the above example a more ideal column order would be the following:
+
+* id (integer, 4 bytes)
+* user_id (integer, 4 bytes)
+* name (text, variable)
+
+In this setup the `id` and `user_id` columns can be packed together, which means
+we only need 8 bytes to store _both_ of them. This in turn each row will require
+8 bytes less of space.
+
+For GitLab we require that columns of new tables are ordered based to use the
+least amount of space. An easy way of doing this is to order them based on the
+type size in descending order with variable sizes (string and text columns for
+example) at the end.
+
+## Type Sizes
+
+While the PostgreSQL docuemntation
+(https://www.postgresql.org/docs/current/static/datatype.html) contains plenty
+of information we will list the sizes of common types here so it's easier to
+look them up. Here "word" refers to the word size, which is 4 bytes for a 32
+bits platform and 8 bytes for a 64 bits platform.
+
+| Type | Size | Aligned To |
+|:-----------------|:-------------------------------------|:-----------|
+| smallint | 2 bytes | 1 word |
+| integer | 4 bytes | 1 word |
+| bigint | 8 bytes | 8 bytes |
+| real | 4 bytes | 1 word |
+| double precision | 8 bytes | 8 bytes |
+| boolean | 1 byte | not needed |
+| text / string | variable, 1 byte plus the data | 1 word |
+| bytea | variable, 1 or 4 bytes plus the data | 1 word |
+| timestamp | 8 bytes | 8 bytes |
+| timestamptz | 8 bytes | 8 bytes |
+| date | 4 bytes | 1 word |
+
+A "variable" size means the actual size depends on the value being stored. If
+PostgreSQL determines this can be embedded directly into a row it may do so, but
+for very large values it will store the data externally and store a pointer (of
+1 word in size) in the column. Because of this variable sized columns should
+always be at the end of a table.
+
+## Real Example
+
+Let's use the "events" table as an example, which currently has the following
+layout:
+
+| Column | Type | Size |
+|:------------|:----------------------------|:---------|
+| id | integer | 4 bytes |
+| target_type | character varying | variable |
+| target_id | integer | 4 bytes |
+| title | character varying | variable |
+| data | text | variable |
+| project_id | integer | 4 bytes |
+| created_at | timestamp without time zone | 8 bytes |
+| updated_at | timestamp without time zone | 8 bytes |
+| action | integer | 4 bytes |
+| author_id | integer | 4 bytes |
+
+After adding padding to align the columns this would translate to columns being
+divided into fixed size chunks as follows:
+
+| Chunk Size | Columns |
+|:-----------|:------------------|
+| 8 bytes | id |
+| variable | target_type |
+| 8 bytes | target_id |
+| variable | title |
+| variable | data |
+| 8 bytes | project_id |
+| 8 bytes | created_at |
+| 8 bytes | updated_at |
+| 8 bytes | action, author_id |
+
+This means that excluding the variable sized data we need at least 48 bytes per
+row.
+
+We can optimise this by using the following column order instead:
+
+| Column | Type | Size |
+|:------------|:----------------------------|:---------|
+| created_at | timestamp without time zone | 8 bytes |
+| updated_at | timestamp without time zone | 8 bytes |
+| id | integer | 4 bytes |
+| target_id | integer | 4 bytes |
+| project_id | integer | 4 bytes |
+| action | integer | 4 bytes |
+| author_id | integer | 4 bytes |
+| target_type | character varying | variable |
+| title | character varying | variable |
+| data | text | variable |
+
+This would produce the following chunks:
+
+| Chunk Size | Columns |
+|:-----------|:-------------------|
+| 8 bytes | created_at |
+| 8 bytes | updated_at |
+| 8 bytes | id, target_id |
+| 8 bytes | project_id, action |
+| 8 bytes | author_id |
+| variable | target_type |
+| variable | title |
+| variable | data |
+
+Here we only need 40 bytes per row excluding the variable sized data. 8 bytes
+being saved may not sound like much, but for tables as large as the "events"
+table it does begin to matter. For example, when storing 80 000 000 rows this
+translates to a space saving of at least 610 MB: all by just changing the order
+of a few columns.
diff --git a/doc/development/serializing_data.md b/doc/development/serializing_data.md
index 2b56f48bc44..37332c20147 100644
--- a/doc/development/serializing_data.md
+++ b/doc/development/serializing_data.md
@@ -1,7 +1,8 @@
# Serializing Data
**Summary:** don't store serialized data in the database, use separate columns
-and/or tables instead.
+and/or tables instead. This includes storing of comma separated values as a
+string.
Rails makes it possible to store serialized data in JSON, YAML or other formats.
Such a field can be defined as follows:
diff --git a/doc/development/sql.md b/doc/development/sql.md
index 23fd7604957..974b1d99dff 100644
--- a/doc/development/sql.md
+++ b/doc/development/sql.md
@@ -216,4 +216,30 @@ exact same results. This also means there's no need to add an index on
`created_at` to ensure consistent performance as `id` is already indexed by
default.
+## Use WHERE EXISTS instead of WHERE IN
+
+While `WHERE IN` and `WHERE EXISTS` can be used to produce the same data it is
+recommended to use `WHERE EXISTS` whenever possible. While in many cases
+PostgreSQL can optimise `WHERE IN` quite well there are also many cases where
+`WHERE EXISTS` will perform (much) better.
+
+In Rails you have to use this by creating SQL fragments:
+
+```ruby
+Project.where('EXISTS (?)', User.select(1).where('projects.creator_id = users.id AND users.foo = X'))
+```
+
+This would then produce a query along the lines of the following:
+
+```sql
+SELECT *
+FROM projects
+WHERE EXISTS (
+ SELECT 1
+ FROM users
+ WHERE projects.creator_id = users.id
+ AND users.foo = X
+)
+```
+
[gin-index]: http://www.postgresql.org/docs/current/static/gin.html
diff --git a/doc/development/verifying_database_capabilities.md b/doc/development/verifying_database_capabilities.md
new file mode 100644
index 00000000000..cc6d62957e3
--- /dev/null
+++ b/doc/development/verifying_database_capabilities.md
@@ -0,0 +1,26 @@
+# Verifying Database Capabilities
+
+Sometimes certain bits of code may only work on a certain database and/or
+version. While we try to avoid such code as much as possible sometimes it is
+necessary to add database (version) specific behaviour.
+
+To facilitate this we have the following methods that you can use:
+
+* `Gitlab::Database.postgresql?`: returns `true` if PostgreSQL is being used
+* `Gitlab::Database.mysql?`: returns `true` if MySQL is being used
+* `Gitlab::Database.version`: returns the PostgreSQL version number as a string
+ in the format `X.Y.Z`. This method does not work for MySQL
+
+This allows you to write code such as:
+
+```ruby
+if Gitlab::Database.postgresql?
+ if Gitlab::Database.version.to_f >= 9.6
+ run_really_fast_query
+ else
+ run_fast_query
+ end
+else
+ run_query
+end
+```