Document not using polymorphic associations

Instead of using polymorphic associations a developer should use separate tables.
author: Yorick Peterse <yorickpeterse@gmail.com> 2017-05-08 13:36:20 +0200
committer: Yorick Peterse <yorickpeterse@gmail.com> 2017-06-07 17:36:55 +0200
commit: 44d65c36dbe2f38eacb1858a99996c025b755937 (patch)
tree: 372fdf94b479c0624207cf5471a75f03ccbaf44f /doc/development
parent: 0601ac5d0b0330fb21228bef82650b3d3f6898cb (diff)
download: gitlab-ce-44d65c36dbe2f38eacb1858a99996c025b755937.tar.gz
2 files changed, 147 insertions, 0 deletions
diff --git a/doc/development/README.md b/doc/development/README.md
index af4131c4a8f..83eef9dadc3 100644
--- a/doc/development/README.md
+++ b/doc/development/README.md
@@ -51,6 +51,7 @@
 - [Post Deployment Migrations](post_deployment_migrations.md)
 - [Foreign Keys & Associations](foreign_keys.md)
 - [Serializing Data](serializing_data.md)
+- [Polymorphic Associations](polymorphic_associations.md)
 
 ## i18n
 
diff --git a/doc/development/polymorphic_associations.md b/doc/development/polymorphic_associations.md
new file mode 100644
index 00000000000..d63b9fb115f
--- /dev/null
+++ b/doc/development/polymorphic_associations.md
@@ -0,0 +1,146 @@
+# Polymorphic Associations
+
+**Summary:** always use separate tables instead of polymorphic associations.
+
+Rails makes it possible to define so called "polymorphic associations". This
+usually works by adding two columns to a table: a target type column, and a
+target id. For example, at the time of writing we have such a setup for
+`members` with the following columns:
+
+* `source_type`: a string defining the model to use, can be either `Project` or
+  `Namespace`.
+* `source_id`: the ID of the row to retrieve based on `source_type`. For
+  example, when `source_type` is `Project` then `source_id` will contain a
+  project ID.
+
+While such a setup may appear to be useful, it comes with many drawbacks; enough
+that you should avoid this at all costs.
+
+## Space Wasted
+
+Because this setup relies on string values to determine the model to use it will
+end up wasting a lot of space. For example, for `Project` and `Namespace` the
+maximum size is 9 bytes, plus 1 extra byte for every string when using
+PostgreSQL. While this may only be 10 bytes per row, given enough tables and
+rows using such a setup we can end up wasting quite a bit of disk space and
+memory (for any indexes).
+
+## Indexes
+
+Because our associations are broken up into two columns this may result in
+requiring composite indexes for queries to be performed efficiently. While
+composite indexes are not wrong at all, they can be tricky to set up as the
+ordering of columns in these indexes is important to ensure optimal performance.
+
+## Consistency
+
+One really big problem with polymorphic associations is being unable to enforce
+data consistency on the database level using foreign keys. For consistency to be
+enforced on the database level one would have to write their own foreign key
+logic to support polymorphic associations.
+
+Enforcing consistency on the database level is absolutely crucial for
+maintaining a healthy environment, and thus is another reason to avoid
+polymorphic associations.
+
+## Query Overhead
+
+When using polymorphic associations you always need to filter using both
+columns. For example, you may end up writing a query like this:
+
+```sql
+SELECT *
+FROM members
+WHERE source_type = 'Project'
+AND source_id = 13083;
+```
+
+Here PostgreSQL can perform the query quite efficiently if both columns are
+indexed, but as the query gets more complex it may not be able to use these
+indexes efficiently.
+
+## Mixed Responsibilities
+
+Similar to functions and classes a table should have a single responsibility:
+storing data with a certain set of pre-defined columns. When using polymorphic
+associations you are instead storing different types of data (possibly with
+different columns set) in the same table.
+
+## The Solution
+
+Fortunately there is a very simple solution to these problems: simply use a
+separate table for every type you would otherwise store in the same table. Using
+a separate table allows you to use everything a database may provide to ensure
+consistency and query data efficiently, without any additional application logic
+being necessary.
+
+Let's say you have a `members` table storing both approved and pending members,
+for both projects and groups, and the pending state is determined by the column
+`requested_at` being set or not. Schema wise such a setup can lead to various
+columns only being set for certain rows, wasting space. It's also possible that
+certain indexes will only be set for certain rows, again wasting space. Finally,
+querying such a table requires less than ideal queries. For example:
+
+```sql
+SELECT *
+FROM members
+WHERE requested_at IS NULL
+AND source_type = 'GroupMember'
+AND source_id = 4
+```
+
+Instead such a table should be broken up into separate tables. For example, you
+may end up with 4 tables in this case:
+
+* project_members
+* group_members
+* pending_project_members
+* pending_group_members
+
+This makes querying data trivial. For example, to get the members of a group
+you'd run:
+
+```sql
+SELECT *
+FROM group_members
+WHERE group_id = 4
+```
+
+To get all the pending members of a group in turn you'd run:
+
+```sql
+SELECT *
+FROM pending_group_members
+WHERE group_id = 4
+```
+
+If you want to get both you can use a UNION, though you need to be explicit
+about what columns you want to SELECT as otherwise the result set will use the
+columns of the first query. For example:
+
+```sql
+SELECT id, 'Group' AS target_type, group_id AS target_id
+FROM group_members
+
+UNION ALL
+
+SELECT id, 'Project' AS target_type, project_id AS target_id
+FROM project_members
+```
+
+The above example is perhaps a bit silly, but it shows that there's nothing
+stopping you from merging the data together and presenting it on the same page.
+Selecting columns explicitly can also speed up queries as the database has to do
+less work to get the data (compared to selecting all columns, even ones you're
+not using).
+
+Our schema also becomes easier. No longer do we need to both store and index the
+`source_type` column, we can define foreign keys easily, and we don't need to
+filter rows using the `IS NULL` condition.
+
+To summarize: using separate tables allows us to use foreign keys effectively,
+create indexes only where necessary, conserve space, query data more
+efficiently, and scale these tables more easily (e.g. by storing them on
+separate disks). A nice side effect of this is that code can also become easier
+as you won't end up with a single model having to handle different kinds of
+data.
author	Yorick Peterse <yorickpeterse@gmail.com>	2017-05-08 13:36:20 +0200
committer	Yorick Peterse <yorickpeterse@gmail.com>	2017-06-07 17:36:55 +0200
commit	44d65c36dbe2f38eacb1858a99996c025b755937 (patch)
tree	372fdf94b479c0624207cf5471a75f03ccbaf44f /doc/development
parent	0601ac5d0b0330fb21228bef82650b3d3f6898cb (diff)
download	gitlab-ce-44d65c36dbe2f38eacb1858a99996c025b755937.tar.gz