diff options
-rw-r--r-- | doc/development/README.md | 1 | ||||
-rw-r--r-- | doc/development/polymorphic_associations.md | 146 |
2 files changed, 147 insertions, 0 deletions
diff --git a/doc/development/README.md b/doc/development/README.md index af4131c4a8f..83eef9dadc3 100644 --- a/doc/development/README.md +++ b/doc/development/README.md @@ -51,6 +51,7 @@ - [Post Deployment Migrations](post_deployment_migrations.md) - [Foreign Keys & Associations](foreign_keys.md) - [Serializing Data](serializing_data.md) +- [Polymorphic Associations](polymorphic_associations.md) ## i18n diff --git a/doc/development/polymorphic_associations.md b/doc/development/polymorphic_associations.md new file mode 100644 index 00000000000..d63b9fb115f --- /dev/null +++ b/doc/development/polymorphic_associations.md @@ -0,0 +1,146 @@ +# Polymorphic Associations + +**Summary:** always use separate tables instead of polymorphic associations. + +Rails makes it possible to define so called "polymorphic associations". This +usually works by adding two columns to a table: a target type column, and a +target id. For example, at the time of writing we have such a setup for +`members` with the following columns: + +* `source_type`: a string defining the model to use, can be either `Project` or + `Namespace`. +* `source_id`: the ID of the row to retrieve based on `source_type`. For + example, when `source_type` is `Project` then `source_id` will contain a + project ID. + +While such a setup may appear to be useful, it comes with many drawbacks; enough +that you should avoid this at all costs. + +## Space Wasted + +Because this setup relies on string values to determine the model to use it will +end up wasting a lot of space. For example, for `Project` and `Namespace` the +maximum size is 9 bytes, plus 1 extra byte for every string when using +PostgreSQL. While this may only be 10 bytes per row, given enough tables and +rows using such a setup we can end up wasting quite a bit of disk space and +memory (for any indexes). + +## Indexes + +Because our associations are broken up into two columns this may result in +requiring composite indexes for queries to be performed efficiently. While +composite indexes are not wrong at all, they can be tricky to set up as the +ordering of columns in these indexes is important to ensure optimal performance. + +## Consistency + +One really big problem with polymorphic associations is being unable to enforce +data consistency on the database level using foreign keys. For consistency to be +enforced on the database level one would have to write their own foreign key +logic to support polymorphic associations. + +Enforcing consistency on the database level is absolutely crucial for +maintaining a healthy environment, and thus is another reason to avoid +polymorphic associations. + +## Query Overhead + +When using polymorphic associations you always need to filter using both +columns. For example, you may end up writing a query like this: + +```sql +SELECT * +FROM members +WHERE source_type = 'Project' +AND source_id = 13083; +``` + +Here PostgreSQL can perform the query quite efficiently if both columns are +indexed, but as the query gets more complex it may not be able to use these +indexes efficiently. + +## Mixed Responsibilities + +Similar to functions and classes a table should have a single responsibility: +storing data with a certain set of pre-defined columns. When using polymorphic +associations you are instead storing different types of data (possibly with +different columns set) in the same table. + +## The Solution + +Fortunately there is a very simple solution to these problems: simply use a +separate table for every type you would otherwise store in the same table. Using +a separate table allows you to use everything a database may provide to ensure +consistency and query data efficiently, without any additional application logic +being necessary. + +Let's say you have a `members` table storing both approved and pending members, +for both projects and groups, and the pending state is determined by the column +`requested_at` being set or not. Schema wise such a setup can lead to various +columns only being set for certain rows, wasting space. It's also possible that +certain indexes will only be set for certain rows, again wasting space. Finally, +querying such a table requires less than ideal queries. For example: + +```sql +SELECT * +FROM members +WHERE requested_at IS NULL +AND source_type = 'GroupMember' +AND source_id = 4 +``` + +Instead such a table should be broken up into separate tables. For example, you +may end up with 4 tables in this case: + +* project_members +* group_members +* pending_project_members +* pending_group_members + +This makes querying data trivial. For example, to get the members of a group +you'd run: + +```sql +SELECT * +FROM group_members +WHERE group_id = 4 +``` + +To get all the pending members of a group in turn you'd run: + +```sql +SELECT * +FROM pending_group_members +WHERE group_id = 4 +``` + +If you want to get both you can use a UNION, though you need to be explicit +about what columns you want to SELECT as otherwise the result set will use the +columns of the first query. For example: + +```sql +SELECT id, 'Group' AS target_type, group_id AS target_id +FROM group_members + +UNION ALL + +SELECT id, 'Project' AS target_type, project_id AS target_id +FROM project_members +``` + +The above example is perhaps a bit silly, but it shows that there's nothing +stopping you from merging the data together and presenting it on the same page. +Selecting columns explicitly can also speed up queries as the database has to do +less work to get the data (compared to selecting all columns, even ones you're +not using). + +Our schema also becomes easier. No longer do we need to both store and index the +`source_type` column, we can define foreign keys easily, and we don't need to +filter rows using the `IS NULL` condition. + +To summarize: using separate tables allows us to use foreign keys effectively, +create indexes only where necessary, conserve space, query data more +efficiently, and scale these tables more easily (e.g. by storing them on +separate disks). A nice side effect of this is that code can also become easier +as you won't end up with a single model having to handle different kinds of +data. |