Add development documentation on label filteringdocs-label-filtering

This topic comes up periodically, and we've investigated several options for changing the way we filter labels. Unfortunately, we have yet to find an option that is strictly better than the current state.
author: Sean McGivern <sean@gitlab.com> 2019-07-15 14:12:38 +0100
committer: Sean McGivern <sean@gitlab.com> 2019-07-25 08:56:45 +0100
commit: 1ec7dae9adddbd73f6d2d3eb87b8c4473038d50a (patch)
tree: 76923568d4d390c254fca41846c33c992e33bd40
parent: f2582fbcddb353a640cf65acae5714b025ef07ad (diff)
download: gitlab-ce-docs-label-filtering.tar.gz
2 files changed, 170 insertions, 0 deletions
diff --git a/doc/development/README.md b/doc/development/README.md
index a74770ae383..45efa744a6b 100644
--- a/doc/development/README.md
+++ b/doc/development/README.md
@@ -111,6 +111,10 @@ description: 'Learn how to contribute to GitLab.'
 - [Database helper modules](database_helpers.md)
 - [Code comments](code_comments.md)
 
+## Case studies
+
+- [Database case study: Filtering by label](filtering_by_label.md)
+
 ## Integration guides
 
 - [Jira Connect app](integrations/jira_connect.md)
diff --git a/doc/development/filtering_by_label.md b/doc/development/filtering_by_label.md
new file mode 100644
index 00000000000..6e6b71b1787
--- /dev/null
+++ b/doc/development/filtering_by_label.md
@@ -0,0 +1,166 @@
+# Filtering by label
+
+## Introduction
+
+GitLab has [labels](../user/project/labels.md) that can be assigned to issues,
+merge requests, and epics. Labels on those objects are a many-to-many relation
+through the polymorphic `label_links` table.
+
+To filter these objects by multiple labels - for instance, 'all open
+issues with the label ~Plan and the label ~backend' - we generate a
+query containing a `GROUP BY` clause. In a simple form, this looks like:
+
+```sql
+SELECT
+    issues.*
+FROM
+    issues
+    INNER JOIN label_links ON label_links.target_id = issues.id
+        AND label_links.target_type = 'Issue'
+    INNER JOIN labels ON labels.id = label_links.label_id
+WHERE
+    issues.project_id = 13083
+    AND (issues.state IN ('opened'))
+    AND labels.title IN ('Plan',
+        'backend')
+GROUP BY
+    issues.id
+HAVING (COUNT(DISTINCT labels.title) = 2)
+ORDER BY
+    issues.updated_at DESC,
+    issues.id DESC
+LIMIT 20 OFFSET 0
+```
+
+In particular, note that:
+
+1. We `GROUP BY issues.id` so that we can ...
+2. Use the `HAVING (COUNT(DISTINCT labels.title) = 2)` condition to ensure that
+   all matched issues have both labels.
+
+This is more complicated than is ideal. It makes the query construction more
+prone to errors (such as
+[gitlab-org/gitlab-ce#15557](https://gitlab.com/gitlab-org/gitlab-ce/issues/15557)).
+
+## Attempt A: WHERE EXISTS
+
+### Attempt A1: use multiple subqueries with WHERE EXISTS
+
+In
+[gitlab-org/gitlab-ce#37137](https://gitlab.com/gitlab-org/gitlab-ce/issues/37137)
+and its associated merge request
+[gitlab-org/gitlab-ce!14022](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14022),
+we tried to replace the `GROUP BY` with multiple uses of `WHERE EXISTS`. For the
+example above, this would give:
+
+```sql
+WHERE (EXISTS (
+        SELECT
+            TRUE
+        FROM
+            label_links
+            INNER JOIN labels ON labels.id = label_links.label_id
+        WHERE
+            labels.title = 'Plan'
+            AND target_type = 'Issue'
+            AND target_id = issues.id))
+AND (EXISTS (
+        SELECT
+            TRUE
+        FROM
+            label_links
+            INNER JOIN labels ON labels.id = label_links.label_id
+        WHERE
+            labels.title = 'backend'
+            AND target_type = 'Issue'
+            AND target_id = issues.id))
+```
+
+While this worked without schema changes, and did improve readability somewhat,
+it did not improve query performance.
+
+## Attempt B: Denormalize using an array column
+
+Having [removed MySQL support in GitLab
+12.1](https://about.gitlab.com/2019/06/27/removing-mysql-support/), using
+[Postgres's arrays](https://www.postgresql.org/docs/9.6/arrays.html) became more
+tractable as we didn't have to support two databases. We discussed denormalizing
+the `label_links` table for querying in
+[gitlab-org/gitlab-ce#49651](https://gitlab.com/gitlab-org/gitlab-ce/issues/49651),
+with two options: label IDs and titles.
+
+We can think of both of those as array columns on `issues`, `merge_requests`,
+and `epics`: `issues.label_ids` would be an array column of label IDs, and
+`issues.label_titles` would be an array of label titles.
+
+These array columns can be complemented with [GIN
+indexes](https://www.postgresql.org/docs/9.6/gin-intro.html) to improve
+matching.
+
+### Attempt B1: store label IDs for each object
+
+This has some strong advantages over titles:
+
+1. Unless a label is deleted, or a project is moved, we never need to
+   bulk-update the denormalized column.
+2. It uses less storage than the titles.
+
+Unfortunately, our application design makes this hard. If we were able to query
+just by label ID easily, we wouldn't need the `INNER JOIN labels` in the initial
+query at the start of this document. GitLab allows users to filter by label
+title across projects and even across groups, so a filter by the label ~Plan may
+include labels with multiple distinct IDs.
+
+We do not want users to have to know about the different IDs, which means that
+given this data set:
+
+| Project | ~Plan label ID | ~backend label ID |
+| ---     | ---            | ---               |
+| A       | 11             | 12                |
+| B       | 21             | 22                |
+| C       | 31             | 32                |
+
+We would need something like:
+
+```sql
+WHERE
+    label_ids @> ARRAY[11, 12]
+    OR label_ids @> ARRAY[21, 22]
+    OR label_ids @> ARRAY[31, 32]
+```
+
+This can get even more complicated when we consider that in some cases, there
+might be two ~backend labels - with different IDs - that could apply to the same
+object, so the number of combinations would balloon further.
+
+### Attempt B2: store label titles for each object
+
+From the perspective of updating the labelable object, this is the worst
+option. We have to bulk update the objects when:
+
+1. The objects are moved from one project to another.
+1. The project is moved from one group to another.
+1. The label is renamed.
+1. The label is deleted.
+
+It also uses much more storage. Querying is simple, though:
+
+```sql
+WHERE
+    label_titles @> ARRAY['Plan', 'backend']
+```
+
+And our [tests in
+gitlab-org/gitlab-ce#49651](https://gitlab.com/gitlab-org/gitlab-ce/issues/49651#note_188777346)
+showed that this could be fast.
+
+However, at present, the disadvantages outweigh the advantages.
+
+## Conclusion
+
+We have yet to find a method that is demonstratably better than the current
+method, when considering:
+
+1. Query performance.
+1. Readability.
+1. Ease of maintaining schema consistency.
author	Sean McGivern <sean@gitlab.com>	2019-07-15 14:12:38 +0100
committer	Sean McGivern <sean@gitlab.com>	2019-07-25 08:56:45 +0100
commit	1ec7dae9adddbd73f6d2d3eb87b8c4473038d50a (patch)
tree	76923568d4d390c254fca41846c33c992e33bd40
parent	f2582fbcddb353a640cf65acae5714b025ef07ad (diff)
download	gitlab-ce-docs-label-filtering.tar.gz