summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorSean McGivern <sean@gitlab.com>2019-07-26 09:53:53 +0000
committerSean McGivern <sean@gitlab.com>2019-07-26 09:53:53 +0000
commit0d2b5bff8c524376f42316f1a4cc8cfbeabc2d8d (patch)
treea944ba284b5f1119708d0d89e1b0cca1a8387d72
parent59a5a89c61df0da1c65de92a25173b806bd5b0f0 (diff)
parent1ec7dae9adddbd73f6d2d3eb87b8c4473038d50a (diff)
downloadgitlab-ce-0d2b5bff8c524376f42316f1a4cc8cfbeabc2d8d.tar.gz
Merge branch 'docs-label-filtering' into 'master'
Add development documentation on label filtering Closes #49651 See merge request gitlab-org/gitlab-ce!30741
-rw-r--r--doc/development/README.md4
-rw-r--r--doc/development/filtering_by_label.md166
2 files changed, 170 insertions, 0 deletions
diff --git a/doc/development/README.md b/doc/development/README.md
index ea5d9e10e2c..99c88146be5 100644
--- a/doc/development/README.md
+++ b/doc/development/README.md
@@ -113,6 +113,10 @@ description: 'Learn how to contribute to GitLab.'
- [Database helper modules](database_helpers.md)
- [Code comments](code_comments.md)
+## Case studies
+
+- [Database case study: Filtering by label](filtering_by_label.md)
+
## Integration guides
- [Jira Connect app](integrations/jira_connect.md)
diff --git a/doc/development/filtering_by_label.md b/doc/development/filtering_by_label.md
new file mode 100644
index 00000000000..6e6b71b1787
--- /dev/null
+++ b/doc/development/filtering_by_label.md
@@ -0,0 +1,166 @@
+# Filtering by label
+
+## Introduction
+
+GitLab has [labels](../user/project/labels.md) that can be assigned to issues,
+merge requests, and epics. Labels on those objects are a many-to-many relation
+through the polymorphic `label_links` table.
+
+To filter these objects by multiple labels - for instance, 'all open
+issues with the label ~Plan and the label ~backend' - we generate a
+query containing a `GROUP BY` clause. In a simple form, this looks like:
+
+```sql
+SELECT
+ issues.*
+FROM
+ issues
+ INNER JOIN label_links ON label_links.target_id = issues.id
+ AND label_links.target_type = 'Issue'
+ INNER JOIN labels ON labels.id = label_links.label_id
+WHERE
+ issues.project_id = 13083
+ AND (issues.state IN ('opened'))
+ AND labels.title IN ('Plan',
+ 'backend')
+GROUP BY
+ issues.id
+HAVING (COUNT(DISTINCT labels.title) = 2)
+ORDER BY
+ issues.updated_at DESC,
+ issues.id DESC
+LIMIT 20 OFFSET 0
+```
+
+In particular, note that:
+
+1. We `GROUP BY issues.id` so that we can ...
+2. Use the `HAVING (COUNT(DISTINCT labels.title) = 2)` condition to ensure that
+ all matched issues have both labels.
+
+This is more complicated than is ideal. It makes the query construction more
+prone to errors (such as
+[gitlab-org/gitlab-ce#15557](https://gitlab.com/gitlab-org/gitlab-ce/issues/15557)).
+
+## Attempt A: WHERE EXISTS
+
+### Attempt A1: use multiple subqueries with WHERE EXISTS
+
+In
+[gitlab-org/gitlab-ce#37137](https://gitlab.com/gitlab-org/gitlab-ce/issues/37137)
+and its associated merge request
+[gitlab-org/gitlab-ce!14022](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14022),
+we tried to replace the `GROUP BY` with multiple uses of `WHERE EXISTS`. For the
+example above, this would give:
+
+```sql
+WHERE (EXISTS (
+ SELECT
+ TRUE
+ FROM
+ label_links
+ INNER JOIN labels ON labels.id = label_links.label_id
+ WHERE
+ labels.title = 'Plan'
+ AND target_type = 'Issue'
+ AND target_id = issues.id))
+AND (EXISTS (
+ SELECT
+ TRUE
+ FROM
+ label_links
+ INNER JOIN labels ON labels.id = label_links.label_id
+ WHERE
+ labels.title = 'backend'
+ AND target_type = 'Issue'
+ AND target_id = issues.id))
+```
+
+While this worked without schema changes, and did improve readability somewhat,
+it did not improve query performance.
+
+## Attempt B: Denormalize using an array column
+
+Having [removed MySQL support in GitLab
+12.1](https://about.gitlab.com/2019/06/27/removing-mysql-support/), using
+[Postgres's arrays](https://www.postgresql.org/docs/9.6/arrays.html) became more
+tractable as we didn't have to support two databases. We discussed denormalizing
+the `label_links` table for querying in
+[gitlab-org/gitlab-ce#49651](https://gitlab.com/gitlab-org/gitlab-ce/issues/49651),
+with two options: label IDs and titles.
+
+We can think of both of those as array columns on `issues`, `merge_requests`,
+and `epics`: `issues.label_ids` would be an array column of label IDs, and
+`issues.label_titles` would be an array of label titles.
+
+These array columns can be complemented with [GIN
+indexes](https://www.postgresql.org/docs/9.6/gin-intro.html) to improve
+matching.
+
+### Attempt B1: store label IDs for each object
+
+This has some strong advantages over titles:
+
+1. Unless a label is deleted, or a project is moved, we never need to
+ bulk-update the denormalized column.
+2. It uses less storage than the titles.
+
+Unfortunately, our application design makes this hard. If we were able to query
+just by label ID easily, we wouldn't need the `INNER JOIN labels` in the initial
+query at the start of this document. GitLab allows users to filter by label
+title across projects and even across groups, so a filter by the label ~Plan may
+include labels with multiple distinct IDs.
+
+We do not want users to have to know about the different IDs, which means that
+given this data set:
+
+| Project | ~Plan label ID | ~backend label ID |
+| --- | --- | --- |
+| A | 11 | 12 |
+| B | 21 | 22 |
+| C | 31 | 32 |
+
+We would need something like:
+
+```sql
+WHERE
+ label_ids @> ARRAY[11, 12]
+ OR label_ids @> ARRAY[21, 22]
+ OR label_ids @> ARRAY[31, 32]
+```
+
+This can get even more complicated when we consider that in some cases, there
+might be two ~backend labels - with different IDs - that could apply to the same
+object, so the number of combinations would balloon further.
+
+### Attempt B2: store label titles for each object
+
+From the perspective of updating the labelable object, this is the worst
+option. We have to bulk update the objects when:
+
+1. The objects are moved from one project to another.
+1. The project is moved from one group to another.
+1. The label is renamed.
+1. The label is deleted.
+
+It also uses much more storage. Querying is simple, though:
+
+```sql
+WHERE
+ label_titles @> ARRAY['Plan', 'backend']
+```
+
+And our [tests in
+gitlab-org/gitlab-ce#49651](https://gitlab.com/gitlab-org/gitlab-ce/issues/49651#note_188777346)
+showed that this could be fast.
+
+However, at present, the disadvantages outweigh the advantages.
+
+## Conclusion
+
+We have yet to find a method that is demonstratably better than the current
+method, when considering:
+
+1. Query performance.
+1. Readability.
+1. Ease of maintaining schema consistency.