From 45c5c2aad628242b88e0938e21a40d30299e3d81 Mon Sep 17 00:00:00 2001
From: Jacob Vosmaer <jacob@gitlab.com>
Date: Wed, 12 Jun 2019 07:12:15 +0000
Subject: Update git object deduplication overview

---
 doc/administration/repository_storage_types.md |   5 ++
 doc/development/git_object_deduplication.md    | 109 ++++++++-----------------
 2 files changed, 38 insertions(+), 76 deletions(-)

diff --git a/doc/administration/repository_storage_types.md b/doc/administration/repository_storage_types.md
index 4db3cbb9958..38842693d73 100644
--- a/doc/administration/repository_storage_types.md
+++ b/doc/administration/repository_storage_types.md
@@ -106,6 +106,11 @@ enabled for individual projects by executing
 be on hashed storage, should not be a fork itself, and hashed storage should be
 enabled for all new projects.
 
+DANGER: **Danger:**
+Do not run `git prune` or `git gc` in pool repositories! This can
+cause data loss in "real" repositories that depend on the pool in
+question.
+
 ### How to migrate to Hashed Storage
 
 To start a migration, enable Hashed Storage for new projects:
diff --git a/doc/development/git_object_deduplication.md b/doc/development/git_object_deduplication.md
index 81c5f69c7b8..b512d7611d3 100644
--- a/doc/development/git_object_deduplication.md
+++ b/doc/development/git_object_deduplication.md
@@ -10,10 +10,10 @@ GitLab implements Git object deduplication.
 
 ## Enabling Git object deduplication via feature flags
 
-As of GitLab 11.9, Git object deduplication in GitLab is in beta. In this
-document, you can read about the caveats of enabling the feature. Also,
-note that Git object deduplication is limited to forks of public
-projects on hashed repository storage.
+As of GitLab 12.0, Git object deduplication in GitLab is still behind a
+feature flag. In this document, you can read about the effects of
+enabling the feature. Also, note that Git object deduplication is
+limited to forks of public projects on hashed repository storage.
 
 You can enable deduplication globally by setting the `object_pools`
 feature flag to `true`:
@@ -51,6 +51,15 @@ configuration. Objects in A that are not in B will remain in A. For this
 to work, it is of course critical that **no objects ever get deleted from
 B** because A might need them.
 
+DANGER: **Danger:**
+Do not run `git prune` or `git gc` in pool repositories! This can
+cause data loss in "real" repositories that depend on the pool in
+question.
+
+The danger lies in `git prune`, and `git gc` calls `git prune`. The
+problem is that `git prune`, when running in a pool repository, cannot
+reliable decide if an object is no longer needed.
+
 ### Git alternates in GitLab: pool repositories
 
 GitLab organizes this object borrowing by creating special **pool
@@ -80,43 +89,10 @@ across a collection of GitLab project repositories at the Git level:
 
 The effectiveness of Git object deduplication in GitLab depends on the
 amount of overlap between the pool repository and each of its
-participants. As of GitLab 11.9, we have a somewhat optimistic system.
-The only data that will be deduplicated is the data in the source
-project repository at the time the pool repository is created. That is,
-the data in the source project at the time of the first fork *after* the
-deduplication feature has been enabled.
-
-When we enable the object deduplication feature for
-gitlab.com/gitlab-org/gitlab-ce, which is about 1GB at the time of
-writing, all new forks of that project would be 1GB smaller than they
-would have been without Git object deduplication. So even in its current
-optimistic form, we expect Git object deduplication in GitLab to make a
-difference.
-
-However, if a lot of Git objects get added to the project repositories
-in a pool after the pool repository was created these new Git objects
-will currently (GitLab 11.9) not get deduplicated. Over time, the
-deduplication factor of the pool will get worse and worse.
-
-As an extreme example, if we create an empty repository A, and fork that
-to repository B, behind the scenes we get an object pool P with no
-objects in it at all. If we then push 1GB of Git data to A, and push the
-same Git data to B, it will not get deduplicated, because that data was
-not in A at the time P was created.
-
-This also matters in less extreme examples. Consider a pool P with
-source project A and 500 active forks B1, B2,...,B500. Suppose,
-optimistically, that the forks are fully deduplicated at the start of
-our scenario. Now some time passes and 200MB of new Git data gets added
-to project A. Because of the forking workflow, this data makes also its way
-into the forks B1, ..., B500. That means we would now have 100GB of Git
-data sitting around (500 \* 200MB) across the forks, that could have
-been deduplicated. But because of the way we do deduplication this new
-data will not be deduplicated.
-
-> TODO Add periodic maintenance of object pools to prevent gradual loss
-> of deduplication over time.
-> https://gitlab.com/groups/gitlab-org/-/epics/524
+participants. Each time garbage collection runs on the source project,
+Git objects from the source project will get migrated to the pool
+repository. One by one, as garbage collection runs, other member
+projects will benefit from the new objects that got added to the pool.
 
 ## SQL model
 
@@ -136,6 +112,9 @@ are as follows:
 -   a `PoolRepository` has exactly one "source `Project`"
     (`pool.source_project`)
 
+> TODO Fix invalid SQL data for pools created prior to GitLab 11.11
+> https://gitlab.com/gitlab-org/gitaly/issues/1653.
+
 ### Assumptions
 
 -   All repositories in a pool must use [hashed
@@ -146,10 +125,6 @@ are as follows:
     The Git alternates mechanism relies on direct disk access across
     multiple repositories, and we can only assume direct disk access to
     be possible within a Gitaly storage shard.
--   All project repositories in a pool must have "Public" visibility in
-    GitLab at the time they join. There are gotchas around visibility of
-    Git objects across alternates links. This restriction is a defense
-    against accidentally leaking private Git data.
 -   The only two ways to remove a member project from a pool are (1) to
     delete the project or (2) to move the project to another Gitaly
     storage shard.
@@ -187,17 +162,14 @@ are as follows:
 ### Consequences
 
 -   If a normal Project participating in a pool gets moved to another
-    Gitaly storage shard, its "belongs to PoolRepository" relation must
+    Gitaly storage shard, its "belongs to PoolRepository" relation will
     be broken. Because of the way moving repositories between shard is
     implemented, we will automatically get a fresh self-contained copy
     of the project's repository on the new storage shard.
 -   If the source project of a pool gets moved to another Gitaly storage
-    shard or is deleted, we may have to break the "PoolRepository has
-    one source Project" relation?
-
-> TODO What happens, or should happen, if a source project changes
-> visibility, is deleted, or moves to another storage shard?
-> https://gitlab.com/gitlab-org/gitaly/issues/1488
+    shard or is deleted the "source project" relation is not broken.
+    However, as of GitLab 12.0 a pool will not fetch from a source
+    unless the source is on the same Gitaly shard.
 
 ## Consistency between the SQL pool relation and Gitaly
 
@@ -209,16 +181,8 @@ repository and a pool.
 ### Pool existence
 
 If GitLab thinks a pool repository exists (i.e. it exists according to
-SQL), but it does not on the Gitaly server, then certain RPC calls that
-take the object pool as an argument will fail.
-
-> TODO What happens if SQL says the pool repo exists but Gitaly says it
-> does not? https://gitlab.com/gitlab-org/gitaly/issues/1533
-
-If GitLab thinks a pool does not exist, while it does exist on disk,
-that has no direct consequences on its own. However, if other
-repositories on disk borrow objects from this unknown pool repository
-then we risk data loss, see below.
+SQL), but it does not on the Gitaly server, then it will be created on
+the fly by Gitaly.
 
 ### Pool relation existence
 
@@ -226,26 +190,19 @@ There are three different things that can go wrong here.
 
 #### 1. SQL says repo A belongs to pool P but Gitaly says A has no alternate objects
 
-In this case, we miss out on disk space savings but all RPC's on A itself
-will function fine. As long as Git can find all its objects, it does not
-matter exactly where those objects are.
+In this case, we miss out on disk space savings but all RPC's on A
+itself will function fine. The next time garbage collection runs on A,
+the alternates connection gets established in Gitaly. This is done by
+`Projects::GitDeduplicationService` in gitlab-rails.
 
 #### 2. SQL says repo A belongs to pool P1 but Gitaly says A has alternate objects in pool P2
 
-If we are not careful, this situation can lead to data loss. During some
-operations (repository maintenance), GitLab will try to re-link A to its
-pool P1. If this clobbers the existing link to P2, then A will loose Git
-objects and become invalid.
-
-Also, keep in mind that if GitLab's database got messed up, it may not
-even know that P2 exists.
-
-> TODO Ensure that Gitaly will not clobber existing, unexpected
-> alternates links. https://gitlab.com/gitlab-org/gitaly/issues/1534
+In this case `Projects::GitDeduplicationService` will throw an exception.
 
 #### 3. SQL says repo A does not belong to any pool but Gitaly says A belongs to P
 
-This has the same data loss possibility as scenario 2 above.
+In this case `Projects::GitDeduplicationService` will try to
+"re-duplicate" the repository A using the DisconnectGitAlternates RPC.
 
 ## Git object deduplication and GitLab Geo
 
-- 
cgit v1.2.1