summaryrefslogtreecommitdiff
path: root/doc/development/git_object_deduplication.md
diff options
context:
space:
mode:
Diffstat (limited to 'doc/development/git_object_deduplication.md')
-rw-r--r--doc/development/git_object_deduplication.md261
1 files changed, 261 insertions, 0 deletions
diff --git a/doc/development/git_object_deduplication.md b/doc/development/git_object_deduplication.md
new file mode 100644
index 00000000000..81c5f69c7b8
--- /dev/null
+++ b/doc/development/git_object_deduplication.md
@@ -0,0 +1,261 @@
+# How Git object deduplication works in GitLab
+
+When a GitLab user [forks a project](../workflow/forking_workflow.md),
+GitLab creates a new Project with an associated Git repository that is a
+copy of the original project at the time of the fork. If a large project
+gets forked often, this can lead to a quick increase in Git repository
+storage disk use. To counteract this problem, we are adding Git object
+deduplication for forks to GitLab. In this document, we will describe how
+GitLab implements Git object deduplication.
+
+## Enabling Git object deduplication via feature flags
+
+As of GitLab 11.9, Git object deduplication in GitLab is in beta. In this
+document, you can read about the caveats of enabling the feature. Also,
+note that Git object deduplication is limited to forks of public
+projects on hashed repository storage.
+
+You can enable deduplication globally by setting the `object_pools`
+feature flag to `true`:
+
+``` {.ruby}
+Feature.enable(:object_pools)
+```
+
+Or just for forks of a specific project:
+
+``` {.ruby}
+fork_parent = Project.find(MY_PROJECT_ID)
+Feature.enable(:object_pools, fork_parent)
+```
+
+To check if a project uses Git object deduplication, look in a Rails
+console if `project.pool_repository` is present.
+
+## Pool repositories
+
+### Understanding Git alternates
+
+At the Git level, we achieve deduplication by using [Git
+alternates](https://git-scm.com/docs/gitrepository-layout#gitrepository-layout-objects).
+Git alternates is a mechanism that lets a repository borrow objects from
+another repository on the same machine.
+
+If we want repository A to borrow from repository B, we first write a
+path that resolves to `B.git/objects` in the special file
+`A.git/objects/info/alternates`. This establishes the alternates link.
+Next, we must perform a Git repack in A. After the repack, any objects
+that are duplicated between A and B will get deleted from A. Repository
+A is now no longer self-contained, but it still has its own refs and
+configuration. Objects in A that are not in B will remain in A. For this
+to work, it is of course critical that **no objects ever get deleted from
+B** because A might need them.
+
+### Git alternates in GitLab: pool repositories
+
+GitLab organizes this object borrowing by creating special **pool
+repositories** which are hidden from the user. We then use Git
+alternates to let a collection of project repositories borrow from a
+single pool repository. We call such a collection of project
+repositories a pool. Pools form star-shaped networks of repositories
+that borrow from a single pool, which will resemble (but not be
+identical to) the fork networks that get formed when users fork
+projects.
+
+At the Git level, pool repositories are created and managed using Gitaly
+RPC calls. Just like with normal repositories, the authority on which
+pool repositories exist, and which repositories borrow from them, lies
+at the Rails application level in SQL.
+
+In conclusion, we need three things for effective object deduplication
+across a collection of GitLab project repositories at the Git level:
+
+1. A pool repository must exist.
+2. The participating project repositories must be linked to the pool
+ repository via their respective `objects/info/alternates` files.
+3. The pool repository must contain Git object data common to the
+ participating project repositories.
+
+### Deduplication factor
+
+The effectiveness of Git object deduplication in GitLab depends on the
+amount of overlap between the pool repository and each of its
+participants. As of GitLab 11.9, we have a somewhat optimistic system.
+The only data that will be deduplicated is the data in the source
+project repository at the time the pool repository is created. That is,
+the data in the source project at the time of the first fork *after* the
+deduplication feature has been enabled.
+
+When we enable the object deduplication feature for
+gitlab.com/gitlab-org/gitlab-ce, which is about 1GB at the time of
+writing, all new forks of that project would be 1GB smaller than they
+would have been without Git object deduplication. So even in its current
+optimistic form, we expect Git object deduplication in GitLab to make a
+difference.
+
+However, if a lot of Git objects get added to the project repositories
+in a pool after the pool repository was created these new Git objects
+will currently (GitLab 11.9) not get deduplicated. Over time, the
+deduplication factor of the pool will get worse and worse.
+
+As an extreme example, if we create an empty repository A, and fork that
+to repository B, behind the scenes we get an object pool P with no
+objects in it at all. If we then push 1GB of Git data to A, and push the
+same Git data to B, it will not get deduplicated, because that data was
+not in A at the time P was created.
+
+This also matters in less extreme examples. Consider a pool P with
+source project A and 500 active forks B1, B2,...,B500. Suppose,
+optimistically, that the forks are fully deduplicated at the start of
+our scenario. Now some time passes and 200MB of new Git data gets added
+to project A. Because of the forking workflow, this data makes also its way
+into the forks B1, ..., B500. That means we would now have 100GB of Git
+data sitting around (500 \* 200MB) across the forks, that could have
+been deduplicated. But because of the way we do deduplication this new
+data will not be deduplicated.
+
+> TODO Add periodic maintenance of object pools to prevent gradual loss
+> of deduplication over time.
+> https://gitlab.com/groups/gitlab-org/-/epics/524
+
+## SQL model
+
+As of GitLab 11.8, project repositories in GitLab do not have their own
+SQL table. They are indirectly identified by columns on the `projects`
+table. In other words, the only way to look up a project repository is to
+first look up its project, and then call `project.repository`.
+
+With pool repositories we made a fresh start. These live in their own
+`pool_repositories` SQL table. The relations between these two tables
+are as follows:
+
+- a `Project` belongs to at most one `PoolRepository`
+ (`project.pool_repository`)
+- as an automatic consequence of the above, a `PoolRepository` has
+ many `Project`s
+- a `PoolRepository` has exactly one "source `Project`"
+ (`pool.source_project`)
+
+### Assumptions
+
+- All repositories in a pool must use [hashed
+ storage](../administration/repository_storage_types.md). This is so
+ that we don't have to ever worry about updating paths in
+ `object/info/alternates` files.
+- All repositories in a pool must be on the same Gitaly storage shard.
+ The Git alternates mechanism relies on direct disk access across
+ multiple repositories, and we can only assume direct disk access to
+ be possible within a Gitaly storage shard.
+- All project repositories in a pool must have "Public" visibility in
+ GitLab at the time they join. There are gotchas around visibility of
+ Git objects across alternates links. This restriction is a defense
+ against accidentally leaking private Git data.
+- The only two ways to remove a member project from a pool are (1) to
+ delete the project or (2) to move the project to another Gitaly
+ storage shard.
+
+### Creating pools and pool memberships
+
+- When a pool gets created, it must have a source project. The initial
+ contents of the pool repository are a Git clone of the source
+ project repository.
+- The occasion for creating a pool is when an existing eligible
+ (public, hashed storage, non-forked) GitLab project gets forked and
+ this project does not belong to a pool repository yet. The fork
+ parent project becomes the source project of the new pool, and both
+ the fork parent and the fork child project become members of the new
+ pool.
+- Once project A has become the source project of a pool, all future
+ eligible forks of A will become pool members.
+- If the fork source is itself a fork, the resulting repository will
+ neither join the repository nor will a new pool repository be
+ seeded.
+
+ eg:
+
+ Suppose fork A is part of a pool repository, any forks created off
+ of fork A *will not* be a part of the pool repository that fork A is
+ a part of.
+
+ Suppose B is a fork of A, and A does not belong to an object pool.
+ Now C gets created as a fork of B. C will not be part of a pool
+ repository.
+
+> TODO should forks of forks be deduplicated?
+> https://gitlab.com/gitlab-org/gitaly/issues/1532
+
+### Consequences
+
+- If a normal Project participating in a pool gets moved to another
+ Gitaly storage shard, its "belongs to PoolRepository" relation must
+ be broken. Because of the way moving repositories between shard is
+ implemented, we will automatically get a fresh self-contained copy
+ of the project's repository on the new storage shard.
+- If the source project of a pool gets moved to another Gitaly storage
+ shard or is deleted, we may have to break the "PoolRepository has
+ one source Project" relation?
+
+> TODO What happens, or should happen, if a source project changes
+> visibility, is deleted, or moves to another storage shard?
+> https://gitlab.com/gitlab-org/gitaly/issues/1488
+
+## Consistency between the SQL pool relation and Gitaly
+
+As far as Gitaly is concerned, the SQL pool relations make two types of
+claims about the state of affairs on the Gitaly server: pool repository
+existence, and the existence of an alternates connection between a
+repository and a pool.
+
+### Pool existence
+
+If GitLab thinks a pool repository exists (i.e. it exists according to
+SQL), but it does not on the Gitaly server, then certain RPC calls that
+take the object pool as an argument will fail.
+
+> TODO What happens if SQL says the pool repo exists but Gitaly says it
+> does not? https://gitlab.com/gitlab-org/gitaly/issues/1533
+
+If GitLab thinks a pool does not exist, while it does exist on disk,
+that has no direct consequences on its own. However, if other
+repositories on disk borrow objects from this unknown pool repository
+then we risk data loss, see below.
+
+### Pool relation existence
+
+There are three different things that can go wrong here.
+
+#### 1. SQL says repo A belongs to pool P but Gitaly says A has no alternate objects
+
+In this case, we miss out on disk space savings but all RPC's on A itself
+will function fine. As long as Git can find all its objects, it does not
+matter exactly where those objects are.
+
+#### 2. SQL says repo A belongs to pool P1 but Gitaly says A has alternate objects in pool P2
+
+If we are not careful, this situation can lead to data loss. During some
+operations (repository maintenance), GitLab will try to re-link A to its
+pool P1. If this clobbers the existing link to P2, then A will loose Git
+objects and become invalid.
+
+Also, keep in mind that if GitLab's database got messed up, it may not
+even know that P2 exists.
+
+> TODO Ensure that Gitaly will not clobber existing, unexpected
+> alternates links. https://gitlab.com/gitlab-org/gitaly/issues/1534
+
+#### 3. SQL says repo A does not belong to any pool but Gitaly says A belongs to P
+
+This has the same data loss possibility as scenario 2 above.
+
+## Git object deduplication and GitLab Geo
+
+When a pool repository record is created in SQL on a Geo primary, this
+will eventually trigger an event on the Geo secondary. The Geo secondary
+will then create the pool repository in Gitaly. This leads to an
+"eventually consistent" situation because as each pool participant gets
+synchronized, Geo will eventuall trigger garbage collection in Gitaly on
+the secondary, at which stage Git objects will get deduplicated.
+
+> TODO How do we handle the edge case where at the time the Geo
+> secondary tries to create the pool repository, the source project does
+> not exist? https://gitlab.com/gitlab-org/gitaly/issues/1533