summaryrefslogtreecommitdiff
path: root/doc/architecture/blueprints/object_pools
diff options
context:
space:
mode:
authorGitLab Bot <gitlab-bot@gitlab.com>2023-05-17 16:05:49 +0000
committerGitLab Bot <gitlab-bot@gitlab.com>2023-05-17 16:05:49 +0000
commit43a25d93ebdabea52f99b05e15b06250cd8f07d7 (patch)
treedceebdc68925362117480a5d672bcff122fb625b /doc/architecture/blueprints/object_pools
parent20c84b99005abd1c82101dfeff264ac50d2df211 (diff)
downloadgitlab-ce-16.0.0-rc42.tar.gz
Add latest changes from gitlab-org/gitlab@16-0-stable-eev16.0.0-rc4216-0-stable
Diffstat (limited to 'doc/architecture/blueprints/object_pools')
-rw-r--r--doc/architecture/blueprints/object_pools/index.md495
1 files changed, 495 insertions, 0 deletions
diff --git a/doc/architecture/blueprints/object_pools/index.md b/doc/architecture/blueprints/object_pools/index.md
new file mode 100644
index 00000000000..3f3a0341e4a
--- /dev/null
+++ b/doc/architecture/blueprints/object_pools/index.md
@@ -0,0 +1,495 @@
+---
+status: proposed
+creation-date: "2023-03-30"
+authors: [ "@pks-gitlab" ]
+coach: [ ]
+approvers: [ ]
+owning-stage: "~devops::systems"
+participating-stages: [ "~devops::create" ]
+---
+
+# Iterate on the design of object pools
+
+## Summary
+
+Forking repositories is at the heart of many modern workflows for projects
+hosted in GitLab. As most of the objects between a fork and its upstream project
+will typically be the same, this opens up potential for optimizations:
+
+- Creating forks can theoretically be lightning fast if we reuse much of the
+ parts of the upstream repository.
+
+- We can save on storage space by deduplicating objects which are shared.
+
+This architecture is currently implemented with object pools which hold objects
+of the primary repository. But the design of object pools has organically grown
+and is nowadays showing its limits.
+
+This blueprint explores how we can iterate on the design of object pools to fix
+long standing issues with it. Furthermore, the intent is to arrive at a design
+that lets us iterate more readily on the exact implementation details of object
+pools.
+
+## Motivation
+
+The current design of object pools is showing problems with scalability in
+various different ways. For a large part the problems come from the fact that
+object pools have organically grown and that we learned as we went by.
+
+It is proving hard to fix the overall design of object pools because there is no
+clear ownership. While Gitaly provides the low-level building blocks to make
+them work, it does not have enough control over them to be able to iterate on
+their implementation details.
+
+There are thus two major goals: taking ownership of object pools so that it
+becomes easier to iterate on the design, and fixing scalability issues once we
+can iterate.
+
+### Lifecycle ownership
+
+While Gitaly provides the interfaces to manage object pools, the actual life
+cycle of them is controlled by the client. A typical lifecycle of an object pool
+looks as following:
+
+1. An object pool is created via `CreateObjectPool()`. The caller provides the
+ path where the object pool shall be created as well as the origin repository
+ from which the repository shall be created.
+
+1. The origin repository needs to be linked to the object pool explicitly by
+ calling `LinkRepositoryToObjectPool()`.
+
+1. The object pool needs to be regularly updated via `FetchIntoObjectPool()`
+ that fetches all changes from the primary pool member into the object pool.
+
+1. To create forks, the client needs to call `CreateFork()` followed by
+ `LinkRepositoryToObjectPool()`.
+
+1. Repositories of forks are unlinked by calling `DisconnectGitAlternates()`.
+ This will reduplicate objects.
+
+1. The object pool is deleted via `DeleteObjectPool()`.
+
+This lifecycle is complex and leaks a lot of implementation details to the
+caller. This was originally done in part to give the Rails side control and
+management over Git object visibility. GitLab project visibility rules are
+complex and not a Gitaly concern. By exposing these details Rails can control
+when pool membership links are created and broken. It is not clear at the
+current point in time how the complete system works and its limits are not
+explicitly documented.
+
+In addition to the complexity of the lifecycle we also have multiple sources of
+truth for pool membership. Gitaly never tracks the set of members of a pool
+repository but can only tell for a specific repository that it is part of said
+pool. Consequently, Rails is forced to maintain this information in a database,
+but it is hard to maintain that information without becoming stale.
+
+### Repository maintenance
+
+Related to the lifecycle ownership issues is the issue of repository
+maintenance. As mentioned, keeping an object pool up to date requires regular
+calls to `FetchIntoObjectPool()`. This is leaking implementation details to the
+client, but was done to give the client control over syncing the primary
+repository with its object pool. With this control, private repositories can be
+prevented from syncing and consquently leaking objects to other repositories in
+the fork network.
+
+We have had good success with moving repository maintenance into Gitaly so that
+clients do not need to know about on-disk details. Ideally, we would do the same
+for repositories that are the primary member of an object pool: if we optimize
+its on-disk state, we will also automatically update the object pool.
+
+There are two issues that keep us from doing so:
+
+- Gitaly does not know about the relationship between an object pool and its
+ members.
+
+- Updating object pools is expensive.
+
+By making Gitaly the single source of truth for object pool memberships we would
+be in a position to fix both issues.
+
+### Fast forking
+
+In the current implementation, Rails first invokes `CreateFork()` which results
+in a complete `git-clone(1)` being performed to generate the fork repository.
+This is followed by `LinkRepositoryToObjectPool()` to link the fork with the
+object pool. It is not until housekeeping is performed on the fork repository
+that objects are deduplicated. This is not only leaking implementation details
+to clients, but it also keeps us from reaping the full potential benefit of
+object pools.
+
+In particular, creating forks is a lot slower than it could be since a clone is
+always performed before linking. If the steps of creating the fork and linking
+the fork to the pool repository were unified, the initial clone could be
+avoided.
+
+### Clustered object pools
+
+Gitaly Cluster and object pools development overlapped. Consequently they are
+known to not work well together. Praefect does neither ensure that repositories
+with object pools have their object pools present on all nodes, nor does it
+ensure that object pools are in a known state. If at all, object pools only work
+by chance.
+
+The current state has led to cases where object pools were missing or had
+different contents per node. This can result in inconsistently observed state in
+object pool members and writes that depend on the object pool's contents to
+fail.
+
+One way object pools might be handled for clustered Gitaly could be to have the
+pool repositories duplicated on nodes that contain repositories dependent on
+them. This would allow members of a fork network to exist of different nodes. To
+make this work, repository replciation would have to be aware of object pools
+and know when it needs to duplicate them onto a particular node.
+
+## Requirements
+
+There are a set of requirements and invariants that must be given for any
+particular solution.
+
+### Private upstream repositories should not leak objects to forks
+
+When a project has a visibility setting that is not public, the objects in the
+repository should not be fetched into an object pool. An object pool should only
+ever contain objects from the upstream repository that were at one point public.
+This prevents private upstream repositories from having objects leaked to forks
+through a shared object pool.
+
+### Forks cannot sneak objects into upstream projects
+
+It should not be possible to make objects uploaded in a fork repository
+accessible in the upstream repository via a shared object pool. Otherwise
+potentially unauthorized users would be able to "sneak in" objects into
+repositories by simply forking them.
+
+Despite leading to confusion, this could also serve as a mechanism to corrupt
+upstream repositories by introducing objects that are known to be broken.
+
+### Object pool lifetime exceeds upstream repository lifetime
+
+If the upstream repository gets deleted, its object pool should remain in place
+to provide continued deduplication of shared objects between the other
+repositories in the fork network. Thus it can be said that the lifetime of the
+object pool is longer than the lifetime of the upstream repository. An object
+pool should only be deleted if there are no longer any repositories referencing
+it.
+
+### Object lifetime
+
+By deduplicating objects in a fork network, repositories become dependent on the
+object pool. Missing objects in the pooled repository could lead to corruption
+of repositories in the fork network. Therefore, objects in the pooled repository
+must continue to exist as long as there are repositories referencing them.
+
+Without a mechanism to accurately determine if a pooled object is referenenced
+by one of more repositories, all objects in the pooled repository must remain.
+Only when there are no repositories referencing the object pool can the pooled
+repository, and therfore all its objects, be removed.
+
+### Object sharing
+
+An object that is deduplicated will become accessible from all forks of a
+particular repository, even if it has never been reachable in any of the forks.
+The consequence is that any write to an object pool immediately influences all
+of its members.
+
+We need to be mindful of this property when repositories connected to an object
+pool are replicated. As the user-observable state should be the same on all
+replicas, we need to ensure that both the repository and its object pool are
+consistent across the different nodes.
+
+## Proposal
+
+In the current design, management of object pools mostly happens on the client
+side as they need to manage their complete lifecyclethem. This requires Rails to
+store the object pool relationships in the Rails database, perform fine-grained
+management of every single step of an object pool's life, and perform periodic
+Sidekiq jobs to enforce state by calling idempotent Gitaly RPCs. This design
+significantly increases complexity of an already-complex mechanism.
+
+Instead of handling the full lifecycle of object pools on the client-side, this
+document proposes to instead encapsulate the object pool lifecycle management
+inside of Gitaly. Instead of performing low-level actions to maintain object
+pools, clients would only need to tell Gitaly about updated relationships
+between a repository and its object pool.
+
+This brings us multiple advantages:
+
+- The inherent complexity of the lifecycle management is encapsulated in a
+ single place, namely Gitaly.
+
+- Gitaly is in a better position to iterate on the low-level technical design of
+ object pools in case we find a better solution compared to "alternates" in the
+ future.
+
+- We can ensure better interplay between Gitaly Cluster, object pools and
+ repository housekeeping.
+
+- Gitaly becomes the single source of truth for object pool relationships and
+ can thus start to manage it better.
+
+Overall, the goal is to raise the abstraction level so that clients need to
+worry less about the technical details while Gitaly is in a better position to
+iterate on them.
+
+### Move lifecycle management of pools into Gitaly
+
+The lifecycle management of object pools is leaking too many details to the
+client, and by doing so makes parts things both hard to understand and
+inefficient.
+
+The current solution relies on a set of fine-grained RPCs that manage the
+relationship between repositories and their object pools. Instead, we are aiming
+for a simplified approach that only exposes the high-level concept of forks to
+the client. This will happen in the form of three RPCs:
+
+- `ForkRepository()` will create a fork of a given repository. If the upstream
+ repository does not yet have an object pool, Gitaly will create it. It will
+ then create the new repository and automatically link it to the object pool.
+ The upstream repository will be recorded as primary member of the object pool,
+ the fork will be recorded as a secondary member of the object pool.
+
+- `UnforkRepository()` will remove a repository from the object pool it is
+ connected to. This will stop deduplication of objects. For the primary object
+ pool member this also means that Gitaly will stop pulling new objects into the
+ object pool.
+
+- `GetObjectPool()` returns the object pool for a given repository. The pool
+ description will contain information about the pool's primary object pool
+ member as well as all secondary object pool members.
+
+Furthermore, the following changes will be implemented:
+
+- `RemoveRepository()` will remove the repository from its object pool. If it
+ was the last object pool member, the pool will be removed.
+
+- `OptimizeRepository()`, when executed on the primary object pool member, will
+ also update and optimize the object pool.
+
+- `ReplicateRepository()` needs to be aware of object pools and replicate them
+ correctly. Repositories shall be linked to and unlink from object pools as
+ required. While this is a step towards fixing the Praefect world, which may
+ seem redundant given that we plan to deprecate Praefect anyway, this RPC call
+ is also used for other use cases like repository rebalancing.
+
+With these changes, Gitaly will have much tighter control over the lifecycle of
+object pools. Furthermore, as it starts to track the membership of repositories
+in object pools it can become the single source of truth for fork networks.
+
+### Fix inefficient maintenance of object pools
+
+In order to update object pools, Gitaly performs a fetch of new objects from the
+primary object pool member into the object pool. This fetch is inefficient as it
+needs to needlessly negotiate objects that are new in the primary object pool
+member. But given that objects are deduplicated already in the primary object
+pool member it means that it should only have objects in its object database
+that do not yet exist in the object pool. Consequently, we should be able to
+skip the negotiation completely and instead link all objects into the object
+pool that exist in the source repository.
+
+In the current design, these objects are kept alive by creating references to
+the just-fetched objects. If the fetch deleted references or force-updated any
+references, then it may happen that previously-referenced objects become
+unreferenced. Gitaly thus creates keep-around references so that they cannot
+ever be deleted. Furthermore, those references are required in order to properly
+replicate object pools as the replication is reference-based.
+
+These two things can be solved in different ways:
+
+- We can set the `preciousObjects` repository extension. This will instruct all
+ versions of Git which understand this extension to never delete any objects
+ even if `git-prune(1)` or similar commands were executed. Versions of Git that
+ do not understand this extension would refuse to work in this repository.
+
+- Instead of replicating object pools via `git-fetch(1)`, we can instead
+ replicate them by sending over all objects part of the object database.
+
+Taken together this means that we can stop writing references in object pools
+altogether. This leads to efficient updates of object pools by simply linking
+all new objects into place, and it fixes issues we have seen with unbounded
+growth of references in object pools.
+
+## Design and implementation details
+
+<!--
+
+This section intentionally left blank. I first want to reach consensus on the
+bigger picture I'm proposing in this blueprint before I iterate and fill in the
+lower-level design and implementation details.
+
+-->
+
+## Problems with the design
+
+As mentioned before, object pools are not a perfect solution. This section goes
+over the most important issues.
+
+### Complexity of lifecycle management
+
+Even though the lifecycle of object pools becomes easier to handle once it is
+fully owned by Gitaly, it is still complex and needs to be considered in many
+ways. Handling object pools in combination with their repositories is not an
+atomic operation as any action by necessity spans over at least two different
+resources.
+
+### Performance issues
+
+As object pools deduplicate objects, the end result is that object pool members
+never have the full closure of objects in a single packfile. This is not
+typically an issue for the primary object pool member, which by definition
+cannot diverge from the object pool's contents. But secondary object pool
+members can and often will diverge from the original contents of the upstream
+repository.
+
+This leads to two different sets of reachable objects in secondary object pool
+members. Unfortunately, due to limitations in Git itself, this precludes the use
+of a subset of optimizations:
+
+- Packfiles cannot be reused as efficiently when serving fetches to serve
+ already-deltified objects. This requires Git to recompute deltas on the fly
+ for object pool members which have diverged from object pools.
+
+- Packfile bitmaps can only exist in object pools as it is not possible nor
+ easily feasible for these bitmaps to cover multiple object databases. This
+ requires Git to traverse larger parts of the object graph for many operations
+ and especially when serving fetches.
+
+### Dependent writes across repositories
+
+The design of object pools introduces significant complexity into the Raft world
+where we use a write-ahead log for all changes to repositories. In the ideal
+case, a Raft-based design would only need to care about the write-ahead log of a
+single repository when considering requests. But with object pools, we are
+forced to consider both reads and writes for a pooled repository to be dependent
+on all writes in its object pool having been applied.
+
+## Alternative Solutions
+
+The proposed solution is not obviously the best choice as it has issues both
+with complexity (management of the lifecycle) and performance (inefficiently
+served fetches for pool members).
+
+This section explores alternatives to object pools and why they have not been
+chosen as the new target architecture.
+
+### Stop using object pools altogether
+
+An obvious way to avoid all of the complexity is to stop using object pools
+altogether. While it is charming from an engineering point of view as we can
+significantly simplify the architecture, it is not a viable approach from the
+product perspective as it would mean that we cannot support efficient forking
+workflows.
+
+### Primary repository as object pool
+
+Instead of creating an explicit object pool repository, we could just use the
+upstream repository as an alternate object database of all forks. This avoids a
+lot of complexity around managing the lifetime of the object pool, at least
+superficially. Furthermore, it circumvents the issue of how to update object
+pools as it will always match the contents of the upstream repository.
+
+It has a number of downsides though:
+
+- Normal repositories can now have different states, where some of the
+ repositories are allowed to prune objects and others aren't. This introduces a
+ source of uncertainty and makes it easy to accidentally delete objects in a
+ normal repository and thus corrupt its forks.
+
+- When upstream repositories go private we must stop updating objects which are
+ supposed to be deduplicated across members of the fork network. This means
+ that we would ultimately still be forced to create object pools once this
+ happens in order to freeze the set of deduplicated objects at the point in
+ time where the repository goes private.
+
+- Deleting repositories becomes more complex as we need to take into account
+ whether a repository is linked to by forks.
+
+### Reference namespaces
+
+With `gitnamespaces(7)`, Git provides a mechanism to partition references into
+different sets of namespaces. This allows us to serve all forks from a single
+repository that contains all objects.
+
+One neat property is that we have the global view of objects referenced by all
+forks together in a single object database. We can thus easily perform shared
+housekeeping across all forks at once, including deletion of objects that are
+not used by any of the forks anymore. Regarding objects, this is likely to be
+the most efficient solution we could potentially aim for.
+
+There are again some downsides though:
+
+- Calculating usage quotas must by necessity use actual reachability of objects
+ into account, which is expensive to compute. This is not a showstopper, but
+ something to keep in mind.
+
+- One stated requirement is that it must not be possible to make objects
+ reachable in other repositories from forks. This property could theoretically
+ be enforced by only allowing access to reachable objects. That way an object
+ can only be accessed through virtual repository if the object is reachable from
+ its references. Reachability checks are too compute heavy for this to be practical.
+
+- Even though references are partitioned, large fork networks would still easily
+ end up with multiple millions of references. It is unclear what the impact on
+ performance would be.
+
+- The blast radius for any repository-level attacks significantly increases as
+ you would not only impact your own repository, but also all forks.
+
+- Custom hooks would have to be isolated for each of the virtual repositories.
+ Since the execution of Git hooks is controled it should be possible to handle
+ this for each of the namespaces.
+
+### Filesystem-based deduplication
+
+The idea of deduplicating objects on the filesystem level was floating around at
+several points in time. While it would be nice if we could shift the burden of
+this to another component, it is likely not easy to implement due to the nature
+of how Git works.
+
+The most important contributing factor to repository sizes are Git objects.
+While it would be possible to store the objects in their loose representation
+and thus deduplicate on that level, this is infeasible:
+
+- Git would not be able to deltify objects, which is an extremely important
+ mechanism to reduce on-disk size. It is unlikely that the size reduction
+ caused by deduplication would outweigh the size reduction gained from the
+ deltification mechanism.
+
+- Loose objects are significantly less efficient when accessing the repository.
+
+- Serving fetches requires us to send a packfile to the client. Usually, Git is
+ able to reuse large parts of already-existing packfiles, which significantly
+ reduces the computational overhead.
+
+Deduplicating on the loose-object level is thus infeasible.
+
+The other unit that one could try to deduplicate is packfiles. But packfiles are
+not deterministically generated by Git and will furthermore be different once
+repositories start to diverge from each other. So packfiles are not a natural
+fit for filesystem-level deduplication either.
+
+An alternative could be to use hard links of packfiles across repositories. This
+would cause us to duplicate storage space whenever any repository decides to
+perform a repack of objects and would thus be unpredictable and hard to manage.
+
+### Custom object backend
+
+In theory, it would be possible to implement a custom object backend that allows
+us to store objects in such a way that we can deduplicate them across forks.
+There are several technical hurdles though that keep us from doing so without
+significant upstream investments:
+
+- Git is not currently designed to have different backends for objects. Accesses
+ to files part of the object database are littered across the code base with no
+ abstraction level. This is in contrast to the reference database, which has at
+ least some level of abstraction.
+
+- Implementing a custom object backend would likely necessitate a fork of the
+ Git project. Even if we had the resources to do so, it would introduce a major
+ risk factor due to potential incompatibilities with upstream changes. It would
+ become impossible to use vanilla Git, which is often a requirement that exists
+ in the context of Linux distributions that package GitLab.
+
+Both the initial and the operational risk of ongoing maintenance are too high to
+really justify this approach for now. We might revisit this approach in the
+future.