diff options
author | GitLab Bot <gitlab-bot@gitlab.com> | 2023-05-17 16:05:49 +0000 |
---|---|---|
committer | GitLab Bot <gitlab-bot@gitlab.com> | 2023-05-17 16:05:49 +0000 |
commit | 43a25d93ebdabea52f99b05e15b06250cd8f07d7 (patch) | |
tree | dceebdc68925362117480a5d672bcff122fb625b /doc/architecture/blueprints/object_pools | |
parent | 20c84b99005abd1c82101dfeff264ac50d2df211 (diff) | |
download | gitlab-ce-16.0.0-rc42.tar.gz |
Add latest changes from gitlab-org/gitlab@16-0-stable-eev16.0.0-rc4216-0-stable
Diffstat (limited to 'doc/architecture/blueprints/object_pools')
-rw-r--r-- | doc/architecture/blueprints/object_pools/index.md | 495 |
1 files changed, 495 insertions, 0 deletions
diff --git a/doc/architecture/blueprints/object_pools/index.md b/doc/architecture/blueprints/object_pools/index.md new file mode 100644 index 00000000000..3f3a0341e4a --- /dev/null +++ b/doc/architecture/blueprints/object_pools/index.md @@ -0,0 +1,495 @@ +--- +status: proposed +creation-date: "2023-03-30" +authors: [ "@pks-gitlab" ] +coach: [ ] +approvers: [ ] +owning-stage: "~devops::systems" +participating-stages: [ "~devops::create" ] +--- + +# Iterate on the design of object pools + +## Summary + +Forking repositories is at the heart of many modern workflows for projects +hosted in GitLab. As most of the objects between a fork and its upstream project +will typically be the same, this opens up potential for optimizations: + +- Creating forks can theoretically be lightning fast if we reuse much of the + parts of the upstream repository. + +- We can save on storage space by deduplicating objects which are shared. + +This architecture is currently implemented with object pools which hold objects +of the primary repository. But the design of object pools has organically grown +and is nowadays showing its limits. + +This blueprint explores how we can iterate on the design of object pools to fix +long standing issues with it. Furthermore, the intent is to arrive at a design +that lets us iterate more readily on the exact implementation details of object +pools. + +## Motivation + +The current design of object pools is showing problems with scalability in +various different ways. For a large part the problems come from the fact that +object pools have organically grown and that we learned as we went by. + +It is proving hard to fix the overall design of object pools because there is no +clear ownership. While Gitaly provides the low-level building blocks to make +them work, it does not have enough control over them to be able to iterate on +their implementation details. + +There are thus two major goals: taking ownership of object pools so that it +becomes easier to iterate on the design, and fixing scalability issues once we +can iterate. + +### Lifecycle ownership + +While Gitaly provides the interfaces to manage object pools, the actual life +cycle of them is controlled by the client. A typical lifecycle of an object pool +looks as following: + +1. An object pool is created via `CreateObjectPool()`. The caller provides the + path where the object pool shall be created as well as the origin repository + from which the repository shall be created. + +1. The origin repository needs to be linked to the object pool explicitly by + calling `LinkRepositoryToObjectPool()`. + +1. The object pool needs to be regularly updated via `FetchIntoObjectPool()` + that fetches all changes from the primary pool member into the object pool. + +1. To create forks, the client needs to call `CreateFork()` followed by + `LinkRepositoryToObjectPool()`. + +1. Repositories of forks are unlinked by calling `DisconnectGitAlternates()`. + This will reduplicate objects. + +1. The object pool is deleted via `DeleteObjectPool()`. + +This lifecycle is complex and leaks a lot of implementation details to the +caller. This was originally done in part to give the Rails side control and +management over Git object visibility. GitLab project visibility rules are +complex and not a Gitaly concern. By exposing these details Rails can control +when pool membership links are created and broken. It is not clear at the +current point in time how the complete system works and its limits are not +explicitly documented. + +In addition to the complexity of the lifecycle we also have multiple sources of +truth for pool membership. Gitaly never tracks the set of members of a pool +repository but can only tell for a specific repository that it is part of said +pool. Consequently, Rails is forced to maintain this information in a database, +but it is hard to maintain that information without becoming stale. + +### Repository maintenance + +Related to the lifecycle ownership issues is the issue of repository +maintenance. As mentioned, keeping an object pool up to date requires regular +calls to `FetchIntoObjectPool()`. This is leaking implementation details to the +client, but was done to give the client control over syncing the primary +repository with its object pool. With this control, private repositories can be +prevented from syncing and consquently leaking objects to other repositories in +the fork network. + +We have had good success with moving repository maintenance into Gitaly so that +clients do not need to know about on-disk details. Ideally, we would do the same +for repositories that are the primary member of an object pool: if we optimize +its on-disk state, we will also automatically update the object pool. + +There are two issues that keep us from doing so: + +- Gitaly does not know about the relationship between an object pool and its + members. + +- Updating object pools is expensive. + +By making Gitaly the single source of truth for object pool memberships we would +be in a position to fix both issues. + +### Fast forking + +In the current implementation, Rails first invokes `CreateFork()` which results +in a complete `git-clone(1)` being performed to generate the fork repository. +This is followed by `LinkRepositoryToObjectPool()` to link the fork with the +object pool. It is not until housekeeping is performed on the fork repository +that objects are deduplicated. This is not only leaking implementation details +to clients, but it also keeps us from reaping the full potential benefit of +object pools. + +In particular, creating forks is a lot slower than it could be since a clone is +always performed before linking. If the steps of creating the fork and linking +the fork to the pool repository were unified, the initial clone could be +avoided. + +### Clustered object pools + +Gitaly Cluster and object pools development overlapped. Consequently they are +known to not work well together. Praefect does neither ensure that repositories +with object pools have their object pools present on all nodes, nor does it +ensure that object pools are in a known state. If at all, object pools only work +by chance. + +The current state has led to cases where object pools were missing or had +different contents per node. This can result in inconsistently observed state in +object pool members and writes that depend on the object pool's contents to +fail. + +One way object pools might be handled for clustered Gitaly could be to have the +pool repositories duplicated on nodes that contain repositories dependent on +them. This would allow members of a fork network to exist of different nodes. To +make this work, repository replciation would have to be aware of object pools +and know when it needs to duplicate them onto a particular node. + +## Requirements + +There are a set of requirements and invariants that must be given for any +particular solution. + +### Private upstream repositories should not leak objects to forks + +When a project has a visibility setting that is not public, the objects in the +repository should not be fetched into an object pool. An object pool should only +ever contain objects from the upstream repository that were at one point public. +This prevents private upstream repositories from having objects leaked to forks +through a shared object pool. + +### Forks cannot sneak objects into upstream projects + +It should not be possible to make objects uploaded in a fork repository +accessible in the upstream repository via a shared object pool. Otherwise +potentially unauthorized users would be able to "sneak in" objects into +repositories by simply forking them. + +Despite leading to confusion, this could also serve as a mechanism to corrupt +upstream repositories by introducing objects that are known to be broken. + +### Object pool lifetime exceeds upstream repository lifetime + +If the upstream repository gets deleted, its object pool should remain in place +to provide continued deduplication of shared objects between the other +repositories in the fork network. Thus it can be said that the lifetime of the +object pool is longer than the lifetime of the upstream repository. An object +pool should only be deleted if there are no longer any repositories referencing +it. + +### Object lifetime + +By deduplicating objects in a fork network, repositories become dependent on the +object pool. Missing objects in the pooled repository could lead to corruption +of repositories in the fork network. Therefore, objects in the pooled repository +must continue to exist as long as there are repositories referencing them. + +Without a mechanism to accurately determine if a pooled object is referenenced +by one of more repositories, all objects in the pooled repository must remain. +Only when there are no repositories referencing the object pool can the pooled +repository, and therfore all its objects, be removed. + +### Object sharing + +An object that is deduplicated will become accessible from all forks of a +particular repository, even if it has never been reachable in any of the forks. +The consequence is that any write to an object pool immediately influences all +of its members. + +We need to be mindful of this property when repositories connected to an object +pool are replicated. As the user-observable state should be the same on all +replicas, we need to ensure that both the repository and its object pool are +consistent across the different nodes. + +## Proposal + +In the current design, management of object pools mostly happens on the client +side as they need to manage their complete lifecyclethem. This requires Rails to +store the object pool relationships in the Rails database, perform fine-grained +management of every single step of an object pool's life, and perform periodic +Sidekiq jobs to enforce state by calling idempotent Gitaly RPCs. This design +significantly increases complexity of an already-complex mechanism. + +Instead of handling the full lifecycle of object pools on the client-side, this +document proposes to instead encapsulate the object pool lifecycle management +inside of Gitaly. Instead of performing low-level actions to maintain object +pools, clients would only need to tell Gitaly about updated relationships +between a repository and its object pool. + +This brings us multiple advantages: + +- The inherent complexity of the lifecycle management is encapsulated in a + single place, namely Gitaly. + +- Gitaly is in a better position to iterate on the low-level technical design of + object pools in case we find a better solution compared to "alternates" in the + future. + +- We can ensure better interplay between Gitaly Cluster, object pools and + repository housekeeping. + +- Gitaly becomes the single source of truth for object pool relationships and + can thus start to manage it better. + +Overall, the goal is to raise the abstraction level so that clients need to +worry less about the technical details while Gitaly is in a better position to +iterate on them. + +### Move lifecycle management of pools into Gitaly + +The lifecycle management of object pools is leaking too many details to the +client, and by doing so makes parts things both hard to understand and +inefficient. + +The current solution relies on a set of fine-grained RPCs that manage the +relationship between repositories and their object pools. Instead, we are aiming +for a simplified approach that only exposes the high-level concept of forks to +the client. This will happen in the form of three RPCs: + +- `ForkRepository()` will create a fork of a given repository. If the upstream + repository does not yet have an object pool, Gitaly will create it. It will + then create the new repository and automatically link it to the object pool. + The upstream repository will be recorded as primary member of the object pool, + the fork will be recorded as a secondary member of the object pool. + +- `UnforkRepository()` will remove a repository from the object pool it is + connected to. This will stop deduplication of objects. For the primary object + pool member this also means that Gitaly will stop pulling new objects into the + object pool. + +- `GetObjectPool()` returns the object pool for a given repository. The pool + description will contain information about the pool's primary object pool + member as well as all secondary object pool members. + +Furthermore, the following changes will be implemented: + +- `RemoveRepository()` will remove the repository from its object pool. If it + was the last object pool member, the pool will be removed. + +- `OptimizeRepository()`, when executed on the primary object pool member, will + also update and optimize the object pool. + +- `ReplicateRepository()` needs to be aware of object pools and replicate them + correctly. Repositories shall be linked to and unlink from object pools as + required. While this is a step towards fixing the Praefect world, which may + seem redundant given that we plan to deprecate Praefect anyway, this RPC call + is also used for other use cases like repository rebalancing. + +With these changes, Gitaly will have much tighter control over the lifecycle of +object pools. Furthermore, as it starts to track the membership of repositories +in object pools it can become the single source of truth for fork networks. + +### Fix inefficient maintenance of object pools + +In order to update object pools, Gitaly performs a fetch of new objects from the +primary object pool member into the object pool. This fetch is inefficient as it +needs to needlessly negotiate objects that are new in the primary object pool +member. But given that objects are deduplicated already in the primary object +pool member it means that it should only have objects in its object database +that do not yet exist in the object pool. Consequently, we should be able to +skip the negotiation completely and instead link all objects into the object +pool that exist in the source repository. + +In the current design, these objects are kept alive by creating references to +the just-fetched objects. If the fetch deleted references or force-updated any +references, then it may happen that previously-referenced objects become +unreferenced. Gitaly thus creates keep-around references so that they cannot +ever be deleted. Furthermore, those references are required in order to properly +replicate object pools as the replication is reference-based. + +These two things can be solved in different ways: + +- We can set the `preciousObjects` repository extension. This will instruct all + versions of Git which understand this extension to never delete any objects + even if `git-prune(1)` or similar commands were executed. Versions of Git that + do not understand this extension would refuse to work in this repository. + +- Instead of replicating object pools via `git-fetch(1)`, we can instead + replicate them by sending over all objects part of the object database. + +Taken together this means that we can stop writing references in object pools +altogether. This leads to efficient updates of object pools by simply linking +all new objects into place, and it fixes issues we have seen with unbounded +growth of references in object pools. + +## Design and implementation details + +<!-- + +This section intentionally left blank. I first want to reach consensus on the +bigger picture I'm proposing in this blueprint before I iterate and fill in the +lower-level design and implementation details. + +--> + +## Problems with the design + +As mentioned before, object pools are not a perfect solution. This section goes +over the most important issues. + +### Complexity of lifecycle management + +Even though the lifecycle of object pools becomes easier to handle once it is +fully owned by Gitaly, it is still complex and needs to be considered in many +ways. Handling object pools in combination with their repositories is not an +atomic operation as any action by necessity spans over at least two different +resources. + +### Performance issues + +As object pools deduplicate objects, the end result is that object pool members +never have the full closure of objects in a single packfile. This is not +typically an issue for the primary object pool member, which by definition +cannot diverge from the object pool's contents. But secondary object pool +members can and often will diverge from the original contents of the upstream +repository. + +This leads to two different sets of reachable objects in secondary object pool +members. Unfortunately, due to limitations in Git itself, this precludes the use +of a subset of optimizations: + +- Packfiles cannot be reused as efficiently when serving fetches to serve + already-deltified objects. This requires Git to recompute deltas on the fly + for object pool members which have diverged from object pools. + +- Packfile bitmaps can only exist in object pools as it is not possible nor + easily feasible for these bitmaps to cover multiple object databases. This + requires Git to traverse larger parts of the object graph for many operations + and especially when serving fetches. + +### Dependent writes across repositories + +The design of object pools introduces significant complexity into the Raft world +where we use a write-ahead log for all changes to repositories. In the ideal +case, a Raft-based design would only need to care about the write-ahead log of a +single repository when considering requests. But with object pools, we are +forced to consider both reads and writes for a pooled repository to be dependent +on all writes in its object pool having been applied. + +## Alternative Solutions + +The proposed solution is not obviously the best choice as it has issues both +with complexity (management of the lifecycle) and performance (inefficiently +served fetches for pool members). + +This section explores alternatives to object pools and why they have not been +chosen as the new target architecture. + +### Stop using object pools altogether + +An obvious way to avoid all of the complexity is to stop using object pools +altogether. While it is charming from an engineering point of view as we can +significantly simplify the architecture, it is not a viable approach from the +product perspective as it would mean that we cannot support efficient forking +workflows. + +### Primary repository as object pool + +Instead of creating an explicit object pool repository, we could just use the +upstream repository as an alternate object database of all forks. This avoids a +lot of complexity around managing the lifetime of the object pool, at least +superficially. Furthermore, it circumvents the issue of how to update object +pools as it will always match the contents of the upstream repository. + +It has a number of downsides though: + +- Normal repositories can now have different states, where some of the + repositories are allowed to prune objects and others aren't. This introduces a + source of uncertainty and makes it easy to accidentally delete objects in a + normal repository and thus corrupt its forks. + +- When upstream repositories go private we must stop updating objects which are + supposed to be deduplicated across members of the fork network. This means + that we would ultimately still be forced to create object pools once this + happens in order to freeze the set of deduplicated objects at the point in + time where the repository goes private. + +- Deleting repositories becomes more complex as we need to take into account + whether a repository is linked to by forks. + +### Reference namespaces + +With `gitnamespaces(7)`, Git provides a mechanism to partition references into +different sets of namespaces. This allows us to serve all forks from a single +repository that contains all objects. + +One neat property is that we have the global view of objects referenced by all +forks together in a single object database. We can thus easily perform shared +housekeeping across all forks at once, including deletion of objects that are +not used by any of the forks anymore. Regarding objects, this is likely to be +the most efficient solution we could potentially aim for. + +There are again some downsides though: + +- Calculating usage quotas must by necessity use actual reachability of objects + into account, which is expensive to compute. This is not a showstopper, but + something to keep in mind. + +- One stated requirement is that it must not be possible to make objects + reachable in other repositories from forks. This property could theoretically + be enforced by only allowing access to reachable objects. That way an object + can only be accessed through virtual repository if the object is reachable from + its references. Reachability checks are too compute heavy for this to be practical. + +- Even though references are partitioned, large fork networks would still easily + end up with multiple millions of references. It is unclear what the impact on + performance would be. + +- The blast radius for any repository-level attacks significantly increases as + you would not only impact your own repository, but also all forks. + +- Custom hooks would have to be isolated for each of the virtual repositories. + Since the execution of Git hooks is controled it should be possible to handle + this for each of the namespaces. + +### Filesystem-based deduplication + +The idea of deduplicating objects on the filesystem level was floating around at +several points in time. While it would be nice if we could shift the burden of +this to another component, it is likely not easy to implement due to the nature +of how Git works. + +The most important contributing factor to repository sizes are Git objects. +While it would be possible to store the objects in their loose representation +and thus deduplicate on that level, this is infeasible: + +- Git would not be able to deltify objects, which is an extremely important + mechanism to reduce on-disk size. It is unlikely that the size reduction + caused by deduplication would outweigh the size reduction gained from the + deltification mechanism. + +- Loose objects are significantly less efficient when accessing the repository. + +- Serving fetches requires us to send a packfile to the client. Usually, Git is + able to reuse large parts of already-existing packfiles, which significantly + reduces the computational overhead. + +Deduplicating on the loose-object level is thus infeasible. + +The other unit that one could try to deduplicate is packfiles. But packfiles are +not deterministically generated by Git and will furthermore be different once +repositories start to diverge from each other. So packfiles are not a natural +fit for filesystem-level deduplication either. + +An alternative could be to use hard links of packfiles across repositories. This +would cause us to duplicate storage space whenever any repository decides to +perform a repack of objects and would thus be unpredictable and hard to manage. + +### Custom object backend + +In theory, it would be possible to implement a custom object backend that allows +us to store objects in such a way that we can deduplicate them across forks. +There are several technical hurdles though that keep us from doing so without +significant upstream investments: + +- Git is not currently designed to have different backends for objects. Accesses + to files part of the object database are littered across the code base with no + abstraction level. This is in contrast to the reference database, which has at + least some level of abstraction. + +- Implementing a custom object backend would likely necessitate a fork of the + Git project. Even if we had the resources to do so, it would introduce a major + risk factor due to potential incompatibilities with upstream changes. It would + become impossible to use vanilla Git, which is often a requirement that exists + in the context of Linux distributions that package GitLab. + +Both the initial and the operational risk of ongoing maintenance are too high to +really justify this approach for now. We might revisit this approach in the +future. |