summaryrefslogtreecommitdiff
path: root/doc/topics/git/partial_clone.md
diff options
context:
space:
mode:
Diffstat (limited to 'doc/topics/git/partial_clone.md')
-rw-r--r--doc/topics/git/partial_clone.md147
1 files changed, 147 insertions, 0 deletions
diff --git a/doc/topics/git/partial_clone.md b/doc/topics/git/partial_clone.md
new file mode 100644
index 00000000000..f2951308ba1
--- /dev/null
+++ b/doc/topics/git/partial_clone.md
@@ -0,0 +1,147 @@
+# Partial Clone for Large Repositories
+
+CAUTION: **Alpha:**
+Partial Clone is an experimental feature, and will significantly increase
+Gitaly resource utilization when performing a partial clone, and decrease
+performance of subsequent fetch operations.
+
+As Git repositories become very large, usability decreases as performance
+decreases. One major challenge is cloning the repository, because Git will
+download the entire repository including every commit and every version of
+every object. This can be slow to transfer, and require large amounts of disk
+space.
+
+Historically, performing a **shallow clone**
+([`--depth`](https://www.git-scm.com/docs/git-clone#Documentation/git-clone.txt---depthltdepthgt))
+has been the only way to reduce the amount of data transferred when cloning
+a Git repository. This does not, however, allow filtering by sub-tree which is
+important for monolithic repositories containing many projects, or by object
+size preventing unnecessary large objects being downloaded.
+
+[Partial clone](https://github.com/git/git/blob/master/Documentation/technical/partial-clone.txt)
+is a performance optimization that "allows Git to function without having a
+complete copy of the repository. The goal of this work is to allow Git better
+handle extremely large repositories."
+
+Specifically, using partial clone, it should be possible for Git to natively
+support:
+
+- large objects, instead of using [Git LFS](https://git-lfs.github.com/)
+- enormous repositories
+
+Briefly, partial clone works by:
+
+- excluding objects from being transferred when cloning or fetching a
+ repository using a new `--filter` flag
+- downloading missing objects on demand
+
+Follow [Git for enormous repositories](https://gitlab.com/groups/gitlab-org/-/epics/773) for roadmap and updates.
+
+## Enabling partial clone
+
+GitLab 12.1 uses Git 2.21.0 which has an arbitrary file access security
+vulnerability when `uploadpack.allowFilter` is enabled, and should not be
+enabled in production environments.
+
+A feature flag is planned to enable `uploadpack.allowFilter` and
+`uploadpack.allowAnySHA1InWant` once the version of Git used by GitLab has been
+updated to Git 2.22.0.
+
+Follow [this issue](https://gitlab.com/gitlab-org/gitaly/issues/1553) for
+updated.
+
+## Excluding objects by size
+
+Partial Clone allows large objects to be stored directly in the Git repository,
+and be excluded from clones as desired by the user. This eliminates the error
+prone process of deciding which objects should be stored in LFS or not. Using
+partial clone, all files – large or small – may be treated the same.
+
+With the `uploadpack.allowFilter` and `uploadpack.allowAnySHA1InWant` options
+enabled on the Git server:
+
+```bash
+# clone the repo, excluding blobs larger than 1 megabyte
+git clone --filter=blob:limit=1m <url>
+
+# in the checkout step of the clone, and any subsequent operations
+# any blobs that are needed will be downloaded on demand
+git checkout feature-branch
+```
+
+## Excluding objects by path
+
+Partial Clone allows clones to be filtered by path using a format similar to a
+`.gitignore` file stored inside the repository.
+
+With the `uploadpack.allowFilter` and `uploadpack.allowAnySHA1InWant` options
+enabled on the Git server:
+
+1. **Create a filter spec.** For example, consider a monolithic repository with
+ many applications, each in a different subdirectory in the root. Create a file
+ `shiny-app/.filterspec` using the GitLab web interface:
+
+ ```.gitignore
+ # Only the paths listed in the file will be downloaded when performing a
+ # partial clone using `--filter=sparse:oid=shiny-app/.gitfilterspec`
+
+ # Explicitly include filterspec needed to configure sparse checkout with
+ # git config --local core.sparsecheckout true
+ # git show master:snazzy-app/.gitfilterspec >> .git/info/sparse-checkout
+ shiny-app/.gitfilterspec
+
+ # Shiny App
+ shiny-app/
+
+ # Dependencies
+ shimmery-app/
+ shared-component-a/
+ shared-component-b/
+ ```
+
+2. *Create a new Git repository and fetch.* Support for `--filter=sparse:oid`
+ using the clone command is incomplete, so we will emulate the clone command
+ by hand, using `git init` and `git fetch`. Follow
+ [gitaly#1769](https://gitlab.com/gitlab-org/gitaly/issues/1769) for updates.
+
+ ```bash
+ # Create a new directory for the Git repository
+ mkdir jumbo-repo && cd jumbo-repo
+
+ # Initialize a new Git repository
+ git init
+
+ # Add the remote
+ git remote add origin git@gitlab.com/example/jumbo-repo
+
+ # Enable partial clone support for the remote
+ git config --local extensions.partialClone origin
+
+ # Fetch the filtered set of objects using the filterspec stored on the
+ # server. WARNING: this step is slow!
+ git fetch --filter=sparse:oid=master:shiny-app/.gitfilterspec origin
+
+ # Optional: observe there are missing objects that we have not fetched
+ git rev-list --all --quiet --objects --missing=print | wc -l
+ ```
+
+ CAUTION: **IDE and Shell integrations:**
+ Git integrations with `bash`, `zsh`, etc and editors that automatically
+ show Git status information often run `git fetch` which will fetch the
+ entire repository. You many need to disable or reconfigure these
+ integrations.
+
+3. **Sparse checkout** must be enabled and configured to prevent objects from
+ other paths being downloaded automatically when checking out branches. Follow
+ [gitaly#1765](https://gitlab.com/gitlab-org/gitaly/issues/1765) for updates.
+
+ ```bash
+ # Enable sparse checkout
+ git config --local core.sparsecheckout true
+
+ # Configure sparse checkout
+ git show master:snazzy-app/.gitfilterspec >> .git/info/sparse-checkout
+
+ # Checkout master
+ git checkout master
+ ```