summaryrefslogtreecommitdiff
path: root/doc/user/project/repository/reducing_the_repo_size_using_git.md
blob: 9c977e4da40ab3713a66cb911fd8dc757e0b63a7 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
---
stage: Systems
group: Gitaly
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/product/ux/technical-writing/#assignments
---

# Reduce repository size **(FREE)**

Git repositories become larger over time. When large files are added to a Git repository:

- Fetching the repository becomes slower because everyone must download the files.
- They take up a large amount of storage space on the server.
- Git repository storage limits [can be reached](#storage-limits).

Rewriting a repository can remove unwanted history to make the repository smaller.
We **recommend [`git filter-repo`](https://github.com/newren/git-filter-repo/blob/main/README.md)**
over [`git filter-branch`](https://git-scm.com/docs/git-filter-branch) and
[BFG](https://rtyley.github.io/bfg-repo-cleaner/).

WARNING:
Rewriting repository history is a destructive operation. Make sure to back up your repository before
you begin. The best way to back up a repository is to
[export the project](../settings/import_export.md#export-a-project-and-its-data).

## Purge files from repository history

To reduce the size of your repository in GitLab, you must first remove references to large files from branches, tags, *and*
other internal references (refs) that are automatically created by GitLab. These refs include:

- `refs/merge-requests/*` for merge requests.
- `refs/pipelines/*` for
  [pipelines](../../../ci/troubleshooting.md#fatal-reference-is-not-a-tree-error).
- `refs/environments/*` for environments.
- `refs/keep-around/*` are created as hidden refs to prevent commits referenced in the database from being removed

These refs are not automatically downloaded and hidden refs are not advertised, but we can remove these refs using a project export.

WARNING:
This process is not suitable for removing sensitive data like password or keys from your repository.
Information about commits, including file content, is cached in the database, and remain
visible even after they have been removed from the repository.

To purge files from a GitLab repository:

1. Install either [`git filter-repo`](https://github.com/newren/git-filter-repo/blob/main/INSTALL.md) or
   [`git-sizer`](https://github.com/github/git-sizer#getting-started)
   using a supported package manager or from source.

1. Generate a fresh
   [export from the project](../settings/import_export.md#export-a-project-and-its-data) and download it.
   This project export contains a backup copy of your repository *and* refs
   we can use to purge files from your repository.

1. Decompress the backup using `tar`:

   ```shell
   tar xzf project-backup.tar.gz
   ```

   This contains a `project.bundle` file, which was created by
   [`git bundle`](https://git-scm.com/docs/git-bundle).

1. Clone a fresh copy of the repository from the bundle using  `--bare` and `--mirror` options:

   ```shell
   git clone --bare --mirror /path/to/project.bundle
   ```

1. Go to the `project.git` directory:

   ```shell
   cd project.git
   ```

1. Because cloning from a bundle file sets the `origin` remote to the local bundle file, change it to the URL of your repository:

   ```shell
   git remote set-url origin https://gitlab.example.com/<namespace>/<project_name>.git
   ```

1. Using either `git filter-repo` or `git-sizer`, analyze your repository
   and review the results to determine which items you want to purge:

   ```shell
   # Using git filter-repo
   git filter-repo --analyze
   head .git/filter-repo/analysis/*-{all,deleted}-sizes.txt

   # Using git-sizer
   git-sizer
   ```

1. Purge the history of your repository using relevant `git filter-repo` options.
   Two common options are:

   - `--path` and `--invert-paths` to purge specific files:

     ```shell
     git filter-repo --path path/to/file.ext --invert-paths
     ```

   - `--strip-blobs-bigger-than` to purge all files larger than for example 10M:

     ```shell
     git filter-repo --strip-blobs-bigger-than 10M
     ```

   See the
   [`git filter-repo` documentation](https://htmlpreview.github.io/?https://github.com/newren/git-filter-repo/blob/docs/html/git-filter-repo.html#EXAMPLES)
   for more examples and the complete documentation.

1. Because you are trying to remove internal refs,
   you'll later rely on `commit-map` files produced by each run
   to tell you which internal refs to remove.
   Every `git filter-repo` run creates a new `commit-map`,
   and overwrites the `commit-map` from the previous run.
   You can use the following command to back up each `commit-map` file:

   ```shell
   cp .git/filter-repo/commit-map ./_filter_repo_commit_map_$(date +%s)
   ```

   Repeat this step and all following steps (including the [repository cleanup](#repository-cleanup) step)
   every time you run any `git filter-repo` command.

1. Force push your changes to overwrite all branches on GitLab:

   ```shell
   git push origin --force 'refs/heads/*'
   ```

   [Protected branches](../protected_branches.md) cause this to fail. To proceed, you must
   remove branch protection, push, and then re-enable protected branches.

1. To remove large files from tagged releases, force push your changes to all tags on GitLab:

   ```shell
   git push origin --force 'refs/tags/*'
   ```

   [Protected tags](../protected_tags.md) cause this to fail. To proceed, you must remove tag
   protection, push, and then re-enable protected tags.

1. To prevent dead links to commits that no longer exist, push the `refs/replace` created by `git filter-repo`.

   ```shell
   git push origin --force 'refs/replace/*'
   ```

   Refer to the Git [`replace`](https://git-scm.com/book/en/v2/Git-Tools-Replace) documentation for information on how this works.

1. Wait at least 30 minutes, because the repository cleanup process only processes object older than 30 minutes.
1. Run [repository cleanup](#repository-cleanup).

## Repository cleanup

> [Introduced](https://gitlab.com/gitlab-org/gitlab-foss/-/issues/19376) in GitLab 11.6.

Repository cleanup allows you to upload a text file of objects and GitLab removes internal Git
references to these objects. You can use
[`git filter-repo`](https://github.com/newren/git-filter-repo) to produce a list of objects (in a
`commit-map` file) that can be used with repository cleanup.

[Introduced](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/45058) in GitLab 13.6,
safely cleaning the repository requires it to be made read-only for the duration
of the operation. This happens automatically, but submitting the cleanup request
fails if any writes are ongoing, so cancel any outstanding `git push`
operations before continuing.

To clean up a repository:

1. Go to the project for the repository.
1. Go to **Settings > Repository**.
1. Upload a list of objects. For example, a `commit-map` file created by `git filter-repo` which is located in the
   `filter-repo` directory.

   If your `commit-map` file is larger than about 250KB or 3000 lines, the file can be split and uploaded piece by piece:

   ```shell
   split -l 3000 filter-repo/commit-map filter-repo/commit-map-
   ```

1. Select **Start cleanup**.

This:

- Removes any internal Git references to old commits.
- Runs `git gc --prune=30.minutes.ago` against the repository to remove unreferenced objects. Repacking your repository temporarily
  causes the size of your repository to increase significantly, because the old pack files are not removed until the
  new pack files have been created.
- Unlinks any unused LFS objects attached to your project, freeing up storage space.
- Recalculates the size of your repository on disk.

GitLab sends an email notification with the recalculated repository size after the cleanup has completed.

If the repository size does not decrease, this may be caused by loose objects
being kept around because they were referenced in a Git operation that happened
in the last 30 minutes. Try re-running these steps after the repository has been
dormant for at least 30 minutes.

When using repository cleanup, note:

- Project statistics are cached. You may need to wait 5-10 minutes to see a reduction in storage utilization.
- The cleanup prunes loose objects older than 30 minutes. This means objects added or referenced in the last 30 minutes
  are not removed immediately. If you have access to the
  [Gitaly](../../../administration/gitaly/index.md) server, you may skip that delay and run `git gc --prune=now` to
  prune all loose objects immediately.
- This process removes some copies of the rewritten commits from the GitLab cache and database,
  but there are still numerous gaps in coverage and some of the copies may persist indefinitely.
  [Clearing the instance cache](../../../administration/raketasks/maintenance.md#clear-redis-cache)
  may help to remove some of them, but it should not be depended on for security purposes!

## Storage limits

Repository size limits:

- Can [be set by an administrator](../../admin_area/settings/account_and_limit_settings.md#account-and-limit-settings).
- Can [be set by an administrator](../../admin_area/settings/account_and_limit_settings.md) on self-managed instances.
- Are [set for GitLab.com](../../gitlab_com/index.md#account-and-limit-settings).

When a project has reached its size limit, you cannot:

- Push to the project.
- Create a new merge request.
- Merge existing merge requests.
- Upload LFS objects.

You can still:

- Create new issues.
- Clone the project.

If you exceed the repository size limit, you can:

1. Remove some data.
1. Make a new commit.
1. Push back to the repository.

If these actions are insufficient, you can also:

- Move some blobs to LFS.
- Remove some old dependency updates from history.

Unfortunately, this workflow doesn't work. Deleting files in a commit doesn't actually reduce the
size of the repository, because the earlier commits and blobs still exist. Instead, you must rewrite
history. We recommend the open-source community-maintained tool
[`git filter-repo`](https://github.com/newren/git-filter-repo).

NOTE:
Until `git gc` runs on the GitLab side, the "removed" commits and blobs still exist. You also
must be able to push the rewritten history to GitLab, which may be impossible if you've already
exceeded the maximum size limit.

To lift these restrictions, the Administrator of the self-managed GitLab instance must
increase the limit on the particular project that exceeded it. Therefore, it's always better to
proactively stay underneath the limit. If you hit the limit, and can't have it temporarily
increased, your only option is to:

1. Prune all the unneeded stuff locally.
1. Create a new project on GitLab and start using that instead.

## Troubleshooting

### Incorrect repository statistics shown in the GUI

If the displayed size or commit number is different from the exported `.tar.gz` or local repository,
you can ask a GitLab administrator to force an update.

Using [the rails console](../../../administration/operations/rails_console.md#starting-a-rails-console-session):

```ruby
p = Project.find_by_full_path('<namespace>/<project>')
pp p.statistics
p.statistics.refresh!
pp p.statistics
# compare with earlier values

# An alternate method to clear project statistics
p.repository.expire_all_method_caches
UpdateProjectStatisticsWorker.perform_async(p.id, ["commit_count","repository_size","storage_size","lfs_objects_size"])

# check the total artifact storage space separately
builds_with_artifacts = p.builds.with_downloadable_artifacts.all

artifact_storage = 0
builds_with_artifacts.find_each do |build|
  artifact_storage += build.artifacts_size
end

puts "#{artifact_storage} bytes"
```