summaryrefslogtreecommitdiff
path: root/doc/administration/repository_storage_types.md
blob: 4934aaf39f70157cd4c3d52b49af2f0c6403733b (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
# Repository Storage Types

> [Introduced][ce-28283] in GitLab 10.0.

## Legacy Storage

Legacy Storage is the storage behavior prior to version 10.0. For historical
reasons, GitLab replicated the same mapping structure from the projects URLs:

- Project's repository: `#{namespace}/#{project_name}.git`
- Project's wiki: `#{namespace}/#{project_name}.wiki.git`

This structure made it simple to migrate from existing solutions to GitLab and
easy for Administrators to find where the repository is stored.

On the other hand this has some drawbacks:

Storage location will concentrate huge amount of top-level namespaces. The
impact can be reduced by the introduction of [multiple storage
paths][storage-paths].

Because backups are a snapshot of the same URL mapping, if you try to recover a
very old backup, you need to verify whether any project has taken the place of
an old removed or renamed project sharing the same URL. This means that
`mygroup/myproject` from your backup may not be the same original project that
is at that same URL today.

Any change in the URL will need to be reflected on disk (when groups / users or
projects are renamed). This can add a lot of load in big installations,
especially if using any type of network based filesystem.

For GitLab Geo in particular: Geo does work with legacy storage, but in some
edge cases due to race conditions it can lead to errors when a project is
renamed multiple times in short succession, or a project is deleted and
recreated under the same name very quickly. We expect these race events to be
rare, and we have not observed a race condition side-effect happening yet.

This pattern also exists in other objects stored in GitLab, like issue
Attachments, GitLab Pages artifacts, Docker Containers for the integrated
Registry, etc.

## Hashed Storage

Hashed Storage is the new storage behavior we rolled out with 10.0. Instead
of coupling project URL and the folder structure where the repository will be
stored on disk, we are coupling a hash, based on the project's ID. This makes
the folder structure immutable, and therefore eliminates any requirement to
synchronize state from URLs to disk structure. This means that renaming a group,
user, or project will cost only the database transaction, and will take effect
immediately.

The hash also helps to spread the repositories more evenly on the disk, so the
top-level directory will contain less folders than the total amount of top-level
namespaces.

The hash format is based on the hexadecimal representation of SHA256:
`SHA256(project.id)`. The top-level folder uses the first 2 characters, followed
by another folder with the next 2 characters. They are both stored in a special
`@hashed` folder, to be able to co-exist with existing Legacy Storage projects:

```ruby
# Project's repository:
"@hashed/#{hash[0..1]}/#{hash[2..3]}/#{hash}.git"

# Wiki's repository:
"@hashed/#{hash[0..1]}/#{hash[2..3]}/#{hash}.wiki.git"
```

### How to migrate to Hashed Storage

In GitLab, go to **Admin > Settings**, find the **Repository Storage** section
and select "_Use hashed storage paths for newly created and renamed projects_".

To migrate your existing projects to the new storage type, check the specific
[rake tasks].

[ce-28283]: https://gitlab.com/gitlab-org/gitlab-ce/issues/28283
[rake tasks]: raketasks/storage.md#migrate-existing-projects-to-hashed-storage
[storage-paths]: repository_storage_types.md

#### Rollback

There is no automated rollback implemented. Below are the steps required to rollback
from each storage migration.

The rollback has to be performed in the reverse order. To get into "Legacy" state,
you need to rollback Attachments first, then Project.

Also note that if Geo is enabled, after the migration was triggered, an event is generated
to replicate the operation on any Secondary node. That means the on disk changes will also
need to be performed on these nodes as well. Database changes will propagate without issues.

You must make sure the migration event was already processed or otherwise it may migrate
the files back to Hashed state again.

#### Hashed object pools

For deduplication of public forks and their parent repository, objects are pooled
in an object pool. These object pools are a third repository where shared objects
are stored.

```ruby
# object pool paths
"@pools/#{hash[0..1]}/#{hash[2..3]}/#{hash}.git"
```

The object pool feature is behind the `object_pools` feature flag, and can be
enabled for individual projects by executing
`Feature.enable(:object_pools, Project.find(<id>))`. Note that the project has to
be on hashed storage, should not be a fork itself, and hashed storage should be
enabled for all new projects.

##### Attachments

To rollback single Attachment migration, rename `aa/bb/abcdef1234567890...` folder back to `namespace/project`.

Both folder names can be generated by the `FileUploader.absolute_base_dir(project)`, you
just need to switch the version from the `project` back to the previous one.

```ruby
project.storage_version
# => 2

FileUploader.absolute_base_dir(project)
# => "/opt/gitlab/embedded/service/gitlab-rails/public/uploads/@hashed/d4/73/d4735e3a265e16eee03f59718b9b5d03019c07d8b6c51f90da3a666eec13ab35"

project.storage_version = 1

FileUploader.absolute_base_dir(project)
# => "/opt/gitlab/embedded/service/gitlab-rails/public/uploads/gitlab/gitlab-shell-renamed"
```

##### Project

To rollback single Project migration, move `@hashed/aa/bb/aabbcdef1234567890abcdef.git` and `@hashed/aa/bb/aabbcdef1234567890abcdef.wiki.git`
back to `namespace/project.git` and `namespace/project.wiki.git` respectively and switch the version from the `project` back to `null`.

### Hashed Storage coverage

We are incrementally moving every storable object in GitLab to the Hashed
Storage pattern. You can check the current coverage status below (and also see
the [issue](https://gitlab.com/gitlab-com/infrastructure/issues/2821)).

Note that things stored in an S3 compatible endpoint will not have the downsides
mentioned earlier, if they are not prefixed with `#{namespace}/#{project_name}`,
which is true for CI Cache and LFS Objects.

| Storable Object | Legacy Storage | Hashed Storage | S3 Compatible | GitLab Version |
| --------------- | -------------- | -------------- | ------------- | -------------- |
| Repository      | Yes            | Yes            | -             | 10.0           |
| Attachments     | Yes            | Yes            | -             | 10.2           |
| Avatars         | Yes            | No             | -             | -              |
| Pages           | Yes            | No             | -             | -              |
| Docker Registry | Yes            | No             | -             | -              |
| CI Build Logs   | No             | No             | -             | -              |
| CI Artifacts    | No             | No             | Yes           | 9.4 / 10.6     |
| CI Cache        | No             | No             | Yes           | -              |
| LFS Objects     | Yes            | Similar        | Yes           | 10.0 / 10.7    |

#### Implementation Details

##### Avatars

Each file is stored in a folder with its `id` from the database. The filename is always `avatar.png` for user avatars.
When avatar is replaced, `Upload` model is destroyed and a new one takes place with different `id`.

##### CI Artifacts

CI Artifacts are S3 compatible since **9.4** (GitLab Premium), and available in GitLab Core since **10.6**.

##### LFS Objects

LFS Objects implements a similar storage pattern using 2 chars, 2 level folders, following git own implementation:

```ruby
"shared/lfs-objects/#{oid[0..1}/#{oid[2..3]}/#{oid[4..-1]}"

# Based on object `oid`: `8909029eb962194cfb326259411b22ae3f4a814b5be4f80651735aeef9f3229c`, path will be:
"shared/lfs-objects/89/09/029eb962194cfb326259411b22ae3f4a814b5be4f80651735aeef9f3229c"
```

They are also S3 compatible since **10.0** (GitLab Premium), and available in GitLab Core since **10.7**.