1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
|
---
stage: none
group: unassigned
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
---
# Working with diffs
We rely on different sources to present diffs. These include:
- Gitaly service
- Database (through `merge_request_diff_files`)
- Redis (cached highlighted diffs)
## Deep Dive
In January 2019, Oswaldo Ferreira hosted a Deep Dive (GitLab team members only:
`https://gitlab.com/gitlab-org/create-stage/issues/1`) on GitLab Diffs and Commenting on Diffs
functionality to share his domain specific knowledge with anyone who may work in this part of the
codebase in the future. You can find the [recording on YouTube](https://www.youtube.com/watch?v=K6G3gMcFyek),
and the slides on [Google Slides](https://docs.google.com/presentation/d/1bGutFH2AT3bxOPZuLMGl1ANWHqFnrxwQwjiwAZkF-TU/edit)
and in [PDF](https://gitlab.com/gitlab-org/create-stage/uploads/b5ad2f336e0afcfe0f99db0af0ccc71a/).
Everything covered in this deep dive was accurate as of GitLab 11.7, and while specific details may
have changed since then, it should still serve as a good introduction.
## Architecture overview
### Merge request diffs
When refreshing a Merge Request (pushing to a source branch, force-pushing to target branch, or if the target branch now contains any commits from the MR)
we fetch the comparison information using `Gitlab::Git::Compare`, which fetches `base` and `head` data using Gitaly and diff between them through
`Gitlab::Git::Diff.between`.
The diffs fetching process _limits_ single file diff sizes and the overall size of the whole diff through a series of constant values. Raw diff files are
then persisted on `merge_request_diff_files` table.
Even though diffs larger than 10% of the value of `ApplicationSettings#diff_max_patch_bytes` are collapsed,
we still keep them on PostgreSQL. However, diff files larger than defined _safety limits_
(see the [Diff limits section](#diff-limits)) are _not_ persisted in the database.
In order to present diffs information on the Merge Request diffs page, we:
1. Fetch all diff files from database `merge_request_diff_files`
1. Fetch the _old_ and _new_ file blobs in batch to:
- Highlight old and new file content
- Know which viewer it should use for each file (text, image, deleted, etc)
- Know if the file content changed
- Know if it was stored externally
- Know if it had storage errors
1. If the diff file is cacheable (text-based), it's cached on Redis
using `Gitlab::Diff::FileCollection::MergeRequestDiff`
### Note diffs
When commenting on a diff (any comparison), we persist a truncated diff version
on `NoteDiffFile` (which is associated with the actual `DiffNote`). So instead
of hitting the repository every time we need the diff of the file, we:
1. Check whether we have the `NoteDiffFile#diff` persisted and use it
1. Otherwise, if it's a current MR revision, use the persisted
`MergeRequestDiffFile#diff`
1. In the last scenario, go the repository and fetch the diff
## Diff limits
As explained above, we limit single diff files and the size of the whole diff. There are scenarios where we collapse the diff file,
and cases where the diff file is not presented at all, and the user is guided to the Blob view.
### Diff collection limits
Limits that act onto all diff files collection. Files number, lines number and files size are considered.
```ruby
Gitlab::Git::DiffCollection.collection_limits[:safe_max_files] = Gitlab::Git::DiffCollection::DEFAULT_LIMITS[:max_files] = 100
```
File diffs are collapsed (but are expandable) if 100 files have already been rendered.
```ruby
Gitlab::Git::DiffCollection.collection_limits[:safe_max_lines] = Gitlab::Git::DiffCollection::DEFAULT_LIMITS[:max_lines] = 5000
```
File diffs are collapsed (but be expandable) if 5000 lines have already been rendered.
```ruby
Gitlab::Git::DiffCollection.collection_limits[:safe_max_bytes] = Gitlab::Git::DiffCollection.collection_limits[:safe_max_files] * 5.kilobytes = 500.kilobytes
```
File diffs are collapsed (but be expandable) if 500 kilobytes have already been rendered.
```ruby
Gitlab::Git::DiffCollection.collection_limits[:max_files] = Commit::DIFF_HARD_LIMIT_FILES = 1000
```
No more files are rendered at all if 1000 files have already been rendered.
```ruby
Gitlab::Git::DiffCollection.collection_limits[:max_lines] = Commit::DIFF_HARD_LIMIT_LINES = 50000
```
No more files are rendered at all if 50,000 lines have already been rendered.
```ruby
Gitlab::Git::DiffCollection.collection_limits[:max_bytes] = Gitlab::Git::DiffCollection.collection_limits[:max_files] * 5.kilobytes = 5000.kilobytes
```
No more files are rendered at all if 5 megabytes have already been rendered.
All collection limit parameters are sent and applied on Gitaly. That is, after the limit is surpassed,
Gitaly only returns the safe amount of data to be persisted on `merge_request_diff_files`.
### Individual diff file limits
Limits that act onto each diff file of a collection. Files number, lines number and files size are considered.
#### Expandable patches (collapsed)
Diff patches are collapsed when surpassing 10% of the value set in `ApplicationSettings#diff_max_patch_bytes`.
That is, it's equivalent to 10kb if the maximum allowed value is 100kb.
The diff is persisted and expandable if the patch size doesn't
surpass `ApplicationSettings#diff_max_patch_bytes`.
Although this nomenclature (Collapsing) is also used on Gitaly, this limit is only used on GitLab (hardcoded - not sent to Gitaly).
Gitaly only returns `Diff.Collapsed` (RPC) when surpassing collection limits.
#### Not expandable patches (too large)
The patch not be rendered if it's larger than `ApplicationSettings#diff_max_patch_bytes`.
Users see a `This source diff could not be displayed because it is too large` message.
```ruby
Commit::DIFF_SAFE_LINES = Gitlab::Git::DiffCollection::DEFAULT_LIMITS[:max_lines] = 5000
```
File diff is suppressed (technically different from collapsed, but behaves the same, and is expandable) if it has more than 5000 lines.
This limit is hardcoded and only applied on GitLab.
## Viewers
Diff Viewers, which can be found on `models/diff_viewer/*` are classes used to map metadata about each type of Diff File. It has information
whether it's a binary, which partial should be used to render it or which File extensions this class accounts for.
`DiffViewer::Base` validates _blobs_ (old and new versions) content, extension and file type in order to check if it can be rendered.
## Merge request diffs against the `HEAD` of the target branch
Historically, merge request diffs have been calculated by `git diff target...source` which compares the
`HEAD` of the source branch with the merge base (or a common ancestor) of the target branch and the source's.
This solution works well until the target branch starts containing some of the
changes introduced by the source branch: Consider the following case, in which the source branch
is `feature_a` and the target is `master`:
1. Checkout a new branch `feature_a` from `master` and remove `file_a` and `file_b` in it.
1. Add a commit that removes `file_a` to `master`.
The merge request diff still contains the `file_a` removal while the actual diff compared to
`master`'s `HEAD` has only the `file_b` removal. The diff with such redundant
changes is harder to review.
In order to display an up-to-date diff, in GitLab 12.9 we
[introduced](https://gitlab.com/gitlab-org/gitlab/-/issues/27008) merge request
diffs compared against `HEAD` of the target branch: the
target branch is artificially merged into the source branch, then the resulting
merge ref is compared to the source branch in order to calculate an accurate
diff.
Until we complete the epics ["use merge refs for diffs"](https://gitlab.com/groups/gitlab-org/-/epics/854)
and ["merge conflicts in diffs"](https://gitlab.com/groups/gitlab-org/-/epics/4893),
both options `master (base)` and `master (HEAD)` are available to be displayed in merge requests:
![Merge ref head options](img/merge_ref_head_options_v13_6.png)
The `master (HEAD)` option is meant to replace `master (base)` in the future.
In order to support comments for both options, diff note positions are stored for
both `master (base)` and `master (HEAD)` versions ([introduced](https://gitlab.com/gitlab-org/gitlab/-/issues/198457) in 12.10).
The position for `master (base)` version is stored in `Note#position` and
`Note#original_position` columns, for `master (HEAD)` version `DiffNotePosition`
has been introduced.
One of the key challenges to deal with when working on merge ref diffs are merge
conflicts. If the target and source branch contains a merge conflict, the branches
cannot be automatically merged. The [recording on
YouTube](https://www.youtube.com/watch?v=GFXIFA4ZuZw&feature=youtu.be&ab_channel=GitLabUnfiltered)
is a quick introduction to the problem and the motivation behind the [epic](https://gitlab.com/groups/gitlab-org/-/epics/854).
In 13.5 a solution for both-modified merge
conflict has been
[introduced](https://gitlab.com/gitlab-org/gitlab/-/issues/232484). However,
there are more classes of merge conflicts that are to be
[addressed](https://gitlab.com/groups/gitlab-org/-/epics/4893) in the future.
|