summaryrefslogtreecommitdiff
path: root/doc/development/contributing/verify/index.md
blob: 828eb0a95987146a61c15a8ba69d8f22b7aae057 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
---
type: reference, dev
stage: none
group: Verify
---

# Contribute to Verify stage codebase

## What are we working on in Verify?

Verify stage is working on a comprehensive Continuous Integration platform
integrated into the GitLab product. Our goal is to empower our users to make
great technical and business decisions, by delivering a fast, reliable, secure
platform that verifies assumptions that our users make, and check them against
the criteria defined in CI/CD configuration. They could be unit tests, end-to-end
tests, benchmarking, performance validation, code coverage enforcement, and so on.

Feedback delivered by GitLab CI/CD makes it possible for our users to make well
informed decisions about technological and business choices they need to make
to succeed. Why is Continuous Integration a mission critical product?

GitLab CI/CD is our platform to deliver feedback to our users and customers.

They contribute their continuous integration configuration files
`.gitlab-ci.yml` to describe the questions they want to get answers for. Each
time someone pushes a commit or triggers a pipeline we need to find answers for
very important questions that have been asked in CI/CD configuration.

Failing to answer these questions or, what might be even worse, providing false
answers, might result in a user making a wrong decision. Such wrong decisions
can have very severe consequences.

## Core principles of our CI/CD platform

Data produced by the platform should be:

1. Accurate.
1. Durable.
1. Accessible.

The platform itself should be:

1. Reliable.
1. Secure.
1. Deterministic.
1. Trustworthy.
1. Fast.
1. Simple.

Since the inception of GitLab CI/CD, we have lived by these principles,
and they serve us and our users well. Some examples of these principles are that:

- The feedback delivered by GitLab CI/CD and data produced by the platform should be accurate.
  If a job fails and we notify a user that it was successful, it can have severe negative consequences.
- Feedback needs to be available when a user needs it and data can not disappear unexpectedly when engineers need it.
- It all doesn’t matter if the platform is not secure and we
are leaking credentials or secrets.
- When a user provides a set of preconditions in a form of CI/CD configuration, the result should be deterministic each time a pipeline runs, because otherwise the platform might not be trustworthy.
- If it is fast, simple to use and has a great UX it will serve our users well.

## Building things in Verify

### Measure before you optimize, and make data-informed decisions

It is very difficult to optimize something that you can not measure. How would you
know if you succeeded, or how significant the success was? If you are working on
a performance or reliability improvement, make sure that you measure things before
you optimize them.

The best way to measure stuff is to add a Prometheus metric. Counters, gauges, and
histograms are great ways to quickly get approximated results. Unfortunately this
is not the best way to measure tail latency. Prometheus metrics, especially histograms,
are usually approximations.

If you have to measure tail latency, like how slow something could be or how
large a request payload might be, consider adding custom application logs and
always use structured logging.

It's useful to use profiling and flamegraphs to understand what the code execution
path truly looks like!

### Strive for simple solutions, avoid clever solutions

It is sometimes tempting to use a clever solution to deliver something more
quickly. We want to avoid shipping clever code, because it is usually more
difficult to understand and maintain in the long term. Instead, we want to
focus on boring solutions that make it easier to evolve the codebase and keep the
contribution barrier low. We want to find solutions that are as simple as
possible.

### Do not confuse boring solutions with easy solutions

Boring solutions are sometimes confused with easy solutions. Very often the
opposite is true. An easy solution might not be simple - for example, a complex
new library can be included to add a very small functionality that otherwise
could be implemented quickly - it is easier to include this library than to
build this thing, but it would bring a lot of complexity into the product.

On the other hand, it is also possible to over-engineer a solution when a simple,
well tested, and well maintained library is available. In that case using the
library might make sense. We recognize that we are constantly balancing simple
and easy solutions, and that finding the right balance is important.

### "Simple" is not mutually exclusive with "flexible"

Building simple things does not mean that more advanced and flexible solutions
will not be available. A good example here is an expanding complexity of
writing `.gitlab-ci.yml` configuration. For example, you can use a simple
method to define an environment name:

```yaml
deploy:
  environment: production
  script: cap deploy
```

But the `environment` keyword can be also expanded into another level of
configuration that can offer more flexibility.

```yaml
deploy:
  environment:
    name: review/$CI_COMMIT_REF_SLUG
    url: https://prod.example.com
  script: cap deploy
```

This kind of approach shields new users from the complexities of the platform,
but still allows them to go deeper if they need to. This approach can be
applied to many other technical implementations.

### Make things observable

GitLab is a DevOps platform. We popularize DevOps because it helps companies
be more efficient and achieve better results. One important component of
DevOps culture is to take ownership over features and code that you are
building. It is very difficult to do that when you don’t know how your features
perform and behave in the production environment.

This is why we want to make our features and code observable. It
should be written in a way that an author can understand how well or how poorly
the feature or code behaves in the production environment. We usually accomplish
that by introducing the proper mix of Prometheus metrics and application
loggers.

**TODO** document when to use Prometheus metrics, when to use loggers. Write a
few sentences about histograms and counters. Write a few sentences highlighting
importance of metrics when doing incremental rollouts.

### Protect customer data

Making data produced by our CI/CD platform durable is important. We recognize that
data generated in the CI/CD by users and customers is
something important and we must protect it. This data is not only important
because it can contain important information, we also do have compliance and
auditing responsibilities.

Therefore we must take extra care when we are writing migrations
that permanently removes data from our database, or when we are define
new retention policies.

As a general rule, when you are writing code that is supposed to remove
data from the database, file system, or object storage, you should get an extra pair
of eyes on your changes. When you are defining a new retention policy, you
should double check with PMs and EMs.

### Get your changes reviewed

When your merge request is ready for reviews you must assign
reviewers and then maintainers. Depending on the complexity of a change, you
might want to involve the people that know the most about the codebase area you are
changing. We do have many domain experts in Verify and it is absolutely acceptable to
ask them to review your code when you are not certain if a reviewer or
maintainer assigned by the Reviewer Roulette has enough context about the
change.

The reviewer roulette offers useful suggestions, but as assigning the right
reviewers is important it should not be done automatically every time. It might
not make sense to assign someone who knows nothing about the area you are
updating, because their feedback might be limited to code style and syntax.
Depending on the complexity and impact of a change, assigning the right people
to review your changes might be very important.

If you don’t know who to assign, consult `git blame` or ask in the `#verify`
Slack channel (GitLab team members only).

### Incremental rollouts

After your merge request is merged by a maintainer, it is time to release it to
users and the wider community. We usually do this with feature flags.
While not every merge request needs a feature flag, most merge
requests in Verify should have [feature flags](https://about.gitlab.com/handbook/product-development-flow/feature-flag-lifecycle/#when-to-use-feature-flags).

If you already follow the advice on this page, you probably already have a
few metrics and perhaps a few loggers added that make your new code observable
in the production environment. You can now use these metrics to incrementally
roll out your changes!

A typical scenario involves enabling a few features in a few internal projects
while observing your metrics or loggers. Be aware that there might be a
small delay involved in ingesting logs in Elastic or Kibana. After you confirm
the feature works well with internal projects you can start an
incremental rollout for other projects.

Avoid using "percent of time" incremental rollouts. These are error prone,
especially when you are checking feature flags in a few places in the codebase
and you have not memoized the result of a check in a single place.

### Do not cause our Universe to implode

During one of the first GitLab Contributes events we had a discussion about the importance
of keeping CI/CD pipeline, stage, and job statuses accurate. We considered a hypothetical
scenario relating to a software being built by one of our [early customers](https://about.gitlab.com/blog/2016/11/23/gitlab-adoption-growing-at-cern/)

> What happens if software deployed to the [Large Hadron Collider (LHC)](https://en.wikipedia.org/wiki/Large_Hadron_Collider),
> breaks because of a bug in GitLab CI/CD that showed that a pipeline
> passed, but this data was not accurate and the software deployed was actually
> invalid? A problem like this could cause the LHC to malfunction, which
> could generate a new particle that would then cause the universe to implode.

That would be quite an undesirable outcome of a small bug in GitLab CI/CD status
processing. Please take extra care when you are working on CI/CD statuses,
we don’t want to implode our Universe!

This is an extreme and unlikely scenario, but presenting data that is not accurate
can potentially cause a myriad of problems through the
[butterfly effect](https://en.wikipedia.org/wiki/Butterfly_effect).
There are much more likely scenarios that
can have disastrous consequences. GitLab CI/CD is being used by companies
building medical, aviation, and automotive software. Continuous Integration is
a mission critical part of software engineering.

When you are working on a subsystem for pipeline processing and transitioning
CI/CD statuses, request an additional review from a domain expert and hold
others accountable for doing the same.