doc/development/stage_group_observability/index.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138

---
stage: Platforms
group: Scalability
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
---

# Observability for stage groups

Observability is about bringing visibility into a system to see and
understand the state of each component, with context, to support
performance tuning and debugging. To run a SaaS platform at scale, a
rich and detailed observability platform is needed.

To make information available to [stage groups](https://about.gitlab.com/handbook/product/categories/#hierarchy),
we are aggregating metrics by feature category and then show
this information on [dashboards](dashboards/index.md) tailored to the groups. Only metrics
for the features built by the group are visible on their
dashboards.

With a filtered view, groups can discover bugs and performance regressions that could otherwise
be missed when viewing aggregated data.

For more specific information on dashboards, see:

- [Dashboards](dashboards/index.md): a general overview of where to find dashboards
  and how to use them.
- [Stage group dashboard](dashboards/stage_group_dashboard.md): how to use and customize the stage group dashboard.
- [Error budget detail](dashboards/error_budget_detail.md): how to explore error budget over time.

## Error budget

The error budget is calculated from the same [Service Level Indicators](https://en.wikipedia.org/wiki/Service_level_indicator) (SLIs)
that we use to monitor GitLab.com. The 28-day availability number for a
stage group is comparable to the
[monthly availability](https://about.gitlab.com/handbook/engineering/infrastructure/performance-indicators/#gitlabcom-availability)
we calculate for GitLab.com, except it's scoped to the features of a group.

To learn more about how we use error budgets, see the
[Engineering Error Budgets](https://about.gitlab.com/handbook/engineering/error-budgets/) handbook page.

By default, the first row of panels on both dashboards shows the
[error budget for the stage group](https://about.gitlab.com/handbook/engineering/error-budgets/#budget-spend-by-stage-group).
This row shows how features owned by the group contribute to our
[overall availability](https://about.gitlab.com/handbook/engineering/infrastructure/performance-indicators/#gitlabcom-availability).

The official budget is aggregated over the 28 days. You can see it on the
[stage group dashboard](dashboards/stage_group_dashboard.md).
The [error budget detail dashboard](dashboards/error_budget_detail.md)
allows customizing the range.

We show the information in two formats:

- Availability: this number can be compared to GitLab.com overall
  availability target of 99.95% uptime.
- Budget Spent: time over the past 28 days that features owned by the group have not been performing
  adequately.

The budget is calculated based on indicators per component. Each
component can have two indicators:

- [Apdex](https://en.wikipedia.org/wiki/Apdex): the rate of operations that performed adequately.

  The threshold for "performing adequately" is stored in our
  [metrics catalog](https://gitlab.com/gitlab-com/runbooks/-/tree/master/metrics-catalog)
  and depends on the service in question. For the Puma (Rails) component of the
  [API](https://gitlab.com/gitlab-com/runbooks/-/blob/f22f40b2c2eab37d85e23ccac45e658b2c914445/metrics-catalog/services/api.jsonnet#L127),
  [Git](https://gitlab.com/gitlab-com/runbooks/-/blob/f22f40b2c2eab37d85e23ccac45e658b2c914445/metrics-catalog/services/git.jsonnet#L216),
  and
  [Web](https://gitlab.com/gitlab-com/runbooks/-/blob/f22f40b2c2eab37d85e23ccac45e658b2c914445/metrics-catalog/services/web.jsonnet#L154)
  services, that threshold is **5 seconds** when not opted in to the
  [`rails_requests` SLI](../application_slis/rails_request_apdex.md).

  We've made this target configurable in [this project](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/525).
  To learn how to customize the request Apdex, see
  [Rails request Apdex SLI](../application_slis/rails_request_apdex.md).
  This new Apdex measurement is not part of the error budget until you
  [opt in](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1451).

  For Sidekiq job execution, the threshold depends on the
  [job urgency](../sidekiq/worker_attributes.md#job-urgency). It is
  [currently](https://gitlab.com/gitlab-com/runbooks/-/blob/f22f40b2c2eab37d85e23ccac45e658b2c914445/metrics-catalog/services/lib/sidekiq-helpers.libsonnet#L25-38)
  **10 seconds** for high-urgency jobs and **5 minutes** for other jobs.

  Some stage groups might have more services. The thresholds for them are also in the metrics catalog.

- Error rate: The rate of operations that had errors.

The calculation of the ratio happens as follows:

```math
\frac {operations\_meeting\_apdex + (total\_operations - operations\_with\_errors)} {total\_apdex\_measurements + total\_operations}
```

## Check where budget is being spent

Both the [stage group dashboard](dashboards/stage_group_dashboard.md)
and the [error budget detail dashboard](dashboards/error_budget_detail.md)
show panels to see where the error budget was spent. The stage group
dashboard always shows a fixed 28 days. The error budget detail
dashboard allows drilling down to the SLIs over time.

The row below the error budget row is collapsed by default. Expanding
it shows which component and violation type had the most offending
operations in the past 28 days.

![Error attribution](img/stage_group_dashboards_error_attribution.png)

The first panel on the left shows a table with the number of errors per
component. Digging into the first row in that table has
the biggest impact on the budget spent.

Commonly, the components that spend most of the budget are Sidekiq or Puma. The panel in
the center explains what different violation types mean and how to dig
deeper in the logs.

The panel on the right provides links to Kibana that should reveal
which endpoints or Sidekiq jobs are causing the errors.

<i class="fa fa-youtube-play youtube" aria-hidden="true"></i>
To learn how to use these panels and logs for
determining which Rails endpoints are slow,
see the [Error Budget Attribution for Purchase group](https://youtu.be/M9u6unON7bU) video.

Other components visible in the table come from
[service-level indicators](https://sre.google/sre-book/service-level-objectives/) (SLIs) defined
in the [metrics catalog](https://gitlab.com/gitlab-com/runbooks/-/blob/master/metrics-catalog/README.md).

For those types of failures, you can follow the link to the service
dashboard linked from the `type` column. The service dashboard
contains a row specifically for the SLI that is causing the budget
spent, with links to logs and a description of what the
component means.

For example, see the `server` component of the `web-pages` service:

![web-pages-server-component SLI](img/stage_group_dashboards_service_sli_detail.png)

To add more SLIs tailored to specific features, you can use an [Application SLI](../application_slis/index.md).