summaryrefslogtreecommitdiff
path: root/doc/development/stage_group_observability/dashboards/error_budget_detail.md
blob: 19f98d404e7a449140d088c6126c6609d3f61e13 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
stage: Platforms
group: Scalability
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
---

# Error budget detail dashboard

With error budget detailed dashboards you can explore the error budget
spent at specific moments in time. By default, the dashboard shows
the past 28 days. You can adjust it with the [time range controls](index.md#time-range-controls)
or by selecting a range on one of the graphs.

This dashboard is the same kind of dashboard we use for service level
monitoring. For example, see the
[overview dashboard for the web service](https://dashboards.gitlab.net/d/web-main) (GitLab internal).

## Error budget panels

On top of each dashboard, there's the same panel with the [error budget](../index.md#error-budget).
Here, the time based targets adjust depending on the range.
For example, while the budget was 20 minutes per 28 days, it is only 1/4 of that for 7 days:

![5m budget in 7 days](img/error_budget_detail_7d_budget.png)

Also, keep in mind that Grafana rounds the numbers. In this example the
total time spent is 5 minutes and 24 seconds, so 24 seconds over
budget.

The attribution panels also show only failures that occurred
within the selected range.

These two panels represent a view of the "official" error budget: they
take into account if an SLI was ignored.
The [attribution panels](../index.md#check-where-budget-is-being-spent) show which components
contributed the most over the selected period.

The panels below take into account all SLIs that contribute to GitLab.com availability.
This includes SLIs that are ignored for the official error budget.

## Time series for aggregations

The time series panels for aggregations all contain three panels:

- Apdex: the [Apdex score](https://en.wikipedia.org/wiki/Apdex) for one or more SLIs. Higher score is better.
- Error Ratio: the error ratio for one or more SLIs. Lower is better.
- Requests Per Second: the number of operations per second. Higher means a bigger impact on the error budget.

The Apdex and error-ratio panels also contain two alerting thresholds:

- The one-hour threshold: the fast burn rate.

  When this line is crossed, we've spent 2% of our monthly budget in the last hour.

- The six-hour threshold: the slow burn rate.

  When this line is crossed, we've spent 2% of our budget in the last six hours.

If there is no error-ratio or Apdex for a certain SLI, the panel is hidden.

Read more about these alerting windows in
[Google SRE workbook](https://sre.google/workbook/alerting-on-slos/#recommended_time_windows_and_burn_rates_f).

We don't have alerting on these metrics for stage groups.
This work is being discussed in [epic 615](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/615).
If this is something you would like for your group, let us know there.

### Stage group aggregation

![stage group aggregation graphs](img/error_budget_detail_stage_group_aggregation.png)

The stage group aggregation shows a graph with the Apdex and errors
portion of the error budget over time. The lower a dip in the Apdex
graph or the higher a peak on the error ratio graph, the more budget
was spent at that moment.

The third graph shows the sum of all the request rates for all
SLIs. Higher means there was more traffic.

To zoom in on a particular moment where a lot of budget was spent, select the appropriate time in
the graph.

### Service-level indicators

![Rails requests service level indicator](img/error_budget_detail_sli.png)

This time series shows a breakdown of each SLI that could be contributing to the
error budget for a stage group. Similar to the stage group
aggregation, it contains an Apdex score, error ratio, and request
rate.

Here we also display an explanation panel, describing the SLI and
linking to other monitoring tools. The links to logs (📖) or
visualizations (📈) in Kibana are scoped to the feature categories
for your stage group, and limited to the range selected. Keep in mind
that we only keep logs in Kibana for seven days.

In the graphs, there is a single line per service. In the previous example image,
`rails_requests` is an SLI for the `web`, `api` and `git` services.

Sidekiq is not included in this dashboard. We're tracking this in
[epic 700](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/700).

### SLI detail

![Rails requests SLI detail](img/error_budget_detail_sli_detail.png)

The SLI details row shows a breakdown of a specific SLI based on the
labels present on the source metrics.

For example, in the previous image, the `rails_requests` SLI has an `endpoint_id` label.
We can show how much a certain endpoint was requested (RPS), and how much it contributed to the error
budget spend.

For Apdex we show the **Apdex Attribution** panel. The more prominent
color is the one that contributed most to the spend. To see the
top spending endpoint over the entire range, sort by the average.

For error ratio we show an error rate. To see which label contributed most to the spend, sort by the
average.

We don't have endpoint information available for Rails errors. This work is being planned in
[epic 663](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/663).

The number of series to be loaded in the SLI details graphs is very
high when compared to the other aggregations. Because of this, it's not possible to
load more than a few days' worth of data.