1 files changed, 138 insertions, 0 deletions
diff --git a/doc/development/stage_group_observability/index.md b/doc/development/stage_group_observability/index.md
new file mode 100644
index 00000000000..868e55735e8
--- /dev/null
+++ b/doc/development/stage_group_observability/index.md
@@ -0,0 +1,138 @@
+---
+stage: Platforms
+group: Scalability
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Observability for stage groups
+
+Observability is about bringing visibility into a system to see and
+understand the state of each component, with context, to support
+performance tuning and debugging. To run a SaaS platform at scale, a
+rich and detailed observability platform is needed.
+
+To make information available to [stage groups](https://about.gitlab.com/handbook/product/categories/#hierarchy),
+we are aggregating metrics by feature category and then show
+this information on [dashboards](dashboards/index.md) tailored to the groups. Only metrics
+for the features built by the group are visible on their
+dashboards.
+
+With a filtered view, groups can discover bugs and performance regressions that could otherwise
+be missed when viewing aggregated data.
+
+For more specific information on dashboards, see:
+
+- [Dashboards](dashboards/index.md): a general overview of where to find dashboards
+  and how to use them.
+- [Stage group dashboard](dashboards/stage_group_dashboard.md): how to use and customize the stage group dashboard.
+- [Error budget detail](dashboards/error_budget_detail.md): how to explore error budget over time.
+
+## Error budget
+
+The error budget is calculated from the same [Service Level Indicators](https://en.wikipedia.org/wiki/Service_level_indicator) (SLIs)
+that we use to monitor GitLab.com. The 28-day availability number for a
+stage group is comparable to the
+[monthly availability](https://about.gitlab.com/handbook/engineering/infrastructure/performance-indicators/#gitlabcom-availability)
+we calculate for GitLab.com, except it's scoped to the features of a group.
+
+To learn more about how we use error budgets, see the
+[Engineering Error Budgets](https://about.gitlab.com/handbook/engineering/error-budgets/) handbook page.
+
+By default, the first row of panels on both dashboards shows the
+[error budget for the stage group](https://about.gitlab.com/handbook/engineering/error-budgets/#budget-spend-by-stage-group).
+This row shows how features owned by the group contribute to our
+[overall availability](https://about.gitlab.com/handbook/engineering/infrastructure/performance-indicators/#gitlabcom-availability).
+
+The official budget is aggregated over the 28 days. You can see it on the
+[stage group dashboard](dashboards/stage_group_dashboard.md).
+The [error budget detail dashboard](dashboards/error_budget_detail.md)
+allows customizing the range.
+
+We show the information in two formats:
+
+- Availability: this number can be compared to GitLab.com overall
+  availability target of 99.95% uptime.
+- Budget Spent: time over the past 28 days that features owned by the group have not been performing
+  adequately.
+
+The budget is calculated based on indicators per component. Each
+component can have two indicators:
+
+- [Apdex](https://en.wikipedia.org/wiki/Apdex): the rate of operations that performed adequately.
+
+  The threshold for "performing adequately" is stored in our
+  [metrics catalog](https://gitlab.com/gitlab-com/runbooks/-/tree/master/metrics-catalog)
+  and depends on the service in question. For the Puma (Rails) component of the
+  [API](https://gitlab.com/gitlab-com/runbooks/-/blob/f22f40b2c2eab37d85e23ccac45e658b2c914445/metrics-catalog/services/api.jsonnet#L127),
+  [Git](https://gitlab.com/gitlab-com/runbooks/-/blob/f22f40b2c2eab37d85e23ccac45e658b2c914445/metrics-catalog/services/git.jsonnet#L216),
+  and
+  [Web](https://gitlab.com/gitlab-com/runbooks/-/blob/f22f40b2c2eab37d85e23ccac45e658b2c914445/metrics-catalog/services/web.jsonnet#L154)
+  services, that threshold is **5 seconds** when not opted in to the
+  [`rails_requests` SLI](../application_slis/rails_request_apdex.md).
+
+  We've made this target configurable in [this project](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/525).
+  To learn how to customize the request Apdex, see
+  [Rails request Apdex SLI](../application_slis/rails_request_apdex.md).
+  This new Apdex measurement is not part of the error budget until you
+  [opt in](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1451).
+
+  For Sidekiq job execution, the threshold depends on the
+  [job urgency](../sidekiq/worker_attributes.md#job-urgency). It is
+  [currently](https://gitlab.com/gitlab-com/runbooks/-/blob/f22f40b2c2eab37d85e23ccac45e658b2c914445/metrics-catalog/services/lib/sidekiq-helpers.libsonnet#L25-38)
+  **10 seconds** for high-urgency jobs and **5 minutes** for other jobs.
+
+  Some stage groups might have more services. The thresholds for them are also in the metrics catalog.
+
+- Error rate: The rate of operations that had errors.
+
+The calculation of the ratio happens as follows:
+
+```math
+\frac {operations\_meeting\_apdex + (total\_operations - operations\_with\_errors)} {total\_apdex\_measurements + total\_operations}
+```
+
+## Check where budget is being spent
+
+Both the [stage group dashboard](dashboards/stage_group_dashboard.md)
+and the [error budget detail dashboard](dashboards/error_budget_detail.md)
+show panels to see where the error budget was spent. The stage group
+dashboard always shows a fixed 28 days. The error budget detail
+dashboard allows drilling down to the SLIs over time.
+
+The row below the error budget row is collapsed by default. Expanding
+it shows which component and violation type had the most offending
+operations in the past 28 days.
+
+![Error attribution](img/stage_group_dashboards_error_attribution.png)
+
+The first panel on the left shows a table with the number of errors per
+component. Digging into the first row in that table has
+the biggest impact on the budget spent.
+
+Commonly, the components that spend most of the budget are Sidekiq or Puma. The panel in
+the center explains what different violation types mean and how to dig
+deeper in the logs.
+
+The panel on the right provides links to Kibana that should reveal
+which endpoints or Sidekiq jobs are causing the errors.
+
+<i class="fa fa-youtube-play youtube" aria-hidden="true"></i>
+To learn how to use these panels and logs for
+determining which Rails endpoints are slow,
+see the [Error Budget Attribution for Purchase group](https://youtu.be/M9u6unON7bU) video.
+
+Other components visible in the table come from
+[service-level indicators](https://sre.google/sre-book/service-level-objectives/) (SLIs) defined
+in the [metrics catalog](https://gitlab.com/gitlab-com/runbooks/-/blob/master/metrics-catalog/README.md).
+
+For those types of failures, you can follow the link to the service
+dashboard linked from the `type` column. The service dashboard
+contains a row specifically for the SLI that is causing the budget
+spent, with links to logs and a description of what the
+component means.
+
+For example, see the `server` component of the `web-pages` service:
+
+![web-pages-server-component SLI](img/stage_group_dashboards_service_sli_detail.png)
+
+To add more SLIs tailored to specific features, you can use an [Application SLI](../application_slis/index.md).