diff options
Diffstat (limited to 'doc/development/application_slis/index.md')
-rw-r--r-- | doc/development/application_slis/index.md | 130 |
1 files changed, 130 insertions, 0 deletions
diff --git a/doc/development/application_slis/index.md b/doc/development/application_slis/index.md new file mode 100644 index 00000000000..c1d7ac9fa0c --- /dev/null +++ b/doc/development/application_slis/index.md @@ -0,0 +1,130 @@ +--- +stage: Platforms +group: Scalability +info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments +--- + +# GitLab Application Service Level Indicators (SLIs) + +> [Introduced](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/525) in GitLab 14.4 + +It is possible to define [Service Level Indicators +(SLIs)](https://en.wikipedia.org/wiki/Service_level_indicator) +directly in the Ruby codebase. This keeps the definition of operations +and their success close to the implementation and allows the people +building features to easily define how these features should be +monitored. + +Defining an SLI causes 2 +[Prometheus +counters](https://prometheus.io/docs/concepts/metric_types/#counter) +to be emitted from the rails application: + +- `gitlab_sli:<sli name>:total`: incremented for each operation. +- `gitlab_sli:<sli_name>:success_total`: incremented for successful + operations. + +## Existing SLIs + +1. [`rails_request_apdex`](rails_request_apdex.md) + +## Defining a new SLI + +An SLI can be defined using the `Gitlab::Metrics::Sli` class. + +Before the first scrape, it is important to have [initialized the SLI +with all possible +label-combinations](https://prometheus.io/docs/practices/instrumentation/#avoid-missing-metrics). This +avoid confusing results when using these counters in calculations. + +To initialize an SLI, use the `.inilialize_sli` class method, for +example: + +```ruby +Gitlab::Metrics::Sli.initialize_sli(:received_email, [ + { + feature_category: :issue_tracking, + email_type: :create_issue + }, + { + feature_category: :service_desk, + email_type: :service_desk + }, + { + feature_category: :code_review, + email_type: :create_merge_request + } +]) +``` + +Metrics must be initialized before they get +scraped for the first time. This could be done at the start time of the +process that will emit them, in which case we need to pay attention +not to increase application's boot time too much. This is preferable +if possible. + +Alternatively, if initializing would take too long, this can be done +during the first scrape. We need to make sure we don't do it for every +scrape. This can be done as follows: + +```ruby +def initialize_request_slis_if_needed! + return if Gitlab::Metrics::Sli.initialized?(:rails_request_apdex) + Gitlab::Metrics::Sli.initialize_sli(:rails_request_apdex, possible_request_labels) +end +``` + +Also pay attention to do it for the different metrics +endpoints we have. Currently the +[`WebExporter`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/gitlab/metrics/exporter/web_exporter.rb) +and the +[`HealthController`](https://gitlab.com/gitlab-org/gitlab/blob/master/app/controllers/health_controller.rb) +for Rails and +[`SidekiqExporter`](https://gitlab.com/gitlab-org/gitlab/blob/master/lib/gitlab/metrics/exporter/sidekiq_exporter.rb) +for Sidekiq. + +## Tracking operations for an SLI + +Tracking an operation in the newly defined SLI can be done like this: + +```ruby +Gitlab::Metrics::Sli[:received_email].increment( + labels: { + feature_category: :service_desk, + email_type: :service_desk + }, + success: issue_created? +) +``` + +Calling `#increment` on this SLI will increment the total Prometheus counter + +```prometheus +gitlab_sli:received_email:total{ feature_category='service_desk', email_type='service_desk' } +``` + +If the `success:` argument passed is truthy, then the success counter +will also be incremented: + +```prometheus +gitlab_sli:received_email:success_total{ feature_category='service_desk', email_type='service_desk' } +``` + +## Using the SLI in service monitoring and alerts + +When the application is emitting metrics for the new SLI, those need +to be consumed in the service catalog to result in alerts, and be +included in the error budget for stage groups and GitLab.com's overall +availability. + +This is currently being worked on in [this +project](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/573). As +part of [this +issue](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1307) +we will update the documentation. + +For any question, please don't hesitate to createan issue in [the +Scalability issue +tracker](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues) +or come find us in +[#g_scalability](https://gitlab.slack.com/archives/CMMF8TKR9) on Slack. |