diff options
author | Zuul <zuul@review.opendev.org> | 2022-08-08 19:22:27 +0000 |
---|---|---|
committer | Gerrit Code Review <review@openstack.org> | 2022-08-08 19:22:27 +0000 |
commit | 7c72f0379d0fa28df4309e018cb80bc8ff616007 (patch) | |
tree | 7916672cc6795aa94c2dab6c2f51a5fc3dabad00 /doc | |
parent | 68684d519e9f906afd899a37f9e3da92b61f914d (diff) | |
parent | 30c046842459891d03654d6ceeaa46fa3f27bd57 (diff) | |
download | zuul-7c72f0379d0fa28df4309e018cb80bc8ff616007.tar.gz |
Merge "Add tracing (OpenTelemetry) spec"
Diffstat (limited to 'doc')
-rw-r--r-- | doc/source/developer/specs/index.rst | 1 | ||||
-rw-r--r-- | doc/source/developer/specs/tracing.rst | 323 |
2 files changed, 324 insertions, 0 deletions
diff --git a/doc/source/developer/specs/index.rst b/doc/source/developer/specs/index.rst index d96df0c26..78c11bbb8 100644 --- a/doc/source/developer/specs/index.rst +++ b/doc/source/developer/specs/index.rst @@ -23,3 +23,4 @@ documentation instead. enhanced-regional-executors tenant-resource-quota community-matrix + tracing diff --git a/doc/source/developer/specs/tracing.rst b/doc/source/developer/specs/tracing.rst new file mode 100644 index 000000000..5e0e9e4d1 --- /dev/null +++ b/doc/source/developer/specs/tracing.rst @@ -0,0 +1,323 @@ +Tracing +======= + +.. warning:: This is not authoritative documentation. These features + are not currently available in Zuul. They may change significantly + before final implementation, or may never be fully completed. + +It can be difficult for a user to understand what steps were involved +between a trigger event (such as a patchset upload or recheck comment) +and a buildset report. If it took an unusually long time it can be +difficult to determine why. At present, an operator would need to +examine logs to determine what steps were involved and the sources of +any potential delays. Even experienced operators and developers can +take quite some time to first collect and then analyze logs to answer +these questions. + +Sometimes these answers may point to routine system operation (such as +a delay caused by many gate resets, or preparing a large number of +repositories). Other times they may point to deficiencies in the +system (insufficient mergers) or bugs in the code. + +Being able to visualize the activities of a Zuul system can help +operators (and potentially users) triage and diagnose issues more +quickly and accurately. Even if examining logs is ultimately required +in order to fully diagnose an issue, being able to narrow down the +scope using analsys tools can greatly simplify the process. + +Proposed Solution +----------------- + +Implementing distributed tracing in Zuul can help improve the +observability of the system and aid operators and potentially users in +understanding the sequence of events. + +By exporting information about the processing Zuul performs using the +OpenTelemetry API, information about Zuul operations can be collected +in any of several tools for analysis. + +OpenTelemetry is an Open Source protocol for exchanging observability +data, an SDK implementing that protocol, as well as an implementation +of a collector for distributing information to multiple backends. + +It supports three kinds of observability data: `traces`, `metrics`, +and `logs`. Since Zuul already has support for metrics and logs, this +specification proposes that we use only the support in OpenTelemtry +for `traces`. + +Usage Scenarios +~~~~~~~~~~~~~~~ + +Usage of OpenTelemetry should be entirely optional and supplementary +for any Zuul deployment. Log messages alone should continue to be +sufficient to analyze any potential problem. + +Should a deployer wish to use OpenTelemetry tracing data, a very +simple deployment for smaller sites may be constructed by running only +Jaeger. Jaeger is a service that can receive, store, and display +tracing information. The project distributes an all-in-one container +image which can store data in local filesystem storage. + +https://www.jaegertracing.io/ + +Larger sites may wish to run multiple collectors and feed data to +larger, distributed storage backends (such as Cassandra, +Elasticsearch, etc). + +Suitability to Zuul +~~~~~~~~~~~~~~~~~~~ + +OpenTelemetry tracing, at a high level, is designed to record +information about events, their timing, and their relation to other +events. At first this seems like a natural fit for Zuul, which reacts +to events, processes events, and generates more events. However, +OpenTelemetry's bias toward small and simple web applications is +evident throughout its documentation and the SDK implementation. + + Traces give us the big picture of what happens when a request is + made by user or an application. + +Zuul is not driven by user or application requests, and a system +designed to record several millisecond-long events which make up the +internal response to a user request of a web app is not necessarily +the obvious right choice for recording sequences and combinations of +events which frequently take hours (and sometimes days) to play out +across multiple systems. + +Fortunately, the concepts and protocol implementation of OpenTelemtry +are sufficiently well-designed for the general case to be able to +accomodate a system like Zuul, even if the SDK makes incompatible +assumptions that make integration difficult. There are some +challenges to implementation, but because the concepts appear to be +well matched, we should proceed with using the OpenTelemetry protocol +and SDK. + +Spans +~~~~~ + +The key tracing concepts in OpenTelemety are `traces` and `spans`. +From a data model perspective, the unit of data storage is a `span`. +A trace itself is really just a unique ID that is common to multiple +spans. + +Spans can relate to other spans as either children or links. A trace +is generally considered to have a single 'root' span, and within the +time period represented by that span, it may have any number of child +spans (which may further have their own child spans). + +OpenTelemetry anticipates that a span on one system may spawn a child +span on another system and includes facilities for transferring enough +information about the parent span to a child system that the child +system alone can emit traces for its span and any children that it +spawns in turn. + +For a concrete example in Zuul, we might have a Zuul scheduler start a +span for a buildset, and then a merger might emit a child span for +performing the initial merge, and an executor might emit a child span +for executing a build. + +Spans can relate to other spans (including spans in other traces), so +sequences of events can be chained together without necessitating that +they all be part of the same span or trace. + +Because Zuul processes series of events which may stretch for long +periods of time, we should specify what events and actions should +correspond to spans and traces. Spans can have arbitrary metadat +associated with them, so we will be able to search by event or job +ids. + +The following sections describe traces and their child spans. + +Event Ingestion ++++++++++++++++ + +A trace will begin when Zuul receives an event and end when that event +has been enqueued into scheduler queues (or discarded). A driver +completing processing of an event is a definitive point in time so it +is easy to know when to close the root span for that event's trace +(whereas if we kept the trace open to include scheduler processing, we +would need to know when the last trigger event spawned by the +connection event was complete). + +This may include processing in internal queues by a given driver, and +these processing steps/queues should appear as their own child spans. +The spans should include event IDs (and potentially other information +about the event such as change or pull request numbers) as metadata. + +Tenant Event Processing ++++++++++++++++++++++++ + +A trace will begin when a scheduler begins processing a tenant event +and ends when it has forwarded the event to all pipelines within a +tenant. It will link to the event ingestion trace as a follow-on +span. + +Queue Item +++++++++++ + +A trace will begin when an item is enqueued and end when it is +dequeued. This will be quite a long trace (hours or days). It is +expected to be the primary benefit of this telemetry effort as it will +show the entire lifetime of a queue item. It will link to the tenant +event processing trace as a follow-on span. + +Within the root span, there will be a span for each buildset (so that +if a gate reset happens and a new buildset is created, users will see +a series of buildset spans). Within a buildset, there will be spans +for all of the major processing steps, such as merge operations, +layout calculating, freezing the job graph, and freezing jobs. Each +build will also merit a span (retried builds will get their own spans +as well), and within a job span, there will be child spans for git +repo prep, job setup, individual playbooks, and cleanup. + +SDK Challenges +~~~~~~~~~~~~~~ + +As a high-level concept, the idea of spans for each of these +operations makes sense. In practice, the SDK makes implementation +challenging. + +The OpenTelemtry SDK makes no provision for beginning a span on one +system and ending it on another, so the fact that one Zuul scheduler +might start a buildset span while another ends it is problematic. + +Fortunately, the OpenTelemetry API only reports spans when they end, +not when they start. This means that we don't need to coordinate a +"start" API call on one scheduler with an "end" API call on another. +We can simply emit the trace with its root span at the end. However, +any child spans emitted during that time need to know the trace ID +they should use, which means that we at least need to store a trace ID +and start timestamp on our starting scheduler for use by any child +spans as well as the "end span" API call. + +The SDK does not support creating a span with a specific trace ID or +start timestamp (most timestamps are automatic), but it has +well-defined interfaces for spans and we can subclass the +implementation to allow us to specify trace IDs and timestamps. With +this approach, we can "virtually" start a span on one host, store its +information in ZooKeeper with whatever long-lived object it is +associated with (such as a QueueItem) and then make it concrete on +another host when we end it. + +Alternatives +++++++++++++ + +This section describes some alternative ideas for dealing with the +SDK's mismatch with Zuul concepts as well as why they weren't +selected. + +* Multiple root spans with the same trace ID + + Jaeger handles this relatively well, and the timeline view appears + as expected (multiple events with whitespace between them). The + graph view in Jaeger may have some trouble displaying this. + + It is not clear that OpenTelemetry anticipates having multiple + "root" spans, so it may be best to avoid this in order to avoid + potential problems with other tools. + +* Child spans without a parent + + If we emit spans that specify a parent which does not exist, Jaeger + will display these traces but show a warning that the parent is + invalid. This may occur naturally while the system is operating + (builds complete while a buildset is running), but should be + eventually corrected once an item is dequeued. In case of a serious + error, we may never close a parent span, which would cause this to + persist. We should accept that this may happen, but try to avoid it + happening intentionally. + +Links +~~~~~ + +Links between spans are fairly primitive in Jaeger. While the +OpenTelemetry API includes attributes for links (so that when we link +a queue item to an event, we could specify that it was a forwarded +event), Jaeger does not store or render them. Instead, we are only +left with a reference to a ``< span in another trace >`` with a +reference type of ``FOLLOWS_FROM``. Clicking on that link will +immediately navigate to the other trace where metadata about the trace +will be visible, but before clicking on it, users will have little +idea of what awaits on the other side. + +For this reason, we should use span links sparingly so that when they +are encountered, users are likely to intuit what they are for and are +not overwhelmed by multiple indistinguishable links. + +Events and Exceptions +~~~~~~~~~~~~~~~~~~~~~ + +OpenTelemetry allows events to be added to spans. Events have their +own timestamp and attributes. These can be used to add additional +context to spans (representing single points in time rather than +events with duration that should be child spans). Examples might +include receiving a request to cancel a job or dequeue an item. + +Events should not be used as an alternative to logs, nor should all +log messages be copied as events. Events should be used sparingly to +avoid overwhelming the tracing storage with data and the user with +information. + +Exceptions may also be included in spans. This happens automatically +and by default when using the context managers supplied by the SDK. +Because many spans in Zuul will be unable to use the SDK context +managers and any exception information would need to be explicitly +handled and stored in ZooKeeper, we will disable inclusion of +exception information in spans. This will provide a more consistent +experience (so that users don't see the absence of an exception in +tracing information to indicate the absence of an error in logs) and +reduce the cost of supporting traces (extra storage in ZooKeeper and +in the telemetry storage). + +If we decide that exception information is worth including in the +future, this decision will be easy to revisit and reverse. + +Sensitive Information +~~~~~~~~~~~~~~~~~~~~~ + +No sensitive information (secrets, passwords, job variables, etc) +should be included in tracing output. All output should be suitable +for an audience of Zuul users (that is, if someone has access to the +Zuul dashboard, then tracing data should not have any more sensitive +information than they already have access to). For public-facing Zuul +systems (such as OpenDev), the information should be suitable for +public use. + +Protobuf and gRPC +~~~~~~~~~~~~~~~~~ + +The most efficient and straightforward method of transmitting data +from Zuul to a collector (including Jaeger) is using OTLP with gRPC +(OpenTelemetry Protocol + gRPC Remote Procedure Calls). Because +Protobuf applications include automatically generated code, we may +encounter the occasional version inconsistency. We may need to +navigate package requirements more than normal due to this (especially +if we have multiple packages that depend on protobuf). + +For a contemporary example, the OpenTelemetry project is in the +process of pinning to an older version of protobuf: + +https://github.com/open-telemetry/opentelemetry-python/issues/2717 + +There is an HTTP+JSON exporter as well, so in the case that something +goes very wrong with protobuf+gRPC, that may be available as a fallback. + +Work Items +---------- + +* Add OpenTelemetry SDK and support for configuring an exporter to + zuul.conf +* Implement SDK subclasses to support opening and closing spans on + different hosts +* Instrument event processing in each driver +* Instrument event processing in scheduler +* Instrument queue items and related spans +* Document a simple Jaeger setup as a quickstart add-on (similar to + authz) +* Optional: work with OpenDev to run a public Jaeger server for + OpenDev + +The last item is not required for this specification (and not our +choice as Zuul developers to make) but it would be nice if there were +one available so that all Zuul users and developers have a reference +implementation available for community collaboration. |