summaryrefslogtreecommitdiff
path: root/doc
diff options
context:
space:
mode:
authorZuul <zuul@review.opendev.org>2022-08-08 19:22:27 +0000
committerGerrit Code Review <review@openstack.org>2022-08-08 19:22:27 +0000
commit7c72f0379d0fa28df4309e018cb80bc8ff616007 (patch)
tree7916672cc6795aa94c2dab6c2f51a5fc3dabad00 /doc
parent68684d519e9f906afd899a37f9e3da92b61f914d (diff)
parent30c046842459891d03654d6ceeaa46fa3f27bd57 (diff)
downloadzuul-7c72f0379d0fa28df4309e018cb80bc8ff616007.tar.gz
Merge "Add tracing (OpenTelemetry) spec"
Diffstat (limited to 'doc')
-rw-r--r--doc/source/developer/specs/index.rst1
-rw-r--r--doc/source/developer/specs/tracing.rst323
2 files changed, 324 insertions, 0 deletions
diff --git a/doc/source/developer/specs/index.rst b/doc/source/developer/specs/index.rst
index d96df0c26..78c11bbb8 100644
--- a/doc/source/developer/specs/index.rst
+++ b/doc/source/developer/specs/index.rst
@@ -23,3 +23,4 @@ documentation instead.
enhanced-regional-executors
tenant-resource-quota
community-matrix
+ tracing
diff --git a/doc/source/developer/specs/tracing.rst b/doc/source/developer/specs/tracing.rst
new file mode 100644
index 000000000..5e0e9e4d1
--- /dev/null
+++ b/doc/source/developer/specs/tracing.rst
@@ -0,0 +1,323 @@
+Tracing
+=======
+
+.. warning:: This is not authoritative documentation. These features
+ are not currently available in Zuul. They may change significantly
+ before final implementation, or may never be fully completed.
+
+It can be difficult for a user to understand what steps were involved
+between a trigger event (such as a patchset upload or recheck comment)
+and a buildset report. If it took an unusually long time it can be
+difficult to determine why. At present, an operator would need to
+examine logs to determine what steps were involved and the sources of
+any potential delays. Even experienced operators and developers can
+take quite some time to first collect and then analyze logs to answer
+these questions.
+
+Sometimes these answers may point to routine system operation (such as
+a delay caused by many gate resets, or preparing a large number of
+repositories). Other times they may point to deficiencies in the
+system (insufficient mergers) or bugs in the code.
+
+Being able to visualize the activities of a Zuul system can help
+operators (and potentially users) triage and diagnose issues more
+quickly and accurately. Even if examining logs is ultimately required
+in order to fully diagnose an issue, being able to narrow down the
+scope using analsys tools can greatly simplify the process.
+
+Proposed Solution
+-----------------
+
+Implementing distributed tracing in Zuul can help improve the
+observability of the system and aid operators and potentially users in
+understanding the sequence of events.
+
+By exporting information about the processing Zuul performs using the
+OpenTelemetry API, information about Zuul operations can be collected
+in any of several tools for analysis.
+
+OpenTelemetry is an Open Source protocol for exchanging observability
+data, an SDK implementing that protocol, as well as an implementation
+of a collector for distributing information to multiple backends.
+
+It supports three kinds of observability data: `traces`, `metrics`,
+and `logs`. Since Zuul already has support for metrics and logs, this
+specification proposes that we use only the support in OpenTelemtry
+for `traces`.
+
+Usage Scenarios
+~~~~~~~~~~~~~~~
+
+Usage of OpenTelemetry should be entirely optional and supplementary
+for any Zuul deployment. Log messages alone should continue to be
+sufficient to analyze any potential problem.
+
+Should a deployer wish to use OpenTelemetry tracing data, a very
+simple deployment for smaller sites may be constructed by running only
+Jaeger. Jaeger is a service that can receive, store, and display
+tracing information. The project distributes an all-in-one container
+image which can store data in local filesystem storage.
+
+https://www.jaegertracing.io/
+
+Larger sites may wish to run multiple collectors and feed data to
+larger, distributed storage backends (such as Cassandra,
+Elasticsearch, etc).
+
+Suitability to Zuul
+~~~~~~~~~~~~~~~~~~~
+
+OpenTelemetry tracing, at a high level, is designed to record
+information about events, their timing, and their relation to other
+events. At first this seems like a natural fit for Zuul, which reacts
+to events, processes events, and generates more events. However,
+OpenTelemetry's bias toward small and simple web applications is
+evident throughout its documentation and the SDK implementation.
+
+ Traces give us the big picture of what happens when a request is
+ made by user or an application.
+
+Zuul is not driven by user or application requests, and a system
+designed to record several millisecond-long events which make up the
+internal response to a user request of a web app is not necessarily
+the obvious right choice for recording sequences and combinations of
+events which frequently take hours (and sometimes days) to play out
+across multiple systems.
+
+Fortunately, the concepts and protocol implementation of OpenTelemtry
+are sufficiently well-designed for the general case to be able to
+accomodate a system like Zuul, even if the SDK makes incompatible
+assumptions that make integration difficult. There are some
+challenges to implementation, but because the concepts appear to be
+well matched, we should proceed with using the OpenTelemetry protocol
+and SDK.
+
+Spans
+~~~~~
+
+The key tracing concepts in OpenTelemety are `traces` and `spans`.
+From a data model perspective, the unit of data storage is a `span`.
+A trace itself is really just a unique ID that is common to multiple
+spans.
+
+Spans can relate to other spans as either children or links. A trace
+is generally considered to have a single 'root' span, and within the
+time period represented by that span, it may have any number of child
+spans (which may further have their own child spans).
+
+OpenTelemetry anticipates that a span on one system may spawn a child
+span on another system and includes facilities for transferring enough
+information about the parent span to a child system that the child
+system alone can emit traces for its span and any children that it
+spawns in turn.
+
+For a concrete example in Zuul, we might have a Zuul scheduler start a
+span for a buildset, and then a merger might emit a child span for
+performing the initial merge, and an executor might emit a child span
+for executing a build.
+
+Spans can relate to other spans (including spans in other traces), so
+sequences of events can be chained together without necessitating that
+they all be part of the same span or trace.
+
+Because Zuul processes series of events which may stretch for long
+periods of time, we should specify what events and actions should
+correspond to spans and traces. Spans can have arbitrary metadat
+associated with them, so we will be able to search by event or job
+ids.
+
+The following sections describe traces and their child spans.
+
+Event Ingestion
++++++++++++++++
+
+A trace will begin when Zuul receives an event and end when that event
+has been enqueued into scheduler queues (or discarded). A driver
+completing processing of an event is a definitive point in time so it
+is easy to know when to close the root span for that event's trace
+(whereas if we kept the trace open to include scheduler processing, we
+would need to know when the last trigger event spawned by the
+connection event was complete).
+
+This may include processing in internal queues by a given driver, and
+these processing steps/queues should appear as their own child spans.
+The spans should include event IDs (and potentially other information
+about the event such as change or pull request numbers) as metadata.
+
+Tenant Event Processing
++++++++++++++++++++++++
+
+A trace will begin when a scheduler begins processing a tenant event
+and ends when it has forwarded the event to all pipelines within a
+tenant. It will link to the event ingestion trace as a follow-on
+span.
+
+Queue Item
+++++++++++
+
+A trace will begin when an item is enqueued and end when it is
+dequeued. This will be quite a long trace (hours or days). It is
+expected to be the primary benefit of this telemetry effort as it will
+show the entire lifetime of a queue item. It will link to the tenant
+event processing trace as a follow-on span.
+
+Within the root span, there will be a span for each buildset (so that
+if a gate reset happens and a new buildset is created, users will see
+a series of buildset spans). Within a buildset, there will be spans
+for all of the major processing steps, such as merge operations,
+layout calculating, freezing the job graph, and freezing jobs. Each
+build will also merit a span (retried builds will get their own spans
+as well), and within a job span, there will be child spans for git
+repo prep, job setup, individual playbooks, and cleanup.
+
+SDK Challenges
+~~~~~~~~~~~~~~
+
+As a high-level concept, the idea of spans for each of these
+operations makes sense. In practice, the SDK makes implementation
+challenging.
+
+The OpenTelemtry SDK makes no provision for beginning a span on one
+system and ending it on another, so the fact that one Zuul scheduler
+might start a buildset span while another ends it is problematic.
+
+Fortunately, the OpenTelemetry API only reports spans when they end,
+not when they start. This means that we don't need to coordinate a
+"start" API call on one scheduler with an "end" API call on another.
+We can simply emit the trace with its root span at the end. However,
+any child spans emitted during that time need to know the trace ID
+they should use, which means that we at least need to store a trace ID
+and start timestamp on our starting scheduler for use by any child
+spans as well as the "end span" API call.
+
+The SDK does not support creating a span with a specific trace ID or
+start timestamp (most timestamps are automatic), but it has
+well-defined interfaces for spans and we can subclass the
+implementation to allow us to specify trace IDs and timestamps. With
+this approach, we can "virtually" start a span on one host, store its
+information in ZooKeeper with whatever long-lived object it is
+associated with (such as a QueueItem) and then make it concrete on
+another host when we end it.
+
+Alternatives
+++++++++++++
+
+This section describes some alternative ideas for dealing with the
+SDK's mismatch with Zuul concepts as well as why they weren't
+selected.
+
+* Multiple root spans with the same trace ID
+
+ Jaeger handles this relatively well, and the timeline view appears
+ as expected (multiple events with whitespace between them). The
+ graph view in Jaeger may have some trouble displaying this.
+
+ It is not clear that OpenTelemetry anticipates having multiple
+ "root" spans, so it may be best to avoid this in order to avoid
+ potential problems with other tools.
+
+* Child spans without a parent
+
+ If we emit spans that specify a parent which does not exist, Jaeger
+ will display these traces but show a warning that the parent is
+ invalid. This may occur naturally while the system is operating
+ (builds complete while a buildset is running), but should be
+ eventually corrected once an item is dequeued. In case of a serious
+ error, we may never close a parent span, which would cause this to
+ persist. We should accept that this may happen, but try to avoid it
+ happening intentionally.
+
+Links
+~~~~~
+
+Links between spans are fairly primitive in Jaeger. While the
+OpenTelemetry API includes attributes for links (so that when we link
+a queue item to an event, we could specify that it was a forwarded
+event), Jaeger does not store or render them. Instead, we are only
+left with a reference to a ``< span in another trace >`` with a
+reference type of ``FOLLOWS_FROM``. Clicking on that link will
+immediately navigate to the other trace where metadata about the trace
+will be visible, but before clicking on it, users will have little
+idea of what awaits on the other side.
+
+For this reason, we should use span links sparingly so that when they
+are encountered, users are likely to intuit what they are for and are
+not overwhelmed by multiple indistinguishable links.
+
+Events and Exceptions
+~~~~~~~~~~~~~~~~~~~~~
+
+OpenTelemetry allows events to be added to spans. Events have their
+own timestamp and attributes. These can be used to add additional
+context to spans (representing single points in time rather than
+events with duration that should be child spans). Examples might
+include receiving a request to cancel a job or dequeue an item.
+
+Events should not be used as an alternative to logs, nor should all
+log messages be copied as events. Events should be used sparingly to
+avoid overwhelming the tracing storage with data and the user with
+information.
+
+Exceptions may also be included in spans. This happens automatically
+and by default when using the context managers supplied by the SDK.
+Because many spans in Zuul will be unable to use the SDK context
+managers and any exception information would need to be explicitly
+handled and stored in ZooKeeper, we will disable inclusion of
+exception information in spans. This will provide a more consistent
+experience (so that users don't see the absence of an exception in
+tracing information to indicate the absence of an error in logs) and
+reduce the cost of supporting traces (extra storage in ZooKeeper and
+in the telemetry storage).
+
+If we decide that exception information is worth including in the
+future, this decision will be easy to revisit and reverse.
+
+Sensitive Information
+~~~~~~~~~~~~~~~~~~~~~
+
+No sensitive information (secrets, passwords, job variables, etc)
+should be included in tracing output. All output should be suitable
+for an audience of Zuul users (that is, if someone has access to the
+Zuul dashboard, then tracing data should not have any more sensitive
+information than they already have access to). For public-facing Zuul
+systems (such as OpenDev), the information should be suitable for
+public use.
+
+Protobuf and gRPC
+~~~~~~~~~~~~~~~~~
+
+The most efficient and straightforward method of transmitting data
+from Zuul to a collector (including Jaeger) is using OTLP with gRPC
+(OpenTelemetry Protocol + gRPC Remote Procedure Calls). Because
+Protobuf applications include automatically generated code, we may
+encounter the occasional version inconsistency. We may need to
+navigate package requirements more than normal due to this (especially
+if we have multiple packages that depend on protobuf).
+
+For a contemporary example, the OpenTelemetry project is in the
+process of pinning to an older version of protobuf:
+
+https://github.com/open-telemetry/opentelemetry-python/issues/2717
+
+There is an HTTP+JSON exporter as well, so in the case that something
+goes very wrong with protobuf+gRPC, that may be available as a fallback.
+
+Work Items
+----------
+
+* Add OpenTelemetry SDK and support for configuring an exporter to
+ zuul.conf
+* Implement SDK subclasses to support opening and closing spans on
+ different hosts
+* Instrument event processing in each driver
+* Instrument event processing in scheduler
+* Instrument queue items and related spans
+* Document a simple Jaeger setup as a quickstart add-on (similar to
+ authz)
+* Optional: work with OpenDev to run a public Jaeger server for
+ OpenDev
+
+The last item is not required for this specification (and not our
+choice as Zuul developers to make) but it would be nice if there were
+one available so that all Zuul users and developers have a reference
+implementation available for community collaboration.