|
This adds several metrics for different phases of processing an item
in a pipeline:
* How long we wait for a response from mergers
* How long it takes to get or compute a layout
* How long it takes to freeze jobs
* How long we wait for node requests to complete
* How long we wait for an executor to start running a job
after the request
And finally, the total amount of time from the original event until
the first job starts. We already report that at the tenant level,
this duplicates that for a pipeline-specific metric.
Several of these would also make sense as job metrics, but since they
are mainly intended to diagnose Zuul system performance and not
individual jobs, that would be a waste of storage space due to the
extremely high cardinality.
Additionally, two other timing metrics are added: the cumulative time
spent reading and writing ZKObject data to ZK during pipeline
processing. These can help determine whether more effort should be
spent optimizing ZK data transfer.
In preparing this change, I noticed that python statsd emits floating
point values for timing. It's not clear whether this strictly matches
the statsd spec, but since it does emit values with that precision,
I have removed several int() casts in order to maintain the precision
through to the statsd client.
I also noticed a place where we were writing a monotonic timestamp
value in a JSON serialized string to ZK. I do not believe this value
is currently being used, therefore there is no further error to correct,
however, we should not use time.monotonic() for values that are
serialized since the reference clock will be different on different
systems.
Several new attributes are added to the QueueItem and Build classes,
but are done so in a way that is backwards compatible, so no model api
schema upgrade is needed. The code sites where they are used protect
against the null values which will occur in a mixed-version cluster
(the components will just not emit these stats in those cases).
Change-Id: Iaacbef7fa2ed93bfc398a118c5e8cfbc0a67b846
|