diff options
author | James E. Blair <jim@acmegating.com> | 2022-07-06 10:56:30 -0700 |
---|---|---|
committer | James E. Blair <jim@acmegating.com> | 2022-07-21 14:21:02 -0700 |
commit | 49abc4255e211c6987d714c6e6089980c6c703cb (patch) | |
tree | 4febbbf28ee315bb8a2ab3cbccd38e88fb760800 /TESTING.rst | |
parent | 78b14ec3c196e7533ac2c72d95fba09c936e625a (diff) | |
download | zuul-49abc4255e211c6987d714c6e6089980c6c703cb.tar.gz |
Apply timer trigger jitter to project-branches
Currently the timer trigger accepts an optional "jitter" specification
which can delay the start of a pipeline timer trigger by up to a
certain number of seconds. It applies uniformly to every project-branch
that participates in the pipeline. For example, if a periodic pipeline
with nova and glance is configured to trigger at midnight, and has
a jitter of 30 seconds, then the master and stable branches of nova and
glance will all be enqueued at the same time (perhaps 00:00:17).
While this provides some utility in that if other systems are configured
to do things around midnight, this pipeline may not join a thundering
herd with them. Or if there are many periodic pipelines configured for
midnight (perhaps across different tenants, or just with slightly different
purposes), they won't be a thundering hurd.
But to the extent that jobs within a given pipeline might want to avoid
a thundering herd with other similar jobs in the same pipeline, it offers
no relief. While Zuul may be able to handle it (especially since multiple
schedulers allows other pipelines to continue to operate), these jobs
may interact with remote systems which would appreciate not being DoS'd.
To alleviate this, we change the jitter from applying to the pipeline
as a whole to individual project-branches. To be clear, it is still the
case that the pipeline has only a single configured trigger time (this
change does not allow projects to configure their own triggers). But
instead of firing a single event for the entire pipeline, we will fire
a unique event for every project-branch in that pipeline, and these
events will have the jitter applied to them individually. So in our
example above, nova@master might fire at 00:00:05, nova@stable/zulu
may fire at 00:00:07, glance@master at 00:00:13, etc.
This behavior is similar enough in spirit to the current behavior that
we can consider it a minor implementation change, and it doesn't require
any new configuration options, feature flags, deprecation notice, etc.
The documentation is updated to describe the new behavior, as well as
correct an error relating to the description of jitter (it only delays,
not advances, events).
We currently add a single job to APScheduler for every timer triggered
pipeline in every tenant (so the number of jobs is the sum of the
periodic pipelines in every tenant). OpenDev for example may have on
the order of 20 APScheduler jobs. With the new approach, we will
enqueue a job for each project-branch in a periodic pipeline. For a
system like OpenDev, that could potentially be thousands of jobs.
In reality, based on current configuration and pipeline participation,
it should be 176.
Even though it will result in the same number of Zuul trigger events,
there is overhead to having more APScheduler jobs. To characterize
this, I performed a benchmark where I added a certain number of
APScheduler jobs with the same trigger time (and no jitter) and
recorded the amount of time needed to add the jobs and also, once the
jobs began firing, the elapsed time from the first to the last job.
This should charactize the additional overhead the scheduler will
encounter with this change.
Time needed to add jobs to APScheduler (seconds)
1: 0.00014448165893554688
10: 0.0009338855743408203
100: 0.00925445556640625
1000: 0.09204769134521484
10000: 0.9236903190612793
100000: 11.758053541183472
1000000: 223.83168983459473
Time to run jobs (last-first in seconds)
1: 2.384185791015625e-06
10: 0.006863832473754883
100: 0.09936022758483887
1000: 0.22670435905456543
10000: 1.517075777053833
100000: 19.97287678718567
1000000: 399.24730825424194
Given that this operates primarily at the tenant level (when a tenant
reconfiguration happens, jobs need to be removed and added), I think
it makes sense to consider up to 10,000 jobs a reasonable high end.
It looks like we can go a little past that (between 10,000 and 100,000)
while still seeing something like a linear increase. As we approach
1,000,000 jobs it starts looking more polynomial and I would not conisder
the performance to be acceptable. But 100,000 is already an unlikely
number, so I think this level of performance is okay within the likely
range of jobs.
The default executor used by APScheduler is a standard python
ThreadPoolExecutor with a maximum of 10 simultaneous workers. This
will cause us to fire up to 10 Zuul event simultaneously (whereas before
we were only likely to fire simultaneous events if multiple tenants
had identical pipeline timer triggers). This could result in more
load on the connection sources and change cache as they update the
branch tips in the change cache. It seems plausible that 10 simulatenous
events is something that the sources and ZK can handle. If not, we
can reduce the granularity of the lock we use to prevent updating the
same project at the same time (to perhaps a single lock for all
projects), or construct the APScheduler with a lower number of max_workrs.
Change-Id: I27fc23763da81273eb135e14cd1d0bd95964fd16
Diffstat (limited to 'TESTING.rst')
0 files changed, 0 insertions, 0 deletions