delta/openstack/zuul.git - opendev.org: zuul/zuul.git

diff options

author	James E. Blair <jim@acmegating.com>	2022-07-06 10:56:30 -0700
committer	James E. Blair <jim@acmegating.com>	2022-07-21 14:21:02 -0700
commit	49abc4255e211c6987d714c6e6089980c6c703cb (patch)
tree	4febbbf28ee315bb8a2ab3cbccd38e88fb760800 /TESTING.rst
parent	78b14ec3c196e7533ac2c72d95fba09c936e625a (diff)
download	zuul-49abc4255e211c6987d714c6e6089980c6c703cb.tar.gz

Apply timer trigger jitter to project-branches

Currently the timer trigger accepts an optional "jitter" specification which can delay the start of a pipeline timer trigger by up to a certain number of seconds. It applies uniformly to every project-branch that participates in the pipeline. For example, if a periodic pipeline with nova and glance is configured to trigger at midnight, and has a jitter of 30 seconds, then the master and stable branches of nova and glance will all be enqueued at the same time (perhaps 00:00:17). While this provides some utility in that if other systems are configured to do things around midnight, this pipeline may not join a thundering herd with them. Or if there are many periodic pipelines configured for midnight (perhaps across different tenants, or just with slightly different purposes), they won't be a thundering hurd. But to the extent that jobs within a given pipeline might want to avoid a thundering herd with other similar jobs in the same pipeline, it offers no relief. While Zuul may be able to handle it (especially since multiple schedulers allows other pipelines to continue to operate), these jobs may interact with remote systems which would appreciate not being DoS'd. To alleviate this, we change the jitter from applying to the pipeline as a whole to individual project-branches. To be clear, it is still the case that the pipeline has only a single configured trigger time (this change does not allow projects to configure their own triggers). But instead of firing a single event for the entire pipeline, we will fire a unique event for every project-branch in that pipeline, and these events will have the jitter applied to them individually. So in our example above, nova@master might fire at 00:00:05, nova@stable/zulu may fire at 00:00:07, glance@master at 00:00:13, etc. This behavior is similar enough in spirit to the current behavior that we can consider it a minor implementation change, and it doesn't require any new configuration options, feature flags, deprecation notice, etc. The documentation is updated to describe the new behavior, as well as correct an error relating to the description of jitter (it only delays, not advances, events). We currently add a single job to APScheduler for every timer triggered pipeline in every tenant (so the number of jobs is the sum of the periodic pipelines in every tenant). OpenDev for example may have on the order of 20 APScheduler jobs. With the new approach, we will enqueue a job for each project-branch in a periodic pipeline. For a system like OpenDev, that could potentially be thousands of jobs. In reality, based on current configuration and pipeline participation, it should be 176. Even though it will result in the same number of Zuul trigger events, there is overhead to having more APScheduler jobs. To characterize this, I performed a benchmark where I added a certain number of APScheduler jobs with the same trigger time (and no jitter) and recorded the amount of time needed to add the jobs and also, once the jobs began firing, the elapsed time from the first to the last job. This should charactize the additional overhead the scheduler will encounter with this change. Time needed to add jobs to APScheduler (seconds) 1: 0.00014448165893554688 10: 0.0009338855743408203 100: 0.00925445556640625 1000: 0.09204769134521484 10000: 0.9236903190612793 100000: 11.758053541183472 1000000: 223.83168983459473 Time to run jobs (last-first in seconds) 1: 2.384185791015625e-06 10: 0.006863832473754883 100: 0.09936022758483887 1000: 0.22670435905456543 10000: 1.517075777053833 100000: 19.97287678718567 1000000: 399.24730825424194 Given that this operates primarily at the tenant level (when a tenant reconfiguration happens, jobs need to be removed and added), I think it makes sense to consider up to 10,000 jobs a reasonable high end. It looks like we can go a little past that (between 10,000 and 100,000) while still seeing something like a linear increase. As we approach 1,000,000 jobs it starts looking more polynomial and I would not conisder the performance to be acceptable. But 100,000 is already an unlikely number, so I think this level of performance is okay within the likely range of jobs. The default executor used by APScheduler is a standard python ThreadPoolExecutor with a maximum of 10 simultaneous workers. This will cause us to fire up to 10 Zuul event simultaneously (whereas before we were only likely to fire simultaneous events if multiple tenants had identical pipeline timer triggers). This could result in more load on the connection sources and change cache as they update the branch tips in the change cache. It seems plausible that 10 simulatenous events is something that the sources and ZK can handle. If not, we can reduce the granularity of the lock we use to prevent updating the same project at the same time (to perhaps a single lock for all projects), or construct the APScheduler with a lower number of max_workrs. Change-Id: I27fc23763da81273eb135e14cd1d0bd95964fd16

Diffstat (limited to 'TESTING.rst')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: