From 965f5f04cb6a105764c537e3a213d72fbdc71548 Mon Sep 17 00:00:00 2001 From: Lars Wirzenius Date: Thu, 26 Jun 2014 15:17:29 +0000 Subject: Add yarn tests for removing ghost jobs --- yarns.webapp/040-running-jobs.yarn | 79 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 79 insertions(+) (limited to 'yarns.webapp') diff --git a/yarns.webapp/040-running-jobs.yarn b/yarns.webapp/040-running-jobs.yarn index 879d9fa..571afd6 100644 --- a/yarns.webapp/040-running-jobs.yarn +++ b/yarns.webapp/040-running-jobs.yarn @@ -237,6 +237,85 @@ Cleanup. FINALLY WEBAPP terminates + +Forget jobs whose MINION is gone +-------------------------------- + +A job's status is updated when a MINION uses the `/1.0/job-update` +call, and when the MINION uses that to report that the job has +finished, the STATEDB is updated accordingly. However, sometimes the +MINION never tells WEBAPP that the job if finished. This can happen +for a variety of reasons, such as (not limited to these): + +* MINION crashes. +* WEBAPP is unavailable. +* The host reboots, killing MINION and WEBAPP both. + +If this happens, STATEDB still marks the job as running, and WEBAPP +won't start a new job for that lorry specification. + +To deal with these, we need to have a way to clean up "ghost jobs" +like these. We do this with the `/1.0/cleanup-ghost-jobs` API call, +which marks all jobs finished that haven't had a `job-update` called +on them for a long time. + + SCENARIO forget jobs without MINION updates in a long time + +Set up a WEBAPP that uses a CONFGIT with a Lorry file, so we can start +a job. + + GIVEN a new git repository in CONFGIT + AND an empty lorry-controller.conf in CONFGIT + AND lorry-controller.conf in CONFGIT adds lorries *.lorry using prefix upstream + AND Lorry file CONFGIT/foo.lorry with {"foo":{"type":"git","url":"git://foo"}} + AND WEBAPP uses CONFGIT as its configuration directory + AND a running WEBAPP + +Pretend it is a known time (specifically, the beginning of the epoch). +This is needed so we can trigger the ghost job timeout later. + + WHEN admin makes request POST /1.0/pretend-time with now=0 + +Tell WEBAPP to read the configuration. + + WHEN admin makes request POST /1.0/read-configuration + +Start a new job. + + WHEN admin makes request POST /1.0/give-me-job with host=testhost&pid=123 + THEN response has job_id set to 1 + +Verify that the job is in the list of running jobs. + + WHEN admin makes request GET /1.0/list-running-jobs + THEN response has running_jobs set to [1] + +Remove any ghosts. There aren't any yet, so nothing should be removed. + + WHEN admin makes request POST /1.0/remove-ghost-jobs + AND admin makes request GET /1.0/list-running-jobs + THEN response has running_jobs set to [1] + +Now, pretend a long time has passed, and clean up the ghost job. The +default value for the ghost timeout is reasonably short (less than a +day), so we pretend it is about 10 days later (one million seconds). + + WHEN admin makes request POST /1.0/pretend-time with now=1000000 + AND admin makes request POST /1.0/remove-ghost-jobs + AND admin makes request GET /1.0/list-running-jobs + THEN response has running_jobs set to [] + +Further, if we request for a new job now, we'll get one for the same +lorry specification. + + WHEN admin makes request POST /1.0/give-me-job with host=testhost&pid=123 + THEN response has job_id set to 2 + AND response has path set to "upstream/foo" + +Finally, clean up. + + FINALLY WEBAPP terminates + Remove a terminated job ----------------------- -- cgit v1.2.1