summaryrefslogtreecommitdiff
path: root/yarns.webapp/040-running-jobs.yarn
blob: 5e47d1cb20c840889ca182ce3e317f5cebbda21b (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
Running jobs
============

This chapter contains tests that verify that WEBAPP schedules jobs,
accepts job output, and lets the admin kill running jobs.

Run a job successfully
----------------------

To start with, with an empty run-queue, nothing should be scheduled.

    SCENARIO run a job
    GIVEN a new git repository in CONFGIT
    AND an empty lorry-controller.conf in CONFGIT
    AND lorry-controller.conf in CONFGIT adds lorries *.lorry using prefix upstream
    AND WEBAPP uses CONFGIT as its configuration directory
    AND a running WEBAPP

We stop the queue first.

    WHEN admin makes request POST /1.0/stop-queue

Then make sure we don't get a job when we request one.

    WHEN admin makes request POST /1.0/give-me-job with host=testhost&pid=123
    THEN response has job_id set to null

    WHEN admin makes request GET /1.0/list-running-jobs
    THEN response has running_jobs set to []

Add a Lorry spec to the run-queue, and check that it looks OK.

    GIVEN Lorry file CONFGIT/foo.lorry with {"foo":{"type":"git","url":"git://foo"}}

    WHEN admin makes request POST /1.0/read-configuration
    AND admin makes request GET /1.0/lorry/upstream/foo
    THEN response has jobs set to []

Request a job. We still shouldn't get a job, since the queue isn't set
to run yet.

    WHEN admin makes request POST /1.0/give-me-job with host=testhost&pid=123
    THEN response has job_id set to null

Enable the queue, and off we go.

    WHEN admin makes request POST /1.0/start-queue
    AND admin makes request POST /1.0/give-me-job with host=testhost&pid=123
    THEN response has job_id set to 1
    AND response has path set to "upstream/foo"

    WHEN admin makes request GET /1.0/lorry/upstream/foo
    THEN response has running_job set to 1
    AND response has jobs set to [1]

    WHEN admin makes request GET /1.0/list-running-jobs
    THEN response has running_jobs set to [1]

Requesting another job should now again return null.

    WHEN admin makes request POST /1.0/give-me-job with host=testhost&pid=123
    THEN response has job_id set to null

Inform WEBAPP the job is finished.

    WHEN MINION makes request POST /1.0/job-update with job_id=1&exit=0
    THEN response has kill set to false
    WHEN admin makes request GET /1.0/lorry/upstream/foo
    THEN response has running_job set to null
    AND response has jobs set to [1]
    AND response has failed_jobs set to []
    WHEN admin makes request GET /1.0/list-running-jobs
    THEN response has running_jobs set to []

Cleanup.

    FINALLY WEBAPP terminates


Run a job that fails
--------------------

Lorry Controller needs to be able to deal with jobs that fail. It also
needs to be able to list them correctly to the user.

    SCENARIO run a job that fails
    GIVEN a new git repository in CONFGIT
    AND an empty lorry-controller.conf in CONFGIT
    AND lorry-controller.conf in CONFGIT adds lorries *.lorry using prefix upstream
    AND WEBAPP uses CONFGIT as its configuration directory
    AND a running WEBAPP
    AND Lorry file CONFGIT/foo.lorry with {"foo":{"type":"git","url":"git://foo"}}
    WHEN admin makes request POST /1.0/read-configuration
    AND admin makes request POST /1.0/start-queue

Initially, the lorry spec should have no jobs or failed jobs listed.

    WHEN admin makes request GET /1.0/lorry/upstream/foo
    THEN response has jobs set to []
    AND response has failed_jobs set to []

MINION requests a job.

    WHEN MINION makes request POST /1.0/give-me-job with host=testhost&pid=123
    THEN response has job_id set to 1
    AND response has path set to "upstream/foo"

Now, when MINION updates WEBAPP about the job, indicating that it has
failed, and admin will then see that the lorry spec lists the job in
failed jobs.

    WHEN MINION makes request POST /1.0/job-update with job_id=1&exit=1
    AND admin makes request GET /1.0/lorry/upstream/foo
    THEN response has jobs set to [1]
    AND response has failed_jobs set to [1]

Cleanup.

    FINALLY WEBAPP terminates



Limit number of jobs running at the same time
---------------------------------------------

WEBAPP can be told to limit the number of jobs running at the same
time.

Set things up. Note that we have two local Lorry files, so that we
could, in principle, run two jobs at the same time.

    SCENARIO limit concurrent jobs
    GIVEN a new git repository in CONFGIT
    AND an empty lorry-controller.conf in CONFGIT
    AND lorry-controller.conf in CONFGIT adds lorries *.lorry using prefix upstream
    AND Lorry file CONFGIT/foo.lorry with {"foo":{"type":"git","url":"git://foo"}}
    AND Lorry file CONFGIT/bar.lorry with {"bar":{"type":"git","url":"git://bar"}}
    AND WEBAPP uses CONFGIT as its configuration directory
    AND a running WEBAPP
    WHEN admin makes request POST /1.0/read-configuration

Check the current set of the `max_jobs` setting.

    WHEN admin makes request GET /1.0/get-max-jobs
    THEN response has max_jobs set to null

Set the limit to 1.

    WHEN admin makes request POST /1.0/set-max-jobs with max_jobs=1
    THEN response has max_jobs set to 1
    WHEN admin makes request GET /1.0/get-max-jobs
    THEN response has max_jobs set to 1

Get a job. This should succeed.

    WHEN MINION makes request POST /1.0/give-me-job with host=testhost&pid=1
    THEN response has job_id set to 1

Get a second job. This should not succeed.

    WHEN MINION makes request POST /1.0/give-me-job with host=testhost&pid=2
    THEN response has job_id set to null

Finish the first job. Then get a new job. This should succeed.

    WHEN MINION makes request POST /1.0/job-update with job_id=1&exit=0
    AND MINION makes request POST /1.0/give-me-job with host=testhost&pid=2
    THEN response has job_id set to 2

Stop job in the middle
----------------------

We need to be able to stop jobs while they're running as well. We
start by setting up everything so that a job is running, the same way
we did for the successful job scenario.

    SCENARIO stop a job while it's running
    GIVEN a new git repository in CONFGIT
    AND an empty lorry-controller.conf in CONFGIT
    AND lorry-controller.conf in CONFGIT adds lorries *.lorry using prefix upstream
    AND WEBAPP uses CONFGIT as its configuration directory
    AND a running WEBAPP
    AND Lorry file CONFGIT/foo.lorry with {"foo":{"type":"git","url":"git://foo"}}
    WHEN admin makes request POST /1.0/read-configuration
    AND admin makes request POST /1.0/start-queue
    AND admin makes request POST /1.0/give-me-job with host=testhost&pid=123
    THEN response has job_id set to 1
    AND response has path set to "upstream/foo"

Admin will now ask WEBAPP to kill the job. This changes sets a field
in the STATEDB only.

    WHEN admin makes request POST /1.0/stop-job with job_id=1
    THEN response has kill set to true

Now, when MINION updates the job, WEBAPP will tell it to kill it.
MINION will do so, and then update the job again.

    WHEN MINION makes request POST /1.0/job-update with job_id=1&exit=no
    THEN response has kill set to true
    WHEN MINION makes request POST /1.0/job-update with job_id=1&exit=1

Admin will now see that the job has, indeed, been killed.

    WHEN admin makes request GET /1.0/lorry/upstream/foo
    THEN response has running_job set to null

    WHEN admin makes request GET /1.0/list-running-jobs
    THEN response has running_jobs set to []

Check that job can be run successfully again. In 2014, we found a bug
where a lorry that was ever set to be killed, would never again
successfully run.

    WHEN admin makes request POST /1.0/give-me-job with host=testhost&pid=123
    THEN response has job_id set to 2
    AND response has path set to "upstream/foo"
    WHEN MINION makes request POST /1.0/job-update with job_id=2&exit=no
    THEN response has kill set to false

Cleanup.

    FINALLY WEBAPP terminates

Stop a job that runs too long
-----------------------------

Sometimes a job gets "stuck" and should be killed. The
`lorry-controller.conf` has an optional `lorry-timeout` field for
this, to set the timeout, and WEBAPP will tell MINION to kill a job
when it has been running too long.

Some setup. Set the `lorry-timeout` to a know value. It doesn't
matter what it is since we'll be telling WEBAPP to fake its sense of
time, so that the test suite is not timing sensitive. We wouldn't want
to have the test suite fail when running on slow devices.

    SCENARIO stop stuck job
    GIVEN a new git repository in CONFGIT
    AND an empty lorry-controller.conf in CONFGIT
    AND lorry-controller.conf in CONFGIT adds lorries *.lorry using prefix upstream
    AND lorry-controller.conf in CONFGIT has lorry-timeout set to 1 for everything
    AND Lorry file CONFGIT/foo.lorry with {"foo":{"type":"git","url":"git://foo"}}
    AND WEBAPP uses CONFGIT as its configuration directory
    AND a running WEBAPP
    WHEN admin makes request POST /1.0/read-configuration

Pretend it is the start of time.

    WHEN admin makes request POST /1.0/pretend-time with now=0
    AND admin makes request GET /1.0/status
    THEN response has timestamp set to "1970-01-01 00:00:00 UTC"

Start the job.

    WHEN admin makes request POST /1.0/give-me-job with host=testhost&pid=123
    THEN response has job_id set to 1

Check that the job info contains a start time.

    WHEN admin makes request GET /1.0/job/1
    THEN response has job_started set

Pretend it is now much later, or at least later than the timeout specified.

    WHEN admin makes request POST /1.0/pretend-time with now=2

Pretend to be a MINION that reports an update on the job. WEBAPP
should now be telling us to kill the job.

    WHEN MINION makes request POST /1.0/job-update with job_id=1&exit=no
    THEN response has kill set to true

Kill the job, as requested.

    WHEN MINION makes request POST /1.0/job-update with job_id=1&exit=1

Verify we can run the job successfully after it has been killed once
by timeout. In 2014 we had a bug where this would not happen, because
a lorry that had ever been killed would never run successfully again.

    WHEN admin makes request POST /1.0/give-me-job with host=testhost&pid=123
    THEN response has job_id set to 2
    WHEN MINION makes request POST /1.0/job-update with job_id=2&exit=no
    THEN response has kill set to false

Cleanup.

    FINALLY WEBAPP terminates


Forget jobs whose MINION is gone
--------------------------------

A job's status is updated when a MINION uses the `/1.0/job-update`
call, and when the MINION uses that to report that the job has
finished, the STATEDB is updated accordingly. However, sometimes the
MINION never tells WEBAPP that the job if finished. This can happen
for a variety of reasons, such as (not limited to these):

* MINION crashes.
* WEBAPP is unavailable.
* The host reboots, killing MINION and WEBAPP both.

If this happens, STATEDB still marks the job as running, and WEBAPP
won't start a new job for that lorry specification.

To deal with these, we need to have a way to clean up "ghost jobs"
like these. We do this with the `/1.0/cleanup-ghost-jobs` API call,
which marks all jobs finished that haven't had a `job-update` called
on them for a long time.

    SCENARIO forget jobs without MINION updates in a long time

Set up a WEBAPP that uses a CONFGIT with a Lorry file, so we can start
a job.

    GIVEN a new git repository in CONFGIT
    AND an empty lorry-controller.conf in CONFGIT
    AND lorry-controller.conf in CONFGIT adds lorries *.lorry using prefix upstream
    AND Lorry file CONFGIT/foo.lorry with {"foo":{"type":"git","url":"git://foo"}}
    AND WEBAPP uses CONFGIT as its configuration directory
    AND a running WEBAPP

Pretend it is a known time (specifically, the beginning of the epoch).
This is needed so we can trigger the ghost job timeout later.

    WHEN admin makes request POST /1.0/pretend-time with now=0

Tell WEBAPP to read the configuration.

    WHEN admin makes request POST /1.0/read-configuration

Start a new job.

    WHEN admin makes request POST /1.0/give-me-job with host=testhost&pid=123
    THEN response has job_id set to 1

Verify that the job is in the list of running jobs.

    WHEN admin makes request GET /1.0/list-running-jobs
    THEN response has running_jobs set to [1]

Remove any ghosts. There aren't any yet, so nothing should be removed.

    WHEN admin makes request POST /1.0/remove-ghost-jobs
    AND admin makes request GET /1.0/list-running-jobs
    THEN response has running_jobs set to [1]

Now, pretend a long time has passed, and clean up the ghost job. The
default value for the ghost timeout is reasonably short (less than a
day), so we pretend it is about 10 days later (one million seconds).

    WHEN admin makes request POST /1.0/pretend-time with now=1000000
    AND admin makes request POST /1.0/remove-ghost-jobs
    AND admin makes request GET /1.0/list-running-jobs
    THEN response has running_jobs set to []

Further, if we request for a new job now, we'll get one for the same
lorry specification.

    WHEN admin makes request POST /1.0/give-me-job with host=testhost&pid=123
    THEN response has job_id set to 2
    AND response has path set to "upstream/foo"

Finally, clean up.

    FINALLY WEBAPP terminates

Remove a terminated job
-----------------------

WEBAPP doesn't remove jobs automatically, it needs to be told to
remove jobs.

    SCENARIO remove job

Setup.

    GIVEN a new git repository in CONFGIT
    AND an empty lorry-controller.conf in CONFGIT
    AND lorry-controller.conf in CONFGIT adds lorries *.lorry using prefix upstream
    AND WEBAPP uses CONFGIT as its configuration directory
    AND a running WEBAPP
    GIVEN Lorry file CONFGIT/foo.lorry with {"foo":{"type":"git","url":"git://foo"}}
    WHEN admin makes request POST /1.0/read-configuration

Start job 1.

    WHEN admin makes request POST /1.0/give-me-job with host=testhost&pid=123
    THEN response has job_id set to 1

Try to remove job 1 while it is running. This should fail.

    WHEN admin makes request POST /1.0/remove-job with job_id=1
    THEN response has reason set to "still running"

Finish the job.

    WHEN MINION makes request POST /1.0/job-update with job_id=1&exit=0
    WHEN admin makes request GET /1.0/list-jobs
    THEN response has job_ids set to [1]

Remove it.

    WHEN admin makes request POST /1.0/remove-job with job_id=1
    AND admin makes request GET /1.0/list-jobs
    THEN response has job_ids set to []

Cleanup.

    FINALLY WEBAPP terminates


Remove old terminated jobs with helper program
--------------------------

There is a helper program to remove old jobs automatically.

    SCENARIO remove old terminated jobs

Setup.

    GIVEN a new git repository in CONFGIT
    AND an empty lorry-controller.conf in CONFGIT
    AND lorry-controller.conf in CONFGIT adds lorries *.lorry using prefix upstream
    AND WEBAPP uses CONFGIT as its configuration directory
    AND a running WEBAPP
    GIVEN Lorry file CONFGIT/foo.lorry with {"foo":{"type":"git","url":"git://foo"}}
    WHEN admin makes request POST /1.0/read-configuration

Start job 1. We start it a known time of 100, so that we can control
when jobs become old.

    WHEN admin makes request POST /1.0/pretend-time with now=100
    AND admin makes request POST /1.0/give-me-job with host=testhost&pid=123
    THEN response has job_id set to 1

Remove old jobs while job 1 is running, still pretending time is 100
seconds since epoch. This should leave job 1 running.

    WHEN admin removes old jobs at 100
    AND admin makes request GET /1.0/list-jobs
    THEN response has job_ids set to [1]

Finish the job.

    WHEN MINION makes request POST /1.0/job-update with job_id=1&exit=0
    WHEN admin makes request GET /1.0/list-jobs
    THEN response has job_ids set to [1]

Remove old jobs, still at 100 seconds. Job 1 should still remain, as
it just finished.

    WHEN admin removes old jobs at 100
    AND admin makes request GET /1.0/list-jobs
    THEN response has job_ids set to [1]

Let a long time pass, and remove old jobs again. Job 1 should now go
away.

    WHEN admin removes old jobs at 100000000000
    AND admin makes request GET /1.0/list-jobs
    THEN response has job_ids set to []

Cleanup.

    FINALLY WEBAPP terminates