| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
| |
Change-Id: I493fced8cf2664283923f6f41097ca991d3fc3de
|
|
|
|
|
|
|
| |
Bad function prototype meant that the mechanism for handling workers
disconnecting actually caused the controller to crash instead.
Change-Id: I8ceb6ad027ba2481c0c4c335e1760692823c208b
|
|
|
|
| |
Change-Id: I01a60d4ec187d5fab060f40947d97aa97013f7a7
|
|
|
|
|
|
|
|
|
| |
Currently jobs may continue running after exec-cancel is sent if
exec-response takes a while to be sent back. This commit makes the
job's state be set to 'failed' when exec-cancel is sent, so that
the wait for exec-response doesn't matter.
Change-Id: I858d9efcba38c81a912cf57aee2bdd8c02cb466b
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit 75ef3e9585091b463b60d2981b3b7283a2ea8eab.
It turns out that the JobQueue may need to handle more than one
build of the same artifact at once, as one may be in the process
of being cancelled when another build of the same artifact is
requested. So they do need an ID separate from the artifact ID.
Change-Id: Ifa0c06987795a4aebdadbd9927de27919377b0a2
|
|
|
|
|
|
|
| |
We no longer serialise whole artifacts, so it doesn't make sense
for things to still refer to serialise-artifact and similar.
Change-Id: Id4d563a07041bbce77f13ac71dc3f7de39df5e23
|
|
|
|
|
|
|
| |
Remove extra job set line as self._current_job no longer exists
in worker_build_scheduler.py
Change-Id: I8849742587f11f83ebba64f48eaf97fac83e6589
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Although in theory a worker should only ever have one job at once, in
practice this assumption doesn't hold, and can cause serious confusion.
The worker (implemented in the JsonRouter class) will actually queue up
exec-request messages and run the oldest one first. I saw a case where,
due to a build not being correctly cancelled, the
WorkerConnection.current_job attribute got out of sync with what the
worker was actually building. This lead to an error when trying to fetch
the built artifacts, as the controller tried to fetch artifacts for
something that wasn't actually built yet, and everything got stuck.
To prevent this from happening, we either need to remove the
exec-request queue in the worker-daemon process, or make the
WorkerConnection class cope with multiple jobs at once. The latter seems
like the more robust approach, so I have done that.
Another bug this fixes is the issue where, if the 'Computing build
graph' (serialise-artifact) step of a build completes on the controller
while one of its WorkerConnection objects is waiting for artifacts to
be fetched by the shared cache from the worker, the build hangs. This
would happen because the WorkerConnection assumed that any
HelperResponse message it saw was the result of its request, so would
send a _JobFinished before caching had actually finished if there was
an unrelated HelperResponse received in the meantime. It now checks
the request ID of the HelperResponse before calling the code that is
now in the new _handle_helper_result_for_job() function.
Change-Id: Ia961f333f9dae77405b58c82c99a56e4c43e1628
|
|
|
|
|
|
|
| |
Rather than generating IDs for each job, identify them by what artifact
is going to be built. Artifact cache IDs need to be unique in any case.
Change-Id: I37a0277931c45a8fb6e37ae7c2a6a942ae732fdd
|
|
|
|
|
|
|
| |
This is a bit more comprehensive than the previous approach of using
public instance attributes, and I find it easier to reason about.
Change-Id: I2942ecf53c95e29893dc0982d38aec689ebfa614
|
|
|
|
|
|
|
| |
The intention is to allow workers to use this class for job tracking, in
addition to the controller.
Change-Id: I355861086764476b383266bab7e850af5e05bc54
|
|
|
|
| |
Change-Id: Ifdaa92c209a4ca488c4447911bef9b1bf7d61438
|
|
|
|
|
|
|
|
| |
We no longer serialise entire artifacts, so the output of deserialise_artifact
is an ArtifactReference. This commit changes stuff in distbuild to know how to
deal with that rather than an Artifact.
Change-Id: I79b40d041700a85c25980e3bd70cd34dedd2a113
|
|
|
|
|
|
|
| |
A JsonMachine object can be set to log all messages that it sends, we
don't need to handle it in the WorkerConnection class as well.
Change-Id: Idfdc06953363a016708b5dda50c978eb93b1113c
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It's good to know which jobs are in progress and which are queued, when
reading morph-controller.log.
Old output:
2015-04-09 10:40:58 DEBUG Current jobs:
['3f647933a1effbb128c857225ba77e9aa775d92314ef0acf3e58e084a7248c73.chunk.stage1-binutils-misc',
'd7279e4179a31d8a3a98c27d5b01ad1bb7387c7fab623fee1086ab68af2784bb.chunk.stage2-fhs-dirs-misc']
New output:
2015-04-09 10:40:58 DEBUG Current jobs:
['3f647933a1effbb128c857225ba77e9aa775d92314ef0acf3e58e084a7248c73.chunk.stage1-binutils-misc (given to worker1:3434)',
'd7279e4179a31d8a3a98c27d5b01ad1bb7387c7fab623fee1086ab68af2784bb.chunk.stage2-fhs-dirs-misc (given to worker2:3434)']
Change-Id: Ie89e6723b0da5f930813591a3166301fd3966804
|
|
|
|
| |
Change-Id: I992dc0c1d40f563ade56a833162d409b02be90a0
|
|\
| |
| |
| |
| | |
Reviewed-By: Adam Coldrick <adam.coldrick@codethink.co.uk>
Reviewed-By: Richard Maw <richard.maw@codethink.co.uk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
For a while we have seen an issue where output from build A would end up
in the log file of some other random chunk.
The problem turns out to be that the WorkerConnection class in the
controller-daemon assumes cancellation is instantaneous. If a build was
cancelled, the WorkerConnection would send a cancel message for the job
it was running, and then start a new job. However, the worker-daemon
process would have a backlog of exec-output messages and a delayed
exec-response message from the old job. The controller would receive
these and would assume that they were for the new job, without checking
the job ID in the messages. Thus they would be sent to the wrong log
file.
To fix this, the WorkerConnection class now tracks jobs by job ID, and
the code should be generally more robust when unexpected messages are
received.
|
| | |
|
|/
|
|
|
|
|
|
|
| |
The logic to handle a worker disconnecting was broken. The
WorkerConnection object would remove itself from the main loop as soon
as the worker disconnected. But it would not get removed from the list
of available workers that the WorkerBuildQueue maintains. So the
controller would continue sending messages to this dead connection, and
the builds it sent would hang forever for a response.
|
| |
|
| |
|
| |
|
|
|
|
| |
We always want to warn if we attempt to remove a job that's not present
|
| |
|
|
|
|
|
|
|
|
|
|
| |
If a new build request makes a request for an artifact that is currently being
cached then the artifact will be needlessly rebuilt.
To avoid this the new build request should wait for caching to finish.
We rename _ExecStarted, _ExecEnded, _ExecFailed to
_JobStarted, _JobFinished, _JobFailed
and Job's is_building attribute is renamed to running.
|
|
|
|
|
| |
This fixes the bug that causes the distbuild controller
to crash when population of the artifact cache fails.
|
|\
| |
| |
| |
| | |
Reviewed-By: Richard Ipsum <richard.ipsum@codethink.co.uk>
Reviewed-By: Lars Wirzenius <lars.wirzenius@codethink.co.uk>
|
| |
| |
| |
| |
| | |
Users need to be able to see logs of all builds, not just those that
failed.
|
|/ |
|
|
|
|
| |
To cancel jobs cleanly we need to know when a job has failed.
|
| |
|
| |
|
|
|
|
| |
add_initiator() isn't necessary given lists have a remove method.
|
| |
|
|
|
|
|
|
|
| |
Put our _exec_response_msg into WorkerBuildFinished event,
it's essentially the same as _finished_msg, just a different name
Get our artifact's cache key from the job
|
|
|
|
| |
Now we just get everything from the job object
|
|
|
|
|
|
|
|
|
|
|
| |
The exec_response_msg also needs to be sent to a number of initiators,
so we give it a list of ids not just one.
The exec_response_msg will be sent to the controller once the artifacts
have been cached successfully.
There's no longer any need to use a route map to retrieve
the id of the initiator, since this is stored with the job
|
|
|
|
|
| |
msg now contains a list of initiator ids rather than a single one,
since BuiltOutput needs to be sent to a number of initiators
|
|
|
|
|
|
| |
Each job is given a unique id, so we don't need to generate
an id for each exec request this means we can remove use of route map
since we can use the job's id for the exec request
|
|
|
|
| |
This method no longer works, we will replace it soon.
|
|
|
|
|
| |
The name change from BuildFailed -> JobFailed etc
was unintentionally merged into master, undo this.
|
|
|
|
|
|
| |
_job is the job this worker is carrying out
_exec_response_msg will contain the response the worker sends back to us
when it finishes the build.
|
| |
|
| |
|
|
|
|
| |
We need to be able to send this message to a number of initiators
|
| |
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Conflicts:
distbuild/build_controller.py
Reviewed by:
Lars Wirzenius
Daniel Silverstone
Sam Thursfield
|
| |
| |
| |
| | |
body and headers must now be specified for http-request message.
|
| | |
|