| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This fixes an issue where distbuild would build the same artifact more
than once. The problem occurs with a single distbuild controller, if
multiple initiators request builds of the same thing at roughly the
same time (which scripts/release-build in definitions.git does).
This change also means that multiple distbuild controllers sharing a
single artifact cache will be smart about sharing built artifacts. It
does not mean that distbuild can handle having built artifacts removed
from the cache while it is building stuff.
The number of HTTP requests made to the shared artifact cache is higher
with this patch, but these seem to take no more than 1 second and we
only ever need to run one request before starting more builds, so there
should be no noticable impact on performance.
Change-Id: Ib3246219a10ca95d40b8a21bd0fe53f32e46c1c9
|
|
|
|
|
|
|
| |
Hopefully this makes the code a little less cryptic. No functional
changes.
Change-Id: I615810e4eacdd5454731e07387b1dbb9eb348fd5
|
|
|
|
| |
Change-Id: I16680439b131e63d30eeff91814a1af643af6246
|
|
|
|
| |
Change-Id: I01a60d4ec187d5fab060f40947d97aa97013f7a7
|
|
|
|
|
|
| |
Logging build output makes the controller logs difficult to read.
Change-Id: I5b81ff9359ada969e964328eb1c2624ab6b9375a
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We found a distbuild controller stuck in a busy loop, with the logs
full of the same error message repeated:
... _flush(): Exception 'IOError: [Errno 32] Broken pipe' from sock.write()
We suspect this came about because the initiator disconnected without
sending an EOF. The initiator was in a VM on a laptop so it seems
possible that the host OS turned off the wireless adaptor without giving
the VM a chance to close its connections gracefully.
The busy loop is because nothing in the SocketBuffer class handles the
SocketError events queued by the _flush() method. Unhandled events are
ignored. So the SocketBuffer stays in 'w' state without ever shifting
any data and never returns. Adding transitions to handle the SocketError
event will fix the problem.
If a socket error happens now in the same scenario, it will be handled
as if the initiator disconnected.
Change-Id: I0f6834f7186a01ca2bc74aef899a4cccbc891e51
|
|
|
|
|
|
|
|
|
| |
Create an InitiatorCommand class that accepts message_type and
status_text parameters to be used by the distbuild-list-jobs,
distbuild-status and distbuild-cancel commands to send request
messages to the distbuild network
Change-Id: Ib686dcd7c370d802b612e9aaa1e3df76f0275fae
|
|
|
|
|
|
|
| |
This patch fixes an error where we can end up calling int(None) when
we try to send an error response for a malformed message.
Change-Id: Id3ee3298cfb6a5cb32e35fdc5916dab1e4c87a03
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Cancelling a distbuild with ctrl+c no longer cancels the build
itself. This commit adds some output explaining what should be
done to cancel the build as well as the local process.
This commit also fixes a bug where the BuildStarted event would
be sent each time a chunk finished building, since it was being
sent in _queue_worker_builds. This is fixed by adding a new
function to be called when the build graph annotation is
complete which sends BuildStarted and then calls
_queue_worker_builds, which no longer sends the BuildStarted
event.
Change-Id: I26ddea2c9080887f449e87004411ddffe4e583b7
|
|
|
|
|
|
|
|
|
| |
Currently jobs may continue running after exec-cancel is sent if
exec-response takes a while to be sent back. This commit makes the
job's state be set to 'failed' when exec-cancel is sent, so that
the wait for exec-response doesn't matter.
Change-Id: I858d9efcba38c81a912cf57aee2bdd8c02cb466b
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit 75ef3e9585091b463b60d2981b3b7283a2ea8eab.
It turns out that the JobQueue may need to handle more than one
build of the same artifact at once, as one may be in the process
of being cancelled when another build of the same artifact is
requested. So they do need an ID separate from the artifact ID.
Change-Id: Ifa0c06987795a4aebdadbd9927de27919377b0a2
|
|
|
|
|
|
|
| |
We no longer serialise whole artifacts, so it doesn't make sense
for things to still refer to serialise-artifact and similar.
Change-Id: Id4d563a07041bbce77f13ac71dc3f7de39df5e23
|
|
|
|
| |
Change-Id: I674c39149aad82c07c85d2db3207280b91dfa292
|
|
|
|
| |
Change-Id: I95fbfcb2ed6a8ffdd946d36eacc030b4ae1b9b21
|
|
|
|
|
|
|
| |
Adds distinct message types to give us more flexibility over message
handling now that we have multiple initiator types with different requirements.
Change-Id: Ib2af8736b83d66ef20a8e37591ca68c9441b6497
|
|
|
|
|
|
|
|
| |
This fixes an issue with distbuild-status and distbuild-cancel crashing
due to their appropriate Initiator classes not handling 'build-failed'
messages
Change-Id: Ia35c8e14a30e3a9bdea1e44f7726181db75dfbe5
|
|
|
|
|
|
|
| |
Remove extra job set line as self._current_job no longer exists
in worker_build_scheduler.py
Change-Id: I8849742587f11f83ebba64f48eaf97fac83e6589
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If the initiator sends an invalid build-request message, it will now
exit with the following sort of error:
ERROR: Failed to build baserock:baserock/definitions
f2d78e9b7221bca65cba53af3f3b50d50d90628f
systems/build-system-x86_64.morph: Invalid build-request message. Check
you are using a supported version of Morph. This distbuild network uses
protocol version 2.
Previously, the controller would log an error to its log file, but it
would not send any response to the initiator so the initiator would
hang forever.
Behaviour is the same as before for the case where the initiator sends a
build-request message with the wrong protocol version: the initiator
will exit with an error message.
Change-Id: I94fdee02bc701d4a679a0261b3c46dbdf14cfcaf
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Although in theory a worker should only ever have one job at once, in
practice this assumption doesn't hold, and can cause serious confusion.
The worker (implemented in the JsonRouter class) will actually queue up
exec-request messages and run the oldest one first. I saw a case where,
due to a build not being correctly cancelled, the
WorkerConnection.current_job attribute got out of sync with what the
worker was actually building. This lead to an error when trying to fetch
the built artifacts, as the controller tried to fetch artifacts for
something that wasn't actually built yet, and everything got stuck.
To prevent this from happening, we either need to remove the
exec-request queue in the worker-daemon process, or make the
WorkerConnection class cope with multiple jobs at once. The latter seems
like the more robust approach, so I have done that.
Another bug this fixes is the issue where, if the 'Computing build
graph' (serialise-artifact) step of a build completes on the controller
while one of its WorkerConnection objects is waiting for artifacts to
be fetched by the shared cache from the worker, the build hangs. This
would happen because the WorkerConnection assumed that any
HelperResponse message it saw was the result of its request, so would
send a _JobFinished before caching had actually finished if there was
an unrelated HelperResponse received in the meantime. It now checks
the request ID of the HelperResponse before calling the code that is
now in the new _handle_helper_result_for_job() function.
Change-Id: Ia961f333f9dae77405b58c82c99a56e4c43e1628
|
|
|
|
|
|
|
| |
Rather than generating IDs for each job, identify them by what artifact
is going to be built. Artifact cache IDs need to be unique in any case.
Change-Id: I37a0277931c45a8fb6e37ae7c2a6a942ae732fdd
|
|
|
|
|
|
|
| |
This is a bit more comprehensive than the previous approach of using
public instance attributes, and I find it easier to reason about.
Change-Id: I2942ecf53c95e29893dc0982d38aec689ebfa614
|
|
|
|
|
|
|
| |
The intention is to allow workers to use this class for job tracking, in
addition to the controller.
Change-Id: I355861086764476b383266bab7e850af5e05bc54
|
|
|
|
|
|
|
|
|
| |
Commit 84096556ea54d4af236f1fe5f7ccf61c1343016f changed the protocol
without changing the protocol version. Versions of Morph between
that one and this one may hang forever in 'morph distbuild' if trying
to build on an incompatible distbuild network.
Change-Id: I9194657f59a4b4a61a6fde7bd85105b56ca1a78d
|
|
|
|
|
|
|
|
|
|
| |
Currently, attempting to distbuild a component which is not in
the given system or doesn't exist at all will cause the full
system to be built, rather than an error raised. This is because
the logic which checks that all components were found is
completely nonsensical. This commit makes it actually check the
right thing.
Change-Id: Ide4d7e3fa5f71e433f3a7b7c8c387fe594c92e43
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Once building starts we close the json machine on the initiator,
but we may have received build progress events between processing
our build-started event and closing the json machine, since there
is not a nice way to tell the different types of build progress
apart (they all use BuildProgress) we will ignore all BuildProgress
messages for now.
A possible fix for this is to introduce GraphProgress messages so that
we can report the building of the graph without reporting other types
of BuildProgress ("Waiting for worker" or "Transferring artifact to cache")
that we're not interested in.
Note that we will still report build failures or build success, so if
there's a mistake in the definitions this will be reported before the
detach can occur, similarly if the system is already built this will
be reported before the detach happens.
Change-Id: Ia006ccfba826d2c91f4dea6c028ecdcb5a2b02d6
|
|
|
|
| |
Change-Id: Icfc3d1aa125196e208d7ac35f43f06c5f5a21ba4
|
|
|
|
|
|
|
|
|
| |
Adds a command to get the status of all recently ran distbuilds
for a given server (e.g. Running, Finished, Failed, Cancelled),
so as to tell if a build running via distbuild-start has finished
or otherwise exited without going through the server's log files
Change-Id: I5ce9fe54ae7b1bd8fe3e0d629f615042be8827ed
|
|
|
|
|
|
|
|
|
|
|
| |
Add command for distbuild-start to build_plugin in morphlib,
and create a boolean parameter to inform the initiator whether
to disconnect the controller and leave the build running remotely.
Add distbuild-cancel command to parse currently-running distbuild
build-request IDs and cancel the one matching the given argument
Change-Id: I458a5767bb768ceb2b4d8876adf1c86075d452bd
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, the distbuild-list-jobs command will fail if morph is
outdated (i.e. protocol version for client and distbuild network
don't match); a protocol_version field has been added to the
list-jobs request message to fix this.
Moved version check outside build-request message to reduce duplication
in new functions.
Generalised the list-request output to reduce duplication for any
further additions that may require a message output.
Change-Id: I28e733cbfe8c89e8c11427df5d40ab275abd313c
|
|
|
|
| |
Change-Id: Ifdaa92c209a4ca488c4447911bef9b1bf7d61438
|
|
|
|
|
|
|
|
| |
We no longer serialise entire artifacts, so the output of deserialise_artifact
is an ArtifactReference. This commit changes stuff in distbuild to know how to
deal with that rather than an Artifact.
Change-Id: I79b40d041700a85c25980e3bd70cd34dedd2a113
|
|
|
|
|
|
|
|
|
|
| |
The controller no longer needs to know everything about an artifact
as the workers can calculate the build graph themselves quickly.
This reduces the amount of data which needs to be serialised by
serialise-artifact, making the yaml dump quicker.
Change-Id: I6bd0bed14c2efb2f499e9d6f0a97e6188353121a
|
|
|
|
|
|
|
|
|
| |
This is mostly to check that the 'cancel entire subprocess tree' works
as expected. Revert that patch and the test fails.
There are also some tweaks included in this commit.
Change-Id: If297522e6589ebb3a07dac66a39eb243789e53aa
|
|
|
|
|
|
|
|
| |
Currently, it leaves around empty directories called build-00, build-01,
etc. when you run a distbuild that fails to get as far as building
something, which is annoying.
Change-Id: Id3466e248c327dedaf973bc2fe22d42e5c5570d4
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We discovered a case where a user of distbuild began a build of
'qtbase', then cancelled it 2 minutes in. The `morph worker-build`
process didn't exit for over an hour -- it ran right through until the
chunk artifacts had been created. Then it exited with code -9 (SIGKILL).
This seems to be due to the fact that SIGKILL doesn't kill subprocesses,
and so any file descriptors the subprocesses have open will remain open.
If we set up the `morph worker-build` process as a process group
leader, using os.setpgid(), then we can use os.killpg() to kill the
entire process group. This should ensure that the `morph worker-build`
command exits straight away, as all of its subprocesses will be killed
at the same time it is.
Change-Id: I38707d18004d8c5bc994fd0cb99e90fd5def58e4
|
|
|
|
|
|
|
|
|
| |
Previously it was only available in the distbuild-helper program. Moving
it to its own module means we can test it and reuse it.
This commit also adds a docstring to the class.
Change-Id: Iaf7854048cf0ff463a87894f1f500cdcb6a34d8b
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A log message was printing the 'remote name' of a socket that was
listening for connections. There isn't one, so the message always
shows this:
2015-04-14 17:05:19 INFO Binding socket to sam-jetson-mason
2015-04-14 17:05:19 INFO Listening at None
Print the local name instead:
2015-04-14 17:05:19 INFO Binding socket to sam-jetson-mason
2015-04-14 17:05:19 INFO Listening at 10.24.2.125:7878
Change-Id: I22c1bbe8c9f78ef63e587b6ace516afc861fae0f
|
|
|
|
|
|
|
|
|
|
| |
Add InitiatorListJobs class and list-jobs message template, add
distbuild-list-jobs to morph commandlist, send running job
information back to initiator, split out handling of build request
and list-jobs messages to separate functions and change generating
a random integer to UUID for message identification
Change-Id: Id02604f2c1201dbc10f6bbd7f501b8ce1ce0deae
|
|
|
|
|
|
|
| |
A JsonMachine object can be set to log all messages that it sends, we
don't need to handle it in the WorkerConnection class as well.
Change-Id: Idfdc06953363a016708b5dda50c978eb93b1113c
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Worker log files are overly verbose with this enabled, each message is
dumped 6 times:
2015-03-19 11:00:11 DEBUG JsonMachine: Received: '"{...}\\n"\n'
2015-03-19 11:00:11 DEBUG JsonMachine: line: '"{...}\\n"'
2015-03-19 11:00:11 DEBUG JsonRouter: got msg: {...}
2015-03-19 11:00:11 DEBUG JsonMachine: Sending message {...}
2015-03-19 11:00:11 DEBUG JsonMachine: As '"{...}\\n"'
2015-03-19 11:00:11 DEBUG JsonRouter: sent to client: {...}
With this setting disabled, the message is only logged by the JsonRouter
class, so appears only twice:
2015-03-19 11:00:11 DEBUG JsonRouter: got msg: {...}
2015-03-19 11:00:11 DEBUG JsonRouter: sent to client: {...}
We've not seen any issues with message encoding/decoding recently so I
think it's safe to disable this debugging output by default.
Change-Id: I7d22ed29e81d6c594cb2c639abf3b40bfb27e3ad
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It's good to know which jobs are in progress and which are queued, when
reading morph-controller.log.
Old output:
2015-04-09 10:40:58 DEBUG Current jobs:
['3f647933a1effbb128c857225ba77e9aa775d92314ef0acf3e58e084a7248c73.chunk.stage1-binutils-misc',
'd7279e4179a31d8a3a98c27d5b01ad1bb7387c7fab623fee1086ab68af2784bb.chunk.stage2-fhs-dirs-misc']
New output:
2015-04-09 10:40:58 DEBUG Current jobs:
['3f647933a1effbb128c857225ba77e9aa775d92314ef0acf3e58e084a7248c73.chunk.stage1-binutils-misc (given to worker1:3434)',
'd7279e4179a31d8a3a98c27d5b01ad1bb7387c7fab623fee1086ab68af2784bb.chunk.stage2-fhs-dirs-misc (given to worker2:3434)']
Change-Id: Ie89e6723b0da5f930813591a3166301fd3966804
|
|
|
|
|
|
|
|
|
| |
A cancel during the 'graphing' or 'annotating' stages would be ignored
as the BuildController was listening for the InitiatorDisconnect message
from the wrong event source. In 'building' state the actual build would
be stopped, but the BuildController instance would stick around due to
sending the message class instead of an instance of the message.
Change-Id: I222a8aa39bf7fffab4d89e12997ffd18cd1b54fc
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In addition to partial builds we also want to be able to do partial
distbuilds, and distbuild uses a different codepath.
This commit updates the distbuild code to know what to do if a partial
build is requested. It only builds up to the latest chunk/stratum that
was requested, and displays where to find the artifacts for each of
the chunks/strata requested upon completion of the build.
The usage is the same as for local builds.
Change-Id: I0537f74e2e65c7aefe5e71795f17999e2415fce5
|
|
|
|
| |
Change-Id: Ibda7a938cd16e35517a531140f39ef4664d85c72
|
|
|
|
| |
Change-Id: I992dc0c1d40f563ade56a833162d409b02be90a0
|
| |
|
|\
| |
| |
| |
| | |
Reviewed-By: Adam Coldrick <adam.coldrick@codethink.co.uk>
Reviewed-By: Richard Maw <richard.maw@codethink.co.uk>
|
| |
| |
| |
| |
| | |
This makes it easier to spot if an incomplete build was due to the user
cancelling, or if it represents a dropped connection or internal error.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This message was hundreds of kilobytes in size, as it contained a
recursive list of dependencies for each artifact in the build graph. It
was used in the initiator only to print this message:
Build steps in total: 592
This message is now gone. The 'Need to build %d artifacts'
build-progress message now indicates the total build steps instead:
Need to build 300 artifacts, of 592 total
This is a compatible change to the distbuild protocol: old initiators
will continue to work as normal with new controllers that don't send
the build-steps message.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
It gets messy having hundreds of build-step-xx.log files in the current
directory, and if two builds are run in parallel from the same directory
the logs for a given chunk will be mixed together in one file.
Now, a new directory named build-0, build-1, build-2 etc is created for
each new build.
If the user passes --initiator-step-output-dir the logs will be placed
in that directory, instead. This behaviour is the same as before.
|