summaryrefslogtreecommitdiff
path: root/distbuild
Commit message (Collapse)AuthorAgeFilesLines
* worker_build_scheduler: Consider active jobs when creating/cancellingPedro Alvarez2016-07-251-5/+11
| | | | | | | | | | | | | In WorkerBuildQueuer._handle_request we were only considering running jobs when creating a new one to not create duplicates. This was making `morph distbuild` build some components more than one time. In this commit we also start considering active jobs when cancelling them. Change-Id: Ib0a7296d453ccd0b8c636c7506d9f1da82acc462
* Avoid UnicodeDecodeError when writing to log filesPedro Alvarez2016-02-231-5/+11
| | | | | | | This is the counterpart fix to b3ecd02236e58386ac4d7566ef70e751ff0d7e26, which had broken the Morph test suite, as it turns out. Change-Id: I5392c2c762c733d7d88cd20898970ec314525d89
* distbuild: Avoid UnicodeEncodeError when writing build output to log filesSam Thursfield2016-01-261-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Text encoding in Python 2 is a total mess so I can only pretend to understand what's going on. The 'stdout' and 'stderr' messages are Python 'unicode' instances, which isn't really important, but I know that because when we try to write them the log file, and they contain non-ASCII data, we see this error: File "/usr/lib/python2.7/site-packages/distbuild/initiator.py", line 231, in _handle_step_output_message f.write(msg['stdout']) UnicodeEncodeError: 'ascii' codec can't encode character u'\u2022' in position 29: ordinal not in range(128) Who said anything about encoding 'unicode' to 'ascii'? It turns out that when you write to a file, Python implicitly tries to encode the data to the 'default encoding', which happens to be 'ascii'. You lose! The only way to fix this is to tell Python that the file has a different encoding, a nice one like UTF-8. (I tried opening the file with 'b' mode, that doesn't seem to help). UTF-8 can only encode valid Unicode data, of course, so we need to make sure the data we write is valid UTF-8. You can to this by calling decode('unicode-escape'), which converts *from* Unicode *to* Unicode, but replacing any invalid characters with escape codes so that we don't get any UnicodeDecodeErrors during the conversion, or when we try to write it to the UTF-8 file. See this presentation for more info: http://farmdev.com/talks/unicode/ Or just use Python 3. Change-Id: I6316d346f5cca2c75f198b48ec9878ac647ae7e5
* distbuild: When a build finishes, say which worker it was built onSam Thursfield2015-10-075-6/+14
| | | | Change-Id: I493fced8cf2664283923f6f41097ca991d3fc3de
* distbuild: Fix cache status messageSam Thursfield2015-06-241-2/+3
| | | | | | | | The initiator would always say "Need to build 260/260 artifacts" even when it didn't need to build everything, because we were counting the number of unbuilt artifacts wrongly. Change-Id: I5da88157dba59949597c58a983f7b31975c52d7f
* distbuild: Fix crash when worker disconnectsSam Thursfield2015-06-241-1/+1
| | | | | | | Bad function prototype meant that the mechanism for handling workers disconnecting actually caused the controller to crash instead. Change-Id: I8ceb6ad027ba2481c0c4c335e1760692823c208b
* distbuild: Fix partial distbuildingSam Thursfield2015-06-241-1/+1
| | | | | | | | | | | | | | | | | | | | I was getting an error from this command: morph distbuild systems/build-system-x86_64.morph stage2-fhs-dirs Saying this: ERROR: Failed to build git://git.baserock.org/baserock/baserock/definitions 93575a2ceeeda77a5bb8c6121a9cac3edde1afbf systems/build-system-x86_64.morph: Some of the requested components are not in build-system-x86_64-rootfs: stage2-fhs-dirs This patch fixes that issue and makes Morph build up to stage2-fhs-dirs successfully. Change-Id: I61c373272484dcb5dc62f281cae8f21f742c31a9
* distbuild: Add __str__() and __repr__() to ArtifactReferenceSam Thursfield2015-06-231-0/+6
| | | | | | | | This is just to make the log files more readable, as what would previously have been logged as '<ArtifactReference at 0x1235478>' is now logged as the actual name of the artifact. Change-Id: I6189aa1390268cec379dd459fc3f4fecc71363b1
* distbuild: Hide a log messageSam Thursfield2015-06-231-1/+4
| | | | | | | | The 'BC: got artifact: <distbuild.artifact_reference.ArtifactReference object at 0x7f84ea2b5c10>' message is only useful when _debug_build_output is true, if at all. Change-Id: I079b398e841d5508ecefd00167fb0d83be748ce6
* distbuild: Check cache status each time we enqueue new artifactsSam Thursfield2015-06-231-131/+174
| | | | | | | | | | | | | | | | | | | This fixes an issue where distbuild would build the same artifact more than once. The problem occurs with a single distbuild controller, if multiple initiators request builds of the same thing at roughly the same time (which scripts/release-build in definitions.git does). This change also means that multiple distbuild controllers sharing a single artifact cache will be smart about sharing built artifacts. It does not mean that distbuild can handle having built artifacts removed from the cache while it is building stuff. The number of HTTP requests made to the shared artifact cache is higher with this patch, but these seem to take no more than 1 second and we only ever need to run one request before starting more builds, so there should be no noticable impact on performance. Change-Id: Ib3246219a10ca95d40b8a21bd0fe53f32e46c1c9
* distbuild: Add docstrings to BuildController state machineSam Thursfield2015-06-231-2/+96
| | | | | | | Hopefully this makes the code a little less cryptic. No functional changes. Change-Id: I615810e4eacdd5454731e07387b1dbb9eb348fd5
* Use protocol to validate incoming requestsRichard Ipsum2015-05-192-23/+49
| | | | Change-Id: I16680439b131e63d30eeff91814a1af643af6246
* Disable WC exec-output messages in log by defaultRichard Ipsum2015-05-181-1/+3
| | | | Change-Id: I01a60d4ec187d5fab060f40947d97aa97013f7a7
* Disable logging of build output by defaultRichard Ipsum2015-05-182-5/+13
| | | | | | Logging build output makes the controller logs difficult to read. Change-Id: I5b81ff9359ada969e964328eb1c2624ab6b9375a
* distbuild: Handle errors from socketSam Thursfield2015-05-154-3/+25
| | | | | | | | | | | | | | | | | | | | | | | We found a distbuild controller stuck in a busy loop, with the logs full of the same error message repeated: ... _flush(): Exception 'IOError: [Errno 32] Broken pipe' from sock.write() We suspect this came about because the initiator disconnected without sending an EOF. The initiator was in a VM on a laptop so it seems possible that the host OS turned off the wireless adaptor without giving the VM a chance to close its connections gracefully. The busy loop is because nothing in the SocketBuffer class handles the SocketError events queued by the _flush() method. Unhandled events are ignored. So the SocketBuffer stays in 'w' state without ever shifting any data and never returns. Adding transitions to handle the SocketError event will fix the problem. If a socket error happens now in the same scenario, it will be handled as if the initiator disconnected. Change-Id: I0f6834f7186a01ca2bc74aef899a4cccbc891e51
* distbuild: Condense Initiator class to remove unnecessary duplicationLauren Perry2015-05-152-129/+8
| | | | | | | | | Create an InitiatorCommand class that accepts message_type and status_text parameters to be used by the distbuild-list-jobs, distbuild-status and distbuild-cancel commands to send request messages to the distbuild network Change-Id: Ib686dcd7c370d802b612e9aaa1e3df76f0275fae
* Protocol check fixRichard Ipsum2015-05-131-2/+2
| | | | | | | This patch fixes an error where we can end up calling int(None) when we try to send an error response for a malformed message. Change-Id: Id3ee3298cfb6a5cb32e35fdc5916dab1e4c87a03
* Explain how to cancel a distbuildAdam Coldrick2015-05-122-16/+28
| | | | | | | | | | | | | | | | Cancelling a distbuild with ctrl+c no longer cancels the build itself. This commit adds some output explaining what should be done to cancel the build as well as the local process. This commit also fixes a bug where the BuildStarted event would be sent each time a chunk finished building, since it was being sent in _queue_worker_builds. This is fixed by adding a new function to be called when the build graph annotation is complete which sends BuildStarted and then calls _queue_worker_builds, which no longer sends the BuildStarted event. Change-Id: I26ddea2c9080887f449e87004411ddffe4e583b7
* distbuild: Set job status to failed when sending exec-cancelAdam Coldrick2015-05-121-0/+8
| | | | | | | | | Currently jobs may continue running after exec-cancel is sent if exec-response takes a while to be sent back. This commit makes the job's state be set to 'failed' when exec-cancel is sent, so that the wait for exec-response doesn't matter. Change-Id: I858d9efcba38c81a912cf57aee2bdd8c02cb466b
* Revert "distbuild: Track worker jobs using artifact basename only"Adam Coldrick2015-05-121-29/+48
| | | | | | | | | | This reverts commit 75ef3e9585091b463b60d2981b3b7283a2ea8eab. It turns out that the JobQueue may need to handle more than one build of the same artifact at once, as one may be in the process of being cancelled when another build of the same artifact is requested. So they do need an ID separate from the artifact ID. Change-Id: Ifa0c06987795a4aebdadbd9927de27919377b0a2
* Clean up artifact serialisationAdam Coldrick2015-05-125-76/+89
| | | | | | | We no longer serialise whole artifacts, so it doesn't make sense for things to still refer to serialise-artifact and similar. Change-Id: Id4d563a07041bbce77f13ac71dc3f7de39df5e23
* Remove % from debug statementRichard Ipsum2015-05-121-1/+1
| | | | Change-Id: I674c39149aad82c07c85d2db3207280b91dfa292
* Add a common func for handling build terminationRichard Ipsum2015-05-121-20/+11
| | | | Change-Id: I95fbfcb2ed6a8ffdd946d36eacc030b4ae1b9b21
* Add GraphProgress messagesRichard Ipsum2015-05-125-10/+91
| | | | | | | Adds distinct message types to give us more flexibility over message handling now that we have multiple initiator types with different requirements. Change-Id: Ib2af8736b83d66ef20a8e37591ca68c9441b6497
* distbuild: Fix protocol version checking for distbuild commandsLauren Perry2015-05-111-0/+14
| | | | | | | | This fixes an issue with distbuild-status and distbuild-cancel crashing due to their appropriate Initiator classes not handling 'build-failed' messages Change-Id: Ia35c8e14a30e3a9bdea1e44f7726181db75dfbe5
* distbuild: Builds currently break due to job being set twiceLauren Perry2015-05-111-1/+0
| | | | | | | Remove extra job set line as self._current_job no longer exists in worker_build_scheduler.py Change-Id: I8849742587f11f83ebba64f48eaf97fac83e6589
* distbuild: Fix initiator hanging when protocol errors occurSam Thursfield2015-05-071-19/+34
| | | | | | | | | | | | | | | | | | | | | If the initiator sends an invalid build-request message, it will now exit with the following sort of error: ERROR: Failed to build baserock:baserock/definitions f2d78e9b7221bca65cba53af3f3b50d50d90628f systems/build-system-x86_64.morph: Invalid build-request message. Check you are using a supported version of Morph. This distbuild network uses protocol version 2. Previously, the controller would log an error to its log file, but it would not send any response to the initiator so the initiator would hang forever. Behaviour is the same as before for the case where the initiator sends a build-request message with the wrong protocol version: the initiator will exit with an error message. Change-Id: I94fdee02bc701d4a679a0261b3c46dbdf14cfcaf
* distbuild: Allow WorkerConnection to track multiple in-flight jobsSam Thursfield2015-05-071-108/+115
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Although in theory a worker should only ever have one job at once, in practice this assumption doesn't hold, and can cause serious confusion. The worker (implemented in the JsonRouter class) will actually queue up exec-request messages and run the oldest one first. I saw a case where, due to a build not being correctly cancelled, the WorkerConnection.current_job attribute got out of sync with what the worker was actually building. This lead to an error when trying to fetch the built artifacts, as the controller tried to fetch artifacts for something that wasn't actually built yet, and everything got stuck. To prevent this from happening, we either need to remove the exec-request queue in the worker-daemon process, or make the WorkerConnection class cope with multiple jobs at once. The latter seems like the more robust approach, so I have done that. Another bug this fixes is the issue where, if the 'Computing build graph' (serialise-artifact) step of a build completes on the controller while one of its WorkerConnection objects is waiting for artifacts to be fetched by the shared cache from the worker, the build hangs. This would happen because the WorkerConnection assumed that any HelperResponse message it saw was the result of its request, so would send a _JobFinished before caching had actually finished if there was an unrelated HelperResponse received in the meantime. It now checks the request ID of the HelperResponse before calling the code that is now in the new _handle_helper_result_for_job() function. Change-Id: Ia961f333f9dae77405b58c82c99a56e4c43e1628
* distbuild: Track worker jobs using artifact basename onlySam Thursfield2015-05-071-34/+23
| | | | | | | Rather than generating IDs for each job, identify them by what artifact is going to be built. Artifact cache IDs need to be unique in any case. Change-Id: I37a0277931c45a8fb6e37ae7c2a6a942ae732fdd
* distbuild: Track state of a job in the Job classSam Thursfield2015-05-071-22/+31
| | | | | | | This is a bit more comprehensive than the previous approach of using public instance attributes, and I find it easier to reason about. Change-Id: I2942ecf53c95e29893dc0982d38aec689ebfa614
* distbuild: Make Jobs class into a more generic JobQueueSam Thursfield2015-05-071-11/+17
| | | | | | | The intention is to allow workers to use this class for job tracking, in addition to the controller. Change-Id: I355861086764476b383266bab7e850af5e05bc54
* Update distbuild protocol version to 3Sam Thursfield2015-05-051-1/+1
| | | | | | | | | Commit 84096556ea54d4af236f1fe5f7ccf61c1343016f changed the protocol without changing the protocol version. Versions of Morph between that one and this one may hang forever in 'morph distbuild' if trying to build on an incompatible distbuild network. Change-Id: I9194657f59a4b4a61a6fde7bd85105b56ca1a78d
* Fix partial distbuilds of non-existant componentsAdam Coldrick2015-04-301-8/+9
| | | | | | | | | | Currently, attempting to distbuild a component which is not in the given system or doesn't exist at all will cause the full system to be built, rather than an error raised. This is because the logic which checks that all components were found is completely nonsensical. This commit makes it actually check the right thing. Change-Id: Ide4d7e3fa5f71e433f3a7b7c8c387fe594c92e43
* Ignore BuildProgress messagesRichard Ipsum2015-04-291-1/+0
| | | | | | | | | | | | | | | | | | | | Once building starts we close the json machine on the initiator, but we may have received build progress events between processing our build-started event and closing the json machine, since there is not a nice way to tell the different types of build progress apart (they all use BuildProgress) we will ignore all BuildProgress messages for now. A possible fix for this is to introduce GraphProgress messages so that we can report the building of the graph without reporting other types of BuildProgress ("Waiting for worker" or "Transferring artifact to cache") that we're not interested in. Note that we will still report build failures or build success, so if there's a mistake in the definitions this will be reported before the detach can occur, similarly if the system is already built this will be reported before the detach happens. Change-Id: Ia006ccfba826d2c91f4dea6c028ecdcb5a2b02d6
* Remove n_state_machines_of_type functionRichard Ipsum2015-04-292-4/+2
| | | | Change-Id: Icfc3d1aa125196e208d7ac35f43f06c5f5a21ba4
* distbuild: Add distbuild status commandLauren Perry2015-04-296-10/+100
| | | | | | | | | Adds a command to get the status of all recently ran distbuilds for a given server (e.g. Running, Finished, Failed, Cancelled), so as to tell if a build running via distbuild-start has finished or otherwise exited without going through the server's log files Change-Id: I5ce9fe54ae7b1bd8fe3e0d629f615042be8827ed
* distbuild: Add distbuild start and cancel functionalityLauren Perry2015-04-295-8/+221
| | | | | | | | | | | Add command for distbuild-start to build_plugin in morphlib, and create a boolean parameter to inform the initiator whether to disconnect the controller and leave the build running remotely. Add distbuild-cancel command to parse currently-running distbuild build-request IDs and cancel the one matching the given argument Change-Id: I458a5767bb768ceb2b4d8876adf1c86075d452bd
* distbuild: Add protocol version checking for list-jobs commandLauren Perry2015-04-293-15/+30
| | | | | | | | | | | | | | | Currently, the distbuild-list-jobs command will fail if morph is outdated (i.e. protocol version for client and distbuild network don't match); a protocol_version field has been added to the list-jobs request message to fix this. Moved version check outside build-request message to reduce duplication in new functions. Generalised the list-request output to reduce duplication for any further additions that may require a message output. Change-Id: I28e733cbfe8c89e8c11427df5d40ab275abd313c
* distbuild: Fix NameError when worker disconnectsSam Thursfield2015-04-281-1/+1
| | | | Change-Id: Ifdaa92c209a4ca488c4447911bef9b1bf7d61438
* Make distbuild use an ArtifactReference not an Artifact internally when buildingAdam Coldrick2015-04-242-28/+30
| | | | | | | | We no longer serialise entire artifacts, so the output of deserialise_artifact is an ArtifactReference. This commit changes stuff in distbuild to know how to deal with that rather than an Artifact. Change-Id: I79b40d041700a85c25980e3bd70cd34dedd2a113
* Don't serialise the entire build graphAdam Coldrick2015-04-242-219/+131
| | | | | | | | | | The controller no longer needs to know everything about an artifact as the workers can calculate the build graph themselves quickly. This reduces the amount of data which needs to be serialised by serialise-artifact, making the yaml dump quicker. Change-Id: I6bd0bed14c2efb2f499e9d6f0a97e6188353121a
* distbuild: Add test suite for distbuild-helperSam Thursfield2015-04-224-5/+70
| | | | | | | | | This is mostly to check that the 'cancel entire subprocess tree' works as expected. Revert that patch and the test fails. There are also some tweaks included in this commit. Change-Id: If297522e6589ebb3a07dac66a39eb243789e53aa
* distbuild: Don't create a directory for build output until we get someSam Thursfield2015-04-211-9/+17
| | | | | | | | Currently, it leaves around empty directories called build-00, build-01, etc. when you run a distbuild that fails to get as far as building something, which is annoying. Change-Id: Id3466e248c327dedaf973bc2fe22d42e5c5570d4
* distbuild: Kill the whole process tree when cancelling a buildSam Thursfield2015-04-211-2/+4
| | | | | | | | | | | | | | | | | | We discovered a case where a user of distbuild began a build of 'qtbase', then cancelled it 2 minutes in. The `morph worker-build` process didn't exit for over an hour -- it ran right through until the chunk artifacts had been created. Then it exited with code -9 (SIGKILL). This seems to be due to the fact that SIGKILL doesn't kill subprocesses, and so any file descriptors the subprocesses have open will remain open. If we set up the `morph worker-build` process as a process group leader, using os.setpgid(), then we can use os.killpg() to kill the entire process group. This should ensure that the `morph worker-build` command exits straight away, as all of its subprocesses will be killed at the same time it is. Change-Id: I38707d18004d8c5bc994fd0cb99e90fd5def58e4
* distbuild: Move SubprocessEventSource into its own moduleSam Thursfield2015-04-212-0/+106
| | | | | | | | | Previously it was only available in the distbuild-helper program. Moving it to its own module means we can test it and reuse it. This commit also adds a docstring to the class. Change-Id: Iaf7854048cf0ff463a87894f1f500cdcb6a34d8b
* distbuild: Fix log message when listening for connectionsSam Thursfield2015-04-211-1/+1
| | | | | | | | | | | | | | | | A log message was printing the 'remote name' of a socket that was listening for connections. There isn't one, so the message always shows this: 2015-04-14 17:05:19 INFO Binding socket to sam-jetson-mason 2015-04-14 17:05:19 INFO Listening at None Print the local name instead: 2015-04-14 17:05:19 INFO Binding socket to sam-jetson-mason 2015-04-14 17:05:19 INFO Listening at 10.24.2.125:7878 Change-Id: I22c1bbe8c9f78ef63e587b6ace516afc861fae0f
* distbuild: Add distbuild-list-jobs functionLauren Perry2015-04-175-26/+116
| | | | | | | | | | Add InitiatorListJobs class and list-jobs message template, add distbuild-list-jobs to morph commandlist, send running job information back to initiator, split out handling of build request and list-jobs messages to separate functions and change generating a random integer to UUID for message identification Change-Id: Id02604f2c1201dbc10f6bbd7f501b8ce1ce0deae
* distbuild: Remove unneeded debugging statementSam Thursfield2015-04-091-6/+0
| | | | | | | A JsonMachine object can be set to log all messages that it sends, we don't need to handle it in the WorkerConnection class as well. Change-Id: Idfdc06953363a016708b5dda50c978eb93b1113c
* distbuild: Disable extra message debugging in worker log filesSam Thursfield2015-04-091-1/+0
| | | | | | | | | | | | | | | | | | | | | | | Worker log files are overly verbose with this enabled, each message is dumped 6 times: 2015-03-19 11:00:11 DEBUG JsonMachine: Received: '"{...}\\n"\n' 2015-03-19 11:00:11 DEBUG JsonMachine: line: '"{...}\\n"' 2015-03-19 11:00:11 DEBUG JsonRouter: got msg: {...} 2015-03-19 11:00:11 DEBUG JsonMachine: Sending message {...} 2015-03-19 11:00:11 DEBUG JsonMachine: As '"{...}\\n"' 2015-03-19 11:00:11 DEBUG JsonRouter: sent to client: {...} With this setting disabled, the message is only logged by the JsonRouter class, so appears only twice: 2015-03-19 11:00:11 DEBUG JsonRouter: got msg: {...} 2015-03-19 11:00:11 DEBUG JsonRouter: sent to client: {...} We've not seen any issues with message encoding/decoding recently so I think it's safe to disable this debugging output by default. Change-Id: I7d22ed29e81d6c594cb2c639abf3b40bfb27e3ad
* distbuild: Make 'Current jobs' log message more usefulSam Thursfield2015-04-091-2/+11
| | | | | | | | | | | | | | | | | | | It's good to know which jobs are in progress and which are queued, when reading morph-controller.log. Old output: 2015-04-09 10:40:58 DEBUG Current jobs: ['3f647933a1effbb128c857225ba77e9aa775d92314ef0acf3e58e084a7248c73.chunk.stage1-binutils-misc', 'd7279e4179a31d8a3a98c27d5b01ad1bb7387c7fab623fee1086ab68af2784bb.chunk.stage2-fhs-dirs-misc'] New output: 2015-04-09 10:40:58 DEBUG Current jobs: ['3f647933a1effbb128c857225ba77e9aa775d92314ef0acf3e58e084a7248c73.chunk.stage1-binutils-misc (given to worker1:3434)', 'd7279e4179a31d8a3a98c27d5b01ad1bb7387c7fab623fee1086ab68af2784bb.chunk.stage2-fhs-dirs-misc (given to worker2:3434)'] Change-Id: Ie89e6723b0da5f930813591a3166301fd3966804