summaryrefslogtreecommitdiff
path: root/src/buildstream/_scheduler
Commit message (Collapse)AuthorAgeFilesLines
* _scheduler/scheduler.py: Use Messenger convenience functionsTristan van Berkom2020-12-221-3/+1
|
* Refactor: Use explicit invocation for retrying jobs.Tristan van Berkom2020-12-101-15/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We created the State object in the core for the purpose of advertizing state to the frontend, and the frontend can register callbacks and get updates to state changes (implicit invocation in the frontend), state always belongs to the core and the frontend can only read state. When the frontend asks the core to do something, this should always be done with an explicit function call, and preferably not via the State object, as this confuses the use of state, which is only a readonly state advertizing desk. This was broken (implemented backwards) for job retries, instead we had the frontend telling state "It has been requested that this job be retried !", and then we had the core registering callbacks to that frontend request - this direction of implicit invocation should not happen (the core should never have to register callbacks on the State object at all in fact). Summary of changes: * _stream.py: Change _failure_retry(), which was for some reason private albeit called from the frontend, to an explicit function call named "retry_job()". Instead of calling into the State object and causing core-side callbacks to be triggered, later to be handled by the Scheduler, implement the retry directly from the Stream, since this implementation deals only with Queues and State, which already directly belong to the Stream object, there is no reason to trouble the Scheduler with this. * _scheduler.py: Remove the callback handling the State "task retry" event. * _state.py: Remove the task retry callback chain completely. * _frontend/app.py: Call stream.retry_job() instead of stream.failure_retry(), now passing along the task's action name rather than the task's ID. This API now assumes that Stream.retry_job() can only be called on a task which originates from a scheduler Queue, and expects to be given the action name of the queue in which the given element has failed and should be retried..
* job.py: Simplify handling of messages through the parent-child pipebschubert/optimize-jobBenjamin Schubert2020-12-051-49/+4
| | | | | | Now that the only type of message that goes through are messages for the messenger, we can remove the enveloppe and only ever handle messenger's messages
* job.py: Stop sending errors through the child-parent pipe, and set it directlyBenjamin Schubert2020-12-051-24/+1
| | | | | Since we run in a single process, we do not need this distinction anymore
* job.py: Stop sending the result from a job through the pipeBenjamin Schubert2020-12-051-30/+8
| | | | | This is not needed now that jobs run in the smae process, we can just return the value from the method.
* job.py: Remove the ability to send child data to the parentBenjamin Schubert2020-12-053-39/+7
| | | | | | | | | | | | This is currently only used by the ElementJob to send back information about the workspace, that we can get directly now that we run in the same process * elementjob.py: Remove the returning of the workspace dict. This is directly available in the main thread. * queue.py: Use the workspace from the element directly instead of going through child data
* scheduler.py: Reconnect signal handlers soonerbschubert/no-multiprocessingBenjamin Schubert2020-12-041-1/+1
| | | | | This reduces a race condition where a sigint received shortly after restarting the scheduler would cause the schedulert to crash.
* scheduler.py: Use threads instead of processes for jobsBenjamin Schubert2020-12-046-293/+187
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This changes how the scheduler works and adapts all the code that needs adapting in order to be able to run in threads instead of in subprocesses, which helps with Windows support, and will allow some simplifications in the main pipeline. This addresses the following issues: * Fix #810: All CAS calls are now made in the master process, and thus share the same connection to the cas server * Fix #93: We don't start as many child processes anymore, so the risk of starving the machine are way less * Fix #911: We now use `forkserver` for starting processes. We also don't use subprocesses for jobs so we should be starting less subprocesses And the following highlevel changes where made: * cascache.py: Run the CasCacheUsageMonitor in a thread instead of a subprocess. * casdprocessmanager.py: Ensure start and stop of the process are thread safe. * job.py: Run the child in a thread instead of a process, adapt how we stop a thread, since we ca't use signals anymore. * _multiprocessing.py: Not needed anymore, we are not using `fork()`. * scheduler.py: Run the scheduler with a threadpool, to run the child jobs in. Also adapt how our signal handling is done, since we are not receiving signals from our children anymore, and can't kill them the same way. * sandbox: Stop using blocking signals to wait on the process, and use timeouts all the time. * messenger.py: Use a thread-local context for the handler, to allow for multiple parameters in the same process. * _remote.py: Ensure the start of the connection is thread safe * _signal.py: Allow blocking entering in the signal's context managers by setting an event. This is to ensure no thread runs long-running code while we asked the scheduler to pause. This also ensures all the signal handlers is thread safe. * source.py: Change check around saving the source's ref. We are now running in the same process, and thus the ref will already have been changed.
* _stream.py: Make `_enqueue_plan` a timed activitybschubert/notify-prepare-planBenjamin Schubert2020-11-041-1/+5
| | | | | | This enqueue_plan can take a long time, as it triggers a verification of the 'cached' state for sources in some cases, which can take a long time.
* Restore task element name / element name distinction in UITristan van Berkom2020-10-271-28/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This behavior has regressed a while back when introducing the messenger object in 0026e379 from merge request !1500. Main behavior change: - Messages in the master log always appear with the task element's element name and cache key, even if the element or plugin issuing the log line is not the primary task element. - Messages logged in the task specific log, retain the context of the element names and cache keys which are issuing the log lines. Changes include: * _message.py: Added the task element name & key members * _messenger.py: Log the element key as well if it is provided * _widget.py: Prefer the task name & key when logging, we fallback to the element name & key in case messages are being logged outside of any ongoing task (main process/context) * job.py: Unconditionally stamp messages with the task name & key Also removed some unused parameters here, clearing up an XXX comment * plugin.py: Add new `_message_kwargs` instance property, it is the responsibility of the core base class to maintain the base keyword arguments which are to be used as kwargs for Message() instances created on behalf of the issuing plugin. Use this method to construct messages in Plugin.__message() and to pass kwargs along to Messenger.timed_activity(). * element.py: Update the `_message_kwargs` when the cache key is updated * tests/frontend/logging.py: Fix test to expect the cache key in the logline * tests/frontend/artifact_log.py: Fix test to expect the cache key in the logline Fixes #1393
* Adding _DisplayKey typeTristan van Berkom2020-10-272-3/+3
| | | | | | | | | | | | | | | | | | | Instead of passing around untyped tuples for cache keys, lets have a clearly typed object for this. This makes for more readable code, and additionally corrects the data model statement of intent that some cache keys should be displayed as "dim", instead informing the frontend about whether the cache key is "strict" or not, allowing the frontend to decide how to display a strict or non-strict key. This patch does the following: * types.py: Add _DisplayKey * element.py: Return a _DisplayKey from Element._get_display_key() * Other sources: Updated to use the display key object
* scheduler.py: Invoke the ticker callback at the end of run()juerg/scheduler-tickerJürg Billeter2020-10-261-0/+3
| | | | This allows the frontend to render pending messages.
* _stream.py: Pull missing artifacts in push()juerg/pushJürg Billeter2020-09-291-1/+1
| | | | | | | | | | | As per #819, BuildStream should pull missing artifacts by default. The previous behavior was to only pull missing buildtrees. A top-level `--no-pull` option can easily be supported in the future. This change makes it possible to use a single scheduler session (with concurrent pull and push jobs). This commit also simplifies the code as it removes the `sched_error_action` emulation, using the regular frontend code path instead.
* element.py: Add skip_uncached parameter to _skip_push()Jürg Billeter2020-09-291-1/+6
| | | | | | This allows proper error handling when pushing an uncached element should result in a failure (bst artifact push), not a skipped job (bst build).
* source.py: Remove BST_KEY_REQUIRES_STAGETristan van Berkom2020-09-241-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Refactored this to remove unneeded complexity in the code base, as described here: https://lists.apache.org/thread.html/r4b9517742433f07c79379ba5b67932cfe997c1e64965a9f1a2b613fc%40%3Cdev.buildstream.apache.org%3E Changes: * source.py: Added private Source._cache_directory() context manager We also move the assertion about nodes which are safe to write to a bit lower in Source._set_ref(), as this was unnecessarily early. When tracking a workspace, the ref will be none and will turn out to be none afterwards, it is not a problem that a workspace's node is a synthetic one, as tracking will never affect it. * local plugin: Implement get_unique_key() and stage() using the new context manager in order to optimize staging and cache key calculations here. * workspace plugin: Implement get_unique_key() and stage() using the new context manager in order to optimize staging and cache key calculations here. * trackqueue.py: No special casing with Source._is_trackable()
* _state.py: Use separate task identifierJürg Billeter2020-09-102-5/+14
| | | | | | | | `State.add_task()` required the job name to be unique in the session. However, the tuple `(action_name, full_name)` is not guaranteed to be unique. E.g., multiple `ArtifactElement` objects with the same element name may participate in a single session. Use a separate task identifier to fix this.
* job.py: Remove ability of job classes to send custom messagesbschubert/remove-custom-sched-messagesBenjamin Schubert2020-08-231-43/+0
| | | | | | | | We previously were sending custom messages from child jobs to parent jobs for example for reporting the cache size. This is not used anymore by the current implementation. Let's remove this entirely
* _messenger.py: Make `timed_suspendable` public and use it in job.pybschubert/timed-suspendableBenjamin Schubert2020-08-221-20/+9
| | | | This reduces the amount of code duplication
* Rework handling of cached failuresAbderrahim Kitouni2020-07-291-35/+0
|
* element.py: move printing the build environment from elementjob.pyAbderrahim Kitouni2020-07-291-10/+0
|
* scheduler.py: Remove all usage of notificationsBenjamin Schubert2020-07-062-76/+15
| | | | Call directly the relevant methods from the stream to the scheduler
* scheduler.py: Remove notifications from scheduler to streamBenjamin Schubert2020-07-061-12/+6
| | | | | This removes all notifications left coming from the scheduler, and replaces them by callbacks
* _stream.py: Stop using a 'RUNNING' event to know the state of the schedulerBenjamin Schubert2020-07-061-7/+0
| | | | | The stream is itself calling the `run` method on the scheduler, we don't need another indirection
* _stream.py: Stop using a 'TERMINATED' event to know the state of the schedulerBenjamin Schubert2020-07-061-2/+0
| | | | | We are calling the scheduler, and it returning correctly already tells us this.
* _stream.py: Stop using a 'SUSPENDED' event to know the state of the schedulerBenjamin Schubert2020-07-061-3/+0
| | | | | We are calling the scheduler, and it returning correctly already tells us this.
* scheduler.py: Pass all 'retry' operations through the stateBenjamin Schubert2020-07-061-3/+1
| | | | | Stop using 'Notifications' for retries, the state is the one handling the callbacks required for every change in status of elements
* _state.py: Only use a single place of truth for the start timeBenjamin Schubert2020-07-061-5/+2
| | | | | This moves all implementations of 'start_time' into a single place for easier handling and removing roundtrips of notifications
* scheduler.py: Remove task-based notifications and use the stateBenjamin Schubert2020-07-061-19/+5
| | | | | The State is the interface between both, there is no need to do multiple round-trips to handle such notifications
* scheduler.py: Remove 'Message' notification type, use the messengerBenjamin Schubert2020-07-063-18/+7
| | | | | The messenger should be the one receiving messages directly, we don't need this indirection
* Completely abolish job pickling.tristan/nuke-pickle-jobberTristan van Berkom2020-06-153-224/+1
|
* _pluginfactory: Delegating the work of locating plugins to the PluginOriginTristan van Berkom2020-05-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This way we split up the logic of how to load plugins from different origins into their respective classes. This commit also: o Introduces PluginType (which is currently either SOURCE or ELEMENT) o Reduces the complexity of the PluginFactory constructor o Kills the loaded_dependencies list and the all_loaded_plugins API, and replaces both of these with a new list_plugins() API. Consequently the jobpickler.py from the scheduler, and the widget.py from the frontend, are updated to use list_plugins(). o Split up the PluginOrigin implementations into separate files Instead of having all PluginOrigin classes in pluginorigin.py, split it up into one base class and separate files for each implementation, which is more inline with BuildStream coding style. This has the unfortunate side effect of adding load_plugin_origin() into the __init__.py file, because keeping new_from_node() as a PluginOrigin class method cannot be done without introducing a cyclic dependency with PluginOrigin and it's implementations.
* Revert "Simplify queue management"Jürg Billeter2020-05-271-0/+7
| | | | This reverts commit aa25f6fcf49f0015fae34dfd79b4626a816bf886.
* Revert "Schedule elements instead of "requiring" them"Jürg Billeter2020-05-271-1/+4
| | | | This reverts commit 14e32a34f67df754d9146efafe9686bfe6c91e50.
* _scheduler/scheduler.py: Reset the schedule handler at the beginning of ↵Tristan Van Berkom2020-05-191-3/+15
| | | | | | | | | | real_schedule() In case queuing jobs results in jobs completing, we need to reset the schedule handler at the beginning of the function and not after queueing the jobs. This fixes the failure to exit the main loop in #1312
* _scheduler: Fix order of launching jobs and sending notifications.Tristan Van Berkom2020-05-192-3/+14
| | | | | | | | | | | | | | | | | | | | | | | Sending notifications causes potentially large bodies of code to run in the abstracted frontend codebase, we are not allowed to have knowledge of the frontend from this code. Previously, we were adding the job to the active jobs, sending the notification, and then starting the job. This means that if a BuildStream frontend implementation crashes, we handle the excepting in an inconsistent state and try to kill jobs which are not running. In addition to making sure that active_jobs list adjustment and job starting does not have any code body run in the danger window in between these, this patch also adds some fault tolerance and assertions around job termination so that: o Job.terminate() and Job.kill() do not crash with None _process o Job.start() raises an assertion if started after being terminated This fixes the infinite looping aspects of frontend crashes at job_start() time described in #1312.
* plugin.py: Rework how deprecation warnings are configured.Tristan Van Berkom2020-05-041-1/+1
| | | | | | | | | | | | | | | | | This is mostly a semantic change which defines how deprecation warnings are suppressed in a more consistent fashion, by declaring such suppressions in the plugin origin declarations rather than on the generic element/source configuration overrides section. Other side effects of this commit are that the warnings have been enhanced to include the provenance of whence the deprecated plugins have been used in the project, and that the custom deprecation message is optional and will appear in the message detail string rather than in the primary warning text, which now simply indicates that the plugin being used is deprecated. Documentation and test cases are updated. This fixes #1291
* _pluginfactory/pluginfactory.py: Add provenance to missing plugin errorsTristan Van Berkom2020-05-031-1/+1
| | | | | | | | | So far we were only reporting "No Source plugin registered for kind 'foo'", without specifying what bst file with line and column information, this commit fixes it. Additionally, this patch stores the provenance on the MetaSource to allow this to happen for sources.
* job.py: Use `_signals.terminator()` to handle `SIGTERM`Jürg Billeter2020-04-091-9/+7
| | | | | | `Sandbox` subclasses use `_signals.terminator()` to gracefully terminate the running command and cleanup the sandbox. Setting a `SIGTERM` handler in `job.py` breaks this.
* element.py: Optimize assemble_done()Jürg Billeter2020-01-181-1/+2
| | | | | After a successful build we know that the artifact is cached. Avoid querying buildbox-casd and the filesystem.
* source.py: Remove the reliance on consistency to get whether a source is cachedBenjamin Schubert2020-01-161-7/+1
| | | | | | | | | | | | | | | | | This removes the need to use consistency in Sources, by asking explicitely whether the source is cached or not. This introduces a new public method on source: `is_cached` that needs implementation and that should return whether the source has a local copy or not. - On fetch, also reset whether the source was cached or set if as cached when we know it was. - Validate the cache's source after fetching it This doesn't need to be run in the scheduler's process and can be offloaded to the child, which will allow better multiprocessing
* element.py: Remove _get_consistency and introduce explicit methodsBenjamin Schubert2020-01-161-4/+1
| | | | | | | This replaces the _get_consistency method by two methods: `_has_all_sources_resolved` and `_has_all_sources_cached` which allows a more fine grained control on what information is needed.
* element.py: Rename '_source_cached' to '_has_all_sources_in_source_cache'Benjamin Schubert2020-01-161-1/+1
| | | | | | '_source_cached' is not explicit enough as it doesn't distinguishes between sources in their respective caches and sources in the global sourcecache.
* scheduler.py: Handle exceptions that are caught under the event looptpollard/loop_exceptionTom Pollard2020-01-101-0/+20
| | | | | | The default exception handler of the async event loop bypasses our custom global exception handler. Ensure that isn't the case, as these exceptions are BUGS & should cause bst to exit.
* _scheduler/scheduler.py: Enforce SafeChildWatcherChandan Singh2019-12-241-0/+6
| | | | | In Python 3.8, `ThreadedChildWatcher` is the default watcher that causes issues with our scheduler. Enforce use of `SafeChildWatcher`.
* job.py: Do not call Process.close()Jürg Billeter2019-12-191-1/+0
| | | | | | | | | | | As we handle subprocess termination by pid with an asyncio child watcher, the multiprocessing.Process object does not get notified when the process terminates. And as the child watcher reaps the process, the pid is no longer valid and the Process object is unable to check whether the process is dead. This results in Process.close() raising a ValueError. Fixes: 9c23ce5c ("job.py: Replace message queue with pipe")
* job.py: Replace message queue with pipejuerg/job-pipeJürg Billeter2019-12-121-44/+40
| | | | | | | | A lightweight unidirectional pipe is sufficient to pass messages from the child job process to its parent. This also avoids the need to access the private `_reader` instance variable of `multiprocessing.Queue`.
* resources: remove [un]register_exclusive_interest()Darius Makovsky2019-12-091-50/+0
|
* scheduler.py: Only run thread-safe code in callbacks from watchersbschubert/stricter-asyncio-handlingBenjamin Schubert2019-12-072-2/+12
| | | | | | | | Per https://docs.python.org/3/library/asyncio-policy.html#asyncio.AbstractChildWatcher.add_child_handler, the callback from a child handler must be thread safe. Not all our callbacks were. This changes all our callbacks to schedule a call for the next loop iteration instead of executing it directly.
* job.py: Only start new jobs in a `with watcher:` blockBenjamin Schubert2019-12-071-26/+5
| | | | | | | | The documentation (https://docs.python.org/3/library/asyncio-policy.html#asyncio.AbstractChildWatcher) is apparently missing this part, but the code mentions that new processes should only ever be called inside a with block: https://github.com/python/cpython/blob/99eb70a9eb9493602ff6ad8bb92df4318cf05a3e/Lib/asyncio/unix_events.py#L808
* job.py: Remove '_watcher' attribute, it is not neededBenjamin Schubert2019-12-071-3/+2
| | | | We don't need to keep a reference to the watcher, let's remove it.