summaryrefslogtreecommitdiff
path: root/zuul
Commit message (Collapse)AuthorAgeFilesLines
* Merge "Do not use _display outside the main thread in zuul_stream"Zuul2022-09-011-13/+20
|\
| * Do not use _display outside the main thread in zuul_streamJames E. Blair2022-08-311-13/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the defaulh "linear" strategy (and likely others), Ansible will send the on_task_start callback, and then fork a worker process to execute that task. Since we spawn a thread in the on_task_start callback, we can end up emitting a log message in this method while Ansible is forking. If a forked process inherits a Python file object (i.e., stdout) that is locked by a thread that doesn't exist in the fork (i.e., this one), it can deadlock when trying to flush the file object. To minimize the chances of that happening, we should avoid using _display outside the main thread. The Python logging module is supposed to use internal locks which are automatically aqcuired and released across a fork. Assuming this is (still) true and functioning correctly, we should be okay to issue our Python logging module calls at any time. If there is a fault in this system, however, it could have a potential to cause a similar problem. If we can convince the Ansible maintainers to lock _display across forks, we may be able to revert this change in the future. Change-Id: Ifc6b835c151539e6209284728ccad467bef8be6f
* | Merge "Add config-error reporter and report config errors to DB"Zuul2022-08-314-7/+23
|\ \ | |/ |/|
| * Add config-error reporter and report config errors to DBJames E. Blair2022-08-224-7/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds a config-error pipeline reporter configuration option and now also reports config errors and merge conflicts to the database as buildset failures. The driving use case is that if periodic pipelines encounter config errors (such as being unable to freeze a job graph), they might send email if configured to send email on merge conflicts, but otherwise their results are not reported to the database. To make this more visible, first we need Zuul pipelines to report buildset ends to the database in more cases -- currently we typically only report a buildset end if there are jobs (and so a buildset start), or in some other special cases. This change adds config errors and merge conflicts to the set of cases where we report a buildset end. Because of some shortcuts previously taken, that would end up reporting a merge conflict message to the database instead of the actual error message. To resolve this, we add a new config-error reporter action and adjust the config error reporter handling path to use it instead of the merge-conflicts action. Tests of this as well as the merge-conflicts code path are added. Finally, a small debug aid is added to the GerritReporter so that we can easily see in the logs which reporter action was used. Change-Id: I805c26a88675bf15ae9d0d6c8999b178185e4f1f
* | Merge "Add detail to "depends on a change that failed to merge""Zuul2022-08-305-65/+102
|\ \
| * | Add detail to "depends on a change that failed to merge"James E. Blair2022-08-115-65/+102
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The report message "This change depends on a change that failed to merge" (and a similar change for circular dependency bundles) is famously vague. To help users identify the actual problem, include URLs for which change(s) caused the problem so that users may more easily resolve the issue. Change-Id: Id8b9f8cf2c108703e9209e30bdc9a3933f074652
* | | Merge "Set remote URL after config was updated"Zuul2022-08-241-5/+5
|\ \ \
| * | | Set remote URL after config was updatedSimon Westphahl2022-08-181-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To avoid issues with outdated Github access tokens in the Git config we only update the remote URL on the repo object after the config update was successful. This also adds a missing repo lock when building the repo state. Change-Id: I8e1b5b26f03cb75727d2b2e3c9310214a3eac447
* | | | Fix links for jobs with special charactersAlbin Vass2022-08-231-0/+2
| |_|/ |/| | | | | | | | Change-Id: I12e8a056a2e5cd1bb18c1f24ecd7db55405f0a8c
* | | Merge "Reduce redundant Gerrit queries"Zuul2022-08-193-7/+27
|\ \ \
| * | | Reduce redundant Gerrit queriesJames E. Blair2022-08-193-7/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sometimes Gerrit events may arrive in batches (for example, an automated process modifies several related changes nearly simultaneously). Because of our inbuilt delay (10 seconds by default), it's possible that in these cases, many or all of the updates represented by these events will have settled on the Gerrit server before we even start processing the first event. In these cases, we don't need to query the same changes multiple times. Take for example a stack of 10 changes. Someone approves all 10 simultaneously. That would produce (at least) 10 events for Zuul to process. Each event would cause Zuul to query all 10 changes in the series (since they are related). That's 100 change queries (and each change query requires 2 or 3 HTTP queries). But if we know that all the event arrived before our first set of change queries, we can reduce that set of 100 queries to 10 by suppressing any queries after the first. This change generates a logical timestamp (ltime) immediately before querying Gerrit for a change, and stores that ltime in the change cache. Whenever an event arrives for processing with an ltime later than the query ltime, we assume the change is already up to date with that event and do not perform another query. Change-Id: Ib1b9245cc84ab3f5a0624697f4e3fc73bc8e03bd
* | | | Merge "zuul-stream : don't write out logfile for tasks in loops"Zuul2022-08-174-14/+44
|\ \ \ \
| * | | | zuul-stream : don't write out logfile for tasks in loopsIan Wienand2022-07-234-14/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In zuul_stream.py:v2_playbook_on_task_start() it checks for "task.loop" and exits if the task is part of a loop. However the library/command.py override still writes out the console log despite it never being read. To avoid leaving this file around, mark a sentinel uuid in the action plugin if the command is part of a loop. In that case, for simplicity we just write to /dev/null -- that way no other assumptions in the library/command.py have to change; it just doesn't leave a file on disk. This is currently difficult to test as the infrastructure zuul_console leaves /tmp/console-* files and we do not know what comes from that, or testing. After this and the related change I823156dc2bcae91bd6d9770bd1520aa55ad875b4 are deployed to the infrastructure executors, we can make a simple and complete test for the future by just ensuring no /tmp/console-* files are left behind afer testing. I have tested this locally and do not see files from loops, which I was before this change. Change-Id: I4f4660c3c0b0f170561c14940cc159dc43eadc79
* | | | | Merge "Fix zoned executor metric when unzoned is allowed"Zuul2022-08-152-2/+6
|\ \ \ \ \ | |_|_|/ / |/| | | |
| * | | | Fix zoned executor metric when unzoned is allowedSimon Westphahl2022-08-112-2/+6
| | |_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | An executor can have a zone configured and at the same time allow unzoned jobs. In this case the executor was not counted for the zoned executor metric (online/accepting). Change-Id: Ib39947e3403d828b595cf2479e64789e049e63cc
* | | | Merge "zuul_console: do not use f-strings"Zuul2022-08-151-1/+2
|\ \ \ \
| * | | | zuul_console: do not use f-stringsIan Wienand2022-08-151-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In Ief366c092e05fb88351782f6d9cd280bfae96237 I missed that this runs in the context of the remote node; meaning that it must support all the Python versions that might run there. f-strings are not 3.5 compatible. I'm thinking about how to lint this better (a syntax check run?) Change-Id: Ia4133b061800791196cd631f2e6836cb77347664
* | | | | Merge "zuul-stream: automatically remove streaming files"Zuul2022-08-142-6/+26
|\ \ \ \ \ | |/ / / / |/| | | |
| * | | | zuul-stream: automatically remove streaming filesIan Wienand2022-08-092-6/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When using protocol version 1, send a finalise message when streaming is complete so that the zuul_console daemon can delete the temporary file. We test this by inspecting the Ansible console output, which logs a message with the UUID of the streaming job. We dump the temporary files on the remote side and make sure a console file for that job isn't present. Change-Id: I823156dc2bcae91bd6d9770bd1520aa55ad875b4
* | | | | Merge "zuul-stream: implement a protocol and version"Zuul2022-08-122-31/+99
|\ \ \ \ \ | |/ / / / | | / / / | |/ / / |/| | |
| * | | zuul-stream: implement a protocol and versionIan Wienand2022-08-092-31/+99
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A refresher on how this works, to the best of my knowledge 1 Firstly, Zuul's Ansible has a library task "zuul_console:" which is run against the remote node; this forks a console daemon, listening on a default port. 2 We have a action plugin that runs for each task, and if that task is a command/shell task, assigns it a unique id 3 We then override with library/command.py (which backs command/shell tasks) with a version that forks and runs the process on the target node as usual, but also saves the stdout/stderr output to a temporary file named per the unique uuid from the step above. 4 At the same time we have the callback plugin zuul_stream.py, which Ansible is calling as it moves through starting, running and finishing the tasks. This looks at the task, and if it has a UUID [2], sends a request to the zuul_console [1], which opens the temporary file [3] and starts streaming it back. 5 We loop reading this until the connection is closed by [1], eventually outputting each line. In this way, the console log is effectively streamed and saved into our job output. We have established that we expect the console [1] is updated asynchronously to the command/streaming [3,4] in situation such as static nodes. This poses a problem if we ever want to update either part -- for example we can not change the file-name that the command.py file logs to, because an old zuul_console: will not know to open the new file. You could imagine other fantasy things you might like to do; e.g. negotiate compression etc. that would have similar issues. To provide the flexibility for these types of changes, implement a simple protocol where the zuul_stream and zuul_console sides exchange their respective version numbers before sending the log files. This way they can both decide what operations are compatible both ways. Luckily the extant protocol, which is really just sending a plain uuid, can be adapted to this. When an old zuul_console server gets the protocol request it will just look like an invalid log file, which zuul_stream can handle and thus assume the remote end doesn't know about protocols. This bumps the testing timeout; it seems that the extra calls make for random failures. The failures are random and not in the same place, I've run this separately in 850719 several times and not seen any problems with the higher timeout. This test is already has a settle timeout slightly higher so I think it must have just been on the edge. Change-Id: Ief366c092e05fb88351782f6d9cd280bfae96237
* | | Merge "smtpreporter: Add pipeline to subject"Zuul2022-08-021-1/+1
|\ \ \
| * | | smtpreporter: Add pipeline to subjectJoshua Watt2022-07-291-1/+1
| |/ / | | | | | | | | | | | | | | | | | | | | | Adds the pipeline for the change to the subject format. This makes it easier to include information about the pipeline (e.g. its name) in the e-mail subject Change-Id: I6ec973635543b4404c125589f23ffd1ba5504c17
* | | Add whitespace around keywordsJames E. Blair2022-08-022-2/+2
| | | | | | | | | | | | | | | | | | This fixes pep8 E275 which wants whitespace after assert and del. Change-Id: I1f8659f462aa91c3fdf8f7eb8b939b67c0ce9f55
* | | Merge "Add pipeline-based merge op metrics"Zuul2022-07-294-7/+37
|\ \ \
| * | | Add pipeline-based merge op metricsJames E. Blair2022-07-124-7/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | So that operators can see in aggregate how long merge, files-changes, and repo-state merge operations take in certain pipelines, add metrics for the merge operations themselves (these exclude the overhead of pipeline processing and job dispatching). Change-Id: I8a707b8453c7c9559d22c627292741972c47c7d7
* | | | Merge "Clear pipeline change cache at start of refresh"Zuul2022-07-292-0/+8
|\ \ \ \
| * | | | Clear pipeline change cache at start of refreshJames E. Blair2022-07-272-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The pipeline change cache is used to avoid repeated cache lookups for change dependencies, but is not as robust as the "real" sounce change cache at managing the lifetime of change objects. If it is used to store some change objects in one scheduler which are later dequeued by a second scheduler, the next time those changes show up in that pipeline on the first scheduler it may use old change objects instead of new ones from the ZooKeeper cache. To fix this, clear the pipeeline change cache before we refresh the pipeline state. This will cause some extra ZK ChangeCache hits to repopulate the cache, but only in the case of commit dependencies (the pipeline change cache isn't used for git dependencies or the changes in the pipeline themselves unless they are commit dependencies). Change-Id: I0a20dc972917440d4f3e8bb59295b77c13913a48
* | | | | Merge "Fix nodepool label query"Zuul2022-07-271-8/+6
|\ \ \ \ \
| * | | | | Fix nodepool label queryJames E. Blair2022-07-171-8/+6
| | |_|/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Nodepool updated its internal interface for registering launchers. Since Zuul uses that internal interface to determine what labels are available, update it to match. Change-Id: Iffa0c1c1d9ef8d195c5e1ea1b625de9d119add3b
* | | | | Merge "Fix deduplicated build references"Zuul2022-07-272-4/+73
|\ \ \ \ \ | |_|/ / / |/| | | |
| * | | | Fix deduplicated build referencesJames E. Blair2022-07-252-4/+73
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we deduplicate a build, we use the same Build object in multiple BuildSets. But when a second scheduler loads the pipeline state, even if the build path points to a build outside of a given BuildSet it will create new Build objects and set their local object references for the buildset and job to the "current" buildset rather than the one that is being referenced. Instead we need to find the Build object for the other buildset, but at this point in the process, the queues haven't been completely loaded yet and so we can't go searching for it. To correct this, when we load a Build from ZooKeeper that is outside of our buildset, we insert a BuildReference object which is a placeholder until we finish loading the pipeline state. Then at the end of that process, we go through and replace the BuildReference objects with the actual builds. Change-Id: I0f0e5e4b8f8c9d34b86ccd83ad0b0578ce4a780c
* | | | | Merge "Fix duplicate setResult calls in deduplicated builds"Zuul2022-07-261-3/+6
|\ \ \ \ \ | |/ / / /
| * | | | Fix duplicate setResult calls in deduplicated buildsJames E. Blair2022-07-251-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We call item.setResult after a build is complete so that the queue item can do any internal processing necessary (for example, prepare data structures for child jobs, or move the build to the retry_builds list). In the case of deduplicated builds, we should do that for every queue item the build participates in since each item may have a different job graph. We were not correctly identifying other builds of deduplicated jobs and so in the case of a dependency cycle we would call setResult on jobs of the same name in that cycle regardless of whether they were deduplicated. This corrects the issue and adds a test to detect that case. Change-Id: I4c47beb2709a77c21c11c97f1d1a8f743d4bf5eb
* | | | | Merge "Do not deduplicate the noop job"Zuul2022-07-261-0/+1
|\ \ \ \ \ | |/ / / /
| * | | | Do not deduplicate the noop jobJames E. Blair2022-07-251-0/+1
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no good reason to do so (there are no resources consumed by the job), and it's difficult to disable a behavior for the noop job globally since it has no definition. Let's never have it deduplicate so that we keep things simple for folks who want to avoid deduplication. Change-Id: Ib3841ce5ef020540edef1cfa479d90c65be97112
* | | | Merge "Hide skipped jobs in status/reports"Zuul2022-07-223-4/+17
|\ \ \ \
| * | | | Hide skipped jobs in status/reportsJames E. Blair2022-07-213-4/+17
| | |/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For heavy users of "dispatch jobs" (where many jobs are declared as dependencies of a single job which then mutates the child_jobs return value to indicate which few of those should be run), there may be large numbers of "SKIPPED" jobs in the status page and in the final job report, which reduces the usability of both of those. Yet it is important for users to be able to see skipped jobs since they may represent an error (they may be inadvertently skipped). To address this, we remove "SKIPPED" jobs from the status page by default, but add a button at the bottom of the change box which can toggle their display. We remove "SKIPPED" jobs from the report, but add a note at the bottom which says "Skipped X jobs". Users can follow the buildset link to see which ones were skipped. The buildset page will continue to show all the jobs for the buildset. Change-Id: Ie297168cdf5b39d1d6f219e9b2efc44c01e87f35
* | | | Merge "Apply timer trigger jitter to project-branches"Zuul2022-07-221-47/+59
|\ \ \ \
| * | | | Apply timer trigger jitter to project-branchesJames E. Blair2022-07-211-47/+59
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the timer trigger accepts an optional "jitter" specification which can delay the start of a pipeline timer trigger by up to a certain number of seconds. It applies uniformly to every project-branch that participates in the pipeline. For example, if a periodic pipeline with nova and glance is configured to trigger at midnight, and has a jitter of 30 seconds, then the master and stable branches of nova and glance will all be enqueued at the same time (perhaps 00:00:17). While this provides some utility in that if other systems are configured to do things around midnight, this pipeline may not join a thundering herd with them. Or if there are many periodic pipelines configured for midnight (perhaps across different tenants, or just with slightly different purposes), they won't be a thundering hurd. But to the extent that jobs within a given pipeline might want to avoid a thundering herd with other similar jobs in the same pipeline, it offers no relief. While Zuul may be able to handle it (especially since multiple schedulers allows other pipelines to continue to operate), these jobs may interact with remote systems which would appreciate not being DoS'd. To alleviate this, we change the jitter from applying to the pipeline as a whole to individual project-branches. To be clear, it is still the case that the pipeline has only a single configured trigger time (this change does not allow projects to configure their own triggers). But instead of firing a single event for the entire pipeline, we will fire a unique event for every project-branch in that pipeline, and these events will have the jitter applied to them individually. So in our example above, nova@master might fire at 00:00:05, nova@stable/zulu may fire at 00:00:07, glance@master at 00:00:13, etc. This behavior is similar enough in spirit to the current behavior that we can consider it a minor implementation change, and it doesn't require any new configuration options, feature flags, deprecation notice, etc. The documentation is updated to describe the new behavior, as well as correct an error relating to the description of jitter (it only delays, not advances, events). We currently add a single job to APScheduler for every timer triggered pipeline in every tenant (so the number of jobs is the sum of the periodic pipelines in every tenant). OpenDev for example may have on the order of 20 APScheduler jobs. With the new approach, we will enqueue a job for each project-branch in a periodic pipeline. For a system like OpenDev, that could potentially be thousands of jobs. In reality, based on current configuration and pipeline participation, it should be 176. Even though it will result in the same number of Zuul trigger events, there is overhead to having more APScheduler jobs. To characterize this, I performed a benchmark where I added a certain number of APScheduler jobs with the same trigger time (and no jitter) and recorded the amount of time needed to add the jobs and also, once the jobs began firing, the elapsed time from the first to the last job. This should charactize the additional overhead the scheduler will encounter with this change. Time needed to add jobs to APScheduler (seconds) 1: 0.00014448165893554688 10: 0.0009338855743408203 100: 0.00925445556640625 1000: 0.09204769134521484 10000: 0.9236903190612793 100000: 11.758053541183472 1000000: 223.83168983459473 Time to run jobs (last-first in seconds) 1: 2.384185791015625e-06 10: 0.006863832473754883 100: 0.09936022758483887 1000: 0.22670435905456543 10000: 1.517075777053833 100000: 19.97287678718567 1000000: 399.24730825424194 Given that this operates primarily at the tenant level (when a tenant reconfiguration happens, jobs need to be removed and added), I think it makes sense to consider up to 10,000 jobs a reasonable high end. It looks like we can go a little past that (between 10,000 and 100,000) while still seeing something like a linear increase. As we approach 1,000,000 jobs it starts looking more polynomial and I would not conisder the performance to be acceptable. But 100,000 is already an unlikely number, so I think this level of performance is okay within the likely range of jobs. The default executor used by APScheduler is a standard python ThreadPoolExecutor with a maximum of 10 simultaneous workers. This will cause us to fire up to 10 Zuul event simultaneously (whereas before we were only likely to fire simultaneous events if multiple tenants had identical pipeline timer triggers). This could result in more load on the connection sources and change cache as they update the branch tips in the change cache. It seems plausible that 10 simulatenous events is something that the sources and ZK can handle. If not, we can reduce the granularity of the lock we use to prevent updating the same project at the same time (to perhaps a single lock for all projects), or construct the APScheduler with a lower number of max_workrs. Change-Id: I27fc23763da81273eb135e14cd1d0bd95964fd16
* | | | | Merge "Strictly sequence reconfiguration events"Zuul2022-07-225-7/+104
|\ \ \ \ \ | |_|_|/ / |/| | | |
| * | | | Strictly sequence reconfiguration eventsJames E. Blair2022-07-185-7/+104
| | |/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the before times when we only had a single scheduler, it was naturally the case that reconfiguration events were processed as they were encountered and no trigger events which arrived after them would be processed until the reconfiguration was complete. As we added more event queues to support SOS, it became possible for trigger events which arrived at the scheduler to be processed before a tenant reconfiguration caused by a preceding event to be complete. This is now even possible with a single scheduler. As a concrete example, imagine a change merges which updates the jobs which should run on a tag, and then a tag is created. A scheduler will process both of those events in succession. The first will cause it to submit a tenant reconfiguration event, and then forward the trigger event to any matching pipelines. The second event will also be forwarded to pipeline event queues. The pipeline events will then be processed, and then only at that point will the scheduler return to the start of the run loop and process the reconfiguration event. To correct this, we can take one of two approaches: make the reconfiguration more synchronous, or make it safer to be asynchronous. To make reconfiguration more synchronous, we would need to be able to upgrade a tenant read lock into a tenant write lock without releasing it. The lock recipes we use from kazoo do not support this. While it would be possible to extend them to do so, it would lead us further from parity with the upstream kazoo recipes, so this aproach is not used. Instead, we will make it safer for reconfiguration to be asynchronous by annotating every trigger event we forward with the last reconfiguration event that was seen before it. This means that every trigger event now specifies the minimum reconfiguration time for that event. If our local scheduler has not reached that time, we should stop processing trigger events and wait for it to catch up. This means that schedulers may continue to process events up to the point of a reconfiguration, but will then stop. The already existing short-circuit to abort processing once a scheduler is ready to reconfigure a tenant (where we check the tenant write lock contenders for a waiting reconfiguration) helps us get out of the way of pending reconfigurations as well. In short, once a reconfiguration is ready to start, we won't start processing tenant events anymore because of the existing lock check. And up until that happens, we will process as many events as possible until any further events require the reconfiguration. We will use the ltime of the tenant trigger event as our timestamp. As we forward tenant trigger events to the pipeline trigger event queues, we decide whether an event should cause a reconfiguration. Whenever one does, we note the ltime of that event and store it as metadata on the tenant trigger event queue so that we always know what the most recent required minimum ltime is (ie, the ltime of the most recently seen event that should cause a reconfiguration). Every event that we forward to the pipeline trigger queue will be annotated to specify that its minimum required reconfiguration ltime is that most recently seen ltime. And each time we reconfigure a tenant, we store the ltime of the event that prompted the reconfiguration in the layout state. If we later process a pipeline trigger event with a minimum required reconfigure ltime greater than the current one, we know we need to stop and wait for a reconfiguration, so we abort early. Because this system involves several event queues and objects each of which may be serialized at any point during a rolling upgrade, every involved object needs to have appropriate default value handling, and a synchronized model api change is not helpful. The remainder of this commit message is a description of what happens with each object when handled by either an old or new scheduler component during a rolling upgrade. When forwarding a trigger event and submitting a tenant reconfiguration event: The tenant trigger event zuul_event_ltime is initialized from zk, so will always have a value. The pipeline management event trigger_event_ltime is initialzed to the tenant trigger event zuul_event_ltime, so a new scheduler will write out the value. If an old scheduler creates the tenant reconfiguration event, it will be missing the trigger_event_ltime. The _reconfigureTenant method is called with a last_reconfigure_event_ltime parameter, which is either the trigger_event_ltime above in the case of a tenant reconfiguration event forwarded by a new scheduler, or -1 in all other cases (including other types of reconfiguration, or a tenant reconfiguration event forwarded by an old scheduler). If it is -1, it will use the current ltime so that if we process an event from an old scheduler which is missing the event ltime, or we are bootstrapping a tenant or otherwise reconfiguring in a context where we don't have a triggering event ltime, we will use an ltime which is very new so that we don't defer processing trigger events. We also ensure we never go backward, so that if we process an event from an old scheduler (and thus use the current ltime) then process an event from a new scheduler with an older (than "now") ltime, we retain the newer ltime. Each time a tenant reconfiguration event is submitted, the ltime of that reconfiguration event is stored on the trigger event queue. This is then used as the min_reconfigure_ltime attribute on the forwarded trigger events. This is updated by new schedulers, and ignored by old ones, so if an old scheduler process a tenant trigger event queue it won't update the min ltime. That will just mean that any events processed by a new scheduler may continue to use an older ltime as their minimum, which should not cause a problem. Any events forwarded by an old scheduler will omit the min_reconfigure_ltime field; that field will be initialized to -1 when loaded on a new scheduler. When processing pipeline trigger events: In process_pipeline_trigger_queue we compare two values: the last_reconfigure_event_ltime on the layout state which is either set to a value as above (by a new scheduler), or will be -1 if it was last written by an old scheduler (including in the case it was overwritten by an old scheduler; it will re-initialize to -1 in that case). The event.min_reconfigure_ltime field will either be the most recent reconfiguration ltime seen by a new scheduler forwarding trigger events, or -1 otherwise. If the min_reconfigure_ltime of an event is -1, we retain the old behavior of processing the event regardless. Only if we have a min_reconfigure_ltime > -1 and it is greater than the layout state last_reconfigure_event_ltime (which itself may be -1, and thus less than the min_reconfigure_ltime) do we abort processing the event. (The test_config_update test for the Gerrit checks plugin is updated to include an extra waitUntilSettled since a potential test race was observed during development.) Change-Id: Icb6a7858591ab867e7006c7c80bfffeb582b28ee
* | | | Merge "Handle jwt decoding error, fix exception default messages"Zuul2022-07-142-7/+11
|\ \ \ \ | |/ / / |/| | |
| * | | Handle jwt decoding error, fix exception default messagesMatthieu Huin2022-05-122-7/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Using a badly formatted token resulted in an error 500 from zuul-web. Return a more precise error message and an error 401 in zuul-web when this occurs. Also fix a typo in default messages for some auth-related exceptions. Change-Id: I4abe013e76ac51c3dad7ccd969ffe79f5cb459e3
* | | | Merge "Handle non default loopvars in Ansible callback stream plugin"Zuul2022-07-071-5/+18
|\ \ \ \
| * | | | Handle non default loopvars in Ansible callback stream pluginClark Boylan2022-07-061-5/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Zuul Ansible callback stream plugin assumed that the ansible loop var was always called 'item' in the result_dict. You can override this value (and it is often necessary to do so to avoid collisions) to something less generic. In those cases we would get errors like: b'[WARNING]: Failure using method (v2_runner_item_on_ok) in callback plugin' b'(<ansible.plugins.callback.zuul_stream.CallbackModule object at' b"0x7fbecc97c910>): 'item'" And stream output would not include the info typically logged. Address this by checking if ansible_loop_var is in the results_dict and using that value for the loop var name instead. We still fall back to 'item' as I'm not sure that ansible_loop_var is always present. Change-Id: I408e6d4af632f8097d63c04cbcb611d843086f6c
* | | | | Merge "Report timing info for POST_FAILURE and TIMED_OUT builds"Zuul2022-07-061-1/+3
|\ \ \ \ \ | |/ / / / |/| | | |
| * | | | Report timing info for POST_FAILURE and TIMED_OUT buildsClark Boylan2022-06-281-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We previously only reported statsd timing info for SUCCESS and FAILURE builds. But it would be useful to get info for POST_FAILURE and TIMED_OUT builds as well. In particular with POST_FAILURE builds we can track how long they are running before they fail. Change-Id: I2fe443ac2f69f40b7419e5280a38958d3ac7c080
* | | | | Merge "Fix read-only branches error in zuul-web"Zuul2022-07-063-40/+85
|\ \ \ \ \
| * | | | | Fix read-only branches error in zuul-webSimon Westphahl2022-07-043-40/+85
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When exclude-unprotected-branche is in effect and a project doesn't have any protected branches (e.g. a wrong branch protection rule in Github or none at all) the branch cache will contain an empty list. Since the empty list was so far used to indicate a fetch error, those projects showed up with a config error in zuul-web ("Will not fetch project branches as read-only is set"). For the branch cache we need to distinguish three different cases: 1. branche cache miss (no fetch yet) 2. branch cache hit (including empty list of branches) 3. fetch error Instead of storing an empty list in case of a fetch error we will store a sentinel value of `None` in those cases. However, with this change we can't use `None` as an indicator for a cache miss anymore, so we are now raising a `LookupError` instead. Change-Id: I5b51805f7d0a5d916a46fe89db7b32f14063d7aa