summaryrefslogtreecommitdiff
path: root/tests
Commit message (Collapse)AuthorAgeFilesLines
* Merge "merger: Keep redundant cherry-pick commits"Zuul2023-03-152-6/+87
|\
| * merger: Keep redundant cherry-pick commitsJoshua Watt2023-03-012-6/+87
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In normal git usage, cherry-picking a commit that has already been applied and doesn't do anything or cherry-picking an empty commit causes git to exit with an error to let the user decide what they want to do. However, this doesn't match the behavior of merges and rebases where non-empty commits that have already been applied are simply skipped (empty source commits are preserved). To fix this, add the --keep-redundant-commit option to `git cherry-pick` to make git always keep a commit when cherry-picking even when it is empty for either reason. Then, after the cherry-pick, check if the new commit is empty and if so back it out if the original commit _wasn't_ empty. This two step process is necessary because git doesn't have any options to simply skip cherry-pick commits that have already been applied to the tree. Removing commits that have already been applied is particularly important in a "deploy" pipeline triggered by a Gerrit "change-merged" event, since the scheduler will try to cherry-pick the change on top of the commit that just merged. Without this option, the cherry-pick will fail and the deploy pipeline will fail with a MERGE_CONFICT. Change-Id: I326ba49e2268197662d11fd79e46f3c020675f21
* | Merge "Elasticsearch: filter zuul data from job returned vars"Zuul2023-03-101-0/+2
|\ \
| * | Elasticsearch: filter zuul data from job returned varsThomas Cardonne2022-09-171-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | Remove data under the `zuul` key from the job returned vars. These returned values are meant to be used only by Zuul and shouldn't be included in documents as it may include large amount of data such as file comments. Change-Id: Ie6de7e3373b21b7c234ffedd5db7d3ca5a0645b6
* | | Merge "extra-config-files/dirs in items of a bundle should be loaded"Zuul2023-03-068-0/+66
|\ \ \
| * | | extra-config-files/dirs in items of a bundle should be loadedDong Zhang2023-02-238-0/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case of a bundle, zuul should load extra-config-paths not only from items ahead, but should from all items in that bundle. Otherwise it might throw a "invalid config" error, because the required zuul items in extra-config-paths are not found. Change-Id: I5c14bcb14b7f5c627fd9bd49f887dcd55803c6a1
* | | | Merge "Fix race related to PR with changed base branch"Zuul2023-03-031-0/+35
|\ \ \ \
| * | | | Fix race related to PR with changed base branchSimon Westphahl2023-03-021-0/+35
| | |_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some people use a workflow that's known as "stacked pull requests" in order to split a change into more reviewable chunks. In this workflow the first PR in the stack targets a protected branch (e.g. master) and all other PRs target the unprotected branch of the next item in the stack. E.g. master <- feature-A (PR#1) <- feature-B (PR#2) <- ... Now, when the first PR in the stack is merged Github will automatically change the target branch of dependent PRs. For the above example this would look like the following after PR#1 is merged: master <- feature-B (PR#2) <- ... The problem here is that we might still process events for a PR before the base branch change, but the Github API already returns the info about the updated target branch. The problem with this is, that we used the target branch name from the event (outdated branch name) and and the information from the change object (new target branch) whether or not the target branch is protected to determine if a branch was configured as protected. In the above example Zuul might wrongly conclude that the "feature-A" branch (taken from the event) is now protected. In the related incident we also observed that this triggered a reconfiguration with the wrong state of now two protected branches (masters + feature-A). Since the project in question previously had only one branch this lead to a change in branch matching behavior for jobs defined in that repository. Change-Id: Ia037e3070aaecb05c062865a6bb0479b86e4dcde
* | | | Merge "Avoid layout updates after delete-pipeline-state"Zuul2023-03-021-29/+49
|\ \ \ \
| * | | | Avoid layout updates after delete-pipeline-stateJames E. Blair2023-03-011-29/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The delete-pipeline-state commend forces a layout update on every scheduler, but that isn't strictly necessary. While it may be helpful for some issues, if it really is necessary, the operator can issue a tenant reconfiguration after performing the delete-pipeline-state. In most cases, where only the state information itself is causing a problem, we can omit the layout updates and assume that the state reset alone is sufficient. To that end, this change removes the layout state changes from the delete-pipeline-state command and instead simply empties and recreates the pipeline state and change list objects. This is very similar to what happens in the pipeline manager _postConfig call, except in this case, we have the tenant lock so we know we can write with imputinity, and we know we are creating objects in ZK from scratch, so we use direct create calls. We set the pipeline state's layout uuid to None, which will cause the first scheduler that comes across it to (assuming its internal layout is up to date) perform a pipeline reset (which is almost a noop on an empty pipeline) and update the pipeline state layout to the current tenant layout state. Change-Id: I1c503280b516ffa7bbe4cf456d9c900b500e16b0
* | | | | Merge "Set layout state event ltime in delete-pipeline-state"Zuul2023-03-021-63/+75
|\ \ \ \ \ | |/ / / / | | / / / | |/ / / |/| | |
| * | | Set layout state event ltime in delete-pipeline-stateJames E. Blair2023-02-281-63/+75
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The delete-pipeline-state command updates the layout state in order to force schedulers to update their local layout (essentially perform a local-only reconfiguration). In doing so, it sets the last event ltime to -1. This is reasonable for initializing a new system, but in an existing system, when an event arrives at the tenant trigger event queue it is assigned the last reconfiguration event ltime seen by that trigger event queue. Later, when a scheduler processes such a trigger event after the delete-pipeline-state command has run, it will refuse to handle the event since it arrived much later than its local layout state. This must then be corrected manually by the operator by forcing a tenant reconfiguration. This means that the system essentially suffers the delay of two sequential reconfigurations before it can proceed. To correct this, set the last event ltime for the layout state to the ltime of the layout state itself. This means that once a scheduler has updated its local layout, it can proceed in processing old events. Change-Id: I66e798adbbdd55ff1beb1ecee39c7f5a5351fc4b
* | | | Merge "Handle missing diff_refs attribute"Zuul2023-03-013-7/+93
|\ \ \ \
| * | | | Handle missing diff_refs attributeFabien Boucher2023-01-173-7/+93
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Recently, on a production Zuul acting on projects hosted on gitlab.com, it has been discovered that a merge requested fetched from the API (just after Zuul receives the merge request created event) could have the "diff_refs" attribute set to None. Related bug: https://gitlab.com/gitlab-org/gitlab/-/issues/386562 Leading to the following stacktrace in the logs: 2022-12-14 10:08:47,921 ERROR zuul.GitlabEventConnector: Exception handling Gitlab event: Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/zuul/driver/gitlab/gitlabconnection.py", line 102, in run self.event_queue.election.run(self._run) File "/usr/local/lib/python3.8/site-packages/zuul/zk/election.py", line 28, in run return super().run(func, *args, **kwargs) File "/usr/local/lib/python3.8/site-packages/kazoo/recipe/election.py", line 54, in run func(*args, **kwargs) File "/usr/local/lib/python3.8/site-packages/zuul/driver/gitlab/gitlabconnection.py", line 110, in _run self._handleEvent(event) File "/usr/local/lib/python3.8/site-packages/zuul/driver/gitlab/gitlabconnection.py", line 246, in _handleEvent self.connection._getChange(change_key, refresh=True, File "/usr/local/lib/python3.8/site-packages/zuul/driver/gitlab/gitlabconnection.py", line 621, in _getChange change = self._change_cache.updateChangeWithRetry(change_key, change, File "/usr/local/lib/python3.8/site-packages/zuul/zk/change_cache.py", line 432, in updateChangeWithRetry update_func(change) File "/usr/local/lib/python3.8/site-packages/zuul/driver/gitlab/gitlabconnection.py", line 619, in _update_change self._updateChange(c, event, mr) File "/usr/local/lib/python3.8/site-packages/zuul/driver/gitlab/gitlabconnection.py", line 665, in _updateChange change.commit_id = change.mr['diff_refs'].get('head_sha') AttributeError: 'NoneType' object has no attribute 'get' The attribute "diff_refs" becomes an object (with the expected keys) few moments later. In order to avoid this situation, this change adds a mechanism to retry fetching a MR until it owns some expected fields. In our case only "diff_refs". https://docs.gitlab.com/ee/api/merge_requests.html#response Tests are included with that change. Change-Id: I6f279516728def655acb8933542a02a4dbb3ccb6
* | | | | Merge "Re-enqueue changes if dequeued missing deps"Zuul2023-02-281-12/+4
|\ \ \ \ \
| * | | | | Re-enqueue changes if dequeued missing depsJames E. Blair2023-02-201-12/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When users create dependency cycles, the process often takes multiple steps, some of which cause changes enqueued in earlier steps to be dequeued. Users then need to re-enqueue those changes in order to have all changes in the cycle tested. This change attempts to improve this by detecting that situation and re-enqueing changes that are being de-queued because of missing deps. Note, thare are plenty of cases where we de-queue because of missing deps and we don't want to re-enqueue (ie, a one-way dependent change ahead has failed). To restrict this to only the situation we're interested in, we only act if the dependencies are already in the pipeline. A recently updated change in a cycle will have just been added to the pipeline, so this is true in the case we're interested in. A one-way dependent change that failed will have already been removed from the pipeline, and so this will be false in cases in which we are not interested. Change-Id: I84b3b2f8fffd1c946dafa605d1c17a37131558bd
* | | | | | Merge "Match events to pipelines based on topic deps"Zuul2023-02-272-10/+88
|\ \ \ \ \ \ | |/ / / / / | | | / / / | |_|/ / / |/| | | |
| * | | | Match events to pipelines based on topic depsJames E. Blair2023-02-172-10/+88
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We distribute tenant events to pipelines based on whether the event matches the pipeline (ie, patchset-created for a check pipeline) or if the event is related to a change already in the pipeline. The latter condition means that pipelines notice quickly when dependencies are changed and they can take appropriate action (such as ejecting changes which no longer have correct dependencies). For git and commit dependencies, an update to a cycle to add a new change requires an update to at least one existing change (for example, adding a new change to a cycle usually requires at least two Depends-On footers: the new change, as well as one of the changes already in the cycle). This means that updates to add new changes to cycles quickly come to the attention of pipelines. However, it is possible to add a new change to a topic dependency cycle without updating any existing changes. Merely uploading a new change in the same topic adds it to the cycle. Since that new change does not appear in any existing pipelines, pipelines won't notice the update until their next natural processing cycle, at which point they will refresh dependencies of any changes they contain, and they will notice the new dpendency and eject the cycle. To align the behavior of topic dependencies with git and commit dependencis, this change causes the scheduler to refresh the dependencies of the change it is handling during tenant trigger event processing, so that it can then compare that change's dependencies to changes already in pipelines to determine if this event is potentially relevant. This moves some work from pipeline processing (which is highly parallel) to tenant processing (which is only somewhat parallel). This could slow tenant event processing somewhat. However, the work is persisted in the change cache, and so it will not need to be repeated during pipeline processing. This is necessary because the tenant trigger event processor operates only with the pipeline change list data; it does not perform a full pipeline refresh, so it does not have access to the current queue items and their changes in order to compare the event change's topic with currently enqueued topics. There are some alternative ways we could implement this if the additional cost is an issue: 1) At the beginning of tenant trigger event processing, using the change list, restore each of the queue's change items from the change cache and compare topics. For large queues, this could end up generating quite a bit of ZK traffic. 2) Add the change topic to the change reference data structure so that it is stored in the change list. This is an abuse of this structure which otherwise exists only to store the minimum amount of information about a change in order to uniquely identify it. 3) Implement a PipelineTopicList similar to a PipelineChangeList for storing pipeline topics and accesing them without a full refresh. Another alternative would be to accept the delayed event handling of topic dependencies and elect not to "fix" this behavior. Change-Id: Ia9d691fa45d4a71a1bc78cc7a4bdec206cc025c8
* | | | | Only store trigger event info on queue itemSimon Westphahl2023-02-221-0/+27
| |_|/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The event that's currently stored as part of the queue item is not sharded. This means that we can see Zookeeper disconnects when the queue item data exceeds the max. Znode size of 1MB. Since we only need the event's timestamp and the Zuul event-id after an item is enqueued, we can reduce the amount of data we store in Zookeeper and also avoid sharding the event. Change-Id: I13577498e55fd4bb189679836219dea4dc5729fc
* | | | Merge "Cleanup old rebase-merge dirs on repo reset"8.2.0Zuul2023-02-171-0/+58
|\ \ \ \
| * | | | Cleanup old rebase-merge dirs on repo resetSimon Westphahl2023-02-171-0/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When using the rebase merge-mode a failed "merge" will leave the repo in a state that Zuul so far could not recover from. The rebase will create a `.git/rebase-merge` directory which is not removed when the rebase fails. To fix this we will abort the rebase when it fails and also remove any existing `.git/rebase-merge` and `.git/rebase-apply` directories when resetting the repository. DEBUG zuul.Merger: [e: ...] Unable to merge {'branch': 'master', 'buildset_uuid': 'f7be4215f37049b4ba0236892a5d8197', 'connection': 'github', 'merge_mode': 5, 'newrev': None, 'number': 71, 'oldrev': None, 'patchset': 'e81d0b248565db290b30d9a638095947b699c76d', 'project': 'org/project', 'ref': 'refs/pull/71/head'} Traceback (most recent call last): File "/opt/zuul/lib/python3.10/site-packages/zuul/merger/merger.py", line 1099, in _mergeChange commit = repo.rebaseMerge( File "/opt/zuul/lib/python3.10/site-packages/zuul/merger/merger.py", line 626, in rebaseMerge repo.git.rebase(*args) File "/opt/zuul/lib/python3.10/site-packages/git/cmd.py", line 542, in <lambda> return lambda *args, **kwargs: self._call_process(name, *args, **kwargs) File "/opt/zuul/lib/python3.10/site-packages/git/cmd.py", line 1005, in _call_process return self.execute(call, **exec_kwargs) File "/opt/zuul/lib/python3.10/site-packages/git/cmd.py", line 822, in execute raise GitCommandError(command, status, stderr_value, stdout_value) git.exc.GitCommandError: Cmd('git') failed due to: exit code(128) cmdline: git rebase 39fead1852ef01a716a1c6470cee9e4197ff5587 stderr: 'fatal: It seems that there is already a rebase-merge directory, and I wonder if you are in the middle of another rebase. If that is the case, please try git rebase (--continue | --abort | --skip) If that is not the case, please rm -fr ".git/rebase-merge" and run me again. I am stopping in case you still have something valuable there. Change-Id: I8518cc5e4b3f7bbfc2c2283a2b946dee504991dd
* | | | | Merge "Return cached Github change on concurrent update"Zuul2023-02-171-1/+42
|\ \ \ \ \
| * | | | | Return cached Github change on concurrent updateSimon Westphahl2023-02-171-1/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a pull-request is updated concurrently on a scheduler we'll wait until the first thread has updated the change. The problem is if we needed to create a new PR object. In this case we'd return a Github change that wasn't updated and also doesn't have a cache key set. ERROR zuul.Scheduler: Exception processing pipeline check in tenant foobar Traceback (most recent call last): File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 2149, in process_pipelines refreshed = self._process_pipeline( File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 2241, in _process_pipeline self.process_pipeline_trigger_queue(tenant, pipeline) File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 2447, in process_pipeline_trigger_queue self._process_trigger_event(tenant, pipeline, event) File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 2480, in _process_trigger_event pipeline.manager.addChange(change, event) File "/opt/zuul/lib/python3.10/site-packages/zuul/manager/__init__.py", line 534, in addChange self.updateCommitDependencies(change, None, event) File "/opt/zuul/lib/python3.10/site-packages/zuul/manager/__init__.py", line 868, in updateCommitDependencies new_commit_needs_changes = [d.cache_key for d in dependencies] File "/opt/zuul/lib/python3.10/site-packages/zuul/manager/__init__.py", line 868, in <listcomp> new_commit_needs_changes = [d.cache_key for d in dependencies] File "/opt/zuul/lib/python3.10/site-packages/zuul/model.py", line 5946, in cache_key return self.cache_stat.key.reference AttributeError: 'NoneType' object has no attribute 'key' Change-Id: I2f3012060c486d593ad857e046334c3d9bef0ed5
* | | | | | Update reconfig event ltime on (smart) reconfigSimon Westphahl2023-02-171-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make sure to update the last reconfig event ltime in the layout state during a (smart) reconfig. This is important in order to unblock pipelines when a tenant reconfig event is lost. So far the last reconfig event ltime was passed as -1, but wasn't set in the layout state since the ltime is not allowed to go backward. Change-Id: Iab04a962abbfbe901c22e4d5f1d484e3f53b0d33
* | | | | | Merge "Add scheduler run handler metric"Zuul2023-02-151-0/+1
|\ \ \ \ \ \
| * | | | | | Add scheduler run handler metricSimon Westphahl2023-02-061-0/+1
| | |/ / / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to better understand the scheduler run handler performance this change adds a new `zuul.scheduler.run_handler` metric to measure the duration of one run handler loop. Change-Id: I77e862cf99d6a8445e71d7daab410d5853487dc3
* | | | | | Merge "gerrit driver: fix bug around unicode branch names"Zuul2023-02-151-0/+45
|\ \ \ \ \ \
| * | | | | | gerrit driver: fix bug around unicode branch namesAlex Hornung2023-01-241-0/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a branch name contains unicode characters that are more than 1-byte wide, the size in bytes of the pack record won't match the size in characters, and the pack parsing will be incorrect. Instead, treat everything as an encoded byte string until parsing is done - and only decode when handling a single, parsed, record. Change-Id: I7f1a0cc96a36129fbc04c7a8687da3f66c1eef99
* | | | | | | Merge "Address SqlAlchemy Removed in 2.0 Warnings"Zuul2023-02-142-98/+98
|\ \ \ \ \ \ \
| * | | | | | | Address SqlAlchemy Removed in 2.0 WarningsClark Boylan2023-01-302-98/+98
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This addresses SqlAlchemy's removed in 2.0 warnings. Now that SqlAlchemy 2.0 has released we can see that we are not compatible yet. A good first step in adding compatibility is fixing warnings in 1.4. In particular there are four types of warning we fix here: 1. Using raw strings in conn.execute() calls. We need to use the text() construct instead. 2. Passing a list of items to select when doing select queries. Instead we need to pass things as normal posargs. 3. Accessing row result items as if the row is a dict. THis is not longer possible without first going through the row._mapping system. Instead we can access items as normal object attributes. 4. You must now use sqlalchemy.inspect() on a connectable to create an Inspector object rather than instantiating it directly. Finally we set up alembic's engine creation to run with future 2.0 behavior now that the warnings are cleared up. This appears to have already been done for the main zuul application. Change-Id: I5475e39bd93d71cd1106ec6d3a5423ea2dd51859
* | | | | | | | Merge "Fix more file opening ResourceWarnings"Zuul2023-02-143-3/+6
|\ \ \ \ \ \ \ \ | |_|_|_|/ / / / |/| | | | | | |
| * | | | | | | Fix more file opening ResourceWarningsClark Boylan2023-02-073-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I've managed to get better at grepping for this and this finds some of the stragglers. They are all file opens without closes fixed by using a with open() context manager. Change-Id: I7b8c8516a86558e2027cb0f01aefe2dd1849069c
* | | | | | | | Fix configuration error related to YAML anchorsJames E. Blair2023-02-131-0/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PyYAML is efficient with YAML anchors and will only construct one Python object for a YAML mapping that is used multiple times via anchors. We copy job variables (a mapping) straight from the YAML to an attribute of the Job object, then we freeze the Job object which converts all of its dicts to mappingproxy objects. This mutates the contents of the vars dict, and if that is used on another job via an anchor, we will end up with mappingproxy objects in what we think is a "raw yaml" configuration dict, and we will not be able to serialize it in case of errors. To address this, perform a deep copy of every configuration yaml blob before parsing it into an object. Change-Id: Idaa6ff78b1ac5a108fb9f43700cf8e66192c43ce
* | | | | | | | Remove layout_uuid from PipelineState create callJames E. Blair2023-02-101-4/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This argument is no longer necessary because it is set via an inderect route through the pipeline and tenant objects. Remove it, and also validate that it is set correctly in unit tests. Change-Id: I62ce18a61a416cbc397866838f8ac3b0ec1bd564
* | | | | | | | Fix race condition in pipeline change list initJames E. Blair2023-02-102-3/+76
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Simon Westphahl describes the race condition: > [The race condition] can occur after a reconfiguration while > some schedulers are updating their local layout and some > already start processing pipelines in the same tenant. > > In this case the pipeline manager's `_postConfig()` method that > calls `PipelineChangeList.create(...)` races with the pipeline > processor updating the change keys. > > This leads to two change lists being written as separate > shards, that can't be correctly loaded, as all shards combined > are expected to form a single JSON object. > > The sequence of events seems to be the following: > 1. S1: pipeline processor needs to update the change keys > 2. S1 the shard writer will delete the `change_list` key with the old > shards > 3. S2: configloader calls the `_postConfig()` method > 4. S2: `PipelineChangeList.create()` notices that the `change_list` node > doesn't exists in Zookeeper: > https://opendev.org/zuul/zuul/src/branch/master/zuul/model.py#L921 > 6. S2: the shard writer creates the first shard `0000000000` > 7. S1: the shard writer creates the second shared `0000000001` > > The race condition seems to be introduced with > Ib1e467b5adb907f93bab0de61da84d2efc22e2a7 That change updated the pipeline manager _postConfig method so that it no longer acquires the pipeline lock when initalizing the pipeline state and change lists. This greatly reduces potential delays during reconfiguration, but has, perhaps predictably, lead to the race condition above. In the commit message for that change, I noted that we might be able to avoid even more work if we accept some caveats related to error reporting. Avoiding that work mean avoiding performing any writes during _postConfig which addresses the root cause of the race condition (steps 3-6 above. Ed. note: there is no step 5). From the commit message: > We still need to attach new state and change list objects to > the pipeline in _postConfig (since our pipeline object is new). > We also should make sure that the objects exist in ZK before we > leave that method, so that if a new pipeline is created, other > schedulers will be able to load the (potentially still empty) > objects from ZK. As an alternative, we could avoid even this > work in _postConfig, but then we would have to handle missing > objects on refresh, and it would not be possible to tell if the > object was missing due to it being new or due to an error. To > avoid masking errors, we keep the current expectation that we > will create these objects in ZK on the initial reconfiguration. The current change does exactly that. We no longer perform any ZK write operations on the state and change list objects in _postConfig. Instead, inside of the refresh methods, we detect the cases where they should be newly created and do so at that time. This happens with the pipeline lock, so is safe against any simultaneous operation from other components. There will be "ERROR" level log messages indicating that reading the state from ZK has failed when these objects are first initialized. To indicate that this is probably okay, they will now be immediately followed by "WARNING" level messages explaining that. Strictly speaking, this particular race should only occur for the change list object, not the pipeline state, since the race condition above requires a sharded object and of the two, only the change list is sharded. However, to keep the lifecycle of these two objects matched (and to simplify _postConfig) the same treatment is applied to both. Note that change I7fa99cd83a857216321f8d946fd42abd9ec427a3 merged after Ib1e467b and changed the behavior slightly, introducing the old_state and old_list arguments. Curiously, the old_list argument is effectively unused, so it is removed entirely in this change. Old_state still has a purpose and is retained. Change-Id: I519348e7d5d74e675808e990920480fb6e1fb981
* | | | | | | | Fix race in test_queue unit testsJames E. Blair2023-02-101-11/+26
| |_|_|_|/ / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These tests call getChangeQueue which mutates the pipeline state objects in ZK. They should therefore hold the pipeline lock while doing this. Otherwise, cleanup routines may run during the test and delete the state out from under them. Change-Id: If85d3cf66669f5786203309294528e1f528b0423
* | | | | | | Merge "Add an !unsafe change_message variable"Zuul2023-02-102-3/+13
|\ \ \ \ \ \ \
| * | | | | | | Add an !unsafe change_message variableJames E. Blair2023-02-092-3/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In I9628e2770dda120b269612e28bb6217036942b8e we switched zuul.change from a plain string tagged with !unsafe to base64 encoded and no !unsafe tag. The idea was to make the inventory file parseable by external tools while avoiding accidental interpolation of the commit message by Ansible. That doesn't work in all cases -- it's not hard to construct a scenario where after base64 decoding the message any further processing by Ansible causes it to undergo interpolation. Moreover, since then we have made many changes to how we deal with variables; notably, the inventory.yaml is no longer actually used by Zuul's Anisble -- it is now there only for human and downstream processing. We call it the "debug inventory". The actual inventory is much more complex and in some cases has lots of !unsafe tags in it. Given all that, it now seems like the most straightforward way to deal with this is to tag the message variable as !unsafe when passing it to Zuul's Ansible, but render it as plain text in the inventory.yaml. To address backwards compatability, this is done in a new variable called zuul.change_message. Since that's a more descriptive variable anyway, we will just keep that one in the future and drop the current base64- encoded zuul.message variable Change-Id: Iea86de15e722bc271c1bf0540db2c9efb032500c
* | | | | | | | Merge "Refresh dependencies for changes in pipelines"Zuul2023-02-091-0/+93
|\ \ \ \ \ \ \ \ | |/ / / / / / / |/| | | | | | |
| * | | | | | | Refresh dependencies for changes in pipelinesJames E. Blair2023-01-241-0/+93
| | |/ / / / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 549103881899751e84b411212f7820cde502b42b appears to have had an error. It attempted to only refresh changes if they are in the pipeline or are dependencies of changes in the pipeline, but it only did the latter, which has an adverse effect for the GitHub driver. We can change dependencies of GitHub changes without creating new patchsets, which means that this error (not updating dependencies of changes in the pipeline) would cause Zuul not to eject a GitHub change from a pipeline when its dependencies changed via a PR message update. The error went unnoticed with Gerrit since dependency updates happen with new patchsets, which themselves cause changes to be ejected. This change corrects the behavior. Repetitive calls to updateCommitDependencies are also eliminated by creating a set of changes to update across the whole pipeline. Finally, two tests are added, one for Gerrit (which previously would have passed) and one for GitHub (which previously would have failed). Change-Id: I005820223471e8f6372ef70730f381918ec9c1af
* | | | | | | Merge "Fix DeprecationWarning: ssl.PROTOCOL_TLS is deprecated"Zuul2023-02-091-1/+1
|\ \ \ \ \ \ \ | | |/ / / / / | |/| | | | |
| * | | | | | Fix DeprecationWarning: ssl.PROTOCOL_TLS is deprecatedClark Boylan2023-02-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since python 3.10 ssl.PROTOCOL_TLS has been deprecated. We are expected to use ssl.PROTOCOL_TLS_CLIENT and ssl.PROTOCOL_TLS_SERVER depending on how the sockets are to be used. Switch over to these new constants to avoid the DeprecationWarning. One thing to note is that PROTOCOL_TLS_CLIENT has default behaviors around cert verification and hostname checking. Zuul is already explicitly setting those options the way it wants to and I've left that alone to avoid trouble if the defaults change later. Finally, this doesn't fix the occurence of this error that happens within kazoo. A separate PR has been made upstream to kazoo and this should be fixed in the next kazoo release. Change-Id: Ib41640f1d33d60503066464c8c98f865a74f003a
* | | | | | | Merge "Fix ResourceWarnings in fingergw"Zuul2023-02-091-11/+15
|\ \ \ \ \ \ \ | |/ / / / / /
| * | | | | | Fix ResourceWarnings in fingergwClark Boylan2023-02-071-11/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The fingergw (and its associated testing) was not properly managing ssl sockets. The issue was we were in a context manager for the underlying tcp socket which will get closed, but that doesn't call close() on the ssl socket wrapping the tcp socket. Fix this by moving common recv() code into a function then use the ssl socket in an inner context manager if we are using ssl. Both ssl and plain tcp will close() properly and we avoid duplicating common code. Change-Id: I1feefbd03a90734cf3c16baa6ed8f52cd8e00d14
* | | | | | | Merge "Fix ResourceWarnings in inventory testing"Zuul2023-02-092-5/+11
|\ \ \ \ \ \ \ | |/ / / / / /
| * | | | | | Fix ResourceWarnings in inventory testingClark Boylan2023-02-072-5/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Inventory testing was opening yaml files to parse them and not explicitly closing them when done. Fix this through the use of with open() context managers. Change-Id: I41a8ee607fcf13e86dd800cefb00d7e120265ed4
* | | | | | | Merge "Cleanup some Python ResourceWarnings in the test suite"Zuul2023-02-092-0/+3
|\ \ \ \ \ \ \ | |/ / / / / / | | | | / / / | |_|_|/ / / |/| | | | |
| * | | | | Cleanup some Python ResourceWarnings in the test suiteClark Boylan2023-02-062-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | First thing we add support for PYTHONTRACEMALLOC being passed through nox to easily enable tracebacks on emitted ResourceWarnings. We set this value to 0 by deafult as enabling this slows things down and requires more memory. But it is super useful locally when debugging specific ResourceWarnings to set `PYTHONTRACEMALLOC=5` in order o correct these issues. With that in place we identify and correct two classes of ResourceWarnings. First up is the executor server not closing its statsd socket when stopping the executor server. Address that by closing the socket if statsd is enabled and set the statsd attribute to None to prevent anything else from using it later. Second is a test only issue where we don't close the fake Gerrit, Gitlab, or Web Proxy Fixture server's HTTP socket we only shutdown the server. Add a close call to the server after it is shutdown to correct this. There are potentially other ResourceWarnings to be found and cleaned up, but sifting through the noise will be easier as we eliminate these more widespread warnings. Change-Id: Iddabe79be1c8557e300dde21a6b34e57b04c48e0
* | | | | | Merge "Cleanup deleted pipelines and and event queues"Zuul2023-02-032-0/+74
|\ \ \ \ \ \
| * | | | | | Cleanup deleted pipelines and and event queuesSimon Westphahl2023-01-242-0/+74
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a pipeline is removed during a reconfiguration Zuul will cancel active builds and node requests. However, since we no longer refresh the pipeline state during a reconfig we can run into errors when Zuul tries to cancel builds and node requests based on a possibly outdated pipeline state. 2023-01-17 10:41:32,223 ERROR zuul.Scheduler: Exception in run handler: Traceback (most recent call last): File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 2007, in run self.process_tenant_management_queue(tenant) File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 2452, in process_tenant_management_queue self._process_tenant_management_queue(tenant) File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 2462, in _process_tenant_management_queue self._doTenantReconfigureEvent(event) File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 1533, in _doTenantReconfigureEvent self._reconfigureTenant(ctx, min_ltimes, File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 1699, in _reconfigureTenant self._reconfigureDeletePipeline(old_pipeline) File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 1804, in _reconfigureDeletePipeline self.cancelJob(build.build_set, build.job, File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 2930, in cancelJob build.updateAttributes( File "/opt/zuul/lib/python3.10/site-packages/zuul/zk/zkobject.py", line 193, in updateAttributes self._save(context, serial) File "/opt/zuul/lib/python3.10/site-packages/zuul/zk/zkobject.py", line 392, in _save zstat = self._retry(context, self._retryableSave, File "/opt/zuul/lib/python3.10/site-packages/zuul/zk/zkobject.py", line 314, in _retry return kazoo_retry(func, *args, **kw) File "/opt/zuul/lib/python3.10/site-packages/kazoo/retry.py", line 126, in __call__ return func(*args, **kwargs) File "/opt/zuul/lib/python3.10/site-packages/zuul/zk/zkobject.py", line 371, in _retryableSave zstat = context.client.set(path, compressed_data, File "/opt/zuul/lib/python3.10/site-packages/kazoo/client.py", line 1359, in set return self.set_async(path, value, version).get() File "/opt/zuul/lib/python3.10/site-packages/kazoo/handlers/utils.py", line 86, in get raise self._exception kazoo.exceptions.BadVersionError To fix this we need to refresh the pipeline state prior to canceling those active builds and node requests. We will also take care of removing the pipeline state and the event queues from Zookeeper if possible. Errors will be ignored as the periodic cleanup task takes care of removing leaked pipelines. Change-Id: I2986419636d8c6557d68d65fb6aff589aa4a680e