summaryrefslogtreecommitdiff
path: root/zuul
Commit message (Collapse)AuthorAgeFilesLines
* Fix prune-database commandJames E. Blair2023-03-292-8/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This command had two problems: * It would only delete the first 50 buildsets * Depending on DB configuration, it may not have deleted anything or left orphan data. We did not tell sqlalchemy to cascade delete operations, meaning that when we deleted the buildset, we didn't delete anything else. If the database enforces foreign keys (innodb, psql) then the command would have failed. If it doesn't (myisam) then it would have deleted the buildset rows but not anything else. The tests use myisam, so they ran without error and without deleting the builds. They check that the builds are deleted, but only through the ORM via a joined load with the buildsets, and since the buildsets are gone, the builds weren't returned. To address this shortcoming, the tests now use distinct ORM methods which return objects without any joins. This would have caught the error had it been in place before. Additionally, the delet operation retained the default limit of 50 rows (set in place for the web UI), meaning that when it did run, it would only delete the most recent 50 matching builds. We now explicitly set the limit to a user-configurable batch size (by default, 10,000 builds) so that we keep transaction sizes manageable and avoid monopolizing database locks. We continue deleting buildsets in batches as long as any matching buildsets remain. This should allow users to remove very large amounts of data without affecting ongoing operations too much. Change-Id: I4c678b294eeda25589b75ab1ce7c5d0b93a07df3
* Merge "[gitlab driver] fix "'EnqueueEvent' object has no attribute ↵Zuul2023-02-221-6/+1
|\ | | | | | | 'change_url'""
| * [gitlab driver] fix "'EnqueueEvent' object has no attribute 'change_url'"Fabien Boucher2022-12-151-6/+1
| | | | | | | | | | | | | | | | | | | | This change fixes an issue that prevent an admin to use the zuul enqueue and enqueue-ref commands. File "/usr/local/lib/python3.8/site-packages/zuul/driver/gitlab/gitlabconnection.py", line 602, in _getChange AttributeError: 'EnqueueEvent' object has no attribute 'change_url' Change-Id: Iceaeadc64baee26adb71909122d8c55314b8e036
* | Merge "Cleanup old rebase-merge dirs on repo reset"8.2.0Zuul2023-02-171-1/+27
|\ \
| * | Cleanup old rebase-merge dirs on repo resetSimon Westphahl2023-02-171-1/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When using the rebase merge-mode a failed "merge" will leave the repo in a state that Zuul so far could not recover from. The rebase will create a `.git/rebase-merge` directory which is not removed when the rebase fails. To fix this we will abort the rebase when it fails and also remove any existing `.git/rebase-merge` and `.git/rebase-apply` directories when resetting the repository. DEBUG zuul.Merger: [e: ...] Unable to merge {'branch': 'master', 'buildset_uuid': 'f7be4215f37049b4ba0236892a5d8197', 'connection': 'github', 'merge_mode': 5, 'newrev': None, 'number': 71, 'oldrev': None, 'patchset': 'e81d0b248565db290b30d9a638095947b699c76d', 'project': 'org/project', 'ref': 'refs/pull/71/head'} Traceback (most recent call last): File "/opt/zuul/lib/python3.10/site-packages/zuul/merger/merger.py", line 1099, in _mergeChange commit = repo.rebaseMerge( File "/opt/zuul/lib/python3.10/site-packages/zuul/merger/merger.py", line 626, in rebaseMerge repo.git.rebase(*args) File "/opt/zuul/lib/python3.10/site-packages/git/cmd.py", line 542, in <lambda> return lambda *args, **kwargs: self._call_process(name, *args, **kwargs) File "/opt/zuul/lib/python3.10/site-packages/git/cmd.py", line 1005, in _call_process return self.execute(call, **exec_kwargs) File "/opt/zuul/lib/python3.10/site-packages/git/cmd.py", line 822, in execute raise GitCommandError(command, status, stderr_value, stdout_value) git.exc.GitCommandError: Cmd('git') failed due to: exit code(128) cmdline: git rebase 39fead1852ef01a716a1c6470cee9e4197ff5587 stderr: 'fatal: It seems that there is already a rebase-merge directory, and I wonder if you are in the middle of another rebase. If that is the case, please try git rebase (--continue | --abort | --skip) If that is not the case, please rm -fr ".git/rebase-merge" and run me again. I am stopping in case you still have something valuable there. Change-Id: I8518cc5e4b3f7bbfc2c2283a2b946dee504991dd
* | | Merge "Return cached Github change on concurrent update"Zuul2023-02-171-0/+1
|\ \ \
| * | | Return cached Github change on concurrent updateSimon Westphahl2023-02-171-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a pull-request is updated concurrently on a scheduler we'll wait until the first thread has updated the change. The problem is if we needed to create a new PR object. In this case we'd return a Github change that wasn't updated and also doesn't have a cache key set. ERROR zuul.Scheduler: Exception processing pipeline check in tenant foobar Traceback (most recent call last): File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 2149, in process_pipelines refreshed = self._process_pipeline( File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 2241, in _process_pipeline self.process_pipeline_trigger_queue(tenant, pipeline) File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 2447, in process_pipeline_trigger_queue self._process_trigger_event(tenant, pipeline, event) File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 2480, in _process_trigger_event pipeline.manager.addChange(change, event) File "/opt/zuul/lib/python3.10/site-packages/zuul/manager/__init__.py", line 534, in addChange self.updateCommitDependencies(change, None, event) File "/opt/zuul/lib/python3.10/site-packages/zuul/manager/__init__.py", line 868, in updateCommitDependencies new_commit_needs_changes = [d.cache_key for d in dependencies] File "/opt/zuul/lib/python3.10/site-packages/zuul/manager/__init__.py", line 868, in <listcomp> new_commit_needs_changes = [d.cache_key for d in dependencies] File "/opt/zuul/lib/python3.10/site-packages/zuul/model.py", line 5946, in cache_key return self.cache_stat.key.reference AttributeError: 'NoneType' object has no attribute 'key' Change-Id: I2f3012060c486d593ad857e046334c3d9bef0ed5
* | | | Merge "Update reconfig event ltime on (smart) reconfig"Zuul2023-02-171-1/+1
|\ \ \ \
| * | | | Update reconfig event ltime on (smart) reconfigSimon Westphahl2023-02-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make sure to update the last reconfig event ltime in the layout state during a (smart) reconfig. This is important in order to unblock pipelines when a tenant reconfig event is lost. So far the last reconfig event ltime was passed as -1, but wasn't set in the layout state since the ltime is not allowed to go backward. Change-Id: Iab04a962abbfbe901c22e4d5f1d484e3f53b0d33
* | | | | Merge "Allow default-ansible-version to be an int"Zuul2023-02-171-3/+3
|\ \ \ \ \ | |/ / / / |/| | | |
| * | | | Allow default-ansible-version to be an intJames E. Blair2023-01-311-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We allow job.ansible-version to be a str, int, or float and coerce it to a string. We should do the same for tenant.default-ansible-version. Change-Id: I1a104b84d578f4932597893f62cd0cd06f031b4a
* | | | | Merge "Fix exception on retry in source base class"Zuul2023-02-151-2/+1
|\ \ \ \ \
| * | | | | Fix exception on retry in source base classSimon Westphahl2023-02-141-2/+1
| | |/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ERROR zuul.Scheduler: Exception processing pipeline check in tenant foobar Traceback (most recent call last): File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 2149, in process_pipelines refreshed = self._process_pipeline( File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 2241, in _process_pipeline self.process_pipeline_trigger_queue(tenant, pipeline) File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 2447, in process_pipeline_trigger_queue self._process_trigger_event(tenant, pipeline, event) File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 2480, in _process_trigger_event pipeline.manager.addChange(change, event) File "/opt/zuul/lib/python3.10/site-packages/zuul/manager/__init__.py", line 534, in addChange self.updateCommitDependencies(change, None, event) File "/opt/zuul/lib/python3.10/site-packages/zuul/manager/__init__.py", line 864, in updateCommitDependencies dep = source.getChangeByURLWithRetry(match, event) File "/opt/zuul/lib/python3.10/site-packages/zuul/source/__init__.py", line 112, in getChangeByURLWithRetry return dep UnboundLocalError: local variable 'dep' referenced before assignment Change-Id: I1c706e5e5d2d337ec84b8fc1ad5e900191f2362c
* | | | | Merge "Prevent files cache ltime from going backward"Zuul2023-02-151-0/+6
|\ \ \ \ \
| * | | | | Prevent files cache ltime from going backwardSimon Westphahl2023-02-011-0/+6
| | |/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fixes a race condition when two schedulers request files for the same repository at roughly the same time. The files cache ltime is generated before creating the cat job. We also cannot guarantee the order in which the job results are processed. Since we include the files cache ltime in the min. ltimes as part of the layout state we need to make sure that the timestamp of the cache doesn't go backward. This should fix the following exception we occassionally see in zuul-web: WARNING zuul.ConfigLoader: Zuul encountered an error while accessing the repo org/project. The error was: Configuration files missing from cache. Check Zuul scheduler logs for more information. Change-Id: I9fe2ed0b7a95dd12bb978273673e905d06bee72d
* | | | | Merge "Simplufy shouldRefresh* checks"Zuul2023-02-151-6/+2
|\ \ \ \ \
| * | | | | Simplufy shouldRefresh* checksJames E. Blair2023-02-011-6/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In comments on I3824af6149bf27c41a8d895fc682236bd0d91f6b Clark Boylan suggested that an equality check of the version number might be a simpler and more robust way of checking that an object should be refreshed. This change implements that suggestion. Change-Id: I9261fc68442e4002c4579db4f99d3c151ffd485c
* | | | | | Merge "Expand the scope of zkprofile"Zuul2023-02-151-4/+3
|\ \ \ \ \ \
| * | | | | | Expand the scope of zkprofileJames E. Blair2023-02-141-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the zkprofile command only profiles the pipeline refresh operations. This expands the scope to include all pipeline operations (refreshing and then any subsequent changes). In particular, this will help identify places where we write too much data. Change-Id: Ida7bcb311b5037f6ce15a166ddeae6db44ed6709
* | | | | | | Merge "Add more log messages for run handler steps"Zuul2023-02-151-1/+7
|\ \ \ \ \ \ \
| * | | | | | | Add more log messages for run handler stepsSimon Westphahl2023-02-061-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This change adds distinct log message for certain run handler steps. So far some of those phases could only be inferred by other related log messages. Change-Id: Icceca78456b20754cbebb5747da3a538586cd5f5
* | | | | | | | Merge "Add scheduler run handler metric"Zuul2023-02-151-52/+56
|\ \ \ \ \ \ \ \ | |/ / / / / / /
| * | | | | | | Add scheduler run handler metricSimon Westphahl2023-02-061-52/+56
| | |_|_|_|/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to better understand the scheduler run handler performance this change adds a new `zuul.scheduler.run_handler` metric to measure the duration of one run handler loop. Change-Id: I77e862cf99d6a8445e71d7daab410d5853487dc3
* | | | | | | Merge "gerrit driver: fix bug around unicode branch names"Zuul2023-02-151-2/+7
|\ \ \ \ \ \ \ | |_|/ / / / / |/| | | | | |
| * | | | | | gerrit driver: fix bug around unicode branch namesAlex Hornung2023-01-241-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a branch name contains unicode characters that are more than 1-byte wide, the size in bytes of the pack record won't match the size in characters, and the pack parsing will be incorrect. Instead, treat everything as an encoded byte string until parsing is done - and only decode when handling a single, parsed, record. Change-Id: I7f1a0cc96a36129fbc04c7a8687da3f66c1eef99
* | | | | | | Merge "Fix queue item creation in bundle deserialization"Zuul2023-02-141-2/+1
|\ \ \ \ \ \ \
| * | | | | | | Fix queue item creation in bundle deserializationSimon Westphahl2023-02-141-2/+1
| | |_|_|_|/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I7fa99cd83a857216321f8d946fd42abd9ec427a3 replaced the pipeline reference of the `QueueItem` with a property. However, the pipeline was still passed to the `fromZK()` method when deserializing the bundle. ERROR zuul.Scheduler: Exception processing pipeline check in tenant foobar Traceback (most recent call last): File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 2149, in process_pipelines refreshed = self._process_pipeline( File "/opt/zuul/lib/python3.10/site-packages/zuul/scheduler.py", line 2227, in _process_pipeline pipeline.state.refresh(ctx) File "/opt/zuul/lib/python3.10/site-packages/zuul/model.py", line 715, in refresh return super().refresh(context) File "/opt/zuul/lib/python3.10/site-packages/zuul/zk/zkobject.py", line 238, in refresh self._load(context) File "/opt/zuul/lib/python3.10/site-packages/zuul/zk/zkobject.py", line 361, in _load self._set(**self.deserialize(data, context)) File "/opt/zuul/lib/python3.10/site-packages/zuul/model.py", line 762, in deserialize queue.refresh(context) File "/opt/zuul/lib/python3.10/site-packages/zuul/zk/zkobject.py", line 238, in refresh self._load(context) File "/opt/zuul/lib/python3.10/site-packages/zuul/zk/zkobject.py", line 361, in _load self._set(**self.deserialize(data, context)) File "/opt/zuul/lib/python3.10/site-packages/zuul/model.py", line 1103, in deserialize Bundle.deserialize(context, self, items_by_path, bundle_data))) File "/opt/zuul/lib/python3.10/site-packages/zuul/model.py", line 5912, in deserialize bundle.items = [ File "/opt/zuul/lib/python3.10/site-packages/zuul/model.py", line 5913, in <listcomp> items_by_path.get(p) or QueueItem.fromZK( File "/opt/zuul/lib/python3.10/site-packages/zuul/zk/zkobject.py", line 232, in fromZK obj._set(**kw) File "/opt/zuul/lib/python3.10/site-packages/zuul/zk/zkobject.py", line 420, in _set super().__setattr__(name, value) AttributeError: can't set attribute 'pipeline' Change-Id: Ia8de8f6717b3a1211cf4d1841e56ccd0c78ee677
* | | | | | | Merge "Fix deprecated use of currentThread() in REPL"Zuul2023-02-141-3/+3
|\ \ \ \ \ \ \
| * | | | | | | Fix deprecated use of currentThread() in REPLSimon Westphahl2023-02-141-3/+3
| |/ / / / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | zuul/lib/repl.py:29: DeprecationWarning: currentThread() is deprecated, use current_thread() instead obj = self.files.get(threading.currentThread(), self.default) Change-Id: I6f5a9b6b169b882024a41623364eefe0955796ef
* | | | | | | Merge "Don't nest alembic transactions"Zuul2023-02-141-18/+22
|\ \ \ \ \ \ \
| * | | | | | | Don't nest alembic transactionsClark Boylan2023-02-011-18/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SqlAlchemy 2.0 has gotten a lot more picky about transactions. In addition to needing to explicitly set up transactions using SqlAlchemy 2.0 it seems alembic's get_current_revision() call cannot be in the same transaction as the alembic migrate/stamp calls with MySQL 8.0. In particular the get_current_revision call seems to get a SHARED_HIGH_PRIO lock on the alembic_version table. This prevents the migrate/stamp calls from creating the alembic_version table as this requires an EXCLUSIVE lock. The SHARED_HIGH_PRIO lock appears to be in place as long as the get_current_revision transaction is active. To fix this we simplify our migration tooling and put get_current_revision in a transaction block of its own. The rest of our migrate function calls into functions that will setup new transactions and it doesn't need to be in this block. Change-Id: Ic71ddf1968610784cef72c4634ccec3a18855a0e
* | | | | | | | Merge "Address SqlAlchemy Removed in 2.0 Warnings"Zuul2023-02-143-15/+22
|\ \ \ \ \ \ \ \ | |/ / / / / / /
| * | | | | | | Address SqlAlchemy Removed in 2.0 WarningsClark Boylan2023-01-303-15/+22
| | |_|_|_|/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This addresses SqlAlchemy's removed in 2.0 warnings. Now that SqlAlchemy 2.0 has released we can see that we are not compatible yet. A good first step in adding compatibility is fixing warnings in 1.4. In particular there are four types of warning we fix here: 1. Using raw strings in conn.execute() calls. We need to use the text() construct instead. 2. Passing a list of items to select when doing select queries. Instead we need to pass things as normal posargs. 3. Accessing row result items as if the row is a dict. THis is not longer possible without first going through the row._mapping system. Instead we can access items as normal object attributes. 4. You must now use sqlalchemy.inspect() on a connectable to create an Inspector object rather than instantiating it directly. Finally we set up alembic's engine creation to run with future 2.0 behavior now that the warnings are cleared up. This appears to have already been done for the main zuul application. Change-Id: I5475e39bd93d71cd1106ec6d3a5423ea2dd51859
* | | | | | | Merge "Fix more file opening ResourceWarnings"Zuul2023-02-141-1/+2
|\ \ \ \ \ \ \ | |_|/ / / / / |/| | | | | |
| * | | | | | Fix more file opening ResourceWarningsClark Boylan2023-02-071-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I've managed to get better at grepping for this and this finds some of the stragglers. They are all file opens without closes fixed by using a with open() context manager. Change-Id: I7b8c8516a86558e2027cb0f01aefe2dd1849069c
* | | | | | | Fix configuration error related to YAML anchorsJames E. Blair2023-02-131-0/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PyYAML is efficient with YAML anchors and will only construct one Python object for a YAML mapping that is used multiple times via anchors. We copy job variables (a mapping) straight from the YAML to an attribute of the Job object, then we freeze the Job object which converts all of its dicts to mappingproxy objects. This mutates the contents of the vars dict, and if that is used on another job via an anchor, we will end up with mappingproxy objects in what we think is a "raw yaml" configuration dict, and we will not be able to serialize it in case of errors. To address this, perform a deep copy of every configuration yaml blob before parsing it into an object. Change-Id: Idaa6ff78b1ac5a108fb9f43700cf8e66192c43ce
* | | | | | | Merge "Update build/job versions in place"Zuul2023-02-111-6/+4
|\ \ \ \ \ \ \ | | |_|_|_|/ / | |/| | | | |
| * | | | | | Update build/job versions in placeJames E. Blair2023-01-171-6/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The previous change to add build/job object version numbers in ZK was perhaps too conservative with the way it updated them. It updated a copy of the version dictionary in order to avoid updating the in-memory dictionary in case the ZK write failed. This behavior is common for calls to updateAttributes, but we don't get this behavior when we use activeContext, so we generally don't rely on it. This change makes it slightly more efficient by updating the in-memory dictionary in place before writing it to ZK. Change-Id: If48d745ceefc0897946232947e5c6056fc833041
* | | | | | | Remove layout_uuid from PipelineState create callJames E. Blair2023-02-102-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This argument is no longer necessary because it is set via an inderect route through the pipeline and tenant objects. Remove it, and also validate that it is set correctly in unit tests. Change-Id: I62ce18a61a416cbc397866838f8ac3b0ec1bd564
* | | | | | | Fix race condition in pipeline change list initJames E. Blair2023-02-104-46/+165
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Simon Westphahl describes the race condition: > [The race condition] can occur after a reconfiguration while > some schedulers are updating their local layout and some > already start processing pipelines in the same tenant. > > In this case the pipeline manager's `_postConfig()` method that > calls `PipelineChangeList.create(...)` races with the pipeline > processor updating the change keys. > > This leads to two change lists being written as separate > shards, that can't be correctly loaded, as all shards combined > are expected to form a single JSON object. > > The sequence of events seems to be the following: > 1. S1: pipeline processor needs to update the change keys > 2. S1 the shard writer will delete the `change_list` key with the old > shards > 3. S2: configloader calls the `_postConfig()` method > 4. S2: `PipelineChangeList.create()` notices that the `change_list` node > doesn't exists in Zookeeper: > https://opendev.org/zuul/zuul/src/branch/master/zuul/model.py#L921 > 6. S2: the shard writer creates the first shard `0000000000` > 7. S1: the shard writer creates the second shared `0000000001` > > The race condition seems to be introduced with > Ib1e467b5adb907f93bab0de61da84d2efc22e2a7 That change updated the pipeline manager _postConfig method so that it no longer acquires the pipeline lock when initalizing the pipeline state and change lists. This greatly reduces potential delays during reconfiguration, but has, perhaps predictably, lead to the race condition above. In the commit message for that change, I noted that we might be able to avoid even more work if we accept some caveats related to error reporting. Avoiding that work mean avoiding performing any writes during _postConfig which addresses the root cause of the race condition (steps 3-6 above. Ed. note: there is no step 5). From the commit message: > We still need to attach new state and change list objects to > the pipeline in _postConfig (since our pipeline object is new). > We also should make sure that the objects exist in ZK before we > leave that method, so that if a new pipeline is created, other > schedulers will be able to load the (potentially still empty) > objects from ZK. As an alternative, we could avoid even this > work in _postConfig, but then we would have to handle missing > objects on refresh, and it would not be possible to tell if the > object was missing due to it being new or due to an error. To > avoid masking errors, we keep the current expectation that we > will create these objects in ZK on the initial reconfiguration. The current change does exactly that. We no longer perform any ZK write operations on the state and change list objects in _postConfig. Instead, inside of the refresh methods, we detect the cases where they should be newly created and do so at that time. This happens with the pipeline lock, so is safe against any simultaneous operation from other components. There will be "ERROR" level log messages indicating that reading the state from ZK has failed when these objects are first initialized. To indicate that this is probably okay, they will now be immediately followed by "WARNING" level messages explaining that. Strictly speaking, this particular race should only occur for the change list object, not the pipeline state, since the race condition above requires a sharded object and of the two, only the change list is sharded. However, to keep the lifecycle of these two objects matched (and to simplify _postConfig) the same treatment is applied to both. Note that change I7fa99cd83a857216321f8d946fd42abd9ec427a3 merged after Ib1e467b and changed the behavior slightly, introducing the old_state and old_list arguments. Curiously, the old_list argument is effectively unused, so it is removed entirely in this change. Old_state still has a purpose and is retained. Change-Id: I519348e7d5d74e675808e990920480fb6e1fb981
* | | | | | | Merge "Add an !unsafe change_message variable"Zuul2023-02-102-4/+14
|\ \ \ \ \ \ \
| * | | | | | | Add an !unsafe change_message variableJames E. Blair2023-02-092-4/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In I9628e2770dda120b269612e28bb6217036942b8e we switched zuul.change from a plain string tagged with !unsafe to base64 encoded and no !unsafe tag. The idea was to make the inventory file parseable by external tools while avoiding accidental interpolation of the commit message by Ansible. That doesn't work in all cases -- it's not hard to construct a scenario where after base64 decoding the message any further processing by Ansible causes it to undergo interpolation. Moreover, since then we have made many changes to how we deal with variables; notably, the inventory.yaml is no longer actually used by Zuul's Anisble -- it is now there only for human and downstream processing. We call it the "debug inventory". The actual inventory is much more complex and in some cases has lots of !unsafe tags in it. Given all that, it now seems like the most straightforward way to deal with this is to tag the message variable as !unsafe when passing it to Zuul's Ansible, but render it as plain text in the inventory.yaml. To address backwards compatability, this is done in a new variable called zuul.change_message. Since that's a more descriptive variable anyway, we will just keep that one in the future and drop the current base64- encoded zuul.message variable Change-Id: Iea86de15e722bc271c1bf0540db2c9efb032500c
* | | | | | | | Merge "Refresh dependencies for changes in pipelines"Zuul2023-02-091-6/+6
|\ \ \ \ \ \ \ \
| * | | | | | | | Refresh dependencies for changes in pipelinesJames E. Blair2023-01-241-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 549103881899751e84b411212f7820cde502b42b appears to have had an error. It attempted to only refresh changes if they are in the pipeline or are dependencies of changes in the pipeline, but it only did the latter, which has an adverse effect for the GitHub driver. We can change dependencies of GitHub changes without creating new patchsets, which means that this error (not updating dependencies of changes in the pipeline) would cause Zuul not to eject a GitHub change from a pipeline when its dependencies changed via a PR message update. The error went unnoticed with Gerrit since dependency updates happen with new patchsets, which themselves cause changes to be ejected. This change corrects the behavior. Repetitive calls to updateCommitDependencies are also eliminated by creating a set of changes to update across the whole pipeline. Finally, two tests are added, one for Gerrit (which previously would have passed) and one for GitHub (which previously would have failed). Change-Id: I005820223471e8f6372ef70730f381918ec9c1af
* | | | | | | | | Merge "Replace use of ProjectContext with SourceContext"Zuul2023-02-092-22/+7
|\ \ \ \ \ \ \ \ \ | |_|/ / / / / / / |/| | | | | | | |
| * | | | | | | | Replace use of ProjectContext with SourceContextSimon Westphahl2023-02-082-22/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It seems we never added serialization for ProjectContext objects. This issue showed up after I3ed9a98dfc1ed63ac11025eb792c61c9a6414384 since we use the ProjectContext for reporting errors related to the merge mode. Since the ProjectContext and SourceContext are almost identical we can use a SourceContext object with irrelevant fields set to None in places where we currently use the ProjectContext. ERROR zuul.Pipeline.tenant.check: [e: ...] Error in dynamic layout Traceback (most recent call last): File "/opt/zuul/lib/python3.10/site-packages/zuul/manager/__init__.py", line 1136, in _loadDynamicLayout item.setConfigErrors(relevant_errors) File "/opt/zuul/lib/python3.10/site-packages/zuul/model.py", line 5549, in setConfigErrors self.current_build_set.setConfigErrors(errors) File "/opt/zuul/lib/python3.10/site-packages/zuul/model.py", line 4129, in setConfigErrors el = ConfigurationErrorList.new(self._active_context, File "/opt/zuul/lib/python3.10/site-packages/zuul/zk/zkobject.py", line 224, in new data = obj._trySerialize(context) File "/opt/zuul/lib/python3.10/site-packages/zuul/zk/zkobject.py", line 249, in _trySerialize return self.serialize(context) File "/opt/zuul/lib/python3.10/site-packages/zuul/model.py", line 289, in serialize "errors": [e.serialize() for e in self.errors], File "/opt/zuul/lib/python3.10/site-packages/zuul/model.py", line 289, in <listcomp> "errors": [e.serialize() for e in self.errors], File "/opt/zuul/lib/python3.10/site-packages/zuul/model.py", line 254, in serialize "key": self.key.serialize() File "/opt/zuul/lib/python3.10/site-packages/zuul/model.py", line 211, in serialize "context": self.context and self.context.serialize(), AttributeError: 'ProjectContext' object has no attribute 'serialize' Change-Id: I1b48abcd593685c332f3f218e2ed05cbcebfb244
* | | | | | | | | Merge "Fix DeprecationWarning: ssl.PROTOCOL_TLS is deprecated"Zuul2023-02-093-3/+3
|\ \ \ \ \ \ \ \ \ | | |_|_|/ / / / / | |/| | | | | | |
| * | | | | | | | Fix DeprecationWarning: ssl.PROTOCOL_TLS is deprecatedClark Boylan2023-02-073-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since python 3.10 ssl.PROTOCOL_TLS has been deprecated. We are expected to use ssl.PROTOCOL_TLS_CLIENT and ssl.PROTOCOL_TLS_SERVER depending on how the sockets are to be used. Switch over to these new constants to avoid the DeprecationWarning. One thing to note is that PROTOCOL_TLS_CLIENT has default behaviors around cert verification and hostname checking. Zuul is already explicitly setting those options the way it wants to and I've left that alone to avoid trouble if the defaults change later. Finally, this doesn't fix the occurence of this error that happens within kazoo. A separate PR has been made upstream to kazoo and this should be fixed in the next kazoo release. Change-Id: Ib41640f1d33d60503066464c8c98f865a74f003a
* | | | | | | | | Merge "Fix ResourceWarnings in the executor server"Zuul2023-02-091-2/+4
|\ \ \ \ \ \ \ \ \ | |/ / / / / / / /
| * | | | | | | | Fix ResourceWarnings in the executor serverClark Boylan2023-02-071-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have two open() calls in the executor server without matching close() calls. Fix this by wrapping both in a with context manager. Change-Id: I0e51c2b22ea1540484851f98749e648728b26406