summaryrefslogtreecommitdiff
path: root/ironic
Commit message (Collapse)AuthorAgeFilesLines
* CI: Fix another network testJulia Kreger2023-05-081-0/+4
| | | | | | | | Turns out more than one test was relying upon the object change determination test. Modifies this test to use the same pattern of behavior so we are avoiding racing. Change-Id: I29ee6cab7320d13fcc2eeda27dae08aeb2d98b00
* CI: Modify dhcp client ID failJulia Kreger2023-05-081-2/+4
| | | | | | | | | | | | | | | | The test, periodically under certian CI race conditions, may be handled as if there was not a change, which breaks the test as it does not save a modified port, it uses the in-flight list of changes to determine the correct path. The challenge is, the list of changes may not reconize there has been a change with the underlying object/db layer. So instead of re-test the library code, we just force the behavior by replacing the method on the object in the test, as the undrelying method being tested is tested as part of the oslo versioned objects code base. Change-Id: Ic8f9b2384ab2f8f76299afce9806fbe93e350f0e
* Merge "Handle MissingAttributeError when using OOB inspections to fetch MACs"Zuul2023-05-082-1/+20
|\
| * Handle MissingAttributeError when using OOB inspections to fetch MACsJacob Anders2023-05-022-1/+20
| | | | | | | | | | | | | | | | | | | | Currently, if an attempt is made to fetch MAC address information using OOB inspection on a Redfish-managed node and EthernetInterfaces attribute is missing on the node, inspection fails due to a MissingAttributeError exception being raised by sushy. This change adds catching and handling this exception. Change-Id: I6f16da05e19c7efc966128fdf79f13546f51b5a6
* | Fix DB/Lock session handling issuesJulia Kreger2023-05-016-29/+72
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prior to this fix, we have been unable to run the Metal3 CI job with SQLAlchemy's internal autocommit setting enabled. However that setting is deprecated and needs to be removed. Investigating our DB queries and request patterns, we were able to identify some queries which generally resulted in the underlying task and lock being held longer because the output was not actually returned, which is something we've generally had to fix in some places previously. Doing some of these changes did drastically reduce the number of errors encountered with the Metal3 CI job, however it did not eliminate them entirely. Further investigation, we were able to determine that the underlying issue we were encountering was when we had an external semi-random reader, such as Metal3 polling endpoints, we could reach a situation where we would be blocked from updating the database as to open a write lock, we need the active readers not to be interacting with the database, and with a random reader of sorts, the only realistic option we have is to enable the Write Ahead Log[0]. We didn't have to do this with SQLAlchemy previously because autocommit behavior hid the complexities from us, but in order to move to SQLAlchemy 2.0, we do need to remove autocommit. Additionally, adds two unit tests for get_node_with_token rpc method, which apparently we missed or lost somewhere along the way. Also, adds notes to two Database interactions to suggest we look at them in the future as they may not be the most efficient path forward. [0]: https://www.sqlite.org/wal.html Change-Id: Iebcc15fe202910b942b58fc004d077740ec61912
* | Merge "Upgrade to latest hacking - v6"Zuul2023-04-303-3/+3
|\ \
| * | Upgrade to latest hacking - v6Jay Faulkner2023-04-213-3/+3
| | | | | | | | | | | | | | | | | | Required minor changes to existing files to comply with new flake rules. Change-Id: Ia0bff27ab4a7ec98c533ea66357a3c0529026102
* | | Merge "[iRMC] Fix typo of Python string format in log message"Zuul2023-04-271-1/+1
|\ \ \
| * | | [iRMC] Fix typo of Python string format in log messageVanou Ishii2023-04-241-1/+1
| |/ / | | | | | | | | | | | | | | | | | | This patch fixes Python string format mistake in log message of iRMC driver. Change-Id: Ib58ae51849cbb06b3dcd6222d5b4ddacd2fbe230
* | | Merge "Remove all references to the "cpus" property"Zuul2023-04-2710-93/+33
|\ \ \
| * | | Remove all references to the "cpus" propertyDmitry Tantsur2023-03-2810-93/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unused by Nova and unlike memory_mb/local_gb also by Ironic (actually, our usage of local_gb is worth double-checking as well, but at the very least it's referenced by inspection implementations). Change-Id: Ie8b0d9f58f4dcd102c183c30ae7f5acf68a5e4c3
* | | | Merge "Add ablity to power off nodes in clean failed"Zuul2023-04-273-0/+43
|\ \ \ \ | |_|/ / |/| | |
| * | | Add ablity to power off nodes in clean failedChris Krelle2023-04-243-0/+43
| | |/ | |/| | | | | | | | | | | | | | | | | | | We have seen duplicate ip issues when leaving clean failed nodes powered on. This patch allows operators to power down nodes that enter clean failed state. Change-Id: Iecb402227485fe0ba787a262121c9d6a048b0e13
* | | tests: Replace invalid UUIDsStephen Finucane2023-04-193-6/+13
| | | | | | | | | | | | | | | | | | | | | | | | Fix the warnings that oslo.versionedobjects has been emitting for years now. Change-Id: I53bd78d8b70f276d2ea8569f0ab1e7ce04f52fea Signed-off-by: Stephen Finucane <sfinucan@redhat.com>
* | | db: Resolve SAWarning warningsStephen Finucane2023-04-192-12/+4
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Resolve the following SAWarning warning: SELECT statement has a cartesian product between FROM element(s) "foo" and FROM element "bar". Apply join condition(s) between each element to resolve. This was happening because we were filtering instances of ConductorHardwareInterfaces by the state of the Conductor referenced by the 'conductor_id' field *without* joining the Conductor table. By adding the join, we can avoid this cartesian product. Change-Id: I2c20d7a7c1de41d4d0057fabc1d953b5bfb5b216 Signed-off-by: Stephen Finucane <sfinucan@redhat.com>
* | Merge "Always fall back from hard linking to copying files"Zuul2023-04-102-107/+35
|\ \
| * | Always fall back from hard linking to copying filesDmitry Tantsur2023-03-312-107/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current check is insufficient: it passes for Kubernetes shared volumes, although hard-linking between them is not possible. This patch changes the approach to trying a hard link and falling back to copyfile instead. The patch relies on optimizations in Python 3.8 and thus should not be backported beyond the Zed series to avoid performance regression. Change-Id: I929944685b3ac61b2f63d2549198a2d8a1c8fe35
* | | Merge "On rpc service stop, wait for node reservation release"Zuul2023-04-053-8/+77
|\ \ \
| * | | On rpc service stop, wait for node reservation releaseSteve Baker2023-02-273-8/+77
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of clearing existing reservations at the beginning of del_host, wait for the tasks holding them to go to completion. This check continues indefinitely until the conductor process exits due to one of: - All reservations for this conductor are released - CONF.graceful_shutdown_timeout has elapsed - The process manager (systemd, kubernetes) sends SIGKILL after the configured graceful period Because the default values of [DEFAULT]graceful_shutdown_timeout and [conductor]heartbeat_timeout are the same (60s) no other conductor will claim a node as an orphan until this conductor exits. Change-Id: Ib8db915746228cd87272740825aaaea1fdf953c7
* | | | Fix requests calls with timeoutsJulia Kreger2023-04-046-19/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Bandit 1.7.5 dropped with logic to check requests invocations. Specifically if a timeout is not explicitly set, then it results in an error. This should cause our bandit job to go green. Closes-Bug: 2015284 Change-Id: I1dcb3075de63aae97bb22012a54736c293393185
* | | | Merge "Add error logging on lookup failures in the API"Zuul2023-04-041-1/+5
|\ \ \ \ | |_|/ / |/| | |
| * | | Add error logging on lookup failures in the APIDmitry Tantsur2023-03-171-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Lookup returns generic 404 errors for security reasons. Logging is the only way of debugging any issues during it. Change-Id: I860ed6b90468a403f0f6cdec9c3d84bc872fda06
* | | | Merge "Enables boot modes switching with Anaconda deploy for ilo driver"Zuul2023-03-272-0/+90
|\ \ \ \
| * | | | Enables boot modes switching with Anaconda deploy for ilo driverNisha Agarwal2023-03-172-0/+90
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Enables boot modes switching with Anaconda deploy for ilo driver Story: 2010357 Task: 46530 Change-Id: I383cdd5c9d45b074d351ec98b1145fd68e2f3ac3
* | | | | Merge "Refactoring: clean up inspection data handlers"Zuul2023-03-238-199/+150
|\ \ \ \ \
| * | | | | Refactoring: clean up inspection data handlersDmitry Tantsur2023-03-148-199/+150
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * Avoid using the term "introspection". We need to settle on either "inspection" or "introspection", and the Ironic API already uses the former. * Accept (and return) inventory and plugin data separately to reflect the Ironic API (single JSON blobs are an Inspector legacy). * Make sure to mention the container name in error logging. * Use more readable formatting syntax for building Swift names. * Do not mock objects with dicts (in unit tests). * Simplify inventory API tests. Change-Id: Id8c4bc6d35b9634f5a5ac2b345a8fd7f1dba13c0
* | | | | | Merge "Refactoring: DRY in the root API controller"Zuul2023-03-231-117/+29
|\ \ \ \ \ \ | |/ / / / /
| * | | | | Refactoring: DRY in the root API controllerDmitry Tantsur2023-03-141-117/+29
| | | | | | | | | | | | | | | | | | | | | | | | Change-Id: I7bba31e73daef7292d0710242e6f88793b7ab357
* | | | | | Merge "Refactoring: create ironic.conductor.inspection"Zuul2023-03-234-193/+232
|\ \ \ \ \ \ | |/ / / / / | | | | | / | |_|_|_|/ |/| | | |
| * | | | Refactoring: create ironic.conductor.inspectionDmitry Tantsur2023-03-144-193/+232
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ... to reduce the already frightening size of ironic.conductor.manager and make space for more inspection additions. While here, fix up log messages for clarity and brevity. Change-Id: I5196d58016ae094f17e0aad187a11d9cceaab04b
* | | | | Merge "Fixes Secureboot with Anaconda deploy"Zuul2023-03-203-17/+15
|\ \ \ \ \ | | |/ / / | |/| / / | |_|/ / |/| | |
| * | | Fixes Secureboot with Anaconda deployNisha Agarwal2023-03-163-17/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fixes Secureboot with Anaconda deploy with PXE and iPXE Story:2010356 Task: 46529 Change-Id: Id6262654bb5e41e02c7d90b9a9aaf395e7b6a088
* | | | Merge "Fix auth_protocol and priv_protocol for SNMP v3"Zuul2023-03-174-2/+105
|\ \ \ \ | |/ / / |/| | |
| * | | Fix auth_protocol and priv_protocol for SNMP v3Duc Truong2023-03-014-2/+105
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SNMP driver was using the wrong dictionary key to retrieve auth_protocol and priv_protocol from driver info. As a result, the SNMP client was created with empty strings for both those fields. Any nodes configured to use SNMP v3 with those fields failed because the SNMP driver was unable to perform power related operations due to authentication error. - Use correct keys for snmp auth_protocol and priv_protocol when creating SNMP client - Sanitize snmp auth_key and priv_key in API results Story: 2010613 Task: 47535 Change-Id: I5efd3c9f79a021f1a8e613c3d13b6596a7972672
* | | | Merge "Wipe Agent Token when cleaning timeout occcurs"Zuul2023-03-142-2/+8
|\ \ \ \
| * | | | Wipe Agent Token when cleaning timeout occcursJulia Kreger2023-03-022-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In a relatively odd turn of events, should cleaning have started, but then timed out due to lost communications or a hard failure of the machine, an agent token could previously be orphaned preventing re-cleaning. We now explicitly remove the token in this case. Change-Id: I236cdf6ddb040284e9fd1fa10136ad17ef665638
* | | | | Merge "Clean out agent token even if power is already off"Zuul2023-03-132-0/+32
|\ \ \ \ \ | |_|_|/ / |/| | | |
| * | | | Clean out agent token even if power is already offJulia Kreger2023-03-022-0/+32
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While investigating a very curious report, I discovered that if somehow the power was *already* turned off to a node, say through an incorrect BMC *or* human action, and Ironic were to pick it up (as it does by default, because it checks before applying the power state, then it would not wipe the token information, preventing the agent from connecting on the next action/attempt/operation. We now remove the token on all calls to conductor utilities node_power_action method when appropriate, even if no other work is required. Change-Id: Ie89e8be9ad2887467f277772445d4bef79fa5ea1
* | | | Merge "Refactoring: extract some common functions from the inspector code"Zuul2023-03-133-37/+45
|\ \ \ \
| * | | | Refactoring: extract some common functions from the inspector codeDmitry Tantsur2023-03-013-37/+45
| | | | | | | | | | | | | | | | | | | | Change-Id: I0acc5303c1a38645318fb9be4cb068d069b7fe6a
* | | | | Merge "Do not recalculate checksum if disk_format is not changed"Zuul2023-03-132-19/+91
|\ \ \ \ \
| * | | | | Do not recalculate checksum if disk_format is not changedDmitry Tantsur2023-03-072-19/+91
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Even if a glance image is raw, we still recalculate the checksum after "converting" it to raw. This process may take exceptionally long. Change-Id: Id93d518b8d2b8064ff901f1a0452abd825e366c0
* | | | | | Merge "Restructure the inspector module in preparation for its expansion"Zuul2023-03-108-99/+152
|\ \ \ \ \ \ | | |/ / / / | |/| | | |
| * | | | | Restructure the inspector module in preparation for its expansionDmitry Tantsur2023-03-018-99/+152
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Converts ironic.drivers.modules.inspector into a package with two subpackages: client and interface, the latter containing most of the current content. Change-Id: Idbfd275c60a873e3de2e0a34db793619f8c99d85
* | | | | | Update release mappings for 21.4 release21.4.0Julia Kreger2023-03-072-4/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This mapping allows object version upgrades to be navigated and needs to be updated pre-release otherwise we break the inherent upgrade job to the latest state of the development branch. Also, had to backfill the records for the bugfix branch since, while not required for that version to run, it is required to have to upgrade from that version. Also, lists antelope and 2023.1 as "named" releases, due to the abiguity and configuration, it just seemed better to be on the safe side. Change-Id: I633275caf8c3dc750023fbb27bd8a3f4d23e9fa5
* | | | | | Fix online upgrades for Bios/TraitsJulia Kreger2023-03-072-12/+48
| |/ / / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ... And tags, but nobody uses tags since it is not available via the API. Anyhow, the online upgrade code was written under the assumption that *all* tables had an "id" column. This is not always true in the ironic data model for tables which started as pure extensions of the Nodes table, and fails in particular when: 1) A database row has data stored in an ealier version of the object 2) That same object gets a version upgrade. In the case which discovered this, BIOSSetting was added at version 1.0, and later updated to include additional fields which incremented the version to 1.1. When the upgrade went to evaluate and iterate through the fields, the command failed because the table was designed around "node_id" instead of "id". Story: 2010632 Task: 47590 Change-Id: I7bec6cfacb9d1558bc514c07386583436759f4df
* | | | | Merge "Do not move nodes to CLEAN FAILED with empty last_error"Zuul2023-03-028-27/+72
|\ \ \ \ \ | |/ / / / |/| | | |
| * | | | Do not move nodes to CLEAN FAILED with empty last_errorDmitry Tantsur2023-03-018-27/+72
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When cleaning fails, we power off the node, unless it has been running a clean step already. This happens when aborting cleaning or on a boot failure. This change makes sure that the power action does not wipe the last_error field, resulting in a node with provision_state=CLEANFAIL and last_error=None for several seconds. I've hit this in Metal3. Also when aborting cleaning, make sure last_error is set during the transition to CLEANFAIL, not when the clean up thread starts running. While here, make sure to log the current step in all cases, not only when aborting a non-abortable step. Change-Id: Id21dd7eb44dad149661ebe2d75a9b030aa70526f Story: #2010603 Task: #47476
* | | | | Merge "Respond to rpc requests on stop until hash ring reset"Zuul2023-02-283-7/+105
|\ \ \ \ \ | |_|/ / / |/| | | / | | |_|/ | |/| |
| * | | Respond to rpc requests on stop until hash ring resetSteve Baker2023-02-273-7/+105
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently when a conductor is stopped, the rpc service stops responding to requests as soon as self.manager.del_host returns. This means that until the hash ring is reset on the whole cluster, requests can be sent to a service which is stopped. This change waits for the remaining seconds to delay stopping until CONF.hash_ring_reset_interval has elapsed. This will improve the reliability of the cluster when scaling down or rolling out updates. This delay only occurs when there is more than one online conductor, to allow fast restarts on single-node ironic installs (bifrost, metal3). Change-Id: I643eb34f9605532c5c12dd2a42f4ea67bf3e0b40