delta/openstack/ironic.git - opendev.org: openstack/ironic.git

	Commit message (Collapse)	Author	Age	Files	Lines
*	Merge "Support longer checksums for redfish firmware upgrade"	Zuul	2023-05-09	1	-0/+6
\|\
\| *	Support longer checksums for redfish firmware upgrade	Julia Kreger	2023-05-03	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Previoulsy only SHA1 hashes were supported, now we support SHA256 and SHA512 by length. Change-Id: Iddb196faca4008837595a3d0923f55d0e9d2aea5
* \|	Merge "Remove use of nomodeset by default"	Zuul	2023-05-09	1	-0/+22
\|\ \
\| * \|	Remove use of nomodeset by default	Julia Kreger	2023-04-26	1	-0/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The troubleshooting kernel command line option nomodeset unfortunately changes the way framebuffer interactions work with graphics devices which in some cases can result in kernel memory to be used for graphics updates. When this happens on some specific hardware common in rack mount servers with baseboard management controllers, this can cause the memory bus to become locked for a brief time while the graphics update is occuring. This locked memory bus means disk IO can become blocked, and network cards can overflow their buffers resulting in packet loss on top of the latency incurred by the graphics update executing. As such, we've removed the nomodeset option from default usage and added a note describing its removal to the documentation along with a release note. Change-Id: I9084d88c3ec6f13bd64b8707892758fa87dd7f86
* \| \|	Imported Translations from Zanata	OpenStack Proposal Bot	2023-05-09	2	-163/+48
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For more information about this automatic import see: https://docs.openstack.org/i18n/latest/reviewing-translation-import.html Change-Id: Ice56ac44161d27ede41fdf53024e62e49c572049
* \| \|	Merge "Handle MissingAttributeError when using OOB inspections to fetch MACs"	Zuul	2023-05-08	1	-0/+9
\|\ \ \ \| \|_\|/ \|/\| \|
\| * \|	Handle MissingAttributeError when using OOB inspections to fetch MACs	Jacob Anders	2023-05-02	1	-0/+9
\| \|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, if an attempt is made to fetch MAC address information using OOB inspection on a Redfish-managed node and EthernetInterfaces attribute is missing on the node, inspection fails due to a MissingAttributeError exception being raised by sushy. This change adds catching and handling this exception. Change-Id: I6f16da05e19c7efc966128fdf79f13546f51b5a6
* \|	Fix DB/Lock session handling issues	Julia Kreger	2023-05-01	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Prior to this fix, we have been unable to run the Metal3 CI job with SQLAlchemy's internal autocommit setting enabled. However that setting is deprecated and needs to be removed. Investigating our DB queries and request patterns, we were able to identify some queries which generally resulted in the underlying task and lock being held longer because the output was not actually returned, which is something we've generally had to fix in some places previously. Doing some of these changes did drastically reduce the number of errors encountered with the Metal3 CI job, however it did not eliminate them entirely. Further investigation, we were able to determine that the underlying issue we were encountering was when we had an external semi-random reader, such as Metal3 polling endpoints, we could reach a situation where we would be blocked from updating the database as to open a write lock, we need the active readers not to be interacting with the database, and with a random reader of sorts, the only realistic option we have is to enable the Write Ahead Log[0]. We didn't have to do this with SQLAlchemy previously because autocommit behavior hid the complexities from us, but in order to move to SQLAlchemy 2.0, we do need to remove autocommit. Additionally, adds two unit tests for get_node_with_token rpc method, which apparently we missed or lost somewhere along the way. Also, adds notes to two Database interactions to suggest we look at them in the future as they may not be the most efficient path forward. [0]: https://www.sqlite.org/wal.html Change-Id: Iebcc15fe202910b942b58fc004d077740ec61912
* \|	Merge "Remove all references to the "cpus" property"	Zuul	2023-04-27	1	-0/+6
\|\ \
\| * \|	Remove all references to the "cpus" property	Dmitry Tantsur	2023-03-28	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Unused by Nova and unlike memory_mb/local_gb also by Ironic (actually, our usage of local_gb is worth double-checking as well, but at the very least it's referenced by inspection implementations). Change-Id: Ie8b0d9f58f4dcd102c183c30ae7f5acf68a5e4c3
* \| \|	Add ablity to power off nodes in clean failed	Chris Krelle	2023-04-24	1	-0/+8
\| \|/ \|/\| \| \| \| \| \| \| \| \| \| \| \| \|	We have seen duplicate ip issues when leaving clean failed nodes powered on. This patch allows operators to power down nodes that enter clean failed state. Change-Id: Iecb402227485fe0ba787a262121c9d6a048b0e13
* \|	Merge "Always fall back from hard linking to copying files"	Zuul	2023-04-10	1	-0/+5
\|\ \
\| * \|	Always fall back from hard linking to copying files	Dmitry Tantsur	2023-03-31	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The current check is insufficient: it passes for Kubernetes shared volumes, although hard-linking between them is not possible. This patch changes the approach to trying a hard link and falling back to copyfile instead. The patch relies on optimizations in Python 3.8 and thus should not be backported beyond the Zed series to avoid performance regression. Change-Id: I929944685b3ac61b2f63d2549198a2d8a1c8fe35
* \| \|	Merge "On rpc service stop, wait for node reservation release"	Zuul	2023-04-05	1	-0/+15
\|\ \ \ \| \|/ / \|/\| \|
\| * \|	On rpc service stop, wait for node reservation release	Steve Baker	2023-02-27	1	-0/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Instead of clearing existing reservations at the beginning of del_host, wait for the tasks holding them to go to completion. This check continues indefinitely until the conductor process exits due to one of: - All reservations for this conductor are released - CONF.graceful_shutdown_timeout has elapsed - The process manager (systemd, kubernetes) sends SIGKILL after the configured graceful period Because the default values of [DEFAULT]graceful_shutdown_timeout and [conductor]heartbeat_timeout are the same (60s) no other conductor will claim a node as an orphan until this conductor exits. Change-Id: Ib8db915746228cd87272740825aaaea1fdf953c7
* \| \|	Merge "Enables boot modes switching with Anaconda deploy for ilo driver"	Zuul	2023-03-27	1	-0/+5
\|\ \ \ \| \|_\|/ \|/\| \|
\| * \|	Enables boot modes switching with Anaconda deploy for ilo driver	Nisha Agarwal	2023-03-17	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Enables boot modes switching with Anaconda deploy for ilo driver Story: 2010357 Task: 46530 Change-Id: I383cdd5c9d45b074d351ec98b1145fd68e2f3ac3
* \| \|	Merge "Fixes Secureboot with Anaconda deploy"	Zuul	2023-03-20	1	-0/+4
\|\ \ \ \| \|/ /
\| * \|	Fixes Secureboot with Anaconda deploy	Nisha Agarwal	2023-03-16	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fixes Secureboot with Anaconda deploy with PXE and iPXE Story:2010356 Task: 46529 Change-Id: Id6262654bb5e41e02c7d90b9a9aaf395e7b6a088
* \| \|	Merge "Fix auth_protocol and priv_protocol for SNMP v3"	Zuul	2023-03-17	1	-0/+7
\|\ \ \ \| \|/ / \|/\| \|
\| * \|	Fix auth_protocol and priv_protocol for SNMP v3	Duc Truong	2023-03-01	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	SNMP driver was using the wrong dictionary key to retrieve auth_protocol and priv_protocol from driver info. As a result, the SNMP client was created with empty strings for both those fields. Any nodes configured to use SNMP v3 with those fields failed because the SNMP driver was unable to perform power related operations due to authentication error. - Use correct keys for snmp auth_protocol and priv_protocol when creating SNMP client - Sanitize snmp auth_key and priv_key in API results Story: 2010613 Task: 47535 Change-Id: I5efd3c9f79a021f1a8e613c3d13b6596a7972672
* \| \|	Merge "Wipe Agent Token when cleaning timeout occcurs"	Zuul	2023-03-14	1	-0/+7
\|\ \ \
\| * \| \|	Wipe Agent Token when cleaning timeout occcurs	Julia Kreger	2023-03-02	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In a relatively odd turn of events, should cleaning have started, but then timed out due to lost communications or a hard failure of the machine, an agent token could previously be orphaned preventing re-cleaning. We now explicitly remove the token in this case. Change-Id: I236cdf6ddb040284e9fd1fa10136ad17ef665638
* \| \| \|	Merge "Clean out agent token even if power is already off"	Zuul	2023-03-13	1	-0/+6
\|\ \ \ \
\| * \| \| \|	Clean out agent token even if power is already off	Julia Kreger	2023-03-02	1	-0/+6
\| \|/ / / \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	While investigating a very curious report, I discovered that if somehow the power was already turned off to a node, say through an incorrect BMC or human action, and Ironic were to pick it up (as it does by default, because it checks before applying the power state, then it would not wipe the token information, preventing the agent from connecting on the next action/attempt/operation. We now remove the token on all calls to conductor utilities node_power_action method when appropriate, even if no other work is required. Change-Id: Ie89e8be9ad2887467f277772445d4bef79fa5ea1
* \| \| \|	Merge "Do not recalculate checksum if disk_format is not changed"	Zuul	2023-03-13	1	-0/+5
\|\ \ \ \
\| * \| \| \|	Do not recalculate checksum if disk_format is not changed	Dmitry Tantsur	2023-03-07	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Even if a glance image is raw, we still recalculate the checksum after "converting" it to raw. This process may take exceptionally long. Change-Id: Id93d518b8d2b8064ff901f1a0452abd825e366c0
* \| \| \| \|	Merge "Update master for stable/2023.1"	Zuul	2023-03-10	2	-0/+7
\|\ \ \ \ \
\| * \| \| \| \|	Update master for stable/2023.1	OpenStack Release Bot	2023-03-09	2	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add file to the reno documentation build to show release notes for stable/2023.1. Use pbr instruction to increment the minor version number automatically so that master versions are higher than the versions on stable/2023.1. Sem-Ver: feature Change-Id: Iabfa1f9b492c85be185c6993a016a6dd88ed9f9a
* \| \| \| \| \|	Imported Translations from Zanata	OpenStack Proposal Bot	2023-03-10	2	-10/+160
\|/ / / / / \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For more information about this automatic import see: https://docs.openstack.org/i18n/latest/reviewing-translation-import.html Change-Id: I40870112f420690bde769c94740602e29ca34183
* \| \| \| \|	Merge "Fix online upgrades for Bios/Traits"	Zuul	2023-03-07	1	-0/+14
\|\ \ \ \ \
\| * \| \| \| \|	Fix online upgrades for Bios/Traits	Julia Kreger	2023-03-07	1	-0/+14
\| \|/ / / / \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	... And tags, but nobody uses tags since it is not available via the API. Anyhow, the online upgrade code was written under the assumption that all tables had an "id" column. This is not always true in the ironic data model for tables which started as pure extensions of the Nodes table, and fails in particular when: 1) A database row has data stored in an ealier version of the object 2) That same object gets a version upgrade. In the case which discovered this, BIOSSetting was added at version 1.0, and later updated to include additional fields which incremented the version to 1.1. When the upgrade went to evaluate and iterate through the fields, the command failed because the table was designed around "node_id" instead of "id". Story: 2010632 Task: 47590 Change-Id: I7bec6cfacb9d1558bc514c07386583436759f4df
* \| \| \| \|	Add prelude for OpenStack 2023.1 Ironic release	Jay Faulkner	2023-03-06	1	-0/+14
\|/ / / / \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We need a prelude. I added one. Change-Id: I48a7ca99439ce2ac3f954ec382971c1a4382ac58
* \| \| \|	Merge "Do not move nodes to CLEAN FAILED with empty last_error"	Zuul	2023-03-02	1	-0/+8
\|\ \ \ \
\| * \| \| \|	Do not move nodes to CLEAN FAILED with empty last_error	Dmitry Tantsur	2023-03-01	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When cleaning fails, we power off the node, unless it has been running a clean step already. This happens when aborting cleaning or on a boot failure. This change makes sure that the power action does not wipe the last_error field, resulting in a node with provision_state=CLEANFAIL and last_error=None for several seconds. I've hit this in Metal3. Also when aborting cleaning, make sure last_error is set during the transition to CLEANFAIL, not when the clean up thread starts running. While here, make sure to log the current step in all cases, not only when aborting a non-abortable step. Change-Id: Id21dd7eb44dad149661ebe2d75a9b030aa70526f Story: #2010603 Task: #47476
* \| \| \| \|	Merge "Respond to rpc requests on stop until hash ring reset"	Zuul	2023-02-28	1	-0/+13
\|\ \ \ \ \ \| \|_\|/ / / \|/\| \| \| / \| \| \|_\|/ \| \|/\| \|
\| * \| \|	Respond to rpc requests on stop until hash ring reset	Steve Baker	2023-02-27	1	-0/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently when a conductor is stopped, the rpc service stops responding to requests as soon as self.manager.del_host returns. This means that until the hash ring is reset on the whole cluster, requests can be sent to a service which is stopped. This change waits for the remaining seconds to delay stopping until CONF.hash_ring_reset_interval has elapsed. This will improve the reliability of the cluster when scaling down or rolling out updates. This delay only occurs when there is more than one online conductor, to allow fast restarts on single-node ironic installs (bifrost, metal3). Change-Id: I643eb34f9605532c5c12dd2a42f4ea67bf3e0b40
* \| \| \|	Merge "Add configurable delays to the fake drivers"	Zuul	2023-02-27	1	-0/+8
\|\ \ \ \
\| * \| \| \|	Add configurable delays to the fake drivers	Steve Baker	2022-10-13	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Simulating workloads with the fake driver currently misses the reality that some operations take time to complete, rather than occuring instantly. This makes it difficult to mock real workloads for performance and functional testing of ironic itself. This change adds configurable random wait times for fake drivers in a new ironic.conf [fake] section. Each supported driver having one configuration option controlling the delay. These delays are applied to operations which typically block in other drivers. The default value of zero continues the existing behaviour of no delay. A single integer value will result in a constant delay in seconds. Two values separated by a comma will result in a triangular distribution weighted by the first value, specifically in python[1]: random.triangular(a, b, a) Change-Id: I7cb1b50d035939e6c4538b3373002a309bfedea4 [1] https://docs.python.org/3/library/random.html#random.triangular
* \| \| \| \|	Merge "Get conductor metric data"	Zuul	2023-02-27	1	-0/+39
\|\ \ \ \ \ \| \|_\|/ / / \|/\| \| \| \|
\| * \| \| \|	Get conductor metric data	Julia Kreger	2023-02-23	1	-0/+39
\| \| \|_\|/ \| \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This change adds the capability for the ironic-conductor and standalone service process to transmit timer and counter metrics to the message bus notifier which may be consumed by a ceilometer, ironic-prometheus-exporter, or other consumer of metrics event data on to the message bus. This functionality is not presently supported on dedicated API services such as those running as an ``ironic-api`` application process, or Ironic WSGI application. This is due to the lack of an internal trigger mechanism to transmit the data in a metrics update to the message bus and/or notifier plugin. This change requires ironic-lib 5.4.0 to collect and ship metrics via the message bus. Depends-On: https://review.opendev.org/c/openstack/ironic-lib/+/865311 Change-Id: If6941f970241a22d96e06d88365f76edc4683364
* \| \| \|	Merge "Set lockutils default logging"	Zuul	2023-02-23	1	-0/+8
\|\ \ \ \
\| * \| \| \|	Set lockutils default logging	Julia Kreger	2023-02-20	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	While developing some internal metrics collection capability, and the realization that a lock was needed, we realized that the lock activity itself would be a bit noisy. And image actions also get lock logging, and it is just really noisy, but not super helpful for troubleshooting. So, set it to WARNING instead. Discussion wise, see: https://review.opendev.org/c/openstack/ironic-lib/+/865311 Change-Id: I3ab14ee5b5cc063784d26e3c760f1422c692060d
* \| \| \| \|	Merge "Relaxing console pid looking"	Zuul	2023-02-23	1	-0/+6
\|\ \ \ \ \
\| * \| \| \| \|	Relaxing console pid looking	Kaifeng Wang	2023-02-15	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Recently we hit an issue that the pid file is missing, current logic simply removes pid file if the corresponding process is not found, but if the pid file is lost then the console could never be stopped and futher more, be restarted, regardless if the process is there or not. This patch captures FileNotFound to the exception handling to allow console recovery. Change-Id: I1a0b8347e960c6cff8aca10a22c67b710f7d617e
* \| \| \| \| \|	Merge "fix inspectwait logic"	Zuul	2023-02-23	1	-0/+5
\|\ \ \ \ \ \ \| \|_\|_\|_\|_\|/ \|/\| \| \| \| \|
\| * \| \| \| \|	fix inspectwait logic	Julia Kreger	2023-02-15	1	-0/+5
\| \| \|_\|/ / \| \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The tl;dr is that we changed ``inspecting`` to include a ``inspect wait`` state. Unfortunately we never spotted the logic inside of the db API. We never spotted it because our testing in inspection code uses a mocked task manager... and we really don't have intense db testing because we expect the objects and higher level interactions to validate the lowest db level. Unfortunately, because of the out of band inspection workflow, we have to cover both cases in terms of what the starting state and ending state could be, but we've added tests to validate this is handled as we expect. Change-Id: Icccbc6d65531e460c55555e021bf81d362f5fc8b
* \| \| \| \|	Merge "Add release note for node sharding"	Zuul	2023-02-20	1	-0/+14
\|\ \ \ \ \
\| * \| \| \| \|	Add release note for node sharding	Jay Faulkner	2023-02-17	1	-0/+14
\| \|/ / / / \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Release note covers changes in the previous 4 commits in this chain. Change-Id: I5388e82e958acd930295215c9f9427080650866d
* \| \| \| \|	Merge "Make metrics names a little more consistent"	Zuul	2023-02-20	1	-0/+9
\|\ \ \ \ \