summaryrefslogtreecommitdiff
path: root/ironic/conductor
Commit message (Collapse)AuthorAgeFilesLines
* Respond to rpc requests on stop until hash ring resetSteve Baker2023-02-271-0/+4
| | | | | | | | | | | | | | | | | Currently when a conductor is stopped, the rpc service stops responding to requests as soon as self.manager.del_host returns. This means that until the hash ring is reset on the whole cluster, requests can be sent to a service which is stopped. This change waits for the remaining seconds to delay stopping until CONF.hash_ring_reset_interval has elapsed. This will improve the reliability of the cluster when scaling down or rolling out updates. This delay only occurs when there is more than one online conductor, to allow fast restarts on single-node ironic installs (bifrost, metal3). Change-Id: I643eb34f9605532c5c12dd2a42f4ea67bf3e0b40
* Merge "Fixes console port conflict occurs in certain path"Zuul2023-02-201-4/+0
|\
| * Fixes console port conflict occurs in certain pathKaifeng Wang2023-02-151-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The dynamically allocated console port for a node is saved into database and reused on subsequent console operations. In certain code path the port record cann't be trusted and we should do a re-allocation. This patch fixes the issue by ignores previous allocation record. The extra cleanup in the takeover is not required anymore and removed as well. Change-Id: I1a07ea9b30a2c760af7a6a4e39f3ff227df28fff Story: 2010489 Task: 47061
* | Merge "Indicate maintenance mode"Zuul2023-02-171-3/+3
|\ \
| * | Indicate maintenance modeJakub Jelinek2023-02-161-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Follow-up to I74b19f7a42c1326d7ec04e6320176e81639ebfb4 Mention need of the maintenance mode to orphan swift objects during node clean up Story: 2010275 Task: 46204 Change-Id: Ie95a5bd333b0dab3e97254dfb4eb532bdbfd2650
* | | Merge "Fix debug log message argument formatting"Zuul2023-02-151-3/+3
|\ \ \ | |/ / |/| |
| * | Fix debug log message argument formattingJonathan Rosser2023-02-011-3/+3
| |/ | | | | | | | | | | | | | | | | | | | | | | The format string is expecting a dictionary with keys matching those used in the format string. Any unused parameters will cause an "not all arguments converted during string formatting" exception. The quote style is also changed from double to single quotes to match the other logging statements in the code. Change-Id: Ic9dea4f51d82866be8ac16242a79237c789b9745
* | Erase swift inventory entry on node deletionJakub Jelinek2023-02-141-0/+21
|/ | | | | | | | | | Follow-up to Ie174904420691be64ce6ca10bca3231f45a5bc58 which enables storage of inventory in Swift, but does not delete the Swift entry when the node whose inventory is stored is deleted Story: 2010275 Task: 46204 Change-Id: I74b19f7a42c1326d7ec04e6320176e81639ebfb4
* Catch any exception for CleaningJulia Kreger2022-12-121-2/+3
| | | | | | | | | | | | | | | | No exception is used to communicate back, the exceptions are used to catch failures, and if we don't catch other possible exceptions leaving cleaning states, we may not clean up state properly. So instead of specific exceptions, we just catch any exception like is used earlier in the same method. Inspired by https://review.opendev.org/c/openstack/ironic/+/866856 and investigation through the code base as a result of inability to clean the node. Change-Id: I2a6bca3550819b98adbaffe315f77427b8a43d62
* Phase 1 - SQLAlchemy 2.0 CompatabilityJulia Kreger2022-10-132-10/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | One of the major changes in SQLAlchemy 2.0 is the removal of autocommit support. It turns out Ironic was using this quite aggressively without even really being aware of it. * Moved the declaritive_base to ORM, as noted in the SQLAlchemy 2.0 changes[0]. * Console testing caused us to become aware of issues around locking where session synchronization, when autocommit was enabled, was defaulted to False. The result of this is that you could have two sessions have different results, which could results on different threads, and where one could still attempt to lock based upon prior information. Inherently, while this basically worked, it was also sort of broken behavior. This resulted in locking being rewritten to use the style mandated in SQLAlchemy 2.0 migration documentation. This ultimately is due to locking, which is *heavily* relied upon in Ironic, and in unit testing with sqlite, there are no transactions, which means we can get some data inconsistency in unit testing as well if we're reliant upon the database to precisely and exactly return what we committed.[1] * Begins changing the query.one()/query.all() style to use explicit select statements as part of the new style mandated for migration to SQLAlchemy 2.0. * Instead of using field label strings for joined queries, use the object format, which makes much more sense now, and is part of the items required for eventual migration to 2.0. * DB queries involving Traits are now loaded using SelectInLoad as opposed to Joins. The now deprecated ORM queries were quietly and silently de-duplicating rows and providing consistent sets from the resulting joined table responses, however putting much higher CPU load on the processing of results on the client. Prior performance testing has informed us this should be a minimal overhead impact, however these queries should no longer be in transactions with the Database Servers which should offset the shift in load pattern. The reason we cannot continue to deduplicate locally in our code is because we carry Dict data sets which cannot be hashed for deduplication. Most projects have handled this by treating them as Text and then converting, but without a massive rewrite, this seems to be the viable middle ground. * Adds an explict mapping for traits and tags on the Node object to point directly to the NodeTrait and NodeTag classes. This superceeds the prior usage of a backref to make the association. * Splits SQLAlchemy class model Node into Node and NodeBase, which allows for high performance queries to skip querying for ``tags`` and ``traits``. Otherwise with the afrormentioned lookups would always execute as they are now properties as well on the Node class. This more common of a SQLAlchemy model, but Ironic's model has been a bit more rigid to date. * Adds a ``start_consoles`` and ``start_allocations`` option to the conductor ``init_host`` method. This allows unit tests to be executed and launched with the service context, while *not* also creating race conditions which resulted in failed tests. * The db API ``_paginate_query`` wrapper now contains additional logic to handle traditional ORM query responses and the newer style of unified query responses. Due to differences in queries and handling, which also was part of the driver for the creation of ``NodeBase``, as SQLAlchemy will only create an object if a base object is referenced. Also, by default, everything returned is a tuple in 1.4 with the unified interface. * Also modified one unit test which counted time.sleep calls, which is a known pattern which can create failures which are ultimately noise. Ultimately, I have labelled the remaining places which SQLAlchemy warnings are raised at for deprecation/removal of functionality, which needs to be addressed. [0] https://docs.sqlalchemy.org/en/14/changelog/migration_20.html [1] https://docs.sqlalchemy.org/en/14/dialects/sqlite.html#transaction-isolation-level-autocommit Change-Id: Ie0f4b8a814eaef1e852088d12d33ce1eab408e23
* Merge "Concurrent Distructive/Intensive ops limits"Zuul2022-09-211-3/+49
|\
| * Concurrent Distructive/Intensive ops limitsJulia Kreger2022-09-201-3/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Provide the ability to limit resource intensive or potentially wide scale operations which could be a symptom of a highly distructive and unplanned operation in progress. The idea behind this change is to help guard the overall deployment to prevent an overall resource exhaustion situation, or prevent an attacker with valid credentials from putting an entire deployment into a potentially disasterous cleaning situation since ironic only other wise limits concurrency based upon running tasks by conductor. Story: 2010007 Task: 45140 Change-Id: I642452cd480e7674ff720b65ca32bce59a4a834a
* | Fix nodes stuck at cleaning on Network Service issuesKaifeng Wang2022-09-201-1/+1
|/ | | | | | | | | | | | | Ironic validates network interface before the cleaning process, currently invalid parameter is captured but for not others. There is chance that a node could be stucked at the cleaning state on networking issues or temporary service down of neutron service. This patch adds NetworkError to the exception hanlding to cover such cases. Change-Id: If20de2ad4ae4177dea10b7ebfc9a91ca6fbabdb9
* Modify do_node_verify to avoid state machine stuckVanou Ishii2022-08-031-1/+1
| | | | | | | | | | | | | | | do_node_verify function runs vendor-driver-defined verify_step. However, when vendor verify_step fails, it causes stuck of state machine at verifying. This is because do_node_verify function tries to retrieve name of verify_step through node.verify_step but node doesn't have verify_step attribute and there is no way to handle exception. This commit fixes this issue. Change-Id: Ie2ec6e08214661f7dc61c92de646e2f4d5bb5469 Story: 2010209 Task: 45942
* Merge "Allocation candidates prefer matching name"Zuul2022-07-151-0/+9
|\
| * Allocation candidates prefer matching nameSteve Baker2022-06-161-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This change finds a node with the same name as the allocation and moves it to the beginning of the shuffled candidate list so that node is the first allocation attempt. It is common for node naming scheme to match the node's role (such as compute-1, compute-2). Also this often matches the hostname (allocation name) scheme. Without this change, this scenario will generally result in swapped names (node compute-1 having hostname compute-2, etc). By preferring matching names this situation can be avoided in the majority of cases, while not otherwise affecting the candidiate allocation approach. Change-Id: Ie990bfc209959d58852b9080778602eab5aa30af
* | Merge "Trivial: log which state the node is in"Zuul2022-07-141-2/+4
|\ \
| * | Trivial: log which state the node is inDmitry Tantsur2022-07-061-2/+4
| | | | | | | | | | | | Change-Id: I87326585896cae9df717e9d19b2ea441d92d3b1a
* | | Merge "Make anaconda non-image deploys sane"Zuul2022-07-142-1/+16
|\ \ \
| * | | Make anaconda non-image deploys saneJulia Kreger2022-07-112-1/+16
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ironic has a lot of logic built up around use of images for filesystems, however several recent additions, such as the ``ramdisk`` and ``anaconda`` deployment interfaces have started to break this mold. In working with some operators attempting to utilzie the anaconda deployment interface outside the context of full OpenStack, we discovered some issues which needed to be make simpler to help remove the need to route around data validation checks for things that are not required. Standalong users also have the ability to point to a URL with anaconda, where as Operators using OpenStack can only do so with customized kickstart files. While this is okay, the disparity in configuraiton checking was also creating additional issues. In this, we discovered we were not really graceful with redirects, so we're now a little more graceful with them. Story: 2009939 Story: 2009940 Task: 44834 Task: 44833 Change-Id: I8b0a50751014c6093faa26094d9f99e173dcdd38
* | | Move logging out of skip_automated_cleaningDmitry Tantsur2022-07-062-6/+5
|/ / | | | | | | | | | | | | Simply boolean functions should not have logging as a side effect. This one is also used in deploy_utils without logging. Change-Id: Iaa398f09cec06a8417c595acac19b0b9f3f3a871
* | Merge "Auto-populate lessee for deployments"Zuul2022-07-024-1/+48
|\ \
| * | Auto-populate lessee for deploymentsJulia Kreger2022-05-234-1/+48
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | Adds a configuration option and capability to automatically record the lessee for a deployment based upon the original auth_token information provided in the request context. Additional token information is now shared through the context which is extended in the same fashion as most other projects saving request token information to their RequestContext, instead of triggering excess API calls in the background to Keystone to try and figure out requestor's information. Change-Id: I42a2ceb9d2e7dfdc575eb37ed773a1bc682cec23
* | No deploy_kernel/ramdisk with the ramdisk deploy and no cleaningDmitry Tantsur2022-06-231-4/+4
|/ | | | | | | | | Ramdisk deploys don't use IPA, no need to provide it. Cleaning may need the agent, so only skip verification if cleaning is disabled. Other boot interfaces may need fixing as well, I haven't checked them. Change-Id: Ia2739311f065e19ba539fe3df7268075d6075787
* Exclude current conductor from offline_conductorsHarald Jensås2022-04-282-2/+25
| | | | | | | | | | | | | | | | | | | In some cases the current conductor may have failed to updated the heartbeat timestamp due to failure of resource starvation. When this occurs the dbapi get_offline_conductors method will include the current conductor in its return value. In this scenario the conductor may end up forcefully remove node reservations or allocations from itself, triggering takeover which fail on-going operations. This change adds a wrapper to exclude the current conductor. The wrapper will log a warning to raise the issue. Related-Bug: #1970484 Stroy: 2010016 Task: 45204 Change-Id: I6a8f38934b475f792433be6f0882540b82ca26c1
* Merge "Shorten error messages in commonly used modules"Zuul2022-03-114-26/+22
|\
| * Shorten error messages in commonly used modulesDmitry Tantsur2022-02-174-26/+22
| | | | | | | | | | | | | | | | | | | | | | * Do not mention "deploy driver", it's not a thing. * Be careful with the pattern "Error: %s" or "Reason: %s". It is good for long introductory sentences, but looks poor for shorter ones and becomes really problematic when several instances are concatenated. This change updates deploy_utils, agent code and conductor modules. Change-Id: Ie1efea02b5f1a174e9ef8c5253ce9754a60b4c56
* | Merge "Ignore fake nodes in the power sync loop"Zuul2022-03-091-2/+4
|\ \ | |/ |/|
| * Ignore fake nodes in the power sync loopDmitry Tantsur2021-12-091-2/+4
| | | | | | | | | | | | | | | | Fake nodes have power_state None and the power interface returns None as well. Currently we make an update every loop, even though the values are the same. This change fixes it. Change-Id: I5227c058f3f6e1583b54a1cbc7edc6d42e20ae53
* | Explicit parameter to distinguish partition/whole-disk imagesDmitry Tantsur2022-01-283-14/+26
| | | | | | | | | | | | | | | | | | | | Using kernel/ramdisk makes no sense with local boot, we need a better way. We already have an internal image_type instance parameter, let's make it public. Glance support will be added in the next patch. Change-Id: I4ce5f7a2317d952f976194d2022328f4afbb0258
* | Automatically configure enabled_***_interfacesDmitry Tantsur2021-12-201-24/+0
| | | | | | | | | | | | | | | | | | This change makes it easier to configure power and management interfaces (and thus vendor drivers) by figuring out reasonable defaults. Story: #2009316 Task: #43717 Change-Id: I8779603e566be5a84daf6f680c0bbe2f191923d9
* | Merge "Allow enabling fast-track per node"Zuul2021-12-151-1/+2
|\ \
| * | Allow enabling fast-track per nodeDmitry Tantsur2021-12-081-1/+2
| | | | | | | | | | | | | | | | | | | | | This is useful when some nodes need the "agent" power interface, while the others can be deployed normally. Change-Id: Ief7df40c83ef03d0ec5ae92d09ceffd39d3c12a3
* | | Adoption: do not validate boot interface when local bootingDmitry Tantsur2021-12-131-5/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We validate the boot interface during adoption because of: a) potential rebuilding, b) non-local boot. Rebuild proved a rarely used feature, and local boot is the default nowadays, so it makes less sense to unconditionally validate the boot interface during adoption. We will run the validation anyway the next time we need to do something with booting. Similarly, do not record is_whole_disk_image None if it cannot be reliably determined. Change-Id: I95252aea808c48ea2d94569449c871f0d483caaa
* | | Move place_loaders_for_boot to boot driver __init__Steve Baker2021-12-101-8/+3
| |/ |/| | | | | | | | | | | | | | | | | | | | | Host preparation file writing is already done by the __init__ method of iPXEBoot. This change moves place_loaders_for_boot calls to iPXEBoot and PXEBoot to be consistent, and to only write the files when that driver is enabled. This will mean multiple writes of the same file when subclasses of these drivers are also enabled, but this overhead will be negligible. Change-Id: I7e17f4d1a54cd6c5d1a4bf006a0d42db8d123a46
* | Merge "Add "none" RPC transport that disables the RPC bus"Zuul2021-12-081-10/+23
|\ \
| * | Add "none" RPC transport that disables the RPC busDmitry Tantsur2021-12-071-10/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When using the new combined executable in a single-conductor scenario, it may make sense to completely disable the remote RPC. The new ``rpc_transport`` value ``none`` achieves that. Change-Id: I6a83358c65b3ed213c8a991d42660ca51fc3a8ec Story: #2009676 Task: #44104
* | | Merge "All-in-one Ironic service with a local RPC bus"Zuul2021-12-082-57/+122
|\ \ \ | |/ /
| * | All-in-one Ironic service with a local RPC busDmitry Tantsur2021-12-072-57/+122
| |/ | | | | | | | | | | | | | | | | | | | | This adds a new executable /usr/bin/ironic (cool that we no longer have a CLI with this name) that starts API and conductor together in the same process. When an RPC host name matches the current one, the call is not routed through the remote RPC, a local function call is done instead. Story: #2009676 Task: #43953 Change-Id: I51bf7226aea145dc7c8fd93d61caa233ca16c9c9
* | Merge "Refactor driver_internal_info updates to methods"Zuul2021-12-075-128/+81
|\ \
| * | Refactor driver_internal_info updates to methodsSteve Baker2021-12-035-128/+81
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Making updates to driver_internal_info can result in hard to read code due the requirement to assign the whole driver_internal_info back to the node to trigger the expected update operation. This change replaces driver_internal_info update operations with a new methods: - set_driver_internal_info - del_driver_internal_info - timestamp_driver_internal_info This change defines the functions and moves core conductor logic to use them. Subsequent changes in this series will move drivers to use the new functions. Change-Id: Ib8917c3c674e77cd3aba6a1e73c65162e3ee1141
* | Merge "Avoid RPC notify_conductor_resume_{deploy,clean} in agent_base"Zuul2021-12-072-5/+9
|\ \
| * | Avoid RPC notify_conductor_resume_{deploy,clean} in agent_baseDmitry Tantsur2021-12-062-5/+9
| |/ | | | | | | | | | | | | | | | | | | | | | | | | Currently we use an RPC call to the conductor itself to proceed to the next clean or deploy step. This is unnecessary and requires temporary lifting the lock, potentially causing race conditions. This change makes the agent code use continue_node_{deploy,clean} directly. The drivers still need updating, it will be done later. Story: #2008167 Task: #40922 Change-Id: If4763d542029b9021432425532f24a0228f04c25
* | Trivial: log current state when continuing cleaningDmitry Tantsur2021-12-061-2/+2
|/ | | | Change-Id: I02a8ed6802fffee071e94be3c0cab2382b7e60ca
* Merge "[Trivial] Clarify conditions under which power recovery is attempted"Zuul2021-11-151-3/+4
|\
| * [Trivial] Clarify conditions under which power recovery is attemptedArne Wiebalck2021-11-041-3/+4
| | | | | | | | | | | | | | | | Be more precise when describing the conditions for automatic recovery from power failures ('maintenance type' is a term we use nowhere else). Change-Id: Iaf14c0fc73f8c97b9d8669485011966a650c21a8
* | Avoid handling a deploy failure twiceDmitry Tantsur2021-11-041-18/+25
|/ | | | | | | | | In some cases we handle the same exception twice in a row: in agent_base and in deployments.do_next_deploy_step. This change avoids it. Also make deploy step error messages more uniform across the board. Change-Id: Ic84c04118b1a85b10a761fc58796827583a5b086
* Merge "node_periodics: encapsulate the interface class check"Zuul2021-10-141-0/+13
|\
| * node_periodics: encapsulate the interface class checkDmitry Tantsur2021-10-121-0/+13
| | | | | | | | Change-Id: I887d4fe4836bc58b5605e950a4287f0d27a590cb
* | Merge "Add a helper for node-based periodics"Zuul2021-10-142-96/+212
|\ \ | |/