summaryrefslogtreecommitdiff
path: root/ironic/conductor/base_manager.py
Commit message (Collapse)AuthorAgeFilesLines
* On rpc service stop, wait for node reservation releaseSteve Baker2023-02-271-3/+14
| | | | | | | | | | | | | | | | | Instead of clearing existing reservations at the beginning of del_host, wait for the tasks holding them to go to completion. This check continues indefinitely until the conductor process exits due to one of: - All reservations for this conductor are released - CONF.graceful_shutdown_timeout has elapsed - The process manager (systemd, kubernetes) sends SIGKILL after the configured graceful period Because the default values of [DEFAULT]graceful_shutdown_timeout and [conductor]heartbeat_timeout are the same (60s) no other conductor will claim a node as an orphan until this conductor exits. Change-Id: Ib8db915746228cd87272740825aaaea1fdf953c7
* Respond to rpc requests on stop until hash ring resetSteve Baker2023-02-271-0/+4
| | | | | | | | | | | | | | | | | Currently when a conductor is stopped, the rpc service stops responding to requests as soon as self.manager.del_host returns. This means that until the hash ring is reset on the whole cluster, requests can be sent to a service which is stopped. This change waits for the remaining seconds to delay stopping until CONF.hash_ring_reset_interval has elapsed. This will improve the reliability of the cluster when scaling down or rolling out updates. This delay only occurs when there is more than one online conductor, to allow fast restarts on single-node ironic installs (bifrost, metal3). Change-Id: I643eb34f9605532c5c12dd2a42f4ea67bf3e0b40
* Phase 1 - SQLAlchemy 2.0 CompatabilityJulia Kreger2022-10-131-5/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | One of the major changes in SQLAlchemy 2.0 is the removal of autocommit support. It turns out Ironic was using this quite aggressively without even really being aware of it. * Moved the declaritive_base to ORM, as noted in the SQLAlchemy 2.0 changes[0]. * Console testing caused us to become aware of issues around locking where session synchronization, when autocommit was enabled, was defaulted to False. The result of this is that you could have two sessions have different results, which could results on different threads, and where one could still attempt to lock based upon prior information. Inherently, while this basically worked, it was also sort of broken behavior. This resulted in locking being rewritten to use the style mandated in SQLAlchemy 2.0 migration documentation. This ultimately is due to locking, which is *heavily* relied upon in Ironic, and in unit testing with sqlite, there are no transactions, which means we can get some data inconsistency in unit testing as well if we're reliant upon the database to precisely and exactly return what we committed.[1] * Begins changing the query.one()/query.all() style to use explicit select statements as part of the new style mandated for migration to SQLAlchemy 2.0. * Instead of using field label strings for joined queries, use the object format, which makes much more sense now, and is part of the items required for eventual migration to 2.0. * DB queries involving Traits are now loaded using SelectInLoad as opposed to Joins. The now deprecated ORM queries were quietly and silently de-duplicating rows and providing consistent sets from the resulting joined table responses, however putting much higher CPU load on the processing of results on the client. Prior performance testing has informed us this should be a minimal overhead impact, however these queries should no longer be in transactions with the Database Servers which should offset the shift in load pattern. The reason we cannot continue to deduplicate locally in our code is because we carry Dict data sets which cannot be hashed for deduplication. Most projects have handled this by treating them as Text and then converting, but without a massive rewrite, this seems to be the viable middle ground. * Adds an explict mapping for traits and tags on the Node object to point directly to the NodeTrait and NodeTag classes. This superceeds the prior usage of a backref to make the association. * Splits SQLAlchemy class model Node into Node and NodeBase, which allows for high performance queries to skip querying for ``tags`` and ``traits``. Otherwise with the afrormentioned lookups would always execute as they are now properties as well on the Node class. This more common of a SQLAlchemy model, but Ironic's model has been a bit more rigid to date. * Adds a ``start_consoles`` and ``start_allocations`` option to the conductor ``init_host`` method. This allows unit tests to be executed and launched with the service context, while *not* also creating race conditions which resulted in failed tests. * The db API ``_paginate_query`` wrapper now contains additional logic to handle traditional ORM query responses and the newer style of unified query responses. Due to differences in queries and handling, which also was part of the driver for the creation of ``NodeBase``, as SQLAlchemy will only create an object if a base object is referenced. Also, by default, everything returned is a tuple in 1.4 with the unified interface. * Also modified one unit test which counted time.sleep calls, which is a known pattern which can create failures which are ultimately noise. Ultimately, I have labelled the remaining places which SQLAlchemy warnings are raised at for deprecation/removal of functionality, which needs to be addressed. [0] https://docs.sqlalchemy.org/en/14/changelog/migration_20.html [1] https://docs.sqlalchemy.org/en/14/dialects/sqlite.html#transaction-isolation-level-autocommit Change-Id: Ie0f4b8a814eaef1e852088d12d33ce1eab408e23
* Automatically configure enabled_***_interfacesDmitry Tantsur2021-12-201-24/+0
| | | | | | | | | This change makes it easier to configure power and management interfaces (and thus vendor drivers) by figuring out reasonable defaults. Story: #2009316 Task: #43717 Change-Id: I8779603e566be5a84daf6f680c0bbe2f191923d9
* Move place_loaders_for_boot to boot driver __init__Steve Baker2021-12-101-8/+3
| | | | | | | | | | | | Host preparation file writing is already done by the __init__ method of iPXEBoot. This change moves place_loaders_for_boot calls to iPXEBoot and PXEBoot to be consistent, and to only write the files when that driver is enabled. This will mean multiple writes of the same file when subclasses of these drivers are also enabled, but this overhead will be negligible. Change-Id: I7e17f4d1a54cd6c5d1a4bf006a0d42db8d123a46
* Facilitate asset copy for bootloader opsJulia Kreger2021-09-151-3/+8
| | | | | | | Adds capability to copy bootloader assets from the system OS into the network boot folders on conductor startup. Change-Id: Ica8f9472d0a2409cf78832166c57f2bb96677833
* Record node history and manage events in dbJulia Kreger2021-09-101-2/+8
| | | | | | | | | | | | | | | | | | | | | | | * Adds periodic task to purge node_history entries based upon provided configuration. * Adds recording of node history entries for errors in the core conductor code. * Also changes the rescue abort behavior to remove the notice from being recorded as an error, as this is a likely bug in behavior for any process or service evaluating the node last_error field. * Makes use of a semi-free form event_type field to help provide some additional context into what is going on and why. For example if deployments are repeatedly failing, then perhaps it is a configuration issue, as opposed to a general failure. If a conductor has no resources, then the failure, in theory would point back to the conductor itself. Story: 2002980 Task: 42960 Change-Id: Ibfa8ac4878cacd98a43dd4424f6d53021ad91166
* Register all hardware_interfaces togetherDerek Higgins2021-01-081-4/+11
| | | | | | | | | | Prevent each driver comming online one at a time. So that /driver returns nothign until all interfaces are registered Story: #2008423 Task: #41368 Change-Id: I6ef3e6e36b96106faf4581509d9219e5c535a6d8
* Remove locks before RPC bus is startedJulia Kreger2020-07-281-5/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | A partner performing some testing recognized a case where if a request is sent to the Ironic Conductor while it is in the process of starting, and the request makes it into be processed, yet latter the operation fails with errors such as NodeNotLocked exception. Notably they were able to reproduce this by requesting the attachment or detachment of a VIF at the same time as restarting the conductor. In part, this condition is due to to the conductor being restarted where the conductors table includes the node being restarted and the webserver has not possibly had a chance to observe that the conductor is in the process of restarting as the hash ring is still valid. In short - Incoming RPC requests can come in during the initialization window and as such we should not remove locks while the conductor could possibly already be receiving work. As such, we've added a ``prepare_host`` method which initializes the conductor database connection and removes the stale locks. Under normal operating conditions, the database client is reused. rhbz# 1847305 Change-Id: I8e759168f1dc81cdcf430f3e33be990731595ec3
* Merge "Release reservation when stoping the ironic-conductor service"Zuul2020-04-171-0/+2
|\
| * Release reservation when stoping the ironic-conductor serviceshenxindi2020-04-101-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a conductor hostname is changed while reservations are issued to a conductor with one hostname, such as 'hostname' and then the process is restarted with 'new_hostname', then the queries would not match the node and effectively the nodes would become inaccessible until the reservation is cleared. This patch clears the reservation when stoping the ironic-conductor service to avoid the nodes becoming inaccessible. Ref to: https://review.opendev.org/#/c/711765/ Change-Id: Id31cd30564ff26df0bbe4976ffe3f268b0dd3d7b
* | Add my new address to .mailmapAeva Black2020-04-131-1/+2
|/ | | | | | | This commit updates the mailmap file and changes my alias in a few places within old comments. Change-Id: Ica0e184109d794b8e129d567b5606d7fe84ff384
* Nodes in maintenance didn't fail, when they should haveRuby Loo2020-01-161-4/+8
| | | | | | | | | | | | | | | | | In this code in base_manager.py, _fail_if_in_state() [1]: if the node is in maintenance, nothing is done. This means that when a node in maintenance is in mid deployment or cleaning and their conductor dies, it won't get put into a failed state [2]. This fixes it. [1] https://opendev.org/openstack/ironic/src/commit/8294afa6231629f9734f19ea5b3b0253ee9b8957/ironic/conductor/base_manager.py#L485 [2] https://opendev.org/openstack/ironic/src/commit/8294afa6231629f9734f19ea5b3b0253ee9b8957/ironic/conductor/base_manager.py#L235 Story #2007098 Task #38134 Change-Id: Ide70619271455685d09671ae16d744fc9ae58a02
* Stop using six libraryRiccardo Pittau2019-12-231-2/+1
| | | | | | | | | | Since we've dropped support for Python 2.7, it's time to look at the bright future that Python 3.x will bring and stop forcing compatibility with older versions. This patch removes the six library from requirements, not looking back. Change-Id: Ib546f16965475c32b2f8caabd560e2c7d382ac5a
* Fix :param: in docstringzhu.fanglei2019-06-141-24/+24
| | | | | | In docstring :param should be used instead of :param:. Change-Id: Id531e58087b8196b30dda12aa4245a1eefd638ac
* Remove deprecated option [DEFAULT]enabled_driversKaifeng Wang2019-06-061-5/+0
| | | | | | The option was deprecated in Rocky, this patch removes it. Change-Id: I6c1ef41b0f960fc2843a494dfd002a982e2bef01
* Publish baremetal endpoint via mdnsDmitry Tantsur2019-05-231-0/+20
| | | | | | | | | This change adds an option to publish the endpoint via mDNS on start up and clean it up on tear down. Story: #2005393 Task: #30383 Change-Id: I55d2e7718a23cde111eaac4e431588184cb16bda
* Removes `hash_distribution_replicas` configuration optionVarsha2019-04-191-3/+1
| | | | | | | | | `hash_distribution_replicas` was deprecated in the Stein cycle (12.1.0). Story: #1680160 Task: #30033 Change-Id: Iddc59ed113fb9808f8c8564433475638491be84f
* Allocation API: resume allocations on conductor restartDmitry Tantsur2019-02-191-0/+16
| | | | | | | | | This change allows allocations that were not finished because of conductor restarting or crashing to be finished after start up. Change-Id: I016e08dcb59613b59ae753ef7d3bc9ac4a4a950a Story: #2004341 Task: #29544
* Node gets stuck in ING state when conductor goes downDao Cong Tien2018-08-031-8/+6
| | | | | | | | | | | | | | If a node is in ING state such as INSPECTING, RESCUING, and the conductor goes down, when the conductor backs, the node gets stuck with that ING state. The cases for (DEPLOYING, CLEANING) are already processed as expected, but (INSPECTING, RESCUING, UNRESCUING, VERIFYING, ADOPTING). DELETING cannot be transitioned to 'fail' state. Change-Id: Ie6886ea78fac8bae81675dabf467939deb1c4460 Story: #2003147 Task: #23282
* Merge "Use conductor group for hash ring calculations"Zuul2018-07-251-11/+26
|\
| * Use conductor group for hash ring calculationsJim Rollenhagen2018-07-231-11/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This changes the calculation for keys in the hash ring manager to be of the form "<conductor_group>:<driver>", instead of just driver. This is used when the RPC version pin is 1.47 or greater (1.47 was created to handle this). When finding an RPC topic, we use the conductor group marked on the node as part of this calculation. However, this becomes a problem when we don't have a node that we're looking up a topic for. In this case we look for a conductor in any group which has the driver loaded, and use a temporary hash ring that does not use conductor groups to find a conductor. This also begins the API work, as the API must be aware of the new hash ring calculation. However, exposing the conductor_group field and adding a microversion is left for a future patch. Story: 2001795 Task: 22641 Change-Id: Iaf71348666b683518fc6ce4769112459d98938f2
* | Merge "Remove endpoint_type from configuration"Zuul2018-07-231-14/+0
|\ \ | |/ |/|
| * Remove endpoint_type from configurationMatthew Thode2018-05-221-14/+0
| | | | | | | | | | | | | | | | | | | | | | python-swiftclient stopped supporting the temp url structure used when radosgw was set as the endpoint_type in ocata, meaning only Newton and older versions of python-swiftclient will work. Newton is deprecated, so remove the option. This breaks the deprecation cycle, but since it has been not working for so long it needs to just be dropped. Change-Id: Ibdc93b049b7e1ae34cac9e1f599786439c46a685
* | Add conductor_group field to config, node and conductor objectsJim Rollenhagen2018-07-181-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adds the fields and bumps the objects versions. Excludes the field from the node API for now. Also adds the conductor_group config option, and populates the field in the conductors table. Also fixes a fundamentally broken test in ironic.tests.unit.db.test_api. Change-Id: Ice2f90f7739b2927712ed45c969865136a216bd6 Story: 2001795 Task: 22640 Task: 22642
* | Remove support for creating and loading classic driversDmitry Tantsur2018-07-021-41/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * removes any bits related to loading classic drivers from the drivers factory code * removes exceptions that only happen when classic drivers can be loaded * removes the BaseDriver, moves the useful functionality to the BareDriver class * /v1/drivers/?type=classic now always returns an empty list * removes the migration updating classic drivers to hardware types The documentation will be updated separately. Change-Id: I8ee58dfade87ae2a2544c5dcc27702c069f5089d
* | Fix E501 errorsJulia Kreger2018-05-141-1/+2
|/ | | | | | Task: #19602 Story: #2001985 Change-Id: I6120b1447819ab3e8b2fb4a7fb8e412dc6ce5034
* Fix W504 errorsJulia Kreger2018-05-091-4/+4
| | | | | | | | Also a few related errors based on some earlier investigation may have been pulled in along the lines of E305. Story: #2001985 Change-Id: Ifb2d3b481202fbd8cbb472e02de0f14f4d0809fd
* Collect periodic tasks from all enabled hardware interfacesDmitry Tantsur2018-04-241-43/+61
| | | | | | | | | | | | | | Currently we only collect periodic tasks from interfaces used in enabled classic drivers. Meaning, periodics are not collected from interfaces that are only used in hardware types. This patch corrects it. This patch does not enable collection of periodic tasks from hardware types, since we did not collect them from classic drivers. I don't remember the reason for that, and we may want to fix it later. Change-Id: Ib1963f3f67a758a6b2405387bfe7b3e30cc31ed8 Story: #2001884 Task: #14357
* Rework logic handling reserved orphaned nodes in the conductorDmitry Tantsur2018-02-211-2/+7
| | | | | | | | | | | | | | | | | | | | | | If a conductor dies while holding a reservation, the node can get stuck in its current state. Currently the conductor that takes over the node only cleans it up if it's in the DEPLOYING state. This change applies the same logic for all nodes: 1. Reservation is cleared by the conductor that took over the node no matter what provision state. 2. CLEANING is also aborted, nodes are moved to CLEAN FAIL with maintenance on. 3. Target power state is cleared as well. The reservation is cleared even for nodes in maintenance mode, otherwise it's impossible to move them out of maintenance. Change-Id: I379c1335692046ca9423fda5ea68d2f10c065cb5 Closes-Bug: #1588901
* Clean nodes stuck in CLEANING state when ir-cond restartsZhenguo Niu2018-02-161-15/+22
| | | | | | | | | | | | | When a conductor managing a node dies abruptly mid cleaing, the node will get stuck in the CLEANING state. This also moves _start_service() before creating CLEANING nodes in tests. Finally, it adds autospec to a few places where the tests fail in a mysterious way otherwise. Change-Id: Ia7bce4dff57569707de4fcf3002eac241a5aa85b Co-Authored-By: Dmitry Tantsur <dtantsur@redhat.com> Partial-Bug: #1651092
* Adds more exception handling for ironic-conductor heartbeatD G Lee2017-09-181-0/+3
| | | | | | | | | | | | | When heartbeat thread of ironic-conductor server is reporting heartbeat, it will be interrupted by database exceptions except 'DBConnectionError'. So add 'Exception' in _conductor_service_record_keepalive to catch all possible exceptions raised from database to ensure the heartbeat thread not to exit. And also log the exception information. When the database recovers from an exception, heartbeat thread will continue to report heartbeat. Change-Id: I0dc3ada945275811ef7272d500823e0a57011e8f Closes-Bug: #1696296
* Improve graceful shutdown of conductor processYuriy Zveryanskyy2017-07-051-0/+5
| | | | | | | | | | | | | | If conductor is being stopped it is trying to wait of completion of all periodical tasks which are already in the running state. If there are many nodes assigned to the conductor this may take a long time, and oslo service library can kill thread by timeout. This patch adds code that stops iterations over nodes in periodical tasks if conductor is being stopped. These changes reduce probability to get locked nodes after shutdown and time of shutdown. Closes-Bug: #1701495 Change-Id: If6ea48d01132817a6f47560d3f6ee1756ebfab39
* Merge "Config drive support for ceph radosgw"Jenkins2017-04-171-0/+14
|\
| * Config drive support for ceph radosgwAnup Navare2017-04-121-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently config drive can be stored in swift with keystone authentication. This change allows ironic to store the config drive in ceph radosgw and use radosgw authentication mechanism that is not currently supported. It uses swift API compatibility for ceph radosgw. New options: [deploy]/configdrive_use_object_store [deploy]/object_store_endpoint_type Deprecations: [conductor]/configdrive_use_swift Replaced by: [deploy]/configdrive_use_object_store [glance]/temp_url_endpoint_type Replaced by: [deploy]/object_store_endpoint_type Change-Id: I9204c718505376cfb73632b0d0f31cea00d5e4d8 Closes-Bug: #1642719
* | Remove translation of log messages from ironic/conductorRamamani Yeleswarapu2017-03-221-37/+35
|/ | | | | | | | | | The i18n team has decided not to translate the logs because it seems like it's not very useful. This patch removes translation of log messages from ironic/conductor. Change-Id: I0fabef88f2d1bc588150f02cac0f5e975965fc29 Partial-Bug: #1674374
* Merge "exception from driver_factory.default_interface()"Jenkins2017-02-071-4/+1
|\
| * exception from driver_factory.default_interface()Ruby Loo2017-02-061-4/+1
| | | | | | | | | | | | | | | | | | | | | | This changes driver_factory.default_interface() so that instead of returning None if there is no calculated default interface, it raises exception.NoValidDefaultForInterface. This is a follow up to 6206c47720eb16719088540a48b53fb22a4eb999. Change-Id: I0c3d5d75b5a37af02f3660968cf3f2c669e52019 Partial-Bug: #1524745
* | Improve enabled_*_interfaces config help and validationJim Rollenhagen2017-02-031-0/+25
|/ | | | | | | | | | This adds additional constraints to the help messages for the enabled_*_interfaces config options. It also checks if they are empty at conductor startup, and if any are empty, errors out with a better error message than previously provided. Change-Id: I97fc318ce00291d5e43b70423930981c2f5a2de0 Partial-Bug: #1524745
* Fail conductor startup if invalid defaults existJim Rollenhagen2017-02-011-1/+7
| | | | | | | | | | | | | | | | | | This causes the conductor to fail to start up if a default interface implementation cannot be found for any dynamic driver. This avoids problems later where building a task object to operate on a node could fail for the same reason. This also removes a RAID interface test that turned out to be an invalid test, but we couldn't tell it was invalid until we had changed the start up behavior of the conductor. Note that this release note doesn't actually note a change between releases, but rather is mostly for my use when I come back to combine many of the release notes for this feature later. Change-Id: I39d3c30a6beda2e496ff85119281fdf4de191560 Partial-Bug: #1524745
* Improve conductor driver validation at startupJim Rollenhagen2017-02-011-13/+30
| | | | | | | | | | | | | | | | This changes the driver loading validation in the conductor startup to check for at least one classic *or* dynamic driver. Previously the conductor would fail to start if no classic drivers were loaded. This allows the conductor to use only dynamic drivers, without loading any classic drivers. It also now checks classic driver names against dynamic driver names, and fails to start if there is a conflict there. This would totally break the hash ring and cause mass confusion, so we cannot allow it. Change-Id: Id368690697f90471d09f16eaa4925338dadebd0f Partial-Bug: #1524745
* Merge "Remove support for driver object periodic tasks"Jenkins2017-02-011-3/+0
|\
| * Remove support for driver object periodic tasksRuby Loo2017-01-311-3/+0
| | | | | | | | | | | | | | | | | | Attaching periodic tasks on a driver object (rather than an interface) was deprecated during the Newton cycle (6.1.0). This removes support for it. Change-Id: I35afd4e0d3d1a32a516f6c755a0bd9aee0f1b1ba Fixes-Bug: #1660805
* | Log reason for hardware type registration failureRuby Loo2017-01-311-2/+2
|/ | | | | | | | | The log string 'Failed to register hardware types' doesn't provide much help as to what went wrong. This adds the reason (exception message) to the log. Change-Id: I941e35473f48c636134d5df31087d0ddbcacf44a Partial-Bug: #1524745
* Turn NOTE into docstringJim Rollenhagen2017-01-201-3/+13
| | | | | | | | | conductor.base_manager._register_and_validate_hardware_interfaces had a note at the top about what exceptions might be raised. Turn this into a proper docstring. Change-Id: I60b3e864f4cfba38ed7d12caf3bf723d73ab9e39 Partial-Bug: #1524745
* Merge "Register/unregister hardware interfaces for conductors"Jenkins2017-01-191-0/+37
|\
| * Register/unregister hardware interfaces for conductorsJim Rollenhagen2017-01-191-0/+37
| | | | | | | | | | | | | | | | | | | | This registers the intersection of supported and enabled interfaces for each hardware type enabled in the conductor at conductor startup, and unregisters them at conductor shutdown. Validation is left as a todo for now. Change-Id: I14e88bfc304de9414de008d1cc8568dda9115ecc Partial-Bug: #1524745
* | Move to tooz hash ring implementationJim Rollenhagen2017-01-191-3/+5
|/ | | | | | | | This changes the ironic driver to use the hash ring implementation from tooz, which is nearly identical to ironic.common.hash_ring. Change-Id: I4200be2035067622604e5aa70e025594bcd0a801 Depends-On: Ic1f8b89b819ace8df9b15c61eaf9bf136ad3166b
* Merge "Add node console notifications"Jenkins2016-12-281-0/+11
|\
| * Add node console notificationsYuriy Zveryanskyy2016-12-231-0/+11
| | | | | | | | | | | | | | | | | | This patch adds node console notifications, event types are: "baremetal.node.console_{set, restore}.{start, end, error}". Developer documentation updated. Change-Id: I3b3ac74607fd6e218fdf0ea3ff30964e527db399 Partial-Bug: #1606520