| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Instead of clearing existing reservations at the beginning of
del_host, wait for the tasks holding them to go to completion. This
check continues indefinitely until the conductor process exits due to
one of:
- All reservations for this conductor are released
- CONF.graceful_shutdown_timeout has elapsed
- The process manager (systemd, kubernetes) sends SIGKILL after the
configured graceful period
Because the default values of [DEFAULT]graceful_shutdown_timeout and
[conductor]heartbeat_timeout are the same (60s) no other conductor
will claim a node as an orphan until this conductor exits.
Change-Id: Ib8db915746228cd87272740825aaaea1fdf953c7
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently when a conductor is stopped, the rpc service stops
responding to requests as soon as self.manager.del_host returns. This
means that until the hash ring is reset on the whole cluster, requests
can be sent to a service which is stopped.
This change waits for the remaining seconds to delay stopping until
CONF.hash_ring_reset_interval has elapsed. This will improve the
reliability of the cluster when scaling down or rolling out updates.
This delay only occurs when there is more than one online conductor,
to allow fast restarts on single-node ironic installs (bifrost,
metal3).
Change-Id: I643eb34f9605532c5c12dd2a42f4ea67bf3e0b40
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
One of the major changes in SQLAlchemy 2.0 is the removal
of autocommit support. It turns out Ironic was using this quite
aggressively without even really being aware of it.
* Moved the declaritive_base to ORM, as noted in the SQLAlchemy 2.0
changes[0].
* Console testing caused us to become aware of issues around locking
where session synchronization, when autocommit was enabled, was
defaulted to False. The result of this is that you could have two
sessions have different results, which could results on different
threads, and where one could still attempt to lock based upon prior
information. Inherently, while this basically worked, it was
also sort of broken behavior. This resulted in locking being
rewritten to use the style mandated in SQLAlchemy 2.0 migration
documentation. This ultimately is due to locking, which is *heavily*
relied upon in Ironic, and in unit testing with sqlite, there are
no transactions, which means we can get some data inconsistency
in unit testing as well if we're reliant upon the database to
precisely and exactly return what we committed.[1]
* Begins changing the query.one()/query.all() style to use explicit
select statements as part of the new style mandated for migration
to SQLAlchemy 2.0.
* Instead of using field label strings for joined queries, use the
object format, which makes much more sense now, and is part of
the items required for eventual migration to 2.0.
* DB queries involving Traits are now loaded using SelectInLoad
as opposed to Joins. The now deprecated ORM queries were quietly
and silently de-duplicating rows and providing consistent sets
from the resulting joined table responses, however putting much
higher CPU load on the processing of results on the client.
Prior performance testing has informed us this should be a minimal
overhead impact, however these queries should no longer be in
transactions with the Database Servers which should offset the
shift in load pattern. The reason we cannot continue to deduplicate
locally in our code is because we carry Dict data sets which cannot
be hashed for deduplication. Most projects have handled this by
treating them as Text and then converting, but without a massive
rewrite, this seems to be the viable middle ground.
* Adds an explict mapping for traits and tags on the Node object
to point directly to the NodeTrait and NodeTag classes. This
superceeds the prior usage of a backref to make the association.
* Splits SQLAlchemy class model Node into Node and NodeBase, which
allows for high performance queries to skip querying for ``tags``
and ``traits``. Otherwise with the afrormentioned lookups would
always execute as they are now properties as well on the Node
class. This more common of a SQLAlchemy model, but Ironic's model
has been a bit more rigid to date.
* Adds a ``start_consoles`` and ``start_allocations`` option to the
conductor ``init_host`` method. This allows unit tests to be
executed and launched with the service context, while *not* also
creating race conditions which resulted in failed tests.
* The db API ``_paginate_query`` wrapper now contains additional
logic to handle traditional ORM query responses and the newer style
of unified query responses. Due to differences in queries and handling,
which also was part of the driver for the creation of ``NodeBase``,
as SQLAlchemy will only create an object if a base object is referenced.
Also, by default, everything returned is a tuple in 1.4 with the
unified interface.
* Also modified one unit test which counted time.sleep calls, which is
a known pattern which can create failures which are ultimately noise.
Ultimately, I have labelled the remaining places which SQLAlchemy
warnings are raised at for deprecation/removal of functionality,
which needs to be addressed.
[0] https://docs.sqlalchemy.org/en/14/changelog/migration_20.html
[1] https://docs.sqlalchemy.org/en/14/dialects/sqlite.html#transaction-isolation-level-autocommit
Change-Id: Ie0f4b8a814eaef1e852088d12d33ce1eab408e23
|
|
|
|
|
|
|
|
|
| |
This change makes it easier to configure power and management interfaces
(and thus vendor drivers) by figuring out reasonable defaults.
Story: #2009316
Task: #43717
Change-Id: I8779603e566be5a84daf6f680c0bbe2f191923d9
|
|
|
|
|
|
|
|
|
|
|
|
| |
Host preparation file writing is already done by the __init__ method
of iPXEBoot. This change moves place_loaders_for_boot calls to
iPXEBoot and PXEBoot to be consistent, and to only write the files
when that driver is enabled.
This will mean multiple writes of the same file when subclasses of
these drivers are also enabled, but this overhead will be negligible.
Change-Id: I7e17f4d1a54cd6c5d1a4bf006a0d42db8d123a46
|
|
|
|
|
|
|
| |
Adds capability to copy bootloader assets from the system OS
into the network boot folders on conductor startup.
Change-Id: Ica8f9472d0a2409cf78832166c57f2bb96677833
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Adds periodic task to purge node_history entries based upon
provided configuration.
* Adds recording of node history entries for errors in the
core conductor code.
* Also changes the rescue abort behavior to remove the notice
from being recorded as an error, as this is a likely bug in
behavior for any process or service evaluating the node
last_error field.
* Makes use of a semi-free form event_type field to help
provide some additional context into what is going on and
why. For example if deployments are repeatedly failing,
then perhaps it is a configuration issue, as opposed to
a general failure. If a conductor has no resources, then
the failure, in theory would point back to the conductor
itself.
Story: 2002980
Task: 42960
Change-Id: Ibfa8ac4878cacd98a43dd4424f6d53021ad91166
|
|
|
|
|
|
|
|
|
|
| |
Prevent each driver comming online one at a time. So that
/driver returns nothign until all interfaces are registered
Story: #2008423
Task: #41368
Change-Id: I6ef3e6e36b96106faf4581509d9219e5c535a6d8
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A partner performing some testing recognized a case where if a request
is sent to the Ironic Conductor while it is in the process of starting,
and the request makes it into be processed, yet latter the operation
fails with errors such as NodeNotLocked exception. Notably they were
able to reproduce this by requesting the attachment or detachment of
a VIF at the same time as restarting the conductor.
In part, this condition is due to to the conductor being restarted
where the conductors table includes the node being restarted and
the webserver has not possibly had a chance to observe that the
conductor is in the process of restarting as the hash ring is
still valid.
In short - Incoming RPC requests can come in during the initialization
window and as such we should not remove locks while the conductor could
possibly already be receiving work.
As such, we've added a ``prepare_host`` method which initializes
the conductor database connection and removes the stale locks.
Under normal operating conditions, the database client is reused.
rhbz# 1847305
Change-Id: I8e759168f1dc81cdcf430f3e33be990731595ec3
|
|\ |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
If a conductor hostname is changed while reservations are
issued to a conductor with one hostname, such as 'hostname'
and then the process is restarted with 'new_hostname', then the
queries would not match the node and effectively the nodes
would become inaccessible until the reservation is cleared.
This patch clears the reservation when stoping the
ironic-conductor service to avoid the nodes becoming inaccessible.
Ref to: https://review.opendev.org/#/c/711765/
Change-Id: Id31cd30564ff26df0bbe4976ffe3f268b0dd3d7b
|
|/
|
|
|
|
|
| |
This commit updates the mailmap file and changes my alias
in a few places within old comments.
Change-Id: Ica0e184109d794b8e129d567b5606d7fe84ff384
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In this code in base_manager.py, _fail_if_in_state() [1]: if the
node is in maintenance, nothing is done. This means that when a
node in maintenance is in mid deployment or cleaning and their
conductor dies, it won't get put into a failed state [2].
This fixes it.
[1] https://opendev.org/openstack/ironic/src/commit/8294afa6231629f9734f19ea5b3b0253ee9b8957/ironic/conductor/base_manager.py#L485
[2] https://opendev.org/openstack/ironic/src/commit/8294afa6231629f9734f19ea5b3b0253ee9b8957/ironic/conductor/base_manager.py#L235
Story #2007098
Task #38134
Change-Id: Ide70619271455685d09671ae16d744fc9ae58a02
|
|
|
|
|
|
|
|
|
|
| |
Since we've dropped support for Python 2.7, it's time to look at
the bright future that Python 3.x will bring and stop forcing
compatibility with older versions.
This patch removes the six library from requirements, not
looking back.
Change-Id: Ib546f16965475c32b2f8caabd560e2c7d382ac5a
|
|
|
|
|
|
| |
In docstring :param should be used instead of :param:.
Change-Id: Id531e58087b8196b30dda12aa4245a1eefd638ac
|
|
|
|
|
|
| |
The option was deprecated in Rocky, this patch removes it.
Change-Id: I6c1ef41b0f960fc2843a494dfd002a982e2bef01
|
|
|
|
|
|
|
|
|
| |
This change adds an option to publish the endpoint via mDNS on start
up and clean it up on tear down.
Story: #2005393
Task: #30383
Change-Id: I55d2e7718a23cde111eaac4e431588184cb16bda
|
|
|
|
|
|
|
|
|
| |
`hash_distribution_replicas` was deprecated in the Stein cycle (12.1.0).
Story: #1680160
Task: #30033
Change-Id: Iddc59ed113fb9808f8c8564433475638491be84f
|
|
|
|
|
|
|
|
|
| |
This change allows allocations that were not finished because of
conductor restarting or crashing to be finished after start up.
Change-Id: I016e08dcb59613b59ae753ef7d3bc9ac4a4a950a
Story: #2004341
Task: #29544
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If a node is in ING state such as INSPECTING, RESCUING, and
the conductor goes down, when the conductor backs, the node
gets stuck with that ING state.
The cases for (DEPLOYING, CLEANING) are already processed
as expected, but (INSPECTING, RESCUING, UNRESCUING, VERIFYING,
ADOPTING). DELETING cannot be transitioned to 'fail' state.
Change-Id: Ie6886ea78fac8bae81675dabf467939deb1c4460
Story: #2003147
Task: #23282
|
|\ |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This changes the calculation for keys in the hash ring manager to be of
the form "<conductor_group>:<driver>", instead of just driver. This is
used when the RPC version pin is 1.47 or greater (1.47 was created to
handle this).
When finding an RPC topic, we use the conductor group marked on the node
as part of this calculation. However, this becomes a problem when we
don't have a node that we're looking up a topic for. In this case we
look for a conductor in any group which has the driver loaded, and use a
temporary hash ring that does not use conductor groups to find a
conductor.
This also begins the API work, as the API must be aware of the new hash
ring calculation. However, exposing the conductor_group field and adding
a microversion is left for a future patch.
Story: 2001795
Task: 22641
Change-Id: Iaf71348666b683518fc6ce4769112459d98938f2
|
|\ \
| |/
|/| |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
python-swiftclient stopped supporting the temp url structure used when radosgw
was set as the endpoint_type in ocata, meaning only Newton and older versions
of python-swiftclient will work. Newton is deprecated, so remove the option.
This breaks the deprecation cycle, but since it has been not working for so
long it needs to just be dropped.
Change-Id: Ibdc93b049b7e1ae34cac9e1f599786439c46a685
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Adds the fields and bumps the objects versions. Excludes the field from
the node API for now.
Also adds the conductor_group config option, and populates the field in
the conductors table.
Also fixes a fundamentally broken test in ironic.tests.unit.db.test_api.
Change-Id: Ice2f90f7739b2927712ed45c969865136a216bd6
Story: 2001795
Task: 22640
Task: 22642
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
* removes any bits related to loading classic drivers from
the drivers factory code
* removes exceptions that only happen when classic drivers
can be loaded
* removes the BaseDriver, moves the useful functionality to
the BareDriver class
* /v1/drivers/?type=classic now always returns an empty list
* removes the migration updating classic drivers to hardware
types
The documentation will be updated separately.
Change-Id: I8ee58dfade87ae2a2544c5dcc27702c069f5089d
|
|/
|
|
|
|
| |
Task: #19602
Story: #2001985
Change-Id: I6120b1447819ab3e8b2fb4a7fb8e412dc6ce5034
|
|
|
|
|
|
|
|
| |
Also a few related errors based on some earlier investigation
may have been pulled in along the lines of E305.
Story: #2001985
Change-Id: Ifb2d3b481202fbd8cbb472e02de0f14f4d0809fd
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently we only collect periodic tasks from interfaces used in enabled
classic drivers. Meaning, periodics are not collected from interfaces
that are only used in hardware types. This patch corrects it.
This patch does not enable collection of periodic tasks from hardware
types, since we did not collect them from classic drivers. I don't
remember the reason for that, and we may want to fix it later.
Change-Id: Ib1963f3f67a758a6b2405387bfe7b3e30cc31ed8
Story: #2001884
Task: #14357
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If a conductor dies while holding a reservation, the node can get
stuck in its current state. Currently the conductor that takes
over the node only cleans it up if it's in the DEPLOYING state.
This change applies the same logic for all nodes:
1. Reservation is cleared by the conductor that took over the node
no matter what provision state.
2. CLEANING is also aborted, nodes are moved to CLEAN FAIL with
maintenance on.
3. Target power state is cleared as well.
The reservation is cleared even for nodes in maintenance mode,
otherwise it's impossible to move them out of maintenance.
Change-Id: I379c1335692046ca9423fda5ea68d2f10c065cb5
Closes-Bug: #1588901
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a conductor managing a node dies abruptly mid cleaing, the
node will get stuck in the CLEANING state.
This also moves _start_service() before creating CLEANING nodes
in tests. Finally, it adds autospec to a few places where the tests
fail in a mysterious way otherwise.
Change-Id: Ia7bce4dff57569707de4fcf3002eac241a5aa85b
Co-Authored-By: Dmitry Tantsur <dtantsur@redhat.com>
Partial-Bug: #1651092
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When heartbeat thread of ironic-conductor server is reporting heartbeat,
it will be interrupted by database exceptions except 'DBConnectionError'.
So add 'Exception' in _conductor_service_record_keepalive to catch all
possible exceptions raised from database to ensure the heartbeat thread
not to exit. And also log the exception information. When the database
recovers from an exception, heartbeat thread will continue to report
heartbeat.
Change-Id: I0dc3ada945275811ef7272d500823e0a57011e8f
Closes-Bug: #1696296
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If conductor is being stopped it is trying to wait of completion of
all periodical tasks which are already in the running state. If there
are many nodes assigned to the conductor this may take a long time,
and oslo service library can kill thread by timeout. This patch adds
code
that stops iterations over nodes in periodical tasks if conductor
is being stopped. These changes reduce probability to get locked
nodes after shutdown and time of shutdown.
Closes-Bug: #1701495
Change-Id: If6ea48d01132817a6f47560d3f6ee1756ebfab39
|
|\ |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Currently config drive can be stored in swift with keystone
authentication. This change allows ironic to store the config drive in
ceph radosgw and use radosgw authentication mechanism that is not
currently supported. It uses swift API compatibility for ceph radosgw.
New options:
[deploy]/configdrive_use_object_store
[deploy]/object_store_endpoint_type
Deprecations:
[conductor]/configdrive_use_swift
Replaced by: [deploy]/configdrive_use_object_store
[glance]/temp_url_endpoint_type
Replaced by: [deploy]/object_store_endpoint_type
Change-Id: I9204c718505376cfb73632b0d0f31cea00d5e4d8
Closes-Bug: #1642719
|
|/
|
|
|
|
|
|
|
|
| |
The i18n team has decided not to translate the logs because it seems
like it's not very useful.
This patch removes translation of log messages from ironic/conductor.
Change-Id: I0fabef88f2d1bc588150f02cac0f5e975965fc29
Partial-Bug: #1674374
|
|\ |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This changes driver_factory.default_interface() so that instead
of returning None if there is no calculated default interface,
it raises exception.NoValidDefaultForInterface.
This is a follow up to 6206c47720eb16719088540a48b53fb22a4eb999.
Change-Id: I0c3d5d75b5a37af02f3660968cf3f2c669e52019
Partial-Bug: #1524745
|
|/
|
|
|
|
|
|
|
|
| |
This adds additional constraints to the help messages for the
enabled_*_interfaces config options. It also checks if they are
empty at conductor startup, and if any are empty, errors out
with a better error message than previously provided.
Change-Id: I97fc318ce00291d5e43b70423930981c2f5a2de0
Partial-Bug: #1524745
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This causes the conductor to fail to start up if a default interface
implementation cannot be found for any dynamic driver. This avoids
problems later where building a task object to operate on a node
could fail for the same reason.
This also removes a RAID interface test that turned out to be an
invalid test, but we couldn't tell it was invalid until we had
changed the start up behavior of the conductor.
Note that this release note doesn't actually note a change between
releases, but rather is mostly for my use when I come back to combine
many of the release notes for this feature later.
Change-Id: I39d3c30a6beda2e496ff85119281fdf4de191560
Partial-Bug: #1524745
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This changes the driver loading validation in the conductor
startup to check for at least one classic *or* dynamic driver.
Previously the conductor would fail to start if no classic drivers
were loaded. This allows the conductor to use only dynamic
drivers, without loading any classic drivers.
It also now checks classic driver names against dynamic driver
names, and fails to start if there is a conflict there. This
would totally break the hash ring and cause mass confusion,
so we cannot allow it.
Change-Id: Id368690697f90471d09f16eaa4925338dadebd0f
Partial-Bug: #1524745
|
|\ |
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Attaching periodic tasks on a driver object (rather than an interface)
was deprecated during the Newton cycle (6.1.0). This removes support
for it.
Change-Id: I35afd4e0d3d1a32a516f6c755a0bd9aee0f1b1ba
Fixes-Bug: #1660805
|
|/
|
|
|
|
|
|
|
| |
The log string 'Failed to register hardware types' doesn't
provide much help as to what went wrong. This adds the reason
(exception message) to the log.
Change-Id: I941e35473f48c636134d5df31087d0ddbcacf44a
Partial-Bug: #1524745
|
|
|
|
|
|
|
|
|
| |
conductor.base_manager._register_and_validate_hardware_interfaces had a
note at the top about what exceptions might be raised. Turn this into a
proper docstring.
Change-Id: I60b3e864f4cfba38ed7d12caf3bf723d73ab9e39
Partial-Bug: #1524745
|
|\ |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This registers the intersection of supported and enabled interfaces for
each hardware type enabled in the conductor at conductor startup, and
unregisters them at conductor shutdown. Validation is left as a todo for
now.
Change-Id: I14e88bfc304de9414de008d1cc8568dda9115ecc
Partial-Bug: #1524745
|
|/
|
|
|
|
|
|
| |
This changes the ironic driver to use the hash ring implementation from
tooz, which is nearly identical to ironic.common.hash_ring.
Change-Id: I4200be2035067622604e5aa70e025594bcd0a801
Depends-On: Ic1f8b89b819ace8df9b15c61eaf9bf136ad3166b
|
|\ |
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This patch adds node console notifications, event types are:
"baremetal.node.console_{set, restore}.{start, end, error}".
Developer documentation updated.
Change-Id: I3b3ac74607fd6e218fdf0ea3ff30964e527db399
Partial-Bug: #1606520
|