summaryrefslogtreecommitdiff
path: root/nova/conductor
Commit message (Collapse)AuthorAgeFilesLines
* Merge "Transport context to all threads"Zuul2023-02-271-2/+2
|\
| * Transport context to all threadsFabian Wiesel2022-08-041-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The nova.utils.spawn and spawn_n methods transport the context (and profiling information) to the newly created threads. But the same isn't done when submitting work to thread-pools in the ComputeManager. The code doing that for spawn and spawn_n is extracted to a new function and called to submit the work to the thread-pools. Closes-Bug: #1962574 Change-Id: I9085deaa8cf0b167d87db68e4afc4a463c00569c
* | compute: enhance compute evacuate instance to support target stateSahid Orentino Ferdjaoui2023-01-313-8/+21
| | | | | | | | | | | | | | | | | | | | | | Related to the bp/allowing-target-state-for-evacuate. This change is extending compute API to accept a new argument targetState. The targetState argument when set will force state of an evacuated instance to the destination host. Signed-off-by: Sahid Orentino Ferdjaoui <sahid.ferdjaoui@industrialdiscipline.com> Change-Id: I9660d42937ad62d647afc6be965f166cc5631392
* | Support unshelve with PCI in placementBalazs Gibizer2022-12-211-0/+6
| | | | | | | | | | blueprint: pci-device-tracking-in-placement Change-Id: I35ca3ae82be5dc345d80ad1857abb915c987d34e
* | Support evacuate with PCI in placementBalazs Gibizer2022-12-211-0/+6
| | | | | | | | | | blueprint: pci-device-tracking-in-placement Change-Id: I1462ee4f4dd143b56732332f7ed00df00a9f2067
* | Support cold migrate and resize with PCI tracking in placementBalazs Gibizer2022-12-211-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | This patch adds support for cold migrate, and resize with PCI devices when the placement tracking is enabled. Same host resize, evacuate and unshelve will be supported by subsequent patches. Live migration was not supported with flavor based PCI requests before so it won't be supported now either. blueprint: pci-device-tracking-in-placement Change-Id: I8eec331ab3c30e5958ed19c173eff9998c1f41b0
* | Store allocated RP in InstancePCIRequestBalazs Gibizer2022-12-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After the scheduler selected a target host and allocated an allocation candidate that is passed the filters nova need to make sure that PCI claim will allocate the real PCI devices from the RP which is allocated in placement. Placement returns the request group - provider mapping for each allocation candidate so nova can map which InstancePCIRequest was fulfilled from which RP in the selected allocation candidate. This mapping is then recorded in the InstancePCIRequest object and used during the PCI claim to filter for PCI pools that can be used to claim PCI devices from. blueprint: pci-device-tracking-in-placement Change-Id: I18bb31e23cc014411db68c31317ed983886d1a8e
* | Add conductor RPC interface for rebuildwhoami-rajat2022-08-313-6/+18
| | | | | | | | | | | | | | | | | | | | This patch adds support for passing the ``reimage_boot_volume`` flag from the API layer through the conductor layer to the computer layer and also includes RPC bump as necessary. Related blueprint volume-backed-server-rebuild Change-Id: I8daf177eb67d08112a16fe788910644abf338fa6
* | Add support for volume backed server rebuildRajat Dhasmana2022-08-311-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds the plumbing for rebuilding a volume backed instance in compute code. This functionality will be enabled in a subsequent patch which adds a new microversion and the external support for requesting it. The flow of the operation is as follows: 1) Create an empty attachment 2) Detach the volume 3) Request cinder to reimage the volume 4) Wait for cinder to notify success to nova (via external events) 5) Update and complete the attachment Related blueprint volume-backed-server-rebuild Change-Id: I0d889691de1af6875603a9f0f174590229e7be18
* | Avoid n-cond startup abort for keystone failuresDan Smith2022-08-181-1/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conductor creates a placement client for the potential case where it needs to make a call for certain operations. A transient network or keystone failure will currently cause it to abort startup, which means it is not available for other unrelated activities, such as DB proxying for compute. This makes conductor test the placement client on startup, but only abort startup on errors that are highly likely to be permanent configuration errors, and only warn about things like being unable to contact keystone/placement during initialization. If a non-fatal error is encountered at startup, later operations needing the placement client will retry initialization. Closes-Bug: #1846820 Change-Id: Idb7fcbce0c9562e7b9bd3e80f2a6d4b9bc286830
* | Unify placement client singleton implementationsDan Smith2022-08-182-3/+3
|/ | | | | | | | | | | | We have many places where we implement singleton behavior for the placement client. This unifies them into a single place and implementation. Not only does this DRY things up, but may cause us to initialize it fewer times and also allows for emitting a common set of error messages about expected failures for better troubleshooting. Change-Id: Iab8a791f64323f996e1d6e6d5a7e7a7c34eb4fb3 Related-Bug: #1846820
* Merge "Add a workaround to skip hypervisor version check on LM"Zuul2022-07-271-2/+3
|\
| * Add a workaround to skip hypervisor version check on LMKashyap Chamarthy2022-07-271-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When turned on, this will disable the version-checking of hypervisors during live-migration. This can be useful for operators in certain scenarios when upgrading. E.g. if you want to relocate all instances off a compute node due to an emergency hardware issue, and you only have another old compute node ready at the time. Note, though: libvirt will do its own internal compatibility checks, and might still reject live migration if the destination is incompatible. Closes-Bug: #1982853 Change-Id: Iec387dcbc49ddb91ebf5cfd188224eaf6021c0e1 Signed-off-by: Kashyap Chamarthy <kchamart@redhat.com>
* | Allow unshelve to a specific host (Compute API part)René Ribaud2022-07-221-0/+6
|/ | | | | | | | | | This patch introduce changes to the compute API that will allow PROJECT_ADMIN to unshelve an shelved offloaded server to a specific host. This patch also supports the ability to unpin the availability_zone of an instance that is bound to it. Implements: blueprint unshelve-to-host Change-Id: Ieb4766fdd88c469574fad823e05fe401537cdc30
* Remove return from rpc castRajesh Tailor2022-06-181-1/+1
| | | | | | | This change removes return statement from rpc cast method calls. As rpc cast are asynchronous, so doesn't return anything. Change-Id: I766f64f2c086dd652bc28b338320cc94ccc48f1f
* Fix typosRajesh Tailor2022-05-301-1/+1
| | | | | | | This change fixes some of the typos in unit tests as well as in nova code-base. Change-Id: I209bbb270baf889fcb2b9a4d1ce0ab4a962d0d0e
* Enforce resource limits using oslo.limitJohn Garbutt2022-02-241-1/+7
| | | | | | | | | | | | | | | | | | We now enforce limits on resources requested in the flavor. This includes: instances, ram, cores. It also works for any resource class being requested via the flavor chosen, such as custom resource classes relating to Ironic resources. Note because disk resources can be limited, we need to know if the instance is boot from volume or not. This has meant adding extra code to make sure we know that when enforcing the limits. Follow on patches will update the APIs to accurately report the limits being applied to instances, ram and cores. blueprint unified-limits-nova Change-Id: If1df93400dcbcb1d3aac0ade80ae5ecf6ce38d11
* Merge "neutron: Rework how we check for extensions"Zuul2022-02-082-4/+3
|\
| * neutron: Rework how we check for extensionsStephen Finucane2021-09-022-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are couple of changes we can make here: - Always attempt to refresh the cache before checking if an extension is enabled. - Using extension slugs as our reference point rather than extension names. They seem like a better thing to use as a constant and are similarly fixed. - Be consistent in how we name and call the extension check functions - Add documentation for what each extension doing/used for There's a TODO here to remove some code that relies on an out-of-tree extension that I can't see. That's done separately since this is already big enough. Change-Id: I8058902df167239fa455396d3595a56bcf472b2b Signed-off-by: Stephen Finucane <sfinucan@redhat.com>
* | Add autopep8 to tox and pre-commitSean Mooney2021-11-081-0/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | autopep8 is a code formating tool that makes python code pep8 compliant without changing everything. Unlike black it will not radically change all code and the primary change to the existing codebase is adding a new line after class level doc strings. This change adds a new tox autopep8 env to manually run it on your code before you submit a patch, it also adds autopep8 to pre-commit so if you use pre-commit it will do it for you automatically. This change runs autopep8 in diff mode with --exit-code in the pep8 tox env so it will fail if autopep8 would modify your code if run in in-place mode. This allows use to gate on autopep8 not modifying patches that are submited. This will ensure authorship of patches is maintianed. The intent of this change is to save the large amount of time we spend on ensuring style guidlines are followed automatically to make it simpler for both new and old contibutors to work on nova and save time and effort for all involved. Change-Id: Idd618d634cc70ae8d58fab32f322e75bfabefb9d
* Merge "Support move ops with extended resource request"Zuul2021-08-313-24/+26
|\
| * Support move ops with extended resource requestBalazs Gibizer2021-08-273-24/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Nova re-generates the resource request of an instance for each server move operation (migrate, resize, evacuate, live-migrate, unshelve) to find (or validate) a target host for the instance move. This patch extends the this logic to support the extended resource request from neutron. As the changes in the neutron interface code is called from nova-compute service during the port binding the compute service version is bumped. And a check is added to the compute-api to reject the move operations with ports having extended resource request if there are old computes in the cluster. blueprint: qos-minimum-guaranteed-packet-rate Change-Id: Ibcf703e254e720b9a6de17527325758676628d48
* | Add force kwarg to delete_allocation_for_instanceMatt Riedemann2021-08-301-1/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds a force kwarg to delete_allocation_for_instance which defaults to True because that was found to be the most common use case by a significant margin during implementation of this patch. In most cases, this method is called when we want to delete the allocations because they should be gone, e.g. server delete, failed build, or shelve offload. The alternative in these cases is the caller could trap the conflict error and retry but we might as well just force the delete in that case (it's cleaner). When force=True, it will DELETE the consumer allocations rather than GET and PUT with an empty allocations dict and the consumer generation which can result in a 409 conflict from Placement. For example, bug 1836754 shows that in one tempest test that creates a server and then immediately deletes it, we can hit a very tight window where the method GETs the allocations and before it PUTs the empty allocations to remove them, something changes which results in a conflict and the server delete fails with a 409 error. It's worth noting that delete_allocation_for_instance used to just DELETE the allocations before Stein [1] when we started taking consumer generations into account. There was also a related mailing list thread [2]. Closes-Bug: #1836754 [1] I77f34788dd7ab8fdf60d668a4f76452e03cf9888 [2] http://lists.openstack.org/pipermail/openstack-dev/2018-August/133374.html Change-Id: Ife3c7a5a95c5d707983ab33fd2fbfc1cfb72f676
* Merge "scheduler: Merge 'FilterScheduler' into base class"Zuul2021-08-201-4/+6
|\
| * scheduler: Merge 'FilterScheduler' into base classStephen Finucane2021-06-291-4/+6
| | | | | | | | | | | | | | | | | | There are no longer any custom filters. We don't need the abstract base class. Merge the code in and give it a more useful 'SchedulerDriver' name. Change-Id: Id08dafa72d617ca85e66d50b3c91045e0e8723d0 Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
* | smartnic support - cleanup arqsYongli He2021-08-051-1/+7
| | | | | | | | | | | | | | | | | | | | delete arqs: - delete arq while port unbind - create ops failed and arqs did not bind to instance - arq bind to instance but not bind to port Implements: blueprint sriov-smartnic-support Change-Id: Idab0ee38750d018de409699a0dbdff106d9e11fb
* | smartnic support - create arqsYongli He2021-08-051-21/+88
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | create arqs for port with device profile: - On API stage, device profile is used to get schedule infomations. - After schedule instance come to a host, Conductor create ARQ and updates the ARQ binding info to Cyborg. Implements: blueprint sriov-smartnic-support Depends-On: https://review.opendev.org/c/openstack/neutron-lib/+/768324 Depends-On: https://review.opendev.org/q/topic:%22bug%252F1906602%22+ Depends-On: https://review.opendev.org/c/openstack/cyborg/+/758942 Change-Id: Idaf92c54df0f39d177d7acaabbfcf254ff5a4d0f Co-Authored-By: Shaohe Feng <shaohe.feng@intel.com> Co-Authored-By: Xinran Wang <xin-ran.wang@intel.com>
* | db: Remove 'nova.db.base' moduleStephen Finucane2021-06-161-3/+1
|/ | | | | | | | | This made sense back in the day where the ORM was configurable and we were making lots of direct calls to the database. Now, in a world where most things happen via o.vo, it's just noise. Remove it. Change-Id: I216cabcde5311abd46fdad9c95bb72c31b414010 Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
* Remove (almost) all references to 'instance_type'Stephen Finucane2021-03-291-5/+5
| | | | | | | | | | | This continues on from I81fec10535034f3a81d46713a6eda813f90561cf and removes all other references to 'instance_type' where it's possible to do so. The only things left are DB columns, o.vo fields, some unversioned objects, and RPC API methods. If we want to remove these, we can but it's a lot more work. Change-Id: I264d6df1809d7283415e69a66a9153829b8df537 Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
* rpc: Rework 'get_notifier', 'wrap_exception'Stephen Finucane2021-03-011-1/+1
| | | | | | | | | | | | | | | | | | | The 'nova.exception_wrapper.wrap_exception' decorator accepted either a pre-configured notifier or a 'get_notifier' function, but the forget was never provided and the latter was consistently a notifier created via a call to 'nova.rpc.get_notifier'. Simplify things by passing the arguments relied by 'get_notifier' into 'wrap_exception', allowing the latter to create the former for us. While doing this rework, it became obvious that 'get_notifier' accepted a 'published_id' that is never provided nowadays, so that is dropped. In addition, a number of calls to 'get_notifier' were passing in 'host=CONF.host', which duplicated the default value for this parameter and is therefore unnecessary. Finally, the unit tests are split up by file, as they should be. Change-Id: I89e1c13e8a0df18594593b1e80c60d177e0d9c4c Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
* Rename ensure_network_metadata to amend requested_networksSylvain Bauza2021-02-033-4/+4
| | | | | | | | | | | As we don't persist (fortunately) the requested networks when booting an instance, we need a way to implement the value of the RequestSpec field during any create or move operation so we would know in a later change which port or network was asked. Partially-Implements: blueprint routed-networks-scheduling Change-Id: I0c7e32f6088a8fc1625a0655af824dee2df4a12c
* Refactor update_pci_request_spec_with_allocated_interface_nameBalazs Gibizer2021-01-181-2/+2
| | | | | | | | | | | | Make update_pci_request_spec_with_allocated_interface_name only depend on a list of IntancePCIRequest o.vos instead of a whole Instance object. This will come in handy for the qos interface attach case where we only need to make the changes on the Instance o.vo after we are sure that the both the resource allocation and the pci claim is succeeded for the request. Change-Id: I5a6c6d3eed61895b00f9e9c3fb3b5d09d6786e9c blueprint: support-interface-attach-with-qos-ports
* Cyborg shelve/unshelve supportzhangbailin2021-01-151-32/+42
| | | | | | | | | | | | | | | | | This change extends the conductor manager to append the cyborg resource request to the request spec when performing an unshelve. On shelve offload an instance will be deleted the instance's ARQs binding info to free up the bound ARQs in Cyborg service. And this change passes the ARQs to spawn during unshelve an instance. This change extends the ``shelve_instance``, ``shelve_offload_instance`` and ``unshelve_instance`` rpcapi function to carry the arq_uuids. Co-Authored-By: Wenping Song <songwenping@inspur.com> Implements: blueprint cyborg-shelve-and-unshelve Change-Id: I258df4d77f6d86df1d867a8fe27360731c21d237
* Remove six.text_type (1/2)Takashi Natsume2020-12-132-7/+4
| | | | | | | | | Replace six.text_type with str. A subsequent patch will replace other six.text_type. Change-Id: I23bb9e539d08f5c6202909054c2dd49b6c7a7a0e Implements: blueprint six-removal Signed-off-by: Takashi Natsume <takanattie@gmail.com>
* Merge "Remove compute service level check for qos ops"Zuul2020-11-151-113/+2
|\
| * Remove compute service level check for qos opsBalazs Gibizer2020-11-091-113/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To support move operations with qos ports both the source and the destination compute hosts need to be on Ussuri level. We have service level checks implemented in Ussuri. In Victoria we could remove those checks as nova only supports compatibility between N and N-1 computes. But we kept them there just for extra safety. In the meanwhile we codified [1] the rule that nova does not support N-2 computes any more. So in Wallaby we can assume that the oldest compute is already on Victoria (Ussuri would be enough too). So this patch removes the unnecessary service level checks and related test cases. [1] Ie15ec8299ae52ae8f5334d591ed3944e9585cf71 Change-Id: I14177e35b9d6d27d49e092604bf0f288cd05f57e
* | Remove six.movesTakashi Natsume2020-11-071-5/+4
|/ | | | | | | | | | | | | | | | | | | | | | | | | Replace the following items with Python 3 style code. - six.moves.configparser - six.moves.StringIO - six.moves.cStringIO - six.moves.urllib - six.moves.builtins - six.moves.range - six.moves.xmlrpc_client - six.moves.http_client - six.moves.http_cookies - six.moves.queue - six.moves.zip - six.moves.reload_module - six.StringIO - six.BytesIO Subsequent patches will replace other six usages. Change-Id: Ib2c406327fef2fb4868d8050fc476a7d17706e23 Implements: blueprint six-removal Signed-off-by: Takashi Natsume <takanattie@gmail.com>
* conductor: Don't use setattrStephen Finucane2020-09-141-11/+7
| | | | | | | | | | | setattr kills discoverability, making it hard to figure out who's setting various fields. Don't do it. While we're here, we drop legacy compat handlers for pre-Train compute nodes. Change-Id: Ie694a80e89f99c8d3e326eebb4590d93c0ebf671 Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
* Merge "Move confirm resize under semaphore"Zuul2020-09-101-1/+1
|\
| * Move confirm resize under semaphoreStephen Finucane2020-09-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The 'ResourceTracker.update_available_resource' periodic task builds usage information for the current host by inspecting instances and in-progress migrations, combining the two. Specifically, it finds all instances that are not in the 'DELETED' or 'SHELVED_OFFLOADED' state, calculates the usage from these, then finds all in-progress migrations for the host that don't have an associated instance (to prevent double accounting) and includes the usage for these. In addition to the periodic task, the 'ResourceTracker' class has a number of helper functions to make or drop claims for the inventory generated by the 'update_available_resource' periodic task as part of the various instance operations. These helpers naturally assume that when making a claim for a particular instance or migration, there shouldn't already be resources allocated for same. Conversely, when dropping claims, the resources should currently be allocated. However, the check for *active* instances and *in-progress* migrations in the periodic task means we have to be careful in how we make changes to a given instance or migration record. Running the periodic task between such an operation and an attempt to make or drop a claim can result in TOCTOU-like races. This generally isn't an issue: we use the 'COMPUTE_RESOURCE_SEMAPHORE' semaphore to prevent the periodic task running while we're claiming resources in helpers like 'ResourceTracker.instance_claim' and we make our changes to the instances and migrations within this context. There is one exception though: the 'drop_move_claim' helper. This function is used when dropping a claim for either a cold migration, a resize or a live migration, and will drop usage from either the source host (based on the "old" flavor) for a resize confirm or the destination host (based on the "new" flavor) for a resize revert or live migration rollback. Unfortunately, while the function itself is wrapped in the semaphore, no changes to the state or the instance or migration in question are protected by it. Consider the confirm resize case, which we're addressing here. If we mark the migration as 'confirmed' before running 'drop_move_claim', then the periodic task running between these steps will not account for the usage on the source since the migration is allegedly 'confirmed'. The call to 'drop_move_claim' will then result in the tracker dropping usage that we're no longer accounting for. This "set migration status before dropping usage" is the current behaviour for both same-cell and cross-cell resize, via the 'ComputeManager.confirm_resize' and 'ComputeManager.confirm_snapshot_based_resize_at_source' functions, respectively. We could reverse those calls and run 'drop_move_claim' before marking the migration as 'confirmed', but while our usage will be momentarily correct, the periodic task running between these steps will re-add the usage we just dropped since the migration isn't yet 'confirmed'. The correct solution is to close this gap between setting the migration status and dropping the move claim to zero. We do this by putting both operations behind the 'COMPUTE_RESOURCE_SEMAPHORE', just like the claim operations. Change-Id: I26b050c402f5721fc490126e9becb643af9279b4 Signed-off-by: Stephen Finucane <stephenfin@redhat.com> Partial-Bug: #1879878
* | Set 'old_flavor', 'new_flavor' on source before resizeStephen Finucane2020-09-081-7/+10
|/ | | | | | | | | Cross-cell resize is confusing. We need to set this information ahead of time. Change-Id: I5a403c072b9f03074882b552e1925f22cb5b15b6 Signed-off-by: Stephen Finucane <stephenfin@redhat.com> Partial-Bug: #1879878
* Cyborg evacuate supportSean Mooney2020-09-011-19/+54
| | | | | | | | | | | | | | | | | | | | | | This change extends the conductor manager to append the cyborg resource request to the request spec when performing an evacuate. This change passes the ARQs to spawn during rebuild and evacuate. On evacuate the existing ARQs will be deleted and new ARQs will be created and bound, during rebuild the existing ARQs are reused. This change extends the rebuild_instance compute rpcapi function to carry the arq_uuids. This eliminates the need to lookup the uuids associated with the arqs assinged to the instance by quering cyborg. Co-Authored-By: Wenping Song <songwenping@inspur.com> Co-Authored-By: Brin Zhang <zhangbailin@inspur.com> Implements: blueprint cyborg-rebuild-and-evacuate Change-Id: I147bf4d95e6d86ff1f967a8ce37260730f21d236
* Remove six.add_metaclassTakashi Natsume2020-08-151-3/+1
| | | | | | | | Replace six.add_metaclass with Python 3 style code. Change-Id: Ifc3f2bcb8fcdd2b555864bd4e22a973a7858c272 Implements: blueprint six-removal Signed-off-by: Takashi Natsume <takanattie@gmail.com>
* Delete ARQs by UUID if Cyborg ARQ bind fails.Sundar Nadathur2020-07-231-16/+23
| | | | | | | | | | | | | | | | | | | During the reivew of the cyborg series it was noted that in some cases ARQs could be leaked during binding. See https://review.opendev.org/#/c/673735/46/nova/conductor/manager.py@1632 This change adds a delete_arqs_by_uuid function that can delete unbound ARQs by instance uuid. This change modifies build_instances and schedule_and_build_instances to handel the AcceleratorRequestBindingFailed exception raised when binding fails and clean up instance arqs. Co-Authored-By: Wenping Song <songwenping@inspur.com> Closes-Bug: #1872730 Change-Id: I86c2f00e2368fe02211175e7328b2cd9c0ebf41b Blueprint: nova-cyborg-interaction
* objects: Add MigrationTypeFieldStephen Finucane2020-05-081-1/+1
| | | | | | | | | | | We use these things many places in the code and it would be good to have constants to reference. Do just that. Note that this results in a change in the object hash. However, there are no actual changes in the output object so that's okay. Change-Id: If02567ce0a3431dda5b2bf6d398bbf7cc954eed0 Signed-off-by: Stephen Finucane <sfinucan@redhat.com>
* Support live migration with vpmemLuyaoZhong2020-04-071-3/+19
| | | | | | | | | | | 1. Check if the cluster supports live migration with vpmem 2. On source host we generate new dest xml with vpmem info stored in migration_context.new_resources. 3. If there are vpmems, cleanup them on host/destination when live migration succeeds/fails Change-Id: I5c346e690148678a2f0dc63f4f516a944c3db8cd Implements: blueprint support-live-migration-with-virtual-persistent-memory
* partial support for live migration with specific resourcesLuyaoZhong2020-04-071-16/+17
| | | | | | | | | | | | | | | | | | | | | | | | 1. Claim allocations from placement first, then claim specific resources in Resource Tracker on destination to populate migration_context.new_resources 3. cleanup specific resources when live migration succeeds/fails Because we store specific resources in migration_context during live migration, to ensure cleanup correctly we can't drop migration_context before cleanup is complete: a) when post live migration, we move source host cleanup before destination cleanup(post_live_migration_at_destination will apply migration_context and drop it) b) when rollback live migration, we drop migration_context after rollback operations are complete For different specific resource, we might need driver specific support, such as vpmem. This change just ensures that new claimed specific resources are populated to migration_context and migration_context is not droped before cleanup is complete. Change-Id: I44ad826f0edb39d770bb3201c675dff78154cbf3 Implements: blueprint support-live-migration-with-virtual-persistent-memory
* Bump compute rpcapi version and reduce Cyborg calls.Sundar Nadathur2020-03-311-5/+9
| | | | | | | | | | | | | | | | The _get_bound_arq_resources() in the compute manager [1] calls Cyborg up to 3 times: once to get the accelerator request (ARQ) UUIDs for the instance, and then once or twice to get all ARQs with completed bindings. The first call can be eliminated by passing the ARQs from the conductor to the compute manager as an additional parameter in build_and_run_instance(). This requires a bump in compute rpcapi version. [1] https://review.opendev.org/#/c/631244/54/nova/compute/manager.py@2652 Blueprint: nova-cyborg-interaction Change-Id: I26395d57bd4ba55276b7514baa808f9888639e11
* Delete ARQs for an instance when the instance is deleted.Sundar Nadathur2020-03-241-0/+1
| | | | | | | | | | | | | | | | | | | | | | | This patch series now works for many VM operations with libvirt: * Creation, deletion of VM instances. * Pause/unpause The following works but is a no-op: * Lock/unlock Hard reboots are taken up in a later patch in this series. Soft reboots work for accelerators unless some unrelated failure forces a hard reboot in the libvirt driver. Suspend is not supported yet. It would fail with this error: libvirtError: Requested operation is not valid: domain has assigned non-USB host devices Shelve is not supported yet. Live migration is not intended to be supported with accelerators now. Change-Id: Icb95890d8f16cad1f7dc18487a48def2f7c9aec2 Blueprint: nova-cyborg-interaction
* Create and bind Cyborg ARQs.Sundar Nadathur2020-03-211-0/+55
| | | | | | | | | | | | | | * Call Cyborg with device profile name to get ARQs (Accelerator Requests). Each ARQ corresponds to a single device profile group, which corrresponds to a single request group in request spec. * Match each ARQ to associated request group, and thereby obtain the corresponding RP for that ARQ. * Call Cyborg to bind the ARQ to that host/device-RP. * When Cyborg sends the ARQ bind notification events, wait for those events with a timeout. Change-Id: I0f8b6bf2b4f4510da6c84fede532533602b6af7f Blueprint: nova-cyborg-interaction