diff options
Diffstat (limited to 'doc/source/admin')
-rw-r--r-- | doc/source/admin/anaconda-deploy-interface.rst | 37 | ||||
-rw-r--r-- | doc/source/admin/drivers.rst | 1 | ||||
-rw-r--r-- | doc/source/admin/drivers/fake.rst | 36 | ||||
-rw-r--r-- | doc/source/admin/drivers/ibmc.rst | 2 | ||||
-rw-r--r-- | doc/source/admin/drivers/ilo.rst | 83 | ||||
-rw-r--r-- | doc/source/admin/drivers/irmc.rst | 70 | ||||
-rw-r--r-- | doc/source/admin/drivers/redfish.rst | 83 | ||||
-rw-r--r-- | doc/source/admin/drivers/snmp.rst | 74 | ||||
-rw-r--r-- | doc/source/admin/hardware-burn-in.rst | 7 | ||||
-rw-r--r-- | doc/source/admin/metrics.rst | 34 | ||||
-rw-r--r-- | doc/source/admin/retirement.rst | 21 | ||||
-rw-r--r-- | doc/source/admin/secure-rbac.rst | 40 | ||||
-rw-r--r-- | doc/source/admin/troubleshooting.rst | 171 |
13 files changed, 612 insertions, 47 deletions
diff --git a/doc/source/admin/anaconda-deploy-interface.rst b/doc/source/admin/anaconda-deploy-interface.rst index 2c686506a..2b7195525 100644 --- a/doc/source/admin/anaconda-deploy-interface.rst +++ b/doc/source/admin/anaconda-deploy-interface.rst @@ -271,11 +271,44 @@ purposes. ``liveimg`` which is used as the base operating system image to start with. +Configuration Considerations +---------------------------- + +When using the ``anaconda`` deployment interface, some configuration +parameters may need to be adjusted in your environment. This is in large +part due to the general defaults being set to much lower values for image +based deployments, but the way the anaconda deployment interface works, +you may need to make some adjustments. + +* ``[conductor]deploy_callback_timeout`` likely needs to be adjusted + for most ``anaconda`` deployment interface users. By default this + is a timer which looks for "agents" which have not checked in with + Ironic, or agents which may have crashed or failed after they + started. If the value is reached, then the current operation is failed. + This value should be set to a number of seconds which exceeds your + average anaconda deployment time. +* ``[pxe]boot_retry_timeout`` can also be triggered and result in + an anaconda deployment in progress getting reset as it is intended + to reboot nodes which might have failed their initial PXE operation. + Depending on sizes of images, and the exact nature of what was deployed, + it may be necessary to ensure this is a much higher value. + Limitations ----------- -This deploy interface has only been tested with Red Hat based operating systems -that use anaconda. Other systems are not supported. +* This deploy interface has only been tested with Red Hat based operating + systems that use anaconda. Other systems are not supported. + +* Runtime TLS certifiate injection into ramdisks is not supported. Assets + such as ``ramdisk`` or a ``stage2`` ramdisk image need to have trusted + Certificate Authority certificates present within the images *or* the + Ironic API endpoint utilized should utilize a known trusted Certificate + Authority. + +* The ``anaconda`` tooling deploying the instance/workload does not + heartbeat to Ironic like the ``ironic-python-agent`` driven ramdisks. + As such, you may need to adjust some timers. See + `Configuration Considerations`_ for some details on this. .. _`anaconda`: https://fedoraproject.org/wiki/Anaconda .. _`ks.cfg.template`: https://opendev.org/openstack/ironic/src/branch/master/ironic/drivers/modules/ks.cfg.template diff --git a/doc/source/admin/drivers.rst b/doc/source/admin/drivers.rst index c3d8eb377..f35cb2dfa 100644 --- a/doc/source/admin/drivers.rst +++ b/doc/source/admin/drivers.rst @@ -26,6 +26,7 @@ Hardware Types drivers/redfish drivers/snmp drivers/xclarity + drivers/fake Changing Hardware Types and Interfaces -------------------------------------- diff --git a/doc/source/admin/drivers/fake.rst b/doc/source/admin/drivers/fake.rst new file mode 100644 index 000000000..ea7d7ef4c --- /dev/null +++ b/doc/source/admin/drivers/fake.rst @@ -0,0 +1,36 @@ +=========== +Fake driver +=========== + +Overview +======== + +The ``fake-hardware`` hardware type is what it claims to be: fake. Use of this +type or the ``fake`` interfaces should be temporary or limited to +non-production environments, as the ``fake`` interfaces do not perform any of +the actions typically expected. + +The ``fake`` interfaces can be configured to be combined with any of the +"real" hardware interfaces, allowing you to effectively disable one or more +hardware interfaces for testing by simply setting that interface to +``fake``. + +Use cases +========= + +Development +----------- +Developers can use ``fake-hardware`` hardware-type to mock out nodes for +testing without those nodes needing to exist with physical or virtual hardware. + +Adoption +-------- +Some OpenStack deployers have used ``fake`` interfaces in Ironic to allow an +adoption-style workflow with Nova. By setting a node's hardware interfaces to +``fake``, it's possible to deploy to that node with Nova without causing any +actual changes to the hardware or an OS already deployed on it. + +This is generally an unsupported use case, but it is possible. For more +information, see the relevant `post from CERN TechBlog`_. + +.. _`post from CERN TechBlog`: https://techblog.web.cern.ch/techblog/post/ironic-nova-adoption/ diff --git a/doc/source/admin/drivers/ibmc.rst b/doc/source/admin/drivers/ibmc.rst index 1bf9a3ba2..0f7fe1d90 100644 --- a/doc/source/admin/drivers/ibmc.rst +++ b/doc/source/admin/drivers/ibmc.rst @@ -312,6 +312,6 @@ boot_up_seq GET Query boot up sequence get_raid_controller_list GET Query RAID controller summary info ======================== ============ ====================================== -.. _Huawei iBMC: https://e.huawei.com/en/products/cloud-computing-dc/servers/accessories/ibmc +.. _Huawei iBMC: https://e.huawei.com/en/products/computing/kunpeng/accessories/ibmc .. _TLS: https://en.wikipedia.org/wiki/Transport_Layer_Security .. _HUAWEI iBMC Client library: https://pypi.org/project/python-ibmcclient/ diff --git a/doc/source/admin/drivers/ilo.rst b/doc/source/admin/drivers/ilo.rst index f764a6d89..b6825fc40 100644 --- a/doc/source/admin/drivers/ilo.rst +++ b/doc/source/admin/drivers/ilo.rst @@ -55,6 +55,8 @@ The hardware type ``ilo`` supports following HPE server features: * `Updating security parameters as manual clean step`_ * `Update Minimum Password Length security parameter as manual clean step`_ * `Update Authentication Failure Logging security parameter as manual clean step`_ +* `Create Certificate Signing Request(CSR) as manual clean step`_ +* `Add HTTPS Certificate as manual clean step`_ * `Activating iLO Advanced license as manual clean step`_ * `Removing CA certificates from iLO as manual clean step`_ * `Firmware based UEFI iSCSI boot from volume support`_ @@ -65,6 +67,7 @@ The hardware type ``ilo`` supports following HPE server features: * `BIOS configuration support`_ * `IPv6 support`_ * `Layer 3 or DHCP-less ramdisk booting`_ +* `Events subscription`_ Apart from above features hardware type ``ilo5`` also supports following features: @@ -200,6 +203,18 @@ The ``ilo`` hardware type supports following hardware interfaces: enabled_hardware_types = ilo enabled_rescue_interfaces = agent,no-rescue +* vendor + Supports ``ilo``, ``ilo-redfish`` and ``no-vendor``. The default is + ``ilo``. They can be enabled by using the + ``[DEFAULT]enabled_vendor_interfaces`` option in ``ironic.conf`` as given + below: + + .. code-block:: ini + + [DEFAULT] + enabled_hardware_types = ilo + enabled_vendor_interfaces = ilo,ilo-redfish,no-vendor + The ``ilo5`` hardware type supports all the ``ilo`` interfaces described above, except for ``boot`` and ``raid`` interfaces. The details of ``boot`` and @@ -751,6 +766,12 @@ Supported **Manual** Cleaning Operations ``update_auth_failure_logging_threshold``: Updates the Authentication Failure Logging security parameter. See `Update Authentication Failure Logging security parameter as manual clean step`_ for user guidance on usage. + ``create_csr``: + Creates the certificate signing request. See `Create Certificate Signing Request(CSR) as manual clean step`_ + for user guidance on usage. + ``add_https_certificate``: + Adds the signed HTTPS certificate to the iLO. See `Add HTTPS Certificate as manual clean step`_ for user + guidance on usage. * iLO with firmware version 1.5 is minimally required to support all the operations. @@ -1648,6 +1669,54 @@ Both the arguments ``logging_threshold`` and ``ignore`` are optional. The accept value be False. If user passes the value of logging_threshold as 0, the Authentication Failure Logging security parameter will be disabled. +Create Certificate Signing Request(CSR) as manual clean step +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +iLO driver can invoke ``create_csr`` request as a manual clean step. This step is only supported for iLO5 based hardware. + +An example of a manual clean step with ``create_csr`` as the only clean step could be:: + + "clean_steps": [{ + "interface": "management", + "step": "create_csr", + "args": { + "csr_params": { + "City": "Bengaluru", + "CommonName": "1.1.1.1", + "Country": "India", + "OrgName": "HPE", + "State": "Karnataka" + } + } + }] + +The ``[ilo]cert_path`` option in ``ironic.conf`` is used as the directory path for +creating the CSR, which defaults to ``/var/lib/ironic/ilo``. The CSR is created in the directory location +given in ``[ilo]cert_path`` in ``node_uuid`` directory as <node_uuid>.csr. + + +Add HTTPS Certificate as manual clean step +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +iLO driver can invoke ``add_https_certificate`` request as a manual clean step. This step is only supported for +iLO5 based hardware. + +An example of a manual clean step with ``add_https_certificate`` as the only clean step could be:: + + "clean_steps": [{ + "interface": "management", + "step": "add_https_certificate", + "args": { + "cert_file": "/test1/iLO.crt" + } + }] + +Argument ``cert_file`` is mandatory. The ``cert_file`` takes the path or url of the certificate file. +The url schemes supported are: ``file``, ``http`` and ``https``. +The CSR generated in step ``create_csr`` needs to be signed by a valid CA and the resultant HTTPS certificate should +be provided in ``cert_file``. It copies the ``cert_file`` to ``[ilo]cert_path`` under ``node.uuid`` as <node_uuid>.crt +before adding it to iLO. + RAID Support ^^^^^^^^^^^^ @@ -2136,6 +2205,20 @@ DHCP-less deploy is supported by ``ilo`` and ``ilo5`` hardware types. However it would work only with ilo-virtual-media boot interface. See :doc:`/admin/dhcp-less` for more information. +Events subscription +^^^^^^^^^^^^^^^^^^^ +Events subscription is supported by ``ilo`` and ``ilo5`` hardware types with +``ilo`` vendor interface for Gen10 and Gen10 Plus servers. See +:ref:`node-vendor-passthru-methods` for more information. + +Anaconda based deployment +^^^^^^^^^^^^^^^^^^^^^^^^^ +Deployment with ``anaconda`` deploy interface is supported by ``ilo`` and +``ilo5`` hardware type and works with ``ilo-pxe`` and ``ilo-ipxe`` +boot interfaces. See :doc:`/admin/anaconda-deploy-interface` for +more information. + + .. _`ssacli documentation`: https://support.hpe.com/hpsc/doc/public/display?docId=c03909334 .. _`proliant-tools`: https://docs.openstack.org/diskimage-builder/latest/elements/proliant-tools/README.html .. _`HPE iLO4 User Guide`: https://h20566.www2.hpe.com/hpsc/doc/public/display?docId=c03334051 diff --git a/doc/source/admin/drivers/irmc.rst b/doc/source/admin/drivers/irmc.rst index 17b8d8644..9ddfa3b3d 100644 --- a/doc/source/admin/drivers/irmc.rst +++ b/doc/source/admin/drivers/irmc.rst @@ -123,11 +123,29 @@ Configuration via ``driver_info`` the iRMC with administrator privileges. - ``driver_info/irmc_password`` property to be ``password`` for irmc_username. - - ``properties/capabilities`` property to be ``boot_mode:uefi`` if - UEFI boot is required. - - ``properties/capabilities`` property to be ``secure_boot:true`` if - UEFI Secure Boot is required. Please refer to `UEFI Secure Boot Support`_ - for more information. + + .. note:: + Fujitsu server equipped with iRMC S6 2.00 or later version of firmware + disables IPMI over LAN by default. However user may be able to enable IPMI + via BMC settings. + To handle this change, ``irmc`` hardware type first tries IPMI and, + if IPMI operation fails, ``irmc`` hardware type uses Redfish API of Fujitsu + server to provide Ironic functionalities. + So if user deploys Fujitsu server with iRMC S6 2.00 or later, user needs + to set Redfish related parameters in ``driver_info``. + + - ``driver_info/redifsh_address`` property to be ``IP address`` or + ``hostname`` of the iRMC. You can prefix it with protocol (e.g. + ``https://``). If you don't provide protocol, Ironic assumes HTTPS + (i.e. add ``https://`` prefix). + iRMC with S6 2.00 or later only support HTTPS connection to Redfish API. + - ``driver_info/redfish_username`` to be user name of iRMC with administrative + privileges + - ``driver_info/redfish_password`` to be password of ``redfish_username`` + - ``driver_info/redfish_verify_ca`` accepts values those accepted in + ``driver_info/irmc_verify_ca`` + - ``driver_info/redfish_auth_type`` to be one of ``basic``, ``session`` or + ``auto`` * If ``port`` in ``[irmc]`` section of ``/etc/ironic/ironic.conf`` or ``driver_info/irmc_port`` is set to 443, ``driver_info/irmc_verify_ca`` @@ -191,6 +209,22 @@ Configuration via ``driver_info`` - ``driver_info/irmc_snmp_priv_password`` property to be the privacy protocol pass phrase. The length of pass phrase should be at least 8 characters. + +Configuration via ``properties`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +* Each node is configured for ``irmc`` hardware type by setting the following + ironic node object's properties: + + - ``properties/capabilities`` property to be ``boot_mode:uefi`` if + UEFI boot is required, or ``boot_mode:bios`` if Legacy BIOS is required. + If this is not set, ``default_boot_mode`` at ``[default]`` section in + ``ironic.conf`` will be used. + - ``properties/capabilities`` property to be ``secure_boot:true`` if + UEFI Secure Boot is required. Please refer to `UEFI Secure Boot Support`_ + for more information. + + Configuration via ``ironic.conf`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -199,6 +233,25 @@ Configuration via ``ironic.conf`` - ``port``: Port to be used for iRMC operations; either 80 or 443. The default value is 443. Optional. + + .. note:: + Since iRMC S6 2.00, iRMC firmware doesn't support HTTP connection to + REST API. If you deploy server with iRMS S6 2.00 and later, please + set ``port`` to 443. + + ``irmc`` hardware type provides ``verify_step`` named + ``verify_http_https_connection_and_fw_version`` to check HTTP(S) + connection to iRMC REST API. If HTTP(S) connection is successfully + established, then it fetches and caches iRMC firmware version. + If HTTP(S) connection to iRMC REST API failed, Ironic node's state + moves to ``enroll`` with suggestion put in log message. + Default priority of this verify step is 10. + + If operator updates iRMC firmware version of node, operator should + run ``cache_irmc_firmware_version`` node vendor passthru method + to update iRMC firmware version stored in + ``driver_internal_info/irmc_fw_version``. + - ``auth_method``: Authentication method for iRMC operations; either ``basic`` or ``digest``. The default value is ``basic``. Optional. - ``client_timeout``: Timeout (in seconds) for iRMC @@ -229,9 +282,10 @@ Configuration via ``ironic.conf`` and ``v2c``. The default value is ``public``. Optional. - ``snmp_security``: SNMP security name required for version ``v3``. Optional. - - ``snmp_auth_proto``: The SNMPv3 auth protocol. The valid value and the - default value are both ``sha``. We will add more supported valid values - in the future. Optional. + - ``snmp_auth_proto``: The SNMPv3 auth protocol. If using iRMC S4 or S5, the + valid value of this option is only ``sha``. If using iRMC S6, the valid + values are ``sha256``, ``sha384`` and ``sha512``. The default value is + ``sha``. Optional. - ``snmp_priv_proto``: The SNMPv3 privacy protocol. The valid value and the default value are both ``aes``. We will add more supported valid values in the future. Optional. diff --git a/doc/source/admin/drivers/redfish.rst b/doc/source/admin/drivers/redfish.rst index dd19f8bde..063dd1fe5 100644 --- a/doc/source/admin/drivers/redfish.rst +++ b/doc/source/admin/drivers/redfish.rst @@ -87,8 +87,18 @@ field: The "auto" mode first tries "session" and falls back to "basic" if session authentication is not supported by the Redfish BMC. Default is set in ironic config - as ``[redfish]auth_type``. + as ``[redfish]auth_type``. Most operators should not + need to leverage this setting. Session based + authentication should generally be used in most + cases as it prevents re-authentication every time + a background task checks in with the BMC. +.. note:: + The ``redfish_address``, ``redfish_username``, ``redfish_password``, + and ``redfish_verify_ca`` fields, if changed, will trigger a new session + to be establsihed and cached with the BMC. The ``redfish_auth_type`` field + will only be used for the creation of a new cached session, or should + one be rejected by the BMC. The ``baremetal node create`` command can be used to enroll a node with the ``redfish`` driver. For example: @@ -533,6 +543,8 @@ settings. The following fields will be returned in the BIOS API "``unique``", "The setting is specific to this node" "``reset_required``", "After changing this setting a node reboot is required" +.. _node-vendor-passthru-methods: + Node Vendor Passthru Methods ============================ @@ -620,6 +632,75 @@ Eject Virtual Media "boot_device (optional)", "body", "string", "Type of the device to eject (all devices by default)" +Internal Session Cache +====================== + +The ``redfish`` hardware type, and derived interfaces, utilizes a built-in +session cache which prevents Ironic from re-authenticating every time +Ironic attempts to connect to the BMC for any reason. + +This consists of cached connectors objects which are used and tracked by +a unique consideration of ``redfish_username``, ``redfish_password``, +``redfish_verify_ca``, and finally ``redfish_address``. Changing any one +of those values will trigger a new session to be created. +The ``redfish_system_id`` value is explicitly not considered as Redfish +has a model of use of one BMC to many systems, which is also a model +Ironic supports. + +The session cache default size is ``1000`` sessions per conductor. +If you are operating a deployment with a larger number of Redfish +BMCs, it is advised that you do appropriately tune that number. +This can be tuned via the API service configuration file, +``[redfish]connection_cache_size``. + +Session Cache Expiration +~~~~~~~~~~~~~~~~~~~~~~~~ + +By default, sessions remain cached for as long as possible in +memory, as long as they have not experienced an authentication, +connection, or other unexplained error. + +Under normal circumstances, the sessions will only be rolled out +of the cache in order of oldest first when the cache becomes full. +There is no time based expiration to entries in the session cache. + +Of course, the cache is only in memory, and restarting the +``ironic-conductor`` will also cause the cache to be rebuilt +from scratch. If this is due to any persistent connectivity issue, +this may be sign of an unexpected condition, and please consider +contacting the Ironic developer community for assistance. + +Redfish Interoperability Profile +================================ + +Ironic projects provides Redfish Interoperability Profile located in +``redfish-interop-profiles`` folder at source code root. The Redfish +Interoperability Profile is a JSON document written in a particular format +that serves two purposes. + +* It enables the creation of a human-readable document that merges the + profile requirements with the Redfish schema into a single document + for developers or users. +* It allows a conformance test utility to test a Redfish Service + implementation for conformance with the profile. + +The JSON document structure is intended to align easily with JSON payloads +retrieved from Redfish Service implementations, to allow for easy comparisons +and conformance testing. Many of the properties defined within this structure +have assumed default values that correspond with the most common use case, so +that those properties can be omitted from the document for brevity. + +Validation of Profiles using DMTF tool +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +An open source utility has been created by the Redfish Forum to verify that +a Redfish Service implementation conforms to the requirements included in a +Redfish Interoperability Profile. The Redfish Interop Validator is available +for download from the DMTF's organization on Github at +https://github.com/DMTF/Redfish-Interop-Validator. Refer to instructions in +README on how to configure and run validation. + + .. _Redfish: http://redfish.dmtf.org/ .. _Sushy: https://opendev.org/openstack/sushy .. _TLS: https://en.wikipedia.org/wiki/Transport_Layer_Security diff --git a/doc/source/admin/drivers/snmp.rst b/doc/source/admin/drivers/snmp.rst index 1c402ab9b..eed4ed794 100644 --- a/doc/source/admin/drivers/snmp.rst +++ b/doc/source/admin/drivers/snmp.rst @@ -22,39 +22,47 @@ this table could possibly work using a similar driver. Please report any device status. -============== ========== ========== ===================== -Manufacturer Model Supported? Driver name -============== ========== ========== ===================== -APC AP7920 Yes apc_masterswitch -APC AP9606 Yes apc_masterswitch -APC AP9225 Yes apc_masterswitchplus -APC AP7155 Yes apc_rackpdu -APC AP7900 Yes apc_rackpdu -APC AP7901 Yes apc_rackpdu -APC AP7902 Yes apc_rackpdu -APC AP7911a Yes apc_rackpdu -APC AP7921 Yes apc_rackpdu -APC AP7922 Yes apc_rackpdu -APC AP7930 Yes apc_rackpdu -APC AP7931 Yes apc_rackpdu -APC AP7932 Yes apc_rackpdu -APC AP7940 Yes apc_rackpdu -APC AP7941 Yes apc_rackpdu -APC AP7951 Yes apc_rackpdu -APC AP7960 Yes apc_rackpdu -APC AP7990 Yes apc_rackpdu -APC AP7998 Yes apc_rackpdu -APC AP8941 Yes apc_rackpdu -APC AP8953 Yes apc_rackpdu -APC AP8959 Yes apc_rackpdu -APC AP8961 Yes apc_rackpdu -APC AP8965 Yes apc_rackpdu -Aten all? Yes aten -CyberPower all? Untested cyberpower -EatonPower all? Untested eatonpower -Teltronix all? Yes teltronix -BayTech MRP27 Yes baytech_mrp27 -============== ========== ========== ===================== +============== ============== ========== ===================== +Manufacturer Model Supported? Driver name +============== ============== ========== ===================== +APC AP7920 Yes apc_masterswitch +APC AP9606 Yes apc_masterswitch +APC AP9225 Yes apc_masterswitchplus +APC AP7155 Yes apc_rackpdu +APC AP7900 Yes apc_rackpdu +APC AP7901 Yes apc_rackpdu +APC AP7902 Yes apc_rackpdu +APC AP7911a Yes apc_rackpdu +APC AP7921 Yes apc_rackpdu +APC AP7922 Yes apc_rackpdu +APC AP7930 Yes apc_rackpdu +APC AP7931 Yes apc_rackpdu +APC AP7932 Yes apc_rackpdu +APC AP7940 Yes apc_rackpdu +APC AP7941 Yes apc_rackpdu +APC AP7951 Yes apc_rackpdu +APC AP7960 Yes apc_rackpdu +APC AP7990 Yes apc_rackpdu +APC AP7998 Yes apc_rackpdu +APC AP8941 Yes apc_rackpdu +APC AP8953 Yes apc_rackpdu +APC AP8959 Yes apc_rackpdu +APC AP8961 Yes apc_rackpdu +APC AP8965 Yes apc_rackpdu +Aten all? Yes aten +CyberPower all? Untested cyberpower +EatonPower all? Untested eatonpower +Teltronix all? Yes teltronix +BayTech MRP27 Yes baytech_mrp27 +Raritan PX3-5547V-V2 Yes raritan_pdu2 +Raritan PX3-5726V Yes raritan_pdu2 +Raritan PX3-5776U-N2 Yes raritan_pdu2 +Raritan PX3-5969U-V2 Yes raritan_pdu2 +Raritan PX3-5961I2U-V2 Yes raritan_pdu2 +Vertiv NU30212 Yes vertivgeist_pdu +ServerTech CW-16VE-P32M Yes servertech_sentry3 +ServerTech C2WG24SN Yes servertech_sentry4 +============== ============== ========== ===================== Software Requirements diff --git a/doc/source/admin/hardware-burn-in.rst b/doc/source/admin/hardware-burn-in.rst index 503664182..35f231d11 100644 --- a/doc/source/admin/hardware-burn-in.rst +++ b/doc/source/admin/hardware-burn-in.rst @@ -108,6 +108,13 @@ Then launch the test with: baremetal node clean --clean-steps '[{"step": "burnin_disk", \ "interface": "deploy"}]' $NODE_NAME_OR_UUID +In order to launch a parallel SMART self test on all devices after the +disk burn-in (which will fail the step if any of the tests fail), set: + +.. code-block:: console + + baremetal node set --driver-info agent_burnin_fio_disk_smart_test=True \ + $NODE_NAME_OR_UUID Network burn-in =============== diff --git a/doc/source/admin/metrics.rst b/doc/source/admin/metrics.rst index f435a50c5..733c6569b 100644 --- a/doc/source/admin/metrics.rst +++ b/doc/source/admin/metrics.rst @@ -17,8 +17,11 @@ These performance measurements, herein referred to as "metrics", can be emitted from the Bare Metal service, including ironic-api, ironic-conductor, and ironic-python-agent. By default, none of the services will emit metrics. -Configuring the Bare Metal Service to Enable Metrics -==================================================== +It is important to stress that not only statsd is supported for metrics +collection and transmission. This is covered later on in our documentation. + +Configuring the Bare Metal Service to Enable Metrics with Statsd +================================================================ Enabling metrics in ironic-api and ironic-conductor --------------------------------------------------- @@ -62,6 +65,30 @@ in the ironic configuration file as well:: agent_statsd_host = 198.51.100.2 agent_statsd_port = 8125 +.. Note:: + Use of a different metrics backend with the agent is not presently + supported. + +Transmission to the Message Bus Notifier +======================================== + +Regardless if you're using Ceilometer, +`ironic-prometheus-exporter <https://docs.openstack.org/ironic-prometheus-exporter/latest/>`_, +or some scripting you wrote to consume the message bus notifications, +metrics data can be sent to the message bus notifier from the timer methods +*and* additional gauge counters by utilizing the ``[metrics]backend`` +configuration option and setting it to ``collector``. When this is the case, +Information is cached locally and periodically sent along with the general sensor +data update to the messaging notifier, which can consumed off of the message bus, +or via notifier plugin (such as is done with ironic-prometheus-exporter). + +.. NOTE:: + Transmission of timer data only works for the Conductor or ``single-process`` + Ironic service model. A separate webserver process presently does not have + the capability of triggering the call to retrieve and transmit the data. + +.. NOTE:: + This functionality requires ironic-lib version 5.4.0 to be installed. Types of Metrics Emitted ======================== @@ -79,6 +106,9 @@ additional load before enabling metrics. To see which metrics have changed names or have been removed between releases, refer to the `ironic release notes <https://docs.openstack.org/releasenotes/ironic/>`_. +Additional conductor metrics in the form of counts will also be generated in +limited locations where petinant to the activity of the conductor. + .. note:: With the default statsd configuration, each timing metric may create additional metrics due to how statsd handles timing metrics. For more diff --git a/doc/source/admin/retirement.rst b/doc/source/admin/retirement.rst index e4884e0f4..aab307bac 100644 --- a/doc/source/admin/retirement.rst +++ b/doc/source/admin/retirement.rst @@ -23,6 +23,27 @@ scheduling of instances, but will still allow for other operations, such as cleaning, to happen (this marks an important difference to nodes which have the ``maintenance`` flag set). +Requirements +============ + +The use of the retirement feature requires that automated cleaning +be enabled. The default ``[conductor]automated_clean`` setting must +not be disabled as the retirement feature is only engaged upon +the completion of cleaning as it sets forth the expectation of removing +sensitive data from a node. + +If you're uncomfortable with full cleaning, but want to make use of the +the retirement feature, a compromise may be to explore use of metadata +erasure, however this will leave additional data on disk which you may +wish to erase completely. Please consult the configuration for the +``[deploy]erase_devices_metadata_priority`` and +``[deploy]erase_devices_priority`` settings, and do note that +clean steps can be manually invoked through manual cleaning should you +wish to trigger the ``erase_devices`` clean step to completely wipe +all data from storage devices. Alternatively, automated cleaning can +also be enabled on an individual node level using the +``baremetal node set --automated-clean <node_id>`` command. + How to use ========== diff --git a/doc/source/admin/secure-rbac.rst b/doc/source/admin/secure-rbac.rst index 639cfcb23..1f1bb66d1 100644 --- a/doc/source/admin/secure-rbac.rst +++ b/doc/source/admin/secure-rbac.rst @@ -267,3 +267,43 @@ restrictive and an ``owner`` may revoke access to ``lessee``. Access to the underlying baremetal node is not exclusive between the ``owner`` and ``lessee``, and this use model expects that some level of communication takes place between the appropriate parties. + +Can I, a project admin, create a node? +-------------------------------------- + +Starting in API version ``1.80``, the capability was added +to allow users with an ``admin`` role to be able to create and +delete their own nodes in Ironic. + +This functionality is enabled by default, and automatically +imparts ``owner`` privileges to the created Bare Metal node. + +This functionality can be disabled by setting +``[api]project_admin_can_manage_own_nodes`` to ``False``. + +Can I use a service role? +------------------------- + +In later versions of Ironic, the ``service`` role has been added to enable +delineation of accounts and access to Ironic's API. As Ironic's API was +largely originally intended as an "admin" API service, the service role +enables similar levels of access as a project-scoped user with the +``admin`` or ``manager`` roles. + +In terms of access, this is likely best viewed as a user with the +``manager`` role, but with slight elevation in privilege to enable +usage of the service via a service account. + +A project scoped user with the ``service`` role is able to create +baremetal nodes, but is not able to delete them. To disable the +ability to create nodes, set the +``[api]project_admin_can_manage_own_nodes`` setting to ``False``. +The nodes which can be accessed/managed in the project scope also align +with the ``owner`` and ``lessee`` access model, and thus if nodes are not +matching the user's ``project_id``, then Ironic's API will appear not to +have any enrolled baremetal nodes. + +With the system scope, a user with the ``service`` role is able to +create baremetal nodes, but also, not delete them. The access rights +are modeled such an ``admin`` scoped is needed to delete baremetal +nodes from Ironic. diff --git a/doc/source/admin/troubleshooting.rst b/doc/source/admin/troubleshooting.rst index fa04d3006..72e969b6e 100644 --- a/doc/source/admin/troubleshooting.rst +++ b/doc/source/admin/troubleshooting.rst @@ -973,3 +973,174 @@ Unfortunately, due to the way the conductor is designed, it is not possible to gracefully break a stuck lock held in ``*-ing`` states. As the last resort, you may need to restart the affected conductor. See `Why are my nodes stuck in a "-ing" state?`_. + +What is ConcurrentActionLimit? +============================== + +ConcurrentActionLimit is an exception which is raised to clients when an +operation is requested, but cannot be serviced at that moment because the +overall threshold of nodes in concurrent "Deployment" or "Cleaning" +operations has been reached. + +These limits exist for two distinct reasons. + +The first is they allow an operator to tune a deployment such that too many +concurrent deployments cannot be triggered at any given time, as a single +conductor has an internal limit to the number of overall concurrent tasks, +this restricts only the number of running concurrent actions. As such, this +accounts for the number of nodes in ``deploy`` and ``deploy wait`` states. +In the case of deployments, the default value is relatively high and should +be suitable for *most* larger operators. + +The second is to help slow down the ability in which an entire population of +baremetal nodes can be moved into and through cleaning, in order to help +guard against authenticated malicious users, or accidental script driven +operations. In this case, the total number of nodes in ``deleting``, +``cleaning``, and ``clean wait`` are evaluated. The default maximum limit +for cleaning operations is *50* and should be suitable for the majority of +baremetal operators. + +These settings can be modified by using the +``[conductor]max_concurrent_deploy`` and ``[conductor]max_concurrent_clean`` +settings from the ironic.conf file supporting the ``ironic-conductor`` +service. Neither setting can be explicity disabled, however there is also no +upper limit to the setting. + +.. note:: + This was an infrastructure operator requested feature from actual lessons + learned in the operation of Ironic in large scale production. The defaults + may not be suitable for the largest scale operators. + +Why do I have an error that an NVMe Partition is not a block device? +==================================================================== + +In some cases, you can encounter an error that suggests a partition that has +been created on an NVMe block device, is not a block device. + +Example: + + lsblk: /dev/nvme0n1p2: not a block device + +What has happened is the partition contains a partition table inside of it +which is confusing the NVMe device interaction. While basically valid in +some cases to have nested partition tables, for example, with software +raid, in the NVMe case the driver and possibly the underlying device gets +quite confused. This is in part because partitions in NVMe devices are higher +level abstracts. + +The way this occurs is you likely had a ``whole-disk`` image, and it was +configured as a partition image. If using glance, your image properties +may have a ``img_type`` field, which should be ``whole-disk``, or you +have a ``kernel_id`` and ``ramdisk_id`` value in the glance image +``properties`` field. Definition of a kernel and ramdisk value also +indicates that the image is of a ``partition`` image type. This is because +a ``whole-disk`` image is bootable from the contents within the image, +and partition images are unable to be booted without a kernel, and ramdisk. + +If you are using Ironic in standalone mode, the optional +``instance_info/image_type`` setting may be advisable to be checked. +Very similar to Glance usage above, if you have set Ironic's node level +``instance_info/kernel`` and ``instance_info/ramdisk`` parameters, Ironic +will proceed with deploying an image as if it is a partition image, and +create a partition table on the new block device, and then write the +contents of the image into the newly created partition. + +.. NOTE:: + As a general reminder, the Ironic community recommends the use of + whole disk images over the use of partition images. + +Why can't I use Secure Erase/Wipe with RAID controllers? +======================================================== + +Situations have been reported where an infrastructure operator is expecting +particular device types to be Secure Erased or Wiped when they are behind a +RAID controller. + +For example, the server may have NVMe devices attached to a RAID controller +which could be in pass-through or single disk volume mode. The same scenario +exists basically regardless of the disk/storage medium/type. + +The basic reason why is that RAID controllers essentially act as command +translators with a buffer cache. They tend to offer a simplified protocol +to the Operating System, and interact with the storage device in whatever +protocol is native to the device. This is the root of the underlying +problem. + +Protocols such as SCSI are rooted in quite a bit of computing history, +but never evolved to include primitives like Secure Erase which evolved in +the `ATA protocol <https://en.wikipedia.org/wiki/Parallel_ATA#HDD_passwords_and_security>`_. + +The closest primitives in SCSI to ATA Secure Erase is the ``FORMAT UNIT`` +and ``UNMAP`` commands. + +``FORMAT UNIT`` might be a viable solution, and a tool named +`sg_format <https://linux.die.net/man/8/sg_format>`_ exists, +but there has not been a sufficient call upstream to implement this and +test it sufficiently that the Ironic community would be comfortable +shipping such a capability. The possibility also exists that a RAID +controller might not translate this command through to an end device, +just as some RAID controllers know how to handle and pass through +ATA commands to disk devices which support them. It is entirely dependent +upon the hardware configuration scenario. + +The ``UNMAP`` command is similar to the ATA ``TRIM`` command. Unfortunately +the SCSI protocol requires this be performed at block level, and similar to +``FORMAT UNIT``, it may not be supported or just passed through. + +If your interested in working on this area, or are willing to help test, +please feel free to contact the +:doc:`Ironic development community </contributor/community>`. +An additional option is the creation of your own +`custom Hardware Manager <https://opendev.org/openstack/ironic-python-agent/src/branch/master/examples/custom-disk-erase>`_ +which can contain your preferred logic, however this does require some Python +development experience. + +One last item of note, depending on the RAID controller, the BMC, and a number +of other variables, you may be able to leverage the `RAID <raid>`_ +configuration interface to delete volumes/disks, and recreate them. This may +have the same effect as a clean disk, however that too is RAID controller +dependent behavior. + +I'm in "clean failed" state, what do I do? +========================================== + +There is only one way to exit the ``clean failed`` state. But before we visit +the answer as to **how**, we need to stress the importance of attempting to +understand **why** cleaning failed. On the simple side of things, this may be +as simple as a DHCP failure, but on a complex side of things, it could be that +a cleaning action failed against the underlying hardware, possibly due to +a hardware failure. + +As such, we encourage everyone to attempt to understand **why** before exiting +the ``clean failed`` state, because you could potentially make things worse +for yourself. For example if firmware updates were being performed, you may +need to perform a rollback operation against the physical server, depending on +what, and how the firmware was being updated. Unfortunately this also borders +the territory of "no simple answer". + +This can be counter balanced with sometimes there is a transient networking +failure and a DHCP address was not obtained. An example of this would be +suggested by the ``last_error`` field indicating something about "Timeout +reached while cleaning the node", however we recommend following several +basic troubleshooting steps: + +* Consult the ``last_error`` field on the node, utilizing the + ``baremetal node show <uuid>`` command. +* If the version of ironic supports the feature, consult the node history + log, ``baremetal node history list`` and + ``baremetal node history get <uuid>``. +* Consult the acutal console screen of the physical machine. *If* the ramdisk + booted, you will generally want to investigate the controller logs and see + if an uploaded agent log is being stored on the conductor responsible for + the baremetal node. Consult `Retrieving logs from the deploy ramdisk`_. + If the node did not boot for some reason, you can typically just retry + at this point and move on. + +How to get out of the state, once you've understood **why** you reached it +in the first place, is to utilize the ``baremetal node manage <node_id>`` +command. This returns the node to ``manageable`` state, from where you can +retry "cleaning" through automated cleaning with the ``provide`` command, +or manual cleaning with ``clean`` command. or the next appropriate action +in the workflow process you are attempting to follow, which may be +ultimately be decommissioning the node because it could have failed and is +being removed or replaced. |