summaryrefslogtreecommitdiff
path: root/doc/source/admin
diff options
context:
space:
mode:
Diffstat (limited to 'doc/source/admin')
-rw-r--r--doc/source/admin/anaconda-deploy-interface.rst37
-rw-r--r--doc/source/admin/drivers.rst1
-rw-r--r--doc/source/admin/drivers/fake.rst36
-rw-r--r--doc/source/admin/drivers/ibmc.rst2
-rw-r--r--doc/source/admin/drivers/ilo.rst83
-rw-r--r--doc/source/admin/drivers/irmc.rst70
-rw-r--r--doc/source/admin/drivers/redfish.rst83
-rw-r--r--doc/source/admin/drivers/snmp.rst74
-rw-r--r--doc/source/admin/hardware-burn-in.rst7
-rw-r--r--doc/source/admin/metrics.rst34
-rw-r--r--doc/source/admin/retirement.rst21
-rw-r--r--doc/source/admin/secure-rbac.rst40
-rw-r--r--doc/source/admin/troubleshooting.rst171
13 files changed, 612 insertions, 47 deletions
diff --git a/doc/source/admin/anaconda-deploy-interface.rst b/doc/source/admin/anaconda-deploy-interface.rst
index 2c686506a..2b7195525 100644
--- a/doc/source/admin/anaconda-deploy-interface.rst
+++ b/doc/source/admin/anaconda-deploy-interface.rst
@@ -271,11 +271,44 @@ purposes.
``liveimg`` which is used as the base operating system image to
start with.
+Configuration Considerations
+----------------------------
+
+When using the ``anaconda`` deployment interface, some configuration
+parameters may need to be adjusted in your environment. This is in large
+part due to the general defaults being set to much lower values for image
+based deployments, but the way the anaconda deployment interface works,
+you may need to make some adjustments.
+
+* ``[conductor]deploy_callback_timeout`` likely needs to be adjusted
+ for most ``anaconda`` deployment interface users. By default this
+ is a timer which looks for "agents" which have not checked in with
+ Ironic, or agents which may have crashed or failed after they
+ started. If the value is reached, then the current operation is failed.
+ This value should be set to a number of seconds which exceeds your
+ average anaconda deployment time.
+* ``[pxe]boot_retry_timeout`` can also be triggered and result in
+ an anaconda deployment in progress getting reset as it is intended
+ to reboot nodes which might have failed their initial PXE operation.
+ Depending on sizes of images, and the exact nature of what was deployed,
+ it may be necessary to ensure this is a much higher value.
+
Limitations
-----------
-This deploy interface has only been tested with Red Hat based operating systems
-that use anaconda. Other systems are not supported.
+* This deploy interface has only been tested with Red Hat based operating
+ systems that use anaconda. Other systems are not supported.
+
+* Runtime TLS certifiate injection into ramdisks is not supported. Assets
+ such as ``ramdisk`` or a ``stage2`` ramdisk image need to have trusted
+ Certificate Authority certificates present within the images *or* the
+ Ironic API endpoint utilized should utilize a known trusted Certificate
+ Authority.
+
+* The ``anaconda`` tooling deploying the instance/workload does not
+ heartbeat to Ironic like the ``ironic-python-agent`` driven ramdisks.
+ As such, you may need to adjust some timers. See
+ `Configuration Considerations`_ for some details on this.
.. _`anaconda`: https://fedoraproject.org/wiki/Anaconda
.. _`ks.cfg.template`: https://opendev.org/openstack/ironic/src/branch/master/ironic/drivers/modules/ks.cfg.template
diff --git a/doc/source/admin/drivers.rst b/doc/source/admin/drivers.rst
index c3d8eb377..f35cb2dfa 100644
--- a/doc/source/admin/drivers.rst
+++ b/doc/source/admin/drivers.rst
@@ -26,6 +26,7 @@ Hardware Types
drivers/redfish
drivers/snmp
drivers/xclarity
+ drivers/fake
Changing Hardware Types and Interfaces
--------------------------------------
diff --git a/doc/source/admin/drivers/fake.rst b/doc/source/admin/drivers/fake.rst
new file mode 100644
index 000000000..ea7d7ef4c
--- /dev/null
+++ b/doc/source/admin/drivers/fake.rst
@@ -0,0 +1,36 @@
+===========
+Fake driver
+===========
+
+Overview
+========
+
+The ``fake-hardware`` hardware type is what it claims to be: fake. Use of this
+type or the ``fake`` interfaces should be temporary or limited to
+non-production environments, as the ``fake`` interfaces do not perform any of
+the actions typically expected.
+
+The ``fake`` interfaces can be configured to be combined with any of the
+"real" hardware interfaces, allowing you to effectively disable one or more
+hardware interfaces for testing by simply setting that interface to
+``fake``.
+
+Use cases
+=========
+
+Development
+-----------
+Developers can use ``fake-hardware`` hardware-type to mock out nodes for
+testing without those nodes needing to exist with physical or virtual hardware.
+
+Adoption
+--------
+Some OpenStack deployers have used ``fake`` interfaces in Ironic to allow an
+adoption-style workflow with Nova. By setting a node's hardware interfaces to
+``fake``, it's possible to deploy to that node with Nova without causing any
+actual changes to the hardware or an OS already deployed on it.
+
+This is generally an unsupported use case, but it is possible. For more
+information, see the relevant `post from CERN TechBlog`_.
+
+.. _`post from CERN TechBlog`: https://techblog.web.cern.ch/techblog/post/ironic-nova-adoption/
diff --git a/doc/source/admin/drivers/ibmc.rst b/doc/source/admin/drivers/ibmc.rst
index 1bf9a3ba2..0f7fe1d90 100644
--- a/doc/source/admin/drivers/ibmc.rst
+++ b/doc/source/admin/drivers/ibmc.rst
@@ -312,6 +312,6 @@ boot_up_seq GET Query boot up sequence
get_raid_controller_list GET Query RAID controller summary info
======================== ============ ======================================
-.. _Huawei iBMC: https://e.huawei.com/en/products/cloud-computing-dc/servers/accessories/ibmc
+.. _Huawei iBMC: https://e.huawei.com/en/products/computing/kunpeng/accessories/ibmc
.. _TLS: https://en.wikipedia.org/wiki/Transport_Layer_Security
.. _HUAWEI iBMC Client library: https://pypi.org/project/python-ibmcclient/
diff --git a/doc/source/admin/drivers/ilo.rst b/doc/source/admin/drivers/ilo.rst
index f764a6d89..b6825fc40 100644
--- a/doc/source/admin/drivers/ilo.rst
+++ b/doc/source/admin/drivers/ilo.rst
@@ -55,6 +55,8 @@ The hardware type ``ilo`` supports following HPE server features:
* `Updating security parameters as manual clean step`_
* `Update Minimum Password Length security parameter as manual clean step`_
* `Update Authentication Failure Logging security parameter as manual clean step`_
+* `Create Certificate Signing Request(CSR) as manual clean step`_
+* `Add HTTPS Certificate as manual clean step`_
* `Activating iLO Advanced license as manual clean step`_
* `Removing CA certificates from iLO as manual clean step`_
* `Firmware based UEFI iSCSI boot from volume support`_
@@ -65,6 +67,7 @@ The hardware type ``ilo`` supports following HPE server features:
* `BIOS configuration support`_
* `IPv6 support`_
* `Layer 3 or DHCP-less ramdisk booting`_
+* `Events subscription`_
Apart from above features hardware type ``ilo5`` also supports following
features:
@@ -200,6 +203,18 @@ The ``ilo`` hardware type supports following hardware interfaces:
enabled_hardware_types = ilo
enabled_rescue_interfaces = agent,no-rescue
+* vendor
+ Supports ``ilo``, ``ilo-redfish`` and ``no-vendor``. The default is
+ ``ilo``. They can be enabled by using the
+ ``[DEFAULT]enabled_vendor_interfaces`` option in ``ironic.conf`` as given
+ below:
+
+ .. code-block:: ini
+
+ [DEFAULT]
+ enabled_hardware_types = ilo
+ enabled_vendor_interfaces = ilo,ilo-redfish,no-vendor
+
The ``ilo5`` hardware type supports all the ``ilo`` interfaces described above,
except for ``boot`` and ``raid`` interfaces. The details of ``boot`` and
@@ -751,6 +766,12 @@ Supported **Manual** Cleaning Operations
``update_auth_failure_logging_threshold``:
Updates the Authentication Failure Logging security parameter. See
`Update Authentication Failure Logging security parameter as manual clean step`_ for user guidance on usage.
+ ``create_csr``:
+ Creates the certificate signing request. See `Create Certificate Signing Request(CSR) as manual clean step`_
+ for user guidance on usage.
+ ``add_https_certificate``:
+ Adds the signed HTTPS certificate to the iLO. See `Add HTTPS Certificate as manual clean step`_ for user
+ guidance on usage.
* iLO with firmware version 1.5 is minimally required to support all the
operations.
@@ -1648,6 +1669,54 @@ Both the arguments ``logging_threshold`` and ``ignore`` are optional. The accept
value be False. If user passes the value of logging_threshold as 0, the Authentication Failure Logging security
parameter will be disabled.
+Create Certificate Signing Request(CSR) as manual clean step
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+iLO driver can invoke ``create_csr`` request as a manual clean step. This step is only supported for iLO5 based hardware.
+
+An example of a manual clean step with ``create_csr`` as the only clean step could be::
+
+ "clean_steps": [{
+ "interface": "management",
+ "step": "create_csr",
+ "args": {
+ "csr_params": {
+ "City": "Bengaluru",
+ "CommonName": "1.1.1.1",
+ "Country": "India",
+ "OrgName": "HPE",
+ "State": "Karnataka"
+ }
+ }
+ }]
+
+The ``[ilo]cert_path`` option in ``ironic.conf`` is used as the directory path for
+creating the CSR, which defaults to ``/var/lib/ironic/ilo``. The CSR is created in the directory location
+given in ``[ilo]cert_path`` in ``node_uuid`` directory as <node_uuid>.csr.
+
+
+Add HTTPS Certificate as manual clean step
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+iLO driver can invoke ``add_https_certificate`` request as a manual clean step. This step is only supported for
+iLO5 based hardware.
+
+An example of a manual clean step with ``add_https_certificate`` as the only clean step could be::
+
+ "clean_steps": [{
+ "interface": "management",
+ "step": "add_https_certificate",
+ "args": {
+ "cert_file": "/test1/iLO.crt"
+ }
+ }]
+
+Argument ``cert_file`` is mandatory. The ``cert_file`` takes the path or url of the certificate file.
+The url schemes supported are: ``file``, ``http`` and ``https``.
+The CSR generated in step ``create_csr`` needs to be signed by a valid CA and the resultant HTTPS certificate should
+be provided in ``cert_file``. It copies the ``cert_file`` to ``[ilo]cert_path`` under ``node.uuid`` as <node_uuid>.crt
+before adding it to iLO.
+
RAID Support
^^^^^^^^^^^^
@@ -2136,6 +2205,20 @@ DHCP-less deploy is supported by ``ilo`` and ``ilo5`` hardware types.
However it would work only with ilo-virtual-media boot interface. See
:doc:`/admin/dhcp-less` for more information.
+Events subscription
+^^^^^^^^^^^^^^^^^^^
+Events subscription is supported by ``ilo`` and ``ilo5`` hardware types with
+``ilo`` vendor interface for Gen10 and Gen10 Plus servers. See
+:ref:`node-vendor-passthru-methods` for more information.
+
+Anaconda based deployment
+^^^^^^^^^^^^^^^^^^^^^^^^^
+Deployment with ``anaconda`` deploy interface is supported by ``ilo`` and
+``ilo5`` hardware type and works with ``ilo-pxe`` and ``ilo-ipxe``
+boot interfaces. See :doc:`/admin/anaconda-deploy-interface` for
+more information.
+
+
.. _`ssacli documentation`: https://support.hpe.com/hpsc/doc/public/display?docId=c03909334
.. _`proliant-tools`: https://docs.openstack.org/diskimage-builder/latest/elements/proliant-tools/README.html
.. _`HPE iLO4 User Guide`: https://h20566.www2.hpe.com/hpsc/doc/public/display?docId=c03334051
diff --git a/doc/source/admin/drivers/irmc.rst b/doc/source/admin/drivers/irmc.rst
index 17b8d8644..9ddfa3b3d 100644
--- a/doc/source/admin/drivers/irmc.rst
+++ b/doc/source/admin/drivers/irmc.rst
@@ -123,11 +123,29 @@ Configuration via ``driver_info``
the iRMC with administrator privileges.
- ``driver_info/irmc_password`` property to be ``password`` for
irmc_username.
- - ``properties/capabilities`` property to be ``boot_mode:uefi`` if
- UEFI boot is required.
- - ``properties/capabilities`` property to be ``secure_boot:true`` if
- UEFI Secure Boot is required. Please refer to `UEFI Secure Boot Support`_
- for more information.
+
+ .. note::
+ Fujitsu server equipped with iRMC S6 2.00 or later version of firmware
+ disables IPMI over LAN by default. However user may be able to enable IPMI
+ via BMC settings.
+ To handle this change, ``irmc`` hardware type first tries IPMI and,
+ if IPMI operation fails, ``irmc`` hardware type uses Redfish API of Fujitsu
+ server to provide Ironic functionalities.
+ So if user deploys Fujitsu server with iRMC S6 2.00 or later, user needs
+ to set Redfish related parameters in ``driver_info``.
+
+ - ``driver_info/redifsh_address`` property to be ``IP address`` or
+ ``hostname`` of the iRMC. You can prefix it with protocol (e.g.
+ ``https://``). If you don't provide protocol, Ironic assumes HTTPS
+ (i.e. add ``https://`` prefix).
+ iRMC with S6 2.00 or later only support HTTPS connection to Redfish API.
+ - ``driver_info/redfish_username`` to be user name of iRMC with administrative
+ privileges
+ - ``driver_info/redfish_password`` to be password of ``redfish_username``
+ - ``driver_info/redfish_verify_ca`` accepts values those accepted in
+ ``driver_info/irmc_verify_ca``
+ - ``driver_info/redfish_auth_type`` to be one of ``basic``, ``session`` or
+ ``auto``
* If ``port`` in ``[irmc]`` section of ``/etc/ironic/ironic.conf`` or
``driver_info/irmc_port`` is set to 443, ``driver_info/irmc_verify_ca``
@@ -191,6 +209,22 @@ Configuration via ``driver_info``
- ``driver_info/irmc_snmp_priv_password`` property to be the privacy protocol
pass phrase. The length of pass phrase should be at least 8 characters.
+
+Configuration via ``properties``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+* Each node is configured for ``irmc`` hardware type by setting the following
+ ironic node object's properties:
+
+ - ``properties/capabilities`` property to be ``boot_mode:uefi`` if
+ UEFI boot is required, or ``boot_mode:bios`` if Legacy BIOS is required.
+ If this is not set, ``default_boot_mode`` at ``[default]`` section in
+ ``ironic.conf`` will be used.
+ - ``properties/capabilities`` property to be ``secure_boot:true`` if
+ UEFI Secure Boot is required. Please refer to `UEFI Secure Boot Support`_
+ for more information.
+
+
Configuration via ``ironic.conf``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -199,6 +233,25 @@ Configuration via ``ironic.conf``
- ``port``: Port to be used for iRMC operations; either 80
or 443. The default value is 443. Optional.
+
+ .. note::
+ Since iRMC S6 2.00, iRMC firmware doesn't support HTTP connection to
+ REST API. If you deploy server with iRMS S6 2.00 and later, please
+ set ``port`` to 443.
+
+ ``irmc`` hardware type provides ``verify_step`` named
+ ``verify_http_https_connection_and_fw_version`` to check HTTP(S)
+ connection to iRMC REST API. If HTTP(S) connection is successfully
+ established, then it fetches and caches iRMC firmware version.
+ If HTTP(S) connection to iRMC REST API failed, Ironic node's state
+ moves to ``enroll`` with suggestion put in log message.
+ Default priority of this verify step is 10.
+
+ If operator updates iRMC firmware version of node, operator should
+ run ``cache_irmc_firmware_version`` node vendor passthru method
+ to update iRMC firmware version stored in
+ ``driver_internal_info/irmc_fw_version``.
+
- ``auth_method``: Authentication method for iRMC operations;
either ``basic`` or ``digest``. The default value is ``basic``. Optional.
- ``client_timeout``: Timeout (in seconds) for iRMC
@@ -229,9 +282,10 @@ Configuration via ``ironic.conf``
and ``v2c``. The default value is ``public``. Optional.
- ``snmp_security``: SNMP security name required for version ``v3``.
Optional.
- - ``snmp_auth_proto``: The SNMPv3 auth protocol. The valid value and the
- default value are both ``sha``. We will add more supported valid values
- in the future. Optional.
+ - ``snmp_auth_proto``: The SNMPv3 auth protocol. If using iRMC S4 or S5, the
+ valid value of this option is only ``sha``. If using iRMC S6, the valid
+ values are ``sha256``, ``sha384`` and ``sha512``. The default value is
+ ``sha``. Optional.
- ``snmp_priv_proto``: The SNMPv3 privacy protocol. The valid value and
the default value are both ``aes``. We will add more supported valid values
in the future. Optional.
diff --git a/doc/source/admin/drivers/redfish.rst b/doc/source/admin/drivers/redfish.rst
index dd19f8bde..063dd1fe5 100644
--- a/doc/source/admin/drivers/redfish.rst
+++ b/doc/source/admin/drivers/redfish.rst
@@ -87,8 +87,18 @@ field:
The "auto" mode first tries "session" and falls back
to "basic" if session authentication is not supported
by the Redfish BMC. Default is set in ironic config
- as ``[redfish]auth_type``.
+ as ``[redfish]auth_type``. Most operators should not
+ need to leverage this setting. Session based
+ authentication should generally be used in most
+ cases as it prevents re-authentication every time
+ a background task checks in with the BMC.
+.. note::
+ The ``redfish_address``, ``redfish_username``, ``redfish_password``,
+ and ``redfish_verify_ca`` fields, if changed, will trigger a new session
+ to be establsihed and cached with the BMC. The ``redfish_auth_type`` field
+ will only be used for the creation of a new cached session, or should
+ one be rejected by the BMC.
The ``baremetal node create`` command can be used to enroll
a node with the ``redfish`` driver. For example:
@@ -533,6 +543,8 @@ settings. The following fields will be returned in the BIOS API
"``unique``", "The setting is specific to this node"
"``reset_required``", "After changing this setting a node reboot is required"
+.. _node-vendor-passthru-methods:
+
Node Vendor Passthru Methods
============================
@@ -620,6 +632,75 @@ Eject Virtual Media
"boot_device (optional)", "body", "string", "Type of the device to eject (all devices by default)"
+Internal Session Cache
+======================
+
+The ``redfish`` hardware type, and derived interfaces, utilizes a built-in
+session cache which prevents Ironic from re-authenticating every time
+Ironic attempts to connect to the BMC for any reason.
+
+This consists of cached connectors objects which are used and tracked by
+a unique consideration of ``redfish_username``, ``redfish_password``,
+``redfish_verify_ca``, and finally ``redfish_address``. Changing any one
+of those values will trigger a new session to be created.
+The ``redfish_system_id`` value is explicitly not considered as Redfish
+has a model of use of one BMC to many systems, which is also a model
+Ironic supports.
+
+The session cache default size is ``1000`` sessions per conductor.
+If you are operating a deployment with a larger number of Redfish
+BMCs, it is advised that you do appropriately tune that number.
+This can be tuned via the API service configuration file,
+``[redfish]connection_cache_size``.
+
+Session Cache Expiration
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+By default, sessions remain cached for as long as possible in
+memory, as long as they have not experienced an authentication,
+connection, or other unexplained error.
+
+Under normal circumstances, the sessions will only be rolled out
+of the cache in order of oldest first when the cache becomes full.
+There is no time based expiration to entries in the session cache.
+
+Of course, the cache is only in memory, and restarting the
+``ironic-conductor`` will also cause the cache to be rebuilt
+from scratch. If this is due to any persistent connectivity issue,
+this may be sign of an unexpected condition, and please consider
+contacting the Ironic developer community for assistance.
+
+Redfish Interoperability Profile
+================================
+
+Ironic projects provides Redfish Interoperability Profile located in
+``redfish-interop-profiles`` folder at source code root. The Redfish
+Interoperability Profile is a JSON document written in a particular format
+that serves two purposes.
+
+* It enables the creation of a human-readable document that merges the
+ profile requirements with the Redfish schema into a single document
+ for developers or users.
+* It allows a conformance test utility to test a Redfish Service
+ implementation for conformance with the profile.
+
+The JSON document structure is intended to align easily with JSON payloads
+retrieved from Redfish Service implementations, to allow for easy comparisons
+and conformance testing. Many of the properties defined within this structure
+have assumed default values that correspond with the most common use case, so
+that those properties can be omitted from the document for brevity.
+
+Validation of Profiles using DMTF tool
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+An open source utility has been created by the Redfish Forum to verify that
+a Redfish Service implementation conforms to the requirements included in a
+Redfish Interoperability Profile. The Redfish Interop Validator is available
+for download from the DMTF's organization on Github at
+https://github.com/DMTF/Redfish-Interop-Validator. Refer to instructions in
+README on how to configure and run validation.
+
+
.. _Redfish: http://redfish.dmtf.org/
.. _Sushy: https://opendev.org/openstack/sushy
.. _TLS: https://en.wikipedia.org/wiki/Transport_Layer_Security
diff --git a/doc/source/admin/drivers/snmp.rst b/doc/source/admin/drivers/snmp.rst
index 1c402ab9b..eed4ed794 100644
--- a/doc/source/admin/drivers/snmp.rst
+++ b/doc/source/admin/drivers/snmp.rst
@@ -22,39 +22,47 @@ this table could possibly work using a similar driver.
Please report any device status.
-============== ========== ========== =====================
-Manufacturer Model Supported? Driver name
-============== ========== ========== =====================
-APC AP7920 Yes apc_masterswitch
-APC AP9606 Yes apc_masterswitch
-APC AP9225 Yes apc_masterswitchplus
-APC AP7155 Yes apc_rackpdu
-APC AP7900 Yes apc_rackpdu
-APC AP7901 Yes apc_rackpdu
-APC AP7902 Yes apc_rackpdu
-APC AP7911a Yes apc_rackpdu
-APC AP7921 Yes apc_rackpdu
-APC AP7922 Yes apc_rackpdu
-APC AP7930 Yes apc_rackpdu
-APC AP7931 Yes apc_rackpdu
-APC AP7932 Yes apc_rackpdu
-APC AP7940 Yes apc_rackpdu
-APC AP7941 Yes apc_rackpdu
-APC AP7951 Yes apc_rackpdu
-APC AP7960 Yes apc_rackpdu
-APC AP7990 Yes apc_rackpdu
-APC AP7998 Yes apc_rackpdu
-APC AP8941 Yes apc_rackpdu
-APC AP8953 Yes apc_rackpdu
-APC AP8959 Yes apc_rackpdu
-APC AP8961 Yes apc_rackpdu
-APC AP8965 Yes apc_rackpdu
-Aten all? Yes aten
-CyberPower all? Untested cyberpower
-EatonPower all? Untested eatonpower
-Teltronix all? Yes teltronix
-BayTech MRP27 Yes baytech_mrp27
-============== ========== ========== =====================
+============== ============== ========== =====================
+Manufacturer Model Supported? Driver name
+============== ============== ========== =====================
+APC AP7920 Yes apc_masterswitch
+APC AP9606 Yes apc_masterswitch
+APC AP9225 Yes apc_masterswitchplus
+APC AP7155 Yes apc_rackpdu
+APC AP7900 Yes apc_rackpdu
+APC AP7901 Yes apc_rackpdu
+APC AP7902 Yes apc_rackpdu
+APC AP7911a Yes apc_rackpdu
+APC AP7921 Yes apc_rackpdu
+APC AP7922 Yes apc_rackpdu
+APC AP7930 Yes apc_rackpdu
+APC AP7931 Yes apc_rackpdu
+APC AP7932 Yes apc_rackpdu
+APC AP7940 Yes apc_rackpdu
+APC AP7941 Yes apc_rackpdu
+APC AP7951 Yes apc_rackpdu
+APC AP7960 Yes apc_rackpdu
+APC AP7990 Yes apc_rackpdu
+APC AP7998 Yes apc_rackpdu
+APC AP8941 Yes apc_rackpdu
+APC AP8953 Yes apc_rackpdu
+APC AP8959 Yes apc_rackpdu
+APC AP8961 Yes apc_rackpdu
+APC AP8965 Yes apc_rackpdu
+Aten all? Yes aten
+CyberPower all? Untested cyberpower
+EatonPower all? Untested eatonpower
+Teltronix all? Yes teltronix
+BayTech MRP27 Yes baytech_mrp27
+Raritan PX3-5547V-V2 Yes raritan_pdu2
+Raritan PX3-5726V Yes raritan_pdu2
+Raritan PX3-5776U-N2 Yes raritan_pdu2
+Raritan PX3-5969U-V2 Yes raritan_pdu2
+Raritan PX3-5961I2U-V2 Yes raritan_pdu2
+Vertiv NU30212 Yes vertivgeist_pdu
+ServerTech CW-16VE-P32M Yes servertech_sentry3
+ServerTech C2WG24SN Yes servertech_sentry4
+============== ============== ========== =====================
Software Requirements
diff --git a/doc/source/admin/hardware-burn-in.rst b/doc/source/admin/hardware-burn-in.rst
index 503664182..35f231d11 100644
--- a/doc/source/admin/hardware-burn-in.rst
+++ b/doc/source/admin/hardware-burn-in.rst
@@ -108,6 +108,13 @@ Then launch the test with:
baremetal node clean --clean-steps '[{"step": "burnin_disk", \
"interface": "deploy"}]' $NODE_NAME_OR_UUID
+In order to launch a parallel SMART self test on all devices after the
+disk burn-in (which will fail the step if any of the tests fail), set:
+
+.. code-block:: console
+
+ baremetal node set --driver-info agent_burnin_fio_disk_smart_test=True \
+ $NODE_NAME_OR_UUID
Network burn-in
===============
diff --git a/doc/source/admin/metrics.rst b/doc/source/admin/metrics.rst
index f435a50c5..733c6569b 100644
--- a/doc/source/admin/metrics.rst
+++ b/doc/source/admin/metrics.rst
@@ -17,8 +17,11 @@ These performance measurements, herein referred to as "metrics", can be
emitted from the Bare Metal service, including ironic-api, ironic-conductor,
and ironic-python-agent. By default, none of the services will emit metrics.
-Configuring the Bare Metal Service to Enable Metrics
-====================================================
+It is important to stress that not only statsd is supported for metrics
+collection and transmission. This is covered later on in our documentation.
+
+Configuring the Bare Metal Service to Enable Metrics with Statsd
+================================================================
Enabling metrics in ironic-api and ironic-conductor
---------------------------------------------------
@@ -62,6 +65,30 @@ in the ironic configuration file as well::
agent_statsd_host = 198.51.100.2
agent_statsd_port = 8125
+.. Note::
+ Use of a different metrics backend with the agent is not presently
+ supported.
+
+Transmission to the Message Bus Notifier
+========================================
+
+Regardless if you're using Ceilometer,
+`ironic-prometheus-exporter <https://docs.openstack.org/ironic-prometheus-exporter/latest/>`_,
+or some scripting you wrote to consume the message bus notifications,
+metrics data can be sent to the message bus notifier from the timer methods
+*and* additional gauge counters by utilizing the ``[metrics]backend``
+configuration option and setting it to ``collector``. When this is the case,
+Information is cached locally and periodically sent along with the general sensor
+data update to the messaging notifier, which can consumed off of the message bus,
+or via notifier plugin (such as is done with ironic-prometheus-exporter).
+
+.. NOTE::
+ Transmission of timer data only works for the Conductor or ``single-process``
+ Ironic service model. A separate webserver process presently does not have
+ the capability of triggering the call to retrieve and transmit the data.
+
+.. NOTE::
+ This functionality requires ironic-lib version 5.4.0 to be installed.
Types of Metrics Emitted
========================
@@ -79,6 +106,9 @@ additional load before enabling metrics. To see which metrics have changed names
or have been removed between releases, refer to the `ironic release notes
<https://docs.openstack.org/releasenotes/ironic/>`_.
+Additional conductor metrics in the form of counts will also be generated in
+limited locations where petinant to the activity of the conductor.
+
.. note::
With the default statsd configuration, each timing metric may create
additional metrics due to how statsd handles timing metrics. For more
diff --git a/doc/source/admin/retirement.rst b/doc/source/admin/retirement.rst
index e4884e0f4..aab307bac 100644
--- a/doc/source/admin/retirement.rst
+++ b/doc/source/admin/retirement.rst
@@ -23,6 +23,27 @@ scheduling of instances, but will still allow for other operations,
such as cleaning, to happen (this marks an important difference to
nodes which have the ``maintenance`` flag set).
+Requirements
+============
+
+The use of the retirement feature requires that automated cleaning
+be enabled. The default ``[conductor]automated_clean`` setting must
+not be disabled as the retirement feature is only engaged upon
+the completion of cleaning as it sets forth the expectation of removing
+sensitive data from a node.
+
+If you're uncomfortable with full cleaning, but want to make use of the
+the retirement feature, a compromise may be to explore use of metadata
+erasure, however this will leave additional data on disk which you may
+wish to erase completely. Please consult the configuration for the
+``[deploy]erase_devices_metadata_priority`` and
+``[deploy]erase_devices_priority`` settings, and do note that
+clean steps can be manually invoked through manual cleaning should you
+wish to trigger the ``erase_devices`` clean step to completely wipe
+all data from storage devices. Alternatively, automated cleaning can
+also be enabled on an individual node level using the
+``baremetal node set --automated-clean <node_id>`` command.
+
How to use
==========
diff --git a/doc/source/admin/secure-rbac.rst b/doc/source/admin/secure-rbac.rst
index 639cfcb23..1f1bb66d1 100644
--- a/doc/source/admin/secure-rbac.rst
+++ b/doc/source/admin/secure-rbac.rst
@@ -267,3 +267,43 @@ restrictive and an ``owner`` may revoke access to ``lessee``.
Access to the underlying baremetal node is not exclusive between the
``owner`` and ``lessee``, and this use model expects that some level of
communication takes place between the appropriate parties.
+
+Can I, a project admin, create a node?
+--------------------------------------
+
+Starting in API version ``1.80``, the capability was added
+to allow users with an ``admin`` role to be able to create and
+delete their own nodes in Ironic.
+
+This functionality is enabled by default, and automatically
+imparts ``owner`` privileges to the created Bare Metal node.
+
+This functionality can be disabled by setting
+``[api]project_admin_can_manage_own_nodes`` to ``False``.
+
+Can I use a service role?
+-------------------------
+
+In later versions of Ironic, the ``service`` role has been added to enable
+delineation of accounts and access to Ironic's API. As Ironic's API was
+largely originally intended as an "admin" API service, the service role
+enables similar levels of access as a project-scoped user with the
+``admin`` or ``manager`` roles.
+
+In terms of access, this is likely best viewed as a user with the
+``manager`` role, but with slight elevation in privilege to enable
+usage of the service via a service account.
+
+A project scoped user with the ``service`` role is able to create
+baremetal nodes, but is not able to delete them. To disable the
+ability to create nodes, set the
+``[api]project_admin_can_manage_own_nodes`` setting to ``False``.
+The nodes which can be accessed/managed in the project scope also align
+with the ``owner`` and ``lessee`` access model, and thus if nodes are not
+matching the user's ``project_id``, then Ironic's API will appear not to
+have any enrolled baremetal nodes.
+
+With the system scope, a user with the ``service`` role is able to
+create baremetal nodes, but also, not delete them. The access rights
+are modeled such an ``admin`` scoped is needed to delete baremetal
+nodes from Ironic.
diff --git a/doc/source/admin/troubleshooting.rst b/doc/source/admin/troubleshooting.rst
index fa04d3006..72e969b6e 100644
--- a/doc/source/admin/troubleshooting.rst
+++ b/doc/source/admin/troubleshooting.rst
@@ -973,3 +973,174 @@ Unfortunately, due to the way the conductor is designed, it is not possible to
gracefully break a stuck lock held in ``*-ing`` states. As the last resort, you
may need to restart the affected conductor. See `Why are my nodes stuck in a
"-ing" state?`_.
+
+What is ConcurrentActionLimit?
+==============================
+
+ConcurrentActionLimit is an exception which is raised to clients when an
+operation is requested, but cannot be serviced at that moment because the
+overall threshold of nodes in concurrent "Deployment" or "Cleaning"
+operations has been reached.
+
+These limits exist for two distinct reasons.
+
+The first is they allow an operator to tune a deployment such that too many
+concurrent deployments cannot be triggered at any given time, as a single
+conductor has an internal limit to the number of overall concurrent tasks,
+this restricts only the number of running concurrent actions. As such, this
+accounts for the number of nodes in ``deploy`` and ``deploy wait`` states.
+In the case of deployments, the default value is relatively high and should
+be suitable for *most* larger operators.
+
+The second is to help slow down the ability in which an entire population of
+baremetal nodes can be moved into and through cleaning, in order to help
+guard against authenticated malicious users, or accidental script driven
+operations. In this case, the total number of nodes in ``deleting``,
+``cleaning``, and ``clean wait`` are evaluated. The default maximum limit
+for cleaning operations is *50* and should be suitable for the majority of
+baremetal operators.
+
+These settings can be modified by using the
+``[conductor]max_concurrent_deploy`` and ``[conductor]max_concurrent_clean``
+settings from the ironic.conf file supporting the ``ironic-conductor``
+service. Neither setting can be explicity disabled, however there is also no
+upper limit to the setting.
+
+.. note::
+ This was an infrastructure operator requested feature from actual lessons
+ learned in the operation of Ironic in large scale production. The defaults
+ may not be suitable for the largest scale operators.
+
+Why do I have an error that an NVMe Partition is not a block device?
+====================================================================
+
+In some cases, you can encounter an error that suggests a partition that has
+been created on an NVMe block device, is not a block device.
+
+Example:
+
+ lsblk: /dev/nvme0n1p2: not a block device
+
+What has happened is the partition contains a partition table inside of it
+which is confusing the NVMe device interaction. While basically valid in
+some cases to have nested partition tables, for example, with software
+raid, in the NVMe case the driver and possibly the underlying device gets
+quite confused. This is in part because partitions in NVMe devices are higher
+level abstracts.
+
+The way this occurs is you likely had a ``whole-disk`` image, and it was
+configured as a partition image. If using glance, your image properties
+may have a ``img_type`` field, which should be ``whole-disk``, or you
+have a ``kernel_id`` and ``ramdisk_id`` value in the glance image
+``properties`` field. Definition of a kernel and ramdisk value also
+indicates that the image is of a ``partition`` image type. This is because
+a ``whole-disk`` image is bootable from the contents within the image,
+and partition images are unable to be booted without a kernel, and ramdisk.
+
+If you are using Ironic in standalone mode, the optional
+``instance_info/image_type`` setting may be advisable to be checked.
+Very similar to Glance usage above, if you have set Ironic's node level
+``instance_info/kernel`` and ``instance_info/ramdisk`` parameters, Ironic
+will proceed with deploying an image as if it is a partition image, and
+create a partition table on the new block device, and then write the
+contents of the image into the newly created partition.
+
+.. NOTE::
+ As a general reminder, the Ironic community recommends the use of
+ whole disk images over the use of partition images.
+
+Why can't I use Secure Erase/Wipe with RAID controllers?
+========================================================
+
+Situations have been reported where an infrastructure operator is expecting
+particular device types to be Secure Erased or Wiped when they are behind a
+RAID controller.
+
+For example, the server may have NVMe devices attached to a RAID controller
+which could be in pass-through or single disk volume mode. The same scenario
+exists basically regardless of the disk/storage medium/type.
+
+The basic reason why is that RAID controllers essentially act as command
+translators with a buffer cache. They tend to offer a simplified protocol
+to the Operating System, and interact with the storage device in whatever
+protocol is native to the device. This is the root of the underlying
+problem.
+
+Protocols such as SCSI are rooted in quite a bit of computing history,
+but never evolved to include primitives like Secure Erase which evolved in
+the `ATA protocol <https://en.wikipedia.org/wiki/Parallel_ATA#HDD_passwords_and_security>`_.
+
+The closest primitives in SCSI to ATA Secure Erase is the ``FORMAT UNIT``
+and ``UNMAP`` commands.
+
+``FORMAT UNIT`` might be a viable solution, and a tool named
+`sg_format <https://linux.die.net/man/8/sg_format>`_ exists,
+but there has not been a sufficient call upstream to implement this and
+test it sufficiently that the Ironic community would be comfortable
+shipping such a capability. The possibility also exists that a RAID
+controller might not translate this command through to an end device,
+just as some RAID controllers know how to handle and pass through
+ATA commands to disk devices which support them. It is entirely dependent
+upon the hardware configuration scenario.
+
+The ``UNMAP`` command is similar to the ATA ``TRIM`` command. Unfortunately
+the SCSI protocol requires this be performed at block level, and similar to
+``FORMAT UNIT``, it may not be supported or just passed through.
+
+If your interested in working on this area, or are willing to help test,
+please feel free to contact the
+:doc:`Ironic development community </contributor/community>`.
+An additional option is the creation of your own
+`custom Hardware Manager <https://opendev.org/openstack/ironic-python-agent/src/branch/master/examples/custom-disk-erase>`_
+which can contain your preferred logic, however this does require some Python
+development experience.
+
+One last item of note, depending on the RAID controller, the BMC, and a number
+of other variables, you may be able to leverage the `RAID <raid>`_
+configuration interface to delete volumes/disks, and recreate them. This may
+have the same effect as a clean disk, however that too is RAID controller
+dependent behavior.
+
+I'm in "clean failed" state, what do I do?
+==========================================
+
+There is only one way to exit the ``clean failed`` state. But before we visit
+the answer as to **how**, we need to stress the importance of attempting to
+understand **why** cleaning failed. On the simple side of things, this may be
+as simple as a DHCP failure, but on a complex side of things, it could be that
+a cleaning action failed against the underlying hardware, possibly due to
+a hardware failure.
+
+As such, we encourage everyone to attempt to understand **why** before exiting
+the ``clean failed`` state, because you could potentially make things worse
+for yourself. For example if firmware updates were being performed, you may
+need to perform a rollback operation against the physical server, depending on
+what, and how the firmware was being updated. Unfortunately this also borders
+the territory of "no simple answer".
+
+This can be counter balanced with sometimes there is a transient networking
+failure and a DHCP address was not obtained. An example of this would be
+suggested by the ``last_error`` field indicating something about "Timeout
+reached while cleaning the node", however we recommend following several
+basic troubleshooting steps:
+
+* Consult the ``last_error`` field on the node, utilizing the
+ ``baremetal node show <uuid>`` command.
+* If the version of ironic supports the feature, consult the node history
+ log, ``baremetal node history list`` and
+ ``baremetal node history get <uuid>``.
+* Consult the acutal console screen of the physical machine. *If* the ramdisk
+ booted, you will generally want to investigate the controller logs and see
+ if an uploaded agent log is being stored on the conductor responsible for
+ the baremetal node. Consult `Retrieving logs from the deploy ramdisk`_.
+ If the node did not boot for some reason, you can typically just retry
+ at this point and move on.
+
+How to get out of the state, once you've understood **why** you reached it
+in the first place, is to utilize the ``baremetal node manage <node_id>``
+command. This returns the node to ``manageable`` state, from where you can
+retry "cleaning" through automated cleaning with the ``provide`` command,
+or manual cleaning with ``clean`` command. or the next appropriate action
+in the workflow process you are attempting to follow, which may be
+ultimately be decommissioning the node because it could have failed and is
+being removed or replaced.