summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorZuul <zuul@review.opendev.org>2021-07-19 15:04:01 +0000
committerGerrit Code Review <review@openstack.org>2021-07-19 15:04:01 +0000
commit41a10cffce8bd85048d939f79fd64371b7382997 (patch)
tree970f46a7b86231a327278c1a76e57f65eb411f53
parent33842cda0d0e330407bf26cd4d3a6192ecacc5a3 (diff)
parente488792ccc3c4c037977c0b1a3d4dcb724d71e6c (diff)
downloadironic-41a10cffce8bd85048d939f79fd64371b7382997.tar.gz
Merge "Burn-in: Add documentation"
-rw-r--r--doc/source/admin/hardware-burn-in.rst166
-rw-r--r--doc/source/admin/index.rst1
2 files changed, 167 insertions, 0 deletions
diff --git a/doc/source/admin/hardware-burn-in.rst b/doc/source/admin/hardware-burn-in.rst
new file mode 100644
index 000000000..0792d58ff
--- /dev/null
+++ b/doc/source/admin/hardware-burn-in.rst
@@ -0,0 +1,166 @@
+.. _hardware-burn-in:
+
+================
+Hardware Burn-in
+================
+
+Overview
+========
+
+Workflows to onboard new hardware often include a stress-testing step to
+provoke early failures and to avoid that these load-triggered issues only
+occur when the nodes have already moved to production. These ``burn-in``
+tests typically include CPU, memory, disk, and network. With the Xena
+release, Ironic supports such tests as part of the cleaning framework.
+
+The burn-in steps rely on standard tools such as
+`stress-ng <https://wiki.ubuntu.com/Kernel/Reference/stress-ng>`_ for CPU
+and memory, or `fio <https://fio.readthedocs.io/en/latest/>`_ for disk and
+network. The burn-in cleaning steps are part of the generic hardware manager
+in the Ironic Python Agent (IPA) and therefore the agent ramdisk does not
+need to be bundled with a specific
+:ironic-python-agent-doc:`IPA hardware manager
+<admin/hardware_managers.html>` to have them available.
+
+Each burn-in step accepts (or in the case of network: needs) some basic
+configuration options, mostly to limit the duration of the test and to
+specify the amount of resources to be used. The options are set on a node's
+``driver-info`` and prefixed with ``agent_burnin_``. The options available
+for the individual tests will be outlined below.
+
+CPU burn-in
+===========
+
+The options, following a `agent_burnin_` + stress-ng stressor (`cpu`) +
+stress-ng option schema, are:
+
+* ``agent_burnin_cpu_timeout`` (default: 24 hours)
+* ``agent_burnin_cpu_cpu`` (default: 0, meaning all CPUs)
+
+to limit the overall runtime and to pick the number of CPUs to stress.
+
+For instance, in order to limit the time of the CPU burn-in to 10 minutes
+do:
+
+.. code-block:: console
+
+ baremetal node set --driver-info agent_burnin_cpu_timeout=600 \
+ $NODE_NAME_OR_UUID
+
+Then launch the test with:
+
+.. code-block:: console
+
+ baremetal node clean --clean-steps '[{"step": "burnin_cpu", \
+ "interface": "deploy"}]' $NODE_NAME_OR_UUID
+
+Memory burn-in
+==============
+
+The options, following a `agent_burnin_` + stress-ng stressor (`vm`) +
+stress-ng option schema, are:
+
+* ``agent_burnin_vm_timeout`` (default: 24 hours)
+* ``agent_burnin_vm_vm-bytes`` (default: 98%)
+
+to limit the overall runtime and to set the fraction of RAM to stress.
+
+For instance, in order to limit the time of the memory burn-in to 1 hour
+and the amount of RAM to be used to 75% run:
+
+.. code-block:: console
+
+ baremetal node set --driver-info agent_burnin_vm_timeout=3600 \
+ $NODE_NAME_OR_UUID
+ baremetal node set --driver-info agent_burnin_vm_vm-bytes=75 \
+ $NODE_NAME_OR_UUID
+
+Then launch the test with:
+
+.. code-block:: console
+
+ baremetal node clean --clean-steps '[{"step": "burnin_vm", \
+ "interface": "deploy"}]' $NODE_NAME_OR_UUID
+
+Disk burn-in
+============
+
+The options, following a `agent_burnin_` + fio stressor (`fio_disk`) +
+fio option schema, are:
+
+* agent_burnin_fio_disk_runtime (default: 0, meaning no time limit)
+* agent_burnin_fio_disk_loops (default: 4)
+
+to set the time limit and the number of iterations when going
+over the disks.
+
+For instance, in order to limit the number of loops to 2 set:
+
+.. code-block:: console
+
+ baremetal node set --driver-info agent_burnin_fio_disk_loops=2 \
+ $NODE_NAME_OR_UUID
+
+Then launch the test with:
+
+.. code-block:: console
+
+ baremetal node clean --clean-steps '[{"step": "burnin_disk", \
+ "interface": "deploy"}]' $NODE_NAME_OR_UUID
+
+
+Network burn-in
+===============
+
+Burning in the network needs a little more config, since we need a pair
+of nodes to perform the test. Therefore, this test needs to set
+``agent_burnin_fio_network_config`` JSON which requires a ``role`` field
+(values: ``reader``, ``writer``) and a ``partner`` field (value is the
+hostname of the other node to test), like:
+
+.. code-block:: console
+
+ baremetal node set --driver-info agent_burnin_fio_network_config= \
+ '{"role": "writer", "partner": "$HOST2"}' $NODE_NAME_OR_UUID1
+ baremetal node set --driver-info agent_burnin_fio_network_config= \
+ '{"role": "reader", "partner": "$HOST1"}' $NODE_NAME_OR_UUID2
+
+In addition and similar to the other tests, there is a runtime option
+to be set (only on the writer):
+
+.. code-block:: console
+
+ baremetal node set --driver-info agent_burnin_fio_network_runtime=600 \
+ $NODE_NAME_OR_UUID
+
+Then launch the test with:
+
+.. code-block:: console
+
+ baremetal node clean --clean-steps '[{"step": "burnin_network",\
+ "interface": "deploy"}]' $NODE_NAME_OR_UUID1
+ baremetal node clean --clean-steps '[{"step": "burnin_network",\
+ "interface": "deploy"}]' $NODE_NAME_OR_UUID2
+
+Both nodes will wait for the other node to show up and block while waiting.
+If the partner does not show up, the cleaning timeout will step in.
+
+Additional Information
+======================
+
+All tests can be aborted at any moment with
+
+.. code-block:: console
+
+ baremetal node abort $NODE_NAME_OR_UUID
+
+One can also launch multiple tests which will be run in sequence, e.g.:
+
+.. code-block:: console
+
+ baremetal node clean --clean-steps '[{"step": "burnin_cpu",\
+ "interface": "deploy"}, {"step": "burnin_memory",\
+ "interface": "deploy"}]' $NODE_NAME_OR_UUID
+
+If desired, configuring ``fast-track`` may be helpful here as it allows
+to keep the node up between consecutive calls of ``baremetal node clean``.
diff --git a/doc/source/admin/index.rst b/doc/source/admin/index.rst
index 86b1e55ff..28f55d021 100644
--- a/doc/source/admin/index.rst
+++ b/doc/source/admin/index.rst
@@ -31,6 +31,7 @@ the services.
Fast-Track Deployment <fast-track>
Booting a Ramdisk or an ISO <ramdisk-boot>
Deploying with anaconda deploy interface <anaconda-deploy-interface>
+ Hardware Burn-in <hardware-burn-in>
Drivers, Hardware Types and Hardware Interfaces
-----------------------------------------------