summaryrefslogtreecommitdiff
path: root/doc
diff options
context:
space:
mode:
authorJulia Kreger <juliaashleykreger@gmail.com>2020-07-29 08:38:22 -0700
committerJulia Kreger <juliaashleykreger@gmail.com>2020-09-04 15:01:19 +0000
commitfa383c916aa15bad10825bc457b1e0ec3cee1dea (patch)
tree26859aabdad0bcd4f1192965052e35217647a17a /doc
parentebae6a40f1131f2b4602e7deddd2a8bd6382ae22 (diff)
downloadironic-fa383c916aa15bad10825bc457b1e0ec3cee1dea.tar.gz
Detail iPXE + LACP troubleshooting information
Please consider providing lolcat gifs to the fund to help ironic developers recover from the headaches of iPXE + LACP. In all seriousness, we needed to document this headache and it does so at a fairly high level so we are not shaming anything specifically. Change-Id: Ic792697a0574e45723c8076002aa802ad22b3d54
Diffstat (limited to 'doc')
-rw-r--r--doc/source/admin/troubleshooting.rst46
1 files changed, 46 insertions, 0 deletions
diff --git a/doc/source/admin/troubleshooting.rst b/doc/source/admin/troubleshooting.rst
index 0c29343c8..db81d81f0 100644
--- a/doc/source/admin/troubleshooting.rst
+++ b/doc/source/admin/troubleshooting.rst
@@ -388,6 +388,52 @@ do that for a Cisco Nexus switch is:
$ (config) interface eth1/11
$ (config-if) spanning-tree port type edge
+Why does X issue occur when I am using LACP bonding with iPXE?
+==============================================================
+
+If you are using iPXE, an unfortunate aspect of its design and interaction
+with networking is an automatic response as a Link Aggregation Control
+Protocol (or LACP) peer to remote switches. iPXE does this for only the
+single port which is used for network booting.
+
+In theory, this may help establish the port link-state faster with some
+switch vendors, but the official reasoning as far as the Ironic Developers
+are aware is not documented for iPXE. The end result of this is that once
+iPXE has stopped responding to LACP messages from the peer port, which
+occurs as part of the process of booting a ramdisk and iPXE handing
+over control to a full operating-system, switches typically begin a
+timer to determine how to handle the failure. This is because,
+depending on the mode of LACP, this can be interpreted as a switch or
+network fabric failure.
+
+This may demonstrate as any number of behaviors or issues from ramdisks
+finding they are unable to acquire DHCP addresses over the network interface
+to downloads abruptly stalling, to even minor issues such as LLDP port data
+being unavailable in introspection.
+
+Overall:
+
+* Ironic's agent doesn't officially support LACP and the Ironic community
+ generally believes this may cause more problems than it would solve.
+ During the Victoria development cycle, we added retry logic for most
+ actions in an attempt to navigate the worst-known default hold-down
+ timers to help ensure a deployment does not fail due to a short-lived
+ transitory network connectivity failure in the form of a switch port having
+ moved to a temporary blocking state. Where applicable and possible,
+ many of these patches have been backported to supported releases,
+ however users of the iSCSI deployment interface will see the least
+ capability for these sorts of situations to be handled
+ automatically. These patches also require that the switchport has an
+ eventual fallback to a non-bonded mode. If the port remains in a blocking
+ state, then traffic will be unable to flow and the deloyment is likely to
+ time out.
+* If you must use LACP, consider ``passive`` LACP negotiation settings
+ in the network switch as opposed to ``active``. The difference being with
+ passive the connected workload is likely a server where it should likely
+ request the switch to establish the Link Aggregate. This is instead of
+ being treated as if it's possibly another switch.
+* Consult your switch vendor's support forums. Some vendors have recommended
+ port settings for booting machines using iPXE with their switches.
IPMI errors
===========