TODO: add "Improve Shutdown of NetworkManager" item

author: Thomas Haller <thaller@redhat.com> 2022-02-18 12:38:23 +0100
committer: Thomas Haller <thaller@redhat.com> 2022-02-18 13:00:35 +0100
commit: 037dfe5ac14891d260798726965281c93256c25d (patch)
tree: fcf8e012170cd888529073181403006698c7839c
parent: 14a599539530bd553de56fbb0514368b6ed1c43c (diff)
download: NetworkManager-037dfe5ac14891d260798726965281c93256c25d.tar.gz
1 files changed, 115 insertions, 8 deletions
diff --git a/TODO b/TODO
index efb1e909e1..d2c6d10b60 100644
--- a/TODO
+++ b/TODO
@@ -2,7 +2,107 @@ So you're interested in hacking on NetworkManager?  Here's some cool
 stuff you could do...
 
 
-* Use netlink API instead of ioctl based ethtool.
+Improve Shutdown of NetworkManager
+==================================
+
+NetworkManager quits when receiving SIGTERM.
+
+Currently, it stops iterating the GMainContext (g_main_loop_quit()) and performs
+some synchronous cleanup actions.
+
+That is problematic for the following reasons.
+
+- We generally avoid blocking operations in NetworkManager (except currently during shutdown).
+  Hence it's normal at any time to have async operations pending. Async operations
+  with glib basically mean that we will receive a callback from the mainloop. For that
+  to work, we need to keep iterating the GMainContext. If we stop iterating,
+  we cannot cleanup the pending operations and leak resources. It's not possible
+  to free all resources, unless we iterate as long as we have pending operations.
+
+  That is because even if you g_cancellable_cancel() an sync operation, you still
+  get a callback. The fact that an async operation will always get (one) callback
+  invocation is an important guarantee in glib. If we no longer have that guarantee,
+  it would be effectively impossible to implement cancellation and proper cleanup
+  and it would require to do that for all async operations (changing the guaranteed
+  semantics of all async operations).
+
+  Often it wouldn't matter whether we free all resources during shutdown. However,
+  unless we have a strict policy and method for freeing all, we will inevitably
+  leak resources where it does matter.
+
+  It's anyway hard to move from a "running" state to a "shutdown" state. It's
+  impossible to get right, if we have pending async operations that no longer can
+  complete.
+
+- Once we stop iterating the mainloop, we also cannot make async operations anymore.
+  This reduces our shutdown to blocking operations (or a string of async operations that
+  get chained together to one blocking operation, e.g. by using a separate GMainContext).
+  This is very limiting, also because it's getting really hard to do things in
+  parallel (unless you strongly intertwine them or essentially re-implement a
+  main loop). Doing things in parallel will be necessary, for example if deactivate
+  two devices, then both should shutdown in parallel.
+
+The real problem is that our shutdown is really messy due to this. And this is a
+fundamental limitation of the current implementation.
+
+The solution will be the following.
+
+When we receive SIGTERM we go into shutdown mode. This may mean to reject new D-Bus
+requests and in general to move into a shutdown state. All the while we keep iterating
+the GMainContext, but we also start to tear down and cancel/complete pending operations.
+While we do that, we may need to start new async operations. For example, during
+shutdown we may want to kill dnsmasq, which itself is a new asynchronous operation.
+
+The API nm_shutdown_wait_obj_register_object() and family allow for things to register
+themselves to block shutdown. This works using weak pointers. Basically, NetworkManager will
+keep iterating the GMainContext as long as we have objects registered there. While shutting
+down, we expect those objects to complete and unregister themselves.
+
+Currently, our singleton objects (NM_DEFINE_SINGLETON_REGISTER) get unrefed after
+the `main()` functions. For some/all of those singletons, during SIGTERM we may
+want to register them as nm_shutdown_wait_obj_register_object() and unref them when
+we initiate the shutdown.
+Singletons also use weak pointers and can work together with nm_shutdown_wait_obj_register_object().
+For that to work, we need that nobody is calling the singleton getter *after* shutdown
+starts. That means, instead of using the singleton getter, you need to get the reference
+from somebody. For example, NMDevice has a reference to a NMNetns and NMPlatform
+and should use those instead of NM_PLATFORM_GET(). For those singeltons that works
+this way (maybe all of them), the singleton getters only works reliably before
+shutdown starts. And no singleton getters work reliably after the main() function
+because singletons unref themselves. In general, avoid singleton getters and see
+that somebody hands you a reference.
+
+After NM_SHUTDOWN_TIMEOUT_MS we loose patience that it's taking too long.
+We now log a debug message about who is still blocking shutdown.
+We also cancel the cancellables from nm_shutdown_wait_obj_register_cancellable()
+and give NM_SHUTDOWN_TIMEOUT_MS_WATCHDOG more time. If we then are still
+not complete, we log an error message about who is still blocking shutdown,
+and just exit with an assertion failure. We encountered a bug.
+
+This means, *all* async operations in NetworkManager must either be cancellable (and
+afterwards complete fast) or they must not take long to begin with. In particular,
+every individual async operation must be completed in at most NM_SHUTDOWN_TIMEOUT_MS,
+and all async cleanup operations must complete in NM_SHUTDOWN_TIMEOUT_MS too.
+
+So if you make an async operation not cancellable, but guarantee that you don't take
+longer than NM_SHUTDOWN_TIMEOUT_MS you are mostly fine (better would be to actually
+complete fast, if you can). That's why reaching NM_SHUTDOWN_TIMEOUT_MS timeout is
+still not a bug scenario. But reaching NM_SHUTDOWN_TIMEOUT_MS+NM_SHUTDOWN_TIMEOUT_MS_WATCHDOG
+is a bug.
+
+As NM_SHUTDOWN_TIMEOUT_MS and nm_shutdown_wait_obj_register_object() API already exists,
+the first step is to ensure that all parts of NetworkManager can be shutdown and be terminated
+in a timely manner.
+
+The second step is to replace the current sync cleanup operations with iterating the
+GMainContext. This is gonna be difficult.
+
+Search for `FIXME(shutdown` for places that are related to this effort and that need
+consideration.
+
+
+Use netlink API instead of ioctl based ethtool
+==============================================
 
 NetworkManager uses ethtool API to set/obtain certain settings of network
 devices. This is an ioctl based API and implmented in "src/platform/nm-platform-utils.c".
@@ -15,14 +115,16 @@ also implements this API, however it is under an incompatible license,
 so better don't look and make sure not to use the code.
 
 
-* Add 802-1x capability to nmtui.
+Add 802-1x capability to nmtui
+==============================
 
 Add dialogs to nmtui for 802-1x. This will be useful for ethernet (with 802-1x
 port authentication), enterprise Wi-Fi and MACSec. From the GUI and dialog design,
 possibly get inspired by nm-connection-editor.
 
 
-* Ethernet Network Auto-detection
+Ethernet Network Auto-detection
+===============================
 
 There are various methods we can use to autodetect which wired network connection
 to use if the user connects to more than one wired network on a frequent basis.
@@ -80,7 +182,8 @@ un-authenticated connections and that additional credentials are required to
 successfully connect to this network.
 
 
-* VPN re-connect (bgo #349151)
+VPN re-connect (bgo #349151)
+============================
 
 NM should remember whether a VPN was connected if a connection disconnects
 (like Wi-Fi drops out or short carrier drop) or if the laptop goes to sleep.
@@ -90,7 +193,8 @@ the VPN because Wi-Fi choked for 10 seconds, but reconnect the VPN if it was
 connected before the drop.
 
 
-* VPN IP Methods
+VPN IP Methods
+==============
 
 Some VPNs (openvpn with TAP for example) require that DHCP is run on a
 pseudo-ethernet device to obtain addressing information.  Currenty, this is not
@@ -133,7 +237,8 @@ failure of the VPN connection, just like DHCP timeouts and lease-renewal
 failures do for other devices (see dhcp_state_changed() in nm-device.c).
 
 
-* VPN Service Daemon Secret Requests
+VPN Service Daemon Secret Requests
+==================================
 
 In addition to NM asking the service daemons whether more secrets are required,
 VPN service daemons (like nm-vpnc-service, nm-openvpn-service, etc) should be
@@ -171,7 +276,8 @@ challenge-response and does not use the "--non-inter" flag which suppresses that
 behavior.
 
 
-* WPS
+WPS
+===
 
 wpa_supplicant has support for WPS (Wifi Protected Setup, basically Bluetooth-
 like PIN codes for setting up a wifi connection) and we should add support for
@@ -215,7 +321,8 @@ because the user has no physical access to the router itself, but has been given
 as passphrase/PSK instead.
 
 
-* Better Tablet/Mobile Behavior
+Better Tablet/Mobile Behavior
+=============================
 
 There are a few components to this:
author	Thomas Haller <thaller@redhat.com>	2022-02-18 12:38:23 +0100
committer	Thomas Haller <thaller@redhat.com>	2022-02-18 13:00:35 +0100
commit	037dfe5ac14891d260798726965281c93256c25d (patch)
tree	fcf8e012170cd888529073181403006698c7839c
parent	14a599539530bd553de56fbb0514368b6ed1c43c (diff)
download	NetworkManager-037dfe5ac14891d260798726965281c93256c25d.tar.gz