shared: use move_pivot_root() for services

Currently, services use mount_move_root() in order to setup the root directory of services using a mount namespace. This relies on MS_MOVE and chroot(). However, this has serious drawbacks even for relatively simple mount propagation scenarios. What systemd currently does is roughly equivalent to the following shell code: unshare --mount --propagation=shared cd / mount --make-rslave / mkdir /new-root mount --rbind / /new-root cd /new-root mount --move /new-root / chroot . This looks simple enough but has the consequence that two separate mount trees exist for the lifetime of the service. The first one was created when the mount namespace was created, and the second one when a new mount for the rootfs was created. The first mount tree sticks around as a shadow mount tree. Both mount trees are dependent mounts with the host rootfs as their dominating mount. Now, when mount propagation is triggered by the host by e.g., mount --bind /opt /mnt it means that two propagation events are generated. I'm skipping over the exact kernel details as they aren't that important. The gist is that for every propagation event that is generated a second one is generated for the shadow mount tree. In other words, the kernel creates two copies for each mount that is propagated instead of one. This isn't necessary. We can simply change the sequence above to: unshare --mount --propagation=shared cd / mount --make-rslave / mkdir /new-root # stash fd to old rootfs # stash fd to new rootfs mount --rbind / /new-root mkdir /new-root cd /new-root pivot_root . . # new root is tucked under old root # chdir into old rootfs via stashed fd umount -l /old-root The pivot_root allows us to get rid of the old mount tree that was created when the mount namespace was created. So after this sequence only one mount tree is alive. Plus, it's safer and nicer. Moving mounts isn't pleasnt. This patch doesn't convert nspawn yet as the requirements are more tricky given that it wants to preserve the rootfs as a shared mount which goes against pivot_root() requirements. Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
author: Christian Brauner <brauner@kernel.org> 2022-11-23 16:15:20 +0100
committer: Luca Boccassi <luca.boccassi@gmail.com> 2022-11-24 10:58:26 +0100
commit: 2e776ed6c8649d5991de5d2a7c0334a77485456c (patch)
tree: 97576a887ba4c1c23cbaa67fee69ceb10af74bc8 /src/core/namespace.c
parent: 00a60eaf5fcb3a0e415349aa649f2699550d26b0 (diff)
download: systemd-2e776ed6c8649d5991de5d2a7c0334a77485456c.tar.gz
1 files changed, 2 insertions, 2 deletions
diff --git a/src/core/namespace.c b/src/core/namespace.c
index 7752e48fb0..c0d0cc9715 100644
--- a/src/core/namespace.c
+++ b/src/core/namespace.c
@@ -2486,7 +2486,7 @@ int setup_namespace(
                 goto finish;
 
         /* MS_MOVE does not work on MS_SHARED so the remount MS_SHARED will be done later */
-        r = mount_move_root(root);
+        r = mount_pivot_root(root);
         if (r == -EINVAL && root_directory) {
                 /* If we are using root_directory and we don't have privileges (ie: user manager in a user
                  * namespace) and the root_directory is already a mount point in the parent namespace,
@@ -2496,7 +2496,7 @@ int setup_namespace(
                 r = mount_nofollow_verbose(LOG_DEBUG, root, root, NULL, MS_BIND|MS_REC, NULL);
                 if (r < 0)
                         goto finish;
-                r = mount_move_root(root);
+                r = mount_pivot_root(root);
         }
         if (r < 0) {
                 log_debug_errno(r, "Failed to mount root with MS_MOVE: %m");
author	Christian Brauner <brauner@kernel.org>	2022-11-23 16:15:20 +0100
committer	Luca Boccassi <luca.boccassi@gmail.com>	2022-11-24 10:58:26 +0100
commit	2e776ed6c8649d5991de5d2a7c0334a77485456c (patch)
tree	97576a887ba4c1c23cbaa67fee69ceb10af74bc8 /src/core/namespace.c
parent	00a60eaf5fcb3a0e415349aa649f2699550d26b0 (diff)
download	systemd-2e776ed6c8649d5991de5d2a7c0334a77485456c.tar.gz