nspawn: make sure host root can write to the uidmapped mounts we prepare for the container payload

When using user namespaces in conjunction with uidmapped mounts, nspawn so far set up two uidmappings: 1. One that is used for the uidmapped mount and that maps the UID range 0…65535 on the backing fs to some high UID range X…X+65535 on the uidmapped fs. (Let's call this mapping the "mount mapping") 2. One that is used for the userns namespace the container payload processes run in, that maps X…X+65535 back to 0…65535. (Let's call this one the "process mapping"). These mappings hence are pretty much identical, one just moves things up and one back down. (Reminder: we do all this so that the processes can run under high UIDs while running off file systems that require no recursive chown()ing, i.e. we want processes with high UID range but files with low UID range.) This creates one problem, i.e. issue #20989: if nspawn (which runs as host root, i.e. host UID 0) wants to add inodes to the uidmapped mount it can't do that, since host UID 0 is not defined in the mount mapping (only the X…X+65536 range is, after all, and X > 0), and processes whose UID is not mapped in a uidmapped fs cannot create inodes in it since those would be owned by an unmapped UID, which then triggers the famous EOVERFLOW error. Let's fix this, by explicitly including an entry for the host UID 0 in the mount mapping. Specifically, we'll extend the mount mapping to map UID 2147483646 (which is INT32_MAX-1, see code for an explanation why I picked this one) of the backing fs to UID 0 on the uidmapped fs. This way nspawn can creates inode on the uidmapped as it likes (which will then actually be owned by UID 2147483646 on the backing fs), and as it always did. Note that we do *not* create a similar entry in the process mapping. Thus any files created by nspawn that way (and not chown()ed to something better) will appear as unmapped (i.e. as overflowuid/"nobody") in the container payload. And that's good. Of course, the latter is mostly theoretic, as nspawn should generally chown() the inodes it creates to UID ranges that actually make sense for the container (and we generally already do this correctly), but it#s good to know that we are safe here, given we might accidentally forget to chown() some inodes we create. Net effect: the two mappings will not be identical anymore. The mount mapping has one entry more, and the only reason it exists is so that nspawn can access the uidmapped fs reasonably independently from any process mapping. Fixes: #20989
author: Lennart Poettering <lennart@poettering.net> 2022-03-17 13:46:12 +0100
committer: Lennart Poettering <lennart@poettering.net> 2022-03-17 19:08:12 +0100
commit: 50ae2966d20b0b4a19def060de3b966b7a70b54a (patch)
tree: d0c072dfc682f5d2e39439d8b664c76a359eba37 /src/basic/user-util.h
parent: 264caae299aa8f42f20460ad3280add657a3747f (diff)
download: systemd-50ae2966d20b0b4a19def060de3b966b7a70b54a.tar.gz
1 files changed, 13 insertions, 0 deletions
diff --git a/src/basic/user-util.h b/src/basic/user-util.h
index 40979d1080..e1692c4f66 100644
--- a/src/basic/user-util.h
+++ b/src/basic/user-util.h
@@ -67,6 +67,19 @@ int take_etc_passwd_lock(const char *root);
 #define UID_NOBODY ((uid_t) 65534U)
 #define GID_NOBODY ((gid_t) 65534U)
 
+/* If REMOUNT_IDMAP_HOST_ROOT is set for remount_idmap() we'll include a mapping here that maps the host root
+ * user accessing the idmapped mount to the this user ID on the backing fs. This is the last valid UID in the
+ * *signed* 32bit range. You might wonder why precisely use this specific UID for this purpose? Well, we
+ * definitely cannot use the first 0…65536 UIDs for that, since in most cases that's precisely the file range
+ * we intend to map to some high UID range, and since UID mappings have to be bijective we thus cannot use
+ * them at all. Furthermore the UID range beyond INT32_MAX (i.e. the range above the signed 32bit range) is
+ * icky, since many APIs cannot use it (example: setfsuid() returns the old UID as signed integer). Following
+ * our usual logic of assigning a 16bit UID range to each container, so that the upper 16bit of a 32bit UID
+ * value indicate kind of a "container ID" and the lower 16bit map directly to the intended user you can read
+ * this specific UID as the "nobody" user of the container with ID 0x7FFF, which is kinda nice. */
+#define UID_MAPPED_ROOT ((uid_t) (INT32_MAX-1))
+#define GID_MAPPED_ROOT ((gid_t) (INT32_MAX-1))
+
 #define ETC_PASSWD_LOCK_PATH "/etc/.pwd.lock"
 
 /* The following macros add 1 when converting things, since UID 0 is a valid UID, while the pointer
author	Lennart Poettering <lennart@poettering.net>	2022-03-17 13:46:12 +0100
committer	Lennart Poettering <lennart@poettering.net>	2022-03-17 19:08:12 +0100
commit	50ae2966d20b0b4a19def060de3b966b7a70b54a (patch)
tree	d0c072dfc682f5d2e39439d8b664c76a359eba37 /src/basic/user-util.h
parent	264caae299aa8f42f20460ad3280add657a3747f (diff)
download	systemd-50ae2966d20b0b4a19def060de3b966b7a70b54a.tar.gz