summaryrefslogtreecommitdiff
path: root/src/shared/mount-util.h
Commit message (Collapse)AuthorAgeFilesLines
* nspawn: make sure host root can write to the uidmapped mounts we prepare for ↵Lennart Poettering2022-03-171-1/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | the container payload When using user namespaces in conjunction with uidmapped mounts, nspawn so far set up two uidmappings: 1. One that is used for the uidmapped mount and that maps the UID range 0…65535 on the backing fs to some high UID range X…X+65535 on the uidmapped fs. (Let's call this mapping the "mount mapping") 2. One that is used for the userns namespace the container payload processes run in, that maps X…X+65535 back to 0…65535. (Let's call this one the "process mapping"). These mappings hence are pretty much identical, one just moves things up and one back down. (Reminder: we do all this so that the processes can run under high UIDs while running off file systems that require no recursive chown()ing, i.e. we want processes with high UID range but files with low UID range.) This creates one problem, i.e. issue #20989: if nspawn (which runs as host root, i.e. host UID 0) wants to add inodes to the uidmapped mount it can't do that, since host UID 0 is not defined in the mount mapping (only the X…X+65536 range is, after all, and X > 0), and processes whose UID is not mapped in a uidmapped fs cannot create inodes in it since those would be owned by an unmapped UID, which then triggers the famous EOVERFLOW error. Let's fix this, by explicitly including an entry for the host UID 0 in the mount mapping. Specifically, we'll extend the mount mapping to map UID 2147483646 (which is INT32_MAX-1, see code for an explanation why I picked this one) of the backing fs to UID 0 on the uidmapped fs. This way nspawn can creates inode on the uidmapped as it likes (which will then actually be owned by UID 2147483646 on the backing fs), and as it always did. Note that we do *not* create a similar entry in the process mapping. Thus any files created by nspawn that way (and not chown()ed to something better) will appear as unmapped (i.e. as overflowuid/"nobody") in the container payload. And that's good. Of course, the latter is mostly theoretic, as nspawn should generally chown() the inodes it creates to UID ranges that actually make sense for the container (and we generally already do this correctly), but it#s good to know that we are safe here, given we might accidentally forget to chown() some inodes we create. Net effect: the two mappings will not be identical anymore. The mount mapping has one entry more, and the only reason it exists is so that nspawn can access the uidmapped fs reasonably independently from any process mapping. Fixes: #20989
* tree-wide: move `unsigned` to the start of type declarationFrantisek Sumsal2022-02-101-1/+1
| | | | | | | | | | | | | | | | | Even though ISO C11 doesn't mandate in which order the type specifiers should appear, having `unsigned` at the beginning of each type declaration feels more natural and, more importantly, it unbreaks Coccinelle, which has a hard time parsing `long unsigned` and others: ``` init_defs_builtins: /usr/lib64/coccinelle/standard.h init_defs: /home/mrc0mmand/repos/systemd/coccinelle/macros.h HANDLING: src/shared/mount-util.c : 1: strange type1, maybe because of weird order: long unsigned ``` Most of the codebase already "complies", so let's fix the remaining "offenders".
* Bump the max number of inodes for /dev to a millionZbigniew Jędrzejewski-Szmek2021-12-091-2/+2
| | | | | 4c733d3046942984c5f73b40c3af39cc218c103f shows that 95k can be used easily on a large system. Let's bump it up even more so that we have some "breathing room".
* Bump the max number of inodes for /dev to 128kFranck Bui2021-12-031-2/+2
| | | | | | | | | | | | | | | | Follow-up for 7d85383edbab73274dc81cc888d884bb01070bc2. Apparently the previous limit set on the max number of inodes for /dev was too small as a system with 4096 LUNs attached can consume up to 95k inodes for symlinks: # /bin/df -i Filesystem Inodes IUsed IFree IUse% Mounted on devtmpfs 49274377 95075 49179302 1% /dev Hence this patch bumps the limit from 64k to 128k although the new limit is still pretty arbitrary (that said, not sure if it really makes sense to put such absolute limit number).
* mount-util: move opening of /proc/self/mountinfo into ↵Lennart Poettering2021-10-251-1/+5
| | | | | | | | | | bind_remount_one_with_mountinfo() Let's move things around a bit, and open /proc/self/mountinfo if needed inside of bind_remount_one_with_mountinfo(). That way bind_remount_one() can become a superthin inline wrapper around bind_remount_one_with_mountinfo(). Main benefit is that we don't even have to open /p/s/mi in case mount_setattr() actually worked for us.
* basic,shared: move make_mount_point_inode_*() to shared/Zbigniew Jędrzejewski-Szmek2021-06-231-0/+5
| | | | Those pull in selinux for labelling, and we should avoid selinux in basic/.
* test-mount-util: add output test for mount_flags_to_string()Zbigniew Jędrzejewski-Szmek2021-06-221-0/+1
|
* mount-util: add a helper that can add an idmap to an existing mountLennart Poettering2021-05-071-0/+2
| | | | | | This makes use of the new kernel 5.12 APIs to add an idmap to a mount point. It does so by cloning the mountpoint, changing it, and then unmounting the old mountpoint, replacing it later with the new one.
* mount-util: add helper that ensures something is a mount pointLennart Poettering2021-05-071-0/+2
|
* mount-util: make umount_and_rmdir_and_freep() cleanup handler deal with NULLLennart Poettering2021-04-201-2/+4
|
* tree-wide: reset the cleaned-up variable in cleanup functionsZbigniew Jędrzejewski-Szmek2021-02-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the cleanup function returns the appropriate type, use that to reset the variable. For other functions (usually the foreign ones which return void), add an explicit value to reset to. This causes a bit of code churn, but I think it might be worth it. In a following patch static destructors will be called from a fuzzer, and this change allows them to be called multiple times. But I think such a change might help with detecting unitialized code reuse too. We hit various bugs like this, and things are more obvious when a pointer has been set to NULL. I was worried whether this change increases text size, but it doesn't seem to: -Dbuildtype=debug: before "tree-wide: return NULL from freeing functions": -rwxrwxr-x 1 zbyszek zbyszek 4117672 Feb 16 14:36 build/libsystemd.so.0.30.0* -rwxrwxr-x 1 zbyszek zbyszek 4494520 Feb 16 15:06 build/systemd* after "tree-wide: return NULL from freeing functions": -rwxrwxr-x 1 zbyszek zbyszek 4117672 Feb 16 14:36 build/libsystemd.so.0.30.0* -rwxrwxr-x 1 zbyszek zbyszek 4494576 Feb 16 15:10 build/systemd* now: -rwxrwxr-x 1 zbyszek zbyszek 4117672 Feb 16 14:36 build/libsystemd.so.0.30.0* -rwxrwxr-x 1 zbyszek zbyszek 4494640 Feb 16 15:15 build/systemd* -Dbuildtype=release: before "tree-wide: return NULL from freeing functions": -rwxrwxr-x 1 zbyszek zbyszek 5252256 Feb 14 14:47 build-rawhide/libsystemd.so.0.30.0* -rwxrwxr-x 1 zbyszek zbyszek 1834184 Feb 16 15:09 build-rawhide/systemd* after "tree-wide: return NULL from freeing functions": -rwxrwxr-x 1 zbyszek zbyszek 5252256 Feb 14 14:47 build-rawhide/libsystemd.so.0.30.0* -rwxrwxr-x 1 zbyszek zbyszek 1834184 Feb 16 15:10 build-rawhide/systemd* now: -rwxrwxr-x 1 zbyszek zbyszek 5252256 Feb 14 14:47 build-rawhide/libsystemd.so.0.30.0* -rwxrwxr-x 1 zbyszek zbyszek 1834184 Feb 16 15:16 build-rawhide/systemd* I would expect that the compiler would be able to elide the setting of a variable if the variable is never used again. And this seems to be the case: in optimized builds there is no change in size whatsoever. And the change in size in unoptimized build is negligible. Something strange is happening with size of libsystemd: it's bigger in optimized builds. Something to figure out, but unrelated to this patch.
* mount-util: add helper to mount image inside live namespaceLuca Boccassi2021-01-211-0/+2
|
* machine/basic: factor out helper function to add airlocked mount to namespaceLuca Boccassi2021-01-181-0/+2
|
* mount-util: fix typoYu Watanabe2020-12-091-1/+1
|
* mount-util: use mfree()Yu Watanabe2020-11-271-2/+2
|
* license: LGPL-2.1+ -> LGPL-2.1-or-laterYu Watanabe2020-11-091-1/+1
|
* test: add heavy load loopback block device testLennart Poettering2020-10-221-1/+2
|
* mount-util: rework umount_verbose() to take log level and flags argLennart Poettering2020-09-231-1/+4
| | | | | | Let's make umount_verbose() more like mount_verbose_xyz(), i.e. take log level and flags param. In particular the latter matters, since we typically don't actually want to follow symlinks when unmounting.
* mount-util: switch most mount_verbose() code over to not follow symlinksLennart Poettering2020-09-231-2/+24
|
* mount-util: add helpers for mount() without following symlinksLennart Poettering2020-09-231-0/+3
|
* pid1: stop limiting size of /dev/shmZbigniew Jędrzejewski-Szmek2020-07-301-5/+3
| | | | | | | | | | | | | The explicit limit is dropped, which means that we return to the kernel default of 50% of RAM. See 362a55fc14 for a discussion why that is not as much as it seems. It turns out various applications need more space in /dev/shm and we would break them by imposing a low limit. While at it, rename the define and use a single macro for various tmpfs mounts. We don't really care what the purpose of the given tmpfs is, so it seems reasonable to use a single macro. This effectively reverts part of 7d85383edbab7. Fixes #16617.
* Bump /tmp size back to 50% of RAMZbigniew Jędrzejewski-Szmek2020-07-291-6/+12
| | | | | | | | | | | | | | | | | | | | | | | | This should be enough to fix https://bugzilla.redhat.com/show_bug.cgi?id=1856514. But the limit should be significantly higher than 10% anyway. By setting a limit on /tmp at 10% we'll break many reasonable use cases, even though the machine would deal fine with a much larger fraction devoted to /tmp. (In the first version of this patch I made it 25% with the comment that "Even 25% might be too low.". The kernel default is 50%, and we have been using that seemingly without trouble since https://fedoraproject.org/wiki/Features/tmp-on-tmpfs. So let's just make it 50% again.) See 7d85383edbab73274dc81cc888d884bb01070bc2. (Another consideration is that we learned from from the whole initiative with zram in Fedora that a reasonable size for zram is 0.5-1.5 of RAM, and that pretty much all systems benefit from having zram or zswap enabled. Thus it is reasonable to assume that it'll become widely used. Taking the usual compression effectiveness of 0.2 into account, machines have effective memory available of between 1.0 - 0.2*0.5 + 0.5 = 1.4 (for zram sized to 0.5 of RAM) and 1.0 - 0.2*1.5 + 1.5 = 2.2 (for zram 1.5 sized to 1.5 of RAM) times RAM size. This means that the 10% was really like 7-4% of effective memory.) A comment is added to mount-util.h to clarify that tmp.mount is separate.
* mount-util: add destructor helper that umounts + rmdirs a pathLennart Poettering2020-07-071-0/+11
|
* tree-wide: avoid some loaded termsLennart Poettering2020-06-251-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | https://tools.ietf.org/html/draft-knodel-terminology-02 https://lwn.net/Articles/823224/ This gets rid of most but not occasions of these loaded terms: 1. scsi_id and friends are something that is supposed to be removed from our tree (see #7594) 2. The test suite defines an API used by the ubuntu CI. We can remove this too later, but this needs to be done in sync with the ubuntu CI. 3. In some cases the terms are part of APIs we call or where we expose concepts the kernel names the way it names them. (In particular all remaining uses of the word "slave" in our codebase are like this, it's used by the POSIX PTY layer, by the network subsystem, the mount API and the block device subsystem). Getting rid of the term in these contexts would mean doing some major fixes of the kernel ABI first. Regarding the replacements: when whitelist/blacklist is used as noun we replace with with allow list/deny list, and when used as verb with allow-list/deny-list.
* Increase size of /run to 20%Topi Miettinen2020-05-151-1/+4
| | | | | | For low memory machines (256MB), 10% of RAM for /run may not be enough for re-exec of PID1 because 16MB of free space is required and /run may already contain something.
* tree-wide: add size limits for tmpfs mountsTopi Miettinen2020-05-131-0/+17
| | | | | | | | | | | | | | | | | Limit size of various tmpfs mounts to 10% of RAM, except volatile root and /var to 25%. Another exception is made for /dev (also /devs for PrivateDevices) and /sys/fs/cgroup since no (or very few) regular files are expected to be used. In addition, since directories, symbolic links, device specials and xattrs are not counted towards the size= limit, number of inodes is also limited correspondingly: 4MB size translates to 1k of inodes (assuming 4k each), 10% of RAM (using 16GB of RAM as baseline) translates to 400k and 25% to 1M inodes. Because nr_inodes option can't use ratios like size option, there's an unfortunate side effect that with small memory systems the limit may be on the too large side. Also, on an extremely small device with only 256MB of RAM, 10% of RAM for /run may not be enough for re-exec of PID1 because 16MB of free space is required.
* core: make sure we use the correct mount flag when re-mounting bind mountsLennart Poettering2020-01-091-0/+1
| | | | | | | | | | | | | | | | | | | | When in a userns environment we cannot take away per-mount point flags set on a mount point that was passed to us. Hence we need to be careful to always check the actual mount flags in place and manipulate only those flags of them that we actually want to change and not reset more as side-effect. We mostly got this right already in bind_remount_recursive_with_mountinfo(), but didn't in the simpler bind_remount_one_with_mountinfo(). Catch up. (The old code assumed that the MountEntry.flags field contained the right flag settings, but it actually doesn't for new mounts we just established as for those mount() establishes the initial flags for us, and we have to read them back to figure out which ones the kernel picked.) Fixes: #13622
* core: create inaccessible nodes for users when making runtime dirsAnita Zhang2019-12-181-1/+1
| | | | | | To support ProtectHome=y in a user namespace (which mounts the inaccessible nodes), the nodes need to be accessible by the user. Create these paths and devices in the user runtime directory so they can be used later if needed.
* mount-util: beef up bind_remount_recursive() to be able to toggle more than ↵Lennart Poettering2019-03-251-2/+2
| | | | | | | | MS_RDONLY The function is otherwise generic enough to toggle other bind mount flags beyond MS_RDONLY (for example: MS_NOSUID or MS_NODEV), hence let's beef it up slightly to support that too.
* Move mount-util.c to shared/Zbigniew Jędrzejewski-Szmek2018-11-291-0/+34
libmount dep is moved from libbasic to libshared, potentially removing libmount from some build products.