| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
the container payload
When using user namespaces in conjunction with uidmapped mounts, nspawn
so far set up two uidmappings:
1. One that is used for the uidmapped mount and that maps the UID range
0…65535 on the backing fs to some high UID range X…X+65535 on the
uidmapped fs. (Let's call this mapping the "mount mapping")
2. One that is used for the userns namespace the container payload
processes run in, that maps X…X+65535 back to 0…65535. (Let's call
this one the "process mapping").
These mappings hence are pretty much identical, one just moves things up
and one back down. (Reminder: we do all this so that the processes can
run under high UIDs while running off file systems that require no
recursive chown()ing, i.e. we want processes with high UID range but
files with low UID range.)
This creates one problem, i.e. issue #20989: if nspawn (which runs as
host root, i.e. host UID 0) wants to add inodes to the uidmapped mount
it can't do that, since host UID 0 is not defined in the mount mapping
(only the X…X+65536 range is, after all, and X > 0), and processes whose
UID is not mapped in a uidmapped fs cannot create inodes in it since
those would be owned by an unmapped UID, which then triggers
the famous EOVERFLOW error.
Let's fix this, by explicitly including an entry for the host UID 0 in
the mount mapping. Specifically, we'll extend the mount mapping to map
UID 2147483646 (which is INT32_MAX-1, see code for an explanation why I
picked this one) of the backing fs to UID 0 on the uidmapped fs. This
way nspawn can creates inode on the uidmapped as it likes (which will
then actually be owned by UID 2147483646 on the backing fs), and as it
always did. Note that we do *not* create a similar entry in the process
mapping. Thus any files created by nspawn that way (and not chown()ed to
something better) will appear as unmapped (i.e. as overflowuid/"nobody")
in the container payload. And that's good. Of course, the latter is
mostly theoretic, as nspawn should generally chown() the inodes it
creates to UID ranges that actually make sense for the container (and we
generally already do this correctly), but it#s good to know that we are
safe here, given we might accidentally forget to chown() some inodes we
create.
Net effect: the two mappings will not be identical anymore. The mount
mapping has one entry more, and the only reason it exists is so that
nspawn can access the uidmapped fs reasonably independently from any
process mapping.
Fixes: #20989
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Even though ISO C11 doesn't mandate in which order the type specifiers
should appear, having `unsigned` at the beginning of each type
declaration feels more natural and, more importantly, it unbreaks
Coccinelle, which has a hard time parsing `long unsigned` and others:
```
init_defs_builtins: /usr/lib64/coccinelle/standard.h
init_defs: /home/mrc0mmand/repos/systemd/coccinelle/macros.h
HANDLING: src/shared/mount-util.c
: 1: strange type1, maybe because of weird order: long unsigned
```
Most of the codebase already "complies", so let's fix the remaining
"offenders".
|
|
|
|
|
| |
4c733d3046942984c5f73b40c3af39cc218c103f shows that 95k can be used easily on a large
system. Let's bump it up even more so that we have some "breathing room".
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Follow-up for 7d85383edbab73274dc81cc888d884bb01070bc2.
Apparently the previous limit set on the max number of inodes for /dev was too
small as a system with 4096 LUNs attached can consume up to 95k inodes for
symlinks:
# /bin/df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
devtmpfs 49274377 95075 49179302 1% /dev
Hence this patch bumps the limit from 64k to 128k although the new limit is
still pretty arbitrary (that said, not sure if it really makes sense to put
such absolute limit number).
|
|
|
|
|
|
|
|
|
|
| |
bind_remount_one_with_mountinfo()
Let's move things around a bit, and open /proc/self/mountinfo if needed
inside of bind_remount_one_with_mountinfo(). That way bind_remount_one()
can become a superthin inline wrapper around
bind_remount_one_with_mountinfo(). Main benefit is that we don't even
have to open /p/s/mi in case mount_setattr() actually worked for us.
|
|
|
|
| |
Those pull in selinux for labelling, and we should avoid selinux in basic/.
|
| |
|
|
|
|
|
|
| |
This makes use of the new kernel 5.12 APIs to add an idmap to a mount
point. It does so by cloning the mountpoint, changing it, and then
unmounting the old mountpoint, replacing it later with the new one.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If the cleanup function returns the appropriate type, use that to reset the
variable. For other functions (usually the foreign ones which return void), add
an explicit value to reset to.
This causes a bit of code churn, but I think it might be worth it. In a
following patch static destructors will be called from a fuzzer, and this
change allows them to be called multiple times. But I think such a change might
help with detecting unitialized code reuse too. We hit various bugs like this,
and things are more obvious when a pointer has been set to NULL.
I was worried whether this change increases text size, but it doesn't seem to:
-Dbuildtype=debug:
before "tree-wide: return NULL from freeing functions":
-rwxrwxr-x 1 zbyszek zbyszek 4117672 Feb 16 14:36 build/libsystemd.so.0.30.0*
-rwxrwxr-x 1 zbyszek zbyszek 4494520 Feb 16 15:06 build/systemd*
after "tree-wide: return NULL from freeing functions":
-rwxrwxr-x 1 zbyszek zbyszek 4117672 Feb 16 14:36 build/libsystemd.so.0.30.0*
-rwxrwxr-x 1 zbyszek zbyszek 4494576 Feb 16 15:10 build/systemd*
now:
-rwxrwxr-x 1 zbyszek zbyszek 4117672 Feb 16 14:36 build/libsystemd.so.0.30.0*
-rwxrwxr-x 1 zbyszek zbyszek 4494640 Feb 16 15:15 build/systemd*
-Dbuildtype=release:
before "tree-wide: return NULL from freeing functions":
-rwxrwxr-x 1 zbyszek zbyszek 5252256 Feb 14 14:47 build-rawhide/libsystemd.so.0.30.0*
-rwxrwxr-x 1 zbyszek zbyszek 1834184 Feb 16 15:09 build-rawhide/systemd*
after "tree-wide: return NULL from freeing functions":
-rwxrwxr-x 1 zbyszek zbyszek 5252256 Feb 14 14:47 build-rawhide/libsystemd.so.0.30.0*
-rwxrwxr-x 1 zbyszek zbyszek 1834184 Feb 16 15:10 build-rawhide/systemd*
now:
-rwxrwxr-x 1 zbyszek zbyszek 5252256 Feb 14 14:47 build-rawhide/libsystemd.so.0.30.0*
-rwxrwxr-x 1 zbyszek zbyszek 1834184 Feb 16 15:16 build-rawhide/systemd*
I would expect that the compiler would be able to elide the setting of a
variable if the variable is never used again. And this seems to be the case:
in optimized builds there is no change in size whatsoever. And the change in
size in unoptimized build is negligible.
Something strange is happening with size of libsystemd: it's bigger in
optimized builds. Something to figure out, but unrelated to this patch.
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
| |
Let's make umount_verbose() more like mount_verbose_xyz(), i.e. take log
level and flags param. In particular the latter matters, since we
typically don't actually want to follow symlinks when unmounting.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The explicit limit is dropped, which means that we return to the kernel default
of 50% of RAM. See 362a55fc14 for a discussion why that is not as much as it
seems. It turns out various applications need more space in /dev/shm and we
would break them by imposing a low limit.
While at it, rename the define and use a single macro for various tmpfs mounts.
We don't really care what the purpose of the given tmpfs is, so it seems
reasonable to use a single macro.
This effectively reverts part of 7d85383edbab7. Fixes #16617.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This should be enough to fix https://bugzilla.redhat.com/show_bug.cgi?id=1856514.
But the limit should be significantly higher than 10% anyway. By setting a
limit on /tmp at 10% we'll break many reasonable use cases, even though the
machine would deal fine with a much larger fraction devoted to /tmp.
(In the first version of this patch I made it 25% with the comment that
"Even 25% might be too low.". The kernel default is 50%, and we have been using
that seemingly without trouble since https://fedoraproject.org/wiki/Features/tmp-on-tmpfs.
So let's just make it 50% again.)
See 7d85383edbab73274dc81cc888d884bb01070bc2.
(Another consideration is that we learned from from the whole initiative with
zram in Fedora that a reasonable size for zram is 0.5-1.5 of RAM, and that pretty
much all systems benefit from having zram or zswap enabled. Thus it is reasonable
to assume that it'll become widely used. Taking the usual compression effectiveness
of 0.2 into account, machines have effective memory available of between
1.0 - 0.2*0.5 + 0.5 = 1.4 (for zram sized to 0.5 of RAM) and
1.0 - 0.2*1.5 + 1.5 = 2.2 (for zram 1.5 sized to 1.5 of RAM) times RAM size.
This means that the 10% was really like 7-4% of effective memory.)
A comment is added to mount-util.h to clarify that tmp.mount is separate.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://tools.ietf.org/html/draft-knodel-terminology-02
https://lwn.net/Articles/823224/
This gets rid of most but not occasions of these loaded terms:
1. scsi_id and friends are something that is supposed to be removed from
our tree (see #7594)
2. The test suite defines an API used by the ubuntu CI. We can remove
this too later, but this needs to be done in sync with the ubuntu CI.
3. In some cases the terms are part of APIs we call or where we expose
concepts the kernel names the way it names them. (In particular all
remaining uses of the word "slave" in our codebase are like this,
it's used by the POSIX PTY layer, by the network subsystem, the mount
API and the block device subsystem). Getting rid of the term in these
contexts would mean doing some major fixes of the kernel ABI first.
Regarding the replacements: when whitelist/blacklist is used as noun we
replace with with allow list/deny list, and when used as verb with
allow-list/deny-list.
|
|
|
|
|
|
| |
For low memory machines (256MB), 10% of RAM for /run may not be enough for
re-exec of PID1 because 16MB of free space is required and /run may already
contain something.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Limit size of various tmpfs mounts to 10% of RAM, except volatile root and /var
to 25%. Another exception is made for /dev (also /devs for PrivateDevices) and
/sys/fs/cgroup since no (or very few) regular files are expected to be used.
In addition, since directories, symbolic links, device specials and xattrs are
not counted towards the size= limit, number of inodes is also limited
correspondingly: 4MB size translates to 1k of inodes (assuming 4k each), 10% of
RAM (using 16GB of RAM as baseline) translates to 400k and 25% to 1M inodes.
Because nr_inodes option can't use ratios like size option, there's an
unfortunate side effect that with small memory systems the limit may be on the
too large side. Also, on an extremely small device with only 256MB of RAM, 10%
of RAM for /run may not be enough for re-exec of PID1 because 16MB of free
space is required.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When in a userns environment we cannot take away per-mount point flags
set on a mount point that was passed to us. Hence we need to be careful
to always check the actual mount flags in place and manipulate only
those flags of them that we actually want to change and not reset more
as side-effect.
We mostly got this right already in
bind_remount_recursive_with_mountinfo(), but didn't in the simpler
bind_remount_one_with_mountinfo(). Catch up.
(The old code assumed that the MountEntry.flags field contained the
right flag settings, but it actually doesn't for new mounts we just
established as for those mount() establishes the initial flags for us,
and we have to read them back to figure out which ones the kernel
picked.)
Fixes: #13622
|
|
|
|
|
|
| |
To support ProtectHome=y in a user namespace (which mounts the inaccessible
nodes), the nodes need to be accessible by the user. Create these paths and
devices in the user runtime directory so they can be used later if needed.
|
|
|
|
|
|
|
|
| |
MS_RDONLY
The function is otherwise generic enough to toggle other bind mount
flags beyond MS_RDONLY (for example: MS_NOSUID or MS_NODEV), hence let's
beef it up slightly to support that too.
|
|
libmount dep is moved from libbasic to libshared, potentially removing
libmount from some build products.
|