| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is closely related to the previous commit: if the credentials dir
is empty and nothing mounted on it, let's remove it again.
This will in particular happen if we decided to not actually install the
mount we prepared for the credentials because it is empty. In that case
the mount point inode is already there, and with this we'll remove it.
Primary effect, users will see ENOENT rather than EACCESS when trying to
access it, which should be preferable, given we already handle that
nicely in our credential consumption code.
This should also be useful on systems where we lack any privs to create
mounts, and thus operate on a regular dir anyway.
|
|\
| |
| | |
dissect: add dissection policies
|
| | |
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
extension-release file
The release file that accompanies the confext images needs to be
host compatible to be able to be merged into the host /etc/ directory.
This commit checks for version compatibility between the image file and
the host file.
|
|/
|
|
|
|
|
|
|
| |
files
Adds a new image type called IMAGE_CONFEXT which is similar to IMAGE_SYSEXT but works
for the /etc/ directory instead of /usr/ and /opt/. This commit also adds the ability to
parse the release file that is present with the confext image in /etc/confext-release.d/
directory.
|
|
|
|
|
| |
It will be used for other extension DDI validation, not just for extension-release
validation
|
|
|
|
|
|
|
|
|
| |
Chasing symlinks is a core function that's used in a lot of places
so it deservers a less verbose names so let's rename it to chase()
and chaseat().
We also slightly change the pattern used for the chaseat() helpers
so we get chase_and_openat() and similar.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
appropriate
ExecContext has a field that controls the mount propagation flag of the
mounts in the resulting namespace. This is exposed as "MountFlags="
which is super confusing, as it suggests one could control more than
propagation, and that it was actually a flags field. It's an enum
though only, and nothing else.
We might want to rename this externally one day, but given the compat
kludges this requires and the fact this is somewhat nichey it might not
be worth it. But internally let's rename it, as it makes things much
easier to grok, in particular as part of the codebase already exposed
the concept as mount_propagation_flag.
No actual code flow changes, just some renaming.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Even when a mount namespace is created, previously host's sysfs is used,
especially with RootDirectory= or RootImage=, thus service processes can
still access the properties of the network interfaces in the main network
namespace through sysfs.
This makes, sysfs is remounted with the new network namespace tag, except
when PrivateMounts= is explicitly disabled. Hence, the properties of the
network interfaces in the main network namespace cannot be accessed by
service processes through sysfs.
Fixes #26422.
|
|
|
|
|
|
| |
This is useful when a service running with a new network namespace.
The mount mode is not used yet, but will be used in a later commit.
|
|
|
|
| |
No functional change, just preparation for later commits.
|
|
|
|
|
|
| |
Let's not leave the sector size unspecified: either set a user supplied
value, or auto-detect the right size by probing the disk image
accordingly.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
-1 was used everywhere, but -EBADF or -EBADFD started being used in various
places. Let's make things consistent in the new style.
Note that there are two candidates:
EBADF 9 Bad file descriptor
EBADFD 77 File descriptor in bad state
Since we're initializating the fd, we're just assigning a value that means
"no fd yet", so it's just a bad file descriptor, and the first errno fits
better. If instead we had a valid file descriptor that became invalid because
of some operation or state change, the other errno would fit better.
In some places, initialization is dropped if unnecessary.
|
| |
|
|
|
|
|
|
| |
This is an octal number. We used the 0 prefix in some places inconsistently.
The kernel always interprets in base-8, so this has no effect, but I think
it's nicer to use the 0 to remind the reader that this is not a decimal number.
|
| |
|
|
|
|
|
|
| |
RUN_WITH_UMASK was initially conceived for spawning externals progs with the
umask set. But nowadays we use it various syscalls and stuff that doesn't "run"
anything, so the "RUN_" prefix has outlived its usefulness.
|
|\
| |
| | |
nspawn: support pivot_root()
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
In order to support pivot_root() we need to move mount propagation
changes after the pivot_root(). While MS_MOVE requires the source mount
to not be a shared mount pivot_root() also requires the target mount to
not be a shared mount. This guarantees that pivot_root() doesn't leak
any mounts.
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
DISSECT_IMAGE_OPEN_PARTITION_DEVICES
Curently, these two flags were implied by dissect_loop_device(), but
that's not right, because this means systemd-gpt-auto-generator will
dissect the root block device with these flags set and that's not
desirable: the generator should not cause the partition devices to be
created (we don't intend to use them right-away after all, but expect
udev to find/probe them first, and then mount them though .mount units).
And there's no point in opening the partition devices, since we do not
intend to mount them via fds either.
Hence, rework this: instead of implying the flags, specify them
explicitly.
While we are at it, let's also rename the flags to make them more
descriptive:
DISSECT_IMAGE_MANAGE_PARTITION_DEVICES becomes
DISSECT_IMAGE_ADD_PARTITION_DEVICES, since that's really all this does:
add the partition devices via BLKPG.
DISSECT_IMAGE_OPEN_PARTITION_DEVICES becomes
DISSECT_IMAGE_PIN_PARTITION_DEVICES, since we not only open the devices,
but keep the devices open continously (i.e. we "pin" them).
Also, drop the DISSECT_IMAGE_BLOCK_DEVICE combination flag, since it is
misleading, i.e. it suggests it was appropriate to specify on all
dissected blocking devices, but that's precisely not the case, see the
systemd-gpt-auto-generator case. My guess is that the confusion around
this was actually the cause for this bug we are addressing here.
Fixes: #25528
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, services use mount_move_root() in order to setup the root
directory of services using a mount namespace. This relies on MS_MOVE
and chroot(). However, this has serious drawbacks even for relatively
simple mount propagation scenarios.
What systemd currently does is roughly equivalent to the following shell
code:
unshare --mount --propagation=shared
cd /
mount --make-rslave /
mkdir /new-root
mount --rbind / /new-root
cd /new-root
mount --move /new-root /
chroot .
This looks simple enough but has the consequence that two separate mount
trees exist for the lifetime of the service. The first one was created
when the mount namespace was created, and the second one when a new
mount for the rootfs was created. The first mount tree sticks around as
a shadow mount tree. Both mount trees are dependent mounts with the host
rootfs as their dominating mount.
Now, when mount propagation is triggered by the host by e.g.,
mount --bind /opt /mnt
it means that two propagation events are generated. I'm skipping over
the exact kernel details as they aren't that important. The gist is that
for every propagation event that is generated a second one is generated
for the shadow mount tree. In other words, the kernel creates two copies
for each mount that is propagated instead of one.
This isn't necessary. We can simply change the sequence above to:
unshare --mount --propagation=shared
cd /
mount --make-rslave /
mkdir /new-root
# stash fd to old rootfs
# stash fd to new rootfs
mount --rbind / /new-root
mkdir /new-root
cd /new-root
pivot_root . .
# new root is tucked under old root
# chdir into old rootfs via stashed fd
umount -l /old-root
The pivot_root allows us to get rid of the old mount tree that was
created when the mount namespace was created. So after this sequence
only one mount tree is alive. Plus, it's safer and nicer. Moving mounts
isn't pleasnt.
This patch doesn't convert nspawn yet as the requirements are more
tricky given that it wants to preserve the rootfs as a shared mount
which goes against pivot_root() requirements.
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
| |
|
|
|
|
|
|
| |
Using fsopen()/fsconfig(), we can check if hidepid/subset are supported to
avoid the noisy logs from the kernel if they aren't supported. This works
on centos/redhat 8 as well since they've backported fsopen()/fsconfig().
|
|
|
|
|
| |
When the --force flag is used, do not insist that the extension-release
file has to match the extension image name
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
image name
Follow-up for e374439f4b8def786031ddbbd7dfdae3a335d4d2 (#24322).
This also simplify the logic of generating image name from image path.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The loading of an extension image from a symlink "NAME.raw" to
"NAME-VERSION.raw" failed because the release file name check worked
with the backing file of the loop device which already resolves the
symlink and thus the found name "NAME-VERSION" mismatched "NAME".
Pass the original filename and use it instead of the backing file
when available. This fixes the loading of "NAME.raw" extensions which
are a symlink to "NAME-VERSION.raw" as, e.g., may be the case when
systemd-sysupdate manages multiple versions.
Fixes https://github.com/systemd/systemd/issues/24293
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Let's rework how we lock loopback block devices in two ways:
1. Lock a separate fd, instead of the main block device fd. We already
did that for our internal locking when allocating loopback block
devices, but do so for the exposed locking (i.e.
loop_device_flock()), too, so that the lock is independent of the
main fd we actually use of IO.
2. Instead of locking the device during allocation of the loopback
device, then unlocking it (which will make udev run), and then
re-locking things if we need, let's instead just keep the lock the
whole time, to make things a bit safer and faster, and not have to
wait for udev at all. This is done by adding a "lock_op" parameter to
loop device allocation functions that declares the initial state of
the lock, and is one of LOCK_UN/LOCK_SH/LOCK_EX. This change also
shortens a lot of code, since we allocate + immediately lock loopback
devices pretty much everywhere.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts a major chunk of 75d7e04eb4662a814c26010d447eed8a862f5ec1
Now that the loopback device code already destroys the partitions we
don't have to do this here anymore.
I am sure the right place to delete the partitions is in the loopback
code, since we really only should do that for loopback devices, see
bug #24431, and not on "real" block devices.
I am also not convinced dropping partitions the dissection logic doesn't
care about is a good idea, after all. The dissection stuff should
probably not consider itself the "owner" of the block devices it
analyzes, but take a more passive role: figure out what is what, but not
modify it.
Fixes: #24431
|
|
|
|
|
|
|
| |
Follow-up for 4c733d3046942984c5f73b40c3af39cc218c103f.
Finding a suitable limit that would fit any use cases out there is pretty hard
and since /dev is only writeable by root anyway, let's simply drop the limit.
|
|
|
|
| |
conflict with glibc 2.36
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This merges the various labelling calls into a single label_fix_full(),
which can operate on paths, on inode fds, and in a dirfd/fname style
(i.e. like openat()). It also systematically separates the path to look
up in the db from the path we actually use to reference the inode to
relabel.
This then ports tmpfiles over to labelling by fd. This should make the
code a bit less racy, as we'll try hard to always operate on the very
same inode, pinning it via an fd.
User-visibly the behaviour should not change.
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
Fixes build with musl:
| ../git/src/shared/dissect-image.c: In function 'mount_image_privately_interactively':
| ../git/src/shared/dissect-image.c:2986:34: error: 'LOCK_SH' undeclared (first use in this function)
| 2986 | r = loop_device_flock(d, LOCK_SH);
| | ^~~~~~~
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When closing a loop device, the kernel will asynchronously remove
the probed partitions. This can lead to race conditions where we
try to reuse a partition device that still needs to be removed by
the kernel. To avoid such issues, let's explicitly try to remove
any partitions using BLKPG_DEL_PARTITION when we're done with an
image.
To make sure we don't try to remove partitions when we want them
to remain (e.g. systemd-dissect --mount), we add
dissected_image_relinquish() in a similar vein to loop_device_relinquish()
and decrypted_image_relinquish().
|
|
|
|
| |
And port some parts over.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
So here's something we should always keep in mind:
systemd-udevd actually does *two* things with BSD file locks on block
devices:
1. While it probes a device it takes a LOCK_SH lock. Thus everyone else
taking a LOCK_EX lock will temporarily block udev from probing
devices, which is good when making changes to it.
2. Whenever a device is closed after write (detected via inotify), udevd
will issue BLKRRPART (requesting the kernel to reread the partition
table). It does this while holding a LOCK_EX lock on the block
device. Thus anyone else taking LOCK_SH or LOCK_EX will temporarily
block udevd from issuing that ioctl. And that's quite relevant, since
the kernel will temporarily flush out all partitions while re-reading
the partition table and then create them anew. Thus it is smart to
take LOCK_SH when dissecting a block device to ensure that no
BLKRRPART is issued in the background, until we mounted the devices.
|
|
|
|
|
|
|
|
|
| |
The implementation of MountImageUnit()/systemctl mount-image was
changed to use a /proc/self/fd path as the source, but that causes
the dm-verity files autodiscovery to fail, as it looks for files
in the same directory as the image.
Use the original file path when setting up dm-verity.
|
| |
|
|
|
|
| |
This also avoids multiple evaluations in STRV_FOREACH_BACKWARDS()
|
| |
|