summaryrefslogtreecommitdiff
path: root/src/core/namespace.c
Commit message (Collapse)AuthorAgeFilesLines
* pid1: port unit namespacing to new /run/systemd/mount-rootfs dirLennart Poettering2023-05-161-3/+4
|
* execute: remove credentials dir again when emptyLennart Poettering2023-05-041-0/+1
| | | | | | | | | | | | | | | This is closely related to the previous commit: if the credentials dir is empty and nothing mounted on it, let's remove it again. This will in particular happen if we decided to not actually install the mount we prepared for the credentials because it is empty. In that case the mount point inode is already there, and with this we'll remove it. Primary effect, users will see ENOENT rather than EACCESS when trying to access it, which should be preferable, given we already handle that nicely in our credential consumption code. This should also be useful on systems where we lack any privs to create mounts, and thus operate on a regular dir anyway.
* Merge pull request #25608 from poettering/dissect-moarLennart Poettering2023-04-121-9/+27
|\ | | | | dissect: add dissection policies
| * tree-wide: hook up image dissection policy logic everywhereLennart Poettering2023-04-051-9/+27
| |
* | extension-release: establish compatibility between host file and ↵maanyagoenka2023-04-051-2/+3
| | | | | | | | | | | | | | | | | | extension-release file The release file that accompanies the confext images needs to be host compatible to be able to be merged into the host /etc/ directory. This commit checks for version compatibility between the image file and the host file.
* | os-util: add a new confext image type and the ability to parse their release ↵maanyagoenka2023-04-051-1/+1
|/ | | | | | | | | files Adds a new image type called IMAGE_CONFEXT which is similar to IMAGE_SYSEXT but works for the /etc/ directory instead of /usr/ and /opt/. This commit also adds the ability to parse the release file that is present with the confext image in /etc/confext-release.d/ directory.
* rename extension-release.[c|h] -> extension-util.[c|h]Luca Boccassi2023-03-301-1/+1
| | | | | It will be used for other extension DDI validation, not just for extension-release validation
* chase-symlinks: Rename chase_symlinks() to chase()Daan De Meyer2023-03-241-5/+5
| | | | | | | | | Chasing symlinks is a core function that's used in a lot of places so it deservers a less verbose names so let's rename it to chase() and chaseat(). We also slightly change the pattern used for the chaseat() helpers so we get chase_and_openat() and similar.
* core: rename "mount_flags" → "mount_propagation_flag" internally where ↵Lennart Poettering2023-03-141-7/+6
| | | | | | | | | | | | | | | | | | appropriate ExecContext has a field that controls the mount propagation flag of the mounts in the resulting namespace. This is exposed as "MountFlags=" which is super confusing, as it suggests one could control more than propagation, and that it was actually a flags field. It's an enum though only, and nothing else. We might want to rename this externally one day, but given the compat kludges this requires and the fact this is somewhat nichey it might not be worth it. But internally let's rename it, as it makes things much easier to grok, in particular as part of the codebase already exposed the concept as mount_propagation_flag. No actual code flow changes, just some renaming.
* namespace: use ERRNO_IS_PRIVILEGE()/ERRNO_IS_NOT_SUPPORTED() where appropriateLennart Poettering2023-03-141-1/+2
|
* namespace: Modernize shareable namespace functionsDaan De Meyer2023-03-131-71/+53
|
* mountpoint-util: generalize mount_option_supported()Lennart Poettering2023-03-091-23/+3
|
* core/namespace: mount new sysfs when new network namespace is requestedYu Watanabe2023-02-231-0/+7
| | | | | | | | | | | | | | Even when a mount namespace is created, previously host's sysfs is used, especially with RootDirectory= or RootImage=, thus service processes can still access the properties of the network interfaces in the main network namespace through sysfs. This makes, sysfs is remounted with the new network namespace tag, except when PrivateMounts= is explicitly disabled. Hence, the properties of the network interfaces in the main network namespace cannot be accessed by service processes through sysfs. Fixes #26422.
* core/namespace: introduce a new namespace mount mode PRIVATE_SYSFSYu Watanabe2023-02-231-1/+29
| | | | | | This is useful when a service running with a new network namespace. The mount mode is not used yet, but will be used in a later commit.
* core/namespace: rename SYSFS -> BIND_SYSFSYu Watanabe2023-02-231-7/+7
| | | | No functional change, just preparation for later commits.
* loop-util: always tell kernel explicitly about loopback sector sizeLennart Poettering2023-01-181-0/+1
| | | | | | Let's not leave the sector size unspecified: either set a user supplied value, or auto-detect the right size by probing the disk image accordingly.
* tree-wide: use -EBADF moreYu Watanabe2022-12-211-2/+2
|
* tree-wide: use -EBADF for fd initializationZbigniew Jędrzejewski-Szmek2022-12-191-2/+2
| | | | | | | | | | | | | | | | -1 was used everywhere, but -EBADF or -EBADFD started being used in various places. Let's make things consistent in the new style. Note that there are two candidates: EBADF 9 Bad file descriptor EBADFD 77 File descriptor in bad state Since we're initializating the fd, we're just assigning a value that means "no fd yet", so it's just a bad file descriptor, and the first errno fits better. If instead we had a valid file descriptor that became invalid because of some operation or state change, the other errno would fit better. In some places, initialization is dropped if unnecessary.
* mount-util: make mount_switch_root() take a mount propagation flagYu Watanabe2022-12-151-2/+2
|
* tree-wide: use mode=0nnn for mount optionZbigniew Jędrzejewski-Szmek2022-12-141-3/+3
| | | | | | This is an octal number. We used the 0 prefix in some places inconsistently. The kernel always interprets in base-8, so this has no effect, but I think it's nicer to use the 0 to remind the reader that this is not a decimal number.
* core/namespace: indentationZbigniew Jędrzejewski-Szmek2022-12-131-6/+6
|
* treewide: drop "RUN_" from "RUN_WITH_UMASK"Zbigniew Jędrzejewski-Szmek2022-12-131-4/+4
| | | | | | RUN_WITH_UMASK was initially conceived for spawning externals progs with the umask set. But nowadays we use it various syscalls and stuff that doesn't "run" anything, so the "RUN_" prefix has outlived its usefulness.
* Merge pull request #25513 from brauner/pivot_root.nspawnLuca Boccassi2022-12-061-2/+2
|\ | | | | nspawn: support pivot_root()
| * nspawn: support pivot_root()Christian Brauner2022-12-051-2/+2
| | | | | | | | | | | | | | | | | | | | In order to support pivot_root() we need to move mount propagation changes after the pivot_root(). While MS_MOVE requires the source mount to not be a shared mount pivot_root() also requires the target mount to not be a shared mount. This guarantees that pivot_root() doesn't leak any mounts. Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
* | dissect: rework DISSECT_IMAGE_ADD_PARTITION_DEVICES + ↵Lennart Poettering2022-12-011-1/+3
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DISSECT_IMAGE_OPEN_PARTITION_DEVICES Curently, these two flags were implied by dissect_loop_device(), but that's not right, because this means systemd-gpt-auto-generator will dissect the root block device with these flags set and that's not desirable: the generator should not cause the partition devices to be created (we don't intend to use them right-away after all, but expect udev to find/probe them first, and then mount them though .mount units). And there's no point in opening the partition devices, since we do not intend to mount them via fds either. Hence, rework this: instead of implying the flags, specify them explicitly. While we are at it, let's also rename the flags to make them more descriptive: DISSECT_IMAGE_MANAGE_PARTITION_DEVICES becomes DISSECT_IMAGE_ADD_PARTITION_DEVICES, since that's really all this does: add the partition devices via BLKPG. DISSECT_IMAGE_OPEN_PARTITION_DEVICES becomes DISSECT_IMAGE_PIN_PARTITION_DEVICES, since we not only open the devices, but keep the devices open continously (i.e. we "pin" them). Also, drop the DISSECT_IMAGE_BLOCK_DEVICE combination flag, since it is misleading, i.e. it suggests it was appropriate to specify on all dissected blocking devices, but that's precisely not the case, see the systemd-gpt-auto-generator case. My guess is that the confusion around this was actually the cause for this bug we are addressing here. Fixes: #25528
* shared: use move_pivot_root() for servicesChristian Brauner2022-11-241-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, services use mount_move_root() in order to setup the root directory of services using a mount namespace. This relies on MS_MOVE and chroot(). However, this has serious drawbacks even for relatively simple mount propagation scenarios. What systemd currently does is roughly equivalent to the following shell code: unshare --mount --propagation=shared cd / mount --make-rslave / mkdir /new-root mount --rbind / /new-root cd /new-root mount --move /new-root / chroot . This looks simple enough but has the consequence that two separate mount trees exist for the lifetime of the service. The first one was created when the mount namespace was created, and the second one when a new mount for the rootfs was created. The first mount tree sticks around as a shadow mount tree. Both mount trees are dependent mounts with the host rootfs as their dominating mount. Now, when mount propagation is triggered by the host by e.g., mount --bind /opt /mnt it means that two propagation events are generated. I'm skipping over the exact kernel details as they aren't that important. The gist is that for every propagation event that is generated a second one is generated for the shadow mount tree. In other words, the kernel creates two copies for each mount that is propagated instead of one. This isn't necessary. We can simply change the sequence above to: unshare --mount --propagation=shared cd / mount --make-rslave / mkdir /new-root # stash fd to old rootfs # stash fd to new rootfs mount --rbind / /new-root mkdir /new-root cd /new-root pivot_root . . # new root is tucked under old root # chdir into old rootfs via stashed fd umount -l /old-root The pivot_root allows us to get rid of the old mount tree that was created when the mount namespace was created. So after this sequence only one mount tree is alive. Plus, it's safer and nicer. Moving mounts isn't pleasnt. This patch doesn't convert nspawn yet as the requirements are more tricky given that it wants to preserve the rootfs as a shared mount which goes against pivot_root() requirements. Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
* nulstr-util: Declare NULSTR_FOREACH() iterator inlineDaan De Meyer2022-11-111-1/+1
|
* namespace: Add hidepid/subset support checkDaan De Meyer2022-11-011-6/+41
| | | | | | Using fsopen()/fsconfig(), we can check if hidepid/subset are supported to avoid the noisy logs from the kernel if they aren't supported. This works on centos/redhat 8 as well since they've backported fsopen()/fsconfig().
* portable: allow caller to override extension-release name checkLuca Boccassi2022-10-121-1/+1
| | | | | When the --force flag is used, do not insist that the extension-release file has to match the extension image name
* namespace-util: add namespace_infoChristian Brauner2022-10-041-0/+1
|
* tree-wide: drop unused reference to DecryptedImageYu Watanabe2022-09-181-2/+1
|
* tree-wide: use dissected_image_relinquish()Yu Watanabe2022-09-181-10/+5
|
* dissect-image: use backing_file stored in LoopDevice object to generate ↵Yu Watanabe2022-09-071-1/+0
| | | | | | | | image name Follow-up for e374439f4b8def786031ddbbd7dfdae3a335d4d2 (#24322). This also simplify the logic of generating image name from image path.
* Use original filename for extension name checkKai Lueke2022-09-061-0/+1
| | | | | | | | | | | | | The loading of an extension image from a symlink "NAME.raw" to "NAME-VERSION.raw" failed because the release file name check worked with the backing file of the loop device which already resolves the symlink and thus the found name "NAME-VERSION" mismatched "NAME". Pass the original filename and use it instead of the backing file when available. This fixes the loading of "NAME.raw" extensions which are a symlink to "NAME-VERSION.raw" as, e.g., may be the case when systemd-sysupdate manages multiple versions. Fixes https://github.com/systemd/systemd/issues/24293
* dissect-image: introduce dissect_loop_device() which takes LoopDevice objectYu Watanabe2022-09-031-5/+2
|
* loop-util: rework how we lock loopback block devicesLennart Poettering2022-09-011-6/+1
| | | | | | | | | | | | | | | | | | | | Let's rework how we lock loopback block devices in two ways: 1. Lock a separate fd, instead of the main block device fd. We already did that for our internal locking when allocating loopback block devices, but do so for the exposed locking (i.e. loop_device_flock()), too, so that the lock is independent of the main fd we actually use of IO. 2. Instead of locking the device during allocation of the loopback device, then unlocking it (which will make udev run), and then re-locking things if we need, let's instead just keep the lock the whole time, to make things a bit safer and faster, and not have to wait for udev at all. This is done by adding a "lock_op" parameter to loop device allocation functions that declares the initial state of the lock, and is one of LOCK_UN/LOCK_SH/LOCK_EX. This change also shortens a lot of code, since we allocate + immediately lock loopback devices pretty much everywhere.
* dissect: drop partition removal codeLennart Poettering2022-09-011-1/+0
| | | | | | | | | | | | | | | | | | | This reverts a major chunk of 75d7e04eb4662a814c26010d447eed8a862f5ec1 Now that the loopback device code already destroys the partitions we don't have to do this here anymore. I am sure the right place to delete the partitions is in the loopback code, since we really only should do that for loopback devices, see bug #24431, and not on "real" block devices. I am also not convinced dropping partitions the dissection logic doesn't care about is a good idea, after all. The dissection stuff should probably not consider itself the "owner" of the block devices it analyzes, but take a more passive role: figure out what is what, but not modify it. Fixes: #24431
* Drop the limit on number of inodes for /devFranck Bui2022-08-191-1/+1
| | | | | | | Follow-up for 4c733d3046942984c5f73b40c3af39cc218c103f. Finding a suitable limit that would fit any use cases out there is pretty hard and since /dev is only writeable by root anyway, let's simply drop the limit.
* glibc: Remove #include <linux/fs.h> to resolve fsconfig_command/mount_attr ↵Rudi Heitbaum2022-07-241-0/+2
| | | | conflict with glibc 2.36
* mac: rework labelling code to be simpler, and less racyLennart Poettering2022-07-081-3/+3
| | | | | | | | | | | | | | This merges the various labelling calls into a single label_fix_full(), which can operate on paths, on inode fds, and in a dirfd/fname style (i.e. like openat()). It also systematically separates the path to look up in the db from the path we actually use to reference the inode to relabel. This then ports tmpfiles over to labelling by fd. This should make the code a bit less racy, as we'll try hard to always operate on the very same inode, pinning it via an fd. User-visibly the behaviour should not change.
* namespace: fix propagated error numberLennart Poettering2022-07-081-1/+1
|
* tree-wide: allow ASCII fallback for → in logsDavid Tardon2022-06-281-2/+5
|
* Add sys/file.h for LOCK_Pavel Zhukov2022-06-211-0/+1
| | | | | | | | Fixes build with musl: | ../git/src/shared/dissect-image.c: In function 'mount_image_privately_interactively': | ../git/src/shared/dissect-image.c:2986:34: error: 'LOCK_SH' undeclared (first use in this function) | 2986 | r = loop_device_flock(d, LOCK_SH); | | ^~~~~~~
* dissect-image: Explicitly remove partitions when done with imageDaan De Meyer2022-05-231-0/+1
| | | | | | | | | | | | | | When closing a loop device, the kernel will asynchronously remove the probed partitions. This can lead to race conditions where we try to reuse a partition device that still needs to be removed by the kernel. To avoid such issues, let's explicitly try to remove any partitions using BLKPG_DEL_PARTITION when we're done with an image. To make sure we don't try to remove partitions when we want them to remain (e.g. systemd-dissect --mount), we add dissected_image_relinquish() in a similar vein to loop_device_relinquish() and decrypted_image_relinquish().
* devnum-util: define helper macros for formatting devnum major/minor pairsLennart Poettering2022-04-131-2/+3
| | | | And port some parts over.
* tree-wide: take BSD lock on loopback devices we dissect/mount/operate onLennart Poettering2022-04-101-0/+14
| | | | | | | | | | | | | | | | | | | | | So here's something we should always keep in mind: systemd-udevd actually does *two* things with BSD file locks on block devices: 1. While it probes a device it takes a LOCK_SH lock. Thus everyone else taking a LOCK_EX lock will temporarily block udev from probing devices, which is good when making changes to it. 2. Whenever a device is closed after write (detected via inotify), udevd will issue BLKRRPART (requesting the kernel to reread the partition table). It does this while holding a LOCK_EX lock on the block device. Thus anyone else taking LOCK_SH or LOCK_EX will temporarily block udevd from issuing that ioctl. And that's quite relevant, since the kernel will temporarily flush out all partitions while re-reading the partition table and then create them anew. Thus it is smart to take LOCK_SH when dissecting a block device to ensure that no BLKRRPART is issued in the background, until we mounted the devices.
* core: fix dm-verity auto-discovery in MountImageUnit()Luca Boccassi2022-04-071-1/+1
| | | | | | | | | The implementation of MountImageUnit()/systemctl mount-image was changed to use a /proc/self/fd path as the source, but that causes the dm-verity files autodiscovery to fail, as it looks for files in the same directory as the image. Use the original file path when setting up dm-verity.
* core/namespace: inline one more iterator variableYu Watanabe2022-03-231-7/+5
|
* strv: make iterator in STRV_FOREACH() declaread in the loopYu Watanabe2022-03-191-6/+0
| | | | This also avoids multiple evaluations in STRV_FOREACH_BACKWARDS()
* list: declare iterator of LIST_FOREACH() in the loopYu Watanabe2022-03-191-1/+0
|