| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This implements a minimal subset of #24961, but in a lot more
restrictive way: we only allow one level of subcgroup (as that's enough
to address the no-processes in inner cgroups rule), and does not change
anything about threaded cgroup logic or similar, or make any of this new
behaviour mandatory.
All this does is this: all non-control processes we invoke for a unit
we'll invoke in a subgroup by the specified name.
We'll later port all our current services that use cgroup delegation
over to this, i.e. user@.service, systemd-nspawn@.service and
systemd-udevd.service.
|
|
|
|
|
|
|
|
|
| |
Addresses
https://github.com/systemd/systemd/pull/25608/commits/84be0c710d9d562f6d2cf986cc2a8ff4c98a138b#r1060130312,
https://github.com/systemd/systemd/pull/25608/commits/84be0c710d9d562f6d2cf986cc2a8ff4c98a138b#r1067927293, and
https://github.com/systemd/systemd/pull/25608/commits/84be0c710d9d562f6d2cf986cc2a8ff4c98a138b#r1067926416.
Follow-up for 84be0c710d9d562f6d2cf986cc2a8ff4c98a138b.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Oftentimes it is useful to allow the per-service fd store to survive
longer than for a restart. This is useful in various scenarios:
1. An fd to some security relevant object needs to be stashed somewhere,
that should not be cleaned automatically, because the security
enforcement would be dropped then.
2. A user namespace fd should be allocated on first invocation and be
kept around until the user logs out (i.e. systemd --user ends), á la
#16328 (This does not implement what #16318 asks for, but should
solve the use-case discussed there.)
3. There's interest in allow a concept of "userspace reboots" where the
kernel stays running, and userspace is swapped out (i.e. all services
exit, and the rootfs transitioned into a new version of it) while
keeping some select resources pinned, very similar to how we
implement a switch root. Thus it is useful to allow services to exit,
while leaving their fds around till the very end.
This is exposed through a new FileDescriptorStorePreserve= setting that
is closely modelled after RuntimeDirectoryPreserve= (in fact it reused
the same internal type), since we want similar behaviour in the end, and
quite often they probably want to be used together.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
appropriate
ExecContext has a field that controls the mount propagation flag of the
mounts in the resulting namespace. This is exposed as "MountFlags="
which is super confusing, as it suggests one could control more than
propagation, and that it was actually a flags field. It's an enum
though only, and nothing else.
We might want to rename this externally one day, but given the compat
kludges this requires and the fact this is somewhat nichey it might not
be worth it. But internally let's rename it, as it makes things much
easier to grok, in particular as part of the codebase already exposed
the concept as mount_propagation_flag.
No actual code flow changes, just some renaming.
|
|
|
|
|
|
| |
Follow-up for #26393.
Addresses https://github.com/systemd/systemd/pull/26393#issuecomment-1458655798.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
| |
Define new unit parameter (LogFilterPatterns) to filter logs processed by
journald.
This option is used to store a regular expression which is carried from
PID1 to systemd-journald through a cgroup xattrs:
`user.journald_log_filter_patterns`.
|
| |
|
|
|
|
| |
Signed-off-by: wineway <wangyuweihx@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts PR #22587 and its follow-up commit. More specifically,
2299b1cae32c1fb8911da0ce26efced68032f4f8 (partially),
e176f855278d5098d3fecc5aa24ba702147d42e0,
ceb46a31a01b3d3d1d6095d857e29ea214a2776b, and
51bb9076ab8c050bebb64db5035852385accda35.
The PR was merged without final approval, and has several issues:
- OSS fuzz reported issues in the conf parser,
- It calls synchrnous netlink call, it should not be especially in PID1,
- The importance of NFTSet for CGroup and DynamicUser may be
questionable, at least, there was no justification PID1 should support
it.
- For networkd, it should be implemented with Request object,
- There is no test for the feature.
Fixes #23711.
Fixes #23717.
Fixes #23719.
Fixes #23720.
Fixes #23721.
Fixes #23759.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
New directive `DynamicUserNFTSet=` provides a method for integrating
configuration of dynamic users into firewall rules with NFT sets.
Example:
```
table inet filter {
set u {
typeof meta skuid
}
chain service_output {
meta skuid != @u drop
accept
}
}
```
```
/etc/systemd/system/dunft.service
[Service]
DynamicUser=yes
DynamicUserNFTSet=inet:filter:u
ExecStart=/bin/sleep 1000
[Install]
WantedBy=multi-user.target
```
```
$ sudo nft list set inet filter u
table inet filter {
set u {
typeof meta skuid
elements = { 64864 }
}
}
$ ps -n --format user,group,pid,command -p `pgrep sleep`
USER GROUP PID COMMAND
64864 64864 55158 /bin/sleep 1000
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
New directive `ControlGroupNFTSet=` provides a method for integrating services
into firewall rules with NFT sets.
Example:
```
table inet filter {
...
set timesyncd {
type cgroupsv2
}
chain ntp_output {
socket cgroupv2 != @timesyncd counter drop
accept
}
...
}
```
/etc/systemd/system/systemd-timesyncd.service.d/override.conf
```
[Service]
ControlGroupNFTSet=inet:filter:timesyncd
```
```
$ sudo nft list set inet filter timesyncd
table inet filter {
set timesyncd {
type cgroupsv2
elements = { "system.slice/systemd-timesyncd.service" }
}
}
```
|
|\
| |
| | |
Reintroduce ExitType
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This introduces `ExitType=main|cgroup` for services.
Similar to how `Type` specifies the launch of a service, `ExitType` is
concerned with how systemd determines that a service exited.
- If set to `main` (the current behavior), the service manager will consider
the unit stopped when the main process exits.
- The `cgroup` exit type is meant for applications whose forking model is not
known ahead of time and which might not have a specific main process.
The service will stay running as long as at least one process in the cgroup
is running. This is intended for transient or automatically generated
services, such as graphical applications inside of a desktop environment.
Motivation for this is #16805. The original PR (#18782) was reverted (#20073)
after realizing that the exit status of "the last process in the cgroup" can't
reliably be known (#19385)
This version instead uses the main process exit status if there is one and just
listens to the cgroup empty event otherwise.
The advantages of a service with `ExitType=cgroup` over scopes are:
- Integrated logging / stdout redirection
- Avoids the race / synchronisation issue between launch and scope creation
- More extensive use of drop-ins and thus distro-level configuration:
by moving from scopes to services we can have drop ins that will affect
properties that can only be set during service creation,
like `OOMPolicy` and security-related properties
- It makes systemd-xdg-autostart-generator usable by fixing [1], as obviously
only services can be used in the generator, not scopes.
[1] https://bugs.kde.org/show_bug.cgi?id=433299
|
|/ |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
To prevent situations like in #17602 from happening, let's drop
direct recursive template dependencies. These will almost certainly
lead to infinite recursion so let's drop them immediately to avoid
instantiating potentially thousands of irrelevant units.
Example of a template that would lead to infinite recursion which
is caught by this check:
notify@.service:
```
[Unit]
Wants=notify@%n.service
```
|
|\
| |
| | |
Watchdog more rework
|
| |
| |
| |
| | |
options to "default"
|
|/
|
|
|
| |
It takes an allow or deny list of filesystems services should have
access to.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
which binaries executed by Exec*= should be found
Currently there does not exist a way to specify a path relative to which
all binaries executed by Exec should be found. The only way is to
specify the absolute path.
This change implements the functionality to specify a path relative to which
binaries executed by Exec*= can be found.
Closes #6308
|
|
|
|
|
|
|
| |
Add new settings which can be used to control cpuset based cpu affinity
during the startup phase only.
Signed-off-by: Peter Morrow <pemorrow@linux.microsoft.com>
|
|
|
|
| |
Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io>
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit cb0e818f7cc2499d81ef143e5acaa00c6e684711.
After this was merged, some design and implementation issues were discovered,
see the discussion in #18782 and #19385. They certainly can be fixed, but so
far nobody has stepped up, and we're nearing a release. Hopefully, this feature
can be merged again after a rework.
Fixes #19345.
|
| |
|
|
|
|
|
|
| |
- Parse a string for bpf attach type
- Simplify bpffs path
- Add foreign bpf program to cgroup context
|
| |
|
|
|
|
|
| |
Add support for overlaying images for services on top of their
root fs, using a read-only overlay.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
This adds a way to control SO_TIMESTAMP/SO_TIMESTAMPNS socket options
for sockets PID 1 binds to.
This is useful in journald so that we get proper timestamps even for
ingress log messages that are submitted before journald is running.
We recently turned on packet info metadata from PID 1 for these sockets,
but the timestamping info was still missing. Let's correct that.
|
|
|
|
|
| |
This adds the hook ups so it can be read with the usual systemd
utilities. Used in later commits by sytemd-oomd.
|
|
|
|
| |
No functional change intended so far.
|
|
|
|
|
|
|
|
|
|
|
|
| |
With new directive SystemCallLog= it's possible to list system calls to be
logged. This can be used for auditing or temporarily when constructing system
call filters.
---
v5: drop intermediary, update HASHMAP_FOREACH_KEY() use
v4: skip useless debug messages, actually parse directive
v3: don't declare unused variables with old libseccomp
v2: fix build without seccomp or old libseccomp
|
|
|
|
| |
Fixes: #15778 #16060
|
|
|
|
|
|
|
|
|
|
|
| |
procfs mount options
Kernel 5.8 gained a hidepid= implementation that is truly per procfs,
which allows us to mount a distinct once into every unit, with
individual hidepid= settings. Let's expose this via two new settings:
ProtectProc= (wrapping hidpid=) and ProcSubset= (wrapping subset=).
Replaces: #11670
|
|
|
|
|
|
|
|
|
|
| |
The concept is flawed, and mostly useless. Let's finally remove it.
It has been deprecated since 90a2ec10f2d43a8530aae856013518eb567c4039 (6
years ago) and we started to warn since
55dadc5c57ef1379dbc984938d124508a454be55 (1.5 years ago).
Let's get rid of it altogether.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Follows the same pattern and features as RootImage, but allows an
arbitrary mount point under / to be specified by the user, and
multiple values - like BindPaths.
Original implementation by @topimiettinen at:
https://github.com/systemd/systemd/pull/14451
Reworked to use dissect's logic instead of bare libmount() calls
and other review comments.
Thanks Topi for the initial work to come up with and implement
this useful feature.
|
|
|
|
|
|
|
|
|
|
| |
Allows to specify mount options for RootImage.
In case of multi-partition images, the partition number can be prefixed
followed by colon. Eg:
RootImageOptions=1:ro,dev 2:nosuid nodev
In absence of a partition number, 0 is assumed.
|
|
|
|
|
| |
Allow to explicitly pass root hash signature as a unit option. Takes precedence
over implicit checks.
|
|
|
|
|
| |
Allow to explicitly pass root hash (explicitly or as a file) and verity
device/file as unit options. Take precedence over implicit checks.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The usual behaviour when a timeout expires is to terminate/kill the
service. This is what user usually want in production systems. To debug
services that fail to start/stop (especially sporadic failures) it
might be necessary to trigger the watchdog machinery and write core
dumps, though. Likewise, it is usually just a waste of time to
gracefully stop a stuck service. Instead it might save time to go
directly into kill mode.
This commit adds two new options to services: TimeoutStartFailureMode=
and TimeoutStopFailureMode=. Both take the same values and tweak the
behavior of systemd when a start/stop timeout expires:
* 'terminate': is the default behaviour as it has always been,
* 'abort': triggers the watchdog machinery and will send SIGABRT
(unless WatchdogSignal was changed) and
* 'kill' will directly send SIGKILL.
To handle the stop failure mode in stop-post state too a new
final-watchdog state needs to be introduced.
|
|
|
|
| |
Fixes #6685.
|
| |
|
|
|
|
| |
Fixes: #14524
|
|
|
|
|
|
|
|
|
|
|
| |
The original PR was submitted with CPUSetCpus and CPUSetMems, which was later
changed to AllowedCPUs and AllowedMemmoryNodes everywhere (including the parser
used by systemd-run), but not in the parser for unit files.
Since we already released -rc1, let's keep support for the old names. I think
we can remove it in a release or two if anyone remembers to do that.
Fixes #14126. Follow-up for 047f5d63d7a1ab75073f8485e2f9b550d25b0772.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Introduce support for configuring cpus and mems for processes using
cgroup v2 CPUSET controller. This allows users to limit which cpus
and memory NUMA nodes can be used by processes to better utilize
system resources.
The cgroup v2 interfaces to control it are cpuset.cpus and cpuset.mems
where the requested configuration is written. However, it doesn't mean
that the requested configuration will be actually used as parent cgroup
may limit the cpus or mems as well. In order to reflect the real
configuration cgroup v2 provides read-only files cpuset.cpus.effective
and cpuset.mems.effective which are exported to users as well.
|
| |
|