summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorAlan Jenkins <alan.christopher.jenkins@gmail.com>2018-02-02 16:06:32 +0000
committerAlan Jenkins <alan.christopher.jenkins@gmail.com>2018-02-02 18:12:34 +0000
commit2428aaf8a24e8792506de5653a373ddfcee6d722 (patch)
treea13233b9ffb5e3058961788f19c89ab6e718f0dc
parent5c19ff79de0cde873de7122f4cf417c7c3012c1a (diff)
downloadsystemd-2428aaf8a24e8792506de5653a373ddfcee6d722.tar.gz
seccomp: allow x86-64 syscalls on x32, used by the VDSO (fix #8060)
The VDSO provided by the kernel for x32, uses x86-64 syscalls instead of x32 ones. I think we can safely allow this; the set of x86-64 syscalls should be very similar to the x32 ones. The real point is not to allow *x86* syscalls, because some of those are inconveniently multiplexed and we're apparently not able to block the specific actions we want to.
-rw-r--r--man/systemd.exec.xml12
-rw-r--r--src/shared/seccomp-util.c26
2 files changed, 29 insertions, 9 deletions
diff --git a/man/systemd.exec.xml b/man/systemd.exec.xml
index fc3b9ffd16..f01599f656 100644
--- a/man/systemd.exec.xml
+++ b/man/systemd.exec.xml
@@ -1429,17 +1429,19 @@ CapabilityBoundingSet=~CAP_B CAP_C</programlisting>
filter. The known architecture identifiers are the same as for <varname>ConditionArchitecture=</varname>
described in <citerefentry><refentrytitle>systemd.unit</refentrytitle><manvolnum>5</manvolnum></citerefentry>,
as well as <constant>x32</constant>, <constant>mips64-n32</constant>, <constant>mips64-le-n32</constant>, and
- the special identifier <constant>native</constant>. If this setting is used, processes of this unit will only
- be permitted to call native system calls, and system calls of the specified architectures. This is an
- effective way to disable compatibility with non-native architectures for processes, for example to prohibit
- execution of 32-bit x86 binaries on 64-bit x86-64 systems. The special <constant>native</constant> identifier
+ the special identifier <constant>native</constant>. The special identifier <constant>native</constant>
implicitly maps to the native architecture of the system (or more precisely: to the architecture the system
manager is compiled for). If running in user mode, or in system mode, but without the
<constant>CAP_SYS_ADMIN</constant> capability (e.g. setting <varname>User=nobody</varname>),
<varname>NoNewPrivileges=yes</varname> is implied. By default, this option is set to the empty list, i.e. no
system call architecture filtering is applied.</para>
- <para>Note that system call filtering is not equally effective on all architectures. For example, on x86
+ <para>If this setting is used, processes of this unit will only be permitted to call native system calls, and
+ system calls of the specified architectures. For the purposes of this option, the x32 architecture is treated
+ as including x86-64 system calls. However, this setting still fulfills its purpose, as explained below, on
+ x32.</para>
+
+ <para>System call filtering is not equally effective on all architectures. For example, on x86
filtering of network socket-related calls is not possible, due to ABI limitations — a limitation that x86-64
does not have, however. On systems supporting multiple ABIs at the same time — such as x86/x86-64 — it is hence
recommended to limit the set of permitted system call architectures so that secondary ABIs may not be used to
diff --git a/src/shared/seccomp-util.c b/src/shared/seccomp-util.c
index e4bc803132..9a9d78dc49 100644
--- a/src/shared/seccomp-util.c
+++ b/src/shared/seccomp-util.c
@@ -1534,17 +1534,35 @@ int seccomp_restrict_archs(Set *archs) {
int r;
/* This installs a filter with no rules, but that restricts the system call architectures to the specified
- * list. */
+ * list.
+ *
+ * There are some qualifications. However the most important use is to stop processes from bypassing
+ * system call restrictions, in case they used a broader (multiplexing) syscall which is only available
+ * in a non-native architecture. There are no holes in this use case, at least so far. */
+ /* Note libseccomp includes our "native" (current) architecture in the filter by default.
+ * We do not remove it. For example, our callers expect to be able to call execve() afterwards
+ * to run a program with the restrictions applied. */
seccomp = seccomp_init(SCMP_ACT_ALLOW);
if (!seccomp)
return -ENOMEM;
SET_FOREACH(id, archs, i) {
r = seccomp_arch_add(seccomp, PTR_TO_UINT32(id) - 1);
- if (r == -EEXIST)
- continue;
- if (r < 0)
+ if (r < 0 && r != -EEXIST)
+ return r;
+ }
+
+ /* The vdso for x32 assumes that x86-64 syscalls are available. Let's allow them, since x32
+ * x32 syscalls should basically match x86-64 for everything except the pointer type.
+ * The important thing is that you can block the old 32-bit x86 syscalls.
+ * https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=850047 */
+
+ if (seccomp_arch_native() == SCMP_ARCH_X32 ||
+ set_contains(archs, UINT32_TO_PTR(SCMP_ARCH_X32 + 1))) {
+
+ r = seccomp_arch_add(seccomp, SCMP_ARCH_X86_64);
+ if (r < 0 && r != -EEXIST)
return r;
}