diff options
author | Alan Jenkins <alan.christopher.jenkins@gmail.com> | 2018-02-02 16:06:32 +0000 |
---|---|---|
committer | Alan Jenkins <alan.christopher.jenkins@gmail.com> | 2018-02-02 18:12:34 +0000 |
commit | 2428aaf8a24e8792506de5653a373ddfcee6d722 (patch) | |
tree | a13233b9ffb5e3058961788f19c89ab6e718f0dc | |
parent | 5c19ff79de0cde873de7122f4cf417c7c3012c1a (diff) | |
download | systemd-2428aaf8a24e8792506de5653a373ddfcee6d722.tar.gz |
seccomp: allow x86-64 syscalls on x32, used by the VDSO (fix #8060)
The VDSO provided by the kernel for x32, uses x86-64 syscalls instead of
x32 ones.
I think we can safely allow this; the set of x86-64 syscalls should be
very similar to the x32 ones. The real point is not to allow *x86*
syscalls, because some of those are inconveniently multiplexed and we're
apparently not able to block the specific actions we want to.
-rw-r--r-- | man/systemd.exec.xml | 12 | ||||
-rw-r--r-- | src/shared/seccomp-util.c | 26 |
2 files changed, 29 insertions, 9 deletions
diff --git a/man/systemd.exec.xml b/man/systemd.exec.xml index fc3b9ffd16..f01599f656 100644 --- a/man/systemd.exec.xml +++ b/man/systemd.exec.xml @@ -1429,17 +1429,19 @@ CapabilityBoundingSet=~CAP_B CAP_C</programlisting> filter. The known architecture identifiers are the same as for <varname>ConditionArchitecture=</varname> described in <citerefentry><refentrytitle>systemd.unit</refentrytitle><manvolnum>5</manvolnum></citerefentry>, as well as <constant>x32</constant>, <constant>mips64-n32</constant>, <constant>mips64-le-n32</constant>, and - the special identifier <constant>native</constant>. If this setting is used, processes of this unit will only - be permitted to call native system calls, and system calls of the specified architectures. This is an - effective way to disable compatibility with non-native architectures for processes, for example to prohibit - execution of 32-bit x86 binaries on 64-bit x86-64 systems. The special <constant>native</constant> identifier + the special identifier <constant>native</constant>. The special identifier <constant>native</constant> implicitly maps to the native architecture of the system (or more precisely: to the architecture the system manager is compiled for). If running in user mode, or in system mode, but without the <constant>CAP_SYS_ADMIN</constant> capability (e.g. setting <varname>User=nobody</varname>), <varname>NoNewPrivileges=yes</varname> is implied. By default, this option is set to the empty list, i.e. no system call architecture filtering is applied.</para> - <para>Note that system call filtering is not equally effective on all architectures. For example, on x86 + <para>If this setting is used, processes of this unit will only be permitted to call native system calls, and + system calls of the specified architectures. For the purposes of this option, the x32 architecture is treated + as including x86-64 system calls. However, this setting still fulfills its purpose, as explained below, on + x32.</para> + + <para>System call filtering is not equally effective on all architectures. For example, on x86 filtering of network socket-related calls is not possible, due to ABI limitations — a limitation that x86-64 does not have, however. On systems supporting multiple ABIs at the same time — such as x86/x86-64 — it is hence recommended to limit the set of permitted system call architectures so that secondary ABIs may not be used to diff --git a/src/shared/seccomp-util.c b/src/shared/seccomp-util.c index e4bc803132..9a9d78dc49 100644 --- a/src/shared/seccomp-util.c +++ b/src/shared/seccomp-util.c @@ -1534,17 +1534,35 @@ int seccomp_restrict_archs(Set *archs) { int r; /* This installs a filter with no rules, but that restricts the system call architectures to the specified - * list. */ + * list. + * + * There are some qualifications. However the most important use is to stop processes from bypassing + * system call restrictions, in case they used a broader (multiplexing) syscall which is only available + * in a non-native architecture. There are no holes in this use case, at least so far. */ + /* Note libseccomp includes our "native" (current) architecture in the filter by default. + * We do not remove it. For example, our callers expect to be able to call execve() afterwards + * to run a program with the restrictions applied. */ seccomp = seccomp_init(SCMP_ACT_ALLOW); if (!seccomp) return -ENOMEM; SET_FOREACH(id, archs, i) { r = seccomp_arch_add(seccomp, PTR_TO_UINT32(id) - 1); - if (r == -EEXIST) - continue; - if (r < 0) + if (r < 0 && r != -EEXIST) + return r; + } + + /* The vdso for x32 assumes that x86-64 syscalls are available. Let's allow them, since x32 + * x32 syscalls should basically match x86-64 for everything except the pointer type. + * The important thing is that you can block the old 32-bit x86 syscalls. + * https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=850047 */ + + if (seccomp_arch_native() == SCMP_ARCH_X32 || + set_contains(archs, UINT32_TO_PTR(SCMP_ARCH_X32 + 1))) { + + r = seccomp_arch_add(seccomp, SCMP_ARCH_X86_64); + if (r < 0 && r != -EEXIST) return r; } |