summaryrefslogtreecommitdiff
path: root/kernel
Commit message (Collapse)AuthorAgeFilesLines
* rwsem: Provide down_read_non_owner() and up_read_non_owner() for -RTSebastian Andrzej Siewior2020-06-161-4/+4
| | | | | | | | | The rwsem implementation on -RT allows multiple reader and there is no owner tracking anymore. We can provide down_read_non_owner() and up_read_non_owner() by skipping the owner check bits which are only available in the !RT implementation. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
* fs/dcache: use swait_queue instead of waitqueueSebastian Andrzej Siewior2020-06-161-0/+1
| | | | | | | __d_lookup_done() invokes wake_up_all() while holding a hlist_bl_lock() which disables preemption. As a workaround convert it to swait. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
* hrtimer: Use sequence counter with associated raw spinlockAhmed S. Darwish2020-06-161-3/+10
| | | | | | | | | | | | | | | | | | A sequence counter write side critical section must be protected by some form of locking to serialize writers. A plain seqcount_t does not contain the information of which lock must be held when entering a write side critical section. Use the new seqcount_raw_spinlock_t data type, which allows to associate a raw spinlock with the sequence counter. This enables lockdep to verify that the raw spinlock used for writer serialization is held when the write side critical section is entered. If lockdep is disabled this lock association is compiled out and has neither storage size nor runtime overhead. Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
* timekeeping: Use sequence counter with associated raw spinlockAhmed S. Darwish2020-06-161-8/+11
| | | | | | | | | | | | | | | | | | A sequence counter write side critical section must be protected by some form of locking to serialize writers. A plain seqcount_t does not contain the information of which lock must be held when entering a write side critical section. Use the new seqcount_raw_spinlock_t data type, which allows to associate a raw spinlock with the sequence counter. This enables lockdep to verify that the raw spinlock used for writer serialization is held when the write side critical section is entered. If lockdep is disabled this lock association is compiled out and has neither storage size nor runtime overhead. Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
* sched: tasks: Use sequence counter with associated spinlockAhmed S. Darwish2020-06-161-1/+1
| | | | | | | | | | | | | | | | | | A sequence counter write side critical section must be protected by some form of locking to serialize writers. A plain seqcount_t does not contain the information of which lock must be held when entering a write side critical section. Use the new seqcount_spinlock_t data type, which allows to associate a spinlock with the sequence counter. This enables lockdep to verify that the spinlock used for writer serialization is held when the write side critical section is entered. If lockdep is disabled this lock association is compiled out and has neither storage size nor runtime overhead. Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
* Revert seqcount related patchesSebastian Andrzej Siewior2020-06-161-1/+0
| | | | | | | | | | | | | | | | | | | | This is an all-in-one patch reverting the following patches: dma-buf-Use-seqlock_t-instread-disabling-preemption.patch seqlock-prevent-rt-starvation.patch NFSv4-replace-seqcount_t-with-a-seqlock_t.patch net-Qdisc-use-a-seqlock-instead-seqcount.patch net-make-devnet_rename_seq-a-mutex.patch userfaultfd-Use-a-seqlock-instead-of-seqcount.patch fs-nfs-turn-rmdir_sem-into-a-semaphore.patch fs-dcache-disable-preemption-on-i_dir_seq-s-write-si.patch list_bl.h-make-list-head-locking-RT-safe.patch fscache-initialize-cookie-hash-table-raw-spinlocks.patch fs-dcache-bring-back-explicit-INIT_HLIST_BL_HEAD-in.patch fs-dcache-Include-swait.h-header.patch fs-dcache-use-swait_queue-instead-of-waitqueue.patch Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
* workqueue: Sync with upstreamSebastian Andrzej Siewior2020-06-101-9/+19
| | | | | | | | | | | | | | | | This is an all-on-one patch reverting the following commits: workqueue: Don't assume that the callback has interrupts disabled sched/swait: Add swait_event_lock_irq() workqueue: Use swait for wq_manager_wait workqueue: Convert the locks to raw type and introducing the following commits from upstream: workqueue: Use rcuwait for wq_manager_wait workqueue: Convert the pool::lock and wq_mayday_lock to raw_spinlock_t as an replacement. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
* Merge tag 'v5.6.17' into linux-5.6.y-rtSebastian Andrzej Siewior2020-06-096-41/+57
|\ | | | | | | | | | | This is the 5.6.17 stable release Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
| * kernel/relay.c: handle alloc_percpu returning NULL in relay_openDaniel Axtens2020-06-071-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 54e200ab40fc14c863bcc80a51e20b7906608fce upstream. alloc_percpu() may return NULL, which means chan->buf may be set to NULL. In that case, when we do *per_cpu_ptr(chan->buf, ...), we dereference an invalid pointer: BUG: Unable to handle kernel data access at 0x7dae0000 Faulting instruction address: 0xc0000000003f3fec ... NIP relay_open+0x29c/0x600 LR relay_open+0x270/0x600 Call Trace: relay_open+0x264/0x600 (unreliable) __blk_trace_setup+0x254/0x600 blk_trace_setup+0x68/0xa0 sg_ioctl+0x7bc/0x2e80 do_vfs_ioctl+0x13c/0x1300 ksys_ioctl+0x94/0x130 sys_ioctl+0x48/0xb0 system_call+0x5c/0x68 Check if alloc_percpu returns NULL. This was found by syzkaller both on x86 and powerpc, and the reproducer it found on powerpc is capable of hitting the issue as an unprivileged user. Fixes: 017c59c042d0 ("relay: Use per CPU constructs for the relay channel buffer pointers") Reported-by: syzbot+1e925b4b836afe85a1c6@syzkaller-ppc64.appspotmail.com Reported-by: syzbot+587b2421926808309d21@syzkaller-ppc64.appspotmail.com Reported-by: syzbot+58320b7171734bf79d26@syzkaller.appspotmail.com Reported-by: syzbot+d6074fb08bdb2e010520@syzkaller.appspotmail.com Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Andrew Donnellan <ajd@linux.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Akash Goel <akash.goel@intel.com> Cc: Andrew Donnellan <ajd@linux.ibm.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Salvatore Bonaccorso <carnil@debian.org> Cc: <stable@vger.kernel.org> [4.10+] Link: http://lkml.kernel.org/r/20191219121256.26480-1-dja@axtens.net Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * Revert "cgroup: Add memory barriers to plug cgroup_rstat_updated() race window"Tejun Heo2020-06-071-13/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit d8ef4b38cb69d907f9b0e889c44d05fc0f890977 ] This reverts commit 9a9e97b2f1f2 ("cgroup: Add memory barriers to plug cgroup_rstat_updated() race window"). The commit was added in anticipation of memcg rstat conversion which needed synchronous accounting for the event counters (e.g. oom kill count). However, the conversion didn't get merged due to percpu memory overhead concern which couldn't be addressed at the time. Unfortunately, the patch's addition of smp_mb() to cgroup_rstat_updated() meant that every scheduling event now had to go through an additional full barrier and Mel Gorman noticed it as 1% regression in netperf UDP_STREAM test. There's no need to have this barrier in tree now and even if we need synchronous accounting in the future, the right thing to do is separating that out to a separate function so that hot paths which don't care about synchronous behavior don't have to pay the overhead of the full barrier. Let's revert. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Mel Gorman <mgorman@techsingularity.net> Link: http://lkml.kernel.org/r/20200409154413.GK3818@techsingularity.net Cc: v4.18+ Signed-off-by: Sasha Levin <sashal@kernel.org>
| * sched/fair: Fix enqueue_task_fair() warning some morePhil Auld2020-05-271-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit b34cb07dde7c2346dec73d053ce926aeaa087303 ] sched/fair: Fix enqueue_task_fair warning some more The recent patch, fe61468b2cb (sched/fair: Fix enqueue_task_fair warning) did not fully resolve the issues with the rq->tmp_alone_branch != &rq->leaf_cfs_rq_list warning in enqueue_task_fair. There is a case where the first for_each_sched_entity loop exits due to on_rq, having incompletely updated the list. In this case the second for_each_sched_entity loop can further modify se. The later code to fix up the list management fails to do what is needed because se does not point to the sched_entity which broke out of the first loop. The list is not fixed up because the throttled parent was already added back to the list by a task enqueue in a parallel child hierarchy. Address this by calling list_add_leaf_cfs_rq if there are throttled parents while doing the second for_each_sched_entity loop. Fixes: fe61468b2cb ("sched/fair: Fix enqueue_task_fair warning") Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20200512135222.GC2201@lorien.usersys.redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
| * sched/fair: Fix reordering of enqueue/dequeue_task_fair()Vincent Guittot2020-05-271-8/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 5ab297bab984310267734dfbcc8104566658ebef ] Even when a cgroup is throttled, the group se of a child cgroup can still be enqueued and its gse->on_rq stays true. When a task is enqueued on such child, we still have to update the load_avg and increase h_nr_running of the throttled cfs. Nevertheless, the 1st for_each_sched_entity() loop is skipped because of gse->on_rq == true and the 2nd loop because the cfs is throttled whereas we have to update both load_avg with the old h_nr_running and increase h_nr_running in such case. The same sequence can happen during dequeue when se moves to parent before breaking in the 1st loop. Note that the update of load_avg will effectively happen only once in order to sync up to the throttled time. Next call for updating load_avg will stop early because the clock stays unchanged. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Fixes: 6d4d22468dae ("sched/fair: Reorder enqueue/dequeue_task_fair path") Link: https://lkml.kernel.org/r/20200306084208.12583-1-vincent.guittot@linaro.org Signed-off-by: Sasha Levin <sashal@kernel.org>
| * sched/fair: Reorder enqueue/dequeue_task_fair pathVincent Guittot2020-05-271-22/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 6d4d22468dae3d8757af9f8b81b848a76ef4409d ] The walk through the cgroup hierarchy during the enqueue/dequeue of a task is split in 2 distinct parts for throttled cfs_rq without any added value but making code less readable. Change the code ordering such that everything related to a cfs_rq (throttled or not) will be done in the same loop. In addition, the same steps ordering is used when updating a cfs_rq: - update_load_avg - update_cfs_group - update *h_nr_running This reordering enables the use of h_nr_running in PELT algorithm. No functional and performance changes are expected and have been noticed during tests. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: "Dietmar Eggemann <dietmar.eggemann@arm.com>" Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Phil Auld <pauld@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Link: https://lore.kernel.org/r/20200224095223.13361-5-mgorman@techsingularity.net Signed-off-by: Sasha Levin <sashal@kernel.org>
| * bpf: Prevent mmap()'ing read-only maps as writableAndrii Nakryiko2020-05-271-3/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit dfeb376dd4cb2c5004aeb625e2475f58a5ff2ea7 ] As discussed in [0], it's dangerous to allow mapping BPF map, that's meant to be frozen and is read-only on BPF program side, because that allows user-space to actually store a writable view to the page even after it is frozen. This is exacerbated by BPF verifier making a strong assumption that contents of such frozen map will remain unchanged. To prevent this, disallow mapping BPF_F_RDONLY_PROG mmap()'able BPF maps as writable, ever. [0] https://lore.kernel.org/bpf/CAEf4BzYGWYhXdp6BJ7_=9OQPJxQpgug080MMjdSB72i9R+5c6g@mail.gmail.com/ Fixes: fc9702273e2e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY") Suggested-by: Jann Horn <jannh@google.com> Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/bpf/20200519053824.1089415-1-andriin@fb.com Signed-off-by: Sasha Levin <sashal@kernel.org>
| * bpf: Add bpf_probe_read_{user, kernel}_str() to do_refine_retval_rangeDaniel Borkmann2020-05-271-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 47cc0ed574abcbbde0cf143ddb21a0baed1aa2df upstream. Given bpf_probe_read{,str}() BPF helpers are now only available under CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE, we need to add the drop-in replacements of bpf_probe_read_{kernel,user}_str() to do_refine_retval_range() as well to avoid hitting the same issue as in 849fa50662fbc ("bpf/verifier: refine retval R0 state for bpf_get_stack helper"). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200515101118.6508-3-daniel@iogearbox.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * bpf: Restrict bpf_probe_read{, str}() only to archs where they workDaniel Borkmann2020-05-271-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 0ebeea8ca8a4d1d453ad299aef0507dab04f6e8d upstream. Given the legacy bpf_probe_read{,str}() BPF helpers are broken on archs with overlapping address ranges, we should really take the next step to disable them from BPF use there. To generally fix the situation, we've recently added new helper variants bpf_probe_read_{user,kernel}() and bpf_probe_read_{user,kernel}_str(). For details on them, see 6ae08ae3dea2 ("bpf: Add probe_read_{user, kernel} and probe_read_{user,kernel}_str helpers"). Given bpf_probe_read{,str}() have been around for ~5 years by now, there are plenty of users at least on x86 still relying on them today, so we cannot remove them entirely w/o breaking the BPF tracing ecosystem. However, their use should be restricted to archs with non-overlapping address ranges where they are working in their current form. Therefore, move this behind a CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE and have x86, arm64, arm select it (other archs supporting it can follow-up on it as well). For the remaining archs, they can workaround easily by relying on the feature probe from bpftool which spills out defines that can be used out of BPF C code to implement the drop-in replacement for old/new kernels via: bpftool feature probe macro Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/bpf/20200515101118.6508-2-daniel@iogearbox.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | printk: Force a line break on pr_cont("\n")汪勇102695662020-05-261-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the printk rework, pr_cont("\n") will not lead to a line break. A new line will only be created if - cpu != c->cpu_owner || !(flags & LOG_CONT) - c->len + len > sizeof(c->buf) Flush the buffer to enforce a new line on pr_cont(). [bigeasy: reword commit message ] Signed-off-by: 汪勇10269566 <wang.yong12@zte.com.cn> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: John Ogness <john.ogness@linutronix.de>
* | Merge tag 'v5.6.14' into linux-5.6.y-rtSebastian Andrzej Siewior2020-05-2615-82/+180
|\ \ | |/ | | | | | | | | This is the 5.6.14 stable release Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
| * bpf: Restrict bpf_trace_printk()'s %s usage and add %pks, %pus specifierDaniel Borkmann2020-05-201-32/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit b2a5212fb634561bb734c6356904e37f6665b955 upstream. Usage of plain %s conversion specifier in bpf_trace_printk() suffers from the very same issue as bpf_probe_read{,str}() helpers, that is, it is broken on archs with overlapping address ranges. While the helpers have been addressed through work in 6ae08ae3dea2 ("bpf: Add probe_read_{user, kernel} and probe_read_{user, kernel}_str helpers"), we need an option for bpf_trace_printk() as well to fix it. Similarly as with the helpers, force users to make an explicit choice by adding %pks and %pus specifier to bpf_trace_printk() which will then pick the corresponding strncpy_from_unsafe*() variant to perform the access under KERNEL_DS or USER_DS. The %pk* (kernel specifier) and %pu* (user specifier) can later also be extended for other objects aside strings that are probed and printed under tracing, and reused out of other facilities like bpf_seq_printf() or BTF based type printing. Existing behavior of %s for current users is still kept working for archs where it is not broken and therefore gated through CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE. For archs not having this property we fall-back to pick probing under KERNEL_DS as a sensible default. Fixes: 8d3b7dce8622 ("bpf: add support for %s specifier to bpf_trace_printk()") Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Reported-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Link: https://lore.kernel.org/bpf/20200515101118.6508-4-daniel@iogearbox.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * bpf: Enforce returning 0 for fentry/fexit progsYonghong Song2020-05-201-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit e92888c72fbdc6f9d07b3b0604c012e81d7c0da7 upstream. Currently, tracing/fentry and tracing/fexit prog return values are not enforced. In trampoline codes, the fentry/fexit prog return values are ignored. Let us enforce it to be 0 to avoid confusion and allows potential future extension. This patch also explicitly added return value checking for tracing/raw_tp, tracing/fmod_ret, and freplace programs such that these program return values can be anything. The purpose are two folds: 1. to make it explicit about return value expectations for these programs in verifier. 2. for tracing prog_type, if a future attach type is added, the default is -ENOTSUPP which will enforce to specify return value ranges explicitly. Fixes: fec56f5890d9 ("bpf: Introduce BPF trampoline") Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200514053206.1298415-1-yhs@fb.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * Stop the ad-hoc games with -Wno-maybe-initializedLinus Torvalds2020-05-201-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 78a5255ffb6a1af189a83e493d916ba1c54d8c75 upstream. We have some rather random rules about when we accept the "maybe-initialized" warnings, and when we don't. For example, we consider it unreliable for gcc versions < 4.9, but also if -O3 is enabled, or if optimizing for size. And then various kernel config options disabled it, because they know that they trigger that warning by confusing gcc sufficiently (ie PROFILE_ALL_BRANCHES). And now gcc-10 seems to be introducing a lot of those warnings too, so it falls under the same heading as 4.9 did. At the same time, we have a very straightforward way to _enable_ that warning when wanted: use "W=2" to enable more warnings. So stop playing these ad-hoc games, and just disable that warning by default, with the known and straight-forward "if you want to work on the extra compiler warnings, use W=123". Would it be great to have code that is always so obvious that it never confuses the compiler whether a variable is used initialized or not? Yes, it would. In a perfect world, the compilers would be smarter, and our source code would be simpler. That's currently not the world we live in, though. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * bpf: Fix bug in mmap() implementation for BPF array mapSasha Levin2020-05-201-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 333291ce5055f2039afc907badaf5b66bc1adfdc ] mmap() subsystem allows user-space application to memory-map region with initial page offset. This wasn't taken into account in initial implementation of BPF array memory-mapping. This would result in wrong pages, not taking into account requested page shift, being memory-mmaped into user-space. This patch fixes this gap and adds a test for such scenario. Fixes: fc9702273e2e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200512235925.3817805-1-andriin@fb.com Signed-off-by: Sasha Levin <sashal@kernel.org>
| * x86/ftrace: Have ftrace trampolines turn read-only at the end of system boot upSteven Rostedt (VMware)2020-05-201-22/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 59566b0b622e3e6ea928c0b8cac8a5601b00b383 ] Booting one of my machines, it triggered the following crash: Kernel/User page tables isolation: enabled ftrace: allocating 36577 entries in 143 pages Starting tracer 'function' BUG: unable to handle page fault for address: ffffffffa000005c #PF: supervisor write access in kernel mode #PF: error_code(0x0003) - permissions violation PGD 2014067 P4D 2014067 PUD 2015063 PMD 7b253067 PTE 7b252061 Oops: 0003 [#1] PREEMPT SMP PTI CPU: 0 PID: 0 Comm: swapper Not tainted 5.4.0-test+ #24 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007 RIP: 0010:text_poke_early+0x4a/0x58 Code: 34 24 48 89 54 24 08 e8 bf 72 0b 00 48 8b 34 24 48 8b 4c 24 08 84 c0 74 0b 48 89 df f3 a4 48 83 c4 10 5b c3 9c 58 fa 48 89 df <f3> a4 50 9d 48 83 c4 10 5b e9 d6 f9 ff ff 0 41 57 49 RSP: 0000:ffffffff82003d38 EFLAGS: 00010046 RAX: 0000000000000046 RBX: ffffffffa000005c RCX: 0000000000000005 RDX: 0000000000000005 RSI: ffffffff825b9a90 RDI: ffffffffa000005c RBP: ffffffffa000005c R08: 0000000000000000 R09: ffffffff8206e6e0 R10: ffff88807b01f4c0 R11: ffffffff8176c106 R12: ffffffff8206e6e0 R13: ffffffff824f2440 R14: 0000000000000000 R15: ffffffff8206eac0 FS: 0000000000000000(0000) GS:ffff88807d400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffa000005c CR3: 0000000002012000 CR4: 00000000000006b0 Call Trace: text_poke_bp+0x27/0x64 ? mutex_lock+0x36/0x5d arch_ftrace_update_trampoline+0x287/0x2d5 ? ftrace_replace_code+0x14b/0x160 ? ftrace_update_ftrace_func+0x65/0x6c __register_ftrace_function+0x6d/0x81 ftrace_startup+0x23/0xc1 register_ftrace_function+0x20/0x37 func_set_flag+0x59/0x77 __set_tracer_option.isra.19+0x20/0x3e trace_set_options+0xd6/0x13e apply_trace_boot_options+0x44/0x6d register_tracer+0x19e/0x1ac early_trace_init+0x21b/0x2c9 start_kernel+0x241/0x518 ? load_ucode_intel_bsp+0x21/0x52 secondary_startup_64+0xa4/0xb0 I was able to trigger it on other machines, when I added to the kernel command line of both "ftrace=function" and "trace_options=func_stack_trace". The cause is the "ftrace=function" would register the function tracer and create a trampoline, and it will set it as executable and read-only. Then the "trace_options=func_stack_trace" would then update the same trampoline to include the stack tracer version of the function tracer. But since the trampoline already exists, it updates it with text_poke_bp(). The problem is that text_poke_bp() called while system_state == SYSTEM_BOOTING, it will simply do a memcpy() and not the page mapping, as it would think that the text is still read-write. But in this case it is not, and we take a fault and crash. Instead, lets keep the ftrace trampolines read-write during boot up, and then when the kernel executable text is set to read-only, the ftrace trampolines get set to read-only as well. Link: https://lkml.kernel.org/r/20200430202147.4dc6e2de@oasis.local.home Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: stable@vger.kernel.org Fixes: 768ae4406a5c ("x86/ftrace: Use text_poke()") Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
| * tracing: Wait for preempt irq delay thread to executeSteven Rostedt (VMware)2020-05-201-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 8b1fac2e73e84ef0d6391051880a8e1d7044c847 ] A bug report was posted that running the preempt irq delay module on a slow machine, and removing it quickly could lead to the thread created by the modlue to execute after the module is removed, and this could cause the kernel to crash. The fix for this was to call kthread_stop() after creating the thread to make sure it finishes before allowing the module to be removed. Now this caused the opposite problem on fast machines. What now happens is the kthread_stop() can cause the kthread never to execute and the test never to run. To fix this, add a completion and wait for the kthread to execute, then wait for it to end. This issue caused the ftracetest selftests to fail on the preemptirq tests. Link: https://lore.kernel.org/r/20200510114210.15d9e4af@oasis.local.home Cc: stable@vger.kernel.org Fixes: d16a8c31077e ("tracing: Wait for preempt irq delay thread to finish") Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
| * fork: prevent accidental access to clone3 featuresChristian Brauner2020-05-201-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 3f2c788a13143620c5471ac96ac4f033fc9ac3f3 ] Jan reported an issue where an interaction between sign-extending clone's flag argument on ppc64le and the new CLONE_INTO_CGROUP feature causes clone() to consistently fail with EBADF. The whole story is a little longer. The legacy clone() syscall is odd in a bunch of ways and here two things interact. First, legacy clone's flag argument is word-size dependent, i.e. it's an unsigned long whereas most system calls with flag arguments use int or unsigned int. Second, legacy clone() ignores unknown and deprecated flags. The two of them taken together means that users on 64bit systems can pass garbage for the upper 32bit of the clone() syscall since forever and things would just work fine. Just try this on a 64bit kernel prior to v5.7-rc1 where this will succeed and on v5.7-rc1 where this will fail with EBADF: int main(int argc, char *argv[]) { pid_t pid; /* Note that legacy clone() has different argument ordering on * different architectures so this won't work everywhere. * * Only set the upper 32 bits. */ pid = syscall(__NR_clone, 0xffffffff00000000 | SIGCHLD, NULL, NULL, NULL, NULL); if (pid < 0) exit(EXIT_FAILURE); if (pid == 0) exit(EXIT_SUCCESS); if (wait(NULL) != pid) exit(EXIT_FAILURE); exit(EXIT_SUCCESS); } Since legacy clone() couldn't be extended this was not a problem so far and nobody really noticed or cared since nothing in the kernel ever bothered to look at the upper 32 bits. But once we introduced clone3() and expanded the flag argument in struct clone_args to 64 bit we opened this can of worms. With the first flag-based extension to clone3() making use of the upper 32 bits of the flag argument we've effectively made it possible for the legacy clone() syscall to reach clone3() only flags. The sign extension scenario is just the odd corner-case that we needed to figure this out. The reason we just realized this now and not already when we introduced CLONE_CLEAR_SIGHAND was that CLONE_INTO_CGROUP assumes that a valid cgroup file descriptor has been given. So the sign extension (or the user accidently passing garbage for the upper 32 bits) caused the CLONE_INTO_CGROUP bit to be raised and the kernel to error out when it didn't find a valid cgroup file descriptor. Let's fix this by always capping the upper 32 bits for all codepaths that are not aware of clone3() features. This ensures that we can't reach clone3() only features by accident via legacy clone as with the sign extension case and also that legacy clone() works exactly like before, i.e. ignoring any unknown flags. This solution risks no regressions and is also pretty clean. Fixes: 7f192e3cd316 ("fork: add clone3") Fixes: ef2c41cf38a7 ("clone3: allow spawning processes into cgroups") Reported-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dmitry V. Levin <ldv@altlinux.org> Cc: Andreas Schwab <schwab@linux-m68k.org> Cc: Florian Weimer <fw@deneb.enyo.de> Cc: libc-alpha@sourceware.org Cc: stable@vger.kernel.org # 5.3+ Link: https://sourceware.org/pipermail/libc-alpha/2020-May/113596.html Link: https://lore.kernel.org/r/20200507103214.77218-1-christian.brauner@ubuntu.com Signed-off-by: Sasha Levin <sashal@kernel.org>
| * bpf: Fix error return code in map_lookup_and_delete_elem()Wei Yongjun2020-05-201-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 7f645462ca01d01abb94d75e6768c8b3ed3a188b ] Fix to return negative error code -EFAULT from the copy_to_user() error handling case instead of 0, as done elsewhere in this function. Fixes: bd513cd08f10 ("bpf: add MAP_LOOKUP_AND_DELETE_ELEM syscall") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200430081851.166996-1-weiyongjun1@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
| * umh: fix memory leak on execve failureVincent Minet2020-05-201-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit db803036ada7d61d096783726f9771b3fc540370 ] If a UMH process created by fork_usermode_blob() fails to execute, a pair of struct file allocated by umh_pipe_setup() will leak. Under normal conditions, the caller (like bpfilter) needs to manage the lifetime of the UMH and its two pipes. But when fork_usermode_blob() fails, the caller doesn't really have a way to know what needs to be done. It seems better to do the cleanup ourselves in this case. Fixes: 449325b52b7a ("umh: introduce fork_usermode_blob() helper") Signed-off-by: Vincent Minet <v.minet@criteo.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * coredump: fix crash when umh is disabledLuis Chamberlain2020-05-141-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 3740d93e37902b31159a82da2d5c8812ed825404 upstream. Commit 64e90a8acb859 ("Introduce STATIC_USERMODEHELPER to mediate call_usermodehelper()") added the optiont to disable all call_usermodehelper() calls by setting STATIC_USERMODEHELPER_PATH to an empty string. When this is done, and crashdump is triggered, it will crash on null pointer dereference, since we make assumptions over what call_usermodehelper_exec() did. This has been reported by Sergey when one triggers a a coredump with the following configuration: ``` CONFIG_STATIC_USERMODEHELPER=y CONFIG_STATIC_USERMODEHELPER_PATH="" kernel.core_pattern = |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e ``` The way disabling the umh was designed was that call_usermodehelper_exec() would just return early, without an error. But coredump assumes certain variables are set up for us when this happens, and calls ile_start_write(cprm.file) with a NULL file. [ 2.819676] BUG: kernel NULL pointer dereference, address: 0000000000000020 [ 2.819859] #PF: supervisor read access in kernel mode [ 2.820035] #PF: error_code(0x0000) - not-present page [ 2.820188] PGD 0 P4D 0 [ 2.820305] Oops: 0000 [#1] SMP PTI [ 2.820436] CPU: 2 PID: 89 Comm: a Not tainted 5.7.0-rc1+ #7 [ 2.820680] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190711_202441-buildvm-armv7-10.arm.fedoraproject.org-2.fc31 04/01/2014 [ 2.821150] RIP: 0010:do_coredump+0xd80/0x1060 [ 2.821385] Code: e8 95 11 ed ff 48 c7 c6 cc a7 b4 81 48 8d bd 28 ff ff ff 89 c2 e8 70 f1 ff ff 41 89 c2 85 c0 0f 84 72 f7 ff ff e9 b4 fe ff ff <48> 8b 57 20 0f b7 02 66 25 00 f0 66 3d 00 8 0 0f 84 9c 01 00 00 44 [ 2.822014] RSP: 0000:ffffc9000029bcb8 EFLAGS: 00010246 [ 2.822339] RAX: 0000000000000000 RBX: ffff88803f860000 RCX: 000000000000000a [ 2.822746] RDX: 0000000000000009 RSI: 0000000000000282 RDI: 0000000000000000 [ 2.823141] RBP: ffffc9000029bde8 R08: 0000000000000000 R09: ffffc9000029bc00 [ 2.823508] R10: 0000000000000001 R11: ffff88803dec90be R12: ffffffff81c39da0 [ 2.823902] R13: ffff88803de84400 R14: 0000000000000000 R15: 0000000000000000 [ 2.824285] FS: 00007fee08183540(0000) GS:ffff88803e480000(0000) knlGS:0000000000000000 [ 2.824767] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2.825111] CR2: 0000000000000020 CR3: 000000003f856005 CR4: 0000000000060ea0 [ 2.825479] Call Trace: [ 2.825790] get_signal+0x11e/0x720 [ 2.826087] do_signal+0x1d/0x670 [ 2.826361] ? force_sig_info_to_task+0xc1/0xf0 [ 2.826691] ? force_sig_fault+0x3c/0x40 [ 2.826996] ? do_trap+0xc9/0x100 [ 2.827179] exit_to_usermode_loop+0x49/0x90 [ 2.827359] prepare_exit_to_usermode+0x77/0xb0 [ 2.827559] ? invalid_op+0xa/0x30 [ 2.827747] ret_from_intr+0x20/0x20 [ 2.827921] RIP: 0033:0x55e2c76d2129 [ 2.828107] Code: 2d ff ff ff e8 68 ff ff ff 5d c6 05 18 2f 00 00 01 c3 0f 1f 80 00 00 00 00 c3 0f 1f 80 00 00 00 00 e9 7b ff ff ff 55 48 89 e5 <0f> 0b b8 00 00 00 00 5d c3 66 2e 0f 1f 84 0 0 00 00 00 00 0f 1f 40 [ 2.828603] RSP: 002b:00007fffeba5e080 EFLAGS: 00010246 [ 2.828801] RAX: 000055e2c76d2125 RBX: 0000000000000000 RCX: 00007fee0817c718 [ 2.829034] RDX: 00007fffeba5e188 RSI: 00007fffeba5e178 RDI: 0000000000000001 [ 2.829257] RBP: 00007fffeba5e080 R08: 0000000000000000 R09: 00007fee08193c00 [ 2.829482] R10: 0000000000000009 R11: 0000000000000000 R12: 000055e2c76d2040 [ 2.829727] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 2.829964] CR2: 0000000000000020 [ 2.830149] ---[ end trace ceed83d8c68a1bf1 ]--- ``` Cc: <stable@vger.kernel.org> # v4.11+ Fixes: 64e90a8acb85 ("Introduce STATIC_USERMODEHELPER to mediate call_usermodehelper()") BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=199795 Reported-by: Tony Vroon <chainsaw@gentoo.org> Reported-by: Sergey Kvachonok <ravenexp@gmail.com> Tested-by: Sergei Trofimovich <slyfox@gentoo.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20200416162859.26518-1-mcgrof@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * tracing: Add a vmalloc_sync_mappings() for safe measureSteven Rostedt (VMware)2020-05-141-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 11f5efc3ab66284f7aaacc926e9351d658e2577b upstream. x86_64 lazily maps in the vmalloc pages, and the way this works with per_cpu areas can be complex, to say the least. Mappings may happen at boot up, and if nothing synchronizes the page tables, those page mappings may not be synced till they are used. This causes issues for anything that might touch one of those mappings in the path of the page fault handler. When one of those unmapped mappings is touched in the page fault handler, it will cause another page fault, which in turn will cause a page fault, and leave us in a loop of page faults. Commit 763802b53a42 ("x86/mm: split vmalloc_sync_all()") split vmalloc_sync_all() into vmalloc_sync_unmappings() and vmalloc_sync_mappings(), as on system exit, it did not need to do a full sync on x86_64 (although it still needed to be done on x86_32). By chance, the vmalloc_sync_all() would synchronize the page mappings done at boot up and prevent the per cpu area from being a problem for tracing in the page fault handler. But when that synchronization in the exit of a task became a nop, it caused the problem to appear. Link: https://lore.kernel.org/r/20200429054857.66e8e333@oasis.local.home Cc: stable@vger.kernel.org Fixes: 737223fbca3b1 ("tracing: Consolidate buffer allocation code") Reported-by: "Tzvetomir Stoyanov (VMware)" <tz.stoyanov@gmail.com> Suggested-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * tracing: Wait for preempt irq delay thread to finishSteven Rostedt (VMware)2020-05-141-6/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit d16a8c31077e75ecb9427fbfea59b74eed00f698 upstream. Running on a slower machine, it is possible that the preempt delay kernel thread may still be executing if the module was immediately removed after added, and this can cause the kernel to crash as the kernel thread might be executing after its code has been removed. There's no reason that the caller of the code shouldn't just wait for the delay thread to finish, as the thread can also be created by a trigger in the sysfs code, which also has the same issues. Link: http://lore.kernel.org/r/5EA2B0C8.2080706@cn.fujitsu.com Cc: stable@vger.kernel.org Fixes: 793937236d1ee ("lib: Add module for testing preemptoff/irqsoff latency tracers") Reported-by: Xiao Yang <yangx.jy@cn.fujitsu.com> Reviewed-by: Xiao Yang <yangx.jy@cn.fujitsu.com> Reviewed-by: Joel Fernandes <joel@joelfernandes.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * tracing/kprobes: Reject new event if loc is NULLMasami Hiramatsu2020-05-141-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 5b4dcd2d201a395ad4054067bfae4a07554fbd65 upstream. Reject the new event which has NULL location for kprobes. For kprobes, user must specify at least the location. Link: http://lkml.kernel.org/r/158779376597.6082.1411212055469099461.stgit@devnote2 Cc: Tom Zanussi <zanussi@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Fixes: 2a588dd1d5d6 ("tracing: Add kprobe event command generation functions") Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * tracing/boottime: Fix kprobe event API usageMasami Hiramatsu2020-05-141-12/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit da0f1f4167e3af69e1d8b32d6d65195ddd2bfb64 upstream. Fix boottime kprobe events to use API correctly for multiple events. For example, when we set a multiprobe kprobe events in bootconfig like below, ftrace.event.kprobes.myevent { probes = "vfs_read $arg1 $arg2", "vfs_write $arg1 $arg2" } This cause an error; trace_boot: Failed to add probe: p:kprobes/myevent (null) vfs_read $arg1 $arg2 vfs_write $arg1 $arg2 This shows the 1st argument becomes NULL and multiprobes are merged to 1 probe. Link: http://lkml.kernel.org/r/158779375766.6082.201939936008972838.stgit@devnote2 Cc: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Fixes: 29a154810546 ("tracing: Change trace_boot to use kprobe_event interface") Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * tracing/kprobes: Fix a double initialization typoMasami Hiramatsu2020-05-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit dcbd21c9fca5e954fd4e3d91884907eb6d47187e ] Fix a typo that resulted in an unnecessary double initialization to addr. Link: http://lkml.kernel.org/r/158779374968.6082.2337484008464939919.stgit@devnote2 Cc: Tom Zanussi <zanussi@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Fixes: c7411a1a126f ("tracing/kprobe: Check whether the non-suffixed symbol is notrace") Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
| * ftrace: Fix memory leak caused by not freeing entry in ↵Steven Rostedt (VMware)2020-05-101-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | unregister_ftrace_direct() [ Upstream commit 353da87921a5ec654e7e9024e083f099f1b33c97 ] kmemleak reported the following: unreferenced object 0xffff90d47127a920 (size 32): comm "modprobe", pid 1766, jiffies 4294792031 (age 162.568s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 22 01 00 00 00 00 ad de ........"....... 00 78 12 a7 ff ff ff ff 00 00 b6 c0 ff ff ff ff .x.............. backtrace: [<00000000bb79e72e>] register_ftrace_direct+0xcb/0x3a0 [<00000000295e4f79>] do_one_initcall+0x72/0x340 [<00000000873ead18>] do_init_module+0x5a/0x220 [<00000000974d9de5>] load_module+0x2235/0x2550 [<0000000059c3d6ce>] __do_sys_finit_module+0xc0/0x120 [<000000005a8611b4>] do_syscall_64+0x60/0x230 [<00000000a0cdc49e>] entry_SYSCALL_64_after_hwframe+0x49/0xb3 The entry used to save the direct descriptor needs to be freed when unregistering. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
| * tracing: Fix memory leaks in trace_events_hist.cVamshi K Sthambamkadi2020-05-101-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 9da73974eb9c965dd9989befb593b8c8da9e4bdc ] kmemleak report 1: [<9092c50b>] kmem_cache_alloc_trace+0x138/0x270 [<05a2c9ed>] create_field_var+0xcf/0x180 [<528a2d68>] action_create+0xe2/0xc80 [<63f50b61>] event_hist_trigger_func+0x15b5/0x1920 [<28ea5d3d>] trigger_process_regex+0x7b/0xc0 [<3138e86f>] event_trigger_write+0x4d/0xb0 [<ffd66c19>] __vfs_write+0x30/0x200 [<4f424a0d>] vfs_write+0x96/0x1b0 [<da59a290>] ksys_write+0x53/0xc0 [<3717101a>] __ia32_sys_write+0x15/0x20 [<c5f23497>] do_fast_syscall_32+0x70/0x250 [<46e2629c>] entry_SYSENTER_32+0xaf/0x102 This is because save_vars[] of struct hist_trigger_data are not destroyed kmemleak report 2: [<9092c50b>] kmem_cache_alloc_trace+0x138/0x270 [<6e5e97c5>] create_var+0x3c/0x110 [<de82f1b9>] create_field_var+0xaf/0x180 [<528a2d68>] action_create+0xe2/0xc80 [<63f50b61>] event_hist_trigger_func+0x15b5/0x1920 [<28ea5d3d>] trigger_process_regex+0x7b/0xc0 [<3138e86f>] event_trigger_write+0x4d/0xb0 [<ffd66c19>] __vfs_write+0x30/0x200 [<4f424a0d>] vfs_write+0x96/0x1b0 [<da59a290>] ksys_write+0x53/0xc0 [<3717101a>] __ia32_sys_write+0x15/0x20 [<c5f23497>] do_fast_syscall_32+0x70/0x250 [<46e2629c>] entry_SYSENTER_32+0xaf/0x102 struct hist_field allocated through create_var() do not initialize "ref" field to 1. The code in __destroy_hist_field() does not destroy object if "ref" is initialized to zero, the condition if (--hist_field->ref > 1) always passes since unsigned int wraps. kmemleak report 3: [<f8666fcc>] __kmalloc_track_caller+0x139/0x2b0 [<bb7f80a5>] kstrdup+0x27/0x50 [<39d70006>] init_var_ref+0x58/0xd0 [<8ca76370>] create_var_ref+0x89/0xe0 [<f045fc39>] action_create+0x38f/0xc80 [<7c146821>] event_hist_trigger_func+0x15b5/0x1920 [<07de3f61>] trigger_process_regex+0x7b/0xc0 [<e87daf8f>] event_trigger_write+0x4d/0xb0 [<19bf1512>] __vfs_write+0x30/0x200 [<64ce4d27>] vfs_write+0x96/0x1b0 [<a6f34170>] ksys_write+0x53/0xc0 [<7d4230cd>] __ia32_sys_write+0x15/0x20 [<8eadca00>] do_fast_syscall_32+0x70/0x250 [<235cf985>] entry_SYSENTER_32+0xaf/0x102 hist_fields (system & event_name) are not freed Link: http://lkml.kernel.org/r/20200422061503.GA5151@cosmos Signed-off-by: Vamshi K Sthambamkadi <vamshi.k.sthambamkadi@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
| * PM: hibernate: Freeze kernel threads in software_resume()Dexuan Cui2020-05-061-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 2351f8d295ed63393190e39c2f7c1fee1a80578f upstream. Currently the kernel threads are not frozen in software_resume(), so between dpm_suspend_start(PMSG_QUIESCE) and resume_target_kernel(), system_freezable_power_efficient_wq can still try to submit SCSI commands and this can cause a panic since the low level SCSI driver (e.g. hv_storvsc) has quiesced the SCSI adapter and can not accept any SCSI commands: https://lkml.org/lkml/2020/4/10/47 At first I posted a fix (https://lkml.org/lkml/2020/4/21/1318) trying to resolve the issue from hv_storvsc, but with the help of Bart Van Assche, I realized it's better to fix software_resume(), since this looks like a generic issue, not only pertaining to SCSI. Cc: All applicable <stable@vger.kernel.org> Signed-off-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | Merge tag 'v5.6.10' into linux-5.6.y-rtSebastian Andrzej Siewior2020-05-0428-244/+405
|\ \ | |/ | | | | This is the 5.6.10 stable release
| * sched/vtime: Work around an unitialized variable warningBorislav Petkov2020-05-021-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit e0d648f9d883ec1efab261af158d73aa30e9dd12 ] Work around this warning: kernel/sched/cputime.c: In function ‘kcpustat_field’: kernel/sched/cputime.c:1007:6: warning: ‘val’ may be used uninitialized in this function [-Wmaybe-uninitialized] because GCC can't see that val is used only when err is 0. Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200327214334.GF8015@zn.tnic Signed-off-by: Sasha Levin <sashal@kernel.org>
| * sched/isolation: Allow "isolcpus=" to skip unknown sub-parametersPeter Xu2020-05-021-2/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 3662daf023500dc084fa3b96f68a6f46179ddc73 ] The "isolcpus=" parameter allows sub-parameters before the cpulist is specified, and if the parser detects an unknown sub-parameters the whole parameter will be ignored. This design is incompatible with itself when new sub-parameters are added. An older kernel will not recognize the new sub-parameter and will invalidate the whole parameter so the CPU isolation will not take effect. It emits a warning: isolcpus: Error, unknown flag The better and compatible way is to allow "isolcpus=" to skip unknown sub-parameters, so that even if new sub-parameters are added an older kernel will still be able to behave as usual even if with the new sub-parameter specified on the command line. Ideally this should have been there when the first sub-parameter for "isolcpus=" was introduced. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200403223517.406353-1-peterx@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
| * signal: check sig before setting info in kill_pid_usb_asyncioZhiqiang Liu2020-05-021-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit eaec2b0bd30690575c581eebffae64bfb7f684ac ] In kill_pid_usb_asyncio, if signal is not valid, we do not need to set info struct. Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/f525fd08-1cf7-fb09-d20c-4359145eb940@huawei.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
| * bpf: Propagate expected_attach_type when verifying freplace programsToke Høiland-Jørgensen2020-05-021-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 03f87c0b45b177ba5f6b4a9bbe9f95e4aba31026 upstream. For some program types, the verifier relies on the expected_attach_type of the program being verified in the verification process. However, for freplace programs, the attach type was not propagated along with the verifier ops, so the expected_attach_type would always be zero for freplace programs. This in turn caused the verifier to sometimes make the wrong call for freplace programs. For all existing uses of expected_attach_type for this purpose, the result of this was only false negatives (i.e., freplace functions would be rejected by the verifier even though they were valid programs for the target they were replacing). However, should a false positive be introduced, this can lead to out-of-bounds accesses and/or crashes. The fix introduced in this patch is to propagate the expected_attach_type to the freplace program during verification, and reset it after that is done. Fixes: be8704ff07d2 ("bpf: Introduce dynamic program extensions") Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/158773526726.293902.13257293296560360508.stgit@toke.dk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * bpf: Fix handling of XADD on BTF memoryJann Horn2020-05-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 8ff3571f7e1bf3f293cc5e3dc14f2943f4fa7fcf upstream. check_xadd() can cause check_ptr_to_btf_access() to be executed with atype==BPF_READ and value_regno==-1 (meaning "just check whether the access is okay, don't tell me what type it will result in"). Handle that case properly and skip writing type information, instead of indexing into the registers at index -1 and writing into out-of-bounds memory. Note that at least at the moment, you can't actually write through a BTF pointer, so check_xadd() will reject the program after calling check_ptr_to_btf_access with atype==BPF_WRITE; but that's after the verifier has already corrupted memory. This patch assumes that BTF pointers are not available in unprivileged programs. Fixes: 9e15db66136a ("bpf: Implement accurate raw_tp context access via BTF") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200417000007.10734-2-jannh@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * perf/core: fix parent pid/tid in task exit eventsIan Rogers2020-05-021-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit f3bed55e850926614b9898fe982f66d2541a36a5 upstream. Current logic yields the child task as the parent. Before: $ perf record bash -c "perf list > /dev/null" $ perf script -D |grep 'FORK\|EXIT' 4387036190981094 0x5a70 [0x30]: PERF_RECORD_FORK(10472:10472):(10470:10470) 4387036606207580 0xf050 [0x30]: PERF_RECORD_EXIT(10472:10472):(10472:10472) 4387036607103839 0x17150 [0x30]: PERF_RECORD_EXIT(10470:10470):(10470:10470) ^ Note the repeated values here -------------------/ After: 383281514043 0x9d8 [0x30]: PERF_RECORD_FORK(2268:2268):(2266:2266) 383442003996 0x2180 [0x30]: PERF_RECORD_EXIT(2268:2268):(2266:2266) 383451297778 0xb70 [0x30]: PERF_RECORD_EXIT(2266:2266):(2265:2265) Fixes: 94d5d1b2d891 ("perf_counter: Report the cloning task as parent on perf_counter_fork()") Reported-by: KP Singh <kpsingh@google.com> Signed-off-by: Ian Rogers <irogers@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200417182842.12522-1-irogers@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * sched/core: Fix reset-on-fork from RT with uclampQuentin Perret2020-05-021-7/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit eaf5a92ebde5bca3bb2565616115bd6d579486cd upstream. uclamp_fork() resets the uclamp values to their default when the reset-on-fork flag is set. It also checks whether the task has a RT policy, and sets its uclamp.min to 1024 accordingly. However, during reset-on-fork, the task's policy is lowered to SCHED_NORMAL right after, hence leading to an erroneous uclamp.min setting for the new task if it was forked from RT. Fix this by removing the unnecessary check on rt_task() in uclamp_fork() as this doesn't make sense if the reset-on-fork flag is set. Fixes: 1a00d999971c ("sched/uclamp: Set default clamps for RT tasks") Reported-by: Chitti Babu Theegala <ctheegal@codeaurora.org> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Patrick Bellasi <patrick.bellasi@matbug.net> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lkml.kernel.org/r/20200416085956.217587-1-qperret@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * cpumap: Avoid warning when CONFIG_DEBUG_PER_CPU_MAPS is enabledToke Høiland-Jørgensen2020-05-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit bc23d0e3f717ced21fbfacab3ab887d55e5ba367 upstream. When the kernel is built with CONFIG_DEBUG_PER_CPU_MAPS, the cpumap code can trigger a spurious warning if CONFIG_CPUMASK_OFFSTACK is also set. This happens because in this configuration, NR_CPUS can be larger than nr_cpumask_bits, so the initial check in cpu_map_alloc() is not sufficient to guard against hitting the warning in cpumask_check(). Fix this by explicitly checking the supplied key against the nr_cpumask_bits variable before calling cpu_possible(). Fixes: 6710e1126934 ("bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP") Reported-by: Xiumei Mu <xmu@redhat.com> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Xiumei Mu <xmu@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200416083120.453718-1-toke@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * bpf: Forbid XADD on spilled pointers for unprivileged usersJann Horn2020-05-021-9/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 6e7e63cbb023976d828cdb22422606bf77baa8a9 upstream. When check_xadd() verifies an XADD operation on a pointer to a stack slot containing a spilled pointer, check_stack_read() verifies that the read, which is part of XADD, is valid. However, since the placeholder value -1 is passed as `value_regno`, check_stack_read() can only return a binary decision and can't return the type of the value that was read. The intent here is to verify whether the value read from the stack slot may be used as a SCALAR_VALUE; but since check_stack_read() doesn't check the type, and the type information is lost when check_stack_read() returns, this is not enforced, and a malicious user can abuse XADD to leak spilled kernel pointers. Fix it by letting check_stack_read() verify that the value is usable as a SCALAR_VALUE if no type information is passed to the caller. To be able to use __is_pointer_value() in check_stack_read(), move it up. Fix up the expected unprivileged error message for a BPF selftest that, until now, assumed that unprivileged users can use XADD on stack-spilled pointers. This also gives us a test for the behavior introduced in this patch for free. In theory, this could also be fixed by forbidding XADD on stack spills entirely, since XADD is a locked operation (for operations on memory with concurrency) and there can't be any concurrency on the BPF stack; but Alexei has said that he wants to keep XADD on stack slots working to avoid changes to the test suite [1]. The following BPF program demonstrates how to leak a BPF map pointer as an unprivileged user using this bug: // r7 = map_pointer BPF_LD_MAP_FD(BPF_REG_7, small_map), // r8 = launder(map_pointer) BPF_STX_MEM(BPF_DW, BPF_REG_FP, BPF_REG_7, -8), BPF_MOV64_IMM(BPF_REG_1, 0), ((struct bpf_insn) { .code = BPF_STX | BPF_DW | BPF_XADD, .dst_reg = BPF_REG_FP, .src_reg = BPF_REG_1, .off = -8 }), BPF_LDX_MEM(BPF_DW, BPF_REG_8, BPF_REG_FP, -8), // store r8 into map BPF_MOV64_REG(BPF_REG_ARG1, BPF_REG_7), BPF_MOV64_REG(BPF_REG_ARG2, BPF_REG_FP), BPF_ALU64_IMM(BPF_ADD, BPF_REG_ARG2, -4), BPF_ST_MEM(BPF_W, BPF_REG_ARG2, 0, 0), BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem), BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1), BPF_EXIT_INSN(), BPF_STX_MEM(BPF_DW, BPF_REG_0, BPF_REG_8, 0), BPF_MOV64_IMM(BPF_REG_0, 0), BPF_EXIT_INSN() [1] https://lore.kernel.org/bpf/20200416211116.qxqcza5vo2ddnkdq@ast-mbp.dhcp.thefacebook.com/ Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200417000007.10734-1-jannh@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * printk: queue wake_up_klogd irq_work only if per-CPU areas are readySergey Senozhatsky2020-05-023-10/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit ab6f762f0f53162d41497708b33c9a3236d3609e upstream. printk_deferred(), similarly to printk_safe/printk_nmi, does not immediately attempt to print a new message on the consoles, avoiding calls into non-reentrant kernel paths, e.g. scheduler or timekeeping, which potentially can deadlock the system. Those printk() flavors, instead, rely on per-CPU flush irq_work to print messages from safer contexts. For same reasons (recursive scheduler or timekeeping calls) printk() uses per-CPU irq_work in order to wake up user space syslog/kmsg readers. However, only printk_safe/printk_nmi do make sure that per-CPU areas have been initialised and that it's safe to modify per-CPU irq_work. This means that, for instance, should printk_deferred() be invoked "too early", that is before per-CPU areas are initialised, printk_deferred() will perform illegal per-CPU access. Lech Perczak [0] reports that after commit 1b710b1b10ef ("char/random: silence a lockdep splat with printk()") user-space syslog/kmsg readers are not able to read new kernel messages. The reason is printk_deferred() being called too early (as was pointed out by Petr and John). Fix printk_deferred() and do not queue per-CPU irq_work before per-CPU areas are initialized. Link: https://lore.kernel.org/lkml/aa0732c6-5c4e-8a8b-a1c1-75ebe3dca05b@camlintechnologies.com/ Reported-by: Lech Perczak <l.perczak@camlintechnologies.com> Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Tested-by: Jann Horn <jannh@google.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Theodore Ts'o <tytso@mit.edu> Cc: John Ogness <john.ogness@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * audit: check the length of userspace generated audit recordsPaul Moore2020-04-291-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 763dafc520add02a1f4639b500c509acc0ea8e5b upstream. Commit 756125289285 ("audit: always check the netlink payload length in audit_receive_msg()") fixed a number of missing message length checks, but forgot to check the length of userspace generated audit records. The good news is that you need CAP_AUDIT_WRITE to submit userspace audit records, which is generally only given to trusted processes, so the impact should be limited. Cc: stable@vger.kernel.org Fixes: 756125289285 ("audit: always check the netlink payload length in audit_receive_msg()") Reported-by: syzbot+49e69b4d71a420ceda3e@syzkaller.appspotmail.com Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * signal: Avoid corrupting si_pid and si_uid in do_notify_parentEric W. Biederman2020-04-291-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 61e713bdca3678e84815f2427f7a063fc353a1fc upstream. Christof Meerwald <cmeerw@cmeerw.org> writes: > Hi, > > this is probably related to commit > 7a0cf094944e2540758b7f957eb6846d5126f535 (signal: Correct namespace > fixups of si_pid and si_uid). > > With a 5.6.5 kernel I am seeing SIGCHLD signals that don't include a > properly set si_pid field - this seems to happen for multi-threaded > child processes. > > A simple test program (based on the sample from the signalfd man page): > > #include <sys/signalfd.h> > #include <signal.h> > #include <unistd.h> > #include <spawn.h> > #include <stdlib.h> > #include <stdio.h> > > #define handle_error(msg) \ > do { perror(msg); exit(EXIT_FAILURE); } while (0) > > int main(int argc, char *argv[]) > { > sigset_t mask; > int sfd; > struct signalfd_siginfo fdsi; > ssize_t s; > > sigemptyset(&mask); > sigaddset(&mask, SIGCHLD); > > if (sigprocmask(SIG_BLOCK, &mask, NULL) == -1) > handle_error("sigprocmask"); > > pid_t chldpid; > char *chldargv[] = { "./sfdclient", NULL }; > posix_spawn(&chldpid, "./sfdclient", NULL, NULL, chldargv, NULL); > > sfd = signalfd(-1, &mask, 0); > if (sfd == -1) > handle_error("signalfd"); > > for (;;) { > s = read(sfd, &fdsi, sizeof(struct signalfd_siginfo)); > if (s != sizeof(struct signalfd_siginfo)) > handle_error("read"); > > if (fdsi.ssi_signo == SIGCHLD) { > printf("Got SIGCHLD %d %d %d %d\n", > fdsi.ssi_status, fdsi.ssi_code, > fdsi.ssi_uid, fdsi.ssi_pid); > return 0; > } else { > printf("Read unexpected signal\n"); > } > } > } > > > and a multi-threaded client to test with: > > #include <unistd.h> > #include <pthread.h> > > void *f(void *arg) > { > sleep(100); > } > > int main() > { > pthread_t t[8]; > > for (int i = 0; i != 8; ++i) > { > pthread_create(&t[i], NULL, f, NULL); > } > } > > I tried to do a bit of debugging and what seems to be happening is > that > > /* From an ancestor pid namespace? */ > if (!task_pid_nr_ns(current, task_active_pid_ns(t))) { > > fails inside task_pid_nr_ns because the check for "pid_alive" fails. > > This code seems to be called from do_notify_parent and there we > actually have "tsk != current" (I am assuming both are threads of the > current process?) I instrumented the code with a warning and received the following backtrace: > WARNING: CPU: 0 PID: 777 at kernel/pid.c:501 __task_pid_nr_ns.cold.6+0xc/0x15 > Modules linked in: > CPU: 0 PID: 777 Comm: sfdclient Not tainted 5.7.0-rc1userns+ #2924 > Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > RIP: 0010:__task_pid_nr_ns.cold.6+0xc/0x15 > Code: ff 66 90 48 83 ec 08 89 7c 24 04 48 8d 7e 08 48 8d 74 24 04 e8 9a b6 44 00 48 83 c4 08 c3 48 c7 c7 59 9f ac 82 e8 c2 c4 04 00 <0f> 0b e9 3fd > RSP: 0018:ffffc9000042fbf8 EFLAGS: 00010046 > RAX: 000000000000000c RBX: 0000000000000000 RCX: ffffc9000042faf4 > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff81193d29 > RBP: ffffc9000042fc18 R08: 0000000000000000 R09: 0000000000000001 > R10: 000000100f938416 R11: 0000000000000309 R12: ffff8880b941c140 > R13: 0000000000000000 R14: 0000000000000000 R15: ffff8880b941c140 > FS: 0000000000000000(0000) GS:ffff8880bca00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f2e8c0a32e0 CR3: 0000000002e10000 CR4: 00000000000006f0 > Call Trace: > send_signal+0x1c8/0x310 > do_notify_parent+0x50f/0x550 > release_task.part.21+0x4fd/0x620 > do_exit+0x6f6/0xaf0 > do_group_exit+0x42/0xb0 > get_signal+0x13b/0xbb0 > do_signal+0x2b/0x670 > ? __audit_syscall_exit+0x24d/0x2b0 > ? rcu_read_lock_sched_held+0x4d/0x60 > ? kfree+0x24c/0x2b0 > do_syscall_64+0x176/0x640 > ? trace_hardirqs_off_thunk+0x1a/0x1c > entry_SYSCALL_64_after_hwframe+0x49/0xb3 The immediate problem is as Christof noticed that "pid_alive(current) == false". This happens because do_notify_parent is called from the last thread to exit in a process after that thread has been reaped. The bigger issue is that do_notify_parent can be called from any process that manages to wait on a thread of a multi-threaded process from wait_task_zombie. So any logic based upon current for do_notify_parent is just nonsense, as current can be pretty much anything. So change do_notify_parent to call __send_signal directly. Inspecting the code it appears this problem has existed since the pid namespace support started handling this case in 2.6.30. This fix only backports to 7a0cf094944e ("signal: Correct namespace fixups of si_pid and si_uid") where the problem logic was moved out of __send_signal and into send_signal. Cc: stable@vger.kernel.org Fixes: 6588c1e3ff01 ("signals: SI_USER: Masquerade si_pid when crossing pid ns boundary") Ref: 921cf9f63089 ("signals: protect cinit from unblocked SIG_DFL signals") Link: https://lore.kernel.org/lkml/20200419201336.GI22017@edge.cmeerw.net/ Reported-by: Christof Meerwald <cmeerw@cmeerw.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * perf/core: Disable page faults when getting phys addressJiri Olsa2020-04-291-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit d3296fb372bf7497b0e5d0478c4e7a677ec6f6e9 ] We hit following warning when running tests on kernel compiled with CONFIG_DEBUG_ATOMIC_SLEEP=y: WARNING: CPU: 19 PID: 4472 at mm/gup.c:2381 __get_user_pages_fast+0x1a4/0x200 CPU: 19 PID: 4472 Comm: dummy Not tainted 5.6.0-rc6+ #3 RIP: 0010:__get_user_pages_fast+0x1a4/0x200 ... Call Trace: perf_prepare_sample+0xff1/0x1d90 perf_event_output_forward+0xe8/0x210 __perf_event_overflow+0x11a/0x310 __intel_pmu_pebs_event+0x657/0x850 intel_pmu_drain_pebs_nhm+0x7de/0x11d0 handle_pmi_common+0x1b2/0x650 intel_pmu_handle_irq+0x17b/0x370 perf_event_nmi_handler+0x40/0x60 nmi_handle+0x192/0x590 default_do_nmi+0x6d/0x150 do_nmi+0x2f9/0x3c0 nmi+0x8e/0xd7 While __get_user_pages_fast() is IRQ-safe, it calls access_ok(), which warns on: WARN_ON_ONCE(!in_task() && !pagefault_disabled()) Peter suggested disabling page faults around __get_user_pages_fast(), which gets rid of the warning in access_ok() call. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200407141427.3184722-1-jolsa@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>