summaryrefslogtreecommitdiff
path: root/kernel
Commit message (Collapse)AuthorAgeFilesLines
* mm: memcg: remote memcg charging for kmem allocationsShakeel Butt2018-04-191-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "Directed kmem charging", v4. This patchset introduces memcg variant memory allocation functions. The caller can explicitly pass the memcg to charge for kmem allocations. Currently the kernel, for __GFP_ACCOUNT memory allocation requests, extract the memcg of the current task to charge for the kmem allocation. This patch series introduces kmem allocation functions where the caller can pass the pointer to the remote memcg. The remote memcg will be charged for the allocation instead of the memcg of the caller. However the caller must have a reference to the remote memcg. This patch (of 2): Introduce the memcg variant for kmalloc[_node] and kmem_cache_alloc[_node]. For kmem_cache_alloc, the kernel switches the root kmem cache with the memcg specific kmem cache for __GFP_ACCOUNT allocations to charge those allocations to the memcg. However, the memcg to charge is extracted from the current task_struct. This patch introduces the variant of kmem cache allocation functions where the memcg can be provided explicitly by the caller instead of deducing the memcg from the current task. The kmalloc allocations are underlying served using the kmem caches unless the size of the allocation request is larger than KMALLOC_MAX_CACHE_SIZE, in which case, the kmem caches are bypassed and the request is routed directly to page allocator. So, for __GFP_ACCOUNT kmalloc allocations, the memcg of current task is charged. This patch introduces memcg variant of kmalloc functions to allow callers to provide memcg for charging. These functions are useful for use-cases where the allocations should be charged to the memcg different from the memcg of the caller. One such concrete use-case is the allocations for fsnotify event objects where the objects should be charged to the listener instead of the producer. One requirement to call these functions is that the caller must have the reference to the memcg. Using kmalloc_memcg and kmem_cache_alloc_memcg implicitly assumes that the caller is requesting a __GFP_ACCOUNT allocation. Link: http://lkml.kernel.org/r/20180305182951.34462-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Amir Goldstein <amir73il@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
* kernel/kexec_file.c: load kernel at top of system RAM if requiredBaoquan He2018-04-191-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For kexec_file loading, if kexec_buf.top_down is 'true', the memory which is used to load kernel/initrd/purgatory is supposed to be allocated from top to down. This is also consistent with the old kexec loading interface. However, the current arch_kexec_walk_mem() doesn't do like this. It ignores checking kexec_buf.top_down, but calls walk_system_ram_res() directly to go through all resources of System RAM, to try to find a memory region which can contain the specific kexec buffer, then calls locate_mem_hole_callback() to allocate memory in that found memory region from top to down. This is not right. Here add checking if kexec_buf.top_down is 'true' in arch_kexec_walk_mem(), if yes, call the newly added walk_system_ram_res_rev() to find memory region from top to down to load kernel. The problem is the current kexec file loading has different behaviour with the old kexec loading. In kexec loading, user space kexec_tools utility does most of job, it searches in /proc/iomem to try to find a memory region from top to down to load kernel. Therefore with the kexec loading, x86_64 bzImage kernel are all loaded at top of System RAM. However, the kexec file loading just searches for available memory region in iomem resource from bottom to top, then try to allocate from top to down in that region. E.g on my testing system with 2G memory as below, the kexec loading will put kernel near 0x000000013fffffff, while kexec file loading will put kernel near 0x000000003ffddfff. There's no bug reported yet, just we need consider it when take care of the kexec collaboration with other kernel components like kaslr/hotplug etc, and also the consistentency between the different kexec interface. [Mar23 15:13] Linux version 4.16.0-rc3+ (bhe@localhost.localdomain) (gcc version [ +0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.16.0-rc3+ root=UUID=be8f8e3a-9 [ +0.000000] x86/fpu: x87 FPU will use FXSAVE [ +0.000000] e820: BIOS-provided physical RAM map: [ +0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable [ +0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved [ +0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved [ +0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000003ffddfff] usable [ +0.000000] BIOS-e820: [mem 0x000000003ffde000-0x000000003fffffff] reserved [ +0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved [ +0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved [ +0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000013fffffff] usable I searched on internet and found the original patches posted for adding bzImage 64 support into the old kexec loading, which is located in user space kexec_tools utility made by Yinghai, and Vivek and hpa reviewed patches. Still I didn't found out why kernel has to be put at top of system RAM. I guess low memory are reserved for system usage mostly, putting kexec kernel at top is safer and no need to exclude those resereved regions by system or firmware which we may not find out all of them, but not very sure about it. Link: http://lkml.kernel.org/r/20180322033722.9279-3-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: AKASHI Takahiro <takahiro.akashi@linaro.org> Cc: Dave Young <dyoung@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Philipp Rudo <prudo@linux.vnet.ibm.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
* kernel/kexec_file.c: add walk_system_ram_res_rev()AKASHI Takahiro2018-04-191-0/+63
| | | | | | | | | | | | | | | | | | | | This function, being a variant of walk_system_ram_res() introduced in commit 8c86e70acead ("resource: provide new functions to walk through resources"), walks through a list of all the resources of System RAM in reversed order, i.e., from higher to lower. It will be used in kexec_file code. Link: http://lkml.kernel.org/r/20180322033722.9279-2-bhe@redhat.com Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org> Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Philipp Rudo <prudo@linux.vnet.ibm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
* Merge branch 'akpm-current/current'Stephen Rothwell2018-04-195-3/+31
|\
| * fork: unconditionally clear stack on forkKees Cook2018-04-151-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | One of the classes of kernel stack content leaks[1] is exposing the contents of prior heap or stack contents when a new process stack is allocated. Normally, those stacks are not zeroed, and the old contents remain in place. In the face of stack content exposure flaws, those contents can leak to userspace. Fixing this will make the kernel no longer vulnerable to these flaws, as the stack will be wiped each time a stack is assigned to a new process. There's not a meaningful change in runtime performance; it almost looks like it provides a benefit. Performing back-to-back kernel builds before: Run times: 157.86 157.09 158.90 160.94 160.80 Mean: 159.12 Std Dev: 1.54 and after: Run times: 159.31 157.34 156.71 158.15 160.81 Mean: 158.46 Std Dev: 1.46 Instead of making this a build or runtime config, Andy Lutomirski recommended this just be enabled by default. [1] A noisy search for many kinds of stack content leaks can be seen here: https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=linux+kernel+stack+leak Link: http://lkml.kernel.org/r/20180221021659.GA37073@beast Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Laura Abbott <labbott@redhat.com> Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * cgroup: list groupoom in cgroup featuresRoman Gushchin2018-04-151-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | List groupoom in cgroup features list (exported via /sys/kernel/cgroup/features), which can be used by a userspace apps (most likely, systemd) to get an idea which cgroup features are supported by kernel. Link: http://lkml.kernel.org/r/20171130152824.1591-8-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm, oom: add cgroup v2 mount option for cgroup-aware OOM killerRoman Gushchin2018-04-151-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware OOM killer. If not set, the OOM selection is performed in a "traditional" per-process way. The behavior can be changed dynamically by remounting the cgroupfs. Link: http://lkml.kernel.org/r/20171130152824.1591-6-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * prctl: add PR_[GS]ET_PDEATHSIG_PROCJürg Billeter2018-04-154-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PR_SET_PDEATHSIG sets a parent death signal that the calling process will get when its parent thread dies, even when the result of getppid() doesn't change because the calling process is reparented to a different thread in the same parent process. When managing multiple processes, a process-based parent death signal is much more useful. E.g., to avoid stray child processes. PR_SET_PDEATHSIG_PROC sets a process-based death signal. Unlike PR_SET_PDEATHSIG, this is inherited across fork to allow killing a whole subtree without race conditions. This can be used for sandboxing when combined with a seccomp filter. There have been previous attempts to support this by changing the behavior of PR_SET_PDEATHSIG. However, that would break existing applications. See https://marc.info/?l=linux-kernel&m=117621804801689 and https://bugzilla.kernel.org/show_bug.cgi?id=43300 Link: http://lkml.kernel.org/r/20170929123058.48924-1-j@bitron.ch Signed-off-by: Jürg Billeter <j@bitron.ch> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Filipe Brandenburger <filbranden@google.com> Cc: David Wilcox <davidvsthegiant@gmail.com> Cc: "Adam H . Peterson" <alphaetapi@hotmail.com> Cc: <hansecke@gmail.com> Cc: <linux-api@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
* | Merge remote-tracking branch 'livepatching/for-next'Stephen Rothwell2018-04-191-37/+71
|\ \
| * | livepatch: Allow to call a custom callback when freeing shadow variablesPetr Mladek2018-04-171-8/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We might need to do some actions before the shadow variable is freed. For example, we might need to remove it from a list or free some data that it points to. This is already possible now. The user can get the shadow variable by klp_shadow_get(), do the necessary actions, and then call klp_shadow_free(). This patch allows to do it a more elegant way. The user could implement the needed actions in a callback that is passed to klp_shadow_free() as a parameter. The callback usually does reverse operations to the constructor callback that can be called by klp_shadow_*alloc(). It is especially useful for klp_shadow_free_all(). There we need to do these extra actions for each found shadow variable with the given ID. Note that the memory used by the shadow variable itself is still released later by rcu callback. It is needed to protect internal structures that keep all shadow variables. But the destructor is called immediately. The shadow variable must not be access anyway after klp_shadow_free() is called. The user is responsible to protect this any suitable way. Be aware that the destructor is called under klp_shadow_lock. It is the same as for the contructor in klp_shadow_alloc(). Signed-off-by: Petr Mladek <pmladek@suse.com> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
| * | livepatch: Initialize shadow variables safely by a custom callbackPetr Mladek2018-04-171-29/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The existing API allows to pass a sample data to initialize the shadow data. It works well when the data are position independent. But it fails miserably when we need to set a pointer to the shadow structure itself. Unfortunately, we might need to initialize the pointer surprisingly often because of struct list_head. It is even worse because the list might be hidden in other common structures, for example, struct mutex, struct wait_queue_head. For example, this was needed to fix races in ALSA sequencer. It required to add mutex into struct snd_seq_client. See commit b3defb791b26ea06 ("ALSA: seq: Make ioctls race-free") and commit d15d662e89fc667b9 ("ALSA: seq: Fix racy pool initializations") This patch makes the API more safe. A custom constructor function and data are passed to klp_shadow_*alloc() functions instead of the sample data. Note that ctor_data are no longer a template for shadow->data. It might point to any data that might be necessary when the constructor is called. Also note that the constructor is called under klp_shadow_lock. It is an internal spin_lock that synchronizes alloc() vs. get() operations, see klp_shadow_get_or_alloc(). On one hand, this adds a risk of ABBA deadlocks. On the other hand, it allows to do some operations safely. For example, we could add the new structure into an existing list. This must be done only once when the structure is allocated. Reported-by: Nicolai Stange <nstange@suse.de> Signed-off-by: Petr Mladek <pmladek@suse.com> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* | | Merge remote-tracking branch 'y2038/y2038'Stephen Rothwell2018-04-195-63/+93
|\ \ \
| * | | nanosleep: change time types to safe __kernel_* typesDeepa Dinamani2018-04-173-6/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change over clock_nanosleep syscalls to use y2038 safe __kernel_timespec times. This will enable changing over of these syscalls to use new y2038 safe syscalls when the architectures define the CONFIG_64BIT_TIME. Note that nanosleep syscall is deprecated and does not have a plan for making it y2038 safe. But, the syscall should work as before on 64 bit machines and on 32 bit machines, the syscall works correctly until y2038 as before using the existing compat syscall version. There is no new syscall for supporting 64 bit time_t on 32 bit architectures. Cc: linux-api@vger.kernel.org Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
| * | | change time types to new y2038 safe __kernel_* typesDeepa Dinamani2018-04-172-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change over clock_settime, clock_gettime and clock_getres syscalls to use __kernel_timespec times. This will enable changing over of these syscalls to use new y2038 safe syscalls when the architectures define the CONFIG_64BIT_TIME. Cc: linux-api@vger.kernel.org Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
| * | | fix get_timespec64() for y2038 safe compat interfacesDeepa Dinamani2018-04-171-4/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | get/put_timespec64() interfaces will eventually be used for conversions between the new y2038 safe struct __kernel_timespec and struct timespec64. The new y2038 safe syscalls have a common entry for native and compat interfaces. On compat interfaces, the high order bits of nanoseconds should be zeroed out. This is because the application code or the libc do not guarantee zeroing of these. If used without zeroing, kernel might be at risk of using timespec values incorrectly. Note that clearing of bits is dependent on CONFIG_64BIT_TIME for now. This is until COMPAT_USE_64BIT_TIME has been handled correctly. x86 will be the first architecture that will use the CONFIG_64BIT_TIME. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
| * | | posix-clocks: Make compat syscalls depend on CONFIG_COMPAT_32BIT_TIMEDeepa Dinamani2018-04-173-3/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | clock_gettime, clock_settime, clock_getres and clock_nanosleep compat syscalls are also repurposed to provide backward compatibility to support 32 bit time_t on 32 bit systems. Note that nanosleep compat syscall will also be treated the same way as the above syscalls as it shares common handler functions with clock_nanosleep. But, there is no plan to provide y2038 safe solution for nanosleep. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
| * | | compat: enable compat_get/put_timespec64 alwaysDeepa Dinamani2018-04-172-44/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These functions are used in the repurposed compat syscalls to provide backward compatibility for using 32 bit time_t on 32 bit systems. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
* | | | Merge remote-tracking branch 'rcu/rcu/next'Stephen Rothwell2018-04-1914-222/+302
|\ \ \ \
| * | | | rcu: Exclude near-simultaneous RCU CPU stall warningsPaul E. McKenney2018-04-161-4/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a two-jiffy delay between the time that a CPU will self-report an RCU CPU stall warning and the time that some other CPU will report a warning on behalf of the first CPU. This has worked well in the past, but on busy systems, it is possible for the two warnings to overlap, which makes interpreting them extremely difficult. This commit therefore uses a cmpxchg-based timing decision that allows only one report in a given one-minute period (assuming default stall-warning Kconfig parameters). This approach will of course fail if you are seeing minute-long vCPU preemption, but in that case the overlapping RCU CPU stall warnings are the least of your worries. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| * | | | srcu: Add cleanup_srcu_struct_quiesced()Paul E. McKenney2018-04-163-17/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current cleanup_srcu_struct() flushes work, which prevents it from being invoked from some workqueue contexts, as well as from atomic (non-blocking) contexts. This patch therefore introduced a cleanup_srcu_struct_quiesced(), which can be invoked only after all activity on the specified srcu_struct has completed. This restriction allows cleanup_srcu_struct_quiesced() to be invoked from workqueue contexts as well as from atomic contexts. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Nitzan Carmi <nitzanc@mellanox.com>
| * | | | rcu: Declare rcu_eqs_special_set() in public headerYury Norov2018-04-161-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because rcu_eqs_special_set() is declared only in internal header kernel/rcu/tree.h and stubbed in include/linux/rcutiny.h, it is inaccessible outside of the RCU implementation. This patch therefore moves the rcu_eqs_special_set() declaration to include/linux/rcutree.h, which allows it to be used in non-rcu kernel code. Signed-off-by: Yury Norov <ynorov@caviumnetworks.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| * | | | rcu: Update rcu_bind_gp_kthread() header commentPaul E. McKenney2018-04-161-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The header comment for rcu_bind_gp_kthread() refers to sysidle, which is no longer with us. However, it is still important to bind RCU's grace-period kthreads to the housekeeping CPU(s), so rather than remove rcu_bind_gp_kthread(), this commit updates the comment. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| * | | | rcu: Move __rcu_read_lock() and __rcu_read_unlock() to tree_plugin.hPaul E. McKenney2018-04-162-48/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The __rcu_read_lock() and __rcu_read_unlock() functions were moved to kernel/rcu/update.c in order to implement tiny preemptible RCU. However, tiny preemptible RCU was removed from the kernel a long time ago, so this commit belatedly moves them back into the only remaining preemptible-RCU code. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| * | | | rcu: Use the proper lockdep annotation in dump_blkd_tasks()Boqun Feng2018-04-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sparse reported this: | kernel/rcu/tree_plugin.h:814:9: warning: incorrect type in argument 1 (different modifiers) | kernel/rcu/tree_plugin.h:814:9: expected struct lockdep_map const *lock | kernel/rcu/tree_plugin.h:814:9: got struct lockdep_map [noderef] *<noident> This is caused by using vanilla lockdep annotations on rcu_node::lock, and that requires accessing ->lock of rcu_node directly. However we need to keep rcu_node::lock __private to avoid breaking its extra ordering guarantee. And we have a dedicated lockdep annotation for rcu_node::lock, so use it. Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| * | | | rcu: exp: Protect all sync_rcu_preempt_exp_done() with rcu_node lockBoqun Feng2018-04-161-3/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently some callsites of sync_rcu_preempt_exp_done() are not called with the corresponding rcu_node's ->lock held, which could introduces bugs as per Paul: o CPU 0 in sync_rcu_preempt_exp_done() reads ->exp_tasks and sees that it is NULL. o CPU 1 blocks within an RCU read-side critical section, so it enqueues the task and points ->exp_tasks at it and clears CPU 1's bit in ->expmask. o All other CPUs clear their bits in ->expmask. o CPU 0 reads ->expmask, sees that it is zero, so incorrectly concludes that all quiescent states have completed, despite the fact that ->exp_tasks is non-NULL. To fix this, sync_rcu_preempt_exp_unlocked() is introduced to replace lockless callsites of sync_rcu_preempt_exp_done(). Further, a lockdep annotation is added into sync_rcu_preempt_exp_done() to prevent mis-use in the future. Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| * | | | rcu: exp: Fix "must hold exp_mutex" comments for QS reporting functionsBoqun Feng2018-04-161-7/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit d9a3da0699b2 ("rcu: Add expedited grace-period support for preemptible RCU"), there are comments for some funtions in rcu_report_exp_rnp()'s call-chain saying that exp_mutex or its predecessors needs to be held. However, exp_mutex and its predecessors were used only to synchronize between GPs, and it is clear that all variables visited by those functions are under the protection of rcu_node's ->lock. Moreover, those functions are currently called without held exp_mutex, and seems that doesn't introduce any trouble. So this patch fixes this problem by updating the comments to match the current code. Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Fixes: d9a3da0699b2 ("rcu: Add expedited grace-period support for preemptible RCU") Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| * | | | softirq: Eliminate unused cond_resched_softirq() macroPaul E. McKenney2018-04-162-16/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The cond_resched_softirq() macro is not used anywhere in mainline, so this commit simplifies the kernel by eliminating it. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org>
| * | | | rcu: Rename cond_resched_rcu_qs() to cond_resched_tasks_rcu_qs()Paul E. McKenney2018-04-166-17/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit e31d28b6ab8f ("trace: Eliminate cond_resched_rcu_qs() in favor of cond_resched()") substituted cond_resched() for the earlier call to cond_resched_rcu_qs(). However, the new-age cond_resched() does not do anything to help RCU-tasks grace periods because (1) RCU-tasks is only enabled when CONFIG_PREEMPT=y and (2) cond_resched() is a complete no-op when preemption is enabled. This situation results in hangs when running the trace benchmarks. A number of potential fixes were discussed on LKML (https://lkml.kernel.org/r/20180224151240.0d63a059@vmware.local.home), including making cond_resched() not be a no-op; making cond_resched() not be a no-op, but only when running tracing benchmarks; reverting the aforementioned commit (which works because cond_resched_rcu_qs() does provide an RCU-tasks quiescent state; and adding a call to the scheduler/RCU rcu_note_voluntary_context_switch() function. All were deemed unsatisfactory, either due to added cond_resched() overhead or due to magic functions inviting cargo culting. This commit renames cond_resched_rcu_qs() to cond_resched_tasks_rcu_qs(), which provides a clear hint as to what this function is doing and why and where it should be used, and then replaces the call to cond_resched() with cond_resched_tasks_rcu_qs() in the trace benchmark's benchmark_event_kthread() function. Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| * | | | EXP rcu: Add checks for setting ->gp_flagsPaul E. McKenney2018-04-163-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ->gp_flags pointer must not be set unless the current rcu_node structure is waiting on at least one quiescent state. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| * | | | rcu: Remove deprecated RCU debugfs tracing codeByungchul Park2018-04-162-12/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit ae91aa0adb14 ("rcu: Remove debugfs tracing") removed the RCU debugfs tracing code, but did not remove the no-longer used ->exp_workdone{0,1,2,3} fields in the srcu_data structure. This commit therefore removes these fields along with the code that uselessly updates them. Signed-off-by: Byungchul Park <byungchul.park@lge.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| * | | | rcu: Call wake_nocb_leader_defer() with 'FORCE' when nocb_q_count is highByungchul Park2018-04-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If an excessive number of callbacks have been queued, but the NOCB leader kthread's wakeup must be deferred, then we should wake up the leader unconditionally once it is safe to do so. This was handled correctly in commit fbce7497ee ("rcu: Parallelize and economize NOCB kthread wakeups"), but then commit 8be6e1b15c ("rcu: Use timer as backstop for NOCB deferred wakeups") passed RCU_NOCB_WAKE instead of the correct RCU_NOCB_WAKE_FORCE to wake_nocb_leader_defer(). As an interesting aside, RCU_NOCB_WAKE_FORCE is never passed to anything, which should have been taken as a hint. ;-) This commit therefore passes RCU_NOCB_WAKE_FORCE instead of RCU_NOCB_WAKE to wake_nocb_leader_defer() when a callback is queued onto a NOCB CPU that already has an excessive number of callbacks pending. Signed-off-by: Byungchul Park <byungchul.park@lge.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| * | | | rcu: Don't allocate rcu_nocb_mask if no one needs itPaul E. McKenney2018-04-161-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 44c65ff2e3b0 ("rcu: Eliminate NOCBs CPU-state Kconfig options") made allocation of rcu_nocb_mask depend only on the rcu_nocbs=, nohz_full=, or isolcpus= kernel boot parameters. However, it failed to change the initial value of rcu_init_nohz()'s local variable need_rcu_nocb_mask to false, which can result in useless allocation of an all-zero rcu_nocb_mask. This commit therefore fixes this bug by changing the initial value of need_rcu_nocb_mask to false. While we are in the area, also correct the error message that is printed when someone specifies that can-never-exist CPUs should be NOCBs CPUs. Reported-by: Byungchul Park <byungchul.park@lge.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Byungchul Park <byungchul.park@lge.com>
| * | | | rcu: Parallelize expedited grace-period initializationPaul E. McKenney2018-04-164-78/+120
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The latency of RCU expedited grace periods grows with increasing numbers of CPUs, eventually failing to be all that expedited. Much of the growth in latency is in the initialization phase, so this commit uses workqueues to carry out this initialization concurrently on a rcu_node-by-rcu_node basis. This change makes use of a new rcu_par_gp_wq because flushing a work item from another work item running from the same workqueue can result in deadlock. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| * | | | rcu: Inline rcu_preempt_do_callback() into its sole callerByungchul Park2018-04-162-11/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The rcu_preempt_do_callbacks() function was introduced in commit 09223371dea(rcu: Use softirq to address performance regression), where it was necessary to handle kernel builds both containing and not containing RCU-preempt. Since then, various changes (most notably f8b7fc6b51 ("rcu: use softirq instead of kthreads except when RCU_BOOST=y")) have resulted in this function being invoked only from rcu_kthread_do_work(), which is present only in kernels containing RCU-preempt, which in turn means that the rcu_preempt_do_callbacks() function is no longer needed. This commit therefore inlines rcu_preempt_do_callbacks() into its sole remaining caller and also removes the rcu_state_p and rcu_data_p indirection for added clarity. Signed-off-by: Byungchul Park <byungchul.park@lge.com> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> [ paulmck: Remove the rcu_state_p and rcu_data_p indirection. ] Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| * | | | EXP rcu: Add ->qsmask to assertionPaul E. McKenney2018-04-161-1/+1
| | | | | | | | | | | | | | | | | | | | Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| * | | | EXP: rcu: Add debugging info to other assertionPaul E. McKenney2018-04-161-1/+2
| | | | | | | | | | | | | | | | | | | | Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| * | | | EXP: rcu: Add ->boost_tasks to assertionPaul E. McKenney2018-04-161-1/+1
| | | | | | | | | | | | | | | | | | | | Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| * | | | EXP: rcu: Abstract addition of debugging information to assertionPaul E. McKenney2018-04-163-16/+32
| | | | | | | | | | | | | | | | | | | | Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
| * | | | EXP: rcu: Add debugging info to assertionPaul E. McKenney2018-04-161-1/+16
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | The WARN_ON_ONCE(rcu_preempt_blocked_readers_cgp()) in rcu_gp_cleanup() triggers (inexplicably, of course) every so often. This commit therefore extracts more information. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* | | | Merge remote-tracking branch 'modules/modules-next'Stephen Rothwell2018-04-191-3/+1
|\ \ \ \
| * | | | module: Allow to always show the status of modsignJia Zhang2018-04-161-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The sig_enforce parameter could be always shown to reflect the current status of signature enforcement. For the case of CONFIG_MODULE_SIG_FORCE=y, this modification doesn't do anything, since sig_enforce can only be enabled, and not disabled, even via the kernel cmdline. Signed-off-by: Jia Zhang <zhang.jia@linux.alibaba.com> [jeyu: reworded commit message to provide clarification] Signed-off-by: Jessica Yu <jeyu@kernel.org>
| * | | | module: Do not access sig_enforce directlyJia Zhang2018-04-161-1/+1
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | Call is_module_sig_enforced() instead. Signed-off-by: Jia Zhang <zhang.jia@linux.alibaba.com> Signed-off-by: Jessica Yu <jeyu@kernel.org>
* | | | Merge remote-tracking branch 'net-next/master'Stephen Rothwell2018-04-191-82/+50
|\ \ \ \
| * | | | xdp: transition into using xdp_frame for return APIJesper Dangaard Brouer2018-04-171-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Changing API xdp_return_frame() to take struct xdp_frame as argument, seems like a natural choice. But there are some subtle performance details here that needs extra care, which is a deliberate choice. When de-referencing xdp_frame on a remote CPU during DMA-TX completion, result in the cache-line is change to "Shared" state. Later when the page is reused for RX, then this xdp_frame cache-line is written, which change the state to "Modified". This situation already happens (naturally) for, virtio_net, tun and cpumap as the xdp_frame pointer is the queued object. In tun and cpumap, the ptr_ring is used for efficiently transferring cache-lines (with pointers) between CPUs. Thus, the only option is to de-referencing xdp_frame. It is only the ixgbe driver that had an optimization, in which it can avoid doing the de-reference of xdp_frame. The driver already have TX-ring queue, which (in case of remote DMA-TX completion) have to be transferred between CPUs anyhow. In this data area, we stored a struct xdp_mem_info and a data pointer, which allowed us to avoid de-referencing xdp_frame. To compensate for this, a prefetchw is used for telling the cache coherency protocol about our access pattern. My benchmarks show that this prefetchw is enough to compensate the ixgbe driver. V7: Adjust for commit d9314c474d4f ("i40e: add support for XDP_REDIRECT") V8: Adjust for commit bd658dda4237 ("net/mlx5e: Separate dma base address and offset in dma_sync call") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | bpf: cpumap convert to use generic xdp_frameJesper Dangaard Brouer2018-04-171-72/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The generic xdp_frame format, was inspired by the cpumap own internal xdp_pkt format. It is now time to convert it over to the generic xdp_frame format. The cpumap needs one extra field dev_rx. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | xdp: introduce xdp_return_frame API and use in cpumapJesper Dangaard Brouer2018-04-171-24/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce an xdp_return_frame API, and convert over cpumap as the first user, given it have queued XDP frame structure to leverage. V3: Cleanup and remove C99 style comments, pointed out by Alex Duyck. V6: Remove comment that id will be added later (Req by Alex Duyck) V8: Rename enum mem_type to xdp_mem_type (found by kbuild test robot) Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | Merge remote-tracking branch 'fuse/for-next'Stephen Rothwell2018-04-191-0/+1
|\ \ \ \ \
| * | | | | fuse: Restrict allow_other to the superblock's namespace or a descendantSeth Forshee2018-03-201-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unprivileged users are normally restricted from mounting with the allow_other option by system policy, but this could be bypassed for a mount done with user namespace root permissions. In such cases allow_other should not allow users outside the userns to access the mount as doing so would give the unprivileged user the ability to manipulate processes it would otherwise be unable to manipulate. Restrict allow_other to apply to users in the same userns used at mount or a descendant of that namespace. Also export current_in_userns() for use by fuse when built as a module. Reviewed-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Dongsu Park <dongsu@kinvolk.io> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* | | | | | Merge remote-tracking branch 'btrfs/next'Stephen Rothwell2018-04-191-0/+1
|\ \ \ \ \ \
| * | | | | | cgroup2: export symbol init_css_setChris Mason2017-12-171-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Btrfs needs this exported to check for IO controls in place Signed-off-by: Chris Mason <clm@fb.com>