summaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* Add linux-next specific files for 20180612next-20180612Stephen Rothwell2018-06-125-0/+7075
| | | | Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
* Merge branch 'akpm/master'Stephen Rothwell2018-06-1237-115/+329
|\
| * sparc64: NG4 memset 32 bits overflowPavel Tatashin2018-06-121-13/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Early in boot Linux patches memset and memcpy to branch to platform optimized versions of these routines. The NG4 (Niagra 4) versions are currently used on all platforms starting from T4. Recently, there were M7 optimized routines added into UEK4 but not into mainline yet. So, even with M7 optimized routines NG4 are still going to be used on T4, T5, M5, and M6 processors. While investigating how to improve initialization time of dentry_hashtable which is 8G long on M6 ldom with 7T of main memory, I noticed that memset() does not reset all the memory in this array, after studying the code, I realized that NG4memset() branches use %icc register instead of %xcc to check compare, so if value of length is over 32-bit long, which is true for 8G array, these routines fail to work properly. The fix is to replace all %icc with %xcc in these routines. (Alternative is to use %ncc, but this is misleading, as the code already has sparcv9 only instructions, and cannot be compiled on 32-bit). This is important to fix this bug, because even older T4-4 can have 2T of memory, and there are large memory proportional data structures in kernel which can be larger than 4G in size. The failing of memset() is silent and corruption is hard to detect. Link: http://lkml.kernel.org/r/1488432825-92126-2-git-send-email-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Babu Moger <babu.moger@oracle.com> Cc: Babu Moger <babu.moger@amd.com> Cc: David Miller <davem@davemloft.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * drivers/media/platform/sti/delta/delta-ipc.c: fix read buffer overflowAndi Kleen2018-06-121-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The single caller passes a string to delta_ipc_open, which copies with a fixed size larger than the string. So it copies some random data after the original string the ro segment. If the string was at the end of a page it may fault. Just copy the string with a normal strcpy after clearing the field. Found by a LTO build (which errors out) because the compiler inlines the functions and can resolve the string sizes and triggers the compile time checks in memcpy. In function `memcpy', inlined from `delta_ipc_open.constprop' at linux/drivers/media/platform/sti/delta/delta-ipc.c:178:0, inlined from `delta_mjpeg_ipc_open' at linux/drivers/media/platform/sti/delta/delta-mjpeg-dec.c:227:0, inlined from `delta_mjpeg_decode' at linux/drivers/media/platform/sti/delta/delta-mjpeg-dec.c:403:0: /home/andi/lsrc/linux/include/linux/string.h:337:0: error: call to `__read_overflow2' declared with attribute error: detected read beyond size of object passed as 2nd parameter __read_overflow2(); Link: http://lkml.kernel.org/r/20171222001212.1850-1-andi@firstfloor.org Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Hugues FRUCHET <hugues.fruchet@st.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * kernel/kexec_file.c: load kernel at top of system RAM if requiredBaoquan He2018-06-121-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For kexec_file loading, if kexec_buf.top_down is 'true', the memory which is used to load kernel/initrd/purgatory is supposed to be allocated from top to down. This is also consistent with the old kexec loading interface. However, the current arch_kexec_walk_mem() doesn't do like this. It ignores checking kexec_buf.top_down, but calls walk_system_ram_res() directly to go through all resources of System RAM, to try to find a memory region which can contain the specific kexec buffer, then calls locate_mem_hole_callback() to allocate memory in that found memory region from top to down. This is not right. Here add checking if kexec_buf.top_down is 'true' in arch_kexec_walk_mem(), if yes, call the newly added walk_system_ram_res_rev() to find memory region from top to down to load kernel. The problem is the current kexec file loading has different behaviour with the old kexec loading. In kexec loading, user space kexec_tools utility does most of job, it searches in /proc/iomem to try to find a memory region from top to down to load kernel. Therefore with the kexec loading, x86_64 bzImage kernel are all loaded at top of System RAM. However, the kexec file loading just searches for available memory region in iomem resource from bottom to top, then try to allocate from top to down in that region. E.g on my testing system with 2G memory as below, the kexec loading will put kernel near 0x000000013fffffff, while kexec file loading will put kernel near 0x000000003ffddfff. There's no bug reported yet, just we need consider it when take care of the kexec collaboration with other kernel components like kaslr/hotplug etc, and also the consistentency between the different kexec interface. [Mar23 15:13] Linux version 4.16.0-rc3+ (bhe@localhost.localdomain) (gcc version [ +0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.16.0-rc3+ root=UUID=be8f8e3a-9 [ +0.000000] x86/fpu: x87 FPU will use FXSAVE [ +0.000000] e820: BIOS-provided physical RAM map: [ +0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable [ +0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved [ +0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved [ +0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000003ffddfff] usable [ +0.000000] BIOS-e820: [mem 0x000000003ffde000-0x000000003fffffff] reserved [ +0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved [ +0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved [ +0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000013fffffff] usable I searched on internet and found the original patches posted for adding bzImage 64 support into the old kexec loading, which is located in user space kexec_tools utility made by Yinghai, and Vivek and hpa reviewed patches. Still I didn't found out why kernel has to be put at top of system RAM. I guess low memory are reserved for system usage mostly, putting kexec kernel at top is safer and no need to exclude those resereved regions by system or firmware which we may not find out all of them, but not very sure about it. Link: http://lkml.kernel.org/r/20180322033722.9279-3-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: AKASHI Takahiro <takahiro.akashi@linaro.org> Cc: Dave Young <dyoung@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Philipp Rudo <prudo@linux.vnet.ibm.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * kernel/kexec_file.c: add walk_system_ram_res_rev()AKASHI Takahiro2018-06-122-0/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This function, being a variant of walk_system_ram_res() introduced in commit 8c86e70acead ("resource: provide new functions to walk through resources"), walks through a list of all the resources of System RAM in reversed order, i.e., from higher to lower. It will be used in kexec_file code. Link: http://lkml.kernel.org/r/20180322033722.9279-2-bhe@redhat.com Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org> Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Philipp Rudo <prudo@linux.vnet.ibm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * lib/test_printf.c: call wait_for_random_bytes() before plain %p testsThierry Escande2018-06-121-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the test_printf module is loaded before the crng is initialized, the plain 'p' tests will fail because the printed address will not be hashed and the buffer will contain '(ptrval)' instead. This patch adds a call to wait_for_random_bytes() before plain 'p' tests to make sure the crng is initialized. Link: http://lkml.kernel.org/r/20180604113708.11554-1-thierry.escande@linaro.org Signed-off-by: Thierry Escande <thierry.escande@linaro.org> Acked-by: Tobin C. Harding <me@tobin.cc> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Miller <davem@davemloft.net> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * selftests: cgroup: add test for memory.low corner casesRoman Gushchin2018-06-121-0/+115
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a more complicated test for memory.low hierarchical behavior. It creates the following hierarchy: A memory.low = 50M B memory.low = 50M C memory.low = 50M D memory.low = 50M memory.(swap.)max = 200M E memory.low = 50M / \ F' F memory.low = 50M G memory.low = 50M H memory.low = 50M I memory.low = 50M J memory.low = 50M K memory.low = 50M, memory.usage = 50M First, it creates local memory pressure by charging pagecache to F' and checks that K's memory.low actually works (usage ~= 50M). The it sets C's memory.low to 0 and repeats the test to check that K's memory.low is not working anymore, despite that memory.low is disabled above the cgroup with memory pressure (D). Link: http://lkml.kernel.org/r/20180522133106.24306-1-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm, memcg: don't skip memory guarantee calculationsRoman Gushchin2018-06-121-7/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are two cases when effective memory guarantee calculation is mistakenly skipped: 1) If memcg is a child of the root cgroup, and the root cgroup is not root_mem_cgroup (in other words, if the reclaim is targeted). Top-level memory cgroups are handled specially in mem_cgroup_protected(), because the root memory cgroup doesn't have memory guarantee and can't limit its children guarantees. So, all effective guarantee calculation is skipped. But in case of targeted reclaim things are different: cgroups, which parent exceeded its memory limit aren't special. 2) If memcg has no charged memory (memory usage is 0). In this case mem_cgroup_protected() always returns MEMCG_PROT_NONE, which is correct and prevents to generate fake memory low events for empty cgroups. But skipping memory emin/elow calculation is wrong: if there is no global memory pressure there might be no good chance again, so we can end up with effective guarantees set to 0 without any reason. Link: http://lkml.kernel.org/r/20180522132528.23769-2-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm, memcg: propagate memory effective protection on setting memory.min/lowRoman Gushchin2018-06-121-2/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Explicitly propagate effective memory min/low values down by the tree. If there is the global memory pressure, it's not really necessary. Effective memory guarantees will be propagated automatically as we traverse memory cgroup tree in the reclaim path. But if there is no global memory pressure, effective memory protection still matters for local (memcg-scoped) memory pressure. So, we have to update effective limits in the subtree, if a user changes memory.min and memory.low values. Link: http://lkml.kernel.org/r/20180522132528.23769-1-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * hexagon: drop the unused variable zero_page_maskAnshuman Khandual2018-06-122-4/+0
| | | | | | | | | | | | | | | | | | | | | | Hexagon arch does not seem to have subscribed to _HAVE_COLOR_ZERO_PAGE framework. Hence zero_page_mask variable is not needed. Link: http://lkml.kernel.org/r/20180517061105.30447-1-khandual@linux.vnet.ibm.com Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * hexagon: fix printk format warning in setup.cRandy Dunlap2018-06-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix printk format warning in hexagon/kernel/setup.c: ../arch/hexagon/kernel/setup.c: In function 'setup_arch': ../arch/hexagon/kernel/setup.c:69:2: warning: format '%x' expects argument of type 'unsigned int', but argument 2 has type 'long unsigned int' [-Wformat] where: extern unsigned long __phys_offset; #define PHYS_OFFSET __phys_offset Link: http://lkml.kernel.org/r/adce8db5-4b01-dc10-7fbb-6a64e0787eb5@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm-fix-oom_kill-event-handling-fixAndrew Morton2018-06-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | fix it for droppage of memcg-replace-mm-owner-with-mm-memcg.patch Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm: fix oom_kill event handlingRoman Gushchin2018-06-123-7/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | e27be240df53 ("mm: memcg: make sure memory.events is uptodate when waking pollers") converted most of memcg event counters to per-memcg atomics, which made them less confusing for a user. The "oom_kill" counter remained untouched, so now it behaves differently than other counters (including "oom"). This adds nothing but confusion. Let's fix this by adding the MEMCG_OOM_KILL event, and follow the MEMCG_OOM approach. This also removes a hack from count_memcg_event_mm(), introduced earlier specially for the OOM_KILL counter. Link: http://lkml.kernel.org/r/20180508124637.29984-1-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * treewide: use PHYS_ADDR_MAX to avoid type casting ULLONG_MAXStefan Agner2018-06-129-13/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With PHYS_ADDR_MAX there is now a type safe variant for all bits set. Make use of it. Patch created using a semantic patch as follows: // <smpl> @@ typedef phys_addr_t; @@ -(phys_addr_t)ULLONG_MAX +PHYS_ADDR_MAX // </smpl> Link: http://lkml.kernel.org/r/20180419214204.19322-1-stefan@agner.ch Signed-off-by: Stefan Agner <stefan@agner.ch> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm: use octal not symbolic permissionsJoe Perches2018-06-1215-66/+63
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mm/*.c files use symbolic and octal styles for permissions. Using octal and not symbolic permissions is preferred by many as more readable. https://lkml.org/lkml/2016/8/2/1945 Prefer the direct use of octal for permissions. Done using $ scripts/checkpatch.pl -f --types=SYMBOLIC_PERMS --fix-inplace mm/*.c and some typing. Before: $ git grep -P -w "0[0-7]{3,3}" mm | wc -l 44 After: $ git grep -P -w "0[0-7]{3,3}" mm | wc -l 86 Miscellanea: o Whitespace neatening around these conversions. Link: http://lkml.kernel.org/r/2e032ef111eebcd4c5952bae86763b541d373469.1522102887.git.joe@perches.com Signed-off-by: Joe Perches <joe@perches.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
* Merge branch 'akpm-current/current'Stephen Rothwell2018-06-1269-570/+1364
|\
| * ipc: use new return type vm_fault_tSouptick Joarder2018-06-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use new return type vm_fault_t for fault handler. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. Commit 1c8f422059ae ("mm: change return type to vm_fault_t") Link: http://lkml.kernel.org/r/20180425043413.GA21467@jordon-HP-15-Notebook-PC Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * sysvipc/sem: mitigate semnum index against spectre v1Davidlohr Bueso2018-06-081-4/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Both smatch and coverity are reporting potential issues with spectre variant 1 with the 'semnum' index within the sma->sems array, ie: ipc/sem.c:388 sem_lock() warn: potential spectre issue 'sma->sems' ipc/sem.c:641 perform_atomic_semop_slow() warn: potential spectre issue 'sma->sems' ipc/sem.c:721 perform_atomic_semop() warn: potential spectre issue 'sma->sems' Avoid any possible speculation by using array_index_nospec() thus ensuring the semnum value is bounded to [0, sma->sem_nsems). With the exception of sem_lock() all of these are slowpaths. Link: http://lkml.kernel.org/r/20180423171131.njs4rfm2yzyeg6do@linux-n805 Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Gustavo A. R. Silva" <gustavo@embeddedor.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * fault-injection: reorder config entriesMikulas Patocka2018-06-081-18/+18
| | | | | | | | | | | | | | | | | | | | | | Reorder Kconfig entries, so that menuconfig displays proper indentation. Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1804251601160.30569@file01.intranet.prod.int.rdu2.redhat.com Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * arm: port KCOV to armDmitry Vyukov2018-06-084-1/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KCOV is code coverage collection facility used, in particular, by syzkaller system call fuzzer. There is some interest in using syzkaller on arm devices. So port KCOV to arm. On implementation level this merely declares that KCOV is supported and disables instrumentation of 3 special cases. Reasons for disabling are commented in code. Tested with qemu-system-arm/vexpress-a15. Link: http://lkml.kernel.org/r/20180511143248.112484-1-dvyukov@google.com Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Abbott Liu <liuwenliang@huawei.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Koguchi Takuo <takuo.koguchi.sw@hitachi.com> Cc: <syzkaller@googlegroups.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * sched/core / kcov: avoid kcov_area during task switchMark Rutland2018-06-084-2/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During a context switch, we first switch_mm() to the next task's mm, then switch_to() that new task. This means that vmalloc'd regions which had previously been faulted in can transiently disappear in the context of the prev task. Functions instrumented by KCOV may try to access a vmalloc'd kcov_area during this window, and as the fault handling code is instrumented, this results in a recursive fault. We must avoid accessing any kcov_area during this window. We can do so with a new flag in kcov_mode, set prior to switching the mm, and cleared once the new task is live. Since task_struct::kcov_mode isn't always a specific enum kcov_mode value, this is made an unsigned int. The manipulation is hidden behind kcov_{prepare,finish}_switch() helpers, which are empty for !CONFIG_KCOV kernels. The code uses macros because I can't use static inline functions without a circular include dependency between <linux/sched.h> and <linux/kcov.h>, since the definition of task_struct uses things defined in <linux/kcov.h> Link: http://lkml.kernel.org/r/20180504135535.53744-4-mark.rutland@arm.com Signed-off-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * kcov-prefault-the-kcov_area-fix-fix-fixAndrew Morton2018-06-081-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | fancier code comment from Mark Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * kcov-prefault-the-kcov_area-fix-fixAndrew Morton2018-06-081-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | add comment explaining kcov_fault_in_area() Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * kcov-prefault-the-kcov_area-fixAndrew Morton2018-06-081-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | code cleanup Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * kcov: prefault the kcov_areaMark Rutland2018-06-081-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On many architectures the vmalloc area is lazily faulted in upon first access. This is problematic for KCOV, as __sanitizer_cov_trace_pc accesses the (vmalloc'd) kcov_area, and fault handling code may be instrumented. If an access to kcov_area faults, this will result in mutual recursion through the fault handling code and __sanitizer_cov_trace_pc(), eventually leading to stack corruption and/or overflow. We can avoid this by faulting in the kcov_area before __sanitizer_cov_trace_pc() is permitted to access it. Once it has been faulted in, it will remain present in the process page tables, and will not fault again. Link: http://lkml.kernel.org/r/20180504135535.53744-3-mark.rutland@arm.com Signed-off-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * kcov: ensure irq code sees a valid areaMark Rutland2018-06-081-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "kcov: fix unexpected faults". These patches fix a few issues where KCOV code could trigger recursive faults, discovered while debugging a patch enabling KCOV for arch/arm: * On CONFIG_PREEMPT kernels, there's a small race window where __sanitizer_cov_trace_pc() can see a bogus kcov_area. * Lazy faulting of the vmalloc area can cause mutual recursion between fault handling code and __sanitizer_cov_trace_pc(). * During the context switch, switching the mm can cause the kcov_area to be transiently unmapped. These are prerequisites for enabling KCOV on arm, but the issues themsevles are generic -- we just happen to avoid them by chance rather than design on x86-64 and arm64. This patch (of 3): For kernels built with CONFIG_PREEMPT, some C code may execute before or after the interrupt handler, while the hardirq count is zero. In these cases, in_task() can return true. A task can be interrupted in the middle of a KCOV_DISABLE ioctl while it resets the task's kcov data via kcov_task_init(). Instrumented code executed during this period will call __sanitizer_cov_trace_pc(), and as in_task() returns true, will inspect t->kcov_mode before trying to write to t->kcov_area. In kcov_init_task() we update t->kcov_{mode,area,size} with plain stores, which may be re-ordered, torn, etc. Thus __sanitizer_cov_trace_pc() may see bogus values for any of these fields, and may attempt to write to memory which is not mapped. Let's avoid this by using WRITE_ONCE() to set t->kcov_mode, with a barrier() to ensure this is ordered before we clear t->kov_{area,size}. This ensures that any code execute while kcov_init_task() is preempted will either see valid values for t->kcov_{area,size}, or will see that t->kcov_mode is KCOV_MODE_DISABLED, and bail out without touching t->kcov_area. Link: http://lkml.kernel.org/r/20180504135535.53744-2-mark.rutland@arm.com Signed-off-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * kernel/relay.c: change return type to vm_fault_tSouptick Joarder2018-06-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use new return type vm_fault_t for fault handler. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. commit 1c8f422059ae ("mm: change return type to vm_fault_t") Link: http://lkml.kernel.org/r/20180510140335.GA25363@jordon-HP-15-Notebook-PC Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Eric Biggers <ebiggers@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * bfs: add sanity check at bfs_fill_super()Tetsuo Handa2018-06-081-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | syzbot is reporting too large memory allocation at bfs_fill_super() [1]. Since file system image is corrupted such that bfs_sb->s_start == 0, bfs_fill_super() is trying to allocate 8MB of continuous memory. Fix this by adding a sanity check on bfs_sb->s_start, __GFP_NOWARN and printf(). [1] https://syzkaller.appspot.com/bug?id=16a87c236b951351374a84c8a32f40edbc034e96 Link: http://lkml.kernel.org/r/1525862104-3407-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: syzbot <syzbot+71c6b5d68e91149fc8a4@syzkaller.appspotmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Tigran Aivazian <aivazian.tigran@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * exofs-avoid-vla-in-structures-v2Kees Cook2018-06-082-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | - use DRY for easier to read stripe updates (ndesaulniers) - add const to unchanging variable (ndesaulniers) - update references with message-id URLs - add Reviewed-by Link: http://lkml.kernel.org/r/20180418163546.GA45794@beast Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Cc: Boaz Harrosh <ooo@electrozaur.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * exofs: avoid VLA in structuresKees Cook2018-06-083-66/+114
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On the quest to remove all VLAs from the kernel[1] this adjusts several cases where allocation is made after an array of structures that points back into the allocation. The allocations are changed to perform explicit calculations instead of using a Variable Length Array in a structure. Additionally, this lets Clang compile this code now, since Clang does not support VLAIS[2]. [1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com [2] https://lkml.kernel.org/r/CA+55aFy6h1c3_rP_bXFedsTXzwW+9Q9MfJaW7GUmMBrAp-fJ9A@mail.gmail.com Link: http://lkml.kernel.org/r/20180327203904.GA1151@beast Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Cc: Boaz Harrosh <ooo@electrozaur.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * coredump: fix spam with zero VMA processAlexey Dobriyan2018-06-081-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Nobody ever tried to self destruct by unmapping whole address space at once: munmap((void *)0, (1ULL << 47) - 4096); Doing this produces 2 warnings for zero-length vmalloc allocations: a.out[1353]: segfault at 7f80bcc4b757 ip 00007f80bcc4b757 sp 00007fff683939b8 error 14 a.out: vmalloc: allocation failure: 0 bytes, mode:0xcc0(GFP_KERNEL), nodemask=(null) ... a.out: vmalloc: allocation failure: 0 bytes, mode:0xcc0(GFP_KERNEL), nodemask=(null) ... Fix is to switch to kvmalloc(). Steps to reproduce: // vsyscall=none #include <sys/mman.h> #include <sys/resource.h> int main(void) { setrlimit(RLIMIT_CORE, &(struct rlimit){RLIM_INFINITY, RLIM_INFINITY}); munmap((void *)0, (1ULL << 47) - 4096); return 0; } Link: http://lkml.kernel.org/r/20180410180353.GA2515@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * fat: use fat_fs_error() instead of BUG_ON() in __fat_get_block()OGAWA Hirofumi2018-06-081-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | If file size and FAT cluster chain is not matched (corrupted image), we can hit BUG_ON(!phys) in __fat_get_block(). So, use fat_fs_error() instead. Link: http://lkml.kernel.org/r/874lilcu67.fsf@mail.parknet.co.jp Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com> Tested-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * checkpatch: add a --strict test for structs with bool member definitionsJoe Perches2018-06-081-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A struct with a bool member can have different sizes on various architectures because neither bool size nor alignment is standardized. So emit a message on the use of bool in structs only in .h files and not .c files. There is the real possibility that this test could have a false positive when a bool is declared as an automatic, so limit the test to .h files where the only false positive is for declarations in static inline functions. Link: http://lkml.kernel.org/r/95477c93db187bab6da8a8ba7c57836868446179.camel@perches.com Signed-off-by: Joe Perches <joe@perches.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * proc-skip-branch-in-proc-lookup-fixAndrew Morton2018-06-081-1/+1
| | | | | | | | | | | | | | | | | | proc_pident_instantiate() no longer takes the inode* Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * proc: skip branch in /proc/*/* lookupAlexey Dobriyan2018-06-081-6/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Code is structured like this: for ( ... p < last; p++) { if (memcmp == 0) break; } if (p >= last) ERROR OK gcc doesn't see that if if lookup succeeds than post loop branch will never be taken and skip it. Link: http://lkml.kernel.org/r/20180423213954.GD9043@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm/page_owner: align with pageblock_nr pageszhong jiang2018-06-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | When pfn_valid(pfn) returns false, pfn should be aligned with pageblock_nr_pages other than MAX_ORDER_NR_PAGES in init_pages_in_zone, because the skipped 2M may be valid pfn, as a result, early allocated count will not be accurate. Link: http://lkml.kernel.org/r/1468938136-24228-1-git-send-email-zhongjiang@huawei.com Signed-off-by: zhong jiang <zhongjiang@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm/page_owner: align with pageblock_nr_pagesMarc-André Lureau2018-06-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, init_pages_in_zone() walks the zone in pageblock_nr_pages steps. MAX_ORDER_NR_PAGES is possible to have holes when CONFIG_HOLES_IN_ZONE is set. it is likely to be different between MAX_ORDER_NR_PAGES and pageblock_nr_pages. if we skip the size of MAX_ORDER_NR_PAGES, it will result in the second 2M memroy leak. Meanwhile, the change will make the code consistent. because the entire function is based on the pageblock_nr_pages steps. Link: http://lkml.kernel.org/r/1512395284-13588-1-git-send-email-zhongjiang@huawei.com Signed-off-by: zhong jiang <zhongjiang@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm: don't expose page to fast gup before it's readyYu Zhao2018-06-082-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We don't want to expose page before it's properly setup. During page setup, we may call page_add_new_anon_rmap() which uses non- atomic bit op. If page is exposed before it's done, we could overwrite page flags that are set by get_user_pages_fast() or its callers. Here is a non-fatal scenario (there might be other fatal problems that I didn't look into): CPU 1 CPU1 set_pte_at() get_user_pages_fast() page_add_new_anon_rmap() gup_pte_range() __SetPageSwapBacked() SetPageReferenced() Fix the problem by delaying set_pte_at() until page is ready. I didn't observe the race directly. But I did get few crashes when trying to access mem_cgroup of pages returned by get_user_pages_fast(). Those page were charged and they showed valid mem_cgroup in kdumps. So this led me to think the problem came from premature set_pte_at(). I think the fact that nobody complained about this problem is because the race only happens when using ksm+swap, and it might not cause any fatal problem even so. Nevertheless, it's nice to have set_pte_at() done consistently after rmap is added and page is charged. Link: http://lkml.kernel.org/r/20180108225632.16332-1-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm: add strictlimit knobMaxim Patlasov2018-06-082-0/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The "strictlimit" feature was introduced to enforce per-bdi dirty limits for FUSE which sets bdi max_ratio to 1% by default: http://article.gmane.org/gmane.linux.kernel.mm/105809 However the feature can be useful for other relatively slow or untrusted BDIs like USB flash drives and DVD+RW. The patch adds a knob to enable the feature: echo 1 > /sys/class/bdi/X:Y/strictlimit Being enabled, the feature enforces bdi max_ratio limit even if global (10%) dirty limit is not reached. Of course, the effect is not visible until /sys/class/bdi/X:Y/max_ratio is decreased to some reasonable value. Jan said: : In principle I have nothing against this and the usecase sounds reasonable : (in fact I believe the lack of a feature like this is one of reasons why : desktop automounters usually mount USB devices with 'sync' mount option). : So feel free to add: Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Artem S. Tashkinov" <t.artem@lycos.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * cgroup: list groupoom in cgroup featuresRoman Gushchin2018-06-081-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | List groupoom in cgroup features list (exported via /sys/kernel/cgroup/features), which can be used by a userspace apps (most likely, systemd) to get an idea which cgroup features are supported by kernel. Link: http://lkml.kernel.org/r/20171130152824.1591-8-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm-oom-docs-describe-the-cgroup-aware-oom-killer-fix-2-fixAndrew Morton2018-06-081-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | tweak text, fix typo Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * oom, memcg: clarify root memcg oom accountingMichal Hocko2018-06-081-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | David Rientjes has pointed out that the current way how the root memcg is accounted for the cgroup aware OOM killer is undocumented. Unlike regular cgroups there is no accounting going on in the root memcg (mostly for performance reasons). Therefore we are suming up oom_badness of its tasks. This might result in an over accounting because of the oom_score_adj setting. Document this for now. Link: http://lkml.kernel.org/r/20180130122011.GB21609@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm-oom-docs-describe-the-cgroup-aware-oom-killer-fixRoman Gushchin2018-06-081-0/+9
| | | | | | | | | | | | | | | | | | | | | | Add a note that cgroup-aware OOM logic is disabled by default and describe how to enable it. Link: http://lkml.kernel.org/r/20171201170149.GB27436@castle.DHCP.thefacebook.com Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm, oom, docs: describe the cgroup-aware OOM killerRoman Gushchin2018-06-081-0/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Document the cgroup-aware OOM killer. Link: http://lkml.kernel.org/r/20171130152824.1591-7-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm, oom: add cgroup v2 mount option for cgroup-aware OOM killerRoman Gushchin2018-06-083-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware OOM killer. If not set, the OOM selection is performed in a "traditional" per-process way. The behavior can be changed dynamically by remounting the cgroupfs. Link: http://lkml.kernel.org/r/20171130152824.1591-6-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm, oom: return error on access to memory.oom_group if groupoom is disabledRoman Gushchin2018-06-081-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cgroup-aware OOM killer depends on cgroup mount option and is turned off by default, despite the user interface (memory.oom_group file) is always present. As it might be confusing to a user, let's return -ENOTSUPP on an attempt to access to memory.oom_group if groupoom is not enabled globally. Example: $ cd /sys/fs/cgroup/user.slice/ $ cat memory.oom_group cat: memory.oom_group: Unknown error 524 $ echo 1 > memory.oom_group -bash: echo: write error: Unknown error 524 $ mount -o remount,groupoom /sys/fs/cgroup $ echo 1 > memory.oom_group $ cat memory.oom_group 1 Link: http://lkml.kernel.org/r/20171201170004.GA27436@castle.DHCP.thefacebook.com Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm, oom: introduce memory.oom_groupRoman Gushchin2018-06-083-14/+127
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The cgroup-aware OOM killer treats leaf memory cgroups as memory consumption entities and performs the victim selection by comparing them based on their memory footprint. Then it kills the biggest task inside the selected memory cgroup. But there are workloads, which are not tolerant to a such behavior. Killing a random task may leave the workload in a broken state. To solve this problem, memory.oom_group knob is introduced. It will define, whether a memory group should be treated as an indivisible memory consumer, compared by total memory consumption with other memory consumers (leaf memory cgroups and other memory cgroups with memory.oom_group set), and whether all belonging tasks should be killed if the cgroup is selected. If set on memcg A, it means that in case of system-wide OOM or memcg-wide OOM scoped to A or any ancestor cgroup, all tasks, belonging to the sub-tree of A will be killed. If OOM event is scoped to a descendant cgroup (A/B, for example), only tasks in that cgroup can be affected. OOM killer will never touch any tasks outside of the scope of the OOM event. Also, tasks with oom_score_adj set to -1000 will not be killed because this has been a long established way to protect a particular process from seeing an unexpected SIGKILL from the OOM killer. Ignoring this user defined configuration might lead to data corruptions or other misbehavior. The default value is 0. Link: http://lkml.kernel.org/r/20171130152824.1591-5-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm-oom-cgroup-aware-oom-killer-fix-2Andrew Morton2018-06-081-1/+1
| | | | | | | | | | | | | | | | | | | | attampt to repair mm-oom-cgroup-aware-oom-killer.patch for memcg-replace-mm-owner-with-mm-memcg.patch Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
| * mm-oom-cgroup-aware-oom-killer-fixAndrew Morton2018-06-081-6/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | two tweaks ariseing from merge fixups, per Roman Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>