summaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* printk: git rid of [sched_delayed] message for printk_deferredMarkus Trippelsdorf2014-10-141-6/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 458df9fd4815 ("printk: remove separate printk_sched buffers and use printk buf instead") hardcodes printk_deferred() to KERN_WARNING and inserts the string "[sched_delayed] " before the actual message. However it doesn't take into account the KERN_* prefix of the message, that now ends up in the middle of the output: [sched_delayed] ^a4CE: hpet increased min_delta_ns to 20115 nsec Fix this by just getting rid of the "[sched_delayed] " scnprintf(). The prefix is useless since 458df9fd4815 anyway since from that moment printk_deferred() inserts the message into the kernel printk buffer immediately. So if the message eventually gets printed to console, it is printed in the correct order with other messages and there's no need for any special prefix. And if the kernel crashes before the message makes it to console, then prefix in the printk buffer doesn't make the situation any better. Link: http://lkml.org/lkml/2014/9/14/4 Signed-off-by: Markus Trippelsdorf <markus@trippelsdorf.de> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* printk: don't bother using LOG_CPU_MAX_BUF_SHIFT on !SMPGeert Uytterhoeven2014-10-142-1/+7
| | | | | | | | | | | When configuring a uniprocessor kernel, don't bother the user with an irrelevant LOG_CPU_MAX_BUF_SHIFT question, and don't build the unused code. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Luis R. Rodriguez <mcgrof@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* drivers: dma-contiguous: add initialization from device treeMarek Szyprowski2014-10-143-11/+120
| | | | | | | | | | | | | | | | | | | Add a function to create CMA region from previously reserved memory and add support for handling 'shared-dma-pool' reserved-memory device tree nodes. Based on previous code provided by Josh Cartwright <joshc@codeaurora.org> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Josh Cartwright <joshc@codeaurora.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* drivers: dma-coherent: add initialization from device treeMarek Szyprowski2014-10-141-22/+129
| | | | | | | | | | | | | | | | | | | | | | Initialization procedure of dma coherent pool has been split into two parts, so memory pool can now be initialized without assigning to particular struct device. Then initialized region can be assigned to more than one struct device. To protect from concurent allocations from structure. The last part of this patch adds support for handling 'shared-dma-pool' reserved-memory device tree nodes. [akpm@linux-foundation.org: use more appropriate printk facility levels] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Josh Cartwright <joshc@codeaurora.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kernel: add support for gcc 5Sasha Levin2014-10-141-0/+66
| | | | | | | | | | | | | | | We're missing include/linux/compiler-gcc5.h which is required now because gcc branched off to v5 in trunk. Just copy the relevant bits out of include/linux/compiler-gcc4.h, no new code is added as of now. This fixes a build error when using gcc 5. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/cma: fix cma bitmap aligned mask computingWeijie Yang2014-10-141-1/+3
| | | | | | | | | | | | | | | | | | | | | | The current cma bitmap aligned mask computation is incorrect. It could cause an unexpected alignment when using cma_alloc() if the wanted align order is larger than cma->order_per_bit. Take kvm for example (PAGE_SHIFT = 12), kvm_cma->order_per_bit is set to 6. When kvm_alloc_rma() tries to alloc kvm_rma_pages, it will use 15 as the expected align value. After using the current implementation however, we get 0 as cma bitmap aligned mask other than 511. This patch fixes the cma bitmap aligned mask calculation. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Weijie Yang <weijie.yang@samsung.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> [3.17] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/slab: fix unaligned access on sparc64Joonsoo Kim2014-10-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit bf0dea23a9c0 ("mm/slab: use percpu allocator for cpu cache") changed the allocation method for cpu cache array from slab allocator to percpu allocator. Alignment should be provided for aligned memory in percpu allocator case, but, that commit mistakenly set this alignment to 0. So, percpu allocator returns unaligned memory address. It doesn't cause any problem on x86 which permits unaligned access, but, it causes the problem on sparc64 which needs strong guarantee of alignment. Following bug report is reported from David Miller. I'm getting tons of the following on sparc64: [603965.383447] Kernel unaligned access at TPC[546b58] free_block+0x98/0x1a0 [603965.396987] Kernel unaligned access at TPC[546b60] free_block+0xa0/0x1a0 ... [603970.554394] log_unaligned: 333 callbacks suppressed ... This patch provides a proper alignment parameter when allocating cpu cache to fix this unaligned memory access problem on sparc64. Reported-by: David Miller <davem@davemloft.net> Tested-by: David Miller <davem@davemloft.net> Tested-by: Meelis Roos <mroos@linux.ee> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'x86-cpu-for-linus' of ↵Linus Torvalds2014-10-131-11/+12
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpu offlining patch from Ingo Molnar: "This tree includes a single commit that speeds up x86 suspend/resume by replacing a naive 100msec sleep based polling loop with proper completion notification. This gives some real suspend/resume benefit on servers with larger core counts" * 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/smpboot: Speed up suspend/resume by avoiding 100ms sleep for CPU offline during S3
| * x86/smpboot: Speed up suspend/resume by avoiding 100ms sleep for CPU offline ↵Lan Tianyu2014-09-241-11/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | during S3 With certain kernel configurations, CPU offline consumes more than 100ms during S3. It's a timing related issue: native_cpu_die() would occasionally fall into a 100ms sleep when the CPU idle loop thread marked the CPU state to DEAD too slowly. What native_cpu_die() does is that it polls the CPU state and waits for 100ms if CPU state hasn't been marked to DEAD. The 100ms sleep doesn't make sense and is purely historic. To avoid such long sleeping, this patch adds a 'struct completion' to each CPU, waits for the completion in native_cpu_die() and wakes up the completion when the CPU state is marked to DEAD. Tested on an Intel Xeon server with 48 cores, Ivybridge and on Haswell laptops. The CPU offlining cost on these machines is reduced from more than 100ms to less than 5ms. The system suspend time is reduced by 2.3s on the servers. Borislav and Prarit also helped to test the patch on an AMD machine and a few systems of various sizes and configurations (multi-socket, single-socket, no hyper threading, etc.). No issues were seen. Tested-by: Prarit Bhargava <prarit@redhat.com> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com> Acked-by: Borislav Petkov <bp@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: srostedt@redhat.com Cc: toshi.kani@hp.com Cc: imammedo@redhat.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1409039025-32310-1-git-send-email-tianyu.lan@intel.com [ Improved a few minor details in the code, cleaned up the changelog. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | Merge branch 'x86-cleanups-for-linus' of ↵Linus Torvalds2014-10-132-17/+12
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Ingo Molnar: "Three small cleanups" * 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/tty/serial/8250: Clean up the asm/serial.h include file a bit x86/tty/serial/8250: Resolve missing-field-initializers warnings x86: Remove obsolete comment in uapi/e820.h
| * | x86/tty/serial/8250: Clean up the asm/serial.h include file a bitIngo Molnar2014-09-061-12/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - correct spelling - align fields vertically to make things more readable - make the layout of magic defines more obvious Cc: Mark Rustad <mark.d.rustad@intel.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Link: http://lkml.kernel.org/r/1409972149-26272-1-git-send-email-jeffrey.t.kirsher@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | x86/tty/serial/8250: Resolve missing-field-initializers warningsMark Rustad2014-09-061-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Resolve some missing-field-initializers warnings by using designated initialization in the expansion of the SERIAL_PORT_DFNS macro. Signed-off-by: Mark Rustad <mark.d.rustad@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Link: http://lkml.kernel.org/r/1409972149-26272-1-git-send-email-jeffrey.t.kirsher@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | x86: Remove obsolete comment in uapi/e820.hRoss Zwisler2014-08-211-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A comment introduced by this old commit: 028b785888c5 ("x86 boot: extend some internal memory map arrays to handle larger EFI input") had to do with some nested preprocessor directives. The directives were split into separate files by this commit: af170c5061dd ("UAPI: (Scripted) Disintegrate arch/x86/include/asm") The comment explaining their interaction was retained and is now present in arch/x86/include/uapi/asm/e820.h. This comment is no longer correct, so delete it. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Link: http://lkml.kernel.org/r/1400521824-21040-1-git-send-email-ross.zwisler@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | Merge branch 'x86-build-for-linus' of ↵Linus Torvalds2014-10-131-6/+2
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 build update from Ingo Molnar: "A single commit that simplifies the no-FPU-ops build options" * 'x86-build-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/kbuild: Eliminate duplicate command line options
| * | | x86/kbuild: Eliminate duplicate command line optionsRasmus Villemoes2014-09-161-6/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The options -mno-mmx and -mno-sse are unconditionally added to KBUILD_CFLAGS in both branches of an ifeq and through a $(cc-option) further down. We can safely remove the first instances. In fact, since the -mno-mmx and -mno-sse options were introduced simultaneous with the other two options in the $(cc-option) [according to http://www.gnu.org/software/gcc/gcc-3.1/changes.html], and since the former were unconditionally used, one can deduce that only gcc versions knowing about all four are supported. So also eliminate the $(cc-option) wrap. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: Peter Foley <pefoley2@pefoley.com> Link: http://lkml.kernel.org/r/1410365139-24440-1-git-send-email-linux@rasmusvillemoes.dk Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | | Merge branch 'x86-boot-for-linus' of ↵Linus Torvalds2014-10-134-80/+51
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 bootup updates from Ingo Molnar: "The changes in this cycle were: - Fix rare SMP-boot hang (mostly in virtual environments) - Fix build warning with certain (rare) toolchains" * 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/relocs: Make per_cpu_load_addr static x86/smpboot: Initialize secondary CPU only if master CPU will wait for it
| * | | | x86/relocs: Make per_cpu_load_addr staticBen Hutchings2014-09-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | per_cpu_load_addr is only used for 64-bit relocations, but is declared in both configurations of relocs.c - with different types. This has undefined behaviour in general. GNU ld is documented to use the larger size in this case, but other tools may differ and some warn about this. References: https://bugs.debian.org/748577 Reported-by: Michael Tautschnig <mt@debian.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: 748577@bugs.debian.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1411561812.3659.23.camel@decadent.org.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | x86/smpboot: Initialize secondary CPU only if master CPU will wait for itIgor Mammedov2014-09-163-79/+50
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Hang is observed on virtual machines during CPU hotplug, especially in big guests with many CPUs. (It reproducible more often if host is over-committed). It happens because master CPU gives up waiting on secondary CPU and allows it to run wild. As result AP causes locking or crashing system. For example as described here: https://lkml.org/lkml/2014/3/6/257 If master CPU have sent STARTUP IPI successfully, and AP signalled to master CPU that it's ready to start initialization, make master CPU wait indefinitely till AP is onlined. To ensure that AP won't ever run wild, make it wait at early startup till master CPU confirms its intention to wait for AP. If AP doesn't respond in 10 seconds, the master CPU will timeout and cancel AP onlining. Signed-off-by: Igor Mammedov <imammedo@redhat.com> Acked-by: Toshi Kani <toshi.kani@hp.com> Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1403266991-12233-1-git-send-email-imammedo@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | | Merge branch 'x86-asm-for-linus' of ↵Linus Torvalds2014-10-137-68/+61
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 asm updates from Ingo Molnar: "The changes in this cycle were: - Speed up the x86 __preempt_schedule() implementation - Fix/improve low level asm code debug info annotations" * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Unwind-annotate thunk_32.S x86: Improve cmpxchg8b_emu.S x86: Improve cmpxchg16b_emu.S x86/lib/Makefile: Remove the unnecessary "+= thunk_64.o" x86: Speed up ___preempt_schedule*() by using THUNK helpers
| * | | | x86: Unwind-annotate thunk_32.SJan Beulich2014-10-081-6/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Jan Beulich <jbeulich@suse.com> Link: http://lkml.kernel.org/r/542291CA0200007800038085@mail.emea.novell.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | x86: Improve cmpxchg8b_emu.SJan Beulich2014-10-081-11/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - don't include unneeded headers - drop redundant entry point label - complete unwind annotations - use .L prefix on local labels to not clutter the symbol table Signed-off-by: Jan Beulich <jbeulich@suse.com> Link: http://lkml.kernel.org/r/5422917E0200007800038081@mail.emea.novell.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | x86: Improve cmpxchg16b_emu.SJan Beulich2014-10-081-19/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - don't include unneeded headers - don't open-code PER_CPU_VAR() - drop redundant entry point label - complete unwind annotations - use .L prefix on local label to not clutter the symbol table Signed-off-by: Jan Beulich <jbeulich@suse.com> Link: http://lkml.kernel.org/r/542290BC020000780003807D@mail.emea.novell.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | x86/lib/Makefile: Remove the unnecessary "+= thunk_64.o"Oleg Nesterov2014-09-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Trivial. We have "lib-y += thunk_$(BITS).o" at the start, no need to add thunk_64.o if !CONFIG_X86_32. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Andy Lutomirski <luto@amacapital.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140921184232.GB23727@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | x86: Speed up ___preempt_schedule*() by using THUNK helpersOleg Nesterov2014-09-244-31/+23
| | |_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ___preempt_schedule() does SAVE_ALL/RESTORE_ALL but this is suboptimal, we do not need to save/restore the callee-saved register. And we already have arch/x86/lib/thunk_*.S which implements the similar asm wrappers, so it makes sense to redefine ___preempt_schedule() as "THUNK ..." and remove preempt.S altogether. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Andy Lutomirski <luto@amacapital.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140921184153.GA23727@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | | Merge branch 'sched-core-for-linus' of ↵Linus Torvalds2014-10-1355-552/+1075
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "The main changes in this cycle were: - Optimized support for Intel "Cluster-on-Die" (CoD) topologies (Dave Hansen) - Various sched/idle refinements for better idle handling (Nicolas Pitre, Daniel Lezcano, Chuansheng Liu, Vincent Guittot) - sched/numa updates and optimizations (Rik van Riel) - sysbench speedup (Vincent Guittot) - capacity calculation cleanups/refactoring (Vincent Guittot) - Various cleanups to thread group iteration (Oleg Nesterov) - Double-rq-lock removal optimization and various refactorings (Kirill Tkhai) - various sched/deadline fixes ... and lots of other changes" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits) sched/dl: Use dl_bw_of() under rcu_read_lock_sched() sched/fair: Delete resched_cpu() from idle_balance() sched, time: Fix build error with 64 bit cputime_t on 32 bit systems sched: Improve sysbench performance by fixing spurious active migration sched/x86: Fix up typo in topology detection x86, sched: Add new topology for multi-NUMA-node CPUs sched/rt: Use resched_curr() in task_tick_rt() sched: Use rq->rd in sched_setaffinity() under RCU read lock sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask' sched: Use dl_bw_of() under RCU read lock sched/fair: Remove duplicate code from can_migrate_task() sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSW sched: print_rq(): Don't use tasklist_lock sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use task_rq_lock() sched: Fix the task-group check in tg_has_rt_tasks() sched/fair: Leverage the idle state info when choosing the "idlest" cpu sched: Let the scheduler see CPU idle states sched/deadline: Fix inter- exclusive cpusets migrations sched/deadline: Clear dl_entity params when setscheduling to different class sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault() ...
| * | | | sched/dl: Use dl_bw_of() under rcu_read_lock_sched()Kirill Tkhai2014-10-031-9/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rq->rd is freed using call_rcu_sched(), so rcu_read_lock() to access it is not enough. We should use either rcu_read_lock_sched() or preempt_disable(). Reported-by: Sasha Levin <sasha.levin@oracle.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kirill Tkhai <ktkhai@parallels.com Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Fixes: 66339c31bc39 "sched: Use dl_bw_of() under RCU read lock" Link: http://lkml.kernel.org/r/1412065417.20287.24.camel@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | sched/fair: Delete resched_cpu() from idle_balance()Kirill Tkhai2014-10-031-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We already reschedule env.dst_cpu in attach_tasks()->check_preempt_curr() if this is necessary. Furthermore, a higher priority class task may be current on dest rq, we shouldn't disturb it. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Cc: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140930210441.5258.55054.stgit@localhost Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | sched, time: Fix build error with 64 bit cputime_t on 32 bit systemsRik van Riel2014-10-035-10/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On 32 bit systems cmpxchg cannot handle 64 bit values, so some additional magic is required to allow a 32 bit system with CONFIG_VIRT_CPU_ACCOUNTING_GEN=y enabled to build. Make sure the correct cmpxchg function is used when doing an atomic swap of a cputime_t. Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: umgwanakikbuti@gmail.com Cc: fweisbec@gmail.com Cc: srao@redhat.com Cc: lwoodman@redhat.com Cc: atheurer@redhat.com Cc: oleg@redhat.com Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: linux390@de.ibm.com Cc: linux-arch@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s390@vger.kernel.org Link: http://lkml.kernel.org/r/20140930155947.070cdb1f@annuminas.surriel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | sched: Improve sysbench performance by fixing spurious active migrationVincent Guittot2014-10-031-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit caeb178c60f4 ("sched/fair: Make update_sd_pick_busiest() ...") sd_pick_busiest returns a group that can be neither imbalanced nor overloaded but is only more loaded than others. This change has been introduced to ensure a better load balance in system that are not overloaded but as a side effect, it can also generate useless active migration between groups. Let take the example of 3 tasks on a quad cores system. We will always have an idle core so the load balance will find a busiest group (core) whenever an ILB is triggered and it will force an active migration (once above nr_balance_failed threshold) so the idle core becomes busy but another core will become idle. With the next ILB, the freshly idle core will try to pull the task of a busy CPU. The number of spurious active migration is not so huge in quad core system because the ILB is not triggered so much. But it becomes significant as soon as you have more than one sched_domain level like on a dual cluster of quad cores where the ILB is triggered every tick when you have more than 1 busy_cpu We need to ensure that the migration generate a real improveùent and will not only move the avg_load imbalance on another CPU. Before caeb178c60f4f93f1b45c0bc056b5cf6d217b67f, the filtering of such use case was ensured by the following test in f_b_g: if ((local->idle_cpus < busiest->idle_cpus) && busiest->sum_nr_running <= busiest->group_weight) This patch modified the condition to take into account situation where busiest group is not overloaded: If the diff between the number of idle cpus in 2 groups is less than or equal to 1 and the busiest group is not overloaded, moving a task will not improve the load balance but just move it. A test with sysbench on a dual clusters of quad cores gives the following results: command: sysbench --test=cpu --num-threads=5 --max-time=5 run The HZ is 200 which means that 1000 ticks has fired during the test. With Mainline, perf gives the following figures: Samples: 727 of event 'sched:sched_migrate_task' Event count (approx.): 727 Overhead Command Shared Object Symbol ........ ............... ............. .............. 12.52% migration/1 [unknown] [.] 00000000 12.52% migration/5 [unknown] [.] 00000000 12.52% migration/7 [unknown] [.] 00000000 12.10% migration/6 [unknown] [.] 00000000 11.83% migration/0 [unknown] [.] 00000000 11.83% migration/3 [unknown] [.] 00000000 11.14% migration/4 [unknown] [.] 00000000 10.87% migration/2 [unknown] [.] 00000000 2.75% sysbench [unknown] [.] 00000000 0.83% swapper [unknown] [.] 00000000 0.55% ktps65090charge [unknown] [.] 00000000 0.41% mmcqd/1 [unknown] [.] 00000000 0.14% perf [unknown] [.] 00000000 With this patch, perf gives the following figures Samples: 20 of event 'sched:sched_migrate_task' Event count (approx.): 20 Overhead Command Shared Object Symbol ........ ............... ............. .............. 80.00% sysbench [unknown] [.] 00000000 10.00% swapper [unknown] [.] 00000000 5.00% ktps65090charge [unknown] [.] 00000000 5.00% migration/1 [unknown] [.] 00000000 Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1412170735-5356-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | sched/x86: Fix up typo in topology detectionDave Hansen2014-10-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit: cebf15eb09a2 ("x86, sched: Add new topology for multi-NUMA-node CPUs") some code to try to detect the situation where we have a NUMA node inside of the "DIE" sched domain. It detected this by looking for cpus which match_die() but do not match NUMA nodes via topology_same_node(). I wrote it up as: if (match_die(c, o) == !topology_same_node(c, o)) which actually seemed to work some of the time, albiet accidentally. It should have been doing an &&, not an ==. This code essentially chopped off the "DIE" domain on one of Andrew Morton's systems. He reported that this patch fixed his issue. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dave Hansen <dave@sr71.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Jan Kiszka <jan.kiszka@siemens.com> Cc: Lan Tianyu <tianyu.lan@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Toshi Kani <toshi.kani@hp.com> Link: http://lkml.kernel.org/r/20140930214546.FD481CFF@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | x86, sched: Add new topology for multi-NUMA-node CPUsDave Hansen2014-09-241-9/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I'm getting the spew below when booting with Haswell (Xeon E5-2699 v3) CPUs and the "Cluster-on-Die" (CoD) feature enabled in the BIOS. It seems similar to the issue that some folks from AMD ran in to on their systems and addressed in this commit: 161270fc1f9d ("x86/smp: Fix topology checks on AMD MCM CPUs") Both these Intel and AMD systems break an assumption which is being enforced by topology_sane(): a socket may not contain more than one NUMA node. AMD special-cased their system by looking for a cpuid flag. The Intel mode is dependent on BIOS options and I do not know of a way which it is enumerated other than the tables being parsed during the CPU bringup process. In other words, we have to trust the ACPI tables <shudder>. This detects the situation where a NUMA node occurs at a place in the middle of the "CPU" sched domains. It replaces the default topology with one that relies on the NUMA information from the firmware (SRAT table) for all levels of sched domains above the hyperthreads. This also fixes a sysfs bug. We used to freak out when we saw the "mc" group cross a node boundary, so we stopped building the MC group. MC gets exported as the 'core_siblings_list' in /sys/devices/system/cpu/cpu*/topology/ and this caused CPUs with the same 'physical_package_id' to not be listed together in 'core_siblings_list'. This violates a statement from Documentation/ABI/testing/sysfs-devices-system-cpu: core_siblings: internal kernel map of cpu#'s hardware threads within the same physical_package_id. core_siblings_list: human-readable list of the logical CPU numbers within the same physical_package_id as cpu#. The sysfs effects here cause an issue with the hwloc tool where it gets confused and thinks there are more sockets than are physically present. Before this patch, there are two packages: # cd /sys/devices/system/cpu/ # cat cpu*/topology/physical_package_id | sort | uniq -c 18 0 18 1 But 4 _sets_ of core siblings: # cat cpu*/topology/core_siblings_list | sort | uniq -c 9 0-8 9 18-26 9 27-35 9 9-17 After this set, there are only 2 sets of core siblings, which is what we expect for a 2-socket system. # cat cpu*/topology/physical_package_id | sort | uniq -c 18 0 18 1 # cat cpu*/topology/core_siblings_list | sort | uniq -c 18 0-17 18 18-35 Example spew: ... NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter. #2 #3 #4 #5 #6 #7 #8 .... node #1, CPUs: #9 ------------[ cut here ]------------ WARNING: CPU: 9 PID: 0 at /home/ak/hle/linux-hle-2.6/arch/x86/kernel/smpboot.c:306 topology_sane.isra.2+0x74/0x90() sched: CPU #9's mc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency. Modules linked in: CPU: 9 PID: 0 Comm: swapper/9 Not tainted 3.17.0-rc1-00293-g8e01c4d-dirty #631 Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0036.R05.1407140519 07/14/2014 0000000000000009 ffff88046ddabe00 ffffffff8172e485 ffff88046ddabe48 ffff88046ddabe38 ffffffff8109691d 000000000000b001 0000000000000009 ffff88086fc12580 000000000000b020 0000000000000009 ffff88046ddabe98 Call Trace: [<ffffffff8172e485>] dump_stack+0x45/0x56 [<ffffffff8109691d>] warn_slowpath_common+0x7d/0xa0 [<ffffffff8109698c>] warn_slowpath_fmt+0x4c/0x50 [<ffffffff81074f94>] topology_sane.isra.2+0x74/0x90 [<ffffffff8107530e>] set_cpu_sibling_map+0x31e/0x4f0 [<ffffffff8107568d>] start_secondary+0x1ad/0x240 ---[ end trace 3fe5f587a9fcde61 ]--- #10 #11 #12 #13 #14 #15 #16 #17 .... node #2, CPUs: #18 #19 #20 #21 #22 #23 #24 #25 #26 .... node #3, CPUs: #27 #28 #29 #30 #31 #32 #33 #34 #35 Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> [ Added LLC domain and s/match_mc/match_die/ ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: David Rientjes <rientjes@google.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: brice.goglin@gmail.com Cc: "H. Peter Anvin" <hpa@linux.intel.com> Link: http://lkml.kernel.org/r/20140918193334.C065EBCE@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | sched/rt: Use resched_curr() in task_tick_rt()Kirill Tkhai2014-09-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some time ago PREEMPT_NEED_RESCHED was implemented, so reschedule technics is a little more difficult now. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140922183642.11015.66039.stgit@localhost Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | sched: Use rq->rd in sched_setaffinity() under RCU read lockKirill Tkhai2014-09-241-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Probability of use-after-free isn't zero in this place. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> # v3.14+ Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140922183636.11015.83611.stgit@localhost Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask'Kirill Tkhai2014-09-241-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Nothing is locked there, so label's name only confuses a reader. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140922183630.11015.59500.stgit@localhost Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | sched: Use dl_bw_of() under RCU read lockKirill Tkhai2014-09-241-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dl_bw_of() dereferences rq->rd which has to have RCU read lock held. Probability of use-after-free isn't zero here. Also add lockdep assert into dl_bw_cpus(). Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> # v3.14+ Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140922183624.11015.71558.stgit@localhost Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | sched/fair: Remove duplicate code from can_migrate_task()Kirill Tkhai2014-09-241-14/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Combine two branches which do the same. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140922183612.11015.64200.stgit@localhost Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSWPeter Zijlstra2014-09-244-43/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kirill found that there's a subtle race in the __ARCH_WANT_UNLOCKED_CTXSW code, and instead of fixing it, remove the entire exception because neither arch that uses it seems to actually still require it. Boot tested on mips64el (qemu) only. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kirill Tkhai <tkhai@yandex.ru> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Burton <paul.burton@imgtec.com> Cc: Qais Yousef <qais.yousef@imgtec.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Tony Luck <tony.luck@intel.com> Cc: oleg@redhat.com Cc: linux@roeck-us.net Cc: linux-ia64@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-mips@linux-mips.org Link: http://lkml.kernel.org/r/20140923150641.GH3312@worktop.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | sched: print_rq(): Don't use tasklist_lockOleg Nesterov2014-09-241-5/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | read_lock_irqsave(tasklist_lock) in print_rq() looks strange. We do not need to disable irqs, and they are already disabled by the caller. And afaics this lock buys nothing, we can rely on rcu_read_lock(). In this case it makes sense to also move rcu_read_lock/unlock from the caller to print_rq(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140921193341.GA28628@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use ↵Oleg Nesterov2014-09-241-10/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | task_rq_lock() 1. read_lock(tasklist_lock) does not need to disable irqs. 2. ->mm != NULL is a common mistake, use PF_KTHREAD. 3. The second ->mm check can be simply removed. 4. task_rq_lock() looks better than raw_spin_lock(&p->pi_lock) + __task_rq_lock(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140921193338.GA28621@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | sched: Fix the task-group check in tg_has_rt_tasks()Oleg Nesterov2014-09-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | tg_has_rt_tasks() wants to find an RT task in this task_group, but task_rq(p)->rt.tg wrongly checks the root rt_rq. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Link: http://lkml.kernel.org/r/20140921193336.GA28618@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | sched/fair: Leverage the idle state info when choosing the "idlest" cpuNicolas Pitre2014-09-241-7/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The code in find_idlest_cpu() looks for the CPU with the smallest load. However, if multiple CPUs are idle, the first idle CPU is selected irrespective of the depth of its idle state. Among the idle CPUs we should pick the one with with the shallowest idle state, or the latest to have gone idle if all idle CPUs are in the same state. The later applies even when cpuidle is configured out. This patch doesn't cover the following issues: - The idle exit latency of a CPU might be larger than the time needed to migrate the waking task to an already running CPU with sufficient capacity, and therefore performance would benefit from task packing in such case (in most cases task packing is about power saving). - Some idle states have a non negligible and non abortable entry latency which needs to run to completion before the exit latency can start. A concurrent patch series is making this info available to the cpuidle core. Once available, the entry latency with the idle timestamp could determine when the exit latency may be effective. Those issues will be handled in due course. In the mean time, what is implemented here should improve things already compared to the current state of affairs. Based on an initial patch from Daniel Lezcano. Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-pm@vger.kernel.org Cc: linaro-kernel@lists.linaro.org Link: http://lkml.kernel.org/n/tip-@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | sched: Let the scheduler see CPU idle statesDaniel Lezcano2014-09-243-0/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the cpu enters idle, it stores the cpuidle state pointer in its struct rq instance which in turn could be used to make a better decision when balancing tasks. As soon as the cpu exits its idle state, the struct rq reference is cleared. There are a couple of situations where the idle state pointer could be changed while it is being consulted: 1. For x86/acpi with dynamic c-states, when a laptop switches from battery to AC that could result on removing the deeper idle state. The acpi driver triggers: 'acpi_processor_cst_has_changed' 'cpuidle_pause_and_lock' 'cpuidle_uninstall_idle_handler' 'kick_all_cpus_sync'. All cpus will exit their idle state and the pointed object will be set to NULL. 2. The cpuidle driver is unloaded. Logically that could happen but not in practice because the drivers are always compiled in and 95% of them are not coded to unregister themselves. In any case, the unloading code must call 'cpuidle_unregister_device', that calls 'cpuidle_pause_and_lock' leading to 'kick_all_cpus_sync' as mentioned above. A race can happen if we use the pointer and then one of these two scenarios occurs at the same moment. In order to be safe, the idle state pointer stored in the rq must be used inside a rcu_read_lock section where we are protected with the 'rcu_barrier' in the 'cpuidle_uninstall_idle_handler' function. The idle_get_state() and idle_put_state() accessors should be used to that effect. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: linux-pm@vger.kernel.org Cc: linaro-kernel@lists.linaro.org Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | sched/deadline: Fix inter- exclusive cpusets migrationsJuri Lelli2014-09-242-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Users can perform clustered scheduling using the cpuset facility. After an exclusive cpuset is created, task migrations happen only between CPUs belonging to the same cpuset. Inter- cpuset migrations can only happen when the user requires so, moving a task between different cpusets. This behaviour is broken in SCHED_DEADLINE, as currently spurious inter- cpuset migration may happen without user intervention. This patch fix the problem (and shuffles the code a bit to improve clarity). Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: raistlin@linux.it Cc: michael@amarulasolutions.com Cc: fchecconi@gmail.com Cc: daniel.wagner@bmw-carit.de Cc: vincent@legout.info Cc: luca.abeni@unitn.it Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1411118561-26323-4-git-send-email-juri.lelli@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | sched/deadline: Clear dl_entity params when setscheduling to different classJuri Lelli2014-09-243-4/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a task is using SCHED_DEADLINE and the user setschedules it to a different class its sched_dl_entity static parameters are not cleaned up. This causes a bug if the user sets it back to SCHED_DEADLINE with the same parameters again. The problem resides in the check we perform at the very beginning of dl_overflow(): if (new_bw == p->dl.dl_bw) return 0; This condition is met in the case depicted above, so the function returns and dl_b->total_bw is not updated (the p->dl.dl_bw is not added to it). After this, admission control is broken. This patch fixes the thing, properly clearing static parameters for a task that ceases to use SCHED_DEADLINE. Reported-by: Daniele Alessandrelli <daniele.alessandrelli@gmail.com> Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Reported-by: Vincent Legout <vincent@legout.info> Tested-by: Luca Abeni <luca.abeni@unitn.it> Tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Tested-by: Vincent Legout <vincent@legout.info> Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Fabio Checconi <fchecconi@gmail.com> Cc: Dario Faggioli <raistlin@linux.it> Cc: Michael Trimarchi <michael@amarulasolutions.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1411118561-26323-2-git-send-email-juri.lelli@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault()Oleg Nesterov2014-09-241-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | current->state == TASK_DEAD means that the task is doing its last schedule(), page fault is obviously impossible at this stage. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140921194743.GA30114@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | sched: Clean up some typos and grammatical errors in code/commentsZhihui Zhang2014-09-213-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Zhihui Zhang <zzhsuny@gmail.com> Cc: peterz@infradead.org Link: http://lkml.kernel.org/r/1411262676-19928-1-git-send-email-zzhsuny@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | sched: Test the CPU's capacity in wake_affine()Vincent Guittot2014-09-191-9/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the task always wakes affine on this_cpu if the latter is idle. Before waking up the task on this_cpu, we check that this_cpu capacity is not significantly reduced because of RT tasks or irq activity. Use case where the number of irq and/or the time spent under irq is important will take benefit of this because the task that is woken up by irq or softirq will not use the same CPU than irq (and softirq) but a idle one. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: preeti@linux.vnet.ibm.com Cc: riel@redhat.com Cc: Morten.Rasmussen@arm.com Cc: efault@gmx.de Cc: nicolas.pitre@linaro.org Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1409051215-16788-8-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | ARM: topology: Use the new cpu_capacity interfaceVincent Guittot2014-09-191-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the new arch_scale_cpu_capacity() scheduler facility in order to reflect the original capacity of a CPU instead of arch_scale_freq_capacity() which is more linked to a scaling of the capacity linked to the frequency. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: preeti@linux.vnet.ibm.com Cc: riel@redhat.com Cc: Morten.Rasmussen@arm.com Cc: efault@gmx.de Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: Grant Likely <grant.likely@linaro.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Brown <broonie@linaro.org> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: devicetree@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/1409051215-16788-6-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | sched: Allow all architectures to set 'capacity_orig'Vincent Guittot2014-09-191-16/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 'capacity_orig' is only changed for systems with an SMT sched_domain level in order to reflect the lower capacity of CPUs. Heterogenous systems also have to reflect an original capacity that is different from the default value. Create a more generic function arch_scale_cpu_capacity that can be also used by non SMT platforms to set capacity_orig. The __weak implementation of arch_scale_cpu_capacity() is the previous SMT variant, in order to keep backward compatibility with the use of capacity_orig. arch_scale_smt_capacity() and default_scale_smt_capacity() have been removed as they were not used elsewhere than in arch_scale_cpu_capacity(). Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Reviewed-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com> [ Added default_scale_cpu_capacity() back. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: riel@redhat.com Cc: Morten.Rasmussen@arm.com Cc: efault@gmx.de Cc: nicolas.pitre@linaro.org Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1409051215-16788-5-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | sched: Fix avg_load computationVincent Guittot2014-09-191-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The computation of avg_load and avg_load_per_task should only take into account the number of CFS tasks. The non-CFS tasks are already taken into account by decreasing the CPU's capacity and they will be tracked in the CPU's utilization (group_utilization) of the next patches. Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: riel@redhat.com Cc: Morten.Rasmussen@arm.com Cc: efault@gmx.de Cc: nicolas.pitre@linaro.org Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1409051215-16788-4-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>