summaryrefslogtreecommitdiff
path: root/arch
Commit message (Collapse)AuthorAgeFilesLines
* x86/mce: Defer mce wakeups to threads for PREEMPT_RTSteven Rostedt2013-05-201-17/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We had a customer report a lockup on a 3.0-rt kernel that had the following backtrace: [ffff88107fca3e80] rt_spin_lock_slowlock at ffffffff81499113 [ffff88107fca3f40] rt_spin_lock at ffffffff81499a56 [ffff88107fca3f50] __wake_up at ffffffff81043379 [ffff88107fca3f80] mce_notify_irq at ffffffff81017328 [ffff88107fca3f90] intel_threshold_interrupt at ffffffff81019508 [ffff88107fca3fa0] smp_threshold_interrupt at ffffffff81019fc1 [ffff88107fca3fb0] threshold_interrupt at ffffffff814a1853 It actually bugged because the lock was taken by the same owner that already had that lock. What happened was the thread that was setting itself on a wait queue had the lock when an MCE triggered. The MCE interrupt does a wake up on its wait list and grabs the same lock. NOTE: THIS IS NOT A BUG ON MAINLINE Sorry for yelling, but as I Cc'd mainline maintainers I want them to know that this is an PREEMPT_RT bug only. I only Cc'd them for advice. On PREEMPT_RT the wait queue locks are converted from normal "spin_locks" into an rt_mutex (see the rt_spin_lock_slowlock above). These are not to be taken by hard interrupt context. This usually isn't a problem as most all interrupts in PREEMPT_RT are converted into schedulable threads. Unfortunately that's not the case with the MCE irq. As wait queue locks are notorious for long hold times, we can not convert them to raw_spin_locks without causing issues with -rt. But Thomas has created a "simple-wait" structure that uses raw spin locks which may have been a good fit. Unfortunately, wait queues are not the only issue, as the mce_notify_irq also does a schedule_work(), which grabs the workqueue spin locks that have the exact same issue. Thus, this patch I'm proposing is to move the actual work of the MCE interrupt into a helper thread that gets woken up on the MCE interrupt and does the work in a schedulable context. NOTE: THIS PATCH ONLY CHANGES THE BEHAVIOR WHEN PREEMPT_RT IS SET Oops, sorry for yelling again, but I want to stress that I keep the same behavior of mainline when PREEMPT_RT is not set. Thus, this only changes the MCE behavior when PREEMPT_RT is configured. Cc: stable-rt@vger.kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org> [bigeasy@linutronix: make mce_notify_work() a proper prototype, use kthread_run()] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
* x86/32: Use kmap switch for non highmem as wellThomas Gleixner2013-05-201-1/+1
| | | | | | | | | Even with CONFIG_HIGHMEM=n we need to take care of the "atomic" mappings which are installed via iomap_atomic. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* mips-remove-smp-reserve-lock.patchThomas Gleixner2013-05-201-6/+0
| | | | | | | | Instead of making the lock raw, remove it as it protects nothing. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* kconfig-disable-a-few-options-rt.patchThomas Gleixner2013-05-201-0/+1
| | | | | | Disable stuff which is known to have issues on RT Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* cpumask: Disable CONFIG_CPUMASK_OFFSTACK for RTThomas Gleixner2013-05-201-1/+1
| | | | | | | We can't deal with the cpumask allocations which happen in atomic context (see arch/x86/kernel/apic/io_apic.c) on RT right now. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* x86: crypto: Reduce preempt disabled regionsPeter Zijlstra2013-05-201-11/+13
| | | | | | | | | | | | | Restrict the preempt disabled regions to the actual floating point operations and enable preemption for the administrative actions. This is necessary on RT to avoid that kfree and other operations are called with preemption disabled. Reported-and-tested-by: Carsten Emde <cbe@osadl.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: stable-rt@vger.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* x86-kvm-require-const-tsc-for-rt.patchThomas Gleixner2013-05-201-0/+7
| | | | Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* mm, rt: kmap_atomic schedulingPeter Zijlstra2013-05-201-0/+36
| | | | | | | | | | | | | | | In fact, with migrate_disable() existing one could play games with kmap_atomic. You could save/restore the kmap_atomic slots on context switch (if there are any in use of course), this should be esp easy now that we have a kmap_atomic stack. Something like the below.. it wants replacing all the preempt_disable() stuff with pagefault_disable() && migrate_disable() of course, but then you can flip kmaps around like below. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> [dvhart@linux.intel.com: build fix] Link: http://lkml.kernel.org/r/1311842631.5890.208.camel@twins
* mips-disable-highmem-on-rt.patchThomas Gleixner2013-05-201-1/+1
| | | | Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* arm-disable-highmem-on-rt.patchThomas Gleixner2013-05-201-1/+1
| | | | Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* power-disable-highmem-on-rt.patchThomas Gleixner2013-05-201-1/+1
| | | | Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* power-use-generic-rwsem-on-rtThomas Gleixner2013-05-201-1/+2
|
* x86-no-perf-irq-work-rt.patchThomas Gleixner2013-05-201-0/+2
| | | | Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* x86: Disable IST stacks for debug/int 3/stack fault for PREEMPT_RTAndi Kleen2013-05-203-6/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Normally the x86-64 trap handlers for debug/int 3/stack fault run on a special interrupt stack to make them more robust when dealing with kernel code. The PREEMPT_RT kernel can sleep in locks even while allocating GFP_ATOMIC memory. When one of these trap handlers needs to send real time signals for ptrace it allocates memory and could then try to to schedule. But it is not allowed to schedule on a IST stack. This can cause warnings and hangs. This patch disables the IST stacks for these handlers for PREEMPT_RT kernel. Instead let them run on the normal process stack. The kernel only really needs the ISTs here to make kernel debuggers more robust in case someone sets a break point somewhere where the stack is invalid. But there are no kernel debuggers in the standard kernel that do this. It also means kprobes cannot be set in situations with invalid stack; but that sounds like a reasonable restriction. The stack fault change could minimally impact oops quality, but not very much because stack faults are fairly rare. A better solution would be to use similar logic as the NMI "paranoid" path: check if signal is for user space, if yes go back to entry.S, switch stack, call sync_regs, then do the signal sending etc. But this patch is much simpler and should work too with minimal impact. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* x86: Use generic rwsem_spinlocks on -rtThomas Gleixner2013-05-201-2/+2
| | | | | | | Simplifies the separation of anon_rw_semaphores and rw_semaphores for -rt. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* x86: stackprotector: Avoid random pool on rtThomas Gleixner2013-05-201-1/+9
| | | | | | | | | | | | | CPU bringup calls into the random pool to initialize the stack canary. During boot that works nicely even on RT as the might sleep checks are disabled. During CPU hotplug the might sleep checks trigger. Making the locks in random raw is a major PITA, so avoid the call on RT is the only sensible solution. This is basically the same randomness which we get during boot where the random pool has no entropy and we rely on the TSC randomnness. Reported-by: Carsten Emde <carsten.emde@osadl.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* x86: Convert mce timer to hrtimerThomas Gleixner2013-05-201-26/+23
| | | | | | | | mce_timer is started in atomic contexts of cpu bringup. This results in might_sleep() warnings on RT. Convert mce_timer to a hrtimer to avoid this. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* softirq-disable-softirq-stacks-for-rt.patchThomas Gleixner2013-05-208-2/+16
| | | | Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* ARM: Initialize ptl->lock for vector pageFrank Rowand2013-05-201-0/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Without this patch, ARM can not use SPLIT_PTLOCK_CPUS if PREEMPT_RT_FULL=y because vectors_user_mapping() creates a VM_ALWAYSDUMP mapping of the vector page (address 0xffff0000), but no ptl->lock has been allocated for the page. An attempt to coredump that page will result in a kernel NULL pointer dereference when follow_page() attempts to lock the page. The call tree to the NULL pointer dereference is: do_notify_resume() get_signal_to_deliver() do_coredump() elf_core_dump() get_dump_page() __get_user_pages() follow_page() pte_offset_map_lock() <----- a #define ... rt_spin_lock() The underlying problem is exposed by mm-shrink-the-page-frame-to-rt-size.patch. Signed-off-by: Frank Rowand <frank.rowand@am.sony.com> Cc: Frank <Frank_Rowand@sonyusa.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/4E87C535.2030907@am.sony.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* acpi: Do not disable interrupts on PREEMPT_RTThomas Gleixner2013-05-201-2/+2
| | | | | | Use the local_irq_*_nort() variants. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* early-printk-consolidate.patchThomas Gleixner2013-05-2012-103/+36
| | | | Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* x86: Do not disable preemption in int3 on 32bitSteven Rostedt2013-05-201-9/+23
| | | | | | | | | | | | | | | | | | | Preemption must be disabled before enabling interrupts in do_trap on x86_64 because the stack in use for int3 and debug is a per CPU stack set by th IST. But 32bit does not have an IST and the stack still belongs to the current task and there is no problem in scheduling out the task. Keep preemption enabled on X86_32 when enabling interrupts for do_trap(). The name of the function is changed from preempt_conditional_sti/cli() to conditional_sti/cli_ist(), to annotate that this function is used when the stack is on the IST. Cc: stable-rt@vger.kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* x86-32-fix-signal-crap.patchThomas Gleixner2013-05-201-0/+8
| | | | Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* x86: Do not unmask io_apic when interrupt is in progressIngo Molnar2013-05-201-1/+2
| | | | | | | | With threaded interrupts we might see an interrupt in progress on migration. Do not unmask it when this is the case. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* of-convert-devtree-lock.patchThomas Gleixner2013-05-201-2/+2
| | | | Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* x86: highmem: Replace BUG_ON by WARN_ONIngo Molnar2013-05-201-1/+1
| | | | | | | | The machine might survive that problem and be at least in a state which allows us to get more information about the problem. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* mm: pagefault_disabled()Peter Zijlstra2013-05-2022-26/+25
| | | | | | | | Wrap the test for pagefault_disabled() into a helper, this allows us to remove the need for current->pagefault_disabled on !-rt kernels. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-3yy517m8zsi9fpsf14xfaqkw@git.kernel.org
* mm: Fixup all fault handlers to check current->pagefault_disableThomas Gleixner2013-05-2022-24/+27
| | | | | | Necessary for decoupling pagefault disable from preempt count. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* preempt-mark-legitimated-no-resched-sites.patchThomas Gleixner2013-05-202-2/+2
| | | | Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* ARM: AT91: PIT: Remove irq handler when clock event is unusedBenedikt Spranger2013-05-202-1/+7
| | | | | | | | | Setup and remove the interrupt handler in clock event mode selection. This avoids calling the (shared) interrupt handler when the device is not used. Signed-off-by: Benedikt Spranger <b.spranger@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* signal/x86: Delay calling signals in atomicOleg Nesterov2013-05-202-0/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | On x86_64 we must disable preemption before we enable interrupts for stack faults, int3 and debugging, because the current task is using a per CPU debug stack defined by the IST. If we schedule out, another task can come in and use the same stack and cause the stack to be corrupted and crash the kernel on return. When CONFIG_PREEMPT_RT_FULL is enabled, spin_locks become mutexes, and one of these is the spin lock used in signal handling. Some of the debug code (int3) causes do_trap() to send a signal. This function calls a spin lock that has been converted to a mutex and has the possibility to sleep. If this happens, the above issues with the corrupted stack is possible. Instead of calling the signal right away, for PREEMPT_RT and x86_64, the signal information is stored on the stacks task_struct and TIF_NOTIFY_RESUME is set. Then on exit of the trap, the signal resume code will send the signal when preemption is enabled. [ rostedt: Switched from #ifdef CONFIG_PREEMPT_RT_FULL to ARCH_RT_DELAYS_SIGNAL_SEND and added comments to the code. ] Cc: stable-rt@vger.kernel.org Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* sched: Use schedule_preempt_disabled()Thomas Gleixner2013-05-2025-85/+33
| | | | | | Coccinelle based conversion. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* preempt-rt: Convert arm boot_lock to rawFrank Rowand2013-05-206-31/+31
| | | | | | | | | | | | | | | | | | | The arm boot_lock is used by the secondary processor startup code. The locking task is the idle thread, which has idle->sched_class == &idle_sched_class. idle_sched_class->enqueue_task == NULL, so if the idle task blocks on the lock, the attempt to wake it when the lock becomes available will fail: try_to_wake_up() ... activate_task() enqueue_task() p->sched_class->enqueue_task(rq, p, flags) Fix by converting boot_lock to a raw spin lock. Signed-off-by: Frank Rowand <frank.rowand@am.sony.com> Link: http://lkml.kernel.org/r/4E77B952.3010606@am.sony.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* arm: Allow forced irq threadingThomas Gleixner2013-05-201-0/+1
| | | | | | | All timer interrupts and the perf interrupt are marked NO_THREAD, so its safe to allow forced interrupt threading. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* arm: Mark pmu interupt IRQF_NO_THREADThomas Gleixner2013-05-201-1/+1
| | | | | | | PMU interrupt must not be threaded. Remove IRQF_DISABLED while at it as we run all handlers with interrupts disabled anyway. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* ia64: vsyscall: Use seqcount instead of seqlockThomas Gleixner2013-05-204-11/+7
| | | | | | | | The update of the vdso data happens under xtime_lock, so adding a nested lock is pointless. Just use a seqcount to sync the readers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com>
* x86: vdso: Use seqcount instead of seqlockThomas Gleixner2013-05-203-17/+12
| | | | | | | The update of the vdso data happens under xtime_lock, so adding a nested lock is pointless. Just use a seqcount to sync the readers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* x86: vdso: Remove bogus locking in update_vsyscall_tz()Thomas Gleixner2013-05-201-5/+0
| | | | | | | | | | Changing the sequence count in update_vsyscall_tz() is completely pointless. The vdso code copies the data unprotected. There is no point to change this as sys_tz is nowhere protected at all. See sys_gettimeofday(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* powerpc: Allow irq threadingThomas Gleixner2013-05-201-0/+1
| | | | | | | All interrupts which must be non threaded are marked IRQF_NO_THREAD. So it's safe to allow force threaded handlers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* powerpc: Mark IPI interrupts IRQF_NO_THREADThomas Gleixner2013-05-203-6/+7
| | | | | | | IPI handlers cannot be threaded. Remove the obsolete IRQF_DISABLED flag (see commit e58aa3d2) while at it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* powerpc: wsp: Mark opb cascade handler IRQF_NO_THREADThomas Gleixner2013-05-201-1/+2
| | | | | | Cascade handlers must run in hard interrupt context. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* powerpc: 85xx: Mark cascade irq IRQF_NO_THREADThomas Gleixner2013-05-201-1/+1
| | | | | | Cascade interrupt must run in hard interrupt context. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* arm-enable-interrupts-in-signal-code.patchThomas Gleixner2013-05-201-0/+3
| | | | Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* mips-enable-interrupts-in-signal.patchThomas Gleixner2013-05-201-0/+3
| | | | Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* x86: hpet: Disable MSI on Lenovo W510Thomas Gleixner2013-05-201-0/+27
| | | | | | | | | MSI based per cpu timers lose interrupts when intel_idle() is enabled - independent of the c-state. With idle=poll the problem cannot be observed. We have no idea yet, whether this is a W510 specific issue or a general chipset oddity. Blacklist the known problem machine. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* x86: kprobes: Remove remove bogus preempt_enableThomas Gleixner2013-05-201-1/+0
| | | | | | | | | | | | The CONFIG_PREEMPT=n section of setup_singlestep() contains: preempt_enable_no_resched(); That's bogus as it is asymetric - no preempt_disable() - and it just never blew up because preempt_enable_no_resched() is a NOP when CONFIG_PREEMPT=n. Remove it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* x86: Call idle notifier after irq_enter()Frederic Weisbecker2013-05-205-9/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Interrupts notify the idle exit state before calling irq_enter(). But the notifier code calls rcu_read_lock() and this is not allowed while rcu is in an extended quiescent state. We need to wait for rcu_irq_enter() to be called before doing so otherwise this results in a grumpy RCU: [ 0.099991] WARNING: at include/linux/rcupdate.h:194 __atomic_notifier_call_chain+0xd2/0x110() [ 0.099991] Hardware name: AMD690VM-FMH [ 0.099991] Modules linked in: [ 0.099991] Pid: 0, comm: swapper Not tainted 3.0.0-rc6+ #255 [ 0.099991] Call Trace: [ 0.099991] <IRQ> [<ffffffff81051c8a>] warn_slowpath_common+0x7a/0xb0 [ 0.099991] [<ffffffff81051cd5>] warn_slowpath_null+0x15/0x20 [ 0.099991] [<ffffffff817d6fa2>] __atomic_notifier_call_chain+0xd2/0x110 [ 0.099991] [<ffffffff817d6ff1>] atomic_notifier_call_chain+0x11/0x20 [ 0.099991] [<ffffffff81001873>] exit_idle+0x43/0x50 [ 0.099991] [<ffffffff81020439>] smp_apic_timer_interrupt+0x39/0xa0 [ 0.099991] [<ffffffff817da253>] apic_timer_interrupt+0x13/0x20 [ 0.099991] <EOI> [<ffffffff8100ae67>] ? default_idle+0xa7/0x350 [ 0.099991] [<ffffffff8100ae65>] ? default_idle+0xa5/0x350 [ 0.099991] [<ffffffff8100b19b>] amd_e400_idle+0x8b/0x110 [ 0.099991] [<ffffffff810cb01f>] ? rcu_enter_nohz+0x8f/0x160 [ 0.099991] [<ffffffff810019a0>] cpu_idle+0xb0/0x110 [ 0.099991] [<ffffffff817a7505>] rest_init+0xe5/0x140 [ 0.099991] [<ffffffff817a7468>] ? rest_init+0x48/0x140 [ 0.099991] [<ffffffff81cc5ca3>] start_kernel+0x3d1/0x3dc [ 0.099991] [<ffffffff81cc5321>] x86_64_start_reservations+0x131/0x135 [ 0.099991] [<ffffffff81cc5412>] x86_64_start_kernel+0xed/0xf4 Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/20110929194047.GA10247@linux.vnet.ibm.com Cc: Ingo Molnar <mingo@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Andy Henroid <andrew.d.henroid@intel.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* x86/mm: account for PGDIR_SIZE alignmentjerry.hoemann@hp.com2013-05-131-0/+5
| | | | | | | | | | | | | | | | | | | | | Patch for 3.0-stable. Function find_early_table_space removed upstream. Fixes panic in alloc_low_page due to pgt_buf overflow during init_memory_mapping. find_early_table_space sizes pgt_buf based upon the size of the memory being mapped, but it does not take into account the alignment of the memory. When the region being mapped spans a 512GB (PGDIR_SIZE) alignment, a panic from alloc_low_pages occurs. kernel_physical_mapping_init takes into account PGDIR_SIZE alignment. This causes an extra call to alloc_low_page to be made. This extra call isn't accounted for by find_early_table_space and causes a kernel panic. Change is to take into account PGDIR_SIZE alignment in find_early_table_space. Signed-off-by: Jerry Hoemann <jerry.hoemann@hp.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* powerpc: fix numa distance for form0 device treeVaidyanathan Srinivasan2013-05-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 7122beeee7bc1757682049780179d7c216dd1c83 upstream. The following commit breaks numa distance setup for old powerpc systems that use form0 encoding in device tree. commit 41eab6f88f24124df89e38067b3766b7bef06ddb powerpc/numa: Use form 1 affinity to setup node distance Device tree node /rtas/ibm,associativity-reference-points would index into /cpus/PowerPCxxxx/ibm,associativity based on form0 or form1 encoding detected by ibm,architecture-vec-5 property. All modern systems use form1 and current kernel code is correct. However, on older systems with form0 encoding, the numa distance will get hard coded as LOCAL_DISTANCE for all nodes. This causes task scheduling anomaly since scheduler will skip building numa level domain (topmost domain with all cpus) if all numa distances are same. (value of 'level' in sched_init_numa() will remain 0) Prior to the above commit: ((from) == (to) ? LOCAL_DISTANCE : REMOTE_DISTANCE) Restoring compatible behavior with this patch for old powerpc systems with device tree where numa distance are encoded as form0. Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* sparc64: Fix race in TLB batch processing.David S. Miller2013-05-137-55/+242
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Commits f36391d2790d04993f48da6a45810033a2cdf847 and f0af97070acbad5d6a361f485828223a4faaa0ee upstream. ] As reported by Dave Kleikamp, when we emit cross calls to do batched TLB flush processing we have a race because we do not synchronize on the sibling cpus completing the cross call. So meanwhile the TLB batch can be reset (tb->tlb_nr set to zero, etc.) and either flushes are missed or flushes will flush the wrong addresses. Fix this by using generic infrastructure to synchonize on the completion of the cross call. This first required getting the flush_tlb_pending() call out from switch_to() which operates with locks held and interrupts disabled. The problem is that smp_call_function_many() cannot be invoked with IRQs disabled and this is explicitly checked for with WARN_ON_ONCE(). We get the batch processing outside of locked IRQ disabled sections by using some ideas from the powerpc port. Namely, we only batch inside of arch_{enter,leave}_lazy_mmu_mode() calls. If we're not in such a region, we flush TLBs synchronously. 1) Get rid of xcall_flush_tlb_pending and per-cpu type implementations. 2) Do TLB batch cross calls instead via: smp_call_function_many() tlb_pending_func() __flush_tlb_pending() 3) Batch only in lazy mmu sequences: a) Add 'active' member to struct tlb_batch b) Define __HAVE_ARCH_ENTER_LAZY_MMU_MODE c) Set 'active' in arch_enter_lazy_mmu_mode() d) Run batch and clear 'active' in arch_leave_lazy_mmu_mode() e) Check 'active' in tlb_batch_add_one() and do a synchronous flush if it's clear. 4) Add infrastructure for synchronous TLB page flushes. a) Implement __flush_tlb_page and per-cpu variants, patch as needed. b) Likewise for xcall_flush_tlb_page. c) Implement smp_flush_tlb_page() to invoke the cross-call. d) Wire up global_flush_tlb_page() to the right routine based upon CONFIG_SMP 5) It turns out that singleton batches are very common, 2 out of every 3 batch flushes have only a single entry in them. The batch flush waiting is very expensive, both because of the poll on sibling cpu completeion, as well as because passing the tlb batch pointer to the sibling cpus invokes a shared memory dereference. Therefore, in flush_tlb_pending(), if there is only one entry in the batch perform a completely asynchronous global_flush_tlb_page() instead. Reported-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>