summaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* drm: remove preempt_disable() from drm_calc_vbltimestamp_from_scanoutpos()Sebastian Andrzej Siewior2014-02-261-7/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Luis captured the following: | BUG: sleeping function called from invalid context at kernel/rtmutex.c:659 | in_atomic(): 1, irqs_disabled(): 0, pid: 517, name: Xorg | 2 locks held by Xorg/517: | #0: | ( | &dev->vbl_lock | ){......} | , at: | [<ffffffffa0024c60>] drm_vblank_get+0x30/0x2b0 [drm] | #1: | ( | &dev->vblank_time_lock | ){......} | , at: | [<ffffffffa0024ce1>] drm_vblank_get+0xb1/0x2b0 [drm] | Preemption disabled at: | [<ffffffffa008bc95>] i915_get_vblank_timestamp+0x45/0xa0 [i915] | CPU: 3 PID: 517 Comm: Xorg Not tainted 3.10.10-rt7+ #5 | Call Trace: | [<ffffffff8164b790>] dump_stack+0x19/0x1b | [<ffffffff8107e62f>] __might_sleep+0xff/0x170 | [<ffffffff81651ac4>] rt_spin_lock+0x24/0x60 | [<ffffffffa0084e67>] i915_read32+0x27/0x170 [i915] | [<ffffffffa008a591>] i915_pipe_enabled+0x31/0x40 [i915] | [<ffffffffa008a6be>] i915_get_crtc_scanoutpos+0x3e/0x1b0 [i915] | [<ffffffffa00245d4>] drm_calc_vbltimestamp_from_scanoutpos+0xf4/0x430 [drm] | [<ffffffffa008bc95>] i915_get_vblank_timestamp+0x45/0xa0 [i915] | [<ffffffffa0024998>] drm_get_last_vbltimestamp+0x48/0x70 [drm] | [<ffffffffa0024db5>] drm_vblank_get+0x185/0x2b0 [drm] | [<ffffffffa0025d03>] drm_wait_vblank+0x83/0x5d0 [drm] | [<ffffffffa00212a2>] drm_ioctl+0x552/0x6a0 [drm] | [<ffffffff811a0095>] do_vfs_ioctl+0x325/0x5b0 | [<ffffffff811a03a1>] SyS_ioctl+0x81/0xa0 | [<ffffffff8165a342>] tracesys+0xdd/0xe2 After a longer thread it was decided to drop the preempt_disable()/ enable() invocations which were meant for -RT and Mario Kleiner looks for a replacement. Cc: stable-rt@vger.kernel.org Reported-By: Luis Claudio R. Goncalves <lclaudio@uudg.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* mm/memcontrol: Don't call schedule_work_on in preemption disabled contextYang Shi2014-02-261-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following trace is triggered when running ltp oom test cases: BUG: sleeping function called from invalid context at kernel/rtmutex.c:659 in_atomic(): 1, irqs_disabled(): 0, pid: 17188, name: oom03 Preemption disabled at:[<ffffffff8112ba70>] mem_cgroup_reclaim+0x90/0xe0 CPU: 2 PID: 17188 Comm: oom03 Not tainted 3.10.10-rt3 #2 Hardware name: Intel Corporation Calpella platform/MATXM-CORE-411-B, BIOS 4.6.3 08/18/2010 ffff88007684d730 ffff880070df9b58 ffffffff8169918d ffff880070df9b70 ffffffff8106db31 ffff88007688b4a0 ffff880070df9b88 ffffffff8169d9c0 ffff88007688b4a0 ffff880070df9bc8 ffffffff81059da1 0000000170df9bb0 Call Trace: [<ffffffff8169918d>] dump_stack+0x19/0x1b [<ffffffff8106db31>] __might_sleep+0xf1/0x170 [<ffffffff8169d9c0>] rt_spin_lock+0x20/0x50 [<ffffffff81059da1>] queue_work_on+0x61/0x100 [<ffffffff8112b361>] drain_all_stock+0xe1/0x1c0 [<ffffffff8112ba70>] mem_cgroup_reclaim+0x90/0xe0 [<ffffffff8112beda>] __mem_cgroup_try_charge+0x41a/0xc40 [<ffffffff810f1c91>] ? release_pages+0x1b1/0x1f0 [<ffffffff8106f200>] ? sched_exec+0x40/0xb0 [<ffffffff8112cc87>] mem_cgroup_charge_common+0x37/0x70 [<ffffffff8112e2c6>] mem_cgroup_newpage_charge+0x26/0x30 [<ffffffff8110af68>] handle_pte_fault+0x618/0x840 [<ffffffff8103ecf6>] ? unpin_current_cpu+0x16/0x70 [<ffffffff81070f94>] ? migrate_enable+0xd4/0x200 [<ffffffff8110cde5>] handle_mm_fault+0x145/0x1e0 [<ffffffff810301e1>] __do_page_fault+0x1a1/0x4c0 [<ffffffff8169c9eb>] ? preempt_schedule_irq+0x4b/0x70 [<ffffffff8169e3b7>] ? retint_kernel+0x37/0x40 [<ffffffff8103053e>] do_page_fault+0xe/0x10 [<ffffffff8169e4c2>] page_fault+0x22/0x30 So, to prevent schedule_work_on from being called in preempt disabled context, replace the pair of get/put_cpu() to get/put_cpu_light(). Cc: stable-rt@vger.kernel.org Signed-off-by: Yang Shi <yang.shi@windriver.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* hwlat-detector: Don't ignore threshold module parameterMike Galbraith2014-02-261-1/+1
| | | | | | | | | | If the user specified a threshold at module load time, use it. Cc: stable-rt@vger.kernel.org Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Mike Galbraith <bitbucket@online.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* genirq: Set the irq thread policy without checking CAP_SYS_NICEThomas Pfaff2014-02-261-1/+1
| | | | | | | | | | | | | | | | | | | | | In commit ee23871389 ("genirq: Set irq thread to RT priority on creation") we moved the assigment of the thread's priority from the thread's function into __setup_irq(). That function may run in user context for instance if the user opens an UART node and then driver calls requests in the ->open() callback. That user may not have CAP_SYS_NICE and so the irq thread won't run with the SCHED_OTHER policy. This patch uses sched_setscheduler_nocheck() so we omit the CAP_SYS_NICE check which is otherwise required for the SCHED_OTHER policy. Cc: Ivo Sieben <meltedpianoman@gmail.com> Cc: stable@vger.kernel.org Cc: stable-rt@vger.kernel.org Signed-off-by: Thomas Pfaff <tpfaff@pcs.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> [bigeasy: rewrite the changelog] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
* genirq: do not invoke the affinity callback via a workqueueSebastian Andrzej Siewior2014-02-262-3/+77
| | | | | | | | | | | Joe Korty reported, that __irq_set_affinity_locked() schedules a workqueue while holding a rawlock which results in a might_sleep() warning. This patch moves the invokation into a process context so that we only wakeup() a process while holding the lock. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
* hwlat-detector: Use thread instead of stop machineSteven Rostedt2014-02-261-34/+25
| | | | | | | | | | | | | | There's no reason to use stop machine to search for hardware latency. Simply disabling interrupts while running the loop will do enough to check if something comes in that wasn't disabled by interrupts being off, which is exactly what stop machine does. Instead of using stop machine, just have the thread disable interrupts while it checks for hardware latency. Cc: stable-rt@vger.kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
* hwlat-detector: Use trace_clock_local if availableSteven Rostedt2014-02-261-9/+25
| | | | | | | | | | | | As ktime_get() calls into the timing code which does a read_seq(), it may be affected by other CPUS that touch that lock. To remove this dependency, use the trace_clock_local() which is already exported for module use. If CONFIG_TRACING is enabled, use that as the clock, otherwise use ktime_get(). Cc: stable-rt@vger.kernel.org Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
* hwlat-detect/trace: Export trace_clock_local for hwlat-detectorSteven Rostedt (Red Hat)2014-02-261-0/+1
| | | | | | | | The hwlat-detector needs a better clock than just ktime_get() as that can induce its own latencies. The trace clock is perfect for it, but it needs to be exported for use by modules. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* hwlat-detector: Update hwlat_detector to add outer loop detectionSteven Rostedt2014-02-261-6/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The hwlat_detector reads two timestamps in a row, then reports any gap between those calls. The problem is, it misses everything between the second reading of the time stamp to the first reading of the time stamp in the next loop. That's were most of the time is spent, which means, chances are likely that it will miss all hardware latencies. This defeats the purpose. By also testing the first time stamp from the previous loop second time stamp (the outer loop), we are more likely to find a latency. Setting the threshold to 1, here's what the report now looks like: 1347415723.0232202770 0 2 1347415725.0234202822 0 2 1347415727.0236202875 0 2 1347415729.0238202928 0 2 1347415731.0240202980 0 2 1347415734.0243203061 0 2 1347415736.0245203113 0 2 1347415738.0247203166 2 0 1347415740.0249203219 0 3 1347415742.0251203272 0 3 1347415743.0252203299 0 3 1347415745.0254203351 0 2 1347415747.0256203404 0 2 1347415749.0258203457 0 2 1347415751.0260203510 0 2 1347415754.0263203589 0 2 1347415756.0265203642 0 2 1347415758.0267203695 0 2 1347415760.0269203748 0 2 1347415762.0271203801 0 2 1347415764.0273203853 2 0 There's some hardware latency that takes 2 microseconds to run. Cc: stable-rt@vger.kernel.org Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
* rt,ntp: Move call to schedule_delayed_work() to helper threadSteven Rostedt2014-02-261-0/+40
| | | | | | | | | | | | | | | | | | | | The ntp code for notify_cmos_timer() is called from a hard interrupt context. schedule_delayed_work() under PREEMPT_RT_FULL calls spinlocks that have been converted to mutexes, thus calling schedule_delayed_work() from interrupt is not safe. Add a helper thread that does the call to schedule_delayed_work and wake up that thread instead of calling schedule_delayed_work() directly. This is only for CONFIG_PREEMPT_RT_FULL, otherwise the code still calls schedule_delayed_work() directly in irq context. Note: There's a few places in the kernel that do this. Perhaps the RT code should have a dedicated thread that does the checks. Just register a notifier on boot up for your check and wake up the thread when needed. This will be a todo. Cc: stable-rt@vger.kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* drm/i915: drop trace_i915_gem_ring_dispatch on rtSebastian Andrzej Siewior2014-02-261-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This tracepoint is responsible for: |[<814cc358>] __schedule_bug+0x4d/0x59 |[<814d24cc>] __schedule+0x88c/0x930 |[<814d3b90>] ? _raw_spin_unlock_irqrestore+0x40/0x50 |[<814d3b95>] ? _raw_spin_unlock_irqrestore+0x45/0x50 |[<810b57b5>] ? task_blocks_on_rt_mutex+0x1f5/0x250 |[<814d27d9>] schedule+0x29/0x70 |[<814d3423>] rt_spin_lock_slowlock+0x15b/0x278 |[<814d3786>] rt_spin_lock+0x26/0x30 |[<a00dced9>] gen6_gt_force_wake_get+0x29/0x60 [i915] |[<a00e183f>] gen6_ring_get_irq+0x5f/0x100 [i915] |[<a00b2a33>] ftrace_raw_event_i915_gem_ring_dispatch+0xe3/0x100 [i915] |[<a00ac1b3>] i915_gem_do_execbuffer.isra.13+0xbd3/0x1430 [i915] |[<810f8943>] ? trace_buffer_unlock_commit+0x43/0x60 |[<8113e8d2>] ? ftrace_raw_event_kmem_alloc+0xd2/0x180 |[<8101d063>] ? native_sched_clock+0x13/0x80 |[<a00acf29>] i915_gem_execbuffer2+0x99/0x280 [i915] |[<a00114a3>] drm_ioctl+0x4c3/0x570 [drm] |[<8101d0d9>] ? sched_clock+0x9/0x10 |[<a00ace90>] ? i915_gem_execbuffer+0x480/0x480 [i915] |[<810f1c18>] ? rb_commit+0x68/0xa0 |[<810f1c6c>] ? ring_buffer_unlock_commit+0x1c/0xa0 |[<81197467>] do_vfs_ioctl+0x97/0x540 |[<81021318>] ? ftrace_raw_event_sys_enter+0xd8/0x130 |[<811979a1>] sys_ioctl+0x91/0xb0 |[<814db931>] tracesys+0xe1/0xe6 Chris Wilson does not like to move i915_trace_irq_get() out of the macro |No. This enables the IRQ, as well as making a number of |very expensively serialised read, unconditionally. so it is gone now on RT. Cc: stable-rt@vger.kernel.org Reported-by: Joakim Hernberg <jbh@alchemy.lu> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
* kernel/hotplug: restore original cpu mask oncpu/downSebastian Andrzej Siewior2014-02-261-1/+12
| | | | | | | | | | If a task which is allowed to run only on CPU X puts CPU Y down then it will be allowed on all CPUs but the on CPU Y after it comes back from kernel. This patch ensures that we don't lose the initial setting unless the CPU the task is running is going down. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
* kernel/cpu: fix cpu down problem if kthread's cpu is going downSebastian Andrzej Siewior2014-02-261-2/+14
| | | | | | | | | | | | | | | | | | | | | | If kthread is pinned to CPUx and CPUx is going down then we get into trouble: - first the unplug thread is created - it will set itself to hp->unplug. As a result, every task that is going to take a lock, has to leave the CPU. - the CPU_DOWN_PREPARE notifier are started. The worker thread will start a new process for the "high priority worker". Now kthread would like to take a lock but since it can't leave the CPU it will never complete its task. We could fire the unplug thread after the notifier but then the cpu is no longer marked "online" and the unplug thread will run on CPU0 which was fixed before :) So instead the unplug thread is started and kept waiting until the notfier complete their work. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
* timers: prepare for full preemption improveZhao Hongjiang2014-02-261-2/+6
| | | | | | | | | | wake_up should do nothing on the nort, so we should use wakeup_timer_waiters, also fix a spell mistake. Cc: stable-rt@vger.kernel.org Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com> [bigeasy: s/CONFIG_PREEMPT_RT_BASE/CONFIG_PREEMPT_RT_FULL/] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
* list_bl.h: fix it for for !SMP && !DEBUG_SPINLOCKUwe Kleine-König2014-02-261-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The patch "list_bl.h: make list head locking RT safe" introduced an unconditional __set_bit(0, (unsigned long *)b); in void hlist_bl_lock(struct hlist_bl_head *b). This clobbers the value of b->first. When the value of b->first is retrieved using hlist_bl_first the clobbering is undone using (unsigned long)h->first & ~LIST_BL_LOCKMASK and so depending on LIST_BL_LOCKMASK being one. But LIST_BL_LOCKMASK is only one if at least on of CONFIG_SMP and CONFIG_DEBUG_SPINLOCK are defined. Without these the value returned by hlist_bl_first has the zeroth bit set which likely results in a crash. So only do the clobbering in the cases where LIST_BL_LOCKMASK is one. An alternative would be to always define LIST_BL_LOCKMASK to one with CONFIG_PREEMPT_RT_BASE. Cc: stable-rt@vger.kernel.org Acked-by: Paul Gortmaker <paul.gortmaker@windriver.com> Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
* list_bl.h: make list head locking RT safePaul Gortmaker2014-02-261-2/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As per changes in include/linux/jbd_common.h for avoiding the bit_spin_locks on RT ("fs: jbd/jbd2: Make state lock and journal head lock rt safe") we do the same thing here. We use the non atomic __set_bit and __clear_bit inside the scope of the lock to preserve the ability of the existing LIST_DEBUG code to use the zero'th bit in the sanity checks. As a bit spinlock, we had no lockdep visibility into the usage of the list head locking. Now, if we were to implement it as a standard non-raw spinlock, we would see: BUG: sleeping function called from invalid context at kernel/rtmutex.c:658 in_atomic(): 1, irqs_disabled(): 0, pid: 122, name: udevd 5 locks held by udevd/122: #0: (&sb->s_type->i_mutex_key#7/1){+.+.+.}, at: [<ffffffff811967e8>] lock_rename+0xe8/0xf0 #1: (rename_lock){+.+...}, at: [<ffffffff811a277c>] d_move+0x2c/0x60 #2: (&dentry->d_lock){+.+...}, at: [<ffffffff811a0763>] dentry_lock_for_move+0xf3/0x130 #3: (&dentry->d_lock/2){+.+...}, at: [<ffffffff811a0734>] dentry_lock_for_move+0xc4/0x130 #4: (&dentry->d_lock/3){+.+...}, at: [<ffffffff811a0747>] dentry_lock_for_move+0xd7/0x130 Pid: 122, comm: udevd Not tainted 3.4.47-rt62 #7 Call Trace: [<ffffffff810b9624>] __might_sleep+0x134/0x1f0 [<ffffffff817a24d4>] rt_spin_lock+0x24/0x60 [<ffffffff811a0c4c>] __d_shrink+0x5c/0xa0 [<ffffffff811a1b2d>] __d_drop+0x1d/0x40 [<ffffffff811a24be>] __d_move+0x8e/0x320 [<ffffffff811a278e>] d_move+0x3e/0x60 [<ffffffff81199598>] vfs_rename+0x198/0x4c0 [<ffffffff8119b093>] sys_renameat+0x213/0x240 [<ffffffff817a2de5>] ? _raw_spin_unlock+0x35/0x60 [<ffffffff8107781c>] ? do_page_fault+0x1ec/0x4b0 [<ffffffff817a32ca>] ? retint_swapgs+0xe/0x13 [<ffffffff813eb0e6>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff8119b0db>] sys_rename+0x1b/0x20 [<ffffffff817a3b96>] system_call_fastpath+0x1a/0x1f Since we are only taking the lock during short lived list operations, lets assume for now that it being raw won't be a significant latency concern. Cc: stable-rt@vger.kernel.org Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
* genirq: Set irq thread to RT priority on creationIvo Sieben2014-02-261-4/+6
| | | | | | | | | | | | | | | | | | | | When a threaded irq handler is installed the irq thread is initially created on normal scheduling priority. Only after the irq thread is woken up it sets its priority to RT_FIFO MAX_USER_RT_PRIO/2 itself. This means that interrupts that occur directly after the irq handler is installed will be handled on a normal scheduling priority instead of the realtime priority that one would expect. Fix this by setting the RT priority on creation of the irq_thread. Cc: stable-rt@vger.kernel.org Signed-off-by: Ivo Sieben <meltedpianoman@gmail.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/1370254322-17240-1-git-send-email-meltedpianoman@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
* x86/mce: fix mce timer intervalMike Galbraith2014-02-261-2/+2
| | | | | | | | | | | | | Seems mce timer fire at the wrong frequency in -rt kernels since roughly forever due to 32 bit overflow. 3.8-rt is also missing a multiplier. Add missing us -> ns conversion and 32 bit overflow prevention. Cc: stable-rt@vger.kernel.org Signed-off-by: Mike Galbraith <bitbucket@online.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> [bigeasy: use ULL instead of u64 cast] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
* sched/workqueue: Only wake up idle workers if not blocked on sleeping spin lockSteven Rostedt2014-02-261-1/+3
| | | | | | | | | | | | | | | | | | In -rt, most spin_locks() turn into mutexes. One of these spin_lock conversions is performed on the workqueue gcwq->lock. When the idle worker is worken, the first thing it will do is grab that same lock and it too will block, possibly jumping into the same code, but because nr_running would already be decremented it prevents an infinite loop. But this is still a waste of CPU cycles, and it doesn't follow the method of mainline, as new workers should only be woken when a worker thread is truly going to sleep, and not just blocked on a spin_lock(). Check the saved_state too before waking up new workers. Cc: stable-rt@vger.kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
* swap: Use unique local lock name for swap_lockSteven Rostedt2014-02-261-10/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | From lib/Kconfig.debug on CONFIG_FORCE_WEAK_PER_CPU: ---- s390 and alpha require percpu variables in modules to be defined weak to work around addressing range issue which puts the following two restrictions on percpu variable definitions. 1. percpu symbols must be unique whether static or not 2. percpu variables can't be defined inside a function To ensure that generic code follows the above rules, this option forces all percpu variables to be defined as weak. ---- The addition of the local IRQ swap_lock in mm/swap.c broke this config as the name "swap_lock" is used through out the kernel. Just do a "git grep swap_lock" to see, and the new swap_lock is a local lock which defines the swap_lock for per_cpu. The fix was to rename swap_lock to swapvec_lock which keeps it unique. Reported-by: Mike Galbraith <bitbucket@online.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* x86/mce: Defer mce wakeups to threads for PREEMPT_RTSteven Rostedt2014-02-261-17/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We had a customer report a lockup on a 3.0-rt kernel that had the following backtrace: [ffff88107fca3e80] rt_spin_lock_slowlock at ffffffff81499113 [ffff88107fca3f40] rt_spin_lock at ffffffff81499a56 [ffff88107fca3f50] __wake_up at ffffffff81043379 [ffff88107fca3f80] mce_notify_irq at ffffffff81017328 [ffff88107fca3f90] intel_threshold_interrupt at ffffffff81019508 [ffff88107fca3fa0] smp_threshold_interrupt at ffffffff81019fc1 [ffff88107fca3fb0] threshold_interrupt at ffffffff814a1853 It actually bugged because the lock was taken by the same owner that already had that lock. What happened was the thread that was setting itself on a wait queue had the lock when an MCE triggered. The MCE interrupt does a wake up on its wait list and grabs the same lock. NOTE: THIS IS NOT A BUG ON MAINLINE Sorry for yelling, but as I Cc'd mainline maintainers I want them to know that this is an PREEMPT_RT bug only. I only Cc'd them for advice. On PREEMPT_RT the wait queue locks are converted from normal "spin_locks" into an rt_mutex (see the rt_spin_lock_slowlock above). These are not to be taken by hard interrupt context. This usually isn't a problem as most all interrupts in PREEMPT_RT are converted into schedulable threads. Unfortunately that's not the case with the MCE irq. As wait queue locks are notorious for long hold times, we can not convert them to raw_spin_locks without causing issues with -rt. But Thomas has created a "simple-wait" structure that uses raw spin locks which may have been a good fit. Unfortunately, wait queues are not the only issue, as the mce_notify_irq also does a schedule_work(), which grabs the workqueue spin locks that have the exact same issue. Thus, this patch I'm proposing is to move the actual work of the MCE interrupt into a helper thread that gets woken up on the MCE interrupt and does the work in a schedulable context. NOTE: THIS PATCH ONLY CHANGES THE BEHAVIOR WHEN PREEMPT_RT IS SET Oops, sorry for yelling again, but I want to stress that I keep the same behavior of mainline when PREEMPT_RT is not set. Thus, this only changes the MCE behavior when PREEMPT_RT is configured. Cc: stable-rt@vger.kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org> [bigeasy@linutronix: make mce_notify_work() a proper prototype, use kthread_run()] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
* rcutiny: Use simple waitqueueThomas Gleixner2014-02-261-8/+9
| | | | | | | Simple waitqueues can be handled from interrupt disabled contexts. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* wait-simple: Simple waitqueue implementationThomas Gleixner2014-02-263-1/+292
| | | | | | | | | | | | | | | wait_queue is a swiss army knife and in most of the cases the complexity is not needed. For RT waitqueues are a constant source of trouble as we can't convert the head lock to a raw spinlock due to fancy and long lasting callbacks. Provide a slim version, which allows RT to replace wait queues. This should go mainline as well, as it lowers memory consumption and runtime overhead. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> [backport by: Tim Sander <tim.sander@hbm.com> ]
* serial: Imx: Fix recursive locking bugThomas Gleixner2014-02-261-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 9ec1882df2 (tty: serial: imx: console write routing is unsafe on SMP) introduced a recursive locking bug in imx_console_write(). The callchain is: imx_rxint() spin_lock_irqsave(&sport->port.lock,flags); ... uart_handle_sysrq_char(); sysrq_function(); printk(); imx_console_write(); spin_lock_irqsave(&sport->port.lock,flags); <--- DEAD The bad news is that the kernel debugging facilities can dectect the problem, but the printks never surface on the serial console for obvious reasons. There is a similar issue with oops_in_progress. If the kernel crashes we really don't want to be stuck on the lock and unable to tell what happened. In general most UP originated drivers miss these checks and nobody ever notices because CONFIG_PROVE_LOCKING seems to be still ignored by a large number of developers. The solution is to avoid locking in the sysrq case and trylock in the oops_in_progress case. This scheme is used in other drivers as well and it would be nice if we could move this to a common place, so the usual copy/paste/modify bugs can be avoided. Now there is another issue with this scheme: CPU0 CPU1 printk() rxint() sysrq_detection() -> sets port->sysrq return from interrupt console_write() if (port->sysrq) avoid locking port->sysrq is reset with the next receive character. So as long as the port->sysrq is not reset and this can take an endless amount of time if after the break no futher receive character follows, all console writes happen unlocked. While the current writer is protected against other console writers by the console sem, it's unprotected against open/close or other operations which fiddle with the port. That's what the above mentioned commit tried to solve. That's an issue in all drivers which use that scheme and unfortunately there is no easy workaround. The only solution is to have a separate indicator port->sysrq_cpu. uart_handle_sysrq_char() then sets it to smp_processor_id() before calling into handle_sysrq() and resets it to -1 after that. Then change the locking check to: if (port->sysrq_cpu == smp_processor_id()) locked = 0; else if (oops_in_progress) locked = spin_trylock_irqsave(port->lock, flags); else spin_lock_irqsave(port->lock, flags); That would force all other cpus into the spin_lock path. Problem solved, but that's way beyond the scope of this fix and really wants to be implemented in a common function which calls the uart specific write function to avoid another gazillion of hard to debug copy/paste/modify bugs. Reported-and-tested-by: Tim Sander <tim@krieglstein.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Xinyu Chen <xinyu.chen@freescale.com> Cc: Dirk Behme <dirk.behme@de.bosch.com> Cc: Shawn Guo <shawn.guo@linaro.org> Cc: Tim Sander <tim@krieglstein.org> Cc: Sascha Hauer <s.hauer@pengutronix.de> Cc: stable-rt@vger.kernel.org Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1302142006050.22263@ionos Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* printk: Fix rq->lock vs logbuf_lock unlock lock inversionBu, Yitian2014-02-261-1/+1
| | | | | | | | | | | | | | | | | | commit 07354eb1a74d1 ("locking printk: Annotate logbuf_lock as raw") reintroduced a lock inversion problem which was fixed in commit 0b5e1c5255 ("printk: Release console_sem after logbuf_lock"). This happened probably when fixing up patch rejects. Restore the ordering and unlock logbuf_lock before releasing console_sem. Signed-off-by: ybu <ybu@qti.qualcomm.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stable@vger.kernel.org Cc: stable-rt@vger.kernel.org Link: http://lkml.kernel.org/r/E807E903FE6CBE4D95E420FBFCC273B827413C@nasanexd01h.na.qualcomm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* acpi/rt: Convert acpi_gbl_hardware lock back to a raw_spinlock_tSteven Rostedt2014-02-265-7/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We hit the following bug with 3.6-rt: [ 5.898990] BUG: scheduling while atomic: swapper/3/0/0x00000002 [ 5.898991] no locks held by swapper/3/0. [ 5.898993] Modules linked in: [ 5.898996] Pid: 0, comm: swapper/3 Not tainted 3.6.11-rt28.19.el6rt.x86_64.debug #1 [ 5.898997] Call Trace: [ 5.899011] [<ffffffff810804e7>] __schedule_bug+0x67/0x90 [ 5.899028] [<ffffffff81577923>] __schedule+0x793/0x7a0 [ 5.899032] [<ffffffff810b4e40>] ? debug_rt_mutex_print_deadlock+0x50/0x200 [ 5.899034] [<ffffffff81577b89>] schedule+0x29/0x70 [ 5.899036] BUG: scheduling while atomic: swapper/7/0/0x00000002 [ 5.899037] no locks held by swapper/7/0. [ 5.899039] [<ffffffff81578525>] rt_spin_lock_slowlock+0xe5/0x2f0 [ 5.899040] Modules linked in: [ 5.899041] [ 5.899045] [<ffffffff81579a58>] ? _raw_spin_unlock_irqrestore+0x38/0x90 [ 5.899046] Pid: 0, comm: swapper/7 Not tainted 3.6.11-rt28.19.el6rt.x86_64.debug #1 [ 5.899047] Call Trace: [ 5.899049] [<ffffffff81578bc6>] rt_spin_lock+0x16/0x40 [ 5.899052] [<ffffffff810804e7>] __schedule_bug+0x67/0x90 [ 5.899054] [<ffffffff8157d3f0>] ? notifier_call_chain+0x80/0x80 [ 5.899056] [<ffffffff81577923>] __schedule+0x793/0x7a0 [ 5.899059] [<ffffffff812f2034>] acpi_os_acquire_lock+0x1f/0x23 [ 5.899062] [<ffffffff810b4e40>] ? debug_rt_mutex_print_deadlock+0x50/0x200 [ 5.899068] [<ffffffff8130be64>] acpi_write_bit_register+0x33/0xb0 [ 5.899071] [<ffffffff81577b89>] schedule+0x29/0x70 [ 5.899072] [<ffffffff8130be13>] ? acpi_read_bit_register+0x33/0x51 [ 5.899074] [<ffffffff81578525>] rt_spin_lock_slowlock+0xe5/0x2f0 [ 5.899077] [<ffffffff8131d1fc>] acpi_idle_enter_bm+0x8a/0x28e [ 5.899079] [<ffffffff81579a58>] ? _raw_spin_unlock_irqrestore+0x38/0x90 [ 5.899081] [<ffffffff8107e5da>] ? this_cpu_load+0x1a/0x30 [ 5.899083] [<ffffffff81578bc6>] rt_spin_lock+0x16/0x40 [ 5.899087] [<ffffffff8144c759>] cpuidle_enter+0x19/0x20 [ 5.899088] [<ffffffff8157d3f0>] ? notifier_call_chain+0x80/0x80 [ 5.899090] [<ffffffff8144c777>] cpuidle_enter_state+0x17/0x50 [ 5.899092] [<ffffffff812f2034>] acpi_os_acquire_lock+0x1f/0x23 [ 5.899094] [<ffffffff8144d1a1>] cpuidle899101] [<ffffffff8130be13>] ? As the acpi code disables interrupts in acpi_idle_enter_bm, and calls code that grabs the acpi lock, it causes issues as the lock is currently in RT a sleeping lock. The lock was converted from a raw to a sleeping lock due to some previous issues, and tests that showed it didn't seem to matter. Unfortunately, it did matter for one of our boxes. This patch converts the lock back to a raw lock. I've run this code on a few of my own machines, one being my laptop that uses the acpi quite extensively. I've been able to suspend and resume without issues. [ tglx: Made the change exclusive for acpi_gbl_hardware_lock ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: John Kacur <jkacur@gmail.com> Cc: Clark Williams <clark@redhat.com> Link: http://lkml.kernel.org/r/1360765565.23152.5.camel@gandalf.local.home Cc: stable-rt@vger.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* x86/32: Use kmap switch for non highmem as wellThomas Gleixner2014-02-262-2/+4
| | | | | | | | | Even with CONFIG_HIGHMEM=n we need to take care of the "atomic" mappings which are installed via iomap_atomic. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* mm: swap: Initialize local locks earlyThomas Gleixner2014-02-261-3/+9
| | | | | | Cc: stable-rt@vger.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* sched: Check for idle task in might_sleep()Thomas Gleixner2014-02-261-1/+2
| | | | | | | | Idle is not allowed to call sleeping functions ever! Cc: stable-rt@vger.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* sched: Init idle->on_rq in init_idle()Thomas Gleixner2014-02-261-0/+1
| | | | | | Cc: stable-rt@vger.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* mmci: Remove bogus local_irq_save()Thomas Gleixner2014-02-261-5/+0
| | | | | | | | | On !RT interrupt runs with interrupts disabled. On RT it's in a thread, so no need to disable interrupts at all. Cc: stable-rt@vger.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* drivers-tty-pl011-irq-disable-madness.patchThomas Gleixner2014-02-261-5/+10
| | | | | | Cc: stable-rt@vger.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* block: Use cpu_chill() for retry loopsThomas Gleixner2014-02-261-2/+3
| | | | | | | | | | | | Retry loops on RT might loop forever when the modifying side was preempted. Steven also observed a live lock when there was a concurrent priority boosting going on. Use cpu_chill() instead of cpu_relax() to let the system make progress. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org
* sched: Consider pi boosting in setschedulerThomas Gleixner2014-02-263-9/+48
| | | | | | | | | | | | | | | | | | If a PI boosted task policy/priority is modified by a setscheduler() call we unconditionally dequeue and requeue the task if it is on the runqueue even if the new priority is lower than the current effective boosted priority. This can result in undesired reordering of the priority bucket list. If the new priority is less or equal than the current effective we just store the new parameters in the task struct and leave the scheduler class and the runqueue untouched. This is handled when the task deboosts itself. Only if the new priority is higher than the effective boosted priority we apply the change immediately. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Cc: stable-rt@vger.kernel.org
* sched: Queue RT tasks to head when prio dropsThomas Gleixner2014-02-261-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following scenario does not work correctly: Runqueue of CPUx contains two runnable and pinned tasks: T1: SCHED_FIFO, prio 80 T2: SCHED_FIFO, prio 80 T1 is on the cpu and executes the following syscalls (classic priority ceiling scenario): sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 90); ... sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 80); ... Now T1 gets preempted by T3 (SCHED_FIFO, prio 95). After T3 goes back to sleep the scheduler picks T2. Surprise! The same happens w/o actual preemption when T1 is forced into the scheduler due to a sporadic NEED_RESCHED event. The scheduler invokes pick_next_task() which returns T2. So T1 gets preempted and scheduled out. This happens because sched_setscheduler() dequeues T1 from the prio 90 list and then enqueues it on the tail of the prio 80 list behind T2. This violates the POSIX spec and surprises user space which relies on the guarantee that SCHED_FIFO tasks are not scheduled out unless they give the CPU up voluntarily or are preempted by a higher priority task. In the latter case the preempted task must get back on the CPU after the preempting task schedules out again. We fixed a similar issue already in commit 60db48c (sched: Queue a deboosted task to the head of the RT prio queue). The same treatment is necessary for sched_setscheduler(). So enqueue to head of the prio bucket list if the priority of the task is lowered. It might be possible that existing user space relies on the current behaviour, but it can be considered highly unlikely due to the corner case nature of the application scenario. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Cc: stable-rt@vger.kernel.org
* sched: Adjust sched_reset_on_fork when nothing else changesThomas Gleixner2014-02-261-2/+4
| | | | | | | | | If the policy and priority remain unchanged a possible modification of sched_reset_on_fork gets lost in the early exit path. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Cc: stable-rt@vger.kernel.org
* net: netfilter: Serialize xt_write_recseq sections on RTThomas Gleixner2014-02-263-0/+17
| | | | | | | | | | The netfilter code relies only on the implicit semantics of local_bh_disable() for serializing wt_write_recseq sections. RT breaks that and needs explicit serialization here. Reported-by: Peter LaDow <petela@gocougs.wsu.edu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org
* rcu: Disable RCU_FAST_NO_HZ on RTThomas Gleixner2014-02-261-1/+1
| | | | | | | | This uses a timer_list timer from the irq disabled guts of the idle code. Disable it for now to prevent wreckage. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org
* hrtimer: Raise softirq if hrtimer irq stalledWatanabe2014-02-261-5/+4
| | | | | | | When the hrtimer stall detection hits the softirq is not raised. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org
* rcu: rcutiny: Prevent RCU stallThomas Gleixner2014-02-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | rcu_read_unlock_special() checks in_serving_softirq() and leaves early when true. On RT this is obviously wrong as softirq processing context can be preempted and therefor such a task can be on the gp_tasks list. Leaving early here will leave the task on the list and therefor block RCU processing forever. This cannot happen on mainline because softirq processing context cannot be preempted and therefor this can never happen at all. In fact this check looks quite questionable in general. Neither irq context nor softirq processing context in mainline can ever be preempted in mainline so the special unlock case should not ever be invoked in such context. Now the only explanation might be a rcu_read_unlock() being interrupted and therefor leave the rcu nest count at 0 before the special unlock bit has been cleared. That looks fragile. At least it's missing a big fat comment. Paul ???? See mainline commits: ec433f0c5 and 8762705a for further enlightment. Reported-by: Kristian Lehmann <krleit00@hs-esslingen.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* slab: Fix up stable merge of slab init_lock_keys()Steven Rostedt2014-02-261-4/+1
| | | | | | | | | There was a stable fix that moved the init_lock_keys() to after the enable_cpucache(). But -rt changed this function to init_cachep_lock_keys(). This moves the init afterwards to match the stable fix. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* stomp_machine: Use mutex_trylock when called from inactive cpuThomas Gleixner2014-02-261-4/+9
| | | | | | | | | | | If the stop machinery is called from inactive CPU we cannot use mutex_lock, because some other stomp machine invokation might be in progress and the mutex can be contended. We cannot schedule from this context, so trylock and loop. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* sched: Better debug output for might sleepThomas Gleixner2014-02-262-2/+25
| | | | | | | | | might sleep can tell us where interrupts have been disabled, but we have no idea what disabled preemption. Add some debug infrastructure. Cc: stable-rt@vger.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* rt: rwsem/rwlock: lockdep annotationsThomas Gleixner2014-02-261-21/+25
| | | | | | | | | rwlocks and rwsems on RT do not allow multiple readers. Annotate the lockdep acquire functions accordingly. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* mm: page_alloc: Use local_lock_on() instead of plain spinlockThomas Gleixner2014-02-262-2/+13
| | | | | | | | | The plain spinlock while sufficient does not update the local_lock internals. Use a proper local_lock function instead to ease debugging. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* mm: slab: Fix potential deadlockThomas Gleixner2014-02-262-8/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ============================================= [ INFO: possible recursive locking detected ] 3.6.0-rt1+ #49 Not tainted --------------------------------------------- swapper/0/1 is trying to acquire lock: lock_slab_on+0x72/0x77 but task is already holding lock: __local_lock_irq+0x24/0x77 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&per_cpu(slab_lock, __cpu).lock); lock(&per_cpu(slab_lock, __cpu).lock); *** DEADLOCK *** May be due to missing lock nesting notation 2 locks held by swapper/0/1: kmem_cache_create+0x33/0x89 __local_lock_irq+0x24/0x77 stack backtrace: Pid: 1, comm: swapper/0 Not tainted 3.6.0-rt1+ #49 Call Trace: __lock_acquire+0x9a4/0xdc4 ? __local_lock_irq+0x24/0x77 ? lock_slab_on+0x72/0x77 lock_acquire+0xc4/0x108 ? lock_slab_on+0x72/0x77 ? unlock_slab_on+0x5b/0x5b rt_spin_lock+0x36/0x3d ? lock_slab_on+0x72/0x77 ? migrate_disable+0x85/0x93 lock_slab_on+0x72/0x77 do_ccupdate_local+0x19/0x44 slab_on_each_cpu+0x36/0x5a do_tune_cpucache+0xc1/0x305 enable_cpucache+0x8c/0xb5 setup_cpu_cache+0x28/0x182 __kmem_cache_create+0x34b/0x380 ? shmem_mount+0x1a/0x1a kmem_cache_create+0x4a/0x89 ? shmem_mount+0x1a/0x1a shmem_init+0x3e/0xd4 kernel_init+0x11c/0x214 kernel_thread_helper+0x4/0x10 ? retint_restore_args+0x13/0x13 ? start_kernel+0x3bc/0x3bc ? gs_change+0x13/0x13 It's not a missing annotation. It's simply wrong code and needs to be fixed. Instead of nesting the local and the remote cpu lock simply acquire only the remote cpu lock, which is sufficient protection for this procedure. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* softirq: Init softirq local lock after per cpu section is set upSteven Rostedt2014-02-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I discovered this bug when booting 3.4-rt on my powerpc box. It crashed with the following report: ------------[ cut here ]------------ kernel BUG at /work/rt/stable-rt.git/kernel/rtmutex_common.h:75! Oops: Exception in kernel mode, sig: 5 [#1] PREEMPT SMP NR_CPUS=64 NUMA PA Semi PWRficient Modules linked in: NIP: c0000000004aa03c LR: c0000000004aa01c CTR: c00000000009b2ac REGS: c00000003e8d7950 TRAP: 0700 Not tainted (3.4.11-test-rt19) MSR: 9000000000029032 <SF,HV,EE,ME,IR,DR,RI> CR: 24000082 XER: 20000000 SOFTE: 0 TASK = c00000003e8fdcd0[11] 'ksoftirqd/1' THREAD: c00000003e8d4000 CPU: 1 GPR00: 0000000000000001 c00000003e8d7bd0 c000000000d6cbb0 0000000000000000 GPR04: c00000003e8fdcd0 0000000000000000 0000000024004082 c000000000011454 GPR08: 0000000000000000 0000000080000001 c00000003e8fdcd1 0000000000000000 GPR12: 0000000024000084 c00000000fff0280 ffffffffffffffff 000000003ffffad8 GPR16: ffffffffffffffff 000000000072c798 0000000000000060 0000000000000000 GPR20: 0000000000642741 000000000072c858 000000003ffffaf0 0000000000000417 GPR24: 000000000072dcd0 c00000003e7ff990 0000000000000000 0000000000000001 GPR28: 0000000000000000 c000000000792340 c000000000ccec78 c000000001182338 NIP [c0000000004aa03c] .wakeup_next_waiter+0x44/0xb8 LR [c0000000004aa01c] .wakeup_next_waiter+0x24/0xb8 Call Trace: [c00000003e8d7bd0] [c0000000004aa01c] .wakeup_next_waiter+0x24/0xb8 (unreliable) [c00000003e8d7c60] [c0000000004a0320] .rt_spin_lock_slowunlock+0x8c/0xe4 [c00000003e8d7ce0] [c0000000004a07cc] .rt_spin_unlock+0x54/0x64 [c00000003e8d7d60] [c0000000000636bc] .__thread_do_softirq+0x130/0x174 [c00000003e8d7df0] [c00000000006379c] .run_ksoftirqd+0x9c/0x1a4 [c00000003e8d7ea0] [c000000000080b68] .kthread+0xa8/0xb4 [c00000003e8d7f90] [c00000000001c2f8] .kernel_thread+0x54/0x70 Instruction dump: 60000000 e86d01c8 38630730 4bff7061 60000000 ebbf0008 7c7c1b78 e81d0040 7fe00278 7c000074 7800d182 68000001 <0b000000> e88d01c8 387d0010 38840738 The rtmutex_common.h:75 is: rt_mutex_top_waiter(struct rt_mutex *lock) { struct rt_mutex_waiter *w; w = plist_first_entry(&lock->wait_list, struct rt_mutex_waiter, list_entry); BUG_ON(w->lock != lock); return w; } Where the waiter->lock is corrupted. I saw various other random bugs that all had to with the softirq lock and plist. As plist needs to be initialized before it is used I investigated how this lock is initialized. It's initialized with: void __init softirq_early_init(void) { local_irq_lock_init(local_softirq_lock); } Where: do { \ int __cpu; \ for_each_possible_cpu(__cpu) \ spin_lock_init(&per_cpu(lvar, __cpu).lock); \ } while (0) As the softirq lock is a local_irq_lock, which is a per_cpu lock, the initialization is done to all per_cpu versions of the lock. But lets look at where the softirq_early_init() is called from. In init/main.c: start_kernel() /* * Interrupts are still disabled. Do necessary setups, then * enable them */ softirq_early_init(); tick_init(); boot_cpu_init(); page_address_init(); printk(KERN_NOTICE "%s", linux_banner); setup_arch(&command_line); mm_init_owner(&init_mm, &init_task); mm_init_cpumask(&init_mm); setup_command_line(command_line); setup_nr_cpu_ids(); setup_per_cpu_areas(); smp_prepare_boot_cpu(); /* arch-specific boot-cpu hooks */ One of the first things that is called is the initialization of the softirq lock. But if you look further down, we see the per_cpu areas have not been set up yet. Thus initializing a local_irq_lock() before the per_cpu section is set up, may not work as it is initializing the per cpu locks before the per cpu exists. By moving the softirq_early_init() right after setup_per_cpu_areas(), the kernel boots fine. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Clark Williams <clark@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Carsten Emde <cbe@osadl.org> Cc: vomlehn@texas.net Cc: stable-rt@vger.kernel.org Link: http://lkml.kernel.org/r/1349362924.6755.18.camel@gandalf.local.home Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* random: Make it work on rtThomas Gleixner2014-02-265-7/+19
| | | | | | | | | | | Delegate the random insertion to the forced threaded interrupt handler. Store the return IP of the hard interrupt handler in the irq descriptor and feed it into the random generator as a source of entropy. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* time/rt: Fix up leap-second backport for RT changesSteven Rostedt2014-02-261-2/+2
| | | | | | | | | | | | | The leap-second backport broke RT, and a few changes had to be done. 1) The second_overflow now encompasses ntp_leap_second, and since second_overflow is called with the xtime_lock held, we can not take that lock either. (this update was done during the rebase). 2) Change ktime_get_update_offsets() to use read_seqcount_begin() instead of read_seq_begin() (and retry). Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* cpu/rt: Fix cpu_hotplug variable initializationSteven Rostedt2014-02-261-4/+0
| | | | | | | | | | The commit "cpu/rt: Rework cpu down for PREEMPT_RT" changed the double meaning of the cpu_hotplug.lock, where it was a spinlock for RT and a mutex for non-RT, to just a mutex for both. But the initialization of the variable was not updated to reflect this change. Cc: stable-rt@vger.kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org>