With i386, NMIs and breakpoints use the current stack and they
do not reset the stack pointer to a fix point that might corrupt
a previous NMI or breakpoint (as it does in x86_64). But NMIs are
still not made to be re-entrant, and need to prevent the case that
an NMI hitting a breakpoint (which does an iret), doesn't allow
another NMI to run.
The fix is to let the NMI be in 3 different states:
1) not running
2) executing
3) latched
When no NMI is executing on a given CPU, the state is "not running".
When the first NMI comes in, the state is switched to "executing".
On exit of that NMI, a cmpxchg is performed to switch the state
back to "not running" and if that fails, the NMI is restarted.
If a breakpoint is hit and does an iret, which re-enables NMIs,
and another NMI comes in before the first NMI finished, it will
detect that the state is not in the "not running" state and the
current NMI is nested. In this case, the state is switched to "latched"
to let the interrupted NMI know to restart the NMI handler, and
the nested NMI exits without doing anything.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul Turner <pjt@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
We want to allow NMI handlers to have breakpoints to be able to
remove stop_machine from ftrace, kprobes and jump_labels. But if
an NMI interrupts a current breakpoint, and then it triggers a
breakpoint itself, it will switch to the breakpoint stack and
corrupt the data on it for the breakpoint processing that it
interrupted.
Instead, have the NMI check if it interrupted breakpoint processing
by checking if the stack that is currently used is a breakpoint
stack. If it is, then load a special IDT that changes the IST
for the debug exception to keep the same stack in kernel context.
When the NMI is done, it puts it back.
This way, if the NMI does trigger a breakpoint, it will keep
using the same stack and not stomp on the breakpoint data for
the breakpoint it interrupted.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In x86, when an NMI goes off, the CPU goes into an NMI context that
prevents other NMIs to trigger on that CPU. If an NMI is suppose to
trigger, it has to wait till the previous NMI leaves NMI context.
At that time, the next NMI can trigger (note, only one more NMI will
trigger, as only one can be latched at a time).
The way x86 gets out of NMI context is by calling iret. The problem
with this is that this causes problems if the NMI handle either
triggers an exception, or a breakpoint. Both the exception and the
breakpoint handlers will finish with an iret. If this happens while
in NMI context, the CPU will leave NMI context and a new NMI may come
in. As NMI handlers are not made to be re-entrant, this can cause
havoc with the system, not to mention, the nested NMI will write
all over the previous NMI's stack.
Linus Torvalds proposed the following workaround to this problem:
https://lkml.org/lkml/2010/7/14/264
"In fact, I wonder if we couldn't just do a software NMI disable
instead? Hav ea per-cpu variable (in the _core_ percpu areas that get
allocated statically) that points to the NMI stack frame, and just
make the NMI code itself do something like
NMI entry:
- load percpu NMI stack frame pointer
- if non-zero we know we're nested, and should ignore this NMI:
- we're returning to kernel mode, so return immediately by using
"popf/ret", which also keeps NMI's disabled in the hardware until the
"real" NMI iret happens.
- before the popf/iret, use the NMI stack pointer to make the NMI
return stack be invalid and cause a fault
- set the NMI stack pointer to the current stack pointer
NMI exit (not the above "immediate exit because we nested"):
clear the percpu NMI stack pointer
Just do the iret.
Now, the thing is, now the "iret" is atomic. If we had a nested NMI,
we'll take a fault, and that re-does our "delayed" NMI - and NMI's
will stay masked.
And if we didn't have a nested NMI, that iret will now unmask NMI's,
and everything is happy."
I first tried to follow this advice but as I started implementing this
code, a few gotchas showed up.
One, is accessing per-cpu variables in the NMI handler.
The problem is that per-cpu variables use the %gs register to get the
variable for the given CPU. But as the NMI may happen in userspace,
we must first perform a SWAPGS to get to it. The NMI handler already
does this later in the code, but its too late as we have saved off
all the registers and we don't want to do that for a disabled NMI.
Peter Zijlstra suggested to keep all variables on the stack. This
simplifies things greatly and it has the added benefit of cache locality.
Two, faulting on the iret.
I really wanted to make this work, but it was becoming very hacky, and
I never got it to be stable. The iret already had a fault handler for
userspace faulting with bad segment registers, and getting NMI to trigger
a fault and detect it was very tricky. But for strange reasons, the system
would usually take a double fault and crash. I never figured out why
and decided to go with a simple "jmp" approach. The new approach I took
also simplified things.
Finally, the last problem with Linus's approach was to have the nested
NMI handler do a ret instead of an iret to give the first NMI NMI-context
again.
The problem is that ret is much more limited than an iret. I couldn't figure
out how to get the stack back where it belonged. I could have copied the
current stack, pushed the return onto it, but my fear here is that there
may be some place that writes data below the stack pointer. I know that
is not something code should depend on, but I don't want to chance it.
I may add this feature later, but for now, an NMI handler that loses NMI
context will not get it back.
Here's what is done:
When an NMI comes in, the HW pushes the interrupt stack frame onto the
per cpu NMI stack that is selected by the IST.
A special location on the NMI stack holds a variable that is set when
the first NMI handler runs. If this variable is set then we know that
this is a nested NMI and we process the nested NMI code.
There is still a race when this variable is cleared and an NMI comes
in just before the first NMI does the return. For this case, if the
variable is cleared, we also check if the interrupted stack is the
NMI stack. If it is, then we process the nested NMI code.
Why the two tests and not just test the interrupted stack?
If the first NMI hits a breakpoint and loses NMI context, and then it
hits another breakpoint and while processing that breakpoint we get a
nested NMI. When processing a breakpoint, the stack changes to the
breakpoint stack. If another NMI comes in here we can't rely on the
interrupted stack to be the NMI stack.
If the variable is not set and the interrupted task's stack is not the
NMI stack, then we know this is the first NMI and we can process things
normally. But in order to do so, we need to do a few things first.
1) Set the stack variable that tells us that we are in an NMI handler
2) Make two copies of the interrupt stack frame.
One copy is used to return on iret
The other is used to restore the first one if we have a nested NMI.
This is what the stack will look like:
+-------------------------+
| original SS |
| original Return RSP |
| original RFLAGS |
| original CS |
| original RIP |
+-------------------------+
| temp storage for rdx |
+-------------------------+
| NMI executing variable |
+-------------------------+
| Saved SS |
| Saved Return RSP |
| Saved RFLAGS |
| Saved CS |
| Saved RIP |
+-------------------------+
| copied SS |
| copied Return RSP |
| copied RFLAGS |
| copied CS |
| copied RIP |
+-------------------------+
| pt_regs |
+-------------------------+
The original stack frame contains what the HW put in when we entered
the NMI.
We store %rdx as a temp variable to use. Both the original HW stack
frame and this %rdx storage will be clobbered by nested NMIs so we
can not rely on them later in the first NMI handler.
The next item is the special stack variable that is set when we execute
the rest of the NMI handler.
Then we have two copies of the interrupt stack. The second copy is
modified by any nested NMIs to let the first NMI know that we triggered
a second NMI (latched) and that we should repeat the NMI handler.
If the first NMI hits an exception or breakpoint that takes it out of
NMI context, if a second NMI comes in before the first one finishes,
it will update the copied interrupt stack to point to a fix up location
to trigger another NMI.
When the first NMI calls iret, it will instead jump to the fix up
location. This fix up location will copy the saved interrupt stack back
to the copy and execute the nmi handler again.
Note, the nested NMI knows enough to check if it preempted a previous
NMI handler while it is in the fixup location. If it has, it will not
modify the copied interrupt stack and will just leave as if nothing
happened. As the NMI handle is about to execute again, there's no reason
to latch now.
To test all this, I forced the NMI handler to call iret and take itself
out of NMI context. I also added assemble code to write to the serial to
make sure that it hits the nested path as well as the fix up path.
Everything seems to be working fine.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul Turner <pjt@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Linus cleaned up the NMI handler but it still needs some comments to
explain why it uses save_paranoid but not paranoid_exit. Just to keep
others from adding that in the future, document why it's not used.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The NMI handler uses the paranoid_exit routine that checks the
NEED_RESCHED flag, and if it is set and the return is for userspace,
then interrupts are enabled, the stack is swapped to the thread's stack,
and schedule is called. The problem with this is that we are still in an
NMI context until an iret is executed. This means that any new NMIs are
now starved until an interrupt or exception occurs and does the iret.
As NMIs can not be masked and can interrupt any location, they are
treated as a special case. NEED_RESCHED should not be set in an NMI
handler. The interruption by the NMI should not disturb the work flow
for scheduling. Any IPI sent to a processor after sending the
NEED_RESCHED would have to wait for the NMI anyway, and after the IPI
finishes the schedule would be called as required.
There is no reason to do anything special leaving an NMI. Remove the
call to paranoid_exit and do a simple return. This not only fixes the
bug of starved NMIs, but it also cleans up the code.
Link: http://lkml.kernel.org/r/CA+55aFzgM55hXTs4griX5e9=v_O+=ue+7Rj0PTD=M7hFYpyULQ@mail.gmail.com
Acked-by: Andi Kleen <ak@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul Turner <pjt@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, efi: Calling __pa() with an ioremap()ed address is invalid
x86, hpet: Immediately disable HPET timer 1 if rtc irq is masked
x86/intel_mid: Kconfig select fix
x86/intel_mid: Fix the Kconfig for MID selection
* git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
arch/tile: use new generic {enable,disable}_percpu_irq() routines
drivers/net/ethernet/tile: use skb_frag_page() API
asm-generic/unistd.h: support new process_vm_{readv,write} syscalls
arch/tile: fix double-free bug in homecache_free_pages()
arch/tile: add a few #includes and an EXPORT to catch up with kernel changes.
With the 3.2-rc kernel, IOMMU 2M pages in KVM works. But when I tried
to use IOMMU 1GB pages in KVM, I encountered an oops and the 1GB page
failed to be used.
The root cause is that 1GB page allocation calls gup_huge_pud() while 2M
page calls gup_huge_pmd. If compound pages are used and the page is a
tail page, gup_huge_pmd() increases _mapcount to record tail page are
mapped while gup_huge_pud does not do that.
So when the mapped page is relesed, it will result in kernel oops
because the page is not marked mapped.
This patch add tail process for compound page in 1GB huge page which
keeps the same process as 2M page.
Reproduce like:
1. Add grub boot option: hugepagesz=1G hugepages=8
2. mount -t hugetlbfs -o pagesize=1G hugetlbfs /dev/hugepages
3. qemu-kvm -m 2048 -hda os-kvm.img -cpu kvm64 -smp 4 -mem-path /dev/hugepages
-net none -device pci-assign,host=07:00.1
kernel BUG at mm/swap.c:114!
invalid opcode: 0000 [#1] SMP
Call Trace:
put_page+0x15/0x37
kvm_release_pfn_clean+0x31/0x36
kvm_iommu_put_pages+0x94/0xb1
kvm_iommu_unmap_memslots+0x80/0xb6
kvm_assign_device+0xba/0x117
kvm_vm_ioctl_assigned_device+0x301/0xa47
kvm_vm_ioctl+0x36c/0x3a2
do_vfs_ioctl+0x49e/0x4e4
sys_ioctl+0x5a/0x7c
system_call_fastpath+0x16/0x1b
RIP put_compound_page+0xd4/0x168
Signed-off-by: Youquan Song <youquan.song@intel.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If we encounter an efi_memory_desc_t without EFI_MEMORY_WB set
in ->attribute we currently call set_memory_uc(), which in turn
calls __pa() on a potentially ioremap'd address.
On CONFIG_X86_32 this is invalid, resulting in the following
oops on some machines:
BUG: unable to handle kernel paging request at f7f22280
IP: [<c10257b9>] reserve_ram_pages_type+0x89/0x210
[...]
Call Trace:
[<c104f8ca>] ? page_is_ram+0x1a/0x40
[<c1025aff>] reserve_memtype+0xdf/0x2f0
[<c1024dc9>] set_memory_uc+0x49/0xa0
[<c19334d0>] efi_enter_virtual_mode+0x1c2/0x3aa
[<c19216d4>] start_kernel+0x291/0x2f2
[<c19211c7>] ? loglevel+0x1b/0x1b
[<c19210bf>] i386_start_kernel+0xbf/0xc8
A better approach to this problem is to map the memory region
with the correct attributes from the start, instead of modifying
it after the fact. The uncached case can be handled by
ioremap_nocache() and the cached by ioremap_cache().
Despite first impressions, it's not possible to use
ioremap_cache() to map all cached memory regions on
CONFIG_X86_64 because EFI_RUNTIME_SERVICES_DATA regions really
don't like being mapped into the vmalloc space, as detailed in
the following bug report,
https://bugzilla.redhat.com/show_bug.cgi?id=748516
Therefore, we need to ensure that any EFI_RUNTIME_SERVICES_DATA
regions are covered by the direct kernel mapping table on
CONFIG_X86_64. To accomplish this we now map E820_RESERVED_EFI
regions via the direct kernel mapping with the initial call to
init_memory_mapping() in setup_arch(), whereas previously these
regions wouldn't be mapped if they were after the last E820_RAM
region until efi_ioremap() was called. Doing it this way allows
us to delete efi_ioremap() completely.
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Huang Ying <huang.ying.caritas@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1321621751-3650-1-git-send-email-matt@console-pimps.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When HPET is operating in RTC mode, the TN_ENABLE bit on timer1
controls whether the HPET or the RTC delivers interrupts to irq8. When
the system goes into suspend, the RTC driver sends a signal to the
HPET driver so that the HPET releases control of irq8, allowing the
RTC to wake the system from suspend. The switchover is accomplished by
a write to the HPET configuration registers which currently only
occurs while servicing the HPET interrupt.
On some systems, I have seen the system suspend before an HPET
interrupt occurs, preventing the write to the HPET configuration
register and leaving the HPET in control of the irq8. As the HPET is
not active during suspend, it does not generate a wake signal and RTC
alarms do not work.
This patch forces the HPET driver to immediately transfer control of
the irq8 channel to the RTC instead of waiting until the next
interrupt event.
Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com>
Link: http://lkml.kernel.org/r/20111118153306.GB16319@alberich.amd.com
Tested-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
If we select a symbol it should have a type declared first
otherwise in some situations the config tools get upset. They
are currently perhaps a bit too resilient which is why this
wasn't noticed initially.
Signed-off-by: Alan Cox <alan@linux.intel.com>
Link: http://lkml.kernel.org/r/20111206132811.4041.32549.stgit@bob.linux.org.uk
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We currently fail to build on CONFIG_X86_INTEL_MID=y and
CONFIG_X86_MRST unset.
We could build all the bits to make generic MID work if you
picked MID platform alone but that's really silly. Instead use
select and two variables.
This looks a bit daft right now but once we add a Medfield
selection it'll start to look a good deal more sensible.
Reported-by: Ingo Molnar <mingo@elte.hu>
Reported-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Link: http://lkml.kernel.org/r/20111205231433.28811.51297.stgit@bob.linux.org.uk
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
intr_remapping: Fix section mismatch in ir_dev_scope_init()
intel-iommu: Fix section mismatch in dmar_parse_rmrr_atsr_dev()
x86, amd: Fix up numa_node information for AMD CPU family 15h model 0-0fh northbridge functions
x86, AMD: Correct align_va_addr documentation
x86/rtc, mrst: Don't register a platform RTC device for for Intel MID platforms
x86/mrst: Battery fixes
x86/paravirt: PTE updates in k(un)map_atomic need to be synchronous, regardless of lazy_mmu mode
x86: Fix "Acer Aspire 1" reboot hang
x86/mtrr: Resolve inconsistency with Intel processor manual
x86: Document rdmsr_safe restrictions
x86, microcode: Fix the failure path of microcode update driver init code
Add TAINT_FIRMWARE_WORKAROUND on MTRR fixup
x86/mpparse: Account for bus types other than ISA and PCI
x86, mrst: Change the pmic_gpio device type to IPC
mrst: Added some platform data for the SFI translations
x86,mrst: Power control commands update
x86/reboot: Blacklist Dell OptiPlex 990 known to require PCI reboot
x86, UV: Fix UV2 hub part number
x86: Add user_mode_vm check in stack_overflow_check
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix loss of notification with multi-event
perf, x86: Force IBS LVT offset assignment for family 10h
perf, x86: Disable PEBS on SandyBridge chips
trace_events_filter: Use rcu_assign_pointer() when setting ftrace_event_call->filter
perf session: Fix crash with invalid CPU list
perf python: Fix undefined symbol problem
perf/x86: Enable raw event access to Intel offcore events
perf: Don't use -ENOSPC for out of PMU resources
perf: Do not set task_ctx pointer in cpuctx if there are no events in the context
perf/x86: Fix PEBS instruction unwind
oprofile, x86: Fix crash when unloading module (nmi timer mode)
oprofile: Fix crash when unloading module (hr timer mode)
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched, x86: Avoid unnecessary overflow in sched_clock
sched: Fix buglet in return_cfs_rq_runtime()
sched: Avoid SMT siblings in select_idle_sibling() if possible
sched: Set the command name of the idle tasks in SMP kernels
sched, rt: Provide means of disabling cross-cpu bandwidth sharing
sched: Document wait_for_completion_*() return values
sched_fair: Fix a typo in the comment describing update_sd_lb_stats
sched: Add a comment to effective_load() since it's a pain
* 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6:
[S390] ap: Setup timer for sending messages after reset.
[S390] cio: fix chsc_chp_vary
[S390] cio: provide fake irb for transport mode IO
[S390] cio: disallow driver io for known to be broken paths
[S390] hibernate: directly trigger subchannel evaluation
[S390] remove reset of system call restart on psw changes
[S390] add missing .set function for NT_S390_LAST_BREAK regset
[S390] fix page change underindication in pgste_update_all
[S390] ptrace inferior call interactions with TIF_SYSCALL
[S390] kdump: Replace is_kdump_kernel() with OLDMEM_BASE check
Otherwise timing is inaccurate, resulting in devices which depend on it,
like omap-keypad, broken.
Tested on Amstrad Delta.
Signed-off-by: Janusz Krzysztofik <jkrzyszt@tis.icnet.pl>
[tony@atomide.com: removed comment referencing a development branch]
Signed-off-by: Tony Lindgren <tony@atomide.com>
I've received complaints that the numa_node attribute for family
15h model 00-0fh (e.g. Interlagos) northbridge functions shows
-1 instead of the proper node ID.
Correct this with attached quirks (similar to quirks for other
AMD CPU families used in multi-socket systems).
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Frank Arnold <frank.arnold@amd.com>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/20111202072143.GA31916@alberich.amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Intel MID x86 platforms have a memory mapped virtual RTC
instead. No MID platform have the default ports (and
accessing them may do weird stuff).
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Cc: feng.tang@intel.com
Cc: Feng Tang <feng.tang@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix an outstanding issue that has been reported since 2.6.37.
Under a heavy loaded machine processing "fork()" calls could
crash with:
BUG: unable to handle kernel paging request at f573fc8c
IP: [<c01abc54>] swap_count_continued+0x104/0x180
*pdpt = 000000002a3b9027 *pde = 0000000001bed067 *pte = 0000000000000000 Oops: 0000 [#1] SMP
Modules linked in:
Pid: 1638, comm: apache2 Not tainted 3.0.4-linode37 #1
EIP: 0061:[<c01abc54>] EFLAGS: 00210246 CPU: 3
EIP is at swap_count_continued+0x104/0x180
.. snip..
Call Trace:
[<c01ac222>] ? __swap_duplicate+0xc2/0x160
[<c01040f7>] ? pte_mfn_to_pfn+0x87/0xe0
[<c01ac2e4>] ? swap_duplicate+0x14/0x40
[<c01a0a6b>] ? copy_pte_range+0x45b/0x500
[<c01a0ca5>] ? copy_page_range+0x195/0x200
[<c01328c6>] ? dup_mmap+0x1c6/0x2c0
[<c0132cf8>] ? dup_mm+0xa8/0x130
[<c013376a>] ? copy_process+0x98a/0xb30
[<c013395f>] ? do_fork+0x4f/0x280
[<c01573b3>] ? getnstimeofday+0x43/0x100
[<c010f770>] ? sys_clone+0x30/0x40
[<c06c048d>] ? ptregs_clone+0x15/0x48
[<c06bfb71>] ? syscall_call+0x7/0xb
The problem is that in copy_page_range() we turn lazy mode on,
and then in swap_entry_free() we call swap_count_continued()
which ends up in:
map = kmap_atomic(page, KM_USER0) + offset;
and then later we touch *map.
Since we are running in batched mode (lazy) we don't actually
set up the PTE mappings and the kmap_atomic is not done
synchronously and ends up trying to dereference a page that has
not been set.
Looking at kmap_atomic_prot_pfn(), it uses
'arch_flush_lazy_mmu_mode' and doing the same in
kmap_atomic_prot() and __kunmap_atomic() makes the problem go
away.
Interestingly, commit b8bcfe997e ("x86/paravirt: remove lazy
mode in interrupts") removed part of this to fix an interrupt
issue - but it went to far and did not consider this scenario.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Looks like on some Acer Aspire 1s with older bioses, reboot via bios
fails. It works on my machine, (with BIOS version 0.3310) but
not on some others (BIOS version 0.3309).
There's a log of problems at:
https://bbs.archlinux.org/viewtopic.php?id=124136
This patch adds a different callback to the reboot quirk table,
to allow rebooting via keybaord controller.
Reported-by: Uroš Vampl <mobile.leecher@gmail.com>
Tested-by: Vasily Khoruzhick <anarsoul@gmail.com>
Signed-off-by: Peter Chubb <peter.chubb@nicta.com.au>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stable@kernel.org
Link: http://lkml.kernel.org/r/1323093233-9481-1-git-send-email-anarsoul@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Following is from Notes of section 11.5.3 of Intel processor
manual available at:
http://www.intel.com/Assets/PDF/manual/325384.pdf
For the Pentium 4 and Intel Xeon processors, after the sequence of
steps given above has been executed, the cache lines containing the
code between the end of the WBINVD instruction and before the
MTRRS have actually been disabled may be retained in the cache
hierarchy. Here, to remove code from the cache completely, a
second WBINVD instruction must be executed after the MTRRs have
been disabled.
This patch provides resolution for that.
Ideally, I will like to make changes only for Pentium 4 and Xeon
processors. But, I am not finding easier way to do it.
And, extra wbinvd() instruction does not hurt much for other
processors.
Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Link: http://lkml.kernel.org/r/4EBD1CC5.3030008@oracle.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Recently, I got bitten by using rdmsr_safe too early in the boot
process. Document its shortcomings for future reference.
Link: http://lkml.kernel.org/r/4ED5B70F.606@lwfinger.net
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
In commit f8924e770e ("x86: unify mp_bus_info"), the 32-bit
and 64-bit versions of MP_bus_info were rearranged to match each
other better. Unfortunately it introduced a regression: prior
to that change we used to always set the mp_bus_not_pci bit,
then clear it if we found a PCI bus. After it, we set
mp_bus_not_pci for ISA buses, clear it for PCI buses, and leave
it alone otherwise.
In the cases of ISA and PCI, there's not much difference. But
ISA is not the only non-PCI bus, so it's better to always set
mp_bus_not_pci and clear it only for PCI.
Without this change, Dan's Dell PowerEdge 4200 panics on boot
with a log indicating interrupt routing trouble unless the
"noapic" option is supplied. With this change, the machine
boots reliably without "noapic".
Fixes http://bugs.debian.org/586494
Reported-bisected-and-tested-by: Dan McGrath <troubledaemon@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: stable@vger.kernel.org # 2.6.26+
Cc: Dan McGrath <troubledaemon@gmail.com>
Cc: Alexey Starikovskiy <aystarik@gmail.com>
[jrnieder@gmail.com: clarified commit message]
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Link: http://lkml.kernel.org/r/20111122215000.GA9151@elie.hsd1.il.comcast.net
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In latest firmware's SFI tables, pmic_gpio has been set to
IPC type of device, so we need handle it too.
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add SFI glue for the following devices:
tca6416: a gpio expander compatible with max7315
mpu3050: gyro sensor
Both of these actual drivers are already upstream
Signed-off-by: Jekyll Lai <jekyll_lai@wistron.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On the Intel MID devices SCU commands are issued to manage power
off and the like. We need to issue different ones for
non-Lincroft based devices.
Signed-off-by: Alek Du <alek.du@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Dell OptiPlex 990 is known to require PCI reboot, so add it to
the reboot blacklist in pci_reboot_dmi_table[].
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Link: http://lkml.kernel.org/r/201111160019.51303.rjw@sisk.pl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Current partition information maintained in kernel does not match with
u-boot, this leads to corruption of u-boot env when we update uImage
from kernel. Patch fixes it to match with u-boot partition information.
Signed-off-by: Shankarmurthy,Akshay <akshay.s@ti.com>
Signed-off-by: Sekhar Nori <nsekhar@ti.com>
On OMAP-L138 platform, EDMA event queue 0 should be used for audio
transfers so that they are not starved by video data moving on event queue 1.
Commit 48519f0ae0 (ASoC: davinci: let platform
data define edma queue numbers) had a side-effect of changing this behavior
by making the driver actually honor the platform data passed.
Fix this now by passing event queue 0 as the queue to be used for audio
transfers.
Signed-off-by: Manjunathappa, Prakash <prakash.pm@ti.com>
Signed-off-by: Sekhar Nori <nsekhar@ti.com>
Cc: <stable@vger.kernel.org> # v2.6.36.x and above
The function setup_vpif_input_channel_mode() used the VSCLKDIS register
instead of VIDCLKCTL. This meant that when in HD mode videoport channel 0
used a different clock from channel 1.
Clearly a copy-and-paste error.
Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com>
Acked-by: Manjunath Hadli <manjunath.hadli@ti.com>
Signed-off-by: Sekhar Nori <nsekhar@ti.com>
Cc: stable@kernel.org
Fix the incorrect classification of DSP clock into a
seperate DSP domain on DM646x.
Per the reference guide (http://www.ti.com/lit/ug/spruep9e/spruep9e.pdf)
there is only one "AlwaysON" power domain on DM6467.
Signed-off-by: Sekhar Nori <nsekhar@ti.com>
Seperate PDSTAT and PDCTL registers are defined for
domain 0 and domain 1 where as the code always reads
the domain 0 PDSTAT register and domain 1 PDCTL register.
Fix this issue. While at it, introduce usage of macros
for register masks to improve readability.
Reviewed-by: Sergei Shtylyov <sshtylyov@mvista.com>
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: Sekhar Nori <nsekhar@ti.com>