__assign_irq_vector() uses the vector_cpumask which is assigned by
apic->vector_allocation_domain() without doing basic sanity checks. That can
result in a situation where the final assignement of a newly found vector
fails in apic->cpu_mask_to_apicid_and(). So we have to do rollbacks for no
reason.
apic->cpu_mask_to_apicid_and() only fails if
vector_cpumask & requested_cpumask & cpu_online_mask
is empty.
Check for this condition right away and if the result is empty try immediately
the next possible cpu in the requested mask. So in case of a failure the old
setting is unchanged and we can remove the rollback code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Borislav Petkov <bp@alien8.de>
Tested-by: Joe Lawrence <joe.lawrence@stratus.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Jeremiah Mahler <jmmahler@gmail.com>
Cc: andy.shevchenko@gmail.com
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: stable@vger.kernel.org #4.3+
Link: http://lkml.kernel.org/r/20151231160106.561877324@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Split out the code which advances the target cpu for the search so we can
reuse it for the next patch which adds an early validation check for the
vectormask which we get from the apic.
Add comments while at it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Borislav Petkov <bp@alien8.de>
Tested-by: Joe Lawrence <joe.lawrence@stratus.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Jeremiah Mahler <jmmahler@gmail.com>
Cc: andy.shevchenko@gmail.com
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: stable@vger.kernel.org #4.3+
Link: http://lkml.kernel.org/r/20151231160106.484562040@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Use an explicit goto for the cases where we have success in the search/update
and return -ENOSPC if the search loop ends due to no space.
Preparatory patch for fixes. No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Borislav Petkov <bp@alien8.de>
Tested-by: Joe Lawrence <joe.lawrence@stratus.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Jeremiah Mahler <jmmahler@gmail.com>
Cc: andy.shevchenko@gmail.com
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: stable@vger.kernel.org #4.3+
Link: http://lkml.kernel.org/r/20151231160106.403491024@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Function __assign_irq_vector() makes use of apic_chip_data.old_domain as a
temporary buffer, which is in the way of using apic_chip_data.old_domain for
synchronizing the vector cleanup with the vector assignement code.
Use a proper temporary cpumask for this.
[ tglx: Renamed the mask to searched_cpumask for clarity ]
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Tested-by: Borislav Petkov <bp@alien8.de>
Tested-by: Joe Lawrence <joe.lawrence@stratus.com>
Cc: Jeremiah Mahler <jmmahler@gmail.com>
Cc: andy.shevchenko@gmail.com
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: stable@vger.kernel.org #4.3+
Link: http://lkml.kernel.org/r/1450880014-11741-1-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In fixup_irqs() we unconditionally dereference the irq chip of an irq
descriptor. The descriptor might still be valid, but already cleaned up,
i.e. the chip removed. Add a check for this condition.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Joe Lawrence <joe.lawrence@stratus.com>
Cc: Jeremiah Mahler <jmmahler@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: andy.shevchenko@gmail.com
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: stable@vger.kernel.org #4.3+
Link: http://lkml.kernel.org/r/20151231160106.236423282@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
There's a race condition between
x86_vector_free_irqs()
{
free_apic_chip_data(irq_data->chip_data);
xxxxx //irq_data->chip_data has been freed, but the pointer
//hasn't been reset yet
irq_domain_reset_irq_data(irq_data);
}
and
smp_irq_move_cleanup_interrupt()
{
raw_spin_lock(&vector_lock);
data = apic_chip_data(irq_desc_get_irq_data(desc));
access data->xxxx // may access freed memory
raw_spin_unlock(&desc->lock);
}
which may cause smp_irq_move_cleanup_interrupt() to access freed memory.
Call irq_domain_reset_irq_data(), which clears the pointer with vector lock
held.
[ tglx: Free memory outside of lock held region. ]
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Tested-by: Borislav Petkov <bp@alien8.de>
Tested-by: Joe Lawrence <joe.lawrence@stratus.com>
Cc: Jeremiah Mahler <jmmahler@gmail.com>
Cc: andy.shevchenko@gmail.com
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: stable@vger.kernel.org #4.3+
Link: http://lkml.kernel.org/r/1450880014-11741-3-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
setup_ioapic_dest() calls irqchip->irq_set_affinity() completely
unprotected. That's wrong in several aspects:
- it opens a race window where irq_set_affinity() can be interrupted and the
irq chip left in unconsistent state.
- it triggers a lockdep splat when we fix the vector race for 4.3+ because
vector lock is taken with interrupts enabled.
The proper calling convention is irq descriptor lock held and interrupts
disabled.
Reported-and-tested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Jeremiah Mahler <jmmahler@gmail.com>
Cc: andy.shevchenko@gmail.com
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Joe Lawrence <joe.lawrence@stratus.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1601140919420.3575@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Originally we calculated ht_nodeid as "ht_nodeid = apicid -
boot_cpu_id;" so presumably it could be negative.
But after commit:
01aaea1afb ('x86: introduce initial apicid')
we use c->initial_apicid which is an unsigned short and thus always >= 0.
It causes a static checker warning to test for impossible
conditions so let's remove it.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Hector Marco-Gisbert <hecmargi@upv.es>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Link: http://lkml.kernel.org/r/20160113123940.GE19993@mwanda
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If the clock becomes unstable while we're reading it, we need to
bail. We can do this by simply moving the check into the
seqcount loop.
Reported-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Alexander Graf <agraf@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/755dcedb17269e1d7ce12a9a713dea303835137e.1451949191.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
My previous comments were still a bit confusing and there was a
typo. Fix it up.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fixes: 71b3c126e6 ("x86/mm: Add barriers and document switch_mm()-vs-flush synchronization")
Link: http://lkml.kernel.org/r/0a0b43cdcdd241c5faaaecfbcc91a155ddedc9a1.1452631609.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The vdso-based sigreturn mechanism is fragile and isn't used by
modern glibc so, if we break it, we'll only notice when someone
tests an unusual libc.
Add an explicit selftest.
[ I wrote this while debugging a Bionic breakage -- my first guess
was that I had somehow messed up sigreturn. I've caused problems in
that code before, and it's really easy to fail to notice it because
there's nothing on a modern distro that needs vdso-based sigreturn. ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/32946d714156879cd8e5d8eab044cd07557ed558.1452628504.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Without the reboot=pci method, the iMac 10,1 simply
hangs after printing "Restarting system" at the point
when it should reboot. This fixes it.
Signed-off-by: Mario Kleiner <mario.kleiner.de@gmail.com>
Cc: <stable@vger.kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1450466646-26663-1-git-send-email-mario.kleiner.de@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pavel noted that lguest maps the switcher code executable and
read-write. This is a bad idea for any kernel text, but
particularly for text mapped at a fixed address.
Create two vmas, one for the text (PAGE_KERNEL_RX) and another
for the stacks (PAGE_KERNEL). Use VM_NO_GUARD to map them
adjacent (as expected by the rest of the code).
Reported-by: Pavel Machek <pavel@ucw.cz>
Tested-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When "eagerfpu=off" is given as a command-line input, the kernel
should disable AVX support.
The Task Switched bit used for lazy context switching does not
support AVX. If AVX is enabled without eagerfpu context
switching, one task's AVX state could become corrupted or leak
to other tasks. This is a bug and has bad security implications.
This only affects systems that have AVX/AVX2/AVX512 and this
issue will be found only when one actually uses AVX/AVX2/AVX512
_AND_ does eagerfpu=off.
Reference: Intel Software Developer's Manual Vol. 3A
Sec. 2.5 Control Registers:
TS Task Switched bit (bit 3 of CR0) -- Allows the saving of the
x87 FPU/ MMX/SSE/SSE2/SSE3/SSSE3/SSE4 context on a task switch
to be delayed until an x87 FPU/MMX/SSE/SSE2/SSE3/SSSE3/SSE4
instruction is actually executed by the new task.
Sec. 13.4.1 Using the TS Flag to Control the Saving of the X87
FPU and SSE State
When the TS flag is set, the processor monitors the instruction
stream for x87 FPU, MMX, SSE instructions. When the processor
detects one of these instructions, it raises a
device-not-available exeception (#NM) prior to executing the
instruction.
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Ravi V. Shankar <ravi.v.shankar@intel.com>
Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: yu-cheng yu <yu-cheng.yu@intel.com>
Link: http://lkml.kernel.org/r/1452119094-7252-5-git-send-email-yu-cheng.yu@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This issue is a fallout from the command-line parsing move.
When "eagerfpu=off" is given as a command-line input, the kernel
should disable MPX support. The decision for turning off MPX was
made in fpu__init_system_ctx_switch(), which is after the
selection of the XSAVE format. This patch fixes it by getting
that decision done earlier in fpu__init_system_xstate().
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Ravi V. Shankar <ravi.v.shankar@intel.com>
Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: yu-cheng yu <yu-cheng.yu@intel.com>
Link: http://lkml.kernel.org/r/1452119094-7252-4-git-send-email-yu-cheng.yu@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When "noxsave" is given as a command-line input, the kernel
should disable XGETBV1. This issue currently does not cause any
actual problems. XGETBV1 is only useful if we have something
using the 'init optimization' (i.e. xsaveopt, xsaves). We
already clear both of those in fpu__xstate_clear_all_cpu_caps().
But this is good for completeness.
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Ravi V. Shankar <ravi.v.shankar@intel.com>
Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: yu-cheng yu <yu-cheng.yu@intel.com>
Link: http://lkml.kernel.org/r/1452119094-7252-3-git-send-email-yu-cheng.yu@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The function fpu__init_system() is executed before
parse_early_param(). This causes wrong FPU configuration. This
patch fixes this issue by parsing boot_command_line in the
beginning of fpu__init_system().
With all four patches in this series, each parameter disables
features as the following:
eagerfpu=off: eagerfpu, avx, avx2, avx512, mpx
no387: fpu
nofxsr: fxsr, fxsropt, xmm
noxsave: xsave, xsaveopt, xsaves, xsavec, avx, avx2, avx512,
mpx, xgetbv1 noxsaveopt: xsaveopt
noxsaves: xsaves
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Ravi V. Shankar <ravi.v.shankar@intel.com>
Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: yu-cheng yu <yu-cheng.yu@intel.com>
Link: http://lkml.kernel.org/r/1452119094-7252-2-git-send-email-yu-cheng.yu@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
ldt_gdt.c relies on cross-cpu invalidation of SS to do one of
its tests. On 32-bit builds, this works fine, but on 64-bit
builds, it only works if the kernel has proper SS sigcontext
handling for 64-bit user programs.
Since the SS fixes are currently reverted, restrict the test
case to 32 bits for now.
In principle, I could change the test to use a different segment
register, but it would be messy: CS can't point to the LDT for
64-bit code, and the other registers don't result in immediate
faults because they aren't reloaded on kernel -> user
transitions.
When we fix sigcontext (in 4.6?), we can revert this.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/231591d9122d282402d8f53175134f8db5b3bc73.1452561752.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In CONFIG_PAGEALLOC_DEBUG=y builds, we disable 2M pages.
Unfortunatly when we split up mappings during boot,
split_page_count() doesn't take this into account, and
starts decrementing an empty direct_pages_count[] level.
This results in /proc/meminfo showing crazy things like:
DirectMap2M: 18446744073709543424 kB
Signed-off-by: Dave Jones <davej@codemonkey.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 platform updates from Ingo Molnar:
"Two changes:
- one to quirk-save/restore certain system MSRs across
suspend/resume, to make certain Intel systems work better
(Chen Yu)
- and also to constify a read only structure (Julia Lawall)"
* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/platform/calgary: Constify cal_chipset_ops structures
x86/pm: Introduce quirk framework to save/restore extra MSR registers around suspend/resume
Pull x86 mm updates from Ingo Molnar:
"The main changes in this cycle were:
- make the debugfs 'kernel_page_tables' file read-only, as it only
has read ops. (Borislav Petkov)
- micro-optimize clflush_cache_range() (Chris Wilson)
- swiotlb enhancements, which fixes certain KVM emulated devices
(Igor Mammedov)
- fix an LDT related debug message (Jan Beulich)
- modularize CONFIG_X86_PTDUMP (Kees Cook)
- tone down an overly alarming warning (Laura Abbott)
- Mark variable __initdata (Rasmus Villemoes)
- PAT additions (Toshi Kani)"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Micro-optimise clflush_cache_range()
x86/mm/pat: Change free_memtype() to support shrinking case
x86/mm/pat: Add untrack_pfn_moved for mremap
x86/mm: Drop WARN from multi-BAR check
x86/LDT: Print the real LDT base address
x86/mm/64: Enable SWIOTLB if system has SRAT memory regions above MAX_DMA32_PFN
x86/mm: Introduce max_possible_pfn
x86/mm/ptdump: Make (debugfs)/kernel_page_tables read-only
x86/mm/mtrr: Mark the 'range_new' static variable in mtrr_calc_range_state() as __initdata
x86/mm: Turn CONFIG_X86_PTDUMP into a module
Pull x86 fpu updates from Ingo Molnar:
"This cleans up the FPU fault handling methods to be more robust, and
moves eligible variables to .init.data"
* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu: Put a few variables in .init.data
x86/fpu: Get rid of xstate_fault()
x86/fpu: Add an XSTATE_OP() macro
Pull x86 cpu updates from Ingo Molnar:
"The main changes in this cycle were:
- Improved CPU ID handling code and related enhancements (Borislav
Petkov)
- RDRAND fix (Len Brown)"
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Replace RDRAND forced-reseed with simple sanity check
x86/MSR: Chop off lower 32-bit value
x86/cpu: Fix MSR value truncation issue
x86/cpu/amd, kvm: Satisfy guest kernel reads of IC_CFG MSR
kvm: Add accessors for guest CPU's family, model, stepping
x86/cpu: Unify CPU family, model, stepping calculation
Pull small x86 boot update from Ingo Molnar:
"A single update to the MAINTAINERS file"
* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tboot: Update maintainer list for Intel TXT
Pull x86 asm updates from Ingo Molnar:
"The main changes in this cycle were:
- vDSO and asm entry improvements (Andy Lutomirski)
- Xen paravirt entry enhancements (Boris Ostrovsky)
- asm entry labels enhancement (Borislav Petkov)
- and other misc changes (Thomas Gleixner, me)"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/vsdo: Fix build on PARAVIRT_CLOCK=y, KVM_GUEST=n
Revert "x86/kvm: On KVM re-enable (e.g. after suspend), update clocks"
x86/entry/64_compat: Make labels local
x86/platform/uv: Include clocksource.h for clocksource_touch_watchdog()
x86/vdso: Enable vdso pvclock access on all vdso variants
x86/vdso: Remove pvclock fixmap machinery
x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap
x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
x86/kvm: On KVM re-enable (e.g. after suspend), update clocks
x86/entry/64: Bypass enter_from_user_mode on non-context-tracking boots
x86/asm: Add asm macros for static keys/jump labels
x86/asm: Error out if asm/jump_label.h is included inappropriately
context_tracking: Switch to new static_branch API
x86/entry, x86/paravirt: Remove the unused usergs_sysret32 PV op
x86/paravirt: Remove the unused irq_enable_sysexit pv op
x86/xen: Avoid fast syscall path for Xen PV guests
Pull x86 apic updates from Ingo Molnar:
"The main changes in this cycle were:
- introduce optimized single IPI sending methods on modern APICs
(Linus Torvalds, Thomas Gleixner)
- kexec/crash APIC handling fixes and enhancements (Hidehiro Kawai)
- extend lapic vector saving/restoring to the CMCI (MCE) vector as
well (Juergen Gross)
- various fixes and enhancements (Jake Oshins, Len Brown)"
* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
x86/irq: Export functions to allow MSI domains in modules
Documentation: Document kernel.panic_on_io_nmi sysctl
x86/nmi: Save regs in crash dump on external NMI
x86/apic: Introduce apic_extnmi command line parameter
kexec: Fix race between panic() and crash_kexec()
panic, x86: Allow CPUs to save registers even if looping in NMI context
panic, x86: Fix re-entrance problem due to panic on NMI
x86/apic: Fix the saving and restoring of lapic vectors during suspend/resume
x86/smpboot: Re-enable init_udelay=0 by default on modern CPUs
x86/smp: Remove single IPI wrapper
x86/apic: Use default send single IPI wrapper
x86/apic: Provide default send single IPI wrapper
x86/apic: Implement single IPI for apic_noop
x86/apic: Wire up single IPI for apic_numachip
x86/apic: Wire up single IPI for x2apic_uv
x86/apic: Implement single IPI for x2apic_phys
x86/apic: Wire up single IPI for bigsmp_apic
x86/apic: Remove pointless indirections from bigsmp_apic
x86/apic: Wire up single IPI for apic_physflat
x86/apic: Remove pointless indirections from apic_physflat
...
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:
- tickless load average calculation enhancements (Byungchul Park)
- vtime handling enhancements (Frederic Weisbecker)
- scalability improvement via properly aligning a key structure field
(Jiri Olsa)
- various stop_machine() fixes (Oleg Nesterov)
- sched/numa enhancement (Rik van Riel)
- various fixes and improvements (Andi Kleen, Dietmar Eggemann,
Geliang Tang, Hiroshi Shimamoto, Joonwoo Park, Peter Zijlstra,
Waiman Long, Wanpeng Li, Yuyang Du)"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
sched/fair: Fix new task's load avg removed from source CPU in wake_up_new_task()
sched/core: Move sched_entity::avg into separate cache line
x86/fpu: Properly align size in CHECK_MEMBER_AT_END_OF() macro
sched/deadline: Fix the earliest_dl.next logic
sched/fair: Disable the task group load_avg update for the root_task_group
sched/fair: Move the cache-hot 'load_avg' variable into its own cacheline
sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats()
sched/core: Move the sched_to_prio[] arrays out of line
sched/cputime: Convert vtime_seqlock to seqcount
sched/cputime: Introduce vtime accounting check for readers
sched/cputime: Rename vtime_accounting_enabled() to vtime_accounting_cpu_enabled()
sched/cputime: Correctly handle task guest time on housekeepers
sched/cputime: Clarify vtime symbols and document them
sched/cputime: Remove extra cost in task_cputime()
sched/fair: Make it possible to account fair load avg consistently
sched/fair: Modify the comment about lock assumptions in migrate_task_rq_fair()
stop_machine: Clean up the usage of the preemption counter in cpu_stopper_thread()
stop_machine: Shift the 'done != NULL' check from cpu_stop_signal_done() to callers
stop_machine: Kill cpu_stop_done->executed
stop_machine: Change __stop_cpus() to rely on cpu_stop_queue_work()
...
Pull RAS updates from Ingo Molnar:
"Various x86 MCE fixes and small enhancements"
* 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Make usable address checks Intel-only
x86/mce: Add the missing memory error check on AMD
x86/RAS: Remove mce.usable_addr
x86/mce: Do not enter deferred errors into the generic pool twice
Pull perf updates from Ingo Molnar:
"Kernel side changes:
- Intel Knights Landing support. (Harish Chegondi)
- Intel Broadwell-EP uncore PMU support. (Kan Liang)
- Core code improvements. (Peter Zijlstra.)
- Event filter, LBR and PEBS fixes. (Stephane Eranian)
- Enable cycles:pp on Intel Atom. (Stephane Eranian)
- Add cycles:ppp support for Skylake. (Andi Kleen)
- Various x86 NMI overhead optimizations. (Andi Kleen)
- Intel PT enhancements. (Takao Indoh)
- AMD cache events fix. (Vince Weaver)
Tons of tooling changes:
- Show random perf tool tips in the 'perf report' bottom line
(Namhyung Kim)
- perf report now defaults to --group if the perf.data file has
grouped events, try it with:
# perf record -e '{cycles,instructions}' -a sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.093 MB perf.data (1247 samples) ]
# perf report
# Samples: 1K of event 'anon group { cycles, instructions }'
# Event count (approx.): 1955219195
#
# Overhead Command Shared Object Symbol
2.86% 0.22% swapper [kernel.kallsyms] [k] intel_idle
1.05% 0.33% firefox libxul.so [.] js::SetObjectElement
1.05% 0.00% kworker/0:3 [kernel.kallsyms] [k] gen6_ring_get_seqno
0.88% 0.17% chrome chrome [.] 0x0000000000ee27ab
0.65% 0.86% firefox libxul.so [.] js::ValueToId<(js::AllowGC)1>
0.64% 0.23% JS Helper libxul.so [.] js::SplayTree<js::jit::LiveRange*, js::jit::LiveRange>::splay
0.62% 1.27% firefox libxul.so [.] js::GetIterator
0.61% 1.74% firefox libxul.so [.] js::NativeSetProperty
0.61% 0.31% firefox libxul.so [.] js::SetPropertyByDefining
- Introduce the 'perf stat record/report' workflow:
Generate perf.data files from 'perf stat', to tap into the
scripting capabilities perf has instead of defining a 'perf stat'
specific scripting support to calculate event ratios, etc.
Simple example:
$ perf stat record -e cycles usleep 1
Performance counter stats for 'usleep 1':
1,134,996 cycles
0.000670644 seconds time elapsed
$ perf stat report
Performance counter stats for '/home/acme/bin/perf stat record -e cycles usleep 1':
1,134,996 cycles
0.000670644 seconds time elapsed
$
It generates PERF_RECORD_ userspace records to store the details:
$ perf report -D | grep PERF_RECORD
0xf0 [0x28]: PERF_RECORD_THREAD_MAP nr: 1 thread: 27637
0x118 [0x12]: PERF_RECORD_CPU_MAP nr: 1 cpu: 65535
0x12a [0x40]: PERF_RECORD_STAT_CONFIG
0x16a [0x30]: PERF_RECORD_STAT
-1 -1 0x19a [0x40]: PERF_RECORD_MMAP -1/0: [0xffffffff81000000(0x1f000000) @ 0xffffffff81000000]: x [kernel.kallsyms]_text
0x1da [0x18]: PERF_RECORD_STAT_ROUND
[acme@ssdandy linux]$
An effort was made to make perf.data files generated like this to
not generate cryptic messages when processed by older tools.
The 'perf script' bits need rebasing, will go up later.
- Make command line options always available, even when they depend
on some feature being enabled, warning the user about use of such
options (Wang Nan)
- Support hw breakpoint events (mem:0xAddress) in the default output
mode in 'perf script' (Wang Nan)
- Fixes and improvements for supporting annotating ARM binaries,
support ARM call and jump instructions, more work needed to have
arch specific stuff separated into tools/perf/arch/*/annotate/
(Russell King)
- Add initial 'perf config' command, for now just with a --list
command to the contents of the configuration file in use and a
basic man page describing its format, commands for doing edits and
detailed documentation are being reviewed and proof-read. (Taeung
Song)
- Allows BPF scriptlets specify arguments to be fetched using DWARF
info, using a prologue generated at compile/build time (He Kuang,
Wang Nan)
- Allow attaching BPF scriptlets to module symbols (Wang Nan)
- Allow attaching BPF scriptlets to userspace code using uprobe (Wang
Nan)
- BPF programs now can specify 'perf probe' tunables via its section
name, separating key=val values using semicolons (Wang Nan)
Testing some of these new BPF features:
Use case: get callchains when receiving SSL packets, filter then in the
kernel, at arbitrary place.
# cat ssl.bpf.c
#define SEC(NAME) __attribute__((section(NAME), used))
struct pt_regs;
SEC("func=__inet_lookup_established hnum")
int func(struct pt_regs *ctx, int err, unsigned short port)
{
return err == 0 && port == 443;
}
char _license[] SEC("license") = "GPL";
int _version SEC("version") = LINUX_VERSION_CODE;
#
# perf record -a -g -e ssl.bpf.c
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.787 MB perf.data (3 samples) ]
# perf script | head -30
swapper 0 [000] 58783.268118: perf_bpf_probe:func: (ffffffff816a0f60) hnum=0x1bb
8a0f61 __inet_lookup_established (/lib/modules/4.3.0+/build/vmlinux)
896def ip_rcv_finish (/lib/modules/4.3.0+/build/vmlinux)
8976c2 ip_rcv (/lib/modules/4.3.0+/build/vmlinux)
855eba __netif_receive_skb_core (/lib/modules/4.3.0+/build/vmlinux)
8565d8 __netif_receive_skb (/lib/modules/4.3.0+/build/vmlinux)
8572a8 process_backlog (/lib/modules/4.3.0+/build/vmlinux)
856b11 net_rx_action (/lib/modules/4.3.0+/build/vmlinux)
2a284b __do_softirq (/lib/modules/4.3.0+/build/vmlinux)
2a2ba3 irq_exit (/lib/modules/4.3.0+/build/vmlinux)
96b7a4 do_IRQ (/lib/modules/4.3.0+/build/vmlinux)
969807 ret_from_intr (/lib/modules/4.3.0+/build/vmlinux)
2dede5 cpu_startup_entry (/lib/modules/4.3.0+/build/vmlinux)
95d5bc rest_init (/lib/modules/4.3.0+/build/vmlinux)
1163ffa start_kernel ([kernel.vmlinux].init.text)
11634d7 x86_64_start_reservations ([kernel.vmlinux].init.text)
1163623 x86_64_start_kernel ([kernel.vmlinux].init.text)
qemu-system-x86 9178 [003] 58785.792417: perf_bpf_probe:func: (ffffffff816a0f60) hnum=0x1bb
8a0f61 __inet_lookup_established (/lib/modules/4.3.0+/build/vmlinux)
896def ip_rcv_finish (/lib/modules/4.3.0+/build/vmlinux)
8976c2 ip_rcv (/lib/modules/4.3.0+/build/vmlinux)
855eba __netif_receive_skb_core (/lib/modules/4.3.0+/build/vmlinux)
8565d8 __netif_receive_skb (/lib/modules/4.3.0+/build/vmlinux)
856660 netif_receive_skb_internal (/lib/modules/4.3.0+/build/vmlinux)
8566ec netif_receive_skb_sk (/lib/modules/4.3.0+/build/vmlinux)
430a br_handle_frame_finish ([bridge])
48bc br_handle_frame ([bridge])
855f44 __netif_receive_skb_core (/lib/modules/4.3.0+/build/vmlinux)
8565d8 __netif_receive_skb (/lib/modules/4.3.0+/build/vmlinux)
#
- Use 'perf probe' various options to list functions, see what
variables can be collected at any given point, experiment first
collecting without a filter, then filter, use it together with
'perf trace', 'perf top', with or without callchains, if it
explodes, please tell us!
- Introduce a new callchain mode: "folded", that will list per line
representations of all callchains for a give histogram entry,
facilitating 'perf report' output processing by other tools, such
as Brendan Gregg's flamegraph tools (Namhyung Kim)
E.g:
# perf report | grep -v ^# | head
18.37% 0.00% swapper [kernel.kallsyms] [k] cpu_startup_entry
|
---cpu_startup_entry
|
|--12.07%--start_secondary
|
--6.30%--rest_init
start_kernel
x86_64_start_reservations
x86_64_start_kernel
#
Becomes, in "folded" mode:
# perf report -g folded | grep -v ^# | head -5
18.37% 0.00% swapper [kernel.kallsyms] [k] cpu_startup_entry
12.07% cpu_startup_entry;start_secondary
6.30% cpu_startup_entry;rest_init;start_kernel;x86_64_start_reservations;x86_64_start_kernel
16.90% 0.00% swapper [kernel.kallsyms] [k] call_cpuidle
11.23% call_cpuidle;cpu_startup_entry;start_secondary
5.67% call_cpuidle;cpu_startup_entry;rest_init;start_kernel;x86_64_start_reservations;x86_64_start_kernel
16.90% 0.00% swapper [kernel.kallsyms] [k] cpuidle_enter
11.23% cpuidle_enter;call_cpuidle;cpu_startup_entry;start_secondary
5.67% cpuidle_enter;call_cpuidle;cpu_startup_entry;rest_init;start_kernel;x86_64_start_reservations;x86_64_start_kernel
15.12% 0.00% swapper [kernel.kallsyms] [k] cpuidle_enter_state
#
The user can also select one of "count", "period" or "percent" as
the first column.
... and lots of infrastructure enhancements, plus fixes and other
changes, features I failed to list - see the shortlog and the git log
for details"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (271 commits)
perf evlist: Add --trace-fields option to show trace fields
perf record: Store data mmaps for dwarf unwind
perf libdw: Check for mmaps also in MAP__VARIABLE tree
perf unwind: Check for mmaps also in MAP__VARIABLE tree
perf unwind: Use find_map function in access_dso_mem
perf evlist: Remove perf_evlist__(enable|disable)_event functions
perf evlist: Make perf_evlist__open() open evsels with their cpus and threads (like perf record does)
perf report: Show random usage tip on the help line
perf hists: Export a couple of hist functions
perf diff: Use perf_hpp__register_sort_field interface
perf tools: Add overhead/overhead_children keys defaults via string
perf tools: Remove list entry from struct sort_entry
perf tools: Include all tools/lib directory for tags/cscope/TAGS targets
perf script: Align event name properly
perf tools: Add missing headers in perf's MANIFEST
perf tools: Do not show trace command if it's not compiled in
perf report: Change default to use event group view
perf top: Decay periods in callchains
tools lib: Move bitmap.[ch] from tools/perf/ to tools/{lib,include}/
tools lib: Sync tools/lib/find_bit.c with the kernel
...
Pull locking updates from Ingo Molnar:
"So we have a laundry list of locking subsystem changes:
- continuing barrier API and code improvements
- futex enhancements
- atomics API improvements
- pvqspinlock enhancements: in particular lock stealing and adaptive
spinning
- qspinlock micro-enhancements"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op
futex: Cleanup the goto confusion in requeue_pi()
futex: Remove pointless put_pi_state calls in requeue()
futex: Document pi_state refcounting in requeue code
futex: Rename free_pi_state() to put_pi_state()
futex: Drop refcount if requeue_pi() acquired the rtmutex
locking/barriers, arch: Remove ambiguous statement in the smp_store_mb() documentation
lcoking/barriers, arch: Use smp barriers in smp_store_release()
locking/cmpxchg, arch: Remove tas() definitions
locking/pvqspinlock: Queue node adaptive spinning
locking/pvqspinlock: Allow limited lock stealing
locking/pvqspinlock: Collect slowpath lock statistics
sched/core, locking: Document Program-Order guarantees
locking, sched: Introduce smp_cond_acquire() and use it
locking/pvqspinlock, x86: Optimize the PV unlock code path
locking/qspinlock: Avoid redundant read of next pointer
locking/qspinlock: Prefetch the next node cacheline
locking/qspinlock: Use _acquire/_release() versions of cmpxchg() & xchg()
atomics: Add test for atomic operations with _relaxed variants
Pull RCU updates from Ingo Molnar:
"The changes in this cycle were:
- Adding transitivity uniformly to rcu_node structure ->lock
acquisitions. (This is implemented by the first two commits on top
of v4.4-rc2 due to the pervasive nature of this change.)
- Documentation updates, including RCU requirements.
- Expedited grace-period changes.
- Miscellaneous fixes.
- Linked-list fixes, courtesy of KTSAN.
- Torture-test updates.
- Late-breaking fix to sysrq-generated crash.
One thing I should note is that these pieces of documentation are
fairly large files:
.../RCU/Design/Requirements/Requirements.html | 2897 ++++++++++++++++++++
.../RCU/Design/Requirements/Requirements.htmlx | 2741 ++++++++++++++++++
and are written in HTML, not the usual .txt style. I hope they are
fine"
Paul McKenney explains the html docs:
"For whatever it is worth, the reason for this unconventional choice
was that attempts to do the diagrams in ASCII art failed miserably.
And attempts to do ASCII art for the upcoming documentation of the
data structures failed even more miserably"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (49 commits)
sysrq: Fix warning in sysrq generated crash.
list: Add lockless list traversal primitives
rcu: Make rcu_gp_init() be bool rather than int
rcu: Move wakeup out from under rnp->lock
rcu: Fix comment for rcu_dereference_raw_notrace
rcu: Don't redundantly disable irqs in rcu_irq_{enter,exit}()
rcu: Make cpu_needs_another_gp() be bool
rcu: Eliminate unused rcu_init_one() argument
rcu: Remove TINY_RCU bloat from pointless boot parameters
torture: Place console.log files correctly from the get-go
torture: Abbreviate console error dump
rcutorture: Print symbolic name for ->gp_state
rcutorture: Print symbolic name for rcu_torture_writer_state
rcutorture: Remove CONFIG_RCU_USER_QS from rcutorture selftest doc
rcutorture: Default grace period to three minutes, allow override
rcutorture: Dump stack when GP kthread stalls
rcutorture: Flag nonexistent RCU GP kthread
rcutorture: Add batch number to script printout
Documentation/memory-barriers.txt: Fix ACCESS_ONCE thinko
documentation: Update RCU requirements based on expedited changes
...
Pull vfs xattr updates from Al Viro:
"Andreas' xattr cleanup series.
It's a followup to his xattr work that went in last cycle; -0.5KLoC"
* 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
xattr handlers: Simplify list operation
ocfs2: Replace list xattr handler operations
nfs: Move call to security_inode_listsecurity into nfs_listxattr
xfs: Change how listxattr generates synthetic attributes
tmpfs: listxattr should include POSIX ACL xattrs
tmpfs: Use xattr handler infrastructure
btrfs: Use xattr handler infrastructure
vfs: Distinguish between full xattr names and proper prefixes
posix acls: Remove duplicate xattr name definitions
gfs2: Remove gfs2_xattr_acl_chmod
vfs: Remove vfs_xattr_cmp
Pull vfs RCU symlink updates from Al Viro:
"Replacement of ->follow_link/->put_link, allowing to stay in RCU mode
even if the symlink is not an embedded one.
No changes since the mailbomb on Jan 1"
* 'work.symlinks' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
switch ->get_link() to delayed_call, kill ->put_link()
kill free_page_put_link()
teach nfs_get_link() to work in RCU mode
teach proc_self_get_link()/proc_thread_self_get_link() to work in RCU mode
teach shmem_get_link() to work in RCU mode
teach page_get_link() to work in RCU mode
replace ->follow_link() with new method that could stay in RCU mode
don't put symlink bodies in pagecache into highmem
namei: page_getlink() and page_follow_link_light() are the same thing
ufs: get rid of ->setattr() for symlinks
udf: don't duplicate page_symlink_inode_operations
logfs: don't duplicate page_symlink_inode_operations
switch befs long symlinks to page_symlink_operations
Pull vfs compat_ioctl fixes from Al Viro:
"This is basically Jann's patches from last week. I have _not_
included the stuff like switching i2c to ->compat_ioctl() into this
one - those need more testing.
Ideally I would like fs/compat_ioctl.c shrunk a lot, but that's a
separate story"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
compat_ioctl: don't call do_ioctl under set_fs(KERNEL_DS)
compat_ioctl: don't pass fd around when not needed
compat_ioctl: don't look up the fd twice
When decompressing kernel image during x86 bootup, malloc memory
for ELF program headers may run out of heap space, which leads
to system halt. This patch doubles BOOT_HEAP_SIZE to 64KB.
Tested with 32-bit kernel which failed to boot without this patch.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: <stable@vger.kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When switch_mm() activates a new PGD, it also sets a bit that
tells other CPUs that the PGD is in use so that TLB flush IPIs
will be sent. In order for that to work correctly, the bit
needs to be visible prior to loading the PGD and therefore
starting to fill the local TLB.
Document all the barriers that make this work correctly and add
a couple that were missing.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Cc: stable@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Single fix for machines with pages > 4k (PPC mostly). There's a bug in our
optimal transfer size code where we don't account for pages > 4k and can set
the transfer size to be less than the page size causing nasty failures.
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABAgAGBQJWkUiHAAoJEDeqqVYsXL0MArAH/2XWKJGI9tr0AQQ79WGD/kjV
KZsUjKP9sshjxXBB8jK+TFvO/Z2BE92XzXtIswgdc6OdeANyhE+LhdbNpuooX2gv
lW28RAbSVLrwJzyr7B2VGAiCOR9opGu2opOJnQMo05pSAFqJxNG4l1Ap+4pOX9/1
ffTwidgk6bWs5zKlDwbETHVv/X50U90O5MyJBvf9KC7YvFhD41OIz7QEqiHgs1qW
T6J80ZH4ZhWN+pRMHlybJ7RwP7TVjUSkDRLWRSX8IWjisbDqY86JVVt/BUA3lbhL
sLDc/mku3ZPsRAUJL544ydAPWA2DtJ9PjPkYuMgPQOCyUfAKq5OTf6hG0Mghm34=
=n00Y
-----END PGP SIGNATURE-----
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fix from James Bottomley:
"A single fix for machines with pages > 4k (PPC mostly).
There's a bug in our optimal transfer size code where we don't account
for pages > 4k and can set the transfer size to be less than the page
size causing nasty failures"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
sd: Reject optimal transfer length smaller than page size
TI DRA7xx host bridge driver
Mark driver as broken (Richard Cochran)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWkSPdAAoJEFmIoMA60/r80kQP/jPgr13qJ0bt+DE8WMZ98zRu
BKpgL2MMQ2PzHjpHvrce1y2WceSdnPvloB5fElpi0PtYRzjmhZ607tgBpUjUEm5U
u/jmFa28PlzzGMu8zQuV6Ge4zFp8PdcNGxZqRNpxZl4HLltxYgd7LnM9wgmSAUo7
VwZlO5QIjHUPQ2oFC1o7F7pRhme1RpfUDwXaoHAER1E44pHBRksjY/xRlP/Nknh5
7QOcb62e6N/ngR6Vb01evNVSdtH1+HQQlPaxPALw3sItaRsqqB9rLfXecw5Xu1NN
h2DiRTcG/2X5bA8yxYZtmkFyvkFQFHjoxgvD2RHf7jb9TX0qyryFJceAKyyAUmFT
A4z3XSmd54tXjetFkSzYsUbb0Egp1atBLT1Uw8d7UH4djebnwh3hTqPNFpRZBDeA
AZRIkhTeupRjJGgtYsVghsiOQeS0yg4OboDVBPJljELOcsYN/nuoNvreXCp8qk4H
pHyq9AQ3XVu3OS94SvDNmTdssUNARPL020/K8FDOkbOnD9EDK/J/PdiEGCZtNea4
nu8qVo1PaROd5HBxQ2zB6PDiAIXAEfuqmFZPRXd5xa/DHxSfkHRJMmdE1TUVeFHn
rB+8Fz/PqdkA4wIM3STWtVZOCjYnn6WkeWYS95jX9PmVc1Pa8ByNYkrYc5mxGtar
uzqdDmZqYJAzRKkIadMR
=srOL
-----END PGP SIGNATURE-----
Merge tag 'pci-v4.4-fixes-4' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI fixlet from Bjorn Helgaas:
"This marks the TI DRA7xx host bridge driver as broken. Apparently it
has never worked without some additional out-of-tree code, so I'm
going to mark it broken now and remove it completely next cycle unless
it's fixed"
* tag 'pci-v4.4-fixes-4' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
PCI: dra7xx: Mark driver as broken
New features:
- Allow using trace events fields as sort order keys, making 'perf evlist --trace_fields'
show those, and then the user can select a subset and use like:
perf top -e sched:sched_switch -s prev_comm,next_comm
That works as well in 'perf report' when handling files containing
tracepoints.
The default when just tracepoint events are found in a perf.data file is to
format it like ftrace, using the libtraceevent formatters, plugins, etc (Namhyung Kim)
- Add support in 'perf script' to process 'perf stat record' generated files,
culminating in a python perf script that calculates CPI (Cycles per
Instruction) (Jiri Olsa)
- Show random perf tool tips in the 'perf report' bottom line (Namhyung Kim)
- perf report now defaults to --group if the perf.data file has grouped events, try it with:
# perf record -e '{cycles,instructions}' -a sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.093 MB perf.data (1247 samples) ]
# perf report
# Samples: 1K of event 'anon group { cycles, instructions }'
# Event count (approx.): 1955219195
#
# Overhead Command Shared Object Symbol
2.86% 0.22% swapper [kernel.kallsyms] [k] intel_idle
1.05% 0.33% firefox libxul.so [.] js::SetObjectElement
1.05% 0.00% kworker/0:3 [kernel.kallsyms] [k] gen6_ring_get_seqno
0.88% 0.17% chrome chrome [.] 0x0000000000ee27ab
0.65% 0.86% firefox libxul.so [.] js::ValueToId<(js::AllowGC)1>
0.64% 0.23% JS Helper libxul.so [.] js::SplayTree<js::jit::LiveRange*, js::jit::LiveRange>::splay
0.62% 1.27% firefox libxul.so [.] js::GetIterator
0.61% 1.74% firefox libxul.so [.] js::NativeSetProperty
0.61% 0.31% firefox libxul.so [.] js::SetPropertyByDefining
User visible fixes:
- Coect data mmaps so that the DWARF unwinder can handle usecases needing them,
like softice (Jiri Olsa)
- Decay callchains in fractal mode, fixing up cases where 'perf top -g' would
show entries with more than 100% (Namhyung Kim)
Infrastructure:
- Sync tools/lib with the lib/ in the kernel sources for find_bit.c and
move bitmap.[ch] from tools/perf/util/ to tools/lib/ (Arnaldo Carvalho de Melo)
- No need to set attr.sample_freq in some 'perf test' entries that only
want to deal with PERF_RECORD_ meta-events, improve a bit error output
for CQM test (Arnaldo Carvalho de Melo)
- Fix python binding build, adding some missing object files now required
due to cpumap using find_bit stuff (Arnaldo Carvalho de Melo)
- tools/build improvemnts (Jiri Olsa)
- Add more files to cscope/ctags databases (Jiri Olsa)
- Do not show 'trace' in 'perf help' if it is not compiled in (Jiri Olsa)
- Make perf_evlist__open() open evsels with their cpus and threads,
like perf record does, making them consistent (Adrian Hunter)
- Fix pmu snapshot initialization bug (Stephane Eranian)
- Add missing headers in perf's MANIFEST (Wang Nan)
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWj/imAAoJENZQFvNTUqpADg4P/2/zn/JlIQbFMBo0YS4HtBtp
OIvpU9XV0YyMWEbxEUfNOM3qnOfqZWOuTHvqVywnz1JXHxKNt9j7KjPhBA2avLEc
WuloOf7Af1eUjroKW/tl1FsPatW0x0zXEVqR5XjUrPfXge6rYPuMwQ2f4oTCzc6H
uf5fH1H2vK/iQsOu9X+IoGEKxoJF22zabKDQy7q+48gSq/TVtz1wJfHqYBCCFOh8
c+MEMnOZ3F0DXJ9iRsXbChcOmkHTfAnu5CM8GlkvM38VnJA69K+AkFXC3YuExvA1
PghTVZMBsYDyHNq+ewIshrJ1xGz0/OwXX2IUtJfFPGbWhMZYl2NCg1SuLfTl7jCF
m/iR5LzdHpLiEIaimuK+8eIVfLdRyxZUeN9/BQzUFrl7pqe2n1xIjIvUX0djHPv9
YlQ6ZI/g/nJ1AunC0QhWiwSkmUas/YATNKt7CunOnSjJey2p9T91lhzstqvCur4Q
XH1iIA4o2A67vRLbVEb2eh54QT2BASO1H+suYPNTvU55W5gGz9pJJjVxIbpT5+lk
dAZ8vXwxOg1jMFIgDrm6mbosGcs4lUBeeKJbE7ImevpSyOECR8lRdV4yX+bfkuUe
FL5fNfJyFcr3q2p+sRTpikrq32C3iyQ08VhYYNLuVSUOQgNYEyLTmhMkotZYjmIk
JZO4/8trMU7FR2WpnGqG
=tbfd
-----END PGP SIGNATURE-----
Merge tag 'perf-core-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core
Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:
New features:
- Allow using trace events fields as sort order keys, making 'perf evlist --trace_fields'
show those, and then the user can select a subset and use like:
perf top -e sched:sched_switch -s prev_comm,next_comm
That works as well in 'perf report' when handling files containing
tracepoints.
The default when just tracepoint events are found in a perf.data file is to
format it like ftrace, using the libtraceevent formatters, plugins, etc (Namhyung Kim)
- Add support in 'perf script' to process 'perf stat record' generated files,
culminating in a python perf script that calculates CPI (Cycles per
Instruction) (Jiri Olsa)
- Show random perf tool tips in the 'perf report' bottom line (Namhyung Kim)
- perf report now defaults to --group if the perf.data file has grouped events, try it with:
# perf record -e '{cycles,instructions}' -a sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.093 MB perf.data (1247 samples) ]
# perf report
# Samples: 1K of event 'anon group { cycles, instructions }'
# Event count (approx.): 1955219195
#
# Overhead Command Shared Object Symbol
2.86% 0.22% swapper [kernel.kallsyms] [k] intel_idle
1.05% 0.33% firefox libxul.so [.] js::SetObjectElement
1.05% 0.00% kworker/0:3 [kernel.kallsyms] [k] gen6_ring_get_seqno
0.88% 0.17% chrome chrome [.] 0x0000000000ee27ab
0.65% 0.86% firefox libxul.so [.] js::ValueToId<(js::AllowGC)1>
0.64% 0.23% JS Helper libxul.so [.] js::SplayTree<js::jit::LiveRange*, js::jit::LiveRange>::splay
0.62% 1.27% firefox libxul.so [.] js::GetIterator
0.61% 1.74% firefox libxul.so [.] js::NativeSetProperty
0.61% 0.31% firefox libxul.so [.] js::SetPropertyByDefining
User visible fixes:
- Coect data mmaps so that the DWARF unwinder can handle usecases needing them,
like softice (Jiri Olsa)
- Decay callchains in fractal mode, fixing up cases where 'perf top -g' would
show entries with more than 100% (Namhyung Kim)
Infrastructure changes:
- Sync tools/lib with the lib/ in the kernel sources for find_bit.c and
move bitmap.[ch] from tools/perf/util/ to tools/lib/ (Arnaldo Carvalho de Melo)
- No need to set attr.sample_freq in some 'perf test' entries that only
want to deal with PERF_RECORD_ meta-events, improve a bit error output
for CQM test (Arnaldo Carvalho de Melo)
- Fix python binding build, adding some missing object files now required
due to cpumap using find_bit stuff (Arnaldo Carvalho de Melo)
- tools/build improvemnts (Jiri Olsa)
- Add more files to cscope/ctags databases (Jiri Olsa)
- Do not show 'trace' in 'perf help' if it is not compiled in (Jiri Olsa)
- Make perf_evlist__open() open evsels with their cpus and threads,
like perf record does, making them consistent (Adrian Hunter)
- Fix pmu snapshot initialization bug (Stephane Eranian)
- Add missing headers in perf's MANIFEST (Wang Nan)
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
kernel test robot has reported the following crash:
BUG: unable to handle kernel NULL pointer dereference at 00000100
IP: [<c1074df6>] __queue_work+0x26/0x390
*pdpt = 0000000000000000 *pde = f000ff53f000ff53 *pde = f000ff53f000ff53
Oops: 0000 [#1] PREEMPT PREEMPT SMP SMP
CPU: 0 PID: 24 Comm: kworker/0:1 Not tainted 4.4.0-rc4-00139-g373ccbe #1
Workqueue: events vmstat_shepherd
task: cb684600 ti: cb7ba000 task.ti: cb7ba000
EIP: 0060:[<c1074df6>] EFLAGS: 00010046 CPU: 0
EIP is at __queue_work+0x26/0x390
EAX: 00000046 EBX: cbb37800 ECX: cbb37800 EDX: 00000000
ESI: 00000000 EDI: 00000000 EBP: cb7bbe68 ESP: cb7bbe38
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
CR0: 8005003b CR2: 00000100 CR3: 01fd5000 CR4: 000006b0
Stack:
Call Trace:
__queue_delayed_work+0xa1/0x160
queue_delayed_work_on+0x36/0x60
vmstat_shepherd+0xad/0xf0
process_one_work+0x1aa/0x4c0
worker_thread+0x41/0x440
kthread+0xb0/0xd0
ret_from_kernel_thread+0x21/0x40
The reason is that start_shepherd_timer schedules the shepherd work item
which uses vmstat_wq (vmstat_shepherd) before setup_vmstat allocates
that workqueue so if the further initialization takes more than HZ we
might end up scheduling on a NULL vmstat_wq. This is really unlikely
but not impossible.
Fixes: 373ccbe592 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress")
Reported-by: kernel test robot <ying.huang@linux.intel.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Tested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This replaces all code in fs/compat_ioctl.c that translated
ioctl arguments into a in-kernel structure, then performed
do_ioctl under set_fs(KERNEL_DS), with code that allocates
data on the user stack and can call the VFS ioctl handler
under USER_DS.
This is done as a hardening measure because the caller
does not know what kind of ioctl handler will be invoked,
only that no corresponding compat_ioctl handler exists and
what the ioctl command number is. The accidental
invocation of an unlocked_ioctl handler that unexpectedly
calls copy_to_user could be a severe security issue.
Signed-off-by: Jann Horn <jann@thejh.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In code in fs/compat_ioctl.c that translates ioctl arguments
into a in-kernel structure, then performs sys_ioctl, possibly
under set_fs(KERNEL_DS), this commit changes the sys_ioctl
calls to do_ioctl calls. do_ioctl is a new function that does
the same thing as sys_ioctl, but doesn't look up the fd again.
This change is made to avoid (potential) security issues
because of ioctl handlers that accept one of the ioctl
commands I2C_FUNCS, VIDEO_GET_EVENT, MTIOCPOS, MTIOCGET,
TIOCGSERIAL, TIOCSSERIAL, RTC_IRQP_READ, RTC_EPOCH_READ.
This can happen for multiple reasons:
- The ioctl command number could be reused.
- The ioctl handler might not check the full ioctl
command. This is e.g. true for drm_ioctl.
- The ioctl handler is very special, e.g. cuse_file_ioctl
The real issue is that set_fs(KERNEL_DS) is used here,
but that's fixed in a separate commit
"compat_ioctl: don't call do_ioctl under set_fs(KERNEL_DS)".
This change mitigates potential security issues by
preventing a race that permits invocation of
unlocked_ioctl handlers under KERNEL_DS through compat
code even if a corresponding compat_ioctl handler exists.
So far, no way has been identified to use this to damage
kernel memory without having CAP_SYS_ADMIN in the init ns
(with the capability, doing reads/writes at arbitrary
kernel addresses should be easy through CUSE's ioctl
handler with FUSE_IOCTL_UNRESTRICTED set).
[AV: two missed sys_ioctl() taken care of]
Signed-off-by: Jann Horn <jann@thejh.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This is the final small set of ARM SoC bug fixes for linux-4.4,
almost all regressions:
OMAP: data corruption on the Nokia N900 flash
Allwinner: Two defconfig change to get USB working again
ARM Versatile: Interrupt numbers gone bad after an older bug fix
Nomadik: Crashes from incorrect L2 cache settings
VIA vt8500: SD/MMC support on WM8650 never worked
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIVAwUAVpA/vGCrR//JCVInAQK1NQ//fk4EYJj6JMwaB/I2hGtpUpartVtXj9tj
916+NazORGc/UBP9vG/bLWnZrKzZQdeLd2HwDT9eyHtZ+Qq8bm+No312l10EATkS
EAZKy4oy0HH5BD8yE5xM+CQn9fxh8GcmGJ/hKacpJnY2e0aaGgp9iHiPWH8kRmO4
kBbUWsbfGjxzfM+a/9XGrzcTWZLSOu6VGtxvdp2rGy+Un8AWVc6FQz48xsFXWivB
5JHp0qTnyUebSNbsxq05BDorybDuNSho5d3r94H85WwfbdybCxuFYedTZ1r3vCh9
rWDI8R2vb3uotzYYjigRwyYqggUz6TewWqwnyeyGsMHOCEU/q/2n2ASLhdxKl2x8
UZ+kKDhRiraaxnsNh/apVy4yvs7KmsUM6DgxzvXW3uyrrLrjJyGHKY5doHt5Z5mJ
ErhOW024skA8GNiWObmNiKzUdKEbO4e7ReFgwbLbA8W6ACVVMNdL4udsmkb+nj25
+VFWOg5WVG5WJFaEoSOGze7J0+SSwu7wP0HRPmWFotslExh5HsrvkSw429sDDB0R
4OwT0Ru0vDZvGrHD+Bu6mFJO1nL2kWKVsolyben1sA+OIbm7eoHcennRUFf2sog0
yWD03DT9iz//bmWPHVntcdRTTaXYvaChPWzHxlbBxw2DCfRkzy4Tx/Ic44s2Kjuq
PgQjTmNlnQ8=
=Z1E0
-----END PGP SIGNATURE-----
Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes from Arnd Bergmann:
"This is the final small set of ARM SoC bug fixes for linux-4.4, almost
all regressions:
OMAP:
- data corruption on the Nokia N900 flash
Allwinner:
- Two defconfig change to get USB working again
ARM Versatile:
- Interrupt numbers gone bad after an older bug fix
Nomadik:
- Crashes from incorrect L2 cache settings
VIA vt8500:
- SD/MMC support on WM8650 never worked"
* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
dts: vt8500: Add SDHC node to DTS file for WM8650
ARM: Fix broken USB support in multi_v7_defconfig for sunxi devices
ARM: versatile: fix MMC/SD interrupt assignment
ARM: nomadik: set latencies to 8 cycles
ARM: OMAP2+: Fix onenand rate detection to avoid filesystem corruption
ARM: Fix broken USB support in sunxi_defconfig
a patch found in your master branch but not yet in the kvm/next branch
that is destined for 4.5.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJWj+yRAAoJEL/70l94x66DulYH/0OGP+yIHDDFlBqtPRm6q0pr
r8pSVRPPd4GY2SOJDBsBvMmWphFSYKIoCTyMbFnikADHM2yh/pycwLU/uzCM5xQl
uABMsCUntwbGaKq+A4bOvsNO49ueRCkML4ToVuKNTeuEKRYfdnlj3XcAMMgsUfEF
QGz8W2cm9xPn69df91cfBuFLLFeQVv2XsjA5WpqzzvWy5HEs1F07aVh57TI4j8OF
eFdn3Lkes9Ync70KjEy2QKe2Su0EWjderE0oqAORKomwZFVCYv/Vg1wERJYsugg5
UyYCY2j1tKlycKYDnO47L1xoS9JgMHY05OsH08Sn/EXBjRjnEVwTyco5pGPmuNA=
=5Lst
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fix from Paolo Bonzini:
"A simple fix. I'm sending it before the merge window, because it
refines a patch found in your master branch but not yet in the
kvm/next branch that is destined for 4.5"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvm: x86: only channel 0 of the i8254 is linked to the HPET