Currently, "cma=" kernel parameter is used to specify the size of CMA,
but we can't specify where it is located. We want to locate CMA below
4GB for devices only supporting 32-bit addressing on 64-bit systems
without iommu.
This enables to specify the placement of CMA by extending "cma=" kernel
parameter.
Examples:
1. locate 64MB CMA below 4GB by "cma=64M@0-4G"
2. locate 64MB CMA exact at 512MB by "cma=64M@512M"
Note that the DMA contiguous memory allocator on x86 assumes that
page_address() works for the pages to allocate. So this change requires
to limit end address of contiguous memory area upto max_pfn_mapped to
prevent from locating it on highmem area by the argument of
dma_contiguous_reserve().
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The DMA Contiguous Memory Allocator support on x86 is disabled when
swiotlb config option is enabled. So DMA CMA is always disabled on
x86_64 because swiotlb is always enabled. This attempts to support for
DMA CMA with enabling swiotlb config option.
The contiguous memory allocator on x86 is integrated in the function
dma_generic_alloc_coherent() which is .alloc callback in nommu_dma_ops
for dma_alloc_coherent().
x86_swiotlb_alloc_coherent() which is .alloc callback in swiotlb_dma_ops
tries to allocate with dma_generic_alloc_coherent() firstly and then
swiotlb_alloc_coherent() is called as a fallback.
The main part of supporting DMA CMA with swiotlb is that changing
x86_swiotlb_free_coherent() which is .free callback in swiotlb_dma_ops
for dma_free_coherent() so that it can distinguish memory allocated by
dma_generic_alloc_coherent() from one allocated by
swiotlb_alloc_coherent() and release it with dma_generic_free_coherent()
which can handle contiguous memory. This change requires making
is_swiotlb_buffer() global function.
This also needs to change .free callback in the dma_map_ops for amd_gart
and sta2x11, because these dma_ops are also using
dma_generic_alloc_coherent().
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patchset enhances the DMA Contiguous Memory Allocator on x86.
Currently the DMA CMA is only supported with pci-nommu dma_map_ops and
furthermore it can't be enabled on x86_64. But I would like to allocate
big contiguous memory with dma_alloc_coherent() and tell it to the device
that requires it, regardless of which dma mapping implementation is
actually used in the system.
So this makes it work with swiotlb and intel-iommu dma_map_ops, too. And
this also extends "cma=" kernel parameter to specify placement constraint
by the physical address range of memory allocations. For example, CMA
allocates memory below 4GB by "cma=64M@0-4G", it is required for the
devices only supporting 32-bit addressing on 64-bit systems without iommu.
This patch (of 5):
Calling dma_alloc_coherent() with __GFP_ZERO must return zeroed memory.
But when the contiguous memory allocator (CMA) is enabled on x86 and the
memory region is allocated by dma_alloc_from_contiguous(), it doesn't
return zeroed memory. Because dma_generic_alloc_coherent() forgot to fill
the memory region with zero if it was allocated by
dma_alloc_from_contiguous()
Most implementations of dma_alloc_coherent() return zeroed memory
regardless of whether __GFP_ZERO is specified. So this fixes it by
unconditionally zeroing the allocated memory region.
Alternatively, we could fix dma_alloc_from_contiguous() to return zeroed
out memory and remove memset() from all caller of it. But we can't simply
remove the memset on arm because __dma_clear_buffer() is used there for
ensuring cache flushing and it is used in many places. Of course we can
do redundant memset in dma_alloc_from_contiguous(), but I think this patch
is less impact for fixing this problem.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On system with 2TiB ram, current x86_64 have 128M as section size, and
one memory_block only include one section. So will have 16400 entries
under /sys/devices/system/memory/.
Current code try to use block id to find block pointer in /sys for any
section, and reuse that block pointer. that finding will take some time
even after commit 7c243c7168 ("mm: speedup in __early_pfn_to_nid")
that will skip the search in that case during booting up.
So solution could be increase block size just like SGI UV system did.
(harded code to 2g).
This patch is trying to probe the block size to make it match mmio remap
size. for example, Intel Nehalem later system will have memory range [0,
TOML), [4g, TOMH]. If the memory hole is 2g and total is 128g, TOM will
be 2g, and TOM2 will be 130g.
We could use 2g as block size instead of default 128M. That will reduce
number of entries in /sys/devices/system/memory/
On system 6TiB system will reduce boot time by 35 seconds.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
_PAGE_NUMA is currently an alias of _PROT_PROTNONE to trap NUMA hinting
faults on x86. Care is taken such that _PAGE_NUMA is used only in
situations where the VMA flags distinguish between NUMA hinting faults
and prot_none faults. This decision was x86-specific and conceptually
it is difficult requiring special casing to distinguish between PROTNONE
and NUMA ptes based on context.
Fundamentally, we only need the _PAGE_NUMA bit to tell the difference
between an entry that is really unmapped and a page that is protected
for NUMA hinting faults as if the PTE is not present then a fault will
be trapped.
Swap PTEs on x86-64 use the bits after _PAGE_GLOBAL for the offset.
This patch shrinks the maximum possible swap size and uses the bit to
uniquely distinguish between NUMA hinting ptes and swap ptes.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Noonan <steven@uplinklabs.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
32-bit support for NUMA is an oddity on its own but with automatic NUMA
balancing on top there is a reasonable risk that the CPUPID information
cannot be stored in the page flags. This patch removes support for
automatic NUMA support on 32-bit x86.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Noonan <steven@uplinklabs.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently hugepage migration is available for all archs which support
pmd-level hugepage, but testing is done only for x86_64 and there're
bugs for other archs. So to avoid breaking such archs, this patch
limits the availability strictly to x86_64 until developers of other
archs get interested in enabling this feature.
Simply disabling hugepage migration on non-x86_64 archs is not enough to
fix the reported problem where sys_move_pages() hits the BUG_ON() in
follow_page(FOLL_GET), so let's fix this by checking if hugepage
migration is supported in vma_migratable().
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: David Miller <davem@davemloft.net>
Cc: <stable@vger.kernel.org> [3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull core irq updates from Thomas Gleixner:
"The irq department delivers:
- Another tree wide update to get rid of the horrible create_irq
interface along with its even more horrible variants. That also
gets rid of the last leftovers of the initial sparse irq hackery.
arch/driver specific changes have been either acked or ignored.
- A fix for the spurious interrupt detection logic with threaded
interrupts.
- A new ARM SoC interrupt controller
- The usual pile of fixes and improvements all over the place"
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
Documentation: brcmstb-l2: Add Broadcom STB Level-2 interrupt controller binding
irqchip: brcmstb-l2: Add Broadcom Set Top Box Level-2 interrupt controller
genirq: Improve documentation to match current implementation
ARM: iop13xx: fix msi support with sparse IRQ
genirq: Provide !SMP stub for irq_set_affinity_notifier()
irqchip: armada-370-xp: Move the devicetree binding documentation
irqchip: gic: Use mask field in GICC_IAR
genirq: Remove dynamic_irq mess
ia64: Use irq_init_desc
genirq: Replace dynamic_irq_init/cleanup
genirq: Remove irq_reserve_irq[s]
genirq: Replace reserve_irqs in core code
s390: Avoid call to irq_reserve_irqs()
s390: Remove pointless arch_show_interrupts()
s390: pci: Check return value of alloc_irq_desc() proper
sh: intc: Remove pointless irq_reserve_irqs() invocation
x86, irq: Remove pointless irq_reserve_irqs() call
genirq: Make create/destroy_irq() ia64 private
tile: Use SPARSE_IRQ
tile: pci: Use irq_alloc/free_hwirq()
...
By changing code16gcc.h from a C header to an assembly header and use
the -Wa,... option to gcc to force it to be added to the assembly
input, we can avoid the problems with gcc reordering code bits on us.
If we have -m16, we still use it, of course.
Suggested-by: Kevin O'Connor <kevin@koconnor.net>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/n/tip-xw8ibgdemucl9fz3i1bymu6w@git.kernel.org
- Another round of clean-up of FDT related code in architecture code.
This removes knowledge of internal FDT details from most architectures
except powerpc.
- Conversion of kernel's custom FDT parsing code to use libfdt.
- DT based initialization for generic serial earlycon. The introduction
of generic serial earlycon support went in thru tty tree.
- Improve the platform device naming for DT probed devices to ensure
unique naming and use parent names instead of a global index.
- Fix a race condition in of_update_property.
- Unify the various linker section OF match tables and fix several
function prototype errors.
- Update platform_get_irq_byname to work in deferred probe cases.
- 2 binding doc updates
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJTjzgyAAoJEMhvYp4jgsXiFsUH/1PMTGo8CyD62VQD5ZKdAoW+
Fq6vCiRQ8assF5i5ZLcW1DqhjtoRaCKYhVbRKa5lj7cZdjlSpacI/qQPrF5Br2Ii
bTE3Ff/AQwipQaz/Bj7HqJCgGwfWK8xdfgW0abKsyXMWDN86Bov/zzeu8apmws0x
H1XjJRgnc/rzM4m9ny6+lss0iq6YL54SuTYNzHR33+Ywxls69SfHXIhCW0KpZcBl
5U3YUOomt40GfO46sxFA4xApAhypEK4oVq7asyiA2ArTZ/c2Pkc9p5CBqzhDLmlq
yioWTwHIISv0q+yMLCuQrVGIsbUDkQyy7RQ15z6U+/e/iGO/M+j3A5yxMc3qOi4=
=Onff
-----END PGP SIGNATURE-----
Merge tag 'devicetree-for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux into next
Pull DeviceTree updates from Rob Herring:
- Another round of clean-up of FDT related code in architecture code.
This removes knowledge of internal FDT details from most
architectures except powerpc.
- Conversion of kernel's custom FDT parsing code to use libfdt.
- DT based initialization for generic serial earlycon. The
introduction of generic serial earlycon support went in through the
tty tree.
- Improve the platform device naming for DT probed devices to ensure
unique naming and use parent names instead of a global index.
- Fix a race condition in of_update_property.
- Unify the various linker section OF match tables and fix several
function prototype errors.
- Update platform_get_irq_byname to work in deferred probe cases.
- 2 binding doc updates
* tag 'devicetree-for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: (58 commits)
of: handle NULL node in next_child iterators
of/irq: provide more wrappers for !CONFIG_OF
devicetree: bindings: Document micrel vendor prefix
dt: bindings: dwc2: fix required value for the phy-names property
of_pci_irq: kill useless variable in of_irq_parse_pci()
of/irq: do irq resolution in platform_get_irq_byname()
of: Add a testcase for of_find_node_by_path()
of: Make of_find_node_by_path() handle /aliases
of: Create unlocked version of for_each_child_of_node()
lib: add glibc style strchrnul() variant
of: Handle memory@0 node on PPC32 only
pci/of: Remove dead code
of: fix race between search and remove in of_update_property()
of: Use NULL for pointers
of: Stop naming platform_device using dcr address
of: Ensure unique names without sacrificing determinism
tty/serial: pl011: add DT based earlycon support
of/fdt: add FDT serial scanning for earlycon
of/fdt: add FDT address translation support
serial: earlycon: add DT support
...
- ACPICA update to upstream version 20140424. That includes a
number of fixes and improvements related to things like GPE
handling, table loading, headers, memory mapping and unmapping,
DSDT/SSDT overriding, and the Unload() operator. The acpidump
utility from upstream ACPICA is included too. From Bob Moore,
Lv Zheng, David Box, David Binderman, and Colin Ian King.
- Fixes and cleanups related to ACPI video and backlight interfaces
from Hans de Goede. That includes blacklist entries for some new
machines and using native backlight by default.
- ACPI device enumeration changes to create platform devices
rather than PNP devices for ACPI device objects with _HID by
default. PNP devices will still be created for the ACPI device
object with device IDs corresponding to real PNP devices, so
that change should not break things left and right, and we're
expecting to see more and more ACPI-enumerated platform devices
in the future. From Zhang Rui and Rafael J Wysocki.
- Updates for the ACPI LPSS (Low-Power Subsystem) driver allowing
it to handle system suspend/resume on Asus T100 correctly.
From Heikki Krogerus and Rafael J Wysocki.
- PM core update introducing a mechanism to allow runtime-suspended
devices to stay suspended over system suspend/resume transitions
if certain additional conditions related to coordination within
device hierarchy are met. Related PM documentation update and
ACPI PM domain support for the new feature. From Rafael J Wysocki.
- Fixes and improvements related to the "freeze" sleep state. They
affect several places including cpuidle, PM core, ACPI core, and
the ACPI battery driver. From Rafael J Wysocki and Zhang Rui.
- Miscellaneous fixes and updates of the ACPI core from Aaron Lu,
Bjørn Mork, Hanjun Guo, Lan Tianyu, and Rafael J Wysocki.
- Fixes and cleanups for the ACPI processor and ACPI PAD (Processor
Aggregator Device) drivers from Baoquan He, Manuel Schölling,
Tony Camuso, and Toshi Kani.
- System suspend/resume optimization in the ACPI battery driver from
Lan Tianyu.
- OPP (Operating Performance Points) subsystem updates from
Chander Kashyap, Mark Brown, and Nishanth Menon.
- cpufreq core fixes, updates and cleanups from Srivatsa S Bhat,
Stratos Karafotis, and Viresh Kumar.
- Updates, fixes and cleanups for the Tegra, powernow-k8, imx6q,
s5pv210, nforce2, and powernv cpufreq drivers from Brian Norris,
Jingoo Han, Paul Bolle, Philipp Zabel, Stratos Karafotis, and
Viresh Kumar.
- intel_pstate driver fixes and cleanups from Dirk Brandewie,
Doug Smythies, and Stratos Karafotis.
- Enabling the big.LITTLE cpufreq driver on arm64 from Mark Brown.
- Fix for the cpuidle menu governor from Chander Kashyap.
- New ARM clps711x cpuidle driver from Alexander Shiyan.
- Hibernate core fixes and cleanups from Chen Gang, Dan Carpenter,
Fabian Frederick, Pali Rohár, and Sebastian Capella.
- Intel RAPL (Running Average Power Limit) driver updates from
Jacob Pan.
- PNP subsystem updates from Bjorn Helgaas and Fabian Frederick.
- devfreq core updates from Chanwoo Choi and Paul Bolle.
- devfreq updates for exynos4 and exynos5 from Chanwoo Choi and
Bartlomiej Zolnierkiewicz.
- turbostat tool fix from Jean Delvare.
- cpupower tool updates from Prarit Bhargava, Ramkumar Ramachandra
and Thomas Renninger.
- New ACPI ec_access.c tool for poking at the EC in a safe way
from Thomas Renninger.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJTjl16AAoJEILEb/54YlRxeKgP/RRQSV7lFtf582Dw/5M/iWOg
qYeNtuYFLArEmJ7SpxHdKsU1ZRm3CahAS1j7grvQMQasUxTzoavMcSBNZefeaoNK
d01LVNqcyKCZs3+izRezk5N1IY+AjdrOcqCdIk8rfgFnc6kOttYUrVcIzKuIKAvJ
MsJ5s/uqP8G69FsAA3Ttdtr0HKiQhN4skSt424wntQRDeJNZPBs74mPKBGh8bxlO
Zr/VCDibKQ2Z8jS7x+TzwZrOxgE1/9x0Cub6GAdTvAfS8A+utPwSkneUyopNqpQ+
tJ5rz5R+HpmPMerizBuU+5s+tvjDPtH4/OZvOPSpYraQSFLOwx3hAm+a5k7fOGmc
XWjXnXWT0i0V3iQkwrspTNjX1RgywbsHbmXrcWn192HResvMQ9zk2gH2ch6m8JhN
yTV5V51dOZicpPuaTCvIkJpsV33p6vRz+EdPBiXoEdua5KKtOg8EnQ470dNaMR92
3ZtWmIvSgGlyPyHlSHLfGXbPUwTYvDNV3aheIoXp9E6WY3WJN9J3WXm4EHKBNVaI
H83kwuk1s92cgqh22H5Pcb0CmDcrbkUdP6hhsPS/aL80/EJMljRP2AYW1Y+l1LAf
pzMLmekHFqQEDjFQltwGvFV/EjFeMHnqOgQONx9ygMaayCGGTYSDx3FbRDesf8t9
qhoFcTPSxoo0XjrGrR6b
=tpdF
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm into next
Pull ACPI and power management updates from Rafael Wysocki:
"ACPICA is the leader this time (63 commits), followed by cpufreq (28
commits), devfreq (15 commits), system suspend/hibernation (12
commits), ACPI video and ACPI device enumeration (10 commits each).
We have no major new features this time, but there are a few
significant changes of how things work. The most visible one will
probably be that we are now going to create platform devices rather
than PNP devices by default for ACPI device objects with _HID. That
was long overdue and will be really necessary to be able to use the
same drivers for the same hardware blocks on ACPI and DT-based systems
going forward. We're not expecting fallout from this one (as usual),
but it's something to watch nevertheless.
The second change having a chance to be visible is that ACPI video
will now default to using native backlight rather than the ACPI
backlight interface which should generally help systems with broken
Win8 BIOSes. We're hoping that all problems with the native backlight
handling that we had previously have been addressed and we are in a
good enough shape to flip the default, but this change should be easy
enough to revert if need be.
In addition to that, the system suspend core has a new mechanism to
allow runtime-suspended devices to stay suspended throughout system
suspend/resume transitions if some extra conditions are met
(generally, they are related to coordination within device hierarchy).
However, enabling this feature requires cooperation from the bus type
layer and for now it has only been implemented for the ACPI PM domain
(used by ACPI-enumerated platform devices mostly today).
Also, the acpidump utility that was previously shipped as a separate
tool will now be provided by the upstream ACPICA along with the rest
of ACPICA code, which will allow it to be more up to date and better
supported, and we have one new cpuidle driver (ARM clps711x).
The rest is improvements related to certain specific use cases,
cleanups and fixes all over the place.
Specifics:
- ACPICA update to upstream version 20140424. That includes a number
of fixes and improvements related to things like GPE handling,
table loading, headers, memory mapping and unmapping, DSDT/SSDT
overriding, and the Unload() operator. The acpidump utility from
upstream ACPICA is included too. From Bob Moore, Lv Zheng, David
Box, David Binderman, and Colin Ian King.
- Fixes and cleanups related to ACPI video and backlight interfaces
from Hans de Goede. That includes blacklist entries for some new
machines and using native backlight by default.
- ACPI device enumeration changes to create platform devices rather
than PNP devices for ACPI device objects with _HID by default. PNP
devices will still be created for the ACPI device object with
device IDs corresponding to real PNP devices, so that change should
not break things left and right, and we're expecting to see more
and more ACPI-enumerated platform devices in the future. From
Zhang Rui and Rafael J Wysocki.
- Updates for the ACPI LPSS (Low-Power Subsystem) driver allowing it
to handle system suspend/resume on Asus T100 correctly. From
Heikki Krogerus and Rafael J Wysocki.
- PM core update introducing a mechanism to allow runtime-suspended
devices to stay suspended over system suspend/resume transitions if
certain additional conditions related to coordination within device
hierarchy are met. Related PM documentation update and ACPI PM
domain support for the new feature. From Rafael J Wysocki.
- Fixes and improvements related to the "freeze" sleep state. They
affect several places including cpuidle, PM core, ACPI core, and
the ACPI battery driver. From Rafael J Wysocki and Zhang Rui.
- Miscellaneous fixes and updates of the ACPI core from Aaron Lu,
Bjørn Mork, Hanjun Guo, Lan Tianyu, and Rafael J Wysocki.
- Fixes and cleanups for the ACPI processor and ACPI PAD (Processor
Aggregator Device) drivers from Baoquan He, Manuel Schölling, Tony
Camuso, and Toshi Kani.
- System suspend/resume optimization in the ACPI battery driver from
Lan Tianyu.
- OPP (Operating Performance Points) subsystem updates from Chander
Kashyap, Mark Brown, and Nishanth Menon.
- cpufreq core fixes, updates and cleanups from Srivatsa S Bhat,
Stratos Karafotis, and Viresh Kumar.
- Updates, fixes and cleanups for the Tegra, powernow-k8, imx6q,
s5pv210, nforce2, and powernv cpufreq drivers from Brian Norris,
Jingoo Han, Paul Bolle, Philipp Zabel, Stratos Karafotis, and
Viresh Kumar.
- intel_pstate driver fixes and cleanups from Dirk Brandewie, Doug
Smythies, and Stratos Karafotis.
- Enabling the big.LITTLE cpufreq driver on arm64 from Mark Brown.
- Fix for the cpuidle menu governor from Chander Kashyap.
- New ARM clps711x cpuidle driver from Alexander Shiyan.
- Hibernate core fixes and cleanups from Chen Gang, Dan Carpenter,
Fabian Frederick, Pali Rohár, and Sebastian Capella.
- Intel RAPL (Running Average Power Limit) driver updates from Jacob
Pan.
- PNP subsystem updates from Bjorn Helgaas and Fabian Frederick.
- devfreq core updates from Chanwoo Choi and Paul Bolle.
- devfreq updates for exynos4 and exynos5 from Chanwoo Choi and
Bartlomiej Zolnierkiewicz.
- turbostat tool fix from Jean Delvare.
- cpupower tool updates from Prarit Bhargava, Ramkumar Ramachandra
and Thomas Renninger.
- New ACPI ec_access.c tool for poking at the EC in a safe way from
Thomas Renninger"
* tag 'pm+acpi-3.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (187 commits)
ACPICA: Namespace: Remove _PRP method support.
intel_pstate: Improve initial busy calculation
intel_pstate: add sample time scaling
intel_pstate: Correct rounding in busy calculation
intel_pstate: Remove C0 tracking
PM / hibernate: fixed typo in comment
ACPI: Fix x86 regression related to early mapping size limitation
ACPICA: Tables: Add mechanism to control early table checksum verification.
ACPI / scan: use platform bus type by default for _HID enumeration
ACPI / scan: always register ACPI LPSS scan handler
ACPI / scan: always register memory hotplug scan handler
ACPI / scan: always register container scan handler
ACPI / scan: Change the meaning of missing .attach() in scan handlers
ACPI / scan: introduce platform_id device PNP type flag
ACPI / scan: drop unsupported serial IDs from PNP ACPI scan handler ID list
ACPI / scan: drop IDs that do not comply with the ACPI PNP ID rule
ACPI / PNP: use device ID list for PNPACPI device enumeration
ACPI / scan: .match() callback for ACPI scan handlers
ACPI / battery: wakeup the system only when necessary
power_supply: allow power supply devices registered w/o wakeup source
...
was a pretty active cycle for KVM. Changes include:
- a lot of s390 changes: optimizations, support for migration,
GDB support and more
- ARM changes are pretty small: support for the PSCI 0.2 hypercall
interface on both the guest and the host (the latter acked by Catalin)
- initial POWER8 and little-endian host support
- support for running u-boot on embedded POWER targets
- pretty large changes to MIPS too, completing the userspace interface
and improving the handling of virtualized timer hardware
- for x86, a larger set of changes is scheduled for 3.17. Still,
we have a few emulator bugfixes and support for running nested
fully-virtualized Xen guests (para-virtualized Xen guests have
always worked). And some optimizations too.
The only missing architecture here is ia64. It's not a coincidence
that support for KVM on ia64 is scheduled for removal in 3.17.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJTjtlBAAoJEBvWZb6bTYbyMOUP/2NAePghE3IjG99ikHFdn+BX
BfrURsuR6GD0AhYQnBidBmpFbAmN/LwSJxv/M7sV7OBRWLu3qbt69DrPTU2e/FK1
j9q25peu8jRyHzJ1q9rBroo74nD9lQYuVr3uXNxxcg0DRnw14JHGlM3y8LDEknO8
W+gpWTeAQ+2AuOX98MpRbCRMuzziCSv5bP5FhBVnsWHiZfvMbcUrbeJt+zYSiDAZ
0tHm/5dFKzfj/vVrrnjD4EZcRr688Bs5rztG96hY6aoVJryjZGLtLp92wCWkRRmH
CCvZwd245NmNthuKHzcs27/duSWfU0uOlu7AMrD44QYhzeDGyB/2nbCxbGqLLoBA
nnOviXH4cC65/CnisZ79zfo979HbZcX+Lzg747EjBgCSxJmLlwgiG8yXtDvk5otB
TH6GUeGDiEEPj//JD3XtgSz0sF2NvjREWRyemjDMvhz6JC/bLytXKb3sn+NXSj8m
ujzF9eQoa4qKDcBL4IQYGTJ4z5nY3Pd68dHFIPHB7n82OxFLSQUBKxXw8/1fb5og
VVb8PL4GOcmakQlAKtTMlFPmuy4bbL2r/2iV5xJiOZKmXIu8Hs1JezBE3SFAltbl
3cAGwSM9/dDkKxUbTFblyOE9bkKbg4WYmq0LkdzsPEomb3IZWntOT25rYnX+LrBz
bAknaZpPiOrW11Et1htY
=j5Od
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm into next
Pull KVM updates from Paolo Bonzini:
"At over 200 commits, covering almost all supported architectures, this
was a pretty active cycle for KVM. Changes include:
- a lot of s390 changes: optimizations, support for migration, GDB
support and more
- ARM changes are pretty small: support for the PSCI 0.2 hypercall
interface on both the guest and the host (the latter acked by
Catalin)
- initial POWER8 and little-endian host support
- support for running u-boot on embedded POWER targets
- pretty large changes to MIPS too, completing the userspace
interface and improving the handling of virtualized timer hardware
- for x86, a larger set of changes is scheduled for 3.17. Still, we
have a few emulator bugfixes and support for running nested
fully-virtualized Xen guests (para-virtualized Xen guests have
always worked). And some optimizations too.
The only missing architecture here is ia64. It's not a coincidence
that support for KVM on ia64 is scheduled for removal in 3.17"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (203 commits)
KVM: add missing cleanup_srcu_struct
KVM: PPC: Book3S PR: Rework SLB switching code
KVM: PPC: Book3S PR: Use SLB entry 0
KVM: PPC: Book3S HV: Fix machine check delivery to guest
KVM: PPC: Book3S HV: Work around POWER8 performance monitor bugs
KVM: PPC: Book3S HV: Make sure we don't miss dirty pages
KVM: PPC: Book3S HV: Fix dirty map for hugepages
KVM: PPC: Book3S HV: Put huge-page HPTEs in rmap chain for base address
KVM: PPC: Book3S HV: Fix check for running inside guest in global_invalidates()
KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number
KVM: PPC: Book3S: Add ONE_REG register names that were missed
KVM: PPC: Add CAP to indicate hcall fixes
KVM: PPC: MPIC: Reset IRQ source private members
KVM: PPC: Graciously fail broken LE hypercalls
PPC: ePAPR: Fix hypercall on LE guest
KVM: PPC: BOOK3S: Remove open coded make_dsisr in alignment handler
KVM: PPC: BOOK3S: Always use the saved DAR value
PPC: KVM: Make NX bit available with magic page
KVM: PPC: Disable NX for old magic page using guests
KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest
...
check_irq_vectors_for_cpu_disable() can overestimate the number of
available interrupt vectors, so the check for cpu down succeeds, but
the actual cpu removal fails.
It iterates from FIRST_EXTERNAL_VECTOR to NR_VECTORS, which is wrong
because the systems vectors are not taken into account.
Limit the search to first_system_vector instead of NR_VECTORS.
The second indicator for vector availability the used_vectors bitmap
is not taken into account at all. So system vectors,
e.g. IA32_SYSCALL_VECTOR (0x80) and IRQ_MOVE_CLEANUP_VECTOR (0x20),
are accounted as available.
Add a check for the used_vectors bitmap and do not account vectors
which are marked there.
[ tglx: Simplified code. Rewrote changelog and code comments. ]
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Elliott, Robert (Server Storage)" <Elliott@hp.com>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/1400160305-17774-2-git-send-email-prarit@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Conflicts:
include/net/inetpeer.h
net/ipv6/output_core.c
Changes in net were fixing bugs in code removed in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
I just went over this when looking at some Xen-related ftrace initialization
problems. They were related to Xen code that is not upstream but this clean up
would make sense here.
I think that this was already the intention when text_ip_addr() was introduced
in the commit 87fbb2ac60 (ftrace/x86: Use breakpoints for converting
function graph caller). Anyway, better do it now before it shots people into
their leg ;-)
Link: http://lkml.kernel.org/p/1401812601-2359-1-git-send-email-pmladek@suse.cz
Signed-off-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull x86/UV changes from Ingo Molnar:
"Continued updates for SGI UV 3 hardware support"
* 'x86-uv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/UV: Fix conditional in gru_exit()
x86/UV: Set n_lshift based on GAM_GR_CONFIG MMR for UV3
Pull x86 RAS changes from Ingo Molnar:
"Improve mcheck device initialization and bootstrap robustness"
* 'x86-ras-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
mce: Panic when a core has reached a timeout
x86/mce: Improve mcheck_init_device() error handling
Pull x86 IOSF platform updates from Ingo Molnar:
"IOSF (Intel OnChip System Fabric) updates:
- generalize the IOSF interface to allow mixed mode drivers: non-IOSF
drivers to utilize of IOSF features on IOSF platforms.
- add 'Quark X1000' IOSF/MBI support
- clean up BayTrail and Quark PCI ID enumeration"
* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, iosf: Add PCI ID macros for better readability
x86, iosf: Add Quark X1000 PCI ID
x86, iosf: Added Quark MBI identifiers
x86, iosf: Make IOSF driver modular and usable by more drivers
Pull x86 mm update from Ingo Molnar:
- speed up 256 GB PCI BAR ioremap()s
- speed up PTE swapout page reclaim case
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, ioremap: Speed up check for RAM pages
x86/mm: In the PTE swapout page reclaim case clear the accessed bit instead of flushing the TLB
Pull x86 microcode changes from Ingo Molnar:
"A microcode-debugging boot flag plus related refactoring"
* 'x86-microcode-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, microcode: Add a disable chicken bit
x86, boot: Carve out early cmdline parsing function
Pull x86 irq cleanup from Ingo Molnar:
"A single, trivial cleanup"
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/irq: Clean up VECTOR_UNDEFINED and VECTOR_RETRIGGERED definition
Pull x86 build cleanups from Ingo Molnar:
"Two small build related cleanups"
* 'x86-build-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/build: Supress realmode.bin is up to date message
compiler-intel.h: Remove duplicate definition
Pull x86 boot changes from Ingo Molnar:
"Two small cleanups"
* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, boot: Remove misc.h inclusion from compressed/string.c
x86, boot: Do not include boot.h in string.c
* acpica: (63 commits)
ACPICA: Namespace: Remove _PRP method support.
ACPI: Fix x86 regression related to early mapping size limitation
ACPICA: Tables: Add mechanism to control early table checksum verification.
ACPICA: acpidump: Fix repetitive table dump in -n mode.
ACPI: Clean up acpi_os_map/unmap_memory() to eliminate __iomem.
ACPICA: Clean up redudant definitions already defined elsewhere
ACPICA: Linux headers: Add <asm/acenv.h> to remove mis-ordered inclusion of <asm/acpi.h>
ACPICA: Linux headers: Add <acpi/platform/aclinuxex.h>
ACPICA: Linux headers: Remove ACPI_PREEMPTION_POINT() due to no usages.
ACPICA: Update version to 20140424.
ACPICA: Comment/format update, no functional change.
ACPICA: Events: Update GPE handling and initialization code.
ACPICA: Remove extraneous error message for large number of GPEs.
ACPICA: Tables: Remove old mechanism to validate if XSDT contains NULL entries.
ACPICA: Tables: Add new mechanism to skip NULL entries in RSDT and XSDT.
ACPICA: acpidump: Add support to force using RSDT.
ACPICA: Back port of improvements on exception code.
ACPICA: Back port of _PRP update.
ACPICA: acpidump: Fix truncated RSDP signature validation.
ACPICA: Linux header: Add support for stubbed externals.
...
Pull scheduler updates from Ingo Molnar:
"The main scheduling related changes in this cycle were:
- various sched/numa updates, for better performance
- tree wide cleanup of open coded nice levels
- nohz fix related to rq->nr_running use
- cpuidle changes and continued consolidation to improve the
kernel/sched/idle.c high level idle scheduling logic. As part of
this effort I pulled cpuidle driver changes from Rafael as well.
- standardized idle polling amongst architectures
- continued work on preparing better power/energy aware scheduling
- sched/rt updates
- misc fixlets and cleanups"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (49 commits)
sched/numa: Decay ->wakee_flips instead of zeroing
sched/numa: Update migrate_improves/degrades_locality()
sched/numa: Allow task switch if load imbalance improves
sched/rt: Fix 'struct sched_dl_entity' and dl_task_time() comments, to match the current upstream code
sched: Consolidate open coded implementations of nice level frobbing into nice_to_rlimit() and rlimit_to_nice()
sched: Initialize rq->age_stamp on processor start
sched, nohz: Change rq->nr_running to always use wrappers
sched: Fix the rq->next_balance logic in rebalance_domains() and idle_balance()
sched: Use clamp() and clamp_val() to make sys_nice() more readable
sched: Do not zero sg->cpumask and sg->sgp->power in build_sched_groups()
sched/numa: Fix initialization of sched_domain_topology for NUMA
sched: Call select_idle_sibling() when not affine_sd
sched: Simplify return logic in sched_read_attr()
sched: Simplify return logic in sched_copy_attr()
sched: Fix exec_start/task_hot on migrated tasks
arm64: Remove TIF_POLLING_NRFLAG
metag: Remove TIF_POLLING_NRFLAG
sched/idle: Make cpuidle_idle_call() void
sched/idle: Reflow cpuidle_idle_call()
sched/idle: Delay clearing the polling bit
...
Pull perf updates from Ingo Molnar:
"The tooling changes maintained by Jiri Olsa until Arnaldo is on
vacation:
User visible changes:
- Add -F option for specifying output fields (Namhyung Kim)
- Propagate exit status of a command line workload for record command
(Namhyung Kim)
- Use tid for finding thread (Namhyung Kim)
- Clarify the output of perf sched map plus small sched command
fixes (Dongsheng Yang)
- Wire up perf_regs and unwind support for ARM64 (Jean Pihet)
- Factor hists statistics counts processing which in turn also fixes
several bugs in TUI report command (Namhyung Kim)
- Add --percentage option to control absolute/relative percentage
output (Namhyung Kim)
- Add --list-cmds to 'kmem', 'mem', 'lock' and 'sched', for use by
completion scripts (Ramkumar Ramachandra)
Development/infrastructure changes and fixes:
- Android related fixes for pager and map dso resolving (Michael
Lentine)
- Add libdw DWARF post unwind support for ARM (Jean Pihet)
- Consolidate types.h for ARM and ARM64 (Jean Pihet)
- Fix possible null pointer dereference in session.c (Masanari Iida)
- Cleanup, remove unused variables in map_switch_event() (Dongsheng
Yang)
- Remove nr_state_machine_bugs in perf latency (Dongsheng Yang)
- Remove usage of trace_sched_wakeup(.success) (Peter Zijlstra)
- Cleanups for perf.h header (Jiri Olsa)
- Consolidate types.h and export.h within tools (Borislav Petkov)
- Move u64_swap union to its single user's header, evsel.h (Borislav
Petkov)
- Fix for s390 to properly parse tracepoints plus test code
(Alexander Yarygin)
- Handle EINTR error for readn/writen (Namhyung Kim)
- Add a test case for hists filtering (Namhyung Kim)
- Share map_groups among threads of the same group (Arnaldo Carvalho
de Melo, Jiri Olsa)
- Making some code (cpu node map and report parse callchain callback)
global to be usable by upcomming changes (Don Zickus)
- Fix pmu object compilation error (Jiri Olsa)
Kernel side changes:
- intrusive uprobes fixes from Oleg Nesterov. Since the interface is
admin-only, and the bug only affects user-space ("any probed
jmp/call can kill the application"), we queued these fixes via the
development tree, as a special exception.
- more fuzzer motivated race fixes and related refactoring and
robustization.
- allow PMU drivers to be built as modules. (No actual module yet,
because the x86 Intel uncore module wasn't ready in time for this)"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
perf tools: Add automatic remapping of Android libraries
perf tools: Add cat as fallback pager
perf tests: Add a testcase for histogram output sorting
perf tests: Factor out print_hists_*()
perf tools: Introduce reset_output_field()
perf tools: Get rid of obsolete hist_entry__sort_list
perf hists: Reset width of output fields with header length
perf tools: Skip elided sort entries
perf top: Add --fields option to specify output fields
perf report/tui: Fix a bug when --fields/sort is given
perf tools: Add ->sort() member to struct sort_entry
perf report: Add -F option to specify output fields
perf tools: Call perf_hpp__init() before setting up GUI browsers
perf tools: Consolidate management of default sort orders
perf tools: Allow hpp fields to be sort keys
perf ui: Get rid of callback from __hpp__fmt()
perf tools: Consolidate output field handling to hpp format routines
perf tools: Use hpp formats to sort final output
perf tools: Support event grouping in hpp ->sort()
perf tools: Use hpp formats to sort hist entries
...
Here is the big USB driver pull request for 3.16-rc1.
Nothing huge here, but lots of little things in the USB core, and in
lots of drivers. Hopefully the USB power management will be work better
now that it has been reworked to do per-port power control dynamically.
There's also a raft of gadget driver updates and fixes, CONFIG_USB_DEBUG
is finally gone now that everything has been converted over to the
dynamic debug inteface, the last hold-out drivers were cleaned up and
the config option removed. There were also other minor things all
through the drivers/usb/ tree, the shortlog shows this pretty well.
All have been in linux-next, including the very last patch, which came
from linux-next to fix a build issue on some platforms.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iEYEABECAAYFAlONYEMACgkQMUfUDdst+ynxvgCggMQBhN5icth8Y5hFglNNaISN
c4AAoMHR2kb62U1plylLbPnboQTjfcl0
=fG6y
-----END PGP SIGNATURE-----
Merge tag 'usb-3.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb into next
Pull USB driver updates from Greg KH:
"Here is the big USB driver pull request for 3.16-rc1.
Nothing huge here, but lots of little things in the USB core, and in
lots of drivers. Hopefully the USB power management will be work
better now that it has been reworked to do per-port power control
dynamically. There's also a raft of gadget driver updates and fixes,
CONFIG_USB_DEBUG is finally gone now that everything has been
converted over to the dynamic debug inteface, the last hold-out
drivers were cleaned up and the config option removed. There were
also other minor things all through the drivers/usb/ tree, the
shortlog shows this pretty well.
All have been in linux-next, including the very last patch, which came
from linux-next to fix a build issue on some platforms"
* tag 'usb-3.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (314 commits)
usb: hub_handle_remote_wakeup() only exists for CONFIG_PM=y
USB: orinoco_usb: remove CONFIG_USB_DEBUG support
USB: media: lirc: igorplugusb: remove CONFIG_USB_DEBUG support
USB: media: streamzap: remove CONFIG_USB_DEBUG
USB: media: redrat3: remove CONFIG_USB_DEBUG usage
USB: media: redrat3: remove unneeded tracing macro
usb: qcserial: add additional Sierra Wireless QMI devices
usb: host: max3421-hcd: Use module_spi_driver
usb: host: max3421-hcd: Allow platform-data to specify Vbus polarity
usb: host: max3421-hcd: fix "spi_rd8" uses dynamic stack allocation warning
usb: host: max3421-hcd: Fix missing unlock in max3421_urb_enqueue()
usb: qcserial: add Netgear AirCard 341U
Documentation: dt-bindings: update xhci-platform DT binding for R-Car H2 and M2
usb: host: xhci-plat: add xhci_plat_start()
usb: host: max3421-hcd: Fix potential NULL urb dereference
Revert "usb: gadget: net2280: Add support for PLX USB338X"
USB: usbip: remove CONFIG_USB_DEBUG reference
USB: remove CONFIG_USB_DEBUG from defconfig files
usb: resume child device when port is powered on
usb: hub_handle_remote_wakeup() depends on CONFIG_PM_RUNTIME=y
...
Here is the big tty / serial driver pull request for 3.16-rc1.
A variety of different serial driver fixes and updates and additions,
nothing huge, and no real major core tty changes at all.
All have been in linux-next for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iEYEABECAAYFAlONXgoACgkQMUfUDdst+ymdSwCgwL0xmWjFYr/UbJ4LslOZ29Q4
BFQAoKyYe9LsfEyodBPabxJjKUtj1htz
=ZGSN
-----END PGP SIGNATURE-----
Merge tag 'tty-3.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty into next
Pull tty/serial driver updates from Greg KH:
"Here is the big tty / serial driver pull request for 3.16-rc1.
A variety of different serial driver fixes and updates and additions,
nothing huge, and no real major core tty changes at all.
All have been in linux-next for a while"
* tag 'tty-3.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (84 commits)
Revert "serial: imx: remove the DMA wait queue"
serial: kgdb_nmi: Improve console integration with KDB I/O
serial: kgdb_nmi: Switch from tasklets to real timers
serial: kgdb_nmi: Use container_of() to locate private data
serial: cpm_uart: No LF conversion in put_poll_char()
serial: sirf: Fix compilation failure
console: Remove superfluous readonly check
console: Use explicit pointer type for vc_uni_pagedir* fields
vgacon: Fix & cleanup refcounting
ARM: tty: Move HVC DCC assembly to arch/arm
tty/hvc/hvc_console: Fix wakeup of HVC thread on hvc_kick()
drivers/tty/n_hdlc.c: replace kmalloc/memset by kzalloc
vt: emulate 8- and 24-bit colour codes.
printk/of_serial: fix serial console cessation part way through boot.
serial: 8250_dma: check the result of TX buffer mapping
serial: uart: add hw flow control support configuration
tty/serial: at91: add interrupts for modem control lines
tty/serial: at91: use mctrl_gpio helpers
tty/serial: Add GPIOLIB helpers for controlling modem lines
ARM: at91: gpio: implement get_direction
...
Here is the big staging driver pull request for 3.16-rc1.
Lots of stuff here, tons of cleanup patches, a few new drivers, and some
removed as well, but I think we are still adding a few thousand more
lines than we remove, due to the new drivers being bigger than the ones
deleted.
One notible bit of work did stand out, Jes Sorensen has gone on a tear,
fixing up a wireless driver to be "more sane" than it originally was
from the vendor, with over 500 patches merged here. Good stuff, and a
number of users laptops are better off for it.
All of this has been in linux-next for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iEYEABECAAYFAlONXKQACgkQMUfUDdst+ynIwgCgq5pPIn+2aewaFK8rrN18xqls
F3YAoNDYeqMpQepvRe50HcjRrgDvsV2n
=VenO
-----END PGP SIGNATURE-----
Merge tag 'staging-3.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging into next
Pull staging driver updates from Greg KH:
"Here is the big staging driver pull request for 3.16-rc1.
Lots of stuff here, tons of cleanup patches, a few new drivers, and
some removed as well, but I think we are still adding a few thousand
more lines than we remove, due to the new drivers being bigger than
the ones deleted.
One notible bit of work did stand out, Jes Sorensen has gone on a
tear, fixing up a wireless driver to be "more sane" than it originally
was from the vendor, with over 500 patches merged here. Good stuff,
and a number of users laptops are better off for it.
All of this has been in linux-next for a while"
* tag 'staging-3.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (1703 commits)
staging: skein: fix sparse warning for static declarations
staging/mt29f_spinand: coding style fixes
staging: silicom: fix sparse warning for static variable
staging: lustre: Fix coding style
staging: android: binder.c: Use more appropriate functions for euid retrieval
staging: lustre: fix integer as NULL pointer warnings
Revert "staging: dgap: remove unneeded kfree() in dgap_tty_register_ports()"
Staging: rtl8192u: r8192U_wx.c Fixed a misplaced brace
staging: ion: shrink highmem pages on kswapd
staging: ion: use compound pages on high order pages for system heap
staging: ion: remove struct ion_page_pool_item
staging: ion: simplify ion_page_pool_total()
staging: ion: tidy up a bit
staging: rtl8723au: Remove redundant casting in usb_ops_linux.c
staging: rtl8723au: Remove redundant casting in rtl8723a_hal_init.c
staging: rtl8723au: Remove redundant casting in rtw_xmit.c
staging: rtl8723au: Remove redundant casting in rtw_wlan_util.c
staging: rtl8723au: Remove redundant casting in rtw_sta_mgt.c
staging: rtl8723au: Remove redundant casting in rtw_recv.c
staging: rtl8723au: Remove redundant casting in rtw_mlme.c
...
Pull x86 fix from Peter Anvin:
"A single quite small patch that managed to get overlooked earlier, to
prevent a user space triggerable oops on systems without HPET"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET
- Support foreign mappings in PVH domains (needed when dom0 is PVH)
- Fix mapping high MMIO regions in x86 PV guests (this is also the
first half of removing the PAGE_IOMAP PTE flag).
- ARM suspend/resume support.
- ARM multicall support.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJTjE5MAAoJEFxbo/MsZsTRtl8H/2lfS9w05e60vRxjolPV0vRc
5k9DcYFeJ+k2cz/2T3mNlIvKdfBTesSfgVquH+28GhQz+uKFQ1OrJpYNDTougSw5
Wv0Ae8e+7eLABvJ9XMiZdDsPzsICw2wqWOvqrnQi2qR3SIimBc5tBigR4+Rccv+e
btuBLlYT4WPQ8qgNyCBPgxzuyxteu5wK/0XryX6NcbrxeEbAzQAeDKkmvCD4fSvx
KxrwTO3mwV4Lefmf/WS4Z9fDcPujQOUqKEtUWanw/2JalO1BzDPo+1wvYs0LduLC
QI/YJN4SL3UeGOmbX2tyIaRgMsAcQVVrYkTm1cp8eD7vcRuvXaqy6dxuX05+V4g=
=cxfG
-----END PGP SIGNATURE-----
Merge tag 'stable/for-linus-3.16-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip into next
Pull Xen updates from David Vrabel:
"xen: features and fixes for 3.16-rc0
- support foreign mappings in PVH domains (needed when dom0 is PVH)
- fix mapping high MMIO regions in x86 PV guests (this is also the
first half of removing the PAGE_IOMAP PTE flag).
- ARM suspend/resume support.
- ARM multicall support"
* tag 'stable/for-linus-3.16-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/xen: map foreign pfns for autotranslated guests
xen-acpi-processor: Don't display errors when we get -ENOSYS
xen/pciback: Document the entry points for 'pcistub_put_pci_dev'
xen/pciback: Document when the 'unbind' and 'bind' functions are called.
xen-pciback: Document when we FLR an PCI device.
xen-pciback: First reset, then free.
xen-pciback: Cleanup up pcistub_put_pci_dev
x86/xen: do not use _PAGE_IOMAP in xen_remap_domain_mfn_range()
x86/xen: set regions above the end of RAM as 1:1
x86/xen: only warn once if bad MFNs are found during setup
x86/xen: compactly store large identity ranges in the p2m
x86/xen: fix set_phys_range_identity() if pfn_e > MAX_P2M_PFN
x86/xen: rename early_p2m_alloc() and early_p2m_alloc_middle()
xen/x86: set panic notifier priority to minimum
arm,arm64/xen: introduce HYPERVISOR_suspend()
xen: refactor suspend pre/post hooks
arm: xen: export HYPERVISOR_multicall to modules.
arm64: introduce virt_to_pfn
arm/xen: Remove definiition of virt_to_pfn in asm/xen/page.h
arm: xen: implement multicall hypercall support.
For ioremapped efi memory aka old_map the virt addresses are not persistant
across kexec reboot. kexec-tools will read the runtime maps from sysfs then
pass them to 2nd kernel and assuming kexec efi boot is ok. This will cause
kexec boot failure.
To address this issue do not export runtime maps in case efi old_map so
userspace can use no efi boot instead.
Signed-off-by: Dave Young <dyoung@redhat.com>
Acked-by: Borislav Petkov <bp@suse.de>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Make it a little clearer what the littleendian access macros in
vdso2c.[ch] actually do. This way they can probably also be moved to
a central location (e.g. tools/include) for the benefit of other host
tools.
We should avoid implementation namespace symbols when writing code
that is compiling for the compiler host, so avoid names starting with
double underscore or underscore-capital.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/2cf258df123cb24bad63c274c8563c050547d99d.1401464755.git.luto@amacapital.net
There is very little and maybe practically nothing we can do to recover
from a system where at least one core has reached a timeout during the
whole monarch cores gathering. So panic when that happens.
Link: http://lkml.kernel.org/r/20140523091041.GA21332@pd.tnic
Signed-off-by: Borislav Petkov <bp@suse.de>
While I play inhouse patches with much memory pressure on qemu-kvm,
3.14 kernel was randomly crashed. The reason was kernel stack overflow.
When I investigated the problem, the callstack was a little bit deeper
by involve with reclaim functions but not direct reclaim path.
I tried to diet stack size of some functions related with alloc/reclaim
so did a hundred of byte but overflow was't disappeard so that I encounter
overflow by another deeper callstack on reclaim/allocator path.
Of course, we might sweep every sites we have found for reducing
stack usage but I'm not sure how long it saves the world(surely,
lots of developer start to add nice features which will use stack
agains) and if we consider another more complex feature in I/O layer
and/or reclaim path, it might be better to increase stack size(
meanwhile, stack usage on 64bit machine was doubled compared to 32bit
while it have sticked to 8K. Hmm, it's not a fair to me and arm64
already expaned to 16K. )
So, my stupid idea is just let's expand stack size and keep an eye
toward stack consumption on each kernel functions via stacktrace of ftrace.
For example, we can have a bar like that each funcion shouldn't exceed 200K
and emit the warning when some function consumes more in runtime.
Of course, it could make false positive but at least, it could make a
chance to think over it.
I guess this topic was discussed several time so there might be
strong reason not to increase kernel stack size on x86_64, for me not
knowing so Ccing x86_64 maintainers, other MM guys and virtio
maintainers.
Here's an example call trace using up the kernel stack:
Depth Size Location (51 entries)
----- ---- --------
0) 7696 16 lookup_address
1) 7680 16 _lookup_address_cpa.isra.3
2) 7664 24 __change_page_attr_set_clr
3) 7640 392 kernel_map_pages
4) 7248 256 get_page_from_freelist
5) 6992 352 __alloc_pages_nodemask
6) 6640 8 alloc_pages_current
7) 6632 168 new_slab
8) 6464 8 __slab_alloc
9) 6456 80 __kmalloc
10) 6376 376 vring_add_indirect
11) 6000 144 virtqueue_add_sgs
12) 5856 288 __virtblk_add_req
13) 5568 96 virtio_queue_rq
14) 5472 128 __blk_mq_run_hw_queue
15) 5344 16 blk_mq_run_hw_queue
16) 5328 96 blk_mq_insert_requests
17) 5232 112 blk_mq_flush_plug_list
18) 5120 112 blk_flush_plug_list
19) 5008 64 io_schedule_timeout
20) 4944 128 mempool_alloc
21) 4816 96 bio_alloc_bioset
22) 4720 48 get_swap_bio
23) 4672 160 __swap_writepage
24) 4512 32 swap_writepage
25) 4480 320 shrink_page_list
26) 4160 208 shrink_inactive_list
27) 3952 304 shrink_lruvec
28) 3648 80 shrink_zone
29) 3568 128 do_try_to_free_pages
30) 3440 208 try_to_free_pages
31) 3232 352 __alloc_pages_nodemask
32) 2880 8 alloc_pages_current
33) 2872 200 __page_cache_alloc
34) 2672 80 find_or_create_page
35) 2592 80 ext4_mb_load_buddy
36) 2512 176 ext4_mb_regular_allocator
37) 2336 128 ext4_mb_new_blocks
38) 2208 256 ext4_ext_map_blocks
39) 1952 160 ext4_map_blocks
40) 1792 384 ext4_writepages
41) 1408 16 do_writepages
42) 1392 96 __writeback_single_inode
43) 1296 176 writeback_sb_inodes
44) 1120 80 __writeback_inodes_wb
45) 1040 160 wb_writeback
46) 880 208 bdi_writeback_workfn
47) 672 144 process_one_work
48) 528 112 worker_thread
49) 416 240 kthread
50) 176 176 ret_from_fork
[ Note: the problem is exacerbated by certain gcc versions that seem to
generate much bigger stack frames due to apparently bad coalescing of
temporaries and generating too many spills. Rusty saw gcc-4.6.4 using
35% more stack on the virtio path than 4.8.2 does, for example.
Minchan not only uses such a bad gcc version (4.6.3 in his case), but
some of the stack use is due to debugging (CONFIG_DEBUG_PAGEALLOC is
what causes that kernel_map_pages() frame, for example). But we're
clearly getting too close.
The VM code also seems to have excessive stack frames partly for the
same compiler reason, triggered by excessive inlining and lots of
function arguments.
We need to improve on our stack use, but in the meantime let's do this
simple stack increase too. Unlike most earlier reports, there is
nothing simple that stands out as being really horribly wrong here,
apart from the fact that the stack frames are just bigger than they
should need to be. - Linus ]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S Tsirkin <mst@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: PJ Waskiewicz <pjwaskiewicz@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Detect the xsaveopt, xsavec, xgetbv, and xsaves features in processor extended
state enumberation sub-leaf (eax=0x0d, ecx=1):
Bit 00: XSAVEOPT is available
Bit 01: Supports XSAVEC and the compacted form of XRSTOR if set
Bit 02: Supports XGETBV with ECX = 1 if set
Bit 03: Supports XSAVES/XRSTORS and IA32_XSS if set
The above features are defined in the new word 10 in cpu features.
The IA32_XSS MSR (index DA0H) contains a state-component bitmap that specifies
the state components that software has enabled xsaves and xrstors to manage.
If the bit corresponding to a state component is clear in XCR0 | IA32_XSS,
xsaves and xrstors will not operate on that state component, regardless of
the value of the instruction mask.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1401387164-43416-3-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
In each X86 feature macro definition, add one space in front of the word
number which is a one-digit number currently.
The purpose of reformatting the macros is to align one-digit and two-digit
word numbers.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1401387164-43416-2-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
* pci/misc:
PCI: Fix return value from pci_user_{read,write}_config_*()
PCI: Turn pcibios_penalize_isa_irq() into a weak function
PCI: Test for std config alias when testing extended config space
* pci/hotplug:
PCI: cpqphp: Fix possible null pointer dereference
NVMe: Implement PCIe reset notification callback
PCI: Notify driver before and after device reset
* pci/pci_is_bridge:
pcmcia: Use pci_is_bridge() to simplify code
PCI: pciehp: Use pci_is_bridge() to simplify code
PCI: acpiphp: Use pci_is_bridge() to simplify code
PCI: cpcihp: Use pci_is_bridge() to simplify code
PCI: shpchp: Use pci_is_bridge() to simplify code
PCI: rpaphp: Use pci_is_bridge() to simplify code
sparc/PCI: Use pci_is_bridge() to simplify code
powerpc/PCI: Use pci_is_bridge() to simplify code
ia64/PCI: Use pci_is_bridge() to simplify code
x86/PCI: Use pci_is_bridge() to simplify code
PCI: Use pci_is_bridge() to simplify code
PCI: Add new pci_is_bridge() interface
PCI: Rename pci_is_bridge() to pci_has_subordinate()
* pci/virtualization:
PCI: Introduce new device binding path using pci_dev.driver_override
Conflicts:
drivers/pci/pci-sysfs.c
* pci/host-exynos:
PCI: exynos: Remove unnecessary OOM messages
* pci/host-rcar:
PCI: rcar: Add gen2 device tree support
PCI: rcar: Add R-Car PCIe device tree bindings
PCI: rcar: Add MSI support for PCIe
PCI: rcar: Add Renesas R-Car PCIe driver
PCI: rcar: Use new OF interrupt mapping when possible
* pci/amd-numa:
x86/PCI: Clean up and mark early_root_info_init() as deprecated
x86/PCI: Work around AMD Fam15h BIOSes that fail to provide _PXM
x86/PCI: Warn if we have to "guess" host bridge node information
Now that CONFIG_USB_DEBUG is gone, remove it from a number of defconfig
files that were enabling it.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The PPC fixes are important because they fix breakage that is new in 3.15.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJTdivEAAoJEBvWZb6bTYbyw3YQAIILnflhHNtklj1mfPnnibQf
c3BLCkJ0gtK6A0FO2aAHgSja0kpgbEEnSphE/A/cb0vkLon3n5O0pQoSKjGUUbBO
Mo0ndjzBYNmCP4MGxhkrg49VdqD40NaR0BjJAZudb4vUOw892WLFIJMIVmIqs9eG
8V/y6S7mPLmrooAKHZxXql9y30UC77T1VZ3r4pXwYgKtUT51BQfTyWiSfjQBa8yI
oGOSb8uqEC7YiOYPJYUNIMsyVqW4E6Qqs46rqtP4XZmSxzWXDzzgP4nQHHyJJCdZ
aBYkeG+sJZG7ZwleJLejAncjWUY9Oq9GkMYNj0cTAoP/zA6jBGAll96KGKRbes9z
bZUtCNL3ifLcgbIGeAxgjmYOq0XLGahHbqm9QISYW2XdRkBI+8EJs5FCP4YEHzZn
FSm3zcCQ+wtbqjBbZZcqqLa6A/CGzjyO26qz+BCxrZ0BQkQX/2am3UykQ0JWam3H
vX5ZM2ewJhs6SjFisPcswd20AN+SHjPyzPvErBLDfrqnAVbwj2ehgqyN2slVsqrj
UyGzeKCfJgA0TiEH/4K6j6hvQWynUU+/2JglIfGE6AXmWddazCzl/qx4LvuGKFoB
b8JSQ7YaHSsq/tHc8WhHkvcP0FSDZEiHcJN2iY1pwLKTSQp9JN3aPNruPKiO8dsW
N+LoHL5fFcDi6Uu6wS7w
=E2fU
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"Small fixes for x86, slightly larger fixes for PPC, and a forgotten
s390 patch. The PPC fixes are important because they fix breakage
that is new in 3.15"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: s390: announce irqfd capability
KVM: x86: disable master clock if TSC is reset during suspend
KVM: vmx: disable APIC virtualization in nested guests
KVM guest: Make pv trampoline code executable
KVM: PPC: Book3S: ifdef on CONFIG_KVM_BOOK3S_32_HANDLER for 32bit
KVM: PPC: Book3S HV: Add missing code for transaction reclaim on guest exit
KVM: PPC: Book3S: HV: make _PAGE_NUMA take effect
pcibios_penalize_isa_irq() is only implemented by x86 now, and legacy ISA
is not used by some architectures. Make pcibios_penalize_isa_irq() a
__weak function to simplify the code. This removes the need for new
platforms to add stub implementations of pcibios_penalize_isa_irq().
[bhelgaas: changelog, comments]
Signed-off-by: Hanjun Guo <hanjun.guo@linaro.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Use pci_is_bridge() to simplify code. No functional change.
Requires: 326c1cdae7 PCI: Rename pci_is_bridge() to pci_has_subordinate()
Requires: 1c86438c94 PCI: Add new pci_is_bridge() interface
Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
early_root_info_init() is now deprecated in favor of info in ACPI. Add a
note to that effect. Also, clean up the code a bit.
There is no functional change.
Signed-off-by: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Since mis-order issues have been solved, we can cleanup redundant
definitions that already have defaults in <acpi/platform/acenv.h>.
This patch removes redudant environments for __KERNEL__ surrounded code.
Signed-off-by: Lv Zheng <lv.zheng@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
There is a mis-order inclusion for <asm/acpi.h>.
As we will enforce including <linux/acpi.h> for all Linux ACPI users, we
can find the inclusion order is as follows:
<linux/acpi.h>
<acpi/acpi.h>
<acpi/platform/acenv.h>
(acenv.h before including aclinux.h)
<acpi/platform/aclinux.h>
...........................................................................
(aclinux.h before including asm/acpi.h)
<asm/acpi.h> @Redundant@
(ACPICA specific stuff)
...........................................................................
...........................................................................
(Linux ACPI specific stuff) ? - - - - - - - - - - - - +
(aclinux.h after including asm/acpi.h) @Invisible@ |
(acenv.h after including aclinux.h) @Invisible@ |
other ACPICA headers @Invisible@ |
............................................................|..............
<acpi/acpi_bus.h> |
<acpi/acpi_drivers.h> |
<asm/acpi.h> (Excluded) |
(Linux ACPI specific stuff) ! <- - - - - - - - - - - - - +
NOTE that, in ACPICA, <acpi/platform/acenv.h> is more like Kconfig
generated <generated/autoconf.h> for Linux, it is meant to be included
before including any ACPICA code.
In the above figure, there is a question mark for "Linux ACPI specific
stuff" in <asm/acpi.h> which should be included after including all other
ACPICA header files. Thus they really need to be moved to the position
marked with exclaimation mark or the definitions in the blocks marked with
"@Invisible@" will be invisible to such architecture specific "Linux ACPI
specific stuff" header blocks. This leaves 2 issues:
1. All environmental definitions in these blocks should have a copy in the
area marked with "@Redundant@" if they are required by the "Linux ACPI
specific stuff".
2. We cannot use any ACPICA defined types in <asm/acpi.h>.
This patch splits architecture specific ACPICA stuff from <asm/acpi.h> to
fix this issue.
Signed-off-by: Lv Zheng <lv.zheng@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When running as a dom0 in PVH mode, foreign pfns that are accessed
must be added to our p2m which is managed by xen. This is done via
XENMEM_add_to_physmap_range hypercall. This is needed for toolstack
building guests and mapping guest memory, xentrace mapping xen pages,
etc.
Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
MOV CR/DR instructions ignore the mod field (in the ModR/M byte). As the SDM
states: "The 2 bits in the mod field are ignored". Accordingly, the second
operand of these instructions is always a general purpose register.
The current emulator implementation does not do so. If the mod bits do not
equal 3, it expects the second operand to be in memory.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When Hyper-V enlightenments are in effect, Windows prefers to issue an
Hyper-V MSR write to issue an EOI rather than an x2apic MSR write.
The Hyper-V MSR write is not handled by the processor, and besides
being slower, this also causes bugs with APIC virtualization. The
reason is that on EOI the processor will modify the highest in-service
interrupt (SVI) field of the VMCS, as explained in section 29.1.4 of
the SDM; every other step in EOI virtualization is already done by
apic_send_eoi or on VM entry, but this one is missing.
We need to do the same, and be careful not to muck with the isr_count
and highest_isr_cache fields that are unused when virtual interrupt
delivery is enabled.
Cc: stable@vger.kernel.org
Reviewed-by: Yang Zhang <yang.z.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Conflicts:
drivers/net/bonding/bond_alb.c
drivers/net/ethernet/altera/altera_msgdma.c
drivers/net/ethernet/altera/altera_sgdma.c
net/ipv6/xfrm6_output.c
Several cases of overlapping changes.
The xfrm6_output.c has a bug fix which overlaps the renaming
of skb->local_df to skb->ignore_df.
In the Altera TSE driver cases, the register access cleanups
in net-next overlapped with bug fixes done in net.
Similarly a bug fix to send ALB packets in the bonding driver using
the right source address overlaps with cleanups in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
"It looks like a sizeble collection but this is nearly 3 weeks of bug
fixing while you were away.
1) Fix crashes over IPSEC tunnels with NAT, the latter can reroute
the packet through a non-IPSEC protected path and the code has to
be able to handle SKBs attached to routes lacking an attached xfrm
state. From Steffen Klassert.
2) Fix OOPSs in ipv4 and ipv6 ipsec layers for unsupported
sub-protocols, also from Steffen Klassert.
3) Set local_df on fragmented netfilter skbs otherwise we won't be
able to forward successfully, from Florian Westphal.
4) cdc_mbim ipv6 neighbour code does __vlan_find_dev_deep without
holding RCU lock, from Bjorn Mork.
5) local_df test in ip_may_fragment is inverted, from Florian
Westphal.
6) jme driver doesn't check for DMA mapping failures, from Neil
Horman.
7) qlogic driver doesn't calculate number of TX queues properly, from
Shahed Shaikh.
8) fib_info_cnt can drift irreversibly positive if we fail to
allocate the fi->fib_metrics array, from Sergey Popovich.
9) Fix use after free in ip6_route_me_harder(), also from Sergey
Popovich.
10) When SYSCTL is disabled, we don't handle local_port_range and
ping_group_range defaults properly at all, from Cong Wang.
11) Unaccelerated VLAN tagged frames improperly handled by cdc_mbim
driver, fix from Bjorn Mork.
12) cassini driver needs nested lock annotations for TX locking, from
Emil Goode.
13) On init error ipv6 VTI driver can unregister pernet ops twice,
oops. Fix from Mahtias Krause.
14) If macvlan device is down, don't propagate IFF_ALLMULTI changes,
from Peter Christensen.
15) Missing NULL pointer check while parsing netlink config options in
ip6_tnl_validate(). From Susant Sahani.
16) Fix handling of neighbour entries during ipv6 router reachability
probing, from Duan Jiong.
17) x86 and s390 JIT address randomization has some address
calculation bugs leading to crashes, from Alexei Starovoitov and
Heiko Carstens.
18) Clear up those uglies with nop patching and net_get_random_once(),
from Hannes Frederic Sowa.
19) Option length miscalculated in ip6_append_data(), fix also from
Hannes Frederic Sowa.
20) A while ago we fixed a race during device unregistry when a
namespace went down, turns out there is a second place that needs
similar protection. From Cong Wang.
21) In the new Altera TSE driver multicast filtering isn't working,
disable it and just use promisc mode until the cause is found.
From Vince Bridgers.
22) When we disable router enabling in ipv6 we have to flush the
cached routes explicitly, from Duan Jiong.
23) NBMA tunnels should not cache routes on the tunnel object because
the key is variable, from Timo Teräs.
24) With stacked devices GRO information in skb->cb[] can be not setup
properly, make sure it is in all code paths. From Eric Dumazet.
25) Really fix stacked vlan locking, multiple levels of nesting with
intervening non-vlan devices are possible. From Vlad Yasevich.
26) Fallback ipip tunnel device's mtu is not setup properly, from
Steffen Klassert.
27) The packet scheduler's tcindex filter can crash because we
structure copy objects with list_head's inside, oops. From Cong
Wang.
28) Fix CHECKSUM_COMPLETE handling for ipv6 GRE tunnels, from Eric
Dumazet.
29) In some configurations 'itag' in __mkroute_input() can end up
being used uninitialized because of how fib_validate_source()
works. Fix it by explitly initializing itag to zero like all the
other fib_validate_source() callers do, from Li RongQing"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (116 commits)
batman: fix a bogus warning from batadv_is_on_batman_iface()
ipv4: initialise the itag variable in __mkroute_input
bonding: Send ALB learning packets using the right source
bonding: Don't assume 802.1Q when sending alb learning packets.
net: doc: Update references to skb->rxhash
stmmac: Remove unbalanced clk_disable call
ipv6: gro: fix CHECKSUM_COMPLETE support
net_sched: fix an oops in tcindex filter
can: peak_pci: prevent use after free at netdev removal
ip_tunnel: Initialize the fallback device properly
vlan: Fix build error wth vlan_get_encap_level()
can: c_can: remove obsolete STRICT_FRAME_ORDERING Kconfig option
MAINTAINERS: Pravin Shelar is Open vSwitch maintainer.
bnx2x: Convert return 0 to return rc
bonding: Fix alb mode to only use first level vlans.
bonding: Fix stacked device detection in arp monitoring
macvlan: Fix lockdep warnings with stacked macvlan devices
vlan: Fix lockdep warning with stacked vlan devices.
net: Allow for more then a single subclass for netif_addr_lock
net: Find the nesting level of a given device by type.
...
Pull perf fixes from Ingo Molnar:
"The biggest changes are fixes for races that kept triggering Trinity
crashes, plus liblockdep build fixes and smaller misc fixes.
The liblockdep bits in perf/urgent are a pull mistake - they should
have been in locking/urgent - but by the time I noticed other commits
were added and testing was done :-/ Sorry about that"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix a race between ring_buffer_detach() and ring_buffer_attach()
perf: Prevent false warning in perf_swevent_add
perf: Limit perf_event_attr::sample_period to 63 bits
tools/liblockdep: Remove all build files when doing make clean
tools/liblockdep: Build liblockdep from tools/Makefile
perf/x86/intel: Fix Silvermont's event constraints
perf: Fix perf_event_init_context()
perf: Fix race in removing an event
Print the AGP bridge info the same way as the rest of the kernel, e.g.,
"0000:00:04.0" instead of "00:04:00".
Also print the AGP aperture address range the same way we print resources,
and label it explicitly as a bus address range.
No functional change except the message changes.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Replace printk() with pr_info(), pr_err(), etc. Define pr_fmt() to prefix
output with "AGP: ".
No functional change except the addition of "AGP: " prefix in dmesg output.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Move the pcibios_assign_resources() fs_initcall annotation next to the
function definition. No functional change.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
The DR7 masking which is done on task switch emulation should be in hex format
(clearing the local breakpoints enable bits 0,2,4 and 6).
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
CS.RPL is not equal to the CPL in the few instructions between
setting CR0.PE and reloading CS. And CS.DPL is also not equal
to the CPL for conforming code segments.
However, SS.DPL *is* always equal to the CPL except for the weird
case of SYSRET on AMD processors, which sets SS.DPL=SS.RPL from the
value in the STAR MSR, but force CPL=3 (Intel instead forces
SS.DPL=SS.RPL=CPL=3).
So this patch:
- modifies SVM to update the CPL from SS.DPL rather than CS.RPL;
the above case with SYSRET is not broken further, and the way
to fix it would be to pass the CPL to userspace and back
- modifies VMX to always return the CPL from SS.DPL (except
forcing it to 0 if we are emulating real mode via vm86 mode;
in vm86 mode all DPLs have to be 3, but real mode does allow
privileged instructions). It also removes the CPL cache,
which becomes a duplicate of the SS access rights cache.
This fixes doing KVM_IOCTL_SET_SREGS exactly after setting
CR0.PE=1 but before CS has been reloaded.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
I noticed on some of my systems that page fault tracing doesn't
work:
cd /sys/kernel/debug/tracing
echo 1 > events/exceptions/enable
cat trace;
# nothing shows up
I eventually traced it down to CONFIG_KVM_GUEST. At least in a
KVM VM, enabling that option breaks page fault tracing, and
disabling fixes it. I tried on some old kernels and this does
not appear to be a regression: it never worked.
There are two page-fault entry functions today. One when tracing
is on and another when it is off. The KVM code calls do_page_fault()
directly instead of calling the traced version:
> dotraplinkage void __kprobes
> do_async_page_fault(struct pt_regs *regs, unsigned long
> error_code)
> {
> enum ctx_state prev_state;
>
> switch (kvm_read_and_reset_pf_reason()) {
> default:
> do_page_fault(regs, error_code);
> break;
> case KVM_PV_REASON_PAGE_NOT_PRESENT:
I'm also having problems with the page fault tracing on bare
metal (same symptom of no trace output). I'm unsure if it's
related.
Steven had an alternative to this which has zero overhead when
tracing is off where this includes the standard noops even when
tracing is disabled. I'm unconvinced that the extra complexity
of his apporach:
http://lkml.kernel.org/r/20140508194508.561ed220@gandalf.local.home
is worth it, expecially considering that the KVM code is already
making page fault entry slower here. This solution is
dirt-simple.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86@kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: kvm@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Table 7-1 of the SDM mentions a check that the code segment's
DPL must match the selector's RPL. This was not done by KVM,
fix it.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
During task switch, all of CS.DPL, CS.RPL, SS.DPL must match (in addition
to all the other requirements) and will be the new CPL. So far this
worked by carefully setting the CS selector and flag before doing the
task switch; setting CS.selector will already change the CPL.
However, this will not work once we get the CPL from SS.DPL, because
then you will have to set the full segment descriptor cache to change
the CPL. ctxt->ops->cpl(ctxt) will then return the old CPL during the
task switch, and the check that SS.DPL == CPL will fail.
Temporarily assume that the CPL comes from CS.RPL during task switch
to a protected-mode task. This is the same approach used in QEMU's
emulation code, which (until version 2.0) manually tracks the CPL.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This resolves the conflicts in the files:
drivers/iio/adc/Kconfig
drivers/staging/rtl8723au/os_dep/usb_ops_linux.c
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Merge x86/espfix into x86/vdso, due to changes in the vdso setup code
that otherwise cause conflicts.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
I haven't touched the device interrupt code, which is different
enough that it's probably not worth merging, and I haven't done
anything about paranoidzeroentry_ist yet.
This appears to produce an entry_64.o file that differs only in the
debug info line numbers.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/e7a6acfb130471700370e77af9e4b4b6ed46f5ef.1400709717.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The oops can be triggered in qemu using -no-hpet (but not nohpet) by
running a 32-bit program and reading a couple of pages before the vdso.
This should send SIGBUS instead of OOPSing.
The bug was introduced by:
commit 7a59ed415f
Author: Stefani Seibold <stefani@seibold.net>
Date: Mon Mar 17 23:22:09 2014 +0100
x86, vdso: Add 32 bit VDSO time support for 32 bit kernel
which is new in 3.15.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/e99025d887d6670b6c4d81e6ccfeeb83770b21e9.1400109621.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Merge in Linus' tree with:
fa81511bb0 x86-64, modify_ldt: Make support for 16-bit segments a runtime option
... reverted, to avoid a conflict. This commit is no longer necessary
with the proper fix in place.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The BIOS is supposed to provide ACPI _PXM methods for PCI host bridges if
it cares about platform topology. But some BIOSes do not, so add Fam15h
to the list of CPUs for which we fall back to reading node numbers from the
hardware.
Note that pci_acpi_scan_root() warns about the BIOS bug if we use this
information because (1) the hardware node numbers are not necessarily
compatible with other logical node numbers from ACPI, and (2) the lack of
_PXM forces OS updates that would not otherwise be required.
[bhelgaas: changelog, comments]
Link: https://bugzilla.kernel.org/show_bug.cgi?id=72051
Tested-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
Signed-off-by: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Signed-off-by: Myron Stowe <myron.stowe@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Robert Richter <rric@kernel.org>
Cc: Daniel J Blueman <daniel@numascale.com>
Cc: Andreas Herrmann <herrmann.der.user@googlemail.com>
The vast majority of platforms are not supplying ACPI _PXM (proximity)
information corresponding to host bridge (PNP0A03/PNP0A08) devices
resulting in sysfs "numa_node" values of -1 (NUMA_NO_NODE):
# for i in /sys/devices/pci0000\:00/*/numa_node; do cat $i; done | uniq
-1
# find /sys/ -name "numa_node" | while read fname; do cat $fname; \
done | uniq
-1
AMD based platforms provide a fall-back for this situation via amd_bus.c.
These platforms snoop out the information by directly reading specific
registers from the Northbridge and caching them via alloc_pci_root_info().
Later during boot processing when host bridges are discovered -
pci_acpi_scan_root() - the kernel looks for their corresponding ACPI _PXM
method - drivers/acpi/numa.c::acpi_get_node(). If the BIOS supplied a _PXM
method then that node (proximity) value is associated. If the BIOS did not
supply a _PXM method *and* the platform is AMD-based, the fall-back cached
values obtained directly from the Northbridge are used; otherwise,
"NUMA_NO_NODE" is associated.
There are a number of issues with this fall-back mechanism the most notable
being that amd_bus.c extracts a 3-bit number from a CPU register and uses
it as the node number. The node numbers used by Linux are logical and
there's no reason they need to be identical to settings in the CPU
registers. So if we have some node information obtained in the normal way
(from _PXM, SLIT, SRAT, etc.) and some from amd_bus.c, there's no reason to
believe they will be compatible.
This patch warns when this situation occurs:
pci_root PNP0A08:00: [Firmware Bug]: no _PXM; falling back to node 0 from hardware (may be inconsistent with ACPI node numbers)
Link: https://bugzilla.kernel.org/show_bug.cgi?id=72051
Signed-off-by: Myron Stowe <myron.stowe@redhat.com>
Signed-off-by: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Add a cmdline param which disables the microcode loader. This is useful
mostly in debugging situations where we want to turn off microcode
loading, both early from the initrd and late, as a means to be able to
rule out its influence on the machine.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1400525957-11525-3-git-send-email-bp@alien8.de
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Carve out early cmdline parsing function into .../lib/cmdline.c so it
can be used by early code in the kernel proper as well.
Adapted from arch/x86/boot/cmdline.c.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1400525957-11525-2-git-send-email-bp@alien8.de
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Using arch_vma_name to give special mappings a name is awkward. x86
currently implements it by comparing the start address of the vma to
the expected address of the vdso. This requires tracking the start
address of special mappings and is probably buggy if a special vma
is split or moved.
Improve _install_special_mapping to just name the vma directly. Use
it to give the x86 vvar area a name, which should make CRIU's life
easier.
As a side effect, the vvar area will show up in core dumps. This
could be considered weird and is fixable.
[hpa: I say we accept this as-is but be prepared to deal with knocking
out the vvars from core dumps if this becomes a problem.]
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/276b39b6b645fb11e345457b503f17b83c2c6fd0.1400538962.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The oops can be triggered in qemu using -no-hpet (but not nohpet) by
reading a couple of pages past the end of the vdso text. This
should send SIGBUS instead of OOPSing.
The bug was introduced by:
commit 7a59ed415f
Author: Stefani Seibold <stefani@seibold.net>
Date: Mon Mar 17 23:22:09 2014 +0100
x86, vdso: Add 32 bit VDSO time support for 32 bit kernel
which is new in 3.15.
This will be fixed separately in 3.15, but that patch will not apply
to tip/x86/vdso. This is the equivalent fix for tip/x86/vdso and,
presumably, 3.16.
Cc: Stefani Seibold <stefani@seibold.net>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/c8b0a9a0b8d011a8b273cbb2de88d37190ed2751.1400538962.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
This patch fixes a bug in precise_store_data_hsw() whereby
it would set the data source memory level to the wrong value.
As per the the SDM Vol 3b Table 18-41 (Layout of Data Linear
Address Information in PEBS Record), when status bit 0 is set
this is a L1 hit, otherwise this is a L1 miss.
This patch encodes the memory level according to the specification.
In V2, we added the filtering on the store events.
Only the following events produce L1 information:
* MEM_UOPS_RETIRED.STLB_MISS_STORES
* MEM_UOPS_RETIRED.LOCK_STORES
* MEM_UOPS_RETIRED.SPLIT_STORES
* MEM_UOPS_RETIRED.ALL_STORES
Cc: mingo@elte.hu
Cc: acme@ghostprotocols.net
Cc: jolsa@redhat.com
Cc: jmario@redhat.com
Cc: ak@linux.intel.com
Tested-and-Reviewed-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140515155644.GA3884@quad
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
That's a leftover from the time where x86 supported SPARSE_IRQ=n.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Grant Likely <grant.likely@linaro.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/20140507154338.967285614@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
No more users. Remove the cruft
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Grant Likely <grant.likely@linaro.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/20140507154336.760446122@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
ia64 and x86 share this driver. x86 is moving to a different irq
allocation and ia64 keeps its private irq_create/destroy stuff.
Use macros to redirect to one or the other. Yes, macros to avoid
include hell.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Grant Likely <grant.likely@linaro.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Acked-by: Joerg Roedel <joro@8bytes.org>
Cc: x86@kernel.org
Cc: linux-ia64@vger.kernel.org
Cc: iommu@lists.linux-foundation.org
Link: http://lkml.kernel.org/r/20140507154336.372289825@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
No need to expose this outside of the ioapic code. The dynamic
allocations are guaranteed not to happen in the gsi space. See commit
62a08ae2a.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Grant Likely <grant.likely@linaro.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: x86@kernel.org
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/20140507154335.959870037@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
No functional change just less crap.
This does not replace the requirement to move x86 to irq domains, but
it limits the mess to some degree.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Grant Likely <grant.likely@linaro.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/20140507154335.749579081@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
No functional change. The request to allocate the irq above
NR_IRQS_LEGACY is completely pointless as the implementation enforces
that the dynamic allocations are above the GSI interrupts, which
includes the legacy PIT irqs.
This does not replace the requirement to move x86 to irq domains, but
it limits the mess to some degree.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Grant Likely <grant.likely@linaro.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/20140507154335.252789823@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Use the new interfaces. No functional change.
This does not replace the requirement to move x86 to irq domains, but
it limits the mess to some degree.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Grant Likely <grant.likely@linaro.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/20140507154334.991589924@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This is just a cleanup to get rid of the create/destroy_irq variants
which were designed in hell.
The long term solution for x86 is to switch over to irq domains and
cleanup the whole vector allocation mess.
The generic irq_alloc_hwirqs() interface deliberately prevents
multi-MSI vector allocation to further enforce the irq domain
conversion (aside of the desire to support ioapic hotplug).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Grant Likely <grant.likely@linaro.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/20140507154334.482904047@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Maps all internal BPF instructions into x86_64 instructions.
This patch replaces original BPF x64 JIT with internal BPF x64 JIT.
sysctl net.core.bpf_jit_enable is reused as on/off switch.
Performance:
1. old BPF JIT and internal BPF JIT generate equivalent x86_64 code.
No performance difference is observed for filters that were JIT-able before
Example assembler code for BPF filter "tcpdump port 22"
original BPF -> old JIT: original BPF -> internal BPF -> new JIT:
0: push %rbp 0: push %rbp
1: mov %rsp,%rbp 1: mov %rsp,%rbp
4: sub $0x60,%rsp 4: sub $0x228,%rsp
8: mov %rbx,-0x8(%rbp) b: mov %rbx,-0x228(%rbp) // prologue
12: mov %r13,-0x220(%rbp)
19: mov %r14,-0x218(%rbp)
20: mov %r15,-0x210(%rbp)
27: xor %eax,%eax // clear A
c: xor %ebx,%ebx 29: xor %r13,%r13 // clear X
e: mov 0x68(%rdi),%r9d 2c: mov 0x68(%rdi),%r9d
12: sub 0x6c(%rdi),%r9d 30: sub 0x6c(%rdi),%r9d
16: mov 0xd8(%rdi),%r8 34: mov 0xd8(%rdi),%r10
3b: mov %rdi,%rbx
1d: mov $0xc,%esi 3e: mov $0xc,%esi
22: callq 0xffffffffe1021e15 43: callq 0xffffffffe102bd75
27: cmp $0x86dd,%eax 48: cmp $0x86dd,%rax
2c: jne 0x0000000000000069 4f: jne 0x000000000000009a
2e: mov $0x14,%esi 51: mov $0x14,%esi
33: callq 0xffffffffe1021e31 56: callq 0xffffffffe102bd91
38: cmp $0x84,%eax 5b: cmp $0x84,%rax
3d: je 0x0000000000000049 62: je 0x0000000000000074
3f: cmp $0x6,%eax 64: cmp $0x6,%rax
42: je 0x0000000000000049 68: je 0x0000000000000074
44: cmp $0x11,%eax 6a: cmp $0x11,%rax
47: jne 0x00000000000000c6 6e: jne 0x0000000000000117
49: mov $0x36,%esi 74: mov $0x36,%esi
4e: callq 0xffffffffe1021e15 79: callq 0xffffffffe102bd75
53: cmp $0x16,%eax 7e: cmp $0x16,%rax
56: je 0x00000000000000bf 82: je 0x0000000000000110
58: mov $0x38,%esi 88: mov $0x38,%esi
5d: callq 0xffffffffe1021e15 8d: callq 0xffffffffe102bd75
62: cmp $0x16,%eax 92: cmp $0x16,%rax
65: je 0x00000000000000bf 96: je 0x0000000000000110
67: jmp 0x00000000000000c6 98: jmp 0x0000000000000117
69: cmp $0x800,%eax 9a: cmp $0x800,%rax
6e: jne 0x00000000000000c6 a1: jne 0x0000000000000117
70: mov $0x17,%esi a3: mov $0x17,%esi
75: callq 0xffffffffe1021e31 a8: callq 0xffffffffe102bd91
7a: cmp $0x84,%eax ad: cmp $0x84,%rax
7f: je 0x000000000000008b b4: je 0x00000000000000c2
81: cmp $0x6,%eax b6: cmp $0x6,%rax
84: je 0x000000000000008b ba: je 0x00000000000000c2
86: cmp $0x11,%eax bc: cmp $0x11,%rax
89: jne 0x00000000000000c6 c0: jne 0x0000000000000117
8b: mov $0x14,%esi c2: mov $0x14,%esi
90: callq 0xffffffffe1021e15 c7: callq 0xffffffffe102bd75
95: test $0x1fff,%ax cc: test $0x1fff,%rax
99: jne 0x00000000000000c6 d3: jne 0x0000000000000117
d5: mov %rax,%r14
9b: mov $0xe,%esi d8: mov $0xe,%esi
a0: callq 0xffffffffe1021e44 dd: callq 0xffffffffe102bd91 // MSH
e2: and $0xf,%eax
e5: shl $0x2,%eax
e8: mov %rax,%r13
eb: mov %r14,%rax
ee: mov %r13,%rsi
a5: lea 0xe(%rbx),%esi f1: add $0xe,%esi
a8: callq 0xffffffffe1021e0d f4: callq 0xffffffffe102bd6d
ad: cmp $0x16,%eax f9: cmp $0x16,%rax
b0: je 0x00000000000000bf fd: je 0x0000000000000110
ff: mov %r13,%rsi
b2: lea 0x10(%rbx),%esi 102: add $0x10,%esi
b5: callq 0xffffffffe1021e0d 105: callq 0xffffffffe102bd6d
ba: cmp $0x16,%eax 10a: cmp $0x16,%rax
bd: jne 0x00000000000000c6 10e: jne 0x0000000000000117
bf: mov $0xffff,%eax 110: mov $0xffff,%eax
c4: jmp 0x00000000000000c8 115: jmp 0x000000000000011c
c6: xor %eax,%eax 117: mov $0x0,%eax
c8: mov -0x8(%rbp),%rbx 11c: mov -0x228(%rbp),%rbx // epilogue
cc: leaveq 123: mov -0x220(%rbp),%r13
cd: retq 12a: mov -0x218(%rbp),%r14
131: mov -0x210(%rbp),%r15
138: leaveq
139: retq
On fully cached SKBs both JITed functions take 12 nsec to execute.
BPF interpreter executes the program in 30 nsec.
The difference in generated assembler is due to the following:
Old BPF imlements LDX_MSH instruction via sk_load_byte_msh() helper function
inside bpf_jit.S.
New JIT removes the helper and does it explicitly, so ldx_msh cost
is the same for both JITs, but generated code looks longer.
New JIT has 4 registers to save, so prologue/epilogue are larger,
but the cost is within noise on x64.
Old JIT checks whether first insn clears A and if not emits 'xor %eax,%eax'.
New JIT clears %rax unconditionally.
2. old BPF JIT doesn't support ANC_NLATTR, ANC_PAY_OFFSET, ANC_RANDOM
extensions. New JIT supports all BPF extensions.
Performance of such filters improves 2-4 times depending on a filter.
The longer the filter the higher performance gain.
Synthetic benchmarks with many ancillary loads see 20x speedup
which seems to be the maximum gain from JIT
Notes:
. net.core.bpf_jit_enable=2 + tools/net/bpf_jit_disasm is still functional
and can be used to see generated assembler
. there are two jit_compile() functions and code flow for classic filters is:
sk_attach_filter() - load classic BPF
bpf_jit_compile() - try to JIT from classic BPF
sk_convert_filter() - convert classic to internal
bpf_int_jit_compile() - JIT from internal BPF
seccomp and tracing filters will just call bpf_int_jit_compile()
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Split bpf_jit_compile() into two functions to improve readability
of for(pass++) loop. The change follows similar style of JIT compilers
for arm, powerpc, s390
The body of new do_jit() was not reformatted to reduce noise
in this patch, since the following patch replaces most of it.
Tested with BPF testsuite.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can now enable the 64bit option for the Goldfish 64bit emulator.
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
_PAGE_IOMAP is used in xen_remap_domain_mfn_range() to prevent the
pfn_pte() call in remap_area_mfn_pte_fn() from using the p2m to translate
the MFN. If mfn_pte() is used instead, the p2m look up is avoided and
the use of _PAGE_IOMAP is no longer needed.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
PCI devices may have BARs located above the end of RAM so mark such
frames as identity frames in the p2m (instead of the default of
missing).
PFNs outside the p2m (above MAX_P2M_PFN) are also considered to be
identity frames for the same reason.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
In xen_add_extra_mem(), if the WARN() checks for bad MFNs trigger it is
likely that they will trigger at lot, spamming the log.
Use WARN_ONCE() instead.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Large (multi-GB) identity ranges currently require a unique middle page
(filled with p2m_identity entries) per 1 GB region.
Similar to the common p2m_mid_missing middle page for large missing
regions, introduce a p2m_mid_identity page (filled with p2m_identity
entries) which can be used instead.
set_phys_range_identity() thus only needs to allocate new middle pages
at the beginning and end of the range.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Allow set_phys_range_identity() to work with a range that overlaps
MAX_P2M_PFN by clamping pfn_e to MAX_P2M_PFN.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
early_p2m_alloc_middle() allocates a new leaf page and
early_p2m_alloc() allocates a new middle page. This is confusing.
Swap the names so they match what the functions actually do.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Execution is not going to continue after telling Xen about the crash.
Let other panic notifiers run by postponing the final hypercall as much
as possible.
Signed-off-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Checkin:
b3b42ac2cb x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
disabled 16-bit segments on 64-bit kernels due to an information
leak. However, it does seem that people are genuinely using Wine to
run old 16-bit Windows programs on Linux.
A proper fix for this ("espfix64") is coming in the upcoming merge
window, but as a temporary fix, create a sysctl to allow the
administrator to re-enable support for 16-bit segments.
It adds a "/proc/sys/abi/ldt16" sysctl that defaults to zero (off). If
you hit this issue and care about your old Windows program more than
you care about a kernel stack address information leak, you can do
echo 1 > /proc/sys/abi/ldt16
as root (add it to your startup scripts), and you should be ok.
The sysctl table is only added if you have COMPAT support enabled on
x86-64, but I assume anybody who runs old windows binaries very much
does that ;)
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/CA%2B55aFw9BPoD10U1LfHbOMpHWZkvJTkMcfCs9s3urPr1YyWBxw@mail.gmail.com
Cc: <stable@vger.kernel.org>
Updating system_time from the kernel clock once master clock
has been enabled can result in time backwards event, in case
kernel clock frequency is lower than TSC frequency.
Disable master clock in case it is necessary to update it
from the resume path.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
As the mcount code gets more complex, it really does not belong
in the entry.S file. By moving it into its own file "mcount.S"
keeps things a bit cleaner.
Link: http://lkml.kernel.org/p/20140508152152.2130e8cf@gandalf.local.home
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As the decision to what needs to be done (converting a call to the
ftrace_caller to ftrace_caller_regs or to convert from ftrace_caller_regs
to ftrace_caller) can easily be determined from the rec->flags of
FTRACE_FL_REGS and FTRACE_FL_REGS_EN, there's no need to have the
ftrace_check_record() return either a UPDATE_MODIFY_CALL_REGS or a
UPDATE_MODIFY_CALL. Just he latter is enough. This added flag causes
more complexity than is required. Remove it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Move and rename get_ftrace_addr() and get_ftrace_addr_old() to
ftrace_get_addr_new() and ftrace_get_addr_curr() respectively.
This moves these two helper functions in the generic code out from
the arch specific code, and renames them to have a better generic
name. This will allow other archs to use them as well as makes it
a bit easier to work on getting separate trampolines for different
functions.
ftrace_get_addr_new() returns the trampoline address that the mcount
call address will be converted to.
ftrace_get_addr_curr() returns the trampoline address of what the
mcount call address currently jumps to.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The add_breakpoint() code in the ftrace updating gets the address
of what the call will become, but if the mcount address is changing
from regs to non-regs ftrace_caller or vice versa, it will use what
the record currently is.
This is rather silly as the code should always use what is currently
there regardless of if it's changing the regs function or just converting
to a nop.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If the probed insn triggers a trap, ->si_addr = regs->ip is technically
correct, but this is not what the signal handler wants; we need to pass
the address of the probed insn, not the address of xol slot.
Add the new arch-agnostic helper, uprobe_get_trap_addr(), and change
fill_trap_info() and math_error() to use it. !CONFIG_UPROBES case in
uprobes.h uses a macro to avoid include hell and ensure that it can be
compiled even if an architecture doesn't define instruction_pointer().
Test-case:
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
extern void probe_div(void);
void sigh(int sig, siginfo_t *info, void *c)
{
int passed = (info->si_addr == probe_div);
printf(passed ? "PASS\n" : "FAIL\n");
_exit(!passed);
}
int main(void)
{
struct sigaction sa = {
.sa_sigaction = sigh,
.sa_flags = SA_SIGINFO,
};
sigaction(SIGFPE, &sa, NULL);
asm (
"xor %ecx,%ecx\n"
".globl probe_div; probe_div:\n"
"idiv %ecx\n"
);
return 0;
}
it fails if probe_div() is probed.
Note: show_unhandled_signals users should probably use this helper too,
but we need to cleanup them first.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Move the callsite of fill_trap_info() into do_error_trap() and remove
the "siginfo_t *info" argument.
This obviously breaks DO_ERROR() which passed info == NULL, we simply
change fill_trap_info() to return "siginfo_t *" and add the "default"
case which returns SEND_SIG_PRIV.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Extract the fill-siginfo code from DO_ERROR_INFO() into the new helper,
fill_trap_info().
It can calculate si_code and si_addr looking at trapnr, so we can remove
these arguments from DO_ERROR_INFO() and simplify the source code. The
generated code is the same, __builtin_constant_p(trapnr) == T.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Move the common code from DO_ERROR() and DO_ERROR_INFO() into the new
helper, do_error_trap(). This simplifies define's and shaves 527 bytes
from traps.o.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
force_sig() is just force_sig_info(SEND_SIG_PRIV). Imho it should die,
we have too many ugly "send signal" helpers.
And do_trap() looks just ugly because it uses force_sig_info() or
force_sig() depending on info != NULL.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Before this patch, instructions such as div, mul, shifts with count
in CL, cmpxchg are mishandled.
This patch adds vex prefix handling. In particular, it avoids colliding
with register operand encoded in vex.vvvv field.
Since we need to avoid two possible register operands, the selection of
scratch register needs to be from at least three registers.
After looking through a lot of CPU docs, it looks like the safest choice
is SI,DI,BX. Selecting BX needs care to not collide with implicit use of
BX by cmpxchg8b.
Test-case:
#include <stdio.h>
static const char *const pass[] = { "FAIL", "pass" };
long two = 2;
void test1(void)
{
long ax = 0, dx = 0;
asm volatile("\n"
" xor %%edx,%%edx\n"
" lea 2(%%edx),%%eax\n"
// We divide 2 by 2. Result (in eax) should be 1:
" probe1: .globl probe1\n"
" divl two(%%rip)\n"
// If we have a bug (eax mangled on entry) the result will be 2,
// because eax gets restored by probe machinery.
: "=a" (ax), "=d" (dx) /*out*/
: "0" (ax), "1" (dx) /*in*/
: "memory" /*clobber*/
);
dprintf(2, "%s: %s\n", __func__,
pass[ax == 1]
);
}
long val2 = 0;
void test2(void)
{
long old_val = val2;
long ax = 0, dx = 0;
asm volatile("\n"
" mov val2,%%eax\n" // eax := val2
" lea 1(%%eax),%%edx\n" // edx := eax+1
// eax is equal to val2. cmpxchg should store edx to val2:
" probe2: .globl probe2\n"
" cmpxchg %%edx,val2(%%rip)\n"
// If we have a bug (eax mangled on entry), val2 will stay unchanged
: "=a" (ax), "=d" (dx) /*out*/
: "0" (ax), "1" (dx) /*in*/
: "memory" /*clobber*/
);
dprintf(2, "%s: %s\n", __func__,
pass[val2 == old_val + 1]
);
}
long val3[2] = {0,0};
void test3(void)
{
long old_val = val3[0];
long ax = 0, dx = 0;
asm volatile("\n"
" mov val3,%%eax\n" // edx:eax := val3
" mov val3+4,%%edx\n"
" mov %%eax,%%ebx\n" // ecx:ebx := edx:eax + 1
" mov %%edx,%%ecx\n"
" add $1,%%ebx\n"
" adc $0,%%ecx\n"
// edx:eax is equal to val3. cmpxchg8b should store ecx:ebx to val3:
" probe3: .globl probe3\n"
" cmpxchg8b val3(%%rip)\n"
// If we have a bug (edx:eax mangled on entry), val3 will stay unchanged.
// If ecx:edx in mangled, val3 will get wrong value.
: "=a" (ax), "=d" (dx) /*out*/
: "0" (ax), "1" (dx) /*in*/
: "cx", "bx", "memory" /*clobber*/
);
dprintf(2, "%s: %s\n", __func__,
pass[val3[0] == old_val + 1 && val3[1] == 0]
);
}
int main(int argc, char **argv)
{
test1();
test2();
test3();
return 0;
}
Before this change all tests fail if probe{1,2,3} are probed.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
It is possible to replace rip-relative addressing mode with addressing
mode of the same length: (reg+disp32). This eliminates the need to fix
up immediate and correct for changing instruction length.
And we can kill arch_uprobe->def.riprel_target.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
The invalidation is required in order to maintain proper semantics
under CoW conditions. In scenarios where a process clones several
threads, a thread operating on a core whose DTLB entry for a
particular hugepage has not been invalidated, will be reading from
the hugepage that belongs to the forked child process, even after
hugetlb_cow().
The thread will not see the updated page as long as the stale DTLB
entry remains cached, the thread attempts to write into the page,
the child process exits, or the thread gets migrated to a different
processor.
Signed-off-by: Anthony Iliopoulos <anthony.iliopoulos@huawei.com>
Link: http://lkml.kernel.org/r/20140514092948.GA17391@server-36.huawei.corp
Suggested-by: Shay Goikhman <shay.goikhman@huawei.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: <stable@vger.kernel.org> # v2.6.16+ (!)
bpf_alloc_binary() adds 128 bytes of room to JITed program image
and rounds it up to the nearest page size. If image size is close
to page size (like 4000), it is rounded to two pages:
round_up(4000 + 4 + 128) == 8192
then 'hole' is computed as 8192 - (4000 + 4) = 4188
If prandom_u32() % hole selects a number >= PAGE_SIZE - sizeof(*header)
then kernel will crash during bpf_jit_free():
kernel BUG at arch/x86/mm/pageattr.c:887!
Call Trace:
[<ffffffff81037285>] change_page_attr_set_clr+0x135/0x460
[<ffffffff81694cc0>] ? _raw_spin_unlock_irq+0x30/0x50
[<ffffffff810378ff>] set_memory_rw+0x2f/0x40
[<ffffffffa01a0d8d>] bpf_jit_free_deferred+0x2d/0x60
[<ffffffff8106bf98>] process_one_work+0x1d8/0x6a0
[<ffffffff8106bf38>] ? process_one_work+0x178/0x6a0
[<ffffffff8106c90c>] worker_thread+0x11c/0x370
since bpf_jit_free() does:
unsigned long addr = (unsigned long)fp->bpf_func & PAGE_MASK;
struct bpf_binary_header *header = (void *)addr;
to compute start address of 'bpf_binary_header'
and header->pages will pass junk to:
set_memory_rw(addr, header->pages);
Fix it by making sure that &header->image[prandom_u32() % hole] and &header
are in the same page
Fixes: 314beb9bca ("x86: bpf_jit_comp: secure bpf jit against spraying attacks")
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
gen8_stolen_size() is missing __init, so add it.
Also all the intel_stolen_funcs structures can be marked
__initconst.
intel_stolen_ids[] can also be made const if we replace the
__initdata with __initconst.
Cc: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
CHV uses the same bits as SNB/VLV to code the Graphics Mode Select field
(GFX stolen memory size) with the addition of finer granularity modes:
4MB increments from 0x11 (8MB) to 0x1d.
Values strictly above 0x1d are either reserved or not supported.
v2: 4MB increments, not 8MB. 32MB has been omitted from the list of new
values (Ville Syrjälä)
v3: Also correctly interpret GGMS (GTT Graphics Memory Size) (Ville
Syrjälä)
v4: Don't assign a value that needs 20bits or more to a u16 (Rafael
Barbalho)
[vsyrjala: v5: Split from i915 changes and add chv_stolen_funcs]
Cc: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Reviewed-by: Jani Nikula <jani.nikula@intel.com>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Reviewed-by: Rafael Barbalho <rafael.barbalho@intel.com>
Tested-by: Rafael Barbalho <rafael.barbalho@intel.com>
Signed-off-by: Damien Lespiau <damien.lespiau@intel.com>
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Regression of 346874c9: PAE is set in long mode, but that does not mean
we have valid PDPTRs.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Conflicts:
drivers/net/ethernet/altera/altera_sgdma.c
net/netlink/af_netlink.c
net/sched/cls_api.c
net/sched/sch_api.c
The netlink conflict dealt with moving to netlink_capable() and
netlink_ns_capable() in the 'net' tree vs. supporting 'tc' operations
in non-init namespaces. These were simple transformations from
netlink_capable to netlink_ns_capable.
The Altera driver conflict was simply code removal overlapping some
void pointer cast cleanups in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
New architectures currently have to provide implementations of 5 different
functions: xen_arch_pre_suspend(), xen_arch_post_suspend(),
xen_arch_hvm_post_suspend(), xen_mm_pin_all(), and xen_mm_unpin_all().
Refactor the suspend code to only require xen_arch_pre_suspend() and
xen_arch_post_suspend().
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
One can logically expect that when the user has specified "nordrand",
the user doesn't want any use of the CPU random number generator,
neither RDRAND nor RDSEED, so disable both.
Reported-by: Stephan Mueller <smueller@chronox.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Link: http://lkml.kernel.org/r/21542339.0lFnPSyGRS@myon.chronox.de
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Added all the MBI units below and their associated read/write
opcodes:
- Host Bridge Arbiter
- Host Bridge
- Remote Management Unit
- Memory Manager & eSRAM
- SoC Unit
Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com>
Link: http://lkml.kernel.org/r/1399668248-24199-3-git-send-email-david.e.box@linux.intel.com
Signed-off-by: David E. Box <david.e.box@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Currently drivers that run on non-IOSF systems (Core/Xeon) can't use the IOSF
driver on SOC's without selecting it which forces an unnecessary and limiting
dependency. Provides dummy functions to allow these modules to conditionally
use the driver on IOSF equipped platforms without impacting their ability to
compile and load on non-IOSF platforms. Build default m to ensure availability
on x86 SOC's.
Signed-off-by: David E. Box <david.e.box@linux.intel.com>
Link: http://lkml.kernel.org/r/1399668248-24199-2-git-send-email-david.e.box@linux.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
With tk->wall_to_monotonic.tv_nsec being a 32-bit value on 32-bit
systems, (tk->wall_to_monotonic.tv_nsec << tk->shift) in update_vsyscall()
may lose upper bits or, worse, add them since compiler will do this:
(u64)(tk->wall_to_monotonic.tv_nsec << tk->shift)
instead of
((u64)tk->wall_to_monotonic.tv_nsec << tk->shift)
So if, for example, tv_nsec is 0x800000 and shift is 8 we will end up
with 0xffffffff80000000 instead of 0x80000000. And then we are stuck in
the subsequent 'while' loop.
We need an explicit cast.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: http://lkml.kernel.org/r/1399648287-15178-1-git-send-email-boris.ostrovsky@oracle.com
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: <stable@vger.kernel.org> # v3.14
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Due to a typo the msr accessor function introduced in
22085a66c2 didn't have any lasting
effects because they accidentally wrote the old value back.
After c0a639ad0b this at the very least
this causes cpuid limits not to be lifted on some cpus leading to
missing capabilities for those.
Signed-off-by: Andres Freund <andres@anarazel.de>
Link: http://lkml.kernel.org/r/1399598957-7011-2-git-send-email-andres@anarazel.de
Cc: Borislav Petkov <bp@suse.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Given the fact that we removed inclusion of boot.h from boot/string.c
does not look like we need misc.h inclusion in compressed/string.c. So
remove it.
misc.h was also pulling in string_32.h which in turn had macros for
memcmp and memcpy. So we don't need to #undef memcmp and memcpy anymore.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Link: http://lkml.kernel.org/r/1398447972-27896-3-git-send-email-vgoyal@redhat.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
string.c does not require whole of boot.h. Just inclusion of linux/types.h
and ctypes.h seems to be sufficient.
Keep list of stuff being included in string.c to bare minimal so that
string.c can be included in other places easily.
For example, Currently boot/compressed/string.c includes boot/string.c
but looks like it does not want boot/boot.h. Hence there is a define
in boot/compressed/misc.h "define BOOT_BOOT_H" which prevents inclusion
of boot.h in compressed/string.c. And compressed/string.c is forced to
include misc.h just for that reason.
So by removing inclusion of boot.h, we can also get rid of inclusion of
misch.h in compressed/misc.c.
This also enables including of boot/string.c in purgatory/ code relatively
easily.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Link: http://lkml.kernel.org/r/1398447972-27896-2-git-send-email-vgoyal@redhat.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Treat monitor and mwait instructions as nop, which is architecturally
correct (but inefficient) behavior. We do this to prevent misbehaving
guests (e.g. OS X <= 10.7) from crashing after they fail to check for
monitor/mwait availability via cpuid.
Since mwait-based idle loops relying on these nop-emulated instructions
would keep the host CPU pegged at 100%, do NOT advertise their presence
via cpuid, to prevent compliant guests from using them inadvertently.
Signed-off-by: Gabriel L. Somlo <somlo@cmu.edu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Standardize the idle polling indicator to TIF_POLLING_NRFLAG such that
both TIF_NEED_RESCHED and TIF_POLLING_NRFLAG are in the same word.
This will allow us, using fetch_or(), to both set NEED_RESCHED and
check for POLLING_NRFLAG in a single operation and avoid pointless
wakeups.
Changing from the non-atomic thread_info::status flags to the atomic
thread_info::flags shouldn't be a big issue since most polling state
changes were followed/preceded by a full memory barrier anyway.
Also, fix up the apm_32 idle function, clearly that was forgotten in
the last conversion. The default idle state is !POLLING so just kill
the lot.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Link: http://lkml.kernel.org/n/tip-7yksmqtlv4nfowmlqr1rifoi@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
HPET on current Baytrail platform has accuracy problem to be
used as reliable clocksource/clockevent, so add a early quirk to
disable it.
Signed-off-by: Feng Tang <feng.tang@intel.com>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1398327498-13163-2-git-send-email-feng.tang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
HPET on some platform has accuracy problem. Making
"boot_hpet_disable" extern so that we can runtime disable
the HPET timer by using quirk to check the platform.
Signed-off-by: Feng Tang <feng.tang@intel.com>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1398327498-13163-1-git-send-email-feng.tang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If you are using a 64-bit kernel with 32-bit userland, then
scripts/gcc-x86_64-has-stack-protector.sh invokes 32-bit gcc
with -mcmodel=kernel, which produces:
<stdin>:1:0: error: code model 'kernel' not supported in the 32 bit mode
and trips the "broken compiler" test at arch/x86/Makefile:120.
There are several places a fix is possible, but the following seems
cleanest. (But it's minimal; it would also be possible to factor
out a bunch of stuff from the two branches of the if.)
Signed-off-by: George Spelvin <linux@horizon.com>
Link: http://lkml.kernel.org/r/20140507210552.7581.qmail@ns.horizon.com
Cc: <stable@vger.kernel.org> # v3.14
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
It seems that it's easy to implement the EOI assist
on top of the PV EOI feature: simply convert the
page address to the format expected by PV EOI.
Notes:
-"No EOI required" is set only if interrupt injected
is edge triggered; this is true because level interrupts are going
through IOAPIC which disables PV EOI.
In any case, if guest triggers EOI the bit will get cleared on exit.
-For migration, set of HV_X64_MSR_APIC_ASSIST_PAGE sets
KVM_PV_EOI_EN internally, so restoring HV_X64_MSR_APIC_ASSIST_PAGE
seems sufficient
In any case, bit is cleared on exit so worst case it's never re-enabled
-no handling of PV EOI data is performed at HV_X64_MSR_EOI write;
HV_X64_MSR_EOI is a separate optimization - it's an X2APIC
replacement that lets you do EOI with an MSR and not IO.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In long-mode, bit 7 in the PDPTE is not reserved only if 1GB pages are
supported by the CPU. Currently the bit is considered by KVM as always
reserved.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The RSP register is not automatically cached, causing mov DR instruction with
RSP to fail. Instead the regular register accessing interface should be used.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
While running a nested guest, we should disable APIC virtualization
controls (virtualized APIC register accesses, virtual interrupt
delivery and posted interrupts), because we do not expose them to
the nested guest.
Reported-by: Hu Yaohui <loki2441@gmail.com>
Suggested-by: Abel Gordon <abel@stratoscale.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Event 0x013c is not the same as fixed counter2, remove it from
Silvermont's event constraints.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1398755081-12471-1-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Some checks are common to all, and moreover,
according to the spec, the check for whether any bits
beyond the physical address width are set are also
applicable to all of them
Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The spec mandates that if the vmptrld or vmclear
address is equal to the vmxon region pointer, the
instruction should fail with error "VMPTRLD with
VMXON pointer" or "VMCLEAR with VMXON pointer"
Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, the vmxon region isn't used in the nested case.
However, according to the spec, the vmxon instruction performs
additional sanity checks on this region and the associated
pointer. Modify emulated vmxon to better adhere to the spec
requirements
Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Our common function for vmptr checks (in 2/4) needs to fetch
the memory address
Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
As requested by Linus add explicit __visible to the asmlinkage users.
This marks all functions visible to assembler.
Tree sweep for arch/x86/*
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1398984278-29319-3-git-send-email-andi@firstfloor.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
arch/x86/crypto/sha1_avx2_x86_64_asm.S introduced _end as a local
symbol, which broke the build under certain circumstances. Although
the wisdom of _end as a local symbol can definitely be questioned, the
build should not break for that reason.
Thus, filter the output of nm to only get global symbols of
appropriate type.
Reported-by: Andy Lutomirski <luto@amacapital.net>
Cc: Chandramouli Narayanan <mouli@linux.intel.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/n/tip-uxm3j3w3odglcwhafwq5tjqu@git.kernel.org
This patch moves the 'kvm_pio' tracepoint to emulator_pio_in_emulated()
and emulator_pio_out_emulated(), and it adds an argument (a pointer to
the 'pio_data'). A single 8-bit or 16-bit or 32-bit data item is fetched
from 'pio_data' (depending on 'size'), and the value is included in the
trace record ('val'). If 'count' is greater than one, this is indicated
by the string "(...)" in the trace output.
Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This unifies the vdso mapping code and teaches it how to map special
pages at addresses corresponding to symbols in the vdso image. The
new code is used for all vdso variants, but so far only the 32-bit
variants use the new vvar page position.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/b6d7858ad7b5ac3fd3c29cab6d6d769bc45d195e.1399317206.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Currently, vdso.so files are prepared and analyzed by a combination
of objcopy, nm, some linker script tricks, and some simple ELF
parsers in the kernel. Replace all of that with plain C code that
runs at build time.
All five vdso images now generate .c files that are compiled and
linked in to the kernel image.
This should cause only one userspace-visible change: the loaded vDSO
images are stripped more heavily than they used to be. Everything
outside the loadable segment is dropped. In particular, this causes
the section table and section name strings to be missing. This
should be fine: real dynamic loaders don't load or inspect these
tables anyway. The result is roughly equivalent to eu-strip's
--strip-sections option.
The purpose of this change is to enable the vvar and hpet mappings
to be moved to the page following the vDSO load segment. Currently,
it is possible for the section table to extend into the page after
the load segment, so, if we map it, it risks overlapping the vvar or
hpet page. This happens whenever the load segment is just under a
multiple of PAGE_SIZE.
The only real subtlety here is that the old code had a C file with
inline assembler that did 'call VDSO32_vsyscall' and a linker script
that defined 'VDSO32_vsyscall = __kernel_vsyscall'. This most
likely worked by accident: the linker script entry defines a symbol
associated with an address as opposed to an alias for the real
dynamic symbol __kernel_vsyscall. That caused ld to relocate the
reference at link time instead of leaving an interposable dynamic
relocation. Since the VDSO32_vsyscall hack is no longer needed, I
now use 'call __kernel_vsyscall', and I added -Bsymbolic to make it
work. vdso2c will generate an error and abort the build if the
resulting image contains any dynamic relocations, so we won't
silently generate bad vdso images.
(Dynamic relocations are a problem because nothing will even attempt
to relocate the vdso.)
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/2c4fcf45524162a34d87fdda1eb046b2a5cecee7.1399317206.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
This code is used during CPU setup, and it isn't strictly speaking
related to the 32-bit vdso. It's easier to understand how this
works when the code is closer to its callers.
This also lets syscall32_cpu_init be static, which might save some
trivial amount of kernel text.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/4e466987204e232d7b55a53ff6b9739f12237461.1399317206.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The early_ioremap code requires that its buffers not span a PMD
boundary. The logic for ensuring that only works if the fixmap is
aligned, so assert that it's aligned correctly.
To make this work reliably, reserve_top_address needs to be
adjusted.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/e59a5f4362661f75dd4841fa74e1f2448045e245.1399317206.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Note add32_with_carry(a, b) is suboptimal, as it forces
a and b in registers.
b could be a memory or a register operand.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The io_setup takes a pointer to a context id of type aio_context_t.
This in turn is typed to a __kernel_ulong_t. We could tweak the
exported headers to define this as a 64bit quantity for specific
ABIs, but since we already have a 32bit compat shim for the x86 ABI,
let's just re-use that logic. The libaio package is also written to
expect this as a pointer type, so a compat shim would simplify that.
The io_submit func operates on an array of pointers to iocb structs.
Padding out the array to be 64bit aligned is a huge pain, so convert
it over to the existing compat shim too.
We don't convert io_getevents to the compat func as its only purpose
is to handle the timespec struct, and the x32 ABI uses 64bit times.
With this change, the libaio package can now pass its testsuite when
built for the x32 ABI.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Link: http://lkml.kernel.org/r/1399250595-5005-1-git-send-email-vapier@gentoo.org
Cc: H.J. Lu <hjl.tools@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: <stable@vger.kernel.org> # v3.4+
Embedded systems, which may be very memory-size-sensitive, are
extremely unlikely to ever encounter any 16-bit software, so make it
a CONFIG_EXPERT option to turn off support for any 16-bit software
whatsoever.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Link: http://lkml.kernel.org/r/1398816946-3351-1-git-send-email-hpa@linux.intel.com
of the framebuffer when early_ioremap() is no longer available and
dropping __init from functions that may be invoked after
free_initmem() - Dave Young
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJTZIL0AAoJEC84WcCNIz1Vr9gP/RCHnmo9+w88ujYMjXtoq+/b
qDX/Fl8/as/gJ8cKhOVlQpC/t4VbC28mRkxV3J8NS/AklY0mU2R8TatprIyUoKAI
oPZwdSbuEIS8ehCr/D+6aAIGLtFYaLD8VK27niNHEHVytZytPqQGpDKARgphin5l
AqtEUv9NNfLaN/aHUuMV33xlD4r25BoWlj3RD2h+Rpnu2/vBXs14NTBN1r+SrLFh
r8htTDsbm3NjDCvboYyPJjnFZvlYqxtLCBC2vVD8fBvaXcBmj/vLP6WmFd3sxbTZ
4CLmRMShaqh87JH9gdg0m/xJ5sEgRqqvMiqjcaAuJzAew0eE6gUZjE9+fawWYHwT
XU0kcsM9wn/014f9fUdqaqM38o/XbnVcW+D5iSrwcx6hhNHzf7nFGnSndN2tednQ
k3z3tpX/GB9u5l0064Clru6GbSnV2cSfayaoIc4sULDrp7KBmyrlwBtsQ67C/JfV
0gJ4ridzbFllHBiw3Cyw8vzLDPgQ6t2DGw6RkzUpbMwLZG5YMRcyNODWewcTuH7g
VcMMaDKVw7uCrItFyTscMuUe1nVnbZANdLu9znF8TejgX1MzwwmdetqAE/WPR+3V
vZoYGNE5zAwGhqF34BLSof9BHoeOjucx1qgaV3QYhrdtgtTXaGf++TvwOhpCVNOC
vhUguxcrMLOM68He6o5H
=BzhM
-----END PGP SIGNATURE-----
Merge tag 'efi-urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi into x86/urgent
Pull EFI fix from Matt Fleming:
" * Fix earlyprintk=efi,keep support by switching to an ioremap() mapping
of the framebuffer when early_ioremap() is no longer available and
dropping __init from functions that may be invoked after
free_initmem() - Dave Young "
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make espfix64 a hidden Kconfig option. This fixes the x86-64 UML
build which had broken due to the non-existence of init_espfix_bsp()
in UML: since UML uses its own Kconfig, this option does not appear in
the UML build.
This also makes it possible to make support for 16-bit segments a
configuration option, for the people who want to minimize the size of
the kernel.
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Richard Weinberger <richard@nod.at>
Link: http://lkml.kernel.org/r/1398816946-3351-1-git-send-email-hpa@linux.intel.com
Pull irq fixes from Thomas Gleixner:
"This udpate delivers:
- A fix for dynamic interrupt allocation on x86 which is required to
exclude the GSI interrupts from the dynamic allocatable range.
This was detected with the newfangled tablet SoCs which have GPIOs
and therefor allocate a range of interrupts. The MSI allocations
already excluded the GSI range, so we never noticed before.
- The last missing set_irq_affinity() repair, which was delayed due
to testing issues
- A few bug fixes for the armada SoC interrupt controller
- A memory allocation fix for the TI crossbar interrupt controller
- A trivial kernel-doc warning fix"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip: irq-crossbar: Not allocating enough memory
irqchip: armanda: Sanitize set_irq_affinity()
genirq: x86: Ensure that dynamic irq allocation does not conflict
linux/interrupt.h: fix new kernel-doc warnings
irqchip: armada-370-xp: Fix releasing of MSIs
irqchip: armada-370-xp: implement the ->check_device() msi_chip operation
irqchip: armada-370-xp: fix invalid cast of signed value into unsigned variable
earlyprintk=efi,keep will cause kernel hangs while freeing initmem like
below:
VFS: Mounted root (ext4 filesystem) readonly on device 254:2.
devtmpfs: mounted
Freeing unused kernel memory: 880K (ffffffff817d4000 - ffffffff818b0000)
It is caused by efi earlyprintk use __init function which will be freed
later. Such as early_efi_write is marked as __init, also it will use
early_ioremap which is init function as well.
To fix this issue, I added early initcall early_efi_map_fb which maps
the whole efi fb for later use. OTOH, adding a wrapper function
early_efi_map which calls early_ioremap before ioremap is available.
With this patch applied efi boot ok with earlyprintk=efi,keep console=efi
Signed-off-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Pull x86 fixes from Peter Anvin:
"Two very small changes: one fix for the vSMP Foundation platform, and
one to help LLVM not choke on options it doesn't understand (although
it probably should)"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/vsmp: Fix irq routing
x86: LLVMLinux: Wrap -mno-80387 with cc-option
In __ioremap_caller() (the guts of ioremap), we loop over the range of
pfns being remapped and checks each one individually with page_is_ram().
For large ioremaps, this can be very slow. For example, we have a
device with a 256 GiB PCI BAR, and ioremapping this BAR can take 20+
seconds -- sometimes long enough to trigger the soft lockup detector!
Internally, page_is_ram() calls walk_system_ram_range() on a single
page. Instead, we can make a single call to walk_system_ram_range()
from __ioremap_caller(), and do our further checks only for any RAM
pages that we find. For the common case of MMIO, this saves an enormous
amount of work, since the range being ioremapped doesn't intersect
system RAM at all.
With this change, ioremap on our 256 GiB BAR takes less than 1 second.
Signed-off-by: Roland Dreier <roland@purestorage.com>
Link: http://lkml.kernel.org/r/1399054721-1331-1-git-send-email-roland@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
the merge window.
- A fix from Oleg to async page faults.
- A bunch of small ARM changes.
- A trivial patch to use the new MSI-X API introduced during the merge
window.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJTY5anAAoJEBvWZb6bTYby6EQQAIWbOJCrLO3NQxwE9M7d8YvN
oviaLFv7vJh1vaXVo7SNjBRXTq4pzWrhg9rwWlHBg1KnxJ0/sc9Tn07Fe+0bxWDh
1XIFXNEkO9+Bpl43VnKGC7sbkYE9m3+jpGCWjF01vBCh+BY73wUOsPD0Zw9YQojN
TKBtiQEjb8avuoTUR0JSOTwLZw4DlDRmRLHkNwlqqvbPdvuIWI/LG2wFUvY7/eq8
dWxIPBjLKaIv2aUs9wGNNiz4Kb92uyH5L6bI6SK8VxphRA+51BOjMcBbzdY+Q1XL
c4CTaL9ybAyUi4SRv41qWnM09YbI1FayUW93k9xz/vEplXOHp5R/lyUdZETd/d83
GxaooTLcy9nOYeZ75buiH/0EG5HxI7On/QfUBEE3qIf8KfGgxb479HbRw6RnX4bf
EhQzf7eyZvvk43Xk3OYwq8Ux1SOiXQEo+8TpCSaM/KN57cJbjGB4GCUK6JX8qJCx
7MfXBdrhkAdw5V4lEBQMYKp4pdUdgYKRXavhLevm0qFjX1Swl6LIHxLtjFTKyX9S
Xfxi09J7EUs7SsI35pdlMtPQkklEUXE96S/W3RCEpR+OfgbVMkYkcQI8TGb7ib3l
xLNJrSgFDSlP5F3rN5SYIItAqboXb7iLp7SiF2ByXV43yexIrzTH0bwdwPwpZHhk
2ziVieX5WXEX4tgzZkRj
=+bLo
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
- Fix for a Haswell regression in nested virtualization, introduced
during the merge window.
- A fix from Oleg to async page faults.
- A bunch of small ARM changes.
- A trivial patch to use the new MSI-X API introduced during the merge
window.
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: ARM: vgic: Fix the overlap check action about setting the GICD & GICC base address.
KVM: arm/arm64: vgic: fix GICD_ICFGR register accesses
KVM: async_pf: mm->mm_users can not pin apf->mm
KVM: ARM: vgic: Fix sgi dispatch problem
MAINTAINERS: co-maintainance of KVM/{arm,arm64}
arm: KVM: fix possible misalignment of PGDs and bounce page
KVM: x86: Check for host supported fields in shadow vmcs
kvm: Use pci_enable_msix_exact() instead of pci_enable_msix()
ARM: KVM: disable KVM in Kconfig on big-endian systems
earlyprintk=efi,keep will cause kernel hangs while freeing initmem like
below:
VFS: Mounted root (ext4 filesystem) readonly on device 254:2.
devtmpfs: mounted
Freeing unused kernel memory: 880K (ffffffff817d4000 - ffffffff818b0000)
It is caused by efi earlyprintk use __init function which will be freed
later. Such as early_efi_write is marked as __init, also it will use
early_ioremap which is init function as well.
To fix this issue, I added early initcall early_efi_map_fb which maps
the whole efi fb for later use. OTOH, adding a wrapper function
early_efi_map which calls early_ioremap before ioremap is available.
With this patch applied efi boot ok with earlyprintk=efi,keep console=efi
Signed-off-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Sparse warns that the percpu variables aren't declared before they are
defined. Rather than hacking around it, move espfix definitions into
a proper header file.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
It is not safe to use LAR to filter when to go down the espfix path,
because the LDT is per-process (rather than per-thread) and another
thread might change the descriptors behind our back. Fortunately it
is always *safe* (if a bit slow) to go down the espfix path, and a
32-bit LDT stack segment is extremely rare.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/1398816946-3351-1-git-send-email-hpa@linux.intel.com
Cc: <stable@vger.kernel.org> # consider after upstream merge
The IRET instruction, when returning to a 16-bit segment, only
restores the bottom 16 bits of the user space stack pointer. This
causes some 16-bit software to break, but it also leaks kernel state
to user space. We have a software workaround for that ("espfix") for
the 32-bit kernel, but it relies on a nonzero stack segment base which
is not available in 64-bit mode.
In checkin:
b3b42ac2cb x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
we "solved" this by forbidding 16-bit segments on 64-bit kernels, with
the logic that 16-bit support is crippled on 64-bit kernels anyway (no
V86 support), but it turns out that people are doing stuff like
running old Win16 binaries under Wine and expect it to work.
This works around this by creating percpu "ministacks", each of which
is mapped 2^16 times 64K apart. When we detect that the return SS is
on the LDT, we copy the IRET frame to the ministack and use the
relevant alias to return to userspace. The ministacks are mapped
readonly, so if IRET faults we promote #GP to #DF which is an IST
vector and thus has its own stack; we then do the fixup in the #DF
handler.
(Making #GP an IST exception would make the msr_safe functions unsafe
in NMI/MC context, and quite possibly have other effects.)
Special thanks to:
- Andy Lutomirski, for the suggestion of using very small stack slots
and copy (as opposed to map) the IRET frame there, and for the
suggestion to mark them readonly and let the fault promote to #DF.
- Konrad Wilk for paravirt fixup and testing.
- Borislav Petkov for testing help and useful comments.
Reported-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/1398816946-3351-1-git-send-email-hpa@linux.intel.com
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andrew Lutomriski <amluto@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dirk Hohndel <dirk@hohndel.org>
Cc: Arjan van de Ven <arjan.van.de.ven@intel.com>
Cc: comex <comexk@gmail.com>
Cc: Alexander van Heukelum <heukelum@fastmail.fm>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: <stable@vger.kernel.org> # consider after upstream merge
Ignoring the "correction" logic riprel_pre_xol() and riprel_post_xol()
are very similar but look quite differently.
1. Add the "UPROBE_FIX_RIP_AX | UPROBE_FIX_RIP_CX" check at the start
of riprel_pre_xol(), like the same check in riprel_post_xol().
2. Add the trivial scratch_reg() helper which returns the address of
scratch register pre_xol/post_xol need to change.
3. Change these functions to use the new helper and avoid copy-and-paste
under if/else branches.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
default_pre_xol_op() passes ¤t->utask->autask to riprel_pre_xol()
and this is just ugly because it still needs to load current->utask to
read ->vaddr.
Remove this argument, change riprel_pre_xol() to use current->utask.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
handle_riprel_insn(), pre_xol_rip_insn() and handle_riprel_post_xol()
look confusing and inconsistent. Rename them into riprel_analyze(),
riprel_pre_xol(), and riprel_post_xol() respectively.
No changes in compiled code.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Now that UPROBE_FIX_IP/UPROBE_FIX_CALL are mutually exclusive we can
use a single "fix_ip_or_call" enum instead of 2 fix_* booleans. This
way the logic looks more understandable and clean to me.
While at it, join "case 0xea" with other "ip is correct" ret/lret cases.
Also change default_post_xol_op() to use "else if" for the same reason.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
The only insn which could have both UPROBE_FIX_IP and UPROBE_FIX_CALL
was 0xe8 "call relative", and now it is handled by branch_xol_ops.
So we can change default_post_xol_op(UPROBE_FIX_CALL) to simply push
the address of next insn == utask->vaddr + insn.length, just we need
to record insn.length into the new auprobe->def.ilen member.
Note: if/when we teach branch_xol_ops to support jcxz/loopz we can
remove the "correction" logic, UPROBE_FIX_IP can use the same address.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Extract the "push return address" code from branch_emulate_op() into
the new simple helper, push_ret_address(). It will have more users.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
handle_riprel_insn() assumes that nobody else could modify ->fixups
before. This is correct but fragile, change it to use "|=".
Also make ->fixups u8, we are going to add the new members into the
union. It is not clear why UPROBE_FIX_RIP_.X lived in the upper byte,
redefine them so that they can fit into u8.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Finally we can move arch_uprobe->fixups/rip_rela_target_address
into the new "def" struct and place this struct in the union, they
are only used by default_xol_ops paths.
The patch also renames rip_rela_target_address to riprel_target just
to make this name shorter.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
UPROBE_FIX_SETF is only needed to handle "popf" correctly but it is
processed by the generic arch_uprobe_post_xol() code. This doesn't
allows us to make ->fixups private for default_xol_ops.
1 Change default_post_xol_op(UPROBE_FIX_SETF) to set ->saved_tf = T.
"popf" always reads the flags from stack, it doesn't matter if TF
was set or not before single-step. Ignoring the naming, this is
even more logical, "saved_tf" means "owned by application" and we
do not own this flag after "popf".
2. Change arch_uprobe_post_xol() to save ->saved_tf into the local
"bool send_sigtrap" before ->post_xol().
3. Change arch_uprobe_post_xol() to ignore UPROBE_FIX_SETF and just
check ->saved_tf after ->post_xol().
With this patch ->fixups and ->rip_rela_target_address are only used
by default_xol_ops hooks, we are ready to remove them from the common
part of arch_uprobe.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
014940bad8 "uprobes/x86: Send SIGILL if arch_uprobe_post_xol() fails"
changed arch_uprobe_post_xol() to use arch_uprobe_abort_xol() if ->post_xol
fails. This was correct and helped to avoid the additional complications,
we need to clear X86_EFLAGS_TF in this case.
However, now that we have uprobe_xol_ops->abort() hook it would be better
to avoid arch_uprobe_abort_xol() here. ->post_xol() should likely do what
->abort() does anyway, we should not do the same work twice. Currently only
handle_riprel_post_xol() can be called twice, this is unnecessary but safe.
Still this is not clean and can lead to the problems in future.
Change arch_uprobe_post_xol() to clear X86_EFLAGS_TF and restore ->ip by
hand and avoid arch_uprobe_abort_xol(). This temporary uglifies the usage
of autask.saved_tf, we will cleanup this later.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
arch_uprobe_abort_xol() calls handle_riprel_post_xol() even if
auprobe->ops != default_xol_ops. This is fine correctness wise, only
default_pre_xol_op() can set UPROBE_FIX_RIP_AX|UPROBE_FIX_RIP_CX and
otherwise handle_riprel_post_xol() is nop.
But this doesn't look clean and this doesn't allow us to move ->fixups
into the union in arch_uprobe. Move this handle_riprel_post_xol() call
into the new default_abort_op() hook and change arch_uprobe_abort_xol()
accordingly.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Currently this doesn't matter, the only ->pre_xol() hook can't fail,
but we need to fix arch_uprobe_pre_xol() anyway. If ->pre_xol() fails
we should not change regs->ip/flags, we should just return the error
to make restart actually possible.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
is_64bit_mm() assumes that mm->context.ia32_compat means the 32-bit
instruction set, this is not true if the task is TIF_X32.
Change set_personality_ia32() to initialize mm->context.ia32_compat
by TIF_X32 or TIF_IA32 instead of 1. This allows to fix is_64bit_mm()
without affecting other users, they all treat ia32_compat as "bool".
TIF_ in ->ia32_compat looks a bit strange, but this is grep-friendly
and avoids the new define's.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Add the suitable ifdef's around good_insns_* arrays. We do not want
to add the ugly ifdef's into their only user, uprobe_init_insn(), so
the "#else" branch simply defines them as NULL. This doesn't generate
the extra code, gcc is smart enough, although the code is fine even if
it could not detect that (without CONFIG_IA32_EMULATION) is_64bit_mm()
is __builtin_constant_p().
The patch looks more complicated because it also moves good_insns_64
up close to good_insns_32.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Change uprobe_init_insn() to make insn_complete() == T, this makes
other insn_get_*() calls unnecessary.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
1. Extract the ->ia32_compat check from 64bit validate_insn_bits()
into the new helper, is_64bit_mm(), it will have more users.
TODO: this checks is actually wrong if mm owner is X32 task,
we need another fix which changes set_personality_ia32().
TODO: even worse, the whole 64-or-32-bit logic is very broken
and the fix is not simple, we need the nontrivial changes in
the core uprobes code.
2. Kill validate_insn_bits() and change its single caller to use
uprobe_init_insn(is_64bit_mm(mm).
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
validate_insn_32bits() and validate_insn_64bits() are very similar,
turn them into the single uprobe_init_insn() which has the additional
"bool x86_64" argument which can be passed to insn_init() and used to
choose between good_insns_64/good_insns_32.
Also kill UPROBE_FIX_NONE, it has no users.
Note: the current code doesn't use ifdef's consistently, good_insns_64
depends on CONFIG_X86_64 but good_insns_32 is unconditional. This patch
removes ifdef around good_insns_64, we will add it back later along with
the similar one for good_insns_32.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
All branch insns on x86 can be prefixed with the operand-size
override prefix, 0x66. It was only ever useful for performing
jumps to 32-bit offsets in 16-bit code segments.
In 32-bit code, such instructions are useless since
they cause IP truncation to 16 bits, and in case of call insns,
they save only 16 bits of return address and misalign
the stack pointer as a "bonus".
In 64-bit code, such instructions are treated differently by Intel
and AMD CPUs: Intel ignores the prefix altogether,
AMD treats them the same as in 32-bit mode.
Before this patch, the emulation code would execute
the instructions as if they have no 0x66 prefix.
With this patch, we refuse to attach uprobes to such insns.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Acked-by: Jim Keniston <jkenisto@us.ibm.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Remove the direct accesses to FDT header data using accessor
function instead. This makes the code more readable and makes the FDT
blob structure more opaque to the arch code. This also prepares for
removing struct boot_param_header completely.
Signed-off-by: Rob Herring <robh@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Tested-by: Grant Likely <grant.likely@linaro.org>
Invariant TSC is a property of TSC, no additional
support code necessary.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
On x86 the allocation of irq descriptors may allocate interrupts which
are in the range of the GSI interrupts. That's wrong as those
interrupts are hardwired and we don't have the irq domain translation
like PPC. So one of these interrupts can be hooked up later to one of
the devices which are hard wired to it and the io_apic init code for
that particular interrupt line happily reuses that descriptor with a
completely different configuration so hell breaks lose.
Inside x86 we allocate dynamic interrupts from above nr_gsi_irqs,
except for a few usage sites which have not yet blown up in our face
for whatever reason. But for drivers which need an irq range, like the
GPIO drivers, we have no limit in place and we don't want to expose
such a detail to a driver.
To cure this introduce a function which an architecture can implement
to impose a lower bound on the dynamic interrupt allocations.
Implement it for x86 and set the lower bound to nr_gsi_irqs, which is
the end of the hardwired interrupt space, so all dynamic allocations
happen above.
That not only allows the GPIO driver to work sanely, it also protects
the bogus callsites of create_irq_nr() in hpet, uv, irq_remapping and
htirq code. They need to be cleaned up as well, but that's a separate
issue.
Reported-by: Jin Yao <yao.jin@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Cc: Mathias Nyman <mathias.nyman@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Krogerus Heikki <heikki.krogerus@intel.com>
Cc: Linus Walleij <linus.walleij@linaro.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1404241617360.28206@ionos.tec.linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We track shadow vmcs fields through two static lists,
one for read only and another for r/w fields. However, with
addition of new vmcs fields, not all fields may be supported on
all hosts. If so, copy_vmcs12_to_shadow() trying to vmwrite on
unsupported hosts will result in a vmwrite error. For example, commit
36be0b9deb introduced GUEST_BNDCFGS, which is not supported
by all processors. Filter out host unsupported fields before
letting guests use shadow vmcs
Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Correct IRQ routing in case a vSMP box is detected
but the Interrupt Routing Comply (IRC) value is set to
"comply", which leads to incorrect IRQ routing.
Before the patch:
When a vSMP box was detected and IRC was set to "comply",
users (and the kernel) couldn't effectively set the
destination of the IRQs. This is because the hook inside
vsmp_64.c always setup all CPUs as the IRQ destination using
cpumask_setall() as the return value for IRQ allocation mask.
Later, this "overrided" mask caused the kernel to set the IRQ
destination to the lowest online CPU in the mask (CPU0 usually).
After the patch:
When the IRC is set to "comply", users (and the kernel) can control
the destination of the IRQs as we will not be changing the
default "apic->vector_allocation_domain".
Signed-off-by: Oren Twaig <oren@scalemp.com>
Acked-by: Shai Fultheim <shai@scalemp.com>
Link: http://lkml.kernel.org/r/1398669697-2123-1-git-send-email-oren@scalemp.com
[ Minor readability edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Bodo reported that on the Asrock M3A UCC, v3.12.6 hangs during boot unless
he uses "pci=nocrs". This regression was caused by 7bc5e3f2be ("x86/PCI:
use host bridge _CRS info by default on 2008 and newer machines"), which
appeared in v2.6.34.
The reason is that the HPET address appears in a PCI device BAR, and this
address is not contained in any of the host bridge windows. Linux moves
the PCI BAR into a window, but the original address was published via the
HPET table and an ACPI device, so changing the BAR is a bad idea. Here's
the dmesg info:
ACPI: HPET id: 0x43538301 base: 0xfed00000
pci_root PNP0A03:00: host bridge window [mem 0xd0000000-0xdfffffff]
pci_root PNP0A03:00: host bridge window [mem 0xf0000000-0xfebfffff]
pci 0000:00:14.0: [1002:4385] type 0 class 0x000c05
pci 0000:00:14.0: reg 14: [mem 0xfed00000-0xfed003ff]
hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0, 0
pnp 00:06: Plug and Play ACPI device, IDs PNP0103 (active)
pnp 00:06: [mem 0xfed00000-0xfed003ff]
When we notice the BAR is not in a host bridge window, we try to move it,
but that causes a hang shortly thereafter:
pci 0000:00:14.0: no compatible bridge window for [mem 0xfed00000-0xfed003ff]
pci 0000:00:14.0: BAR 1: assigned [mem 0xf0000000-0xf00003ff]
This patch marks the BAR as IORESOURCE_PCI_FIXED to prevent Linux from
moving it. This depends on a previous patch ("x86/PCI: Don't try to move
IORESOURCE_PCI_FIXED resources") to check for this flag when
pci_claim_resource() fails.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=68591
Reported-and-tested-by: Bodo Eggert <7eggert@gmx.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Don't attempt to move resource marked IORESOURCE_PCI_FIXED, even if
pci_claim_resource() fails. In some cases, these are legacy resources that
cannot be moved.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
In the expression "word1 << 16", word1 starts as u16, but is promoted to
a signed int, then sign-extended to resource_size_t, which is probably
not what was intended. Cast to resource_size_t to avoid the sign
extension.
Found by Coverity (CID 138749, 138750).
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
In preparation to support FIX_EARLYCON_MEM on other arches, make the
option per arch.
Signed-off-by: Rob Herring <robh@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
As part of this make the usual change to xen_ulong_t in place of unsigned long.
This change has no impact on x86.
The Linux definition of struct multicall_entry.result differs from the Xen
definition, I think for good reasons, and used a long rather than an unsigned
long. Therefore introduce a xen_long_t, which is a long on x86 architectures
and a signed 64-bit integer on ARM.
Use uint32_t nr_calls on x86 for consistency with the ARM definition.
Build tested on amd64 and i386 builds. Runtime tested on ARM.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Use NOKPROBE_SYMBOL macro for protecting functions
from kprobes instead of __kprobes annotation under
arch/x86.
This applies nokprobe_inline annotation for some cases,
because NOKPROBE_SYMBOL() will inhibit inlining by
referring the symbol address.
This just folds a bunch of previous NOKPROBE_SYMBOL()
cleanup patches for x86 to one patch.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Link: http://lkml.kernel.org/r/20140417081814.26341.51656.stgit@ltc230.yrl.intra.hitachi.co.jp
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fernando Luis Vázquez Cao <fernando_b1@lab.ntt.co.jp>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Lebon <jlebon@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matt Fleming <matt.fleming@intel.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Allow kprobes on text_poke/hw_breakpoint because
those are not related to the critical int3-debug
recursive path of kprobes at this moment.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Link: http://lkml.kernel.org/r/20140417081807.26341.73219.stgit@ltc230.yrl.intra.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There is no need to prohibit probing on the functions
used in preparation phase. Those are safely probed because
those are not invoked from breakpoint/fault/debug handlers,
there is no chance to cause recursive exceptions.
Following functions are now removed from the kprobes blacklist:
can_boost
can_probe
can_optimize
is_IF_modifier
__copy_instruction
copy_optimized_instructions
arch_copy_kprobe
arch_prepare_kprobe
arch_arm_kprobe
arch_disarm_kprobe
arch_remove_kprobe
arch_trampoline_kprobe
arch_prepare_kprobe_ftrace
arch_prepare_optimized_kprobe
arch_check_optimized_kprobe
arch_within_optimized_kprobe
__arch_remove_optimized_kprobe
arch_remove_optimized_kprobe
arch_optimize_kprobes
arch_unoptimize_kprobe
I tested those functions by putting kprobes on all
instructions in the functions with the bash script
I sent to LKML. See:
https://lkml.org/lkml/2014/3/27/33
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Jonathan Lebon <jlebon@redhat.com>
Link: http://lkml.kernel.org/r/20140417081747.26341.36065.stgit@ltc230.yrl.intra.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move exception_enter() call after kprobes handler
is done. Since the exception_enter() involves
many other functions (like printk), it can cause
recursive int3/break loop when kprobes probe such
functions.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Kees Cook <keescook@chromium.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Link: http://lkml.kernel.org/r/20140417081740.26341.10894.stgit@ltc230.yrl.intra.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
To avoid a kernel crash by probing on lockdep code, call
kprobe_int3_handler() and kprobe_debug_handler()(which was
formerly called post_kprobe_handler()) directly from
do_int3 and do_debug.
Currently kprobes uses notify_die() to hook the int3/debug
exceptoins. Since there is a locking code in notify_die,
the lockdep code can be invoked. And because the lockdep
involves printk() related things, theoretically, we need to
prohibit probing on such code, which means much longer blacklist
we'll have. Instead, hooking the int3/debug for kprobes before
notify_die() can avoid this problem.
Anyway, most of the int3 handlers in the kernel are already
called from do_int3 directly, e.g. ftrace_int3_handler,
poke_int3_handler, kgdb_ll_trap. Actually only
kprobe_exceptions_notify is on the notifier_call_chain.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Jonathan Lebon <jlebon@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Link: http://lkml.kernel.org/r/20140417081733.26341.24423.stgit@ltc230.yrl.intra.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
thunk/restore functions are also used for tracing irqoff etc.
and those are involved in kprobe's exception handling.
Prohibit probing on them to avoid kernel crash.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20140417081726.26341.3872.stgit@ltc230.yrl.intra.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since the kprobes uses do_debug for single stepping,
functions called from do_debug() before notify_die() must not
be probed.
And also native_load_idt() is called from paranoid_exit when
returning int3, this also must not be probed.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Alok Kataria <akataria@vmware.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: virtualization@lists.linux-foundation.org
Link: http://lkml.kernel.org/r/20140417081719.26341.65542.stgit@ltc230.yrl.intra.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Prohibit probing on debug_stack_reset and debug_stack_set_zero.
Since the both functions are called from TRACE_IRQS_ON/OFF_DEBUG
macros which run in int3 ist entry, probing it may cause a soft
lockup.
This happens when the kernel built with CONFIG_DYNAMIC_FTRACE=y
and CONFIG_TRACE_IRQFLAGS=y.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Link: http://lkml.kernel.org/r/20140417081712.26341.32994.stgit@ltc230.yrl.intra.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Introduce NOKPROBE_SYMBOL() macro which builds a kprobes
blacklist at kernel build time.
The usage of this macro is similar to EXPORT_SYMBOL(),
placed after the function definition:
NOKPROBE_SYMBOL(function);
Since this macro will inhibit inlining of static/inline
functions, this patch also introduces a nokprobe_inline macro
for static/inline functions. In this case, we must use
NOKPROBE_SYMBOL() for the inline function caller.
When CONFIG_KPROBES=y, the macro stores the given function
address in the "_kprobe_blacklist" section.
Since the data structures are not fully initialized by the
macro (because there is no "size" information), those
are re-initialized at boot time by using kallsyms.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Link: http://lkml.kernel.org/r/20140417081705.26341.96719.stgit@ltc230.yrl.intra.hitachi.co.jp
Cc: Alok Kataria <akataria@vmware.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christopher Li <sparse@chrisli.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jan-Simon Möller <dl9pf@gmx.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: linux-arch@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-sparse@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
.entry.text is a code area which is used for interrupt/syscall
entries, which includes many sensitive code.
Thus, it is better to prohibit probing on all of such code
instead of a part of that.
Since some symbols are already registered on kprobe blacklist,
this also removes them from the blacklist.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: David S. Miller <davem@davemloft.net>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Jonathan Lebon <jlebon@redhat.com>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Link: http://lkml.kernel.org/r/20140417081658.26341.57354.stgit@ltc230.yrl.intra.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since the NMI handlers(e.g. perf) can interrupt in the
single stepping (or preparing the single stepping, do_debug
etc.), we should consider a kprobe is hit in the NMI
handler. Even in that case, the kprobe is allowed to be
reentered as same as the kprobes hit in kprobe handlers
(KPROBE_HIT_ACTIVE or KPROBE_HIT_SSDONE).
The real issue will happen when a kprobe hit while another
reentered kprobe is processing (KPROBE_REENTER), because
we already consumed a saved-area for the previous kprobe.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Jonathan Lebon <jlebon@redhat.com>
Link: http://lkml.kernel.org/r/20140417081651.26341.10593.stgit@ltc230.yrl.intra.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch fixes a bug introduced by:
2422365780 ("perf/x86/intel: Use rdmsrl_safe() when initializing RAPL PMU")
The rdmsrl_safe() function returns 0 on success.
The current code was failing to detect the RAPL PMU
on real hardware (missing /sys/devices/power) because
the return value of rdmsrl_safe() was misinterpreted.
Signed-off-by: Stephane Eranian <eranian@google.com>
Acked-by: Borislav Petkov <bp@suse.de>
Acked-by: Venkatesh Srinivas <venkateshs@google.com>
Cc: peterz@infradead.org
Cc: zheng.z.yan@intel.com
Link: http://lkml.kernel.org/r/20140423170418.GA12767@quad
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now we can flush all the TLBs out of the mmu lock without TLB corruption when
write-proect the sptes, it is because:
- we have marked large sptes readonly instead of dropping them that means we
just change the spte from writable to readonly so that we only need to care
the case of changing spte from present to present (changing the spte from
present to nonpresent will flush all the TLBs immediately), in other words,
the only case we need to care is mmu_spte_update()
- in mmu_spte_update(), we haved checked
SPTE_HOST_WRITEABLE | PTE_MMU_WRITEABLE instead of PT_WRITABLE_MASK, that
means it does not depend on PT_WRITABLE_MASK anymore
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Relax the tlb flush condition since we will write-protect the spte out of mmu
lock. Note lockless write-protection only marks the writable spte to readonly
and the spte can be writable only if both SPTE_HOST_WRITEABLE and
SPTE_MMU_WRITEABLE are set (that are tested by spte_is_locklessly_modifiable)
This patch is used to avoid this kind of race:
VCPU 0 VCPU 1
lockless wirte protection:
set spte.w = 0
lock mmu-lock
write protection the spte to sync shadow page,
see spte.w = 0, then without flush tlb
unlock mmu-lock
!!! At this point, the shadow page can still be
writable due to the corrupt tlb entry
Flush all TLB
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Currently, kvm zaps the large spte if write-protected is needed, the later
read can fault on that spte. Actually, we can make the large spte readonly
instead of making them un-present, the page fault caused by read access can
be avoided
The idea is from Avi:
| As I mentioned before, write-protecting a large spte is a good idea,
| since it moves some work from protect-time to fault-time, so it reduces
| jitter. This removes the need for the return value.
This version has fixed the issue reported in 6b73a9606, the reason of that
issue is that fast_page_fault() directly sets the readonly large spte to
writable but only dirty the first page into the dirty-bitmap that means
other pages are missed. Fixed it by only the normal sptes (on the
PT_PAGE_TABLE_LEVEL level) can be fast fixed
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Using sp->role.level instead of @level since @level is not got from the
page table hierarchy
There is no issue in current code since the fast page fault currently only
fixes the fault caused by dirty-log that is always on the last level
(level = 1)
This patch makes the code more readable and avoids potential issue in the
further development
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This reverts commit 5befdc385d.
Since we will allow flush tlb out of mmu-lock in the later
patch
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
If EFER.LMA is off, cs.l does not determine execution mode.
Currently, the emulation engine assumes differently.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
The IN instruction is not be affected by REP-prefix as INS is. Therefore, the
emulation should ignore the REP prefix as well. The current emulator
implementation tries to perform writeback when IN instruction with REP-prefix
is emulated. This causes it to perform wrong memory write or spurious #GP
exception to be injected to the guest.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
According to Intel specifications, PAE and non-PAE does not have any reserved
bits. In long-mode, regardless to PCIDE, only the high bits (above the
physical address) are reserved.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
If a guest enables a performance counter but does not enable PMI, the
hypervisor currently does not reprogram the performance counter once it
overflows. As a result the host performance counter is kept with the original
sampling period which was configured according to the value of the guest's
counter when the counter was enabled.
Such behaviour can cause very bad consequences. The most distrubing one can
cause the guest not to make any progress at all, and keep exiting due to host
PMI before any guest instructions is exeucted. This situation occurs when the
performance counter holds a very high value when the guest enables the
performance counter. As a result the host's sampling period is configured to be
very short. The host then never reconfigures the sampling period and get stuck
at entry->PMI->exit loop. We encountered such a scenario in our experiments.
The solution is to reprogram the counter even if the guest does not use PMI.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Some Type 1 hypervisors such as XEN won't enable VMX without it present
Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This feature emulates the "Acknowledge interrupt on exit" behavior.
We can safely emulate it for L1 to run L2 even if L0 itself has it
disabled (to run L1).
Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
For single context invalidation, we fall through to global
invalidation in handle_invept() except for one case - when
the operand supplied by L1 is different from what we have in
vmcs12. However, typically hypervisors will only call invept
for the currently loaded eptp, so the condition will
never be true.
Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
When entering an exception after an ICEBP, the saved instruction
pointer should point to after the instruction.
This fixes the bug here: https://bugs.launchpad.net/qemu/+bug/1119686
Signed-off-by: Huw Davies <huw@codeweavers.com>
Reviewed-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Pull x86 vdso fix from Peter Anvin:
"This is a single build fix for building with gold as opposed to GNU
ld. It got queued up separately and was expected to be pushed during
the merge window, but it got left behind"
* 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, vdso: Make the vdso linker script compatible with Gold
According to Intel specifications, only general purpose registers and segment
selectors should be saved in the old TSS during 32-bit task-switch.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
The colon at the end of the printk message suggests that it should get printed
before the details printed by ftrace_bug().
When touching the line, let's use the preferred pr_warn() macro as suggested
by checkpatch.pl.
Link: http://lkml.kernel.org/r/1392650573-3390-5-git-send-email-pmladek@suse.cz
Signed-off-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull x86 fix from Ingo Molnar:
"This fixes the preemption-count imbalance crash reported by Owen
Kibel"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Fix CMCI preemption bugs
x86 is strongly ordered and all its atomic ops imply a full barrier.
Implement the two new primitives as the old ones were.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/n/tip-knswsr5mldkr0w1lrdxvc81w@git.kernel.org
Cc: Dave Jones <davej@redhat.com>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
CPUs which should support the RAPL counters according to
Family/Model/Stepping may still issue #GP when attempting to access
the RAPL MSRs. This may happen when Linux is running under KVM and
we are passing-through host F/M/S data, for example. Use rdmsrl_safe
to first access the RAPL_POWER_UNIT MSR; if this fails, do not
attempt to use this PMU.
Signed-off-by: Venkatesh Srinivas <venkateshs@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1394739386-22260-1-git-send-email-venkateshs@google.com
Cc: zheng.z.yan@intel.com
Cc: eranian@google.com
Cc: ak@linux.intel.com
Cc: linux-kernel@vger.kernel.org
[ The patch also silently fixes another bug: rapl_pmu_init() didn't handle the memory alloc failure case previously. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Change branch_setup_xol_ops() to simply use opc1 = OPCODE2(insn) - 0x10
if OPCODE1() == 0x0f; this matches the "short" jmp which checks the same
condition.
Thanks to lib/insn.c, it does the rest correctly. branch->ilen/offs are
correct no matter if this jmp is "near" or "short".
Reported-by: Jonathan Lebon <jlebon@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Teach branch_emulate_op() to emulate the conditional "short" jmp's which
check regs->flags.
Note: this doesn't support jcxz/jcexz, loope/loopz, and loopne/loopnz.
They all are rel8 and thus they can't trigger the problem, but perhaps
we will add the support in future just for completeness.
Reported-by: Jonathan Lebon <jlebon@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
See the previous "Emulate unconditional relative jmp's" which explains
why we can not execute "jmp" out-of-line, the same applies to "call".
Emulating of rip-relative call is trivial, we only need to additionally
push the ret-address. If this fails, we execute this instruction out of
line and this should trigger the trap, the probed application should die
or the same insn will be restarted if a signal handler expands the stack.
We do not even need ->post_xol() for this case.
But there is a corner (and almost theoretical) case: another thread can
expand the stack right before we execute this insn out of line. In this
case it hit the same problem we are trying to solve. So we simply turn
the probed insn into "call 1f; 1:" and add ->post_xol() which restores
->sp and restarts.
Many thanks to Jonathan who finally found the standalone reproducer,
otherwise I would never resolve the "random SIGSEGV's under systemtap"
bug-report. Now that the problem is clear we can write the simplified
test-case:
void probe_func(void), callee(void);
int failed = 1;
asm (
".text\n"
".align 4096\n"
".globl probe_func\n"
"probe_func:\n"
"call callee\n"
"ret"
);
/*
* This assumes that:
*
* - &probe_func = 0x401000 + a_bit, aligned = 0x402000
*
* - xol_vma->vm_start = TASK_SIZE_MAX - PAGE_SIZE = 0x7fffffffe000
* as xol_add_vma() asks; the 1st slot = 0x7fffffffe080
*
* so we can target the non-canonical address from xol_vma using
* the simple math below, 100 * 4096 is just the random offset
*/
asm (".org . + 0x800000000000 - 0x7fffffffe080 - 5 - 1 + 100 * 4096\n");
void callee(void)
{
failed = 0;
}
int main(void)
{
probe_func();
return failed;
}
It SIGSEGV's if you probe "probe_func" (although this is not very reliable,
randomize_va_space/etc can change the placement of xol area).
Note: as Denys Vlasenko pointed out, amd and intel treat "callw" (0x66 0xe8)
differently. This patch relies on lib/insn.c and thus implements the intel's
behaviour: 0x66 is simply ignored. Fortunately nothing sane should ever use
this insn, so we postpone the fix until we decide what should we do; emulate
or not, support or not, etc.
Reported-by: Jonathan Lebon <jlebon@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Finally we can kill the ugly (and very limited) code in __skip_sstep().
Just change branch_setup_xol_ops() to treat "nop" as jmp to the next insn.
Thanks to lib/insn.c, it is clever enough. OPCODE1() == 0x90 includes
"(rep;)+ nop;" at least, and (afaics) much more.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Currently we always execute all insns out-of-line, including relative
jmp's and call's. This assumes that even if regs->ip points to nowhere
after the single-step, default_post_xol_op(UPROBE_FIX_IP) logic will
update it correctly.
However, this doesn't work if this regs->ip == xol_vaddr + insn_offset
is not canonical. In this case CPU generates #GP and general_protection()
kills the task which tries to execute this insn out-of-line.
Now that we have uprobe_xol_ops we can teach uprobes to emulate these
insns and solve the problem. This patch adds branch_xol_ops which has
a single branch_emulate_op() hook, so far it can only handle rel8/32
relative jmp's.
TODO: move ->fixup into the union along with rip_rela_target_address.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Jonathan Lebon <jlebon@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
1. Add the trivial sizeof_long() helper and change other callers of
is_ia32_task() to use it.
TODO: is_ia32_task() is not what we actually want, TS_COMPAT does
not necessarily mean 32bit. Fortunately syscall-like insns can't be
probed so it actually works, but it would be better to rename and
use is_ia32_frame().
2. As Jim pointed out "ncopied" in arch_uretprobe_hijack_return_addr()
and adjust_ret_addr() should be named "nleft". And in fact only the
last copy_to_user() in arch_uretprobe_hijack_return_addr() actually
needs to inspect the non-zero error code.
TODO: adjust_ret_addr() should die. We can always calculate the value
we need to write into *regs->sp, just UPROBE_FIX_CALL should record
insn->length.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
SIGILL after the failed arch_uprobe_post_xol() should only be used as
a last resort, we should try to restart the probed insn if possible.
Currently only adjust_ret_addr() can fail, and this can only happen if
another thread unmapped our stack after we executed "call" out-of-line.
Most probably the application if buggy, but even in this case it can
have a handler for SIGSEGV/etc. And in theory it can be even correct
and do something non-trivial with its memory.
Of course we can't restart unconditionally, so arch_uprobe_post_xol()
does this only if ->post_xol() returns -ERESTART even if currently this
is the only possible error.
default_post_xol_op(UPROBE_FIX_CALL) can always restart, but as Jim
pointed out it should not forget to pop off the return address pushed
by this insn executed out-of-line.
Note: this is not "perfect", we do not want the extra handler_chain()
after restart, but I think this is the best solution we can realistically
do without too much uglifications.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Currently the error from arch_uprobe_post_xol() is silently ignored.
This doesn't look good and this can lead to the hard-to-debug problems.
1. Change handle_singlestep() to loudly complain and send SIGILL.
Note: this only affects x86, ppc/arm can't fail.
2. Change arch_uprobe_post_xol() to call arch_uprobe_abort_xol() and
avoid TF games if it is going to return an error.
This can help to to analyze the problem, if nothing else we should
not report ->ip = xol_slot in the core-file.
Note: this means that handle_riprel_post_xol() can be called twice,
but this is fine because it is idempotent.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
arch_uprobe_analyze_insn() calls handle_riprel_insn() at the start,
but only "0xff" and "default" cases need the UPROBE_FIX_RIP_ logic.
Move the callsite into "default" case and change the "0xff" case to
fall-through.
We are going to add the various hooks to handle the rip-relative
jmp/call instructions (and more), we need this change to enforce the
fact that the new code can not conflict with is_riprel_insn() logic
which, after this change, can only be used by default_xol_ops.
Note: arch_uprobe_abort_xol() still calls handle_riprel_post_xol()
directly. This is fine unless another _xol_ops we may add later will
need to reuse "UPROBE_FIX_RIP_AX|UPROBE_FIX_RIP_CX" bits in ->fixup.
In this case we can add uprobe_xol_ops->abort() hook, which (perhaps)
we will need anyway in the long term.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Introduce arch_uprobe->ops pointing to the "struct uprobe_xol_ops",
move the current UPROBE_FIX_{RIP*,IP,CALL} code into the default
set of methods and change arch_uprobe_pre/post_xol() accordingly.
This way we can add the new uprobe_xol_ops's to handle the insns
which need the special processing (rip-relative jmp/call at least).
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
No functional changes. Preparation to simplify the review of the next
change. Just reorder the code in arch_uprobe_pre/post_xol() functions
so that UPROBE_FIX_{RIP_*,IP,CALL} logic goes to the end.
Also change arch_uprobe_pre_xol() to use utask instead of autask, to
make the code more symmetrical with arch_uprobe_post_xol().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cosmetic. Move pre_xol_rip_insn() and handle_riprel_post_xol() up to
the closely related handle_riprel_insn(). This way it is simpler to
read and understand this code, and this lessens the number of ifdef's.
While at it, update the comment in handle_riprel_post_xol() as Jim
suggested.
TODO: rename them somehow to make the naming consistent.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Kill the "mm->context.ia32_compat" check in handle_riprel_insn(), if
it is true insn_rip_relative() must return false. validate_insn_bits()
passed "ia32_compat" as !x86_64 to insn_init(), and insn_rip_relative()
checks insn->x86_64.
Also, remove the no longer needed "struct mm_struct *mm" argument and
the unnecessary "return" at the end.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
No functional changes, preparation.
Shift the code from prepare_fixups() to arch_uprobe_analyze_insn()
with the following modifications:
- Do not call insn_get_opcode() again, it was already called
by validate_insn_bits().
- Move "case 0xea" up. This way "case 0xff" can fall through
to default case.
- change "case 0xff" to use the nested "switch (MODRM_REG)",
this way the code looks a bit simpler.
- Make the comments look consistent.
While at it, kill the initialization of rip_rela_target_address and
->fixups, we can rely on kzalloc(). We will add the new members into
arch_uprobe, it would be better to assume that everything is zero by
default.
TODO: cleanup/fix the mess in validate_insn_bits() paths:
- validate_insn_64bits() and validate_insn_32bits() should be
unified.
- "ifdef" is not used consistently; if good_insns_64 depends
on CONFIG_X86_64, then probably good_insns_32 should depend
on CONFIG_X86_32/EMULATION
- the usage of mm->context.ia32_compat looks wrong if the task
is TIF_X32.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Jim Keniston <jkenisto@us.ibm.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
- Fix completely broken 32-bit PV guests caused by x86 refactoring
32-bit thread_info.
- Only enable ticketlock slow path on Xen (not bare metal).
- Fix two bugs with PV guests not shutting down when requested.
- Fix a minor memory leak in xen-pciback error path.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
iQEcBAABAgAGBQJTT/hQAAoJEFxbo/MsZsTR6sMIAJs7mJXSqDQn3Z8O+TemRa53
p92ZomTNYALjUMglXcxJ2Zua6IsZMWdu7jcV1GoXC70V4YLmUs8KaBgZmI5ayUQy
bBpK+6WIAJyBkJdNH5fK3wggJ2UZjw0/twPNgd9gACwjUiYhx8iHN/hTGvu4qPBJ
MGAIlg6wdnGwRydi72uk9Am/xpebEdQy4DRD20vjwA/qUkT4uHVv/AA4hc4AK29w
ToK8qFSisgAlahcmq8/T4+OBFEKz78b9dQcdsGWyAk0ofWILfwD1l53xhzUin25s
JUVevWhhLCKRZBOq4Ykc5qyqnLff4m56rm/THQ6f0oRdJn/OR+SWOImda2Qqmvs=
=Gxpq
-----END PGP SIGNATURE-----
Merge tag 'stable/for-linus-3.15-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull Xen fixes from David Vrabel:
"Xen regression and bug fixes for 3.15-rc1:
- fix completely broken 32-bit PV guests caused by x86 refactoring
32-bit thread_info.
- only enable ticketlock slow path on Xen (not bare metal)
- fix two bugs with PV guests not shutting down when requested
- fix a minor memory leak in xen-pciback error path"
* tag 'stable/for-linus-3.15-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/manage: Poweroff forcefully if user-space is not yet up.
xen/xenbus: Avoid synchronous wait on XenBus stalling shutdown/restart.
xen/spinlock: Don't enable them unconditionally.
xen-pciback: silence an unwanted debug printk
xen: fix memory leak in __xen_pcibk_add_pci_dev()
x86/xen: Fix 32-bit PV guests's usage of kernel_stack
With KVM, MMIO is much slower than PIO, due to the need to
do page walk and emulation. But with EPT, it does not have to be: we
know the address from the VMCS so if the address is unique, we can look
up the eventfd directly, bypassing emulation.
Unfortunately, this only works if userspace does not need to match on
access length and data. The implementation adds a separate FAST_MMIO
bus internally. This serves two purposes:
- minimize overhead for old userspace that does not use eventfd with lengtth = 0
- minimize disruption in other code (since we don't know the length,
devices on the MMIO bus only get a valid address in write, this
way we don't need to touch all devices to teach them to handle
an invalid length)
At the moment, this optimization only has effect for EPT on x86.
It will be possible to speed up MMIO for NPT and MMU using the same
idea in the future.
With this patch applied, on VMX MMIO EVENTFD is essentially as fast as PIO.
I was unable to detect any measureable slowdown to non-eventfd MMIO.
Making MMIO faster is important for the upcoming virtio 1.0 which
includes an MMIO signalling capability.
The idea was suggested by Peter Anvin. Lots of thanks to Gleb for
pre-review and suggestions.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
It is sometimes benefitial to ignore IO size, and only match on address.
In hindsight this would have been a better default than matching length
when KVM_IOEVENTFD_FLAG_DATAMATCH is not set, In particular, this kind
of access can be optimized on VMX: there no need to do page lookups.
This can currently be done with many ioeventfds but in a suboptimal way.
However we can't change kernel/userspace ABI without risk of breaking
some applications.
Use len = 0 to mean "ignore length for matching" in a more optimal way.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Do a complete FPU context save/restore around the EFI calls. This required
as runtime EFI firmware may potentially use the FPU.
This change covers only the i386 configuration.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Cc: Borislav Petkov <bp@suse.de>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Do a complete FPU context save/restore around the EFI calls. This required
as runtime EFI firmware may potentially use the FPU.
This change covers only the x86_64 configuration.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Cc: Borislav Petkov <bp@suse.de>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
For i386, all the EFI system runtime services functions return efi_status_t
except efi_reset_system_system. Therefore, not all functions can be covered
by the same macro in case the macro needs to do more than calling the function
(i.e., return a value). The purpose of the __efi_call_virt macro is to be used
when no return value is expected.
For x86_64, this macro would not be needed as all the runtime services return
u64. However, the same code is used for both x86_64 and i386. Thus, the macro
__efi_call_virt is also defined to not break compilation.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Cc: Borislav Petkov <bp@suse.de>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
It may be necessary to save and restore the FPU context during EFI runtime
system services calls. However, this may happen during boot and before
alternatives have run. Thus, we need to use static_cpu_has_safe instead.
The rationale behind the use of static_cpu_has_safe is the same as in
commit 5f8c421814 ("x86, fpu: Use static_cpu_has_safe
before alternatives") by Borislav Petkov.
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Cc: Borislav Petkov <bp@suse.de>
We really only need one phys and one virt function call, and then only
one assembly function to make firmware calls.
Since we are not using the C type system anyway, we're not really losing
much by deleting the macros apart from no longer having a check that
we are passing the correct number of parameters. The lack of duplicated
code seems like a worthwhile trade-off.
Cc: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Cc: Borislav Petkov <bp@suse.de>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Instead of truncating UTF-16 assuming all characters is ASCII,
properly convert it to UTF-8.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
[ Bug and style fixes. ]
Signed-off-by: Roy Franz <roy.franz@linaro.org>
Signed-off-by: Leif Lindholm <leif.lindholm@linaro.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Current kprobes in-kernel page fault handler doesn't
expect that its single-stepping can be interrupted by
an NMI handler which may cause a page fault(e.g. perf
with callback tracing).
In that case, the page-fault handled by kprobes and it
misunderstands the page-fault has been caused by the
single-stepping code and tries to recover IP address
to probed address.
But the truth is the page-fault has been caused by the
NMI handler, and do_page_fault failes to handle real
page fault because the IP address is modified and
causes Kernel BUGs like below.
----
[ 2264.726905] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
[ 2264.727190] IP: [<ffffffff813c46e0>] copy_user_generic_string+0x0/0x40
To handle this correctly, I fixed the kprobes fault
handler to ensure the faulted ip address is its own
single-step buffer instead of checking current kprobe
state.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Sandeepa Prabhu <sandeepa.prabhu@linaro.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: fche@redhat.com
Cc: systemtap@sourceware.org
Link: http://lkml.kernel.org/r/20140417081644.26341.52351.stgit@ltc230.yrl.intra.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The following commit:
27f6c573e0 ("x86, CMCI: Add proper detection of end of CMCI storms")
Added two preemption bugs:
- machine_check_poll() does a get_cpu_var() without a matching
put_cpu_var(), which causes preemption imbalance and crashes upon
bootup.
- it does percpu ops without disabling preemption. Preemption is not
disabled due to the mistaken use of a raw spinlock.
To fix these bugs fix the imbalance and change
cmci_discover_lock to a regular spinlock.
Reported-by: Owen Kibel <qmewlo@gmail.com>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Chen, Gong <gong.chen@linux.intel.com>
Cc: Josh Boyer <jwboyer@fedoraproject.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexander Todorov <atodorov@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/n/tip-jtjptvgigpfkpvtQxpEk1at2@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
--
arch/x86/kernel/cpu/mcheck/mce.c | 4 +---
arch/x86/kernel/cpu/mcheck/mce_intel.c | 18 +++++++++---------
2 files changed, 10 insertions(+), 12 deletions(-)
Pull x86 fixes from Ingo Molnar:
"Various fixes:
- reboot regression fix
- build message spam fix
- GPU quirk fix
- 'make kvmconfig' fix
plus the wire-up of the renameat2() system call on i386"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Remove the PCI reboot method from the default chain
x86/build: Supress "Nothing to be done for ..." messages
x86/gpu: Fix sign extension issue in Intel graphics stolen memory quirks
x86/platform: Fix "make O=dir kvmconfig"
i386: Wire up the renameat2() syscall
Pull perf fixes from Ingo Molnar:
"Tooling fixes, plus a simple hardware-enablement patch for the Intel
RAPL PMU (energy use measurement) on Haswell CPUs, which I hope is
still fine at this stage"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf tools: Instead of redirecting flex output, use -o
perf tools: Fix double free in perf test 21 (code-reading.c)
perf stat: Initialize statistics correctly
perf bench: Set more defaults in the 'numa' suite
perf bench: Fix segfault at the end of an 'all' execution
perf bench: Update manpage to mention numa and futex
perf probe: Use dwarf_getcfi_elf() instead of dwarf_getcfi()
perf probe: Fix to handle errors in line_range searching
perf probe: Fix --line option behavior
perf tools: Pick up libdw without explicit LIBDW_DIR
MAINTAINERS: Change e-mail to kernel.org one
perf callchains: Disable unwind libraries when libelf isn't found
tools lib traceevent: Do not call warning() directly
tools lib traceevent: Print event name when show warning if possible
perf top: Fix documentation of invalid -s option
perf/x86: Enable DRAM RAPL support on Intel Haswell
KVM does not handle the reserved bits of x86 page tables correctly:
In PAE, bits 5:8 are reserved in the PDPTE.
In IA-32e, bit 8 is not reserved.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Several patches to fix cpu hotplug and the down'd cpu's irq
relocations have been submitted in the past month or so. The
patches should resolve the problems with cpu hotplug and irq
relocation, however, there is always a possibility that a bug
still exists. The big problem with debugging these irq
reassignments is that the cpu down completes and then we get
random stack traces from drivers for which irqs have not been
properly assigned to a new cpu. The stack traces are a mix of
storage, network, and other kernel subsystem (I once saw the
serial port stop working ...) warnings and failures.
The problem with these failures is that they are difficult to
diagnose. There is no warning in the cpu hotplug down path to
indicate that an IRQ has failed to be assigned to a new cpu, and
all we are left with is a stack trace from a driver, or a
non-functional device. If we had some information on the
console debugging these situations would be much easier; after
all we can map an IRQ to a device by simply using lspci or
/proc/interrupts.
The current code, fixup_irqs(), which migrates IRQs from the
down'd cpu and is called close to the end of the cpu down path,
calls chip->set_irq_affinity which eventually calls
__assign_irq_vector(). Errors are not propogated back from this
function call and this results in silent irq relocation
failures.
This patch fixes this issue by returning the error codes up the
call stack and prints out a warning if there is a relocation
failure.
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Rui Wang <rui.y.wang@intel.com>
Cc: Liu Ping Fan <kernelfans@gmail.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Cc: Lv Zheng <lv.zheng@intel.com>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: Yang Zhang <yang.z.zhang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
Cc: Li Fei <fei.li@intel.com>
Cc: gong.chen@linux.intel.com
Link: http://lkml.kernel.org/r/1396440673-18286-1-git-send-email-prarit@redhat.com
[ Made small cleanliness tweaks. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We use the accessed bit to age a page at page reclaim time,
and currently we also flush the TLB when doing so.
But in some workloads TLB flush overhead is very heavy. In my
simple multithreaded app with a lot of swap to several pcie
SSDs, removing the tlb flush gives about 20% ~ 30% swapout
speedup.
Fortunately just removing the TLB flush is a valid optimization:
on x86 CPUs, clearing the accessed bit without a TLB flush
doesn't cause data corruption.
It could cause incorrect page aging and the (mistaken) reclaim of
hot pages, but the chance of that should be relatively low.
So as a performance optimization don't flush the TLB when
clearing the accessed bit, it will eventually be flushed by
a context switch or a VM operation anyway. [ In the rare
event of it not getting flushed for a long time the delay
shouldn't really matter because there's no real memory
pressure for swapout to react to. ]
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Shaohua Li <shli@fusionio.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: linux-mm@kvack.org
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20140408075809.GA1764@kernel.org
[ Rewrote the changelog and the code comments. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Steve reported a reboot hang and bisected it back to this commit:
a4f1987e4c x86, reboot: Add EFI and CF9 reboot methods into the default list
He heroically tested all reboot methods and found the following:
reboot=t # triple fault ok
reboot=k # keyboard ctrl FAIL
reboot=b # BIOS ok
reboot=a # ACPI FAIL
reboot=e # EFI FAIL [system has no EFI]
reboot=p # PCI 0xcf9 FAIL
And I think it's pretty obvious that we should only try PCI 0xcf9 as a
last resort - if at all.
The other observation is that (on this box) we should never try
the PCI reboot method, but close with either the 'triple fault'
or the 'BIOS' (terminal!) reboot methods.
Thirdly, CF9_COND is a total misnomer - it should be something like
CF9_SAFE or CF9_CAREFUL, and 'CF9' should be 'CF9_FORCE' ...
So this patch fixes the worst problems:
- it orders the actual reboot logic to follow the reboot ordering
pattern - it was in a pretty random order before for no good
reason.
- it fixes the CF9 misnomers and uses BOOT_CF9_FORCE and
BOOT_CF9_SAFE flags to make the code more obvious.
- it tries the BIOS reboot method before the PCI reboot method.
(Since 'BIOS' is a terminal reboot method resulting in a hang
if it does not work, this is essentially equivalent to removing
the PCI reboot method from the default reboot chain.)
- just for the miraculous possibility of terminal (resulting
in hang) reboot methods of triple fault or BIOS returning
without having done their job, there's an ordering between
them as well.
Reported-and-bisected-and-tested-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Li Aubrey <aubrey.li@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Link: http://lkml.kernel.org/r/20140404064120.GB11877@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The git commit a945928ea2
('xen: Do not enable spinlocks before jump_label_init() has executed')
was added to deal with the jump machinery. Earlier the code
that turned on the jump label was only called by Xen specific
functions. But now that it had been moved to the initcall machinery
it gets called on Xen, KVM, and baremetal - ouch!. And the detection
machinery to only call it on Xen wasn't remembered in the heat
of merge window excitement.
This means that the slowpath is enabled on baremetal while it should
not be.
Reported-by: Waiman Long <waiman.long@hp.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
CC: stable@vger.kernel.org
CC: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Commit 198d208df4 ("x86: Keep
thread_info on thread stack in x86_32") made 32-bit kernels use
kernel_stack to point to thread_info. That change missed a couple of
updates needed by Xen's 32-bit PV guests:
1. kernel_stack needs to be initialized for secondary CPUs
2. GET_THREAD_INFO() now uses %fs register which may not be the
kernel's version when executing xen_iret().
With respect to the second issue, we don't need GET_THREAD_INFO()
anymore: we used it as an intermediate step to get to per_cpu xen_vcpu
and avoid referencing %fs. Now that we are going to use %fs anyway we
may as well go directly to xen_vcpu.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Pull KVM fixes from Marcelo Tosatti:
- Fix for guest triggerable BUG_ON (CVE-2014-0155)
- CR4.SMAP support
- Spurious WARN_ON() fix
* git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: remove WARN_ON from get_kernel_ns()
KVM: Rename variable smep to cr4_smep
KVM: expose SMAP feature to guest
KVM: Disable SMAP for guests in EPT realmode and EPT unpaging mode
KVM: Add SMAP support when setting CR4
KVM: Remove SMAP bit from CR4_RESERVED_BITS
KVM: ioapic: try to recover if pending_eoi goes out of range
KVM: ioapic: fix assignment of ioapic->rtc_status.pending_eoi (CVE-2014-0155)
Rename variable smep to cr4_smep, which can better reflect the
meaning of the variable.
Signed-off-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
SMAP is disabled if CPU is in non-paging mode in hardware.
However KVM always uses paging mode to emulate guest non-paging
mode with TDP. To emulate this behavior, SMAP needs to be
manually disabled when guest switches to non-paging mode.
Signed-off-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This patch adds SMAP handling logic when setting CR4 for guests
Thanks a lot to Paolo Bonzini for his suggestion to use the branchless
way to detect SMAP violation.
Signed-off-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
The legacy PIC may or may not be available and we need a mechanism to
detect the existence of the legacy PIC that is applicable for all
hardware (both physical as well as virtual) currently supported by
Linux.
On Hyper-V, when our legacy firmware presented to the guests, emulates
the legacy PIC while when our EFI based firmware is presented we do
not emulate the PIC. To support Hyper-V EFI firmware, we had to set
the legacy_pic to the null_legacy_pic since we had to bypass PIC based
calibration in the early boot code. While, on the EFI firmware, we
know we don't emulate the legacy PIC, we need a generic mechanism to
detect the presence of the legacy PIC that is not based on boot time
state - this became apparent when we tried to get kexec to work on
Hyper-V EFI firmware.
This patch implements the proposal put forth by H. Peter Anvin
<hpa@linux.intel.com>: Write a known value to the PIC data port and
read it back. If the value read is the value written, we do have the
PIC, if not there is no PIC and we can safely set the legacy_pic to
null_legacy_pic. Since the read from an unconnected I/O port returns
0xff, we will use ~(1 << PIC_CASCADE_IR) (0xfb: mask all lines except
the cascade line) to probe for the existence of the PIC.
In version V1 of the patch, I had cleaned up the code based on comments from Peter.
In version V2 of the patch, I have addressed additional comments from Peter.
In version V3 of the patch, I have addressed Jan's comments (JBeulich@suse.com).
In version V4 of the patch, I have addressed additional comments from Peter.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Link: http://lkml.kernel.org/r/1397501029-29286-1-git-send-email-kys@microsoft.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
During another patch review, David Rientjes noted that
VECTOR_UNDEFINED and VECTOR_RETRIGGERED should be defined with ()s
so that they are not erroneously used in an arithmetic operation.
Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: Yang Zhang <yang.z.zhang@Intel.com>
Link: http://lkml.kernel.org/r/1396440827-18352-1-git-send-email-prarit@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When we build an already built kernel again, arch/x86/syscalls/Makefile
and arch/x86/tools/Makefile emits "Nothing to be done for ..."
messages.
Here is the command log:
$ make defconfig
[ snip ]
$ make
[ snip ]
$ make
make[1]: Nothing to be done for `all'. <-----
make[1]: Nothing to be done for `relocs'. <-----
CHK include/config/kernel.release
CHK include/generated/uapi/linux/version.h
Besides not emitting those, "all" and "relocs" should be added to PHONY as well.
Signed-off-by: Masahiro Yamada <yamada.m@jp.panasonic.com>
Acked-by: Peter Foley <pefoley2@pefoley.com>
Acked-by: Michal Marek <mmarek@suse.cz>
Link: http://lkml.kernel.org/r/1397093742-11144-1-git-send-email-yamada.m@jp.panasonic.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Have the KB(),MB(),GB() macros produce unsigned longs to avoid
unintended sign extension issues with the gen2 memory size
detection.
What happens is first the uint8_t returned by
read_pci_config_byte() gets promoted to an int which gets
multiplied by another int from the MB() macro, and finally the
result gets sign extended to size_t.
Although this shouldn't be a problem in practice as all affected
gen2 platforms are 32bit AFAIK, so size_t will be 32 bits.
Reported-by: Bjorn Helgaas <bhelgaas@google.com>
Suggested-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1397382303-17525-1-git-send-email-ville.syrjala@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Running:
make O=dir x86_64_defconfig
make O=dir kvmconfig
the second command dirties the source tree with file ".config",
symlink "source" and objects in folder "scripts".
Fixed by using properly prefixed paths in the arch Makefile.
Signed-off-by: Antonio Borneo <borneo.antonio@gmail.com>
Acked-by: Borislav Petkov <bp@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Link: http://lkml.kernel.org/r/1397377568-8375-1-git-send-email-borneo.antonio@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
iEYEABECAAYFAlNFtEkACgkQuseO5dulBZVcPwCdFWY81hKqQaKHaSPFh9m+n1lt
yY0An2VZZGrFSkj162POFy1P2sPpGw5p
=LRFZ
-----END PGP SIGNATURE-----
Merge tag 'llvmlinux-for-v3.15' of git://git.linuxfoundation.org/llvmlinux/kernel
Pull llvm patches from Behan Webster:
"These are some initial updates to support compiling the kernel with
clang.
These patches have been through the proper reviews to the best of my
ability, and have been soaking in linux-next for a few weeks. These
patches by themselves still do not completely allow clang to be used
with the kernel code, but lay the foundation for other patches which
are still under review.
Several other of the LLVMLinux patches have been already added via
maintainer trees"
* tag 'llvmlinux-for-v3.15' of git://git.linuxfoundation.org/llvmlinux/kernel:
x86: LLVMLinux: Fix "incomplete type const struct x86cpu_device_id"
x86 kbuild: LLVMLinux: More cc-options added for clang
x86, acpi: LLVMLinux: Remove nested functions from Thinkpad ACPI
LLVMLinux: Add support for clang to compiler.h and new compiler-clang.h
LLVMLinux: Remove warning about returning an uninitialized variable
kbuild: LLVMLinux: Fix LINUX_COMPILER definition script for compilation with clang
Documentation: LLVMLinux: Update Documentation/dontdiff
kbuild: LLVMLinux: Adapt warnings for compilation with clang
kbuild: LLVMLinux: Add Kbuild support for building kernel with Clang
Pull audit updates from Eric Paris.
* git://git.infradead.org/users/eparis/audit: (28 commits)
AUDIT: make audit_is_compat depend on CONFIG_AUDIT_COMPAT_GENERIC
audit: renumber AUDIT_FEATURE_CHANGE into the 1300 range
audit: do not cast audit_rule_data pointers pointlesly
AUDIT: Allow login in non-init namespaces
audit: define audit_is_compat in kernel internal header
kernel: Use RCU_INIT_POINTER(x, NULL) in audit.c
sched: declare pid_alive as inline
audit: use uapi/linux/audit.h for AUDIT_ARCH declarations
syscall_get_arch: remove useless function arguments
audit: remove stray newline from audit_log_execve_info() audit_panic() call
audit: remove stray newlines from audit_log_lost messages
audit: include subject in login records
audit: remove superfluous new- prefix in AUDIT_LOGIN messages
audit: allow user processes to log from another PID namespace
audit: anchor all pid references in the initial pid namespace
audit: convert PPIDs to the inital PID namespace.
pid: get pid_t ppid of task in init_pid_ns
audit: rename the misleading audit_get_context() to audit_take_context()
audit: Add generic compat syscall support
audit: Add CONFIG_HAVE_ARCH_AUDITSYSCALL
...
- Fix for a recently introduced CPU hotplug regression in ARM KVM
from Ming Lei.
- Fixes for breakage in the at32ap, loongson2_cpufreq, and unicore32
cpufreq drivers introduced during the 3.14 cycle (-stable material)
from Chen Gang and Viresh Kumar.
- New powernv cpufreq driver from Vaidyanathan Srinivasan, with bits
from Gautham R Shenoy and Srivatsa S Bhat.
- Exynos cpufreq driver fix preventing it from being included into
multiplatform builds that aren't supported by it from Sachin Kamat.
- cpufreq cleanups related to the usage of the driver_data field in
struct cpufreq_frequency_table from Viresh Kumar.
- cpufreq ppc driver cleanup from Sachin Kamat.
- Intel BayTrail support for intel_idle and ACPI idle from Len Brown.
- Intel CPU model 54 (Atom N2000 series) support for intel_idle from
Jan Kiszka.
- intel_idle fix for Intel Ivy Town residency targets from Len Brown.
- turbostat updates (Intel Broadwell support and output cleanups)
from Len Brown.
- New cpuidle sysfs attribute for exporting C-states' target residency
information to user space from Daniel Lezcano.
- New kernel command line argument to prevent power domains enabled
by the bootloader from being turned off even if they are not in use
(for diagnostics purposes) from Tushar Behera.
- Fixes for wakeup sysfs attributes documentation from Geert Uytterhoeven.
- New ACPI video blacklist entry for ThinkPad Helix from Stephen Chandler
Paul.
- Assorted ACPI cleanups and a Kconfig help update from Jonghwan Choi,
Zhihui Zhang, Hanjun Guo.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJTRxLAAAoJEILEb/54YlRxnCwP/16UwO/eVE8SIi0TqQboikFC
k8u7F3zgDYG+xPSzXlCR+J7thTxGueQlrb+aM18PYuMVgaw2rpy7U7SIqEk8s6oR
uFnzZCWKA5ZebbZn+NlodnQaJmbgJxwsVJDuuechUka/e67CaIc54JULi2ynZ0lz
Kg/nU3NJhu4S81cT5SOTkJ9xE63oxHcCwKbNqEmxn7x7ddFzGK/DThG67NMEnW1F
vHbBTSyI6vmXXg1f9aobUtuo3PfJkkx5jD+nR1H2e6wmB64tW7JPVKV3mi6LJfYM
ui/8/gNb3PUMHMX1QbL9EFbPxl9miQx2NJ7dgFKa1HZ/WPyiXpJjz7uGr9O3Fau3
cFVREdaW8p2TAYWOEgH8luohhdK0j8UEpR/sEm0TrTjsK8wqczVf/hz6RraVJZiN
ck6eVHjY6m3/bFQauZQ/r+DNeeNcdr+iLejgjbh/MXuF3j0kx+1dkKkzCEU2TgEZ
9etF0uzjlgyXySyxNKBeSW13+ssVA6kF5/BHns7LHoxTfGu7Y4oVaWUi+j74i66O
bc+2ileNal71mS4v9gomnj6Ffj8oH8KXFA7k0sEsAdwLZNgThB5bTppmY/U7Y5Ce
hTS81tcGe2vOVQzF9iFOF7LNKKussAVAtrgkkrA8lJLeOTfQbIo4+fMhORxf3X/p
3O7R/jc4cT+IXK8a2xRt
=hGKg
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.15-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more ACPI and power management fixes and updates from Rafael Wysocki:
"This is PM and ACPI material that has emerged over the last two weeks
and one fix for a CPU hotplug regression introduced by the recent CPU
hotplug notifiers registration series.
Included are intel_idle and turbostat updates from Len Brown (these
have been in linux-next for quite some time), a new cpufreq driver for
powernv (that might spend some more time in linux-next, but BenH was
asking me so nicely to push it for 3.15 that I couldn't resist), some
cpufreq fixes and cleanups (including fixes for some silly breakage in
a couple of cpufreq drivers introduced during the 3.14 cycle),
assorted ACPI cleanups, wakeup framework documentation fixes, a new
sysfs attribute for cpuidle and a new command line argument for power
domains diagnostics.
Specifics:
- Fix for a recently introduced CPU hotplug regression in ARM KVM
from Ming Lei.
- Fixes for breakage in the at32ap, loongson2_cpufreq, and unicore32
cpufreq drivers introduced during the 3.14 cycle (-stable material)
from Chen Gang and Viresh Kumar.
- New powernv cpufreq driver from Vaidyanathan Srinivasan, with bits
from Gautham R Shenoy and Srivatsa S Bhat.
- Exynos cpufreq driver fix preventing it from being included into
multiplatform builds that aren't supported by it from Sachin Kamat.
- cpufreq cleanups related to the usage of the driver_data field in
struct cpufreq_frequency_table from Viresh Kumar.
- cpufreq ppc driver cleanup from Sachin Kamat.
- Intel BayTrail support for intel_idle and ACPI idle from Len Brown.
- Intel CPU model 54 (Atom N2000 series) support for intel_idle from
Jan Kiszka.
- intel_idle fix for Intel Ivy Town residency targets from Len Brown.
- turbostat updates (Intel Broadwell support and output cleanups)
from Len Brown.
- New cpuidle sysfs attribute for exporting C-states' target
residency information to user space from Daniel Lezcano.
- New kernel command line argument to prevent power domains enabled
by the bootloader from being turned off even if they are not in use
(for diagnostics purposes) from Tushar Behera.
- Fixes for wakeup sysfs attributes documentation from Geert
Uytterhoeven.
- New ACPI video blacklist entry for ThinkPad Helix from Stephen
Chandler Paul.
- Assorted ACPI cleanups and a Kconfig help update from Jonghwan
Choi, Zhihui Zhang, Hanjun Guo"
* tag 'pm+acpi-3.15-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (28 commits)
ACPI: Update the ACPI spec information in Kconfig
arm, kvm: fix double lock on cpu_add_remove_lock
cpuidle: sysfs: Export target residency information
cpufreq: ppc: Remove duplicate inclusion of fsl_soc.h
cpufreq: create another field .flags in cpufreq_frequency_table
cpufreq: use kzalloc() to allocate memory for cpufreq_frequency_table
cpufreq: don't print value of .driver_data from core
cpufreq: ia64: don't set .driver_data to index
cpufreq: powernv: Select CPUFreq related Kconfig options for powernv
cpufreq: powernv: Use cpufreq_frequency_table.driver_data to store pstate ids
cpufreq: powernv: cpufreq driver for powernv platform
cpufreq: at32ap: don't declare local variable as static
cpufreq: loongson2_cpufreq: don't declare local variable as static
cpufreq: unicore32: fix typo issue for 'clk'
cpufreq: exynos: Disable on multiplatform build
PM / wakeup: Correct presence vs. emptiness of wakeup_* attributes
PM / domains: Add pd_ignore_unused to keep power domains enabled
ACPI / dock: Drop dock_device_ids[] table
ACPI / video: Favor native backlight interface for ThinkPad Helix
ACPI / thermal: Fix wrong variable usage in debug statement
...
Pullx86 core platform updates from Peter Anvin:
"This is the x86/platform branch with the objectionable IOSF patches
removed.
What is left is proper memory handling for Intel GPUs, and a change to
the Calgary IOMMU code which will be required to make kexec work
sanely on those platforms after some upcoming kexec changes"
* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, calgary: Use 8M TCE table size by default
x86/gpu: Print the Intel graphics stolen memory range
x86/gpu: Add Intel graphics stolen memory quirk for gen2 platforms
x86/gpu: Add vfunc for Intel graphics stolen memory base address
Pull x86 fixes from Peter Anvin:
"This is a collection of minor fixes for x86, plus the IRET information
leak fix (forbid the use of 16-bit segments in 64-bit mode)"
NOTE! We may have to relax the "forbid the use of 16-bit segments in
64-bit mode" part, since there may be people who still run and depend on
16-bit Windows binaries under Wine.
But I'm taking this in the current unconditional form for now to see who
(if anybody) screams bloody murder. Maybe nobody cares. And maybe
we'll have to update it with some kind of runtime enablement (like our
vm.mmap_min_addr tunable that people who run dosemu/qemu/wine already
need to tweak).
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
efi: Pass correct file handle to efi_file_{read,close}
x86/efi: Correct EFI boot stub use of code32_start
x86/efi: Fix boot failure with EFI stub
x86/platform/hyperv: Handle VMBUS driver being a module
x86/apic: Reinstate error IRQ Pentium erratum 3AP workaround
x86, CMCI: Add proper detection of end of CMCI storms
The IRET instruction, when returning to a 16-bit segment, only
restores the bottom 16 bits of the user space stack pointer. We have
a software workaround for that ("espfix") for the 32-bit kernel, but
it relies on a nonzero stack segment base which is not available in
32-bit mode.
Since 16-bit support is somewhat crippled anyway on a 64-bit kernel
(no V86 mode), and most (if not quite all) 64-bit processors support
virtualization for the users who really need it, simply reject
attempts at creating a 16-bit segment when running on top of a 64-bit
kernel.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/n/tip-kicdm89kzw9lldryb1br9od0@git.kernel.org
Cc: <stable@vger.kernel.org>