Commit Graph

2009 Commits

Author SHA1 Message Date
Linus Torvalds
2071b3e34f Merge branch 'x86/espfix' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next
Pull x86-64 espfix changes from Peter Anvin:
 "This is the espfix64 code, which fixes the IRET information leak as
  well as the associated functionality problem.  With this code applied,
  16-bit stack segments finally work as intended even on a 64-bit
  kernel.

  Consequently, this patchset also removes the runtime option that we
  added as an interim measure.

  To help the people working on Linux kernels for very small systems,
  this patchset also makes these compile-time configurable features"

* 'x86/espfix' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Revert "x86-64, modify_ldt: Make support for 16-bit segments a runtime option"
  x86, espfix: Make it possible to disable 16-bit support
  x86, espfix: Make espfix64 a Kconfig option, fix UML
  x86, espfix: Fix broken header guard
  x86, espfix: Move espfix definitions into a separate header file
  x86-32, espfix: Remove filter for espfix32 due to race
  x86-64, espfix: Don't leak bits 31:16 of %esp returning to 16-bit stack
2014-06-05 07:46:15 -07:00
Emil Medve
af4459d363 arch/x86/mm/numa.c: use for_each_memblock()
Signed-off-by: Emil Medve <Emilian.Medve@Freescale.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:05 -07:00
Yinghai Lu
982792c782 x86, mm: probe memory block size for generic x86 64bit
On system with 2TiB ram, current x86_64 have 128M as section size, and
one memory_block only include one section.  So will have 16400 entries
under /sys/devices/system/memory/.

Current code try to use block id to find block pointer in /sys for any
section, and reuse that block pointer.  that finding will take some time
even after commit 7c243c7168 ("mm: speedup in __early_pfn_to_nid")
that will skip the search in that case during booting up.

So solution could be increase block size just like SGI UV system did.
(harded code to 2g).

This patch is trying to probe the block size to make it match mmio remap
size.  for example, Intel Nehalem later system will have memory range [0,
TOML), [4g, TOMH].  If the memory hole is 2g and total is 128g, TOM will
be 2g, and TOM2 will be 130g.

We could use 2g as block size instead of default 128M.  That will reduce
number of entries in /sys/devices/system/memory/

On system 6TiB system will reduce boot time by 35 seconds.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:53:55 -07:00
Mel Gorman
c46a7c817e x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels
_PAGE_NUMA is currently an alias of _PROT_PROTNONE to trap NUMA hinting
faults on x86.  Care is taken such that _PAGE_NUMA is used only in
situations where the VMA flags distinguish between NUMA hinting faults
and prot_none faults.  This decision was x86-specific and conceptually
it is difficult requiring special casing to distinguish between PROTNONE
and NUMA ptes based on context.

Fundamentally, we only need the _PAGE_NUMA bit to tell the difference
between an entry that is really unmapped and a page that is protected
for NUMA hinting faults as if the PTE is not present then a fault will
be trapped.

Swap PTEs on x86-64 use the bits after _PAGE_GLOBAL for the offset.
This patch shrinks the maximum possible swap size and uses the bit to
uniquely distinguish between NUMA hinting ptes and swap ptes.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Noonan <steven@uplinklabs.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:53:55 -07:00
Naoya Horiguchi
c177c81e09 hugetlb: restrict hugepage_migration_support() to x86_64
Currently hugepage migration is available for all archs which support
pmd-level hugepage, but testing is done only for x86_64 and there're
bugs for other archs.  So to avoid breaking such archs, this patch
limits the availability strictly to x86_64 until developers of other
archs get interested in enabling this feature.

Simply disabling hugepage migration on non-x86_64 archs is not enough to
fix the reported problem where sys_move_pages() hits the BUG_ON() in
follow_page(FOLL_GET), so let's fix this by checking if hugepage
migration is supported in vma_migratable().

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: David Miller <davem@davemloft.net>
Cc: <stable@vger.kernel.org>	[3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:53:51 -07:00
H. Peter Anvin
03c1b4e8e5 Merge remote-tracking branch 'origin/x86/espfix' into x86/vdso
Merge x86/espfix into x86/vdso, due to changes in the vdso setup code
that otherwise cause conflicts.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-05-21 17:36:33 -07:00
Andy Lutomirski
ac49b9a9f2 x86, mm: Replace arch_vma_name with vm_ops->name for vsyscalls
This removes the last vestiges of arch_vma_name from x86, replacing it
with vm_ops->name.  Good riddance.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/e681cb56096eee5b8b8767093a4f6fb82839f0a4.1400538962.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-05-20 11:39:31 -07:00
Andy Lutomirski
a62c34bd2a x86, mm: Improve _install_special_mapping and fix x86 vdso naming
Using arch_vma_name to give special mappings a name is awkward.  x86
currently implements it by comparing the start address of the vma to
the expected address of the vdso.  This requires tracking the start
address of special mappings and is probably buggy if a special vma
is split or moved.

Improve _install_special_mapping to just name the vma directly.  Use
it to give the x86 vvar area a name, which should make CRIU's life
easier.

As a side effect, the vvar area will show up in core dumps.  This
could be considered weird and is fixable.

[hpa: I say we accept this as-is but be prepared to deal with knocking
 out the vvars from core dumps if this becomes a problem.]

Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/276b39b6b645fb11e345457b503f17b83c2c6fd0.1400538962.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-05-20 11:38:42 -07:00
Andy Lutomirski
f40c330091 x86, vdso: Move the vvar and hpet mappings next to the 64-bit vDSO
This makes the 64-bit and x32 vdsos use the same mechanism as the
32-bit vdso.  Most of the churn is deleting all the old fixmap code.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/8af87023f57f6bb96ec8d17fce3f88018195b49b.1399317206.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-05-05 13:19:01 -07:00
Andy Lutomirski
6f121e548f x86, vdso: Reimplement vdso.so preparation in build-time C
Currently, vdso.so files are prepared and analyzed by a combination
of objcopy, nm, some linker script tricks, and some simple ELF
parsers in the kernel.  Replace all of that with plain C code that
runs at build time.

All five vdso images now generate .c files that are compiled and
linked in to the kernel image.

This should cause only one userspace-visible change: the loaded vDSO
images are stripped more heavily than they used to be.  Everything
outside the loadable segment is dropped.  In particular, this causes
the section table and section name strings to be missing.  This
should be fine: real dynamic loaders don't load or inspect these
tables anyway.  The result is roughly equivalent to eu-strip's
--strip-sections option.

The purpose of this change is to enable the vvar and hpet mappings
to be moved to the page following the vDSO load segment.  Currently,
it is possible for the section table to extend into the page after
the load segment, so, if we map it, it risks overlapping the vvar or
hpet page.  This happens whenever the load segment is just under a
multiple of PAGE_SIZE.

The only real subtlety here is that the old code had a C file with
inline assembler that did 'call VDSO32_vsyscall' and a linker script
that defined 'VDSO32_vsyscall = __kernel_vsyscall'.  This most
likely worked by accident: the linker script entry defines a symbol
associated with an address as opposed to an alias for the real
dynamic symbol __kernel_vsyscall.  That caused ld to relocate the
reference at link time instead of leaving an interposable dynamic
relocation.  Since the VDSO32_vsyscall hack is no longer needed, I
now use 'call __kernel_vsyscall', and I added -Bsymbolic to make it
work.  vdso2c will generate an error and abort the build if the
resulting image contains any dynamic relocations, so we won't
silently generate bad vdso images.

(Dynamic relocations are a problem because nothing will even attempt
to relocate the vdso.)

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/2c4fcf45524162a34d87fdda1eb046b2a5cecee7.1399317206.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-05-05 13:18:51 -07:00
Andy Lutomirski
73159fdcdb x86, mm: Ensure correct alignment of the fixmap
The early_ioremap code requires that its buffers not span a PMD
boundary.  The logic for ensuring that only works if the fixmap is
aligned, so assert that it's aligned correctly.

To make this work reliably, reserve_top_address needs to be
adjusted.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/e59a5f4362661f75dd4841fa74e1f2448045e245.1399317206.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-05-05 13:18:25 -07:00
Roland Dreier
c81c8a1eee x86, ioremap: Speed up check for RAM pages
In __ioremap_caller() (the guts of ioremap), we loop over the range of
pfns being remapped and checks each one individually with page_is_ram().
For large ioremaps, this can be very slow.  For example, we have a
device with a 256 GiB PCI BAR, and ioremapping this BAR can take 20+
seconds -- sometimes long enough to trigger the soft lockup detector!

Internally, page_is_ram() calls walk_system_ram_range() on a single
page.  Instead, we can make a single call to walk_system_ram_range()
from __ioremap_caller(), and do our further checks only for any RAM
pages that we find.  For the common case of MMIO, this saves an enormous
amount of work, since the range being ioremapped doesn't intersect
system RAM at all.

With this change, ioremap on our 256 GiB BAR takes less than 1 second.

Signed-off-by: Roland Dreier <roland@purestorage.com>
Link: http://lkml.kernel.org/r/1399054721-1331-1-git-send-email-roland@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-05-02 11:52:26 -07:00
H. Peter Anvin
3891a04aaf x86-64, espfix: Don't leak bits 31:16 of %esp returning to 16-bit stack
The IRET instruction, when returning to a 16-bit segment, only
restores the bottom 16 bits of the user space stack pointer.  This
causes some 16-bit software to break, but it also leaks kernel state
to user space.  We have a software workaround for that ("espfix") for
the 32-bit kernel, but it relies on a nonzero stack segment base which
is not available in 64-bit mode.

In checkin:

    b3b42ac2cb x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels

we "solved" this by forbidding 16-bit segments on 64-bit kernels, with
the logic that 16-bit support is crippled on 64-bit kernels anyway (no
V86 support), but it turns out that people are doing stuff like
running old Win16 binaries under Wine and expect it to work.

This works around this by creating percpu "ministacks", each of which
is mapped 2^16 times 64K apart.  When we detect that the return SS is
on the LDT, we copy the IRET frame to the ministack and use the
relevant alias to return to userspace.  The ministacks are mapped
readonly, so if IRET faults we promote #GP to #DF which is an IST
vector and thus has its own stack; we then do the fixup in the #DF
handler.

(Making #GP an IST exception would make the msr_safe functions unsafe
in NMI/MC context, and quite possibly have other effects.)

Special thanks to:

- Andy Lutomirski, for the suggestion of using very small stack slots
  and copy (as opposed to map) the IRET frame there, and for the
  suggestion to mark them readonly and let the fault promote to #DF.
- Konrad Wilk for paravirt fixup and testing.
- Borislav Petkov for testing help and useful comments.

Reported-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/1398816946-3351-1-git-send-email-hpa@linux.intel.com
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andrew Lutomriski <amluto@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dirk Hohndel <dirk@hohndel.org>
Cc: Arjan van de Ven <arjan.van.de.ven@intel.com>
Cc: comex <comexk@gmail.com>
Cc: Alexander van Heukelum <heukelum@fastmail.fm>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: <stable@vger.kernel.org> # consider after upstream merge
2014-04-30 14:14:28 -07:00
Masami Hiramatsu
9326638cbe kprobes, x86: Use NOKPROBE_SYMBOL() instead of __kprobes annotation
Use NOKPROBE_SYMBOL macro for protecting functions
from kprobes instead of __kprobes annotation under
arch/x86.

This applies nokprobe_inline annotation for some cases,
because NOKPROBE_SYMBOL() will inhibit inlining by
referring the symbol address.

This just folds a bunch of previous NOKPROBE_SYMBOL()
cleanup patches for x86 to one patch.

Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Link: http://lkml.kernel.org/r/20140417081814.26341.51656.stgit@ltc230.yrl.intra.hitachi.co.jp
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fernando Luis Vázquez Cao <fernando_b1@lab.ntt.co.jp>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Lebon <jlebon@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matt Fleming <matt.fleming@intel.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-24 10:26:38 +02:00
Shaohua Li
b13b1d2d86 x86/mm: In the PTE swapout page reclaim case clear the accessed bit instead of flushing the TLB
We use the accessed bit to age a page at page reclaim time,
and currently we also flush the TLB when doing so.

But in some workloads TLB flush overhead is very heavy. In my
simple multithreaded app with a lot of swap to several pcie
SSDs, removing the tlb flush gives about 20% ~ 30% swapout
speedup.

Fortunately just removing the TLB flush is a valid optimization:
on x86 CPUs, clearing the accessed bit without a TLB flush
doesn't cause data corruption.

It could cause incorrect page aging and the (mistaken) reclaim of
hot pages, but the chance of that should be relatively low.

So as a performance optimization don't flush the TLB when
clearing the accessed bit, it will eventually be flushed by
a context switch or a VM operation anyway. [ In the rare
event of it not getting flushed for a long time the delay
shouldn't really matter because there's no real memory
pressure for swapout to react to. ]

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Shaohua Li <shli@fusionio.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: linux-mm@kvack.org
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20140408075809.GA1764@kernel.org
[ Rewrote the changelog and the code comments. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-16 08:57:08 +02:00
David Rientjes
d0057ca4c1 arch/x86/mm/kmemcheck/kmemcheck.c: use kstrtoint() instead of sscanf()
Kmemcheck should use the preferred interface for parsing command line
arguments, kstrto*(), rather than sscanf() itself.  Use it
appropriately.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Vegard Nossum <vegardno@ifi.uio.no>
Acked-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-08 16:48:52 -07:00
Mark Salter
5b7c73e009 x86: use generic early_ioremap
Move x86 over to the generic early ioremap implementation.

Signed-off-by: Mark Salter <msalter@redhat.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 16:36:15 -07:00
Dave Young
6b550f6f20 x86/mm: sparse warning fix for early_memremap
This patch series takes the common bits from the x86 early ioremap
implementation and creates a generic implementation which may be used by
other architectures.  The early ioremap interfaces are intended for
situations where boot code needs to make temporary virtual mappings
before the normal ioremap interfaces are available.  Typically, this
means before paging_init() has run.

This patch (of 6):

There's a lot of sparse warnings for code like below: void *a =
early_memremap(phys_addr, size);

early_memremap intend to map kernel memory with ioremap facility, the
return pointer should be a kernel ram pointer instead of iomem one.

For making the function clearer and supressing sparse warnings this patch
do below two things:
1. cast to (__force void *) for the return value of early_memremap
2. add early_memunmap function and pass (__force void __iomem *) to iounmap

From Boris:
  "Ingo told me yesterday, it makes sense too.  I'd guess we can try it.
   FWIW, all callers of early_memremap use the memory they get remapped
   as normal memory so we should be safe"

Signed-off-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Mark Salter <msalter@redhat.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 16:36:14 -07:00
Linus Torvalds
467cbd207a Merge branch 'x86-nuke-platforms-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 old platform removal from Peter Anvin:
 "This patchset removes support for several completely obsolete
  platforms, where the maintainers either have completely vanished or
  acked the removal.  For some of them it is questionable if there even
  exists functional specimens of the hardware"

Geert Uytterhoeven apparently thought this was a April Fool's pull request ;)

* 'x86-nuke-platforms-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, platforms: Remove NUMAQ
  x86, platforms: Remove SGI Visual Workstation
  x86, apic: Remove support for IBM Summit/EXA chipset
  x86, apic: Remove support for ia32-based Unisys ES7000
2014-04-02 13:15:58 -07:00
Linus Torvalds
1694f0bb8f Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm change from Ingo Molnar:
 "A micro-optimization for acpi_numa_slit_init()"

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Avoid duplicated pxm_to_node() calls
2014-04-01 09:50:01 -07:00
Linus Torvalds
7cc3afdf43 Merge branch 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 EFI changes from Ingo Molnar:
 "The main changes:

  - Add debug code to the dump EFI pagetable - Borislav Petkov

  - Make 1:1 runtime mapping robust when booting on machines with lots
    of memory - Borislav Petkov

  - Move the EFI facilities bits out of 'x86_efi_facility' and into
    efi.flags which is the standard architecture independent place to
    keep EFI state, by Matt Fleming.

  - Add 'EFI mixed mode' support: this allows 64-bit kernels to be
    booted from 32-bit firmware.  This needs a bootloader that supports
    the 'EFI handover protocol'.  By Matt Fleming"

* 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
  x86, efi: Abstract x86 efi_early calls
  x86/efi: Restore 'attr' argument to query_variable_info()
  x86/efi: Rip out phys_efi_get_time()
  x86/efi: Preserve segment registers in mixed mode
  x86/boot: Fix non-EFI build
  x86, tools: Fix up compiler warnings
  x86/efi: Re-disable interrupts after calling firmware services
  x86/boot: Don't overwrite cr4 when enabling PAE
  x86/efi: Wire up CONFIG_EFI_MIXED
  x86/efi: Add mixed runtime services support
  x86/efi: Firmware agnostic handover entry points
  x86/efi: Split the boot stub into 32/64 code paths
  x86/efi: Add early thunk code to go from 64-bit to 32-bit
  x86/efi: Build our own EFI services pointer table
  efi: Add separate 32-bit/64-bit definitions
  x86/efi: Delete dead code when checking for non-native
  x86/mm/pageattr: Always dump the right page table in an oops
  x86, tools: Consolidate #ifdef code
  x86/boot: Cleanup header.S by removing some #ifdefs
  efi: Use NULL instead of 0 for pointer
  ...
2014-03-31 12:26:05 -07:00
Linus Torvalds
918d80a136 Merge branch 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpu handling changes from Ingo Molnar:
 "Bigger changes:

   - Intel CPU hardware-enablement: new vector instructions support
     (AVX-512), by Fenghua Yu.

   - Support the clflushopt instruction and use it in appropriate
     places.  clflushopt is similar to clflush but with more relaxed
     ordering, by Ross Zwisler.

   - MSR accessor cleanups, by Borislav Petkov.

   - 'forcepae' boot flag for those who have way too much time to spend
     on way too old Pentium-M systems and want to live way too
     dangerously, by Chris Bainbridge"

* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, cpu: Add forcepae parameter for booting PAE kernels on PAE-disabled Pentium M
  Rename TAINT_UNSAFE_SMP to TAINT_CPU_OUT_OF_SPEC
  x86, intel: Make MSR_IA32_MISC_ENABLE bit constants systematic
  x86, Intel: Convert to the new bit access MSR accessors
  x86, AMD: Convert to the new bit access MSR accessors
  x86: Add another set of MSR accessor functions
  x86: Use clflushopt in drm_clflush_virt_range
  x86: Use clflushopt in drm_clflush_page
  x86: Use clflushopt in clflush_cache_range
  x86: Add support for the clflushopt instruction
  x86, AVX-512: Enable AVX-512 States Context Switch
  x86, AVX-512: AVX-512 Feature Detection
2014-03-31 12:00:45 -07:00
Borislav Petkov
b82ad3d394 x86, pageattr: Correct WBINVD spelling in comment
It is WBINVD, for INValiDate and not "wbindv". Use caps for instruction
names, while at it.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1394633584-5509-4-git-send-email-bp@alien8.de
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-03-13 15:32:45 -07:00
Peter Zijlstra
d4078e2322 x86, trace: Further robustify CR2 handling vs tracing
Building on commit 0ac09f9f8c ("x86, trace: Fix CR2 corruption when
tracing page faults") this patch addresses another few issues:

 - Now that read_cr2() is lifted into trace_do_page_fault(), we should
   pass the address to trace_page_fault_entries() to avoid it
   re-reading a potentially changed cr2.

 - Put both trace_do_page_fault() and trace_page_fault_entries() under
   CONFIG_TRACING.

 - Mark both fault entry functions {,trace_}do_page_fault() as notrace
   to avoid getting __mcount or other function entry trace callbacks
   before we've observed CR2.

 - Mark __do_page_fault() as noinline to guarantee the function tracer
   does get to see the fault.

Cc: <jolsa@redhat.com>
Cc: <vincent.weaver@maine.edu>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140306145300.GO9987@twins.programming.kicks-ass.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-03-06 10:58:18 -08:00
Matt Fleming
994448f1af Merge remote-tracking branch 'tip/x86/efi-mixed' into efi-for-mingo
Conflicts:
	arch/x86/kernel/setup.c
	arch/x86/platform/efi/efi.c
	arch/x86/platform/efi/efi_64.c
2014-03-05 18:15:37 +00:00
Matt Fleming
4fd69331ad Merge remote-tracking branch 'tip/x86/urgent' into efi-for-mingo
Conflicts:
	arch/x86/include/asm/efi.h
2014-03-05 17:31:41 +00:00
Jiri Olsa
0ac09f9f8c x86, trace: Fix CR2 corruption when tracing page faults
The trace_do_page_fault function trigger tracepoint
and then handles the actual page fault.

This could lead to error if the tracepoint caused page
fault. The original cr2 value gets lost and the original
page fault handler kills current process with SIGSEGV.

This happens if you record page faults with callchain
data, the user part of it will cause tracepoint handler
to page fault:

  # perf record -g -e exceptions:page_fault_user ls

Fixing this by saving the original cr2 value
and using it after tracepoint handler is done.

v2: Moving the cr2 read before exception_enter, because
    it could trigger tracepoint as well.

Reported-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Tested-by: Vince Weaver <vincent.weaver@maine.edu>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1402211701380.6395@vincent-weaver-1.um.maine.edu
Link: http://lkml.kernel.org/r/20140228160526.GD1133@krava.brq.redhat.com
2014-03-04 16:00:14 -08:00
Matt Fleming
426e34cc4f x86/mm/pageattr: Always dump the right page table in an oops
Now that we have EFI-specific page tables we need to lookup the pgd when
dumping those page tables, rather than assuming that swapper_pgdir is
the current pgdir.

Remove the double underscore prefix, which is usually reserved for
static functions.

Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-03-04 21:23:36 +00:00
Borislav Petkov
42a5477251 x86, pageattr: Export page unmapping interface
We will use it in efi so expose it.

Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-03-04 16:17:17 +00:00
Borislav Petkov
ef6bea6ddf x86, ptdump: Add the functionality to dump an arbitrary pagetable
With reusing the ->trampoline_pgd page table for mapping EFI regions in
order to use them after having switched to EFI virtual mode, it is very
useful to be able to dump aforementioned page table in dmesg. This adds
that functionality through the walk_pgd_level() interface which can be
called from somewhere else.

The original functionality of dumping to debugfs remains untouched.

Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2014-03-04 16:16:17 +00:00
Ross Zwisler
8b80fd8b45 x86: Use clflushopt in clflush_cache_range
If clflushopt is available on the system, use it instead of clflush in
clflush_cache_range.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Link: http://lkml.kernel.org/r/1393441612-19729-3-git-send-email-ross.zwisler@linux.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-02-27 08:26:10 -08:00
H. Peter Anvin
b5660ba76b x86, platforms: Remove NUMAQ
The NUMAQ support seems to be unmaintained, remove it.

Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/n/530CFD6C.7040705@zytor.com
2014-02-27 08:07:39 -08:00
H. Peter Anvin
4640c7ee9b x86, smap: smap_violation() is bogus if CONFIG_X86_SMAP is off
If CONFIG_X86_SMAP is disabled, smap_violation() tests for conditions
which are incorrect (as the AC flag doesn't matter), causing spurious
faults.

The dynamic disabling of SMAP (nosmap on the command line) is fine
because it disables X86_FEATURE_SMAP, therefore causing the
static_cpu_has() to return false.

Found by Fengguang Wu's test system.

[ v3: move all predicates into smap_violation() ]
[ v2: use IS_ENABLED() instead of #ifdef ]

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Link: http://lkml.kernel.org/r/20140213124550.GA30497@localhost
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: <stable@vger.kernel.org> # v3.7+
2014-02-13 08:40:52 -08:00
Yinghai Lu
ba6a328f7d x86/mm: Avoid duplicated pxm_to_node() calls
In slit init code, too many pxm_to_node() function calls are done.

We can store from_node/to_node instead of keep calling
pxm_to_node().

  - Before this patch: pxm_to_node() is called n*(1+n*3) times.
  - After  this patch: pxm_to_node() is called n*(1+n) times.

for  8 sockets, it will be   72 instead of  200.
for 32 sockets, it will be 1056 instead of 3104.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/1390770102-4007-1-git-send-email-yinghai@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-09 15:32:31 +01:00
Linus Torvalds
c1ff84317f Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Peter Anvin:
 "Quite a varied little collection of fixes.  Most of them are
  relatively small or isolated; the biggest one is Mel Gorman's fixes
  for TLB range flushing.

  A couple of AMD-related fixes (including not crashing when given an
  invalid microcode image) and fix a crash when compiled with gcov"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, microcode, AMD: Unify valid container checks
  x86, hweight: Fix BUG when booting with CONFIG_GCOV_PROFILE_ALL=y
  x86/efi: Allow mapping BGRT on x86-32
  x86: Fix the initialization of physnode_map
  x86, cpu hotplug: Fix stack frame warning in check_irq_vectors_for_cpu_disable()
  x86/intel/mid: Fix X86_INTEL_MID dependencies
  arch/x86/mm/srat: Skip NUMA_NO_NODE while parsing SLIT
  mm, x86: Revisit tlb_flushall_shift tuning for page flushes except on IvyBridge
  x86: mm: change tlb_flushall_shift for IvyBridge
  x86/mm: Eliminate redundant page table walk during TLB range flushing
  x86/mm: Clean up inconsistencies when flushing TLB ranges
  mm, x86: Account for TLB flushes only when debugging
  x86/AMD/NB: Fix amd_set_subcaches() parameter type
  x86/quirks: Add workaround for AMD F16h Erratum792
  x86, doc, kconfig: Fix dud URL for Microcode data
2014-02-08 11:54:43 -08:00
H. Peter Anvin
a3b072cd18 * Avoid WARN_ON() when mapping BGRT on Baytrail (EFI 32-bit).
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJS9P2pAAoJEC84WcCNIz1VABsP/j0fwiWdaoEn/9p9EUQcBcMn
 CILuiKI8GoBR+0nH7vm/jSgUKfUxzM6T0+qjoZPVkO/u/qup+JCT5JxoywUyEbfb
 +wPNIhBgKSwJdWXPd4ZvObA9jknrP6r5sJTsixpESVpVWmcC8egPgkyBFILjokKS
 BCJqqruHZcXfWaDQTocYErl3T217J7RfnCG1o7yP96g4jteX8EdvIkPjEf2mM83I
 GXacHLy8IhGhdyPKSXcRqZ0ivhZ2WX1iFQYIY19FhmzBs364WulyXve+oV+l/4h4
 Kx7ks4Ob5AlUIizchGNwVz7058F9o/v7m0CezTgi4Q/RrZi34samxbjk/95/Xc60
 JSvWkekm3/jOODubad5zj+ZnJmG83/ZlCUuTaqsE47ftbaedSUHBN9QSpm8iHEex
 n3d4J3AdP/3amcP33kZ5MRALDYIFKb4ZxtDkADqDcXhS56COivGAdZe5hnyCpb/9
 RPUDXTOlxfJQK/y2Atcdb400JjJ/Yr9Kew81LRIt0UMZU3dKSh05UZ+a7Ym0yCkt
 3k0NNkgsFCZbYTO/Z3aPDcprwU5Lq9UrwjB17U2ev/qK+qRYDzCzSR0XGPrLMRv7
 C5Bnov6uCn/0ZG/NlAx8UXK9wdWDsLhp1QkBz+daX3sGwRAS+OiKBv6+l8dqsdOc
 1L8PMkTX2rgtELiv4PJ/
 =hLCC
 -----END PGP SIGNATURE-----

Merge tag 'efi-urgent' into x86/urgent

 * Avoid WARN_ON() when mapping BGRT on Baytrail (EFI 32-bit).

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-02-07 11:27:30 -08:00
Tang Chen
7bc35fdde6 arch/x86/mm/numa.c: fix array index overflow when synchronizing nid to memblock.reserved.
The following path will cause array out of bound.

memblock_add_region() will always set nid in memblock.reserved to
MAX_NUMNODES.  In numa_register_memblks(), after we set all nid to
correct valus in memblock.reserved, we called setup_node_data(), and
used memblock_alloc_nid() to allocate memory, with nid set to
MAX_NUMNODES.

The nodemask_t type can be seen as a bit array.  And the index is 0 ~
MAX_NUMNODES-1.

After that, when we call node_set() in numa_clear_kernel_node_hotplug(),
the nodemask_t got an index of value MAX_NUMNODES, which is out of [0 ~
MAX_NUMNODES-1].

See below:

numa_init()
 |---> numa_register_memblks()
 |      |---> memblock_set_node(memory)		set correct nid in memblock.memory
 |      |---> memblock_set_node(reserved)	set correct nid in memblock.reserved
 |      |......
 |      |---> setup_node_data()
 |             |---> memblock_alloc_nid()	here, nid is set to MAX_NUMNODES (1024)
 |......
 |---> numa_clear_kernel_node_hotplug()
        |---> node_set()			here, we have an index 1024, and overflowed

This patch moves nid setting to numa_clear_kernel_node_hotplug() to fix
this problem.

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reported-by: Dave Jones <davej@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Tested-by: Dave Jones <davej@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-06 13:48:51 -08:00
Tang Chen
017c217a26 arch/x86/mm/numa.c: initialize numa_kernel_nodes in numa_clear_kernel_node_hotplug()
On-stack variable numa_kernel_nodes in numa_clear_kernel_node_hotplug()
was not initialized.  So we need to initialize it.

[akpm@linux-foundation.org: use NODE_MASK_NONE, per David]
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reported-by: Dave Jones <davej@redhat.com>
Reported-by: David Rientjes <rientjes@google.com>
Tested-by: Dave Jones <davej@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-06 13:48:51 -08:00
Petr Tesarik
85fc73a2cd x86: Fix the initialization of physnode_map
With DISCONTIGMEM, the mapping between a pfn and its owning node is
initialized using data provided by the BIOS. However, the initialization
may fail if the extents are not aligned to section boundary (64M).

The symptom of this bug is an early boot failure in pfn_to_page(),
as it tries to access NODE_DATA(__nid) using index from an unitialized
element of the physnode_map[] array.

While the bug is always present, it is more likely to be hit in kdump
kernels on large machines, because:

1. The memory map for a kdump kernel is specified as exactmap, and
   exactmap is more likely to be unaligned.

2. Large reservations are more likely to span across a 64M boundary.

[ hpa: fixed incorrect use of "pfn" instead of "start" ]

Signed-off-by: Petr Tesarik <ptesarik@suse.cz>
Link: http://lkml.kernel.org/r/20140201133019.32e56f86@hananiah.suse.cz
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2014-02-01 22:15:51 -08:00
Ingo Molnar
2b45e0f9f3 Merge branch 'linus' into x86/urgent
Merge in the x86 changes to apply a fix.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-25 09:16:14 +01:00
Toshi Kani
a85eba8814 arch/x86/mm/srat: Skip NUMA_NO_NODE while parsing SLIT
When ACPI SLIT table has an I/O locality (i.e. a locality
unique to an I/O device), numa_set_distance() emits this warning
message:

 NUMA: Warning: node ids are out of bound, from=-1 to=-1 distance=10

acpi_numa_slit_init() calls numa_set_distance() with
pxm_to_node(), which assumes that all localities have been
parsed with SRAT previously.  SRAT does not list I/O localities,
where as SLIT lists all localities including I/Os.  Hence,
pxm_to_node() returns NUMA_NO_NODE (-1) for an I/O locality.

I/O localities are not supported and are ignored today, but emitting
such warning message leads to unnecessary confusion.

Change acpi_numa_slit_init() to avoid calling
numa_set_distance() with NUMA_NO_NODE.

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/n/tip-dSvpjjvp8aMzs1ybkftxohlh@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-25 09:13:35 +01:00
Mel Gorman
71b54f8263 x86/mm: Eliminate redundant page table walk during TLB range flushing
When choosing between doing an address space or ranged flush,
the x86 implementation of flush_tlb_mm_range takes into account
whether there are any large pages in the range.  A per-page
flush typically requires fewer entries than would covered by a
single large page and the check is redundant.

There is one potential exception.  THP migration flushes single
THP entries and it conceivably would benefit from flushing a
single entry instead of the mm.  However, this flush is after a
THP allocation, copy and page table update potentially with any
other threads serialised behind it.  In comparison to that, the
flush is noise.  It makes more sense to optimise balancing to
require fewer flushes than to optimise the flush itself.

This patch deletes the redundant huge page check.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alex Shi <alex.shi@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-sgei1drpOcburujPsfh6ovmo@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-25 09:10:43 +01:00
Mel Gorman
15aa368255 x86/mm: Clean up inconsistencies when flushing TLB ranges
NR_TLB_LOCAL_FLUSH_ALL is not always accounted for correctly and
the comparison with total_vm is done before taking
tlb_flushall_shift into account.  Clean it up.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Alex Shi <alex.shi@linaro.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/n/tip-Iz5gcahrgskIldvukulzi0hh@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-25 09:10:42 +01:00
Mel Gorman
ec65993443 mm, x86: Account for TLB flushes only when debugging
Bisection between 3.11 and 3.12 fingered commit 9824cf97 ("mm:
vmstats: tlb flush counters") to cause overhead problems.

The counters are undeniably useful but how often do we really
need to debug TLB flush related issues?  It does not justify
taking the penalty everywhere so make it a debugging option.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Alex Shi <alex.shi@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-XzxjntugxuwpxXhcrxqqh53b@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-25 09:10:41 +01:00
Sasha Levin
309381feae mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE
Most of the VM_BUG_ON assertions are performed on a page.  Usually, when
one of these assertions fails we'll get a BUG_ON with a call stack and
the registers.

I've recently noticed based on the requests to add a small piece of code
that dumps the page to various VM_BUG_ON sites that the page dump is
quite useful to people debugging issues in mm.

This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
VM_BUG_ON() does, also dumps the page before executing the actual
BUG_ON.

[akpm@linux-foundation.org: fix up includes]
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23 16:36:50 -08:00
Grygorii Strashko
9a28f9dc8d x86/mm: memblock: switch to use NUMA_NO_NODE
Update X86 code to use NUMA_NO_NODE instead of MAX_NUMNODES while
calling memblock APIs, because memblock API will be changed to use
NUMA_NO_NODE and will produce warning during boot otherwise.

See:
 https://lkml.org/lkml/2013/12/9/898

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Santosh Shilimkar <santosh.shilimkar@ti.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21 16:19:47 -08:00
Tang Chen
a0acda9172 acpi, numa, mem_hotplug: mark all nodes the kernel resides un-hotpluggable
At very early time, the kernel have to use some memory such as loading
the kernel image.  We cannot prevent this anyway.  So any node the
kernel resides in should be un-hotpluggable.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
Cc: Chen Tang <imtangchen@gmail.com>
Cc: Gong Chen <gong.chen@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Liu Jiang <jiang.liu@huawei.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21 16:19:45 -08:00
Tang Chen
05d1d8cb1c acpi, numa, mem_hotplug: mark hotpluggable memory in memblock
When parsing SRAT, we know that which memory area is hotpluggable.  So we
invoke function memblock_mark_hotplug() introduced by previous patch to
mark hotpluggable memory in memblock.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
Cc: Chen Tang <imtangchen@gmail.com>
Cc: Gong Chen <gong.chen@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Liu Jiang <jiang.liu@huawei.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21 16:19:45 -08:00
Tang Chen
e7e8de5918 memblock: make memblock_set_node() support different memblock_type
[sfr@canb.auug.org.au: fix powerpc build]
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
Cc: Chen Tang <imtangchen@gmail.com>
Cc: Gong Chen <gong.chen@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Liu Jiang <jiang.liu@huawei.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21 16:19:44 -08:00
Linus Torvalds
f4bcd8ccdd Merge branch 'x86-kaslr-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 kernel address space randomization support from Peter Anvin:
 "This enables kernel address space randomization for x86"

* 'x86-kaslr-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, kaslr: Clarify RANDOMIZE_BASE_MAX_OFFSET
  x86, kaslr: Remove unused including <linux/version.h>
  x86, kaslr: Use char array to gain sizeof sanity
  x86, kaslr: Add a circular multiply for better bit diffusion
  x86, kaslr: Mix entropy sources together as needed
  x86/relocs: Add percpu fixup for GNU ld 2.23
  x86, boot: Rename get_flags() and check_flags() to *_cpuflags()
  x86, kaslr: Raise the maximum virtual address to -1 GiB on x86_64
  x86, kaslr: Report kernel offset on panic
  x86, kaslr: Select random position from e820 maps
  x86, kaslr: Provide randomness functions
  x86, kaslr: Return location from decompress_kernel
  x86, boot: Move CPU flags out of cpucheck
  x86, relocs: Add more per-cpu gold special cases
2014-01-20 14:45:50 -08:00
Linus Torvalds
8bd6964cbd Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm changes from Ingo Molnar:
 "A cleanup, a fix and ASLR support for hugetlb mappings"

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm/numa: Fix 32-bit kernel NUMA boot
  x86/mm: Implement ASLR for hugetlb mappings
  x86/mm: Unify pte_to_pgoff() and pgoff_to_pte() helpers
2014-01-20 12:08:51 -08:00
Linus Torvalds
972d5e7e5b Merge branch 'x86-efi-kexec-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 EFI changes from Ingo Molnar:
 "This consists of two main parts:

   - New static EFI runtime services virtual mapping layout which is
     groundwork for kexec support on EFI (Borislav Petkov)

   - EFI kexec support itself (Dave Young)"

* 'x86-efi-kexec-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  x86/efi: parse_efi_setup() build fix
  x86: ksysfs.c build fix
  x86/efi: Delete superfluous global variables
  x86: Reserve setup_data ranges late after parsing memmap cmdline
  x86: Export x86 boot_params to sysfs
  x86: Add xloadflags bit for EFI runtime support on kexec
  x86/efi: Pass necessary EFI data for kexec via setup_data
  efi: Export EFI runtime memory mapping to sysfs
  efi: Export more EFI table variables to sysfs
  x86/efi: Cleanup efi_enter_virtual_mode() function
  x86/efi: Fix off-by-one bug in EFI Boot Services reservation
  x86/efi: Add a wrapper function efi_map_region_fixed()
  x86/efi: Remove unused variables in __map_region()
  x86/efi: Check krealloc return value
  x86/efi: Runtime services virtual mapping
  x86/mm/cpa: Map in an arbitrary pgd
  x86/mm/pageattr: Add last levels of error path
  x86/mm/pageattr: Add a PUD error unwinding path
  x86/mm/pageattr: Add a PTE pagetable populating function
  x86/mm/pageattr: Add a PMD pagetable populating function
  ...
2014-01-20 12:05:30 -08:00
Linus Torvalds
2a0fede97f Merge branch 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Ingo Molnar:
 "Misc cleanups"

* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, cpu, amd: Fix a shadowed variable situation
  um, x86: Fix vDSO build
  x86: Delete non-required instances of include <linux/init.h>
  x86, realmode: Pointer walk cleanups, pull out invariant use of __pa()
  x86/traps: Clean up error exception handler definitions
2014-01-20 12:03:57 -08:00
Peter Zijlstra
c026b3591e x86, mm, perf: Allow recursive faults from interrupts
Waiman managed to trigger a PMI while in a emulate_vsyscall() fault,
the PMI in turn managed to trigger a fault while obtaining a stack
trace. This triggered the sig_on_uaccess_error recursive fault logic
and killed the process dead.

Fix this by explicitly excluding interrupts from the recursive fault
logic.

Reported-and-Tested-by: Waiman Long <waiman.long@hp.com>
Fixes: e00b12e64b ("perf/x86: Further optimize copy_from_user_nmi()")
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140110200603.GJ7572@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-16 09:19:48 +01:00
Paul Gortmaker
663b55b9b3 x86: Delete non-required instances of include <linux/init.h>
None of these files are actually using any __init type directives
and hence don't need to include <linux/init.h>.  Most are just a
left over from __devinit and __cpuinit removal, or simply due to
code getting copied from one driver to the next.

[ hpa: undid incorrect removal from arch/x86/kernel/head_32.S ]

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Link: http://lkml.kernel.org/r/1389054026-12947-1-git-send-email-paul.gortmaker@windriver.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2014-01-06 21:25:18 -08:00
Ingo Molnar
ef0b8b9a52 Linux 3.13-rc7
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJSyJVbAAoJEHm+PkMAQRiGa28H/0m7GpZSpT8mvBthITxzqWCq
 JRkSPS4KTurAWlA5CJMJePyCM30DgN90s06bYUen9sTecZUwnL+qSV5OqAmg2r+0
 PrfwtXtGZR6/Y12XlZ/3oFxVfUxjmgJyDAS76TIH1IvIum52nvJmLrR+6AyVphIX
 DkgBOuapdA7lia+U+ZM1cRkeHxUOKTUEw9v611VgoN3LYZyzyRb6d0rB7JtZN1RV
 dnXRi27enaPhwxelsCnORioRjsByMwD40CERxfLHmr5CGhmvCehBjO6bJ+KAdp14
 52bfwWcNdbFMzUobcR7qlfS3Hy3AYJci+P6JzeeZ+kWEdv/eh5/1lvNuXtBJRlc=
 =iwzJ
 -----END PGP SIGNATURE-----

Merge tag 'v3.13-rc7' into x86/efi-kexec to resolve conflicts

Conflicts:
	arch/x86/platform/efi/efi.c
	drivers/firmware/efi/Kconfig

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-05 12:34:29 +01:00
Lans Zhang
f3d815cb85 x86/mm/numa: Fix 32-bit kernel NUMA boot
When booting a 32-bit x86 kernel on a NUMA machine, node data
cannot be allocated from local node if the account of memory for
node 0 covers the low memory space entirely:

  [    0.000000] Initmem setup node 0 [mem 0x00000000-0x83fffffff]
  [    0.000000]   NODE_DATA [mem 0x367ed000-0x367edfff]
  [    0.000000] Initmem setup node 1 [mem 0x840000000-0xfffffffff]
  [    0.000000] Cannot find 4096 bytes in node 1
  [    0.000000] 64664MB HIGHMEM available.
  [    0.000000] 871MB LOWMEM available.

To fix this issue, node data is allowed to be allocated from
other nodes if the memory of local node is still not mapped. The
expected result looks like this:

  [    0.000000] Initmem setup node 0 [mem 0x00000000-0x83fffffff]
  [    0.000000]   NODE_DATA [mem 0x367ed000-0x367edfff]
  [    0.000000] Initmem setup node 1 [mem 0x840000000-0xfffffffff]
  [    0.000000]   NODE_DATA [mem 0x367ec000-0x367ecfff]
  [    0.000000]     NODE_DATA(1) on node 0
  [    0.000000] 64664MB HIGHMEM available.
  [    0.000000] 871MB LOWMEM available.

Signed-off-by: Lans Zhang <jia.zhang@windriver.com>
Cc: <andi@firstfloor.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1386303510-18574-1-git-send-email-jia.zhang@windriver.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-12-19 13:58:36 +01:00
Ingo Molnar
3331e924e7 Linux 3.13-rc4
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.15 (GNU/Linux)
 
 iQEcBAABAgAGBQJSrhGrAAoJEHm+PkMAQRiGsNoH/jIK3CsQ2lbW7yRLXmfgtbzz
 i2Kep6D4SDvmaLpLYOVC8xNYTiE8jtTbSXHomwP5wMZ63MQDhBfnEWsEWqeZ9+D9
 3Q46p0QWuoBgYu2VGkoxTfygkT6hhSpwWIi3SeImbY4fg57OHiUil/+YGhORM4Qc
 K4549OCTY3sIrgmWL77gzqjRUo+pQ4C73NKqZ3+5nlOmYBZC1yugk8mFwEpQkwhK
 4NRNU760Fo+XIht/bINqRiPMddzC15p0mxvJy3cDW8bZa1tFSS9SB7AQUULBbcHL
 +2dFlFOEb5SV1sNiNPrJ0W+h2qUh2e7kPB0F8epaBppgbwVdyQoC2u4uuLV2ZN0=
 =lI2r
 -----END PGP SIGNATURE-----

Merge tag 'v3.13-rc4' into x86/mm

Pick up the latest fixes before applying new patches.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-12-19 13:56:10 +01:00
Mel Gorman
2b4847e730 mm: numa: serialise parallel get_user_page against THP migration
Base pages are unmapped and flushed from cache and TLB during normal
page migration and replaced with a migration entry that causes any
parallel NUMA hinting fault or gup to block until migration completes.

THP does not unmap pages due to a lack of support for migration entries
at a PMD level.  This allows races with get_user_pages and
get_user_pages_fast which commit 3f926ab945 ("mm: Close races between
THP migration and PMD numa clearing") made worse by introducing a
pmd_clear_flush().

This patch forces get_user_page (fast and normal) on a pmd_numa page to
go through the slow get_user_page path where it will serialise against
THP migration and properly account for the NUMA hinting fault.  On the
migration side the page table lock is taken for each PTE update.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-18 19:04:50 -08:00
Ingo Molnar
61d0669775 * New static EFI runtime services virtual mapping layout which is
groundwork for kexec support on EFI - Borislav Petkov
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.13 (GNU/Linux)
 
 iQIcBAABAgAGBQJSfMDjAAoJEC84WcCNIz1V0LIP/2WyTbJR6bL0HXwnGLpxmxag
 v0VgnKRhypNboA3WEu4a9as6AdExqB7qsiWIipHuDSMj/vkfZgHAKTd2f1iRxmsJ
 RZxzwV6YzvsWkdXjvpCoLWKSsvQDug++BAIti1PitW6RXRjo01t3ymo/Ho1CQrpI
 hNJbB3bbihMF+uqFvdSpO0KZtZE6EtnylrfBeuo0GzqqJdTGe1MmqlWmyUlEy5JW
 ZiHV8E/xTjh3N675tWPcT9hGVfCyOXXu/kPrXsJTXrdYyZL9qgA9b8SVRLs6DctX
 wVgL9lNv4wobsmZJ5DxkYl9+TaF7rbshUeIJbzrQyMVJjb3TpXk/ZpspDMAEjL7e
 bb76c1bAx4xZuUatR/f1ykkWKAEryAhXHkvwcbIBjebW33if1MgGJLk5udJQQv6H
 j+J9ROH38MDr0Geg+pM2RnCyTz8l+q+8Mfu4Yh9TSte+ttB6fr9phs3/G+fdSUn1
 0vI627v1HWzDcBh4eZjjslzJviR8PldsGVT3EsIaOnHGtk/9FPz/7n4efph4v7+9
 yqTkLvQHxsAx7f0tR/qRpkEIQ9WTMXO0IO79OC13QTSATJSl+WSPTJM7ccqOgn+P
 h89ssBnzlwmFHkuvTi599KVHdzOZrWTsB2zROn+NnSchJ+YAY6TXwznk3/MNo9rB
 d+euVcffL4yfWZ7Bzj8Z
 =w6ff
 -----END PGP SIGNATURE-----

Merge tag 'efi-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi into x86/efi

Pull EFI virtual mapping changes from Matt Fleming:

  * New static EFI runtime services virtual mapping layout which is
    groundwork for kexec support on EFI. (Borislav Petkov)

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-26 12:23:04 +01:00
Kirill A. Shutemov
c283610e44 x86, mm: do not leak page->ptl for pmd page tables
There are two code paths how page with pmd page table can be freed:
pmd_free() and pmd_free_tlb().

I've missed the second one and didn't add page table destructor call
there.  It leads to leak of page->ptl for pmd page tables, if
dynamically allocated page->ptl is in use.

The patch adds the missed destructor and modifies documentation
accordingly.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Andrey Vagin <avagin@openvz.org>
Tested-by: Andrey Vagin <avagin@openvz.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-21 16:42:28 -08:00
Al Viro
2a46eed54a Wrong page freed on preallocate_pmds() failure exit
Note that pmds[i] is simply uninitialized at that point...

Granted, it's very hard to hit (you need split page locks *and*
kmalloc(sizeof(spinlock_t), GFP_KERNEL) failing), but the code is
obviously bogus.

Introduced by commit 09ef493985 ("x86: add missed
pgtable_pmd_page_ctor/dtor calls for preallocated pmds")

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-20 14:22:14 -08:00
Kirill A. Shutemov
fd8526ad14 x86/mm: Implement ASLR for hugetlb mappings
Matthew noticed that hugetlb mappings don't participate in ASLR on x86-64:

  %  for i in `seq 3`; do
  > tools/testing/selftests/vm/map_hugetlb | grep address
  > done
  Returned address is 0x2aaaaac00000
  Returned address is 0x2aaaaac00000
  Returned address is 0x2aaaaac00000

/proc/PID/maps entries for the mapping are always the same
(except inode number):

  2aaaaac00000-2aaabac00000 rw-p 00000000 00:0c 8200              /anon_hugepage (deleted)
  2aaaaac00000-2aaabac00000 rw-p 00000000 00:0c 256               /anon_hugepage (deleted)
  2aaaaac00000-2aaabac00000 rw-p 00000000 00:0c 7180              /anon_hugepage (deleted)

The reason is the generic hugetlb_get_unmapped_area() function
which is used on x86-64.  It doesn't support randomization and
use bottom-up unmapped area lookup, instead of usual top-down
on x86-64.

x86 has arch-specific hugetlb_get_unmapped_area(), but it's used
only on x86-32.

Let's use arch-specific hugetlb_get_unmapped_area() on x86-64
too. That adds ASLR and switches hugetlb mappings to use top-down
unmapped area lookup:

  %  for i in `seq 3`; do
  > tools/testing/selftests/vm/map_hugetlb | grep address
  > done
  Returned address is 0x7f4f08a00000
  Returned address is 0x7fdda4200000
  Returned address is 0x7febe0000000

/proc/PID/maps entries:

  7f4f08a00000-7f4f18a00000 rw-p 00000000 00:0c 1168              /anon_hugepage (deleted)
  7fdda4200000-7fddb4200000 rw-p 00000000 00:0c 7092              /anon_hugepage (deleted)
  7febe0000000-7febf0000000 rw-p 00000000 00:0c 7183              /anon_hugepage (deleted)

Unmapped area lookup policy for hugetlb mappings is consistent
with normal mappings now -- the only difference is alignment
requirements for huge pages.

libhugetlbfs test-suite didn't detect any regressions with the
patch applied (although it shows few failures on my machine
regardless the patch).

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mel Gorman <mgorman@suse.de>
Link: http://lkml.kernel.org/r/20131119131750.EA45CE0090@blue.fi.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-19 14:24:50 +01:00
Kirill A. Shutemov
cecbd1b5af x86: handle pgtable_page_ctor() fail
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:19 +09:00
Kirill A. Shutemov
09ef493985 x86: add missed pgtable_pmd_page_ctor/dtor calls for preallocated pmds
In split page table lock case, we embed spinlock_t into struct page.
For obvious reason, we don't want to increase size of struct page if
spinlock_t is too big, like with DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC or
on -rt kernel.  So we disable split page table lock, if spinlock_t is
too big.

This patchset allows to allocate the lock dynamically if spinlock_t is
big.  In this page->ptl is used to store pointer to spinlock instead of
spinlock itself.  It costs additional cache line for indirect access,
but fix page fault scalability for multi-threaded applications.

LOCK_STAT depends on DEBUG_SPINLOCK, so on current kernel enabling
LOCK_STAT to analyse scalability issues breaks scalability.  ;)

The patchset mostly fixes this.  Results for ./thp_memscale -c 80 -b 512M
on 4-socket machine:

baseline, no CONFIG_LOCK_STAT:	9.115460703 seconds time elapsed
baseline, CONFIG_LOCK_STAT=y:	53.890567123 seconds time elapsed
patched, no CONFIG_LOCK_STAT:	8.852250368 seconds time elapsed
patched, CONFIG_LOCK_STAT=y:	11.069770759 seconds time elapsed

Patch count is scary, but most of them trivial. Overview:

 Patches 1-4	Few bug fixes. No dependencies to other patches.
		Probably should applied as soon as possible.

 Patch 5	Changes signature of pgtable_page_ctor(). We will use it
		for dynamic lock allocation, so it can fail.

 Patches 6-8	Add missing constructor/destructor calls on few archs.
		It's fixes NR_PAGETABLE accounting and prepare to use
		split ptl.

 Patches 9-33	Add pgtable_page_ctor() fail handling to all archs.

 Patches 34	Finally adds support of dynamically-allocated page->pte.
		Also contains documentation for split page table lock.

This patch (of 34):

I've missed that we preallocate few pmds on pgd_alloc() if X86_PAE
enabled.  Let's add missed constructor/destructor calls.

I haven't noticed it during testing since prep_new_page() clears
page->mapping and therefore page->ptl.  It's effectively equal to
spin_lock_init(&page->ptl).

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Liqin <liqin.chen@sunplusct.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Howells <dhowells@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:15 +09:00
Linus Torvalds
d320e203ba Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull two x86 fixes from Ingo Molnar.

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/microcode/amd: Tone down printk(), don't treat a missing firmware file as an error
  x86/dumpstack: Fix printk_address for direct addresses
2013-11-14 16:55:56 +09:00
Linus Torvalds
7971e23a66 Merge branch 'x86-trace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/trace changes from Ingo Molnar:
 "This adds page fault tracepoints which have zero runtime cost in the
  disabled case via IDT trickery (no NOPs in the page fault hotpath)"

* 'x86-trace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, trace: Change user|kernel_page_fault to page_fault_user|kernel
  x86, trace: Add page fault tracepoints
  x86, trace: Delete __trace_alloc_intr_gate()
  x86, trace: Register exception handler to trace IDT
  x86, trace: Remove __alloc_intr_gate()
2013-11-14 16:25:10 +09:00
Zhi Yong Wu
d4dd100f2e arch/x86/mm/init.c: fix incorrect function name in alloc_low_pages()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:11 +09:00
Tang Chen
c5320926e3 mem-hotplug: introduce movable_node boot option
The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel, it
cannot be hot-removed.  So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.

Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.

But the kernel cannot use memory in ZONE_MOVABLE.  By doing this, the
kernel cannot use memory in movable nodes.  This will cause NUMA
performance down.  And other users may be unhappy.

So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movable_node boot option to allow users to
choose to not to consume hotpluggable memory at early boot time and later
we can set it as ZONE_MOVABLE.

To achieve this, the movable_node boot option will control the memblock
allocation direction.  That said, after memblock is ready, before SRAT is
parsed, we should allocate memory near the kernel image as we explained in
the previous patches.  So if movable_node boot option is set, the kernel
does the following:

1. After memblock is ready, make memblock allocate memory bottom up.
2. After SRAT is parsed, make memblock behave as default, allocate memory
   top down.

Users can specify "movable_node" in kernel commandline to enable this
functionality.  For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything.  The kernel
will work as before.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Suggested-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Suggested-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:09 +09:00
Tang Chen
b959ed6c73 x86/mem-hotplug: support initialize page tables in bottom-up
The Linux kernel cannot migrate pages used by the kernel.  As a result,
kernel pages cannot be hot-removed.  So we cannot allocate hotpluggable
memory for the kernel.

In a memory hotplug system, any numa node the kernel resides in should be
unhotpluggable.  And for a modern server, each node could have at least
16GB memory.  So memory around the kernel image is highly likely
unhotpluggable.

ACPI SRAT (System Resource Affinity Table) contains the memory hotplug
info.  But before SRAT is parsed, memblock has already started to allocate
memory for the kernel.  So we need to prevent memblock from doing this.

So direct memory mapping page tables setup is the case.
init_mem_mapping() is called before SRAT is parsed.  To prevent page
tables being allocated within hotpluggable memory, we will use bottom-up
direction to allocate page tables from the end of kernel image to the
higher memory.

Note:
As for allocating page tables in lower memory, TJ said:

: This is an optional behavior which is triggered by a very specific kernel
: boot param, which I suspect is gonna need to stick around to support
: memory hotplug in the current setup unless we add another layer of address
: translation to support memory hotplug.

As for page tables may occupy too much lower memory if using 4K mapping
(CONFIG_DEBUG_PAGEALLOC and CONFIG_KMEMCHECK both disable using >4k
pages), TJ said:

: But as I said in the same paragraph, parsing SRAT earlier doesn't solve
: the problem in itself either.  Ignoring the option if 4k mapping is
: required and memory consumption would be prohibitive should work, no?
: Something like that would be necessary if we're gonna worry about cases
: like this no matter how we implement it, but, frankly, I'm not sure this
: is something worth worrying about.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:08 +09:00
Tang Chen
0167d7d8b0 x86/mm: factor out of top-down direct mapping setup
Create a new function memory_map_top_down to factor out of the top-down
direct memory mapping pagetable setup.  This is also a preparation for the
following patch, which will introduce the bottom-up memory mapping.  That
said, we will put the two ways of pagetable setup into separate functions,
and choose to use which way in init_mem_mapping, which makes the code more
clear.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:08 +09:00
Jiri Slaby
5f01c98859 x86/dumpstack: Fix printk_address for direct addresses
Consider a kernel crash in a module, simulated the following way:

 static int my_init(void)
 {
         char *map = (void *)0x5;
         *map = 3;
         return 0;
 }
 module_init(my_init);

When we turn off FRAME_POINTERs, the very first instruction in
that function causes a BUG. The problem is that we print IP in
the BUG report using %pB (from printk_address). And %pB
decrements the pointer by one to fix printing addresses of
functions with tail calls.

This was added in commit 71f9e59800 ("x86, dumpstack: Use
%pB format specifier for stack trace") to fix the call stack
printouts.

So instead of correct output:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000005
  IP: [<ffffffffa01ac000>] my_init+0x0/0x10 [pb173]

We get:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000005
  IP: [<ffffffffa0152000>] 0xffffffffa0151fff

To fix that, we use %pS only for stack addresses printouts (via
newly added printk_stack_address) and %pB for regs->ip (via
printk_address). I.e. we revert to the old behaviour for all
except call stacks. And since from all those reliable is 1, we
remove that parameter from printk_address.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: joe@perches.com
Cc: jirislaby@gmail.com
Link: http://lkml.kernel.org/r/1382706418-8435-1-git-send-email-jslaby@suse.cz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-12 21:06:06 +01:00
Linus Torvalds
2612c49db3 Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm fixlet from Ingo Molnar:
 "One cleanup that documents a particular detail in init_mem_mapping()"

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Add 'step_size' comments to init_mem_mapping()
2013-11-12 11:20:52 +09:00
H. Peter Anvin
a4f61dec55 x86, trace: Change user|kernel_page_fault to page_fault_user|kernel
Tracepoints are named hierachially, and it makes more sense to keep a
general flow of information level from general to specific from left
to right, i.e.

	x86_exceptions.page_fault_user|kernel

rather than

	x86_exceptions.user|kernel_page_fault

Suggested-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Seiji Aguchi <seiji.aguchi@hds.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Link: http://lkml.kernel.org/r/20131111082955.GB12405@gmail.com
2013-11-11 08:15:40 -08:00
Seiji Aguchi
d34603b07c x86, trace: Add page fault tracepoints
This patch introduces page fault tracepoints to x86 architecture
by switching IDT.

  Two events, for user and kernel spaces, are introduced at the beginning
  of page fault handler for tracing.

  - User space event
    There is a request of page fault event for user space as below.

    https://lkml.kernel.org/r/1368079520-11015-2-git-send-email-fdeslaur+()+gmail+!+com
    https://lkml.kernel.org/r/1368079520-11015-1-git-send-email-fdeslaur+()+gmail+!+com

  - Kernel space event:
    When we measure an overhead in kernel space for investigating performance
    issues, we can check if it comes from the page fault events.

Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com>
Link: http://lkml.kernel.org/r/52716E67.6090705@hds.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-11-08 14:15:49 -08:00
Seiji Aguchi
25c74b10ba x86, trace: Register exception handler to trace IDT
This patch registers exception handlers for tracing to a trace IDT.

To implemented it in set_intr_gate(), this patch does followings.
 - Register the exception handlers to
   the trace IDT by prepending "trace_" to the handler's names.
 - Also, newly introduce trace_page_fault() to add tracepoints
   in a subsequent patch.

Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com>
Link: http://lkml.kernel.org/r/52716DEC.5050204@hds.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-11-08 14:15:45 -08:00
Borislav Petkov
82f0712ca0 x86/mm/cpa: Map in an arbitrary pgd
Add the ability to map pages in an arbitrary pgd. This wires in the
remaining stuff so that there's a new interface with which you can map a
region into an arbitrary PGD.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2013-11-02 11:09:35 +00:00
Borislav Petkov
52a628fb45 x86/mm/pageattr: Add last levels of error path
We try to free the pagetable pages once we've unmapped our portion.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2013-11-02 11:09:31 +00:00
Borislav Petkov
0bb8aeee7b x86/mm/pageattr: Add a PUD error unwinding path
In case we encounter an error during the mapping of a region, we want to
unwind what we've established so far exactly the way we did the mapping.
This is the PUD part kept deliberately small for easier review.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2013-11-02 11:09:28 +00:00
Borislav Petkov
c6b6f363f7 x86/mm/pageattr: Add a PTE pagetable populating function
Handle last level by unconditionally writing the PTEs into the PTE page
while paying attention to the NX bit.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2013-11-02 11:09:24 +00:00
Borislav Petkov
f900a4b8ab x86/mm/pageattr: Add a PMD pagetable populating function
Handle PMD-level mappings the same as PUD ones.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2013-11-02 11:09:20 +00:00
Borislav Petkov
4b23538d88 x86/mm/pageattr: Add a PUD pagetable populating function
Add the next level of the pagetable populating function, we handle
chunks around a 1G boundary by mapping them with the lower level
functions - otherwise we use 1G pages for the mappings, thus using as
less amount of pagetable pages as possible.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2013-11-02 11:09:17 +00:00
Borislav Petkov
f3f729661e x86/mm/pageattr: Add a PGD pagetable populating function
This allocates, if necessary, and populates the corresponding PGD entry
with a PUD page. The next population level is a dummy macro which will
be removed by the next patch and it is added here to keep the patch
small and easily reviewable but not break bisection, at the same time.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2013-11-02 11:09:12 +00:00
Borislav Petkov
0fd64c23fd x86/mm/pageattr: Lookup address in an arbitrary PGD
This is preparatory work in order to be able to map pages into a
specified PGD and not implicitly and only into init_mm.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2013-11-02 11:09:08 +00:00
Peter Zijlstra
e00b12e64b perf/x86: Further optimize copy_from_user_nmi()
Now that we can deal with nested NMI due to IRET re-enabling NMIs and
can deal with faults from NMI by making sure we preserve CR2 over NMIs
we can in fact simply access user-space memory from NMI context.

So rewrite copy_from_user_nmi() to use __copy_from_user_inatomic() and
rework the fault path to do the minimal required work before taking
the in_atomic() fault handler.

In particular avoid perf_sw_event() which would make perf recurse on
itself (it should be harmless as our recursion protections should be
able to deal with this -- but why tempt fate).

Also rename notify_page_fault() to kprobes_fault() as that is a much
better name; there is no notifier in it and its specific to kprobes.

Don measured that his worst case NMI path shrunk from ~300K cycles to
~150K cycles.

Cc: Stephane Eranian <eranian@google.com>
Cc: jmario@redhat.com
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: dave.hansen@linux.intel.com
Tested-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131024105206.GM2490@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-29 12:02:54 +01:00
Kees Cook
6145cfe394 x86, kaslr: Raise the maximum virtual address to -1 GiB on x86_64
On 64-bit, this raises the maximum location to -1 GiB (from -1.5 GiB),
the upper limit currently, since the kernel fixmap page mappings need
to be moved to use the other 1 GiB (which would be the theoretical
limit when building with -mcmodel=kernel).

Signed-off-by: Kees Cook <keescook@chromium.org>
Link: http://lkml.kernel.org/r/1381450698-28710-7-git-send-email-keescook@chromium.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-10-13 03:13:13 -07:00
Johannes Weiner
3a13c4d761 x86: finish user fault error path with fatal signal
The x86 fault handler bails in the middle of error handling when the
task has a fatal signal pending.  For a subsequent patch this is a
problem in OOM situations because it relies on pagefault_out_of_memory()
being called even when the task has been killed, to perform proper
per-task OOM state unwinding.

Shortcutting the fault like this is a rather minor optimization that
saves a few instructions in rare cases.  Just remove it for
user-triggered faults.

Use the opportunity to split the fault retry handling from actual fault
errors and add locking documentation that reads suprisingly similar to
ARM's.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: azurIt <azurit@pobox.sk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:01 -07:00
Johannes Weiner
759496ba64 arch: mm: pass userspace fault flag to generic fault handler
Unlike global OOM handling, memory cgroup code will invoke the OOM killer
in any OOM situation because it has no way of telling faults occuring in
kernel context - which could be handled more gracefully - from
user-triggered faults.

Pass a flag that identifies faults originating in user space from the
architecture-specific fault handlers to generic code so that memcg OOM
handling can be improved.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: azurIt <azurit@pobox.sk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:01 -07:00
Naoya Horiguchi
83467efbdb mm: migrate: check movability of hugepage in unmap_and_move_huge_page()
Currently hugepage migration works well only for pmd-based hugepages
(mainly due to lack of testing,) so we had better not enable migration of
other levels of hugepages until we are ready for it.

Some users of hugepage migration (mbind, move_pages, and migrate_pages) do
page table walk and check pud/pmd_huge() there, so they are safe.  But the
other users (softoffline and memory hotremove) don't do this, so without
this patch they can try to migrate unexpected types of hugepages.

To prevent this, we introduce hugepage_migration_support() as an
architecture dependent check of whether hugepage are implemented on a pmd
basis or not.  And on some architecture multiple sizes of hugepages are
available, so hugepage_migration_support() also checks hugepage size.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:57:49 -07:00
Dave Hansen
6df46865ff mm: vmstats: track TLB flush stats on UP too
The previous patch doing vmstats for TLB flushes ("mm: vmstats: tlb flush
counters") effectively missed UP since arch/x86/mm/tlb.c is only compiled
for SMP.

UP systems do not do remote TLB flushes, so compile those counters out on
UP.

arch/x86/kernel/cpu/mtrr/generic.c calls __flush_tlb() directly.  This is
probably an optimization since both the mtrr code and __flush_tlb() write
cr4.  It would probably be safe to make that a flush_tlb_all() (and then
get these statistics), but the mtrr code is ancient and I'm hesitant to
touch it other than to just stick in the counters.

[akpm@linux-foundation.org: tweak comments]
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:57:09 -07:00
Dave Hansen
9824cf9753 mm: vmstats: tlb flush counters
I was investigating some TLB flush scaling issues and realized that we do
not have any good methods for figuring out how many TLB flushes we are
doing.

It would be nice to be able to do these in generic code, but the
arch-independent calls don't explicitly specify whether we actually need
to do remote flushes or not.  In the end, we really need to know if we
actually _did_ global vs.  local invalidations, so that leaves us with few
options other than to muck with the counters from arch-specific code.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:57:08 -07:00
Yinghai Lu
6979287a7d x86/mm: Add 'step_size' comments to init_mem_mapping()
Current code uses macro to shift by 5, but there is no explanation
why there's no worry about an overflow there.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Jacob Shin <jacob.shin@amd.com>
Link: http://lkml.kernel.org/r/1378519629-10433-1-git-send-email-yinghai@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-10 09:51:34 +02:00
Linus Torvalds
cb3e4330e6 Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm changes from Ingo Molnar:
 "Misc smaller fixes:

   - a parse_setup_data() boot crash fix

   - a memblock and an __early_ioremap cleanup

   - turn the always-on CONFIG_ARCH_MEMORY_PROBE=y into a configurable
     option and turn it off - it's an unrobust debug facility, it
     shouldn't be enabled by default"

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86: avoid remapping data in parse_setup_data()
  x86: Use memblock_set_current_limit() to set limit for memblock.
  mm: Remove unused variable idx0 in __early_ioremap()
  mm/hotplug, x86: Disable ARCH_MEMORY_PROBE by default
2013-09-04 09:39:26 -07:00
Linus Torvalds
40031da445 ACPI and power management updates for 3.12-rc1
1) ACPI-based PCI hotplug (ACPIPHP) subsystem rework and introduction
     of Intel Thunderbolt support on systems that use ACPI for signalling
     Thunderbolt hotplug events.  This also should make ACPIPHP work in
     some cases in which it was known to have problems.  From
     Rafael J Wysocki, Mika Westerberg and Kirill A Shutemov.
 
  2) ACPI core code cleanups and dock station support cleanups from
     Jiang Liu and Rafael J Wysocki.
 
  3) Fixes for locking problems related to ACPI device hotplug from
     Rafael J Wysocki.
 
  4) ACPICA update to version 20130725 includig fixes, cleanups, support
     for more than 256 GPEs per GPE block and a change to make the ACPI
     PM Timer optional (we've seen systems without the PM Timer in the
     field already).  One of the fixes, related to the DeRefOf operator,
     is necessary to prevent some Windows 8 oriented AML from causing
     problems to happen.  From Bob Moore, Lv Zheng, and Jung-uk Kim.
 
  5) Removal of the old and long deprecated /proc/acpi/event interface
     and related driver changes from Thomas Renninger.
 
  6) ACPI and Xen changes to make the reduced hardware sleep work with
     the latter from Ben Guthro.
 
  7) ACPI video driver cleanups and a blacklist of systems that should
     not tell the BIOS that they are compatible with Windows 8 (or ACPI
     backlight and possibly other things will not work on them).  From
     Felipe Contreras.
 
  8) Assorted ACPI fixes and cleanups from Aaron Lu, Hanjun Guo,
     Kuppuswamy Sathyanarayanan, Lan Tianyu, Sachin Kamat, Tang Chen,
     Toshi Kani, and Wei Yongjun.
 
  9) cpufreq ondemand governor target frequency selection change to
     reduce oscillations between min and max frequencies (essentially,
     it causes the governor to choose target frequencies proportional
     to load) from Stratos Karafotis.
 
 10) cpufreq fixes allowing sysfs attributes file permissions to be
     preserved over suspend/resume cycles Srivatsa S Bhat.
 
 11) Removal of Device Tree parsing for CPU device nodes from multiple
     cpufreq drivers that required some changes related to
     of_get_cpu_node() to be made in a few architectures and in the
     driver core.  From Sudeep KarkadaNagesha.
 
 12) cpufreq core fixes and cleanups related to mutual exclusion and
     driver module references from Viresh Kumar, Lukasz Majewski and
     Rafael J Wysocki.
 
 13) Assorted cpufreq fixes and cleanups from Amit Daniel Kachhap,
     Bartlomiej Zolnierkiewicz, Hanjun Guo, Jingoo Han, Joseph Lo,
     Julia Lawall, Li Zhong, Mark Brown, Sascha Hauer, Stephen Boyd,
     Stratos Karafotis, and Viresh Kumar.
 
 14) Fixes to prevent race conditions in coupled cpuidle from happening
     from Colin Cross.
 
 15) cpuidle core fixes and cleanups from Daniel Lezcano and
     Tuukka Tikkanen.
 
 16) Assorted cpuidle fixes and cleanups from Daniel Lezcano,
     Geert Uytterhoeven, Jingoo Han, Julia Lawall, Linus Walleij,
     and Sahara.
 
 17) System sleep tracing changes from Todd E Brandt and Shuah Khan.
 
 18) PNP subsystem conversion to using struct dev_pm_ops for power
     management from Shuah Khan.
 
 /
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIcBAABCAAGBQJSJcKhAAoJEKhOf7ml8uNsplIQAJSOshxhkkemvFOuHZ+0YIbh
 R9aufjXeDkMDBi8YtU+tB7ERth1j+0LUSM0NTnP51U7e+7eSGobA9s5jSZQj2l7r
 HFtnSOegLuKAfqwgfSLK91xa1rTFdfW0Kych9G2nuHtBIt6P0Oc59Cb5M0oy6QXs
 nVtaDEuU//tmO71+EF5HnMJHabRTrpvtn/7NbDUpU7LZYpWJrHJFT9xt1rXNab7H
 YRCATPm3kXGRg58Doc3EZE4G3D7DLvq74jWMaI089X/m5Pg1G6upqArypOy6oxdP
 p2FEzYVrb2bi8fakXp7BBeO1gCJTAqIgAkbSSZHLpGhFaeEMmb9/DWPXdm2TjzMV
 c1EEucvsqZWoprXgy12i5Hk814xN8d8nBBLg/UYiRJ44nc/hevXfyE9ZYj6bkseJ
 +GNHmZIa1QYC05nnGli4+W4kHns8EZf/gmvIxnPuco1RN2yMWagrud5/G6Dr9M2B
 hzJV6qauLVzgZso4oe79zv9aVxe/dPHKANLD/sg23WBiJJbJF1ocBlnj2Xlbpqze
 pmMUWGiO/gUiS0fmpW/lAJauza5jFmSCjE4E8R0Gyn0j4YXjmMhdEanaU6J3VuCi
 yVgEzYEth4sowq4AflMMLKYN+WmozDnK7taZRGmT0t+EKRFKLT6EgnNrkQgs1vKl
 oawD9LM4fZ8E0yroOEme
 =CgqW
 -----END PGP SIGNATURE-----

Merge tag 'pm+acpi-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI and power management updates from Rafael Wysocki:

 1) ACPI-based PCI hotplug (ACPIPHP) subsystem rework and introduction
    of Intel Thunderbolt support on systems that use ACPI for signalling
    Thunderbolt hotplug events.  This also should make ACPIPHP work in
    some cases in which it was known to have problems.  From
    Rafael J Wysocki, Mika Westerberg and Kirill A Shutemov.

 2) ACPI core code cleanups and dock station support cleanups from
    Jiang Liu and Rafael J Wysocki.

 3) Fixes for locking problems related to ACPI device hotplug from
    Rafael J Wysocki.

 4) ACPICA update to version 20130725 includig fixes, cleanups, support
    for more than 256 GPEs per GPE block and a change to make the ACPI
    PM Timer optional (we've seen systems without the PM Timer in the
    field already).  One of the fixes, related to the DeRefOf operator,
    is necessary to prevent some Windows 8 oriented AML from causing
    problems to happen.  From Bob Moore, Lv Zheng, and Jung-uk Kim.

 5) Removal of the old and long deprecated /proc/acpi/event interface
    and related driver changes from Thomas Renninger.

 6) ACPI and Xen changes to make the reduced hardware sleep work with
    the latter from Ben Guthro.

 7) ACPI video driver cleanups and a blacklist of systems that should
    not tell the BIOS that they are compatible with Windows 8 (or ACPI
    backlight and possibly other things will not work on them).  From
    Felipe Contreras.

 8) Assorted ACPI fixes and cleanups from Aaron Lu, Hanjun Guo,
    Kuppuswamy Sathyanarayanan, Lan Tianyu, Sachin Kamat, Tang Chen,
    Toshi Kani, and Wei Yongjun.

 9) cpufreq ondemand governor target frequency selection change to
    reduce oscillations between min and max frequencies (essentially,
    it causes the governor to choose target frequencies proportional
    to load) from Stratos Karafotis.

10) cpufreq fixes allowing sysfs attributes file permissions to be
    preserved over suspend/resume cycles Srivatsa S Bhat.

11) Removal of Device Tree parsing for CPU device nodes from multiple
    cpufreq drivers that required some changes related to
    of_get_cpu_node() to be made in a few architectures and in the
    driver core.  From Sudeep KarkadaNagesha.

12) cpufreq core fixes and cleanups related to mutual exclusion and
    driver module references from Viresh Kumar, Lukasz Majewski and
    Rafael J Wysocki.

13) Assorted cpufreq fixes and cleanups from Amit Daniel Kachhap,
    Bartlomiej Zolnierkiewicz, Hanjun Guo, Jingoo Han, Joseph Lo,
    Julia Lawall, Li Zhong, Mark Brown, Sascha Hauer, Stephen Boyd,
    Stratos Karafotis, and Viresh Kumar.

14) Fixes to prevent race conditions in coupled cpuidle from happening
    from Colin Cross.

15) cpuidle core fixes and cleanups from Daniel Lezcano and
    Tuukka Tikkanen.

16) Assorted cpuidle fixes and cleanups from Daniel Lezcano,
    Geert Uytterhoeven, Jingoo Han, Julia Lawall, Linus Walleij,
    and Sahara.

17) System sleep tracing changes from Todd E Brandt and Shuah Khan.

18) PNP subsystem conversion to using struct dev_pm_ops for power
    management from Shuah Khan.

* tag 'pm+acpi-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (217 commits)
  cpufreq: Don't use smp_processor_id() in preemptible context
  cpuidle: coupled: fix race condition between pokes and safe state
  cpuidle: coupled: abort idle if pokes are pending
  cpuidle: coupled: disable interrupts after entering safe state
  ACPI / hotplug: Remove containers synchronously
  driver core / ACPI: Avoid device hot remove locking issues
  cpufreq: governor: Fix typos in comments
  cpufreq: governors: Remove duplicate check of target freq in supported range
  cpufreq: Fix timer/workqueue corruption due to double queueing
  ACPI / EC: Add ASUSTEK L4R to quirk list in order to validate ECDT
  ACPI / thermal: Add check of "_TZD" availability and evaluating result
  cpufreq: imx6q: Fix clock enable balance
  ACPI: blacklist win8 OSI for buggy laptops
  cpufreq: tegra: fix the wrong clock name
  cpuidle: Change struct menu_device field types
  cpuidle: Add a comment warning about possible overflow
  cpuidle: Fix variable domains in get_typical_interval()
  cpuidle: Fix menu_device->intervals type
  cpuidle: CodingStyle: Break up multiple assignments on single line
  cpuidle: Check called function parameter in get_typical_interval()
  ...
2013-09-03 15:59:39 -07:00
Linus Torvalds
e09a1fa9be Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 boot fix from Peter Anvin:
 "A single very small boot fix for very large memory systems (> 0.5T)"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Fix boot crash with DEBUG_PAGE_ALLOC=y and more than 512G RAM
2013-09-02 09:55:14 -07:00
Rafael J. Wysocki
0c581415b5 Merge branch 'acpi-assorted'
* acpi-assorted:
  ACPI / osl: Kill macro INVALID_TABLE().
  earlycpio.c: Fix the confusing comment of find_cpio_data().
  ACPI / x86: Print Hot-Pluggable Field in SRAT.
  ACPI / thermal: Use THERMAL_TRIPS_NONE macro to replace number
  ACPI / thermal: Remove unused macros in the driver/acpi/thermal.c
  ACPI / thermal: Remove the unused lock of struct acpi_thermal
  ACPI / osl: Fix osi_setup_entries[] __initdata attribute location
  ACPI / numa: Fix __init attribute location in slit_valid()
  ACPI / dock: Fix __init attribute location in find_dock_and_bay()
  ACPI / Sleep: Fix incorrect placement of __initdata
  ACPI / processor: Fix incorrect placement of __initdata
  ACPI / EC: Fix incorrect placement of __initdata
  ACPI / scan: Drop unnecessary label from acpi_create_platform_device()
  ACPI: Move acpi_bus_get_device() from bus.c to scan.c
  ACPI / scan: Allow platform device creation without any IO resources
  ACPI: Cleanup sparse warning on acpi_os_initialize1()
  platform / thinkpad: Remove deprecated hotkey_report_mode parameter
  ACPI: Remove the old /proc/acpi/event interface
2013-08-27 01:29:04 +02:00
Radu Caragea
41aacc1eea x86 get_unmapped_area: Access mmap_legacy_base through mm_struct member
This is the updated version of df54d6fa54 ("x86 get_unmapped_area():
use proper mmap base for bottom-up direction") that only randomizes the
mmap base address once.

Signed-off-by: Radu Caragea <sinaelgl@gmail.com>
Reported-and-tested-by: Jeff Shorey <shoreyjeff@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Adrian Sendroiu <molecula2788@gmail.com>
Cc: Greg KH <greg@kroah.com>
Cc: Kamal Mostafa <kamal@canonical.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-22 10:19:35 -07:00
Linus Torvalds
5ea80f76a5 Revert "x86 get_unmapped_area(): use proper mmap base for bottom-up direction"
This reverts commit df54d6fa54.

The commit isn't necessarily wrong, but because it recalculates the
random mmap_base every time, it seems to confuse user memory allocators
that expect contiguous mmap allocations even when the mmap address isn't
specified.

In particular, the MATLAB Java runtime seems to be unhappy. See

  https://bugzilla.kernel.org/show_bug.cgi?id=60774

So we'll want to apply the random offset only once, and Radu has a patch
for that.  Revert this older commit in order to apply the other one.

Reported-by: Jeff Shorey <shoreyjeff@gmail.com>
Cc: Radu Caragea <sinaelgl@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-22 10:18:44 -07:00
Yinghai Lu
527bf129f9 x86/mm: Fix boot crash with DEBUG_PAGE_ALLOC=y and more than 512G RAM
Dave Hansen reported that systems between 500G and 600G RAM
crash early if DEBUG_PAGEALLOC is selected.

 > [    0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
 > [    0.000000]  [mem 0x00000000-0x000fffff] page 4k
 > [    0.000000] BRK [0x02086000, 0x02086fff] PGTABLE
 > [    0.000000] BRK [0x02087000, 0x02087fff] PGTABLE
 > [    0.000000] BRK [0x02088000, 0x02088fff] PGTABLE
 > [    0.000000] init_memory_mapping: [mem 0xe80ee00000-0xe80effffff]
 > [    0.000000]  [mem 0xe80ee00000-0xe80effffff] page 4k
 > [    0.000000] BRK [0x02089000, 0x02089fff] PGTABLE
 > [    0.000000] BRK [0x0208a000, 0x0208afff] PGTABLE
 > [    0.000000] Kernel panic - not syncing: alloc_low_page: ran out of memory

It turns out that we missed increasing needed pages in BRK to
mapping initial 2M and [0,1M) when we switched to use the #PF
handler to set memory mappings:

 > commit 8170e6bed4
 > Author: H. Peter Anvin <hpa@zytor.com>
 > Date:   Thu Jan 24 12:19:52 2013 -0800
 >
 >     x86, 64bit: Use a #PF handler to materialize early mappings on demand

Before that, we had the maping from [0,512M) in head_64.S, and we
can spare two pages [0-1M).  After that change, we can not reuse
pages anymore.

When we have more than 512M ram, we need an extra page for pgd page
with [512G, 1024g).

Increase pages in BRK for page table to solve the boot crash.

Reported-by: Dave Hansen <dave.hansen@intel.com>
Bisected-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: <stable@vger.kernel.org> # v3.9 and later
Link: http://lkml.kernel.org/r/1376351004-4015-1-git-send-email-yinghai@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-08-20 10:06:50 +02:00
Tang Chen
d7b2c3d8fb ACPI / x86: Print Hot-Pluggable Field in SRAT.
The Hot-Pluggable field in SRAT suggests if the memory could be
hotplugged while the system is running. Print it as well when
parsing SRAT will help users to know which memory is hotpluggable.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-08-14 23:24:01 +02:00
Radu Caragea
df54d6fa54 x86 get_unmapped_area(): use proper mmap base for bottom-up direction
When the stack is set to unlimited, the bottomup direction is used for
mmap-ings but the mmap_base is not used and thus effectively renders
ASLR for mmapings along with PIE useless.

Cc: Michel Lespinasse <walken@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Adrian Sendroiu <molecula2788@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-13 17:57:49 -07:00
Jianguo Wu
d4f5228c01 mm: Remove unused variable idx0 in __early_ioremap()
After commit:

   8827247ffc ("x86: don't define __this_fixmap_does_not_exist()")

variable idx0 is no longer needed, so just remove it.

Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <linux-mm@kvack.org>
Cc: <wangchen@cn.fujitsu.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Link: http://lkml.kernel.org/r/5209A173.3090600@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-08-13 11:46:36 +02:00
Paul Gortmaker
148f9bb877 x86: delete __cpuinit usage from all x86 files
The __cpuinit type of throwaway sections might have made sense
some time ago when RAM was more constrained, but now the savings
do not offset the cost and complications.  For example, the fix in
commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
is a good example of the nasty type of bugs that can be created
with improper use of the various __init prefixes.

After a discussion on LKML[1] it was decided that cpuinit should go
the way of devinit and be phased out.  Once all the users are gone,
we can then finally remove the macros themselves from linux/init.h.

Note that some harmless section mismatch warnings may result, since
notify_cpu_starting() and cpu_up() are arch independent (kernel/cpu.c)
are flagged as __cpuinit  -- so if we remove the __cpuinit from
arch specific callers, we will also get section mismatch warnings.
As an intermediate step, we intend to turn the linux/init.h cpuinit
content into no-ops as early as possible, since that will get rid
of these warnings.  In any case, they are temporary and harmless.

This removes all the arch/x86 uses of the __cpuinit macros from
all C files.  x86 only had the one __CPUINIT used in assembly files,
and it wasn't paired off with a .previous or a __FINIT, so we can
delete it directly w/o any corresponding additional change there.

[1] https://lkml.org/lkml/2013/5/20/589

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2013-07-14 19:36:56 -04:00
Michel Lespinasse
98d1e64f95 mm: remove free_area_cache
Since all architectures have been converted to use vm_unmapped_area(),
there is no remaining use for the free_area_cache.

Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Richard Henderson <rth@twiddle.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-10 18:11:34 -07:00
Wanpeng Li
73b44ff43c mm/pgtable: don't accumulate addr during pgd prepopulate pmd
The old codes accumulate addr to get right pmd, however, currently pmds
are preallocated and transfered as a parameter, there is unnecessary to
accumulate addr variable any more, this patch remove it.

Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09 10:33:23 -07:00
Jiang Liu
46a841329a mm/x86: prepare for removing num_physpages and simplify mem_init()
Prepare for removing num_physpages and simplify mem_init().

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:38 -07:00
Jiang Liu
0c98853473 mm: concentrate modification of totalram_pages into the mm core
Concentrate code to modify totalram_pages into the mm core, so the arch
memory initialized code doesn't need to take care of it.  With these
changes applied, only following functions from mm core modify global
variable totalram_pages: free_bootmem_late(), free_all_bootmem(),
free_all_bootmem_node(), adjust_managed_page_count().

With this patch applied, it will be much more easier for us to keep
totalram_pages and zone->managed_pages in consistence.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: <sworddragon2@aol.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:33 -07:00
Jiang Liu
170a5a7eb2 mm: make __free_pages_bootmem() only available at boot time
In order to simpilify management of totalram_pages and
zone->managed_pages, make __free_pages_bootmem() only available at boot
time.  With this change applied, __free_pages_bootmem() will only be
used by bootmem.c and nobootmem.c at boot time, so mark it as __init.
Other callers of __free_pages_bootmem() have been converted to use
free_reserved_page(), which handles totalram_pages and
zone->managed_pages in a safer way.

This patch also fix a bug in free_pagetable() for x86_64, which should
increase zone->managed_pages instead of zone->present_pages when freeing
reserved pages.

And now we have managed_pages_count_lock to protect totalram_pages and
zone->managed_pages, so remove the redundant ppb_lock lock in
put_page_bootmem().  This greatly simplifies the locking rules.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan@kernel.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: <sworddragon2@aol.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:33 -07:00
Jiang Liu
7b4b2a0d6c mm: accurately calculate zone->managed_pages for highmem zones
Commit "mm: introduce new field 'managed_pages' to struct zone" assumes
that all highmem pages will be freed into the buddy system by function
mem_init().  But that's not always true, some architectures may reserve
some highmem pages during boot.  For example PPC may allocate highmem
pages for giagant HugeTLB pages, and several architectures have code to
check PageReserved flag to exclude highmem pages allocated during boot
when freeing highmem pages into the buddy system.

So treat highmem pages in the same way as normal pages, that is to:
1) reset zone->managed_pages to zero in mem_init().
2) recalculate managed_pages when freeing pages into the buddy system.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: <sworddragon2@aol.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:33 -07:00
Jiang Liu
c88442ec45 mm/x86: use free_reserved_area() to simplify code
Use common help function free_reserved_area() to simplify code.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: <sworddragon2@aol.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:33 -07:00
Linus Torvalds
1873e50028 Main features:
- KVM and Xen ports to AArch64
 - Hugetlbfs and transparent huge pages support for arm64
 - Applied Micro X-Gene Kconfig entry and dts file
 - Cache flushing improvements
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.9 (GNU/Linux)
 
 iQIcBAABAgAGBQJR0bZAAAoJEGvWsS0AyF7xTEEP/R/aRoqWwbVAMlwAhujq616O
 t4RzIyBXZXqxS9I+raokCX4mgYxdeisJlzN2hoq73VEX2BQlXZoYh8vmfY9WeNSM
 2pdfif2HF7oo9ymCRyqfuhbumPrTyJhpbguzOYrxPqpp2f1hv2D8hbUJEFj429yL
 UjqTFoONngfouZmAlwrPGZQKhBI95vvN53yvDMH0PWfvpm07DKGIQMYp20y0pj8j
 slhLH3bh2kfpS1cf23JtH6IICwWD2pXW0POo569CfZry6bI74xve+Trcsm7iPnsO
 PSI1P046ME1mu3SBbKwiPIdN/FQqWwTHW07fvMmH/xuXu3Zs/mxgzi7vDzDrVvTg
 PJSbKWD6N/IPPwKS/gCUmWWDASO0bXx3KlDuRZqAjbRojs0UPUOTUhzJM/BHUms1
 vY2QS9lAm02LmZZrk1LeKKP85gB+qKQvHuOVhIOldWeLGKtsNufz1kynz6YTqsLq
 uUB55KwbhQ7q8+aoY6lWujqiTXMoLkBgGdjHs2I407PAv7ZjlhRWk2fIry7xJifp
 rKu2cIlWsRe4CGvGI410NvIJFrGvJAV4wA43sgBDjPumyILgT/5jw9r3RpJEBZZs
 akw/Bl1CbL+gMjyoPUWgcWZdRkUCE0eLrgyMOmaYfst8cOTaWw4dWLvUG/bBZg+Y
 mGnuEQUQtAPadk8P/Sv3
 =PZ/e
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64

Pull ARM64 updates from Catalin Marinas:
 "Main features:
   - KVM and Xen ports to AArch64
   - Hugetlbfs and transparent huge pages support for arm64
   - Applied Micro X-Gene Kconfig entry and dts file
   - Cache flushing improvements

  For arm64 huge pages support, there are x86 changes moving part of
  arch/x86/mm/hugetlbpage.c into mm/hugetlb.c to be re-used by arm64"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64: (66 commits)
  arm64: Add initial DTS for APM X-Gene Storm SOC and APM Mustang board
  arm64: Add defines for APM ARMv8 implementation
  arm64: Enable APM X-Gene SOC family in the defconfig
  arm64: Add Kconfig option for APM X-Gene SOC family
  arm64/Makefile: provide vdso_install target
  ARM64: mm: THP support.
  ARM64: mm: Raise MAX_ORDER for 64KB pages and THP.
  ARM64: mm: HugeTLB support.
  ARM64: mm: Move PTE_PROT_NONE bit.
  ARM64: mm: Make PAGE_NONE pages read only and no-execute.
  ARM64: mm: Restore memblock limit when map_mem finished.
  mm: thp: Correct the HPAGE_PMD_ORDER check.
  x86: mm: Remove general hugetlb code from x86.
  mm: hugetlb: Copy general hugetlb code from x86 to mm.
  x86: mm: Remove x86 version of huge_pmd_share.
  mm: hugetlb: Copy huge_pmd_share from x86 to mm.
  arm64: KVM: document kernel object mappings in HYP
  arm64: KVM: MAINTAINERS update
  arm64: KVM: userspace API documentation
  arm64: KVM: enable initialization of a 32bit vcpu
  ...
2013-07-03 10:31:38 -07:00
Linus Torvalds
1982269a5c Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm changes from Ingo Molnar:
 "Misc improvements:

   - Fix /proc/mtrr reporting
   - Fix ioremap printout
   - Remove the unused pvclock fixmap entry on 32-bit
   - misc cleanups"

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/ioremap: Correct function name output
  x86: Fix /proc/mtrr with base/size more than 44bits
  ix86: Don't waste fixmap entries
  x86/mm: Drop unneeded include <asm/*pgtable, page*_types.h>
  x86_64: Correct phys_addr in cleanup_highmap comment
2013-07-02 16:29:05 -07:00
Borislav Petkov
4f4319a02a x86/ioremap: Correct function name output
Infact, let the compiler enter the function name so that there
are no discrepancies.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1372369996-20556-1-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-28 11:11:58 +02:00
Steve Capper
53313b2c91 x86: mm: Remove general hugetlb code from x86.
huge_pte_alloc, huge_pte_offset and follow_huge_p[mu]d have
already been copied over to mm.

This patch removes the x86 copies of these functions and activates
the general ones by enabling:
CONFIG_ARCH_WANT_GENERAL_HUGETLB

Signed-off-by: Steve Capper <steve.capper@linaro.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
2013-06-14 09:40:15 +01:00
Steve Capper
cfe28c5d63 x86: mm: Remove x86 version of huge_pmd_share.
The huge_pmd_share code has been copied over to mm/hugetlb.c to
make it accessible to other architectures.

Remove the x86 copy of the huge_pmd_share code and enable the
ARCH_WANT_HUGE_PMD_SHARE config flag. That way we reference the
general one.

Signed-off-by: Steve Capper <steve.capper@linaro.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
2013-06-14 09:39:46 +01:00
Yinghai Lu
7de3d66b13 x86: Fix adjust_range_size_mask calling position
Commit

    8d57470d x86, mm: setup page table in top-down

causes a kernel panic while setting mem=2G.

     [mem 0x00000000-0x000fffff] page 4k
     [mem 0x7fe00000-0x7fffffff] page 1G
     [mem 0x7c000000-0x7fdfffff] page 1G
     [mem 0x00100000-0x001fffff] page 4k
     [mem 0x00200000-0x7bffffff] page 2M

for last entry is not what we want, we should have
     [mem 0x00200000-0x3fffffff] page 2M
     [mem 0x40000000-0x7bffffff] page 1G

Actually we merge the continuous ranges with same page size too early.
in this case, before merging we have
     [mem 0x00200000-0x3fffffff] page 2M
     [mem 0x40000000-0x7bffffff] page 2M
after merging them, will get
     [mem 0x00200000-0x7bffffff] page 2M
even we can use 1G page to map
     [mem 0x40000000-0x7bffffff]

that will cause problem, because we already map
     [mem 0x7fe00000-0x7fffffff] page 1G
     [mem 0x7c000000-0x7fdfffff] page 1G
with 1G page, aka [0x40000000-0x7fffffff] is mapped with 1G page already.
During phys_pud_init() for [0x40000000-0x7bffffff], it will not
reuse existing that pud page, and allocate new one then try to use
2M page to map it instead, as page_size_mask does not include
PG_LEVEL_1G. At end will have [7c000000-0x7fffffff] not mapped, loop
in phys_pmd_init stop mapping at 0x7bffffff.

That is right behavoir, it maps exact range with exact page size that
we ask, and we should explicitly call it to map [7c000000-0x7fffffff]
before or after mapping 0x40000000-0x7bffffff.
Anyway we need to make sure ranges' page_size_mask correct and consistent
after split_mem_range for each range.

Fix that by calling adjust_range_size_mask before merging range
with same page size.

-v2: update change log.
-v3: add more explanation why [7c000000-0x7fffffff] is not mapped, and
    it causes panic.

Bisected-by: "Xie, ChanglongX" <changlongx.xie@intel.com>
Bisected-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reported-and-tested-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1370015587-20835-1-git-send-email-yinghai@kernel.org
Cc: <stable@vger.kernel.org> v3.9
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-05-31 13:21:32 -07:00
Zhang Yanfei
1e3b308181 x86_64: Correct phys_addr in cleanup_highmap comment
For x86_64, we have phys_base, which means the delta between the
the address kernel is actually running at and the address kernel
is compiled to run at. Not phys_addr so correct it.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Link: http://lkml.kernel.org/r/5192F9BF.2000802@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-28 11:41:24 +02:00
Zhang Yanfei
cf8b166d5c x86/mm: Add missing comments for initial kernel direct mapping
Two sets of comments were lost during patch-series shuffling:

  - comments for init_range_memory_mapping()

  - comments in init_mem_mapping that is helpful for reminding people
    that the pagetable is setup top-down

The comments were written by Yinghai in his patch in:

   https://lkml.org/lkml/2012/11/28/620

This patch reintroduces them.

Originally-From: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/518BC776.7010506@gmail.com
[ Tidied it all up a bit. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-10 12:00:35 +02:00
Linus Torvalds
20b4fb4852 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS updates from Al Viro,

Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).

7kloc removed.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
  don't bother with deferred freeing of fdtables
  proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
  proc: Make the PROC_I() and PDE() macros internal to procfs
  proc: Supply a function to remove a proc entry by PDE
  take cgroup_open() and cpuset_open() to fs/proc/base.c
  ppc: Clean up scanlog
  ppc: Clean up rtas_flash driver somewhat
  hostap: proc: Use remove_proc_subtree()
  drm: proc: Use remove_proc_subtree()
  drm: proc: Use minor->index to label things, not PDE->name
  drm: Constify drm_proc_list[]
  zoran: Don't print proc_dir_entry data in debug
  reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
  proc: Supply an accessor for getting the data from a PDE's parent
  airo: Use remove_proc_subtree()
  rtl8192u: Don't need to save device proc dir PDE
  rtl8187se: Use a dir under /proc/net/r8180/
  proc: Add proc_mkdir_data()
  proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
  proc: Move PDE_NET() to fs/proc/proc_net.c
  ...
2013-05-01 17:51:54 -07:00
Linus Torvalds
f9b3bcfbc4 Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm changes from Ingo Molnar:
 "Misc smaller changes all over the map"

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/iommu/dmar: Remove warning for HPET scope type
  x86/mm/gart: Drop unnecessary check
  x86/mm/hotplug: Put kernel_physical_mapping_remove() declaration in CONFIG_MEMORY_HOTREMOVE
  x86/mm/fixmap: Remove unused FIX_CYCLONE_TIMER
  x86/mm/numa: Simplify some bit mangling
  x86/mm: Re-enable DEBUG_TLBFLUSH for X86_32
  x86/mm/cpa: Cleanup split_large_page() and its callee
  x86: Drop always empty .text..page_aligned section
2013-04-30 08:40:35 -07:00
Linus Torvalds
df8edfa9af Merge branch 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpuid changes from Ingo Molnar:
 "The biggest change is x86 CPU bug handling refactoring and cleanups,
  by Borislav Petkov"

* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, CPU, AMD: Drop useless label
  x86, AMD: Correct {rd,wr}msr_amd_safe warnings
  x86: Fold-in trivial check_config function
  x86, cpu: Convert AMD Erratum 400
  x86, cpu: Convert AMD Erratum 383
  x86, cpu: Convert Cyrix coma bug detection
  x86, cpu: Convert FDIV bug detection
  x86, cpu: Convert F00F bug detection
  x86, cpu: Expand cpufeature facility to include cpu bugs
2013-04-30 08:34:38 -07:00
Linus Torvalds
16fa94b532 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler changes from Ingo Molnar:
 "The main changes in this development cycle were:

   - full dynticks preparatory work by Frederic Weisbecker

   - factor out the cpu time accounting code better, by Li Zefan

   - multi-CPU load balancer cleanups and improvements by Joonsoo Kim

   - various smaller fixes and cleanups"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
  sched: Fix init NOHZ_IDLE flag
  sched: Prevent to re-select dst-cpu in load_balance()
  sched: Rename load_balance_tmpmask to load_balance_mask
  sched: Move up affinity check to mitigate useless redoing overhead
  sched: Don't consider other cpus in our group in case of NEWLY_IDLE
  sched: Explicitly cpu_idle_type checking in rebalance_domains()
  sched: Change position of resched_cpu() in load_balance()
  sched: Fix wrong rq's runnable_avg update with rt tasks
  sched: Document task_struct::personality field
  sched/cpuacct/UML: Fix header file dependency bug on the UML build
  cgroup: Kill subsys.active flag
  sched/cpuacct: No need to check subsys active state
  sched/cpuacct: Initialize cpuacct subsystem earlier
  sched/cpuacct: Initialize root cpuacct earlier
  sched/cpuacct: Allocate per_cpu cpuusage for root cpuacct statically
  sched/cpuacct: Clean up cpuacct.h
  sched/cpuacct: Remove redundant NULL checks in cpuacct_acount_field()
  sched/cpuacct: Remove redundant NULL checks in cpuacct_charge()
  sched/cpuacct: Add cpuacct_acount_field()
  sched/cpuacct: Add cpuacct_init()
  ...
2013-04-30 07:43:28 -07:00
Akinobu Mita
56a1ba5fd1 x86: rename random32() to prandom_u32()
Use preferable function name which implies using a pseudo-random
number generator.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 18:28:42 -07:00
Akinobu Mita
4a7b4d2360 x86: pageattr-test: remove srandom32 call
pageattr-test calls srandom32() once every test iteration.  But calling
srandom32() after late_initcalls is not meaningfull.  Because the random
states for random32() is mixed by good random numbers in late_initcall
prandom_reseed().

So this removes the call to srandom32().

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 18:28:42 -07:00
Cody P Schafer
d2ad351e36 x86/mm/numa: use setup_nr_node_ids() instead of opencoding.
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:36 -07:00
Johannes Weiner
8e2cdbcb86 x86-64: fall back to regular page vmemmap on allocation failure
Memory hotplug can happen on a machine under load, memory shortness
and fragmentation, so huge page allocations for the vmemmap are not
guaranteed to succeed.

Try to fall back to regular pages before failing the hotplug event
completely.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Bernhard Schmidt <Bernhard.Schmidt@lrz.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:35 -07:00
Johannes Weiner
e8216da5c7 x86-64: use vmemmap_populate_basepages() for !pse setups
We already have generic code to allocate vmemmap with regular pages, use
it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Bernhard Schmidt <Bernhard.Schmidt@lrz.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: David Miller <davem@davemloft.net>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:35 -07:00
Johannes Weiner
6c7a2ca4c1 x86-64: remove dead debugging code for !pse setups
No need to maintain addr_end and p_end when they are never actually read
anywhere on !pse setups.  Remove the dead code.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Bernhard Schmidt <Bernhard.Schmidt@lrz.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:35 -07:00
Johannes Weiner
0aad818b2d sparse-vmemmap: specify vmemmap population range in bytes
The sparse code, when asking the architecture to populate the vmemmap,
specifies the section range as a starting page and a number of pages.

This is an awkward interface, because none of the arch-specific code
actually thinks of the range in terms of 'struct page' units and always
translates it to bytes first.

In addition, later patches mix huge page and regular page backing for
the vmemmap.  For this, they need to call vmemmap_populate_basepages()
on sub-section ranges with PAGE_SIZE and PMD_SIZE in mind.  But these
are not necessarily multiples of the 'struct page' size and so this unit
is too coarse.

Just translate the section range into bytes once in the generic sparse
code, then pass byte ranges down the stack.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Bernhard Schmidt <Bernhard.Schmidt@lrz.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: David S. Miller <davem@davemloft.net>
Tested-by: David S. Miller <davem@davemloft.net>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:35 -07:00
Joonsoo Kim
ef93247325 mm, vmalloc: change iterating a vmlist to find_vm_area()
This patchset removes vm_struct list management after initializing
vmalloc.  Adding and removing an entry to vmlist is linear time
complexity, so it is inefficient.  If we maintain this list, overall
time complexity of adding and removing area to vmalloc space is O(N),
although we use rbtree for finding vacant place and it's time complexity
is just O(logN).

And vmlist and vmlist_lock is used many places of outside of vmalloc.c.
It is preferable that we hide this raw data structure and provide
well-defined function for supporting them, because it makes that they
cannot mistake when manipulating theses structure and it makes us easily
maintain vmalloc layer.

For kexec and makedumpfile, I export vmap_area_list, instead of vmlist.
This comes from Atsushi's recommendation.  For more information, please
refer below link.  https://lkml.org/lkml/2012/12/6/184

This patch:

The purpose of iterating a vmlist is finding vm area with specific virtual
address.  find_vm_area() is provided for this purpose and more efficient,
because it uses a rbtree.  So change it.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Guan Xuetao <gxt@mprc.pku.edu.cn>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Chris Metcalf <cmetcalf@tilera.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Dave Anderson <anderson@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:33 -07:00
Jiang Liu
5e7ccf8635 mm/x86: use free_highmem_page() to free highmem pages into buddy system
Use helper function free_highmem_page() to free highmem pages into
the buddy system.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Cong Wang <amwang@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Attilio Rao <attilio.rao@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:32 -07:00
Jiang Liu
bced0e32f6 mm/x86: use common help functions to free reserved pages
Use common help functions to free reserved pages.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:31 -07:00
David Howells
2f96b8c1d5 proc: Split kcore bits from linux/procfs.h into linux/kcore.h
Split kcore bits from linux/procfs.h into linux/kcore.h.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
cc: linux-mips@linux-mips.org
cc: sparclinux@vger.kernel.org
cc: x86@kernel.org
cc: linux-mm@kvack.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-29 15:42:02 -04:00
Tang Chen
587ff8c4ea x86/mm/hotplug: Put kernel_physical_mapping_remove() declaration in CONFIG_MEMORY_HOTREMOVE
kernel_physical_mapping_remove() is only called by
arch_remove_memory() in init_64.c, which is enclosed in
CONFIG_MEMORY_HOTREMOVE. So when we don't configure
CONFIG_MEMORY_HOTREMOVE, the compiler will give a warning:

	warning: ‘kernel_physical_mapping_remove’ defined but not used

So put kernel_physical_mapping_remove() in
CONFIG_MEMORY_HOTREMOVE.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: linux-mm@kvack.org
Cc: gregkh@linuxfoundation.org
Cc: yinghai@kernel.org
Cc: wency@cn.fujitsu.com
Cc: mgorman@suse.de
Cc: tj@kernel.org
Cc: liwanp@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1366019207-27818-3-git-send-email-tangchen@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-15 12:03:24 +02:00
Linus Torvalds
6c4c4d4bda Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
 "Misc fixes"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Flush lazy MMU when DEBUG_PAGEALLOC is set
  x86/mm/cpa/selftest: Fix false positive in CPA self test
  x86/mm/cpa: Convert noop to functional fix
  x86, mm: Patch out arch_flush_lazy_mmu_mode() when running on bare metal
  x86, mm, paravirt: Fix vmalloc_fault oops during lazy MMU updates
2013-04-14 11:13:24 -07:00
Dave Hansen
1de14c3c5c x86-32: Fix possible incomplete TLB invalidate with PAE pagetables
This patch attempts to fix:

	https://bugzilla.kernel.org/show_bug.cgi?id=56461

The symptom is a crash and messages like this:

	chrome: Corrupted page table at address 34a03000
	*pdpt = 0000000000000000 *pde = 0000000000000000
	Bad pagetable: 000f [#1] PREEMPT SMP

Ingo guesses this got introduced by commit 611ae8e3f5 ("x86/tlb:
enable tlb flush range support for x86") since that code started to free
unused pagetables.

On x86-32 PAE kernels, that new code has the potential to free an entire
PMD page and will clear one of the four page-directory-pointer-table
(aka pgd_t entries).

The hardware aggressively "caches" these top-level entries and invlpg
does not actually affect the CPU's copy.  If we clear one we *HAVE* to
do a full TLB flush, otherwise we might continue using a freed pmd page.
(note, we do this properly on the population side in pud_populate()).

This patch tracks whenever we clear one of these entries in the 'struct
mmu_gather', and ensures that we follow up with a full tlb flush.

BTW, I disassembled and checked that:

	if (tlb->fullmm == 0)
and
	if (!tlb->fullmm && !tlb->need_flush_all)

generate essentially the same code, so there should be zero impact there
to the !PAE case.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Artem S Tashkinov <t.artem@mailcity.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-12 16:56:47 -07:00
Boris Ostrovsky
26564600c9 x86/mm: Flush lazy MMU when DEBUG_PAGEALLOC is set
When CONFIG_DEBUG_PAGEALLOC is set page table updates made by
kernel_map_pages() are not made visible (via TLB flush)
immediately if lazy MMU is on. In environments that support lazy
MMU (e.g. Xen) this may lead to fatal page faults, for example,
when zap_pte_range() needs to allocate pages in
__tlb_remove_page() -> tlb_next_batch().

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: konrad.wilk@oracle.com
Link: http://lkml.kernel.org/r/1365703192-2089-1-git-send-email-boris.ostrovsky@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-12 07:19:19 +02:00
Andrea Arcangeli
18699739b6 x86/mm/cpa/selftest: Fix false positive in CPA self test
If the pmd is not present, _PAGE_PSE will not be set anymore.
Fix the false positive.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Stefan Bader <stefan.bader@canonical.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/1365687369-30802-1-git-send-email-aarcange@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-12 06:39:20 +02:00
Andrea Arcangeli
f76cfa3c24 x86/mm/cpa: Convert noop to functional fix
Commit:

  a8aed3e075 ("x86/mm/pageattr: Prevent PSE and GLOABL leftovers to confuse pmd/pte_present and pmd_huge")

introduced a valid fix but one location that didn't trigger the bug that
lead to finding those (small) problems, wasn't updated using the
right variable.

The wrong variable was also initialized for no good reason, that
may have been the source of the confusion. Remove the noop
initialization accordingly.

Commit a8aed3e075 also erroneously removed one canon_pgprot pass meant
to clear pmd bitflags not supported in hardware by older CPUs, that
automatically gets corrected by this patch too by applying it to the right
variable in the new location.

Reported-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Mel Gorman <mgorman@suse.de>
Link: http://lkml.kernel.org/r/1365600505-19314-1-git-send-email-aarcange@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-11 10:34:42 +02:00
Samu Kallio
1160c2779b x86, mm, paravirt: Fix vmalloc_fault oops during lazy MMU updates
In paravirtualized x86_64 kernels, vmalloc_fault may cause an oops
when lazy MMU updates are enabled, because set_pgd effects are being
deferred.

One instance of this problem is during process mm cleanup with memory
cgroups enabled. The chain of events is as follows:

- zap_pte_range enables lazy MMU updates
- zap_pte_range eventually calls mem_cgroup_charge_statistics,
  which accesses the vmalloc'd mem_cgroup per-cpu stat area
- vmalloc_fault is triggered which tries to sync the corresponding
  PGD entry with set_pgd, but the update is deferred
- vmalloc_fault oopses due to a mismatch in the PUD entries

The OOPs usually looks as so:

------------[ cut here ]------------
kernel BUG at arch/x86/mm/fault.c:396!
invalid opcode: 0000 [#1] SMP
.. snip ..
CPU 1
Pid: 10866, comm: httpd Not tainted 3.6.10-4.fc18.x86_64 #1
RIP: e030:[<ffffffff816271bf>]  [<ffffffff816271bf>] vmalloc_fault+0x11f/0x208
.. snip ..
Call Trace:
 [<ffffffff81627759>] do_page_fault+0x399/0x4b0
 [<ffffffff81004f4c>] ? xen_mc_extend_args+0xec/0x110
 [<ffffffff81624065>] page_fault+0x25/0x30
 [<ffffffff81184d03>] ? mem_cgroup_charge_statistics.isra.13+0x13/0x50
 [<ffffffff81186f78>] __mem_cgroup_uncharge_common+0xd8/0x350
 [<ffffffff8118aac7>] mem_cgroup_uncharge_page+0x57/0x60
 [<ffffffff8115fbc0>] page_remove_rmap+0xe0/0x150
 [<ffffffff8115311a>] ? vm_normal_page+0x1a/0x80
 [<ffffffff81153e61>] unmap_single_vma+0x531/0x870
 [<ffffffff81154962>] unmap_vmas+0x52/0xa0
 [<ffffffff81007442>] ? pte_mfn_to_pfn+0x72/0x100
 [<ffffffff8115c8f8>] exit_mmap+0x98/0x170
 [<ffffffff810050d9>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
 [<ffffffff81059ce3>] mmput+0x83/0xf0
 [<ffffffff810624c4>] exit_mm+0x104/0x130
 [<ffffffff8106264a>] do_exit+0x15a/0x8c0
 [<ffffffff810630ff>] do_group_exit+0x3f/0xa0
 [<ffffffff81063177>] sys_exit_group+0x17/0x20
 [<ffffffff8162bae9>] system_call_fastpath+0x16/0x1b

Calling arch_flush_lazy_mmu_mode immediately after set_pgd makes the
changes visible to the consistency checks.

Cc: <stable@vger.kernel.org>
RedHat-Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=914737
Tested-by: Josh Boyer <jwboyer@redhat.com>
Reported-and-Tested-by: Krishna Raman <kraman@redhat.com>
Signed-off-by: Samu Kallio <samu.kallio@aberdeencloud.com>
Link: http://lkml.kernel.org/r/1364045796-10720-1-git-send-email-konrad.wilk@oracle.com
Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-04-10 11:25:07 -07:00
Martin Bundgaard
7e9a2f0a08 x86/mm/numa: Simplify some bit mangling
Minor. Reordered a few lines to lose a superfluous OR operation.

Signed-off-by: Martin Bundgaard <martin@mindflux.org>
Link: http://lkml.kernel.org/r/1363286075-62615-1-git-send-email-martin@mindflux.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-10 19:06:26 +02:00
Borislav Petkov
5952886bfe x86/mm/cpa: Cleanup split_large_page() and its callee
So basically we're generating the pte_t * from a struct page and
we're handing it down to the __split_large_page() internal version
which then goes and gets back struct page * from it because it
needs it.

Change the caller to hand down struct page * directly and the
callee can compute the pte_t itself.

Net save is one virt_to_page() call and simpler code. While at
it, make __split_large_page() static.

Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1363886217-24703-1-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-10 14:39:08 +02:00
Borislav Petkov
e2604b49e8 x86, cpu: Convert F00F bug detection
... to using the new facility and drop the cpuinfo_x86 member.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1363788448-31325-3-git-send-email-bp@alien8.de
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2013-04-02 10:12:52 -07:00
Dave Hansen
60f583d56a x86: Do not try to sync identity map for non-mapped pages
kernel_map_sync_memtype() is called from a variety of contexts.  The
pat.c code that calls it seems to ensure that it is not called for
non-ram areas by checking via pat_pagerange_is_ram().  It is important
that it only be called on the actual identity map because there *IS*
no map to sync for highmem pages, or for memory holes.

The ioremap.c uses are not as careful as those from pat.c, and call
kernel_map_sync_memtype() on PCI space which is in the middle of the
kernel identity map _range_, but is not actually mapped.

This patch adds a check to kernel_map_sync_memtype() which probably
duplicates some of the checks already in pat.c.  But, it is necessary
for the ioremap.c uses and shouldn't hurt other callers.

I have reproduced this bug and this patch fixes it for me and the
original bug reporter:

	https://lkml.org/lkml/2013/2/5/396

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20130307163151.D9B58C4E@kernel.stglabs.ibm.com
Signed-off-by: Dave Hansen <dave@sr71.net>
Tested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2013-03-07 13:23:28 -08:00
Frederic Weisbecker
6c1e0256fa context_tracking: Restore correct previous context state on exception exit
On exception exit, we restore the previous context tracking state based on
the regs of the interrupted frame. Iff that frame is in user mode as
stated by user_mode() helper, we restore the context tracking user mode.

However there is a tiny chunck of low level arch code after we pass through
user_enter() and until the CPU eventually resumes userspace.
If an exception happens in this tiny area, exception_enter() correctly
exits the context tracking user mode but exception_exit() won't restore
it because of the value returned by user_mode(regs).

As a result we may return to userspace with the wrong context tracking
state.

To fix this, change exception_enter() to return the context tracking state
prior to its call and pass this saved state to exception_exit(). This restores
the real context tracking state of the interrupted frame.

(May be this patch was suggested to me, I don't recall exactly. If so,
sorry for the missing credit).

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Mats Liljegren <mats.liljegren@enea.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-03-07 17:10:11 +01:00
Frederic Weisbecker
56dd9470d7 context_tracking: Move exception handling to generic code
Exceptions handling on context tracking should share common
treatment: on entry we exit user mode if the exception triggered
in that context. Then on exception exit we return to that previous
context.

Generalize this to avoid duplication across archs.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Mats Liljegren <mats.liljegren@enea.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-03-07 17:09:25 +01:00
Yinghai Lu
98e7a98997 x86, mm: Make sure to find a 2M free block for the first mapped area
Henrik reported that his MacAir 3.1 would not boot with

| commit 8d57470d8f
| Date:   Fri Nov 16 19:38:58 2012 -0800
|
|    x86, mm: setup page table in top-down

It turns out that we do not calculate the real_end properly:
We try to get 2M size with 4K alignment, and later will round down
to 2M, so we will get less then 2M for first mapping, in extreme
case could be only 4K only. In Henrik's system it has (1M-32K) as
last usable rage is [mem 0x7f9db000-0x7fef8fff].

The problem is exposed when EFI booting have several holes and it
will force mapping to use PTE instead as we only map usable areas.

To fix it, just make it be 2M aligned, so we can be guaranteed to be
able to use large pages to map it.

Reported-by: Henrik Rydberg <rydberg@euromail.se>
Bisected-by: Henrik Rydberg <rydberg@euromail.se>
Tested-by: Henrik Rydberg <rydberg@euromail.se>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/CAE9FiQX4nQ7_1kg5RL_vh56rmcSHXUi1ExrZX7CwED4NGMnHfg@mail.gmail.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2013-03-06 20:18:32 -08:00
Yinghai Lu
20e6926dcb x86, ACPI, mm: Revert movablemem_map support
Tim found:

  WARNING: at arch/x86/kernel/smpboot.c:324 topology_sane.isra.2+0x6f/0x80()
  Hardware name: S2600CP
  sched: CPU #1's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
  smpboot: Booting Node   1, Processors  #1
  Modules linked in:
  Pid: 0, comm: swapper/1 Not tainted 3.9.0-0-generic #1
  Call Trace:
    set_cpu_sibling_map+0x279/0x449
    start_secondary+0x11d/0x1e5

Don Morris reproduced on a HP z620 workstation, and bisected it to
commit e8d1955258 ("acpi, memory-hotplug: parse SRAT before memblock
is ready")

It turns out movable_map has some problems, and it breaks several things

1. numa_init is called several times, NOT just for srat. so those
	nodes_clear(numa_nodes_parsed)
	memset(&numa_meminfo, 0, sizeof(numa_meminfo))
   can not be just removed.  Need to consider sequence is: numaq, srat, amd, dummy.
   and make fall back path working.

2. simply split acpi_numa_init to early_parse_srat.
   a. that early_parse_srat is NOT called for ia64, so you break ia64.
   b.  for (i = 0; i < MAX_LOCAL_APIC; i++)
	     set_apicid_to_node(i, NUMA_NO_NODE)
     still left in numa_init. So it will just clear result from early_parse_srat.
     it should be moved before that....
   c.  it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
       early before override from INITRD is settled.

3. that patch TITLE is total misleading, there is NO x86 in the title,
   but it changes critical x86 code. It caused x86 guys did not
   pay attention to find the problem early. Those patches really should
   be routed via tip/x86/mm.

4. after that commit, following range can not use movable ram:
  a. real_mode code.... well..funny, legacy Node0 [0,1M) could be hot-removed?
  b. initrd... it will be freed after booting, so it could be on movable...
  c. crashkernel for kdump...: looks like we can not put kdump kernel above 4G
	anymore.
  d. init_mem_mapping: can not put page table high anymore.
  e. initmem_init: vmemmap can not be high local node anymore. That is
     not good.

If node is hotplugable, the mem related range like page table and
vmemmap could be on the that node without problem and should be on that
node.

We have workaround patch that could fix some problems, but some can not
be fixed.

So just remove that offending commit and related ones including:

 f7210e6c4a ("mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to
    protect movablecore_map in memblock_overlaps_region().")

 01a178a94e ("acpi, memory-hotplug: support getting hotplug info from
    SRAT")

 27168d38fa ("acpi, memory-hotplug: extend movablemem_map ranges to
    the end of node")

 e8d1955258 ("acpi, memory-hotplug: parse SRAT before memblock is
    ready")

 fb06bc8e5f ("page_alloc: bootmem limit with movablecore_map")

 42f47e27e7 ("page_alloc: make movablemem_map have higher priority")

 6981ec3114 ("page_alloc: introduce zone_movable_limit[] to keep
    movable limit for nodes")

 34b71f1e04 ("page_alloc: add movable_memmap kernel parameter")

 4d59a75125 ("x86: get pg_data_t's memory from other node")

Later we should have patches that will make sure kernel put page table
and vmemmap on local node ram instead of push them down to node0.  Also
need to find way to put other kernel used ram to local node ram.

Reported-by: Tim Gardner <tim.gardner@canonical.com>
Reported-by: Don Morris <don.morris@hp.com>
Bisected-by: Don Morris <don.morris@hp.com>
Tested-by: Don Morris <don.morris@hp.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-02 09:34:39 -08:00
Linus Torvalds
2003cd90c4 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar.

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm/pageattr: Prevent PSE and GLOABL leftovers to confuse pmd/pte_present and pmd_huge
  Revert "x86, mm: Make spurious_fault check explicitly check explicitly check the PRESENT bit"
  x86/mm/numa: Don't check if node is NUMA_NO_NODE
  x86, efi: Make "noefi" really disable EFI runtime serivces
  x86/apic: Fix parsing of the 'lapic' cmdline option
2013-02-26 19:45:29 -08:00
Andrea Arcangeli
a8aed3e075 x86/mm/pageattr: Prevent PSE and GLOABL leftovers to confuse pmd/pte_present and pmd_huge
Without this patch any kernel code that reads kernel memory in
non present kernel pte/pmds (as set by pageattr.c) will crash.

With this kernel code:

static struct page *crash_page;
static unsigned long *crash_address;
[..]
	crash_page = alloc_pages(GFP_KERNEL, 9);
	crash_address = page_address(crash_page);
	if (set_memory_np((unsigned long)crash_address, 1))
		printk("set_memory_np failure\n");
[..]

The kernel will crash if inside the "crash tool" one would try
to read the memory at the not present address.

crash> p crash_address
crash_address = $8 = (long unsigned int *) 0xffff88023c000000
crash> rd 0xffff88023c000000
[ *lockup* ]

The lockup happens because _PAGE_GLOBAL and _PAGE_PROTNONE
shares the same bit, and pageattr leaves _PAGE_GLOBAL set on a
kernel pte which is then mistaken as _PAGE_PROTNONE (so
pte_present returns true by mistake and the kernel fault then
gets confused and loops).

With THP the same can happen after we taught pmd_present to
check _PAGE_PROTNONE and _PAGE_PSE in commit
027ef6c878 ("mm: thp: fix pmd_present for
split_huge_page and PROT_NONE with THP").  THP has the same
problem with _PAGE_GLOBAL as the 4k pages, but it also has a
problem with _PAGE_PSE, which must be cleared too.

After the patch is applied copy_user correctly returns -EFAULT
and doesn't lockup anymore.

crash> p crash_address
crash_address = $9 = (long unsigned int *) 0xffff88023c000000
crash> rd 0xffff88023c000000
rd: read error: kernel virtual address: ffff88023c000000  type:
"64-bit KVADDR"

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-02-24 13:01:45 +01:00