Commit Graph

107 Commits

Author SHA1 Message Date
Linus Torvalds
5fbe4c224c Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes from Ingo Molnar:
 "This contains:

   - EFI fixes
   - a boot printout fix
   - ASLR/kASLR fixes
   - intel microcode driver fixes
   - other misc fixes

  Most of the linecount comes from an EFI revert"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm/ASLR: Avoid PAGE_SIZE redefinition for UML subarch
  x86/microcode/intel: Handle truncated microcode images more robustly
  x86/microcode/intel: Guard against stack overflow in the loader
  x86, mm/ASLR: Fix stack randomization on 64-bit systems
  x86/mm/init: Fix incorrect page size in init_memory_mapping() printks
  x86/mm/ASLR: Propagate base load address calculation
  Documentation/x86: Fix path in zero-page.txt
  x86/apic: Fix the devicetree build in certain configs
  Revert "efi/libstub: Call get_memory_map() to obtain map and desc sizes"
  x86/efi: Avoid triple faults during EFI mixed mode calls
2015-02-21 10:41:29 -08:00
Dave Hansen
f15e05186c x86/mm/init: Fix incorrect page size in init_memory_mapping() printks
With 32-bit non-PAE kernels, we have 2 page sizes available
(at most): 4k and 4M.

Enabling PAE replaces that 4M size with a 2M one (which 64-bit
systems use too).

But, when booting a 32-bit non-PAE kernel, in one of our
early-boot printouts, we say:

  init_memory_mapping: [mem 0x00000000-0x000fffff]
   [mem 0x00000000-0x000fffff] page 4k
  init_memory_mapping: [mem 0x37000000-0x373fffff]
   [mem 0x37000000-0x373fffff] page 2M
  init_memory_mapping: [mem 0x00100000-0x36ffffff]
   [mem 0x00100000-0x003fffff] page 4k
   [mem 0x00400000-0x36ffffff] page 2M
  init_memory_mapping: [mem 0x37400000-0x377fdfff]
   [mem 0x37400000-0x377fdfff] page 4k

Which is obviously wrong.  There is no 2M page available.  This
is probably because of a badly-named variable: in the map_range
code: PG_LEVEL_2M.

Instead of renaming all the PG_LEVEL_2M's.  This patch just
fixes the printout:

  init_memory_mapping: [mem 0x00000000-0x000fffff]
   [mem 0x00000000-0x000fffff] page 4k
  init_memory_mapping: [mem 0x37000000-0x373fffff]
   [mem 0x37000000-0x373fffff] page 4M
  init_memory_mapping: [mem 0x00100000-0x36ffffff]
   [mem 0x00100000-0x003fffff] page 4k
   [mem 0x00400000-0x36ffffff] page 4M
  init_memory_mapping: [mem 0x37400000-0x377fdfff]
   [mem 0x37400000-0x377fdfff] page 4k
  BRK [0x03206000, 0x03206fff] PGTABLE

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/20150210212030.665EC267@viggo.jf.intel.com
Signed-off-by: Borislav Petkov <bp@suse.de>
2015-02-19 11:45:27 +01:00
Linus Torvalds
37507717de Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 perf updates from Ingo Molnar:
 "This series tightens up RDPMC permissions: currently even highly
  sandboxed x86 execution environments (such as seccomp) have permission
  to execute RDPMC, which may leak various perf events / PMU state such
  as timing information and other CPU execution details.

  This 'all is allowed' RDPMC mode is still preserved as the
  (non-default) /sys/devices/cpu/rdpmc=2 setting.  The new default is
  that RDPMC access is only allowed if a perf event is mmap-ed (which is
  needed to correctly interpret RDPMC counter values in any case).

  As a side effect of these changes CR4 handling is cleaned up in the
  x86 code and a shadow copy of the CR4 value is added.

  The extra CR4 manipulation adds ~ <50ns to the context switch cost
  between rdpmc-capable and rdpmc-non-capable mms"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86: Add /sys/devices/cpu/rdpmc=2 to allow rdpmc for all tasks
  perf/x86: Only allow rdpmc if a perf_event is mapped
  perf: Pass the event to arch_perf_update_userpage()
  perf: Add pmu callbacks to track event mapping and unmapping
  x86: Add a comment clarifying LDT context switching
  x86: Store a per-cpu shadow copy of CR4
  x86: Clean up cr4 manipulation
2015-02-16 14:58:12 -08:00
Linus Torvalds
29afc4e9a4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree changes from Jiri Kosina:
 "Patches from trivial.git that keep the world turning around.

  Mostly documentation and comment fixes, and a two corner-case code
  fixes from Alan Cox"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  kexec, Kconfig: spell "architecture" properly
  mm: fix cleancache debugfs directory path
  blackfin: mach-common: ints-priority: remove unused function
  doubletalk: probe failure causes OOPS
  ARM: cache-l2x0.c: Make it clear that cache-l2x0 handles L310 cache controller
  msdos_fs.h: fix 'fields' in comment
  scsi: aic7xxx: fix comment
  ARM: l2c: fix comment
  ibmraid: fix writeable attribute with no store method
  dynamic_debug: fix comment
  doc: usbmon: fix spelling s/unpriviledged/unprivileged/
  x86: init_mem_mapping(): use capital BIOS in comment
2015-02-10 18:57:15 -08:00
Andy Lutomirski
1e02ce4ccc x86: Store a per-cpu shadow copy of CR4
Context switches and TLB flushes can change individual bits of CR4.
CR4 reads take several cycles, so store a shadow copy of CR4 in a
per-cpu variable.

To avoid wasting a cache line, I added the CR4 shadow to
cpu_tlbstate, which is already touched in switch_mm.  The heaviest
users of the cr4 shadow will be switch_mm and __switch_to_xtra, and
__switch_to_xtra is called shortly after switch_mm during context
switch, so the cacheline is likely to be hot.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Vince Weaver <vince@deater.net>
Cc: "hillf.zj" <hillf.zj@alibaba-inc.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/3a54dd3353fffbf84804398e00dfdc5b7c1afd7d.1414190806.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-04 12:10:42 +01:00
Andy Lutomirski
375074cc73 x86: Clean up cr4 manipulation
CR4 manipulation was split, seemingly at random, between direct
(write_cr4) and using a helper (set/clear_in_cr4).  Unfortunately,
the set_in_cr4 and clear_in_cr4 helpers also poke at the boot code,
which only a small subset of users actually wanted.

This patch replaces all cr4 access in functions that don't leave cr4
exactly the way they found it with new helpers cr4_set_bits,
cr4_clear_bits, and cr4_set_bits_and_update_boot.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Vince Weaver <vince@deater.net>
Cc: "hillf.zj" <hillf.zj@alibaba-inc.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/495a10bdc9e67016b8fd3945700d46cfd5c12c2f.1414190806.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-04 12:10:41 +01:00
Juergen Gross
31bb772370 x86, mm: Change cachemode exports to non-gpl
Commit 281d4078be ("x86: Make page cache mode a real type")
introduced the symbols __cachemode2pte_tbl and __pte2cachemode_tbl and
exported them via EXPORT_SYMBOL_GPL.  The exports are part of a
replacement of code which has been EXPORT_SYMBOL before these changes
resulting in build breakage of out-of-tree non-gpl modules.

Change EXPORT_SYMBOL_GPL to EXPORT-SYMBOL for these two symbols.

Fixes: 281d4078be "x86: Make page cache mode a real type"
Reported-and-tested-by: Steven Noonan <steven@uplinklabs.net>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Toshi Kani <toshi.kani@hp.com>
Link: http://lkml.kernel.org/r/1421926997-28615-1-git-send-email-jgross@suse.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-01-22 21:50:14 +01:00
Pavel Machek
801a559114 x86: init_mem_mapping(): use capital BIOS in comment
Use capital BIOS in comment. Its cleaner, and allows diference
between BIOS and BIOs.

Signed-off-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-01-02 12:07:06 +01:00
Jan Beulich
132978b94e x86: Fix step size adjustment during initial memory mapping
The old scheme can lead to failure in certain cases - the
problem is that after bumping step_size the next (non-final)
iteration is only guaranteed to make available a memory block
the size of what step_size was before. E.g. for a memory block
[0,3004600000) we'd have:

 iter	start		end		step		amount
 1	3004400000	30045fffff	 2M		  2M
 2	3004000000	30043fffff	64M		  4M
 3	3000000000	3003ffffff	 2G		 64M
 4	2000000000	2fffffffff	64G		 64G

Yet to map 64G with 4k pages (as happens e.g. under PV Xen) we
need slightly over 128M, but the first three iterations made
only about 70M available.

The condition (new_mapped_ram_size > mapped_ram_size) for
bumping step_size is just not suitable. Instead we want to bump
it when we know we have enough memory available to cover a block
of the new step_size. And rather than making that condition more
complicated than needed, simply adjust step_size by the largest
possible factor we know we can cover at that point - which is
shifting it left by one less than the difference between page
table level shifts. (Interestingly the original STEP_SIZE_SHIFT
definition had a comment hinting at that having been the
intention, just that it should have been PUD_SHIFT-PMD_SHIFT-1
instead of (PUD_SHIFT-PMD_SHIFT)/2, and of course for non-PAE
32-bit we can't really use these two constants as they're equal
there.)

Furthermore the comment in get_new_step_size() didn't get
updated when the bottom-down mapping logic got added. Yet while
an overflow (flushing step_size to zero) of the shift doesn't
matter for the top-down method, it does for bottom-up because
round_up(x, 0) = 0, and an upper range boundary of zero can't
really work well.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/54945C1E020000780005114E@mail.emea.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-12-23 11:39:34 +01:00
Linus Torvalds
536e89ee53 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
 "Misc fixes (mainly Andy's TLS fixes), plus a cleanup"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/tls: Disallow unusual TLS segments
  x86/tls: Validate TLS entries to protect espfix
  MAINTAINERS: Add me as x86 VDSO submaintainer
  x86/asm: Unify segment selector defines
  x86/asm: Guard against building the 32/64-bit versions of the asm-offsets*.c file directly
  x86_64, switch_to(): Load TLS descriptors before switching DS and ES
  x86/mm: Use min() instead of min_t() in the e820 printout code
  x86/mm: Fix zone ranges boot printout
  x86/doc: Update documentation after file shuffling
2014-12-14 11:51:50 -08:00
Xishi Qiu
c072b90c8d x86/mm: Fix zone ranges boot printout
This is the usual physical memory layout boot printout:
	...
	[    0.000000] Zone ranges:
	[    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
	[    0.000000]   DMA32    [mem 0x01000000-0xffffffff]
	[    0.000000]   Normal   [mem 0x100000000-0xc3fffffff]
	[    0.000000] Movable zone start for each node
	[    0.000000] Early memory node ranges
	[    0.000000]   node   0: [mem 0x00001000-0x00099fff]
	[    0.000000]   node   0: [mem 0x00100000-0xbf78ffff]
	[    0.000000]   node   0: [mem 0x100000000-0x63fffffff]
	[    0.000000]   node   1: [mem 0x640000000-0xc3fffffff]
	...

This is the log when we set "mem=2G" on the boot cmdline:
	...
	[    0.000000] Zone ranges:
	[    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
	[    0.000000]   DMA32    [mem 0x01000000-0xffffffff]  // should be 0x7fffffff, right?
	[    0.000000]   Normal   empty
	[    0.000000] Movable zone start for each node
	[    0.000000] Early memory node ranges
	[    0.000000]   node   0: [mem 0x00001000-0x00099fff]
	[    0.000000]   node   0: [mem 0x00100000-0x7fffffff]
	...

This patch fixes the printout, the following log shows the right
ranges:
	...
	[    0.000000] Zone ranges:
	[    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
	[    0.000000]   DMA32    [mem 0x01000000-0x7fffffff]
	[    0.000000]   Normal   empty
	[    0.000000] Movable zone start for each node
	[    0.000000] Early memory node ranges
	[    0.000000]   node   0: [mem 0x00001000-0x00099fff]
	[    0.000000]   node   0: [mem 0x00100000-0x7fffffff]
	...

Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: Linux MM <linux-mm@kvack.org>
Cc: <dave@sr71.net>
Cc: Rik van Riel <riel@redhat.com>
Link: http://lkml.kernel.org/r/5487AB3D.6070306@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-12-11 11:35:02 +01:00
Juergen Gross
bd809af16e x86: Enable PAT to use cache mode translation tables
Update the translation tables from cache mode to pgprot values
according to the PAT settings. This enables changing the cache
attributes of a PAT index in just one place without having to change
at the users side.

With this change it is possible to use the same kernel with different
PAT configurations, e.g. supporting Xen.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Toshi Kani <toshi.kani@hp.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stefan.bader@canonical.com
Cc: xen-devel@lists.xensource.com
Cc: ville.syrjala@linux.intel.com
Cc: david.vrabel@citrix.com
Cc: jbeulich@suse.com
Cc: plagnioj@jcrosoft.com
Cc: tomi.valkeinen@ti.com
Cc: bhelgaas@google.com
Link: http://lkml.kernel.org/r/1415019724-4317-18-git-send-email-jgross@suse.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-16 11:04:26 +01:00
Juergen Gross
281d4078be x86: Make page cache mode a real type
At the moment there are a lot of places that handle setting or getting
the page cache mode by treating the pgprot bits equal to the cache mode.
This is only true because there are a lot of assumptions about the setup
of the PAT MSR. Otherwise the cache type needs to get translated into
pgprot bits and vice versa.

This patch tries to prepare for that by introducing a separate type
for the cache mode and adding functions to translate between those and
pgprot values.

To avoid too much performance penalty the translation between cache mode
and pgprot values is done via tables which contain the relevant
information.  Write-back cache mode is hard-wired to be 0, all other
modes are configurable via those tables. For large pages there are
translation functions as the PAT bit is located at different positions
in the ptes of 4k and large pages.

Based-on-patch-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stefan.bader@canonical.com
Cc: xen-devel@lists.xensource.com
Cc: konrad.wilk@oracle.com
Cc: ville.syrjala@linux.intel.com
Cc: david.vrabel@citrix.com
Cc: jbeulich@suse.com
Cc: toshi.kani@hp.com
Cc: plagnioj@jcrosoft.com
Cc: tomi.valkeinen@ti.com
Cc: bhelgaas@google.com
Link: http://lkml.kernel.org/r/1415019724-4317-2-git-send-email-jgross@suse.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-16 11:04:24 +01:00
Dave Hansen
d17d8f9ded x86/mm: Add tracepoints for TLB flushes
We don't have any good way to figure out what kinds of flushes
are being attempted.  Right now, we can try to use the vm
counters, but those only tell us what we actually did with the
hardware (one-by-one vs full) and don't tell us what was actually
_requested_.

This allows us to select out "interesting" TLB flushes that we
might want to optimize (like the ranged ones) and ignore the ones
that we have very little control over (the ones at context
switch).

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: http://lkml.kernel.org/r/20140731154059.4C96CBA5@viggo.jf.intel.com
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-07-31 08:48:51 -07:00
Zhi Yong Wu
d4dd100f2e arch/x86/mm/init.c: fix incorrect function name in alloc_low_pages()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:11 +09:00
Tang Chen
b959ed6c73 x86/mem-hotplug: support initialize page tables in bottom-up
The Linux kernel cannot migrate pages used by the kernel.  As a result,
kernel pages cannot be hot-removed.  So we cannot allocate hotpluggable
memory for the kernel.

In a memory hotplug system, any numa node the kernel resides in should be
unhotpluggable.  And for a modern server, each node could have at least
16GB memory.  So memory around the kernel image is highly likely
unhotpluggable.

ACPI SRAT (System Resource Affinity Table) contains the memory hotplug
info.  But before SRAT is parsed, memblock has already started to allocate
memory for the kernel.  So we need to prevent memblock from doing this.

So direct memory mapping page tables setup is the case.
init_mem_mapping() is called before SRAT is parsed.  To prevent page
tables being allocated within hotpluggable memory, we will use bottom-up
direction to allocate page tables from the end of kernel image to the
higher memory.

Note:
As for allocating page tables in lower memory, TJ said:

: This is an optional behavior which is triggered by a very specific kernel
: boot param, which I suspect is gonna need to stick around to support
: memory hotplug in the current setup unless we add another layer of address
: translation to support memory hotplug.

As for page tables may occupy too much lower memory if using 4K mapping
(CONFIG_DEBUG_PAGEALLOC and CONFIG_KMEMCHECK both disable using >4k
pages), TJ said:

: But as I said in the same paragraph, parsing SRAT earlier doesn't solve
: the problem in itself either.  Ignoring the option if 4k mapping is
: required and memory consumption would be prohibitive should work, no?
: Something like that would be necessary if we're gonna worry about cases
: like this no matter how we implement it, but, frankly, I'm not sure this
: is something worth worrying about.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:08 +09:00
Tang Chen
0167d7d8b0 x86/mm: factor out of top-down direct mapping setup
Create a new function memory_map_top_down to factor out of the top-down
direct memory mapping pagetable setup.  This is also a preparation for the
following patch, which will introduce the bottom-up memory mapping.  That
said, we will put the two ways of pagetable setup into separate functions,
and choose to use which way in init_mem_mapping, which makes the code more
clear.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:08 +09:00
Yinghai Lu
6979287a7d x86/mm: Add 'step_size' comments to init_mem_mapping()
Current code uses macro to shift by 5, but there is no explanation
why there's no worry about an overflow there.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Jacob Shin <jacob.shin@amd.com>
Link: http://lkml.kernel.org/r/1378519629-10433-1-git-send-email-yinghai@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-10 09:51:34 +02:00
Yinghai Lu
527bf129f9 x86/mm: Fix boot crash with DEBUG_PAGE_ALLOC=y and more than 512G RAM
Dave Hansen reported that systems between 500G and 600G RAM
crash early if DEBUG_PAGEALLOC is selected.

 > [    0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
 > [    0.000000]  [mem 0x00000000-0x000fffff] page 4k
 > [    0.000000] BRK [0x02086000, 0x02086fff] PGTABLE
 > [    0.000000] BRK [0x02087000, 0x02087fff] PGTABLE
 > [    0.000000] BRK [0x02088000, 0x02088fff] PGTABLE
 > [    0.000000] init_memory_mapping: [mem 0xe80ee00000-0xe80effffff]
 > [    0.000000]  [mem 0xe80ee00000-0xe80effffff] page 4k
 > [    0.000000] BRK [0x02089000, 0x02089fff] PGTABLE
 > [    0.000000] BRK [0x0208a000, 0x0208afff] PGTABLE
 > [    0.000000] Kernel panic - not syncing: alloc_low_page: ran out of memory

It turns out that we missed increasing needed pages in BRK to
mapping initial 2M and [0,1M) when we switched to use the #PF
handler to set memory mappings:

 > commit 8170e6bed4
 > Author: H. Peter Anvin <hpa@zytor.com>
 > Date:   Thu Jan 24 12:19:52 2013 -0800
 >
 >     x86, 64bit: Use a #PF handler to materialize early mappings on demand

Before that, we had the maping from [0,512M) in head_64.S, and we
can spare two pages [0-1M).  After that change, we can not reuse
pages anymore.

When we have more than 512M ram, we need an extra page for pgd page
with [512G, 1024g).

Increase pages in BRK for page table to solve the boot crash.

Reported-by: Dave Hansen <dave.hansen@intel.com>
Bisected-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: <stable@vger.kernel.org> # v3.9 and later
Link: http://lkml.kernel.org/r/1376351004-4015-1-git-send-email-yinghai@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-08-20 10:06:50 +02:00
Jiang Liu
c88442ec45 mm/x86: use free_reserved_area() to simplify code
Use common help function free_reserved_area() to simplify code.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: <sworddragon2@aol.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:07:33 -07:00
Yinghai Lu
7de3d66b13 x86: Fix adjust_range_size_mask calling position
Commit

    8d57470d x86, mm: setup page table in top-down

causes a kernel panic while setting mem=2G.

     [mem 0x00000000-0x000fffff] page 4k
     [mem 0x7fe00000-0x7fffffff] page 1G
     [mem 0x7c000000-0x7fdfffff] page 1G
     [mem 0x00100000-0x001fffff] page 4k
     [mem 0x00200000-0x7bffffff] page 2M

for last entry is not what we want, we should have
     [mem 0x00200000-0x3fffffff] page 2M
     [mem 0x40000000-0x7bffffff] page 1G

Actually we merge the continuous ranges with same page size too early.
in this case, before merging we have
     [mem 0x00200000-0x3fffffff] page 2M
     [mem 0x40000000-0x7bffffff] page 2M
after merging them, will get
     [mem 0x00200000-0x7bffffff] page 2M
even we can use 1G page to map
     [mem 0x40000000-0x7bffffff]

that will cause problem, because we already map
     [mem 0x7fe00000-0x7fffffff] page 1G
     [mem 0x7c000000-0x7fdfffff] page 1G
with 1G page, aka [0x40000000-0x7fffffff] is mapped with 1G page already.
During phys_pud_init() for [0x40000000-0x7bffffff], it will not
reuse existing that pud page, and allocate new one then try to use
2M page to map it instead, as page_size_mask does not include
PG_LEVEL_1G. At end will have [7c000000-0x7fffffff] not mapped, loop
in phys_pmd_init stop mapping at 0x7bffffff.

That is right behavoir, it maps exact range with exact page size that
we ask, and we should explicitly call it to map [7c000000-0x7fffffff]
before or after mapping 0x40000000-0x7bffffff.
Anyway we need to make sure ranges' page_size_mask correct and consistent
after split_mem_range for each range.

Fix that by calling adjust_range_size_mask before merging range
with same page size.

-v2: update change log.
-v3: add more explanation why [7c000000-0x7fffffff] is not mapped, and
    it causes panic.

Bisected-by: "Xie, ChanglongX" <changlongx.xie@intel.com>
Bisected-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reported-and-tested-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1370015587-20835-1-git-send-email-yinghai@kernel.org
Cc: <stable@vger.kernel.org> v3.9
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-05-31 13:21:32 -07:00
Zhang Yanfei
cf8b166d5c x86/mm: Add missing comments for initial kernel direct mapping
Two sets of comments were lost during patch-series shuffling:

  - comments for init_range_memory_mapping()

  - comments in init_mem_mapping that is helpful for reminding people
    that the pagetable is setup top-down

The comments were written by Yinghai in his patch in:

   https://lkml.org/lkml/2012/11/28/620

This patch reintroduces them.

Originally-From: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/518BC776.7010506@gmail.com
[ Tidied it all up a bit. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-05-10 12:00:35 +02:00
Jiang Liu
bced0e32f6 mm/x86: use common help functions to free reserved pages
Use common help functions to free reserved pages.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:31 -07:00
Yinghai Lu
98e7a98997 x86, mm: Make sure to find a 2M free block for the first mapped area
Henrik reported that his MacAir 3.1 would not boot with

| commit 8d57470d8f
| Date:   Fri Nov 16 19:38:58 2012 -0800
|
|    x86, mm: setup page table in top-down

It turns out that we do not calculate the real_end properly:
We try to get 2M size with 4K alignment, and later will round down
to 2M, so we will get less then 2M for first mapping, in extreme
case could be only 4K only. In Henrik's system it has (1M-32K) as
last usable rage is [mem 0x7f9db000-0x7fef8fff].

The problem is exposed when EFI booting have several holes and it
will force mapping to use PTE instead as we only map usable areas.

To fix it, just make it be 2M aligned, so we can be guaranteed to be
able to use large pages to map it.

Reported-by: Henrik Rydberg <rydberg@euromail.se>
Bisected-by: Henrik Rydberg <rydberg@euromail.se>
Tested-by: Henrik Rydberg <rydberg@euromail.se>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/CAE9FiQX4nQ7_1kg5RL_vh56rmcSHXUi1ExrZX7CwED4NGMnHfg@mail.gmail.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2013-03-06 20:18:32 -08:00
Fenghua Yu
cd745be89e x86/mm/init.c: Copy ucode from initrd image to kernel memory
Before initrd image is freed, copy valid ucode patches from initrd image
to kernel memory. The saved ucode will be used to update AP in resume
or hotplug.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1356075872-3054-12-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-31 13:20:26 -08:00
Yinghai Lu
0e691cf824 x86, kexec, 64bit: Only set ident mapping for ram.
We should set mappings only for usable memory ranges under max_pfn
Otherwise causes same problem that is fixed by

	x86, mm: Only direct map addresses that are marked as E820_RAM

This patch exposes pfn_mapped array, and only sets ident mapping for ranges
in that array.

This patch relies on new kernel_ident_mapping_init that could handle existing
pgd/pud between different calls.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1359058816-7615-25-git-send-email-yinghai@kernel.org
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-29 15:26:35 -08:00
H. Peter Anvin
8170e6bed4 x86, 64bit: Use a #PF handler to materialize early mappings on demand
Linear mode (CR0.PG = 0) is mutually exclusive with 64-bit mode; all
64-bit code has to use page tables.  This makes it awkward before we
have first set up properly all-covering page tables to access objects
that are outside the static kernel range.

So far we have dealt with that simply by mapping a fixed amount of
low memory, but that fails in at least two upcoming use cases:

1. We will support load and run kernel, struct boot_params, ramdisk,
   command line, etc. above the 4 GiB mark.
2. need to access ramdisk early to get microcode to update that as
   early possible.

We could use early_iomap to access them too, but it will make code to
messy and hard to be unified with 32 bit.

Hence, set up a #PF table and use a fixed number of buffers to set up
page tables on demand.  If the buffers fill up then we simply flush
them and start over.  These buffers are all in __initdata, so it does
not increase RAM usage at runtime.

Thus, with the help of the #PF handler, we can set the final kernel
mapping from blank, and switch to init_level4_pgt later.

During the switchover in head_64.S, before #PF handler is available,
we use three pages to handle kernel crossing 1G, 512G boundaries with
sharing page by playing games with page aliasing: the same page is
mapped twice in the higher-level tables with appropriate wraparound.
The kernel region itself will be properly mapped; other mappings may
be spurious.

early_make_pgtable is using kernel high mapping address to access pages
to set page table.

-v4: Add phys_base offset to make kexec happy, and add
	init_mapping_kernel()   - Yinghai
-v5: fix compiling with xen, and add back ident level3 and level2 for xen
     also move back init_level4_pgt from BSS to DATA again.
     because we have to clear it anyway.  - Yinghai
-v6: switch to init_level4_pgt in init_mem_mapping. - Yinghai
-v7: remove not needed clear_page for init_level4_page
     it is with fill 512,8,0 already in head_64.S  - Yinghai
-v8: we need to keep that handler alive until init_mem_mapping and don't
     let early_trap_init to trash that early #PF handler.
     So split early_trap_pf_init out and move it down. - Yinghai
-v9: switchover only cover kernel space instead of 1G so could avoid
     touch possible mem holes. - Yinghai
-v11: change far jmp back to far return to initial_code, that is needed
     to fix failure that is reported by Konrad on AMD systems.  - Yinghai

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1359058816-7615-12-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-29 15:20:06 -08:00
Yinghai Lu
c9b3234a6a x86, mm: Fix page table early allocation offset checking
During debugging loading kernel above 4G, found that one page is not used
in pre-allocated BRK area for early page allocation.
pgt_buf_top is address that can not be used, so should check if that new
end is above that top, otherwise last page will not be used.

Fix that checking and also add print out for allocation from pre-allocated
BRK area to catch possible bugs later.

But after we get back that page for pgt, it tiggers one bug in pgt allocation
with xen: We need to avoid to use page as pgt to map range that is
overlapping with that pgt page.

Add checking about overlapping, when it happens, use memblock allocation
instead.  That fixes crash on Xen PV guest with 2G that Stefan found.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1359058816-7615-2-git-send-email-yinghai@kernel.org
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Tested-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-29 15:12:23 -08:00
Yinghai Lu
b8fd39c036 x86, mm: Use clamp_t() in init_range_memory_mapping
save some lines, and make code more readable.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-42-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-17 11:59:46 -08:00
Yinghai Lu
4e37a89047 x86, mm: Unifying after_bootmem for 32bit and 64bit
after_bootmem has different meaning in 32bit and 64bit.
        32bit: after bootmem is ready
        64bit: after bootmem is distroyed
Let's merget them make 32bit the same as 64bit.

for 32bit, it is mixing alloc_bootmem_pages, and alloc_low_page under
after_bootmem is set or not set.

alloc_bootmem is just wrapper for memblock for x86.

Now we have alloc_low_page() with memblock too. We can drop bootmem path
now, and only alloc_low_page only.

At the same time, we make alloc_low_page could handle real after_bootmem
for 32bit, because alloc_bootmem_pages could fallback to use slab too.

At last move after_bootmem set position for 32bit the same as 64bit.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-40-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-17 11:59:44 -08:00
Yinghai Lu
2e8059edb6 x86, mm: use limit_pfn for end pfn
instead of shifting end to get that.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-39-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-17 11:59:43 -08:00
Yinghai Lu
1829ae9ad7 x86, mm: use pfn instead of pos in split_mem_range
could save some bit shifting operations.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-38-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-17 11:59:41 -08:00
Yinghai Lu
84d770019b x86, mm: use PFN_DOWN in split_mem_range()
to replace own inline version for shifting.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-37-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-17 11:59:40 -08:00
Yinghai Lu
5a0d3aeeef x86, mm: use round_up/down in split_mem_range()
to replace own inline version for those roundup and rounddown.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-36-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-17 11:59:40 -08:00
Yinghai Lu
148b20989e x86, mm: Move init_gbpages() out of setup.c
Put it in mm/init.c, and call it from probe_page_mask().
init_mem_mapping is calling probe_page_mask at first.
So calling sequence is not changed.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-32-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-17 11:59:37 -08:00
Yinghai Lu
cf47065961 x86, mm: Move back pgt_buf_* to mm/init.c
Also change them to static.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-31-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-17 11:59:36 -08:00
Yinghai Lu
719272c45b x86, mm: only call early_ioremap_page_table_range_init() once
On 32bit, before patcheset that only set page table for ram, we only
call that one time.

Now, we are calling that during every init_memory_mapping if we have holes
under max_low_pfn.

We should only call it one time after all ranges under max_low_page get
mapped just like we did before.

Also that could avoid the risk to run out of pgt_buf in BRK.

Need to update page_table_range_init() to count the pages for kmap page table
at first, and use new added alloc_low_pages() to get pages in sequence.
That will conform to the requirement that pages need to be in low to high order.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-30-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-17 11:59:29 -08:00
Stefano Stabellini
ddd3509df8 x86, mm: Add pointer about Xen mmu requirement for alloc_low_pages
Add link for more information
	279b706 x86,xen: introduce x86_init.mapping.pagetable_reserve

-v2: updated to commets from hpa to include commit name.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-29-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-17 11:59:28 -08:00
Yinghai Lu
22c8ca2ac2 x86, mm: Add alloc_low_pages(num)
32bit kmap mapping needs pages to be used for low to high.
At this point those pages are still from pgt_buf_* from BRK, so it is
ok now.
But we want to move early_ioremap_page_table_range_init() out of
init_memory_mapping() and only call it one time later, that will
make page_table_range_init/page_table_kmap_check/alloc_low_page to
use memblock to get page.

memblock allocation for pages are from high to low.
So will get panic from page_table_kmap_check() that has BUG_ON to do
ordering checking.

This patch add alloc_low_pages to make it possible to allocate serveral
pages at first, and hand out pages one by one from low to high.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-28-git-send-email-yinghai@kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-17 11:59:27 -08:00
Yinghai Lu
6f80b68e9e x86, mm, Xen: Remove mapping_pagetable_reserve()
Page table area are pre-mapped now after
	x86, mm: setup page table in top-down
	x86, mm: Remove early_memremap workaround for page table accessing on 64bit

mapping_pagetable_reserve is not used anymore, so remove it.

Also remove operation in mask_rw_pte(), as modified allow_low_page
always return pages that are already mapped, moreover
xen_alloc_pte_init, xen_alloc_pmd_init, etc, will mark the page RO
before hooking it into the pagetable automatically.

-v2: add changelog about mask_rw_pte() from Stefano.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-27-git-send-email-yinghai@kernel.org
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-17 11:59:26 -08:00
Yinghai Lu
9985b4c6fa x86, mm: Move min_pfn_mapped back to mm/init.c
Also change it to static.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-26-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-17 11:59:24 -08:00
Yinghai Lu
5c51bdbe4c x86, mm: Merge alloc_low_page between 64bit and 32bit
They are almost same except 64 bit need to handle after_bootmem case.

Add mm_internal.h to make that alloc_low_page() only to be accessible
from arch/x86/mm/init*.c

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-25-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-17 11:59:23 -08:00
Yinghai Lu
8d57470d8f x86, mm: setup page table in top-down
Get pgt_buf early from BRK, and use it to map PMD_SIZE from top at first.
Then use mapped pages to map more ranges below, and keep looping until
all pages get mapped.

alloc_low_page will use page from BRK at first, after that buffer is used
up, will use memblock to find and reserve pages for page table usage.

Introduce min_pfn_mapped to make sure find new pages from mapped ranges,
that will be updated when lower pages get mapped.

Also add step_size to make sure that don't try to map too big range with
limited mapped pages initially, and increase the step_size when we have
more mapped pages on hand.

We don't need to call pagetable_reserve anymore, reserve work is done
in alloc_low_page() directly.

At last we can get rid of calculation and find early pgt related code.

-v2: update to after fix_xen change,
     also use MACRO for initial pgt_buf size and add comments with it.
-v3: skip big reserved range in memblock.reserved near end.
-v4: don't need fix_xen change now.
-v5: add changelog about moving about reserving pagetable to alloc_low_page.

Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-22-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-17 11:59:19 -08:00
Yinghai Lu
f763ad1d38 x86, mm: Break down init_all_memory_mapping
Will replace that with top-down page table initialization.
New API need to take range: init_range_memory_mapping()

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-21-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-17 11:59:17 -08:00
Yinghai Lu
aeebe84cc9 x86, mm: Use big page size for small memory range
We could map small range in the middle of big range at first, so should use
big page size at first to avoid using small page size to break down page table.

Only can set big page bit when that range has ram area around it.

-v2: fix 32bit boundary checking. We can not count ram above max_low_pfn
	for 32 bit.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-19-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-17 11:59:16 -08:00
Jacob Shin
66520ebc2d x86, mm: Only direct map addresses that are marked as E820_RAM
Currently direct mappings are created for [ 0 to max_low_pfn<<PAGE_SHIFT )
and [ 4GB to max_pfn<<PAGE_SHIFT ), which may include regions that are not
backed by actual DRAM. This is fine for holes under 4GB which are covered
by fixed and variable range MTRRs to be UC. However, we run into trouble
on higher memory addresses which cannot be covered by MTRRs.

Our system with 1TB of RAM has an e820 that looks like this:

 BIOS-e820: [mem 0x0000000000000000-0x00000000000983ff] usable
 BIOS-e820: [mem 0x0000000000098400-0x000000000009ffff] reserved
 BIOS-e820: [mem 0x00000000000d0000-0x00000000000fffff] reserved
 BIOS-e820: [mem 0x0000000000100000-0x00000000c7ebffff] usable
 BIOS-e820: [mem 0x00000000c7ec0000-0x00000000c7ed7fff] ACPI data
 BIOS-e820: [mem 0x00000000c7ed8000-0x00000000c7ed9fff] ACPI NVS
 BIOS-e820: [mem 0x00000000c7eda000-0x00000000c7ffffff] reserved
 BIOS-e820: [mem 0x00000000fec00000-0x00000000fec0ffff] reserved
 BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
 BIOS-e820: [mem 0x00000000fff00000-0x00000000ffffffff] reserved
 BIOS-e820: [mem 0x0000000100000000-0x000000e037ffffff] usable
 BIOS-e820: [mem 0x000000e038000000-0x000000fcffffffff] reserved
 BIOS-e820: [mem 0x0000010000000000-0x0000011ffeffffff] usable

and so direct mappings are created for huge memory hole between
0x000000e038000000 to 0x0000010000000000. Even though the kernel never
generates memory accesses in that region, since the page tables mark
them incorrectly as being WB, our (AMD) processor ends up causing a MCE
while doing some memory bookkeeping/optimizations around that area.

This patch iterates through e820 and only direct maps ranges that are
marked as E820_RAM, and keeps track of those pfn ranges. Depending on
the alignment of E820 ranges, this may possibly result in using smaller
size (i.e. 4K instead of 2M or 1G) page tables.

-v2: move changes from setup.c to mm/init.c, also use for_each_mem_pfn_range
	instead.  - Yinghai Lu
-v3: add calculate_all_table_space_size() to get correct needed page table
	size. - Yinghai Lu
-v4: fix add_pfn_range_mapped() to get correct max_low_pfn_mapped when
     mem map does have hole under 4g that is found by Konard on xen
     domU with 8g ram. - Yinghai

Signed-off-by: Jacob Shin <jacob.shin@amd.com>
Link: http://lkml.kernel.org/r/1353123563-3103-16-git-send-email-yinghai@kernel.org
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-17 11:59:14 -08:00
Yinghai Lu
ab9519376e x86, mm: Separate out calculate_table_space_size()
It should take physical address range that will need to be mapped.
find_early_table_space should take range that pgt buff should be in.

Separating page table size calculating and finding early page table to
reduce confusing.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-9-git-send-email-yinghai@kernel.org
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-17 11:59:07 -08:00
Yinghai Lu
c14fa0b63b x86, mm: Find early page table buffer together
We should not do that in every calling of init_memory_mapping.

At the same time need to move down early_memtest, and could remove after_bootmem
checking.

-v2: fix one early_memtest with 32bit by passing max_pfn_mapped instead.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-8-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-17 11:59:06 -08:00
Yinghai Lu
84f1ae30bb x86, mm: Change find_early_table_space() paramters
call split_mem_range inside the function.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-7-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-17 11:59:05 -08:00
Yinghai Lu
28b6ff6670 x86, mm: Revert back good_end setting for 64bit
After

| commit 8548c84da2
| Author: Takashi Iwai <tiwai@suse.de>
| Date:   Sun Oct 23 23:19:12 2011 +0200
|
|    x86: Fix S4 regression
|
|    Commit 4b239f458 ("x86-64, mm: Put early page table high") causes a S4
|    regression since 2.6.39, namely the machine reboots occasionally at S4
|    resume.  It doesn't happen always, overall rate is about 1/20.  But,
|    like other bugs, once when this happens, it continues to happen.
|
|    This patch fixes the problem by essentially reverting the memory
|    assignment in the older way.

Have some page table around 512M again, that will prevent kdump to find 512M
under 768M.

We need revert that reverting, so we could put page table high again for 64bit.

Takashi agreed that S4 regression could be something else.

	https://lkml.org/lkml/2012/6/15/182

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-6-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-11-17 11:59:04 -08:00