linux_dsm_epyc7002/mm
Dave Hansen 62b5f7d013 mm/core, x86/mm/pkeys: Add execute-only protection keys support
Protection keys provide new page-based protection in hardware.
But, they have an interesting attribute: they only affect data
accesses and never affect instruction fetches.  That means that
if we set up some memory which is set as "access-disabled" via
protection keys, we can still execute from it.

This patch uses protection keys to set up mappings to do just that.
If a user calls:

	mmap(..., PROT_EXEC);
or
	mprotect(ptr, sz, PROT_EXEC);

(note PROT_EXEC-only without PROT_READ/WRITE), the kernel will
notice this, and set a special protection key on the memory.  It
also sets the appropriate bits in the Protection Keys User Rights
(PKRU) register so that the memory becomes unreadable and
unwritable.

I haven't found any userspace that does this today.  With this
facility in place, we expect userspace to move to use it
eventually.  Userspace _could_ start doing this today.  Any
PROT_EXEC calls get converted to PROT_READ inside the kernel, and
would transparently be upgraded to "true" PROT_EXEC with this
code.  IOW, userspace never has to do any PROT_EXEC runtime
detection.

This feature provides enhanced protection against leaking
executable memory contents.  This helps thwart attacks which are
attempting to find ROP gadgets on the fly.

But, the security provided by this approach is not comprehensive.
The PKRU register which controls access permissions is a normal
user register writable from unprivileged userspace.  An attacker
who can execute the 'wrpkru' instruction can easily disable the
protection provided by this feature.

The protection key that is used for execute-only support is
permanently dedicated at compile time.  This is fine for now
because there is currently no API to set a protection key other
than this one.

Despite there being a constant PKRU value across the entire
system, we do not set it unless this feature is in use in a
process.  That is to preserve the PKRU XSAVE 'init state',
which can lead to faster context switches.

PKRU *is* a user register and the kernel is modifying it.  That
means that code doing:

	pkru = rdpkru()
	pkru |= 0x100;
	mmap(..., PROT_EXEC);
	wrpkru(pkru);

could lose the bits in PKRU that enforce execute-only
permissions.  To avoid this, we suggest avoiding ever calling
mmap() or mprotect() when the PKRU value is expected to be
unstable.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chen Gang <gang.chen.5i5j@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: David Hildenbrand <dahi@linux.vnet.ibm.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Piotr Kwapulinski <kwapulinski.piotr@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Vladimir Murzin <vladimir.murzin@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: keescook@google.com
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210240.CB4BB5CA@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-18 19:46:33 +01:00
..
kasan UBSAN: run-time undefined behavior sanity checker 2016-01-20 17:09:18 -08:00
backing-dev.c mm/backing-dev.c: fix error path in wb_init() 2016-02-11 18:35:48 -08:00
balloon_compaction.c virtio_balloon: fix race between migration and ballooning 2016-01-12 20:47:06 +02:00
bootmem.c x86/mm: Introduce max_possible_pfn 2015-12-06 12:46:31 +01:00
cleancache.c cleancache: constify cleancache_ops structure 2016-01-27 09:09:57 -05:00
cma_debug.c mm/cma_debug: correct size input to bitmap function 2015-07-17 16:39:54 -07:00
cma.c mm/cma.c: suppress warning 2015-11-05 19:34:48 -08:00
cma.h mm: cma: mark cma_bitmap_maxno() inline in header 2015-08-14 15:56:32 -07:00
compaction.c mm/compaction.c: __compact_pgdat() code cleanuup 2016-01-14 16:00:49 -08:00
debug-pagealloc.c mm/debug-pagealloc: make debug-pagealloc boottime configurable 2014-12-13 12:42:48 -08:00
debug.c mm: rework mapcount accounting to enable 4k mapping of THPs 2016-01-15 17:56:32 -08:00
dmapool.c mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd 2015-11-06 17:50:42 -08:00
early_ioremap.c mm/early_ioremap: use offset_in_page macro 2015-11-05 19:34:48 -08:00
fadvise.c writeback: implement and use inode_congested() 2015-06-02 08:33:35 -06:00
failslab.c mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM 2015-11-06 17:50:42 -08:00
filemap.c mm: fix filemap.c kernel doc warning 2016-02-11 18:35:48 -08:00
frame_vector.c mm/gup: Switch all callers of get_user_pages() to not pass tsk/mm 2016-02-16 10:11:12 +01:00
frontswap.c frontswap: allow multiple backends 2015-06-24 17:49:45 -07:00
gup.c mm/core, x86/mm/pkeys: Differentiate instruction fetches 2016-02-18 19:46:29 +01:00
highmem.c mm/highmem: make kmap cache coloring aware 2014-08-06 18:01:22 -07:00
huge_memory.c thp: make deferred_split_scan() work again 2016-02-05 18:10:40 -08:00
hugetlb_cgroup.c mm: make compound_head() robust 2015-11-06 17:50:42 -08:00
hugetlb.c mm, hugetlb: don't require CMA for runtime gigantic pages 2016-02-05 18:10:40 -08:00
hwpoison-inject.c hwpoison: use page_cgroup_ino for filtering by memcg 2015-09-10 13:29:01 -07:00
init-mm.c
internal.h mm: polish virtual memory accounting 2016-02-03 08:28:43 -08:00
interval_tree.c mm: replace vma->sharead.linear with vma->shared 2015-02-10 14:30:31 -08:00
Kconfig mm/core, x86/mm/pkeys: Add arch_validate_pkey() 2016-02-18 19:46:31 +01:00
Kconfig.debug mm/debug_pagealloc: remove obsolete Kconfig options 2015-01-08 15:10:52 -08:00
kmemcheck.c mm/slab_common: move kmem_cache definition to internal header 2014-10-09 22:25:50 -04:00
kmemleak-test.c mm/kmemleak-test.c: use pr_fmt for logging 2014-06-06 16:08:18 -07:00
kmemleak.c Revert "gfp: add __GFP_NOACCOUNT" 2016-01-14 16:00:49 -08:00
ksm.c mm/core: Do not enforce PKEY permissions on remote mm access 2016-02-18 19:46:28 +01:00
list_lru.c mm: memcontrol: move kmem accounting code to CONFIG_MEMCG 2016-01-20 17:09:18 -08:00
maccess.c mm/maccess.c: actually return -EFAULT from strncpy_from_unsafe 2015-11-05 19:34:48 -08:00
madvise.c mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called 2016-01-15 17:56:32 -08:00
Makefile media updates for v4.3-rc1 2015-09-11 16:42:39 -07:00
memblock.c memblock: don't mark memblock_phys_mem_size() as __init 2016-02-05 18:10:40 -08:00
memcontrol.c thp: change pmd_trans_huge_lock() interface to return ptl 2016-01-21 17:20:51 -08:00
memory_hotplug.c x86, mm: introduce vmem_altmap to augment vmemmap_populate() 2016-01-15 17:56:32 -08:00
memory-failure.c mm: soft-offline: exit with failure for non anonymous thp 2016-01-15 17:56:32 -08:00
memory.c mm/core, x86/mm/pkeys: Differentiate instruction fetches 2016-02-18 19:46:29 +01:00
mempolicy.c mm/gup: Switch all callers of get_user_pages() to not pass tsk/mm 2016-02-16 10:11:12 +01:00
mempool.c mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd 2015-11-06 17:50:42 -08:00
memtest.c memtest: remove unused header files 2015-09-08 15:35:28 -07:00
migrate.c thp: introduce deferred_split_huge_page() 2016-01-15 17:56:32 -08:00
mincore.c thp: change pmd_trans_huge_lock() interface to return ptl 2016-01-21 17:20:51 -08:00
mlock.c mm: fix mlock accouting 2016-01-21 17:20:51 -08:00
mm_init.c mm: meminit: remove mminit_verify_page_links 2015-06-30 19:44:56 -07:00
mmap.c mm/core, x86/mm/pkeys: Add execute-only protection keys support 2016-02-18 19:46:33 +01:00
mmu_context.c sched/mm: call finish_arch_post_lock_switch in idle_task_exit and use_mm 2014-02-21 08:50:17 +01:00
mmu_notifier.c mmu-notifier: add clear_young callback 2015-09-10 13:29:01 -07:00
mmzone.c mm/mmzone.c: memmap_valid_within() can be boolean 2016-01-14 16:00:49 -08:00
mprotect.c mm/core, x86/mm/pkeys: Add execute-only protection keys support 2016-02-18 19:46:33 +01:00
mremap.c mm, dax: check for pmd_none() after split_huge_pmd() 2016-02-11 18:35:48 -08:00
msync.c mm/msync: use offset_in_page macro 2015-11-05 19:34:48 -08:00
nobootmem.c x86/mm: Introduce max_possible_pfn 2015-12-06 12:46:31 +01:00
nommu.c mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits() 2016-02-18 19:46:30 +01:00
oom_kill.c mm, shmem: add internal shmem resident memory accounting 2016-01-14 16:00:49 -08:00
page_alloc.c mm, hugetlb: don't require CMA for runtime gigantic pages 2016-02-05 18:10:40 -08:00
page_counter.c mm: page_counter: let page_counter_try_charge() return bool 2015-11-05 19:34:48 -08:00
page_ext.c mm: introduce idle page tracking 2015-09-10 13:29:01 -07:00
page_idle.c mm: add page_check_address_transhuge() helper 2016-01-15 17:56:32 -08:00
page_io.c fs: use helper bio_add_page() instead of open coding on bi_io_vec 2015-08-13 12:32:00 -06:00
page_isolation.c mm/page_isolation: do some cleanup in "undo_isolate_page_range" 2016-01-15 17:56:32 -08:00
page_owner.c mm/page_owner: set correct gfp_mask on page_owner 2015-07-17 16:39:54 -07:00
page-writeback.c mm: page_alloc: generalize the dirty balance reserve 2016-01-14 16:00:49 -08:00
pagewalk.c thp: rename split_huge_page_pmd() to split_huge_pmd() 2016-01-15 17:56:32 -08:00
percpu-km.c percpu: implmeent pcpu_nr_empty_pop_pages and chunk->nr_populated 2014-09-02 14:46:05 -04:00
percpu-vm.c percpu: move region iterations out of pcpu_[de]populate_chunk() 2014-09-02 14:46:02 -04:00
percpu.c tree wide: use kvfree() than conditional kfree()/vfree() 2016-01-22 17:02:18 -08:00
pgtable-generic.c mm,thp: fix spellos in describing __HAVE_ARCH_FLUSH_PMD_TLB_RANGE 2016-02-11 18:35:48 -08:00
process_vm_access.c mm/gup: Introduce get_user_pages_remote() 2016-02-16 10:04:09 +01:00
quicklist.c
readahead.c mm: move lru_to_page to mm_inline.h 2016-01-14 16:00:49 -08:00
rmap.c mm: fix locking order in mm_take_all_locks() 2016-01-15 17:56:32 -08:00
shmem.c make sure that freeing shmem fast symlinks is RCU-delayed 2016-01-22 18:08:52 -05:00
slab_common.c mm: memcontrol: move kmem accounting code to CONFIG_MEMCG 2016-01-20 17:09:18 -08:00
slab.c mm/slab.c: add a helper function get_first_slab 2016-01-14 16:00:49 -08:00
slab.h mm: memcontrol: move kmem accounting code to CONFIG_MEMCG 2016-01-20 17:09:18 -08:00
slob.c slab/slub: adjust kmem_cache_alloc_bulk API 2015-11-22 11:58:44 -08:00
slub.c mm: memcontrol: move kmem accounting code to CONFIG_MEMCG 2016-01-20 17:09:18 -08:00
sparse-vmemmap.c x86, mm: introduce vmem_altmap to augment vmemmap_populate() 2016-01-15 17:56:32 -08:00
sparse.c x86, mm: introduce vmem_altmap to augment vmemmap_populate() 2016-01-15 17:56:32 -08:00
swap_cgroup.c mm: page_cgroup: rename file to mm/swap_cgroup.c 2014-12-10 17:41:09 -08:00
swap_state.c mm: memcontrol: charge swap to cgroup2 2016-01-20 17:09:18 -08:00
swap.c mm, x86: get_user_pages() for dax mappings 2016-01-15 17:56:32 -08:00
swapfile.c wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
truncate.c dax: support dirty DAX entries in radix tree 2016-01-22 17:02:18 -08:00
userfaultfd.c memcg: adjust to support new THP refcounting 2016-01-15 17:56:32 -08:00
util.c mm/gup: Overload get_user_pages() functions 2016-02-16 10:11:12 +01:00
vmacache.c mm/vmacache: inline vmacache_valid_mm() 2015-11-05 19:34:48 -08:00
vmalloc.c mm/vmalloc.c: use macro IS_ALIGNED to judge the aligment 2016-01-15 17:56:32 -08:00
vmpressure.c mm/vmpressure.c: fix subtree pressure detection 2016-02-03 08:28:43 -08:00
vmscan.c mm: downgrade VM_BUG in isolate_lru_page() to warning 2016-02-05 18:10:40 -08:00
vmstat.c vmstat: make vmstat_update deferrable 2016-02-05 18:10:40 -08:00
workingset.c dax: support dirty DAX entries in radix tree 2016-01-22 17:02:18 -08:00
zbud.c mm/zbud.c: use list_last_entry() instead of list_tail_entry() 2016-01-15 11:40:52 -08:00
zpool.c mm: zsmalloc: constify struct zs_pool name 2015-11-06 17:50:42 -08:00
zsmalloc.c zsmalloc: fix migrate_zspage-zs_free race condition 2016-01-20 17:09:18 -08:00
zswap.c mm/zswap: change incorrect strncmp use to strcmp 2015-12-18 14:25:40 -08:00