When emulating a nested VM-entry from L1 to L2, several control field
validation checks are deferred to the hardware. Should one of these
validation checks fail, vcpu_vmx_run will set the vmx->fail flag. When
this happens, the L2 guest state is not loaded (even in part), and
execution should continue in L1 with the next instruction after the
VMLAUNCH/VMRESUME.
The VMCS12 is not modified (except for the VM-instruction error
field), the VMCS12 MSR save/load lists are not processed, and the CPU
state is not loaded from the VMCS12 host area. Moreover, the vmcs02
exit reason is stale, so it should not be consulted for any reason.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
On an early VMLAUNCH/VMRESUME failure (i.e. one which sets the
VM-instruction error field of the current VMCS), the launch state of
the current VMCS is not set to "launched," and the VM-exit information
fields of the current VMCS (including IDT-vectoring information and
exit reason) are stale.
On a late VMLAUNCH/VMRESUME failure (i.e. one which sets the high bit
of the exit reason field), the launch state of the current VMCS is not
set to "launched," and only two of the VM-exit information fields of
the current VMCS are modified (exit reason and exit
qualification). The remaining VM-exit information fields of the
current VMCS (including IDT-vectoring information, in particular) are
stale.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
After a successful VM-entry, RFLAGS is cleared, with the exception of
bit 1, which is always set. This is handled by load_vmcs12_host_state.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
During code inspection, the following potential race was seen:
CPU0 CPU1
kvm_async_pf_task_wait apf_task_wake_one
[L] swait_active(&n->wq)
[S] prepare_to_swait(&n.wq)
[L] if (!hlist_unhahed(&n.link))
schedule() [S] hlist_del_init(&n->link);
Properly serialize swait_active() checks such that a wakeup is
not missed.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The value of the guest_irq argument to vmx_update_pi_irte() is
ultimately coming from a KVM_IRQFD API call. Do not BUG() in
vmx_update_pi_irte() if the value is out-of bounds. (Especially,
since KVM as a whole seems to hang after that.)
Instead, print a message only once if we find that we don't have a
route for a certain IRQ (which can be out-of-bounds or within the
array).
This fixes CVE-2017-1000252.
Fixes: efc644048e ("KVM: x86: Update IRTE for posted-interrupts")
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If L1 does not specify the "use TPR shadow" VM-execution control in
vmcs12, then L0 must specify the "CR8-load exiting" and "CR8-store
exiting" VM-execution controls in vmcs02. Failure to do so will give
the L2 VM unrestricted read/write access to the hardware CR8.
This fixes CVE-2017-12154.
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull more set_fs removal from Al Viro:
"Christoph's 'use kernel_read and friends rather than open-coding
set_fs()' series"
* 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: unexport vfs_readv and vfs_writev
fs: unexport vfs_read and vfs_write
fs: unexport __vfs_read/__vfs_write
lustre: switch to kernel_write
gadget/f_mass_storage: stop messing with the address limit
mconsole: switch to kernel_read
btrfs: switch write_buf to kernel_write
net/9p: switch p9_fd_read to kernel_write
mm/nommu: switch do_mmap_private to kernel_read
serial2002: switch serial2002_tty_write to kernel_{read/write}
fs: make the buf argument to __kernel_write a void pointer
fs: fix kernel_write prototype
fs: fix kernel_read prototype
fs: move kernel_read to fs/read_write.c
fs: move kernel_write to fs/read_write.c
autofs4: switch autofs4_write to __kernel_write
ashmem: switch to ->read_iter
qemu-system-x86-8600 [004] d..1 7205.687530: kvm_entry: vcpu 2
qemu-system-x86-8600 [004] .... 7205.687532: kvm_exit: reason EXCEPTION_NMI rip 0xffffffffa921297d info ffffeb2c0e44e018 80000b0e
qemu-system-x86-8600 [004] .... 7205.687532: kvm_page_fault: address ffffeb2c0e44e018 error_code 0
qemu-system-x86-8600 [004] .... 7205.687620: kvm_try_async_get_page: gva = 0xffffeb2c0e44e018, gfn = 0x427e4e
qemu-system-x86-8600 [004] .N.. 7205.687628: kvm_async_pf_not_present: token 0x8b002 gva 0xffffeb2c0e44e018
kworker/4:2-7814 [004] .... 7205.687655: kvm_async_pf_completed: gva 0xffffeb2c0e44e018 address 0x7fcc30c4e000
qemu-system-x86-8600 [004] .... 7205.687703: kvm_async_pf_ready: token 0x8b002 gva 0xffffeb2c0e44e018
qemu-system-x86-8600 [004] d..1 7205.687711: kvm_entry: vcpu 2
After running some memory intensive workload in guest, I catch the kworker
which completes the GUP too quickly, and queues an "Page Ready" #PF exception
after the "Page not Present" exception before the next vmentry as the above
trace which will result in #DF injected to guest.
This patch fixes it by clearing the queue for "Page not Present" if "Page Ready"
occurs before the next vmentry since the GUP has already got the required page
and shadow page table has already been fixed by "Page Ready" handler.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Fixes: 7c90705bf2 ("KVM: Inject asynchronous page fault into a PV guest if page is swapped out.")
[Changed indentation and added clearing of injected. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Don't block vCPU if there is pending exception.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
SVM AVIC hardware accelerates guest write to APIC_EOI register
(for edge-trigger interrupt), which means it does not trap to KVM.
So, only enable SVM AVIC only in split irqchip mode.
(e.g. launching qemu w/ option '-machine kernel_irqchip=split').
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Fixes: 44a95dae1d ("KVM: x86: Detect and Initialize AVIC support")
[Removed pr_debug - Radim.]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
... and __initconst if applicable.
Based on similar work for an older kernel in the Grsecurity patch.
[JD: fix toshiba-wmi build]
[JD: add htcpen]
[JD: move __initconst where checkscript wants it]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Modify struct kvm_x86_ops.arch.apicv_active() to take struct kvm_vcpu
pointer as parameter in preparation to subsequent changes.
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Preparing the base code for subsequent changes. This does not change
existing logic.
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Clang resolves __builtin_constant_p() to false even if the expression is
constant in the end. The only purpose of that expression was to
differentiate a case where the following expression couldn't be checked
at compile-time, so we can just remove the check.
Clang handles the following two correctly. Turn it into BUG_ON if there
are any more problems with this.
Fixes: d6321d4933 ("KVM: x86: generalize guest_cpuid_has_ helpers")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
When user space sets kvm_run->immediate_exit, KVM is supposed to
return quickly. However, when a vCPU is in KVM_MP_STATE_UNINITIALIZED,
the value is not considered and the vCPU blocks.
Fix that oversight.
Fixes: 460df4c1fc ("KVM: race-free exit from KVM_RUN without POSIX signals")
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
KVM API says that KVM_RUN will return with -EINTR when a signal is
pending. However, if a vCPU is in KVM_MP_STATE_UNINITIALIZED, then
the return value is unconditionally -EAGAIN.
Copy over some code from vcpu_run(), so that the case of a pending
signal results in the expected return value.
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
The commit
9dd21e104bc ('KVM: x86: simplify handling of PKRU')
removed all users and providers of that call-back, but
didn't remove it. Remove it now.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Commits:
7dcf90e9e0 ("PCI: hv: Use vPCI protocol version 1.2")
628f54cc64 ("x86/hyper-v: Support extended CPU ranges for TLB flush hypercalls")
added the same definition and they came in through different trees.
Fix the duplication.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: devel@linuxdriverproject.org
Link: http://lkml.kernel.org/r/20170911150620.3998-1-vkuznets@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
cpu_init() is weird: it's called rather late (after early
identification and after most MMU state is initialized) on the boot
CPU but is called extremely early (before identification) on secondary
CPUs. It's called just late enough on the boot CPU that its CR4 value
isn't propagated to mmu_cr4_features.
Even if we put CR4.PCIDE into mmu_cr4_features, we'd hit two
problems. First, we'd crash in the trampoline code. That's
fixable, and I tried that. It turns out that mmu_cr4_features is
totally ignored by secondary_start_64(), though, so even with the
trampoline code fixed, it wouldn't help.
This means that we don't currently have CR4.PCIDE reliably initialized
before we start playing with cpu_tlbstate. This is very fragile and
tends to cause boot failures if I make even small changes to the TLB
handling code.
Make it more robust: initialize CR4.PCIDE earlier on the boot CPU
and propagate it to secondary CPUs in start_secondary().
( Yes, this is ugly. I think we should have improved mmu_cr4_features
to actually control CR4 during secondary bootup, but that would be
fairly intrusive at this stage. )
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reported-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Tested-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: 660da7c922 ("x86/mm: Enable CR4.PCIDE on supported systems")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Jiri reported a resume-from-hibernation failure triggered by PCID.
The root cause appears to be rather odd. The hibernation asm
restores a CR3 value that comes from the image header. If the image
kernel has PCID on, it's entirely reasonable for this CR3 value to
have one of the low 12 bits set. The restore code restores it with
CR4.PCIDE=0, which means that those low 12 bits are accepted by the
CPU but are either ignored or interpreted as a caching mode. This
is odd, but still works. We blow up later when the image kernel
restores CR4, though, since changing CR4.PCIDE with CR3[11:0] != 0
is illegal. Boom!
FWIW, it's entirely unclear to me what's supposed to happen if a PAE
kernel restores a non-PAE image or vice versa. Ditto for LA57.
Reported-by: Jiri Kosina <jikos@kernel.org>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 660da7c922 ("x86/mm: Enable CR4.PCIDE on supported systems")
Link: http://lkml.kernel.org/r/18ca57090651a6341e97083883f9e814c4f14684.1504847163.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If we hit the VM_BUG_ON(), we're detecting a genuinely bad situation,
but we're very unlikely to get a useful call trace.
Make it a warning instead.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/3b4e06bbb382ca54a93218407c93925ff5871546.1504847163.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 fixes from Ingo Molnar:
"Two fixes: dead code removal, plus a SME memory encryption fix on
32-bit kernels that crashed Xen guests"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Remove unused and undefined __generic_processor_info() declaration
x86/mm: Make the SME mask a u64
Pull namespace updates from Eric Biederman:
"Life has been busy and I have not gotten half as much done this round
as I would have liked. I delayed it so that a minor conflict
resolution with the mips tree could spend a little time in linux-next
before I sent this pull request.
This includes two long delayed user namespace changes from Kirill
Tkhai. It also includes a very useful change from Serge Hallyn that
allows the security capability attribute to be used inside of user
namespaces. The practical effect of this is people can now untar
tarballs and install rpms in user namespaces. It had been suggested to
generalize this and encode some of the namespace information
information in the xattr name. Upon close inspection that makes the
things that should be hard easy and the things that should be easy
more expensive.
Then there is my bugfix/cleanup for signal injection that removes the
magic encoding of the siginfo union member from the kernel internal
si_code. The mips folks reported the case where I had used FPE_FIXME
me is impossible so I have remove FPE_FIXME from mips, while at the
same time including a return statement in that case to keep gcc from
complaining about unitialized variables.
I almost finished the work to get make copy_siginfo_to_user a trivial
copy to user. The code is available at:
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git neuter-copy_siginfo_to_user-v3
But I did not have time/energy to get the code posted and reviewed
before the merge window opened.
I was able to see that the security excuse for just copying fields
that we know are initialized doesn't work in practice there are buggy
initializations that don't initialize the proper fields in siginfo. So
we still sometimes copy unitialized data to userspace"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
Introduce v3 namespaced file capabilities
mips/signal: In force_fcr31_sig return in the impossible case
signal: Remove kernel interal si_code magic
fcntl: Don't use ambiguous SIG_POLL si_codes
prctl: Allow local CAP_SYS_ADMIN changing exe_file
security: Use user_namespace::level to avoid redundant iterations in cap_capable()
userns,pidns: Verify the userns for new pid namespaces
signal/testing: Don't look for __SI_FAULT in userspace
signal/mips: Document a conflict with SI_USER with SIGFPE
signal/sparc: Document a conflict with SI_USER with SIGFPE
signal/ia64: Document a conflict with SI_USER with SIGFPE
signal/alpha: Document a conflict with SI_USER for SIGTRAP
* Media error handling support in the Block Translation Table (BTT)
driver is reworked to address sleeping-while-atomic locking and
memory-allocation-context conflicts.
* The dax_device lookup overhead for xfs and ext4 is moved out of the
iomap hot-path to a mount-time lookup.
* A new 'ecc_unit_size' sysfs attribute is added to advertise the
read-modify-write boundary property of a persistent memory range.
* Preparatory fix-ups for arm and powerpc pmem support are included
along with other miscellaneous fixes.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZtsAGAAoJEB7SkWpmfYgCrzMP/2vPvZvrFjZn5pAoZjlmTmHM
ySceoOC7vwvVXIsSs52FhSjcxEoXo9cklXPwhXOPVtVUFdSDJBUOIUxwIziE6Y+5
sFJ2xT9K+5zKBUiXJwqFQDg52dn//eBNnnnDz+HQrBSzGrbWQhIZY2m19omPzv1I
BeN0OCGOdW3cjSo3BCFl1d+KrSl704e7paeKq/TO3GIiAilIXleTVxcefEEodV2K
ZvWHpFIhHeyN8dsF8teI952KcCT92CT/IaabxQIwCxX0/8/GFeDc5aqf77qiYWKi
uxCeQXdgnaE8EZNWZWGWIWul6eYEkoCNbLeUQ7eJnECq61VxVajJS0NyGa5T9OiM
P046Bo2b1b3R0IHxVIyVG0ZCm3YUMAHSn/3uRxPgESJ4bS/VQ3YP5M6MLxDOlc90
IisLilagitkK6h8/fVuVrwciRNQ71XEC34t6k7GCl/1ZnLlLT+i4/jc5NRZnGEZh
aXAAGHdteQ+/mSz6p2UISFUekbd6LerwzKRw8ibDvH6pTud8orYR7g2+JoGhgb6Y
pyFVE8DhIcqNKAMxBsjiRZ46OQ7qrT+AemdAG3aVv6FaNoe4o5jPLdw2cEtLqtpk
+DNm0/lSWxxxozjrvu6EUZj6hk8R5E19XpRzV5QJkcKUXMu7oSrFLdMcC4FeIjl9
K4hXLV3fVBVRMiS0RA6z
=5iGY
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm from Dan Williams:
"A rework of media error handling in the BTT driver and other updates.
It has appeared in a few -next releases and collected some late-
breaking build-error and warning fixups as a result.
Summary:
- Media error handling support in the Block Translation Table (BTT)
driver is reworked to address sleeping-while-atomic locking and
memory-allocation-context conflicts.
- The dax_device lookup overhead for xfs and ext4 is moved out of the
iomap hot-path to a mount-time lookup.
- A new 'ecc_unit_size' sysfs attribute is added to advertise the
read-modify-write boundary property of a persistent memory range.
- Preparatory fix-ups for arm and powerpc pmem support are included
along with other miscellaneous fixes"
* tag 'libnvdimm-for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (26 commits)
libnvdimm, btt: fix format string warnings
libnvdimm, btt: clean up warning and error messages
ext4: fix null pointer dereference on sbi
libnvdimm, nfit: move the check on nd_reserved2 to the endpoint
dax: fix FS_DAX=n BLOCK=y compilation
libnvdimm: fix integer overflow static analysis warning
libnvdimm, nd_blk: remove mmio_flush_range()
libnvdimm, btt: rework error clearing
libnvdimm: fix potential deadlock while clearing errors
libnvdimm, btt: cache sector_size in arena_info
libnvdimm, btt: ensure that flags were also unchanged during a map_read
libnvdimm, btt: refactor map entry operations with macros
libnvdimm, btt: fix a missed NVDIMM_IO_ATOMIC case in the write path
libnvdimm, nfit: export an 'ecc_unit_size' sysfs attribute
ext4: perform dax_device lookup at mount
ext2: perform dax_device lookup at mount
xfs: perform dax_device lookup at mount
dax: introduce a fs_dax_get_by_bdev() helper
libnvdimm, btt: check memory allocation failure
libnvdimm, label: fix index block size calculation
...
The following revert:
2b85b3d229 ("x86/acpi: Restore the order of CPU IDs")
... got rid of __generic_processor_info(), but forgot to remove its
declaration in mpspec.h.
Remove the declaration and update the comments as well.
Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: lenb@kernel.org
Link: http://lkml.kernel.org/r/1505101403-29100-1-git-send-email-douly.fnst@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I've been staring at the word PCID too long.
Fixes: f13c8e8c58ba ("x86/mm: Reinitialize TLB state on hotplug and resume")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull i2c updates from Wolfram Sang:
- new drivers for Spreadtrum I2C, Intel Cherry Trail Whiskey Cove SMBUS
- quite some driver updates
- cleanups for the i2c-mux subsystem
- some subsystem-wide constification
- further cleanup of include/linux/i2c
* 'i2c/for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (50 commits)
i2c: sprd: Fix undefined reference errors
i2c: nomadik: constify amba_id
i2c: versatile: Make i2c_algo_bit_data const
i2c: busses: make i2c_adapter_quirks const
i2c: busses: make i2c_adapter const
i2c: busses: make i2c_algorithm const
i2c: Add Spreadtrum I2C controller driver
dt-bindings: i2c: Add Spreadtrum I2C controller documentation
i2c-cht-wc: make cht_wc_i2c_adap_driver static
MAINTAINERS: Add entry for drivers/i2c/busses/i2c-cht-wc.c
i2c: aspeed: Retain delay/setup/hold values when configuring bus frequency
dt-bindings: i2c: eeprom: Document vendor to be used and deprecated ones
i2c: i801: Restore the presence state of P2SB PCI device after reading BAR
MAINTAINERS: drop entry for Blackfin I2C and Sonic's email
blackfin: merge the two TWI header files
i2c: davinci: Preserve return value of devm_clk_get
i2c: mediatek: Add i2c compatible for MediaTek MT7622
dt-bindings: i2c: Add MediaTek MT7622 i2c binding
dt-bindings: i2c: modify information formats
i2c: mux: i2c-arb-gpio-challenge: allow compiling w/o OF support
...
Merge more updates from Andrew Morton:
- most of the rest of MM
- a small number of misc things
- lib/ updates
- checkpatch
- autofs updates
- ipc/ updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (126 commits)
ipc: optimize semget/shmget/msgget for lots of keys
ipc/sem: play nicer with large nsops allocations
ipc/sem: drop sem_checkid helper
ipc: convert kern_ipc_perm.refcount from atomic_t to refcount_t
ipc: convert sem_undo_list.refcnt from atomic_t to refcount_t
ipc: convert ipc_namespace.count from atomic_t to refcount_t
kcov: support compat processes
sh: defconfig: cleanup from old Kconfig options
mn10300: defconfig: cleanup from old Kconfig options
m32r: defconfig: cleanup from old Kconfig options
drivers/pps: use surrounding "if PPS" to remove numerous dependency checks
drivers/pps: aesthetic tweaks to PPS-related content
cpumask: make cpumask_next() out-of-line
kmod: move #ifdef CONFIG_MODULES wrapper to Makefile
kmod: split off umh headers into its own file
MAINTAINERS: clarify kmod is just a kernel module loader
kmod: split out umh code into its own file
test_kmod: flip INT checks to be consistent
test_kmod: remove paranoid UINT_MAX check on uint range processing
vfat: deduplicate hex2bin()
...
First, number of CPUs can't be negative number.
Second, different signnnedness leads to suboptimal code in the following
cases:
1)
kmalloc(nr_cpu_ids * sizeof(X));
"int" has to be sign extended to size_t.
2)
while (loff_t *pos < nr_cpu_ids)
MOVSXD is 1 byte longed than the same MOV.
Other cases exist as well. Basically compiler is told that nr_cpu_ids
can't be negative which can't be deduced if it is "int".
Code savings on allyesconfig kernel: -3KB
add/remove: 0/0 grow/shrink: 25/264 up/down: 261/-3631 (-3370)
function old new delta
coretemp_cpu_online 450 512 +62
rcu_init_one 1234 1272 +38
pci_device_probe 374 399 +25
...
pgdat_reclaimable_pages 628 556 -72
select_fallback_rq 446 369 -77
task_numa_find_cpu 1923 1807 -116
Link: http://lkml.kernel.org/r/20170819114959.GA30580@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These are single instructions on x86. There's no 64-bit instruction for
x86-32, but we don't yet have any user for memset64() on 32-bit
architectures, so don't bother to implement it.
Link: http://lkml.kernel.org/r/20170720184539.31609-4-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: David Miller <davem@davemloft.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Russell King <rmk+kernel@armlinux.org.uk>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are new users of memory hotplug emerging. Some of them require
different subset of arch_add_memory. There are some which only require
allocation of struct pages without mapping those pages to the kernel
address space. We currently have __add_pages for that purpose. But this
is rather lowlevel and not very suitable for the code outside of the
memory hotplug. E.g. x86_64 wants to update max_pfn which should be done
by the caller. Introduce add_pages() which should care about those
details if they are needed. Each architecture should define its
implementation and select CONFIG_ARCH_HAS_ADD_PAGES. All others use the
currently existing __add_pages.
Link: http://lkml.kernel.org/r/20170817000548.32038-7-jglisse@redhat.com
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Nellans <dnellans@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Sherry Cheung <SCheung@nvidia.com>
Cc: Subhash Gutti <sgutti@nvidia.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Bob Liu <liubo95@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Soft dirty bit is designed to keep tracked over page migration. This
patch makes it work in the same manner for thp migration too.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Nellans <dnellans@nvidia.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add thp migration's core code, including conversions between a PMD entry
and a swap entry, setting PMD migration entry, removing PMD migration
entry, and waiting on PMD migration entries.
This patch makes it possible to support thp migration. If you fail to
allocate a destination page as a thp, you just split the source thp as
we do now, and then enter the normal page migration. If you succeed to
allocate destination thp, you enter thp migration. Subsequent patches
actually enable thp migration for each caller of page migration by
allowing its get_new_page() callback to allocate thps.
[zi.yan@cs.rutgers.edu: fix gcc-4.9.0 -Wmissing-braces warning]
Link: http://lkml.kernel.org/r/A0ABA698-7486-46C3-B209-E95A9048B22C@cs.rutgers.edu
[akpm@linux-foundation.org: fix x86_64 allnoconfig warning]
Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Nellans <dnellans@nvidia.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce CONFIG_ARCH_ENABLE_THP_MIGRATION to limit thp migration
functionality to x86_64, which should be safer at the first step.
Link: http://lkml.kernel.org/r/20170717193955.20207-5-zi.yan@sent.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
Reviewed-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Nellans <dnellans@nvidia.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
_PAGE_PSE is used to distinguish between a truly non-present
(_PAGE_PRESENT=0) PMD, and a PMD which is undergoing a THP split and
should be treated as present.
But _PAGE_SWP_SOFT_DIRTY currently uses the _PAGE_PSE bit, which would
cause confusion between one of those PMDs undergoing a THP split, and a
soft-dirty PMD. Dropping _PAGE_PSE check in pmd_present() does not work
well, because it can hurt optimization of tlb handling in thp split.
Thus, we need to move the bit.
In the current kernel, bits 1-4 are not used in non-present format since
commit 00839ee3b2 ("x86/mm: Move swap offset/type up in PTE to work
around erratum"). So let's move _PAGE_SWP_SOFT_DIRTY to bit 1. Bit 7
is used as reserved (always clear), so please don't use it for other
purpose.
Link: http://lkml.kernel.org/r/20170717193955.20207-3-zi.yan@sent.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: David Nellans <dnellans@nvidia.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJZsr8cAAoJEFmIoMA60/r8lXYQAKViYIRMJDD4n3NhjMeLOsnJ
vwaBmWlLRjSFIEpag5kMjS1RJE17qAvmkBZnDvSNZ6cT28INkkZnVM2IW96WECVq
64MIvDijVPcvqGuWePCfWdDiSXApiDWwJuw55BOhmvV996wGy0gYgzpPY+1g0Knh
XzH9IOzDL79hZleLfsxX0MLV6FGBVtOsr0jvQ04k4IgEMIxEDTlbw85rnrvzQUtc
0Vj2koaxWIESZsq7G/wiZb2n6ekaFdXO/VlVvvhmTSDLCBaJ63Hb/gfOhwMuVkS6
B3cVprNrCT0dSzWmU4ZXf+wpOyDpBexlemW/OR/6CQUkC6AUS6kQ5si1X44dbGmJ
nBPh414tdlm/6V4h/A3UFPOajSGa/ZWZ/uQZPfvKs1R6WfjUerWVBfUpAzPbgjam
c/mhJ19HYT1J7vFBfhekBMeY2Px3JgSJ9rNsrFl48ynAALaX5GEwdpo4aqBfscKz
4/f9fU4ysumopvCEuKD2SsJvsPKd5gMQGGtvAhXM1TxvAoQ5V4cc99qEetAPXXPf
h2EqWm4ph7YP4a+n/OZBjzluHCmZJn1CntH5+//6wpUk6HnmzsftGELuO9n12cLE
GGkreI3T9ctV1eOkzVVa0l0QTE1X/VLyEyKCtb9obXsDaG4Ud7uKQoZgB19DwyTJ
EG76ridTolUFVV+wzJD9
=9cLP
-----END PGP SIGNATURE-----
Merge tag 'pci-v4.14-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI updates from Bjorn Helgaas:
- add enhanced Downstream Port Containment support, which prints more
details about Root Port Programmed I/O errors (Dongdong Liu)
- add Layerscape ls1088a and ls2088a support (Hou Zhiqiang)
- add MediaTek MT2712 and MT7622 support (Ryder Lee)
- add MediaTek MT2712 and MT7622 MSI support (Honghui Zhang)
- add Qualcom IPQ8074 support (Varadarajan Narayanan)
- add R-Car r8a7743/5 device tree support (Biju Das)
- add Rockchip per-lane PHY support for better power management (Shawn
Lin)
- fix IRQ mapping for hot-added devices by replacing the
pci_fixup_irqs() boot-time design with a host bridge hook called at
probe-time (Lorenzo Pieralisi, Matthew Minter)
- fix race when enabling two devices that results in upstream bridge
not being enabled correctly (Srinath Mannam)
- fix pciehp power fault infinite loop (Keith Busch)
- fix SHPC bridge MSI hotplug events by enabling bus mastering
(Aleksandr Bezzubikov)
- fix a VFIO issue by correcting PCIe capability sizes (Alex
Williamson)
- fix an INTD issue on Xilinx and possibly other drivers by unifying
INTx IRQ domain support (Paul Burton)
- avoid IOMMU stalls by marking AMD Stoney GPU ATS as broken (Joerg
Roedel)
- allow APM X-Gene device assignment to guests by adding an ACS quirk
(Feng Kan)
- fix driver crashes by disabling Extended Tags on Broadcom HT2100
(Extended Tags support is required for PCIe Receivers but not
Requesters, and we now enable them by default when Requesters support
them) (Sinan Kaya)
- fix MSIs for devices that use phantom RIDs for DMA by assuming MSIs
use the real Requester ID (not a phantom RID) (Robin Murphy)
- prevent assignment of Intel VMD children to guests (which may be
supported eventually, but isn't yet) by not associating an IOMMU with
them (Jon Derrick)
- fix Intel VMD suspend/resume by releasing IRQs on suspend (Scott
Bauer)
- fix a Function-Level Reset issue with Intel 750 NVMe by waiting
longer (up to 60sec instead of 1sec) for device to become ready
(Sinan Kaya)
- fix a Function-Level Reset issue on iProc Stingray by working around
hardware defects in the CRS implementation (Oza Pawandeep)
- fix an issue with Intel NVMe P3700 after an iProc reset by adding a
delay during shutdown (Oza Pawandeep)
- fix a Microsoft Hyper-V lockdep issue by polling instead of blocking
in compose_msi_msg() (Stephen Hemminger)
- fix a wireless LAN driver timeout by clearing DesignWare MSI
interrupt status after it is handled, not before (Faiz Abbas)
- fix DesignWare ATU enable checking (Jisheng Zhang)
- reduce Layerscape dependencies on the bootloader by doing more
initialization in the driver (Hou Zhiqiang)
- improve Intel VMD performance allowing allocation of more IRQ vectors
than present CPUs (Keith Busch)
- improve endpoint framework support for initial DMA mask, different
BAR sizes, configurable page sizes, MSI, test driver, etc (Kishon
Vijay Abraham I, Stan Drozd)
- rework CRS support to add periodic messages while we poll during
enumeration and after Function-Level Reset and prepare for possible
other uses of CRS (Sinan Kaya)
- clean up Root Port AER handling by removing unnecessary code and
moving error handler methods to struct pcie_port_service_driver
(Christoph Hellwig)
- clean up error handling paths in various drivers (Bjorn Andersson,
Fabio Estevam, Gustavo A. R. Silva, Harunobu Kurokawa, Jeffy Chen,
Lorenzo Pieralisi, Sergei Shtylyov)
- clean up SR-IOV resource handling by disabling VF decoding before
updating the corresponding resource structs (Gavin Shan)
- clean up DesignWare-based drivers by unifying quirks to update Class
Code and Interrupt Pin and related handling of write-protected
registers (Hou Zhiqiang)
- clean up by adding empty generic pcibios_align_resource() and
pcibios_fixup_bus() and removing empty arch-specific implementations
(Palmer Dabbelt)
- request exclusive reset control for several drivers to allow cleanup
elsewhere (Philipp Zabel)
- constify various structures (Arvind Yadav, Bhumika Goyal)
- convert from full_name() to %pOF (Rob Herring)
- remove unused variables from iProc, HiSi, Altera, Keystone (Shawn
Lin)
* tag 'pci-v4.14-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (170 commits)
PCI: xgene: Clean up whitespace
PCI: xgene: Define XGENE_PCI_EXP_CAP and use generic PCI_EXP_RTCTL offset
PCI: xgene: Fix platform_get_irq() error handling
PCI: xilinx-nwl: Fix platform_get_irq() error handling
PCI: rockchip: Fix platform_get_irq() error handling
PCI: altera: Fix platform_get_irq() error handling
PCI: spear13xx: Fix platform_get_irq() error handling
PCI: artpec6: Fix platform_get_irq() error handling
PCI: armada8k: Fix platform_get_irq() error handling
PCI: dra7xx: Fix platform_get_irq() error handling
PCI: exynos: Fix platform_get_irq() error handling
PCI: iproc: Clean up whitespace
PCI: iproc: Rename PCI_EXP_CAP to IPROC_PCI_EXP_CAP
PCI: iproc: Add 500ms delay during device shutdown
PCI: Fix typos and whitespace errors
PCI: Remove unused "res" variable from pci_resource_io()
PCI: Correct kernel-doc of pci_vpd_srdt_size(), pci_vpd_srdt_tag()
PCI/AER: Reformat AER register definitions
iommu/vt-d: Prevent VMD child devices from being remapping targets
x86/PCI: Use is_vmd() rather than relying on the domain number
...
Common:
- improve heuristic for boosting preempted spinlocks by ignoring VCPUs
in user mode
ARM:
- fix for decoding external abort types from guests
- added support for migrating the active priority of interrupts when
running a GICv2 guest on a GICv3 host
- minor cleanup
PPC:
- expose storage keys to userspace
- merge powerpc/topic/ppc-kvm branch that contains
find_linux_pte_or_hugepte and POWER9 thread management cleanup
- merge kvm-ppc-fixes with a fix that missed 4.13 because of vacations
- fixes
s390:
- merge of topic branch tlb-flushing from the s390 tree to get the
no-dat base features
- merge of kvm/master to avoid conflicts with additional sthyi fixes
- wire up the no-dat enhancements in KVM
- multiple epoch facility (z14 feature)
- Configuration z/Architecture Mode
- more sthyi fixes
- gdb server range checking fix
- small code cleanups
x86:
- emulate Hyper-V TSC frequency MSRs
- add nested INVPCID
- emulate EPTP switching VMFUNC
- support Virtual GIF
- support 5 level page tables
- speedup nested VM exits by packing byte operations
- speedup MMIO by using hardware provided physical address
- a lot of fixes and cleanups, especially nested
-----BEGIN PGP SIGNATURE-----
iQEcBAABCAAGBQJZspE1AAoJEED/6hsPKofoDcMIALT11n+LKV50QGwQdg2W1GOt
aChbgnj/Kegit3hQlDhVNb8kmdZEOZzSL81Lh0VPEr7zXU8QiWn2snbizDPv8sde
MpHhcZYZZ0YrpoiZKjl8yiwcu88OWGn2qtJ7OpuTS5hvEGAfxMncp0AMZho6fnz/
ySTwJ9GK2MTgBw39OAzCeDOeoYn4NKYMwjJGqBXRhNX8PG/1wmfqv0vPrd6wfg31
KJ58BumavwJjr8YbQ1xELm9rpQrAmaayIsG0R1dEUqCbt5a1+t2gt4h2uY7tWcIv
ACt2bIze7eF3xA+OpRs+eT+yemiH3t9btIVmhCfzUpnQ+V5Z55VMSwASLtTuJRQ=
=R8Ry
-----END PGP SIGNATURE-----
Merge tag 'kvm-4.14-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Radim Krčmář:
"First batch of KVM changes for 4.14
Common:
- improve heuristic for boosting preempted spinlocks by ignoring
VCPUs in user mode
ARM:
- fix for decoding external abort types from guests
- added support for migrating the active priority of interrupts when
running a GICv2 guest on a GICv3 host
- minor cleanup
PPC:
- expose storage keys to userspace
- merge kvm-ppc-fixes with a fix that missed 4.13 because of
vacations
- fixes
s390:
- merge of kvm/master to avoid conflicts with additional sthyi fixes
- wire up the no-dat enhancements in KVM
- multiple epoch facility (z14 feature)
- Configuration z/Architecture Mode
- more sthyi fixes
- gdb server range checking fix
- small code cleanups
x86:
- emulate Hyper-V TSC frequency MSRs
- add nested INVPCID
- emulate EPTP switching VMFUNC
- support Virtual GIF
- support 5 level page tables
- speedup nested VM exits by packing byte operations
- speedup MMIO by using hardware provided physical address
- a lot of fixes and cleanups, especially nested"
* tag 'kvm-4.14-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (67 commits)
KVM: arm/arm64: Support uaccess of GICC_APRn
KVM: arm/arm64: Extract GICv3 max APRn index calculation
KVM: arm/arm64: vITS: Drop its_ite->lpi field
KVM: arm/arm64: vgic: constify seq_operations and file_operations
KVM: arm/arm64: Fix guest external abort matching
KVM: PPC: Book3S HV: Fix memory leak in kvm_vm_ioctl_get_htab_fd
KVM: s390: vsie: cleanup mcck reinjection
KVM: s390: use WARN_ON_ONCE only for checking
KVM: s390: guestdbg: fix range check
KVM: PPC: Book3S HV: Report storage key support to userspace
KVM: PPC: Book3S HV: Fix case where HDEC is treated as 32-bit on POWER9
KVM: PPC: Book3S HV: Fix invalid use of register expression
KVM: PPC: Book3S HV: Fix H_REGISTER_VPA VPA size validation
KVM: PPC: Book3S HV: Fix setting of storage key in H_ENTER
KVM: PPC: e500mc: Fix a NULL dereference
KVM: PPC: e500: Fix some NULL dereferences on error
KVM: PPC: Book3S HV: Protect updates to spapr_tce_tables list
KVM: s390: we are always in czam mode
KVM: s390: expose no-DAT to guest and migration support
KVM: s390: sthyi: remove invalid guest write access
...
This fix was intended for 4.13, but didn't get in because both
maintainers were on vacation.
Paul Mackerras:
"It adds mutual exclusion between list_add_rcu and list_del_rcu calls
on the kvm->arch.spapr_tce_tables list. Without this, userspace could
potentially trigger corruption of the list and cause a host crash or
worse."
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABAgAGBQJZsBbYAAoJELDendYovxMv4hoH/39psrSeHw2hPX78KJ6orq4v
mTVEP2gLA/qxaM03EnFljXfd88J8NcJsxv7vVjh/U4xRwntvAMovCkygkkO1aw93
nZEhUq6IGupr8KzmqQi5U7WtiWAXFwDbGSasnOKEj/lLa7E0/9MsYYQ01FS6oFkc
c9CHONaCWepdz0Xpt7s6BKyzo74ZbJeCc5rUZU81oH40XphaZEoy8E9NOgDdfz3l
VvPSaxZvebynT8JKDe4KxrMPpBjhr7mwgLcXk/Zy2EzOzxFSxXLsDAnwjtCW1gTh
lPLD4TkgtziDfPfZXxFH3J34IUe1tZ2M+7Cz157FBu6BKX/g9ETQT24DXWDzFuI=
=cgfV
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.14b-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from Juergen Gross:
- the new pvcalls backend for routing socket calls from a guest to dom0
- some cleanups of Xen code
- a fix for wrong usage of {get,put}_cpu()
* tag 'for-linus-4.14b-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (27 commits)
xen/mmu: set MMU_NORMAL_PT_UPDATE in remap_area_mfn_pte_fn
xen: Don't try to call xen_alloc_p2m_entry() on autotranslating guests
xen/events: events_fifo: Don't use {get,put}_cpu() in xen_evtchn_fifo_init()
xen/pvcalls: use WARN_ON(1) instead of __WARN()
xen: remove not used trace functions
xen: remove unused function xen_set_domain_pte()
xen: remove tests for pvh mode in pure pv paths
xen-platform: constify pci_device_id.
xen: cleanup xen.h
xen: introduce a Kconfig option to enable the pvcalls backend
xen/pvcalls: implement write
xen/pvcalls: implement read
xen/pvcalls: implement the ioworker functions
xen/pvcalls: disconnect and module_exit
xen/pvcalls: implement release command
xen/pvcalls: implement poll command
xen/pvcalls: implement accept command
xen/pvcalls: implement listen command
xen/pvcalls: implement bind command
xen/pvcalls: implement connect command
...
Pull EFI updates from Ingo Molnar:
"The main changes in this cycle were:
- Transparently fall back to other poweroff method(s) if EFI poweroff
fails (and returns)
- Use separate PE/COFF section headers for the RX and RW parts of the
ARM stub loader so that the firmware can use strict mapping
permissions
- Add support for requesting the firmware to wipe RAM at warm reboot
- Increase the size of the random seed obtained from UEFI so CRNG
fast init can complete earlier
- Update the EFI framebuffer address if it points to a BAR that gets
moved by the PCI resource allocation code
- Enable "reset attack mitigation" of TPM environments: this is
enabled if the kernel is configured with
CONFIG_RESET_ATTACK_MITIGATION=y.
- Clang related fixes
- Misc cleanups, constification, refactoring, etc"
* 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi/bgrt: Use efi_mem_type()
efi: Move efi_mem_type() to common code
efi/reboot: Make function pointer orig_pm_power_off static
efi/random: Increase size of firmware supplied randomness
efi/libstub: Enable reset attack mitigation
firmware/efi/esrt: Constify attribute_group structures
firmware/efi: Constify attribute_group structures
firmware/dcdbas: Constify attribute_group structures
arm/efi: Split zImage code and data into separate PE/COFF sections
arm/efi: Replace open coded constants with symbolic ones
arm/efi: Remove pointless dummy .reloc section
arm/efi: Remove forbidden values from the PE/COFF header
drivers/fbdev/efifb: Allow BAR to be moved instead of claiming it
efi/reboot: Fall back to original power-off method if EFI_RESET_SHUTDOWN returns
efi/arm/arm64: Add missing assignment of efi.config_table
efi/libstub/arm64: Set -fpie when building the EFI stub
efi/libstub/arm64: Force 'hidden' visibility for section markers
efi/libstub/arm64: Use hidden attribute for struct screen_info reference
efi/arm: Don't mark ACPI reclaim memory as MEMBLOCK_NOMAP
Pull x86 platform updates from Ingo Molnar:
"The main changes include various Hyper-V optimizations such as faster
hypercalls and faster/better TLB flushes - and there's also some
Intel-MID cleanups"
* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tracing/hyper-v: Trace hyperv_mmu_flush_tlb_others()
x86/hyper-v: Support extended CPU ranges for TLB flush hypercalls
x86/platform/intel-mid: Make several arrays static, to make code smaller
MAINTAINERS: Add missed file for Hyper-V
x86/hyper-v: Use hypercall for remote TLB flush
hyper-v: Globalize vp_index
x86/hyper-v: Implement rep hypercalls
hyper-v: Use fast hypercall for HVCALL_SIGNAL_EVENT
x86/hyper-v: Introduce fast hypercall implementation
x86/hyper-v: Make hv_do_hypercall() inline
x86/hyper-v: Include hyperv/ only when CONFIG_HYPERV is set
x86/platform/intel-mid: Make 'bt_sfi_data' const
x86/platform/intel-mid: Make IRQ allocation a bit more flexible
x86/platform/intel-mid: Group timers callbacks together
The SME encryption mask is for masking 64-bit pagetable entries. It
being an unsigned long works fine on X86_64 but on 32-bit builds in
truncates bits leading to Xen guests crashing very early.
And regardless, the whole SME mask handling shouldnt've leaked into
32-bit because SME is X86_64-only feature. So, first make the mask u64.
And then, add trivial 32-bit versions of the __sme_* macros so that
nothing happens there.
Reported-and-tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Tested-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Tom Lendacky <Thomas.Lendacky@amd.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas <Thomas.Lendacky@amd.com>
Fixes: 21729f81ce ("x86/mm: Provide general kernel support for memory encryption")
Link: http://lkml.kernel.org/r/20170907093837.76zojtkgebwtqc74@pd.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Merge updates from Andrew Morton:
- various misc bits
- DAX updates
- OCFS2
- most of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (119 commits)
mm,fork: introduce MADV_WIPEONFORK
x86,mpx: make mpx depend on x86-64 to free up VMA flag
mm: add /proc/pid/smaps_rollup
mm: hugetlb: clear target sub-page last when clearing huge page
mm: oom: let oom_reap_task and exit_mmap run concurrently
swap: choose swap device according to numa node
mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
mm, oom: do not rely on TIF_MEMDIE for memory reserves access
z3fold: use per-cpu unbuddied lists
mm, swap: don't use VMA based swap readahead if HDD is used as swap
mm, swap: add sysfs interface for VMA based swap readahead
mm, swap: VMA based swap readahead
mm, swap: fix swap readahead marking
mm, swap: add swap readahead hit statistics
mm/vmalloc.c: don't reinvent the wheel but use existing llist API
mm/vmstat.c: fix wrong comment
selftests/memfd: add memfd_create hugetlbfs selftest
mm/shmem: add hugetlbfs support to memfd_create()
mm, devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups
mm/vmalloc.c: halve the number of comparisons performed in pcpu_get_vm_areas()
...
While debugging a problem, I thought that using
cr4_set_bits_and_update_boot() to restore CR4.PCIDE would be
helpful. It turns out to be counterproductive.
Add a comment documenting how this works.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When Linux brings a CPU down and back up, it switches to init_mm and then
loads swapper_pg_dir into CR3. With PCID enabled, this has the side effect
of masking off the ASID bits in CR3.
This can result in some confusion in the TLB handling code. If we
bring a CPU down and back up with any ASID other than 0, we end up
with the wrong ASID active on the CPU after resume. This could
cause our internal state to become corrupt, although major
corruption is unlikely because init_mm doesn't have any user pages.
More obviously, if CONFIG_DEBUG_VM=y, we'll trip over an assertion
in the next context switch. The result of *that* is a failure to
resume from suspend with probability 1 - 1/6^(cpus-1).
Fix it by reinitializing cpu_tlbstate on resume and CPU bringup.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Jiri Kosina <jikos@kernel.org>
Fixes: 10af6235e0 ("x86/mm: Implement PCID based optimization: try to preserve old TLB entries using PCID")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>