linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2024-12-28 11:18:45 +07:00

Author	SHA1	Message	Date
Jim Mattson	fb6c819843	kvm: vmx: Flush TLB when the APIC-access address changes Quoting from the Intel SDM, volume 3, section 28.3.3.4: Guidelines for Use of the INVEPT Instruction: If EPT was in use on a logical processor at one time with EPTP X, it is recommended that software use the INVEPT instruction with the "single-context" INVEPT type and with EPTP X in the INVEPT descriptor before a VM entry on the same logical processor that enables EPT with EPTP X and either (a) the "virtualize APIC accesses" VM-execution control was changed from 0 to 1; or (b) the value of the APIC-access address was changed. In the nested case, the burden falls on L1, unless L0 enables EPT in vmcs02 when L1 doesn't enable EPT in vmcs12. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-03-23 19:02:06 +01:00
Peter Xu	c761159cf8	KVM: x86: use pic/ioapic destructor when destroy vm We have specific destructors for pic/ioapic, we'd better use them when destroying the VM as well. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-03-23 19:02:06 +01:00
Peter Xu	950712eb8e	KVM: x86: check existance before destroy Mostly used for split irqchip mode. In that case, these two things are not inited at all, so no need to release. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-03-23 19:02:03 +01:00
Wanpeng Li	6d1b3ad2cd	KVM: nVMX: don't reset kvm mmu twice kvm mmu is reset once successfully loading CR3 as part of emulating vmentry in nested_vmx_load_cr3(). We should not reset kvm mmu twice. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-03-20 16:25:06 +01:00
Dmitry Vyukov	3863dff0c3	kvm: fix usage of uninit spinlock in avic_vm_destroy() If avic is not enabled, avic_vm_init() does nothing and returns early. However, avic_vm_destroy() still tries to destroy what hasn't been created. The only bad consequence of this now is that avic_vm_destroy() uses svm_vm_data_hash_lock that hasn't been initialized (and is not meant to be used at all if avic is not enabled). Return early from avic_vm_destroy() if avic is not enabled. It has nothing to destroy. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: kvm@vger.kernel.org Cc: syzkaller@googlegroups.com Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-03-20 16:25:05 +01:00
Radim Krčmář	6c6c5e0311	KVM: VMX: downgrade warning on unexpected exit code We never needed the call trace and we better rate-limit if it can be triggered by a guest. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-03-20 16:25:05 +01:00
Radim Krčmář	05d8d34611	KVM: nVMX: do not warn when MSR bitmap address is not backed Before trying to do nested_get_page() in nested_vmx_merge_msr_bitmap(), we have already checked that the MSR bitmap address is valid (4k aligned and within physical limits). SDM doesn't specify what happens if the there is no memory mapped at the valid address, but Intel CPUs treat the situation as if the bitmap was configured to trap all MSRs. KVM already does that by returning false and a correct handling doesn't need the guest-trigerrable warning that was reported by syzkaller: (The warning was originally there to catch some possible bugs in nVMX.) ------------[ cut here ]------------ WARNING: CPU: 0 PID: 7832 at arch/x86/kvm/vmx.c:9709 nested_vmx_merge_msr_bitmap arch/x86/kvm/vmx.c:9709 [inline] WARNING: CPU: 0 PID: 7832 at arch/x86/kvm/vmx.c:9709 nested_get_vmcs12_pages+0xfb6/0x15c0 arch/x86/kvm/vmx.c:9640 Kernel panic - not syncing: panic_on_warn set ... CPU: 0 PID: 7832 Comm: syz-executor1 Not tainted 4.10.0+ #229 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:15 [inline] dump_stack+0x2ee/0x3ef lib/dump_stack.c:51 panic+0x1fb/0x412 kernel/panic.c:179 __warn+0x1c4/0x1e0 kernel/panic.c:540 warn_slowpath_null+0x2c/0x40 kernel/panic.c:583 nested_vmx_merge_msr_bitmap arch/x86/kvm/vmx.c:9709 [inline] nested_get_vmcs12_pages+0xfb6/0x15c0 arch/x86/kvm/vmx.c:9640 enter_vmx_non_root_mode arch/x86/kvm/vmx.c:10471 [inline] nested_vmx_run+0x6186/0xaab0 arch/x86/kvm/vmx.c:10561 handle_vmlaunch+0x1a/0x20 arch/x86/kvm/vmx.c:7312 vmx_handle_exit+0xfc0/0x3f00 arch/x86/kvm/vmx.c:8526 vcpu_enter_guest arch/x86/kvm/x86.c:6982 [inline] vcpu_run arch/x86/kvm/x86.c:7044 [inline] kvm_arch_vcpu_ioctl_run+0x1418/0x4840 arch/x86/kvm/x86.c:7205 kvm_vcpu_ioctl+0x673/0x1120 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2570 Reported-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> [Jim Mattson explained the bare metal behavior: "I believe this behavior would be documented in the chipset data sheet rather than the SDM, since the chipset returns all 1s for an unclaimed read."] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-03-09 15:34:51 +01:00
Wanpeng Li	2f707d9798	KVM: nVMX: reset nested_run_pending if the vCPU is going to be reset Reported by syzkaller: WARNING: CPU: 1 PID: 27742 at arch/x86/kvm/vmx.c:11029 nested_vmx_vmexit+0x5c35/0x74d0 arch/x86/kvm/vmx.c:11029 CPU: 1 PID: 27742 Comm: a.out Not tainted 4.10.0+ #229 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:15 [inline] dump_stack+0x2ee/0x3ef lib/dump_stack.c:51 panic+0x1fb/0x412 kernel/panic.c:179 __warn+0x1c4/0x1e0 kernel/panic.c:540 warn_slowpath_null+0x2c/0x40 kernel/panic.c:583 nested_vmx_vmexit+0x5c35/0x74d0 arch/x86/kvm/vmx.c:11029 vmx_leave_nested arch/x86/kvm/vmx.c:11136 [inline] vmx_set_msr+0x1565/0x1910 arch/x86/kvm/vmx.c:3324 kvm_set_msr+0xd4/0x170 arch/x86/kvm/x86.c:1099 do_set_msr+0x11e/0x190 arch/x86/kvm/x86.c:1128 __msr_io arch/x86/kvm/x86.c:2577 [inline] msr_io+0x24b/0x450 arch/x86/kvm/x86.c:2614 kvm_arch_vcpu_ioctl+0x35b/0x46a0 arch/x86/kvm/x86.c:3497 kvm_vcpu_ioctl+0x232/0x1120 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2721 vfs_ioctl fs/ioctl.c:43 [inline] do_vfs_ioctl+0x1bf/0x1790 fs/ioctl.c:683 SYSC_ioctl fs/ioctl.c:698 [inline] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:689 entry_SYSCALL_64_fastpath+0x1f/0xc2 The syzkaller folks reported a nested_run_pending warning during userspace clear VMX capability which is exposed to L1 before. The warning gets thrown while doing ((uint32_t)0x20aecfe8 = (uint32_t)0x1); ((uint32_t)0x20aecfec = (uint32_t)0x0); ((uint32_t)0x20aecff0 = (uint32_t)0x3a); ((uint32_t)0x20aecff4 = (uint32_t)0x0); ((uint64_t)0x20aecff8 = (uint64_t)0x0); r[29] = syscall(__NR_ioctl, r[4], 0x4008ae89ul, 0x20aecfe8ul, 0, 0, 0, 0, 0, 0); i.e. KVM_SET_MSR ioctl with struct kvm_msrs { .nmsrs = 1, .pad = 0, .entries = { {.index = MSR_IA32_FEATURE_CONTROL, .reserved = 0, .data = 0} } } The VMLANCH/VMRESUME emulation should be stopped since the CPU is going to reset here. This patch resets the nested_run_pending since the CPU is going to be reset hence there should be nothing pending. Reported-by: Dmitry Vyukov <dvyukov@google.com> Suggested-by: Radim Krčmář <rkrcmar@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-03-07 15:41:12 +01:00
Jim Mattson	587d7e72ae	kvm: nVMX: VMCLEAR should not cause the vCPU to shut down VMCLEAR should silently ignore a failure to clear the launch state of the VMCS referenced by the operand. Signed-off-by: Jim Mattson <jmattson@google.com> [Changed "kvm_write_guest(vcpu->kvm" to "kvm_vcpu_write_guest(vcpu".] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-03-06 17:31:29 +01:00
Linus Torvalds	2d62e0768d	Second batch of KVM changes for 4.11 merge window PPC: * correct assumption about ASDR on POWER9 * fix MMIO emulation on POWER9 x86: * add a simple test for ioperm * cleanup TSS (going through KVM tree as the whole undertaking was caused by VMX's use of TSS) * fix nVMX interrupt delivery * fix some performance counters in the guest And two cleanup patches. -----BEGIN PGP SIGNATURE----- iQEcBAABCAAGBQJYuu5qAAoJEED/6hsPKofoRAUH/jkx/KFDcw3FggixysWVgRai iLSbbAZemnSLFSOkOU/t7Bz0fXCUgB0tAcMJd9ow01Dg1zObiTpuUIo6qEPaYHdX gqtUzlHuyECZEcgK0RXS9kDYLrvw7EFocxnDWQfV91qCZSS6nBSSLF3ST1rNV69W mUvcZG+MciDcZUe1lTexoswVTh1m7avvozEnQ5OHnZR9yicoXiadBQjzL6yqWoqf Ml/29zRk5+MvloTudxjkAKm3mh7psW88jNMh37TXbAA7i+Xwl9cU6GLR9mFWstoP 7Ot7ecq9mNAUO3lTIQh7lqvB60LMFznS4IlYK7MbplC3kvJLkfzhTWaN1aGvh90= =cqHo -----END PGP SIGNATURE----- Merge tag 'kvm-4.11-2' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull more KVM updates from Radim Krčmář: "Second batch of KVM changes for the 4.11 merge window: PPC: - correct assumption about ASDR on POWER9 - fix MMIO emulation on POWER9 x86: - add a simple test for ioperm - cleanup TSS (going through KVM tree as the whole undertaking was caused by VMX's use of TSS) - fix nVMX interrupt delivery - fix some performance counters in the guest ... and two cleanup patches" * tag 'kvm-4.11-2' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: nVMX: Fix pending events injection x86/kvm/vmx: remove unused variable in segment_base() selftests/x86: Add a basic selftest for ioperm x86/asm: Tidy up TSS limit code kvm: convert kvm.users_count from atomic_t to refcount_t KVM: x86: never specify a sample period for virtualized in_tx_cp counters KVM: PPC: Book3S HV: Don't use ASDR for real-mode HPT faults on POWER9 KVM: PPC: Book3S HV: Fix software walk of guest process page tables	2017-03-04 11:36:19 -08:00
Ingo Molnar	3905f9ad45	sched/headers: Prepare to move sched_info_on() and force_schedstat_enabled() from <linux/sched.h> to <linux/sched/stat.h> But first update usage sites with the new header dependency. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-03-02 08:42:39 +01:00
Ingo Molnar	32ef5517c2	sched/headers: Prepare to move cputime functionality from <linux/sched.h> into <linux/sched/cputime.h> Introduce a trivial, mostly empty <linux/sched/cputime.h> header to prepare for the moving of cputime functionality out of sched.h. Update all code that relies on these facilities. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-03-02 08:42:39 +01:00
Ingo Molnar	b2d0910310	sched/headers: Prepare to use <linux/rcuupdate.h> instead of <linux/rculist.h> in <linux/sched.h> We don't actually need the full rculist.h header in sched.h anymore, we will be able to include the smaller rcupdate.h header instead. But first update code that relied on the implicit header inclusion. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-03-02 08:42:38 +01:00
Ingo Molnar	3f07c01441	sched/headers: Prepare for new header dependencies before moving code to <linux/sched/signal.h> We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/signal.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-03-02 08:42:29 +01:00
Wanpeng Li	acc9ab6013	KVM: nVMX: Fix pending events injection L2 fails to boot on a non-APICv box dues to 'commit `0ad3bed6c5` ("kvm: nVMX: move nested events check to kvm_vcpu_running")' KVM internal error. Suberror: 3 extra data[0]: 800000ef extra data[1]: 1 RAX=0000000000000000 RBX=ffffffff81f36140 RCX=0000000000000000 RDX=0000000000000000 RSI=0000000000000000 RDI=0000000000000000 RBP=ffff88007c92fe90 RSP=ffff88007c92fe90 R8 =ffff88007fccdca0 R9 =0000000000000000 R10=00000000fffedb3d R11=0000000000000000 R12=0000000000000003 R13=0000000000000000 R14=0000000000000000 R15=ffff88007c92c000 RIP=ffffffff810645e6 RFL=00000246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0000 0000000000000000 ffffffff 00c00000 CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA] SS =0000 0000000000000000 ffffffff 00c00000 DS =0000 0000000000000000 ffffffff 00c00000 FS =0000 0000000000000000 ffffffff 00c00000 GS =0000 ffff88007fcc0000 ffffffff 00c00000 LDT=0000 0000000000000000 ffffffff 00c00000 TR =0040 ffff88007fcd4200 00002087 00008b00 DPL=0 TSS64-busy GDT= ffff88007fcc9000 0000007f IDT= ffffffffff578000 00000fff CR0=80050033 CR2=00000000ffffffff CR3=0000000001e0a000 CR4=003406e0 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000fffe0ff0 DR7=0000000000000400 EFER=0000000000000d01 We should try to reinject previous events if any before trying to inject new event if pending. If vmexit is triggered by L2 guest and L0 interested in, we should reinject IDT-vectoring info to L2 through vmcs02 if any, otherwise, we can consider new IRQs/NMIs which can be injected and call nested events callback to switch from L2 to L1 if needed and inject the proper vmexit events. However, 'commit `0ad3bed6c5` ("kvm: nVMX: move nested events check to kvm_vcpu_running")' results in the handle events order reversely on non-APICv box. This patch fixes it by bailing out for pending events and not consider new events in this scenario. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Fixes: `0ad3bed6c5` ("kvm: nVMX: move nested events check to kvm_vcpu_running") Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-03-01 17:03:24 +01:00
Jérémy Lefaure	0fce546f9f	x86/kvm/vmx: remove unused variable in segment_base() The pointer 'struct desc_struct *d' is unused since commit `8c2e41f7ae` ("x86/kvm/vmx: Simplify segment_base()") so let's remove it. Signed-off-by: Jérémy Lefaure <jeremy.lefaure@lse.epita.fr> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-03-01 17:03:24 +01:00
Robert O'Callahan	bba82fd756	KVM: x86: never specify a sample period for virtualized in_tx_cp counters pmc_reprogram_counter() always sets a sample period based on the value of pmc->counter. However, hsw_hw_config() rejects sample periods less than 2^31 - 1. So for example, if a KVM guest does struct perf_event_attr attr; memset(&attr, 0, sizeof(attr)); attr.type = PERF_TYPE_RAW; attr.size = sizeof(attr); attr.config = 0x2005101c4; // conditional branches retired IN_TXCP attr.sample_period = 0; int fd = syscall(__NR_perf_event_open, &attr, 0, -1, -1, 0); ioctl(fd, PERF_EVENT_IOC_DISABLE, 0); ioctl(fd, PERF_EVENT_IOC_ENABLE, 0); the guest kernel counts some conditional branch events, then updates the virtual PMU register with a nonzero count. The host reaches pmc_reprogram_counter() with nonzero pmc->counter, triggers EOPNOTSUPP in hsw_hw_config(), prints "kvm_pmu: event creation failed" in pmc_reprogram_counter(), and silently (from the guest's point of view) stops counting events. We fix event counting by forcing attr.sample_period to always be zero for in_tx_cp counters. Sampling doesn't work, but it already didn't work and can't be fixed without major changes to the approach in hsw_hw_config(). Signed-off-by: Robert O'Callahan <robert@ocallahan.org> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-03-01 14:19:46 +01:00
Masahiro Yamada	9332ef9dbd	scripts/spelling.txt: add "an user" pattern and fix typo instances Fix typos and add the following to the scripts/spelling.txt: an user\|\|a user an userspace\|\|a userspace I also added "userspace" to the list since it is a common word in Linux. I found some instances for "an userfaultfd", but I did not add it to the list. I felt it is endless to find words that start with "user" such as "userland" etc., so must draw a line somewhere. Link: http://lkml.kernel.org/r/1481573103-11329-4-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-27 18:43:46 -08:00
Linus Torvalds	fd7e9a8834	4.11 is going to be a relatively large release for KVM, with a little over 200 commits and noteworthy changes for most architectures. * ARM: - GICv3 save/restore - cache flushing fixes - working MSI injection for GICv3 ITS - physical timer emulation * MIPS: - various improvements under the hood - support for SMP guests - a large rewrite of MMU emulation. KVM MIPS can now use MMU notifiers to support copy-on-write, KSM, idle page tracking, swapping, ballooning and everything else. KVM_CAP_READONLY_MEM is also supported, so that writes to some memory regions can be treated as MMIO. The new MMU also paves the way for hardware virtualization support. * PPC: - support for POWER9 using the radix-tree MMU for host and guest - resizable hashed page table - bugfixes. * s390: expose more features to the guest - more SIMD extensions - instruction execution protection - ESOP2 * x86: - improved hashing in the MMU - faster PageLRU tracking for Intel CPUs without EPT A/D bits - some refactoring of nested VMX entry/exit code, preparing for live migration support of nested hypervisors - expose yet another AVX512 CPUID bit - host-to-guest PTP support - refactoring of interrupt injection, with some optimizations thrown in and some duct tape removed. - remove lazy FPU handling - optimizations of user-mode exits - optimizations of vcpu_is_preempted() for KVM guests * generic: - alternative signaling mechanism that doesn't pound on tsk->sighand->siglock -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQEcBAABAgAGBQJYral1AAoJEL/70l94x66DbNgH/Rx8YXuidFq2fe3RWOvld3RK 85OM/D5g38cTLpBE0/sJpcvX34iYN8U/l5foCZwpxB+83GHEk2Cr57JyfTogdaAJ x8dBhHKQCA/HxSQUQLN6nFqRV+yT8WUR92Fhqx82+80BSen5Yzcfee/TDoW6T1IW g8CYgX9FrRaGOX066ImAuUfdAdUVjyssfs9VttDTX+HiusPeuBPx/wsRe1ZEEPlH vnltIJQb1ETV2GOZLUojKjzH6aZkjIl29XxjkYii9JTUornClG0DfW+5QT3uLrB5 gJ+G+Zmpsq8ZBx9jNDtAi7sFsoPY1Mzf+JPNCGXBra2sP2GrBAuXcxmgznRYltQ= =8IIp -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull KVM updates from Paolo Bonzini: "4.11 is going to be a relatively large release for KVM, with a little over 200 commits and noteworthy changes for most architectures. ARM: - GICv3 save/restore - cache flushing fixes - working MSI injection for GICv3 ITS - physical timer emulation MIPS: - various improvements under the hood - support for SMP guests - a large rewrite of MMU emulation. KVM MIPS can now use MMU notifiers to support copy-on-write, KSM, idle page tracking, swapping, ballooning and everything else. KVM_CAP_READONLY_MEM is also supported, so that writes to some memory regions can be treated as MMIO. The new MMU also paves the way for hardware virtualization support. PPC: - support for POWER9 using the radix-tree MMU for host and guest - resizable hashed page table - bugfixes. s390: - expose more features to the guest - more SIMD extensions - instruction execution protection - ESOP2 x86: - improved hashing in the MMU - faster PageLRU tracking for Intel CPUs without EPT A/D bits - some refactoring of nested VMX entry/exit code, preparing for live migration support of nested hypervisors - expose yet another AVX512 CPUID bit - host-to-guest PTP support - refactoring of interrupt injection, with some optimizations thrown in and some duct tape removed. - remove lazy FPU handling - optimizations of user-mode exits - optimizations of vcpu_is_preempted() for KVM guests generic: - alternative signaling mechanism that doesn't pound on tsk->sighand->siglock" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (195 commits) x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64 x86/paravirt: Change vcp_is_preempted() arg type to long KVM: VMX: use correct vmcs_read/write for guest segment selector/base x86/kvm/vmx: Defer TR reload after VM exit x86/asm/64: Drop __cacheline_aligned from struct x86_hw_tss x86/kvm/vmx: Simplify segment_base() x86/kvm/vmx: Get rid of segment_base() on 64-bit kernels x86/kvm/vmx: Don't fetch the TSS base from the GDT x86/asm: Define the kernel TSS limit in a macro kvm: fix page struct leak in handle_vmon KVM: PPC: Book3S HV: Disable HPT resizing on POWER9 for now KVM: Return an error code only as a constant in kvm_get_dirty_log() KVM: Return an error code only as a constant in kvm_get_dirty_log_protect() KVM: Return directly after a failed copy_from_user() in kvm_vm_compat_ioctl() KVM: x86: remove code for lazy FPU handling KVM: race-free exit from KVM_RUN without POSIX signals KVM: PPC: Book3S HV: Turn "KVM guest htab" message into a debug message KVM: PPC: Book3S PR: Ratelimit copy data failure error messages KVM: Support vCPU-based gfn->hva cache KVM: use separate generations for each address space ...	2017-02-22 18:22:53 -08:00
Chao Peng	96794e4ed4	KVM: VMX: use correct vmcs_read/write for guest segment selector/base Guest segment selector is 16 bit field and guest segment base is natural width field. Fix two incorrect invocations accordingly. Without this patch, build fails when aggressive inlining is used with ICC. Cc: stable@vger.kernel.org Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-02-21 12:45:49 +01:00
Andy Lutomirski	b7ffc44d5b	x86/kvm/vmx: Defer TR reload after VM exit Intel's VMX is daft and resets the hidden TSS limit register to 0x67 on VMX reload, and the 0x67 is not configurable. KVM currently reloads TR using the LTR instruction on every exit, but this is quite slow because LTR is serializing. The 0x67 limit is entirely harmless unless ioperm() is in use, so defer the reload until a task using ioperm() is actually running. Here's some poorly done benchmarking using kvm-unit-tests: Before: cpuid 1313 vmcall 1195 mov_from_cr8 11 mov_to_cr8 17 inl_from_pmtimer 6770 inl_from_qemu 6856 inl_from_kernel 2435 outl_to_kernel 1402 After: cpuid 1291 vmcall 1181 mov_from_cr8 11 mov_to_cr8 16 inl_from_pmtimer 6457 inl_from_qemu 6209 inl_from_kernel 2339 outl_to_kernel 1391 Signed-off-by: Andy Lutomirski <luto@kernel.org> [Force-reload TR in invalidate_tss_limit. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-02-21 12:45:08 +01:00
Andy Lutomirski	8c2e41f7ae	x86/kvm/vmx: Simplify segment_base() Use actual pointer types for pointers (instead of unsigned long) and replace hardcoded constants with the appropriate self-documenting macros. The function is still a bit messy, but this seems a lot better than before to me. This is mostly borrowed from a patch by Thomas Garnier. Cc: Thomas Garnier <thgarnie@google.com> Cc: Jim Mattson <jmattson@google.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-02-21 11:48:56 +01:00
Andy Lutomirski	e28baeadcf	x86/kvm/vmx: Get rid of segment_base() on 64-bit kernels It was a bit buggy (it didn't list all segment types that needed 64-bit fixups), but the bug was irrelevant because it wasn't called in any interesting context on 64-bit kernels and was only used for data segents on 32-bit kernels. To avoid confusion, make it explicitly 32-bit only. Cc: Thomas Garnier <thgarnie@google.com> Cc: Jim Mattson <jmattson@google.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-02-21 11:48:46 +01:00
Andy Lutomirski	e0c230634a	x86/kvm/vmx: Don't fetch the TSS base from the GDT The current CPU's TSS base is a foregone conclusion, so there's no need to parse it out of the segment tables. This should save a couple cycles (as STR is surely microcoded and poorly optimized) but, more importantly, it's a cleanup and it means that segment_base() will never be called on 64-bit kernels. Cc: Thomas Garnier <thgarnie@google.com> Cc: Jim Mattson <jmattson@google.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-02-21 11:48:40 +01:00
Linus Torvalds	828cad8ea0	Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "The main changes in this (fairly busy) cycle were: - There was a class of scheduler bugs related to forgetting to update the rq-clock timestamp which can cause weird and hard to debug problems, so there's a new debug facility for this: which uncovered a whole lot of bugs which convinced us that we want to keep the debug facility. (Peter Zijlstra, Matt Fleming) - Various cputime related updates: eliminate cputime and use u64 nanoseconds directly, simplify and improve the arch interfaces, implement delayed accounting more widely, etc. - (Frederic Weisbecker) - Move code around for better structure plus cleanups (Ingo Molnar) - Move IO schedule accounting deeper into the scheduler plus related changes to improve the situation (Tejun Heo) - ... plus a round of sched/rt and sched/deadline fixes, plus other fixes, updats and cleanups" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (85 commits) sched/core: Remove unlikely() annotation from sched_move_task() sched/autogroup: Rename auto_group.[ch] to autogroup.[ch] sched/topology: Split out scheduler topology code from core.c into topology.c sched/core: Remove unnecessary #include headers sched/rq_clock: Consolidate the ordering of the rq_clock methods delayacct: Include <uapi/linux/taskstats.h> sched/core: Clean up comments sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds sched/clock: Add dummy clear_sched_clock_stable() stub function sched/cputime: Remove generic asm headers sched/cputime: Remove unused nsec_to_cputime() s390, sched/cputime: Remove unused cputime definitions powerpc, sched/cputime: Remove unused cputime definitions s390, sched/cputime: Make arch_cpu_idle_time() to return nsecs ia64, sched/cputime: Remove unused cputime definitions ia64: Convert vtime to use nsec units directly ia64, sched/cputime: Move the nsecs based cputime headers to the last arch using it sched/cputime: Remove jiffies based cputime sched/cputime, vtime: Return nsecs instead of cputime_t to account sched/cputime: Complete nsec conversion of tick based accounting ...	2017-02-20 12:52:55 -08:00
Paolo Bonzini	06ce521af9	kvm: fix page struct leak in handle_vmon handle_vmon gets a reference on VMXON region page, but does not release it. Release the reference. Found by syzkaller; based on a patch by Dmitry. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-02-20 17:54:43 +01:00
Paolo Bonzini	bd7e5b0899	KVM: x86: remove code for lazy FPU handling The FPU is always active now when running KVM. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Bandan Das <bsd@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-02-17 12:28:01 +01:00
Paolo Bonzini	460df4c1fc	KVM: race-free exit from KVM_RUN without POSIX signals The purpose of the KVM_SET_SIGNAL_MASK API is to let userspace "kick" a VCPU out of KVM_RUN through a POSIX signal. A signal is attached to a dummy signal handler; by blocking the signal outside KVM_RUN and unblocking it inside, this possible race is closed: VCPU thread service thread -------------------------------------------------------------- check flag set flag raise signal (signal handler does nothing) KVM_RUN However, one issue with KVM_SET_SIGNAL_MASK is that it has to take tsk->sighand->siglock on every KVM_RUN. This lock is often on a remote NUMA node, because it is on the node of a thread's creator. Taking this lock can be very expensive if there are many userspace exits (as is the case for SMP Windows VMs without Hyper-V reference time counter). As an alternative, we can put the flag directly in kvm_run so that KVM can see it: VCPU thread service thread -------------------------------------------------------------- raise signal signal handler set run->immediate_exit KVM_RUN check run->immediate_exit Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-02-17 12:27:37 +01:00
Cao, Lei	bbd6411513	KVM: Support vCPU-based gfn->hva cache Provide versions of struct gfn_to_hva_cache functions that take vcpu as a parameter instead of struct kvm. The existing functions are not needed anymore, so delete them. This allows dirty pages to be logged in the vcpu dirty ring, instead of the global dirty ring, for ring-based dirty memory tracking. Signed-off-by: Lei Cao <lei.cao@stratus.com> Message-Id: <CY1PR08MB19929BD2AC47A291FD680E83F04F0@CY1PR08MB1992.namprd08.prod.outlook.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-02-16 18:42:46 +01:00
Paolo Bonzini	47c0152e0f	KVM: VMX: use vmcs_set/clear_bits for CPU-based execution controls Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-02-16 18:42:26 +01:00
David Hildenbrand	681bcea802	KVM: svm: inititalize hash table structures directly The hashtable and guarding spinlock are global data structures, we can inititalize them statically. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20170124212116.4568-1-david@redhat.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-02-15 15:26:06 +01:00
Jim Mattson	858e25c06f	kvm: nVMX: Refactor nested_vmx_run() Nested_vmx_run is split into two parts: the part that handles the VMLAUNCH/VMRESUME instruction, and the part that modifies the vcpu state to transition from VMX root mode to VMX non-root mode. The latter will be used when restoring the checkpointed state of a vCPU that was in VMX operation when a snapshot was taken. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-02-15 14:56:36 +01:00
Jim Mattson	ca0bde28f2	kvm: nVMX: Split VMCS checks from nested_vmx_run() The checks performed on the contents of the vmcs12 are extracted from nested_vmx_run so that they can be used to validate a vmcs12 that has been restored from a checkpoint. Signed-off-by: Jim Mattson <jmattson@google.com> [Change prepare_vmcs02 and nested_vmx_load_cr3's last argument to u32, to match check_vmentry_postreqs. Update comments for singlestep handling. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-02-15 14:56:35 +01:00
Jim Mattson	6beb7bd52e	kvm: nVMX: Refactor nested_get_vmcs12_pages() Perform the checks on vmcs12 state early, but defer the gpa->hpa lookups until after prepare_vmcs02. Later, when we restore the checkpointed state of a vCPU in guest mode, we will not be able to do the gpa->hpa lookups when the restore is done. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-02-15 14:56:11 +01:00
Jim Mattson	a8bc284eb7	kvm: nVMX: Refactor handle_vmptrld() Handle_vmptrld is split into two parts: the part that handles the VMPTRLD instruction, and the part that establishes the current VMCS pointer. The latter will be used when restoring the checkpointed state of a vCPU that had a valid VMCS pointer when a snapshot was taken. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-02-15 14:54:37 +01:00
Jim Mattson	e29acc55bf	kvm: nVMX: Refactor handle_vmon() Handle_vmon is split into two parts: the part that handles the VMXON instruction, and the part that modifies the vcpu state to transition from legacy mode to VMX operation. The latter will be used when restoring the checkpointed state of a vCPU that was in VMX operation when a snapshot was taken. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-02-15 14:54:37 +01:00
Jim Mattson	cf8b84f48a	kvm: nVMX: Prepare for checkpointing L2 state Split prepare_vmcs12 into two parts: the part that stores the current L2 guest state and the part that sets up the exit information fields. The former will be used when checkpointing the vCPU's VMX state. Modify prepare_vmcs02 so that it can construct a vmcs02 midway through L2 execution, using the checkpointed L2 guest state saved into the cached vmcs12 above. Signed-off-by: Jim Mattson <jmattson@google.com> [Rebasing: add from_vmentry argument to prepare_vmcs02 instead of using vmx->nested.nested_run_pending, because it is no longer 1 at the point prepare_vmcs02 is called. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-02-15 14:54:36 +01:00
Paolo Bonzini	b95234c840	kvm: x86: do not use KVM_REQ_EVENT for APICv interrupt injection Since `bf9f6ac8d7` ("KVM: Update Posted-Interrupts Descriptor when vCPU is blocked", 2015-09-18) the posted interrupt descriptor is checked unconditionally for PIR.ON. Therefore we don't need KVM_REQ_EVENT to trigger the scan and, if NMIs or SMIs are not involved, we can avoid the complicated event injection path. Calling kvm_vcpu_kick if PIR.ON=1 is also useless, though it has been there since APICv was introduced. However, without the KVM_REQ_EVENT safety net KVM needs to be much more careful about races between vmx_deliver_posted_interrupt and vcpu_enter_guest. First, the IPI for posted interrupts may be issued between setting vcpu->mode = IN_GUEST_MODE and disabling interrupts. If that happens, kvm_trigger_posted_interrupt returns true, but smp_kvm_posted_intr_ipi doesn't do anything about it. The guest is entered with PIR.ON, but the posted interrupt IPI has not been sent and the interrupt is only delivered to the guest on the next vmentry (if any). To fix this, disable interrupts before setting vcpu->mode. This ensures that the IPI is delayed until the guest enters non-root mode; it is then trapped by the processor causing the interrupt to be injected. Second, the IPI may be issued between kvm_x86_ops->sync_pir_to_irr(vcpu) and vcpu->mode = IN_GUEST_MODE. In this case, kvm_vcpu_kick is called but it (correctly) doesn't do anything because it sees vcpu->mode == OUTSIDE_GUEST_MODE. Again, the guest is entered with PIR.ON but no posted interrupt IPI is pending; this time, the fix for this is to move the RVI update after IN_GUEST_MODE. Both issues were mostly masked by the liberal usage of KVM_REQ_EVENT, though the second could actually happen with VT-d posted interrupts. In both race scenarios KVM_REQ_EVENT would cancel guest entry, resulting in another vmentry which would inject the interrupt. This saves about 300 cycles on the self_ipi_* tests of vmexit.flat. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-02-15 14:54:36 +01:00
Paolo Bonzini	76dfafd536	KVM: x86: do not scan IRR twice on APICv vmentry Calls to apic_find_highest_irr are scanning IRR twice, once in vmx_sync_pir_from_irr and once in apic_search_irr. Change sync_pir_from_irr to get the new maximum IRR from kvm_apic_update_irr; now that it does the computation, it can also do the RVI write. In order to avoid complications in svm.c, make the callback optional. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-02-15 14:54:35 +01:00
Paolo Bonzini	3d92789f69	KVM: vmx: move sync_pir_to_irr from apic_find_highest_irr to callers Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-02-15 14:54:34 +01:00
Paolo Bonzini	810e6defcc	KVM: x86: preparatory changes for APICv cleanups Add return value to __kvm_apic_update_irr/kvm_apic_update_irr. Move vmx_sync_pir_to_irr around. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-02-15 14:54:34 +01:00
Paolo Bonzini	0ad3bed6c5	kvm: nVMX: move nested events check to kvm_vcpu_running vcpu_run calls kvm_vcpu_running, not kvm_arch_vcpu_runnable, and the former does not call check_nested_events. Once KVM_REQ_EVENT is removed from the APICv interrupt injection path, however, this would leave no place to trigger a vmexit from L2 to L1, causing a missed interrupt delivery while in guest mode. This is caught by the "ack interrupt on exit" test in vmx.flat. [This does not change the calls to check_nested_events in inject_pending_event. That is material for a separate cleanup.] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-02-15 14:54:33 +01:00
Paolo Bonzini	967235d320	KVM: vmx: clear pending interrupts on KVM_SET_LAPIC Pending interrupts might be in the PI descriptor when the LAPIC is restored from an external state; we do not want them to be injected. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-02-15 14:54:33 +01:00
Paolo Bonzini	db1c056cee	kvm: vmx: Use the hardware provided GPA instead of page walk As in the SVM patch, the guest physical address is passed by VMX to x86_emulate_instruction already, so mark the GPA as available in vcpu->arch. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-02-15 14:54:32 +01:00
Arnd Bergmann	8ef81a9a45	KVM: x86: hide KVM_HC_CLOCK_PAIRING on 32 bit The newly added hypercall doesn't work on x86-32: arch/x86/kvm/x86.c: In function 'kvm_pv_clock_pairing': arch/x86/kvm/x86.c:6163:6: error: implicit declaration of function 'kvm_get_walltime_and_clockread';did you mean 'kvm_get_time_scale'? [-Werror=implicit-function-declaration] This adds an #ifdef around it, matching the one around the related functions that are also only implemented on 64-bit systems. Fixes: `55dd00a73a` ("KVM: x86: add KVM_HC_CLOCK_PAIRING hypercall") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-02-09 16:23:39 +01:00
Paolo Bonzini	2e751dfb5f	kvmarm updates for 4.11 - GICv3 save restore - Cache flushing fixes - MSI injection fix for GICv3 ITS - Physical timer emulation support -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJYnICmAAoJECPQ0LrRPXpDC04P/A73ZEL6m0vUzGpuvclxwWc6 OCJ2C9kYloK+twyGLFbPprI4eN/70dpThgFE1Zr+ol/vAOhQlGQJoarc4n4+eYyb 8e8IxM5Cmi44HUB64xOInidLacqeyRy5+TvKXIH0aHLgpdynSEQJu88RVXUvVgvs IZizhTpYueDYdexNNEkL5r2yJhVZCaczyjB1vU8k5MdODLDM63ABnPOSNJNXio2x itoO0EU1Lb9GhuzQj0hiMvKJPyviuPHwau7AhokUSjDPaHzaQT7TgSVioKov/rl6 bRzhPmXqesex97ZWA5Fxr8jgSNR7JyRz+bzCLEry7XFaI3chbe0YvXeRv32PNH7I meuycQw64gsKmfJGRNlq30qhQQfv4fTbzpZP/j1UbvKNwhK5J6e7037c1CUH4i9C p9UO9HF/zAMqzD3iMcDZSpaFcbhJYrfQufbhTnbHfGC5AMVJEOWheHSEmzlDWnwr K5fPBxnsPv58hDmp/UZUTqCEPusY+HyuOq4ZumFSsnBwjdW+z9mLuaaTJbxaqR/G B6dfSQNwSnw6b2lbiXPUCm6c+Z9b190pUEWdwJ4kOTxwiPUWBppVU7gE2TrjnQ8m aIvEBPGIf58okjEewA5Dni6qjv7CjDN5z1V0vZUTTdVw8xuhX9eJ1Cx853SM7n0U sJgW5nSvSLDUpizSKdRI =H4vX -----END PGP SIGNATURE----- Merge tag 'kvmarm-for-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD kvmarm updates for 4.11 - GICv3 save restore - Cache flushing fixes - MSI injection fix for GICv3 ITS - Physical timer emulation support	2017-02-09 16:01:23 +01:00
Paolo Bonzini	80fbd89cbd	KVM: x86: fix compilation Fix rebase breakage from commit `55dd00a73a` ("KVM: x86: add KVM_HC_CLOCK_PAIRING hypercall", 2017-01-24), courtesy of the "I could have sworn I had pushed the right branch" department. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-02-08 10:57:24 +01:00
Marcelo Tosatti	55dd00a73a	KVM: x86: add KVM_HC_CLOCK_PAIRING hypercall Add a hypercall to retrieve the host realtime clock and the TSC value used to calculate that clock read. Used to implement clock synchronization between host and guest. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-02-07 18:16:45 +01:00
David Hildenbrand	6342c50ad1	KVM: nVMX: vmx_complete_nested_posted_interrupt() can't fail vmx_complete_nested_posted_interrupt() can't fail, let's turn it into a void function. Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-02-07 18:16:44 +01:00
David Hildenbrand	42cf014d38	KVM: nVMX: kmap() can't fail kmap() can't fail, therefore it will always return a valid pointer. Let's just get rid of the unnecessary checks. Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-02-07 18:16:44 +01:00
Radim Krčmář	00c87e9a70	KVM: x86: do not save guest-unsupported XSAVE state Saving unsupported state prevents migration when the new host does not support a XSAVE feature of the original host, even if the feature is not exposed to the guest. We've masked host features with guest-visible features before, with `4344ee981e` ("KVM: x86: only copy XSAVE state for the supported features") and dropped it when implementing XSAVES. Do it again. Fixes: `df1daba7d1` ("KVM: x86: support XSAVES usage in the host") Cc: stable@vger.kernel.org Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-02-03 18:43:08 +01:00
Frederic Weisbecker	5613fda9a5	sched/cputime: Convert task/group cputime to nsecs Now that most cputime readers use the transition API which return the task cputime in old style cputime_t, we can safely store the cputime in nsecs. This will eventually make cputime statistics less opaque and more granular. Back and forth convertions between cputime_t and nsecs in order to deal with cputime_t random granularity won't be needed anymore. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-8-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-02-01 09:13:49 +01:00
Junaid Shahid	d3e328f2cb	kvm: x86: mmu: Verify that restored PTE has needed perms in fast page fault Before fast page fault restores an access track PTE back to a regular PTE, it now also verifies that the restored PTE would grant the necessary permissions for the faulting access to succeed. If not, it falls back to the slow page fault path. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-01-27 15:46:40 +01:00
Junaid Shahid	d162f30a7c	kvm: x86: mmu: Move pgtbl walk inside retry loop in fast_page_fault Redo the page table walk in fast_page_fault when retrying so that we are working on the latest PTE even if the hierarchy changes. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-01-27 15:46:40 +01:00
Junaid Shahid	20d65236d0	kvm: x86: mmu: Update comment in mark_spte_for_access_track Reword the comment to hopefully make it more clear. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-01-27 15:46:39 +01:00
Junaid Shahid	312b616b30	kvm: x86: mmu: Set SPTE_SPECIAL_MASK within mmu.c Instead of the caller including the SPTE_SPECIAL_MASK in the masks being supplied to kvm_mmu_set_mmio_spte_mask() and kvm_mmu_set_mask_ptes(), those functions now themselves include the SPTE_SPECIAL_MASK. Note that bit 63 is now reset in the default MMIO mask. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-01-27 15:46:39 +01:00
Junaid Shahid	ab22a4733f	kvm: x86: mmu: Rename EPT_VIOLATION_READ/WRITE/INSTR constants Rename the EPT_VIOLATION_READ/WRITE/INSTR constants to EPT_VIOLATION_ACC_READ/WRITE/INSTR to more clearly indicate that these signify the type of the memory access as opposed to the permissions granted by the PTE. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-01-27 15:46:38 +01:00
Jim Mattson	0b4c208d44	Revert "KVM: nested VMX: disable perf cpuid reporting" This reverts commit `bc6134942d`. A CPUID instruction executed in VMX non-root mode always causes a VM-exit, regardless of the leaf being queried. Fixes: `bc6134942d` ("KVM: nested VMX: disable perf cpuid reporting") Signed-off-by: Jim Mattson <jmattson@google.com> [The issue solved by `bc6134942d` has been resolved with `ff651cb613` ("KVM: nVMX: Add nested msr load/restore algorithm").] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-01-20 22:18:55 +01:00
Piotr Luc	a17f32270a	kvm: x86: Expose Intel VPOPCNTDQ feature to guest Vector population count instructions for dwords and qwords are to be used in future Intel Xeon & Xeon Phi processors. The bit 14 of CPUID[level:0x07, ECX] indicates that the new instructions are supported by a processor. The spec can be found in the Intel Software Developer Manual (SDM) or in the Instruction Set Extensions Programming Reference (ISE). Signed-off-by: Piotr Luc <piotr.luc@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: x86@kernel.org Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-01-17 17:55:18 +01:00
Radim Krčmář	a9ff720e0f	Merge branch 'x86/cpufeature' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next For AVX512_VPOPCNTDQ.	2017-01-17 17:53:01 +01:00
Dmitry Vyukov	ce2e852ecc	KVM: x86: fix fixing of hypercalls emulator_fix_hypercall() replaces hypercall with vmcall instruction, but it does not handle GP exception properly when writes the new instruction. It can return X86EMUL_PROPAGATE_FAULT without setting exception information. This leads to incorrect emulation and triggers WARN_ON(ctxt->exception.vector > 0x1f) in x86_emulate_insn() as discovered by syzkaller fuzzer: WARNING: CPU: 2 PID: 18646 at arch/x86/kvm/emulate.c:5558 Call Trace: warn_slowpath_null+0x2c/0x40 kernel/panic.c:582 x86_emulate_insn+0x16a5/0x4090 arch/x86/kvm/emulate.c:5572 x86_emulate_instruction+0x403/0x1cc0 arch/x86/kvm/x86.c:5618 emulate_instruction arch/x86/include/asm/kvm_host.h:1127 [inline] handle_exception+0x594/0xfd0 arch/x86/kvm/vmx.c:5762 vmx_handle_exit+0x2b7/0x38b0 arch/x86/kvm/vmx.c:8625 vcpu_enter_guest arch/x86/kvm/x86.c:6888 [inline] vcpu_run arch/x86/kvm/x86.c:6947 [inline] Set exception information when write in emulator_fix_hypercall() fails. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Cc: kvm@vger.kernel.org Cc: syzkaller@googlegroups.com Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-01-17 15:06:05 +01:00
Paolo Bonzini	33ab91103b	KVM: x86: fix emulation of "MOV SS, null selector" This is CVE-2017-2583. On Intel this causes a failed vmentry because SS's type is neither 3 nor 7 (even though the manual says this check is only done for usable SS, and the dmesg splat says that SS is unusable!). On AMD it's worse: svm.c is confused and sets CPL to 0 in the vmcb. The fix fabricates a data segment descriptor when SS is set to a null selector, so that CPL and SS.DPL are set correctly in the VMCS/vmcb. Furthermore, only allow setting SS to a NULL selector if SS.RPL < 3; this in turn ensures CPL < 3 because RPL must be equal to CPL. Thanks to Andy Lutomirski and Willy Tarreau for help in analyzing the bug and deciphering the manuals. Reported-by: Xiaohan Zhang <zhangxiaohan1@huawei.com> Fixes: `79d5b4c3cd` Cc: stable@nongnu.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-01-12 15:17:13 +01:00
Wanpeng Li	546d87e5c9	KVM: x86: fix NULL deref in vcpu_scan_ioapic Reported by syzkaller: BUG: unable to handle kernel NULL pointer dereference at 00000000000001b0 IP: _raw_spin_lock+0xc/0x30 PGD 3e28eb067 PUD 3f0ac6067 PMD 0 Oops: 0002 [#1] SMP CPU: 0 PID: 2431 Comm: test Tainted: G OE 4.10.0-rc1+ #3 Call Trace: ? kvm_ioapic_scan_entry+0x3e/0x110 [kvm] kvm_arch_vcpu_ioctl_run+0x10a8/0x15f0 [kvm] ? pick_next_task_fair+0xe1/0x4e0 ? kvm_arch_vcpu_load+0xea/0x260 [kvm] kvm_vcpu_ioctl+0x33a/0x600 [kvm] ? hrtimer_try_to_cancel+0x29/0x130 ? do_nanosleep+0x97/0xf0 do_vfs_ioctl+0xa1/0x5d0 ? __hrtimer_init+0x90/0x90 ? do_nanosleep+0x5b/0xf0 SyS_ioctl+0x79/0x90 do_syscall_64+0x6e/0x180 entry_SYSCALL64_slow_path+0x25/0x25 RIP: _raw_spin_lock+0xc/0x30 RSP: ffffa43688973cc0 The syzkaller folks reported a NULL pointer dereference due to ENABLE_CAP succeeding even without an irqchip. The Hyper-V synthetic interrupt controller is activated, resulting in a wrong request to rescan the ioapic and a NULL pointer dereference. #include <sys/ioctl.h> #include <sys/mman.h> #include <sys/types.h> #include <linux/kvm.h> #include <pthread.h> #include <stddef.h> #include <stdint.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #ifndef KVM_CAP_HYPERV_SYNIC #define KVM_CAP_HYPERV_SYNIC 123 #endif void* thr(void* arg) { struct kvm_enable_cap cap; cap.flags = 0; cap.cap = KVM_CAP_HYPERV_SYNIC; ioctl((long)arg, KVM_ENABLE_CAP, &cap); return 0; } int main() { void host_mem = mmap(0, 0x1000, PROT_READ\|PROT_WRITE, MAP_PRIVATE\|MAP_ANONYMOUS, -1, 0); int kvmfd = open("/dev/kvm", 0); int vmfd = ioctl(kvmfd, KVM_CREATE_VM, 0); struct kvm_userspace_memory_region memreg; memreg.slot = 0; memreg.flags = 0; memreg.guest_phys_addr = 0; memreg.memory_size = 0x1000; memreg.userspace_addr = (unsigned long)host_mem; host_mem[0] = 0xf4; ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, &memreg); int cpufd = ioctl(vmfd, KVM_CREATE_VCPU, 0); struct kvm_sregs sregs; ioctl(cpufd, KVM_GET_SREGS, &sregs); sregs.cr0 = 0; sregs.cr4 = 0; sregs.efer = 0; sregs.cs.selector = 0; sregs.cs.base = 0; ioctl(cpufd, KVM_SET_SREGS, &sregs); struct kvm_regs regs = { .rflags = 2 }; ioctl(cpufd, KVM_SET_REGS, &regs); ioctl(vmfd, KVM_CREATE_IRQCHIP, 0); pthread_t th; pthread_create(&th, 0, thr, (void)(long)cpufd); usleep(rand() % 10000); ioctl(cpufd, KVM_RUN, 0); pthread_join(th, 0); return 0; } This patch fixes it by failing ENABLE_CAP if without an irqchip. Reported-by: Dmitry Vyukov <dvyukov@google.com> Fixes: `5c919412fe` (kvm/x86: Hyper-V synthetic interrupt controller) Cc: stable@vger.kernel.org # 4.5+ Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-01-12 14:52:52 +01:00
Steve Rutherford	129a72a0d3	KVM: x86: Introduce segmented_write_std Introduces segemented_write_std. Switches from emulated reads/writes to standard read/writes in fxsave, fxrstor, sgdt, and sidt. This fixes CVE-2017-2584, a longstanding kernel memory leak. Since commit `283c95d0e3` ("KVM: x86: emulate FXSAVE and FXRSTOR", 2016-11-09), which is luckily not yet in any final release, this would also be an exploitable kernel memory write! Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: stable@vger.kernel.org Fixes: `96051572c8` Fixes: `283c95d0e3` Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Steve Rutherford <srutherford@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-01-12 14:34:58 +01:00
David Matlack	cef84c302f	KVM: x86: flush pending lapic jump label updates on module unload KVM's lapic emulation uses static_key_deferred (apic_{hw,sw}_disabled). These are implemented with delayed_work structs which can still be pending when the KVM module is unloaded. We've seen this cause kernel panics when the kvm_intel module is quickly reloaded. Use the new static_key_deferred_flush() API to flush pending updates on module unload. Signed-off-by: David Matlack <dmatlack@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-01-12 14:33:17 +01:00
Jim Mattson	21e7fbe7db	kvm: nVMX: Reorder error checks for emulated VMXON Checks on the operand to VMXON are performed after the check for legacy mode operation and the #GP checks, according to the pseudo-code in Intel's SDM. Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-01-09 14:48:04 +01:00
Paolo Bonzini	4d82d12b39	KVM: lapic: do not scan IRR when delivering an interrupt On interrupt delivery the PPR can only grow (except for auto-EOI), so it is impossible that non-auto-EOI interrupt delivery results in KVM_REQ_EVENT. We can therefore use __apic_update_ppr. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-01-09 14:48:03 +01:00
Paolo Bonzini	26fbbee581	KVM: lapic: do not set KVM_REQ_EVENT unnecessarily on PPR update On PPR update, we set KVM_REQ_EVENT unconditionally anytime PPR is lowered. But we can take into account IRR here already. Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-01-09 14:48:02 +01:00
Paolo Bonzini	b3c045d332	KVM: lapic: remove unnecessary KVM_REQ_EVENT on PPR update PPR needs to be updated whenever on every IRR read because we may have missed TPR writes that _increased_ PPR. However, these writes need not generate KVM_REQ_EVENT, because either KVM_REQ_EVENT has been set already in __apic_accept_irq, or we are going to process the interrupt right away. Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-01-09 14:48:01 +01:00
Paolo Bonzini	eb90f3417a	KVM: vmx: speed up TPR below threshold vmexits Since we're already in VCPU context, all we have to do here is recompute the PPR value. That will in turn generate a KVM_REQ_EVENT if necessary. Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-01-09 14:48:00 +01:00
Paolo Bonzini	0f1e261ead	KVM: x86: add VCPU stat for KVM_REQ_EVENT processing This statistic can be useful to estimate the cost of an IRQ injection scenario, by comparing it with irq_injections. For example the stat shows that sti;hlt triggers more KVM_REQ_EVENT than sti;nop. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-01-09 14:47:59 +01:00
Tom Lendacky	0f89b207b0	kvm: svm: Use the hardware provided GPA instead of page walk When a guest causes a NPF which requires emulation, KVM sometimes walks the guest page tables to translate the GVA to a GPA. This is unnecessary most of the time on AMD hardware since the hardware provides the GPA in EXITINFO2. The only exception cases involve string operations involving rep or operations that use two memory locations. With rep, the GPA will only be the value of the initial NPF and with dual memory locations we won't know which memory address was translated into EXITINFO2. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Borislav Petkov <bp@suse.de> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-01-09 14:47:58 +01:00
Radim Krčmář	5bd5db385b	KVM: x86: allow hotplug of VCPU with APIC ID over 0xff LAPIC after reset is in xAPIC mode, which poses a problem for hotplug of VCPUs with high APIC ID, because reset VCPU is waiting for INIT/SIPI, but there is no way to uniquely address it using xAPIC. From many possible options, we chose the one that also works on real hardware: accepting interrupts addressed to LAPIC's x2APIC ID even in xAPIC mode. KVM intentionally differs from real hardware, because real hardware (Knights Landing) does just "x2apic_id & 0xff" to decide whether to accept the interrupt in xAPIC mode and it can deliver one interrupt to more than one physical destination, e.g. 0x123 to 0x123 and 0x23. Fixes: `682f732ecf` ("KVM: x86: bump MAX_VCPUS to 288") Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-01-09 14:47:48 +01:00
Radim Krčmář	b4535b58ae	KVM: x86: make interrupt delivery fast and slow path behave the same Slow path tried to prevent IPIs from x2APIC VCPUs from being delivered to xAPIC VCPUs and vice-versa. Make slow path behave like fast path, which never distinguished that. Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-01-09 14:47:40 +01:00
Radim Krčmář	6e50043912	KVM: x86: replace kvm_apic_id with kvm_{x,x2}apic_id There were three calls sites: - recalculate_apic_map and kvm_apic_match_physical_addr, where it would only complicate implementation of x2APIC hotplug; - in apic_debug, where it was still somewhat preserved, but keeping the old function just for apic_debug was not worth it Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-01-09 14:47:29 +01:00
Radim Krčmář	f98a3efb28	KVM: x86: use delivery to self in hyperv synic Interrupt to self can be sent without knowing the APIC ID. Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-01-09 14:46:13 +01:00
Junaid Shahid	f160c7b7bb	kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits. This change implements lockless access tracking for Intel CPUs without EPT A bits. This is achieved by marking the PTEs as not-present (but not completely clearing them) when clear_flush_young() is called after marking the pages as accessed. When an EPT Violation is generated as a result of the VM accessing those pages, the PTEs are restored to their original values. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-01-09 14:46:11 +01:00
Junaid Shahid	37f0e8fe6b	kvm: x86: mmu: Do not use bit 63 for tracking special SPTEs MMIO SPTEs currently set both bits 62 and 63 to distinguish them as special PTEs. However, bit 63 is used as the SVE bit in Intel EPT PTEs. The SVE bit is ignored for misconfigured PTEs but not necessarily for not-Present PTEs. Since MMIO SPTEs use an EPT misconfiguration, so using bit 63 for them is acceptable. However, the upcoming fast access tracking feature adds another type of special tracking PTE, which uses not-Present PTEs and hence should not set bit 63. In order to use common bits to distinguish both type of special PTEs, we now use only bit 62 as the special bit. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-01-09 14:46:10 +01:00
Junaid Shahid	f39a058d0e	kvm: x86: mmu: Introduce a no-tracking version of mmu_spte_update mmu_spte_update() tracks changes in the accessed/dirty state of the SPTE being updated and calls kvm_set_pfn_accessed/dirty appropriately. However, in some cases (e.g. when aging the SPTE), this shouldn't be done. mmu_spte_update_no_track() is introduced for use in such cases. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-01-09 14:46:09 +01:00
Junaid Shahid	83ef6c8155	kvm: x86: mmu: Refactor accessed/dirty checks in mmu_spte_update/clear This simplifies mmu_spte_update() a little bit. The checks for clearing of accessed and dirty bits are refactored into separate functions, which are used inside both mmu_spte_update() and mmu_spte_clear_track_bits(), as well as kvm_test_age_rmapp(). The new helper functions handle both the case when A/D bits are supported in hardware and the case when they are not. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-01-09 14:46:08 +01:00
Junaid Shahid	97dceba29a	kvm: x86: mmu: Fast Page Fault path retries This change adds retries into the Fast Page Fault path. Without the retries, the code still works, but if a retry does end up being needed, then it will result in a second page fault for the same memory access, which will cause much more overhead compared to just retrying within the original fault. This would be especially useful with the upcoming fast access tracking change, as that would make it more likely for retries to be needed (e.g. due to read and write faults happening on different CPUs at the same time). Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-01-09 14:46:07 +01:00
Junaid Shahid	ea4114bcd3	kvm: x86: mmu: Rename spte_is_locklessly_modifiable() This change renames spte_is_locklessly_modifiable() to spte_can_locklessly_be_made_writable() to distinguish it from other forms of lockless modifications. The full set of lockless modifications is covered by spte_has_volatile_bits(). Signed-off-by: Junaid Shahid <junaids@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-01-09 14:46:06 +01:00
Junaid Shahid	27959a4415	kvm: x86: mmu: Use symbolic constants for EPT Violation Exit Qualifications This change adds some symbolic constants for VM Exit Qualifications related to EPT Violations and updates handle_ept_violation() to use these constants instead of hard-coded numbers. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-01-09 14:46:05 +01:00
David Matlack	114df303a7	kvm: x86: reduce collisions in mmu_page_hash When using two-dimensional paging, the mmu_page_hash (which provides lookups for existing kvm_mmu_page structs), becomes imbalanced; with too many collisions in buckets 0 and 512. This has been seen to cause mmu_lock to be held for multiple milliseconds in kvm_mmu_get_page on VMs with a large amount of RAM mapped with 4K pages. The current hash function uses the lower 10 bits of gfn to index into mmu_page_hash. When doing shadow paging, gfn is the address of the guest page table being shadow. These tables are 4K-aligned, which makes the low bits of gfn a good hash. However, with two-dimensional paging, no guest page tables are being shadowed, so gfn is the base address that is mapped by the table. Thus page tables (level=1) have a 2MB aligned gfn, page directories (level=2) have a 1GB aligned gfn, etc. This means hashes will only differ in their 10th bit. hash_64() provides a better hash. For example, on a VM with ~200G (99458 direct=1 kvm_mmu_page structs): hash max_mmu_page_hash_collisions -------------------------------------------- low 10 bits 49847 hash_64 105 perfect 97 While we're changing the hash, increase the table size by 4x to better support large VMs (further reduces number of collisions in 200G VM to 29). Note that hash_64() does not provide a good distribution prior to commit `ef703f49a6` ("Eliminate bad hash multipliers from hash_32() and hash_64()"). Signed-off-by: David Matlack <dmatlack@google.com> Change-Id: I5aa6b13c834722813c6cca46b8b1ed6f53368ade Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-01-09 14:46:04 +01:00
David Matlack	f3414bc774	kvm: x86: export maximum number of mmu_page_hash collisions Report the maximum number of mmu_page_hash collisions as a per-VM stat. This will make it easy to identify problems with the mmu_page_hash in the future. Signed-off-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-01-09 14:46:03 +01:00
Radim Krčmář	826da32140	KVM: x86: simplify conditions with split/kernel irqchip Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-01-09 14:45:57 +01:00
Radim Krčmář	8231f50d98	KVM: x86: prevent setup of invalid routes The check in kvm_set_pic_irq() and kvm_set_ioapic_irq() was just a temporary measure until the code improved enough for us to do this. This changes APIC in a case when KVM_SET_GSI_ROUTING is called to set up pic and ioapic routes before KVM_CREATE_IRQCHIP. Those rules would get overwritten by KVM_CREATE_IRQCHIP at best, so it is pointless to allow it. Userspaces hopefully noticed that things don't work if they do that and don't do that. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-01-09 14:45:50 +01:00
Radim Krčmář	e5dc48777d	KVM: x86: refactor pic setup in kvm_set_routing_entry Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-01-09 14:45:36 +01:00
Radim Krčmář	099413664c	KVM: x86: make pic setup code look like ioapic setup We don't treat kvm->arch.vpic specially anymore, so the setup can look like ioapic. This gets a bit more information out of return values. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-01-09 14:45:28 +01:00
Radim Krčmář	49776faf93	KVM: x86: decouple irqchip_in_kernel() and pic_irqchip() irqchip_in_kernel() tried to save a bit by reusing pic_irqchip(), but it just complicated the code. Add a separate state for the irqchip mode. Reviewed-by: David Hildenbrand <david@redhat.com> [Used Paolo's version of condition in irqchip_in_kernel().] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-01-09 14:42:47 +01:00
Radim Krčmář	35e6eaa3df	KVM: x86: don't allow kernel irqchip with split irqchip Split irqchip cannot be created after creating the kernel irqchip, but we forgot to restrict the other way. This is an API change. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-01-09 14:38:19 +01:00
Linus Torvalds	08289086b0	KVM fixes for v4.10-rc3 MIPS: (both for stable) - fix host kernel crashes when receiving a signal with 64-bit userspace - flush instruction cache on all vcpus after generating entry code x86: - fix NULL dereference in MMU caused by SMM transitions (for stable) - correct guest instruction pointer after emulating some VMX errors - minor cleanup -----BEGIN PGP SIGNATURE----- iQEcBAABCAAGBQJYb/N7AAoJEED/6hsPKofoa4QH/0/jwHr64lFeiOzMxqZfTF0y wufcTqw3zGq5iPaNlEwn+6AkKnTq2IPws92FludfPHPb7BrLUPqrXxRlSRN+XPVw pHVcV9u0q4yghMi7/6Flu3JASnpD6PrPZ7ezugZwgXFrR7pewd/+sTq6xBUnI9rZ nNEYsfh8dYiBicxSGXlmZcHLuJJHKshjsv9F6ngyBGXAAf/F+nLiJReUzPO0m2+P gmXi5zhVu6z05zlaCW1KAmJ1QV1UJla1vZnzrnK3twRK/05l7YX+xCbHIo1wB03R 2YhKDnSrnG3Zt+KpXfRhADXazNgM5ASvORdvI6RvjLNVxlnOveQtAcfRyvZezT4= =LXLf -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull KVM fixes from Radim Krčmář: "MIPS: - fix host kernel crashes when receiving a signal with 64-bit userspace - flush instruction cache on all vcpus after generating entry code (both for stable) x86: - fix NULL dereference in MMU caused by SMM transitions (for stable) - correct guest instruction pointer after emulating some VMX errors - minor cleanup" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: VMX: remove duplicated declaration KVM: MIPS: Flush KVM entry code from icache globally KVM: MIPS: Don't clobber CP0_Status.UX KVM: x86: reset MMU on KVM_SET_VCPU_EVENTS KVM: nVMX: fix instruction skipping during emulated vm-entry	2017-01-06 15:27:17 -08:00
Jan Dakinevich	69130ea1e6	KVM: VMX: remove duplicated declaration Declaration of VMX_VPID_EXTENT_SUPPORTED_MASK occures twice in the code. Probably, it was happened after unsuccessful merge. Signed-off-by: Jan Dakinevich <jan.dakinevich@gmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-01-05 15:08:48 +01:00
Linus Torvalds	3ddc76dfc7	Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer type cleanups from Thomas Gleixner: "This series does a tree wide cleanup of types related to timers/timekeeping. - Get rid of cycles_t and use a plain u64. The type is not really helpful and caused more confusion than clarity - Get rid of the ktime union. The union has become useless as we use the scalar nanoseconds storage unconditionally now. The 32bit timespec alike storage got removed due to the Y2038 limitations some time ago. That leaves the odd union access around for no reason. Clean it up. Both changes have been done with coccinelle and a small amount of manual mopping up" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: ktime: Get rid of ktime_equal() ktime: Cleanup ktime_set() usage ktime: Get rid of the union clocksource: Use a plain u64 instead of cycle_t	2016-12-25 14:30:04 -08:00
Thomas Gleixner	8b0e195314	ktime: Cleanup ktime_set() usage ktime_set(S,N) was required for the timespec storage type and is still useful for situations where a Seconds and Nanoseconds part of a time value needs to be converted. For anything where the Seconds argument is 0, this is pointless and can be replaced with a simple assignment. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org>	2016-12-25 17:21:22 +01:00
Thomas Gleixner	a5a1d1c291	clocksource: Use a plain u64 instead of cycle_t There is no point in having an extra type for extra confusion. u64 is unambiguous. Conversion was done with the following coccinelle script: @rem@ @@ -typedef u64 cycle_t; @fix@ typedef cycle_t; @@ -cycle_t +u64 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: John Stultz <john.stultz@linaro.org>	2016-12-25 11:04:12 +01:00
Thomas Gleixner	73c1b41e63	cpu/hotplug: Cleanup state names When the state names got added a script was used to add the extra argument to the calls. The script basically converted the state constant to a string, but the cleanup to convert these strings into meaningful ones did not happen. Replace all the useless strings with 'subsys/xxx/yyy:state' strings which are used in all the other places already. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Siewior <bigeasy@linutronix.de> Link: http://lkml.kernel.org/r/20161221192112.085444152@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2016-12-25 10:47:44 +01:00
Xiao Guangrong	6ef4e07ecd	KVM: x86: reset MMU on KVM_SET_VCPU_EVENTS Otherwise, mismatch between the smm bit in hflags and the MMU role can cause a NULL pointer dereference. Cc: stable@vger.kernel.org Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-12-24 10:16:04 +01:00
David Matlack	b428018a06	KVM: nVMX: fix instruction skipping during emulated vm-entry kvm_skip_emulated_instruction() should not be called after emulating a VM-entry failure during or after loading guest state (nested_vmx_entry_failure()). Otherwise the L1 hypervisor is resumed some number of bytes past vmcs->host_rip. Fixes: `eb27756217` Signed-off-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-12-21 18:55:09 +01:00
Jim Mattson	ef85b67385	kvm: nVMX: Allow L1 to intercept software exceptions (#BP and #OF) When L2 exits to L0 due to "exception or NMI", software exceptions (#BP and #OF) for which L1 has requested an intercept should be handled by L1 rather than L0. Previously, only hardware exceptions were forwarded to L1. Signed-off-by: Jim Mattson <jmattson@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-12-19 16:05:31 +01:00

1 2 3 4 5 ...

4126 Commits