linux_dsm_epyc7002

mirror of https://github.com/AuxXxilium/linux_dsm_epyc7002.git synced 2025-02-24 08:02:05 +07:00

Author	SHA1	Message	Date
Anton Blanchard	476ce5ef09	KVM: PPC: Book3S: Enable in-kernel XICS emulation by default The in-kernel XICS emulation is faster than doing it all in QEMU and it has got a lot of testing, so enable it by default. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>	2014-12-17 22:23:22 +01:00
Sam Bobroff	90fd09f804	KVM: PPC: Book3S HV: Improve H_CONFER implementation Currently the H_CONFER hcall is implemented in kernel virtual mode, meaning that whenever a guest thread does an H_CONFER, all the threads in that virtual core have to exit the guest. This is bad for performance because it interrupts the other threads even if they are doing useful work. The H_CONFER hcall is called by a guest VCPU when it is spinning on a spinlock and it detects that the spinlock is held by a guest VCPU that is currently not running on a physical CPU. The idea is to give this VCPU's time slice to the holder VCPU so that it can make progress towards releasing the lock. To avoid having the other threads exit the guest unnecessarily, we add a real-mode implementation of H_CONFER that checks whether the other threads are doing anything. If all the other threads are idle (i.e. in H_CEDE) or trying to confer (i.e. in H_CONFER), it returns H_TOO_HARD which causes a guest exit and allows the H_CONFER to be handled in virtual mode. Otherwise it spins for a short time (up to 10 microseconds) to give other threads the chance to observe that this thread is trying to confer. The spin loop also terminates when any thread exits the guest or when all other threads are idle or trying to confer. If the timeout is reached, the H_CONFER returns H_SUCCESS. In this case the guest VCPU will recheck the spinlock word and most likely call H_CONFER again. This also improves the implementation of the H_CONFER virtual mode handler. If the VCPU is part of a virtual core (vcore) which is runnable, there will be a 'runner' VCPU which has taken responsibility for running the vcore. In this case we yield to the runner VCPU rather than the target VCPU. We also introduce a check on the target VCPU's yield count: if it differs from the yield count passed to H_CONFER, the target VCPU has run since H_CONFER was called and may have already released the lock. This check is required by PAPR. Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>	2014-12-17 13:53:39 +01:00
Paul Mackerras	4a157d61b4	KVM: PPC: Book3S HV: Fix endianness of instruction obtained from HEIR register There are two ways in which a guest instruction can be obtained from the guest in the guest exit code in book3s_hv_rmhandlers.S. If the exit was caused by a Hypervisor Emulation interrupt (i.e. an illegal instruction), the offending instruction is in the HEIR register (Hypervisor Emulation Instruction Register). If the exit was caused by a load or store to an emulated MMIO device, we load the instruction from the guest by turning data relocation on and loading the instruction with an lwz instruction. Unfortunately, in the case where the guest has opposite endianness to the host, these two methods give results of different endianness, but both get put into vcpu->arch.last_inst. The HEIR value has been loaded using guest endianness, whereas the lwz will load the instruction using host endianness. The rest of the code that uses vcpu->arch.last_inst assumes it was loaded using host endianness. To fix this, we define a new vcpu field to store the HEIR value. Then, in kvmppc_handle_exit_hv(), we transfer the value from this new field to vcpu->arch.last_inst, doing a byte-swap if the guest and host endianness differ. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>	2014-12-17 13:50:39 +01:00
Paul Mackerras	c17b98cf60	KVM: PPC: Book3S HV: Remove code for PPC970 processors This removes the code that was added to enable HV KVM to work on PPC970 processors. The PPC970 is an old CPU that doesn't support virtualizing guest memory. Removing PPC970 support also lets us remove the code for allocating and managing contiguous real-mode areas, the code for the !kvm->arch.using_mmu_notifiers case, the code for pinning pages of guest memory when first accessed and keeping track of which pages have been pinned, and the code for handling H_ENTER hypercalls in virtual mode. Book3S HV KVM is now supported only on POWER7 and POWER8 processors. The KVM_CAP_PPC_RMA capability now always returns 0. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>	2014-12-17 13:44:03 +01:00
Suresh E. Warrier	3c78f78af9	KVM: PPC: Book3S HV: Tracepoints for KVM HV guest interactions This patch adds trace points in the guest entry and exit code and also for exceptions handled by the host in kernel mode - hypercalls and page faults. The new events are added to /sys/kernel/debug/tracing/events under a new subsystem called kvm_hv. Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com> Signed-off-by: Alexander Graf <agraf@suse.de>	2014-12-17 13:29:27 +01:00
Paul Mackerras	2711e248a3	KVM: PPC: Book3S HV: Simplify locking around stolen time calculations Currently the calculations of stolen time for PPC Book3S HV guests uses fields in both the vcpu struct and the kvmppc_vcore struct. The fields in the kvmppc_vcore struct are protected by the vcpu->arch.tbacct_lock of the vcpu that has taken responsibility for running the virtual core. This works correctly but confuses lockdep, because it sees that the code takes the tbacct_lock for a vcpu in kvmppc_remove_runnable() and then takes another vcpu's tbacct_lock in vcore_stolen_time(), and it thinks there is a possibility of deadlock, causing it to print reports like this: ============================================= [ INFO: possible recursive locking detected ] 3.18.0-rc7-kvm-00016-g8db4bc6 #89 Not tainted --------------------------------------------- qemu-system-ppc/6188 is trying to acquire lock: (&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb1fe8>] .vcore_stolen_time+0x48/0xd0 [kvm_hv] but task is already holding lock: (&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb25a0>] .kvmppc_remove_runnable.part.3+0x30/0xd0 [kvm_hv] other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&(&vcpu->arch.tbacct_lock)->rlock); lock(&(&vcpu->arch.tbacct_lock)->rlock); * DEADLOCK * May be due to missing lock nesting notation 3 locks held by qemu-system-ppc/6188: #0: (&vcpu->mutex){+.+.+.}, at: [<d00000000eb93f98>] .vcpu_load+0x28/0xe0 [kvm] #1: (&(&vcore->lock)->rlock){+.+...}, at: [<d00000000ecb41b0>] .kvmppc_vcpu_run_hv+0x530/0x1530 [kvm_hv] #2: (&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb25a0>] .kvmppc_remove_runnable.part.3+0x30/0xd0 [kvm_hv] stack backtrace: CPU: 40 PID: 6188 Comm: qemu-system-ppc Not tainted 3.18.0-rc7-kvm-00016-g8db4bc6 #89 Call Trace: [c000000b2754f3f0] [c000000000b31b6c] .dump_stack+0x88/0xb4 (unreliable) [c000000b2754f470] [c0000000000faeb8] .__lock_acquire+0x1878/0x2190 [c000000b2754f600] [c0000000000fbf0c] .lock_acquire+0xcc/0x1a0 [c000000b2754f6d0] [c000000000b2954c] ._raw_spin_lock_irq+0x4c/0x70 [c000000b2754f760] [d00000000ecb1fe8] .vcore_stolen_time+0x48/0xd0 [kvm_hv] [c000000b2754f7f0] [d00000000ecb25b4] .kvmppc_remove_runnable.part.3+0x44/0xd0 [kvm_hv] [c000000b2754f880] [d00000000ecb43ec] .kvmppc_vcpu_run_hv+0x76c/0x1530 [kvm_hv] [c000000b2754f9f0] [d00000000eb9f46c] .kvmppc_vcpu_run+0x2c/0x40 [kvm] [c000000b2754fa60] [d00000000eb9c9a4] .kvm_arch_vcpu_ioctl_run+0x54/0x160 [kvm] [c000000b2754faf0] [d00000000eb94538] .kvm_vcpu_ioctl+0x498/0x760 [kvm] [c000000b2754fcb0] [c000000000267eb4] .do_vfs_ioctl+0x444/0x770 [c000000b2754fd90] [c0000000002682a4] .SyS_ioctl+0xc4/0xe0 [c000000b2754fe30] [c0000000000092e4] syscall_exit+0x0/0x98 In order to make the locking easier to analyse, we change the code to use a spinlock in the kvmppc_vcore struct to protect the stolen_tb and preempt_tb fields. This lock needs to be an irq-safe lock since it is used in the kvmppc_core_vcpu_load_hv() and kvmppc_core_vcpu_put_hv() functions, which are called with the scheduler rq lock held, which is an irq-safe lock. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>	2014-12-17 13:20:09 +01:00
Rickard Strandqvist	a0499cf746	arch: powerpc: kvm: book3s_paired_singles.c: Remove unused function Remove the function inst_set_field() that is not used anywhere. This was partially found by using a static code analysis program called cppcheck. Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se> Signed-off-by: Alexander Graf <agraf@suse.de>	2014-12-17 13:13:29 +01:00
Rickard Strandqvist	6178839b01	arch: powerpc: kvm: book3s_pr.c: Remove unused function Remove the function get_fpr_index() that is not used anywhere. This was partially found by using a static code analysis program called cppcheck. Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se> Signed-off-by: Alexander Graf <agraf@suse.de>	2014-12-17 13:13:16 +01:00
Rickard Strandqvist	54ca162a0c	arch: powerpc: kvm: book3s.c: Remove some unused functions Removes some functions that are not used anywhere: kvmppc_core_load_guest_debugstate() kvmppc_core_load_host_debugstate() This was partially found by using a static code analysis program called cppcheck. Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se> Signed-off-by: Alexander Graf <agraf@suse.de>	2014-12-17 13:12:42 +01:00
Rickard Strandqvist	24aaaf22ea	arch: powerpc: kvm: book3s_32_mmu.c: Remove unused function Remove the function sr_nx() that is not used anywhere. This was partially found by using a static code analysis program called cppcheck. Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se> Signed-off-by: Alexander Graf <agraf@suse.de>	2014-12-17 13:12:25 +01:00
Suresh E. Warrier	1bc5d59c35	KVM: PPC: Book3S HV: Check wait conditions before sleeping in kvmppc_vcore_blocked The kvmppc_vcore_blocked() code does not check for the wait condition after putting the process on the wait queue. This means that it is possible for an external interrupt to become pending, but the vcpu to remain asleep until the next decrementer interrupt. The fix is to make one last check for pending exceptions and ceded state before calling schedule(). Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>	2014-12-15 13:27:25 +01:00
Cédric Le Goater	ffada016fb	KVM: PPC: Book3S HV: ptes are big endian When being restored from qemu, the kvm_get_htab_header are in native endian, but the ptes are big endian. This patch fixes restore on a KVM LE host. Qemu also needs a fix for this : http://lists.nongnu.org/archive/html/qemu-ppc/2014-11/msg00008.html Signed-off-by: Cédric Le Goater <clg@fr.ibm.com> Signed-off-by: Alexander Graf <agraf@suse.de>	2014-12-15 13:27:24 +01:00
Suresh E. Warrier	5b88cda665	KVM: PPC: Book3S HV: Fix inaccuracies in ICP emulation for H_IPI This fixes some inaccuracies in the state machine for the virtualized ICP when implementing the H_IPI hcall (Set_MFFR and related states): 1. The old code wipes out any pending interrupts when the new MFRR is more favored than the CPPR but less favored than a pending interrupt (by always modifying xisr and the pending_pri). This can cause us to lose a pending external interrupt. The correct code here is to only modify the pending_pri and xisr in the ICP if the MFRR is equal to or more favored than the current pending pri (since in this case, it is guaranteed that that there cannot be a pending external interrupt). The code changes are required in both kvmppc_rm_h_ipi and kvmppc_h_ipi. 2. Again, in both kvmppc_rm_h_ipi and kvmppc_h_ipi, there is a check for whether MFRR is being made less favored AND further if new MFFR is also less favored than the current CPPR, we check for any resends pending in the ICP. These checks look like they are designed to cover the case where if the MFRR is being made less favored, we opportunistically trigger a resend of any interrupts that had been previously rejected. Although, this is not a state described by PAPR, this is an action we actually need to do especially if the CPPR is already at 0xFF. Because in this case, the resend bit will stay on until another ICP state change which may be a long time coming and the interrupt stays pending until then. The current code which checks for MFRR < CPPR is broken when CPPR is 0xFF since it will not get triggered in that case. Ideally, we would want to do a resend only if prio(pending_interrupt) < mfrr && prio(pending_interrupt) < cppr where pending interrupt is the one that was rejected. But we don't have the priority of the pending interrupt state saved, so we simply trigger a resend whenever the MFRR is made less favored. 3. In kvmppc_rm_h_ipi, where we save state to pass resends to the virtual mode, we also need to save the ICP whose need_resend we reset since this does not need to be my ICP (vcpu->arch.icp) as is incorrectly assumed by the current code. A new field rm_resend_icp is added to the kvmppc_icp structure for this purpose. Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>	2014-12-15 13:27:24 +01:00
Paul Mackerras	b4a839009a	KVM: PPC: Book3S HV: Fix KSM memory corruption Testing with KSM active in the host showed occasional corruption of guest memory. Typically a page that should have contained zeroes would contain values that look like the contents of a user process stack (values such as 0x0000_3fff_xxxx_xxx). Code inspection in kvmppc_h_protect revealed that there was a race condition with the possibility of granting write access to a page which is read-only in the host page tables. The code attempts to keep the host mapping read-only if the host userspace PTE is read-only, but if that PTE had been temporarily made invalid for any reason, the read-only check would not trigger and the host HPTE could end up read-write. Examination of the guest HPT in the failure situation revealed that there were indeed shared pages which should have been read-only that were mapped read-write. To close this race, we don't let a page go from being read-only to being read-write, as far as the real HPTE mapping the page is concerned (the guest view can go to read-write, but the actual mapping stays read-only). When the guest tries to write to the page, we take an HDSI and let kvmppc_book3s_hv_page_fault take care of providing a writable HPTE for the page. This eliminates the occasional corruption of shared pages that was previously seen with KSM active. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>	2014-12-15 13:27:24 +01:00
Mahesh Salgaonkar	dee6f24c33	KVM: PPC: Book3S HV: Fix an issue where guest is paused on receiving HMI When we get an HMI (hypervisor maintenance interrupt) while in a guest, we see that guest enters into paused state. The reason is, in kvmppc_handle_exit_hv it falls through default path and returns to host instead of resuming guest. This causes guest to enter into paused state. HMI is a hypervisor only interrupt and it is safe to resume the guest since the host has handled it already. This patch adds a switch case to resume the guest. Without this patch we see guest entering into paused state with following console messages: [ 3003.329351] Severe Hypervisor Maintenance interrupt [Recovered] [ 3003.329356] Error detail: Timer facility experienced an error [ 3003.329359] HMER: 0840000000000000 [ 3003.329360] TFMR: 4a12000980a84000 [ 3003.329366] vcpu c0000007c35094c0 (40): [ 3003.329368] pc = c0000000000c2ba0 msr = 8000000000009032 trap = e60 [ 3003.329370] r 0 = c00000000021ddc0 r16 = 0000000000000046 [ 3003.329372] r 1 = c00000007a02bbd0 r17 = 00003ffff27d5d98 [ 3003.329375] r 2 = c0000000010980b8 r18 = 00001fffffc9a0b0 [ 3003.329377] r 3 = c00000000142d6b8 r19 = c00000000142d6b8 [ 3003.329379] r 4 = 0000000000000002 r20 = 0000000000000000 [ 3003.329381] r 5 = c00000000524a110 r21 = 0000000000000000 [ 3003.329383] r 6 = 0000000000000001 r22 = 0000000000000000 [ 3003.329386] r 7 = 0000000000000000 r23 = c00000000524a110 [ 3003.329388] r 8 = 0000000000000000 r24 = 0000000000000001 [ 3003.329391] r 9 = 0000000000000001 r25 = c00000007c31da38 [ 3003.329393] r10 = c0000000014280b8 r26 = 0000000000000002 [ 3003.329395] r11 = 746f6f6c2f68656c r27 = c00000000524a110 [ 3003.329397] r12 = 0000000028004484 r28 = c00000007c31da38 [ 3003.329399] r13 = c00000000fe01400 r29 = 0000000000000002 [ 3003.329401] r14 = 0000000000000046 r30 = c000000003011e00 [ 3003.329403] r15 = ffffffffffffffba r31 = 0000000000000002 [ 3003.329404] ctr = c00000000041a670 lr = c000000000272520 [ 3003.329405] srr0 = c00000000007e8d8 srr1 = 9000000000001002 [ 3003.329406] sprg0 = 0000000000000000 sprg1 = c00000000fe01400 [ 3003.329407] sprg2 = c00000000fe01400 sprg3 = 0000000000000005 [ 3003.329408] cr = 48004482 xer = 2000000000000000 dsisr = 42000000 [ 3003.329409] dar = 0000010015020048 [ 3003.329410] fault dar = 0000010015020048 dsisr = 42000000 [ 3003.329411] SLB (8 entries): [ 3003.329412] ESID = c000000008000000 VSID = 40016e7779000510 [ 3003.329413] ESID = d000000008000001 VSID = 400142add1000510 [ 3003.329414] ESID = f000000008000004 VSID = 4000eb1a81000510 [ 3003.329415] ESID = 00001f000800000b VSID = 40004fda0a000d90 [ 3003.329416] ESID = 00003f000800000c VSID = 400039f536000d90 [ 3003.329417] ESID = 000000001800000d VSID = 0001251b35150d90 [ 3003.329417] ESID = 000001000800000e VSID = 4001e46090000d90 [ 3003.329418] ESID = d000080008000019 VSID = 40013d349c000400 [ 3003.329419] lpcr = c048800001847001 sdr1 = 0000001b19000006 last_inst = ffffffff [ 3003.329421] trap=0xe60 \| pc=0xc0000000000c2ba0 \| msr=0x8000000000009032 [ 3003.329524] Severe Hypervisor Maintenance interrupt [Recovered] [ 3003.329526] Error detail: Timer facility experienced an error [ 3003.329527] HMER: 0840000000000000 [ 3003.329527] TFMR: 4a12000980a94000 [ 3006.359786] Severe Hypervisor Maintenance interrupt [Recovered] [ 3006.359792] Error detail: Timer facility experienced an error [ 3006.359795] HMER: 0840000000000000 [ 3006.359797] TFMR: 4a12000980a84000 Id Name State ---------------------------------------------------- 2 guest2 running 3 guest3 paused 4 guest4 running Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>	2014-12-15 13:27:24 +01:00
Paul Mackerras	d506735b1a	KVM: PPC: Book3S HV: Fix computation of tlbie operand The B (segment size) field in the RB operand for the tlbie instruction is two bits, which we get from the top two bits of the first doubleword of the HPT entry to be invalidated. These bits go in bits 8 and 9 of the RB operand (bits 54 and 55 in IBM bit numbering). The compute_tlbie_rb() function gets these bits as v >> (62 - 8), which is not correct as it will bring in the top 10 bits, not just the top two. These extra bits could corrupt the AP, AVAL and L fields in the RB value. To fix this we shift right 62 bits and then shift left 8 bits, so we only get the two bits of the B field. The first doubleword of the HPT entry is under the control of the guest kernel. In fact, Linux guests will always put zeroes in bits 54 -- 61 (IBM bits 2 -- 9), but we should not rely on guests doing this. Signed-off-by: Paul Mackerras <paulus@samba.org> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Alexander Graf <agraf@suse.de>	2014-12-15 13:27:23 +01:00
Aneesh Kumar K.V	f6fb9e848c	KVM: PPC: Book3S HV: Add missing HPTE unlock In kvm_test_clear_dirty(), if we find an invalid HPTE we move on to the next HPTE without unlocking the invalid one. In fact we should never find an invalid and unlocked HPTE in the rmap chain, but for robustness we should unlock it. This adds the missing unlock. Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>	2014-12-15 13:27:23 +01:00
Alexander Graf	b6b612571e	KVM: PPC: BookE: Improve irq inject tracepoint When injecting an IRQ, we only document which IRQ priority (which translates to IRQ type) gets injected. However, when reading traces you don't necessarily have all the numbers in your head to know which IRQ really is meant. This patch converts the IRQ number field to a symbolic name that is in sync with the respective define. That way it's a lot easier for readers to figure out what interrupt gets injected. Signed-off-by: Alexander Graf <agraf@suse.de>	2014-12-15 13:27:23 +01:00
Radim Krčmář	e08e833616	KVM: cpuid: recompute CPUID 0xD.0:EBX,ECX We reused host EBX and ECX, but KVM might not support all features; emulated XSAVE size should be smaller. EBX depends on unknown XCR0, so we default to ECX. SDM CPUID (EAX = 0DH, ECX = 0): EBX Bits 31-00: Maximum size (bytes, from the beginning of the XSAVE/XRSTOR save area) required by enabled features in XCR0. May be different than ECX if some features at the end of the XSAVE save area are not enabled. ECX Bit 31-00: Maximum size (bytes, from the beginning of the XSAVE/XRSTOR save area) of the XSAVE/XRSTOR save area required by all supported features in the processor, i.e all the valid bit fields in XCR0. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Tested-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-12-05 13:57:49 +01:00
Wanpeng Li	81dc01f749	kvm: vmx: add nested virtualization support for xsaves Add nested virtualization support for xsaves. Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-12-05 13:57:44 +01:00
Wanpeng Li	203000993d	kvm: vmx: add MSR logic for XSAVES Add logic to get/set the XSS model-specific register. Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-12-05 13:57:39 +01:00
Wanpeng Li	f53cd63c2d	kvm: x86: handle XSAVES vmcs and vmexit Initialize the XSS exit bitmap. It is zero so there should be no XSAVES or XRSTORS exits. Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-12-05 13:57:33 +01:00
Paolo Bonzini	404e0a19e1	KVM: cpuid: mask more bits in leaf 0xd and subleaves - EAX=0Dh, ECX=1: output registers EBX/ECX/EDX are reserved. - EAX=0Dh, ECX>1: output register ECX bit 0 is clear for all the CPUID leaves we support, because variable "supported" comes from XCR0 and not XSS. Bits above 0 are reserved, so ECX is overall zero. Output register EDX is reserved. Source: Intel Architecture Instruction Set Extensions Programming Reference, ref. number 319433-022 Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Tested-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-12-05 13:57:17 +01:00
Paolo Bonzini	412a3c411e	KVM: cpuid: set CPUID(EAX=0xd,ECX=1).EBX correctly This is the size of the XSAVES area. This starts providing guest support for XSAVES (with no support yet for supervisor states, i.e. XSS == 0 always in guests for now). Wanpeng Li suggested testing XSAVEC as well as XSAVES, since in practice no real processor exists that only has one of them, and there is no other way for userspace programs to compute the area of the XSAVEC save area. CPUID(EAX=0xd,ECX=1).EBX provides an upper bound. Suggested-by: Radim Krčmář <rkrcmar@redhat.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Tested-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-12-05 13:57:17 +01:00
Wanpeng Li	55412b2eda	kvm: x86: Add kvm_x86_ops hook that enables XSAVES for guest Expose the XSAVES feature to the guest if the kvm_x86_ops say it is available. Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-12-05 13:57:16 +01:00
Paolo Bonzini	5c404cabd1	KVM: x86: use F() macro throughout cpuid.c For code that deals with cpuid, this makes things a bit more readable. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-12-05 13:57:15 +01:00
Paolo Bonzini	df1daba7d1	KVM: x86: support XSAVES usage in the host Userspace is expecting non-compacted format for KVM_GET_XSAVE, but struct xsave_struct might be using the compacted format. Convert in order to preserve userspace ABI. Likewise, userspace is passing non-compacted format for KVM_SET_XSAVE but the kernel will pass it to XRSTORS, and we need to convert back. Fixes: `f31a9f7c71` Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: stable@vger.kernel.org Cc: H. Peter Anvin <hpa@linux.intel.com> Tested-by: Nadav Amit <namit@cs.technion.ac.il> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-12-05 13:57:05 +01:00
Paolo Bonzini	ba7b39203a	x86: export get_xsave_addr get_xsave_addr is the API to access XSAVE states, and KVM would like to use it. Export it. Cc: stable@vger.kernel.org Cc: x86@kernel.org Cc: H. Peter Anvin <hpa@linux.intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-12-05 13:55:44 +01:00
Paolo Bonzini	28145be0a7	KVM: s390: Fixups for kvm/next (3.19) Here we have two fixups of the latest interrupt rework and one architectural fixup. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (GNU/Linux) iQIcBAABAgAGBQJUgIO9AAoJEBF7vIC1phx8U5YQALKPvkoyVaGOexDCQQJrz04+ 9W5ZQuvg020jj1EmRPlv9yekhtYBTYA/OVT+FsDG3eq+xCS4Y04vlvUfk2Lq0YAB Bk+cU7plNP8f1Ml/kRTC/yfWcPe8iCVI1WA7zqA2wwEbaEZCz+9+i76wyriDWtJM FJEKXqVmLZH/H+GAw4GQG0YrO4uZ70Sk5F5YVtceGN5mDlqCATXa2xADm4ciAiIW L6w+G3KjIuv/Opds6DFMfJgYQeAF+0MFFWpD2rUfVLwIvdtBTtnxVwQZfpoXH7rF SeNU6n7CILS0csW/ZMcuTE8UrYgW1kkj1iVgc6fT7nkTWWZ5+iGRbAKQG5WPKibn 84f9cn7kJ22ZobayaLskNfe041o1a+zMUCswHdFYTgF29WefeqkikoPx0B80g4p6 O1jZqSZXKiTtlmIqiISikJhDVx0/kC+ftlu4MJLKq7O8RgcaYmAJp7gSGz50sMPB AybcfNkoYsg5J85sinT16TA/Vmcd3ZBWRqaokiRFD/uuqh5cKnyZOTpPwph8Welq QxBtqpG36cBLRJRMzlSJNXg1LKvb8WAxWYxGjiSL/8y/V3intnSFYdLjhNpx3yj8 WCxaG3m2Gf68RQOVSdqlvgH//xBeSz0l9grX+BzJNvBY7aZMmTbXHwIt/lSCiBZd radGCMmVN4YEq4OoTgyr =ZZSq -----END PGP SIGNATURE----- Merge tag 'kvm-s390-next-20141204' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD KVM: s390: Fixups for kvm/next (3.19) Here we have two fixups of the latest interrupt rework and one architectural fixup.	2014-12-05 13:55:40 +01:00
Jens Freimann	99e20009ae	KVM: s390: clean up return code handling in irq delivery code Instead of returning a possibly random or'ed together value, let's always return -EFAULT if rc is set. Signed-off-by: Jens Freimann <jfrei@linux.vnet.ibm.com> Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>	2014-12-04 16:39:00 +01:00
Jens Freimann	9185124e87	KVM: s390: use atomic bitops to access pending_irqs bitmap Currently we use a mixture of atomic/non-atomic bitops and the local_int spin lock to protect the pending_irqs bitmap and interrupt payload data. We need to use atomic bitops for the pending_irqs bitmap everywhere and in addition acquire the local_int lock where interrupt data needs to be protected. Signed-off-by: Jens Freimann <jfrei@linux.vnet.ibm.com> Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>	2014-12-04 16:38:57 +01:00
David Hildenbrand	467fc29892	KVM: s390: some ext irqs have to clear the ext cpu addr The cpu address of a source cpu (responsible for an external irq) is only to be stored if bit 6 of the ext irq code is set. If bit 6 is not set, it is to be zeroed out. The special external irq code used for virtio and pfault uses the cpu addr as a parameter field. As bit 6 is set, this implementation is correct. Reviewed-by: Thomas Huth <thuth@linux.vnet.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>	2014-12-04 16:38:38 +01:00
Christian Borntraeger	7a72f7a140	KVM: track pid for VCPU only on KVM_RUN ioctl We currently track the pid of the task that runs the VCPU in vcpu_load. If a yield to that VCPU is triggered while the PID of the wrong thread is active, the wrong thread might receive a yield, but this will most likely not help the executing thread at all. Instead, if we only track the pid on the KVM_RUN ioctl, there are two possibilities: 1) the thread that did a non-KVM_RUN ioctl is holding a mutex that the VCPU thread is waiting for. In this case, the VCPU thread is not runnable, but we also do not do a wrong yield. 2) the thread that did a non-KVM_RUN ioctl is sleeping, or doing something that does not block the VCPU thread. In this case, the VCPU thread can receive the directed yield correctly. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> CC: Rik van Riel <riel@redhat.com> CC: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> CC: Michael Mueller <mimu@linux.vnet.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-12-04 15:29:13 +01:00
David Hildenbrand	eed6e79d73	KVM: don't check for PF_VCPU when yielding kvm_enter_guest() has to be called with preemption disabled and will set PF_VCPU. Current code takes PF_VCPU as a hint that the VCPU thread is running and therefore needs no yield. However, the check on PF_VCPU is wrong on s390, where preemption has to stay enabled in order to correctly process page faults. Thus, s390 reenables preemption and starts to execute the guest. The thread might be scheduled out between kvm_enter_guest() and kvm_exit_guest(), resulting in PF_VCPU being set but not being run. When this happens, the opportunity for directed yield is missed. However, this check is done already in kvm_vcpu_on_spin before calling kvm_vcpu_yield_loop: if (!ACCESS_ONCE(vcpu->preempted)) continue; so the check on PF_VCPU is superfluous in general, and this patch removes it. Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-12-04 15:29:12 +01:00
Igor Mammedov	9c1a5d3878	kvm: optimize GFN to memslot lookup with large slots amount Current linear search doesn't scale well when large amount of memslots is used and looked up slot is not in the beginning memslots array. Taking in account that memslots don't overlap, it's possible to switch sorting order of memslots array from 'npages' to 'base_gfn' and use binary search for memslot lookup by GFN. As result of switching to binary search lookup times are reduced with large amount of memslots. Following is a table of search_memslot() cycles during WS2008R2 guest boot. boot, boot + ~10 min mostly same of using it, slot lookup randomized lookup max average average cycles cycles cycles 13 slots : 1450 28 30 13 slots : 1400 30 40 binary search 117 slots : 13000 30 460 117 slots : 2000 35 180 binary search Signed-off-by: Igor Mammedov <imammedo@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-12-04 15:29:11 +01:00
Igor Mammedov	0e60b0799f	kvm: change memslot sorting rule from size to GFN it will allow to use binary search for GFN -> memslot lookups, reducing lookup cost with large slots amount. Signed-off-by: Igor Mammedov <imammedo@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-12-04 15:29:11 +01:00
Igor Mammedov	d4ae84a02b	kvm: search_memslots: add simple LRU memslot caching In typical guest boot workload only 2-3 memslots are used extensively, and at that it's mostly the same memslot lookup operation. Adding LRU cache improves average lookup time from 46 to 28 cycles (~40%) for this workload. Signed-off-by: Igor Mammedov <imammedo@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-12-04 15:29:10 +01:00
Igor Mammedov	7f379cff11	kvm: update_memslots: drop not needed check for the same slot UP/DOWN shift loops will shift array in needed direction and stop at place where new slot should be placed regardless of old slot size. Signed-off-by: Igor Mammedov <imammedo@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-12-04 15:29:09 +01:00
Igor Mammedov	5a38b6e6b4	kvm: update_memslots: drop not needed check for the same number of pages if number of pages haven't changed sorting algorithm will do nothing, so there is no need to do extra check to avoid entering sorting logic. Signed-off-by: Igor Mammedov <imammedo@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-12-04 15:29:09 +01:00
Radim Krčmář	45c3094a64	KVM: x86: allow 256 logical x2APICs again While fixing an x2apic bug, `17d68b7` KVM: x86: fix guest-initiated crash with x2apic (CVE-2013-6376) we've made only one cluster available. This means that the amount of logically addressible x2APICs was reduced to 16 and VCPUs kept overwriting themselves in that region, so even the first cluster wasn't set up correctly. This patch extends x2APIC support back to the logical_map's limit, and keeps the CVE fixed as messages for non-present APICs are dropped. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-12-04 15:29:08 +01:00
Radim Krčmář	25995e5b4a	KVM: x86: check bounds of APIC maps They can't be violated now, but play it safe for the future. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-12-04 15:29:08 +01:00
Radim Krčmář	fa834e9197	KVM: x86: fix APIC physical destination wrapping x2apic allows destinations > 0xff and we don't want them delivered to lower APICs. They are correctly handled by doing nothing. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-12-04 15:29:07 +01:00
Radim Krčmář	085563fb04	KVM: x86: deliver phys lowest-prio Physical mode can't address more than one APIC, but lowest-prio is allowed, so we just reuse our paths. SDM 10.6.2.1 Physical Destination: Also, for any non-broadcast IPI or I/O subsystem initiated interrupt with lowest priority delivery mode, software must ensure that APICs defined in the interrupt address are present and enabled to receive interrupts. We could warn on top of that. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-12-04 15:29:06 +01:00
Radim Krčmář	698f9755d9	KVM: x86: don't retry hopeless APIC delivery False from kvm_irq_delivery_to_apic_fast() means that we don't handle it in the fast path, but we still return false in cases that were perfectly handled, fix that. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-12-04 15:29:06 +01:00
Radim Krčmář	decdc28382	KVM: x86: use MSR_ICR instead of a number 0x830 MSR is 0x300 xAPIC MMIO, which is MSR_ICR. Signed-off-by: Radim KrÄmÃ¡Å™ <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-12-04 15:29:05 +01:00
Nadav Amit	c69d3d9bc1	KVM: x86: Fix reserved x2apic registers x2APIC has no registers for DFR and ICR2 (see Intel SDM 10.12.1.2 "x2APIC Register Address Space"). KVM needs to cause #GP on such accesses. Fix it (DFR and ICR2 on read, ICR2 on write, DFR already handled on writes). Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-12-04 15:29:05 +01:00
Nadav Amit	39f062ff51	KVM: x86: Generate #UD when memory operand is required Certain x86 instructions that use modrm operands only allow memory operand (i.e., mod012), and cause a #UD exception otherwise. KVM ignores this fact. Currently, the instructions that are such and are emulated by KVM are MOVBE, MOVNTPS, MOVNTPD and MOVNTI. MOVBE is the most blunt example, since it may be emulated by the host regardless of MMIO. The fix introduces a new group for handling such instructions, marking mod3 as illegal instruction. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-12-04 15:29:04 +01:00
Paolo Bonzini	be06b6bece	KVM: s390: Several fixes,cleanups and reworks Here is a bunch of fixes that deal mostly with architectural compliance: - interrupt priorities - interrupt handling - intruction exit handling We also provide a helper function for getting the guest visible storage key. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (GNU/Linux) iQIcBAABAgAGBQJUeHKOAAoJEBF7vIC1phx8fS8P/2i1zXvsB+Mp9FafU6FO3Ci6 yM/ZBLFEsX2jmU0TGR2szZCP4xuHqomJm441+2CX9bXzjf8vnA2hGiDYX0bnBraK 1/klx6Li1fQxsiSHXIgBXr0wh5ftNUdFVZiJoifY9dEdrhVI+YEiptIl7lADCFXi SdGtjEAzrVEe8H0g6OuBXeEfPzHvxAzNJ31yHuiCKl7vbFVnXNVPeZhq3dJZwmWC iIdlSqGIjbcHNMLXvrScLDAScBe6WruBLhPSy5aTIA2eBU6f4qkOedSFABJkIAq+ V6v5pXWbrVBIqfLXE7Vp7jxhP7+viBGzu/gPkfT8HV1pQZDa94WojF8hGU0DLGLd vgZuFDV8cMOZMUS/onXOwnIXN5VPvP8V2v3y8gTiap3MykRiyTGEzq2auU9p0K5n /8W6Cn1P/WBS8MOFYg726DGmMAWkzpEVz9rxpCLaTpzz7QVyLuSLq/n3SXyQNQIl zhox/KzwUQD0t1062USoK3w4suYNvnX0BuFOwxXvS7f4bsb+6V/t0GyIBnVAL/OF DZzJSIyzP/Ur/9krxJQxML3kEELU1CjwSLOrzDUnZA3ytaKvsLrHkTb9nK6fREDK 14AGRnp94B3EMPR6T+7T6gvwK2Y/QIo8Y/EFAa2BwXY4Q/BQPSVU3x3RK6L9+7jY VaG9sgn4OC2ZPzxjqMTc =MQUy -----END PGP SIGNATURE----- Merge tag 'kvm-s390-next-20141128' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD KVM: s390: Several fixes,cleanups and reworks Here is a bunch of fixes that deal mostly with architectural compliance: - interrupt priorities - interrupt handling - intruction exit handling We also provide a helper function for getting the guest visible storage key.	2014-12-03 15:20:11 +01:00
Jens Freimann	fc2020cfe9	KVM: s390: allow injecting all kinds of machine checks Allow to specify CR14, logout area, external damage code and failed storage address. Since more then one machine check can be indicated to the guest at a time we need to combine all indication bits with already pending requests. Signed-off-by: Jens Freimann <jfrei@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>	2014-11-28 13:59:05 +01:00
Jens Freimann	383d0b0501	KVM: s390: handle pending local interrupts via bitmap This patch adapts handling of local interrupts to be more compliant with the z/Architecture Principles of Operation and introduces a data structure which allows more efficient handling of interrupts. * get rid of li->active flag, use bitmap instead * Keep interrupts in a bitmap instead of a list * Deliver interrupts in the order of their priority as defined in the PoP * Use a second bitmap for sigp emergency requests, as a CPU can have one request pending from every other CPU in the system. Signed-off-by: Jens Freimann <jfrei@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>	2014-11-28 13:59:04 +01:00

1 2 3 4 5 ...

481502 Commits