This patch includes changes to the external API for SMM support.
Userspace can predicate the availability of the new fields and
ioctls on a new capability, KVM_CAP_X86_SMM, which is added at the end
of the patch series.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The hflags field will contain information about system management mode
and will be useful for the emulator. Pass the entire field rather than
just the guest-mode information.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SMBASE is only readable from SMM for the VCPU, but it must be always
accessible if userspace is accessing it. Thus, all functions that
read MSRs are changed to accept a struct msr_data; the host_initiated
and index fields are pre-initialized, while the data field is filled
on return.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We will want to filter away MSR_IA32_SMBASE from the emulated_msrs if
the host CPU does not support SMM virtualization. Introduce the
logic to do that, and also move paravirt MSRs to emulated_msrs for
simplicity and to get rid of KVM_SAVE_MSRS_BEGIN.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Malicious (or egregiously buggy) userspace can trigger it, but it
should never happen in normal operation.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
VFIO has proved itself a much better option than KVM's built-in
device assignment. It is mature, provides better isolation because
it enforces ACS, and even the userspace code is being tested on
a wider variety of hardware these days than the legacy support.
Disable legacy device assignment by default.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Initialize kvmclock base, on kvmclock system MSR write time,
so that the guest sees kvmclock counting from zero.
This matches baremetal behaviour when kvmclock in guest
sets sched clock stable.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
[Remove unnecessary comment. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
arch/x86/kvm/mmu.c: In function 'kvm_mmu_pte_write':
arch/x86/kvm/mmu.c:4256: error: unknown field 'cr0_wp' specified in initializer
arch/x86/kvm/mmu.c:4257: error: unknown field 'cr4_pae' specified in initializer
arch/x86/kvm/mmu.c:4257: warning: excess elements in union initializer
...
gcc-4.4.4 (at least) has issues when using anonymous unions in
initializers.
Fixes: edc90b7dc4 ("KVM: MMU: fix SMAP virtualization")
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There is no reason to deny this feature to guests. We are emulating the
APIC timer, thus are exposing it without stops in power-saving states.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Logical x2APIC stops working if we rewrite it with zeros.
The best references are SDM April 2015: 10.12.10.1 Logical Destination
Mode in x2APIC Mode
[...], the LDR are initialized by hardware based on the value of
x2APIC ID upon x2APIC state transitions.
and SDM April 2015: 10.12.10.2 Deriving Logical x2APIC ID from the Local
x2APIC ID
The LDR initialization occurs whenever the x2APIC mode is enabled
Signed-off-by: Radim KrÄmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SDM April 2015, 10.12.5 State Changes From xAPIC Mode to x2APIC Mode
• Any APIC ID value written to the memory-mapped local APIC ID register
is not preserved.
Fix it by sourcing vcpu_id (= initial APIC ID) instead of memory-mapped
APIC ID. Proper use of apic functions would result in two calls to
recalculate_apic_map(), so this patch makes a new helper.
Signed-off-by: Radim KrÄmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The periodic kvmclock sync can be an undesired source of latencies.
When running cyclictest on a guest, a latency spike is visible.
With kvmclock periodic sync disabled, the spike is gone.
Guests should use ntp which means the propagations of ntp corrections
from the host clock are unnecessary.
v2:
-> Make parameter read-only (Radim)
-> Return early on kvmclock_sync_fn (Andrew)
Reported-and-tested-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Prepare for multiple address spaces this way, since a VCPU is not available
where unaccount_shadowed is called. We will get to the right kvm_memslots
struct through the role field in struct kvm_mmu_page.
Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The memory slot is already available from gfn_to_memslot_dirty_bitmap.
Isn't it a shame to look it up again? Plus, it makes gfn_to_page_many_atomic
agnostic of multiple VCPU address spaces.
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This lets the function access the new memory slot without going through
kvm_memslots and id_to_memslot. It will simplify the code when more
than one address space will be supported.
Unfortunately, the "const"ness of the new argument must be casted
away in two places. Fixing KVM to accept const struct kvm_memory_slot
pointers would require modifications in pretty much all architectures,
and is left for later.
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Most code already uses consts for the struct kernel_param_ops,
sweep the kernel for the last offending stragglers. Other than
include/linux/moduleparam.h and kernel/params.c all other changes
were generated with the following Coccinelle SmPL patch. Merge
conflicts between trees can be handled with Coccinelle.
In the future git could get Coccinelle merge support to deal with
patch --> fail --> grammar --> Coccinelle --> new patch conflicts
automatically for us on patches where the grammar is available and
the patch is of high confidence. Consider this a feature request.
Test compiled on x86_64 against:
* allnoconfig
* allmodconfig
* allyesconfig
@ const_found @
identifier ops;
@@
const struct kernel_param_ops ops = {
};
@ const_not_found depends on !const_found @
identifier ops;
@@
-struct kernel_param_ops ops = {
+const struct kernel_param_ops ops = {
};
Generated-by: Coccinelle SmPL
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Junio C Hamano <gitster@pobox.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: cocci@systeme.lip6.fr
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Bring the __copy_fpstate_to_fpregs() and copy_fpstate_to_fpregs() functions
in line with the parameter passing convention of other kernel-to-FPU-registers
copying functions: pass around an in-memory FPU register state pointer,
instead of struct fpu *.
NOTE: This patch also changes the assembly constraint of the FXSAVE-leak
workaround from 'fpu->fpregs_active' to 'fpstate' - but that is fine,
as we only need a valid memory address there for the FILDL instruction.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Bring the __copy_fpstate_to_fpregs() and copy_fpstate_to_fpregs() functions
in line with the naming of other kernel-to-FPU-registers copying functions.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Architecture-specific helpers are not supposed to muck with
struct kvm_userspace_memory_region contents. Add const to
enforce this.
In order to eliminate the only write in __kvm_set_memory_region,
the cleaning of deleted slots is pulled up from update_memslots
to __kvm_set_memory_region.
Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_memslots provides lockdep checking. Use it consistently instead of
explicit dereferencing of kvm->memslots.
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The MPX feature requires eager KVM FPU restore support. We have verified
that MPX cannot work correctly with the current lazy KVM FPU restore
mechanism. Eager KVM FPU restore should be enabled if the MPX feature is
exposed to VM.
Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
[Also activate the FPU on AMD processors. - Paolo]
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
gfn_to_pfn_async is used in just one place, and because of x86-specific
treatment that place will need to look at the memory slot. Hence inline
it into try_async_pf and export __gfn_to_pfn_memslot.
The patch also switches the subsequent call to gfn_to_pfn_prot to use
__gfn_to_pfn_memslot. This is a small optimization. Finally, remove
the now-unused async argument of __gfn_to_pfn.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
CR0.CD and CR0.NW are not used by shadow page table so that need
not adjust mmu if these two bit are changed
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, whenever guest MTRR registers are changed
kvm_mmu_reset_context is called to switch to the new root shadow page
table, however, it's useless since:
1) the cache type is not cached into shadow page's attribute so that
the original root shadow page will be reused
2) the cache type is set on the last spte, that means we should sync
the last sptes when MTRR is changed
This patch fixs this issue by drop all the spte in the gfn range which
is being updated by MTRR
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There are some bugs in current get_mtrr_type();
1: bit 1 of mtrr_state->enabled is corresponding bit 11 of
IA32_MTRR_DEF_TYPE MSR which completely control MTRR's enablement
that means other bits are ignored if it is cleared
2: the fixed MTRR ranges are controlled by bit 0 of
mtrr_state->enabled (bit 10 of IA32_MTRR_DEF_TYPE)
3: if MTRR is disabled, UC is applied to all of physical memory rather
than mtrr_state->def_type
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Reviewed-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Split kvm_unmap_rmapp and introduce kvm_zap_rmapp which will be used in the
later patch
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
slot_handle_level and its helper functions are ready now, use them to
clean up the code
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There are several places walking all rmaps for the memslot so that
introduce common functions to cleanup the code
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It's used to abstract the code from kvm_handle_hva_range and it will be
used by later patch
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It's used to walk all the sptes on the rmap to clean up the
code
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM may turn a user page to a kernel page when kernel writes a readonly
user page if CR0.WP = 1. This shadow page entry will be reused after
SMAP is enabled so that kernel is allowed to access this user page
Fix it by setting SMAP && !CR0.WP into shadow page's role and reset mmu
once CR4.SMAP is updated
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When a REP-string is executed in 64-bit mode with an address-size prefix,
ECX/EDI/ESI are used as counter and pointers. When ECX is initially zero, Intel
CPUs clear the high 32-bits of RCX, and recent Intel CPUs update the high bits
of the pointers in MOVS/STOS. This behavior is specific to Intel according to
few experiments.
As one may guess, this is an undocumented behavior. Yet, it is observable in
the guest, since at least VMX traps REP-INS/OUTS even when ECX=0. Note that
VMware appears to get it right. The behavior can be observed using the
following code:
#include <stdio.h>
#define LOW_MASK (0xffffffff00000000ull)
#define ALL_MASK (0xffffffffffffffffull)
#define TEST(opcode) \
do { \
asm volatile(".byte 0xf2 \n\t .byte 0x67 \n\t .byte " opcode "\n\t" \
: "=S"(s), "=c"(c), "=D"(d) \
: "S"(ALL_MASK), "c"(LOW_MASK), "D"(ALL_MASK)); \
printf("opcode %s rcx=%llx rsi=%llx rdi=%llx\n", \
opcode, c, s, d); \
} while(0)
void main()
{
unsigned long long s, d, c;
iopl(3);
TEST("0x6c");
TEST("0x6d");
TEST("0x6e");
TEST("0x6f");
TEST("0xa4");
TEST("0xa5");
TEST("0xa6");
TEST("0xa7");
TEST("0xaa");
TEST("0xab");
TEST("0xae");
TEST("0xaf");
}
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When REP-string instruction is preceded with an address-size prefix,
ECX/EDI/ESI are used as the operation counter and pointers. When they are
updated, the high 32-bits of RCX/RDI/RSI are cleared, similarly to the way they
are updated on every 32-bit register operation. Fix it.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If the host sets hardware breakpoints to debug the guest, and a task-switch
occurs in the guest, the architectural DR7 will not be updated. The effective
DR7 would be updated instead.
This fix puts the DR7 update during task-switch emulation, so it now uses the
standard DR setting mechanism instead of the one that was previously used. As a
bonus, the update of DR7 will now be effective for AMD as well.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
fpstate_init() only uses fpu->state, so pass that in to it.
This enables the cleanup we will do in the next patch.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
fpu_restore_checking() is a helper function of restore_fpu_checking(),
but this is not apparent from the naming.
Both copy fpstate contents to fpregs, while the fuller variant does
a full copy without leaking information.
So rename them to:
copy_fpstate_to_fpregs()
__copy_fpstate_to_fpregs()
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rename this function in line with the new FPU nomenclature.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that all FPU internals using drivers are converted to public APIs,
move xcr.h's definitions into fpu/internal.h and remove xcr.h.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
'xsave' is an x86 instruction name to most people - but xsave.h is
about a lot more than just the XSAVE instruction: it includes
definitions and support, both internal and external, related to
xstate and xfeatures support.
As a first step in cleaning up the various xstate uses rename this
header to 'fpu/xstate.h' to better reflect what this header file
is about.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that fpstate_init_curr() is not doing implicit allocations
anymore, almost all uses of it involve a very simple pattern:
if (!fpu->fpstate_active)
fpstate_init_curr(fpu);
which is basically activating the FPU fpstate if it was not active
before.
So propagate the check into the function itself, and rename the
function according to its new purpose:
fpu__activate_curr(fpu);
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that fpstate_init() cannot fail the error return of fx_init()
has lost its purpose. Eliminate the error return and propagate this
change to all callers.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that there are no FPU context allocations, rename fpstate_alloc_init()
to fpstate_init_curr(), to signal that it initializes the fpstate and
marks it active, for the current task.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Remove the failure code and propagate this down to callers.
Note that this function still has an 'init' aspect, which must be
called.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that we always allocate the FPU context as part of task_struct there's
no need for separate allocations - remove them and their primary failure
handling code.
( Note that there's still secondary error codes that have become superfluous,
those will be removed in separate patches. )
Move the somewhat misplaced setup_xstate_comp() call to the core.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So 6 years ago we made the FPU fpstate dynamically allocated:
aa283f4927 ("x86, fpu: lazy allocation of FPU area - v5")
61c4628b53 ("x86, fpu: split FPU state from task struct - v5")
In hindsight this was a mistake:
- it complicated context allocation failure handling, such as:
/* kthread execs. TODO: cleanup this horror. */
if (WARN_ON(fpstate_alloc_init(fpu)))
force_sig(SIGKILL, tsk);
- it caused us to enable irqs in fpu__restore():
local_irq_enable();
/*
* does a slab alloc which can sleep
*/
if (fpstate_alloc_init(fpu)) {
/*
* ran out of memory!
*/
do_group_exit(SIGKILL);
return;
}
local_irq_disable();
- it (slightly) slowed down task creation/destruction by adding
slab allocation/free pattens.
- it made access to context contents (slightly) slower by adding
one more pointer dereference.
The motivation for the dynamic allocation was two-fold:
- reduce memory consumption by non-FPU tasks
- allocate and handle only the necessary amount of context for
various XSAVE processors that have varying hardware frame
sizes.
These days, with glibc using SSE memcpy by default and GCC optimizing
for SSE/AVX by default, the scope of FPU using apps on an x86 system is
much larger than it was 6 years ago.
For example on a freshly installed Fedora 21 desktop system, with a
recent kernel, all non-kthread tasks have used the FPU shortly after
bootup.
Also, even modern embedded x86 CPUs try to support the latest vector
instruction set - so they'll too often use the larger xstate frame
sizes.
So remove the dynamic allocation complication by embedding the FPU
fpstate in task_struct again. This should make the FPU a lot more
accessible to all sorts of atomic contexts.
We could still optimize for the xstate frame size in the future,
by moving the state structure to the last element of task_struct,
and allocating only a part of that.
This change is kept minimal by still keeping the ctx_alloc()/free()
routines (that now do nothing substantial) - we'll remove them in
the following patches.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So fpu_save_init() is a historic name that got its name when the only
way the FPU state was FNSAVE, which cleared (well, destroyed) the FPU
state after saving it.
Nowadays the name is misleading, because ever since the introduction of
FXSAVE (and more modern FPU saving instructions) the 'we need to reload
the FPU state' part is only true if there's a pending FPU exception [*],
which is almost never the case.
So rename it to copy_fpregs_to_fpstate() to make it clear what's
happening. Also add a few comments about why we cannot keep registers
in certain cases.
Also clean up the control flow a bit, to make it more apparent when
we are dropping/keeping FP registers, and to optimize the common
case (of keeping fpregs) some more.
[*] Probably not true anymore, modern instructions always leave the FPU
state intact, even if exceptions are pending: because pending FP
exceptions are posted on the next FP instruction, not asynchronously.
They were truly asynchronous back in the IRQ13 case, and we had to
synchronize with them, but that code is not working anymore: we don't
have IRQ13 mapped in the IDT anymore.
But a cleanup patch is obviously not the place to change subtle behavior.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are a number of FPU internal function prototypes and an inline function
in fpu/api.h, mostly placed so historically as the code grew over the years.
Move them over into fpu/internal.h where they belong. (Add sched.h include
to stackprotector.h which incorrectly relied on getting it from fpu/api.h.)
fpu/api.h is now a pure file that only contains FPU APIs intended for driver
use.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
'xsave.header::xstate_bv' is a misnomer - what does 'bv' stand for?
It probably comes from the 'XGETBV' instruction name, but I could
not find in the Intel documentation where that abbreviation comes
from. It could mean 'bit vector' - or something else?
But how about - instead of guessing about a weird name - we named
the field in an obvious and descriptive way that tells us exactly
what it does?
So rename it to 'xfeatures', which is a bitmask of the
xfeatures that are fpstate_active in that context structure.
Eyesore like:
fpu->state->xsave.xsave_hdr.xstate_bv |= XSTATE_FP;
is now much more readable:
fpu->state->xsave.header.xfeatures |= XSTATE_FP;
Which form is not just infinitely more readable, but is also
shorter as well.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Code like:
fpu->state->xsave.xsave_hdr.xstate_bv |= XSTATE_FP;
is an eyesore, because not only is the words 'xsave' and 'state'
are repeated twice times (!), but also because of the 'hdr' and 'bv'
abbreviations that are pretty meaningless at a first glance.
Start cleaning this up by renaming 'xsave_hdr' to 'header'.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This unifies all the FPU related header files under a unified, hiearchical
naming scheme:
- asm/fpu/types.h: FPU related data types, needed for 'struct task_struct',
widely included in almost all kernel code, and hence kept
as small as possible.
- asm/fpu/api.h: FPU related 'public' methods exported to other subsystems.
- asm/fpu/internal.h: FPU subsystem internal methods
- asm/fpu/xsave.h: XSAVE support internal methods
(Also standardize the header guard in asm/fpu/internal.h.)
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We already have fpu/types.h, move i387.h to fpu/api.h.
The file name has become a misnomer anyway: it offers generic FPU APIs,
but is not limited to i387 functionality.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Migrate this function to pure 'struct fpu' usage.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Introduce a simple fpu->fpstate_active flag in the fpu context data structure
and use that instead of PF_USED_MATH in task->flags.
Testing for this flag byte should be slightly more efficient than
testing a bit in a bitmask, but the main advantage is that most
FPU functions can now be performed on a 'struct fpu' alone, they
don't need access to 'struct task_struct' anymore.
There's a slight linecount increase, mostly due to the 'fpu' local
variables and due to extra comments. The local variables will go away
once we move most of the FPU methods to pure 'struct fpu' parameters.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
PF_USED_MATH is used directly, but also in a handful of helper inlines.
To ease the elimination of PF_USED_MATH, convert all inline helpers
to open-coded PF_USED_MATH usage.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fix a minor header file dependency bug in asm/fpu-internal.h: it
relies on i387.h but does not include it. All users of fpu-internal.h
included it explicitly.
Also remove unnecessary includes, to reduce compilation time.
This also makes it easier to use it as a standalone header file
for FPU internals, such as an upcoming C module in arch/x86/kernel/fpu/.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make it clear that we are initializing the in-memory FPU context area,
no the FPU registers.
Also move it to the fpu__*() namespace.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use the fpu__*() namespace for fpstate_alloc() as well.
Also add a comment about FPU state alignment.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Most init_fpu() users don't want the register-saving aspect of the
function, they are calling it for 'current' and when FPU registers
are not allocated and initialized yet.
Split out a simplified API that does just that (and add debug-checks
for these conditions): fpstate_alloc_init().
Use it where appropriate.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The term "ftrace" is really the infrastructure of the function hooks,
and not the trace events. Rename ftrace_event.h to trace_events.h to
represent the trace_event infrastructure and decouple the term ftrace
from it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
KVM may turn a user page to a kernel page when kernel writes a readonly
user page if CR0.WP = 1. This shadow page entry will be reused after
SMAP is enabled so that kernel is allowed to access this user page
Fix it by setting SMAP && !CR0.WP into shadow page's role and reset mmu
once CR4.SMAP is updated
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
smep_andnot_wp is initialized in kvm_init_shadow_mmu and shadow pages
should not be reused for different values of it. Thus, it has to be
added to the mask in kvm_mmu_pte_write.
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Current permission check assumes that RSVD bit in PFEC is always zero,
however, it is not true since MMIO #PF will use it to quickly identify
MMIO access
Fix it by clearing the bit if walking guest page table is needed
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
smep_andnot_wp is initialized in kvm_init_shadow_mmu and shadow pages
should not be reused for different values of it. Thus, it has to be
added to the mask in kvm_mmu_pte_write.
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Current permission check assumes that RSVD bit in PFEC is always zero,
however, it is not true since MMIO #PF will use it to quickly identify
MMIO access
Fix it by clearing the bit if walking guest page table is needed
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
vcpu->arch.apic is NULL when a userspace irqchip is active. But instead
of letting the test incorrectly depend on in-kernel irqchip mode,
open-code it to catch also userspace x2APICs.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Far call in 64-bit has a 32-bit operand size. Remove the marking of this
operation as Stack so it can be emulated correctly in 64-bit.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If the null test is needed, the call to cancel_delayed_work_sync would have
already crashed. Normally, the destroy function should only be called
if the init function has succeeded, in which case ioapic is not null.
Problem found using Coccinelle.
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
PAT should be 0007_0406_0007_0406h on RESET and not modified on INIT.
VMX used a wrong value (host's PAT) and while SVM used the right one,
it never got to arch.pat.
This is not an issue with QEMU as it will force the correct value.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently KVM will clear the FPU bits in CR0.TS in the VMCS, and trap to
re-load them every time the guest accesses the FPU after a switch back into
the guest from the host.
This patch copies the x86 task switch semantics for FPU loading, with the
FPU loaded eagerly after first use if the system uses eager fpu mode,
or if the guest uses the FPU frequently.
In the latter case, after loading the FPU for 255 times, the fpu_counter
will roll over, and we will revert to loading the FPU on demand, until
it has been established that the guest is still actively using the FPU.
This mirrors the x86 task switch policy, which seems to work.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
An MSI interrupt should only be delivered to the lowest priority CPU
when it has RH=1, regardless of the delivery mode. Modified
kvm_is_dm_lowest_prio() to check for either irq->delivery_mode == APIC_DM_LOWPRI
or irq->msi_redir_hint.
Moved kvm_is_dm_lowest_prio() into lapic.h and renamed to
kvm_lowest_prio_delivery().
Changed a check in kvm_irq_delivery_to_apic_fast() from
irq->delivery_mode == APIC_DM_LOWPRI to kvm_is_dm_lowest_prio().
Signed-off-by: James Sullivan <sullivan.james.f@gmail.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Extended struct kvm_lapic_irq with bool msi_redir_hint, which will
be used to determine if the delivery of the MSI should target only
the lowest priority CPU in the logical group specified for delivery.
(In physical dest mode, the RH bit is not relevant). Initialized the value
of msi_redir_hint to true when RH=1 in kvm_set_msi_irq(), and initialized
to false in all other cases.
Added value of msi_redir_hint to a debug message dump of an IRQ in
apic_send_ipi().
Signed-off-by: James Sullivan <sullivan.james.f@gmail.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Change to u16 if they only contain data in the low 16 bits.
Change the level field to bool, since we assign 1 sometimes, but
just mask icr_low with APIC_INT_ASSERT in apic_send_ipi.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
x86 architecture defines differences between the reset and INIT sequences.
INIT does not initialize the FPU (including MMX, XMM, YMM, etc.), TSC, PMU,
MSRs (in general), MTRRs machine-check, APIC ID, APIC arbitration ID and BSP.
References (from Intel SDM):
"If the MP protocol has completed and a BSP is chosen, subsequent INITs (either
to a specific processor or system wide) do not cause the MP protocol to be
repeated." [8.4.2: MP Initialization Protocol Requirements and Restrictions]
[Table 9-1. IA-32 Processor States Following Power-up, Reset, or INIT]
"If the processor is reset by asserting the INIT# pin, the x87 FPU state is not
changed." [9.2: X87 FPU INITIALIZATION]
"The state of the local APIC following an INIT reset is the same as it is after
a power-up or hardware reset, except that the APIC ID and arbitration ID
registers are not affected." [10.4.7.3: Local APIC State After an INIT Reset
("Wait-for-SIPI" State)]
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Message-Id: <1428924848-28212-1-git-send-email-namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Introducing KVM_CAP_DISABLE_QUIRKS for disabling x86 quirks that were previous
created in order to overcome QEMU issues. Those issue were mostly result of
invalid VM BIOS. Currently there are two quirks that can be disabled:
1. KVM_QUIRK_LINT0_REENABLED - LINT0 was enabled after boot
2. KVM_QUIRK_CD_NW_CLEARED - CD and NW are cleared after boot
These two issues are already resolved in recent releases of QEMU, and would
therefore be disabled by QEMU.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Message-Id: <1428879221-29996-1-git-send-email-namit@cs.technion.ac.il>
[Report capability from KVM_CHECK_EXTENSION too. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use __kvm_guest_{enter|exit} instead of kvm_guest_{enter|exit}
where interrupts are disabled.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The kvmclock spec says that the host will increment a version field to
an odd number, then update stuff, then increment it to an even number.
The host is buggy and doesn't do this, and the result is observable
when one vcpu reads another vcpu's kvmclock data.
There's no good way for a guest kernel to keep its vdso from reading
a different vcpu's kvmclock data, but we don't need to care about
changing VCPUs as long as we read a consistent data from kvmclock.
(VCPU can change outside of this loop too, so it doesn't matter if we
return a value not fit for this VCPU.)
Based on a patch by Radim Krčmář.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull fourth vfs update from Al Viro:
"d_inode() annotations from David Howells (sat in for-next since before
the beginning of merge window) + four assorted fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
RCU pathwalk breakage when running into a symlink overmounting something
fix I_DIO_WAKEUP definition
direct-io: only inc/dec inode->i_dio_count for file systems
fs/9p: fix readdir()
VFS: assorted d_backing_inode() annotations
VFS: fs/inode.c helpers: d_inode() annotations
VFS: fs/cachefiles: d_backing_inode() annotations
VFS: fs library helpers: d_inode() annotations
VFS: assorted weird filesystems: d_inode() annotations
VFS: normal filesystems (and lustre): d_inode() annotations
VFS: security/: d_inode() annotations
VFS: security/: d_backing_inode() annotations
VFS: net/: d_inode() annotations
VFS: net/unix: d_backing_inode() annotations
VFS: kernel/: d_inode() annotations
VFS: audit: d_backing_inode() annotations
VFS: Fix up some ->d_inode accesses in the chelsio driver
VFS: Cachefiles should perform fs modifications on the top layer only
VFS: AF_UNIX sockets should call mknod on the top layer only
The host's decision to enable machine check exceptions should remain
in force during non-root mode. KVM was writing 0 to cr4 on VCPU reset
and passed a slightly-modified 0 to the vmcs.guest_cr4 value.
Tested: Built.
On earlier version, tested by injecting machine check
while a guest is spinning.
Before the change, if guest CR4.MCE==0, then the machine check is
escalated to Catastrophic Error (CATERR) and the machine dies.
If guest CR4.MCE==1, then the machine check causes VMEXIT and is
handled normally by host Linux. After the change, injecting a machine
check causes normal Linux machine check handling.
Signed-off-by: Ben Serebrin <serebrin@google.com>
Reviewed-by: Venkatesh Srinivas <venkateshs@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Soft mmu uses direct shadow page to fill guest large mapping with small
pages if huge mapping is disallowed on host. So zapping direct shadow
page works well both for soft mmu and hard mmu, it's just less widely
applicable.
Fix the comment to reflect this.
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Message-Id: <552C91BA.1010703@linux.intel.com>
[Fix comment wording further. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
As Andres pointed out:
| I don't understand the value of this check here. Are we looking for a
| broken memslot? Shouldn't this be a BUG_ON? Is this the place to care
| about these things? npages is capped to KVM_MEM_MAX_NR_PAGES, i.e.
| 2^31. A 64 bit overflow would be caused by a gigantic gfn_start which
| would be trouble in many other ways.
This patch drops the memslot overflow check to make the codes more simple.
Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Message-Id: <1429064694-3072-1-git-send-email-wanpeng.li@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Sparse is reporting a "we previously assumed 'src' could be null" error.
This is true as far as the static analyzer can see, but in practice only
IPIs can set shorthand to self and they also set 'src', so it's ok.
Still, move the initialization of x2apic_ipi (and thus the NULL check for
src right before the first use.
While at it, initializing ret to "false" is somewhat confusing because of
the almost immediate assigned of "true" to the same variable. Thus,
initialize it to "true" and modify it in the only path that used to use
the value from "bool ret = false". There is no change in generated code
from this change.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_init_msr_list is currently called before hardware_setup. As a result,
vmx_mpx_supported always returns false when kvm_init_msr_list checks whether to
save MSR_IA32_BNDCFGS.
Move kvm_init_msr_list after vmx_hardware_setup is called to fix this issue.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Message-Id: <1428864435-4732-1-git-send-email-namit@cs.technion.ac.il>
Cc: stable@vger.kernel.org # 3.15+
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull timer updates from Ingo Molnar:
"The main changes in this cycle were:
- clockevents state machine cleanups and enhancements (Viresh Kumar)
- clockevents broadcast notifier horror to state machine conversion
and related cleanups (Thomas Gleixner, Rafael J Wysocki)
- clocksource and timekeeping core updates (John Stultz)
- clocksource driver updates and fixes (Ben Dooks, Dmitry Osipenko,
Hans de Goede, Laurent Pinchart, Maxime Ripard, Xunlei Pang)
- y2038 fixes (Xunlei Pang, John Stultz)
- NMI-safe ktime_get_raw_fast() and general refactoring of the clock
code, in preparation to perf's per event clock ID support (Peter
Zijlstra)
- generic sched/clock fixes, optimizations and cleanups (Daniel
Thompson)
- clockevents cpu_down() race fix (Preeti U Murthy)"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (94 commits)
timers/PM: Drop unnecessary braces from tick_freeze()
timers/PM: Fix up tick_unfreeze()
timekeeping: Get rid of stale comment
clockevents: Cleanup dead cpu explicitely
clockevents: Make tick handover explicit
clockevents: Remove broadcast oneshot control leftovers
sched/idle: Use explicit broadcast oneshot control function
ARM: Tegra: Use explicit broadcast oneshot control function
ARM: OMAP: Use explicit broadcast oneshot control function
intel_idle: Use explicit broadcast oneshot control function
ACPI/idle: Use explicit broadcast control function
ACPI/PAD: Use explicit broadcast oneshot control function
x86/amd/idle, clockevents: Use explicit broadcast oneshot control functions
clockevents: Provide explicit broadcast oneshot control functions
clockevents: Remove the broadcast control leftovers
ARM: OMAP: Use explicit broadcast control function
intel_idle: Use explicit broadcast control function
cpuidle: Use explicit broadcast control function
ACPI/processor: Use explicit broadcast control function
ACPI/PAD: Use explicit broadcast control function
...
ARM/ARM64: fixes for live migration, irqfd and ioeventfd support (enabling
vhost, too), page aging
s390: interrupt handling rework, allowing to inject all local interrupts
via new ioctl and to get/set the full local irq state for migration
and introspection. New ioctls to access memory by virtual address,
and to get/set the guest storage keys. SIMD support.
MIPS: FPU and MIPS SIMD Architecture (MSA) support. Includes some patches
from Ralf Baechle's MIPS tree.
x86: bugfixes (notably for pvclock, the others are small) and cleanups.
Another small latency improvement for the TSC deadline timer.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJVJ9vmAAoJEL/70l94x66DoMEH/R3rh8IMf4jTiWRkcqohOMPX
k1+NaSY/lCKayaSgggJ2hcQenMbQoXEOdslvaA/H0oC+VfJGK+lmU6E63eMyyhjQ
Y+Px6L85NENIzDzaVu/TIWWuhil5PvIRr3VO8cvntExRoCjuekTUmNdOgCvN2ObW
wswN2qRdPIeEj2kkulbnye+9IV4G0Ne9bvsmUdOdfSSdi6ZcV43JcvrpOZT++mKj
RrKB+3gTMZYGJXMMLBwMkdl8mK1ozriD+q0mbomT04LUyGlPwYLl4pVRDBqyksD7
KsSSybaK2E4i5R80WEljgDMkNqrCgNfg6VZe4n9Y+CfAAOToNnkMJaFEi+yuqbs=
=yu2b
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"First batch of KVM changes for 4.1
The most interesting bit here is irqfd/ioeventfd support for ARM and
ARM64.
Summary:
ARM/ARM64:
fixes for live migration, irqfd and ioeventfd support (enabling
vhost, too), page aging
s390:
interrupt handling rework, allowing to inject all local interrupts
via new ioctl and to get/set the full local irq state for migration
and introspection. New ioctls to access memory by virtual address,
and to get/set the guest storage keys. SIMD support.
MIPS:
FPU and MIPS SIMD Architecture (MSA) support. Includes some
patches from Ralf Baechle's MIPS tree.
x86:
bugfixes (notably for pvclock, the others are small) and cleanups.
Another small latency improvement for the TSC deadline timer"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (146 commits)
KVM: use slowpath for cross page cached accesses
kvm: mmu: lazy collapse small sptes into large sptes
KVM: x86: Clear CR2 on VCPU reset
KVM: x86: DR0-DR3 are not clear on reset
KVM: x86: BSP in MSR_IA32_APICBASE is writable
KVM: x86: simplify kvm_apic_map
KVM: x86: avoid logical_map when it is invalid
KVM: x86: fix mixed APIC mode broadcast
KVM: x86: use MDA for interrupt matching
kvm/ppc/mpic: drop unused IRQ_testbit
KVM: nVMX: remove unnecessary double caching of MAXPHYADDR
KVM: nVMX: checks for address bits beyond MAXPHYADDR on VM-entry
KVM: x86: cache maxphyaddr CPUID leaf in struct kvm_vcpu
KVM: vmx: pass error code with internal error #2
x86: vdso: fix pvclock races with task migration
KVM: remove kvm_read_hva and kvm_read_hva_atomic
KVM: x86: optimize delivery of TSC deadline timer interrupt
KVM: x86: extract blocking logic from __vcpu_run
kvm: x86: fix x86 eflags fixed bit
KVM: s390: migrate vcpu interrupt state
...
Dirty logging tracks sptes in 4k granularity, meaning that large sptes
have to be split. If live migration is successful, the guest in the
source machine will be destroyed and large sptes will be created in the
destination. However, the guest continues to run in the source machine
(for example if live migration fails), small sptes will remain around
and cause bad performance.
This patch introduce lazy collapsing of small sptes into large sptes.
The rmap will be scanned in ioctl context when dirty logging is stopped,
dropping those sptes which can be collapsed into a single large-page spte.
Later page faults will create the large-page sptes.
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Message-Id: <1428046825-6905-1-git-send-email-wanpeng.li@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
CR2 is not cleared as it should after reset. See Intel SDM table named "IA-32
Processor States Following Power-up, Reset, or INIT".
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Message-Id: <1427933438-12782-5-git-send-email-namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
DR0-DR3 are not cleared as they should during reset and when they are set from
userspace. It appears to be caused by c77fb5fe6f ("KVM: x86: Allow the guest
to run with dirty debug registers").
Force their reload on these situations.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Message-Id: <1427933438-12782-4-git-send-email-namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
After reset, the CPU can change the BSP, which will be used upon INIT. Reset
should return the BSP which QEMU asked for, and therefore handled accordingly.
To quote: "If the MP protocol has completed and a BSP is chosen, subsequent
INITs (either to a specific processor or system wide) do not cause the MP
protocol to be repeated."
[Intel SDM 8.4.2: MP Initialization Protocol Requirements and Restrictions]
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Message-Id: <1427933438-12782-3-git-send-email-namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
recalculate_apic_map() uses two passes over all VCPUs. This is a relic
from time when we selected a global mode in the first pass and set up
the optimized table in the second pass (to have a consistent mode).
Recent changes made mixed mode unoptimized and we can do it in one pass.
Format of logical MDA is a function of the mode, so we encode it in
apic_logical_id() and drop obsoleted variables from the struct.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Message-Id: <1423766494-26150-5-git-send-email-rkrcmar@redhat.com>
[Add lid_bits temporary in apic_logical_id. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We want to support mixed modes and the easiest solution is to avoid
optimizing those weird and unlikely scenarios.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Message-Id: <1423766494-26150-4-git-send-email-rkrcmar@redhat.com>
[Add comment above KVM_APIC_MODE_* defines. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Broadcast allowed only one global APIC mode, but mixed modes are
theoretically possible. x2APIC IPI doesn't mean 0xff as broadcast,
the rest does.
x2APIC broadcasts are accepted by xAPIC. If we take SDM to be logical,
even addreses beginning with 0xff should be accepted, but real hardware
disagrees. This patch aims for simple code by considering most of real
behavior as undefined.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Message-Id: <1423766494-26150-3-git-send-email-rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In mixed modes, we musn't deliver xAPIC IPIs like x2APIC and vice versa.
Instead of preserving the information in apic_send_ipi(), we regain it
by converting all destinations into correct MDA in the slow path.
This allows easier reasoning about subsequent matching.
Our kvm_apic_broadcast() had an interesting design decision: it didn't
consider IOxAPIC 0xff as broadcast in x2APIC mode ...
everything worked because IOxAPIC can't set that in physical mode and
logical mode considered it as a message for first 8 VCPUs.
This patch interprets IOxAPIC 0xff as x2APIC broadcast.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Message-Id: <1423766494-26150-2-git-send-email-rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
After speed-up of cpuid_maxphyaddr() it can be called frequently:
instead of heavyweight enumeration of CPUID entries it returns a cached
pre-computed value. It is also inlined now. So caching its result became
unnecessary and can be removed.
Signed-off-by: Eugene Korenevsky <ekorenevsky@gmail.com>
Message-Id: <20150329205644.GA1258@gnote>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
On each VM-entry CPU should check the following VMCS fields for zero bits
beyond physical address width:
- APIC-access address
- virtual-APIC address
- posted-interrupt descriptor address
This patch adds these checks required by Intel SDM.
Signed-off-by: Eugene Korenevsky <ekorenevsky@gmail.com>
Message-Id: <20150329205627.GA1244@gnote>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
cpuid_maxphyaddr(), which performs lot of memory accesses is called
extensively across KVM, especially in nVMX code.
This patch adds a cached value of maxphyaddr to vcpu.arch to reduce the
pressure onto CPU cache and simplify the code of cpuid_maxphyaddr()
callers. The cached value is initialized in kvm_arch_vcpu_init() and
reloaded every time CPUID is updated by usermode. It is obvious that
these reloads occur infrequently.
Signed-off-by: Eugene Korenevsky <ekorenevsky@gmail.com>
Message-Id: <20150329205612.GA1223@gnote>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Exposing the on-stack error code with internal error is cheap and
potentially useful.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Message-Id: <1428001865-32280-1-git-send-email-rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The newly-added tracepoint shows the following results on
the tscdeadline_latency test:
qemu-kvm-8387 [002] 6425.558974: kvm_vcpu_wakeup: poll time 10407 ns
qemu-kvm-8387 [002] 6425.558984: kvm_vcpu_wakeup: poll time 0 ns
qemu-kvm-8387 [002] 6425.561242: kvm_vcpu_wakeup: poll time 10477 ns
qemu-kvm-8387 [002] 6425.561251: kvm_vcpu_wakeup: poll time 0 ns
and so on. This is because we need to go through kvm_vcpu_block again
after the timer IRQ is injected. Avoid it by polling once before
entering kvm_vcpu_block.
On my machine (Xeon E5 Sandy Bridge) this removes about 500 cycles (7%)
from the latency of the TSC deadline timer.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename the old __vcpu_run to vcpu_run, and extract part of it to a new
function vcpu_block.
The next patch will add a new condition in vcpu_block, avoid extra
indentation.
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Guest can't be booted w/ ept=0, there is a message dumped as below:
If you're running a guest on an Intel machine without unrestricted mode
support, the failure can be most likely due to the guest entering an invalid
state for Intel VT. For example, the guest maybe running in big real mode
which is not supported on less recent Intel processors.
EAX=00000011 EBX=f000d2f6 ECX=00006cac EDX=000f8956
ESI=bffbdf62 EDI=00000000 EBP=00006c68 ESP=00006c68
EIP=0000d187 EFL=00000004 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =e000 000e0000 ffffffff 00809300 DPL=0 DS16 [-WA]
CS =f000 000f0000 ffffffff 00809b00 DPL=0 CS16 [-RA]
SS =0000 00000000 ffffffff 00809300 DPL=0 DS16 [-WA]
DS =0000 00000000 ffffffff 00809300 DPL=0 DS16 [-WA]
FS =0000 00000000 ffffffff 00809300 DPL=0 DS16 [-WA]
GS =0000 00000000 ffffffff 00809300 DPL=0 DS16 [-WA]
LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
GDT= 000f6a80 00000037
IDT= 000f6abe 00000000
CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000000
Code=01 1e b8 6a 2e 0f 01 16 74 6a 0f 20 c0 66 83 c8 01 0f 22 c0 <66> ea 8f d1 0f 00 08 00 b8 10 00 00 00 8e d8 8e c0 8e d0 8e e0 8e e8 89 c8 ff e2 89 c1 b8X
X86 eflags bit 1 is fixed set, which means that 1 << 1 is set instead of 1,
this patch fix it.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Message-Id: <1428473294-6633-1-git-send-email-wanpeng.li@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use the normal return values for bool functions
Signed-off-by: Joe Perches <joe@perches.com>
Message-Id: <9f593eb2f43b456851cd73f7ed09654ca58fb570.1427759009.git.joe@perches.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A trivial code cleanup. This `if` is redundant.
Signed-off-by: Eugene Korenevsky <ekorenevsky@gmail.com>
Message-Id: <20150328222717.GA6508@gnote>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Some constants are redfined in emulate.c. Avoid it.
s/SELECTOR_RPL_MASK/SEGMENT_RPL_MASK
s/SELECTOR_TI_MASK/SEGMENT_TI_MASK
No functional change.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Message-Id: <1427635984-8113-3-git-send-email-namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The eflags are redefined (using other defines) in emulate.c.
Use the definition from processor-flags.h as some mess already started.
No functional change.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Message-Id: <1427635984-8113-2-git-send-email-namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If the source of BSF and BSR is zero, the destination register should not
change. That is how real hardware behaves. If we set the destination even with
the same value that we had before, we may clear bits [63:32] unnecassarily.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Message-Id: <1427719163-5429-4-git-send-email-namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
POPA should assign the values to the registers as usual registers are assigned.
In other words, 32-bits register assignments should clear bits [63:32] of the
register.
Split the code of register assignments that will be used by future changes as
well.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Message-Id: <1427719163-5429-3-git-send-email-namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
On legacy mode CMOV emulation should still clear bits [63:32] even if the
assignment is not done. The previous fix 140bad89fd ("KVM: x86: emulation of
dword cmov on long-mode should clear [63:32]") was incomplete.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Message-Id: <1427719163-5429-2-git-send-email-namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If data is read from PIC with invalid access size, the return data stays
uninitialized even though success is returned.
Fix this by always initializing the data.
Signed-off-by: Petr Matousek <pmatouse@redhat.com>
Reported-by: Nadav Amit <nadav.amit@gmail.com>
Message-Id: <20150311111609.GG8544@dhcp-25-225.brq.redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In preparation of adding another tkr field, rename this one to
tkr_mono. Also rename tk_read_base::base_mono to tk_read_base::base,
since the structure is not specific to CLOCK_MONOTONIC and the mono
name got added to the tk_read_base instance.
Lots of trivial churn.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150319093400.344679419@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If the guest CPU is supposed to support rdtscp and the host has rdtscp
enabled in the secondary execution controls, we can also expose this
feature to L1. Just extend nested_vmx_exit_handled to properly route
EXIT_REASON_RDTSCP.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
virt/kvm was never really a good include directory for anything else
than locally included headers.
With the move of iodev.h there is no need anymore to add this
directory the compiler's include path, so remove it from the x86 kvm
Makefile.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
iodev.h contains definitions for the kvm_io_bus framework. This is
needed both by the generic KVM code in virt/kvm as well as by
architecture specific code under arch/. Putting the header file in
virt/kvm and using local includes in the architecture part seems at
least dodgy to me, so let's move the file into include/kvm, so that a
more natural "#include <kvm/iodev.h>" can be used by all of the code.
This also solves a problem later when using struct kvm_io_device
in arm_vgic.h.
Fixing up the FSF address in the GPL header and a wrong include path
on the way.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
This is needed in e.g. ARM vGIC emulation, where the MMIO handling
depends on the VCPU that does the access.
Signed-off-by: Nikolay Nikolaev <n.nikolaev@virtualopensystems.com>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
An overhead from function call is not appropriate for its size and
frequency of execution.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
kvm_ioapic_update_eoi() wasn't called if directed EOI was enabled.
We need to do that for irq notifiers. (Like with edge interrupts.)
Fix it by skipping EOI broadcast only.
Bug: https://bugzilla.kernel.org/show_bug.cgi?id=82211
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Tested-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
I hit this path on a AMD box and thought
someone was playing a April Fool's joke on me.
Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This patch fix the following sparse warnings:
for arch/x86/kvm/x86.c:
warning: symbol 'emulator_read_write' was not declared. Should it be static?
warning: symbol 'emulator_write_emulated' was not declared. Should it be static?
warning: symbol 'emulator_get_dr' was not declared. Should it be static?
warning: symbol 'emulator_set_dr' was not declared. Should it be static?
for arch/x86/kvm/pmu.c:
warning: symbol 'fixed_pmc_events' was not declared. Should it be static?
Signed-off-by: Xiubo Li <lixiubo@cmss.chinamobile.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This patch fix the following sparse warning:
for file arch/x86/kvm/x86.c:
warning: Using plain integer as NULL pointer
Signed-off-by: Xiubo Li <lixiubo@cmss.chinamobile.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
If EPT was enabled, unrestricted_guest was allowed in L1 regardless of
L0. L1 triple faulted when running L2 guest that required emulation.
Another side effect was 'WARN_ON_ONCE(vmx->nested.nested_run_pending)'
in L0's dmesg:
WARNING: CPU: 0 PID: 0 at arch/x86/kvm/vmx.c:9190 nested_vmx_vmexit+0x96e/0xb00 [kvm_intel] ()
Prevent this scenario by masking SECONDARY_EXEC_UNRESTRICTED_GUEST when
the host doesn't have it enabled.
Fixes: 78051e3b7e ("KVM: nVMX: Disable unrestricted mode if ept=0")
Cc: stable@vger.kernel.org
Tested-By: Kashyap Chamarthy <kchamart@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
While in L2, leave all #UD to L2 and do not try to emulate it. If L1 is
interested in doing this, it reports its interest via the exception
bitmap, and we never get into handle_exception of L0 anyway.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
For a very long time (since 2b3d2a20), the path handling a vmmcall
instruction of the guest on an Intel host only applied the patch but no
longer handled the hypercall. The reverse case, vmcall on AMD hosts, is
fine. As both em_vmcall and em_vmmcall actually have to do the same, we
can fix the issue by consolidating both into the same handler.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Another patch in my war on emulate_on_interception() use as a svm exit handler.
These were pulled out of a larger patch at the suggestion of Radim Krcmar, see
https://lkml.org/lkml/2015/2/25/559
Changes since v1:
* fixed typo introduced after test, retested
Signed-off-by: David Kaplan <david.kaplan@amd.com>
[separated out just cr_interception part from larger removal of
INTERCEPT_CR0_WRITE, forward ported, tested]
Signed-off-by: Joel Schopp <joel.schopp@amd.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
In commit 3af18d9c5f ("KVM: nVMX: Prepare for using hardware MSR bitmap"),
we are setting MSR_BITMAP in prepare_vmcs02 if we should use hardware. This
is not enough since the field will be modified by following vmx_set_efer.
Fix this by setting vmx_msr_bitmap_nested in vmx_set_msr_bitmap if vcpu is
in guest mode.
Signed-off-by: Wincy Van <fanwenyi0529@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
If data is read from PIC with invalid access size, the return data stays
uninitialized even though success is returned.
Fix this by always initializing the data.
Signed-off-by: Petr Matousek <pmatouse@redhat.com>
Reported-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
POWER supports irqfds but forgot to advertise them. Some userspace does
not check for the capability, but others check it---thus they work on
x86 and s390 but not POWER.
To avoid that other architectures in the future make the same mistake, let
common code handle KVM_CAP_IRQFD the same way as KVM_CAP_IRQFD_RESAMPLE.
Reported-and-tested-by: Greg Kurz <gkurz@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Fixes: 297e21053a
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
No need to re-decode WBINVD since we know what it is from the intercept.
Signed-off-by: David Kaplan <David.Kaplan@amd.com>
[extracted from larger unlrelated patch, forward ported, tested,style cleanup]
Signed-off-by: Joel Schopp <joel.schopp@amd.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Currently kvm_emulate() skips the instruction but kvm_emulate_* sometimes
don't. The end reult is the caller ends up doing the skip themselves.
Let's make them consistant.
Signed-off-by: Joel Schopp <joel.schopp@amd.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
kvm_kvfree() provides exactly the same functionality as the
new common kvfree() function - so let's simply replace the
kvm function with the common function.
Signed-off-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This patch fixes the bug discussed in
https://www.mail-archive.com/kvm@vger.kernel.org/msg109813.html
This patch uses a new field named irr_delivered to record the
delivery status of edge-triggered interrupts, and clears the
delivered interrupts in kvm_get_ioapic. So it has the same effect
of commit 0bc830b05c
("KVM: ioapic: clear IRR for edge-triggered interrupts at delivery")
while avoids the bug of Windows guests.
Signed-off-by: Wincy Van <fanwenyi0529@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
KVM has nice wrappers to access the register values, clean up a few places
that should use them but currently do not.
Signed-off-by: David Kaplan <david.kaplan@amd.com>
[forward port and testing]
Signed-off-by: Joel Schopp <joel.schopp@amd.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
In commit b4eef9b36d, we started to use hwapic_isr_update() != NULL
instead of kvm_apic_vid_enabled(vcpu->kvm). This didn't work because
SVM had it defined and "apicv" path in apic_{set,clear}_isr() does not
change apic->isr_count, because it should always be 1. The initial
value of apic->isr_count was based on kvm_apic_vid_enabled(vcpu->kvm),
which is always 0 for SVM, so KVM could have injected interrupts when it
shouldn't.
Fix it by implicitly setting SVM's hwapic_isr_update to NULL and make the
initial isr_count depend on hwapic_isr_update() for good measure.
Fixes: b4eef9b36d ("kvm: x86: vmx: NULL out hwapic_isr_update() in case of !enable_apicv")
Reported-and-tested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This has been broken for a long time: it broke first in 2.6.35, then was
almost fixed in 2.6.36 but this one-liner slipped through the cracks.
The bug shows up as an infinite loop in Windows 7 (and newer) boot on
32-bit hosts without EPT.
Windows uses CMPXCHG8B to write to page tables, which causes a
page fault if running without EPT; the emulator is then called from
kvm_mmu_page_fault. The loop then happens if the higher 4 bytes are
not 0; the common case for this is that the NX bit (bit 63) is 1.
Fixes: 6550e1f165
Fixes: 16518d5ada
Cc: stable@vger.kernel.org # 2.6.35+
Reported-by: Erik Rull <erik.rull@rdsoftware.de>
Tested-by: Erik Rull <erik.rull@rdsoftware.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
'apic' is not defined if !CONFIG_X86_64 && !CONFIG_X86_LOCAL_APIC.
Posted interrupt makes no sense without CONFIG_SMP, and
CONFIG_X86_LOCAL_APIC will be set with it.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull x86 perf updates from Ingo Molnar:
"This series tightens up RDPMC permissions: currently even highly
sandboxed x86 execution environments (such as seccomp) have permission
to execute RDPMC, which may leak various perf events / PMU state such
as timing information and other CPU execution details.
This 'all is allowed' RDPMC mode is still preserved as the
(non-default) /sys/devices/cpu/rdpmc=2 setting. The new default is
that RDPMC access is only allowed if a perf event is mmap-ed (which is
needed to correctly interpret RDPMC counter values in any case).
As a side effect of these changes CR4 handling is cleaned up in the
x86 code and a shadow copy of the CR4 value is added.
The extra CR4 manipulation adds ~ <50ns to the context switch cost
between rdpmc-capable and rdpmc-non-capable mms"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86: Add /sys/devices/cpu/rdpmc=2 to allow rdpmc for all tasks
perf/x86: Only allow rdpmc if a perf_event is mapped
perf: Pass the event to arch_perf_update_userpage()
perf: Add pmu callbacks to track event mapping and unmapping
x86: Add a comment clarifying LDT context switching
x86: Store a per-cpu shadow copy of CR4
x86: Clean up cr4 manipulation
Common: Optional support for adding a small amount of polling on each HLT
instruction executed in the guest (or equivalent for other architectures).
This can improve latency up to 50% on some scenarios (e.g. O_DSYNC writes
or TCP_RR netperf tests). This also has to be enabled manually for now,
but the plan is to auto-tune this in the future.
ARM/ARM64: the highlights are support for GICv3 emulation and dirty page
tracking
s390: several optimizations and bugfixes. Also a first: a feature
exposed by KVM (UUID and long guest name in /proc/sysinfo) before
it is available in IBM's hypervisor! :)
MIPS: Bugfixes.
x86: Support for PML (page modification logging, a new feature in
Broadwell Xeons that speeds up dirty page tracking), nested virtualization
improvements (nested APICv---a nice optimization), usual round of emulation
fixes. There is also a new option to reduce latency of the TSC deadline
timer in the guest; this needs to be tuned manually.
Some commits are common between this pull and Catalin's; I see you
have already included his tree.
ARM has other conflicts where functions are added in the same place
by 3.19-rc and 3.20 patches. These are not large though, and entirely
within KVM.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJU28rkAAoJEL/70l94x66DXqQH/1TDOfJIjW7P2kb0Sw7Fy1wi
cEX1KO/VFxAqc8R0E/0Wb55CXyPjQJM6xBXuFr5cUDaIjQ8ULSktL4pEwXyyv/s5
DBDkN65mriry2w5VuEaRLVcuX9Wy+tqLQXWNkEySfyb4uhZChWWHvKEcgw5SqCyg
NlpeHurYESIoNyov3jWqvBjr4OmaQENyv7t2c6q5ErIgG02V+iCux5QGbphM2IC9
LFtPKxoqhfeB2xFxTOIt8HJiXrZNwflsTejIlCl/NSEiDVLLxxHCxK2tWK/tUXMn
JfLD9ytXBWtNMwInvtFm4fPmDouv2VDyR0xnK2db+/axsJZnbxqjGu1um4Dqbak=
=7gdx
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM update from Paolo Bonzini:
"Fairly small update, but there are some interesting new features.
Common:
Optional support for adding a small amount of polling on each HLT
instruction executed in the guest (or equivalent for other
architectures). This can improve latency up to 50% on some
scenarios (e.g. O_DSYNC writes or TCP_RR netperf tests). This
also has to be enabled manually for now, but the plan is to
auto-tune this in the future.
ARM/ARM64:
The highlights are support for GICv3 emulation and dirty page
tracking
s390:
Several optimizations and bugfixes. Also a first: a feature
exposed by KVM (UUID and long guest name in /proc/sysinfo) before
it is available in IBM's hypervisor! :)
MIPS:
Bugfixes.
x86:
Support for PML (page modification logging, a new feature in
Broadwell Xeons that speeds up dirty page tracking), nested
virtualization improvements (nested APICv---a nice optimization),
usual round of emulation fixes.
There is also a new option to reduce latency of the TSC deadline
timer in the guest; this needs to be tuned manually.
Some commits are common between this pull and Catalin's; I see you
have already included his tree.
Powerpc:
Nothing yet.
The KVM/PPC changes will come in through the PPC maintainers,
because I haven't received them yet and I might end up being
offline for some part of next week"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (130 commits)
KVM: ia64: drop kvm.h from installed user headers
KVM: x86: fix build with !CONFIG_SMP
KVM: x86: emulate: correct page fault error code for NoWrite instructions
KVM: Disable compat ioctl for s390
KVM: s390: add cpu model support
KVM: s390: use facilities and cpu_id per KVM
KVM: s390/CPACF: Choose crypto control block format
s390/kernel: Update /proc/sysinfo file with Extended Name and UUID
KVM: s390: reenable LPP facility
KVM: s390: floating irqs: fix user triggerable endless loop
kvm: add halt_poll_ns module parameter
kvm: remove KVM_MMIO_SIZE
KVM: MIPS: Don't leak FPU/DSP to guest
KVM: MIPS: Disable HTW while in guest
KVM: nVMX: Enable nested posted interrupt processing
KVM: nVMX: Enable nested virtual interrupt delivery
KVM: nVMX: Enable nested apic register virtualization
KVM: nVMX: Make nested control MSRs per-cpu
KVM: nVMX: Enable nested virtualize x2apic mode
KVM: nVMX: Prepare for using hardware MSR bitmap
...
<asm/apic.h> isn't included directly and without CONFIG_SMP, an option
that automagically pulls it can't be enabled.
Reported-by: Jim Davis <jim.epost@gmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull RCU updates from Ingo Molnar:
"The main RCU changes in this cycle are:
- Documentation updates.
- Miscellaneous fixes.
- Preemptible-RCU fixes, including fixing an old bug in the
interaction of RCU priority boosting and CPU hotplug.
- SRCU updates.
- RCU CPU stall-warning updates.
- RCU torture-test updates"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
rcu: Initialize tiny RCU stall-warning timeouts at boot
rcu: Fix RCU CPU stall detection in tiny implementation
rcu: Add GP-kthread-starvation checks to CPU stall warnings
rcu: Make cond_resched_rcu_qs() apply to normal RCU flavors
rcu: Optionally run grace-period kthreads at real-time priority
ksoftirqd: Use new cond_resched_rcu_qs() function
ksoftirqd: Enable IRQs and call cond_resched() before poking RCU
rcutorture: Add more diagnostics in rcu_barrier() test failure case
torture: Flag console.log file to prevent holdovers from earlier runs
torture: Add "-enable-kvm -soundhw pcspk" to qemu command line
rcutorture: Handle different mpstat versions
rcutorture: Check from beginning to end of grace period
rcu: Remove redundant rcu_batches_completed() declaration
rcutorture: Drop rcu_torture_completed() and friends
rcu: Provide rcu_batches_completed_sched() for TINY_RCU
rcutorture: Use unsigned for Reader Batch computations
rcutorture: Make build-output parsing correctly flag RCU's warnings
rcu: Make _batches_completed() functions return unsigned long
rcutorture: Issue warnings on close calls due to Reader Batch blows
documentation: Fix smp typo in memory-barriers.txt
...