Commit Graph

2189 Commits

Author SHA1 Message Date
Orit Wasserman
b246dd5df1 KVM: VMX: Fix KVM_SET_SREGS with big real mode segments
For example migration between Westmere and Nehelem hosts, caught in big real mode.

The code that fixes the segments for real mode guest was moved from enter_rmode
to vmx_set_segments. enter_rmode calls vmx_set_segments for each segment.

Signed-off-by: Orit Wasserman <owasserm@rehdat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-05 17:51:46 +03:00
Gleb Natapov
1952639665 KVM: MMU: do not iterate over all VMs in mmu_shrink()
mmu_shrink() needlessly iterates over all VMs even though it will not
attempt to free mmu pages from more than one on them. Fix that and also
check used mmu pages count outside of VM lock to skip inactive VMs faster.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-05 17:46:43 +03:00
Xudong Hao
3f6d8c8a47 KVM: VMX: Use EPT Access bit in response to memory notifiers
Signed-off-by: Haitao Shan <haitao.shan@intel.com>
Signed-off-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-05 16:31:05 +03:00
Xudong Hao
b38f993478 KVM: VMX: Enable EPT A/D bits if supported by turning on relevant bit in EPTP
In EPT page structure entry, Enable EPT A/D bits if processor supported.

Signed-off-by: Haitao Shan <haitao.shan@intel.com>
Signed-off-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-05 16:31:04 +03:00
Xudong Hao
83c3a33122 KVM: VMX: Add parameter to control A/D bits support, default is on
Add kernel parameter to control A/D bits support, it's on by default.

Signed-off-by: Haitao Shan <haitao.shan@intel.com>
Signed-off-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-05 16:31:03 +03:00
Takuya Yoshikawa
c1a7b32a14 KVM: Avoid wasting pages for small lpage_info arrays
lpage_info is created for each large level even when the memory slot is
not for RAM.  This means that when we add one slot for a PCI device, we
end up allocating at least KVM_NR_PAGE_SIZES - 1 pages by vmalloc().

To make things worse, there is an increasing number of devices which
would result in more pages being wasted this way.

This patch mitigates this problem by using kvm_kvzalloc().

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-05 16:29:49 +03:00
Xiao Guangrong
c358666783 KVM: MMU: fix huge page adapted on non-PAE host
The huge page size is 4M on non-PAE host, but 2M page size is used in
transparent_hugepage_adjust(), so the page we get after adjust the
mapping level is not the head page, the BUG_ON() will be triggered

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-05-28 17:41:15 +03:00
Linus Torvalds
07acfc2a93 Merge branch 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM changes from Avi Kivity:
 "Changes include additional instruction emulation, page-crossing MMIO,
  faster dirty logging, preventing the watchdog from killing a stopped
  guest, module autoload, a new MSI ABI, and some minor optimizations
  and fixes.  Outside x86 we have a small s390 and a very large ppc
  update.

  Regarding the new (for kvm) rebaseless workflow, some of the patches
  that were merged before we switch trees had to be rebased, while
  others are true pulls.  In either case the signoffs should be correct
  now."

Fix up trivial conflicts in Documentation/feature-removal-schedule.txt
arch/powerpc/kvm/book3s_segment.S and arch/x86/include/asm/kvm_para.h.

I suspect the kvm_para.h resolution ends up doing the "do I have cpuid"
check effectively twice (it was done differently in two different
commits), but better safe than sorry ;)

* 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (125 commits)
  KVM: make asm-generic/kvm_para.h have an ifdef __KERNEL__ block
  KVM: s390: onereg for timer related registers
  KVM: s390: epoch difference and TOD programmable field
  KVM: s390: KVM_GET/SET_ONEREG for s390
  KVM: s390: add capability indicating COW support
  KVM: Fix mmu_reload() clash with nested vmx event injection
  KVM: MMU: Don't use RCU for lockless shadow walking
  KVM: VMX: Optimize %ds, %es reload
  KVM: VMX: Fix %ds/%es clobber
  KVM: x86 emulator: convert bsf/bsr instructions to emulate_2op_SrcV_nobyte()
  KVM: VMX: unlike vmcs on fail path
  KVM: PPC: Emulator: clean up SPR reads and writes
  KVM: PPC: Emulator: clean up instruction parsing
  kvm/powerpc: Add new ioctl to retreive server MMU infos
  kvm/book3s: Make kernel emulated H_PUT_TCE available for "PR" KVM
  KVM: PPC: bookehv: Fix r8/r13 storing in level exception handler
  KVM: PPC: Book3S: Enable IRQs during exit handling
  KVM: PPC: Fix PR KVM on POWER7 bare metal
  KVM: PPC: Fix stbux emulation
  KVM: PPC: bookehv: Use lwz/stw instead of PPC_LL/PPC_STL for 32-bit fields
  ...
2012-05-24 16:17:30 -07:00
Avi Kivity
d8368af8b4 KVM: Fix mmu_reload() clash with nested vmx event injection
Currently the inject_pending_event() call during guest entry happens after
kvm_mmu_reload().  This is for historical reasons - we used to
inject_pending_event() in atomic context, while kvm_mmu_reload() needs task
context.

A problem is that nested vmx can cause the mmu context to be reset, if event
injection is intercepted and causes a #VMEXIT instead (the #VMEXIT resets
CR0/CR3/CR4).  If this happens, we end up with invalid root_hpa, and since
kvm_mmu_reload() has already run, no one will fix it and we end up entering
the guest this way.

Fix by reordering event injection to be before kvm_mmu_reload().  Use
->cancel_injection() to undo if kvm_mmu_reload() fails.

https://bugzilla.kernel.org/show_bug.cgi?id=42980

Reported-by: Luke-Jr <luke-jr+linuxbugs@utopios.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-05-16 18:09:26 -03:00
Avi Kivity
c142786c62 KVM: MMU: Don't use RCU for lockless shadow walking
Using RCU for lockless shadow walking can increase the amount of memory
in use by the system, since RCU grace periods are unpredictable.  We also
have an unconditional write to a shared variable (reader_counter), which
isn't good for scaling.

Replace that with a scheme similar to x86's get_user_pages_fast(): disable
interrupts during lockless shadow walk to force the freer
(kvm_mmu_commit_zap_page()) to wait for the TLB flush IPI to find the
processor with interrupts enabled.

We also add a new vcpu->mode, READING_SHADOW_PAGE_TABLES, to prevent
kvm_flush_remote_tlbs() from avoiding the IPI.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-05-16 16:08:28 -03:00
Avi Kivity
b2da15ac26 KVM: VMX: Optimize %ds, %es reload
On x86_64, we can defer %ds and %es reload to the heavyweight context switch,
since nothing in the lightweight paths uses the host %ds or %es (they are
ignored by the processor).  Furthermore we can avoid the load if the segments
are null, by letting the hardware load the null segments for us.  This is the
expected case.

On i386, we could avoid the reload entirely, since the entry.S paths take care
of reload, except for the SYSEXIT path which leaves %ds and %es set to __USER_DS.
So we set them to the same values as well.

Saves about 70 cycles out of 1600 (around 4%; noisy measurements).

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-05-16 16:03:19 -03:00
Avi Kivity
512d5649e8 KVM: VMX: Fix %ds/%es clobber
The vmx exit code unconditionally restores %ds and %es to __USER_DS.  This
can override the user's values, since %ds and %es are not saved and restored
in x86_64 syscalls.  In practice, this isn't dangerous since nobody uses
segment registers in long mode, least of all programs that use KVM.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-05-16 16:03:19 -03:00
Joerg Roedel
d54e4237bc KVM: x86 emulator: convert bsf/bsr instructions to emulate_2op_SrcV_nobyte()
The instruction emulation for bsrw is broken in KVM because
the code always uses bsr with 32 or 64 bit operand size for
emulation. Fix that by using emulate_2op_SrcV_nobyte() macro
to use guest operand size for emulation.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-05-14 11:32:38 +03:00
Xiao Guangrong
5f3fbc342f KVM: VMX: unlike vmcs on fail path
fix:

[ 1529.577273] Call Trace:
[ 1529.577289]  [<ffffffffa060d58f>] kvm_arch_hardware_disable+0x13/0x30 [kvm]
[ 1529.577302]  [<ffffffffa05fa2d4>] hardware_disable_nolock+0x35/0x39 [kvm]
[ 1529.577311]  [<ffffffffa05fa29f>] ? cpumask_clear_cpu.constprop.31+0x13/0x13 [kvm]
[ 1529.577315]  [<ffffffff81096ba8>] on_each_cpu+0x44/0x84
[ 1529.577326]  [<ffffffffa05f98b5>] hardware_disable_all_nolock+0x34/0x36 [kvm]
[ 1529.577335]  [<ffffffffa05f98e2>] hardware_disable_all+0x2b/0x39 [kvm]
[ 1529.577349]  [<ffffffffa05fafe5>] kvm_put_kvm+0xed/0x10f [kvm]
[ 1529.577358]  [<ffffffffa05fb3d7>] kvm_vm_release+0x22/0x28 [kvm]

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-05-14 11:28:02 +03:00
Takuya Yoshikawa
9f4260e73a KVM: x86 emulator: Avoid pushing back ModRM byte fetched for group decoding
Although ModRM byte is fetched for group decoding, it is soon pushed
back to make decode_modrm() fetch it later again.

Now that ModRM flag can be found in the top level opcode tables, fetch
ModRM byte before group decoding to make the code simpler.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-05-06 16:15:58 +03:00
Takuya Yoshikawa
1c2545be05 KVM: x86 emulator: Move ModRM flags for groups to top level opcode tables
Needed for the following patch which simplifies ModRM fetching code.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-05-06 16:15:57 +03:00
Michael S. Tsirkin
57c22e5f35 KVM: fix cpuid eax for KVM leaf
cpuid eax should return the max leaf so that
guests can find out the valid range.
This matches Xen et al.
Update documentation to match.

Tested with -cpu host.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-05-06 15:51:56 +03:00
Gleb Natapov
a4fa163531 KVM: ensure async PF event wakes up vcpu from halt
If vcpu executes hlt instruction while async PF is waiting to be delivered
vcpu can block and deliver async PF only after another even wakes it
up. This happens because kvm_check_async_pf_completion() will remove
completion event from vcpu->async_pf.done before entering kvm_vcpu_block()
and this will make kvm_arch_vcpu_runnable() return false. The solution
is to make vcpu runnable when processing completion.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-05-06 14:56:54 +03:00
Jan Kiszka
b6ddf05ff6 KVM: x86: Run PIT work in own kthread
We can't run PIT IRQ injection work in the interrupt context of the host
timer. This would allow the user to influence the handler complexity by
asking for a broadcast to a large number of VCPUs. Therefore, this work
was pushed into workqueue context in 9d244caf2e. However, this prevents
prioritizing the PIT injection over other task as workqueues share
kernel threads.

This replaces the workqueue with a kthread worker and gives that thread
a name in the format "kvm-pit/<owner-process-pid>". That allows to
identify and adjust the kthread priority according to the VM process
parameters.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-04-27 19:40:29 -03:00
Avi Kivity
38e8a2ddc9 KVM: x86 emulator: fix asm constraint in flush_pending_x87_faults
'bool' wants 8-bit registers.

Reported-by: Takuya Yoshikawa <takuya.yoshikawa@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-04-24 17:16:17 +03:00
Gleb Natapov
4138377142 KVM: Introduce bitmask for apic attention reasons
The patch introduces a bitmap that will hold reasons apic should be
checked during vmexit. This is in a preparation for vp eoi patch
that will add one more check on vmexit. With the bitmap we can do
if(apic_attention) to check everything simultaneously which will
add zero overhead on the fast path.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-04-24 16:36:18 +03:00
Jan Kiszka
07975ad3b3 KVM: Introduce direct MSI message injection for in-kernel irqchips
Currently, MSI messages can only be injected to in-kernel irqchips by
defining a corresponding IRQ route for each message. This is not only
unhandy if the MSI messages are generated "on the fly" by user space,
IRQ routes are a limited resource that user space has to manage
carefully.

By providing a direct injection path, we can both avoid using up limited
resources and simplify the necessary steps for user land.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-04-24 15:59:47 +03:00
Al Viro
bfce281c28 kill mm argument of vm_munmap()
it's always current->mm

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-04-21 01:58:20 -04:00
Linus Torvalds
6be5ceb02e VM: add "vm_mmap()" helper function
This continues the theme started with vm_brk() and vm_munmap():
vm_mmap() does the same thing as do_mmap(), but additionally does the
required VM locking.

This uninlines (and rewrites it to be clearer) do_mmap(), which sadly
duplicates it in mm/mmap.c and mm/nommu.c.  But that way we don't have
to export our internal do_mmap_pgoff() function.

Some day we hopefully don't have to export do_mmap() either, if all
modular users can become the simpler vm_mmap() instead.  We're actually
very close to that already, with the notable exception of the (broken)
use in i810, and a couple of stragglers in binfmt_elf.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-20 17:29:13 -07:00
Linus Torvalds
a46ef99d80 VM: add "vm_munmap()" helper function
Like the vm_brk() function, this is the same as "do_munmap()", except it
does the VM locking for the caller.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-20 17:29:13 -07:00
Avi Kivity
f78146b0f9 KVM: Fix page-crossing MMIO
MMIO that are split across a page boundary are currently broken - the
code does not expect to be aborted by the exit to userspace for the
first MMIO fragment.

This patch fixes the problem by generalizing the current code for handling
16-byte MMIOs to handle a number of "fragments", and changes the MMIO
code to create those fragments.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-04-19 20:35:07 -03:00
Marcelo Tosatti
eac0556750 Merge branch 'linus' into queue
Merge reason: development work has dependency on kvm patches merged
upstream.

Conflicts:
	Documentation/feature-removal-schedule.txt

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-04-19 17:06:26 -03:00
Avi Kivity
2225fd5604 KVM: VMX: Fix kvm_set_shared_msr() called in preemptible context
kvm_set_shared_msr() may not be called in preemptible context,
but vmx_set_msr() does so:

  BUG: using smp_processor_id() in preemptible [00000000] code: qemu-kvm/22713
  caller is kvm_set_shared_msr+0x32/0xa0 [kvm]
  Pid: 22713, comm: qemu-kvm Not tainted 3.4.0-rc3+ #39
  Call Trace:
   [<ffffffff8131fa82>] debug_smp_processor_id+0xe2/0x100
   [<ffffffffa0328ae2>] kvm_set_shared_msr+0x32/0xa0 [kvm]
   [<ffffffffa03a103b>] vmx_set_msr+0x28b/0x2d0 [kvm_intel]
   ...

Making kvm_set_shared_msr() work in preemptible is cleaner, but
it's used in the fast path.  Making two variants is overkill, so
this patch just disables preemption around the call.

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-04-18 23:42:27 -03:00
Davidlohr Bueso
f71fa31f9f KVM: MMU: use page table level macro
Its much cleaner to use PT_PAGE_TABLE_LEVEL than its numeric value.

Signed-off-by: Davidlohr Bueso <dave@gnu.org>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-04-18 23:35:01 -03:00
Michael S. Tsirkin
a0c9a822bf KVM: dont clear TMR on EOI
Intel spec says that TMR needs to be set/cleared
when IRR is set, but kvm also clears it on  EOI.

I did some tests on a real (AMD based) system,
and I see same TMR values both before
and after EOI, so I think it's a minor bug in kvm.

This patch fixes TMR to be set/cleared on IRR set
only as per spec.

And now that we don't clear TMR, we can save
an atomic read of TMR on EOI that's not propagated
to ioapic, by checking whether ioapic needs
a specific vector first and calculating
the mode afterwards.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-04-16 20:36:38 -03:00
Avi Kivity
e59717550e KVM: x86 emulator: implement MMX MOVQ (opcodes 0f 6f, 0f 7f)
Needed by some framebuffer drivers.  See

https://bugzilla.kernel.org/show_bug.cgi?id=42779

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-04-16 20:36:16 -03:00
Avi Kivity
cbe2c9d30a KVM: x86 emulator: MMX support
General support for the MMX instruction set.  Special care is taken
to trap pending x87 exceptions so that they are properly reflected
to the guest.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-04-16 20:36:16 -03:00
Avi Kivity
3e114eb4db KVM: x86 emulator: implement movntps
Used to write to framebuffers (by at least Icaros).

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-04-16 20:36:16 -03:00
Stefan Hajnoczi
49597d8116 KVM: x86: emulate movdqa
An Ubuntu 9.10 Karmic Koala guest is unable to boot or install due to
missing movdqa emulation:

kvm_exit: reason EXCEPTION_NMI rip 0x7fef3e025a7b info 7fef3e799000 80000b0e
kvm_page_fault: address 7fef3e799000 error_code f
kvm_emulate_insn: 0:7fef3e025a7b: 66 0f 7f 07 (prot64)

movdqa %xmm0,(%rdi)

[avi: mark it explicitly aligned]

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-04-16 20:36:15 -03:00
Avi Kivity
1c11b37669 KVM: x86 emulator: add support for vector alignment
x86 defines three classes of vector instructions: explicitly
aligned (#GP(0) if unaligned, explicitly unaligned, and default
(which depends on the encoding: AVX is unaligned, SSE is aligned).

Add support for marking an instruction as explicitly aligned or
unaligned, and mark MOVDQU as unaligned.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-04-16 20:36:15 -03:00
Josh Triplett
ae75954457 KVM: SVM: Auto-load on CPUs with SVM
Enable x86 feature-based autoloading for the kvm-amd module on CPUs
with X86_FEATURE_SVM.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-04-16 20:35:04 -03:00
Gleb Natapov
f19a0c2c2e KVM: PMU emulation: GLOBAL_CTRL MSR should be enabled on reset
On reset all MPU counters should be enabled in GLOBAL_CTRL MSR.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-04-10 15:34:10 +03:00
Takuya Yoshikawa
1e3f42f03c KVM: MMU: Improve iteration through sptes from rmap
Iteration using rmap_next(), the actual body is pte_list_next(), is
inefficient: every time we call it we start from checking whether rmap
holds a single spte or points to a descriptor which links more sptes.

In the case of shadow paging, this quadratic total iteration cost is a
problem.  Even for two dimensional paging, with EPT/NPT on, in which we
almost always have a single mapping, the extra checks at the end of the
iteration should be eliminated.

This patch fixes this by introducing rmap_iterator which keeps the
iteration context for the next search.  Furthermore the implementation
of rmap_next() is splitted into two functions, rmap_get_first() and
rmap_get_next(), to avoid repeatedly checking whether the rmap being
iterated on has only one spte.

Although there seemed to be only a slight change for EPT/NPT, the actual
improvement was significant: we observed that GET_DIRTY_LOG for 1GB
dirty memory became 15% faster than before.  This is probably because
the new code is easy to make branch predictions.

Note: we just remove pte_list_next() because we can think of parent_ptes
as a reverse mapping.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-04-08 16:08:27 +03:00
Takuya Yoshikawa
220f773a00 KVM: MMU: Make pte_list_desc fit cache lines well
We have PTE_LIST_EXT + 1 pointers in this structure and these 40/20
bytes do not fit cache lines well.  Furthermore, some allocators may
use 64/32-byte objects for the pte_list_desc cache.

This patch solves this problem by changing PTE_LIST_EXT from 4 to 3.

For shadow paging, the new size is still large enough to hold both the
kernel and process mappings for usual anonymous pages.  For file
mappings, there may be a slight change in the cache usage.

Note: with EPT/NPT we almost always have a single spte in each reverse
mapping and we will not see any change by this.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-04-08 16:08:25 +03:00
Davidlohr Bueso
c36fc04ef5 KVM: x86: add paging gcc optimization
Since most guests will have paging enabled for memory management, add likely() optimization
around CR0.PG checks.

Signed-off-by: Davidlohr Bueso <dave@gnu.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-04-08 14:03:13 +03:00
Josh Triplett
e9bda3b3d0 KVM: VMX: Auto-load on CPUs with VMX
Enable x86 feature-based autoloading for the kvm-intel module on CPUs
with X86_FEATURE_VMX.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Acked-By: Kay Sievers <kay@vrfy.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-04-08 14:03:12 +03:00
Takuya Yoshikawa
60c34612b7 KVM: Switch to srcu-less get_dirty_log()
We have seen some problems of the current implementation of
get_dirty_log() which uses synchronize_srcu_expedited() for updating
dirty bitmaps; e.g. it is noticeable that this sometimes gives us ms
order of latency when we use VGA displays.

Furthermore the recent discussion on the following thread
    "srcu: Implement call_srcu()"
    http://lkml.org/lkml/2012/1/31/211
also motivated us to implement get_dirty_log() without SRCU.

This patch achieves this goal without sacrificing the performance of
both VGA and live migration: in practice the new code is much faster
than the old one unless we have too many dirty pages.

Implementation:

The key part of the implementation is the use of xchg() operation for
clearing dirty bits atomically.  Since this allows us to update only
BITS_PER_LONG pages at once, we need to iterate over the dirty bitmap
until every dirty bit is cleared again for the next call.

Although some people may worry about the problem of using the atomic
memory instruction many times to the concurrently accessible bitmap,
it is usually accessed with mmu_lock held and we rarely see concurrent
accesses: so what we need to care about is the pure xchg() overheads.

Another point to note is that we do not use for_each_set_bit() to check
which ones in each BITS_PER_LONG pages are actually dirty.  Instead we
simply use __ffs() in a loop.  This is much faster than repeatedly call
find_next_bit().

Performance:

The dirty-log-perf unit test showed nice improvements, some times faster
than before, except for some extreme cases; for such cases the speed of
getting dirty page information is much faster than we process it in the
userspace.

For real workloads, both VGA and live migration, we have observed pure
improvements: when the guest was reading a file during live migration,
we originally saw a few ms of latency, but with the new method the
latency was less than 200us.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-04-08 12:50:00 +03:00
Takuya Yoshikawa
5dc99b2380 KVM: Avoid checking huge page mappings in get_dirty_log()
Dropped such mappings when we enabled dirty logging and we will never
create new ones until we stop the logging.

For this we introduce a new function which can be used to write protect
a range of PT level pages: although we do not need to care about a range
of pages at this point, the following patch will need this feature to
optimize the write protection of many pages.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-04-08 12:49:58 +03:00
Takuya Yoshikawa
a0ed46073c KVM: MMU: Split the main body of rmap_write_protect() off from others
We will use this in the following patch to implement another function
which needs to write protect pages using the rmap information.

Note that there is a small change in debug printing for large pages:
we do not differentiate them from others to avoid duplicating code.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-04-08 12:49:56 +03:00
Eric B Munson
1c0b28c2a4 KVM: x86: Add ioctl for KVM_KVMCLOCK_CTRL
Now that we have a flag that will tell the guest it was suspended, create an
interface for that communication using a KVM ioctl.

Signed-off-by: Eric B Munson <emunson@mgebm.net>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-04-08 12:49:01 +03:00
Christoffer Dall
b6d33834bd KVM: Factor out kvm_vcpu_kick to arch-generic code
The kvm_vcpu_kick function performs roughly the same funcitonality on
most all architectures, so we shouldn't have separate copies.

PowerPC keeps a pointer to interchanging waitqueues on the vcpu_arch
structure and to accomodate this special need a
__KVM_HAVE_ARCH_VCPU_GET_WQ define and accompanying function
kvm_arch_vcpu_wq have been defined. For all other architectures this
is a generic inline that just returns &vcpu->wq;

Acked-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-04-08 12:47:47 +03:00
Jason Wang
675acb758a KVM: SVM: count all irq windows exit
Also count the exits of fast-path.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-04-08 12:47:01 +03:00
Liu, Jinsong
83c529151a KVM: x86: expose Intel cpu new features (HLE, RTM) to guest
Intel recently release 2 new features, HLE and RTM.
Refer to http://software.intel.com/file/41417.
This patch expose them to guest.

Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-04-08 12:46:32 +03:00
Marcelo Tosatti
7a4f5ad051 KVM: VMX: vmx_set_cr0 expects kvm->srcu locked
vmx_set_cr0 is called from vcpu run context, therefore it expects
kvm->srcu to be held (for setting up the real-mode TSS).

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-04-05 19:04:09 +03:00
Sasikantha babu
fea5295324 KVM: PMU: Fix integer constant is too large warning in kvm_pmu_set_msr()
Signed-off-by: Sasikantha babu <sasikanth.v19@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-04-05 19:04:08 +03:00