Commit Graph

641 Commits

Author SHA1 Message Date
Paolo Bonzini
820b3fcdeb KVM: add missing cleanup_srcu_struct
Reported-by: hrg <hrgstephen@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-06-03 13:44:17 +02:00
Christian Borntraeger
719d93cd5f kvm/irqchip: Speed up KVM_SET_GSI_ROUTING
When starting lots of dataplane devices the bootup takes very long on
Christian's s390 with irqfd patches. With larger setups he is even
able to trigger some timeouts in some components.  Turns out that the
KVM_SET_GSI_ROUTING ioctl takes very long (strace claims up to 0.1 sec)
when having multiple CPUs.  This is caused by the  synchronize_rcu and
the HZ=100 of s390.  By changing the code to use a private srcu we can
speed things up.  This patch reduces the boot time till mounting root
from 8 to 2 seconds on my s390 guest with 100 disks.

Uses of hlist_for_each_entry_rcu, hlist_add_head_rcu, hlist_del_init_rcu
are fine because they do not have lockdep checks (hlist_for_each_entry_rcu
uses rcu_dereference_raw rather than rcu_dereference, and write-sides
do not do rcu lockdep at all).

Note that we're hardly relying on the "sleepable" part of srcu.  We just
want SRCU's faster detection of grace periods.

Testing was done by Andrew Theurer using netperf tests STREAM, MAERTS
and RR.  The difference between results "before" and "after" the patch
has mean -0.2% and standard deviation 0.6%.  Using a paired t-test on the
data points says that there is a 2.5% probability that the patch is the
cause of the performance difference (rather than a random fluctuation).

(Restricting the t-test to RR, which is the most likely to be affected,
changes the numbers to respectively -0.3% mean, 0.7% stdev, and 8%
probability that the numbers actually say something about the patch.
The probability increases mostly because there are fewer data points).

Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> # s390
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-05-05 16:29:11 +02:00
Oleg Nesterov
e9545b9f8a KVM: async_pf: change async_pf_execute() to use get_user_pages(tsk => NULL)
async_pf_execute() passes tsk == current to gup(), this is doesn't
hurt but unnecessary and misleading. "tsk" is only used to account
the number of faults and current is the random workqueue thread.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-04-28 17:24:55 +02:00
Oleg Nesterov
d72d946d0b KVM: async_pf: kill the unnecessary use_mm/unuse_mm async_pf_execute()
async_pf_execute() has no reasons to adopt apf->mm, gup(current, mm)
should work just fine even if current has another or NULL ->mm.

Recently kvm_async_page_present_sync() was added insedie the "use_mm"
section, but it seems that it doesn't need current->mm too.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-04-28 17:24:25 +02:00
Xiao Guangrong
a086f6a1eb Revert "KVM: Simplify kvm->tlbs_dirty handling"
This reverts commit 5befdc385d.

Since we will allow flush tlb out of mmu-lock in the later
patch

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-23 17:49:48 -03:00
Marcelo Tosatti
63b5cf04f4 Lazy storage key handling
-------------------------
 Linux does not use the ACC and F bits of the storage key. Newer Linux
 versions also do not use the storage keys for dirty and reference
 tracking. We can optimize the guest handling for those guests for faults
 as well as page-in and page-out by simply not caring about the guest
 visible storage key. We trap guest storage key instruction to enable
 those keys only on demand.
 
 Migration bitmap
 
 Until now s390 never provided a proper dirty bitmap.  Let's provide a
 proper migration bitmap for s390. We also change the user dirty tracking
 to a fault based mechanism. This makes the host completely independent
 from the storage keys. Long term this will allow us to back guest memory
 with large pages.
 
 per-VM device attributes
 ------------------------
 To avoid the introduction of new ioctls, let's provide the
 attribute semanantic also on the VM-"device".
 
 Userspace controlled CMMA
 -------------------------
 The CMMA assist is changed from "always on" to "on if requested" via
 per-VM device attributes. In addition a callback to reset all usage
 states is provided.
 
 Proper guest DAT handling for intercepts
 ----------------------------------------
 While instructions handled by SIE take care of all addressing aspects,
 KVM/s390 currently does not care about guest address translation of
 intercepts. This worked out fine, because
 - the s390 Linux kernel has a 1:1 mapping between kernel virtual<->real
  for all pages up to memory size
 - intercepts happen only for a small amount of cases
 - all of these intercepts happen to be in the kernel text for current
   distros
 
 Of course we need to be better for other intercepts, kernel modules etc.
 We provide the infrastructure and rework all in-kernel intercepts to work
 on logical addresses (paging etc) instead of real ones. The code has
 been running internally for several months now, so it is time for going
 public.
 
 GDB support
 -----------
 We provide breakpoints, single stepping and watchpoints.
 
 Fixes/Cleanups
 --------------
 - Improve program check delivery
 - Factor out the handling of transactional memory  on program checks
 - Use the existing define __LC_PGM_TDB
 - Several cleanups in the lowcore structure
 - Documentation
 
 NOTES
 -----
 - All patches touching base s390 are either ACKed or written by the s390
   maintainers
 - One base KVM patch "KVM: add kvm_is_error_gpa() helper"
 - One patch introduces the notion of VM device attributes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJTVlHZAAoJEBF7vIC1phx8REgP/1P0EUzfBpoS53z1v60n2uLT
 lW79LY9Op4/ZacEgHtU9LzmGa88X0arDsIpBZQsTNLF77AGFcMCCV3X2il/lQrRG
 KSE+ycKLoFjCcES442DwF4gHoGldD+KL/+5LPWSQZtvb9dDpHDft9aeMRBbpUL0Q
 M2kKQDlmJ2XqQu3D5PwSHgVRByHiHOzmTe2ejSSbdppkwBpaiqSBBBk0jVYDW9Jh
 eqUnBcrrYW2p+QS37ELM6hOkfDXN/vXoHBQeyca19TuZVCPNA7HeJaPc2mJ/GZk9
 wrNWEmY3f/lY0lk0zMwBwsDOS5K7jbtvXzcex6m+NsIqQuOvKsmPBy1BWb/axcK5
 uZq/JGFC0fxsFU+7ImtvQrJ/DMHnVuvSKF4WUVle2GdMlDIqkguwX27WwHSiH4/r
 Au02KlVIMUZdLAEUrw/W/S4MPLeZYoGfetHGCOmSaP2qGc97BVFedZaqekDlUgMw
 3gIoQmSIBcfrgF4k9N4nLjdhAX2S4gkviwF3pTlIkecNfa7RcI3Xk7U9mVPmIhL4
 IquVqjdXZH4m0e4gViBMtQ0IPwGt1qFlV6Wv3O9MExhfi7VQ8M8TMYNhEvtGpY75
 cuZwZYGM4FqszDAy9hbk0avTLqCxqlTiBKi3tHoQMappQmsJPrIdxIpev3MZPHCp
 vZMkbzhM9l3eefNJVw66
 =jxBp
 -----END PGP SIGNATURE-----

Merge tag 'kvm-s390-20140422' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into queue

Lazy storage key handling
-------------------------
Linux does not use the ACC and F bits of the storage key. Newer Linux
versions also do not use the storage keys for dirty and reference
tracking. We can optimize the guest handling for those guests for faults
as well as page-in and page-out by simply not caring about the guest
visible storage key. We trap guest storage key instruction to enable
those keys only on demand.

Migration bitmap

Until now s390 never provided a proper dirty bitmap.  Let's provide a
proper migration bitmap for s390. We also change the user dirty tracking
to a fault based mechanism. This makes the host completely independent
from the storage keys. Long term this will allow us to back guest memory
with large pages.

per-VM device attributes
------------------------
To avoid the introduction of new ioctls, let's provide the
attribute semanantic also on the VM-"device".

Userspace controlled CMMA
-------------------------
The CMMA assist is changed from "always on" to "on if requested" via
per-VM device attributes. In addition a callback to reset all usage
states is provided.

Proper guest DAT handling for intercepts
----------------------------------------
While instructions handled by SIE take care of all addressing aspects,
KVM/s390 currently does not care about guest address translation of
intercepts. This worked out fine, because
- the s390 Linux kernel has a 1:1 mapping between kernel virtual<->real
 for all pages up to memory size
- intercepts happen only for a small amount of cases
- all of these intercepts happen to be in the kernel text for current
  distros

Of course we need to be better for other intercepts, kernel modules etc.
We provide the infrastructure and rework all in-kernel intercepts to work
on logical addresses (paging etc) instead of real ones. The code has
been running internally for several months now, so it is time for going
public.

GDB support
-----------
We provide breakpoints, single stepping and watchpoints.

Fixes/Cleanups
--------------
- Improve program check delivery
- Factor out the handling of transactional memory  on program checks
- Use the existing define __LC_PGM_TDB
- Several cleanups in the lowcore structure
- Documentation

NOTES
-----
- All patches touching base s390 are either ACKed or written by the s390
  maintainers
- One base KVM patch "KVM: add kvm_is_error_gpa() helper"
- One patch introduces the notion of VM device attributes

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Conflicts:
	include/uapi/linux/kvm.h
2014-04-22 10:51:06 -03:00
Jason J. Herne
15f36ebd34 KVM: s390: Add proper dirty bitmap support to S390 kvm.
Replace the kvm_s390_sync_dirty_log() stub with code to construct the KVM
dirty_bitmap from S390 memory change bits.  Also add code to properly clear
the dirty_bitmap size when clearing the bitmap.

Signed-off-by: Jason J. Herne <jjherne@us.ibm.com>
CC: Dominik Dingel <dingel@linux.vnet.ibm.com>
[Dominik Dingel: use gmap_test_and_clear_dirty, locking fixes]
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2014-04-22 09:36:28 +02:00
Michael S. Tsirkin
68c3b4d167 KVM: VMX: speed up wildcard MMIO EVENTFD
With KVM, MMIO is much slower than PIO, due to the need to
do page walk and emulation. But with EPT, it does not have to be: we
know the address from the VMCS so if the address is unique, we can look
up the eventfd directly, bypassing emulation.

Unfortunately, this only works if userspace does not need to match on
access length and data.  The implementation adds a separate FAST_MMIO
bus internally. This serves two purposes:
    - minimize overhead for old userspace that does not use eventfd with lengtth = 0
    - minimize disruption in other code (since we don't know the length,
      devices on the MMIO bus only get a valid address in write, this
      way we don't need to touch all devices to teach them to handle
      an invalid length)

At the moment, this optimization only has effect for EPT on x86.

It will be possible to speed up MMIO for NPT and MMU using the same
idea in the future.

With this patch applied, on VMX MMIO EVENTFD is essentially as fast as PIO.
I was unable to detect any measureable slowdown to non-eventfd MMIO.

Making MMIO faster is important for the upcoming virtio 1.0 which
includes an MMIO signalling capability.

The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
pre-review and suggestions.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-17 14:01:43 -03:00
Michael S. Tsirkin
f848a5a8dc KVM: support any-length wildcard ioeventfd
It is sometimes benefitial to ignore IO size, and only match on address.
In hindsight this would have been a better default than matching length
when KVM_IOEVENTFD_FLAG_DATAMATCH is not set, In particular, this kind
of access can be optimized on VMX: there no need to do page lookups.
This can currently be done with many ioeventfds but in a suboptimal way.

However we can't change kernel/userspace ABI without risk of breaking
some applications.
Use len = 0 to mean "ignore length for matching" in a more optimal way.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-17 14:01:42 -03:00
Linus Torvalds
55101e2d6c Merge git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Marcelo Tosatti:
 - Fix for guest triggerable BUG_ON (CVE-2014-0155)
 - CR4.SMAP support
 - Spurious WARN_ON() fix

* git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86: remove WARN_ON from get_kernel_ns()
  KVM: Rename variable smep to cr4_smep
  KVM: expose SMAP feature to guest
  KVM: Disable SMAP for guests in EPT realmode and EPT unpaging mode
  KVM: Add SMAP support when setting CR4
  KVM: Remove SMAP bit from CR4_RESERVED_BITS
  KVM: ioapic: try to recover if pending_eoi goes out of range
  KVM: ioapic: fix assignment of ioapic->rtc_status.pending_eoi (CVE-2014-0155)
2014-04-14 16:21:28 -07:00
Ming Lei
553f809e23 arm, kvm: fix double lock on cpu_add_remove_lock
Commit 8146875de7 (arm, kvm: Fix CPU hotplug callback registration)
holds the lock before calling the two functions:

	kvm_vgic_hyp_init()
	kvm_timer_hyp_init()

and both the two functions are calling register_cpu_notifier()
to register cpu notifier, so cause double lock on cpu_add_remove_lock.

Considered that both two functions are only called inside
kvm_arch_init() with holding cpu_add_remove_lock, so simply use
__register_cpu_notifier() to fix the problem.

Fixes: 8146875de7 (arm, kvm: Fix CPU hotplug callback registration)
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-04-08 13:15:54 +02:00
Paolo Bonzini
4009b2499e KVM: ioapic: try to recover if pending_eoi goes out of range
The RTC tracking code tracks the cardinality of rtc_status.dest_map
into rtc_status.pending_eoi.  It has some WARN_ONs that trigger if
pending_eoi ever becomes negative; however, these do not do anything
to recover, and it bad things will happen soon after they trigger.

When the next RTC interrupt is triggered, rtc_check_coalesced() will
return false, but ioapic_service will find pending_eoi != 0 and
do a BUG_ON.  To avoid this, should pending_eoi ever be nonzero,
call kvm_rtc_eoi_tracking_restore_all to recompute a correct
dest_map and pending_eoi.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-04-04 17:17:24 +02:00
Paolo Bonzini
5678de3f15 KVM: ioapic: fix assignment of ioapic->rtc_status.pending_eoi (CVE-2014-0155)
QE reported that they got the BUG_ON in ioapic_service to trigger.
I cannot reproduce it, but there are two reasons why this could happen.

The less likely but also easiest one, is when kvm_irq_delivery_to_apic
does not deliver to any APIC and returns -1.

Because irqe.shorthand == 0, the kvm_for_each_vcpu loop in that
function is never reached.  However, you can target the similar loop in
kvm_irq_delivery_to_apic_fast; just program a zero logical destination
address into the IOAPIC, or an out-of-range physical destination address.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-04-04 17:17:14 +02:00
Linus Torvalds
d0cb5f71c5 VFIO updates for v3.15 include:
- Allow the vfio-type1 IOMMU to support multiple domains within a container
 - Plumb path to query whether all domains are cache-coherent
 - Wire query into kvm-vfio device to avoid KVM x86 WBINVD emulation
 - Always select CONFIG_ANON_INODES, vfio depends on it (Arnd)
 
 The first patch also makes the vfio-type1 IOMMU driver completely independent
 of the bus_type of the devices it's handling, which enables it to be used for
 both vfio-pci and a future vfio-platform (and hopefully combinations involving
 both simultaneously).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJTPbikAAoJECObm247sIsiG9cP/jnurW84YuAHmybzy3R4nMaa
 8BWcQST+TyQ78GWvubFDcRt+vHmJUI4iFWPBIm1twBSVrMy6F1GcT/spmSne1Dfb
 OvcfGEy59tGO+BeklHg5Qq2Hj1UzfeV9uQoy+PrpTF/sYzsyL6g5+O3Zv39SNvr0
 zCwnO2JAKXIpQlkK3wVha6V13X07Z/+d2b4JSsvmON2cnlPzOUYNrgBvoSeV2IQe
 3kMAU9qfAfQlpNDhRRZL/HshgWazCg4XZUp9UdNjUkOxrUp4vXHrVTUJnUGBDytk
 8V9UGRKF3mik9PqpZJk4jLV5urgUVpnUR5747uqs0KF+9GWxClXvh3gp1XX8Zn7f
 NCfDSMn/wrDr6fQBglKUadITaDhF49KV+J0cX083q46BbYxhAfgxv1I9UOvLreG6
 SGAezlTB0mefa1BmSwJdfKwDBsuDOhbF0TsHC6zT7yFc8pqe3hl/vQzUyu3HtwuM
 yPycQiQQmvOaymaiiBRwyLocwJbDNKPigDomv8NZ8Mcd97Fu9YsF4ydaxMJmmeKf
 TffYS4z4tUElYiU2vv1ipB8T+ReCmdtj2cIvyfwTL9jycY9ouO4bJ56dOGWd3WNl
 m7AVokbrrJQ03itUpFMuYjHAzTNUlXVvXlGr9n6L4VTR0RTL1zwJVrMVDsXdhxm4
 XgDbVB5PrxinDAC1vJvI
 =mnzO
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v3.15-rc1' of git://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:
 "VFIO updates for v3.15 include:

   - Allow the vfio-type1 IOMMU to support multiple domains within a
     container
   - Plumb path to query whether all domains are cache-coherent
   - Wire query into kvm-vfio device to avoid KVM x86 WBINVD emulation
   - Always select CONFIG_ANON_INODES, vfio depends on it (Arnd)

  The first patch also makes the vfio-type1 IOMMU driver completely
  independent of the bus_type of the devices it's handling, which
  enables it to be used for both vfio-pci and a future vfio-platform
  (and hopefully combinations involving both simultaneously)"

* tag 'vfio-v3.15-rc1' of git://github.com/awilliam/linux-vfio:
  vfio: always select ANON_INODES
  kvm/vfio: Support for DMA coherent IOMMUs
  vfio: Add external user check extension interface
  vfio/type1: Add extension to test DMA cache coherence of IOMMU
  vfio/iommu_type1: Multi-IOMMU domain support
2014-04-03 14:05:02 -07:00
Linus Torvalds
7cbb39d4d4 Merge tag 'kvm-3.15-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
 "PPC and ARM do not have much going on this time.  Most of the cool
  stuff, instead, is in s390 and (after a few releases) x86.

  ARM has some caching fixes and PPC has transactional memory support in
  guests.  MIPS has some fixes, with more probably coming in 3.16 as
  QEMU will soon get support for MIPS KVM.

  For x86 there are optimizations for debug registers, which trigger on
  some Windows games, and other important fixes for Windows guests.  We
  now expose to the guest Broadwell instruction set extensions and also
  Intel MPX.  There's also a fix/workaround for OS X guests, nested
  virtualization features (preemption timer), and a couple kvmclock
  refinements.

  For s390, the main news is asynchronous page faults, together with
  improvements to IRQs (floating irqs and adapter irqs) that speed up
  virtio devices"

* tag 'kvm-3.15-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (96 commits)
  KVM: PPC: Book3S HV: Save/restore host PMU registers that are new in POWER8
  KVM: PPC: Book3S HV: Fix decrementer timeouts with non-zero TB offset
  KVM: PPC: Book3S HV: Don't use kvm_memslots() in real mode
  KVM: PPC: Book3S HV: Return ENODEV error rather than EIO
  KVM: PPC: Book3S: Trim top 4 bits of physical address in RTAS code
  KVM: PPC: Book3S HV: Add get/set_one_reg for new TM state
  KVM: PPC: Book3S HV: Add transactional memory support
  KVM: Specify byte order for KVM_EXIT_MMIO
  KVM: vmx: fix MPX detection
  KVM: PPC: Book3S HV: Fix KVM hang with CONFIG_KVM_XICS=n
  KVM: PPC: Book3S: Introduce hypervisor call H_GET_TCE
  KVM: PPC: Book3S HV: Fix incorrect userspace exit on ioeventfd write
  KVM: s390: clear local interrupts at cpu initial reset
  KVM: s390: Fix possible memory leak in SIGP functions
  KVM: s390: fix calculation of idle_mask array size
  KVM: s390: randomize sca address
  KVM: ioapic: reinject pending interrupts on KVM_SET_IRQCHIP
  KVM: Bump KVM_MAX_IRQ_ROUTES for s390
  KVM: s390: irq routing for adapter interrupts.
  KVM: s390: adapter interrupt sources
  ...
2014-04-02 14:50:10 -07:00
Linus Torvalds
176ab02d49 Merge branch 'x86-asmlinkage-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 LTO changes from Peter Anvin:
 "More infrastructure work in preparation for link-time optimization
  (LTO).  Most of these changes is to make sure symbols accessed from
  assembly code are properly marked as visible so the linker doesn't
  remove them.

  My understanding is that the changes to support LTO are still not
  upstream in binutils, but are on the way there.  This patchset should
  conclude the x86-specific changes, and remaining patches to actually
  enable LTO will be fed through the Kbuild tree (other than keeping up
  with changes to the x86 code base, of course), although not
  necessarily in this merge window"

* 'x86-asmlinkage-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
  Kbuild, lto: Handle basic LTO in modpost
  Kbuild, lto: Disable LTO for asm-offsets.c
  Kbuild, lto: Add a gcc-ld script to let run gcc as ld
  Kbuild, lto: add ld-version and ld-ifversion macros
  Kbuild, lto: Drop .number postfixes in modpost
  Kbuild, lto, workaround: Don't warn for initcall_reference in modpost
  lto: Disable LTO for sys_ni
  lto: Handle LTO common symbols in module loader
  lto, workaround: Add workaround for initcall reordering
  lto: Make asmlinkage __visible
  x86, lto: Disable LTO for the x86 VDSO
  initconst, x86: Fix initconst mistake in ts5500 code
  initconst: Fix initconst mistake in dcdbas
  asmlinkage: Make trace_hardirqs_on/off_caller visible
  asmlinkage, x86: Fix 32bit memcpy for LTO
  asmlinkage Make __stack_chk_failed and memcmp visible
  asmlinkage: Mark rwsem functions that can be called from assembler asmlinkage
  asmlinkage: Make main_extable_sort_needed visible
  asmlinkage, mutex: Mark __visible
  asmlinkage: Make trace_hardirq visible
  ...
2014-03-31 14:13:25 -07:00
Paolo Bonzini
673f7b4257 KVM: ioapic: reinject pending interrupts on KVM_SET_IRQCHIP
After the previous patches, an interrupt whose bit is set in the IRR
register will never be in the LAPIC's IRR and has never been injected
on the migration source.  So inject it on the destination.

This fixes migration of Windows guests without HPET (they use the RTC
to trigger the scheduler tick, and lose it after migration).

Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-21 15:10:43 +01:00
Paolo Bonzini
44847dea79 KVM: ioapic: extract body of kvm_ioapic_set_irq
We will reuse it to process a nonzero IRR that is passed to KVM_SET_IRQCHIP.

Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-21 10:20:16 +01:00
Paolo Bonzini
0bc830b05c KVM: ioapic: clear IRR for edge-triggered interrupts at delivery
This ensures that IRR bits are set in the KVM_GET_IRQCHIP result only if
the interrupt is still sitting in the IOAPIC.  After the next patches, it
avoids spurious reinjection of the interrupt when KVM_SET_IRQCHIP is
called.

Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-21 10:20:10 +01:00
Paolo Bonzini
0b10a1c87a KVM: ioapic: merge ioapic_deliver into ioapic_service
Commonize the handling of masking, which was absent for kvm_ioapic_set_irq.
Setting remote_irr does not need a separate function either, and merging
the two functions avoids confusion.

Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-21 10:19:48 +01:00
Cornelia Huck
684a0b719d KVM: eventfd: Fix lock order inversion.
When registering a new irqfd, we call its ->poll method to collect any
event that might have previously been pending so that we can trigger it.
This is done under the kvm->irqfds.lock, which means the eventfd's ctx
lock is taken under it.

However, if we get a POLLHUP in irqfd_wakeup, we will be called with the
ctx lock held before getting the irqfds.lock to deactivate the irqfd,
causing lockdep to complain.

Calling the ->poll method does not really need the irqfds.lock, so let's
just move it after we've given up the irqfds.lock in kvm_irqfd_assign().

Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-18 17:06:04 +01:00
Gabriel L. Somlo
100943c54e kvm: x86: ignore ioapic polarity
Both QEMU and KVM have already accumulated a significant number of
optimizations based on the hard-coded assumption that ioapic polarity
will always use the ActiveHigh convention, where the logical and
physical states of level-triggered irq lines always match (i.e.,
active(asserted) == high == 1, inactive == low == 0). QEMU guests
are expected to follow directions given via ACPI and configure the
ioapic with polarity 0 (ActiveHigh). However, even when misbehaving
guests (e.g. OS X <= 10.9) set the ioapic polarity to 1 (ActiveLow),
QEMU will still use the ActiveHigh signaling convention when
interfacing with KVM.

This patch modifies KVM to completely ignore ioapic polarity as set by
the guest OS, enabling misbehaving guests to work alongside those which
comply with the ActiveHigh polarity specified by QEMU's ACPI tables.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Gabriel L. Somlo <somlo@cmu.edu>
[Move documentation to KVM_IRQ_LINE, add ia64. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-13 11:58:21 +01:00
Alex Williamson
9d830d47c7 kvm/vfio: Support for DMA coherent IOMMUs
VFIO now has support for using the IOMMU_CACHE flag and a mechanism
for an external user to test the current operating mode of the IOMMU.
Add support for this to the kvm-vfio pseudo device so that we only
register noncoherent DMA when necessary.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
2014-02-26 11:38:40 -07:00
Michael Mueller
98f4a14676 KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop
Use the arch specific function kvm_arch_vcpu_runnable() to add a further
criterium to identify a suitable vcpu to yield to during undirected yield
processing.

Signed-off-by: Michael Mueller <mimu@linux.vnet.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-02-26 17:32:05 +01:00
Takuya Yoshikawa
5befdc385d KVM: Simplify kvm->tlbs_dirty handling
When this was introduced, kvm_flush_remote_tlbs() could be called
without holding mmu_lock.  It is now acknowledged that the function
must be called before releasing mmu_lock, and all callers have already
been changed to do so.

There is no need to use smp_mb() and cmpxchg() any more.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-02-18 10:07:26 +01:00
Paolo Bonzini
f18eb31f9d Merge branch 'kvm-master' into kvm-queue 2014-02-14 11:10:07 +01:00
Christoffer Dall
2a2f3e269c arm64: KVM: Add VGIC device control for arm64
This fixes the build breakage introduced by
c07a0191ef and adds support for the device
control API and save/restore of the VGIC state for ARMv8.

The defines were simply missing from the arm64 header files and
uaccess.h must be implicitly imported from somewhere else on arm.

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-02-14 11:09:49 +01:00
Andi Kleen
52480137d8 asmlinkage, kvm: Make kvm_rebooting visible
kvm_rebooting is referenced from assembler code, thus
needs to be visible.

Cc: Gleb Natapov <gleb@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1391845930-28580-1-git-send-email-ak@linux.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-02-13 18:11:56 -08:00
Dominik Dingel
1179ba5395 KVM: async_pf: Add missing call for async page present
Commit KVM: async_pf: Provide additional direct page notification
missed the call from kvm_check_async_pf_completion to the new introduced function.

Reported-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-02-04 04:24:05 +01:00
Dominik Dingel
9f2ceda49c KVM: async_pf: Allow to wait for outstanding work
On s390 we are not able to cancel work. Instead we will flush the work and wait for
completion.

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2014-01-30 12:52:20 +01:00
Dominik Dingel
e0ead41a6d KVM: async_pf: Provide additional direct page notification
By setting a Kconfig option, the architecture can control when
guest notifications will be presented by the apf backend.
There is the default batch mechanism, working as before, where the vcpu
thread should pull in this information.
Opposite to this, there is now the direct mechanism, that will push the
information to the guest.
This way s390 can use an already existing architecture interface.

Still the vcpu thread should call check_completion to cleanup leftovers.

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2014-01-30 12:51:38 +01:00
Dan Carpenter
aac5c4226e KVM: return an error code in kvm_vm_ioctl_register_coalesced_mmio()
If kvm_io_bus_register_dev() fails then it returns success but it should
return an error code.

I also did a little cleanup like removing an impossible NULL test.

Cc: stable@vger.kernel.org
Fixes: 2b3c246a68 ('KVM: Make coalesced mmio use a device per zone')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-01-30 11:56:09 +01:00
Jens Freimann
c05c4186bb KVM: s390: add floating irq controller
This patch adds a floating irq controller as a kvm_device.
It will be necessary for migration of floating interrupts as well
as for hardening the reset code by allowing user space to explicitly
remove all pending floating interrupts.

Signed-off-by: Jens Freimann <jfrei@linux.vnet.ibm.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2014-01-30 10:25:20 +01:00
Linus Torvalds
7ebd3faa9b First round of KVM updates for 3.14; PPC parts will come next week.
Nothing major here, just bugfixes all over the place.  The most
 interesting part is the ARM guys' virtualized interrupt controller
 overhaul, which lets userspace get/set the state and thus enables
 migration of ARM VMs.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJS3TVKAAoJEBvWZb6bTYbyIFgP/2cmt4ifCuFMaZv4+G1S8jZU
 uC9ZB/+7vzht/p6zAy+4BxurKbHmSBFkC1OKcxYuy7yB4CQkHabzj4V2vRtqFdwH
 5lExP9qh3kqaVLuhnvxLTmkktR3EW4PFy6OI53l5kRNktOXSuZ0aN6K3V7tCg/X0
 iL7ASo4bJKlxeWcDpmuVrNgAajmZVfXrjKY7robgBQno+yIsgKhRZRBQHjozA6B8
 FpCo/k48RZd/EzIbV/PDDRI4hmmry/lgrO9SKjzq56wSqff2bd/k/KYze4dbAPfd
 Ps60enPTuHmeEjjb4MMMU4EKHVdTQFUMx/xZCmT4xzoh8s4of6RHphXbfE0SUznQ
 dTveyEQAR7E3JNS0k1+3WEX5fWlFesp0hO2NeE0wzUq4TAr9ztgVO9NQ6Si15e7Z
 2HysO0T5Ojtt0lY08/PvS6i48eCAuuBomrejJS8hLW4SUZ5adn+yW4Qo7Fp9JeBR
 l9a3LsVT8BZMtUWrUuFcVhlM4MbzElUPjDbgWhR8UYU/kpfVZOQu8qWgGKR4UWXy
 X7/t9l/tjR99CmfMJBAOzJid+ScSpAfg77BdaKiQrVfVIJmsjEjlO8vUMyj5b1HF
 hPX5wNyJjHAOfridLeHSs4Rdm4a8sk8Az5d4h76pLVz8M4jyTi2v0rO3N4/dU/pu
 x7N8KR5hAj+mLBoM9/Al
 =8sYU
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "First round of KVM updates for 3.14; PPC parts will come next week.

  Nothing major here, just bugfixes all over the place.  The most
  interesting part is the ARM guys' virtualized interrupt controller
  overhaul, which lets userspace get/set the state and thus enables
  migration of ARM VMs"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (67 commits)
  kvm: make KVM_MMU_AUDIT help text more readable
  KVM: s390: Fix memory access error detection
  KVM: nVMX: Update guest activity state field on L2 exits
  KVM: nVMX: Fix nested_run_pending on activity state HLT
  KVM: nVMX: Clean up handling of VMX-related MSRs
  KVM: nVMX: Add tracepoints for nested_vmexit and nested_vmexit_inject
  KVM: nVMX: Pass vmexit parameters to nested_vmx_vmexit
  KVM: nVMX: Leave VMX mode on clearing of feature control MSR
  KVM: VMX: Fix DR6 update on #DB exception
  KVM: SVM: Fix reading of DR6
  KVM: x86: Sync DR7 on KVM_SET_DEBUGREGS
  add support for Hyper-V reference time counter
  KVM: remove useless write to vcpu->hv_clock.tsc_timestamp
  KVM: x86: fix tsc catchup issue with tsc scaling
  KVM: x86: limit PIT timer frequency
  KVM: x86: handle invalid root_hpa everywhere
  kvm: Provide kvm_vcpu_eligible_for_directed_yield() stub
  kvm: vfio: silence GCC warning
  KVM: ARM: Remove duplicate include
  arm/arm64: KVM: relax the requirements of VMA alignment for THP
  ...
2014-01-22 21:40:43 -08:00
Scott Wood
4a55dd7273 kvm: Provide kvm_vcpu_eligible_for_directed_yield() stub
Commit 7940876e13 ("kvm: make local
functions static") broke KVM PPC builds due to removing (rather than
moving) the stub version of kvm_vcpu_eligible_for_directed_yield().

This patch reintroduces it.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Alexander Graf <agraf@suse.de>
[Move the #ifdef inside the function. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-01-15 12:03:23 +01:00
Paul Bolle
e81d1ad327 kvm: vfio: silence GCC warning
Building vfio.o triggers a GCC warning (when building for 32 bits x86):
    arch/x86/kvm/../../../virt/kvm/vfio.c: In function 'kvm_vfio_set_group':
    arch/x86/kvm/../../../virt/kvm/vfio.c:104:22: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
      void __user *argp = (void __user *)arg;
                          ^

Silence this warning by casting arg to unsigned long.

argp's current type, "void __user *", is always casted to "int32_t
__user *". So its type might as well be changed to "int32_t __user *".

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-01-15 12:01:48 +01:00
Stephen Hemminger
ea0269bc34 kvm: remove dead code
The function kvm_io_bus_read_cookie is defined but never used
in current in-tree code.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-01-08 19:03:00 -02:00
Stephen Hemminger
7940876e13 kvm: make local functions static
Running 'make namespacecheck' found lots of functions that
should be declared static, since only used in one file.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-01-08 19:02:58 -02:00
Christoffer Dall
fa20f5aea5 KVM: arm-vgic: Support CPU interface reg access
Implement support for the CPU interface register access driven by MMIO
address offsets from the CPU interface base address.  Useful for user
space to support save/restore of the VGIC state.

This commit adds support only for the same logic as the current VGIC
support, and no more.  For example, the active priority registers are
handled as RAZ/WI, just like setting priorities on the emulated
distributor.

Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2013-12-21 10:02:10 -08:00
Christoffer Dall
90a5355ee7 KVM: arm-vgic: Add GICD_SPENDSGIR and GICD_CPENDSGIR handlers
Handle MMIO accesses to the two registers which should support both the
case where the VMs want to read/write either of these registers and the
case where user space reads/writes these registers to do save/restore of
the VGIC state.

Note that the added complexity compared to simple set/clear enable
registers stems from the bookkeping of source cpu ids.  It may be
possible to change the underlying data structure to simplify the
complexity, but since this is not in the critical path at all, this will
do.

Also note that reading this register from a live guest will not be
accurate compared to on hardware, because some state may be living on
the CPU LRs and the only way to give a consistent read would be to force
stop all the VCPUs and request them to unqueu the LR state onto the
distributor.  Until we have an actual user of live reading this
register, we can live with the difference.

Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2013-12-21 10:02:04 -08:00
Christoffer Dall
cbd333a4bf KVM: arm-vgic: Support unqueueing of LRs to the dist
To properly access the VGIC state from user space it is very unpractical
to have to loop through all the LRs in all register access functions.
Instead, support moving all pending state from LRs to the distributor,
but leave active state LRs alone.

Note that to accurately present the active and pending state to VCPUs
reading these distributor registers from a live VM, we would have to
stop all other VPUs than the calling VCPU and ask each CPU to unqueue
their LR state onto the distributor and add fields to track active state
on the distributor side as well.  We don't have any users of such
functionality yet and there are other inaccuracies of the GIC emulation,
so don't provide accurate synchronized access to this state just yet.
However, when the time comes, having this function should help.

Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2013-12-21 10:01:44 -08:00
Christoffer Dall
c07a0191ef KVM: arm-vgic: Add vgic reg access from dev attr
Add infrastructure to handle distributor and cpu interface register
accesses through the KVM_{GET/SET}_DEVICE_ATTR interface by adding the
KVM_DEV_ARM_VGIC_GRP_DIST_REGS and KVM_DEV_ARM_VGIC_GRP_CPU_REGS groups
and defining the semantics of the attr field to be the MMIO offset as
specified in the GICv2 specs.

Missing register accesses or other changes in individual register access
functions to support save/restore of the VGIC state is added in
subsequent patches.

Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2013-12-21 10:01:39 -08:00
Christoffer Dall
1006e8cb22 KVM: arm-vgic: Make vgic mmio functions more generic
Rename the vgic_ranges array to vgic_dist_ranges to be more specific and
to prepare for handling CPU interface register access as well (for
save/restore of VGIC state).

Pass offset from distributor or interface MMIO base to
find_matching_range function instead of the physical address of the
access in the VM memory map.  This allows other callers unaware of the
VM specifics, but with generic VGIC knowledge to reuse the function.

Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2013-12-21 10:01:31 -08:00
Christoffer Dall
ce01e4e887 KVM: arm-vgic: Set base addr through device API
Support setting the distributor and cpu interface base addresses in the
VM physical address space through the KVM_{SET,GET}_DEVICE_ATTR API
in addition to the ARM specific API.

This has the added benefit of being able to share more code in user
space and do things in a uniform manner.

Also deprecate the older API at the same time, but backwards
compatibility will be maintained.

Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2013-12-21 10:01:22 -08:00
Christoffer Dall
7330672bef KVM: arm-vgic: Support KVM_CREATE_DEVICE for VGIC
Support creating the ARM VGIC device through the KVM_CREATE_DEVICE
ioctl, which can then later be leveraged to use the
KVM_{GET/SET}_DEVICE_ATTR, which is useful both for setting addresses in
a more generic API than the ARM-specific one and is useful for
save/restore of VGIC state.

Adds KVM_CAP_DEVICE_CTRL to ARM capabilities.

Note that we change the check for creating a VGIC from bailing out if
any VCPUs were created, to bailing out if any VCPUs were ever run.  This
is an important distinction that shouldn't break anything, but allows
creating the VGIC after the VCPUs have been created.

Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2013-12-21 10:01:16 -08:00
Christoffer Dall
e1ba0207a1 ARM: KVM: Allow creating the VGIC after VCPUs
Rework the VGIC initialization slightly to allow initialization of the
vgic cpu-specific state even if the irqchip (the VGIC) hasn't been
created by user space yet.  This is safe, because the vgic data
structures are already allocated when the CPU is allocated if VGIC
support is compiled into the kernel.  Further, the init process does not
depend on any other information and the sacrifice is a slight
performance degradation for creating VMs in the no-VGIC case.

The reason is that the new device control API doesn't mandate creating
the VGIC before creating the VCPU and it is unreasonable to require user
space to create the VGIC before creating the VCPUs.

At the same time move the irqchip_in_kernel check out of
kvm_vcpu_first_run_init and into the init function to make the per-vcpu
and global init functions symmetric and add comments on the exported
functions making it a bit easier to understand the init flow by only
looking at vgic.c.

Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2013-12-21 10:01:06 -08:00
Andre Przywara
39735a3a39 ARM/KVM: save and restore generic timer registers
For migration to work we need to save (and later restore) the state of
each core's virtual generic timer.
Since this is per VCPU, we can use the [gs]et_one_reg ioctl and export
the three needed registers (control, counter, compare value).
Though they live in cp15 space, we don't use the existing list, since
they need special accessor functions and the arch timer is optional.

Acked-by: Marc Zynger <marc.zyngier@arm.com>
Signed-off-by: Andre Przywara <andre.przywara@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2013-12-21 10:00:15 -08:00
Christoffer Dall
a1a64387ad arm/arm64: KVM: arch_timer: Initialize cntvoff at kvm_init
Initialize the cntvoff at kvm_init_vm time, not before running the VCPUs
at the first time because that will overwrite any potentially restored
values from user space.

Cc: Andre Przywara <andre.przywara@linaro.org>
Acked-by: Marc Zynger <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2013-12-21 09:58:57 -08:00
Takuya Yoshikawa
c08ac06ab3 KVM: Use cond_resched() directly and remove useless kvm_resched()
Since the commit 15ad7146 ("KVM: Use the scheduler preemption notifiers
to make kvm preemptible"), the remaining stuff in this function is a
simple cond_resched() call with an extra need_resched() check which was
there to avoid dropping VCPUs unnecessarily.  Now it is meaningless.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-12-13 14:23:45 +01:00
Andy Honig
338c7dbadd KVM: Improve create VCPU parameter (CVE-2013-4587)
In multiple functions the vcpu_id is used as an offset into a bitfield.  Ag
malicious user could specify a vcpu_id greater than 255 in order to set or
clear bits in kernel memory.  This could be used to elevate priveges in the
kernel.  This patch verifies that the vcpu_id provided is less than 255.
The api documentation already specifies that the vcpu_id must be less than
max_vcpus, but this is currently not checked.

Reported-by: Andrew Honig <ahonig@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Andrew Honig <ahonig@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-12-12 22:39:33 +01:00