A device mapping is normally always mapped at Stage-2, since there
is very little gain in having it faulted in.
Nonetheless, it is possible to end-up in a situation where the device
mapping has been removed from Stage-2 (userspace munmaped the VFIO
region, and the MMU notifier did its job), but present in a userspace
mapping (userpace has mapped it back at the same address). In such
a situation, the device mapping will be demand-paged as the guest
performs memory accesses.
This requires to be careful when dealing with mapping size, cache
management, and to handle potential execution of a device mapping.
Reported-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: James Morse <james.morse@arm.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20191211165651.7889-2-maz@kernel.org
In kvm_arch_prepare_memory_region, arm kvm regards the memory region as
writable if the flag has no KVM_MEM_READONLY, and the vm is readonly if
!VM_WRITE.
But there is common usage for setting kvm memory region as follows:
e.g. qemu side (see the PROT_NONE flag)
1. mmap(NULL, size, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
memory_region_init_ram_ptr()
2. re mmap the above area with read/write authority.
Such example is used in virtio-fs qemu codes which hasn't been upstreamed
[1]. But seems we can't forbid this example.
Without this patch, it will cause an EPERM during kvm_set_memory_region()
and cause qemu boot crash.
As told by Ard, "the underlying assumption is incorrect, i.e., that the
value of vm_flags at this point in time defines how the VMA is used
during its lifetime. There may be other cases where a VMA is created
with VM_READ vm_flags that are changed to VM_READ|VM_WRITE later, and
we are currently rejecting this use case as well."
[1] https://gitlab.com/virtio-fs/qemu/blob/5a356e/hw/virtio/vhost-user-fs.c#L488
Suggested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Jia He <justin.he@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Christoffer Dall <christoffer.dall@arm.com>
Link: https://lore.kernel.org/r/20191206020802.196108-1-justin.he@arm.com
Use wrapper function lock_all_vcpus()/unlock_all_vcpus()
in kvm_vgic_create() to remove duplicated code dealing
with locking and unlocking all vcpus in a vm.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Link: https://lore.kernel.org/r/1575081918-11401-1-git-send-email-linmiaohe@huawei.com
In kvm_vgic_dist_init() called from kvm_vgic_map_resources(), if
dist->vgic_model is invalid, dist->spis will be freed without set
dist->spis = NULL. And in vgicv2 resources clean up path,
__kvm_vgic_destroy() will be called to free allocated resources.
And dist->spis will be freed again in clean up chain because we
forget to set dist->spis = NULL in kvm_vgic_dist_init() failed
path. So double free would happen.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Link: https://lore.kernel.org/r/1574923128-19956-1-git-send-email-linmiaohe@huawei.com
As arg dummy is not really needed, there's no need to pass
NULL when calling cpu_init_hyp_mode(). So clean it up.
Fixes: 67f6919766 ("arm64: kvm: allows kvm cpu hotplug")
Reviewed-by: Steven Price <steven.price@arm.com>
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/1574320559-5662-1-git-send-email-linmiaohe@huawei.com
Just like we do for WFE trapping, it can be useful to turn off
WFI trapping when the physical CPU is not oversubscribed (that
is, the vcpu is the only runnable process on this CPU) *and*
that we're using direct injection of interrupts.
The conditions are reevaluated on each vcpu_load(), ensuring that
we don't switch to this mode on a busy system.
On a GICv4 system, this has the effect of reducing the generation
of doorbell interrupts to zero when the right conditions are
met, which is a huge improvement over the current situation
(where the doorbells are screaming if the CPU ever hits a
blocking WFI).
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Zenghui Yu <yuzenghui@huawei.com>
Reviewed-by: Christoffer Dall <christoffer.dall@arm.com>
Link: https://lore.kernel.org/r/20191107160412.30301-3-maz@kernel.org
In order to find out whether a vcpu is likely to be the target of
VLPIs (and to further optimize the way we deal with those), let's
track the number of VLPIs a vcpu can receive.
This gets implemented with an atomic variable that gets incremented
or decremented on map, unmap and move of a VLPI.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Zenghui Yu <yuzenghui@huawei.com>
Reviewed-by: Christoffer Dall <christoffer.dall@arm.com>
Link: https://lore.kernel.org/r/20191107160412.30301-2-maz@kernel.org
The timers are canceled from an preempt-notifier which is invoked with
disabled preemption which is not allowed on PREEMPT_RT.
The timer callback is short so in could be invoked in hard-IRQ context
on -RT.
Let the timer expire on hard-IRQ context even on -RT.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Julien Grall <julien.grall@arm.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20191107095424.16647-1-bigeasy@linutronix.de
It's possible that two LPIs locate in the same "byte_offset" but target
two different vcpus, where their pending status are indicated by two
different pending tables. In such a scenario, using last_byte_offset
optimization will lead KVM relying on the wrong pending table entry.
Let us use last_ptr instead, which can be treated as a byte index into
a pending table and also, can be vcpu specific.
Fixes: 280771252c ("KVM: arm64: vgic-v3: KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES")
Cc: stable@vger.kernel.org
Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Acked-by: Eric Auger <eric.auger@redhat.com>
Link: https://lore.kernel.org/r/20191029071919.177-4-yuzenghui@huawei.com
Fix various comments, including wrong function names, grammar mistakes
and specification references.
Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20191029071919.177-3-yuzenghui@huawei.com
When the VHE code was reworked, a lot of the vgic stuff was moved around,
but the GICv4 residency code did stay untouched, meaning that we come
in and out of residency on each flush/sync, which is obviously suboptimal.
To address this, let's move things around a bit:
- Residency entry (flush) moves to vcpu_load
- Residency exit (sync) moves to vcpu_put
- On blocking (entry to WFI), we "put"
- On unblocking (exit from WFI), we "load"
Because these can nest (load/block/put/load/unblock/put, for example),
we now have per-VPE tracking of the residency state.
Additionally, vgic_v4_put gains a "need doorbell" parameter, which only
gets set to true when blocking because of a WFI. This allows a finer
control of the doorbell, which now also gets disabled as soon as
it gets signaled.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20191027144234.8395-2-maz@kernel.org
Allow user space to inform the KVM host where in the physical memory
map the paravirtualized time structures should be located.
User space can set an attribute on the VCPU providing the IPA base
address of the stolen time structure for that VCPU. This must be
repeated for every VCPU in the VM.
The address is given in terms of the physical address visible to
the guest and must be 64 byte aligned. The guest will discover the
address via a hypercall.
Signed-off-by: Steven Price <steven.price@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Implement the service call for configuring a shared structure between a
VCPU and the hypervisor in which the hypervisor can write the time
stolen from the VCPU's execution time by other tasks on the host.
User space allocates memory which is placed at an IPA also chosen by user
space. The hypervisor then updates the shared structure using
kvm_put_guest() to ensure single copy atomicity of the 64-bit value
reporting the stolen time in nanoseconds.
Whenever stolen time is enabled by the guest, the stolen time counter is
reset.
The stolen time itself is retrieved from the sched_info structure
maintained by the Linux scheduler code. We enable SCHEDSTATS when
selecting KVM Kconfig to ensure this value is meaningful.
Signed-off-by: Steven Price <steven.price@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
This provides a mechanism for querying which paravirtualized time
features are available in this hypervisor.
Also add the header file which defines the ABI for the paravirtualized
time features we're about to add.
Signed-off-by: Steven Price <steven.price@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
We currently intertwine the KVM PSCI implementation with the general
dispatch of hypercall handling, which makes perfect sense because PSCI
is the only category of hypercalls we support.
However, as we are about to support additional hypercalls, factor out
this functionality into a separate hypercall handler file.
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
[steven.price@arm.com: rebased]
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Steven Price <steven.price@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
In some scenarios, such as buggy guest or incorrect configuration of the
VMM and firmware description data, userspace will detect a memory access
to a portion of the IPA, which is not mapped to any MMIO region.
For this purpose, the appropriate action is to inject an external abort
to the guest. The kernel already has functionality to inject an
external abort, but we need to wire up a signal from user space that
lets user space tell the kernel to do this.
It turns out, we already have the set event functionality which we can
perfectly reuse for this.
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
For a long time, if a guest accessed memory outside of a memslot using
any of the load/store instructions in the architecture which doesn't
supply decoding information in the ESR_EL2 (the ISV bit is not set), the
kernel would print the following message and terminate the VM as a
result of returning -ENOSYS to userspace:
load/store instruction decoding not implemented
The reason behind this message is that KVM assumes that all accesses
outside a memslot is an MMIO access which should be handled by
userspace, and we originally expected to eventually implement some sort
of decoding of load/store instructions where the ISV bit was not set.
However, it turns out that many of the instructions which don't provide
decoding information on abort are not safe to use for MMIO accesses, and
the remaining few that would potentially make sense to use on MMIO
accesses, such as those with register writeback, are not used in
practice. It also turns out that fetching an instruction from guest
memory can be a pretty horrible affair, involving stopping all CPUs on
SMP systems, handling multiple corner cases of address translation in
software, and more. It doesn't appear likely that we'll ever implement
this in the kernel.
What is much more common is that a user has misconfigured his/her guest
and is actually not accessing an MMIO region, but just hitting some
random hole in the IPA space. In this scenario, the error message above
is almost misleading and has led to a great deal of confusion over the
years.
It is, nevertheless, ABI to userspace, and we therefore need to
introduce a new capability that userspace explicitly enables to change
behavior.
This patch introduces KVM_CAP_ARM_NISV_TO_USER (NISV meaning Non-ISV)
which does exactly that, and introduces a new exit reason to report the
event to userspace. User space can then emulate an exception to the
guest, restart the guest, suspend the guest, or take any other
appropriate action as per the policy of the running system.
Reported-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Reviewed-by: Alexander Graf <graf@amazon.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
The PMU emulation code uses the perf event sample period to trigger
the overflow detection. This works fine for the *first* overflow
handling, but results in a huge number of interrupts on the host,
unrelated to the number of interrupts handled in the guest (a x20
factor is pretty common for the cycle counter). On a slow system
(such as a SW model), this can result in the guest only making
forward progress at a glacial pace.
It turns out that the clue is in the name. The sample period is
exactly that: a period. And once the an overflow has occured,
the following period should be the full width of the associated
counter, instead of whatever the guest had initially programed.
Reset the sample period to the architected value in the overflow
handler, which now results in a number of host interrupts that is
much closer to the number of interrupts in the guest.
Fixes: b02386eb7d ("arm64: KVM: Add PMU overflow interrupt routing")
Reviewed-by: Andrew Murray <andrew.murray@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
The current convention for KVM to request a chained event from the
host PMU is to set bit[0] in attr.config1 (PERF_ATTR_CFG1_KVM_PMU_CHAINED).
But as it turns out, this bit gets set *after* we create the kernel
event that backs our virtual counter, meaning that we never get
a 64bit counter.
Moving the setting to an earlier point solves the problem.
Fixes: 80f393a23b ("KVM: arm/arm64: Support chained PMU counters")
Reviewed-by: Andrew Murray <andrew.murray@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
When a counter is disabled, its value is sampled before the event
is being disabled, and the value written back in the shadow register.
In that process, the value gets truncated to 32bit, which is adequate
for any counter but the cycle counter (defined as a 64bit counter).
This obviously results in a corrupted counter, and things like
"perf record -e cycles" not working at all when run in a guest...
A similar, but less critical bug exists in kvm_pmu_get_counter_value.
Make the truncation conditional on the counter not being the cycle
counter, which results in a minor code reorganisation.
Fixes: 80f393a23b ("KVM: arm/arm64: Support chained PMU counters")
Reviewed-by: Andrew Murray <andrew.murray@arm.com>
Reported-by: Julien Thierry <julien.thierry.kdev@gmail.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
- Remove the now obsolete hyp_alternate_select construct
- Fix the TRACE_INCLUDE_PATH macro in the vgic code
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAl2TFyoPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpDROwP/inRUonz+KEG2B0Bx/NWtzdnDghxdcoNC9H6
lVDHJ2dtC5Kmf0iHEualUvxXHYx7QJ3Maov3UAtkeYl3s4wC6TAl++QkqAG9PYsc
lPQH4GBiQNewQyaebc/NKHDz3I3TClJDq57haHSFFiCwsUpJRgYL8WjktZD/Dide
CUSQGxdnaALzHvMv5a8yQWadPL/RrXCZqOSKbUjjc20meZxrO66HwUd1G6uZZVDn
VClMQwFkQzVjR7yX21/7gmTcwG99RqVaAsvOpCu9+MVlqSpDROspmSPMuG5X/usO
zDgC07UFNPYHQKrGu8DHqlvO9DrK3vR8VEuKu+asVZP7D/ntvKhAM2c5ai188Z12
w8rOnhJKnDtMGHXn4owcC9tgSfrPR+ZukaltzKRVVFm1Y1Io+qTkAuf3geFqZ1hj
L9LWZ0KlMsFvfIKWPcAEp5rA9EeZoP5IeVCelBWj9ERDrcCMhma8RxpAlBPz1YPy
J345jthE4xFZYQxV+amTKJ3CzbZPuU2iIKgDBYiG2PNCuKwCT46RQitOXWWTwSIb
FZ6pcsmhofj69dSAlrRFjEpiLNkJuNX1ArsAA91vXemTXA2YfVLMZo1HkrmFNfbR
j4HP1BhNVdCgk6HF2HzwdRt8eutvk889GG3q+uCoYCaSu3M8MUEgx64LurOPProO
11jhNb3J
=9luB
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-fixes-5.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm fixes for 5.4, take #1
- Remove the now obsolete hyp_alternate_select construct
- Fix the TRACE_INCLUDE_PATH macro in the vgic code
* ARM: ITS translation cache; support for 512 vCPUs, various cleanups
and bugfixes
* PPC: various minor fixes and preparation
* x86: bugfixes all over the place (posted interrupts, SVM, emulation
corner cases, blocked INIT), some IPI optimizations
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJdf7fdAAoJEL/70l94x66DJzkIAKDcuWXJB4Qtoto6yUvPiHZm
LYkY/Dn1zulb/DhzrBoXFey/jZXwl9kxMYkVTefnrAl0fRwFGX+G1UYnQrtAL6Gr
ifdTYdy3kZhXCnnp99QAantWDswJHo1THwbmHrlmkxS4MdisEaTHwgjaHrDRZ4/d
FAEwW2isSonP3YJfTtsKFFjL9k2D4iMnwZ/R2B7UOaWvgnerZ1GLmOkilvnzGGEV
IQ89IIkWlkKd4SKgq8RkDKlfW5JrLrSdTK2Uf0DvAxV+J0EFkEaR+WlLsqumra0z
Eg3KwNScfQj0DyT0TzurcOxObcQPoMNSFYXLRbUu1+i0CGgm90XpF1IosiuihgU=
=w6I3
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"s390:
- ioctl hardening
- selftests
ARM:
- ITS translation cache
- support for 512 vCPUs
- various cleanups and bugfixes
PPC:
- various minor fixes and preparation
x86:
- bugfixes all over the place (posted interrupts, SVM, emulation
corner cases, blocked INIT)
- some IPI optimizations"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (75 commits)
KVM: X86: Use IPI shorthands in kvm guest when support
KVM: x86: Fix INIT signal handling in various CPU states
KVM: VMX: Introduce exit reason for receiving INIT signal on guest-mode
KVM: VMX: Stop the preemption timer during vCPU reset
KVM: LAPIC: Micro optimize IPI latency
kvm: Nested KVM MMUs need PAE root too
KVM: x86: set ctxt->have_exception in x86_decode_insn()
KVM: x86: always stop emulation on page fault
KVM: nVMX: trace nested VM-Enter failures detected by H/W
KVM: nVMX: add tracepoint for failed nested VM-Enter
x86: KVM: svm: Fix a check in nested_svm_vmrun()
KVM: x86: Return to userspace with internal error on unexpected exit reason
KVM: x86: Add kvm_emulate_{rd,wr}msr() to consolidate VXM/SVM code
KVM: x86: Refactor up kvm_{g,s}et_msr() to simplify callers
doc: kvm: Fix return description of KVM_SET_MSRS
KVM: X86: Tune PLE Window tracepoint
KVM: VMX: Change ple_window type to unsigned int
KVM: X86: Remove tailing newline for tracepoints
KVM: X86: Trace vcpu_id for vmexit
KVM: x86: Manually calculate reserved bits when loading PDPTRS
...
Commit 49dfe94fe5 ("KVM: arm/arm64: Fix TRACE_INCLUDE_PATH") fixes
TRACE_INCLUDE_PATH to the correct relative path to the define_trace.h
and explains why did the old one work.
The same fix should be applied to virt/kvm/arm/vgic/trace.h.
Reviewed-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
While parts of the VGIC support a large number of vcpus (we
bravely allow up to 512), other parts are more limited.
One of these limits is visible in the KVM_IRQ_LINE ioctl, which
only allows 256 vcpus to be signalled when using the CPU or PPI
types. Unfortunately, we've cornered ourselves badly by allocating
all the bits in the irq field.
Since the irq_type subfield (8 bit wide) is currently only taking
the values 0, 1 and 2 (and we have been careful not to allow anything
else), let's reduce this field to only 4 bits, and allocate the
remaining 4 bits to a vcpu2_index, which acts as a multiplier:
vcpu_id = 256 * vcpu2_index + vcpu_index
With that, and a new capability (KVM_CAP_ARM_IRQ_LINE_LAYOUT_2)
allowing this to be discovered, it becomes possible to inject
PPIs to up to 4096 vcpus. But please just don't.
Whilst we're there, add a clarification about the use of KVM_IRQ_LINE
on arm, which is not completely conditionned by KVM_CAP_IRQCHIP.
Reported-by: Zenghui Yu <yuzenghui@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
- Fix GICv2 emulation bug (KVM)
- Fix deadlock in virtual GIC interrupt injection code (KVM)
- Fix kprobes blacklist init failure due to broken kallsyms lookup
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAl1mW/0ACgkQt6xw3ITB
YzQWEQgAnG+20c1uzPqIEME3o0bCXsci0I1w4yPYPH/NVrfmJdAjfway/cLZlMCh
JYKBTc392kfYXgzAvMqvjNe+9O/qg9MnMxV7Afjfk9e0k+41Kcw/7U39/B6q4yOg
zIz4YW6SdbbbUfY9VgR8SjsZ3HwJoP8YVNEv3W259funtFHkKDO35h1iG1f57GWL
UELjEnag9dmCH35ykt7+SC4woBGjOQnic2XLSS7VZuU/26TfnUyo2HCdDcLcViCZ
DHDQgaHAscb+tgJjeNtNEeRu+GPBG9LOhPw4EpBjGHeoJla7z74GOJPMEt4Ufoqd
qG9TPfOFAFM30/PAdlWF3SjfzqK3PA==
=WYzY
-----END PGP SIGNATURE-----
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"Hot on the heels of our last set of fixes are a few more for -rc7.
Two of them are fixing issues with our virtual interrupt controller
implementation in KVM/arm, while the other is a longstanding but
straightforward kallsyms fix which was been acked by Masami and
resolves an initialisation failure in kprobes observed on arm64.
- Fix GICv2 emulation bug (KVM)
- Fix deadlock in virtual GIC interrupt injection code (KVM)
- Fix kprobes blacklist init failure due to broken kallsyms lookup"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
KVM: arm/arm64: vgic-v2: Handle SGI bits in GICD_I{S,C}PENDR0 as WI
KVM: arm/arm64: vgic: Fix potential deadlock when ap_list is long
kallsyms: Don't let kallsyms_lookup_size_offset() fail on retrieving the first symbol
A guest is not allowed to inject a SGI (or clear its pending state)
by writing to GICD_ISPENDR0 (resp. GICD_ICPENDR0), as these bits are
defined as WI (as per ARM IHI 0048B 4.3.7 and 4.3.8).
Make sure we correctly emulate the architecture.
Fixes: 96b298000d ("KVM: arm/arm64: vgic-new: Add PENDING registers handlers")
Cc: stable@vger.kernel.org # 4.7+
Reported-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
If the ap_list is longer than 256 entries, merge_final() in list_sort()
will call the comparison callback with the same element twice, causing
a deadlock in vgic_irq_cmp().
Fix it by returning early when irqa == irqb.
Cc: stable@vger.kernel.org # 4.7+
Fixes: 8e44474579 ("KVM: arm/arm64: vgic-new: Add IRQ sorting")
Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Heyi Guo <guoheyi@huawei.com>
[maz: massaged commit log and patch, added Fixes and Cc-stable]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
At the moment we use 2 IO devices per GICv3 redistributor: one
one for the RD_base frame and one for the SGI_base frame.
Instead we can use a single IO device per redistributor (the 2
frames are contiguous). This saves slots on the KVM_MMIO_BUS
which is currently limited to NR_IOBUS_DEVS (1000).
This change allows to instantiate up to 512 redistributors and may
speed the guest boot with a large number of VCPUs.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Detected by Coccinelle (and Will Deacon) using
scripts/coccinelle/misc/semicolon.cocci.
Reported-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
- Don't overskip instructions on MMIO emulation
- Fix UBSAN splat when initializing PPI priorities
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAl1gE4sPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpDvT0QALDyWgNXugpyOTbuH01zgTT5W2PxWPnLT6bl
yCN84C2falMjLgsvBGJo+HuD8nOTwCsam+6mVnbVOmXjDpFtsp1z/unJ9Cv9T4e+
1/TSDgp1Y1wJsdfVMqLOj2LOJterVC65e+eRp4ShEaCaGl0QsJLQIZNndoycen8K
XcwLokABKoypctGz/1XJD9fX0GeJdgJQ2dASVuccaWxvo0lrD5qoRlZUIdWKjTmn
OneayyIB8Dqn2Ju/bQ9bbTzg5VLfw2L9lnrVAlaFnWZETWHAtG6uCK6Zj/eDNZj8
TFBwXtKLbdJPQO+JR7l40QjvK/qkHdVaOp4M1kB+4niYogK23WlWvh7kZ/sZbAUb
A1PRQ39L6f0LJrJJtWeS/bJyUmBnX4PJkwZMNV4EN4fXDi2+79/DxUKXih+im/WN
W26WMAqFwxKiMSEENLfl4ladmrgo9SUBeI8QAnEgvUChCcy9HGpKgQp/KjImM9b3
ab87VS8BUYfyThF7PPshfBteWg3rHPQY2kjRn7B8yRhCcoWBErGtXkEdIhxvtfjk
hUgvT8CPk4uoh4DynqRxvDR16xMPwpUTtedVoZzIkGgG6ZLHAdX0303OLaRZ/KDl
j6vKdm8rU5I4samalckcHuoP+t2Hmmdbd0JNo+BaiorbtBXXXJQWd++85tCMniTg
kWGoHUn5
=FGm5
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-fixes-for-5.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm/fixes
Pull KVM/arm fixes from Marc Zyngier as per Paulo's request at:
https://lkml.kernel.org/r/21ae69a2-2546-29d0-bff6-2ea825e3d968@redhat.com
"One (hopefully last) set of fixes for KVM/arm for 5.3: an embarassing
MMIO emulation regression, and a UBSAN splat. Oh well...
- Don't overskip instructions on MMIO emulation
- Fix UBSAN splat when initializing PPI priorities"
* tag 'kvmarm-fixes-for-5.3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm:
KVM: arm/arm64: VGIC: Properly initialise private IRQ affinity
KVM: arm/arm64: Only skip MMIO insn once
At the moment we initialise the target *mask* of a virtual IRQ to the
VCPU it belongs to, even though this mask is only defined for GICv2 and
quickly runs out of bits for many GICv3 guests.
This behaviour triggers an UBSAN complaint for more than 32 VCPUs:
------
[ 5659.462377] UBSAN: Undefined behaviour in virt/kvm/arm/vgic/vgic-init.c:223:21
[ 5659.471689] shift exponent 32 is too large for 32-bit type 'unsigned int'
------
Also for GICv3 guests the reporting of TARGET in the "vgic-state" debugfs
dump is wrong, due to this very same problem.
Because there is no requirement to create the VGIC device before the
VCPUs (and QEMU actually does it the other way round), we can't safely
initialise mpidr or targets in kvm_vgic_vcpu_init(). But since we touch
every private IRQ for each VCPU anyway later (in vgic_init()), we can
just move the initialisation of those fields into there, where we
definitely know the VGIC type.
On the way make sure we really have either a VGICv2 or a VGICv3 device,
since the existing code is just checking for "VGICv3 or not", silently
ignoring the uninitialised case.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reported-by: Dave Martin <dave.martin@arm.com>
Tested-by: Julien Grall <julien.grall@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
If after an MMIO exit to userspace a VCPU is immediately run with an
immediate_exit request, such as when a signal is delivered or an MMIO
emulation completion is needed, then the VCPU completes the MMIO
emulation and immediately returns to userspace. As the exit_reason
does not get changed from KVM_EXIT_MMIO in these cases we have to
be careful not to complete the MMIO emulation again, when the VCPU is
eventually run again, because the emulation does an instruction skip
(and doing too many skips would be a waste of guest code :-) We need
to use additional VCPU state to track if the emulation is complete.
As luck would have it, we already have 'mmio_needed', which even
appears to be used in this way by other architectures already.
Fixes: 0d640732db ("arm64: KVM: Skip MMIO insn after emulation")
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Since commit 503a62862e ("KVM: arm/arm64: vgic: Rely on the GIC driver to
parse the firmware tables"), the vgic_v{2,3}_probe functions stopped using
a DT node. Commit 9097773245 ("KVM: arm/arm64: vgic-new: vgic_init:
implement kvm_vgic_hyp_init") changed the functions again, and now they
require exactly one argument, a struct gic_kvm_info populated by the GIC
driver. Unfortunately the comments regressed and state that a DT node is
used instead. Change the function comments to reflect the current
prototypes.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Now that we have a cache of MSI->LPI translations, it is pretty
easy to implement kvm_arch_set_irq_inatomic (this cache can be
parsed without sleeping).
Hopefully, this will improve some LPI-heavy workloads.
Tested-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
When performing an MSI injection, let's first check if the translation
is already in the cache. If so, let's inject it quickly without
going through the whole translation process.
Tested-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
On a successful translation, preserve the parameters in the LPI
translation cache. Each translation is reusing the last slot
in the list, naturally evicting the least recently used entry.
Tested-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
In order to avoid leaking vgic_irq structures on teardown, we need to
drop all references to LPIs before deallocating the cache itself.
This is done by invalidating the cache on vgic teardown.
Tested-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
If an ITS gets disabled, we need to make sure that further interrupts
won't hit in the cache. For that, we invalidate the translation cache
when the ITS is disabled.
Tested-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
If a vcpu disables LPIs at its redistributor level, we need to make sure
we won't pend more interrupts. For this, we need to invalidate the LPI
translation cache.
Tested-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
The LPI translation cache needs to be discarded when an ITS command
may affect the translation of an LPI (DISCARD, MAPC and MAPD with V=0)
or the routing of an LPI to a redistributor with disabled LPIs (MOVI,
MOVALL).
We decide to perform a full invalidation of the cache, irrespective
of the LPI that is affected. Commands are supposed to be rare enough
that it doesn't matter.
Tested-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
There's a number of cases where we need to invalidate the caching
of translations, so let's add basic support for that.
Tested-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Our LPI translation cache needs to be able to drop the refcount
on an LPI whilst already holding the lpi_list_lock.
Let's add a new primitive for this.
Tested-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Add the basic data structure that expresses an MSI to LPI
translation as well as the allocation/release hooks.
The size of the cache is arbitrarily defined as 16*nr_vcpus.
Tested-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
- Fix our system register reset so that we stop writing
non-sensical values to them, and track which registers
get reset instead.
- Sync VMCR back from the GIC on WFI so that KVM has an
exact vue of PMR.
- Reevaluate state of HW-mapped, level triggered interrupts
on enable.
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAl1NHNIPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpDITgP/irkNHkSxRPBswncMb9oYguoGOUIStxSApfD
E/9wH5khyXBvThqPmaY65kJok9YZwd+PbqDvp6umCkWGUeIbWI4bzM2bD974La8N
qaogJxOQubqg9r/TYf5f3EI8V67CS+E/ixK6zrz2XvqvFEmQ3OMhrRFjoFkEK1wA
yQwV15h/4IbebI2D1pkVbi8S/s54wAc3dr20cS3KfqrLYrvEIQrCm7lIOVwKk8fo
wjk8D+2IuF9LBsClQUOZwHG26yxcbd7FNwm4qEneR3UmtT14AgHnPJO3uQUTxM+o
bRajO3tXx4hIe1PAG7l4RxW47+AfjhAD1kzWtmM+xN5NdIfoew/pGMf3HXLhYhYO
9dM5YOPrF5CmZGNbSnrNPwD5mejLngNLtWBUvUKMxjtxILhADxGZIOgfM08pr+XO
9I1gzhPpczc3IIDEmDspkQnylQBPXyu2uOORPllHmmuZwMeODQ2pyyw9leBXq7lx
5tUybLPR2N1uxAIfkN6gEJKq0YgyPcapcVBr9vgyEFA6YKKyGZqthDMTIvm1TlGI
3ha2lUdRjYnYVUZ+WY8RqZVhJ3mB5qgCW4GUUeDx1nm5Ttskx1ic0WGIiErD21vs
Eo//2N9ROp+E1iJC8dCdVZjKy0ZbOJwdW9t9Y89Coo6PF6SqiXBlxrqR+GIQdKlo
P1hdjjRI
=pZYJ
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-fixes-for-5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm fixes for 5.3, take #2
- Fix our system register reset so that we stop writing
non-sensical values to them, and track which registers
get reset instead.
- Sync VMCR back from the GIC on WFI so that KVM has an
exact vue of PMR.
- Reevaluate state of HW-mapped, level triggered interrupts
on enable.
A HW mapped level sensitive interrupt asserted by a device will not be put
into the ap_list if it is disabled at the VGIC level. When it is enabled
again, it will be inserted into the ap_list and written to a list register
on guest entry regardless of the state of the device.
We could argue that this can also happen on real hardware, when the command
to enable the interrupt reached the GIC before the device had the chance to
de-assert the interrupt signal; however, we emulate the distributor and
redistributors in software and we can do better than that.
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Since commit commit 328e566479 ("KVM: arm/arm64: vgic: Defer
touching GICH_VMCR to vcpu_load/put"), we leave ICH_VMCR_EL2 (or
its GICv2 equivalent) loaded as long as we can, only syncing it
back when we're scheduled out.
There is a small snag with that though: kvm_vgic_vcpu_pending_irq(),
which is indirectly called from kvm_vcpu_check_block(), needs to
evaluate the guest's view of ICC_PMR_EL1. At the point were we
call kvm_vcpu_check_block(), the vcpu is still loaded, and whatever
changes to PMR is not visible in memory until we do a vcpu_put().
Things go really south if the guest does the following:
mov x0, #0 // or any small value masking interrupts
msr ICC_PMR_EL1, x0
[vcpu preempted, then rescheduled, VMCR sampled]
mov x0, #ff // allow all interrupts
msr ICC_PMR_EL1, x0
wfi // traps to EL2, so samping of VMCR
[interrupt arrives just after WFI]
Here, the hypervisor's view of PMR is zero, while the guest has enabled
its interrupts. kvm_vgic_vcpu_pending_irq() will then say that no
interrupts are pending (despite an interrupt being received) and we'll
block for no reason. If the guest doesn't have a periodic interrupt
firing once it has blocked, it will stay there forever.
To avoid this unfortuante situation, let's resync VMCR from
kvm_arch_vcpu_blocking(), ensuring that a following kvm_vcpu_check_block()
will observe the latest value of PMR.
This has been found by booting an arm64 Linux guest with the pseudo NMI
feature, and thus using interrupt priorities to mask interrupts instead
of the usual PSTATE masking.
Cc: stable@vger.kernel.org # 4.12
Fixes: 328e566479 ("KVM: arm/arm64: vgic: Defer touching GICH_VMCR to vcpu_load/put")
Signed-off-by: Marc Zyngier <maz@kernel.org>