The TLFS structures are used for hypervisor-guest communication and must
exactly meet the specification.
Compilers can add alignment padding to structures or reorder struct members
for randomization and optimization, which would break the hypervisor ABI.
Mark the structures as packed to prevent this. 'struct hv_vp_assist_page'
and 'struct hv_enlightened_vmcs' need to be properly padded to support the
change.
Suggested-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Nadav Amit <nadav.amit@gmail.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The SynIC message delivery protocol allows the message originator to
request, should the message slot be busy, to be notified when it's free.
However, this is unnecessary and even undesirable for messages generated
by SynIC timers in periodic mode: if the period is short enough compared
to the time the guest spends in the timer interrupt handler, so the
timer ticks start piling up, the excessive interactions due to this
notification and retried message delivery only makes the things worse.
[This was observed, in particular, with Windows L2 guests setting
(temporarily) the periodic timer to 2 kHz, and spending hundreds of
microseconds in the timer interrupt handler due to several L2->L1 exits;
under some load in L0 this could exceed 500 us so the timer ticks
started to pile up and the guest livelocked.]
Relieve the situation somewhat by not retrying message delivery for
periodic SynIC timers. This appears to remain within the "lazy" lost
ticks policy for SynIC timers as implemented in KVM.
Note that it doesn't solve the fundamental problem of livelocking the
guest with a periodic timer whose period is smaller than the time needed
to process a tick, but it makes it a bit less likely to be triggered.
Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SynIC message delivery is somewhat overengineered: it pretends to follow
the ordering rules when grabbing the message slot, using atomic
operations and all that, but does it incorrectly and unnecessarily.
The correct order would be to first set .msg_pending, then atomically
replace .message_type if it was zero, and then clear .msg_pending if
the previous step was successful. But this all is done in vcpu context
so the whole update looks atomic to the guest (it's assumed to only
access the message page from this cpu), and therefore can be done in
whatever order is most convenient (and is also the reason why the
incorrect order didn't trigger any bugs so far).
While at this, also switch to kvm_vcpu_{read,write}_guest_page, and drop
the no longer needed synic_clear_sint_msg_pending.
Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In the previous code, the variable apic_sw_disabled influences
recalculate_apic_map. But in "KVM: x86: simplify kvm_apic_map"
(commit: 3b5a5ffa92),
the access to apic_sw_disabled in recalculate_apic_map has been
deleted.
Signed-off-by: Peng Hao <peng.hao2@zte.com.cn>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Like IA32_STAR, IA32_LSTAR and IA32_FMASK only need to contain guest
values on VM-entry when the guest is in long mode and EFER.SCE is set.
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Reviewed-by: Marc Orr <marcorr@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SYSCALL raises #UD in compatibility mode on Intel CPUs, so it's
pointless to load the guest's IA32_CSTAR value into the hardware MSR.
IA32_CSTAR still provides 48 bits of storage on Intel CPUs that have
CPUID.80000001:EDX.LM[bit 29] set, so we cannot remove it from the
vmx_msr_index[] array.
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a comment explaining why MSR_STAR must be included in
vmx_msr_index[] even for i386 builds.
The elided comment has not been relevant since move_msr_up() was
introduced in commit a75beee6e4 ("KVM: VMX: Avoid saving and
restoring msrs on lightweight vmexit").
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
RDTSCP is supported in legacy mode as well as long mode. The
IA32_TSC_AUX MSR should be set to the correct guest value before
entering any guest that supports RDTSCP.
Fixes: 4e47c7a6d7 ("KVM: VMX: Add instruction rdtscp support for guest")
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Reviewed-by: Marc Orr <marcorr@google.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
From a functional perspective, this is (supposed to be) a straight
copy-paste of code. Code was moved piecemeal to nested.c as not all
code that could/should be moved was obviously nested-only. The nested
code was then re-ordered as needed to compile, i.e. stats may not show
this is being a "pure" move despite there not being any intended changes
in functionality.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Exposing only the function allows @nested, i.e. the module param, to be
statically defined in vmx.c, ensuring we aren't unnecessarily checking
said variable in the nested code. nested_vmx_allowed() is exposed due
to the need to verify nested support in vmx_{get,set}_nested_state().
The downside is that nested_vmx_allowed() likely won't be inlined in
vmx_{get,set}_nested_state(), but that should be a non-issue as they're
not a hot path. Keeping vmx_{get,set}_nested_state() in vmx.c isn't a
viable option as they need access to several nested-only functions.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...as they're used directly by the nested code. This will allow
moving the bulk of the nested code out of vmx.c without concurrent
changes to vmx.h.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Exposed vmx_msr_index, vmx_return and host_efer via vmx.h so that the
nested code can be moved out of vmx.c.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...so that the function doesn't need to be created when moving the
nested code out of vmx.c.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...so that it doesn't need access to @nested. The only case where the
provided struct isn't already zeroed is the call from vmx_create_vcpu()
as setup_vmcs_config() zeroes the struct in the other use cases. This
will allow @nested to be statically defined in vmx.c, i.e. this removes
the last direct reference from nested code.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...in nested-specific code so that they can eventually be moved out of
vmx.c, e.g. into nested.c.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...so that future patches can reference e.g. @kvm_vmx_exit_handlers
without having to simultaneously move a big chunk of code. Speaking
from experience, resolving merge conflicts is an absolute nightmare
without pre-moving the code.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...so that it can conditionally set by the VMX code, i.e. iff @nested is
true. This will in turn allow it to be moved out of vmx.c and into a
nested-specified file.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Eventually this will allow us to move the nested VMX code out of vmx.c.
Note that this also effectively wraps @enable_shadow_vmcs with @nested
so that it too can be moved out of vmx.c.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
VMX has a few hundred lines of code just to wrap various VMX specific
instructions, e.g. VMWREAD, INVVPID, etc... Move them to a dedicated
header so it's easier to find/isolate the boilerplate.
With this change, more inlines can be moved from vmx.c to vmx.h.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The header, evmcs.h, already exists and contains a fair amount of code,
but there are a few pieces in vmx.c that can be moved verbatim. In
addition, move an array definition to evmcs.c to prepare for multiple
consumers of evmcs.h.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
vmcs12 is the KVM-defined struct used to track a nested VMCS, e.g. a
VMCS created by L1 for L2.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This isn't intended to be a pure reflection of hardware, e.g. struct
loaded_vmcs and struct vmcs_host_state are KVM-defined constructs.
Similar to capabilities.h, this is a standalone file to avoid circular
dependencies between yet-to-be-created vmx.h and nested.h files.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Expose the variables associated with various module params that are
needed by the nested VMX code. There is no ulterior logic for what
variables are/aren't exposed, this is purely "what's needed by the
nested code".
Note that @nested is intentionally not exposed.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Defining a separate capabilities.h as opposed to putting this code in
e.g. vmx.h avoids circular dependencies between (the yet-to-be-added)
vmx.h and nested.h. The aforementioned circular dependencies are why
struct nested_vmx_msrs also resides in capabilities instead of e.g.
nested.h.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...instead of referencing the global struct. This will allow moving
setup_vmcs_config() to a separate file that may not have access to
the global variable. Modify nested_vmx_setup_ctls_msrs() appropriately
since vmx_capability.ept may not be accurate when called by
vmx_check_processor_compat().
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
EFER and PERF_GLOBAL_CTRL MSRs have dedicated VM Entry/Exit controls
that KVM dynamically toggles based on whether or not the guest's value
for each MSRs differs from the host. Handle the dynamic behavior by
adding a helper that clears the dynamic bits so the bits aren't set
when initializing the VMCS field outside of the dynamic toggling flow.
This makes the handling consistent with similar behavior for other
controls, e.g. pin, exec and sec_exec. More importantly, it eliminates
two global bools that are stealthily modified by setup_vmcs_config.
Opportunistically clean up a comment and print related to errata for
IA32_PERF_GLOBAL_CTRL.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
MSR_IA32_XSS has no relation to the VMCS whatsoever, it doesn't belong
in setup_vmcs_config() and its reference to host_xss prevents moving
setup_vmcs_config() to a dedicated file.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
VMX specific files now reside in a dedicated subdirectory, i.e. the
file name prefix is redundant.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
VMX specific files now reside in a dedicated subdirectory. Drop the
"vmx" prefix, which is redundant, and add a "vmcs" prefix to clarify
that the file is referring to VMCS shadow fields.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...to prepare for shattering vmx.c into multiple files without having
to prepend "vmx_" to all new files.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Until this point vmx.c has been the only consumer and included the
file after many others. Prepare for multiple consumers, i.e. the
shattering of vmx.c
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Until this point vmx.c has been the only consumer and included the
file after many others. Prepare for multiple consumers, i.e. the
shattering of vmx.c
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...to prepare for the creation of a "vmx" subdirectory that will contain
a variety of headers. Clean things up now to avoid making a bigger mess
in the future.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...and make enable_shadow_vmcs depend on nested. Aside from the obvious
memory savings, this will allow moving the relevant code out of vmx.c in
the future, e.g. to a nested specific file.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fixes: 34a1cd60d1 ("kvm: x86: vmx: move some vmx setting from vmx_init() to hardware_setup()")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There are two problems with KVM_GET_DIRTY_LOG. First, and less important,
it can take kvm->mmu_lock for an extended period of time. Second, its user
can actually see many false positives in some cases. The latter is due
to a benign race like this:
1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
them.
2. The guest modifies the pages, causing them to be marked ditry.
3. Userspace actually copies the pages.
4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
they were not written to since (3).
This is especially a problem for large guests, where the time between
(1) and (3) can be substantial. This patch introduces a new
capability which, when enabled, makes KVM_GET_DIRTY_LOG not
write-protect the pages it returns. Instead, userspace has to
explicitly clear the dirty log bits just before using the content
of the page. The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
64-page granularity rather than requiring to sync a full memslot;
this way, the mmu_lock is taken for small amounts of time, and
only a small amount of time will pass between write protection
of pages and the sending of their content.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When manual dirty log reprotect will be enabled, kvm_get_dirty_log_protect's
pointer argument will always be false on exit, because no TLB flush is needed
until the manual re-protection operation. Rename it from "is_dirty" to "flush",
which more accurately tells the caller what they have to do with it.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit b2d35fa5fc ("selftests: add headers_install to lib.mk") added
khdr target to run headers_install target from the main Makefile. The
logic uses KSFT_KHDR_INSTALL and top_srcdir as controls to initialize
variables and include files to run headers_install from the top level
Makefile. There are a few problems with this logic.
1. Exposes top_srcdir to all tests
2. Common logic impacts all tests
3. Uses KSFT_KHDR_INSTALL, top_srcdir, and khdr in an adhoc way. Tests
add "khdr" dependency in their Makefiles to TEST_PROGS_EXTENDED in
some cases, and STATIC_LIBS in other cases. This makes this framework
confusing to use.
The common logic that runs for all tests even when KSFT_KHDR_INSTALL
isn't defined by the test. top_srcdir is initialized to a default value
when test doesn't initialize it. It works for all tests without a sub-dir
structure and tests with sub-dir structure fail to build.
e.g: make -C sparc64/drivers/ or make -C drivers/dma-buf
../../lib.mk:20: ../../../../scripts/subarch.include: No such file or directory
make: *** No rule to make target '../../../../scripts/subarch.include'. Stop.
There is no reason to require all tests to define top_srcdir and there is
no need to require tests to add khdr dependency using adhoc changes to
TEST_* and other variables.
Fix it with a consistent use of KSFT_KHDR_INSTALL and top_srcdir from tests
that have the dependency on headers_install.
Change common logic to include khdr target define and "all" target with
dependency on khdr when KSFT_KHDR_INSTALL is defined.
Only tests that have dependency on headers_install have to define just
the KSFT_KHDR_INSTALL, and top_srcdir variables and there is no need to
specify khdr dependency in the test Makefiles.
Fixes: b2d35fa5fc ("selftests: add headers_install to lib.mk")
Cc: stable@vger.kernel.org
Signed-off-by: Shuah Khan <shuah@kernel.org>
Pull networking fixes from David Miller:
"A decent batch of fixes here. I'd say about half are for problems that
have existed for a while, and half are for new regressions added in
the 4.20 merge window.
1) Fix 10G SFP phy module detection in mvpp2, from Baruch Siach.
2) Revert bogus emac driver change, from Benjamin Herrenschmidt.
3) Handle BPF exported data structure with pointers when building
32-bit userland, from Daniel Borkmann.
4) Memory leak fix in act_police, from Davide Caratti.
5) Check RX checksum offload in RX descriptors properly in aquantia
driver, from Dmitry Bogdanov.
6) SKB unlink fix in various spots, from Edward Cree.
7) ndo_dflt_fdb_dump() only works with ethernet, enforce this, from
Eric Dumazet.
8) Fix FID leak in mlxsw driver, from Ido Schimmel.
9) IOTLB locking fix in vhost, from Jean-Philippe Brucker.
10) Fix SKB truesize accounting in ipv4/ipv6/netfilter frag memory
limits otherwise namespace exit can hang. From Jiri Wiesner.
11) Address block parsing length fixes in x25 from Martin Schiller.
12) IRQ and ring accounting fixes in bnxt_en, from Michael Chan.
13) For tun interfaces, only iface delete works with rtnl ops, enforce
this by disallowing add. From Nicolas Dichtel.
14) Use after free in liquidio, from Pan Bian.
15) Fix SKB use after passing to netif_receive_skb(), from Prashant
Bhole.
16) Static key accounting and other fixes in XPS from Sabrina Dubroca.
17) Partially initialized flow key passed to ip6_route_output(), from
Shmulik Ladkani.
18) Fix RTNL deadlock during reset in ibmvnic driver, from Thomas
Falcon.
19) Several small TCP fixes (off-by-one on window probe abort, NULL
deref in tail loss probe, SNMP mis-estimations) from Yuchung
Cheng"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (93 commits)
net/sched: cls_flower: Reject duplicated rules also under skip_sw
bnxt_en: Fix _bnxt_get_max_rings() for 57500 chips.
bnxt_en: Fix NQ/CP rings accounting on the new 57500 chips.
bnxt_en: Keep track of reserved IRQs.
bnxt_en: Fix CNP CoS queue regression.
net/mlx4_core: Correctly set PFC param if global pause is turned off.
Revert "net/ibm/emac: wrong bit is used for STA control"
neighbour: Avoid writing before skb->head in neigh_hh_output()
ipv6: Check available headroom in ip6_xmit() even without options
tcp: lack of available data can also cause TSO defer
ipv6: sr: properly initialize flowi6 prior passing to ip6_route_output
mlxsw: spectrum_switchdev: Fix VLAN device deletion via ioctl
mlxsw: spectrum_router: Relax GRE decap matching check
mlxsw: spectrum_switchdev: Avoid leaking FID's reference count
mlxsw: spectrum_nve: Remove easily triggerable warnings
ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes
sctp: frag_point sanity check
tcp: fix NULL ref in tail loss probe
tcp: Do not underestimate rwnd_limited
net: use skb_list_del_init() to remove from RX sublists
...
Pull x86 fixes from Ingo Molnar:
"Three fixes: a boot parameter re-(re-)fix, a retpoline build artifact
fix and an LLVM workaround"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/vdso: Drop implicit common-page-size linker flag
x86/build: Fix compiler support check for CONFIG_RETPOLINE
x86/boot: Clear RSDP address in boot_params for broken loaders
Pull kprobes fixes from Ingo Molnar:
"Two kprobes fixes: a blacklist fix and an instruction patching related
corruption fix"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
kprobes/x86: Blacklist non-attachable interrupt functions
kprobes/x86: Fix instruction patching corruption when copying more than one RIP-relative instruction
Pull EFI fixes from Ingo Molnar:
"Two fixes: a large-system fix and an earlyprintk fix with certain
resolutions"
* 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/earlyprintk/efi: Fix infinite loop on some screen widths
x86/efi: Allocate e820 buffer before calling efi_exit_boot_service
Currently, duplicated rules are rejected only for skip_hw or "none",
hence allowing users to push duplicates into HW for no reason.
Use the flower tables to protect for that.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reported-by: Chris Mi <chrism@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan says:
====================
bnxt_en: Bug fixes.
The first patch fixes a regression on CoS queue setup, introduced
recently by the 57500 new chip support patches. The rest are
fixes related to ring and resource accounting on the new 57500 chips.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The CP rings are accounted differently on the new 57500 chips. There
must be enough CP rings for the sum of RX and TX rings on the new
chips. The current logic may be over-estimating the RX and TX rings.
The output parameter max_cp should be the maximum NQs capped by
MSIX vectors available for networking in the context of 57500 chips.
The existing code which uses CMPL rings capped by the MSIX vectors
works most of the time but is not always correct.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>