Commit Graph

827264 Commits

Author SHA1 Message Date
Vitaly Kuznetsov
79904c9de0 selftests: kvm: add a selftest for SMM
Add a simple test for SMM, based on VMX.  The test implements its own
sync between the guest and the host as using our ucall library seems to
be too cumbersome: SMI handler is happening in real-address mode.

This patch also fixes KVM_SET_NESTED_STATE to happen after
KVM_SET_VCPU_EVENTS, in fact it places it last.  This is because
KVM needs to know whether the processor is in SMM or not.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:38:06 +02:00
Paolo Bonzini
c2390f16fc selftests: kvm: fix for compilers that do not support -no-pie
-no-pie was added to GCC at the same time as their configuration option
--enable-default-pie.  Compilers that were built before do not have
-no-pie, but they also do not need it.  Detect the option at build
time.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:38:05 +02:00
Paolo Bonzini
c68c21ca92 selftests: kvm/evmcs_test: complete I/O before migrating guest state
Starting state migration after an IO exit without first completing IO
may result in test failures.  We already have two tests that need this
(this patch in fact fixes evmcs_test, similar to what was fixed for
state_test in commit 0f73bbc851, "KVM: selftests: complete IO before
migrating guest state", 2019-03-13) and a third is coming.  So, move the
code to vcpu_save_state, and while at it do not access register state
until after I/O is complete.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:37:39 +02:00
Sean Christopherson
b68f3cc7d9 KVM: x86: Always use 32-bit SMRAM save state for 32-bit kernels
Invoking the 64-bit variation on a 32-bit kenrel will crash the guest,
trigger a WARN, and/or lead to a buffer overrun in the host, e.g.
rsm_load_state_64() writes r8-r15 unconditionally, but enum kvm_reg and
thus x86_emulate_ctxt._regs only define r8-r15 for CONFIG_X86_64.

KVM allows userspace to report long mode support via CPUID, even though
the guest is all but guaranteed to crash if it actually tries to enable
long mode.  But, a pure 32-bit guest that is ignorant of long mode will
happily plod along.

SMM complicates things as 64-bit CPUs use a different SMRAM save state
area.  KVM handles this correctly for 64-bit kernels, e.g. uses the
legacy save state map if userspace has hid long mode from the guest,
but doesn't fare well when userspace reports long mode support on a
32-bit host kernel (32-bit KVM doesn't support 64-bit guests).

Since the alternative is to crash the guest, e.g. by not loading state
or explicitly requesting shutdown, unconditionally use the legacy SMRAM
save state map for 32-bit KVM.  If a guest has managed to get far enough
to handle SMIs when running under a weird/buggy userspace hypervisor,
then don't deliberately crash the guest since there are no downsides
(from KVM's perspective) to allow it to continue running.

Fixes: 660a5d517a ("KVM: x86: save/load state on SMM switch")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:37:38 +02:00
Sean Christopherson
8f4dc2e77c KVM: x86: Don't clear EFER during SMM transitions for 32-bit vCPU
Neither AMD nor Intel CPUs have an EFER field in the legacy SMRAM save
state area, i.e. don't save/restore EFER across SMM transitions.  KVM
somewhat models this, e.g. doesn't clear EFER on entry to SMM if the
guest doesn't support long mode.  But during RSM, KVM unconditionally
clears EFER so that it can get back to pure 32-bit mode in order to
start loading CRs with their actual non-SMM values.

Clear EFER only when it will be written when loading the non-SMM state
so as to preserve bits that can theoretically be set on 32-bit vCPUs,
e.g. KVM always emulates EFER_SCE.

And because CR4.PAE is cleared only to play nice with EFER, wrap that
code in the long mode check as well.  Note, this may result in a
compiler warning about cr4 being consumed uninitialized.  Re-read CR4
even though it's technically unnecessary, as doing so allows for more
readable code and RSM emulation is not a performance critical path.

Fixes: 660a5d517a ("KVM: x86: save/load state on SMM switch")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:37:37 +02:00
Sean Christopherson
9ec19493fb KVM: x86: clear SMM flags before loading state while leaving SMM
RSM emulation is currently broken on VMX when the interrupted guest has
CR4.VMXE=1.  Stop dancing around the issue of HF_SMM_MASK being set when
loading SMSTATE into architectural state, e.g. by toggling it for
problematic flows, and simply clear HF_SMM_MASK prior to loading
architectural state (from SMRAM save state area).

Reported-by: Jon Doron <arilou@gmail.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Liran Alon <liran.alon@oracle.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Fixes: 5bea5123cb ("KVM: VMX: check nested state and CR4.VMXE against SMM")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:37:36 +02:00
Sean Christopherson
c5833c7a43 KVM: x86: Open code kvm_set_hflags
Prepare for clearing HF_SMM_MASK prior to loading state from the SMRAM
save state map, i.e. kvm_smm_changed() needs to be called after state
has been loaded and so cannot be done automatically when setting
hflags from RSM.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:37:36 +02:00
Sean Christopherson
ed19321fb6 KVM: x86: Load SMRAM in a single shot when leaving SMM
RSM emulation is currently broken on VMX when the interrupted guest has
CR4.VMXE=1.  Rather than dance around the issue of HF_SMM_MASK being set
when loading SMSTATE into architectural state, ideally RSM emulation
itself would be reworked to clear HF_SMM_MASK prior to loading non-SMM
architectural state.

Ostensibly, the only motivation for having HF_SMM_MASK set throughout
the loading of state from the SMRAM save state area is so that the
memory accesses from GET_SMSTATE() are tagged with role.smm.  Load
all of the SMRAM save state area from guest memory at the beginning of
RSM emulation, and load state from the buffer instead of reading guest
memory one-by-one.

This paves the way for clearing HF_SMM_MASK prior to loading state,
and also aligns RSM with the enter_smm() behavior, which fills a
buffer and writes SMRAM save state in a single go.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:37:35 +02:00
Liran Alon
e51bfdb687 KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU
Issue was discovered when running kvm-unit-tests on KVM running as L1 on
top of Hyper-V.

When vmx_instruction_intercept unit-test attempts to run RDPMC to test
RDPMC-exiting, it is intercepted by L1 KVM which it's EXIT_REASON_RDPMC
handler raise #GP because vCPU exposed by Hyper-V doesn't support PMU.
Instead of unit-test expectation to be reflected with EXIT_REASON_RDPMC.

The reason vmx_instruction_intercept unit-test attempts to run RDPMC
even though Hyper-V doesn't support PMU is because L1 expose to L2
support for RDPMC-exiting. Which is reasonable to assume that is
supported only in case CPU supports PMU to being with.

Above issue can easily be simulated by modifying
vmx_instruction_intercept config in x86/unittests.cfg to run QEMU with
"-cpu host,+vmx,-pmu" and run unit-test.

To handle issue, change KVM to expose RDPMC-exiting only when guest
supports PMU.

Reported-by: Saar Amar <saaramar@microsoft.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:37:34 +02:00
Liran Alon
672ff6cff8 KVM: x86: Raise #GP when guest vCPU do not support PMU
Before this change, reading a VMware pseduo PMC will succeed even when
PMU is not supported by guest. This can easily be seen by running
kvm-unit-test vmware_backdoors with "-cpu host,-pmu" option.

Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:37:34 +02:00
WANG Chao
1811d979c7 x86/kvm: move kvm_load/put_guest_xcr0 into atomic context
guest xcr0 could leak into host when MCE happens in guest mode. Because
do_machine_check() could schedule out at a few places.

For example:

kvm_load_guest_xcr0
...
kvm_x86_ops->run(vcpu) {
  vmx_vcpu_run
    vmx_complete_atomic_exit
      kvm_machine_check
        do_machine_check
          do_memory_failure
            memory_failure
              lock_page

In this case, host_xcr0 is 0x2ff, guest vcpu xcr0 is 0xff. After schedule
out, host cpu has guest xcr0 loaded (0xff).

In __switch_to {
     switch_fpu_finish
       copy_kernel_to_fpregs
         XRSTORS

If any bit i in XSTATE_BV[i] == 1 and xcr0[i] == 0, XRSTORS will
generate #GP (In this case, bit 9). Then ex_handler_fprestore kicks in
and tries to reinitialize fpu by restoring init fpu state. Same story as
last #GP, except we get DOUBLE FAULT this time.

Cc: stable@vger.kernel.org
Signed-off-by: WANG Chao <chao.wang@ucloud.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:37:33 +02:00
Vitaly Kuznetsov
99c221796a KVM: x86: svm: make sure NMI is injected after nmi_singlestep
I noticed that apic test from kvm-unit-tests always hangs on my EPYC 7401P,
the hanging test nmi-after-sti is trying to deliver 30000 NMIs and tracing
shows that we're sometimes able to deliver a few but never all.

When we're trying to inject an NMI we may fail to do so immediately for
various reasons, however, we still need to inject it so enable_nmi_window()
arms nmi_singlestep mode. #DB occurs as expected, but we're not checking
for pending NMIs before entering the guest and unless there's a different
event to process, the NMI will never get delivered.

Make KVM_REQ_EVENT request on the vCPU from db_interception() to make sure
pending NMIs are checked and possibly injected.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:37:32 +02:00
Suthikulpanit, Suravee
e44e3eaccc svm/avic: Fix invalidate logical APIC id entry
Only clear the valid bit when invalidate logical APIC id entry.
The current logic clear the valid bit, but also set the rest of
the bits (including reserved bits) to 1.

Fixes: 98d90582be ('svm: Fix AVIC DFR and LDR handling')
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:37:32 +02:00
Suthikulpanit, Suravee
4a58038b9e Revert "svm: Fix AVIC incomplete IPI emulation"
This reverts commit bb218fbcfa.

As Oren Twaig pointed out the old discussion:

  https://patchwork.kernel.org/patch/8292231/

that the change coud potentially cause an extra IPI to be sent to
the destination vcpu because the AVIC hardware already set the IRR bit
before the incomplete IPI #VMEXIT with id=1 (target vcpu is not running).
Since writting to ICR and ICR2 will also set the IRR. If something triggers
the destination vcpu to get scheduled before the emulation finishes, then
this could result in an additional IPI.

Also, the issue mentioned in the commit bb218fbcfa was misdiagnosed.

Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Reported-by: Oren Twaig <oren@scalemp.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:37:31 +02:00
Ben Gardon
bc8a3d8925 kvm: mmu: Fix overflow on kvm mmu page limit calculation
KVM bases its memory usage limits on the total number of guest pages
across all memslots. However, those limits, and the calculations to
produce them, use 32 bit unsigned integers. This can result in overflow
if a VM has more guest pages that can be represented by a u32. As a
result of this overflow, KVM can use a low limit on the number of MMU
pages it will allocate. This makes KVM unable to map all of guest memory
at once, prompting spurious faults.

Tested: Ran all kvm-unit-tests on an Intel Haswell machine. This patch
	introduced no new failures.

Signed-off-by: Ben Gardon <bgardon@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:37:30 +02:00
Paolo Bonzini
2b27924bb1 KVM: nVMX: always use early vmcs check when EPT is disabled
The remaining failures of vmx.flat when EPT is disabled are caused by
incorrectly reflecting VMfails to the L1 hypervisor.  What happens is
that nested_vmx_restore_host_state corrupts the guest CR3, reloading it
with the host's shadow CR3 instead, because it blindly loads GUEST_CR3
from the vmcs01.

For simplicity let's just always use hardware VMCS checks when EPT is
disabled.  This way, nested_vmx_restore_host_state is not reached at
all (or at least shouldn't be reached).

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 15:37:12 +02:00
Paolo Bonzini
690908104e KVM: nVMX: allow tests to use bad virtual-APIC page address
As mentioned in the comment, there are some special cases where we can simply
clear the TPR shadow bit from the CPU-based execution controls in the vmcs02.
Handle them so that we can remove some XFAILs from vmx.flat.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-16 10:59:07 +02:00
Colin Ian King
614c70f35c bnx2x: fix spelling mistake "dicline" -> "decline"
There is a spelling mistake in a BNX2X_ERR message, fix it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 17:23:09 -07:00
David S. Miller
432bc23070 Merge branch 'hsr-next'
Murali Karicheri says:

====================
net: hsr: updates from internal tree

This series picks commit from our internal kernel tree.
Patch 1/3 fixes a file name issue introduced in my previous
series.

History:
  v2: fixed patch 3/3 by moving stats update to inside hsr_forward_skb()
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 17:22:02 -07:00
Murali Karicheri
ee2c46f353 net: hsr: add tx stats for master interface
Add tx stats to hsr interface. Without this
ifconfig for hsr interface doesn't show tx packet stats.

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 17:22:02 -07:00
Murali Karicheri
3271273388 net: hsr: fix debugfs path to support multiple interfaces
Fix the path of hsr debugfs root directory to use the net device
name so that it can work with multiple interfaces. While at it,
also fix some typos.

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 17:22:02 -07:00
Murali Karicheri
9c5f8a19b2 net: hsr: fix naming of file and functions
Fix the file name and functions to match with existing implementation.

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 17:22:01 -07:00
Heiner Kallweit
dcdecdcfe1 net: phy: switch drivers to use dynamic feature detection
Recently genphy_read_abilities() has been added that dynamically detects
clause 22 PHY abilities. I *think* this detection should work with all
supported PHY's, at least for the ones with basic features sets, i.e.
PHY_BASIC_FEATURES and PHY_GBIT_FEATURES. So let's remove setting these
features explicitly and rely on phylib feature detection.

I don't have access to most of these PHY's, therefore I'd appreciate
regression testing.

v2:
- make the feature constant a comment so that readers know which
  features are supported by the respective PHY

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Tested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 17:19:54 -07:00
Linus Torvalds
618d919cae libnvdimm fixes v5.1-rc6
- Compatibility fix for nvdimm-security implementations with a default
   zero-key.
 
 - Miscellaneous small fixes for out-of-bound accesses, cleanup after
   initialization failures, and missing debug messages.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJctQ6UAAoJEB7SkWpmfYgCV4AQAL18Kv0VJjeWzMirH+Q5B9Z2
 WdHzKBvOWUWx8HeUhQoTtP+QnWriXI37EmhD7S34mVJZdYXxIQJBESPpFF1IpjUi
 jMibrdgrPAzyXq+x6FS4gHwi8uwUwwHOYfBEPV+7UvA8Zi8AU+g1Sgl+FftY34Em
 ACWc8/BtMtnwr2xFZT/4brzDCyvVHTK7f9HB280Je7DU6ghjEAaRFqqFXgAAbQ+l
 HAOQz4GVweT/JUmu4cvABGwllTN3np4wR/ePKYdlZTVWpN02InECukiSFtgCWN4S
 +bKm5EMTGDprLtNDz3m1SDWPrGSkWkoEtmVZljAXepJzAcZ1qbEw4xe++Kqrgr0z
 YOawM0lMciTp78uiH797thYnS3fo5+Ccr0WE4lhrSC3kAZE+EfGvbyhv3T+Pz3M+
 Z3hEpz+gGNMBwby0AmCLJHfwyujztNBE5hnXcsL5dC6BXKHZGZSgsUllRcZJ+xJ1
 H6b5sdxmNvn7Ja0svhKJzfpP4j8v25v+KEns9VlbIejJAp62cQCmA1dHlGaC5pDc
 0g9mtPbYsEZjKQ5/5grHgtre+JYmYDAIKwS4UK11ZyChqR+tmZ2Cp7XgmVab9a7T
 QpFLczMV/Q8NSWIFYSHvXjj1/PWtUxf81lEtA+Y3+mDznn30QctPwufPcdxeTNJs
 KSyFKhhKIOnasEplrLu4
 =zISv
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-fixes-5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm fixes from Dan Williams:
 "I debated holding this back for the v5.2 merge window due to the size
  of the "zero-key" changes, but affected users would benefit from
  having the fixes sooner. It did not make sense to change the zero-key
  semantic in isolation for the "secure-erase" command, but instead
  include it for all security commands.

  The short background on the need for these changes is that some NVDIMM
  platforms enable security with a default zero-key rather than let the
  OS specify the initial key. This makes the security enabling that
  landed in v5.0 unusable for some users.

  Summary:

   - Compatibility fix for nvdimm-security implementations with a
     default zero-key.

   - Miscellaneous small fixes for out-of-bound accesses, cleanup after
     initialization failures, and missing debug messages"

* tag 'libnvdimm-fixes-5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  tools/testing/nvdimm: Retain security state after overwrite
  libnvdimm/pmem: fix a possible OOB access when read and write pmem
  libnvdimm/security, acpi/nfit: unify zero-key for all security commands
  libnvdimm/security: provide fix for secure-erase to use zero-key
  libnvdimm/btt: Fix a kmemdup failure check
  libnvdimm/namespace: Fix a potential NULL pointer dereference
  acpi/nfit: Always dump _DSM output payload
2019-04-15 16:48:51 -07:00
David S. Miller
b6ed55cb72 Merge branch 'nfp-Flower-flow-merging'
Simon Horman says:

====================
nfp: Flower flow merging

John Hurley says,

These patches deal with 'implicit recirculation' on the NFP. This is a
firmware feature whereby a packet egresses to an 'internal' port meaning
that it will recirculate back to the header extract phase with the
'internal' port now marked as its ingress port. This internal port can
then be matched on by another rule. This process simulates how OvS
datapath outputs to an internal port. The FW traces the packet's
recirculation route and sends a 'merge hint' to the driver telling it
which flows it matched against. The driver can then decide if these flows
can be merged to a single rule and offloaded.

The patches deal with the following issues:

- assigning/freeing IDs to/from each of these new internal ports
- offloading rules that match on internal ports
- offloading neighbour table entries whose egress port is internal
- handling fallback traffic with an internal port as ingress
- using merge hints to create 'faster path' flows and tracking stats etc.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 15:45:36 -07:00
John Hurley
8af56f40e5 nfp: flower: offload merge flows
A merge flow is formed from 2 sub flows. The match fields of the merge are
the same as the first sub flow that has formed it, with the actions being
a combination of the first and second sub flow. Therefore, a merge flow
should replace sub flow 1 when offloaded.

Offload valid merge flows by using a new 'flow mod' message type to
replace an existing offloaded rule. Track the deletion of sub flows that
are linked to a merge flow and revert offloaded merge rules if required.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 15:45:36 -07:00
John Hurley
aa6ce2ea0c nfp: flower: support stats update for merge flows
With the merging of 2 sub flows, a new 'merge' flow will be created and
written to FW. The TC layer is unaware that the merge flow exists and will
request stats from the sub flows. Conversely, the FW treats a merge rule
the same as any other rule and sends stats updates to the NFP driver.

Add links between merge flows and their sub flows. Use these links to pass
merge flow stats updates from FW to the underlying sub flows, ensuring TC
stats requests are handled correctly. The updating of sub flow stats is
done on (the less time critcal) TC stats requests rather than on FW stats
update.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 15:45:36 -07:00
John Hurley
1c6952ca58 nfp: flower: generate merge flow rule
When combining 2 sub_flows to a single 'merge flow' (assuming the merge is
valid), the merge flow should contain the same match fields as sub_flow 1
with actions derived from a combination of sub_flows 1 and 2. This action
list should have all actions from sub_flow 1 with the exception of the
output action that triggered the 'implicit recirculation' by sending to
an internal port, followed by all actions of sub_flow 2. Any pre-actions
in either sub_flow should feature at the start of the action list.

Add code to generate a new merge flow and populate the match and actions
fields based on the sub_flows. The offloading of the flow is left to
future patches.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 15:45:36 -07:00
John Hurley
107e37bb4f nfp: flower: validate merge hint flows
Two flows can be merged if the second flow (after recirculation) matches
on bits that are either matched on or explicitly set by the first flow.
This means that if a packet hits flow 1 and recirculates then it is
guaranteed to hit flow 2.

Add a 'can_merge' function that determines if 2 sub_flows in a merge hint
can be validly merged to a single flow.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 15:45:36 -07:00
John Hurley
dbc2d68edc nfp: flower: handle merge hint messages
If a merge hint is received containing 2 flows that are matched via an
implicit recirculation (sending to and matching on an internal port), fw
reports that the flows (called sub_flows) may be able to be combined to a
single flow.

Add infastructure to accept and process merge hint messages. The actual
merging of the flows is left as a stub call.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 15:45:36 -07:00
John Hurley
cf4172d575 nfp: flower: get flows by host context
Each flow is given a context ID that the fw uses (along with its cookie)
to identity the flow. The flows stats are updated by the fw via this ID
which is a reference to a pre-allocated array entry.

In preparation for flow merge code, enable the nfp_fl_payload structure to
be accessed via this stats context ID. Rather than increasing the memory
requirements of the pre-allocated array, add a new rhashtable to associate
each active stats context ID with its rule payload.

While adding new code to the compile metadata functions, slightly
restructure the existing function to allow for cleaner, easier to read
error handling.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 15:45:36 -07:00
John Hurley
45756dfeda nfp: flower: allow tunnels to output to internal port
The neighbour table in the FW only accepts next hop entries if the egress
port is an nfp repr. Modify this to allow the next hop to be an internal
port. This means that if a packet is to egress to that port, it will
recirculate back into the system with the internal port becoming its
ingress port.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 15:45:36 -07:00
John Hurley
f41dd0595d nfp: flower: support fallback packets from internal ports
FW may receive a packet with its ingress port marked as an internal port.
If a rule does not exist to match on this port, the packet will be sent to
the NFP driver. Modify the flower app to detect packets from such internal
ports and convert the ingress port to the correct kernel space netdev.

At this point, it is assumed that fallback packets from internal ports are
to be sent out said port. Therefore, set the redir_egress bool to true on
detection of these ports.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 15:45:36 -07:00
John Hurley
27f54b5825 nfp: allow fallback packets from non-reprs
Currently, it is assumed that fallback packets will be from reprs. Modify
this to allow an app to receive non-repr ports from the fallback channel -
e.g. from an internal port. If such a packet is received, do not update
repr stats.

Change the naming function calls so as not to imply it will always be a
repr netdev returned. Add the option to set a bool value to redirect a
fallback packet out the returned port rather than RXing it. Setting of
this bool in subsequent patches allows the handling of packets falling
back when they are due to egress an internal port.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 15:45:36 -07:00
John Hurley
4d12ba4278 nfp: flower: allow offloading of matches on 'internal' ports
Recent FW modifications allow the offloading of non repr ports. These
ports exist internally on the NFP. So if a rule outputs to an 'internal'
port, then the packet will recirculate back into the system but will now
have this internal port as it's incoming port. These ports are indicated
by a specific type field combined with an 8 bit port id.

Add private app data to assign additional port ids for use in offloads.
Provide functions to lookup or create new ids when a rule attempts to
match on an internal netdev - the only internal netdevs currently
supported are of type openvswitch. Have a netdev notifier to release
port ids on netdev unregister.

OvS offloads rules that match on internal ports as TC egress filters.
Ensure that such rules are accepted by the driver.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 15:45:36 -07:00
John Hurley
2f2622f59c nfp: flower: turn on recirc and merge hint support in firmware
Write to a FW symbol to indicate that the driver supports flow merging. If
this symbol does not exist then flow merging and recirculation is not
supported on the FW. If support is available, add a stub to deal with FW
to kernel merge hint messages.

Full flow merging requires the firmware to support of flow mods. If it
does not, then do not attempt to 'turn on' flow merging.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 15:45:36 -07:00
Linus Torvalds
5512320c9f fsdax fix 5.1-rc6
- Avoid a crash scenario with architectures like powerpc that require
   'pgtable_deposit' for the zero page.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJctNryAAoJEB7SkWpmfYgCzcMP/37LJbb4SYNwnDIW4BF33ril
 ZwtPeJJVTR56Ojo+Dy1v9084zeyhUHHewz0Oqx15dm6k/N5SS19yKNFKQDOK+4OC
 zbaWD5UOtllU3RQ2ORUOUoqNGF278+h4VVVQMntVaHhdt5f120tgHXxmKoB5Z5zH
 Gcy0vZNHoJ5lVYfKjKYG0b0/dWWOD1ZEjTkZjTa4DjhVSQcFauN8DxJ4hSyumYqs
 HDnZZt44RTTUS5W3BTlhuaSEcZaDOznmyj1HmKXNg3ghxguKACho4xhA7xFKqT8O
 03WZxDBFnOXZb3yfKpHB6RclkJgrtmD5U5GStzl5SobLPb2E/TPQzCRhZ/kcFPZ8
 RE2JkgdGl8gqCDRqRsC/tbF3dETO66vxUyf5utNv0ttBk7qLMwTGTKm3VQz7Xvu2
 SLkwv6Rlw4UT6ML8nd2kNhf8xRkaLl6j1B6zWDy7wEoFPXWW+My0PPpsJZcbTeza
 eib2ood7AlPHRU0/mW2ZrGHGabbS6kNGeQlod9U5sikkE7ZA/LwzyFl4b/uCqYNP
 NKGQdz0iHVcq8lFPXEmZ7vP2krd6uUWIv9KaiwmjBBMf9w3ZAzS85c7HFAZD0zgC
 tTHm6stMhpdS3ndyIxMBf0sL7AB/Q9BH7jJwDK/P8QObovezW2zZ4CPx/gQYJ2XU
 LTeCJmQh3xcCpI3f/eka
 =ijtJ
 -----END PGP SIGNATURE-----

Merge tag 'fsdax-fix-5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull fsdax fix from Dan Williams:
 "A single filesystem-dax fix. It has been lingering in -next for a long
  while and there are no other fsdax fixes on the horizon:

   - Avoid a crash scenario with architectures like powerpc that require
     'pgtable_deposit' for the zero page"

* tag 'fsdax-fix-5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  fs/dax: Deposit pagetable even when installing zero page
2019-04-15 15:10:20 -07:00
David S. Miller
47a1a225ab Merge branch 'hns3-next'
Huazhong Tan says:

====================
net: hns3: fixes sparse: warning and type error

This patchset fixes a sparse warning and a overflow problem.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 13:39:19 -07:00
Yunsheng Lin
2566f10676 net: hns3: fix for vport->bw_limit overflow problem
When setting vport->bw_limit to hdev->tm_info.pg_info[0].bw_limit
in hclge_tm_vport_tc_info_update, vport->bw_limit can be as big as
HCLGE_ETHER_MAX_RATE (100000), which can not fit into u16 (65535).

So this patch fixes it by using u32 for vport->bw_limit.

Fixes: 848440544b ("net: hns3: Add support of TX Scheduler & Shaper to HNS3 driver")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 13:39:19 -07:00
Jian Shen
8a9a654b5b net: hns3: fix sparse: warning when calling hclge_set_vlan_filter_hw()
The input parameter "proto" in function hclge_set_vlan_filter_hw()
is asked to be __be16, but got u16 when calling it in function
hclge_update_port_base_vlan_cfg().

This patch fixes it by converting it with htons().

Reported-by: kbuild test robot <lkp@intel.com>
Fixes: 21e043cd81 ("net: hns3: fix set port based VLAN for PF")
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 13:39:19 -07:00
David S. Miller
c7cf89b5dd Merge branch 'sctp-fully-support-memory-accounting'
Xin Long says:

====================
sctp: fully support memory accounting

sctp memory accounting is added in this patchset by using
these kernel APIs on send side:

  - sk_mem_charge()
  - sk_mem_uncharge()
  - sk_wmem_schedule()
  - sk_under_memory_pressure()
  - sk_mem_reclaim()

and these on receive side:

  - sk_mem_charge()
  - sk_mem_uncharge()
  - sk_rmem_schedule()
  - sk_under_memory_pressure()
  - sk_mem_reclaim()

With sctp memory accounting, we can limit the memory allocation by
either sysctl:

  # sysctl -w net.sctp.sctp_mem="10 20 50"

or cgroup:

  # echo $((8<<14)) > \
    /sys/fs/cgroup/memory/sctp_mem/memory.kmem.tcp.limit_in_bytes

When the socket is under memory pressure, the send side will block
and wait, while the receive side will renege or drop.

v1->v2:
  - add the missing Reported/Tested/Acked/-bys.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 13:36:51 -07:00
Xin Long
9dde27de3e sctp: implement memory accounting on rx path
sk_forward_alloc's updating is also done on rx path, but to be consistent
we change to use sk_mem_charge() in sctp_skb_set_owner_r().

In sctp_eat_data(), it's not enough to check sctp_memory_pressure only,
which doesn't work for mem_cgroup_sockets_enabled, so we change to use
sk_under_memory_pressure().

When it's under memory pressure, sk_mem_reclaim() and sk_rmem_schedule()
should be called on both RENEGE or CHUNK DELIVERY path exit the memory
pressure status as soon as possible.

Note that sk_rmem_schedule() is using datalen to make things easy there.

Reported-by: Matteo Croce <mcroce@redhat.com>
Tested-by: Matteo Croce <mcroce@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 13:36:51 -07:00
Xin Long
1033990ac5 sctp: implement memory accounting on tx path
Now when sending packets, sk_mem_charge() and sk_mem_uncharge() have been
used to set sk_forward_alloc. We just need to call sk_wmem_schedule() to
check if the allocated should be raised, and call sk_mem_reclaim() to
check if the allocated should be reduced when it's under memory pressure.

If sk_wmem_schedule() returns false, which means no memory is allowed to
allocate, it will block and wait for memory to become available.

Note different from tcp, sctp wait_for_buf happens before allocating any
skb, so memory accounting check is done with the whole msg_len before it
too.

Reported-by: Matteo Croce <mcroce@redhat.com>
Tested-by: Matteo Croce <mcroce@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 13:36:51 -07:00
Jonathan Lemon
9c69a13205 route: Avoid crash from dereferencing NULL rt->from
When __ip6_rt_update_pmtu() is called, rt->from is RCU dereferenced, but is
never checked for null - rt6_flush_exceptions() may have removed the entry.

[ 1913.989004] RIP: 0010:ip6_rt_cache_alloc+0x13/0x170
[ 1914.209410] Call Trace:
[ 1914.214798]  <IRQ>
[ 1914.219226]  __ip6_rt_update_pmtu+0xb0/0x190
[ 1914.228649]  ip6_tnl_xmit+0x2c2/0x970 [ip6_tunnel]
[ 1914.239223]  ? ip6_tnl_parse_tlv_enc_lim+0x32/0x1a0 [ip6_tunnel]
[ 1914.252489]  ? __gre6_xmit+0x148/0x530 [ip6_gre]
[ 1914.262678]  ip6gre_tunnel_xmit+0x17e/0x3c7 [ip6_gre]
[ 1914.273831]  dev_hard_start_xmit+0x8d/0x1f0
[ 1914.283061]  sch_direct_xmit+0xfa/0x230
[ 1914.291521]  __qdisc_run+0x154/0x4b0
[ 1914.299407]  net_tx_action+0x10e/0x1f0
[ 1914.307678]  __do_softirq+0xca/0x297
[ 1914.315567]  irq_exit+0x96/0xa0
[ 1914.322494]  smp_apic_timer_interrupt+0x68/0x130
[ 1914.332683]  apic_timer_interrupt+0xf/0x20
[ 1914.341721]  </IRQ>

Fixes: a68886a691 ("net/ipv6: Make from in rt6_info rcu protected")
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 13:31:59 -07:00
David S. Miller
93144b0ecd Merge branch 'mlxsw-Add-neighbour-offload-indication'
Ido Schimmel says:

====================
mlxsw: Add neighbour offload indication

Neighbour entries are programmed to the device's table so that the
correct destination MAC will be specified in a packet after it was
routed.

Despite being programmed to the device and unlike routes and FDB
entries, neighbour entries are currently not marked as offloaded. This
patchset changes that.

Patch #1 is a preparatory patch to make sure we only mark a neighbour as
offloaded in case it was successfully programmed to the device.

Patch #2 sets the offload indication on neighbours.

Patch #3 adds a test to verify above mentioned functionality.

Patched iproute2 version that prints the offload indication is available
here [1].

[1] https://github.com/idosch/iproute2/tree/idosch-next
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 13:29:21 -07:00
Ido Schimmel
3321cff3c5 selftests: mlxsw: Test neighbour offload indication
Test that neighbour entries are marked as offloaded.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 13:29:21 -07:00
Ido Schimmel
caf345a18b mlxsw: spectrum_router: Add neighbour offload indication
In a similar fashion to routes and FDB entries, the neighbour table is
reflected to the device.

Set an offload indication on the neighbour in case it was programmed to
the device.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 13:29:20 -07:00
Ido Schimmel
a85e84e030 mlxsw: spectrum_router: Propagate neighbour update errors
Next patch will add offload indication to neighbours, but the indication
should only be altered in case the neighbour was successfully added to /
deleted from the device.

Propagate neighbour update errors, so that they could be taken into
account by the next patch.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 13:29:20 -07:00
Lukas Bulwahn
789445b960 MAINTAINERS: normalize Woojung Huh's email address
MAINTAINERS contains a lower-case and upper-case variant of
Woojung Huh' s email address.

Only keep the lower-case variant in MAINTAINERS.

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Acked-by: Woojung Huh <woojung.huh@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 13:25:03 -07:00
Sabrina Dubroca
92480b3977 bonding: fix event handling for stacked bonds
When a bond is enslaved to another bond, bond_netdev_event() only
handles the event as if the bond is a master, and skips treating the
bond as a slave.

This leads to a refcount leak on the slave, since we don't remove the
adjacency to its master and the master holds a reference on the slave.

Reproducer:
  ip link add bondL type bond
  ip link add bondU type bond
  ip link set bondL master bondU
  ip link del bondL

No "Fixes:" tag, this code is older than git history.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 13:22:09 -07:00