[ Upstream commit 9c39be40c0155c43343f53e3a439290c0fec5542 ]
syzbot reported memory leak in zr364xx_probe()[1].
The problem was in invalid error handling order.
All error conditions rigth after v4l2_ctrl_handler_init()
must call v4l2_ctrl_handler_free().
Reported-by: syzbot+efe9aefc31ae1e6f7675@syzkaller.appspotmail.com
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ea354b6ddd6f09be29424f41fa75a3e637fea234 ]
Syzbot discovered that the probe error handling doesn't clean up the
resources allocated in zr364xx_board_init(). There are several
related bugs in this code so I have re-written the error handling.
1) Introduce a new function zr364xx_board_uninit() which cleans up
the resources in zr364xx_board_init().
2) In zr364xx_board_init() if the call to zr364xx_start_readpipe()
fails then release the "cam->buffer.frame[i].lpvbits" memory
before returning. This way every function either allocates
everything successfully or it cleans up after itself.
3) Re-write the probe function so that each failure path goto frees
the most recent allocation. That way we don't free anything
before it has been allocated and we can also verify that
everything is freed.
4) Originally, in the probe function the "cam->v4l2_dev.release"
pointer was set to "zr364xx_release" near the start but I moved
that assignment to the end, after everything had succeeded. The
release function was never actually called during the probe cleanup
process, but with this change I wanted to make it clear that we
don't want to call zr364xx_release() until everything is
allocated successfully.
Next I re-wrote the zr364xx_release() function. Ideally this would
have been a simple matter of copy and pasting the cleanup code from
probe and adding an additional call to video_unregister_device(). But
there are a couple quirks to note.
1) The probe function does not call videobuf_mmap_free() and I don't
know where the videobuf_mmap is allocated. I left the code as-is to
avoid introducing a bug in code I don't understand.
2) The zr364xx_board_uninit() has a call to zr364xx_stop_readpipe()
which is a change from the original behavior with regards to
unloading the driver. Calling zr364xx_stop_readpipe() on a stopped
pipe is not a problem so this is safe and is potentially a bugfix.
Reported-by: syzbot+b4d54814b339b5c6bbd4@syzkaller.appspotmail.com
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit af0321a5be3e5647441eb6b79355beaa592df97a ]
zr364xx_start_readpipe() can fail but callers do not care about that.
This can result in various negative consequences. The patch adds missed
error handling.
Found by Linux Driver Verification project (linuxtesting.org).
Signed-off-by: Evgeny Novikov <novikov@ispras.ru>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 2394e628738933aa014093d93093030f6232946d upstream.
Erasing an AMD linear flash card (AM29F016D) crashes after the first
sector has been erased. Likewise, writing to it crashes after two bytes
have been written. The reason is a missing check for a null pointer -
the cmdset_priv field is not set for this type of card.
Fixes: 4844ef8030 ("mtd: cfi_cmdset_0002: Add support for polling status register")
Signed-off-by: Andreas Persson <andreasp56@outlook.com>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Link: https://lore.kernel.org/linux-mtd/DB6P189MB05830B3530B8087476C5CFE4C1159@DB6P189MB0583.EURP189.PROD.OUTLOOK.COM
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ca2848022c12789685d3fab3227df02b863f9696 upstream.
Do not delete a key cache entry that is still being referenced by
pending frames in TXQs. This avoids reuse of the key cache entry while a
frame might still be transmitted using it.
To avoid having to do any additional operations during the main TX path
operations, track pending key cache entries in a new bitmap and check
whether any pending entries can be deleted before every new key
add/remove operation. Also clear any remaining entries when stopping the
interface.
Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Link: https://lore.kernel.org/r/20201214172118.18100-6-jouni@codeaurora.org
Cc: Pali Rohár <pali@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 144cd24dbc36650a51f7fe3bf1424a1432f1f480 upstream.
tkip_keymap can be used internally to avoid the reference to key->cipher
and with this, only the key index value itself is needed. This allows
ath_key_delete() call to be postponed to be handled after the upper
layer STA and key entry have already been removed. This is needed to
make ath9k key cache management safer.
Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Link: https://lore.kernel.org/r/20201214172118.18100-5-jouni@codeaurora.org
Cc: Pali Rohár <pali@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d2d3e36498dd8e0c83ea99861fac5cf9e8671226 upstream.
ath9k is going to use this for safer management of key cache entries.
Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Link: https://lore.kernel.org/r/20201214172118.18100-4-jouni@codeaurora.org
Cc: Pali Rohár <pali@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 73488cb2fa3bb1ef9f6cf0d757f76958bd4deaca upstream.
Now that ath/key.c may not be explicitly clearing keys from the key
cache, clear all key cache entries when disabling hardware to make sure
no keys are left behind beyond this point.
Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Link: https://lore.kernel.org/r/20201214172118.18100-3-jouni@codeaurora.org
Cc: Pali Rohár <pali@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 56c5485c9e444c2e85e11694b6c44f1338fc20fd upstream.
It is possible for there to be pending frames in TXQs with a reference
to the key cache entry that is being deleted. If such a key cache entry
is cleared, those pending frame in TXQ might get transmitted without
proper encryption. It is safer to leave the previously used key into the
key cache in such cases. Instead, only clear the MAC address to prevent
RX processing from using this key cache entry.
This is needed in particularly in AP mode where the TXQs cannot be
flushed on station disconnection. This change alone may not be able to
address all cases where the key cache entry might get reused for other
purposes immediately (the key cache entry should be released for reuse
only once the TXQs do not have any remaining references to them), but
this makes it less likely to get unprotected frames and the more
complete changes may end up being significantly more complex.
Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Link: https://lore.kernel.org/r/20201214172118.18100-2-jouni@codeaurora.org
Cc: Pali Rohár <pali@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9130c2d30c17846287b803a9803106318cbe5266 upstream.
On the CPU port, we can support both tagged and untagged VLANs at the
same time by doing any necessary untagging in software rather than
hardware. To enable that, keep the CPU port's Remove Tag flag cleared
and set the dsa_switch::untag_bridge_pvid flag.
Fixes: e66f840c08 ("net: dsa: ksz: Add Microchip KSZ8795 DSA driver")
Signed-off-by: Ben Hutchings <ben.hutchings@mind.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backport to 5.10: adjust context]
Signed-off-by: Ben Hutchings <ben.hutchings@mind.be>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit af01754f9e3c553a2ee63b4693c79a3956e230ab upstream.
When a VLAN is deleted from a port, the flags in struct
switchdev_obj_port_vlan are always 0. ksz8_port_vlan_del() copies the
BRIDGE_VLAN_INFO_UNTAGGED flag to the port's Tag Removal flag, and
therefore always clears it.
In case there are multiple VLANs configured as untagged on this port -
which seems useless, but is allowed - deleting one of them changes the
remaining VLANs to be tagged.
It's only ever necessary to change this flag when a VLAN is added to
the port, so leave it unchanged in ksz8_port_vlan_del().
Fixes: e66f840c08 ("net: dsa: ksz: Add Microchip KSZ8795 DSA driver")
Signed-off-by: Ben Hutchings <ben.hutchings@mind.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backport to 5.10: adjust context]
Signed-off-by: Ben Hutchings <ben.hutchings@mind.be>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8f4f58f88fe0d9bd591f21f53de7dbd42baeb3fa upstream.
The switches supported by ksz8795 only have a per-port flag for Tag
Removal. This means it is not possible to support both tagged and
untagged VLANs on the same port. Reject attempts to add a VLAN that
requires the flag to be changed, unless there are no VLANs currently
configured.
VID 0 is excluded from this check since it is untagged regardless of
the state of the flag.
On the CPU port we could support tagged and untagged VLANs at the same
time. This will be enabled by a later patch.
Fixes: e66f840c08 ("net: dsa: ksz: Add Microchip KSZ8795 DSA driver")
Signed-off-by: Ben Hutchings <ben.hutchings@mind.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backport to 5.10:
- This configuration has to be detected and rejected in the
port_vlan_prepare operation
- ksz8795_port_vlan_add() has to check again to decide whether to
change the Tag Removal flag, so put the common condition in a
separate function
- Handle VID ranges]
Signed-off-by: Ben Hutchings <ben.hutchings@mind.be>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ef3b02a1d79b691f9a354c4903cf1e6917e315f9 upstream.
ksz8795 has never actually enabled PVID tag insertion, and it also
programmed the PVID incorrectly. To fix this:
* Allow tag insertion to be controlled per ingress port. On most
chips, set bit 2 in Global Control 19. On KSZ88x3 this control
flag doesn't exist.
* When adding a PVID:
- Set the appropriate register bits to enable tag insertion on
egress at every other port if this was the packet's ingress port.
- Mask *out* the VID from the default tag, before or-ing in the new
PVID.
* When removing a PVID:
- Clear the same control bits to disable tag insertion.
- Don't update the default tag. This wasn't doing anything useful.
Fixes: e66f840c08 ("net: dsa: ksz: Add Microchip KSZ8795 DSA driver")
Signed-off-by: Ben Hutchings <ben.hutchings@mind.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backport to 5.10:
- Drop the KSZ88x3 cases as those chips are not supported here
- Handle VID ranges in ksz8795_port_vlan_del()]
Signed-off-by: Ben Hutchings <ben.hutchings@mind.be>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The ksz8795 and ksz9477 drivers differ in the way they count ports.
For ksz8795, ksz_device::port_cnt does not include the host port
whereas for ksz9477 it does. This inconsistency was fixed in Linux
5.11 by a series of changes, but remains in 5.10-stable.
When probing, the common code treats a port device node with an
address >= dev->port_cnt as a fatal error. As a minimal fix, change
it to compare again dev->mib_port_cnt. This is the length of the
dev->ports array that the port number will be used to index, and
always includes the host port.
Cc: Woojung Huh <woojung.huh@microchip.com>
Cc: Microchip Linux Driver Support <UNGLinuxDriver@microchip.com>
Cc: Michael Grzeschik <m.grzeschik@pengutronix.de>
Cc: Marek Vasut <marex@denx.de>
Signed-off-by: Ben Hutchings <ben.hutchings@mind.be>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c7dfa4009965a9b2d7b329ee970eb8da0d32f0bc upstream.
If L1 disables VMLOAD/VMSAVE intercepts, and doesn't enable
Virtual VMLOAD/VMSAVE (currently not supported for the nested hypervisor),
then VMLOAD/VMSAVE must operate on the L1 physical memory, which is only
possible by making L0 intercept these instructions.
Failure to do so allowed the nested guest to run VMLOAD/VMSAVE unintercepted,
and thus read/write portions of the host physical memory.
Fixes: 89c8a4984f ("KVM: SVM: Enable Virtual VMLOAD VMSAVE feature")
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0f923e07124df069ba68d8bb12324398f4b6b709 upstream.
* Invert the mask of bits that we pick from L2 in
nested_vmcb02_prepare_control
* Invert and explicitly use VIRQ related bits bitmask in svm_clear_vintr
This fixes a security issue that allowed a malicious L1 to run L2 with
AVIC enabled, which allowed the L2 to exploit the uninitialized and enabled
AVIC to read/write the host physical memory at some offsets.
Fixes: 3d6368ef58 ("KVM: SVM: Add VMRUN handler")
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 848378812e40152abe9b9baf58ce2004f76fb988 upstream.
A recent change in LLVM causes module_{c,d}tor sections to appear when
CONFIG_K{A,C}SAN are enabled, which results in orphan section warnings
because these are not handled anywhere:
ld.lld: warning: arch/x86/pci/built-in.a(legacy.o):(.text.asan.module_ctor) is being placed in '.text.asan.module_ctor'
ld.lld: warning: arch/x86/pci/built-in.a(legacy.o):(.text.asan.module_dtor) is being placed in '.text.asan.module_dtor'
ld.lld: warning: arch/x86/pci/built-in.a(legacy.o):(.text.tsan.module_ctor) is being placed in '.text.tsan.module_ctor'
Fangrui explains: "the function asan.module_ctor has the SHF_GNU_RETAIN
flag, so it is in a separate section even with -fno-function-sections
(default)".
Place them in the TEXT_TEXT section so that these technologies continue
to work with the newer compiler versions. All of the KASAN and KCSAN
KUnit tests continue to pass after this change.
Cc: stable@vger.kernel.org
Link: https://github.com/ClangBuiltLinux/linux/issues/1432
Link: 7b78956224
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Fangrui Song <maskray@google.com>
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210731023107.1932981-1-nathan@kernel.org
[nc: Resolve conflict due to lack of cf68fffb66d60]
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 52dfd86aa568e433b24357bb5fc725560f1e22d8 upstream.
Opening a new file is done in 2 steps on regular filesystems:
1. Call the create inode-op on the parent-dir to create an inode
to hold the meta-data related to the file.
2. Call the open file-op to get a handle for the file.
vboxsf however does not really use disk-backed inodes because it
is based on passing through file-related system-calls through to
the hypervisor. So both steps translate to an open(2) call being
passed through to the hypervisor. With the handle returned by
the first call immediately being closed again.
Making 2 open calls for a single open(..., O_CREATE, ...) calls
has 2 problems:
a) It is not really efficient.
b) It actually breaks some apps.
An example of b) is doing a git clone inside a vboxsf mount.
When git clone tries to create a tempfile to store the pak
files which is downloading the following happens:
1. vboxsf_dir_mkfile() gets called with a mode of 0444 and succeeds.
2. vboxsf_file_open() gets called with file->f_flags containing
O_RDWR. When the host is a Linux machine this fails because doing
a open(..., O_RDWR) on a file which exists and has mode 0444 results
in an -EPERM error.
Other network-filesystems and fuse avoid the problem of needing to
pass 2 open() calls to the other side by using the atomic_open
directory-inode op.
This commit fixes git clone not working inside a vboxsf mount,
by adding support for the atomic_open directory-inode op.
As an added bonus this should also make opening new files faster.
The atomic_open implementation is modelled after the atomic_open
implementations from the 9p and fuse code.
Fixes: 0fd1695766 ("fs: Add VirtualBox guest shared folder (vboxsf) support")
Reported-by: Ludovic Pouzenc <bugreports@pouzenc.fr>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 02f840f90764f22f5c898901849bdbf0cee752ba upstream.
Factor out the code to create / release a struct vboxsf_handle into
2 new helper functions.
This is a preparation patch for adding atomic_open support.
Fixes: 0fd1695766 ("fs: Add VirtualBox guest shared folder (vboxsf) support")
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 18712c13709d2de9516c5d3414f707c4f0a9c190 upstream.
Use vmx_need_pf_intercept() when determining if L0 wants to handle a #PF
in L2 or if the VM-Exit should be forwarded to L1. The current logic fails
to account for the case where #PF is intercepted to handle
guest.MAXPHYADDR < host.MAXPHYADDR and ends up reflecting all #PFs into
L1. At best, L1 will complain and inject the #PF back into L2. At
worst, L1 will eat the unexpected fault and cause L2 to hang on infinite
page faults.
Note, while the bug was technically introduced by the commit that added
support for the MAXPHYADDR madness, the shame is all on commit
a0c134347b ("KVM: VMX: introduce vmx_need_pf_intercept").
Fixes: 1dbf5d68af ("KVM: VMX: Add guest physical address check in EPT violation and misconfig")
Cc: stable@vger.kernel.org
Cc: Peter Shier <pshier@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812045615.3167686-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7b9cae027ba3aaac295ae23a62f47876ed97da73 upstream.
Use the secondary_exec_controls_get() accessor in vmx_has_waitpkg() to
effectively get the controls for the current VMCS, as opposed to using
vmx->secondary_exec_controls, which is the cached value of KVM's desired
controls for vmcs01 and truly not reflective of any particular VMCS.
While the waitpkg control is not dynamic, i.e. vmcs01 will always hold
the same waitpkg configuration as vmx->secondary_exec_controls, the same
does not hold true for vmcs02 if the L1 VMM hides the feature from L2.
If L1 hides the feature _and_ does not intercept MSR_IA32_UMWAIT_CONTROL,
L2 could incorrectly read/write L1's virtual MSR instead of taking a #GP.
Fixes: 6e3ba4abce ("KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210810171952.2758100-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c32ac11da3f83bb42b986702a9b92f0a14ed4182 upstream.
On arm64, the stub only moves the kernel image around in memory if
needed, which is typically only for KASLR, given that relocatable
kernels (which is the default) can run from any 64k aligned address,
which is also the minimum alignment communicated to EFI via the PE/COFF
header.
Unfortunately, some loaders appear to ignore this header, and load the
kernel at some arbitrary offset in memory. We can deal with this, but
let's check for this condition anyway, so non-compliant code can be
spotted and fixed.
Cc: <stable@vger.kernel.org> # v5.10+
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 77e89afc25f30abd56e76a809ee2884d7c1b63ce upstream.
Multi-MSI uses a single MSI descriptor and there is a single mask register
when the device supports per vector masking. To avoid reading back the mask
register the value is cached in the MSI descriptor and updates are done by
clearing and setting bits in the cache and writing it to the device.
But nothing protects msi_desc::masked and the mask register from being
modified concurrently on two different CPUs for two different Linux
interrupts which belong to the same multi-MSI descriptor.
Add a lock to struct device and protect any operation on the mask and the
mask register with it.
This makes the update of msi_desc::masked unconditional, but there is no
place which requires a modification of the hardware register without
updating the masked cache.
msi_mask_irq() is now an empty wrapper which will be cleaned up in follow
up changes.
The problem goes way back to the initial support of multi-MSI, but picking
the commit which introduced the mask cache is a valid cut off point
(2.6.30).
Fixes: f2440d9acb ("PCI MSI: Refactor interrupt masking code")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210729222542.726833414@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d28d4ad2a1aef27458b3383725bb179beb8d015c upstream.
No point in using the raw write function from shutdown. Preparatory change
to introduce proper serialization for the msi_desc::masked cache.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210729222542.674391354@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 689e6b5351573c38ccf92a0dd8b3e2c2241e4aff upstream.
The comments about preserving the cached state in pci_msi[x]_shutdown() are
misleading as the MSI descriptors are freed right after those functions
return. So there is nothing to restore. Preparatory change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210729222542.621609423@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 361fd37397f77578735907341579397d5bed0a2d upstream.
msi_mask_irq() takes a mask and a flags argument. The mask argument is used
to mask out bits from the cached mask and the flags argument to set bits.
Some places invoke it with a flags argument which sets bits which are not
used by the device, i.e. when the device supports up to 8 vectors a full
unmask in some places sets the mask to 0xFFFFFF00. While devices probably
do not care, it's still bad practice.
Fixes: 7ba1930db0 ("PCI MSI: Unmask MSI if setup failed")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210729222542.568173099@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b9255a7cb51754e8d2645b65dd31805e282b4f3e upstream.
Nothing enforces the posted writes to be visible when the function
returns. Flush them even if the flush might be redundant when the entry is
masked already as the unmask will flush as well. This is either setup or a
rare affinity change event so the extra flush is not the end of the world.
While this is more a theoretical issue especially the logic in the X86
specific msi_set_affinity() function relies on the assumption that the
update has reached the hardware when the function returns.
Again, as this never has been enforced the Fixes tag refers to a commit in:
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Fixes: f036d4ea5fa7 ("[PATCH] ia32 Message Signalled Interrupt support")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210729222542.515188147@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit da181dc974ad667579baece33c2c8d2d1e4558d5 upstream.
The specification (PCIe r5.0, sec 6.1.4.5) states:
For MSI-X, a function is permitted to cache Address and Data values
from unmasked MSI-X Table entries. However, anytime software unmasks a
currently masked MSI-X Table entry either by clearing its Mask bit or
by clearing the Function Mask bit, the function must update any Address
or Data values that it cached from that entry. If software changes the
Address or Data value of an entry while the entry is unmasked, the
result is undefined.
The Linux kernel's MSI-X support never enforced that the entry is masked
before the entry is modified hence the Fixes tag refers to a commit in:
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Enforce the entry to be masked across the update.
There is no point in enforcing this to be handled at all possible call
sites as this is just pointless code duplication and the common update
function is the obvious place to enforce this.
Fixes: f036d4ea5fa7 ("[PATCH] ia32 Message Signalled Interrupt support")
Reported-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210729222542.462096385@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7d5ec3d3612396dc6d4b76366d20ab9fc06f399f upstream.
When MSI-X is enabled the ordering of calls is:
msix_map_region();
msix_setup_entries();
pci_msi_setup_msi_irqs();
msix_program_entries();
This has a few interesting issues:
1) msix_setup_entries() allocates the MSI descriptors and initializes them
except for the msi_desc:masked member which is left zero initialized.
2) pci_msi_setup_msi_irqs() allocates the interrupt descriptors and sets
up the MSI interrupts which ends up in pci_write_msi_msg() unless the
interrupt chip provides its own irq_write_msi_msg() function.
3) msix_program_entries() does not do what the name suggests. It solely
updates the entries array (if not NULL) and initializes the masked
member for each MSI descriptor by reading the hardware state and then
masks the entry.
Obviously this has some issues:
1) The uninitialized masked member of msi_desc prevents the enforcement
of masking the entry in pci_write_msi_msg() depending on the cached
masked bit. Aside of that half initialized data is a NONO in general
2) msix_program_entries() only ensures that the actually allocated entries
are masked. This is wrong as experimentation with crash testing and
crash kernel kexec has shown.
This limited testing unearthed that when the production kernel had more
entries in use and unmasked when it crashed and the crash kernel
allocated a smaller amount of entries, then a full scan of all entries
found unmasked entries which were in use in the production kernel.
This is obviously a device or emulation issue as the device reset
should mask all MSI-X table entries, but obviously that's just part
of the paper specification.
Cure this by:
1) Masking all table entries in hardware
2) Initializing msi_desc::masked in msix_setup_entries()
3) Removing the mask dance in msix_program_entries()
4) Renaming msix_program_entries() to msix_update_entries() to
reflect the purpose of that function.
As the masking of unused entries has never been done the Fixes tag refers
to a commit in:
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Fixes: f036d4ea5fa7 ("[PATCH] ia32 Message Signalled Interrupt support")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210729222542.403833459@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 438553958ba19296663c6d6583d208dfb6792830 upstream.
The ordering of MSI-X enable in hardware is dysfunctional:
1) MSI-X is disabled in the control register
2) Various setup functions
3) pci_msi_setup_msi_irqs() is invoked which ends up accessing
the MSI-X table entries
4) MSI-X is enabled and masked in the control register with the
comment that enabling is required for some hardware to access
the MSI-X table
Step #4 obviously contradicts #3. The history of this is an issue with the
NIU hardware. When #4 was introduced the table access actually happened in
msix_program_entries() which was invoked after enabling and masking MSI-X.
This was changed in commit d71d6432e1 ("PCI/MSI: Kill redundant call of
irq_set_msi_desc() for MSI-X interrupts") which removed the table write
from msix_program_entries().
Interestingly enough nobody noticed and either NIU still works or it did
not get any testing with a kernel 3.19 or later.
Nevertheless this is inconsistent and there is no reason why MSI-X can't be
enabled and masked in the control register early on, i.e. move step #4
above to step #1. This preserves the NIU workaround and has no side effects
on other hardware.
Fixes: d71d6432e1 ("PCI/MSI: Kill redundant call of irq_set_msi_desc() for MSI-X interrupts")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210729222542.344136412@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b9cc7d8a4656a6e815852c27ab50365009cb69c1 upstream.
When the interrupt interval is greater than 2 ^ PREDICTION_BUFFER_SIZE *
PREDICTION_FACTOR us and less than 1s, the calculated index will be greater
than the length of irqs->ema_time[]. Check the calculated index before
using it to prevent array overflow.
Fixes: 23aa3b9a6b ("genirq/timings: Encapsulate storing function")
Signed-off-by: Ben Dai <ben.dai@unisoc.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210425150903.25456-1-ben.dai9703@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit dbbc93576e03fbe24b365fab0e901eb442237a8a upstream.
msi_domain_alloc_irqs() invokes irq_domain_activate_irq(), but
msi_domain_free_irqs() does not enforce deactivation before tearing down
the interrupts.
This happens when PCI/MSI interrupts are set up and never used before being
torn down again, e.g. in error handling pathes. The only place which cleans
that up is the error handling path in msi_domain_alloc_irqs().
Move the cleanup from msi_domain_alloc_irqs() into msi_domain_free_irqs()
to cure that.
Fixes: f3b0946d62 ("genirq/msi: Make sure PCI MSIs are activated early")
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210518033117.78104-1-cuibixuan@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 064855a69003c24bd6b473b367d364e418c57625 upstream.
Creating a new sub monitoring group in the root /sys/fs/resctrl leads to
getting the "Unavailable" value for mbm_total_bytes and mbm_local_bytes
on the entire filesystem.
Steps to reproduce:
1. mount -t resctrl resctrl /sys/fs/resctrl/
2. cd /sys/fs/resctrl/
3. cat mon_data/mon_L3_00/mbm_total_bytes
23189832
4. Create sub monitor group:
mkdir mon_groups/test1
5. cat mon_data/mon_L3_00/mbm_total_bytes
Unavailable
When a new monitoring group is created, a new RMID is assigned to the
new group. But the RMID is not active yet. When the events are read on
the new RMID, it is expected to report the status as "Unavailable".
When the user reads the events on the default monitoring group with
multiple subgroups, the events on all subgroups are consolidated
together. Currently, if any of the RMID reads report as "Unavailable",
then everything will be reported as "Unavailable".
Fix the issue by discarding the "Unavailable" reads and reporting all
the successful RMID reads. This is not a problem on Intel systems as
Intel reports 0 on Inactive RMIDs.
Fixes: d89b737901 ("x86/intel_rdt/cqm: Add mon_data")
Reported-by: Paweł Szulik <pawel.szulik@intel.com>
Signed-off-by: Babu Moger <Babu.Moger@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Cc: stable@vger.kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=213311
Link: https://lkml.kernel.org/r/162793309296.9224.15871659871696482080.stgit@bmoger-ubuntu
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0c0e37dc11671384e53ba6ede53a4d91162a2cc5 upstream.
The IO/APIC cannot handle interrupt affinity changes safely after startup
other than from an interrupt handler. The startup sequence in the generic
interrupt code violates that assumption.
Mark the irq chip with the new IRQCHIP_AFFINITY_PRE_STARTUP flag so that
the default interrupt setting happens before the interrupt is started up
for the first time.
Fixes: 1840475676 ("genirq: Expose default irq affinity mask (take 3)")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210729222542.832143400@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ff363f480e5997051dd1de949121ffda3b753741 upstream.
The X86 MSI mechanism cannot handle interrupt affinity changes safely after
startup other than from an interrupt handler, unless interrupt remapping is
enabled. The startup sequence in the generic interrupt code violates that
assumption.
Mark the irq chips with the new IRQCHIP_AFFINITY_PRE_STARTUP flag so that
the default interrupt setting happens before the interrupt is started up
for the first time.
While the interrupt remapping MSI chip does not require this, there is no
point in treating it differently as this might spare an interrupt to a CPU
which is not in the default affinity mask.
For the non-remapping case go to the direct write path when the interrupt
is not yet started similar to the not yet activated case.
Fixes: 1840475676 ("genirq: Expose default irq affinity mask (take 3)")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210729222542.886722080@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 826da771291fc25a428e871f9e7fb465e390f852 upstream.
X86 IO/APIC and MSI interrupts (when used without interrupts remapping)
require that the affinity setup on startup is done before the interrupt is
enabled for the first time as the non-remapped operation mode cannot safely
migrate enabled interrupts from arbitrary contexts. Provide a new irq chip
flag which allows affected hardware to request this.
This has to be opt-in because there have been reports in the past that some
interrupt chips cannot handle affinity setting before startup.
Fixes: 1840475676 ("genirq: Expose default irq affinity mask (take 3)")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210729222542.779791738@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 839ad22f755132838f406751439363c07272ad87 ]
Skip (omit) any version string info that is parenthesized.
Warning: objdump version 15) is older than 2.19
Warning: Skipping posttest.
where 'objdump -v' says:
GNU objdump (GNU Binutils; SUSE Linux Enterprise 15) 2.35.1.20201123-7.18
Fixes: 8bee738bb1 ("x86: Fix objdump version check in chkobjdump.awk for different formats.")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20210731000146.2720-1-rdunlap@infradead.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3a262423755b83a5f85009ace415d6e7f572dfe8 ]
Commit 82046702e2 ("efi/libstub/arm64: Replace 'preferred' offset with
alignment check") simplified the way the stub moves the kernel image
around in memory before booting it, given that a relocatable image does
not need to be copied to a 2M aligned offset if it was loaded on a 64k
boundary by EFI.
Commit d32de9130f ("efi/arm64: libstub: Deal gracefully with
EFI_RNG_PROTOCOL failure") inadvertently defeated this logic by
overriding the value of efi_nokaslr if EFI_RNG_PROTOCOL is not
available, which was mistaken by the loader logic as an explicit request
on the part of the user to disable KASLR and any associated relocation
of an Image not loaded on a 2M boundary.
So let's reinstate this functionality, by capturing the value of
efi_nokaslr at function entry to choose the minimum alignment.
Fixes: d32de9130f ("efi/arm64: libstub: Deal gracefully with EFI_RNG_PROTOCOL failure")
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 5b94046efb4706b3429c9c8e7377bd8d1621d588 ]
Distro versions of GRUB replace the usual LoadImage/StartImage calls
used to load the kernel image with some local code that fails to honor
the allocation requirements described in the PE/COFF header, as it
does not account for the image's BSS section at all: it fails to
allocate space for it, and fails to zero initialize it.
Since the EFI stub itself is allocated in the .init segment, which is
in the middle of the image, its BSS section is not impacted by this,
and the main consequence of this omission is that the BSS section may
overlap with memory regions that are already used by the firmware.
So let's warn about this condition, and force image reallocation to
occur in this case, which works around the problem.
Fixes: 82046702e2 ("efi/libstub/arm64: Replace 'preferred' offset with alignment check")
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 4152433c397697acc4b02c4a10d17d5859c2730d ]
The EFI stub random allocator used for kaslr on arm64 has a subtle
bug. In function get_entry_num_slots() which counts the number of
possible allocation "slots" for the image in a given chunk of free
EFI memory, "last_slot" can become negative if the chunk is smaller
than the requested allocation size.
The test "if (first_slot > last_slot)" doesn't catch it because
both first_slot and last_slot are unsigned.
I chose not to make them signed to avoid problems if this is ever
used on architectures where there are meaningful addresses with the
top bit set. Instead, fix it with an additional test against the
allocation size.
This can cause a boot failure in addition to a loss of randomisation
due to another bug in the arm64 stub fixed separately.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Fixes: 2ddbfc81ea ("efi: stub: add implementation of efi_random_alloc()")
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit cddce01160582a5f52ada3da9626c052d852ec42 ]
There is a race between iterating over requests in
nbd_clear_que() and completing requests in recv_work(),
which can lead to double completion of a request.
To fix it, flush the recv worker before iterating over
the requests and don't abort the completed request
while iterating.
Fixes: 96d97e1782 ("nbd: clear_sock on netlink disconnect")
Reported-by: Jiang Yadong <jiangyadong@bytedance.com>
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20210813151330.96-1-xieyongji@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 49b0b6ffe20c5344f4173f3436298782a08da4f2 ]
There's a potential deadlock case when remove the vsock device or
process the RESET event:
vsock_for_each_connected_socket:
spin_lock_bh(&vsock_table_lock) ----------- (1)
...
virtio_vsock_reset_sock:
lock_sock(sk) --------------------- (2)
...
spin_unlock_bh(&vsock_table_lock)
lock_sock() may do initiative schedule when the 'sk' is owned by
other thread at the same time, we would receivce a warning message
that "scheduling while atomic".
Even worse, if the next task (selected by the scheduler) try to
release a 'sk', it need to request vsock_table_lock and the deadlock
occur, cause the system into softlockup state.
Call trace:
queued_spin_lock_slowpath
vsock_remove_bound
vsock_remove_sock
virtio_transport_release
__vsock_release
vsock_release
__sock_release
sock_close
__fput
____fput
So we should not require sk_lock in this case, just like the behavior
in vhost_vsock or vmci.
Fixes: 0ea9e1d3a9 ("VSOCK: Introduce virtio_transport.ko")
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/r/20210812053056.1699-1-longpeng2@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 88ca2521bd5b4e8b83743c01a2d4cb09325b51e9 ]
There is a TOCTOU issue in set_evtchn_to_irq. Rows in the evtchn_to_irq
mapping are lazily allocated in this function. The check whether the row
is already present and the row initialization is not synchronized. Two
threads can at the same time allocate a new row for evtchn_to_irq and
add the irq mapping to the their newly allocated row. One thread will
overwrite what the other has set for evtchn_to_irq[row] and therefore
the irq mapping is lost. This will trigger a BUG_ON later in
bind_evtchn_to_cpu:
INFO: pci 0000:1a:15.4: [1d0f:8061] type 00 class 0x010802
INFO: nvme 0000:1a:12.1: enabling device (0000 -> 0002)
INFO: nvme nvme77: 1/0/0 default/read/poll queues
CRIT: kernel BUG at drivers/xen/events/events_base.c:427!
WARN: invalid opcode: 0000 [#1] SMP NOPTI
WARN: Workqueue: nvme-reset-wq nvme_reset_work [nvme]
WARN: RIP: e030:bind_evtchn_to_cpu+0xc2/0xd0
WARN: Call Trace:
WARN: set_affinity_irq+0x121/0x150
WARN: irq_do_set_affinity+0x37/0xe0
WARN: irq_setup_affinity+0xf6/0x170
WARN: irq_startup+0x64/0xe0
WARN: __setup_irq+0x69e/0x740
WARN: ? request_threaded_irq+0xad/0x160
WARN: request_threaded_irq+0xf5/0x160
WARN: ? nvme_timeout+0x2f0/0x2f0 [nvme]
WARN: pci_request_irq+0xa9/0xf0
WARN: ? pci_alloc_irq_vectors_affinity+0xbb/0x130
WARN: queue_request_irq+0x4c/0x70 [nvme]
WARN: nvme_reset_work+0x82d/0x1550 [nvme]
WARN: ? check_preempt_wakeup+0x14f/0x230
WARN: ? check_preempt_curr+0x29/0x80
WARN: ? nvme_irq_check+0x30/0x30 [nvme]
WARN: process_one_work+0x18e/0x3c0
WARN: worker_thread+0x30/0x3a0
WARN: ? process_one_work+0x3c0/0x3c0
WARN: kthread+0x113/0x130
WARN: ? kthread_park+0x90/0x90
WARN: ret_from_fork+0x3a/0x50
This patch sets evtchn_to_irq rows via a cmpxchg operation so that they
will be set only once. The row is now cleared before writing it to
evtchn_to_irq in order to not create a race once the row is visible for
other threads.
While at it, do not require the page to be zeroed, because it will be
overwritten with -1's in clear_evtchn_to_irq_row anyway.
Signed-off-by: Maximilian Heyne <mheyne@amazon.de>
Fixes: d0b075ffee ("xen/events: Refactor evtchn_to_irq array to be dynamically allocated")
Link: https://lore.kernel.org/r/20210812130930.127134-1-mheyne@amazon.de
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 24d032e2359e3abc926b3d423f49a7c33e0b7836 ]
The SFC_DONE register lives within the corresponding VD0/VD2/VD4/VD6
forcewake domain and is not accessible if the vdbox in that domain is
fused off and the forcewake is not initialized.
This mistake went unnoticed because until recently we were using the
wrong register offset for the SFC_DONE register; once the register
offset was corrected, we started hitting errors like
<4> [544.989065] i915 0000:cc:00.0: Uninitialized forcewake domain(s) 0x80 accessed at 0x1ce000
on parts with fused-off vdbox engines.
Fixes: e50dbdbfd9 ("drm/i915/tgl: Add SFC instdone to error state")
Fixes: 9c9c6d0ab08a ("drm/i915: Correct SFC_DONE register offset")
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210806174130.1058960-1-matthew.d.roper@intel.com
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
(cherry picked from commit c5589bb5dccb0c5cb74910da93663f489589f3ce)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
[Changed Fixes tag to match the cherry-picked 82929a2140eb]
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b69dd5b3780a7298bd893816a09da751bc0636f7 ]
Some arches support cmpxchg() on 4-byte and 8-byte only.
Increase mr_ifc_count width to 32bit to fix this problem.
Fixes: 4a2b285e7e10 ("net: igmp: fix data-race in igmp_ifc_timer_expire()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/20210811195715.3684218-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6de035fec045f8ae5ee5f3a02373a18b939e91fb ]
Currently if BBR congestion control is initialized after more than 2B
packets have been delivered, depending on the phase of the
tp->delivered counter the tracking of BBR round trips can get stuck.
The bug arises because if tp->delivered is between 2^31 and 2^32 at
the time the BBR congestion control module is initialized, then the
initialization of bbr->next_rtt_delivered to 0 will cause the logic to
believe that the end of the round trip is still billions of packets in
the future. More specifically, the following check will fail
repeatedly:
!before(rs->prior_delivered, bbr->next_rtt_delivered)
and thus the connection will take up to 2B packets delivered before
that check will pass and the connection will set:
bbr->round_start = 1;
This could cause many mechanisms in BBR to fail to trigger, for
example bbr_check_full_bw_reached() would likely never exit STARTUP.
This bug is 5 years old and has not been observed, and as a practical
matter this would likely rarely trigger, since it would require
transferring at least 2B packets, or likely more than 3 terabytes of
data, before switching congestion control algorithms to BBR.
This patch is a stable candidate for kernels as far back as v4.9,
when tcp_bbr.c was added.
Fixes: 0f8782ea14 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Kevin Yang <yyd@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210811024056.235161-1-ncardwell@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6922110d152e56d7569616b45a1f02876cf3eb9f ]
After migrating my laptop from 4.19-LTS to 5.4-LTS a while ago I noticed
that my Ethernet port to which a bond and a VLAN interface are attached
appeared to remain up after resuming from suspend with the cable unplugged
(and that problem still persists with 5.10-LTS).
It happens that the following happens:
- the network driver (e1000e here) prepares to suspend, calls e1000e_down()
which calls netif_carrier_off() to signal that the link is going down.
- netif_carrier_off() adds a link_watch event to the list of events for
this device
- the device is completely stopped.
- the machine suspends
- the cable is unplugged and the machine brought to another location
- the machine is resumed
- the queued linkwatch events are processed for the device
- the device doesn't yet have the __LINK_STATE_PRESENT bit and its events
are silently dropped
- the device is resumed with its link down
- the upper VLAN and bond interfaces are never notified that the link had
been turned down and remain up
- the only way to provoke a change is to physically connect the machine
to a port and possibly unplug it.
The state after resume looks like this:
$ ip -br li | egrep 'bond|eth'
bond0 UP e8:6a:64:64:64:64 <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP>
eth0 DOWN e8:6a:64:64:64:64 <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP>
eth0.2@eth0 UP e8:6a:64:64:64:64 <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP>
Placing an explicit call to netdev_state_change() either in the suspend
or the resume code in the NIC driver worked around this but the solution
is not satisfying.
The issue in fact really is in link_watch that loses events while it
ought not to. It happens that the test for the device being present was
added by commit 124eee3f69 ("net: linkwatch: add check for netdevice
being present to linkwatch_do_dev") in 4.20 to avoid an access to
devices that are not present.
Instead of dropping events, this patch proceeds slightly differently by
postponing their handling so that they happen after the device is fully
resumed.
Fixes: 124eee3f69 ("net: linkwatch: add check for netdevice being present to linkwatch_do_dev")
Link: https://lists.openwall.net/netdev/2018/03/15/62
Cc: Heiner Kallweit <hkallweit1@gmail.com>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20210809160628.22623-1-w@1wt.eu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>