Commit 24dea04767 ("bpf, x32: remove ld_abs/ld_ind")
removed the 4 /* Extra space for skb_copy_bits buffer */
from _STACK_SIZE, but it didn't fix the concerned code
in emit_prologue and emit_epilogue, and this error will
bring very strange kernel runtime errors. This patch
fixes it.
Fixes: 24dea04767 ("bpf, x32: remove ld_abs/ld_ind")
Reported-by: Meelis Roos <mroos@linux.ee>
Bisected-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: Wang YanQing <udknight@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Martin KaFai Lau says:
====================
The series allows the BPF loader to figure out the btf_key_id
and btf_value_id from a map's name by using BPF_ANNOTATE_KV_PAIR()
similarly as in iproute2 commit f823f36012fb ("bpf: implement
btf handling and map annotation").
It also removes the old 'typedef' way which requires two separate
typedefs (one for the key and one for the value).
By doing this, iproute2 and libbpf have one consistent way to
figure out the btf_key_type_id and btf_value_type_id for a map.
The first two patches are some prep/cleanup works. The last patch
introduces BPF_ANNOTATE_KV_PAIR.
v3:
- Replace some more *int*_t and u* usages with the
equivalent __[su]* in btf.c
v2:
- Fix the incorrect '&&' check on container_type
in bpf_map_find_btf_info().
- Expose the existing static btf_type_by_id() instead of
creating a new one.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch introduces BPF_ANNOTATE_KV_PAIR to signal the
bpf loader about the btf key_type and value_type of a bpf map.
Please refer to the changes in test_btf_haskv.c for its usage.
Both iproute2 and libbpf loader will then have the same
convention to find out the map's btf_key_type_id and
btf_value_type_id from a map's name.
Fixes: 8a138aed4a ("bpf: btf: Add BTF support to libbpf")
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch replaces [u]int32_t and [u]int64_t usage with
__[su]32 and __[su]64. The same change goes for [u]int16_t
and [u]int8_t.
Fixes: 8a138aed4a ("bpf: btf: Add BTF support to libbpf")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch sync the uapi btf.h to tools/
Fixes: 36fc3c8c28 bpf: btf: Clean up BTF_INT_BITS() in uapi btf.h
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch ensures the member->offset of a struct
is in the correct order (i.e the later member's offset cannot
go backward).
The current "pahole -J" BTF encoder does not generate something
like this. However, checking this can ensure future encoder
will not violate this.
Fixes: 69b693f0ae ("bpf: btf: Introduce BPF Type Format (BTF)")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Eric Dumazet says:
====================
Juha-Matti Tilli reported that malicious peers could inject tiny
packets in out_of_order_queue, forcing very expensive calls
to tcp_collapse_ofo_queue() and tcp_prune_ofo_queue() for
every incoming packet.
With tcp_rmem[2] default of 6MB, the ooo queue could
contain ~7000 nodes.
This patch series makes sure we cut cpu cycles enough to
render the attack not critical.
We might in the future go further, like disconnecting
or black-holing proven malicious flows.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
In case skb in out_or_order_queue is the result of
multiple skbs coalescing, we would like to get a proper gso_segs
counter tracking, so that future tcp_drop() can report an accurate
number.
I chose to not implement this tracking for skbs in receive queue,
since they are not dropped, unless socket is disconnected.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to be able to give better diagnostics and detect
malicious traffic, we need to have better sk->sk_drops tracking.
Fixes: 9f5afeae51 ("tcp: use an RB tree for ooo receive queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case an attacker feeds tiny packets completely out of order,
tcp_collapse_ofo_queue() might scan the whole rb-tree, performing
expensive copies, but not changing socket memory usage at all.
1) Do not attempt to collapse tiny skbs.
2) Add logic to exit early when too many tiny skbs are detected.
We prefer not doing aggressive collapsing (which copies packets)
for pathological flows, and revert to tcp_prune_ofo_queue() which
will be less expensive.
In the future, we might add the possibility of terminating flows
that are proven to be malicious.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Right after a TCP flow is created, receiving tiny out of order
packets allways hit the condition :
if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
tcp_clamp_window(sk);
tcp_clamp_window() increases sk_rcvbuf to match sk_rmem_alloc
(guarded by tcp_rmem[2])
Calling tcp_collapse_ofo_queue() in this case is not useful,
and offers a O(N^2) surface attack to malicious peers.
Better not attempt anything before full queue capacity is reached,
forcing attacker to spend lots of resource and allow us to more
easily detect the abuse.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Juha-Matti Tilli reported that malicious peers could inject tiny
packets in out_of_order_queue, forcing very expensive calls
to tcp_collapse_ofo_queue() and tcp_prune_ofo_queue() for
every incoming packet. out_of_order_queue rb-tree can contain
thousands of nodes, iterating over all of them is not nice.
Before linux-4.9, we would have pruned all packets in ofo_queue
in one go, every XXXX packets. XXXX depends on sk_rcvbuf and skbs
truesize, but is about 7000 packets with tcp_rmem[2] default of 6 MB.
Since we plan to increase tcp_rmem[2] in the future to cope with
modern BDP, can not revert to the old behavior, without great pain.
Strategy taken in this patch is to purge ~12.5 % of the queue capacity.
Fixes: 36a6503fed ("tcp: refine tcp_prune_ofo_queue() to not drop all packets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Juha-Matti Tilli <juha-matti.tilli@iki.fi>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The skb hash for locally generated ip[v6] fragments belonging
to the same datagram can vary in several circumstances:
* for connected UDP[v6] sockets, the first fragment get its hash
via set_owner_w()/skb_set_hash_from_sk()
* for unconnected IPv6 UDPv6 sockets, the first fragment can get
its hash via ip6_make_flowlabel()/skb_get_hash_flowi6(), if
auto_flowlabel is enabled
For the following frags the hash is usually computed via
skb_get_hash().
The above can cause OoO for unconnected IPv6 UDPv6 socket: in that
scenario the egress tx queue can be selected on a per packet basis
via the skb hash.
It may also fool flow-oriented schedulers to place fragments belonging
to the same datagram in different flows.
Fix the issue by copying the skb hash from the head frag into
the others at fragmentation time.
Before this commit:
perf probe -a "dev_queue_xmit skb skb->hash skb->l4_hash:b1@0/8 skb->sw_hash:b1@1/8"
netperf -H $IPV4 -t UDP_STREAM -l 5 -- -m 2000 -n &
perf record -e probe:dev_queue_xmit -e probe:skb_set_owner_w -a sleep 0.1
perf script
probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=3713014309 l4_hash=1 sw_hash=0
probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=0 l4_hash=0 sw_hash=0
After this commit:
probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=2171763177 l4_hash=1 sw_hash=0
probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=2171763177 l4_hash=1 sw_hash=0
Fixes: b73c3d0e4f ("net: Save TX flow hash in sock and set in skbuf on xmit")
Fixes: 67800f9b1f ("ipv6: Call skb_get_hash_flowi6 to get skb->hash in ip6_make_flowlabel")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEENrCndlB/VnAEWuH5k9IU1zQoZfEFAltVzJQTHG1rbEBwZW5n
dXRyb25peC5kZQAKCRCT0hTXNChl8e/pCADT/Al1tOpzM0EYs3fdSDzqi3caxkAI
gpxGDcM8fpiJn5psC0DvFJ4vf8GBQBdkoS8W6M3ieSxrwxvFlK0hQsujbBn4vAGt
WaKgDYvXJY5PJBjwwHSFMHRVGmzzg7YOQ3CqFsoHlVjr+gPA6T4qgclnPQrUgWQY
y6IuQfoAagsP3ezDV15hiRhPaI4SJrCK27XfAD8Cbr9rrl7sa4ifsP20Wf2xoFDu
lbf/bVMjOiYANrga4Pz8PNlFiIX9F3kW0Qc81eyJkvBEnmxRThZ7nvP/DHd3vVoF
k2EUBaGhdyu5URHASTZRKnT2WKTbeAJ48wFhwT9QCD169SrZWhv7r/nP
=wU3+
-----END PGP SIGNATURE-----
Merge tag 'linux-can-fixes-for-4.18-20180723' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2018-07-23
this is a pull request of 12 patches for net/master.
The patch by Stephane Grosjean for the peak_canfd CAN driver fixes a problem
with older firmware. The next patch is by Roman Fietze and fixes the setup of
the CCCR register in the m_can driver. Nicholas Mc Guire's patch for the
mpc5xxx_can driver adds missing error checking. The two patches by Faiz Abbas
fix the runtime resume and clean up the probe function in the m_can driver. The
last 7 patches by Anssi Hannula fix several problem in the xilinx_can driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
There are several issues with the suspend/resume handling code of the
driver:
- The device is attached and detached in the runtime_suspend() and
runtime_resume() callbacks if the interface is running. However,
during xcan_chip_start() the interface is considered running,
causing the resume handler to incorrectly call netif_start_queue()
at the beginning of xcan_chip_start(), and on xcan_chip_start() error
return the suspend handler detaches the device leaving the user
unable to bring-up the device anymore.
- The device is not brought properly up on system resume. A reset is
done and the code tries to determine the bus state after that.
However, after reset the device is always in Configuration mode
(down), so the state checking code does not make sense and
communication will also not work.
- The suspend callback tries to set the device to sleep mode (low-power
mode which monitors the bus and brings the device back to normal mode
on activity), but then immediately disables the clocks (possibly
before the device reaches the sleep mode), which does not make sense
to me. If a clean shutdown is wanted before disabling clocks, we can
just bring it down completely instead of only sleep mode.
Reorganize the PM code so that only the clock logic remains in the
runtime PM callbacks and the system PM callbacks contain the device
bring-up/down logic. This makes calling the runtime PM callbacks during
e.g. xcan_chip_start() safe.
The system PM callbacks now simply call common code to start/stop the
HW if the interface was running, replacing the broken code from before.
xcan_chip_stop() is updated to use the common reset code so that it will
wait for the reset to complete. Reset also disables all interrupts so do
not do that separately.
Also, the device_may_wakeup() checks are removed as the driver does not
have wakeup support.
Tested on Zynq-7000 integrated CAN.
Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
Cc: Michal Simek <michal.simek@xilinx.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
xcan_interrupt() clears ERROR|RXOFLV|BSOFF|ARBLST interrupts if any of
them is asserted. This does not take into account that some of them
could have been asserted between interrupt status read and interrupt
clear, therefore clearing them without handling them.
Fix the code to only clear those interrupts that it knows are asserted
and therefore going to be processed in xcan_err_interrupt().
Fixes: b1201e44f5 ("can: xilinx CAN controller support")
Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
Cc: Michal Simek <michal.simek@xilinx.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
RX overflow interrupt (RXOFLW) is disabled even though xcan_interrupt()
processes it. This means that an RX overflow interrupt will only be
processed when another interrupt gets asserted (e.g. for RX/TX).
Fix that by enabling the RXOFLW interrupt.
Fixes: b1201e44f5 ("can: xilinx CAN controller support")
Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
Cc: Michal Simek <michal.simek@xilinx.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
The xilinx_can driver assumes that the TXOK interrupt only clears after
it has been acknowledged as many times as there have been successfully
sent frames.
However, the documentation does not mention such behavior, instead
saying just that the interrupt is cleared when the clear bit is set.
Similarly, testing seems to also suggest that it is immediately cleared
regardless of the amount of frames having been sent. Performing some
heavy TX load and then going back to idle has the tx_head drifting
further away from tx_tail over time, steadily reducing the amount of
frames the driver keeps in the TX FIFO (but not to zero, as the TXOK
interrupt always frees up space for 1 frame from the driver's
perspective, so frames continue to be sent) and delaying the local echo
frames.
The TX FIFO tracking is also otherwise buggy as it does not account for
TX FIFO being cleared after software resets, causing
BUG!, TX FIFO full when queue awake!
messages to be output.
There does not seem to be any way to accurately track the state of the
TX FIFO for local echo support while using the full TX FIFO.
The Zynq version of the HW (but not the soft-AXI version) has watermark
programming support and with it an additional TX-FIFO-empty interrupt
bit.
Modify the driver to only put 1 frame into TX FIFO at a time on soft-AXI
and 2 frames at a time on Zynq. On Zynq the TXFEMP interrupt bit is used
to detect whether 1 or 2 frames have been sent at interrupt processing
time.
Tested with the integrated CAN on Zynq-7000 SoC. The 1-frame-FIFO mode
was also tested.
An alternative way to solve this would be to drop local echo support but
keep using the full TX FIFO.
v2: Add FIFO space check before TX queue wake with locking to
synchronize with queue stop. This avoids waking the queue when xmit()
had just filled it.
v3: Keep local echo support and reduce the amount of frames in FIFO
instead as suggested by Marc Kleine-Budde.
Fixes: b1201e44f5 ("can: xilinx CAN controller support")
Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
Cc: <stable@vger.kernel.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
The xilinx_can driver contains no mechanism for propagating recovery
from CAN_STATE_ERROR_WARNING and CAN_STATE_ERROR_PASSIVE.
Add such a mechanism by factoring the handling of
XCAN_STATE_ERROR_PASSIVE and XCAN_STATE_ERROR_WARNING out of
xcan_err_interrupt and checking for recovery after RX and TX if the
interface is in one of those states.
Tested with the integrated CAN on Zynq-7000 SoC.
Fixes: b1201e44f5 ("can: xilinx CAN controller support")
Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
Cc: <stable@vger.kernel.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
If the device gets into a state where RXNEMP (RX FIFO not empty)
interrupt is asserted without RXOK (new frame received successfully)
interrupt being asserted, xcan_rx_poll() will continue to try to clear
RXNEMP without actually reading frames from RX FIFO. If the RX FIFO is
not empty, the interrupt will not be cleared and napi_schedule() will
just be called again.
This situation can occur when:
(a) xcan_rx() returns without reading RX FIFO due to an error condition.
The code tries to clear both RXOK and RXNEMP but RXNEMP will not clear
due to a frame still being in the FIFO. The frame will never be read
from the FIFO as RXOK is no longer set.
(b) A frame is received between xcan_rx_poll() reading interrupt status
and clearing RXOK. RXOK will be cleared, but RXNEMP will again remain
set as the new message is still in the FIFO.
I'm able to trigger case (b) by flooding the bus with frames under load.
There does not seem to be any benefit in using both RXNEMP and RXOK in
the way the driver does, and the polling example in the reference manual
(UG585 v1.10 18.3.7 Read Messages from RxFIFO) also says that either
RXOK or RXNEMP can be used for detecting incoming messages.
Fix the issue and simplify the RX processing by only using RXNEMP
without RXOK.
Tested with the integrated CAN on Zynq-7000 SoC.
Fixes: b1201e44f5 ("can: xilinx CAN controller support")
Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
Cc: <stable@vger.kernel.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
The xilinx_can driver performs a software reset when an RX overrun is
detected. This causes the device to enter Configuration mode where no
messages are received or transmitted.
The documentation does not mention any need to perform a reset on an RX
overrun, and testing by inducing an RX overflow also indicated that the
device continues to work just fine without a reset.
Remove the software reset.
Tested with the integrated CAN on Zynq-7000 SoC.
Fixes: b1201e44f5 ("can: xilinx CAN controller support")
Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
Cc: <stable@vger.kernel.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
MCAN message ram should only be accessed once clocks are enabled.
Therefore, move the call to parse/init the message ram to after
clocks are enabled.
Signed-off-by: Faiz Abbas <faiz_abbas@ti.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
pm_runtime_get_sync() returns a 1 if the state of the device is already
'active'. This is not a failure case and should return a success.
Therefore fix error handling for pm_runtime_get_sync() call such that
it returns success when the value is 1.
Also cleanup the TODO for using runtime PM for sleep mode as that is
implemented.
Signed-off-by: Faiz Abbas <faiz_abbas@ti.com>
Cc: <stable@vger.kernel.org
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
of_iomap() can return NULL so that return needs to be checked and NULL
treated as failure. While at it also take care of the missing
of_node_put() in the error path.
Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Fixes: commit afa17a500a ("net/can: add driver for mscan family & mpc52xx_mscan")
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Inside m_can_chip_config(), when setting up the new value of the CCCR,
the CCCR_NISO bit is not cleared like the others, CCCR_TEST, CCCR_MON,
CCCR_BRSE and CCCR_FDOE, before checking the can.ctrlmode bits for
CAN_CTRLMODE_FD_NON_ISO.
This way once the controller was configured for CAN_CTRLMODE_FD_NON_ISO,
this mode could never be cleared again.
This fix is only relevant for controllers with version 3.1.x or 3.2.x.
Older versions do not support NISO.
Signed-off-by: Roman Fietze <roman.fietze@telemotive.de>
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
The DMA logic in firmwares < v3.3.0 embedded in the PCAN-PCIe FD cards
family is not capable of handling a mix of 32-bit and 64-bit logical
addresses. If the board is equipped with 2 or 4 CAN ports, then such a
situation might lead to a PCIe Bus Error "Malformed TLP" packet
as well as "irq xx: nobody cared" issue.
This patch adds a workaround that requests only 32-bit DMA addresses
when these might be allocated outside of the 4 GB area.
This issue has been fixed in firmware v3.3.0 and next.
Signed-off-by: Stephane Grosjean <s.grosjean@peak-system.com>
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Prevent drivers from building on PPC32 if they use isa_bus_to_virt(),
isa_virt_to_bus(), or isa_page_to_bus(), which are not available and
thus cause build errors.
../drivers/net/ethernet/3com/3c515.c: In function 'corkscrew_open':
../drivers/net/ethernet/3com/3c515.c:824:9: error: implicit declaration of function 'isa_virt_to_bus'; did you mean 'virt_to_bus'? [-Werror=implicit-function-declaration]
../drivers/net/ethernet/amd/lance.c: In function 'lance_rx':
../drivers/net/ethernet/amd/lance.c:1203:23: error: implicit declaration of function 'isa_bus_to_virt'; did you mean 'bus_to_virt'? [-Werror=implicit-function-declaration]
../drivers/net/ethernet/amd/ni65.c: In function 'ni65_init_lance':
../drivers/net/ethernet/amd/ni65.c:585:20: error: implicit declaration of function 'isa_virt_to_bus'; did you mean 'virt_to_bus'? [-Werror=implicit-function-declaration]
../drivers/net/ethernet/cirrus/cs89x0.c: In function 'net_open':
../drivers/net/ethernet/cirrus/cs89x0.c:897:20: error: implicit declaration of function 'isa_virt_to_bus'; did you mean 'virt_to_bus'? [-Werror=implicit-function-declaration]
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously only the neighbour state was checked to decide if an offloaded
entry should be removed. However, there can be situations when the entry
is dead but still marked as valid. This can lead to dead entries not
being removed from fw tables or even incorrect data being added.
Check the entry dead bit before deciding if it should be added to or
removed from fw neighbour tables.
Fixes: 8e6a9046b6 ("nfp: flower vxlan neighbour offload")
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Roopa Prabhu says:
====================
vxlan: fix default fdb entry user-space notify ordering/race
Problem:
In vxlan_newlink, a default fdb entry is added before register_netdev.
The default fdb creation function notifies user-space of the
fdb entry on the vxlan device which user-space does not know about yet.
(RTM_NEWNEIGH goes before RTM_NEWLINK for the same ifindex).
This series fixes the user-space netlink notification ordering issue
with the following changes:
- decouple fdb notify from fdb create.
- Move fdb notify after register_netdev.
- modify rtnl_configure_link to allow configuring a link early.
- Call rtnl_configure_link in vxlan newlink handler to notify
userspace about the newlink before fdb notify and
hence avoiding the user-space race.
====================
Fixes: afbd8bae9c ("vxlan: add implicit fdb entry for default destination")
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Problem:
In vxlan_newlink, a default fdb entry is added before register_netdev.
The default fdb creation function also notifies user-space of the
fdb entry on the vxlan device which user-space does not know about yet.
(RTM_NEWNEIGH goes before RTM_NEWLINK for the same ifindex).
This patch fixes the user-space netlink notification ordering issue
with the following changes:
- decouple fdb notify from fdb create.
- Move fdb notify after register_netdev.
- Call rtnl_configure_link in vxlan newlink handler to notify
userspace about the newlink before fdb notify and
hence avoiding the user-space race.
Fixes: afbd8bae9c ("vxlan: add implicit fdb entry for default destination")
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new option do_notify to vxlan_fdb_destroy to make
sending netlink notify optional. Used by a later patch.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Add new vxlan_fdb_alloc helper
- rename existing vxlan_fdb_create into vxlan_fdb_update:
because it really creates or updates an existing
fdb entry
- move new fdb creation into a separate vxlan_fdb_create
Main motivation for this change is to introduce the ability
to decouple vxlan fdb creation and notify, used in a later patch.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rtnl_configure_link sets dev->rtnl_link_state to
RTNL_LINK_INITIALIZED and unconditionally calls
__dev_notify_flags to notify user-space of dev flags.
current call sequence for rtnl_configure_link
rtnetlink_newlink
rtnl_link_ops->newlink
rtnl_configure_link (unconditionally notifies userspace of
default and new dev flags)
If a newlink handler wants to call rtnl_configure_link
early, we will end up with duplicate notifications to
user-space.
This patch fixes rtnl_configure_link to check rtnl_link_state
and call __dev_notify_flags with gchanges = 0 if already
RTNL_LINK_INITIALIZED.
Later in the series, this patch will help the following sequence
where a driver implementing newlink can call rtnl_configure_link
to initialize the link early.
makes the following call sequence work:
rtnetlink_newlink
rtnl_link_ops->newlink (vxlan) -> rtnl_configure_link (initializes
link and notifies
user-space of default
dev flags)
rtnl_configure_link (updates dev flags if requested by user ifm
and notifies user-space of new dev flags)
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Got crash report with following backtrace:
BUG: unable to handle kernel paging request at ffff8801869daffe
RIP: 0010:[<ffffffff816429c4>] [<ffffffff816429c4>] ip6_finish_output2+0x394/0x4c0
RSP: 0018:ffff880186c83a98 EFLAGS: 00010283
RAX: ffff8801869db00e ...
[<ffffffff81644cdc>] ip6_finish_output+0x8c/0xf0
[<ffffffff81644d97>] ip6_output+0x57/0x100
[<ffffffff81643dc9>] ip6_forward+0x4b9/0x840
[<ffffffff81645566>] ip6_rcv_finish+0x66/0xc0
[<ffffffff81645db9>] ipv6_rcv+0x319/0x530
[<ffffffff815892ac>] netif_receive_skb+0x1c/0x70
[<ffffffffc0060bec>] atl1c_clean+0x1ec/0x310 [atl1c]
...
The bad access is in neigh_hh_output(), at skb->data - 16 (HH_DATA_MOD).
atl1c driver provided skb with no headroom, so 14 bytes (ethernet
header) got pulled, but then 16 are copied.
Reserve NET_SKB_PAD bytes headroom, like netdev_alloc_skb().
Compile tested only; I lack hardware.
Fixes: 7b70176421 ("atl1c: Fix misuse of netdev_alloc_skb in refilling rx ring")
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are two scenarios that we will restore deleted records. The first is
when device down and up(or unmap/remap). In this scenario the new filter
mode is same with previous one. Because we get it from in_dev->mc_list and
we do not touch it during device down and up.
The other scenario is when a new socket join a group which was just delete
and not finish sending status reports. In this scenario, we should use the
current filter mode instead of restore old one. Here are 4 cases in total.
old_socket new_socket before_fix after_fix
IN(A) IN(A) ALLOW(A) ALLOW(A)
IN(A) EX( ) TO_IN( ) TO_EX( )
EX( ) IN(A) TO_EX( ) ALLOW(A)
EX( ) EX( ) TO_EX( ) TO_EX( )
Fixes: 24803f38a5 (igmp: do not remove igmp souce list info when set link down)
Fixes: 1666d49e1d (mld: do not remove mld souce list info when set link down)
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
free_irq() waits until all handlers for this IRQ have completed. As the
relevant handler (mv88e6xxx_g1_irq_thread_fn()) takes the chip's reg_lock
it might never return if the thread calling free_irq() holds this lock.
For the same reason kthread_cancel_delayed_work_sync() in the polling case
must not hold this lock.
Also first free the irq (or stop the worker respectively) such that
mv88e6xxx_g1_irq_thread_work() isn't called any more before the irq
mappings are dropped in mv88e6xxx_g1_irq_free_common() to prevent the
worker thread to call handle_nested_irq(0) which results in a NULL-pointer
exception.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Example setup:
host: ip -6 addr add dev eth1 2001:db8:104::4
where eth1 is enslaved to a VRF
switch: ip -6 ro add 2001:db8:104::4/128 dev br1
where br1 only has an LLA
ping6 2001:db8:104::4
ssh 2001:db8:104::4
(NOTE: UDP works fine if the PKTINFO has the address set to the global
address and ifindex is set to the index of eth1 with a destination an
LLA).
For ICMP, icmp6_iif needs to be updated to check if skb->dev is an
L3 master. If it is then return the ifindex from rt6i_idev similar
to what is done for loopback.
For TCP, restore the original tcp_v6_iif definition which is needed in
most places and add a new tcp_v6_iif_l3_slave that considers the
l3_slave variability. This latter check is only needed for socket
lookups.
Fixes: 9ff7438460 ("net: vrf: Handle ipv6 multicast and link-local addresses")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix following warning:
net/ipv4/bpfilter/sockopt.c:28:5: error: symbol 'bpfilter_ip_set_sockopt' redeclared with different type
net/ipv4/bpfilter/sockopt.c:34:5: error: symbol 'bpfilter_ip_get_sockopt' redeclared with different type
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The situation described in the comment can occur also with
PHY_IGNORE_INTERRUPT, therefore change the condition to include it.
Fixes: f555f34fdc ("net: phy: fix auto-negotiation stall due to unavailable interrupt")
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sudarsana Reddy Kalluru says:
====================
qed: Fix series II.
The patch series fixes few issues in the qed driver.
Please consider applying it to 'net' branch.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
FW hsi contains 256 approximation buckets which are split in ramrod into
eight u32 values, but driver is using eight 'unsigned long' variables.
This patch fixes the mcast logic by making the API utilize u32.
Fixes: 83aeb933 ("qed*: Trivial modifications")
Signed-off-by: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com>
Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There's a possible race where driver can read link status in mid-transition
and see that virtual-link is up yet speed is 0. Since in this
mid-transition we're guaranteed to see a mailbox from MFW soon, we can
afford to treat this as link down.
Fixes: cc875c2e ("qed: Add link support")
Signed-off-by: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com>
Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Apparently, MFW publishes EEE capabilities even for Fiber-boards that don't
support them, and later since qed internally sets adv_caps it would cause
link-flap avoidance (LFA) to fail when driver would initiate the link.
This in turn delays the link, causing traffic to fail.
Driver has been modified to not to ask MFW for any EEE config if EEE isn't
to be enabled.
Fixes: 645874e5 ("qed: Add support for Energy efficient ethernet.")
Signed-off-by: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com>
Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a missing rcu_read_unlock in the error path
Fixes: c95567c803 ("caif: added check for potential null return")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For some time now, if you load the bonding driver and configure bond
parameters via sysfs using minimal config options, such as specifying
nothing but the mode, relying on defaults for everything else, modes
that cannot use arp monitoring (802.3ad, balance-tlb, balance-alb) all
wind up with both arp_interval=0 (as it should be) and miimon=0, which
means the miimon monitor thread never actually runs. This is particularly
problematic for 802.3ad.
For example, from an LNST recipe I've set up:
$ modprobe bonding max_bonds=0"
$ echo "+t_bond0" > /sys/class/net/bonding_masters"
$ ip link set t_bond0 down"
$ echo "802.3ad" > /sys/class/net/t_bond0/bonding/mode"
$ ip link set ens1f1 down"
$ echo "+ens1f1" > /sys/class/net/t_bond0/bonding/slaves"
$ ip link set ens1f0 down"
$ echo "+ens1f0" > /sys/class/net/t_bond0/bonding/slaves"
$ ethtool -i t_bond0"
$ ip link set ens1f1 up"
$ ip link set ens1f0 up"
$ ip link set t_bond0 up"
$ ip addr add 192.168.9.1/24 dev t_bond0"
$ ip addr add 2002::1/64 dev t_bond0"
This bond comes up okay, but things look slightly suspect in
/proc/net/bonding/t_bond0 output:
$ grep -i mii /proc/net/bonding/t_bond0
MII Status: up
MII Polling Interval (ms): 0
MII Status: up
MII Status: up
Now, pull a cable on one of the ports in the bond, then reconnect it, and
you'll see:
Slave Interface: ens1f0
MII Status: down
Speed: 1000 Mbps
Duplex: full
I believe this became a major issue as of commit 4d2c0cda07, which for
802.3ad bonds, sets slave->link = BOND_LINK_DOWN, with a comment about
relying on link monitoring via miimon to set it correctly, but since the
miimon work queue never runs, the link just stays marked down.
If we simply tweak bond_option_mode_set() slightly, we can check for the
non-arp modes having no miimon value set, and insert BOND_DEFAULT_MIIMON,
which gets things back in full working order. This problem exists as far
back as 4.14, and might be worth fixing in all stable trees since, though
the work-around is to simply specify an miimon value yourself.
Reported-by: Bob Ball <ball@umich.edu>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJbT+cLAAoJEEg/ir3gV/o+I5QH/3LQemGzH33iNsg4khpPeNA+
Q4mGd2jqbwfL17FTSGpTsPje6rpwzR+j8W1fGTx1vzYmE79ZyDu4EwHS7YZJcGyz
q8P0HgrUe4NrJV8mlOpbIRbTuSwfqultw2qRpmCfLf5kK1nqSIPpUHIfBUMqwy0o
O7GJrytUI4Av+r5Px/6bjb5kBaVe5YBe0tg8nSrN2vtzHVQWm+5/uaNRW2SrCN+4
5SI2AsWyMwfGCC+IE8i9OlIFCy6Iu2vwcUabK+6EeGKP4Wb6rukyG01TkQPSd7gy
ozcAjvj+ppHmVFath1uzLCFU3RbKt6GbVRGaFQg5jO5vvK3uzFJnm59Vqw/WzNs=
=UXsy
-----END PGP SIGNATURE-----
Merge tag 'mlx5-fixes-2018-07-18' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
Mellanox, mlx5 fixes 2018-07-18
The following series provides fixes to mlx5 core and net device driver.
Please pull and let me know if there's any problem.
For -stable v4.7
net/mlx5e: Don't allow aRFS for encapsulated packets
net/mlx5e: Fix quota counting in aRFS expire flow
For -stable v4.15
net/mlx5e: Only allow offloading decap egress (egdev) flows
net/mlx5e: Refine ets validation function
net/mlx5: Adjust clock overflow work period
For -stable v4.17
net/mlx5: E-Switch, UBSAN fix undefined behavior in mlx5_eswitch_mode
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf 2018-07-20
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix in BPF Makefile to detect llvm-objcopy in a more robust way which is
needed for pahole's BTF converter and minor UAPI tweaks in BTF_INT_BITS()
to shrink the mask before eventual UAPI freeze, from Martin.
2) Fix a segfault in bpftool when prog pin id has no further arguments such
as id value or file specified, from Taeung.
3) Fix powerpc JIT handling of XADD which has jumps to exit path that would
potentially bypass verifier expectations e.g. with subprog calls. Also add
a test case to make sure XADD is not mangling src/dst register, from Daniel.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The current code does not check sk->sk_shutdown & RCV_SHUTDOWN.
tls_sw_recvmsg may return a positive value in the case where bytes have
already been copied when the socket is shutdown. sk->sk_err has been
cleared, causing the tls_wait_data to hang forever on a subsequent
invocation. Checking sk->sk_shutdown & RCV_SHUTDOWN, as in tcp_recvmsg,
fixes this problem.
Fixes: c46234ebb4 ("tls: RX path for ktls")
Acked-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: Doron Roberts-Kedes <doronrk@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>