Commit Graph

890270 Commits

Author SHA1 Message Date
Petr Machata
23effa2479 mlxsw: reg: Add max_shaper_bs to QoS ETS Element Configuration
The QEEC register configures scheduling elements. One of the bits of
configuration is the burst size to use for the shaper installed on the
element. Add the necessary fields to support this configuration.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25 10:56:31 +01:00
Petr Machata
be1d5a8a77 mlxsw: spectrum_qdisc: Extract a common leaf unoffload function
When the RED Qdisc is unoffloaded, it needs to reduce the reported backlog
by the amount that is in the HW, so that only the SW backlog is contained
in the counter. The same thing will need to be done by TBF, and likely any
other leaf Qdisc as well.

Extract a helper mlxsw_sp_qdisc_leaf_unoffload() and call it from
mlxsw_sp_qdisc_red_unoffload().

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25 10:56:31 +01:00
Petr Machata
3d0d592193 mlxsw: spectrum_qdisc: Add mlxsw_sp_qdisc_get_class_stats()
Add a wrapper around mlxsw_sp_qdisc_collect_tc_stats() and
mlxsw_sp_qdisc_update_stats() for the simple case of doing both in one go:
mlxsw_sp_qdisc_get_class_stats(). Dispatch to that function from
mlxsw_sp_qdisc_get_red_stats(). This new function will be useful for other
leaf Qdiscs as well.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25 10:56:31 +01:00
Petr Machata
cf9af379cd mlxsw: spectrum_qdisc: Extract a per-TC stat function
Extract from mlxsw_sp_qdisc_get_prio_stats() two new functions:
mlxsw_sp_qdisc_collect_tc_stats() to accumulate stats for that one TC only,
and mlxsw_sp_qdisc_update_stats() that makes the stats relative to base
values stored earlier. Use them from mlxsw_sp_qdisc_get_red_stats().

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25 10:56:31 +01:00
Petr Machata
ef6aadcc76 net: sched: Make TBF Qdisc offloadable
Invoke ndo_setup_tc as appropriate to signal init / replacement, destroying
and dumping of TBF Qdisc.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25 10:56:31 +01:00
Petr Machata
c207015274 net: sched: sch_tbf: Don't overwrite backlog before dumping
In 2011, in commit b0460e4484 ("sch_tbf: report backlog information"),
TBF started copying backlog depth from the child Qdisc before dumping, with
the motivation that the backlog was otherwise not visible in "tc -s qdisc
show".

Later, in 2016, in commit 8d5958f424 ("sch_tbf: update backlog as well"),
TBF got a full-blown backlog tracking. However it kept copying the child's
backlog over before dumping.

That line is now unnecessary, so remove it.

As shown in the following example, backlog is still reported correctly:

    # tc -s qdisc show dev veth0 invisible
    qdisc tbf 1: root refcnt 2 rate 1Mbit burst 128Kb lat 82.8s
     Sent 505475370 bytes 406985 pkt (dropped 0, overlimits 812544 requeues 0)
     backlog 81972b 66p requeues 0
    qdisc bfifo 0: parent 1:1 limit 10Mb
     Sent 505475370 bytes 406985 pkt (dropped 0, overlimits 0 requeues 0)
     backlog 81972b 66p requeues 0

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25 10:56:30 +01:00
Michael Ellerman
3546d8f1bb net: cxgb3_main: Add CAP_NET_ADMIN check to CHELSIO_GET_MEM
The cxgb3 driver for "Chelsio T3-based gigabit and 10Gb Ethernet
adapters" implements a custom ioctl as SIOCCHIOCTL/SIOCDEVPRIVATE in
cxgb_extension_ioctl().

One of the subcommands of the ioctl is CHELSIO_GET_MEM, which appears
to read memory directly out of the adapter and return it to userspace.
It's not entirely clear what the contents of the adapter memory
contains, but the assumption is that it shouldn't be accessible to all
users.

So add a CAP_NET_ADMIN check to the CHELSIO_GET_MEM case. Put it after
the is_offload() check, which matches two of the other subcommands in
the same function which also check for is_offload() and CAP_NET_ADMIN.

Found by Ilja by code inspection, not tested as I don't have the
required hardware.

Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25 10:50:42 +01:00
David S. Miller
2f64ab27c8 Merge branch 'hv_netvsc-Add-XDP-support'
Haiyang Zhang says:

====================
hv_netvsc: Add XDP support

Add XDP support and update related document.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25 10:43:19 +01:00
Haiyang Zhang
12fa74383e hv_netvsc: Update document for XDP support
Added the new section in the document regarding XDP support
by hv_netvsc driver.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25 10:43:19 +01:00
Haiyang Zhang
351e158139 hv_netvsc: Add XDP support
This patch adds support of XDP in native mode for hv_netvsc driver, and
transparently sets the XDP program on the associated VF NIC as well.

Setting / unsetting XDP program on synthetic NIC (netvsc) propagates to
VF NIC automatically. Setting / unsetting XDP program on VF NIC directly
is not recommended, also not propagated to synthetic NIC, and may be
overwritten by setting of synthetic NIC.

The Azure/Hyper-V synthetic NIC receive buffer doesn't provide headroom
for XDP. We thought about re-use the RNDIS header space, but it's too
small. So we decided to copy the packets to a page buffer for XDP. And,
most of our VMs on Azure have Accelerated  Network (SRIOV) enabled, so
most of the packets run on VF NIC. The synthetic NIC is considered as a
fallback data-path. So the data copy on netvsc won't impact performance
significantly.

XDP program cannot run with LRO (RSC) enabled, so you need to disable LRO
before running XDP:
        ethtool -K eth0 lro off

XDP actions not yet supported:
        XDP_REDIRECT

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25 10:43:19 +01:00
Moshe Shemesh
6ec8b6cd79 devlink: Add health recover notifications on devlink flows
Devlink health recover notifications were added only on driver direct
updates of health_state through devlink_health_reporter_state_update().
Add notifications on updates of health_state by devlink flows of report
and recover.

Moved functions devlink_nl_health_reporter_fill() and
devlink_recover_notify() to avoid forward declaration.

Fixes: 97ff3bd37f ("devlink: add devink notification when reporter update health state")
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25 10:34:42 +01:00
Florian Fainelli
148965df1a net: bcmgenet: Use netif_tx_napi_add() for TX NAPI
Before commit 7587935cfa ("net: bcmgenet: move NAPI initialization to
ring initialization") moved the code, this used to be
netif_tx_napi_add(), but we lost that small semantic change in the
process, restore that.

Fixes: 7587935cfa ("net: bcmgenet: move NAPI initialization to ring initialization")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Doug Berger <opendmb@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25 10:31:28 +01:00
Jon Maloy
61b1f2aff4 tipc: change maintainer email address
Reflecting new realities.

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25 10:18:02 +01:00
Andy Shevchenko
53c6770095 net: fddi: skfp: Use print_hex_dump() helper
Use the print_hex_dump() helper, instead of open-coding the same operations.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25 10:15:49 +01:00
Andy Shevchenko
79ac522402 net: atm: use %*ph to print small buffer
Use %*ph format to print small buffer as hex string.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25 10:14:40 +01:00
Ajay Gupta
b9f0b2f634 net: stmmac: platform: fix probe for ACPI devices
Use generic device API to get phy mode to fix probe failure
with ACPI based devices.

Signed-off-by: Ajay Gupta <ajayg@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25 10:09:47 +01:00
Mat Martineau
edc7e4898d mptcp: Fix code formatting
checkpatch.pl had a few complaints in the last set of MPTCP patches:

ERROR: code indent should use tabs where possible
+^I         subflow, sk->sk_family, icsk->icsk_af_ops, target, mapped);$

CHECK: Comparison to NULL could be written "!new_ctx"
+	if (new_ctx == NULL) {

ERROR: "foo * bar" should be "foo *bar"
+static const struct proto_ops * tcp_proto_ops(struct sock *sk)

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25 08:14:39 +01:00
Florian Westphal
e42f1ac626 mptcp: do not inherit inet proto ops
We need to initialise the struct ourselves, else we expose tcp-specific
callbacks such as tcp_splice_read which will then trigger splat because
the socket is an mptcp one:

BUG: KASAN: slab-out-of-bounds in tcp_mstamp_refresh+0x80/0xa0 net/ipv4/tcp_output.c:57
Write of size 8 at addr ffff888116aa21d0 by task syz-executor.0/5478

CPU: 1 PID: 5478 Comm: syz-executor.0 Not tainted 5.5.0-rc6 #3
Call Trace:
 tcp_mstamp_refresh+0x80/0xa0 net/ipv4/tcp_output.c:57
 tcp_rcv_space_adjust+0x72/0x7f0 net/ipv4/tcp_input.c:612
 tcp_read_sock+0x622/0x990 net/ipv4/tcp.c:1674
 tcp_splice_read+0x20b/0xb40 net/ipv4/tcp.c:791
 do_splice+0x1259/0x1560 fs/splice.c:1205

To prevent build error with ipv6, add the recv/sendmsg function
declaration to ipv6.h.  The functions are already accessible "thanks"
to retpoline related work, but they are currently only made visible
by socket.c specific INDIRECT_CALLABLE macros.

Reported-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25 08:14:39 +01:00
Linus Torvalds
d5d359b0ac Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input fixes from Dmitry Torokhov:

 - add sanity checks to USB endpoints in various dirvers

 - max77650-onkey was missing an OF table which was preventing module
   autoloading

 - a revert and a different fix for F54 handling in Synaptics dirver

 - a fixup for handling register in pm8xxx vibrator driver

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: pm8xxx-vib - fix handling of separate enable register
  Input: keyspan-remote - fix control-message timeouts
  Input: max77650-onkey - add of_match table
  Input: rmi_f54 - read from FIFO in 32 byte blocks
  Revert "Input: synaptics-rmi4 - don't increment rmiaddr for SMBus transfers"
  Input: sur40 - fix interface sanity checks
  Input: gtco - drop redundant variable reinit
  Input: gtco - fix extra-descriptor debug message
  Input: gtco - fix endpoint sanity check
  Input: aiptek - use descriptors of current altsetting
  Input: aiptek - fix endpoint sanity check
  Input: pegasus_notetaker - fix endpoint sanity check
  Input: sun4i-ts - add a check for devm_thermal_zone_of_sensor_register
  Input: evdev - convert kzalloc()/vzalloc() to kvzalloc()
2020-01-24 19:27:42 -08:00
Tony Nguyen
31ad4e4ee1 ice: Allocate flow profile
Create an extraction sequence based on the packet header protocols to be
programmed and allocate a flow profile for the extraction sequence.

Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Henry Tieman <henry.w.tieman@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2020-01-24 16:06:32 -08:00
Tony Nguyen
c90ed40cef ice: Enable writing hardware filtering tables
Enable the driver to write the filtering hardware tables to allow for
changing of RSS rules. Upon loading of DDP package, a minimal configuration
should be written to hardware.

Introduce and initialize structures for storing configuration and make
the top level calls to configure the RSS tables to initial values. A packet
segment will be created but nothing is written to hardware yet.

Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Henry Tieman <henry.w.tieman@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2020-01-24 13:18:19 -08:00
Olof Johansson
6716cb162d Few minor fixes for omaps
Looks like we have wrong default memory size for beaglebone black,
 it has at least 512 MB of RAM and not 256 MB. This causes an issue
 when booted with GRUB2 that does not seem to pass memory info to
 the kernel.
 
 And for am43x-epos-evm the SPI pin directions need to be configured
 for SPI to work.
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCAAvFiEEkgNvrZJU/QSQYIcQG9Q+yVyrpXMFAl4rSQkRHHRvbnlAYXRv
 bWlkZS5jb20ACgkQG9Q+yVyrpXMSeg//f607jg+6RqkJ6dMczxBSHpkAYSdGBjpB
 nwa8O7jNbHfjtUGXTm+9Xx7sKXTR+600lobgMYguRlitTYNtC2jQG5ux92/6sTU/
 5qB6m0qkhMqOKDOI2AZ1wIcc6PP067M7x+il0Rs9/PRdIaDNUMHsNZ6+sxEWUR7N
 1vcYMz3PL8KW/ucAWxBO4NeMDwg5vi6dIILBl63sBlrH6AgYTVT8LqNhSysZWmXY
 Gj5JNIt+FjvhKGYNnD1jeFYX/+TQH10B8Ouepj6ixdcXaA4v0553S7S7R3HYR+l8
 +AFcWB1TMuvlNv5vbnVd+PUXhhyByVDQv15/92i+rKMXUxxdLZDuKeqa3/GbL/vI
 oU5ShNXWQ7C7g+SrxYfHT0OQz38j7TUOKrRJKBxFYykneoC33uwpSaAUTSN0PuXT
 MA7RJVBtjHe8tfbkII3UqqvHxmVB1TAkj9TE9y21Dznlq2fyq7w/13UNS67VCZ57
 4vO722ULOHVLnTmwH6gA+e1Cjhiz17fJJq3wpDTqqn0As4cuLOAio7PmHgNcm2YG
 Qg9MVQV2Rj3YGdxGVDKCfTOLt4tQzJ8qbN0o7XIp0CgyK7Sg2djcjzUMtJDx++Ez
 6tRmyfcY/GD5ykPfEUXvJcDWw0tIFSRUQMtXMi3vnuNh5mYFQ2ciQB5KkPBhyTST
 CaAAg8vCvek=
 =XXZc
 -----END PGP SIGNATURE-----

Merge tag 'omap-for-fixes-whenever-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap into arm/fixes

Few minor fixes for omaps

Looks like we have wrong default memory size for beaglebone black,
it has at least 512 MB of RAM and not 256 MB. This causes an issue
when booted with GRUB2 that does not seem to pass memory info to
the kernel.

And for am43x-epos-evm the SPI pin directions need to be configured
for SPI to work.

* tag 'omap-for-fixes-whenever-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap:
  ARM: dts: am43x-epos-evm: set data pin directions for spi0 and spi1
  ARM: dts: am335x-boneblack-common: fix memory size

Link: https://lore.kernel.org/r/pull-1579895109-287828@atomide.com
Signed-off-by: Olof Johansson <olof@lixom.net>
2020-01-24 12:05:33 -08:00
Olof Johansson
088307d216 Fix OP-TEE compile error with nommu
-----BEGIN PGP SIGNATURE-----
 
 iQJOBAABCgA4FiEEcK3MsDvGvFp6zV9ztbC4QZeP7NMFAl4pbiIaHGplbnMud2lr
 bGFuZGVyQGxpbmFyby5vcmcACgkQtbC4QZeP7NP14w//ZIO2PDmn/f7w3qafb+Gx
 bXWqpqNgr/kC7T1Z21vyTozTyqW3ZXv3VpONGqyYSzD8skW80LPq6MjySTCBcCIS
 AcpooDIA/jXMa52lmDv8nNzr2IuULKGEZk4kBY4zzlzW6hrASUuLqo7W7qkN4xkh
 Tpu38r7k6FPir0R4oCphG4TPUS9B7vN0fmgN+bJXNFC1pn/fUzJa14k5FVPVqi8w
 Z71s2NXTV95dmRgNM5DQNXUbdS4TY5mhnZfdFCxo0f39cob7oTM4e0Ii1b280lQv
 xsiAfTbPAOPhchrMf12TqD53TvECUe/7SoAoaZpHSepkxQ2vqD+JBCkADfdTmPje
 FhCaTVix6eJLIMgvrFaiWfmKkuLwQY2k50nLDlIfHVgRZhsChrjsaP8gAJf8VYcN
 4zzmDeDm4dDewpp/+KjnUVXuxtYLI9Zc1QnKXyoRfuBBQuVH5pcSbCrt4cn8PPtk
 4KpW6Of+oKPt7uQSCEIejell8ew/X/WPFQN0dDaaJYtlT4kuityDWwNrebfdtZ/2
 1GeS5yrc+weG4Sl+ATK7wtAeIA84c4wYyI2NkJffZAikXmy9VkYmmOyJL6+SPAp4
 Ff7hvbVJatEeCeMcM44R8C8KmsCY5n4cz1SpxpACzrAxNiWzOfbsI0HCDeelFlVD
 wF6kGTYn3B3afDTcRdeRQGs=
 =JTLy
 -----END PGP SIGNATURE-----

Merge tag 'tee-optee-fix2-for-5.5' of https://git.linaro.org:/people/jens.wiklander/linux-tee into arm/fixes

Fix OP-TEE compile error with nommu

* tag 'tee-optee-fix2-for-5.5' of https://git.linaro.org:/people/jens.wiklander/linux-tee:
  tee: optee: Fix compilation issue with nommu

Link: https://lore.kernel.org/r/20200123101310.GA10320@jax
Signed-off-by: Olof Johansson <olof@lixom.net>
2020-01-24 12:05:08 -08:00
Tariq Toukan
342508c1c7 net/mlx5e: kTLS, Do not send decrypted-marked SKBs via non-accel path
When TCP out-of-order is identified (unexpected tcp seq mismatch), driver
analyzes the packet and decides what handling should it get:
1. go to accelerated path (to be encrypted in HW),
2. go to regular xmit path (send w/o encryption),
3. drop.

Packets marked with skb->decrypted by the TLS stack in the TX flow skips
SW encryption, and rely on the HW offload.
Verify that such packets are never sent un-encrypted on the wire.
Add a WARN to catch such bugs, and prefer dropping the packet in these cases.

Fixes: 46a3ea9807 ("net/mlx5e: kTLS, Enhance TX resync flow")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Reviewed-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-01-24 12:04:40 -08:00
Tariq Toukan
1e92899791 net/mlx5e: kTLS, Remove redundant posts in TX resync flow
The call to tx_post_resync_params() is done earlier in the flow,
the post of the control WQEs is unnecessarily repeated. Remove it.

Fixes: 700ec49742 ("net/mlx5e: kTLS, Fix missing SQ edge fill")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Reviewed-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-01-24 12:04:37 -08:00
Tariq Toukan
ffbd9ca94e net/mlx5e: kTLS, Fix corner-case checks in TX resync flow
There are the following cases:

1. Packet ends before start marker: bypass offload.
2. Packet starts before start marker and ends after it: drop,
   not supported, breaks contract with kernel.
3. packet ends before tls record info starts: drop,
   this packet was already acknowledged and its record info
   was released.

Add the above as comment in code.

Mind possible wraparounds of the TCP seq, replace the simple comparison
with a call to the TCP before() method.

In addition, remove logic that handles negative sync_len values,
as it became impossible.

Fixes: d2ead1f360 ("net/mlx5e: Add kTLS TX HW offload support")
Fixes: 46a3ea9807 ("net/mlx5e: kTLS, Enhance TX resync flow")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Reviewed-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-01-24 12:04:35 -08:00
Dmytro Linkin
3b83b6c2e0 net/mlx5e: Clear VF config when switching modes
Currently VF in LEGACY mode are not able to go up. Also in OFFLOADS
mode, when switching to it first time, VF can go up independently to
his representor, which is not expected.
Perform clearing of VF config when switching modes and set link state
to AUTO as default value. Also, when switching to OFFLOADS mode set
link state to DOWN, which allow VF link state to be controlled by its
REP.

Fixes: 1ab2068a4c ("net/mlx5: Implement vports admin state backup/restore")
Fixes: 556b9d16d3 ("net/mlx5: Clear VF's configuration on disabling SRIOV")
Signed-off-by: Dmytro Linkin <dmitrolin@mellanox.com>
Signed-off-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-01-24 12:04:32 -08:00
Erez Shitrit
c0702a4bd4 net/mlx5: DR, use non preemptible call to get the current cpu number
Use raw_smp_processor_id instead of smp_processor_id() otherwise we will
get the following trace in debug-kernel:
	BUG: using smp_processor_id() in preemptible [00000000] code: devlink
	caller is dr_create_cq.constprop.2+0x31d/0x970 [mlx5_core]
	Call Trace:
	dump_stack+0x9a/0xf0
	debug_smp_processor_id+0x1f3/0x200
	dr_create_cq.constprop.2+0x31d/0x970
	genl_family_rcv_msg+0x5fd/0x1170
	genl_rcv_msg+0xb8/0x160
	netlink_rcv_skb+0x11e/0x340

Fixes: 297cccebdc ("net/mlx5: DR, Expose an internal API to issue RDMA operations")
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-01-24 12:04:29 -08:00
Eli Cohen
e401a1848b net/mlx5: E-Switch, Prevent ingress rate configuration of uplink rep
Since the implementation relies on limiting the VF transmit rate to
simulate ingress rate limiting, and since either uplink representor or
ecpf are not associated with a VF, we limit the rate limit configuration
for those ports.

Fixes: fcb64c0f56 ("net/mlx5: E-Switch, add ingress rate support")
Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-01-24 12:04:27 -08:00
Erez Shitrit
b850a82114 net/mlx5: DR, Enable counter on non-fwd-dest objects
The current code handles only counters that attached to dest, we still
have the cases where we have counter on non-dest, like over drop etc.

Fixes: 6a48faeeca ("net/mlx5: Add direct rule fs_cmd implementation")
Signed-off-by: Hamdan Igbaria <hamdani@mellanox.com>
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Reviewed-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-01-24 12:04:25 -08:00
Meir Lichtinger
505a7f5478 net/mlx5: Update the list of the PCI supported devices
Add the upcoming ConnectX-7 device ID.

Fixes: 85327a9c41 ("net/mlx5: Update the list of the PCI supported devices")
Signed-off-by: Meir Lichtinger <meirl@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-01-24 12:04:22 -08:00
Paul Blakey
93b8a7ecb7 net/mlx5: Fix lowest FDB pool size
The pool sizes represent the pool sizes in the fw. when we request
a pool size from fw, it will return the next possible group.
We track how many pools the fw has left and start requesting groups
from the big to the small.
When we start request 4k group, which doesn't exists in fw, fw
wants to allocate the next possible size, 64k, but will fail since
its exhausted. The correct smallest pool size in fw is 128 and not 4k.

Fixes: e52c280240 ("net/mlx5: E-Switch, Add chains and priorities")
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-01-24 12:04:20 -08:00
Praveen Chaudhary
189c9b1e94 net: Fix skb->csum update in inet_proto_csum_replace16().
skb->csum is updated incorrectly, when manipulation for
NF_NAT_MANIP_SRC\DST is done on IPV6 packet.

Fix:
There is no need to update skb->csum in inet_proto_csum_replace16(),
because update in two fields a.) IPv6 src/dst address and b.) L4 header
checksum cancels each other for skb->csum calculation. Whereas
inet_proto_csum_replace4 function needs to update skb->csum, because
update in 3 fields a.) IPv4 src/dst address, b.) IPv4 Header checksum
and c.) L4 header checksum results in same diff as L4 Header checksum
for skb->csum calculation.

[ pablo@netfilter.org: a few comestic documentation edits ]
Signed-off-by: Praveen Chaudhary <pchaudhary@linkedin.com>
Signed-off-by: Zhenggen Xu <zxu@linkedin.com>
Signed-off-by: Andy Stracner <astracner@linkedin.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-24 20:54:30 +01:00
Pablo Neira Ayuso
eb014de4fd netfilter: nf_tables: autoload modules from the abort path
This patch introduces a list of pending module requests. This new module
list is composed of nft_module_request objects that contain the module
name and one status field that tells if the module has been already
loaded (the 'done' field).

In the first pass, from the preparation phase, the netlink command finds
that a module is missing on this list. Then, a module request is
allocated and added to this list and nft_request_module() returns
-EAGAIN. This triggers the abort path with the autoload parameter set on
from nfnetlink, request_module() is called and the module request enters
the 'done' state. Since the mutex is released when loading modules from
the abort phase, the module list is zapped so this is iteration occurs
over a local list. Therefore, the request_module() calls happen when
object lists are in consistent state (after fulling aborting the
transaction) and the commit list is empty.

On the second pass, the netlink command will find that it already tried
to load the module, so it does not request it again and
nft_request_module() returns 0. Then, there is a look up to find the
object that the command was missing. If the module was successfully
loaded, the command proceeds normally since it finds the missing object
in place, otherwise -ENOENT is reported to userspace.

This patch also updates nfnetlink to include the reason to enter the
abort phase, which is required for this new autoload module rationale.

Fixes: ec7470b834 ("netfilter: nf_tables: store transaction list locally while requesting module")
Reported-by: syzbot+29125d208b3dae9a7019@syzkaller.appspotmail.com
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-24 20:54:29 +01:00
Pablo Neira Ayuso
826035498e netfilter: nf_tables: add __nft_chain_type_get()
This new helper function validates that unknown family and chain type
coming from userspace do not trigger an out-of-bound array access. Bail
out in case __nft_chain_type_get() returns NULL from
nft_chain_parse_hook().

Fixes: 9370761c56 ("netfilter: nf_tables: convert built-in tables/chains to chain types")
Reported-by: syzbot+156a04714799b1d480bc@syzkaller.appspotmail.com
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-24 20:54:28 +01:00
wenxu
c83de17dd6 netfilter: nf_tables_offload: fix check the chain offload flag
In the nft_indr_block_cb the chain should check the flag with
NFT_CHAIN_HW_OFFLOAD.

Fixes: 9a32669fec ("netfilter: nf_tables_offload: support indr block call")
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-24 20:54:11 +01:00
Linus Torvalds
6381b44283 IOMMU Fixes for Linux v5.5-rc7
Two Fixes:
 
 	- Fix NULL-ptr dereference bug in Intel IOMMU driver
 
 	- Properly safe and restore AMD IOMMU performance counter
 	  registers when testing if they are writable.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAl4rKJ0ACgkQK/BELZcB
 GuM8BxAAse7i4EAB0DZ26aDVU/ZZAGBTc5pH+7IOvuGFWpu3Ik7siltMZKTa56XD
 5reQl1hUHyBesdwNcd+ZQz/0QzgMu0zEcxJtzLD5nbk2Rp1pDYab84+wA0oqftmN
 7TBqPJjqGYfnfkobPW+4SaiCSxM6XKEXYaJvXqF6dsz5fLHHqTVjGyEXuu0pWey5
 1AG1nNWey5TobR39ZUNVIGUEf4ftN8tOGutjGKch29OmMg1UkHZjDyYpVJurjSUl
 1YDf0WA2X5cUWzRg519o9Uxs+5y/tBItS2qMJ/Uwh1fp/kOYrGoEPXRH9brU3owh
 j8kVKyz/cNlvhdJN/QPVq/A+GBgzAq+u9FUGH68LBVx05cuCRGWb6qP/blEZWUSA
 zQvbHPIPqPBK0pYtbVCgJFCpOS5C6X33RMgbRr3XncAItmfEp2FH7tcHIsDZUzmO
 EGBgSPrhRFJrgfa3hopc93L4OEHmGy+20o7oT4kk0SBcHVDSyDZVBu1lX+/gAZKd
 ZoV5FyF9KNB+qQQ6E+vpL5gLh7BWiBgOI41Bdw5ut09I0gYURPdFWUPHRFwHK1/1
 KTSWLyuLkS4VxdhcwhA30eVmSrbfocPPwuUxcGn2p1YnaycFkDNoJnhifvFiOS6w
 tFOJ3wEVL2OX7On6U3jviVy7MQapX3MvWrWKlgg/UoMv36dbqRA=
 =ViDp
 -----END PGP SIGNATURE-----

Merge tag 'iommu-fixes-v5.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu

Pull iommu fixes from Joerg Roedel:
 "Two fixes:

   - Fix NULL-ptr dereference bug in Intel IOMMU driver

   - Properly save and restore AMD IOMMU performance counter registers
     when testing if they are writable"

* tag 'iommu-fixes-v5.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
  iommu/amd: Fix IOMMU perf counter clobbering during init
  iommu/vt-d: Call __dmar_remove_one_dev_info with valid pointer
2020-01-24 09:56:59 -08:00
Linus Torvalds
3c45d7510c powerpc fixes for 5.5 #6
Fix our hash MMU code to avoid having overlapping ids between user and kernel,
 which isn't as bad as it sounds but led to crashes on some machines.
 
 A fix for the Power9 XIVE interrupt code, which could return the wrong interrupt
 state in obscure error conditions.
 
 A minor Kconfig fix for the recently added CONFIG_PPC_UV code.
 
 Thanks to:
   Aneesh Kumar K.V, Bharata B Rao, Cédric Le Goater, Frederic Barrat.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl4q3iMTHG1wZUBlbGxl
 cm1hbi5pZC5hdQAKCRBR6+o8yOGlgM1fEAC7YefQZsuIstV2bNH9xFJe/oEOSLWt
 DqZYfWZRBRZRdRMAP7+hmXpsLfHAB3QSGv4yEc2Pgm5Y5vYuv8MWqEKkpJoivIPW
 /Fu7mNGI06IWinMyl59dLA5fXPNeOzSNcdPnBdIUC/62XU414IGsy273X3LN4YiM
 hfXSpPA3tpSjVBTeqfKTJowXWMOvbkwiHMs2ziAu4Y9GDauJsU0UiOxMkl2MYT7Z
 PULO/SrjYSbP8+MazD/YdJ5k7F4rNV3bVDPUCU305tKIkrqRmBs+8DXjeUAMWVjj
 y2huUZ/0XXsr8QZyyGfMhEC1sJKAD3colG4LHUysbWz1oInY93CvXELdPAMXhqrH
 xXTzx9ae74Xk7t7WaGgq5xJGHJCDSg1aA2HPkUg6Gbk6iJIztOCkKMdSPvFMo18S
 p+Kjis8db83RPXQm3ooW7s/xhp6AaDiXd8khQ7A2efgt4SDTSPFvgv2q4CMDC1lP
 njchgzgPJ635MSd3CC/u6i7eViuUylb6SfhFVYY0DqOOvNJVrj1W6M4N7tpWSrRS
 PXJI2/NwlL3WIajKO3KfqTRI7QaSLtccTrXGLiGBxx/ooAowP94/WL3CYJAeuO2F
 NmA1GXH6/WXvzMMNm5NB0kCOUOOhtJk/MFuO7Agly6YL9p36fqVz4WIJVAqtwJLO
 uPu0xASJCgp9Mg==
 =r/RM
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.5-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "Some more powerpc fixes for 5.5:

   - Fix our hash MMU code to avoid having overlapping ids between user
     and kernel, which isn't as bad as it sounds but led to crashes on
     some machines.

   - A fix for the Power9 XIVE interrupt code, which could return the
     wrong interrupt state in obscure error conditions.

   - A minor Kconfig fix for the recently added CONFIG_PPC_UV code.

  Thanks to Aneesh Kumar K.V, Bharata B Rao, Cédric Le Goater, Frederic
  Barrat"

* tag 'powerpc-5.5-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/mm/hash: Fix sharing context ids between kernel & userspace
  powerpc/xive: Discard ESB load value when interrupt is invalid
  powerpc: Ultravisor: Fix the dependencies for CONFIG_PPC_UV
2020-01-24 09:49:20 -08:00
Linus Torvalds
274adbff45 drm fixes for 5.5-rc8
core/mst:
 - Fix SST branch device handling
 
 amdgpu:
 - enable renoir outside experimental
 
 i915:
 - Avoid overflow with huge userptr objects
 - uAPI fix to correctly handle negative values in
   engine->uabi_class/instance (cc: stable)
 
 panfrost:
 - Fix mapping of globally visible BO's (Boris)
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJeKoezAAoJEAx081l5xIa+hRMP/jC45Om4YDtsoH+75v41mYfO
 yUt1jBsCDXHAvPjBbRcoM2kbZU+ZcaDR4EZS7lA7QTW5CAosqzxrGdkzGobjmj5v
 TN/0BWWyQYtQ+lbt/pUxM+kktaV9IkjZIAUlW3msCwKsYE+nPIWfXDnW2ioFpitk
 sXk3vZYC4Lh6IVJ5PG879nz5XPpo2eb4sYYQ4PTFj65u436NEd8AFqmcDzp64Jhb
 xh1jLSxeS/JROyb/AmTHt0PgrGB5B08U8V7CAJXhIgf7zM28iiA3ynanoO3PauYb
 3uPg8eVdP2tOqcpplV5CBkiw/in2sPgRprMEg66cKYzo9rieWu7nPL+1zmaxD16s
 NIXSmov4Sm7yOS4LAGAckS12ejniy7L6OcP5N2S2Gz0HpIMXluBfLw3d4ZiIW5Q5
 yKdpmrwTfl0IfTiblweaHi8uT4K19vSqoXUf4bFa4UxqeYOtipHnNP907+nDs0JN
 GiHjIvrQJ4KH/A/JAkjEyNzHRAqJhJke84E519UP+mI+8paqzftsbDvf0ADA7nzZ
 q6TbZpTbEPV+NqnaH6n8volfh/WougW2BpxozMO3UP/474AZq1bXJIXi45ksatRW
 Cq0KQP8iI//C0xhCDPs7LQ/XzgkcQtDPZ1Nx6vWjrn6O3DMjKCm0RxZgfKj5CfbJ
 PovYqZQXJ01989DfXfQv
 =K7TY
 -----END PGP SIGNATURE-----

Merge tag 'drm-fixes-2020-01-24' of git://anongit.freedesktop.org/drm/drm

Pull drm fixes from Dave Airlie:
 "This one has a core mst fix and two i915 fixes. amdgpu just enables
  some hw outside experimental.

  The panfrost fix is a little bigger than I'd like at this stage but it
  fixes a fairly fundamental problem with global shared buffers in that
  driver, and since it's confined to that driver and I've taken a look
  at it, I think it's fine to get into the tree now, so it can get
  stable propagated as well.

  core/mst:
   - Fix SST branch device handling

  amdgpu:
   - enable renoir outside experimental

  i915:
   - Avoid overflow with huge userptr objects
   - uAPI fix to correctly handle negative values in
     engine->uabi_class/instance (cc: stable)

  panfrost:
   - Fix mapping of globally visible BO's (Boris)"

* tag 'drm-fixes-2020-01-24' of git://anongit.freedesktop.org/drm/drm:
  drm/amdgpu: remove the experimental flag for renoir
  drm/panfrost: Add the panfrost_gem_mapping concept
  drm/i915: Align engine->uabi_class/instance with i915_drm.h
  drm/i915/userptr: fix size calculation
  drm/dp_mst: Handle SST-only branch device case
2020-01-24 09:38:04 -08:00
Christophe Leroy
ab10ae1c3b lib: Reduce user_access_begin() boundaries in strncpy_from_user() and strnlen_user()
The range passed to user_access_begin() by strncpy_from_user() and
strnlen_user() starts at 'src' and goes up to the limit of userspace
although reads will be limited by the 'count' param.

On 32 bits powerpc (book3s/32) access has to be granted for each
256Mbytes segment and the cost increases with the number of segments to
unlock.

Limit the range with 'count' param.

Fixes: 594cc251fd ("make 'user_access_begin()' do 'access_ok()'")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-24 09:27:34 -08:00
Jiri Wiesner
ab658b9fa7 netfilter: conntrack: sctp: use distinct states for new SCTP connections
The netlink notifications triggered by the INIT and INIT_ACK chunks
for a tracked SCTP association do not include protocol information
for the corresponding connection - SCTP state and verification tags
for the original and reply direction are missing. Since the connection
tracking implementation allows user space programs to receive
notifications about a connection and then create a new connection
based on the values received in a notification, it makes sense that
INIT and INIT_ACK notifications should contain the SCTP state
and verification tags available at the time when a notification
is sent. The missing verification tags cause a newly created
netfilter connection to fail to verify the tags of SCTP packets
when this connection has been created from the values previously
received in an INIT or INIT_ACK notification.

A PROTOINFO event is cached in sctp_packet() when the state
of a connection changes. The CLOSED and COOKIE_WAIT state will
be used for connections that have seen an INIT and INIT_ACK chunk,
respectively. The distinct states will cause a connection state
change in sctp_packet().

Signed-off-by: Jiri Wiesner <jwiesner@suse.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-24 18:26:53 +01:00
Shuah Khan
8c17bbf6c8 iommu/amd: Fix IOMMU perf counter clobbering during init
init_iommu_perf_ctr() clobbers the register when it checks write access
to IOMMU perf counters and fails to restore when they are writable.

Add save and restore to fix it.

Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Fixes: 30861ddc9c ("perf/x86/amd: Add IOMMU Performance Counter resource management")
Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Tested-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2020-01-24 15:28:40 +01:00
Jerry Snitselaar
bf708cfb2f iommu/vt-d: Call __dmar_remove_one_dev_info with valid pointer
It is possible for archdata.iommu to be set to
DEFER_DEVICE_DOMAIN_INFO or DUMMY_DEVICE_DOMAIN_INFO so check for
those values before calling __dmar_remove_one_dev_info. Without a
check it can result in a null pointer dereference. This has been seen
while booting a kdump kernel on an HP dl380 gen9.

Cc: Joerg Roedel <joro@8bytes.org>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: stable@vger.kernel.org # 5.3+
Cc: linux-kernel@vger.kernel.org
Fixes: ae23bfb68f ("iommu/vt-d: Detach domain before using a private one")
Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com>
Acked-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2020-01-24 15:23:50 +01:00
Qu Wenruo
1bbb97b8ce btrfs: scrub: Require mandatory block group RO for dev-replace
[BUG]
For dev-replace test cases with fsstress, like btrfs/06[45] btrfs/071,
looped runs can lead to random failure, where scrub finds csum error.

The possibility is not high, around 1/20 to 1/100, but it's causing data
corruption.

The bug is observable after commit b12de52896 ("btrfs: scrub: Don't
check free space before marking a block group RO")

[CAUSE]
Dev-replace has two source of writes:

- Write duplication
  All writes to source device will also be duplicated to target device.

  Content:	Not yet persisted data/meta

- Scrub copy
  Dev-replace reused scrub code to iterate through existing extents, and
  copy the verified data to target device.

  Content:	Previously persisted data and metadata

The difference in contents makes the following race possible:
	Regular Writer		|	Dev-replace
-----------------------------------------------------------------
  ^                             |
  | Preallocate one data extent |
  | at bytenr X, len 1M		|
  v				|
  ^ Commit transaction		|
  | Now extent [X, X+1M) is in  |
  v commit root			|
 ================== Dev replace starts =========================
  				| ^
				| | Scrub extent [X, X+1M)
				| | Read [X, X+1M)
				| | (The content are mostly garbage
				| |  since it's preallocated)
  ^				| v
  | Write back happens for	|
  | extent [X, X+512K)		|
  | New data writes to both	|
  | source and target dev.	|
  v				|
				| ^
				| | Scrub writes back extent [X, X+1M)
				| | to target device.
				| | This will over write the new data in
				| | [X, X+512K)
				| v

This race can only happen for nocow writes. Thus metadata and data cow
writes are safe, as COW will never overwrite extents of previous
transaction (in commit root).

This behavior can be confirmed by disabling all fallocate related calls
in fsstress (*), then all related tests can pass a 2000 run loop.

*: FSSTRESS_AVOID="-f fallocate=0 -f allocsp=0 -f zero=0 -f insert=0 \
		   -f collapse=0 -f punch=0 -f resvsp=0"
   I didn't expect resvsp ioctl will fallback to fallocate in VFS...

[FIX]
Make dev-replace to require mandatory block group RO, and wait for current
nocow writes before calling scrub_chunk().

This patch will mostly revert commit 76a8efa171 ("btrfs: Continue replace
when set_block_ro failed") for dev-replace path.

The side effect is, dev-replace can be more strict on avaialble space, but
definitely worth to avoid data corruption.

Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 76a8efa171 ("btrfs: Continue replace when set_block_ro failed")
Fixes: b12de52896 ("btrfs: scrub: Don't check free space before marking a block group RO")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-24 14:35:56 +01:00
David S. Miller
08a45c59f1 Merge branch 'mptcp-part-two'
Christoph Paasch says:

====================
Multipath TCP part 2: Single subflow & RFC8684 support

v2 -> v3: Added RFC8684-style handshake (see below fore more details) and some minor fixes
v1 -> v2: Rebased on latest "Multipath TCP: Prerequisites" v3 series

This set adds MPTCP connection establishment, writing & reading MPTCP
options on data packets, a sysctl to allow MPTCP per-namespace, and self
tests. This is sufficient to establish and maintain a connection with a
MPTCP peer, but will not yet allow or initiate establishment of
additional MPTCP subflows.

We also add the necessary code for the RFC8684-style handshake.
RFC8684 obsoletes the experimental RFC6824 and makes MPTCP move-on to
version 1.

Originally our plan was to submit single-subflow and RFC8684 support in
two patchsets, but to simplify the merging-process and ensure that a coherent
MPTCP-version lands in Linux we decided to merge the two sets into a single
one.

The MPTCP patchset exclusively supports RFC 8684. Although all MPTCP
deployments are currently based on RFC 6824, future deployments will be
migrating to MPTCP version 1. 3GPP's 5G standardization also solely supports
RFC 8684. In addition, we believe that this initial submission of MPTCP will be
cleaner by solely supporting RFC 8684. If later on support for the old
MPTCP-version is required it can always be added in the future.

The major difference between RFC 8684 and RFC 6824 is that it has a better
support for servers using TCP SYN-cookies by reliably retransmitting the
MP_CAPABLE option.

Before ending this cover letter with some refs, it is worth mentioning
that we promise David Miller that merging this series will be rewarded by
Twitter dopamine hits :-D

Clone/fetch:
https://github.com/multipath-tcp/mptcp_net-next.git (tag: netdev-v3-part2)

Browse:
https://github.com/multipath-tcp/mptcp_net-next/tree/netdev-v3-part2

Thank you for your review. You can find us at mptcp@lists.01.org and
https://is.gd/mptcp_upstream
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:08 +01:00
Paolo Abeni
8ab183deb2 mptcp: cope with later TCP fallback
With MPTCP v1, passive connections can fallback to TCP after the
subflow becomes established:

syn + MP_CAPABLE ->
               <- syn, ack + MP_CAPABLE

ack, seq = 3    ->
        // OoO packet is accepted because in-sequence
        // passive socket is created, is in ESTABLISHED
	// status and tentatively as MP_CAPABLE

ack, seq = 2     ->
        // no MP_CAPABLE opt, subflow should fallback to TCP

We can't use the 'subflow' socket fallback, as we don't have
it available for passive connection.

Instead, when the fallback is detected, replace the mptcp
socket with the underlying TCP subflow. Beyond covering
the above scenario, it makes a TCP fallback socket as efficient
as plain TCP ones.

Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:08 +01:00
Christoph Paasch
d22f4988ff mptcp: process MP_CAPABLE data option
This patch implements the handling of MP_CAPABLE + data option, as per
RFC 6824 bis / RFC 8684: MPTCP v1.

On the server side we can receive the remote key after that the connection
is established. We need to explicitly track the 'missing remote key'
status and avoid emitting a mptcp ack until we get such info.

When a late/retransmitted/OoO pkt carrying MP_CAPABLE[+data] option
is received, we have to propagate the mptcp seq number info to
the msk socket. To avoid ABBA locking issue, explicitly check for
that in recvmsg(), where we own msk and subflow sock locks.

The above also means that an established mp_capable subflow - still
waiting for the remote key - can be 'downgraded' to plain TCP.

Such change could potentially block a reader waiting for new data
forever - as they hook to msk, while later wake-up after the downgrade
will be on subflow only.

The above issue is not handled here, we likely have to get rid of
msk->fallback to handle that cleanly.

Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:08 +01:00
Christoph Paasch
cc7972ea19 mptcp: parse and emit MP_CAPABLE option according to v1 spec
This implements MP_CAPABLE options parsing and writing according
to RFC 6824 bis / RFC 8684: MPTCP v1.

Local key is sent on syn/ack, and both keys are sent on 3rd ack.
MP_CAPABLE messages len are updated accordingly. We need the skbuff to
correctly emit the above, so we push the skbuff struct as an argument
all the way from tcp code to the relevant mptcp callbacks.

When processing incoming MP_CAPABLE + data, build a full blown DSS-like
map info, to simplify later processing.  On child socket creation, we
need to record the remote key, if available.

Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:08 +01:00
Paolo Abeni
65492c5a6a mptcp: move from sha1 (v0) to sha256 (v1)
For simplicity's sake use directly sha256 primitives (and pull them
as a required build dep).
Add optional, boot-time self-tests for the hmac function.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:08 +01:00
Florian Westphal
048d19d444 mptcp: add basic kselftest for mptcp
Add mptcp_connect tool:
xmit two files back and forth between two processes, several net
namespaces including some adding delays, losses and reordering.
Wrapper script tests that data was transmitted without corruption.

The "-c" command line option for mptcp_connect.sh is there for debugging:

The script will use tcpdump to create one .pcap file per test case, named
according to the namespaces, protocols, and connect address in use.
For example, the first test case writes the capture to
ns1-ns1-MPTCP-MPTCP-10.0.1.1.pcap.

The stderr output from tcpdump is printed after the test completes to
show tcpdump's "packets dropped by kernel" information.

Also check that userspace can't create MPTCP sockets when mptcp.enabled
sysctl is off.

The "-b" option allows to tune/lower send buffer size.
"-m mmap" can be used to test blocking io.  Default is non-blocking
io using read/write/poll.

Will run automatically on "make kselftest".

Note that the default timeout of 45 seconds is used even if there is a
"settings" changing it to 450. 45 seconds should be enough in most cases
but this depends on the machine running the tests.

A fix to correctly read the "settings" file has been proposed upstream
but not applied yet. It is not blocking the execution of these new tests
but it would be nice to have it:

  https://patchwork.kernel.org/patch/11204935/

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:08 +01:00