Commit Graph

53938 Commits

Author SHA1 Message Date
Willem de Bruijn
97ef7b4c55 ip: silence udp zerocopy smatch false positive
extra_uref is used in __ip(6)_append_data only if uarg is set.

Smatch sees that the variable is passed to sock_zerocopy_put_abort.
This function accesses it only when uarg is set, but smatch cannot
infer this.

Make this dependency explicit.

Fixes: 52900d2228 ("udp: elide zerocopy operation in hot path")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-08 12:26:20 -08:00
yupeng
0fbe82e628 net: call sk_dst_reset when set SO_DONTROUTE
after set SO_DONTROUTE to 1, the IP layer should not route packets if
the dest IP address is not in link scope. But if the socket has cached
the dst_entry, such packets would be routed until the sk_dst_cache
expires. So we should clean the sk_dst_cache when a user set
SO_DONTROUTE option. Below are server/client python scripts which
could reprodue this issue:

server side code:

==========================================================================
import socket
import struct
import time

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(('0.0.0.0', 9000))
s.listen(1)
sock, addr = s.accept()
sock.setsockopt(socket.SOL_SOCKET, socket.SO_DONTROUTE, struct.pack('i', 1))
while True:
    sock.send(b'foo')
    time.sleep(1)
==========================================================================

client side code:
==========================================================================
import socket
import time

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('server_address', 9000))
while True:
    data = s.recv(1024)
    print(data)
==========================================================================

Signed-off-by: yupeng <yupeng0921@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-07 16:11:54 -08:00
David Ahern
58956317c8 neighbor: Improve garbage collection
The existing garbage collection algorithm has a number of problems:

1. The gc algorithm will not evict PERMANENT entries as those entries
   are managed by userspace, yet the existing algorithm walks the entire
   hash table which means it always considers PERMANENT entries when
   looking for entries to evict. In some use cases (e.g., EVPN) there
   can be tens of thousands of PERMANENT entries leading to wasted
   CPU cycles when gc kicks in. As an example, with 32k permanent
   entries, neigh_alloc has been observed taking more than 4 msec per
   invocation.

2. Currently, when the number of neighbor entries hits gc_thresh2 and
   the last flush for the table was more than 5 seconds ago gc kicks in
   walks the entire hash table evicting *all* entries not in PERMANENT
   or REACHABLE state and not marked as externally learned. There is no
   discriminator on when the neigh entry was created or if it just moved
   from REACHABLE to another NUD_VALID state (e.g., NUD_STALE).

   It is possible for entries to be created or for established neighbor
   entries to be moved to STALE (e.g., an external node sends an ARP
   request) right before the 5 second window lapses:

        -----|---------x|----------|-----
            t-5         t         t+5

   If that happens those entries are evicted during gc causing unnecessary
   thrashing on neighbor entries and userspace caches trying to track them.

   Further, this contradicts the description of gc_thresh2 which says
   "Entries older than 5 seconds will be cleared".

   One workaround is to make gc_thresh2 == gc_thresh3 but that negates the
   whole point of having separate thresholds.

3. Clearing *all* neigh non-PERMANENT/REACHABLE/externally learned entries
   when gc_thresh2 is exceeded is over kill and contributes to trashing
   especially during startup.

This patch addresses these problems as follows:

1. Use of a separate list_head to track entries that can be garbage
   collected along with a separate counter. PERMANENT entries are not
   added to this list.

   The gc_thresh parameters are only compared to the new counter, not the
   total entries in the table. The forced_gc function is updated to only
   walk this new gc_list looking for entries to evict.

2. Entries are added to the list head at the tail and removed from the
   front.

3. Entries are only evicted if they were last updated more than 5 seconds
   ago, adhering to the original intent of gc_thresh2.

4. Forced gc is stopped once the number of gc_entries drops below
   gc_thresh2.

5. Since gc checks do not apply to PERMANENT entries, gc levels are skipped
   when allocating a new neighbor for a PERMANENT entry. By extension this
   means there are no explicit limits on the number of PERMANENT entries
   that can be created, but this is no different than FIB entries or FDB
   entries.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-07 16:03:10 -08:00
Petr Machata
43920edf3b bridge: Add br_fdb_clear_offload()
When a driver unoffloads all FDB entries en bloc, it's inefficient to
send the switchdev notification one by one. Add a helper that unsets the
offload flag on FDB entries on a given bridge port and VLAN.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-07 12:59:08 -08:00
Petr Machata
40c900aa1f net: core: dev: Attach extack to NETDEV_PRE_UP
Drivers may need to validate configuration of a device that's about to
be upped. Should the validation fail, there's currently no way to
communicate details of the failure to the user, beyond an error number.

To mend that, change __dev_open() to take an extack argument and pass it
from __dev_change_flags() and dev_open(), where it was propagated in the
previous patches.

Change __dev_open() to call call_netdevice_notifiers_extack() so that
the passed-in extack is attached to the NETDEV_PRE_UP notifier.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-06 13:26:07 -08:00
Petr Machata
2637260534 net: core: dev: Add call_netdevice_notifiers_extack()
In order to propagate extack through NETDEV_PRE_UP, add a new function
call_netdevice_notifiers_extack() that primes the extack field of the
notifier info. Convert call_netdevice_notifiers() to a simple wrapper
around the new function that passes NULL for extack.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-06 13:26:07 -08:00
Petr Machata
6d0403216d net: core: dev: Add extack argument to __dev_change_flags()
In order to pass extack together with NETDEV_PRE_UP notifications, it's
necessary to route the extack to __dev_open() from diverse (possibly
indirect) callers. The last missing API is __dev_change_flags().

Therefore extend __dev_change_flags() with and extra extack argument and
update the two existing users.

Since the function declaration line is changed anyway, name the struct
net_device argument to placate checkpatch.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-06 13:26:07 -08:00
Petr Machata
567c5e13be net: core: dev: Add extack argument to dev_change_flags()
In order to pass extack together with NETDEV_PRE_UP notifications, it's
necessary to route the extack to __dev_open() from diverse (possibly
indirect) callers. One prominent API through which the notification is
invoked is dev_change_flags().

Therefore extend dev_change_flags() with and extra extack argument and
update all users. Most of the calls end up just encoding NULL, but
several sites (VLAN, ipvlan, VRF, rtnetlink) do have extack available.

Since the function declaration line is changed anyway, name the other
function arguments to placate checkpatch.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-06 13:26:07 -08:00
Petr Machata
00f54e6892 net: core: dev: Add extack argument to dev_open()
In order to pass extack together with NETDEV_PRE_UP notifications, it's
necessary to route the extack to __dev_open() from diverse (possibly
indirect) callers. One prominent API through which the notification is
invoked is dev_open().

Therefore extend dev_open() with and extra extack argument and update
all users. Most of the calls end up just encoding NULL, but bond and
team drivers have the extack readily available.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-06 13:26:06 -08:00
Pedro Tammela
fdb8b29867 tcp: fix code style in tcp_recvmsg()
2 goto labels are indented with a tab. remove the tabs and
keep the code style consistent.

Signed-off-by: Pedro Tammela <pctammela@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-06 12:19:47 -08:00
Andrew Lunn
dc0fe7d47f net: dsa: Set the master device's MTU to account for DSA overheads
DSA tagging of frames sent over the master interface to the switch
increases the size of the frame. Such frames can then be bigger than
the normal MTU of the master interface, and it may drop them. Use the
overhead information from the tagger to set the MTU of the master
device to include this overhead.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-06 12:18:17 -08:00
Andrew Lunn
a5dd308778 net: dsa: Add overhead to tag protocol ops.
Each DSA tag protocol needs to add additional headers to the Ethernet
frame in order to direct it towards a specific switch egress port. It
must also remove the head from a frame received from a
switch. Indicate the maximum size of these headers in the tag protocol
ops structure, so the core can take these overheads into account.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-06 12:18:16 -08:00
David Ahern
7a35a50df5 neighbor: Add extack messages for add and delete commands
Add extack messages for failures in neigh_add and neigh_delete.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-05 20:53:19 -08:00
Hoang Le
f5d6c3e5a3 tipc: fix node keep alive interval calculation
When setting LINK tolerance, node timer interval will be calculated
base on the LINK with lowest tolerance.

But when calculated, the old node timer interval only updated if current
setting value (tolerance/4) less than old ones regardless of number of
links as well as links' lowest tolerance value.

This caused to two cases missing if tolerance changed as following:
Case 1:
1.1/ There is one link (L1) available in the system
1.2/ Set L1's tolerance from 1500ms => lower (i.e 500ms)
1.3/ Then, fallback to default (1500ms) or higher (i.e 2000ms)

Expected:
    node timer interval is 1500/4=375ms after 1.3

Result:
node timer interval will not being updated after changing tolerance at 1.3
since its value 1500/4=375ms is not less than 500/4=125ms at 1.2.

Case 2:
2.1/ There are two links (L1, L2) available in the system
2.2/ L1 and L2 tolerance value are 2000ms as initial
2.3/ Set L2's tolerance from 2000ms => lower 1500ms
2.4/ Disable link L2 (bring down its bearer)

Expected:
    node timer interval is 2000ms/4=500ms after 2.4

Result:
node timer interval will not being updated after disabling L2 since
its value 2000ms/4=500ms is still not less than 1500/4=375ms at 2.3
although L2 is already not available in the system.

To fix this, we start the node interval calculation by initializing it to
a value larger than any conceivable calculated value. This way, the link
with the lowest tolerance will always determine the calculated value.

Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-05 20:52:31 -08:00
Peter Oskolkov
d66280b12b net: netem: use a list in addition to rbtree
When testing high-bandwidth TCP streams with large windows,
high latency, and low jitter, netem consumes a lot of CPU cycles
doing rbtree rebalancing.

This patch uses a linear list/queue in addition to the rbtree:
if an incoming packet is past the tail of the linear queue, it is
added there, otherwise it is inserted into the rbtree.

Without this patch, perf shows netem_enqueue, netem_dequeue,
and rb_* functions among the top offenders. With this patch,
only netem_enqueue is noticeable if jitter is low/absent.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-05 20:18:41 -08:00
Nikolay Aleksandrov
d08c6bc08f net: bridge: increase multicast's default maximum number of entries
bridge's default hash_max was 512 which is rather conservative, now that
we're using the generic rhashtable API which autoshrinks let's increase
it to 4096 and move it to a define in br_private.h.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-05 17:01:51 -08:00
Nikolay Aleksandrov
cf332bca56 net: bridge: mark hash_elasticity as obsolete
Now that the bridge multicast uses the generic rhashtable interface we
can drop the hash_elasticity option as that is already done for us and
it's hardcoded to a maximum of RHT_ELASTICITY (16 currently). Add a
warning about the obsolete option when the hash_elasticity is set.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-05 17:01:51 -08:00
Nikolay Aleksandrov
4329596cb1 net: bridge: multicast: use non-bh rcu flavor
The bridge multicast code has been using a mix of RCU and RCU-bh flavors
sometimes in questionable way. Since we've moved to rhashtable just use
non-bh RCU everywhere. In addition this simplifies freeing of objects
and allows us to remove some unnecessary callback functions.

v3: new patch

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-05 17:01:51 -08:00
Nikolay Aleksandrov
19e3a9c90c net: bridge: convert multicast to generic rhashtable
The bridge multicast code currently uses a custom resizable hashtable
which predates the generic rhashtable interface. It has many
shortcomings compared and duplicates functionality that is presently
available via the generic rhashtable, so this patch removes the custom
rhashtable implementation in favor of the kernel's generic rhashtable.
The hash maximum is kept and the rhashtable's size is used to do a loose
check if it's reached in which case we revert to the old behaviour and
disable further bridge multicast processing. Also now we can support any
hash maximum, doesn't need to be a power of 2.

v3: add non-rcu br_mdb_get variant and use it where multicast_lock is
    held to avoid RCU splat, drop hash_max function and just set it
    directly

v2: handle when IGMP snooping is undefined, add br_mdb_init/uninit
    placeholders

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-05 17:01:51 -08:00
Eric Dumazet
a74f0fa082 tcp: reduce POLLOUT events caused by TCP_NOTSENT_LOWAT
TCP_NOTSENT_LOWAT socket option or sysctl was added in linux-3.12
as a step to enable bigger tcp sndbuf limits.

It works reasonably well, but the following happens :

Once the limit is reached, TCP stack generates
an [E]POLLOUT event for every incoming ACK packet.

This causes a high number of context switches.

This patch implements the strategy David Miller added
in sock_def_write_space() :

 - If TCP socket has a notsent_lowat constraint of X bytes,
   allow sendmsg() to fill up to X bytes, but send [E]POLLOUT
   only if number of notsent bytes is below X/2

This considerably reduces TCP_NOTSENT_LOWAT overhead,
while allowing to keep the pipe full.

Tested:
 100 ms RTT netem testbed between A and B, 100 concurrent TCP_STREAM

A:/# cat /proc/sys/net/ipv4/tcp_wmem
4096	262144	64000000
A:/# super_netperf 100 -H B -l 1000 -- -K bbr &

A:/# grep TCP /proc/net/sockstat
TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 1364904 # This is about 54 MB of memory per flow :/

A:/# vmstat 5 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 256220672  13532 694976    0    0    10     0   28   14  0  1 99  0  0
 2  0      0 256320016  13532 698480    0    0   512     0 715901 5927  0 10 90  0  0
 0  0      0 256197232  13532 700992    0    0   735    13 771161 5849  0 11 89  0  0
 1  0      0 256233824  13532 703320    0    0   512    23 719650 6635  0 11 89  0  0
 2  0      0 256226880  13532 705780    0    0   642     4 775650 6009  0 12 88  0  0

A:/# echo 2097152 >/proc/sys/net/ipv4/tcp_notsent_lowat

A:/# grep TCP /proc/net/sockstat
TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 86411 # 3.5 MB per flow

A:/# vmstat 5 5  # check that context switches have not inflated too much.
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 260386512  13592 662148    0    0    10     0   17   14  0  1 99  0  0
 0  0      0 260519680  13592 604184    0    0   512    13 726843 12424  0 10 90  0  0
 1  1      0 260435424  13592 598360    0    0   512    25 764645 12925  0 10 90  0  0
 1  0      0 260855392  13592 578380    0    0   512     7 722943 13624  0 11 88  0  0
 1  0      0 260445008  13592 601176    0    0   614    34 772288 14317  0 10 90  0  0

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-04 21:21:18 -08:00
Adi Nissim
1c25324caf net/sched: act_tunnel_key: Don't dump dst port if it wasn't set
It's possible to set a tunnel without a destination port. However,
on dump(), a zero dst port is returned to user space even if it was not
set, fix that.

Note that so far it wasn't required, b/c key less tunnels were not
supported and the UDP tunnels do require destination port.

Signed-off-by: Adi Nissim <adin@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-04 20:53:37 -08:00
Adi Nissim
80ef0f22ce net/sched: act_tunnel_key: Allow key-less tunnels
Allow setting a tunnel without a tunnel key. This is required for
tunneling protocols, such as GRE, that define the key as an optional
field.

Signed-off-by: Adi Nissim <adin@mellanox.com>
Acked-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-04 20:53:37 -08:00
Ido Schimmel
f839a6c925 net: Do not route unicast IP packets twice
Packets marked with 'offload_l3_fwd_mark' were already forwarded by a
capable device and should not be forwarded again by the kernel.
Therefore, have the kernel consume them.

The check is performed in ip{,6}_forward_finish() in order to allow the
kernel to process such packets in ip{,6}_forward() and generate required
exceptions. For example, ICMP redirects.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-04 08:36:36 -08:00
Ido Schimmel
875e893995 skbuff: Rename 'offload_mr_fwd_mark' to 'offload_l3_fwd_mark'
Commit abf4bb6b63 ("skbuff: Add the offload_mr_fwd_mark field") added
the 'offload_mr_fwd_mark' field to indicate that a packet has already
undergone L3 multicast routing by a capable device. The field is used to
prevent the kernel from forwarding a packet through a netdev through
which the device has already forwarded the packet.

Currently, no unicast packet is routed by both the device and the
kernel, but this is about to change by subsequent patches and we need to
be able to mark such packets, so that they will no be forwarded twice.

Instead of adding yet another field to 'struct sk_buff', we can just
rename 'offload_mr_fwd_mark' to 'offload_l3_fwd_mark', as a packet
either has a multicast or a unicast destination IP.

While at it, add a comment about both 'offload_fwd_mark' and
'offload_l3_fwd_mark'.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-04 08:36:36 -08:00
Willem de Bruijn
52900d2228 udp: elide zerocopy operation in hot path
With MSG_ZEROCOPY, each skb holds a reference to a struct ubuf_info.
Release of its last reference triggers a completion notification.

The TCP stack in tcp_sendmsg_locked holds an extra ref independent of
the skbs, because it can build, send and free skbs within its loop,
possibly reaching refcount zero and freeing the ubuf_info too soon.

The UDP stack currently also takes this extra ref, but does not need
it as all skbs are sent after return from __ip(6)_append_data.

Avoid the extra refcount_inc and refcount_dec_and_test, and generally
the sock_zerocopy_put in the common path, by passing the initial
reference to the first skb.

This approach is taken instead of initializing the refcount to 0, as
that would generate error "refcount_t: increment on 0" on the
next skb_zcopy_set.

Changes
  v3 -> v4
    - Move skb_zcopy_set below the only kfree_skb that might cause
      a premature uarg destroy before skb_zerocopy_put_abort
      - Move the entire skb_shinfo assignment block, to keep that
        cacheline access in one place

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03 15:58:32 -08:00
Willem de Bruijn
b5947e5d1e udp: msg_zerocopy
Extend zerocopy to udp sockets. Allow setting sockopt SO_ZEROCOPY and
interpret flag MSG_ZEROCOPY.

This patch was previously part of the zerocopy RFC patchsets. Zerocopy
is not effective at small MTU. With segmentation offload building
larger datagrams, the benefit of page flipping outweights the cost of
generating a completion notification.

tools/testing/selftests/net/msg_zerocopy.sh after applying follow-on
test patch and making skb_orphan_frags_rx same as skb_orphan_frags:

    ipv4 udp -t 1
    tx=191312 (11938 MB) txc=0 zc=n
    rx=191312 (11938 MB)
    ipv4 udp -z -t 1
    tx=304507 (19002 MB) txc=304507 zc=y
    rx=304507 (19002 MB)
    ok
    ipv6 udp -t 1
    tx=174485 (10888 MB) txc=0 zc=n
    rx=174485 (10888 MB)
    ipv6 udp -z -t 1
    tx=294801 (18396 MB) txc=294801 zc=y
    rx=294801 (18396 MB)
    ok

Changes
  v1 -> v2
    - Fixup reverse christmas tree violation
  v2 -> v3
    - Split refcount avoidance optimization into separate patch
      - Fix refcount leak on error in fragmented case
        (thanks to Paolo Abeni for pointing this one out!)
      - Fix refcount inc on zero
      - Test sock_flag SOCK_ZEROCOPY directly in __ip_append_data.
        This is needed since commit 5cf4a8532c ("tcp: really ignore
	MSG_ZEROCOPY if no SO_ZEROCOPY") did the same for tcp.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03 15:58:32 -08:00
Bartosz Golaszewski
0e839df92c net: ethernet: provide nvmem_get_mac_address()
We already have of_get_nvmem_mac_address() but some non-DT systems want
to read the MAC address from NVMEM too. Implement a generalized routine
that takes struct device as argument.

Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03 15:40:30 -08:00
Alexis Bauvin
6a6d6681ac l3mdev: add function to retreive upper master
Existing functions to retreive the l3mdev of a device did not walk the
master chain to find the upper master. This patch adds a function to
find the l3mdev, even indirect through e.g. a bridge:

+----------+
|          |
| vrf-blue |
|          |
+----+-----+
     |
     |
+----+-----+
|          |
| br-blue  |
|          |
+----+-----+
     |
     |
+----+-----+
|          |
|   eth0   |
|          |
+----------+

This will properly resolve the l3mdev of eth0 to vrf-blue.

Signed-off-by: Alexis Bauvin <abauvin@scaleway.com>
Reviewed-by: Amine Kherbouche <akherbouche@scaleway.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Tested-by: Amine Kherbouche <akherbouche@scaleway.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03 14:15:26 -08:00
Alexis Bauvin
da5095d052 udp_tunnel: add config option to bind to a device
UDP tunnel sockets are always opened unbound to a specific device. This
patch allow the socket to be bound on a custom device, which
incidentally makes UDP tunnels VRF-aware if binding to an l3mdev.

Signed-off-by: Alexis Bauvin <abauvin@scaleway.com>
Reviewed-by: Amine Kherbouche <akherbouche@scaleway.com>
Tested-by: Amine Kherbouche <akherbouche@scaleway.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03 14:15:26 -08:00
Shalom Toledo
846e980a87 devlink: Add 'fw_load_policy' generic parameter
Many drivers load the device's firmware image during the initialization
flow either from the flash or from the disk. Currently this option is not
controlled by the user and the driver decides from where to load the
firmware image.

'fw_load_policy' gives the ability to control this option which allows the
user to choose between different loading policies supported by the driver.

This parameter can be useful while testing and/or debugging the device. For
example, testing a firmware bug fix.

Signed-off-by: Shalom Toledo <shalomt@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03 13:55:43 -08:00
Ido Schimmel
5a6db04ca8 net: bridge: Extend br_vlan_get_pvid() for bridge ports
Currently, the function only works for the bridge device itself, but
subsequent patches will need to be able to query the PVID of a given
bridge port, so extend the function.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-30 17:06:28 -08:00
Jakub Kicinski
a293974590 rtnetlink: avoid frame size warning in rtnl_newlink()
Standard kernel compilation produces the following warning:

net/core/rtnetlink.c: In function ‘rtnl_newlink’:
net/core/rtnetlink.c:3232:1: warning: the frame size of 1288 bytes is larger than 1024 bytes [-Wframe-larger-than=]
 }
  ^

This should not really be an issue, as rtnl_newlink() stack is
generally quite shallow.

Fix the warning by allocating attributes with kmalloc() in a wrapper
and passing it down to rtnl_newlink(), avoiding complexities on error
paths.

Alternatively we could kmalloc() some structure within rtnl_newlink(),
slave attributes look like a good candidate.  In practice it adds to
already rather high complexity and length of the function.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-30 13:33:34 -08:00
Jakub Kicinski
420d031822 rtnetlink: remove a level of indentation in rtnl_newlink()
rtnl_newlink() used to create VLAs based on link kind.  Since
commit ccf8dbcd06 ("rtnetlink: Remove VLA usage") statically
sized array is created on the stack, so there is no more use
for a separate code block that used to be the VLA's live range.

While at it christmas tree the variables.  Note that there is
a goto-based retry so to be on the safe side the variables can
no longer be initialized in place.  It doesn't seem to matter,
logically, but why make the code harder to read..

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-30 13:33:34 -08:00
Eric Dumazet
6015c71e65 tcp: md5: add tcp_md5_needed jump label
Most linux hosts never setup TCP MD5 keys. We can avoid a
cache line miss (accessing tp->md5ig_info) on RX and TX
using a jump label.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-30 13:28:03 -08:00
Eric Dumazet
4f693b55c3 tcp: implement coalescing on backlog queue
In case GRO is not as efficient as it should be or disabled,
we might have a user thread trapped in __release_sock() while
softirq handler flood packets up to the point we have to drop.

This patch balances work done from user thread and softirq,
to give more chances to __release_sock() to complete its work
before new packets are added the the backlog.

This also helps if we receive many ACK packets, since GRO
does not aggregate them.

This patch brings ~60% throughput increase on a receiver
without GRO, but the spectacular gain is really on
1000x release_sock() latency reduction I have measured.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-30 13:26:54 -08:00
Eric Dumazet
19119f298b tcp: take care of compressed acks in tcp_add_reno_sack()
Neal pointed out that non sack flows might suffer from ACK compression
added in the following patch ("tcp: implement coalescing on backlog queue")

Instead of tweaking tcp_add_backlog() we can take into
account how many ACK were coalesced, this information
will be available in skb_shinfo(skb)->gso_segs

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-30 13:26:53 -08:00
Geneviève Bastien
b0e3f1bdf9 net: Add trace events for all receive exit points
Trace events are already present for the receive entry points, to indicate
how the reception entered the stack.

This patch adds the corresponding exit trace events that will bound the
reception such that all events occurring between the entry and the exit
can be considered as part of the reception context. This greatly helps
for dependency and root cause analyses.

Without this, it is not possible with tracepoint instrumentation to
determine whether a sched_wakeup event following a netif_receive_skb
event is the result of the packet reception or a simple coincidence after
further processing by the thread. It is possible using other mechanisms
like kretprobes, but considering the "entry" points are already present,
it would be good to add the matching exit events.

In addition to linking packets with wakeups, the entry/exit event pair
can also be used to perform network stack latency analyses.

Signed-off-by: Geneviève Bastien <gbastien@versatic.net>
CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Ingo Molnar <mingo@redhat.com>
CC: David S. Miller <davem@davemloft.net>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> (tracing side)
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-30 13:23:25 -08:00
David S. Miller
93029d7d40 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
bpf-next 2018-11-30

The following pull-request contains BPF updates for your *net-next* tree.

(Getting out bit earlier this time to pull in a dependency from bpf.)

The main changes are:

1) Add libbpf ABI versioning and document API naming conventions
   as well as ABI versioning process, from Andrey.

2) Add a new sk_msg_pop_data() helper for sk_msg based BPF
   programs that is used in conjunction with sk_msg_push_data()
   for adding / removing meta data to the msg data, from John.

3) Optimize convert_bpf_ld_abs() for 0 offset and fix various
   lib and testsuite build failures on 32 bit, from David.

4) Make BPF prog dump for !JIT identical to how we dump subprogs
   when JIT is in use, from Yonghong.

5) Rename btf_get_from_id() to make it more conform with libbpf
   API naming conventions, from Martin.

6) Add a missing BPF kselftest config item, from Naresh.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-29 18:15:07 -08:00
Cong Wang
1464193107 net: explain __skb_checksum_complete() with comments
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-29 13:47:16 -08:00
Eric Dumazet
19bf62613a tcp: remove loop to compute wscale
We can remove the loop and conditional branches
and compute wscale efficiently thanks to ilog2()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-29 11:10:14 -08:00
David S. Miller
e561bb29b6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Trivial conflict in net/core/filter.c, a locally computed
'sdif' is now an argument to the function.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-28 22:10:54 -08:00
John Fastabend
7246d8ed4d bpf: helper to pop data from messages
This adds a BPF SK_MSG program helper so that we can pop data from a
msg. We use this to pop metadata from a previous push data call.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-11-28 22:07:57 +01:00
David S. Miller
e9d8faf93d Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Disable BH while holding list spinlock in nf_conncount, from
   Taehee Yoo.

2) List corruption in nf_conncount, also from Taehee.

3) Fix race that results in leaving around an empty list node in
   nf_conncount, from Taehee Yoo.

4) Proper chain handling for inactive chains from the commit path,
   from Florian Westphal. This includes a selftest for this.

5) Do duplicate rule handles when replacing rules, also from Florian.

6) Remove net_exit path in xt_RATEEST that results in splat, from Taehee.

7) Possible use-after-free in nft_compat when releasing extensions.
   From Florian.

8) Memory leak in xt_hashlimit, from Taehee.

9) Call ip_vs_dst_notifier after ipv6_dev_notf, from Xin Long.

10) Fix cttimeout with udplite and gre, from Florian.

11) Preserve oif for IPv6 link-local generated traffic from mangle
    table, from Alin Nastac.

12) Missing error handling in masquerade notifiers, from Taehee Yoo.

13) Use mutex to protect registration/unregistration of masquerade
    extensions in order to prevent a race, from Taehee.

14) Incorrect condition check in tree_nodes_free(), also from Taehee.

15) Fix chain counter leak in rule replacement path, from Taehee.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-28 11:02:45 -08:00
Taehee Yoo
ca08987885 netfilter: nf_tables: deactivate expressions in rule replecement routine
There is no expression deactivation call from the rule replacement path,
hence, chain counter is not decremented. A few steps to reproduce the
problem:

   %nft add table ip filter
   %nft add chain ip filter c1
   %nft add chain ip filter c1
   %nft add rule ip filter c1 jump c2
   %nft replace rule ip filter c1 handle 3 accept
   %nft flush ruleset

<jump c2> expression means immediate NFT_JUMP to chain c2.
Reference count of chain c2 is increased when the rule is added.

When rule is deleted or replaced, the reference counter of c2 should be
decreased via nft_rule_expr_deactivate() which calls
nft_immediate_deactivate().

Splat looks like:
[  214.396453] WARNING: CPU: 1 PID: 21 at net/netfilter/nf_tables_api.c:1432 nf_tables_chain_destroy.isra.38+0x2f9/0x3a0 [nf_tables]
[  214.398983] Modules linked in: nf_tables nfnetlink
[  214.398983] CPU: 1 PID: 21 Comm: kworker/1:1 Not tainted 4.20.0-rc2+ #44
[  214.398983] Workqueue: events nf_tables_trans_destroy_work [nf_tables]
[  214.398983] RIP: 0010:nf_tables_chain_destroy.isra.38+0x2f9/0x3a0 [nf_tables]
[  214.398983] Code: 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 8e 00 00 00 48 8b 7b 58 e8 e1 2c 4e c6 48 89 df e8 d9 2c 4e c6 eb 9a <0f> 0b eb 96 0f 0b e9 7e fe ff ff e8 a7 7e 4e c6 e9 a4 fe ff ff e8
[  214.398983] RSP: 0018:ffff8881152874e8 EFLAGS: 00010202
[  214.398983] RAX: 0000000000000001 RBX: ffff88810ef9fc28 RCX: ffff8881152876f0
[  214.398983] RDX: dffffc0000000000 RSI: 1ffff11022a50ede RDI: ffff88810ef9fc78
[  214.398983] RBP: 1ffff11022a50e9d R08: 0000000080000000 R09: 0000000000000000
[  214.398983] R10: 0000000000000000 R11: 0000000000000000 R12: 1ffff11022a50eba
[  214.398983] R13: ffff888114446e08 R14: ffff8881152876f0 R15: ffffed1022a50ed6
[  214.398983] FS:  0000000000000000(0000) GS:ffff888116400000(0000) knlGS:0000000000000000
[  214.398983] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  214.398983] CR2: 00007fab9bb5f868 CR3: 000000012aa16000 CR4: 00000000001006e0
[  214.398983] Call Trace:
[  214.398983]  ? nf_tables_table_destroy.isra.37+0x100/0x100 [nf_tables]
[  214.398983]  ? __kasan_slab_free+0x145/0x180
[  214.398983]  ? nf_tables_trans_destroy_work+0x439/0x830 [nf_tables]
[  214.398983]  ? kfree+0xdb/0x280
[  214.398983]  nf_tables_trans_destroy_work+0x5f5/0x830 [nf_tables]
[ ... ]

Fixes: bb7b40aecb ("netfilter: nf_tables: bogus EBUSY in chain deletions")
Reported by: Christoph Anton Mitterer <calestyo@scientia.net>
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=914505
Link: https://bugzilla.kernel.org/show_bug.cgi?id=201791
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-11-28 10:56:40 +01:00
David Ahern
86d1d8b72c net/ipv4: Fix missing raw_init when CONFIG_PROC_FS is disabled
Randy reported when CONFIG_PROC_FS is not enabled:
    ld: net/ipv4/af_inet.o: in function `inet_init':
    af_inet.c:(.init.text+0x42d): undefined reference to `raw_init'

Fix by moving the endif up to the end of the proc entries

Fixes: 6897445fb1 ("net: provide a sysctl raw_l3mdev_accept for raw socket lookup with VRFs")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Mike Manning <mmanning@vyatta.att-mail.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-27 20:58:02 -08:00
Eric Dumazet
e7395f1f4b tcp: remove hdrlen argument from tcp_queue_rcv()
Only one caller needs to pull TCP headers, so lets
move __skb_pull() to the caller side.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-27 16:38:08 -08:00
Vijay Khemka
16e8c4ca21 net/ncsi: Add NCSI Mellanox OEM command
This patch adds OEM Mellanox commands and response handling. It also
defines OEM Get MAC Address handler to get and configure the device.

ncsi_oem_gma_handler_mlx: This handler send NCSI mellanox command for
getting mac address.
ncsi_rsp_handler_oem_mlx: This handles response received for all
mellanox OEM commands.
ncsi_rsp_handler_oem_mlx_gma: This handles get mac address response and
set it to device.

Signed-off-by: Vijay Khemka <vijaykhemka@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-27 16:37:20 -08:00
Jon Maloy
ec835f8912 tipc: fix lockdep warning during node delete
We see the following lockdep warning:

[ 2284.078521] ======================================================
[ 2284.078604] WARNING: possible circular locking dependency detected
[ 2284.078604] 4.19.0+ #42 Tainted: G            E
[ 2284.078604] ------------------------------------------------------
[ 2284.078604] rmmod/254 is trying to acquire lock:
[ 2284.078604] 00000000acd94e28 ((&n->timer)#2){+.-.}, at: del_timer_sync+0x5/0xa0
[ 2284.078604]
[ 2284.078604] but task is already holding lock:
[ 2284.078604] 00000000f997afc0 (&(&tn->node_list_lock)->rlock){+.-.}, at: tipc_node_stop+0xac/0x190 [tipc]
[ 2284.078604]
[ 2284.078604] which lock already depends on the new lock.
[ 2284.078604]
[ 2284.078604]
[ 2284.078604] the existing dependency chain (in reverse order) is:
[ 2284.078604]
[ 2284.078604] -> #1 (&(&tn->node_list_lock)->rlock){+.-.}:
[ 2284.078604]        tipc_node_timeout+0x20a/0x330 [tipc]
[ 2284.078604]        call_timer_fn+0xa1/0x280
[ 2284.078604]        run_timer_softirq+0x1f2/0x4d0
[ 2284.078604]        __do_softirq+0xfc/0x413
[ 2284.078604]        irq_exit+0xb5/0xc0
[ 2284.078604]        smp_apic_timer_interrupt+0xac/0x210
[ 2284.078604]        apic_timer_interrupt+0xf/0x20
[ 2284.078604]        default_idle+0x1c/0x140
[ 2284.078604]        do_idle+0x1bc/0x280
[ 2284.078604]        cpu_startup_entry+0x19/0x20
[ 2284.078604]        start_secondary+0x187/0x1c0
[ 2284.078604]        secondary_startup_64+0xa4/0xb0
[ 2284.078604]
[ 2284.078604] -> #0 ((&n->timer)#2){+.-.}:
[ 2284.078604]        del_timer_sync+0x34/0xa0
[ 2284.078604]        tipc_node_delete+0x1a/0x40 [tipc]
[ 2284.078604]        tipc_node_stop+0xcb/0x190 [tipc]
[ 2284.078604]        tipc_net_stop+0x154/0x170 [tipc]
[ 2284.078604]        tipc_exit_net+0x16/0x30 [tipc]
[ 2284.078604]        ops_exit_list.isra.8+0x36/0x70
[ 2284.078604]        unregister_pernet_operations+0x87/0xd0
[ 2284.078604]        unregister_pernet_subsys+0x1d/0x30
[ 2284.078604]        tipc_exit+0x11/0x6f2 [tipc]
[ 2284.078604]        __x64_sys_delete_module+0x1df/0x240
[ 2284.078604]        do_syscall_64+0x66/0x460
[ 2284.078604]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 2284.078604]
[ 2284.078604] other info that might help us debug this:
[ 2284.078604]
[ 2284.078604]  Possible unsafe locking scenario:
[ 2284.078604]
[ 2284.078604]        CPU0                    CPU1
[ 2284.078604]        ----                    ----
[ 2284.078604]   lock(&(&tn->node_list_lock)->rlock);
[ 2284.078604]                                lock((&n->timer)#2);
[ 2284.078604]                                lock(&(&tn->node_list_lock)->rlock);
[ 2284.078604]   lock((&n->timer)#2);
[ 2284.078604]
[ 2284.078604]  *** DEADLOCK ***
[ 2284.078604]
[ 2284.078604] 3 locks held by rmmod/254:
[ 2284.078604]  #0: 000000003368be9b (pernet_ops_rwsem){+.+.}, at: unregister_pernet_subsys+0x15/0x30
[ 2284.078604]  #1: 0000000046ed9c86 (rtnl_mutex){+.+.}, at: tipc_net_stop+0x144/0x170 [tipc]
[ 2284.078604]  #2: 00000000f997afc0 (&(&tn->node_list_lock)->rlock){+.-.}, at: tipc_node_stop+0xac/0x19
[...}

The reason is that the node timer handler sometimes needs to delete a
node which has been disconnected for too long. To do this, it grabs
the lock 'node_list_lock', which may at the same time be held by the
generic node cleanup function, tipc_node_stop(), during module removal.
Since the latter is calling del_timer_sync() inside the same lock, we
have a potential deadlock.

We fix this letting the timer cleanup function use spin_trylock()
instead of just spin_lock(), and when it fails to grab the lock it
just returns so that the timer handler can terminate its execution.
This is safe to do, since tipc_node_stop() anyway is about to
delete both the timer and the node instance.

Fixes: 6a939f365b ("tipc: Auto removal of peer down node instance")
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-27 16:30:39 -08:00
Nicolas Dichtel
288f06a001 netns: enable to dump full nsid translation table
Like the previous patch, the goal is to ease to convert nsids from one
netns to another netns.
A new attribute (NETNSA_CURRENT_NSID) is added to the kernel answer when
NETNSA_TARGET_NSID is provided, thus the user can easily convert nsids.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-27 16:20:20 -08:00
Nicolas Dichtel
3a4f68bf66 netns: enable to specify a nsid for a get request
Combined with NETNSA_TARGET_NSID, it enables to "translate" a nsid from one
netns to a nsid of another netns.
This is useful when using NETLINK_F_LISTEN_ALL_NSID because it helps the
user to interpret a nsid received from an other netns.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-27 16:20:20 -08:00