Commit Graph

5969 Commits

Author SHA1 Message Date
Ivan Vecera
0aef78aa7b ipv6: addrconf: don't evaluate keep_addr_on_down twice
The addrconf_ifdown() evaluates keep_addr_on_down state twice. There
is no need to do it.

Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Ivan Vecera <cera@cera.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-25 13:03:37 -04:00
Ahmed Abdelsalam
b5facfdba1 ipv6: sr: Compute flowlabel for outer IPv6 header of seg6 encap mode
ECMP (equal-cost multipath) hashes are typically computed on the packets'
5-tuple(src IP, dst IP, src port, dst port, L4 proto).

For encapsulated packets, the L4 data is not readily available and ECMP
hashing will often revert to (src IP, dst IP). This will lead to traffic
polarization on a single ECMP path, causing congestion and waste of network
capacity.

In IPv6, the 20-bit flow label field is also used as part of the ECMP hash.
In the lack of L4 data, the hashing will be on (src IP, dst IP, flow
label). Having a non-zero flow label is thus important for proper traffic
load balancing when L4 data is unavailable (i.e., when packets are
encapsulated).

Currently, the seg6_do_srh_encap() function extracts the original packet's
flow label and set it as the outer IPv6 flow label. There are two issues
with this behaviour:

a) There is no guarantee that the inner flow label is set by the source.
b) If the original packet is not IPv6, the flow label will be set to
zero (e.g., IPv4 or L2 encap).

This patch adds a function, named seg6_make_flowlabel(), that computes a
flow label from a given skb. It supports IPv6, IPv4 and L2 payloads, and
leverages the per namespace 'seg6_flowlabel" sysctl value.

The currently support behaviours are as follows:
-1 set flowlabel to zero.
0 copy flowlabel from Inner paceket in case of Inner IPv6
(Set flowlabel to 0 in case IPv4/L2)
1 Compute the flowlabel using seg6_make_flowlabel()

This patch has been tested for IPv6, IPv4, and L2 traffic.

Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com>
Acked-by: David Lebrun <dlebrun@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-25 13:02:15 -04:00
David S. Miller
c749fa181b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-04-24 23:59:11 -04:00
Eric Dumazet
091311debc net/ipv6: fix LOCKDEP issue in rt6_remove_exception_rt()
rt6_remove_exception_rt() is called under rcu_read_lock() only.

We lock rt6_exception_lock a bit later, so we do not hold
rt6_exception_lock yet.

Fixes: 8a14e46f14 ("net/ipv6: Fix missing rcu dereferences on from")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: David Ahern <dsahern@gmail.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24 16:19:14 -04:00
David S. Miller
77621f024d Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter/IPVS fixes for net

The following patchset contains Netfilter/IPVS fixes for your net tree,
they are:

1) Fix SIP conntrack with phones sending session descriptions for different
   media types but same port numbers, from Florian Westphal.

2) Fix incorrect rtnl_lock mutex logic from IPVS sync thread, from Julian
   Anastasov.

3) Skip compat array allocation in ebtables if there is no entries, also
   from Florian.

4) Do not lose left/right bits when shifting marks from xt_connmark, from
   Jack Ma.

5) Silence false positive memleak in conntrack extensions, from Cong Wang.

6) Fix CONFIG_NF_REJECT_IPV6=m link problems, from Arnd Bergmann.

7) Cannot kfree rule that is already in list in nf_tables, switch order
   so this error handling is not required, from Florian Westphal.

8) Release set name in error path, from Florian.

9) include kmemleak.h in nf_conntrack_extend.c, from Stepheh Rothwell.

10) NAT chain and extensions depend on NF_TABLES.

11) Out of bound access when renaming chains, from Taehee Yoo.

12) Incorrect casting in xt_connmark leads to wrong bitshifting.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23 16:22:24 -04:00
David Ahern
8a14e46f14 net/ipv6: Fix missing rcu dereferences on from
kbuild test robot reported 2 uses of rt->from not properly accessed
using rcu_dereference:
1. add rcu_dereference_protected to rt6_remove_exception_rt and make
   sure it is always called with rcu lock held.

2. change rt6_do_redirect to take a reference on 'from' when accessed
   the first time so it can be used the sceond time outside of the lock

Fixes: a68886a691 ("net/ipv6: Make from in rt6_info rcu protected")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23 16:12:55 -04:00
David Ahern
c3c14da028 net/ipv6: add rcu locking to ip6_negative_advice
syzbot reported a suspicious rcu_dereference_check:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
  lockdep_rcu_suspicious+0x14a/0x153 kernel/locking/lockdep.c:4592
  rt6_check_expired+0x38b/0x3e0 net/ipv6/route.c:410
  ip6_negative_advice+0x67/0xc0 net/ipv6/route.c:2204
  dst_negative_advice include/net/sock.h:1786 [inline]
  sock_setsockopt+0x138f/0x1fe0 net/core/sock.c:1051
  __sys_setsockopt+0x2df/0x390 net/socket.c:1899
  SYSC_setsockopt net/socket.c:1914 [inline]
  SyS_setsockopt+0x34/0x50 net/socket.c:1911

Add rcu locking around call to rt6_check_expired in
ip6_negative_advice.

Fixes: a68886a691 ("net/ipv6: Make from in rt6_info rcu protected")
Reported-by: syzbot+2422c9e35796659d2273@syzkaller.appspotmail.com
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23 16:12:54 -04:00
Eric Dumazet
aa8f877849 ipv6: add RTA_TABLE and RTA_PREFSRC to rtm_ipv6_policy
KMSAN reported use of uninit-value that I tracked to lack
of proper size check on RTA_TABLE attribute.

I also believe RTA_PREFSRC lacks a similar check.

Fixes: 86872cb579 ("[IPv6] route: FIB6 configuration using struct fib6_config")
Fixes: c3968a857a ("ipv6: RTA_PREFSRC support for ipv6 route source address selection")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23 12:01:21 -04:00
Roopa Prabhu
b16fb418b1 net: fib_rules: add extack support
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23 10:21:24 -04:00
Ahmed Abdelsalam
a957fa190a ipv6: sr: fix NULL pointer dereference in seg6_do_srh_encap()- v4 pkts
In case of seg6 in encap mode, seg6_do_srh_encap() calls set_tun_src()
in order to set the src addr of outer IPv6 header.

The net_device is required for set_tun_src(). However calling ip6_dst_idev()
on dst_entry in case of IPv4 traffic results on the following bug.

Using just dst->dev should fix this BUG.

[  196.242461] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[  196.242975] PGD 800000010f076067 P4D 800000010f076067 PUD 10f060067 PMD 0
[  196.243329] Oops: 0000 [#1] SMP PTI
[  196.243468] Modules linked in: nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd input_leds glue_helper led_class pcspkr serio_raw mac_hid video autofs4 hid_generic usbhid hid e1000 i2c_piix4 ahci pata_acpi libahci
[  196.244362] CPU: 2 PID: 1089 Comm: ping Not tainted 4.16.0+ #1
[  196.244606] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  196.244968] RIP: 0010:seg6_do_srh_encap+0x1ac/0x300
[  196.245236] RSP: 0018:ffffb2ce00b23a60 EFLAGS: 00010202
[  196.245464] RAX: 0000000000000000 RBX: ffff8c7f53eea300 RCX: 0000000000000000
[  196.245742] RDX: 0000f10000000000 RSI: ffff8c7f52085a6c RDI: ffff8c7f41166850
[  196.246018] RBP: ffffb2ce00b23aa8 R08: 00000000000261e0 R09: ffff8c7f41166800
[  196.246294] R10: ffffdce5040ac780 R11: ffff8c7f41166828 R12: ffff8c7f41166808
[  196.246570] R13: ffff8c7f52085a44 R14: ffffffffb73211c0 R15: ffff8c7e69e44200
[  196.246846] FS:  00007fc448789700(0000) GS:ffff8c7f59d00000(0000) knlGS:0000000000000000
[  196.247286] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  196.247526] CR2: 0000000000000000 CR3: 000000010f05a000 CR4: 00000000000406e0
[  196.247804] Call Trace:
[  196.247972]  seg6_do_srh+0x15b/0x1c0
[  196.248156]  seg6_output+0x3c/0x220
[  196.248341]  ? prandom_u32+0x14/0x20
[  196.248526]  ? ip_idents_reserve+0x6c/0x80
[  196.248723]  ? __ip_select_ident+0x90/0x100
[  196.248923]  ? ip_append_data.part.50+0x6c/0xd0
[  196.249133]  lwtunnel_output+0x44/0x70
[  196.249328]  ip_send_skb+0x15/0x40
[  196.249515]  raw_sendmsg+0x8c3/0xac0
[  196.249701]  ? _copy_from_user+0x2e/0x60
[  196.249897]  ? rw_copy_check_uvector+0x53/0x110
[  196.250106]  ? _copy_from_user+0x2e/0x60
[  196.250299]  ? copy_msghdr_from_user+0xce/0x140
[  196.250508]  sock_sendmsg+0x36/0x40
[  196.250690]  ___sys_sendmsg+0x292/0x2a0
[  196.250881]  ? _cond_resched+0x15/0x30
[  196.251074]  ? copy_termios+0x1e/0x70
[  196.251261]  ? _copy_to_user+0x22/0x30
[  196.251575]  ? tty_mode_ioctl+0x1c3/0x4e0
[  196.251782]  ? _cond_resched+0x15/0x30
[  196.251972]  ? mutex_lock+0xe/0x30
[  196.252152]  ? vvar_fault+0xd2/0x110
[  196.252337]  ? __do_fault+0x1f/0xc0
[  196.252521]  ? __handle_mm_fault+0xc1f/0x12d0
[  196.252727]  ? __sys_sendmsg+0x63/0xa0
[  196.252919]  __sys_sendmsg+0x63/0xa0
[  196.253107]  do_syscall_64+0x72/0x200
[  196.253305]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  196.253530] RIP: 0033:0x7fc4480b0690
[  196.253715] RSP: 002b:00007ffde9f252f8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[  196.254053] RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 00007fc4480b0690
[  196.254331] RDX: 0000000000000000 RSI: 000000000060a360 RDI: 0000000000000003
[  196.254608] RBP: 00007ffde9f253f0 R08: 00000000002d1e81 R09: 0000000000000002
[  196.254884] R10: 00007ffde9f250c0 R11: 0000000000000246 R12: 0000000000b22070
[  196.255205] R13: 20c49ba5e353f7cf R14: 431bde82d7b634db R15: 00007ffde9f278fe
[  196.255484] Code: a5 0f b6 45 c0 41 88 41 28 41 0f b6 41 2c 48 c1 e0 04 49 8b 54 01 38 49 8b 44 01 30 49 89 51 20 49 89 41 18 48 8b 83 b0 00 00 00 <48> 8b 30 49 8b 86 08 0b 00 00 48 8b 40 20 48 8b 50 08 48 0b 10
[  196.256190] RIP: seg6_do_srh_encap+0x1ac/0x300 RSP: ffffb2ce00b23a60
[  196.256445] CR2: 0000000000000000
[  196.256676] ---[ end trace 71af7d093603885c ]---

Fixes: 8936ef7604 ("ipv6: sr: fix NULL pointer dereference when setting encap source address")
Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com>
Acked-by: David Lebrun <dlebrun@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-22 21:04:17 -04:00
David Ahern
8ae869714b net/ipv6: Remove unncessary check on f6i in fib6_check
Dan reported an imbalance in fib6_check on use of f6i and checking
whether it is null. Since fib6_check is only called if f6i is non-null,
remove the unnecessary check.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21 16:06:14 -04:00
David Ahern
a68886a691 net/ipv6: Make from in rt6_info rcu protected
When a dst entry is created from a fib entry, the 'from' in rt6_info
is set to the fib entry. The 'from' reference is used most notably for
cookie checking - making sure stale dst entries are updated if the
fib entry is changed.

When a fib entry is deleted, the pcpu routes on it are walked releasing
the fib6_info reference. This is needed for the fib6_info cleanup to
happen and to make sure all device references are released in a timely
manner.

There is a race window when a FIB entry is deleted and the 'from' on the
pcpu route is dropped and the pcpu route hits a cookie check. Handle
this race using rcu on from.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21 16:06:14 -04:00
David Ahern
5bcaa41b96 net/ipv6: Move release of fib6_info from pcpu routes to helper
Code move only; no functional change intended.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21 16:06:13 -04:00
David Ahern
a87b7dc9f7 net/ipv6: Move rcu locking to callers of fib6_get_cookie_safe
A later patch protects 'from' in rt6_info and this simplifies the
locking needed by it.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21 16:06:13 -04:00
David Ahern
4d85cd0c2a net/ipv6: Move rcu_read_lock to callers of ip6_rt_cache_alloc
A later patch protects 'from' in rt6_info and this simplifies the
locking needed by it.

With the move, the fib6_info_hold for the uncached_rt is no longer
needed since the rcu_lock is still held.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21 16:06:13 -04:00
David Ahern
a269f1a764 net/ipv6: Rename rt6_get_cookie_safe
rt6_get_cookie_safe takes a fib6_info and checks the sernum of
the node. Update the name to reflect its purpose.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21 16:06:13 -04:00
David Ahern
6a3e030f08 net/ipv6: Clean up rt expires helpers
rt6_clean_expires and rt6_set_expires are no longer used. Removed them.
rt6_update_expires has 1 caller in route.c, so move it from the header.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21 16:06:13 -04:00
Eric Dumazet
263243d6c2 net/ipv6: Fix ip6_convert_metrics() bug
If ip6_convert_metrics() fails to allocate memory, it should not
overwrite rt->fib6_metrics or we risk a crash later as syzbot found.

BUG: KASAN: null-ptr-deref in atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
BUG: KASAN: null-ptr-deref in refcount_sub_and_test+0x92/0x330 lib/refcount.c:179
Read of size 4 at addr 0000000000000044 by task syzkaller832429/4487

CPU: 1 PID: 4487 Comm: syzkaller832429 Not tainted 4.16.0+ #6
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1b9/0x294 lib/dump_stack.c:113
 kasan_report_error mm/kasan/report.c:352 [inline]
 kasan_report.cold.7+0x6d/0x2fe mm/kasan/report.c:412
 check_memory_region_inline mm/kasan/kasan.c:260 [inline]
 check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
 kasan_check_read+0x11/0x20 mm/kasan/kasan.c:272
 atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
 refcount_sub_and_test+0x92/0x330 lib/refcount.c:179
 refcount_dec_and_test+0x1a/0x20 lib/refcount.c:212
 fib6_info_destroy+0x2d0/0x3c0 net/ipv6/ip6_fib.c:206
 fib6_info_release include/net/ip6_fib.h:304 [inline]
 ip6_route_info_create+0x677/0x3240 net/ipv6/route.c:3020
 ip6_route_add+0x23/0xb0 net/ipv6/route.c:3030
 inet6_rtm_newroute+0x142/0x160 net/ipv6/route.c:4406
 rtnetlink_rcv_msg+0x466/0xc10 net/core/rtnetlink.c:4648
 netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2448
 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4666
 netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
 netlink_unicast+0x58b/0x740 net/netlink/af_netlink.c:1336
 netlink_sendmsg+0x9f0/0xfa0 net/netlink/af_netlink.c:1901
 sock_sendmsg_nosec net/socket.c:629 [inline]
 sock_sendmsg+0xd5/0x120 net/socket.c:639
 ___sys_sendmsg+0x805/0x940 net/socket.c:2117
 __sys_sendmsg+0x115/0x270 net/socket.c:2155
 SYSC_sendmsg net/socket.c:2164 [inline]
 SyS_sendmsg+0x29/0x30 net/socket.c:2162
 do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x42/0xb7

Fixes: d4ead6b34b ("net/ipv6: move metrics from dst to rt6_info")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Ahern <dsa@cumulusnetworks.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-20 11:36:15 -04:00
David Ahern
27b10608a2 net/ipv6: Fix gfp_flags arg to addrconf_prefix_route
Eric noticed that __ipv6_ifa_notify is called under rcu_read_lock, so
the gfp argument to addrconf_prefix_route can not be GFP_KERNEL.

While scrubbing other calls I noticed addrconf_addr_gen has one
place with GFP_ATOMIC that can be GFP_KERNEL.

Fixes: acb54e3cba ("net/ipv6: Add gfp_flags to route add functions")
Reported-by: syzbot+2add39b05179b31f912f@syzkaller.appspotmail.com
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 15:40:13 -04:00
David Ahern
dcd1f57295 net/ipv6: Remove fib6_idev
fib6_idev can be obtained from __in6_dev_get on the nexthop device
rather than caching it in the fib6_info. Remove it.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 15:40:13 -04:00
David Ahern
eea68cd371 net/ipv6: Remove unnecessary checks on fib6_idev
Prior to 4832c30d54 ("net: ipv6: put host and anycast routes on device
with address") host routes and anycast routes were installed with the
device set to loopback (or VRF device once that feature was added). In the
older code dst.dev was set to loopback (needed for packet tx) and rt6i_idev
was used to denote the actual interface.

Commit 4832c30d54 changed the code to have dst.dev pointing to the real
device with the switch to lo or vrf device done on dst clones. As a
consequence of this change a couple of device checks during route lookups
are no longer needed. Remove them.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 15:40:13 -04:00
David Ahern
9ee8cbb2fd net/ipv6: Remove aca_idev
aca_idev has only 1 user - inet6_fill_ifacaddr - and it only
wants the device index which can be extracted from the fib6_info
nexthop.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 15:40:13 -04:00
David Ahern
360a9887c8 net/ipv6: Rename addrconf_dst_alloc
addrconf_dst_alloc now returns a fib6_info. Update the name
and its users to reflect the change.

Rename only; no functional change intended.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 15:40:13 -04:00
David Ahern
93c2fb253d net/ipv6: Rename fib6_info struct elements
Change the prefix for fib6_info struct elements from rt6i_ to fib6_.
rt6i_pcpu and rt6i_exception_bucket are left as is given that they
point to rt6_info entries.

Rename only; not functional change intended.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19 15:40:12 -04:00
Pablo Neira Ayuso
39f2ff0816 netfilter: nf_tables: NAT chain and extensions require NF_TABLES
Move these options inside the scope of the 'if' NF_TABLES and
NF_TABLES_IPV6 dependencies. This patch fixes:

   net/ipv6/netfilter/nft_chain_nat_ipv6.o: In function `nft_nat_do_chain':
>> net/ipv6/netfilter/nft_chain_nat_ipv6.c:37: undefined reference to `nft_do_chain'
   net/ipv6/netfilter/nft_chain_nat_ipv6.o: In function `nft_chain_nat_ipv6_exit':
>> net/ipv6/netfilter/nft_chain_nat_ipv6.c:94: undefined reference to `nft_unregister_chain_type'
   net/ipv6/netfilter/nft_chain_nat_ipv6.o: In function `nft_chain_nat_ipv6_init':
>> net/ipv6/netfilter/nft_chain_nat_ipv6.c:87: undefined reference to `nft_register_chain_type'

that happens with:

CONFIG_NF_TABLES=m
CONFIG_NFT_CHAIN_NAT_IPV6=y

Fixes: 02c7b25e5f ("netfilter: nf_tables: build-in filter chain type")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-19 12:31:34 +02:00
Eric Dumazet
415787d779 ipv6: frags: fix a lockdep false positive
lockdep does not know that the locks used by IPv4 defrag
and IPv6 reassembly units are of different classes.

It complains because of following chains :

1) sch_direct_xmit()        (lock txq->_xmit_lock)
    dev_hard_start_xmit()
     xmit_one()
      dev_queue_xmit_nit()
       packet_rcv_fanout()
        ip_check_defrag()
         ip_defrag()
          spin_lock()     (lock frag queue spinlock)

2) ip6_input_finish()
    ipv6_frag_rcv()       (lock frag queue spinlock)
     ip6_frag_queue()
      icmpv6_param_prob() (lock txq->_xmit_lock at some point)

We could add lockdep annotations, but we also can make sure IPv6
calls icmpv6_param_prob() only after the release of the frag queue spinlock,
since this naturally makes frag queue spinlock a leaf in lock hierarchy.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-18 23:19:39 -04:00
David Ahern
77634cc67d net/ipv6: Remove unused code and variables for rt6_info
Drop unneeded elements from rt6_info struct and rearrange layout to
something more relevant for the data path.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:18 -04:00
David Ahern
8d1c802b28 net/ipv6: Flip FIB entries to fib6_info
Convert all code paths referencing a FIB entry from
rt6_info to fib6_info.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:18 -04:00
David Ahern
93531c6743 net/ipv6: separate handling of FIB entries from dst based routes
Last step before flipping the data type for FIB entries:
- use fib6_info_alloc to create FIB entries in ip6_route_info_create
  and addrconf_dst_alloc
- use fib6_info_release in place of dst_release, ip6_rt_put and
  rt6_release
- remove the dst_hold before calling __ip6_ins_rt or ip6_del_rt
- when purging routes, drop per-cpu routes
- replace inc and dec of rt6i_ref with fib6_info_hold and fib6_info_release
- use rt->from since it points to the FIB entry
- drop references to exception bucket, fib6_metrics and per-cpu from
  dst entries (those are relevant for fib entries only)

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:17 -04:00
David Ahern
a64efe142f net/ipv6: introduce fib6_info struct and helpers
Add fib6_info struct and alloc, destroy, hold and release helpers.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:17 -04:00
David Ahern
23fb93a4d3 net/ipv6: Cleanup exception and cache route handling
IPv6 FIB will only contain FIB entries with exception routes added to
the FIB entry. Once this transformation is complete, FIB lookups will
return a fib6_info with the lookup functions still returning a dst
based rt6_info. The current code uses rt6_info for both paths and
overloads the rt6_info variable usually called 'rt'.

This patch introduces a new 'f6i' variable name for the result of the FIB
lookup and keeps 'rt' as the dst based return variable. 'f6i' becomes a
fib6_info in a later patch which is why it is introduced as f6i now;
avoids the additional churn in the later patch.

In addition, remove RTF_CACHE and dst checks from fib6 add and delete
since they can not happen now and will never happen after the data
type flip.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:17 -04:00
David Ahern
acb54e3cba net/ipv6: Add gfp_flags to route add functions
Most FIB entries can be added using memory allocated with GFP_KERNEL.
Add gfp_flags to ip6_route_add and addrconf_dst_alloc. Code paths that
can be reached from the packet path (e.g., ndisc and autoconfig) or
atomic notifiers use GFP_ATOMIC; paths from user context (adding
addresses and routes) use GFP_KERNEL.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:17 -04:00
David Ahern
f8a1b43b70 net/ipv6: Create a neigh_lookup for FIB entries
The router discovery code has a FIB entry and wants to validate the
gateway has a neighbor entry. Refactor the existing dst_neigh_lookup
for IPv6 and create a new function that takes the gateway and device
and returns a neighbor entry. Use the new function in
ndisc_router_discovery to validate the gateway.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:17 -04:00
David Ahern
3b6761d18b net/ipv6: Move dst flags to booleans in fib entries
Continuing to wean FIB paths off of dst_entry, use a bool to hold
requests for certain dst settings. Add a helper to convert the
flags to DST flags when a FIB entry is converted to a dst_entry.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:17 -04:00
David Ahern
dec9b0e295 net/ipv6: Add rt6_info create function for ip6_pol_route_lookup
ip6_pol_route_lookup is the lookup function for ip6_route_lookup and
rt6_lookup. At the moment it returns either a reference to a FIB entry
or a cached exception. To move FIB entries to a separate struct, this
lookup function needs to convert FIB entries to an rt6_info that is
returned to the caller.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:17 -04:00
David Ahern
421842edea net/ipv6: Add fib6_null_entry
ip6_null_entry will stay a dst based return for lookups that fail to
match an entry.

Add a new fib6_null_entry which constitutes the root node and leafs
for fibs. Replace existing references to ip6_null_entry with the
new fib6_null_entry when dealing with FIBs.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:17 -04:00
David Ahern
14895687d3 net/ipv6: move expires into rt6_info
Add expires to rt6_info for FIB entries, and add fib6 helpers to
manage it. Data path use of dst.expires remains.

The transition is fairly straightforward: when working with fib entries,
rt->dst.expires is just rt->expires, rt6_clean_expires is replaced with
fib6_clean_expires, rt6_set_expires becomes fib6_set_expires, and
rt6_check_expired becomes fib6_check_expired, where the fib6 versions
are added by this patch.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:17 -04:00
David Ahern
d4ead6b34b net/ipv6: move metrics from dst to rt6_info
Similar to IPv4, add fib metrics to the fib struct, which at the moment
is rt6_info. Will be moved to fib6_info in a later patch. Copy metrics
into dst by reference using refcount.

To make the transition:
- add dst_metrics to rt6_info. Default to dst_default_metrics if no
  metrics are passed during route add. No need for a separate pmtu
  entry; it can reference the MTU slot in fib6_metrics

- ip6_convert_metrics allocates memory in the FIB entry and uses
  ip_metrics_convert to copy from netlink attribute to metrics entry

- the convert metrics call is done in ip6_route_info_create simplifying
  the route add path
  + fib6_commit_metrics and fib6_copy_metrics and the temporary
    mx6_config are no longer needed

- add fib6_metric_set helper to change the value of a metric in the
  fib entry since dst_metric_set can no longer be used

- cow_metrics for IPv6 can drop to dst_cow_metrics_generic

- rt6_dst_from_metrics_check is no longer needed

- rt6_fill_node needs the FIB entry and dst as separate arguments to
  keep compatibility with existing output. Current dst address is
  renamed to dest.
  (to be consistent with IPv4 rt6_fill_node really should be split
  into 2 functions similar to fib_dump_info and rt_fill_info)

- rt6_fill_node no longer needs the temporary metrics variable

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:16 -04:00
David Ahern
6edb3c96a5 net/ipv6: Defer initialization of dst to data path
Defer setting dst input, output and error until fib entry is copied.

The reject path from ip6_route_info_create is moved to a new function
ip6_rt_init_dst_reject with a helper doing the conversion from fib6_type
to dst error.

The remainder of the new ip6_rt_init_dst is an amalgamtion of dst code
from addrconf_dst_alloc and the non-reject path of ip6_route_info_create.
The dst output function is always ip6_output and the input function is
either ip6_input (local routes), ip6_mc_input (multicast routes) or
ip6_forward (anything else).

A couple of places using dst.error are updated to look at rt6i_flags.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:16 -04:00
David Ahern
5e670d844b net/ipv6: Move nexthop data to fib6_nh
Introduce fib6_nh structure and move nexthop related data from
rt6_info and rt6_info.dst to fib6_nh. References to dev, gateway or
lwtstate from a FIB lookup perspective are converted to use fib6_nh;
datapath references to dst version are left as is.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:16 -04:00
David Ahern
e8478e80e5 net/ipv6: Save route type in rt6_info
The RTN_ type for IPv6 FIB entries is currently embedded in rt6i_flags
and dst.error. Since dst is going to be removed, it can no longer be
relied on for FIB dumps so save the route type as fib6_type.

fc_type is set in current users based on the algorithm in rt6_fill_node:
  - rt6i_flags contains RTF_LOCAL: fc_type = RTN_LOCAL
  - rt6i_flags contains RTF_ANYCAST: fc_type = RTN_ANYCAST
  - else fc_type = RTN_UNICAST

Similarly, fib6_type is set in the rt6_info templates based on the
RTF_REJECT section of rt6_fill_node converting dst.error to RTN type.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:16 -04:00
David Ahern
ae90d867f9 net/ipv6: Move support functions up in route.c
Code move only.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:16 -04:00
David Ahern
afb1d4b593 net/ipv6: Pass net namespace to route functions
Pass network namespace reference into route add, delete and get
functions.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:16 -04:00
David Ahern
7aef6859ee net/ipv6: Pass net to fib6_update_sernum
Pass net namespace to fib6_update_sernum. It can not be marked const
as fib6_new_sernum will change ipv6.fib6_sernum.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 23:41:16 -04:00
Lorenzo Bianconi
a2d481b326 ipv6: send netlink notifications for manually configured addresses
Send a netlink notification when userspace adds a manually configured
address if DAD is enabled and optimistic flag isn't set.
Moreover send RTM_DELADDR notifications for tentative addresses.

Some userspace applications (e.g. NetworkManager) are interested in
addr netlink events albeit the address is still in tentative state,
however events are not sent if DAD process is not completed.
If the address is added and immediately removed userspace listeners
are not notified. This behaviour can be easily reproduced by using
veth interfaces:

$ ip -b - <<EOF
> link add dev vm1 type veth peer name vm2
> link set dev vm1 up
> link set dev vm2 up
> addr add 2001:db8:a🅱️1:2:3:4/64 dev vm1
> addr del 2001:db8:a🅱️1:2:3:4/64 dev vm1
EOF

This patch reverts the behaviour introduced by the commit f784ad3d79
("ipv6: do not send RTM_DELADDR for tentative addresses")

Suggested-by: Thomas Haller <thaller@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 14:03:56 -04:00
Stephen Suryaputra
bdb7cc643f ipv6: Count interface receive statistics on the ingress netdev
The statistics such as InHdrErrors should be counted on the ingress
netdev rather than on the dev from the dst, which is the egress.

Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 13:39:51 -04:00
David Ahern
032234d823 net/ipv6: Make __inet6_bind static
BPF core gets access to __inet6_bind via ipv6_bpf_stub_impl, so it is
not invoked directly outside of af_inet6.c. Make it static and move
inet6_bind after to avoid forward declaration.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 13:19:22 -04:00
Eric Dumazet
93ab6cc691 tcp: implement mmap() for zero copy receive
Some networks can make sure TCP payload can exactly fit 4KB pages,
with well chosen MSS/MTU and architectures.

Implement mmap() system call so that applications can avoid
copying data without complex splice() games.

Note that a successful mmap( X bytes) on TCP socket is consuming
bytes, as if recvmsg() has been done. (tp->copied += X)

Only PROT_READ mappings are accepted, as skb page frags
are fundamentally shared and read only.

If tcp_mmap() finds data that is not a full page, or a patch of
urgent data, -EINVAL is returned, no bytes are consumed.

Application must fallback to recvmsg() to read the problematic sequence.

mmap() wont block,  regardless of socket being in blocking or
non-blocking mode. If not enough bytes are in receive queue,
mmap() would return -EAGAIN, or -EIO if socket is in a state
where no other bytes can be added into receive queue.

An application might use SO_RCVLOWAT, poll() and/or ioctl( FIONREAD)
to efficiently use mmap()

On the sender side, MSG_EOR might help to clearly separate unaligned
headers and 4K-aligned chunks if necessary.

Tested:

mlx4 (cx-3) 40Gbit NIC, with tcp_mmap program provided in following patch.
MTU set to 4168  (4096 TCP payload, 40 bytes IPv6 header, 32 bytes TCP header)

Without mmap() (tcp_mmap -s)

received 32768 MB (0 % mmap'ed) in 8.13342 s, 33.7961 Gbit,
  cpu usage user:0.034 sys:3.778, 116.333 usec per MB, 63062 c-switches
received 32768 MB (0 % mmap'ed) in 8.14501 s, 33.748 Gbit,
  cpu usage user:0.029 sys:3.997, 122.864 usec per MB, 61903 c-switches
received 32768 MB (0 % mmap'ed) in 8.11723 s, 33.8635 Gbit,
  cpu usage user:0.048 sys:3.964, 122.437 usec per MB, 62983 c-switches
received 32768 MB (0 % mmap'ed) in 8.39189 s, 32.7552 Gbit,
  cpu usage user:0.038 sys:4.181, 128.754 usec per MB, 55834 c-switches

With mmap() on receiver (tcp_mmap -s -z)

received 32768 MB (100 % mmap'ed) in 8.03083 s, 34.2278 Gbit,
  cpu usage user:0.024 sys:1.466, 45.4712 usec per MB, 65479 c-switches
received 32768 MB (100 % mmap'ed) in 7.98805 s, 34.4111 Gbit,
  cpu usage user:0.026 sys:1.401, 43.5486 usec per MB, 65447 c-switches
received 32768 MB (100 % mmap'ed) in 7.98377 s, 34.4296 Gbit,
  cpu usage user:0.028 sys:1.452, 45.166 usec per MB, 65496 c-switches
received 32768 MB (99.9969 % mmap'ed) in 8.01838 s, 34.281 Gbit,
  cpu usage user:0.02 sys:1.446, 44.7388 usec per MB, 65505 c-switches

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16 18:26:37 -04:00
Eric Dumazet
d1361840f8 tcp: fix SO_RCVLOWAT and RCVBUF autotuning
Applications might use SO_RCVLOWAT on TCP socket hoping to receive
one [E]POLLIN event only when a given amount of bytes are ready in socket
receive queue.

Problem is that receive autotuning is not aware of this constraint,
meaning sk_rcvbuf might be too small to allow all bytes to be stored.

Add a new (struct proto_ops)->set_rcvlowat method so that a protocol
can override the default setsockopt(SO_RCVLOWAT) behavior.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16 18:26:37 -04:00
Lorenzo Bianconi
f85f94b871 ipv6: remove unnecessary check in addrconf_prefix_rcv_add_addr()
Remove unnecessary check on update_lft variable in
addrconf_prefix_rcv_add_addr routine since it is always set to 0.
Moreover remove update_lft re-initialization to 0

Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16 18:16:16 -04:00