linux_dsm_epyc7002/net/ipv4
David Ahern a6db4494d2 net: ipv4: Consider failed nexthops in multipath routes
Multipath route lookups should consider knowledge about next hops and not
select a hop that is known to be failed.

Example:

                     [h2]                   [h3]   15.0.0.5
                      |                      |
                     3|                     3|
                    [SP1]                  [SP2]--+
                     1  2                   1     2
                     |  |     /-------------+     |
                     |   \   /                    |
                     |     X                      |
                     |    / \                     |
                     |   /   \---------------\    |
                     1  2                     1   2
         12.0.0.2  [TOR1] 3-----------------3 [TOR2] 12.0.0.3
                     4                         4
                      \                       /
                        \                    /
                         \                  /
                          -------|   |-----/
                                 1   2
                                [TOR3]
                                  3|
                                   |
                                  [h1]  12.0.0.1

host h1 with IP 12.0.0.1 has 2 paths to host h3 at 15.0.0.5:

    root@h1:~# ip ro ls
    ...
    12.0.0.0/24 dev swp1  proto kernel  scope link  src 12.0.0.1
    15.0.0.0/16
            nexthop via 12.0.0.2  dev swp1 weight 1
            nexthop via 12.0.0.3  dev swp1 weight 1
    ...

If the link between tor3 and tor1 is down and the link between tor1
and tor2 then tor1 is effectively cut-off from h1. Yet the route lookups
in h1 are alternating between the 2 routes: ping 15.0.0.5 gets one and
ssh 15.0.0.5 gets the other. Connections that attempt to use the
12.0.0.2 nexthop fail since that neighbor is not reachable:

    root@h1:~# ip neigh show
    ...
    12.0.0.3 dev swp1 lladdr 00:02:00:00:00:1b REACHABLE
    12.0.0.2 dev swp1  FAILED
    ...

The failed path can be avoided by considering known neighbor information
when selecting next hops. If the neighbor lookup fails we have no
knowledge about the nexthop, so give it a shot. If there is an entry
then only select the nexthop if the state is sane. This is similar to
what fib_detect_death does.

To maintain backward compatibility use of the neighbor information is
based on a new sysctl, fib_multipath_use_neigh.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11 15:16:13 -04:00
..
netfilter netfilter: ipv4: fix NULL dereference 2016-03-28 17:59:29 +02:00
af_inet.c net: introduce lockdep_is_held and update various places to use it 2016-04-07 16:44:14 -04:00
ah4.c
arp.c
cipso_ipv4.c net: introduce lockdep_is_held and update various places to use it 2016-04-07 16:44:14 -04:00
datagram.c
devinet.c
esp4.c
fib_frontend.c ipv4: initialize flowi4_flags before calling fib_lookup() 2016-03-22 15:59:23 -04:00
fib_lookup.h
fib_rules.c
fib_semantics.c net: ipv4: Consider failed nexthops in multipath routes 2016-04-11 15:16:13 -04:00
fib_trie.c
fou.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-04-09 17:41:41 -04:00
gre_demux.c
gre_offload.c GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOU 2016-04-07 16:56:33 -04:00
icmp.c
igmp.c
inet_connection_sock.c tcp/dccp: fix inet_reuseport_add_sock() 2016-04-07 12:02:33 -04:00
inet_diag.c tcp/dccp: do not touch listener sk_refcnt under synflood 2016-04-04 22:11:20 -04:00
inet_fragment.c
inet_hashtables.c tcp/dccp: fix inet_reuseport_add_sock() 2016-04-07 12:02:33 -04:00
inet_timewait_sock.c
inetpeer.c
ip_forward.c
ip_fragment.c
ip_gre.c GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOU 2016-04-07 16:56:33 -04:00
ip_input.c
ip_options.c
ip_output.c
ip_sockglue.c net: introduce lockdep_is_held and update various places to use it 2016-04-07 16:44:14 -04:00
ip_tunnel_core.c ip_tunnel: implement __iptunnel_pull_header 2016-04-06 16:50:32 -04:00
ip_tunnel.c
ip_vti.c
ipcomp.c
ipconfig.c
ipip.c
ipmr.c
Kconfig
Makefile
netfilter.c
ping.c sock: enable timestamping using control messages 2016-04-04 15:50:30 -04:00
proc.c
protocol.c
raw.c sock: enable timestamping using control messages 2016-04-04 15:50:30 -04:00
route.c
syncookies.c
sysctl_net_ipv4.c net: ipv4: Consider failed nexthops in multipath routes 2016-04-11 15:16:13 -04:00
tcp_bic.c
tcp_cdg.c
tcp_cong.c
tcp_cubic.c
tcp_dctcp.c
tcp_diag.c
tcp_fastopen.c
tcp_highspeed.c
tcp_htcp.c
tcp_hybla.c
tcp_illinois.c
tcp_input.c tcp: increment sk_drops for listeners 2016-04-04 22:11:20 -04:00
tcp_ipv4.c net: introduce lockdep_is_held and update various places to use it 2016-04-07 16:44:14 -04:00
tcp_lp.c
tcp_metrics.c
tcp_minisocks.c tcp: rate limit ACK sent by SYN_RECV request sockets 2016-04-04 22:11:20 -04:00
tcp_offload.c
tcp_output.c
tcp_probe.c
tcp_recovery.c
tcp_scalable.c
tcp_timer.c
tcp_vegas.c
tcp_vegas.h
tcp_veno.c
tcp_westwood.c
tcp_yeah.c
tcp.c sock: enable timestamping using control messages 2016-04-04 15:50:30 -04:00
tunnel4.c
udp_diag.c udp: no longer use SLAB_DESTROY_BY_RCU 2016-04-04 22:11:19 -04:00
udp_impl.h
udp_offload.c udp: Remove udp_offloads 2016-04-07 16:53:30 -04:00
udp_tunnel.c udp: Add socket based GRO and config 2016-04-07 16:53:29 -04:00
udp.c udp: Add udp6_lib_lookup_skb and udp4_lib_lookup_skb 2016-04-07 16:53:14 -04:00
udplite.c
xfrm4_input.c
xfrm4_mode_beet.c
xfrm4_mode_transport.c
xfrm4_mode_tunnel.c
xfrm4_output.c
xfrm4_policy.c
xfrm4_protocol.c
xfrm4_state.c
xfrm4_tunnel.c