Commit Graph

5293 Commits

Author SHA1 Message Date
Wei Wang
9514528d92 ipv6: call dst_dev_put() properly
As the intend of this patch series is to completely remove dst gc,
we need to call dst_dev_put() to release the reference to dst->dev
when removing routes from fib because we won't keep the gc list anymore
and will lose the dst pointer right after removing the routes.
Without the gc list, there is no way to find all the dst's that have
dst->dev pointing to the going-down dev.
Hence, we are doing dst_dev_put() immediately before we lose the last
reference of the dst from the routing code. The next dst_check() will
trigger a route re-lookup to find another route (if there is any).

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-17 22:54:00 -04:00
Wei Wang
1cfb71eeb1 ipv6: take dst->__refcnt for insertion into fib6 tree
In IPv6 routing code, struct rt6_info is created for each static route
and RTF_CACHE route and inserted into fib6 tree. In both cases, dst
ref count is not taken.
As explained in the previous patch, this leads to the need of the dst
garbage collector.

This patch holds ref count of dst before inserting the route into fib6
tree and properly releases the dst when deleting it from the fib6 tree
as a preparation in order to fully get rid of dst gc later.

Also, correct fib6_age() logic to check dst->__refcnt to be 1 to indicate
no user is referencing the dst.

And remove dst_hold() in vrf_rt6_create() as ip6_dst_alloc() already puts
dst->__refcnt to 1.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-17 22:54:00 -04:00
Wei Wang
1dbe32525e net: use loopback dev when generating blackhole route
Existing ipv4/6_blackhole_route() code generates a blackhole route
with dst->dev pointing to the passed in dst->dev.
It is not necessary to hold reference to the passed in dst->dev
because the packets going through this route are dropped anyway.
A loopback interface is good enough so that we don't need to worry about
releasing this dst->dev when this dev is going down.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-17 22:53:59 -04:00
Wei Wang
d24406c85d udp: call dst_hold_safe() in udp_sk_rx_set_dst()
In udp_v4/6_early_demux() code, we try to hold dst->__refcnt for
dst with DST_NOCACHE flag. This is because later in udp_sk_rx_dst_set()
function, we will try to cache this dst in sk for connected case.
However, a better way to achieve this is to not try to hold dst in
early_demux(), but in udp_sk_rx_dst_set(), call dst_hold_safe(). This
approach is also more consistant with how tcp is handling it. And it
will make later changes simpler.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-17 22:53:59 -04:00
Wei Wang
1758fd4688 ipv6: remove unnecessary dst_hold() in ip6_fragment()
In ipv6 tx path, rcu_read_lock() is taken so that dst won't get freed
during the execution of ip6_fragment(). Hence, no need to hold dst in
it.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-17 22:53:59 -04:00
Johannes Berg
d58ff35122 networking: make skb_push & __skb_push return void pointers
It seems like a historic accident that these return unsigned char *,
and in many places that means casts are required, more often than not.

Make these functions return void * and remove all the casts across
the tree, adding a (u8 *) cast only where the unsigned char pointer
was used directly, all done with the following spatch:

    @@
    expression SKB, LEN;
    typedef u8;
    identifier fn = { skb_push, __skb_push, skb_push_rcsum };
    @@
    - *(fn(SKB, LEN))
    + *(u8 *)fn(SKB, LEN)

    @@
    expression E, SKB, LEN;
    identifier fn = { skb_push, __skb_push, skb_push_rcsum };
    type T;
    @@
    - E = ((T *)(fn(SKB, LEN)))
    + E = fn(SKB, LEN)

    @@
    expression SKB, LEN;
    identifier fn = { skb_push, __skb_push, skb_push_rcsum };
    @@
    - fn(SKB, LEN)[0]
    + *(u8 *)fn(SKB, LEN)

Note that the last part there converts from push(...)[0] to the
more idiomatic *(u8 *)push(...).

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 11:48:40 -04:00
Johannes Berg
af72868b90 networking: make skb_pull & friends return void pointers
It seems like a historic accident that these return unsigned char *,
and in many places that means casts are required, more often than not.

Make these functions return void * and remove all the casts across
the tree, adding a (u8 *) cast only where the unsigned char pointer
was used directly, all done with the following spatch:

    @@
    expression SKB, LEN;
    typedef u8;
    identifier fn = {
            skb_pull,
            __skb_pull,
            skb_pull_inline,
            __pskb_pull_tail,
            __pskb_pull,
            pskb_pull
    };
    @@
    - *(fn(SKB, LEN))
    + *(u8 *)fn(SKB, LEN)

    @@
    expression E, SKB, LEN;
    identifier fn = {
            skb_pull,
            __skb_pull,
            skb_pull_inline,
            __pskb_pull_tail,
            __pskb_pull,
            pskb_pull
    };
    type T;
    @@
    - E = ((T *)(fn(SKB, LEN)))
    + E = fn(SKB, LEN)

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 11:48:39 -04:00
Johannes Berg
4df864c1d9 networking: make skb_put & friends return void pointers
It seems like a historic accident that these return unsigned char *,
and in many places that means casts are required, more often than not.

Make these functions (skb_put, __skb_put and pskb_put) return void *
and remove all the casts across the tree, adding a (u8 *) cast only
where the unsigned char pointer was used directly, all done with the
following spatch:

    @@
    expression SKB, LEN;
    typedef u8;
    identifier fn = { skb_put, __skb_put };
    @@
    - *(fn(SKB, LEN))
    + *(u8 *)fn(SKB, LEN)

    @@
    expression E, SKB, LEN;
    identifier fn = { skb_put, __skb_put };
    type T;
    @@
    - E = ((T *)(fn(SKB, LEN)))
    + E = fn(SKB, LEN)

which actually doesn't cover pskb_put since there are only three
users overall.

A handful of stragglers were converted manually, notably a macro in
drivers/isdn/i4l/isdn_bsdcomp.c and, oddly enough, one of the many
instances in net/bluetooth/hci_sock.c. In the former file, I also
had to fix one whitespace problem spatch introduced.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 11:48:39 -04:00
Johannes Berg
59ae1d127a networking: introduce and use skb_put_data()
A common pattern with skb_put() is to just want to memcpy()
some data into the new space, introduce skb_put_data() for
this.

An spatch similar to the one for skb_put_zero() converts many
of the places using it:

    @@
    identifier p, p2;
    expression len, skb, data;
    type t, t2;
    @@
    (
    -p = skb_put(skb, len);
    +p = skb_put_data(skb, data, len);
    |
    -p = (t)skb_put(skb, len);
    +p = skb_put_data(skb, data, len);
    )
    (
    p2 = (t2)p;
    -memcpy(p2, data, len);
    |
    -memcpy(p, data, len);
    )

    @@
    type t, t2;
    identifier p, p2;
    expression skb, data;
    @@
    t *p;
    ...
    (
    -p = skb_put(skb, sizeof(t));
    +p = skb_put_data(skb, data, sizeof(t));
    |
    -p = (t *)skb_put(skb, sizeof(t));
    +p = skb_put_data(skb, data, sizeof(t));
    )
    (
    p2 = (t2)p;
    -memcpy(p2, data, sizeof(*p));
    |
    -memcpy(p, data, sizeof(*p));
    )

    @@
    expression skb, len, data;
    @@
    -memcpy(skb_put(skb, len), data, len);
    +skb_put_data(skb, data, len);

(again, manually post-processed to retain some comments)

Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 11:48:37 -04:00
Johannes Berg
b080db5853 networking: convert many more places to skb_put_zero()
There were many places that my previous spatch didn't find,
as pointed out by yuan linyu in various patches.

The following spatch found many more and also removes the
now unnecessary casts:

    @@
    identifier p, p2;
    expression len;
    expression skb;
    type t, t2;
    @@
    (
    -p = skb_put(skb, len);
    +p = skb_put_zero(skb, len);
    |
    -p = (t)skb_put(skb, len);
    +p = skb_put_zero(skb, len);
    )
    ... when != p
    (
    p2 = (t2)p;
    -memset(p2, 0, len);
    |
    -memset(p, 0, len);
    )

    @@
    type t, t2;
    identifier p, p2;
    expression skb;
    @@
    t *p;
    ...
    (
    -p = skb_put(skb, sizeof(t));
    +p = skb_put_zero(skb, sizeof(t));
    |
    -p = (t *)skb_put(skb, sizeof(t));
    +p = skb_put_zero(skb, sizeof(t));
    )
    ... when != p
    (
    p2 = (t2)p;
    -memset(p2, 0, sizeof(*p));
    |
    -memset(p, 0, sizeof(*p));
    )

    @@
    expression skb, len;
    @@
    -memset(skb_put(skb, len), 0, len);
    +skb_put_zero(skb, len);

Apply it to the tree (with one manual fixup to keep the
comment in vxlan.c, which spatch removed.)

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 11:48:35 -04:00
David S. Miller
0ddead90b2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The conflicts were two cases of overlapping changes in
batman-adv and the qed driver.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 11:59:32 -04:00
Jesper Dangaard Brouer
849a44de91 net: don't global ICMP rate limit packets originating from loopback
Florian Weimer seems to have a glibc test-case which requires that
loopback interfaces does not get ICMP ratelimited.  This was broken by
commit c0303efeab ("net: reduce cycles spend on ICMP replies that
gets rate limited").

An ICMP response will usually be routed back-out the same incoming
interface.  Thus, take advantage of this and skip global ICMP
ratelimit when the incoming device is loopback.  In the unlikely event
that the outgoing it not loopback, due to strange routing policy
rules, ICMP rate limiting still works via peer ratelimiting via
icmpv4_xrlim_allow().  Thus, we should still comply with RFC1812
(section 4.3.2.8 "Rate Limiting").

This seems to fix the reproducer given by Florian.  While still
avoiding to perform expensive and unneeded outgoing route lookup for
rate limited packets (in the non-loopback case).

Fixes: c0303efeab ("net: reduce cycles spend on ICMP replies that gets rate limited")
Reported-by: Florian Weimer <fweimer@redhat.com>
Reported-by: "H.J. Lu" <hjl.tools@gmail.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-14 15:33:58 -04:00
Christian Perle
3500cd73df proc: snmp6: Use correct type in memset
Reading /proc/net/snmp6 yields bogus values on 32 bit kernels.
Use "u64" instead of "unsigned long" in sizeof().

Fixes: 4a4857b1c8 ("proc: Reduce cache miss in snmp6_seq_show")
Signed-off-by: Christian Perle <christian.perle@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-12 09:53:14 -04:00
Chenbo Feng
89dfba3e1b Remove the redundant skb->dev initialization in ip6_fragment
After moves the skb->dev and skb->protocol initialization into
ip6_output, setting the skb->dev inside ip6_fragment is unnecessary.

Fixes: 97a7a37a7b7b("ipv6: Initial skb->dev and skb->protocol in ip6_output")
Signed-off-by: Chenbo Feng <fengc@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-10 16:25:21 -04:00
Chenbo Feng
384abed1fe bpf: Remove duplicate tcp_filter hook in ipv6
There are two tcp_filter hooks in tcp_ipv6 ingress path currently.
One is at tcp_v6_rcv and another is in tcp_v6_do_rcv. It seems the
tcp_filter() call inside tcp_v6_do_rcv is redundent and some packet
will be filtered twice in this situation. This will cause trouble
when using eBPF filters to account traffic data.

Signed-off-by: Chenbo Feng <fengc@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-10 16:08:02 -04:00
Chenbo Feng
97a7a37a7b ipv6: Initial skb->dev and skb->protocol in ip6_output
Move the initialization of skb->dev and skb->protocol from
ip6_finish_output2 to ip6_output. This can make the skb->dev and
skb->protocol information avalaible to the CGROUP eBPF filter.

Signed-off-by: Chenbo Feng <fengc@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-09 15:07:56 -04:00
Krister Johansen
3ad7d2468f Ipvlan should return an error when an address is already in use.
The ipvlan code already knows how to detect when a duplicate address is
about to be assigned to an ipvlan device.  However, that failure is not
propogated outward and leads to a silent failure.

Introduce a validation step at ip address creation time and allow device
drivers to register to validate the incoming ip addresses.  The ipvlan
code is the first consumer.  If it detects an address in use, we can
return an error to the user before beginning to commit the new ifa in
the networking code.

This can be especially useful if it is necessary to provision many
ipvlans in containers.  The provisioning software (or operator) can use
this to detect situations where an ip address is unexpectedly in use.

Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-09 12:26:07 -04:00
Arnd Bergmann
0db47e3d32 ila_xlat: add missing hash secret initialization
While discussing the possible merits of clang warning about unused initialized
functions, I found one function that was clearly meant to be called but
never actually is.

__ila_hash_secret_init() initializes the hash value for the ila locator,
apparently this is intended to prevent hash collision attacks, but this ends
up being a read-only zero constant since there is no caller. I could find
no indication of why it was never called, the earliest patch submission
for the module already was like this. If my interpretation is right, we
certainly want to backport the patch to stable kernels as well.

I considered adding it to the ila_xlat_init callback, but for best effect
the random data is read as late as possible, just before it is first used.
The underlying net_get_random_once() is already highly optimized to avoid
overhead when called frequently.

Fixes: 7f00feaf10 ("ila: Add generic ILA translation facility")
Cc: stable@vger.kernel.org
Link: https://www.spinics.net/lists/kernel/msg2527243.html
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-08 15:36:56 -04:00
Eric Dumazet
0604475119 tcp: add TCPMemoryPressuresChrono counter
DRAM supply shortage and poor memory pressure tracking in TCP
stack makes any change in SO_SNDBUF/SO_RCVBUF (or equivalent autotuning
limits) and tcp_mem[] quite hazardous.

TCPMemoryPressures SNMP counter is an indication of tcp_mem sysctl
limits being hit, but only tracking number of transitions.

If TCP stack behavior under stress was perfect :
1) It would maintain memory usage close to the limit.
2) Memory pressure state would be entered for short times.

We certainly prefer 100 events lasting 10ms compared to one event
lasting 200 seconds.

This patch adds a new SNMP counter tracking cumulative duration of
memory pressure events, given in ms units.

$ cat /proc/sys/net/ipv4/tcp_mem
3088    4117    6176
$ grep TCP /proc/net/sockstat
TCP: inuse 180 orphan 0 tw 2 alloc 234 mem 4140
$ nstat -n ; sleep 10 ; nstat |grep Pressure
TcpExtTCPMemoryPressures        1700
TcpExtTCPMemoryPressuresChrono  5209

v2: Used EXPORT_SYMBOL_GPL() instead of EXPORT_SYMBOL() as David
instructed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-08 11:26:19 -04:00
David Ahern
8397ed36b7 net: ipv6: Release route when device is unregistering
Roopa reported attempts to delete a bond device that is referenced in a
multipath route is hanging:

$ ifdown bond2    # ifupdown2 command that deletes virtual devices
unregister_netdevice: waiting for bond2 to become free. Usage count = 2

Steps to reproduce:
    echo 1 > /proc/sys/net/ipv6/conf/all/ignore_routes_with_linkdown
    ip link add dev bond12 type bond
    ip link add dev bond13 type bond
    ip addr add 2001:db8:2::0/64 dev bond12
    ip addr add 2001:db8:3::0/64 dev bond13
    ip route add 2001:db8:33::0/64 nexthop via 2001:db8:2::2 nexthop via 2001:db8:3::2
    ip link del dev bond12
    ip link del dev bond13

The root cause is the recent change to keep routes on a linkdown. Update
the check to detect when the device is unregistering and release the
route for that case.

Fixes: a1a22c1206 ("net: ipv6: Keep nexthop of multipath route on admin down")
Reported-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-08 11:12:39 -04:00
Eric Dumazet
5d2ed0521a tcp: Namespaceify sysctl_tcp_timestamps
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-08 10:53:29 -04:00
Eric Dumazet
f930103421 tcp: Namespaceify sysctl_tcp_sack
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-08 10:53:28 -04:00
Eric Dumazet
eed29f17f0 tcp: add a struct net parameter to tcp_parse_options()
We want to move some TCP sysctls to net namespaces in the future.

tcp_window_scaling, tcp_sack and tcp_timestamps being fetched
from tcp_parse_options(), we need to pass an extra parameter.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-08 10:53:28 -04:00
David S. Miller
cf124db566 net: Fix inconsistent teardown and release of private netdev state.
Network devices can allocate reasources and private memory using
netdev_ops->ndo_init().  However, the release of these resources
can occur in one of two different places.

Either netdev_ops->ndo_uninit() or netdev->destructor().

The decision of which operation frees the resources depends upon
whether it is necessary for all netdev refs to be released before it
is safe to perform the freeing.

netdev_ops->ndo_uninit() presumably can occur right after the
NETDEV_UNREGISTER notifier completes and the unicast and multicast
address lists are flushed.

netdev->destructor(), on the other hand, does not run until the
netdev references all go away.

Further complicating the situation is that netdev->destructor()
almost universally does also a free_netdev().

This creates a problem for the logic in register_netdevice().
Because all callers of register_netdevice() manage the freeing
of the netdev, and invoke free_netdev(dev) if register_netdevice()
fails.

If netdev_ops->ndo_init() succeeds, but something else fails inside
of register_netdevice(), it does call ndo_ops->ndo_uninit().  But
it is not able to invoke netdev->destructor().

This is because netdev->destructor() will do a free_netdev() and
then the caller of register_netdevice() will do the same.

However, this means that the resources that would normally be released
by netdev->destructor() will not be.

Over the years drivers have added local hacks to deal with this, by
invoking their destructor parts by hand when register_netdevice()
fails.

Many drivers do not try to deal with this, and instead we have leaks.

Let's close this hole by formalizing the distinction between what
private things need to be freed up by netdev->destructor() and whether
the driver needs unregister_netdevice() to perform the free_netdev().

netdev->priv_destructor() performs all actions to free up the private
resources that used to be freed by netdev->destructor(), except for
free_netdev().

netdev->needs_free_netdev is a boolean that indicates whether
free_netdev() should be done at the end of unregister_netdevice().

Now, register_netdevice() can sanely release all resources after
ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit()
and netdev->priv_destructor().

And at the end of unregister_netdevice(), we invoke
netdev->priv_destructor() and optionally call free_netdev().

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-07 15:53:24 -04:00
David S. Miller
216fe8f021 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Just some simple overlapping changes in marvell PHY driver
and the DSA core code.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-06 22:20:08 -04:00
Richard Haines
e3ebdb20fd net/ipv6: Fix CALIPSO causing GPF with datagram support
When using CALIPSO with IPPROTO_UDP it is possible to trigger a GPF as the
IP header may have moved.

Also update the payload length after adding the CALIPSO option.

Signed-off-by: Richard Haines <richard_c_haines@btinternet.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Huw Davies <huw@codeweavers.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-06 15:18:20 -04:00
David S. Miller
f4eb17e1ef Revert "sit: reload iphdr in ipip6_rcv"
This reverts commit b699d00358.

As per Eric Dumazet, the pskb_may_pull() is a NOP in this
particular case, so the 'iph' reload is unnecessary.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-06 11:34:06 -04:00
Haishuang Yan
b699d00358 sit: reload iphdr in ipip6_rcv
Since iptunnel_pull_header() can call pskb_may_pull(),
we must reload any pointer that was related to skb->head.

Fixes: a09a4c8dd1 ("tunnels: Remove encapsulation offloads on decap")
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-04 23:04:31 -04:00
Jason A. Donenfeld
3f29770723 ipsec: check return value of skb_to_sgvec always
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-04 23:01:47 -04:00
Eric Dumazet
77d4b1d369 net: ping: do not abuse udp_poll()
Alexander reported various KASAN messages triggered in recent kernels

The problem is that ping sockets should not use udp_poll() in the first
place, and recent changes in UDP stack finally exposed this old bug.

Fixes: c319b4d76b ("net: ipv4: add IPPROTO_ICMP socket kind")
Fixes: 6d0bfe2261 ("net: ipv6: Add IPv6 support to the ping socket.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Sasha Levin <alexander.levin@verizon.com>
Cc: Solar Designer <solar@openwall.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Acked-By: Lorenzo Colitti <lorenzo@google.com>
Tested-By: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-04 22:56:55 -04:00
David S. Miller
e3e86b5119 ipv6: Fix leak in ipv6_gso_segment().
If ip6_find_1stfragopt() fails and we return an error we have to free
up 'segs' because nobody else is going to.

Fixes: 2423496af3 ("ipv6: Prevent overrun when parsing v6 header options")
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-04 21:41:10 -04:00
Liam McBirnie
5f733ee68f ip6_tunnel: fix traffic class routing for tunnels
ip6_route_output() requires that the flowlabel contains the traffic
class for policy routing.

Commit 0e9a709560 ("ip6_tunnel, ip6_gre: fix setting of DSCP on
encapsulated packets") removed the code which previously added the
traffic class to the flowlabel.

The traffic class is added here because only route lookup needs the
flowlabel to contain the traffic class.

Fixes: 0e9a709560 ("ip6_tunnel, ip6_gre: fix setting of DSCP on encapsulated packets")
Signed-off-by: Liam McBirnie <liam.mcbirnie@boeing.com>
Acked-by: Peter Dawson <peter.a.dawson@boeing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-04 19:49:33 -04:00
Ben Hutchings
6e80ac5cc9 ipv6: xfrm: Handle errors reported by xfrm6_find_1stfragopt()
xfrm6_find_1stfragopt() may now return an error code and we must
not treat it as a length.

Fixes: 2423496af3 ("ipv6: Prevent overrun when parsing v6 header options")
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Acked-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-02 13:57:27 -04:00
David Ahern
9ae2872748 net: add extack arg to lwtunnel build state
Pass extack arg down to lwtunnel_build_state and the build_state callbacks.
Add messages for failures in lwtunnel_build_state, and add the extarg to
nla_parse where possible in the build_state callbacks.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-30 11:55:32 -04:00
David Ahern
c255bd681d net: lwtunnel: Add extack to encap attr validation
Pass extack down to lwtunnel_valid_encap_type and
lwtunnel_valid_encap_type_attr. Add messages for unknown
or unsupported encap types.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-30 11:55:31 -04:00
David S. Miller
34aa83c2fc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Overlapping changes in drivers/net/phy/marvell.c, bug fix in 'net'
restricting a HW workaround alongside cleanups in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-26 20:46:35 -04:00
Peter Dawson
0e9a709560 ip6_tunnel, ip6_gre: fix setting of DSCP on encapsulated packets
This fix addresses two problems in the way the DSCP field is formulated
 on the encapsulating header of IPv6 tunnels.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=195661

1) The IPv6 tunneling code was manipulating the DSCP field of the
 encapsulating packet using the 32b flowlabel. Since the flowlabel is
 only the lower 20b it was incorrect to assume that the upper 12b
 containing the DSCP and ECN fields would remain intact when formulating
 the encapsulating header. This fix handles the 'inherit' and
 'fixed-value' DSCP cases explicitly using the extant dsfield u8 variable.

2) The use of INET_ECN_encapsulate(0, dsfield) in ip6_tnl_xmit was
 incorrect and resulted in the DSCP value always being set to 0.

Commit 90427ef5d2 ("ipv6: fix flow labels when the traffic class
 is non-0") caused the regression by masking out the flowlabel
 which exposed the incorrect handling of the DSCP portion of the
 flowlabel in ip6_tunnel and ip6_gre.

Fixes: 90427ef5d2 ("ipv6: fix flow labels when the traffic class is non-0")
Signed-off-by: Peter Dawson <peter.a.dawson@boeing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-26 14:54:39 -04:00
Roopa Prabhu
18c3a61c42 net: ipv6: RTM_GETROUTE: return matched fib result when requested
This patch adds support to return matched fib result when RTM_F_FIB_MATCH
flag is specified in RTM_GETROUTE request. This is useful for user-space
applications/controllers wanting to query a matching route.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-26 14:12:51 -04:00
David S. Miller
218b6a5b23 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-05-22 23:32:48 -04:00
David Ahern
d5d531cb50 net: ipv6: Add extack messages for route add failures
Add messages for non-obvious errors (e.g, no need to add text for malloc
failures or ENODEV failures). This mostly covers the annoying EINVAL errors
Some message strings violate the 80-columns but searchable strings need to
trump that rule.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-22 12:12:20 -04:00
David Ahern
333c430167 net: ipv6: Plumb extack through route add functions
Plumb extack argument down to route add functions.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-22 12:12:20 -04:00
Eric Dumazet
232cd35d08 ipv6: fix out of bound writes in __ip6_append_data()
Andrey Konovalov and idaifish@gmail.com reported crashes caused by
one skb shared_info being overwritten from __ip6_append_data()

Andrey program lead to following state :

copy -4200 datalen 2000 fraglen 2040
maxfraglen 2040 alloclen 2048 transhdrlen 0 offset 0 fraggap 6200

The skb_copy_and_csum_bits(skb_prev, maxfraglen, data + transhdrlen,
fraggap, 0); is overwriting skb->head and skb_shared_info

Since we apparently detect this rare condition too late, move the
code earlier to even avoid allocating skb and risking crashes.

Once again, many thanks to Andrey and syzkaller team.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Reported-by: <idaifish@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-22 11:47:44 -04:00
stephen hemminger
0a9fc39e41 ipv6: drop unused variables in seg6_genl_dumphac
THe seg6_pernet_data variable was set but never used.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-21 13:42:36 -04:00
stephen hemminger
9dc621afa8 fou: make local function static
The build header functions are not used by any other code.

net/ipv6/fou6.c:36:5: warning: no previous prototype for ‘fou6_build_header’ [-Wmissing-prototypes]
net/ipv6/fou6.c:54:5: warning: no previous prototype for ‘gue6_build_header’ [-Wmissing-prototypes]

Need to do some code rearranging to satisfy different Kconfig possiblities.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-21 13:42:36 -04:00
stephen hemminger
9e7b19c516 ila: propagate error code in ila_output
This warning:
net/ipv6/ila/ila_lwt.c: In function ‘ila_output’:
net/ipv6/ila/ila_lwt.c:42:6: warning: variable ‘err’ set but not used [-Wunused-but-set-variable]

It looks like the code attempts to set propagate different error
values, but always returned -EINVAL.

Compile tested only. Needs review by original author.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-21 13:42:33 -04:00
Davide Caratti
219f1d7987 sk_buff: remove support for csum_bad in sk_buff
This bit was introduced with commit 5a21232983 ("net: Support for
csum_bad in skbuff") to reduce the stack workload when processing RX
packets carrying a wrong Internet Checksum. Up to now, only one driver and
GRO core are setting it.

Suggested-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-19 19:21:29 -04:00
David S. Miller
c6cd850d65 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-05-18 16:11:32 -04:00
Paolo Abeni
a3f96c47c8 udp: make *udp*_queue_rcv_skb() functions static
Since the udp memory accounting refactor, we don't need any more
to export the *udp*_queue_rcv_skb(). Make them static and fix
a couple of sparse warnings:

net/ipv4/udp.c:1615:5: warning: symbol 'udp_queue_rcv_skb' was not
declared. Should it be static?
net/ipv6/udp.c:572:5: warning: symbol 'udpv6_queue_rcv_skb' was not
declared. Should it be static?

Fixes: 850cbaddb5 ("udp: use it's own memory accounting schema")
Fixes: c915fe13cb ("udplite: fix NULL pointer dereference")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-18 10:23:33 -04:00
David S. Miller
7dd7eb9513 ipv6: Check ip6_find_1stfragopt() return value properly.
Do not use unsigned variables to see if it returns a negative
error or not.

Fixes: 2423496af3 ("ipv6: Prevent overrun when parsing v6 header options")
Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 22:54:11 -04:00
Eric Dumazet
9a568de481 tcp: switch TCP TS option (RFC 7323) to 1ms clock
TCP Timestamps option is defined in RFC 7323

Traditionally on linux, it has been tied to the internal
'jiffies' variable, because it had been a cheap and good enough
generator.

For TCP flows on the Internet, 1 ms resolution would be much better
than 4ms or 10ms (HZ=250 or HZ=100 respectively)

For TCP flows in the DC, Google has used usec resolution for more
than two years with great success [1]

Receive size autotuning (DRS) is indeed more precise and converges
faster to optimal window size.

This patch converts tp->tcp_mstamp to a plain u64 value storing
a 1 usec TCP clock.

This choice will allow us to upstream the 1 usec TS option as
discussed in IETF 97.

[1] https://www.ietf.org/proceedings/97/slides/slides-97-tcpm-tcp-options-for-low-latency-00.pdf

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 16:06:01 -04:00