Commit Graph

9803 Commits

Author SHA1 Message Date
Aaron Conole
e3b37f11e6 netfilter: replace list_head with single linked list
The netfilter hook list never uses the prev pointer, and so can be trimmed to
be a simple singly-linked list.

In addition to having a more light weight structure for hook traversal,
struct net becomes 5568 bytes (down from 6400) and struct net_device becomes
2176 bytes (down from 2240).

Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 14:38:48 +02:00
Aaron Conole
54f17bbc52 netfilter: nf_queue: whitespace cleanup
A future patch will modify the hook drop and outfn functions.  This will
cause the line lengths to take up too much space.  This is simply a
readability change.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 01:20:05 +02:00
Florian Westphal
c5136b15ea netfilter: bridge: add and use br_nf_hook_thresh
This replaces the last uses of NF_HOOK_THRESH().
Followup patch will remove it and rename nf_hook_thresh.

The reason is that inet (non-bridge) netfilter no longer invokes the
hooks from hooks, so we do no longer need the thresh value to skip hooks
with a lower priority.

The bridge netfilter however may need to do this. br_nf_hook_thresh is a
wrapper that is supposed to do this, i.e. only call hooks with a
priority that exceeds NF_BR_PRI_BRNF.

It's used only in the recursion cases of br_netfilter.  It invokes
nf_hook_slow while holding an rcu read-side critical section to make a
future cleanup simpler.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-24 21:25:48 +02:00
Liping Zhang
2462f3f4a7 netfilter: nf_queue: improve queue range support for bridge family
After commit ac28634456 ("netfilter: bridge: add nf_afinfo to enable
queuing to userspace"), we can queue packets to the user space in bridge
family. But when the user specify the queue range, packets will be only
delivered to the first queue num. Because in nfqueue_hash, we only support
ipv4 and ipv6 family. Now add support for bridge family too.

Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-23 09:30:01 +02:00
Laura Garcia Liebana
36b701fae1 netfilter: nf_tables: validate maximum value of u32 netlink attributes
Fetch value and validate u32 netlink attribute. This validation is
usually required when the u32 netlink attributes are being stored in a
field whose size is smaller.

This patch revisits 4da449ae1d ("netfilter: nft_exthdr: Add size check
on u8 nft_exthdr attributes").

Fixes: 96518518cc ("netfilter: add nftables")
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-23 09:29:02 +02:00
Liping Zhang
6bd14303a9 netfilter: nf_queue: get rid of dependency on IP6_NF_IPTABLES
hash_v6 is used by both nftables and ip6tables, so depend on
IP6_NF_IPTABLES is not properly.

Actually, it only parses ipv6hdr and computes a hash value, so
even if IPV6 is disabled, there's no side effect too, remove it.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12 19:54:46 +02:00
Pablo Neira Ayuso
71212c9b04 netfilter: nf_tables: don't drop IPv6 packets that cannot parse transport
This is overly conservative and not flexible at all, so better let them
go through and let the filtering policy decide what to do with them. We
use skb_header_pointer() all over the place so we would just fail to
match when trying to access fields from malformed traffic.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12 18:52:32 +02:00
Pablo Neira Ayuso
10151d7b03 netfilter: nf_tables_bridge: use nft_set_pktinfo_ipv{4, 6}_validate
Consolidate pktinfo setup and validation by using the new generic
functions so we converge to the netdev family codebase.

We only need a linear IPv4 and IPv6 header from the reject expression,
so move nft_bridge_iphdr_validate() and nft_bridge_ip6hdr_validate()
to net/bridge/netfilter/nft_reject_bridge.c.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12 18:52:15 +02:00
Pablo Neira Ayuso
ddc8b6027a netfilter: introduce nft_set_pktinfo_{ipv4, ipv6}_validate()
These functions are extracted from the netdev family, they initialize
the pktinfo structure and validate that the IPv4 and IPv6 headers are
well-formed given that these functions are called from a path where
layer 3 sanitization did not happen yet.

These functions are placed in include/net/netfilter/nf_tables_ipv{4,6}.h
so they can be reused by a follow up patch to use them from the bridge
family too.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12 18:52:09 +02:00
Pablo Neira Ayuso
8df9e32e7e netfilter: nf_tables_ipv6: setup pktinfo transport field on failure to parse
Make sure the pktinfo protocol fields are initialized if this fails to
parse the transport header.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12 18:52:04 +02:00
Pablo Neira Ayuso
beac5afa2d netfilter: nf_tables: ensure proper initialization of nft_pktinfo fields
This patch introduces nft_set_pktinfo_unspec() that ensures proper
initialization all of pktinfo fields for non-IP traffic. This is used
by the bridge, netdev and arp families.

This new function relies on nft_set_pktinfo_proto_unspec() to set a new
tprot_set field that indicates if transport protocol information is
available. Remain fields are zeroed.

The meta expression has been also updated to check to tprot_set in first
place given that zero is a valid tprot value. Even a handcrafted packet
may come with the IPPROTO_RAW (255) protocol number so we can't rely on
this value as tprot unset.

Reported-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12 18:51:57 +02:00
Liping Zhang
3c15b8e112 netfilter: nf_conntrack: remove unused ctl_table_path member in nf_conntrack_l3proto
After commit adf0516845 ("netfilter: remove ip_conntrack* sysctl
compat code"), ctl_table_path member in struct nf_conntrack_l3proto{}
is not used anymore, remove it.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-09 16:17:58 +02:00
David S. Miller
60175ccdf4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for your net-next
tree.  Most relevant updates are the removal of per-conntrack timers to
use a workqueue/garbage collection approach instead from Florian
Westphal, the hash and numgen expression for nf_tables from Laura
Garcia, updates on nf_tables hash set to honor the NLM_F_EXCL flag,
removal of ip_conntrack sysctl and many other incremental updates on our
Netfilter codebase.

More specifically, they are:

1) Retrieve only 4 bytes to fetch ports in case of non-linear skb
   transport area in dccp, sctp, tcp, udp and udplite protocol
   conntrackers, from Gao Feng.

2) Missing whitespace on error message in physdev match, from Hangbin Liu.

3) Skip redundant IPv4 checksum calculation in nf_dup_ipv4, from Liping Zhang.

4) Add nf_ct_expires() helper function and use it, from Florian Westphal.

5) Replace opencoded nf_ct_kill() call in IPVS conntrack support, also
   from Florian.

6) Rename nf_tables set implementation to nft_set_{name}.c

7) Introduce the hash expression to allow arbitrary hashing of selector
   concatenations, from Laura Garcia Liebana.

8) Remove ip_conntrack sysctl backward compatibility code, this code has
   been around for long time already, and we have two interfaces to do
   this already: nf_conntrack sysctl and ctnetlink.

9) Use nf_conntrack_get_ht() helper function whenever possible, instead
   of opencoding fetch of hashtable pointer and size, patch from Liping Zhang.

10) Add quota expression for nf_tables.

11) Add number generator expression for nf_tables, this supports
    incremental and random generators that can be combined with maps,
    very useful for load balancing purpose, again from Laura Garcia Liebana.

12) Fix a typo in a debug message in FTP conntrack helper, from Colin Ian King.

13) Introduce a nft_chain_parse_hook() helper function to parse chain hook
    configuration, this is used by a follow up patch to perform better chain
    update validation.

14) Add rhashtable_lookup_get_insert_key() to rhashtable and use it from the
    nft_set_hash implementation to honor the NLM_F_EXCL flag.

15) Missing nulls check in nf_conntrack from nf_conntrack_tuple_taken(),
    patch from Florian Westphal.

16) Don't use the DYING bit to know if the conntrack event has been already
    delivered, instead a state variable to track event re-delivery
    states, also from Florian.

17) Remove the per-conntrack timer, use the workqueue approach that was
    discussed during the NFWS, from Florian Westphal.

18) Use the netlink conntrack table dump path to kill stale entries,
    again from Florian.

19) Add a garbage collector to get rid of stale conntracks, from
    Florian.

20) Reschedule garbage collector if eviction rate is high.

21) Get rid of the __nf_ct_kill_acct() helper.

22) Use ARPHRD_ETHER instead of hardcoded 1 from ARP logger.

23) Make nf_log_set() interface assertive on unsupported families.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-06 12:45:26 -07:00
Rosen, Rami
dd19bde367 switchdev: Fix return value of switchdev_port_fdb_dump().
This patch fixes the retun value of switchdev_port_fdb_dump() when
CONFIG_NET_SWITCHDEV is not set. This avoids getting "warning: return makes
integer from pointer without a cast [-Wint-conversion]" when building
when CONFIG_NET_SWITCHDEV is not set under several compiler versions.
This warning is due to commit d297653dd6
("rtnetlink: fdb dump: optimize by saving last interface markers").

Signed-off-by: Rami Rosen <rami.rosen@intel.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-02 11:00:21 -07:00
Vivien Didelot
04bed1434d net: dsa: remove ds_to_priv
Access the priv member of the dsa_switch structure directly, instead of
having an unnecessary helper.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 22:51:12 -07:00
Roopa Prabhu
d297653dd6 rtnetlink: fdb dump: optimize by saving last interface markers
fdb dumps spanning multiple skb's currently restart from the first
interface again for every skb. This results in unnecessary
iterations on the already visited interfaces and their fdb
entries. In large scale setups, we have seen this to slow
down fdb dumps considerably. On a system with 30k macs we
see fdb dumps spanning across more than 300 skbs.

To fix the problem, this patch replaces the existing single fdb
marker with three markers: netdev hash entries, netdevs and fdb
index to continue where we left off instead of restarting from the
first netdev. This is consistent with link dumps.

In the process of fixing the performance issue, this patch also
re-implements fix done by
commit 472681d57a ("net: ndo_fdb_dump should report -EMSGSIZE to rtnl_fdb_dump")
(with an internal fix from Wilson Kok) in the following ways:
- change ndo_fdb_dump handlers to return error code instead
of the last fdb index
- use cb->args strictly for dump frag markers and not error codes.
This is consistent with other dump functions.

Below results were taken on a system with 1000 netdevs
and 35085 fdb entries:
before patch:
$time bridge fdb show | wc -l
15065

real    1m11.791s
user    0m0.070s
sys 1m8.395s

(existing code does not return all macs)

after patch:
$time bridge fdb show | wc -l
35085

real    0m2.017s
user    0m0.113s
sys 0m1.942s

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 16:56:15 -07:00
Gao Feng
66fdd05e7a rps: flow_dissector: Add the const for the parameter of flow_keys_have_l4
Add the const for the parameter of flow_keys_have_l4 for the readability.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 16:51:08 -07:00
David Howells
d001648ec7 rxrpc: Don't expose skbs to in-kernel users [ver #2]
Don't expose skbs to in-kernel users, such as the AFS filesystem, but
instead provide a notification hook the indicates that a call needs
attention and another that indicates that there's a new call to be
collected.

This makes the following possibilities more achievable:

 (1) Call refcounting can be made simpler if skbs don't hold refs to calls.

 (2) skbs referring to non-data events will be able to be freed much sooner
     rather than being queued for AFS to pick up as rxrpc_kernel_recv_data
     will be able to consult the call state.

 (3) We can shortcut the receive phase when a call is remotely aborted
     because we don't have to go through all the packets to get to the one
     cancelling the operation.

 (4) It makes it easier to do encryption/decryption directly between AFS's
     buffers and sk_buffs.

 (5) Encryption/decryption can more easily be done in the AFS's thread
     contexts - usually that of the userspace process that issued a syscall
     - rather than in one of rxrpc's background threads on a workqueue.

 (6) AFS will be able to wait synchronously on a call inside AF_RXRPC.

To make this work, the following interface function has been added:

     int rxrpc_kernel_recv_data(
		struct socket *sock, struct rxrpc_call *call,
		void *buffer, size_t bufsize, size_t *_offset,
		bool want_more, u32 *_abort_code);

This is the recvmsg equivalent.  It allows the caller to find out about the
state of a specific call and to transfer received data into a buffer
piecemeal.

afs_extract_data() and rxrpc_kernel_recv_data() now do all the extraction
logic between them.  They don't wait synchronously yet because the socket
lock needs to be dealt with.

Five interface functions have been removed:

	rxrpc_kernel_is_data_last()
    	rxrpc_kernel_get_abort_code()
    	rxrpc_kernel_get_error_number()
    	rxrpc_kernel_free_skb()
    	rxrpc_kernel_data_consumed()

As a temporary hack, sk_buffs going to an in-kernel call are queued on the
rxrpc_call struct (->knlrecv_queue) rather than being handed over to the
in-kernel user.  To process the queue internally, a temporary function,
temp_deliver_data() has been added.  This will be replaced with common code
between the rxrpc_recvmsg() path and the kernel_rxrpc_recv_data() path in a
future patch.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 16:43:27 -07:00
Vivien Didelot
8df3025520 net: dsa: add MDB support
Add SWITCHDEV_OBJ_ID_PORT_MDB support to the DSA layer.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-31 14:15:42 -07:00
Roopa Prabhu
14972cbd34 net: lwtunnel: Handle fragmentation
Today mpls iptunnel lwtunnel_output redirect expects the tunnel
output function to handle fragmentation. This is ok but can be
avoided if we did not do the mpls output redirect too early.
ie we could wait until ip fragmentation is done and then call
mpls output for each ip fragment.

To make this work we will need,
1) the lwtunnel state to carry encap headroom
2) and do the redirect to the encap output handler on the ip fragment
(essentially do the output redirect after fragmentation)

This patch adds tunnel headroom in lwtstate to make sure we
account for tunnel data in mtu calculations during fragmentation
and adds new xmit redirect handler to redirect to lwtunnel xmit func
after ip fragmentation.

This includes IPV6 and some mtu fixes and testing from David Ahern.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-30 22:27:18 -07:00
David Howells
4de48af663 rxrpc: Pass struct socket * to more rxrpc kernel interface functions
Pass struct socket * to more rxrpc kernel interface functions.  They should
be starting from this rather than the socket pointer in the rxrpc_call
struct if they need to access the socket.

I have left:

	rxrpc_kernel_is_data_last()
	rxrpc_kernel_get_abort_code()
	rxrpc_kernel_get_error_number()
	rxrpc_kernel_free_skb()
	rxrpc_kernel_data_consumed()

unmodified as they're all about to be removed (and, in any case, don't
touch the socket).

Signed-off-by: David Howells <dhowells@redhat.com>
2016-08-30 16:07:53 +01:00
David Howells
8324f0bcfb rxrpc: Provide a way for AFS to ask for the peer address of a call
Provide a function so that kernel users, such as AFS, can ask for the peer
address of a call:

   void rxrpc_kernel_get_peer(struct rxrpc_call *call,
			      struct sockaddr_rxrpc *_srx);

In the future the kernel service won't get sk_buffs to look inside.
Further, this allows us to hide any canonicalisation inside AF_RXRPC for
when IPv6 support is added.

Also propagate this through to afs_find_server() and issue a warning if we
can't handle the address family yet.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-08-30 16:07:53 +01:00
Gao Feng
779994fa36 netfilter: log: Check param to avoid overflow in nf_log_set
The nf_log_set is an interface function, so it should do the strict sanity
check of parameters. Convert the return value of nf_log_set as int instead
of void. When the pf is invalid, return -EOPNOTSUPP.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-30 11:52:32 +02:00
Florian Westphal
ad66713f5a netfilter: remove __nf_ct_kill_acct helper
After timer removal this just calls nf_ct_delete so remove the __ prefix
version and make nf_ct_kill a shorthand for nf_ct_delete.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-30 11:43:10 +02:00
Florian Westphal
f330a7fdbe netfilter: conntrack: get rid of conntrack timer
With stats enabled this eats 80 bytes on x86_64 per nf_conn entry, as
Eric Dumazet pointed out during netfilter workshop 2016.

Eric also says: "Another reason was the fact that Thomas was about to
change max timer range [..]" (500462a9de, 'timers: Switch to
a non-cascading wheel').

Remove the timer and use a 32bit jiffies value containing timestamp until
entry is valid.

During conntrack lookup, even before doing tuple comparision, check
the timeout value and evict the entry in case it is too old.

The dying bit is used as a synchronization point to avoid races where
multiple cpus try to evict the same entry.

Because lookup is always lockless, we need to bump the refcnt once
when we evict, else we could try to evict already-dead entry that
is being recycled.

This is the standard/expected way when conntrack entries are destroyed.

Followup patches will introduce garbage colliction via work queue
and further places where we can reap obsoleted entries (e.g. during
netlink dumps), this is needed to avoid expired conntracks from hanging
around for too long when lookup rate is low after a busy period.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-30 11:43:09 +02:00
Florian Westphal
616b14b469 netfilter: don't rely on DYING bit to detect when destroy event was sent
The reliable event delivery mode currently (ab)uses the DYING bit to
detect which entries on the dying list have to be skipped when
re-delivering events from the eache worker in reliable event mode.

Currently when we delete the conntrack from main table we only set this
bit if we could also deliver the netlink destroy event to userspace.

If we fail we move it to the dying list, the ecache worker will
reattempt event delivery for all confirmed conntracks on the dying list
that do not have the DYING bit set.

Once timer is gone, we can no longer use if (del_timer()) to detect
when we 'stole' the reference count owned by the timer/hash entry, so
we need some other way to avoid racing with other cpu.

Pablo suggested to add a marker in the ecache extension that skips
entries that have been unhashed from main table but are still waiting
for the last reference count to be dropped (e.g. because one skb waiting
on nfqueue verdict still holds a reference).

We do this by adding a tristate.
If we fail to deliver the destroy event, make a note of this in the
eache extension.  The worker can then skip all entries that are in
a different state.  Either they never delivered a destroy event,
e.g. because the netlink backend was not loaded, or redelivery took
place already.

Once the conntrack timer is removed we will now be able to replace
del_timer() test with test_and_set_bit(DYING, &ct->status) to avoid
racing with other cpu that tries to evict the same conntrack.

Because DYING will then be set right before we report the destroy event
we can no longer skip event reporting when dying bit is set.

Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-30 11:43:08 +02:00
David S. Miller
6abdd5f593 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
All three conflicts were cases of simple overlapping
changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-30 00:54:02 -04:00
Eric Dumazet
c9c3321257 tcp: add tcp_add_backlog()
When TCP operates in lossy environments (between 1 and 10 % packet
losses), many SACK blocks can be exchanged, and I noticed we could
drop them on busy senders, if these SACK blocks have to be queued
into the socket backlog.

While the main cause is the poor performance of RACK/SACK processing,
we can try to avoid these drops of valuable information that can lead to
spurious timeouts and retransmits.

Cause of the drops is the skb->truesize overestimation caused by :

- drivers allocating ~2048 (or more) bytes as a fragment to hold an
  Ethernet frame.

- various pskb_may_pull() calls bringing the headers into skb->head
  might have pulled all the frame content, but skb->truesize could
  not be lowered, as the stack has no idea of each fragment truesize.

The backlog drops are also more visible on bidirectional flows, since
their sk_rmem_alloc can be quite big.

Let's add some room for the backlog, as only the socket owner
can selectively take action to lower memory needs, like collapsing
receive queues or partial ofo pruning.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-29 00:20:24 -04:00
Tom Herbert
96a5908347 kcm: Remove TCP specific references from kcm and strparser
kcm and strparser need to work with any type of stream socket not just
TCP. Eliminate references to TCP and call generic proto_ops functions of
read_sock and peek_len. Also in strp_init check if the socket support
the proto_ops read_sock and peek_len.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-28 23:32:41 -04:00
Tom Herbert
3203558589 tcp: Set read_sock and peek_len proto_ops
In inet_stream_ops we set read_sock to tcp_read_sock and peek_len to
tcp_peek_len (which is just a stub function that calls tcp_inq).

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-28 23:32:41 -04:00
Tom Herbert
0294b625ad net: Add read_sock proto_op
Add new function in proto_ops structure. This includes moving the
typedef got sk_read_actor into net.h and removing the definition from
tcp.h.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-28 23:32:41 -04:00
Ido Schimmel
6bc506b4fb bridge: switchdev: Add forward mark support for stacked devices
switchdev_port_fwd_mark_set() is used to set the 'offload_fwd_mark' of
port netdevs so that packets being flooded by the device won't be
flooded twice.

It works by assigning a unique identifier (the ifindex of the first
bridge port) to bridge ports sharing the same parent ID. This prevents
packets from being flooded twice by the same switch, but will flood
packets through bridge ports belonging to a different switch.

This method is problematic when stacked devices are taken into account,
such as VLANs. In such cases, a physical port netdev can have upper
devices being members in two different bridges, thus requiring two
different 'offload_fwd_mark's to be configured on the port netdev, which
is impossible.

The main problem is that packet and netdev marking is performed at the
physical netdev level, whereas flooding occurs between bridge ports,
which are not necessarily port netdevs.

Instead, packet and netdev marking should really be done in the bridge
driver with the switch driver only telling it which packets it already
forwarded. The bridge driver will mark such packets using the mark
assigned to the ingress bridge port and will prevent the packet from
being forwarded through any bridge port sharing the same mark (i.e.
having the same parent ID).

Remove the current switchdev 'offload_fwd_mark' implementation and
instead implement the proposed method. In addition, make rocker - the
sole user of the mark - use the proposed method.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-26 13:13:36 -07:00
Ivan Vecera
2a313cdf1e devlink: remove unused priv_size
Remove unused and useless priv_size member from struct devlink_ops.

Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-26 11:55:18 -07:00
Pablo Neira Ayuso
c016c7e45d netfilter: nf_tables: honor NLM_F_EXCL flag in set element insertion
If the NLM_F_EXCL flag is set, then new elements that clash with an
existing one return EEXIST. In case you try to add an element whose
data area differs from what we have, then this returns EBUSY. If no
flag is specified at all, then this returns success to userspace.

This patch also update the set insert operation so we can fetch the
existing element that clashes with the one you want to add, we need
this to make sure the element data doesn't differ.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-26 17:30:20 +02:00
Eric Dumazet
eb60a8ddf3 net: minor optimization in qdisc_qstats_cpu_drop()
per_cpu_inc() is faster (at least on x86) than per_cpu_ptr(xxx)++;

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-25 16:45:29 -07:00
Vivien Didelot
9d490b4ee4 net: dsa: rename switch operations structure
Now that the dsa_switch_driver structure contains only function pointers
as it is supposed to, rename it to the more appropriate dsa_switch_ops,
uniformly to any other operations structure in the kernel.

No functional changes here, basically just the result of something like:
s/dsa_switch_driver *drv/dsa_switch_ops *ops/g

However keep the {un,}register_switch_driver functions and their
dsa_switch_drivers list as is, since they represent the -- likely to be
deprecated soon -- legacy DSA registration framework.

In the meantime, also fix the following checks from checkpatch.pl to
make it happy with this patch:

    CHECK: Comparison to NULL could be written "!ops"
    #403: FILE: net/dsa/dsa.c:470:
    +	if (ops == NULL) {

    CHECK: Comparison to NULL could be written "ds->ops->get_strings"
    #773: FILE: net/dsa/slave.c:697:
    +		if (ds->ops->get_strings != NULL)

    CHECK: Comparison to NULL could be written "ds->ops->get_ethtool_stats"
    #824: FILE: net/dsa/slave.c:785:
    +	if (ds->ops->get_ethtool_stats != NULL)

    CHECK: Comparison to NULL could be written "ds->ops->get_sset_count"
    #835: FILE: net/dsa/slave.c:798:
    +		if (ds->ops->get_sset_count != NULL)

    total: 0 errors, 0 warnings, 4 checks, 784 lines checked

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-24 21:45:39 -07:00
Eric Dumazet
ba2489b0e0 net: remove clear_sk() method
We no longer use this handler, we can delete it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-23 23:25:29 -07:00
Eric Dumazet
4cac820466 udp: get rid of sk_prot_clear_portaddr_nulls()
Since we no longer use SLAB_DESTROY_BY_RCU for UDP,
we do not need sk_prot_clear_portaddr_nulls() helper.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-23 23:25:29 -07:00
David Ahern
5d77dca828 net: diag: support SOCK_DESTROY for UDP sockets
This implements SOCK_DESTROY for UDP sockets similar to what was done
for TCP with commit c1e64e298b ("net: diag: Support destroying TCP
sockets.") A process with a UDP socket targeted for destroy is awakened
and recvmsg fails with ECONNABORTED.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-23 23:12:27 -07:00
Yuchung Cheng
cebc5cbab4 net-tcp: retire TFO_SERVER_WO_SOCKOPT2 config
TFO_SERVER_WO_SOCKOPT2 was intended for debugging purposes during
Fast Open development. Remove this config option and also
update/clean-up the documentation of the Fast Open sysctl.

Reported-by: Piotr Jurkiewicz <piotr.jerzy.jurkiewicz@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-23 17:01:01 -07:00
Tom Herbert
cff6a334e6 strparser: Queue work when being unpaused
When the upper layer unpauses a stream parser connection we need to
queue rx_work to make sure no events are missed.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-23 16:23:12 -07:00
Andrew Lunn
7b314362a2 net: dsa: Allow the DSA driver to indicate the tag protocol
DSA drivers may drive different families of switches which need
different tag protocol. Rather than hard code the tag protocol in the
driver structure, have a callback for the DSA core to call.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-22 21:08:08 -07:00
WANG Cong
b9a24bb76b net_sched: properly handle failure case of tcf_exts_init()
After commit 22dc13c837 ("net_sched: convert tcf_exts from list to pointer array")
we do dynamic allocation in tcf_exts_init(), therefore we need
to handle the ENOMEM case properly.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-22 17:02:31 -07:00
Florian Fainelli
ea825e70d0 net: dsa: Export suspend/resume functions
In preparation for allowing switch drivers to implement system-wide
suspend/resume functions, export dsa_switch_suspend and
dsa_switch_resume() such that these are callable from the appropriate
driver specific suspend/resume functions.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-19 17:15:36 -07:00
Daniel Borkmann
54fd9c2dff bpf: get rid of cgroup helper related ifdefs
As recently discussed during the task_under_cgroup_hierarchy() addition,
we should get rid of the ifdefs surrounding the bpf_skb_under_cgroup()
helper. If related functionality is not built-in, the helper cannot be
used anyway, which is also in line with what we do for all other helpers.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18 23:38:16 -07:00
Eric Dumazet
bb1fceca22 tcp: fix use after free in tcp_xmit_retransmit_queue()
When tcp_sendmsg() allocates a fresh and empty skb, it puts it at the
tail of the write queue using tcp_add_write_queue_tail()

Then it attempts to copy user data into this fresh skb.

If the copy fails, we undo the work and remove the fresh skb.

Unfortunately, this undo lacks the change done to tp->highest_sack and
we can leave a dangling pointer (to a freed skb)

Later, tcp_xmit_retransmit_queue() can dereference this pointer and
access freed memory. For regular kernels where memory is not unmapped,
this might cause SACK bugs because tcp_highest_sack_seq() is buggy,
returning garbage instead of tp->snd_nxt, but with various debug
features like CONFIG_DEBUG_PAGEALLOC, this can crash the kernel.

This bug was found by Marco Grassi thanks to syzkaller.

Fixes: 6859d49475 ("[TCP]: Abstract tp->highest_sack accessing & point to next skb")
Reported-by: Marco Grassi <marco.gra@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18 23:22:57 -07:00
Hadar Hen Zion
956af37102 net_sched: act_vlan: Add priority option
The current vlan push action supports only vid and protocol options.
Add priority option.

Example script that adds vlan push action with vid and
priority:

tc filter add dev veth0 protocol ip parent ffff: \
	   flower \
	   	indev veth0 \
	   action vlan push id 100 priority 5

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18 23:13:14 -07:00
Hadar Hen Zion
f6a6692769 flow_dissector: Get vlan priority in addition to vlan id
Add vlan priority check to the flow dissector by adding new flow
dissector struct, flow_dissector_key_vlan which includes vlan tag
fields.

vlan_id and flow_label fields were under the same struct
(flow_dissector_key_tags). It was a convenient setting since struct
flow_dissector_key_tags is used by struct flow_keys and by setting
vlan_id and flow_label under the same struct, we get precisely 24 or 48
bytes in flow_keys from flow_dissector_key_basic.

Now, when adding vlan priority support, the code will be cleaner if
flow_label and vlan tag won't be under the same struct anymore.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18 23:13:13 -07:00
David S. Miller
60747ef4d1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Minor overlapping changes for both merge conflicts.

Resolution work done by Stephen Rothwell was used
as a reference.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18 01:17:32 -04:00
Linus Torvalds
184ca82348 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Buffers powersave frame test is reversed in cfg80211, fix from Felix
    Fietkau.

 2) Remove bogus WARN_ON in openvswitch, from Jarno Rajahalme.

 3) Fix some tg3 ethtool logic bugs, and one that would cause no
    interrupts to be generated when rx-coalescing is set to 0.  From
    Satish Baddipadige and Siva Reddy Kallam.

 4) QLCNIC mailbox corruption and napi budget handling fix from Manish
    Chopra.

 5) Fix fib_trie logic when walking the trie during /proc/net/route
    output than can access a stale node pointer.  From David Forster.

 6) Several sctp_diag fixes from Phil Sutter.

 7) PAUSE frame handling fixes in mlxsw driver from Ido Schimmel.

 8) Checksum fixup fixes in bpf from Daniel Borkmann.

 9) Memork leaks in nfnetlink, from Liping Zhang.

10) Use after free in rxrpc, from David Howells.

11) Use after free in new skb_array code of macvtap driver, from Jason
    Wang.

12) Calipso resource leak, from Colin Ian King.

13) mediatek bug fixes (missing stats sync init, etc.) from Sean Wang.

14) Fix bpf non-linear packet write helpers, from Daniel Borkmann.

15) Fix lockdep splats in macsec, from Sabrina Dubroca.

16) hv_netvsc bug fixes from Vitaly Kuznetsov, mostly to do with VF
    handling.

17) Various tc-action bug fixes, from CONG Wang.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (116 commits)
  net_sched: allow flushing tc police actions
  net_sched: unify the init logic for act_police
  net_sched: convert tcf_exts from list to pointer array
  net_sched: move tc offload macros to pkt_cls.h
  net_sched: fix a typo in tc_for_each_action()
  net_sched: remove an unnecessary list_del()
  net_sched: remove the leftover cleanup_a()
  mlxsw: spectrum: Allow packets to be trapped from any PG
  mlxsw: spectrum: Unmap 802.1Q FID before destroying it
  mlxsw: spectrum: Add missing rollbacks in error path
  mlxsw: reg: Fix missing op field fill-up
  mlxsw: spectrum: Trap loop-backed packets
  mlxsw: spectrum: Add missing packet traps
  mlxsw: spectrum: Mark port as active before registering it
  mlxsw: spectrum: Create PVID vPort before registering netdevice
  mlxsw: spectrum: Remove redundant errors from the code
  mlxsw: spectrum: Don't return upon error in removal path
  i40e: check for and deal with non-contiguous TCs
  ixgbe: Re-enable ability to toggle VLAN filtering
  ixgbe: Force VLNCTRL.VFE to be set in all VMDq paths
  ...
2016-08-17 17:26:58 -07:00