Commit Graph

38348 Commits

Author SHA1 Message Date
Marcelo Ricardo Leitner
b52effd263 sctp: fix cut and paste issue in comment
Cookie ACK is always received by the association initiator, so fix the
comment to avoid confusion.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 00:21:32 -07:00
Marcelo Ricardo Leitner
0ca50d12fe sctp: fix src address selection if using secondary addresses
In short, sctp is likely to incorrectly choose src address if socket is
bound to secondary addresses. This patch fixes it by adding a new check
that checks if such src address belongs to the interface that routing
identified as output.

This is enough to avoid rp_filter drops on remote peer.

Details:

Currently, sctp will do a routing attempt without specifying the src
address and compare the returned value (preferred source) with the
addresses that the socket is bound to. When using secondary addresses,
this will not match.

Then it will try specifying each of the addresses that the socket is
bound to and re-routing, checking if that address is valid as src for
that dst. Thing is, this check alone is weak:

# ip r l
192.168.100.0/24 dev eth1  proto kernel  scope link  src 192.168.100.149
192.168.122.0/24 dev eth0  proto kernel  scope link  src 192.168.122.147

# ip a l
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 52:54:00:15:18:6a brd ff:ff:ff:ff:ff:ff
    inet 192.168.122.147/24 brd 192.168.122.255 scope global dynamic eth0
       valid_lft 2160sec preferred_lft 2160sec
    inet 192.168.122.148/24 scope global secondary eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe15:186a/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 52:54:00:b3:91:46 brd ff:ff:ff:ff:ff:ff
    inet 192.168.100.149/24 brd 192.168.100.255 scope global dynamic eth1
       valid_lft 2162sec preferred_lft 2162sec
    inet 192.168.100.148/24 scope global secondary eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:feb3:9146/64 scope link
       valid_lft forever preferred_lft forever
4: ens9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 52:54:00:05:47:ee brd ff:ff:ff:ff:ff:ff
    inet6 fe80::5054:ff:fe05:47ee/64 scope link
       valid_lft forever preferred_lft forever

# ip r g 192.168.100.193 from 192.168.122.148
192.168.100.193 from 192.168.122.148 dev eth1
    cache

Even if you specify an interface:

# ip r g 192.168.100.193 from 192.168.122.148 oif eth1
192.168.100.193 from 192.168.122.148 dev eth1
    cache

Although this would be valid, peers using rp_filter will drop such
packets as their src doesn't match the routes for that interface.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 00:20:17 -07:00
Marcelo Ricardo Leitner
07868284e5 sctp: reduce indent level on sctp_v4_get_dst
Paves the day for the next patch. Functionality stays untouched.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 00:20:17 -07:00
David S. Miller
27dfead164 Some fixes for the current cycle:
1. Arik introduced an rtnl-locked regulatory API to be able
     to differentiate between place do/don't have the RTNL;
     this fixes missing locking in some of the code paths
 
  2. Two small mesh bugfixes from Bob, one to avoid treating
     a certain malformed over-the-air frame and one to avoid
     sending a garbage field over the air.
 
  3. A fix for powersave during WoWLAN suspend from Krishna Chaitanya.
 
  4. A fix for a powersave vs. aggregation teardown race, from Michal.
 
  5. Thomas reduced the loglevel of CRDA messages to avoid spamming
     the kernel log with mostly irrelevant information.
 
  6. Tom fixed a dangling debugfs directory pointer that could cause
     crashes if subsequent addition of the same interface to debugfs
     failed for some reason.
 
  7. A fix from myself for a list corruption issue in mac80211 during
     combined interface shutdown/removal - shut down interfaces first
     and only then remove them to avoid that.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJVqQMwAAoJEDBSmw7B7bqrZzkQAIjMKojlJRouN/N/aF7ym2pC
 eAboLMC+XHubQoq2H01k5ZOSrLL1kElhkB7pLas+q00gTFyavLzEcEiFqCNuLwPH
 lQEwLXTDUeiaVWekOYJev/ONtaDdwUXoB4BPAA3Ih4EAk9fEBtcWiWeLDgOLOS8P
 eYVqcMV733cOTjhYImEQnhnm3qrcwSCF1vTOJaN4Gf/qqw6j2ilq5wU1TvPyh0TA
 1EP5Elb9hy1sud5X6shrsOBqkBrPoO1p3V4EeoHkxl8welqxXdqGvmA3K0sloGZT
 7RiL8PD4QVyISy1NrBDnNMRRgj6BD1aLC+clmECmmgYvGGcqbzLtB3CWUCV6oQmb
 TC4NmgJkKNVTvQnoqxQEL8JYSs/E2ITRKyMi3sfIYAyz1dVuQf1RkZZzB22rQWT2
 PaLq/k+vpS7E3OD3XB53flB/k7Y6j/OwJb/rE7i2vqSn3kcbua8H7dzd7p+AE5FA
 ZF//u2GBDgZeMaA9BvifByWy2+yvAEcD5/U9XkWqJ7t+HohKteLJj/scHT89pto3
 n0NZ7RVRMNQ9mz14UJiVnFOL/81AjmiU123S5UIIMkmVE5Zrn7VTZlN6fVY4Fcsh
 AtxHQesOlCw8T4lFLxgyKkEl7bxATQ2OMR6vWsZQraRHSqIuK8JDABRokIlzoFn/
 xC/Yn1vTaBuj+2nif/F0
 =US5Y
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2015-07-17' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Some fixes for the current cycle:

 1. Arik introduced an rtnl-locked regulatory API to be able
    to differentiate between place do/don't have the RTNL;
    this fixes missing locking in some of the code paths

 2. Two small mesh bugfixes from Bob, one to avoid treating
    a certain malformed over-the-air frame and one to avoid
    sending a garbage field over the air.

 3. A fix for powersave during WoWLAN suspend from Krishna Chaitanya.

 4. A fix for a powersave vs. aggregation teardown race, from Michal.

 5. Thomas reduced the loglevel of CRDA messages to avoid spamming
    the kernel log with mostly irrelevant information.

 6. Tom fixed a dangling debugfs directory pointer that could cause
    crashes if subsequent addition of the same interface to debugfs
    failed for some reason.

 7. A fix from myself for a list corruption issue in mac80211 during
    combined interface shutdown/removal - shut down interfaces first
    and only then remove them to avoid that.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 00:17:53 -07:00
Konstantin Khlebnikov
8bf4ada2e2 net: ratelimit warnings about dst entry refcount underflow or overflow
Kernel generates a lot of warnings when dst entry reference counter
overflows and becomes negative. That bug was seen several times at
machines with outdated 3.10.y kernels. Most like it's already fixed
in upstream. Anyway that flood completely kills machine and makes
further debugging impossible.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 00:11:19 -07:00
Eric Dumazet
b8a23e8d8e caif: fix leaks and race in caif_queue_rcv_skb()
1) If sk_filter() is applied, skb was leaked (not freed)
2) Testing SOCK_DEAD twice is racy :
   packet could be freed while already queued.
3) Remove obsolete comment about caching skb->len

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 00:02:44 -07:00
Alexei Starovoitov
4d9c5c53ac test_bpf: add bpf_skb_vlan_push/pop() tests
improve accuracy of timing in test_bpf and add two stress tests:
- {skb->data[0], get_smp_processor_id} repeated 2k times
- {skb->data[0], vlan_push} x 68 followed by {skb->data[0], vlan_pop} x 68

1st test is useful to test performance of JIT implementation of BPF_LD_ABS
together with BPF_CALL instructions.
2nd test is stressing skb_vlan_push/pop logic together with skb->data access
via BPF_LD_ABS insn which checks that re-caching of skb->data is done correctly.

In order to call bpf_skb_vlan_push() from test_bpf.ko have to add
three export_symbol_gpl.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:52:32 -07:00
Alexei Starovoitov
4e10df9a60 bpf: introduce bpf_skb_vlan_push/pop() helpers
Allow eBPF programs attached to TC qdiscs call skb_vlan_push/pop via
helper functions. These functions may change skb->data/hlen which are
cached by some JITs to improve performance of ld_abs/ld_ind instructions.
Therefore JITs need to recognize bpf_skb_vlan_push/pop() calls,
re-compute header len and re-cache skb->data/hlen back into cpu registers.
Note, skb->data/hlen are not directly accessible from the programs,
so any changes to skb->data done either by these helpers or by other
TC actions are safe.

eBPF JIT supported by three architectures:
- arm64 JIT is using bpf_load_pointer() without caching, so it's ok as-is.
- x64 JIT re-caches skb->data/hlen unconditionally after vlan_push/pop calls
  (experiments showed that conditional re-caching is slower).
- s390 JIT falls back to interpreter for now when bpf_skb_vlan_push() is present
  in the program (re-caching is tbd).

These helpers allow more scalable handling of vlan from the programs.
Instead of creating thousands of vlan netdevs on top of eth0 and attaching
TC+ingress+bpf to all of them, the program can be attached to eth0 directly
and manipulate vlans as necessary.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:52:31 -07:00
Jon Paul Maloy
d999297c3d tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:

We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.

Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.

This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:16 -07:00
Jon Paul Maloy
1a20cc254e tipc: introduce node contact FSM
The logics for determining when a node is permitted to establish
and maintain contact with its peer node becomes non-trivial in the
presence of multiple parallel links that may come and go independently.

A known failure scenario is that one endpoint registers both its links
to the peer lost, cleans up it binding table, and prepares for a table
update once contact is re-establihed, while the other endpoint may
see its links reset and re-established one by one, hence seeing
no need to re-synchronize the binding table. To avoid this, a node
must not allow re-establishing contact until it has confirmation that
even the peer has lost both links.

Currently, the mechanism for handling this consists of setting and
resetting two state flags from different locations in the code. This
solution is hard to understand and maintain. A closer analysis even
reveals that it is not completely safe.

In this commit we do instead introduce an FSM that keeps track of
the conditions for when the node can establish and maintain links.
It has six states and four events, and is strictly based on explicit
knowledge about the own node's and the peer node's contact states.
Only events leading to state change are shown as edges in the figure
below.

                             +--------------+
                             | SELF_UP/     |
           +---------------->| PEER_COMING  |-----------------+
    SELF_  |                 +--------------+                 |PEER_
    ESTBL_ |                        |                         |ESTBL_
    CONTACT|      SELF_LOST_CONTACT |                         |CONTACT
           |                        v                         |
           |                 +--------------+                 |
           |      PEER_      | SELF_DOWN/   |     SELF_       |
           |      LOST_   +--| PEER_LEAVING |<--+ LOST_       v
+-------------+   CONTACT |  +--------------+   | CONTACT  +-----------+
| SELF_DOWN/  |<----------+                     +----------| SELF_UP/  |
| PEER_DOWN   |<----------+                     +----------| PEER_UP   |
+-------------+   SELF_   |  +--------------+   | PEER_    +-----------+
           |      LOST_   +--| SELF_LEAVING/|<--+ LOST_       A
           |      CONTACT    | PEER_DOWN    |     CONTACT     |
           |                 +--------------+                 |
           |                         A                        |
    PEER_  |       PEER_LOST_CONTACT |                        |SELF_
    ESTBL_ |                         |                        |ESTBL_
    CONTACT|                 +--------------+                 |CONTACT
           +---------------->| PEER_UP/     |-----------------+
                             | SELF_COMING  |
                             +--------------+

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:16 -07:00
Jon Paul Maloy
8a1577c96f tipc: move link supervision timer to node level
In our effort to move control of the links to the link aggregation
layer, we move the perodic link supervision timer to struct tipc_node.
The new timer is shared between all links belonging to the node, thus
saving resources, while still kicking the FSM on both its pertaining
links at each expiration.

The current link timer and corresponding functions are removed.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:16 -07:00
Jon Paul Maloy
333ef69ed2 tipc: simplify link timer implementation
We create a second, simpler, link timer function, tipc_link_timeout().
The new function  makes use of the new FSM function introduced in the
previous commit, and just like it, takes a buffer queue as parameter.
It returns an event bit field and potentially a link protocol packet
to the caller.

The existing timer function, link_timeout(), is still needed for a
while, so we redesign it to become a wrapper around the new function.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:16 -07:00
Jon Paul Maloy
6ab30f9cbe tipc: improve link FSM implementation
The link FSM implementation is currently unnecessarily complex.
It sometimes checks for conditional state outside the FSM data
before deciding next state, and often performs actions directly
inside the FSM logics.

In this commit, we create a second, simpler FSM implementation,
that as far as possible acts only on states and events that it is
strictly defined for, and postpone any actions until it is finished
with its decisions. It also returns an event flag field and an a
buffer queue which may potentially contain a protocol message to
be sent by the caller.

Unfortunately, we cannot yet make the FSM "clean", in the sense
that its decisions are only based on FSM state and event, and that
state changes happen only here. That will have to wait until the
activate/reset logics has been cleaned up in a future commit.

We also rename the link states as follows:

WORKING_WORKING -> TIPC_LINK_WORKING
WORKING_UNKNOWN -> TIPC_LINK_PROBING
RESET_UNKNOWN   -> TIPC_LINK_RESETTING
RESET_RESET     -> TIPC_LINK_ESTABLISHING

The existing FSM function, link_state_event(), is still needed for
a while, so we redesign it to make use of the new function.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:15 -07:00
Jon Paul Maloy
426cc2b86d tipc: introduce new link protocol msg create function
As a preparation for later changes, we introduce a new function
tipc_link_build_proto_msg(). Instead of actually sending the created
protocol message, it only creates it and adds it to the head of a
skb queue provided by the caller.

Since we still need the existing function tipc_link_protocol_xmit()
for a while, we redesign it to make use of the new function.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:15 -07:00
Jon Paul Maloy
d3504c3449 tipc: clean up definitions and usage of link flags
The status flag LINK_STOPPED is not needed any more, since the
mechanism for delayed deletion of links has been removed.
Likewise, LINK_STARTED and LINK_START_EVT are unnecessary,
because we can just as well start the link timer directly from
inside tipc_link_create().

We eliminate these flags in this commit.

Instead of the above flags, we now introduce three new link modes,
TIPC_LINK_OPEN, TIPC_LINK_BLOCKED and TIPC_LINK_TUNNEL. The values
indicate whether, and in the case of TIPC_LINK_TUNNEL, which, messages
the link is allowed to receive in this state. TIPC_LINK_BLOCKED also
blocks timer-driven protocol messages to be sent out, and any change
to the link FSM. Since the modes are mutually exclusive, we convert
them to state values, and rename the 'flags' field in struct tipc_link
to 'exec_mode'.

Finally, we move the #defines for link FSM states and events from link.h
into enums inside the file link.c, which is the real usage scope of
these definitions.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:15 -07:00
Jon Paul Maloy
af9b028e27 tipc: make media xmit call outside node spinlock context
Currently, message sending is performed through a deep call chain,
where the node spinlock is grabbed and held during a significant
part of the transmission time. This is clearly detrimental to
overall throughput performance; it would be better if we could send
the message after the spinlock has been released.

In this commit, we do instead let the call revert on the stack after
the buffer chain has been added to the transmission queue, whereafter
clones of the buffers are transmitted to the device layer outside the
spinlock scope.

As a further step in our effort to separate the roles of the node
and link entities we also move the function tipc_link_xmit() to
node.c, and rename it to tipc_node_xmit().

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:15 -07:00
Jon Paul Maloy
22d85c7942 tipc: change sk_buffer handling in tipc_link_xmit()
When the function tipc_link_xmit() is given a buffer list for
transmission, it currently consumes the list both when transmission
is successful and when it fails, except for the special case when
it encounters link congestion.

This behavior is inconsistent, and needs to be corrected if we want
to avoid problems in later commits in this series.

In this commit, we change this to let the function consume the list
only when transmission is successful, and leave the list with the
sender in all other cases. We also modifiy the socket code so that
it adapts to this change, i.e., purges the list when a non-congestion
error code is returned.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:15 -07:00
Jon Paul Maloy
36e78a463b tipc: use bearer index when looking up active links
struct tipc_node currently holds two arrays of link pointers; one,
indexed by bearer identity, which contains all links irrespective of
current state, and one two-slot array for the currently active link
or links. The latter array contains direct pointers into the elements
of the former. This has the effect that we cannot know the bearer id of
a link when accessing it via the "active_links[]" array without actually
dereferencing the pointer, something we want to avoid in some cases.

In this commit, we do instead store the bearer identity in the
"active_links" array, and use this as an index to find the right element
in the overall link entry array. This change should be seen as a
preparation for the later commits in this series.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:14 -07:00
Jon Paul Maloy
d39bbd445d tipc: move link input queue to tipc_node
At present, the link input queue and the name distributor receive
queues are fields aggregated in struct tipc_link. This is a hazard,
because a link might be deleted while a receiving socket still keeps
reference to one of the queues.

This commit fixes this bug. However, rather than adding yet another
reference counter to the critical data path, we move the two queues
to safe ground inside struct tipc_node, which is already protected, and
let the link code only handle references to the queues. This is also
in line with planned later changes in this area.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:14 -07:00
Jon Paul Maloy
d3a43b907a tipc: move link creation from neighbor discoverer to node
As a step towards turning links into node internal entities, we move the
creation of links from the neighbor discovery logics to the node's link
control logics.

We also create an additional entry for the link's media address in the
newly introduced struct tipc_link_entry, since this is where it is
needed in the upcoming commits. The current copy in struct tipc_link
is kept for now, but will be removed later.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:14 -07:00
Jon Paul Maloy
9d13ec65ed tipc: introduce link entry structure to struct tipc_node
struct 'tipc_node' currently contains two arrays for link attributes,
one for the link pointers, and one for the usable link MTUs.

We now group those into a new struct 'tipc_link_entry', and intoduce
one single array consisting of such enties. Apart from being a cosmetic
improvement, this is a starting point for the strict master-slave
relation between node and link that we will introduce in the following
commits.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:14 -07:00
Scott Feldman
1a3b2ec93d switchdev: add offload_fwd_mark generator helper
skb->offload_fwd_mark and dev->offload_fwd_mark are 32-bit and should be
unique for device and may even be unique for a sub-set of ports within
device, so add switchdev helper function to generate unique marks based on
port's switch ID and group_ifindex.  group_ifindex would typically be the
container dev's ifindex, such as the bridge's ifindex.

The generator uses a global hash table to store offload_fwd_marks hashed by
{switch ID, group_ifindex} key.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 18:32:44 -07:00
Scott Feldman
d754f98b50 net: add phys ID compare helper to test if two IDs are the same
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 18:32:44 -07:00
Scott Feldman
0c4f691ff6 net: don't reforward packets already forwarded by offload device
Just before queuing skb for xmit on port, check if skb has been marked by
switchdev port driver as already fordwarded by device.  If so, drop skb.  A
non-zero skb->offload_fwd_mark field is set by the switchdev port
driver/device on ingress to indicate the skb has already been forwarded by
the device to egress ports with matching dev->skb_mark.  The switchdev port
driver would assign a non-zero dev->offload_skb_mark for each device port
netdev during registration, for example.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 18:32:44 -07:00
Herbert Xu
fdbf5b097b Revert "sit: Add gro callbacks to sit_offload"
This patch reverts 19424e052f ("sit:
Add gro callbacks to sit_offload") because it generates packets
that cannot be handled even by our own GSO.

Reported-by: Wolfgang Walter <linux@stwm.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 16:52:28 -07:00
Nikolay Aleksandrov
a7ce45a74b bridge: mcast: fix br_multicast_dev_del warn when igmp snooping is not defined
Fix:
net/bridge/br_if.c: In function 'br_dev_delete':
>> net/bridge/br_if.c:284:2: error: implicit declaration of function
>> 'br_multicast_dev_del' [-Werror=implicit-function-declaration]
     br_multicast_dev_del(br);
     ^
   cc1: some warnings being treated as errors

when igmp snooping is not defined.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 16:19:03 -07:00
Phil Sutter
a0a9f33bdf net/ipv6: update flowi6_oif in ip6_dst_lookup_flow if not set
Newly created flows don't have flowi6_oif set (at least if the
associated socket is not interface-bound). This leads to a mismatch in
__xfrm6_selector_match() for policies which specify an interface in the
selector (sel->ifindex != 0).

Backtracing shows this happens in code-paths originating from e.g.
ip6_datagram_connect(), rawv6_sendmsg() or tcp_v6_connect(). (UDP was
not tested for.)

In summary, this patch fixes policy matching on outgoing interface for
locally generated packets.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 12:59:32 -07:00
Satish Ashok
e10177abf8 bridge: multicast: fix handling of temp and perm entries
When the bridge (or port) is brought down/up flush only temp entries and
leave the perm ones. Flush perm entries only when deleting the bridge
device or the associated port.

Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 12:49:10 -07:00
Nikolay Aleksandrov
ef8299de7e bridge: multicast: notify on group delete
Group notifications were not sent when a group expired or was deleted
due to bridge/port device being deleted. So add br_mdb_notify() to
br_multicast_del_pg().

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 12:49:10 -07:00
Daniel Borkmann
8d20aabe1c ebpf: add helper to retrieve net_cls's classid cookie
It would be very useful to retrieve the net_cls's classid from an eBPF
program to allow for a more fine-grained classification, it could be
directly used or in conjunction with additional policies. I.e. docker,
but also tooling such as cgexec, can easily run applications via net_cls
cgroups:

  cgcreate -g net_cls:/foo
  echo 42 > foo/net_cls.classid
  cgexec -g net_cls:foo <prog>

Thus, their respecitve classid cookie of foo can then be looked up on
the egress path to apply further policies. The helper is desigend such
that a non-zero value returns the cgroup id.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 12:41:30 -07:00
Daniel Borkmann
b87a173e25 cls_cgroup: factor out classid retrieval
Split out retrieving the cgroups net_cls classid retrieval into its
own function, so that it can be reused later on from other parts of
the traffic control subsystem. If there's no skb->sk, then the small
helper returns 0 as well, which in cls_cgroup terms means 'could not
classify'.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 12:41:30 -07:00
Arik Nemtsov
923b352f19 cfg80211: use RTNL locked reg_can_beacon for IR-relaxation
The RTNL is required to check for IR-relaxation conditions that allow
more channels to beacon. Export an RTNL locked version of reg_can_beacon
and use it where possible in AP/STA interface type flows, where
IR-relaxation may be applicable.

Fixes: 06f207fc54 ("cfg80211: change GO_CONCURRENT to IR_CONCURRENT for STA")
Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-07-17 15:02:02 +02:00
Bob Copeland
b3e7de873d mac80211: add missing length check for confirm frames
Although mesh_rx_plink_frame() already checks that frames have enough
bytes for the action code plus another two bytes for capability/reason
code, it doesn't take into account that confirm frames also have an
additional two-byte aid.  As a result, a corrupt frame could cause a
subsequent subtraction to wrap around to ill effect.  Add another
check for this case.

Signed-off-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-07-17 14:39:42 +02:00
Bob Copeland
2ea752cd2c mac80211: correct aid location in peering frames
According to 802.11-2012 8.5.16.3.2 AID comes directly after the
capability bytes in mesh peering confirm frames.  The existing
code, however, was adding a 2 byte offset to this location,
resulting in garbage data going out over the air.  Remove the
offset to fix it.

Signed-off-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-07-17 14:38:10 +02:00
Thomas Petazzoni
042ab5fc7a wireless: regulatory: reduce log level of CRDA related messages
With a basic Linux userspace, the messages "Calling CRDA to update
world regulatory domain" appears 10 times after boot every second or
so, followed by a final "Exceeded CRDA call max attempts. Not calling
CRDA". For those of us not having the corresponding userspace parts,
having those messages repeatedly displayed at boot time is a bit
annoying, so this commit reduces their log level to pr_debug().

Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-07-17 14:37:23 +02:00
Johannes Berg
d8d9008cfb mac80211: shut down interfaces before destroying interface list
If the hardware is unregistered while interfaces are up, mac80211 will
unregister all interfaces, which in turns causes mac80211 to be called
again to remove them all from the driver and eventually shut down the
hardware.

During this shutdown, however, it's currently already unsafe to iterate
the list of interfaces atomically, as the list is manipulated in an
unsafe manner. This puts an undue burden on the driver - it must stop
all its activities before calling ieee80211_unregister_hw(), while in
the normal stop path it can do all cleanup in the stop method. If, for
example, it's using the iteration during RX for some reason, it would
have to stop RX before unregistering to avoid crashes.

Fix this problem by closing all interfaces before unregistering them.
This will cause the driver stop to have completed before we manipulate
the interface list, and after the driver is stopped *and* has called
ieee80211_unregister_hw() it really musn't be iterating any more as
the memory will be freed as well.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-07-17 11:16:26 +02:00
Chaitanya T K
541b6ed7ce mac80211: wowlan: enable powersave if suspend while ps-polling
If for any reason we're in the middle of PS-polling or awake after
TX due to dynamic powersave while going to suspend, go back to save
power. This might cause a response frame to get lost, but since we
can't really wait for it while going to suspend that's still better
than not enabling powersave which would cause higher power usage
during (and possibly even after) suspend.

Note that this really only affects the very few drivers that use
the powersave implementation in mac80211.

Signed-off-by: Chaitanya T K <chaitanya.mgit@gmail.com>
[rewrite misleading commit log]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-07-17 11:13:21 +02:00
Michal Kazior
e9de01907e mac80211: don't clear all tx flags when requeing
When acting as AP and a PS-Poll frame is received
associated station is marked as one in a Service
Period. This state is kept until Tx status for
released frame is reported. While a station is in
Service Period PS-Poll frames are ignored.

However if PS-Poll was received during A-MPDU
teardown it was possible to have the to-be
released frame re-queued back to pending queue.
In such case the frame was stripped of 2 important
flags:

 (a) IEEE80211_TX_CTL_NO_PS_BUFFER
 (b) IEEE80211_TX_STATUS_EOSP

Stripping of (a) led to the frame that was to be
released to be queued back to ps_tx_buf queue. If
station remained to use only PS-Poll frames the
re-queued frame (and new ones) was never actually
transmitted because mac80211 would ignore
subsequent PS-Poll frames due to station being in
Service Period. There was nothing left to clear
the Service Period bit (no xmit -> no tx status ->
no SP end), i.e. the AP would have the station
stuck in Service Period. Beacon TIM would
repeatedly prompt station to poll for frames but
it would get none.

Once (a) is not stripped (b) becomes important
because it's the main condition to clear the
Service Period bit of the station when Tx status
for the released frame is reported back.

This problem was observed with ath9k acting as P2P
GO in some testing scenarios but isn't limited to
it. AP operation with mac80211 based Tx A-MPDU
control combined with clients using PS-Poll frames
is subject to this race.

Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-07-17 11:06:21 +02:00
Tom Hughes
4479004e64 mac80211: clear subdir_stations when removing debugfs
If we don't do this, and we then fail to recreate the debugfs
directory during a mode change, then we will fail later trying
to add stations to this now bogus directory:

BUG: unable to handle kernel NULL pointer dereference at 0000006c
IP: [<c0a92202>] mutex_lock+0x12/0x30
Call Trace:
[<c0678ab4>] start_creating+0x44/0xc0
[<c0679203>] debugfs_create_dir+0x13/0xf0
[<f8a938ae>] ieee80211_sta_debugfs_add+0x6e/0x490 [mac80211]

Cc: stable@kernel.org
Signed-off-by: Tom Hughes <tom@compton.nu>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-07-17 10:53:19 +02:00
YOSHIFUJI Hideaki
c15df306fc ipv6: Remove unused arguments for __ipv6_dev_get_saddr().
Signed-off-by: YOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-16 01:00:56 -07:00
Anuradha Karuppiah
88d6378bd6 netlink: changes for setting and clearing protodown via netlink.
Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 21:39:40 -07:00
Anuradha Karuppiah
d746d707a8 net core: Add protodown support.
This patch introduces the proto_down flag that can be used by user space
applications to notify switch drivers that errors have been detected on the
device.

The switch driver can react to protodown notification by doing a phys down
on the associated switch port.

Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 21:39:40 -07:00
Alexei Starovoitov
ddf06c1e56 tc: act_bpf: fix memory leak
prog->bpf_ops is populated when act_bpf is used with classic BPF and
prog->bpf_name is optionally used with extended BPF.
Fix memory leak when act_bpf is released.

Fixes: d23b8ad8ab ("tc: add BPF based action")
Fixes: a8cb5f556b ("act_bpf: add initial eBPF support for actions")
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 21:36:35 -07:00
WANG Cong
c0afd9ce4d fq_codel: fix return value of fq_codel_drop()
The ->drop() is supposed to return the number of bytes it dropped,
however fq_codel_drop() returns the index of the flow where it drops
a packet from.

Fix this by introducing a helper to wrap fq_codel_drop().

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 21:36:35 -07:00
WANG Cong
e8d092aafd net_sched: fix a use-after-free in sfq
Fixes: 25331d6ce4 ("net: sched: implement qstat helper routines")
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 21:36:34 -07:00
YOSHIFUJI Hideaki/吉藤英明
c0b8da1e76 ipv6: Fix finding best source address in ipv6_dev_get_saddr().
Commit 9131f3de2 ("ipv6: Do not iterate over all interfaces when
finding source address on specific interface.") did not properly
update best source address available.  Plus, it introduced
possible NULL pointer dereference.

Bug was reported by Erik Kline <ek@google.com>.
Based on patch proposed by Hajime Tazaki <thehajime@gmail.com>.

Fixes: 9131f3de24 ("ipv6: Do not
	iterate over all interfaces when finding source address
	on specific interface.")
Signed-off-by: YOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
Acked-by: Hajime Tazaki <thehajime@gmail.com>
Acked-by: Erik Kline <ek@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 21:06:13 -07:00
Eric Dumazet
03645a11a5 ipv6: lock socket in ip6_datagram_connect()
ip6_datagram_connect() is doing a lot of socket changes without
socket being locked.

This looks wrong, at least for udp_lib_rehash() which could corrupt
lists because of concurrent udp_sk(sk)->udp_portaddr_hash accesses.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 17:25:51 -07:00
Andrea Parri
40bdc5360d pkt_sched: sch_qfq: remove unused member of struct qfq_sched
The member (u32) "num_active_agg" of struct qfq_sched has been unused
since its introduction in 462dbc9101
"pkt_sched: QFQ Plus: fair-queueing service at DRR cost" and (AFAICT)
there is no active plan to use it; this removes the member.

Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Acked-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 17:21:31 -07:00
WANG Cong
052cbda41f fq_codel: fix a use-after-free
Fixes: 25331d6ce4 ("net: sched: implement qstat helper routines")
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 17:18:21 -07:00
Yuchung Cheng
f82b681a51 tcp: don't use F-RTO on non-recurring timeouts
Currently F-RTO may repeatedly send new data packets on non-recurring
timeouts in CA_Loss mode. This is a bug because F-RTO (RFC5682)
should only be used on either new recovery or recurring timeouts.

This exacerbates the recovery progress during frequent timeout &
repair, because we prioritize sending new data packets instead of
repairing the holes when the bandwidth is already scarce.

Fix it by correcting the test of a new recovery episode.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 17:17:21 -07:00
Nikolay Aleksandrov
5ebc784625 bridge: mdb: fix double add notification
Since the mdb add/del code was introduced there have been 2 br_mdb_notify
calls when doing br_mdb_add() resulting in 2 notifications on each add.

Example:
 Command: bridge mdb add dev br0 port eth1 grp 239.0.0.1 permanent
 Before patch:
 root@debian:~# bridge monitor all
 [MDB]dev br0 port eth1 grp 239.0.0.1 permanent
 [MDB]dev br0 port eth1 grp 239.0.0.1 permanent

 After patch:
 root@debian:~# bridge monitor all
 [MDB]dev br0 port eth1 grp 239.0.0.1 permanent

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: cfd5675435 ("bridge: add support of adding and deleting mdb entries")
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 17:15:01 -07:00
Satish Ashok
bc8c20acae bridge: multicast: treat igmpv3 report with INCLUDE and no sources as a leave
A report with INCLUDE/Change_to_include and empty source list should be
treated as a leave, specified by RFC 3376, section 3.1:
"If the requested filter mode is INCLUDE *and* the requested source
 list is empty, then the entry corresponding to the requested
 interface and multicast address is deleted if present.  If no such
 entry is present, the request is ignored."

Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 17:10:40 -07:00
Linus Torvalds
9090fdb9c9 Changes for 4.2-rc
- Mainly fix-ups for the various 4.2 items
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJVpWPTAAoJELgmozMOVy/dRjkQALGpY+m0Q+dfS4geU+JvvH0/
 tTNJevWA1lu5xIgzk8aW+RgNOt+IQV2K6XnKOdKWcdE2wg9Yg9/8Y7atxeNSLoB6
 mdUfstmTHoDFRXea1wog45QVVv9ABqSXIOmFd/PQSf2pEHvwgmVaxO2KW5n0n71H
 0G9nw28jx+Igd1IjX3UezbB4Rm5jeG6exQEM5KKQ/l/cD2Pt/DIr9OmTmkMhj4Ft
 4ppC+JsDHZMT2mQljEB1UL6Va1BGgULn4JwZviXRGgTEFiS+rtUkWXy9CVeNIV+A
 bo6aHEBlO8zbj9xg0POkHGBP2XCkDpsc5EP6/zon30MLpOTpnnTp2zq/bhGpyr5w
 ytH1x6V8Od1Or9BrHygy1yLGaqaVthIuZMReLq0Bu26px8mTECdu8bUjwxEvbvep
 u51vQ75h99td5XtXBHULcd/I8mUb3JqbEHqig0ChLOcnOMh72KLql8qfhLvzrYSv
 gFMvnqUZWq4WYET8cKcWypj5+cIME4objOim6T/TG+zTCmgbPnhwBsgvujLhfhDy
 JhUjUfS+31S99I4smL7ceKQpfA/7awLuZf/C20zdLZO2wXnx7VmbNlhQA3mvssyt
 ZOXeLkLGcXDIT3egNj5VWQR/KlVtvUS9FNVslxqo4ItyZZwJtbHTwRAgATp3XRwD
 Po7E7bWoUQqpanhgiX4Q
 =N/5G
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma

Pull rdma fixes from Doug Ledford:
 "Mainly fix-ups for the various 4.2 items"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (24 commits)
  IB/core: Destroy ocrdma_dev_id IDR on module exit
  IB/core: Destroy multcast_idr on module exit
  IB/mlx4: Optimize do_slave_init
  IB/mlx4: Fix memory leak in do_slave_init
  IB/mlx4: Optimize freeing of items on error unwind
  IB/mlx4: Fix use of flow-counters for process_mad
  IB/ipath: Convert use of __constant_<foo> to <foo>
  IB/ipoib: Set MTU to max allowed by mode when mode changes
  IB/ipoib: Scatter-Gather support in connected mode
  IB/ucm: Fix bitmap wrap when devnum > IB_UCM_MAX_DEVICES
  IB/ipoib: Prevent lockdep warning in __ipoib_ib_dev_flush
  IB/ucma: Fix lockdep warning in ucma_lock_files
  rds: rds_ib_device.refcount overflow
  RDMA/nes: Fix for incorrect recording of the MAC address
  RDMA/nes: Fix for resolving the neigh
  RDMA/core: Fixes for port mapper client registration
  IB/IPoIB: Fix bad error flow in ipoib_add_port()
  IB/mlx4: Do not attemp to report HCA clock offset on VFs
  IB/cm: Do not queue work to a device that's going away
  IB/srp: Avoid using uninitialized variable
  ...
2015-07-15 17:03:03 -07:00
Herbert Xu
89c22d8c3b net: Fix skb csum races when peeking
When we calculate the checksum on the recv path, we store the
result in the skb as an optimisation in case we need the checksum
again down the line.

This is in fact bogus for the MSG_PEEK case as this is done without
any locking.  So multiple threads can peek and then store the result
to the same skb, potentially resulting in bogus skb states.

This patch fixes this by only storing the result if the skb is not
shared.  This preserves the optimisations for the few cases where
it can be done safely due to locking or other reasons, e.g., SIOCINQ.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 15:59:58 -07:00
Richard Stearn
da278622bf NET: AX.25: Stop heartbeat timer on disconnect.
This may result in a kernel panic.  The bug has always existed but
somehow we've run out of luck now and it bites.

Signed-off-by: Richard Stearn <richard@rns-stearn.demon.co.uk>
Cc: stable@vger.kernel.org	# all branches
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 15:59:58 -07:00
Herbert Xu
738ac1ebb9 net: Clone skb before setting peeked flag
Shared skbs must not be modified and this is crucial for broadcast
and/or multicast paths where we use it as an optimisation to avoid
unnecessary cloning.

The function skb_recv_datagram breaks this rule by setting peeked
without cloning the skb first.  This causes funky races which leads
to double-free.

This patch fixes this by cloning the skb and replacing the skb
in the list when setting skb->peeked.

Fixes: a59322be07 ("[UDP]: Only increment counter on first peek/recv")
Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 15:59:58 -07:00
Daniel Borkmann
035d210f92 rtnetlink: reject non-IFLA_VF_PORT attributes inside IFLA_VF_PORTS
Similarly as in commit 4f7d2cdfdd ("rtnetlink: verify IFLA_VF_INFO
attributes before passing them to driver"), we have a double nesting
of netlink attributes, i.e. IFLA_VF_PORTS only contains IFLA_VF_PORT
that is nested itself. While IFLA_VF_PORTS is a verified attribute
from ifla_policy[], we only check if the IFLA_VF_PORTS container has
IFLA_VF_PORT attributes and then pass the attribute's content itself
via nla_parse_nested(). It would be more correct to reject inner types
other than IFLA_VF_PORT instead of continuing parsing and also similarly
as in commit 4f7d2cdfdd, to check for a minimum of NLA_HDRLEN.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Roopa Prabhu <roopa@cumulusnetworks.com>
Cc: Scott Feldman <sfeldma@gmail.com>
Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 15:53:27 -07:00
Wengang Wang
4fabb59449 rds: rds_ib_device.refcount overflow
Fixes: 3e0249f9c0 ("RDS/IB: add refcount tracking to struct rds_ib_device")

There lacks a dropping on rds_ib_device.refcount in case rds_ib_alloc_fmr
failed(mr pool running out). this lead to the refcount overflow.

A complain in line 117(see following) is seen. From vmcore:
s_ib_rdma_mr_pool_depleted is 2147485544 and rds_ibdev->refcount is -2147475448.
That is the evidence the mr pool is used up. so rds_ib_alloc_fmr is very likely
to return ERR_PTR(-EAGAIN).

115 void rds_ib_dev_put(struct rds_ib_device *rds_ibdev)
116 {
117         BUG_ON(atomic_read(&rds_ibdev->refcount) <= 0);
118         if (atomic_dec_and_test(&rds_ibdev->refcount))
119                 queue_work(rds_wq, &rds_ibdev->free_work);
120 }

fix is to drop refcount when rds_ib_alloc_fmr failed.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-07-14 13:20:11 -04:00
David S. Miller
638d3c6381 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/bridge/br_mdb.c

Minor conflict in br_mdb.c, in 'net' we added a memset of the
on-stack 'ip' variable whereas in 'net-next' we assign a new
member 'vid'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-13 17:28:09 -07:00
Nikolay Aleksandrov
74fe61f17e bridge: mdb: add vlan support for user entries
Until now all user mdb entries were added in vlan 0, this patch adds
support to allow the user to specify the vlan for the entry.
About the uapi change a hole in struct br_mdb_entry is used so the size
and offsets are kept the same (verified with pahole and tested with older
iproute2).

Example:
$ bridge mdb
dev br0 port eth1 grp 239.0.0.1 permanent vlan 2000
dev br0 port eth1 grp 239.0.0.1 permanent vlan 200
dev br0 port eth1 grp 239.0.0.1 permanent

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-13 14:41:26 -07:00
Tom Herbert
de551f2eb2 net: Build IPv6 into kernel by default
This patch makes the default to build IPv6 into the kernel. IPv6
now has significant traction and any remaining vestiges of IPv6
not being provided parity with IPv4 should be swept away. IPv6 is now
core to the Internet and kernel.

Points on IPv6 adoption:

- Per Google statistics, IPv6 usage has reached 7% on the Internet
  and continues to exhibit an exponential growth rate
  https://www.google.com/intl/en/ipv6/statistics.html
- Just a few days ago ARIN officially depleted its IPv4 pool
- IPv6 only data centers are being successfully built
  (e.g. at Facebook)

This patch changes the IPv6 Kconfig for IPV6. Default for CONFIG_IPV6
is set to "y" and the text has been updated to reflect the maturity of
IPv6.

Impact:

Under some circumstances building modules in to kernel might have a
performance advantage. In my testing, I did notice a very slight
improvement.

This will obviously increase the size of the kernel image. In my
configuration I see:

IPv6 as module:

   text    data     bss     dec     hex filename
9703666 1899288  933888 12536842         bf4c0a vmlinux

IPv6 built into kernel

  text     data     bss     dec     hex filename
9436490 1879600  913408 12229498         ba9b7a vmlinux

Which increases text size by ~270K (2.8% increase in size for me). If
image size is an issue, presumably for a device which does not do IP
networking (IMO we should be discouraging IPv4-only devices), IPV6 can
be disabled or still built as a module.

Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-13 13:10:21 -07:00
Linus Torvalds
f760b87f8f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Missing list head init in bluetooth hidp session creation, from Tedd
    Ho-Jeong An.

 2) Don't leak SKB in bridge netfilter error paths, from Florian
    Westphal.

 3) ipv6 netdevice private leak in netfilter bridging, fixed by Julien
    Grall.

 4) Fix regression in IP over hamradio bpq encapsulation, from Ralf
    Baechle.

 5) Fix race between rhashtable resize events and table walks, from Phil
    Sutter.

 6) Missing validation of IFLA_VF_INFO netlink attributes, fix from
    Daniel Borkmann.

 7) Missing security layer socket state initialization in tipc code,
    from Stephen Smalley.

 8) Fix shared IRQ handling in boomerang 3c59x interrupt handler, from
    Denys Vlasenko.

 9) Missing minor_idr destroy on module unload on macvtap driver, from
    Johannes Thumshirn.

10) Various pktgen kernel thread races, from Oleg Nesterov.

11) Fix races that can cause packets to be processed in the backlog even
    after a device attached to that SKB has been fully unregistered.
    From Julian Anastasov.

12) bcmgenet driver doesn't account packet drops vs.  errors properly,
    fix from Petri Gynther.

13) Array index validation and off by one fix in DSA layer from Florian
    Fainelli

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (66 commits)
  can: replace timestamp as unique skb attribute
  ARM: dts: dra7x-evm: Prevent glitch on DCAN1 pinmux
  can: c_can: Fix default pinmux glitch at init
  can: rcar_can: unify error messages
  can: rcar_can: print request_irq() error code
  can: rcar_can: fix typo in error message
  can: rcar_can: print signed IRQ #
  can: rcar_can: fix IRQ check
  net: dsa: Fix off-by-one in switch address parsing
  net: dsa: Test array index before use
  net: switchdev: don't abort unsupported operations
  net: bcmgenet: fix accounting of packet drops vs errors
  cdc_ncm: update specs URL
  Doc: z8530book: Fix typo in API-z8530-sync-txdma-open.html
  net: inet_diag: always export IPV6_V6ONLY sockopt for listening sockets
  bridge: mdb: allow the user to delete mdb entry if there's a querier
  net: call rcu_read_lock early in process_backlog
  net: do not process device backlog during unregistration
  bridge: fix potential crash in __netdev_pick_tx()
  net: axienet: Fix devm_ioremap_resource return value check
  ...
2015-07-13 11:18:25 -07:00
David S. Miller
cee9f6d018 linux-can-fixes-for-4.2-20150712
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABCgAGBQJVorxbAAoJEP5prqPJtc/H5zgH/2nvkmT3aQ1gBKdU8L+Ten5D
 LIKYyDJ67agNoMfEC/5xWOm7P3X8Qi8cGc326/9AZE22QLE5x6fMCq/SmEC4MA25
 GKu2ocwosdPVhIDQOY33IYZHxxtV8UZjt5KdNbQnNO6+iqE6tX5CbgueMlcWusry
 WmPCOlezTqHydXo0TYYLPjmqJ66RlrQMWYKvcB0SaxLYQfkBGbkW+fei3wdE4kW1
 t7cocRj6Ievv59wCeF+G+fpviTVQVR5bknGA0J3lku9onATOYfdArEAD9FRajygG
 KhA/eNr0YEA40VO/baUInuEAJ/YvDqdk9LoyJ1DOXgUv1ysxYSxu/44G/IcZsOU=
 =+zVD
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-fixes-for-4.2-20150712' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can

Marc Kleine-Budde says:

====================
pull-request: can 2015-07-12

this is a pull request of 8 patchs for net/master.

Sergei Shtylyov contributes 5 patches for the rcar_can driver, fixing the IRQ
check and several info and error messages. There are two patches by J.D.
Schroeder and Roger Quadros for the c_can driver and dra7x-evm device tree,
which precent a glitch in the DCAN1 pinmux. Oliver Hartkopp provides a better
approach to make the CAN skbs unique, the timestamp is replaced by a counter.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-12 22:24:01 -07:00
Oliver Hartkopp
d3b58c47d3 can: replace timestamp as unique skb attribute
Commit 514ac99c64 "can: fix multiple delivery of a single CAN frame for
overlapping CAN filters" requires the skb->tstamp to be set to check for
identical CAN skbs.

Without timestamping to be required by user space applications this timestamp
was not generated which lead to commit 36c01245eb "can: fix loss of CAN frames
in raw_rcv" - which forces the timestamp to be set in all CAN related skbuffs
by introducing several __net_timestamp() calls.

This forces e.g. out of tree drivers which are not using alloc_can{,fd}_skb()
to add __net_timestamp() after skbuff creation to prevent the frame loss fixed
in mainline Linux.

This patch removes the timestamp dependency and uses an atomic counter to
create an unique identifier together with the skbuff pointer.

Btw: the new skbcnt element introduced in struct can_skb_priv has to be
initialized with zero in out-of-tree drivers which are not using
alloc_can{,fd}_skb() too.

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2015-07-12 21:13:22 +02:00
Florian Fainelli
c8cf89f73f net: dsa: Fix off-by-one in switch address parsing
cd->sw_addr is used as a MDIO bus address, which cannot exceed
PHY_MAX_ADDR (32), our check was off-by-one.

Fixes: 5e95329b70 ("dsa: add device tree bindings to register DSA switches")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-11 23:25:16 -07:00
Florian Fainelli
8f5063e97f net: dsa: Test array index before use
port_index is used an index into an array, and this information comes
from Device Tree, make sure that port_index is not equal to the array
size before using it. Move the check against port_index earlier in the
loop.

Fixes: 5e95329b70: ("dsa: add device tree bindings to register DSA switches")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-11 23:25:16 -07:00
Vivien Didelot
2ee94014d9 net: switchdev: don't abort unsupported operations
There is no need to abort attribute setting or object addition, if the
prepare phase returned operation not supported.

Thus, abort these two transactions only if the error is not -EOPNOTSUPP.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-11 21:29:55 -07:00
Florian Westphal
14fe22e334 Revert "ipv4: use skb coalescing in defragmentation"
This reverts commit 3cc4949269.

There is nothing wrong with coalescing during defragmentation, it
reduces truesize overhead and simplifies things for the receiving
socket (no fraglist walk needed).

However, it also destroys geometry of the original fragments.
While that doesn't cause any breakage (we make sure to not exceed largest
original size) ip_do_fragment contains a 'fastpath' that takes advantage
of a present frag list and results in fragments that (in most cases)
match what was received.

In case its needed the coalescing could be done later, when we're sure
the skb is not forwarded.  But discussion during NFWS resulted in
'lets just remove this for now'.

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-11 21:16:40 -07:00
Phil Sutter
8220ea2324 net: inet_diag: always export IPV6_V6ONLY sockopt for listening sockets
Reconsidering my commit 20462155 "net: inet_diag: export IPV6_V6ONLY
sockopt", I am not happy with the limitations it causes for socket
analysing code in userspace. Exporting the value only if it is set makes
it hard for userspace to decide whether the option is not set or the
kernel does not support exporting the option at all.

>From an auditor's perspective, the interesting question for listening
AF_INET6 sockets is: "Does it NOT have IPV6_V6ONLY set?" Because it is
the unexpected case. This patch allows to answer this question reliably.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-10 23:25:24 -07:00
YOSHIFUJI Hideaki/吉藤英明
9131f3de24 ipv6: Do not iterate over all interfaces when finding source address on specific interface.
If outgoing interface is specified and the candidate address is
restricted to the outgoing interface, it is enough to iterate
over that given interface only.

Signed-off-by: YOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
Acked-by: Erik Kline <ek@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-10 23:19:25 -07:00
Satish Ashok
51ed7f3e7d bridge: mdb: allow the user to delete mdb entry if there's a querier
Until now when a querier was present static entries couldn't be deleted.
Fix this and allow the user to manipulate the mdb with or without a
querier.

Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-10 18:18:00 -07:00
Julian Anastasov
2c17d27c36 net: call rcu_read_lock early in process_backlog
Incoming packet should be either in backlog queue or
in RCU read-side section. Otherwise, the final sequence of
flush_backlog() and synchronize_net() may miss packets
that can run without device reference:

CPU 1                  CPU 2
                       skb->dev: no reference
                       process_backlog:__skb_dequeue
                       process_backlog:local_irq_enable

on_each_cpu for
flush_backlog =>       IPI(hardirq): flush_backlog
                       - packet not found in backlog

                       CPU delayed ...
synchronize_net
- no ongoing RCU
read-side sections

netdev_run_todo,
rcu_barrier: no
ongoing callbacks
                       __netif_receive_skb_core:rcu_read_lock
                       - too late
free dev
                       process packet for freed dev

Fixes: 6e583ce524 ("net: eliminate refcounting in backlog queue")
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-10 18:16:36 -07:00
Julian Anastasov
e9e4dd3267 net: do not process device backlog during unregistration
commit 381c759d99 ("ipv4: Avoid crashing in ip_error")
fixes a problem where processed packet comes from device
with destroyed inetdev (dev->ip_ptr). This is not expected
because inetdev_destroy is called in NETDEV_UNREGISTER
phase and packets should not be processed after
dev_close_many() and synchronize_net(). Above fix is still
required because inetdev_destroy can be called for other
reasons. But it shows the real problem: backlog can keep
packets for long time and they do not hold reference to
device. Such packets are then delivered to upper levels
at the same time when device is unregistered.
Calling flush_backlog after NETDEV_UNREGISTER_FINAL still
accounts all packets from backlog but before that some packets
continue to be delivered to upper levels long after the
synchronize_net call which is supposed to wait the last
ones. Also, as Eric pointed out, processed packets, mostly
from other devices, can continue to add new packets to backlog.

Fix the problem by moving flush_backlog early, after the
device driver is stopped and before the synchronize_net() call.
Then use netif_running check to make sure we do not add more
packets to backlog. We have to do it in enqueue_to_backlog
context when the local IRQ is disabled. As result, after the
flush_backlog and synchronize_net sequence all packets
should be accounted.

Thanks to Eric W. Biederman for the test script and his
valuable feedback!

Reported-by: Vittorio Gambaletta <linuxbugs@vittgam.net>
Fixes: 6e583ce524 ("net: eliminate refcounting in backlog queue")
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-10 18:16:36 -07:00
Eric Dumazet
a7d35f9d73 bridge: fix potential crash in __netdev_pick_tx()
Commit c29390c6df ("xps: must clear sender_cpu before forwarding")
fixed an issue in normal forward path, caused by sender_cpu & napi_id
skb fields being an union.

Bridge is another point where skb can be forwarded, so we need
the same cure.

Bug triggers if packet was received on a NIC using skb_mark_napi_id()

Fixes: 2bd82484bb ("xps: fix xps for stacked devices")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Bob Liu <bob.liu@oracle.com>
Tested-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-09 22:48:42 -07:00
Eric Dumazet
a4e2405cc5 tcp: do not export tcp_init_xmit_timers()
After commit 900f65d361 ("tcp: move duplicate code from
tcp_v4_init_sock()/tcp_v6_init_sock()"), we no longer
need to export tcp_init_xmit_timers()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-09 21:44:38 -07:00
Nikolay Aleksandrov
09cf0211f9 bridge: mdb: fill state in br_mdb_notify
Fill also the port group state when sending notifications.

Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-09 21:41:11 -07:00
Masatake YAMATO
cb1c61680d route: remove unsed variable in __mkroute_input
flags local variable in __mkroute_input is not used as a variable.

Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-09 21:34:34 -07:00
Tom Herbert
35a256fee5 ipv6: Nonlocal bind
Add support to allow non-local binds similar to how this was done for IPv4.
Non-local binds are very useful in emulating the Internet in a box, etc.

This add the ip_nonlocal_bind sysctl under ipv6.

Testing:

Set up nonlocal binding and receive routing on a host, e.g.:

ip -6 rule add from ::/0 iif eth0 lookup 200
ip -6 route add local 2001:0:0:1::/64 dev lo proto kernel scope host table 200
sysctl -w net.ipv6.ip_nonlocal_bind=1

Set up routing to 2001:0:0:1::/64 on peer to go to first host

ping6 -I 2001:0:0:1::1 peer-address -- to verify

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-09 21:09:10 -07:00
Eric Dumazet
dbe7faa404 inet: inet_twsk_deschedule factorization
inet_twsk_deschedule() calls are followed by inet_twsk_put().

Only particular case is in inet_twsk_purge() but there is no point
to defer the inet_twsk_put() after re-enabling BH.

Lets rename inet_twsk_deschedule() to inet_twsk_deschedule_put()
and move the inet_twsk_put() inside.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-09 15:12:20 -07:00
Eric Dumazet
fc01538f9f inet: simplify timewait refcounting
timewait sockets have a complex refcounting logic.
Once we realize it should be similar to established and
syn_recv sockets, we can use sk_nulls_del_node_init_rcu()
and remove inet_twsk_unhash()

In particular, deferred inet_twsk_put() added in commit
13475a30b6 ("tcp: connect() race with timewait reuse")
looks unecessary : When removing a timewait socket from
ehash or bhash, caller must own a reference on the socket
anyway.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-09 15:12:20 -07:00
Florian Westphal
8b58a39846 ipv6: use flag instead of u16 for hop in inet6_skb_parm
Hop was always either 0 or sizeof(struct ipv6hdr).

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-09 15:06:59 -07:00
Oleg Nesterov
1fbe4b46ca net: pktgen: kill the "Wait for kthread_stop" code in pktgen_thread_worker()
pktgen_thread_worker() doesn't need to wait for kthread_stop(), it
can simply exit. Just pktgen_create_thread() and pg_net_exit() should
do get_task_struct()/put_task_struct(). kthread_stop(dead_thread) is
fine.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-09 15:05:32 -07:00
Oleg Nesterov
fecdf8be2d net: pktgen: fix race between pktgen_thread_worker() and kthread_stop()
pktgen_thread_worker() is obviously racy, kthread_stop() can come
between the kthread_should_stop() check and set_current_state().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Reported-by: Marcelo Leitner <mleitner@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-09 15:05:32 -07:00
Yuchung Cheng
b20a3fa30a tcp: update congestion state first before raising cwnd
The congestion state and cwnd can be updated in the wrong order.
For example, upon receiving a dubious ACK, we incorrectly raise
the cwnd first (tcp_may_raise_cwnd()/tcp_cong_avoid()) because
the state is still Open, then enter recovery state to reduce cwnd.

For another example, if the ACK indicates spurious timeout or
retransmits, we first revert the cwnd reduction and congestion
state back to Open state.  But we don't raise the cwnd even though
the ACK does not indicate any congestion.

To fix this problem we should first call tcp_fastretrans_alert() to
process the dubious ACK and update the congestion state, then call
tcp_may_raise_cwnd() that raises cwnd based on the current state.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-09 14:22:52 -07:00
Yuchung Cheng
76174004a0 tcp: do not slow start when cwnd equals ssthresh
In the original design slow start is only used to raise cwnd
when cwnd is stricly below ssthresh. It makes little sense
to slow start when cwnd == ssthresh: especially
when hystart has set ssthresh in the initial ramp, or after
recovery when cwnd resets to ssthresh. Not doing so will
also help reduce the buffer bloat slightly.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-09 14:22:52 -07:00
Yuchung Cheng
071d5080e3 tcp: add tcp_in_slow_start helper
Add a helper to test the slow start condition in various congestion
control modules and other places. This is to prepare a slight improvement
in policy as to exactly when to slow start.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-09 14:22:52 -07:00
Alexander Duyck
1007f59dce net: skb_defer_rx_timestamp should check for phydev before setting up classify
This change makes it so that the call skb_defer_rx_timestamp will first
check for a phydev before going in and manipulating the skb->data and
skb->len values.  By doing this we can avoid unnecessary work on network
devices that don't support phydev.  As a result we reduce the total
instruction count needed to process this on most devices.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-09 14:17:15 -07:00
Jon Maxwell
2251ae46af tcp: v1 always send a quick ack when quickacks are enabled
V1 of this patch contains Eric Dumazet's suggestion to move the per
dst RTAX_QUICKACK check into tcp_in_quickack_mode(). Thanks Eric.

I ran some tests and after setting the "ip route change quickack 1"
knob there were still many delayed ACKs sent. This occured
because when icsk_ack.quick=0 the !icsk_ack.pingpong value is
subsequently ignored as tcp_in_quickack_mode() checks both these
values. The condition for a quick ack to trigger requires
that both icsk_ack.quick != 0 and icsk_ack.pingpong=0. Currently
only icsk_ack.pingpong is controlled by the knob. But the
icsk_ack.quick value changes dynamically depending on heuristics.
The crux of the matter is that delayed acks still cannot be entirely
disabled even with the RTAX_QUICKACK per dst knob enabled. This
patch ensures that a quick ack is always sent when the RTAX_QUICKACK
per dst knob is turned on.

The "ip route change quickack 1" knob was recently added to enable
quickacks. It was modeled around the TCP_QUICKACK setsockopt() option.
This issue is that even with "ip route change quickack 1" enabled
we still see delayed ACKs under some conditions. It would be nice
to be able to completely disable delayed ACKs.

Here is an example:

# netstat -s|grep dela
    3 delayed acks sent

For all routes enable the knob

# ip route change quickack 1

Generate some traffic across a slow link and we still see the delayed
acks.

# netstat -s|grep dela
    106 delayed acks sent
    1 delayed acks further delayed because of locked socket

The issue is that both the "ip route change quickack 1" knob and
the TCP_QUICKACK option set the icsk_ack.pingpong variable to 0.
However at the business end in the __tcp_ack_snd_check() routine,
tcp_in_quickack_mode() checks that both icsk_ack.quick != 0
and icsk_ack.pingpong=0 in order to trigger a quickack. As
icsk_ack.quick is determined by heuristics it can be 0. When
that occurs the icsk_ack.pingpong value is ignored and a delayed
ACK is sent regardless.

This patch moves the RTAX_QUICKACK per dst check into the
tcp_in_quickack_mode() routine which ensures that a quickack is
always sent when the quickack knob is enabled for that dst.

Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-09 14:15:44 -07:00
Ilya Dryomov
c44bd69c0c libceph: treat sockaddr_storage with uninitialized family as blank
addr_is_blank() should return true if family is neither AF_INET nor
AF_INET6.  This is what its counterpart entity_addr_t::is_blank_ip() is
doing and it is the right thing to do: in process_banner() we check if
our address is blank and if it is "learn" it from our peer.  As it is,
we never learn our address and always send out a blank one.  This goes
way back to ceph.git commit dd732cbfc1c9 ("use sockaddr_storage; and
some ipv6 support groundwork") from 2009.

While at at, do not open-code ipv6_addr_any() and use INADDR_ANY
constant instead of 0.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2015-07-09 20:30:34 +03:00
Ilya Dryomov
757856d2b9 libceph: enable ceph in a non-default network namespace
Grab a reference on a network namespace of the 'rbd map' (in case of
rbd) or 'mount' (in case of ceph) process and use that to open sockets
instead of always using init_net and bailing if network namespace is
anything but init_net.  Be careful to not share struct ceph_client
instances between different namespaces and don't add any code in the
!CONFIG_NET_NS case.

This is based on a patch from Hong Zhiguo <zhiguohong@tencent.com>.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2015-07-09 20:30:34 +03:00
David S. Miller
ace15bbb39 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for your net tree. This batch
mostly comes with patches to address fallout from the previous merge window
cycle, they are:

1) Use entry->state.hook_list from nf_queue() instead of the global nf_hooks
   which is not valid when used from NFPROTO_NETDEV, this should cause no
   problems though since we have no userspace queueing for that family, but
   let's fix this now for the sake of correctness. Patch from Eric W. Biederman.

2) Fix compilation breakage in bridge netfilter if CONFIG_NF_DEFRAG_IPV4 is not
   set, from Bernhard Thaler.

3) Use percpu jumpstack in arptables too, now that there's a single copy of the
   rule blob we can't store the return address there anymore. Patch from
   Florian Westphal.

4) Fix a skb leak in the xmit path of bridge netfilter, problem there since
   2.6.37 although it should be not possible to hit invalid traffic there, also
   from Florian.

5) Eric Leblond reports that when loading a large ruleset with many missing
   modules after a fresh boot, nf_tables can take long time commit it. Fix this
   by processing the full batch until the end, even on missing modules, then
   abort only once and restart processing.

6) Add bridge netfilter files to the MAINTAINER files.

7) Fix a net_device refcount leak in the new IPV6 bridge netfilter code, from
   Julien Grall.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-09 00:03:10 -07:00
Andy Gospodarek
974d7af5fc ipv4: add support for linkdown sysctl to netconf
This kernel patch exports the value of the new
ignore_routes_with_linkdown via netconf.

v2: changes to notify userspace via netlink when sysctl values change
and proposed for 'net' since this could be considered a bugfix

Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Suggested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-08 23:34:53 -07:00
Nikolay Aleksandrov
f1158b74e5 bridge: mdb: zero out the local br_ip variable before use
Since commit b0e9a30dd6 ("bridge: Add vlan id to multicast groups")
there's a check in br_ip_equal() for a matching vlan id, but the mdb
functions were not modified to use (or at least zero it) so when an
entry was added it would have a garbage vlan id (from the local br_ip
variable in __br_mdb_add/del) and this would prevent it from being
matched and also deleted. So zero out the whole local ip var to protect
ourselves from future changes and also to fix the current bug, since
there's no vlan id support in the mdb uapi - use always vlan id 0.
Example before patch:
root@debian:~# bridge mdb add dev br0 port eth1 grp 239.0.0.1 permanent
root@debian:~# bridge mdb
dev br0 port eth1 grp 239.0.0.1 permanent
root@debian:~# bridge mdb del dev br0 port eth1 grp 239.0.0.1 permanent
RTNETLINK answers: Invalid argument

After patch:
root@debian:~# bridge mdb add dev br0 port eth1 grp 239.0.0.1 permanent
root@debian:~# bridge mdb
dev br0 port eth1 grp 239.0.0.1 permanent
root@debian:~# bridge mdb del dev br0 port eth1 grp 239.0.0.1 permanent
root@debian:~# bridge mdb

Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Fixes: b0e9a30dd6 ("bridge: Add vlan id to multicast groups")
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-08 16:10:40 -07:00
Stephen Smalley
fdd75ea8df net/tipc: initialize security state for new connection socket
Calling connect() with an AF_TIPC socket would trigger a series
of error messages from SELinux along the lines of:
SELinux: Invalid class 0
type=AVC msg=audit(1434126658.487:34500): avc:  denied  { <unprintable> }
  for pid=292 comm="kworker/u16:5" scontext=system_u:system_r:kernel_t:s0
  tcontext=system_u:object_r:unlabeled_t:s0 tclass=<unprintable>
  permissive=0

This was due to a failure to initialize the security state of the new
connection sock by the tipc code, leaving it with junk in the security
class field and an unlabeled secid.  Add a call to security_sk_clone()
to inherit the security state from the parent socket.

Reported-by: Tim Shearer <tim.shearer@overturenetworks.com>
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: Paul Moore <paul@paul-moore.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-08 16:08:23 -07:00
Timo Teräs
fc24f2b209 ip_tunnel: fix ipv4 pmtu check to honor inner ip header df
Frag needed should be sent only if the inner header asked
to not fragment. Currently fragmentation is broken if the
tunnel has df set, but df was not asked in the original
packet. The tunnel's df needs to be still checked to update
internally the pmtu cache.

Commit 23a3647bc4 broke it, and this commit fixes
the ipv4 df check back to the way it was.

Fixes: 23a3647bc4 ("ip_tunnels: Use skb-len to PMTU check.")
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-08 16:03:09 -07:00
Daniel Borkmann
4f7d2cdfdd rtnetlink: verify IFLA_VF_INFO attributes before passing them to driver
Jason Gunthorpe reported that since commit c02db8c629 ("rtnetlink: make
SR-IOV VF interface symmetric"), we don't verify IFLA_VF_INFO attributes
anymore with respect to their policy, that is, ifla_vfinfo_policy[].

Before, they were part of ifla_policy[], but they have been nested since
placed under IFLA_VFINFO_LIST, that contains the attribute IFLA_VF_INFO,
which is another nested attribute for the actual VF attributes such as
IFLA_VF_MAC, IFLA_VF_VLAN, etc.

Despite the policy being split out from ifla_policy[] in this commit,
it's never applied anywhere. nla_for_each_nested() only does basic nla_ok()
testing for struct nlattr, but it doesn't know about the data context and
their requirements.

Fix, on top of Jason's initial work, does 1) parsing of the attributes
with the right policy, and 2) using the resulting parsed attribute table
from 1) instead of the nla_for_each_nested() loop (just like we used to
do when still part of ifla_policy[]).

Reference: http://thread.gmane.org/gmane.linux.network/368913
Fixes: c02db8c629 ("rtnetlink: make SR-IOV VF interface symmetric")
Reported-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Sucheta Chakraborty <sucheta.chakraborty@qlogic.com>
Cc: Greg Rose <gregory.v.rose@intel.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Rony Efraim <ronye@mellanox.com>
Cc: Vlad Zolotarov <vladz@cloudius-systems.com>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-08 16:01:52 -07:00
Nicolas Dichtel
95ec655bc4 Revert "dev: set iflink to 0 for virtual interfaces"
This reverts commit e1622baf54.

The side effect of this commit is to add a '@NONE' after each virtual
interface name with a 'ip link'. It may break existing scripts.

Reported-by: Olivier Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Tested-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-08 15:52:33 -07:00
Eric Dumazet
d339727c2b net: graceful exit from netif_alloc_netdev_queues()
User space can crash kernel with

ip link add ifb10 numtxqueues 100000 type ifb

We must replace a BUG_ON() by proper test and return -EINVAL for
crazy values.

Fixes: 60877a32bc ("net: allow large number of tx queues")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-08 15:46:17 -07:00
Satish Ashok
f7e2965db1 bridge: mdb: start delete timer for temp static entries
Start the delete timer when adding temp static entries so they can expire.

Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: ccb1c31a7a ("bridge: add flags to distinguish permanent mdb entires")
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-08 14:52:58 -07:00
Eric Dumazet
32f675bbc9 net_sched: gen_estimator: extend pps limit
rate estimators are limited to 4 Mpps, which was fine years ago, but
too small with current hardware generation.

Lets use 2^5 scaling instead of 2^10 to get 128 Mpps new limit.

On 64bit arch, use an "unsigned long" for temp storage and remove limit.
(We do not expect 32bit arches to be able to reach this point)

Tested:

tc -s -d filter sh dev eth0 parent ffff:

filter protocol ip pref 1 u32
filter protocol ip pref 1 u32 fh 800: ht divisor 1
filter protocol ip pref 1 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:15
  match 07000000/ff000000 at 12
	action order 1: gact action drop
	 random type none pass val 0
	 index 1 ref 1 bind 1 installed 166 sec
 	Action statistics:
	Sent 39734251496 bytes 863788076 pkt (dropped 863788117, overlimits 0 requeues 0)
	rate 4067Mbit 11053596pps backlog 0b 0p requeues 0

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-08 13:59:20 -07:00